Tuesday, June 30, 2015

Tsay Ch8 - Multivariate Time Series Analysis and Its Applications

How markets are interrelated is important to understand the lead-lag relationship and under what circumstances they reverse or do not work. Various previous methods can be applied directly to the vector case, but some need attention.

Weak stationarity and cross-correlation matrices

k-dimensional time series $\pmb{r_t}=[r_{1t},...,r_{kt}]^T$ is weakly stationary if its first and second moments are time-invariant, $\pmb{\mu}=E(\pmb{r}_t)$ and $\pmb{\Gamma}_0=E[(\pmb{r}_t-\pmb{\mu})(\pmb{r}_t-\pmb{\mu})^T]$. Let $\pmb{D}$ be a $k\times k$ diagonal matrix consisting of the standard deviations of $r_{it}$. The lag-zero, cross-correlation matrix is defined as $\pmb{\rho}_0=\pmb{D}^{-1}\pmb{\Gamma}_0\pmb{D}^{-1}$, which is the correlation matrix. The lag-$l$ cross-covariance matrix $\pmb{r}_t$ is defined as $\pmb{\Gamma}_l=E[(\pmb{r}_t-\pmb{\mu})(\pmb{r}_{t-l}-\pmb{\mu})^T]$. For a weakly stationary series, the cross-covariance matrix $\pmb{\Gamma}_l$ is a function of $l$, not the time index $t$. The lag-$l$ cross-correlation matrix (CCM) is defined as $\pmb{\rho}_l=\pmb{D}^{-1}\pmb{\Gamma}_l\pmb{D}^{-1}$. To understand this better, notice
which is the correlation between $r_{it}$ and $r_{j,t-l}$. If $\rho_{ij}(l)\ne 0$ and $l>0$, we say that the series $r_{jt}$ leads the series $r_{it}$ at lag $l$. Similarly, if $\rho_{ji}(l)\ne 0$ and $l>0$ we say that the series $r_{it}$ leads the series $r_{jt}$ at lag $l$. The diagonal element $\rho_{ii}(l)$ is simply the lag-$l$ autocorrelation coefficient of $r_{it}$.

Some remarks are ($l>0$):
1) In general $\rho_{ij}(l) \ne \rho_{ji}(l)$ for $i\ne j$, because they are measuring two different lag relationships, implying $\pmb{\Gamma}_l$ and $\pmb{\rho}_l$ are in general not symmetric.
2) It is easy to see that $\pmb{\Gamma}_l=\pmb{\Gamma}_{-l}^T$ and $\pmb{\rho}_l=\pmb{\rho}_{-l}^T$. Hence it suffices in practice to consider the cross-correlation matrices $\pmb{\rho}_l$ for $l \ge 0$.

For the cross-correlation matrices {$\pmb{\rho}_l|l=0,1,...$}, the diagonal elements {$\rho_{ii}(l)|l=0,1,...$} are the autocorrelation function of $r_{it}$, the off-diagonal element $\rho_{ij}(0)$ measures the concurrent linear relationship between $r_{it}$ and $r_{jt}$, and for $l>0$ the off-diagonal element $\rho_{ij}(l)$ measures the linear dependence of $r_{it}$ on past value of $r_{j,t-l}$. Depending on the value in these matrices one and identify -
1) no linear relationship ($\rho_{ij}(l)=\rho_{ji}(l)=0, \forall l\ge0$),
2) concurrent correlation ($\rho_{ij}(0)\ne0$),
3) no lead-lag relationship ($\rho_{ij}(l)=\rho_{ji}(l)=0, \forall l>0$),
4) unidirectional relationship ($\rho_{ij}(l)=0,\forall l>0,\rho_{ji}(v)$ for some $v\ge0$), or
5) feedback relationship ($\rho_{ij}(l)\ne0$, for some $l>0,\rho_{ji}(v)\ne0$ for some $v\ge0$) .

Sample Cross-correlation matrices can be estimate using $\hat{\pmb{\rho}}_l=\hat{\pmb{D}}^{-1}\hat{\pmb{\Gamma}}_l\hat{\pmb{D}}^{-1}$, where
$$\hat{\Gamma}_l=\frac{1}{T}\sum_{t=l+1}^{T}(\pmb{r}_t-\bar{\pmb{r}})(\pmb{r}_{t-l}-\bar{\pmb{r}})^T, \qquad l\ge 0.$$ Bootstrapping can be used to get confidence intervals on finite samples.

Multivariate  Portmanteau tests: or Multivariate Ljung-Box test with statistic $Q(m)$ have the null hypothesis $H_0:\pmb{\rho}_1=...=\pmb{\rho}_m=\pmb{0}$, and $H_a: \pmb{\rho}_i\ne0$ for some $i \epsilon {1,...,m}$. The test statistic assumes the form
$$Q_k(m)=T^2\sum_{l=1}^m\frac{1}{T-l} tr(\hat{\Gamma}^T_l\hat{\pmb{\Gamma}}^{-1}_0\hat{\pmb{\Gamma}}_l
\hat{\pmb{\Gamma}}^{-1}_0)$$ and under some regularity conditions follows a chi-squared distribution with $k^2m$ degrees of freedom, asymptotically.

Vector autoregressive models (VAR)

A multivariate time series ${\pmb{r}_t}$ is a VAR process of order 1, or VAR(1) for short, if it follows the model $$\pmb{r}_t=\pmb{\phi}_0+\pmb{\Phi}\pmb{r}_{t-1}+\pmb{a}_t,$$ where $\pmb{\phi}_0$ is a k-dimensional vector, $\pmb{\Phi}$ is a $k\times k$ matrix, and ${\pmb{a}_t}$ is a sequence of uncorrelated random vectors with mean zero and covariance matrix $\pmb{\Sigma}$, which is positive definite (generally assumed to be multivariate normal).

Positive definite matrix is a symmetric matrix with all eigenvalues positive. Also for any vector $\pmb{b}$, we have $\pmb{b}^T\pmb{Ab}>0$. These types of matrices can be decomposed as $\pmb{A}=\pmb{P\Lambda P}^T$, where $\pmb{\Lambda}$ is a diagonal matrix consisting of eigenvalues of $\pmb{A}$ and $\pmb{P}$ is a square matrix consisting of eigenvectors of $\pmb{A}$. These eigenvectors are orthogonal to each other. Matrix $\pmb{P}$ is orthogonal and this decomposition is referred to as spectral decomposition. For a symmetric matrix $\pmb{A}$, there exists a lower triangular matrix $\pmb{L}$ with diagonal elements being 1 and a diagonal matrix $\pmb{G}$ such that $\pmb{A}=\pmb{LGL}^T$. If $\pmb{A}$ is positive definite, then the diagonal elements of $\pmb{G}$ are positive. In this case, we have $\pmb{A=L\sqrt{G}\sqrt{G}L}^T=\pmb{MM}^T$, where $\pmb{M=L\sqrt{G}}$ is a lower triangular matrix. This is called Cholesky decomposition. Notice that this implies $\pmb{L}^{-1}\pmb{A}(\pmb{L}^{-1})^{T}=\pmb{G}$.

Reduced and Structural form: In general the off diagonal elements of matrix $\pmb{\Sigma}$ show the concurrent relationship between $r_{1t}$ and $r_{2t}$, while the matrix $\pmb{\Phi}$ measures the dynamic dependence or $\pmb{r}_t$. This is called reduced-form model because it does not show explicitly the concurrent dependence between the component series. The explicit-form expression of concurrent relationship (for the last series and hence any series by rearrangement) can be deduced by a simple linear transformation. Using Cholesky decomposition (possible because $\pmb{\Sigma}$ is positive definite symmetric matrix) we can find a lower triangle matrix $\pmb{L}$ with unit diagonal elements such that $\pmb{\Sigma=LGL}^T$, where $\pmb{G}$ is a diagonal matrix. If we define $\pmb{b}_t = \pmb{L}^{-1}\pmb{a}_t$, then $E(\pmb{b}_t)=\pmb{L}^{-1}E(\pmb{a_t})=\pmb{0}$ and $Cov(\pmb{b}_t)=E(\pmb{b}_t\pmb{b}_t^T)=\pmb{L}^{-1}\pmb{\Sigma}(\pmb{L}^T)^{-1}=\pmb{G}$. Since $\pmb{G}$ is a diagonal matrix the components of $\pmb{b}_t$ are uncorrelated.

Pre-multiplying the reduced-form with $\pmb{L}^{-1}$, to uncouple the equations, we get
The last row of $\pmb{L}^{-1}$ has 1 as the last element, let it be $(w_{k1},w_{k2},..,w_{k,k-1},1)$, and hence the structural equation for the last ($k^{th}$) time series becomes:
$$r_{kt}+\sum_{i=1}^{k-1}w_{ki}r_{it}=\phi_{k,0}^*+\sum_{i=1}^k\Phi_{ki}^* r_{i,t-1}+b_{kt}.$$
This is possible because $\pmb{b}_t$ is a diagonal matrix and uncoupled. Reduced-form is commonly used for two reasons - ease in estimation, and concurrent correlations cannot be used in forecasting.

Stationarity condition and moments of a VAR(1) model: All eigenvalues of $\pmb{\Phi}$ should be less than 1 in modulus for weak stationarity for $\pmb{r}_t$, provided the covariance matrix of $\pmb{a}_t$ exists. Further we have $\pmb{\Gamma}_l=\pmb{\Phi \Gamma}_{l-1}$, for $l>0$, where $\pmb{\Gamma}_j$ is the lag-j cross-covariance matrix of $\pmb{r}_t$. By repeated substitutions we get $\pmb{\Gamma}_l=\pmb{\Phi}^l \pmb{\Gamma}_0$. Further for $\pmb{\Upsilon}=\pmb{D}^{-1/2}\pmb{\Phi}\pmb{D}^{1/2}$, we get $\pmb{\rho}_l=\pmb{\Upsilon}^l\pmb{\rho}_0$. A VAR(p) model is generally converted to a VAR(1) model using companion matrix and then analyzed like a VAR(1) model.

To find the order of a VAR model one can generally use the multi-variant equivalent of PACF with hypothesis tests on the successive residuals. The $i^{th}$ equation in the PACF is given by $\pmb{r}_t=\pmb{\phi}_0+\pmb{\Phi}_1\pmb{r}_{t-1}+...+\pmb{\Phi}_i\pmb{r}_{t-i}+\pmb{a}_t$. Parameters of these equations can be estimated by OLS method. For the $i^{th}$ equation let the OLS estimates of coefficients be $\hat{\pmb{\Phi}}_j^{(i)}$ and $\hat{\pmb{\phi}}_0^{(i)}$, where the superscript $(i)$ is used to denote the VAR(i) model. Then the residual is $\hat{\pmb{a}}_t^{(i)}=\pmb{r}_t-\hat{\pmb{\phi}}_0^{(i)}-\hat{\pmb{\Phi}}_1^{(i)}\pmb{r}_{t-1}-...-\hat{\pmb{\Phi}}_i^{(i)}\pmb{r}_{t-i}$. We then test the hypothesis sequentially to identify the order of VAR model. For $i^{th}$ and $(i-1)^{th}$ equations we test $H_0:\pmb{\Phi}_i=0$ versus $H_a:\pmb{\Phi}_i\ne0$, the test statistic is
$$M(i)=-\Bigg(T-k-i-\frac{3}{2}\Bigg)ln\Bigg(\frac{|\hat{\Sigma}_i|}{|\hat{\Sigma}_{i-1}|}\Bigg).$$ Asymptotically, $M(i)$ is distributed as a chi-squared distribution with $k^2$ degrees of freedom, where $k$ is the dimensionality of the asset universe.

Equivalent AIC and BIC methods can also be employed.  OLS or ML is generally used to estimate the parameters, the two methods being asymptotically equivalent. Once a model is fit, residuals should be tested for inadequacy using $Q_k(m)$ statistic (with $k^2m-g$ degrees of freedom). Forecasting is similar to the uni-variate case. Impulse response function is the MA version and can be derived to look at the decay rate. The MA equation is premultiplied by $\pmb{L}^{-1}$ to get the impulse response function of $\pmb{r}_t$ with the orthogonal innovations $\pmb{b}_t$. But different ordering may lead to different response functions, which is a drawback.

Vector moving-average models (VMA) :A VMA(1) model is given by $$\pmb{r}_t=\pmb{\theta}_0+\pmb{a}_t-\pmb{\Theta}\pmb{a}_{t-1},$$ where $\pmb{\theta}_0$ is k-dimensional vector, $\pmb{\Theta}$ is $k\times k$ matrix. Like the uni-variate case the cross-correlations cuts at order 1 and can be used to identify the order. Estimation of VMA model is lot more involved. The conditional or exact MLE approaches can be used.
Vector ARMA models (VARMA): In generalizing from uni-variate to vectors ARMA encounters the issue of identifiability - it may not be uniquely defined. Some constraints need to be imposed - structural specification. These are hardly used.
Marginal models of components: Given the vector models the component models are called the marginal models. For a k-dimensional ARMA(p,q)  model, the marginal models are ARMA(kp,(k-1)p+q) models.

Unit root nonstationarity and cointegration: When modeling several unit-root nonstationary time series jointly, one may encounter the case of cointegration. This may be due to the common trend or unit root of one of the components. In other words one can find a linear combination which is stationary. Let $h$ be the number of unit roots (or common trends) in the $k$-dimensional series $\pmb{x}_t$. Cointegration exists if $0<h<k$, and the quantity $k-h$ is called the number of cointegrating factors, these are the different linear combinations that are unit-root stationary. The linear combinations resulting in these unit-root stationary processes are called the cointegrating vectors. Two price series, if cointegrated, will have common underlying trend and we will lose this information if we take the first difference of the price series. This is because one difference per unit-root preserves useful information. Under cointegration we have more non-stationary series than uni-roots hence losing information. Overdifferencing can lead to the problem of unit roots in the MA matrix polynomial, invertability and estimation. Also, cointegration can also exist after adjusting for transaction costs and exchange-rate risk, which is artificial.

Error correction form: To overcome the difficulty of noninvertible VARMA models one can use this form. A VARMA(p,q) model is $$\pmb{x}_t=\sum_{i=1}^{p}\pmb{\Phi}_i\pmb{x}_{t-i}+\pmb{a}_t-\sum_{j=1}^q\pmb{\Theta}_j\pmb{a}_{t-j}.$$
For $\Delta \pmb{x}_t=\pmb{x}_t-\pmb{x}_{t-1}$, we can subtract $\pmb{x}_{t-1}$ from both side of VARMA equation to get the error correction form. \[\Delta\pmb{x}_t=\pmb{\alpha\beta}^T\pmb{x}_{t-1}+\sum_{i=1}^{p-1}\pmb{\Phi}_i^* \Delta\pmb{x}_{t-i}+\pmb{a}_t-\sum_{j=1}^q\pmb{\Theta}_j\pmb{a}_{t-j}\]
where $\pmb{\alpha}$ and $\pmb{\beta}$ are $k\times m$ full-rank matrix, k is the total asset dimension, m is the cointegrating factors ($m<k$).  The term $\pmb{\alpha\beta}^T\pmb{x}_{t-1}$ is called the error-correction term, as it compensates for the over-differentiating. Further, $\pmb{\beta}^T\pmb{x}_{t-1}$ is stationary. Also, $$\pmb{\Phi}_j^*=-\sum_{i=j+1}^{p}\pmb{\Phi}_i, \qquad j=1,...,p-1$$ $$\pmb{\alpha\beta}^T=\pmb{\Phi}_p+...+\pmb{\Phi}_1-\pmb{I}$$
The time series $\pmb{\beta}^T\pmb{x_t}$ is unit-root stationary, and the columns of $\pmb{\beta}$ are the cointegrating vectors of $\pmb{x}_t$.

Co-integrated VAR models: We focus on VAR models for their simplicity in estimation, to better understand cointegration here. For a k-dimensional VAR(p) model
where $\pmb{\mu}_0+\pmb{\mu}_1t$. Or equivalently, using backshift operator $B$
The characteristic polynomial in the above is represented as $\pmb{\Phi}(B)$. For a unit-root nonstationary process, 1 is a root making $|\pmb{\Phi}(1)|=0$ An error-correction form for this can be obtained by subtracting $\pmb{x}_{t-1}$ from both sides of the equation giving
\[\Delta \pmb{x}_t=\pmb{\mu}_t+\pmb{\Pi}\pmb{x}_{t-1}+\pmb{\Phi}_1^*\Delta\pmb{x}_{t-1}+...+\pmb{\Phi}_{p-1}^*\Delta \pmb{x}_{t-p+1}+\pmb{a}_t \]
Where $\pmb{\Pi}=\pmb{\Phi}_1+...+\pmb{\Phi}_p-\pmb{I}=-\pmb{\Phi}(1)$ and $\pmb{\Phi}_j^*=-\sum_{i=j+1}^p\pmb{\Phi}_i$ for $j=1,...,p-1$. Further if Rank($\pmb{\Pi}$)=0 implies that $\pmb{x}_t$ is not cointegrated, Rank($\pmb{x}_t$)=$k$ implies that ECM is not informative and one studies $\pmb{x}_t$ directly. Finally, if $0<$Rank($\pmb{\Pi}$)$=m<k$ then one can write $\pmb{\Pi}=\pmb{\alpha\beta}^T$, where $\pmb{\alpha}$ and $\pmb{\beta}$ are both of rank $m$. We have the case of cointegration with $m$ linearly independent cointegrated vectors $\pmb{w}_t=\pmb{\beta}^T\pmb{x}_t$,a nd has $k-m$ unit roots or common trends.

To obtain the $k-m$ common trends vector of size $(k-m)\times 1$, $\pmb{y}_t=\pmb{\alpha}^T_{\bot}\pmb{x}_t$, we calculate the orthogonal matrix of size $k \times (k-m)$ as $\pmb{\alpha}^T_{\bot}\pmb{\alpha}=\pmb{0}$.To uniquely identify $\pmb{\alpha}$ and $\pmb{\beta}$ we require that $\pmb{\beta}^T=[\pmb{I}_m,\pmb{\beta}^T_1]$, where $\pmb{T}_m$ is a $m\times m$ identity matrix and $\pmb{\beta}_1$ is a $(k-m)\times m$ matrix. There are a few more constraints for the process $\pmb{w}_t=\pmb{\beta}^T\pmb{x}_t$ to be unit-root stationary.

The rank of $\pmb{\Pi}$ in the ECM is the number of cointegrating vectors. Thus, to test for cointegration, once can examine the rank of $\pmb{\Pi}$, the approach taken in Johansen test.

Deterministic function: Limiting distributions of cointegration tests depend on the deterministic function $\pmb{\mu}_t$.
1) $\pmb{\mu}_t=\pmb{0}$: All components series of $\pmb{x}_t$ are $I(1)$ without drift and the stationary series $\pmb{w}_t=\pmb{\beta}^T\pmb{x}_t$ has mean zero.
2) $\pmb{\mu}_t=\pmb{\alpha}\pmb{c}_0$: Components of $\pmb{x}_t$ are $I(1)$ without drift, but $\pmb{w}_t$ have a nonzero mean $-\pmb{c}_0$, called restricted constant.
3) $\pmb{\mu}_t=\pmb{\mu}_0$: Component series are $I(1)$ with drift $\pmb{\mu}_0$ and $\pmb{w}_t$ may have a nonzero mean.
4)  $\pmb{\mu}_t=\pmb{\mu}_0+\pmb{\alpha c}_1 t$: Components of $\pmb{x}_t$ are $I(1)$ with drift $\pmb{\mu}_0$ and $\pmb{w}_t$ has a linear time trend, called restricted trend.
5) $\pmb{\mu}_t=\pmb{\mu}_0+\pmb{\mu}_1t$: Both the constant and trend are unrestricted. The components of $\pmb{x_t}$ are $I(1)$ and have a quadratic time trend and $\pmb{w}_t$ have a linear trend.

Maximum likelihood estimation: Estimation of VAR(p) is quite involved. The deterministic term ($\pmb{x}_t$) and stationary terms ($\Delta\pmb{x}_t$) are first bifurcated and estimated using linear regression and have error terms are $\pmb{u}_t$ and $\pmb{v}_t$ respectively. A relevant eigenvalue problem when solved leads to a likelihood which when maximized gives the estimates of the coefficients.

Johansen test for cointegration: This is esentially testing the rank of the matrix $\pmb{\Pi}$, for a specified deterministic term $\pmb{\mu}_t$. The number of non-zero eigenvalues of $\pmb{\Pi}$ can be obtained if a consistent estimate of $\pmb{\Pi}$ is available. Looking at the ECM equation it is clear that $\pmb{\Pi}$ is related to the covariance matrix between $\pmb{x}_{t-1}$ and $\Delta\pmb{x}_t$ after adjusting for the effects of deterministic trend term and $\Delta\pmb{x}_{t-i}$ for $i=1,...,p-1$. Using canonical correlation analysis between the two adjusted equation, the squared correlation between them are calculated to be $\hat{\lambda}_i$. There are two versions of Johansen test:

1) Trace cointegration test:- $H_0$: Rank($\pmb{\Pi}$) = $m$ versus $H_a$: Rank($\pmb{\Pi}$)>$m$. The Likelihood ratio (LR) statistic is
$$LK_{tr}(m) = -(T-p)\sum_{i=m+1}^{k}ln(1-\hat{\lambda}_i)$$
Due to the presence of unit-roots, the asymptotic distribution of statistic is not chi-squared, but a function of standard Brownian motions. Thus, the critical values must be obtained via simulation.

2) Sequential test:- $H_0$: Rank($\pmb{\Pi}$) = $m$ versus $H_a$: Rank($\pmb{\Pi}$)=$m+1$. The $LK$ ratio test statistic, called the maximum eigenvalue statistic, is
Again, the critical values of the test statistics are nonstandard and must be evaluated via simulation.

Left out sections: 8.7, 8.8

Tsay Ch5 - High Frequency Data Analysis and Market Microstructure

Non-synchronous trading

For daily stock returns, non-synchronous trading can introduce
1) cross correlations between stock returns
2) serial correlation in a portfolio return
3) sometimes negative serial correlations in the return series of a stock.

Bid-ask spread

Introduces lag-1 serial correlation in an asset return, called bid-ask bounce.

Empirical characteristics

Aggregation don't show some of the characteristics of transactions data
1) unequally spaced time intervals - duration between trades might contain useful information about market micro-structure e.g. trading intensity
2) discrete-valued prices - along with limits
3) Existence of a daily periodic pattern - thinner during lunch hour.
4) Multiple transactions within a single second

Overnight stocks returns differ substantially from intraday returns (Stoll and Whaley 1990). Intraday trading has exploded with multiple transactions within second.

Models for price change

The discreteness and concentration on 'no change' make it difficult to model the intraday price changes. There are two models - ordered probit model (Hauseman, Lo and MacKinlay 1992) and a decomposition model (McCulloch and Tsay 2000). These models find prediction challenging, but are more used for understanding purposes.
Ordered Probit model - For $P_t^*$ being the fundamental value of the asset in a friction-less market and $P_t$ being the observed price, we define $y_i^*=P^*_{t_i}-P^*_{t_{i-1}}$ and model $y^*_i$ as a continuous random variable given by $y^*_i=\bf{x_i}\bf{\beta}+\epsilon_i$. The observed value $y_i$ can be categorized in ordered set ${s_1,...,s_k}$. Generally a normal distribution is assumed. The model can be estimated by maximum likelihood or MCMC methods. Explanatory variables $\bf{x_i}$ can be time duration, lagged prices, lagged SP500 price, bid-ask spread and direction, lagged volume. Volatility can be explained using duration and bid-ask spread as well.
A decomposition model (ADS) - indicator for price change, direction of price change, and the size of price change, $y_i = P_{t_i}-P_{t_{i-1}}=A_iD_iS_i$, where ordering is important. Each of these terms are modeled as logistic regression using explanatory variables and estimated using log likelihood.

Duration models - ACD

Concerned with time intervals between trades. Longer durations indicate lack of trading activities, which means no new information. Before the duration can be modeled the diurnal pattern has to be removed from the time series. This is done by calculating adjusted time duration $\Delta t^*_i=\Delta t_i/f(t_i)$. $f(t_i)=exp(\beta_0+\sum_1^7\beta_j f_j(t_i))$, where the $f_i$ are functions defined to take care of first 5 minutes, last 30 minutes, and mid period, depending on the asset and profile. We can then fit the autoregressive conditional duration model. $f(t_i)$ is commonly estimated using smoothing splines. One way is to use combination of quadratic functions and indicator variables to take care of deterministic components of daily trading activities.
$$f(t_i)=e^{d(t_i)} \qquad d(t_i)=\beta_0+\sum_1^7\beta_j f_j(t_i))$$
where, $f_1, f_2, f_3, f_4$ are quadratic functions fitted for specific data (Tsay pg 225).  $f_5$ and $f_6$ are indicator variables for the first and second 5 minutes of market open, and $f_7$ is the indicator for the last 30 minutes of daily trading. The coefficients can be determined by least sqaures method
$$ln(\delta t_i)=\beta_0+\sum_1^7\beta_j f_j(t_i)) +\epsilon_i$$
The autoregressive conditional duration (ACD) model uses the idea of GARCH models to study the dynamic structure of the adjusted duration $\Delta t^*_i$. For $x_i=\Delta t^*_t$ and $\psi_i=E(x_i|F_{i-1})$, the model is defined as $x_i=\psi_i\epsilon_i$, where $\epsilon_i$ follows a standard exponential (EACD) or a standard Weibull (WACD) distribution. Further, similar to GARCH, we have ACD(r,s) model
$$\psi_i=\omega+\sum^r_{j=1}\gamma_j x_{i-j}+\sum^s_{j=1}\omega_j \psi_{i-j}$$
with $\gamma_j=0$ for $j>r$ and $\omega_j=0$ for $j>s$. For stationarity $\omega>0$ and $1>\sum_j(\gamma_j+\omega_j)$.

EACD(1,1) model: $\epsilon_i$ as exponential distribution $x_i=\psi_i\epsilon_i$ and $\psi_i=\omega+\gamma_1x_{i-1}+\omega_1\psi_{i-1}$. We have $E(\epsilon_i)=1$ and $Var(\epsilon_i)=1$, implying $E(\epsilon_i^2)=2$. Assuming weak stationarity,
Hence, $1>2\gamma_1^2+\omega_1^2+2\gamma_1\omega_1$ for stationarity.

Bivariate models for price change and duration - PCD

Jointly modeling the price change and associated duration process. Focus on transactions that result in a price change $P_{t_i}=P_{t_{i-1}}+D_i S_i$, where $D_i$ is the direction change dummy and $S_i$ is the size change variable. It reduces the number of data point and there is no diurnal pattern in time durations between price changes. The PCD model decomposes the joint distribution of $(\Delta t_i, N_i, D_i, S_i)$ given $F_{i-1}$ as
$$f(\Delta t_i, N_i, D_i, S_i | F_{i-1})=f(S_i|D_i, N_i, \Delta t_i, F_{i-1}) f(D_i|N_i, \Delta t_i, F_{i-1}) f(N_i|\Delta t_i, F_{i-1}) f(\Delta t_i|F_{i-1})$$
where the $i^{th}$ transaction data consists of $\Delta t_i$ duration, $N_i$ number of trades in the period, $D_i$ direction of price change, $S_i$ size of price change in ticks. There are many ways to specify the conditional distributions depending on the asset under study. Using McCulloch and Tsay (2000) generalized linear models for discrete valued variables and time series model for continuous variable $ln(\Delta t_i)$ we get
$$ln(\Delta t_i) = \beta_0 + \beta_1 ln(\Delta t_{i-1}) + \beta_2 S_{i-1} + \sigma \epsilon_i.$$
Log transformation is added to ensure positiveness. Due to concentration of $N_i$ at 0, we partition the model for $N_i$ in tow parts.
$$p(N_i=0|\Delta t_i, F_{i-1}) = logit[\alpha_0+\alpha_1 ln(\Delta t_i)]$$
where $logit(x)=e^x/(1+e^x)$, whereas the second part of the model is
$$N_i|(N_i>0, \Delta t_i, F_{i-1}) \sim 1+g(\lambda_i) \qquad \lambda_i=\frac{e^{\gamma_0+\gamma_1 ln(\Delta t_i)}}{1+e^{\gamma_0 + \gamma_1 ln(\Delta t_i)}}$$
where $g(\lambda)$ denotes a geometric distribution with parameter $\lambda$, in interval (0,1). The model for direction $D_i$ is
$$D_i|(N_i, \Delta t_i, F_{i-1})=sign(\mu_i+\sigma_i\epsilon)$$
where $\epsilon$ is a $N(0,1)$ random variable, and
$$\mu_i=\omega_0+\omega_1 D_{i-1}+\omega_2 ln(\Delta t_i),$$
To allow for different dynamics between positive and negative price movements, we use different models for the size of a price change.
$$S_i | (D_i=-1, N_i, \Delta t_i, F_{i-1}) \sim p(\lambda_{d,i})+1 \qquad ln(\lambda_{d,i})=\eta_{d,0}+\eta_{d,1}N_i+\eta_{d,2}ln(\delta t_i)+\eta_{d,3}S_{i-1}$$
$$S_i | (D_i=1, N_i, \Delta t_i, F_{i-1}) \sim p(\lambda_{u,i})+1 \qquad ln(\lambda_{u,i})=\eta_{u,0}+\eta_{u,1}N_i+\eta_{u,2}ln(\delta t_i)+\eta_{u,3}S_{i-1}$$
where $p(\lambda)$ denotes a Poisson distribution with parameter $\lambda$, and 1 is added to the size because the minimum size is 1 tick when there is a price change. Estimation can be done either by maximum likelihood or MCMC methods.

Left out sections: 5.6

Tsay Ch3 - Conditional Heteroscedastic Models

Ch3 : Conditional Heteroscedastic Models

Concerned with modeling of volatility of an asset return. Volatility modeling, apart from being useful in options pricing, provides simple approach to calculate VaR in risk management. It also plays an important role in asset allocation under mean-variance framework. It can also improve the efficiency in parameter estimation and the accuracy in interval forecast. Implied volatility tend to be larger than that obtained by using a GARCH type of volatility model. Some characteristics are:
1) there exist volatility clusters
2) volatility jumps are rare
3) does not diverge to infinity
These implies that volatility is generally stationary.
4) leverage effect - volatility reacts differently to big price increases vs big price drops

Returns are generally serially independent but the square of them are not. The general mean equation is $r_t=\mu_t+a_t$ where,
$$\mu_t=\phi_0+\sum_{i=1}^k \beta_i x_{it}+\sum_{i=1}^p \phi_i r_{t-i}-\sum_{i=1}^q \theta_i a_{t-i}$$
where $k$, $p$ and $q$ are non-negative integers, and $x_{it}$ are explanatory variables (e.g. exogenous variables or dummy variables). The conditional variance (or volatility) is given by
The model of volatility (called volatility equation) can be of two types - one with exact function and the other with stochastic equation.

The first step is always to model the mean equation (e.g. ARMA) and check residuals for ARCH effects. We then fit volatility models and then jointly estimate the mean and volatility equations. Let the mean equation be $r_t=\mu_t+a_t$, with $\mu$ being the simple average of the series. The squared series $a_t^2$ is then used to check for conditional heteroscedasticity or ARCH effects. Two tests:
1) Ljung-Box test on $a_t^2$ series: $H_0:$ first m lags of ACF of $a_t^2$ series are zero.
2) Lagrange multiplier test of Engle: Equivalent to F-statistic for testing $\alpha_i=0$ in the linear regression $a_t^2=\alpha_0+\alpha_1 a^2_{t-1}+...+\alpha_m a^2_{t-m}+e_t$, for $t=m+1,...,T$

ARCH model: Shock $a_t$ is serially uncorrelated but dependent at the quadratic level as $a_t=\sigma_t\epsilon_t$, where
$$\sigma_t^2=\alpha_0+\alpha_1 a^2_{t-1}+...+\alpha_m a^2_{t-m}$$
where ${\epsilon_t}$ is a sequence of iid with mean zero and variance 1, $\alpha_0>0$ and $\alpha_i\ge 0$ for $i>0$. $\epsilon_t$ can be modeled using standard normal, student-t or generalized error distribution. Using heavy-tailed distribution will reduce the ARCH effect. This models volatility clustering.

Properties of ARCH(1): $E[a_t]=E[E[a_t|F_{t-1}]]=E[\sigma_t E[\epsilon_t]]=0$. Also, $Var(a_t)=\alpha_0/(1-\alpha_1)$. Making higher orders finite can impose further conditions, like the unconditional kurtosis is
Thus, ARCH(1) model admits heavier tails than that of a normal distribution. For kurtosis to be finite $\alpha^2_1\le 1/3$. The model assumes that the positive and negative shocks have same effects. They are likely to over-predict the volatility because they respond slowly to large isolated shocks.

Order determination: using PACF of $a^2_t$ one can select the ARCH order, unless the sample size is small. Estimation is done using log likelihood estimation. After fitting standardized residuals $a_t/\sigma_t$  should be checked for iid using Ljung-Box statistics, skewness, kurtosis and qqplot. Forecasting is similar to AR model.

GARCH model: To reduce the number of parameters we use GARCH model. A GARCH(m,s) is given by $a_t=\sigma_t \epsilon_t$ and
$$\sigma^2_t = \alpha_0 + \sum_{i=1}^m \alpha_i a^2_{t-i} + \sum_{j=1}^s \beta_j \sigma^2_{t-j}$$
 where again ${\epsilon_t}$ is an iid with mean 0 and variance 1, $\alpha_0>0$, $\alpha_i \ge 0$, $\beta_j \ge 0$, and $\sum_{i=1}^{max(m,x)}(\alpha_i+\beta_i)<1$. $\alpha_i$ and $\beta_j$ are referred to as ARCH and GARCH parameters, respectively. This shows similar properties as ARCH with ARMA like characteristics.

Difficult to specify the model. Only lower order models used. Log likelihood is used and the fitted model is checked using standardized residual $a_t/\sigma_t$ and its squared process. $a^2_{h+1}$ is a consistent estimate of $\sigma^2_{h+1}$ but is not an accurate estimate of the prediction, because a single observation of a random variable with a known mean and value cannot provide an accurate estimate of its variance.

IGARCH model: unit-root GARCH model. $\alpha_i+\beta_i=1$. The unconditional variance of $a_t$, hence that of $r_t$, is not defined under the IGARCH model. Theoretically, this might be caused by occasional level shifts in volatility. For the case of $\alpha_0=0$, IGARCH(1,1) model become exponentially smoothing of volatility model.

Stochastic volatility model: To ensure positiveness of the conditional variance, SV models use $ln(\sigma^2_t)$ instead of $\sigma^2_t$. It is defined as $a_t=\sigma_t \epsilon_t$
$$ln(\sigma^2_t)=\alpha_0+\alpha_1 ln(\sigma^t_{t-1})+\nu_t$$
where $\epsilon_t$ are iid $N(0,1)$, the $\nu_t$ are iid $N(0,\sigma^2_{\nu})$, ${\epsilon_t}$ and ${\nu_t}$ are independent, $\alpha_0$ is a constant and all zeros of the characteristic polynomial are greater than 1 in modulus.

Quasi-likelihood methods via Kalman filtering or MCMC methods are used for estimation. Limited experience shows that SV models often provided improvements in model fitting, but their contributions to out-of-sample volatility forecasts received mixed results.

Alternative approaches: Two approaches
1) High-Frequency Data - Use of high-frequency data to calculate volatility of low-frequency returns. The model for daily returns is unknown which can complicate the monthly estimation. Assuming $r_{i}$ are log daily returns for the monthly returns $r^m$, we have $r^m=\sum^n r_{t}$. For white noise we have $Var(r^m)=nVar(r_i)$. For MA(1) process it is $Var(r^m)=nVar(r_i)+2(n-1)Cov(r_i+r_{i+1})$. Further the sample size is only 21 days making the accuracy questionable. If the kurtosis and serial correlations are high the sample estimate is not even consistent.
This concept can be used to calculate daily volatility from intraday log returns. At this level the realized volatility ($\sum r^2$) approximately follows Gaussian ARIMA(0,1,q) model, which can be used to produce forecasts. Intuitively one would use the smallest possible interval, but anything less than 15 minutes bring in noise. Overnight returns also need to be taken into account for correct daily volatility estimation.
2) Daily OHLC Prices - Shown to improve volatility estimate. Let $C_t$, $O_t$, $H_t$, and $L_t$ be the Close, Open, High and Low for day t. Also let $f$ be the fraction of day that trading is closed. The conventional volatility estimate is $\sigma^2_t=E[(C_t-C_{t-1})^2|F_{t-1}]$. Based on price following simple diffusion model without drift various estimates of volatility has been proposed, one being $0.5(H_t-L_t)^2-[2ln(2)-1](C_t-O_t)^2$. Yang and Zhang (2000) also proposed an estimate on the lines of $$\hat{\sigma}^2_{yz}=\hat{\sigma}^2_o+k\hat{\sigma}^2_c+(1-k)\hat{\sigma}^2_{rs},$$
$$\hat{\sigma}^2_o = \frac{1}{n-1}\sum^n_{t=1}(o_t-\bar{o})^2 \qquad o_t=ln(O_t)-ln(C_{t-1})$$
$$\hat{\sigma}^2_c = \frac{1}{n-1}\sum^n_{t=1}(c_t-\bar{c})^2 \qquad c_t=ln(C_t)-ln(O_t)$$
$$\hat{\sigma}^2_{rs} = \frac{1}{n}\sum^n_{t=1}[u_t(u_t-c_t)+d_t(d_t-c_t)] \qquad u_t=ln(H_t/)_t) \qquad d_t=ln(L_t/O_t)$$
quantity k is chosen to minimize the variance of the estimator $\hat{\sigma}^2_{yz}$.

Left out sections: 3.7-3.11, 3.13-3.14, 3.16

Tsay Ch2 - Linear Time Series Analysis and Its Applications

Ch2: Linear Time Series Analysis and its Applications

We work with log returns $r_t$ as a collection of random variables over time, as a time series ${r_t}$.

Stationarity: The first step in time series analysis should always be checking stationarity.
1) Strictly stationary time series ${r_t}$ if the joint distribution of $(r_{t_1}, ...,r_{t_k})$ is identical to that of $(r_{t_1+t},...,t_{t_k+t})$ for all t and k, i.e. the joint distribution is invariant under time shift.
2) Weak stationarity: if both mean and covariance between $r_t$ and $r_{t-l}$ are time-invariant. Weak stationarity implies partial predictability. In other words, the first two moments of $r_t$ are finite and constant. A strictly stationary series is weakly stationary but not vice-versa unless $r_t$ is normally distributed.

Autocovariance: $\gamma_l=Cov(r_t,r_{t-l})$ with properties $\gamma_0=Var(r_t)$ and $\gamma_{-l}=\gamma_l$

Autocorrelation function (ACF): uncorrelated means $\rho_{x,y}=0$. It implies independence under normality. $\rho_l=\gamma_l/ \gamma_0$ giving $\rho_0=1$ and $\rho_{-l}=\rho_l$. Sample autocorrelation is given by
is asymptotically normal with mean zero and variance $1/T$, if $r_t$ is an iid sequence with finite second moment (Brockwell and Davis 1991).

Bartlett's formula: For a weakly stationary time series of the form $r_t=\mu+\sum^q_1\psi_i a_{t-i}$, where $\psi_0=1$, q is a non-negative integer, and ${a_j}$ are Gaussian white noise series, then $\hat{\rho}_l$ is asymptotically normal with mean $\mu$ and variance $(1+2\sum^q_1\rho^2_i)/T$ for $l \ge q$ (Box, Jenkins, Reinsel 1994). Hence, to test the autocorrelation for a given integer $l$, we can construct the statistic.

Ljung and Box (1978) test increases the power for finite samples by constructing a chi-squared distribution statistic with m degrees of freedom:
In practice, the selection of m may affect the performance of the $Q(m)$ statistic. $m\approx ln(T)$ provides better power performance.

Monthly returns of the value-weighted index seems to have stronger serial dependence than individual stock returns. CAPM suggests there is no autocorrelation in financial series. One has to be careful of the autocorrelation induced by the way index returns are determined or stock prices are determined - these are pseudo relationships.

White noise: time series ${r_t}$ is white noise if the sequence is iid with finite mean and variance. if the mean is 0 and variance $\sigma^2$, it is called Gaussian white noise. For a white noise series all the ACFs are zero. e.g. monthly individual stocks returns seem to be white noise but not value-weighted index return series.

Wold decomposition: A time series ${r_t}$ is said to be linear if it can be written as Wold decomposition
$$r_t=\mu+\sum_{i=0}^{\infty}\psi_i a_{t-i},$$
where $\mu$ is the mean of $r_t$, $\phi_0=1$, and ${a_t}$ is white noise series. If $r_t$ is weakly stationary, we can obtain the mean, variance and covariance easily. For weakly stationary series $\rho_l$ converges to zero as $l$ increases.

AR(p) model: Current conditional expectation of returns depend on the last p steps
$$r_t=\phi_o+\phi_1 r_{t-1}+...+\phi_pr_{r-p}+a_t$$
The necessary and sufficient condition for weakly stationarity is $|\phi_1|<1$ for AR(1) process.  The plot of ACF decays exponentially (extends beyond lag 1!). For higher order AR processes the inverse of the solutions to characteristic equation are called characteristic roots. Complex roots correspond to damped oscillations related to business cycles. The stationarity condition is that the characteristic roots are less than one in modulus (or all solutions to characteristic equation greater than one in modulus).

Identification of AR model: order determination of AR models. Two approaches:
1) Partial Autocorrelation Function (PACF): nested AR models with increasing order. For a true order of $p$, the terms $\hat{\phi}_{l,l}$ converge to zero for all $l>p$ with variance $1/T$. These can be plotted to determine the cutoff as a PACF plot.
2) Information criteria:  likelihood based criteria. e.g. the Akaike information criterion (AIC) 1973 is:
$$AIC = -\frac{2}{T}ln(likelihood)+\frac{2}{T}P$$
For a Gaussian AR($l$) model, AIC reduces to $ln(\hat{\sigma}_l^2)+2l/T$, where $\hat{\sigma^2_l}$ is the maximum likelihood estimate of $\sigma^2_a$, T is the sample size and P is the number of parameters. Another common criteria is Bayesian information criterion (BIC), which for Gaussian AR($l$) reduces to $ln(\hat{\sigma}_l^2)+ln(T)l/T$. BIC tends to select a lower AR model when the sample size is moderate to large.

Parameter estimation: OLS method, or likelihood method. Likelihood estimation is very similar to OLS but gives slightly lower $\sigma_a^2$ for finite samples.

Model checking: residual series should be white noise - check ACF of residual and Ljung-Box statistics with $m-g$ degrees of chi-squared distribution, where m is approximately $ln(T)$ and g is number of AR coefficients used in the model.

Goodness of fit: For a stationary series $R^2$ can be used as a goodness of fit. For unit-root non-stationary series $R^2$ converges to one regardless of the underlying model. Adjusted $R^2$ should be used to penalize the increased number of parameters.

Forecasting: Longer terms forecasts are more accurate and approach unconditional mean. The speed of mean reversion is measured by half life denoted by $k=ln(0.5/|\phi_1|)$.

MA(q) model: Moving average models. These are simple extension of white noise series, an infinite-order AR model with some parameter constraints. Bid-ask bounce in stock trading may introduce an MA(1) structure in a return series. The general form is
$$r_t=c_0+a_t-\theta_1 a_{t-1}-...-\theta_q a_{t-q}$$
It is a finite memory model with ACF zero beyond the first q. MA(q) is invertible if $|\theta| <1$, otherwise noninvertible. An MA model is always stationary. The coefficient of MA models are also called impulse response function.

Identification of AR model: Identified using ACF, provides information on the nonzero MA lags of the model.

Estimation: Likelihood methods:
1) conditional likelihood method: assumes initial shock to be zero.
2) exact likelihood method: initial shocks are estimated jointly with other parameters.

Forecasting: q step ahead forecast for MA(q) model is the mean and stays there.

ARMA(p,q) model: to reduce higher order parameters in AR or MA model, one can use ARMA model. Generally not used for returns but volatility models. ARMA(1,1) is
$$r_t-\phi_1 r_{t-1}=\phi_0+a_t-\theta_1 a_{t-1}$$
For the model to be meaningful we need $\phi_1 \ne \theta_1$. The ACF of an ARMA(1,1) behaves very much like that of an AR(1) model except that the exponential decay starts with lag 2 instead of lag 1. Consequently, the ACF of an ARMA(1,1,) does not cut off at any finite lag. The PACF of an ARMA(1,1) does not cut off at any finite lag either, like of MA(1), except that the exponential decay starts with lag 2. The stationarity condition for an ARMA(1,1) is same as that of an AR(1) model. A general ARMA(p,q) model is
$$r_t = \phi_0+\sum_{i=1}^{p}\phi_i r_{t-i}+a_t-\sum_{j=1}^{q}\theta_j a_{t-j}$$
There should be no common factors between the AR and MA polynomial, otherwise the order of the model can be reduced. For stationarity all solutions to AR polynomial should be less than 1 in absolute value.

Identification: EACF (Tsay, Tiao 1984) can be used. The first zero corner in the EACF table identifies the order. Estimation can then be done using conditional or exact likelihood. Ljung-Box statistics of the residuals can also be checked for the adequacy of the model.

Unit root nonstationarity: Price series, interest rate, FX rates are generally non stationary. They are called unit root non-stationary  series in time series literature. The best example is random-walk.

Random walk: A time series ${p_t}$ is a random walk if it satisfies
Looking it as and AR(1) process, it is a non-stationary as the coefficient is equal to 1. We call this a unit root nonstationary time series. This is a non-mean reverting model with prediction of price equal to the value at forecast origin, with variance of estimate $l\sigma_a^2$, where $l$ is the number of look ahead steps, which diverge to infinity as $l \to \infty$. The series has a strong memory, as the sample ACFs are all approaching 1 as the sample size increases.

Random walk with drift: The market index tends to have a small and positive mean. The model for log price becomes
Both the expected value $t\mu$ and variance $t\sigma_a^2$ increase with time t.

Trend-Stationary time series: $p_t = \beta_0+\beta_1 t +t_t$
The mean is $\beta_0+\beta_1 t$ and variance is finite and equal to $Var(r_t)$, unlike random walk non-stationary model with trend. This can be transformed into a stationary one by removing the time trend via a simple linear regression analysis.

ARIMA models: A conventional way to handle unit-root nonstationarity is to use differencing. A time series $y_t$ is said to be ARIMA(p,1,q) process if the change series $c_T=y_t-y_{t-1}$ is stationary and invertible ARMA(p,q) process. For example, log price series is an ARIMA process. Series with multiple roots may need multiple differencing.

Dickey-Fuller unit root test: $H_0:\phi_1=1$ versus $H_a:\phi_1<1$. The test statistic is the t-ratio of the least squares estimate of $\phi_1$ from the model $p_t=\phi_1 p_{t-1}+e_t$ giving
$$\hat{\phi}_1 = \frac{\sum_1^T p_{t-1}p_t}{\sum_1^T p^2_{t-1}}$$
$$\hat{\sigma}_e^2=\frac{\sum_1^T (p_t-\hat{\phi}_1 p_{t-1})^2}{T-1}$$
where $p_0=0$ and $T$ is the sample size. The t-ratio is
$$\frac{\hat{\phi}_1-1}{std(\hat{\phi}_1)} = \frac{\sum_1^T p_{t-1}e_t}{\hat{\sigma}_e\sqrt{\sum_1^T p^2_{t-1}}}$$

Left out sections: 2.8 Seasonal models, 2.9 Regression models with time series errors, 2.10 consistent covariance matrix estimation, 2.11 long-memory models

Tsay Ch1 - Financial time series and their characteristics

Here are my notes from the book 'Analysis of Financial Time Series' by Tsay. It's a good introductory book, mostly empirical with decent maths. Someday I will read Hamilton or Green and write that up too. 

Ch1: Financial Time series and their characteristics

Concerned with asset valuation. The uncertainty in the definition and statistical inference of returns and volatility plays important role.

Asset returns - various definitions. Let $P_t$ be the price of an asset at time index t. Assume no dividends.
1) One-period simple return: $R_t=\frac{P_t}{P_{t-1}}-1$
2) Multi-period simple return: $1+R_t[k]=\Pi_{j=0}^{k-1}(1+R_{t-j})$
3) Continuously compounded or log return: $r_t=ln(1+R_t)=ln(P_t/P_{t-1})=p_t-p_{t-1}$, and $r_t[k] = \sum_{i=0}^{k-1}r_{t-i}$
4) Portfolio return: $R_{p,t}=\sum_{i=1}^{N}w_i R_{it}$ and $r_{p,t}\approx\sum_{i=1}^{N}w_i r_{it}$
5) Dividend payment: $R_t=\frac{P_t+D_t}{P_{t-1}}-1$ and $r_t=ln(P_t+D_t)-ln(P_{t-1})$
6) Excess return: $Z_t=R_t-R_{0t}$ and $z_t=r_t-r_{0t}$

Distributional properties - N assets and T time indices.
Joint distribution:
$$F_{X,Y}(x,y;\theta)=P(X\le x, Y\le y; \theta)=\int_{-\infty}^x \int_{-\infty}^y f_{x,y}(w,z;\theta)dzdw$$
Marginal distributions: marginal distribution of $X$ is obtained by integrating out $Y$.
Conditional distributions:
$$F_{X|Y\le y}(x;\theta) = \frac{P(X\le x, Y \le y ; \theta)}{P(Y \le y; \theta)}$$
$$f_{x,y}(x,y;\theta)=f_{x|y}(x;\theta)\times f_y(y;\theta)$$
Independent variables imply $f_{x,y}(x,y;\theta)=f_x(x;\theta)f_y(y;\theta)$
Moments of a random variable:
The $l$th central moment of $X$ is defined as
Sample central moments:
sample mean: $\hat{\mu_x}=\frac{1}{T}\sum_{t=1}^{T}x_t$
sample variance: $\hat{\sigma_x^2}=\frac{1}{T-1}\sum_{t=1}^{T}(x_t-\hat{mu_x})^2$
sample skewness: $\hat{S_x}=\frac{1}{(T-1)\hat{\sigma}^3_x}\sum_{t=1}^{T}(x_t-\hat{mu_x})^3$, distributed normally with mean 0 and variance $6/T$ (Snedecor and Cochran 1980).
sample kurtosis: $\hat{K_x}=\frac{1}{(T-1)\hat{\sigma}^4_x}\sum_{t=1}^{T}(x_t-\hat{mu_x})^4$, distributed normally with mean 3 and variance $24/T$.

JB normality test: Both $\hat{S}(r)$ and $\hat{K}(r)-3$ are normally distributed and can have individual hypothesis tests for normality. Jarque and Bera (1987) combine the two tests and use the test statistic
which is asymptotically distributed as a chi-squared random variable with 2 degrees of freedom.

Distribution of returns - The most general model for the log returns ${r_{it}; i=1, ...,N; t=1,...,T}$ is its joint distribution funtion:
where $Y$ is a state vector summarizing the environment in which asset returns are determined and $\theta$ are parameters that uniquely determine the distribution function $F_r(.)$. Many times $Y$ is assumed given and the main concern is the conditional distribution of ${r_{it}}$ given $Y$. Some applications focus on the joint distribution of N returns at a single time index t, i.e. ${r_{1t},...,r_{Nt}}$. Other theories emphasize the dynamic structure of individual asset returns - the uni-variate case - ${r_{i1},...,r_{iT}}$ for a given asset $i$. For the uni-variate case the joint distribution is given as
$$F(r_1, ...,r_T;\theta)=F(r_1;\theta)\Pi_{t=2}^{T}F(r_t|r_{t-1},...,r_{1};\theta).$$
The main concern then is to specify the conditional distribution $F(r_t|r_{t-1},..,r{1};\theta)$. Different distributional specifications lead to different theories. For instance, if the conditional distribution is equal to the marginal distribution it is a random-walk hypothesis. It is customary to treat asset returns as continuous random variables, and use their probability density functions.

Different distributions on marginals give different models:
Normal distribution: $R_t$ iid and normal with fixed mean and variance - lower bound is wrong, multiperiod returns are not normal due to multiplication, empirical excess kurtosis.
Lognormal distribution: $r_t$ iid and normal with fixed mean $\mu$ and variance$\sigma^2$. The simple returns $R_t$ are then lognormal with:
$$E[R_t]=e^{(\mu+\frac{\sigma^2}{2})}-1 \qquad Var[R_t]=e^{2\mu+\sigma^2}[e^{\sigma^2}-1]$$
summation is also normally distributed, there is no lower bound, but is not consistent with excess kurtosis.
Stable distributions: distribution where a linear combination has the same distribution. Summation and lower bound satisfied but has infinite variance, e.g. Cauchy distribution $f(x)=1/(\pi(1+x^2))$.
Scale mixture of normal distributions: finite mixtures of normal distributions $r_t \sim (1-X)N(\mu,\sigma^2_1)+XN(\mu,\sigma^2_2)$, where $X$ is a Bernoulli random variable and $\sigma^2_2$ is relatively large. Maintain tractability of normal under addition, capture excess kurtosis and finite higher moments, but hard to estimate the mixture parameters. For multivariate returns covariance becomes critical. Everything becomes vectors.

log likelihood function of returns with normal distribution:
$$ln f(r_1,...,r_T;\theta)=ln f(r_1;\theta)-\frac{1}{2}\sum_{t=2}^T(ln(2\pi)+ln(\sigma_t^2)+\frac{(r_t-\mu_t)^2}{\sigma^2_t})$$
This assumes iid of each observation.

Empirical properties of returns:
a) Daily returns have higher kurtosis than monthly returns. For monthly returns market indexes have higher kurtosis than individual stocks.
b) For daily returns market indexes have smaller standard deviation than individual stocks.
c) Empirical returns density is taller, but with fatter tails.

Processes considered: return series, conditional volatility (e.g. clustering), extreme behaviors (frequency, size, impact).