Friday, September 4, 2015

Tsay Ch10 - Multivariate Volatility Models and their applicaitons

Generalize the univariate volatility models of chapter 3 to the multivariate case as well as simplifying the dynamic relationship between volatility process of multiple asset returns, to address the curse of dimensionality and time-varying correlations. Consider a multivariate return series $\pmb{r}_t$ given by $$\pmb{r}_t=\pmb{\mu}_t+\pmb{a}_t,$$ where $\pmb{\mu}_t=E(\pmb{r}_t|F_{t-1})$ is the conditional expectation of $\pmb{r}_t$ given the past information and $\pmb{a}_t$ is the shock at time t. The mean equation of $\pmb{r}_t$ can be modeled as a multivariate time series process (ch 8) , e.g. a simple VARMA process $$\pmb{\mu}_t=\pmb{\Gamma}\pmb{x}_t+\sum_{i=1}^{p}\pmb{\Phi}_i\pmb{r}_{t-i}-\sum_{i=1}^{q}\pmb{\Theta}_i\pmb{a}_{t-i},$$ where $\pmb{x}_t$ denotes the $m$-dimensional vector of exogenous (explanatory) variables with $x_{1t}=1$, $\pmb{\Gamma}$ is a $k\times m$ matrix, and $p$ and $q$ are non-negative integers.

The conditional covariance matrix of $\pmb{a}_t$ given $F_{t-1}$ is a $k\times k$ positive definite matrix $\pmb{\Sigma}_t$ defined by $Cov(\pmb{a}_t|F_{t-1})$. Multivariate volatility modeling is concerned with the time evolution of $\pmb{\Sigma}_t$. This is referred to as the volatility model equation of $\pmb{r}_t$.

Exponentially weighted estimate

An equally weighted estimate of unconditional covariance matrix of the innovations can be estimated by $$\hat{\Sigma}=\frac{1}{t-1}\sum_{j=1}^{t-1}a_j a_j^T.$$ To allow for a time-varying covariance matrix with emphasis on recent information one can use exponential smoothing as $$\hat{\Sigma}_t=\frac{1-\lambda}{1-\lambda^{t-1}}\sum_{j=1}^{t-1}\lambda^{j-1}a_{t-j}a_{t-j}^T,$$ where $0<\lambda<1.$ For a sufficiently large t such that $\lambda^{t-1}\approx 0,$ the equation becomes $$\hat{\Sigma}_t=(1-\lambda)a_{t-1}a_{t-1}^T+\lambda \hat{\Sigma}_{t-1}.$$ This is called the EWMA estimate of covariance matrix. The parameters along with $\lambda$ can be jointly estimated using log-likelihood, which can be evaluated recursively. $\lambda$ of 0.94 (30 days) comes out commonly as optimal. 

Some multivariate GARCH models

  1. Diagonal Vectorization model (VEC): generalization of exponentially weighted moving-average approach. Each element is a GARCH(1,1) type mode. May not produce a positive definite covariance matrix and does not model the dynamic dependence between volatility series. 
  2. BEKK model: Baba-Engle-Kraft-Kroner model (1995) to guarantee the positive-definite constraint. Too many parameters but models dynamic dependence between the volatility series. 


$\Sigma_{t}$ is reparameterized by making used of the symmetric property.
  1. Use of correlations - Covariance matrix can be represented as variances and lower triangle correlations and can be jointly modeled. Specifically, we write $\Sigma_t$ as $D_t\rho_tD_t$, where $\rho_t$ is the conditional correlation matrix of $a_t$, and $D_t$ is a $k \times k$ diagonal matrix consisting of conditional standard deviations of elements of $a_t$. To model the volatility of $a_t$, it suffices to consider the conditional variances and correlation coefficient of $a_{it}$. The $k(k+1)/2$ dimensional vector $\Xi_t= (\sigma_{11,t},...,\sigma_{kk,t}, \varrho_t^T)^T$, where $\varrho_t$ is a $k(k-1)/2$ dimensional vector obtained by stacking columns of the correlation matrix $\rho_t$, but using only the elements below the main diagonal, i.e. $\varrho_t=(\rho_{21,t},...,\rho_{k1,t}|\rho_{32,t},...,\rho_{k2,t}|...|\rho_{k,k-1,t})^T$. To illustrate, for $k=2$, we have $\varrho_t=\rho_{21,t}$ and $\Xi_t=(\sigma_{11,t},\sigma_{22,t},\rho_{21,t})^T$, which is a 3-dimensional vector. The approach has weaknesses because the likelihood function becomes complicated when the dimension is greater than 2. And the approach requires a constrained maximization to ensure positive definiteness. 
  2. Cholesky decomposition - This requires no constrained maximization. This is orthogonal transformation so the resulting likelihood is extremely simple. Because $\Sigma_t$ is positive definite, there exist a lower triangular matrix $L_t$ with unit diagonal elements and a diagonal matrix $G_t$ with positive diagonal elements such that $\Sigma_t=L_tG_tL_t^T.$ A feature of the decomposition is that the lower off-diagonal elements of $L_t$ and the diagonal elements of $G_t$ have close connections with linear regression. Using Cholesky decomposition amounts to doing an orthogonal transformation from $a_t$ to $b_t$, where $b_{1t}=a_{1t}$, and $b_{it}$, for $1<i \le k$, is defined recursively by the least-square regression $a_{it}=q_{i1,t}b_{1t}+q_{i2,t}b_{2t}+...+q_{i(i-1),t}b_{(i-1)t}+b_{it}$, where $q_{ij,t}$ is the $(i,j)$th element of the lower triangular matrix $L_t$ for $1\le j <i$. We can write this transformation as $a_t=L_tb_t$, where $L_t$ is the lower triangular matrix with unit diagonal elements. The covariance matrix of $b_t$ is $G_t$. The parameter vector relevant to volatility modeling under such a transformation becomes $\Xi_t=(g_{11,t},...,g_{kk,t},q_{21,t},q_{31,t},q_{32,t},...,q_{k1,t},...,q_{k(k-1),t})^T$, which is also a $k(k+1)/2$ dimensional vector. The likelihood function also simplifies drastically. There are several advantages of this transformation. First, $\Sigma_t$ can be kept positive definite simply by modeling $ln(g_{ii,t})$. Second, element of $\Xi_t$ are simply the coefficients and residual variances of multiple linear regressions that orthogonalize the shocks to the returns. Third, the correlation coefficient between $a_{1t}$ and $a_{2t}$, which is simply $q_{21,t}\sqrt{\sigma_{11,t}}/\sqrt{\sigma_{22,y}}$, is time-varying. Finally, we get $\sigma_{ij,t}=\sum_{c=1}^{j}q_{iv,t}q_{jv,t}g_{vv,t}.$

GARCH models for bivariate returns

Thursday, September 3, 2015

Multivariate Normal distribution

This was forthcoming, especially, if you want to understand Kalman filter.

A $k$-dimensional random vector $\pmb{x}=(x_1,...,x_k)^T$ follows a multivariate normal distribution with mean $\pmb{\mu}=(\mu_1,...,\mu_k)^T$ and positive-definite covariance matrix $\pmb{\Sigma}=[\sigma_{ij}]$ if its probability density function is $$f(x|\mu, \Sigma)=\frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}e^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)}.$$ This is denoted by $x\sim N_k(\mu,\Sigma).$ A square matrix $A (m\times m)$ is a positive-definite matrix if $A$ is symmetric, and all eigenvalues of $A$ are positive. Alternatively, $A$ is a positive-definite matrix if for any nonzero $m$-dimensional vector $b$, we have $b^TAb>0.$ For a positive-definite matrix $A$ all eigenvalues are positive and matrix can be decomposed as $A=P\Lambda P^T,$ where $\lambda$ is a diagonal matrix consisting of all eigenvalues of $A$ and $P$ is an $m\times m$ matrix consisting of the $m$ right eigenvectors of $A$, making $P$ an orthogonal matrix, if eigenvalues are distinct.

For a symmetric matrix $A$, there exists a lower triangular matrix $L$ with diagonal elements being 1 and a diagonal matrix $G$ such that $A=LGL^T$. If $A$ is positive definite, then the diagonal elements of G are positive. In this case we can write $A=(L\sqrt{G})(L\sqrt{G})^T$, where $L\sqrt{G}$ again is a lower triangle matrix. Such a decomposition is called Cholesky decomposition of $A$. This shows that a positive-definite matrix $A$ can be diagonalized as $L^{-1}A(L^T)^{-1}=L^{-1}A(L^{-1})^T=G.$

Let $c=[c_1,...,c_k]^T$ be a nonzero vector partitioned as $x=[x_1^T,x_2^T]^T$, with the first of size $p$ and the second of size $k-p$ such that, $$\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \sim N\left( \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{bmatrix} \right).$$ Some properties of $x$ are:

  1. $c^Tx \sim N\left( c^T\mu, c^T\Sigma c \right)$, any nonzero linear combination of $x$ is univariate normal and vice-versa.
  2. The marginal distribution of $x_i$ is normal, $x_i \sim N_k \left( \mu_i, \Sigma_{ii}\right)$.
  3. $\Sigma_{12}=0$ if an only if $x_1$ and $x_2$ are independent.
  4. The variable $(x-\mu)^T\Sigma^{-1}(x-\mu)$ follows a chi-squared distribution with $m$ degrees of freedom.
  5. The conditional distribution of $x_1$ given $x_2=b$ is also normally distributed as $$(x_1|x_2=b)\sim N \left( \mu_1+\Sigma_{12}\Sigma_{22}^{-1}(b-\mu_2), \Sigma_{11}-\Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \right).$$
Suppose that $x$, $y$, and $z$ are three random vectors such that their joint distribution is multivariate normal. In addition, assume that the diagonal block covariance matrix $\Sigma_{ww}$ is nonsingular for $w=x,y,z$, and $\Sigma_{yz}=0$. Then,

  1. $(x|y) \sim N \left( \mu_x+\Sigma_{xy}\Sigma_{yy}^{-1}(y-\mu_y), \Sigma_{xx}-\Sigma_{xx}\Sigma_{yy}^{-1}\Sigma_{yx}\right)$
  2. $(x|y,z) \sim N\left( E(x|y)+\Sigma_{xz}\Sigma_{zz}^{-1}(z-\mu_z), Var(x|y)-\Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}\right)$

Sunday, August 16, 2015

Tsay Ch 11 - State-space models and kalman filter

Local trend model

For a univariate time series $y_t=\mu_t+\epsilon_t$ and $\mu_{t+1}=\mu_t+\eta_t$, both the error terms are assumed to be normally distributed to distinct variance $\sigma_e^2$ and $\sigma_{\eta}^2$ respectively. Notice the first equation is the observed version of the second trend model with added noise. This model can be used to analyze realized volatility of an asset price if $\mu_t$ is assumed to be the log volatility (which is not directly observable) and $y_t$ is the logarithm or realized volatility (which is observable constructed from high-frequency transaction data with microstructure noise).

If there is no measurement error term in the first equation ($\sigma_e=0$) this becomes a ARIMA(0,1,0) model. With the error term it is a ARIMA(0,1,1) model, which is also the simple exponential smoothing model. The form is $(1-B)y_t=(1-\theta B)a_t$, $\theta$ and $\sigma_{a}^2$ are related to $\sigma^2_e$ and $\sigma^2_{\eta}$ as follows: $(1+\theta^2)\sigma^2_a=2\sigma^2_e+\sigma^2_{\eta}$ and $\theta \sigma^2_a=\sigma^2_e.$ The quadratic equation for $\theta$ will give two solutions with $|\theta|<1$ chosen. The reverse is also possible for positive $\theta$. Both representations have pros and cons and the objective of data analysis, substantive issues and experience decide which to use.

Statistical Inference

Three types (using reading handwritten note example)
  1. Filtering - recover state variable $\mu_t$ given $F_t$ to remove the measurement errors from the data. (figuring out the word you are reading based on knowledge accumulated from the beginning to the note).
  2. Prediction - forecast $\mu_{t+h}$ or $y_{t+h}$ for $h>0$ given $F_t$, where $t$ is the forecast origin. (guess the next word).
  3. Smoothing - estimate $\mu_t$ given $F_T$, where $T>t$. (deciphering a particular word once you have read through the note).

The Kalman Filter

Let $\mu_{t|j}=E(\mu_t|F_j)$ and $\Sigma_{t|j}=Var(\mu_t|F_j)$ be, respectively, the conditional mean and variance of $\mu_t$ given information $F_j$. Similarly $y_{t|j}$ denotes the conditional mean of $y_t$ given $F_j$. Furthermore let $v_t=y_t-y_{t|j}$ and $V_t=Var(v_t|F_{t-1})$ be 1-step ahead forecast error and its variance of $y_t$ given $F_{t-1}$. Note that $Var(v_t|F_{t-1})=Var(v_t)$, since the forecast error $v_t$ is independent of $F_{t-1}$. Further, $y_{t|t-1}=\mu_{t|t-1}$ giving $v_t=y_t-\mu_{t|t-1}$ and $V_t=\Sigma_{t|t-1}+\sigma^2_e$. Also, $E(v_t)=0$ and $Cov(v_t,y_t)=0$ for $j<t$. The information $F_t \equiv \{F_{t-1},y_t\} \equiv \{F_{t-1},v_t\}$, hence $\mu_{t|t}=E(\mu_t|F_{t-1},v_t)$ and $\Sigma_{t|t}=Var(\mu_t|F_{t-1},v_t)$.

One can show that $Cov(\mu_t,v_t|F_{t-1})=\Sigma_{t|t-1}$ giving, $$\begin{bmatrix} \mu_t \\ v_t \end{bmatrix}_{F_{t-1}} \sim N\left( \begin{bmatrix} \mu_{t|t-1} \\ 0 \end{bmatrix}, \begin{bmatrix} \Sigma_{t|t-1} & \Sigma_{t|t-1} \\ \Sigma_{t|t-1} & V_t\end{bmatrix} \right).$$ Applying the multivariate normal theorem we get $$\mu_t|t = \mu_{t|t-1}+(V_t^{-1}\Sigma_{t|t-1})v_t=\mu_{t|t-1}+K_tv_t,$$ $$\Sigma_{t|t}=\Sigma_{t|t-1}-\Sigma_{t|t-1}V_t^{-1}\Sigma_{t|t-1} = \Sigma_{t|t-1}(1-K_t),$$ where $K_t=V_t^{-1}\Sigma_{t|t-1}$ is referred to as the Kalman gain, which is the regression coefficient of $\mu_t$ on $v_t$, governing the contribution of th enew shock $v_t$ to the state variable $\mu_t$. To predict $\mu_{t+1}$ given $F_t$ we have $$\mu_{t+1|t} \sim N(\mu_{t|t}, \Sigma_{t|t}+\sigma^2_{\eta}).$$ once the new data $y_{t+1}$ is observed, the above procedure can be repeated (obviously once $\sigma_e$ and $\sigma_{\eta}$ are estimated, generally using maximum likelihood method). This is the famous Kalman filter algorithm (1960). The choice of priors $\mu_{1|0}$ and $\Sigma_{1|0}$ requires some attention.

Properties of forecast error -
State error recursion -
State smoothing -
Missing Values -
Effect of Initialization -
Estimation - 

Regression assumptions

Everybody in finance knows that the 90% of quant work is 'REGRESSION' and mostly LINEAR. The results of a linear regression are as good as we understand their assumptions. For a univariate case we write $y_t = \alpha + \beta x_t + \epsilon_t$, where the estimation is straightforward. The interesting case is multivariate regression, where we write $$Y_t = \pmb{\beta}X_t +\pmb{\epsilon}_t.$$ To estimate the parameters we use the normal equation to get $$\pmb{\beta} = (X^TX)^{-1}X^TY$$ Now, how good is this an estimate? We want these estimates to be:
unbiased - The expected value of the estimate is the true value.
consistent - With more observations the distribution of the estimate becomes more concentrated near true value.
efficient - lessor observations are required to establish true value for given confidence.
asymptotically normal - With a lot of observations the distribution of the estimate is a normal distribution.

OLS is consistent when the regressors are exogenous and there is no perfect multicollinearity, and optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, OLS provides min-variance and mean-unbiased estimates, when the errors have finite variances. Aussuming errors have normal distribution, OLS is same as MLE. The expanded version of OLS is multi-fractional order estimator (like Kalman filter).

The 'random design' paradigm treats the regressors $x_i$ as random and sampled together with $y_i$ from some population. The 'fixed design' paradigm treats $X$ as known constants and $y$ is sampled conditionally on the values of $X$ as in an experiment. Practically, the distinction is unimportant and results in the same formula for estimation. 


  1. OLS minimizes error in dependent variable $y$ only and hence assumes there is no error in $x$.
  2. The functional dependence being modeled is valid.
  3. Strict exogeneity - The errors in regression have conditional mean zero: $E[\epsilon|X]=0$, which implies that error have mean zero: $E[\epsilon]=0$, and that the regressors are uncorrelated with the errors: $E[X^T\epsilon]=0$. If not true the OLS estimates are invalid. In that case use method of instrumental variables.
  4.  No linear dependence - The regressors in X must be linearly independent, i.e. X must be full rank almost surely. Sometimes we also assume that the regressors has finite moments up to second order, in such a case the matrix $X^TX$ will be finite and positive semi-definite. If violated the regressors are called perfectly multicollinear, $\beta$ can't be estimated, though prediction of $y$ is still possible.
  5. Spherical errors - It is assumed that $Var[\epsilon|X]=\sigma^2\pmb{I}_n$. IF violated OLS estimates are still valid, but no longer efficient. If error terms are don't have same variance, i.e. they are not homoscedastic Weighted least square is used. If there autocorrealation between error terms Generalized least squares is used.
  6. Normality - It is sometimes additionally assumed that errors have normal distribution. This is not required. Under this assumption OLS is equivalent to MLE and is asymptotically efficient in the class of all regular estimators.
Certain degree of correlation between the observations is very common, under which OLS and WLS are inefficient. GLS is the right thing to do: $$Y = X\beta + \epsilon \qquad E[\epsilon|X]=0, Var[\epsilon|X]=\Omega.$$ GLS estimates $\beta$ by minimizing the squared Mahalanobis length of the residual vector to give $$\hat{\beta}=(X^T\Omega^{-1}X)^{-1}X^T\Omega^{-1}Y.$$ The GLS estimator is unbiased, consistent, efficient and asymptotically normal. It is equivalent to applying OLS to linearly transformed version of data, which standardize and de-correlates the regressors. WLS is a special case of GLS.

To estimate GLS we use Feasible Generalized Least squares (FGLS) in two steps:
1) Model is estimated using OLS (consistent but inefficient) estimator, and the residuals are used to build a consistent estimator of the error covariance matrix;
2) Using these we estimate GLS.

FGLS is preferred only for large sample size. For small sample size it is better to stick to OLS. FGLS is not always consistent for small sample.

Saturday, August 15, 2015

Tsay Ch9 - Principal Component Analysis and Factor Models

Dimension reduction is essential to search for the underlying structure of the assets - called factors.

Three types of factor models -
1) Macroeconomic factor models - GDP growth, interest rates, inflation, unemployment - observable factors using regression
2) Fundamental factor models - firm size, book and market values, industrial classification.
3) Statistical factor models - non-observable  or latent variables e.g. PCA

General Factor Model

For $m$ factors, $k$ assets, and $T$ time periods let $r_{it}$ be the return of asset i at time period t. The factor model is
$$\pmb{r}_{t}=\pmb{\alpha}+\pmb{\beta}\pmb{f}_t+\pmb{\epsilon}_{t}, \qquad t = 1,...,T$$
where $\pmb{\beta}$ is a $k\times m$ factor loading matrix and $\pmb{\epsilon}_t$ is the error vector with $Cov{\pmb{\epsilon}_t}=\pmb{D}=diag[\sigma^2_1,...,\sigma^2_k]$, a $k\times k$ diagonal matrix. The covariance matrix of the returns $\pmb{r}_t$ is then given by:

Macroeconomic factor models

Macroeconomic factors are observable. We can convert the general factor model into Multiple Linear regression setup and estimate the factors. This estimation does not impose the constraint of $\epsilon_{it}$ being uncorrelated, so may not be efficient in general. The best known single factor model is the market model (Sharpe 1970). The $R^2$ can reach up to 50%, showing the significance of common market factor. One simple trick to compare factor based covariance matrix with sample covariance matrix is to use the global minimum variance portfolio (GMVP). For a given covariance matrix $\Sigma$, the GMVP $\omega$ solves $min\sigma^2_{p,\omega}=\omega^T\Sigma\omega$, such that $\omega^T\pmb{1}=1$ given by
It is also important to verify that the residual covariance matrices do not have large off-diagonal elements, to fit the factor model criteria.

Ross (1986) considers multi-factor model consisting of unexpected changes or surprises (e.g. residuals after fitting VAR(3) model to seasonally adjusted CPI and unemployment growth numbers). The explanatory power is low.

Fundamental factor models

BARRA factor method treats the observed asset specific fundamentals as the factor betas $\beta_i$, and estimates the factor $f_t$ at each time index $t$ via regression. Fama and French construct their factors based on hedge portfolio which depend on the fundamentals. For BARRA factor model $$\widetilde{\pmb{r}}_t = \pmb{\beta} \pmb{f}_t+\pmb{\epsilon}_t,$$ where $\widetilde{\pmb{r}_t}$ is the mean-corrected returns. We need WLS setup since the regression is not homogeneous, the estimate would be $$\hat{\pmb{f}_t}=(\beta D^{-1}\beta^T)^{-1}(\beta D^{-1}\beta^T\widetilde{r_t}).$$ We estimate the diagonal covariance matrix of errors from OLS first and then use it to estimate the factors using WLS equation. Cross-correlations in errors are ignored. The diagonal covariance matrix of final errors $\hat{D_g}$ and the covariance matrix of estimated factor realizations $\hat{\Sigma}$ can be used to derive the covariance matrix of the original returns as $$Cov(r_t)=\beta\hat{\Sigma}_f\beta^T+\hat{D_g}.$$ In practice, the sample mean or returns are not different from zero, so one may not need to remove the sample mean before fitting the BARRA factor model.

Fama-French approach used a two-step procedure. First, they sorted the assets based on the value of three fundamentals (market excess returns, small vs big cap, value vs growth stocks). They formed the hedge portfolio which is long top quintile and short the bottom quintile. The observed return on this hedge portfolio is the factor realization for the given asset. Finally, given the factor realizations calculate betas using regression.

Principal component analysis

We look to find linear combinations which explain the most variance and are orthogonal to each other, with weights summing to one. This is done on covariance or correlation matrix which are non-negative definite and hence have spectral decomposition. For covariance matrix the variance explained is $\lambda_i/\sum \lambda$, which becomes $\lambda_i/k$ for a correlation matrix, since $Tr(\rho_r)=k$.

Statistical factor analysis

The aim is to identify a few factors that can account for most of the variations in the covariance or correlation matrix of the data. The assumption of no serial correlations is all right for low frequency data but not accurate for higher frequencies. Serial correlations should first be removed parametrically. We then construct orthogonal factor model. Since both the loadings and the factors are unobservable it is different from other factor models. For the Statistical factor model $r_t - \mu = \beta f_t + \epsilon_t$, we have the assumptions $E[f_t]=0$, $Cov[f_t]=\pmb{I}_m$, $E[\epsilon_t]=0$, $Cov[\epsilon_t]=D=diag(\sigma^2_1,...,\sigma^2_k)$ and $E[f_t\epsilon^T]=0$. These are not uniquely determined. This can be estimated either using Principal Component Method or Maximum Likelihood estimation, with specified $k$. Factor rotation can be used for interpretation using varimax criteria.

Left out sections: 9.6

Tuesday, July 7, 2015

Tsay Ch4 - Nonlinear Models and their applications

I was going to leave out this chapter but may be not. I need to see what regime switching is all about.

Remember a purely stochastic time series $x_t$ is liner if it can be decomposed into a Wold form. Anything else is nonlinear. A general nonlinear functions is inapplicable due to too many parameters. We can restrict the model to the form $$x_t=g(F_{t-1})+\sqrt{h(F_{t-1})}\epsilon_t,$$ where $\epsilon_t=a_t/\sigma_t$ is the standardized innovation. ARMA is a linear model for mean while ARCH/GARCH models are nonlinear models of variance (because they model the square of volatility and not volatility). For stationary volatility series, shocks are uncorrelated but dependent. The models in this chapter extend to nonlinear model of mean.

The following four nonlinear models let the conditional mean $\mu_t$ evolve over time according to some simple parametric nonlinear function - Bilinear, Threshold autoregressive, State dependent, Markov switching. More recent nonlinear models are more data driven - nonlinear state-space, functional coefficient autoregressive, nonlinear additive autoregressive, multivariate adaptive regression spline. Finally, nonparametric methods like kernel regression and artificial neural networks have also been used. Test statistics - parametric (based on Lagrange multiplier or likelihood) and non-parametric (based on higher order spectra or dimension correlation) are also discussed.

Nonlinear models: Some basic ones for financial time series.

1) Bilinear Model: The linear model is simply the firs-order Taylor series expansion of $f(.)$ function. As such, a natural extension to non-linearity is to employ the second order terms in the expansion to improve the approximation. Hence the model is
$$x_t=c+\sum_{i=1}^p\phi_ix_{t-i}-\sum_{j=1}^q\theta_j a_{t-j}+\sum_{i=1}^m\sum_{j=1}^s\beta_{ij}x_{t-i}a_{t-j}+a_t.$$
This is generally analyzed by putting the model in state-space form. ARCH models generally seem to fit the data better, when quadratic terms of innovations are involved.

2) Threshold Autoregressive Model (TAR): Motivated by asymmetry in declining and rising patterns and hence is piece-wise linear (not in time space but threshold space). This can be used to model regimes. e.g. here is two regime model
\phi_1^{(1)}x_{t-1}+a_t, & \text{if } x_{t-l}<\delta\\
\phi_1^{(2)}x_{t-1}+a_t, & \text{if } x_{t-l}\ge \delta
Here $l$ is the delay parameter and $\delta$ is the threshold, which bifurcates the two regimes. There are several interesting characteristics of TAR models:

a) The process $x_t$ is geometrically ergodic and stationary,under the conditions $\phi_1^{(1)}<1$, $\phi_1^{(2)}<1$, and $\phi_1^{(1)\phi_1^{(2)}}<1$. A process is ergodic if its statistical properties (such as mean and variance) can be deduced from a single, sufficiently long sample (realization) of the process. Ergodic theorem: sample mean $\bar{x}=(\sum_t^Tx_t)/T$ of $x_t$ converges to the mean of $x_t$. This is regarded as the counterpart of the central limit theory for the iid case. 

b) The series exhibits an asymmetric increasing and decreasing pattern due to different coefficients in the two regimes, hence different regimes will have different number of observations. The series is hence not time-reversible. 

c) The model, in the example, has no constant terms, but $E(x_t)$ is not zero. $E(x_t)$ is a weighted average of the conditional means of the two regimes, which are nonzero. the weight of each regime is simply the probability that $x_t$ is in that regime under its stationary distribution. Hence for TAR model to have zero mean, nonzero constant terms are needed in some regimes, unlike in stationary linear model where nonzero constant implies nonzero mean for $x_t$.

Properties of TAR models are hard to obtain, but estimation is not difficult. US unemployment can be modeled using TAR models where the regime changes after a sudden increase in unemployment causing monetary intervention. It is also used to model asymmetric responses in volatility between positive and negative returns, study arbitrage tradings in index futures and cash prices. 

3) Smooth transition AR model (STAR): The mean is discontinuous for TAR, so using logistic, exponential or cumulative functions it can be made continuous and differentiable. These are hard to estimate with large standard errors. 

4) Markov switching autoregressive Model (MSA): Using probability of switching, emphasizing aperiodic transition between states, Hamilton (1989). 
c_1+\sum_{i=1}^p\phi_{1,i}x_{t-i}+a_{1t}, & \text{if } s_{t}=1\\
c_2+\sum_{i=1}^p\phi_{2,i}x_{t-i}+a_{2t}  & \text{if } s_{t}=2
where $s_t$ are the two states and is a first order Markov-chain with transition probability matrix
1-w_1 & w_1 \\
w_2 & 1-w_2
The expected duration of the process to stay in state $i$ is $1/w_i$. TAR model uses a deterministic scheme of transition while MSA uses a stochastic scheme. Hence, in MSA one is never certain about which state $x_t$ belongs to. This has important implication. MSA forecast are always some linear combination of the two states, while TAR model picks a particular state. These models are estimated using EM or MCMC algorithms. The model can be further generalized by modeling transition probabilities to be logistic, probit or some other function of explanatory variables available at time $t-1$.  Markov switching can hence be used to choose among many models. Taking the example of US quarterly GDP and fitting a MSA model we see the following observations:
a) The states correspond to growth and contraction.
b) Large std of state 2 implies that that there are relatively fewer observations in the contraction state.
c) It is more likely for US GNP to get out of contraction period than to jump into one.
d) Expected duration of expansion is 11 quarters while contraction is 4 quarters.

5) Non-parametric methods: The essence is smoothing, as they are highly data dependent and may otherwise overfit. If given many realizations of $y_t$ for a given $x_t$, an asymptotically accurate value of the functional dependence, $m(x)$, of $Y$ on $X$ at $X=x$ is given by average of $y_t$. In a time series only one observation of $y_t$ is available for a given $x_t$. But if the functional form is sufficiently smooth, then the values of $Y$ for which $X_t\approx x$ continues to provide accurate approximation of $m(x)$. We can use weighted average to estimate $m(x)$. Hence,

Kernel Regression: Kernel $K(x)$ is the weighting function, satisfying $K(x)\ge0$ and $\int K(x)dz=1$. For rescaling of distance measure, one can use bandwidth $h>0$ making the kernel $K_h(x)=K(x/h)/h$ and $\int K_h(z)dz=1$. The weight function simply becomes:
$$w_t(x)=\frac{K_h(x-x_t)}{\sum_{t=1}^T K_h(x-x_t)}$$, which gives the Nadarya-Watson kernel estimator (1964). Kernels chosen are generally, Gaussian, Epanechnikov. If $h$ is very small, then the weights focus on a few observations that are int he neighborhood around each $x_t$. If $h$ is very large, then the weights will spread over a larger neighborhood of x_t. Bandwidth selection is done by either minimizing the mean integrated square error (using some preliminary smoothing estimations) or leave-one-out cross validation. Fan and Yao (2003) give the bandwidth for Gaussian as $1.06\sigma_xT^{-0.2}$.

Local Linear Regression Method:- Takes the weighted kernel function but minimizes the least square error to estimate the parameters of the equation.
Functional coefficient AR model:- Treat each AR equation coefficient as function and estimate, using kernels for functions. Shown to have better 1-step ahead forecasting.
Nonlinear additive AR model:- To avoid curse of dimensionality use GAM style AR model.
Nonlinear State-Space model:- instead of linear matrix evolution, use nonlinear evolution of states, using MCMC.

Left of sections: 4.1.9,4.2-4.5

Tuesday, June 30, 2015

Tsay Ch8 - Multivariate Time Series Analysis and Its Applications

How markets are interrelated is important to understand the lead-lag relationship and under what circumstances they reverse or do not work. Various previous methods can be applied directly to the vector case, but some need attention.

Weak stationarity and cross-correlation matrices

k-dimensional time series $\pmb{r_t}=[r_{1t},...,r_{kt}]^T$ is weakly stationary if its first and second moments are time-invariant, $\pmb{\mu}=E(\pmb{r}_t)$ and $\pmb{\Gamma}_0=E[(\pmb{r}_t-\pmb{\mu})(\pmb{r}_t-\pmb{\mu})^T]$. Let $\pmb{D}$ be a $k\times k$ diagonal matrix consisting of the standard deviations of $r_{it}$. The lag-zero, cross-correlation matrix is defined as $\pmb{\rho}_0=\pmb{D}^{-1}\pmb{\Gamma}_0\pmb{D}^{-1}$, which is the correlation matrix. The lag-$l$ cross-covariance matrix $\pmb{r}_t$ is defined as $\pmb{\Gamma}_l=E[(\pmb{r}_t-\pmb{\mu})(\pmb{r}_{t-l}-\pmb{\mu})^T]$. For a weakly stationary series, the cross-covariance matrix $\pmb{\Gamma}_l$ is a function of $l$, not the time index $t$. The lag-$l$ cross-correlation matrix (CCM) is defined as $\pmb{\rho}_l=\pmb{D}^{-1}\pmb{\Gamma}_l\pmb{D}^{-1}$. To understand this better, notice
which is the correlation between $r_{it}$ and $r_{j,t-l}$. If $\rho_{ij}(l)\ne 0$ and $l>0$, we say that the series $r_{jt}$ leads the series $r_{it}$ at lag $l$. Similarly, if $\rho_{ji}(l)\ne 0$ and $l>0$ we say that the series $r_{it}$ leads the series $r_{jt}$ at lag $l$. The diagonal element $\rho_{ii}(l)$ is simply the lag-$l$ autocorrelation coefficient of $r_{it}$.

Some remarks are ($l>0$):
1) In general $\rho_{ij}(l) \ne \rho_{ji}(l)$ for $i\ne j$, because they are measuring two different lag relationships, implying $\pmb{\Gamma}_l$ and $\pmb{\rho}_l$ are in general not symmetric.
2) It is easy to see that $\pmb{\Gamma}_l=\pmb{\Gamma}_{-l}^T$ and $\pmb{\rho}_l=\pmb{\rho}_{-l}^T$. Hence it suffices in practice to consider the cross-correlation matrices $\pmb{\rho}_l$ for $l \ge 0$.

For the cross-correlation matrices {$\pmb{\rho}_l|l=0,1,...$}, the diagonal elements {$\rho_{ii}(l)|l=0,1,...$} are the autocorrelation function of $r_{it}$, the off-diagonal element $\rho_{ij}(0)$ measures the concurrent linear relationship between $r_{it}$ and $r_{jt}$, and for $l>0$ the off-diagonal element $\rho_{ij}(l)$ measures the linear dependence of $r_{it}$ on past value of $r_{j,t-l}$. Depending on the value in these matrices one and identify -
1) no linear relationship ($\rho_{ij}(l)=\rho_{ji}(l)=0, \forall l\ge0$),
2) concurrent correlation ($\rho_{ij}(0)\ne0$),
3) no lead-lag relationship ($\rho_{ij}(l)=\rho_{ji}(l)=0, \forall l>0$),
4) unidirectional relationship ($\rho_{ij}(l)=0,\forall l>0,\rho_{ji}(v)$ for some $v\ge0$), or
5) feedback relationship ($\rho_{ij}(l)\ne0$, for some $l>0,\rho_{ji}(v)\ne0$ for some $v\ge0$) .

Sample Cross-correlation matrices can be estimate using $\hat{\pmb{\rho}}_l=\hat{\pmb{D}}^{-1}\hat{\pmb{\Gamma}}_l\hat{\pmb{D}}^{-1}$, where
$$\hat{\Gamma}_l=\frac{1}{T}\sum_{t=l+1}^{T}(\pmb{r}_t-\bar{\pmb{r}})(\pmb{r}_{t-l}-\bar{\pmb{r}})^T, \qquad l\ge 0.$$ Bootstrapping can be used to get confidence intervals on finite samples.

Multivariate  Portmanteau tests: or Multivariate Ljung-Box test with statistic $Q(m)$ have the null hypothesis $H_0:\pmb{\rho}_1=...=\pmb{\rho}_m=\pmb{0}$, and $H_a: \pmb{\rho}_i\ne0$ for some $i \epsilon {1,...,m}$. The test statistic assumes the form
$$Q_k(m)=T^2\sum_{l=1}^m\frac{1}{T-l} tr(\hat{\Gamma}^T_l\hat{\pmb{\Gamma}}^{-1}_0\hat{\pmb{\Gamma}}_l
\hat{\pmb{\Gamma}}^{-1}_0)$$ and under some regularity conditions follows a chi-squared distribution with $k^2m$ degrees of freedom, asymptotically.

Vector autoregressive models (VAR)

A multivariate time series ${\pmb{r}_t}$ is a VAR process of order 1, or VAR(1) for short, if it follows the model $$\pmb{r}_t=\pmb{\phi}_0+\pmb{\Phi}\pmb{r}_{t-1}+\pmb{a}_t,$$ where $\pmb{\phi}_0$ is a k-dimensional vector, $\pmb{\Phi}$ is a $k\times k$ matrix, and ${\pmb{a}_t}$ is a sequence of uncorrelated random vectors with mean zero and covariance matrix $\pmb{\Sigma}$, which is positive definite (generally assumed to be multivariate normal).

Positive definite matrix is a symmetric matrix with all eigenvalues positive. Also for any vector $\pmb{b}$, we have $\pmb{b}^T\pmb{Ab}>0$. These types of matrices can be decomposed as $\pmb{A}=\pmb{P\Lambda P}^T$, where $\pmb{\Lambda}$ is a diagonal matrix consisting of eigenvalues of $\pmb{A}$ and $\pmb{P}$ is a square matrix consisting of eigenvectors of $\pmb{A}$. These eigenvectors are orthogonal to each other. Matrix $\pmb{P}$ is orthogonal and this decomposition is referred to as spectral decomposition. For a symmetric matrix $\pmb{A}$, there exists a lower triangular matrix $\pmb{L}$ with diagonal elements being 1 and a diagonal matrix $\pmb{G}$ such that $\pmb{A}=\pmb{LGL}^T$. If $\pmb{A}$ is positive definite, then the diagonal elements of $\pmb{G}$ are positive. In this case, we have $\pmb{A=L\sqrt{G}\sqrt{G}L}^T=\pmb{MM}^T$, where $\pmb{M=L\sqrt{G}}$ is a lower triangular matrix. This is called Cholesky decomposition. Notice that this implies $\pmb{L}^{-1}\pmb{A}(\pmb{L}^{-1})^{T}=\pmb{G}$.

Reduced and Structural form: In general the off diagonal elements of matrix $\pmb{\Sigma}$ show the concurrent relationship between $r_{1t}$ and $r_{2t}$, while the matrix $\pmb{\Phi}$ measures the dynamic dependence or $\pmb{r}_t$. This is called reduced-form model because it does not show explicitly the concurrent dependence between the component series. The explicit-form expression of concurrent relationship (for the last series and hence any series by rearrangement) can be deduced by a simple linear transformation. Using Cholesky decomposition (possible because $\pmb{\Sigma}$ is positive definite symmetric matrix) we can find a lower triangle matrix $\pmb{L}$ with unit diagonal elements such that $\pmb{\Sigma=LGL}^T$, where $\pmb{G}$ is a diagonal matrix. If we define $\pmb{b}_t = \pmb{L}^{-1}\pmb{a}_t$, then $E(\pmb{b}_t)=\pmb{L}^{-1}E(\pmb{a_t})=\pmb{0}$ and $Cov(\pmb{b}_t)=E(\pmb{b}_t\pmb{b}_t^T)=\pmb{L}^{-1}\pmb{\Sigma}(\pmb{L}^T)^{-1}=\pmb{G}$. Since $\pmb{G}$ is a diagonal matrix the components of $\pmb{b}_t$ are uncorrelated.

Pre-multiplying the reduced-form with $\pmb{L}^{-1}$, to uncouple the equations, we get
The last row of $\pmb{L}^{-1}$ has 1 as the last element, let it be $(w_{k1},w_{k2},..,w_{k,k-1},1)$, and hence the structural equation for the last ($k^{th}$) time series becomes:
$$r_{kt}+\sum_{i=1}^{k-1}w_{ki}r_{it}=\phi_{k,0}^*+\sum_{i=1}^k\Phi_{ki}^* r_{i,t-1}+b_{kt}.$$
This is possible because $\pmb{b}_t$ is a diagonal matrix and uncoupled. Reduced-form is commonly used for two reasons - ease in estimation, and concurrent correlations cannot be used in forecasting.

Stationarity condition and moments of a VAR(1) model: All eigenvalues of $\pmb{\Phi}$ should be less than 1 in modulus for weak stationarity for $\pmb{r}_t$, provided the covariance matrix of $\pmb{a}_t$ exists. Further we have $\pmb{\Gamma}_l=\pmb{\Phi \Gamma}_{l-1}$, for $l>0$, where $\pmb{\Gamma}_j$ is the lag-j cross-covariance matrix of $\pmb{r}_t$. By repeated substitutions we get $\pmb{\Gamma}_l=\pmb{\Phi}^l \pmb{\Gamma}_0$. Further for $\pmb{\Upsilon}=\pmb{D}^{-1/2}\pmb{\Phi}\pmb{D}^{1/2}$, we get $\pmb{\rho}_l=\pmb{\Upsilon}^l\pmb{\rho}_0$. A VAR(p) model is generally converted to a VAR(1) model using companion matrix and then analyzed like a VAR(1) model.

To find the order of a VAR model one can generally use the multi-variant equivalent of PACF with hypothesis tests on the successive residuals. The $i^{th}$ equation in the PACF is given by $\pmb{r}_t=\pmb{\phi}_0+\pmb{\Phi}_1\pmb{r}_{t-1}+...+\pmb{\Phi}_i\pmb{r}_{t-i}+\pmb{a}_t$. Parameters of these equations can be estimated by OLS method. For the $i^{th}$ equation let the OLS estimates of coefficients be $\hat{\pmb{\Phi}}_j^{(i)}$ and $\hat{\pmb{\phi}}_0^{(i)}$, where the superscript $(i)$ is used to denote the VAR(i) model. Then the residual is $\hat{\pmb{a}}_t^{(i)}=\pmb{r}_t-\hat{\pmb{\phi}}_0^{(i)}-\hat{\pmb{\Phi}}_1^{(i)}\pmb{r}_{t-1}-...-\hat{\pmb{\Phi}}_i^{(i)}\pmb{r}_{t-i}$. We then test the hypothesis sequentially to identify the order of VAR model. For $i^{th}$ and $(i-1)^{th}$ equations we test $H_0:\pmb{\Phi}_i=0$ versus $H_a:\pmb{\Phi}_i\ne0$, the test statistic is
$$M(i)=-\Bigg(T-k-i-\frac{3}{2}\Bigg)ln\Bigg(\frac{|\hat{\Sigma}_i|}{|\hat{\Sigma}_{i-1}|}\Bigg).$$ Asymptotically, $M(i)$ is distributed as a chi-squared distribution with $k^2$ degrees of freedom, where $k$ is the dimensionality of the asset universe.

Equivalent AIC and BIC methods can also be employed.  OLS or ML is generally used to estimate the parameters, the two methods being asymptotically equivalent. Once a model is fit, residuals should be tested for inadequacy using $Q_k(m)$ statistic (with $k^2m-g$ degrees of freedom). Forecasting is similar to the uni-variate case. Impulse response function is the MA version and can be derived to look at the decay rate. The MA equation is premultiplied by $\pmb{L}^{-1}$ to get the impulse response function of $\pmb{r}_t$ with the orthogonal innovations $\pmb{b}_t$. But different ordering may lead to different response functions, which is a drawback.

Vector moving-average models (VMA) :A VMA(1) model is given by $$\pmb{r}_t=\pmb{\theta}_0+\pmb{a}_t-\pmb{\Theta}\pmb{a}_{t-1},$$ where $\pmb{\theta}_0$ is k-dimensional vector, $\pmb{\Theta}$ is $k\times k$ matrix. Like the uni-variate case the cross-correlations cuts at order 1 and can be used to identify the order. Estimation of VMA model is lot more involved. The conditional or exact MLE approaches can be used.
Vector ARMA models (VARMA): In generalizing from uni-variate to vectors ARMA encounters the issue of identifiability - it may not be uniquely defined. Some constraints need to be imposed - structural specification. These are hardly used.
Marginal models of components: Given the vector models the component models are called the marginal models. For a k-dimensional ARMA(p,q)  model, the marginal models are ARMA(kp,(k-1)p+q) models.

Unit root nonstationarity and cointegration: When modeling several unit-root nonstationary time series jointly, one may encounter the case of cointegration. This may be due to the common trend or unit root of one of the components. In other words one can find a linear combination which is stationary. Let $h$ be the number of unit roots (or common trends) in the $k$-dimensional series $\pmb{x}_t$. Cointegration exists if $0<h<k$, and the quantity $k-h$ is called the number of cointegrating factors, these are the different linear combinations that are unit-root stationary. The linear combinations resulting in these unit-root stationary processes are called the cointegrating vectors. Two price series, if cointegrated, will have common underlying trend and we will lose this information if we take the first difference of the price series. This is because one difference per unit-root preserves useful information. Under cointegration we have more non-stationary series than uni-roots hence losing information. Overdifferencing can lead to the problem of unit roots in the MA matrix polynomial, invertability and estimation. Also, cointegration can also exist after adjusting for transaction costs and exchange-rate risk, which is artificial.

Error correction form: To overcome the difficulty of noninvertible VARMA models one can use this form. A VARMA(p,q) model is $$\pmb{x}_t=\sum_{i=1}^{p}\pmb{\Phi}_i\pmb{x}_{t-i}+\pmb{a}_t-\sum_{j=1}^q\pmb{\Theta}_j\pmb{a}_{t-j}.$$
For $\Delta \pmb{x}_t=\pmb{x}_t-\pmb{x}_{t-1}$, we can subtract $\pmb{x}_{t-1}$ from both side of VARMA equation to get the error correction form. \[\Delta\pmb{x}_t=\pmb{\alpha\beta}^T\pmb{x}_{t-1}+\sum_{i=1}^{p-1}\pmb{\Phi}_i^* \Delta\pmb{x}_{t-i}+\pmb{a}_t-\sum_{j=1}^q\pmb{\Theta}_j\pmb{a}_{t-j}\]
where $\pmb{\alpha}$ and $\pmb{\beta}$ are $k\times m$ full-rank matrix, k is the total asset dimension, m is the cointegrating factors ($m<k$).  The term $\pmb{\alpha\beta}^T\pmb{x}_{t-1}$ is called the error-correction term, as it compensates for the over-differentiating. Further, $\pmb{\beta}^T\pmb{x}_{t-1}$ is stationary. Also, $$\pmb{\Phi}_j^*=-\sum_{i=j+1}^{p}\pmb{\Phi}_i, \qquad j=1,...,p-1$$ $$\pmb{\alpha\beta}^T=\pmb{\Phi}_p+...+\pmb{\Phi}_1-\pmb{I}$$
The time series $\pmb{\beta}^T\pmb{x_t}$ is unit-root stationary, and the columns of $\pmb{\beta}$ are the cointegrating vectors of $\pmb{x}_t$.

Co-integrated VAR models: We focus on VAR models for their simplicity in estimation, to better understand cointegration here. For a k-dimensional VAR(p) model
where $\pmb{\mu}_0+\pmb{\mu}_1t$. Or equivalently, using backshift operator $B$
The characteristic polynomial in the above is represented as $\pmb{\Phi}(B)$. For a unit-root nonstationary process, 1 is a root making $|\pmb{\Phi}(1)|=0$ An error-correction form for this can be obtained by subtracting $\pmb{x}_{t-1}$ from both sides of the equation giving
\[\Delta \pmb{x}_t=\pmb{\mu}_t+\pmb{\Pi}\pmb{x}_{t-1}+\pmb{\Phi}_1^*\Delta\pmb{x}_{t-1}+...+\pmb{\Phi}_{p-1}^*\Delta \pmb{x}_{t-p+1}+\pmb{a}_t \]
Where $\pmb{\Pi}=\pmb{\Phi}_1+...+\pmb{\Phi}_p-\pmb{I}=-\pmb{\Phi}(1)$ and $\pmb{\Phi}_j^*=-\sum_{i=j+1}^p\pmb{\Phi}_i$ for $j=1,...,p-1$. Further if Rank($\pmb{\Pi}$)=0 implies that $\pmb{x}_t$ is not cointegrated, Rank($\pmb{x}_t$)=$k$ implies that ECM is not informative and one studies $\pmb{x}_t$ directly. Finally, if $0<$Rank($\pmb{\Pi}$)$=m<k$ then one can write $\pmb{\Pi}=\pmb{\alpha\beta}^T$, where $\pmb{\alpha}$ and $\pmb{\beta}$ are both of rank $m$. We have the case of cointegration with $m$ linearly independent cointegrated vectors $\pmb{w}_t=\pmb{\beta}^T\pmb{x}_t$,a nd has $k-m$ unit roots or common trends.

To obtain the $k-m$ common trends vector of size $(k-m)\times 1$, $\pmb{y}_t=\pmb{\alpha}^T_{\bot}\pmb{x}_t$, we calculate the orthogonal matrix of size $k \times (k-m)$ as $\pmb{\alpha}^T_{\bot}\pmb{\alpha}=\pmb{0}$.To uniquely identify $\pmb{\alpha}$ and $\pmb{\beta}$ we require that $\pmb{\beta}^T=[\pmb{I}_m,\pmb{\beta}^T_1]$, where $\pmb{T}_m$ is a $m\times m$ identity matrix and $\pmb{\beta}_1$ is a $(k-m)\times m$ matrix. There are a few more constraints for the process $\pmb{w}_t=\pmb{\beta}^T\pmb{x}_t$ to be unit-root stationary.

The rank of $\pmb{\Pi}$ in the ECM is the number of cointegrating vectors. Thus, to test for cointegration, once can examine the rank of $\pmb{\Pi}$, the approach taken in Johansen test.

Deterministic function: Limiting distributions of cointegration tests depend on the deterministic function $\pmb{\mu}_t$.
1) $\pmb{\mu}_t=\pmb{0}$: All components series of $\pmb{x}_t$ are $I(1)$ without drift and the stationary series $\pmb{w}_t=\pmb{\beta}^T\pmb{x}_t$ has mean zero.
2) $\pmb{\mu}_t=\pmb{\alpha}\pmb{c}_0$: Components of $\pmb{x}_t$ are $I(1)$ without drift, but $\pmb{w}_t$ have a nonzero mean $-\pmb{c}_0$, called restricted constant.
3) $\pmb{\mu}_t=\pmb{\mu}_0$: Component series are $I(1)$ with drift $\pmb{\mu}_0$ and $\pmb{w}_t$ may have a nonzero mean.
4)  $\pmb{\mu}_t=\pmb{\mu}_0+\pmb{\alpha c}_1 t$: Components of $\pmb{x}_t$ are $I(1)$ with drift $\pmb{\mu}_0$ and $\pmb{w}_t$ has a linear time trend, called restricted trend.
5) $\pmb{\mu}_t=\pmb{\mu}_0+\pmb{\mu}_1t$: Both the constant and trend are unrestricted. The components of $\pmb{x_t}$ are $I(1)$ and have a quadratic time trend and $\pmb{w}_t$ have a linear trend.

Maximum likelihood estimation: Estimation of VAR(p) is quite involved. The deterministic term ($\pmb{x}_t$) and stationary terms ($\Delta\pmb{x}_t$) are first bifurcated and estimated using linear regression and have error terms are $\pmb{u}_t$ and $\pmb{v}_t$ respectively. A relevant eigenvalue problem when solved leads to a likelihood which when maximized gives the estimates of the coefficients.

Johansen test for cointegration: This is esentially testing the rank of the matrix $\pmb{\Pi}$, for a specified deterministic term $\pmb{\mu}_t$. The number of non-zero eigenvalues of $\pmb{\Pi}$ can be obtained if a consistent estimate of $\pmb{\Pi}$ is available. Looking at the ECM equation it is clear that $\pmb{\Pi}$ is related to the covariance matrix between $\pmb{x}_{t-1}$ and $\Delta\pmb{x}_t$ after adjusting for the effects of deterministic trend term and $\Delta\pmb{x}_{t-i}$ for $i=1,...,p-1$. Using canonical correlation analysis between the two adjusted equation, the squared correlation between them are calculated to be $\hat{\lambda}_i$. There are two versions of Johansen test:

1) Trace cointegration test:- $H_0$: Rank($\pmb{\Pi}$) = $m$ versus $H_a$: Rank($\pmb{\Pi}$)>$m$. The Likelihood ratio (LR) statistic is
$$LK_{tr}(m) = -(T-p)\sum_{i=m+1}^{k}ln(1-\hat{\lambda}_i)$$
Due to the presence of unit-roots, the asymptotic distribution of statistic is not chi-squared, but a function of standard Brownian motions. Thus, the critical values must be obtained via simulation.

2) Sequential test:- $H_0$: Rank($\pmb{\Pi}$) = $m$ versus $H_a$: Rank($\pmb{\Pi}$)=$m+1$. The $LK$ ratio test statistic, called the maximum eigenvalue statistic, is
Again, the critical values of the test statistics are nonstandard and must be evaluated via simulation.

Left out sections: 8.7, 8.8

Tsay Ch5 - High Frequency Data Analysis and Market Microstructure

Non-synchronous trading

For daily stock returns, non-synchronous trading can introduce
1) cross correlations between stock returns
2) serial correlation in a portfolio return
3) sometimes negative serial correlations in the return series of a stock.

Bid-ask spread

Introduces lag-1 serial correlation in an asset return, called bid-ask bounce.

Empirical characteristics

Aggregation don't show some of the characteristics of transactions data
1) unequally spaced time intervals - duration between trades might contain useful information about market micro-structure e.g. trading intensity
2) discrete-valued prices - along with limits
3) Existence of a daily periodic pattern - thinner during lunch hour.
4) Multiple transactions within a single second

Overnight stocks returns differ substantially from intraday returns (Stoll and Whaley 1990). Intraday trading has exploded with multiple transactions within second.

Models for price change

The discreteness and concentration on 'no change' make it difficult to model the intraday price changes. There are two models - ordered probit model (Hauseman, Lo and MacKinlay 1992) and a decomposition model (McCulloch and Tsay 2000). These models find prediction challenging, but are more used for understanding purposes.
Ordered Probit model - For $P_t^*$ being the fundamental value of the asset in a friction-less market and $P_t$ being the observed price, we define $y_i^*=P^*_{t_i}-P^*_{t_{i-1}}$ and model $y^*_i$ as a continuous random variable given by $y^*_i=\bf{x_i}\bf{\beta}+\epsilon_i$. The observed value $y_i$ can be categorized in ordered set ${s_1,...,s_k}$. Generally a normal distribution is assumed. The model can be estimated by maximum likelihood or MCMC methods. Explanatory variables $\bf{x_i}$ can be time duration, lagged prices, lagged SP500 price, bid-ask spread and direction, lagged volume. Volatility can be explained using duration and bid-ask spread as well.
A decomposition model (ADS) - indicator for price change, direction of price change, and the size of price change, $y_i = P_{t_i}-P_{t_{i-1}}=A_iD_iS_i$, where ordering is important. Each of these terms are modeled as logistic regression using explanatory variables and estimated using log likelihood.

Duration models - ACD

Concerned with time intervals between trades. Longer durations indicate lack of trading activities, which means no new information. Before the duration can be modeled the diurnal pattern has to be removed from the time series. This is done by calculating adjusted time duration $\Delta t^*_i=\Delta t_i/f(t_i)$. $f(t_i)=exp(\beta_0+\sum_1^7\beta_j f_j(t_i))$, where the $f_i$ are functions defined to take care of first 5 minutes, last 30 minutes, and mid period, depending on the asset and profile. We can then fit the autoregressive conditional duration model. $f(t_i)$ is commonly estimated using smoothing splines. One way is to use combination of quadratic functions and indicator variables to take care of deterministic components of daily trading activities.
$$f(t_i)=e^{d(t_i)} \qquad d(t_i)=\beta_0+\sum_1^7\beta_j f_j(t_i))$$
where, $f_1, f_2, f_3, f_4$ are quadratic functions fitted for specific data (Tsay pg 225).  $f_5$ and $f_6$ are indicator variables for the first and second 5 minutes of market open, and $f_7$ is the indicator for the last 30 minutes of daily trading. The coefficients can be determined by least sqaures method
$$ln(\delta t_i)=\beta_0+\sum_1^7\beta_j f_j(t_i)) +\epsilon_i$$
The autoregressive conditional duration (ACD) model uses the idea of GARCH models to study the dynamic structure of the adjusted duration $\Delta t^*_i$. For $x_i=\Delta t^*_t$ and $\psi_i=E(x_i|F_{i-1})$, the model is defined as $x_i=\psi_i\epsilon_i$, where $\epsilon_i$ follows a standard exponential (EACD) or a standard Weibull (WACD) distribution. Further, similar to GARCH, we have ACD(r,s) model
$$\psi_i=\omega+\sum^r_{j=1}\gamma_j x_{i-j}+\sum^s_{j=1}\omega_j \psi_{i-j}$$
with $\gamma_j=0$ for $j>r$ and $\omega_j=0$ for $j>s$. For stationarity $\omega>0$ and $1>\sum_j(\gamma_j+\omega_j)$.

EACD(1,1) model: $\epsilon_i$ as exponential distribution $x_i=\psi_i\epsilon_i$ and $\psi_i=\omega+\gamma_1x_{i-1}+\omega_1\psi_{i-1}$. We have $E(\epsilon_i)=1$ and $Var(\epsilon_i)=1$, implying $E(\epsilon_i^2)=2$. Assuming weak stationarity,
Hence, $1>2\gamma_1^2+\omega_1^2+2\gamma_1\omega_1$ for stationarity.

Bivariate models for price change and duration - PCD

Jointly modeling the price change and associated duration process. Focus on transactions that result in a price change $P_{t_i}=P_{t_{i-1}}+D_i S_i$, where $D_i$ is the direction change dummy and $S_i$ is the size change variable. It reduces the number of data point and there is no diurnal pattern in time durations between price changes. The PCD model decomposes the joint distribution of $(\Delta t_i, N_i, D_i, S_i)$ given $F_{i-1}$ as
$$f(\Delta t_i, N_i, D_i, S_i | F_{i-1})=f(S_i|D_i, N_i, \Delta t_i, F_{i-1}) f(D_i|N_i, \Delta t_i, F_{i-1}) f(N_i|\Delta t_i, F_{i-1}) f(\Delta t_i|F_{i-1})$$
where the $i^{th}$ transaction data consists of $\Delta t_i$ duration, $N_i$ number of trades in the period, $D_i$ direction of price change, $S_i$ size of price change in ticks. There are many ways to specify the conditional distributions depending on the asset under study. Using McCulloch and Tsay (2000) generalized linear models for discrete valued variables and time series model for continuous variable $ln(\Delta t_i)$ we get
$$ln(\Delta t_i) = \beta_0 + \beta_1 ln(\Delta t_{i-1}) + \beta_2 S_{i-1} + \sigma \epsilon_i.$$
Log transformation is added to ensure positiveness. Due to concentration of $N_i$ at 0, we partition the model for $N_i$ in tow parts.
$$p(N_i=0|\Delta t_i, F_{i-1}) = logit[\alpha_0+\alpha_1 ln(\Delta t_i)]$$
where $logit(x)=e^x/(1+e^x)$, whereas the second part of the model is
$$N_i|(N_i>0, \Delta t_i, F_{i-1}) \sim 1+g(\lambda_i) \qquad \lambda_i=\frac{e^{\gamma_0+\gamma_1 ln(\Delta t_i)}}{1+e^{\gamma_0 + \gamma_1 ln(\Delta t_i)}}$$
where $g(\lambda)$ denotes a geometric distribution with parameter $\lambda$, in interval (0,1). The model for direction $D_i$ is
$$D_i|(N_i, \Delta t_i, F_{i-1})=sign(\mu_i+\sigma_i\epsilon)$$
where $\epsilon$ is a $N(0,1)$ random variable, and
$$\mu_i=\omega_0+\omega_1 D_{i-1}+\omega_2 ln(\Delta t_i),$$
To allow for different dynamics between positive and negative price movements, we use different models for the size of a price change.
$$S_i | (D_i=-1, N_i, \Delta t_i, F_{i-1}) \sim p(\lambda_{d,i})+1 \qquad ln(\lambda_{d,i})=\eta_{d,0}+\eta_{d,1}N_i+\eta_{d,2}ln(\delta t_i)+\eta_{d,3}S_{i-1}$$
$$S_i | (D_i=1, N_i, \Delta t_i, F_{i-1}) \sim p(\lambda_{u,i})+1 \qquad ln(\lambda_{u,i})=\eta_{u,0}+\eta_{u,1}N_i+\eta_{u,2}ln(\delta t_i)+\eta_{u,3}S_{i-1}$$
where $p(\lambda)$ denotes a Poisson distribution with parameter $\lambda$, and 1 is added to the size because the minimum size is 1 tick when there is a price change. Estimation can be done either by maximum likelihood or MCMC methods.

Left out sections: 5.6

Tsay Ch3 - Conditional Heteroscedastic Models

Ch3 : Conditional Heteroscedastic Models

Concerned with modeling of volatility of an asset return. Volatility modeling, apart from being useful in options pricing, provides simple approach to calculate VaR in risk management. It also plays an important role in asset allocation under mean-variance framework. It can also improve the efficiency in parameter estimation and the accuracy in interval forecast. Implied volatility tend to be larger than that obtained by using a GARCH type of volatility model. Some characteristics are:
1) there exist volatility clusters
2) volatility jumps are rare
3) does not diverge to infinity
These implies that volatility is generally stationary.
4) leverage effect - volatility reacts differently to big price increases vs big price drops

Returns are generally serially independent but the square of them are not. The general mean equation is $r_t=\mu_t+a_t$ where,
$$\mu_t=\phi_0+\sum_{i=1}^k \beta_i x_{it}+\sum_{i=1}^p \phi_i r_{t-i}-\sum_{i=1}^q \theta_i a_{t-i}$$
where $k$, $p$ and $q$ are non-negative integers, and $x_{it}$ are explanatory variables (e.g. exogenous variables or dummy variables). The conditional variance (or volatility) is given by
The model of volatility (called volatility equation) can be of two types - one with exact function and the other with stochastic equation.

The first step is always to model the mean equation (e.g. ARMA) and check residuals for ARCH effects. We then fit volatility models and then jointly estimate the mean and volatility equations. Let the mean equation be $r_t=\mu_t+a_t$, with $\mu$ being the simple average of the series. The squared series $a_t^2$ is then used to check for conditional heteroscedasticity or ARCH effects. Two tests:
1) Ljung-Box test on $a_t^2$ series: $H_0:$ first m lags of ACF of $a_t^2$ series are zero.
2) Lagrange multiplier test of Engle: Equivalent to F-statistic for testing $\alpha_i=0$ in the linear regression $a_t^2=\alpha_0+\alpha_1 a^2_{t-1}+...+\alpha_m a^2_{t-m}+e_t$, for $t=m+1,...,T$

ARCH model: Shock $a_t$ is serially uncorrelated but dependent at the quadratic level as $a_t=\sigma_t\epsilon_t$, where
$$\sigma_t^2=\alpha_0+\alpha_1 a^2_{t-1}+...+\alpha_m a^2_{t-m}$$
where ${\epsilon_t}$ is a sequence of iid with mean zero and variance 1, $\alpha_0>0$ and $\alpha_i\ge 0$ for $i>0$. $\epsilon_t$ can be modeled using standard normal, student-t or generalized error distribution. Using heavy-tailed distribution will reduce the ARCH effect. This models volatility clustering.

Properties of ARCH(1): $E[a_t]=E[E[a_t|F_{t-1}]]=E[\sigma_t E[\epsilon_t]]=0$. Also, $Var(a_t)=\alpha_0/(1-\alpha_1)$. Making higher orders finite can impose further conditions, like the unconditional kurtosis is
Thus, ARCH(1) model admits heavier tails than that of a normal distribution. For kurtosis to be finite $\alpha^2_1\le 1/3$. The model assumes that the positive and negative shocks have same effects. They are likely to over-predict the volatility because they respond slowly to large isolated shocks.

Order determination: using PACF of $a^2_t$ one can select the ARCH order, unless the sample size is small. Estimation is done using log likelihood estimation. After fitting standardized residuals $a_t/\sigma_t$  should be checked for iid using Ljung-Box statistics, skewness, kurtosis and qqplot. Forecasting is similar to AR model.

GARCH model: To reduce the number of parameters we use GARCH model. A GARCH(m,s) is given by $a_t=\sigma_t \epsilon_t$ and
$$\sigma^2_t = \alpha_0 + \sum_{i=1}^m \alpha_i a^2_{t-i} + \sum_{j=1}^s \beta_j \sigma^2_{t-j}$$
 where again ${\epsilon_t}$ is an iid with mean 0 and variance 1, $\alpha_0>0$, $\alpha_i \ge 0$, $\beta_j \ge 0$, and $\sum_{i=1}^{max(m,x)}(\alpha_i+\beta_i)<1$. $\alpha_i$ and $\beta_j$ are referred to as ARCH and GARCH parameters, respectively. This shows similar properties as ARCH with ARMA like characteristics.

Difficult to specify the model. Only lower order models used. Log likelihood is used and the fitted model is checked using standardized residual $a_t/\sigma_t$ and its squared process. $a^2_{h+1}$ is a consistent estimate of $\sigma^2_{h+1}$ but is not an accurate estimate of the prediction, because a single observation of a random variable with a known mean and value cannot provide an accurate estimate of its variance.

IGARCH model: unit-root GARCH model. $\alpha_i+\beta_i=1$. The unconditional variance of $a_t$, hence that of $r_t$, is not defined under the IGARCH model. Theoretically, this might be caused by occasional level shifts in volatility. For the case of $\alpha_0=0$, IGARCH(1,1) model become exponentially smoothing of volatility model.

Stochastic volatility model: To ensure positiveness of the conditional variance, SV models use $ln(\sigma^2_t)$ instead of $\sigma^2_t$. It is defined as $a_t=\sigma_t \epsilon_t$
$$ln(\sigma^2_t)=\alpha_0+\alpha_1 ln(\sigma^t_{t-1})+\nu_t$$
where $\epsilon_t$ are iid $N(0,1)$, the $\nu_t$ are iid $N(0,\sigma^2_{\nu})$, ${\epsilon_t}$ and ${\nu_t}$ are independent, $\alpha_0$ is a constant and all zeros of the characteristic polynomial are greater than 1 in modulus.

Quasi-likelihood methods via Kalman filtering or MCMC methods are used for estimation. Limited experience shows that SV models often provided improvements in model fitting, but their contributions to out-of-sample volatility forecasts received mixed results.

Alternative approaches: Two approaches
1) High-Frequency Data - Use of high-frequency data to calculate volatility of low-frequency returns. The model for daily returns is unknown which can complicate the monthly estimation. Assuming $r_{i}$ are log daily returns for the monthly returns $r^m$, we have $r^m=\sum^n r_{t}$. For white noise we have $Var(r^m)=nVar(r_i)$. For MA(1) process it is $Var(r^m)=nVar(r_i)+2(n-1)Cov(r_i+r_{i+1})$. Further the sample size is only 21 days making the accuracy questionable. If the kurtosis and serial correlations are high the sample estimate is not even consistent.
This concept can be used to calculate daily volatility from intraday log returns. At this level the realized volatility ($\sum r^2$) approximately follows Gaussian ARIMA(0,1,q) model, which can be used to produce forecasts. Intuitively one would use the smallest possible interval, but anything less than 15 minutes bring in noise. Overnight returns also need to be taken into account for correct daily volatility estimation.
2) Daily OHLC Prices - Shown to improve volatility estimate. Let $C_t$, $O_t$, $H_t$, and $L_t$ be the Close, Open, High and Low for day t. Also let $f$ be the fraction of day that trading is closed. The conventional volatility estimate is $\sigma^2_t=E[(C_t-C_{t-1})^2|F_{t-1}]$. Based on price following simple diffusion model without drift various estimates of volatility has been proposed, one being $0.5(H_t-L_t)^2-[2ln(2)-1](C_t-O_t)^2$. Yang and Zhang (2000) also proposed an estimate on the lines of $$\hat{\sigma}^2_{yz}=\hat{\sigma}^2_o+k\hat{\sigma}^2_c+(1-k)\hat{\sigma}^2_{rs},$$
$$\hat{\sigma}^2_o = \frac{1}{n-1}\sum^n_{t=1}(o_t-\bar{o})^2 \qquad o_t=ln(O_t)-ln(C_{t-1})$$
$$\hat{\sigma}^2_c = \frac{1}{n-1}\sum^n_{t=1}(c_t-\bar{c})^2 \qquad c_t=ln(C_t)-ln(O_t)$$
$$\hat{\sigma}^2_{rs} = \frac{1}{n}\sum^n_{t=1}[u_t(u_t-c_t)+d_t(d_t-c_t)] \qquad u_t=ln(H_t/)_t) \qquad d_t=ln(L_t/O_t)$$
quantity k is chosen to minimize the variance of the estimator $\hat{\sigma}^2_{yz}$.

Left out sections: 3.7-3.11, 3.13-3.14, 3.16

Tsay Ch2 - Linear Time Series Analysis and Its Applications

Ch2: Linear Time Series Analysis and its Applications

We work with log returns $r_t$ as a collection of random variables over time, as a time series ${r_t}$.

Stationarity: The first step in time series analysis should always be checking stationarity.
1) Strictly stationary time series ${r_t}$ if the joint distribution of $(r_{t_1}, ...,r_{t_k})$ is identical to that of $(r_{t_1+t},...,t_{t_k+t})$ for all t and k, i.e. the joint distribution is invariant under time shift.
2) Weak stationarity: if both mean and covariance between $r_t$ and $r_{t-l}$ are time-invariant. Weak stationarity implies partial predictability. In other words, the first two moments of $r_t$ are finite and constant. A strictly stationary series is weakly stationary but not vice-versa unless $r_t$ is normally distributed.

Autocovariance: $\gamma_l=Cov(r_t,r_{t-l})$ with properties $\gamma_0=Var(r_t)$ and $\gamma_{-l}=\gamma_l$

Autocorrelation function (ACF): uncorrelated means $\rho_{x,y}=0$. It implies independence under normality. $\rho_l=\gamma_l/ \gamma_0$ giving $\rho_0=1$ and $\rho_{-l}=\rho_l$. Sample autocorrelation is given by
is asymptotically normal with mean zero and variance $1/T$, if $r_t$ is an iid sequence with finite second moment (Brockwell and Davis 1991).

Bartlett's formula: For a weakly stationary time series of the form $r_t=\mu+\sum^q_1\psi_i a_{t-i}$, where $\psi_0=1$, q is a non-negative integer, and ${a_j}$ are Gaussian white noise series, then $\hat{\rho}_l$ is asymptotically normal with mean $\mu$ and variance $(1+2\sum^q_1\rho^2_i)/T$ for $l \ge q$ (Box, Jenkins, Reinsel 1994). Hence, to test the autocorrelation for a given integer $l$, we can construct the statistic.

Ljung and Box (1978) test increases the power for finite samples by constructing a chi-squared distribution statistic with m degrees of freedom:
In practice, the selection of m may affect the performance of the $Q(m)$ statistic. $m\approx ln(T)$ provides better power performance.

Monthly returns of the value-weighted index seems to have stronger serial dependence than individual stock returns. CAPM suggests there is no autocorrelation in financial series. One has to be careful of the autocorrelation induced by the way index returns are determined or stock prices are determined - these are pseudo relationships.

White noise: time series ${r_t}$ is white noise if the sequence is iid with finite mean and variance. if the mean is 0 and variance $\sigma^2$, it is called Gaussian white noise. For a white noise series all the ACFs are zero. e.g. monthly individual stocks returns seem to be white noise but not value-weighted index return series.

Wold decomposition: A time series ${r_t}$ is said to be linear if it can be written as Wold decomposition
$$r_t=\mu+\sum_{i=0}^{\infty}\psi_i a_{t-i},$$
where $\mu$ is the mean of $r_t$, $\phi_0=1$, and ${a_t}$ is white noise series. If $r_t$ is weakly stationary, we can obtain the mean, variance and covariance easily. For weakly stationary series $\rho_l$ converges to zero as $l$ increases.

AR(p) model: Current conditional expectation of returns depend on the last p steps
$$r_t=\phi_o+\phi_1 r_{t-1}+...+\phi_pr_{r-p}+a_t$$
The necessary and sufficient condition for weakly stationarity is $|\phi_1|<1$ for AR(1) process.  The plot of ACF decays exponentially (extends beyond lag 1!). For higher order AR processes the inverse of the solutions to characteristic equation are called characteristic roots. Complex roots correspond to damped oscillations related to business cycles. The stationarity condition is that the characteristic roots are less than one in modulus (or all solutions to characteristic equation greater than one in modulus).

Identification of AR model: order determination of AR models. Two approaches:
1) Partial Autocorrelation Function (PACF): nested AR models with increasing order. For a true order of $p$, the terms $\hat{\phi}_{l,l}$ converge to zero for all $l>p$ with variance $1/T$. These can be plotted to determine the cutoff as a PACF plot.
2) Information criteria:  likelihood based criteria. e.g. the Akaike information criterion (AIC) 1973 is:
$$AIC = -\frac{2}{T}ln(likelihood)+\frac{2}{T}P$$
For a Gaussian AR($l$) model, AIC reduces to $ln(\hat{\sigma}_l^2)+2l/T$, where $\hat{\sigma^2_l}$ is the maximum likelihood estimate of $\sigma^2_a$, T is the sample size and P is the number of parameters. Another common criteria is Bayesian information criterion (BIC), which for Gaussian AR($l$) reduces to $ln(\hat{\sigma}_l^2)+ln(T)l/T$. BIC tends to select a lower AR model when the sample size is moderate to large.

Parameter estimation: OLS method, or likelihood method. Likelihood estimation is very similar to OLS but gives slightly lower $\sigma_a^2$ for finite samples.

Model checking: residual series should be white noise - check ACF of residual and Ljung-Box statistics with $m-g$ degrees of chi-squared distribution, where m is approximately $ln(T)$ and g is number of AR coefficients used in the model.

Goodness of fit: For a stationary series $R^2$ can be used as a goodness of fit. For unit-root non-stationary series $R^2$ converges to one regardless of the underlying model. Adjusted $R^2$ should be used to penalize the increased number of parameters.

Forecasting: Longer terms forecasts are more accurate and approach unconditional mean. The speed of mean reversion is measured by half life denoted by $k=ln(0.5/|\phi_1|)$.

MA(q) model: Moving average models. These are simple extension of white noise series, an infinite-order AR model with some parameter constraints. Bid-ask bounce in stock trading may introduce an MA(1) structure in a return series. The general form is
$$r_t=c_0+a_t-\theta_1 a_{t-1}-...-\theta_q a_{t-q}$$
It is a finite memory model with ACF zero beyond the first q. MA(q) is invertible if $|\theta| <1$, otherwise noninvertible. An MA model is always stationary. The coefficient of MA models are also called impulse response function.

Identification of AR model: Identified using ACF, provides information on the nonzero MA lags of the model.

Estimation: Likelihood methods:
1) conditional likelihood method: assumes initial shock to be zero.
2) exact likelihood method: initial shocks are estimated jointly with other parameters.

Forecasting: q step ahead forecast for MA(q) model is the mean and stays there.

ARMA(p,q) model: to reduce higher order parameters in AR or MA model, one can use ARMA model. Generally not used for returns but volatility models. ARMA(1,1) is
$$r_t-\phi_1 r_{t-1}=\phi_0+a_t-\theta_1 a_{t-1}$$
For the model to be meaningful we need $\phi_1 \ne \theta_1$. The ACF of an ARMA(1,1) behaves very much like that of an AR(1) model except that the exponential decay starts with lag 2 instead of lag 1. Consequently, the ACF of an ARMA(1,1,) does not cut off at any finite lag. The PACF of an ARMA(1,1) does not cut off at any finite lag either, like of MA(1), except that the exponential decay starts with lag 2. The stationarity condition for an ARMA(1,1) is same as that of an AR(1) model. A general ARMA(p,q) model is
$$r_t = \phi_0+\sum_{i=1}^{p}\phi_i r_{t-i}+a_t-\sum_{j=1}^{q}\theta_j a_{t-j}$$
There should be no common factors between the AR and MA polynomial, otherwise the order of the model can be reduced. For stationarity all solutions to AR polynomial should be less than 1 in absolute value.

Identification: EACF (Tsay, Tiao 1984) can be used. The first zero corner in the EACF table identifies the order. Estimation can then be done using conditional or exact likelihood. Ljung-Box statistics of the residuals can also be checked for the adequacy of the model.

Unit root nonstationarity: Price series, interest rate, FX rates are generally non stationary. They are called unit root non-stationary  series in time series literature. The best example is random-walk.

Random walk: A time series ${p_t}$ is a random walk if it satisfies
Looking it as and AR(1) process, it is a non-stationary as the coefficient is equal to 1. We call this a unit root nonstationary time series. This is a non-mean reverting model with prediction of price equal to the value at forecast origin, with variance of estimate $l\sigma_a^2$, where $l$ is the number of look ahead steps, which diverge to infinity as $l \to \infty$. The series has a strong memory, as the sample ACFs are all approaching 1 as the sample size increases.

Random walk with drift: The market index tends to have a small and positive mean. The model for log price becomes
Both the expected value $t\mu$ and variance $t\sigma_a^2$ increase with time t.

Trend-Stationary time series: $p_t = \beta_0+\beta_1 t +t_t$
The mean is $\beta_0+\beta_1 t$ and variance is finite and equal to $Var(r_t)$, unlike random walk non-stationary model with trend. This can be transformed into a stationary one by removing the time trend via a simple linear regression analysis.

ARIMA models: A conventional way to handle unit-root nonstationarity is to use differencing. A time series $y_t$ is said to be ARIMA(p,1,q) process if the change series $c_T=y_t-y_{t-1}$ is stationary and invertible ARMA(p,q) process. For example, log price series is an ARIMA process. Series with multiple roots may need multiple differencing.

Dickey-Fuller unit root test: $H_0:\phi_1=1$ versus $H_a:\phi_1<1$. The test statistic is the t-ratio of the least squares estimate of $\phi_1$ from the model $p_t=\phi_1 p_{t-1}+e_t$ giving
$$\hat{\phi}_1 = \frac{\sum_1^T p_{t-1}p_t}{\sum_1^T p^2_{t-1}}$$
$$\hat{\sigma}_e^2=\frac{\sum_1^T (p_t-\hat{\phi}_1 p_{t-1})^2}{T-1}$$
where $p_0=0$ and $T$ is the sample size. The t-ratio is
$$\frac{\hat{\phi}_1-1}{std(\hat{\phi}_1)} = \frac{\sum_1^T p_{t-1}e_t}{\hat{\sigma}_e\sqrt{\sum_1^T p^2_{t-1}}}$$

Left out sections: 2.8 Seasonal models, 2.9 Regression models with time series errors, 2.10 consistent covariance matrix estimation, 2.11 long-memory models

Tsay Ch1 - Financial time series and their characteristics

Here are my notes from the book 'Analysis of Financial Time Series' by Tsay. It's a good introductory book, mostly empirical with decent maths. Someday I will read Hamilton or Green and write that up too. 

Ch1: Financial Time series and their characteristics

Concerned with asset valuation. The uncertainty in the definition and statistical inference of returns and volatility plays important role.

Asset returns - various definitions. Let $P_t$ be the price of an asset at time index t. Assume no dividends.
1) One-period simple return: $R_t=\frac{P_t}{P_{t-1}}-1$
2) Multi-period simple return: $1+R_t[k]=\Pi_{j=0}^{k-1}(1+R_{t-j})$
3) Continuously compounded or log return: $r_t=ln(1+R_t)=ln(P_t/P_{t-1})=p_t-p_{t-1}$, and $r_t[k] = \sum_{i=0}^{k-1}r_{t-i}$
4) Portfolio return: $R_{p,t}=\sum_{i=1}^{N}w_i R_{it}$ and $r_{p,t}\approx\sum_{i=1}^{N}w_i r_{it}$
5) Dividend payment: $R_t=\frac{P_t+D_t}{P_{t-1}}-1$ and $r_t=ln(P_t+D_t)-ln(P_{t-1})$
6) Excess return: $Z_t=R_t-R_{0t}$ and $z_t=r_t-r_{0t}$

Distributional properties - N assets and T time indices.
Joint distribution:
$$F_{X,Y}(x,y;\theta)=P(X\le x, Y\le y; \theta)=\int_{-\infty}^x \int_{-\infty}^y f_{x,y}(w,z;\theta)dzdw$$
Marginal distributions: marginal distribution of $X$ is obtained by integrating out $Y$.
Conditional distributions:
$$F_{X|Y\le y}(x;\theta) = \frac{P(X\le x, Y \le y ; \theta)}{P(Y \le y; \theta)}$$
$$f_{x,y}(x,y;\theta)=f_{x|y}(x;\theta)\times f_y(y;\theta)$$
Independent variables imply $f_{x,y}(x,y;\theta)=f_x(x;\theta)f_y(y;\theta)$
Moments of a random variable:
The $l$th central moment of $X$ is defined as
Sample central moments:
sample mean: $\hat{\mu_x}=\frac{1}{T}\sum_{t=1}^{T}x_t$
sample variance: $\hat{\sigma_x^2}=\frac{1}{T-1}\sum_{t=1}^{T}(x_t-\hat{mu_x})^2$
sample skewness: $\hat{S_x}=\frac{1}{(T-1)\hat{\sigma}^3_x}\sum_{t=1}^{T}(x_t-\hat{mu_x})^3$, distributed normally with mean 0 and variance $6/T$ (Snedecor and Cochran 1980).
sample kurtosis: $\hat{K_x}=\frac{1}{(T-1)\hat{\sigma}^4_x}\sum_{t=1}^{T}(x_t-\hat{mu_x})^4$, distributed normally with mean 3 and variance $24/T$.

JB normality test: Both $\hat{S}(r)$ and $\hat{K}(r)-3$ are normally distributed and can have individual hypothesis tests for normality. Jarque and Bera (1987) combine the two tests and use the test statistic
which is asymptotically distributed as a chi-squared random variable with 2 degrees of freedom.

Distribution of returns - The most general model for the log returns ${r_{it}; i=1, ...,N; t=1,...,T}$ is its joint distribution funtion:
where $Y$ is a state vector summarizing the environment in which asset returns are determined and $\theta$ are parameters that uniquely determine the distribution function $F_r(.)$. Many times $Y$ is assumed given and the main concern is the conditional distribution of ${r_{it}}$ given $Y$. Some applications focus on the joint distribution of N returns at a single time index t, i.e. ${r_{1t},...,r_{Nt}}$. Other theories emphasize the dynamic structure of individual asset returns - the uni-variate case - ${r_{i1},...,r_{iT}}$ for a given asset $i$. For the uni-variate case the joint distribution is given as
$$F(r_1, ...,r_T;\theta)=F(r_1;\theta)\Pi_{t=2}^{T}F(r_t|r_{t-1},...,r_{1};\theta).$$
The main concern then is to specify the conditional distribution $F(r_t|r_{t-1},..,r{1};\theta)$. Different distributional specifications lead to different theories. For instance, if the conditional distribution is equal to the marginal distribution it is a random-walk hypothesis. It is customary to treat asset returns as continuous random variables, and use their probability density functions.

Different distributions on marginals give different models:
Normal distribution: $R_t$ iid and normal with fixed mean and variance - lower bound is wrong, multiperiod returns are not normal due to multiplication, empirical excess kurtosis.
Lognormal distribution: $r_t$ iid and normal with fixed mean $\mu$ and variance$\sigma^2$. The simple returns $R_t$ are then lognormal with:
$$E[R_t]=e^{(\mu+\frac{\sigma^2}{2})}-1 \qquad Var[R_t]=e^{2\mu+\sigma^2}[e^{\sigma^2}-1]$$
summation is also normally distributed, there is no lower bound, but is not consistent with excess kurtosis.
Stable distributions: distribution where a linear combination has the same distribution. Summation and lower bound satisfied but has infinite variance, e.g. Cauchy distribution $f(x)=1/(\pi(1+x^2))$.
Scale mixture of normal distributions: finite mixtures of normal distributions $r_t \sim (1-X)N(\mu,\sigma^2_1)+XN(\mu,\sigma^2_2)$, where $X$ is a Bernoulli random variable and $\sigma^2_2$ is relatively large. Maintain tractability of normal under addition, capture excess kurtosis and finite higher moments, but hard to estimate the mixture parameters. For multivariate returns covariance becomes critical. Everything becomes vectors.

log likelihood function of returns with normal distribution:
$$ln f(r_1,...,r_T;\theta)=ln f(r_1;\theta)-\frac{1}{2}\sum_{t=2}^T(ln(2\pi)+ln(\sigma_t^2)+\frac{(r_t-\mu_t)^2}{\sigma^2_t})$$
This assumes iid of each observation.

Empirical properties of returns:
a) Daily returns have higher kurtosis than monthly returns. For monthly returns market indexes have higher kurtosis than individual stocks.
b) For daily returns market indexes have smaller standard deviation than individual stocks.
c) Empirical returns density is taller, but with fatter tails.

Processes considered: return series, conditional volatility (e.g. clustering), extreme behaviors (frequency, size, impact).