## Sunday, August 16, 2015

### Regression assumptions

Everybody in finance knows that the 90% of quant work is 'REGRESSION' and mostly LINEAR. The results of a linear regression are as good as we understand their assumptions. For a univariate case we write $y_t = \alpha + \beta x_t + \epsilon_t$, where the estimation is straightforward. The interesting case is multivariate regression, where we write $$Y_t = \pmb{\beta}X_t +\pmb{\epsilon}_t.$$ To estimate the parameters we use the normal equation to get $$\pmb{\beta} = (X^TX)^{-1}X^TY$$ Now, how good is this an estimate? We want these estimates to be:
unbiased - The expected value of the estimate is the true value.
consistent - With more observations the distribution of the estimate becomes more concentrated near true value.
efficient - lessor observations are required to establish true value for given confidence.
asymptotically normal - With a lot of observations the distribution of the estimate is a normal distribution.

OLS is consistent when the regressors are exogenous and there is no perfect multicollinearity, and optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, OLS provides min-variance and mean-unbiased estimates, when the errors have finite variances. Aussuming errors have normal distribution, OLS is same as MLE. The expanded version of OLS is multi-fractional order estimator (like Kalman filter).

The 'random design' paradigm treats the regressors $x_i$ as random and sampled together with $y_i$ from some population. The 'fixed design' paradigm treats $X$ as known constants and $y$ is sampled conditionally on the values of $X$ as in an experiment. Practically, the distinction is unimportant and results in the same formula for estimation.

#### Assumptions

1. OLS minimizes error in dependent variable $y$ only and hence assumes there is no error in $x$.
2. The functional dependence being modeled is valid.
3. Strict exogeneity - The errors in regression have conditional mean zero: $E[\epsilon|X]=0$, which implies that error have mean zero: $E[\epsilon]=0$, and that the regressors are uncorrelated with the errors: $E[X^T\epsilon]=0$. If not true the OLS estimates are invalid. In that case use method of instrumental variables.
4.  No linear dependence - The regressors in X must be linearly independent, i.e. X must be full rank almost surely. Sometimes we also assume that the regressors has finite moments up to second order, in such a case the matrix $X^TX$ will be finite and positive semi-definite. If violated the regressors are called perfectly multicollinear, $\beta$ can't be estimated, though prediction of $y$ is still possible.
5. Spherical errors - It is assumed that $Var[\epsilon|X]=\sigma^2\pmb{I}_n$. IF violated OLS estimates are still valid, but no longer efficient. If error terms are don't have same variance, i.e. they are not homoscedastic Weighted least square is used. If there autocorrealation between error terms Generalized least squares is used.
6. Normality - It is sometimes additionally assumed that errors have normal distribution. This is not required. Under this assumption OLS is equivalent to MLE and is asymptotically efficient in the class of all regular estimators.
Certain degree of correlation between the observations is very common, under which OLS and WLS are inefficient. GLS is the right thing to do: $$Y = X\beta + \epsilon \qquad E[\epsilon|X]=0, Var[\epsilon|X]=\Omega.$$ GLS estimates $\beta$ by minimizing the squared Mahalanobis length of the residual vector to give $$\hat{\beta}=(X^T\Omega^{-1}X)^{-1}X^T\Omega^{-1}Y.$$ The GLS estimator is unbiased, consistent, efficient and asymptotically normal. It is equivalent to applying OLS to linearly transformed version of data, which standardize and de-correlates the regressors. WLS is a special case of GLS.

To estimate GLS we use Feasible Generalized Least squares (FGLS) in two steps:
1) Model is estimated using OLS (consistent but inefficient) estimator, and the residuals are used to build a consistent estimator of the error covariance matrix;
2) Using these we estimate GLS.

FGLS is preferred only for large sample size. For small sample size it is better to stick to OLS. FGLS is not always consistent for small sample.