#### Background

I was fixing some bugs in determining the number of lags in statistical tests for time series, and remembered that we still haven't optimized our sequential regressions. So, I opened an issue .I learned a lot of linear algebra through the numpy and scipy mailing lists. The relevant idea and thread is here, quoting myself:

But from your description of QR, I thought specifically of the case where we have a "natural" ordering of the regressors, similar to the polynomial case of you and Anne. In the timeseries case, it would be by increasing lags yt on y_{t-1} yt on y_{t-1}, y_{t-2} ... ... yt on y_{t-k} for k= 1,...,K or yt on xt and the lags of xt This is really sequential LS with a predefined sequence, not PLS or PCA/PCR or similar orthogonalization by "importance". The usual procedure for deciding on the appropriate number of lags usually loops over OLS with increasing number of regressors. >From the discussion, I thought there might be a way to "cheat" in this using QR and Gram-SchmidtI never got around trying this out, but I thought I give it a try today. Some hours of trial and error and working on some algebra later, I have what looks like a reasonable solution.

#### Using QR Decomposition for Sequential Regression

The following is just a commented walk through my current script, the core is just a few lines.First, some imports

import numpy as np from statsmodels.regression.linear_model import OLS from numpy.testing import assert_almost_equalThen, we define a toy example to try out whether it works. The last two variables have no effect. I used that to get the minimum of the information criteria, aic, bic, to be in interior.

nobs, k_vars = 50, 4 np.random.seed(85325783) x = np.random.randn(nobs, k_vars) y = x[:, :k_vars-2].sum(1) + np.random.randn(nobs)We start with the boring way of doing the sequential regression: use OLS from statsmodels and loop with expanding number of explanatory variables in exog.

Note, this uses the generalized inverse,

`pinv`, for each new matrix of explanatory variables. I'm storing parameters, residual sum of squares and information criteria to have a benchmark to compare against later.

ssr_ols = np.zeros(k_vars) params_ols = np.zeros((k_vars, k_vars)) ic_ols = np.zeros((2, k_vars)) for i in range(4): res = OLS(y, x[:,:i+1]).fit() ssr_ols[i] = res.ssr params_ols[i, :i+1] = res.params ic_ols[0, i] = res.aic ic_ols[1, i] = res.bicIn my example, the estimated coefficients are

>>> params_ols array([[ 0.7585129 , 0. , 0. , 0. ], [ 1.00564191, 0.96414302, 0. , 0. ], [ 1.00776594, 0.93035613, -0.10759121, 0. ], [ 1.03379054, 0.91302697, -0.12998046, -0.08787965]])The first two coefficients are close to one, the second two coefficients are close to zero, all in the neighborhood of the true values of my simulated model.

Instead of using

`pinv`, statsmodels also has the option to use QR for estimating a linear model, with the basic code as the following. This uses the QR decomposition of the matrix of explanatory variables.

q, r = np.linalg.qr(x) qy = np.dot(q.T, y) print '\nparams full model' print params_ols[-1] print np.linalg.solve(r, qy)It gives the same result as the

`pinv`version

params full model [ 1.03379054 0.91302697 -0.12998046 -0.08787965] [ 1.03379054 0.91302697 -0.12998046 -0.08787965]We already have the QR decomposition for the full model. QR` calculates the orthogonalization using the sequence as it is given by the matrix of explanatory variables.

My first attempt was to find the OLS parameters sequentially using the appropriate subset or slices of the QR matrices

params1 = np.zeros((k_vars, k_vars)) for i in range(4): params1[i, :i+1] = np.linalg.solve(r[:i+1, :i+1], qy[:i+1])This gives us the same parameter estimates as the OLS loop. The computations are on much smaller arrays than the original problem,

`r`is

`(k_vars, k_vars)`and

`qy`is just one dimensional with length

`k_vars`.

There is still one loop, although a much smaller one, and I want to get rid of it. The following is trivial once we know what's going on, but it took me a while to figure this out. Some observations about QR:

All we need to do is to select the sequentially increasing subsets of

qis the orthogonal matrix that spans the same space asx.theta = r beta = qyis the parameter vector in the orthogonalized problem.beta = inv(r) thetais the parameter vector for the original regressors.

`r`and

`theta`. For calculating predicted values, we can work directly in the orthogonalized system.

Using an upper triangular matrix of ones

upp_tri = (np.arange(4)[:,None] <= np.arange(4)).astype(float)we can get the expanding

`theta`or

`qy`in a matrix

>>> upp_tri * qy[:,None] array([[ 4.87045847, 4.87045847, 4.87045847, 4.87045847], [-0. , -6.94085588, -6.94085588, -6.94085588], [ 0. , 0. , 0.70410517, 0.70410517], [ 0. , 0. , 0. , 0.60930516]])Then, we can just solve the system for all steps at once. First, I tried the matrix inverse since that's easier to understand theoretically

params2 = np.dot(np.linalg.inv(r), upp_tri * qy[:,None]).TBut, we can avoid the inverse and use

`linalg.solve`directly:

params2 = np.linalg.solve(r, upp_tri * qy[:,None]).T

**That's it**, one line and I have the parameters for the four sequential regression problems. Comparing the parameter estimates of the three methods, we see they are the same (at floating point precision)

>>> np.max(np.abs(params1 - params2)) 0.0 >>> np.max(np.abs(params1 - params_ols)) 5.5511151231257827e-16The rest is calculating fitted values and residuals

contrib = q * qy fitted = contrib.cumsum(1) resids = y[:,None] - fitted ssrs = (resids**2).sum(0) print '\nresidual sum of squares ssr' print ssrs print ssr_olswhich gives us identical residual sum of squares,

`ssr`

[ 80.31708213 32.14160173 31.64583764 31.27458486] [ 80.31708213 32.14160173 31.64583764 31.27458486]I added some asserts in the script, so I don't break anything by accident

assert_almost_equal(ssrs, ssr_ols, decimal=13) assert_almost_equal(params1, params_ols, decimal=13) assert_almost_equal(params2, params_ols, decimal=13)The applications, that I have in mind for this, are selection of the number of regressors or lags in the time series case, so I want to calculate the information criteria,

`aic`and

`bic`. In the previous result, I had calculated the residual sum of squares, which I can feed to helper functions that I had written to calculate information criteria.

The values cannot be directly compared to the results from OLS, since

`aic`and

`bic`reported by the OLS Results instance are based on the likelihood value, while here I use the

`ssr`based definition, which uses a different normalization and drops a constant term. However, the minimum is achieved at two variables for all cases, which was expected since I selected the simulated model for this.

print '\naic, bic' print ic_ols import statsmodels.tools.eval_measures as evm dfmodelwc = range(1, k_vars+1) #number of variables in regression aics = [evm.aic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)] bics = [evm.bic_sigma(ssrs[i], nobs, dfmodelwc[i]) for i in range(k_vars)] print aics print bics print 'aic, bic minimum number of variables' print np.argmin(ic_ols, 1) + 1 print np.argmin(aics) + 1, np.argmin(aics) + 1prints

aic, bic [[ 167.59181941 123.80026281 125.02303444 126.43299217] [ 169.50384241 127.62430882 130.75910345 134.08108419]] [4.4259823271765022, 3.5501511951835187, 3.5746066277468964, 3.6028057824070068] [4.4642227872850651, 3.6266321154006445, 3.689328008072585, 3.7557676228412582] aic, bic minimum number of variables [2 2] 2 2PS: Since I never had any numerical linear algebra training, I can still get excited about figuring out a two-liner. I could have tried to find a book for it, but that would have been boring compared to working it out on my own.

PS: Following pep-8, I'm used to not using capital letters very often.