Sunday, January 29, 2012

Influence and Outlier measures in regression

Suppose we run a simple regression, and want to know whether there are possible outliers, or observations that have a large influence on the estimated parameters. In the past, I added several specification test to statsmodels. These are tests that check whether your model assumptions are "correct" or whether there are violations.
(This is sloppy phrasing, they are statistical tests after all.)

What is missing so far in statsmodels are the usual outlier and influence measures. (Although there are robust linear models.)

I found a nice example for influence and outlier measures using SAS in

Trying to do something similar as an extension to statsmodels, I ended up with the table below. This is supposed to correspond to the bottom tables on the above linked page.
All numbers look the same, but I'm still missing `Cov Ratio` and haven't looked for the prediction intervals yet. I also have dfbetas, but ran out of time to add them nicely to the table. Some of the measures have associated pvalues or recommended thresholds that can be used to interpret the results and find which observations might "mess up" the estimation.

Additionally, I also coded Variance Inflation Factor (VIF), a measure of multicollinearity for the explanatory variables, and the Ramsey RESET Test which is unrelated to the rest here and is a test for linear specification. I think I got now all from this page

The code is roughly, after imports and estimating the model by OLS

infl = Influence(res_ols)
print infl.summary()

obs endog fittedCook's d studentized hat diagdffits ext.stud. dffits
valueresidualsinternal residuals
0.0 64.059.71430.03170.75130.10080.25160.73380.2457
2.0 53.052.42860.00250.11240.28570.07110.10670.0675
3.0 67.070.64290.058-0.67780.2017-0.3407-0.6583-0.3309
4.0 55.059.71430.0383-0.82650.1008-0.2768-0.8123-0.272
5.0 58.056.07140.01250.35150.16810.1580.33550.1508
7.0 57.063.35710.0559-1.10420.084-0.3345-1.1179-0.3386
9.0 51.052.42860.0158-0.2810.2857-0.1777-0.2676-0.1693
10.0 76.074.28570.0310.34980.33610.24890.33390.2376
11.0 68.063.35710.02980.80640.0840.24430.79120.2397

The html for this table was created with scikits.statsmodels.iolib.table.SimpleTable, but I never tried to fine tune the html options for a nice result. The text table that is getting printed in the python shell is

>>> print infl.summary_obs()
       obs      endog     fitted   Cook's d studentized   hat diag    dffits   ext.stud.     dffits
                           value              residuals              internal  residuals          
         0    64.0000    59.7143     0.0317      0.7513     0.1008     0.2516     0.7338     0.2457
         1    71.0000    67.0000     0.0334      0.7079     0.1176     0.2585     0.6891     0.2516
         2    53.0000    52.4286     0.0025      0.1124     0.2857     0.0711     0.1067     0.0675
         3    67.0000    70.6429     0.0580     -0.6778     0.2017    -0.3407    -0.6583    -0.3309
         4    55.0000    59.7143     0.0383     -0.8265     0.1008    -0.2768    -0.8123    -0.2720
         5    58.0000    56.0714     0.0125      0.3515     0.1681     0.1580     0.3355     0.1508
         6    77.0000    67.0000     0.2088      1.7697     0.1176     0.6462     2.0259     0.7398
         7    57.0000    63.3571     0.0559     -1.1042     0.0840    -0.3345    -1.1179    -0.3386
         8    56.0000    67.0000     0.2526     -1.9467     0.1176    -0.7108    -2.3435    -0.8557
         9    51.0000    52.4286     0.0158     -0.2810     0.2857    -0.1777    -0.2676    -0.1693
        10    76.0000    74.2857     0.0310      0.3498     0.3361     0.2489     0.3339     0.2376
        11    68.0000    63.3571     0.0298      0.8064     0.0840     0.2443     0.7912     0.2397

Observations 6 and 8 look like outliers from these measures.

This was the fun part, and I managed to get it done in a few (or more) hours this weekend. The rest sounds more like work.

Some comments on the implementation.

The first measures can be calculated efficiently from only the original regression, the second part of this list of measures requires a loop that estimates the model dropping each observation one at a time (Leave One Observation Out). The latter can get expensive if the number of observations is large.

Initially I wanted to cache all Leave One Observation Out result instances, but this might get expensive in terms of memory, since at least some copies of the data will be made and stored together with the result instances. Instead, I finally decided to loop once but store all the attributes that will be required for the various measures. (Right now, I actually needed only the estimate of the error variance and the parameter estimates).

If the number of observations is large, then calculating the Leave One Observation Out results for dropping every observation will be a waste, because many observations will not be likely candidates for the "full treatment". The, not yet implemented, idea is to calculate the cheap measures first, and use them to select those observations that are likely candidates for the more precise LOOO loop.

Also still missing: plots.
I did some cleanup for the regression plots, but the outlier plots still look plain vanilla.

Completely unrelated:
We went this weekend to the Fete des neiges and looked at the snow village (giant igloos).

Quote: "Parc Jean-Drapeau is proud to welcome a North-American premiere – a Snow Village which has a hotel, restaurant, replicas of familiar MontrĂ©al buildings and lighted snow tunnels to explore all winter long. This Finnish concept attraction allows visitors and clients to enjoy a unique MontrĂ©al-winter experience."

I'm definitely NOT going to book a room there. But if anyone is interested, here's the link.


  1. Hello! I've just stumbled across your blog looking for variance inflation factor implementations in Python. Do I understand correctly that your implementation is now in statsmodels? The documentation seems to think it is still missing:


  2. Hi,
    Yes, it's in current master of statsmodels
    but I haven't updated the documentation yet. I added the tests but have not finished cleaning up the code and writing all the docstrings. It will be in the next release of statsmodels. (Note statsmodels just changed it's name in master, droping the scikits part)

    The documentation for the development version is
    Your link to the documentation points to the released version which will not be updated until the next release.