(This is sloppy phrasing, they are statistical tests after all.)

What is missing so far in statsmodels are the usual outlier and influence measures. (Although there are robust linear models.)

I found a nice example for influence and outlier measures using SAS in http://www.bios.unc.edu/~truong/b545/reg/simple/wa.html

Trying to do something similar as an extension to statsmodels, I ended up with the table below. This is supposed to correspond to the bottom tables on the above linked page.

All numbers look the same, but I'm still missing `Cov Ratio` and haven't looked for the prediction intervals yet. I also have dfbetas, but ran out of time to add them nicely to the table. Some of the measures have associated pvalues or recommended thresholds that can be used to interpret the results and find which observations might "mess up" the estimation.

Additionally, I also coded Variance Inflation Factor (VIF), a measure of multicollinearity for the explanatory variables, and the Ramsey RESET Test which is unrelated to the rest here and is a test for linear specification. I think I got now all from this page http://en.wikipedia.org/wiki/Category:Regression_diagnostics.

The code is roughly, after imports and estimating the model by OLS

infl = Influence(res_ols)

print infl.summary() obs | endog | fitted | Cook's d | studentized | hat diag | dffits | ext.stud. | dffits |
---|---|---|---|---|---|---|---|---|

value | residuals | internal | residuals | |||||

0.0 | 64.0 | 59.7143 | 0.0317 | 0.7513 | 0.1008 | 0.2516 | 0.7338 | 0.2457 |

1.0 | 71.0 | 67.0 | 0.0334 | 0.7079 | 0.1176 | 0.2585 | 0.6891 | 0.2516 |

2.0 | 53.0 | 52.4286 | 0.0025 | 0.1124 | 0.2857 | 0.0711 | 0.1067 | 0.0675 |

3.0 | 67.0 | 70.6429 | 0.058 | -0.6778 | 0.2017 | -0.3407 | -0.6583 | -0.3309 |

4.0 | 55.0 | 59.7143 | 0.0383 | -0.8265 | 0.1008 | -0.2768 | -0.8123 | -0.272 |

5.0 | 58.0 | 56.0714 | 0.0125 | 0.3515 | 0.1681 | 0.158 | 0.3355 | 0.1508 |

6.0 | 77.0 | 67.0 | 0.2088 | 1.7697 | 0.1176 | 0.6462 | 2.0259 | 0.7398 |

7.0 | 57.0 | 63.3571 | 0.0559 | -1.1042 | 0.084 | -0.3345 | -1.1179 | -0.3386 |

8.0 | 56.0 | 67.0 | 0.2526 | -1.9467 | 0.1176 | -0.7108 | -2.3435 | -0.8557 |

9.0 | 51.0 | 52.4286 | 0.0158 | -0.281 | 0.2857 | -0.1777 | -0.2676 | -0.1693 |

10.0 | 76.0 | 74.2857 | 0.031 | 0.3498 | 0.3361 | 0.2489 | 0.3339 | 0.2376 |

11.0 | 68.0 | 63.3571 | 0.0298 | 0.8064 | 0.084 | 0.2443 | 0.7912 | 0.2397 |

The html for this table was created with scikits.statsmodels.iolib.table.SimpleTable, but I never tried to fine tune the html options for a nice result. The text table that is getting printed in the python shell is

>>> print infl.summary_obs()

===================================================================================================

obs endog fitted Cook's d studentized hat diag dffits ext.stud. dffits

value residuals internal residuals

---------------------------------------------------------------------------------------------------

0 64.0000 59.7143 0.0317 0.7513 0.1008 0.2516 0.7338 0.2457

1 71.0000 67.0000 0.0334 0.7079 0.1176 0.2585 0.6891 0.2516

2 53.0000 52.4286 0.0025 0.1124 0.2857 0.0711 0.1067 0.0675

3 67.0000 70.6429 0.0580 -0.6778 0.2017 -0.3407 -0.6583 -0.3309

4 55.0000 59.7143 0.0383 -0.8265 0.1008 -0.2768 -0.8123 -0.2720

5 58.0000 56.0714 0.0125 0.3515 0.1681 0.1580 0.3355 0.1508

6 77.0000 67.0000 0.2088 1.7697 0.1176 0.6462 2.0259 0.7398

7 57.0000 63.3571 0.0559 -1.1042 0.0840 -0.3345 -1.1179 -0.3386

8 56.0000 67.0000 0.2526 -1.9467 0.1176 -0.7108 -2.3435 -0.8557

9 51.0000 52.4286 0.0158 -0.2810 0.2857 -0.1777 -0.2676 -0.1693

10 76.0000 74.2857 0.0310 0.3498 0.3361 0.2489 0.3339 0.2376

11 68.0000 63.3571 0.0298 0.8064 0.0840 0.2443 0.7912 0.2397

===================================================================================================

Observations 6 and 8 look like outliers from these measures.

This was the fun part, and I managed to get it done in a few (or more) hours this weekend. The rest sounds more like work.

Some comments on the implementation.

The first measures can be calculated efficiently from only the original regression, the second part of this list of measures requires a loop that estimates the model dropping each observation one at a time (Leave One Observation Out). The latter can get expensive if the number of observations is large.

Initially I wanted to cache all Leave One Observation Out result instances, but this might get expensive in terms of memory, since at least some copies of the data will be made and stored together with the result instances. Instead, I finally decided to loop once but store all the attributes that will be required for the various measures. (Right now, I actually needed only the estimate of the error variance and the parameter estimates).

If the number of observations is large, then calculating the Leave One Observation Out results for dropping every observation will be a waste, because many observations will not be likely candidates for the "full treatment". The, not yet implemented, idea is to calculate the cheap measures first, and use them to select those observations that are likely candidates for the more precise LOOO loop.

Also still missing: plots.

I did some cleanup for the regression plots, but the outlier plots still look plain vanilla.

Completely unrelated:

We went this weekend to the Fete des neiges and looked at the snow village (giant igloos).

Quote: "Parc Jean-Drapeau is proud to welcome a North-American premiere – a Snow Village which has a hotel, restaurant, replicas of familiar MontrĂ©al buildings and lighted snow tunnels to explore all winter long. This Finnish concept attraction allows visitors and clients to enjoy a unique MontrĂ©al-winter experience."

I'm definitely NOT going to book a room there. But if anyone is interested, here's the link.

Hello! I've just stumbled across your blog looking for variance inflation factor implementations in Python. Do I understand correctly that your implementation is now in statsmodels? The documentation seems to think it is still missing:

ReplyDeletehttp://statsmodels.sourceforge.net/diagnostic.html?highlight=variance%20inflation#mutlicollinearity-tests

Cheers!

Hi,

ReplyDeleteYes, it's in current master of statsmodels https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/outliers_influence.py#L56

but I haven't updated the documentation yet. I added the tests but have not finished cleaning up the code and writing all the docstrings. It will be in the next release of statsmodels. (Note statsmodels just changed it's name in master, droping the scikits part)

The documentation for the development version is http://statsmodels.sourceforge.net/devel/

Your link to the documentation points to the released version which will not be updated until the next release.

Cheers