## Monday, January 30, 2012

### Anscombe and Diagnostic Statistics

Now that I got these outlier and influence measures (see last post), we can look at the Anscombe example, Anscombe Quartet

I took the example from matplotlib here. With a few changes, mainly adding the observation number. I get this graph

The first example is what we would usually expect and shouldn't have any problems for estimating the regression line. The second example is nonlinear, but we are fitting only a linear model. The third example has an outlier in y, which pulls the regression line towards it. The last example has an outlier in x.

Interpreting the last case is a bit tricky. The slope of the regression line is dominated by observation 7, if we think it is a reliable (correctly measured) observation, then it is very useful since it contains a lot of information. If we doubt the reliability of that measurement, then we are left without any information about the slope and the effect of the explanatory variable.
(I don't remember where I read this, someone argued that we shouldn't through out x outliers.)

Now we can run a loop over the four problems and check the influence diagnostic table, and also Ramsey's RESET test for linearity.

example 1: ok
-------------

Parameter estimates (constant, slope): [ 3.00009091  0.50009091]
=============================================================================================================
obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
0      8.040      8.001      0.000      0.033      0.100      0.011      0.031      0.010      0.003
1      6.950      7.001      0.000     -0.043      0.100     -0.014     -0.041     -0.014      0.004
2      7.580      9.501      0.489     -1.778      0.236     -0.989     -2.081     -1.158     -0.908
3      8.810      7.501      0.062      1.110      0.091      0.351      1.127      0.356      0.000
4      8.330      8.501      0.002     -0.148      0.127     -0.057     -0.140     -0.053     -0.029
5      9.960     10.001      0.000     -0.041      0.318     -0.028     -0.038     -0.026     -0.022
6      7.240      6.001      0.127      1.102      0.173      0.503      1.117      0.510     -0.351
7      4.260      5.000      0.123     -0.725      0.318     -0.495     -0.705     -0.481      0.407
8     10.840      9.001      0.279      1.635      0.173      0.747      1.838      0.840      0.578
9      4.820      6.501      0.154     -1.455      0.127     -0.556     -1.569     -0.599      0.320
10      5.680      5.500      0.004      0.166      0.236      0.092      0.157      0.087     -0.068
=============================================================================================================
RESET

F test: F=array([[ 0.44964358]]), p=[[ 0.77063577]], df_denom=5, df_num=4

Everything looks mostly ok, the RESET test doesn't reject the linear specification. Observation 2 might be an outlier, the studentized residual is at around 2, which is about the threshold for the t-distribution.

example 2: nonlinear
--------------------

Parameter estimates (constant, slope): [ 3.00090909  0.5       ]
=============================================================================================================
obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
0      9.140      8.001      0.052      0.971      0.100      0.324      0.967      0.322      0.097
1      8.140      7.001      0.052      0.971      0.100      0.324      0.967      0.322     -0.097
2      8.740      9.501      0.077     -0.704      0.236     -0.392     -0.683     -0.380     -0.298
3      8.770      7.501      0.058      1.076      0.091      0.340      1.087      0.344      0.000
4      9.260      8.501      0.032      0.657      0.127      0.251      0.635      0.242      0.130
5      8.100     10.001      0.808     -1.861      0.318     -1.271     -2.236     -1.528     -1.291
6      6.130      6.001      0.001      0.115      0.173      0.052      0.108      0.050     -0.034
7      3.100      5.001      0.808     -1.861      0.318     -1.271     -2.236     -1.528      1.291
8      9.130      9.001      0.001      0.115      0.173      0.052      0.108      0.050      0.034
9      7.260      6.501      0.032      0.657      0.127      0.251      0.635      0.242     -0.130
10      4.740      5.501      0.077     -0.704      0.236     -0.392     -0.683     -0.380      0.298
=============================================================================================================
RESET

F test: F=array([[ 820836.08290181]]), p=[[ 0.]], df_denom=5, df_num=4

Observation 2 and 7 look like outliers. However, the RESET test strongly rejects the hypothesis that the linear specification is correct. So we better look for another model, for example including polynomial terms for the explanatory variable.

example 3: outlier at 2
----------------------

Parameter estimates (constant, slope): [ 3.00245455  0.49972727]
=============================================================================================================
obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
0      7.460      8.000      0.012     -0.460      0.100     -0.153     -0.439     -0.146     -0.044
1      6.770      7.000      0.002     -0.196      0.100     -0.065     -0.186     -0.062      0.019
2     12.740      9.499      1.393      3.000      0.236      1.669   1203.540    669.587    525.268
3      7.110      7.500      0.005     -0.331      0.091     -0.105     -0.314     -0.099      0.000
4      7.810      8.499      0.026     -0.597      0.127     -0.228     -0.574     -0.219     -0.117
5      8.840      9.999      0.301     -1.135      0.318     -0.775     -1.156     -0.790     -0.667
6      6.080      6.001      0.001      0.070      0.173      0.032      0.066      0.030     -0.021
7      5.390      5.001      0.034      0.381      0.318      0.260      0.362      0.247     -0.209
8      8.150      8.999      0.059     -0.755      0.173     -0.345     -0.736     -0.336     -0.231
9      6.420      6.500      0.000     -0.070      0.127     -0.027     -0.066     -0.025      0.013
10      5.730      5.501      0.007      0.212      0.236      0.118      0.200      0.111     -0.087
=============================================================================================================
RESET
F test: F=array([[ 1.42791012]]), p=[[ 0.34730273]], df_denom=5, df_num=4

The results from the table are pretty clear, observation 2 has to be an outlier, all residual measures are huge. The RESET test doesn't indicate any problem with the linear specification.

example 4: exog_outlier at 7
----------------------------

Parameter estimates (constant, slope): [ 3.00172727  0.49990909]
=============================================================================================================
obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
0      6.580      7.001      0.007     -0.359      0.100     -0.120     -0.341     -0.114      0.034
1      5.760      7.001      0.062     -1.059      0.100     -0.353     -1.067     -0.356      0.107
2      7.710      7.001      0.020      0.605      0.100      0.202      0.582      0.194     -0.059
3      8.840      7.001      0.137      1.569      0.100      0.523      1.735      0.578     -0.174
4      8.470      7.001      0.087      1.253      0.100      0.418      1.300      0.433     -0.131
5      7.040      7.001      0.000      0.033      0.100      0.011      0.031      0.011     -0.003
6      5.250      7.001      0.124     -1.494      0.100     -0.498     -1.624     -0.541      0.163
7     12.500     12.500     -1.#IO     -1.#IO      1.000     -1.#IO     -1.#IO     -1.#IO     -3.070
8      5.560      7.001      0.084     -1.229      0.100     -0.410     -1.271     -0.423      0.128
9      7.910      7.001      0.033      0.775      0.100      0.259      0.757      0.252     -0.076
10      6.890      7.001      0.001     -0.095      0.100     -0.032     -0.089     -0.030      0.009
=============================================================================================================
RESET
F test: F=array([[ 126.84716907]]), p=[[ 0.00000007]], df_denom=9, df_num=4

What's going on with observation 7? The "-1.#IO" stands for numpy.NaN. All the measures that rely on leaving out observation 7 are indeterminate, since without observation 7 we cannot identify a slope.
The diagonal value of the hat matrix and the dfbeta for the slope indicate that this observation is very influential.

example 4a: exog_outlier at 7 (v2)
----------------------------------

Parameter estimates (constant, slope): [ 3.00172741  0.49990908]
=============================================================================================================
obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
0      7.006      7.006      0.000      0.000      0.091      0.000      0.000      0.000     -0.000
1      6.580      7.001      0.007     -0.377      0.091     -0.119     -0.360     -0.114      0.033
2      5.760      7.001      0.062     -1.110      0.091     -0.351     -1.125     -0.356      0.103
3      7.710      7.001      0.020      0.634      0.091      0.201      0.614      0.194     -0.056
4      8.840      7.001      0.135      1.645      0.091      0.520      1.828      0.578     -0.167
5      8.470      7.001      0.086      1.314      0.091      0.416      1.371      0.434     -0.125
6      7.040      7.001      0.000      0.035      0.091      0.011      0.033      0.011     -0.003
7      5.250      7.001      0.123     -1.567      0.091     -0.495     -1.711     -0.541      0.156
8     12.500     12.500      0.000     -0.000      1.000     -0.001     -0.000     -0.001     -0.001
9      5.560      7.001      0.083     -1.289      0.091     -0.408     -1.339     -0.424      0.122
10      7.910      7.001      0.033      0.813      0.091      0.257      0.798      0.253     -0.073
11      6.890      7.001      0.001     -0.099      0.091     -0.031     -0.094     -0.030      0.009
=============================================================================================================
RESET
F test: F=array([[ 113.86707171]]), p=[[ 0.00000011]], df_denom=9, df_num=4

Since I didn't like the NaN results in the previous example, I made one change. I added an additional observation that has an x value slightly larger the the large group, 8.01 insted of 8.00, the y value was chosen so it lies on the predicted line. The new observation is the first observation, the x outlier is now at index 8.

What happened to our influence measures? Both observations 0 and 8 have zero influence, since we only need one of them to identify the slope. Leave one observation out doesn't help if we have two "similar" (in terms of parameter estimation) outliers.

Since this was interesting, let's try another variation. This time I change the value of the new observation to lie below the regression line (6.999 instead of 7.006). One more table and we are done.

example 4b: exog_outlier at 7 (v3)
----------------------------------

Parameter estimates (constant, slope): [ 3.00063327  0.49996637]
=============================================================================================================
obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits     dfbeta
value          d   residual              internal   residual                 slope
-------------------------------------------------------------------------------------------------------------
0      6.999      7.005      0.000     -0.006      0.091     -0.002     -0.005     -0.002      0.001
1      6.580      7.000      0.007     -0.376      0.091     -0.119     -0.359     -0.114      0.033
2      5.760      7.000      0.062     -1.110      0.091     -0.351     -1.124     -0.356      0.103
3      7.710      7.000      0.020      0.635      0.091      0.201      0.615      0.194     -0.056
4      8.840      7.000      0.136      1.646      0.091      0.520      1.829      0.578     -0.167
5      8.470      7.000      0.086      1.315      0.091      0.416      1.372      0.434     -0.125
6      7.040      7.000      0.000      0.035      0.091      0.011      0.034      0.011     -0.003
7      5.250      7.000      0.123     -1.566      0.091     -0.495     -1.710     -0.541      0.156
8     12.500     12.500     21.566      0.006      1.000      6.567      0.005      6.231      5.965
9      5.560      7.000      0.083     -1.289      0.091     -0.407     -1.339     -0.423      0.122
10      7.910      7.000      0.033      0.814      0.091      0.257      0.799      0.253     -0.073
11      6.890      7.000      0.001     -0.099      0.091     -0.031     -0.094     -0.030      0.009
=============================================================================================================
RESET
F test: F=array([[ 113.85091319]]), p=[[ 0.00000011]], df_denom=9, df_num=4

Now we see that the outlier measures like the studentized residuals are very small for observation 8, however the influence measures, diagonal value of hat matrix, Cook's distance, dffits and dfbeta, are all large and clearly signal that this is a very influential observation.

Extra comment:

The RESET test in the versions of example four has different degrees of freedom than the first three examples, even though it uses the same code. After looking at this surprising result a bit, it turns out that because of the structure of the explanatory variable (only 2 or 3 distinct x values) the auxilliary regression for the RESET test has reduced rank.

A bit of history: