Friday, March 2, 2012

10 lines of code and it took you 3 weeks?!

A question on the statsmodels mailing list reminded me about some features of software development that seem to occur to me quite frequently.

The question was about estimating the parameters of a statistical distribution based on the characteristic function with the general method of moments (GMM) with a continuum of moment conditions. I'm not going into details, just say the theory behind it is "a bit" technical. The suggested paper is a survey paper that summarizes the basic ideas and several applications. Of course, there are not enough details to base an implementation on a paper like this.

But this is not about GMM, in this case I would be reasonably familiar with it. The case I have in mind is p-value correction for multiple testing.

Close to two years ago there was the question on the statsmodels mailing list about some multiple testing corrections in post-hoc tests, specifically Tukey's_range_test

Never heard, what's that? A quick look at Wikipedia but that's all I know.

But the search also shows that this topic seems to be pretty important. However, Tukey and the other post-hoc tests are a bit old, and looking for "multiple testing" and "multiple comparisons" shows a large number of papers published in recent years, and there is FDR, false discovery rate. Now, what's that?

Fast forward:

Reading and skimming articles, especially survey articles and articles with Monte Carlo Studies, and some documentation of SAS, to figure out what is worth the trouble, and what is not, and then papers that look like they have enough details to base an implementation on it.

Starting to code and trying to translate the verbal descriptions of various step-up and step-down procedures into code, with lots of false starts and convoluted loops or code. (summer break in the coffeshop next to the swimming pool)

Finally, I figure out the pattern, and some R packages to compare my results with. (Christmas break)

The final result is something like

elif method.lower() in ['hs', 'holm-sidak']:
    pvals_corrected_raw = 1 - np.power((1. - pvals), np.arange(ntests, 0, -1))
    pvals_corrected = np.maximum.accumulate(pvals_corrected_raw)

elif method.lower() in ['sh', 'simes-hochberg']:
    pvals_corrected_raw = np.arange(ntests, 0, -1) * pvals
    pvals_corrected = np.minimum.accumulate(pvals_corrected_raw[::-1])[::-1]

available in statsmodels as function multipletests. I didn't know we can do things like np.minimum.accumulate with numpy before.

The code is actually still located in the sandbox although imported through the main parts of statsmodels, since I ran out of steam after getting it to work and verifying it's correctness against R. Actually, I ran out of steam (and Christmas break was over and I had to take care of some tickets for scipy.stats) after getting the cdf for the multivariate t distribution and multiple comparison based on the full covariance matrix, kind of, to work. But here I found a helpful book just published for this for R.

So, since then statsmodels has a function where part of the docstring is

method : string
    Method used for testing and adjustment of pvalues. Can be either the
    full name or initial letters. Available methods are ::

    `bonferroni` : one-step correction
    `sidak` : on-step correction
    `holm-sidak` :
    `holm` :
    `simes-hochberg` :
    `hommel` :
    `fdr_bh` : Benjamini/Hochberg
    `fdr_by` : Benjamini/Yekutieli

and some other functions.

PS: The title is exagerated, it's a few hundred lines of code, the sandbox module has close to 2000 lines that is partly not cleaned up and verified, but might be useful when I work on this or similar again. The story is correct in the sense that I spend three weeks or more with maybe 4 or 5 hours a day to get the few lines of code that is at the core of these functions, and to understand what I was doing.

PS: This was about code developement, multiple testing deserves its own story, and more emphasis and more advertising which I only did on the statsmodels mailing list. I recently read Joseph P. Romano, Michael Wolf : Stepwise Multiple Testing as Formalized Data Snooping which is interesting also in terms of application stories in economics and finance, but I haven't implemented their approach yet.

PS: This would be a lot easier if I were writing GPL instead of BSD licensed code, since then I could just check the R source.

PS: I don't have to work so long just to get started with the right implementation, if the problem is more traditional econometrics. There, it's usually just checking which variation people use and some details.

PS: There are too many PS.

No comments:

Post a Comment