Wednesday, April 11, 2012

Statsmodels and Google Summer of Code 2012

I didn't have much time or motivation to work on my blog these last weeks, mainly because I was busy discussing Google Summer of Code and preparing a new release for statsmodels.

So here is just an update on our Google Summer of Code candidates and their projects. This year was a successful year in attracting student proposals. We have six proposals, five of them we discussed quite extensively on our mailing list before the application.

Of the five projects, the first two are must-haves for econometrics or statistical packages, one on System of Equations, the other on Nonlinear Least-Squares and Nonlinear Robust Models. The next two are nonparametric or semi-parametric methods, one more traditional kernel estimation, the other using Empirical Likelihood which is a relatively new approach that has become popular in recent research both in econometrics and in statistics. The fifth is on Dynamic Linear Models mainly using Kalman filter and a Bayesian approach, which would extend the depth of statsmodels in time series analysis.

All topics would be valuable extensions to statsmodels and significantly increase our coverage of statistics and econometrics. From the discussion on the mailing list I think that all candidates are qualified to bring the projects to a successful finish.

Estimating System of Equations

This is a standard econometrics topic, but I only recently found that graphical models and causal models discussed in other fields have a large overlap with this. In the case of a system of simultaneous equations, we have several variables that depend on each other. The simplest case in economics is a market equilibrium, where the demanded and supplied quantities depend on the price, and the price depends on the supply and demand. The estimation methods commonly used in this area are two-stage and three-stage least-squares and limited and full information maximum likelihood estimation. The first part of the project starts with the simpler case when we have several response variables, but they don't depend on each other simultaneously, although they can depend on past values of other response variables. I'm very glad that someone is picking this one up.

Extension of Linear to Non Linear Models

This project has two parts, the first is extending the linear least-squares model to the non-linear case, the second part is to implement non-linear models for robust estimation. Non-linear least squares is available in scipy for example with scipy.optimize.curve_fit. However, in the statsmodels version we want to provide all the usual results statistics and statistical tests. The second part will implement two robust estimators for non-linear model, that have been shown to be most successful in recent Monte Carlo studies comparing different robust estimators for non-linear equations. Robust estimation here refers to the case when there are possibly many outliers in the data. My guess is that these will become the most used models of all proposals.

Nonparametric Estimation

This project extends the kernel based method in statsmodels from the univariate to the multivariate case, will provide better bandwidth selection, and then implement nonparametric function estimation. Multivariate kernel density estimation should complement scipy.stats.gaussian_kde which only works well with distributions that are approximately normal shaped or have only a single peak. Another extension is to provide kernels and estimation methods for discrete variables. These methods have been on our wishlist for a while, but only the univariate case has been included in statsmodels so far.

Empirical Likelihood

This is a relatively new approach in statistics and econometrics that avoids the distributional assumptions in estimation and in statistical tests. Instead of relying on a known distribution in small samples, where we often assume normal distribution, or instead of relying on the asymptotic normal distribution in large samples, this approach estimates the distribution in a nonparametric way. This is similar, to some extend, to the difference between, for example, a t-test and a rank-based Mann–Whitney U or Wilcoxon test, which are available in scipy.stats. The advantages are that in smaller samples the estimates and tests are more accurate when the distribution is not known, and in many cases, for example in finance, most tests show that the assumption of normal distribution does not hold. For this project, I still have to catch up with some readings because I'm only familiar with a small part of this literature, mainly on empirical likelihood in relation to Generalized Method of Moments (GMM).

Dynamic Linear Models

This covers statespace models implemented by Kalman Filter for multivariate time series models, both from a likelihood and a Bayesian perspective. The project expands the coverage of statsmodels in linear time series analysis, the first area where we get a good coverage of models. Currently, we have univariate AR and ARIMA, vector autoregressive models VAR, and structural VAR. Part of this project would be to get a good cython based implementation of Kalman filters. Wes has started a libray, statlib, for this last year, however, it is still incomplete and needs to be integrated with statsmodels. Another advantage of this project is that it increases our infrastructure and models for macro-econometrics, estimation of macroeconomic models and dynamic stochastic general equilibrium DSGE models, which is currently still Matlab dominated, as far as I can tell.

Now we still have to see how many GSoC slots we will get, but we have the chance this year to get a large increase in the speed of development of statsmodels, and we can reduce the number of cases where someone needs to run to R, or Stata, or Matlab because there is no implementation for a statistical analysis available in Python.