Sunday, November 18, 2012

Density Estimation with Orthogonal Series - circular data


Orthogonal Series are very useful. If we have a basis (gi)iN for some function space (usually with smoothness and integrability conditions), then we can represent function as linear combinations of the basis functions:
To get an approximation to the function f, we can truncate after some finite number of terms. (N is all positive integers.)
Orthonormal polynomials are convenient for density estimation, because we can directly estimate the coefficients ci from the data without having to run a non-linear optimization. In the basic case, we just need to calculate the moments of the data.
The orthogonality and normalization of the basis function is defined with respect to a weighting function w:
gi(x)gj(x)w(x)=0 if ij=1 if i=j
In the case of estimating or approximating a density we can use a reference density as weighting function. Then, the first term corresponds to the reference density, higher order terms are deviations from the reference density. This forms the basis for smooth goodness-of-fit tests. It is also very similar to series expansion of distributions, for example the Gram-Charlier expansion for the normal distribution. The reference density is the normal distribution. Higher order terms are based on Hermite polynomials.
In the basic form, we can just add the weighting function to the expansion above
However, these kinds of series expansion do not necessarily have densities that are non-negative over the full range of the density function. As a consequence, several non-linear transformation have been introduced in the literature, for example squaring or taking the exponential. The transformed expansion always results in non-negative densities. However, they loose the simple estimation property and have to be estimated with non-linear optimization. (I haven't actually coded any of those yet.)
These series approximation to densities can be extended to the multivariate case, but I haven't coded those yet either.

The Quest

I got started with this after a "random" reading, "Orthogonal series density estimation" and later "Smooth tests of goodness of fit" Both papers give well motivated introductions.
In the mean time I have read dozens more papers in this direction. The latest is a lot more theoretical and goes into continuous time stochastic processes, where I'm not yet ready to go back to, and along a similar line, orthonormal series variance estimator
scipy.special has a nice collection of orthogonal polynomials. Now also numpy.polynomial has a good implementation of orthogonal polynomials, but they were not available when I started with this. The scipy.special documentation is a bit "thin". It is good enough when we know what we are looking for, but not very helpful when we only have a vague idea what kind of "animals" those functions are.
The first problem was to find the normalization, since the polynomials in scipy are orthogonal but not orthonormal.
Also, on the way I had to figure out how to construct orthonormal polynomials for an arbitrary given weight function (reference density), and learn about recurrence equations and how we can construct and evaluate orthogonal polynomials. Neither of those are standard training where I come from.
Plots of some of my previous results can be seen in my gallery. Two examples:

Fourier polynomials

 and Hermite polynomials (black line, green line is a normal distribution)

The latest Installment

Recently, I started to figure out the basics of circular or directional statistics, see for example .
Trying to understand the usual smooth goodness-of-fit tests, I read However, orthonormal polynomials on the unit circle are "different". To get orthogonal polynomials with the Von Mises distribution as the weight functions, we need Verblunsky coefficients and Szego recurrence. Now what are those? Searching with Google, I didn't find any basic explanations. I don't really want to get a math book on the topic (by Barry Simon) and read it.
To get started with something easier, I went back to orthogonal polynomials with a uniform weight function, that is no weights. In this case, the polynomials are just trigonometric functions or Fourier series.
An explanation and application that imposes additionally non-negativity of the density function (which I do not impose in my code) is
The coefficients of the series approximation are just the circular moments of the underlying distribution. We can calculate those for a given distribution, or we can calculate the empirical moments from the data.
Detour: scipy.integrate.quad
An under used feature of scipy.integrate.quad is that we are able to use a weight function. For example, calculating the cosine and sine parts of the circular moments can be done with
integrate.quad(pdf_func, low, upp, weight='cos', wvar=k)
integrate.quad(pdf_func, low, upp, weight='sin', wvar=k)
which calculates the k-th circular moment of a circular distribution given by pdf_func. The integration limits are either (0,2π) or (-π,π). We cannot integrate with the complex definition:
integrate.quad(lambda x: np.exp(1j*k*x)*pdf_func(x, *args), low, upp)
because quad throws away the imaginary part and issues a warning about the casting to float.

The Plots

And now, the plots. I draw random numbers from a two component mixture of Von Mises distributions [1]. The plots contain the histogram of the data and the estimated density based on the trigonometric series. For reference it also contains the density of the data generating process, the mixture distribution, and the density given by the 15 component series based on the circular moments of the data generating distribution (calculated by integration as above). With 15 components the series distribution based on the true moments is essentially indistinguishable from the true density.

First plot: 10000 observations, which makes the histogram and estimated moments close to the true density and moments.

Second plot: 200 observations, given the randomness in the data, the histogram is pretty uneven (given the number of bins). I fitted 3 components in the series density estimate.

Third and fourth plots: 1000 observations, in one plot I used 5 components, in the other plot I used 15 components in the series density. The density with 15 components is fitting random "bumps" in the histogram.

Some Comments

Orthogonal series expansion could be or is very useful. The advantage compared to kernel density estimation is that it is much faster and we do not need to keep the original data for evaluating the density. All we need are the coefficients for the series. It also works better on bounded support than kernel density estimation. One of the disadvantages is that it is a global method and will not be able to adjust locally if there are regions with different features, unless we sufficiently increase the number of terms in the series. Increasing the number of terms will make the density estimate more sensitive to random effects.
My impression is that orthogonal series expansion for densities are limited in their usefulness when the distribution contains a shape parameter and not just location and scale. A while ago, I wrote the recursion for polynomial series with Poisson as the weight function. It can be used for testing whether a distribution is Poisson, as in the paper I used as reference. However, I finally came to the conclusion that this is not really so useful, since in many cases we want to have count regression, with the shape parameter as a function of some explanatory variables. The series expansion of the Poisson distribution is specific to a given shape parameter, which means that we cannot create the orthonormal base independently of the regressors. I also have not seen any articles that uses orthogonal expansion in regression outside the normal distribution case, as far as I remember.
One of the main missing pieces in my code is automatic selection of the bandwidth or of the optimal penalization. For the former, we need to select the number of components in the series expansion. For the later, we use a larger number of terms but need to find an increasingly strong penalization for higher order terms. I only know of one journal article that derives the penalization for Fourier series on the real line.
Related to the last point: One of the main work that George and Ralph did during GSOC last summer is to get automatic bandwidth selection for kernel density estimation and kernel regression for the new nonparametric extension in statsmodels. There are many other new features besides this. statsmodels will get a good coverage of kernel methods when the branch is merged, which will happen very soon.
(My code is mostly only part of my private collection of "dirty" scripts.)
[1]I just concatenated the data and didn't randomize on the number of observations in each component.
Editorial Note: I'm still using rst2blogger with the default settings. I am able to edit Latex math in restructured text for the use with sphinx which I used for the draft. With rst2blogger the default is conversion to mathml, which doesn't recognize all Latex math that I was using, and some fine-tunig got lost. Additionally, the math doesn't display in Internet Explorer on my computer.
PS: Where does your black box start?
Just a thought after reading this again.
Sometimes I'm happy to use somebody else's code or recipes without worrying about why it works. Sometimes I have to dig in myself because there are no recipes available. But often I have to dig in because I feel a need to understand whatever I'm using and I cannot rest until my ignorance is sufficiently reduced (or I realize that the mountain is too big.)
And after five or more years of Statistics in Python, I can safely say that I learned a lot about topics that I never heard of before.

No comments:

Post a Comment