Sunday, March 15, 2009

Using the Gaussian Kernel Density Estimation

In scipy.stats we can find a class to estimate and use a gaussian kernel density estimator, scipy.stats.stats.gaussian_kde.

Until recently, I didn’t know how this part of scipy works, and the following describes roughly how I figured out what it does.

First, we can look at help to see whether the documentation provides enough information for its use. Here is the first part.

```>>> help(stats.gaussian_kde)
Help on class gaussian_kde in module scipy.stats.kde:

class gaussian_kde(__builtin__.object)
|  Representation of a kernel-density estimate using Gaussian kernels.
|
|   Parameters
|   ----------
|   dataset : (# of dims, # of data)-array
|       datapoints to estimate from
|
|   Members
|   -------
|   d : int
|       number of dimensions
|   n : int
|       number of datapoints
|
|   Methods
|   -------
|   kde.evaluate(points) : array
|       evaluate the estimated pdf on a provided set of points
|   kde(points) : array
|       same as kde.evaluate(points)
|   kde.integrate_gaussian(mean, cov) : float
|       multiply pdf with a specified Gaussian and integrate over the whole domain
|   kde.integrate_box_1d(low, high) : float
|       integrate pdf (1D only) between two bounds
|   kde.integrate_box(low_bounds, high_bounds) : float
|       integrate pdf over a rectangular space between low_bounds and high_bounds
|   kde.integrate_kde(other_kde) : float
|       integrate two kernel density estimates multiplied together
|
|  Internal Methods
|  ----------------
|   kde.covariance_factor() : float
|       computes the coefficient that multiplies the data covariance matrix to
|       obtain the kernel covariance matrix. Set this method to
|       kde.scotts_factor or kde.silverman_factor (or subclass to provide your
|       own). The default is scotts_factor.
|
|  Methods defined here:
|
|  __call__ = evaluate(self, points)
|
|  __init__(self, dataset)
|
|  covariance_factor = scotts_factor(self)
|
|  evaluate(self, points)
|      Evaluate the estimated pdf on a set of points.
|
...
```

The first two basic methods that I was interested in is the initialization, __init__ just takes a data set as a argument, and then evaluate, or __call__ which takes as function argument the list of points at which we want to evaluated the estimated pdf.

Let’s just try this out:

First, I get the standard imports:

```import numpy as np
from scipy import stats
import matplotlib.pylab as plt
```

then I generate a sample that I can feed to the kde, as usual for continuous variables, I start with a normal distribution:

```n_basesample = 1000
np.random.seed(8765678)
xn = np.random.randn(n_basesample)
```

Now, we create an instance of the gaussian_kde class and feed our sample to it:

```gkde=stats.gaussian_kde(xn)
```

We need some points at which we evaluate the density funtion for the estimated density function:

```ind = np.linspace(-7,7,101)
kdepdf = gkde.evaluate(ind)
```

and finally we create the plot of the histogram of our data, together with the density that created our sample, the data generating process DGP, and finally the estimated density.

```plt.figure()
# plot histgram of sample
plt.hist(xn, bins=20, normed=1)
# plot data generating density
plt.plot(ind, stats.norm.pdf(ind), color="r", label='DGP normal')
# plot estimated density
plt.plot(ind, kdepdf, label='kde', color="g")
plt.title('Kernel Density Estimation')
plt.legend()
#plt.show()
```

and this is the graph that we get

This graph looks pretty good, when the underlying distribution is the normal distribution, then the gaussian kernel density estimate follows very closely the true distribution, at least for a large sample as we used.

Now, to make it a bit more difficult we can look at a bimodal distribution, and see if it is still able to fit so well. As an example, I pick a mixture of two normal distributions, 60% of the sample comes from a normal distribution with mean -3 and standard deviation equal to one, 40% of the observation are from the normal distribution with mean 3 and the same standard deviation.

```alpha = 0.6 #weight for (prob of) lower distribution
mlow, mhigh = (-3,3)  #mean locations for gaussian mixture
xn =  np.concatenate([mlow + np.random.randn(alpha * n_basesample),
mhigh + np.random.randn((1-alpha) * n_basesample)])
```

With the new data set, we can run the same commands as above, except that we have to change the plot of the data generating density to use the mixture density instead:

```plt.plot(ind, alpha * stats.norm.pdf(ind, loc=mlow) +
(1-alpha) * stats.norm.pdf(ind, loc=mhigh),
color="r", label='normal mix')
```

The next graph shows what we get in this case.

The estimated density is oversmoothing, the peaks of the estimated pdf are too small. So the automatic selection of the smoothing parameter doesn’t work in this case. Now, it’s time to find out how gaussian_kde actually selects the smoothing parameter or bandwith of the kernel.

But I’m out of time for today. We dig into this next time.

6 comments:

1. Hello,

I work daily with the scipy.stats.kde package and I find it quite useful. If you make some improvements, I would really appreciate to know it!

Regards
Stefano

2. Hi,

I found this tutorial very helpful, thanks!

I think that any kde will always broaden distributions somewhat, more so for small sample sizes. I tried to illustrate this point by giving just one sample, but this results in a ValueError. (Btw, giving an array of ints results in a gaussian_kde value of 0, which probably is a bug?) One very clear example of broadening is sampling a uniform distribution within limits 0 to 1 and computing the gaussian_kde, which will always have Gaussian wings below 0 and above 1, where the pdf really was 0.

If you find out how the automatic kernel width selection works, whether it is adaptive with position (smaller width in regions of higher sample density) and how it can be influenced by the user, or if it is possible to introduce limits xmin, xmax beyond which the gaussian_kde pdf is 0, please share!

Cheers, --Christoph

3. thank you for this tutorial!

4. This fail with 2D data set. How to estimate KDE if you have 2D point (X,Y) ?

5. Thanks for the example code

6. In case you want to fiddle with the bandwidth, you can use the "set_bandwidth" function - it takes a constant or a function or a name of it's inbuilt functions. In some cases you may want to reduce to smoothing - I find it estimates far too high