Tuesday, April 28, 2009

Having fun with expectations

Using the distribution classes in scipy.stats it is easy to calculate expectations of a function with respect to any distributions using numerical integration.

I’m going to write a function that calculates the expectation, then I attach it to the class of continuous distributions in scipy.stats. Finally we can use our new method with any existing distribution.

Warning: Monkey Patching a class can have unintended effects if the new or changed methods interfere with other uses. In this case we just add a new method, which does not effect any of the original use of the distributions.

import numpy as np
from scipy import stats, integrate

def expectedfunc(self, fn=None, args=(), lb=None, ub=None, conditional=False):
    '''calculate expected value of a function with respect to the distribution

    only for standard version of distribution,
    location and scale not tested

        all parameters are keyword parameters
        fn : function (default: identity mapping)
           Function for which integral is calculated. Takes only one argument.
        args : tuple
           argument (parameters) of the distribution
        lb, ub : numbers
           lower and upper bound for integration, default is set to the support
           of the distribution
        conditional : boolean (False)
           If true then the integral is corrected by the conditional probability
           of the integration interval. The return value is the expectation
           of the function, conditional on being in the given interval.

        expected value : float
    if fn is None:
        def fun(x, *args):
            return x*self.pdf(x, *args)
        def fun(x, *args):
            return fn(x)*self.pdf(x, *args)
    if lb is None:
        lb = self.a
    if ub is None:
        ub = self.b
    if conditional:
        invfac = self.sf(lb,*args) - self.sf(ub,*args)
        invfac = 1.0
    return integrate.quad(fun, lb, ub,

For now this is just a function where the first argument is a distribution instance, as they are available in scipy.stats. We can call this function to calculate the forth moment of the standard normal distribution and compare it with the moment of stats.norm

>>> print stats.norm.moment(4)
>>> print expectedfunc(stats.norm, lambda(x): (x)**4)

We obtain the same result, which means in this case our function works correctly.

Now we can attach it to stats.distributions.rv_continuous, which is the superclass of all continuous distributions. We could have alse used new.instancemethod which is, however, depreciated and will be removen in py3k.

>>> import types
>>> stats.distributions.rv_continuous.expectedfunc =
...       types.MethodType(expectedfunc,None,stats.distributions.rv_continuous)
>>> #print dir(stats.norm)
>>> print stats.norm.expectedfunc
<bound method norm_gen.expectedfunc of <scipy.stats.distributions.norm_gen object at 0x02122830>>


Here is the forth moment for both the normal and the t distribution. The t distribution requires one parameter, the degrees of freedom, which I set to 10 to get fatter tails:

>>> print stats.norm.expectedfunc(lambda(x): (x)**4)
>>> print stats.norm.moment(4)
>>> print stats.t.expectedfunc(lambda(x): (x)**4, 10)
>>> print stats.t.moment(4, 10)

Expectation of some additional functions:

>>> print stats.norm.expectedfunc(lambda(x): np.sqrt(np.abs(x)))
>>> print stats.norm.expectedfunc(lambda(x): np.exp(-np.abs(x)))
>>> print stats.norm.expectedfunc(lambda(x): np.exp(-x), lb=0)

If our function is identical to one, and we use integration bounds on our integral, then we get the probability of the interval:

>>> print stats.norm.expectedfunc(lambda(x): 1, lb=-1, ub=0.5)
>>> print stats.norm.cdf(0.5) - stats.norm.cdf(-1)

Can we calculate the expectation of exp(x)?

>>> print stats.norm.expectedfunc(lambda(x): np.exp(x))
Warning: The ocurrence of roundoff error is detected, which prevents
  the requested tolerance from being achieved.  The error may be

The expectation of exp(x) with respect to the standard normal distribution is unbound, and our numerical integration function returns nan, not a number.

If we integrate with respect to a distribution with finite support, for example the rdistribution, rdist, then the expectation is finite:

>>> print stats.rdist.expectedfunc(lambda(x): np.exp(x),0.1)

We can also try complex values:

>>> print stats.norm.expectedfunc(lambda(x): np.exp(1j*x))

I have no idea if this is correct, but this is the basic calculation for the characteristic function of a distribution.

Next we can try out conditional expectation. As an example, we calculate the expectation of a standard normal random variable conditional on values being in the top decile, i.e. the expectation of all values in the top 10 %.

>>> lbdec = stats.norm.isf(0.1)
>>> print stats.norm.expectedfunc(lb=lbdec, conditional=True)
>>> print expectedfunc(stats.norm, lb=lbdec, conditional=True)
>>> print stats.norm.expectedfunc(lambda(x): 1, lb=-lbdec, ub=lbdec)
>>> #should be 0.8

What’s the variance if we truncate the normal distribution at the 0.1 and 0.9 deciles?

>>> print stats.norm.expectedfunc(lambda(x): x**2, lb=-lbdec, ub=lbdec,conditional=True)
>>> print expectedfunc(stats.norm, lambda(x): x**2, lb=-lbdec, ub=lbdec,conditional=True)

and verify the result with truncated normal

>>> print stats.truncnorm.moment(2,-lbdec,lbdec)
>>> lbdect = stats.t.isf(0.1, 10)
>>> print stats.t.expectedfunc(args=(10,), lb=lbdect, conditional=True)
>>> print expectedfunc(stats.t, args=(10,), lb=lbdect, conditional=True)

The t distribution has fatter tails than the normal distribution, so the conditional expectation of the top decile is larger for the t distribution than for the normal distribution, 1.989 versus 1.755.

Sunday, March 15, 2009

Using the Gaussian Kernel Density Estimation

In scipy.stats we can find a class to estimate and use a gaussian kernel density estimator, scipy.stats.stats.gaussian_kde.

Until recently, I didn’t know how this part of scipy works, and the following describes roughly how I figured out what it does.

First, we can look at help to see whether the documentation provides enough information for its use. Here is the first part.

>>> help(stats.gaussian_kde)
Help on class gaussian_kde in module scipy.stats.kde:

class gaussian_kde(__builtin__.object)
 |  Representation of a kernel-density estimate using Gaussian kernels.
 |   Parameters
 |   ----------
 |   dataset : (# of dims, # of data)-array
 |       datapoints to estimate from
 |   Members
 |   -------
 |   d : int
 |       number of dimensions
 |   n : int
 |       number of datapoints
 |   Methods
 |   -------
 |   kde.evaluate(points) : array
 |       evaluate the estimated pdf on a provided set of points
 |   kde(points) : array
 |       same as kde.evaluate(points)
 |   kde.integrate_gaussian(mean, cov) : float
 |       multiply pdf with a specified Gaussian and integrate over the whole domain
 |   kde.integrate_box_1d(low, high) : float
 |       integrate pdf (1D only) between two bounds
 |   kde.integrate_box(low_bounds, high_bounds) : float
 |       integrate pdf over a rectangular space between low_bounds and high_bounds
 |   kde.integrate_kde(other_kde) : float
 |       integrate two kernel density estimates multiplied together
 |  Internal Methods
 |  ----------------
 |   kde.covariance_factor() : float
 |       computes the coefficient that multiplies the data covariance matrix to
 |       obtain the kernel covariance matrix. Set this method to
 |       kde.scotts_factor or kde.silverman_factor (or subclass to provide your
 |       own). The default is scotts_factor.
 |  Methods defined here:
 |  __call__ = evaluate(self, points)
 |  __init__(self, dataset)
 |  covariance_factor = scotts_factor(self)
 |  evaluate(self, points)
 |      Evaluate the estimated pdf on a set of points.

The first two basic methods that I was interested in is the initialization, __init__ just takes a data set as a argument, and then evaluate, or __call__ which takes as function argument the list of points at which we want to evaluated the estimated pdf.

Let’s just try this out:

First, I get the standard imports:

import numpy as np
from scipy import stats
import matplotlib.pylab as plt

then I generate a sample that I can feed to the kde, as usual for continuous variables, I start with a normal distribution:

n_basesample = 1000
xn = np.random.randn(n_basesample)

Now, we create an instance of the gaussian_kde class and feed our sample to it:


We need some points at which we evaluate the density funtion for the estimated density function:

ind = np.linspace(-7,7,101)
kdepdf = gkde.evaluate(ind)

and finally we create the plot of the histogram of our data, together with the density that created our sample, the data generating process DGP, and finally the estimated density.

# plot histgram of sample
plt.hist(xn, bins=20, normed=1)
# plot data generating density
plt.plot(ind, stats.norm.pdf(ind), color="r", label='DGP normal')
# plot estimated density
plt.plot(ind, kdepdf, label='kde', color="g")
plt.title('Kernel Density Estimation')

and this is the graph that we get


This graph looks pretty good, when the underlying distribution is the normal distribution, then the gaussian kernel density estimate follows very closely the true distribution, at least for a large sample as we used.

Now, to make it a bit more difficult we can look at a bimodal distribution, and see if it is still able to fit so well. As an example, I pick a mixture of two normal distributions, 60% of the sample comes from a normal distribution with mean -3 and standard deviation equal to one, 40% of the observation are from the normal distribution with mean 3 and the same standard deviation.

alpha = 0.6 #weight for (prob of) lower distribution
mlow, mhigh = (-3,3)  #mean locations for gaussian mixture
xn =  np.concatenate([mlow + np.random.randn(alpha * n_basesample),
               mhigh + np.random.randn((1-alpha) * n_basesample)])

With the new data set, we can run the same commands as above, except that we have to change the plot of the data generating density to use the mixture density instead:

plt.plot(ind, alpha * stats.norm.pdf(ind, loc=mlow) +
              (1-alpha) * stats.norm.pdf(ind, loc=mhigh),
              color="r", label='normal mix')

The next graph shows what we get in this case.


The estimated density is oversmoothing, the peaks of the estimated pdf are too small. So the automatic selection of the smoothing parameter doesn’t work in this case. Now, it’s time to find out how gaussian_kde actually selects the smoothing parameter or bandwith of the kernel.

But I’m out of time for today. We dig into this next time.

Saturday, March 14, 2009

Warmup: Fitting Distributions

As a warmup exercise, I generate some random samples, fit two distributions and plot the results. I also calculate the Kolmogorvo-Smirnov test using scipy.stats.kstest.

This is a shortened, simplified version of a script that I wrote to see how the dostrinutions in scipy.stats can be used to automatically fit some data and select the best fitting distribution.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def plothist(x, distfn, args, loc, scale, right=1):
    '''plot histogram and pdf, based on matplotlib doc example'''
    # the histogram of the data
    n, bins, patches = plt.hist(x, 25, normed=1, facecolor='green', alpha=0.75)
    maxheight = max([p.get_height() for p in patches])
    axlim = list(plt.axis())
    axlim[-1] = maxheight*1.05

    # add more points for density plot
    pdfpoints = bins + np.diff(bins)[0]*np.linspace(-0.5,0.5,11)[:,np.newaxis]
    pdfpoints = np.sort(pdfpoints.ravel())

    # calculate and plot the estimated density function
    yt = distfn.pdf(pdfpoints, loc=loc, scale=scale, *args)
    yt[yt>maxheight] = maxheight
    lt = plt.plot(pdfpoints, yt, 'r--', linewidth=1, label='estimated')
    # calculate and plot the density function that generated the data
    ys = stats.t.pdf(pdfpoints, 10, scale=10,)*right
    ls = plt.plot(pdfpoints, ys, 'b-', linewidth=1, label='true')

    plt.title(r'$\mathrm{Fitting\ Distribution\ %s :}\ \mu=%f,\ \sigma=%f$'%(distfn.name,loc,scale))

n = 500
dgp_arg = 10
dgp_scale = 10
rvs = stats.t.rvs(dgp_arg, scale=dgp_scale, size=n)
sm = rvs.mean()
sstd = np.sqrt(rvs.var())
ssupp = (rvs.min(), rvs.max())

for distr in [stats.norm, stats.t]:
    distname = distr.name
    # estimate parameters
    par_est = distr.fit(rvs,loc=sm, scale=sstd)
    print '\nFitting distribution %s' % distname
    print 'estimated distribution parameters\n', par_est
    arg_est = par_est[:-2]  # get scale parameters if any
    loc_est = par_est[-2]
    scale_est = par_est[-1]
    rvs_normed = (rvs-loc_est)/scale_est
    ks_stat, ks_pval = stats.kstest(rvs_normed,distname, arg_est)
    print 'ks-stat = %f, ks-pval = %f)' % (ks_stat, ks_pval)
    plothist(rvs, distr, arg_est, loc_est, scale_est, right = 1)
    plt.savefig('ex_dist1_%s.png'% distname)

#plt.show() # if we want to see it on screen
(Source code)



The script produces the following output for the parameter estimate and the Kolmogorov-Smirnov test

Fitting distribution norm
estimated distribution parameters
[ -0.70287027  11.22248481]
ks-stat = 0.037073, ks-pval = 0.493706)

Fitting distribution t
estimated distribution parameters
[ 7.8518085  -0.69695469  9.71315677]
ks-stat = 0.019562, ks-pval = 0.990926)

The p-values of Kolmogorov-Smirnov test are derived under the assumption that we test against a known distribution and not against a distribution with estimated parameters. But in this example, the numbers look pretty informative, the p-values are large for both distributions, from which I conclude that both distributions fit overall the sample relatively well. The pvalue of the t-distribution is about twice the p-value of the normal distribution, which strongly suggest that the distribution was generated by a t-distribution and not by a normal distribution. Although, if we look only at the pvalue of the normal distribution, we wouldn't reject the hypothesis that the sample was generated by the normal distribution.

However, I want to emphasis that this is an informal interpretation, in which we can be quite confident and which we know to be correct since we generated the data, but that it is not the result of a formal statistical test for it.

script file ex_dist1.py

Editorial comments:

Sphinx and restructured text work well, and the highlighting with pygments also works without additional intervention, but adding the graphs to blogger requires some manual upload and editing of the formatting, which still takes too much time. I haven't figured out yet how to upload source files, the upload file button seems to be missing from my blog editor.

update: some google searches later, it seems that it is not possible to attach regular text files to blogger, so it needs to be hosted somewhere else


I am trying to improve the statistics coverage in python using numpy and scipy. Here, I want to post some code examples and comments on the way in doing this.

The text and code are written in restructured text and then processed with sphinx, which I hope gets correct formatting and highlighting of code snippets. There will still be some stumbling to get this to work correctly.

A note on the copyright: Since numpy and scipy are BSD, I will be looking mostly at BSD or MIT licensed code. My code snippets are public domain, longer code parts are under the same BSD license as numpy and scipy, unless a different license is specified.