6/3/2015 - 2:56 AM

Fitting Models to Data

In the first chapter, we learned that while Bayes' formula is simple and its derivation is straightforward, using it to estimate model parameters is hampered by the fact that its calculation typically involves multidimensional integration that is rarely computable in closed form. Most useful Bayesian models, therefore, require computational methods in order to obtain reasonable estimates. Now that we have some Python at our disposal, we will make use of them by stepping through a selection of numerical methods for calculating Bayesian models.

As a motivating example, we will use some real data to build and estimate a simple parametric model. Specifically, we are looking at some measurements taken from subjects in a medical research study, and trying to fit a normal distribution to the weight of the group of patients.

>>> import numpy as np
>>> import pandas as pd

>>> patient_data = pd.read_csv('data/patients.csv')
>>> patient_data.head()

The weights are slightly skewed to the right, so we will use the logarithm of the values for the model.

>>> import seaborn as sb
>>> import scipy.stats as stats

>>> log_weight = np.log(patient_data.weight.values)
>>> sb.distplot(log_weight, kde=False, fit=stats.norm)

To specify our model, we need to select a likelihood that describes the data as a function of the model parameters, and a set of priors for the parameters. We have already selected a normal distribution to describe the data; that will serve as the likelihood.

$$ f(x \mid \mu, \tau) = \sqrt{\frac{\tau}{2\pi}} \exp\left(-\frac{\tau}{2} (x-\mu)^2 \right) $$

This leaves us with the task of choosing priors for the mean $\mu$ and precision $\tau$. Since the mean is a continuous variable defined on the real line, it is natural to use another normal distribution for the prior. Though we have quite a lot of prior information regarding the mean weight of adult humans, for the time being we are going to ignore this information and specify a normal distribution with a zero mean (on the log scale) and a standard deviation of 10, which translates to a precision of 0.01. This results in a prior that is extremely vague, relative to our actual state of knowledge.

>>> import matplotlib.pyplot as plt

>>> mu_prior = lambda x, mu=0, tau=0.01: np.sqrt(0.5*tau/np.pi) * np.exp(-0.5*tau*(x-mu)**2)

>>> mu_x = np.linspace(-50, 50)
>>> plt.plot(mu_x, mu_prior(mu_x))