The Poisson distribution for count data

This week I focus on analyzing one distribution available to model count data (the Poisson distribution), as well as its properties. My aim is to understand how well the Poisson distribution does at fitting my data. I go beyond checking measures of fit, by visualizing the distribution in a plot.

Count data is a special type of data. Not only is it always positive (i.e. negative counts do not make sense), but it also only includes integers (e.g. 1, 2, 3, etc). When handling count data, ordinary least squares might not yield good results (Wooldridge, 2010). OLS might result in predicted counts being negative, especially if there are a large number of zero counts. The ultimate problem of negative counts is poor model fit, which may yield biased estimates of any welfare measures we are interested in.

Instead we can assume that the count data (e.g. number of trips) has a Poisson distribution. It solves the problems of negative and non-integers count predictions. The density of the dependent variable given the control variables under the Poisson assumption is:

f(y|x) = exp[-\mu (x)][\mu (x)]^y / y!

where \mu (x) is the conditional mean of the dependent variable. So, if we know the conditional mean of our count data, we know the assumed Poisson distribution.

Let us simulate some data that follows a Poisson distribution. Let us say that the conditional mean of the count data we are interested in is 4 counts. How does the underlying Poisson distribution look like?

I can generate some densities for different counts by using the dpois function in R. Let us say that counts vary between zero and 15. If the conditional mean is 4, then the probability function looks like this:

x <- 0:15
 miu <- 4
 plot(dpois(x, miu), type = "l", lwd = 2,
      main = "Poisson probability function",
      ylab = "P(X = x)", xlab = "Number of events")

Of course, there is a spike of counts at the conditional mean of 4, with no negative counts being predicted. How does this curve looks like for different conditional means?

I will use conditional means ranging from 0 to 7 and plot them in the same graph.

miu <- 0
 plot(dpois(x, miu), type = "l", lwd = 2,
      main = "Poisson probability function",
      ylab = "P(X = x)", xlab = "Number of events")
 lambda <- 1
 lines(dpois(x, lambda), type = "l", lwd = 2, col = "grey10")
 lambda <- 2
 lines(dpois(x, lambda), type = "l", lwd = 2, col = "grey20")
 lambda <- 3
 lines(dpois(x, lambda), type = "l", lwd = 2, col = "grey30")
 lambda <- 4
 lines(dpois(x, lambda), type = "l", lwd = 2, col = "grey40")
 lambda <- 5
 lines(dpois(x, lambda), type = "l", lwd = 2, col = "grey50")
 lambda <- 6
 lines(dpois(x, lambda), type = "l", lwd = 2, col = "grey60")
 lambda <- 7
 lines(dpois(x, lambda), type = "l", lwd = 2, col = "grey70")

The higher the conditional mean, the more the density function will shift to the right. Moreover, since the assumption of the Poisson distribution is that the conditional mean equals the conditional variance, the higher the conditional mean, the more spread out the density distribution is (i.e. higher variance).

For a parametric model, the conditional mean is the predicted dependent variable given explanatory variables X. For example, the conditional mean can be represented by:

\mu (x) = exp(x \beta)

where \beta are the parameters to be estimated.

My goal is to run a simple Poisson regression and check how well these assumptions fit my data. I will use my own data for this exercise, which is data on beach visitation in Norwegian beaches. Our dependent variable is the number of beach trips per person over the course of a year.

   summary(data_2019$V_Total)
      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0.00    0.00    4.00   10.09   12.00  266.00  

The unconditional mean is 10.09, with a minimum or 0 and a maximum of 266 trips.

Let us run a simple Poisson regression to get the conditional mean. I use fairly conventional regressors to explain the number of beach trips: the travel cost, a dummy for gender, age, years of education, net income and a dummy for being a member of a tourist organization.

model_2019 <- glm(V_Total ~ TC_Total + A27_DKVINNE + A28_ALDER + A29_SKOLE + INCOME + A37_DTURIST , 
                      family="poisson", 
                      data=data_2019)
 summary(model_2019)

I will not paste the regression table here, since it is not of interest to me. What I am interested in is the conditional mean. I can look it up by checking the fitted.values vector from the object that has the regression results:

   summary(model_2019$fitted.values)
       Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    0.01096  7.00628 10.23836 10.09519 12.79558 46.23019  

Unsurprisingly, the conditional mean (10.09519) is the same as the unconditional mean (10.09). However, the minimum and maximum values for beach trips are a bit different from the original data. The minimum is no longer zero (0.01096) and the maximum (46) is significantly lower than the maximum number of trips (266).

At this point, I am still unsure whether the Poisson regression results in good fit for my data. One could look at the AIC (17638) and compare it with an alternative model. To conclude my analysis, I will plot the number of beach trips and the fitted values from using this Poisson distribution assumption.

I use ggplot to plot the densities of beach visits in the same plot:

ggplot() + 
   labs(x="Number of Visits", y="Density") +
   geom_density(aes(Offsite_counts), alpha=.2, fill="cadetblue1", colour = I("cadetblue4")) +
   geom_density(aes(model_2019$fitted.values), alpha=.2, fill="gold1", colour = I("gold4"))
   xlim(NA, 10)

The blue density curve represents my raw data and the yellow one represents the fitted data with the regression estimated above.

We can see that the fitted data is not satisfactory. It predicts a peak of counts at the conditional mean (10.09), while the original data had a peak at zero.

The biggest limitation of the Poisson distribution is that the conditional variance and conditional mean are equal (Wooldridge, 2010). In our data, the conditional variance and conditional means are likely not the same: the variance is likely to be much higher than the conditional mean.

Instead, it is worth investigating whether other distributions provide better fitted values. My favorite alternative is the negative binomial distribution, which provides more flexibility, rather than imposing that the conditional variance has to equal the conditional mean. My data has a large amount of zero counts, as can be seen in the graph. Hence, alternative distributions such as the zero-inflated Poisson distributions, may also be worth investigating.

References:

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT press.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s