Handling excess zeros in TCM (3): The Hurdle Model

I have been discussing the problem of modelling excess zeros in visitation data. First I showed that the participation decision (i.e. whether or not to visit at all) and the visitation frequency decision (i.e. how many visits to take) might be different data generation processes. For this end, I used my own visitation data, but you can easily adapt the code to your own data.

There are different approaches to handle excess zeros. In my previous blog post, I used zero-inflated count data models. In this blog post, I will use another approach: the hurdle model.

The Hurdle Model

The Hurdle model is somewhat similar to the zero-inflated model. It takes its name from the idea that the number of visits is only higher than one if a certain threshold has been passed (Martinez-Cruz, 2016). If you pass a hurdle or a threshold, then you have a non-zero count.

The Hurdle model consists of two parts: one that establishes a data generation process to explain the zero counts and another data generation process for number of visits.[1] The second part explains the number of trips conditional on them being positive.

Just like the zero-inflated model, the first part of the hurdle model is a binary model, generally a logit or probit (1 denoting to participate and 0 not to visit at all). The second part however is based on a zero-truncated count model (Martinez-Cruz, 2016), either a Poisson or a Negative Binomial model. By truncated it means that only positive counts are modelled (zeros are excluded from analysis).

However, a hurdle model is actually equivalent to estimating a binary model to explain participation, and then explain the number of positive visits by omitting zero counts. In other words, the hurdle model is not one single model, but rather two sequentially estimated models.

The pscl package

In R, we can fit a hurdle model by using the pscl package. I could not find another package to run a hurdle model. I will not start by installing it because I already have it in my computer. Again, the data I am using is about visitation patterns in a Norwegian fjord.

To re-iterate from a previous post, my data suffers from excess zeros. Only one out of ten respondents has actually been in this Norwegian fjord in the last 12 months, that is 1243 respondents report zero visits to the Norwegian fjord.

> table(data$visits)   
0     1     2     3     4     5     6     7     10     11  
1243  70    30    15    5     8     2     2     2      9

I will run a hurdle Poisson model. The syntax is almost identical to how I estimated a Poisson model. I use the hurdle function in the pscl package.

library(pscl)
model_hp <- hurdle(visits ~ TC_final + Gender + age + Income + 
                            household_children_u18, 
                   data = data_NA, 
                   dist = "poisson", 
                   zero.dist = "binomial")
summary(model_hp)
Count model coefficients (truncated poisson with log link):
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)             2.107568   0.281474   7.488 7.01e-14 ***
TC_final               -0.315769   0.043557  -7.250 4.18e-13 ***
Gender                  0.301448   0.140928   2.139 0.032434 *  
age                    -0.017393   0.004904  -3.547 0.000390 ***
Income                  0.082924   0.023809   3.483 0.000496 ***
household_children_u18 -0.361634   0.125530  -2.881 0.003966 ** 
Zero hurdle model coefficients (binomial with logit link):
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -2.780688   0.463385  -6.001 1.96e-09 ***
TC_final               -0.391766   0.063497  -6.170 6.84e-10 ***
Gender                  0.422945   0.217815   1.942  0.05217 .  
age                     0.020553   0.007445   2.761  0.00577 ** 
Income                  0.137677   0.030986   4.443 8.87e-06 ***
household_children_u18  0.377938   0.153981   2.454  0.01411 *

The output of the hurdle model includes the first part explaining the count data and the second part explaining participation. Again, estimating this hurdle model would have been the same as first estimating a logit model, then omitting the zeros, and then running a count data on positive count of visits.

Just like the Poisson model, the number of visits decreases as the travel cost increases (coefficient is -0.31). The number of visits increases as income increases (coefficient is 0.138), hence visits to this Norwegian fjord are a normal good.

The second part explains the decision of whether to visit at all. The Travel cost variable also has a negative effect on the probability of visiting the Norwegian fjord, as expected. This is different from the output from the zero-inflated poisson model. The coefficients associated with Age and number of children of the respondent are the only two that change signs across the two models.

To complement the analysis, I will run a hurdle Negative Binomial model.

model_hnb <- hurdle(visits ~ TC_final + Gender + age + Income + 
                             household_children_u18, 
                    data = data_NA, 
                    dist = "negbin", 
                    zero.dist = "binomial")
Count model coefficients (truncated negbin with log link):
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)             1.64465    0.77846   2.113   0.0346 *  
TC_final               -0.36608    0.07877  -4.647 3.36e-06 ***
Gender                  0.44347    0.32638   1.359   0.1742    
age                    -0.02399    0.01139  -2.105   0.0353 *  
Income                  0.10805    0.05417   1.995   0.0461 *  
household_children_u18 -0.37974    0.26687  -1.423   0.1548    
Log(theta)             -0.75801    0.75249  -1.007   0.3138    
Zero hurdle model coefficients (binomial with logit link):
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -2.780688   0.463385  -6.001 1.96e-09 ***
TC_final               -0.391766   0.063497  -6.170 6.84e-10 ***
Gender                  0.422945   0.217815   1.942  0.05217 .  
age                     0.020553   0.007445   2.761  0.00577 ** 
Income                  0.137677   0.030986   4.443 8.87e-06 ***
household_children_u18  0.377938   0.153981   2.454  0.01411 *

Of course, the second part of the output should be exactly the same, since the logit model is still being used to explain participation. As expected, when comparing the Poisson and Negative Binomial models, the coefficients themselves do not change, but their statistical significance does. Only Travel cost, age and income remain statistical significance at the 5% level.

Just like in my previous blog post, I will compare the models we have obtained so far: 1) Poisson (model_p), 2) Negative Binomial (model_nb), 3) Zero-Inflated Poisson (model_zip), 4) Zero-Inflated Negative Binomial (model_zinb), 5) Hurdle Poisson (model_zp), and 6) Hurdle Negative Binomial (model_hnb).

First, I analyze the fit of each model:

 > AIC(model_p)
[1] 1497.835
> AIC(model_nb)
[1] 1015.715
> AIC(model_zip)
[1] 1071.637
> AIC(model_zinb)
[1] 1003.822
> AIC(model_hp)
[1] 1065.867
> AIC(model_hnb)
[1] 1001.923

According to the AIC, the model with the best fit is the hurdle negative binomial model (lowest AIC).

To complete the analysis, I will plot the predicted values side by side.

par(mfrow=c(3,2))

hist(data_NA$predictions_p   , col=c("#C0C0C0") , xlim=c(0,10),
     main="Poisson model" )
hist(data_NA$predictions_nb  , col=c("gold")    , xlim=c(0,10),
     main="Negative Binomial model" )
hist(data_NA$predictions_zip , col=c("orange")  , xlim=c(0,10),
     main="ZI Poisson model" )
hist(data_NA$predictions_zinb, col=c("#CD7F32") , xlim=c(0,10),
     main="ZI Negative Binomial model" )
hist(data_NA$predictions_hp  , col=c("purple")  , xlim=c(0,10),
     main="Hurdle Poisson model" )
hist(data_NA$predictions_hnb , col=c("blue")    , xlim=c(0,10),
     main="Hurdle Negative Binomial model")

Visit predictions

After cleaning the missing data, my sample includes 882 non-visitors. I interpret a zero to be a predicted number of trips lower than 0.5. If so, the Poisson model predicts 841 “zero” visitors, 844 when using the zero-inflated Poisson and 851 when using the Hurdle Poisson model. The Negative Binomial models predict slightly more zeros, which is expected since the AIC indicates they fit the data better. The Negative Binomial model predicts 869 non-visitors, the Zero-Inflated 849, and the Hurdle model 854 non-visitors.

In my context, it seems that the Hurdle Negative Binomial performs best. The problem is that, just like I mentioned previously, the hurdle model is estimated sequentially, and it is thus two models instead of a single one. The remaining question we have to answer is: how to estimate one single model that can handle excess zeros?

On a final note, the main purposed of estimating a recreational demand model like we are doing is to estimate measures of welfare, such as consumer surplus. A follow-up question left to answer given our analysis is the following: how does model selection affect any welfare measures, namely consumer surplus?

 

References:

Martinez-Cruz, A. L. (2016). Handling excess zeros in count models for recreation demand analysis without apology. CER-ETH–Center of Economic Research at ETH Zurich, Working Paper16, 253.

[1] https://data.library.virginia.edu/getting-started-with-hurdle-models/