# Handling Excess Zeros in TCM (2): The Zero-Inflated Models

I am following up on my last post, which introduced the problem of excess zeros in counts of recreational visits. To summarize, I concluded that when visitation data has a lot of zeros, i.e. a lot of non-visitors, the choice of whether to participate or not and the decision of how many times to visit may be two distinct data generation processes. In other words, these are two distinct choices and any relevant explanatory variables will have distinct impact on each choice.

Imagine you are considering how many times to visit a well-known beach. Some people will choose to go several times, and some people will choose not to go. Some people however will not even consider that particular beach when choosing recreational destinations. In this case, these are not “true” zeros, because going to that particular beach was not part of a choice.

If this is the case, there are both true and fake zeros in the data. The data process differs for true and fake zeros.

The remaining question is the following: if the zeros and the positive counts are from distinct data processes, how can we model both decisions in a single framework? For this purpose, I will use zero-inflated models. Some studies applying the travel cost method use zero-inflated models, such as Anderson (2010), Jeuland et al. (2010) and Munaretto & Ando (2019).

The Zero-Inflated Models

The Zero-Inflated model is a slight modification of a count data model that introduces a zero-inflation parameter.[1] The Zero-Inflated model can be combined with either the Poisson or Negative Binomial model.

I am using the same data as my previous blog post. The survey data pertains visitation behavior to a Norwegian fjord. The problem with this data is that almost 9 out of 10 respondents reported not having visited the Norwegian fjord in the last year, hence there are excess zeros in my data. Some of these zero visits belong to individuals who do not even consider visiting the Norwegian fjord.

The Zero-inflated model is composed of two parts: a logit model, which explains the visit (1) versus not to visit (0) outcome, and a count data model, which explains how many times to visit a site. If the count data model is a Poisson, then the corresponding ZI model is the Zero-inflated Poisson (ZIP) model. If instead the Negative Binomial model is used, this corresponds to the Zero-Inflated Negative Binomial model.

Interestingly, I found out that the zero-inflated model does not converge if the variables are not scaled back to smaller units. For example, I am using Income to explain the number of recreational trips. It turns out that the Income variable needs to be divided by 100000 so that income ranges between 0.5 and 20. If so, the model converges properly.

Income before scaling looks like this:

```> summary(data_NA\$Income)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
50000  450000  650000  721768  950000 2000000```

After scaling, income looks like this:

```> summary(data_NA\$Income)
Min.    1st Qu.  Median    Mean    3rd Qu.    Max.
0.500   4.500    6.500     7.218   9.500      20.000```

The zeroinfl function

The Zero-inflated models can be estimated using five different packages on R:

I am only using the zeroinfl function to estimate a Zero-Inflated Poisson model. The first output corresponds to the explanation of the count data, and the second part explains the visit or not to visit decision.

```model_zip <- zeroinfl(visits ~ TC_final + Gender + age + Income +
household_children_u18 |
TC_final + Gender + age + Income +
household_children_u18,
dist = 'poisson',
data = data_NA)
summary(model_zip)```
```Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)             2.040115   0.281035   7.259 3.89e-13 ***
TC_final               -0.284533   0.041274  -6.894 5.43e-12 ***
Gender                  0.323733   0.141146   2.294 0.021814 *
age                    -0.018962   0.005022  -3.776 0.000159 ***
Income                  0.090633   0.024593   3.685 0.000228 ***
household_children_u18 -0.408355   0.135009  -3.025 0.002489 **

Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)             3.082541   0.489984   6.291 3.15e-10 ***
TC_final                0.283438   0.071091   3.987 6.69e-05 ***
Gender                 -0.315446   0.236685  -1.333 0.182609
age                    -0.028039   0.008082  -3.469 0.000521 ***
Income                 -0.101142   0.034658  -2.918 0.003520 **
household_children_u18 -0.689539   0.224407  -3.073 0.002121 **```

The model I estimated suggests that the visis to the Norwegian fjord decrease if the travel cost to get there increases. This outcome conforms with the reasoning behind the travel cost method. However, the second regression indicates that the higher the travel cost, the more likely I am to actually visit the Norwegian fjord. I would also expect that the decision to visit or not depends positively on income, but that does not seem to be the case (coefficient of income is -0.10).

I can easily estimate a ZI Negative Binomial model by changing the R command slightly and specifying dist=”negbin”. As a reminder, the advantage of the negative binomial is that it accounts for overdispersion of count data. For a reminder why overdispersion might be a problem, I recommend reading my previous blog post.

```model_zinb <- zeroinfl(visits ~ TC_final + Gender + age + Income +
household_children_u18 |
TC_final + Gender + age + Income +
household_children_u18,
dist = 'negbin',
data = data_NA)
summary(model_zinb)```
```Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)             1.42788    0.78958   1.808   0.0705 .
TC_final               -0.36211    0.08012  -4.520 6.19e-06 ***
Gender                  0.50330    0.30214   1.666   0.0958 .
age                    -0.03015    0.01195  -2.524   0.0116 *
Income                  0.16491    0.06548   2.519   0.0118 *
household_children_u18 -0.54186    0.22599  -2.398   0.0165 *
Log(theta)             -0.90755    0.79813  -1.137   0.2555

Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept)             3.02527    0.78545   3.852 0.000117 ***
TC_final                0.21908    0.10024   2.185 0.028853 *
Gender                 -0.22734    0.41511  -0.548 0.583925
age                    -0.05529    0.02935  -1.884 0.059621 .
Income                 -0.03232    0.08999  -0.359 0.719511
household_children_u18 -1.24110    0.57020  -2.177 0.029510 *```

Comparing the ZI Negative Binomial with the ZIP model estimated previously, I see that the estimated coefficients are the same (as they should be), but the standard errors are larger due to overdispersion.

Are the estimated ZIP or ZI Negative Binomial models any better than just running a Poisson (or Negative Binomial) and ignoring the excess zeros? How can we compare all the models to see which one we should use?

Some references argue that the Vuong test can be used. They argue that if the test statistic from Vuong test is statistically significant, then the ZIP model performs better than the Poisson model. However, [1] argues that the Zero-Inflated model is not a nested model, hence the Vuong test cannot be used.

Instead, I show what is the AIC of each of the four relevant models: a Poisson, a Negative Binomial, a Zero-Inflated Poisson and a Zero-Inflated Negative Binomial model. Beware that the sample size has to be the same in all models.

```> AIC(model_p)
[1] 1497.835
> AIC(model_nb)
[1] 1015.715
> AIC(model_zip)
[1] 1071.637
> AIC(model_zinb)
[1] 1003.822```

According to the AIC, the most appropriate model is the ZI Negative Binomial model, because it has the lowest AIC. Hence, I conclude that opting for a Zero-Inflated model improves the fit of the model, given the excess zeros in my data. Nonetheless, allowing for overdispersion, that is, going from the Poisson to the Negative Binomial model, already improves the fit of the model quite significantly.

To finalize this analysis, I will plot the histograms of the predicted number of trips given each model.

First, I need to predict the number of trips given the zero-inflated models:

```data_NA\$predictions_zip <- predict(model_zip, data.frame(data_NA),
type = "response")
data_NA\$predictions_zinb <- predict(model_zinb, data.frame(data_NA),
type = "response")```

I also predicted the outcome variable (number of visits) for the Poisson and Negative Binomial models. I want to plot the four histograms of all visit predictions.

```par(mfrow=c(2,2))

hist(data\$predictions_p      , col=c("#C0C0C0"), xlim=c(0,10),
main="Poisson model")
hist(data\$predictions_nb     , col=c("gold")   , xlim=c(0,10),
main="Negative Binomial model")
hist(data_NA\$predictions_zip , col=c("orange") , xlim=c(0,10),
main="ZI Poisson model")
hist(data_NA\$predictions_zinb, col=c("#CD7F32"), xlim=c(0,10),
main="ZI Negative Binomial model")```

The model that predicts the lowest number of zeros is the ZI Poisson Model. Interestingly, the Poisson model is the one that predicts the highest counts (7 visits).

Besides the zero-inflated model, there are other alternatives to analyze the excess zeros in visitation data. I will explore other alternatives in future blog posts.

References:

Anderson, D. M. (2010). Estimating the economic value of ice climbing in Hyalite Canyon: An application of travel cost count data models that account for excess zeros. Journal of environmental management91(4), 1012-1020.

Jeuland, M., Lucas, M., Clemens, J., & Whittington, D. (2010). Estimating the private benefits of vaccination against cholera in Beira, Mozambique: A travel cost approach. Journal of Development Economics91(2), 310-322.

Munaretto, C. M., & Ando, A. W. (2019). Valuing Urban Beaches: Distribution of Benefits across Race and Income. http://www.bioecon-network.org/pages/19th_2017/Ando.pdf