# Handling Excess Zeros in TCM (1): The data generation process

Revealed Preference surveys ask respondents to report their visitation patterns during a given time period. Respondents report the number of trips they made to one or more recreational sites.

However, in the presence of a large choice set of recreational sites, the respondent might report many recreational visits to some sites (interior solutions) and no visits to other sites (corner solutions). When corner solutions arise, the demand for recreation at those sites is zero (Phaneuf, 1999; Gurmu & Trivedi, 1996).

When analyzing count data, the existence of corner solutions is not necessarily problematic. However, if the number of zero visits is “excessive”, that is there is a mass of zeros in the distribution of recreational visits, then a modelling problem arises. What constitutes excess zeros in visitation data is context-specific, but if zero visits represent more than 20% or 30% of the number of observations, then I consider that to constitute “excess zeros”.

The Problem of Excess Zeros

The problem of predicting the number of visits when there are excess zeros is that the number of visits cannot be lower than zero. Econometrically these zero visits would be predicted to be negative to better fit the data.

The Poisson model assumes that all counts come from the same data-generating process (Gurmu & Trivedi, 1996). Both low and high visit counts can be explained at the margin by the same explanatory variables. While it is a restrictive assumption, it works well to explain recreation visits in some applications.

However, the choice of whether to recreate or not, i.e. zero visits versus one or more visits, might come from a data-generation process that is very different from the one that explains the number of visits chosen. The choice of whether to recreate or not can be expressed as a “1” if the respondent recreates, or “0” if (s)he chooses not to. Instead of Poisson, this can be modelled using a logit or probit model.

In this blog post, I will estimate the data using 1) the Poisson model without the excess zeros and 2) the logit model. The aim is to infer whether the data generation process is statistically the same in these two models. This is the first of a series of blog posts dedicated to handling excess zeros in visitation data.

Different Data generation processes

I am using my own data for illustration purposes. The data comes from a recreational survey in north of Norway.

In my data, respondents were asked how many times they had visited the Norwegian fjord. However, the visitation data recorded an abnormal amount of zero counts, that is, people who had not visited the Norwegian fjord in the last 12 months.

```> summary(data\$visits)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
0.0000  0.0000  0.0000  0.2742  0.0000 11.0000      38

> table(data\$visits)

0    1    2    3    4    5    6    7   10   11
1243   70   30   15    5    8    2    2    2    9```

As seen above, 1243 respondents (89.7%) reported never having visited the Norwegian fjord.

To estimate the Poisson model I first create a new dataframe object that only includes respondents who visited the Norwegian fjord one time or more (N=181).

`data_p <- data[data\$visits>0,]`

I estimate the Poisson using the glm function, and defining the “poisson” model in the family= option.

```model_p <- glm(visits ~ TC_final + Gender + age +
Income + household_children_u18,
family="poisson", data=data_p)
summary(model_p)```
```                          Estimate Std. Error z value Pr(>|z|)
(Intercept)             1.891e+00  2.523e-01   7.497 6.54e-14 ***
TC_final               -1.929e-04  3.013e-05  -6.403 1.52e-10 ***
Gender                  2.395e-01  1.219e-01   1.964  0.04951 *
age                    -1.353e-02  4.372e-03  -3.095  0.00197 **
Income                  5.436e-07  1.947e-07   2.792  0.00525 **
household_children_u18 -1.769e-01  9.757e-02  -1.813  0.06982 .```

Most coefficients have the expected sign. The travel cost (TC_final) coefficient is negative and statistically significant, meaning the number of trips to the Norwegian fjord decrease as the travel cost increases. Gender and income increase the number of predicted recreational trips, while age and number of children in the household decrease the number of visits.

Before estimating the logit model, I need to create a new dependent variable that is called Visits_logit.

I also use the glm function to estimate the logit model, but defining family=”binomial” instead.

```model_l <- glm(Visits_logit ~ TC_final + Gender + age +
Income + household_children_u18,
family="binomial", data=data)
summary(model_l)```
```                         Estimate Std. Error z value Pr(>|z|)
(Intercept)            -2.781e+00  4.634e-01  -6.001 1.96e-09 ***
TC_final               -3.918e-04  6.349e-05  -6.170 6.82e-10 ***
Gender                  4.229e-01  2.178e-01   1.942  0.05216 .
age                     2.055e-02  7.445e-03   2.761  0.00577 **
Income                  1.377e-06  3.099e-07   4.443 8.86e-06 ***
household_children_u18  3.779e-01  1.540e-01   2.455  0.01410 *```

The estimated coefficients are quite different with the logit model. Increases in the travel cost (TC_final) decrease the probability of visiting the Norwegian fjord. All other explanatory variables increase the probability of visitation of this fjord.

The differences across the two models are significant. To illustrate them, I want to graph the boxplots for the regression coefficients. To do so, I am using the dotwhisker package, which is explained here:

```install.packages("dotwhisker")
library(dotwhisker)```
```dwplot(list(model_p, model_l),
vline = geom_vline(xintercept = 0, colour = "grey60", linetype = 2)) +
theme_bw() + xlab("Coefficient Estimate") +
theme(plot.title = element_text(face="bold"),
legend.background = element_rect(colour="grey80"))```

The code above yields the following graph. Model 1 (in yellow) is the Poisson model and Model 2 (in green) is the logit model. Both the Gender and Income variables increase the probability/increase the expected number of visits to the Norwegian fjord. The Travel Cost variable has a negative effect in both models. However, both age and number of children in the household have differing impacts on the probability of visiting/expected number of trips.

Hence, I conclude that the data generation process is statistically different for the probability of visiting and the expected number of trips. If that is so, having a zero in the recorded number of trips should not be jointly modelled with non-zero number of trips.

Future blog posts will review how to model excess zeros and the number of visits in a joint modelling framework.

References:

Gurmu, S., & Trivedi, P. K. (1996). Excess zeros in count models for recreational trips. Journal of Business & Economic Statistics14(4), 469-477.

Phaneuf, D. J. (1999). A dual approach to modeling corner solutions in recreation demand. Journal of Environmental Economics and Management37(1), 85-105.