Handling Excess Zeros in TCM (4): The Latent Class Model

This is part 4 of a series of posts about handling excess zeros in visitation data. In previous posts, I have illustrated how to model visitation data when there are too many zero visits by estimating the most common econometric models. That is, a regular Poisson and Negative Binomial models, the Zero-Inflated models and the Hurdle models.

I defined excess zeros as a large portion of the visitor sample reporting never having been to the site. In my case, this proportion is almost 90%.

Once the researcher recognizes the presence of excess zeros, zero-inflated and hurdle model are the most frequent ways to model the visitation data. These models assume that there are “pure” non-visitors and that, like I have argued in my previous post, the participating decision and the visit frequency decision belong to two different data generation processes.

The problem with the zero-inflated and hurdle models is that they separate participants from non-participants. For example, the hurdle model is literally two models estimated separately. This results in two complications: the travel cost coefficient is estimated only using data on the visitation frequency, and hence the welfare measures that are obtained from these studies exclude non-participants (Martinez-Cruz, 2016).

Martinez-Cruz (2016) proposes the use of the latent class count data model to explain the excess zeros in visitation data. He argues that the latent class model can handle excess zeros within one theoretical framework. The idea of the latent class is that respondents belong to different classes. The estimated effect of one variable differs for each class, but class membership for each individual is not observed. Instead, class membership is a latent variable.

Martinez-Cruz (2016) estimates a lantent class Poisson model with 4 classes. Whereas he interprets that the first class, which is comprised of 56% of the sample, includes mostly infrequent visitors with low counts of visits, the 4th class (11%) includes the very frequent visitors. Groups 2 and 3 include relatively frequent visitors. The travel cost coefficient is negative and statistically significant across all four groups. He also allowed some classes to follow a Negative Binomial distribution instead, but those yielded a worse fit.

Finally, he compared two outcomes across the zero-inflated negative binomial, the hurdle negative binomial and the latent class poisson models: 1) the predicted frequency of counts and 2) the consumer surplus estimates. While the latent class model seems to replicate the relative frequencies with the lowest margin of error, it also produces the lowest consumer surplus per trip (0.89 US dollars).  

The Latent Class Count Data Models

I have previously encountered latent class models when accounting for preference heterogeneity in discrete data models. The same idea can be applied to count data, that is the Poisson and Negative Binomial models. I assume that there are 2, 3 or more classes and each individual has a certain probability of belonging to a given class. The effect of the travel cost on visitation depends on class membership.

I found that R already has a package to estimate a latent class Poisson model, which was a relief. The alternative to estimating this model without a package would be to write down the log-likelihood function from scratch. The R package is called flexmix.

install.packages("flexmix")
library(flexmix)

The function we use to implement a latent class model is stepFlexmix, althougth flexmix can also be used. To specify a Poisson latent class model, I need to add the option model=FLXMRglm(family=”poisson”), as seen below. The k= option defines how many classes to estimate. Let’s only assume 2 classes for now (k=2). The nrep= option defines how many times the flexmix function should run to find the maximum log-likelihood value. I use 20 repetitions for now.

model_p_latent <- stepFlexmix(visits ~ TC_final + Gender + age + Income + 
                                       household_children_u18, 
                              data = data_NA,
                              model=FLXMRglm(family="poisson"),
                              k=2, nrep=20 )
summary(refit(model_p_latent))
> summary(model_p_latent)

Call:
stepFlexmix(visits ~ TC_final + Gender + age + Income + household_children_u18, data = data_NA, model = FLXMRglm(family = "poisson"), 
    k = 2, nrep = 20, verbose = TRUE)

        prior size post>0  ratio
Comp.1 0.9318  933    972 0.9599
Comp.2 0.0682   57    936 0.0609

'log Lik.' -504.061 (df=13)
AIC: 1034.122   BIC: 1097.792

The output above indicates that the first class is comprised of 933 respondents, while the second class includes 57 of the sample. I hypothesize that the first class is mainly comprised of non-participants or rather respondents who visit very infrequently. The second class hence includes mostly frequent visitors. The coefficients for each of the samples is estimated as follows:

$`Comp.1`
                         Estimate Std. Error z value  Pr(>|z|)    
(Intercept)            -2.9534372  0.8038068 -3.6743 0.0002385 ***
TC_final               -1.0114666  0.1372227 -7.3710 1.694e-13 ***
Gender                  1.1378848  0.5435606  2.0934 0.0363143 *  
age                     0.0309641  0.0088523  3.4979 0.0004690 ***
Income                  0.2198476  0.0417596  5.2646 1.405e-07 ***
household_children_u18  0.3357126  0.1979914  1.6956 0.0899632 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

$Comp.2
                         Estimate Std. Error z value  Pr(>|z|)    
(Intercept)             2.0243300  0.3626169  5.5826 2.370e-08 ***
TC_final               -0.3010176  0.0486378 -6.1890 6.056e-10 ***
Gender                  0.6173922  0.1946969  3.1710  0.001519 ** 
age                    -0.0082096  0.0062446 -1.3147  0.188621    
Income                  0.0762639  0.0292129  2.6106  0.009038 ** 
household_children_u18 -0.2800737  0.1414025 -1.9807  0.047627 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

A note: I found that, similarly to the zero-inflated model, the variables need to be rescaled (i.e., all variables close to unity) to allow the model to converge. Hence, I divided income and travel cost by 100000 and 1000, respectively.

The output above indicates that in the first class respondents are three times more responsive to travel cost changes than respondents in the second class. Moreover, the age and number of children coeffients have different signs in the two classes.

To run a Negative Binomial Latent Class model I still use the flexmix package, but I also need the countreg package.

install.packages("countreg", repos="http://R-Forge.R-project.org")
library(countreg)

The command to run this model is almost identical as before, but with the model=FLXMRnegbin() option. This option is only available because I installed the countreg package.

model_nb_latent <- stepFlexmix(visits ~ TC_final + Gender + age + Income + 
                                        household_children_u18, 
                               data = data_NA,
                               model=FLXMRnegbin(),
                               k=2, nrep=20)
summary(refit(model_nb_latent, gradient = NULL))
> summary(model_nb_latent)

Call:
stepFlexmix(visits ~ TC_final + Gender + age + Income + household_children_u18, data = data_NA, model = FLXMRnegbin(), 
    k = 2, nrep = 20)

       prior size post>0  ratio
Comp.1 0.678  905    954 0.9486
Comp.2 0.322   85    990 0.0859

'log Lik.' -485.1166 (df=15)
AIC: 1000.233   BIC: 1073.699

Similarly to the Latent Class Poisson model, 905 respondents are in the first class and 85 are in the second class. Again, I hypothesize that the first class includes non-participants and the second includes visitors to the Norwegian fjord. The AIC is lower than for the Latent Class Poisson model, hence this model is preferred.

> summary(refit(model_nb_latent, gradient = NULL))
$`Comp.1`
                        Estimate Std. Error z value  Pr(>|z|)    
(Intercept)            -4.239703   1.784200 -2.3762 0.0174897 *  
TC_final               -2.419845   0.679308 -3.5622 0.0003677 ***
Gender                  0.630956   0.651699  0.9682 0.3329590    
age                     0.075829   0.027134  2.7946 0.0051963 ** 
Income                  0.406456   0.111278  3.6526 0.0002596 ***
household_children_u18  0.598853   0.657906  0.9102 0.3626948    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

$Comp.2
                         Estimate Std. Error z value  Pr(>|z|)    
(Intercept)             0.2418453  0.9104565  0.2656  0.790524    
TC_final               -0.3378123  0.0733912 -4.6029 4.167e-06 ***
Gender                  0.5505055  0.2857338  1.9266  0.054025 .  
age                    -0.0059965  0.0115648 -0.5185  0.604096    
Income                  0.1462306  0.0465846  3.1390  0.001695 ** 
household_children_u18 -0.0956728  0.2211490 -0.4326  0.665293    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The output of the Poisson and Negative Binomial Latent class models is similar, but not identical. For a more detailed explanation on how to interpret the model output, this link can help.

So far I only estimated a latent class model with 2 classes. The obvious extension is to estimate the same model using 3, 4 or 5 classes. The command is the same, but I change the k= option to k=3, k=4 or k=5. The table summarizes the AIC and BIC for each model.

Model Name Number of Classes AIC BIC
Poisson 2 1034.122 1097.792
3 1005.345 1103.299
4 970.3581 1102.596
5 971.5756 1138.098
Negative Binomial 2 1000.233 1073.699
3 1014.649 1127.297
4 NC NC
5 NC NC

The latent class Negative Binomial does not converge if the number of classes is higher than 3, that is why I wrote in the table NC (Not Converge).

When using the AIC, I conclude that the best model (i.e., lowest AIC) is the Latent Class Poisson model with 4 classes (AIC=970). If I use the BIC instead, I conclude that the best model is the Latent Class Negative Binomial with 2 classes (BIC=1073).

Finally, I want to predict how many trips each individual would take. I will exemplify with the Poisson model with 2 classes. Class membership can be observed in the clusters vector:

data_NA$clusters <- clusters(model_p_latent, data=data_NA)

And the predict function indicates how many trips each individual takes irrespective for class membership. So it generates two columns with the predicted number of trips if the respondent belonged to one class or the other.

result <- data.frame(predict(model_p_latent, data=data_NA))

What I want to know is the predicted number of trips given that class membership is predicted in the data_NA$clusters vector. To do that, I create a new variable predictions_p_latent. I create a loop to find the correct value within the result dataframe for each observation in this vector:

for(i in 1:nrow(data_NA)){
         data_NA$predictions_p_latent[i] = result[i,data_NA$clusters[i]]
}

Hence, I can now check the predictions from the model:

> summary(data_NA$predictions_p_latent)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
 0.000002  0.004666  0.022385  0.288434  0.095341 10.146081

The results are quite similar to the actual sample. Just like with other models that handle excess zeros, the latent class model predicts a large proportion of people making zero visits, since the median and mean are lower than 0.5 visits. The maximum number of visits is 10, which is similar to the 11 visits that is the maximum observed number of visits in our sample. I will plot these predictions in a histogram:

par(mfrow=c(1,2))
hist(data_NA$predictions_p_latent , col=c("green") , xlim=c(0,10),
     main="LC Poisson" )
hist(data_NA$predictions_nb_latent , col=c("darkolivegreen4") , 
     main="LC Negative Binomial")

Latent Class

In my series of posts focusing on excess zeros, I have applied a total of eight models: the Poisson and Negative Binomial, the Hurdle Poisson and Negative Binomial, the Zero-Inflated Poisson and Negative Binomial, and the Latent Class Poisson and Negative Binomial. There are some differences when it comes to their predicted number of trips, but are there differences across models regarding their estimated welfare measures? How does the estimated consumer surplus differ across models? This will be the focus of future blog posts.

 

References: 

Martinez-Cruz, A. L. (2016). Handling excess zeros in count models for recreation demand analysis without apology. CER-ETH–Center of Economic Research at ETH Zurich, Working Paper16, 253.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s