# An R routine to account for endogenous stratification and overdispersion

In many instances, environmental economists study the use of public spaces. Much of my own research focuses on recreational sites and their value. A lot of data about use of public spaces is collected on-site. The main reason for this is the low cost: going on-site and interviewing people is cost-effective and relatively fast. To cut costs we would like to collect our data on-site as much as possible.

The problem is that on-site data is not the most reliable for data analysis. There are two problems of on-site data: truncation and endogenous stratification. The problem of truncation means that when asking how many times a visitor has been to a recreational site, at a minimum the answer will be one. That is, there are no zeros in the data. By analyzing this data we ignore potential visitors who do not visit the public space but might do so in the future. The second problem of endogenous stratification means that more frequent visitors are more likely to be intercepted on-site and be represented in the data than non-frequent visitors. There is a huge bias towards frequent visitors, misleading the researcher to believe the public space is more popular than it actually is.

The consequence of these two problems is that the population of potential visitors is not truly represented by the people intercepted on-site (i.e. the sample). This means that any conclusions taken from the sample do not apply to the population of visitors (for example, the value of the recreational site or average number of visits). For example, the number of visits from an on-site sample will be exaggerated when compared with a sample that had been drawn in a more random fashion.

To avoid erroneous conclusions, it is important to account for these two problems when handling data that is collected on-site. The problem of truncation is relatively easy to account for. Most econometric models are already programmed in software to account for truncation. For example, I can run a truncated Poisson regression or truncated OLS to explain the number of visits to a site in R. The problem of endogenous stratification is not as easy to account for.

If the researcher opts for a count data model to explain the number of visits, then there are two options: the Poisson or the Negative Binomial model. Accounting for truncation and endogenous stratification in a Poisson model is surprisingly easy: as shown by Shaw (1988), one must simply subtract one from the number of visits (dependent variable) to obtain unbiased estimates of visitation measures. But it is not as straightforward if a Negative Binomial model is the most appropriate (i.e. if overdispersion is present in the count data).

Only Stata has a routine to estimate a Negative Binomial model that accounts for truncation, endogenous stratification. The NBSTRAT command in Stata was developed by Joseph Hilbe and Roberto Martinez-Espineira.

There is no alternative for R users. The absence of commands and alternatives to Stata might explain why some analyzes still ignore the presence of endogenous stratification (e.g. Blackwell, 2007).

Therefore I have made it this post’s mission to share an R code to estimate a Negative Binomial model that accounts for truncation and endogenous stratification.

This post builds nicely on previous blog posts. I have written the log-likelihood function for a Negative Binomial regression; all I need to do is to adjust it to account for truncation and endogenous stratification.

Englin and Shonkwiler (1995) derive the density function, but what I need is the log-likelihood function. Fortunately, instead of having to write it myself, Meisner et al. (2008) derives the log-likelihood function in their paper:

This log-likelihood function is almost identical to the Negative Binomial one, with some minor adjustments.

Let $\mu_i$ be the expected number of counts for individual i. It is generally defined as:

$\mu_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i}$

The corresponding code on R (with two explanatory variables) is:

nb.lik = function(df,par){
miu = exp(par[1] + par[2]*x_1 + par[3]*x_2)
log_likelihood = sum( log(y) +
Lgamma(y+1/par[4]) -
Lgamma(y+1) -
Lgamma(1/par[4]) +
y*log(par[4]) +
(y-1)*log(miu) -
(y+1/par[4])*log(1+par[4]*miu)
)
return(log_likelihood)
}


This code prepares the function to estimate a Negative Binomial model that accounts for truncation and endogenous stratification.

All you need to do is to define $\mu_i$ and perhaps change the number of parameters to estimate. It might be useful to change the dispersion parameter (par[4]) to par[5] or par[6] depending on how many explanatory variables you have. You then use optim to find the parameter values that maximize the log-likelihood.

optim(par=c(0.001,-0.05,0.005,1),nb.lik,df=data,control=list(fnscale=-1))

The results should be a vector of parameter estimates, value of log-likelihood function and other details regarding convergence of the function.

I am curious whether the code I suggest is easily adaptable to different datasets, so I would appreciate any feedback in the comments section.

References:

Blackwell, B. (2007). The value of a recreational beach visit: An application to Mooloolaba beach and comparisons with other outdoor recreation sites. Economic Analysis and Policy37(1), 77-98.

Englin, J., & Shonkwiler, J. S. (1995). Estimating social welfare using count data models: an application to long-run recreation demand under conditions of endogenous stratification and truncation. The Review of Economics and statistics, 104-112.

Meisner, C., Wang, H., & Laplante, B. (2008). Welfare measurement convergence through bias adjustments in general population and on-site surveys: an application to water-based recreation at Lake Sevan, Armenia. Journal of Leisure Research40(3), 457-478.

Shaw, D. (1988). On-site samples’ regression: Problems of non-negative integers, truncation, and endogenous stratification. Journal of econometrics37(2), 211-223.