In my previous blog post, I explained on-site sampling bias. I also shared an R code to implement Englin and Shonkwiler (1995)’s negative Binomial model that accounts for these biases.

I have yet to illustrate how this code works using travel cost data. It is useful to double check whether this code yields the same results as the counterpart pre-programmed command in Stata: NBSTRAT. If I can obtain the same results, that gives me confidence that the code I have prepared is correct and it is an alternative for R users. That is my aim for today.

In detail, this is the outline of this post:

1- estimate an uncorrected Negative Binomial model and obtain a (biased) consumer surplus estimate;

2- estimate a corrected Negative Binomial model in Stata and obtain a corrected consumer surplus estimate;

3- estimate a corrected Negative Binomial model in R and double-check whether the estimates are the same as in Stata.

I use data available at the Environmental Economics blog to apply a travel cost model of beach recreation at Wrightsville Beach. The dataset is fairly old (from 2003), but it serves our purpose: the data was collected on-site, it includes trip data in the form of counts and it includes travel cost data information. The report describing the sample and results is available here.

Because the data was collected on-site, the minimum number of trips is one. The number of trips to Wrightsville beach is my dependent variable, while travel cost and income are my independent variables. This results in a standard single-site travel cost model.

I estimate a negative binomial model in R with this data and calculate the corresponding consumer surplus.

```
library(MASS)
model_nb <- glm.nb(trips ~ TCWB + income2, data=data_WBeach, control=glm.control(trace=10,maxit=100), init.theta=0.937)
summary(model_nb)
```

```
> summary(model_nb)
Call:
glm.nb(formula = trips ~ TCWB + income2, data = data_WBeach,
control = glm.control(trace = 10, maxit = 100), init.theta = 0.9833531339,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7934 -1.0631 -0.5434 0.2919 2.4390
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.2267872 0.1860114 17.347 < 2e-16 ***
TCWB -0.0025443 0.0003504 -7.261 3.85e-13 ***
income2 -0.0059460 0.0022109 -2.689 0.00716 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.9834) family taken to be 1)
Null deviance: 265.15 on 150 degrees of freedom
Residual deviance: 158.82 on 148 degrees of freedom
AIC: 950.07
Number of Fisher Scoring iterations: 1
Theta: 0.983
Std. Err.: 0.117
2 x log-likelihood: -942.069
```

```
> -1/model_nb$coefficients[2]
TCWB
393.0314
```

Most of the results are as expected. The travel cost variable is negative and statistically significant. Income’s coefficient is also negative and statistically significant (which implies beach visits are an inferior good). The consumer surplus is 393 per visit and per visitor. Of course, I am not yet accounting for truncation and overdispersion, so this estimate might be biased.

I now want to account for these on-site sampling biases, and I start by doing so in Stata. The reason I use Stata is that, to the best of my knowledge, it is the only software that has a command available to calculate the model proposed by Englin and Shonkwiler (1995). This is the NBSTRAT command in Stata. I first import data into Stata. The code to estimate this model in Stata, as well as to calculate consumer surplus, is shown below.

```
. nbstrat trips tcwb income2
Negative Binomial with Endogenous Stratification Number of obs = 151
Wald chi2(2) = 138.25
Log likelihood = -426.28507 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
trips | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
tcwb | -.0043928 .0004459 -9.85 0.000 -.0052667 -.0035188
income2 | -.004568 .0022712 -2.01 0.044 -.0090195 -.0001165
_cons | -25.46132 112.6215 -0.23 0.821 -246.1955 195.2729
-------------+----------------------------------------------------------------
/lnalpha | 28.67241 112.6215 0.25 0.799 -192.0616 249.4065
-------------+----------------------------------------------------------------
alpha | 2.83e+12 3.19e+14 3.88e-84 2.1e+108
------------------------------------------------------------------------------
AIC Statistic = 5.686 BIC Statistic = -742.557
Deviance = 0.000 Dispersion = 0.000
.
. nlcom -1/ _b[tcwb]
_nl_1: -1/ _b[tcwb]
------------------------------------------------------------------------------
trips | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_nl_1 | 227.6458 23.10752 9.85 0.000 182.3559 272.9357
------------------------------------------------------------------------------
```

Again, most of the results are as expected. The outcomes regarding the travel cost and income variables remain unchanged (both negative and statistically significant). However, the consumer surplus estimate is lower: 228 per visit. The confidence interval of this estimate does not include the consumer surplus that was estimated before (in the model that does not account for truncation and endogenous stratification). This goes in line with most findings in the literature of on-site sampling bias. That is, accounting for truncation and endogenous stratification leads to lower consumer surplus estimates.

The exciting part is to confirm whether I get the same results when using my own R code. I have to define the *miu* parameter (i.e. expected number of trips) given the new variable names, but also change the log-likelihood function according to the dependent variable’s name *(trips)*.

```
library(Rfast)
library(VGAM)
### Writing log-lik by hand
nb.lik = function(df,par){
miu = exp(par[1] + par[2]*df$TCWB + par[3]*df$income2)
log_likelihood = sum( log(df$trips) +
Lgamma(df$trips+1/par[4]) -
Lgamma(df$trips+1) -
Lgamma(1/par[4]) +
df$trips*log(par[4]) +
(df$trips-1)*log(miu) -
(df$trips+1/par[4])*log(1+par[4]*miu)
)
return(log_likelihood)
}
starting.values <- data.frame(rnorm(1000,0,1),rnorm(1000,-0.5,1),rnorm(1000,-0.5,1),runif(1000,0.000001,1000))
```

```
> values <-0
> for(i in 1:nrow(starting.values)){
+
+ assign(paste("result", i, sep="_") , tryCatch(optim(par=starting.values[i,],nb.lik,df=data_WBeach,control=list(fnscale=-1)), error=function(err) NA) )
+ values[i] <- tryCatch(get(paste("result" , i, sep = "_"))[["value"]], error=function(err) NA)
+
+ }
Warning messages:
1: In log(par[4]) : NaNs produced
2: In log(1 + par[4] * miu) : NaNs produced
>
> which.max(values)
[1] 53
> get(paste("result", which.max(values), sep="_"))
$par
rnorm.1000..0..1. rnorm.1000...0.5..1. rnorm.1000...0.5..1..1 runif.1000..1e.06..1000.
-3.881567e+00 -4.400641e-03 -4.552885e-03 1.205406e+03
$value
[1] -426.3138
$counts
function gradient
461 NA
$convergence
[1] 0
$message
NULL
```

This code gives me the following parameter values: -3.88, -0.0044, -0.00455 and 1205.406 is the dispersion parameter. The travel cost and income coefficients are nearly identical to those estimated in Stata: -0.0043928 and -0.004568, respectively. The log-likelihood value is also very close: -426.3138 in R and -426.28507 in Stata. It is motivating to know that I get basically the same results with my code as I did with Stata’s NBSTRAT command. I also check if this is also the case for consumer surplus:

```
> -1/ -4.400641e-03
[1] 227.2396
```

The consumer surplus estimate is also virtually identical to the one estimated after using the NBSTRAT command. The R code I prepared works (or at least yields the same results as Stata’s).

On a general note, I show that accounting for truncation and endogenous stratification has an important impact on estimated welfare measures (i.e. consumer surplus). Without accounting for these biases, consumer surplus was estimated at 393 and when accounting for these it decreased to 227.

**References:**

Englin, J., & Shonkwiler, J. S. (1995). Estimating social welfare using count data models: an application to long-run recreation demand under conditions of endogenous stratification and truncation. *The Review of Economics and statistics*, 104-112.

## 2 Comments