Routines to account for truncation and endogenous stratification: R vs. Stata

In my previous blog post, I explained on-site sampling bias. I also shared an R code to implement Englin and Shonkwiler (1995)’s negative Binomial model that accounts for these biases.

I have yet to illustrate how this code works using travel cost data. It is useful to double check whether this code yields the same results as the counterpart pre-programmed command in Stata: NBSTRAT. If I can obtain the same results, that gives me confidence that the code I have prepared is correct and it is an alternative for R users. That is my aim for today.

In detail, this is the outline of this post:

1- estimate an uncorrected Negative Binomial model and obtain a (biased) consumer surplus estimate;

2- estimate a corrected Negative Binomial model in Stata and obtain a corrected consumer surplus estimate;

3- estimate a corrected Negative Binomial model in R and double-check whether the estimates are the same as in Stata.

I use data available at the Environmental Economics blog to apply a travel cost model of beach recreation at Wrightsville Beach. The dataset is fairly old (from 2003), but it serves our purpose: the data was collected on-site, it includes trip data in the form of counts and it includes travel cost data information. The report describing the sample and results is available here.

Because the data was collected on-site, the minimum number of trips is one. The number of trips to Wrightsville beach is my dependent variable, while travel cost and income are my independent variables. This results in a standard single-site travel cost model.

I estimate a negative binomial model in R with this data and calculate the corresponding consumer surplus.

library(MASS)
model_nb <- glm.nb(trips ~ TCWB  + income2, data=data_WBeach, control=glm.control(trace=10,maxit=100), init.theta=0.937)
summary(model_nb)
> summary(model_nb)
Call:
glm.nb(formula = trips ~ TCWB + income2, data = data_WBeach, 
    control = glm.control(trace = 10, maxit = 100), init.theta = 0.9833531339, 
    link = log)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7934  -1.0631  -0.5434   0.2919   2.4390  
Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.2267872  0.1860114  17.347  < 2e-16 ***
TCWB        -0.0025443  0.0003504  -7.261 3.85e-13 ***
income2     -0.0059460  0.0022109  -2.689  0.00716 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.9834) family taken to be 1)
    Null deviance: 265.15  on 150  degrees of freedom
Residual deviance: 158.82  on 148  degrees of freedom
AIC: 950.07
Number of Fisher Scoring iterations: 1
              Theta:  0.983 
          Std. Err.:  0.117 
 2 x log-likelihood:  -942.069 
> -1/model_nb$coefficients[2]
    TCWB 
393.0314 

Most of the results are as expected. The travel cost variable is negative and statistically significant. Income’s coefficient is also negative and statistically significant (which implies beach visits are an inferior good). The consumer surplus is 393 per visit and per visitor. Of course, I am not yet accounting for truncation and overdispersion, so this estimate might be biased.

I now want to account for these on-site sampling biases, and I start by doing so in Stata. The reason I use Stata is that, to the best of my knowledge, it is the only software that has a command available to calculate the model proposed by Englin and Shonkwiler (1995). This is the NBSTRAT command in Stata. I first import data into Stata. The code to estimate this model in Stata, as well as to calculate consumer surplus, is shown below.

. nbstrat trips tcwb income2
Negative Binomial with Endogenous Stratification  Number of obs   =        151
                                                  Wald chi2(2)    =     138.25
Log likelihood = -426.28507                       Prob > chi2     =     0.0000
------------------------------------------------------------------------------
       trips |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        tcwb |  -.0043928   .0004459    -9.85   0.000    -.0052667   -.0035188
     income2 |   -.004568   .0022712    -2.01   0.044    -.0090195   -.0001165
       _cons |  -25.46132   112.6215    -0.23   0.821    -246.1955    195.2729
-------------+----------------------------------------------------------------
    /lnalpha |   28.67241   112.6215     0.25   0.799    -192.0616    249.4065
-------------+----------------------------------------------------------------
       alpha |   2.83e+12   3.19e+14                      3.88e-84    2.1e+108
------------------------------------------------------------------------------
AIC Statistic   =        5.686                      BIC Statistic =   -742.557
Deviance        =        0.000                      Dispersion    =      0.000
. 
. nlcom -1/ _b[tcwb]
       _nl_1:  -1/ _b[tcwb]
------------------------------------------------------------------------------
       trips |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _nl_1 |   227.6458   23.10752     9.85   0.000     182.3559    272.9357
------------------------------------------------------------------------------

Again, most of the results are as expected. The outcomes regarding the travel cost and income variables remain unchanged (both negative and statistically significant). However, the consumer surplus estimate is lower: 228 per visit. The confidence interval of this estimate does not include the consumer surplus that was estimated before (in the model that does not account for truncation and endogenous stratification). This goes in line with most findings in the literature of on-site sampling bias. That is, accounting for truncation and endogenous stratification leads to lower consumer surplus estimates.

The exciting part is to confirm whether I get the same results when using my own R code. I have to define the miu parameter (i.e. expected number of trips) given the new variable names, but also change the log-likelihood function according to the dependent variable’s name (trips).

library(Rfast)
library(VGAM)
### Writing log-lik by hand
nb.lik = function(df,par){
  
  miu = exp(par[1] + par[2]*df$TCWB + par[3]*df$income2)
  log_likelihood = sum( log(df$trips) +
                          Lgamma(df$trips+1/par[4]) -
                          Lgamma(df$trips+1) - 
                          Lgamma(1/par[4]) +
                          df$trips*log(par[4]) +
                          (df$trips-1)*log(miu) -
                          (df$trips+1/par[4])*log(1+par[4]*miu)
  )
  return(log_likelihood)
}
starting.values <- data.frame(rnorm(1000,0,1),rnorm(1000,-0.5,1),rnorm(1000,-0.5,1),runif(1000,0.000001,1000))
> values <-0
> for(i in 1:nrow(starting.values)){
+   
+   assign(paste("result", i, sep="_") , tryCatch(optim(par=starting.values[i,],nb.lik,df=data_WBeach,control=list(fnscale=-1)), error=function(err) NA) )
+   values[i]  <-   tryCatch(get(paste("result" , i, sep = "_"))[["value"]], error=function(err) NA)
+   
+ }
Warning messages:
1: In log(par[4]) : NaNs produced
2: In log(1 + par[4] * miu) : NaNs produced
> 
> which.max(values)
[1] 53
> get(paste("result", which.max(values), sep="_"))
$par
       rnorm.1000..0..1.     rnorm.1000...0.5..1.   rnorm.1000...0.5..1..1 runif.1000..1e.06..1000. 
           -3.881567e+00            -4.400641e-03            -4.552885e-03             1.205406e+03 
$value
[1] -426.3138
$counts
function gradient 
     461       NA 
$convergence
[1] 0
$message
NULL

This code gives me the following parameter values: -3.88, -0.0044, -0.00455 and 1205.406 is the dispersion parameter. The travel cost and income coefficients are nearly identical to those estimated in Stata: -0.0043928 and -0.004568, respectively. The log-likelihood value is also very close: -426.3138 in R and -426.28507 in Stata. It is motivating to know that I get basically the same results with my code as I did with Stata’s NBSTRAT command. I also check if this is also the case for consumer surplus:

> -1/ -4.400641e-03
[1] 227.2396

The consumer surplus estimate is also virtually identical to the one estimated after using the NBSTRAT command. The R code I prepared works (or at least yields the same results as Stata’s).

On a general note, I show that accounting for truncation and endogenous stratification has an important impact on estimated welfare measures (i.e. consumer surplus). Without accounting for these biases, consumer surplus was estimated at 393 and when accounting for these it decreased to 227.

References:

Englin, J., & Shonkwiler, J. S. (1995). Estimating social welfare using count data models: an application to long-run recreation demand under conditions of endogenous stratification and truncation. The Review of Economics and statistics, 104-112.

Advertisement

2 Comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s