Previously I have addressed the problems of on-site sampling bias. Let me refresh that discussion.
Suppose you have gotten data from a certain behavior in the form of counts. For example, if you work in the field of marketing, you may count how many times a certain item such as butter was purchased per customer. This outcome will depend on many variables, perhaps most importantly, on the price of a pack of butter. We suppose also from economic theory that people with higher incomes will also purchase more items.
The way you sample this data is of great importance. If you stand at the cashier and count how many packs of butter are purchased, you are likely to sample a lot of zero counts (i.e., a lot of customers who do not purchase butter). On the other hand, if you stand at the dairy section of the supermarket and sample anyone who takes packs of butter, you might be incurring in on-site sampling bias. That is because that sample does not observe any no-purchases and you are more likely to encounter intensive butter consumers. Because this on-site sample is biased, you cannot extrapolate from that on-site sample to the sample you would get if you stood at the cashier instead.
Because on-site sampling is so effective at sampling actual customers, we usually prefer to sample on-site and then correct for these biases ex-post. The most well-known correction is that proposed by Englin and Shonkwiler (1995). Some papers have shown that on-site sampling bias is in fact corrected by applying their correction (e.g. Martinez-Espineira et al., 2008).
However, by standing at the dairy section, we have no way of knowing with certainty how the count distribution is if we stood at the cashier instead. Englin and Shonkwiler (1995)’s correction works if that distribution can be approximated with a Negative Binomial. If instead it follows another distribution (e.g., Truncated Normal), then the correction we have may fail to correct for on-site sampling bias. That is the problem addressed by Shi and Huang (2018).
Shi and Huang (2018) are confronted with the problem that if the distribution of counts is misspecified, then Englin and Shonkwiler (1995)’s correction might not work. They conduct a Monte Carlo simulation to show exactly this, and then propose an alternative approach to address this problem.
What they propose is to use weights of 1/yi , wherein yi is the observed individual’s number of counts. In the authors own words, “We weigh each observation in the on-site data by the corresponding probability weight 1/yi to systematically correct the overrepresentation of frequent visitors in the on-site sample” (Shi and Huang, 2018). They then apply a variety of truncated models to this data, such as zero-truncated Poisson or zero-truncated Negative Binomial model.
My goal for today is to apply Shi and Huang (2018)’s approach to on-site trip data that I have used previously. It pertains the choice of number of trips to a beach. I will estimate several truncated models with weights, starting with the truncated Negative Binomial model.
To estimated a truncated count data model, I use the VGAM package.
The model is the same as in the previous blog post: trips is our dependent variable and travel cost and income are the explanatory variables. To use a truncated negative binomial model, I need to specify the family within the vglm function, as well as the weights, which are the inverse of the number of trips as specified in Shi and Huang (2018).
model_tnb <- vglm(trips ~ TCWB + income2, family = posnegbinomial(), data=data_WBeach, weights=1/data_WBeach$trips)
The regression results are as follows:
> summary(model_tnb) Call: vglm(formula = trips ~ TCWB + income2, family = posnegbinomial(), data = data_WBeach, weights = 1/data_WBeach$trips) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept):1 -19.569705 16.235106 NA NA (Intercept):2 -21.488287 0.120197 -178.776 <2e-16 *** TCWB -0.003590 0.031975 NA NA income2 0.002629 0.173069 0.015 0.988 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Names of linear predictors: loglink(munb), loglink(size) Log-likelihood: -82.416 on 298 degrees of freedom Number of Fisher scoring iterations: 15 Warning: Hauck-Donner effect detected in the following estimate(s): '(Intercept):1', 'TCWB'
And the implied consumer surplus is:
> -1/-0.003590  278.5515
I have a consumer surplus estimate of 278 that is slightly higher than the one implied by Englin and Shownkwiler (1995)’s correction of 227.2396, but lower than the model that does not correct for on-site sampling bias (393). I can conclude that this reweighting approach does seem to mitigate on-site sampling bias.
An important thing is note however how imprecise the travel cost estimate is. The standard error of the travel cost estimate is extremely large compared with the estimate. Subsequently the consumer surplus estimates are imprecise (i.e., wide confidence interval). This already leads me to suspect that there might be a problem with this model. Shi and Huang (2018) also point out that estimated using truncated models with weights lead to a loss of efficiency, but not to this extent…
Just like Shi and Huang (2018) suggest, I also estimate a Truncated Poisson and a Truncated Normal models in the following way:
model_tp <- vglm(trips ~ TCWB + income2, family = pospoisson(), data=data_WBeach, weights=1/data_WBeach$trips) model_tn <- truncreg::truncreg(trips ~ TCWB + income2, data=data_WBeach, weights=1/data_WBeach$trips)
The regression results are as follows:
> summary(model_tp) Call: vglm(formula = trips ~ TCWB + income2, family = pospoisson(), data = data_WBeach, weights = 1/data_WBeach$trips) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.6822668 0.2218221 7.584 3.35e-14 *** TCWB -0.0055523 0.0009426 -5.890 3.85e-09 *** income2 0.0035155 0.0026294 1.337 0.181 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Name of linear predictor: loglink(lambda) Log-likelihood: -126.7602 on 148 degrees of freedom Number of Fisher scoring iterations: 7 Warning: Hauck-Donner effect detected in the following estimate(s): 'TCWB' > model_tn <- truncreg::truncreg(trips ~ TCWB + income2, data=data_WBeach, weights=1/data_WBeach$trips) > summary(model_tn) Call: truncreg::truncreg(formula = trips ~ TCWB + income2, data = data_WBeach, weights = 1/data_WBeach$trips) BFGS maximization method 65 iterations, 0h:0m:0s g'(-H)^-1g = 0.283 Coefficients : Estimate Std. Error t-value Pr(>|t|) (Intercept) 26.32171 32.26709 0.8157 0.414646 TCWB -2.66137 1.73340 -1.5353 0.124699 income2 -0.92722 0.80537 -1.1513 0.249610 sigma 55.47545 18.73827 2.9605 0.003071 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Log-Likelihood: -427.08 on 4 Df
The log-likelihood values are larger for these two Truncated models, rather than the truncated Negative Binomial model, leading me to suspect the truncated NB model is the preferred model to explain trip count.
I am a bit unsure if the Truncated Negative Binomial model is working as it should given the strange results. This could be a problem with the software or the package, or the nature of the model itself. What I am doing now is trying to write down the log-likelihood of this re-weighted truncated Negative Binomial model (as specified in Shi and Huang, 2018) to try to confirm these results. I will post my findings if they differ.
Englin, J., & Shonkwiler, J. S. (1995). Estimating social welfare using count data models: an application to long-run recreation demand under conditions of endogenous stratification and truncation. The Review of Economics and statistics, 104-112.
Shi, W., & Huang, J. C. (2018). Correcting on-site sampling bias: a new method with application to recreation demand analysis. Land Economics, 94(3), 459-474.