Power Calculations for Count Data

I have been deep into count data modelling in the past few months and writing some papers on analysis of count data. I have been using various datasets, and interpreting the results. Many times the results are as expected, while sometimes I suspect that the results have problems. In many cases, understanding the statistical power of our models is very helpful to understand the regression results we obtain. Today I wanted to illustrate how to look into power calculations for count data. Does the model have enough statistical power to find the effect of interest? How big an effect can we find with our data?

I have addressed power calculations for both continuous and binary outcomes. I have explained that the power of an experiment depends on the sample size, effect size and the significance level to be reached. For example, ceteris paribus, the higher the sample size, the higher the power of the experiment. I have shown the pwr package before (Champely et al., 2018), but today I want to show an alternative: how simulation can help shed light on power calculations. This is because standard power calculations cannot handle integer outcomes.

My goal is to simulate a population with parameter values that I know, and see if when I run the regression I can find those parameter values. If I cannot, it means the data does not have enough power, for example because there is not enough variation in the explanatory variables.

In my application, I am interested in fishing presence and its effect on the number of visits. Let me suppose that a visitor is more likely to make more trips to places with fishing presence (which is a dummy). So I expect the coefficient associated with fishing presence to be positive and statistically significant.

I am interested in running a count data model that depends on travel cost (TC), substitute travel cost (Subs_TC) and income. The higher the travel cost and lower the substitute travel cost, the less trips a individual is expected to make. I also assume the effect of income to be positive. Therefore, I set the initial parameter values as such:

intercept      <- 1
beta_TC        <- -0.05
beta_subs_TC   <- 0.0006
beta_income    <- 0.00001
beta_fishing   <-  0.2

I now have to simulate the expected number of trips, denoted by mu. I also have to define the overdispersion parameter. Because I am interested in a sample without zeros, I simulate count data that follows a truncated Negative Binomial distribution by using the rztnbinom function in the countreg package:

#install.packages("countreg", repos="http://R-Forge.R-project.org")
library(countreg)

# Conditional number of trips for each individual
mu       <- exp(intercept + 
                beta_TC      *Dataset$TC + 
                beta_subs_TC *Dataset$subs_TC +
                beta_income  *Dataset$income +
                beta_fishing *Dataset$fishing_presence  )  
theta    <- 1/1.4

Dataset$sim_visits <- rztnbinom(n=length(mu), size=theta, mu = mu)

With these parameter values, the simulated number of visits (sim_visits) and realized number of visits are similar in both the mean value as well as the maximum number of visits:

> summary(Dataset$sim_visits)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   1.000   3.000   4.085   5.000  75.000    1584 
> summary(Dataset$visits[!is.na(Dataset$sim_visits)])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.00    2.00    3.98    4.00   56.00 

The next step is to check if the value of beta_fishing of 0.2 can be retrieved when I run a truncated negative binomial model. I use the vglm function in the VGAM package to run this model:

library(VGAM)
m1 <- vglm(sim_visits ~ TC + 
                  substitute_TC +
                  income +
                  fishing_presence 
             , family = posnegbinomial(), data = Dataset)

item1 <- summary(m1)
item1
Coefficients: 
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept):1        8.388e-01  7.187e-02  11.671  < 2e-16 ***
(Intercept):2       -5.753e-01  9.702e-02  -5.930 3.03e-09 ***
TC                  -3.062e-02  2.944e-03 -10.400  < 2e-16 ***
substitute_TC        7.947e-04  1.428e-04   5.567 2.59e-08 ***
income               5.849e-06  2.325e-06   2.516   0.0119 *  
fishing_presence     2.862e-01  7.044e-02   4.063 4.84e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

All of the estimated parameters are statistically significant, including the fishing presence. But is the true value of 0.2 included in the confidence interval implied by the estimate and standard error?

> m1@coefficients[6]-1.96*sqrt(abs(item1@cov.unscaled[6,6]))
active_fishing_port 
          0.1481543 
> m1@coefficients[6]+1.96*sqrt(abs(item1@cov.unscaled[6,6]))
active_fishing_port 
          0.4242673 

So the answer is yes, 0.2 is included in this confidence interval. Therefore, I conclude that, if the effect size is 0.2 or higher, we should be able to retrieve this effect size when we use actual data.

What if the effect size is smaller? For example 0.1? I run the same exercise above by changing the value of beta_fishing to 0.1. These are the resulting confidence intervals:

> m1@coefficients[6]-1.96*sqrt(abs(item1@cov.unscaled[6,6]))
active_fishing_port 
         -0.1445287 
> m1@coefficients[6]+1.96*sqrt(abs(item1@cov.unscaled[6,6]))
active_fishing_port 
          0.1320521 

So, even though we will also obtain the value 0.1 within the confidence interval, the effect is too small and the estimate will not be statistically different from zero. This means that, if the effect size is o.1 or lower, even if there is indeed a positive effect, our data does not have enough variation to retrieve this effect. The size of the effect of fishing presence matters, as the data will not retrieve small effect sizes. If possible, gathering larger sample sizes may solve the problem.

This simulation approach works for different count distribution, different variables (e.g. continuous) and different combinations of explanatory variables. One can avoid claiming that the effect is zero when data simply does not have enough variation to retrieve any effect.

References:

Champely, S., Ekstrom, C., Dalgaard, P., Gill, J., Weibelzahl, S., Anandkumar, A., Ford, C., Volcic, R., De Rosario, H., De Rosario, M.H., 2018. Package ‘pwr.’ R Package Version 1–2.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s