The Chi-squared test in R after running a regression

I illustrate today how to perform a chi-squared test after running regressions in R. 

A chi-squared test is a useful way to compare across models. Imagine you are investigating and/or explaining a categorical variable. You have run two alternative models and would like to conclude what the preferred one is. AIC and BIC are researchers’s go-to measures to assess measures of fit. The lower the AIC/BIC, the better the model. 

AIC and BIC are useful but might not be the best to investigate the fit of a models. When handling categorical variables, we might be interested in the frequencies: e.g. how many Yes’s or No’s does my model predict, and how well does that match the raw data? The chi-squared test helps determine if there is a significant correlation between two variables: e.g. the predicted and raw data. The higher the correlation, the better the model is at predicting. The chi-squared test looks like this:

\rchi^2 = \sum_{i=1r}^{C} \frac{(observed-expected)^2}{expected}

where expected is the vector of the predicted data (conditional on the independent variables) and observed is the vector of the raw data.

The degrees of freedom (df) of this test are equal to the number of categories (C). If we are analyzing binary variables (e.g. Yes/No), then there are two categories. If we are analyzing a scale of five responses (e.g. Excellent, good, mediocre, bad, terrible), then there are five categories. In both these cases the number of degrees of freedom will be low (2 and 5 at most, respectively). A large number of degrees of freedom (50+) will result in this test not being reliable. That is the case of a large number of counts (e.g. number of doctor visits ranging to 50+ and taking each count as one category). 

This last variable was the case of my data. Instead of comparing the predicted and raw data of too many counts, I decided to group the variable into groups that make sense when writing down the code for my chi-squared test.

To illustrate this test, I am using data on doctor visits from Cameron and Trivedi (2005). The data can be found at the book’s webpage. I downloaded and imported the data into R. The variable I am interested in is the number of doctor visits (mdvis). This variable ranges from 0 to 77, as seen below:

> table(data_doctor$mdvis)

   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24 
6308 3817 2797 1884 1345  968  689  531  408  287  206  190  118  109   82   59   56   33   37   35   26   22   19   19   13 
  25   26   27   28   29   30   31   32   33   34   35   37   38   39   40   41   44   45   46   48   51   52   55   56   57 
   8   10    6   12    6    8    8    4    5    9    5    5    9    1    3    5    6    2    2    2    1    3    1    1    1 
  58   62   63   65   69   72   74   76   77 
   1    1    1    1    1    1    1    1    1 

Suppose I want to explain the number of doctor visits given the number of diseases, age, gender, income and time. I run an ordinary least squares regression.

> model <- lm(mdvis ~  disea + xage + female + linc + time, data=data_doctor)
> summary(model)

Call:
lm(formula = mdvis ~ disea + xage + female + linc + time, data = data_doctor)

Residuals:
   Min     1Q Median     3Q    Max 
-8.225 -2.313 -1.183  0.842 71.356 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.832448   1.209156   0.688   0.4912    
disea        0.126080   0.004776  26.396  < 2e-16 ***
xage         0.012888   0.001885   6.838 8.26e-12 ***
female       0.484084   0.063096   7.672 1.77e-14 ***
linc         0.237916   0.025171   9.452  < 2e-16 ***
time        -2.045501   1.187877  -1.722   0.0851 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.381 on 20184 degrees of freedom
Multiple R-squared:  0.05427,	Adjusted R-squared:  0.05403 
F-statistic: 231.6 on 5 and 20184 DF,  p-value: < 2.2e-16

The regression results conform to my expectations. Patients who are older, female, with more diseases and higher income visit the doctor more than patients who are younger, male, with fewer diseases and with lower income.

Given the regression above, I create a vector with predictions of number of doctor visits. I also restrict the number of counts to be non-negative.

data_doctor$predictions  <- predict.lm(model,  data.frame(data_doctor),type = "response")
data_doctor$predictions[data_doctor$predictions<0] <- 0

The predictions look quite different from the raw data. The original data predicted a number of doctor visits going as high as 77, while this model only predicts up to 8 doctor visits.

I want to compare the predicted/expected counts with the observed/raw counts. The existing test in R cannot be used (yet), because first we need to group the counts into categories. This is what happens when we try to use the chisq.test in R:

> library("MASS")
> data_mdvis <- data.frame(data_doctor$mdvis, data_doctor$predictions) 
> print(chisq.test(data_mdvis))

	Pearson's Chi-squared test

data:  data_mdvis
X-squared = 35250, df = 20189, p-value < 2.2e-16

Warning message:
In chisq.test(data_mdvis) : Chi-squared approximation may be incorrect

You can see two important things: one is that the chi-squared test reports a high number (35250) and a low p-value, meaning it has concluded that the observed and expected counts are different. The second alarming thing is the high number of degrees of freedom (20189). This test does not do well with such a high number of degrees of freedom.

Instead, counts should be grouped into categories. The exact number of categories depends on what is of interest to the researcher. For illustration purposes, I use two categories: number of doctor visits equal to zero, or higher than zero.

The code below implements the chi-squared test:

ccategories <- data.frame(c("0","1 or more"), 
                         c(sum(data_doctor$mdvis==0), 
                           sum(data_doctor$mdvis>0)),
                         c(sum(as.integer(data_doctor$predictions)==0), 
                           sum(as.integer(data_doctor$predictions)>0)) )
colnames(categories) <- list("Counts", "Observed", "Expected")                   

categories$res <- ((categories$Observed-categories$Expected)^2)/categories$Expected

For our OLS model, this yields a value for this test of:

> sum(categories$res)
[1] 112701.5

I then use this value to calculate the p-value of the chi-squared test:

> 1-pchisq(112701.5, 1)   
[1] 0

The p-value is zero, meaning that the observed and expected counts are statistically different from each other. As I already expected from looking at the adjusted R-squared statistic, this OLS model with five explanatory variables does not predict well enough the number of zero doctor visits and the number of doctor visits above zero. If I wanted a model to better explain this phenomenon, I should, perhaps, include additional regressors to my model.

References:

https://mgimond.github.io/Stats-in-R/ChiSquare_test.html

https://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html

Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: methods and applications. Cambridge university press.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s