# The Chi-squared test in R after running a regression

I illustrate today how to perform a chi-squared test after running regressions in R.

A chi-squared test is a useful way to compare across models. Imagine you are investigating and/or explaining a categorical variable. You have run two alternative models and would like to conclude what the preferred one is. AIC and BIC are researchers’s go-to measures to assess measures of fit. The lower the AIC/BIC, the better the model.

AIC and BIC are useful but might not be the best to investigate the fit of a models. When handling categorical variables, we might be interested in the frequencies: e.g. how many Yes’s or No’s does my model predict, and how well does that match the raw data? The chi-squared test helps determine if there is a significant correlation between two variables: e.g. the predicted and raw data. The higher the correlation, the better the model is at predicting. The chi-squared test looks like this:

$\rchi^2 = \sum_{i=1r}^{C} \frac{(observed-expected)^2}{expected}$

where expected is the vector of the predicted data (conditional on the independent variables) and observed is the vector of the raw data.

The degrees of freedom (df) of this test are equal to the number of categories (C). If we are analyzing binary variables (e.g. Yes/No), then there are two categories. If we are analyzing a scale of five responses (e.g. Excellent, good, mediocre, bad, terrible), then there are five categories. In both these cases the number of degrees of freedom will be low (2 and 5 at most, respectively). A large number of degrees of freedom (50+) will result in this test not being reliable. That is the case of a large number of counts (e.g. number of doctor visits ranging to 50+ and taking each count as one category).

This last variable was the case of my data. Instead of comparing the predicted and raw data of too many counts, I decided to group the variable into groups that make sense when writing down the code for my chi-squared test.

To illustrate this test, I am using data on doctor visits from Cameron and Trivedi (2005). The data can be found at the book’s webpage. I downloaded and imported the data into R. The variable I am interested in is the number of doctor visits (mdvis). This variable ranges from 0 to 77, as seen below:

> table(data_doctor$mdvis) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 6308 3817 2797 1884 1345 968 689 531 408 287 206 190 118 109 82 59 56 33 37 35 26 22 19 19 13 25 26 27 28 29 30 31 32 33 34 35 37 38 39 40 41 44 45 46 48 51 52 55 56 57 8 10 6 12 6 8 8 4 5 9 5 5 9 1 3 5 6 2 2 2 1 3 1 1 1 58 62 63 65 69 72 74 76 77 1 1 1 1 1 1 1 1 1  Suppose I want to explain the number of doctor visits given the number of diseases, age, gender, income and time. I run an ordinary least squares regression. > model <- lm(mdvis ~ disea + xage + female + linc + time, data=data_doctor) > summary(model) Call: lm(formula = mdvis ~ disea + xage + female + linc + time, data = data_doctor) Residuals: Min 1Q Median 3Q Max -8.225 -2.313 -1.183 0.842 71.356 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.832448 1.209156 0.688 0.4912 disea 0.126080 0.004776 26.396 < 2e-16 *** xage 0.012888 0.001885 6.838 8.26e-12 *** female 0.484084 0.063096 7.672 1.77e-14 *** linc 0.237916 0.025171 9.452 < 2e-16 *** time -2.045501 1.187877 -1.722 0.0851 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.381 on 20184 degrees of freedom Multiple R-squared: 0.05427, Adjusted R-squared: 0.05403 F-statistic: 231.6 on 5 and 20184 DF, p-value: < 2.2e-16 The regression results conform to my expectations. Patients who are older, female, with more diseases and higher income visit the doctor more than patients who are younger, male, with fewer diseases and with lower income. Given the regression above, I create a vector with predictions of number of doctor visits. I also restrict the number of counts to be non-negative. data_doctor$predictions  <- predict.lm(model,  data.frame(data_doctor),type = "response")
data_doctor$predictions[data_doctor$predictions<0] <- 0


The predictions look quite different from the raw data. The original data predicted a number of doctor visits going as high as 77, while this model only predicts up to 8 doctor visits.

I want to compare the predicted/expected counts with the observed/raw counts. The existing test in R cannot be used (yet), because first we need to group the counts into categories. This is what happens when we try to use the chisq.test in R:

> library("MASS")
> data_mdvis <- data.frame(data_doctor$mdvis, data_doctor$predictions)
> print(chisq.test(data_mdvis))

Pearson's Chi-squared test

data:  data_mdvis
X-squared = 35250, df = 20189, p-value < 2.2e-16

Warning message:
In chisq.test(data_mdvis) : Chi-squared approximation may be incorrect

You can see two important things: one is that the chi-squared test reports a high number (35250) and a low p-value, meaning it has concluded that the observed and expected counts are different. The second alarming thing is the high number of degrees of freedom (20189). This test does not do well with such a high number of degrees of freedom.

Instead, counts should be grouped into categories. The exact number of categories depends on what is of interest to the researcher. For illustration purposes, I use two categories: number of doctor visits equal to zero, or higher than zero.

The code below implements the chi-squared test:

ccategories <- data.frame(c("0","1 or more"),
c(sum(data_doctor$mdvis==0), sum(data_doctor$mdvis>0)),
c(sum(as.integer(data_doctor$predictions)==0), sum(as.integer(data_doctor$predictions)>0)) )
colnames(categories) <- list("Counts", "Observed", "Expected")

categories$res <- ((categories$Observed-categories$Expected)^2)/categories$Expected


For our OLS model, this yields a value for this test of:

> sum(categories\$res)
[1] 112701.5

I then use this value to calculate the p-value of the chi-squared test:

> 1-pchisq(112701.5, 1)
[1] 0

The p-value is zero, meaning that the observed and expected counts are statistically different from each other. As I already expected from looking at the adjusted R-squared statistic, this OLS model with five explanatory variables does not predict well enough the number of zero doctor visits and the number of doctor visits above zero. If I wanted a model to better explain this phenomenon, I should, perhaps, include additional regressors to my model.

References:

https://mgimond.github.io/Stats-in-R/ChiSquare_test.html

https://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html

Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: methods and applications. Cambridge university press.