Running many specifications (Best Subset Regression)

In this previous post, I illustrated how to run multiple specifications, especially in the case there are multiple explanatory variables.

In turns out that, only for least squares, you can use the leaps package in R, especially if you are very interested in comparing the fit of different models. This blog post will focus on defining measures of fit, and then illustrate how to use the package with meta-analysis data. If however your estimation is with maximum likelihood, I am afraid this package is not useful.

Best subsets regression

Given a set of explanatory variables, best subsets regression compares all possible combinations. It then identifies the specification that yields the best fit.[1]

The R-squared (R^2)  is a measure of fit, which means that it gives an indication of how close the fitted Y (i.e. the predicted values of the dependent variable) is with the actual observed level of Y. Generally, to have a more reliable model, we want a higher R^2.

A variation of the R^2 is the adjusted R^2. Basically it’s a measure of fit that “punishes” models with an excessive number of explanatory variables. In other words, it indicates if the contribution of an additional explanatory variable is enough to justify its inclusion in the model.

There are other measures of fit, but these are perhaps the most well-known ones. Please note that they are useful to compare models estimated by least squares, but not so much if maximum likelihood estimation is used.

Best subsets regression uses measures of fit like the R^2 or adjusted R^2 to identify the “best” model, that is, the subset of explanatory variables that explains Y the best. An alternative to best subsets regression is stepwise regression. If you want to understand the differences, check this [2].

Best subsets regression and stepwise regression, however, only work to compare models estimated by least squares. For maximum likelihood I have no knowledge of a similar method. One possible solution is to do it “manually” like I proposed in this post.

Application to a meta analysis

Best subset regression may prove useful in many situations in the field of environmental economics. I will illustrate an application when doing meta analyzes.

A meta analysis is a meta study of several studies to identify trends and generalize results across studies. In my opinion, meta analyzes are very useful because they summarize the state-of-the-art, and can be a good starting point when writing a new paper.

I will use data from Mattmann et al. (2016).[3] The authors do a meta analysis of hydropower valuation studies. They collected 29 papers which value the (positive and negative) external effects of hydropower. These 29 papers all use different valuation methods: the travel cost method, the contigent valuation method and choice experiments. Given several observations per paper, they have 81 “welfare estimates for hydropower externalities”.[3] This is their dependent variable.

To explain the welfare estimates, they collected a list of 18 independent variables. These include dummies for the valuation method used, dummies for the type of externality, and median income, for example. The authors estimate a model with all these explanatory variables. Hence, I think this data is a very interesting example to use the best subset regression idea, and to see if we can come up with the same result.

In case you’d like to try it out yourself, the data from this study is acessible from the paper’s webpage, by clicking in the “Supplementary material” link. I downloaded the .dta file (which is the extension for Stata databases), opened it in Stata and saved it as a CSV file, so I can use R to analyze the data.

The leaps package

The basis for my explanation is this blog post.[4] We will be using the leaps package as well. As usual, the first step is to install the leaps package, and then use the library command:

install.packages("leaps")
library(leaps)

Again, please note that the package only works with ordinary least squares. Nonetheless, it might be a useful first step to select explanatory variables.

The dataset actually has information about 42 explanatory variables. I have tried to only use the same variables as used in the paper (see their Table 4). To do the subset regressions, I use the function regsubsets:

all.models <- regsubsets(value ~ choice_experiment + face_to_face_survey + 
                                 year_survey + users_only + 
                                 high_income + america + 
                                 asia + hydropower_share + 
                                 limited + vegetation_landscape_pos + 
                                 vegetation_landscape_neg + animal_life_pos + 
                                 aesthetics_pos + aesthetics_neg + 
                                 GHG_hydroshare_pos + recreation_fishing + 
                                 small_change + taxes , 
                         data = meta_data, nvmax = 18)

It’s very similar to running an OLS regression with the lm function. After writing regsubsets instead of lm, the first variable is the dependent variable (i.e. value), and the remainder on the right hand side are my possible explanatory variables. The option data= identifies the dataframe and the option nvmax identifies the maximum number of explanatory variables that I can have in a specification. I included all eighteen.

You should get something like this after typing summary(all.models) in R. This is just an excerpt of the result though.

meta result

Basically, the first line starting with 1 says which of the variables should be included if the model were to have only one explanatory variable. In that case, it would be users_only. Line 10 for example tells me that if I want 10 explanatory variables to explain the values variable, then I should include users_only, year_survey, high_income, america, asia, and a few others that do not show up in this image.

But a final question remains, how to choose the best model? How do the models including one, ten, or eighteen explanatory variables compare?

We can find that out by comparing the measures of fit. I use here BIC and adjusted R-squared, since these are the measures of fit I know the best. In the commands below I am asking R to identify the models that yield the highest adjusted R-squared, and the lowest BIC (i.e. the “best” models).

> res.sum <- summary(all.models)
> which.max(res.sum$adjr2)
[1] 10
> which.min(res.sum$bic)
[1] 4

The model with 10 explanatory variables is the one that yields the highest adjusted R-squared. This is the model identified previously that included users_only, year_survey, high_income, america, asia, vegetation_landscape_pos, vegetation_landscape_neg, recreation_fishing, small_change and taxes. 

The model with four variables is the one that yields the lowest BIC. This one only includes users_only, year_survey, vegetation_landscape_pos and recreation_fishing.

Even if using an OLS model is not appropriate, I think this is a neat idea to identify subsets of explanatory variables, in the case that there are too many. I used the example of meta analyzes, but other methods can benefit from this technique, as long as the estimation is by least squares.

 

References: 

[1] https://www.statisticssolutions.com/best-subsets-regression/

[2] http://blog.minitab.com/blog/adventures-in-statistics-2/regression-smackdown-stepwise-versus-best-subsets

[3] Mattmann, M., Logar, I., & Brouwer, R. (2016). Hydropower externalities: A meta-analysis. Energy Economics57, 66-77.

[4] http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/155-best-subsets-regression-essentials-in-r/

1 Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s