Running many specifications

Imagine you have a dataset and many possible explanatory variables. Maybe you have ambiguous information that all these explanatory variables might matter or not to explain your Y variable. Maybe you know some of them are important for some individuals/firms/countries, while not relevant to others.

Sometimes you would like to run hundreds or even more specifications to see how the significance or the estimated coefficient changes when you add or remove one (or more) explanatory variable. That would be something akin to an internal meta-analysis [3].

Bottom-line is that you could run thousands of models each with a different subset of explanatory variables. However, some of these explanatory variables are bound to be correlated to each other, i.e. colinearity arises. If you run a specification in which two variables are highly correlated, you may not be able to isolate the effect of each of the variables. You might end up with a lot of regression outputs in which the coefficients are not significant. But you intuitively know that the sign of that particular variable should be the opposite from what the estimated value is. You might not know the magnitude of the sign, but your whole study might not be valid if that particular variable is not significant.

For example, in my X matrix, I have length and width of the site. The correlation coefficient between the two is 55%, which might mean trouble when I estimate the coefficients of width and length.

This is a common problem called multicolinearity. It is not the biggest problem when it comes to econometrics. It does not affect the fit of the model, but it affects the estimated parameter of the variables that are correlated with each other. Generally, you have non-statistical significant in most (if not all!) predictors.

When it comes to coding, I find myself very frustrated if I have to run regressions with different specifications one by one, and I end up forgetting the first result if I don’t save it. To avoid copy pasting the regression output as many times as the number of specifications you want to estimate, I have decided to learn how to do this automatically on R.

In my case there is a total of 33 explanatory variables. What I would like to do is to run my model for all possible combinations of these 33 variables without repetitions.

Combinations of explanatory variables

First, to know the magnitude of my problem, I want to know how many specifications I will have to estimate. This is a simple combination without repetitions:


where n represents the total number of my explanatory variables and r represents the number of variables I choose in each draw. If I want to know how many combinations of 33 variables of a specification of 10 of these variables, I can write:


If I calculate all possible number of combinations (i.e. replacing 10 by other possible numbers), I get a total of 8589934591 possible specifications!

Because this is such a huge number, I will just write down an example with 5 variables. Using the formula above 5 times, I find that there is a total of 31 possible specifications with these variables.

These are the names of my five variables:

VARS <- c('VAR1' , 'VAR2', 'VAR3', 'VAR4', 'VAR5')

I want all the possible combinations of these variables. I am using the combn function to do this. The combn function generates all the combinations of my explanatory variables within the VARS vector, but taking only some at a time. In the example below, I am extracting 3 out of the five variables available. For my vector of variables (VARS) and extracting 3 of the variables, I can write:

combn(VARS, 3, paste0, collapse = "+")

You can change the number of variables by replace the second argument “3” with the number of variables you want to combine. To get all the possible combinations with five variables, I write:

combn(VARS, 1, paste0, collapse = "+")

combn(VARS, 2, paste0, collapse = "+")

combn(VARS, 3, paste0, collapse = "+")

combn(VARS, 4, paste0, collapse = "+")

combn(VARS, 5, paste0, collapse = "+")

But because I am lazy and I want to do this automatically, I just wrote a loop:

for(i in 1:5){
              assign( paste("SPEC", i, sep = "") , 
                      combn(VARS, i, paste0, collapse = "+"))

This generates five vectors:

[1] "VAR1" "VAR2" "VAR3" "VAR4" "VAR5"
[1] "VAR1+VAR2" "VAR1+VAR3" "VAR1+VAR4" "VAR1+VAR5" "VAR2+VAR3" "VAR2+VAR4" "VAR2+VAR5" "VAR3+VAR4" "VAR3+VAR5" "VAR4+VAR5"
[10] "VAR3+VAR4+VAR5"
[1] "VAR1+VAR2+VAR3+VAR4+VAR5"

I will join these five vectors together in a single vector:


[1] 31

As expected the number of combinations stored in the SPECS vector are 31. This is the number of all possible combinations I have calculated using the formula n!/(n-r)!r!.

Now I have the vector containing the expression of the combinations of explanatory variables I want to estimate. The next step is write a code to estimate all the models with these combinations.

Estimating models with multiple specifications

I will be estimating OLS regressions to illustrate how to use the code. It can “easily” be re-adjusted to estimate other types of models.

First, I would like to estimate the OLS regression with the full set of covariates. I can retrieve the expression for the full set of covariates by typing:

> SPECS[31]
[1] "VAR1+VAR2+VAR3+VAR4+VAR5"

I can run a regression for this specification by typing:

lm(as.formula(paste0("YVAR", " ~ ", SPECS[i])))

First, I am defining i as 31. Then, the paste0 function converts the “YVAR”, ” ~ “, SPECS[i] part into character strings. The as.formula function informs the code that it should treat the argument as a formula.

However, I would like to create a loop to run a regression for each expression of variables defined in SPECS.  I can do that with this expression:

for(i in 1:length(SPECS)){
         regression <- paste0("YVAR", " ~ ", SPECS[i])
         assign(paste("model", i, sep= "") , lm(as.formula(regression)) )

(Note: That expression took me 3 hours to write. I used two R tutorials. [1],[2] )

The only new function is the assign function that basically creates a new object with the name being the first argument of the function, and the value assigned to this new object in the second part of the argument. The expression above reproduces 31 regression outputs, all named as model1, model2, etc … , until model31.

And voilà, now I have estimated all 31 regressions I was interested in. The coefficients are stored in the model1, model2, … objects. You can retrieve the coefficients by writing either of these expressions:



The final step is retrieving and storing the coefficients (and any other output you might need) in a separate matrix. However, I have not been able to figure this out yet. Hopefully I will write a new blog post when I do.





[3] Kuminoff, N. V., Zhang, C., & Rudi, J. (2010). Are travelers willing to pay a premium to stay at a “green” hotel? Evidence from an internal meta-analysis of hedonic price premia. Agricultural and Resource Economics Review39(3), 468-484.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s