Running many specifications (3): Storing the results

On a previous blog post, I illustrated how to run several specifications based on the same data. On a subsequent post, I showed a shortcut to doing so, if we are mostly interested in finding the model with the best fit (by using best subset regression). In case we are uncertain about the underlying data generation processes, it may be of use to understand how the estimated parameters change by omitting or including different variables. Then, it might be useful to run a model by choosing different specifications.

For whatever reason, you might be interested in saving and comparing the estimated coefficients. Ideally, each coefficient would be in each column, wherein each row corresponds to a different specification. For example, one could be interested in understanding the omitted variable bias from excluding a crucial variable. While I have previously shown how to run many specifications, I did not show how to save the estimated coefficients for future analysis. This is what I will be doing in this blog post.

Let us assume I have run many specifications using the data from Mattmann et al. (2016). The authors do a meta analysis of hydropower valuation studies. They collected 29 papers which value the (positive and negative) external effects of hydropower. These 29 papers all use different valuation methods: the travel cost method, the contigent valuation method and choice experiments. Given several observations per paper, they have 81 “welfare estimates for hydropower externalities”. This is their dependent variable.

To explain the welfare estimates, they collected a list of 18 independent variables. These include dummies for the valuation method used, dummies for the type of externality, and median income, for example. The authors estimate a model with all these explanatory variables. It could be interesting to check how estimated coefficients differ across specifications.

In case you’d like to try it out yourself, the data from this study is acessible from the paper’s webpage, by clicking in the “Supplementary material” link. I downloaded the .dta file (which is the extension for Stata databases), opened it in Stata and saved it as a CSV file, so I can use R to analyze the data.

I replicate here the code to run a series of models with all possible combinations of all 18 explanatory variables. If you would like to understand this excerpt of code, I highly advise reading this previous blog post:

VARS <- c( "choice_experiment",
"face_to_face_survey",
"year_survey",
"users_only",
"high_income",
"america",
"asia",
"hydropower_share",
"limited",
"vegetation_landscape_pos",
"vegetation_landscape_neg",
"animal_life_pos",
"aesthetics_pos",
"aesthetics_neg",
"GHG_hydroshare_pos",
"recreation_fishing",
"small_change",
"taxes" )
for(i in 1:18){
assign( paste("SPEC", i, sep = "") ,
combn(VARS, i, paste0, collapse = "+"))
}
SPECS <- c(SPEC1, SPEC2, SPEC3, SPEC4, SPEC5, SPEC6, SPEC7, SPEC8, SPEC9, SPEC10, SPEC11, SPEC12, SPEC13, SPEC14, SPEC15, SPEC16, SPEC17, SPEC18)
attach(meta_data)
for(i in 1:length(SPECS)){
regression <- paste0("value", " ~ ", SPECS[i])
assign(paste("model", i, sep= "") , lm(as.formula(regression)) )
}

In the VARS list, I wrote the names of all the 18 explanatory variables. I then created 18 vectors called SPEC1, SPEC2, etc, that specify all possible combinations of the 18 elements of 1, 2, etc, elements. SPECS is a list merging all these combination vectors, which I then use in the last loop to run many OLS regressions with names model1, model2, model3, etc.

The number of specifications to be estimated are 249,527. These are the sum of all possible combinations of all these different variables. As you can imagine, it takes quite some time to run all these specifications. Instead of running all 249 thousand specifications, I run just the first 10 thousand for illustration purposes.

Once the code is finished, you need to store the estimated parameters. You might also be interesting in saving other results, such as the AIC or the standard error, for which you have to make small changes in the code below.

I start by creating a matrix called RESULTS, where I will store the coefficient results. It contains 10000 rows (one for each specification) and 18 columns (one for each explanatory variable).

RESULTS <- matrix(0, nrow = 10000, 18, dimnames = NULL)

I then label the RESULTS columns with the variables’ names to know which variable is in each column.

colnames(RESULTS) <- c( "choice_experiment",
                         "face_to_face_survey",
                         "year_survey",
                         "users_only",
                         "high_income",
                         "america",
                         "asia",
                         "hydropower_share",
                         "limited",
                         "vegetation_landscape_pos",
                         "vegetation_landscape_neg",
                         "animal_life_pos",
                         "aesthetics_pos",
                         "aesthetics_neg",
                         "GHG_hydroshare_pos",
                         "recreation_fishing",
                         "small_change",
                         "taxes" )

Let us look at the regression output from model1. This model only includes the choice experiment variable to explain the value estimated. This coefficient is not statistically significant, meaning choice experiments on average do not yield higher nor lower welfare estimates than other non-market valuation methods.

> summary(model1)

Call:
lm(formula = as.formula(regression))

Residuals:
    Min      1Q  Median      3Q     Max 
-178.43 -135.08  -37.05   38.44 1658.86 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         182.30      36.18   5.038 2.91e-06 ***
choice_experiment   -25.74      54.27  -0.474    0.637    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 242.7 on 79 degrees of freedom
Multiple R-squared:  0.002839,	Adjusted R-squared:  -0.009783 
F-statistic: 0.2249 on 1 and 79 DF,  p-value: 0.6366

I want to save this coefficient estimate of “-25.74” to compare with other specifications. In the RESULTS matrix, I want the first observation to correspond to model1’s coefficient associated with the choice_experiment variable. Since we do not include any other coefficients in this specifications, all other observations in this row should equal zero or NA.

The tryCatch function in R is perfect for this task. If we were to look for the coefficient associated with face_to_face_survey in model 1, it would yield an error. For example, by searching for the coefficient associated with choice_experiments:

> model1[["coefficients"]][["choice_experiment"]]
[1] -25.73934

I can find the right coefficient in model1. But if I were to look for the coefficient associated with face_to_face_survey:

> model1[["coefficients"]][["face_to_face_survey"]]
Error in model1[["coefficients"]][["face_to_face_survey"]] : 
  subscript out of bounds

It yields an error message. In order to bypass the NA value, I use the tryCatch function. Basically, instead of refusing to extract the coefficient because it cannot be found, if the function encounters an error, it replaces this error with NA. Therefore, it will yield a bunch of NAs if the variable is absent from the regression.

I type the loop for the 10,000 specifications to change each of the 18 columns by pasting the coefficient values:


for(i in 1:10000){
  RESULTS[i,1] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["choice_experiment"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,2] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["face_to_face_survey"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,3] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["year_survey"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,4] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["users_only"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,5] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["high_income"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,6] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["america"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,7] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["asia"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,8] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["hydropower_share"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,9] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["limited"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,10] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["vegetation_landscape_pos"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,11] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["vegetation_landscape_neg"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,12] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["animal_life_pos"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,13] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["aesthetics_pos"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,14] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["aesthetics_neg"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,15] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["GHG_hydroshare_pos"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,16] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["recreation_fishing"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,17] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["small_change"]], error=function(err) NA)
}
for(i in 1:10000){
  RESULTS[i,18] <- tryCatch(get(paste("model" , i, sep = ""))[["coefficients"]][["taxes"]], error=function(err) NA)
}

The first six observations in the RESULTS matrix look like this:

For each row, it succesfully extracts the estimated coefficient. This matrix includes 10,000 rows. I can now analyze the distribution of estimated coefficients if I wish to. For example, I might wish to plot the histogram of the choice_experiments variable:

This coefficient seems to be centered around zero. Similar analyzes to the remaining coefficients can be undertaken, and extracting other outputs such as the standard errors is also possible.

References:

Mattmann, M., Logar, I., & Brouwer, R. (2016). Hydropower externalities: A meta-analysis. Energy Economics57, 66-77.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s