Combining RP&SP data: The Distributional assumptions

Lately I have been dwelling into the techniques and the motivations to combine revealed and stated preferences.

It makes intuitive sense that sometimes the answers from revealed and stated preference questions can be combined. For example, the authors in [1] asks two very similar questions: 1) the total number of trips between November 1999 and October 2000 to a particular lake; 2) anticipated number of trips in 2001 to the same lake, with exactly the same conditions. The first of these is actual behavior, whereas the second elicits hypothetical/future behavior. While the nature of the behavior is different, it makes intuitive sense that the answers can be combined to some extent. Both answers will be visitation during the same time period (one year) in the form of an integer and to the same site (a lake). Many (if not all) of the variables that affect one behavior will also affect the other. The authors ask other hypothetical questions, but these two are perfect to illustrate why it makes sense that data is combined.

A naive straight approach would be to stack the two datasets and not account for any difference. In other words, on average, the true effect of an explanatory variable on one sample will be the same in the second dataset. This is true if the two samples are drawn from the name underlying preferences,. The problem it seems is that the hypothetical nature of the second question about future visitation raises some questions about more fundamental assumptions we make about the underlying data generating process. Even if the effect of the explanatory variable X is the same in both datasets, the scale is usually not. In fact, many studies find that the variance of the error terms in the hypothetical (SP) dataset is larger than in the actual (RP) dataset. [2] argue that accounting for scale enables the datasets to be stacked, as long as the effects of the Xs are the same in both datasets.

The focus of this blog post will be on scale, and its respective impact on the underlying assumptions: the distributions of the error terms.

Many papers that combine revealed and stated preference tend to report the relative scale parameter. The scale (\mu ) cannot be identified, but the ratio \frac{\mu_rp}{\mu_sp}  can: [2]

\frac{\varepsilon_rp}{\varepsilon_sp} = \frac{\frac{1}{\mu_rp}}{\frac{1}{\mu_sp}} 

If the variance of RP data \varepsilon_rp and SP data \varepsilon_rp are different, then the relative scale parameter will be significantly different than one.

In this post, I don’t want to go as far as estimating the relative scale parameter, but rather to illustrate what happens when the only difference is the scale of two datasets. In other to do so, I will be simulating distributions.

Much of RP and SP research relies on discrete choice analysis. We frequently use random utility models (RUM), and the multinomial or conditional logit models (MNL) are a popular choice to analyze choices given their simplicity. The MNL requires a strong asumption though: the error terms follow the extreme value type I distribution, for example the Gumbel distribution.

The Gumbell distribution is characterized by two parameters: a location and a scale. I will be generating some Gumbel errors in R. To do so, I need the evd package.



I will replicate the exercise in [3] by simulating two datasets. Assume a respondent is presented with three alternatives (A, B and C) and makes a choice among the three. Each alternative is characterized in terms of a X and a Y variables. There is a “true” effect of X and Y: gamma3 is the marginal effect of X on utility, and gamma4 is the effect of Y on utility. In [3]’s example, utility is drawn from a population with the following parameters:

gamma1 <- 1.0
gamma2 <- 0.3
gamma3 <- -0.03
gamma4 <- 0.005

Both X and Y are randomly drawn from an uniform distribution.

XA <- runif(500, min = 0, max = 100)
XB <- runif(500, min = 0, max = 100)
XC <- runif(500, min = 0, max = 100)

YA <- runif(500, min = 0, max = 1000)
YB <- runif(500, min = 0, max = 1000)
YC <- runif(500, min = 0, max = 1000)

Lastly, to be able to simulate utility, we need to simulate error terms. Let’s assume the error terms of the first dataset has a scale of 1.

error1.A <- rgumbel(500 ,loc=0, scale=1)
error1.B <- rgumbel(500 ,loc=0, scale=1)
error1.C <- rgumbel(500 ,loc=0, scale=1)

The second dataset however, has an error terms with a scale of 2.5.

error2.A <- rgumbel(500 ,loc=0, scale=2.5)
error2.B <- rgumbel(500 ,loc=0, scale=2.5)
error2.C <- rgumbel(500 ,loc=0, scale=2.5)

In this case, the scale factor will be:

\frac{\varepsilon_rp}{\varepsilon_sp}  = \frac{1}{2.5} = 0.4 $

Now that I have simulated error terms, I can simulate the utilities as well. I am following tit-for-tat the example from [3]. Gamma1 through 4 are the parameters that we have defined previously. XA is the vector of the independent variable X for utility A. error1.A is the vector of Gumbel error for Utility_1.A. So in my case, Utility_1.A through C are the utilities that alternatives A through C give the respondent for the first dataset. This first dataset will serve as an example of a revealed preference dataset.

Utility_1.A <- gamma1+gamma3*XA+gamma4*YA+error1.A
Utility_1.B <- gamma2+gamma3*XB+gamma4*YB+error1.B
Utility_1.C <-        gamma3*XC+gamma4*YC+error1.C

I also simulate the utilities for the second dataset. These are almost identical as before, the difference being that the error terms are generated from a Gumbel distribution with a much larger scale (error2.A to C).

Utility_2.A <- gamma1+gamma3*XA+gamma4*YA+error2.A
Utility_2.B <- gamma2+gamma3*XB+gamma4*YB+error2.B
Utility_2.C <-        gamma3*XC+gamma4*YC+error2.C

Now I have six vectors of utilities. To actually understand the impact of the scale, I will summarize and plot Utility_1.A versus Utility_2.A.


One thing to note immediately is that the minimum and maximum of Utility_2.A (which had a larger scale) is much broader than Utility_1.A.

Finally, I want to plot the histograms to compare the two utilities.

hist(Utility_2.A, col=rgb(1,0,0,1/4))
hist(Utility_1.A, col=rgb(0,0,1,1/4), add=T)


Again, we see the same pattern. If unobserved part of utility (the error term) of one dataset is drawn from a distribution with a larger scale (Utility_2.A), then the resulting utility has higher variance than in the case of Utility_1.A. This finding has major implications: the two datasets should not be naively stacked without accounting for the differences in scale.

Hopefully I will follow up next week with further implications of this finding and how to account for different scale in two datasets.



[1] Egan, K., & Herriges, J. (2006). Multivariate count data regression models with individual panel data from an on-site sample. Journal of environmental economics and management52(2), 567-581.

[2] Hensher, D. A., & Bradley, M. (1993). Using stated response choice data to enrich revealed preference discrete choice models. Marketing Letters4(2), 139-151.

[3] Swait, J., & Louviere, J. (1993). The role of the scale parameter in the estimation and comparison of multinomial logit models. Journal of marketing research30(3), 305-314.