# Combining RP&SP data: the importance of accounting for scale

In this previous post, I have introduced the paradigm of combining two datasets, elicited through revealed and stated preferences respectively. In summary, if the two datasets are separately collected but the way people choose come from the same underlying preferences, then the data can be combined. If however the scale in the two datasets differs for example because of the hypothetical nature of some questions, then data cannot be combined unless the scale difference is acknowledged and estimated.

What I will do in this blog post is to do precisely what I argue should not be done: I will combine the two datasets without accounting for scale and show the problems of doing so by means of simulation. To follow up on the last post, I will be replicating the results from .

Like the previous post, I will simulate two distributions that have the same parameters (gamma1 through 4) but differ in scale (one has scale equal to one, and the other scale of 2). X and Y are the explanatory variables that are randomly drawn from an uniform distribution.

```library(evd)

# First Dataset

error1.A <- rgumbel(500 ,loc=0, scale=1)
error1.B <- rgumbel(500 ,loc=0, scale=1)
error1.C <- rgumbel(500 ,loc=0, scale=1)

XA <- runif(500, min = 0, max = 100)
XB <- runif(500, min = 0, max = 100)
XC <- runif(500, min = 0, max = 100)

YA <- runif(500, min = 0, max = 1000)
YB <- runif(500, min = 0, max = 1000)
YC <- runif(500, min = 0, max = 1000)

gamma1.1 <- 1.0
gamma1.2 <- 0.3
gamma1.3 <- -0.03
gamma1.4 <- 0.005

Utility_1.A <- gamma1.1+gamma1.3*XA+gamma1.4*YA+error1.A
Utility_1.B <- gamma1.2+gamma1.3*XB+gamma1.4*YB+error1.B
Utility_1.C <- gamma1.3*XC+gamma1.4*YC+error1.C

```
```# Second Dataset

error2.B <- rgumbel(500 ,loc=0, scale=2)
error2.A <- rgumbel(500 ,loc=0, scale=2)
error2.C <- rgumbel(500 ,loc=0, scale=2)

gamma2.1 <- 1.0
gamma2.2 <- 0.3
gamma2.3 <- -0.03
gamma2.4 <- 0.005

Utility_2.D <- gamma2.1+gamma2.3*XA+gamma2.4*YA+error2.A
Utility_2.E <- gamma2.2+gamma2.3*XB+gamma2.4*YB+error2.B
Utility_2.F <- gamma2.3*XC+gamma2.4*YC+error2.C```

Now I have the six vectors of Utilities that I also had in the previous post. I need these six vectors to simulate what people would do. We generally assume that people will choose the alternative that yields the highest utility. Therefore, in the first dataset, the alternative chosen among A, B and C is the one that has the highest utility. choice1 is a vector that identifies the choice with the highest utility:

```choice1 <- 0
for(i in 1:500){
choice1[i] <- apply(cbind(Utility_1.A[i],Utility_1.B[i],Utility_1.C[i]),
1,which.max)
}```

choice1 is a vector with 500 choices. If it takes the value 1, then alternative A is chosen. The values 2 and 3 correspond to alternatives B and C, respectively.

I will do the same with the second dataset.

```choice2 <- 0
for(i in 1:500){
choice2[i] <- apply(cbind(Utility_2.D[i],Utility_2.E[i],Utility_2.F[i]),
1,which.max)
}```

So now I have two complete datasets: choice1 is the dependent variable, i.e. the alternative chosen, and the explanatory variables XA, XB and XC and YA, YB, and YC; and choice2 is the dependent variable of the second dataset with the same explanatory variables.

Before I combine the two datasets, I just want to make sure I can retrieve the gamma parameters if I estimate a multinomial logit model. To recap, the gamma parameters I have set up are:

```gamma2.1 <-  1.0
gamma2.2 <-  0.3
gamma2.3 <- -0.03
gamma2.4 <-  0.005```

I will write the log likelihood function by hand to estimate a MNL model. I am estimating a multinomial logit model, which is pretty simple and straightforward. The log likelihood function is basically a function that depends on four parameters: as many as the gamma parameters that are to be estimated.

```X_vectorA <- XA
X_vectorB <- XB
X_vectorC <- XC
Y_vectorA <- YA
Y_vectorB <- YB
Y_vectorC <- YC

CHOICE <- choice1

MNL_loglik1 <- function(parameters){

Utility_A <- parameters+parameters*X_vectorA+parameters*Y_vectorA
Utility_B <- parameters+parameters*X_vectorB+parameters*Y_vectorB
Utility_C <-               parameters*X_vectorC+parameters*Y_vectorC

choice <- Utility_A* ifelse(CHOICE == 1,1,0) +
Utility_B*ifelse(CHOICE == 2,1,0) +
Utility_C*ifelse(CHOICE == 3,1,0)
Probabilities <- exp(choice)/(exp(Utility_A)+exp(Utility_B)+exp(Utility_C))

loglik <- sum(log(Probabilities))

return(loglik)
}```

Basically the script above created a function named MNL_loglik1. If I use an optimizer and call this function, then I should get the gamma parameters. To do so, I will use the maxLik package.

```library(maxLik)
MNL <- maxLik(MNL_loglik1, start = c(1.0, 0.3, -0.03, 0.005))
summary(MNL)```

The output looks like this:

```MNL <- maxLik(MNL_loglik1, start = c(1.0, 0.3, -0.03, 0.005))
> summary(MNL)
--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 3 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: -319.2769
4 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
[1,] 0.9055630 0.1530454 5.917 3.28e-09 ***
[2,] 0.2124244 0.1549798 1.371 0.17
[3,] -0.0311409 0.0031223 -9.974 < 2e-16 ***
[4,] 0.0049884 0.0003676 13.569 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------```

The estimates (0.9055630, 0.2124244, -0.0311409, 0.0049884) are almost identical to the true values of the parameters (1.0, 0.3, -0.03, 0.005). This is good news.

However, when I estimate the multinomial logit model with the second dataset (CHOICE <- choice2), the results are a lot less precise.

```> MNL <- maxLik(MNL_loglik1, start = c(1.0, 0.3, -0.03, 0.005))
> summary(MNL)
--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 4 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: -454.2358
4 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
[1,] 0.2916059 0.1182139 2.467 0.0136 *
[2,] -0.1698585 0.1275292 -1.332 0.1829
[3,] -0.0154303 0.0022249 -6.935 4.06e-12 ***
[4,] 0.0024002 0.0002331 10.295 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------```

The estimated parameters (0.2916059, -0.1698585, -0.0154303, 0.0024002) are quite far from their true values. This is the effect of the scale. Since the variance of the second dataset is larger, the unobserved part of utility is greater than the observed part. However, this is not necessarily problematic. The parameters are estimated in utility or preference space, meaning that they represent utility which does not really have a value. It is the relative utility of the parameters that matters rather than their absolute value.

Now, let’s combine the two datasets. First, I will adjust the dependent variable of the second dataset to range from 4 to 6:

`choice2 <- choice2+3`

Now I will run the same function as before, with some changes. The dependent variable now is a vector with a 1000 choices ranging from 1 to 6. The first three choices (1 to 3) are from the first dataset and the last three (4 to 6) are from the second dataset. The independent variables X and Y are basically the same as before but twice as long. The MNL_loglik1 function is essentially the same:

```choice1.2 <- c(choice1, choice2)

XAs <- c(XA,XA)
XBs <- c(XB,XB)
XCs <- c(XC,XC)

YAs <- c(YA,YA)
YBs <- c(YB,YB)
YCs <- c(YC,YC)

X_vectorA <- XAs
X_vectorB <- XBs
X_vectorC <- XCs
Y_vectorA <- YAs
Y_vectorB <- YBs
Y_vectorC <- YCs

CHOICE <- choice1.2

MNL_loglik1 <- function(parameters){

Utility_A <- parameters+parameters*X_vectorA+parameters*Y_vectorA
Utility_B <- parameters+parameters*X_vectorB+parameters*Y_vectorB
Utility_C <- parameters*X_vectorC+parameters*Y_vectorC

choice <- Utility_A* ifelse(CHOICE == 1,1,0) +
Utility_B*ifelse(CHOICE == 2,1,0) +
Utility_C*ifelse(CHOICE == 3,1,0)

Probabilities <- exp(choice)/(exp(Utility_A)+exp(Utility_B)+exp(Utility_C))

loglik <- sum(log(Probabilities))

return(loglik)
}

library(maxLik)
MNL <- maxLik(MNL_loglik1, start = c(1.0, 0.3, -0.03, 0.005))```

Now I can just look at the output with the summary function again.

```> summary(MNL)
--------------------------------------------
Maximum Likelihood estimation
Newton-Raphson maximisation, 5 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: -799.1825
4 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
[1,] 0.5221686 0.0916955 5.695 1.24e-08 ***
[2,] -0.0156766 0.0969604 -0.162 0.872
[3,] -0.0210961 0.0017683 -11.930 < 2e-16 ***
[4,] 0.0033605 0.0001934 17.380 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
--------------------------------------------```

Like before, the fact that we don’t account for the scale of the second dataset being different results in the estimated parameters being quite far from their true values. The estimated parameters are 0.5221686, -0.0156766, -0.0210961 and 0.0033605, while the true values are 1.0, 0.3, -0.03 and 0.005, respectively.

In a future blog post I will show the difference when accounting for scale.

References:

 Swait, J., & Louviere, J. (1993). The role of the scale parameter in the estimation and comparison of multinomial logit models. Journal of marketing research30(3), 305-314.