In this previous post, I have introduced the paradigm of combining two datasets, elicited through revealed and stated preferences respectively. In summary, if the two datasets are separately collected but the way people choose come from the same underlying preferences, then the data can be combined. If however the scale in the two datasets differs for example because of the hypothetical nature of some questions, then data cannot be combined unless the scale difference is acknowledged and estimated.

What I will do in this blog post is to do precisely what I argue should not be done: I will combine the two datasets without accounting for scale and show the problems of doing so by means of simulation. To follow up on the last post, I will be replicating the results from [1].

Like the previous post, I will simulate two distributions that have the same parameters (gamma1 through 4) but differ in scale (one has scale equal to one, and the other scale of 2). X and Y are the explanatory variables that are randomly drawn from an uniform distribution.

library(evd) # First Dataset error1.A <- rgumbel(500 ,loc=0, scale=1) error1.B <- rgumbel(500 ,loc=0, scale=1) error1.C <- rgumbel(500 ,loc=0, scale=1) XA <- runif(500, min = 0, max = 100) XB <- runif(500, min = 0, max = 100) XC <- runif(500, min = 0, max = 100) YA <- runif(500, min = 0, max = 1000) YB <- runif(500, min = 0, max = 1000) YC <- runif(500, min = 0, max = 1000) gamma1.1 <- 1.0 gamma1.2 <- 0.3 gamma1.3 <- -0.03 gamma1.4 <- 0.005 Utility_1.A <- gamma1.1+gamma1.3*XA+gamma1.4*YA+error1.A Utility_1.B <- gamma1.2+gamma1.3*XB+gamma1.4*YB+error1.B Utility_1.C <- gamma1.3*XC+gamma1.4*YC+error1.C

# Second Dataset error2.B <- rgumbel(500 ,loc=0, scale=2) error2.A <- rgumbel(500 ,loc=0, scale=2) error2.C <- rgumbel(500 ,loc=0, scale=2) gamma2.1 <- 1.0 gamma2.2 <- 0.3 gamma2.3 <- -0.03 gamma2.4 <- 0.005 Utility_2.D <- gamma2.1+gamma2.3*XA+gamma2.4*YA+error2.A Utility_2.E <- gamma2.2+gamma2.3*XB+gamma2.4*YB+error2.B Utility_2.F <- gamma2.3*XC+gamma2.4*YC+error2.C

Now I have the six vectors of Utilities that I also had in the previous post. I need these six vectors to simulate what people would do. We generally assume that people will choose the alternative that yields the highest utility. Therefore, in the first dataset, the alternative chosen among A, B and C is the one that has the highest utility. *choice1* is a vector that identifies the choice with the highest utility:

choice1 <- 0 for(i in 1:500){ choice1[i] <- apply(cbind(Utility_1.A[i],Utility_1.B[i],Utility_1.C[i]), 1,which.max) }

*choice1* is a vector with 500 choices. If it takes the value 1, then alternative A is chosen. The values 2 and 3 correspond to alternatives B and C, respectively.

I will do the same with the second dataset.

choice2 <- 0 for(i in 1:500){ choice2[i] <- apply(cbind(Utility_2.D[i],Utility_2.E[i],Utility_2.F[i]), 1,which.max) }

So now I have two complete datasets: *choice1* is the dependent variable, i.e. the alternative chosen, and the explanatory variables XA, XB and XC and YA, YB, and YC; and *choice2* is the dependent variable of the second dataset with the same explanatory variables.

Before I combine the two datasets, I just want to make sure I can retrieve the gamma parameters if I estimate a multinomial logit model. To recap, the gamma parameters I have set up are:

gamma2.1 <- 1.0 gamma2.2 <- 0.3 gamma2.3 <- -0.03 gamma2.4 <- 0.005

I will write the log likelihood function by hand to estimate a MNL model. I am estimating a multinomial logit model, which is pretty simple and straightforward. The log likelihood function is basically a function that depends on four parameters: as many as the gamma parameters that are to be estimated.

X_vectorA <- XA X_vectorB <- XB X_vectorC <- XC Y_vectorA <- YA Y_vectorB <- YB Y_vectorC <- YC CHOICE <- choice1 MNL_loglik1 <- function(parameters){ Utility_A <- parameters[1]+parameters[3]*X_vectorA+parameters[4]*Y_vectorA Utility_B <- parameters[2]+parameters[3]*X_vectorB+parameters[4]*Y_vectorB Utility_C <- parameters[3]*X_vectorC+parameters[4]*Y_vectorC choice <- Utility_A* ifelse(CHOICE == 1,1,0) + Utility_B*ifelse(CHOICE == 2,1,0) + Utility_C*ifelse(CHOICE == 3,1,0) Probabilities <- exp(choice)/(exp(Utility_A)+exp(Utility_B)+exp(Utility_C)) loglik <- sum(log(Probabilities)) return(loglik) }

Basically the script above created a function named *MNL_loglik1*. If I use an optimizer and call this function, then I should get the gamma parameters. To do so, I will use the *maxLik* package.

library(maxLik) MNL <- maxLik(MNL_loglik1, start = c(1.0, 0.3, -0.03, 0.005)) summary(MNL)

The output looks like this:

MNL <- maxLik(MNL_loglik1, start = c(1.0, 0.3, -0.03, 0.005)) > summary(MNL) -------------------------------------------- Maximum Likelihood estimation Newton-Raphson maximisation, 3 iterations Return code 2: successive function values within tolerance limit Log-Likelihood: -319.2769 4 free parameters Estimates: Estimate Std. error t value Pr(> t) [1,] 0.9055630 0.1530454 5.917 3.28e-09 *** [2,] 0.2124244 0.1549798 1.371 0.17 [3,] -0.0311409 0.0031223 -9.974 < 2e-16 *** [4,] 0.0049884 0.0003676 13.569 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 --------------------------------------------

The estimates (0.9055630, 0.2124244, -0.0311409, 0.0049884) are almost identical to the true values of the parameters (1.0, 0.3, -0.03, 0.005). This is good news.

However, when I estimate the multinomial logit model with the second dataset (CHOICE <- choice2), the results are a lot less precise.

> MNL <- maxLik(MNL_loglik1, start = c(1.0, 0.3, -0.03, 0.005)) > summary(MNL) -------------------------------------------- Maximum Likelihood estimation Newton-Raphson maximisation, 4 iterations Return code 2: successive function values within tolerance limit Log-Likelihood: -454.2358 4 free parameters Estimates: Estimate Std. error t value Pr(> t) [1,] 0.2916059 0.1182139 2.467 0.0136 * [2,] -0.1698585 0.1275292 -1.332 0.1829 [3,] -0.0154303 0.0022249 -6.935 4.06e-12 *** [4,] 0.0024002 0.0002331 10.295 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 --------------------------------------------

The estimated parameters (0.2916059, -0.1698585, -0.0154303, 0.0024002) are quite far from their true values. This is the effect of the scale. Since the variance of the second dataset is larger, the unobserved part of utility is greater than the observed part. However, this is not necessarily problematic. The parameters are estimated in utility or preference space, meaning that they represent utility which does not really have a value. It is the relative utility of the parameters that matters rather than their absolute value.

Now, let’s combine the two datasets. First, I will adjust the dependent variable of the second dataset to range from 4 to 6:

choice2 <- choice2+3

Now I will run the same function as before, with some changes. The dependent variable now is a vector with a 1000 choices ranging from 1 to 6. The first three choices (1 to 3) are from the first dataset and the last three (4 to 6) are from the second dataset. The independent variables X and Y are basically the same as before but twice as long. The *MNL_loglik1* function is essentially the same:

choice1.2 <- c(choice1, choice2) XAs <- c(XA,XA) XBs <- c(XB,XB) XCs <- c(XC,XC) YAs <- c(YA,YA) YBs <- c(YB,YB) YCs <- c(YC,YC) X_vectorA <- XAs X_vectorB <- XBs X_vectorC <- XCs Y_vectorA <- YAs Y_vectorB <- YBs Y_vectorC <- YCs CHOICE <- choice1.2 MNL_loglik1 <- function(parameters){ Utility_A <- parameters[1]+parameters[3]*X_vectorA+parameters[4]*Y_vectorA Utility_B <- parameters[2]+parameters[3]*X_vectorB+parameters[4]*Y_vectorB Utility_C <- parameters[3]*X_vectorC+parameters[4]*Y_vectorC choice <- Utility_A* ifelse(CHOICE == 1,1,0) + Utility_B*ifelse(CHOICE == 2,1,0) + Utility_C*ifelse(CHOICE == 3,1,0) Probabilities <- exp(choice)/(exp(Utility_A)+exp(Utility_B)+exp(Utility_C)) loglik <- sum(log(Probabilities)) return(loglik) } library(maxLik) MNL <- maxLik(MNL_loglik1, start = c(1.0, 0.3, -0.03, 0.005))

Now I can just look at the output with the summary function again.

> summary(MNL) -------------------------------------------- Maximum Likelihood estimation Newton-Raphson maximisation, 5 iterations Return code 2: successive function values within tolerance limit Log-Likelihood: -799.1825 4 free parameters Estimates: Estimate Std. error t value Pr(> t) [1,] 0.5221686 0.0916955 5.695 1.24e-08 *** [2,] -0.0156766 0.0969604 -0.162 0.872 [3,] -0.0210961 0.0017683 -11.930 < 2e-16 *** [4,] 0.0033605 0.0001934 17.380 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 --------------------------------------------

Like before, the fact that we don’t account for the scale of the second dataset being different results in the estimated parameters being quite far from their true values. The estimated parameters are 0.5221686, -0.0156766, -0.0210961 and 0.0033605, while the true values are 1.0, 0.3, -0.03 and 0.005, respectively.

In a future blog post I will show the difference when accounting for scale.

**References: **

[1] Swait, J., & Louviere, J. (1993). The role of the scale parameter in the estimation and comparison of multinomial logit models. *Journal of marketing research*, *30*(3), 305-314.

## 1 Comment