Choice set and data format

In this blog post I will be focusing on choice sets. Choice sets refer to the number of alternatives that are being compared when an individual chooses one of them.

Most of my research focuses on how individuals make decisions. When I go to the supermarket, I have a variety of products to choose from. They all differ in their characteristics, which in nonmarket valuation we tend to call “attributes”. I look at the alternatives I have, compare them based on prices, quality, packaging, and other attributes, and then make my choice. As researchers, we observe these choices and model them, hereby identifying the relative importance of attributes.

Yet, sometimes one or more alternatives might not be available to me. For example, I might fail to see/take into consideration some of the alternatives available in that aisle. This can be because I am not aware of these, or maybe because I chose some kind of heuristic to reduce the choice set. See [2] for a good review on heuristics in decision making. In that case, the choice set I am actually considering is a subset of the one that was available.

When modeling decisions, it is crucial to only include the appropriate “substitutes” in the choice set, along with the actual choice. It is not correct to model my choice as having a wide choice set including all of the alternatives theoretically available to me. As [1] puts it, “including the appropriate substitutes in the demand estimation is important in both a theoretical and empirical sense”.

Yet, the actual alternatives that the respondent considered are not always known by us, the researchers. Having a wide or reduced choice set has a certain probability associated with it. So it might be useful to include this uncertainty when estimating our models.

Imagine however that you actually have data on which alternatives were considered. This could be in the case that the respondent indicated which alternatives (s)he considered when making the choice, or you have external data indicating this “consideration choice set” (for example, if you had eye movement data to see which alternatives the individual was comparing).

In my example, I coded a variable named CS, which is a dummy variable, taking the value 1 if the alternative was considered, and zero otherwise. I will be using the same data example as in this previous post.

To recap the previous example, there are eleven observations wherein the individual has three alternatives and chose six times alternative one, four times alternative two and one time alternative three. In the previous post, I also mentioned this would be the same individual, hence calling these “repeated” choices. In this example, I am using the exact same data but assuming these are the choices of eleven different individuals (hence eleven observations, one per individual).

There are three additional columns: CS.1, CS.2 and CS.3 compared to the previous post. These are dummies indicating whether the respondent “considered” this alternative or not. If CS.1=1, then the individual actually considered alternative 1 when making his/her choice.

The data is in wide format. Here’s how it looks like:

Choice CS.1 CS.2 CS.3 Alt1.Att1 Alt2.Att1 Alt3.Att1 Alt1.Att2 Alt2.Att2 Alt3.Att2 Alt1.Att3 Alt2.Att3 Alt3.Att3 Alt1.Att4 Alt2.Att4 Alt3.Att4
1 1 1 1 0 500 1000 500 350 250 -5 -2 0 50% 25% 0%
1 1 0 1 0 500 1000 500 350 250 -5 -2 0 50% 25% 0%
1 1 1 1 0 500 1000 500 350 250 -5 -2 0 50% 25% 0%
1 1 1 0 0 500 1000 500 350 250 -5 -2 0 50% 25% 0%
1 1 1 1 0 500 1000 500 350 250 -5 -2 0 50% 25% 0%
1 1 1 0 0 500 1000 500 350 250 -5 -2 0 50% 25% 0%
2 1 1 1 0 500 1000 500 350 250 -5 -2 0 50% 25% 0%
2 0 1 1 0 500 1000 500 350 250 -5 -2 0 50% 25% 0%
2 1 1 1 0 500 1000 500 350 250 -5 -2 0 50% 25% 0%
2 1 1 0 0 500 1000 500 350 250 -5 -2 0 50% 25% 0%
3 0 1 1 0 500 1000 500 350 250 -5 -2 0 50% 25% 0%

How do I know it is wide format? Each row represents one choice and as you can see, each attribute has several columns. Alt1.Att1, Alt2.Att1, and Alt3.Att1 are the attribute levels for Attribute 1 and alternatives 1, 2 and 3, respectively.

We need the data in long format. I can easily transform wide data into long data by using the gmnl package.

install.packages("gmnl")

library(gmnl)

Before changing the data format, I need to rename the columns. The function I will be using needs to know which what are the name of the attributes and its levels. My attribute names are Att1, Att2, Att3 and Att4. Here I change the names of the columns of my dataframe:

colnames(DATA.wide) <- list("Choice",
                            "CS.1","CS.2","CS.3",
                            "Att1.1","Att1.2","Att1.3",
                            "Att2.1","Att2.2","Att2.3",
                            "Att3.1","Att3.2","Att3.3",
                            "Att4.1","Att4.2","Att4.3")

As an example, Att4.1 means that this is the level of attribute 4 (Att4) for Alternative 1. Now, I use the mlogit.data function to convert my data into long format:

DATA.long<-mlogit::mlogit.data(DATA.wide,
                               choice="Choice",
                               varying = 2:16,
                               shape="wide",
                               sep=".")

The choice option tells the function where in the data frame the choice of individual is. This is recorded in the “Choice” column. Varying informs where the attribute levels are (and the choice set variable as well). In this case, they are in columns 2 to 16. Shape informs the function which data format the data is currently in, which is wide in my case. Sep tells the function how to differentiate from the attribute name and the alternative index. In this case, Att4.1 has full stop between the attribute name “Att4” and the alternative index, which is 1, in this case.

This is how DATA.long looks like:

Choice alt CS Att1 Att2 Att3 Att4 chid
TRUE 1 1 0 500 -5 50 % 1
FALSE 2 1 500 350 -2 25 % 1
FALSE 3 1 1000 250 0 0 % 1
TRUE 1 1 0 500 -5 50 % 2
FALSE 2 0 500 350 -2 25 % 2
FALSE 3 1 1000 250 0 0 % 2
TRUE 1 1 0 500 -5 50 % 3
FALSE 2 1 500 350 -2 25 % 3
FALSE 3 1 1000 250 0 0 % 3

I just pasted the first nine rows to illustrate. In fact, this data frame now has 33 rows.

Now, I can finally analyze the data by running a simple linear regression.

mnl.model <- lm(Choice ~ Att1 +Att2 , 
                data=DATA.long)

Finally, I want to select only the observations that the individual actually considered. Then I just adjust the dataset with which to estimate the parameters.

lm.model.CS <- lm(Choice ~ Att1 +Att2 , 
                  data=DATA.long[DATA.long$CS==1,])

I can look at the estimated parameters and see the differences by typing:

summary(lm.model)

summary(lm.model.CS)

Unfortunately, this data cannot be analyzed using the gmnl conditional logit function, because there aren’t always 3 alternatives being considered in each choice occasion. To estimate a conditional logit model, I need to write my own log likelihood function. I will do this in a future post.

 

References:

[1] Hicks, R. L., & Strand, I. E. (2000). The extent of information: its relevance for random utility models. Land Economics, 374-385.

[2] Leong, W., & Hensher, D. A. (2012). Embedding multiple heuristics into choice models: An exploratory analysis. Journal of choice modelling5(3), 131-144.

1 Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s