From Sample to Population (part 1)

In this blog post I will be focusing on poststratification. This is the first blog post on this issue. Next week I will continue with Part 2.

Poststratification entails creating one weight to each observation in the sample so that when weighted the respondents are representative of the population of interest, and thus any result can be generalized from the sample to the population.

By population of interest I am referring to the population from where the respondents are (randomly) drawn. For example, if the research question is how university students are affected by class attendance, the population of interest would be university students, and not faculty. If however you would like to see the effect of pensions on tourism choices, then maybe your population of interest would be retired people. In my case, the population of interest is the general population in my county.

Once you get weights, you can generalize a variable of interest to the population (e.g. mean or median of Y), or use the weights in regressions to get an unbiased estimate of the effect you are studying.

When using surveys, one can generalize from the sample to the population of interest, as long as the assumption of random sampling holds. Random sampling implies that “each individual has the same probability of being chosen at any stage during the sampling process”.[3] However, even if the individuals in the population are randomly drawn, the resulting sample might not always be representative of the population.

How can we see that? If you have access to the descriptive statistics of your population of interest, you can compare these with the sample’s descriptive statistics. This includes mean, median, and quartiles, for example.

In my case, I calculated the average age for the sample. The mean age of my sample is 47.28 years old. According to official statistics, the mean age of my population of interest is 37.62 years old. Basically my sample is on average 10 years older than the population of interest. This will be okay, as long as age does not affect my Y variable, i.e. the variable of interest.

I ran a regression adding several characteristics to explain my Y variable.

> lm(Y ~ age + gender + education + household_size + income, data=DATA)

Coefficients:
                     Estimate   Std. Error  t value  Pr(>|t|) 
(Intercept)          8.334719   3.065339    2.719    0.00667 **
age                  0.054304   0.037076    1.465    0.14334 
gender              -0.323864   1.299788   -0.249    0.80329 
education            0.001417   0.037609    0.038    0.96995 
household_size       0.059084   0.044114    1.339    0.18077 
income              -0.022062   0.017726   -1.245    0.21358

Good news! None of the sample characteristics drives my Y variable. This means that even if the sample is not perfectly representative, I can still analyze my Y variable and generalize it to the population even without weights.

However, if any over- or under-represented characteristics had an impact on Y, then using weights, i.e. poststratification, is highly advisable.

Why are samples not representative?

Biased samples can happen due to different reasons. One example is the survey administration mode, which I will be illustrating here.

Surveys can be administered face-to-face, over the phone, by mail or through the internet. Some surveys use a single mode, others use a mixture (e.g. both mail and phone interviews). Each has advantages and disadvantages. The choice of mode of survey administration may affect the way respondents answer and how they interpret questions.[4] Face to face interviews are the “gold standard” in some sense since the interviewer may clarify questions, and the respondent may be motivated to answer truthfully if there is a trusting relationship with the interviewer.[4] But interviews face-to-face can be quite expensive, whereas mail or internet-based surveys are relative cheap but cognitively difficult to answer. This is summarized in the graph below.

blog post (mode)

The internet has become extremely popular to administer surveys.[4] These kind of surveys can be quite representative of the population of interest when using web-panels which recruit respondents in a way so that the panel is representative.

Unfortunately, in internet surveys, it is quite common that the sample is not representative. On average, respondents in these web-panels are more educated and younger than the population it samples from. Moreover, respondents with internet access and those who are more interested in the subject of the survey are more likely to respond.[1] Regarding the former, some countries don’t need to worry about internet-access being limited: the vast majority (95% and up) of the population uses the internet.[2] In other countries however the general population does not use the internet as heavily and internet-based panels may yield non-representative samples.

How to do poststratification

Basically poststratification creates sampling weights to adjust for the fact that a strata of my sample is under or over-represented.

According to [5], weighting has three stages:

  1. Each respondent (or nonrespondent) is given a base weight;
  2. Nonresponse is compensated by identifying respondents who can represent non-respondents;
  3. The base weights are adjusted so that the sample characteristics fit the population characteristics.

There are different methods to do this, such as raking, generalised regression estimation, logistic regression modelling and combinations of weighting cell methods.[5]

The survey package I will illustrating uses the raking method. This method is “an iterative proportional fitting procedure”.[5, 6] These methods require poststratification by using categorical variables, e.g. a variable taking only the values 1, 2 and 3. Gender works quite well: it either is “female” or “male”. Age can also be used to adjust weights, as long as categories are created, instead of using age as a continuous variable.

If I have two variables to adjust my weights (e.g. variables A and B), the raking method first adjusts the base weights to correct for variable A, then re-adjusts these weights to correct for variable B, and goes back to variable A. The raking method does this until the weights converge.[5]

In the next blog post I will illustrate all these concepts using R scripts.

 

References: 

[1] Supan, A., Elsner, D., Faßbender, H., Kiefer, R., McFadden, D., & Winter, J. (2004). How to make internet surveys representative: A case study of a two-step weighting procedure. Manuscript, Mannheim Germany: Mannheim University.

[2] https://www.fn.no/Statistikk/Internettbrukere

[3] https://en.wikipedia.org/wiki/Simple_random_sample

[4] Lindhjem, H., & Navrud, S. (2011). Are Internet surveys an alternative to face-to-face interviews in contingent valuation?. Ecological economics70(9), 1628-1637.

[5] Kalton, G., & Flores-Cervantes, I. (2003). Weighting methods. Journal of Official Statistics19(2), 81.

[6] https://en.wikipedia.org/wiki/Iterative_proportional_fitting

1 Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s