From Sample to Population (part 2)

On part 1 (“From Sample to Population”) I talked about poststratification. In summary, when using surveys, sometimes it is useful to generalize results from the sample to a broader population. However, it is not always possible to generalize. In some cases, the sample we have obtained is biased towards some characteristics. For example, the sample I have is older and more educated than the population.

In such circunstances it is useful to use weights to make the sample representative, and hence be able to generalize sample results to the population. To do this, I use the survey package in R.

Before actually writing the code, I had to do some homework. If you remember from my previous post, I summarized five characteristics from my respondents: age, gender, education, household size and income. I found official statistics and found that my sample is ten years older than my population. In your case, you should try to find whatever statistics you can from an official source, such as the national statistics office or international organizations, such as the OECD.

The variable age in the DATA dataframe is the column where the age of each respondent is stored. But age is a continuous variable, and I cannot create weights if I would use age as it is. Instead, I found the age distribution by categories using the official statistics [1]:

Age Frequency
0-24 32.88%
25-34 14.13%
35-44 14.10%
45-69 29.11%
70-79 6.27%
80+ 3.49%

With the frequencies from the table above, first I am going to create a variable called AGE_CLASS, which takes the values 1 to 6 depending on the age variable.

DATA$AGE_CLASS[DATA$age<=24]                <- 1  
DATA$AGE_CLASS[DATA$age>=25 & DATA$age<=34] <- 2 
DATA$AGE_CLASS[DATA$age>=35 & DATA$age<=44] <- 3  
DATA$AGE_CLASS[DATA$age>=45 & DATA$age<=69] <- 4  
DATA$AGE_CLASS[DATA$age>=70 & DATA$age<=79] <- 5  
DATA$AGE_CLASS[DATA$age>=80]                <- 6

Note: The straight brackets [ ] are used to subset the age variable.

So, now I have the AGE_CLASS variable, and its frequencies are in the third column of the table below.

Age Frequency (Official Statistics) Frequency (AGE_CLASS)
0-24 32.88% 11.19%
25-34 14.13% 17.72%
35-44 14.10% 17.31%
45-69 29.11% 41.14%
70-79 6.27% 11.71%
80+ 3.49% 0.93%

As you can see, there is a disparity between the official statistics (in the second column) and my sample (third sample).

Second, we now use the survey package in R. Let’s start by installing it, as usual, and then using the library command.



Most of what I am referring to in the remainder of this post is very well summarized in [2]. The function to use is svydesign. I create a survey object using my DATA but without any weights. I will call this object DATA.w:

DATA.w <- svydesign(ids = ~1, data = DATA)

I just set ids = ~1. This option specifies the cluster. Type help(svydesign) in R if you want further explanation about these options.

R gives you a warning message:

Warning message:
In svydesign.default(ids = ~1, data = DATA) :
No weights or probabilities supplied, assuming equal probability

This is fine, it just warns you that every observation has a weight of 1, which is what we want at this stage.

To supply the weighting function with the proper arguments, I need to tell it how to properly weight my AGE_CLASS variable. Now, AGE_CLASS has the frequencies of the third column of the table above, but I want these to be the frequencies of the second column of the table when weighted. So I just create a data frame “age.dist” with this information:

age.dist <- data.frame(AGE_CLASS = c("1", "2","3","4","5","6"), 
                       Freq = nrow(DATA) * 
                              c(0.3289, 0.1413, 0.141,0.2912,0.0627,0.0349))

As you can see, the frequencies in the last vector I typed are the same as the frequencies in the second column of the table above.

Now we have the two things we need: an unweighted survey object (DATA.w) and the dataframe age.dist which dictates how to create weights. I can now use the rake function in the survey package to create my weights. It needs at least three arguments: the unweighted survey object (that I called DATA.w), the variable from my sample that is biased (i.e. the AGE_CLASS variable in my case), and the dataframe object that says how to weight the AGE_CLASS variable (that I called age.dist previously).

DATA.rake <- rake(design = DATA.w, 
                  sample.margins = list(~AGE_CLASS), 
                  population.margins = list(age.dist))

Again, if you want further details about the rake function, just type help(rake) in R (provided you have installed it previously).

Now I have created a survey object called DATA.rake that has the weights. I can access the weights by using the function summary:

> summary(weights(DATA.rake))
Min.      1st Qu.     Median     Mean       3rd Qu.    Max. 
0.5354    0.7078      0.7078     1.00       0.8148     3.7421

So the minimum weight given to an observation is 0.5354, and the maximum is 3.74. The histogram looks like this:

weights hist

In fact, because I only use the variable AGE_CLASS to create weights, which has six categories, the resulting weights take only six values: 0.53, 0.71, 0.80, 0.81, 2.94, or 3.74. These values are in the horizontal axis of the histogram. If I were to use more variable to create weights, then the weights would take more distinct values.

And to access all the DATA including the weights, I can type:


Note: I can use more than one characteristic for creating weights, as long as it’s categorical and not continuous. For example, if I wanted to use gender to create weights:

DATA.rake <- rake(design = DATA.w, 
                  sample.margins = list(~gender,~AGE_CLASS), 
                  population.margins = list(gender.dist, age.dist))

Now, to wrap up nicely, I will show how to use the weights to calculate means and other descriptive statistics that might prove useful.

Remember that my age variable was biased:

[1] 47.2829

I can use the weights to calculate the weighted mean of age in two ways. The first uses the weighted.mean function…

weighted.mean(DATA$age, weights(DATA.rake))
[1] 40.71522

Or I can use a function in the survey package called svymean

> svymean(~age, DATA.rake)
           mean         SE
age     40.715       0.1436

These give exactly the same result, as it should.

I can also calculate descriptive statistics for other variables, such as gender, education level, or income. For example, I want to check that gender is still representative after using weights based solely on age. Then I can write:

> mean(MNA63240$gender)
[1] 0.543005
> svymean(~gender, MNA.svy.rake)
         mean     SE
gender   0.6063   0.0181

So the mean of the gender variable is now slightly more biased towards women than before. In such a case, it might be good to go back and add gender as well to calculate the weights. Whether this is necessary or not, it all depends on the objectives of your research.

This concludes the poststratification series. In case you have any questions or comments, feel free to leave a comment below.







Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s