Strategies for Choosing the Reference Category in Dummy Coding

Every statistical software procedure that dummy codes predictor variables uses a default for choosing the reference category.

This default is usually the category that comes first or last alphabetically.

That may or may not be the best category to use, but fortunately you’re not stuck with the defaults.

So if you do choose, which one should you choose?

The first thing to remember is that ultimately, it doesn’t really matter, as long as you are aware of which category is the reference. You’re going to get the same results no matter what you choose. It’s just that the specific comparisons that the software reports (and gives you p-values for) will differ.

So it’s best to choose a category that makes interpretation of results easier. Here are a few common options for choosing a category.

Remember, the regression coefficients will give you the difference in means (and/or slopes if you’ve included an interaction term) between each other category and the reference category.

Strategy 1: Use the normative category

In many cases, the most logical or important comparisons are to the most normative group. For example, in one data set I analyzed, an important dummy-coded predictor is Poverty Status: In Poverty or Not In Poverty.

Not In Poverty is the norm–most people aren’t in Poverty (at least in this data set–it may not be true in the population you’re studying). The interesting comparison is to see how people in poverty differ from this normative group. So making Not In Poverty the reference group just makes sense.

Likewise, another example is Marital Status: Never Married, Currently Married, Divorced, Separated, or Widowed.

The alphabetical default would make Widowed the reference group. But it’s not as interesting to compare Separated people to Widowed people, as they’re both small groups in the data set, and the most interesting comparisons are with the normative categories of Never Married or Currently Married.

In experiments or randomized control trials the control group is a natural normative category. The only exception I can think of is a study with multiple controls, but only one intervention or treatment group. In that case, it may be more important to measure any differences between the treatment and each control.

Strategy 2: Use the largest category

The other problem with using the Widowed group as the reference is it’s very, very small. When sample sizes are very unequal in the groups, which is very common for naturally occurring groups, it can become problematic to use it as the reference.

Sometimes, if there isn’t a normative group in a logical sense, it makes sense to just use the largest category as the reference.

Strategy 3: Use the category whose mean is in the middle, or conversely, at one of the ends

Sometimes all of these options fail. There is no obvious norm and sample sizes are similar.

In those cases, sometimes the best thing to do is to pick the category with the lowest, the highest, or the middle mean. Let me give you an example.

Let’s say those 5 marital categories have means on Y of

10 Never Married

11 Currently Married

If the overall F test in the ANOVA table is significant for this variable, you already know that the highest and lowest means are significantly different. You just don’t know which of the middle three are significantly different from each of those.

For example, the middle value here is 11, the mean for currently married folks. If you use that as the reference group and discover that it is significantly lower than 15, the mean for separated folks and 19, the mean for widowed, you know that both 9 for Divorced and 10 for Never Married should be too. (Note, this doesn’t always hold if some groups have much smaller sample sizes, but as long as they’re reasonably equal, it should hold).

You won’t know, for example, if there is a significant difference between the means for the Separated and Widowed groups, but if that’s not a theoretically important comparison, you’re done.

This particular strategy doesn’t always work, but you can use it to your advantage when it does.

Interpreting Linear Regression Coefficients: A Walk Through Output

Learn the approach for understanding coefficients in that regression as we walk through output of a model that includes numerical and categorical predictors and an interaction.