Dummy Variables


A dummy variable is a numerical variable used in regression analysis to represent subgroups of respondents in your sample. In research design, a dummy variable is often used to distinguish different groups based on a given characteristic. In the simplest case, we would code a dummy variable 0 and 1, where a person is given a value of 1 if they are in the group we wish to examine and a 0 if they are in the reference group. For example, say we want to examine the effects of "femaleness" on earned income.  We would code all females 1 and males (the reference group) would be coded 0.

 
Dummy variables are useful because they enable us to use a single regression equation to represent multiple groups rather than writing separate equation models for each subgroup. Dummy variables allow us to use nominal and ordinal variables in regression techniques. (Normally, the use of nominal or ordinal variables in regression violates a major assumption of regression analysis). 

To use a nominal variable we would construct a new set of dummy variables to represent each response value (minus 1 category). For instance, let's say we collected data on race, with response values of (1) White, (2) Black, (3) Hispanic, and (4) Other.  We would compute a new variable representing White (coded 1), with all others coded 0.  A second variable would be computed with Black coded 1,and allothers coded 0.  Hispanics would be coded 1 on the third variable, and all others coded 0. Other would be the omitted category and is actually represented in the 0 value of the three newly computed variables -- White, Black, & Hispanic. Sample SPSS statements (which follow the data list command) for this process are given below:

COMPUTE WHITE = 0.
IF (RACE EQ 1) WHITE =1.
COMPUTE BLACK =0.
IF (RACE EQ 2) BLACK =1.
COMPUTE HISPANIC =0.
IF (RACE EQ 3) HISPANIC =1.
{This method also codes respondents who did not answer or answered "don't know" as Other. Alternative methods are available to avoid this.}

Another advantage of a dummy variable coded 0,1 is that, even though it is a nominal-level variable, it can be treated like an interval-level variable in statistical analysis. (if this makes no sense, you should refresh your memory on levels of measurement). For instance, if you calculate the mean of a 0,1 variable, the result is the proportion of cases in code 1.

Whenever you calculate a regression model with dummy variables, you can see how the variables represent multiple subgroup equations by following two steps:

  • create separate equations for each subgroup by substituting the dummy values
  • find the difference between groups by finding the difference between their equations
Back to methods and measurements