The recodes()
functions makes it very easy to recode one
or more variables in the your data frame. The format is
newdata <- recodes(olddata, variables, from values, to values)
Consider the following data set (below). Lets make the following changes.
sex | race | outcome | Q1 | Q2 | age | rating |
---|---|---|---|---|---|---|
1 | b | better | 20 | 15 | 12 | 1 |
2 | w | worse | 30 | 23 | 20 | 2 |
1 | a | same | 44 | 18 | 33 | 5 |
2 | b | same | 15 | 86 | 55 | 3 |
2 | w | better | 50 | 99 | 30 | 4 |
2 | h | worse | 99 | 35 | 100 | 5 |
For sex
, set 1 to “Male” and 2 to “Female”.
sex | race | outcome | Q1 | Q2 | age | rating |
---|---|---|---|---|---|---|
Male | b | better | 20 | 15 | 12 | 1 |
Female | w | worse | 30 | 23 | 20 | 2 |
Male | a | same | 44 | 18 | 33 | 5 |
Female | b | same | 15 | 86 | 55 | 3 |
Female | w | better | 50 | 99 | 30 | 4 |
Female | h | worse | 99 | 35 | 100 | 5 |
Recode race
to “White” vs. “Other”.
df <- recodes(data=df, vars="race",
from=c("w", "b", "a", "h"),
to=c("White", "Other", "Other", "Other"))
sex | race | outcome | Q1 | Q2 | age | rating |
---|---|---|---|---|---|---|
Male | Other | better | 20 | 15 | 12 | 1 |
Female | White | worse | 30 | 23 | 20 | 2 |
Male | Other | same | 44 | 18 | 33 | 5 |
Female | Other | same | 15 | 86 | 55 | 3 |
Female | White | better | 50 | 99 | 30 | 4 |
Female | Other | worse | 99 | 35 | 100 | 5 |
Recode outcome
to 1 (better) vs. 0 (not better).
sex | race | outcome | Q1 | Q2 | age | rating |
---|---|---|---|---|---|---|
Male | Other | 1 | 20 | 15 | 12 | 1 |
Female | White | 0 | 30 | 23 | 20 | 2 |
Male | Other | 0 | 44 | 18 | 33 | 5 |
Female | Other | 0 | 15 | 86 | 55 | 3 |
Female | White | 1 | 50 | 99 | 30 | 4 |
Female | Other | 0 | 99 | 35 | 100 | 5 |
For Q1
and Q2
set values of 86 and 99 to
missing.
df <- recodes(data=df, vars=c("Q1", "Q2"),
from=c(86, 99), to=NA)
#> Note: 'from' is longer than 'to', so 'to' was recycled.
sex | race | outcome | Q1 | Q2 | age | rating |
---|---|---|---|---|---|---|
Male | Other | 1 | 20 | 15 | 12 | 1 |
Female | White | 0 | 30 | 23 | 20 | 2 |
Male | Other | 0 | 44 | 18 | 33 | 5 |
Female | Other | 0 | 15 | NA | 55 | 3 |
Female | White | 1 | 50 | NA | 30 | 4 |
Female | Other | 0 | NA | 35 | 100 | 5 |
For age
, set values
You can use expressions in your from
fields. When they
are TRUE
, the corresponding to
values will be
applied. We will use the dollar sign ($) to represent
the variable (age in this case). The symbols ( |,
& ) mean OR and
AND respectively.
df <- recodes(data=df, vars="age",
from=c("$ < 20 | $ > 90",
"$ >= 20 & $ <= 30",
"$ > 30 & $ <= 50",
"$ > 50 & $ <= 90"),
to=c(NA, "Younger", "Middle Aged", "Older"))
We can also write this as
df <- recodes(data=df, vars="age",
from=c("$ < 20", "$ <= 30", "$ <= 50", "$ <= 90", "$ > 90"),
to= c(NA, "Younger", "Middle Aged", "Older", "NA"))
This works because once the age value for an observations meets a
criteria that is TRUE
(working left to right), it is
recoded. It isn’t changed again by later criteria in the same
recodes
statement.
sex | race | outcome | Q1 | Q2 | age | rating |
---|---|---|---|---|---|---|
Male | Other | 1 | 20 | 15 | NA | 1 |
Female | White | 0 | 30 | 23 | Younger | 2 |
Male | Other | 0 | 44 | 18 | Middle Aged | 5 |
Female | Other | 0 | 15 | NA | Older | 3 |
Female | White | 1 | 50 | NA | Younger | 4 |
Female | Other | 0 | NA | 35 | NA | 5 |
Finally, for the rating
variable, reverse the scoring so
that 1 to 5 becomes 5 to 1.
df <- recodes(data=df, vars="rating", from=1:5, to=5:1)
sex | race | outcome | Q1 | Q2 | age | rating |
---|---|---|---|---|---|---|
Male | Other | 1 | 20 | 15 | NA | 5 |
Female | White | 0 | 30 | 23 | Younger | 4 |
Male | Other | 0 | 44 | 18 | Middle Aged | 1 |
Female | Other | 0 | 15 | NA | Older | 3 |
Female | White | 1 | 50 | NA | Younger | 2 |
Female | Other | 0 | NA | 35 | NA | 1 |
Remember that recodes
returns a data frame, not a
variable.
df <- recodes(data=df, vars="rating", from=1:5, to=5:1)
is correct.
df$rating <- recodes(data=df, vars="rating", from=1:5, to=5:1)
is not.
This allows you to apply the same recoding scheme to more than one variable at a time (e.g., Q1 and Q2 above).