I want to create a new variable that is equal to the value of one of two other variables, conditional on the values of still other variables. Here's a toy example with fake data.
Each row of the data frame represents a student. Each student can be studying up to two subjects (subj1
and subj2
), and can be pursuing a degree ("BA") or a minor ("MN") in each subject. My real data includes thousands of students, several types of degree, about 50 subjects, and students can have up to five majors/minors.
ID subj1 degree1 subj2 degree2 1 1 BUS BA <NA> <NA> 2 2 SCI BA ENG BA 3 3 BUS MN ENG BA 4 4 SCI MN BUS BA 5 5 ENG BA BUS MN 6 6 SCI MN <NA> <NA> 7 7 ENG MN SCI BA 8 8 BUS BA ENG MN ...
Now I want to create a sixth variable, df$major
, that equals the value of subj1
if subj1
is the student's primary major, or the value of subj2
if subj2
is the primary major. The primary major is the first subject with degree equal to "BA". I tried the following code:
df$major[df$degree1 == "BA"] = df$subj1 df$major[df$degree1 != "BA" & df$degree2 == "BA"] = df$subj2
Unfortunately, I got an error message:
> df$major[df$degree1 == "BA"] = df$subj1 Error in df$major[df$degree1 == "BA"] = df$subj1 : NAs are not allowed in subscripted assignments
I assume this means that a vectorized assignment can't be used if the assignment evaluates to NA for at least one row.
I feel like I must be missing something basic here, but the code above seemed like the obvious thing to do and I haven't been able to come up with an alternative.
In case it would be helpful in writing an answer, here's sample data, created using dput()
, in the same format as the fake data listed above:
structure(list(ID = 1:20, subj1 = structure(c(3L, NA, 1L, 2L, 2L, 3L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 1L ), .Label = c("BUS", "ENG", "SCI"), class = "factor"), degree1 = structure(c(2L, NA, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("BA", "MN"), class = "factor"), subj2 = structure(c(1L, 2L, NA, NA, 1L, NA, 3L, 2L, NA, 2L, 2L, 1L, 3L, NA, 2L, 1L, 1L, NA, 2L, 2L), .Label = c("BUS", "ENG", "SCI"), class = "factor"), degree2 = structure(c(2L, 2L, NA, NA, 2L, NA, 1L, 2L, NA, 2L, 1L, 1L, 2L, NA, 1L, 2L, 2L, NA, 1L, 2L), .Label = c("BA", "MN"), class = "factor")), .Names = c("ID", "subj1", "degree1", "subj2", "degree2"), row.names = c(NA, -20L), class = "data.frame")
After a value is assigned to a variable using the assignment operator, you can assign the value of that variable to another variable using the assignment operator. var myVar; myVar = 5; var myNum; myNum = myVar; The above declares a myVar variable with no value, then assigns it the value 5 .
The first time a variable is assigned a value, it is said to be initialised. The = symbol is known as the assignment operator. It is also possible to declare a variable and assign it a value in the same line, so instead of int i and then i = 9 you can write int i = 9 all in one go.
type variableName = value; Where type is one of Java's types (such as int or String ), and variableName is the name of the variable (such as x or name). The equal sign is used to assign values to the variable.
There are two ways to assign a value to a variable:in two lines. or in one line.
Your original method of assignment is failing for at least two reasons.
1) A problem with the subscripted assignment df$major[df$degree1 == "BA"] <-
. Using ==
can produce NA
, which is what prompted the error. From ?"[<-"
: "When replacing (that is using indexing on the lhs of an assignment) NA does not select any element to be replaced. As there is ambiguity as to whether an element of the rhs should be used or not, this is only allowed if the rhs value is of length one (so the two interpretations would have the same outcome)." There are many ways to get around this, but I prefer using which
:
df$major[which(df$degree1 == "BA")] <-
The difference is that ==
returns TRUE
, FALSE
and NA
, while which
returns the indices of an object that are TRUE
> df$degree1 == "BA" [1] FALSE NA TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE > which(df$degree1 == "BA") [1] 3 4 5 8 9 10 11 12 13 14 15 16 17 18 19 20
2) When you perform a subscripted assignment, the right hand side needs to fit into the left hand side sensibly (this is the way I think of it). This can mean left and right hand sides of equal length, which is what your example seems to imply. Therefore, you would need to subset the right hand side of the assignment as well:
df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")]
I hope that clarifies why your original attempt produced an error.
Using ifelse
, as suggested by @DavidRobinson, is a good way of doing this type of assignment. My take on it:
df$major2 <- ifelse(df$degree1 == "BA", df$subj1, ifelse(df$degree2 == "BA", df$subj2,NA))
This is equivalent to
df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")] df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <- df$subj2[which(df$degree1 != "BA" & df$degree2 == "BA")]
Depending on the depth of the nested ifelse
statements, another approach might be better for your real data.
EDIT:
I was going to write a third reason for the original code failing (namely that df$major
wasn't yet assigned), but it works for me without having to do that. This was a problem I remember having in the past, though. What version of R are you running? (2.15.0 for me.) This step is not necessary if you use the ifelse()
approach. Your solution is fine when using [
, although I would have chosen
df$major <- NA
To get the character values of the subjects, instead of the factor level index, use as.character()
(which for factors is equivalent to and calls levels(x)[x]
):
df$major[which(df$degree1 == "BA")] <- as.character(df$subj1)[which(df$degree1 == "BA")] df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <- as.character(df$subj2)[which(df$degree1 != "BA" & df$degree2 == "BA")]
Same for the ifelse()
way:
df$major2 <- ifelse(df$degree1 == "BA", as.character(df$subj1), ifelse(df$degree2 == "BA", as.character(df$subj2), NA))
In general, the ifelse function is the right choice for these situations, something like:
df$major = ifelse((!is.na(df$degree1) & df$degree1 == "BA") & (is.na(df$degree2) | df$degree1 != "BA"), df$subj1, df$subj2)
However, its precise use depends on what you do if both df$degree1
and df$degree2
are "BA".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With