I want to create a new variable that is equal to the value of one of two other variables, conditional on the values of still other variables. Here's a toy example with fake data. Each row of the data frame represents a student. Each student can be studying up to two subjects (<code>subj1</code> and <code>subj2</code>), and can be pursuing a degree ("BA") or a minor ("MN") in each subject. My real data includes thousands of students, several types of degree, about 50 subjects, and students can have up to five majors/minors. <pre class="prettyprint"><code> ID subj1 degree1 subj2 degree2 1 1 BUS BA <NA> <NA> 2 2 SCI BA ENG BA 3 3 BUS MN ENG BA 4 4 SCI MN BUS BA 5 5 ENG BA BUS MN 6 6 SCI MN <NA> <NA> 7 7 ENG MN SCI BA 8 8 BUS BA ENG MN ... </code></pre> Now I want to create a sixth variable, <code>df$major</code>, that equals the value of <code>subj1</code> if <code>subj1</code> is the student's primary major, or the value of <code>subj2</code> if <code>subj2</code> is the primary major. The primary major is the first subject with degree equal to "BA". I tried the following code: <pre class="prettyprint"><code>df$major[df$degree1 == "BA"] = df$subj1 df$major[df$degree1 != "BA" & df$degree2 == "BA"] = df$subj2 </code></pre> Unfortunately, I got an error message: <pre class="prettyprint"><code>> df$major[df$degree1 == "BA"] = df$subj1 Error in df$major[df$degree1 == "BA"] = df$subj1 : NAs are not allowed in subscripted assignments </code></pre> I assume this means that a vectorized assignment can't be used if the assignment evaluates to NA for at least one row. I feel like I must be missing something basic here, but the code above seemed like the obvious thing to do and I haven't been able to come up with an alternative. In case it would be helpful in writing an answer, here's sample data, created using <code>dput()</code>, in the same format as the fake data listed above: <pre class="prettyprint"><code>structure(list(ID = 1:20, subj1 = structure(c(3L, NA, 1L, 2L, 2L, 3L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 1L ), .Label = c("BUS", "ENG", "SCI"), class = "factor"), degree1 = structure(c(2L, NA, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("BA", "MN"), class = "factor"), subj2 = structure(c(1L, 2L, NA, NA, 1L, NA, 3L, 2L, NA, 2L, 2L, 1L, 3L, NA, 2L, 1L, 1L, NA, 2L, 2L), .Label = c("BUS", "ENG", "SCI"), class = "factor"), degree2 = structure(c(2L, 2L, NA, NA, 2L, NA, 1L, 2L, NA, 2L, 1L, 1L, 2L, NA, 1L, 2L, 2L, NA, 1L, 2L), .Label = c("BA", "MN"), class = "factor")), .Names = c("ID", "subj1", "degree1", "subj2", "degree2"), row.names = c(NA, -20L), class = "data.frame") </code></pre>

Your original method of assignment is failing for at least two reasons. 1) A problem with the subscripted assignment <code>df$major[df$degree1 == "BA"] <-</code>. Using <code>==</code> can produce <code>NA</code>, which is what prompted the error. From <code>?"[<-"</code>: "When replacing (that is using indexing on the lhs of an assignment) NA does not select any element to be replaced. As there is ambiguity as to whether an element of the rhs should be used or not, this is only allowed if the rhs value is of length one (so the two interpretations would have the same outcome)." There are many ways to get around this, but I prefer using <code>which</code>: <pre class="prettyprint"><code>df$major[which(df$degree1 == "BA")] <- </code></pre> The difference is that <code>==</code> returns <code>TRUE</code>, <code>FALSE</code> and <code>NA</code>, while <code>which</code> returns the indices of an object that are TRUE <pre class="prettyprint"><code>> df$degree1 == "BA" [1] FALSE NA TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE > which(df$degree1 == "BA") [1] 3 4 5 8 9 10 11 12 13 14 15 16 17 18 19 20 </code></pre> 2) When you perform a subscripted assignment, the right hand side needs to fit into the left hand side sensibly (this is the way I think of it). This can mean left and right hand sides of equal length, which is what your example seems to imply. Therefore, you would need to subset the right hand side of the assignment as well: <pre class="prettyprint"><code>df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")] </code></pre> <hr> I hope that clarifies why your original attempt produced an error. Using <code>ifelse</code>, as suggested by @DavidRobinson, is a good way of doing this type of assignment. My take on it: <pre class="prettyprint"><code>df$major2 <- ifelse(df$degree1 == "BA", df$subj1, ifelse(df$degree2 == "BA", df$subj2,NA)) </code></pre> This is equivalent to <pre class="prettyprint"><code>df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")] df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <- df$subj2[which(df$degree1 != "BA" & df$degree2 == "BA")] </code></pre> Depending on the depth of the nested <code>ifelse</code> statements, another approach might be better for your real data. <hr> EDIT: I was going to write a third reason for the original code failing (namely that <code>df$major</code> wasn't yet assigned), but it works for me without having to do that. This was a problem I remember having in the past, though. What version of R are you running? (2.15.0 for me.) This step is not necessary if you use the <code>ifelse()</code> approach. Your solution is fine when using <code>[</code>, although I would have chosen <pre class="prettyprint"><code>df$major <- NA </code></pre> To get the character values of the subjects, instead of the factor level index, use <code>as.character()</code> (which for factors is equivalent to and calls <code>levels(x)[x]</code>): <pre class="prettyprint"><code>df$major[which(df$degree1 == "BA")] <- as.character(df$subj1)[which(df$degree1 == "BA")] df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <- as.character(df$subj2)[which(df$degree1 != "BA" & df$degree2 == "BA")] </code></pre> Same for the <code>ifelse()</code> way: <pre class="prettyprint"><code>df$major2 <- ifelse(df$degree1 == "BA", as.character(df$subj1), ifelse(df$degree2 == "BA", as.character(df$subj2), NA)) </code></pre>

In general, the ifelse function is the right choice for these situations, something like: <pre class="prettyprint"><code>df$major = ifelse((!is.na(df$degree1) & df$degree1 == "BA") & (is.na(df$degree2) | df$degree1 != "BA"), df$subj1, df$subj2) </code></pre> However, its precise use depends on what you do if both <code>df$degree1</code> and <code>df$degree2</code> are "BA".

Conditional assignment of one variable to the value of one of two other variables

Tags:

r

I want to create a new variable that is equal to the value of one of two other variables, conditional on the values of still other variables. Here's a toy example with fake data.

Each row of the data frame represents a student. Each student can be studying up to two subjects (subj1 and subj2), and can be pursuing a degree ("BA") or a minor ("MN") in each subject. My real data includes thousands of students, several types of degree, about 50 subjects, and students can have up to five majors/minors.

   ID  subj1 degree1  subj2 degree2 1   1    BUS      BA   <NA>    <NA> 2   2    SCI      BA    ENG      BA 3   3    BUS      MN    ENG      BA 4   4    SCI      MN    BUS      BA 5   5    ENG      BA    BUS      MN 6   6    SCI      MN   <NA>    <NA> 7   7    ENG      MN    SCI      BA 8   8    BUS      BA    ENG      MN ...

Now I want to create a sixth variable, df$major, that equals the value of subj1 if subj1 is the student's primary major, or the value of subj2 if subj2 is the primary major. The primary major is the first subject with degree equal to "BA". I tried the following code:

df$major[df$degree1 == "BA"] = df$subj1 df$major[df$degree1 != "BA" & df$degree2 == "BA"] = df$subj2

Unfortunately, I got an error message:

> df$major[df$degree1 == "BA"] = df$subj1 Error in df$major[df$degree1 == "BA"] = df$subj1 :    NAs are not allowed in subscripted assignments

I assume this means that a vectorized assignment can't be used if the assignment evaluates to NA for at least one row.

I feel like I must be missing something basic here, but the code above seemed like the obvious thing to do and I haven't been able to come up with an alternative.

In case it would be helpful in writing an answer, here's sample data, created using dput(), in the same format as the fake data listed above:

structure(list(ID = 1:20, subj1 = structure(c(3L, NA, 1L, 2L,  2L, 3L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 1L ), .Label = c("BUS", "ENG", "SCI"), class = "factor"), degree1 = structure(c(2L,  NA, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,  1L, 1L, 1L), .Label = c("BA", "MN"), class = "factor"), subj2 = structure(c(1L,  2L, NA, NA, 1L, NA, 3L, 2L, NA, 2L, 2L, 1L, 3L, NA, 2L, 1L, 1L,  NA, 2L, 2L), .Label = c("BUS", "ENG", "SCI"), class = "factor"),      degree2 = structure(c(2L, 2L, NA, NA, 2L, NA, 1L, 2L, NA,      2L, 1L, 1L, 2L, NA, 1L, 2L, 2L, NA, 1L, 2L), .Label = c("BA",      "MN"), class = "factor")), .Names = c("ID", "subj1", "degree1",  "subj2", "degree2"), row.names = c(NA, -20L), class = "data.frame")

949

asked May 07 '12 21:05

eipi10

2 Answers

Your original method of assignment is failing for at least two reasons.

1) A problem with the subscripted assignment df$major[df$degree1 == "BA"] <-. Using == can produce NA, which is what prompted the error. From ?"[<-": "When replacing (that is using indexing on the lhs of an assignment) NA does not select any element to be replaced. As there is ambiguity as to whether an element of the rhs should be used or not, this is only allowed if the rhs value is of length one (so the two interpretations would have the same outcome)." There are many ways to get around this, but I prefer using which:

df$major[which(df$degree1 == "BA")] <-

The difference is that == returns TRUE, FALSE and NA, while which returns the indices of an object that are TRUE

> df$degree1 == "BA"  [1] FALSE    NA  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  > which(df$degree1 == "BA")  [1]  3  4  5  8  9 10 11 12 13 14 15 16 17 18 19 20

2) When you perform a subscripted assignment, the right hand side needs to fit into the left hand side sensibly (this is the way I think of it). This can mean left and right hand sides of equal length, which is what your example seems to imply. Therefore, you would need to subset the right hand side of the assignment as well:

df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")]

I hope that clarifies why your original attempt produced an error.

Using ifelse, as suggested by @DavidRobinson, is a good way of doing this type of assignment. My take on it:

df$major2 <- ifelse(df$degree1 == "BA", df$subj1, ifelse(df$degree2 == "BA",   df$subj2,NA))

This is equivalent to

df$major[which(df$degree1 == "BA")] <- df$subj1[which(df$degree1 == "BA")] df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <-    df$subj2[which(df$degree1 != "BA" & df$degree2 == "BA")]

Depending on the depth of the nested ifelse statements, another approach might be better for your real data.

EDIT:

I was going to write a third reason for the original code failing (namely that df$major wasn't yet assigned), but it works for me without having to do that. This was a problem I remember having in the past, though. What version of R are you running? (2.15.0 for me.) This step is not necessary if you use the ifelse() approach. Your solution is fine when using [, although I would have chosen

df$major <- NA

To get the character values of the subjects, instead of the factor level index, use as.character() (which for factors is equivalent to and calls levels(x)[x]):

df$major[which(df$degree1 == "BA")] <- as.character(df$subj1)[which(df$degree1 == "BA")] df$major[which(df$degree1 != "BA" & df$degree2 == "BA")] <-    as.character(df$subj2)[which(df$degree1 != "BA" & df$degree2 == "BA")]

Same for the ifelse() way:

df$major2 <- ifelse(df$degree1 == "BA", as.character(df$subj1),   ifelse(df$degree2 == "BA", as.character(df$subj2), NA))

143

answered Oct 14 '22 14:10

BenBarnes

In general, the ifelse function is the right choice for these situations, something like:

df$major = ifelse((!is.na(df$degree1) & df$degree1 == "BA") & (is.na(df$degree2) | df$degree1 != "BA"), df$subj1, df$subj2)

However, its precise use depends on what you do if both df$degree1 and df$degree2 are "BA".

answered Oct 14 '22 13:10

David Robinson

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Conditional assignment of one variable to the value of one of two other variables

Tags:

r

eipi10

People also ask

2 Answers

BenBarnes

David Robinson

Recent Activity

Donate For Us

Conditional assignment of one variable to the value of one of two other variables

Tags:

r

eipi10

People also ask

2 Answers

BenBarnes

David Robinson

Related questions

Recent Activity

Donate For Us