Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use a lookup table in R without creating duplicates?

Tags:

lookup

r

I was wondering if someone had an nice way to achieve this. I have a data frame where each observation (=item), belonging to a particular group (= condition) has a given value:

# Create sample data.
item       = rep(1:3,2)                               #6 items
condition  = c(rep("control",3), rep("related",3))    #2 conditions
value      = c(10,11,12,20,21,22)                     #6 values          
df         = data.frame(item, condition, value)

  item condition value
1    1   control    10
2    2   control    11
3    3   control    12
4    1   related    20
5    2   related    21
6    3   related    22

I also have a look-up table, which contains the mean of each group:

# Create lookup table.
condition  = c("control", "related")
mean       = c(11,21)
table      = data.frame(condition, mean)

  condition mean
1   control   11
2   related   21

I want to modify my original data frame, such that it contains a new column, label, which says "low" if the value of the item is lower than the group mean, and "high" otherwise. It should look like this:

# How the output should look like.
# If the item value is less than the group mean, write "low". Write "high" otherwise.
item       = rep(1:3,2)                               
condition  = c(rep("control",3), rep("related",3))    
value      = c(10,11,12,20,21,22)                      
label      = c(rep(c("low", "high", "high"),2))
output     = data.frame(item, condition, value, label)

  item condition value label
1    1   control    10   low
2    2   control    11  high
3    3   control    12  high
4    1   related    20   low
5    2   related    21  high
6    3   related    22  high

If this were a matter of just copying the group mean to my original data frame, I would use merge. But what I need is to take the group mean into account to write a new label for each item that says "low" or "high" depending on the group mean.

One thing that I tried was to first merge my data frame with the table, and then use ifelse to compare the value column with the mean column. This works, but I also end up with a mean column in my data frame, which I don't need (I only need the label column). Of course, I could delete the mean column by hand, but it seems clunky. So I was wondering: does someone know a better/more elegant solution?

Thanks a bunch!

like image 769
Sol Avatar asked Dec 24 '15 01:12

Sol


1 Answers

Here are some alternatives. (1) and (2) only use base R and (2), (3) and (5) do not create a mean column only to be explicitly deleted. In (1), (3) and (4) we used left joins although inner joins would have given the same result with this data and in the case of (1a) allow us to write (1) as a single line.

1) merge

m <- merge(df, table, all.x = TRUE)
transform(m, label = ifelse(value < mean, "low", "high"), mean = NULL)

giving:

  item condition value label
1    1   control    10   low
2    2   control    11  high
3    3   control    12  high
4    1   related    20   low
5    2   related    21  high
6    3   related    22  high

1a) With an inner join it could be shortened to:

transform(merge(df, table), label = ifelse(value < mean, "low", "high"), mean = NULL)

2) match

transform(df, 
  label = ifelse(value < table$mean[match(condition, table$condition)], "low", "high")
)

giving the same.

3) sqldf

library(sqldf)
sqldf("select 
         df.*, 
         case when value < mean 
              then 'low' 
              else 'high' 
              end label
       from df 
       left join 'table' using (condition)")

4) dplyr

library(dplyr)
df %>%
   left_join(table) %>%
   mutate(label = ifelse(value < mean, "low", "high")) %>%
   select(- mean)

5) data.table

library(data.table)
dt <- as.data.table(df)
setkey(dt, "condition")
dt[table, label := ifelse(value < mean, "low", "high")]
like image 152
G. Grothendieck Avatar answered Nov 15 '22 00:11

G. Grothendieck