Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify names of columns for x and y when joining in dplyr?

I have two data frames that I want to join using dplyr. One is a data frame containing first names.

test_data <- data.frame(first_name = c("john", "bill", "madison", "abby", "zzz"),                         stringsAsFactors = FALSE) 

The other data frame contains a cleaned up version of the Kantrowitz names corpus, identifying gender. Here is a minimal example:

kantrowitz <- structure(list(name = c("john", "bill", "madison", "abby", "thomas"), gender = c("M", "either", "M", "either", "M")), .Names = c("name", "gender"), row.names = c(NA, 5L), class = c("tbl_df", "tbl", "data.frame")) 

I essentially want to look up the gender of the name from the test_data table using the kantrowitz table. Because I'm going to abstract this into a function encode_gender, I won't know the name of the column in the data set that's going to be used, and so I can't guarantee that it will be name, as in kantrowitz$name.

In base R I would perform the merge this way:

merge(test_data, kantrowitz, by.x = "first_names", by.y = "name", all.x = TRUE) 

That returns the correct output:

  first_name gender 1       abby either 2       bill either 3       john      M 4    madison      M 5        zzz   <NA> 

But I want to do this in dplyr because I'm using that package for all my other data manipulation. The dplyr by option to the various *_join functions only lets me specify one column name, but I need to specify two. I'm looking for something like this:

library(dplyr) # either left_join(test_data, kantrowitz, by.x = "first_name", by.y = "name") # or left_join(test_data, kantrowitz, by = c("first_name", "name")) 

What is the way to perform this kind of join using dplyr?

(Never mind that the Kantrowitz corpus is a bad way to identify gender. I'm working on a better implementation, but I want to get this working first.)

like image 270
Lincoln Mullen Avatar asked Feb 19 '14 18:02

Lincoln Mullen


People also ask

How do I join a Dataframe in dplyr in R?

Joins with dplyr. dplyr uses SQL database syntax for its join functions. A left join means: Include everything on the left (what was the x data frame in merge() ) and all rows that match from the right (y) data frame. If the join columns have the same name, all you need is left_join(x, y) .

How do I select multiple columns by name in R?

To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.


2 Answers

This feature has been added in dplyr v0.3. You can now pass a named character vector to the by argument in left_join (and other joining functions) to specify which columns to join on in each data frame. With the example given in the original question, the code would be:

left_join(test_data, kantrowitz, by = c("first_name" = "name")) 
like image 82
Lincoln Mullen Avatar answered Oct 09 '22 10:10

Lincoln Mullen


This is more a workaround than a real solution. You can create a new object test_data with another column name:

left_join("names<-"(test_data, "name"), kantrowitz, by = "name")       name gender 1    john      M 2    bill either 3 madison      M 4    abby either 5     zzz   <NA> 
like image 31
Sven Hohenstein Avatar answered Oct 09 '22 09:10

Sven Hohenstein