Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge rows within a dataframe by a key

Tags:

dataframe

r

If I have a dataframe with a key column and data columns, like this

df <- cbind(key=c("Jane", "Jane", "Sam", "Sam", "Mary"), var1=c("a", NA, "a", "a", "c"), var2=c(NA, "b", NA, "b", "d"))

key    var1 var2
"Jane" "a"  NA  
"Jane" NA   "b" 
"Sam"  "a"  NA
"Sam"  "a"  "b" 
"Mary" "c"  "d" 
"Mary" "c"  NA

And want a dataframe that merges the rows by name, overwriting NAs whenever possible, like so

key    var1 var2
"Jane" "a"  "b"
"Sam"  "a"  "b"
"Mary" "c"  "d"

How can I do this?

like image 303
sus Avatar asked Jan 30 '14 01:01

sus


People also ask

How do I merge rows in a data frame?

We can use the concat function in pandas to append either columns or rows from one DataFrame to another. Let's grab two subsets of our data to see how this works. When we concatenate DataFrames, we need to specify the axis. axis=0 tells pandas to stack the second DataFrame UNDER the first one.

Can you merge on an object Pandas?

Pandas DataFrame merge() function is used to merge two DataFrame objects with a database-style join operation. The joining is performed on columns or indexes. If the joining is done on columns, indexes are ignored. This function returns a new DataFrame and the source DataFrame objects are unchanged.

Can you merge DataFrame on index?

Merging Dataframes by index of both the dataframes As both the dataframe contains similar IDs on the index. So, to merge the dataframe on indices pass the left_index & right_index arguments as True i.e. Both the dataframes are merged on index using default Inner Join.


2 Answers

library(data.table)
dtt <- as.data.table(df)

dtt[, list(var1=unique(var1[!is.na(var1)])
         , var2=unique(var2[!is.na(var2)]))
    , by=key]

    key var1 var2
1: Jane    a    b
2: Mary    c    d
3:  Sam    a    b
like image 100
Ricardo Saporta Avatar answered Oct 03 '22 13:10

Ricardo Saporta


Here's a solution using dplyr. Note that cbind() creates matrices, not data frames, so I've modified the code to do what I think you meant. I also pulled out the selection algorithm into a separate function. I think this is good practice because it allows you to change your algorithm in one place if you discover you need something different.

df <- data.frame(
  key = c("Jane", "Jane", "Sam", "Sam", "Mary"), 
  var1 = c("a", NA, "a", "a", "c"), 
  var2 = c(NA, "b", NA, "b", "d"),
  stringsAsFactors = FALSE
)

library(dplyr)

collapse <- function(x) x[!is.na(x)][1]

df %.% 
  group_by(key) %.%
  summarise(var1 = collapse(var1), var2 = collapse(var2))
# Source: local data frame [3 x 3]
# 
#  key var1 var2
# 1 Mary    c    d
# 2  Sam    a    b
# 3 Jane    a    b
like image 21
hadley Avatar answered Oct 03 '22 14:10

hadley