If I have a dataframe with a key column and data columns, like this <pre class="prettyprint"><code>df <- cbind(key=c("Jane", "Jane", "Sam", "Sam", "Mary"), var1=c("a", NA, "a", "a", "c"), var2=c(NA, "b", NA, "b", "d")) key var1 var2 "Jane" "a" NA "Jane" NA "b" "Sam" "a" NA "Sam" "a" "b" "Mary" "c" "d" "Mary" "c" NA </code></pre> And want a dataframe that merges the rows by name, overwriting NAs whenever possible, like so <pre class="prettyprint"><code>key var1 var2 "Jane" "a" "b" "Sam" "a" "b" "Mary" "c" "d" </code></pre> How can I do this?

Here's a solution using <code>dplyr</code>. Note that <code>cbind()</code> creates matrices, not data frames, so I've modified the code to do what I think you meant. I also pulled out the selection algorithm into a separate function. I think this is good practice because it allows you to change your algorithm in one place if you discover you need something different. <pre class="prettyprint"><code>df <- data.frame( key = c("Jane", "Jane", "Sam", "Sam", "Mary"), var1 = c("a", NA, "a", "a", "c"), var2 = c(NA, "b", NA, "b", "d"), stringsAsFactors = FALSE ) library(dplyr) collapse <- function(x) x[!is.na(x)][1] df %.% group_by(key) %.% summarise(var1 = collapse(var1), var2 = collapse(var2)) # Source: local data frame [3 x 3] # # key var1 var2 # 1 Mary c d # 2 Sam a b # 3 Jane a b </code></pre>

Merge rows within a dataframe by a key

Tags:

dataframe

r

If I have a dataframe with a key column and data columns, like this

df <- cbind(key=c("Jane", "Jane", "Sam", "Sam", "Mary"), var1=c("a", NA, "a", "a", "c"), var2=c(NA, "b", NA, "b", "d"))

key    var1 var2
"Jane" "a"  NA  
"Jane" NA   "b" 
"Sam"  "a"  NA
"Sam"  "a"  "b" 
"Mary" "c"  "d" 
"Mary" "c"  NA

And want a dataframe that merges the rows by name, overwriting NAs whenever possible, like so

key    var1 var2
"Jane" "a"  "b"
"Sam"  "a"  "b"
"Mary" "c"  "d"

How can I do this?

303

asked Jan 30 '14 01:01

sus

2 Answers

library(data.table)
dtt <- as.data.table(df)

dtt[, list(var1=unique(var1[!is.na(var1)])
         , var2=unique(var2[!is.na(var2)]))
    , by=key]

    key var1 var2
1: Jane    a    b
2: Mary    c    d
3:  Sam    a    b

100

answered Oct 03 '22 13:10

Ricardo Saporta

Here's a solution using dplyr. Note that cbind() creates matrices, not data frames, so I've modified the code to do what I think you meant. I also pulled out the selection algorithm into a separate function. I think this is good practice because it allows you to change your algorithm in one place if you discover you need something different.

df <- data.frame(
  key = c("Jane", "Jane", "Sam", "Sam", "Mary"), 
  var1 = c("a", NA, "a", "a", "c"), 
  var2 = c(NA, "b", NA, "b", "d"),
  stringsAsFactors = FALSE
)

library(dplyr)

collapse <- function(x) x[!is.na(x)][1]

df %.% 
  group_by(key) %.%
  summarise(var1 = collapse(var1), var2 = collapse(var2))
# Source: local data frame [3 x 3]
# 
#  key var1 var2
# 1 Mary    c    d
# 2  Sam    a    b
# 3 Jane    a    b

answered Oct 03 '22 14:10

hadley

Related questions
                            
                                Can I make this dplyr + data.table task faster?
                            
                                set method initialize S4 class vs. using function
                            
                                lmer error: grouping factor must be < number of observations
                            
                                interleave rows of matrix stored in a list in R
                            
                                R: Subsetting a data.table with repeated column names with numerical positions
                            
                                writing multiple dataframe into one excel sheet using xlsx and R
                            
                                How to vectorize or otherwise speed-up this looping logic in R?
                            
                                Legend does not show line type in ggplot2 density plot
                            
                                How to add legend for regional map with a legend describing associated labels using ggplot2?
                            
                                Understanding element wise clearing of R's workspace
                            
                                R - Plotting netcdf climate data
                            
                                How does data.table get the column name from j?
                            
                                Ellipsis expansion in nested functions: Error "'...' used in an incorrect context"
                            
                                What’s the environment and enclosure of nested `eval`?
                            
                                R - store functions in a data.frame
                            
                                How to fit an VARMA time series model in R?
                            
                                Fast calculations of the Pareto front in R
                            
                                Display values on heatmap in R
                            
                                Pretty dendrograms in R?
                            
                                ggplot2: remove colour and draw borders in bar plot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With