Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging rows with shared information

Tags:

merge

r

I have a data.frame with several rows which come from a merge which are not completely merged:

b <- read.table(text = "
      ID   Age    Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
68 HA-09   16   <NA>          <NA>       <NA>       5             NA
69 HA-09   16   <33% no/occasional       <NA>      NA             1")

How can I merge them by a column ?

Expected output :

      ID  Age     Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
69 HA-09   16  <33% no/occasional       <NA>       5             1

Note that some columns (other than ID) have the same value on both rows. These columns aren't part of the "primary key" of the database (AFAIK). So if there are several different values shouldn't be merged. Things I tried:

 merge(b[1, ], b[2, ], all = T) # Doesn't merge the rows, just the data.frames
 cast(b, ID ~ .) # I can count them but not merging them into a single row
 aggregate(b, by = list("ID", "Age"), c) # Error 
like image 663
llrs Avatar asked Feb 05 '23 23:02

llrs


1 Answers

A dplyr approach using summarise_all:

## using `na.strings` to identify NA entries in posted data
b <- read.table(text = "
      ID   Age    Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
68 HA-09   16   <NA>          <NA>       <NA>       5             NA
69 HA-09   16   <33% no/occasional       <NA>      NA             1", na.strings = c("NA", "<NA>"))

library(dplyr)
f <- function(x) {
  x <- na.omit(x)
  if (length(x) > 0) first(x) else NA
}
res <- b %>% group_by(ID,Age) %>% summarise_all(funs(f))
##Source: local data frame [1 x 7]
##Groups: ID [?]
##
##      ID   Age Steatosis       Mallory Lille_dico Lille_3 Bili.AHHS2cat
##  <fctr> <int>    <fctr>        <fctr>      <lgl>   <int>         <int>
##1  HA-09    16      <33% no/occasional         NA       5             1

The definition of the function is to handle the case where all values is NA.


As @jdobres suggests, if there are more than one non-NA values that you want to merge (per each column), you may want to flatten all of these to a string representation using:

library(dplyr)
f <- function(x) {
  x <- na.omit(x)
  if (length(x) > 0) paste(x,collapse='-') else NA
}
res <- b %>% group_by(ID,Age) %>% summarise_all(funs(f))

In your posted data, the result would be the same as above because all columns that are summarized has at most one non-NA value.

like image 157
aichao Avatar answered Feb 08 '23 12:02

aichao