I have a data frame that looks like this:
Chr start stop ref alt Hom/het ID
chr1 5179574 5183384 ref Del Het 719
chr1 5179574 5184738 ref Del Het 915
chr1 5179574 5184738 ref Del Het 951
chr1 5336806 5358384 ref Del Het 376
chr1 5347979 5358384 ref Del Het 228
I would like to merge any duplicate rows, combining the last ID column so that all IDs are in one row/column, like this:
Chr start stop ref alt Hom/het ID
chr1 5179574 5183384 ref Del Het 719
chr1 5179574 5184738 ref Del Het 915, 951
chr1 5336806 5358384 ref Del Het 376
chr1 5347979 5358384 ref Del Het 228
I have found examples of people removing duplicates and summing a column, but I just want to combine all IDs with duplicate regions in a list in a single column.
Some call to aggregate()
should do the trick.
Here's an option that collects the ID's in a list object:
(df1 <- aggregate(df[7], df[-7], unique))
# Chr start stop ref alt Hom.het ID
# 1 chr1 5179574 5183384 ref Del Het 719
# 2 chr1 5179574 5184738 ref Del Het 915, 951
# 3 chr1 5336806 5358384 ref Del Het 376
# 4 chr1 5347979 5358384 ref Del Het 228
And here's one that collects them in a character vector:
df2 <- aggregate(df[7], df[-7],
FUN = function(X) paste(unique(X), collapse=", "))
Comparing the results of the two options:
str(df1$ID)
# List of 4
# $ 0: int 719
# $ 3: int [1:2] 915 951
# $ 7: int 376
# $ 8: int 228
str(df2$ID)
# chr [1:4] "719" "915, 951" "376" "228"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With