I have a large data frame (1616610 rows, 255 columns) and I need to paste together the unique values of each column based on a key.
For example:
> data = data.frame(a=c(1,1,1,2,2,3),
b=c("apples", "oranges", "apples", "apples", "apples", "grapefruit"),
c=c(12, 22, 22, 45, 67, 28),
d=c("Monday", "Monday", "Monday", "Tuesday", "Wednesday", "Tuesday"))
> data
a b c d
1 1 apples 12 Monday
2 1 oranges 22 Monday
3 1 apples 22 Monday
4 2 apples 45 Tuesday
5 2 apples 67 Wednesday
6 3 grapefruit 28 Tuesday
What I need is to aggregate each unique value in each of the 255 columns, and return a new data frame with comma separators for each unique value. Like this:
a b c d
1 1 apples, oranges 12, 22 Monday
2 2 apples 45, 67 Tuesday, Wednesday
3 3 grapefruit 28 Thursday
I have tried using aggregate
, like so:
output <- aggregate(data, by=list(data$a), paste, collapse=", ")
but for a data frame this size, it has been too time-intensive (hours), and often times I have to kill the process all together. On top of that, this will aggregate all values and not only the unique ones. Does anyone have any tips on:
1) how to improve the time of this aggregation for large data sets
2) then get the unique values of each field
BTW, this is my first post on SO, so thanks for your patience.
Moved from comments:
library(data.table)
dt <- as.data.table(data)
dt[, lapply(.SD, function(x) toString(unique(x))), by = a]
giving:
a b c d
1: 1 apples, oranges 12, 22 Monday
2: 2 apples 45, 67 Tuesday, Wednesday
3: 3 grapefruit 28 Tuesday
You could do the following with dplyr
func_paste <- function(x) paste(unique(x), collapse = ', ')
data %>%
group_by(a) %>%
summarise_each(funs(func_paste))
## a b c d
## (dbl) (chr) (chr) (chr)
##1 1 apples, oranges 12, 22 Monday
##2 2 apples 45, 67 Tuesday, Wednesday
##3 3 grapefruit 28 Tuesday
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With