I have rows with grouping variables and text. I want to use data.table to paste the text when grouping variables are all the same for n times in a row.
Here's what I have:
x y z
1: Pre A Lorem ipsum dolor sit amet, ac
2: Pre C Maecenas netus in, lacus fames
3: Pre C Conubia sed nulla dolor dui ap
4: Pre A Eget laoreet eros in aliquet e
5: Pre A Vestibulum quisque himenaeos p
6: Post A Sed venenatis. Ante turpis ege
7: Post A Ultricies ultricies eleifend m
8: Post B Conubia in in sed etiam nec mo
9: Post A Quam nec ex sodales dictum orc
10: Post A Sed ante id pulvinar maximus s
Desired output:
x y z
1 Pre A Lorem ipsum dolor sit amet, ac
2 Pre C Maecenas netus in, lacus fames Conubia sed nulla dolor dui ap
3 Pre A Eget laoreet eros in aliquet e Vestibulum quisque himenaeos p
4 Post A Sed venenatis. Ante turpis ege Ultricies ultricies eleifend m
5 Post B Conubia in in sed etiam nec mo
6 Post A Quam nec ex sodales dictum orc Sed ante id pulvinar maximus s
MWE:
library(data.table)
library(stringi)
set.seed(12)
dat <- data.frame(
x = rep(c("Pre", "Post"), each=5),
y = sample(LETTERS[1:3], 10, TRUE),
z = substring(stri_rand_lipsum(10), 1, 30), stringsAsFactors = FALSE
)
setDT(dat)
We could use rleid
from the devel version of data.table
to create the grouping variable ('gr') based on the 'x' and 'y' variables. Grouped by 'gr', we get the first element of 'x', 'y', and paste
the 'z' elements together.
library(data.table)#v1.9.5+
setDT(dat)[,list(x=x[1L], y= y[1L], z=paste(z, collapse = ' ')) ,
by = .(gr=rleid(x,y))][, gr:= NULL][]
# x y z
#1: Pre A Lorem ipsum dolor sit amet, ac
#2: Pre C Maecenas netus in, lacus fames Conubia sed nulla dolor dui ap
#3: Pre A Eget laoreet eros in aliquet e Vestibulum quisque himenaeos p
#4: Post A Sed venenatis. Ante turpis ege Ultricies ultricies eleifend m
#5: Post B Conubia in in sed etiam nec mo
#6: Post A Quam nec ex sodales dictum orc Sed ante id pulvinar maximus s
NOTE: Instructions to install the devel version are here
Another option if we are not using the devel version would be to create the 'gr' variable by comparing the adjacent elements after paste
ing 'x' and 'y' columns and then proceed as before.
setDT(dat)[, gr:={gr= paste(x,y); cumsum(c(TRUE, gr[-1]!= gr[-.N]))}]
res <- dat[, list(x=x[1L], y=y[1L], z=paste(z, collapse=' ')) ,
by = gr][,gr:= NULL][]
res
# x y z
#1: Pre A Lorem ipsum dolor sit amet, ac
#2: Pre C Maecenas netus in, lacus fames Conubia sed nulla dolor dui ap
#3: Pre A Eget laoreet eros in aliquet e Vestibulum quisque himenaeos p
#4: Post A Sed venenatis. Ante turpis ege Ultricies ultricies eleifend m
#5: Post B Conubia in in sed etiam nec mo
#6: Post A Quam nec ex sodales dictum orc Sed ante id pulvinar maximus s
This also assigns the 'gr' column in the original dataset. We can assign the 'gr' to NULL
to keep the original structure the same.
dat[, gr:= NULL]
Instead of doing x=x[1L], y= y[1L]
, we can use 'x', 'y' and 'gr' as the grouping variables and return only the 'z' column.
res <- dat[,list(z=paste(z, collapse=' ')) ,.(x, y, gr)][, gr:= NULL][]
dat[, gr:= NULL]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With