Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

collapse and paste text by multiple grouping sequential variables

Tags:

r

data.table

I have rows with grouping variables and text. I want to use data.table to paste the text when grouping variables are all the same for n times in a row.

Here's what I have:

       x y                              z
 1:  Pre A Lorem ipsum dolor sit amet, ac
 2:  Pre C Maecenas netus in, lacus fames
 3:  Pre C Conubia sed nulla dolor dui ap
 4:  Pre A Eget laoreet eros in aliquet e
 5:  Pre A Vestibulum quisque himenaeos p
 6: Post A Sed venenatis. Ante turpis ege
 7: Post A Ultricies ultricies eleifend m
 8: Post B Conubia in in sed etiam nec mo
 9: Post A Quam nec ex sodales dictum orc
10: Post A Sed ante id pulvinar maximus s

Desired output:

     x y                                                             z
1  Pre A                                Lorem ipsum dolor sit amet, ac
2  Pre C Maecenas netus in, lacus fames Conubia sed nulla dolor dui ap
3  Pre A Eget laoreet eros in aliquet e Vestibulum quisque himenaeos p
4 Post A Sed venenatis. Ante turpis ege Ultricies ultricies eleifend m
5 Post B                                Conubia in in sed etiam nec mo
6 Post A Quam nec ex sodales dictum orc Sed ante id pulvinar maximus s

MWE:

library(data.table)
library(stringi)

set.seed(12)
dat <- data.frame(
    x = rep(c("Pre", "Post"), each=5), 
    y = sample(LETTERS[1:3], 10, TRUE), 
    z = substring(stri_rand_lipsum(10), 1, 30), stringsAsFactors = FALSE
)

setDT(dat)
like image 548
Tyler Rinker Avatar asked Sep 05 '15 03:09

Tyler Rinker


Video Answer


1 Answers

We could use rleid from the devel version of data.table to create the grouping variable ('gr') based on the 'x' and 'y' variables. Grouped by 'gr', we get the first element of 'x', 'y', and paste the 'z' elements together.

library(data.table)#v1.9.5+
setDT(dat)[,list(x=x[1L], y= y[1L], z=paste(z, collapse = ' ')) ,
                      by = .(gr=rleid(x,y))][, gr:= NULL][]
#      x y                                                             z
#1:  Pre A                                Lorem ipsum dolor sit amet, ac
#2:  Pre C Maecenas netus in, lacus fames Conubia sed nulla dolor dui ap
#3:  Pre A Eget laoreet eros in aliquet e Vestibulum quisque himenaeos p
#4: Post A Sed venenatis. Ante turpis ege Ultricies ultricies eleifend m
#5: Post B                                Conubia in in sed etiam nec mo
#6: Post A Quam nec ex sodales dictum orc Sed ante id pulvinar maximus s

NOTE: Instructions to install the devel version are here


Another option if we are not using the devel version would be to create the 'gr' variable by comparing the adjacent elements after pasteing 'x' and 'y' columns and then proceed as before.

setDT(dat)[, gr:={gr= paste(x,y); cumsum(c(TRUE, gr[-1]!= gr[-.N]))}]
res <- dat[, list(x=x[1L], y=y[1L], z=paste(z, collapse=' ')) ,
                     by = gr][,gr:= NULL][]
res
#      x y                                                             z
#1:  Pre A                                Lorem ipsum dolor sit amet, ac
#2:  Pre C Maecenas netus in, lacus fames Conubia sed nulla dolor dui ap
#3:  Pre A Eget laoreet eros in aliquet e Vestibulum quisque himenaeos p
#4: Post A Sed venenatis. Ante turpis ege Ultricies ultricies eleifend m
#5: Post B                                Conubia in in sed etiam nec mo
#6: Post A Quam nec ex sodales dictum orc Sed ante id pulvinar maximus s

This also assigns the 'gr' column in the original dataset. We can assign the 'gr' to NULL to keep the original structure the same.

dat[, gr:= NULL]

Instead of doing x=x[1L], y= y[1L], we can use 'x', 'y' and 'gr' as the grouping variables and return only the 'z' column.

res <- dat[,list(z=paste(z, collapse=' ')) ,.(x, y, gr)][, gr:= NULL][]
dat[, gr:= NULL]
like image 132
akrun Avatar answered Sep 23 '22 07:09

akrun