I have what seems like a very inefficient solution to a peculiar problem. I have text data which, for various reasons, is broken across rows of a dataframe at random intervals. However, certain subsets of are known to belong together based on unique combinations of other variables in the dataframe. See, for example, a MWE demonstrating the structure and my initial solution:
# Data
df <- read.table(text="page passage person index text
1 123 A 1 hello
1 123 A 2 my
1 123 A 3 name
1 123 A 4 is
1 123 A 5 guy
1 124 B 1 well
1 124 B 2 hello
1 124 B 3 guy",header=T,stringsAsFactors=F)
master<-data.frame()
for (i in 123:max(df$passage)) {
print(paste0('passage ',i))
tempset <- df[df$passage==i,]
concat<-''
for (j in 1:nrow(tempset)) {
print(paste0('index ',j))
concat<-paste(concat, tempset$text[j])
}
tempdf<-data.frame(tempset$page[1],tempset$passage[1], tempset$person[1], concat, stringsAsFactors = FALSE)
master<-rbind(master, tempdf)
rm(concat, tempset, tempdf)
}
master
> master
tempset.page.1. tempset.passage.1. tempset.person.1. concat
1 1 123 A hello my name is guy
2 1 124 B well hello guy
In this example as in my real case, "passage" is the unique grouping variable, so it is not entirely necessary to take the other pieces along with it, although I'd like them available in my dataset.
My current estimates are that this procedure I have devise will take several hours for a dataset that is otherwise easily handled by R on my computer. Perhaps there are some efficiencies to be gained either by other functions or packages, or not creating and removing so many objects?
Thanks for any help here!
data.table
Here's one way:
require(data.table)
DT <- data.table(df)
DT[,.(concat=paste0(text,collapse=" ")),by=.(page,passage,person)]
# page passage person concat
# 1: 1 123 A hello my name is guy
# 2: 1 124 B well hello guy
Putting the extra variables (besides passage
) in the by
doesn't cost much, I think.
dplyr The analogue is
df %>%
group_by(page,passage,person) %>%
summarise(concat=paste0(text,collapse=" "))
# Source: local data frame [2 x 4]
# Groups: page, passage, person
#
# page passage person concat
# 1 1 123 A hello my name is guy
# 2 1 124 B well hello guy
base R One way is:
df$concat <- with(df,ave(text,passage,FUN=function(x)paste0(x,collapse=" ")))
unique(df[,which(names(df)%in%c("page","passage","person","concat"))])
# page passage person concat
# 1 1 123 A hello my name is guy
# 6 1 124 B well hello guy
Here are two ways:
base R
aggregate(
text ~ page + passage + person,
data=df,
FUN=paste, collapse=' '
)
dplyr
library(dplyr)
df %>%
group_by_(~page, ~passage, ~person) %>%
summarize_(text=~paste(text, collapse=' '))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With