Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Dataframe: aggregating strings within column, across rows, by group

I have what seems like a very inefficient solution to a peculiar problem. I have text data which, for various reasons, is broken across rows of a dataframe at random intervals. However, certain subsets of are known to belong together based on unique combinations of other variables in the dataframe. See, for example, a MWE demonstrating the structure and my initial solution:

# Data
df <- read.table(text="page passage  person index text
1  123   A   1 hello      
1  123   A   2 my
1  123   A   3 name
1  123   A   4 is
1  123   A   5 guy
1  124   B   1 well
1  124   B   2 hello
1  124   B   3 guy",header=T,stringsAsFactors=F)

master<-data.frame()
for (i in 123:max(df$passage)) {
  print(paste0('passage ',i))
  tempset <- df[df$passage==i,]
  concat<-''
  for (j in 1:nrow(tempset)) {
    print(paste0('index ',j))
    concat<-paste(concat, tempset$text[j])
  }
  tempdf<-data.frame(tempset$page[1],tempset$passage[1], tempset$person[1], concat, stringsAsFactors = FALSE)
  master<-rbind(master, tempdf)
  rm(concat, tempset, tempdf)
}
master
> master
  tempset.page.1. tempset.passage.1. tempset.person.1.                concat
1               1                123                 A  hello my name is guy
2               1                124                 B        well hello guy

In this example as in my real case, "passage" is the unique grouping variable, so it is not entirely necessary to take the other pieces along with it, although I'd like them available in my dataset.

My current estimates are that this procedure I have devise will take several hours for a dataset that is otherwise easily handled by R on my computer. Perhaps there are some efficiencies to be gained either by other functions or packages, or not creating and removing so many objects?

Thanks for any help here!

like image 745
SOConnell Avatar asked Dec 15 '22 14:12

SOConnell


2 Answers

data.table Here's one way:

require(data.table)
DT <- data.table(df)

DT[,.(concat=paste0(text,collapse=" ")),by=.(page,passage,person)]
#    page passage person               concat
# 1:    1     123      A hello my name is guy
# 2:    1     124      B       well hello guy

Putting the extra variables (besides passage) in the by doesn't cost much, I think.


dplyr The analogue is

df %>% 
  group_by(page,passage,person) %>% 
  summarise(concat=paste0(text,collapse=" "))

# Source: local data frame [2 x 4]
# Groups: page, passage, person
# 
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 2    1     124      B       well hello guy

base R One way is:

df$concat <- with(df,ave(text,passage,FUN=function(x)paste0(x,collapse=" ")))
unique(df[,which(names(df)%in%c("page","passage","person","concat"))])
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 6    1     124      B       well hello guy
like image 157
Frank Avatar answered Jan 14 '23 04:01

Frank


Here are two ways:

base R

aggregate(
    text ~ page + passage + person, 
    data=df, 
    FUN=paste, collapse=' '
)

dplyr

library(dplyr)
df %>% 
    group_by_(~page, ~passage, ~person) %>%
    summarize_(text=~paste(text, collapse=' '))
like image 33
Matthew Plourde Avatar answered Jan 14 '23 02:01

Matthew Plourde