Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining runs of nominal variables

I have a dataset that contains dialog between two people that was created during a chat session. For example,

  1. "A: Hi"
  2. "A: How are you today"
  3. "B: Fine. How are you?"
  4. "A: I'm good"
  5. "Cool"

I want to create a simple function in R that will combine the lines of A before B speaks into one line so that I have a dataset that looks like:

  1. "A: Hi A: How are you today"
  2. "B: Fine how are you?
  3. "A: I'm good"
  4. "B: Cool"

I know how to merge/ combine cells but I'm not sure how to create a logical statement create an indicator for the lines A speaks before B (and vice versa).

like image 369
User7598 Avatar asked Feb 15 '15 13:02

User7598


Video Answer


2 Answers

The rle() function may be used for this purpose. It determines all runs of equal values in a given vector.

v1 <- c("A: Hi" , "A: How are you today", "B: Fine. How are you?", 
     "A: I'm good" ,"B: Cool") # input data

speakers <- rle(substring(v1, 1, 1))

The output of the rle() function may now be used to split the dialogue parts accordingly and then combine them in order to get the desired result.

ids <- rep(paste(1:length(speakers$lengths)), speakers$lengths) 
unname(sapply(split(v1, ids), function(monologue) {
   # concatenate all statements in a "monologue"
   monologue[-1] <- substring(monologue[-1], 4)
   paste(monologue, collapse=" ")
}))

Result:

## [1] "A: Hi How are you today"
## [2] "B: Fine. How are you?"
## [3] "A: I'm good"             
## [4] "B: Cool"   
like image 184
gagolews Avatar answered Sep 28 '22 17:09

gagolews


An option using data.table. Convert the vector ("v1") to data.table (setDT). Create a new variable ("indx") based on the prefix ("A", "B"). Using rleid, create a grouping variable, and paste the contents of "V1" variable (without the prefix) with the "indx" to create the expected output.

library(data.table)#data.table_1.9.5
setDT(list(v1))[, indx:=sub(':.*', '', V1)][, paste(unique(indx), 
   paste(sub('.:', '', V1), collapse=" "), sep=":") , rleid(indx)]$V1
# [1] "A: Hi  How are you today" "B: Fine. How are you?"   
# [3] "A: I'm good"              "B: Cool"                 

Or a variant would be using tstrsplit to split the column "V1" into two ("V1", and "V2"), group by rleid of "V1", and paste the contents of "V1" and "V2".

setDT(list(v1))[,tstrsplit(V1, ": ")][, sprintf('%s: %s', unique(V1),
           paste(V2, collapse=" ")), rleid(V1)]$V1
#[1] "A: Hi How are you today" "B: Fine. How are you?"  
#[3] "A: I'm good"             "B: Cool"   

Or an option using base R

 str1 <- sub(':.*', '', v1)
 indx1 <- cumsum(c(TRUE,indx[-1]!=indx[-length(indx)]))
 str2 <- sub('.*: +', '', v1)
 paste(tapply(str1, indx1, FUN=unique),
    tapply(str2, indx1, FUN=paste, collapse=" "), sep=": ")
 #[1] "A: Hi How are you today" "B: Fine. How are you?"  
 #[3] "A: I'm good"             "B: Cool"   

data

v1 <- c("A: Hi" , "A: How are you today", "B: Fine. How are you?", 
     "A: I'm good" ,"B: Cool")
like image 30
akrun Avatar answered Sep 28 '22 18:09

akrun