I have transcripts of storytellings with many instances of overlapped speech indicated by square brackets wrapped around the speech in overlap. I want to extract these instances of overlap. In the following mock example,
ovl <- c("well [yes right]", "let's go", "oh [ we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")
this code works fine:
pattern <- "\\[(.*\\w.+])*"
grep(pattern, ovl, value=T)
matches <- gregexpr(pattern, ovl)
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap); overlap_clean
[1] "[yes right]" "[ we::ll]" "[°well right° ]"
But in a larger file, a dataframe, it doesn't. Is this due to a mistake in the pattern or could it be due to how the dataframe is structured? The first six lines of the df look like this:
> head(df)
Story
1 "Kar:\tMind you our Colin's getting more like your dad every day
2 June:\tI know he is.
3 Kar:\tblack welding glasses on,
4 \tand he turned round and he made me jump
5 \t“O:h, Colin”,
6 \tand then ( )
Though it might be working in certain cases, your pattern looks off to me. I think it should be this:
pattern <- "(\\[.*?\\])"
matches <- gregexpr(pattern, ovl)
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap)
overlap_clean
[1] "[yes right]" "[ we::ll]" "[°well right° ]"
This would match and capture a bracketed term, using the Perl lazy dot to make sure we stop at the first closing bracket.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With