Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex in R to match strings in square brackets

Tags:

regex

r

I have transcripts of storytellings with many instances of overlapped speech indicated by square brackets wrapped around the speech in overlap. I want to extract these instances of overlap. In the following mock example,

ovl <- c("well [yes right]", "let's go", "oh [  we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")

this code works fine:

pattern <- "\\[(.*\\w.+])*"
grep(pattern, ovl, value=T) 
matches <- gregexpr(pattern, ovl) 
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap); overlap_clean
[1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

But in a larger file, a dataframe, it doesn't. Is this due to a mistake in the pattern or could it be due to how the dataframe is structured? The first six lines of the df look like this:

> head(df)
                                                             Story
1 "Kar:\tMind you our Colin's getting more like your dad every day
2                                             June:\tI know he is.
3                                 Kar:\tblack welding glasses on, 
4                        \tand he turned round and he made me jump
5                                                 \t“O:h, Colin”, 
6                                  \tand then (                  )
like image 689
Chris Ruehlemann Avatar asked Aug 31 '25 22:08

Chris Ruehlemann


1 Answers

Though it might be working in certain cases, your pattern looks off to me. I think it should be this:

pattern <- "(\\[.*?\\])"
matches <- gregexpr(pattern, ovl)
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap)
overlap_clean

[1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

Demo

This would match and capture a bracketed term, using the Perl lazy dot to make sure we stop at the first closing bracket.

like image 199
Tim Biegeleisen Avatar answered Sep 03 '25 20:09

Tim Biegeleisen