Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Regex in strsplit (finding ", " followed by capital letter)

Tags:

regex

r

strsplit

Say I have a vector containing some characters that I want to split based on a regular expression.

To be more precise, I want to split the strings based on a comma, followed by a space, and then by a capital letter (to my understanding, the regex command looks like this: /(, [A-Z])/g (which works fine when I try it here)).

When I try to achieve this in r, the regex doesn't seem to work, for example:

x <- c("Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)",
  "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)")

strsplit(x, "/(, [A-Z])/g")
[[1]]
[1] "Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)"

It finds no split. What am I doing wrong here?

Any help is greatly appreciated!

like image 206
David Avatar asked Jan 08 '23 04:01

David


1 Answers

Here is a solution:

strsplit(x, ", (?=[A-Z])", perl=T)

See IDEONE demo

Output:

[[1]]
[1] "Non MMF investment funds"                                       
[2] "Insurance corporations"                                         
[3] "Assets (Net Acquisition of)"                                    
[4] "Loans"                                                          
[5] "Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations"                                                                                
[2] "Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds"
[3] "Assets (Net Acquisition of)"                                                                               
[4] "Loans"                                                                                                     
[5] "Short-term original maturity (up to 1 year)"

The regex - ", (?=[A-Z])" - contains a look-ahead (?=[A-Z]) that checks but does not consume the uppercase letter. In R, you need to use perl=T with regexps that contain lookarounds.

If the space is optional, or there can be double space between the comma and the uppercase letter, use

strsplit(x, ",\\s*(?=[A-Z])", perl=T)

And one more variation that will support Unicode letters (with \\p{Lu}):

strsplit(x, ", (?=\\p{Lu})", perl=T)
like image 179
Wiktor Stribiżew Avatar answered Jan 14 '23 22:01

Wiktor Stribiżew