Say I have a vector containing some characters that I want to split based on a regular expression.
To be more precise, I want to split the strings based on a comma, followed by a space, and then by a capital letter (to my understanding, the regex
command looks like this: /(, [A-Z])/g
(which works fine when I try it here)).
When I try to achieve this in r
, the regex
doesn't seem to work, for example:
x <- c("Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)",
"Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)")
strsplit(x, "/(, [A-Z])/g")
[[1]]
[1] "Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)"
[[2]]
[1] "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)"
It finds no split. What am I doing wrong here?
Any help is greatly appreciated!
Here is a solution:
strsplit(x, ", (?=[A-Z])", perl=T)
See IDEONE demo
Output:
[[1]]
[1] "Non MMF investment funds"
[2] "Insurance corporations"
[3] "Assets (Net Acquisition of)"
[4] "Loans"
[5] "Long-term original maturity (over 1 year or no stated maturity)"
[[2]]
[1] "Non financial corporations"
[2] "Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds"
[3] "Assets (Net Acquisition of)"
[4] "Loans"
[5] "Short-term original maturity (up to 1 year)"
The regex - ", (?=[A-Z])"
- contains a look-ahead (?=[A-Z])
that checks but does not consume the uppercase letter. In R, you need to use perl=T
with regexps that contain lookarounds.
If the space is optional, or there can be double space between the comma and the uppercase letter, use
strsplit(x, ",\\s*(?=[A-Z])", perl=T)
And one more variation that will support Unicode letters (with \\p{Lu}
):
strsplit(x, ", (?=\\p{Lu})", perl=T)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With