Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string when a capital letter follows a lower cap letter in the middle of a word in R

Tags:

string

regex

r

I have some problems with different strings being concatenated and which I would like to split again. I am dealing with things such as

name="o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol"

which in this case should be split in "o-n-Butylhydroxylamine", "1-Methylpropylhydroxylamine" and "Amino-2-butanol"

Any thoughts how I could use strsplit and/or gsub regular expression to achieve this? The rule I would like to use is that I would like to split a word when either a number, a bracket ("(") or a capital letter follows a lower caps letter. Any thoughts how to do this?

like image 556
Tom Wenseleers Avatar asked Jan 16 '14 21:01

Tom Wenseleers


People also ask

How do you split a string with capital letters?

To split a string on capital letters, call the split() method with the following regular expression - /(? =[A-Z])/ . The regular expression uses a positive lookahead assertion to split the string on each capital letter and returns an array of the substrings. Copied!

How do you know if a substring is lowercase?

Traverse the string character by character from start to end. Check the ASCII value of each character for the following conditions: If the ASCII value lies in the range of [65, 90], then it is an uppercase letter. If the ASCII value lies in the range of [97, 122], then it is a lowercase letter.

How would you check if each word in a string begins with a capital letter?

How would you check if each word in a string begins with a capital letter? The istitle() function checks if each word is capitalized.

How do you split a string on every capital letter in Python?

Use the re. findall() method to split a string on uppercase letters, e.g. re. findall('[a-zA-Z][^A-Z]*', my_str) .


2 Answers

You could use positive look-around assertions to find (and then split at) inter-character positions preceded by a lower case letter and succeeded by an upper case letter, a digit, or a (.

name <- "o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol"
pat <- "(?<=[[:lower:]])(?=[[:upper:][:digit:](])"
strsplit(name, pat, perl=TRUE)
# [[1]]
# [1] "o-n-Butylhydroxylamine"      "1-Methylpropylhydroxylamine"
# [3] "Amino-2-butanol"
like image 83
Josh O'Brien Avatar answered Nov 14 '22 21:11

Josh O'Brien


strsplit(name, "(?<=([a-z]))(?=[A-Z]|[0-9]|\\()", perl=TRUE)
# [[1]]
# [1] "o-n-Butylhydroxylamine"      "1-Methylpropylhydroxylamine" "Amino-2-butanol"

Remember that the return value is a list, so use [[1]] if appropriate.

like image 26
Ricardo Saporta Avatar answered Nov 14 '22 23:11

Ricardo Saporta