I have a data frame, in which one column is a Series of strings, in which distinct phrases are either single words or multiple words separated by spaces; and the first letter of each individual word is upper case (e.g. "Strawberry" or "Strawberry Jam", respectively). In contrast, if not part of the same phrase, the words are not spaced out (e.g. "JamApple").
df = pd.DataFrame({
'foo': ['Strawberry JamApple', 'BananaPear CrumblePotato', 'Almond Cake'],
'bar': ['A', 'B', 'C'],
'baz': [1, 2, 3],
'zoo': ['x', 'y', 'z'],
})
foo bar baz zoo
0 Strawberry JamApple A 1 x
1 BananaPear CrumblePotato B 2 y
2 Almond Cake C 3 z
How could I use regex to separate phrases in a string based on the rule above (into "Strawberry Jam", "Apple", "Banana", "Pear Crumble", "Potato", "Almond Cake"). and extract them? I.e., get the following data frame:
foo
0 Strawberry Jam
0 Apple
1 Banana
1 Pear Crumble
1 Potato
2 Almond Cake
I started with the following code:
df.loc[:, 'foo'].str.extractall('([A-Z]{1}[a-z]+)').copy()
However, this separates all words and doesn't use space to "connect" them. How would I include the latter?
Thanks.
Definition and Usage The \f metacharacter matches form feed characters.
The most common forms of whitespace you will use with regular expressions are the space (␣), the tab (\t), the new line (\n) and the carriage return (\r) (useful in Windows environments), and these special characters match each of their respective whitespaces.
Yes, also your regex will match if there are just spaces.
Using regex \B-\B matches - between the word color - coded . Using \b-\b on the other hand matches the - in nine-digit and pass-key .
Series.str.split
+ explode
df['foo'].str.split(r'(?<=[a-z])(?=[A-Z])').explode()
0 Strawberry Jam
0 Apple
1 Banana
1 Pear Crumble
1 Potato
2 Almond Cake
Name: foo, dtype: object
Regex details:
(?<=[a-z])
: Positive Lookbehind matches the single character in the range a
to z
(?=[A-Z])
: Positive Lookahead matches the single character in the range A
to Z
See the regex demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With