Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to group words separated by space

I have a data frame, in which one column is a Series of strings, in which distinct phrases are either single words or multiple words separated by spaces; and the first letter of each individual word is upper case (e.g. "Strawberry" or "Strawberry Jam", respectively). In contrast, if not part of the same phrase, the words are not spaced out (e.g. "JamApple").

df = pd.DataFrame({
    'foo': ['Strawberry JamApple', 'BananaPear CrumblePotato', 'Almond Cake'],
    'bar': ['A', 'B', 'C'],
    'baz': [1, 2, 3],
    'zoo': ['x', 'y', 'z'],
})


                        foo bar  baz zoo
0       Strawberry JamApple   A    1   x
1  BananaPear CrumblePotato   B    2   y
2               Almond Cake   C    3   z

How could I use regex to separate phrases in a string based on the rule above (into "Strawberry Jam", "Apple", "Banana", "Pear Crumble", "Potato", "Almond Cake"). and extract them? I.e., get the following data frame:

   foo
0  Strawberry Jam
0  Apple
1  Banana
1  Pear Crumble
1  Potato
2  Almond Cake

I started with the following code:

df.loc[:, 'foo'].str.extractall('([A-Z]{1}[a-z]+)').copy()

However, this separates all words and doesn't use space to "connect" them. How would I include the latter?

Thanks.

like image 610
JuM24 Avatar asked Mar 17 '21 15:03

JuM24


People also ask

What does \f mean in regex?

Definition and Usage The \f metacharacter matches form feed characters.

How do you indicate a space in regex?

The most common forms of whitespace you will use with regular expressions are the space (␣), the tab (\t), the new line (\n) and the carriage return (\r) (useful in Windows environments), and these special characters match each of their respective whitespaces.

Are spaces allowed in regex?

Yes, also your regex will match if there are just spaces.

What is the difference between \b and \b in regular expression?

Using regex \B-\B matches - between the word color - coded . Using \b-\b on the other hand matches the - in nine-digit and pass-key .


1 Answers

Series.str.split + explode

df['foo'].str.split(r'(?<=[a-z])(?=[A-Z])').explode()

0    Strawberry Jam
0             Apple
1            Banana
1      Pear Crumble
1            Potato
2       Almond Cake
Name: foo, dtype: object

Regex details:

  • (?<=[a-z]) : Positive Lookbehind matches the single character in the range a to z

  • (?=[A-Z]) : Positive Lookahead matches the single character in the range A to Z

See the regex demo

like image 153
Shubham Sharma Avatar answered Oct 18 '22 09:10

Shubham Sharma