I have a data frame, in which one column is a Series of strings, in which distinct phrases are either single words or multiple words separated by spaces; and the first letter of each individual word is upper case (e.g. "Strawberry" or "Strawberry Jam", respectively). In contrast, if not part of the same phrase, the words are not spaced out (e.g. "JamApple"). <pre class="prettyprint"><code>df = pd.DataFrame({ 'foo': ['Strawberry JamApple', 'BananaPear CrumblePotato', 'Almond Cake'], 'bar': ['A', 'B', 'C'], 'baz': [1, 2, 3], 'zoo': ['x', 'y', 'z'], }) foo bar baz zoo 0 Strawberry JamApple A 1 x 1 BananaPear CrumblePotato B 2 y 2 Almond Cake C 3 z </code></pre> How could I use regex to separate phrases in a string based on the rule above (into "Strawberry Jam", "Apple", "Banana", "Pear Crumble", "Potato", "Almond Cake"). and extract them? I.e., get the following data frame: <pre class="prettyprint"><code> foo 0 Strawberry Jam 0 Apple 1 Banana 1 Pear Crumble 1 Potato 2 Almond Cake </code></pre> I started with the following code: <pre class="prettyprint"><code>df.loc[:, 'foo'].str.extractall('([A-Z]{1}[a-z]+)').copy() </code></pre> However, this separates all words and doesn't use space to "connect" them. How would I include the latter? Thanks.

<h3> <code>Series.str.split</code> + <code>explode</code> </h3> <pre class="prettyprint"><code>df['foo'].str.split(r'(?<=[a-z])(?=[A-Z])').explode() </code></pre> <hr> <pre class="prettyprint"><code>0 Strawberry Jam 0 Apple 1 Banana 1 Pear Crumble 1 Potato 2 Almond Cake Name: foo, dtype: object </code></pre> Regex details: <ul> <li> <code>(?<=[a-z])</code> : Positive Lookbehind matches the single character in the range <code>a</code> to <code>z</code> </li> <li> <code>(?=[A-Z])</code> : Positive Lookahead matches the single character in the range <code>A</code> to <code>Z</code> </li> </ul> See the <code>regex demo</code>

Regex to group words separated by space

Tags:

python

regex

pandas

I have a data frame, in which one column is a Series of strings, in which distinct phrases are either single words or multiple words separated by spaces; and the first letter of each individual word is upper case (e.g. "Strawberry" or "Strawberry Jam", respectively). In contrast, if not part of the same phrase, the words are not spaced out (e.g. "JamApple").

df = pd.DataFrame({
    'foo': ['Strawberry JamApple', 'BananaPear CrumblePotato', 'Almond Cake'],
    'bar': ['A', 'B', 'C'],
    'baz': [1, 2, 3],
    'zoo': ['x', 'y', 'z'],
})


                        foo bar  baz zoo
0       Strawberry JamApple   A    1   x
1  BananaPear CrumblePotato   B    2   y
2               Almond Cake   C    3   z

How could I use regex to separate phrases in a string based on the rule above (into "Strawberry Jam", "Apple", "Banana", "Pear Crumble", "Potato", "Almond Cake"). and extract them? I.e., get the following data frame:

   foo
0  Strawberry Jam
0  Apple
1  Banana
1  Pear Crumble
1  Potato
2  Almond Cake

I started with the following code:

df.loc[:, 'foo'].str.extractall('([A-Z]{1}[a-z]+)').copy()

However, this separates all words and doesn't use space to "connect" them. How would I include the latter?

Thanks.

610

asked Mar 17 '21 15:03

JuM24

1 Answers

`Series.str.split` + `explode`

df['foo'].str.split(r'(?<=[a-z])(?=[A-Z])').explode()

0    Strawberry Jam
0             Apple
1            Banana
1      Pear Crumble
1            Potato
2       Almond Cake
Name: foo, dtype: object

Regex details:

(?<=[a-z]) : Positive Lookbehind matches the single character in the range a to z
(?=[A-Z]) : Positive Lookahead matches the single character in the range A to Z

See the regex demo

153

answered Oct 18 '22 09:10

Shubham Sharma

Related questions
                            
                                What is the correct syntax for Walrus operator with ternary operator?
                            
                                Get Rankings of Column Names in Pandas Dataframe
                            
                                Python - LogReturn on an entire dataframe
                            
                                NSWindow drag regions should only be invalidated on the Main Thread! This will throw an exception in the future
                            
                                How to concatenate a vector into rows of a numpy matrix?
                            
                                Perform sum over different slice of each row for 2D array
                            
                                Simple way to delete existing pods from Python
                            
                                AttributeError: module 'google.cloud.vision' has no attribute 'types'
                            
                                Sendgrid Authenticate with API Keys
                            
                                Pytorch RuntimeError: expected scalar type Float but found Byte
                            
                                What exactly is Keras's CategoricalCrossEntropy doing?
                            
                                Python, Avoid ugly nested for loop
                            
                                Google Ads API - "failed with status "PERMISSION_DENIED" - "User doesn't have permission to access customer."
                            
                                Django: What's the difference between Queryset.union() and the OR operator?
                            
                                With BERT Text Classification, ValueError: too many dimensions 'str' error occuring
                            
                                Example code from typing library causes TypeError: 'type' object is not subscriptable, why?
                            
                                Python regex to match 6-digit numbers of different formats
                            
                                How to efficiently perform addition over large loops in python
                            
                                Getting ImportError when using torchtext
                            
                                ImportError: cannot import name '_ColumnEntity' Ubuntu20.10 [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regex to group words separated by space

Tags:

python

regex

pandas

JuM24

People also ask

1 Answers

`Series.str.split` + `explode`

Shubham Sharma

Recent Activity

Donate For Us

Regex to group words separated by space

Tags:

python

regex

pandas

JuM24

People also ask

1 Answers

Series.str.split + explode

Shubham Sharma

Related questions

Recent Activity

Donate For Us

`Series.str.split` + `explode`