Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split Pandas dataframe column with strings according to multiple conditions

I have a pandas dataframe look like this:

    ID       Col.A

28654      This is a dark chocolate which is sweet 
39876      Sky is blue 1234 Sky is cloudy 3423
88776      Stars can be seen in the dark sky
35491      Schools are closed 4568 but shops are open

I tried to split Col.A before the word dark or the digits. My desired result is as given below.

     ID             Col.A                             Col.B
    
    28654      This is a                  dark chocolate which is sweet 
    39876      Sky is blue                1234 Sky is cloudy 3423
    88776      Stars can be seen in the   dark sky
    35491      Schools are closed         4568 but shops are open

I tried to group the rows which contains the word dark to a dataframe and group the rows with digits to another dataframe and then split them accordingly. After that I can concatenate the resulting dataframes to obtain expected result. The code is as given below:

df = pd.DataFrame({'ID':[28654,39876,88776,35491], 'Col.A':['This is a dark chocolate which is sweet', 
                                                            'Sky is blue 1234 Sky is cloudy 3423', 
                                                            'Stars can be seen in the dark sky',
                                                            'Schools are closed 4568 but shops are open']})

df1 = df[df['Col.A'].str.contains(' dark ')==True]
df2 = df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
df1 = df1["Col.A"].str.split(' dark ', expand = True)
df2 = df2["Col.A"].str.split('\d+', expand = True)
pd.concat([[df1, df2], axis =0)

The obtained result is different from the one expected. that is,

      0                              1
0   This is a                   chocolate which is sweet
2   Stars can be seen in the     sky    
1   Sky is blue                  Sky is cloudy  
3   Schools are closed           but shops are open

I missed the digits in the string and the word dark in the result.

So how can I solve this issue and get result without missing the splitting word and digits?

Is there any way to "slice before expected word or digits" without removing them?

like image 746
Athul R T Avatar asked Dec 08 '22 09:12

Athul R T


2 Answers

Series.str.split

s = df['Col.A'].str.split(r'\s+(?=\b(?:dark|\d+)\b)', n=1, expand=True)
df[['ID']].join(s.set_axis(['Col.A', 'Col.B'], 1))

      ID                     Col.A                          Col.B
0  28654                 This is a  dark chocolate which is sweet
1  39876               Sky is blue        1234 Sky is cloudy 3423
2  88776  Stars can be seen in the                       dark sky
3  35491        Schools are closed        4568 but shops are open

Regex details:

  • \s+ : Matches any whitespace character one or more time
  • (?=\b(?:dark|\d+)\b) : Positive Lookahead
    • \b : Word boundary to prevent partial matches
    • (?:dark|\d+): Non capturing group
      • dark : First Alternative matches the characters dark literally
      • \d+ : Second alternative which matches any digit one or more times
    • \b : Word boundary to prevent partial matches

See the online regex demo

like image 67
Shubham Sharma Avatar answered Apr 12 '23 23:04

Shubham Sharma


With your shown samples, please try following. Using str.extract function of Pandas. Simple explanation would be using extract function and mentioning regex to create 1st capturing group with non-greedy match and 2nd group has digits OR dark string till last of line and saving it into Col.A and Col.B columns.

df[["Col.A","Col.B"]] = df['Col.A'].str.extract(r'(.*?)((?:dark|\d+).*)', expand=True)
df

With shown samples output will be as follows:

    ID      Col.A                       Col.B
0   28654   This is a                   dark chocolate which is sweet
1   39876   Sky is blue                 1234 Sky is cloudy 3423
2   88776   Stars can be seen in the    dark sky
3   35491   Schools are closed          4568 but shops are open
like image 32
RavinderSingh13 Avatar answered Apr 12 '23 23:04

RavinderSingh13