I have a pandas dataframe look like this:
ID Col.A
28654 This is a dark chocolate which is sweet
39876 Sky is blue 1234 Sky is cloudy 3423
88776 Stars can be seen in the dark sky
35491 Schools are closed 4568 but shops are open
I tried to split Col.A
before the word dark
or the digits
. My desired result is as given below.
ID Col.A Col.B
28654 This is a dark chocolate which is sweet
39876 Sky is blue 1234 Sky is cloudy 3423
88776 Stars can be seen in the dark sky
35491 Schools are closed 4568 but shops are open
I tried to group the rows which contains the word dark
to a dataframe and group the rows with digits to another dataframe and then split them accordingly. After that I can concatenate the resulting dataframes to obtain expected result. The code is as given below:
df = pd.DataFrame({'ID':[28654,39876,88776,35491], 'Col.A':['This is a dark chocolate which is sweet',
'Sky is blue 1234 Sky is cloudy 3423',
'Stars can be seen in the dark sky',
'Schools are closed 4568 but shops are open']})
df1 = df[df['Col.A'].str.contains(' dark ')==True]
df2 = df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
df1 = df1["Col.A"].str.split(' dark ', expand = True)
df2 = df2["Col.A"].str.split('\d+', expand = True)
pd.concat([[df1, df2], axis =0)
The obtained result is different from the one expected. that is,
0 1
0 This is a chocolate which is sweet
2 Stars can be seen in the sky
1 Sky is blue Sky is cloudy
3 Schools are closed but shops are open
I missed the digits in the string and the word dark
in the result.
So how can I solve this issue and get result without missing the splitting word and digits?
Is there any way to "slice before expected word or digits" without removing them?
Series.str.split
s = df['Col.A'].str.split(r'\s+(?=\b(?:dark|\d+)\b)', n=1, expand=True)
df[['ID']].join(s.set_axis(['Col.A', 'Col.B'], 1))
ID Col.A Col.B
0 28654 This is a dark chocolate which is sweet
1 39876 Sky is blue 1234 Sky is cloudy 3423
2 88776 Stars can be seen in the dark sky
3 35491 Schools are closed 4568 but shops are open
Regex details:
\s+
: Matches any whitespace character one or more time(?=\b(?:dark|\d+)\b)
: Positive Lookahead
\b
: Word boundary to prevent partial matches(?:dark|\d+)
: Non capturing group
dark
: First Alternative matches the characters dark literally\d+
: Second alternative which matches any digit one or more times\b
: Word boundary to prevent partial matchesSee the online regex demo
With your shown samples, please try following. Using str.extract
function of Pandas. Simple explanation would be using extract function and mentioning regex to create 1st capturing group with non-greedy match and 2nd group has digits OR dark string till last of line and saving it into Col.A and Col.B columns.
df[["Col.A","Col.B"]] = df['Col.A'].str.extract(r'(.*?)((?:dark|\d+).*)', expand=True)
df
With shown samples output will be as follows:
ID Col.A Col.B
0 28654 This is a dark chocolate which is sweet
1 39876 Sky is blue 1234 Sky is cloudy 3423
2 88776 Stars can be seen in the dark sky
3 35491 Schools are closed 4568 but shops are open
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With