I'm looking to split a string Series at different points depending on the length of certain substrings:
In [47]: df = pd.DataFrame(['group9class1', 'group10class2', 'group11class20'], columns=['group_class'])
In [48]: split_locations = df.group_class.str.rfind('class')
In [49]: split_locations
Out[49]:
0 6
1 7
2 7
dtype: int64
In [50]: df
Out[50]:
group_class
0 group9class1
1 group10class2
2 group11class20
My output should look like:
group_class group class
0 group9class1 group9 class1
1 group10class2 group10 class2
2 group11class20 group11 class20
I half-thought this might work:
In [56]: df.group_class.str[:split_locations]
Out[56]:
0 NaN
1 NaN
2 NaN
How can I slice my strings by the variable locations in split_locations
?
String slicing in Python is about obtaining a sub-string from the given string by slicing it respectively from start to end.
ServiceNow DevOps provides an automated change management engine that maintains appropriate governance in less time. Integrate Split with ServiceNow to automate the creation of change tickets and set up policies based on data within ServiceNow for automatic approvals.
This works, by using double [[]]
you can access the index value of the current element so you can index into the split_locations
series:
In [119]:
df[['group_class']].apply(lambda x: pd.Series([x.str[split_locations[x.name]:][0], x.str[:split_locations[x.name]][0]]), axis=1)
Out[119]:
0 1
0 class1 group9
1 class2 group10
2 class20 group11
Or as @ajcr has suggested you can extract
:
In [106]:
df['group_class'].str.extract(r'(?P<group>group[0-9]+)(?P<class>class[0-9]+)')
Out[106]:
group class
0 group9 class1
1 group10 class2
2 group11 class20
EDIT
Regex explanation:
the regex came from @ajcr (thanks!), this uses str.extract
to extract groups, the groups become new columns.
So ?P<group>
here identifies an id for a specific group to look for, if this is missing then an int will be returned for the column name.
so the rest should be self-explanatory: group[0-9]
looks for the string group
followed by the digits in range [0-9]
which is what the []
indicate, this is equivalent to group\d
where \d
means digit.
So it could be re-written as:
df['group_class'].str.extract(r'(?P<group>group\d+)(?P<class>class\d+)')
Use a regular expression to split the string
import re
regex = re.compile("(class)")
str="group1class23"
# this will split the group and the class string by adding a space between them, and using a simple split on space.
split_string = re.sub(regex, " \\1", str).split(" ")
This will return the array:
['group9', 'class23']
So to append two new columns to your DataFrame
you can do:
new_cols = [re.sub(regex, " \\1", x).split(" ") for x in df.group_class]
df['group'], df['class'] = zip(*new_cols)
Which results in:
group_class group class
0 group9class1 group9 class1
1 group10class2 group10 class2
2 group11class20 group11 class20
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With