Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slice/split string Series at various positions

Tags:

python

pandas

I'm looking to split a string Series at different points depending on the length of certain substrings:

In [47]: df = pd.DataFrame(['group9class1', 'group10class2', 'group11class20'], columns=['group_class'])
In [48]: split_locations = df.group_class.str.rfind('class')
In [49]: split_locations
Out[49]: 
0    6
1    7
2    7
dtype: int64
In [50]: df
Out[50]: 
      group_class
0    group9class1
1   group10class2
2  group11class20

My output should look like:

      group_class    group    class
0    group9class1   group9   class1
1   group10class2  group10   class2
2  group11class20  group11  class20

I half-thought this might work:

In [56]: df.group_class.str[:split_locations]
Out[56]: 
0   NaN
1   NaN
2   NaN

How can I slice my strings by the variable locations in split_locations?

like image 930
LondonRob Avatar asked Aug 07 '15 15:08

LondonRob


People also ask

What is str slice in Python?

String slicing in Python is about obtaining a sub-string from the given string by slicing it respectively from start to end.

What is split in ServiceNow?

ServiceNow DevOps provides an automated change management engine that maintains appropriate governance in less time. Integrate Split with ServiceNow to automate the creation of change tickets and set up policies based on data within ServiceNow for automatic approvals.


2 Answers

This works, by using double [[]] you can access the index value of the current element so you can index into the split_locations series:

In [119]:    
df[['group_class']].apply(lambda x: pd.Series([x.str[split_locations[x.name]:][0], x.str[:split_locations[x.name]][0]]), axis=1)
Out[119]:
         0        1
0   class1   group9
1   class2  group10
2  class20  group11

Or as @ajcr has suggested you can extract:

In [106]:

df['group_class'].str.extract(r'(?P<group>group[0-9]+)(?P<class>class[0-9]+)')
Out[106]:
     group    class
0   group9   class1
1  group10   class2
2  group11  class20

EDIT

Regex explanation:

the regex came from @ajcr (thanks!), this uses str.extract to extract groups, the groups become new columns.

So ?P<group> here identifies an id for a specific group to look for, if this is missing then an int will be returned for the column name.

so the rest should be self-explanatory: group[0-9] looks for the string group followed by the digits in range [0-9] which is what the [] indicate, this is equivalent to group\d where \d means digit.

So it could be re-written as:

df['group_class'].str.extract(r'(?P<group>group\d+)(?P<class>class\d+)')
like image 81
EdChum Avatar answered Oct 04 '22 16:10

EdChum


Use a regular expression to split the string

 import re

 regex = re.compile("(class)")
 str="group1class23"
 # this will split the group and the class string by adding a space between them, and using a simple split on space.
 split_string = re.sub(regex, " \\1", str).split(" ")

This will return the array:

 ['group9', 'class23']

So to append two new columns to your DataFrame you can do:

new_cols = [re.sub(regex, " \\1", x).split(" ") for x in df.group_class]
df['group'], df['class'] = zip(*new_cols)

Which results in:

      group_class    group    class
0    group9class1   group9   class1
1   group10class2  group10   class2
2  group11class20  group11  class20
like image 44
Rob Avatar answered Oct 04 '22 16:10

Rob