Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

`re.sub()` in pandas

Say I have:

s = 'white male, 2 white females'

And want to "expand" this to:

'white male, white female, white female'

A more complete list of cases would be:

  • 'two hispanic males, two hispanic females'
    • --> 'hispanic male, hispanic male, hispanic female, hispanic female'
  • '2 black males, white male'
    • --> 'black male, black male, white male'

It seems like I am close with:

import re

# Do I need boundaries here?
mult = re.compile('two|2 (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')

# This works:
s = 'white male, 2 white females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# 'white male, white female, white female'

# This fails:
s = 'two hispanic males, 2 hispanic females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# ' ,  , hispanic males, hispanic female, hispanic female,'

What is creating the trip-up in the second case?

Bonus question: Is there a method of pandas' Series that implements this functionality directly instead of using Series.apply()? Sorry to revise my question and waste anyone's time here.

For instance, on:

s = pd.Series(
    ['white male',
     'white male, white female',
     'hispanic male, 2 hispanic females',
     'black male, 2 white females'])

Is there a faster route than:

s.apply(lambda x: mult.sub(..., x))
like image 664
Brad Solomon Avatar asked Jan 19 '18 19:01

Brad Solomon


2 Answers

With regards to your "bonus" question, you can use pandas.Series.str.replace, which is part of the pandas.Series.str methods which work with regex:

In [10]: import re

In [11]: import pandas as pd

In [12]: s = pd.Series(
    ...:     ['white male',
    ...:      'white male, white female',
    ...:      'hispanic male, 2 hispanic females',
    ...:      'black male, 2 white females'])

In [13]: mult = re.compile('two|2 (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')
    ...:

In [14]: s.str.replace(mult, r'\g<race> \g<gender>, \g<race> \g<gender>')
Out[14]:
0                                         white male
1                           white male, white female
2    hispanic male, hispanic female, hispanic female
3             black male, white female, white female
dtype: object

Whether or not these methods are significantly faster than .apply I don't know. I suspect that you'll never be very fast working with object dtypes.

Note, if found this issue regarding these methods being on the slow side. I suppose until they decide it is worth it to write out a Cythonized implementation then you probably can't hope for much.

like image 131
juanpa.arrivillaga Avatar answered Oct 08 '22 13:10

juanpa.arrivillaga


IIUC, you need to put paranthesis around two|2 like (two|2) if you want to match either.

import re

mult = re.compile('(two|2) (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')
s = 'two hispanic males, 2 hispanic females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# 'hispanic male, hispanic male, hispanic female, hispanic female'
like image 1
Tai Avatar answered Oct 08 '22 15:10

Tai