Say I have:
s = 'white male, 2 white females'
And want to "expand" this to:
'white male, white female, white female'
A more complete list of cases would be:
It seems like I am close with:
import re
# Do I need boundaries here?
mult = re.compile('two|2 (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')
# This works:
s = 'white male, 2 white females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# 'white male, white female, white female'
# This fails:
s = 'two hispanic males, 2 hispanic females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# ' , , hispanic males, hispanic female, hispanic female,'
What is creating the trip-up in the second case?
Bonus question: Is there a method of pandas' Series that implements this functionality directly instead of using Series.apply()
? Sorry to revise my question and waste anyone's time here.
For instance, on:
s = pd.Series(
['white male',
'white male, white female',
'hispanic male, 2 hispanic females',
'black male, 2 white females'])
Is there a faster route than:
s.apply(lambda x: mult.sub(..., x))
With regards to your "bonus" question, you can use pandas.Series.str.replace
, which is part of the pandas.Series.str
methods which work with regex:
In [10]: import re
In [11]: import pandas as pd
In [12]: s = pd.Series(
...: ['white male',
...: 'white male, white female',
...: 'hispanic male, 2 hispanic females',
...: 'black male, 2 white females'])
In [13]: mult = re.compile('two|2 (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')
...:
In [14]: s.str.replace(mult, r'\g<race> \g<gender>, \g<race> \g<gender>')
Out[14]:
0 white male
1 white male, white female
2 hispanic male, hispanic female, hispanic female
3 black male, white female, white female
dtype: object
Whether or not these methods are significantly faster than .apply
I don't know. I suspect that you'll never be very fast working with object
dtypes.
Note, if found this issue regarding these methods being on the slow side. I suppose until they decide it is worth it to write out a Cythonized implementation then you probably can't hope for much.
IIUC, you need to put paranthesis around two|2
like (two|2)
if you want to match either.
import re
mult = re.compile('(two|2) (?P<race>[a-z]+) (?P<gender>(?:fe)?male)s')
s = 'two hispanic males, 2 hispanic females'
mult.sub(r'\g<race> \g<gender>, \g<race> \g<gender>', s)
# 'hispanic male, hispanic male, hispanic female, hispanic female'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With