I have a data frame with a column that has information on the number of years / months the person has an account with the organization.
The field is a custom text format,
eg: '0yrs 11mon', '15yrs 4mon' etc.
Is there a way to extract just to extract the yrs and months in to a separate column like below:
Member Since | Year | Mon
-------------------------
'0yrs 11mon' | 0 | 11
-------------------------------------------
'15yrs 4mon' | 15 | 4
---------------------
You can use Series.str.extract:
df['Year'] = df['Member Since'].str.extract('(\d+)(?:yrs)')
df['Mon'] = df['Member Since'].str.extract('(?:\s)(\d+)(?:mon)')
the ?:
is non capturing group notation. Which means it matches the the regular expression but doesn't capture in the output
You can also try Series.str.extractall:
df[['year','mon']]= df['Member Since'].str.extractall('(\d+)').unstack()
df
Member Since year mon
0 0yrs 11mon 0 11
1 15yrs 4mon 15 4
The extractall function actually returns a DataFrame with one row for each match, and one column for each group and so you need to unstack it.
Use the extract all method only if you have numbers as shown in your question. More or less than two will return an error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With