I'm having trouble applying a regex function a column in a python dataframe. Here is the head of my dataframe:
Name Season School G MP FGA 3P 3PA 3P% 74 Joe Dumars 1982-83 McNeese State 29 NaN 487 5 8 0.625 84 Sam Vincent 1982-83 Michigan State 30 1066 401 5 11 0.455 176 Gerald Wilkins 1982-83 Chattanooga 30 820 350 0 2 0.000 177 Gerald Wilkins 1983-84 Chattanooga 23 737 297 3 10 0.300 243 Delaney Rudd 1982-83 Wake Forest 32 1004 324 13 29 0.448
I thought I had a pretty good grasp of applying functions to Dataframes, so maybe my Regex skills are lacking.
Here is what I put together:
import re def split_it(year): return re.findall('(\d\d\d\d)', year) df['Season2'] = df['Season'].apply(split_it(x)) TypeError: expected string or buffer
Output would be a column called Season2 that contains the year before the hyphen. I'm sure theres an easier way to do it without regex, but more importantly, i'm trying to figure out what I did wrong
Thanks for any help in advance.
We have seen how regexp can be used effectively with some the Pandas functions and can help to extract, match the patterns in the Series or a Dataframe. Especially when you are working with the Text data then Regex is a powerful tool for data extraction, Cleaning and validation.
Pandas replace() is a very rich function that is used to replace a string, regex, dictionary, list, and series from the DataFrame. The values of the DataFrame can be replaced with other values dynamically. It is capable of working with the Python regex(regular expression). It differs from updating with .
When I try (a variant of) your code I get NameError: name 'x' is not defined
-- which it isn't.
You could use either
df['Season2'] = df['Season'].apply(split_it)
or
df['Season2'] = df['Season'].apply(lambda x: split_it(x))
but the second one is just a longer and slower way to write the first one, so there's not much point (unless you have other arguments to handle, which we don't here.) Your function will return a list, though:
>>> df["Season"].apply(split_it) 74 [1982] 84 [1982] 176 [1982] 177 [1983] 243 [1982] Name: Season, dtype: object
although you could easily change that. FWIW, I'd use vectorized string operations and do something like
>>> df["Season"].str[:4].astype(int) 74 1982 84 1982 176 1982 177 1983 243 1982 Name: Season, dtype: int64
or
>>> df["Season"].str.split("-").str[0].astype(int) 74 1982 84 1982 176 1982 177 1983 243 1982 Name: Season, dtype: int64
You can simply use str.extract
df['Season2']=df['Season'].str.extract(r'(\d{4})-\d{2}')
Here you locate \d{4}-\d{2}
(for example 1982-83) but only extracts the captured group between parenthesis \d{4}
(for example 1982)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With