Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

applying regex to a pandas dataframe

I'm having trouble applying a regex function a column in a python dataframe. Here is the head of my dataframe:

               Name   Season          School   G    MP  FGA  3P  3PA    3P%  74       Joe Dumars  1982-83   McNeese State  29   NaN  487   5    8  0.625     84      Sam Vincent  1982-83  Michigan State  30  1066  401   5   11  0.455     176  Gerald Wilkins  1982-83     Chattanooga  30   820  350   0    2  0.000     177  Gerald Wilkins  1983-84     Chattanooga  23   737  297   3   10  0.300     243    Delaney Rudd  1982-83     Wake Forest  32  1004  324  13   29  0.448   

I thought I had a pretty good grasp of applying functions to Dataframes, so maybe my Regex skills are lacking.

Here is what I put together:

import re  def split_it(year):     return re.findall('(\d\d\d\d)', year)   df['Season2'] = df['Season'].apply(split_it(x))  TypeError: expected string or buffer 

Output would be a column called Season2 that contains the year before the hyphen. I'm sure theres an easier way to do it without regex, but more importantly, i'm trying to figure out what I did wrong

Thanks for any help in advance.

like image 430
itjcms18 Avatar asked Aug 13 '14 17:08

itjcms18


People also ask

Can I use regex in a Pandas Dataframe?

We have seen how regexp can be used effectively with some the Pandas functions and can help to extract, match the patterns in the Series or a Dataframe. Especially when you are working with the Text data then Regex is a powerful tool for data extraction, Cleaning and validation.

What is regex in Pandas replace?

Pandas replace() is a very rich function that is used to replace a string, regex, dictionary, list, and series from the DataFrame. The values of the DataFrame can be replaced with other values dynamically. It is capable of working with the Python regex(regular expression). It differs from updating with .


2 Answers

When I try (a variant of) your code I get NameError: name 'x' is not defined-- which it isn't.

You could use either

df['Season2'] = df['Season'].apply(split_it) 

or

df['Season2'] = df['Season'].apply(lambda x: split_it(x)) 

but the second one is just a longer and slower way to write the first one, so there's not much point (unless you have other arguments to handle, which we don't here.) Your function will return a list, though:

>>> df["Season"].apply(split_it) 74     [1982] 84     [1982] 176    [1982] 177    [1983] 243    [1982] Name: Season, dtype: object 

although you could easily change that. FWIW, I'd use vectorized string operations and do something like

>>> df["Season"].str[:4].astype(int) 74     1982 84     1982 176    1982 177    1983 243    1982 Name: Season, dtype: int64 

or

>>> df["Season"].str.split("-").str[0].astype(int) 74     1982 84     1982 176    1982 177    1983 243    1982 Name: Season, dtype: int64 
like image 184
DSM Avatar answered Sep 17 '22 18:09

DSM


You can simply use str.extract

df['Season2']=df['Season'].str.extract(r'(\d{4})-\d{2}') 

Here you locate \d{4}-\d{2} (for example 1982-83) but only extracts the captured group between parenthesis \d{4} (for example 1982)

like image 29
Gabriel Avatar answered Sep 20 '22 18:09

Gabriel