Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying Regex across entire column of a Dataframe

I have a Dataframe with 3 columns:

id,name,team 
101,kevin, marketing
102,scott,admin\n
103,peter,finance\n

I am trying to apply a regex function such that I remove the unnecessary spaces. I have got the code that removes these spaces how ever I am unable loop it through the entire Dataframe.

This is what I have tried thus far:

df['team'] = re.sub(r'[\n\r]*','',df['team'])

But this throws an error AttributeError: 'Series' object has no attribute 're'

Could anyone advice how could I loop this regex through the entire Dataframe df['team'] column

like image 354
hello kee Avatar asked Dec 28 '18 18:12

hello kee


People also ask

Can you use regex in pandas?

A regular expression (regex) is a sequence of characters that define a search pattern. To filter rows in Pandas by regex, we can use the str. match() method.

How do you find the pattern in a data frame?

Import required modules. Assign data frame. Create pattern-mixer object with the data frame as a constructor argument. Call find() method of the pattern-mixer object to identify various patterns in the data frame.

What is regex in pandas replace?

Pandas replace() is a very rich function that is used to replace a string, regex, dictionary, list, and series from the DataFrame. The values of the DataFrame can be replaced with other values dynamically. It is capable of working with the Python regex(regular expression). It differs from updating with . loc or .

WHAT IS RE sub in Python?

sub() function belongs to the Regular Expressions ( re ) module in Python. It returns a string where all matching occurrences of the specified pattern are replaced by the replace string.


2 Answers

You are almost there, there are two simple ways of doing this:

# option 1 - faster way
df['team'] =  [re.sub(r'[\n\r]*','', str(x)) for x in df['team']]

# option 2
df['team'] =  df['team'].apply(lambda x: re.sub(r'[\n\r]*','', str(x)))
like image 101
YOLO Avatar answered Sep 22 '22 07:09

YOLO


As long it's a dataframe check replace https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

df['team'].replace( { r"[\n\r]+" : '' }, inplace= True, regex = True)

Regarding the regex, '*' means 0 or more, you should need '+' which is 1 or more

like image 45
josem8f Avatar answered Sep 26 '22 07:09

josem8f