Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove punctuation for each row in a pandas data frame [duplicate]

I am new to python so this may be a very basic question. I am trying to use lambda to remove punctuation for each row in a pandas dataframe. I used the following, but received an error. I am trying to avoid having convert the df into a list then append the cleaned results into new list, then convert it back to a df.

Any suggestions would be appreciated!

import string

df['cleaned'] = df['old'].apply(lambda x: x.replace(c,'') for c in string.punctuation)
like image 411
RJL Avatar asked Oct 09 '15 22:10

RJL


People also ask

How do you remove punctuation from a data set?

So we use the negation (i.e. ^ ) to denote that we want to replace any non-word and non-sentence (i.e. punctuation characters) with an empty string.

How do I remove all punctuation from a string in a Dataframe Python?

To remove punctuation with Python Pandas, we can use the DataFrame's str. replace method. We call replace with a regex string that matches all punctuation characters and replace them with empty strings. replace returns a new DataFrame column and we assign that to df['text'] .

How do I remove special characters from a data frame?

Add df = df. astype(float) after the replace and you've got it. I'd skip inplace and just do df = df. replace('\*', '', regex=True).


2 Answers

You need to iterate over the string in the dataframe, not over string.punctuation. You also need to build the string back up using .join().

df['cleaned'] = df['old'].apply(lambda x:''.join([i for i in x 
                                                  if i not in string.punctuation]))

When lambda expressions get long like that it can be more readable to write out the function definition separately, e.g. (thanks to @AndyHayden for the optimization tips):

def remove_punctuation(s):
    s = ''.join([i for i in s if i not in frozenset(string.punctuation)])
    return s

df['cleaned'] = df['old'].apply(remove_punctuation)
like image 129
mechanical_meat Avatar answered Sep 28 '22 08:09

mechanical_meat


Using a regex will most likely be faster here:

In [11]: RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation])  # perhaps this is available in the re/regex library?

In [12]: s = pd.Series(["a..b", "c<=d", "e|}f"])

In [13]: s.str.replace(RE_PUNCTUATION, "")
Out[13]:
0    ab
1    cd
2    ef
dtype: object
like image 32
Andy Hayden Avatar answered Sep 28 '22 09:09

Andy Hayden