I am new to python so this may be a very basic question. I am trying to use lambda to remove punctuation for each row in a pandas dataframe. I used the following, but received an error. I am trying to avoid having convert the df into a list then append the cleaned results into new list, then convert it back to a df.
Any suggestions would be appreciated!
import string
df['cleaned'] = df['old'].apply(lambda x: x.replace(c,'') for c in string.punctuation)
So we use the negation (i.e. ^ ) to denote that we want to replace any non-word and non-sentence (i.e. punctuation characters) with an empty string.
To remove punctuation with Python Pandas, we can use the DataFrame's str. replace method. We call replace with a regex string that matches all punctuation characters and replace them with empty strings. replace returns a new DataFrame column and we assign that to df['text'] .
Add df = df. astype(float) after the replace and you've got it. I'd skip inplace and just do df = df. replace('\*', '', regex=True).
You need to iterate over the string in the dataframe, not over string.punctuation
. You also need to build the string back up using .join()
.
df['cleaned'] = df['old'].apply(lambda x:''.join([i for i in x
if i not in string.punctuation]))
When lambda expressions get long like that it can be more readable to write out the function definition separately, e.g. (thanks to @AndyHayden for the optimization tips):
def remove_punctuation(s):
s = ''.join([i for i in s if i not in frozenset(string.punctuation)])
return s
df['cleaned'] = df['old'].apply(remove_punctuation)
Using a regex will most likely be faster here:
In [11]: RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation]) # perhaps this is available in the re/regex library?
In [12]: s = pd.Series(["a..b", "c<=d", "e|}f"])
In [13]: s.str.replace(RE_PUNCTUATION, "")
Out[13]:
0 ab
1 cd
2 ef
dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With