Replacing punctuation in a data frame based on punctuation list [duplicate]

Question

Using Canopy and Pandas, I have data frame a which is defined by:

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"]

test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.

Assuming df looks like:

test

%hgh&12

abc123!!!

porkyfries

I want my results to be:

test

hgh12

abc123

porkyfries

Effort so far:

from string import punctuation /-- import punctuation list from python itself

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"] /-- define the dataframe


for p in list(punctuation):

     ...:     df2=df.med.str.replace(p,'')

     ...:     df2=pd.DataFrame(df2);

     ...:     df2

The command above basically just returns me with the same data set. Appreciate any leads.

Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows. Long story short, I need to clean data in a very efficient manner for big data sets.

EdChum · Accepted Answer

Use replace with correct regex would be easier:

In [41]:

import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
         text
0        test
1     %hgh&12
2   abc123!!!
3  porkyfries

[4 rows x 1 columns]

use regex with the pattern which means not alphanumeric/whitespace

In [49]:

df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
         text
0        test
1       hgh12
2      abc123
3  porkyfries

[4 rows x 1 columns]

Aakash Saxena · Answer

For removing punctuation from a text column in your dataframme:

In:

import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)

pattern

Out:

'[!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~]'

In:

df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df

Out:

        text
0  book...regh
1      book...
2         boo,
3       book. 
4       ball, 
5   ballnroll"
6       "rope"
7      rick %

In:

df['text'] = df['text'].str.replace(pattern, '')
df

You can replace the pattern with your desired character. Ex - replace(pattern, '$')

Out:

        text
0   bookregh
1       book
2        boo
3      book 
4      ball 
5  ballnroll
6       rope
7     rick

Replacing punctuation in a data frame based on punctuation list [duplicate]

Tags:

python

pandas

dataframe

large-data

BernardL

2 Answers

EdChum

Aakash Saxena

Recent Activity

Donate For Us

Replacing punctuation in a data frame based on punctuation list [duplicate]

Tags:

python

pandas

dataframe

large-data

BernardL

2 Answers

EdChum

Aakash Saxena

Related questions

Recent Activity

Donate For Us