Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing punctuation in a data frame based on punctuation list [duplicate]

Using Canopy and Pandas, I have data frame a which is defined by:

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"]

test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.

Assuming df looks like:


test

%hgh&12

abc123!!!

porkyfries


I want my results to be:


test

hgh12

abc123

porkyfries


Effort so far:

from string import punctuation /-- import punctuation list from python itself

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"] /-- define the dataframe


for p in list(punctuation):

     ...:     df2=df.med.str.replace(p,'')

     ...:     df2=pd.DataFrame(df2);

     ...:     df2

The command above basically just returns me with the same data set. Appreciate any leads.

Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows. Long story short, I need to clean data in a very efficient manner for big data sets.

like image 346
BernardL Avatar asked Feb 10 '14 08:02

BernardL


2 Answers

Use replace with correct regex would be easier:

In [41]:

import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
         text
0        test
1     %hgh&12
2   abc123!!!
3  porkyfries

[4 rows x 1 columns]

use regex with the pattern which means not alphanumeric/whitespace

In [49]:

df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
         text
0        test
1       hgh12
2      abc123
3  porkyfries

[4 rows x 1 columns]
like image 128
EdChum Avatar answered Sep 17 '22 01:09

EdChum


For removing punctuation from a text column in your dataframme:

In:

import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)

pattern

Out:

'[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]'

In:

df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df

Out:

        text
0  book...regh
1      book...
2         boo,
3       book. 
4       ball, 
5   ballnroll"
6       "rope"
7      rick % 

In:

df['text'] = df['text'].str.replace(pattern, '')
df

You can replace the pattern with your desired character. Ex - replace(pattern, '$')

Out:

        text
0   bookregh
1       book
2        boo
3      book 
4      ball 
5  ballnroll
6       rope
7     rick  
like image 23
Aakash Saxena Avatar answered Sep 19 '22 01:09

Aakash Saxena