Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove non-ASCII characters from pandas column

Tags:

I have been trying to work on this issue for a while.I am trying to remove non ASCII characters form DB_user column and trying to replace them with spaces. But I keep getting some errors. This is how my data frame looks:

  +----------------------------------------------------------- |      DB_user                            source   count  |                                              +----------------------------------------------------------- | ???/"Ò|Z?)?]??C %??J                      A        10   |                                        | ?D$ZGU   ;@D??_???T(?)                    B         3   |                                        | ?Q`H??M'?Y??KTK$?Ù‹???ЩJL4??*?_??        C         2   |                                         +-----------------------------------------------------------  

I was using this function, which I had come across while researching the problem on SO.

def filter_func(string):    for i in range(0,len(string)):         if (ord(string[i])< 32 or ord(string[i])>126            break        return ''  And then using the apply function:  df['DB_user'] = df.apply(filter_func,axis=1) 

I keep getting the error:

  'ord() expected a character, but string of length 66 found', u'occurred at index 2' 

However, I thought by using the loop in the filter_func function, I was dealing with this by inputing a char into 'ord'. Therefore the moment it hits a non-ASCII character, it should be replaced by a space.

Could somebody help me out?

Thanks!

like image 213
red_devil Avatar asked Mar 31 '16 18:03

red_devil


People also ask

How do I get rid of non-ASCII characters in pandas?

By using encode and decode function we can easily remove non-ASCII characters from Pandas DataFrame. In Python, the encode() function is used to encode the string using a given encoding, and decoding means converting a string of bytes to a Unicode string.

How do you remove non-ASCII characters?

Use . replace() method to replace the Non-ASCII characters with the empty string.

How do I remove non-ASCII characters from a string in Python?

In python, to remove non-ASCII characters in python, we need to use string. encode() with encoding as ASCII and error as ignore, to returns a string without ASCII character use string. decode().

How do I remove special characters from a column in a DataFrame?

Add df = df. astype(float) after the replace and you've got it. I'd skip inplace and just do df = df. replace('\*', '', regex=True).


1 Answers

you may try this:

df.DB_user.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True) 
like image 83
MaxU - stop WAR against UA Avatar answered Sep 18 '22 15:09

MaxU - stop WAR against UA