I have panda dataframe with multiple columns which mixed with values and unwanted characters.
columnA columnB columnC ColumnD
\x00A\X00B NULL \x00C\x00D 123
\x00E\X00F NULL NULL 456
what I'd like to do is to make this dataframe as below.
columnA columnB columnC ColumnD
AB NULL CD 123
EF NULL NULL 456
With my codes below, I can remove '\x00' from columnA but columnC is tricky as it is mixed with NULL in certain row.
col_names = cols_to_clean
fixer = dict.fromkeys([0x00], u'')
for i in col_names:
if df[i].isnull().any() == False:
if df[i].dtype != np.int64:
df[i] = df[i].map(lambda x: x.translate(fixer))
Is there any efficient way to remove unwanted characters from columnC?
Use . replace() method to replace the Non-ASCII characters with the empty string.
In python, to remove non-ASCII characters in python, we need to use string. encode() with encoding as ASCII and error as ignore, to returns a string without ASCII character use string. decode().
In general, to remove non-ascii characters, use str.encode
with errors='ignore':
df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')
To perform this on multiple string columns, use
u = df.select_dtypes(object)
df[u.columns] = u.apply(
lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
Although that still won't handle the null characters in your columns. For that, you replace them using regex:
df2 = df.replace(r'\W+', '', regex=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With