I have a Python3.x pandas DataFrame whereby certain columns are strings which as expressed as bytes (like in Python2.x)
import pandas as pd df = pd.DataFrame(...) df COLUMN1 .... 0 b'abcde' .... 1 b'dog' .... 2 b'cat1' .... 3 b'bird1' .... 4 b'elephant1' ....
When I access by column with df.COLUMN1
, I see Name: COLUMN1, dtype: object
However, if I access by element, it is a "bytes" object
df.COLUMN1.ix[0].dtype Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'bytes' object has no attribute 'dtype'
How do I convert these into "regular" strings? That is, how can I get rid of this b''
prefix?
decode() Function Given a bytes object, you can use the built-in decode() method to convert the byte to a string. You can also pass the encoding type to this function as an argument.
String encode() and decode() method provides symmetry whereas bytes() constructor is more object-oriented and readable approach. You can choose any of them based on your preference.
Python bytes decode() function is used to convert bytes to string object. Both these functions allow us to specify the error handling scheme to use for encoding/decoding errors. The default is 'strict' meaning that encoding errors raise a UnicodeEncodeError.
You can use vectorised str.decode
to decode byte strings into ordinary strings:
df['COLUMN1'].str.decode("utf-8")
To do this for multiple columns you can select just the str columns:
str_df = df.select_dtypes([np.object])
convert all of them:
str_df = str_df.stack().str.decode('utf-8').unstack()
You can then swap out converted cols with the original df cols:
for col in str_df: df[col] = str_df[col]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With