I'm following advice of this article to reduce Pandas DataFrame memory usage, I'm using .astype('|S') on an object column like so:
data_frame['COLUMN1'] = data_frame['COLUMN1'].astype('|S')
data_frame['COLUMN2'] = data_frame['COLUMN2'].astype('|S')
Performing this on the DataFrame cuts memory usage by 20-40% without negative impacts on processing the columns. However, when outputting the file using .to_csv():
data_frame.to_csv(filename, sep='\t', encoding='utf-8')
The columns with .astype('|S') are outputted with a prefix of b with single quotes:
b'00001234' b'Source'
Removing the .astype('|S') call and outputting to csv gives the expected behavior:
00001234 Source
Some googling on this issue does find GitHub issues, but I don't think they are related (looks like they were fixed as well): to_csv and bytes on Python 3, BUG: Fix default encoding for CSVFormatter.save
I'm on Python 3.6.4 and Pandas 0.22.0. I tested the behavior is consistent on both MacOS and Windows. Any advice on how to output the columns without the b prefix and single quotes?
The 'b' prefix indicates a Python 3 bytes literal that represents an object rather than an unicode string. So if you want to remove the prefix you could decode the bytes object using the string decode method before saving it to a csv file:
data_frame['COLUMN1'] = data_frame['COLUMN1'].apply(lambda s: s.decode('utf-8'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With