Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

seperator ="§" should be a single byte character, but is 2 bytes long error in Python to read csv in Polars dataframe

Following code I'm using to read CSV file into Polars data frame where my csv is having § (section sign) as separator .

import polars as po

po_df = po.read_csv(<*full file path*>, separator= '§', has_header=True, quote_char='"', encoding='utf8')


print(po_df.columns)

read_csv throwing following error -

"seperator ="§" should be a single byte character, but is 2 bytes long"

tried to run with different encoding but it's not working.

Any help to resolve this error would be much appreciated.

like image 511
Kaustav Nandy Avatar asked Aug 31 '25 23:08

Kaustav Nandy


2 Answers

When you use a non-utf-8 encoding, polars will read it as whatever you specify and then write it to BytesIO with utf8 encoding before reading it with its native rust reader. It doesn't convert your separator in this way so it still fails. To get around that you can simply do what polars was going to do anyway except you also replace your separator with something else. I'm using a control character here so I wouldn't think there'd be a collision.

from io import BytesIO
with open(YOUR_FILE,'r',encoding='latin-1') as s, BytesIO() as b:
    b.write(s.read().replace('§','\x1f').encode('utf-8'))
    b.seek(0)
    df=pl.read_csv(b,separator='\x1f', has_header=True, quote_char='"')
like image 65
Dean MacGregor Avatar answered Sep 03 '25 11:09

Dean MacGregor


The character you're using as a separator '§' can't be encoded into a single byte in UTF-8. If you run the following, you'll see that it prints two bytes:

print('§'.encode('utf-8'))  # or alternatively: print(bytes('§', 'utf-8'))
# prints: b'\xc2\xa7'

Whereas a typical separator like a comma ',' only occupies one byte.

If you use the encoding latin-1 instead, the '§' character only occupies one byte \xa7, but I suspect your CSV file will need to use the same encoding.

like image 20
JRiggles Avatar answered Sep 03 '25 11:09

JRiggles