seperator ="§" should be a single byte character, but is 2 bytes long error in Python to read csv in Polars dataframe

Question

Following code I'm using to read CSV file into Polars data frame where my csv is having § (section sign) as separator .

import polars as po

po_df = po.read_csv(<*full file path*>, separator= '§', has_header=True, quote_char='"', encoding='utf8')


print(po_df.columns)

read_csv throwing following error -

"seperator ="§" should be a single byte character, but is 2 bytes long"

tried to run with different encoding but it's not working.

Any help to resolve this error would be much appreciated.

Dean MacGregor · Accepted Answer

When you use a non-utf-8 encoding, polars will read it as whatever you specify and then write it to BytesIO with utf8 encoding before reading it with its native rust reader. It doesn't convert your separator in this way so it still fails. To get around that you can simply do what polars was going to do anyway except you also replace your separator with something else. I'm using a control character here so I wouldn't think there'd be a collision.

from io import BytesIO
with open(YOUR_FILE,'r',encoding='latin-1') as s, BytesIO() as b:
    b.write(s.read().replace('§','\x1f').encode('utf-8'))
    b.seek(0)
    df=pl.read_csv(b,separator='\x1f', has_header=True, quote_char='"')

JRiggles · Answer

The character you're using as a separator '§' can't be encoded into a single byte in UTF-8. If you run the following, you'll see that it prints two bytes:

print('§'.encode('utf-8'))  # or alternatively: print(bytes('§', 'utf-8'))
# prints: b'\xc2\xa7'

Whereas a typical separator like a comma ',' only occupies one byte.

If you use the encoding latin-1 instead, the '§' character only occupies one byte \xa7, but I suspect your CSV file will need to use the same encoding.

seperator ="§" should be a single byte character, but is 2 bytes long error in Python to read csv in Polars dataframe

Tags:

python

python-polars

Kaustav Nandy

2 Answers

Dean MacGregor

JRiggles

Recent Activity

Donate For Us

seperator ="§" should be a single byte character, but is 2 bytes long error in Python to read csv in Polars dataframe

Tags:

python

python-polars

Kaustav Nandy

2 Answers

Dean MacGregor

JRiggles

Related questions

Recent Activity

Donate For Us