Following code I'm using to read CSV file into Polars data frame where my csv is having § (section sign) as separator .
import polars as po
po_df = po.read_csv(<*full file path*>, separator= '§', has_header=True, quote_char='"', encoding='utf8')
print(po_df.columns)
read_csv throwing following error -
"seperator ="§" should be a single byte character, but is 2 bytes long"
tried to run with different encoding but it's not working.
Any help to resolve this error would be much appreciated.
When you use a non-utf-8 encoding, polars will read it as whatever you specify and then write it to BytesIO with utf8 encoding before reading it with its native rust reader. It doesn't convert your separator in this way so it still fails. To get around that you can simply do what polars was going to do anyway except you also replace your separator with something else. I'm using a control character here so I wouldn't think there'd be a collision.
from io import BytesIO
with open(YOUR_FILE,'r',encoding='latin-1') as s, BytesIO() as b:
b.write(s.read().replace('§','\x1f').encode('utf-8'))
b.seek(0)
df=pl.read_csv(b,separator='\x1f', has_header=True, quote_char='"')
The character you're using as a separator '§' can't be encoded into a single byte in UTF-8. If you run the following, you'll see that it prints two bytes:
print('§'.encode('utf-8')) # or alternatively: print(bytes('§', 'utf-8'))
# prints: b'\xc2\xa7'
Whereas a typical separator like a comma ',' only occupies one byte.
If you use the encoding latin-1
instead, the '§' character only occupies one byte \xa7
, but I suspect your CSV file will need to use the same encoding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With