Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeError: UTF-16 stream does not start with BOM

I have trouble reading the csv file by python. My csv file has Korean and numbers.

Below is my python code.

import csv
import codecs
csvreader = csv.reader(codecs.open('1.csv', 'rU', 'utf-16'))
for row in csvreader:
    print(row)

First, there was a UnicodeDecodeError when I enter "for row in csvreader" line in the above code.

So I used the code below then the problem seemed to be solved

csvreader = csv.reader(codecs.open('1.csv', 'rU', 'utf-16'))

Then I ran into NULL byte error. Then I can't figure out what's wrong with the csv file.

[update] I don't think I changed anything from the previous code but my program shows "UnicodeError: UTF-16 stream does not start with BOM"

When I open the csv by excel I can see the table in proper format (image attached at the botton) but when I open it in sublime Text, below is a snippet of what I get.

504b 0304 1400 0600 0800 0000 2100 6322
f979 7701 0000 d405 0000 1300 0802 5b43
6f6e 7465 6e74 5f54 7970 6573 5d2e 786d
6c20 a204 0228 a000 0200 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

If you need more information about my file, let me know!

I appreciate your help. Thanks in advance :)

csv file shown in excel

enter image description here

csv file shown in sublime text enter image description here

like image 963
Py11 Avatar asked Mar 19 '18 20:03

Py11


People also ask

Does UTF 16 require bom?

In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted.

Is UTF 8 Unicode?

UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.


1 Answers

The problem is that your input file apparently doesn’t start with a BOM (a special character that gets recognizably encoded differently for little-endian vs. big-endian utf-16), so you can’t just use “utf-16” as the encoding, you have to explicitly use “utf-16-le” or “utf-16-be”.

If you don’t do that, codecs will guess, and if it guesses wrong, it’ll try to read each code point backward and get illegal values.

If your posted sample starts at an even offset and contains a bunch of ASCII, it’s little-ending, so use the -le version. (But of course it’s better to look at what it actually is than to guess.)

like image 98
abarnert Avatar answered Sep 19 '22 08:09

abarnert