Python - read text file with weird utf-16 format

Tags:

I'm trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:

file = open('data.txt','r')

lines = file.readlines()

for line in lines[0:1]:
    print line,
    print line.split()

Output:

0.0200197   1.97691e-005

['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']

Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. Of course, when I try to convert those strings to floats, this produces an error. Any idea about how I can convert these back into numbers?

I put the sample datafile here if you would like to try to load it: https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt

I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file.

821

asked Oct 11 '13 23:10

DanHickstein

1 Answers

I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.

In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.

To fix this, just decode the data:

print line.decode('utf-16-le').split()

Or do the same thing at the file level with the io or codecs module:

file = io.open('data.txt','r', encoding='utf-16-le')

* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.

156

answered Sep 30 '22 18:09

abarnert

Related questions
                            
                                How to generate a `kwargs` list?
                            
                                How to draw a pixel on the screen directly?
                            
                                How to get priorly-unknown array as the output of a function in Fortran
                            
                                Which is the relationship between CPU time measured by Python profiler and, real, user and sys time?
                            
                                Sqlite3 - Update table using Python code - syntax error near %s
                            
                                How to Filter from CSV file using Python Script
                            
                                How to change background color of excel cell with python xlwt library?
                            
                                How to delete project in django
                            
                                Specifying arguments with spaces for running a python script
                            
                                Django: How to change a field widget in a Inline Formset
                            
                                Fill input of type text and press submit using python
                            
                                No handlers could be found for logger "pika.adapters.blocking_connection"
                            
                                PyDev for Eclipse - Resolve Python dependencies (unresolved imports)
                            
                                Disable PyYAML value conversion
                            
                                Converting RGB to HLS and back
                            
                                obtain the max y-value of a histogram
                            
                                How to read specific part of large file in Python
                            
                                xlwt write excel sheet on the fly
                            
                                Comparing Exception Objects in Python
                            
                                How can I escape LaTeX special characters inside django templates?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - read text file with weird utf-16 format

Tags:

python

encoding

numpy

utf-16le

DanHickstein

People also ask

1 Answers

abarnert

Recent Activity

Donate For Us