I'm trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:
file = open('data.txt','r')
lines = file.readlines()
for line in lines[0:1]:
print line,
print line.split()
Output:
0.0200197 1.97691e-005
['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']
Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. Of course, when I try to convert those strings to floats, this produces an error. Any idea about how I can convert these back into numbers?
I put the sample datafile here if you would like to try to load it: https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt
I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file.
UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.
I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.
In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.
To fix this, just decode the data:
print line.decode('utf-16-le').split()
Or do the same thing at the file level with the io or codecs module:
file = io.open('data.txt','r', encoding='utf-16-le')
* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With