I'm trying to load the columns of a file with a strange encoding. Windows appears to have no issues opening it, but Linux complains and I have only been able to open it using the Atom text editor (others give me either a blank file or a file with data encoded)
The command:
file -i data_file.tit
returns:
application/octet-stream; charset=binary
Opening the file in binary mode and reading the first 400 bytes gives:
'0905077U1- a\r\nIntegration time: 19,00 ms\r\nAverage: 25 scans\r\nNr of pixels used for smoothing: 2\r\nData measured with spectrometer name: 0905077U1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\r\nWave ;Dark ;Ref ;Sample ;Absolute Irradiance ;Photon Counts\r\n[nm] ;[counts] ;[counts] ;[counts] ;[\xb5Watt/cm\xb2/nm] ;[\xb5Mol/s/m\xb2/nm]\r\n247,40;-1,0378;18,713;10,738;21,132;0,4369\r\n247,'
The rest of the file consists only of ASCII numbers separated by semicolons.
I tried the following ways to load the file:
with open('data_file.tit') as f:
bytes = f.read() # (1)
# bytes = f.read().decode('???') # (2)
# bytes = np.genfromtxt(f) # (3)
print bytes
(1)
Sort of works but skips the first several hundred lines.
(2)
Failed with every encoding I tried with the error:
codec can't decode byte 0xb5 in position 315: unexpected special character
(3)
Complains about ValueError: Some errors were detected !
and shows for each line something similar to Line #3 (got 3 columns instead of 2)
.
How can I load this data file?
You have a codepage 1252 encoded text file, with one line containing NULL bytes. The file
command determined you have binary data on the basis of those NULLs, while I made an educated guess on the basis of the \xb2
and \xb5
codepoints, which stand for the ²
and µ
characters.
To open, just decode from that encoding:
import io
with io.open(filename, 'r', encoding='cp1252') as f:
for line in f:
print(line.rstrip('\n\x00'))
The first 10 lines are then:
0905077U1- a
Integration time: 19,00 ms
Average: 25 scans
Nr of pixels used for smoothing: 2
Data measured with spectrometer name: 0905077U1
Wave ;Dark ;Ref ;Sample ;Absolute Irradiance ;Photon Counts
[nm] ;[counts] ;[counts] ;[counts] ;[µWatt/cm²/nm] ;[µMol/s/m²/nm]
247,40;-1,0378;18,713;10,738;21,132;0,4369
247,57;3,0793;19,702;9,5951;11,105;0,2298
247,74;-0,9414;19,929;8,8908;16,567;0,3430
The NULLs were stripped from the Data measured with spectrometer name: 0905077U1 line; the spetrometer name is now 9 bytes long, together with the 55 NULLs it looks like the name could be up to 64 characters long and the file writer didn't bother to strip those NULLs.
Guessing an encoding can be really hard, luckily there's a library that tries to help with that: https://pypi.python.org/pypi/chardet
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With