Read file with unknown encoding

Question

I'm trying to load the columns of a file with a strange encoding. Windows appears to have no issues opening it, but Linux complains and I have only been able to open it using the Atom text editor (others give me either a blank file or a file with data encoded)

The command:

file -i data_file.tit

returns:

application/octet-stream; charset=binary

Opening the file in binary mode and reading the first 400 bytes gives:

'0905077U1- a Integration time: 19,00 ms Average: 25 scans Nr of pixels used for smoothing: 2 Data measured with spectrometer name: 0905077U1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 Wave ;Dark ;Ref ;Sample ;Absolute Irradiance ;Photon Counts [nm] ;[counts] ;[counts] ;[counts] ;[\xb5Watt/cm\xb2/nm] ;[\xb5Mol/s/m\xb2/nm] 247,40;-1,0378;18,713;10,738;21,132;0,4369 247,'

The rest of the file consists only of ASCII numbers separated by semicolons.

I tried the following ways to load the file:

with open('data_file.tit') as f:
    bytes = f.read() # (1)
    # bytes = f.read().decode('???')  # (2)
    # bytes = np.genfromtxt(f)  # (3)
    print bytes

(1) Sort of works but skips the first several hundred lines.

(2) Failed with every encoding I tried with the error:

codec can't decode byte 0xb5 in position 315: unexpected special character

(3) Complains about ValueError: Some errors were detected ! and shows for each line something similar to Line #3 (got 3 columns instead of 2).

How can I load this data file?

Martijn Pieters · Accepted Answer

You have a codepage 1252 encoded text file, with one line containing NULL bytes. The file command determined you have binary data on the basis of those NULLs, while I made an educated guess on the basis of the \xb2 and \xb5 codepoints, which stand for the ² and µ characters.

To open, just decode from that encoding:

import io

with io.open(filename, 'r', encoding='cp1252') as f:
    for line in f:
        print(line.rstrip('
\x00'))

The first 10 lines are then:

0905077U1- a
Integration time: 19,00 ms
Average: 25 scans
Nr of pixels used for smoothing: 2
Data measured with spectrometer name: 0905077U1
Wave   ;Dark     ;Ref      ;Sample   ;Absolute Irradiance  ;Photon Counts
[nm]   ;[counts] ;[counts] ;[counts] ;[µWatt/cm²/nm]       ;[µMol/s/m²/nm]
247,40;-1,0378;18,713;10,738;21,132;0,4369
247,57;3,0793;19,702;9,5951;11,105;0,2298
247,74;-0,9414;19,929;8,8908;16,567;0,3430

The NULLs were stripped from the Data measured with spectrometer name: 0905077U1 line; the spetrometer name is now 9 bytes long, together with the 55 NULLs it looks like the name could be up to 64 characters long and the file writer didn't bother to strip those NULLs.

ojii · Answer

Guessing an encoding can be really hard, luckily there's a library that tries to help with that: https://pypi.python.org/pypi/chardet

Read file with unknown encoding

Tags:

python

file

file-io

character-encoding

encoding

Gabriel

2 Answers

Martijn Pieters

ojii

Recent Activity

Donate For Us

Read file with unknown encoding

Tags:

python

file

file-io

character-encoding

encoding

Gabriel

2 Answers

Martijn Pieters

ojii

Related questions

Recent Activity

Donate For Us