Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read file with unknown encoding

I'm trying to load the columns of a file with a strange encoding. Windows appears to have no issues opening it, but Linux complains and I have only been able to open it using the Atom text editor (others give me either a blank file or a file with data encoded)

The command:

file -i data_file.tit

returns:

application/octet-stream; charset=binary

Opening the file in binary mode and reading the first 400 bytes gives:

'0905077U1- a\r\nIntegration time: 19,00 ms\r\nAverage: 25 scans\r\nNr of pixels used for smoothing: 2\r\nData measured with spectrometer name: 0905077U1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\r\nWave ;Dark ;Ref ;Sample ;Absolute Irradiance ;Photon Counts\r\n[nm] ;[counts] ;[counts] ;[counts] ;[\xb5Watt/cm\xb2/nm] ;[\xb5Mol/s/m\xb2/nm]\r\n247,40;-1,0378;18,713;10,738;21,132;0,4369\r\n247,'

The rest of the file consists only of ASCII numbers separated by semicolons.

I tried the following ways to load the file:

with open('data_file.tit') as f:
    bytes = f.read() # (1)
    # bytes = f.read().decode('???')  # (2)
    # bytes = np.genfromtxt(f)  # (3)
    print bytes
  • (1) Sort of works but skips the first several hundred lines.

  • (2) Failed with every encoding I tried with the error:

    codec can't decode byte 0xb5 in position 315: unexpected special character
    
  • (3) Complains about ValueError: Some errors were detected ! and shows for each line something similar to Line #3 (got 3 columns instead of 2).

How can I load this data file?

like image 541
Gabriel Avatar asked Dec 26 '22 03:12

Gabriel


2 Answers

You have a codepage 1252 encoded text file, with one line containing NULL bytes. The file command determined you have binary data on the basis of those NULLs, while I made an educated guess on the basis of the \xb2 and \xb5 codepoints, which stand for the ² and µ characters.

To open, just decode from that encoding:

import io

with io.open(filename, 'r', encoding='cp1252') as f:
    for line in f:
        print(line.rstrip('\n\x00'))

The first 10 lines are then:

0905077U1- a
Integration time: 19,00 ms
Average: 25 scans
Nr of pixels used for smoothing: 2
Data measured with spectrometer name: 0905077U1
Wave   ;Dark     ;Ref      ;Sample   ;Absolute Irradiance  ;Photon Counts
[nm]   ;[counts] ;[counts] ;[counts] ;[µWatt/cm²/nm]       ;[µMol/s/m²/nm]
247,40;-1,0378;18,713;10,738;21,132;0,4369
247,57;3,0793;19,702;9,5951;11,105;0,2298
247,74;-0,9414;19,929;8,8908;16,567;0,3430

The NULLs were stripped from the Data measured with spectrometer name: 0905077U1 line; the spetrometer name is now 9 bytes long, together with the 55 NULLs it looks like the name could be up to 64 characters long and the file writer didn't bother to strip those NULLs.

like image 176
Martijn Pieters Avatar answered Dec 27 '22 18:12

Martijn Pieters


Guessing an encoding can be really hard, luckily there's a library that tries to help with that: https://pypi.python.org/pypi/chardet

like image 24
ojii Avatar answered Dec 27 '22 19:12

ojii