Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The same code works differently regarding file size

Tags:

python

I am running simple code to select text from lines in the input file and write that text to an output file.

with open('inputpath', 'r') as vh_datoteka, open('outputpath', 'w') as iz_datoteka:
        for line in vh_datoteka:
            NMEA = str(line) [24:-39]
            iz_datoteka.write (NMEA + '\n')

The data I need to process looks something like this (two lines):

2012-05-01
23:59:59.007;!AIVDM,1,1,0,,33cm>k100013vglDPkW1QSin0000,0*6E;2470028;1;NULL;2012-05-01
21:59:59.007 2012-05-01
23:59:59.007;!AIVDM,1,1,0,,19NSBn001nQ8<7vDhIq43C<2280F,0*07;2470032;1;NULL;2012-05-01
21:59:59.007 ...

Since I have large files to process (approximately 2 GB), I first tested the code on a small part of one of the large files (simply copied first 1000 or so lines and saved them into a test file).

The code worked perfectly and I got the results I was looking for:

&#33;AIVDM,1,1,0,,33cm>k100013vglDPkW1QSin0000,0*6E;
!AIVDM,1,1,0,,19NSBn001nQ8<7vDhIq43C<2280F,0*07;

After that I tried using the code on the whole data and got very different outputs:

2 3 : 5 9 : 5 9 . 0 0 7 ; ! A I V D M , 1 , 1 , 0 , , 3 3 c m > k 1 0
0 0 1 3 v g l D P k W 1 Q S i n 0 0 0 0 , 0 * 6 E ; 2 4 7 0 0 2 8 ; 1
; N U L L ; 2 0 1 2 -   3 : 5 9 : 5 9 . 0 0 7 ; ! A I V D M , 1 , 1 ,
0 , , 1 9 N S B n 0 0 1 n Q 8 < 7 v D h I q 4 3 C < 2 2 8 0 F , 0 * 0
7 ; 2 4 7 0 0 3 2 ; 1 ; N U L L ; 2 0 1 2 - ...

What is the reason for such behaviour?

like image 427
Gašper Zupančič Avatar asked May 24 '26 03:05

Gašper Zupančič


1 Answers

Apparently, the large data files were in UTF16-LE, which was the problem. I corrected the Python code to read in UTF-16 and write to UTF-8 and that did the trick.

with codecs.open('inputpath', 'r', encoding='utf-16-le') as vh_datoteka, open('outputpath', 'w') as iz_datoteka:
        for line in vh_datoteka:
            NMEA = str(line) [24:-39]
            iz_line = NMEA + '\n'
            iz_datoteka.write (iz_line.encode('utf-8'))
like image 176
Gašper Zupančič Avatar answered May 25 '26 17:05

Gašper Zupančič