I am running simple code to select text from lines in the input file and write that text to an output file.
with open('inputpath', 'r') as vh_datoteka, open('outputpath', 'w') as iz_datoteka:
for line in vh_datoteka:
NMEA = str(line) [24:-39]
iz_datoteka.write (NMEA + '\n')
The data I need to process looks something like this (two lines):
2012-05-01
23:59:59.007;!AIVDM,1,1,0,,33cm>k100013vglDPkW1QSin0000,0*6E;2470028;1;NULL;2012-05-01
21:59:59.007 2012-05-01
23:59:59.007;!AIVDM,1,1,0,,19NSBn001nQ8<7vDhIq43C<2280F,0*07;2470032;1;NULL;2012-05-01
21:59:59.007 ...
Since I have large files to process (approximately 2 GB), I first tested the code on a small part of one of the large files (simply copied first 1000 or so lines and saved them into a test file).
The code worked perfectly and I got the results I was looking for:
!AIVDM,1,1,0,,33cm>k100013vglDPkW1QSin0000,0*6E;
!AIVDM,1,1,0,,19NSBn001nQ8<7vDhIq43C<2280F,0*07;
After that I tried using the code on the whole data and got very different outputs:
2 3 : 5 9 : 5 9 . 0 0 7 ; ! A I V D M , 1 , 1 , 0 , , 3 3 c m > k 1 0
0 0 1 3 v g l D P k W 1 Q S i n 0 0 0 0 , 0 * 6 E ; 2 4 7 0 0 2 8 ; 1
; N U L L ; 2 0 1 2 - 3 : 5 9 : 5 9 . 0 0 7 ; ! A I V D M , 1 , 1 ,
0 , , 1 9 N S B n 0 0 1 n Q 8 < 7 v D h I q 4 3 C < 2 2 8 0 F , 0 * 0
7 ; 2 4 7 0 0 3 2 ; 1 ; N U L L ; 2 0 1 2 - ...
What is the reason for such behaviour?
Apparently, the large data files were in UTF16-LE, which was the problem. I corrected the Python code to read in UTF-16 and write to UTF-8 and that did the trick.
with codecs.open('inputpath', 'r', encoding='utf-16-le') as vh_datoteka, open('outputpath', 'w') as iz_datoteka:
for line in vh_datoteka:
NMEA = str(line) [24:-39]
iz_line = NMEA + '\n'
iz_datoteka.write (iz_line.encode('utf-8'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With