I am using a perl tokenizer for German. The tokenizer works fine for some files but now I am facing the following error:
perl tokenizer.perl -l de < ~/Desktop/me.txt > ~/Desktop/me.txt.tok
Tokenizer v3
Language: de
utf8 "\xFF" does not map to Unicode at tokenizer.perl line 44, <STDIN> line 1.
Malformed UTF-8 character (byte 0xff) in pattern match (m//) at tokenizer.perl line 45, <STDIN> line 1.
Malformed UTF-8 character (byte 0xff) in pattern match (m//) at tokenizer.perl line 45, <STDIN> line 1.
Malformed UTF-8 character (fatal) at tokenizer.perl line 64, <STDIN> line 1.
Any thoughts?
Thanks in advance.
Neg.
The error message is misleading, but the intended information is correct and useful: the byte FF (hexadecimal) was encountered in the data, but it cannot appear in UTF-8 data. So “utf8 "\xFF"” is nonsense as such, but read it as “byte FF encountered as data purported to be UTF-8 encoded”. Similarly, read “Malformed UTF-8 character (byte 0xff)” as “Invalid data (byte FF) encountered in purported UTF8 data”.
To find out why your data contains the byte FF, you need to reveal more of it. My guess is that it is actually part of a byte order mark in UTF-16 encoding, but this is just a guess.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With