utf8 "\xFF" does not map to Unicode at tokenizer.perl line 44, line 1.

Question

I am using a perl tokenizer for German. The tokenizer works fine for some files but now I am facing the following error:

perl tokenizer.perl -l de < ~/Desktop/me.txt > ~/Desktop/me.txt.tok 
Tokenizer v3
Language: de
utf8 "\xFF" does not map to Unicode at tokenizer.perl line 44, <STDIN> line 1.
Malformed UTF-8 character (byte 0xff) in pattern match (m//) at tokenizer.perl line 45, <STDIN> line 1.
Malformed UTF-8 character (byte 0xff) in pattern match (m//) at tokenizer.perl line 45, <STDIN> line 1.
Malformed UTF-8 character (fatal) at tokenizer.perl line 64, <STDIN> line 1.

Any thoughts?

Thanks in advance.

Neg.

Jukka K. Korpela · Accepted Answer

The error message is misleading, but the intended information is correct and useful: the byte FF (hexadecimal) was encountered in the data, but it cannot appear in UTF-8 data. So “utf8 "\xFF"” is nonsense as such, but read it as “byte FF encountered as data purported to be UTF-8 encoded”. Similarly, read “Malformed UTF-8 character (byte 0xff)” as “Invalid data (byte FF) encountered in purported UTF8 data”.

To find out why your data contains the byte FF, you need to reveal more of it. My guess is that it is actually part of a byte order mark in UTF-16 encoding, but this is just a guess.

utf8 "\xFF" does not map to Unicode at tokenizer.perl line 44, <STDIN> line 1.

Tags:

unicode

utf-8

perl

tokenize

user89423

1 Answers

Jukka K. Korpela

Recent Activity

Donate For Us