My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I need to sanitize the output such that no invalid UTF-8 octets are emitted, otherwise the downstream consumer (Sphinx) will blow up.
At the very least I would like to know if the data is invalid so I can avoid passing it on; ideally I could remove just the offending bytes. However, enabling all the fatalisms I can find doesn't quite get me there with perl 5.12 (FWIW, use v5.12; use warnings qw( FATAL utf8 );
is in effect).
I'm specifically having trouble with the sequence "\xFE\xBF\xBE"
. If I create a file containing only these three bytes (perl -e 'print "\xEF\xBF\xBE"' > bad.txt
), trying to read the file with mode :encoding(UTF-8)
errors out with utf8 "\xFFFE" does not map to Unicode
, but only under 5.14.0. 5.12.3 and earlier are perfectly fine reading and later writing that sequence. I'm unsure where it's getting the \xFFFE
(illegal reverse-BOM) from, but at least having a complaint is consistent with Sphinx.
Unfortunately, decode_utf8("\xEF\xBF\xBE", 1)
causes no errors under 5.12 or 5.14. I'd prefer a detection method that didn't require an encoded I/O layer, as that will just leave me with an error message and no way to sanitize the raw octets.
I'm sure there are more sequences that I need to address, but just handling this one would be a start. So my questions are: can I reliably detect this kind of problem data with a perl before 5.14? What substitution routine can generally sanitize almost-UTF-8 into strict UTF-8?
Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.
This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.
The problem UTF-8 solves US keyboards can often produce 101 symbols, which suggests 101 symbols would be enough for most English text. Seven bits would be enough to encode these symbols since 27 = 128, and that's what ASCII does.
std::string doesn't "use" any encoding, neither UTF-8 nor EBCDIC. std::string is just a container for bytes of types char . You can put UTF-8 strings in there, or ASCII strings, or EBCDIC strings, or even binary data.
You should read the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.
To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8
, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.
The other encoding is called utf-8
(a.k.a. utf-8-strict
). This allows only codepoints that are listed as legal for interchange by the Unicode standard.
"\xEF\xBF\xBE"
, when interpreted as UTF-8, decodes to the codepoint U+FFFE. But that's not legal for interchange according to Unicode, so programs that are strict about such things complain.
Instead of using decode_utf8
(which uses the lax utf8
encoding), use decode
with the utf-8
encoding. And read the Handling Malformed Data section to see the different ways you can handle or complain about problems.
Update: It does appear that some versions of Perl don't complain about U+FFFE, even when using the utf-8-strict
encoding. This appears to be a bug. You may just have to build a list of codepoints that Sphinx complains about and filter them out manually (e.g. with tr
).
You have a utf8 string containing some invalid utf8...
This replaces it with a default 'bad char'.
use Encode qw(decode encode);
my $octets = decode('UTF-8', $malformed_utf8, Encode::FB_DEFAULT);
my $good_utf8 = encode('UTF-8', $octets, Encode::FB_CROAK);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With