Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode Encoding and decoding issues in QRCode

I am trying to generate UTF-8 QRCode so that I can encore accents and Unicode characters.

To test it, I am using many decoding solution :

  1. http://zxing.org/w/decode.jspx - The zxing project also used in Android
  2. http://www.drhu.org/QRCode/QRDecoder.php - a PHP Decoder
  3. http://zbar.sf.net - The ZBar bar code reader - OpenSource and C project for embedded

All of them give me always the same result.

You can try this image works well with Unicode Characters.

But if I am trying to use zxing or Google Chart API to generate the QRCode, I cannot decode it correctly.

I have tried this :

  1. http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=SHIFT_JIS&chl=R%C3%A9my+Hubscher
  2. http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=ISO-8859-1&chl=R%C3%A9my+Hubscher
  3. http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=R%C3%A9my+Hubscher

But all without success.

Do you know how I can do ? Do you know which encoding is used for the working image ?

like image 444
Natim Avatar asked Oct 23 '09 08:10

Natim


People also ask

What are the issues or problems of a QR code?

Although QR codes have numerous useful applications, bad actors can also use them for malicious purposes. In January 2022, the FBI released a warning that cybercriminals may tamper with QR codes to direct victims to malicious websites. Scammers often look to the latest trends for new cybercrime tactics.

What are the issues of unicode?

Unicode is inconsistent with regards to which symbols get unique codes, and which do not. So that all of the accented letters of the European languages have their own code (Ő is 0150), but Native American symbols, like Guaraní g̃ have to be made up from two codes, 0067 (g) and 0303 (combining ~) or Dene Ų̀.

What is unicode encode and decode?

Encoding is the process of transforming a set of Unicode characters into a sequence of bytes. Decoding is the process of transforming a sequence of encoded bytes into a set of Unicode characters. The Unicode Standard assigns a code point (a number) to each character in every supported script.

What encoding do QR codes use?

QR codes use four standardized encoding modes (numeric, alphanumeric, byte/binary, and kanji) to store data efficiently; extensions may also be used.


2 Answers

Heuristics used by QR decoders often fails, BOM does not help

Most QR decoders use heuristics to automatically detect character encoding even if it is specified explicitly inside the QR code via the ECI extension.

It turned out that BOM helped to your decoder. But for most decoders, BOM does not help. As an example of a decoder that cannot display a proper UTF-8 string, take a Xiaomi phone with MIUI Global v11.0.3 (with their native scanner application). This phone cannot correctly show an UTF-8 QR code produced a link in your original question. Here is how it showed: R閙y Hubscher. With the BOM (using a link from your subsequent message) it showed this way: ?R閙y Hubscher (it just showed the BOM character as ?). But if you add a Chinese character like 日 before the string instead of BOM, Xiaomi will show the string correctly. Here is the link: chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=%E6%97%A5R%C3%A9my%20Hubscher Xiaomi correctly displays the string 日Rémy Hubscher from a QR code generated by this link.

Another example is “QR code reader & QR code Scanner” Android app by TWMobile. It did properly decode all the QR codes from all the links that you have provided. So you did not have to use BOM to make the scanner by TWMobile properly display the strings.

Why do QR decoders always use heuristics to detect character set even though these heuristics frequently fails as shown in your case? As you know, there are 4 modes of storing text in a QR code: (1) numeric, (2) alphanumeric, (3) 8-bit, and (4) Kanji. So, QR code standard does not inherently support UTF-8. To use UTF-8 encoding (instead of the default “ISO-8859-1” or “JIS8”) in the 8-bit string, the implementation has to insert an ECI (Extended Channel Interpretations) before that string. ECI is an optional, additional feature for a QR Code. Good point is that it was defined in earliest QR code standard at least in 2000. ECI enables data encoding using character sets other than the default. It also enables other data interpretations (e.g. compacted data using defined compression schemes) or other industry-specific requirements to be encoded. The ECI protocol is defined in a specification developed by AIM, Inc, and is not available for free but can be purchased for a fee. Unfortunately, not all QR decoders can handle the ECI protocol, even in such a basic thing as changing default encoding to UTF-8. And even for default encoding like “ISO-8859-1” (for a 8-bit string mode) or “Shift_JIS”(for Kanji mode), decoders still use heuristics to determine character set, because some applications that encode QR codes may not support ECI or specify incorrect character set.

Conclusion

Because of heuristics to automatically detect character set, QR decoders often fail do display the string properly, even when correct encoding is explicitly specified via ECI as it was in your case and the BOM character did not help as shown in the Xiaomi example. You have found a solution in your reply, but it did not help for Xiaomi. Some QR decoders use heuristics algorithms that are so dumb that even BOM does not help.

Although the BOM did help with your QR decoder, a better solution would be to stop using error-prone QR decoders that use heuristics even if the character encoding is explicitly specified via ECI.

Find a better QR decoder if a decoder cannot properly decode the text without BOM. The encoder that you have provided (using the links) is OK.

like image 59
Maxim Masiutin Avatar answered Oct 07 '22 07:10

Maxim Masiutin


The solution that comes up, is to encode the text in UTF-8 and add a BOM to specify that the string is actually in UTF-8.

Here it works :

  • http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=%EF%BB%BFR%C3%A9my+Hubscher
like image 41
Natim Avatar answered Oct 07 '22 06:10

Natim