Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to test an application for correct encoding (e.g. UTF-8)

Tags:

Encoding issues are among the one topic that have bitten me most often during development. Every platform insists on its own encoding, most likely some non-UTF-8 defaults are in the game. (I'm usually working on Linux, defaulting to UTF-8, my colleagues mostly work on german Windows, defaulting to ISO-8859-1 or some similar windows codepage)

I believe, that UTF-8 is a suitable standard for developing an i18nable application. However, in my experience encoding bugs are usually discovered late (even though I'm located in Germany and we have some special characters that along with ISO-8859-1 provide some detectable differences).

I believe that those developers with a completely non-ASCII character set (or those that know a language that uses such a character set) are getting a head start in providing test data. But there must be a way to ease this for the rest of us as well.

What [technique|tool|incentive] are people here using? How do you get your co-developers to care for these issues? How do you test for compliance? Are those tests conducted manually or automatically?

Adding one possible answer upfront:

I've recently discovered fliptitle.com (they are providing an easy way to get weird characters written "uʍop ǝpısdn" *) and I'm planning on using them to provide easily verifiable UTF-8 character strings (as most of the characters used there are at some weird binary encoding position) but there surely must be more systematic tests, patterns or techniques for ensuring UTF-8 compatibility/usage.

Note: Even though there's an accepted answer, I'd like to know of more techniques and patterns if there are some. Please add more answers if you have more ideas. And it has not been easy choosing only one answer for acceptance. I've chosen the regexp answer for the least expected angle to tackle the problem although there would be reasons to choose other answers as well. Too bad only one answer can be accepted.

Thank you for your input.

*) that's "upside down" written "upside down" for those that cannot see those characters due to font problems

like image 793
Olaf Kock Avatar asked Jan 25 '09 20:01

Olaf Kock


People also ask

How do I check my UTF-8 format?

Open the file in Notepad. Click 'Save As...'. In the 'Encoding:' combo box you will see the current file format. Yes, I opened the file in notepad and selected the UTF-8 format and saved it.

How do I determine application encoding?

In Visual Studio, you can select "File > Advanced Save Options..." The "Encoding:" combo box will tell you specifically which encoding is currently being used for the file.

How do I know if a character is UTF-8?

You do that by calling str. valid_encoding? on a String str that is in UTF-8 -encoding. Does that not get clear from my answer? Programmatically, you can not (or at least not easily and of course not reliably) check the invalidity of a string in a one-byte-encoding such as CP1252 .

How do I test Unicode?

To test if a program is fully Unicode compliant, write text mixing different languages in different directions and characters with diacritics, especially in Persian characters. Try also decomposed characters, for example: {e, U+0301} (decomposed form of é, U+00E9).


1 Answers

Thank you for fliptitle!

I, too, am trying to lay out a proper test plan to make sure that an application supports Unicode strings throughout the system.

I am bilingual, but in two languages that only use ISO-8859-1. Therefore, I have been struggling to determine what is a "real-life," "meaningful" way to test the full range of Unicode possibilities.

I just came across this:

  • International Testing Basics - Testing non-English and non-ASCII support

Follow-Up Post:

After devising some tests for my application, I realized that I had put together a small list of encoded values that might be helpful to others.

I am using the following international strings in my test:

(NOTE: here comes some UTF-8 encoded text... hopefully you can see this in your browser)

ユーザー別サイト
简体中文
크로스 플랫폼으로
מדורים מבוקשים
أفضل البحوث
Σὲ γνωρίζω ἀπὸ
Десятую Международную
แผ่นดินฮั่นเสื่อมโทรมแสนสังเวช
∮ E⋅da = Q, n → ∞, ∑ f(i) = ∏ g(i)
français langue étrangère
mañana olé

(End of UTF-8 foreign/non-English text)

However, at various points during testing, I realized that it was insufficient to only have information about how the strings were supposed to look when rendered in their respective foreign alphabets. I also needed to know the correct Unicode codepoint numbers, and also the correct hexadecimal values for these strings in at least two encodings (UCS-2 and UTF-8).

Here is the equivalent code-point numbering and hex values:

str = L"\u30E6\u30FC\u30B6\u30FC\u5225\u30B5\u30A4\u30C8"; // JAPAN 
// Little endian UTF-16/UCS-2: e6 30 fc 30 b6 30 fc 30 25 52 b5 30 a4 30 c8 30 00 00
// Hex of UTF-8: e3 83 a6 e3 83 bc e3 82 b6 e3 83 bc e5 88 a5 e3 82 b5 e3 82 a4 e3 83 88 00 

str = L"\u7B80\u4F53\u4E2D\u6587"; // CHINA 
// Little endian UTF-16/UCS-2: 80 7b 53 4f 2d 4e 87 65 00 00 
// Hex of UTF-8: e7 ae 80 e4 bd 93 e4 b8 ad e6 96 87 00

str = L"\uD06C\uB85C\uC2A4 \uD50C\uB7AB\uD3FC\uC73C\uB85C"; // KOREA 
// Little endian UTF-16/UCS-2: 6c d0 5c b8 a4 c2 20 00 0c d5 ab b7 fc d3 3c c7 5c b8 00 00
// Hex of UTF-8: ed 81 ac eb a1 9c ec 8a a4 20 ed 94 8c eb 9e ab ed 8f bc ec 9c bc eb a1 9c 00 

str = L"\u05DE\u05D3\u05D5\u05E8\u05D9\u05DD \u05DE\u05D1\u05D5\u05E7\u05E9\u05D9\u05DD"; // ISRAEL 
// Little endian UTF-16/UCS-2: de 05 d3 05 d5 05 e8 05 d9 05 dd 05 20 00 de 05 d1 05 d5 05 e7 05 e9 05 d9 05 dd 05 00 00
// Hex of UTF-8: d7 9e d7 93 d7 95 d7 a8 d7 99 d7 9d 20 d7 9e d7 91 d7 95 d7 a7 d7 a9 d7 99 d7 9d 00

str = L"\u0623\u0641\u0636\u0644 \u0627\u0644\u0628\u062D\u0648\u062B"; // EGYPT 
// Little endian UTF-16/UCS-2: 23 06 41 06 36 06 44 06 20 00 27 06 44 06 28 06 2d 06 48 06 2b 06 00 00
// Hex of UTF-8: d8 a3 d9 81 d8 b6 d9 84 20 d8 a7 d9 84 d8 a8 d8 ad d9 88 d8 ab 00 

str = L"\u03A3\u1F72 \u03B3\u03BD\u03C9\u03C1\u03AF\u03B6\u03C9 \u1F00\u03C0\u1F78"; // GREECE 
// Little endian UTF-16/UCS-2: a3 03 72 1f 20 00 b3 03 bd 03 c9 03 c1 03 af 03 b6 03 c9 03 20 00 00
// Hex of UTF-8: ce a3 e1 bd b2 20 ce b3 ce bd cf 89 cf 81 ce af ce b6 cf 89 20 e1 bc 80 cf 80 e1 bd b8 00 

str = L"\u0414\u0435\u0441\u044F\u0442\u0443\u044E \u041C\u0435\u0436\u0434\u0443\u043D\u0430\u0440\u043E\u0434\u043D\u0443\u044E"; // RUSSIA 
// Little endian UTF-16/UCS-2: 14 04 35 04 41 04 4f 04 42 04 43 04 4e 04 20 00 1c 04 35 04 36 04 34 04 43 04 3d 04 30 04 40 04 3e 04 34 04 3d 04 43 04 4e 04 00 00
// Hex of UTF-8: d0 94 d0 b5 d1 81 d1 8f d1 82 d1 83 d1 8e 20 d0 9c d0 b5 d0 b6 d0 b4 d1 83 d0 bd d0 b0 d1 80 d0 be d0 b4 d0 bd d1 83 d1 8e 00

str = L"\u0E41\u0E1C\u0E48\u0E19\u0E14\u0E34\u0E19\u0E2E\u0E31\u0E48\u0E19\u0E40\u0E2A\u0E37\u0E48\u0E2D\u0E21\u0E42\u0E17\u0E23\u0E21\u0E41\u0E2A\u0E19\u0E2A\u0E31\u0E07\u0E40\u0E27\u0E0A"; // THAILAND
// Little endian UTF-16/UCS-2: 41 0e 1c 0e 48 0e 19 0e 14 0e 34 0e 19 0e 2e 0e 31 0e 48 0e 19 0e 40 0e 2a 0e 37 0e 48 0e 2d 0e 21 0e 42 0e 17 0e 23 0e 21 0e 41 0e 2a 0e 19 0e 2a 0e 31 0e 07 0e 40 0e 27 0e 0a 0e 00 00
// Hex of UTF-8: e0 b9 81 e0 b8 9c e0 b9 88 e0 b8 99 e0 b8 94 e0 b8 b4 e0 b8 99 e0 b8 ae e0 b8 b1 e0 b9 88 e0 b8 99 e0 b9 80 e0 b8 aa e0 b8 b7 e0 b9 88 e0 b8 ad e0 b8 a1 e0 b9 82 e0 b8 97 e0 b8 a3 e0 b8 a1 e0 b9 81 e0 b8 aa e0 b8 99 e0 b8 aa e0 b8 b1 e0 b8 87 e0 b9 80 e0 b8 a7 e0 b8 8a 00

str = L"\u222E E\u22C5da = Q,  n \u2192 \u221E, \u2211 f(i) = \u220F g(i)"; // MATHEMATICS 
// Little endian UTF-16/UCS-2: 2e 22 20 00 45 00 c5 22 64 00 61 00 20 00 3d 00 20 00 51 00 2c 00 20 00 20 00 6e 00 20 00 92 21 20 00 1e 22 2c 00 20 00 11 22 20 00 66 00 28 00 69 00 29 00 20 00 3d 00 20 00 0f 22 20 00 67 00 28 00 69 00 29 00 00 00
// Hex of UTF-8: e2 88 ae 20 45 e2 8b 85 64 61 20 3d 20 51 2c 20 20 6e 20 e2 86 92 20 e2 88 9e 2c 20 e2 88 91 20 66 28 69 29 20 3d 20 e2 88 8f 20 67 28 69 29 00 

str = L"fran\u00E7ais langue \u00E9trang\u00E8re"; // FRANCE
// Little endian UTF-16/UCS-2: 66 00 72 00 61 00 6e 00 e7 00 61 00 69 00 73 00 20 00 6c 00 61 00 6e 00 67 00 75 00 65 00 20 00 e9 00 74 00 72 00 61 00 6e 00 67 00 e8 00 72 00 65 00 00 00
// Hex of UTF-8: 66 72 61 6e c3 a7 61 69 73 20 6c 61 6e 67 75 65 20 c3 a9 74 72 61 6e 67 c3 a8 72 65 00

str = L"ma\u00F1ana ol\u00E9"; // SPAIN
// Little endian UTF-16/UCS-2: 6d 00 61 00 f1 00 61 00 6e 00 61 00 20 00 6f 00 6c 00 e9 00 00 00
// Hex of UTF-8: 6d 61 c3 b1 61 6e 61 20 6f 6c c3 a9 00

Also, here are a couple images that show some common "mis-renderings" that can happen in various editors, even though the underlying bytes are well-formed UTF8. If you see any of these renderings, it probably means that you correctly produced a UTF8 string, but that your editor/viewer is trying to interpret them under some encoding other than UTF8.

Sample Renderings Num. 1

Sample Renderings Num. 2

like image 62
pestophagous Avatar answered Oct 11 '22 04:10

pestophagous