Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode test strings for unit tests

I need some Utf32 test strings to exercise some cross platform string manipulation code. I'd like a suite of test strings that exercise the utf32 <-> utf16 <-> utf8 encodings to validate that characters outside the BMP can be transformed from utf32, through utf16 surrogates, through utf8, and back. properly.

And I always find it a bit more elegant if the strings in question aren't just composed of random bytes, but are actually meaningful in the (various) languages they encode.

like image 645
Chris Becke Avatar asked May 26 '11 10:05

Chris Becke


2 Answers

Although this isn't quite what you asked for, I've always found this test document useful.

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

The same site offers this

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt

... which are equivalents of English's "Quick brown fox" text, which exercise all the characters used, for a variety of languages. This page refers to a larger list of "pangrams" which used to be on Wikipedia, but was apparently deleted there. It is still available here:

http://clagnut.com/blog/2380/

like image 188
tialaramex Avatar answered Oct 23 '22 15:10

tialaramex


https://github.com/noct/cutf/tree/master/bin

Includes following files:

UTF-8-demo.txt
big.txt
quickbrown.txt
utf8_invalid.txt
like image 38
TarmoPikaro Avatar answered Oct 23 '22 16:10

TarmoPikaro