Is there a set of "Lorem ipsums" files for testing character encoding issues?

Tags:

For layouting we have our famous "Lorem ipsum" text to test how it looks like.

What I am looking for is a set of files containing Text encoded with several different encodings that I can use in my JUnit tests to test some methods that are dealing with character encoding when reading text files.

Example:

Having a ISO 8859-1 encoded test-file and a Windows-1252 encoded test-file. The Windows-1252 have to trigger the differences in region 80₁₆ – 9F₁₆. In other words it must contain at least one character of this region to distinguish it from ISO 8859-1.

Maybe the best set of test-files is that where the test-file for each encoding contains all its characters once. But maybe I am not aware of sth - we all like this encoding stuff, right? :-)

Is there such a set of test-files for character-encoding issues out there?

355

asked Feb 08 '12 09:02

Fabian Barney

3 Answers

The Wikipedia article on diacritics is pretty comprehensive, unfortunately you have to extract these characters manually. Also there might exist some mnemonics for each language. For instance in Polish we use:

Zażółć gęślą jaźń

which contains all 9 Polish diacritics in one correct sentence. Another useful search hint are pangrams: sentences using every letter of the alphabet at least once:

in Spanish, "El veloz murciélago hindú comía feliz cardillo y kiwi. La cigüeña tocaba el saxofón detrás del palenque de paja." (all 27 letters and diacritics).

in Russian, "Съешь же ещё этих мягких французских булок, да выпей чаю" (all 33 Russian Cyrillic alphabet letters).

List of pangrams contains an exhaustive summary. Anyone care to wrap this in a simple:

public interface NationalCharacters {
  String spanish();
  String russian();
  //...
}

library?

answered Oct 07 '22 18:10

Tomasz Nurkiewicz

How about trying to use the ICU test suite files? I don't know if they are what you need for your test, but they seem to have pretty complete from/to UTF mapping files at least: Link to the repo for ICU test files

answered Oct 07 '22 18:10

Daniel Teply

I don't know of any complete text documents, but if you can start with a simple overview of all character sets there are some files available at the ftp.unicode.org server

Here's WINDOWS-1252 for example. The first column is the hexadecimal character value, and the second the unicode value.

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT

answered Oct 07 '22 19:10

Optimist

Related questions
                            
                                Why can't "async void" unit tests be recognized?
                            
                                Assert that arrays are equal in Visual Studio 2008 test framework
                            
                                How to match null passed to parameter of Class<T> with Mockito
                            
                                How do programmers work together on a project?
                            
                                What is the difference between mock.patch.object(... and mock.patch(
                            
                                How can I unit test django messages?
                            
                                maven :: run only single test in multi-module project
                            
                                Mocking globals in Jest
                            
                                mock or stub for chained call
                            
                                How to write unit tests in Spark 2.0+?
                            
                                ReSharper Error: "The output has reached the limit and was truncated. To view the full output use 'Show Stack Trace in a new window' action."
                            
                                How are integration tests written for interacting with external API?
                            
                                Perl build, unit testing, code coverage: A complete working example
                            
                                java.lang.NoClassDefFoundError in junit
                            
                                Injecting @Autowired private field during testing
                            
                                Angular unit testing with Jasmine: how to remove or modify spyOn
                            
                                How to test if JSON path does not include a specific element, or if the element is present it is null?
                            
                                Angular 2/4/6/7 - Unit Testing with Router
                            
                                Should Private/Protected methods be under unit test? [closed]
                            
                                Best Practices of Test Driven Development Using C# and RhinoMocks [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a set of "Lorem ipsums" files for testing character encoding issues?

Tags:

unit-testing

character-encoding

xunit