Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the minimum test to verify that a component can save/retrieve UTF8 encoded strings

Tags:

I am integration testing a component. The component allows you to save and fetch strings.

I want to verify that the component is handling UTF-8 characters properly. What is the minimum test that is required to verify this?

I think that doing something like this is a good start:

// This is the ☺ character
String toSave = "\u263A";
int id = 123;

// Saves to Database
myComponent.save( id, toSave );

// Retrieve from Database
String fromComponent = myComponent.retrieve( id );

// Verify they are same 
org.junit.Assert.assertEquals( toSave, fromComponent );

One mistake I have made in the past is I have set String toSave = "è". My test passed because the string was saved and retrieved properly to/from the DB. Unfortunately the application was not actually working correctly because the app was using ISO 8859-1 encoding. This meant that è worked but other characters like ☺ did not.

Question restated: What is the minimum test (or tests) to verify that I can persist UTF-8 encoded strings?

like image 281
sixtyfootersdude Avatar asked Apr 21 '17 20:04

sixtyfootersdude


2 Answers

A code and/or documentation review is probably your best option here. But, you can probe if you want. It seems that a sufficient test is the goal and minimizing it is less important. It is hard to figure what a sufficient test is, based only on speculation of what the threat would be, but here's my suggestion: all codepoints, including U+0000, proper handling of "combining characters."

The method you want to test has a Java string as a parameter. Java doesn't have "UTF-8 encoded strings": Java's native text datatypes use the UTF-16 encoding of the Unicode character set. This is common for in-memory representations of text—It's used by Java, .NET, JavaScript, VB6, VBA,…. UTF-8 is commonly used for streams and storage, so it makes sense that you should ask about it in the context of "saving and fetching". Databases typically offer one or more of UTF-8, 3-byte-limited UTF-8, or UTF-16 (NVARCHAR) datatypes and collations.

The encoding is an implementation detail. If the component accepts a Java string, it should either throw an exception for data it is unwilling to handle or handle it properly.

"Characters" is a rather ill-defined term. Unicode codepoints range from 0x0 to 0x10FFFF—21 bits. Some codepoints are not assigned (aka "defined"), depending on the Unicode Standard revision. Java datatypes can handle any codepoint, but information about them is limited by version. For Java 8, "Character information is based on the Unicode Standard, version 6.2.0.". You can limit the test to "defined" codepoints or go all possible codepoints.

A codepoint is either a base "character" or a "combining character". Also, each codepoint is in exactly one Unicode Category. Two categories are for combining characters. To form a grapheme, a base character is followed by zero or more combining characters. It might be difficult to layout graphemes graphically (see Zalgo text) but for text storage all that it is needed to not mangle the sequence of codepoints (and byte order, if applicable).

So, here is a non-minimal, somewhat comprehensive test:

final Stream<Integer> codepoints = IntStream
    .rangeClosed(Character.MIN_CODE_POINT, Character.MAX_CODE_POINT)
    .filter(cp -> Character.isDefined(cp)) // optional filtering
    .boxed();              
final int[] combiningCategories = { 
    Character.COMBINING_SPACING_MARK, 
    Character.ENCLOSING_MARK 
};
final Map<Boolean, List<Integer>> partitionedCodepoints = codepoints
    .collect(Collectors.partitioningBy(cp -> 
        Arrays.binarySearch(combiningCategories, Character.getType(cp)) < 0));
final Integer[] baseCodepoints = partitionedCodepoints.get(true)
    .toArray(new Integer[0]); 
final Integer[] combiningCodepoints = partitionedCodepoints.get(false)
    .toArray(new Integer[0]);
final int baseLength = baseCodepoints.length;
final int combiningLength = combiningCodepoints.length;
final StringBuilder graphemes = new StringBuilder();
for (int i = 0; i < baseLength; i++) {
    graphemes.append(Character.toChars(baseCodepoints[i])); 
    graphemes.append(Character.toChars(combiningCodepoints[i % combiningLength])); 
}
final String test = graphemes.toString();
final byte[] testUTF8 = StandardCharsets.UTF_8.encode(test).array();

// Java 8 counts for when filtering by Character.isDefined 
assertEquals(736681, test.length());  // number of UTF-16 code units
assertEquals(3241399, testUTF8.length); // number of UTF-8 code units
like image 190
Tom Blodget Avatar answered Sep 23 '22 10:09

Tom Blodget


If your component is only capable of storing and retrieving strings, then all you need to do is make sure that nothing gets lost in the conversion to and from the Unicode strings of java and the UTF-8 strings that the component stores.

That would involve checking with at least one character from each UTF-8 code point length. So, I would suggest check with:

  • One character from the US-ASCII set, (1-byte long code point,) then

  • One character from Greek, (2-byte long code point,) and

  • One character from Chinese (3-byte long code point.)

  • In theory you would also want to check with an emoji (4-byte long code point,) though these cannot be represented in java's Unicode strings, so it's moot point.

A useful extra test would be to try a string combining at least one character from each of the above cases, so as to make sure that characters of different code-point lengths can co-exist within the same string.

(If your component does anything more than storing and retrieving strings, like searching for strings, then things can get a bit more complicated, but it seems to me that you specifically avoided asking about that.)

I do believe that black box testing is the only kind of testing that makes sense, so I would not recommend polluting the interface of your component with methods that would expose knowledge of its internals. However, there are two things that you can do to increase the testability of the component without ruining its interface:

  1. Introduce additional functions to the interface that might help with testing without disclosing anything about the internal implementation and without requiring that the testing code must have knowledge of the internal implementation of the component.

  2. Introduce functionality useful for testing in the constructor of your component. The code that constructs the component knows precisely what component it is constructing, so it is intimately familiar with the nature of the component, so it is okay to pass something implementation-specific there.

An example of what you could do with any of the above techniques would be to artificially severely limit the number of bytes that the internal representation is allowed to occupy, so that you can make sure that a certain string you are planning to store will fit. So, you could limit the internal size to no more than 9 bytes, and then make sure that a java unicode string containing 3 chinese characters gets properly stored and retrieved.

like image 31
Mike Nakis Avatar answered Sep 21 '22 10:09

Mike Nakis