One of my requirement says "Text Box Name should accept only UTF-8 Character set". I want to perform a negative test by entering a non UTF-8 character set. How can I do this?
Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages.
Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.
If you are asking how to construct a non-UTF-8 character, that should be easy from this definition from Wikipedia:
For code points U+0000 through U+007F, each codepoint is one byte long and looks like this:
0xxxxxxx // a
For code points U+0080 through U+07FF, each codepoint is two bytes long and look like this:
110xxxxx 10xxxxxx // b
And so on.
So, to construct an illegal UTF-8 character that is one byte long, the highest bit must be 1 (to be different from pattern a) and the second highest bit must be 0 (to be different from pattern b):
10xxxxxx
or
111xxxxx
Which also differs from both patterns.
With the same logic, you can construct illegal codeunit sequences which are more than two bytes long.
You did not tag a language, but I had to test it, so I used Java:
for (int i=0;i<255;i++) {
System.out.println(
i + " " +
(byte)i + " " +
Integer.toHexString(i) + " " +
String.format("%8s", Integer.toBinaryString(i)).replace(' ', '0') + " " +
new String(new byte[]{(byte)i},"UTF-8")
);
}
0 to 31 are non-printable characters, then 32 is space, followed by printable characters:
...
31 31 1f 00011111
32 32 20 00100000
33 33 21 00100001 !
...
126 126 7e 01111110 ~
127 127 7f 01111111
128 -128 80 10000000 �
delete
is 0x7f
and after it, from 128 inclusively up to 254 no valid characters are printed. You can see from the UTF-8 chartable also:
Codepoint U+007F
is represented with one byte 0x7F
(bits 01111111
), while codepoint U+0080
is represented with two bytes 0xC2 0x80
(bits 11000010 10000000
).
If you are not familiar with UTF-8 I strongly recommend reading this excellent article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With