One of my requirement says "Text Box Name should accept only UTF-8 Character set". I want to perform a negative test by entering a non UTF-8 character set. How can I do this?

If you are asking how to construct a non-UTF-8 character, that should be easy from this definition from Wikipedia: <img src="https://i.stack.imgur.com/EK8vP.png" alt="utf8 definition"> For code points U+0000 through U+007F, each codepoint is one byte long and looks like this: <pre class="prettyprint"><code>0xxxxxxx // a </code></pre> For code points U+0080 through U+07FF, each codepoint is two bytes long and look like this: <pre class="prettyprint"><code>110xxxxx 10xxxxxx // b </code></pre> And so on. So, to construct an illegal UTF-8 character that is one byte long, the highest bit must be 1 (to be different from pattern a) and the second highest bit must be 0 (to be different from pattern b): <pre class="prettyprint"><code>10xxxxxx </code></pre> or <pre class="prettyprint"><code>111xxxxx </code></pre> Which also differs from both patterns. With the same logic, you can construct illegal codeunit sequences which are more than two bytes long. You did not tag a language, but I had to test it, so I used Java: <pre class="prettyprint"><code>for (int i=0;i<255;i++) { System.out.println( i + " " + (byte)i + " " + Integer.toHexString(i) + " " + String.format("%8s", Integer.toBinaryString(i)).replace(' ', '0') + " " + new String(new byte[]{(byte)i},"UTF-8") ); } </code></pre> 0 to 31 are non-printable characters, then 32 is space, followed by printable characters: <pre class="prettyprint"><code>... 31 31 1f 00011111 32 32 20 00100000 33 33 21 00100001 ! ... 126 126 7e 01111110 ~ 127 127 7f 01111111 128 -128 80 10000000 � </code></pre> <code>delete</code> is <code>0x7f</code> and after it, from 128 inclusively up to 254 no valid characters are printed. You can see from the UTF-8 chartable also: <img src="https://i.stack.imgur.com/WEQZQ.png" alt="image"> Codepoint <code>U+007F</code> is represented with one byte <code>0x7F</code> (bits <code>01111111</code>), while codepoint <code>U+0080</code> is represented with two bytes <code>0xC2 0x80</code> (bits <code>11000010 10000000</code>). If you are not familiar with UTF-8 I strongly recommend reading this excellent article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

How can I generate a non-UTF-8 Character Set

Video Answer

1 Answers

If you are asking how to construct a non-UTF-8 character, that should be easy from this definition from Wikipedia:

utf8 definition

For code points U+0000 through U+007F, each codepoint is one byte long and looks like this:

0xxxxxxx   // a

For code points U+0080 through U+07FF, each codepoint is two bytes long and look like this:

110xxxxx 10xxxxxx  // b

And so on.

So, to construct an illegal UTF-8 character that is one byte long, the highest bit must be 1 (to be different from pattern a) and the second highest bit must be 0 (to be different from pattern b):

10xxxxxx

111xxxxx

Which also differs from both patterns.

With the same logic, you can construct illegal codeunit sequences which are more than two bytes long.

You did not tag a language, but I had to test it, so I used Java:

for (int i=0;i<255;i++) {
    System.out.println( 
        i + " " + 
        (byte)i + " " + 
        Integer.toHexString(i) + " " + 
        String.format("%8s", Integer.toBinaryString(i)).replace(' ', '0') + " " + 
        new String(new byte[]{(byte)i},"UTF-8")
    );
}

0 to 31 are non-printable characters, then 32 is space, followed by printable characters:

...
31 31 1f 00011111 
32 32 20 00100000  
33 33 21 00100001 !
...
126 126 7e 01111110 ~
127 127 7f 01111111 
128 -128 80 10000000 �

delete is 0x7f and after it, from 128 inclusively up to 254 no valid characters are printed. You can see from the UTF-8 chartable also:

Codepoint U+007F is represented with one byte 0x7F (bits 01111111), while codepoint U+0080 is represented with two bytes 0xC2 0x80 (bits 11000010 10000000).

If you are not familiar with UTF-8 I strongly recommend reading this excellent article:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

185

answered Dec 02 '22 04:12

linski

Related questions
                            
                                Converting ANSI to UTF-8 in shell
                            
                                PHP mysql charset utf8 problems [duplicate]
                            
                                Read UTF-8 files correctly with PowerShell
                            
                                How to calculate byte length containing UTF8 characters using javascript?
                            
                                How to read write this in utf-8?
                            
                                Set UTF-8 as default for Ruby 1.9.3
                            
                                Why do Python unicode strings require special treatment for UTF-8 BOM?
                            
                                How can I filter Emoji characters from my input so I can save in MySQL <5.5?
                            
                                How can I substitute Unicode characters with ASCII in Perl?
                            
                                How to correct double-encoded UTF-8 strings sitting in MySQL utf8_general_ci fields?
                            
                                Converting UTF8 to ANSI with Ruby
                            
                                Conversion in .net: Native Utf-8 <-> Managed String
                            
                                Accented characters in mySQL table
                            
                                UTF-8 encoding in Volley Requests
                            
                                Using PDFBox to write UTF-8 encoded strings to a PDF [duplicate]
                            
                                How to return str from MySQL using mysql.connector?
                            
                                How to make "use My::defaults" with modern perl & utf8 defaults?
                            
                                iconv UTF-8//IGNORE still produces "illegal character" error
                            
                                How is character encoding specified in a multipart/form-data HTTP POST request?
                            
                                Special characters in UTF8 mailto: subject= link and Outlook

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I generate a non-UTF-8 Character Set

Tags:

utf-8

Nitin Tripathi

People also ask

Video Answer

1 Answers

linski

Recent Activity

Donate For Us