Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Example invalid utf8 string?

I'm testing how some of my code handles bad data, and I need a few series of bytes that are invalid UTF-8.

Can you post some, and ideally, an explanation of why they are bad/where you got them?

like image 578
twk Avatar asked Aug 19 '09 17:08

twk


People also ask

What is an invalid UTF-8 string?

This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.

What characters are not included in UTF-8?

0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.

What is UTF-8 an example of?

UTF-8 is a Unicode character encoding method. This means that UTF-8 takes the code point for a given Unicode character and translates it into a string of binary.

What is valid UTF-8?

UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8.


1 Answers

Take a look at Markus Kuhn's UTF-8 decoder capability and stress test file

You'll find examples of many UTF-8 irregularities, including lonely start bytes, continuation bytes missing, overlong sequences, etc.

like image 126
Nemanja Trifunovic Avatar answered Oct 22 '22 09:10

Nemanja Trifunovic