I'm trying to write a script that generates random unicode by creating random utf-8 encoded strings and then decoding those to unicode. It works fine for a single byte, but with two bytes it fails.
For instance, if I run the following in a python shell:
>>> a = str()
>>> a += chr(0xc0) + chr(0xaf)
>>> print a.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte
According to the utf-8 scheme https://en.wikipedia.org/wiki/UTF-8#Description the byte sequence 0xc0 0xaf
should be valid as 0xc0
starts with 110
and 0xaf
starts with 10
.
Here's my python script:
def unicode(self):
'''returns a random (astral) utf encoded byte string'''
num_bytes = random.randint(1,4)
if num_bytes == 1:
return self.gen_utf8(num_bytes, 0x00, 0x7F)
elif num_bytes == 2:
return self.gen_utf8(num_bytes, 0xC0, 0xDF)
elif num_bytes == 3:
return self.gen_utf8(num_bytes, 0xE0, 0xEF)
elif num_bytes == 4:
return self.gen_utf8(num_bytes, 0xF0, 0xF7)
def gen_utf8(self, num_bytes, start_val, end_val):
byte_str = list()
byte_str.append(random.randrange(start_val, end_val)) # start byte
for i in range(0,num_bytes-1):
byte_str.append(random.randrange(0x80,0xBF)) # trailing bytes
a = str()
sum = int()
for b in byte_str:
a += chr(b)
ret = a.decode('utf-8')
return ret
if __name__ == "__main__":
g = GenFuzz()
print g.gen_utf8(2,0xC0,0xDF)
This is, indeed, invalid UTF-8. In UTF-8, only code points in the range U+0080 to U+07FF, inclusive, can be encoded using two bytes. Read the Wikipedia article more closely, and you will see the same thing. As a result, the byte 0xc0
may not appear in UTF-8, ever. The same is true of 0xc1
.
Some UTF-8 decoders have erroneously decoded sequences like C0 AF
as valid UTF-8, which has lead to security vulnerabilities in the past.
Found one standard that actually accepts 0xc0 : encoding="ISO-8859-1"
from https://stackoverflow.com/a/27456542/4355695
But this entails making sure the rest of the file doesn't have unicode chars, so this would not be an exact answer to the question, but may be helpful for folks like me who didn't have any unicode chars in their file anyways and just wanted python to load the damn thing and both utf-8 and ascii encodings were erroring out.
More on ISO-8859-1 : What is the difference between UTF-8 and ISO-8859-1?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With