I'd like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module? Neither Google nor StackOverflow seems to have an answer.
Edit: It looks like this is more complex than expected, so I'll rephrase the question - Is the following code sufficient to generate all valid non-control characters in Unicode?
unicode_glyphs = ''.join(
unichr(char)
for char in xrange(1114112) # 0x10ffff + 1
if unicodedata.category(unichr(char))[0] in ('LMNPSZ')
)
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
In Python, Strings are by default in utf-8 format which means each alphabet corresponds to a unique code point.
UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.
People may find their way here based mainly on the question title, so here's a way to generate a random string containing a variety of Unicode characters. To include more (or fewer) possible characters, just extend that part of the example with the code point ranges that you want.
import random def get_random_unicode(length): try: get_char = unichr except NameError: get_char = chr # Update this to include code point ranges to be sampled include_ranges = [ ( 0x0021, 0x0021 ), ( 0x0023, 0x0026 ), ( 0x0028, 0x007E ), ( 0x00A1, 0x00AC ), ( 0x00AE, 0x00FF ), ( 0x0100, 0x017F ), ( 0x0180, 0x024F ), ( 0x2C60, 0x2C7F ), ( 0x16A0, 0x16F0 ), ( 0x0370, 0x0377 ), ( 0x037A, 0x037E ), ( 0x0384, 0x038A ), ( 0x038C, 0x038C ), ] alphabet = [ get_char(code_point) for current_range in include_ranges for code_point in range(current_range[0], current_range[1] + 1) ] return ''.join(random.choice(alphabet) for i in range(length)) if __name__ == '__main__': print('A random string: ' + get_random_unicode(10))
There is a UTF-8 stress test from Markus Kuhn you could use.
See also Really Good, Bad UTF-8 example test data.
Here is an example function that probably creates a random well-formed UTF-8 sequence, as defined in Table 3–7 of Unicode 5.0.0:
#!/usr/bin/env python3.1
# From Table 3–7 of the Unicode Standard 5.0.0
import random
def byte_range(first, last):
return list(range(first, last+1))
first_values = byte_range(0x00, 0x7F) + byte_range(0xC2, 0xF4)
trailing_values = byte_range(0x80, 0xBF)
def random_utf8_seq():
first = random.choice(first_values)
if first <= 0x7F:
return bytes([first])
elif first <= 0xDF:
return bytes([first, random.choice(trailing_values)])
elif first == 0xE0:
return bytes([first, random.choice(byte_range(0xA0, 0xBF)), random.choice(trailing_values)])
elif first == 0xED:
return bytes([first, random.choice(byte_range(0x80, 0x9F)), random.choice(trailing_values)])
elif first <= 0xEF:
return bytes([first, random.choice(trailing_values), random.choice(trailing_values)])
elif first == 0xF0:
return bytes([first, random.choice(byte_range(0x90, 0xBF)), random.choice(trailing_values), random.choice(trailing_values)])
elif first <= 0xF3:
return bytes([first, random.choice(trailing_values), random.choice(trailing_values), random.choice(trailing_values)])
elif first == 0xF4:
return bytes([first, random.choice(byte_range(0x80, 0x8F)), random.choice(trailing_values), random.choice(trailing_values)])
print("".join(str(random_utf8_seq(), "utf8") for i in range(10)))
Because of the vastness of the Unicode standard I cannot test this thoroughly. Also note that the characters are not equally distributed (but each byte in the sequence is).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With