Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate random UTF-8 string in Python

I'd like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module? Neither Google nor StackOverflow seems to have an answer.

Edit: It looks like this is more complex than expected, so I'll rephrase the question - Is the following code sufficient to generate all valid non-control characters in Unicode?

unicode_glyphs = ''.join(
    unichr(char)
    for char in xrange(1114112) # 0x10ffff + 1
    if unicodedata.category(unichr(char))[0] in ('LMNPSZ')
    )
like image 842
l0b0 Avatar asked Sep 25 '09 13:09

l0b0


People also ask

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

Are Python strings UTF-8?

In Python, Strings are by default in utf-8 format which means each alphabet corresponds to a unique code point.

Why do we use UTF-8 in Python?

UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.


3 Answers

People may find their way here based mainly on the question title, so here's a way to generate a random string containing a variety of Unicode characters. To include more (or fewer) possible characters, just extend that part of the example with the code point ranges that you want.

import random  def get_random_unicode(length):      try:         get_char = unichr     except NameError:         get_char = chr      # Update this to include code point ranges to be sampled     include_ranges = [         ( 0x0021, 0x0021 ),         ( 0x0023, 0x0026 ),         ( 0x0028, 0x007E ),         ( 0x00A1, 0x00AC ),         ( 0x00AE, 0x00FF ),         ( 0x0100, 0x017F ),         ( 0x0180, 0x024F ),         ( 0x2C60, 0x2C7F ),         ( 0x16A0, 0x16F0 ),         ( 0x0370, 0x0377 ),         ( 0x037A, 0x037E ),         ( 0x0384, 0x038A ),         ( 0x038C, 0x038C ),     ]      alphabet = [         get_char(code_point) for current_range in include_ranges             for code_point in range(current_range[0], current_range[1] + 1)     ]     return ''.join(random.choice(alphabet) for i in range(length))  if __name__ == '__main__':     print('A random string: ' + get_random_unicode(10)) 
like image 168
Jacob Wan Avatar answered Sep 28 '22 09:09

Jacob Wan


There is a UTF-8 stress test from Markus Kuhn you could use.

See also Really Good, Bad UTF-8 example test data.

like image 41
Gumbo Avatar answered Sep 28 '22 09:09

Gumbo


Here is an example function that probably creates a random well-formed UTF-8 sequence, as defined in Table 3–7 of Unicode 5.0.0:

#!/usr/bin/env python3.1

# From Table 3–7 of the Unicode Standard 5.0.0

import random

def byte_range(first, last):
    return list(range(first, last+1))

first_values = byte_range(0x00, 0x7F) + byte_range(0xC2, 0xF4)
trailing_values = byte_range(0x80, 0xBF)

def random_utf8_seq():
    first = random.choice(first_values)
    if first <= 0x7F:
        return bytes([first])
    elif first <= 0xDF:
        return bytes([first, random.choice(trailing_values)])
    elif first == 0xE0:
        return bytes([first, random.choice(byte_range(0xA0, 0xBF)), random.choice(trailing_values)])
    elif first == 0xED:
        return bytes([first, random.choice(byte_range(0x80, 0x9F)), random.choice(trailing_values)])
    elif first <= 0xEF:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF0:
        return bytes([first, random.choice(byte_range(0x90, 0xBF)), random.choice(trailing_values), random.choice(trailing_values)])
    elif first <= 0xF3:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF4:
        return bytes([first, random.choice(byte_range(0x80, 0x8F)), random.choice(trailing_values), random.choice(trailing_values)])

print("".join(str(random_utf8_seq(), "utf8") for i in range(10)))

Because of the vastness of the Unicode standard I cannot test this thoroughly. Also note that the characters are not equally distributed (but each byte in the sequence is).

like image 42
Philipp Avatar answered Sep 28 '22 08:09

Philipp