Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I iterate over every character in a given encoding using Python?

Is there a way to iterate over every character in a given encoding, and print it's code? Say, UTF8?

like image 304
Geo Avatar asked Dec 17 '22 03:12

Geo


2 Answers

All Unicode characters can be represented in UTF-n for all defined n. What are you trying to achieve?

If you really want to do something like print all the valid characters in a particular encoding, without needing to know whether the encoding is "single byte" or "multi byte" or whether its size is fixed or not:

import unicodedata as ucd
import sys

def dump_encoding(enc):
    for i in xrange(sys.maxunicode):
        u = unichr(i)
        try:
            s = u.encode(enc)
        except UnicodeEncodeError:
            continue
        try:
            name = ucd.name(u)
        except:
            name = '?'
        print "U+%06X %r %s" % (i, s, name)

if __name__ == "__main__":
    dump_encoding(sys.argv[1])

Suggestions: Try it out on something small, like cp1252. Redirect stdout to a file.

like image 183
John Machin Avatar answered Dec 29 '22 00:12

John Machin


dude, do you have any idea how many code points there are in unicode...

btw, from the Python docs:

chr( i )

Return the string representing a character whose Unicode code point is the integer i. For example, chr(97) returns the string 'a', while chr(8364) returns the string '€'. This is the inverse of ord().

The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16). ValueError will be raised if i is outside that range.

So

import sys

for i in range(sys.maxunicode + 1):
    char = chr(i)
    print(repr(char))  # print('\ud800') causes a UnicodeEncodeError
like image 44
wich Avatar answered Dec 29 '22 00:12

wich