Is there a way to iterate over every character in a given encoding, and print it's code? Say, UTF8?
All Unicode characters can be represented in UTF-n
for all defined n
. What are you trying to achieve?
If you really want to do something like print all the valid characters in a particular encoding, without needing to know whether the encoding is "single byte" or "multi byte" or whether its size is fixed or not:
import unicodedata as ucd
import sys
def dump_encoding(enc):
for i in xrange(sys.maxunicode):
u = unichr(i)
try:
s = u.encode(enc)
except UnicodeEncodeError:
continue
try:
name = ucd.name(u)
except:
name = '?'
print "U+%06X %r %s" % (i, s, name)
if __name__ == "__main__":
dump_encoding(sys.argv[1])
Suggestions: Try it out on something small, like cp1252
. Redirect stdout to a file.
dude, do you have any idea how many code points there are in unicode...
btw, from the Python docs:
chr( i )
Return the string representing a character whose Unicode code point is the integer i. For example,
chr(97)
returns the string'a'
, whilechr(8364)
returns the string'€'
. This is the inverse oford()
.The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).
ValueError
will be raised if i is outside that range.
So
import sys
for i in range(sys.maxunicode + 1):
char = chr(i)
print(repr(char)) # print('\ud800') causes a UnicodeEncodeError
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With