Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get a list of all the encodings Python can encode to

I am writing a script that will try encoding bytes into many different encodings in Python 2.6. Is there some way to get a list of available encodings that I can iterate over?

The reason I'm trying to do this is because a user has some text that is not encoded correctly. There are funny characters. I know the unicode character that's messing it up. I want to be able to give them an answer like "Your text editor is interpreting that string as X encoding, not Y encoding". I thought I would try to encode that character using one encoding, then decode it again using another encoding, and see if we get the same character sequence.

i.e. something like this:

for encoding1, encoding2 in itertools.permutation(encodinglist(), 2):   try:     unicode_string = my_unicode_character.encode(encoding1).decode(encoding2)   except:     pass 
like image 211
Amandasaurus Avatar asked Nov 13 '09 10:11

Amandasaurus


People also ask

Can UTF-8 encode all characters?

Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.

What are the types of encoding in Python?

The popular encodings being utf-8, ascii, etc. Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.

What is encoding UTF-8 in Python?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.

How do you get the UTF-8 character code in Python?

UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point. In Python 2, chr only supports ASCII, so only numbers in the [0.. 255] range.


2 Answers

Other answers here seem to indicate that constructing this list programmatically is difficult and fraught with traps. However, doing so is probably unnecessary since the documentation contains a complete list of the standard encodings Python supports, and has done since Python 2.3.

You can find these lists (for each stable version of the language so far released) at:

  • https://docs.python.org/2.3/lib/node130.html
  • https://docs.python.org/2.4/lib/standard-encodings.html
  • https://docs.python.org/2.5/lib/standard-encodings.html
  • https://docs.python.org/2.6/library/codecs.html#standard-encodings
  • https://docs.python.org/2.7/library/codecs.html#standard-encodings
  • https://docs.python.org/3.0/library/codecs.html#standard-encodings
  • https://docs.python.org/3.1/library/codecs.html#standard-encodings
  • https://docs.python.org/3.2/library/codecs.html#standard-encodings
  • https://docs.python.org/3.3/library/codecs.html#standard-encodings
  • https://docs.python.org/3.4/library/codecs.html#standard-encodings
  • https://docs.python.org/3.5/library/codecs.html#standard-encodings
  • https://docs.python.org/3.6/library/codecs.html#standard-encodings
  • https://docs.python.org/3.7/library/codecs.html#standard-encodings

Below are the lists for each documented version of Python. Note that if you want backwards-compatibility rather than just supporting a particular version of Python, you can just copy the list from the latest Python version and check whether each encoding exists in the Python running your program before trying to use it.

Python 2.3 (59 encodings)

['ascii',  'cp037',  'cp424',  'cp437',  'cp500',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp869',  'cp874',  'cp875',  'cp1006',  'cp1026',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'koi8_r',  'koi8_u',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8']

Python 2.4 (85 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp424',  'cp437',  'cp500',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'johab',  'koi8_r',  'koi8_u',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8']

Python 2.5 (86 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp424',  'cp437',  'cp500',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'johab',  'koi8_r',  'koi8_u',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8',  'utf_8_sig']

Python 2.6 (90 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp424',  'cp437',  'cp500',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'iso8859_16',  'johab',  'koi8_r',  'koi8_u',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_32',  'utf_32_be',  'utf_32_le',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8',  'utf_8_sig']

Python 2.7 (93 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp424',  'cp437',  'cp500',  'cp720',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp858',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_11',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'iso8859_16',  'johab',  'koi8_r',  'koi8_u',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_32',  'utf_32_be',  'utf_32_le',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8',  'utf_8_sig']

Python 3.0 (89 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp424',  'cp437',  'cp500',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'johab',  'koi8_r',  'koi8_u',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_32',  'utf_32_be',  'utf_32_le',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8',  'utf_8_sig']

Python 3.1 (90 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp424',  'cp437',  'cp500',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'iso8859_16',  'johab',  'koi8_r',  'koi8_u',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_32',  'utf_32_be',  'utf_32_le',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8',  'utf_8_sig']

Python 3.2 (92 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp424',  'cp437',  'cp500',  'cp720',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp858',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'iso8859_16',  'johab',  'koi8_r',  'koi8_u',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_32',  'utf_32_be',  'utf_32_le',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8',  'utf_8_sig']

Python 3.3 (93 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp424',  'cp437',  'cp500',  'cp720',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp858',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'cp65001',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'iso8859_16',  'johab',  'koi8_r',  'koi8_u',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_32',  'utf_32_be',  'utf_32_le',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8',  'utf_8_sig']

Python 3.4 (96 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp273',  'cp424',  'cp437',  'cp500',  'cp720',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp858',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1125',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'cp65001',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_11',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'iso8859_16',  'johab',  'koi8_r',  'koi8_u',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_32',  'utf_32_be',  'utf_32_le',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8',  'utf_8_sig']

Python 3.5 (98 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp273',  'cp424',  'cp437',  'cp500',  'cp720',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp858',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1125',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'cp65001',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_11',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'iso8859_16',  'johab',  'koi8_r',  'koi8_t',  'koi8_u',  'kz1048',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_32',  'utf_32_be',  'utf_32_le',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8',  'utf_8_sig']

Python 3.6 (98 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp273',  'cp424',  'cp437',  'cp500',  'cp720',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp858',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1125',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'cp65001',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_11',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'iso8859_16',  'johab',  'koi8_r',  'koi8_t',  'koi8_u',  'kz1048',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_32',  'utf_32_be',  'utf_32_le',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8',  'utf_8_sig']

Python 3.7 (98 encodings)

['ascii',  'big5',  'big5hkscs',  'cp037',  'cp273',  'cp424',  'cp437',  'cp500',  'cp720',  'cp737',  'cp775',  'cp850',  'cp852',  'cp855',  'cp856',  'cp857',  'cp858',  'cp860',  'cp861',  'cp862',  'cp863',  'cp864',  'cp865',  'cp866',  'cp869',  'cp874',  'cp875',  'cp932',  'cp949',  'cp950',  'cp1006',  'cp1026',  'cp1125',  'cp1140',  'cp1250',  'cp1251',  'cp1252',  'cp1253',  'cp1254',  'cp1255',  'cp1256',  'cp1257',  'cp1258',  'cp65001',  'euc_jp',  'euc_jis_2004',  'euc_jisx0213',  'euc_kr',  'gb2312',  'gbk',  'gb18030',  'hz',  'iso2022_jp',  'iso2022_jp_1',  'iso2022_jp_2',  'iso2022_jp_2004',  'iso2022_jp_3',  'iso2022_jp_ext',  'iso2022_kr',  'latin_1',  'iso8859_2',  'iso8859_3',  'iso8859_4',  'iso8859_5',  'iso8859_6',  'iso8859_7',  'iso8859_8',  'iso8859_9',  'iso8859_10',  'iso8859_11',  'iso8859_13',  'iso8859_14',  'iso8859_15',  'iso8859_16',  'johab',  'koi8_r',  'koi8_t',  'koi8_u',  'kz1048',  'mac_cyrillic',  'mac_greek',  'mac_iceland',  'mac_latin2',  'mac_roman',  'mac_turkish',  'ptcp154',  'shift_jis',  'shift_jis_2004',  'shift_jisx0213',  'utf_32',  'utf_32_be',  'utf_32_le',  'utf_16',  'utf_16_be',  'utf_16_le',  'utf_7',  'utf_8',  'utf_8_sig']

In case they're relevant to anyone's use case, note that the docs also list some Python-specific encodings, many of which seem to be primarily for use by Python's internals or are otherwise weird in some way, like the 'undefined' encoding which always throws an exception if you try to use it. You probably want to ignore these completely if, like the question-asker here, you're trying to figure out what encoding was used for some text you've come across in the real world. As of Python 3.7, the list is as follows:

["idna",  "mbcs",  "oem",  "palmos",  "punycode",  "raw_unicode_escape",  "rot_13",  "undefined",  "unicode_escape",  "unicode_internal",  "base64_codec",  "bz2_codec",  "hex_codec",  "quopri_codec",  "uu_codec",  "zlib_codec"] 

Some older Python versions had a string_escape special encoding that I've not included in the above list because it's been removed from the language.

Finally, in case you'd like to update my tables above for a newer version of Python, here's the (crude, not very robust) script I used to generate them:

import requests import lxml.html import pprint  for version, url in [     ('2.3', 'https://docs.python.org/2.3/lib/node130.html'),     ('2.4', 'https://docs.python.org/2.4/lib/standard-encodings.html'),     ('2.5', 'https://docs.python.org/2.5/lib/standard-encodings.html'),     ('2.6', 'https://docs.python.org/2.6/library/codecs.html#standard-encodings'),     ('2.7', 'https://docs.python.org/2.7/library/codecs.html#standard-encodings'),     ('3.0', 'https://docs.python.org/3.0/library/codecs.html#standard-encodings'),     ('3.1', 'https://docs.python.org/3.1/library/codecs.html#standard-encodings'),     ('3.2', 'https://docs.python.org/3.2/library/codecs.html#standard-encodings'),     ('3.3', 'https://docs.python.org/3.3/library/codecs.html#standard-encodings'),     ('3.4', 'https://docs.python.org/3.4/library/codecs.html#standard-encodings'),     ('3.5', 'https://docs.python.org/3.5/library/codecs.html#standard-encodings'),     ('3.6', 'https://docs.python.org/3.6/library/codecs.html#standard-encodings'),     ('3.7', 'https://docs.python.org/3.7/library/codecs.html#standard-encodings'), ]:     html = requests.get(url).text     doc = lxml.html.fromstring(html)     standard_encodings_table = doc.xpath(         '//table[preceding::h2[.//text()[contains(., "Standard Encodings")]]][//th/text()="Codec"]'     )[0]     codecs = standard_encodings_table.xpath('.//td[1]/text()')     print("## Python %s (%i encodings)" % (version, len(codecs)))     print('<pre><code>' + pprint.pformat(codecs) + '</code></pre>') 
like image 124
Mark Amery Avatar answered Oct 22 '22 13:10

Mark Amery


Unfortunately encodings.aliases.aliases.keys() is NOT an appropriate answer.

aliases(as one would/should expect) contains several cases where different keys are mapped to the same value e.g. 1252 and windows_1252 are both mapped to cp1252. You could save time if instead of aliases.keys() you use set(aliases.values()).

BUT THERE'S A WORSE PROBLEM: aliases doesn't contain codecs that don't have aliases (like cp856, cp874, cp875, cp737, and koi8_u).

>>> from encodings.aliases import aliases >>> def find(q): ...     return [(k,v) for k, v in aliases.items() if q in k or q in v] ... >>> find('1252') # multiple aliases [('1252', 'cp1252'), ('windows_1252', 'cp1252')] >>> find('856') # no codepage 856 in aliases [] >>> find('koi8') # no koi8_u in aliases [('cskoi8r', 'koi8_r')] >>> 'x'.decode('cp856') # but cp856 is a valid codec u'x' >>> 'x'.decode('koi8_u') # but koi8_u is a valid codec u'x' >>> 

It's also worth noting that however you obtain a full list of codecs, it may be a good idea to ignore the codecs that aren't about encoding/decoding character sets, but do some other transformation e.g. zlib, quopri, and base64.

Which brings us to the question of WHY you want to "try encoding bytes into many different encodings". If we know that, we may be able to steer you in the right direction.

For a start, that's ambiguous. One DEcodes bytes into unicode, and one ENcodes unicode into bytes. Which do you want to do?

What are you really trying to achieve: Are you trying to determine which codec to use to decode some incoming bytes, and plan to attempt this with all possible codecs? [note: latin1 will decode anything] Are you trying to determine the language of some unicode text by trying to encode it with all possible codecs? [note: utf8 will encode anything].

like image 31
John Machin Avatar answered Oct 22 '22 14:10

John Machin