How can I programmatically find the list of codecs known to Python? [duplicate]

Question

I know that I can do the following:

>>> import encodings, pprint
>>> pprint.pprint(sorted(encodings.aliases.aliases.values()))
['ascii',
 'base64_codec',
 'big5',
 'big5hkscs',
 'bz2_codec',
 'cp037',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp424',
 'cp437',
 'cp500',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp932',
 'cp949',
 'cp950',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_jp',
 'euc_kr',
 'gb18030',
 'gb2312',
 'gbk',
 'hex_codec',
 'hp_roman8',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'johab',
 'koi8_r',
 'latin_1',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'mbcs',
 'ptcp154',
 'quopri_codec',
 'rot_13',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'tactis',
 'tis_620',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_7',
 'utf_8',
 'uu_codec',
 'zlib_codec']

I also know for sure that this is not a complete list, since it includes only encodings for which an alias exists (e.g "cp737" is missing), and at least some pseudo-encodings are missing (e.g "string_escape").

As the title of the question says: how can I programmatically get a list of all codecs/encodings known to Python?

If not programmatically: is there a complete list available online?

unutbu · Accepted Answer

I don't think the complete list is stored anywhere in the python standard library. Instead, encodings are loaded on demand through calls to encoding.search_function(encoding). If you study the code there, it looks like encoding string is first normalized and then the encodings package is searched for submodules whose name matches encoding.

The following uses pkgutil to list all the submodules of encoding, and then adds them to those listed in encoding.aliases.aliases.

Unfortunately, encoding.aliases.aliases contains one encoding, tactis that is not generated by the above, so I tried to generate the complete list by union-ing the two sets.

import encodings
import os
import pkgutil

modnames=set([modname for importer, modname, ispkg in pkgutil.walk_packages(
    path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases=set(encodings.aliases.aliases.values())

print(modnames-aliases)
# set(['charmap', 'unicode_escape', 'cp1006', 'unicode_internal', 'punycode', 'string_escape', 'aliases', 'palmos', 'mac_centeuro', 'mac_farsi', 'mac_romanian', 'cp856', 'raw_unicode_escape', 'mac_croatian', 'utf_8_sig', 'mac_arabic', 'undefined', 'cp737', 'idna', 'koi8_u', 'cp875', 'cp874', 'iso8859_1'])

print(aliases-modnames)
# set(['tactis'])

codec_names=modnames.union(aliases)
print(codec_names)
# set(['bz2_codec', 'cp1140', 'euc_jp', 'cp932', 'punycode', 'euc_jisx0213', 'aliases', 'hex_codec', 'cp500', 'uu_codec', 'big5hkscs', 'mac_romanian', 'mbcs', 'euc_jis_2004', 'iso2022_jp_3', 'iso2022_jp_2', 'iso2022_jp_1', 'gbk', 'iso2022_jp_2004', 'unicode_internal', 'utf_16_be', 'quopri_codec', 'cp424', 'iso2022_jp', 'mac_iceland', 'raw_unicode_escape', 'hp_roman8', 'iso2022_kr', 'cp875', 'iso8859_6', 'cp1254', 'utf_32_be', 'gb2312', 'cp850', 'shift_jis', 'cp852', 'cp855', 'iso8859_3', 'cp857', 'cp856', 'cp775', 'unicode_escape', 'cp1026', 'mac_latin2', 'utf_32', 'mac_cyrillic', 'base64_codec', 'ptcp154', 'palmos', 'mac_centeuro', 'euc_kr', 'hz', 'utf_8', 'utf_32_le', 'mac_greek', 'utf_7', 'mac_turkish', 'utf_8_sig', 'mac_arabic', 'tactis', 'cp949', 'zlib_codec', 'big5', 'iso8859_9', 'iso8859_8', 'iso8859_5', 'iso8859_4', 'iso8859_7', 'cp874', 'iso8859_1', 'utf_16_le', 'iso8859_2', 'charmap', 'gb18030', 'cp1006', 'shift_jis_2004', 'mac_roman', 'ascii', 'string_escape', 'iso8859_15', 'iso8859_14', 'tis_620', 'iso8859_16', 'iso8859_11', 'iso8859_10', 'iso8859_13', 'cp950', 'utf_16', 'cp869', 'mac_farsi', 'rot_13', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'shift_jisx0213', 'johab', 'mac_croatian', 'cp1255', 'latin_1', 'cp1257', 'cp1256', 'cp1251', 'cp1250', 'cp1253', 'cp1252', 'cp437', 'cp1258', 'undefined', 'cp737', 'koi8_r', 'cp037', 'koi8_u', 'iso2022_jp_ext', 'idna'])

How can I programmatically find the list of codecs known to Python? [duplicate]

Tags:

python

character-encoding

internationalization

tzot

1 Answers

unutbu

Recent Activity

Donate For Us

How can I programmatically find the list of codecs known to Python? [duplicate]

Tags:

python

character-encoding

internationalization

tzot

1 Answers

unutbu

Related questions

Recent Activity

Donate For Us