How to determine the encoding of text?

People also ask

How can I determine the encoding of a text file?

Open the file with Notepad++ and will see on the right down corner the encoding table name. And in the menu encoding you can change the encoding table and save the file.

What is the encoding of a text file?

An encoding standard is a numbering scheme that assigns each text character in a character set to a numeric value. A character set can include alphabetical characters, numbers, and other symbols.

How can I tell if a file is UTF-8?

To verify if a file passes an encoding such as ascii, iso-8859-1, utf-8 or whatever then a good solution is to use the 'iconv' command.

EDIT: chardet seems to be unmantained but most of the answer applies. Check https://pypi.org/project/charset-normalizer/ for an alternative

Correctly detecting the encoding all times is impossible.

(From chardet FAQ:)

However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

You can also use UnicodeDammit. It will try the following methods:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
An encoding sniffed by the chardet library, if you have it installed.
UTF-8
Windows-1252

Another option for working out the encoding is to use libmagic (which is the code behind the file command). There are a profusion of python bindings available.

The python bindings that live in the file source tree are available as the python-magic (or python3-magic) debian package. It can determine the encoding of a file by doing:

import magic

blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc

There is an identically named, but incompatible, python-magic pip package on pypi that also uses libmagic. It can also get the encoding, by doing:

import magic

blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)

Some encoding strategies, please uncomment to taste :

#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile

You might like to check the encoding by opening and reading the file in a form of a loop... but you might need to check the filesize first :

# PYTHON
encodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more
for e in encodings:
    try:
        fh = codecs.open('file.txt', 'r', encoding=e)
        fh.readlines()
        fh.seek(0)
    except UnicodeDecodeError:
        print('got unicode error with %s , trying different encoding' % e)
    else:
        print('opening the file with encoding:  %s ' % e)
        break

Here is an example of reading and taking at face value a chardet encoding prediction, reading n_lines from the file in the event it is large.

chardet also gives you a probability (i.e. confidence) of it's encoding prediction (haven't looked how they come up with that), which is returned with its prediction from chardet.predict(), so you could work that in somehow if you like.

def predict_encoding(file_path, n_lines=20):
    '''Predict a file's encoding using chardet'''
    import chardet

    # Open the file as binary data
    with open(file_path, 'rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

Related questions
                            
                                How do I read a large csv file with pandas?
                            
                                What is the result of % in Python?
                            
                                Why use Abstract Base Classes in Python?
                            
                                Pandas DataFrame to List of Dictionaries
                            
                                How to read specific lines from a file (by line number)?
                            
                                Jupyter Notebook not saving: '_xsrf' argument missing from post
                            
                                Python try...except comma vs 'as' in except
                            
                                How to normalize a NumPy array to a unit vector?
                            
                                Moving matplotlib legend outside of the axis makes it cutoff by the figure box
                            
                                What is the difference between json.load() and json.loads() functions
                            
                                How to get http headers in flask?
                            
                                How to import a Python class that is in a directory above?
                            
                                How do I calculate percentiles with python/numpy?
                            
                                Check if a string contains a number
                            
                                How to use MySQLdb with Python and Django in OSX 10.6?
                            
                                How do you create a daemon in Python?
                            
                                multiple prints on the same line in Python
                            
                                How to check if one of the following items is in a list?
                            
                                Fastest way to get the first object from a queryset in django?
                            
                                What blocks Ruby, Python to get Javascript V8 speed? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to determine the encoding of text?

Tags:

python

text-files

encoding

People also ask

Recent Activity

Donate For Us