Is there a way to recognize if text file is UTF-8 in Python? I would really like to get if the file is UTF-8 or not. I don't need to detect other encodings.

You mentioned in a comment you only need to detect UTF-8. If you know the alternative consists of only single byte encodings, then there is a solution that often works. If you know it's either UTF-8 or single byte encoding like <code>latin-1</code>, then try opening it first in UTF-8 and then in the other encoding. If the file contains only ASCII characters, it will end up opened in UTF-8 even if it was intended as the other encoding. If it contains any non-ASCII characters, this will almost always correctly detect the right character set between the two. <pre class="prettyprint"><code>try: # or codecs.open on Python <= 2.5 # or io.open on Python > 2.5 and <= 2.7 filedata = open(filename, encoding='UTF-8').read() except: filedata = open(filename, encoding='other-single-byte-encoding').read() </code></pre> Your best bet is to use the <code>chardet</code> package from PyPI, either directly or through <code>UnicodeDamnit</code> from BeautifulSoup: <blockquote> <h3>chardet 1.0.1</h3> Universal encoding detector Detects: <ul> <li>ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)</li> <li>Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)</li> <li>EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)</li> <li>EUC-KR, ISO-2022-KR (Korean)</li> <li>KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)</li> <li>ISO-8859-2, windows-1250 (Hungarian)</li> <li>ISO-8859-5, windows-1251 (Bulgarian)</li> <li>windows-1252 (English)</li> <li>ISO-8859-7, windows-1253 (Greek)</li> <li>ISO-8859-8, windows-1255 (Visual and Logical Hebrew)</li> <li>TIS-620 (Thai)</li> </ul> Requires Python 2.1 or later </blockquote> However, some files will be valid in multiple encodings, so <code>chardet</code> is not a panacea.

How do I detect if a file is encoded using UTF-8?

2 Answers

You mentioned in a comment you only need to detect UTF-8. If you know the alternative consists of only single byte encodings, then there is a solution that often works.

If you know it's either UTF-8 or single byte encoding like latin-1, then try opening it first in UTF-8 and then in the other encoding. If the file contains only ASCII characters, it will end up opened in UTF-8 even if it was intended as the other encoding. If it contains any non-ASCII characters, this will almost always correctly detect the right character set between the two.

try:
    # or codecs.open on Python <= 2.5
    # or io.open on Python > 2.5 and <= 2.7
    filedata = open(filename, encoding='UTF-8').read() 
except:
    filedata = open(filename, encoding='other-single-byte-encoding').read()

Your best bet is to use the chardet package from PyPI, either directly or through UnicodeDamnit from BeautifulSoup:

chardet 1.0.1

Universal encoding detector

Detects:

ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)

Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)

EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)

EUC-KR, ISO-2022-KR (Korean)

KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)

ISO-8859-2, windows-1250 (Hungarian)

ISO-8859-5, windows-1251 (Bulgarian)

windows-1252 (English)

ISO-8859-7, windows-1253 (Greek)

ISO-8859-8, windows-1255 (Visual and Logical Hebrew)

TIS-620 (Thai)

Requires Python 2.1 or later

However, some files will be valid in multiple encodings, so chardet is not a panacea.

167

answered Sep 25 '22 22:09

agf

Reliably? No.

In general, a byte sequence has no meaning unless you know how to interpret it -- this goes for text files, but also integers, floating point numbers, etc.

But, there are ways of guessing the encoding of a file, by looking at the byte order mark (if there is one) and the first chunk of the file (to see which encoding yields the most sensible characters). The chardet library is pretty good at this, but be aware it's only a heuristic, albeit a rather powerful one.

answered Sep 22 '22 22:09

Cameron

Related questions
                            
                                urllib2 POST progress monitoring
                            
                                Python wait x secs for a key and continue execution if not pressed
                            
                                How do you select choices in a form using Python?
                            
                                Generic methods in python
                            
                                Why doesn't Python have a hybrid getattr + __getitem__ built in?
                            
                                accessing *args from within a function in Python
                            
                                python: convert base64 encoded png image to jpg
                            
                                Using __getattribute__ or __getattr__ to call methods in Python
                            
                                Change Cherrypy Port and restart web server
                            
                                Can I load a multi-frame TIFF through OpenCV?
                            
                                Python Mechanize select form FormNotFoundError
                            
                                How to make menubar cut/copy/paste with Python/Tkinter
                            
                                how to correctly modify the iterator of a loop in python from within the loop
                            
                                Group by max or min in a numpy array
                            
                                Pyramid: how to set cookie without renderer?
                            
                                Dnspython: Setting query timeout/lifetime
                            
                                Python easy way to read all import statements from py module
                            
                                Data Hiding in Python Class
                            
                                Python, want to print float in exact format +-00.00
                            
                                Python: Confused with list.remove

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I detect if a file is encoded using UTF-8?

Tags:

python

character-encoding

unicode

utf-8

Riki137

People also ask

2 Answers

chardet 1.0.1

agf

Cameron

Recent Activity

Donate For Us