Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I detect if a file is encoded using UTF-8?

Is there a way to recognize if text file is UTF-8 in Python?

I would really like to get if the file is UTF-8 or not. I don't need to detect other encodings.

like image 723
Riki137 Avatar asked Apr 14 '12 18:04

Riki137


People also ask

How do I know if a file is UTF-8 or UTF 16?

There are a few options you can use: check the content-type to see if it includes a charset parameter which would indicate the encoding (e.g. Content-Type: text/plain; charset=utf-16 ); check if the uploaded data has a BOM (the first few bytes in the file, which would map to the unicode character U+FEFF - 2 bytes for ...

How do I know the encoding of a file?

Files generally indicate their encoding with a file header. There are many examples here. However, even reading the header you can never be sure what encoding a file is really using. For example, a file with the first three bytes 0xEF,0xBB,0xBF is probably a UTF-8 encoded file.

How do I check if a file is UTF-8 encoded in Python?

Could be simpler by using only one line: codecs. open("path/to/file", encoding="utf-8", errors="strict").

How can I tell if a csv file is encoded?

You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory and looking in the drop down.


2 Answers

You mentioned in a comment you only need to detect UTF-8. If you know the alternative consists of only single byte encodings, then there is a solution that often works.

If you know it's either UTF-8 or single byte encoding like latin-1, then try opening it first in UTF-8 and then in the other encoding. If the file contains only ASCII characters, it will end up opened in UTF-8 even if it was intended as the other encoding. If it contains any non-ASCII characters, this will almost always correctly detect the right character set between the two.

try:
    # or codecs.open on Python <= 2.5
    # or io.open on Python > 2.5 and <= 2.7
    filedata = open(filename, encoding='UTF-8').read() 
except:
    filedata = open(filename, encoding='other-single-byte-encoding').read() 

Your best bet is to use the chardet package from PyPI, either directly or through UnicodeDamnit from BeautifulSoup:

chardet 1.0.1

Universal encoding detector

Detects:

  • ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
  • Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
  • EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
  • EUC-KR, ISO-2022-KR (Korean)
  • KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
  • ISO-8859-2, windows-1250 (Hungarian)
  • ISO-8859-5, windows-1251 (Bulgarian)
  • windows-1252 (English)
  • ISO-8859-7, windows-1253 (Greek)
  • ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
  • TIS-620 (Thai)

Requires Python 2.1 or later

However, some files will be valid in multiple encodings, so chardet is not a panacea.

like image 167
agf Avatar answered Sep 25 '22 22:09

agf


Reliably? No.

In general, a byte sequence has no meaning unless you know how to interpret it -- this goes for text files, but also integers, floating point numbers, etc.

But, there are ways of guessing the encoding of a file, by looking at the byte order mark (if there is one) and the first chunk of the file (to see which encoding yields the most sensible characters). The chardet library is pretty good at this, but be aware it's only a heuristic, albeit a rather powerful one.

like image 29
Cameron Avatar answered Sep 22 '22 22:09

Cameron