There is no reason to check if a BOM exists or not, utf-8-sig
manages that for you and behaves exactly as utf-8
if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig
correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig
and not worry about it
BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig
encoding. You could try something like this:
import io
import chardet
import codecs
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
if raw.startswith(codecs.BOM_UTF8):
encoding = 'utf-8-sig'
else:
result = chardet.detect(raw)
encoding = result['encoding']
infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
I've composed a nifty BOM-based detector based on Chewie's answer. It's sufficient in the common use case where data can be either in a known local encoding or Unicode with BOM (that's what text editors typically produce). More importantly, unlike chardet
, it doesn't do any random guessing, so it gives predictable results:
def detect_by_bom(path, default):
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
if any(raw.startswith(bom) for bom in boms):
return enc
return default
chardet
detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014:
#!/usr/bin/env python
import chardet # $ pip install chardet
# detect file encoding
with open(filename, 'rb') as file:
raw = file.read(32) # at most 32 bytes are returned
encoding = chardet.detect(raw)['encoding']
with open(filename, encoding=encoding) as file:
text = file.read()
print(text)
Note: chardet
may return 'UTF-XXLE'
, 'UTF-XXBE'
encodings that leave the BOM in the text. 'LE'
, 'BE'
should be stripped to avoid it -- though it is easier to detect BOM yourself at this point e.g., as in @ivan_pozdeev's answer.
To avoid UnicodeEncodeError
while printing Unicode text to Windows console, see Python, Unicode, and the Windows console.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With