Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect Byte Order Mark (BOM) in Python

I've found lots of posts describing how to parse/ignore BOMs but can't find anything on how to simply output a true/false as to whether a file contains a BOM. Can anyone point me in the right direction to do this in Python?

like image 919
d3wannabe Avatar asked May 19 '26 18:05

d3wannabe


1 Answers

The simple answer is: read the first 4 bytes and look at them.

with open("utf32le.file", "rb") as file:
    beginning = file.read(4)
    # The order of these if-statements is important
    # otherwise UTF32 LE may be detected as UTF16 LE as well
    if beginning == b'\x00\x00\xfe\xff':
        print("UTF-32 BE")
    elif beginning == b'\xff\xfe\x00\x00':
        print("UTF-32 LE")
    elif beginning[0:3] == b'\xef\xbb\xbf':
        print("UTF-8")
    elif beginning[0:2] == b'\xff\xfe':
        print("UTF-16 LE")
    elif beginning[0:2] == b'\xfe\xff':
        print("UTF-16 BE")
    else:
        print("Unknown or no BOM")

The not so simple answer is:

There may be binary files that seem to have BOM, but they might still just be binary files with data that accidentally looks like a BOM.

Other than that you can typically treat text files without BOM as UTF-8 as well.

like image 140
Thomas Weller Avatar answered May 24 '26 12:05

Thomas Weller