Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Tags:

Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?

source files: Tue Jan 17$ file brh-m-157.json  brh-m-157.json: UTF-8 Unicode (with BOM) text

Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?

edit 1 proposed sol'n from below (thanks!)

fp = open('brh-m-157.json','rw') s = fp.read() u = s.decode('utf-8-sig') s = u.encode('utf-8') print fp.encoding   fp.write(s)

This gives me the following error:

IOError: [Errno 9] Bad file descriptor

Newsflash

I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.

427

asked Jan 17 '12 16:01

timpone

2 Answers

Simply use the "utf-8-sig" codec:

fp = open("file.txt") s = fp.read() u = s.decode("utf-8-sig")

That gives you a unicode string without the BOM. You can then use

s = u.encode("utf-8")

to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:

import os, sys, codecs  BUFSIZE = 4096 BOMLEN = len(codecs.BOM_UTF8)  path = sys.argv[1] with open(path, "r+b") as fp:     chunk = fp.read(BUFSIZE)     if chunk.startswith(codecs.BOM_UTF8):         i = 0         chunk = chunk[BOMLEN:]         while chunk:             fp.seek(i)             fp.write(chunk)             i += len(chunk)             fp.seek(BOMLEN, os.SEEK_CUR)             chunk = fp.read(BUFSIZE)         fp.seek(-BOMLEN, os.SEEK_CUR)         fp.truncate()

It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.

As for guessing the encoding, then you can just loop through the encoding from most to least specific:

def decode(s):     for encoding in "utf-8-sig", "utf-16":         try:             return s.decode(encoding)         except UnicodeDecodeError:             continue     return s.decode("latin-1") # will always work

An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

answered Oct 05 '22 23:10

Martin Geisler

In Python 3 it's quite easy: read the file and rewrite it with utf-8 encoding:

s = open(bom_file, mode='r', encoding='utf-8-sig').read() open(bom_file, mode='w', encoding='utf-8').write(s)

answered Oct 05 '22 23:10

Geng Jiawen

Related questions
                            
                                Split string into list in jinja?
                            
                                converting epoch time with milliseconds to datetime
                            
                                Reading Excel File using Python, how do I get the values of a specific column with indicated column name?
                            
                                Can you explain closures (as they relate to Python)?
                            
                                Base language of Python
                            
                                Good examples of python-memcache (memcached) being used in Python? [closed]
                            
                                Accessing a value in a tuple that is in a list
                            
                                How to force a list to a fixed size?
                            
                                Is it possible to hide the browser in Selenium RC?
                            
                                Reversing a list using slice notation
                            
                                Interactive large plot with ~20 million sample points and gigabytes of data
                            
                                Mean Squared Error in Numpy?
                            
                                How do you remove the column name row when exporting a pandas DataFrame?
                            
                                What is the purpose of the return statement?
                            
                                Efficient date range overlap calculation in python?
                            
                                How to print Y axis label horizontally in a matplotlib / pylab chart?
                            
                                How do I merge dictionaries together in Python?
                            
                                Why does numpy std() give a different result to matlab std()?
                            
                                Trailing slash triggers 404 in Flask path rule
                            
                                Weighted standard deviation in NumPy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Tags:

python

utf-8

byte-order-mark

utf-16