i've been trying to mass-edit a bunch of text files to utf-8 in python and this error keeps popping out. is there a way to replace them in some python scrips or bash commands? i used the code:
writer = codecs.open(os.path.join(wrd, 'dict.en'), 'wtr', 'utf-8')
for infile in glob.glob(os.path.join(wrd,'*.txt')):
print infile
for line in open(infile):
writer.write(line.encode('utf-8'))
and got these sorts of errors:
Traceback (most recent call last):
File "dicting.py", line 30, in <module>
writer.write(line2.encode('utf-8'))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 216: unexpected code byte
OK, first point: your output file is set to automatically encode text written to it as utf-8
, so don't include an explicit encode('utf-8')
method call when passing arguments to the write()
method.
So the first thing to try is to simply use the following in your inner loop:
writer.write(line)
If that doesn't work, then the problem is almost certainly the fact that, as others have noted, you aren't decoding your input file properly.
Taking a wild guess and assuming that your input files are encoded in cp1252
, you could try as a quick test the following in the inner loop:
for line in codecs.open(infile, 'r', 'cp1252'):
writer.write(line)
Minor point: 'wtr' is a nonsensical mode string (as write access implies read access). Simplify it to either 'wt' or even just 'w'.
Did you omit some code there? You're reading into line
but trying to re-encode line2
.
In any case, you're going to have to tell Python what encoding the input file is; if you don't know, then you'll have to open it raw and perform substitutions without help of a codec.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With