How to replace all '0xa0' chars with a ' ' in a bunch of text files?

Question

i've been trying to mass-edit a bunch of text files to utf-8 in python and this error keeps popping out. is there a way to replace them in some python scrips or bash commands? i used the code:

writer = codecs.open(os.path.join(wrd, 'dict.en'), 'wtr', 'utf-8')
for infile in glob.glob(os.path.join(wrd,'*.txt')):
        print infile
        for line in open(infile):
                writer.write(line.encode('utf-8'))

and got these sorts of errors:

Traceback (most recent call last):
  File "dicting.py", line 30, in <module>
    writer.write(line2.encode('utf-8'))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 216: unexpected code byte

ncoghlan · Accepted Answer

OK, first point: your output file is set to automatically encode text written to it as utf-8, so don't include an explicit encode('utf-8') method call when passing arguments to the write() method.

So the first thing to try is to simply use the following in your inner loop:

writer.write(line)

If that doesn't work, then the problem is almost certainly the fact that, as others have noted, you aren't decoding your input file properly.

Taking a wild guess and assuming that your input files are encoded in cp1252, you could try as a quick test the following in the inner loop:

for line in codecs.open(infile, 'r', 'cp1252'):
    writer.write(line)

Minor point: 'wtr' is a nonsensical mode string (as write access implies read access). Simplify it to either 'wt' or even just 'w'.

geekosaur · Answer

Did you omit some code there? You're reading into line but trying to re-encode line2.

In any case, you're going to have to tell Python what encoding the input file is; if you don't know, then you'll have to open it raw and perform substitutions without help of a codec.

How to replace all '0xa0' chars with a ' ' in a bunch of text files?

Tags:

python

text

bash

shell

utf-8

alvas

2 Answers

ncoghlan

geekosaur

Recent Activity

Donate For Us

How to replace all '0xa0' chars with a ' ' in a bunch of text files?

Tags:

python

text

bash

shell

utf-8

alvas

2 Answers

ncoghlan

geekosaur

Related questions

Recent Activity

Donate For Us