Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace all '0xa0' chars with a ' ' in a bunch of text files?

i've been trying to mass-edit a bunch of text files to utf-8 in python and this error keeps popping out. is there a way to replace them in some python scrips or bash commands? i used the code:

writer = codecs.open(os.path.join(wrd, 'dict.en'), 'wtr', 'utf-8')
for infile in glob.glob(os.path.join(wrd,'*.txt')):
        print infile
        for line in open(infile):
                writer.write(line.encode('utf-8'))

and got these sorts of errors:

Traceback (most recent call last):
  File "dicting.py", line 30, in <module>
    writer.write(line2.encode('utf-8'))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 216: unexpected code byte
like image 855
alvas Avatar asked Dec 21 '22 16:12

alvas


2 Answers

OK, first point: your output file is set to automatically encode text written to it as utf-8, so don't include an explicit encode('utf-8') method call when passing arguments to the write() method.

So the first thing to try is to simply use the following in your inner loop:

writer.write(line)

If that doesn't work, then the problem is almost certainly the fact that, as others have noted, you aren't decoding your input file properly.

Taking a wild guess and assuming that your input files are encoded in cp1252, you could try as a quick test the following in the inner loop:

for line in codecs.open(infile, 'r', 'cp1252'):
    writer.write(line)

Minor point: 'wtr' is a nonsensical mode string (as write access implies read access). Simplify it to either 'wt' or even just 'w'.

like image 110
ncoghlan Avatar answered Dec 29 '22 00:12

ncoghlan


Did you omit some code there? You're reading into line but trying to re-encode line2.

In any case, you're going to have to tell Python what encoding the input file is; if you don't know, then you'll have to open it raw and perform substitutions without help of a codec.

like image 29
geekosaur Avatar answered Dec 29 '22 01:12

geekosaur