Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove all characters which cannot be decoded in Python

I try to parse a html file with a Python script using the xml.etree.ElementTree module. The charset should be UTF-8 according to the header. But there is a strange character in the file. Therefore, the parser can't parse it. I opened the file in Notepad++ to see the character FS. I tried to open it with several encodings but I don't find the correct one.

As I have many files to parse, I would like to know how to remove all bytes which can't be decode. Is there a solution?

like image 231
clemtoy Avatar asked Jun 18 '15 18:06

clemtoy


1 Answers

I would like to know how to remove all bytes which can't be decode. Is there a solution?

This is simple:

with open('filename', 'r', encoding='utf8', errors='ignore') as f:
    ...

The errors='ignore' tells Python to drop unrecognized characters. It can also be passed to bytes.decode() and most other places which take an encoding argument.

Since this decodes the bytes into unicode, it may not be suitable for an XML parser that wants to consume bytes. In that case, you should write the data back to disk (e.g. using shutil.copyfileobj()) and then re-open in 'rb' mode.

In Python 2, these arguments to the built-in open() don't exist, but you can use io.open() instead. Alternatively, you can decode your 8-bit strings into unicode strings after reading them, but this is more error-prone in my opinion.


But it turns out OP doesn't have invalid UTF-8. OP has valid UTF-8 which happens to include control characters. Control characters are mildly annoying to filter out since you have to run them through a function like this, meaning you can't just use copyfileobj():

import unicodedata

def strip_control_chars(data: str) -> str:
    return ''.join(c for c in data if unicodedata.category(c) != 'Cc')

Cc is the Unicode category for "Other, control character, as described on the Unicode website. To include a slightly broader array of "bad characters," we could strip the entire "other" category (which mostly contains useless stuff anyway):

def strip_control_chars(data: str) -> str:
    return ''.join(c for c in data if not unicodedata.category(c).startswith('C'))

This will filter out line breaks, so it's probably a good idea to process the file a line at a time and add the line breaks back in at the end.

In principle, we could create a codec for doing this incrementally, and then we could use copyfileobj(), but that's like using a sledgehammer to swat a fly.

like image 176
Kevin Avatar answered Oct 29 '22 09:10

Kevin