Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to strip the leading Unciode characters from a file?

I am processing a few thousand xml files and have a few problem files.

In each case, they contain leading Unicode characters, such as C3 AF C2 BB C2 BF and EF BB BF, etc.

In all cases, the file contains only ASCII characters (after the header bytes), so that there would be no risk of data loss converting them to ASCII.

I am not allowed to change the contents of the files on disk, only use them as input to my script.

At its simplest, I would be happy to convert such files to ASCII (all input files are parsed, some changes made and written to an output directory, where a second script will process them.)

How would I code that? When I try:

with open(filePath, "rb") as file:
    contentOfFile = file.read()

unicodeData = contentOfFile.decode("utf-8")
asciiData = unicodeData.encode("ascii", "ignore")

with open(filePath, 'wt')  as file:
    file.write(asciiData)

I get an error must be str, not bytes.

I also tried

    asciiData = unicodedata.normalize('NFKD', unicodeData).encode('ASCII', 'ignore')

with the same result. How do I correct that?

Or is there any other way to covert the file?

like image 933
Mawg says reinstate Monica Avatar asked Dec 02 '25 11:12

Mawg says reinstate Monica


1 Answers

...
asciiData = unicodeData.encode("ascii", "ignore")

asciiData is bytes object because it's encoded. You need to use binary mode instead of text mode when opening file:

with open(filePath, 'wb')  as file:  # <---
    file.write(asciiData)
like image 193
falsetru Avatar answered Dec 05 '25 00:12

falsetru



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!