I am processing a large number of CSV files in python. The files are received from external organizations and are encoded with a range of encodings. I would like to find an automated method to remove the following:
I have a product called 'Find and Replace It!' that would use regular expressions so a way to solve the above with a regular expression would be very helpful.
Thank you
An alternative you might be interested in would be:
import string
clean = lambda dirty: ''.join(filter(string.printable.__contains__, dirty))
It simply filters out all non-printable characters from the dirty string it receives.
>>> len(clean(map(chr, range(0x110000))))
100
Try this:
clean = re.sub('[\0\200-\377]', '', dirty)
The idea is to match each NUL or "high ASCII" character (i.e. \0 and those that do not fit in 7 bits) and remove them. You could add more characters as you find them, such as ASCII ESC or BEL.
Or this:
clean = re.sub('[^\040-\176]', '', dirty)
The idea being to only permit the limited range of "printable ASCII," but note that this also removes newlines. If you want to keep newlines or tabs or the like, just add them into the brackets.
Replace anything that isn't a desirable character with a blank (delete it):
clean = re.sub('[^\s!-~]', '', dirty)
This allows all whitespace (spaces, newlines, tabs etc), and all "normal" characters (!
is the first ascii printable and ~
is the last ascii printable under decimal 128).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With