Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing non-printable "gremlin" chars from text files

I am processing a large number of CSV files in python. The files are received from external organizations and are encoded with a range of encodings. I would like to find an automated method to remove the following:

  • Non-ASCII Characters
  • Control characters
  • Null (ASCII 0) Characters

I have a product called 'Find and Replace It!' that would use regular expressions so a way to solve the above with a regular expression would be very helpful.

Thank you

like image 553
John Steedman Avatar asked Sep 25 '13 11:09

John Steedman


3 Answers

An alternative you might be interested in would be:

import string
clean = lambda dirty: ''.join(filter(string.printable.__contains__, dirty))

It simply filters out all non-printable characters from the dirty string it receives.

>>> len(clean(map(chr, range(0x110000))))
100
like image 174
Noctis Skytower Avatar answered Nov 15 '22 14:11

Noctis Skytower


Try this:

clean = re.sub('[\0\200-\377]', '', dirty)

The idea is to match each NUL or "high ASCII" character (i.e. \0 and those that do not fit in 7 bits) and remove them. You could add more characters as you find them, such as ASCII ESC or BEL.

Or this:

clean = re.sub('[^\040-\176]', '', dirty)

The idea being to only permit the limited range of "printable ASCII," but note that this also removes newlines. If you want to keep newlines or tabs or the like, just add them into the brackets.

like image 6
John Zwinck Avatar answered Nov 15 '22 15:11

John Zwinck


Replace anything that isn't a desirable character with a blank (delete it):

clean = re.sub('[^\s!-~]', '', dirty)

This allows all whitespace (spaces, newlines, tabs etc), and all "normal" characters (! is the first ascii printable and ~ is the last ascii printable under decimal 128).

like image 3
Bohemian Avatar answered Nov 15 '22 14:11

Bohemian