Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stripping non printable characters from a string in python

I use to run

$s =~ s/[^[:print:]]//g; 

on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.

like image 358
Vinko Vrsalovic Avatar asked Sep 18 '08 13:09

Vinko Vrsalovic


People also ask

How do I remove non-ASCII characters from a string in Python?

In python, to remove non-ASCII characters in python, we need to use string. encode() with encoding as ASCII and error as ignore, to returns a string without ASCII character use string. decode().

How do you strip a character from a string in Python?

Use the . strip() method to remove whitespace and characters from the beginning and the end of a string. Use the . lstrip() method to remove whitespace and characters only from the beginning of a string.


1 Answers

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys  all_chars = (chr(i) for i in range(sys.maxunicode)) categories = {'Cc'} control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) # or equivalently and much more efficiently control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))  control_char_re = re.compile('[%s]' % re.escape(control_chars))  def remove_control_chars(s):     return control_char_re.sub('', s) 

For Python2

import unicodedata, re, sys  all_chars = (unichr(i) for i in xrange(sys.maxunicode)) categories = {'Cc'} control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories) # or equivalently and much more efficiently control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))  control_char_re = re.compile('[%s]' % re.escape(control_chars))  def remove_control_chars(s):     return control_char_re.sub('', s) 

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

  • Cc (control): 65
  • Cf (format): 161
  • Cs (surrogate): 2048
  • Co (private-use): 137468
  • Cn (unassigned): 836601

Edit Adding suggestions from the comments.

like image 188
Ants Aasma Avatar answered Sep 23 '22 15:09

Ants Aasma