Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing control characters from a string in python

I currently have the following code

def removeControlCharacters(line):     i = 0     for c in line:         if (c < chr(32)):             line = line[:i - 1] + line[i+1:]             i += 1     return line 

This is just does not work if there are more than one character to be deleted.

like image 464
David Avatar asked Dec 01 '10 13:12

David


People also ask

How do I remove a control character in Python?

Explanation : \n, \0, \f, \r, \b, \t being control characters are removed from string. Explanation : \n, \0, \f, \r being control characters are removed from string, giving Gfg as output.

How do I remove special characters from a string in Python?

Using 'str. replace() , we can replace a specific character. If we want to remove that specific character, replace that character with an empty string. The str. replace() method will replace all occurrences of the specific character mentioned.


1 Answers

There are hundreds of control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…) function returns the unicode category code (e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".

This snippet removes all control characters from a string.

import unicodedata def remove_control_characters(s):     return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C") 

Examples of unicode categories:

>>> from unicodedata import category >>> category('\r')      # carriage return --> Cc : control character 'Cc' >>> category('\0')      # null character ---> Cc : control character 'Cc' >>> category('\t')      # tab --------------> Cc : control character 'Cc' >>> category(' ')       # space ------------> Zs : separator, space 'Zs' >>> category(u'\u200A') # hair space -------> Zs : separator, space 'Zs' >>> category(u'\u200b') # zero width space -> Cf : control character, formatting 'Cf' >>> category('A')       # letter "A" -------> Lu : letter, uppercase 'Lu' >>> category(u'\u4e21') # 両 ---------------> Lo : letter, other 'Lo' >>> category(',')       # comma  -----------> Po : punctuation 'Po' >>> 
like image 187
Alex Quinn Avatar answered Sep 20 '22 20:09

Alex Quinn