Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to strip unicode "punctuation" from Python string

Here's the problem, I have a unicode string as input to a python sqlite query. The query failed ('like'). It turns out the string, 'FRANCE' doesn't have 6 characters, it has seven. And the seventh is . . . unicode U+FEFF, a zero-width no-break space.

How on earth do I trap a class of such things before the query?

like image 770
Dave Fultz Avatar asked Mar 24 '11 04:03

Dave Fultz


People also ask

How do you remove punctuation from a string using string punctuation in Python?

We can use replace() method to remove punctuation from python string by replacing each punctuation mark by empty string. We will iterate over the entire punctuation marks one by one replace it by an empty string in our text string.

How do I strip Unicode in Python?

In python, to remove Unicode ” u “ character from string then, we can use the replace() method to remove the Unicode ” u ” from the string. After writing the above code (python remove Unicode ” u ” from a string), Ones you will print “ string_unicode ” then the output will appear as a “ Python is easy. ”.

How do you remove special and punctuation characters in Python?

One of the easiest and fastest methods through which punctuation marks and special characters can be removed from a string is by using the translate () method. The built-in translate () function is available in the string library of Python.


2 Answers

You may use the unicodedata categories as part of the unicode data table in Python:

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'.')
'Po'
>>> unicodedata.category(u',')
'Po'

The categories for punctation characters start with 'P' as you can see. So you need to filter you out char by char (using a list comprehension).

See also:

in your case :

>>> unicodedata.category(u'\ufeff')
'Cf'

So you may perform some whitelisting based on the categories for characters.

like image 68
Andreas Jung Avatar answered Sep 20 '22 11:09

Andreas Jung


In general, input validation should be done by using a whitelist of allowable characters if you can define such a thing for your use case. Then you simply throw out anything that isn't on the whitelist (or reject the input altogether).

If you can define a set of allowed characters, then you can use a regular expression to strip out everything else.

For example, lets say you know "country" will only have upper-case English letters and spaces you could strip out everything else, including your nasty unicode letter like this:

>>> import re
>>> country = u'FRANCE\ufeff'
>>> clean_pattern = re.compile(u'[^A-Z ]+')
>>> clean_pattern.sub('', country)
u'FRANCE'

If you can't define a set of allowed characters, you're in deep trouble, because it becomes your task to anticipate all tens of thousands of possible unexpected unicode characters that could be thrown at you--and more and more are added to the specs as languages evolve over the years.

like image 45
Nathan Stocks Avatar answered Sep 21 '22 11:09

Nathan Stocks