Here's the problem, I have a unicode string as input to a python sqlite query. The query failed ('like'). It turns out the string, 'FRANCE' doesn't have 6 characters, it has seven. And the seventh is . . . unicode U+FEFF, a zero-width no-break space.
How on earth do I trap a class of such things before the query?
We can use replace() method to remove punctuation from python string by replacing each punctuation mark by empty string. We will iterate over the entire punctuation marks one by one replace it by an empty string in our text string.
In python, to remove Unicode ” u “ character from string then, we can use the replace() method to remove the Unicode ” u ” from the string. After writing the above code (python remove Unicode ” u ” from a string), Ones you will print “ string_unicode ” then the output will appear as a “ Python is easy. ”.
One of the easiest and fastest methods through which punctuation marks and special characters can be removed from a string is by using the translate () method. The built-in translate () function is available in the string library of Python.
You may use the unicodedata categories as part of the unicode data table in Python:
>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'.')
'Po'
>>> unicodedata.category(u',')
'Po'
The categories for punctation characters start with 'P' as you can see. So you need to filter you out char by char (using a list comprehension).
See also:
in your case :
>>> unicodedata.category(u'\ufeff')
'Cf'
So you may perform some whitelisting based on the categories for characters.
In general, input validation should be done by using a whitelist of allowable characters if you can define such a thing for your use case. Then you simply throw out anything that isn't on the whitelist (or reject the input altogether).
If you can define a set of allowed characters, then you can use a regular expression to strip out everything else.
For example, lets say you know "country" will only have upper-case English letters and spaces you could strip out everything else, including your nasty unicode letter like this:
>>> import re
>>> country = u'FRANCE\ufeff'
>>> clean_pattern = re.compile(u'[^A-Z ]+')
>>> clean_pattern.sub('', country)
u'FRANCE'
If you can't define a set of allowed characters, you're in deep trouble, because it becomes your task to anticipate all tens of thousands of possible unexpected unicode characters that could be thrown at you--and more and more are added to the specs as languages evolve over the years.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With