How to strip unicode "punctuation" from Python string

Tags:

Here's the problem, I have a unicode string as input to a python sqlite query. The query failed ('like'). It turns out the string, 'FRANCE' doesn't have 6 characters, it has seven. And the seventh is . . . unicode U+FEFF, a zero-width no-break space.

How on earth do I trap a class of such things before the query?

770

asked Mar 24 '11 04:03

Dave Fultz

2 Answers

You may use the unicodedata categories as part of the unicode data table in Python:

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'.')
'Po'
>>> unicodedata.category(u',')
'Po'

The categories for punctation characters start with 'P' as you can see. So you need to filter you out char by char (using a list comprehension).

Andreas Jung

In general, input validation should be done by using a whitelist of allowable characters if you can define such a thing for your use case. Then you simply throw out anything that isn't on the whitelist (or reject the input altogether).

If you can define a set of allowed characters, then you can use a regular expression to strip out everything else.

For example, lets say you know "country" will only have upper-case English letters and spaces you could strip out everything else, including your nasty unicode letter like this:

>>> import re
>>> country = u'FRANCE\ufeff'
>>> clean_pattern = re.compile(u'[^A-Z ]+')
>>> clean_pattern.sub('', country)
u'FRANCE'

If you can't define a set of allowed characters, you're in deep trouble, because it becomes your task to anticipate all tens of thousands of possible unexpected unicode characters that could be thrown at you--and more and more are added to the specs as languages evolve over the years.

answered Sep 21 '22 11:09

Nathan Stocks

Related questions
                            
                                Need some assistance with Python threading/queue
                            
                                How do I get the definition order of class attributes in Python?
                            
                                Traversing and modifying a tree-like list of dict structure
                            
                                How can I log into a website using python?
                            
                                How can I call erlang from Python? [duplicate]
                            
                                How can I assert from Python C code?
                            
                                Are there any more elegant ways of handling lists in Java ? (Python VS Java)
                            
                                How to change folder icons with Python on windows?
                            
                                python 3.1 - Creating normal distribution
                            
                                python -> time a while loop has been running
                            
                                Where / how to get free high resolution satellite images for geospatial data visualization with python
                            
                                How to make sure buildout doesn't use the already installed packages?
                            
                                Is there something equivalent to django's managers in SQLAlchemy?
                            
                                python 3: random.seed(): where to call it?
                            
                                Escape unescaped characters in XML with Python
                            
                                Python: Is math.factorial memoized?
                            
                                If a command line program is unsure of stdout's encoding, what encoding should it output?
                            
                                What is the history of the import statement?
                            
                                python import seems to behave differently in mercurial_keyring.py file
                            
                                Google app engine static file handlers example

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to strip unicode "punctuation" from Python string

Tags:

python

unicode

punctuation

Dave Fultz

People also ask

2 Answers

Andreas Jung

Nathan Stocks

Recent Activity

Donate For Us