Fastest way to strip punctuation from a unicode string in Python

Tags:

I am trying to efficiently strip punctuation from a unicode string. With a regular string, using mystring.translate(None, string.punctuation) is clearly the fastest approach. However, this code breaks on a unicode string in Python 2.7. As the comments to this answer explain, the translate method can still be implemented, but it must be implement with a dictionary. When I use this implementation though, I find that translate's performance is dramatically reduced. Here is my timing code (copied primarily from this answer):

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = "For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."
su = u"For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."


exclude = set(string.punctuation)
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(None, string.punctuation)

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))

def test_trans_unicode(su):
    return su.translate(tbl)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

print "sets (unicode)      :",timeit.Timer('f(su)', 'from __main__ import su,test_set as f').timeit(1000000)
print "regex (unicode)     :",timeit.Timer('f(su)', 'from __main__ import su,test_re as f').timeit(1000000)
print "translate (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_trans_unicode as f').timeit(1000000)
print "replace (unicode)   :",timeit.Timer('f(su)', 'from __main__ import su,test_repl as f').timeit(1000000)

As my results show, the unicode implementation of translate performs horribly:

sets      : 38.323941946
regex     : 6.7729549408
translate : 1.27428412437
replace   : 5.54967689514

sets (unicode)      : 43.6268708706
regex (unicode)     : 7.32343912125
translate (unicode) : 54.0041439533
replace (unicode)   : 17.4450061321

My question is whether there is a faster way to implement translate for unicode (or any other method) that would outperform regex.

857

asked Dec 11 '13 20:12

Michael

1 Answers

The current test script is flawed, because it does not compare like with like.

For a fairer comparison, all the functions must be run with the same set of punctuation characters (i.e. either all ascii, or all unicode).

When that is done, the regex and replace methods fare much worse with the full set of unicode punctuation characters.

For full unicode, it looks like the "set" method is the best. However, if you only want remove the ascii punctuation characters from unicode strings, it may be best to encode, translate, and decode (depending on the length of the input string).

The "replace" method can also be substantially improved by doing a containment test before attempting replacements (depending on the precise make-up of the string).

Here's some sample results from a re-hash of the test script:

$ python2 test.py
running ascii punctuation test...
using byte strings...

set: 0.862006902695
re: 0.17484498024
trans: 0.0207080841064
enc_trans: 0.0206489562988
repl: 0.157525062561
in_repl: 0.213351011276

$ python2 test.py a
running ascii punctuation test...
using unicode strings...

set: 0.927773952484
re: 0.18892288208
trans: 1.58275294304
enc_trans: 0.0794939994812
repl: 0.413739919662
in_repl: 0.249747991562

python2 test.py u
running unicode punctuation test...
using unicode strings...

set: 0.978360176086
re: 7.97941994667
trans: 1.72471117973
enc_trans: 0.0784001350403
repl: 7.05612301826
in_repl: 3.66821289062

And here's the re-hashed script:

# -*- coding: utf-8 -*-

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = """For me, Reddit brings to mind Obi Wan’s enduring description of the Mos
Eisley cantina: a wretched hive of scum and villainy. But, you know, one you
still kinda want to hang out in occasionally. The thing is, though, Reddit
isn’t some obscure dive bar in a remote corner of the universe—it’s a huge
watering hole at the very center of it. The site had some 400 million unique
visitors in 2012. They can’t all be Greedos. So maybe my problem is just that
I’ve never been able to find the places where the decent people hang out."""

su = u"""For me, Reddit brings to mind Obi Wan’s enduring description of the
Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one
you still kinda want to hang out in occasionally. The thing is, though,
Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a
huge watering hole at the very center of it. The site had some 400 million
unique visitors in 2012. They can’t all be Greedos. So maybe my problem is
just that I’ve never been able to find the places where the decent people
hang out."""

def test_trans(s):
    return s.translate(tbl)

def test_enc_trans(s):
    s = s.encode('utf-8').translate(None, string.punctuation)
    return s.decode('utf-8')

def test_set(s): # with list comprehension fix
    return ''.join([ch for ch in s if ch not in exclude])

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_repl(s):  # From S.Lott's solution
    for c in punc:
        s = s.replace(c, "")
    return s

def test_in_repl(s):  # From S.Lott's solution, with fix
    for c in punc:
        if c in s:
            s = s.replace(c, "")
    return s

txt = 'su'
ptn = u'[%s]'

if 'u' in sys.argv[1:]:
    print 'running unicode punctuation test...'
    print 'using unicode strings...'
    punc = u''
    tbl = {}
    for i in xrange(sys.maxunicode):
        char = unichr(i)
        if unicodedata.category(char).startswith('P'):
            tbl[i] = None
            punc += char
else:
    print 'running ascii punctuation test...'
    punc = string.punctuation
    if 'a' in sys.argv[1:]:
        print 'using unicode strings...'
        punc = punc.decode()
        tbl = {ord(ch):None for ch in punc}
    else:
        print 'using byte strings...'
        txt = 's'
        ptn = '[%s]'
        def test_trans(s):
            return s.translate(None, punc)
        test_enc_trans = test_trans

exclude = set(punc)
regex = re.compile(ptn % re.escape(punc))

def time_func(func, n=10000):
    timer = timeit.Timer(
        'func(%s)' % txt,
        'from __main__ import %s, test_%s  as func' % (txt, func))
    print '%s: %s' % (func, timer.timeit(n))

print
time_func('set')
time_func('re')
time_func('trans')
time_func('enc_trans')
time_func('repl')
time_func('in_repl')

157

answered Oct 23 '22 18:10

ekhumoro

Related questions
                            
                                What is the proper way to create a numpy array of transformation matrices
                            
                                Python: Writing Counter to a csv file
                            
                                How to read parameter values from a file in Python
                            
                                Getting error from beanstalk when trying to deploy flask app: "no module named flask"
                            
                                Change language on Firefox with Selenium Python
                            
                                Difficulty getting OpenMP to work with f2py
                            
                                Why still has "commands out of sync; you can't run this command now" error
                            
                                How to add multiple pictures in Python ebay sdk
                            
                                Capture URL with query string as parameter in Flask route
                            
                                Parallelize loop over numpy rows
                            
                                python : install logging module [closed]
                            
                                Unable to run certain Python files from Terminal. Permission denied?
                            
                                Python: why must Tkinter class instantiation use a Frame?
                            
                                Python script hanging when running in the background
                            
                                Python 2.7 Tkinter how to change text color of a button's text
                            
                                Disable SSL certificate verification in Scrapy
                            
                                Is there a way to define the docstring for a python object that defines __call__?
                            
                                Beautiful Soup: Parsing only one element
                            
                                Comparing two dataframes of different length row by row and adding columns for each row with equal value
                            
                                Filtering list of tuples based on condition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to strip punctuation from a unicode string in Python

Tags:

python

regex

unicode

python-2.7

Michael

People also ask

1 Answers

ekhumoro

Recent Activity

Donate For Us