This is a self-answered post. Below I outline a common problem in the NLP domain and propose a few performant methods to solve it. Oftentimes the need arises to remove punctuation during text cleaning and pre-processing. Punctuation is defined as any character in <code>string.punctuation</code>: <pre class="prettyprint"><code>>>> import string string.punctuation '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' </code></pre> This is a common enough problem and has been asked before ad nauseam. The most idiomatic solution uses pandas <code>str.replace</code>. However, for situations which involve a lot of text, a more performant solution may need to be considered. What are some good, performant alternatives to <code>str.replace</code> when dealing with hundreds of thousands of records?

<h3>Setup</h3> For the purpose of demonstration, let's consider this DataFrame. <pre class="prettyprint"><code>df = pd.DataFrame({'text':['a..b?!??', '%hgh&12','abc123!!!', '$$$1234']}) df text 0 a..b?!?? 1 %hgh&12 2 abc123!!! 3 $$$1234 </code></pre> Below, I list the alternatives, one by one, in increasing order of performance <h3><code>str.replace</code></h3> This option is included to establish the default method as a benchmark for comparing other, more performant solutions. This uses pandas in-built <code>str.replace</code> function which performs regex-based replacement. <pre class="prettyprint"><code>df['text'] = df['text'].str.replace(r'[^\w\s]+', '') </code></pre> <pre class="prettyprint"><code>df text 0 ab 1 hgh12 2 abc123 3 1234 </code></pre> This is very easy to code, and is quite readable, but slow. <hr> <h3><code>regex.sub</code></h3> This involves using the <code>sub</code> function from the <code>re</code> library. Pre-compile a regex pattern for performance, and call <code>regex.sub</code> inside a list comprehension. Convert <code>df['text']</code> to a list beforehand if you can spare some memory, you'll get a nice little performance boost out of this. <pre class="prettyprint"><code>import re p = re.compile(r'[^\w\s]+') df['text'] = [p.sub('', x) for x in df['text'].tolist()] </code></pre> <pre class="prettyprint"><code>df text 0 ab 1 hgh12 2 abc123 3 1234 </code></pre> Note: If your data has NaN values, this (as well as the next method below) will not work as is. See the section on "Other Considerations". <hr> <h3><code>str.translate</code></h3> python's <code>str.translate</code> function is implemented in C, and is therefore very fast. How this works is: <ol> <li>First, join all your strings together to form one huge string using a single (or more) character separator that you choose. You must use a character/substring that you can guarantee will not belong inside your data.</li> <li>Perform <code>str.translate</code> on the large string, removing punctuation (the separator from step 1 excluded).</li> <li>Split the string on the separator that was used to join in step 1. The resultant list must have the same length as your initial column.</li> </ol> Here, in this example, we consider the pipe separator <code>|</code>. If your data contains the pipe, then you must choose another separator. <pre class="prettyprint"><code>import string punct = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}~' # `|` is not present here transtab = str.maketrans(dict.fromkeys(punct, '')) df['text'] = '|'.join(df['text'].tolist()).translate(transtab).split('|') </code></pre> <pre class="prettyprint"><code>df text 0 ab 1 hgh12 2 abc123 3 1234 </code></pre> <hr> <h3>Performance</h3> <code>str.translate</code> performs the best, by far. Note that the graph below includes another variant <code>Series.str.translate</code> from MaxU's answer. (Interestingly, I reran this a second time, and the results are slightly different from before. During the second run, it seems <code>re.sub</code> was winning out over <code>str.translate</code> for really small amounts of data.) <img src="https://i.stack.imgur.com/Eb0XZ.png" alt="enter image description here"> There is an inherent risk involved with using <code>translate</code> (particularly, the problem of automating the process of deciding which separator to use is non-trivial), but the trade-offs are worth the risk. <hr> <h3>Other Considerations</h3> Handling NaNs with list comprehension methods; Note that this method (and the next) will only work as long as your data does not have NaNs. When handling NaNs, you will have to determine the indices of non-null values and replace those only. Try something like this: <pre class="prettyprint"><code>df = pd.DataFrame({'text': [ 'a..b?!??', np.nan, '%hgh&12','abc123!!!', '$$$1234', np.nan]}) idx = np.flatnonzero(df['text'].notna()) col_idx = df.columns.get_loc('text') df.iloc[idx,col_idx] = [ p.sub('', x) for x in df.iloc[idx,col_idx].tolist()] df text 0 ab 1 NaN 2 hgh12 3 abc123 4 1234 5 NaN </code></pre> Dealing with DataFrames; If you are dealing with DataFrames, where every column requires replacement, the procedure is simple: <pre class="prettyprint"><code>v = pd.Series(df.values.ravel()) df[:] = translate(v).values.reshape(df.shape) </code></pre> Or, <pre class="prettyprint"><code>v = df.stack() v[:] = translate(v) df = v.unstack() </code></pre> Note that the <code>translate</code> function is defined below in with the benchmarking code. Every solution has tradeoffs, so deciding what solution best fits your needs will depend on what you're willing to sacrifice. Two very common considerations are performance (which we've already seen), and memory usage. <code>str.translate</code> is a memory-hungry solution, so use with caution. Another consideration is the complexity of your regex. Sometimes, you may want to remove anything that is not alphanumeric or whitespace. Othertimes, you will need to retain certain characters, such as hyphens, colons, and sentence terminators <code>[.!?]</code>. Specifying these explicitly add complexity to your regex, which may in turn impact the performance of these solutions. Make sure you test these solutions on your data before deciding what to use. Lastly, unicode characters will be removed with this solution. You may want to tweak your regex (if using a regex-based solution), or just go with <code>str.translate</code> otherwise. For even more performance (for larger N), take a look at this answer by Paul Panzer. <hr> <h3>Appendix</h3> Functions <pre class="prettyprint"><code>def pd_replace(df): return df.assign(text=df['text'].str.replace(r'[^\w\s]+', '')) def re_sub(df): p = re.compile(r'[^\w\s]+') return df.assign(text=[p.sub('', x) for x in df['text'].tolist()]) def translate(df): punct = string.punctuation.replace('|', '') transtab = str.maketrans(dict.fromkeys(punct, '')) return df.assign( text='|'.join(df['text'].tolist()).translate(transtab).split('|') ) # MaxU's version (https://stackoverflow.com/a/50444659/4909087) def pd_translate(df): punct = string.punctuation.replace('|', '') transtab = str.maketrans(dict.fromkeys(punct, '')) return df.assign(text=df['text'].str.translate(transtab)) </code></pre> Performance Benchmarking Code <pre class="prettyprint"><code>from timeit import timeit import pandas as pd import matplotlib.pyplot as plt res = pd.DataFrame( index=['pd_replace', 're_sub', 'translate', 'pd_translate'], columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000], dtype=float ) for f in res.index: for c in res.columns: l = ['a..b?!??', '%hgh&12','abc123!!!', '$$$1234'] * c df = pd.DataFrame({'text' : l}) stmt = '{}(df)'.format(f) setp = 'from __main__ import df, {}'.format(f) res.at[f, c] = timeit(stmt, setp, number=30) ax = res.div(res.min()).T.plot(loglog=True) ax.set_xlabel("N"); ax.set_ylabel("time (relative)"); plt.show() </code></pre>

Fast punctuation removal with pandas

Tags:

python

string

regex

pandas

numpy

This is a self-answered post. Below I outline a common problem in the NLP domain and propose a few performant methods to solve it.

Oftentimes the need arises to remove punctuation during text cleaning and pre-processing. Punctuation is defined as any character in string.punctuation:

>>> import string string.punctuation '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

This is a common enough problem and has been asked before ad nauseam. The most idiomatic solution uses pandas str.replace. However, for situations which involve a lot of text, a more performant solution may need to be considered.

What are some good, performant alternatives to str.replace when dealing with hundreds of thousands of records?

436

asked May 21 '18 07:05

cs95

2 Answers

Setup

For the purpose of demonstration, let's consider this DataFrame.

df = pd.DataFrame({'text':['a..b?!??', '%hgh&12','abc123!!!', '$$$1234']}) df         text 0   a..b?!?? 1    %hgh&12 2  abc123!!! 3    $$$1234

Below, I list the alternatives, one by one, in increasing order of performance

`str.replace`

This option is included to establish the default method as a benchmark for comparing other, more performant solutions.

This uses pandas in-built str.replace function which performs regex-based replacement.

df['text'] = df['text'].str.replace(r'[^\w\s]+', '')

df      text 0      ab 1   hgh12 2  abc123 3    1234

This is very easy to code, and is quite readable, but slow.

`regex.sub`

This involves using the sub function from the re library. Pre-compile a regex pattern for performance, and call regex.sub inside a list comprehension. Convert df['text'] to a list beforehand if you can spare some memory, you'll get a nice little performance boost out of this.

import re p = re.compile(r'[^\w\s]+') df['text'] = [p.sub('', x) for x in df['text'].tolist()]

df      text 0      ab 1   hgh12 2  abc123 3    1234

Note: If your data has NaN values, this (as well as the next method below) will not work as is. See the section on "Other Considerations".

`str.translate`

python's str.translate function is implemented in C, and is therefore very fast.

How this works is:

First, join all your strings together to form one huge string using a single (or more) character separator that you choose. You must use a character/substring that you can guarantee will not belong inside your data.
Perform str.translate on the large string, removing punctuation (the separator from step 1 excluded).
Split the string on the separator that was used to join in step 1. The resultant list must have the same length as your initial column.

Here, in this example, we consider the pipe separator |. If your data contains the pipe, then you must choose another separator.

import string  punct = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}~'   # `|` is not present here transtab = str.maketrans(dict.fromkeys(punct, ''))  df['text'] = '|'.join(df['text'].tolist()).translate(transtab).split('|')

df      text 0      ab 1   hgh12 2  abc123 3    1234

Performance

str.translate performs the best, by far. Note that the graph below includes another variant Series.str.translate from MaxU's answer.

(Interestingly, I reran this a second time, and the results are slightly different from before. During the second run, it seems re.sub was winning out over str.translate for really small amounts of data.) enter image description here

There is an inherent risk involved with using translate (particularly, the problem of automating the process of deciding which separator to use is non-trivial), but the trade-offs are worth the risk.

Other Considerations

Handling NaNs with list comprehension methods; Note that this method (and the next) will only work as long as your data does not have NaNs. When handling NaNs, you will have to determine the indices of non-null values and replace those only. Try something like this:

df = pd.DataFrame({'text': [     'a..b?!??', np.nan, '%hgh&12','abc123!!!', '$$$1234', np.nan]})  idx = np.flatnonzero(df['text'].notna()) col_idx = df.columns.get_loc('text') df.iloc[idx,col_idx] = [     p.sub('', x) for x in df.iloc[idx,col_idx].tolist()]  df      text 0      ab 1     NaN 2   hgh12 3  abc123 4    1234 5     NaN

Dealing with DataFrames; If you are dealing with DataFrames, where every column requires replacement, the procedure is simple:

v = pd.Series(df.values.ravel()) df[:] = translate(v).values.reshape(df.shape)

Or,

v = df.stack() v[:] = translate(v) df = v.unstack()

Note that the translate function is defined below in with the benchmarking code.

Every solution has tradeoffs, so deciding what solution best fits your needs will depend on what you're willing to sacrifice. Two very common considerations are performance (which we've already seen), and memory usage. str.translate is a memory-hungry solution, so use with caution.

Another consideration is the complexity of your regex. Sometimes, you may want to remove anything that is not alphanumeric or whitespace. Othertimes, you will need to retain certain characters, such as hyphens, colons, and sentence terminators [.!?]. Specifying these explicitly add complexity to your regex, which may in turn impact the performance of these solutions. Make sure you test these solutions on your data before deciding what to use.

Lastly, unicode characters will be removed with this solution. You may want to tweak your regex (if using a regex-based solution), or just go with str.translate otherwise.

For even more performance (for larger N), take a look at this answer by Paul Panzer.

Appendix

Functions

def pd_replace(df):     return df.assign(text=df['text'].str.replace(r'[^\w\s]+', ''))   def re_sub(df):     p = re.compile(r'[^\w\s]+')     return df.assign(text=[p.sub('', x) for x in df['text'].tolist()])  def translate(df):     punct = string.punctuation.replace('|', '')     transtab = str.maketrans(dict.fromkeys(punct, ''))      return df.assign(         text='|'.join(df['text'].tolist()).translate(transtab).split('|')     )  # MaxU's version (https://stackoverflow.com/a/50444659/4909087) def pd_translate(df):     punct = string.punctuation.replace('|', '')     transtab = str.maketrans(dict.fromkeys(punct, ''))      return df.assign(text=df['text'].str.translate(transtab))

Performance Benchmarking Code

from timeit import timeit  import pandas as pd import matplotlib.pyplot as plt  res = pd.DataFrame(        index=['pd_replace', 're_sub', 'translate', 'pd_translate'],        columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000],        dtype=float )  for f in res.index:      for c in res.columns:         l = ['a..b?!??', '%hgh&12','abc123!!!', '$$$1234'] * c         df = pd.DataFrame({'text' : l})         stmt = '{}(df)'.format(f)         setp = 'from __main__ import df, {}'.format(f)         res.at[f, c] = timeit(stmt, setp, number=30)  ax = res.div(res.min()).T.plot(loglog=True)  ax.set_xlabel("N");  ax.set_ylabel("time (relative)");  plt.show()

163

answered Sep 25 '22 10:09

cs95

Using numpy we can gain a healthy speedup over the best methods posted so far. The basic strategy is similar---make one big super string. But the processing seems much faster in numpy, presumably because we fully exploit the simplicity of the nothing-for-something replacement op.

For smaller (less than 0x110000 characters total) problems we automatically find a separator, for larger problems we use a slower method that does not rely on str.split.

Note that I have moved all precomputables out of the functions. Also note, that translate and pd_translate get to know the only possible separator for the three largest problems for free whereas np_multi_strat has to compute it or to fall back to the separator-less strategy. And finally, note that for the last three data points I switch to a more "interesting" problem; pd_replace and re_sub because they are not equivalent to the other methods had to be excluded for that.

enter image description here

On the algorithm:

The basic strategy is actually quite simple. There are only 0x110000 different unicode characters. As OP frames the challenge in terms of huge data sets, it is perfectly worthwhile making a lookup table that has True at the character id's that we want to keep and False at the ones that have to go --- the punctuation in our example.

Such a lookup table can be used for bulk loookup using numpy's advanced indexing. As lookup is fully vectorized and essentially amounts to dereferencing an array of pointers it is much faster than for example dictionary lookup. Here we make use of numpy view casting which allows to reinterpret unicode characters as integers essentially for free.

Using the data array which contains just one monster string reinterpreted as a sequence of numbers to index into the lookup table results in a boolean mask. This mask can then be used to filter out the unwanted characters. Using boolean indexing this, too, is a single line of code.

So far so simple. The tricky bit is chopping up the monster string back into its parts. If we have a separator, i.e. one character that does not occur in the data or the punctuation list, then it still is easy. Use this character to join and resplit. However, automatically finding a separator is challenging and indeed accounts for half the loc in the implementation below.

Alternatively, we can keep the split points in a separate data structure, track how they move as a consequence of deleting unwanted characters and then use them to slice the processed monster string. As chopping up into parts of uneven length is not numpy's strongest suit, this method is slower than str.split and only used as a fallback when a separator would be too expensive to calculate if it existed in the first place.

Code (timing/plotting heavily based on @COLDSPEED's post):

import numpy as np import pandas as pd import string import re   spct = np.array([string.punctuation]).view(np.int32) lookup = np.zeros((0x110000,), dtype=bool) lookup[spct] = True invlookup = ~lookup OSEP = spct[0] SEP = chr(OSEP) while SEP in string.punctuation:     OSEP = np.random.randint(0, 0x110000)     SEP = chr(OSEP)   def find_sep_2(letters):     letters = np.array([letters]).view(np.int32)     msk = invlookup.copy()     msk[letters] = False     sep = msk.argmax()     if not msk[sep]:         return None     return sep  def find_sep(letters, sep=0x88000):     letters = np.array([letters]).view(np.int32)     cmp = np.sign(sep-letters)     cmpf = np.sign(sep-spct)     if cmp.sum() + cmpf.sum() >= 1:         left, right, gs = sep+1, 0x110000, -1     else:         left, right, gs = 0, sep, 1     idx, = np.where(cmp == gs)     idxf, = np.where(cmpf == gs)     sep = (left + right) // 2     while True:         cmp = np.sign(sep-letters[idx])         cmpf = np.sign(sep-spct[idxf])         if cmp.all() and cmpf.all():             return sep         if cmp.sum() + cmpf.sum() >= (left & 1 == right & 1):             left, sep, gs = sep+1, (right + sep) // 2, -1         else:             right, sep, gs = sep, (left + sep) // 2, 1         idx = idx[cmp == gs]         idxf = idxf[cmpf == gs]  def np_multi_strat(df):     L = df['text'].tolist()     all_ = ''.join(L)     sep = 0x088000     if chr(sep) in all_: # very unlikely ...         if len(all_) >= 0x110000: # fall back to separator-less method                                   # (finding separator too expensive)             LL = np.array((0, *map(len, L)))             LLL = LL.cumsum()             all_ = np.array([all_]).view(np.int32)             pnct = invlookup[all_]             NL = np.add.reduceat(pnct, LLL[:-1])             NLL = np.concatenate([[0], NL.cumsum()]).tolist()             all_ = all_[pnct]             all_ = all_.view(f'U{all_.size}').item(0)             return df.assign(text=[all_[NLL[i]:NLL[i+1]]                                    for i in range(len(NLL)-1)])         elif len(all_) >= 0x22000: # use mask             sep = find_sep_2(all_)         else: # use bisection             sep = find_sep(all_)     all_ = np.array([chr(sep).join(L)]).view(np.int32)     pnct = invlookup[all_]     all_ = all_[pnct]     all_ = all_.view(f'U{all_.size}').item(0)     return df.assign(text=all_.split(chr(sep)))  def pd_replace(df):     return df.assign(text=df['text'].str.replace(r'[^\w\s]+', ''))   p = re.compile(r'[^\w\s]+')  def re_sub(df):     return df.assign(text=[p.sub('', x) for x in df['text'].tolist()])  punct = string.punctuation.replace(SEP, '') transtab = str.maketrans(dict.fromkeys(punct, ''))  def translate(df):     return df.assign(         text=SEP.join(df['text'].tolist()).translate(transtab).split(SEP)     )  # MaxU's version (https://stackoverflow.com/a/50444659/4909087) def pd_translate(df):     return df.assign(text=df['text'].str.translate(transtab))  from timeit import timeit  import pandas as pd import matplotlib.pyplot as plt  res = pd.DataFrame(        index=['translate', 'pd_replace', 're_sub', 'pd_translate', 'np_multi_strat'],        columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000,                 1000000],        dtype=float )  for c in res.columns:     if c >= 100000: # stress test the separator finder         all_ = np.r_[:OSEP, OSEP+1:0x110000].repeat(c//10000)         np.random.shuffle(all_)         split = np.arange(c-1) + \                 np.sort(np.random.randint(0, len(all_) - c + 2, (c-1,)))          l = [x.view(f'U{x.size}').item(0) for x in np.split(all_, split)]     else:         l = ['a..b?!??', '%hgh&12','abc123!!!', '$$$1234'] * c     df = pd.DataFrame({'text' : l})     for f in res.index:          if f == res.index[0]:             ref = globals()[f](df).text         elif not (ref == globals()[f](df).text).all():             res.at[f, c] = np.nan             print(f, 'disagrees at', c)             continue         stmt = '{}(df)'.format(f)         setp = 'from __main__ import df, {}'.format(f)         res.at[f, c] = timeit(stmt, setp, number=16)  ax = res.div(res.min()).T.plot(loglog=True)  ax.set_xlabel("N");  ax.set_ylabel("time (relative)");  plt.show()

answered Sep 21 '22 10:09

Paul Panzer

Related questions
                            
                                mypy, type hint: Union[float, int] -> is there a Number type?
                            
                                Pandas column bind (cbind) two data frames
                            
                                Putting many python pandas dataframes to one excel worksheet
                            
                                increase the linewidth of the legend lines in matplotlib
                            
                                Pandas: Convert Timestamp to datetime.date
                            
                                2D list has weird behavor when trying to modify a single value [duplicate]
                            
                                In Javascript a dictionary comprehension, or an Object `map`
                            
                                Print raw string from variable? (not getting the answers)
                            
                                Celery with RabbitMQ: AttributeError: 'DisabledBackend' object has no attribute '_get_task_meta_for'
                            
                                Create Spark DataFrame. Can not infer schema for type: <type 'float'>
                            
                                Show default value for editing on Python input possible?
                            
                                List of all available matplotlib backends
                            
                                Finding non-numeric rows in dataframe in pandas?
                            
                                How to transform an XML file using XSLT in Python?
                            
                                Move seaborn plot legend to a different position
                            
                                How to read multiple lines of raw input?
                            
                                How do I fill a column with one value in Pandas?
                            
                                In the Django admin interface, is there a way to duplicate an item?
                            
                                How to recover matplotlib defaults after setting stylesheet
                            
                                How can I resolve 'django_content_type already exists'?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast punctuation removal with pandas

Tags:

python

string

regex

pandas

numpy

cs95

People also ask

2 Answers

Setup

`str.replace`

`regex.sub`

`str.translate`

Performance

Other Considerations

Appendix

cs95

On the algorithm:

Paul Panzer

Recent Activity

Donate For Us