Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is there so much speed difference between these two variants?

Version 1:

import string, pandas as pd
def correct_contraction1(x, dic):
    for word in dic.keys():
        if word in x:
            x = x.replace(word, " " + dic[word]+ " ")
    return x

Version 2:

import string, pandas as pd
def correct_contraction2(x, dic):
    for word in dic.keys():
        if " " + word + " " in x:
            x = x.replace(" " + word + " ", " " + dic[word]+ " ")
    return x

How I am using them:

train['comment_text'] = train['comment_text'].apply(correct_contraction1,args=(contraction_mapping,))
#3 mins 40 sec without that space thing (version1)

train['comment_text'] = train['comment_text'].apply(correct_contraction2,args=(contraction_mapping,))
#5 mins 56 sec with that space thing (version2)

Why is there so much speed difference which shouldn't likely be the case and secondly any better/hidden pandas trick to optimize this further? (The code has been tested multiple times on Kaggle Kernels)

  • train is a data-frame with 2 million rows in both cases, exactly identical as well
  • contraction_mapping is a dictionary mapping... (same as well in both cases)
  • Latest pandas hopefully.

Edit

  • Data comes from the Kaggle Comp, Version 1 is way faster!
like image 250
Aditya Avatar asked Apr 24 '19 07:04

Aditya


People also ask

Which types of settings does COVID-19 spread more easily?

The “Three C's” are a useful way to think about this. They describe settings where transmission of the COVID-19 virus spreads more easily:• Crowded places;• Close-contact settings, especially where people have conversations very near each other;• Confined and enclosed spaces with poor ventilation.

Can COVID-19 spread through water while swimming?

Fact: Water or swimming does not transmit the COVID-19 virusThe COVID-19 virus does not transmit through water while swimming. However, the virus spreads between people when someone has close contact with an infected person. WHAT YOU CAN DO: Avoid crowds and maintain at least a 1-metre distance from others, even when you are swimming or at swimming areas. Wear a mask when you’re not in the water and you can’t stay distant. Clean your hands frequently, cover a cough or sneeze with a tissue or bent elbow, and stay home if you’re unwell.

Can COVID-19 be transmitted through food?

There is currently no evidence that people can catch COVID-19 from food. The virus that causes COVID-19 can be killed at temperatures similar to that of other known viruses and bacteria found in food.

Can the coronavirus survive on surfaces?

It is not certain how long the virus that causes COVID-19 survives on surfaces, but it seems likely to behave like other coronaviruses. A recent review of the survival of human coronaviruses on surfaces found large variability, ranging from 2 hours to 9 days (11).The survival time depends on a number of factors, including the type of surface, temperature, relative humidity and specific strain of the virus.


2 Answers

Sorry to not answer the difference, but the current approach can be easily improved on in any case. It is going slow for you because you'll have to scan all sentences multiple times (for each word). You're even checking each word twice, first if it is there, and then to replace it - you could just replace only.

This is the crucial lesson when doing text replacement, whether using regex, simple string replacement or even when you develop your own algorithm: try to go over the text only once. Regardless of how many words you want to replace. A regex goes a long way, but depending on the implementation needs to go back a few characters when it does not find a hit. For the interested: look for the trie data structure.

Try for example an implementation of a fast text search (aho-corasick). I'm developing a library for this, but until then, you can use flashtext (which does things a little differently):

import flashtext
# already considers word boundaries, so no need for " " + word " "
fl = flashtext.KeywordProcessor()
fl.add_keywords_from_dict(dic)

train['comment_text'] = train['comment_text'].apply(fl.replace_keywords)

If you have a lot of words to replace, this will be orders of magnitude faster.

For a comparison on the first data I could find:

Words to replace: 8520
Sentences to replace in: 11230
Replacements made using flashtext: 1706
Replacements made using correct_contraction1: 25 

flashtext: (considers word boundaries and ignores case)
39 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

correct_contraction1: (does not consider case nor words at end of line)
11.9 s ± 194 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

<unannounced>
30 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So we are talking a 300x speedup. That doesn't happen every day ;-)

For reference, added the regex way by Jon Clements:

pandas.str.replace + regex (1733 replacements)
3.02 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My new lib will shave off another 30% as I tested it. I've seen 2-3x improvement over flashtext too, but more importantly, give you, as user, more control. It's fully functional, just need to clean it up and add more documentation.

I'll update the answer when it arrives!

like image 63
PascalVKooten Avatar answered Sep 23 '22 17:09

PascalVKooten


You're better of using Pandas' Series.str.replace here and providing it a compiled regular expression based on the contents of a lookup table. This means the string replacement operations can work on the Series quicker than applying a function, it also means you're not scanning the string way, way more times than you need to... Hopefully it'd reduce your time down to seconds instead of minutes.

import re
import pandas as pd

corrections = {
    "it's": "it is",
    "can't": "can not",
    "won't": "will not",
    "haven't": "have not"
}

sample = pd.Series([
    "Stays the same",
    "it's horrible!",
    "I hope I haven't got this wrong as that won't do",
    "Cabbage"
])

Then build your regex so that it looks for for any possible matches that are keys in your dictionary, case insensitively and honouring word boundaries:

rx = re.compile(r'(?i)\b({})\b'.format('|'.join(re.escape(c) for c in corrections)))

Then apply to your column (change sample to training['comment_text'] for instance) a str.replace passing the regex and a function that takes the match and returns the matching value for the key found:

corrected = sample.str.replace(rx, lambda m: corrections.get(m.group().lower()))

Then you'll have corrected as a Series containing:

['Stays the same',
 'it is horrible!',
 'I hope I have not got this wrong as that will not do',
 'Cabbage']

Note the casing of It's... it's been case insensitively picked up and made into it is instead... There's various ways to preserve case but it's probably not massively important and a different question altogether.

like image 44
Jon Clements Avatar answered Sep 22 '22 17:09

Jon Clements