Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance is slow when replacing a string in a pandas dataframe using a dict

The following code works but needs to run faster. The dict has ~25K keys and the dataframe is ~3M rows. Is there a way to produce the same results but with python code that will run faster? (without multiprocessing, processing would be 8x slower).

miscdict={" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}

df=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})

def parse_text(data):
    for key, replacement in miscdict.items():
        data['q1'] = data['q1'].str.replace( key, replacement )
    return data

if __name__ == '__main__':
    t1_1 = datetime.datetime.now()
    p = multiprocessing.Pool(processes=8)
    split_dfs = np.array_split(df,8)
    pool_results = p.map(parse_text, split_dfs)
    p.close()
    p.join()
    parts = pd.concat(pool_results, axis=0)
    df = pd.concat([parts], axis=1)
    t2_1 = datetime.datetime.now()
    print("done"+ str(t2_1-t1_1)) 
like image 253
Robert Treichler Avatar asked Apr 09 '17 22:04

Robert Treichler


2 Answers

I tested out a few of these. The suggestion of @A-Za-z is a major improvement, yet it might be possible to do it even faster.

edit: I've rerun the tests where I precomputed the replacement dict and the dataframe (and the precompiled regex). The new timings are:

  • Original: 11.71 seconds
  • @A-Za-z: 4.72 seconds, a 60% improvement.
  • @piRSquared: 4.95 seconds, a 58% improvement.
  • Precompiled: 2.81 seconds, a 76% improvement.

The original results where data generation and regex compiling were included in the timings:

"Testing your code I got 15 seconds, @A-Za-z's code gave 8-9 seconds, and my own solution moved it down to 6 seconds. It uses precompiled regex. See the end of this answer."


Imports:

import pandas as pd
import re
import timeit

Your original code:

miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def org(printout=False):
    def parse_text(data):
        for key, replacement in miscdict.items():
            data['q1'] = data['q1'].str.replace( key, replacement )
        return data
    data2 = parse_text(data)
    if printout:
        print(data2)
org(printout=True)
print(timeit.timeit(org, number=10000))

This used 11.7 seconds:

                       q1
0              beer is ok
1          beer is not ok
2  beer was not available
3   Sierra Nevada is good
11.71043858179268

User @A-Za-z's code:

miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt1(printout=False):
    data['q1'].replace(miscdict, regex = True, inplace = True)
    if printout:
        print(data)
alt1(printout=True)
print(timeit.timeit(alt1, number=10000))

This used 4.7 seconds:

                       q1
0              beer is ok
1          beer is not ok
2  beer was not available
3   Sierra Nevada is good
4.721581550644499

User @piRSquared's code:

miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt2(printout=False):
    # regex = True is added later because it doesn't work without it.
    data = data.replace(miscdict, regex = True)
    if printout:
        print(data)
alt2(printout=True)
print(timeit.timeit(alt2, number=10000))

This used 5.0 seconds:

                       q1
0              beer is ok
1          beer is not ok
2  beer was not available
3   Sierra Nevada is good
4.951810616074919

miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
miscdict_comp = {re.compile(k): v for k, v in miscdict.items()}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt3(printout=False):
    def parse_text(text):
        for pattern, replacement in miscdict_comp.items():
            text = pattern.sub(replacement, text)
        return text
    data["q1"] = data["q1"].apply(parse_text)
    if printout:
        print(data)
alt3(printout=True)
print(timeit.timeit(alt3, number=10000))

This used 2.8 seconds:

                       q1
0              beer is ok
1          beer is not ok
2  beer was not available
3   Sierra Nevada is good
2.810334940701157

The idea is to precompile the pattern you're looking to change.

I got the idea from here: https://jerel.co/blog/2011/12/using-python-for-super-fast-regex-search-and-replace

like image 174
André C. Andersen Avatar answered Oct 13 '22 01:10

André C. Andersen


You don't need the loop here, df.replace does the job along with regex = True and it cuts the time by more than half.

df['q1'].replace(miscdict, regex = True, inplace = True)
1000 loops, best of 3: 1.08 ms per loop

gets you

        q1
0   beer is ok
1   beer is not ok
2   beer was not available
3   Sierra Nevada is good

Comparing this with the current solution

for key, replacement in miscdict.items(): df['q1'] = df['q1'].str.replace( key, replacement )
100 loops, best of 3: 2.35 ms per loop
like image 41
Vaishali Avatar answered Oct 13 '22 00:10

Vaishali