Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up pandas apply for string matching

I have a large number of files for which I have to carry out calculations based on string columns. The relevant columns look like this.

df = pd.DataFrame({'A': ['A', 'B', 'A', 'B'], 'B': ['B', 'C', 'D', 'A'], 'C': ['A', 'B', 'D', 'D'], 'D': ['A', 'C', 'C', 'B'],})

    A   B   C   D
0   A   B   A   A
1   B   C   B   C
2   A   D   D   C
3   B   A   D   B

I have to create new columns containing the number of occurences of certain strings in each row. I do this like this:

for elem in ['A', 'B', 'C', 'D']:
    df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)

   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1

However, this is taking minutes per file, and I have to do this for around 900 files. Is there any way I can speed this up?

like image 311
Wouter Avatar asked Aug 12 '20 13:08

Wouter


People also ask

How do you make apply faster in pandas?

You can speed up the execution even faster by using another trick: making your pandas' dataframes lighter by using more efficent data types. As we know that df only contains integers from 1 to 10, we can then reduce the data type from 64 bits to 16 bits. See how we reduced the size of our dataframe from 38MB to 9.5MB.

Which is faster apply or map?

Series Map: This is actually somewhat faster than Series Apply, but still relatively slow.

How do you simply make an operation on pandas Dataframe faster?

Numpy supports several operations directly on arrays – hence to add 1 to every element of an array, we can simply do col=col+1 on the array. We once again see that this turns out to be faster than doing a similar operation on the pandas dataframe column.

Is there a way to speed up pandas map?

is able to achieve a 4x speed up relative to the third approach, with a very simple parameter tweak in adding raw=True . This is telling the apply method to bypass the overhead associated with the Pandas series object and use simple map objects instead. It is interesting note that this is still slower than the second approach with itertuples

When to use pandas apply vs DASK parallel processing?

When vectorization is not possible, it switches to dask parallel processing or a simple pandas apply. Thus, swifter uses pandas apply when it leads to faster computation time for smaller data sets, and it shifts to dask parallel processing when that is a faster option for large data sets.

How do you handle large data sets in pandas?

Here are a few rules of thumb that you can apply next time you’re working with large data sets in Pandas: Try to use vectorized operations where possible rather than approaching problems with the for x in df... mentality.

Can pandas vectorization reduce the run time of the apply function?

We showed that by using pandas vectorization together with efficient data types, we could reduce the running time of the apply function by 600 (without using anything else than pandas). Your home for data science.


Video Answer


4 Answers

Use stack + str.get_dummies and then sum on level=0 and join it with df:

df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))

Result:

print(df1)
   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1
like image 79
Shubham Sharma Avatar answered Nov 15 '22 02:11

Shubham Sharma


Rather than use apply to loop over each row, I looped over each column to compute the sum for each letter:

for l in ['A','B','C','D']:
    df['n_' + l] = (df == l).sum(axis=1)

This seems to be an improvement in this example, but (from quick testing not shown) seems like it can be ~equal or worse depending on the shape and size of the data (and probably how many strings you are looking for)

Some time comparisons:

%%timeit
for elem in ['A', 'B', 'C', 'D']:
    df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)    
#6.77 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
for l in ['A','B','C','D']:
    df['n_' + l] = (df == l).sum(axis=1)
#1.95 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And for other answers here:

%%timeit
df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))
#3.59 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df1=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
#5.82 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)
#5.58 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 39
Tom Avatar answered Nov 15 '22 02:11

Tom


Try get_dummies and sum with level , here we do not need stack :-)

df=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
Out[57]: 
   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1
like image 25
BENY Avatar answered Nov 15 '22 03:11

BENY


you could do:

counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)
like image 37
Ayoub ZAROU Avatar answered Nov 15 '22 04:11

Ayoub ZAROU