I have a large number of files for which I have to carry out calculations based on string columns. The relevant columns look like this.
df = pd.DataFrame({'A': ['A', 'B', 'A', 'B'], 'B': ['B', 'C', 'D', 'A'], 'C': ['A', 'B', 'D', 'D'], 'D': ['A', 'C', 'C', 'B'],})
A B C D
0 A B A A
1 B C B C
2 A D D C
3 B A D B
I have to create new columns containing the number of occurences of certain strings in each row. I do this like this:
for elem in ['A', 'B', 'C', 'D']:
df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)
A B C D n_A n_B n_C n_D
0 A B A A 3 1 0 0
1 B C B C 0 2 2 0
2 A D D C 1 0 1 2
3 B A D B 1 2 0 1
However, this is taking minutes per file, and I have to do this for around 900 files. Is there any way I can speed this up?
You can speed up the execution even faster by using another trick: making your pandas' dataframes lighter by using more efficent data types. As we know that df only contains integers from 1 to 10, we can then reduce the data type from 64 bits to 16 bits. See how we reduced the size of our dataframe from 38MB to 9.5MB.
Series Map: This is actually somewhat faster than Series Apply, but still relatively slow.
Numpy supports several operations directly on arrays – hence to add 1 to every element of an array, we can simply do col=col+1 on the array. We once again see that this turns out to be faster than doing a similar operation on the pandas dataframe column.
is able to achieve a 4x speed up relative to the third approach, with a very simple parameter tweak in adding raw=True . This is telling the apply method to bypass the overhead associated with the Pandas series object and use simple map objects instead. It is interesting note that this is still slower than the second approach with itertuples
When vectorization is not possible, it switches to dask parallel processing or a simple pandas apply. Thus, swifter uses pandas apply when it leads to faster computation time for smaller data sets, and it shifts to dask parallel processing when that is a faster option for large data sets.
Here are a few rules of thumb that you can apply next time you’re working with large data sets in Pandas: Try to use vectorized operations where possible rather than approaching problems with the for x in df... mentality.
We showed that by using pandas vectorization together with efficient data types, we could reduce the running time of the apply function by 600 (without using anything else than pandas). Your home for data science.
Use stack
+ str.get_dummies
and then sum
on level=0
and join
it with df
:
df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))
Result:
print(df1)
A B C D n_A n_B n_C n_D
0 A B A A 3 1 0 0
1 B C B C 0 2 2 0
2 A D D C 1 0 1 2
3 B A D B 1 2 0 1
Rather than use apply
to loop over each row, I looped over each column to compute the sum for each letter:
for l in ['A','B','C','D']:
df['n_' + l] = (df == l).sum(axis=1)
This seems to be an improvement in this example, but (from quick testing not shown) seems like it can be ~equal or worse depending on the shape and size of the data (and probably how many strings you are looking for)
Some time comparisons:
%%timeit
for elem in ['A', 'B', 'C', 'D']:
df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)
#6.77 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
for l in ['A','B','C','D']:
df['n_' + l] = (df == l).sum(axis=1)
#1.95 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And for other answers here:
%%timeit
df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))
#3.59 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df1=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
#5.82 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)
#5.58 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Try get_dummies
and sum
with level
, here we do not need stack
:-)
df=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
Out[57]:
A B C D n_A n_B n_C n_D
0 A B A A 3 1 0 0
1 B C B C 0 2 2 0
2 A D D C 1 0 1 2
3 B A D B 1 2 0 1
you could do:
counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With