I am trying to add two pandas Series together. The first Series is very large and has a MultiIndex. The index of the second series is a small subset of the index of the first.
df1 = pd.DataFrame(np.ones((1000,5000)),dtype=int).stack()
df1 = pd.DataFrame(df1, columns = ['total'])
df2 = pd.concat([df1.iloc[50:55],df1.iloc[2000:2005]]) # df2 is tiny subset of df1
Using the regular Series.add function takes about 9 seconds the first time, and 2 seconds on subsequent tries (maybe because pandas optimizes how the df is stored in memory?).
starttime = time.time()
df1.total.add(df2.total,fill_value=0).sum()
print "Method 1 took %f seconds" % (time.time() - starttime)
Manually iterating over rows takes about 2/3 as long as Series.add the first time, and about 1/100 as long as Series.add on subsequent tries.
starttime = time.time()
result = df1.total.copy()
for row_index, row in df2.iterrows():
result[row_index] += row
print "Method 2 took %f seconds" % (time.time() - starttime)
The speed difference is particularly noticeable when (as here) the Index is a MultiIndex.
Why does Series.add not work well here? Any suggestions for speeding this up? Is there a more efficient alternative to iterating over each element of the Series?
Also, how do I sort or structure the data frame to improve the performance of either method? The second time either of these methods is run is appreciably faster. How do I get this performance on the first time? Sorting using sort_index helps only marginally.
You can speed up the execution even faster by using another trick: making your pandas' dataframes lighter by using more efficent data types. As we know that df only contains integers from 1 to 10, we can then reduce the data type from 64 bits to 16 bits.
This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.
Try NumPy Instead. NumPy has all of the computation capabilities of Pandas, but uses pre-compiled, optimized methods. This mean NumPy can be significantly faster than Pandas. Converting a DataFrame from Pandas to NumPy is relatively straightforward.
NumPy provides n dimensional arrays, Data Type (dtype), etc. as objects. In the Series of Pandas, indexing is relatively slower compared to the Arrays in NumPy. The indexing of NumPy arrays is faster than that of the Pandas Series.
You don't need for loop:
df1.total[df2.index] += df2.total
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With