Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster alternative to Series.add function in pandas

Tags:

python

pandas

I am trying to add two pandas Series together. The first Series is very large and has a MultiIndex. The index of the second series is a small subset of the index of the first.

    df1 = pd.DataFrame(np.ones((1000,5000)),dtype=int).stack()
    df1 = pd.DataFrame(df1, columns = ['total'])
    df2 = pd.concat([df1.iloc[50:55],df1.iloc[2000:2005]])  # df2 is tiny subset of df1

Using the regular Series.add function takes about 9 seconds the first time, and 2 seconds on subsequent tries (maybe because pandas optimizes how the df is stored in memory?).

    starttime = time.time()
    df1.total.add(df2.total,fill_value=0).sum()
    print "Method 1 took %f seconds" % (time.time() - starttime)

Manually iterating over rows takes about 2/3 as long as Series.add the first time, and about 1/100 as long as Series.add on subsequent tries.

    starttime = time.time()
    result = df1.total.copy()
    for row_index, row in df2.iterrows():
        result[row_index] += row
    print "Method 2 took %f seconds" % (time.time() - starttime)

The speed difference is particularly noticeable when (as here) the Index is a MultiIndex.

Why does Series.add not work well here? Any suggestions for speeding this up? Is there a more efficient alternative to iterating over each element of the Series?

Also, how do I sort or structure the data frame to improve the performance of either method? The second time either of these methods is run is appreciably faster. How do I get this performance on the first time? Sorting using sort_index helps only marginally.

like image 708
amball Avatar asked Nov 07 '13 22:11

amball


People also ask

How do I speed up pandas apply function?

You can speed up the execution even faster by using another trick: making your pandas' dataframes lighter by using more efficent data types. As we know that df only contains integers from 1 to 10, we can then reduce the data type from 64 bits to 16 bits.

Is Iterrows faster than apply?

This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.

What is faster than pandas DataFrame?

Try NumPy Instead. NumPy has all of the computation capabilities of Pandas, but uses pre-compiled, optimized methods. This mean NumPy can be significantly faster than Pandas. Converting a DataFrame from Pandas to NumPy is relatively straightforward.

Is NumPy array faster than pandas series for the same size?

NumPy provides n dimensional arrays, Data Type (dtype), etc. as objects. In the Series of Pandas, indexing is relatively slower compared to the Arrays in NumPy. The indexing of NumPy arrays is faster than that of the Pandas Series.


1 Answers

You don't need for loop:

df1.total[df2.index] += df2.total
like image 90
HYRY Avatar answered Sep 24 '22 00:09

HYRY