Faster way to sum all combinations of rows in dataframe

Tags:

I have a dataframe of 10,000 rows that I am trying to sum all possible combinations of those rows. According to my math, that's about 50 million combinations. I'll give a small example to simplify what my data looks like:

df = Ratio     Count     Score
     1         6         11
     2         7         12
     3         8         13
     4         9         14
     5         10        15

And here's the desired result:

results = Min Ratio     Max Ratio     Total Count     Total Score
          1             2             13              23
          1             3             21              36
          1             4             30              50
          1             5             40              65
          2             3             15              25
          2             4             24              39
          2             5             34              54
          3             4             17              27
          3             5             27              42
          4             5             19              29

This is the code that I came up with to complete the calculation:

for i in range(len(df)):
    j = i + 1
    while j <= len(df):
        range_to_calc = df.iloc[i:j]
        total_count = range_to_calc['Count'].sum()
        total_score = range_to_calc['Score'].sum()
        new_row = {'Min Ratio': range_to_calc.at[range_to_calc.first_valid_index(),'Ratio'],
                   'Max Ratio': range_to_calc.at[range_to_calc.last_valid_index(),'Ratio'],
                   'Total Count': total_count,
                   'Total Score': total_score}
        results = results.append(new_row, ignore_index=True)
        j = j + 1

This code works, but according to my estimates after running it for a few minutes, it would take 200 hours to complete. I understand that using numpy would be a lot faster, but I can't wrap my head around how to build multiple arrays to add together. (I think it would be easy if I was doing just 1+2, 2+3, 3+4, etc., but it's a lot harder because I need 1+2, 1+2+3, 1+2+3+4, etc.) Is there a more efficient way to complete this calculation so it can run in a reasonable amount of time? Thank you!

P.S.: If you're wondering what I want to do with a 50 million-row dataframe, I don't actually need that in my final results. I'm ultimately looking to divide the Total Score of each row in the results by its Total Count to get a Total Score Per Total Count value, and then display the 1,000 highest Total Scores Per Total Count, along with each associated Min Ratio, Max Ratio, Total Count, and Total Score.

532

asked Apr 19 '21 11:04

mr7

Video Answer

3 Answers

After these improvements it takes ~2 minutes to run for 10k rows.

For the sum computation, you can pre-compute cumulative sum(cumsum) and save it. sum(i to j) is equal to sum(0 to j) - sum(0 to i-1). Now sum(0 to j) is cumsum[j] and sum(0 to i - 1) is cumsum[i-1]. So sum(i to j) = cumsum[j] - cumsum[i - 1]. This gives significant improvement over computing sum each time for different combination.
Operation over numpy array is faster than the operation on pandas series, hence convert every colum to numpy array and then do the computation over it.
(From other answers): Instead of appending in list, initialise an empty numpy array of size ((n*(n+1)//2) -n , 4) and use it to save the results.

Use:

count_cumsum = np.cumsum(df.Count.values)
score_cumsum = np.cumsum(df.Score.values)
ratios = df.Ratio.values
n = len(df)
rowInCombination = (n * (n + 1) // 2) - n
arr = np.empty(shape = (rowInCombination, 4), dtype = int)
k = 0
for i in range(len(df)):
    for j in range(i + 1, len(df)):
        arr[k, :] = ([
              count_cumsum[j] - count_cumsum[i-1] if i > 0 else count_cumsum[j], 
              score_cumsum[j] - score_cumsum[i-1] if i > 0 else score_cumsum[j],
              ratios[i],
              ratios[j]])
        k = k + 1
out = pd.DataFrame(arr, columns = ['Total_Count', 'Total_Score', 
                    'Min_Ratio', 'Max_Ratio'])

Input:

df = pd.DataFrame({'Ratio': [1, 2, 3, 4, 5], 
                   'Count': [6, 7, 8, 9, 10],
                   'Score': [11, 12, 13, 14, 15]})

Output:

>>>out

  Min_Ratio Max_Ratio   Total_Count Total_Score
0   1     2              13                 23
1   1     3              21                 36
2   1     4              30                 50
3   1     5              40                 65
4   2     3              15                 25
5   2     4              24                 39
6   2     5              34                 54
7   3     4              17                 27
8   3     5              27                 42
9   4     5              19                 29

149

answered Oct 24 '22 07:10

Amit Vikram Singh

First of all, you can improve the algorithm. Then, you can speed up the computation using Numpy vectorization/broadcasting.

Here are the interesting point to improve the performance of the algorithm:

append of Pandas is slow because it recreate a new dataframe. You should never use it in a costly loop. Instead, you can append the lines to a Python list or even directly write the items in a pre-allocated Numpy vector.
computing partial sums take an O(n) time while you can pre-compute the cumulative sums and then just find the partial sum in constant time.
CPython loops are very slow, but the inner loop can be vectorized using Numpy thanks to broadcasting.

Here is the resulting code:

import numpy as np
import pandas as pd

def fastImpl(df):
    n = len(df)
    resRowCount = (n * (n+1)) // 2
    k = 0

    cumCounts = np.concatenate(([0], df['Count'].astype(int).cumsum()))
    cumScores = np.concatenate(([0], df['Score'].astype(int).cumsum()))
    ratios = df['Ratio'].astype(int)
    minRatio = np.empty(resRowCount, dtype=int)
    maxRatio = np.empty(resRowCount, dtype=int)
    count = np.empty(resRowCount, dtype=int)
    score = np.empty(resRowCount, dtype=int)

    for i in range(n):
        kStart, kEnd = k, k+(n-i)
        jStart, jEnd = i+1, n+1
        minRatio[kStart:kEnd] = ratios[i]
        maxRatio[kStart:kEnd] = ratios[i:n]
        count[kStart:kEnd] = cumCounts[jStart:jEnd] - cumCounts[i]
        score[kStart:kEnd] = cumScores[jStart:jEnd] - cumScores[i]
        k = kEnd
    assert k == resRowCount

    return pd.DataFrame({
        'Min Ratio': minRatio,
        'Max Ratio': maxRatio,
        'Total Count': count,
        'Total Score': score
    })

Note that this code give the same results than the code in your question, but the original code does not give the expected results stated in the question. Note also that since inputs are integers, I forced Numpy to use integers for sake of performance (despite the algorithm should work with floats too).

This code is hundreds of thousand times faster than the original code on big dataframes and it succeeds to compute a dataframe of 10,000 rows in 0.7 second.

answered Oct 24 '22 07:10

Jérôme Richard

Others have explained why your algorithm was so slow so I will dive into that.

Let's take a different approach to your problem. In particular, look at how the Total Count and Total Score columns are calculated:

Calculate the cumulative sum for every row from 1 to n
Calculate the cumulative sum for every row from 2 to n
...
Calculate the cumulative sum for every row from n to n

Since cumulative sums are accumulative, we only need to calculate it once for row 1 to row n:

The cumsum of (2 to n) is the cumsum of (1 to n) - (row 1)
The cumsum of (3 to n) is the cumsum of (2 to n) - (row 2)
And so on...

In other words, the current cumsum is the previous cumsum minus its first row, then dropping the first row.

As you have theorized, pandas is a lot slower than numpy so we will convert everthing into numpy for speed:

arr = df[['Ratio', 'Count', 'Score']].to_numpy() # Convert to numpy array

tmp = np.cumsum(arr[:, 1:3], axis=0)       # calculate cumsum for row 1 to n
tmp = np.insert(tmp, 0, arr[0, 0], axis=1) # create the Min Ratio column
tmp = np.insert(tmp, 1, arr[:, 0], axis=1) # create the Max Ratio column

results2 = [tmp]
for i in range(1, len(arr)):
    tmp = results2[-1][1:] # current cumsum is the previous cumsum without the first row
    diff = results2[-1][0] # the previous cumsum's first row

    tmp -= diff            # adjust the current cumsum
    tmp[:, 0] = arr[i, 0]  # new Min Ratio
    tmp[:, 1] = arr[i:, 0] # new Max Ratio
    results2.append(tmp)

# Assemble the result
results2 = np.concatenate(results2).reshape(-1,4)
results2 = pd.DataFrame(results2, columns=['Min Ratio', 'Max Ratio', 'Total Count', 'Total Score'])

During my test, this produces the results for a 10k row data frame in about 2 seconds.

answered Oct 24 '22 07:10

Code Different

Related questions
                            
                                Is it possible to have Python IDEs offer autocompletion for dynamically generated class attributes?
                            
                                Keras: ValueError: logits and labels must have the same shape ((None, 2) vs (None, 1))
                            
                                What is !r called?
                            
                                Error when importing Dash: "ImportError: DLL load failed while importing _brotli: The specified module could not be found."
                            
                                Plotly: How to set up a color palette for a figure created with multiple traces?
                            
                                django.contrib.auth.login() function not returning any user as logged in
                            
                                Pivoting pandas dataframe by rank on id
                            
                                Zen of Python 'Explicit is better than implicit'
                            
                                How I can aggregate employee based on their department and show average salary in each department using groupby pandas?
                            
                                How to replace multiple forward slashes in a directory by a single slash?
                            
                                Selenium app redirect to Cloudflare page when hosted on Heroku
                            
                                Replace values in pandas dataframe column with different replacement dict based on condition
                            
                                How to run selenium+chrome on Raspberry PI 4?
                            
                                Set default value for selectbox
                            
                                How to sum rows in the same column than the category in pandas dataframe - python
                            
                                Why is iterating over a dict so slow?
                            
                                TypeError: load_pem_private_key() missing 1 required positional argument: 'backend'
                            
                                Python cannot be opened when launching PyCharm CE
                            
                                Why cannot add PPA deadsnakes?
                            
                                How to cause Jupyter Lab to save notebook (programmatically)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Faster way to sum all combinations of rows in dataframe

Tags:

performance

python

pandas

numpy

combinations

mr7

People also ask

Video Answer

3 Answers

Amit Vikram Singh

Jérôme Richard

Code Different

Recent Activity

Donate For Us