I have a dataframe that looks like this: <pre class="prettyprint"><code>from random import randint import pandas as pd df = pd.DataFrame({"ID": ["a", "b", "c", "d", "e", "f", "g"], "Size": [randint(0,9) for i in range(0,7)]}) df ID Size 0 a 4 1 b 3 2 c 0 3 d 2 4 e 9 5 f 5 6 g 3 </code></pre> And what I would like to obtain is this (could be a matrix as well): <pre class="prettyprint"><code>sums_df a b c d e f g a 8.0 7.0 4.0 6.0 13.0 9.0 7.0 b 7.0 6.0 3.0 5.0 12.0 8.0 6.0 c 4.0 3.0 0.0 2.0 9.0 5.0 3.0 d 6.0 5.0 2.0 4.0 11.0 7.0 5.0 e 13.0 12.0 9.0 11.0 18.0 14.0 12.0 f 9.0 8.0 5.0 7.0 14.0 10.0 8.0 g 7.0 6.0 3.0 5.0 12.0 8.0 6.0 </code></pre> That is, the sum of <code>Size</code> values for all possible pairs in <code>ID</code>. For now I have this simple but unefficient code: <pre class="prettyprint"><code>sums_df = pd.DataFrame() for i in range(len(df)): for j in range(len(df)): sums_df.loc[i,j] = df.Size[i] + df.Size[j] sums_df.index = list(df.ID) sums_df.columns = list(df.ID) </code></pre> It works fine for small examples like this, but for my actual data it gets too long and I am sure it is possible to avoid the nested <code>for</code> loops. Can you think of a better way to do this ? Thanks for any help !

use np.add.outer(): <pre class="prettyprint"><code>In [65]: pd.DataFrame(np.add.outer(df['Size'], df['Size']), columns=df['ID'].values, index=df['ID'].values) Out[65]: a b c d e f g a 8 7 4 6 13 9 7 b 7 6 3 5 12 8 6 c 4 3 0 2 9 5 3 d 6 5 2 4 11 7 5 e 13 12 9 11 18 14 12 f 9 8 5 7 14 10 8 g 7 6 3 5 12 8 6 </code></pre> UPDATE: memory-saving (Pandas Multi-Index) approach (NOTE: this approach is much slower, compared to the previous one): <pre class="prettyprint"><code>In [33]: r = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1), ...: index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))), ...: columns=['TotalSize'] ...: ) In [34]: r Out[34]: TotalSize a b 7 c 4 d 6 e 13 f 9 g 7 b c 3 d 5 e 12 f 8 g 6 c d 2 e 9 f 5 g 3 d e 11 f 7 g 5 e f 14 g 12 f g 8 </code></pre> It can be accessed as follows: <pre class="prettyprint"><code>In [41]: r.loc[('a','b')] Out[41]: TotalSize 7 Name: (a, b), dtype: int32 In [42]: r.loc[('a','b'), 'TotalSize'] Out[42]: 7 In [44]: r.loc[[('a','b'), ('c','d')], 'TotalSize'] Out[44]: a b 7 c d 2 Name: TotalSize, dtype: int32 In [43]: r.at[('a','b'), 'TotalSize'] Out[43]: 7 </code></pre> Memory usage comparison (DF shape: <code>7000x3</code>): <pre class="prettyprint"><code>In [65]: df = pd.concat([df] * 1000, ignore_index=True) In [66]: df.shape Out[66]: (7000, 2) In [67]: r1 = pd.DataFrame(np.add.outer(df['Size'], df['Size']), ...: columns=df['ID'].values, ...: index=df['ID'].values) ...: In [68]: r2 = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1), ...: index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))), ...: columns=['TotalSize']) ...: In [69]: r1.memory_usage().sum()/r2.memory_usage().sum() Out[69]: 2.6685407829018244 </code></pre> Speed comparison (DF shape: <code>7000x3</code>): <pre class="prettyprint"><code>In [70]: %%timeit ...: r1 = pd.DataFrame(np.add.outer(df['Size'], df['Size']), ...: columns=df['ID'].values, ...: index=df['ID'].values) ...: 180 ms ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [71]: %%timeit ...: r2 = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1), ...: index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))), ...: columns=['TotalSize']) ...: 17 s ± 325 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) </code></pre>

Efficient way to sum all possible pairs

Tags:

python

pandas

numpy

I have a dataframe that looks like this:

from random import randint
import pandas as pd

df = pd.DataFrame({"ID": ["a", "b", "c", "d", "e", "f", "g"], 
                   "Size": [randint(0,9) for i in range(0,7)]})

df

  ID  Size
0  a     4
1  b     3
2  c     0
3  d     2
4  e     9
5  f     5
6  g     3

And what I would like to obtain is this (could be a matrix as well):

sums_df

      a     b    c     d     e     f     g
a   8.0   7.0  4.0   6.0  13.0   9.0   7.0
b   7.0   6.0  3.0   5.0  12.0   8.0   6.0
c   4.0   3.0  0.0   2.0   9.0   5.0   3.0
d   6.0   5.0  2.0   4.0  11.0   7.0   5.0
e  13.0  12.0  9.0  11.0  18.0  14.0  12.0
f   9.0   8.0  5.0   7.0  14.0  10.0   8.0
g   7.0   6.0  3.0   5.0  12.0   8.0   6.0

That is, the sum of Size values for all possible pairs in ID.

For now I have this simple but unefficient code:

sums_df = pd.DataFrame()

for i in range(len(df)):
    for j in range(len(df)):
        sums_df.loc[i,j] = df.Size[i] + df.Size[j]

sums_df.index = list(df.ID)
sums_df.columns = list(df.ID)

It works fine for small examples like this, but for my actual data it gets too long and I am sure it is possible to avoid the nested for loops. Can you think of a better way to do this ?

Thanks for any help !

888

asked Nov 26 '17 17:11

atonnerre

2 Answers

use np.add.outer():

In [65]: pd.DataFrame(np.add.outer(df['Size'], df['Size']),
                      columns=df['ID'].values,
                      index=df['ID'].values)
Out[65]:
    a   b  c   d   e   f   g
a   8   7  4   6  13   9   7
b   7   6  3   5  12   8   6
c   4   3  0   2   9   5   3
d   6   5  2   4  11   7   5
e  13  12  9  11  18  14  12
f   9   8  5   7  14  10   8
g   7   6  3   5  12   8   6

UPDATE: memory-saving (Pandas Multi-Index) approach (NOTE: this approach is much slower, compared to the previous one):

In [33]: r = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1),
    ...:                  index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))),
    ...:                  columns=['TotalSize']
    ...: )

In [34]: r
Out[34]:
     TotalSize
a b          7
  c          4
  d          6
  e         13
  f          9
  g          7
b c          3
  d          5
  e         12
  f          8
  g          6
c d          2
  e          9
  f          5
  g          3
d e         11
  f          7
  g          5
e f         14
  g         12
f g          8

It can be accessed as follows:

In [41]: r.loc[('a','b')]
Out[41]:
TotalSize    7
Name: (a, b), dtype: int32

In [42]: r.loc[('a','b'), 'TotalSize']
Out[42]: 7

In [44]: r.loc[[('a','b'), ('c','d')], 'TotalSize']
Out[44]:
a  b    7
c  d    2
Name: TotalSize, dtype: int32

In [43]: r.at[('a','b'), 'TotalSize']
Out[43]: 7

Memory usage comparison (DF shape: 7000x3):

In [65]: df = pd.concat([df] * 1000, ignore_index=True)

In [66]: df.shape
Out[66]: (7000, 2)

In [67]: r1 = pd.DataFrame(np.add.outer(df['Size'], df['Size']),
    ...:                       columns=df['ID'].values,
    ...:                       index=df['ID'].values)
    ...:

In [68]: r2 = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1),
    ...:                  index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))),
    ...:                  columns=['TotalSize'])
    ...:

In [69]: r1.memory_usage().sum()/r2.memory_usage().sum()
Out[69]: 2.6685407829018244

Speed comparison (DF shape: 7000x3):

In [70]: %%timeit
    ...: r1 = pd.DataFrame(np.add.outer(df['Size'], df['Size']),
    ...:                       columns=df['ID'].values,
    ...:                       index=df['ID'].values)
    ...:
180 ms ± 2.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [71]: %%timeit
    ...: r2 = pd.DataFrame(np.array(list(combinations(df['Size'], 2))).sum(axis=1),
    ...:                  index=pd.MultiIndex.from_tuples(list(combinations(df['ID'], 2))),
    ...:                  columns=['TotalSize'])
    ...:
17 s ± 325 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

120

answered Oct 05 '22 01:10

MaxU - stop WAR against UA

Use Numpy's broadcasting

size = df.Size.values
ids = df.ID.values

pd.DataFrame(
    size[:, None] + size,
    ids, ids
)

    a   b  c   d   e   f   g
a   8   7  4   6  13   9   7
b   7   6  3   5  12   8   6
c   4   3  0   2   9   5   3
d   6   5  2   4  11   7   5
e  13  12  9  11  18  14  12
f   9   8  5   7  14  10   8
g   7   6  3   5  12   8   6

answered Oct 04 '22 23:10

piRSquared

Related questions
                            
                                Google Appengine sending emails: [Error] unauthorized sender
                            
                                Heroku/Django: No module named dj_database_url
                            
                                How to fetch process,thread name,levelname in customized python logger
                            
                                Polygon intersection error in Shapely: "shapely.geos.TopologicalError: The operation 'GEOSIntersection_r' produced a null geometry"
                            
                                convert String to MD5
                            
                                python: get all youtube video urls of a channel
                            
                                Vertical line not respecting min,max limits (matplotlib)
                            
                                How can I generate more colors on pie chart matplotlib
                            
                                Access denied using Py2exe
                            
                                how to send an array over a socket in python
                            
                                HoverTool for multiple data series in bokeh scatter plot
                            
                                Changing fig size with statsmodel
                            
                                cycle through multiple list using itertools.cycle()
                            
                                Plot equation showing a circle
                            
                                How to add common arguments to argparse subcommands? [duplicate]
                            
                                Ubuntu, how do you remove all Python 3 but not 2
                            
                                Create random numpy matrix of same size as another.
                            
                                Refresh Excel External Data with Python
                            
                                Vertical lines to points in scatter plot
                            
                                I set True=False and I can't undo it

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With