Let's say I have the following pandas DataFrame in Python3.x
import pandas as pd
dict1 = {'name':['dog', 'dog', 'cat', 'cat', 'cat', 'bird', 'bird', 'bird', 'bird'], 'number':[42, 42, 42, 42, 42, 42, 42, 42, 42], 'count':[1, 2, 4, 5, 7, 1, 2, 5, 8]}
df = pd.DataFrame(dict1)
print(df)
## name number count
## 0 dog 42 1
## 1 dog 42 2
## 2 cat 42 4
## 3 cat 42 5
## 4 cat 42 7
## 5 bird 42 1
## 6 bird 42 2
## 7 bird 42 5
## 8 bird 42 8
Column counts
contains integers from 1 to 8. My goal is to populate an 8 by 8 zero matrix with the count of each combination "pair" given the unique category in column name
.
So, the combination pairs for dog
, cat
, and bird
are:
dog: (1, 2)
cat: (4, 5), (4, 7), (5, 7)
bird: (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)
For each pair, I add +1
to the corresponding entry in the zero matrix.
This matrix will be symmetric, i.e. (n, m) = (m, n)
. The matrix given df
would be:
1 2 3 4 5 6 7 8
1: 0 2 0 0 1 0 0 1
2: 2 0 0 0 1 0 0 1
3: 0 0 0 0 0 0 0 0
4: 0 0 0 0 1 0 1 0
5: 1 1 0 1 0 0 1 1
6: 0 0 0 0 0 0 0 0
7: 0 0 0 1 1 0 0 0
8: 1 1 0 0 1 0 0 0
Note that (1,2)=(2,1)
has a count 2, from the dog
combination and the bird
combination.
(1) In order to do this, I think it would be best to create a list of "combinations tuples" given the pandas DataFrame.
That is, something like
list_combos = [(1, 2), (2, 1), (4, 5), (4, 7), (5, 7), (5, 4), (7, 4), (7, 5),
(1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8), (2, 1), (5, 1),
(8, 1), (5, 2), (8, 2), (8, 5)]
Given that the matrix is symmetric, perhaps it would be better to use:
list_combos2 = [(1, 2), (4, 5), (4, 7), (5, 7), (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)]
How could one calculate the permutations of entires in a pandas DataFrame, given the categorical value in 'names'?
(2) What would be the most algorithmically efficient (i.e. RAM) to populate this matrix, given the list of tuples?
I should be able to feed a list of tuples into a numpy array, but how does one fill in the zeros?
You can use groupby, iterate over combinations, and build your matrix like so:
import numpy as np
from itertools import combinations
mat = np.zeros((df['count'].max(), ) * 2)
idx = []
for _, g in df.groupby('name'):
idx.extend(combinations(g['count'] - 1, r=2))
np.add.at(mat, list(zip(*idx)), 1)
mat += mat.T
array([[0., 2., 0., 0., 1., 0., 0., 1.],
[2., 0., 0., 0., 1., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 1., 0.],
[1., 1., 0., 1., 0., 0., 1., 1.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 1., 0., 0., 0.],
[1., 1., 0., 0., 1., 0., 0., 0.]])
There may be a faster solution, but this is the cleanest one I can think of.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With