I have a pandas dataframe, and I need count how many rows are there where each unique entry in the dataframe occurs within the same row of each other entry.
import pandas as pd
import numpy as np
The dataframe:
df = pd.DataFrame({'a': ['A', 'A', 'B', 'B'],
'b': ['B', 'C', 'B', 'B'],
'c': ['C', 'A', 'C', 'A'],
'd': ['B', 'D', 'B', 'A']},
index=[0, 1, 2, 3])
ie:
+----+-----+-----+-----+-----+
| | a | b | c | d |
|----+-----+-----+-----+-----|
| 0 | A | B | C | B |
| 1 | A | C | A | D |
| 2 | B | B | C | B |
| 3 | B | B | A | A |
+----+-----+-----+-----+-----+
(Printed using this.)
I have tried to use the code from answer, & substituting these variables:
document = [list(each) for each in df.values]
names = list(np.unique(df.values))
It gave the wrong results:
A B C D
A 4 6 3 2
B 6 10 5 0
C 3 5 0 1
D 2 0 1 0
It is based on iteratations, so I would hope for a better solution.
+----+-----+-----+-----+-----+
| | A | B | C | D |
|----+-----+-----+-----+-----|
| A | nan | 2 | 2 | 1 |
| B | 2 | nan | 2 | 0 |
| C | 2 | 2 | nan | 1 |
| D | 1 | 0 | 1 | nan |
+----+-----+-----+-----+-----+
There are 2
rows where A
& B
both appears, so the value in the cell row A
column B
is 2
.
There are 2
rows where A
& C
both appears, so the value in the cell row A
column C
is 2
.
How can I get this row-wise cooccurence matrix easily in Pandas? It would be great if I didn't have to loop through the values.
(pandas.Categorical might be some use, I haven't managed to make it work yet.)
WE can do stack
then get_dummies
and dot
then value
s=df.stack().str.get_dummies().sum(level=0).ne(0).astype(int)
s=s.T.dot(s).astype(float)
np.fill_diagonal(s.values, np.nan)
s
Out[33]:
A B C D
A NaN 2.0 2.0 1.0
B 2.0 NaN 2.0 0.0
C 2.0 2.0 NaN 1.0
D 1.0 0.0 1.0 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With