Calculate intersection over union (Jaccard's index) in pandas dataframe

Tags:

I have a dataframe like:

animal    ids
cat       1,3,4
dog       1,2,4
hamster   5        
dolphin   3,5

The dataframe is quite big, with over 80 thousand rows, and ids column may contain easily over thousands, even 10 thousands comma separated id. Ids in a given row would be unique in the comma separated string.

I would like to construct a dataframe which calculated Jaccard's index, i.e. intersection of each items in animal column with each other in ids column over union.

So if we look at cat and dog, the union is 2 (ids 1 and 4), and union is 4 (ids 1, 2, 3, 4), hence the Jaccard's index is 2/4 = 0.5. It would be great to have the dataset in this format:

            cat        dog        hamster    dolphin
cat         1          0.5        0          0.25
dog         0.5        1          0          0
hamster     0          0          1          0.5
dolphin     0.25       0          0.5        1

which means using the row index as the name of the animal, so that I can find related jaccard's index quickly like:

cat_dog_ji = df_new['cat']['dog']

595

asked Aug 22 '20 11:08

Ahmet Cetin

Video Answer

2 Answers

You can use str.get_dummies and some scipy tools here.

from scipy.spatial import distance

u = df["ids"].str.get_dummies(",")
j = distance.pdist(u, "jaccard")
k = df["animal"].to_numpy()
pd.DataFrame(1 - distance.squareform(j), index=k, columns=k)

          cat  dog  hamster  dolphin
cat      1.00  0.5      0.0     0.25
dog      0.50  1.0      0.0     0.00
hamster  0.00  0.0      1.0     0.50
dolphin  0.25  0.0      0.5     1.00

answered Oct 19 '22 17:10

user3483203

Use:

d = df.assign(key=1, ids=df['ids'].str.split(','))
d = d.merge(d, on='key', suffixes=['', '_r'])

i = [np.intersect1d(*x).size / np.union1d(*x).size for x in zip(d['ids'], d['ids_r'])]
d = pd.crosstab(d['animal'], d['animal_r'], i, aggfunc='first').rename_axis(index=None, columns=None)

Details:

Use DataFrame.assign to create a temporary column key and use Series.str.split on column ids. Then use DataFrame.merge to merge the dataframe d with itself based column key (essentially a cross join).

print(d)

     animal        ids  key animal_r      ids_r
0       cat  [1, 3, 4]    1      cat  [1, 3, 4]
1       cat  [1, 3, 4]    1      dog  [1, 2, 4]
2       cat  [1, 3, 4]    1  hamster        [5]
3       cat  [1, 3, 4]    1  dolphin     [3, 5]
4       dog  [1, 2, 4]    1      cat  [1, 3, 4]
5       dog  [1, 2, 4]    1      dog  [1, 2, 4]
6       dog  [1, 2, 4]    1  hamster        [5]
7       dog  [1, 2, 4]    1  dolphin     [3, 5]
8   hamster        [5]    1      cat  [1, 3, 4]
9   hamster        [5]    1      dog  [1, 2, 4]
10  hamster        [5]    1  hamster        [5]
11  hamster        [5]    1  dolphin     [3, 5]
12  dolphin     [3, 5]    1      cat  [1, 3, 4]
13  dolphin     [3, 5]    1      dog  [1, 2, 4]
14  dolphin     [3, 5]    1  hamster        [5]
15  dolphin     [3, 5]    1  dolphin     [3, 5]

Using np.interset1d along with np.union1d inside list comprehension to calculate the Jaccard's index.

print(i)
[1.0, 0.5, 0.0, 0.25, 0.5, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.5, 0.25, 0.0, 0.5, 1.0]

Finally we use pd.crosstab to create a simple cross tabulation to get the result in desired format:

print(d)
          cat  dog  dolphin  hamster
cat      1.00  0.5     0.25      0.0
dog      0.50  1.0     0.00      0.0
dolphin  0.25  0.0     1.00      0.5
hamster  0.00  0.0     0.50      1.0

answered Oct 19 '22 15:10

Shubham Sharma

Related questions
                            
                                How can I get the MSE of a tensor across a specific dimension?
                            
                                How to Avoid Arrow Key Values in Python Input?
                            
                                tensorflow2.1 InvalidArgumentError: assertion failed: [0] [Op:Assert] name: EagerVariableNameReuse
                            
                                Traverse and Accessing inner elements in JSON
                            
                                No Python 3.8 installation was detected
                            
                                More perceptually uniform colormaps?
                            
                                How to pass --debug to build_ext when invoking setup.py install?
                            
                                Assigning column names while creating dataframe results in nan values
                            
                                How to structure imports in a large python project
                            
                                How can i login in instagram with python requests?
                            
                                Getting flake8 returned a non none zero code : 1 in docker
                            
                                Pytorch: IndexError: index out of range in self. How to solve?
                            
                                Compressing list[0], list[1], list[2],... into a simple statement
                            
                                Find the substring avoiding the use of recursive function
                            
                                Why is Python's built-in sum much slower than manual summation?
                            
                                Generate video from numpy arrays with openCV
                            
                                Replace a list of characters with indices in a string in python
                            
                                On a django site I am getting socket cluster error
                            
                                How do you make pylint in VSCode know that it's in a package (so that relative imports work)?
                            
                                Python: Dynamically create class while providing arguments to __init__subclass__()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculate intersection over union (Jaccard's index) in pandas dataframe

Tags:

python

pandas

dataframe

numpy

scikit-learn