Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate intersection over union (Jaccard's index) in pandas dataframe

I have a dataframe like:

animal    ids
cat       1,3,4
dog       1,2,4
hamster   5        
dolphin   3,5

The dataframe is quite big, with over 80 thousand rows, and ids column may contain easily over thousands, even 10 thousands comma separated id. Ids in a given row would be unique in the comma separated string.

I would like to construct a dataframe which calculated Jaccard's index, i.e. intersection of each items in animal column with each other in ids column over union.

So if we look at cat and dog, the union is 2 (ids 1 and 4), and union is 4 (ids 1, 2, 3, 4), hence the Jaccard's index is 2/4 = 0.5. It would be great to have the dataset in this format:

            cat        dog        hamster    dolphin
cat         1          0.5        0          0.25
dog         0.5        1          0          0
hamster     0          0          1          0.5
dolphin     0.25       0          0.5        1

which means using the row index as the name of the animal, so that I can find related jaccard's index quickly like:

cat_dog_ji = df_new['cat']['dog']
like image 595
Ahmet Cetin Avatar asked Aug 22 '20 11:08

Ahmet Cetin


People also ask

How do you find the intersection of a panda?

Intersection of Two data frames in Pandas can be easily calculated by using the pre-defined function merge() . This function takes both the data frames as argument and returns the intersection between them.

How do you read the Jaccard index?

Developed by Paul Jaccard, the index ranges from 0 to 1. The closer to 1, the more similar the two sets of data. If two datasets share the exact same members, their Jaccard Similarity Index will be 1. Conversely, if they have no members in common then their similarity will be 0.

Is Jaccard distance a metric?

Jaccard distance is commonly used to calculate an n × n matrix for clustering and multidimensional scaling of n sample sets. This distance is a metric on the collection of all finite sets.


Video Answer


2 Answers

You can use str.get_dummies and some scipy tools here.


from scipy.spatial import distance

u = df["ids"].str.get_dummies(",")
j = distance.pdist(u, "jaccard")
k = df["animal"].to_numpy()
pd.DataFrame(1 - distance.squareform(j), index=k, columns=k)

          cat  dog  hamster  dolphin
cat      1.00  0.5      0.0     0.25
dog      0.50  1.0      0.0     0.00
hamster  0.00  0.0      1.0     0.50
dolphin  0.25  0.0      0.5     1.00
like image 93
user3483203 Avatar answered Oct 19 '22 17:10

user3483203


Use:

d = df.assign(key=1, ids=df['ids'].str.split(','))
d = d.merge(d, on='key', suffixes=['', '_r'])

i = [np.intersect1d(*x).size / np.union1d(*x).size for x in zip(d['ids'], d['ids_r'])]
d = pd.crosstab(d['animal'], d['animal_r'], i, aggfunc='first').rename_axis(index=None, columns=None)

Details:

Use DataFrame.assign to create a temporary column key and use Series.str.split on column ids. Then use DataFrame.merge to merge the dataframe d with itself based column key (essentially a cross join).

print(d)

     animal        ids  key animal_r      ids_r
0       cat  [1, 3, 4]    1      cat  [1, 3, 4]
1       cat  [1, 3, 4]    1      dog  [1, 2, 4]
2       cat  [1, 3, 4]    1  hamster        [5]
3       cat  [1, 3, 4]    1  dolphin     [3, 5]
4       dog  [1, 2, 4]    1      cat  [1, 3, 4]
5       dog  [1, 2, 4]    1      dog  [1, 2, 4]
6       dog  [1, 2, 4]    1  hamster        [5]
7       dog  [1, 2, 4]    1  dolphin     [3, 5]
8   hamster        [5]    1      cat  [1, 3, 4]
9   hamster        [5]    1      dog  [1, 2, 4]
10  hamster        [5]    1  hamster        [5]
11  hamster        [5]    1  dolphin     [3, 5]
12  dolphin     [3, 5]    1      cat  [1, 3, 4]
13  dolphin     [3, 5]    1      dog  [1, 2, 4]
14  dolphin     [3, 5]    1  hamster        [5]
15  dolphin     [3, 5]    1  dolphin     [3, 5]

Using np.interset1d along with np.union1d inside list comprehension to calculate the Jaccard's index.

print(i)
[1.0, 0.5, 0.0, 0.25, 0.5, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.5, 0.25, 0.0, 0.5, 1.0]

Finally we use pd.crosstab to create a simple cross tabulation to get the result in desired format:

print(d)
          cat  dog  dolphin  hamster
cat      1.00  0.5     0.25      0.0
dog      0.50  1.0     0.00      0.0
dolphin  0.25  0.0     1.00      0.5
hamster  0.00  0.0     0.50      1.0
like image 3
Shubham Sharma Avatar answered Oct 19 '22 15:10

Shubham Sharma