I have a dataframe like:
animal ids
cat 1,3,4
dog 1,2,4
hamster 5
dolphin 3,5
The dataframe is quite big, with over 80 thousand rows, and ids column may contain easily over thousands, even 10 thousands comma separated id. Ids in a given row would be unique in the comma separated string.
I would like to construct a dataframe which calculated Jaccard's index, i.e. intersection of each items in animal column with each other in ids column over union.
So if we look at cat and dog, the union is 2 (ids 1 and 4), and union is 4 (ids 1, 2, 3, 4), hence the Jaccard's index is 2/4 = 0.5. It would be great to have the dataset in this format:
cat dog hamster dolphin
cat 1 0.5 0 0.25
dog 0.5 1 0 0
hamster 0 0 1 0.5
dolphin 0.25 0 0.5 1
which means using the row index as the name of the animal, so that I can find related jaccard's index quickly like:
cat_dog_ji = df_new['cat']['dog']
Intersection of Two data frames in Pandas can be easily calculated by using the pre-defined function merge() . This function takes both the data frames as argument and returns the intersection between them.
Developed by Paul Jaccard, the index ranges from 0 to 1. The closer to 1, the more similar the two sets of data. If two datasets share the exact same members, their Jaccard Similarity Index will be 1. Conversely, if they have no members in common then their similarity will be 0.
Jaccard distance is commonly used to calculate an n × n matrix for clustering and multidimensional scaling of n sample sets. This distance is a metric on the collection of all finite sets.
You can use str.get_dummies
and some scipy
tools here.
from scipy.spatial import distance
u = df["ids"].str.get_dummies(",")
j = distance.pdist(u, "jaccard")
k = df["animal"].to_numpy()
pd.DataFrame(1 - distance.squareform(j), index=k, columns=k)
cat dog hamster dolphin
cat 1.00 0.5 0.0 0.25
dog 0.50 1.0 0.0 0.00
hamster 0.00 0.0 1.0 0.50
dolphin 0.25 0.0 0.5 1.00
Use:
d = df.assign(key=1, ids=df['ids'].str.split(','))
d = d.merge(d, on='key', suffixes=['', '_r'])
i = [np.intersect1d(*x).size / np.union1d(*x).size for x in zip(d['ids'], d['ids_r'])]
d = pd.crosstab(d['animal'], d['animal_r'], i, aggfunc='first').rename_axis(index=None, columns=None)
Details:
Use DataFrame.assign
to create a temporary column key
and use Series.str.split
on column ids
. Then use DataFrame.merge
to merge the dataframe d
with itself based column key
(essentially a cross join).
print(d)
animal ids key animal_r ids_r
0 cat [1, 3, 4] 1 cat [1, 3, 4]
1 cat [1, 3, 4] 1 dog [1, 2, 4]
2 cat [1, 3, 4] 1 hamster [5]
3 cat [1, 3, 4] 1 dolphin [3, 5]
4 dog [1, 2, 4] 1 cat [1, 3, 4]
5 dog [1, 2, 4] 1 dog [1, 2, 4]
6 dog [1, 2, 4] 1 hamster [5]
7 dog [1, 2, 4] 1 dolphin [3, 5]
8 hamster [5] 1 cat [1, 3, 4]
9 hamster [5] 1 dog [1, 2, 4]
10 hamster [5] 1 hamster [5]
11 hamster [5] 1 dolphin [3, 5]
12 dolphin [3, 5] 1 cat [1, 3, 4]
13 dolphin [3, 5] 1 dog [1, 2, 4]
14 dolphin [3, 5] 1 hamster [5]
15 dolphin [3, 5] 1 dolphin [3, 5]
Using np.interset1d
along with np.union1d
inside list comprehension to calculate the Jaccard's
index.
print(i)
[1.0, 0.5, 0.0, 0.25, 0.5, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.5, 0.25, 0.0, 0.5, 1.0]
Finally we use pd.crosstab
to create a simple cross tabulation to get the result in desired format:
print(d)
cat dog dolphin hamster
cat 1.00 0.5 0.25 0.0
dog 0.50 1.0 0.00 0.0
dolphin 0.25 0.0 1.00 0.5
hamster 0.00 0.0 0.50 1.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With