I have implemented a function to construct a distance matrix using the jaccard similarity: <pre class="prettyprint"><code>import pandas as pd entries = [ {'id':'1', 'category1':'100', 'category2': '0', 'category3':'100'}, {'id':'2', 'category1':'100', 'category2': '0', 'category3':'100'}, {'id':'3', 'category1':'0', 'category2': '100', 'category3':'100'}, {'id':'4', 'category1':'100', 'category2': '100', 'category3':'100'}, {'id':'5', 'category1':'100', 'category2': '0', 'category3':'100'} ] df = pd.DataFrame(entries) </code></pre> and the distance matrix with scipy <pre class="prettyprint"><code>from scipy.spatial.distance import squareform from scipy.spatial.distance import pdist, jaccard res = pdist(df[['category1','category2','category3']], 'jaccard') squareform(res) distance = pd.DataFrame(squareform(res), index=df.index, columns= df.index) </code></pre> The problem is that my result looks like this which seems to be false: <img src="https://i.stack.imgur.com/wPsJW.png" alt="enter image description here"> What am i missing? The similarity of 0 and 1 have to be maximum for example and the other values seem wrong too

Looking at the docs, the implementation of <code>jaccard</code> in <code>scipy.spatial.distance</code> is jaccard dissimilarity, not similarity. This is the usual way in which distance is computed when using jaccard as a metric. The reason for this is because in order to be a metric, the distance between the identical points must be zero. In your code, the dissimilarity between 0 and 1 should be minimized, which it is. The other values look correct in the context of dissimilarity as well. If you want similarity instead of dissimilarity, just subtract the dissimilarity from 1. <pre class="prettyprint"><code>res = 1 - pdist(df[['category1','category2','category3']], 'jaccard') </code></pre>

Python Pandas Distance matrix using jaccard similarity

Tags:

python

pandas

matrix

scipy

I have implemented a function to construct a distance matrix using the jaccard similarity:

import pandas as pd
entries = [
    {'id':'1', 'category1':'100', 'category2': '0', 'category3':'100'},
    {'id':'2', 'category1':'100', 'category2': '0', 'category3':'100'},
    {'id':'3', 'category1':'0', 'category2': '100', 'category3':'100'},
    {'id':'4', 'category1':'100', 'category2': '100', 'category3':'100'},
    {'id':'5', 'category1':'100', 'category2': '0', 'category3':'100'}
           ]
df = pd.DataFrame(entries)

and the distance matrix with scipy

from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist, jaccard

res = pdist(df[['category1','category2','category3']], 'jaccard')
squareform(res)
distance = pd.DataFrame(squareform(res), index=df.index, columns= df.index)

The problem is that my result looks like this which seems to be false:

enter image description here

What am i missing? The similarity of 0 and 1 have to be maximum for example and the other values seem wrong too

850

asked Feb 25 '16 22:02

J-H

1 Answers

Looking at the docs, the implementation of jaccard in scipy.spatial.distance is jaccard dissimilarity, not similarity. This is the usual way in which distance is computed when using jaccard as a metric. The reason for this is because in order to be a metric, the distance between the identical points must be zero.

In your code, the dissimilarity between 0 and 1 should be minimized, which it is. The other values look correct in the context of dissimilarity as well.

If you want similarity instead of dissimilarity, just subtract the dissimilarity from 1.

res = 1 - pdist(df[['category1','category2','category3']], 'jaccard')

101

answered Sep 19 '22 15:09

root

Related questions
                            
                                Create dynamic arguments for url_for in Flask
                            
                                naming convention: What does the 'm' mean in libpython3.5m.dylib
                            
                                Create child processes inside a child process with Python multiprocessing failed
                            
                                rabbitmq multiple consumers on a queue- only one get the message
                            
                                What is the advantage of flask.logger over the more generic python logging module?
                            
                                read HDF5 file to pandas DataFrame with conditions
                            
                                How to make 'pip install' not uninstall other versions?
                            
                                Kivy properly set own icon
                            
                                What type signature do generators have in Python?
                            
                                Find substrings in PyMongo
                            
                                PyQt4: How to pause a Thread until a signal is emitted?
                            
                                Python BigQuery allowLargeResults with pandas.io.gbq
                            
                                'Unexpected Keyword Argument' in super().__init__()
                            
                                Sklearn SVM: SVR and SVC, getting the same prediction for every input
                            
                                How do I ADD accents to a letter? [closed]
                            
                                How to read index data as string with pandas.read_csv()?
                            
                                How to normalize only certain columns in scikit-learn?
                            
                                Convert mask (boolean) array to list of x,y coordinates
                            
                                Chunking bytes (not strings) in Python 2 and 3
                            
                                TK Framework double implementation issue

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With