I am using TF/IDF to calculate similarity. For example if I have the following two doc.
Doc A => cat dog
Doc B => dog sparrow
It is normal it's similarity would be 50% but when I calculate its TF/IDF. It is as follow
Tf values for Doc A
dog tf = 0.5
cat tf = 0.5
Tf values for Doc B
dog tf = 0.5
sparrow tf = 0.5
IDF values for Doc A
dog idf = -0.4055
cat idf = 0
IDF values for Doc B
dog idf = -0.4055 ( without +1 formula 0.6931)
sparrow idf = 0
TF/IDF value for Doc A
0.5x-0.4055 + 0.5x0 = -0.20275
TF/IDF values for Doc B
0.5x-0.4055 + 0.5x0 = -0.20275
Now it looks like there is -0.20275 similarity. Is it? Or am I missing something ? Or is any kind of next step too? Please tell me so I can calculate that too.
I used tf/idf formula which Wikipedia mentioned
Let's see if I get your question: You want to calculate the TF/IDF similarity between the two documents:
Doc A: cat dog
and
Doc B: dog sparrow
I take it that this is your whole corpus. Therefore |D| = 2
Tfs are indeed 0.5 for all words.
To calculate the IDF of 'dog', take log(|D|/|d:dog in d| = log(2/2) = 0
Similarly, the IDFs of 'cat' and 'sparrow' are log(2/1) = log(2) =1
(I use 2 as the log base to make this easier).
Therefore, the TF/IDF values for 'dog' will be 0.5*0 = 0 the TF/IDF value for 'cat' and 'sparrow' will be 0.5*1 = 0.5
To measure the similarity between the two documents, you should calculate the cosine between the vectors in the (cat, sparrow, dog) space: (0.5, 0 , 0) and (0, 0.5, 0) and get the result 0.
To sum it up:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With