Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Computation of Cosine Similarity

Suppose you have a table in a database constructed as follows:

create table data (v int, base int, w_td float);
insert into data values (99,1,4);
insert into data values (99,2,3);
insert into data values (99,3,4);
insert into data values (1234,2,5);
insert into data values (1234,3,2);    
insert into data values (1234,4,3);

To be clear select * from data should output:

v   |base|w_td
--------------
99  |1   |4.0
99  |2   |3.0
99  |3   |4.0
1234|2   |5.0
1234|3   |2.0
1234|4   |3.0

Note that since the vectors are stored in a database, we need only store the non-zero entries. In this example, we only have two vectors $v_{99} = (4,3,4,0)$ and $v_{1234} = (0,5,2,3)$ both in $\mathbb{R}^4$.

The cosine similarity of those vectors should be $\displaystyle \frac{23}{\sqrt{41 \cdot 38}} = 0.5826987807288609$.

How do you compute the cosine similarity using nearly only SQL?

I say nearly because you will need the sqrt function which is not always provided in basic SQL implementations, for example it is not in sqlite3!

like image 924
tipanverella Avatar asked Feb 18 '17 03:02

tipanverella


People also ask

How cosine similarity is calculated?

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

How do you use cosine similarity for classification?

Cosine similarity is mostly used with vectors produced by word embeddings. If you are using something like Doc2Vec, then you get a vector for the whole document. These vectors could be categorized by using cosine similarity. In your case, you should try a LSTM text classifier using Embedding layers.


1 Answers

with norms as (
    select v,
        sum(w_td * w_td) as w2
    from data
    group by v
)
select 
    x.v as ego,y.v as v,nx.w2 as x2, ny.w2 as y2,
    sum(x.w_td * y.w_td) as innerproduct,
    sum(x.w_td * y.w_td) / sqrt(nx.w2 * ny.w2) as cosinesimilarity
from data as x
join data as y
    on (x.base=y.base)
join norms as nx
    on (nx.v=x.v)
join norms as ny
    on (ny.v=y.v)
where x.v < y.v
group by 1,2,3,4
order by 6 desc

yields

ego|v   |x2  |y2  |innerproduct|cosinesimilarity
--------------------------------------------------
99 |1234|41.0|38.0|23.0        |0.5826987807288609
like image 124
tipanverella Avatar answered Jan 11 '23 23:01

tipanverella