Suppose you have a table in a database constructed as follows:
create table data (v int, base int, w_td float);
insert into data values (99,1,4);
insert into data values (99,2,3);
insert into data values (99,3,4);
insert into data values (1234,2,5);
insert into data values (1234,3,2);
insert into data values (1234,4,3);
To be clear select * from data
should output:
v |base|w_td
--------------
99 |1 |4.0
99 |2 |3.0
99 |3 |4.0
1234|2 |5.0
1234|3 |2.0
1234|4 |3.0
Note that since the vectors are stored in a database, we need only store the non-zero entries. In this example, we only have two vectors $v_{99} = (4,3,4,0)$ and $v_{1234} = (0,5,2,3)$ both in $\mathbb{R}^4$.
The cosine similarity of those vectors should be $\displaystyle \frac{23}{\sqrt{41 \cdot 38}} = 0.5826987807288609$.
How do you compute the cosine similarity using nearly only SQL
?
I say nearly because you will need the sqrt
function which is not always provided in basic SQL
implementations, for example it is not in sqlite3
!
Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.
Cosine similarity is mostly used with vectors produced by word embeddings. If you are using something like Doc2Vec, then you get a vector for the whole document. These vectors could be categorized by using cosine similarity. In your case, you should try a LSTM text classifier using Embedding layers.
with norms as (
select v,
sum(w_td * w_td) as w2
from data
group by v
)
select
x.v as ego,y.v as v,nx.w2 as x2, ny.w2 as y2,
sum(x.w_td * y.w_td) as innerproduct,
sum(x.w_td * y.w_td) / sqrt(nx.w2 * ny.w2) as cosinesimilarity
from data as x
join data as y
on (x.base=y.base)
join norms as nx
on (nx.v=x.v)
join norms as ny
on (ny.v=y.v)
where x.v < y.v
group by 1,2,3,4
order by 6 desc
yields
ego|v |x2 |y2 |innerproduct|cosinesimilarity
--------------------------------------------------
99 |1234|41.0|38.0|23.0 |0.5826987807288609
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With