I am wondering if there is a way to get cosine distance of two vectors in postgres. For storing vectors I am using CUBE data type.
Below is my table definition:
test=# \d vectors
Table "public.vectors"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+-------------------------------------
id | integer | | not null | nextval('vectors_id_seq'::regclass)
vector | cube | | |
Also, sample data is given below:
test=# select * from vectors order by id desc limit 2;
id | vector
---------+------------------------------------------
2000000 | (109, 568, 787, 938, 948, 126, 271, 499)
1999999 | (139, 365, 222, 653, 313, 103, 215, 796)
I actually can write my own PLPGSql function for this, but wanted to avoid this as it might not be efficient.
The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.
In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative. This remains true when using tf–idf weights. The angle between two term frequency vectors cannot be greater than 90°.
However, the Euclidean distance measure will be more effective and it indicates that A' is more closer (similar) to B' than C'. As can be seen from the above output, the Cosine similarity measure was same but the Euclidean distance suggests points A and B are closer to each other and hence similar to each other.
About your table
First of all, I believe you should change your data type to plain array.
CREATE TABLE public.vector (
id serial NOT NULL,
vctor double precision [3] --for three dimensional vectors; of course you can change the dimension or leave it unbounded if you need it.
);
INSERT INTO public.vector (vctor) VALUES (ARRAY[2,3,4]);
INSERT INTO public.vector (vctor) VALUES (ARRAY[3,4,5]);
So
SELECT * FROM public.vector;
Will result in the following data
id | vctor
------|---------
1 | {2,3,4}
2 | {3,4,5}
Maybe not the answer you expected but consider this
As you may know already, calculating the cosine between the vectors involves calculating the magnitudes. I don't think the problem is the algorithm but the implementation; it requires calculating squares and square roots that is expensive for a RDBMS.
Now, talking about efficiency; the server process does not take the load when calling mathematical functions. In PostgreSQL, the mathematical functions (look here) run from the C library so they are pretty efficient. However, in the end, the host has to assign some resources to make these calculations.
I would indeed think carefully before implementing these rather costly operations inside the server. But there is not a right answer; it depends on how you are using the database. For example if it is a production database with thousands of concurrent users, I would move this kind of calculation elsewhere (a middle layer or a user application.) But if there are few users and your database is for a small research operation, then it is fine to implement it as a stored procedure or a process running inside your server but keep in mind this will affect scalability or portability. Of course, there are more considerations like how many rows will be processed, or whether or not you intend to fire triggers, etc.
Consider other alternatives
Make a client app
You can do a fast and decent program in VB or the language of your choice. And let the client app make the heavy calculation and use the database for what it does best that is storing and retrieving data.
Store the data differently
For this particular example, you could store the unit vectors plus the magnitude. In this way, finding the cosine between any two vectors reduces simply to the dot product of the unit vectors (only multiplication and division and no squares nor square roots.)
CREATE TABLE public.vector (
id serial NOT NULL,
uvctor double precision [3], --for three dimensional vectors; of course you can change the dimension or make it decimal if you need it
magnitude double precision
);
INSERT INTO public.vector (vctor) VALUES (ARRAY[0.3714, 0.5571, 0.7428], 5.385); -- {Ux, Uy, Uz}, ||V|| where V = [2, 3, 4];
INSERT INTO public.vector (vctor) VALUES (ARRAY[0.4243, 0.5657, 0.7071], 7.071); -- {Ux, Uy, Uz}, ||V|| where V = [3, 4, 5];
SELECT a.vctor as a, b.vctor as b, 1-(a.uvctor[1] * b.uvctor[1] + a.uvctor[2] * b.uvctor[2] + a.uvctor[3] * b.uvctor[3]) as cosine_distance FROM public.vector a
JOIN public.vector b ON a.id != b.id;
Resulting in
a | b | cosine_distance
-----------------------------|------------------------------|------------------
{0.3714,0.5571,0.7428,5.385} | {0.4243,0.5657,0.7071,7.071} | 0.00202963
{0.4243,0.5657,0.7071,7.071} | {0.3714,0.5571,0.7428,5.385} | 0.00202963
Even if you have to calculate the magnitude of the vector inside the server, you will make it once per vector and not every time you need to get the distance between two of them. This becomes more important as the number of rows is increasing. For 1000 vectors for example, you would have to calculate the magnitude 999000 times if you wanted to obtain the cosine difference between any two vectors using the original vector components.
Any combination of the above
Conclusion
When we pursue efficiency, most of the times there is not a canonical answer. Instead we have trade-offs that we have to consider and evaluate. It always depends on the ultimate goal we need to achieve. Databases are excellent for storing and retrieving data; they can definitely make other things but that comes with an added cost. If we can live with the added overhead then it's fine; otherwise we have to consider alternatives.
you can take reference to my code.
--for calculation of norm vector --
CREATE or REPLACE FUNCTION public.vector_norm(IN vector double precision[])
RETURNS double precision AS
$BODY$
BEGIN
RETURN(SELECT SQRT(SUM(pow)) FROM (SELECT POWER(e,2) as pow from unnest(vector) as e) as norm);
END;
$BODY$ LANGUAGE 'plpgsql';
ALTER FUNCTION public.vector_norm(double precision[]) OWNER TO postgres;
COMMENT ON FUNCTION public.vector_norm(double precision[]) IS 'This function is used to find a norm of vectors.';
--call function--
select public.vector_norm('{ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794}')
--for caculation of dot_product--
CREATE OR REPLACE FUNCTION public.dot_product(IN vector1 double precision[], IN vector2 double precision[])
RETURNS double precision
AS $BODY$
BEGIN
RETURN(SELECT sum(mul) FROM (SELECT v1e*v2e as mul FROM unnest(vector1, vector2) AS t(v1e,v2e)) AS denominator);
END;
$BODY$ LANGUAGE 'plpgsql';
ALTER FUNCTION public.dot_product(double precision[], double precision[]) OWNER TO postgres;
COMMENT ON FUNCTION public.dot_product(double precision[], double precision[])
IS 'This function is used to find a cosine similarity between two multi-dimensional vectors.';
--call fuction--
SELECT public.dot_product(ARRAY[ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794],ARRAY[ 0.039968978613615,0.357211461290717,0.753132887650281,0.760665621142834,0.20826127845794])
--for calculatuion of cosine similarity--
CREATE OR REPLACE FUNCTION public.cosine_similarity(IN vector1 double precision[], IN vector2 double precision[])
RETURNS double precision
LANGUAGE 'plpgsql'
AS $BODY$
BEGIN
RETURN(select ((select public.dot_product(ARRAY[ 0.63434,0.23487,0.324323], ARRAY[ 0.63434,0.23487,0.324323]) as dot_pod)/((select public.vector_norm(ARRAY[ 0.63434,0.23487,0.324323]) as norm1) * (select public.vector_norm(ARRAY[ 0.63434,0.23487,0.324323]) as norm2))) AS similarity_value)
END;
$BODY$;
ALTER FUNCTION public.cosine_similarity(double precision[], double precision[])
OWNER TO postgres;
COMMENT ON FUNCTION public.cosine_similarity(double precision[], double precision[])
IS 'this function is used to find a cosine similarity between two vector';
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With