Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use average function in neo4j with collection

Tags:

neo4j

cypher

I want to calculate covariance of two vectors as collection A=[1, 2, 3, 4] B=[5, 6, 7, 8]

Cov(A,B)= Sigma[(ai-AVGa)*(bi-AVGb)] / (n-1)

My problem for covariance computation is:

1) I can not have a nested aggregate function when I write

SUM((ai-avg(a)) * (bi-avg(b)))

2) Or in another shape, how can I extract two collection with one reduce such as:

REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))

3) if it is not possible to extract two collection in oe reduce how it is possible to relate their value to calculate covariance when they are separated

REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a)))
REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))

I mean that can I write nested reduce?

4) Is there any ways with "unwind", "extract"

Thank you in advanced for any help.

like image 320
Mahsa Hassankashi Avatar asked Dec 22 '15 19:12

Mahsa Hassankashi


1 Answers

cybersam's answer is totally fine but if you want to avoid the n^2 Cartesian product that results from the double UNWIND you can do this instead:

WITH [1,2,3,4] AS a, [5,6,7,8] AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
     REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
     SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;

Edit:

Not calling anyone out, but let me elaborate more on why you would want to avoid the double UNWIND in https://stackoverflow.com/a/34423783/2848578. Like I said below, UNWINDing k length-n collections in Cypher results in n^k rows. So let's take two length-3 collections over which you want to calculate the covariance.

> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN aa, bb;
   | aa | bb
---+----+----
 1 |  1 |  4
 2 |  1 |  5
 3 |  1 |  6
 4 |  2 |  4
 5 |  2 |  5
 6 |  2 |  6
 7 |  3 |  4
 8 |  3 |  5
 9 |  3 |  6

Now we have n^k = 3^2 = 9 rows. At this point, taking the average of these identifiers means we're taking the average of 9 values.

> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
   | AVG(aa) | AVG(bb)
---+---------+---------
 1 |     2.0 |     5.0

Also as I said below, this doesn't affect the answer because the average of a repeating vector of numbers will always be the same. For example, the average of {1,2,3} is equal to the average of {1,2,3,1,2,3}. It is likely inconsequential for small values of n, but when you start getting larger values of n you'll start seeing a performance decrease.

Let's say you have two length-1000 vectors. Calculating the average of each with a double UNWIND:

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
   | AVG(aa) | AVG(bb)
---+---------+---------
 1 |   500.0 |  1500.0

714 ms

Is significantly slower than using REDUCE:

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
       REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b;
   | e_a   | e_b   
---+-------+--------
 1 | 500.0 | 1500.0

4 ms

To bring it all together, I'll compare the two queries in full on length-1000 vectors:

> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS
 covariance;
   | covariance
---+------------
 1 |    83583.5

9105 ms

> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
     REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
          SIZE(a) AS n, a, b
          RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i
] - e_b))) / (n - 1) AS cov;
   | cov    
---+---------
 1 | 83583.5

33 ms

like image 140
Nicole White Avatar answered Oct 12 '22 06:10

Nicole White