Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a function calculating correlation/correlation matrix using J?

Tags:

j

I have already wrote the following code by my own, which calculates the correlation/correlation matrix step by step:

a=: 1 2 3
b=: 2 3 4
getmean=: +/%#
getmdevn=: -getmean
getvariance1=: (getmean@:*:)@getmdevn

getvariance1 a
getvariance1 b
corr_a_b=: getmean (a*b) - getmean a * getmean b

What is the best way to calculate correlation/corellation matrix in J by using a single function? Or is there any way to combine all my code into one function?

P.S. I just found some libraries like 'numeric' in J. However, it seems that there is no documentation of those libraries online. Does anyone know where to find details of those libraries?

like image 426
Siyao Gu Avatar asked Jun 28 '17 14:06

Siyao Gu


2 Answers

For practical purposes, @EelVex is right, you should use the libraries shipped with J, as they encapsulate "best practice" as JSoftware perceives it.

However, for pedagogical, intellectual, and aesthetic reasons, I am a big fan of Oleg Kobchenko's triumph:

 corr =: (+/@:* % *&(+/)&.:*:)&(- +/%#) 

As I mentioned in 2013, it's got specimens of all the major compositions in J:

  • fork f g h
  • hook f g
  • atop f@:g
  • compose f&g
  • under f&.:g

And with the exception of &, precisely one of each. For a very real, very common calculation. Neat!

And, if you can read J, it's a clear improvement over the standard mathematical notation, because you can literally see some of the symmetries that underlie that formula: the left tine is the sum of the product (+/@:*) and right is the product of the sums (*&(+/)).

The whole middle fork is an elegant butterfly with beautiful symmetric wings (complete with antennae on the head %!).

   c=:+/@:* % *&(+/)

   5!:4<'c'
         +- / --- +
  +- @: -+- *      
  +- %             
--+      +- *      
  +- & --+- / --- +
 

Plus the whole thing is algebraically reduced. What that means is, in contrast to the standard mathematical notation where you have xs and ys and s and ȳs scattered all over the place, in corr it's clear that the very first thing we do is standardize the variables, and thereafter all we ever deal with is the delta, and so we are immune to changes in scale, units of measurement, etc.

As a further example of J notation shedding light on the underlying mathematical structures, Oleg took the reduction a step further, and cut this jem:

Cr=: +/@:*&(% +/&.:*:)&(- +/ % #)

As I discussed later in the 2013 thread, I still prefer the original because of its structural symmetries, but this version also has its merits: it makes it clear that correlation is simply the linear combination of the series, after some normalization/standardization.

sum             =:  +/         
of              =:  @          
the_product     =:  *          
after           =:  &          
scaling         =:  % +/&.:*:
after           =:  &          
standardizing   =:  - +/%#


corr =: sum of the_product after scaling after standardizing

We're getting insight into the math simply by trying different ways of expressing ourselves!

But again, for practical purposes, I suggest you follow the advice in @EelVex's answer. As Henry Rich observed in the thread where Oleg discovered these beautiful forms:

I don't think either of these forms is useful for real work, since they involve subtracting big near-equal numbers, but I learned something about the computation by writing them.

Which Oleg then demonstrated concretely:

That is easily seen running the Wikipedia stability test:

   CR=: (+/@:* % *&(+/)&.:*:)&(- +/ % #)

  0j20":CR/900000000(+,:-)1+i.1000000
_1.00000000000000000000

  COR f.
(+/ % #)@:*&(- (+/ % #)) % *&(%:@(+/ % #)@:*:@:- (+/ % #))

  0j20":COR/900000000(+,:-)1+i.1000000
_1.00000000000000000000

  0j20":c/900000000(+,:-)1+i.1000000
1.00000229430253350000

  load'stats'
  0j20":corr/900000000(+,:-)1+i.1000000
_0.99999999999619615000

But, then, that's not the point. Ultimately, I think June Kim summed it up best in 2007:

While I try to translate a mathematical expression into a J expression, I often discover hidden patterns in the expression with great joy.

It sometimes leads me to a path to new insights.

like image 170
Dan Bron Avatar answered Oct 01 '22 04:10

Dan Bron


Your code is fine and in fact very similar to what is defined in 'stats' library.

load'stats'
varp a               NB. variance of population a
  0.666667
a covp b             NB. covariance of a/b
  0.666667

Take a look at the stats library

I would write your corr_a_b as dyadic function:

 corr_a_b =: ((getmean @: *) - (* &: getmean))
like image 32
Eelvex Avatar answered Oct 01 '22 05:10

Eelvex