Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding non-linear correlations in R

I have about 90 variables stored in data[2-90]. I suspect about 4 of them will have a parabola-like correlation with data[1]. I want to identify which ones have the correlation. Is there an easy and quick way to do this?

I have tried building a model like this (which I could do in a loop for each variable i = 2:90):

y <- data$AvgRating
x <- data$Hamming.distance
x2 <- x^2

quadratic.model = lm(y ~ x + x2)

And then look at the R^2/coefficient to get an idea of the correlation. Is there a better way of doing this?

Maybe R could build a regression model with the 90 variables and chose the ones which are significant itself? Would that be in any way possible? I can do this in JMP for linear regression, but I'm not sure I could do non-linear regression with R for all the variables at ones. Therefore I was manually trying to see if I could see which ones are correlated in advance. It would be helpful if there was a function to use for that.

like image 735
dorien Avatar asked Aug 01 '16 09:08

dorien


Video Answer


2 Answers

You can use nlcor package in R. This package finds the nonlinear correlation between two data vectors. There are different approaches to estimate a nonlinear correlation, such as infotheo. However, nonlinear correlations between two variables can take any shape.

nlcor is robust to most nonlinear shapes. It works pretty well in different scenarios.

At a high level, nlcor works by adaptively segmenting the data into linearly correlated segments. The segment correlations are aggregated to yield the nonlinear correlation. The output is a number between 0 to 1. With close to 1 meaning high correlation. Unlike a pearson correlation, negative values are not returned because it has no meaning in nonlinear relationships.

More details about this package here

To install nlcor, follow these steps:

install.packages("devtools") 
library(devtools)
install_github("ProcessMiner/nlcor")
library(nlcor)

After you install it,

# Implementation 
x <- seq(0,3*pi,length.out=100)
y <- sin(x)
plot(x,y,type="l")

sin(x) plot

# linear correlation is small
cor(x,y)
# [1] 6.488616e-17
# nonlinear correlation is more representative
nlcor(x,y, plt = T)
# $cor.estimate
# [1] 0.9774
# $adjusted.p.value
# [1] 1.586302e-09
# $cor.plot

using nlcor for sin(x)

As shown in the example the linear correlation was close to zero although there was a clear relationship between the variables that nlcor could detect.

Note: The order of x and y inside the nlcor is important. nlcor(x,y) is different from nlcor(y,x). The x and y here represent 'independent' and 'dependent' variables, respectively.

like image 83
vahab najari Avatar answered Sep 29 '22 16:09

vahab najari


Fitting a generalized additive model, will help you identify curvature in the relationships between the explanatory variables. Read the example on page 22 here.

like image 35
George Dontas Avatar answered Sep 29 '22 15:09

George Dontas