I have about 90 variables stored in data[2-90]. I suspect about 4 of them will have a parabola-like correlation with data[1]. I want to identify which ones have the correlation. Is there an easy and quick way to do this?
I have tried building a model like this (which I could do in a loop for each variable i = 2:90):
y <- data$AvgRating
x <- data$Hamming.distance
x2 <- x^2
quadratic.model = lm(y ~ x + x2)
And then look at the R^2/coefficient to get an idea of the correlation. Is there a better way of doing this?
Maybe R could build a regression model with the 90 variables and chose the ones which are significant itself? Would that be in any way possible? I can do this in JMP for linear regression, but I'm not sure I could do non-linear regression with R for all the variables at ones. Therefore I was manually trying to see if I could see which ones are correlated in advance. It would be helpful if there was a function to use for that.
You can use nlcor
package in R. This package finds the nonlinear correlation between two data vectors.
There are different approaches to estimate a nonlinear correlation, such as infotheo. However, nonlinear correlations between two variables can take any shape.
nlcor
is robust to most nonlinear shapes. It works pretty well in different scenarios.
At a high level, nlcor
works by adaptively segmenting the data into linearly correlated segments. The segment correlations are aggregated to yield the nonlinear correlation. The output is a number between 0 to 1. With close to 1 meaning high correlation. Unlike a pearson correlation, negative values are not returned because it has no meaning in nonlinear relationships.
More details about this package here
To install nlcor
, follow these steps:
install.packages("devtools")
library(devtools)
install_github("ProcessMiner/nlcor")
library(nlcor)
After you install it,
# Implementation
x <- seq(0,3*pi,length.out=100)
y <- sin(x)
plot(x,y,type="l")
# linear correlation is small
cor(x,y)
# [1] 6.488616e-17
# nonlinear correlation is more representative
nlcor(x,y, plt = T)
# $cor.estimate
# [1] 0.9774
# $adjusted.p.value
# [1] 1.586302e-09
# $cor.plot
As shown in the example the linear correlation was close to zero although there was a clear relationship between the variables that nlcor
could detect.
Note: The order of x and y inside the nlcor
is important. nlcor(x,y)
is different from nlcor(y,x)
. The x and y here represent 'independent' and 'dependent' variables, respectively.
Fitting a generalized additive model, will help you identify curvature in the relationships between the explanatory variables. Read the example on page 22 here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With