Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate correlation with cor(), only for numerical columns

Tags:

r

correlation

I have a dataframe and would like to calculate the correlation (with Spearman, data is categorical and ranked) but only for a subset of columns. I tried with all, but R's cor() function only accepts numerical data (x must be numeric, says the error message), even if Spearman is used.

One brute approach is to delete the non-numerical columns from the dataframe. This is not as elegant, for speed I still don't want to calculate correlations between all columns.

I hope there is a way to simply say "calculate correlations for columns x, y, z". Column references could by number or by name. I suppose the flexible way to provide them would be through a vector.

Any suggestions are appreciated.

like image 514
wishihadabettername Avatar asked Aug 26 '10 03:08

wishihadabettername


People also ask

How do you find the correlation between non numeric data?

Correlation is basically a concept used to identify relationship between two numeric variables. It is not applicable for non numeric data. If you want to understand relationship between two non numeric data, you can use chi squared test of independence.

How do I keep only numeric columns in R?

We can use select_if() function to get numeric columns by calling the function with the dataframe name and isnumeric() function that will check for numeric columns.

How do you find the correlation between selected columns in pandas?

You can see the correlation between two columns of pandas DataFrame by using DataFrame. corr() function.

How do I find the correlation between two columns in R?

In this method to calculate the correlation between two variables, the user has to simply call the corr() function from the base R, passed with the required parameters which will be the name of the variables whose correlation is needed to be calculated and further this will be returning the correlation detail between ...


1 Answers

if you have a dataframe where some columns are numeric and some are other (character or factor) and you only want to do the correlations for the numeric columns, you could do the following:

set.seed(10)  x = as.data.frame(matrix(rnorm(100), ncol = 10)) x$L1 = letters[1:10] x$L2 = letters[11:20]  cor(x)  Error in cor(x) : 'x' must be numeric 

but

cor(x[sapply(x, is.numeric)])               V1         V2          V3          V4          V5          V6          V7 V1   1.00000000  0.3025766 -0.22473884 -0.72468776  0.18890578  0.14466161  0.05325308 V2   0.30257657  1.0000000 -0.27871430 -0.29075170  0.16095258  0.10538468 -0.15008158 V3  -0.22473884 -0.2787143  1.00000000 -0.22644156  0.07276013 -0.35725182 -0.05859479 V4  -0.72468776 -0.2907517 -0.22644156  1.00000000 -0.19305921  0.16948333 -0.01025698 V5   0.18890578  0.1609526  0.07276013 -0.19305921  1.00000000  0.07339531 -0.31837954 V6   0.14466161  0.1053847 -0.35725182  0.16948333  0.07339531  1.00000000  0.02514081 V7   0.05325308 -0.1500816 -0.05859479 -0.01025698 -0.31837954  0.02514081  1.00000000 V8   0.44705527  0.1698571  0.39970105 -0.42461411  0.63951574  0.23065830 -0.28967977 V9   0.21006372 -0.4418132 -0.18623823 -0.25272860  0.15921890  0.36182579 -0.18437981 V10  0.02326108  0.4618036 -0.25205899 -0.05117037  0.02408278  0.47630138 -0.38592733               V8           V9         V10 V1   0.447055266  0.210063724  0.02326108 V2   0.169857120 -0.441813231  0.46180357 V3   0.399701054 -0.186238233 -0.25205899 V4  -0.424614107 -0.252728595 -0.05117037 V5   0.639515737  0.159218895  0.02408278 V6   0.230658298  0.361825786  0.47630138 V7  -0.289679766 -0.184379813 -0.38592733 V8   1.000000000  0.001023392  0.11436143 V9   0.001023392  1.000000000  0.15301699 V10  0.114361431  0.153016985  1.00000000 
like image 162
Greg Avatar answered Oct 13 '22 11:10

Greg