Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark computing correlation

I want to use pyspark.mllib.stat.Statistics.corr function to compute correlation between two columns of pyspark.sql.dataframe.DataFrame object. corr function expects to take an rdd of Vectors objects. How do I translate a column of df['some_name'] to rdd of Vectors.dense object?

like image 878
VJune Avatar asked Jun 03 '16 16:06

VJune


People also ask

How do you do a correlation in PySpark?

The dataFrame. stat. corr() function is used to calculate the correlation. The columns between which the correlation is to be calculated are passed as arguments to this method.

How does Numpy calculate correlation?

The Pearson Correlation coefficient can be computed in Python using corrcoef() method from Numpy. The input for this function is typically a matrix, say of size mxn , where: Each column represents the values of a random variable. Each row represents a single sample of n random variables.

Is correlation affected by scaling?

A correlation value close to 0 indicates no association between the variables. Since the formula for calculating the correlation coefficient standardizes the variables, changes in scale or units of measurement will not affect its value.


2 Answers

There should be no need for that. For numerical you can compute correlation directly using DataFrameStatFunctions.corr:

df1 = sc.parallelize([(0.0, 1.0), (1.0, 0.0)]).toDF(["x", "y"])
df1.stat.corr("x", "y")
# -1.0

otherwise you can use VectorAssembler:

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
assembler.transform(df).select("features").flatMap(lambda x: x)
like image 103
zero323 Avatar answered Sep 24 '22 01:09

zero323


df.stat.corr("column1","column2")

like image 22
MUK Avatar answered Sep 24 '22 01:09

MUK