I want to use pyspark.mllib.stat.Statistics.corr
function to compute correlation between two columns of pyspark.sql.dataframe.DataFrame
object. corr
function expects to take an rdd
of Vectors
objects. How do I translate a column of df['some_name']
to rdd
of Vectors.dense
object?
The dataFrame. stat. corr() function is used to calculate the correlation. The columns between which the correlation is to be calculated are passed as arguments to this method.
The Pearson Correlation coefficient can be computed in Python using corrcoef() method from Numpy. The input for this function is typically a matrix, say of size mxn , where: Each column represents the values of a random variable. Each row represents a single sample of n random variables.
A correlation value close to 0 indicates no association between the variables. Since the formula for calculating the correlation coefficient standardizes the variables, changes in scale or units of measurement will not affect its value.
There should be no need for that. For numerical you can compute correlation directly using DataFrameStatFunctions.corr
:
df1 = sc.parallelize([(0.0, 1.0), (1.0, 0.0)]).toDF(["x", "y"])
df1.stat.corr("x", "y")
# -1.0
otherwise you can use VectorAssembler
:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
assembler.transform(df).select("features").flatMap(lambda x: x)
df.stat.corr("column1","column2")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With