How to get correlation matrix values pyspark

Tags:

I have a correlation matrix calculated as follow on pyspark 2.2:

from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

datos = sql("""select * from proceso_riesgos.jdgc_bd_train_mn_ingresos""")

Variables_corr= ['ingreso_final_mix','ingreso_final_promedio',
'ingreso_final_mediana','ingreso_final_trimedia','ingresos_serv_q1',
'ingresos_serv_q2','ingresos_serv_q3','prom_ingresos_serv','y_correc']

assembler = VectorAssembler(
inputCols=Variables_corr,
outputCol="features")

datos1=datos.select(Variables_corr).filter("y_correc is not null")
output = assembler.transform(datos)
r1 = Correlation.corr(output, "features")

the result is a data frame with a variable called "pearson(features): matrix":

Row(pearson(features)=DenseMatrix(20, 20, [1.0, 0.9428, 0.8908, 0.913, 
0.567, 0.5832, 0.6148, 0.6488, ..., -0.589, -0.6145, -0.5906, -0.5534, 
-0.5346, -0.0797, -0.617, 1.0], False))]

I need to take those values and export it to an excel, or to be able to manipulate the result. A list could be desiderable.

Thanks for help!!

511

asked Aug 13 '18 23:08

Juan David

1 Answers

You are almost there ! There is no need to use old rdd mllib api .

This is my method to generate pandas dataframe, you can export to excel or csv or others format.

def correlation_matrix(df, corr_columns, method='pearson'):
    vector_col = "corr_features"
    assembler = VectorAssembler(inputCols=corr_columns, outputCol=vector_col)
    df_vector = assembler.transform(df).select(vector_col)
    matrix = Correlation.corr(df_vector, vector_col, method)

    result = matrix.collect()[0]["pearson({})".format(vector_col)].values
    return pd.DataFrame(result.reshape(-1, len(corr_columns)), columns=corr_columns, index=corr_columns)

117

answered Oct 02 '22 20:10

Mithril

Related questions
                            
                                Jinja2: How to use named blocks inside included templates, inside extendable template
                            
                                How to perform a chi-squared goodness of fit test using scientific libraries in Python?
                            
                                Compute the gradient of the SVM loss function
                            
                                Sampling n= 2000 from a Dask Dataframe of len 18000 generates error Cannot take a larger sample than population when 'replace=False'
                            
                                Interactive matplotlib using ipywidgets
                            
                                Where are the gains using numba coming from for pure numpy code?
                            
                                Cache Julia module for faster startup and usage in Python
                            
                                Alter namespace prefixing with ElementTree in Python
                            
                                Which Python client library should I use for CouchdB? [closed]
                            
                                Hot-swapping of Python running program
                            
                                returning aggregated dataframe from pandas groupby
                            
                                Index the middle of a numpy array?
                            
                                What does python's "re.compile" do?
                            
                                Wrapping an std::vector using boost::python vector_indexing_suite
                            
                                open file for random write without truncating?
                            
                                How to best share static data between ipyparallel client and remote engines?
                            
                                Dict/Set Parsing Order Consistency
                            
                                Preprocessing poorly scanned handwritten digits
                            
                                kernel keeps dying in jupyter notebook
                            
                                VS Code - Rename symbol too slow for Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get correlation matrix values pyspark

Tags:

python

apache-spark

pyspark

Juan David

People also ask

1 Answers

Mithril

Recent Activity

Donate For Us