Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Relating column names to model parameters in pySpark ML

I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using StringIndexer and OneHotEncoder, then using VectorAssembler to combine it with a continuous independent variable into a column of sparse vectors.

If my column names are continuous and categorical where the first is a column of floats and the second is a column of strings denoting (in this case, 8) different categories:

string_indexer = StringIndexer(inputCol='categorical', 
                               outputCol='categorical_index')

encoder = OneHotEncoder(inputCol ='categorical_index',
                        outputCol='categorical_vector')

assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
                            outputCol='indep_vars')

pipeline  = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)

Everything works fine to this point, and I run the model:

glm = GeneralizedLinearRegression(family='gaussian', 
                                  link='identity',
                                  labelCol='dep_var',
                                  featuresCol='indep_vars')
model = glm.fit(df)
model.params

Which outputs:

DenseVector([8440.0573, 3729.449, 4388.9042, 2879.1802, 4613.7646, 5163.3233, 5186.6189, 5513.1392])

Which is great, because I can verify that these coefficients are essentially correct (via other sources). However, I haven't found a good way to link these coefficients to the original column names, which I need to do (I've simplified this model for SO; there's more involved.)

The relationship between column names and coefficients is broken by StringIndexer and OneHotEncoder. I've found one fairly slow way:

df[['categorical', 'categorical_index']].distinct()

Which gives me a small dataframe relating the the string names to the numerical names, which I think I could then relate back to the keys in the sparse vector? This is very clunky and slow though, when you consider the scale of the data.

Is there a better way to do this?

like image 308
Jeff Avatar asked Aug 18 '16 15:08

Jeff


People also ask

How do you use StringIndexer PySpark?

A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.

Why do we use VectorAssembler in PySpark?

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.

What is the difference between spark ML and spark MLlib?

Choosing Between Spark MLlib and Spark ML At first glance, the most obvious difference between MLlib and ML is the data types they work on, with MLlib supporting RDDs and ML supporting DataFrame s and Dataset s.

How do you use OneHotEncoder in PySpark?

To perform one-hot encoding in PySpark, we must: convert the categorical column into a numeric column ( 0 , 1 , ...) using StringIndexer. convert the numeric column into one-hot encoded columns using OneHotEncoder.


1 Answers

For PySpark, here is the solution to map feature index to feature name:

First, train your model:

pipeline = Pipeline().setStages([label_stringIdx,assembler,classifier])
model = pipeline.fit(x)

Transform your data:

df_output = model.transform(x)

Extract the mapping between feature index and feature name. Merge numeric attributes and binary attributes into a single list.

numeric_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('numeric')
binary_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')

merge_list = numeric_metadata + binary_metadata 

OUTPUT:

[{'name': 'variable_abc', 'idx': 0},
{'name': 'variable_azz', 'idx': 1},
{'name': 'variable_azze', 'idx': 2},
{'name': 'variable_azqs', 'idx': 3},
  ....
like image 148
pierre_comalada Avatar answered Oct 16 '22 07:10

pierre_comalada