I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using StringIndexer
and OneHotEncoder
, then using VectorAssembler
to combine it with a continuous independent variable into a column of sparse vectors.
If my column names are continuous
and categorical
where the first is a column of floats and the second is a column of strings denoting (in this case, 8) different categories:
string_indexer = StringIndexer(inputCol='categorical',
outputCol='categorical_index')
encoder = OneHotEncoder(inputCol ='categorical_index',
outputCol='categorical_vector')
assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
outputCol='indep_vars')
pipeline = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)
Everything works fine to this point, and I run the model:
glm = GeneralizedLinearRegression(family='gaussian',
link='identity',
labelCol='dep_var',
featuresCol='indep_vars')
model = glm.fit(df)
model.params
Which outputs:
DenseVector([8440.0573, 3729.449, 4388.9042, 2879.1802, 4613.7646, 5163.3233, 5186.6189, 5513.1392])
Which is great, because I can verify that these coefficients are essentially correct (via other sources). However, I haven't found a good way to link these coefficients to the original column names, which I need to do (I've simplified this model for SO; there's more involved.)
The relationship between column names and coefficients is broken by StringIndexer
and OneHotEncoder
. I've found one fairly slow way:
df[['categorical', 'categorical_index']].distinct()
Which gives me a small dataframe relating the the string names to the numerical names, which I think I could then relate back to the keys in the sparse vector? This is very clunky and slow though, when you consider the scale of the data.
Is there a better way to do this?
A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.
Choosing Between Spark MLlib and Spark ML At first glance, the most obvious difference between MLlib and ML is the data types they work on, with MLlib supporting RDDs and ML supporting DataFrame s and Dataset s.
To perform one-hot encoding in PySpark, we must: convert the categorical column into a numeric column ( 0 , 1 , ...) using StringIndexer. convert the numeric column into one-hot encoded columns using OneHotEncoder.
For PySpark, here is the solution to map feature index to feature name:
First, train your model:
pipeline = Pipeline().setStages([label_stringIdx,assembler,classifier])
model = pipeline.fit(x)
Transform your data:
df_output = model.transform(x)
Extract the mapping between feature index and feature name. Merge numeric attributes and binary attributes into a single list.
numeric_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('numeric')
binary_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')
merge_list = numeric_metadata + binary_metadata
OUTPUT:
[{'name': 'variable_abc', 'idx': 0},
{'name': 'variable_azz', 'idx': 1},
{'name': 'variable_azze', 'idx': 2},
{'name': 'variable_azqs', 'idx': 3},
....
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With