I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using <code>StringIndexer</code> and <code>OneHotEncoder</code>, then using <code>VectorAssembler</code> to combine it with a continuous independent variable into a column of sparse vectors. If my column names are <code>continuous</code> and <code>categorical</code> where the first is a column of floats and the second is a column of strings denoting (in this case, 8) different categories: <pre class="prettyprint"><code>string_indexer = StringIndexer(inputCol='categorical', outputCol='categorical_index') encoder = OneHotEncoder(inputCol ='categorical_index', outputCol='categorical_vector') assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'], outputCol='indep_vars') pipeline = Pipeline(stages=string_indexer+encoder+assembler) model = pipeline.fit(df) df = model.transform(df) </code></pre> Everything works fine to this point, and I run the model: <pre class="prettyprint"><code>glm = GeneralizedLinearRegression(family='gaussian', link='identity', labelCol='dep_var', featuresCol='indep_vars') model = glm.fit(df) model.params </code></pre> Which outputs: <blockquote> DenseVector([8440.0573, 3729.449, 4388.9042, 2879.1802, 4613.7646, 5163.3233, 5186.6189, 5513.1392]) </blockquote> Which is great, because I can verify that these coefficients are essentially correct (via other sources). However, I haven't found a good way to link these coefficients to the original column names, which I need to do (I've simplified this model for SO; there's more involved.) The relationship between column names and coefficients is broken by <code>StringIndexer</code> and <code>OneHotEncoder</code>. I've found one fairly slow way: <pre class="prettyprint"><code>df[['categorical', 'categorical_index']].distinct() </code></pre> Which gives me a small dataframe relating the the string names to the numerical names, which I think I could then relate back to the keys in the sparse vector? This is very clunky and slow though, when you consider the scale of the data. Is there a better way to do this?

For PySpark, here is the solution to map feature index to feature name: First, train your model: <pre class="prettyprint"><code>pipeline = Pipeline().setStages([label_stringIdx,assembler,classifier]) model = pipeline.fit(x) </code></pre> Transform your data: <pre class="prettyprint"><code>df_output = model.transform(x) </code></pre> Extract the mapping between feature index and feature name. Merge numeric attributes and binary attributes into a single list. <pre class="prettyprint"><code>numeric_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('numeric') binary_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary') merge_list = numeric_metadata + binary_metadata </code></pre> OUTPUT: <pre class="prettyprint"><code>[{'name': 'variable_abc', 'idx': 0}, {'name': 'variable_azz', 'idx': 1}, {'name': 'variable_azze', 'idx': 2}, {'name': 'variable_azqs', 'idx': 3}, .... </code></pre>

Relating column names to model parameters in pySpark ML

Tags:

python

pyspark

apache-spark-ml

I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using StringIndexer and OneHotEncoder, then using VectorAssembler to combine it with a continuous independent variable into a column of sparse vectors.

If my column names are continuous and categorical where the first is a column of floats and the second is a column of strings denoting (in this case, 8) different categories:

string_indexer = StringIndexer(inputCol='categorical', 
                               outputCol='categorical_index')

encoder = OneHotEncoder(inputCol ='categorical_index',
                        outputCol='categorical_vector')

assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
                            outputCol='indep_vars')

pipeline  = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)

Everything works fine to this point, and I run the model:

glm = GeneralizedLinearRegression(family='gaussian', 
                                  link='identity',
                                  labelCol='dep_var',
                                  featuresCol='indep_vars')
model = glm.fit(df)
model.params

Which outputs:

DenseVector([8440.0573, 3729.449, 4388.9042, 2879.1802, 4613.7646, 5163.3233, 5186.6189, 5513.1392])

Which is great, because I can verify that these coefficients are essentially correct (via other sources). However, I haven't found a good way to link these coefficients to the original column names, which I need to do (I've simplified this model for SO; there's more involved.)

The relationship between column names and coefficients is broken by StringIndexer and OneHotEncoder. I've found one fairly slow way:

df[['categorical', 'categorical_index']].distinct()

Which gives me a small dataframe relating the the string names to the numerical names, which I think I could then relate back to the keys in the sparse vector? This is very clunky and slow though, when you consider the scale of the data.

Is there a better way to do this?

308

asked Aug 18 '16 15:08

Jeff

1 Answers

For PySpark, here is the solution to map feature index to feature name:

First, train your model:

pipeline = Pipeline().setStages([label_stringIdx,assembler,classifier])
model = pipeline.fit(x)

Transform your data:

df_output = model.transform(x)

Extract the mapping between feature index and feature name. Merge numeric attributes and binary attributes into a single list.

numeric_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('numeric')
binary_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')

merge_list = numeric_metadata + binary_metadata

OUTPUT:

[{'name': 'variable_abc', 'idx': 0},
{'name': 'variable_azz', 'idx': 1},
{'name': 'variable_azze', 'idx': 2},
{'name': 'variable_azqs', 'idx': 3},
  ....

148

answered Oct 16 '22 07:10

pierre_comalada

Related questions
                            
                                Python "Too many indices for array"
                            
                                How to change tab size in a specific file in Pycharm
                            
                                Is looping through a generator in a loop over that same generator safe in Python?
                            
                                Find the column names which have top 3 largest values for each row
                            
                                How can I change the intensity of a colormap in matplotlib?
                            
                                Plotting hsv values with imshow
                            
                                RabbitMq - pika - python - Dropping messages when published
                            
                                Multiplication of two positive numbers gives a negative output in Python 3
                            
                                Appending to a Pandas Dataframe From a pd.read_sql Output
                            
                                Guided filter in OpenCV and Python
                            
                                stack all levels of a MultiIndex
                            
                                How to reindex a pandas DataFrame after concatenation
                            
                                Is there a pythonic way to process tree-structured dict keys?
                            
                                Pandas: Delete rows based on multiple columns values
                            
                                How can i find all ydl_opts
                            
                                What is the difference between Property Based Testing and Mutation testing?
                            
                                Can't access dataframe columns
                            
                                Sklearn Fit model multiple times
                            
                                How to make a copy of xml tree in python using ElementTree?
                            
                                How get equation after fitting in scikit-learn?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With