Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to map the coefficient obtained from logistic regression model to the feature names in pyspark

I built a logistic regression model using a pipeline flow to the one listed by databricks. https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html

the features (numeric and string features) were encoded using OneHotEncoderEstimator and then transformed using standard scaler.

I would like to know how to map the weights(coefficients) obtained from logistic regression to the feature names in the original dataframe.

In other words, how to get the corresponding features to the weights or the coefficients obtained from the model

Thank you

I have tried to extract the features from the lrModel.schema, which gave a list of structField showing the features

I tried to extract the features from the schema and map to the weights but not successful

from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="scaledFeatures", maxIter=10)

# Train model with Training Data

lrModel = lr.fit(trainingData)

predictions = lrModel.transform(trainingData)

LRschema = predictions.schema

the expected outcome from the extraction a list of tuples(feature weight, feature name)

like image 318
Fady Nabil Avatar asked Oct 19 '25 03:10

Fady Nabil


2 Answers

Is not a direct output from LogisticRegression but can be obtained using following function I use:

def ExtractFeatureCoeficient(model, dataset, excludedCols = None):
    test = model.transform(dataset)
    weights = model.coefficients
    print('This is model weights: \n', weights)
    weights = [(float(w),) for w in weights]  # convert numpy type to float, and to tuple
    if excludedCols == None:
        feature_col = [f for f in test.schema.names if f not in ['y', 'classWeights', 'features', 'label', 'rawPrediction', 'probability', 'prediction']]
    else:
        feature_col = [f for f in test.schema.names if f not in excludedCols]
    if len(weights) == len(feature_col):
        weightsDF = sqlContext.createDataFrame(zip(weights, feature_col), schema= ["Coeficients", "FeatureName"])
    else:
        print('Coeficients are not matching with remaining Fetures in the model, please check field lists with model.transform(dataset).schema.names')
    
    return weightsDF

results = ExtractFeatureCoeficient(lr_model, trainingData)

results.show()

This will generated a spark dataframe with following fields:

+--------------------+--------------------+
|         Coeficients|         FeatureName|
+--------------------+--------------------+
|[0.15834847825223...|    name            |
|               [0.0]|  lat               |
+--------------------+--------------------+

Or you can fit a GML model as follow:

model = GeneralizedLinearRegression(family="binomial", link="logit", featuresCol="features", labelCol="label", maxIter = 1000, regParam = 0.8, weightCol="classWeights")

# Train model.  This also runs the indexer.
models = glmModel.fit(trainingData)

# then get summary of the model:

summary = model.summary
print(summary)

Generating the output:

Coefficients:
        Feature       Estimate Std Error  T Value P Value
    (Intercept)       -1.3079    0.0705 -18.5549  0.0000
    name               0.1248    0.0158   7.9129  0.0000
    lat                0.0239    0.0209   1.1455  0.2520
like image 194
n1tk Avatar answered Oct 22 '25 03:10

n1tk


None of the above solutions seemed to work for my case. My model has a mix of numeric and binary variables. Also all of the data transformations and model validation are connected in one long pipeline so the only place I could see the schema is in the predictions data. I was able to hack together some code to iterate through the schema and make a dictionary from all of the variable names. Then connect this to the coefficients.

# Extract the coefficients on each of the variables
coeff = mymodel.coefficients.toArray().tolist()

# Loop through the features to extract the original column names. Store in the var_index dictionary
var_index = dict()
for variable_type in ['numeric', 'binary']:
    for variable in predictions.schema["features"].metadata["ml_attr"]["attrs"][variable_type]:
        print("Found variable:", variable)
        idx = variable['idx']
        name = variable['name']
        var_index[idx] = name      # Add the name to the dictionary

# Loop through all of the variables found and print out the associated coefficients
for i in range(len(var_index)):
    print(i, var_index[i], coeff[i])
like image 37
user3276159 Avatar answered Oct 22 '25 04:10

user3276159



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!