Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replicate logistic regression model from pyspark in scikit-learn

Problem: The default implementations (no custom parameters set) of the logistic regression model in pyspark and scikit-learn seem to yield different results given their default paramter values.

I am trying to replicate a result from logistic regression (no custom paramters set) performed with pypark (see: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression) with the logistic regression model from scikit-learn (see: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

It appears to me that both model implementations (in pyspark and scikit) do not possess the same parameters, so i cant just simply match the paramteres in scikit to fit those in pyspark. Is there any solution on how to match both models on their default configuration?

Parameters Scikit model (default parameters):

`LogisticRegression(
C=1.0, 
class_weight=None, 
dual=False, 
fit_intercept=True,
intercept_scaling=1, 
max_iter=100, 
multi_class='ovr', 
n_jobs=1,
penalty='l2', 
random_state=None, 
solver='liblinear', 
tol=0.0001,
verbose=0, 
warm_start=False`

Parameters Pyspark model (default parameters):

LogisticRegression(self, 
featuresCol="features", 
labelCol="label", 
predictionCol="prediction", 
maxIter=100,
regParam=0.0, 
elasticNetParam=0.0, 
tol=1e-6, 
fitIntercept=True, 
threshold=0.5, 
thresholds=None, 
probabilityCol="probability", 
rawPredictionCol="rawPrediction", 
standardization=True, 
weightCol=None, 
aggregationDepth=2, 
family="auto")

Thank you very much!

like image 401
AaronDT Avatar asked Feb 05 '23 05:02

AaronDT


1 Answers

pyspark's LR uses ElasticNet regularization, which is a weighted sum of L1 and L2 terms; weight is elasticNetParam. So with elasticNetParam=0 you get L2 regularization, and regParam is L2 regularization coefficient; with elasticNetParam=1 you get L1 regularization, and regParam is L1 regularization coefficient. C in sklearn LogisticRegression is inverse of regParam, i.e. regParam = 1/C.

Also, default training methods are different; you may need to set solver='lbfgs' in sklearn LogisticRegression to make training methods more similar. It only works with L2 though.

If you need ElasticNet regularization (i.e. 0 < elasticNetParam < 1), then sklearn implements it in SGDClassifier - set loss='elasticnet', alpha would be similar to regParam (and you don't have to inverse it, like C), and l1_ratio would be elasticNetParam.

sklearn doesn't provide threshold directly, but you can use predict_proba instead of predict, and then apply the threshold yourselves.

Disclaimer: I have zero spark experience, the answer is based on sklearn and spark docs.

like image 197
Mikhail Korobov Avatar answered Feb 06 '23 18:02

Mikhail Korobov