Problem: The default implementations (no custom parameters set) of the logistic regression model in pyspark and scikit-learn seem to yield different results given their default paramter values.
I am trying to replicate a result from logistic regression (no custom paramters set) performed with pypark (see: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression) with the logistic regression model from scikit-learn (see: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
It appears to me that both model implementations (in pyspark and scikit) do not possess the same parameters, so i cant just simply match the paramteres in scikit to fit those in pyspark. Is there any solution on how to match both models on their default configuration?
Parameters Scikit model (default parameters):
`LogisticRegression(
C=1.0,
class_weight=None,
dual=False,
fit_intercept=True,
intercept_scaling=1,
max_iter=100,
multi_class='ovr',
n_jobs=1,
penalty='l2',
random_state=None,
solver='liblinear',
tol=0.0001,
verbose=0,
warm_start=False`
Parameters Pyspark model (default parameters):
LogisticRegression(self,
featuresCol="features",
labelCol="label",
predictionCol="prediction",
maxIter=100,
regParam=0.0,
elasticNetParam=0.0,
tol=1e-6,
fitIntercept=True,
threshold=0.5,
thresholds=None,
probabilityCol="probability",
rawPredictionCol="rawPrediction",
standardization=True,
weightCol=None,
aggregationDepth=2,
family="auto")
Thank you very much!
pyspark's LR uses ElasticNet regularization, which is a weighted sum of L1 and L2 terms; weight is elasticNetParam
. So with elasticNetParam=0
you get L2 regularization, and regParam
is L2 regularization coefficient; with elasticNetParam=1
you get L1 regularization, and regParam
is L1 regularization coefficient. C
in sklearn LogisticRegression is inverse of regParam
, i.e. regParam = 1/C
.
Also, default training methods are different; you may need to set solver='lbfgs' in sklearn LogisticRegression to make training methods more similar. It only works with L2 though.
If you need ElasticNet regularization (i.e. 0 < elasticNetParam < 1), then sklearn implements it in SGDClassifier - set loss='elasticnet'
, alpha
would be similar to regParam
(and you don't have to inverse it, like C), and l1_ratio
would be elasticNetParam
.
sklearn doesn't provide threshold directly, but you can use predict_proba instead of predict, and then apply the threshold yourselves.
Disclaimer: I have zero spark experience, the answer is based on sklearn and spark docs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With