I would like to perform a multinomial logistic regression but I can't set threshold
and thresholds
parameters correctly. Consider the following DF:
from pyspark.ml.linalg import DenseVector
test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
(0, DenseVector([3.1, -2.0, -2.9])),
(1, DenseVector([1.0, 0.8, 0.3])),
(1, DenseVector([4.2, 1.4, -1.7])),
(0, DenseVector([-1.9, 2.5, -2.3])),
(2, DenseVector([2.6, -0.2, 0.2])),
(1, DenseVector([0.3, -3.4, 1.8])),
(2, DenseVector([-1.0, -3.5, 4.7]))],
['label', 'features'])
)
My label has 3 classes, so I have to set thresholds
(plural, which default is None
) rather than threshold
(singular, which default is 0.5
). Then I write:
from pyspark.ml import classification as cl
test_logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
)
Then I would like to fit the model on my DF:
test_logit = test_logit_abst.fit(test_train_df)
but when executing this last command I get an error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.
During handling of the above exception, another exception occurred:
IllegalArgumentException Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
62 return self.copy(params)._fit(dataset)
63 else:
---> 64 return self._fit(dataset)
65 else:
66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
264 def _fit(self, dataset):
--> 265 java_model = self._fit_java(dataset)
266 return self._create_model(java_model)
267
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
260 """
261 self._transfer_params_to_java()
--> 262 return self._java_obj.fit(dataset._jdf)
263
264 def _fit(self, dataset):
~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
78 if s.startswith('java.lang.IllegalArgumentException: '):
---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
80 raise
81 return deco
IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'
The error says threshold
is set. This looks strange, as the documentation says that setting thresholds
(plural) clears threshold
(singular), so that the value 0.5
should be deleted.
So, how to clear threshold
since no clearThreshold()
exists?
In order to achieve this I tried to clear threshold
this way:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThresholds([.5, .5, .5])
.setThreshold(None)
)
This time the fit command works, I even obtain the model intercept and coefficients:
test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])
test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)
But if I try to get thresholds
(plural) from test_logit_abst
I get an error:
test_logit_abst.getThresholds()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()
~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
363 if not self.isSet(self.thresholds) and self.isSet(self.threshold):
364 t = self.getOrDefault(self.threshold)
--> 365 return [1.0-t, t]
366 else:
367 return self.getOrDefault(self.thresholds)
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
What does this mean?
As a further detail, curiously (and incomprehensibly to me) inverting the order of the parameters settings produces the first error I posted above:
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setThreshold(None)
.setThresholds([.5, .5, .5])
)
Why does changing the order of the "set" instructions change the output as well?
The logistic regression assigns each row a probability of bring True and then makes a prediction for each row where that prbability is >= 0.5 i.e. 0.5 is the default threshold.
LogisticRegressionWithLBFGS [source] Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS. Standard feature scaling and L2 regularization are used by default. .. versionadded:: 1.2.0. train (data[, iterations, initialWeights, …
Multinomial logistic regression is a simple extension of binary logistic regression that allows for more than two categories of the dependent or outcome variable. Like binary logistic regression, multinomial logistic regression uses maximum likelihood estimation to evaluate the probability of categorical membership.
In data mining, Regression is a model to represent the relationship between the value of lable ( or target, it is numerical variable) and on one or more features (or predictors they can be numerical and categorical variables).
It is a messy situation indeed...
The short answer is:
setThresholds
(plural) not clearing the threshold (singular) seems to be a bugsetThresholds
does not do what you expect (and arguably you don't need it)setThresholds
statementprobability
column in the transformed dataframe (it works OK though with setThreshold(s)
for binary classification)And now for the long answer...
Let's start with binary classification, adapting the toy data from the docs:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
Row(label=0.0, features=Vectors.dense(1.0, 2.0)),
blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()
We don't need to set thresholds
(plural) here - threshold=0.7
is enough, but it will be useful when illustrating the differences with setThreshold
below.
blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data
Here is the result:
+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction |probability |prediction|
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0 |[-1.138455151184087,1.138455151184087] |[0.242604109995602,0.757395890004398] |1.0 |
|[1.0,2.0]|0.0 |[-0.6056346859838877,0.6056346859838877] |[0.35305562698104337,0.6469443730189567]|0.0 |
|[2.0,1.0]|1.0 |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0 |
|[3.0,3.0]|0.0 |[1.6453673835702176,-1.6453673835702176] |[0.8382639556951765,0.16173604430482344]|0.0 |
+---------+-----+------------------------------------------+----------------------------------------+----------+
What is the meaning of thresholds=[0.3, 0.7]
? The answer lies in the 2nd row, where the prediction is 0.0
, despite the fact that the the probability is higher for 1.0
(0.65): 0.65 is indeed higher that 0.35, but it is lower than the threshold we have set for this class (0.7), hence it is not classified as such.
Let's now try the seemingly identical operation, but with setThreshold(s)
instead:
blor2 = (LogisticRegression()
.setThreshold(0.7)
.setThresholds([0.3, 0.7]) ) # works OK
blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'
Nice, eh?
setThresholds
(plural) seems indeed to have cleared our value of threshold (0.7) set in the previous line, as claimed in the docs, but it seemingly did so only to restore it to its default value of 0.5...
Omitting .setThreshold(0.7)
gives the first error you report yourself (not shown).
Inverting the order of the parameter settings resolves the issue (!!!) and, moreover, renders both getThreshold
(singular) and getThresholds
(plural) operational (in contrast with your case):
blor2 = (LogisticRegression()
.setThresholds([0.3, 0.7])
.setThreshold(0.7) )
blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]
Let's move now to the multinomial case; we'll stick again to the example in the docs, with data from the Spark Github repo (they should also be available locally, in your $SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt
, but I am working on a Databricks notebook); it is a 3-class case, with labels in {0.0, 1.0, 2.0}
.
data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)
Similarly with the binary case above, where the elements of our thresholds
(plural) sum up to 1, let's ask for a threshold of 0.8 for class 2:
mlor = (LogisticRegression()
.setFamily("multinomial")
.setThresholds([0, 0.2, 0.8])
.setThreshold(0.8) )
mlorModel= mlor.fit(mdf) # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]
Looks fine, but let's ask for a prediction in the (training) dataset:
mlorModel.transform(mdf).show(truncate=False)
I have singled out only one row - it should be the 2nd from the end of the full output:
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
|label|features |rawPrediction |probability |prediction|
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0 |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0 |
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
Scrolling to the right, you'll see that despite the fact that the prediction for class 2.0
here is below the threshold we have set (0.8), the row is indeed predicted as 2.0
- in contrast with the binary case demonstrated above...
So, what to do? Simply remove all the threshold-related statements; you don't need them - even setFamily
is unnecessary, as the algorithm will detect by itself that you have more than 2 classes. This will give identical results with the above:
mlor = LogisticRegression() # works OK - no family, no threshold(s)
To summarize:
probability
class as the prediction
, but applying a user-defined threshold instead; this setting might be useful e.g. in cases with imbalanced data.threshold(s)
setting has actually no effect in the multinomial case, where Spark will always return as prediction
the class with the highest probability
.Despite the mess in the documentation (about which I have argued elsewhere) and the possibility of some bugs, let me say about (3) that this design choice is not unjustifiable; as it has been nicely argued elsewhere (emphasis in the original):
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Although the above argument was made for the binary case, it fully holds for the multinomial one, too...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With