Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set thresholds in PySpark multinomial logistic regression

I would like to perform a multinomial logistic regression but I can't set threshold and thresholds parameters correctly. Consider the following DF:

from pyspark.ml.linalg import DenseVector

test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
                  (0, DenseVector([3.1, -2.0, -2.9])),
                  (1, DenseVector([1.0, 0.8, 0.3])),
                  (1, DenseVector([4.2, 1.4, -1.7])),
                  (0, DenseVector([-1.9, 2.5, -2.3])),
                  (2, DenseVector([2.6, -0.2, 0.2])),
                  (1, DenseVector([0.3, -3.4, 1.8])),
                  (2, DenseVector([-1.0, -3.5, 4.7]))],
                 ['label', 'features'])
)

My label has 3 classes, so I have to set thresholds (plural, which default is None) rather than threshold (singular, which default is 0.5). Then I write:

from pyspark.ml import classification as cl

test_logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThresholds([.5, .5, .5])
)

Then I would like to fit the model on my DF:

test_logit = test_logit_abst.fit(test_train_df)

but when executing this last command I get an error:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:

Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds.  Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.

During handling of the above exception, another exception occurred:

IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
     62                 return self.copy(params)._fit(dataset)
     63             else:
---> 64                 return self._fit(dataset)
     65         else:
     66             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
    264     def _fit(self, dataset):
--> 265         java_model = self._fit_java(dataset)
    266         return self._create_model(java_model)
267

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
    260         """
    261         self._transfer_params_to_java()
--> 262         return self._java_obj.fit(dataset._jdf)
263
    264     def _fit(self, dataset):

~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
1134
   1135         for temp_arg in temp_args:

~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds.  Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'

The error says threshold is set. This looks strange, as the documentation says that setting thresholds (plural) clears threshold (singular), so that the value 0.5 should be deleted. So, how to clear threshold since no clearThreshold() exists?

In order to achieve this I tried to clear threshold this way:

logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThresholds([.5, .5, .5])
    .setThreshold(None)
)

This time the fit command works, I even obtain the model intercept and coefficients:

test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])

test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)

But if I try to get thresholds (plural) from test_logit_abst I get an error:

test_logit_abst.getThresholds()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
    363         if not self.isSet(self.thresholds) and self.isSet(self.threshold):
    364             t = self.getOrDefault(self.threshold)
--> 365             return [1.0-t, t]
    366         else:
    367             return self.getOrDefault(self.thresholds)

TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

What does this mean?


As a further detail, curiously (and incomprehensibly to me) inverting the order of the parameters settings produces the first error I posted above:

logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThreshold(None)
    .setThresholds([.5, .5, .5])
)

Why does changing the order of the "set" instructions change the output as well?

like image 681
Vanni Rovera Avatar asked Nov 16 '17 09:11

Vanni Rovera


People also ask

How do you use the threshold in logistic regression?

The logistic regression assigns each row a probability of bring True and then makes a prediction for each row where that prbability is >= 0.5 i.e. 0.5 is the default threshold.

What is LogisticRegressionWithLBFGS?

LogisticRegressionWithLBFGS [source] Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS. Standard feature scaling and L2 regularization are used by default. .. versionadded:: 1.2.0. train (data[, iterations, initialWeights, …

How does multinomial logistic regression work?

Multinomial logistic regression is a simple extension of binary logistic regression that allows for more than two categories of the dependent or outcome variable. Like binary logistic regression, multinomial logistic regression uses maximum likelihood estimation to evaluate the probability of categorical membership.

What is PySpark regression?

In data mining, Regression is a model to represent the relationship between the value of lable ( or target, it is numerical variable) and on one or more features (or predictors they can be numerical and categorical variables).


1 Answers

It is a messy situation indeed...

The short answer is:

  1. setThresholds (plural) not clearing the threshold (singular) seems to be a bug
  2. For multinomial classification (i.e. number of classes > 2), setThresholds does not do what you expect (and arguably you don't need it)
  3. If all you need is having some "thresholds" in the "default" value of 0.5, you don't have a problem - simply don't use any relevant argument or setThresholds statement
  4. If you really need to apply different decision thresholds to different classes in multinomial classification, you will have to do it manually, by post-processing the respective probabilities, i.e. the probability column in the transformed dataframe (it works OK though with setThreshold(s) for binary classification)

And now for the long answer...

Let's start with binary classification, adapting the toy data from the docs:

spark.version
# u'2.2.0'

from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
     Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
     Row(label=0.0, features=Vectors.dense(1.0, 2.0)),

blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
     Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
     Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()

We don't need to set thresholds (plural) here - threshold=0.7 is enough, but it will be useful when illustrating the differences with setThreshold below.

blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data

Here is the result:

+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction                             |probability                             |prediction| 
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0  |[-1.138455151184087,1.138455151184087]    |[0.242604109995602,0.757395890004398]   |1.0       |
|[1.0,2.0]|0.0  |[-0.6056346859838877,0.6056346859838877]  |[0.35305562698104337,0.6469443730189567]|0.0       | 
|[2.0,1.0]|1.0  |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0       | 
|[3.0,3.0]|0.0  |[1.6453673835702176,-1.6453673835702176]  |[0.8382639556951765,0.16173604430482344]|0.0       | 
+---------+-----+------------------------------------------+----------------------------------------+----------+

What is the meaning of thresholds=[0.3, 0.7]? The answer lies in the 2nd row, where the prediction is 0.0, despite the fact that the the probability is higher for 1.0 (0.65): 0.65 is indeed higher that 0.35, but it is lower than the threshold we have set for this class (0.7), hence it is not classified as such.

Let's now try the seemingly identical operation, but with setThreshold(s) instead:

blor2 = (LogisticRegression()
  .setThreshold(0.7)
  .setThresholds([0.3, 0.7]) ) # works OK

blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'

Nice, eh?

setThresholds (plural) seems indeed to have cleared our value of threshold (0.7) set in the previous line, as claimed in the docs, but it seemingly did so only to restore it to its default value of 0.5...

Omitting .setThreshold(0.7) gives the first error you report yourself (not shown).

Inverting the order of the parameter settings resolves the issue (!!!) and, moreover, renders both getThreshold (singular) and getThresholds (plural) operational (in contrast with your case):

blor2 = (LogisticRegression()
  .setThresholds([0.3, 0.7])
  .setThreshold(0.7) )

blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]

Let's move now to the multinomial case; we'll stick again to the example in the docs, with data from the Spark Github repo (they should also be available locally, in your $SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt, but I am working on a Databricks notebook); it is a 3-class case, with labels in {0.0, 1.0, 2.0}.

data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)

Similarly with the binary case above, where the elements of our thresholds (plural) sum up to 1, let's ask for a threshold of 0.8 for class 2:

mlor = (LogisticRegression()
       .setFamily("multinomial")
       .setThresholds([0, 0.2, 0.8])
       .setThreshold(0.8) )
mlorModel= mlor.fit(mdf)  # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]

Looks fine, but let's ask for a prediction in the (training) dataset:

mlorModel.transform(mdf).show(truncate=False)

I have singled out only one row - it should be the 2nd from the end of the full output:

+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+ 
|label|features                                            |rawPrediction                                            |probability                                                    |prediction| 
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0  |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0       | 
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+

Scrolling to the right, you'll see that despite the fact that the prediction for class 2.0 here is below the threshold we have set (0.8), the row is indeed predicted as 2.0 - in contrast with the binary case demonstrated above...

So, what to do? Simply remove all the threshold-related statements; you don't need them - even setFamily is unnecessary, as the algorithm will detect by itself that you have more than 2 classes. This will give identical results with the above:

mlor = LogisticRegression() # works OK - no family, no threshold(s)

To summarize:

  1. In both the binary & multinomial cases, what is actually returned by the algorithm is a vector of probabilities of length equal to the number of classes, with elements summing up to 1.
  2. In the binary case only, Spark allows you to go one step further and not naively selecting the highest probability class as the prediction, but applying a user-defined threshold instead; this setting might be useful e.g. in cases with imbalanced data.
  3. This threshold(s) setting has actually no effect in the multinomial case, where Spark will always return as prediction the class with the highest probability.

Despite the mess in the documentation (about which I have argued elsewhere) and the possibility of some bugs, let me say about (3) that this design choice is not unjustifiable; as it has been nicely argued elsewhere (emphasis in the original):

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Although the above argument was made for the binary case, it fully holds for the multinomial one, too...

like image 118
desertnaut Avatar answered Sep 23 '22 16:09

desertnaut