Set thresholds in PySpark multinomial logistic regression

Tags:

I would like to perform a multinomial logistic regression but I can't set threshold and thresholds parameters correctly. Consider the following DF:

from pyspark.ml.linalg import DenseVector

test_train_df = (
sqlc
.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])),
                  (0, DenseVector([3.1, -2.0, -2.9])),
                  (1, DenseVector([1.0, 0.8, 0.3])),
                  (1, DenseVector([4.2, 1.4, -1.7])),
                  (0, DenseVector([-1.9, 2.5, -2.3])),
                  (2, DenseVector([2.6, -0.2, 0.2])),
                  (1, DenseVector([0.3, -3.4, 1.8])),
                  (2, DenseVector([-1.0, -3.5, 4.7]))],
                 ['label', 'features'])
)

My label has 3 classes, so I have to set thresholds (plural, which default is None) rather than threshold (singular, which default is 0.5). Then I write:

from pyspark.ml import classification as cl

test_logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThresholds([.5, .5, .5])
)

Then I would like to fit the model on my DF:

test_logit = test_logit_abst.fit(test_train_df)

but when executing this last command I get an error:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:

Py4JJavaError: An error occurred while calling o3769.fit.
: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds.  Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.

During handling of the above exception, another exception occurred:

IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-211-8f3443f41b6b> in <module>()
----> 1 test_logit = test_logit_abst.fit(test_train_df)

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params)
     62                 return self.copy(params)._fit(dataset)
     63             else:
---> 64                 return self._fit(dataset)
     65         else:
     66             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)
263
    264     def _fit(self, dataset):
--> 265         java_model = self._fit_java(dataset)
    266         return self._create_model(java_model)
267

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset)
    260         """
    261         self._transfer_params_to_java()
--> 262         return self._java_obj.fit(dataset._jdf)
263
    264     def _fit(self, dataset):

~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
1134
   1135         for temp_arg in temp_args:

~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds.  Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'

The error says threshold is set. This looks strange, as the documentation says that setting thresholds (plural) clears threshold (singular), so that the value 0.5 should be deleted. So, how to clear threshold since no clearThreshold() exists?

In order to achieve this I tried to clear threshold this way:

logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThresholds([.5, .5, .5])
    .setThreshold(None)
)

This time the fit command works, I even obtain the model intercept and coefficients:

test_logit.interceptVector
DenseVector([65.6445, 31.6369, -97.2814])

test_logit.coefficientMatrix
DenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)

But if I try to get thresholds (plural) from test_logit_abst I get an error:

test_logit_abst.getThresholds()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-214-fc1c8617ce80> in <module>()
----> 1 test_logit_abst.getThresholds()

~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self)
    363         if not self.isSet(self.thresholds) and self.isSet(self.threshold):
    364             t = self.getOrDefault(self.threshold)
--> 365             return [1.0-t, t]
    366         else:
    367             return self.getOrDefault(self.thresholds)

TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

What does this mean?

As a further detail, curiously (and incomprehensibly to me) inverting the order of the parameters settings produces the first error I posted above:

logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setThreshold(None)
    .setThresholds([.5, .5, .5])
)

Why does changing the order of the "set" instructions change the output as well?

681

asked Nov 16 '17 09:11

Vanni Rovera

1 Answers

It is a messy situation indeed...

The short answer is:

setThresholds (plural) not clearing the threshold (singular) seems to be a bug
For multinomial classification (i.e. number of classes > 2), setThresholds does not do what you expect (and arguably you don't need it)
If all you need is having some "thresholds" in the "default" value of 0.5, you don't have a problem - simply don't use any relevant argument or setThresholds statement
If you really need to apply different decision thresholds to different classes in multinomial classification, you will have to do it manually, by post-processing the respective probabilities, i.e. the probability column in the transformed dataframe (it works OK though with setThreshold(s) for binary classification)

And now for the long answer...

Let's start with binary classification, adapting the toy data from the docs:

spark.version
# u'2.2.0'

from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
bdf = sc.parallelize([
     Row(label=1.0, features=Vectors.dense(0.0, 5.0)),
     Row(label=0.0, features=Vectors.dense(1.0, 2.0)),

blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7])
     Row(label=1.0, features=Vectors.dense(2.0, 1.0)),
     Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()

We don't need to set thresholds (plural) here - threshold=0.7 is enough, but it will be useful when illustrating the differences with setThreshold below.

blorModel = blor.fit(bdf) # works OK
blor.getThreshold()
# 0.7
blor.getThresholds()
# [0.3, 0.7]
blorModel.transform(bdf).show(truncate=False) # transform the training data

Here is the result:

+---------+-----+------------------------------------------+----------------------------------------+----------+
|features |label|rawPrediction                             |probability                             |prediction| 
+---------+-----+------------------------------------------+----------------------------------------+----------+
|[0.0,5.0]|1.0  |[-1.138455151184087,1.138455151184087]    |[0.242604109995602,0.757395890004398]   |1.0       |
|[1.0,2.0]|0.0  |[-0.6056346859838877,0.6056346859838877]  |[0.35305562698104337,0.6469443730189567]|0.0       | 
|[2.0,1.0]|1.0  |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0       | 
|[3.0,3.0]|0.0  |[1.6453673835702176,-1.6453673835702176]  |[0.8382639556951765,0.16173604430482344]|0.0       | 
+---------+-----+------------------------------------------+----------------------------------------+----------+

What is the meaning of thresholds=[0.3, 0.7]? The answer lies in the 2nd row, where the prediction is 0.0, despite the fact that the the probability is higher for 1.0 (0.65): 0.65 is indeed higher that 0.35, but it is lower than the threshold we have set for this class (0.7), hence it is not classified as such.

Let's now try the seemingly identical operation, but with setThreshold(s) instead:

blor2 = (LogisticRegression()
  .setThreshold(0.7)
  .setThresholds([0.3, 0.7]) ) # works OK

blorModel2 = blor2.fit(bdf)
[...]
IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'

Nice, eh?

setThresholds (plural) seems indeed to have cleared our value of threshold (0.7) set in the previous line, as claimed in the docs, but it seemingly did so only to restore it to its default value of 0.5...

Omitting .setThreshold(0.7) gives the first error you report yourself (not shown).

Inverting the order of the parameter settings resolves the issue (!!!) and, moreover, renders both getThreshold (singular) and getThresholds (plural) operational (in contrast with your case):

blor2 = (LogisticRegression()
  .setThresholds([0.3, 0.7])
  .setThreshold(0.7) )

blorModel2 = blor2.fit(bdf) # works OK
blor2.getThreshold()
# 0.7
blor2.getThresholds()
# [0.30000000000000004, 0.7]

Let's move now to the multinomial case; we'll stick again to the example in the docs, with data from the Spark Github repo (they should also be available locally, in your $SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt, but I am working on a Databricks notebook); it is a 3-class case, with labels in {0.0, 1.0, 2.0}.

data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"
mdf = spark.read.format("libsvm").load(data_path)

Similarly with the binary case above, where the elements of our thresholds (plural) sum up to 1, let's ask for a threshold of 0.8 for class 2:

mlor = (LogisticRegression()
       .setFamily("multinomial")
       .setThresholds([0, 0.2, 0.8])
       .setThreshold(0.8) )
mlorModel= mlor.fit(mdf)  # works OK
mlor.getThreshold()
# 0.8
mlor.getThresholds()
# [0.19999999999999996, 0.8]

Looks fine, but let's ask for a prediction in the (training) dataset:

mlorModel.transform(mdf).show(truncate=False)

I have singled out only one row - it should be the 2nd from the end of the full output:

+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+ 
|label|features                                            |rawPrediction                                            |probability                                                    |prediction| 
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
[...]
|0.0  |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0       | 
[...]
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+

Scrolling to the right, you'll see that despite the fact that the prediction for class 2.0 here is below the threshold we have set (0.8), the row is indeed predicted as 2.0 - in contrast with the binary case demonstrated above...

So, what to do? Simply remove all the threshold-related statements; you don't need them - even setFamily is unnecessary, as the algorithm will detect by itself that you have more than 2 classes. This will give identical results with the above:

mlor = LogisticRegression() # works OK - no family, no threshold(s)

To summarize:

In both the binary & multinomial cases, what is actually returned by the algorithm is a vector of probabilities of length equal to the number of classes, with elements summing up to 1.
In the binary case only, Spark allows you to go one step further and not naively selecting the highest probability class as the prediction, but applying a user-defined threshold instead; this setting might be useful e.g. in cases with imbalanced data.
This threshold(s) setting has actually no effect in the multinomial case, where Spark will always return as prediction the class with the highest probability.

Despite the mess in the documentation (about which I have argued elsewhere) and the possibility of some bugs, let me say about (3) that this design choice is not unjustifiable; as it has been nicely argued elsewhere (emphasis in the original):

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Although the above argument was made for the binary case, it fully holds for the multinomial one, too...

118

answered Sep 23 '22 16:09

desertnaut

Related questions
                            
                                Running a Spark Application in Intellij 14.1.3
                            
                                In Spark's client mode, the driver needs network access to remote executors?
                            
                                How to Validate contents of Spark Dataframe
                            
                                Accessing nested data in spark
                            
                                Broadcast Annoy object in Spark (for nearest neighbors)?
                            
                                Adding the resulting TFIDF calculation to the dataframe of the original documents in Pyspark
                            
                                Selecting values from non-null columns in a PySpark DataFrame
                            
                                Spark: Expansion of RDD(Key, List) to RDD(Key, Value)
                            
                                Access Spark broadcast variable in different classes
                            
                                How to normalize or standardize the data having multiple columns/variables in spark using scala?
                            
                                Apache Spark writing to s3 failing to move parquet files from temporary folder
                            
                                Scala: Spark SQL to_date(unix_timestamp) returning NULL
                            
                                How to get the difference between two RDDs in PySpark?
                            
                                Tuple to data frame in spark scala
                            
                                How Spark RDD partitions are processed if no. of executors < no. of RDD partition
                            
                                Spark create UDF that doesn't take in input
                            
                                How to deal with Spark UDF input/output of primitive nullable type
                            
                                In spark, how to estimate the number of elements in a dataframe quickly
                            
                                Define return value in Spark Scala UDF
                            
                                Spark from_json - StructType and ArrayType

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Set thresholds in PySpark multinomial logistic regression

Tags:

machine-learning

apache-spark

logistic-regression

pyspark

apache-spark-ml

Vanni Rovera

People also ask

1 Answers

desertnaut

Recent Activity

Donate For Us