Is there a built-in log loss function in pyspark?
I have a pyspark dataframe with columns: probability, rawPrediction, label
and I want to use mean log loss to evaluate these predictions.
Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). The more the predicted probability diverges from the actual value, the higher is the log-loss value.
Log loss, aka logistic loss or cross-entropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model that returns y_pred probabilities for its training data y_true .
The cost function used in Logistic Regression is Log Loss.
Logistic regression is similar to linear regression but with two significant differences. It uses a sigmoid activation function on the output neuron to squash the output into the range 0–1 (to represent the output as a probability) It uses a loss function called log loss to calculate the Error.
No such function exists directly, as far as I can tell. But given a PySpark dataframe df
with the columns named as in the question, one can explicitly calculate the average log loss:
import pyspark.sql.functions as f
df = (
df.withColumn(
'logloss'
, -f.col('label')*f.log(f.col('probability')) - (1.-f.col('label'))*f.log(1.-f.col('probability'))
)
)
logloss = df.agg(f.mean('logloss').alias('ll')).collect()[0]['ll']
I'm assuming here that label
is numerical (i.e. 0 or 1), and that probability
represents the predictions of the model. (Not sure what rawPrediction
might mean.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With