Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with noisy training labels in text classification using deep learning

I have a dataset that comprises of sentences and corresponding multi-labels (e.g. a sentence can belong to multiple labels). Using a combination of Convolutional Neural Networks and Recurrent Neural Nets on language models (Word2Vec) I'm able to achieve a good accuracy. However, it's /too/ good at modelling the output, in the sense that a lot of labels are arguably wrong and thus the output too. This means that the evaluation (even with regularization and dropout) gives a wrong impression, since I have no ground truth. Cleaning up the labels would be prohibitively expensive. So I'm left to explore "denoising" the labels somehow. I've looked at things like "Learning from Massive Noisy Labeled Data for Image Classification", however they assume to learn some sort of noise covariace matrix on the outputs, which I'm not sure how to do in Keras.

Has anyone dealt with the problem of noisy labels in a mutli-label text classification setting before (ideally using Keras or similar) and has good ideas on how to learn a robust model with noisy labels?

like image 392
JoelKuiper Avatar asked Mar 10 '23 22:03

JoelKuiper


1 Answers

The cleanlab Python package, pip install cleanlab, for which I am an author, was designed to solve this task: https://github.com/cleanlab/cleanlab/. It's a professional package created for finding labels errors in datasets and learning with noisy labels. It works with any scikit-learn model out-of-the-box and can be used with PyTorch, FastText, Tensorflow, etc.

(UPDATED Sep 2022) I've added resources for exactly this task (text classification with noisy labels (labels that are sometimes flipped to other classes):

  • Blog: https://cleanlab.ai/blog/label-errors-text-datasets/|
  • Runnable Colab Notebook: https://docs.cleanlab.ai/stable/tutorials/text.html

Example -- Find label errors in your dataset.

from cleanlab.classification import CleanLearning
from cleanlab.filter import find_label_issues
from cleanlab.count import estimate_cv_predicted_probabilities

# OPTION 1 - 1 line of code for sklearn compatible models
issues = CleanLearning(sklearnModel, seed=SEED).find_label_issues(data, labels)

# OPTION 2 - 2 lines of code to use ANY model
#   just pass in out-of-sample predicted probabilities
pred_probs = estimate_cv_predicted_probabilities(data, labels)
ordered_label_issues = find_label_issues(
    labels=labels,
    pred_probs=pred_probs,
    return_indices_ranked_by='self_confidence',
)

Details on how to compute out-of-sample predicted probabilities with any model here.

Example -- Learning with Noisy Labels

Train an ML model on noisy labels like it was trained on perfect labels.

# Code taken from https://github.com/cleanlab/cleanlab
from sklearn.linear_model import LogisticRegression

# Learning with noisy labels in 3 lines of code.
cl = CleanLearning(clf=LogisticRegression())  # any sklearn-compatible classifier
cl.fit(X=train_data, labels=labels)
# Estimate the predictions you would have gotten training with error-free labels.
predictions = cl.predict(test_data)

Given that you also may be working with image classification and audio classification, here are working examples for Image Classification with PyTorch and Audio Classification with SpeechBrain.

Additional documentation is available here: docs.cleanlab.ai

like image 62
cgnorthcutt Avatar answered Apr 28 '23 08:04

cgnorthcutt