What is the difference between labeled and unlabeled data?

Tags:

machine-learning

In this video from Sebastian Thrum he says that supervised learning works with "labeled" data and unsupervised learning works with "unlabeled" data. What does he mean by this? Googling "labeled vs unlabeled data" returns a bunch of scholarly papers on this topic. I just want to know the basic difference.

901

asked Oct 03 '13 23:10

bernie2436

1 Answers

Typically, unlabeled data consists of samples of natural or human-created artifacts that you can obtain relatively easily from the world. Some examples of unlabeled data might include photos, audio recordings, videos, news articles, tweets, x-rays (if you were working on a medical application), etc. There is no "explanation" for each piece of unlabeled data -- it just contains the data, and nothing else.

Labeled data typically takes a set of unlabeled data and augments each piece of that unlabeled data with some sort of meaningful "tag," "label," or "class" that is somehow informative or desirable to know. For example, labels for the above types of unlabeled data might be whether this photo contains a horse or a cow, which words were uttered in this audio recording, what type of action is being performed in this video, what the topic of this news article is, what the overall sentiment of this tweet is, whether the dot in this x-ray is a tumor, etc.

Labels for data are often obtained by asking humans to make judgments about a given piece of unlabeled data (e.g., "Does this photo contain a horse or a cow?") and are significantly more expensive to obtain than the raw unlabeled data.

After obtaining a labeled dataset, machine learning models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.

There are many active areas of research in machine learning that are aimed at integrating unlabeled and labeled data to build better and more accurate models of the world. Semi-supervised learning attempts to combine unlabeled and labeled data (or, more generally, sets of unlabeled data where only some data points have labels) into integrated models. Deep neural networks and feature learning are areas of research that attempt to build models of the unlabeled data alone, and then apply information from the labels to the interesting parts of the models.

146

answered Sep 19 '22 02:09

lmjohns3

Related questions
                            
                                classifiers in scikit-learn that handle nan/null
                            
                                Perceptron learning algorithm not converging to 0
                            
                                Keras model.summary() result - Understanding the # of Parameters
                            
                                Keras model.summary() object to string
                            
                                Higher validation accuracy, than training accurracy using Tensorflow and Keras
                            
                                TensorFlow - regularization with L2 loss, how to apply to all weights, not just last one?
                            
                                What is the difference between Gradient Descent and Newton's Gradient Descent?
                            
                                Different result with roc_auc_score() and auc()
                            
                                SVM - hard or soft margins?
                            
                                Does Any one got "AttributeError: 'str' object has no attribute 'decode' " , while Loading a Keras Saved Model
                            
                                Linear regression analysis with string/categorical features (variables)?
                            
                                Machine learning in OCaml or Haskell?
                            
                                Tensorflow One Hot Encoder?
                            
                                Ways to improve the accuracy of a Naive Bayes Classifier?
                            
                                What is out of bag error in Random Forests? [closed]
                            
                                Pattern recognition in time series [closed]
                            
                                How to get most informative features for scikit-learn classifiers?
                            
                                Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn
                            
                                why gradient descent when we can solve linear regression analytically
                            
                                Adding L1/L2 regularization in PyTorch?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With