I am not very experienced with unsupervised learning, but my general understanding is that in unsupervised learning, the model learns without there being an output. However, during pre-training in models such as BERT or GPT-3, it seems to me that there is an output. For example, in BERT, some of the tokens in the input sequence are masked. Then, the model will try to predict those words. Since we already know what those masked words originally were, we can compare that with the prediction to find the loss. Isn't this basically supervised learning?
Pre-trained language models typically leverage learning objectives that can be derived from the structure of the training data. As pointed out by @solitone, techniques like masked/causal language modelling allow the model to get supervision signals from the pre-training data, which greatly contributes to the success of PLMs and LLMs.
Another more acute terminology often referred to in the literature would be self-supervised learning as unsupervised learning traditionally refers to techniques like clustering.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With