Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FastText using pre-trained word vector for text classification

I am working on a text classification problem, that is, given some text, I need to assign to it certain given labels.

I have tried using fast-text library by Facebook, which has two utilities of interest to me:

A) Word Vectors with pre-trained models

B) Text Classification utilities

However, it seems that these are completely independent tools as I have been unable to find any tutorials that merge these two utilities.

What I want is to be able to classify some text, by taking advantage of the pre-trained models of the Word-Vectors. Is there any way to do this?

like image 447
JarvisIA Avatar asked Dec 07 '17 10:12

JarvisIA


People also ask

Is FastText a Pretrained model?

FastText's advantage over word2vec or glove for example is that they use subword information to return vectors for OOV (out-of-vocabulary) words. So they offer two types of pretrained models : . vec and . bin .

Is FastText better than Word2Vec?

Although it takes longer time to train a FastText model (number of n-grams > number of words), it performs better than Word2Vec and allows rare words to be represented appropriately.

Can Word2Vec be used for text classification?

After feeding the Word2Vec algorithm with our corpus, it will learn a vector representation for each word. This by itself, however, is still not enough to be used as features for text classification as each record in our data is a document not a word.

What algorithm does FastText use?

FastText uses the Huffman algorithm to build these trees to make full use of the fact that classes can be imbalanced. Depth of the frequently occurring labels is smaller than the infrequent ones.


1 Answers

FastText supervised training has -pretrainedVectors argument which can be used like this:

$ ./fasttext supervised -input train.txt -output model -epoch 25 \
       -wordNgrams 2 -dim 300 -loss hs -thread 7 -minCount 1 \
       -lr 1.0 -verbose 2 -pretrainedVectors wiki.ru.vec

Few things to consider:

  • Chosen dimension of embeddings must fit the one used in pretrained vectors. E.g. for Wiki word vectors is must be 300. It is set by -dim 300 argument.
  • As of mid-February 2018, Python API (v0.8.22) doesn't support training using pretrained vectors (the corresponding parameter is ignored). So you must use CLI (command line interface) version for training. However, a model trained by CLI with pretrained vectors can be loaded by Python API and used for predictions.
  • For large number of classes (in my case there were 340 of them) even CLI may break with an exception so you will need to use hierarchical softmax loss function (-loss hs)
  • Hierarchical softmax is worse in performance than normal softmax so it can give up all the gain you've got from pretrained embeddings.
  • The model trained with pretrained vectors can be several times larger than one trained without.
  • In my observation, the model trained with pretrained vectors gets overfitted faster than one trained without
like image 64
Dmitry Kashtanov Avatar answered Sep 27 '22 23:09

Dmitry Kashtanov