I am working on a text classification problem, that is, given some text, I need to assign to it certain given labels.
I have tried using fast-text library by Facebook, which has two utilities of interest to me:
A) Word Vectors with pre-trained models
B) Text Classification utilities
However, it seems that these are completely independent tools as I have been unable to find any tutorials that merge these two utilities.
What I want is to be able to classify some text, by taking advantage of the pre-trained models of the Word-Vectors. Is there any way to do this?
FastText's advantage over word2vec or glove for example is that they use subword information to return vectors for OOV (out-of-vocabulary) words. So they offer two types of pretrained models : . vec and . bin .
Although it takes longer time to train a FastText model (number of n-grams > number of words), it performs better than Word2Vec and allows rare words to be represented appropriately.
After feeding the Word2Vec algorithm with our corpus, it will learn a vector representation for each word. This by itself, however, is still not enough to be used as features for text classification as each record in our data is a document not a word.
FastText uses the Huffman algorithm to build these trees to make full use of the fact that classes can be imbalanced. Depth of the frequently occurring labels is smaller than the infrequent ones.
FastText supervised training has -pretrainedVectors
argument which can be used like this:
$ ./fasttext supervised -input train.txt -output model -epoch 25 \
-wordNgrams 2 -dim 300 -loss hs -thread 7 -minCount 1 \
-lr 1.0 -verbose 2 -pretrainedVectors wiki.ru.vec
Few things to consider:
-dim 300
argument.-loss hs
)If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With