I am working on a text classification problem, that is, given some text, I need to assign to it certain given labels. I have tried using fast-text library by Facebook, which has two utilities of interest to me: A) Word Vectors with pre-trained models B) Text Classification utilities However, it seems that these are completely independent tools as I have been unable to find any tutorials that merge these two utilities. What I want is to be able to classify some text, by taking advantage of the pre-trained models of the Word-Vectors. Is there any way to do this?

FastText supervised training has <code>-pretrainedVectors</code> argument which can be used like this: <pre class="prettyprint"><code>$ ./fasttext supervised -input train.txt -output model -epoch 25 \ -wordNgrams 2 -dim 300 -loss hs -thread 7 -minCount 1 \ -lr 1.0 -verbose 2 -pretrainedVectors wiki.ru.vec </code></pre> Few things to consider: <ul> <li>Chosen dimension of embeddings must fit the one used in pretrained vectors. E.g. for Wiki word vectors is must be 300. It is set by <code>-dim 300</code> argument.</li> <li>As of mid-February 2018, Python API (v0.8.22) doesn't support training using pretrained vectors (the corresponding parameter is ignored). So you must use CLI (command line interface) version for training. However, a model trained by CLI with pretrained vectors can be loaded by Python API and used for predictions.</li> <li>For large number of classes (in my case there were 340 of them) even CLI may break with an exception so you will need to use hierarchical softmax loss function (<code>-loss hs</code>)</li> <li>Hierarchical softmax is worse in performance than normal softmax so it can give up all the gain you've got from pretrained embeddings.</li> <li>The model trained with pretrained vectors can be several times larger than one trained without.</li> <li>In my observation, the model trained with pretrained vectors gets overfitted faster than one trained without</li> </ul>

FastText using pre-trained word vector for text classification

1 Answers

FastText supervised training has -pretrainedVectors argument which can be used like this:

$ ./fasttext supervised -input train.txt -output model -epoch 25 \
       -wordNgrams 2 -dim 300 -loss hs -thread 7 -minCount 1 \
       -lr 1.0 -verbose 2 -pretrainedVectors wiki.ru.vec

Few things to consider:

Chosen dimension of embeddings must fit the one used in pretrained vectors. E.g. for Wiki word vectors is must be 300. It is set by -dim 300 argument.
As of mid-February 2018, Python API (v0.8.22) doesn't support training using pretrained vectors (the corresponding parameter is ignored). So you must use CLI (command line interface) version for training. However, a model trained by CLI with pretrained vectors can be loaded by Python API and used for predictions.
For large number of classes (in my case there were 340 of them) even CLI may break with an exception so you will need to use hierarchical softmax loss function (-loss hs)
Hierarchical softmax is worse in performance than normal softmax so it can give up all the gain you've got from pretrained embeddings.
The model trained with pretrained vectors can be several times larger than one trained without.
In my observation, the model trained with pretrained vectors gets overfitted faster than one trained without

answered Sep 27 '22 23:09

Dmitry Kashtanov

Related questions
                            
                                Latent Semantic Analysis concepts
                            
                                How to conjugate a verb in NLTK given POS tag?
                            
                                How can I preprocess NLP text (lowercase, remove special characters, remove numbers, remove emails, etc) in one pass?
                            
                                TF2.0: Translation model: Error when restoring the saved model: Unresolved object in checkpoint (root).optimizer.iter: attributes
                            
                                How to use NLP to separate a unstructured text content into distinct paragraphs?
                            
                                What is a chunker in Natural Language Processing?
                            
                                Supervised Latent Dirichlet Allocation for Document Classification?
                            
                                Coreference resolution in python nltk using Stanford coreNLP
                            
                                what is dimensionality in word embeddings?
                            
                                Extract Nouns from Text (Java)
                            
                                POS tagging using spaCy
                            
                                Using word2vec to classify words in categories
                            
                                Is it possible to use Google BERT to calculate similarity between two textual documents?
                            
                                NLP: Qualitatively "positive" vs "negative" sentence
                            
                                Why do I need a tokenizer for each language? [closed]
                            
                                Python - Generating the plural noun of a singular noun
                            
                                Transformers v4.x: Convert slow tokenizer to fast tokenizer
                            
                                How do I replace the string exactly using gsub()
                            
                                Replace apostrophe/short words in python
                            
                                what is the difference between bigram and unigram text features extraction

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

FastText using pre-trained word vector for text classification

Tags:

nlp

word2vec

text-classification

fasttext

JarvisIA

People also ask

1 Answers

Dmitry Kashtanov

Recent Activity

Donate For Us