I tried to follow this.
But some how I wasted a lot of time ending up with nothing useful.
I just want to train a GloVe
model on my own corpus (~900Mb corpus.txt file).
I downloaded the files provided in the link above and compiled it using cygwin
(after editing the demo.sh file and changed it to VOCAB_FILE=corpus.txt
. should I leave CORPUS=text8
unchanged?)
the output was:
How can I used those files to load it as a GloVe
model on python?
The GloVe model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus. Populating this matrix requires a single pass through the entire corpus to collect the statistics.
GloVe, coined from Global Vectors, is a model for distributed word representation. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.
You can do it using GloVe library:
Install it: pip install glove_python
Then:
from glove import Corpus, Glove
#Creating a corpus object
corpus = Corpus()
#Training the corpus to generate the co-occurrence matrix which is used in GloVe
corpus.fit(lines, window=10)
glove = Glove(no_components=5, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')
Reference: word vectorization using glove
This is how you run the model
$ git clone http://github.com/stanfordnlp/glove
$ cd glove && make
To train it on your own corpus, you just have to make changes to one file, that is demo.sh.
Remove the script from if to fi after 'make'. Replace the CORPUS name with your file name 'corpus.txt' There is another if loop at the end of file 'demo.sh'
if [ "$CORPUS" = 'text8' ]; then
Replace text8 with your file name.
Run the demo.sh once the changes are made.
$ ./demo.sh
Make sure your corpus file is in the correct format.You'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters.
your corpus should go to variable CORPUS. The vectors.txt is the output, which suppose to be useful. You can train Glove in python, but it takes more time and you need to have C compiling environment. I tried it before and won't recommend it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With