I tried to follow this. But some how I wasted a lot of time ending up with nothing useful. I just want to train a <code>GloVe</code> model on my own corpus (~900Mb corpus.txt file). I downloaded the files provided in the link above and compiled it using <code>cygwin</code> (after editing the demo.sh file and changed it to <code>VOCAB_FILE=corpus.txt</code> . should I leave <code>CORPUS=text8</code> unchanged?) the output was: <ol> <li>cooccurrence.bin </li> <li>cooccurrence.shuf.bin </li> <li>text8</li> <li>corpus.txt</li> <li>vectors.txt</li> </ol> How can I used those files to load it as a <code>GloVe</code> model on python?

This is how you run the model <pre class="prettyprint"><code>$ git clone http://github.com/stanfordnlp/glove $ cd glove && make </code></pre> To train it on your own corpus, you just have to make changes to one file, that is demo.sh. Remove the script from if to fi after 'make'. Replace the CORPUS name with your file name 'corpus.txt' There is another if loop at the end of file 'demo.sh' <pre class="prettyprint"><code>if [ "$CORPUS" = 'text8' ]; then </code></pre> Replace text8 with your file name. Run the demo.sh once the changes are made. <pre class="prettyprint"><code>$ ./demo.sh </code></pre> Make sure your corpus file is in the correct format.You'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters.

How to Train GloVe algorithm on my own corpus

Tags:

nlp

stanford-nlp

gensim

word2vec

glove

I tried to follow this.
But some how I wasted a lot of time ending up with nothing useful.
I just want to train a GloVe model on my own corpus (~900Mb corpus.txt file). I downloaded the files provided in the link above and compiled it using cygwin (after editing the demo.sh file and changed it to VOCAB_FILE=corpus.txt . should I leave CORPUS=text8 unchanged?) the output was:

cooccurrence.bin
cooccurrence.shuf.bin
text8
corpus.txt
vectors.txt

How can I used those files to load it as a GloVe model on python?

600

asked Feb 24 '18 11:02

Codir

3 Answers

You can do it using GloVe library:

Install it: pip install glove_python

Then:

from glove import Corpus, Glove

#Creating a corpus object
corpus = Corpus() 

#Training the corpus to generate the co-occurrence matrix which is used in GloVe
corpus.fit(lines, window=10)

glove = Glove(no_components=5, learning_rate=0.05) 
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')

Reference: word vectorization using glove

102

answered Oct 24 '22 08:10

Minions

This is how you run the model

$ git clone http://github.com/stanfordnlp/glove
$ cd glove && make

To train it on your own corpus, you just have to make changes to one file, that is demo.sh.

Remove the script from if to fi after 'make'. Replace the CORPUS name with your file name 'corpus.txt' There is another if loop at the end of file 'demo.sh'

if [ "$CORPUS" = 'text8' ]; then

Replace text8 with your file name.

Run the demo.sh once the changes are made.

$ ./demo.sh

Make sure your corpus file is in the correct format.You'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters.

answered Oct 24 '22 08:10

Palak Bansal

your corpus should go to variable CORPUS. The vectors.txt is the output, which suppose to be useful. You can train Glove in python, but it takes more time and you need to have C compiling environment. I tried it before and won't recommend it.

answered Oct 24 '22 08:10

MLam

Related questions
                            
                                Conventions to write simple additions of hexadecimal and decimal numbers
                            
                                Bind rvalue reference to lvalue with `void*`
                            
                                WebStorm - Argument type {providedIn: "root"} is not assignable to parameter type {providedIn: Type<any> | "root" | null} & InjectableProvider
                            
                                JavaScript get elements from an object array that are not in another
                            
                                How to search the debug console in vscode?
                            
                                Find duplicate in array with a memory efficient approach
                            
                                Generator expression uses list assigned after the generator's creation
                            
                                Does fixture.whenStable() actually do anything in my angular tests if not within an async test execution zone?
                            
                                java stream find match or the last one?
                            
                                Bluetooth blocked through rfkill
                            
                                AWS Amplify - AppSync & Multiple DynamoDB Tables
                            
                                Error when trying to publish an azure function from Visual Studio

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Train GloVe algorithm on my own corpus

Tags:

nlp

stanford-nlp

gensim

word2vec

glove

Codir

People also ask

3 Answers

Minions

Palak Bansal

MLam

Recent Activity

Donate For Us