Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement Word2Vec in Java?

I installed word2Vec using this tutorial on by Ubuntu laptop. Is it completely necessary to install DL4J in order to implement word2Vec vectors in Java? I'm comfortable working in Eclipse and I'm not sure that I want all the other pre-requisites that DL4J wants me to install.

Ideally there would be a really easy way for me to just use the Java code I've already written (in Eclipse) and change a few lines -- so that word look-ups that I am doing would retrieve a word2Vec vector instead of the current retrieval process I'm using.


Also, I've looked into using GloVe, however, I do not have MatLab. Is it possible to use GloVe without MatLab? (I got an error while installing it because of this). If so, the same question as above goes... I have no idea how to implement it in Java.

like image 619
Nate Cook3 Avatar asked Jul 15 '15 18:07

Nate Cook3


People also ask

How is Word2Vec implemented?

To implement Word2Vec, there are two flavors to choose from — Continuous Bag-Of-Words (CBOW) or continuous Skip-gram (SG). In short, CBOW attempts to guess the output (target word) from its neighbouring words (context words) whereas continuous Skip-Gram guesses the context words from a target word.

What is Word2Vec example?

Given a large enough dataset, Word2Vec can make strong estimates about a word's meaning based on their occurrences in the text. These estimates yield word associations with other words in the corpus. For example, words like “King” and “Queen” would be very similar to one another.

How are Word2Vec trained?

Training the network we take a training sample and generate the output value of the nework. we evaluate the loss by comparing the model prediction with the true output label. we update weights of the network by using gradient descent technique on the evaluated loss. we then take another sample and start over again.

Can Word2Vec be used for machine learning?

Applying Word2Vec features for Machine Learning Tasks To start with, we will build a simple Word2Vec model on the corpus and visualize the embeddings. Remember that our corpus is extremely small so to get meaninful word embeddings and for the model to get more context and semantics, more data helps.


2 Answers

What is preventing you from saving the word2vec (the C program) output in text format and then read the file with a Java piece of code and load the vectors in a hashmap keyed by the word string?

Some code snippets:

// Class to store a hashmap of wordvecs
public class WordVecs {

    HashMap<String, WordVec> wordvecmap;
    ....
    void loadFromTextFile() {
        String wordvecFile = prop.getProperty("wordvecs.vecfile");
        wordvecmap = new HashMap();
        try (FileReader fr = new FileReader(wordvecFile);
            BufferedReader br = new BufferedReader(fr)) {
            String line;

            while ((line = br.readLine()) != null) {
                WordVec wv = new WordVec(line);
                wordvecmap.put(wv.word, wv);
            }
        }
        catch (Exception ex) { ex.printStackTrace(); }        
    }
    ....
}

// class for each wordvec
public class WordVec implements Comparable<WordVec> {
    public WordVec(String line) {
        String[] tokens = line.split("\\s+");
        word = tokens[0];
        vec = new float[tokens.length-1];
        for (int i = 1; i < tokens.length; i++)
            vec[i-1] = Float.parseFloat(tokens[i]);
        norm = getNorm();
    }
    ....
}

If you want to get the nearest neighbours for a given word, you can keep a list of N nearest pre-computed neighbours associated with each WordVec object.

like image 153
Debasis Avatar answered Oct 11 '22 03:10

Debasis


Dl4j author here. Our word2vec implementation is targeted for people who need to have custom pipelines. I don't blame you for going the simple route here.

Our word2vec implementation is meant for when you want to do something with them not for messing around. The c word2vec format is pretty straight forward.

Here is parsing logic in java if you'd like: https://github.com/deeplearning4j/deeplearning4j/blob/374609b2672e97737b9eb3ba12ee62fab6cfee55/deeplearning4j-scaleout/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/embeddings/loader/WordVectorSerializer.java#L113

Hope that helps a bit

like image 29
Adam Gibson Avatar answered Oct 11 '22 03:10

Adam Gibson