Relation between Word2Vec vector size and total number of words scanned?

Tags:

What is the optimum number of vector size to be set in word2vec algorithm if the total number of unique words is greater than 1 billion?

I am using Apache Spark Mllib 1.6.0 for word2vec.

Sample code :-

public class Main {       
      public static void main(String[] args) throws IOException {

        SparkConf conf = new SparkConf().setAppName("JavaWord2VecExample");
        conf.setMaster("local[*]");
        JavaSparkContext jsc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(jsc);

        // $example on$
        // Input data: Each row is a bag of words from a sentence or document.
        JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
          RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
          RowFactory.create(Arrays.asList("Hi I heard about Java".split(" "))),
          RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
          RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
        ));
        StructType schema = new StructType(new StructField[]{
          new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
        });
        DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema);

        // Learn a mapping from words to Vectors.
        Word2Vec word2Vec = new Word2Vec()
          .setInputCol("text")
          .setOutputCol("result")
          .setVectorSize(3) // What is the optimum value to set here
          .setMinCount(0);
        Word2VecModel model = word2Vec.fit(documentDF);
        DataFrame result = model.transform(documentDF);
        result.show(false);
        for (Row r : result.select("result").take(3)) {
         System.out.println(r);
        }
        // $example off$
      }
}

470

asked Oct 04 '17 08:10

Rahul

2 Answers

There's no one answer: it will depend on your dataset and goals.

Common values for the dimensionality-size of word-vectors are 300-400, based on values preferred in some of the original papers.

But, the best approach is to create some sort of project-specific quantitative quality score – are the word-vectors performing well in your intended application? – and then optimize the size like any other meta-parameter.

Separately, if you truly have 1 billion unique word tokens – a 1 billion word vocabulary – it will be hard to train those vectors in typical system environments. (1 billion word-tokens is 333 times larger than Google's released 3-million-vectors dataset.)

1 billion 300-dimensional word-vectors would require (1 billion * 300 float dimensions * 4 bytes/float =) 1.2TB of addressable memory (essentially, RAM) just to store the raw vectors during training. (The neural network will need another 1.2TB for output-weights during training, plus other supporting structures.)

Relatedly, words with very few occurrences can't get quality word-vectors from those few contexts, but still tend to interfere with the training of nearby words – so a minimum-count of 0 is never a good idea, and throwing away more lower-frequency words tends to speed training, lower memory-requirements, and improve the quality of the remaining words.

117

answered Sep 27 '22 18:09

gojomo

According to research, the quality for vector representations improves as you increase the vector size until you reach 300 dimensions. After 300 dimensions, the quality of vectors starts to decrease. You can find analysis of the different vector and vocabulary sizes here (See Table 2, where SG refers to the Skip Gram model that is the model behind Word2Vec).

Your choice for the vector size also depends on you computational power, even though 300 probably gives you the most reliable vectors, you may need to lower the size if your machine is too slow at computing the vectors.

answered Sep 27 '22 18:09

TrnKh

Related questions
                            
                                How do I cluster with KL-divergence?
                            
                                what is the difference between the stacking grading, and voting algorithms?
                            
                                Scikit learn - How to use SVM and Random Forest for text classification?
                            
                                Training SVM with variable sized hog descriptors of training images (MATLAB)
                            
                                What is wrong with my Gradient Descent algorithm
                            
                                scikit weighted f1 score calculation and usage
                            
                                Tensorflow classification with extremely unbalanced dataset
                            
                                Update a subset of weights in TensorFlow
                            
                                Python scikit svm "ValueError: X has 62 features per sample; expecting 337"
                            
                                Adaboost with neural networks
                            
                                Why the decision tree structure is only binary tree for sklearn DecisionTreeClassifier?
                            
                                Label smoothing (soft targets) in Pandas
                            
                                Local and global minima of the cost function in logistic regression
                            
                                How to check NaN in gradients in Tensorflow when updating?
                            
                                Pandas (Python) - Update column of a dataframe from another one with conditions
                            
                                Issue with setting TensorFlow as the session in Keras
                            
                                Neural Networks normalizing output data
                            
                                TensorFlow: Is there a metric to calculate and update top k accuracy?
                            
                                machine learning-how to use the past 20 rows as an input for X for each Y value
                            
                                NLP - When to lowercase text during preprocessing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Relation between Word2Vec vector size and total number of words scanned?

Tags:

machine-learning

word2vec

apache-spark-mllib

Rahul

People also ask

2 Answers

gojomo

TrnKh

Recent Activity

Donate For Us