What is the optimum number of vector size to be set in word2vec algorithm if the total number of unique words is greater than 1 billion?
I am using Apache Spark Mllib 1.6.0 for word2vec.
Sample code :-
public class Main {
public static void main(String[] args) throws IOException {
SparkConf conf = new SparkConf().setAppName("JavaWord2VecExample");
conf.setMaster("local[*]");
JavaSparkContext jsc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(jsc);
// $example on$
// Input data: Each row is a bag of words from a sentence or document.
JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("Hi I heard about Java".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
));
StructType schema = new StructType(new StructField[]{
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema);
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3) // What is the optimum value to set here
.setMinCount(0);
Word2VecModel model = word2Vec.fit(documentDF);
DataFrame result = model.transform(documentDF);
result.show(false);
for (Row r : result.select("result").take(3)) {
System.out.println(r);
}
// $example off$
}
}
The standard Word2Vec pre-trained vectors, as mentioned above, have 300 dimensions. We have tended to use 200 or fewer, under the rationale that our corpus and vocabulary are much smaller than those of Google News, and so we need fewer dimensions to represent them.
They found that Word2vec has a steep learning curve, outperforming another word-embedding technique, latent semantic analysis (LSA), when it is trained with medium to large corpus size (more than 10 million words). However, with a small training corpus, LSA showed better performance.
To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.
Word2vec is not a single algorithm but a combination of two techniques – CBOW(Continuous bag of words) and Skip-gram model. Both of these are shallow neural networks which map word(s) to the target variable which is also a word(s). Both of these techniques learn weights which act as word vector representations.
There's no one answer: it will depend on your dataset and goals.
Common values for the dimensionality-size of word-vectors are 300-400, based on values preferred in some of the original papers.
But, the best approach is to create some sort of project-specific quantitative quality score – are the word-vectors performing well in your intended application? – and then optimize the size
like any other meta-parameter.
Separately, if you truly have 1 billion unique word tokens – a 1 billion word vocabulary – it will be hard to train those vectors in typical system environments. (1 billion word-tokens is 333 times larger than Google's released 3-million-vectors dataset.)
1 billion 300-dimensional word-vectors would require (1 billion * 300 float dimensions * 4 bytes/float =) 1.2TB of addressable memory (essentially, RAM) just to store the raw vectors during training. (The neural network will need another 1.2TB for output-weights during training, plus other supporting structures.)
Relatedly, words with very few occurrences can't get quality word-vectors from those few contexts, but still tend to interfere with the training of nearby words – so a minimum-count of 0
is never a good idea, and throwing away more lower-frequency words tends to speed training, lower memory-requirements, and improve the quality of the remaining words.
According to research, the quality for vector representations improves as you increase the vector size until you reach 300 dimensions. After 300 dimensions, the quality of vectors starts to decrease. You can find analysis of the different vector and vocabulary sizes here (See Table 2, where SG refers to the Skip Gram model that is the model behind Word2Vec).
Your choice for the vector size also depends on you computational power, even though 300 probably gives you the most reliable vectors, you may need to lower the size if your machine is too slow at computing the vectors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With