Is there any relation between numFeatures in HashingTF in Spark MLlib and the actual number of terms in a document(sentence)?
List<Row> data = Arrays.asList(
RowFactory.create(0.0, "Hi I heard about Spark"),
RowFactory.create(0.0, "I wish Java could use case classes"),
RowFactory.create(1.0, "Logistic regression models are neat")
);
StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceData = spark.createDataFrame(data, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> wordsData = tokenizer.transform(sentenceData);
int numFeatures = 20;
HashingTF hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures")
.setNumFeatures(numFeatures);
Dataset<Row> featurizedData = hashingTF.transform(wordsData);
As mentioned in the documentation of Spark Mllib, HashingTF converts each sentence into a feature vector of having numFeatures as length. What will happen if each document here, in this case, sentence contains thousands of terms? What should be the value of numFeatures? How to calculate that value?
HashingTF
uses the hashing trick that does not maintain a map between a word/token and its vector position. The transformer takes each word/taken, applies a hash function (MurmurHash3_x86_32) to generate a long value, and then performs a simple module operation (% 'numFeatures') to generate an Integer between 0 and numFeatures
. The resulting value is the index that will be incremented in the feature Vector.
Given the nature of the algorithm, if numFeatures
is less than the actual number of distinct words/tokens in the DataFrame you are guaranteed to have an 'incorrect' frequency for at least 1 token (i.e. different tokens will hash to the same bucket). NOTE: Even with numFeatures >= vocabularySize collisions 'might' still happen.
What's the best value for numFeatures
? I would take a number greater than the size of your 'vocabulary' (do not worry too much about space as the features are stored in a ml.linalg.SparseVector). Note that (see docs):
Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.
If you prefer to have an exact frequency count then take a look at CountVectorizer
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With