What is the relation between numFeatures in HashingTF in Spark MLlib and actual number of terms in a document?

Question

Is there any relation between numFeatures in HashingTF in Spark MLlib and the actual number of terms in a document(sentence)?

List<Row> data = Arrays.asList(
  RowFactory.create(0.0, "Hi I heard about Spark"),
  RowFactory.create(0.0, "I wish Java could use case classes"),
  RowFactory.create(1.0, "Logistic regression models are neat")
);
StructType schema = new StructType(new StructField[]{
  new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
  new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceData = spark.createDataFrame(data, schema);

Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> wordsData = tokenizer.transform(sentenceData);

int numFeatures = 20;
HashingTF hashingTF = new HashingTF()
  .setInputCol("words")
  .setOutputCol("rawFeatures")
  .setNumFeatures(numFeatures);

Dataset<Row> featurizedData = hashingTF.transform(wordsData);

As mentioned in the documentation of Spark Mllib, HashingTF converts each sentence into a feature vector of having numFeatures as length. What will happen if each document here, in this case, sentence contains thousands of terms? What should be the value of numFeatures? How to calculate that value?

Marsellus Wallace · Accepted Answer

HashingTF uses the hashing trick that does not maintain a map between a word/token and its vector position. The transformer takes each word/taken, applies a hash function (MurmurHash3_x86_32) to generate a long value, and then performs a simple module operation (% 'numFeatures') to generate an Integer between 0 and numFeatures. The resulting value is the index that will be incremented in the feature Vector.

Given the nature of the algorithm, if numFeatures is less than the actual number of distinct words/tokens in the DataFrame you are guaranteed to have an 'incorrect' frequency for at least 1 token (i.e. different tokens will hash to the same bucket). NOTE: Even with numFeatures >= vocabularySize collisions 'might' still happen.

What's the best value for numFeatures? I would take a number greater than the size of your 'vocabulary' (do not worry too much about space as the features are stored in a ml.linalg.SparseVector). Note that (see docs):

Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.

If you prefer to have an exact frequency count then take a look at CountVectorizer

What is the relation between numFeatures in HashingTF in Spark MLlib and actual number of terms in a document?

Tags:

machine-learning

apache-spark

tf-idf

apache-spark-mllib

Rahul

1 Answers

Marsellus Wallace

Recent Activity

Donate For Us

What is the relation between numFeatures in HashingTF in Spark MLlib and actual number of terms in a document?

Tags:

machine-learning

apache-spark

tf-idf

apache-spark-mllib

Rahul

1 Answers

Marsellus Wallace

Related questions

Recent Activity

Donate For Us