Spark TF-IDF getting the words back from hash

Question

I am following this example from Spark documentation for calculating the TF-IDF for a bunch of documents. Spark uses the hashing trick for this calculations so at the end you get a Vector containing the hashed words and the corresponding weight but... How can I get back the words from the hash?

Do I really have to hash all the words and save them in a map for later iterate through it looking for the keywords? There is no more efficient way built-in Spark?

Thanks in advance

Tim Hennekey · Accepted Answer

The transformation of String to hash in HashingTF results in a positive integer between 0 and numFeatures (default 2^20) using org.apache.spark.util.Utils.nonNegativeMod(int, int).

The original string is lost; there is no way to convert from the resulting integer to the input string.

Spark TF-IDF getting the words back from hash

Tags:

java

hash

apache-spark

tf-idf

Enrique Fernández-Polo

1 Answers

Tim Hennekey

Recent Activity

Donate For Us

Spark TF-IDF getting the words back from hash

Tags:

java

hash

apache-spark

tf-idf

Enrique Fernández-Polo

1 Answers

Tim Hennekey

Related questions

Recent Activity

Donate For Us