Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Spark HashingTF works

I am new to Spark 2. I tried Spark tfidf example

sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark")
], ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)


hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32)
featurizedData = hashingTF.transform(wordsData)

for each in featurizedData.collect():
    print(each)

It outputs

Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0}))

I expected that in rawFeatures I will get term frequencies like {0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2}. Because terms frequency is:

tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)

In our case is : tf(w) = 1/5 = 0.2 for each word, because each word apears once in a document. If we imagine that output rawFeatures dictionary contains word index as key, and number of word appearances in a document as value, why key 1 is equal to 3.0? There no word that appears in a document 3 times. This is confusing for me. What am I missing?

like image 771
Yerzhan Torgayev Avatar asked Feb 16 '17 20:02

Yerzhan Torgayev


People also ask

How HashingTF works?

HashingTF takes an RDD of list as the input. Each record could be an iterable of strings or other types. While applying HashingTF only needs a single pass to the data, applying IDF needs two passes: first to compute the IDF vector and second to scale the term frequencies by IDF.

What is HashingTF spark?

HashingTF converts documents to vectors of fixed size. The default feature dimension is 262,144. The terms are mapped to indices using a Hash Function. The hash function used is MurmurHash 3. The term frequencies are computed with respect to the mapped indices.

What is CountVectorizer in pyspark?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.


1 Answers

TL;DR; It is just a simple hash collision. HashingTF takes hash(word) % numBuckets to determine the bucket and with very low number of buckets like here collisions are to be expected. In general you should use much higher number of buckets or, if collisions are unacceptable, CountVectorizer.

In detail. HashingTF by default uses Murmur hash. [u'hi', u'i', u'heard', u'about', u'spark'] will be hashed to [-537608040, -1265344671, 266149357, 146891777, 2101843105]. If you follow the source you'll see that the implementation is equivalent to:

import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes

Seq("hi", "i", "heard", "about", "spark")
  .map(UTF8String.fromString(_))
  .map(utf8 => 
    hashUnsafeBytes(utf8.getBaseObject, utf8.getBaseOffset, utf8.numBytes, 42))
Seq[Int] = List(-537608040, -1265344671, 266149357, 146891777, 2101843105)

When you take non-negative modulo of these values you'll get [24, 1, 13, 1, 1]:

List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
  .map(nonNegativeMod(_, 32))
List[Int] = List(24, 1, 13, 1, 1)

Three words from the list (i, about and spark) hash to the same bucket, each occurs once, hence the result you get.

Related:

  • What hashing function does Spark use for HashingTF and how do I duplicate it?
  • How to get word details from TF Vector RDD in Spark ML Lib?
like image 90
zero323 Avatar answered Oct 06 '22 16:10

zero323