I am new to Spark 2. I tried Spark tfidf example
sentenceData = spark.createDataFrame([
(0.0, "Hi I heard about Spark")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32)
featurizedData = hashingTF.transform(wordsData)
for each in featurizedData.collect():
print(each)
It outputs
Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0}))
I expected that in rawFeatures
I will get term frequencies like {0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2}
. Because terms frequency is:
tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)
In our case is : tf(w) = 1/5 = 0.2
for each word, because each word apears once in a document.
If we imagine that output rawFeatures
dictionary contains word index as key, and number of word appearances in a document as value, why key 1
is equal to 3.0
? There no word that appears in a document 3 times.
This is confusing for me. What am I missing?
HashingTF takes an RDD of list as the input. Each record could be an iterable of strings or other types. While applying HashingTF only needs a single pass to the data, applying IDF needs two passes: first to compute the IDF vector and second to scale the term frequencies by IDF.
HashingTF converts documents to vectors of fixed size. The default feature dimension is 262,144. The terms are mapped to indices using a Hash Function. The hash function used is MurmurHash 3. The term frequencies are computed with respect to the mapped indices.
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
TL;DR; It is just a simple hash collision. HashingTF
takes hash(word) % numBuckets
to determine the bucket and with very low number of buckets like here collisions are to be expected. In general you should use much higher number of buckets or, if collisions are unacceptable, CountVectorizer
.
In detail. HashingTF
by default uses Murmur hash. [u'hi', u'i', u'heard', u'about', u'spark']
will be hashed to [-537608040, -1265344671, 266149357, 146891777, 2101843105]
. If you follow the source you'll see that the implementation is equivalent to:
import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes
Seq("hi", "i", "heard", "about", "spark")
.map(UTF8String.fromString(_))
.map(utf8 =>
hashUnsafeBytes(utf8.getBaseObject, utf8.getBaseOffset, utf8.numBytes, 42))
Seq[Int] = List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
When you take non-negative modulo of these values you'll get [24, 1, 13, 1, 1]
:
List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
.map(nonNegativeMod(_, 32))
List[Int] = List(24, 1, 13, 1, 1)
Three words from the list (i, about and spark) hash to the same bucket, each occurs once, hence the result you get.
Related:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With