I am new to Spark 2. I tried Spark tfidf example <pre class="prettyprint"><code>sentenceData = spark.createDataFrame([ (0.0, "Hi I heard about Spark") ], ["label", "sentence"]) tokenizer = Tokenizer(inputCol="sentence", outputCol="words") wordsData = tokenizer.transform(sentenceData) hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32) featurizedData = hashingTF.transform(wordsData) for each in featurizedData.collect(): print(each) </code></pre> It outputs <pre class="prettyprint"><code>Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0})) </code></pre> I expected that in <code>rawFeatures</code> I will get term frequencies like <code>{0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2}</code>. Because terms frequency is: <pre class="prettyprint"><code>tf(w) = (Number of times the word appears in a document) / (Total number of words in the document) </code></pre> In our case is : <code>tf(w) = 1/5 = 0.2</code> for each word, because each word apears once in a document. If we imagine that output <code>rawFeatures</code> dictionary contains word index as key, and number of word appearances in a document as value, why key <code>1</code> is equal to <code>3.0</code>? There no word that appears in a document 3 times. This is confusing for me. What am I missing?

TL;DR; It is just a simple hash collision. <code>HashingTF</code> takes <code>hash(word) % numBuckets</code> to determine the bucket and with very low number of buckets like here collisions are to be expected. In general you should use much higher number of buckets or, if collisions are unacceptable, <code>CountVectorizer</code>. In detail. <code>HashingTF</code> by default uses Murmur hash. <code>[u'hi', u'i', u'heard', u'about', u'spark']</code> will be hashed to <code>[-537608040, -1265344671, 266149357, 146891777, 2101843105]</code>. If you follow the source you'll see that the implementation is equivalent to: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.unsafe.types.UTF8String import org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes Seq("hi", "i", "heard", "about", "spark") .map(UTF8String.fromString(_)) .map(utf8 => hashUnsafeBytes(utf8.getBaseObject, utf8.getBaseOffset, utf8.numBytes, 42)) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>Seq[Int] = List(-537608040, -1265344671, 266149357, 146891777, 2101843105) </code></pre> When you take non-negative modulo of these values you'll get <code>[24, 1, 13, 1, 1]</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>List(-537608040, -1265344671, 266149357, 146891777, 2101843105) .map(nonNegativeMod(_, 32)) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>List[Int] = List(24, 1, 13, 1, 1) </code></pre> Three words from the list (i, about and spark) hash to the same bucket, each occurs once, hence the result you get. Related: <ul> <li>What hashing function does Spark use for HashingTF and how do I duplicate it?</li> <li>How to get word details from TF Vector RDD in Spark ML Lib?</li> </ul>

How Spark HashingTF works

Tags:

apache-spark

pyspark

tf-idf

apache-spark-ml

apache-spark-mllib

I am new to Spark 2. I tried Spark tfidf example

sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark")
], ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)


hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32)
featurizedData = hashingTF.transform(wordsData)

for each in featurizedData.collect():
    print(each)

It outputs

Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0}))

I expected that in rawFeatures I will get term frequencies like {0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2}. Because terms frequency is:

tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)

In our case is : tf(w) = 1/5 = 0.2 for each word, because each word apears once in a document. If we imagine that output rawFeatures dictionary contains word index as key, and number of word appearances in a document as value, why key 1 is equal to 3.0? There no word that appears in a document 3 times. This is confusing for me. What am I missing?

771

asked Feb 16 '17 20:02

Yerzhan Torgayev

1 Answers

TL;DR; It is just a simple hash collision. HashingTF takes hash(word) % numBuckets to determine the bucket and with very low number of buckets like here collisions are to be expected. In general you should use much higher number of buckets or, if collisions are unacceptable, CountVectorizer.

In detail. HashingTF by default uses Murmur hash. [u'hi', u'i', u'heard', u'about', u'spark'] will be hashed to [-537608040, -1265344671, 266149357, 146891777, 2101843105]. If you follow the source you'll see that the implementation is equivalent to:

import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes

Seq("hi", "i", "heard", "about", "spark")
  .map(UTF8String.fromString(_))
  .map(utf8 => 
    hashUnsafeBytes(utf8.getBaseObject, utf8.getBaseOffset, utf8.numBytes, 42))

Seq[Int] = List(-537608040, -1265344671, 266149357, 146891777, 2101843105)

When you take non-negative modulo of these values you'll get [24, 1, 13, 1, 1]:

List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
  .map(nonNegativeMod(_, 32))

List[Int] = List(24, 1, 13, 1, 1)

Three words from the list (i, about and spark) hash to the same bucket, each occurs once, hence the result you get.

What hashing function does Spark use for HashingTF and how do I duplicate it?
How to get word details from TF Vector RDD in Spark ML Lib?

answered Oct 06 '22 16:10

zero323

Related questions
                            
                                Wrapping a java function in pyspark
                            
                                Spark 1.6 apply function to column with dot in name/ How to properly escape colName
                            
                                Split RDD for K-fold validation: pyspark
                            
                                How to Reference Spark Broadcast Variables Outside of Scope
                            
                                SPARK DataFrame: Remove MAX value in a group
                            
                                How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?
                            
                                Read random sample of files on S3 with Pyspark
                            
                                How to parallelize Spark scala computation?
                            
                                Can Dataframe joins in Spark preserve order?
                            
                                Spark Metrics: how to access executor and worker data?
                            
                                How to manage a Apache Spark context in Django?
                            
                                Deploy spark driver application without spark submit
                            
                                Setting up dynamic allocation in Apache Spark?
                            
                                Spark Local Mode - all jobs only use one CPU core
                            
                                spark - join one to many relationship dataframes
                            
                                Cannot change hive.exec.max.dynamic.partitions in Spark
                            
                                How to automate StructType creation for passing RDD to DataFrame
                            
                                How to expose Spark Driver behind dockerized Apache Zeppelin?
                            
                                Running from a local IDE against a remote Spark cluster
                            
                                spark streaming assertion failed: Failed to get records for spark-executor-a-group a-topic 7 244723248 after polling for 4096

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With