Hash function in spark

Tags:

I'm trying to add a column to a dataframe, which will contain hash of another column.

I've found this piece of documentation: https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
And tried this:

import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))

But what is the hash function used by that hash()? Is that murmur, sha, md5, something else?

The value I get in this column is integer, thus range of values here is probably [-2^(31) ... +2^(31-1)].
Can I get a long value here? Can I get a string hash instead?
How can I specify a concrete hashing algorithm for that?
Can I use a custom hash function?

903

asked Dec 05 '18 14:12

Viacheslav Shalamov

2 Answers

It is Murmur based on the source code:

  /**
   * Calculates the hash code of given columns, and returns the result as an int column.
   *
   * @group misc_funcs
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def hash(cols: Column*): Column = withExpr {
    new Murmur3Hash(cols.map(_.expr))
  }

134

answered Sep 26 '22 05:09

Fermat's Little Student

If you want a Long hash, in spark 3 there is the xxhash64 function: https://spark.apache.org/docs/3.0.0-preview/api/sql/index.html#xxhash64.

You may want only positive numbers. In this case you may use hash and sum Int.MaxValue as

df.withColumn("hashID", hash($"value").cast(LongType)+Int.MaxValue).show()

answered Sep 24 '22 05:09

Galuoises

Related questions
                            
                                Migrating Java TreeMap code to Scala?
                            
                                What is a Singleton Type exactly?
                            
                                Mixing Scala and Java files in an Eclipse project
                            
                                Tell SBT to collect all my dependencies together
                            
                                Play Framework & JSON Web Token
                            
                                Sample of `forSome { val `?
                            
                                foldRight on infinite lazy structure
                            
                                Good examples of idiomatic scala code
                            
                                Comparing Subcut and Scaldi
                            
                                Magnet pattern and overloaded methods
                            
                                Can you recommend a good shared hosting provider for a webapp made with Lift framework with Scala? [closed]
                            
                                Is there something like AutoMapper for Scala?
                            
                                Could not find implicit value for evidence parameter of type scala.reflect.ClassManifest[T]
                            
                                Why is Akka Streams swallowing my exceptions?
                            
                                configure ant for scala
                            
                                How to get full stacktrace in SBT 0.10.0?
                            
                                Inheriting a trait twice
                            
                                Play! framework 2.0: Validate field in forms using other fields
                            
                                Scheduling a task at a fixed time of the day with Akka
                            
                                What do you call the data wrapped inside a monad?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hash function in spark

Tags:

hash

scala

apache-spark

apache-spark-sql

Viacheslav Shalamov

People also ask

2 Answers

Fermat's Little Student

Galuoises

Recent Activity

Donate For Us