Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hash function in spark

I'm trying to add a column to a dataframe, which will contain hash of another column.

I've found this piece of documentation: https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
And tried this:

import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))

But what is the hash function used by that hash()? Is that murmur, sha, md5, something else?

The value I get in this column is integer, thus range of values here is probably [-2^(31) ... +2^(31-1)].
Can I get a long value here? Can I get a string hash instead?
How can I specify a concrete hashing algorithm for that?
Can I use a custom hash function?

like image 903
Viacheslav Shalamov Avatar asked Dec 05 '18 14:12

Viacheslav Shalamov


People also ask

What is hash function in Spark?

pyspark.sql.functions. hash (*cols)[source] Calculates the hash code of given columns, and returns the result as an int column.

What is hash partition in Spark?

Hash Partitioning in Spark Hash Partitioning attempts to spread the data evenly across various partitions based on the key. Object. hashCode method is used to determine the partition in Spark as partition = key. hashCode () % numPartitions.

What is sha2 in Pyspark?

pyspark.sql.functions. sha2 (col, numBits)[source] Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). The numBits indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256).

What is the function of Spark?

Spark defines the dataset as data frames. It helps to add, write, modify and remove the columns of the data frames. It support built-in syntax through multiple languages such as R, Python, Java, and Scala. The Spark functions are evolving with new features.


2 Answers

It is Murmur based on the source code:

  /**
   * Calculates the hash code of given columns, and returns the result as an int column.
   *
   * @group misc_funcs
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def hash(cols: Column*): Column = withExpr {
    new Murmur3Hash(cols.map(_.expr))
  }
like image 134
Fermat's Little Student Avatar answered Sep 26 '22 05:09

Fermat's Little Student


If you want a Long hash, in spark 3 there is the xxhash64 function: https://spark.apache.org/docs/3.0.0-preview/api/sql/index.html#xxhash64.

You may want only positive numbers. In this case you may use hash and sum Int.MaxValue as

df.withColumn("hashID", hash($"value").cast(LongType)+Int.MaxValue).show()
like image 45
Galuoises Avatar answered Sep 24 '22 05:09

Galuoises