I'm trying to add a column to a dataframe, which will contain hash of another column.
I've found this piece of documentation:
https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
And tried this:
import org.apache.spark.sql.functions._
val df = spark.read.parquet(...)
val withHashedColumn = df.withColumn("hashed", hash($"my_column"))
But what is the hash function used by that hash()
? Is that murmur
, sha
, md5
, something else?
The value I get in this column is integer, thus range of values here is probably [-2^(31) ... +2^(31-1)]
.
Can I get a long value here? Can I get a string hash instead?
How can I specify a concrete hashing algorithm for that?
Can I use a custom hash function?
pyspark.sql.functions. hash (*cols)[source] Calculates the hash code of given columns, and returns the result as an int column.
Hash Partitioning in Spark Hash Partitioning attempts to spread the data evenly across various partitions based on the key. Object. hashCode method is used to determine the partition in Spark as partition = key. hashCode () % numPartitions.
pyspark.sql.functions. sha2 (col, numBits)[source] Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). The numBits indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256).
Spark defines the dataset as data frames. It helps to add, write, modify and remove the columns of the data frames. It support built-in syntax through multiple languages such as R, Python, Java, and Scala. The Spark functions are evolving with new features.
It is Murmur based on the source code:
/**
* Calculates the hash code of given columns, and returns the result as an int column.
*
* @group misc_funcs
* @since 2.0.0
*/
@scala.annotation.varargs
def hash(cols: Column*): Column = withExpr {
new Murmur3Hash(cols.map(_.expr))
}
If you want a Long hash, in spark 3 there is the xxhash64
function: https://spark.apache.org/docs/3.0.0-preview/api/sql/index.html#xxhash64.
You may want only positive numbers. In this case you may use hash
and sum Int.MaxValue
as
df.withColumn("hashID", hash($"value").cast(LongType)+Int.MaxValue).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With