Given the following DataSet
values as inputData
:
column0 column1 column2 column3
A 88 text 99
Z 12 test 200
T 120 foo 12
In Spark, what is an efficient way to compute a new hash
column, and append it to a new DataSet
, hashedData
, where hash
is defined as the application of MurmurHash3
over each row value of inputData
.
Specifically, hashedData
as:
column0 column1 column2 column3 hash
A 88 text 99 MurmurHash3.arrayHash(Array("A", 88, "text", 99))
Z 12 test 200 MurmurHash3.arrayHash(Array("Z", 12, "test", 200))
T 120 foo 12 MurmurHash3.arrayHash(Array("T", 120, "foo", 12))
Please let me know if any more specifics are necessary.
Any help is appreciated. Thanks!
One way is to use the withColumn
function:
import org.apache.spark.sql.functions.{col, hash}
dataset.withColumn("hash", hash(dataset.columns.map(col):_*))
Turns out that Spark already has this implemented as the hash
function inside package org.apache.spark.sql.functions
/**
* Calculates the hash code of given columns, and returns the result as an int column.
*
* @group misc_funcs
* @since 2.0
*/
@scala.annotation.varargs
def hash(cols: Column*): Column = withExpr {
new Murmur3Hash(cols.map(_.expr))
}
And in my case, applied as:
import org.apache.spark.sql.functions.{col, hash}
val newDs = typedRows.withColumn("hash", hash(typedRows.columns.map(col): _*))
I truly have a lot to learn about Spark sql :(.
Leaving this here in case someone else needs it. Thanks!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With