As specified in the question, I'm trying to generate an hash for each row of RDD. For my purpose I cannot use zipWithUniqueId()
method, I need one hash of all the columns, for each Row of the RDD.
for row in DataFrame.collect():
return hashlib.sha1(str(row))
I know that is the worst way,iterating into rdd, but I'm beginner with pyspark. However the problems is that: I obtain the same hash for each row. I tried to use strong collision resistant hash function but it is too slow. Is there some way to solve the problem? Thanks in advance :)
Check out pyspark.sql.functions.sha2(col, numBits) which returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512)
Available since Spark v1.5
import pyspark.sql.functions as F
df2 = df.withColumn('my_col_hashed', F.sha2(F.col('my_col'), 256))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With