Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate a hash for each row of rdd? (PYSPARK)

As specified in the question, I'm trying to generate an hash for each row of RDD. For my purpose I cannot use zipWithUniqueId() method, I need one hash of all the columns, for each Row of the RDD.

for row in DataFrame.collect():
    return hashlib.sha1(str(row))

I know that is the worst way,iterating into rdd, but I'm beginner with pyspark. However the problems is that: I obtain the same hash for each row. I tried to use strong collision resistant hash function but it is too slow. Is there some way to solve the problem? Thanks in advance :)

like image 742
Mr do Avatar asked Jan 05 '23 03:01

Mr do


1 Answers

Check out pyspark.sql.functions.sha2(col, numBits) which returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512)

Available since Spark v1.5

import pyspark.sql.functions as F
df2 = df.withColumn('my_col_hashed', F.sha2(F.col('my_col'), 256))
like image 86
Dylan Hogg Avatar answered Jan 13 '23 13:01

Dylan Hogg