How to generate a hash for each row of rdd? (PYSPARK)

Question

As specified in the question, I'm trying to generate an hash for each row of RDD. For my purpose I cannot use zipWithUniqueId() method, I need one hash of all the columns, for each Row of the RDD.

for row in DataFrame.collect():
    return hashlib.sha1(str(row))

I know that is the worst way,iterating into rdd, but I'm beginner with pyspark. However the problems is that: I obtain the same hash for each row. I tried to use strong collision resistant hash function but it is too slow. Is there some way to solve the problem? Thanks in advance :)

Dylan Hogg · Accepted Answer

Check out pyspark.sql.functions.sha2(col, numBits) which returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512)

Available since Spark v1.5

import pyspark.sql.functions as F
df2 = df.withColumn('my_col_hashed', F.sha2(F.col('my_col'), 256))

How to generate a hash for each row of rdd? (PYSPARK)

Tags:

hash

row

rdd

pyspark

Mr do

1 Answers

Dylan Hogg

Recent Activity

Donate For Us

How to generate a hash for each row of rdd? (PYSPARK)

Tags:

hash

row

rdd

pyspark

Mr do

1 Answers

Dylan Hogg

Related questions

Recent Activity

Donate For Us