I am working with spark 2.2.0 and pyspark2. I have created a DataFrame <code>df</code> and now trying to add a new column <code>"rowhash"</code> that is the sha2 hash of specific columns in the DataFrame. For example, say that <code>df</code> has the columns: <code>(column1, column2, ..., column10)</code> I require <code>sha2((column2||column3||column4||...... column8), 256)</code> in a new column <code>"rowhash"</code>. For now, I tried using below methods: 1) Used <code>hash()</code> function but since it gives an integer output it is of not much use 2) Tried using <code>sha2()</code> function but it is failing. Say <code>columnarray</code> has array of columns I need. <pre class="prettyprint lang-python prettyprint-override"><code>def concat(columnarray): concat_str = '' for val in columnarray: concat_str = concat_str + '||' + str(val) concat_str = concat_str[2:] return concat_str </code></pre> and then <pre class="prettyprint lang-python prettyprint-override"><code>df1 = df1.withColumn("row_sha2", sha2(concat(columnarray),256)) </code></pre> This is failing with "cannot resolve" error. Thanks gaw for your answer. Since I have to hash only specific columns, I created a list of those column names (in hash_col) and changed your function as : <pre class="prettyprint lang-python prettyprint-override"><code> def sha_concat(row, columnarray): row_dict = row.asDict() #transform row to a dict concat_str = '' for v in columnarray: concat_str = concat_str + '||' + str(row_dict.get(v)) concat_str = concat_str[2:] #preserve concatenated value for testing (this can be removed later) row_dict["sha_values"] = concat_str row_dict["sha_hash"] = hashlib.sha256(concat_str).hexdigest() return Row(**row_dict) </code></pre> Then passed as : <pre class="prettyprint lang-python prettyprint-override"><code> df1.rdd.map(lambda row: sha_concat(row,hash_col)).toDF().show(truncate=False) </code></pre> It is now however failing with error: <pre class="prettyprint lang-python prettyprint-override"><code> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 8: ordinal not in range(128) </code></pre> I can see value of \ufffd in one of the column so I am unsure if there is a way to handle this ?

You can use <code>pyspark.sql.functions.concat_ws()</code> to concatenate your columns and <code>pyspark.sql.functions.sha2()</code> to get the SHA256 hash. Using the data from @gaw: <pre class="prettyprint lang-python prettyprint-override"><code>from pyspark.sql.functions import sha2, concat_ws df = spark.createDataFrame( [(1,"2",5,1),(3,"4",7,8)], ("col1","col2","col3","col4") ) df.withColumn("row_sha2", sha2(concat_ws("||", *df.columns), 256)).show(truncate=False) #+----+----+----+----+----------------------------------------------------------------+ #|col1|col2|col3|col4|row_sha2 | #+----+----+----+----+----------------------------------------------------------------+ #|1 |2 |5 |1 |1b0ae4beb8ce031cf585e9bb79df7d32c3b93c8c73c27d8f2c2ddc2de9c8edcd| #|3 |4 |7 |8 |57f057bdc4178b69b1b6ab9d78eabee47133790cba8cf503ac1658fa7a496db1| #+----+----+----+----+----------------------------------------------------------------+ </code></pre> You can pass in either <code>0</code> or <code>256</code> as the second argument to <code>sha2()</code>, as per the docs: <blockquote> Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). The numBits indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256). </blockquote> The function <code>concat_ws</code> takes in a separator, and a list of columns to join. I am passing in <code>||</code> as the separator and <code>df.columns</code> as the list of columns. I am using all of the columns here, but you can specify whatever subset of columns you'd like- in your case that would be <code>columnarray</code>. (You need to use the <code>*</code> to unpack the list.)

New in version 2.0 is the <code>hash</code> function. <pre class="prettyprint"><code>from pyspark.sql.functions import hash ( spark .createDataFrame([(1,'Abe'),(2,'Ben'),(3,'Cas')], ('id','name')) .withColumn('hashed_name', hash('name')) ).show() </code></pre> wich results in: <pre class="prettyprint"><code>+---+----+-----------+ | id|name|hashed_name| +---+----+-----------+ | 1| Abe| 1567000248| | 2| Ben| 1604243918| | 3| Cas| -586163893| +---+----+-----------+ </code></pre> https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html#hash

pyspark generate row hash of specific columns and add it as a new column

Tags:

string-concatenation

sha256

pyspark

I am working with spark 2.2.0 and pyspark2.

I have created a DataFrame df and now trying to add a new column "rowhash" that is the sha2 hash of specific columns in the DataFrame.

For example, say that df has the columns: (column1, column2, ..., column10)

I require sha2((column2||column3||column4||...... column8), 256) in a new column "rowhash".

For now, I tried using below methods:

1) Used hash() function but since it gives an integer output it is of not much use

2) Tried using sha2() function but it is failing.

Say columnarray has array of columns I need.

def concat(columnarray):
    concat_str = ''
    for val in columnarray:
        concat_str = concat_str + '||' + str(val) 
    concat_str = concat_str[2:] 
    return concat_str

and then

df1 = df1.withColumn("row_sha2", sha2(concat(columnarray),256))

This is failing with "cannot resolve" error.

Thanks gaw for your answer. Since I have to hash only specific columns, I created a list of those column names (in hash_col) and changed your function as :

 def sha_concat(row, columnarray):
   row_dict = row.asDict()      #transform row to a dict
   concat_str = '' 
   for v in columnarray: 
       concat_str = concat_str + '||' + str(row_dict.get(v)) 
   concat_str = concat_str[2:] 
   #preserve concatenated value for testing (this can be removed later)
   row_dict["sha_values"] = concat_str  
   row_dict["sha_hash"] = hashlib.sha256(concat_str).hexdigest()
   return Row(**row_dict)

Then passed as :

    df1.rdd.map(lambda row: sha_concat(row,hash_col)).toDF().show(truncate=False)

It is now however failing with error:

    UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 8: ordinal not in range(128)

I can see value of \ufffd in one of the column so I am unsure if there is a way to handle this ?

741

asked Sep 12 '18 09:09

msashish

3 Answers

You can use pyspark.sql.functions.concat_ws() to concatenate your columns and pyspark.sql.functions.sha2() to get the SHA256 hash.

Using the data from @gaw:

from pyspark.sql.functions import sha2, concat_ws
df = spark.createDataFrame(
    [(1,"2",5,1),(3,"4",7,8)],
    ("col1","col2","col3","col4")
)
df.withColumn("row_sha2", sha2(concat_ws("||", *df.columns), 256)).show(truncate=False)
#+----+----+----+----+----------------------------------------------------------------+
#|col1|col2|col3|col4|row_sha2                                                        |
#+----+----+----+----+----------------------------------------------------------------+
#|1   |2   |5   |1   |1b0ae4beb8ce031cf585e9bb79df7d32c3b93c8c73c27d8f2c2ddc2de9c8edcd|
#|3   |4   |7   |8   |57f057bdc4178b69b1b6ab9d78eabee47133790cba8cf503ac1658fa7a496db1|
#+----+----+----+----+----------------------------------------------------------------+

You can pass in either 0 or 256 as the second argument to sha2(), as per the docs:

Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). The numBits indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256).

The function concat_ws takes in a separator, and a list of columns to join. I am passing in || as the separator and df.columns as the list of columns.

I am using all of the columns here, but you can specify whatever subset of columns you'd like- in your case that would be columnarray. (You need to use the * to unpack the list.)

117

answered Oct 07 '22 11:10

pault

If you want to have the hash for each value in the different columns of your dataset you can apply a self-designed function via map to the rdd of your dataframe.

import hashlib
test_df = spark.createDataFrame([
    (1,"2",5,1),(3,"4",7,8),              
    ], ("col1","col2","col3","col4"))

def sha_concat(row):
    row_dict = row.asDict()                             #transform row to a dict
    columnarray = row_dict.keys()                       #get the column names
    concat_str = ''
    for v in row_dict.values():
        concat_str = concat_str + '||' + str(v)         #concatenate values
    concat_str = concat_str[2:] 
    row_dict["sha_values"] = concat_str                 #preserve concatenated value for testing (this can be removed later)
    row_dict["sha_hash"] = hashlib.sha256(concat_str).hexdigest() #calculate sha256
    return Row(**row_dict)

test_df.rdd.map(sha_concat).toDF().show(truncate=False)

The Results would look like:

+----+----+----+----+----------------------------------------------------------------+----------+
|col1|col2|col3|col4|sha_hash                                                        |sha_values|
+----+----+----+----+----------------------------------------------------------------+----------+
|1   |2   |5   |1   |1b0ae4beb8ce031cf585e9bb79df7d32c3b93c8c73c27d8f2c2ddc2de9c8edcd|1||2||5||1|
|3   |4   |7   |8   |cb8f8c5d9fd7165cf3c0f019e0fb10fa0e8f147960c715b7f6a60e149d3923a5|8||4||7||3|
+----+----+----+----+----------------------------------------------------------------+----------+

answered Oct 07 '22 11:10

gaw

New in version 2.0 is the hash function.

from pyspark.sql.functions import hash

(
    spark
    .createDataFrame([(1,'Abe'),(2,'Ben'),(3,'Cas')], ('id','name'))
    .withColumn('hashed_name', hash('name'))
).show()

wich results in:

+---+----+-----------+
| id|name|hashed_name|
+---+----+-----------+
|  1| Abe| 1567000248|
|  2| Ben| 1604243918|
|  3| Cas| -586163893|
+---+----+-----------+

https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html#hash

answered Oct 07 '22 11:10

Michael H.

Related questions
                            
                                Spark groupByKey alternative
                            
                                Python spark extract characters from dataframe
                            
                                Connect to S3 data from PySpark
                            
                                Pyspark Invalid Input Exception try except error
                            
                                While submit job with pyspark, how to access static files upload with --files argument?
                            
                                Filter by whether column value equals a list in Spark
                            
                                PySpark vs sklearn TFIDF
                            
                                AttributeError: Can't get attribute 'new_block' on <module 'pandas.core.internals.blocks'>
                            
                                How to use first and last function in pyspark?
                            
                                how to pass python package to spark job and invoke main file from package with arguments
                            
                                Add one more StructField to schema
                            
                                Loading compressed gzipped csv file in Spark 2.0
                            
                                get first N elements from dataframe ArrayType column in pyspark
                            
                                how to create a new columns with random values in pyspark?
                            
                                Spark: save DataFrame partitioned by "virtual" column
                            
                                Pyspark: How to add ten days to existing date column
                            
                                How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector
                            
                                Create DataFrame from list of tuples using pyspark
                            
                                Write spark dataframe to file using python and '|' delimiter
                            
                                PySpark: Create New Column And Fill In Based on Conditions of Two Other Columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark generate row hash of specific columns and add it as a new column

Tags:

string-concatenation

sha256

pyspark

msashish

People also ask

3 Answers

pault

gaw

Michael H.

Recent Activity

Donate For Us