I have a pyspark DataFrame <pre class="prettyprint lang-python prettyprint-override"><code>a = [ ('Bob', 562), ('Bob',880), ('Bob',380), ('Sue',85), ('Sue',963) ] df = spark.createDataFrame(a, ["Person", "Amount"]) </code></pre> I need to create a column that hashes the <code>Amount</code> and returns the amount. The problem is I can't use a <code>UDF</code> so I have used a mapping function. <pre class="prettyprint lang-python prettyprint-override"><code>df.rdd.map(lambda x: hash(x["Amount"])) </code></pre>

If you can't use <code>udf</code> you can use the <code>map</code> function, but as you've currently written it, there will only be one column. To keep all the columns, do the following: <pre class="prettyprint lang-python prettyprint-override"><code>df = df.rdd\ .map(lambda x: (x["Person"], x["Amount"], hash(str(x["Amount"]))))\ .toDF(["Person", "Amount", "Hash"]) df.show() #+------+------+--------------------+ #|Person|Amount| Hash| #+------+------+--------------------+ #| Bob| 562|-4340709941618811062| #| Bob| 880|-7718876479167384701| #| Bob| 380|-2088598916611095344| #| Sue| 85| 7168043064064671| #| Sue| 963|-8844931991662242457| #+------+------+--------------------+ </code></pre> Note: In this case, <code>hash(x["Amount"])</code> is not very interesting so I changed it to hash <code>Amount</code> converted to a string. Essentially you have to map the row to a tuple containing all of the existing columns and add in the new column(s). If your columns are too many to enumerate, you could also just add a tuple to the existing row. <pre class="prettyprint lang-python prettyprint-override"><code>df = df.rdd\ .map(lambda x: x + (hash(str(x["Amount"])),))\ .toDF(df.columns + ["Hash"])\ </code></pre> <hr> I should also point out that if hashing the values is your end goal, there is also a pyspark function <code>pyspark.sql.functions.hash</code> that can be used to avoid the serialization to <code>rdd</code>: <pre class="prettyprint lang-python prettyprint-override"><code>import pyspark.sql.functions as f df.withColumn("Hash", f.hash("Amount")).show() #+------+------+----------+ #|Person|Amount| Hash| #+------+------+----------+ #| Bob| 562| 51343841| #| Bob| 880|1241753636| #| Bob| 380| 514174926| #| Sue| 85|1944150283| #| Sue| 963|1665082423| #+------+------+----------+ </code></pre> This appears to use a different hashing algorithm than the python builtin.

PySpark - Add map function as column

Tags:

rdd

apache-spark-sql

pyspark

I have a pyspark DataFrame

a = [
    ('Bob', 562),
    ('Bob',880),
    ('Bob',380),
    ('Sue',85),
    ('Sue',963)
] 
df = spark.createDataFrame(a, ["Person", "Amount"])

I need to create a column that hashes the Amount and returns the amount. The problem is I can't use a UDF so I have used a mapping function.

df.rdd.map(lambda x: hash(x["Amount"]))

462

asked Apr 17 '18 13:04

Bryce Ramgovind

1 Answers

If you can't use udf you can use the map function, but as you've currently written it, there will only be one column. To keep all the columns, do the following:

df = df.rdd\
    .map(lambda x: (x["Person"], x["Amount"], hash(str(x["Amount"]))))\
    .toDF(["Person", "Amount", "Hash"])

df.show()
#+------+------+--------------------+
#|Person|Amount|                Hash|
#+------+------+--------------------+
#|   Bob|   562|-4340709941618811062|
#|   Bob|   880|-7718876479167384701|
#|   Bob|   380|-2088598916611095344|
#|   Sue|    85|    7168043064064671|
#|   Sue|   963|-8844931991662242457|
#+------+------+--------------------+

Note: In this case, hash(x["Amount"]) is not very interesting so I changed it to hash Amount converted to a string.

Essentially you have to map the row to a tuple containing all of the existing columns and add in the new column(s).

If your columns are too many to enumerate, you could also just add a tuple to the existing row.

df = df.rdd\
    .map(lambda x: x + (hash(str(x["Amount"])),))\
    .toDF(df.columns + ["Hash"])\

I should also point out that if hashing the values is your end goal, there is also a pyspark function pyspark.sql.functions.hash that can be used to avoid the serialization to rdd:

import pyspark.sql.functions as f
df.withColumn("Hash", f.hash("Amount")).show()
#+------+------+----------+
#|Person|Amount|      Hash|
#+------+------+----------+
#|   Bob|   562|  51343841|
#|   Bob|   880|1241753636|
#|   Bob|   380| 514174926|
#|   Sue|    85|1944150283|
#|   Sue|   963|1665082423|
#+------+------+----------+

This appears to use a different hashing algorithm than the python builtin.

140

answered Sep 27 '22 23:09

pault

Related questions
                            
                                Amazon EMR Pyspark Module not found
                            
                                Pyspark import .py file not working
                            
                                pyspark: sparse vectors to scipy sparse matrix
                            
                                Count number of duplicate rows in SPARKSQL
                            
                                Setting YARN queue in PySpark
                            
                                Can I change SparkContext.appName on the fly?
                            
                                How to transform data with sliding window over time series data in Pyspark
                            
                                PySpark: Randomize rows in dataframe
                            
                                How to find pyspark dataframe memory usage?
                            
                                User defined function to be applied to Window in PySpark?
                            
                                Pyspark ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:50532)
                            
                                Calculating percentage of total count for groupBy using pyspark
                            
                                collect() or toPandas() on a large DataFrame in pyspark/EMR
                            
                                How to find out the amount of memory pyspark has from iPython interface?
                            
                                Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?
                            
                                How to name file when saveAsTextFile in spark?
                            
                                Get the max value for each key in a Spark RDD
                            
                                Broadcast hash join - Iterative
                            
                                How to select a same-size stratified sample from a dataframe in Apache Spark?
                            
                                PySpark difference between pyspark.sql.functions.col and pyspark.sql.functions.lit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With