Say I have two PySpark DataFrames <code>df1</code> and <code>df2</code>. <pre class="prettyprint"><code>df1= 'a' 1 2 5 df2= 'b' 3 6 </code></pre> And I want to find the closest <code>df2['b']</code> value for each <code>df1['a']</code>, and add the closest values as a new column in <code>df1</code>. In other words, for each value <code>x</code> in <code>df1['a']</code>, I want to find a <code>y</code> that achieves <code>min(abx(x-y))</code> for all <code>y in df2['b']</code>(note: can assume that there is only one <code>y</code> that can achieve the minimum distance), and the result would be <pre class="prettyprint"><code>'a' 'b' 1 3 2 3 5 6 </code></pre> I tried the following code to create a distance matrix first (before finding the values achieving the minimum distance): <pre class="prettyprint"><code>from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf def dict(x,y): return abs(x-y) udf_dict = udf(dict, IntegerType()) sql_sc = SQLContext(sc) udf_dict(df1.a, df2.b) </code></pre> which gives <pre class="prettyprint"><code>Column<PythonUDF#dist(a,b)> </code></pre> Then I tried <pre class="prettyprint"><code>sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b)) </code></pre> which runs forever without giving error/output. My questions are: <ol> <li>As I'm new to Spark, is my way to construct the output DataFrame efficient? (My way would be creating a distance matrix for all the <code>a</code> and <code>b</code> values first, and then find the <code>min</code> one)</li> <li>What's wrong with the last line of my code and how to fix it?</li> </ol>

Starting with your second question - you can apply udf only to existing dataframe, I think you were thinking for something like this: <pre class="prettyprint"><code>>>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show() +---+---+--------+ | a| b|distance| +---+---+--------+ | 1| 3| 2| | 1| 6| 5| | 2| 3| 1| | 2| 6| 4| | 5| 3| 2| | 5| 6| 1| +---+---+--------+ </code></pre> But there is a more efficient way to apply this distance, by using internal <code>abs</code>: <pre class="prettyprint"><code>>>> from pyspark.sql.functions import abs >>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b)) </code></pre> Then you can find matching numbers by calculating: <pre class="prettyprint"><code>>>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b)) >>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance')) >>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show() +---+---+ | a| b| +---+---+ | 5| 6| | 1| 3| | 2| 3| +---+---+ </code></pre>

Pyspark Dataframe Apply function to two columns

Tags:

pyspark

pyspark-sql

spark-dataframe

Say I have two PySpark DataFrames df1 and df2.

And I want to find the closest df2['b'] value for each df1['a'], and add the closest values as a new column in df1.

In other words, for each value x in df1['a'], I want to find a y that achieves min(abx(x-y)) for all y in df2['b'](note: can assume that there is only one y that can achieve the minimum distance), and the result would be

I tried the following code to create a distance matrix first (before finding the values achieving the minimum distance):

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

def dict(x,y):
    return abs(x-y)
udf_dict = udf(dict, IntegerType())

sql_sc = SQLContext(sc)
udf_dict(df1.a, df2.b)

which gives

Column<PythonUDF#dist(a,b)>

Then I tried

sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b))

which runs forever without giving error/output.

My questions are:

As I'm new to Spark, is my way to construct the output DataFrame efficient? (My way would be creating a distance matrix for all the a and b values first, and then find the min one)
What's wrong with the last line of my code and how to fix it?

497

asked Nov 02 '16 20:11

Chianti5

1 Answers

Starting with your second question - you can apply udf only to existing dataframe, I think you were thinking for something like this:

>>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show()
+---+---+--------+
|  a|  b|distance|
+---+---+--------+
|  1|  3|       2|
|  1|  6|       5|
|  2|  3|       1|
|  2|  6|       4|
|  5|  3|       2|
|  5|  6|       1|
+---+---+--------+

But there is a more efficient way to apply this distance, by using internal abs:

>>> from pyspark.sql.functions import abs
>>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b))

Then you can find matching numbers by calculating:

>>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
>>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance'))
>>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show()
+---+---+                                                                       
|  a|  b|
+---+---+
|  5|  6|
|  1|  3|
|  2|  3|
+---+---+

150

answered Oct 14 '22 01:10

Mariusz

Related questions
                            
                                Geoip2's python library doesn't work in pySpark's map function
                            
                                AWS Glue and update duplicating data
                            
                                Ways to Plot Spark Dataframe without Converting it to Pandas
                            
                                pySpark Create DataFrame from RDD with Key/Value
                            
                                A list as a key for PySpark's reduceByKey
                            
                                PySpark: spit out single file when writing instead of multiple part files
                            
                                PySpark using IAM roles to access S3
                            
                                How to create a z-score in Spark SQL for each group
                            
                                Relating column names to model parameters in pySpark ML
                            
                                Spark 2.0.0 reading json data with variable schema
                            
                                convert dataframe to libsvm format
                            
                                How to read a zip containing multiple files in Apache Spark
                            
                                Forward fill missing values in Spark/Python
                            
                                Custom aggregation on PySpark dataframes [duplicate]
                            
                                Vector assembler in Pyspark is creating tuple of multiple vectors instead of a single vector, how to solve the issue? [duplicate]
                            
                                UDF with multiple rows as response pySpark
                            
                                Custom Evaluator in PySpark
                            
                                Check if table exists in hive metastore using Pyspark
                            
                                Functions from Python packages for udf() of Spark dataframe
                            
                                Select array element from Spark Dataframes split method in same call?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With