How do you create merge_asof functionality in PySpark?

Tags:

Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular interval. Table A is small, table B is massive.

I need to join B to A under the condition that a given element a of A.datetime corresponds to

B[B['datetime'] <= a]]['datetime'].max()

There are a couple ways to do this, but I would like the most efficient way.

Option 1

Broadcast the small dataset as a Pandas DataFrame. Set up a Spark UDF that creates a pandas DataFrame for each row merges with the large dataset using merge_asof.

Option 2

Use the broadcast join functionality of Spark SQL: set up a theta join on the following condition

B['datetime'] <= A['datetime']

Then eliminate all the superfluous rows.

Option B seems pretty terrible... but please let me know if the first way is efficient or if there is another way.

EDIT: Here is the sample input and expected output:

A =
+---------+----------+
| Column1 | Datetime |
+---------+----------+
|    A    |2019-02-03|
|    B    |2019-03-14|
+---------+----------+

B =
+---------+----------+
|   Key   | Datetime |
+---------+----------+
|    0    |2019-01-01|
|    1    |2019-01-15|
|    2    |2019-02-01|
|    3    |2019-02-15|
|    4    |2019-03-01|
|    5    |2019-03-15|
+---------+----------+

custom_join(A,B) =
+---------+----------+
| Column1 |   Key    |
+---------+----------+
|    A    |     2    |
|    B    |     4    |
+---------+----------+

856

asked Aug 09 '19 19:08

Collin Cunningham

1 Answers

I doubt that it is faster, but you could solve it with Spark by using union and last together with a window function.

from pyspark.sql import functions as f
from pyspark.sql.window import Window

df1 = df1.withColumn('Key', f.lit(None))
df2 = df2.withColumn('Column1', f.lit(None))

df3 = df1.unionByName(df2)

w = Window.orderBy('Datetime', 'Column1').rowsBetween(Window.unboundedPreceding, -1)
df3.withColumn('Key', f.last('Key', True).over(w)).filter(~f.isnull('Column1')).show()

Which gives

+-------+----------+---+
|Column1|  Datetime|Key|
+-------+----------+---+
|      A|2019-02-03|  2|
|      B|2019-03-14|  4|
+-------+----------+---+

It's an old question but maybe still useful for somebody.

105

answered Oct 19 '22 13:10

ScootCork

Related questions
                            
                                Change type of pandas series/dataframe column inplace
                            
                                TypeError: An op outside of the function building code is being passed a Graph tensor
                            
                                Using bundle_files = 1 with py2exe is not working
                            
                                Google App Engine OAuth endpoints throwing 400 in production
                            
                                how to properly close a tweepy stream
                            
                                Decoding RFC 2231 headers
                            
                                Send anonymous mail from local machine
                            
                                Mutation testing tool for Python 2.7
                            
                                Using Multithreaded queue in python the correct way?
                            
                                Numpy Memory Error on Linux Server but not Mac
                            
                                Is there any way to non-violently stop particular task of celery worker?
                            
                                Magical libc.math.abs in Cython
                            
                                How to pass parameters to an AWS Lambda Function using Python
                            
                                Decoding sequences in a GaussianHMM
                            
                                How to efficiently save a Pandas Dataframe into one/more TFRecord file?
                            
                                PyCharm type hinting enum iteration
                            
                                Python create xml from xsd
                            
                                How can I use Sphinx with subpackages without duplicating everything?
                            
                                Conda error on update: `conda.core.link:_execute(637): An error occurred while installing package 'None'. AssertionError()`
                            
                                Is there a need for bumpversion (or bump2version) when setuptools_scm is available?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you create merge_asof functionality in PySpark?

Tags:

python

pandas

apache-spark

apache-spark-sql

pyspark