Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

check if values are within intervals in pyspark

I have a large DataFrame A with intervals like this:

df_a = spark.createDataFrame([
    (0, 23), (1, 6), (2, 55), (3, 1), (4, 12), (5, 51),
], ("id", "x"))
# +---+---+
# | id|  x|
# +---+---+
# |  0| 23|
# |  1|  6|
# |  2| 55|
# |  3|  1|
# |  4| 12|
# |  5| 51|
# +---+---+

and I have a Dataframe B with sorted non-overlapping closed intervals like this:

df_b = spark.createDataFrame([
    (0, 1, 5), (1, 8, 10), (2, 15, 16), (3, 20, 30), (4, 50, 52),
], ("id", "start", "end"))
# +---+-----+---+
# | id|start|end|
# +---+-----+---+
# |  0|    1|  5|
# |  1|    8| 10|
# |  2|   15| 16|
# |  3|   20| 30|
# |  4|   50| 52|
# +---+-----+---+

I want to check if Values of DataFrame A are contained in one of the intervals of DataFrame B and if so, save the id in a new column (interval_id). My Output-DataFrame should look like this:

id   x          interval_id
0    23         3
1    6          null
2    55         null
3    1          0
4    12         null
5    51         4         

Is there a way to do this efficiently without udfs?

like image 771
Sip Avatar asked Nov 16 '25 21:11

Sip


1 Answers

Simple left_join should do the job:

from pyspark.sql import functions as F

result = df_a.join(
    df_b.withColumnRenamed("id", "interval_id"),
    F.col("x").between(F.col("start"), F.col("end")),
    "left"
).drop("start", "end")

result.show()

#+---+---+-----------+
#| id|  x|interval_id|
#+---+---+-----------+
#|  0| 23|          3|
#|  1|  6|       null|
#|  2| 55|       null|
#|  3|  1|          0|
#|  4| 12|       null|
#|  5| 51|          4|
#+---+---+-----------+
like image 120
blackbishop Avatar answered Nov 18 '25 12:11

blackbishop



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!