Spatial Join between pyspark dataframe and polygons (geopandas)

Tags:

Problem :

I would like to make a spatial join between:

A big Spark Dataframe (500M rows) with points (eg. points on a road)
a small geojson (20000 shapes) with polygons (eg. regions boundaries).

Here is what I have so far, which I find to be slow (lot of scheduler delay, maybe due to the fact that communes is not broadcasted) :

@pandas_udf(schema_out, PandasUDFType.GROUPED_MAP)
def join_communes(traces):   
    geometry = gpd.points_from_xy(traces['longitude'], traces['latitude'])
    gdf_traces = gpd.GeoDataFrame(traces, geometry=geometry, crs = communes.crs)
    joined_df = gpd.sjoin(gdf_traces, communes, how='left', op='within')
    return joined_df[columns]

The pandas_udf takes in a bit of the points dataframe (traces) as a pandas dataframe, turns it into a GeoDataFrame with geopandas, and operates the spatial join with the polygons GeoDataFrame (therefore benefitting from the Rtree join of Geopandas)

Questions:

Is there a way to make it faster ? I understand that my communes geodataframe is in the Spark Driver's memory and that each worker has to download it for each call to the udf, is this correct ?

However I do not know how I could make this GeoDataFrame available directly to the workers (as in a broadcast join)

Any ideas ?

704

asked Dec 02 '19 17:12

Luis Blanche

1 Answers

A year after , here is what I ended up doing as @ndricca suggested, the trick is to broadcast the communes, but you can't broadcast a GeoDataFrame directy so you have to load it as a Spark DataFrame, then convert it to JSON before broadcasting it. Then you rebuild the GeoDataFrame inside the UDF using shapely.wkt (Well Known Text : a way to encode geometric objects as text)

Another trick is to use a salt in the groupby to ensure equal repartition of the data across the cluster

import geopandas as gpd
from shapely import wkt
from pyspark.sql.functions import broadcast
communes = gpd.load_file('...communes.geojson')
# Use a previously created spark session
traces= spark_session.read_csv('trajectoires.csv')
communes_spark = spark.createDataFrame(communes[['insee_comm', 'wkt']])
communes_json = provinces_spark.toJSON().collect()
communes_bc = spark.sparkContext.broadcast(communes_json)

@pandas_udf(schema_out, PandasUDFType.GROUPED_MAP)
def join_communes_bc(traces):
    communes = pd.DataFrame.from_records([json.loads(c) for c in communes_bc.value])
    polygons = [wkt.loads(w) for w in communes['wkt']]
    gdf_communes = gpd.GeoDataFrame(communes, geometry=polygons, crs=crs )
    geometry = gpd.points_from_xy(traces['longitude'], traces['latitude'])
    gdf_traces = gpd.GeoDataFrame(traces , geometry=geometry, crs=crs)
    joined_df = gpd.sjoin(gdf_traces, gdf_communes, how='left', op='within')
    return joined_df[columns]
    

traces = traces.groupby(salt).apply(join_communes_bc)

answered Oct 23 '22 18:10

Luis Blanche

Related questions
                            
                                Troublesome filter behavior when implementing the "Sieve of Eratosthenes" in python
                            
                                What happens if tf.stop_gradient is not set?
                            
                                How to group consecutive NaN values from a Pandas Series in a set of slices?
                            
                                Make Python class inherited from boost-python class copyable
                            
                                What is the meaning of "Building wheel for xxx" when I use pip install package?
                            
                                Robust Python3 shebang line?
                            
                                How can I pass command line arguments contained in a file and retain the name of that file?
                            
                                Grid Search for Keras with multiple inputs
                            
                                Tensorflow Transform: How to find the mean of a variable over the entire dataset
                            
                                How to Implement 3-way Dependent/Chained Dropdown List with Django?
                            
                                Numba jit warnings interpretation in python
                            
                                How to add arguments to a class that extends `Poly` class in sympy?
                            
                                Filter Nested field in Flask Marshmallow
                            
                                Numpy error when converting array of ctypes types to void pointer
                            
                                Problem with autocompletion with Pandas in Jupyter
                            
                                Is it possible to pass multiple dictionary in enchant?
                            
                                How to get the table name from Spark SQL Query [PySpark]?
                            
                                Relationship between Eager Execution and tf.function
                            
                                How do I ensure that a generator gets properly closed?
                            
                                How to make python halt once target product is found in subset?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spatial Join between pyspark dataframe and polygons (geopandas)

Tags:

python

pandas

pyspark

pyspark-sql

geopandas

Luis Blanche

People also ask

1 Answers

Luis Blanche

Recent Activity

Donate For Us