Let's say I have two disjoint groups / "islands" of polygons (think census tracts in two non-adjacent counties). My data could look something like this:
>>> p1=Polygon([(0,0),(10,0),(10,10),(0,10)])
>>> p2=Polygon([(10,10),(20,10),(20,20),(10,20)])
>>> p3=Polygon([(10,10),(10,20),(0,10)])
>>>
>>> p4=Polygon([(40,40),(50,40),(50,30),(40,30)])
>>> p5=Polygon([(40,40),(50,40),(50,50),(40,50)])
>>> p6=Polygon([(40,40),(40,50),(30,50)])
>>>
>>> df=gpd.GeoDataFrame(geometry=[p1,p2,p3,p4,p5,p6])
>>> df
geometry
0 POLYGON ((0 0, 10 0, 10 10, 0 10, 0 0))
1 POLYGON ((10 10, 20 10, 20 20, 10 20, 10 10))
2 POLYGON ((10 10, 10 20, 0 10, 10 10))
3 POLYGON ((40 40, 50 40, 50 30, 40 30, 40 40))
4 POLYGON ((40 40, 50 40, 50 50, 40 50, 40 40))
5 POLYGON ((40 40, 40 50, 30 50, 40 40))
>>>
>>> df.plot()
I want the polygons within each island to take on an ID (can be arbitrary) representing it's group. For example the 3 polygons on the bottom left can have IslandID = 1 and the 3 polygons on the top right can have an IslandID=2.
I have developed a way to do this, but I'm wondering if it's the best / most efficient way. I do the following:
1) Create a GeoDataFrame with a geometry equal to polygons within the multipolygon unary union. This gives me two polygons, one for each "island".
>>> SepIslands=gpd.GeoDataFrame(geometry=list(df.unary_union))
>>> SepIslands.plot()
2) Create an ID for each group.
>>> SepIslands['IslandID']=SepIslands.index+1
3) Spatial join the islands to the original polygons, so each polygon has the appropriate island id.
>>> Final=gpd.tools.sjoin(df, SepIslands, how='left').drop('index_right',1)
>>> Final
geometry IslandID
0 POLYGON ((0 0, 10 0, 10 10, 0 10, 0 0)) 1
1 POLYGON ((10 10, 20 10, 20 20, 10 20, 10 10)) 1
2 POLYGON ((10 10, 10 20, 0 10, 10 10)) 1
3 POLYGON ((40 40, 50 40, 50 30, 40 30, 40 40)) 2
4 POLYGON ((40 40, 50 40, 50 50, 40 50, 40 40)) 2
5 POLYGON ((40 40, 40 50, 30 50, 40 40)) 2
Is this indeed the best / most efficient way to do this?
If the gap between each groups are reasonably big, another option is sklearn.cluster.DBSCAN to cluster the centroid of polygons and label them as clusters.
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise and it can group points that are closely packed together. In our case, polygons in one island will be clustered in same cluster.
This also works with more than two islands as well.
import geopandas as gpd
import pandas as pd
from shapely.geometry import Polygon
from sklearn.cluster import DBSCAN
# Note, EPS_DISTANCE = 20 is a magic number and it needs to be
# * smaller than the gap between any two islands
# * large enough to cluster polygons in one island in same cluster
EPS_DISTANCE = 20
MIN_SAMPLE_POLYGONS = 1
p1=Polygon([(0,0),(10,0),(10,10),(0,10)])
p2=Polygon([(10,10),(20,10),(20,20),(10,20)])
p3=Polygon([(10,10),(10,20),(0,10)])
p4=Polygon([(40,40),(50,40),(50,30),(40,30)])
p5=Polygon([(40,40),(50,40),(50,50),(40,50)])
p6=Polygon([(40,40),(40,50),(30,50)])
df = gpd.GeoDataFrame(geometry=[p1, p2, p3, p4, p5, p6])
# preparation for dbscan
df['x'] = df['geometry'].centroid.x
df['y'] = df['geometry'].centroid.y
coords = df.as_matrix(columns=['x', 'y'])
# dbscan
dbscan = DBSCAN(eps=EPS_DISTANCE, min_samples=MIN_SAMPLE_POLYGONS)
clusters = dbscan.fit(coords)
# add labels back to dataframe
labels = pd.Series(clusters.labels_).rename('IslandID')
df = pd.concat([df, labels], axis=1)
> df
geometry ... IslandID
0 POLYGON ((0 0, 10 0, 10 10, 0 10, 0 0)) ... 0
1 POLYGON ((10 10, 20 10, 20 20, 10 20, 10 10)) ... 0
2 POLYGON ((10 10, 10 20, 0 10, 10 10)) ... 0
3 POLYGON ((40 40, 50 40, 50 30, 40 30, 40 40)) ... 1
4 POLYGON ((40 40, 50 40, 50 50, 40 50, 40 40)) ... 1
5 POLYGON ((40 40, 40 50, 30 50, 40 40)) ... 1
[6 rows x 4 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With