I'm working with a large set comprising of spatial parcels, with each row containing geographic coordinates (UTM), parcel area & value:
[x, y, area, value]:
[272564.9434265977, 6134243.108910706, 980.63, 550.6664083293393],
[272553.9611341293, 6134209.499155387, 1026.55, 477.32696897374706],
[271292.4197118982, 6132982.047648986, 634.438, 851.1469993915875],
...
Plotting these visually identifies several distinct zones where dollar value varies based on geography (the high value strip on the left is coastal, for example):
I would like to identify clusters of value (ie the coastal strip) & have looked at several approaches;
K-means seems the easiest clustering method to implement, but appears unsuitable due to only considering distance between points and no further attributes.
ClusterPy looks ideal for this application but their documentation only seems to cover working with GIS files.
DBSCAN seems more relevant but I'm not sure how I can include my additional attribute ($ value) - perhaps as a third dimension?
Can anybody suggest any other toolkits/approaches to consider?
Look at generalized DBSCAN (GDBSCAN), which easily allows you to require neighbor points to both
At least in hierarchical clustering you can define connectivity constraints such that only "connected" samples can belong to same cluster. In your case x and y would be used by function sklearn.neighbors.kneighbors_graph() to create the list of neighbours, and the value variable will be used in the clustering.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With