Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently grouping a list of coordinates points by location in Python

Given a list of X,Y coordinate points on a 2D grid, what is the most efficient algorithm to create a list of groups of adjacent coordinate points?

For example, given a list of points making up two non-adjacent squares (3x3) on a grid (15x15), the result of this algorithm would be two groups of points corresponding to the two squares.

I suppose you could do a flood fill algorithm, but this seems overkill and not very efficient for a large 2D array of say 1024 size.

like image 819
Kaelan Cooter Avatar asked Jul 27 '14 20:07

Kaelan Cooter


3 Answers

Fundamentally, this is an image processing operation. If you use an image processing library like scikit-image (a.k.a. skimage), it will be easy. Dealing with really huge data will eventually get slow, but 1024x1024 is nothing.

In [1]: import numpy as np
In [2]: import skimage.morphology
In [3]: x = [0,1,2,0,1,2,0,1,2,-3,-2,-1,-3,-2,-1,-3,-2,-1]
In [4]: y = [0,0,0,1,1,1,2,2,2,-3,-3,-3,-2,-2,-2,-1,-1,-1]
In [5]: dense = np.zeros((9,9), dtype=bool)
In [6]: dense[y,x] = True

In [7]: print(dense)
[[ True  True  True False False False False False False]
 [ True  True  True False False False False False False]
 [ True  True  True False False False False False False]
 [False False False False False False False False False]
 [False False False False False False False False False]
 [False False False False False False False False False]
 [False False False False False False  True  True  True]
 [False False False False False False  True  True  True]
 [False False False False False False  True  True  True]]

In [8]: labeled = skimage.morphology.label(dense)
In [9]: print(labeled)
[[1 1 1 0 0 0 0 0 0]
 [1 1 1 0 0 0 0 0 0]
 [1 1 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 2 2 2]
 [0 0 0 0 0 0 2 2 2]
 [0 0 0 0 0 0 2 2 2]]

In [10]: coords_yx = { i: (labeled == i).nonzero() for i in range(1,labeled.max()+1) }
In [11]: coords_yx
Out[11]:
{1: (array([0, 0, 0, 1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2, 0, 1, 2])),
 2: (array([6, 6, 6, 7, 7, 7, 8, 8, 8]), array([6, 7, 8, 6, 7, 8, 6, 7, 8]))}
like image 102
Stuart Berg Avatar answered Nov 06 '22 00:11

Stuart Berg


You can hash all coordinate points (e.g. using dictionary structure in python) and then for each coordinate point, hash the adjacent neighbors of the point to find pairs of points that are adjacent and "merge" them. Also, for each point you can maintain a pointer to the connected component that that point belongs to (using the dictionary structure), and for each connected component you maintain a list of points that belong to the component.

Then, when you hash a neighbor of a point and find a match, you merge the two connected component sets that the points belong to and update the group pointers for all new points in the union set. You can show that you only need to hash all the neighbors of all points just once and this will find all connected components, and furthermore, if you update the pointers for the smaller of the two connected components sets when two connected component sets are merged, then the run-time will be linear in the number of points.

like image 3
user2566092 Avatar answered Nov 06 '22 02:11

user2566092


It's unclear what you mean by "groups of adjacent" coordinate points. Your example of two non-adjacent 3x3 squares suggests you are looking for what is called connected components labeling.

There are many implementations to extract connected components. Below are a few for guidance.

  1. cclabel
  2. OpenCV
  3. bwconncomp

However, I've implemented this kind of blob detector and they are not that hard to write up if you are looking for a learning experience. If not, then I would go with the most mature library like OpenCV and use their Python API if that's all you need.

Also, you mentioned "efficiency". Note that there are single-pass and double-pass version of these algorithms. Single-pass, as the name suggests, is generally more efficient as it only requires a single pass through our data. This might be needed if your grids are very large.

like image 2
dpmcmlxxvi Avatar answered Nov 06 '22 00:11

dpmcmlxxvi