Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to select point with the most surrounding points

N.B: there's a major edit at the bottom of the question - check it out

Question

Say I have a set of points:

Points

I want to find the point with the most points surrounding it, within radius R (ie a circle) or within +-R (ie a square) of the point for 2 dimensions. I'll refer to it as the densest point function.

For the diagrams in this question, I'll represent the surrounding region as circles. In the image above, the middle point's surrounding region is shown in green. This middle point has the most surrounding points of all the points within radius R and would be returned by the densest point function.

What I've tried

A viable way to solve this problem would be to use a range searching solution; this answer explains further and that it has "O(log(n) + k) worst-case time". Using this, I could get the number of points surrounding each point and choose the point with largest surrounding point count.

However, if the points were extremely densely packed (in the order of a million), as such:

enter image description here

then each of these million points (1e6) would need to have a range search performed. The worst-case time O(log(n) + k), where k is the number of points returned in the range, is true for the following point tree types:

  • kd-trees of two dimensions (which are actually slightly worse, at O(sqrt(n)+k)),
  • 2d-range trees,
  • Quadtrees, which have a worst-case time of O(n)

So, for a group of 1e6 points within radius R of all points within the group, it gives complexity of O(log(1e6) + 1e6) for each point. This yields over a trillion operations!

Any ideas on a more efficient, precise way of achieving this, so that I could find the point with the most surrounding points for a group of points, and in a reasonable time (preferably O(n log(n)) or less)?

EDIT

Turns out that the method above is correct! I just need help implementing it.

(Semi-)Solution

If I use a 2d-range tree:

  • A range reporting query costs O(log^d(n) + k), for k returned points,
  • For a range tree with fractional cascading (also known as layered range trees) the complexity is O(log^(d-1)(n) + k),
  • For 2 dimensions, that is O(log(n) + k),
  • Furthermore, if I perform a range counting query (i.e., I do not report each point), then it costs O(log(n)).

I'd perform this on every point - yielding the O(n log(n)) complexity I desired!

Problem

However, I cannot figure out how to write the code for a counting query for a 2d layered range tree.

I've found a great resource (from page 113 onwards) about range trees, including 2d-range tree psuedocode. But I can't figure out how to introduce fractional cascading, nor how to correctly implement the counting query so that it is of O(log n) complexity.

I've also found two range tree implementations here and here in Java, and one in C++ here, although I'm not sure this uses fractional cascading as it states above the countInRange method that

It returns the number of such points in worst case * O(log(n)^d) time. It can also return the points that are in the rectangle in worst case * O(log(n)^d + k) time where k is the number of points that lie in the rectangle.

which suggests to me it does not apply fractional cascading.

Refined question

To answer the question above therefore, all I need to know is if there are any libraries with 2d-range trees with fractional cascading that have a range counting query of complexity O(log(n)) so I don't go reinventing any wheels, or can you help me to write/modify the resources above to perform a query of that complexity?

Also not complaining if you can provide me with any other methods to achieve a range counting query of 2d points in O(log(n)) in any other way!

like image 282
Nick Bull Avatar asked Jun 04 '17 02:06

Nick Bull


2 Answers

I suggest using plane sweep algorithm. This allows one-dimensional range queries instead of 2-d queries. (Which is more efficient, simpler, and in case of square neighborhood does not require fractional cascading):

  1. Sort points by Y-coordinate to array S.
  2. Advance 3 pointers to array S: one (C) for currently inspected (center) point; other one, A (a little bit ahead) for nearest point at distance > R below C; and the last one, B (a little bit behind) for farthest point at distance < R above it.
  3. Insert points pointed by A to Order statistic tree (ordered by coordinate X) and remove points pointed by B from this tree. Use this tree to find points at distance R to the left/right from C and use difference of these points' positions in the tree to get number of points in square area around C.
  4. Use results of previous step to select "most surrounded" point.

This algorithm could be optimized if you rotate points (or just exchange X-Y coordinates) so that width of the occupied area is not larger than its height. Also you could cut points into vertical slices (with R-sized overlap) and process slices separately - if there are too many elements in the tree so that it does not fit in CPU cache (which is unlikely for only 1 million points). This algorithm (optimized or not) has time complexity O(n log n).

For circular neighborhood (if R is not too large and points are evenly distributed) you could approximate circle with several rectangles:

approximated circle

In this case step 2 of the algorithm should use more pointers to allow insertion/removal to/from several trees. And on step 3 you should do a linear search near points at proper distance (<=R) to distinguish points inside the circle from the points outside it.

Other way to deal with circular neighborhood is to approximate circle with rectangles of equal height (but here circle should be split into more pieces). This results in much simpler algorithm (where sorted arrays are used instead of order statistic trees):

  1. Cut area occupied by points into horizontal slices, sort slices by Y, then sort points inside slices by X.
  2. For each point in each slice, assume it to be a "center" point and do step 3.
  3. For each nearby slice use binary search to find points with Euclidean distance close to R, then use linear search to tell "inside" points from "outside" ones. Stop linear search where the slice is completely inside the circle, and count remaining points by difference of positions in the array.
  4. Use results of previous step to select "most surrounded" point.

This algorithm allows optimizations mentioned earlier as well as fractional cascading.

like image 123
Evgeny Kluev Avatar answered Sep 24 '22 06:09

Evgeny Kluev


I would start by creating something like a https://en.wikipedia.org/wiki/K-d_tree, where you have a tree with points at the leaves and each node information about its descendants. At each node I would keep a count of the number of descendants, and a bounding box enclosing those descendants.

Now for each point I would recursively search the tree. At each node I visit, either all of the bounding box is within R of the current point, all of the bounding box is more than R away from the current point, or some of it is inside R and some outside R. In the first case I can use the count of the number of descendants of the current node to increase the count of points within R of the current point and return up one level of the recursion. In the second case I can simply return up one level of the recursion without incrementing anything. It is only in the intermediate case that I need to continue recursing down the tree.

So I can work out for each point the number of neighbours within R without checking every other point, and pick the point with the highest count.

If the points are spread out evenly then I think you will end up constructing a k-d tree where the lower levels are close to a regular grid, and I think if the grid is of size A x A then in the worst case R is large enough so that its boundary is a circle that intersects O(A) low level cells, so I think that if you have O(n) points you could expect this to cost about O(n * sqrt(n)).

like image 27
mcdowella Avatar answered Sep 23 '22 06:09

mcdowella