Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to efficiently find high density regions

Tags:

Over the course of my coding, I have come across a problem as follows: Find the a region of fixed size in a 2D space that has the highest density of particles. The particles can be considered generally distributed randomly over the entire space, but in theory there should be some areas that have a higher density.

For example, 100 particles are placed randomly in a 2D grid that is 500x500, and I need to find the 50x50 region with the most particles (highest density).

Is there some other way to calculate the best region besides brute force testing every possible region (in this case about over 200000 regions)? That would scale up at O(n^2) for an n-length axis.

like image 484
WilBur Avatar asked Oct 23 '13 05:10

WilBur


People also ask

How do you find the highest density?

This is so because if we divide each object's mass by its volume, and since all three objects have the same volume, the object with the larger mass will have the greatest density. And the object with the smaller mass, will have the lowest density.

What is known as high density region?

A highest posterior density [interval] is basically the shortest interval on a posterior density for some given confidence level. A highest density region is probably the same idea applied to any arbitrary density, so not necessarily a posterior distribution.


1 Answers

Algorithm 1

Create a 500x500 2D array, where each cell contains the count of the number of particles in that cell. Then convolve that array with a 50x50 kernel, the resulting array will have the count of particles in a 50x50 region in each cell. Then find the cell with the largest value.

If you are using a 50x50 box as a region, the kernel can be decomposed into two separate convolutions, one for each axis. The resulting algorithm is O(n^2) space and time, where n is the width and height of the 2D space you are searching.

As a reminder, a one-dimensional convolution with a boxcar function can be completed in O(n) time and space and it can be done in place. Let x(t) be the input for t=1..n, and let y(t) be the output. Define x(t)=0 and y(t)=0 for t<1 and t>n. Define the kernel f(t) to be 1 for 0..d-1 and 0 elsewhere. The definition for convolution gives us the following formula:

y(t) = sum i x(t-i) * f(i) = sum i=0..d-1 x(t-i)

This looks like it takes time O(n*d), but we can rewrite it as a recurrence:

y(t) = y(t-1) + x(t) - x(t-d)

This shows that the one-dimensional convolution is O(n), independent of d. To perform the two-dimensional convolution, you simply perform the one-dimensional convolution for each axis. This works because the boxcar kernel can be decomposed: in general, most kernels cannot be decomposed. The Gaussian kernel is another kernel that can be decomposed, which is why Gaussian blur in an image editing program is so fast.

For the kind of numbers you specify, this will be extremely fast. 500x500 is an extremely small data set, and your computer can check 202,500 regions in a few milliseconds at most. You will have to ask yourself whether it is worth the extra hours, days, or weeks of time it will take you to optimize further.

This is the same as justhalf's solution, except due to the decomposed convolution, the region size does not affect the algorithm's speed.

Algorithm 2

Assume there is at least one point. Without loss of generality, consider the 2D space to be the entire plane. Let d be the width and height of the region. Let N be the number of points.

Lemma: There exists a region of maximum density which has a point on its left edge.

Proof: Let R be a region of maximum density. Let R' be the same region, translated right by the distance between the left edge of R and the leftmost point in R. All points in R must also lie in R', therefore R' is also a region of maximum density.

The algorithm

  1. Insert all points into a K-D tree. This can be done in O(N log2 N) time.

  2. For each point, consider the region of width d and height 2d where the point is centered on the left edge of the region. Call this region R.

  3. Query the K-D tree for the points in region R. Call this set S. This can be done in O(N1/2+|S|) time.

  4. Find the d x d subregion of R containing the largest number of points in S. This can be done in O(|S| log |S|) time by sorting S by y-coordinate and then performing a linear scan.

The resulting algorithm has a time of O(N3/2 + N |S| log |S|).

Comparison

Algorithm #1 is superior to algorithm #2 when the density is high. Algorithm #2 is only superior when the density of particles is very low, and the density at which algorithm #2 is superior decreases as the total board size increases.

Note that the continuous case can be considered to have zero density, at which point only algorithm #2 works.

like image 107
Dietrich Epp Avatar answered Oct 20 '22 19:10

Dietrich Epp