Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to quickly find animals away from the herd

I am developing a simulation program. There are herds of animals (wildebeests), and in that herd, I need to be able to find one animal that is away from the herd.

On the picture below, green dots are away from the herd. It is these points that I'd like to be able to find quickly.

Green dots are away from the herd

Of course, there is a simple algorithm to solve that problem. Count the number of dots in the neighborhood of each point, and then if that neighborhood is empty (0 point in it), then we know that this point is away from the herd.

The problem is that this algorithm is not efficient at all. I have one million points, and applying this algorithm on each of the million points is very slow.

Is there something that would be faster? Maybe using trees?

Edit for @amit: we want to avoid that case. A group of green dots in the left corner would be chosen, even though they should not because it's not a single animal that is away from the herd, it's a group of animals. We are only looking for one single animal away from herd (not a group).

A group of green dots away from the herd

like image 481
user1493046 Avatar asked Dec 27 '12 10:12

user1493046


3 Answers

For nearest neighbors queries, kd-trees are often used. This would result in O(n log n) queries (one query is in log(n) times n queries, and building the kd-tree is itself in O(n log n) ) which I can see working pretty fast for a couple millions of points, and there are libraries that are already pretty efficient as well (ANN for instance).

In addition, ANN stands for "Approximate nearest neighbors", and can be even faster when exact distances are not needed. Since in your case, you only want to detect whether the first nearest neighbor distance is large or small, you can set a pretty high threshold which would make things even faster.

From that, you can determine the distance distribution to everyone nearest neighbor, and find the outliers. Sorting all these distances to determine the outliers is again in O(n log n).

like image 81
nbonneel Avatar answered Nov 03 '22 10:11

nbonneel


I think you are looking for anomaly detection algorithm (which is an unsupervised machine learning problem).

The idea is to find the instances that "behave" un-normal comparing to the rest of the instances.

The set of videos starting with this one (from an on-line machine learning course in Coursera) describes the problem and how it can be approached nicely.


EDIT:
A simpler alternative will be to find the mean of all points (animals), and "chose" the k animals that are most far from it (or alternatively, all points which have distance greater from some threshold).

If you have several groups, you might want to cluster them first. One way to do it is with k-means clustering, and apply one of the above approaches on each group (cluster).

like image 39
amit Avatar answered Nov 03 '22 12:11

amit


Since you're looking for a lone animal, you could use two convex layers for O(N log N + ab*) O(N log N), where a is the size of the first hull and b is the size of the second hull.

  1. Create a convex hull from the list of positions
  2. Create a second convex hull from the list of positions, excluding those in the first hull.

An animal in the outer (first) hull is "isolated" if it's nearest neighbors are sufficiently far away. The nearest neighbors are the closet points to that point (that are not the same point) in the inner and outer hull. In the case of the outer hull, you can probably get by with just checking the distance to the points left and right of the point being considered. Hence the a*b in the big O instead of a(a+b)

If you are expecting cases where one of the "inner" animals of the herd is considered isolated (in this case, inner refer to any animal that doesn't make up the outer hull), then the above method probably won't work. In which case, you'll need to use a more sophisticated approach.

It also probably be inefficient if a + b is close to N as it'll be basically O(N^2). Although, in that case, it's rather unlikely that any animal is very isolated.

Edit: I should also point out that there are dynamic convex hull structures that can be used to maintain a convex hull where the points are moving simply by adding and removing the points. That would probably be helpful for the real-time updates.

*This is actually O(N), using rotating calipers.

like image 5
Nuclearman Avatar answered Nov 03 '22 12:11

Nuclearman