Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different results from LOF implementation in ELKI and RapidMiner

I have written my own implementation of LOF and I'm trying to compare results with the implementations in ELKI and RapidMiner, but all 3 give different results! I'm trying to work out why.

My reference dataset is one-dimensional, 102 real values with many duplicates. I'll try and post it below.

First, the RapidMiner implementation. The LOF scores are wildly different from ELKI and from my results; many come back with a LOF of infinity. Has this implementation been validated as being correct?

My results are similar to ELKI, but I don't get exactly the same LOF values. From a quick scan of the comments in the ELKI source code, I think this may be because of differences in the way the k-neighbourhood is calculated.

In the LOF paper, the MinPts parameter (elsewhere called k) specifies the minimum no. of points to be included in the k-neighbourhood. In the ELKI implementation, I think they are defining the k-neighbourhood as exactly k points rather than all the points within the k-distance or k-distinct distance. Can anyone confirm exactly how ELKI constructs the k-neighbourhood? Also there is a private variable which allows the point itself to be included in its own neighbourhood, but it looks like the default is not to include it.

Does anyone know of a public reference dataset which has the LOF scores attached for validation purposes?

--- more details follow ---

Reference: ELKI source code is here:

http://elki.dbs.ifi.lmu.de/browser/elki/trunk/src/de/lmu/ifi/dbs/elki/algorithm/outlier/lof/LOF.java

RapidMiner source code is here:

http://code.google.com/p/rapidminer-anomalydetection/source/browse/trunk/src/de/dfki/madm/anomalydetection/evaluator/nearest_neighbor_based/LOFEvaluator.java

Here is my test dataset:

4.32323 5.12595 5.12595 5.12595 5.12595 5.7457 5.7457 5.7457 5.7457 5.7457 5.7457 5.97766 5.97766 6.07352 6.07352 6.12015 6.12015 6.12015 6.44797 6.44797 6.48131 6.48131 6.48131 6.48131 6.48131 6.48131 6.6333 6.6333 6.6333 6.70872 6.70872 6.70872 6.70872 6.70872 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.10361 7.10361 7.10361 7.10361 7.10361 7.10361 7.10361 7.10361 7.15651 7.15651 7.15651 7.15651 7.15651 7.15651 7.15651 7.15651 8.22598 8.22598 8.22598 8.22598 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538

For example, I get the following LOF score for the first number (4.32323):

  • RapidMiner: infinity (with MinPts lower/upper bounds set to 10,100)
  • ELKI: 2.6774 (with k = 10 and distfunction/reachdistfunction set to default)
  • My implementation: 1.9531

Some more details on what my implementation is doing:

  1. MinPts is 10, so I'm finding the 10 distinct neighbours of the point. So the neighbourhood of 4.32323 is actually 48 points, from 5.12595 up to 6.77579.
  2. That gives me a k-distinct distance of 2.45256
  3. I'm calculating the reachability distance of the first neighbour as 1.58277
  4. I'm calculating the LRD of the sample as 1/(99.9103/48)
  5. Sum of lrd(o)/lrd(p) for all 48 neighbours is 93.748939
  6. Divide by 48 to get a LOF of 1.9531
like image 287
Michael D. Avatar asked Feb 20 '13 18:02

Michael D.


1 Answers

I'm actually not surprised they differ. You could also add in Weka's implementation of LOF, and you will probably get yet another answer.

Here is one more difference for you to add to your equations: as far as I know, the rapidminer implementation merges points that have the same coordinates. But maybe, they forgot to take these weights into account when computing the nearest neighbors!

In the classic database context, you would not merge duplicate coordinates into a single observation. They are still valid database records and should be counted as full records.

I do not know if any of them perform some automatic data preprocessing such as rescaling the data set.

The ELKI implementation has been verified against a number of textbook examples we use for teaching.

However, there are corner cases in the algorithm that are not 100% fixed, so there is room for difference even in "literal" implementations of the algorithm. You have already run into three of these:

  1. How to treat duplicate points: A) aggregate, B) drop, C) consider different

    From a data mining point of view, C is correct, and A (when implemented correctly) is an optimization that can save you unnecessary distance computations. B is the common mathematical view, but does not make as much sense for a database context. If I have two "John Doe", are they the same person?

  2. Definition of k nearest neighbors and k-distance.

    The usual definition of k-distance is: the smallest distance, such that at least k observations are contained. When excluding the query point, this yields the inverval up to 5.7457 from the starting point: there are 10 other observations in a radius of 5.7457 - 4.32323.

    The k nearest neighbors are usually defined as any point within this distance, which may be more than k. But then all additional objects must have the same distance as the kth! It seems as if rapidminer uses exactly k, which does not align with the LOF publication (see Definition 4 in the LOF publication!)

    It's really the k nearest neighbors (including ties, but other than that not more than k objects), not the k-ths smallest distinct distance. Where did you get the "distinct" from?

    Defintions 3 and 4 in the LOF publication are pretty clear on the kNN set LOF uses.

    Your neighborhood of 48 objects thus is not correct.

  3. What to do if there are more than minPts duplicate points (a literal implementation will yield a division by zero, but for obvious reasons the point should be given a LOF of 1.0)

    This is maybe what is happening to Rapidminer.

And then there is reachability distance: this one is really tricky, because it is not a mathematical distance. It is asymmetric.

The reachability of the first observation from the second happens to be the k-distance of the second, which from a quick look (did not double check) reach-dist(x[0], x[1]) = max(5.97766 - 5.12595, 5.12595 - 4.32323) = 0.80272

See my extensive tutorial slides on outlier detection for a step-by-step demonstration of how to compute LOF. As far as I can tell, this is literal LOF. It doesn't touch all the corner cases, but it motivates the design of the LOF algorithm and is quite exhaustive.

like image 84
Erich Schubert Avatar answered Sep 22 '22 15:09

Erich Schubert