Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running clustering algorithms in ELKI

I need to run a k-medoids clustering algorithm by using ELKI programmatically. I have a similarity matrix that I wish to input to the algorithm.

Is there any code snippet available for how to run ELKI algorithms? I basically need to know how to create Database and Relation objects, create a custom distance function, and read the algorithm output.

Unfortunately the ELKI tutorial (http://elki.dbs.ifi.lmu.de/wiki/Tutorial) focuses on the GUI version and on implementing new algorithms, and trying to write code by looking at the Javadoc is frustrating.

If someone is aware of any easy-to-use library for k-medoids, that's probably a good answer to this question as well.

like image 255
Alphaaa Avatar asked Feb 17 '23 05:02

Alphaaa


1 Answers

We do appreciate documentation contributions! (Update: I have turned this post into a new ELKI tutorial entry for now.)

ELKI does advocate to not embed it in other applications Java for a number of reasons. This is why we recommend using the MiniGUI (or the command line it constructs). Adding custom code is best done e.g. as a custom ResultHandler or just by using the ResultWriter and parsing the resulting text files.

If you really want to embed it in your code (there are a number of situations where it is useful, in particular when you need multiple relations, and want to evaluate different index structures against each other), here is the basic setup for getting a Database and Relation:

// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(FileBasedDatabaseConnection.INPUT_ID, filename);
// Add other parameters for the database here!

// Instantiate the database:
Database db = ClassGenericsUtil.parameterizeOrAbort(
    StaticArrayDatabase.class,
    params);
// Don't forget this, it will load the actual data...
db.initialize();

Relation<DoubleVector> vectors = db.getRelation(TypeUtil.DOUBLE_VECTOR_FIELD);
Relation<LabelList> labels = db.getRelation(TypeUtil.LABELLIST);

If you want to program more general, use NumberVector<?>.

Why we do (currently) not recommend using ELKI as a "library":

  1. The API is still changing a lot. We keep on adding options, and we cannot (yet) provide a stable API. The command line / MiniGUI / Parameterization is much more stable, because of the handling of default values - the parameterization only lists the non-default parameters, so only if these change you'll notice.

    In the code example above, note that I also used this pattern. A change to the parsers, database etc. will likely not affect this program!

  2. Memory usage: data mining is quite memory intensive. If you use the MiniGUI or command line, you have a good cleanup when the task is finished. If you invoke it from Java, changes are really high that you keep some reference somewhere, and end up leaking lots of memory. So do not use above pattern without ensuring that the objects are properly cleaned up when you are done!

    By running ELKI from the command line, you get two things for free:

    1. no memory leaks. When the task is finished, the process quits and frees all memory.

    2. no need to rerun it twice for the same data. Subsequent analysis does not need to rerun the algorithm.

  3. ELKI is not designed as embeddable library for good reasons. ELKI has tons of options and functionality, and this comes at a price, both in runtime (although it can easily outperform R and Weka, for example!) memory usage and in particular in code complexity. ELKI was designed for research in data mining algorithms, not for making them easy to include in arbitrary applications. Instead, if you have a particular problem, you should use ELKI to find out which approach works good, then reimplement that approach in an optimized manner for your problem.

Best ways of using ELKI

Here are some tips and tricks:

  1. Use the MiniGUI to build a command line. Note that in the logging window of the "GUI" it shows the corresponding command line parameters - running ELKI from command line is easy to script, and can easily be distributed to multiple computers e.g. via Grid Engine.

    #!/bin/bash
    for k in $( seq 3 39 ); do
        java -jar elki.jar KDDCLIApplication \
            -dbc.in whatever \
            -algorithm clustering.kmeans.KMedoidsEM \
            -kmeans.k $k \
            -resulthandler ResultWriter -out.gzip \
            -out output/k-$k 
    done
    
  2. Use indexes. For many algorithms, index structures can make a huge difference! (But you need to do some research which indexes can be used for which algorithms!)

  3. Consider using the extension points such as ResultWriter. It may be the easiest for you to hook into this API, then use ResultUtil to select the results that you want to output in your own preferred format or analyze:

    List<Clustering<? extends Model>> clusterresults =
        ResultUtil.getClusteringResults(result);
    
  4. To identify objects, use labels and a LabelList relation. The default parser will do this when it sees text along the numerical attributes, i.e. a file such as

    1.0 2.0 3.0 ObjectLabel1
    

    will make it easy to identify the object by its label!

UPDATE: See ELKI tutorial created out of this post for updates.

like image 102
Erich Schubert Avatar answered Feb 24 '23 04:02

Erich Schubert