DBSCAN on spark : which implementation

Question

I would like to do some DBSCAN on Spark. I have currently found 2 implementations:

https://github.com/irvingc/dbscan-on-spark
https://github.com/alitouka/spark_dbscan

I have tested the first one with the sbt configuration given in its github but:

functions in the jar are not the same as those in the doc or in the source on github. For example, I cannot find the train function in the jar
I manage to run a test with the fit function (found in the jar) but a bad configuration of epsilon (a little to big) put the code in an infinite loop.

code :

val model = DBSCAN.fit(eps, minPoints, values, parallelism)

Has someone managed to do someting with the first library?

Has someone tested the second one?

Cal · Accepted Answer

You can also consider using smile which provides an implementation of DBSCAN. You would have to use groupBy combined with either mapGroups or flatMapGroups in the most direct way and you would run dbscan there. Here's an example:

  import smile.clustering._

  val dataset: Array[Array[Double]] = Array(
    Array(100, 100),
    Array(101, 100),
    Array(100, 101),
    Array(100, 100),
    Array(101, 100),
    Array(100, 101),

    Array(0, 0),
    Array(1, 0),
    Array(1, 2),
    Array(1, 1)
  )

  val dbscanResult = dbscan(dataset, minPts = 3, radius = 5)
  println(dbscanResult)

  // output
  DBSCAN clusters of 10 data points:
  0     6 (60.0%)
  1     4 (40.0%)
  Noise     0 ( 0.0%)

You can also write a User Defined Aggregate Function (UDAF) if you need to eek out more performance.

I use this approach at work to do clustering of time-series data, so grouping using Spark's time window function and then being able to execute DBSCAN within each window allows us to parallelize the implementation.

I was inspired by the following article to do this

DBSCAN on spark : which implementation

Tags:

scala

cluster-analysis

apache-spark

apache-spark-mllib

dbscan

Benjamin

1 Answers

Cal

Recent Activity

Donate For Us

DBSCAN on spark : which implementation

Tags:

scala

cluster-analysis

apache-spark

apache-spark-mllib

dbscan

Benjamin

1 Answers

Cal

Related questions

Recent Activity

Donate For Us