I am currently looking for some tool that would generate datasets of different shapes like square, circle, rectangle, etc. with outliers for cluster analysis.
Can any one of you recommend a good dataset generator for cluster analysis? Is there anyway to generates such datasets in languages like R?
I would create a shape and extract bounding coordinates. You can populate the shape with random points using splancs
package.
Here's a small snippet from one of my programs:
# First we create a circle, into which uniform random points will be generated (kudos to Barry Rowlingson, r-sig-geo).
circle <- function(x = x, y = y, r = radius, n = n.faces){
t <- seq(from = 0, to = 2 * pi, length = n + 1)[-1]
t <- cbind(x = x + r * sin(t), y = y+ r * cos(t))
t <- rbind(t, t[1,])
return(t)
}
csr(circle(0, 0, 100, 30), 1000)
Feel free to add outliers. One way of going about this is sampling different shapes and joining them in different ways.
You should probably look into the mlbench package, especially synthetic dataset generating from mlbench.*
functions, see some examples below.
Other datasets or utility functions are probably best found on the Cluster Task View on CRAN. As @Roman said, adding outliers is not really difficult, especially when you work in only two dimensions.
There is a flexible data generator in ELKI that can generate various distributions in arbitrary dimensionality. It also can generate Gamma distributed variables, for example.
There is documentation on the Wiki: http://elki.dbs.ifi.lmu.de/wiki/DataSetGenerator
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With