I am a data-mining newbie and need some help with a high dimensional data-set (subset is shown below). It actually has 30 dimensions and several thousand rows.
The task is to see how they are clustered and if any similarity metrics can be calculated from this data. I have looked at SOMs and Cosine similarity approaches, however unsure how to approach this problem.
p.s. I am not versed at all with R or similar stats packages, would appreciate some pointers in C#/.NET based libraries.
"ROW" "CPG" "FSD" "FR" "CV" "BI22" "MI99" "ME" "HC" "L1" "L2" "TL"
1 298 840 3.80 5.16 169.17 69 25.0 0.82 125 453 792
2 863 676 4.09 4.28 97.22 63 18.5 0.85 172 448 571
3 915 942 7.04 5.33 33.01 72 35.1 0.86 134 450 574
I think what you are looking for is known as a multidimensional scaling plot (MDS), its pretty straightforward to do, but you will need a library that can do some linear algebra/optimization stuff.
Step one is to calculate a distance matrix, this is a matrix of pairwise Euclidean distance between all of the data points.
Step two is to find N vectors or features (usually 2 for a 2d plot) which form the closest distance matrix to the one calculated in step 1. This is equivalent to getting the eigenvectors with the N largest eigenvalues from the square distance matrix. You may be able to find some linear algebra libraries that can do this in your language of choice. I have always used the R function cmdscale()for this though:
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/cmdscale.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With