I'm trying to do a kmeans clustering algorithm from apache Spark's mlib library. I have everything setup but I'm not exactly sure how would I go about formatting the input data. I'm relatively new to machine learning so any help would be appreciated. In the sample data.txt the data is as follows: <code> 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2 </code> And the data that I want to run the algorithm on is in this format for now (json array): <code> [{"customer":"ddf6022","order_id":"20031-19958","asset_id":"dd1~33","price":300,"time":1411134115000,"location":"bt2"},{"customer":"ddf6023","order_id":"23899-23825","asset_id":"dd1~33","price":300,"time":1411954672000,"location":"bt2"}] </code> How can I convert it into something that can be used with the k-means clustering algorithm? I'm using Java, also I'm guessing I need it to be in a JavaRDD format, but have no idea how to go about doing it.

How this works: First of all, you have to define on what dimensions you would like to apply KMeans, the KMeans example included on Spark documentation is applied on a data set of 3D points (X Y & Z dimensions). take into accoint that the KMeans implementation on MLLib is able to work on sets of n dimensions where n>=1 A Proposal: So lets say, for your input, the X Y & Z dimensions are going to be the JSON fields: price, time & location. then, all you have to do is to extract those dimensions from your data set and put these in a text file as follows: <pre class="prettyprint"><code>300 1411134115000 2 300 1411954672000 2 ... ... ... </code></pre> Where location "bt2" has been replace by 2 (assuming that your data set has another locations). You have to provide numeric values to KMeans. Notes/Ideas: For better clustering results and depending on the data time distribution, It would be nice if you take advantage of the timestamp field by transforming it to values: Year , Month , Day , Hour, Minute, Second, etc. So, you could play with different dimensions as separate fields depending on your clustering purpose. Also, I guess you would like to make automatic JSON2CSV conversion process. So, in your mapping implementation you could use an approach like this: https://stackoverflow.com/a/15411074/833336

How to format data for the spark mlib kmeans clustering algorithm?

Tags:

java

algorithm

machine-learning

apache-spark

I'm trying to do a kmeans clustering algorithm from apache Spark's mlib library. I have everything setup but I'm not exactly sure how would I go about formatting the input data. I'm relatively new to machine learning so any help would be appreciated. In the sample data.txt the data is as follows: 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2

And the data that I want to run the algorithm on is in this format for now (json array):

[{"customer":"ddf6022","order_id":"20031-19958","asset_id":"dd1~33","price":300,"time":1411134115000,"location":"bt2"},{"customer":"ddf6023","order_id":"23899-23825","asset_id":"dd1~33","price":300,"time":1411954672000,"location":"bt2"}]

How can I convert it into something that can be used with the k-means clustering algorithm? I'm using Java, also I'm guessing I need it to be in a JavaRDD format, but have no idea how to go about doing it.

283

asked Apr 29 '15 18:04

Raza

1 Answers

How this works:

First of all, you have to define on what dimensions you would like to apply KMeans, the KMeans example included on Spark documentation is applied on a data set of 3D points (X Y & Z dimensions). take into accoint that the KMeans implementation on MLLib is able to work on sets of n dimensions where n>=1

A Proposal:

So lets say, for your input, the X Y & Z dimensions are going to be the JSON fields: price, time & location. then, all you have to do is to extract those dimensions from your data set and put these in a text file as follows:

300 1411134115000 2
300 1411954672000 2
...
...
...

Where location "bt2" has been replace by 2 (assuming that your data set has another locations). You have to provide numeric values to KMeans.

Notes/Ideas:

For better clustering results and depending on the data time distribution, It would be nice if you take advantage of the timestamp field by transforming it to values: Year , Month , Day , Hour, Minute, Second, etc. So, you could play with different dimensions as separate fields depending on your clustering purpose.

Also, I guess you would like to make automatic JSON2CSV conversion process. So, in your mapping implementation you could use an approach like this: https://stackoverflow.com/a/15411074/833336

180

answered Nov 08 '22 03:11

emecas

Related questions
                            
                                Is there a way of using Windows Indexer information in Java?
                            
                                How to execute multiple goals in one maven command, but with different arguments for each goal
                            
                                Error while loading shared libraries; cannot open shared object file: No such file or directory
                            
                                Identifying lambdas in stacktrace in Java 8
                            
                                org.bouncycastle.asn1.DLSequence cannot be cast to org.bouncycastle.asn1.ASN1Integer
                            
                                Legitimate uses for static initializer?
                            
                                Mapping between two objects that contain a List using Orika
                            
                                What is CharsetDecoder.decode(ByteBuffer, CharBuffer, endOfInput)
                            
                                SSLHandshakeException while using AWS SDK for Java API
                            
                                Why do I get an unexpected RuntimeException warn in the logs when using RemoteFileTemplate?
                            
                                How to read in a text file with integers and Strings to an array
                            
                                How can I achieve android fab speed dial selector?
                            
                                Hibernate Projections / Lazy Loading for non required 1 to 1 Mappings
                            
                                Difference between Java and Oracle Java for Redhat
                            
                                How to do a git pull with an in-memory database in JGit? [duplicate]
                            
                                Convert byte in hex to actual byte [duplicate]
                            
                                Fixing error in java: incompatible types: java.lang.Object cannot be converted to capture#1 of?
                            
                                Spring Integration with JMS + ActiveMQ: Messages remain in JDBC Message Store after reconnect
                            
                                Mongodb 3.0 java insertOne
                            
                                How to decrypt properties used in @ConfigurationProperties beans?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With