im trying to use Apache Spark for document classification.
For example i have two types of Class (C and J)
Train data is :
C, Chinese Beijing Chinese
C, Chinese Chinese Shanghai
C, Chinese Macao
J, Tokyo Japan Chinese
And test data is : Chinese Chinese Chinese Tokyo Japan // What is ist J or C ?
How i can train and predict as above datas. I did Naive Bayes text classification with Apache Mahout, however no with Apache Spark.
How can i do this with Apache Spark?
Yes, it doesn't look like there is any simple tool to do that in Spark yet. But you can do it manually by first creating a dictionary of terms. Then compute IDFs for each term and then convert each documents into vectors using the TF-IDF scores.
There is a post on http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ that explains how to do it (with some code as well).
Spark can do this in very simple way. The key step is: 1 use HashingTF to get the item frequency. 2 convert the data to the form of the bayes model needed.
def testBayesClassifier(hiveCnt:SQLContext){
val trainData = hiveCnt.createDataFrame(Seq((0,"aa bb aa cc"),(1,"aa dd ee"))).toDF("category","text")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
val featureData = hashTF.transform(wordsData) //key step 1
val trainDataRdd = featureData.select("category","features").map {
case Row(label: Int, features: Vector) => //key step 2
LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
//train the model
val model = NaiveBayes.train(trainDataRdd, lambda = 1.0, modelType = "multinomial")
//same for the test data
val testData = hiveCnt.createDataFrame(Seq((-1,"aa bb"),(-1,"cc ee ff"))).toDF("category","text")
val testWordData = tokenizer.transform(testData)
val testFeatureData = hashTF.transform(testWordData)
val testDataRdd = testFeatureData.select("category","features").map {
case Row(label: Int, features: Vector) =>
LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With