Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark Naive Bayes based Text Classification

im trying to use Apache Spark for document classification.

For example i have two types of Class (C and J)

Train data is :

C, Chinese Beijing Chinese
C, Chinese Chinese Shanghai
C, Chinese Macao
J, Tokyo Japan Chinese

And test data is : Chinese Chinese Chinese Tokyo Japan // What is ist J or C ?

How i can train and predict as above datas. I did Naive Bayes text classification with Apache Mahout, however no with Apache Spark.

How can i do this with Apache Spark?

like image 377
Rahman Usta Avatar asked Dec 25 '22 08:12

Rahman Usta


2 Answers

Yes, it doesn't look like there is any simple tool to do that in Spark yet. But you can do it manually by first creating a dictionary of terms. Then compute IDFs for each term and then convert each documents into vectors using the TF-IDF scores.

There is a post on http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ that explains how to do it (with some code as well).

like image 109
Francois Dang Ngoc Avatar answered Jan 30 '23 21:01

Francois Dang Ngoc


Spark can do this in very simple way. The key step is: 1 use HashingTF to get the item frequency. 2 convert the data to the form of the bayes model needed.

def testBayesClassifier(hiveCnt:SQLContext){
    val trainData = hiveCnt.createDataFrame(Seq((0,"aa bb aa cc"),(1,"aa dd ee"))).toDF("category","text")
    val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
    val wordsData = tokenizer.transform(trainData)
    val hashTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
    val featureData = hashTF.transform(wordsData) //key step 1
    val trainDataRdd = featureData.select("category","features").map {
    case Row(label: Int, features: Vector) =>  //key step 2
    LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
    }
    //train the model
    val model = NaiveBayes.train(trainDataRdd, lambda = 1.0, modelType = "multinomial")

    //same for the test data
    val testData = hiveCnt.createDataFrame(Seq((-1,"aa bb"),(-1,"cc ee ff"))).toDF("category","text")
    val testWordData = tokenizer.transform(testData)
    val testFeatureData = hashTF.transform(testWordData)
    val testDataRdd = testFeatureData.select("category","features").map {
    case Row(label: Int, features: Vector) =>
    LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
    }
    val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))

}

like image 45
user1785009 Avatar answered Jan 30 '23 20:01

user1785009