Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count the frequency of words with CountVectorizer in spark ML?

The below code gives a count vector for each row in the DataFrame:

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = spark.createDataFrame(Seq(
  (0, Array("a", "b", "c")),
  (1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")

// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
  .fit(df)


cvModel.transform(df).show(false)

The result is:

+---+---------------+-------------------------+
|id |words          |features                 |
+---+---------------+-------------------------+
|0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+

How to get total counts of each words, like:

+---+------+------+
|id |words |counts|
+---+------+------+
|0  |a     |  3   |
|1  |b     |  3   |
|2  |c     |  2   |
+---+------+------+
like image 420
Ivan Lee Avatar asked Jan 26 '26 22:01

Ivan Lee


1 Answers

Shankar's answer only gives you the actual frequencies if the CountVectorizer model keeps every single word in the corpus (e.g. no minDF or VocabSize limitations). In these cases you can use Summarizer to directly sum each Vector. Note: this requires Spark 2.3+ for Summarizer.

import org.apache.spark.ml.stat.Summarizer.metrics

// You need to select normL1 and another item (like mean) because, for some reason, Spark
// won't allow one Vector to be selected at a time (at least in 2.4)
val totalCounts = cvModel.transform(df)
    .select(metrics("normL1", "mean").summary($"features").as("summary"))
    .select("summary.normL1", "summary.mean")
    .as[(Vector, Vector)]
    .first()
    ._1

You'll then have to zip totalCounts with cvModel.vocabulary to get the words themselves.

like image 176
hichris123 Avatar answered Jan 29 '26 12:01

hichris123



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!