Spark IDF for new documents

Question

What is the best approach to apply tf.idf transformation to new documents in spark. I have a setting in which I train the model offline and then load it and apply it for new files. Basically, it does not make much sense to calculate IDF if there is no access to the model IDF distribution.

So far the only solution I thought of is to save the TF RDD of the training set and append the new doc to it and then calcualte IDF RDD and extract the new file from the IDF RDD. The problem with this is that I have to keep the entire TF vector in memory (I guess it could probably be down with the IDF RDD as well).

This looks like a problem someone already had, so looking for advice an insights in what is the best way to do it.

Cheers,

Ilija

zero323 · Accepted Answer

You don't need RDDs at all. TF doesn't depend on anything else than a data you have (and vocabulary if you use fixed size representation without hashing) and IDF is simply a model which can be represented as a vector and depends only on vocabulary.

So the only thing you have to be kept around is an IDFModel. Assuming transformations you use look more or less like this:

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(rdd) 

val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

the only which is useful for further operations on the new data is the idf variable. While it has no save method it is a local serializable object so you can use standard Java methods to serialize it.

Spark IDF for new documents

Tags:

machine-learning

apache-spark

apache-spark-mllib

ilijaluve

1 Answers

zero323

Recent Activity

Donate For Us

Spark IDF for new documents

Tags:

machine-learning

apache-spark

apache-spark-mllib

ilijaluve

1 Answers

zero323

Related questions

Recent Activity

Donate For Us