What is the best approach to apply tf.idf transformation to new documents in spark. I have a setting in which I train the model offline and then load it and apply it for new files. Basically, it does not make much sense to calculate IDF if there is no access to the model IDF distribution.
So far the only solution I thought of is to save the TF RDD of the training set and append the new doc to it and then calcualte IDF RDD and extract the new file from the IDF RDD. The problem with this is that I have to keep the entire TF vector in memory (I guess it could probably be down with the IDF RDD as well).
This looks like a problem someone already had, so looking for advice an insights in what is the best way to do it.
Cheers,
Ilija
You don't need RDDs at all. TF doesn't depend on anything else than a data you have (and vocabulary if you use fixed size representation without hashing) and IDF is simply a model which can be represented as a vector and depends only on vocabulary.
So the only thing you have to be kept around is an IDFModel
. Assuming transformations you use look more or less like this:
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(rdd)
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
the only which is useful for further operations on the new data is the idf
variable. While it has no save
method it is a local serializable object so you can use standard Java methods to serialize it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With