Based on the Spark documentation for 1.4 (https://spark.apache.org/docs/1.4.0/mllib-feature-extraction.html) I'm writing a TF-IDF example for converting text documents to vectors of values. The example given shows how this can be done but the input is a RDD of tokens with no keys. This means that my output RDD no longer contains an index or key to refer back to the original document. The example is this:
documents = sc.textFile("...").map(lambda line: line.split(" "))
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
I would like to do something like this:
documents = sc.textFile("...").map(lambda line: (UNIQUE_LINE_KEY, line.split(" ")))
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
and have the resulting tf
variable contain the UNIQUE_LINE_KEY
value somewhere. Am I just missing something obvious? From the examples it appears there is no good way to link the document
RDD with the tf
RDD.
I also encountered the same issue. In the example from the docs they encourage you to apply the transformations directly on the RDD.
However, you can apply the transformations on the vectors themselves and this way you can keep the keys whichever way you choose.
val input = sc.textFile("...")
val documents = input.map(doc => doc -> doc.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf = documents.mapValues(hashingTF.transform(_))
tf.cache()
val idf = new IDF().fit(tf.values)
val tfidf = tf.mapValues(idf.transform(_))
Note that this code will yield RDD[(String, Vector)] instead of RDD[Vector]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With