I am attempting to filter out the stop words out of an RDD of words from a .txt file.
// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
val stopWords = stopWordsInput.map(x => x.split(","))
// Create a tuple of test words
val testWords = ("you", "to")
// Split using a regular expression that extracts words
val wordsWithStopWords = input.flatMap(x => x.split("\\W+"))
The code above all makes sense to me and seems to work fine. This is where I'm having trouble.
//Remove the stop words from the list
val words = wordsWithStopWords.filter(x => x != testWords)
This will run but doesn't actually filter out the words contained in the tuple testWords. I'm not sure how to test the words in wordsWithStopWords against every word in my tuple testWords
You can use a Broadcast variable to filter with your stopwords RDD :
// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
// Flatten, collect, and broadcast.
val stopWords = stopWordsInput.flatMap(x => x.split(",")).map(_.trim)
val broadcastStopWords = sc.broadcast(stopWords.collect.toSet)
// Split using a regular expression that extracts words
val wordsWithStopWords: RDD[String] = input.flatMap(x => x.split("\\W+"))
wordsWithStopWords.filter(!broadcastStopWords.value.contains(_))
Broadcast variables allow you to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner (in this case as well).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With