I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees".
Can you please give me some hints how to do it? Sorry for the inconvenience.
Character n-grams are handcrafted features which widely serve as discriminative features in text categorization [2], authorship attribution [3] authorship verification [5], plagiarism detection [9, 19], spam filtering [6], native language identification of text author [8], discriminating language variety [11], and many ...
N-Gram Ranking Simply put, an n-gram is a sequence of n words where n is a discrete number that can range from 1 to infinity!
Your questions could be a little more specific but here is my try.
val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))
Here is a stream based approach. This will not required too much memory while computing n-grams.
object ngramstream extends App {
def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match {
case x #:: xs => {
f(x)
process(xs)(f)
}
case _ => Stream[Array[String]]()
}
def ngrams(n: Int, words: Array[String]) = {
// exclude 1-grams
(2 to n).map { i => words.sliding(i).toStream }
.foldLeft(Stream[Array[String]]()) {
(a, b) => a #::: b
}
}
val words = "the bee is the bee of the bees"
val n = 4
val ngrams2 = ngrams(n, words.split(" "))
process(ngrams2) { x =>
println(x.toList)
}
}
OUTPUT:
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With