Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate n-grams in scala?

Tags:

scala

n-gram

I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees".

  1. First it has to pick a random n-gram. For example, the bee.
  2. Then it has to look for n-grams starting with (n-1) words. For example, bee of.
  3. it prints the last word of this n-gram. Then repeats.

Can you please give me some hints how to do it? Sorry for the inconvenience.

like image 628
user1002579 Avatar asked Nov 24 '11 14:11

user1002579


People also ask

What is character n-grams?

Character n-grams are handcrafted features which widely serve as discriminative features in text categorization [2], authorship attribution [3] authorship verification [5], plagiarism detection [9, 19], spam filtering [6], native language identification of text author [8], discriminating language variety [11], and many ...

What is n-gram range in NLP?

N-Gram Ranking Simply put, an n-gram is a sequence of n words where n is a discrete number that can range from 1 to infinity!


2 Answers

Your questions could be a little more specific but here is my try.

val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))
like image 142
peri4n Avatar answered Nov 13 '22 07:11

peri4n


Here is a stream based approach. This will not required too much memory while computing n-grams.

object ngramstream extends App {

  def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match {
    case x #:: xs => {
      f(x)
      process(xs)(f)
    }
    case _ => Stream[Array[String]]()
  }

  def ngrams(n: Int, words: Array[String]) = {
    // exclude 1-grams
    (2 to n).map { i => words.sliding(i).toStream }
      .foldLeft(Stream[Array[String]]()) {
        (a, b) => a #::: b
      }
  }

  val words = "the bee is the bee of the bees"
  val n = 4
  val ngrams2 = ngrams(n, words.split(" "))

  process(ngrams2) { x =>
    println(x.toList)
  }

}

OUTPUT:

List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)
like image 39
tuxdna Avatar answered Nov 13 '22 07:11

tuxdna