I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees". <ol> <li>First it has to pick a random n-gram. For example, the bee.</li> <li>Then it has to look for n-grams starting with (n-1) words. For example, bee of.</li> <li>it prints the last word of this n-gram. Then repeats. </li> </ol> Can you please give me some hints how to do it? Sorry for the inconvenience.

Your questions could be a little more specific but here is my try. <pre class="prettyprint"><code>val words = "the bee is the bee of the bees" words.split(' ').sliding(2).foreach( p => println(p.mkString)) </code></pre>

Here is a stream based approach. This will not required too much memory while computing n-grams. <pre class="prettyprint"><code>object ngramstream extends App { def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match { case x #:: xs => { f(x) process(xs)(f) } case _ => Stream[Array[String]]() } def ngrams(n: Int, words: Array[String]) = { // exclude 1-grams (2 to n).map { i => words.sliding(i).toStream } .foldLeft(Stream[Array[String]]()) { (a, b) => a #::: b } } val words = "the bee is the bee of the bees" val n = 4 val ngrams2 = ngrams(n, words.split(" ")) process(ngrams2) { x => println(x.toList) } } </code></pre> OUTPUT: <pre class="prettyprint"><code>List(the, bee) List(bee, is) List(is, the) List(the, bee) List(bee, of) List(of, the) List(the, bees) List(the, bee, is) List(bee, is, the) List(is, the, bee) List(the, bee, of) List(bee, of, the) List(of, the, bees) List(the, bee, is, the) List(bee, is, the, bee) List(is, the, bee, of) List(the, bee, of, the) List(bee, of, the, bees) </code></pre>

How to generate n-grams in scala?

2 Answers

Your questions could be a little more specific but here is my try.

val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))

142

answered Nov 13 '22 07:11

peri4n

Here is a stream based approach. This will not required too much memory while computing n-grams.

object ngramstream extends App {

  def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match {
    case x #:: xs => {
      f(x)
      process(xs)(f)
    }
    case _ => Stream[Array[String]]()
  }

  def ngrams(n: Int, words: Array[String]) = {
    // exclude 1-grams
    (2 to n).map { i => words.sliding(i).toStream }
      .foldLeft(Stream[Array[String]]()) {
        (a, b) => a #::: b
      }
  }

  val words = "the bee is the bee of the bees"
  val n = 4
  val ngrams2 = ngrams(n, words.split(" "))

  process(ngrams2) { x =>
    println(x.toList)
  }

}

OUTPUT:

List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)

answered Nov 13 '22 07:11

tuxdna

Related questions
                            
                                Scala - Easiest 2D graphics for simply writing a 2D array to the screen? [closed]
                            
                                Add lift-json as build dependency for Play 2.0 project
                            
                                How to stop backtracking in Scala?
                            
                                Scala: Does == default to equals?
                            
                                Play2 does not find my implicit Reads or Format for JSON
                            
                                Spock mocks for Akka's ActorRef
                            
                                DateTime does not equal itself after unserialization
                            
                                Removing nth element from a String Array in Scala
                            
                                Why parentheses around int on a scala method invocation
                            
                                Scala: value :: is not a member of Int
                            
                                get first 2 values in a comma separated string
                            
                                Matching on nested exception type
                            
                                in scala how to convert one case class to another immune to code changes field additions?
                            
                                Scala - How to define map, where value depends on key?
                            
                                How to include application/x-www-form-urlencoded HttpHeader in Akka-http 2.4.1?
                            
                                Spark java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to java.util.ArrayList
                            
                                Are there any good Scala-specific frameworks and libraries worth taking a look out there? [closed]
                            
                                Scala: Mixing traits with private fields
                            
                                Implementing ifTrue, ifFalse, ifSome, ifNone, etc. in Scala to avoid if(...) and simple pattern matching
                            
                                scala's mutable and immutable set when to use val and var

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to generate n-grams in scala?

Tags:

scala

n-gram

user1002579

People also ask

2 Answers

peri4n

tuxdna

Recent Activity

Donate For Us