Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do fast prefix string matching in Scala

I'm using some Java code to do fast prefix lookups, using java.util.TreeSet, could I be using scala's TreeSet instead? Or a different solution?

/** A class that uses a TreeSet to do fast prefix matching
 */
class PrefixMatcher {
  private val _set = new java.util.TreeSet[String]

  def add(s: String) = _set.add(s)

  def findMatches(prefix: String): List[String] = {
    val matches = new ListBuffer[String]
    val tailSet = _set.tailSet(prefix)
    for ( tail <- tailSet.toArray ) {
      val tailString = tail.asInstanceOf[String]
      if ( tailString.startsWith(prefix) ) 
        matches += tailString
      else
        return matches.toList    
    }

    matches.toList
  }
}
like image 812
Alex Black Avatar asked Dec 09 '09 11:12

Alex Black


2 Answers

Use a Trie. Nobody's actually posted a Trie here yet, despite the fact that some people have posted sorted TreeMap data structures that they have misnamed as tries. Here is a fairly representative sample of a Trie implementation in Java. I don't know of any in Scala. See also an explanation of Tries on Wikipedia.

like image 185
Ken Bloom Avatar answered Sep 23 '22 13:09

Ken Bloom


The from & takeWhile approach:

class PrefixMatcher {
    private val _set = new TreeSet[String]
    def add(s: String) = _set.add(s)
    def findMatches(prefix: String): Iterable[String] =
        _set from prefix takeWhile(_ startsWith prefix)
}

An alternative is to select a subset from prefix to prefix++ (the smallest string after the prefix). This selects only the range of the tree that actually starts with the given prefix. Filtering of entries is not necessary. The subSet method will create a view of the underlying set.

There's still some work (overflow and empty strings won't work) in the increment method but the intent should be clear.

class PrefixMatcher {
  private val _set = new java.util.TreeSet[String]

  def add(s: String) = _set.add(s)

  def findMatches(prefix: String) : Set[String] = {
    def inc(x : String) = { //ignores overflow
       assert(x.length > 0) 
       val last = x.length - 1
       (x take last) + (x(last) + 1).asInstanceOf[Char]
    }
   _set.subSet(prefix, inc(prefix))
  }
}

The same works with the scala jcl.TreeSet#range implementation.

like image 41
Thomas Jung Avatar answered Sep 20 '22 13:09

Thomas Jung