Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using zip to find the difference between two strings in Scala

Tags:

string

scala

This is a follow up to my previous question.

I noticed that if I manipulate two strings (e.g. to get the longest common prefix, calculate the difference between two strings, etc.) I tend to use zip (like in (s1 zip s2)...).

It looks nice but probably inefficient (in comparison to imperative code). Is it correct ? Perhaps, if performance does matter I should use an imperative algorithm.

Now I am trying to find if two strings differ. Would you recommend using zip to do it?

like image 944
Michael Avatar asked Feb 22 '23 07:02

Michael


2 Answers

It is slightly more efficient to use (s1,s2).zipped than (s1 zip s2) since you don't need to create the tuples; instead you create functions that take two arguments.

Better yet for efficiency but not ease of use is to define your own custom specialized string folder:

abstract class StrFold[@specialized T](var result: T) {
  def apply(c1: Char, c2: Char): Unit
  def done: Boolean
}
def strfold[@specialized T](s1: String, s2: String)(folder: StrFold[T]): T = {
  var i = 0
  val L = math.min(s1.length, s2.length)
  while (i < L) {
    folder(s1.charAt(i), s2.charAt(i))
    if (folder.done) return folder.result
    i += 1
  }
  folder.result
}

Now if you want to find the length of the longest common prefix, you

class LcpFind extends StrFold(0) {
  var done = false
  def apply(c1: Char, c2: Char) { if (c1 == c2) result += 1 else done = true }
}
def lcp(s1: String, s2: String) = strfold(s1,s2)(new LcpFind)

It's a fair bit of work--arguably almost as much as just writing out the imperative or recursive functional code to calculate LCP without a fold-with-escape-clause.

But it should be almost as fast as writing the low-level string stuff by hand each time. (The only penalty should be creating the (tiny) LcpFind object each time, and if you really wanted to you could reuse it by resetting result to zero between calls, or you could modify StrFold and strfold to implement and call a reset method, changing the input parameter to be named zero and having a separate result method.)

like image 166
Rex Kerr Avatar answered Feb 25 '23 03:02

Rex Kerr


Yes, it's going to be less efficient for two reasons.

  1. You are creating a list of pairs of characters the same length as the shorter string. This will take substantially more place than the original two strings. Usual ways to find longest common prefix or check if strings differ in one character don't require any additional memory.

  2. When finding LCP, you don't necessarily need to iterate over entire strings. Since zip does need to do it, it will obviously be less efficient to zip them first.

But this doesn't mean you have to use imperative algorithms: a normal tail-recursive functional algorithm will perform just as well.

like image 23
Alexey Romanov Avatar answered Feb 25 '23 03:02

Alexey Romanov