How to compare characters in Swift efficiently

Question

I have a function in Swift that computes the hamming distance of two strings and then puts them into a connected graph if the result is 1.

For example, read to hear returns a hamming distance of 2 because read[0] != hear[0] and read[3] != hear[3].

At first, I thought my function was taking a long time because of the quantity of input (8,000+ word dictionary), but I knew that several minutes was too long. So, I rewrote my same algorithm in Java, and the computation took merely 0.3s.

I have tried writing this in Swift two different ways:

Way 1 - Substrings

extension String {

    subscript (i: Int) -> String {
        return self[Range(i ..< i + 1)]
    }

}

private func getHammingDistance(w1: String, w2: String) -> Int {
    if w1.length != w2.length { return -1 }

    var counter = 0
    for i in 0 ..< w1.length {
        if w1[i] != w2[i] { counter += 1 }
    }

    return counter
}

Results: 434 seconds

Way 2 - Removing Characters

private func getHammingDistance(w1: String, w2: String) -> Int {
    if w1.length != w2.length { return -1 }

    var counter = 0
    var c1 = w1, c2 = w2      // need to mutate
    let length = w1.length

    for i in 0 ..< length {
        if c1.removeFirst() != c2.removeFirst() { counter += 1 }
    }

    return counter
}

Results: 156 seconds

Same Thing in Java

Results: 0.3 seconds

Where it's being called

var graph: Graph

func connectData() {
    let verticies = graph.canvas // canvas is Array<Node>
                                 // Node has key that holds the String

    for vertex in 0 ..< verticies.count {
        for compare in vertex + 1 ..< verticies.count {
            if getHammingDistance(w1: verticies[vertex].key!, w2: verticies[compare].key!) == 1 {
                graph.addEdge(source: verticies[vertex], neighbor: verticies[compare])
            }
        }
    }
}

156 seconds is still far too inefficient for me. What is the absolute most efficient way of comparing characters in Swift? Is there a possible workaround for computing hamming distance that involves not comparing characters?

Edit

Edit 1: I am taking an entire dictionary of 4 and 5 letter words and creating a connected graph where the edges indicate a hamming distance of 1. Therefore, I am comparing 8,000+ words to each other to generate edges.

Edit 2: Added method call.

Alain T. · Accepted Answer

Unless you chose a fixed length character model for your strings, methods and properties such as .count and .characters will have a complexity of O(n) or at best O(n/2) (where n is the string length). If you were to store your data in an array of character (e.g. [Character] ), your functions would perform much better.

You can also combine the whole calculation in a single pass using the zip() function

let hammingDistance = zip(word1.characters,word2.characters)
                      .filter{$0 != $1}.count

but that still requires going through all characters of every word pair.

...

Given that you're only looking for Hamming distances of 1, there is a faster way to get to all the unique pairs of words:

The strategy is to group words by the 4 (or 5) patterns that correspond to one "missing" letter. Each of these pattern groups defines a smaller scope for word pairs because words in different groups would be at a distance other than 1.

Each word will belong to as many groups as its character count.

For example :

"hear" will be part of the pattern groups:
"*ear", "h*ar", "he*r" and "hea*".

Any other word that would correspond to one of these 4 pattern groups would be at a Hamming distance of 1 from "hear".

Here is how this can be implemented:

// Test data 8500 words of 4-5 characters ...
var seenWords = Set<String>()
var allWords = try! String(contentsOfFile: "/usr/share/dict/words")
                     .lowercased()                        
                     .components(separatedBy:"
")
                     .filter{$0.characters.count == 4 || $0.characters.count == 5}
                     .filter{seenWords.insert($0).inserted}
                     .enumerated().filter{$0.0 < 8500}.map{$1}

// Compute patterns for a Hamming distance of 1
// Replace each letter position with "*" to create patterns of
// one "non-matching" letter
public func wordH1Patterns(_ aWord:String) -> [String]
{
   var result       : [String]    = []
   let fullWord     : [Character] = aWord.characters.map{$0}
   for index in 0..<fullWord.count
   {
      var pattern    = fullWord
      pattern[index] = "*" 
      result.append(String(pattern))                     
   }
   return result
}

// Group words around matching patterns
// and add unique pairs from each group
func addHamming1Edges()
{
   // Prepare pattern groups ...
   // 
   var patternIndex:[String:Int] = [:]
   var hamming1Groups:[[String]]  = []
   for word in allWords
   {
      for pattern in wordH1Patterns(word)
      {
         if let index = patternIndex[pattern]
         { 
           hamming1Groups[index].append(word) 
         }
         else
         {
           let index = hamming1Groups.count
           patternIndex[pattern] = index
           hamming1Groups.append([word])
         }        
      }
   }

   // add edge nodes ...
   //
   for h1Group in hamming1Groups
   {
       for (index,sourceWord) in h1Group.dropLast(1).enumerated()
       {
          for targetIndex in index+1..<h1Group.count
          { addEdge(source:sourceWord, neighbour:h1Group[targetIndex]) } 
       }
   }
}

On my 2012 MacBook Pro, the 8500 words go through 22817 (unique) edge pairs in 0.12 sec.

[EDIT] to illustrate my first point, I made a "brute force" algorithm using arrays of characters instead of Strings :

   let wordArrays = allWords.map{Array($0.unicodeScalars)}
   for i in 0..<wordArrays.count-1
   {
      let word1 = wordArrays[i]
      for j in i+1..<wordArrays.count
      {
         let word2 = wordArrays[j]
         if word1.count != word2.count { continue }

         var distance = 0
         for c in 0..<word1.count 
         {
            if word1[c] == word2[c] { continue }
            distance += 1
            if distance > 1 { break }
         }
         if distance == 1
         { addEdge(source:allWords[i], neighbour:allWords[j]) }
      }
   }

This goes through the unique pairs in 0.27 sec. The reason for the speed difference is the internal model of Swift Strings which is not actually an array of equal length elements (characters) but rather a chain of varying length encoded characters (similar to the UTF model where special bytes indicate that the following 2 or 3 bytes are part of a single character. There is no simple Base+Displacement indexing of such a structure which must always be iterated from the beginning to get to the Nth element.

Note that I used unicodeScalars instead of Character because they are 16 bit fixed length representations of characters that allow a direct binary comparison. The Character type isn't as straightforward and take longer to compare.

How to compare characters in Swift efficiently

Tags:

string

swift

character

Way 1 - Substrings

Way 2 - Removing Characters

Same Thing in Java

Where it's being called

Edit

Logan Jahnke

1 Answers

Alain T.

Recent Activity

Donate For Us

How to compare characters in Swift efficiently

Tags:

string

swift

character

Way 1 - Substrings

Way 2 - Removing Characters

Same Thing in Java

Where it's being called

Edit

Logan Jahnke

1 Answers

Alain T.

Related questions

Recent Activity

Donate For Us