Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this regex replacement using capture hang in this swift code?

Tags:

regex

swift

In an effort to teach myself some bits of SwiftUI & Regex at the same time I thought it would be a simple enough task to translate a VTT file into an SRT file. In this translation, a time stamp including a period is transformed into the same timestamp with a comma.

I thought this would be handled neatly with a regex expression using a capture of the timestamp. I coded this up & everything seems to work fine when tested against a sample of a few records. However, when I dump the entire VTT file in as a string & try it, the function just hangs. I think the issue is the sheer number of replacements.

A VTT file for a typical Hollywood film is somewhere around 1,500 records each with a timestamps requiring 2 translations. A timestamp looks like this: 00:00:48.757 --> 00:00:52.970 & the only thing that needs to be done is for the periods to become commas like this: 00:00:48,757 --> 00:00:52,970.

I can't just do a global replace because other periods will get turned into commas & I want this only for the timestamps. The following code accomplishes the translations correctly but hangs for big files:

func VTT2SRT() {

    var vtt$ = vttText // preserve vttText from views
    
    let targetPattern = /(?<timeStamp>\d\d:\d\d:\d\d)./
    let matchArray = vtt$.matches(of: targetPattern)
    for s$ in matchArray {
        vtt$ = vtt$.replacing(s$.0, with: s$.timeStamp + ",")
    }

   [other stuff that works just fine]
   
    srtText = srt$ // pass it back to views
}

I cannot figure out why this hangs or takes so long. Can anyone point me in the right direction or suggest an alternative approach? Note: I am attempting this under macOS 14.6.1 (23G93) using Xcode Version 15.4 (15F31d)

like image 483
rbkopelman Avatar asked Feb 05 '26 03:02

rbkopelman


1 Answers

A few observations:

  1. The unescaped . in the regex will match any character (including a comma!). If one escapes the . with a backslash in the targetPattern, only the period character will be matched:

    let targetPattern = /(?<timeStamp>\d\d:\d\d:\d\d)\./
    

    Without escaping the ., your replacement string will match your search string. Now, your example is not repeatedly searching, but if you were, you could end up with an infinite loop if you failed to escape the ..

  2. Your code calls replacing inside the for loop. That means that for every iteration of the for loop, it will rescan the entire string looking for matches, resulting in something akin to O(n²) performance. On top of that, it will create a new copy of the complete string at every iteration, which is very inefficient.

    Rather than rescanning the string within the for loop, you can just replace the matching substring:

    var vtt = vttText
    
    let targetPattern = /(?<timeStamp>\d\d:\d\d:\d\d)\./
    
    let matches = vtt.matches(of: targetPattern)
    for match in matches.reversed() {
        vtt.replaceSubrange(match.range, with: match.timeStamp + ",")
    }
    

    By calling replaceSubrange rather than replacing, that will avoid rescanning the string for every iteration of the for loop.

    (Note, if you’re interested in why I reversed the array of matches, that is a robust way of making sure that the replacing of some substring does not invalidate the indexes of previously found matches later in the string. In this case, it is not necessary (because the replacement is always the same length as the string being replaced), but this is the general pattern.)

    The above snippet will be far more performant: You will enjoy O(n) performance. For short strings, the difference will be modest, but for really long strings, this will really add up.

  3. Even better, replacing(_:maxReplacements:with:) eliminates the need for the for loop entirely. It can replace all of the matches within a API single call:

    let targetPattern = /(?<timeStamp>\d\d:\d\d:\d\d)\./
    var vtt = vttText.replacing(targetPattern) { $0.timeStamp + "," }
    

    The for loop is not needed.

  4. Rather than using the named capture group, you can consider a look-behind assertion, ?<=, for the hh:mm:ss portion of the string. You don’t really need to replace that part at all (and thus you do not really need to capture it; you only care about replacing the . characters preceded by hh:mm:ss strings):

    let targetPattern = #"(?<=\d\d:\d\d:\d\d)\."#
    let vtt = vttText.replacingOccurrences(of: targetPattern, with: ",", options: .regularExpression)
    

    It appears that the newer Regex does not support it yet, but the legacy regular expression API does, as shown above. It logical to just replace the single character in question, namely the ., rather than the whole timestamp string.

like image 196
Rob Avatar answered Feb 08 '26 04:02

Rob



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!