Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy matching/chunking algorithm

Background: I have video clips and audio tracks that I want to sync with said videos.

From the video clips, I'll extract a reference audio track. I also have another track that I want to synchronize with the reference track. The desync comes from editing, which altered the intervals for each cutscene.

I need to manipulate the target track to look like (sound like, in this case) the ref track. This amounts to adding or removing silence at the correct locations. This could be done manually, but it'd be extremely tedious. So I want to be able to determine these locations programatically.

Example:

     0         1         2         
     012345678901234567890123
ref: --part1------part2------
syn: -----part1----part2-----
# (let `-` denote silence)

Output:

[(2,6), (5,9) # part1
 (13, 17), (14, 18)] # part2 

My idea is, starting from the beginning:

Fingerprint 2 large chunks* of audio and see if they match:
    If yes: move on to the next chunk
    If not:
        Go down both tracks looking for the first non-silent portion of each
        Offset the target to match the original
        Go back to the beginning of the loop

# * chunk size determined by heuristics and modifiable

The main problem here is sound matching and fingerprinting are fuzzy and relatively expensive operations.

Ideally I want to them as few times as possible. Ideas?

like image 870
Confluence Avatar asked Oct 12 '22 15:10

Confluence


1 Answers

Sounds like you're not looking to spend a lot of time delving into audio processing/engineering, and hence you want something you can quickly understand and just works. If you're willing to go with something more complex see here for a very good reference.

That being the case, I'd expect simple loudness and zero crossing measures would be sufficient to identify portions of sound. This is great because you can use techniques similar to rsync.

Choose some number of samples as a chunk size and march through your reference audio data at a regular interval. (Let's call it 'chunk size'.) Calculate the zero-crossing measure (you likely want a logarithm (or a fast approximation) of a simple zero-crossing count). Store the chunks in a 2D spatial structure based on time and the zero-crossing measure.

Then march through your actual audio data a much finer step at a time. (Probably doesn't need to be as small as one sample.) Note that you don't have to recompute the measures for the entire chunk size -- just subtract out the zero-crossings no longer in the chunk and add in the new ones that are. (You'll still need to compute the logarithm or approximation thereof.)

Look for the 'next' chunk with a close enough frequency. Note that since what you're looking for is in order from start to finish, there's no reason to look at -all- chunks. In fact, we don't want to since we're far more likely to get false positives.

If the chunk matches well enough, see if it matches all the way out to silence.

The only concerning point is the 2D spatial structure, but honestly this can be made much easier if you're willing to forgive a strict window of approximation. Then you can just have overlapping bins. That way all you need to do is check two bins for all the values after a certain time -- essentially two binary searches through a search structure.

The disadvantage to all of this is it may require some tweaking to get right and isn't a proven method.

like image 103
Kaganar Avatar answered Nov 15 '22 09:11

Kaganar