As part of a fun-at-home-research-project, I am trying to find a way to reduce/convert a song to a humming like audio signal (the underlying melody that we humans perceive when we listen to a song). Before I proceed any further in describing my attempt on this problem, I would like to mention that I am totally new to audio analysis though I have a lot of experience with analyzing images and videos.
After googling a bit, I found a bunch of melody extraction algorithms. Given a polyphonic audio signal of a song (ex: .wav file), they output a pitch track --- at each point in time they estimate the dominant pitch (coming from a singer's voice or some melody generating instrument) and track the dominant pitch over time.
I read a few papers, and they seem to compute a short time Fourier transform of the song, and then do some analysis on the spectrogram to get and track the dominant pitch. Melody extraction is only a component in the system I am trying to develop, so I don't mind using any algorithm that's available as far as it does a decent job on my audio files and the code is available. Since I am new to this, I would be happy to hear any suggestions on which algorithms are known to work well and where can I find its code.
I found two algorithms:
I chose Melodia as the results on different music genres looked quite impressive. Please check this to see its results. The humming that you hear for each piece of music is essentially what I am interested in.
"It is the generation of this humming for any arbitrary song, that I want your help with in this question".
The algorithm (available as a vamp plugin) outputs a pitch track --- [time_stamp, pitch/frequency] --- an Nx2 matrix where in the first column is the time-stamp (in seconds) and the second column is dominant pitch detected at the corresponding time-stamp. Shown below is a visualization of the pitch-track obtained from the algorithm overlayed in purple color with a song's time-domain signal (above) and it spectrogram/short-time-fourier. Negative-values of pitch/frequency represent the algorithms dominant pitch estimate for un-voiced/non-melodic segments. So all pitch estimates >= 0 correspond to the melody, the rest are not important to me.
Now I want to convert this pitch track back to a humming like audio signal -- just like the authors have it on their website.
Below is a MATLAB function that I wrote to do this:
function [melSignal] = melody2audio(melody, varargin)
% melSignal = melody2audio(melody, Fs, synthtype)
% melSignal = melody2audio(melody, Fs)
% melSignal = melody2audio(melody)
%
% Convert melody/pitch-track to a time-domain signal
%
% Inputs:
%
% melody - [time-stamp, dominant-frequency]
% an Nx2 matrix with time-stamp in the
% first column and the detected dominant
% frequency at corresponding time-stamp
% in the second column.
%
% synthtype - string to choose synthesis method
% passed to synth function in synth.m
% current choices are: 'fm', 'sine' or 'saw'
% default='fm'
%
% Fs - sampling frequency in Hz
% default = 44.1e3
%
% Output:
%
% melSignal -- time-domain representation of the
% melody. When you play this, you
% are supposed to hear a humming
% of the input melody/pitch-track
%
p = inputParser;
p.addRequired('melody', @isnumeric);
p.addParamValue('Fs', 44100, @(x) isnumeric(x) && isscalar(x));
p.addParamValue('synthtype', 'fm', @(x) ismember(x, {'fm', 'sine', 'saw'}));
p.addParamValue('amp', 60/127, @(x) isnumeric(x) && isscalar(x));
p.parse(melody, varargin{:});
parameters = p.Results;
% get parameter values
Fs = parameters.Fs;
synthtype = parameters.synthtype;
amp = parameters.amp;
% generate melody
numTimePoints = size(melody,1);
endtime = melody(end,1);
melSignal = zeros(1, ceil(endtime*Fs));
h = waitbar(0, 'Generating Melody Audio' );
for i = 1:numTimePoints
% frequency
freq = max(0, melody(i,2));
% duration
if i > 1
n1 = floor(melody(i-1,1)*Fs)+1;
dur = melody(i,1) - melody(i-1,1);
else
n1 = 1;
dur = melody(i,1);
end
% synthesize/generate signal of given freq
sig = synth(freq, dur, amp, Fs, synthtype);
N = length(sig);
% augment note to whole signal
melSignal(n1:n1+N-1) = melSignal(n1:n1+N-1) + reshape(sig,1,[]);
% update status
waitbar(i/size(melody,1));
end
close(h);
end
The underlying logic behind this code is the following: at each time-stamp, I synthesize a short-lived wave (say a sine-wave) with frequency equal to the detected dominant pitch/frequency at that time-stamp for a duration equal to its gap with the next time-stamp in the input melody matrix. I only wonder if I am doing this right.
Then I take the audio signal I get from this function and play it with the original song (melody on the left channel and original song on the right channel). Though the generated audio signal seems to segment the melody-generating sources (voice/lead-intstrument) fairly well -- its active where voice is and zero everywhere else --- the signal itself is far from being a humming (I get something like beep beep beeeeep beep beeep beeeeeeeep) that the authors show on their website. Specifically, below is a visualization showing the time-domain signal of the input song in the bottom and the time-domain signal of the melody generated using my function.
One main issue is -- though I am given the frequency of the wave to generate at each time-stamp and also the duration, I don't know how to set the amplitude of the wave. For now, I set the amplitude to be flat/a-constant value, and i suspect this is where the problem is.
Does anyone have any suggestions on this? I welcome suggestions in any program language (preferably MATLAB, python, C++), but I guess my question here is more general --- How to generate the wave at each time-stamp?
A few ideas/fixes in my mind:
If I understand correctly, you seem to already have an accurate representation of the pitch but your problem is that what you generate just doesn't "sound good enough".
Starting with your second approach: filtering out anything but the pitch isn't going to lead to anything good. By removing everything but a few frequency bins corresponding to your local pitch estimates, you will lose the texture of the input signal, what makes it sound good. In fact, if you took that to an extreme and removed everything but the one sample corresponding to the pitch and took an ifft, you would get exactly a sinusoid, which is what you are doing currently. If you wanted to do this anyway, I recommend you perform all of this by just applying a filter to your temporal signal rather than going in and out of the frequency domain, which is more expensive and cumbersome. The filter would have a small cutoff around the frequency you want to keep and that would allow for a sound with better texture as well.
However, if you already have pitch and duration estimates that you are happy with but you want to improve on the sound rendering, I suggest that you just replace your sine waves--which will always sound like silly beep-beep no matter how much you massage them--with some actual humming (or violin or flute or whatever you like) samples for each frequency in the scale. If memory is a concern or if the songs you represent do not fall into a well tempered scale (thinking middle-east song for example), instead of having a humming sample for each note of the scale, you could only have humming samples for a few frequencies. You would then derive the humming sounds at any frequency by doing a sample rate conversion from one of these humming samples. Having a few samples to pick from for doing the sample conversion would allow you to pick the one that leans to "best" ratio with the frequency you need to produce, since the complexity of sampling conversion depends on that ratio. Obviously adding a sample rate conversion would be more work and computationally demanding compared to just having a bank of samples to pick from.
Using a bank of real samples will make a big difference in the quality of what you render. It will also allow you to have realistic attacks for each new note you play.
Then yes, like you suggest, you may want to also play with the amplitude by following the instantaneous amplitude of the input signal to produce a more nuanced rendering of the song.
Last, I would also play with the duration estimates you have so that you have smoother transitions from one sound to the next. Guessing from your performance of your audio file that I enjoyed very much (beep beep beeeeep beep beeep beeeeeeeep) and the graph that you display, it looks like you have many interruptions inserted in the rendering of your song. You could avoid this by extending the duration estimates to get rid of any silence that is shorter than, say .1 second. That way you would preserver the real silences from the original song but avoid cutting off each note of your song.
Though I don't have access to your synth() function, based on the parameters it takes I'd say your problem is because you're not handling the phase.
That is - it is not enough to concatenate waveform snippets together, you must ensure that they have continuous phase. Otherwise, you're creating a discontinuity in the waveform every time you concatenate two waveform snippets. If this is the case, my guess is that you're hearing the same frequency all the time and that it sounds more like a sawtooth than a sinusoid - am I right?
The solution is to set the starting phase of snippet n to the end phase of snippet n-1. Here's an example of how you would concatenate two waveforms with different frequencies without creating a phase discontinuity:
fs = 44100; % sampling frequency
% synthesize a cosine waveform with frequency f1 and starting additional phase p1
p1 = 0;
dur1 = 1;
t1 = 0:1/fs:dur1;
x1(1:length(t1)) = 0.5*cos(2*pi*f1*t1 + p1);
% Compute the phase at the end of the waveform
p2 = mod(2*pi*f1*dur1 + p1,2*pi);
dur2 = 1;
t2 = 0:1/fs:dur2;
x2(1:length(t2)) = 0.5*cos(2*pi*f2*t2 + p2); % use p2 so that the phase is continuous!
x3 = [x1 x2]; % this should give you a waveform without any discontinuities
Note that whilst this gives you a continuous waveform, the frequency transition is instantaneous. If you want the frequency to gradually change from time_n to time_n+1 then you would have to use something more complex like McAulay-Quatieri interpolation. But in any case, if your snippets are short enough this should sound good enough.
Regarding other comments, if I understand correctly your goal is just to be able to hear the frequency sequence, not for it to sound like the original source. In this case, the amplitude is not that important and you can keep it fixed.
If you wanted to make it sound like the original source that's a whole different story and probably beyond the scope of this discussion.
Hope this answers your question!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With