Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Audio signal source separation with neural network

What I am trying to do is separating the audio sources and extract its pitch from the raw signal. I modeled this process myself, as represented below: model to decomposite the raw signal Each sources oscillate in normal modes, often makes its component peaks' frequency integer multiplication. It's known as Harmonic. And then resonanced, finally combined linearly.

As seen in above, I've got many hints in frequency response pattern of audio signals, but almost no idea how to 'separate' it. I've tried countless of my own models. This is one of them:

  1. FFT the PCM
  2. Get peak frequency bins and amplitudes.
  3. Calculate pitch candidate frequency bins.
  4. For each pitch candidates, using recurrent neural network analyze all the peaks and find appropriate combination of peaks.
  5. Separate analyzed pitch candidates.

Unfortunately, I've got non of them successfully separates the signal until now. I want any of advices to solve these kind of problem. Especially in modeling of source separation like my one above.

like image 886
Laie Avatar asked Feb 02 '14 07:02

Laie


People also ask

How does audio source separation work?

Audio source separation consists in recovering different unknown signals called sources by filtering their observed mixtures. In music processing, most mixtures are stereophonic songs and the sources are the individual signals played by the instruments, e.g. bass, vocals, guitar, etc.

What is voice separation?

The goal of speech separation is to separate the mixed speech into two original speech signals. In signal processing, speech separation is a basic task with a wide range of applications, including mobile communication, speaker recognition, and song separation.

What is blind audio source separation?

Blind Audio Source Separation (BASS) consists in recovering one or several source signals from a given mixture signal. Direct applications include real-time speaker separation for simultaneous translation, sampling of musical sounds for electronic music composition.

What is Wave unet?

The Wave-U-Net is a convolutional neural network applicable to audio source separation tasks, which works directly on the raw audio waveform, presented in this paper. The Wave-U-Net is an adaptation of the U-Net architecture to the one-dimensional time domain to perform end-to-end audio source separation.


1 Answers

Because no one has really attempted to answer this, and because you've marked it with the neural-network tag, I'm going to address the suitability of a neural network to this kind of problem. As the question was somewhat non-technical, this answer will also be "high level".

Neural networks require some sort of sample set from which to learn. In order to "teach" a neural net to solve this problem you would essentially need to have a working set of known solutions to work from. Do you have this? If so, read on. If not, a neural is probably not what you are seeking. You stated that you have "many hints" but no real solution. This leads me to believe you probably don't have sample sets. If you can get them, great, otherwise you might be out of luck.

Supposing now that you have a sample set of Raw Signal samples and corresponding Source 1 and Source 2 outputs... Well, now you're going to need a method for deciding on a topology. Assuming you don't know a lot about how neural nets work (and don't want to), and assuming you also don't know the exact degree of complexity of the problem, I would probably recommend the open source NEAT package to get you started. I am not affiliated in any way with this project, but I have used it, and it allows you to (relatively) intelligently evolve neural network topologies to fit the problem.

Now, in terms of how a neural net would solve this specific problem. The first thing that comes to mind is that all audio signals are essentially time-series. That is to say, the information they convey is somehow dependent and related to the data at previous timesteps (e.g. the detection of some waveform cannot be done from a single time-point; it requires information about previous timesteps as well). Again, there's a million ways of solving this problem, but since I'm already recommending NEAT I'd probably suggest you take a look at the C++ NEAT Time Series mod.

If you're going down this route, you'll probably be wanting to use some sort of sliding window to provide information about the recent past at each time step. For a quick and dirty intro to sliding windows, check out this SO question:

Time Series Prediction via Neural Networks

The size of the sliding window can be important, especially if you're not using recurrent neural nets. Recurrent networks allow neural nets to remember previous time steps (at the cost of performance - NEAT is already recurrent so that choice is made for you here). You will probably want the sliding window length (ie. the number of timesteps in the past provided at every time step) to be roughly equal to your conservative guess of the largest number of previous timesteps required to gain enough information to split your waveform.

I'd say this is probably enough information to get you started.

When it comes to deciding how to provide the neural net with the data, you'll first want to normalise the input signals (consider a sigmoid function) and experiment with different transfer functions (sigmoid would probably be a good starting point).

I would imagine you'll want to have 2 output neurons, providing normalised amplitude (which you would denormalise via the inverse of the sigmoid function) as the output representing Source 1 and Source 2 respectively. For the fitness value (the way you judge the ability of each tested network to solve the problem) would be something along the lines of the negative of the RMS error of the output signal against the actual known signal (ie. tested against the samples I was referring to earlier that you will need to procure).

Suffice to say, this will not be a trivial operation, but it could work if you have enough samples to train the network against. What is a good number of samples? Well as a rule of thumb it's roughly a number that is large enough such that a simple polynomial function of order N (where N is the number of neurons in the netural network you require to solve the problem) cannot fit all of the samples accurately. This is basically to ensure you are not simply overfitting the problem, which is a serious issue with neural networks.

I hope this has been helpful! Best of luck.

Additional note: your work to date wouldn't have been in vain if you go down this route. A neural network is likely to benefit from additional "help" in the form of FFTs and other signal modelling "inputs", so you might want to consider taking the signal processing you have already done, organising into an analog, continuous representation and feeding it as an input alongside the input signal.

like image 133
quant Avatar answered Oct 16 '22 18:10

quant