Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Given MP3, is it possible to break out different instruments using Fast Fourier transform (FFT)?

I am working on a music visualizer and I'd like to display a different visual element for each instrument. For example, blue bar representing vocal, red bar representing guitar, yellow bar representing drums, etc.

Is there a way to analyze the results of FFT to get this information?

Thanks.

like image 356
user377782 Avatar asked Aug 09 '11 10:08

user377782


People also ask

Does MP3 use FFT?

With MP3, the sound samples are transformed using methods that involve Fourier Series Transformations.

How is FFT used in audio processing?

The "Fast Fourier Transform" (FFT) is an important measurement method in the science of audio and acoustics measurement. It converts a signal into individual spectral components and thereby provides frequency information about the signal.

What is Fast Fourier Transform used for?

The FFT is used to process data throughout today's highly networked, digital world. It allows computers to efficiently calculate the different frequency components in time-varying signals—and also to reconstruct such signals from a set of frequency components.

What are the limitations of FFT?

A disadvantage associated with the FFT is the restricted range of waveform data that can be transformed and the need to apply a window weighting function (to be defined) to the waveform to compensate for spectral leakage (also to be defined). An alternative to the FFT is the discrete Fourier transform (DFT).


2 Answers

This is a challenge that's an active area of research in music technology.

It's possible, to an extent, but it's certainly not easy. It will be especially difficult using mp3 as a lot of important information is lost in compression.

What you're trying to do is known as Audio Source Separation, or Sound Source Separation. It pursues the separation of an audio recording into its constituent elements.

These elements could be speech (several people talking at the same time - the 'cocktail party problem') or instruments (separating one instrument from another in a recording 'blind demixing').

There's various approaches you could take, some of these are based on the frequency domain characteristics of sound and others are based on spatial properties.

The frequency domain approach might appear fairly straightforward if you're trying to separate a bass drum and a flute (i.e. the low frequency bins of your FFT would be the bass drum and the higher frequency bins assigned to the flute) however in reality sounds are rarely neatly segregated into useful frequency regions. The bass drum for example will have harmonic content right the way up the frequency spectrum. These types of solutions are hence very mathematically complicated and often involves statistical modeling. Heavy stuff.

Separation based on spatial properties of sound often relies on some prior knowledge of where each source was before recording (this is 'non-blind'). It's often necessary to have more than one microphone (stereo recording at least). Using some clever maths it's possible to approach separating the sources based on a knowledge of where the source is in space, based on the relationship of the signals at each microphone. This is also the basis for a technique called beamforming, by which the position of a source can be determined using an array of microphones.

So, back on track. People are trying to do it, but it's complicated, and using mp3 will make your life difficult!

I'm afraid I don't really know enough to explain the approaches better, but I can find a few references to get you started:

http://www.cs.tut.fi/~tuomasv/demopage.html

http://www.cs.northwestern.edu/~pardo/courses/eecs352/lectures/source%20separation.pdf (pdf warning!)

Good luck!

like image 131
Speedy Avatar answered Nov 02 '22 19:11

Speedy


For the vocal and bass you can use the fact that they are usually in the center of the stereo mix, which means it will have the exact same waveform in the left and right channel. If you subtract one channel from the other you will end up with a new channel that often will be without vocal and bass.

Something like:

sound = LoadMP3(...)
length = sound.SampleCount
left = sound.Channels[LEFT]
right = sound.Channels[RIGHT]
for i = 0:length
    difference[i] = left[i] - right[i]

Now you can look at clever ways to visualize FFT(left), FFT(right) and FFT(difference).

Maybe this will take a small step towards the effect that you are after?

like image 36
Hallgrim Avatar answered Nov 02 '22 18:11

Hallgrim