Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any signal-processing algorithm that could reverse-engineer how the sound wave was produced through the vocal system of group of humans?

Having long sound tape with 3 speakers on it , how to get info on how there mouthes open/close? We have audio recording, with more than one speaker. Sound is clear and does not require noise reduction. We want to create some animation with speaking 3d heads. Generally we want to find out from sound data mouthes movement.

Really we have 3d heads moving somehow via some default animation. Like we have prepared animation for O sound for each person, we need some info: on which millisecond which person produced which sound?

So it is like voice to text but for sounds and for more than one person on one recording.

image with head on it

In general (perfect case) we want to obtain some signals on movements of D9, D6, D5 point pairs. From more than one speaker, English language of course.

Are there any papers with algorithms or opensource libraries?

So far I have found some libraries

http://freespeech.sourceforge.net/ http://cmusphinx.sourceforge.net/

but I had never used any of them yet...

like image 215
Rella Avatar asked May 20 '11 23:05

Rella


4 Answers

Interesting problem!! The first thing that came to my mind was to use motion detection to identify any movements at regions D5, D6 and D9. Extend D5, D6, D9 to be rectangles and use one of the approaches mentioned here to detect motion within those reigons.

Of course you have to first identify a person's face and the regions D5, D6, D9 in a frame before you can start monitoring any motion.

You can use a speech recognition library and detect phonemes in the audio stream along with the motion and try to map motion features(like region, intensity and frequency etc.) to Phonemes and build a probabilistic model that maps mouth motions to phonemes.

Really interesting problem!! I wish I was currently working something this interesting :).

Hope I mentioned something useful in here.

like image 86
user258808 Avatar answered Oct 02 '22 17:10

user258808


This is an instance of the "cocktail party problem" or its generalization, "blind signal separation".

Unfortunately, while good algorithms exist if you have N microphones recording N speakers, performance of blind algorithms with fewer microphones than sources is quite bad. So those are not much help.

There is no particularly robust method I know of (certainly was not as of five years ago) to separate speakers even with extra data. You may be able to train a classifier on human-annotated spectrograms of the speech so that it can pick out who is who, and then possibly use speaker-independent voice recognition to try to figure out what is said, and then use 3D speaking models used for high-end video games or movie special effects. But it won't work well.

You would be better off hiring three actors to listen to the tape and then each recite the part of one of the speakers while you video them. You will get much more realistic appearance with much less time, effort, and money. If you want to have a variety of 3D characters, put markers on the actors' faces and capture their position, then use those as control points on your 3D models.

like image 39
Rex Kerr Avatar answered Oct 01 '22 17:10

Rex Kerr


I think that you are looking for what is known as "Blind Signal Separation". An academic paper surveying this is:

Blind signal separation: statistical principles (pdf)

Jean-François Cardoso, C.N.R.S. and E.N.S.T.

Abstract— Blind signal separation (BSS) and independent component analysis (ICA) are emerging techniques of array processing and data analysis, aiming at recovering unobserved signals or ‘sources’ from observed mixtures (typically, the output of an array of sensors), exploiting only the assumption of mutual independence between the signals. The weakness of the assumptions makes it a powerful approach but requires to venture beyond familiar second order statistics. The objective of this paper is to review some of the approaches that have been recently developed to address this exciting problem, to show how they stem from basic principles and how they relate to each other.

I have no idea how practical what you are trying to do is, or how much work it might take, if practical.

like image 21
mcdowella Avatar answered Oct 03 '22 17:10

mcdowella


Some work that came out of University of Edinburgh about 15 years ago (probably the basis of the voice recognition we have) is applicable. They were able to automatically turn any intelligible English speech (without the program being trained) into a set of about 40 symbols, one for each distinct sound we use. That capability combined with waveform signature analysis to identify the human of interest is "all" you need.

This is an engineering problem for sure. But not a programming problem suitable for Stack Overflow. I look forward to the day it is though. :-)

like image 20
wallyk Avatar answered Oct 04 '22 17:10

wallyk