Using a microphone as an input for real-time audio. How do I extract the currently said phoneme from the audio? I need it for lipsyncing 2d characters.
Basically, my approach would be to:
I have tried looking everywhere for an example or library that could solve this type of problem. Most libraries don't seem to output phonemes from audio.
There is a website that explains how they used machine learning to solve this, however without any code or tutorial on how to do it. https://www.arxiv-vanity.com/papers/1910.08685/
There is also this cool speech recognition tool called Pocketsphinx, but I cannot seem to find an example of it using Phoneme Recognition yet.
The way I would approach this is to get the word from the audio using Whisper or a similar STT service (the Python Speech Recognition Library is the go-to at the moment), then I would use the CMU Dict Library to provide phonemes for each word.
The phonemes are given using the CMU dictionary - for example DH for the θ phoneme - the th sound in this and that. That is, they are not given in IPA pronunciation - so you may need another layer if you need the phonemes in IPA format. If you need IPA formatted phonemes, then consider the IPA2 library.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With