I am trying to create an app that leverages both STT (Speech to Text) and TTS (Text to Speech) at the same time. However, I am running into a couple of foggy issues and would appreciate your kind expertise.
The app consists of a button at the center of the screen which, upon clicking, starts the required speech recognition functionality using the code below.
// MARK: - Constant Properties
let audioEngine = AVAudioEngine()
// MARK: - Optional Properties
var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
var recognitionTask: SFSpeechRecognitionTask?
var speechRecognizer: SFSpeechRecognizer?
// MARK: - Functions
internal func startSpeechRecognition() {
// Instantiate the recognitionRequest property.
self.recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
// Set up the audio session.
let audioSession = AVAudioSession.sharedInstance()
do {
try audioSession.setCategory(.record, mode: .measurement, options: [.defaultToSpeaker, .duckOthers])
try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
} catch {
print("An error has occurred while setting the AVAudioSession.")
}
// Set up the audio input tap.
let inputNode = self.audioEngine.inputNode
let inputNodeFormat = inputNode.outputFormat(forBus: 0)
self.audioEngine.inputNode.installTap(onBus: 0, bufferSize: 512, format: inputNodeFormat, block: { [unowned self] buffer, time in
self.recognitionRequest?.append(buffer)
})
// Start the recognition task.
guard
let speechRecognizer = self.speechRecognizer,
let recognitionRequest = self.recognitionRequest else {
fatalError("One or more properties could not be instantiated.")
}
self.recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest, resultHandler: { [unowned self] result, error in
if error != nil {
// Stop the audio engine and recognition task.
self.stopSpeechRecognition()
} else if let result = result {
let bestTranscriptionString = result.bestTranscription.formattedString
self.command = bestTranscriptionString
print(bestTranscriptionString)
}
})
// Start the audioEngine.
do {
try self.audioEngine.start()
} catch {
print("Could not start the audioEngine property.")
}
}
internal func stopSpeechRecognition() {
// Stop the audio engine.
self.audioEngine.stop()
self.audioEngine.inputNode.removeTap(onBus: 0)
// End and deallocate the recognition request.
self.recognitionRequest?.endAudio()
self.recognitionRequest = nil
// Cancel and deallocate the recognition task.
self.recognitionTask?.cancel()
self.recognitionTask = nil
}
When used alone, this code works like a charm. However, when I want to read that transcribed text using an AVSpeechSynthesizer object, nothing seems to be clear.
I went through the suggestions of multiple Stack Overflow posts, which suggested modifying
audioSession.setCategory(.record, mode: .measurement, options: [.defaultToSpeaker, .duckOthers])
To the following
audioSession.setCategory(.playAndRecord, mode: .default, options: [.defaultToSpeaker, .duckOthers])
Yet in vain. The app was still crashing after running STT then TTS, respectively.
The solution was for me to use this rather than the aforementioned
audioSession.setCategory(.multiRoute, mode: .default, options: [.defaultToSpeaker, .duckOthers])
This got me completely overwhelmed as I really have no clue what was intricately going on. I would highly appreciate any relevant explanation!
I am developing an app with both SFSpeechRecognizer and AVSpeechSythesizer too, and for me the .setCategory(.playAndRecord, mode: .default) works fine and it is the best category for our needs, according to Apple. Even, I am able to .speak() every transcription of the SFSpeechRecognitionTask while the audio engine is running without any problem. My opinion is somewhere in your programm's logic causes the crash. It would be good if you can update your question with the corresponding error.
And about why the .multiRoute category works: I guess there is a problem with the AVAudioInputNode. If you see in the console and error like this
Terminating app due to uncaught exception 'com.apple.coreaudio.avfaudio', reason: 'required condition is false: IsFormatSampleRateAndChannelCountValid(hwFormat)
or like this
Terminating app due to uncaught exception 'com.apple.coreaudio.avfaudio', reason: 'required condition is false: nullptr == Tap()
you only needs to reorder some parts of the code like moving the setup of the audio session somewhere where it only gets called once, or ensure that the tap of the input node is always removed before installing a new one even if the recognition task finish successfully or not. And maybe (I have never worked with it) the .multiRoute is able to reuse the same input node by its nature of working with different audio streams and routes.
I leave below the logic I use with my programm following Apple's WWDC session:
override func viewDidLoad() { //or init() or when necessarily
super.viewDidLoad()
try? AVAudioSession.sharedInstance().setCategory(.playAndRecord, mode: .default)
}
func shouldProcessSpeechRecognition() {
guard AVAudioSession.sharedInstance().recordPermission == .granted,
speechRecognizerAuthorizationStatus == .authorized,
let speechRecognizer = speechRecognizer, speechRecognizer.isAvailable else { return }
//Continue only if we have authorization and recognizer is available
startSpeechRecognition()
}
func startSpeechRecognition() {
let format = audioEngine.inputNode.outputFormat(forBus: 0)
audioEngine.inputNode.installTap(onBus: 0, bufferSize: 1024, format: format) { [unowned self] (buffer, _) in
self.recognitionRequest.append(buffer)
}
audioEngine.prepare()
do {
try audioEngine.start()
recognitionTask = speechRecognizer!.recognitionTask(with: recognitionRequest, resultHandler: {...}
} catch {...}
}
func endSpeechRecognition() {
recognitionTask?.finish()
stopAudioEngine()
}
func cancelSpeechRecognition() {
recognitionTask?.cancel()
stopAudioEngine()
}
func stopAudioEngine() {
audioEngine.stop()
audioEngine.inputNode.removeTap(onBus: 0)
recognitionRequest.endAudio()
}
And with that, anywhere in my code I can call an AVSpeechSynthesizer instance and speak an utterance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With