Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stream audio from mic to IBM Watson SpeechToText Web service using Java SDK

Trying to send a continuous audio stream from microphone directly to IBM Watson SpeechToText Web service using the Java SDK. One of the examples provided with the distribution (RecognizeUsingWebSocketsExample) shows how to stream a file in .WAV format to the service. However, .WAV files require that their length be specified ahead of time, so the naive approach of just appending to the file one buffer at a time is not feasible.

It appears that SpeechToText.recognizeUsingWebSocket can take a stream, but feeding it an instance of AudioInputStream does not seem to do it appears like the connection is established but no transcripts are returned even though RecognizeOptions.interimResults(true).

public class RecognizeUsingWebSocketsExample {
private static CountDownLatch lock = new CountDownLatch(1);

public static void main(String[] args) throws FileNotFoundException, InterruptedException {
SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("<username>", "<password>");

AudioInputStream audio = null;

try {
    final AudioFormat format = new AudioFormat(16000, 16, 1, true, false);
    DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
    TargetDataLine line;
    line = (TargetDataLine)AudioSystem.getLine(info);
    line.open(format);
    line.start();
    audio = new AudioInputStream(line);
    } catch (LineUnavailableException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

RecognizeOptions options = new RecognizeOptions.Builder()
    .continuous(true)
    .interimResults(true)
    .contentType(HttpMediaType.AUDIO_WAV)
    .build();

service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() {
  @Override
  public void onTranscription(SpeechResults speechResults) {
    System.out.println(speechResults);
    if (speechResults.isFinal())
      lock.countDown();
  }
});

lock.await(1, TimeUnit.MINUTES);
}
}

Any help would be greatly appreciated.

-rg

Here's an update based on German's comment below (thanks for that).

I was able to use javaFlacEncode to covert the WAV stream arriving from the mic into a FLAC stream and save it into a temporary file. Unlike a WAV audio file, whose size is fixed at creation, the FLAC file can be appended to easily.

    WAV_audioInputStream = new AudioInputStream(line);
    FileInputStream FLAC_audioInputStream = new FileInputStream(tempFile);

    StreamConfiguration streamConfiguration = new StreamConfiguration();
    streamConfiguration.setSampleRate(16000);
    streamConfiguration.setBitsPerSample(8);
    streamConfiguration.setChannelCount(1);

    flacEncoder = new FLACEncoder();
    flacOutputStream = new FLACFileOutputStream(tempFile);  // write to temp disk file

    flacEncoder.setStreamConfiguration(streamConfiguration);
    flacEncoder.setOutputStream(flacOutputStream);

    flacEncoder.openFLACStream();

    ...
    // convert data
    int frameLength = 16000;
    int[] intBuffer = new int[frameLength];
    byte[] byteBuffer = new byte[frameLength];

    while (true) {
        int count = WAV_audioInputStream.read(byteBuffer, 0, frameLength);
        for (int j1=0;j1<count;j1++)
            intBuffer[j1] = byteBuffer[j1];

        flacEncoder.addSamples(intBuffer, count);
        flacEncoder.encodeSamples(count, false);  // 'false' means non-final frame
    }

    flacEncoder.encodeSamples(flacEncoder.samplesAvailableToEncode(), true);  // final frame
    WAV_audioInputStream.close();
    flacOutputStream.close();
    FLAC_audioInputStream.close();

The resulting file can be analyzed (using curl or recognizeUsingWebSocket()) without any problems after adding an arbitrary number of frames. However, the recognizeUsingWebSocket() will return the final result as soon as it reaches the end of the FLAC file, even though the file's last frame may not be final (i.e., after encodeSamples(count, false)).

I would expect recognizeUsingWebSocket() to block till the final frame is written to the file. In practical terms, it means that the analysis stops after the first frame, as it takes less time to analyze the first frame than to collect the 2nd, so upon returning the results, the end of file is reached.

Is this the right way to implement streaming audio from a mic in Java? Seems like a common use case.


Here's a modification of RecognizeUsingWebSocketsExample, incorporating some of Daniel's suggestions below. It uses PCM content type (passed as a String, together with a frame size), and an attempt to signal the end of the audio stream, albeit not a very successful one.

As before, the connection is made, but the recognize callback is never called. Closing the stream does not seem to be interpreted as an end of audio either. I must be misunderstanding something here...

    public static void main(String[] args) throws IOException, LineUnavailableException, InterruptedException {

    final PipedOutputStream output = new PipedOutputStream();
    final PipedInputStream  input  = new PipedInputStream(output);

  final AudioFormat format = new AudioFormat(16000, 8, 1, true, false);
  DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
  final TargetDataLine line = (TargetDataLine)AudioSystem.getLine(info);
  line.open(format);
  line.start();

    Thread thread1 = new Thread(new Runnable() {
        @Override
        public void run() {
            try {
              final int MAX_FRAMES = 2;
              byte buffer[] = new byte[16000];
              for(int j1=0;j1<MAX_FRAMES;j1++) {  // read two frames from microphone
              int count = line.read(buffer, 0, buffer.length);
              System.out.println("Read audio frame from line: " + count);
              output.write(buffer, 0, buffer.length);
              System.out.println("Written audio frame to pipe: " + count);
              }
              /** no need to fake end-of-audio;  StopMessage will be sent 
              * automatically by SDK once the pipe is drained (see WebSocketManager)
              // signal end of audio; based on WebSocketUploader.stop() source
              byte[] stopData = new byte[0];
              output.write(stopData);
              **/
            } catch (IOException e) {
            }
        }
    });
    thread1.start();

  final CountDownLatch lock = new CountDownLatch(1);

  SpeechToText service = new SpeechToText();
  service.setUsernameAndPassword("<username>", "<password>");

  RecognizeOptions options = new RecognizeOptions.Builder()
  .continuous(true)
  .interimResults(false)
  .contentType("audio/pcm; rate=16000")
  .build();

  service.recognizeUsingWebSocket(input, options, new BaseRecognizeCallback() {
    @Override
    public void onConnected() {
      System.out.println("Connected.");
    }
    @Override
    public void onTranscription(SpeechResults speechResults) {
    System.out.println("Received results.");
      System.out.println(speechResults);
      if (speechResults.isFinal())
        lock.countDown();
    }
  });

  System.out.println("Waiting for STT callback ... ");

  lock.await(5, TimeUnit.SECONDS);

  line.stop();

  System.out.println("Done waiting for STT callback.");

}

Dani, I instrumented the source for WebSocketManager (comes with SDK) and replaced a call to sendMessage() with an explicit StopMessage payload as follows:

        /**
     * Send input steam.
     *
     * @param inputStream the input stream
     * @throws IOException Signals that an I/O exception has occurred.
     */
    private void sendInputSteam(InputStream inputStream) throws IOException {
      int cumulative = 0;
      byte[] buffer = new byte[FOUR_KB];
      int read;
      while ((read = inputStream.read(buffer)) > 0) {
        cumulative += read;
        if (read == FOUR_KB) {
          socket.sendMessage(RequestBody.create(WebSocket.BINARY, buffer));
        } else {
          System.out.println("completed sending " + cumulative/16000 + " frames over socket");
          socket.sendMessage(RequestBody.create(WebSocket.BINARY, Arrays.copyOfRange(buffer, 0, read)));  // partial buffer write
          System.out.println("signaling end of audio");
          socket.sendMessage(RequestBody.create(WebSocket.TEXT, buildStopMessage().toString()));  // end of audio signal

        }

      }
      inputStream.close();
    }

Neither of sendMessage() options (sending 0-length binary content or sending the stop text message) seems to work. The caller code is unchanged from above. The resulting output is:

Waiting for STT callback ... 
Connected.
Read audio frame from line: 16000
Written audio frame to pipe: 16000
Read audio frame from line: 16000
Written audio frame to pipe: 16000
completed sending 2 frames over socket
onFailure: java.net.SocketException: Software caused connection abort: socket write error

REVISED: actually, the end-of-audio call is never reached. Exception is thrown while writing the last (partial) buffer to the socket.

Why is the connection aborted? That typically happens when the peer closes the connection.

As for point 2): Would either of these matter at this stage? It appears that recognition process is not being started at all... Audio is valid (I wrote the stream out to a disk and was able to recognize it by streaming it from a file, as I point out above).

Also, on further review of WebSocketManager source code, onMessage() already sends StopMessage immediately upon return from sendInputSteam() (ie.e., when the audio stream, or pipe in the example above, drains), so no need to call it explicitly. The problem is definitely occurring before the audio data transmission completes. The behavior is the same, regardless if PipedInputStream or AudioInputStream is passed as input. Exception is thrown while sending binary data in both cases.

like image 232
Robert Grzeszczuk Avatar asked May 14 '16 22:05

Robert Grzeszczuk


1 Answers

The Java SDK has an example and supports this.

Update your pom.xml with:

 <dependency>
   <groupId>com.ibm.watson.developer_cloud</groupId>
   <artifactId>java-sdk</artifactId>
   <version>3.3.1</version>
 </dependency>

Here is an example of how to listen to your microphone.

SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("<username>", "<password>");

// Signed PCM AudioFormat with 16kHz, 16 bit sample size, mono
int sampleRate = 16000;
AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);
DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);

if (!AudioSystem.isLineSupported(info)) {
  System.out.println("Line not supported");
  System.exit(0);
}

TargetDataLine line = (TargetDataLine) AudioSystem.getLine(info);
line.open(format);
line.start();

AudioInputStream audio = new AudioInputStream(line);

RecognizeOptions options = new RecognizeOptions.Builder()
  .continuous(true)
  .interimResults(true)
  .timestamps(true)
  .wordConfidence(true)
  //.inactivityTimeout(5) // use this to stop listening when the speaker pauses, i.e. for 5s
  .contentType(HttpMediaType.AUDIO_RAW + "; rate=" + sampleRate)
  .build();

service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() {
  @Override
  public void onTranscription(SpeechResults speechResults) {
    System.out.println(speechResults);
  }
});

System.out.println("Listening to your voice for the next 30s...");
Thread.sleep(30 * 1000);

// closing the WebSockets underlying InputStream will close the WebSocket itself.
line.stop();
line.close();

System.out.println("Fin.");
like image 117
German Attanasio Avatar answered Oct 25 '22 12:10

German Attanasio