i'm trying to use tensorflowjs speech recognition in offline mode. online mode using microphone is working fine. but for offline mode i'm not able to find any reliable library for converting wav/mp3 file to spectrogram according to the required specs of array as ffttsize:1024 , columnTruncateLength: 232, numFramesPerSpectrogram: 43.
All libraries like spectrogram.js that i tried dont have those conversin options. while tensorlfowjs speech clearly mentions to have following specs for spectrograph tensor
const mic = await tf.data.microphone({
  fftSize: 1024,
  columnTruncateLength: 232,
  numFramesPerSpectrogram: 43,
  sampleRateHz:44100,
  includeSpectrogram: true,
  includeWaveform: true
});
Getting error as Error: tensor4d() requires shape to be provided when values are a flat array in following
await recognizer.ensureModelLoaded();
    var audiocaptcha = await response.buffer();
    fs.writeFile("./afterverify.mp3", audiocaptcha, function (err) {
        if (err) {}
    });
    var bufferNewSamples =  new Float32Array(audiocaptcha);
    const buffersliced = bufferNewSamples.slice(0,bufferNewSamples .length-(bufferNewSamples .length%9976));
    const xtensor = tf.tensor(bufferNewSamples).reshape([-1, 
...recognizer.modelInputShape().slice(1)]);
got this error after slicing and correcting to tensor
output.scores
[ Float32Array [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ],
  Float32Array [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ],
  Float32Array [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ],
  Float32Array [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ],
  Float32Array [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ] ]
score for word '_background_noise_' = 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
score for word '_unknown_' = 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
score for word 'down' = 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
score for word 'eight' = 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
score for word 'five' = 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
score for word 'four' = undefined
score for word 'go' = undefined
score for word 'left' = undefined
score for word 'nine' = undefined
score for word 'no' = undefined
score for word 'one' = undefined
score for word 'right' = undefined
score for word 'seven' = undefined
score for word 'six' = undefined
score for word 'stop' = undefined
score for word 'three' = undefined
score for word 'two' = undefined
score for word 'up' = undefined
score for word 'yes' = undefined
score for word 'zero' = undefined
The only requirement when working with offline recognition is to have an input tensor of shape [null, 43, 232, 1].
1 - Read the wav file and get the array of data
var spectrogram = require('spectrogram');
var spectro = Spectrogram(document.getElementById('canvas'), {
  audio: {
    enable: false
  }
});
var audioContext = new AudioContext();
readWavFile() {
return new Promise(resove => {
var request = new XMLHttpRequest();
request.open('GET', 'audio.mp3', true);
request.responseType = 'arraybuffer';
request.onload = function() {
  audioContext.decodeAudioData(request.response, function(buffer) {
    resolve(buffer)
  });
};
request.send()
})
}
const buffer = await readWavFile()
The same thing can be done without using the third party library. 2 options are possible.
Read the file using <input type="file">. In that case, this answer shows how to get the typedarray. 
Serve and read the wav file using a http request
var req = new XMLHttpRequest();
req.open("GET", "file.wav", true);
req.responseType = "arraybuffer";
req.onload = function () {
  var arrayBuffer = req.response;
  if (arrayBuffer) {
    var byteArray = new Float32Array(arrayBuffer);
  }
};
req.send(null);
2- convert the buffer to typedarray
const data = Float32Array(buffer)
3- convert the array to a tensor using the shape of the speech recognition model
const x = tf.tensor(
   data).reshape([-1, ...recognizer.modelInputShape().slice(1));
If the latter commands fails, it means that the data does not have the shape needed for the model. The tensor needs to be sliced to have the appropriate shape or the recording made should respect the fft and other parameters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With