Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return predictions wav2vec fairseq

I'm trying to use wav2vec to train my own Automatic Speech Recognition System:

https://github.com/pytorch/fairseq/tree/master/examples/wav2vec

import torch
from fairseq.models.wav2vec import Wav2VecModel

cp = torch.load('/path/to/wav2vec.pt')
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()

First of all how can I use a loaded model to return predictions from a wav file?

Second, how can I pre-train using annotated data? I don't see any text mention in the manifest scripts.

like image 466
Juanvulcano Avatar asked Feb 24 '20 14:02

Juanvulcano


People also ask

Is Wav2Vec open source?

Wav2vec 2.0 enables us to build better speech recognition systems for many more languages and domains with much less annotated data. We've open-sourced the code and pretrained models to enable other researchers to do exactly this.

What is the output of Wav2Vec?

The Viterbi decoder finds the most likely token sequence given their probability distributions, which is the output from wav2vec 2.0. A token can be a character or a sentence boundary.

How does Wav2Vec work?

Wav2Vec 2.0 uses a self-supervised training approach for Automatic Speech Recognition, which is based on the idea of contrastive learning. Learning speech representation on a huge, raw (unlabeled) dataset reduces the amount of labeled data required for getting satisfying results.

How to train wav2vec with fairseq?

It turns out that since wav2vec is part of fairseq, the following fairseq command line tool should be used to train it: As the arguments to this command are pretty long, this can be done using a bash scipt such as most of the arguments are those suggested here, only the first two (which are filesystem paths) must be modified for your system.

What is wav2vec2?

Update on GitHub Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in September 2020 by Alexei Baevski, Michael Auli, and Alex Conneau. Using a novel contrastive pretraining objective, Wav2Vec2 learns powerful speech representations from more than 50.000 hours of unlabeled speech.

Where can I find pre-trained models for wav2vec?

Wav2Vec2 is also available in the Transformers library since version 4.4. Pretrained Models can be found on the hub and documentation can be found here. Example to train a wav2vec model as described in wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019).

What is the word error rate (WER) of wav2vec2?

Using as little as 10 minutes of labeled data, Wav2Vec2 yields a word error rate (WER) of less than 5% on the clean test set of LibriSpeech - cf. with Table 9 of the paper. In this notebook, we will give an in-detail explanation of how Wav2Vec2's pretrained checkpoints can be fine-tuned on any English ASR dataset.


1 Answers

After trying various things I was able to figure this out and trained a wav2vec model from scratch.

Some background: wav2vec uses semi-supervised learning to learn vector representations for preprocessed sound frames. This is similar to what word2vec does to learn word embeddings a text corpus. In the case of wav2vec it samples random parts of the sound file and learns to predict if a given part is in the near future from a current offset position. This is somewhat similar to the masked word task used to train transformers such as BERT. The nice thing about such prediction tasks is that they are self-supervised: the algorithm can be trained on unlabeled data since it uses the temporal structure of the data to produce labels and it uses random sampling to produce contrasting negative examples. It is a binary classification task (is the proposed processed sound frame in the near future of the current offset or not). In training for this binary classification task, it learns vector representations of sound frames (one 512 dim vector for each 10ms of sound). These vector representations are useful features because they concentrate information relevant to predicting speech. These vectors can then be used instead of spectrogram vectors as inputs for speech to text algorithms such as wav2letter or deepSpeech. This is an important point: wav2vec is not a full automatic speech recognition (ASR) system. It is a useful component because by leveraging self-supervised learning on unlabeled data (audio files containing speech but without text transcriptions), it greatly reduces the need for labeled data (speech transcribed to text). Based on their article it appears that by using wav2vec in an ASR pipeline, the amount of labeled data needed can be reduced by a factor of at least 10 (10 to 100 times less transcribed speech is needed apparently). Since un-transcribed speech files are much easier to get than transcribed speech, this is a huge advantage of using wav2vec as an initial module in an ASR system.

So wav2vec is trained with data which is not annotated (no text is used to train it).

The thing which confused me was the following command for training (here) :

python train.py /manifest/path --save-dir /model/path ...(etc.).........

It turns out that since wav2vec is part of fairseq, the following fairseq command line tool should be used to train it:

fairseq-train

As the arguments to this command are pretty long, this can be done using a bash scipt such as

#!/bin/bash
fairseq-train /home/user/4fairseq --save-dir /home/user/4fairseq --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints \
--arch wav2vec --task audio_pretraining --lr 1e-06 --min-lr 1e-09 --optimizer adam --max-lr 0.005 --lr-scheduler cosine \
--conv-feature-layers "[(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)]" \
--conv-aggregator-layers "[(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)]" \
--skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion binary_cross_entropy --num-negatives 10 \
--max-sample-size 150000 --max-tokens 1500000

most of the arguments are those suggested here, only the first two (which are filesystem paths) must be modified for your system.

Since I had audio voice files which were in mp3 format, I converted them to wav files the following bash script:

#!/bin/bash
for file in /home/user/data/soundFiles/*
do
  echo "$file"
  echo "${file%.*}.wav"
  ffmpeg -i "$file" "${file%.*}.wav"
done

They suggest that the audio files be of short duration, longer files should be split into smaller files. The files which I had were already pretty short so I did not do any splitting.

the script wav2vec_manifest.py must be used to create a training data manifest before training. It will create two files (train.tsv and valid.tsv) basically creating lists of which audio files should be used for training and which should be used for validation. The path at which these two files are located is the first argument to the fairseq-train method.

The second argument to the method fairseq-train is the path at which to save the model. After training there will be these two model files:
checkpoint_best.pt
checkpoint_last.pt
These are updated at the end of each epoch so I was able to terminate the train process early and still have those saved model files

like image 55
DBaker Avatar answered Oct 24 '22 16:10

DBaker