I am training an LSTM network in Python Tensorflow on audio data. My dataset is a bunch of wave files which read_wavfiles
turns into a generator of numpy
arrays. I decided to try training my network with the same dataset 20 times, and wrote some code as follows.
from with_hyperparams import stft
from model import lstm_network
import tensorflow as tf
def read_wavfile():
for file in itertools.chain(DATA_PATH.glob("**/*.ogg"),
DATA_PATH.glob("**/*.wav")):
waveform, samplerate = librosa.load(file, sr=hparams.sample_rate)
if len(waveform.shape) > 1:
waveform = waveform[:, 1]
yield waveform
audio_dataset = Dataset.from_generator(
read_wavfile,
tf.float32,
tf.TensorShape([None]))
dataset = audio_dataset.padded_batch(5, padded_shapes=[None])
iterator = tf.data.Iterator.from_structure(dataset.output_types,
dataset.output_shapes)
dataset_init_op = iterator.make_initializer(dataset)
signals = iterator.get_next()
magnitude_spectrograms = tf.abs(stft(signals))
output, loss = lstm_network(magnitude_spectrograms)
train_op = tf.train.AdamOptimizer(1e-3).minimize(loss)
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
for i in range(20):
print(i)
sess.run(dataset_init_op)
while True:
try:
l, _ = sess.run((loss, train_op))
print(l)
except tf.errors.OutOfRangeError:
break
The full code, including the sufficiently free data (Wikipedia sound files with IPA transcriptions) used, is on github.
The non-free data (EMU corpus sound files) does make a significant difference, though I am not sure how to show it to you:
1
indicating the second loop, and suddenly loss is at about 5000 again.DATA_PATH.glob("**/*.wav"), DATA_PATH.glob("**/*.ogg")
the loss starts at below 5000 and goes down to about 1000, before jumping up to 4000 again for the *.ogg
samples.Re-ordering the samples gives me a different result, so it looks like the WAV files are more similar to each other than the OGG files. I have a notion that shuffling should ideally happen at the level of the dataset, and not rely on it being read in random order. However, that would mean reading a lot of wav files into memory, which does not sound like a good solution.
What should my code look like?
Please try this:
dataset.shuffle(buffer_size=1000)
to the input pipeline.loss
to evaluate after each training epoch.As illustrated below:
dataset = audio_dataset.padded_batch(5, padded_shapes=[None])
dataset = dataset.shuffle(buffer_size=1000)
iterator = tf.data.Iterator.from_structure(dataset.output_types,
dataset.output_shapes)
dataset_init_op = iterator.make_initializer(dataset)
signals = iterator.get_next()
with tf.Session() as sess:
sess.run(init_op)
for i in range(20):
print(i)
sess.run(dataset_init_op)
while True:
try:
sess.run(train_op)
except tf.errors.OutOfRangeError:
break
# print loss for each epoch
l = sess.run(loss)
print(l)
If I have access to a few data samples, I might be able to help more precisely. For now, I'm working blind here, in any case, do let me know if this works.
This looks like an issue in architecture. First, you are generating your data on the go, which despite being a commonly employed technique, is not always the most reasonable choice. This is because:
One of the downsides of
Dataset.from_generator()
is shuffling the resulting dataset with a shuffle buffer of size n requires n examples to be loaded. This will either create periodic pauses in your pipeline (large n) or result in potentially poor shuffling (small n).
It might be a good idea to convert your data into numpy arrays, and then store the numpy arrays on disk to use as your data set like so:
def array_to_tfrecords(X, y, output_file):
feature = {
'X': tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten())),
'y': tf.train.Feature(float_list=tf.train.FloatList(value=y.flatten()))
}
example = tf.train.Example(features=tf.train.Features(feature=feature))
serialized = example.SerializeToString()
writer = tf.python_io.TFRecordWriter(output_file)
writer.write(serialized)
writer.close()
This will take the Dataset.from_generator
component out of the issue. The data can then be read with:
def read_tfrecords(file_names=("file1.tfrecord", "file2.tfrecord", "file3.tfrecord"),
buffer_size=10000,
batch_size=100):
dataset = tf.contrib.data.TFRecordDataset(file_names)
dataset = dataset.map(parse_proto)
dataset = dataset.shuffle(buffer_size)
dataset = dataset.repeat()
dataset = dataset.batch(batch_size)
return tf.contrib.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
This should ensure your data is thoroughly shuffled and give better results.
Additionally, I believe that you would benefit from a little data preprocessing. For starters, try converting all the files in your dataset into a standardized WAVE form and then saving that data to a TFRecord. Currently you are converting them into WAVE and standardizing the sample rate with librosa, but that doesn't standardize the channels. Instead try using a function like:
from pydub import AudioSegment
def convert(path):
#open file (supports all ffmpeg supported filetypes)
audio = AudioSegment.from_file(path, path.split('.')[-1].lower())
#set to mono
audio = audio.set_channels(1)
#set to 44.1 KHz
audio = audio.set_frame_rate(44100)
#save as wav
audio.export(path, format="wav")
Lastly, you might find that reading the sound files as floating points isn't in your best interests. You should consider trying something like:
import scipy.io.wavfile as wave
import python_speech_features as psf
def getSpectrogram(path, winlen=0.025, winstep=0.01, NFFT=512):
#open wav file
(rate,sig) = wave.read(path)
#get frames
winfunc=lambda x:np.ones((x,))
frames = psf.sigproc.framesig(sig, winlen*rate, winstep*rate, winfunc)
#Magnitude Spectrogram
magspec = np.rot90(psf.sigproc.magspec(frames, NFFT))
#noise reduction (mean substract)
magspec -= magspec.mean(axis=0)
#normalize values between 0 and 1
magspec -= magspec.min(axis=0)
magspec /= magspec.max(axis=0)
#show spec dimensions
print magspec.shape
return magspec
Then apply the functions like so:
#convert file if you need to
convert(filepath)
#get spectrogram
spec = getSpectrogram(filepath)
This will parse the data from the WAVE files into images, which you can then handle in the same way you would any image classification problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With