Loss goes up back to starting value after re-initializing dataset

Question

I am training an LSTM network in Python Tensorflow on audio data. My dataset is a bunch of wave files which read_wavfiles turns into a generator of numpy arrays. I decided to try training my network with the same dataset 20 times, and wrote some code as follows.

from with_hyperparams import stft
from model import lstm_network
import tensorflow as tf


def read_wavfile():
    for file in itertools.chain(DATA_PATH.glob("**/*.ogg"),
                                DATA_PATH.glob("**/*.wav")):
        waveform, samplerate = librosa.load(file, sr=hparams.sample_rate)
        if len(waveform.shape) > 1:
            waveform = waveform[:, 1]

        yield waveform    

audio_dataset = Dataset.from_generator(
    read_wavfile,
    tf.float32,
    tf.TensorShape([None]))

dataset = audio_dataset.padded_batch(5, padded_shapes=[None])

iterator = tf.data.Iterator.from_structure(dataset.output_types,
                                           dataset.output_shapes)
dataset_init_op = iterator.make_initializer(dataset)

signals = iterator.get_next()

magnitude_spectrograms = tf.abs(stft(signals))

output, loss = lstm_network(magnitude_spectrograms)

train_op = tf.train.AdamOptimizer(1e-3).minimize(loss)

init_op = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init_op)
    for i in range(20):
        print(i)
        sess.run(dataset_init_op)

        while True:
            try:
                l, _ = sess.run((loss, train_op))
                print(l)
            except tf.errors.OutOfRangeError:
                break

The full code, including the sufficiently free data (Wikipedia sound files with IPA transcriptions) used, is on github.

The non-free data (EMU corpus sound files) does make a significant difference, though I am not sure how to show it to you:

When running the script on the whole dataset, the output starts in iteration 0 with a loss of about 5000, which then decreases over the dataset to about 1000. Then comes the line with 1 indicating the second loop, and suddenly loss is at about 5000 again.
When swapping the order to DATA_PATH.glob("**/*.wav"), DATA_PATH.glob("**/*.ogg") the loss starts at below 5000 and goes down to about 1000, before jumping up to 4000 again for the *.ogg samples.

Re-ordering the samples gives me a different result, so it looks like the WAV files are more similar to each other than the OGG files. I have a notion that shuffling should ideally happen at the level of the dataset, and not rely on it being read in random order. However, that would mean reading a lot of wav files into memory, which does not sound like a good solution.

What should my code look like?

Ekaba Bisong · Accepted Answer

Please try this:

Add dataset.shuffle(buffer_size=1000) to the input pipeline.
Isolate the call to loss to evaluate after each training epoch.

As illustrated below:

Update to input pipeline

dataset = audio_dataset.padded_batch(5, padded_shapes=[None])
dataset = dataset.shuffle(buffer_size=1000)
iterator = tf.data.Iterator.from_structure(dataset.output_types,
                                           dataset.output_shapes)
dataset_init_op = iterator.make_initializer(dataset)
signals = iterator.get_next()

Update to Session

with tf.Session() as sess:
    sess.run(init_op)

    for i in range(20):
        print(i)
        sess.run(dataset_init_op)

        while True:
            try:
                sess.run(train_op)
            except tf.errors.OutOfRangeError:
                break

        # print loss for each epoch
        l = sess.run(loss)
        print(l)

If I have access to a few data samples, I might be able to help more precisely. For now, I'm working blind here, in any case, do let me know if this works.

Philip DiSarro · Answer

This looks like an issue in architecture. First, you are generating your data on the go, which despite being a commonly employed technique, is not always the most reasonable choice. This is because:

One of the downsides of Dataset.from_generator() is shuffling the resulting dataset with a shuffle buffer of size n requires n examples to be loaded. This will either create periodic pauses in your pipeline (large n) or result in potentially poor shuffling (small n).

It might be a good idea to convert your data into numpy arrays, and then store the numpy arrays on disk to use as your data set like so:

def array_to_tfrecords(X, y, output_file):
  feature = {
    'X': tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten())),
    'y': tf.train.Feature(float_list=tf.train.FloatList(value=y.flatten()))
  }
  example = tf.train.Example(features=tf.train.Features(feature=feature))
  serialized = example.SerializeToString()

  writer = tf.python_io.TFRecordWriter(output_file)
  writer.write(serialized)
  writer.close()

This will take the Dataset.from_generator component out of the issue. The data can then be read with:

def read_tfrecords(file_names=("file1.tfrecord", "file2.tfrecord", "file3.tfrecord"),
                   buffer_size=10000,
                   batch_size=100):
  dataset = tf.contrib.data.TFRecordDataset(file_names)
  dataset = dataset.map(parse_proto)
  dataset = dataset.shuffle(buffer_size)
  dataset = dataset.repeat()
  dataset = dataset.batch(batch_size)
  return tf.contrib.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)

This should ensure your data is thoroughly shuffled and give better results.

Additionally, I believe that you would benefit from a little data preprocessing. For starters, try converting all the files in your dataset into a standardized WAVE form and then saving that data to a TFRecord. Currently you are converting them into WAVE and standardizing the sample rate with librosa, but that doesn't standardize the channels. Instead try using a function like:

from pydub import AudioSegment
def convert(path):

    #open file (supports all ffmpeg supported filetypes) 
    audio = AudioSegment.from_file(path, path.split('.')[-1].lower())

    #set to mono
    audio = audio.set_channels(1)

    #set to 44.1 KHz
    audio = audio.set_frame_rate(44100)

    #save as wav
    audio.export(path, format="wav")

Lastly, you might find that reading the sound files as floating points isn't in your best interests. You should consider trying something like:

import scipy.io.wavfile as wave
import python_speech_features as psf
def getSpectrogram(path, winlen=0.025, winstep=0.01, NFFT=512):

    #open wav file
    (rate,sig) = wave.read(path)

    #get frames
    winfunc=lambda x:np.ones((x,))
    frames = psf.sigproc.framesig(sig, winlen*rate, winstep*rate, winfunc)

    #Magnitude Spectrogram
    magspec = np.rot90(psf.sigproc.magspec(frames, NFFT))

    #noise reduction (mean substract)
    magspec -= magspec.mean(axis=0)

    #normalize values between 0 and 1
    magspec -= magspec.min(axis=0)
    magspec /= magspec.max(axis=0)

    #show spec dimensions
    print magspec.shape    

    return magspec

Then apply the functions like so:

#convert file if you need to
convert(filepath)

#get spectrogram
spec = getSpectrogram(filepath)

This will parse the data from the WAVE files into images, which you can then handle in the same way you would any image classification problem.

Loss goes up back to starting value after re-initializing dataset

Tags:

python

tensorflow

tensorflow-datasets

Anaphory

2 Answers

Update to input pipeline

Update to Session

Ekaba Bisong

Philip DiSarro

Recent Activity

Donate For Us

Loss goes up back to starting value after re-initializing dataset

Tags:

python

tensorflow

tensorflow-datasets

Anaphory

2 Answers

Update to input pipeline

Update to Session

Ekaba Bisong

Philip DiSarro

Related questions

Recent Activity

Donate For Us