Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AudioSet and Tensorflow Understanding

With AudioSet released and providing a brand new area of research for those who do sound analysis for research, I've been really trying to dig deep these last few days on how to analyze and decode such data.

The data is given in .tfrecord files, heres a small snippet.

�^E^@^@^@^@^@^@C�bd
u
^[
^Hvideo_id^R^O

^KZZcwENgmOL0
^^
^Rstart_time_seconds^R^H^R^F
^D^@^@�C
^X
^Flabels^R^N^Z^L

�^B�^B�^B�^B�^B
^\
^Pend_time_seconds^R^H^R^F
^D^@^@�C^R�

�

^Oaudio_embedding^R�

�^A
�^A
�^A3�^] q^@�Z�r�����w���Q����.���^@�b�{m�^@P^@^S����,^]�x�����:^@����^@^@^Z0��^@]^Gr?v(^@^U^@��^EZ6�$
�^A

The example proto given is:

context: {
  feature: {
    key  : "video_id"
    value: {
      bytes_list: {
        value: [YouTube video id string]
      }
    }
  }
  feature: {
    key  : "start_time_seconds"
    value: {
      float_list: {
        value: 6.0
      }
    }
  }
  feature: {
    key  : "end_time_seconds"
    value: {
      float_list: {
        value: 16.0
      }
    }
  }
  feature: {
    key  : "labels"
      value: {
        int64_list: {
          value: [1, 522, 11, 172] # The meaning of the labels can be found here.
        }
      }
    }
}
feature_lists: {
  feature_list: {
    key  : "audio_embedding"
    value: {
      feature: {
        bytes_list: {
          value: [128 8bit quantized features]
        }
      }
      feature: {
        bytes_list: {
          value: [128 8bit quantized features]
        }
      }
    }
    ... # Repeated for every second of the segment
  }

}

My very direct question here - something that I can't seem to find good information on is - how do I convert cleanly between the two?

If I have a machine readable file, how to make it human readable, as well as the other way around.

I have found this which takes a tfrecord of a picture and converts it to a readable format... but I can't seem to get it to a form that works with AudioSet

like image 477
Zach Avatar asked Mar 10 '23 16:03

Zach


1 Answers

this worked for me, storing the features in feat_audio. to plot them, convert them to an ndarray and reshape them accordingly.

audio_record = '/audioset_v1_embeddings/eval/_1.tfrecord'
vid_ids = []
labels = []
start_time_seconds = [] # in secondes
end_time_seconds = []
feat_audio = []
count = 0
for example in tf.python_io.tf_record_iterator(audio_record):
    tf_example = tf.train.Example.FromString(example)
    #print(tf_example)
    vid_ids.append(tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding='UTF-8'))
    labels.append(tf_example.features.feature['labels'].int64_list.value)
    start_time_seconds.append(tf_example.features.feature['start_time_seconds'].float_list.value)
    end_time_seconds.append(tf_example.features.feature['end_time_seconds'].float_list.value)

    tf_seq_example = tf.train.SequenceExample.FromString(example)
    n_frames = len(tf_seq_example.feature_lists.feature_list['audio_embedding'].feature)

    sess = tf.InteractiveSession()
    rgb_frame = []
    audio_frame = []
    # iterate through frames
    for i in range(n_frames):
        audio_frame.append(tf.cast(tf.decode_raw(
                tf_seq_example.feature_lists.feature_list['audio_embedding'].feature[i].bytes_list.value[0],tf.uint8)
                       ,tf.float32).eval())

    sess.close()
    feat_audio.append([])

    feat_audio[count].append(audio_frame)
    count+=1
like image 189
jerpint Avatar answered Mar 20 '23 15:03

jerpint