Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pure Java/Scala code for writing Tensorflow TFRecords data file

I'm trying to write a pure Java/Scala implementation of the Tensorflow RecordWriter class in order to convert Spark DataFrame into TFRecords file. According to the documentation, in TFRecords, each record is formated as follow:

uint64 length
uint32 masked_crc32_of_length
byte   data[length]
uint32 masked_crc32_of_data

And the CRC mask

masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul

Currently, I compute the CRC with guava implementation with the following code:

import com.google.common.hash.Hashing

object CRC32 {
  val kMaskDelta = 0xa282ead8

  def hash(in: Array[Byte]): Int = {
    val hashing = Hashing.crc32c()
    hashing.hashBytes(in).asInt()
  }

  def mask(crc: Int): Int ={
    ((crc >> 15) | (crc << 17)) + kMaskDelta
  }
}

The rest of my code is:

The data encoding part is done with the following piece of code:

  object LittleEndianEncoding {
   def encodeLong(in: Long): Array[Byte] = {
    val baos = new ByteArrayOutputStream()
    val out = new LittleEndianDataOutputStream(baos)
    out.writeLong(in)
    baos.toByteArray
  }

  def encodeInt(in: Int): Array[Byte] = {
    val baos = new ByteArrayOutputStream()
    val out = new LittleEndianDataOutputStream(baos)

    out.writeInt(in)
    baos.toByteArray
  }
}

The record are generated with protocol buffer:

import com.google.protobuf.ByteString
import org.tensorflow.example._

import collection.JavaConversions._
import collection.mutable._

object TFRecord {

  def int64Feature(in: Long): Feature = {

    val valueBuilder = Int64List.newBuilder()
    valueBuilder.addValue(in)

    Feature.newBuilder()
      .setInt64List(valueBuilder.build())
      .build()
  }


  def floatFeature(in: Float): Feature = {
    val valueBuilder = FloatList.newBuilder()
    valueBuilder.addValue(in)
    Feature.newBuilder()
      .setFloatList(valueBuilder.build())
      .build()
  }

  def floatVectorFeature(in: Array[Float]): Feature = {
    val valueBuilder = FloatList.newBuilder()
    in.foreach(valueBuilder.addValue)

    Feature.newBuilder()
      .setFloatList(valueBuilder.build())
      .build()
  }

  def bytesFeature(in: Array[Byte]): Feature = {
    val valueBuilder = BytesList.newBuilder()
    valueBuilder.addValue(ByteString.copyFrom(in))
    Feature.newBuilder()
      .setBytesList(valueBuilder.build())
      .build()
  }

  def makeFeatures(features: HashMap[String, Feature]): Features = {
    Features.newBuilder().putAllFeature(features).build()
  }


  def makeExample(features: Features): Example = {
    Example.newBuilder().setFeatures(features).build()
  }

}

And here is an example of how I use things together in order to generate my TFRecords file:

val label = TFRecord.int64Feature(1)
val feature = TFRecord.floatVectorFeature(Array[Float](1, 2, 3, 4))
val features = TFRecord.makeFeatures(HashMap[String, Feature]  ("feature"->feature, "label"-> label))
val ex = TFRecord.makeExample(features)
val exSerialized = ex.toByteArray()
val length = LittleEndianEncoding.encodeLong(exSerialized.length)
val crcLength =  LittleEndianEncoding.encodeInt(CRC32.mask(CRC32.hash(length)))
val crcEx = LittleEndianEncoding.encodeInt(CRC32.mask(CRC32.hash(exSerialized)))

val out = new FileOutputStream(new File("test.tfrecords"))
out.write(length)
out.write(crcLength)
out.write(exSerialized)
out.write(crcEx)
out.close()

When I try to read the file I got inside Tensorflow with TFRecordReader, I get the following error:

W tensorflow/core/common_runtime/executor.cc:1076] 0x24cc430 Compute status: Data loss: corrupted record at 0

I suspect that the CRC mask computation is not correct or the endianness between java and c++ generated file are not the same.

like image 309
jrabary Avatar asked Jan 10 '16 21:01

jrabary


People also ask

How do I create a TFRecord file?

Once we have creates an example of an image, we need to write it into a trfrecord file. These can be done using tfrecord writer. tfrecord_file_name in the below code is the file name of tfrecord in which we want to store the images. TensorFlow will create these files automatically.

What is Tfrecords?

The TFRecord format is a simple format for storing a sequence of binary records. Protocol buffers are a cross-platform, cross-language library for efficient serialization of structured data. Protocol messages are defined by . proto files, these are often the easiest way to understand a message type.

What is the ideal size of a TFRecord file size?

The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10+MB and ideally 100MB+) so that you benefit from I/O prefetching.


1 Answers

FWIW, the Tensorflow team has provided utility code for reading/writing TFRecords, which can be found in the ecosystem repo

like image 143
shark8me Avatar answered Sep 16 '22 16:09

shark8me