I'm trying to write a pure Java/Scala implementation of the Tensorflow RecordWriter class in order to convert Spark DataFrame into TFRecords file. According to the documentation, in TFRecords, each record is formated as follow:
uint64 length
uint32 masked_crc32_of_length
byte data[length]
uint32 masked_crc32_of_data
And the CRC mask
masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul
Currently, I compute the CRC with guava implementation with the following code:
import com.google.common.hash.Hashing
object CRC32 {
val kMaskDelta = 0xa282ead8
def hash(in: Array[Byte]): Int = {
val hashing = Hashing.crc32c()
hashing.hashBytes(in).asInt()
}
def mask(crc: Int): Int ={
((crc >> 15) | (crc << 17)) + kMaskDelta
}
}
The rest of my code is:
The data encoding part is done with the following piece of code:
object LittleEndianEncoding {
def encodeLong(in: Long): Array[Byte] = {
val baos = new ByteArrayOutputStream()
val out = new LittleEndianDataOutputStream(baos)
out.writeLong(in)
baos.toByteArray
}
def encodeInt(in: Int): Array[Byte] = {
val baos = new ByteArrayOutputStream()
val out = new LittleEndianDataOutputStream(baos)
out.writeInt(in)
baos.toByteArray
}
}
The record are generated with protocol buffer:
import com.google.protobuf.ByteString
import org.tensorflow.example._
import collection.JavaConversions._
import collection.mutable._
object TFRecord {
def int64Feature(in: Long): Feature = {
val valueBuilder = Int64List.newBuilder()
valueBuilder.addValue(in)
Feature.newBuilder()
.setInt64List(valueBuilder.build())
.build()
}
def floatFeature(in: Float): Feature = {
val valueBuilder = FloatList.newBuilder()
valueBuilder.addValue(in)
Feature.newBuilder()
.setFloatList(valueBuilder.build())
.build()
}
def floatVectorFeature(in: Array[Float]): Feature = {
val valueBuilder = FloatList.newBuilder()
in.foreach(valueBuilder.addValue)
Feature.newBuilder()
.setFloatList(valueBuilder.build())
.build()
}
def bytesFeature(in: Array[Byte]): Feature = {
val valueBuilder = BytesList.newBuilder()
valueBuilder.addValue(ByteString.copyFrom(in))
Feature.newBuilder()
.setBytesList(valueBuilder.build())
.build()
}
def makeFeatures(features: HashMap[String, Feature]): Features = {
Features.newBuilder().putAllFeature(features).build()
}
def makeExample(features: Features): Example = {
Example.newBuilder().setFeatures(features).build()
}
}
And here is an example of how I use things together in order to generate my TFRecords file:
val label = TFRecord.int64Feature(1)
val feature = TFRecord.floatVectorFeature(Array[Float](1, 2, 3, 4))
val features = TFRecord.makeFeatures(HashMap[String, Feature] ("feature"->feature, "label"-> label))
val ex = TFRecord.makeExample(features)
val exSerialized = ex.toByteArray()
val length = LittleEndianEncoding.encodeLong(exSerialized.length)
val crcLength = LittleEndianEncoding.encodeInt(CRC32.mask(CRC32.hash(length)))
val crcEx = LittleEndianEncoding.encodeInt(CRC32.mask(CRC32.hash(exSerialized)))
val out = new FileOutputStream(new File("test.tfrecords"))
out.write(length)
out.write(crcLength)
out.write(exSerialized)
out.write(crcEx)
out.close()
When I try to read the file I got inside Tensorflow with TFRecordReader, I get the following error:
W tensorflow/core/common_runtime/executor.cc:1076] 0x24cc430 Compute status: Data loss: corrupted record at 0
I suspect that the CRC mask computation is not correct or the endianness between java and c++ generated file are not the same.
Once we have creates an example of an image, we need to write it into a trfrecord file. These can be done using tfrecord writer. tfrecord_file_name in the below code is the file name of tfrecord in which we want to store the images. TensorFlow will create these files automatically.
The TFRecord format is a simple format for storing a sequence of binary records. Protocol buffers are a cross-platform, cross-language library for efficient serialization of structured data. Protocol messages are defined by . proto files, these are often the easiest way to understand a message type.
The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10+MB and ideally 100MB+) so that you benefit from I/O prefetching.
FWIW, the Tensorflow team has provided utility code for reading/writing TFRecords, which can be found in the ecosystem repo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With