Pure Java/Scala code for writing Tensorflow TFRecords data file

Tags:

I'm trying to write a pure Java/Scala implementation of the Tensorflow RecordWriter class in order to convert Spark DataFrame into TFRecords file. According to the documentation, in TFRecords, each record is formated as follow:

uint64 length
uint32 masked_crc32_of_length
byte   data[length]
uint32 masked_crc32_of_data

And the CRC mask

masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul

Currently, I compute the CRC with guava implementation with the following code:

import com.google.common.hash.Hashing

object CRC32 {
  val kMaskDelta = 0xa282ead8

  def hash(in: Array[Byte]): Int = {
    val hashing = Hashing.crc32c()
    hashing.hashBytes(in).asInt()
  }

  def mask(crc: Int): Int ={
    ((crc >> 15) | (crc << 17)) + kMaskDelta
  }
}

The rest of my code is:

The data encoding part is done with the following piece of code:

  object LittleEndianEncoding {
   def encodeLong(in: Long): Array[Byte] = {
    val baos = new ByteArrayOutputStream()
    val out = new LittleEndianDataOutputStream(baos)
    out.writeLong(in)
    baos.toByteArray
  }

  def encodeInt(in: Int): Array[Byte] = {
    val baos = new ByteArrayOutputStream()
    val out = new LittleEndianDataOutputStream(baos)

    out.writeInt(in)
    baos.toByteArray
  }
}

The record are generated with protocol buffer:

import com.google.protobuf.ByteString
import org.tensorflow.example._

import collection.JavaConversions._
import collection.mutable._

object TFRecord {

  def int64Feature(in: Long): Feature = {

    val valueBuilder = Int64List.newBuilder()
    valueBuilder.addValue(in)

    Feature.newBuilder()
      .setInt64List(valueBuilder.build())
      .build()
  }


  def floatFeature(in: Float): Feature = {
    val valueBuilder = FloatList.newBuilder()
    valueBuilder.addValue(in)
    Feature.newBuilder()
      .setFloatList(valueBuilder.build())
      .build()
  }

  def floatVectorFeature(in: Array[Float]): Feature = {
    val valueBuilder = FloatList.newBuilder()
    in.foreach(valueBuilder.addValue)

    Feature.newBuilder()
      .setFloatList(valueBuilder.build())
      .build()
  }

  def bytesFeature(in: Array[Byte]): Feature = {
    val valueBuilder = BytesList.newBuilder()
    valueBuilder.addValue(ByteString.copyFrom(in))
    Feature.newBuilder()
      .setBytesList(valueBuilder.build())
      .build()
  }

  def makeFeatures(features: HashMap[String, Feature]): Features = {
    Features.newBuilder().putAllFeature(features).build()
  }


  def makeExample(features: Features): Example = {
    Example.newBuilder().setFeatures(features).build()
  }

}

And here is an example of how I use things together in order to generate my TFRecords file:

val label = TFRecord.int64Feature(1)
val feature = TFRecord.floatVectorFeature(Array[Float](1, 2, 3, 4))
val features = TFRecord.makeFeatures(HashMap[String, Feature]  ("feature"->feature, "label"-> label))
val ex = TFRecord.makeExample(features)
val exSerialized = ex.toByteArray()
val length = LittleEndianEncoding.encodeLong(exSerialized.length)
val crcLength =  LittleEndianEncoding.encodeInt(CRC32.mask(CRC32.hash(length)))
val crcEx = LittleEndianEncoding.encodeInt(CRC32.mask(CRC32.hash(exSerialized)))

val out = new FileOutputStream(new File("test.tfrecords"))
out.write(length)
out.write(crcLength)
out.write(exSerialized)
out.write(crcEx)
out.close()

When I try to read the file I got inside Tensorflow with TFRecordReader, I get the following error:

W tensorflow/core/common_runtime/executor.cc:1076] 0x24cc430 Compute status: Data loss: corrupted record at 0

I suspect that the CRC mask computation is not correct or the endianness between java and c++ generated file are not the same.

309

asked Jan 10 '16 21:01

jrabary

1 Answers

FWIW, the Tensorflow team has provided utility code for reading/writing TFRecords, which can be found in the ecosystem repo

143

answered Sep 16 '22 16:09

shark8me

Related questions
                            
                                Are @Bean and @Component annotations the same but for different targets in Spring Framework? [duplicate]
                            
                                JGit - Pushing a branch and add upstream (-u option)
                            
                                Downloading files >3Gb from S3 fails with "SocketTimeoutException: Read timed out"
                            
                                Return a Class instance with its generic type
                            
                                Are there any naming conventions for command line arguments?
                            
                                Modern Akka DI with Guice
                            
                                How to release memory of bitmap using imageloader in android?
                            
                                How to read from particular header in opencsv?
                            
                                Access to superclass private fields using the super keyword in a subclass
                            
                                MyBatis, insert with complex object
                            
                                Proxy repository VS hosted repository
                            
                                If not null - java 8 style
                            
                                How do I get Checkstyle CustomImportOrder to work with IntelliJ properly?
                            
                                Whats the best practice of using Controller,Service and Repository annotations?
                            
                                I need to download several images to directory so that content can be accessed offline
                            
                                How to screencast automated tests using Java? [closed]
                            
                                Timeout when publishing from AWS Lambda to SNS
                            
                                HttpURLConnection PUT to Google Cloud Storage giving error 403
                            
                                Java 1.8.65 javac is missing
                            
                                Remove folder from Java classpath at runtime

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pure Java/Scala code for writing Tensorflow TFRecords data file

Tags:

java

scala

tensorflow

apache-spark

guava

jrabary

People also ask

1 Answers

shark8me

Recent Activity

Donate For Us