Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Java, how can I create an equivalent of an Apache Avro container file without being forced to use a File as a medium?

Tags:

This is somewhat of a shot in the dark in case anyone savvy with the Java implementation of Apache Avro is reading this.

My high-level objective is to have some way to transmit some series of avro data over the network (let's just say HTTP for example, but the particular protocol is not that important for this purpose). In my context I have a HttpServletResponse I need to write this data to somehow.

I initially attempted to write the data as what amounted to a virtual version of an avro container file (suppose that "response" is of type HttpServletResponse):

response.setContentType("application/octet-stream");
response.setHeader("Content-transfer-encoding", "binary");
ServletOutputStream outStream = response.getOutputStream();
BufferedOutputStream bos = new BufferedOutputStream(outStream);

Schema someSchema = Schema.parse(".....some valid avro schema....");
GenericRecord someRecord = new GenericData.Record(someSchema);
someRecord.put("somefield", someData);
...

GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(someSchema);
DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<GenericRecord>(datumWriter);
fileWriter.create(someSchema, bos);
fileWriter.append(someRecord);
fileWriter.close();
bos.flush();

This was all fine and dandy, except that it turns out Avro doesn't really provide a way to read a container file apart from an actual file: the DataFileReader only has two constructors:

public DataFileReader(File file, DatumReader<D> reader);

and

public DataFileReader(SeekableInput sin, DatumReader<D> reader);

where SeekableInput is some avro-specific customized form whose creation also ends up reading from a file. Now given that, unless there is some way to somehow coerce an InputStream into a File (http://stackoverflow.com/questions/578305/create-a-java-file-object-or-equivalent-using-a-byte-array-in-memory-without-a suggests that there is not, and I have tried looking around the Java documentation as well), this approach won't work if the reader on the other end of the OutputStream receives that avro container file (I'm not sure why they allowed one to output avro binary container files to an arbitrary OutputStream without providing a way to read them from the corresponding InputStream on the other end, but that's beside the point). It seems that the implementation of the container file reader requires the "seekable" functionality that a concrete File provides.

Okay, so it doesn't look like that approach will do what I want. How about creating a JSON response that mimics the avro container file?

public static Schema WRAPPER_SCHEMA = Schema.parse(
  "{\"type\": \"record\", " +
   "\"name\": \"AvroContainer\", " +
   "\"doc\": \"a JSON avro container file\", " +
   "\"namespace\": \"org.bar.foo\", " +
   "\"fields\": [" +
     "{\"name\": \"schema\", \"type\": \"string\", \"doc\": \"schema representing the included data\"}, " +
     "{\"name\": \"data\", \"type\": \"bytes\", \"doc\": \"packet of data represented by the schema\"}]}"
  );

I'm not sure if this is the best way to approach this given the above constraints, but it looks like this might do the trick. I'll put the schema (of "Schema someSchema" from above, for instance) as a String inside the "schema" field, and then put in the avro-binary-serialized form of a record fitting that schema (ie. "GenericRecord someRecord") inside the "data" field.

I actually wanted to know about a specific detail of that which is described below, but I thought it would be worthwhile to give a bigger context as well, so that if there is a better high-level approach I could be taking (this approach works but just doesn't feel optimal) please do let me know.

My question is, assuming I go with this JSON-based approach, how do I write the avro binary representation of my Record into the "data" field of the AvroContainer schema? For example, I got up to here:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(someSchema);
Encoder e = new BinaryEncoder(baos);
datumWriter.write(resultsRecord, e);
e.flush();

GenericRecord someRecord = new GenericData.Record(someSchema);
someRecord.put("schema", someSchema.toString());
someRecord.put("data", ByteBuffer.wrap(baos.toByteArray()));
datumWriter = new GenericDatumWriter<GenericRecord>(WRAPPER_SCHEMA);
JsonGenerator jsonGenerator = new JsonFactory().createJsonGenerator(baos, JsonEncoding.UTF8);
e = new JsonEncoder(WRAPPER_SCHEMA, jsonGenerator);
datumWriter.write(someRecord, e);
e.flush();

PrintWriter printWriter = response.getWriter(); // recall that response is the HttpServletResponse
response.setContentType("text/plain");
response.setCharacterEncoding("UTF-8");
printWriter.print(baos.toString("UTF-8"));

I initially tried omitting the ByteBuffer.wrap clause, but then then the line

datumWriter.write(someRecord, e);

threw an exception that I couldn't cast a byte array into ByteBuffer. Fair enough, it looks like when the Encoder class (of which JsonEncoder is a subclass) is called to write an avro Bytes object, it requires a ByteBuffer to be given as an argument. Thus, I tried encapsulating the byte[] with java.nio.ByteBuffer.wrap, but when the data was printed out, it was printed as a straight series of bytes, without being passed through the avro hexadecimal representation:

"data": {"bytes": ".....some gibberish other than the expected format...}

That doesn't seem right. According to the avro documentation, the example bytes object they give says that I need to put in a json object, an example of which looks like "\u00FF", and what I have put in there is clearly not of that format. What I now want to know is the following:

  • What is an example of an avro bytes format? Does it look something like "\uDEADBEEFDEADBEEF..."?
  • How do I coerce my binary avro data (as output by the BinaryEncoder into a byte[] array) into a format that I can stick into the GenericRecord object and have it print correctly in JSON? For example, I want an Object DATA for which I can call on some GenericRecord "someRecord.put("data", DATA);" with my avro serialized data inside?
  • How would I then read that data back into a byte array on the other (consumer) end, when it is given the text JSON representation and wants to recreate the GenericRecord as represented by the AvroContainer-format JSON?
  • (reiterating the question from before) Is there a better way I could be doing all this?
like image 945
omnilinguist Avatar asked Sep 24 '11 08:09

omnilinguist


People also ask

How is Avro different from JSON?

It is based on a subset of the JavaScript Programming Language. Avro can be classified as a tool in the "Serialization Frameworks" category, while JSON is grouped under "Languages". Redsift, OTTLabs, and Mon Style are some of the popular companies that use JSON, whereas Avro is used by Liferay, LendUp, and BetterCloud.

What is Avro schema?

Avro schema definitions are JSON records. Because it is a record, it can define multiple fields which are organized in a JSON array. Each such field identifies the field's name as well as its type. The type can be something simple, like an integer, or something complex, like another record.

Does Avro schema order matter?

Avro serializer/deserializers operate on fields in the order they are declared. Producers and Consumers must be on a compatible schema including the field order. Do not change the order of AVRO fields. All Producers and Consumers are must be updated at the same time if you change the field order.


1 Answers

As Knut said, if you want to use something other than a file, you can either:

  • use SeekableByteArrayInput, as Knut said, for anything you can shoe-horn into a byte array
  • Implement SeekablInput in your own way - for example if you were getting it out of some weird database structure.
  • Or just use a file. Why not?

Those are your answers.

like image 171
Ben Avatar answered Sep 26 '22 00:09

Ben