Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java: assigning object reference IDs for custom serialization

For various reasons I have a custom serialization where I am dumping some fairly simple objects to a data file. There are maybe 5-10 classes, and the object graphs that result are acyclic and pretty simple (each serialized object has 1 or 2 references to another that are serialized). For example:

class Foo
{
    final private long id;
    public Foo(long id, /* other stuff */) { ... }
}

class Bar
{
    final private long id;
    final private Foo foo;
    public Bar(long id, Foo foo, /* other stuff */) { ... }
}

class Baz
{
    final private long id;
    final private List<Bar> barList;
    public Baz(long id, List<Bar> barList, /* other stuff */) { ... }
}

The id field is just for the serialization, so that when I am serializing to a file, I can write objects by keeping a record of which IDs have been serialized so far, then for each object checking whether its child objects have been serialized and writing the ones that haven't, finally writing the object itself by writing its data fields and the IDs corresponding to its child objects.

What's puzzling me is how to assign id's. I thought about it, and it seems like there are three cases for assigning an ID:

  • dynamically-created objects -- id is assigned from a counter that increments
  • reading objects from disk -- id is assigned from the number stored in the disk file
  • singleton objects -- object is created prior to any dynamically-created object, to represent a singleton object that is always present.

How can I handle these properly? I feel like I'm reinventing the wheel and there must be a well-established technique for handling all the cases.


clarification: just as some tangential information, the file format I am looking at is approximately the following (glossing over a few details which should not be relevant). It's optimized to handle a fairly large amount of dense binary data (tens/hundreds of MB) with the ability to intersperse structured data in it. The dense binary data makes up 99.9% of the file size.

The file consists of a series of error-corrected blocks which serve as containers. Each block can be thought of as containing a byte array which consists of a series of packets. It is possible to read the packets one at a time in succession (e.g. it's possible to tell where the end of each packet is, and the next one starts immediately afterwards).

So the file can be thought of as a series of packets stored on top of an error-correcting layer. The vast majority of these packets are opaque binary data that has nothing to do with this question. A small minority of these packets, however, are items containing serialized structured data, forming a sort of "archipelago" consisting of data "islands" which may be linked by object reference relationships.

So I might have a file where packet 2971 contains a serialized Foo, and packet 12083 contains a serialized Bar that refers to the Foo in packet 2971. (with packets 0-2970 and 2972-12082 being opaque data packets)

All these packets are all immutable (and therefore given the constrains of Java object construction, they form an acyclic object graph) so I don't have to deal with mutability issues. They are also descendents of a common Item interface. What I would like to do is write an arbitrary Item object to the file. If the Item contains references to other Items that are already in the file, I need to write those to the file too, but only if they haven't been written yet. Otherwise I will have duplicates that I will need to somehow coalesce when I read them back.

like image 598
Jason S Avatar asked Jun 08 '10 14:06

Jason S


3 Answers

Do you really need to do this? Internally, the ObjectOutputStream tracks which objects have been serialized already. Subsequent writes of the same object only store a internal reference (similar to writing out just the id) rather than writing out the whole object again.

See Serialization Cache for more details.

If the IDs correspond to some externally defined identity, such as an entity ID, then this makes sense. But the question states that the IDs are generated purely to track which objects are serialized.

You can handle singletons via the readResolve method. A simple approach is to compare the freshly deserialized instance with your singleton instances, and if there is a match, return the singleton instance rather than the deserialized instance. E.g.

   private Object readResolve() {
      return (this.equals(SINGLETON)) ? SINGLETON : this;
      // or simply
      // return SINGLETON;
   }

EDIT: In response to the comments, the stream is mostly binary data (stored in an optimized format) with complex objects indispersed in that data. This can be handled by using a stream format that supports substreams, e.g. zip, or a simple block chunking. E.g. the stream can be a sequence of blocks:

offset 0  - block type
offset 4  - block length N
offset 8  - N bytes of data
...
offset N+8  start of next block

You can then have blocks for binary data, blocks for serialized data, blocks for XStream serialized data etc. Since each block knows it's size you can create a substream to read up to that length from the place in the file. This allows you to freely mix data without concerns for parsing.

To implement a stream, have your main stream parse the blocks, e.g.

   DataInputStream main = new DataInputStream(input);
   int blockType = main.readInt();
   int blockLength = main.readInt();
   // next N bytes are the data
   LimitInputStream data = new LimitInputStream(main, blockLength);

   if (blockType==BINARY) {
      handleBinaryBlock(new DataInputStream(data));
   }
   else if (blockType==OBJECTSTREAM) {
      deserialize(new ObjectInputStream(data));
   }
   else
      ...

A sketch of LimitInputStream looks like this:

public class LimitInputStream extends FilterInputStream
{
   private int bytesRead;
   private int limit;
   /** Reads up to limit bytes from in */
   public LimitInputStream(InputStream in, int limit) {
      super(in);
      this.limit = limit;
   }

   public int read(byte[] data, int offs, int len) throws IOException {
      if (len==0) return 0; // read() contract mandates this
      if (bytesRead==limit)
         return -1;
      int toRead = Math.min(limit-bytesRead, len);
      int actuallyRead = super.read(data, offs, toRead);
      if (actuallyRead==-1)
          throw new UnexpectedEOFException();
      bytesRead += actuallyRead;
      return actuallyRead;
   }

   // similarly for the other read() methods

   // don't propagate to underlying stream
   public void close() { }
}
like image 180
mdma Avatar answered Oct 23 '22 19:10

mdma


Are the foos registered with a FooRegistry? You could try this approach (assume Bar and Baz also have registries to acquire the references via the id).

This probably has lots of syntax errors, usage errors, etc. But I feel the approach is a good one.

public class Foo {

public Foo(...) {
    //construct
    this.id = FooRegistry.register(this);
}

public Foo(long id, ...) {
    //construct
    this.id = id;
    FooRegistry.register(this,id);
}

}

public class FooRegistry() { Map foos = new HashMap...

long register(Foo foo) {
    while(foos.get(currentFooCount) == null) currentFooCount++;
    foos.add(currentFooCount,foo);
    return currentFooCount;
}

void register(Foo foo, long id) {
    if(foo.get(id) == null) throw new Exc ... // invalid
    foos.add(foo,id);
}

}

public class Bar() {

void writeToStream(OutputStream out) {
    out.print("<BAR><id>" + id + "</id><foo>" + foo.getId() + "</foo></BAR>");
}

}

public class Baz() {

void.writeToStream(OutputStream out) {
    out.print("<BAZ><id>" + id + "</id>");
    for(Bar bar : barList) out.println("<bar>" + bar.getId() + </bar>");
    out.print("</BAZ>");
}

}

like image 32
corsiKa Avatar answered Oct 23 '22 18:10

corsiKa


I feel like I'm reinventing the wheel and there must be a well-established technique for handling all the cases.

Yes, looks like default object serialization would do, or ultimately you're pre-optimizing.

You can change the format of the serialized data ( like the XMLEncoder does ) for a more convenient one.

But if you insist, I think the singleton with dynamic counter should do, but don't put the id, in the public interface for the constructor:

class Foo {
    private final int id;
    public Foo( int id, /*other*/ ) { // drop the int id
    }
 }

So the class could be a "sequence" and probably a long would be more appropriate to avoid have problems with the Integer.MAX_VALUE.

Using an AtomicLong as described in the java.util.concurrent.atomic package ( to avoid having two threads assign the same id, or to avoid excessive synchronization ) would help too.

class Sequencer {
    private static AtomicLong sequenceNumber = new AtomicLong(0);
    public static long next() { 
         return sequenceNumber.getAndIncrement();
    }
}

Now in each class you have

 class Foo {
      private final long id;
      public Foo( String name, String data, etc ) {
          this.id = Sequencer.next();
      }
 }

And that's it.

( note, I don't remember if deserializing the object invokes the constructor, but you get the idea )

like image 34
OscarRyz Avatar answered Oct 23 '22 17:10

OscarRyz