Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to serialize/deserialize large list of items with protobuf-net

I have a list of about 500 million items. I am able to serialize this into a file with protobuf-net file if I serialize individual items, not a list -- I cannot collect the items into List of Price and then serialize because I run out of memory. So, I have to serialize one record at a time:

using (var input = File.OpenText("..."))
using (var output = new FileStream("...", FileMode.Create, FileAccess.Write))
{
    string line = "";
    while ((line = input.ReadLine()) != null)
    {
        Price price = new Price();
        (code that parses input into a Price record)

        Serializer.Serialize(output, price);
    }
}

My question is about deserialization part. It appears that Deserialize method does not move the Position of the stream to the next record. I tried:

using (var input = new FileStream("...", FileMode.Open, FileAccess.Read))
{
    Price price = null;
    while ((price = Serializer.Deserialize<Price>(input)) != null)
    {
    }
}

I see one real-looking Price record, and then the rest are empty records -- I get the Price object back but all fields are initialized to default values.

How to properly deserialize a stream that contains a list of objects which are not serialized as a list?

like image 695
user1044169 Avatar asked Nov 25 '11 17:11

user1044169


3 Answers

The API apprently has changed since Marc's answer.
It seems there's no SerializeItems method any more.

Here's some more up to date info that should help:

ProtoBuf.Serializer.Serialize(stream, items);

can take an IEnumerable as seen above and it does the job when it comes to serialization.
However there's a DeserializeItems(...) method and the devil is in the details :)
If you serialize IEnumerable like above, then you need to call DeserializeItems passing PrefixStyle.Base128 and 1 as fieldNumber cause apprently those are the defaults.
Here's an example:

ProtoBuf.Serializer.DeserializeItems<T>(stream, ProtoBuf.PrefixStyle.Base128, 1));

Also as pointed out by Marc and Vic you can serialize/deserialize on a per item basis like this (using custom values for PrefixStyle and fieldNumber):

ProtoBuf.Serializer.SerializeWithLengthPrefix(stream, item, ProtoBuf.PrefixStyle.Base128, fieldNumber: 1);

and

T item;
while ((item = ProtoBuf.Serializer.DeserializeWithLengthPrefix<T>(stream, ProtoBuf.PrefixStyle.Base128, fieldNumber: 1)) != null)
{
    // do stuff here
}
like image 153
Piotr Owsiak Avatar answered Nov 16 '22 23:11

Piotr Owsiak


May be I am too late on this... but just to add to what Marc already said.

As you use Serializer.Serialize(output, price); protobuf treat consecutive messages as part of a (same)single object. So when you use Deserialize using

while ((price = Serializer.Deserialize<Price>(input)) != null)

you will get all the records back. Hence you will see only the last Price record.

To do what you want to do, change the serialization code to:

Serializer.SerializeWithLengthPrefix(output, price, PrefixStyle.Base128, 1);

and

while ((price = Serializer.DeserializeWithLengthPrefix<Price>(input, PrefixStyle.Base128, 1)) != null)
like image 45
Vic Avatar answered Nov 16 '22 21:11

Vic


Good news! The protobuf-net API is setup for exactly this scenario. You should see a SerializeItems and DeserializeItems pair of methods that work with IEnumerable<T>, allowing streaming both in and out. The easiest way to do feed it an enumerate is via an "iterator block" over the source data.

If, for whatever reason, that isn't convenient, that is 100% identical to using SerializeWithLengthPrefix and DeserializeWithLengthPrefix on a per-item basis, specifying (as parameters) field: 1 and prefix-style: base-128. You could even use SerializeWithLengthPrefix for the writing, and DeserializeItems for the reading (as long as you use field 1 and base-128).

Re the example - id have to see that in a fully reproducible scenario to comment; actually, what I would expect there is that you only get a single object back out, containing the combined values from each object - because without the length-prefix, the protobuf spec assumes you are just concatenating values to a single object. The two approaches mentioned above avoid this issue.

like image 25
Marc Gravell Avatar answered Nov 16 '22 22:11

Marc Gravell