Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deserialize an Avro file with C#

I can't find a way to deserialize an Apache Avro file with C#. The Avro file is a file generated by the Archive feature in Microsoft Azure Event Hubs.

With Java I can use Avro Tools from Apache to convert the file to JSON:

java -jar avro-tools-1.8.1.jar tojson --pretty inputfile > output.json

Using NuGet package Microsoft.Hadoop.Avro I am able to extract SequenceNumber, Offset and EnqueuedTimeUtc, but since I don't know what type to use for Body an exception is thrown. I've tried with Dictionary<string, object> and other types.

static void Main(string[] args)
{
    var fileName = "...";

    using (Stream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    {
        using (var reader = AvroContainer.CreateReader<EventData>(stream))
        {
            using (var streamReader = new SequentialReader<EventData>(reader))
            {
                var record = streamReader.Objects.FirstOrDefault();
            }
        }
    }
}

[DataContract(Namespace = "Microsoft.ServiceBus.Messaging")]
public class EventData
{
    [DataMember(Name = "SequenceNumber")]
    public long SequenceNumber { get; set; }

    [DataMember(Name = "Offset")]
    public string Offset { get; set; }

    [DataMember(Name = "EnqueuedTimeUtc")]
    public string EnqueuedTimeUtc { get; set; }

    [DataMember(Name = "Body")]
    public foo Body { get; set; }

    // More properties...
}

The schema looks like this:

{
  "type": "record",
  "name": "EventData",
  "namespace": "Microsoft.ServiceBus.Messaging",
  "fields": [
    {
      "name": "SequenceNumber",
      "type": "long"
    },
    {
      "name": "Offset",
      "type": "string"
    },
    {
      "name": "EnqueuedTimeUtc",
      "type": "string"
    },
    {
      "name": "SystemProperties",
      "type": {
        "type": "map",
        "values": [ "long", "double", "string", "bytes" ]
      }
    },
    {
      "name": "Properties",
      "type": {
        "type": "map",
        "values": [ "long", "double", "string", "bytes" ]
      }
    },
    {
      "name": "Body",
      "type": [ "null", "bytes" ]
    }
  ]
}    
like image 924
Kristoffer Jälén Avatar asked Oct 04 '16 07:10

Kristoffer Jälén


People also ask

Can we edit Avro file?

View and Edit Avro Schema Avro schemas are written in JSON, and as such can be readily viewed and edited in the XMLSpy JSON editor, which lets you switch between text-based editing and/or Grid View for a graphical representation of the document's structure.

What is Avro serialization and Deserialization?

Avro is a language independent, schema-based data serialization library. It uses a schema to perform serialization and deserialization. Moreover, Avro uses a JSON format to specify the data structure which makes it more powerful.


1 Answers

I was able to get full data access working using dynamic. Here's the code for accessing the raw body data, which is stored as an array of bytes. In my case, those bytes contain UTF8-encoded JSON, but of course it depends on how you initially created your EventData instances that you published to the Event Hub:

using (var reader = AvroContainer.CreateGenericReader(stream))
{
    while (reader.MoveNext())
    {
        foreach (dynamic record in reader.Current.Objects)
        {
            var sequenceNumber = record.SequenceNumber;
            var bodyText = Encoding.UTF8.GetString(record.Body);
            Console.WriteLine($"{sequenceNumber}: {bodyText}");
        }
    }
}

If someone can post a statically-typed solution, I'll upvote it, but given that the bigger latency in any system will almost certainly be the connection to the Event Hub Archive blobs, I wouldn't worry about parsing performance. :)

like image 60
Lars Kemmann Avatar answered Sep 21 '22 16:09

Lars Kemmann