Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practices for specific data types in Avro

Tags:

avro

I am curious to understand the best practices for encoding two very specific types of data within Avro: Timestamps and IP Addresses.

I came across the open JIRA ticket for Timestamps (https://issues.apache.org/jira/browse/AVRO-739), but it looks like the topic has been quiet for some time. So - What are the best practices for encoding Timestamps in Avro (preferably for downstream use in a MapReduce, Pig, Hive, Streaming context).

Furthermore, I would be interested to hear what other people are doing to encode IP Addresses into Avro.

like image 584
telescope7 Avatar asked Feb 03 '13 14:02

telescope7


1 Answers

I have some experience with encoding of types in Avro. In my case a big requirement is accessing the data through Hive.

  • For timestamps I would recommend using a float with unix timestamps. This is supported by most other libraries and works easy with Hive since you can cast to timestamp.

  • For IP Addresses I would use a string encoding. I think the readability of strings when using the data makes it the best type to go for. If you have other requirements, such as keeping down the data size, maybe a binary encoding might be better for you.

like image 200
anyman Avatar answered Nov 21 '22 00:11

anyman