Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting Started with Avro [closed]

Tags:

mapreduce

avro

I want to get started with using Avro with Map Reduce. Can Someone suggest a good tutorial / example to get started with. I couldnt find much through the internet search.

like image 570
Sri Avatar asked Mar 29 '11 23:03

Sri


People also ask

How do I convert Avro to JSON in Java?

You can use either ConvertRecord or ConvertAvroToJSON to convert your incoming Avro data to JSON. If the incoming Avro files do not have a schema embedded in them, then you will have to provide it, either to an AvroReader (for ConvertRecord) or the "Avro schema" property (for ConvertAvroToJSON).

When should I use an Avro file?

The Avro format is the ideal candidate for storing data in a data lake landing zone because: 1. Data from the landing zone is usually read as a whole for further processing by downstream systems (the row-based format is more efficient in this case).

Is Avro format human readable?

ORC, Parquet, and Avro are also machine-readable binary formats, which is to say that the files look like gibberish to humans. If you need a human-readable format like JSON or XML, then you should probably re-consider why you're using Hadoop in the first place.


3 Answers

I recently did a project that was heavily based on Avro data and not having used this data format before, I had to start from scratch. You are right in that it is rather hard to get much help from online sources when getting started with Avro. The material that I would recommend to you is:

  • By far, the most helpful source that I found was the Avro section (p103-p116) in Tom White's Hadoop: The Definitive Guide book as well as his Github page for the code he uses in the book.
  • For additional code examples I looked at Ron Bodkin's Github page avro-mr-sample.
  • In my case I used Python for reading and writing Avro files and for that I used this tutorial.
  • Even though it is obvious, I will add the link to the Avro Users mailing list. There is a ton of information to be found there and after I had read the above material and implemented a bunch of code, I found myself spending hours looking through the archives.

Finally, my last suggestion to you is to use Avro 1.4.1 with Hadoop 0.20.2 and ONLY that combination. I had some major issues getting my code to run using Hadoop 0.21 and more recent Avro versions.

like image 89
diliop Avatar answered Oct 16 '22 18:10

diliop


Other links:

  • JavaDocs are sometimes needed.
  • This InfoQ article may be of some use
  • Avro Serialization

The main problem I see with documentation (little that does exist) is that it focuses on very laborious "generic" approach; which seems odd because it combines worst of both world -- you must still provide full schema for data, but get no benefit from static types or such. The automatic code-generation is more convenient, but less well covered.

like image 28
StaxMan Avatar answered Oct 16 '22 18:10

StaxMan


https://github.com/apache/avro/blob/trunk/lang/java/mapred avro source code do have examples. e.g. TestReflectJob help me to write map-reduce job using my pre-defined domain objects

like image 2
leef Avatar answered Oct 16 '22 17:10

leef