Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does code generation mean in avro - hadoop

Tags:

hadoop

avro

Kindly regret if this question is silly. I am finding it difficult to get what it really means.When i read 'Hadoop the definitive guide' it says that the best advantage of avro is that code generation is optional in Avro. This link has a program for avro serialization/deserialization with/without code generation. Could some one help me in understanding exactly what with/without code generation mean and the real context of the same.

like image 648
Vignesh I Avatar asked May 16 '15 15:05

Vignesh I


People also ask

What is schema in Avro?

Avro schema definitions are JSON records. Because it is a record, it can define multiple fields which are organized in a JSON array. Each such field identifies the field's name as well as its type. The type can be something simple, like an integer, or something complex, like another record.

Does Avro file contain schema?

Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages; one for human editing (Avro IDL) and another which is more machine-readable based on JSON.

How do you write Avro schema?

Creating Avro Schemas type − This field comes under the document as well as the under the field named fields. In case of document, it shows the type of the document, generally a record because there are multiple fields. When it is field, the type describes data type.

What are logical types in Avro?

Avro supports logical types. A logical type is defined as a higher level representation for a primitive type. For eg, a higher level type of UUID could be represented as a primitive type string. Similarly, a higher level java.


1 Answers

It's not a silly question -- it's actually a very important aspect of Avro.

With code-generation usually means that before compiling your Java application, you have an Avro schema available. You, as a developer, will use an Avro compiler to generate a class for each record in the schema and you use these classes in your application.

In the referenced link, the author does this: java -jar avro-tools-1.7.5.jar compile schema student.avsc, and then uses the student_marks class directly.

In this case, each instance of the class student_marks inherits from SpecificRecord, with custom methods for accessing the data inside (such as getStudentId() to fetch the student_id field).

Without code-generation usually means that your application doesn't have any specific necessary schema (for example, it can treat different kinds of data).

In this case, there's no student class generated, but you can still read Avro records in an Avro container. You won't have instances of student, but instances of GenericRecord. There won't be any helpful methods like getStudentId(), but you can use methods get("student_marks") or get(0).

Often, using specific records with code generation is easier to read, easier to serialize and deserialize, but generic records offer more flexibility when the exact schema of the records you want to process isn't known at compile time.

A helpful way to think of it is the difference between storing some data in a helpful handwritten POJO structure versus an Object[]. The former is much easier to develop with, but the latter is necessary if the types and quantity of data are dynamic or unknown.

like image 84
Ryan Skraba Avatar answered Sep 28 '22 00:09

Ryan Skraba