Kindly regret if this question is silly. I am finding it difficult to get what it really means.When i read 'Hadoop the definitive guide' it says that the best advantage of avro is that code generation is optional in Avro. This link has a program for avro serialization/deserialization with/without code generation. Could some one help me in understanding exactly what with/without code generation mean and the real context of the same.
Avro schema definitions are JSON records. Because it is a record, it can define multiple fields which are organized in a JSON array. Each such field identifies the field's name as well as its type. The type can be something simple, like an integer, or something complex, like another record.
Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages; one for human editing (Avro IDL) and another which is more machine-readable based on JSON.
Creating Avro Schemas type − This field comes under the document as well as the under the field named fields. In case of document, it shows the type of the document, generally a record because there are multiple fields. When it is field, the type describes data type.
Avro supports logical types. A logical type is defined as a higher level representation for a primitive type. For eg, a higher level type of UUID could be represented as a primitive type string. Similarly, a higher level java.
It's not a silly question -- it's actually a very important aspect of Avro.
With code-generation usually means that before compiling your Java application, you have an Avro schema available. You, as a developer, will use an Avro compiler to generate a class for each record in the schema and you use these classes in your application.
In the referenced link, the author does this: java -jar avro-tools-1.7.5.jar compile schema student.avsc
, and then uses the student_marks
class directly.
In this case, each instance of the class student_marks
inherits from SpecificRecord
, with custom methods for accessing the data inside (such as getStudentId()
to fetch the student_id
field).
Without code-generation usually means that your application doesn't have any specific necessary schema (for example, it can treat different kinds of data).
In this case, there's no student
class generated, but you can still read Avro records in an Avro container. You won't have instances of student
, but instances of GenericRecord
. There won't be any helpful methods like getStudentId()
, but you can use methods get("student_marks")
or get(0)
.
Often, using specific records with code generation is easier to read, easier to serialize and deserialize, but generic records offer more flexibility when the exact schema of the records you want to process isn't known at compile time.
A helpful way to think of it is the difference between storing some data in a helpful handwritten POJO structure versus an Object[]
. The former is much easier to develop with, but the latter is necessary if the types and quantity of data are dynamic or unknown.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With