what is a data serialization system?

Tags:

according to Apache AVRO project, "Avro is a serialization system". By saying data serialization system, does it mean that avro is a product or api?

also, I am not quit sure about what a data serialization system is? for now, my understanding is that it is a protocol that defines how data object is passed over the network. Can anyone help explain it in an intuitive way that it is easier for people with limited distributed computing background to understand?

Thanks in advance!

232

asked Mar 21 '10 10:03

Yang

1 Answers

So when Hadoop was being written by Doug Cutting he decided that the standard Java method of serializing Java object using Java Object Serialization (Java Serialization) didn't meet his requirements for Hadoop. Namely, these requirements were:

Serialize the data into a compact binary format.
Be fast, both in performance and how quickly it allowed data to be transfered.
Interoperable so that other languages plug into Hadoop more easily.

As he described Java Serialization:

It looked big and hairy and I though we needed something lean and mean

Instead of using Java Serialization they wrote their own serialization framework. The main perceived problems with Java Serialization was that it writes the classname of each object being serialized to the stream, with each subsequent instance of that class containing a 5 byte reference to the first, instead of the classname.

As well as reducing the effective bandwidth of the stream this causes problems with random access as well as sorting of records in a serialized stream. Thus Hadoop serialization doesn't write the classname or the required references, and makes the assumption that the client knows the expected type.

Java Serialization also creates a new object for each one that is deserialized. Hadoop Writables, which implement Hadoop Serialization, can be reused. Thus, helping to improve the performance of MapReduce which accentually serializes and deserializes billions of records.

Avro fits into Hadoop in that it approaches serialization in a different manner. The client and server exchange a scheme which describes the datastream. This helps make it fast, compact and importantly makes it easier to mix languanges together.

So Avro defines a serialization format, a protocol for clients and servers to communicate these serial streams and a way to compactly persist data in files.

I hope this helps. I thought a bit of Hadoop history would help understand why Avro is a subproject of Hadoop and what its meant to help with.

166

answered Sep 23 '22 18:09

Binary Nerd

Related questions
                            
                                Amazon Emr - What is the need of Task nodes when we have Core nodes?
                            
                                Hadoop, Mahout real-time processing alternative
                            
                                Slow transfers in Jetty with chunked transfer encoding at certain buffer size
                            
                                hbase cannot find an existing table
                            
                                Rstudio-server environment variables not loading?
                            
                                What is the fastest way to bulk load data into HBase programmatically?
                            
                                Accessing Hue on Cloudera Docker QuickStart
                            
                                Reading and Writing Sequencefile using Hadoop 2.0 Apis
                            
                                hadoop and hbase rebalancing after node additions
                            
                                AWS Glue issue with double quote and commas
                            
                                What is the most mature library for building a Data Analytics Pipeline in Java/Scala for Hadoop?
                            
                                How to test if a kinit is needed?
                            
                                Got InterruptedException while executing word count mapreduce job
                            
                                Transfer file out from HDFS
                            
                                Difference between Hadoop Map Reduce and Google Map Reduce
                            
                                The type HTable(config,tablename) is deprecated. What use instead?
                            
                                hadoop MultipleInputs fails with ClassCastException
                            
                                what is the basic difference between jobconf and job?
                            
                                What is the difference between the fair and capacity schedulers?
                            
                                Hive 2.1.1 MetaException(message:Version information not found in metastore. )

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

what is a data serialization system?

Tags:

distributed-computing

hadoop

data-serialization

Yang

People also ask

1 Answers

Binary Nerd

Recent Activity

Donate For Us