when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data. when performing an INSERT or CTAS (see “Importing Data” on page 441), the table’s SerDe will serialize Hive’s internal representation of a row of data into the bytes that are written to the output file. <ol> <li>Is serDe library?</li> <li>How does hive store data i.e it stores in file or table?</li> <li>Please can anyone explain the bold sentences clearly? I'm new to hive!!</li> </ol>

Answers <ol> <li>Yes, SerDe is a Library which is built-in to the Hadoop API</li> <li>Hive uses Files systems like HDFS or any other storage (FTP) to store data, data here is in the form of tables (which has rows and columns). </li> <li>SerDe - Serializer, Deserializer instructs hive on how to process a record (Row). Hive enables semi-structured (XML, Email, etc) or unstructured records (Audio, Video, etc) to be processed also. For Example If you have 1000 GB worth of RSS Feeds (RSS XMLs). You can ingest those to a location in HDFS. You would need to write a custom SerDe based on your XML structure so that Hive knows how to load XML files to Hive tables or other way around.</li> </ol> For more information on how to write a SerDe read this post

How does Hive stores data and what is SerDe?

Tags:

hadoop

hive

when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data. when performing an INSERT or CTAS (see “Importing Data” on page 441), the table’s SerDe will serialize Hive’s internal representation of a row of data into the bytes that are written to the output file.

Is serDe library?
How does hive store data i.e it stores in file or table?
Please can anyone explain the bold sentences clearly? I'm new to hive!!

426

asked Jan 30 '13 13:01

pramav

2 Answers

Answers

Yes, SerDe is a Library which is built-in to the Hadoop API
Hive uses Files systems like HDFS or any other storage (FTP) to store data, data here is in the form of tables (which has rows and columns).
SerDe - Serializer, Deserializer instructs hive on how to process a record (Row). Hive enables semi-structured (XML, Email, etc) or unstructured records (Audio, Video, etc) to be processed also. For Example If you have 1000 GB worth of RSS Feeds (RSS XMLs). You can ingest those to a location in HDFS. You would need to write a custom SerDe based on your XML structure so that Hive knows how to load XML files to Hive tables or other way around.

For more information on how to write a SerDe read this post

104

answered Oct 19 '22 05:10

shazin

In this aspect we can see Hive as some kind of database engine. This engine is working on tables which are built from records.
When we let Hive (as well as any other database) to work in its own internal formats - we do not care.
When we want Hive to process our own files as tables (external tables) we have to let him know - how to translate data in files into records. This is exactly the role of SerDe. You can see it as plug-in which enables Hive to read / write your data.
For example - you want to work with CSV. Here is example of CSV_Serde https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java Method serialize will read the data, and chop it into fields assuming it is CSV
Method deserialize will take a record and format it as CSV.

answered Oct 19 '22 06:10

David Gruzman

Related questions
                            
                                How to setup a HTTP Source for testing Flume setup?
                            
                                Read from Kafka and write to hdfs in parquet
                            
                                Hadoop Hbase: Spreading column families across tables or not
                            
                                How to convert a string to timestamp with milliseconds in Hive
                            
                                how to give a custom name to hadoop output files
                            
                                How does HBase enable Random Access to HDFS?
                            
                                What does in-memory data storage mean in the context of Apache Spark?
                            
                                How to cross join unnest a JSON array in Presto
                            
                                Copying file from HDFS to Local Machine
                            
                                Create directory in hadoop filesystem
                            
                                Why does my yarn application not have logs even with logging enabled?
                            
                                Hadoop JobConf class is deprecated , need updated example
                            
                                Import data from HDFS to HBase (cdh3u2)
                            
                                Mapreduce for dummies
                            
                                Hadoop namenode needs to be formatted after every computer start
                            
                                No partition predicate found for Alias even when the partition predicate in present in the query
                            
                                What is Lineage In Spark?
                            
                                Hbase mapreduce error
                            
                                What is Memory reserved on Yarn
                            
                                How does Apache Flink compare to Mapreduce on Hadoop?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With