Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Hive stores data and what is SerDe?

Tags:

hadoop

hive

when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data. when performing an INSERT or CTAS (see “Importing Data” on page 441), the table’s SerDe will serialize Hive’s internal representation of a row of data into the bytes that are written to the output file.

  1. Is serDe library?
  2. How does hive store data i.e it stores in file or table?
  3. Please can anyone explain the bold sentences clearly? I'm new to hive!!
like image 426
pramav Avatar asked Jan 30 '13 13:01

pramav


People also ask

What is SerDe in Hive?

SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.

How does Hive store data?

Hive stores its database and table metadata in a metastore, which is a database or file backed store that enables easy data abstraction and discovery.

What is SerDe used for?

A SerDe (Serializer/Deserializer) is a way in which Athena interacts with data in various formats. It is the SerDe you specify, and not the DDL, that defines the table schema. In other words, the SerDe can override the DDL configuration that you specify in Athena when you create your table.

Which SerDe is used in Hive to read or write data stored in parquet files?

SerDe nameParquetHiveSerDe is used for data stored in Parquet format . To convert data into Parquet format, you can use CREATE TABLE AS SELECT (CTAS) queries. For more information, see Creating a table from query results (CTAS), Examples of CTAS queries and Using CTAS and INSERT INTO for ETL and data analysis.


2 Answers

Answers

  1. Yes, SerDe is a Library which is built-in to the Hadoop API
  2. Hive uses Files systems like HDFS or any other storage (FTP) to store data, data here is in the form of tables (which has rows and columns).
  3. SerDe - Serializer, Deserializer instructs hive on how to process a record (Row). Hive enables semi-structured (XML, Email, etc) or unstructured records (Audio, Video, etc) to be processed also. For Example If you have 1000 GB worth of RSS Feeds (RSS XMLs). You can ingest those to a location in HDFS. You would need to write a custom SerDe based on your XML structure so that Hive knows how to load XML files to Hive tables or other way around.

For more information on how to write a SerDe read this post

like image 104
shazin Avatar answered Oct 19 '22 05:10

shazin


In this aspect we can see Hive as some kind of database engine. This engine is working on tables which are built from records.
When we let Hive (as well as any other database) to work in its own internal formats - we do not care.
When we want Hive to process our own files as tables (external tables) we have to let him know - how to translate data in files into records. This is exactly the role of SerDe. You can see it as plug-in which enables Hive to read / write your data.
For example - you want to work with CSV. Here is example of CSV_Serde https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java Method serialize will read the data, and chop it into fields assuming it is CSV
Method deserialize will take a record and format it as CSV.

like image 32
David Gruzman Avatar answered Oct 19 '22 06:10

David Gruzman