Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flume NG and HDFS

Tags:

hadoop

hdfs

flume

I am very new to hadoop , so please excuse the dumb questions.

I have the following knowledge Best usecase of Hadoop is large files thus helping in efficiency while running mapreduce tasks.

Keeping the above in mind I am somewhat confused about Flume NG. Assume I am tailing a log file and logs are produced every second, the moment the log gets a new line it will be transferred to hdfs via Flume.

a) Does this mean that flume creates a new file on every line that is logged in the log file I am tailing or does it append to the existing hdfs file ??

b) is append allowed in hdfs in the first place??

c) if the answer to b is true ?? ie contents are appended constantly , how and when should I run my mapreduce application?

Above questions could sound very silly but a answers to the same will be highly appreciated.

PS: I have not yet set up Flume NG or hadoop as yet, just reading the articles to get an understanding and how it could add value to my company.

like image 522
user1103472 Avatar asked Jul 18 '13 13:07

user1103472


People also ask

What is Flume Ng?

Flume NG is a refactoring of Flume and was originally tracked in FLUME-728. From the JIRA's description: To solve certain known issues and limitations, Flume requires a refactoring of some core classes and systems.

What is Flume used for in Hadoop?

Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log data, events (etc...) from various webserves to a centralized data store.

What is difference between Flume and Sqoop?

Apache Sqoop in Hadoop is used to fetch structured data from RDBMS systems like Teradata, Oracle, MySQL, MSSQL, PostgreSQL and on the other hand Apache Flume is used to fetch data that is stored on various sources as like the log files on a Web Server or an Application Server.

Why do we use Flume?

Flume is an open-source distributed data collection service used for transferring the data from source to destination. It is a reliable, and highly available service for collecting, aggregating, and transferring huge amounts of logs into HDFS. It has a simple and flexible architecture.


1 Answers

Flume writes to HDFS by means of HDFS sink. When Flume starts and begins to receive events, the sink opens new file and writes events into it. At some point previously opened file should be closed, and until then data in the current block being written is not visible to other redaers.

As described in the documentation, Flume HDFS sink has several file closing strategies:

  • each N seconds (specified by rollInterval option)
  • after writing N bytes (rollSize option)
  • after writing N received events (rollCount option)
  • after N seconds of inactivity (idleTimeout option)

So, to your questions:

a) Flume writes events to currently opened file until it is closed (and new file opened).

b) Append is allowed in HDFS, but Flume does not use it. After file is closed, Flume does not append to it any data.

c) To hide currently opened file from mapreduce application use inUsePrefix option - all files with name that starts with . is not visible to MR jobs.

like image 193
Dmitry Avatar answered Sep 19 '22 00:09

Dmitry