Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Write CSV files to HDFS using Flume

Tags:

hdfs

flume

I'm writing a number of CSV files from my local file system to HDFS using Flume.

I want to know what would be the best configuration for Flume HDFS sink such that each file on local system will be copied exactly in HDFS as CSV. I want each CSV file processed by Flume to be a single event, flushed and written as a single file. As much as possible, I want the file to be exactly the same without the header stuffs etc.

What do I need to put on these values to simulate the behavior that I want?

hdfs.batchSize = x
hdfs.rollSize = x
hdfs.rollInterval = x
hdfs.rollCount = x

Kindly provide if there are other Flume agent config variables I need to change as well.

If this will not work using existing configuration, do I need to use custom sink then to achieve what I want?

Thanks for your input.

P.S. I know hadoop fs -put or -copyFromLocal would be more suited for this job, but since this is a proof of concept (showing that we can use Flume for data ingestion), that's why I need to use Flume.

like image 451
oikonomiyaki Avatar asked Oct 20 '22 12:10

oikonomiyaki


1 Answers

You will have to disable all roll* properties by setting the values to 0. That will effectively prevent flume from rolling over files. As you might have noticed, flume operates on a per event basis, in most cases an event is a single line in a file. To also achieve a preservation of the file structure itself, you will need to use the spool dir source and activate fileHeader:

fileHeader  false   Whether to add a header storing the absolute path filename.

set that to true. It will provide a %{file} property which you can reference in your hdfs sink path specification.

like image 119
Erik Schmiegelow Avatar answered Jan 04 '23 07:01

Erik Schmiegelow