I'm writing a number of CSV files from my local file system to HDFS using Flume.
I want to know what would be the best configuration for Flume HDFS sink such that each file on local system will be copied exactly in HDFS as CSV. I want each CSV file processed by Flume to be a single event, flushed and written as a single file. As much as possible, I want the file to be exactly the same without the header stuffs etc.
What do I need to put on these values to simulate the behavior that I want?
hdfs.batchSize = x
hdfs.rollSize = x
hdfs.rollInterval = x
hdfs.rollCount = x
Kindly provide if there are other Flume agent config variables I need to change as well.
If this will not work using existing configuration, do I need to use custom sink then to achieve what I want?
Thanks for your input.
P.S. I know hadoop fs -put or -copyFromLocal would be more suited for this job, but since this is a proof of concept (showing that we can use Flume for data ingestion), that's why I need to use Flume.
You will have to disable all roll* properties by setting the values to 0. That will effectively prevent flume from rolling over files. As you might have noticed, flume operates on a per event basis, in most cases an event is a single line in a file. To also achieve a preservation of the file structure itself, you will need to use the spool dir source and activate fileHeader:
fileHeader false Whether to add a header storing the absolute path filename.
set that to true. It will provide a %{file} property which you can reference in your hdfs sink path specification.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With