I am doing a POC on ways to import data from a shared network drive to HDFS. Data would in different folders on the shared drive and each folder would correspond to a different directory on HDFS. I looked at some popular tools that do this but most of them are for moving small pieces of data and not the whole file. These are the tools I found, are there any other?
Apache Flume: If there are only a handful of production servers producing data and the data does not need to be written out in real time, then it might also make sense to just move the data to HDFS via Web HDFS or NFS, especially if the amount of data being written out is relatively less - a few files of a few GB every few hours will not hurt HDFS. In this case, planning, configuring and deploying Flume may not be worth it. Flume is really meant to push events in real time and the stream of data is continuous and its volume reasonably large. [Flume book from safari online and flume cookbook]
Apache Kafka: Producer-consumer model : Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
Amazon Kinesis: Paid version for real-time data like Flume
WEB HDFS: Submit a HTTP PUT request without automatically following redirects and without sending the file data. Submit another HTTP PUT request using the URL in the Location header with the file data to be written. [http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE]
Open Source Projects: https://github.com/alexholmes/hdfs-file-slurper
My requirements are simple:
Give flume a try with a spooling directory source. You didn't mention your data volume or velocity, but I did a similar POC from a local linux filesystem to a Kerberized hdfs cluster with good results using a single flume agent running on an edge node.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With