Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop tools for moving files from local file system to HDFS [closed]

I am doing a POC on ways to import data from a shared network drive to HDFS. Data would in different folders on the shared drive and each folder would correspond to a different directory on HDFS. I looked at some popular tools that do this but most of them are for moving small pieces of data and not the whole file. These are the tools I found, are there any other?

Apache Flume: If there are only a handful of production servers producing data and the data does not need to be written out in real time, then it might also make sense to just move the data to HDFS via Web HDFS or NFS, especially if the amount of data being written out is relatively less - a few files of a few GB every few hours will not hurt HDFS. In this case, planning, configuring and deploying Flume may not be worth it. Flume is really meant to push events in real time and the stream of data is continuous and its volume reasonably large. [Flume book from safari online and flume cookbook]

Apache Kafka: Producer-consumer model : Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.

Amazon Kinesis: Paid version for real-time data like Flume

WEB HDFS: Submit a HTTP PUT request without automatically following redirects and without sending the file data. Submit another HTTP PUT request using the URL in the Location header with the file data to be written. [http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE]

Open Source Projects: https://github.com/alexholmes/hdfs-file-slurper

My requirements are simple:

  • Poll a directory for file, if a file comes, copy it to HDFS and move the file to a "processed" directory.
  • I need to do this for multiple directories
like image 340
imgr8 Avatar asked Oct 21 '22 03:10

imgr8


1 Answers

Give flume a try with a spooling directory source. You didn't mention your data volume or velocity, but I did a similar POC from a local linux filesystem to a Kerberized hdfs cluster with good results using a single flume agent running on an edge node.

like image 182
J Maurer Avatar answered Oct 23 '22 12:10

J Maurer