Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Repository organization for Hadoop project

I am starting on a new Hadoop project that will have multiple hadoop jobs(and hence multiple jar files). Using mercurial for source control, I was wondering what would be optimal way of organizing the repository structure? Should each job live in separate repo or would it be more efficient to keep them in the same, but break down into folders?

like image 247
Alex N. Avatar asked Jun 02 '10 00:06

Alex N.


People also ask

What are the 4 main components of the Hadoop architecture?

There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major elements.

How are Hadoop files stored?

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.

Where are Hadoop files stored?

Hadoop stores data in HDFS- Hadoop Distributed FileSystem. HDFS is the primary storage system of Hadoop which stores very large files running on the cluster of commodity hardware. It works on the principle of storage of less number of large files rather than the huge number of small files.


1 Answers

If you're pipelining the Hadoop jobs (output of one is the input of another), I've found it's better to keep most of it in the same repository since I tend to generate a lot of common methods I can use in the various MR jobs.

Personally, I keep the streaming jobs in a separate repo from my more traditional jobs since there are generally no dependencies.

Are you planning on using the DistributedCache or streaming jobs? You might want a separate directory for files you distribute. Do you really need a JAR per Hadoop job? I've found I don't.

If you give more details about what you plan on doing with Hadoop, I can see what else I can suggest.

like image 162
Eric Wendelin Avatar answered Oct 13 '22 07:10

Eric Wendelin