I am starting on a new Hadoop project that will have multiple hadoop jobs(and hence multiple jar files). Using mercurial for source control, I was wondering what would be optimal way of organizing the repository structure? Should each job live in separate repo or would it be more efficient to keep them in the same, but break down into folders?
There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major elements.
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.
Hadoop stores data in HDFS- Hadoop Distributed FileSystem. HDFS is the primary storage system of Hadoop which stores very large files running on the cluster of commodity hardware. It works on the principle of storage of less number of large files rather than the huge number of small files.
If you're pipelining the Hadoop jobs (output of one is the input of another), I've found it's better to keep most of it in the same repository since I tend to generate a lot of common methods I can use in the various MR jobs.
Personally, I keep the streaming jobs in a separate repo from my more traditional jobs since there are generally no dependencies.
Are you planning on using the DistributedCache or streaming jobs? You might want a separate directory for files you distribute. Do you really need a JAR per Hadoop job? I've found I don't.
If you give more details about what you plan on doing with Hadoop, I can see what else I can suggest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With