Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Just enough Java for Hadoop [closed]

Tags:

java

hadoop

I have been a C++ developer for about 10 years. I need to pick up Java just for Hadoop. I doubt I will be doing any thing else in Java. So, I would like a list of things I would need to pick up. Of course, I would need to learn the core language, but what else?

I did Google around for this and this could be seen as a possible duplicate of "I want to learn Java. Show me how?" but it's not. Java is a huge programming language with lots, of libraries and what I need to learn will depend largely on what I am using Hadoop for. But I suppose it is possible to say something like don't bother learning this. This will be quite useful too.

like image 443
Nikhil Avatar asked Apr 20 '11 14:04

Nikhil


People also ask

Is Java mandatory for Hadoop?

Hadoop is built in Java but to work on Hadoop you didn't require Java. It is preferred if you know Java, then you can code on mapreduce. If you are not familiar with Java. You can focus your skills on Pig and Hive to perform the same functionality.

Is Java compulsory for big data?

The simple answer is no.

Do I need to know programming for Hadoop?

Hadoop requires knowledge of several programming languages, depending on the role you want it to fulfill. For instance, R or Python are relevant for analysis, while Java is more relevant for development work.

Do I need to know Java to use Hadoop?

Hadoop is written in Java. Its popular Sequence File format is dependent on Java. Even if you use Hive or Pig, you'll probably need to write your own UDF someday. Some people still try to write them in other languages, but I guess that Java has more robust and primary support for them. It is not required for you to know Java.

What programming language is Hadoop written in?

Hadoop is written in Java. Its popular Sequence File format is dependent on Java. Even if you use Hive or Pig, you'll probably need to write your own UDF someday. Some people still try to write them in other languages, but I guess that Java has more robust and primary support for them.

Is Java easy to learn just enough?

Learning "just enough" Java is learning Java. Either you learn all the core principles and language design decisions, or you suffer along making easily avoidable mistakes. Considering that you already know how to program, a lot of the information can be skimmed (with an eye for where it differs from other languages you are intimately familiar).

What is hive in Hadoop?

Developed by Facebook for Hadoop, Hive is a tool which uses a query like a language for big data processing in Hadoop. The language it uses is known as HiveQL. It is very similar to SQL.


2 Answers

In my day job, I've just spent some time helping a C++ person to pick up enough Java to use some Java libraries via JNI (Java Native Interface) and then shared memory into their primarily C++ application. Here are some of the key things I noticed:

  1. You cannot manage for anything beyond a toy project without an IDE. The very first thing you should do is download a popular Java IDE (Eclipse is a fine choice, but there are also alternatives including Netbeans and IntelliJ). Do not be tempted to try and manage with vi / emacs and javac / make. You will be living in a cave and not realising it. Once you're up to speed with even basic IDE functions you will be literally dozens of times more poductive than without an IDE.
  2. Learn how to layout a simple project structure and packages. There will be simple walkthroughs of how to do this on the Eclipse site or elsewhere. Never put anything into the default package.
  3. Java has a type system whereby the reference and primitive types are relatively separate for historic / performance reasons.
  4. Java's generics are not the same as C++ templates. Read up on "type erasure".
  5. You may wish to understand how Java's GC works. Just google "mark and sweep" - at first, you can just settle for the naivest mental model and then learn the details of how a modern production GC would do it later.
  6. The core of the Collections API should be learned without delay. Map / HashMap, List / ArrayList & LinkedList and Set should be enough to get going.
  7. Learn modern Java concurrency. Thread is an assembly-language level primitive compared to some of the cool stuff in java.util.concurrent. Learn ConcurrentHashMap, Atomic*, Lock, Condition, CountDownLatch, BlockingQueue and the threadpools from Executors. Good books here are those by Brian Goetz and Doug Lea.
  8. As soon as you want to use 3rd party libraries, you'll need to learn how the classpath works. It's not rocket science, but it is a bit verbose.

If you're a low-level C++ guy, then you may find some of this interesting also:

  1. Java has virtual dispatch by default. The keyword static on a Java method is used to indicate a class method. private Java methods use invokespecial dispatch, which is a dispatch onto the exact type in use.
  2. On an Oracle VM at least, objects comprise two machine words of header (the mark word and the class word). The mark word is a bunch of flags the VM uses - notably for thread synchronization. The class word you can think of as a pointer to the VM's representation of the Class object (which is where the vtables for methods live). Following the class word are the member fields of the instance of the object.
  3. Java .class files are an intermediate language, and not really that similar to x86 object code. In particular there are lots more useful tools for .class files (including the javap disassembler which ships with the JVM)
  4. The Java equivalent of the symbol table is called the Constant Pool. It's typed and it has a lot of information in it - arguably more than the x86 object code equivalent.
  5. Java virtual method dispatch consists of looking up the correct method to be called in the Constant Pool and then converting that to an offset into a vtable. Then walking up the class hierarchy until a not-null value is found at that vtable offset.
  6. Java starts off interpreted and then goes compiled (for Oracle and some other VMs anyway). The switch to compiled mode is done method-by-method on a as-need basis. When benchmarking and perf tuning you need to make sure that you've warmed the system up before you start, and that you should typically profile at the method level to start with. The optimizations that are made can be quite aggressive / optimistic (with a check and a fallback if the assumptions are violated) - so perf tuning is a bit of an art.

Hopefully there's some useful stuff in there to be going on with - please comment / ask followup questions.

like image 52
kittylyst Avatar answered Sep 19 '22 04:09

kittylyst


Learning "just enough" Java is learning Java. Either you learn all the core principles and language design decisions, or you suffer along making easily avoidable mistakes. Considering that you already know how to program, a lot of the information can be skimmed (with an eye for where it differs from other languages you are intimately familiar).

so you need to learn:

  1. How to get started
  2. The language itself
  3. The core, essential classes
  4. The major Collections

And if you don't have a build framework in place, how to package your compiled code.

Beyond that, nearly every other item you might need to learn depends heavily on what you intend to do. Don't discount the on-line tutorials from Oracle/Sun, they are quite good (compared to other online tutorials).

like image 29
Edwin Buck Avatar answered Sep 17 '22 04:09

Edwin Buck