Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java vs Python on Hadoop

I am working on a project using Hadoop and it seems to natively incorporate Java and provide streaming support for Python. Is there is a significant performance impact to choosing one over the other? I am early enough in the process where I can go either way if there is a significant performance difference one way or the other.

like image 878
jnoss Avatar asked Sep 26 '09 21:09

jnoss


People also ask

Can we use Python instead of Java in Hadoop?

Hadoop framework is written in Java language; however, Hadoop programs can be coded in Python or C++ language. We can write programs like MapReduce in Python language, while not the requirement for translating the code into Java jar files.

Is Hadoop a Python or Java?

Hadoop framework is written in Java language, but it is entirely possible for Hadoop programs to be coded in Python or C++ language. This implies that data architects don't have to learn Java if they are familiar with Python.

Which language is best for Hadoop?

Java is the language behind Hadoop and which is why it is crucial for the big data enthusiast to learn this language in order to debug Hadoop applications.

Is Python or Java better for big data?

In terms of concurrency, Java beats Python. Java is excellent when it comes to scaling applications, which makes it the best choice for building large and more complex ML and AI applications.


2 Answers

With Python you'll probably develop faster and with Java will definitely run faster.

Google "benchmarksgame" if you want to see some very accurate speed comparisons between all popular languages, but if I recall correctly you're talking about 3-5x faster.

That said, few things are processor bound these days, so if you feel like you'd develop better with Python, have at it!


In reply to comment (how can java be faster than Python):

All languages are processed differently. Java is about the fastest after C & C++ (which can be as fast or up to 5x faster than java, but seems to average around 2x faster). The rest are from 2-5+ times slower. Python is one of the faster ones after Java. I'm guessing that C# is about as fast as Java or maybe faster, but the benchmarksgame only had Mono (which was a tad slower) because they don't run it on windows.

Most of these claims are based on the computer language benchmarks game which tends to be pretty fair because advocates of/experts in each language tweak the test written in their specific language to ensure the code is well-targeted.

For example, this shows all tests with Java vs c++ and you can see the speed ranges from about equal to java being 3x slower (first column is between 1 and 3), and java uses much more memory!

Now this page shows java vs python (from the point of view of Python). So the speeds range from python being 2x slower than Java to 174x slower, python generally beats java in code size and memory usage though.

Another interesting point here--tests that allocated a lot of memory, Java actually performed significantly better than Python in memory size as well. I'm pretty sure java usually loses memory because of the overhead of the VM, but once that factors out, java is probably more efficient than most (again, except the C's).

This is Python 3 by the way, the other python platform tested (Just called Python) faired much worse.

If you really wanted to know how it is faster, the VM is amazingly intelligent. It compiles to machine language AFTER running the code, so it knows what the most likely code paths are and optimizes for them. Memory allocation is an art--really useful in an OO language. It can perform some amazing run-time optimizations which no non-VM language can do. It can run in a pretty small memory footprint when forced to, and is a language of choice for embedded devices along with C/C++.

I worked on a Signal Analyzer for Agilent (think expensive o-scope) where nearly the entire thing (aside from the sampling) was done in Java. This includes drawing the screen including the trace (AWT) and interacting with the controls.

Currently I'm working on a project for all future cable boxes. The Guide along with most other apps will be written in Java.

Why wouldn't it be faster than Python?

like image 196
Bill K Avatar answered Sep 25 '22 14:09

Bill K


Java is less dynamic than Python and more effort has been put into its VM, making it a faster language. Python is also held back by its Global Interpreter Lock, meaning it cannot push threads of a single process onto different core.

Whether this makes any significant difference depends on what you intend to do. I suspect both languages will work for you.

like image 39
David Crawshaw Avatar answered Sep 23 '22 14:09

David Crawshaw