Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large scale machine learning - Python or Java? [closed]

I am currently embarking on a project that will involve crawling and processing huge amounts of data (hundreds of gigs), and also mining them for extracting structured data, named entity recognition, deduplication, classification etc.

I'm familiar with ML tools from both Java and the Python world: Lingpipe, Mahout, NLTK, etc. However, when it comes down to picking a platform for such a large scale problem - I lack sufficient experience to decide between Java or Python.

I know this sounds like a vague question, and but I am looking for general advice on picking either Java or Python. The JVM offers better performance(?) over Python, but are libraries like Lingpipe etc. match up with the Python ecosystem? If I went this Python, how easy would it be scaling it and managing it across multiple machines etc.

Which one should I go with and why?

like image 913
jeffreyveon Avatar asked Mar 15 '12 13:03

jeffreyveon


People also ask

Is Java or Python better for machine learning?

Java is popular among programmers interested in web development, big data, cloud development, and Android app development. Python is favored by those working in back-end development, app development, data science, and machine learning.

Why Java is not used in machine learning?

Java is not a leading programming language in this domain but with the help of third-party open source libraries, any java developer can implement Machine Learning and get into Data Science. Moving ahead, let us see the most popular libraries used for Machine Learning in Java.

Which language is best for deep learning?

First, let's look at the overall popularity of machine learning languages. Python leads the pack, with 57% of data scientists and machine learning developers using it and 33% prioritising it for development.


2 Answers

As Apache is going strong producing excellent stuff like Lucene/Solr/Nutch for Search, Mahout for Big Data Machine Learning, Hadoop for Map Reduce, OpenNLP for NLP, lot of NoSQL stuff. The best part is the big "I" which stands for integration and these products can be integrated with each other well as of course in most situations they (these products) complement each other.

Python is great too however if you consider above from ASF then I will go with Java like Sean Owen. Python will always be available for the above but mostly like Add on's and not the actual stuff. For example you can do Hadoop using Python by using Streaming etc.

I partially switched from C++ to Java in order to utilize some of the very popular Apache products like Lucene, Solr & OpenNLP and also other popular open source NoSQL Java products like Neo4j & OrientDB.

like image 51
Yavar Avatar answered Oct 04 '22 21:10

Yavar


I think one big thing Java has going for it is Hadoop. If you really mean large scale, you'll want to be able to use something like that. Generally speaking Java has the performance advantage, and more libraries available. So: Java.

like image 11
Sean Owen Avatar answered Oct 04 '22 23:10

Sean Owen