I know it is possible to use python language over Hadoop.
But is it possible to use scikit-learn's machine learning algorithms on Hadoop ?
If the answer is no, is there some machine learning library for python and Hadoop ?
Thanks for your Help.
Scikit-learn and TensorFlow were designed to assist developers in creating and benchmarking new models, so their functional implementations are very similar, with the exception that Scikit-learn is used in practice with a broader range of models, whereas TensorFlow's implied use is for neural networks.
The scikit learn library has the following requirements for the data before it can be used to train a model: Features and response should be separate objects. Features and response should be numeric. Features and response should be NumPy arrays of compatible sizes (in terms of rows and columns)
Scikit-learn is a free machine learning library for Python. It features various algorithms like support vector machine, random forests, and k-neighbours, and it also supports Python numerical and scientific libraries like NumPy and SciPy .
When used on a single machine, Spark can be used as a substitute to the default multithreading framework used by scikit-learn. If a need comes to spread the work across multiple machines, no change is required in the code between the single-machine case and the cluster case.
Short answer: YES. Because you can run almost everything on Hadoop.
Long answer: it depends. Answer to this question for a start:
Also, you may find this presentation useful (Hadoop is starting at 73'rd slide).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With