Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

running Hadoop software on office computers (when they are idle)

Is there a project which helps setup a Hadoop cluster on office desktops, when they are idle?

I'd like to experiment with Hadoop/MR/hbase but don't have acces to 5-10 computers. The computers at work are idle after hours and are connected to each other through a very high speed connection. What's more, data on these computers stays within our network so there is no privacy issue.

In order for this to work I need a fairly light weight monitor running on each machine. When the computer has been idle for X hours, it will join the cluster. If the user logs on, it has to drop out of the cluster and return all CPU/memory back.

Does something like this exist?

like image 866
Shahbaz Avatar asked Apr 14 '12 05:04

Shahbaz


People also ask

Can Hadoop run on cloud?

Cloud computingCompanies often choose to run Hadoop clusters on public, private, or hybrid cloud resources versus on-premises hardware to gain flexibility, availability, and cost control. Many cloud solution providers offer fully managed services for Hadoop, such as Dataproc from Google Cloud.

Does Hadoop use multiple computers?

Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

How do a Hadoop administrator handle data node crash and scalability of Hadoop system?

In HDFS, replication data is done to solve the problem of data loss in unfavorable conditions like crashing of the node, hardware failure and so on. Scalability – HDFS stores data on multiple nodes in the cluster, when requirement increases we can scale the cluster.

How does Hadoop work?

Apache Hadoop is an open source, Java-based software platform that manages data processing and storage for big data applications. The platform works by distributing Hadoop big data and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads that can be run in parallel.


2 Answers

You can use task scheduler to detect idle state and then start/stop a hadoop vm with virtual box or vmplayer. Or you can write a powershell script that does start stop based on resource usage.

like image 70
johnshen64 Avatar answered Sep 27 '22 16:09

johnshen64


Hadoop is not a computation grid it is a more a data grid (see slide 9 in this presentation). The point is that with hadoop that data is spread over the cluster and thus the data has to be stored on the computers. The time it would take to copy the data over/remove it when they're not idle would probably not be worth it - you'd be better off using hadoop in the cloud (amazon,Azure etc.)

like image 42
Arnon Rotem-Gal-Oz Avatar answered Sep 27 '22 17:09

Arnon Rotem-Gal-Oz