I want to use Apache YARN as a cluster and resource manager for running a framework where resources would be shared across different task of the same framework. I want to use my own distributed off-heap file system.
Is it possible to use any other distributed file system with YARN other than HDFS?
If yes, what HDFS APIs need to be implemented?
You can Run Spark without Hadoop in Standalone Mode Spark and Hadoop are better together Hadoop is not essential to run Spark. If you go by Spark documentation, it is mentioned that there is no need for Hadoop if you run Spark in a standalone mode. In this case, you need resource managers like CanN or Mesos only.
Hive can store the data in external tables so it's not mandatory to used HDFS also it support file formats such as ORC, Avro files, Sequence File and Text files, etc.
YARN is a generic job scheduling framework and HDFS is a storage framework. YARN in a nut shell has a master(Resource Manager) and workers(Node manager), The resource manager creates containers on workers to execute MapReduce jobs, spark jobs etc.
No, you cannot download HDFS alone because Hadoop 2. X has four core components: HDFS – It is the core component of Hadoop Ecosystem which is used to store a huge amount of data. Map Reduce – It is used for processing of large distributed datasets parallelly.
There's some different questions here
Yes: it's how LinkedIn have deployed Samza in the past, using http:// downloads. Samza does not need a cluster filesystem, so there is no hdfs running in cluster, just local file:// filesystems, one per host.
Applications which need a cluster fileystems wouldn't work in such a cluster.
Yes.
For what "filesystem" is, look at the Filesystem Specification. You need a consistent view across the filesytem: newly create files list(), deleted ones aren't found, updates immediately visible. And rename() of files and directories must be an atomic operation, ideally O(1). It's used for atomic commits of work, checkpoints, ... Oh, and for HBase, append() is needed.
MapR does this, Redhat with GlusterFS; IBM and EMC for theirs. Do bear in mind here that pretty much everything is tested on HDFS; you'd better hope the other cluster FS has done the testing (or someone has done it for them, such as Hortonworks or Cloudera).
It depends on whether or not the FS offers a consistent filesystem view, rather than some eventual consistency world view. HBase is the real test here.
Well, you can certainly try!
First get all the filesystem contract tests to work, which measure basic API compliance. Then look at all the Apache Bigtop tests, which do system integration. I recommend you avoid HBase & Accumulo initially, focus on: Mapreduce, Hive, spark, Flink.
Don't be afraid to get on the Hadoop common-dev & bigtop lists and ask questions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With