Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does "pre-built for Apache Hadoop 2.7 and later" mean?

Tags:

apache-spark

What does "pre-built for Apache Hadoop 2.7 and later" mean on the download page of Apache Spark?

Does it mean necessary libraries for HDFS in spark? If so, how about other storage systems like Cassandra, s3, HBase, SQL Databases, NoSQL databases? Do we need to download any libraries to connect to other storage systems.

like image 849
Kannan Avatar asked Sep 14 '17 03:09

Kannan


People also ask

What is the difference between Hadoop and Apache Hadoop?

Apache Hadoop and Apache Spark are both open-source frameworks for big data processing with some key differences. Hadoop uses the MapReduce to process data, while Spark uses resilient distributed datasets (RDDs).

Do I need to install Hadoop before Spark?

Do I need Hadoop to run Spark? No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.

Is Spark built on Hadoop?

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.

Which is better Apache Spark or Hadoop?

Spark is much faster as it uses MLib for computations and has in-memory processing. Hadoop has a slower performance as it uses disk for storage and depends upon disk read and write operations. It has fast performance with reduced disk reading and writing operations.


1 Answers

Does it mean necessary libraries for HDFS in spark

Correct! Spark uses Hadoop FileSystem API to access files (on HDFS and S3 and other HDFS-supported file systems) and "pre-built for Apache Hadoop 2.7 and later" version comes with the necessary libraries included.

That's mainly for Spark Core's RDD to access files with data.

how about other storage systems like Cassandra, s3, HBase, SQL Databases, NoSQL databases? Do we need to download any libraries to connect to other storage systems.

Out of the mentioned storage systems, S3 is covered partially by the "pre-built for Apache Hadoop 2.7 and later" bundle (but you have to add additional jars for S3 specifically).

That's mostly for Spark SQL's Dataset API.

Cassandra, HBase, etc. have their own Spark connectors and are not included. See DataStax Spark Cassandra Connector and Apache HBase Connector


If you've been wondering "If I have to run spark on YARN which package type to use" just use "Pre-built for Apache Hadoop" with the version of Hadoop ("2.7" vs "3.2 and later") as the version of Hadoop in use (which is likely the version of Hadoop YARN).


So "spark-prebuilt-with-hadoop-x.y" means that Spark includes Hadoop x.y in jars directory. That obviously makes the distribution larger than "spark-without-hadoop". It also means that once you upgrade your HDFS to Hadoop 3.2 but the Spark distribution was "with Hadoop 2.7" you can still use it but some features won't simply be supported and your application can be less optimized (by HDFS itself) not to mention all the bugs fixed (and new introduced).


Wouldn't there be a conflict between hadoop jars present in spark-prebuilt-with-hadoop-x.y and those that in hadoop-x.y? spark-prebuilt-with-hadoop-x.y gave me the impression that all necessary hadoop stuff (e.g., YARN) would be present in the spark binary. Hence my confusion that all hadoop should be present in the umbrella spark-prebuilt-with-hadoop-x.y.

Not really if you think about the underlying communication between Spark app and Hadoop DFS or Hadoop YARN. It's between separate applications living in their own containers (possibly in Docker) so their CLASSPATHs are separated.

The only issue could be mismatches in protocols of Hadoop components and Spark and that's why you should be as compatible with the jars as possible and use the Spark bundle that is the closest to your Hadoop env version-wise.

like image 197
Jacek Laskowski Avatar answered Oct 04 '22 02:10

Jacek Laskowski