What does "pre-built for Apache Hadoop 2.7 and later" mean?

Tags:

apache-spark

What does "pre-built for Apache Hadoop 2.7 and later" mean on the download page of Apache Spark?

Does it mean necessary libraries for HDFS in spark? If so, how about other storage systems like Cassandra, s3, HBase, SQL Databases, NoSQL databases? Do we need to download any libraries to connect to other storage systems.

849

asked Sep 14 '17 03:09

Kannan

1 Answers

Does it mean necessary libraries for HDFS in spark

Correct! Spark uses Hadoop FileSystem API to access files (on HDFS and S3 and other HDFS-supported file systems) and "pre-built for Apache Hadoop 2.7 and later" version comes with the necessary libraries included.

That's mainly for Spark Core's RDD to access files with data.

how about other storage systems like Cassandra, s3, HBase, SQL Databases, NoSQL databases? Do we need to download any libraries to connect to other storage systems.

Out of the mentioned storage systems, S3 is covered partially by the "pre-built for Apache Hadoop 2.7 and later" bundle (but you have to add additional jars for S3 specifically).

That's mostly for Spark SQL's Dataset API.

Cassandra, HBase, etc. have their own Spark connectors and are not included. See DataStax Spark Cassandra Connector and Apache HBase Connector

If you've been wondering "If I have to run spark on YARN which package type to use" just use "Pre-built for Apache Hadoop" with the version of Hadoop ("2.7" vs "3.2 and later") as the version of Hadoop in use (which is likely the version of Hadoop YARN).

So "spark-prebuilt-with-hadoop-x.y" means that Spark includes Hadoop x.y in jars directory. That obviously makes the distribution larger than "spark-without-hadoop". It also means that once you upgrade your HDFS to Hadoop 3.2 but the Spark distribution was "with Hadoop 2.7" you can still use it but some features won't simply be supported and your application can be less optimized (by HDFS itself) not to mention all the bugs fixed (and new introduced).

Wouldn't there be a conflict between hadoop jars present in spark-prebuilt-with-hadoop-x.y and those that in hadoop-x.y? spark-prebuilt-with-hadoop-x.y gave me the impression that all necessary hadoop stuff (e.g., YARN) would be present in the spark binary. Hence my confusion that all hadoop should be present in the umbrella spark-prebuilt-with-hadoop-x.y.

Not really if you think about the underlying communication between Spark app and Hadoop DFS or Hadoop YARN. It's between separate applications living in their own containers (possibly in Docker) so their CLASSPATHs are separated.

The only issue could be mismatches in protocols of Hadoop components and Spark and that's why you should be as compatible with the jars as possible and use the Spark bundle that is the closest to your Hadoop env version-wise.

197

answered Oct 04 '22 02:10

Jacek Laskowski

Related questions
                            
                                How to print accumulator variable from within task (seem to "work" without calling value method)?
                            
                                Apache Spark: ERROR local class incompatible when initiating a SparkContext class
                            
                                Saving / exporting transformed DataFrame back to JDBC / MySQL
                            
                                Basic linear algebra on spark matrices
                            
                                Connecting/Integrating Cassandra with Spark (pyspark)
                            
                                How to know when to repartition/coalesce RDD with unbalanced partitions (without shuffling possibly)?
                            
                                Error from python worker: /bin/python: No module named pyspark
                            
                                Spark - Difference between sortBy and sortByKey
                            
                                Connecting IPython notebook to spark master running in different machines
                            
                                Spark - How can get the Logical / Physical Query execution using - Thirft - Hive Interactor
                            
                                Spark DataFrame not respecting schema and considering everything as String
                            
                                Spark Is there any rule of thumb about the optimal number of partition of a RDD and its number of elements?
                            
                                Spark sql top n per group
                            
                                org.apache.thrift.transport.TTransportException error while Reading large JSON file in zeppelin scala
                            
                                How to split column of vectors into two columns?
                            
                                Running steps of EMR in parallel
                            
                                How Spark handle data larger than cluster memory
                            
                                Dropping nested column of Dataframe with PySpark
                            
                                Best practice to create SparkSession object in Scala to use both in unittest and spark-submit
                            
                                Add months to date column in Spark dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With