Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark vs Apache Spark 2 [closed]

What are the improvements Apache Spark2 brings compared to Apache Spark?

  1. From architecture perspective
  2. From application point of view
  3. or more
like image 658
YoungHobbit Avatar asked Oct 21 '16 05:10

YoungHobbit


People also ask

What is the difference between Spark and Spark 2?

New in spark 2: The biggest change that I can see is that DataSet and DataFrame APIs will be merged. The latest and greatest from Spark will be a whole lot efficient as compared to predecessors. Spark 2.0 is going to focus on a combination of Parquet and caching to achieve even better throughput.

Is Apache Spark and Spark are same?

Apache Spark belongs to "Big Data Tools" category of the tech stack, while Spark Framework can be primarily classified under "Microframeworks (Backend)". Apache Spark is an open source tool with 22.9K GitHub stars and 19.7K GitHub forks. Here's a link to Apache Spark's open source repository on GitHub.

Is Apache spark still popular?

According to Eric, the answer is yes: “Of course Spark is still relevant, because it's everywhere. Everybody is still using it. There are lots of people doing lots of things with it and selling lots of products that are powered by it.”

What is the difference between Spark streaming and structured streaming?

Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.


2 Answers

Apache Spark 2.0.0 APIs have stayed largely similar to 1.X, Spark 2.0.0 does have API breaking changes

Apache Spark 2.0.0 is the first release on the 2.x line. The major updates are API usability, SQL 2003 support, performance improvements, structured streaming, R UDF support, as well as operational improvements.

New in spark 2:

  • The biggest change that I can see is that DataSet and DataFrame APIs will be merged.
  • The latest and greatest from Spark will be a whole lot efficient as compared to predecessors. Spark 2.0 is going to focus on a combination of Parquet and caching to achieve even better throughput.
  • Structured streaming is another big thing!
  • It will be the first version that will focus on ETL. Successive versions will add more operators and libraries for ETL

You can go through the Spark release 2.0.0 where updates in following points are explained:

  • API Stability
  • Core and Spark SQL
  • MLlib
  • SparkR
  • Streaming
  • Dependency, Packaging, and Operations
  • Removals, Behavior Changes and Deprecations
  • Known Issues
like image 72
bob Avatar answered Oct 12 '22 16:10

bob


There is not much difference with respect to architecture as the nutshell is still DAG and RDD , which is the most important part of it !

Though Spark 2.0 is much more optimized and has DataSet Api which gives much more powerful to the hands of developers. So I would say the architecture is same it is just the Spark 2.0 provides much optimized and has a rich set of Api !

These are the main things that are provided by Apache Spark 2.0:

  • The biggest change that I can see is that DataSet and DataFrame APIs will be merged.
  • The latest and greatest from Spark will be a whole lot efficient as compared to predecessors. Spark 2.0 is going to focus on a combination of Parquet and caching to achieve even better throughput.
  • Structured streaming is another big thing!
  • It will be the first version that will focus on ETL. Successive versions will add more operators and libraries for ETL

For more information please take a lok here : https://www.quora.com/What-are-special-features-and-advantages-of-Apache-Spark-2-0-over-earlier-versions

like image 42
Shivansh Avatar answered Oct 12 '22 15:10

Shivansh