I am very new to this whole world of "big data" tech, and recently started reading about Spark. One thing that keeps coming up is SparkSQL, yet I consistently fail to comprehend was exactly it is.
Is it supposed to convert SQL queries to MapReduce that do operations on the data you give it? But aren't dataframes already essentially SQL tables in terms of functionality?
Or is it some tech that allows you to connect to an SQL database and use Spark to query it? In this case, what's the point of Spark in here at all - why not use SQL directly? Or is the point that you can use your structured SQL data in combination with the flat data?
Again, I am emphasizing that I am very new to all of this and may or may not talking out of my butt :). So please do correct me and be forgiving if you see that I'm clearly misunderstanding something.
Your first answer is essentially correct, it's a API in Spark where you can write queries in SQL and they will be converted to a parallelised Spark job (Spark can do more complex types of operations than just map and reduce). Spark Data frames actually are just a wrapper around this API, it's just an alternative way of accessing the API, depending on whether you're more comfortable coding in SQL or in Python/Scala.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With