Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Azure Spark SQL vs U-SQL

I have a lot of data files that will be eventually be pushed and stored on the Azure Storage/Data Lake at a regular interval of time. I want to provide an ability to do Analytic on this data but then I see that on Azure there are two approach:

  1. U-SQL / Azure Data Lake query (Visualization ???)
  2. Spark SQL using Spark on Azure and Zeppelin

can some one suggest me when to use which of this approach? it looks to me that both can do the similar job.

like image 710
Kiran Avatar asked Feb 23 '16 10:02

Kiran


People also ask

What is U-SQL in Azure?

U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale. Through the scalable, distributed-query capability of U-SQL, you can efficiently analyze data across relational stores such as Azure SQL Database.

What is the difference between Spark and Spark SQL?

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.

What is the advantage of spark SQL?

Another important advantage of Spark SQL is that the loading and querying can be done for data from different sources. Hence, the data access is unified. It offers standard connectivity as Spark SQL can be connected through JDBC or ODBC. It can be used for faster processing of Hive tables.

What is T SQL and U-SQL?

The integration between SQL and C# in U-SQL is based on SCOPE, as is U-SQL's query execution and optimization framework. U-SQL's metadata system, SQL syntax and language semantics are modeled on standard ANSI SQL and Transact-SQL (T-SQL), Microsoft's implementation of the query language for its SQL Server database.


1 Answers

You can think of U-SQL as Microsoft's version of Spark SQL, where you can write SQL Server styled SQL and extend with User-Defined Functions in C#. While with Spark you write in a Semi MySQL styled SQL and extend it with either Scala or Python.

If you are familiar with Scala or Python then choosing HDInsight might be the best choice. Spark comes with GraphX and MLLib which at the moment have no analogues in Data Lake Analytics. Also if you need something that works outside of Azure then SparkSQL is your only option.

Another important dimension to think about is the pricing. Data Lake Analytics only costs money while your query is executing, but HDInsight costs money for as long as the cluster is running. Depending on the size of the data and the complexity of your queries Data Lake Analytics can be cheaper because you aren't charged while it's provisioning.

like image 80
wm_eddie Avatar answered Oct 28 '22 22:10

wm_eddie