I have a lot of data files that will be eventually be pushed and stored on the Azure Storage/Data Lake at a regular interval of time. I want to provide an ability to do Analytic on this data but then I see that on Azure there are two approach:
can some one suggest me when to use which of this approach? it looks to me that both can do the similar job.
U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale. Through the scalable, distributed-query capability of U-SQL, you can efficiently analyze data across relational stores such as Azure SQL Database.
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.
Another important advantage of Spark SQL is that the loading and querying can be done for data from different sources. Hence, the data access is unified. It offers standard connectivity as Spark SQL can be connected through JDBC or ODBC. It can be used for faster processing of Hive tables.
The integration between SQL and C# in U-SQL is based on SCOPE, as is U-SQL's query execution and optimization framework. U-SQL's metadata system, SQL syntax and language semantics are modeled on standard ANSI SQL and Transact-SQL (T-SQL), Microsoft's implementation of the query language for its SQL Server database.
You can think of U-SQL as Microsoft's version of Spark SQL, where you can write SQL Server styled SQL and extend with User-Defined Functions in C#. While with Spark you write in a Semi MySQL styled SQL and extend it with either Scala or Python.
If you are familiar with Scala or Python then choosing HDInsight might be the best choice. Spark comes with GraphX and MLLib which at the moment have no analogues in Data Lake Analytics. Also if you need something that works outside of Azure then SparkSQL is your only option.
Another important dimension to think about is the pricing. Data Lake Analytics only costs money while your query is executing, but HDInsight costs money for as long as the cluster is running. Depending on the size of the data and the complexity of your queries Data Lake Analytics can be cheaper because you aren't charged while it's provisioning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With