Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

U-SQL Output in Azure Data Lake

Would it be possible to automatically split a table into several files based on column values if I don't know how many different key values the table contains? Is it possible to put the key value into the filename?

like image 644
peterko Avatar asked Mar 06 '17 22:03

peterko


People also ask

What is the use of U-SQL in Azure Data lake?

Introduced in 2015, U-SQL is part of Microsoft's Azure Data Lake Analytics cloud service, but it lets users run queries against multiple data stores in the Azure cloud. SQL is the standard language for querying relational databases, while C# (pronounced "C-sharp") is a programming language developed by Microsoft.

Does Azure Data LAKE support SQL?

Azure Data Lake Analytics is an on-demand analytics platform for big data. Users can develop and run massively parallel data transformation and processing programs in U-SQL, R, Python, and . NET over petabytes of data.

What is U-SQL user-defined function?

U-SQL user-defined functions, or UDF, are programming routines that accept parameters, perform an action (such as a complex calculation), and return the result of that action as a value. The return value of UDF can only be a single scalar. U-SQL UDF can be called in U-SQL base script like any other C# scalar function.

How do I get data from Azure Data lake?

There are three ways of accessing Azure Data Lake Storage Gen2: Mount an Azure Data Lake Storage Gen2 filesystem to DBFS using a service principal and OAuth 2.0. Use a service principal directly. Use the Azure Data Lake Storage Gen2 storage account access key directly.


1 Answers

This is our top ask (and has been previously asked on stackoverflow too :). We are currently working on it and hopefully have it available by summer.

Until then you have to write a script generator. I tend to use U-SQL to generate the script but you could do it with Powershell or T4 etc.

Here is an example:

Let's assume you want to write files for the column name in the following table/rowset @x:

name | value1 | value2
-----+--------+-------
A    | 10     | 20
A    | 11     | 21
B    | 10     | 30
B    | 100    | 200

You would write a script to generate the script like the following:

@x = SELECT * FROM (VALUES( "A", 10, 20), ("A", 11, 21), ("B", 10, 30), ("B", 100, 200)) AS T(name, value1, value2);

// Generate the script to do partitioned output based on name column:

@stmts = 
  SELECT "OUTPUT (SELECT value1, value2 FROM @x WHERE name == \""+name+"\") TO \"/output/"+name+".csv\" USING Outputters.Csv();" AS output 
  FROM (SELECT DISTINCT name FROM @x) AS x;

OUTPUT @stmts TO "/output/genscript.usql" 
USING Outputters.Text(delimiter:' ', quoting:false);

Then you take genscript.usql, prepend the calculation of @x and submit it to get the data partitioned into the two files.

like image 197
Michael Rys Avatar answered Sep 25 '22 17:09

Michael Rys