Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can i have any books about Azure Data Lake Internals?

I dont wanna use the ADL and ADLA as a black box. I need to understand how the gears rotate underhood to use it in an efficient way.

Where i can find an information that describe internals:

  1. how U-SQL query is processed
  2. how parallelism is worked
  3. how storage is organized in ADL at low level
  4. how DB's storage is organized in ADL at low level (is it rowstore or columnstore)
  5. how partitioning is organized
  6. etc

There is exists a lot of books and whitepappers that describes RDBMS engine's internals. Does it exists for ADL/ADLA?

There are a lot of guys who works in Azure. Could you publish any drafts/whitepappers to use as is (unoficially).

like image 681
churupaha Avatar asked Feb 05 '23 01:02

churupaha


2 Answers

Some of that information is available in presentations we have given. For example you can find some of these presentations on my slideshare account at: http://www.slideshare.net/MichaelRys.

To answer some of your questions above:

The current clustered index version of U-SQL tables are stored in your catalog folder structured as so called structured stream files. These are highly compressible, scaled out files that use a row-oriented structure with self-contained meta data and statistics (more detailed stats can be created). The table construct provides 2 level partitioning: addressable partitions and internal distribution schemes (HASH, RANGE etc). Both help with parallelization, although distribution schemes are more for performance while partition more for data lifecycle management. There is no limit on them, although the sweet spot is 1GB to 4GB per distribution bucket.

1 AU is basically 1 container. And ADLS is NOT HDFS architecturally but offers the WebHDFS API for compatibility.

like image 134
Michael Rys Avatar answered Feb 08 '23 04:02

Michael Rys


This is a pretty broad question. I assume you've started with the existing documentation on ADLA and U-SQL? https://learn.microsoft.com/en-us/azure/data-lake-analytics/ https://msdn.microsoft.com/library/azure/mt591959

ADLA GA'd in November of 2016, compared to SQL Server in 1987 - that's a very apples and oranges comparison.

Maybe we can start with your specific questions?

like image 41
guyhay_MSFT Avatar answered Feb 08 '23 03:02

guyhay_MSFT