I dont wanna use the ADL and ADLA as a black box. I need to understand how the gears rotate underhood to use it in an efficient way.
Where i can find an information that describe internals:
There is exists a lot of books and whitepappers that describes RDBMS engine's internals. Does it exists for ADL/ADLA?
There are a lot of guys who works in Azure. Could you publish any drafts/whitepappers to use as is (unoficially).
Some of that information is available in presentations we have given. For example you can find some of these presentations on my slideshare account at: http://www.slideshare.net/MichaelRys.
To answer some of your questions above:
The current clustered index version of U-SQL tables are stored in your catalog folder structured as so called structured stream files. These are highly compressible, scaled out files that use a row-oriented structure with self-contained meta data and statistics (more detailed stats can be created). The table construct provides 2 level partitioning: addressable partitions and internal distribution schemes (HASH, RANGE etc). Both help with parallelization, although distribution schemes are more for performance while partition more for data lifecycle management. There is no limit on them, although the sweet spot is 1GB to 4GB per distribution bucket.
1 AU is basically 1 container. And ADLS is NOT HDFS architecturally but offers the WebHDFS API for compatibility.
This is a pretty broad question. I assume you've started with the existing documentation on ADLA and U-SQL? https://learn.microsoft.com/en-us/azure/data-lake-analytics/ https://msdn.microsoft.com/library/azure/mt591959
ADLA GA'd in November of 2016, compared to SQL Server in 1987 - that's a very apples and oranges comparison.
Maybe we can start with your specific questions?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With