Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dremel - repetition and definition level

Reading Interactive Analysis of Web-Scale Datasets paper, I bumped into the concept of repetition and definition level.
while I understand the need for these two, to be able to disambiguate occurrences, it attaches a repetition and definition level to each value.

What is unclear to me is how they computed the levels...

As illustrated in picture: enter image description here

It says:

Consider field Code in Figure 2. It occurs three times in r1. Occurrences ‘en-us’ and ‘en’ are inside the first Name, while ’en-gb’ is in the third Name. To disambiguate these occurrences, we attach a repetition level to each value. It tells us at what repeated field in the field’s path the value has repeated.


The field path Name.Language.Code contains two repeated fields, Name and Language. Hence, the repetition level of Code ranges between 0 and 2; level 0 denotes the start of a new record. Now suppose we are scanning record r1 top down. When we encounter ‘en-us’, we have not seen any repeated fields, i.e., the repetition level is 0. When we see ‘en’, field Language has repeated, so the repetitionlevelis2.

I just can't get me head around it, Name.Language.Code in r1 has en-us and en values. While is the first one r = 0 and the second one r = 2 is it because two definitions were repeated ? (language and code) ?

If it was:

Name
    Language
       Code: en-us
Name 
    Language
        Code: en
Name
    Language
        Code: en-gb

Would it be ?

0 2
1 2
2 2 

Definition levels. Each value of a field with path p, esp. every NULL, has a definition level specifying how many fields in p that could be undefined (because they are optional or repeated) are actually present in record.

Why is then the definition level is 2 ? Isn't the path Name.Language contain two fields Code and Country where only 1 is optional\repeated ?

like image 583
Tony Tannous Avatar asked Apr 23 '17 06:04

Tony Tannous


People also ask

What is nested data in parquet?

Nested types are complex structures that represent objects or arrays. Nested types can be stored in: Parquet, where you can have multiple complex columns that contain arrays and objects. Hierarchical JSON files, where you can read a complex JSON document as a single column.

What does parquet look like?

Parquet files are composed of row groups, header and footer. Each row group contains data from the same columns. The same columns are stored together in each row group: This structure is well-optimized both for fast query performance, as well as low I/O (minimizing the amount of data scanned).

Does parquet include schema?

Overall, Parquet's features of storing data in columnar format together with schema and typed data allow efficient use for analytical purposes.

Which file stores nested data structures in a columnar format?

Parquet, an open-source file format for Hadoop, stores nested data structures in a flat columnar format.


1 Answers

The Dremel striping algorithm is by no means trivial.

To answer your first question:

  • The repetition level of en-us is 0 since it is the first occurrence of a name.language.code path within the record.

  • The repetition level of en is 2, since the repetition occurred at level 2 (the language tag).

To answer your second question, for the following record,

DocId: 20
Name
  Language
    Code: en-us
Name 
  Language
    Code: en
Name
  Language
    Code: en-gb

the entries for name.language.code would be

en-us 0 2
en    1 2
en-gb 1 2 

Explanation:

  • The definition level is always two, since the two optional tags name and language are present.
  • The repetition level for en-us is zero, since it is the first name.language.code within the record.
  • The repetition level for en and en-gb is 1, since the repetition occurred at the name tag (level 1).
like image 159
user152468 Avatar answered Oct 03 '22 22:10

user152468