Dremel - repetition and definition level

Tags:

Reading Interactive Analysis of Web-Scale Datasets paper, I bumped into the concept of repetition and definition level.
while I understand the need for these two, to be able to disambiguate occurrences, it attaches a repetition and definition level to each value.

What is unclear to me is how they computed the levels...

As illustrated in picture: enter image description here

It says:

Consider ﬁeld Code in Figure 2. It occurs three times in r1. Occurrences ‘en-us’ and ‘en’ are inside the ﬁrst Name, while ’en-gb’ is in the third Name. To disambiguate these occurrences, we attach a repetition level to each value. It tells us at what repeated ﬁeld in the ﬁeld’s path the value has repeated.

The ﬁeld path Name.Language.Code contains two repeated ﬁelds, Name and Language. Hence, the repetition level of Code ranges between 0 and 2; level 0 denotes the start of a new record. Now suppose we are scanning record r1 top down. When we encounter ‘en-us’, we have not seen any repeated ﬁelds, i.e., the repetition level is 0. When we see ‘en’, ﬁeld Language has repeated, so the repetitionlevelis2.

I just can't get me head around it, Name.Language.Code in r1 has en-us and en values. While is the first one r = 0 and the second one r = 2 is it because two definitions were repeated ? (language and code) ?

If it was:

Name
    Language
       Code: en-us
Name 
    Language
        Code: en
Name
    Language
        Code: en-gb

Would it be ?

0 2
1 2
2 2

Deﬁnition levels. Each value of a ﬁeld with path p, esp. every NULL, has a deﬁnition level specifying how many ﬁelds in p that could be undeﬁned (because they are optional or repeated) are actually present in record.

Why is then the definition level is 2 ? Isn't the path Name.Language contain two fields Code and Country where only 1 is optional\repeated ?

583

asked Apr 23 '17 06:04

Tony Tannous

1 Answers

The Dremel striping algorithm is by no means trivial.

To answer your first question:

The repetition level of en-us is 0 since it is the first occurrence of a name.language.code path within the record.
The repetition level of en is 2, since the repetition occurred at level 2 (the language tag).

To answer your second question, for the following record,

DocId: 20
Name
  Language
    Code: en-us
Name 
  Language
    Code: en
Name
  Language
    Code: en-gb

the entries for name.language.code would be

en-us 0 2
en    1 2
en-gb 1 2

Explanation:

The definition level is always two, since the two optional tags name and language are present.
The repetition level for en-us is zero, since it is the first name.language.code within the record.
The repetition level for en and en-gb is 1, since the repetition occurred at the name tag (level 1).

159

answered Oct 03 '22 22:10

user152468

Related questions
                            
                                Optimizing Jaro-Winkler algorithm
                            
                                Algorithm for fairly assigning tasks to workers based on skills
                            
                                A* Search Algorithm
                            
                                Algorithm for fast Drop shadow in GDI+
                            
                                How to find multiplicative partitions of any integer?
                            
                                How to obtain index of element from predicate passed to some STL algorithm?
                            
                                Find the better intersection of two moving objects
                            
                                Finding a list of adjacent words between two words
                            
                                Memoization algorithm time complexity
                            
                                How is union find quadratic algorithm?
                            
                                Get longest continuous sequence of 1s
                            
                                Big O of an algorithm that relies on convergence
                            
                                Ternary Tree Vs Hash Table
                            
                                Duplicate image detection algorithms?
                            
                                How can I group an array of rectangles into "Islands" of connected regions?
                            
                                How to find longest common substring using trees?
                            
                                Efficient algorithm for converting number of days to years (including leap years)
                            
                                Longest Non-Overlapping Repeated Substring using Suffix Tree/Array (Algorithm Only)
                            
                                Optimally cutting a stick at specified locations
                            
                                Increasing speed of a pure Numpy/Scipy convolutional neural network implementation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dremel - repetition and definition level

Tags:

algorithm

data-structures

dataset

parquet

dremel

Tony Tannous

People also ask

1 Answers

user152468

Recent Activity

Donate For Us