Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres Materialized Path - What are the benefits of using ltree?

Materialized Path is a method for representing hierarchy in SQL. Each node contains the path itself and all its ancestors (grandparent/parent/self).

The django-treebeard implementation of MP (docs):

  1. Each step of the path is a fixed length for consistent performance.

  2. Each node contains depth and numchild fields (fast reads at minimal cost to writes).

  3. The path field is indexed (with a standard b-tree index):

    The materialized path approach makes heavy use of LIKE in your database, with clauses like WHERE path LIKE '002003%'. If you think that LIKE is too slow, you’re right, but in this case the path field is indexed in the database, and all LIKE clauses that don’t start with a % character will use the index. This is what makes the materialized path approach so fast.

Implementation of get_ancestors (link):

Match nodes with a path that contains a subset of the current path (steplen is the fixed length of a step).

paths = [
    self.path[0:pos]
    for pos in range(0, len(self.path), self.steplen)[1:]
]
return get_result_class(self.__class__).objects.filter(
    path__in=paths).order_by('depth')

Implementation of get_descendants (link):

Match nodes with a depth greater than self and a path which starts with current path.

return cls.objects.filter(
    path__startswith=parent.path,
    depth__gte=parent.depth
).order_by(
    'path'
)

Potential downsides to this approach:

  1. A deeply nested hierarchy will result in long paths, which can hurt read performance.
  2. Moving a node requires updating the path of all descendants.

Postgres includes the ltree extension which provides a custom GiST index (docs).

I am not clear which benefits ltree provides over django-treebeard's implementation. This article argues that only ltree can answer the get_ancestors question, but as demonstrated earlier, figuring out the ancestors (or descendants) of a node is trivial.

[As an aside, if found this Django ltree library - https://github.com/mariocesar/django-ltree].

Both approaches use an index (django-treebeard uses b-tree, ltree uses a custom GiST). I am interested in understanding the implementation of the ltree GiST and why it might be a more efficient index than a standard b-tree for this particular use case (materialized path).

Additional links

What are the options for storing hierarchical data in a relational database?

https://news.ycombinator.com/item?id=709970

like image 244
Avi Kaminetzky Avatar asked Dec 02 '19 03:12

Avi Kaminetzky


People also ask

What is Ltree in PostgreSQL?

ltree stores a label path. lquery represents a regular-expression-like pattern for matching ltree values. A simple word matches that label within a path. A star symbol ( * ) matches zero or more labels. These can be joined with dots to form a pattern that must match the whole label path.

What is materialized path?

The Materialized Paths pattern stores each tree node in a document; in addition to the tree node, document stores as a string the id(s) of the node's ancestors or path.

Can you index a materialized view Postgres?

When you create a materialized view in PostgreSQL, it uses a regular database table underneath. You can create database indexes on the materialized view directly and improve performance of queries that access the materialized view.


1 Answers

TL;DR Reusable labels, complex search patterns, and ancestry searches against multiple descendant nodes (or a single node whose path hasn't yet been retrieved) can't be accomplished using a materialized path index.


For those interested in the gory details...

Firstly, your question is only relevant if you are not reusing any labels in your node description. If you were, the l-tree is really the only option of the two. But materialized path implementations don't typically need this, so let's put that aside.

One obvious difference will be in the flexibility in the types of searches that l-tree gives you. Consider these examples (from the ltree docs linked in your question):

foo         Match the exact label path foo
*.foo.*     Match any label path containing the label foo
*.foo       Match any label path whose last label is foo

The first query is obviously achievable with materialized path. The last is also achievable, where you'd adjust the query as a sibling lookup. The middle case, however, isn't directly achievable with a single index lookup. You'd either have to break this up into two queries (all descendants + all ancestors), or resort to a table scan.

And then there are really complex queries like this one (also from the docs):

Top.*{0,2}.sport*@.!football|tennis.Russ*|Spain

A materialized path index would be useless here, and a full table scan would be required to handle this. l-tree is the only option if you want to perform this as a SARGable query.

But for the standard hierarchical operations, finding any of:

  • parent
  • children
  • descendants
  • root nodes
  • leaf nodes

materialized path will work just as well as l-tree. Contrary to the article linked above, searching for all descendants of a common ancestor is very doable using a b-tree. The query format WHERE path LIKE 'A.%' is SARGable provided your index is prepared properly (I had to explicitly tag my path index with varchar_pattern_ops to get this to work).

What is missing from this list is finding all ancestors for a descendant. The query format WHERE 'A.B.C.D' LIKE path || '.%' is unfortunately not going to use the index. One workaround that some libraries implement is to parse out the ancestor nodes from the path, and query them directly: WHERE id IN ('A', 'B', 'C'). However, this will only work if you're targeting ancestors of a specific node whose path you have already retrieved. l-tree is going to win on this one.

like image 157
PinnyM Avatar answered Dec 17 '22 23:12

PinnyM