Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Arango DB performace: edge vs. DOCUMENT()

Tags:

I'm new to arangoDB with graphs. I simply want to know if it is faster to build edges or use 'DOCUMENT()' for very simple 1:1 connections where a querying the graph is not needed?

LET a = DOCUMENT(@from)
FOR v IN OUTBOUND a
CollectionAHasCollectionB
RETURN MERGE(a,{b:v})

vs

LET a = DOCUMENT(@from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}
like image 663
Richard Burkhardt Avatar asked Sep 06 '17 07:09

Richard Burkhardt


2 Answers

A simple benchmark you can try:

Create the collections products, categories and an edge collection has_category. Then generate some sample data:

FOR i IN 1..10000
    INSERT {_key: TO_STRING(i), name: CONCAT("Product ", i)} INTO products
FOR i IN 1..10000
    INSERT {_key: TO_STRING(i), name: CONCAT("Category ", i)} INTO categories
FOR p IN products
    LET random_categories = (
    FOR c IN categories
        SORT RAND()
        LIMIT 5
        RETURN c._id
    )
    LET category_subset = SLICE(random_categories, 0, RAND()*5+1)

    UPDATE p WITH {
        categories: category_subset,
        categoriesEmbedded: DOCUMENT(category_subset)[*].name
    } INTO products

    FOR cat IN category_subset
        INSERT {_from: p._id, _to: cat} INTO has_category

Then compare the query times for the different approaches.

Graph traversal (depth 1..1):

FOR p IN products
    RETURN {
        product: p.name,
        categories: (FOR v IN OUTBOUND p has_category RETURN v.name)
    }

Look-up in categories collection using DOCUMENT():

FOR p IN products
    RETURN {
        product: p.name,
        categories: DOCUMENT(p.categories)[*].name
    }

Using the directly embedded category names:

FOR p IN products
    RETURN {
        product: p.name,
        categories: p.categoriesEmbedded
    }

Graph traversal is the slowest of all 3, the lookup in another collection is faster than the traversal, but the by far fastest query is the one with embedded category names.

If you query the categories for just one or a few products however, the response times should be in the sub-millisecond area regardless of the data model and query approach and therefore not pose a performance problem.

The graph approach should be chosen if you need to query for paths with variable depth, long paths, shortest path etc. For your use case, it is not necessary. Whether the embedded approach is suitable or not is something you need to decide:

  • Is it acceptable to duplicate information, and potentially have inconsistencies in the data? (If you want to change the category name, you need to change it in all product records instead of just one category document, that products can refer to via the immutable ID)

  • Is there a lot of additional information per category? If so, all that data needs to be embedded into every product document that has that category - basically trading memory / storage space for performance

  • Do you need to retrieve a list of all (distinct) categories often? You can do this type of query really cheap with the separate categories collection. With the embedded approach, it will be much less efficient, because you need to go over all products and collect the category info.

Bottom line: you should choose the data model and approach that fits your use case best. Thanks to ArangoDB's multi-model nature you can easily try another approach if your use case changes or you run into performance issues.

like image 173
CodeManX Avatar answered Sep 28 '22 20:09

CodeManX


Generally spoken, the latter variant

LET a = DOCUMENT(@from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}

should have lower overhead than the full-featured traversal variant. This is because the DOCUMENT variant will do a point lookup of a document whereas the traversal variant is very general purpose: it can return zero to many results from a variable number of collections, needs to keep track of the path seen etc.

When I tried both variants in a local test case, the non-traversal variant was also a lot faster, supporting this claim.

However, the traversal-based variant is more flexible: it can also be used should there be multiple edges (no 1:1 mapping) and for longer paths.

like image 43
stj Avatar answered Sep 28 '22 19:09

stj