Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ArangoDB Faceted Search Performance

We are evaluating ArangoDB performance in space of facets calculations. There are number of other products capable of doing the same, either via special API or query language:

  • MarkLogic Facets
  • ElasticSearch Aggregations
  • Solr Faceting etc

We understand, there is no special API in Arango to calculate factes explicitly. But in reality, it is not needed, thanks for a comprehensive AQL it can be easily achieved via simple query, like:

 FOR a in Asset 
  COLLECT attr = a.attribute1 INTO g
 RETURN { value: attr, count: length(g) }

This query calculate a facet on attribute1 and yields frequency in the form of:

[
  {
    "value": "test-attr1-1",
    "count": 2000000
  },
  {
    "value": "test-attr1-2",
    "count": 2000000
  },
  {
    "value": "test-attr1-3",
    "count": 3000000
  }
]

It is saying, that across my entire collection attribute1 took three forms (test-attr1-1, test-attr1-2 and test-attr1-3) with related counts provided. Pretty much we run a DISTINCT query and aggregated counts.

Looks simple and clean. With only one, but really big issue - performance.

Provided query above runs for !31 seconds! on top of the test collection with only 8M documents. We have experimented with different index types, storage engines (with rocksdb and without), investigating explanation plans at no avail. Test documents we use in this test are very concise with only three short attributes.

We would appreciate any input at this point. Either we doing something wrong. Or ArangoDB simply is not designed to perform in this particular area.

btw, ultimate goal would be to run something like the following in under-second time:

LET docs = (FOR a IN Asset 

  FILTER a.name like 'test-asset-%'

  SORT a.name

 RETURN a)

LET attribute1 = (

 FOR a in docs 

  COLLECT attr = a.attribute1 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute2 = (

 FOR a in docs 

  COLLECT attr = a.attribute2 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute3 = (

 FOR a in docs 

  COLLECT attr = a.attribute3 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute4 = (

 FOR a in docs 

  COLLECT attr = a.attribute4 INTO g

 RETURN { value: attr, count: length(g[*])}

)

RETURN {

  counts: (RETURN {

    total: LENGTH(docs), 

    offset: 2, 

    to: 4, 

    facets: {

      attribute1: {

        from: 0, 

        to: 5,

        total: LENGTH(attribute1)

      },

      attribute2: {

        from: 5, 

        to: 10,

        total: LENGTH(attribute2)

      },

      attribute3: {

        from: 0, 

        to: 1000,

        total: LENGTH(attribute3)

      },

      attribute4: {

        from: 0, 

        to: 1000,

        total: LENGTH(attribute4)

      }

    }

  }),

  items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),

  facets: {

    attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),

    attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),

    attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),

    attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a)

   }

}

Thanks!

like image 838
Roman Kuzmik Avatar asked Sep 14 '17 14:09

Roman Kuzmik


1 Answers

Turns out main thread has happened on ArangoDB Google Group. Here is a link to a full discussion

Here is a summary of current solution:

  • Run custom build of the Arango from a specific feature branch where number of performance improvements has been done (hope they should make it to a main release soon)
  • No indexes are required for a facets calculations
  • MMFiles is a preferred storage engine
  • AQL should be written to use "COLLECT attr = a.attributeX WITH COUNT INTO length" instead of "count: length(g)"
  • AQL should be split into smaller pieces and run in parallel (we are running Java8's Fork/Join to spread facets AQLs and then join them into a final result)
  • One AQL to filter/sort and retrieve main entity (if required. while sorting/filtering add corresponding skiplist index)
  • The rest are small AQLs for each facet value/frequency pairs

In the end we have gained >10x performance gain compare to an original AQL provided above.

like image 144
Roman Kuzmik Avatar answered Sep 28 '22 01:09

Roman Kuzmik