Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query product catalog RavenDB store for spec aggregate over arbitrary collection of products

This is a continuation of the project outlined in this question.

I have the following model:

class Product {
  public string Id { get; set; }
  public string[] Specs { get; set; }
  public int CategoryId { get; set; }
}

The "Specs" array stores product specification name value pairs joined by a special character. For example if a product is colored blue the spec string would be "Color~Blue". Representing specs in this way allows querying for products having multiple spec values specified by a query. There are two principal queries that I would like to support:

  1. Get all products in a given category.
  2. Get all products in a given category which have a set of specified specs.

This works well with RavenDB. However, in addition to the products satisfying a given query I would like to return a result set which contains all spec name-value pairs for the set of products specified by the query. The spec name-value pairs should be grouped by the name and value of the spec and contain a count of products which have a given spec name-value pair. For query #1 I created the following map reduce index:

class CategorySpecGroups {
    public int CategoryId { get; set; }
    public string Spec { get; set; }
    public int Count { get; set; }
}


public class SpecGroups_ByCategoryId : AbstractIndexCreationTask<Product, CategorySpecGroups>
{
    public SpecGroups_ByCategoryId()
    {
        this.Map = products => from product in products
                               where product.Specs != null
                               from spec in product.Specs
                               select new
                               {
                                   CategoryId = product.CategoryId,
                                   Spec = spec,
                                   Count = 1
                               };

        this.Reduce = results => from result in results
                                 group result by new { result.CategoryId, result.Spec } into g
                                 select new
                                 {
                                     CategoryId = g.Key.CategoryId,
                                     Spec = g.Key.Spec,
                                     Count = g.Sum(x => x.Count)
                                 };
    }
}

I can then query this index and get all spec name-value pairs in a given category. The problem I am running into is to get the same result set but for a query which filters both by a category and a set of spec name-value pairs. When using SQL this result set would be obtained by doing a group by over a set of products filtered by category and specs. In general, this type of query is expensive but when filtering by both category and specs the product sets are normally small, though not small enough to fit into a single page - they may contain up to 1000 products. For reference, MongoDB supports a group method which can be used to achieve the same result set. This performs the ad hoc grouping server side and the performance is acceptable.

How can I get this type of result set using RavenDB?

One possible solution is to get all the products for a query and perform the grouping in memory and another option is to create a mapreduce index as above, though the challenge with this would be deducing all possible spec selections that can be made for a given category and additionally, this type of index might explode in size.

For an example, take a look at this fastener category page. The user can filter their selection by selecting attributes. When an attribute is selected it narrows the selection of products and displays the attributes within the new set of products. This type of interaction is typically called faceted search.

EDIT

In the meantime, I will be attempting a solution using Solr as they support faceted search out of the box.

EDIT 2

It appears that RavenDB also supports faceted search (which of course makes sense, indexes are stored by Lucene just like Solr). I will be exploring this and post updates.

EDIT 3

The RavenDB faceted search functionality works as expected. I store a facet setup document for each category ID which is used to calculate facets for a query within a given category. The issue I am having now is performance. For a collection of 500k products with 4500 distinct categories resulting in 4500 facet setup documents a query by category id takes about 16 seconds when also querying for facets and about 0.05 seconds when not querying for facets. The particular category tested contains about 6k products, 23 distinct facets and 2k distinct facet name-range combinations. After looking at the code in FacetedQueryRunner it appears a facets query will result in a Lucene query for every facet name-value combination to get the counts, as a well as a query for each facet name to get the terms. One problem with the implementation is that it will retrieve all the distinct terms for a given facet name regardless of the query, which in most cases will significantly reduce the number of terms for a facet and therefore reduce the number of Lucene queries. One way to improve performance here would be to store a MapReduce computed result set (as shown above) for each facet setup document which could then be queried to get all the distinct terms when further filtering by facets. The overall performance however may still be too slow.

like image 294
eulerfx Avatar asked Oct 23 '22 22:10

eulerfx


1 Answers

I've implemented this feature using RavenDB faceted search, however I made some changes to FacetedQueryRunner to support a heuristic optimization. The heuristic is that, in my case, facets are only displayed in leaf categories. This is a reasonable constraint since navigation between root and internal categories can be driven by either search or listings of child categories.

Now given the constraint I store a FacetSetup document for each leaf category with the Id being something like "facets/category_123". When the facet setup document is being stored I have access to the facet names as well as facet values (or ranges) that are contained in the category. Therefore, I can store all available facet values in the Ranges collection of each Facet in the FacetSetup document, however the facet mode is still FacetMode.Default.

Here are the changes to FacetedQueryRunner. Specifically, the optimization checks to see if a given facet stores ranges, in which case it returns those values to use for searching instead of getting all terms in an index associated with a given facet. In most cases this will significantly reduce the number of Lucene searches that are required since there available facet values in a given category are a subset of facet values in the entire index.

The next optimization that can be made is that if the original query only filters by a category id, then the FacetSetup document can actually store the counts as well. One, albeit hacky, way to do this would be to append the count to each facet value in the Ranges collection, then add a boolean to FacetSetup document to indicate that counts are appended. Now this facet query will basically return the values in the FacetSetup document - no need to query.

A consideration now would be to keep the FacetSetup documents up to date, however this would be required either way. Beyond this optimization caching can be utilized, which is I believe the approach taken by Solr faceted search.

Furthermore, it would be nice if the FacetSetup documents where automatically synchronized with the product collection since effectively they are result of an aggregating MapReduce operation over the set of products grouping initially by category id, then the name of the facet and then the values.

like image 149
eulerfx Avatar answered Dec 07 '22 01:12

eulerfx