Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ECommerce Storefront Website: Discovering Similar Products Programmatically

I am developing a storefront web application. When a potential customer is viewing a product on the website, I'd like to suggest a set of similar products from the database automatically (vs requiring an human to explicitly input product similarity data/mappings).

In fact when you think about it, most storefront databases already have a lot of similarity data available. In my case Products could be:

  • mapped to a Manufacturer (aka Brand),
  • mapped to one or more Categories, and
  • mapped to one or more Tags (aka Keywords).

Storefront Datamodel Snippet

By counting the number of shared attributes between a product and all others, you could calculate a "SimilarityScore" for comparing other products against the one being viewed by the customer. Here's my initial prototype implementation:

;WITH ProductsRelatedByTags (ProductId, NumberOfRelations)
AS
(
    SELECT  t2.ProductId, COUNT(t2.TagId)
    FROM    ProductTagMappings AS t1 INNER JOIN
                ProductTagMappings AS t2 ON t1.TagId = t2.TagId AND t2.ProductId != t1.ProductId
    WHERE   t1.ProductId = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
    GROUP BY t2.ProductId
), ProductsRelatedByCategories (ProductId, NumberOfRelations)
AS
(
    SELECT  t2.ProductId, COUNT(t2.CategoryId)
    FROM    ProductCategoryMappings AS t1 INNER JOIN
                ProductCategoryMappings AS t2 ON t1.CategoryId = t2.CategoryId AND t2.ProductId != t1.ProductId
    WHERE   t1.ProductId = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
    GROUP BY t2.ProductId
)
SELECT  prbt.ProductId AS ProductId
        ,IsNull(prbt.NumberOfRelations, 0) AS TagsInCommon
        ,IsNull(prbc.NumberOfRelations, 0) AS CategoriesInCommon
        ,CASE WHEN SimilarProduct.ManufacturerId = SourceProduct.ManufacturerId THEN 1 ELSE 0 END as SameManufacturer
        ,CASE WHEN SimilarProduct.ManufacturerId = SourceProduct.ManufacturerId 
            THEN IsNull(prbt.NumberOfRelations, 0) + IsNull(prbc.NumberOfRelations, 0) + 1
            ELSE IsNull(prbt.NumberOfRelations, 0) + IsNull(prbc.NumberOfRelations, 0)
        END as SimilarityScore
FROM    Products AS SourceProduct, 
        Products AS SimilarProduct INNER JOIN
        ProductsRelatedByTags prbt ON prbt.ProductId = SimilarProduct.Id FULL OUTER JOIN
        ProductsRelatedByCategories prbc ON prbt.ProductId = prbc.ProductId
WHERE SourceProduct.Id = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'

which results in data like this:

ProductId                            TagsInCommon CategoriesInCommon SameManufacturer SimilarityScore
------------------------------------ ------------ ------------------ ---------------- ---------------
6416C19D-BA4F-4AE6-AB75-A25A0138B3A5 1            0                  0                1
77B2ECC0-E2EB-4C1B-A1E1-A25A0138BA19 1            0                  0                1
2D83276E-40CC-44D0-9DDF-A25A0138BE14 2            1                  1                4
E036BFE0-BBB5-450C-858C-A25A0138C21C 3            0                  0                3

I am not a SQL performance guru, so I have the following questions for you SQL gurus:

  • Are common-table-expressions (CTEs) appropriate/optimum in this use case? (They sure seem to make it easier to read/follow the SQL). Is there a way to save a join in there anywhere given the model presented above?

and

  • Would this be a good candidate for an indexed view (for persistance) or would this add excessive cost to changes in source data? In that case I will make this a stored procedure which updates a physical SimilarProductMappings table for any given product.
like image 987
BenSwayne Avatar asked Nov 06 '13 18:11

BenSwayne


2 Answers

You ask a lot of questions. I will try and address each one without going into too much detail.

  1. CTEs vs. derived tables is syntactical sugar. It makes no difference performance-wise. The only benefit to using them is that you can reuse them instead of copy/paste/typing-out a derived table again. However, you aren't reusing them in this case, so it's up to you.

  2. Indexed views: Keep in mind, indexes on the views act like indexes on the table(s) with little exception. Picture it like another table is being created for your specific query/view and being stored on disk for faster retrieval. When the underlying data changes, these indexes have to be updated. Yes, this can create a huge resource impact. In general, I'd rather see someone write a query that uses the indexes on the base table, and if they need more indexes for a specific purpose, then look at that in detail rather than holistically on a view with multiple tables. This is far easier to maintain and a lot easier to figure out why your CRUD is taking longer than expected. There is nothing necessarily wrong with an indexed view. But, be very careful with adding this on a application database model like this because of the complexity to tables that are updated/inserted/deleted from. Most of the more appropriate uses for an indexed view is in a reporting data warehouse. Regardless, don't put an index on a view without understand what it will do to the tables for CRUD (create, read, update, delete) operations. And in a CRM or application support type of database, I'd stay away from them for the most part unless there is a static need and it's not really impacting performance.

Read this article: http://technet.microsoft.com/en-us/library/ms187864(v=sql.105).aspx

Note about 3/4 of the way down the page it talks about where NOT to use one and I think your case fits in to 4 / 5 of the scenarios where you should NOT use it.

  1. Regarding saving joins... keep in mind FULL OUTER joins are one of the worst offenders for efficiency. It seems to me the only reason you have it there, is because you aren't including manufacturer in your CTEs. You could just include it in your CTEs and then aggregate the number of matches by cat/tag in the final query which pulls it together to get your score. This way you only have two left outer joins (one to each CTE) and then sum the two counts together and group by same manufacturer (case statement), productId etc.

  2. Finaly... I would consider placing all of this in a de-normalized table or maybe even a cube where it's precalculated. Let's consider several things about your requirement: a. Does the correlation score for products need to be live? If yes, why? This isn't mission critical when new products are added / removed. Anyone who says it needs to be live, probably doesn't really mean it. b. Speed of retrieval. I could rewrite your query using temp tables, make sure the indexes are correct etc. and come up with a reasonably faster query in a stored procedure. But, I am still aggregating data from the DB to be displayed all over my store front every time the page loads. If the data is precalculated and stored in a separate table of productIds and scores for every product and indexed by productId, the retrieval would be very fast. You could trunk and reload the table in an ETL nightly, hourly / whatever and wouldn't have to worry about maintaining the indexes that are rebuilt every time. Of course, if your store front is 24/7/365, you'll need to write some database side code to worry about versioning so that your application never has to wait if the db is in the middle of recalculating.

Also, make sure you at least cache this information on the web/application server if nothing else. One thing is for sure, if you go with your solution above, then you'll need to build something into your site so it doesn't wait for data to return and caches it instead.

Hope all that helps.

like image 59
maplemale Avatar answered Oct 24 '22 02:10

maplemale


How about a somewhat different approach?

;WITH ProductFindings (ProductId, NbrTags, NbrCategories)
AS
(
    SELECT  t2.ProductId, COUNT(t2.TagId), 0
    FROM    ProductTagMappings AS t1 
    INNER JOIN
            ProductTagMappings AS t2  ON t1.TagId      = t2.TagId
                                     AND t1.ProductId != t2.ProductId
    WHERE   t1.ProductId = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
    GROUP BY t2.ProductId

  UNION ALL

    SELECT  c2.ProductId, 0, COUNT(c2.CategoryId)
    FROM    ProductCategoryMappings AS c1 
    INNER JOIN
            ProductCategoryMappings AS c2  ON c1.CategoryId = c2.CategoryId
                                          AND c1.ProductId != c2.ProductId
    WHERE c1.ProductId = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
    GROUP BY c2.ProductId

), ProductTally (ProductId, TotTags, TotCategories) as
(
   SELECT ProductID, sum(NbrTags), sum(NbrCategories)
   FROM     ProductFindings
   GROUP BY ProductID
)
SELECT  Tot.ProductId      AS ProductId
       ,Tot.TotTags        AS TagsInCommon
       ,Tot.TotCategories  AS CategoriesInCommon
       ,CASE WHEN SimilarProduct.ManufacturerId = SourceProduct.ManufacturerId
               THEN 1
               ELSE 0 
         END               as SameManufacturer
       ,CASE WHEN SimilarProduct.ManufacturerId = SourceProduct.ManufacturerId 
               THEN 1
               ELSE 0
         END + Tot.TotTags + Tot.TotCategories
                           as SimilarityScore
FROM    ProductTally as Tot
INNER JOIN  Products     AS SimilarProduct   ON Tot.ProductID = SimilarProduct.Id 
INNER JOIN  Products     AS SourceProduct    ON SourceProduct.Id = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
like image 1
WarrenT Avatar answered Oct 24 '22 01:10

WarrenT