I am developing a storefront web application. When a potential customer is viewing a product on the website, I'd like to suggest a set of similar products from the database automatically (vs requiring an human to explicitly input product similarity data/mappings).
In fact when you think about it, most storefront databases already have a lot of similarity data available. In my case Products
could be:
Manufacturer
(aka Brand
), Categories
, and Tags
(aka Keywords
).By counting the number of shared attributes between a product and all others, you could calculate a "SimilarityScore" for comparing other products against the one being viewed by the customer. Here's my initial prototype implementation:
;WITH ProductsRelatedByTags (ProductId, NumberOfRelations)
AS
(
SELECT t2.ProductId, COUNT(t2.TagId)
FROM ProductTagMappings AS t1 INNER JOIN
ProductTagMappings AS t2 ON t1.TagId = t2.TagId AND t2.ProductId != t1.ProductId
WHERE t1.ProductId = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
GROUP BY t2.ProductId
), ProductsRelatedByCategories (ProductId, NumberOfRelations)
AS
(
SELECT t2.ProductId, COUNT(t2.CategoryId)
FROM ProductCategoryMappings AS t1 INNER JOIN
ProductCategoryMappings AS t2 ON t1.CategoryId = t2.CategoryId AND t2.ProductId != t1.ProductId
WHERE t1.ProductId = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
GROUP BY t2.ProductId
)
SELECT prbt.ProductId AS ProductId
,IsNull(prbt.NumberOfRelations, 0) AS TagsInCommon
,IsNull(prbc.NumberOfRelations, 0) AS CategoriesInCommon
,CASE WHEN SimilarProduct.ManufacturerId = SourceProduct.ManufacturerId THEN 1 ELSE 0 END as SameManufacturer
,CASE WHEN SimilarProduct.ManufacturerId = SourceProduct.ManufacturerId
THEN IsNull(prbt.NumberOfRelations, 0) + IsNull(prbc.NumberOfRelations, 0) + 1
ELSE IsNull(prbt.NumberOfRelations, 0) + IsNull(prbc.NumberOfRelations, 0)
END as SimilarityScore
FROM Products AS SourceProduct,
Products AS SimilarProduct INNER JOIN
ProductsRelatedByTags prbt ON prbt.ProductId = SimilarProduct.Id FULL OUTER JOIN
ProductsRelatedByCategories prbc ON prbt.ProductId = prbc.ProductId
WHERE SourceProduct.Id = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
which results in data like this:
ProductId TagsInCommon CategoriesInCommon SameManufacturer SimilarityScore
------------------------------------ ------------ ------------------ ---------------- ---------------
6416C19D-BA4F-4AE6-AB75-A25A0138B3A5 1 0 0 1
77B2ECC0-E2EB-4C1B-A1E1-A25A0138BA19 1 0 0 1
2D83276E-40CC-44D0-9DDF-A25A0138BE14 2 1 1 4
E036BFE0-BBB5-450C-858C-A25A0138C21C 3 0 0 3
I am not a SQL performance guru, so I have the following questions for you SQL gurus:
and
SimilarProductMappings
table for any given product.You ask a lot of questions. I will try and address each one without going into too much detail.
CTEs vs. derived tables is syntactical sugar. It makes no difference performance-wise. The only benefit to using them is that you can reuse them instead of copy/paste/typing-out a derived table again. However, you aren't reusing them in this case, so it's up to you.
Indexed views: Keep in mind, indexes on the views act like indexes on the table(s) with little exception. Picture it like another table is being created for your specific query/view and being stored on disk for faster retrieval. When the underlying data changes, these indexes have to be updated. Yes, this can create a huge resource impact. In general, I'd rather see someone write a query that uses the indexes on the base table, and if they need more indexes for a specific purpose, then look at that in detail rather than holistically on a view with multiple tables. This is far easier to maintain and a lot easier to figure out why your CRUD is taking longer than expected. There is nothing necessarily wrong with an indexed view. But, be very careful with adding this on a application database model like this because of the complexity to tables that are updated/inserted/deleted from. Most of the more appropriate uses for an indexed view is in a reporting data warehouse. Regardless, don't put an index on a view without understand what it will do to the tables for CRUD (create, read, update, delete) operations. And in a CRM or application support type of database, I'd stay away from them for the most part unless there is a static need and it's not really impacting performance.
Read this article: http://technet.microsoft.com/en-us/library/ms187864(v=sql.105).aspx
Note about 3/4 of the way down the page it talks about where NOT to use one and I think your case fits in to 4 / 5 of the scenarios where you should NOT use it.
Regarding saving joins... keep in mind FULL OUTER joins are one of the worst offenders for efficiency. It seems to me the only reason you have it there, is because you aren't including manufacturer in your CTEs. You could just include it in your CTEs and then aggregate the number of matches by cat/tag in the final query which pulls it together to get your score. This way you only have two left outer joins (one to each CTE) and then sum the two counts together and group by same manufacturer (case statement), productId etc.
Finaly... I would consider placing all of this in a de-normalized table or maybe even a cube where it's precalculated. Let's consider several things about your requirement: a. Does the correlation score for products need to be live? If yes, why? This isn't mission critical when new products are added / removed. Anyone who says it needs to be live, probably doesn't really mean it. b. Speed of retrieval. I could rewrite your query using temp tables, make sure the indexes are correct etc. and come up with a reasonably faster query in a stored procedure. But, I am still aggregating data from the DB to be displayed all over my store front every time the page loads. If the data is precalculated and stored in a separate table of productIds and scores for every product and indexed by productId, the retrieval would be very fast. You could trunk and reload the table in an ETL nightly, hourly / whatever and wouldn't have to worry about maintaining the indexes that are rebuilt every time. Of course, if your store front is 24/7/365, you'll need to write some database side code to worry about versioning so that your application never has to wait if the db is in the middle of recalculating.
Also, make sure you at least cache this information on the web/application server if nothing else. One thing is for sure, if you go with your solution above, then you'll need to build something into your site so it doesn't wait for data to return and caches it instead.
Hope all that helps.
How about a somewhat different approach?
;WITH ProductFindings (ProductId, NbrTags, NbrCategories)
AS
(
SELECT t2.ProductId, COUNT(t2.TagId), 0
FROM ProductTagMappings AS t1
INNER JOIN
ProductTagMappings AS t2 ON t1.TagId = t2.TagId
AND t1.ProductId != t2.ProductId
WHERE t1.ProductId = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
GROUP BY t2.ProductId
UNION ALL
SELECT c2.ProductId, 0, COUNT(c2.CategoryId)
FROM ProductCategoryMappings AS c1
INNER JOIN
ProductCategoryMappings AS c2 ON c1.CategoryId = c2.CategoryId
AND c1.ProductId != c2.ProductId
WHERE c1.ProductId = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
GROUP BY c2.ProductId
), ProductTally (ProductId, TotTags, TotCategories) as
(
SELECT ProductID, sum(NbrTags), sum(NbrCategories)
FROM ProductFindings
GROUP BY ProductID
)
SELECT Tot.ProductId AS ProductId
,Tot.TotTags AS TagsInCommon
,Tot.TotCategories AS CategoriesInCommon
,CASE WHEN SimilarProduct.ManufacturerId = SourceProduct.ManufacturerId
THEN 1
ELSE 0
END as SameManufacturer
,CASE WHEN SimilarProduct.ManufacturerId = SourceProduct.ManufacturerId
THEN 1
ELSE 0
END + Tot.TotTags + Tot.TotCategories
as SimilarityScore
FROM ProductTally as Tot
INNER JOIN Products AS SimilarProduct ON Tot.ProductID = SimilarProduct.Id
INNER JOIN Products AS SourceProduct ON SourceProduct.Id = '22D6059C-D981-4A97-8F7B-A25A0138B3F4'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With