Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling inheritance with overriding efficiently

I have the following two data structures.

First, a list of properties applied to object triples:

Object1  Object2  Object3 Property  Value
     O1       O2       O3       P1  "abc"
     O1       O2       O3       P2  "xyz"
     O1       O3       O4       P1  "123"
     O2       O4       O5       P1  "098"

Second, an inheritance tree:

O1
    O2
        O4
    O3
        O5

Or viewed as a relation:

Object    Parent
    O2        O1
    O4        O2
    O3        O1
    O5        O3
    O1      null

The semantics of this being that O2 inherits properties from O1; O4 - from O2 and O1; O3 - from O1; and O5 - from O3 and O1, in that order of precedence.
NOTE 1: I have an efficient way to select all children or all parents of a given object. This is currently implemented with left and right indexes, but hierarchyid could also work. This does not seem important right now.
NOTE 2: I have tiggers in place that make sure that the "Object" column always contains all possible objects, even when they do not really have to be there (i.e. have no parent or children defined). This makes it possible to use inner joins rather than severely less effiecient outer joins.

The objective is: Given a pair of (Property, Value), return all object triples that have that property with that value either defined explicitly or inherited from a parent.

NOTE 1: An object triple (X,Y,Z) is considered a "parent" of triple (A,B,C) when it is true that either X = A or X is a parent of A, and the same is true for (Y,B) and (Z,C).
NOTE 2: A property defined on a closer parent "overrides" the same property defined on a more distant parent.
NOTE 3: When (A,B,C) has two parents - (X1,Y1,Z1) and (X2,Y2,Z2), then (X1,Y1,Z1) is considered a "closer" parent when:
(a) X2 is a parent of X1, or
(b) X2 = X1 and Y2 is a parent of Y1, or
(c) X2 = X1 and Y2 = Y1 and Z2 is a parent of Z1

In other words, the "closeness" in ancestry for triples is defined based on the first components of the triples first, then on the second components, then on the third components. This rule establishes an unambigous partial order for triples in terms of ancestry.

For example, given the pair of (P1, "abc"), the result set of triples will be:

 O1, O2, O3     -- Defined explicitly
 O1, O2, O5     -- Because O5 inherits from O3
 O1, O4, O3     -- Because O4 inherits from O2
 O1, O4, O5     -- Because O4 inherits from O2 and O5 inherits from O3
 O2, O2, O3     -- Because O2 inherits from O1
 O2, O2, O5     -- Because O2 inherits from O1 and O5 inherits from O3
 O2, O4, O3     -- Because O2 inherits from O1 and O4 inherits from O2
 O3, O2, O3     -- Because O3 inherits from O1
 O3, O2, O5     -- Because O3 inherits from O1 and O5 inherits from O3
 O3, O4, O3     -- Because O3 inherits from O1 and O4 inherits from O2
 O3, O4, O5     -- Because O3 inherits from O1 and O4 inherits from O2 and O5 inherits from O3
 O4, O2, O3     -- Because O4 inherits from O1
 O4, O2, O5     -- Because O4 inherits from O1 and O5 inherits from O3
 O4, O4, O3     -- Because O4 inherits from O1 and O4 inherits from O2
 O5, O2, O3     -- Because O5 inherits from O1
 O5, O2, O5     -- Because O5 inherits from O1 and O5 inherits from O3
 O5, O4, O3     -- Because O5 inherits from O1 and O4 inherits from O2
 O5, O4, O5     -- Because O5 inherits from O1 and O4 inherits from O2 and O5 inherits from O3

Note that the triple (O2, O4, O5) is absent from this list. This is because property P1 is defined explicitly for the triple (O2, O4, O5) and this prevents that triple from inheriting that property from (O1, O2, O3). Also note that the triple (O4, O4, O5) is also absent. This is because that triple inherits its value of P1="098" from (O2, O4, O5), because it is a closer parent than (O1, O2, O3).

The straightforward way to do it is the following. First, for every triple that a property is defined on, select all possible child triples:

select Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value
from TriplesAndProperties tp

-- Select corresponding objects of the triple
inner join Objects as Objects1 on Objects1.Id = tp.O1
inner join Objects as Objects2 on Objects2.Id = tp.O2
inner join Objects as Objects3 on Objects3.Id = tp.O3

-- Then add all possible children of all those objects
inner join Objects as Children1 on Objects1.Id [isparentof] Children1.Id
inner join Objects as Children2 on Objects2.Id [isparentof] Children2.Id
inner join Objects as Children3 on Objects3.Id [isparentof] Children3.Id

But this is not the whole story: if some triple inherits the same property from several parents, this query will yield conflicting results. Therefore, second step is to select just one of those conflicting results:

select * from
(
    select 
        Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value,
        row_number() over( 
            partition by Children1.Id, Children2.Id, Children3.Id, tp.Property
            order by Objects1.[depthInTheTree] descending, Objects2.[depthInTheTree] descending, Objects3.[depthInTheTree] descending
        )
        as InheritancePriority
    from
    ... (see above)
)
where InheritancePriority = 1

The window function row_number() over( ... ) does the following: for every unique combination of objects triple and property, it sorts all values by the ancestral distance from the triple to the parents that the value is inherited from, and then I only select the very first of the resulting list of values. A similar effect can be achieved with a GROUP BY and ORDER BY statements, but I just find the window function semantically cleaner (the execution plans they yield are identical). The point is, I need to select the closest of contributing ancestors, and for that I need to group and then sort within the group.

And finally, now I can simply filter the result set by Property and Value.

This scheme works. Very reliably and predictably. It has proven to be very powerful for the business task it implements.

The only trouble is, it is awfuly slow.
One might point out the join of seven tables might be slowing things down, but that is actually not the bottleneck.

According to the actual execution plan I'm getting from the SQL Management Studio (as well as SQL Profiler), the bottleneck is the sorting. The problem is, in order to satisfy my window function, the server has to sort by Children1.Id, Children2.Id, Children3.Id, tp.Property, Parents1.[depthInTheTree] descending, Parents2.[depthInTheTree] descending, Parents3.[depthInTheTree] descending, and there can be no indexes it can use, because the values come from a cross join of several tables.

EDIT: Per Michael Buen's suggestion (thank you, Michael), I have posted the whole puzzle to sqlfiddle here. One can see in the execution plan that the Sort operation accounts for 32% of the whole query, and that is going to grow with the number of total rows, because all the other operations use indexes.

Usually in such cases I would use an indexed view, but not in this case, because indexed views cannot contain self-joins, of which there are six.

The only way that I can think of so far is to create six copies of the Objects table and then use them for the joins, thus enabling an indexed view.
Did the time come that I shall be reduced to that kind of hacks? The despair sets in.

like image 569
Fyodor Soikin Avatar asked Aug 16 '12 17:08

Fyodor Soikin


1 Answers

I have 3 possible answers.

The sql fiddle for your question is here: http://sqlfiddle.com/#!3/7c7a0/3/0

The sql fiddle for my answer is here: http://sqlfiddle.com/#!3/5d257/1

Warnings:

  1. Query Analyzer is not enough - I notice a number of answers were rejected because their query plans are more expensive than the original query. The Analyzer is only guide. Depending on the actual data set, hardware, and use case, more expensive queries can return results faster than less expensive ones. You have to test in your environment.
  2. Query Analyzer is ineffective - even if you find a way to remove the "most expensive step" from a query, it often makes no difference to your query.
  3. Query changes alone rarely mitigate schema/design issues - Some answers were rejected because they involve schema level changes such as triggers and additional tables. Complex queries that resist optimization are a strong sign that problem is with the underlying design or with my expectations. You may not like it, but you might have to accept that the problem is not solvable at the query level.
  4. Indexed view cannot contain row_number()/partitition clause - Working around the self-join problem by creating six copies of objects table is not enough to allow you to create the indexed view you've suggested. I tried it in this sqlfiddle. If you uncomment that last "create index" statement, you'll get an error because your view "contains a ranking or aggregate window function."

Working Answers:

  1. Left Join instead of row_number() - You can use a query that uses left joins to exclude results that have been overriden lower in the tree. Removing the finally "order by" from this query actually removes the sort that has been plaguing you! The execution plan for this query is still more expensive than your original, but see Disclaimer #1 above.
  2. Indexed view for part of the query - Using some serious query magic (based on this technique), I created an indexed view for part of the query. This view can be used to enhance the original question query or answer #1.
  3. Actualize into a well indexed table - Someone else suggested this answer, but they might not have explained it well. Unless your result set is very large or you are doing very frequent updates to the source tables, actualizing the results of a query and using a trigger to keep them up-to-date is a perfectly fine way to work around this kind of issue. Once you create a view for your query, it is easy enough to test this option. You can reuse answer #2 to speed up the trigger, and then further improve it over time. (You are talking about creating six copies of your tables, try this first. It guarantees that the performance for the select you care about will be as good as possible.)

Here's is the schema part of my answer from sqlfiddle:

Create Table Objects
(
    Id int not null identity primary key,
    LeftIndex int not null default 0,
    RightIndex int not null default 0
)

alter table Objects add ParentId int null references Objects

CREATE TABLE TP
(
    Object1 int not null references Objects,
    Object2 int not null references Objects,
    Object3 int not null references Objects,
    Property varchar(20) not null,
    Value varchar(50) not null
)


insert into Objects(LeftIndex, RightIndex) values(1, 10)
insert into Objects(ParentId, LeftIndex, RightIndex) values(1, 2, 5)
insert into Objects(ParentId, LeftIndex, RightIndex) values(1, 6, 9)
insert into Objects(ParentId, LeftIndex, RightIndex) values(2, 3, 4)
insert into Objects(ParentId, LeftIndex, RightIndex) values(3, 7, 8)

insert into TP(Object1, Object2, Object3, Property, Value) values(1,2,3, 'P1', 'abc')
insert into TP(Object1, Object2, Object3, Property, Value) values(1,2,3, 'P2', 'xyz')
insert into TP(Object1, Object2, Object3, Property, Value) values(1,3,4, 'P1', '123')
insert into TP(Object1, Object2, Object3, Property, Value) values(2,4,5, 'P1', '098')

create index ix_LeftIndex on Objects(LeftIndex)
create index ix_RightIndex on Objects(RightIndex)
create index ix_Objects on TP(Property, Value, Object1, Object2, Object3)
create index ix_Prop on TP(Property)
GO

---------- QUESTION ADDITIONAL SCHEMA --------
CREATE VIEW TPResultView AS
Select O1, O2, O3, Property, Value
FROM
(
    select Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value,

    row_number() over( 
        partition by Children1.Id, Children2.Id, Children3.Id, tp.Property
        order by Objects1.LeftIndex desc, Objects2.LeftIndex desc, Objects3.LeftIndex desc
    )
    as Idx

    from tp

    -- Select corresponding objects of the triple
    inner join Objects as Objects1 on Objects1.Id = tp.Object1
    inner join Objects as Objects2 on Objects2.Id = tp.Object2
    inner join Objects as Objects3 on Objects3.Id = tp.Object3

    -- Then add all possible children of all those objects
    inner join Objects as Children1 on Children1.LeftIndex between Objects1.LeftIndex and Objects1.RightIndex
    inner join Objects as Children2 on Children2.LeftIndex between Objects2.LeftIndex and Objects2.RightIndex
    inner join Objects as Children3 on Children3.LeftIndex between Objects3.LeftIndex and Objects3.RightIndex
) as x
WHERE idx = 1 
GO

---------- ANSWER 1 SCHEMA --------

CREATE VIEW TPIntermediate AS
select tp.Property, tp.Value 
    , Children1.Id as O1, Children2.Id as O2, Children3.Id as O3
    , Objects1.LeftIndex as PL1, Objects2.LeftIndex as PL2, Objects3.LeftIndex as PL3    
    , Children1.LeftIndex as CL1, Children2.LeftIndex as CL2, Children3.LeftIndex as CL3    
    from tp

    -- Select corresponding objects of the triple
    inner join Objects as Objects1 on Objects1.Id = tp.Object1
    inner join Objects as Objects2 on Objects2.Id = tp.Object2
    inner join Objects as Objects3 on Objects3.Id = tp.Object3

    -- Then add all possible children of all those objects
    inner join Objects as Children1 WITH (INDEX(ix_LeftIndex)) on Children1.LeftIndex between Objects1.LeftIndex and Objects1.RightIndex
    inner join Objects as Children2 WITH (INDEX(ix_LeftIndex)) on Children2.LeftIndex between Objects2.LeftIndex and Objects2.RightIndex
    inner join Objects as Children3 WITH (INDEX(ix_LeftIndex)) on Children3.LeftIndex between Objects3.LeftIndex and Objects3.RightIndex
GO

---------- ANSWER 2 SCHEMA --------

-- Partial calculation using an indexed view
-- Circumvented the self-join limitation using a black magic technique, based on 
-- http://jmkehayias.blogspot.com/2008/12/creating-indexed-view-with-self-join.html
CREATE TABLE dbo.multiplier (i INT PRIMARY KEY)

INSERT INTO dbo.multiplier VALUES (1) 
INSERT INTO dbo.multiplier VALUES (2) 
INSERT INTO dbo.multiplier VALUES (3) 
GO

CREATE VIEW TPIndexed
WITH SCHEMABINDING
AS

SELECT tp.Object1, tp.object2, tp.object3, tp.property, tp.value,
    SUM(ISNULL(CASE M.i WHEN 1 THEN Objects.LeftIndex ELSE NULL END, 0)) as PL1,
    SUM(ISNULL(CASE M.i WHEN 2 THEN Objects.LeftIndex ELSE NULL END, 0)) as PL2,
    SUM(ISNULL(CASE M.i WHEN 3 THEN Objects.LeftIndex ELSE NULL END, 0)) as PL3,
    SUM(ISNULL(CASE M.i WHEN 1 THEN Objects.RightIndex ELSE NULL END, 0)) as PR1,
    SUM(ISNULL(CASE M.i WHEN 2 THEN Objects.RightIndex ELSE NULL END, 0)) as PR2,
    SUM(ISNULL(CASE M.i WHEN 3 THEN Objects.RightIndex ELSE NULL END, 0)) as PR3,
    COUNT_BIG(*) as ID
    FROM dbo.tp
    cross join dbo.multiplier M 
    inner join dbo.Objects 
    on (M.i = 1 AND Objects.Id = tp.Object1)
    or (M.i = 2 AND Objects.Id = tp.Object2)
    or (M.i = 3 AND Objects.Id = tp.Object3)
GROUP BY tp.Object1, tp.object2, tp.object3, tp.property, tp.value
GO

-- This index is mostly useless but required
create UNIQUE CLUSTERED index pk_TPIndexed on dbo.TPIndexed(property, value, object1, object2, object3)
-- Once we have the clustered index, we can create a nonclustered that actually addresses our needs
create NONCLUSTERED index ix_TPIndexed on dbo.TPIndexed(property, value, PL1, PL2, PL3, PR1, PR2, PR3)
GO

-- NOTE: this View is not indexed, but is uses the indexed view 
CREATE VIEW TPIndexedResultView AS
Select O1, O2, O3, Property, Value
FROM
(
    select Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value,

    row_number() over( 
        partition by tp.Property, Children1.Id, Children2.Id, Children3.Id
        order by tp.Property, Tp.PL1 desc, Tp.PL2 desc, Tp.PL3 desc
    )
    as Idx

    from TPIndexed as TP WITH (NOEXPAND)

    -- Then add all possible children of all those objects
    inner join Objects as Children1 WITH (INDEX(ix_LeftIndex)) on Children1.LeftIndex between TP.PL1 and TP.PR1
    inner join Objects as Children2 WITH (INDEX(ix_LeftIndex)) on Children2.LeftIndex between TP.PL2 and TP.PR2
    inner join Objects as Children3 WITH (INDEX(ix_LeftIndex)) on Children3.LeftIndex between TP.PL3 and TP.PR3
) as x
WHERE idx = 1 
GO


-- NOTE: this View is not indexed, but is uses the indexed view 
CREATE VIEW TPIndexedIntermediate AS
select tp.Property, tp.Value 
    , Children1.Id as O1, Children2.Id as O2, Children3.Id as O3
    , PL1, PL2, PL3    
    , Children1.LeftIndex as CL1, Children2.LeftIndex as CL2, Children3.LeftIndex as CL3    
    from TPIndexed as TP WITH (NOEXPAND)

    -- Then add all possible children of all those objects
    inner join Objects as Children1 WITH (INDEX(ix_LeftIndex)) on Children1.LeftIndex between TP.PL1 and TP.PR1
    inner join Objects as Children2 WITH (INDEX(ix_LeftIndex)) on Children2.LeftIndex between TP.PL2 and TP.PR2
    inner join Objects as Children3 WITH (INDEX(ix_LeftIndex)) on Children3.LeftIndex between TP.PL3 and TP.PR3  
GO


---------- ANSWER 3 SCHEMA --------
-- You're talking about making six copies of the TP table
-- If you're going to go that far, you might as well, go the trigger route
-- The performance profile is much the same - slower on insert, faster on read
-- And instead of still recalculating on every read, you'll be recalculating
-- only when the data changes. 

CREATE TABLE TPResult
(
    Object1 int not null references Objects,
    Object2 int not null references Objects,
    Object3 int not null references Objects,
    Property varchar(20) not null,
    Value varchar(50) not null
)
GO

create UNIQUE index ix_Result on TPResult(Property, Value, Object1, Object2, Object3)


--You'll have to imagine this trigger, sql fiddle doesn't want to do it
--CREATE TRIGGER tr_TP
--ON TP
--  FOR INSERT, UPDATE, DELETE
--AS
--  DELETE FROM TPResult
-- -- For this example we'll just insert into the table once
INSERT INTO TPResult 
SELECT O1, O2, O3, Property, Value 
FROM TPResultView

Query part of my answer from sqlfiddle:

-------- QUESTION QUERY ----------
-- Original query, modified to use the view I added
SELECT O1, O2, O3, Property, Value 
FROM TPResultView
WHERE property = 'P1' AND value = 'abc'
-- Your assertion is that this order by is the most expensive part. 
-- Sometimes converting queries into views allows the server to
-- Optimize them better over time.
-- NOTE: removing this order by has no effect on this query.
-- ORDER BY O1, O2, O3
GO

-------- ANSWER 1  QUERY ----------
-- A different way to get the same result. 
-- Query optimizer says this is more expensive, but I've seen cases where
-- it says a query is more expensive but it returns results faster.
SELECT O1, O2, O3, Property, Value
FROM (
  SELECT A.O1, A.O2, A.O3, A.Property, A.Value
  FROM TPIntermediate A
  LEFT JOIN TPIntermediate B ON A.O1 = B.O1
    AND A.O2 = B.O2
    AND A.O3 = B.O3
    AND A.Property = B.Property
    AND 
    (
      -- Find any rows with Parent LeftIndex triplet that is greater than this one
      (A.PL1 < B.PL1
      AND A.PL2 < B.PL2
      AND A.PL3 < B.PL3) 
    OR
      -- Find any rows with LeftIndex triplet that is greater than this one
      (A.CL1 < B.CL1
      AND A.CL2 < B.CL2
      AND A.CL3 < B.CL3)
    )
  -- If this row has any rows that match the previous two cases, exclude it
  WHERE B.O1 IS NULL ) AS x
WHERE property = 'P1' AND value = 'abc'
-- NOTE: Removing this order _DOES_ reduce query cost removing the "sort" action
-- that has been the focus of your question.   
-- Howeer, it wasn't clear from your question whether this order by was required.
--ORDER BY O1, O2, O3
GO

-------- ANSWER 2  QUERIES ----------
-- Same as above but using an indexed view to partially calculate results

SELECT O1, O2, O3, Property, Value 
FROM TPIndexedResultView
WHERE property = 'P1' AND value = 'abc'
-- Your assertion is that this order by is the most expensive part. 
-- Sometimes converting queries into views allows the server to
-- Optimize them better over time.
-- NOTE: removing this order by has no effect on this query.
--ORDER BY O1, O2, O3
GO

SELECT O1, O2, O3, Property, Value
FROM (
  SELECT A.O1, A.O2, A.O3, A.Property, A.Value
  FROM TPIndexedIntermediate A
  LEFT JOIN TPIndexedIntermediate B ON A.O1 = B.O1
    AND A.O2 = B.O2
    AND A.O3 = B.O3
    AND A.Property = B.Property
    AND 
    (
      -- Find any rows with Parent LeftIndex triplet that is greater than this one
      (A.PL1 < B.PL1
      AND A.PL2 < B.PL2
      AND A.PL3 < B.PL3) 
    OR
      -- Find any rows with LeftIndex triplet that is greater than this one
      (A.CL1 < B.CL1
      AND A.CL2 < B.CL2
      AND A.CL3 < B.CL3)
    )
  -- If this row has any rows that match the previous two cases, exclude it
  WHERE B.O1 IS NULL ) AS x
WHERE property = 'P1' AND value = 'abc'
-- NOTE: Removing this order _DOES_ reduce query cost removing the "sort" action
-- that has been the focus of your question.   
-- Howeer, it wasn't clear from your question whether this order by was required.
--ORDER BY O1, O2, O3
GO



-------- ANSWER 3  QUERY ----------
-- Returning results from a pre-calculated table is fast and easy
-- Unless your are doing many more inserts than reads, or your result
-- set is very large, this is a fine way to compensate for a poor design
-- in one area of your database.
SELECT Object1 as O1, Object2 as O2, Object3 as O3, Property, Value 
FROM TPResult
WHERE property = 'P1' AND value = 'abc'
ORDER BY O1, O2, O3
like image 90
BitwiseMan Avatar answered Nov 03 '22 04:11

BitwiseMan