Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Structuring large SQL rowset(s) and consuming in .NET

Take a look at this psuedo schema (please note this is a simplification so please try not to comment too heavily on the "advisability" of the schema itself). Assume Indexes are inplace on the FKs.

 TABLE Lookup (
     Lookup_ID int not null PK
     Name nvarchar(255) not null
 )

 TABLE Document (
     Document_ID int not null PK
     Previous_ID null FK REFERENCES Document(Document_ID)
 )

 TABLE Document_Lookup (
     Document_ID int not null FK REFERENCES Document(Document_ID)
     Lookup_ID int not null FK REFERENCES Lookup(Lookup_ID)
 )

Volumes: Document, 4 Million rows of which 90% have a null Previous_ID field value; Lookup, 6000 rows, Average lookups attached to each document 20 giving Document_Lookup 80 Millions rows.

Now in a .NET Service have structure to represent a Lookup row like this:-

 struct Lookup
 {
      public int ID;
      public string Name;
      public List<int> DocumentIDs;
 }

and that lookup rows are stored in a Dictionary<int, Lookup> where the key is the lookup ID. An important point here is that this dictionary should contain entries where the Lookup is referenced by at least one document, i.e., the list DocumentIDs should have Count > 0.

My task is populate this dictionary efficiently. So the simple approach would be:-

  SELECT dl.Lookup_ID, l.Name, dl.Document_ID
  FROM Document_Lookup dl
  INNER JOIN Lookup l ON l.Lookup_ID = dl.Lookup_ID
  INNER JOIN Document d ON d.Document_ID = dl.Lookup_ID
  WHERE d.Previous_ID IS NULL
  ORDER BY dl.Lookup_ID, dl.Document_ID

This could then be used to populate a the dictionary fairly efficiently.

The Question: Does the underlying rowset delivery (TDS?) perform some optimization? It seems to me that queries that de-normalise data are very common hence the possiblity that field values don't change from one row to the next is high, hence it would make sense to optomise the stream by not sending field values that haven't changed. Does anyone know whether such an optomisation is in place? (Optomisation does not appear to exist).

What more sophisticated query could I use to eliminate the duplication (I'm think specifically of repeating the name value)? I've heard of such a thing a "nested rowset", can that sort of thing be generated? Would it be more performant? How would I access it in .NET?

I would perform two queries; one to populate the Lookup dictionary then a second to populate the ditionary lists. I would then add code to knock out the unused Lookup entires. However imagine I got my predictions wrong and Lookup ended up being 1 Million rows with only a quarter actually referenced by any document?

like image 275
AnthonyWJones Avatar asked Nov 22 '11 12:11

AnthonyWJones


1 Answers

  • As long as the names are relatively short in practice, the optimisation may not be necessary.

  • The easiest optimisation is to split it into two queries, one to get the names, the other to get the Document_ID list. (can be in the other order if it makes it easier to populate your data structures).

Example:

/*First get the name of the Lookup*/
select distinct dl.Lookup_ID, l.Name
FROM Document_Lookup dl 
INNER JOIN Lookup l ON l.Lookup_ID = dl.Lookup_ID 
INNER JOIN Document d ON d.Document_ID = dl.Lookup_ID 
WHERE d.Previous_ID IS NULL 
ORDER BY dl.Lookup_ID, dl.Document_ID 

/*Now get the list of Document_IDs for each*/
SELECT dl.Lookup_ID, dl.Document_ID 
FROM Document_Lookup dl 
INNER JOIN Lookup l ON l.Lookup_ID = dl.Lookup_ID 
INNER JOIN Document d ON d.Document_ID = dl.Lookup_ID 
WHERE d.Previous_ID IS NULL 
ORDER BY dl.Lookup_ID, dl.Document_ID 
  • There are also various tricks you could use to massage these into a single table but I suggest these are not worthwile.

  • The heirarchical rowsets you are thinking of are the MSDASHAPE OLEDB provider. They can do what you are suggesting but would restrict you to using the OLEDB provider for SQL which may not be what you want.

  • Finally consider careful XML

For example:

select
  l.lookup_ID as "@l", 
  l.name as "@n",
  (
    select dl.Document_ID as "node()", ' ' as "node()" 
    from Document_Lookup dl where dl.lookup_ID = l.lookup_ID for xml path(''), type
  ) as "*"
  from Lookup l
  where l.lookup_ID in (select dl.lookup_ID from Document_Lookup dl)
  for xml path('dl')

returns:

<dl l="1" n="One">1 2 </dl>
<dl l="2" n="Two">2 </dl>
like image 107
Ben Avatar answered Oct 15 '22 23:10

Ben