Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Server Performance with Key/Pair Table vs XML Field and XPath

I've seen a few questions on this topic already but I'm looking for some insight on the performance differences between these two techniques.

For example, lets say I am recording a log of events which will come into the system with a dictionary set of key/value pairs for the specific event. I will record an entry in an Events table with the base data but then I need a way to also link the additional key/value data. I will never know what kinds of Keys or Values will come in so any sort of predefined enum table seems out of the question.

This event data will be constantly streaming in so insert times is just as important as query times.

When I query for specific events I will be using some fields on the Event as well as data from the key/value data. For the XML way I would simply use a Attributes.exists('xpath') statement as part of the where clause to filter the records.

The normalized way would be to use a Table with basically Key and Value fields with a foreign link to the Event record. This seems clean and simple but I worry about the amount of data that is involved.

like image 765
Vyrotek Avatar asked Feb 19 '10 16:02

Vyrotek


3 Answers

You've got three major options for a 'flexible' storage mechanism.

  • XML fields are flexible but put you in the realm of blob storage, which is slow to query. I've seen queries against small data sets of 30,000 rows take 5 minutes when it was digging stuff out of the blobs with Xpath queries. This is the slowest option by far but it is flexible.

  • Key/value pairs are a lot faster, particularly if you put a clustered index on the event key. This means that all attributes for a single event will be physically stored together in the database, which will minimise the I/O. The approach is less flexible than XML but substantially faster. The most efficient queries to report against it would involve pivoting the data (i.e. a table scan to make an intermediate flattened result); joining to get individual fields will be much slower.

  • The fastest approach is to have a flat table with a set of user defined fields (Field1 - Field50) and hold some metadata about the contents of the fields. This is the fastest to insert and fastest and easiest to query, but the contents of the table are opaque to anything that does not have access to the metadata.

like image 97
ConcernedOfTunbridgeWells Avatar answered Sep 29 '22 08:09

ConcernedOfTunbridgeWells


The problem I think the key/value table approach is regarding the datatypes - if a value could be a datetime, or a string or a unicode string or an integer, then how do you define the column? This dilemma means the value column has to be a datatype which can contain all the different types of data in it which then begs the question of efficiency/ease of querying. Alternatively, you have multiple columns of specific datatypes, but I think this is a bit clunky.

For a true flexible schema, I can't think of a better option than XML. You can index XML columns.

This article off MSDN discusses XML storage in more detail.

like image 45
AdaTheDev Avatar answered Sep 29 '22 07:09

AdaTheDev


I'd assume the normalized way would be faster for both INSERT and SELECT operations, if only because that's what any RDBMS would be optimized for. The "amount of data involved" part might be an issue too, but a more solvable one - how long do you need that data immediately on hand, can you archive it after a day, or a couple weeks, or 3 months, etc? SQL Server can handle an awful lot.

This event data will be constantly streaming in so insert times is just as important as query times.

Option 3: If you really have a lot of data constantly streaming - create a separate queue in shared memory, in-process sqlite, separate db table, or even it's own server, to store the incoming raw event & attributes, and have another process (scheduled task, windows service, etc) parse that queue into whatever preferred format tuned for speedy SELECTs. Optimal input, optimal output, ready to scale in either direction, everyone's happy.

like image 31
tadamson Avatar answered Sep 29 '22 08:09

tadamson