Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining whether to store XML data as XML or in normalized tables

(This post is soliciting personal experiences about storing XML; please share what you know. :-) )

I am working on a service application that communicates with an external service using XML. I am planning to use SQL Server 2008 to store the XML that is received and sent to the external service. I'm exploring my options for storing the XML in the database. The three options that I have identified are:

  1. Store the XML in an XML data type column
  2. Create tables to store the various parent and child relationships represented in the XML.
  3. A hybrid of the two above approaches where the original XML is stored in an XML data type column but several fields from the XML broken out into their own columns to simplify querying and indexing.

I'm looking for any advice, based on your personal experience, with storing and retrieving XML data in SQL Server.

Some additional background: I've used an 'xsd.exe' equivalent called XsdObjectgenerator to create .net classes based on the XML schemas. When the service receives the XML file, it is deserialized into an instance of .net class. This instance is used to perform the operations of the service. My original plan was to then use option #1 above to store the XML. If I needed to update or report on the data, I would simply deserialize the db record back into one of my .net classes.

Although this approach works and makes working with the xml very simple, I have concerns that as the volume of data increases, the performance of querying XML data type records will decrease. This is why I've explored options 2. & 3. above.

In addition to storing the XML, the XML will be queried for use in both reports and a separate web application. The db records will be queried, sorted, filtered, grouped, summaried and possibly updated by the end users.

like image 712
Dean Avatar asked Jul 05 '10 15:07

Dean


People also ask

Where should I store XML files?

XML documents you insert into columns of type XML can reside either in the default storage object, or directly in the base table row. Base table row storage is under your control and is available only for small documents; larger documents are always stored in the default storage object.

Which format of XML is storing the data?

XML stores data in plain text format. This provides a software- and hardware-independent way of storing, transporting, and sharing data.

Which of the following data types is best when storing large XML documents for later retrieval?

An appropriate storage model for your XML documents is an xml data type column. This preserves the InfoSet content of your XML data.

In which of the following cases we can use XML data type?

Explanation: XML is a good choice if you want a platform-independent model in order to ensure the portability of the data by using structural and semantic markup.


2 Answers

I guess it depends on what you want to do with your XML in your database.

If you're mostly only storing it, and possibly retrieving it later as a whole and sending it out again, then I'd definitely use the XML datatype - no point in shredding it into bits and pieces.

If you however need to mostly work with the contents of the XML file, and possibly also manipulate and change that content, then it might be advisable to create tables with columns to match your XML content, and shred it when storing it, use it, and when you need to, reassemble it from the relational pieces using something like SELECT (columns) FROM dbo.Table FOR XML.....

There is an overhead involved in shredding and reassembling - so you need to ask yourself if that's worth doing. But there's also an overhead involved if you need to manipulate the XML column too much.

If you only need read-only access to a few attributes in your XML, I've come to appreciate the ability to wrap those into a UDF and surface it as a computed column in your table. That way, you can easily select something from your table, based on values that are stored somewhere inside your XML - quite handy! But do not overuse this approach - works fine for 2, 3 attributes - but if you need to access your XML over and over again (and most or all of it), then you might be better off shredding it into relational pieces to begin with.

like image 179
marc_s Avatar answered Sep 22 '22 06:09

marc_s


While continuing to explore solutions, a collegue forwarded the following applicable links:

Some preliminary conclusions from these articles and other research:

  • While working with the xml datatype in SQL Server is flexible, querying large volumes of data will be slow as you are essentially querying a blob datatype.
  • While you can create indexes on xml datatype columns in Sql Server, the index is on the whole column and not on a particular element or attribute, so the indexes are not as effective as an index on a non-xml db column.
  • Storing the xml in raw form in an xml datatype field while also maintaining a parsed version of the data in either relational tables or denormalized flat table(s) for querying and reporting is beginning to emerge as the most flexible solution. The xml can be "shredded" into the querying tables either at runtime or after the fact by a separate service or thread.

I will be mocking up each solution with test data and performing some benchmarking. I'll post the results here once they are available.

like image 42
Dean Avatar answered Sep 23 '22 06:09

Dean