At work, we're using a XML log file. Each log message is a <message>
block with <date>
and <time>
subnodes, there are <submessage>
blocks, <table>
constructs and so on and the log file can be transformed to a localized HTML using some Delphi processing and XSLT later.
For medium sized log files (around 2 MB), we got performance problems (up to a minute for loading the XML and doing some basic manipulations) and I could reduce them to a test project like this (EDIT: Updated the code and added measurements):
procedure TForm1.PrepareTest(MessageCount : integer);
var
XML : IXMLDocument;
i : integer;
begin
XML := NewXMLDocument;
XML.DocumentElement := XML.CreateNode('root');
for i := 1 to MessageCount do
begin
XML.DocumentElement.AddChild('message').Text := 'Test Text';
end;
XML.SaveToFile(XML_NAME);
end;
procedure TForm1.XMLTest;
var
StartTime : Cardinal;
XML : IXMLDocument;
begin
StartTime := GetTickCount();
XML := NewXMLDocument;
XML.LoadFromFile(XML_NAME);
Memo1.Lines.Add('Node count: ' + IntToStr(XML.DocumentElement.ChildNodes.Count));
Memo1.Lines.Add('Time: ' + FloatToStr((GetTickCount() - StartTime) / 1000) + ' seconds');
end;
This results in the following time measurements (node count is increased by 25% per column, all times in milliseconds):
Node count 8000 10000 12500 15625 19531 24413 30516 38145 47681
Base test time 484 781 1140 1875 2890 4421 6734 10672 16812
Variation 1 32 47 62 78 78 141
Variation 2 2656 3157 3906 5015 6532 8922 12140 17391 24985
(delta Base) 2172 2376 2766 3140 3642 4501 5406 6719 8173
Note the two variations, the first is LoadFromFile
only and the second one is additionally adding 10000 more nodes to the beginning (!) of the XML like PrepareTest
does, which is worst case but looking at the delta to the base test, even this doesn't show quadratic effects. Also note that counting the nodes can be replaced with any other operation, so it looks like there is some delayed initialization/validation of the XML file involved that causes the problem and any manipulation afterwards shows expected behaviour.
Memory usage is not high, the last test case (47681 nodes) has a peak memory usage of 39 MB, its XML filesize is 1.3 MB.
The first thing that is done after loading the XML (e.g. reading or writing some nodes or accessing node count) is slow and it shows quadratic runtime behaviour, so any log file above 10 MB is unusable.
We already solved the performance problems along with some other things by parsing small chunks of 100 messages, and I'm aware that the Delphi XML routines are not suited/overkill for this use case - using a different XML library would most likely stop the performance problems. So I'm not asking for a solution for the problem (although it would be interesting to know if the problem can be solved without using a different XML library).
My question is: What is the reason for the quadratic runtime behaviour of the Delphi XML routines, respectively MSXML? I can't imagine things that would lead to this in XML loading/parsing/validating, besides really "stupid" things like managing the nodes in a linked list instead of a tree, but I might be overlooking something, perhaps DOM related.
I agree with mj2008 that XML isn't a good fit for logging. That said, this problem and large XML files in general, could get processed faster by using SAX, it throws events while parsing the incoming stream of XML data, which lets you process the items as they are read from disk, in effect mitigating the exponentiality of loading it all in memory before handing it to XSLT.
I regret I have not done SAX in Delphi (yet), but I suspect the hardest part is implementing the required SAX interfaces (e.g. ISAXContentHandler), but Delphi has TInterfacedObject and TAutoObject and other just for that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With