for a certain project, I need some way to parse XML and get data from it. So I wonder, which one of built-in parsers is the fastest?
Also, it would be nice of the parser could accept a XML string as input - I have my own implementation of thread-safe working with files and I don't want some nasty non-thread-safe libraries to make my efforts useless.
An XML Parser is a program that translates XML an XML document into a DOM tree-structure like document. CDATA is used to ignore special characters when parsing XML documents. PHP uses the simplexml_load_file to read XML documents and return the results as a numeric array. PHP DOMDocument class to create XML files.
The XML language is a way to structure data for sharing across websites. Several web technologies like RSS Feeds and Podcasts are written in XML. XML is easy to create. It looks a lot like HTML, except that you make up your own tags. If you want to learn more about XML, please visit our XML tutorial.
The fastest parser will be SAX -- it doesn't have to create a dom, and it can be done with partial xml, or progressively. Info on the PHP SAX parser (Expat) can be found here. Alternatively there is a libxml based DOM parser named SimpleXML. A DOM based parser will be easier to work with but it is typically a few orders of magnitude slower.
**This is geared primarily toward those starting with XML Parsing and not sure which parser to use.
There are two "big" ways to go about parsing - you can either load the XML into memory and find what you need (DOM, SimpleXML) or you can stream it - read it and execute code based on what you read (XMLReader, SAX).
According to Microsoft, SAX is a "push" parser, which sends every piece of information to your application and your application processes it. SimpleXML is a "pull" parser, which allows you to skip chunks of data and only grab what you need. According to Microsoft, this can both simplify and accelerate your application, and I would assume the .NET and PHP implementations are similar. I suppose your choice would depend on your needs - if you're pulling out just a few tags from a larger chunk and can use the $xml->next('Element')
to skip significant chunks, you may find that XMLReader is faster than SAX.
Parsing "small" (<30kb, 700 lines) XML files repetitively, you might not expect there would be a huge time difference between the methods of parsing. I was surprised to find that there was. I ran a comparison of a small feed processed in SimpleXML and XMLReader. Hopefully this will help someone else to visualize how significant a difference this data is. For a real life comparison, this is parsing the response to two Amazon MWS Product Information request feeds.
Each Parse Time is the time required to take 2 XML strings and return about 120 variables containing values from each string. Each loop takes different data, but each of the tests was on the same data in the same order.
SimpleXML loads the document into memory. I used microtime to check both the time to complete the parse (extract the relevant values), as well as the time spent creating the element (when new SimpleXMLElement($xml)
was called). I have rounded these to 4 decimal places.
Parse Time: 0.5866 seconds
Parse Time: 0.3045 seconds
Parse Time: 0.1037 seconds
Parse Time: 0.0151 seconds
Parse Time: 0.0282 seconds
Parse Time: 0.0622 seconds
Parse Time: 0.7756 seconds
Parse Time: 0.2439 seconds
Parse Time: 0.0806 seconds
Parse Time: 0.0696 seconds
Parse Time: 0.0218 seconds
Parse Time: 0.0542 seconds
__________________________
2.3500 seconds
0.1958 seconds average
Time Spent Making the Elements: 0.5232 seconds
Time Spent Making the Elements: 0.2974 seconds
Time Spent Making the Elements: 0.0980 seconds
Time Spent Making the Elements: 0.0097 seconds
Time Spent Making the Elements: 0.0231 seconds
Time Spent Making the Elements: 0.0091 seconds
Time Spent Making the Elements: 0.7190 seconds
Time Spent Making the Elements: 0.2410 seconds
Time Spent Making the Elements: 0.0765 seconds
Time Spent Making the Elements: 0.0637 seconds
Time Spent Making the Elements: 0.0081 seconds
Time Spent Making the Elements: 0.0507 seconds
______________________________________________
2.1195 seconds
0.1766 seconds average
over 90% of the total time is spent loading elements into the DOM.
Only 0.2305 seconds is spent locating the elements and returning them.
While the XMLReader, which is stream based, I was able to skip a significant chunk of one of the XML feeds since the data I wanted was near the top of each element. "Your Mileage May Vary."
Parse Time: 0.1059 seconds
Parse Time: 0.0169 seconds
Parse Time: 0.0214 seconds
Parse Time: 0.0665 seconds
Parse Time: 0.0255 seconds
Parse Time: 0.0241 seconds
Parse Time: 0.0234 seconds
Parse Time: 0.0225 seconds
Parse Time: 0.0183 seconds
Parse Time: 0.0202 seconds
Parse Time: 0.0245 seconds
Parse Time: 0.0205 seconds
__________________________
0.3897 seconds
0.0325 seconds average
What is striking is that although locating elements is slightly faster in SimpleXML once it is all loaded, it is actually over 6 times faster to use XMLReader overall.
You can find some information on using XMLReader at How to use XMLReader in PHP?
Each XML extension has its own strengths and weaknesses. For example, I have a script that parses the XML data dump from Stack Overflow. The posts.xml file is 2.8GB! For this large XML file, I had to use XMLReader
because it reads XML in a streaming mode, instead of trying to load and represent the whole XML document in memory at once, as the DOM extension does.
So you need to be more specific about describing how you are going to use the XML, in order to decide which PHP extension to use.
All of PHP's XML extensions provide some method to read XML data as a string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With