We are looking to develop a reporting application that reports on data stored in a large number of XML files. ~3,000,000 files ranging in size from 7KB to 5MB (Each file conforms to the same schema). I’m guessing that there will be about around 200GB of XML. I’m looking at a number of open source XML databases (Sedna, BaseX and eXist-db) and I’m not sure how well these systems will scale, I read a comparison of these three database here. Which is where my concerns of scalability originated from.
Some details regarding what we want to do are: We won’t be changing the data in any of the XML files and new files will be added daily. Since we are concerned with reporting query performance is important to us, and the time it takes to add and index new files isn’t a high priority for us.
I’m wondering if anyone has experience using these systems at similar scales? I’ve looked at the BaseX statistics page and see some fairly large XML instances but no mention of performance.
We don’t require an open source product and the MarkLogic system looks like it can fit the bill nicely, but I’m curious what’s been done with open source products.
I think it is impossible to answer your question with either a yes
or no
. It is really impossible to state anything about performance from the little details that you have given.
Performance is typically based on the queries that you want to perform and the distribution of your data. Not to mention, what you consider to be "acceptable".
In the paper you referenced, it is interesting to note that they state that they could not get the new range indexes in eXist 2.2 preview to work. Certainly without those, they would have seen much worse performance. Also at the end they state that they will select Sedna as they can overcome the problems with Sedna, it was not clear to me why that was, i.e. do they have C++ devs that can work with Sedna but they don't have Java devs that could work with eXist or BaseX? Finally, the version of Java they used for testing eXist and BaseX is rather old, the next release of eXist (3.0) will only support Java 8 and newer.
I would be surprised if you could not store 200GB of data into BaseX, eXist or Sedna, but without knowing your data and the sort of queries you want to execute, I cannot comment on query performance.
I think you would be best to do a small trial of either one or all, in a manner not dissimilar to that linked article.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With