Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XML vs MySQL for Large Sites

Tags:

php

mysql

xml

For a very large site such as a Social Network (say Facebook), which method would you recommend for user accounts storage?

1) Single XML files for each type of features, on the user's directory: basicinfo.xml, comments.xml, photos.xml, ...

2) MySQL, although not sure how to organize on this. Maybe separated tables for each feature? E.g. a tables for Comments, where columns are id,from,message,time?

I know XML is not designed for storage, and PHP (this is the language I use) must read the entire XML file and store in memory before it is used.

But, here are the reasons why I prefer XML (but I may be wrong, please tell me if you disagree with any):

1) If I have user accounts' paths organized in this way

User ID 2342:
/users/00/00/00/00/00/00/00/23/42/

I think it's faster to find the Comments of a user by file path than seeking in a large database.
Also, if each feature is split in tables, each user profile will seek more than once, to display comments, photos, basic info, etc.

2) I heard MySQL is globaly locked when writing on it. Is this true? If yes, I rather to lock a single file than everything.

3) Is MySQL "shared" between the cluster? I mean, if 1 disk gets full, will it "continue" on another? Or do I, as the programmer, have to manage it myself and create new databases on another disk? (note, I use Linux)
It is ok that it is about the same by using XML files, but it is easier to split between disks, because structure is split by account IDs, not by feature as it would be in a database.

4) Note that I don't store each comment on the comments.xml. I just note their attributes in each XML tag, and the messages are in separated text files commentid.txt. Once each XML should not be much large, there should not be problems with memory/time.

As for the problem of parsing entire XML, maybe I should think on using XMLReader/Writer instead of SimpleXML/DOM? Or, will it decrease performance allot?

Thank you!

like image 523
Nuno Avatar asked Sep 10 '10 03:09

Nuno


People also ask

Why are we using XML instead of database?

XML Database is used to store huge amount of information in the XML format. As the use of XML is increasing in every field, it is required to have a secured place to store the XML documents. The data stored in the database can be queried using XQuery, serialized, and exported into a desired format.

Is XML good for storing data?

XML is a format to store data along with its structure. This feature makes it useful for many things, including transferring data, formatting documents, creating layouts, and more.

Can you replace database with XML?

XML is used as the storage format basis for some NoSQL systems but in itself is not a replacement for a relational database.

Can we use both XML and MySQL together?

You can use XML from within any language that has the appropriate processing tools available. For example, XML APIs exist for languages such as PHP, Python, Java, and Tcl, all of which also have MySQL capabilities.


2 Answers

Facebook uses MySQL.

That being said. Here's the long version:

I always say that XML is a data transfer technology, not a data storage technology, but not everyone agrees. XML is not designed to be use a relational datastore. XML was first introduced in order to provide a standard way of transmitting data from system to system w/o giving access to the originating systems.

Since you are talking about a large application, I would strongly urge you to use MySQL (or other RDBMS), as your dataset grows and grows the XML will be increasingly slower and slower unless you always keep a fresh copy in memory and only read the XML files upon service reboot.

Using an XML database is reportedly more efficient in terms of conversion costs when you're constantly sending XML into and retrieving XML out of a database. The rationale is, when XML is the only transport syntax used to get things in and out of the DB, why squeeze everything through a layer of SQL abstraction and all those relational tables, foreign keys, and the like? It basically takes a parsing layer out of the application and brings it into the data engine - where it's probably going to work faster and more efficiently than the SQL alternative. Probably.

like image 183
shamittomar Avatar answered Sep 23 '22 00:09

shamittomar


Depends heavily on the nature of your site. On the one hand the XML approach gives you a free pass on things like “SELECT * FROM $table where $table.id=$id” type queries. On the other hand...

For a very large site, in the worst case scenario the data files end up pretty big too. If it is any kind of community site this may easily happen for any account go to any forum with a true number of old-guard members in its community and you'll find a couple of posters that have say 10K posts... This means you will wish for SQL style result sets which are implemented using a memory efficient model, rather than a speed efficient one. To the end user 1s versus 1.1s response time is not that much of a deal; but to you 1K of simultaneous requests versus 1.5K or better definitely is.

Then there is the aspect that if you are mostly reading data XML may be fine if somewhat crude for large data sets and DOM based implementations. But if you are writing a lot, things become much much worse. Caching of data is still possible, but giving ACID like guarantees on these file transactions requires you to pretty much write your own database software.

And then there is storage requirements and such like which mean you may need a distributed approach for storing your data. These kind of setups are relatively well understood in the database world, and they bring a lot of interesting problems with them to the table (like what do you do if a single disk fails?, how do you know on what disk to find the data and how do you implement efficient caching?) that essentially amount to again writing your own mini-database software from scratch.

So for a very large site I think the hard technical requirements of performance at not too great a cost in terms of memory and also a certain reliability and not needing to reinvent 21 wheels at the same time means that your approach would not work that well. I think it is better suited to smallish read-only sites where you can afford to experiment with and pursue alternative routes, where you can easily make changes and roll them out across the entire site.

like image 41
user268396 Avatar answered Sep 27 '22 00:09

user268396