Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using a Filesystem (Not a Database!) for Schemaless Data - Best Practices

After reading over my other question, Using a Relational Database for Schema-Less Data, I began to wonder if a filesystem is more appropriate than a relational database for storing and querying schemaless data.

Rather than just building a file system on top of MySQL, why not just save the data directly to the filesystem? Indexing needs to be figured out, but modern filesystems are very stable, have great features like replication, snapshot and backup facilities, and are flexible at storing schema-less data.

However, I can't find any examples of someone using a filesystem instead of a database.

Where can I find more resources on how to implement a schemaless (or "document-oriented") database as a layer on top of a filesystem? Is anyone using a modern filesystem as a schemaless database?

like image 991
Évariste Galois Avatar asked Nov 15 '10 23:11

Évariste Galois


People also ask

Why file system is not a database?

A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application). On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data.

When would you use a file system database?

My conclusion was, using the file system as a database is best for applications where the content is maintained by a limited number of administrators and concurrency writes are rarely a concern. But you want to have as more cheap reads as possible. For those case scenarios this idea can be quite a money saver.

Which database is best for transactional data?

1. MongoDB. This open-source database powers many web and mobile applications. It allows for single-shard transactions with ACID guarantees.


2 Answers

Yes a filesystem could be taken as a special case of a NOSQL-like database system. It may have some limitations that should be considered during any design decisions:

pros: - - simple, intuitive.

  • takes advantage of years of tuning and caching algorithms
  • easy backup, potentially easy clustering

things to think about:

  • richness of metadata - what types of data does it store, how does it let you query them, can you have hierarchal or multivalued attributes

  • speed of querying metadata - not all fs's are particularly well optimized with anything other than size, dates.

  • inability to join queries (though that's pretty much common to NoSQL)

  • inefficient storage usage (unless the file system performs block suballocation, you'll typically blow 4-16K per item stored regardless of size)

  • May not have the kind of caching algorithm you want for it's directory structure
  • tends to be less tunable, etc.
  • backup solutions may have trouble depending on how you store things - too deep, too many items per node, etc - which might obviate an obvious advantage of such a structure. locking for a LOCAL filesystem works pretty well of course if you call the right routines, but not necessarily for a network base fileesytem (those problems have been solved in various ways, but it's certainly a design issue)
like image 154
MJB Avatar answered Oct 06 '22 00:10

MJB


I got the same idea more than 15 years ago, when hosting costs and hardware limitations where very different from today.

My main motivation was to design a cheap and simple solution able to withstand traffic spikes. Another goal was to improve the security of the applications by removing SQL attack vectors out of the equation.

I end up with a simple document-oriented database, more like a wrapper around FS functions.

What started as a personal project out of curiosity proved to be very rewarding in the long run. I will try to list both pros and cons.

PROS:

  • Fast
  • Cheap maintenance. Most applications I build using a file system "database" are still working till today with zero maintenance regarding the database implementation part. This was an unexpected outcome and it is happening due to the fact the file system functions are rarely changing in all the programming languages I used this solution for (PHP, C, C++, Erlang). I can't say the same about applications using mainstream databases. They often require fixing deprecated code and many of my old projects are now dead in the water because either me or the clients decided not to finance the expensive upgrades anymore. Or running old unsupported db versions that pose a high security risk.
  • Resilient to attacks being completely immune to SQL injections. Many attackers are targeting mainstream products and they are clueless when facing a custom storage facility.
  • Amazingly good on withstanding traffic spikes compared to many database systems that require sockets connections. It's quite easy to exhaust the maximum connection limitations of a database and many drivers for well known NoSQL databases have a limited connections pool they reuse across multiple threads forcing the industry to design expensive distributed systems.
  • Unexpected easy to scale. In one case when the application required much more data to be stored that I was initially anticipated I used a distributed file system (Ceph) and I solved the problem without any code modification.
  • Keeping the files in a RAM FS opens many opportunities to optimize things
  • Did I say security? All you have to care is usually to make sure any upload process can not write you FS database files nor can play tricks on file names. And of course your usual OS security measures to protect your files.
  • Easy to backup and maintain using file system tools.

CONS:

  • Atomic operations are hard to implement due to the lack of supervisor processes that are found in more complex database systems.
  • Implementing counters is hard and you will have to be quite creative designing a FS based database locking mechanism expecially if you want to remain compatible with distributed FS such as Ceph for which OS level file locks are known to be buggy.
  • Handling concurrent writes is tricky. I came up with a simple solution resembling Cassandra writes, adding updates as new files and having cron jobs cleaning up the old "versions" of the data.

My conclusion was, using the file system as a database is best for applications where the content is maintained by a limited number of administrators and concurrency writes are rarely a concern. But you want to have as more cheap reads as possible. For those case scenarios this idea can be quite a money saver.

Disclaimer: Please don't judge me too hard :) I'm a programmer with an old mind set of being more a creator than a user of the out of the box solutions. I lived the times when programmers where doing a lot from scratch to fit their needs including... operating systems. I believe personal experiments (including reinventing the wheel) are good learning opportunities for anybody.

like image 43
Grigore Madalin Avatar answered Oct 06 '22 00:10

Grigore Madalin