Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing large files / binary data in a mysql database: when is it ok?

Ok, I have searched about this and read a few points of view about storing binary data in a [MySQL] database. Generally I consider this a bad idea and try to avoid it, favouring traditional file transfers and just storing a reference to the file in a database.

However, I am working on a project which requires database synchronisation with a remote/cloud database, not just for files, but also for settings and other user content. For this, and other reasons, I felt this might be an appropriate situation for binary storage in a database.

I have written a general system for the database sync which works well using Reflection and XML. I have also (against my instincts) integrated the file storage in to this system. Again, it works well - I chop files in to 64Kb BLOBs and store them in a table, with a file_id reference (linked to a seperate table which contains meta data such as file name/size/mime type).

This enables me to send bits and pieces as and when a connection is available, and also allows me to limit each request size to keep things running smoothly.

So far I have not found any issues with this, and have successfully imported and transferred over 1gb of data in both directions (over about 10-15 files / 16000 rows), but I worry about its scalability - will it slow down once there is 20gb+ data in there, or can MySQL handle it provided my queries are well structured?

Another reason for my decision to store the data in the database was that I figured I could simply add another HDD/storage device to MySQL if space ran low, in the hope of efficient scaling/replication/etc.

I would very much appreciate any views or comments as to whether this is a good or bad approach, and have I missed any obvious problems I'm likely to see once used in a production environment?

edit: I forgot to mention, the file sizes could range from 1KB to ~1GB

[Rough] Conclusion Firstly: thanks very much to those who contributed a considered answer. Choosing the accepted answer here has been quite difficult as each has something decent to offer.

In the end (despite my hopes), I have decided that a pure MySQL storage server is at best only an ok solution (I still can't help wondering why they bother including the BLOB types though).

As the alternative, I am torn between @Nick Coons file system approach and @tadman's suggestion of a hybrid using a light weight key/value database engine such as leveldb. Provided the practicalities of using leveldb in this project are not an issue, this is most likely the approach I will work towards.

I have accepted tadman's answer on this basis; his answer was also most applicable and useful to my situation.

That being said, and for those that are interested: I have enjoyed quite a lot of success using only MySQL so far. I have tested a table storing over 15gb of binary data without any noticable negative side effects from to inserting/retrieving data from large tables (with careful queries). However, I am certain this is still very inefficient and either of the alternative methods mentioned will be significantly better.

like image 980
Alfie Avatar asked Jul 29 '13 23:07

Alfie


People also ask

Should binary files be stored in the database?

Don't do it. There really isn't an upside to having files stored in the database.

Why you should not store files in database?

Storing a file in a database table means you need to use database logic (and possible application logic) to access the file. This increases the size of your database and degrades its performance. It also makes backups and data corruption harder to deal with.

What is the advantage of the binary format for storing numbers What is the disadvantages?

Keeping the binary data in the database and returning it in the record set is simpler. A disadvantage is that your database will now require more storage space and this will impact backups and other things. Sometimes having large amounts of binary data in a database can also have an impact on performance.

How can I store large files in SQL?

FILESTREAM, in SQL Server, allows storing these large documents, images or files onto the file system itself. In FILESTREAM, we do not have a limit of storage up to 2 GB, unlike the BLOB data type. We can store large size documents as per the underlying file system limitation.


2 Answers

I have to wonder why you're even bothering with a database at all, when the layer you've added on top to chunk, store, retrieve and reassemble would work just as well on a well-defined filesystem structure. MySQL wants all of its data on a single volume, so it's not a case of adding another drive whenever you feel like it, and replication of large amounts of binary data is going to be cripplingly slow as the binary logs will end up duplicating the amount of data you need to store.

The simplest approach is often the best one. Storing this in the filesystem directly is probably the best way to do it. If you need to keep an index of what's stored where, maybe you'd use a database like MySQL, but there's many ways to accomplish this same task. The more low-tech, the better. For example, don't rule out SQLite because an embedded database performs very well under light read and write load, and has the advantage of being "just a file" when it comes to backing up and restoring.

That being said, what you're doing sounds suspiciously similar to LevelDB, so before you commit to your approach, you'd have to see how it's significantly different than a key-value document store of that variety.

like image 103
tadman Avatar answered Oct 21 '22 15:10

tadman


Short Answer:

I'm not sure there's a hard-lined way to answer this. You mentioned files being from 1KB to 1GB.. I wouldn't store binary data in a DB if it's going to anywhere near 1KB, let along 1GB. I may store a few bytes of binary data in a DB if it's incidental, but any large amount of data, especially that doesn't need to be searched, should be stored in the filesystem:

When you store data in a DB, you're storing it on a filesystem anyway, you've just added another layer (the DB) to the mix. There's a cost to this layer, so there ought to be a benefit to make up the difference. If you're storing the data so that you can search based on it or join it to other data, then this makes sense. But file data, binary or not, is typically not used in that way.

Example Implementation:

There are better methods to distribute file data than to enter it into a DB, such as a distributed filesystems (check into GlusterFS, MooseFS, both of which will scale by simply adding additional hard drives, whereas MySQL will not).

Typically, I'll store file data in the filesystem using an SHA1 hash of the data as the name of the file. If the hash is 98a75af529f07b1ef7be7400f51344b9f07b1ef7, then I'll store it in this directory structure:

./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7

That is, a top-level directory made up of the first two characters, a second-level directory made up of the second two characters, and then finally the file with the name of the total string. In this way, I can literally have billions of files without having so many in a single directory that the system is too slow to function.

Then I create a DB table with these columns to hold the meta data:

  • file_id, an auto_increment field
  • created, a field with a default value of current_timestamp
  • prev_id, more on this below
  • hash, the SHA1 hash on the filesystem
  • name, a textual name of the file (such as the original name that the file would have taken on disk.

When I need a hierarchical directory structure, I would also create a directory table and add a dir_id to the list of columns above.

If I edit the file represented by ./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7, I don't actually change that file on disk, I create a new one (because the new file contents would be represented by a new SHA1 hash), and create a new entry in the files table where prev_id equals the file_id of the file I edited. In other words, I now have versioning.

If I need this to be available in a distributed fashion, I setup MySQL replication and then use GlusterFS to replicate he filesystem across multiple servers.

like image 3
Nick Coons Avatar answered Oct 21 '22 15:10

Nick Coons