Storing large files / binary data in a mysql database: when is it ok?

Tags:

Ok, I have searched about this and read a few points of view about storing binary data in a [MySQL] database. Generally I consider this a bad idea and try to avoid it, favouring traditional file transfers and just storing a reference to the file in a database.

However, I am working on a project which requires database synchronisation with a remote/cloud database, not just for files, but also for settings and other user content. For this, and other reasons, I felt this might be an appropriate situation for binary storage in a database.

I have written a general system for the database sync which works well using Reflection and XML. I have also (against my instincts) integrated the file storage in to this system. Again, it works well - I chop files in to 64Kb BLOBs and store them in a table, with a file_id reference (linked to a seperate table which contains meta data such as file name/size/mime type).

This enables me to send bits and pieces as and when a connection is available, and also allows me to limit each request size to keep things running smoothly.

So far I have not found any issues with this, and have successfully imported and transferred over 1gb of data in both directions (over about 10-15 files / 16000 rows), but I worry about its scalability - will it slow down once there is 20gb+ data in there, or can MySQL handle it provided my queries are well structured?

Another reason for my decision to store the data in the database was that I figured I could simply add another HDD/storage device to MySQL if space ran low, in the hope of efficient scaling/replication/etc.

I would very much appreciate any views or comments as to whether this is a good or bad approach, and have I missed any obvious problems I'm likely to see once used in a production environment?

edit: I forgot to mention, the file sizes could range from 1KB to ~1GB

[Rough] Conclusion Firstly: thanks very much to those who contributed a considered answer. Choosing the accepted answer here has been quite difficult as each has something decent to offer.

In the end (despite my hopes), I have decided that a pure MySQL storage server is at best only an ok solution (I still can't help wondering why they bother including the BLOB types though).

As the alternative, I am torn between @Nick Coons file system approach and @tadman's suggestion of a hybrid using a light weight key/value database engine such as leveldb. Provided the practicalities of using leveldb in this project are not an issue, this is most likely the approach I will work towards.

I have accepted tadman's answer on this basis; his answer was also most applicable and useful to my situation.

That being said, and for those that are interested: I have enjoyed quite a lot of success using only MySQL so far. I have tested a table storing over 15gb of binary data without any noticable negative side effects from to inserting/retrieving data from large tables (with careful queries). However, I am certain this is still very inefficient and either of the alternative methods mentioned will be significantly better.

980

asked Jul 29 '13 23:07

Alfie

2 Answers

I have to wonder why you're even bothering with a database at all, when the layer you've added on top to chunk, store, retrieve and reassemble would work just as well on a well-defined filesystem structure. MySQL wants all of its data on a single volume, so it's not a case of adding another drive whenever you feel like it, and replication of large amounts of binary data is going to be cripplingly slow as the binary logs will end up duplicating the amount of data you need to store.

The simplest approach is often the best one. Storing this in the filesystem directly is probably the best way to do it. If you need to keep an index of what's stored where, maybe you'd use a database like MySQL, but there's many ways to accomplish this same task. The more low-tech, the better. For example, don't rule out SQLite because an embedded database performs very well under light read and write load, and has the advantage of being "just a file" when it comes to backing up and restoring.

That being said, what you're doing sounds suspiciously similar to LevelDB, so before you commit to your approach, you'd have to see how it's significantly different than a key-value document store of that variety.

103

answered Oct 21 '22 15:10

tadman

Short Answer:

I'm not sure there's a hard-lined way to answer this. You mentioned files being from 1KB to 1GB.. I wouldn't store binary data in a DB if it's going to anywhere near 1KB, let along 1GB. I may store a few bytes of binary data in a DB if it's incidental, but any large amount of data, especially that doesn't need to be searched, should be stored in the filesystem:

When you store data in a DB, you're storing it on a filesystem anyway, you've just added another layer (the DB) to the mix. There's a cost to this layer, so there ought to be a benefit to make up the difference. If you're storing the data so that you can search based on it or join it to other data, then this makes sense. But file data, binary or not, is typically not used in that way.

Example Implementation:

There are better methods to distribute file data than to enter it into a DB, such as a distributed filesystems (check into GlusterFS, MooseFS, both of which will scale by simply adding additional hard drives, whereas MySQL will not).

Typically, I'll store file data in the filesystem using an SHA1 hash of the data as the name of the file. If the hash is 98a75af529f07b1ef7be7400f51344b9f07b1ef7, then I'll store it in this directory structure:

./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7

That is, a top-level directory made up of the first two characters, a second-level directory made up of the second two characters, and then finally the file with the name of the total string. In this way, I can literally have billions of files without having so many in a single directory that the system is too slow to function.

Then I create a DB table with these columns to hold the meta data:

file_id, an auto_increment field
created, a field with a default value of current_timestamp
prev_id, more on this below
hash, the SHA1 hash on the filesystem
name, a textual name of the file (such as the original name that the file would have taken on disk.

When I need a hierarchical directory structure, I would also create a directory table and add a dir_id to the list of columns above.

If I edit the file represented by ./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7, I don't actually change that file on disk, I create a new one (because the new file contents would be represented by a new SHA1 hash), and create a new entry in the files table where prev_id equals the file_id of the file I edited. In other words, I now have versioning.

If I need this to be available in a distributed fashion, I setup MySQL replication and then use GlusterFS to replicate he filesystem across multiple servers.

answered Oct 21 '22 15:10

Nick Coons

Related questions
                            
                                Storage file to System uri?
                            
                                Executing a stored procedure within a function and not waiting for the return
                            
                                Get last non-empty cell in Excel column
                            
                                Enabling mouse wheel zooming in a Microsoft Chart Control
                            
                                Looking for time picker control with half hourly up/down
                            
                                MessageBox not showing (focused) after SaveFileDialog
                            
                                Can not bind within resource dictionary
                            
                                Returning a List of type from web service
                            
                                MVC 4 authentication with Active Directory or Membership database
                            
                                Object initializer syntax (c#) in python?
                            
                                Join two tables using linq, and fill a Dictionary of them
                            
                                Adding a whole folder (with subfolders) as embedded resource?
                            
                                How to press "Enter" in Selenium WebDriver (Nunit Test Case) written in C#?
                            
                                Auto print without dialog
                            
                                C# .xml to .xlsx how?
                            
                                How to marshal to ANSI string via attribute?
                            
                                c# Get the cell text value if DataError is triggered
                            
                                Make the ProgressRing in MahApps.Metro Smaller
                            
                                Can a serialized simple java object be deserialized by C#?
                            
                                How could revise the recursive algorithm to find the shortest path?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Storing large files / binary data in a mysql database: when is it ok?

Tags:

c#

database

php

mysql

binary

Alfie

People also ask

2 Answers

tadman

Nick Coons

Recent Activity

Donate For Us