Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keeping my database and file system in sync

Tags:

I'm working on a piece of software that stores files in a file system, as well as references to those files in a database. Querying the uploaded files can thus be done in the database without having to access the file system. From what I've read in other posts, most people say it's better to use a file system for file storage rather then storing binary data directly in a database as BLOB.

So now I'm trying to understand the best way to set this up so that both the database a file system stay in sync and I don't end up with references to files that don't exist, or files taking up space in the file system that aren't referenced. Here are a couple options that I'm considering.

Option 1: Add File Reference First

//Adds a reference to a file in the database database.AddFileRef("newfile.txt");   //Stores the file in the file system fileStorage.SaveFile("newfile.txt",dataStream);  

This option would be problematic because the reference to the file is added before the actual file, so another user may end up trying to download a file before it is actually stored in the system. Although, since the reference to the the file is created before hand the primary key value could be used when storing the file.

Option 2: Store File First

//Stores the file fileStorage.SaveFile("newfile.txt",dataStream);   //Adds a reference to the file in the database //fails if reference file does not existing in file system database.AddFileRef("newfile.txt");  

This option is better, but would make it possible for someone to upload a file to the system that is never referenced. Although this could be remedied with a "Purge" or "CleanUpFileSystem" function that deletes any unreferenced files. This option also wouldn't allow the file to be stored using the primary key value from the database.

Option 3: Pending Status

//Adds a pending file reference to database //pending files would be ignored by others database.AddFileRef("newfile.txt");   //Stores the file, fails if there is no  //matching pending file reference in the database fileStorage.SaveFile("newfile.txt",dataStream); database  //marks the file reference as committed after file is uploaded database.CommitFileRef("newfile.txt");  

This option allows the primary key to be created before the file is uploaded, but also prevents other users from obtaining a reference to a file before it is uploaded. Although, it would be possible for a file to never be uploaded, and a file reference to be stuck pending. Yet, it would also be fairly trivial to purge pending references from the database.

I'm leaning toward option 2, because it's simple, and I don't have to worry about users trying to request files before they are uploaded. Storage is cheap, so it's not the end of the world if I end up with some unreferenced files taking up space. But this also seems like a common problem, and I'd like to hear how others have solved it or other considerations I should be making.

like image 358
Eric Anastas Avatar asked Mar 15 '13 18:03

Eric Anastas


People also ask

What is database sync?

Database synchronization establishes data consistency between two or more databases, automatically copying changes back and forth. Harmonization of the data over time should be performed continuously.

Is it better to store files in database or filesystem?

Database provides a proper data recovery process while file system did not. In terms of security the database is more secure then the file system (usually).

What does file sync do?

File synchronization (file sync) is a method of keeping files that are stored in several different physical locations up to date. Cloud and storage vendors often offer software that helps with this process. File synchronization has grown increasingly important as enterprises use the cloud as a means for collaboration.

Which tool keeps a model and the database synchronized?

with Sync Tools. The Data Comparison tool allows you to compare and synchronize data of two databases. Also, you can automate the synchronization process via the command-line interface.


1 Answers

I want to propose another option. Make the filename always equal to the hash of its contents. Then you can safely write any content at all times provided that you do it before you add a reference to it elsewhere.

As contents never change there is never a synchronization problem.

This gives you deduplication for free. Deletes become harder though. I recommend a nightly garbage collection process.

like image 66
usr Avatar answered Sep 16 '22 13:09

usr