A common technique for storing a lot of files/blobs in a filesystem is to use a hash function to determine the filepath; eg hash(identifier) -> "o238455789" -> o23/8455/789 (there is often a hash-collision strategy too)
Does this technique have a name (is it a 'pattern'?) so that I may find it with a search of ACM Digital Library or similar online database of computing literature.
Are there any books/papers that explore the problem/solution?
PS thanks for the helpful notes - but none address the technique given above.
I think this is what microsoft has done in SQL Server 2008 with FILESTREAM storage. It allows storage of BLOB data inside of SQL Server, but allows you to access the files directly off the disk, which gives you kick-ass performance.
Microsoft released a whitepaper on managing unstructured data that you may be interested in. THere's also an MSDN article describing FILESTREAM as well as the pros & cons of file storage & whether to BLOB or not to BLOB
United States Patent 5742807 deals with this
http://www.freepatentsonline.com/5742807.html
Systems and methods for managing a plurality of electronically stored documents in an open document repository employ a one-way hash function to compute a hash for the stored documents as an indexing link. A document management index maps an attribute of an original document stored in the repository to the hash and the document. A hash-to-location index maps the hash to an address location of the document in a file system of the repository. The attribute points to the hash which then points to the location for linking the attribute to the location.
@Chris Kimpton
This would be called indexing. Sharding or partitioning is more about how to split a file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With