I want to use GUID's (uuid) for naming folders in a huge file store. Each storage item gets his own folder and guid.
The easiest way would be "x:\items\uuid\{uuid}..."
example: "x:\items\uuid\F3B16318-4236-4E45-92B3-3C2C3F31D44F..."
I see here one problem. What if you expect to get at least 10.000 items and probably a few 100.000 or more then 1 million. I don't want to put so many items (sub folders) in one folder.
I thought to solve this by splitting up the guid. Taking the 2 first chars to create sub folders at the first level and the take the next 2 chars and also create sub folders. The above example would be --> "x:\items\uuid\F3\B1\6318-4236-4E45-92B3-3C2C3F31D44F..."
If the first 4 chars of guid's are really as random as expected then I get after a while 256 folder within 256 folders and I always end up with a reasonable amount of items within each of these folders For example if you have 1 million items then you get --> 1 000 000 / 256 /256 = 15.25 items per folder
In the past I'v already tested the randomness of the first chars. (via vb.net app). Result: The items where spread quit evenly over the folders. Also somebody else came to the same conclusion. see How evenly spread are the first four bytes of a Guid created in .NET?
Possible splits I thought of (1 million items as example) C1 = character 1 of GUID, C2 = character 2, etc
- C1\C2\Rest of GUID --> 16 * 16 * 3906 (almost 4000 are still al lot of folders)
- C1\C2\C3\C4\Rest of Guid --> 16 * 16 * 16 * 16 * 15 ( unnecessary splitting up of folders)
- C1C2\C3C4\Rest of Guid --> 256 * 256 * 15 (for me the best option ?)
- C1C2C3\Rest of Guid --> 4096 * 244 (to many folders at first level??)
- C1C2C3C4\Rest of Guid --> 65536 * 15 (to many folders at first level!)
My questions are:
Thanks, Mumblic
This is fairly similar to the method git
uses for sharding it's object database (although with SHA1 hashes instead of GUIDs...). As with any algorithm, there are pros and cons, but I don't think there are any significant cons in this case that would outweigh the definite pros. There's a little extra CPU overhead to calculate the directory structure, but in the long term, that overhead is probably significantly less than what is necessary to search through a single directory of a million files repeatedly.
Regarding how to do it, it depends a bit on what library you are using to generate the GUIDs - do you get them in a byte-array (or even a struct
) format that then needs to be converted to a character representation in order to display it, or do you get them in an already formatted ASCII array? In the first case, you need to extract the appropriate bytes and format them yourself, in the second you just need to extract a substring.
As far as putting an extreme number of sub-folders (or even files) in one folder, the exact performance characteristics are highly dependent on the actual file system in use. Some perform better than others, but almost all will show significant performance degradation the more entries each directory has.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With