Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java content APIs for a large number of files

Tags:

java

Does anyone know any java libraries (open source) that provides features for handling a large number of files (write/read) from a disk. I am talking about 2-4 millions of files (most of them are pdf and ms docs). it is not a good idea to store all files in a single directory. Instead of re-inventing the wheel, I am hoping that it has been done by many people already.

Features I am looking for 1) Able to write/read files from disk 2) Able to create random directories/sub-directories for new files 2) Provide version/audit (optional)

I was looking at JCR API and it looks promising but it starts with a workspace and not sure what will be the performance when there are many nodes.

like image 995
wern Avatar asked Mar 02 '11 15:03

wern


1 Answers

Edit: JCP does look pretty good. I'd suggest trying it out to see how it actually does perform for your use-case.

If you're running your system on Windows and noticed a horrible n^2 performance hit at some point, you're probably running up against the performance hit incurred by automatic 8.3 filename generation. Of course, you can disable 8.3 filename generation, but as you pointed out, it would still not be a good idea to store large numbers of files in a single directory.

One common strategy I've seen for handling large numbers of files is to create directories for the first n letters of the filename. For example, document.pdf would be stored in d/o/c/u/m/document.pdf. I don't recall ever seeing a library to do this in Java, but it seems pretty straightforward. If necessary, you can create a database to store the lookup table (mapping keys to the uniformly-distributed random filenames), so you won't have to rebuild your index every time you start up. If you want to get the benefit of automatic deduplication, you could hash each file's content and use that checksum as the filename (but you would also want to add a check so you don't accidentally discard a file whose checksum matches an existing file even though the contents are actually different).

Depending on the sizes of the files, you might also consider storing the files themselves in a database--if you do this, it would be trivial to add versioning, and you wouldn't necessarily have to create random filenames because you could reference them using an auto-generated primary key.

like image 113
rob Avatar answered Nov 12 '22 13:11

rob