Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handle a great number of files

Tags:

java

file

I have a external disk with a billion files. If I mount the external disk in computer A, my program will scan all files' path and save the files' path in a database table. After that, when I eject the external disk, those data will still remain in the table. The problem is, if some files are deleted in the computer B, and I mount it to the computer A again, I must synchronize the database table in computer A. However, I don't want to scan all the files again because it takes a lots time and waste a lots memory. Is there any way to update the database table without scanning all files whilst minimizing the memory used?

Besides, in my case, memory limitation is more important than time. Which means I would rather to save more memory than save more time.

I think I can cut the files to a lot of sections and use some specific function (may be SHA1?) to check whether the files in this section are deleted. However, I cannot find out a way to cut the files to the sections. Can anyone help me or give me better ideas?

like image 958
s011208 Avatar asked May 21 '12 06:05

s011208


1 Answers

If you don't have control over the file system on the disk you have no choice but to scan the file names on the entire disk. To list the files that have been deleted you could do something like this:

update files in database: set "seen on this scan" to false
for each file on disk do:
    insert/update database, setting "seen on this scan" to true
done
deleted files = select from files where "seen on this scan" = false

A solution to the db performance problem could be accumulating the file names into a list of some kind and do a bulk insert/update whenever you reach, say, 1000 files.

As for directories with 1 billion files, you just need to replace the code that lists the files with something that wraps the C functions opendir and readdir. If I were you wouldn't worry about it too much for now. No sane person has 1 billion files in one directory because that sort of thing cripples file systems and common OS tools, so the risk is low and the solution is easy.

like image 67
Joni Avatar answered Oct 05 '22 02:10

Joni