Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing files and quickiest way to find a file in folders?

I have 660000 xml files(with unique file names) in 22 folders. Each folder has 30000 files. I need to find them by their names efficiently in a C# application. I know there is a SearchIndexer service in Windows(?Vista+?) and I was just wondering if I can use that or I have to index the files myself?

Alternatively, I guess I could create a database with the file name being the primary key and path in another column. However, should I create one table with 660000 rows in it or 22 tables with 30000 rows each? And Why?

Thanks in advance.

like image 938
StarCub Avatar asked Oct 14 '22 01:10

StarCub


1 Answers

My experience on this may be dated (NTFS), but you should check how quickly you can open a file in a directory of 30,000 files. I think you might find that it's better to distribute the files over more directories.

If you have control over the directory layout, consider hashing the file names to a number between 0 and 660000. You can then use the file system as an index:

00/
  00/
    <99 files that hash here>
..
65

You still need to write a simple "indexer" that reads each file, computes it's hash and stores it in the correct location. You then lookup a file as:

Lookup(string filename)
{
   int hash = filename.GetHashCode() % 660000;
   string directory = HashToDirectory(hash);
   string path = Path.Combine(directory, filename);
   ...

One thing that's nice about this approach is that you can profile various "densities" for the number of files in a directory. You just change the HashToPath function. You also don't need a database.

We used a similar approach with a web crawler that stored a lot of files. It was against NTFS, so YMMV.

like image 192
Rob Avatar answered Oct 18 '22 15:10

Rob