I have large files, containing a small number of large datasets. Each dataset contains a name and the dataset size in bytes, allowing to skip it and go to the next dataset.
I want to build an index of dataset names very quickly. An example of file is about 21MB large, and contains 88 datasets. Reading the 88 names quickly by using a std::ifstream
and seekg()
to skip between datasets takes about 1300ms, which I would like to reduce.
So in fact, I'm reading 88 chunks of about 30 bytes, at given positions in a 21MB file, and it takes 1300ms.
Is there a way to improve this, or is it an OS and filesystem limitation? I'm running the test under Windows 7 64bit.
I know that having a complete index at the beginning of the file would be better, but the file format does not have this, and we can't change it.
You could use a memory mapped file interface (I recommend boost's implementation.)
This will open the file into the virtual page for your application for quicker lookup time, without going back to the disk.
You could scan the file and make your own header with the key and the index in a seperate file. Depending on your use case you can do it once at program start and everytime the file changes. Before accessing the big data, a lookup in the smaller file gives you the needed index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With