Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently read thousands of small files with GCD

I'd like to read some metadata data (e.x.: EXIF data) from potentially thousands of files as efficiently as possible without impacting the user experience. I'm interested if anyone has any thoughts on how best to go about this using something like regular GCD queues, dispatch_io channels or even another implementation.

Option #1: Using regular GCD queues.

This one is pretty straightforward I can just use something like the following:

for (NSURL *URL in URLS) {
  dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_LOW, 0), ^{
    // Read metadata information from file.
    CGImageSourceCopyProperties(...);
  });
}

The problem with this implementation, I think (and have experienced), is that GCD doesn't know that the operation in the block is I/O related so it submits dozens of these blocks to the global queue for processing, who in turn saturate the I/O. The system eventually recovers, but the I/O takes a hit if I'm reading in thousands, or tens of thousands of files.

Option #2: Using dispatch_io

This one seems like a good contender, but I actually get worse performance with it then using a regular GCD queue. That could be my implementation.

dispatch_queue_t intakeQueue = dispatch_queue_create("someName"), NULL);

for (NSURL *URL in URLS) {    
  const char *path = URL.path.UTF8String;
  dispatch_io_t intakeChannel = dispatch_io_create_with_path(DISPATCH_IO_RANDOM, path, O_RDONLY, 0, intakeQueue, NULL);
  dispatch_io_set_high_water(intakeChannel, 256);
  dispatch_io_set_low_water(intakeChannel, 0);

  dispatch_io_handler_t readHandler = ^void(bool done, dispatch_data_t data, int error) {
    // Read metadata information from file.
    CGImageSourceCopyProperties(...);
    // Error stuff...
  };

  dispatch_io_read(intakeChannel, 0, 256, intakeQueue, readHandler);
}

In this second option, I feel like I'm somewhat abusing dispatch_read. I'm not interested in the data it reads at all, I just want dispatch_io to throttle the I/O for me. The 256 size is just a random number so that some amount of data is read, even though I never use it.

In this second option, I've had several runs where the system worked "pretty good", but I've also had an instance where my entire machine locked up (even the cursor) and I had to hard-reset. In other instances (equally random), the application has simply quit with a stack trace that looks like dozens of dispatch_io calls trying to clean up. (In all of these instances, I'm attempting to read in excess of 10,000 images.)

(Since I'm not opening any file descriptors myself, and GCD blocks are now ARC-friendly, I don't think I have to do any explicit clean-up after the dispatch_io_read has completed, though maybe I'm wrong?)

Solutions?

Is there another option I could use? I've considered manually throttling the requests with an NSOperationQueue and a low value for the maxConcurrentOperationCount but that just seems wrong as the newer MacPros can clearly handle a ton more I/O compared to an older, non-SSD, MacBook.

Update 1

I thought of doing a slight modification to option #2 based on some of the points @Ken-Thomases touched on below. In this attempt, I'm trying to prevent the dispatch_io block from exiting by setting a high_water mark below the total number of bytes requested. The idea being that the read handler will get called with data remaining to be read.

dispatch_queue_t intakeQueue = dispatch_queue_create("someName"), NULL);

for (NSURL *URL in URLS) {    
  const char *path = URL.path.UTF8String;
  dispatch_io_t intakeChannel = dispatch_io_create_with_path(DISPATCH_IO_RANDOM, path, O_RDONLY, 0, intakeQueue, NULL);
  dispatch_io_set_high_water(intakeChannel, 256);
  dispatch_io_set_low_water(intakeChannel, 0);
  __block BOOL didReadProperties = NO;

  dispatch_io_handler_t readHandler = ^void(bool done, dispatch_data_t data, int error) {
    // Read metadata information from file.
    if (didReadProperties == NO) {
        CGImageSourceCopyProperties(...);
        didReadProperties = YES;
    } else {
      // Maybe try and force close the channel here with dispatch_close?
     }        
  };

  dispatch_io_read(intakeChannel, 0, 512, intakeQueue, readHandler);
}

This does appear to slow down the dispatch_io calls, but it's now causing a situation where calls to CGImageSourceCreateWithURL are failing in a different part of the application where they never used to. (Now CGImageSourceCreateWithURL is randomly returning NULL, which, if I had to guess, suggests that it can't open a file descriptor because the file is definitely present at the given path.)

Update 2

After experimenting with a half-dozen other ideas, an implementation as simple as using an NSOperationQueue and calling addOperationWithBlock turned out to be just as effective as anything else I could come up with. Manually tuning the maxConcurrentOperationCount had some effect but nowhere near as much as I would have thought.

Obviously, the performance difference between an SSD and an external USB 3.0 drive is dramatic. While I can iterate over 100,000 images (and even get away with around 200,000) on an SSD in reasonable time, that many images on a USB drive is hopeless. Simple math: (bytes needed to read * file count / drive speed) shows that I can't really get the user experience I was hoping for. (Instruments seems to show that _CGImageSourceBindToPlugin is reading anywhere from about 40KB to 1MB per file.)

like image 446
kennyc Avatar asked May 12 '14 00:05

kennyc


1 Answers

The reality is that a modern, multi-tasking, multi-user system that runs across many configurations of hardware, automatically throttling an I/O bound task is nigh impossible for the system to do.

You're going to have to do the throttling yourself. This could be done with NSOperationQueue, with a semaphore, or with any of a number of other mechanisms.

Normally, I'd suggest you try to separate the I/O from any computation so you can serialize I/O (which is going to be the most generally reasonable performance across all systems), but that is pretty much impossible when using high level APIs. In fact, it isn't clear how the CG* I/O APIs might interact with the dispatch_io_* advisory APIs.

Not a terribly helpful answer. Without knowing more about your very specific case, it is hard to be more specific. I would suggest that caching may be key here; build up a database of metadata for all the various images. Of course, then you have synchronization and validation problems.

like image 98
bbum Avatar answered Nov 07 '22 03:11

bbum