Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grand Central Strategy for Opening Multiple Files

I have a working implementation using Grand Central dispatch queues that (1) opens a file and computes an OpenSSL DSA hash on "queue1", (2) writing out the hash to a new "side car" file for later verification on "queue2".

I would like to open multiple files at the same time, but based on some logic that doesn't "choke" the OS by having 100s of files open and exceeding the hard drive's sustainable output. Photo browsing applications such as iPhoto or Aperture seem to open multiple files and display them, so I'm assuming this can be done.

I'm assuming the biggest limitation will be disk I/O, as the application can (in theory) read and write multiple files simultaneously.

Any suggestions?

TIA

like image 439
Torjt Avatar asked Nov 30 '22 05:11

Torjt


2 Answers

You are correct in that you'll be I/O bound, most assuredly. And it will be compounded by the random access nature of having multiple files open and being actively read at the same time.

Thus, you need to strike a bit of a balance. More likely than not, one file is not the most efficient, as you've observed.

Personally?

I'd use a dispatch semaphore.

Something like:

@property(nonatomic, assign) dispatch_queue_t dataQueue;
@property(nonatomic, assign) dispatch_semaphore_t execSemaphore;

And:

- (void) process:(NSData *)d {
    dispatch_async(self.dataQueue, ^{
        if (!dispatch_semaphore_wait(self.execSemaphore, DISPATCH_TIME_FOREVER)) {
            dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
                ... do calcualtion work here on d ...
                dispatch_async(dispatch_get_main_queue(), ^{
                    .... update main thread w/new data here ....
                });
                dispatch_semaphore_signal(self.execSemaphore);
            });
        }
    });
}

Where it is kicked off with:

self.dataQueue = dispatch_queue_create("com.yourcompany.dataqueue", NULL);
self.execSemaphore = dispatch_semaphore_create(3);
[self process: ...];
[self process: ...];
[self process: ...];
[self process: ...];
[self process: ...];
.... etc ....

You'll need to determine how best you want to handle the queueing. If there are many items and there is a notion of cancellation, enqueueing everything is likely wasteful. Similarly, you'll probably want to enqueue URLs to the files to process, and not NSData objects like the above.

In any case, the above will process three things simultaneously, regardless of how many have been enqueued.

like image 82
bbum Avatar answered Dec 05 '22 10:12

bbum


I'd use NSOperation for this because of the ease of handling both dependencies and cancellation.

I'd create one operation each for reading the data file, computing the data file's hash, and writing the sidecar file. I'd make each write operation dependent on its associated compute operation, and each compute operation dependent on its associated read operation.

Then I'd add the read and write operations to one NSOperationQueue, the "I/O queue," with a restricted width. The compute operations I'd add to a separate NSOperationQueue, the "compute queue," with a non-restricted width.

The reason for the restriced width on the I/O queue is that your work will likely be I/O bound; you may want it to have a width greater than 1, but it's very likely to be directly related to the number of physical disks on which your input files reside. (Probably something like 2x, you'll want to determine this experimentally.)

The code would wind up looking something like this:

@implementation FileProcessor

static NSOperationQueue *FileProcessorIOQueue = nil;
static NSOperationQueue *FileProcessorComputeQueue = nil;

+ (void)inititalize
{
    if (self == [FileProcessor class]) {
        FileProcessorIOQueue = [[NSOperationQueue alloc] init];
        [FileProcessorIOQueue setName:@"FileProcessorIOQueue"];
        [FileProcessorIOQueue setMaxConcurrentOperationCount:2]; // limit width

        FileProcessorComputeQueue = [[NSOperationQueue alloc] init];
        [FileProcessorComputeQueue setName:@"FileProcessorComputeQueue"];
    }
}

- (void)processFilesAtURLs:(NSArray *)URLs
{
    for (NSURL *URL in URLs) {
        __block NSData *fileData = nil; // set by readOperation
        __block NSData *fileHashData = nil; // set by computeOperation

        // Create operations to do the work for this URL

        NSBlockOperation *readOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                fileData = CreateDataFromFileAtURL(URL);
            }];

        NSBlockOperation *computeOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                fileHashData = CreateHashFromData(fileData);
                [fileData release]; // created in readOperation
            }];

        NSBlockOperation *writeOperation =
            [NSBlockOperation blockOperationWithBlock:^{
                WriteHashSidecarForFileAtURL(fileHashData, URL);
                [fileHashData release]; // created in computeOperation
            }];

        // Set up dependencies between operations

        [computeOperation addDependency:readOperation];
        [writeOperation addDependency:computeOperation];

        // Add operations to appropriate queues

        [FileProcessorIOQueue addOperation:readOperation];
        [FileProcessorComputeQueue addOperation:computeOperation];
        [FileProcessorIOQueue addOperation:writeOperation];
    }
}

@end

It's pretty straightforward; rather than deal with multiply-nested layers of sync/async as you would with the dispatch_* APIs, NSOperation allows you to define your units of work and your dependencies between them independently. For some situations this can be easier to understand and debug.

like image 24
Chris Hanson Avatar answered Dec 05 '22 10:12

Chris Hanson