Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex pattern and/or NSRegularExpression a bit too slow searching over very large file, can it be optimized?

In an iOS framework, I am searching through this 3.2 MB file for pronunciations: https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic

I am using NSRegularExpression to search for an arbitrary set of words that are given as an NSArray. The search is done through the contents of the large file as an NSString. I need to match any word that appears bracketed by a newline and a tab character, and then grab the whole line, for example if I have the word "monday" in my NSArray I want to match this line within the dictionary file:

monday  M AH N D IY

This line starts with a newline, the string "monday" is followed by a tab character, and then the pronunciation follows. The entire line needs to be matched by the regex for its ultimate output. I also need to find alternate pronunciations of the words which are listed as follows:

monday(2)   M AH N D EY

The alternative pronunciations always begin with (2) and can go as high as (5). So I also search for iterations of the word followed by parentheses containing a single number bracketed by a newline and a tab character.

I have a 100% working NSRegularExpression method as follows:

NSArray *array = [NSArray arrayWithObjects:@"friday",@"monday",@"saturday",@"sunday", @"thursday",@"tuesday",@"wednesday",nil]; // This array could contain any arbitrary words but they will always be in alphabetical order by the time they get here.

// Use this string to build up the pattern.
NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^("]; 

int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // After the first iteration we need an OR operator first.
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
     }
    [mutablePatternString appendString:[NSString stringWithFormat:@"(%@(\\(.\\)|))",word]];
}

[mutablePatternString appendString:@")\\t.*$"];

// This results in this regex pattern:

// ^((change(\(.\)|))|(friday(\(.\)|))|(monday(\(.\)|))|(saturday(\(.\)|))|(sunday(\(.\)|))|(thursday(\(.\)|))|(tuesday(\(.\)|))|(wednesday(\(.\)|)))\t.*$

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                     options:NSRegularExpressionAnchorsMatchLines
                                                                                       error:nil];
int rangeLocation = 0;
int rangeLength = [string length];
NSMutableArray * matches = [NSMutableArray array];
[regularExpression enumerateMatchesInString:string
                                     options:0
                                       range:NSMakeRange(rangeLocation, rangeLength)
                                  usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                      [matches addObject:[string substringWithRange:result.range]];
                                  }];

[mutablePatternString release];

// matches array is returned to the caller.

My issue is that given the big text file, it isn't really fast enough on the iPhone. 8 words take 1.3 seconds on an iPhone 4, which is too long for the application. Given the following known factors:

• The 3.2 MB text file has the words to match listed in alphabetical order

• The array of arbitrary words to look up are always in alphabetical order when they get to this method

• Alternate pronunciations start with (2) in parens after the word, not (1)

• If there is no (2) there won't be a (3), (4) or more

• The presence of one alternative pronunciation is rare, occurring maybe 1 time in 8 on average. Further alternate pronunciations are even rarer.

Can this method be optimized, either by improving the regex or some aspect of the Objective-C? I'm assuming that NSRegularExpression is already optimized enough that it isn't going to be worthwhile trying to do it with a different Objective-C library or in C, but if I'm wrong here let me know. Otherwise, very grateful for any suggestions on improving the performance. I am hoping to make this generalized to any pronunciation file so I'm trying to stay away from solutions like calculating the alphabetical ranges ahead of time to do more constrained searches.

****EDIT****

Here are the timings on the iPhone 4 for all of the search-related answers given by August 16th 2012:

dasblinkenlight's create NSDictionary approach https://stackoverflow.com/a/11958852/119717: 5.259676 seconds

Ωmega's fastest regex at https://stackoverflow.com/a/11957535/119717: 0.609593 seconds

dasblinkenlight's multiple NSRegularExpression approach at https://stackoverflow.com/a/11969602/119717: 1.255130 seconds

my first hybrid approach at https://stackoverflow.com/a/11970549/119717: 0.372215 seconds

my second hybrid approach at https://stackoverflow.com/a/11970549/119717: 0.337549 seconds

The best time so far is the second version of my answer. I can't mark any of the answers best, since all of the search-related answers informed the approach that I took in my version so they are all very helpful and mine is just based on the others. I learned a lot and my method ended up a quarter of the original time so this was enormously helpful, thank you dasblinkenlight and Ωmega for talking it through with me.

like image 359
Halle Avatar asked Aug 14 '12 17:08

Halle


People also ask

Why is my regex so slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

How efficient is regex?

Regular Expressions are efficient in that one line of code can save you writing hundreds of lines. But they're normally slower (even pre-compiled) than thoughtful hand written code simply due to the overhead. Generally the simpler the objective the worse Regular Expressions are. They're better for complex operations.

Why is regex important?

Regular expressions are useful in search and replace operations. The typical use case is to look for a sub-string that matches a pattern and replace it with something else. Most APIs using regular expressions allow you to reference capture groups from the search pattern in the replacement string.


2 Answers

Since you are putting the entire file into memory anyway, you might as well represent it as a structure that is easy to search:

  • Create a mutable NSDictionary words, with NSString keys and NSMutableArray values
  • Read the file into memory
  • Go through the string representing the file line-by-line
  • For each line, separate out the word part by searching for a '(' or a '\t' character
  • Get a sub-string for the word (from zero to the index of the '(' or '\t' minus one); this is your key.
  • Check if the words contains your key; if it does not, add new NSMutableArray
  • Add line to the NSMutableArray that you found/created at the specific key
  • Once your are finished, throw away the original string representing the file.

With this structure in hand, you should be able to do your searches in time that no regex engine would be able to match, because you replaced a full-text scan, which is linear, with a hash look-up, which is constant-time.

** EDIT: ** I checked the relative speed of this solution vs. regex, it is about 60 times faster on a simulator. This is not at all surprising, because the odds are stacked heavily against the regex-based solution.

Reading the file:

NSBundle *bdl = [NSBundle bundleWithIdentifier:@"com.poof-poof.TestAnim"];
NSString *path = [NSString stringWithFormat:@"%@/words_pron.dic", [bdl bundlePath]];
data = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
NSMutableDictionary *tmp = [NSMutableDictionary dictionary];
NSUInteger pos = 0;
NSMutableCharacterSet *terminator = [NSMutableCharacterSet characterSetWithCharactersInString:@"\t("];
while (pos != data.length) {
    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
        rangeOfCharacterFromSet:[NSCharacterSet newlineCharacterSet]
        options:NSLiteralSearch
        range:remaining
    ];
    if (next.location != NSNotFound) {
        next.length = next.location - pos;
        next.location = pos;
    } else {
        next = remaining;
    }
    pos += (next.length+1);
    NSString *line = [data substringWithRange:next];
    NSRange keyRange = [line rangeOfCharacterFromSet:terminator];
    keyRange.length = keyRange.location;
    keyRange.location = 0;
    NSString *key = [line substringWithRange:keyRange];
    NSMutableArray *array = [tmp objectForKey:key];
    if (!array) {
        array = [NSMutableArray array];
        [tmp setObject:array forKey:key];
    }
    [array addObject:line];
}
dict = tmp; // dict is your NSMutableDictionary ivar

Searching:

NSArray *keys = [NSArray arrayWithObjects:@"sunday", @"monday", @"tuesday", @"wednesday", @"thursday", @"friday", @"saturday", nil];
NSMutableArray *all = [NSMutableArray array];
NSLog(@"Starting...");
for (NSString *key in keys) {
    for (NSString *s in [dict objectForKey:key]) {
        [all addObject:s];
    }
}
NSLog(@"Done! %u", all.count);
like image 179
Sergey Kalinichenko Avatar answered Sep 20 '22 10:09

Sergey Kalinichenko


Try this one:

^(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

and also this one (using positive lookahead with list of possible first letters):

^(?=[cmtwfs])(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

and at the end, a version with some optimization:

^(?=[cmtwfs])(?:change|monday|t(?:uesday|hursday)|wednesday|friday|s(?:aturday|unday))(?:\([2-5]\))?\t.*$
like image 20
Ωmega Avatar answered Sep 21 '22 10:09

Ωmega