In an iOS framework, I am searching through this 3.2 MB file for pronunciations: https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic
I am using NSRegularExpression to search for an arbitrary set of words that are given as an NSArray. The search is done through the contents of the large file as an NSString. I need to match any word that appears bracketed by a newline and a tab character, and then grab the whole line, for example if I have the word "monday" in my NSArray I want to match this line within the dictionary file:
monday M AH N D IY
This line starts with a newline, the string "monday" is followed by a tab character, and then the pronunciation follows. The entire line needs to be matched by the regex for its ultimate output. I also need to find alternate pronunciations of the words which are listed as follows:
monday(2) M AH N D EY
The alternative pronunciations always begin with (2) and can go as high as (5). So I also search for iterations of the word followed by parentheses containing a single number bracketed by a newline and a tab character.
I have a 100% working NSRegularExpression method as follows:
NSArray *array = [NSArray arrayWithObjects:@"friday",@"monday",@"saturday",@"sunday", @"thursday",@"tuesday",@"wednesday",nil]; // This array could contain any arbitrary words but they will always be in alphabetical order by the time they get here.
// Use this string to build up the pattern.
NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^("];
int firstRound = 0;
for(NSString *word in array) {
if(firstRound == 0) { // this is the first round
firstRound++;
} else { // After the first iteration we need an OR operator first.
[mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
}
[mutablePatternString appendString:[NSString stringWithFormat:@"(%@(\\(.\\)|))",word]];
}
[mutablePatternString appendString:@")\\t.*$"];
// This results in this regex pattern:
// ^((change(\(.\)|))|(friday(\(.\)|))|(monday(\(.\)|))|(saturday(\(.\)|))|(sunday(\(.\)|))|(thursday(\(.\)|))|(tuesday(\(.\)|))|(wednesday(\(.\)|)))\t.*$
NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
options:NSRegularExpressionAnchorsMatchLines
error:nil];
int rangeLocation = 0;
int rangeLength = [string length];
NSMutableArray * matches = [NSMutableArray array];
[regularExpression enumerateMatchesInString:string
options:0
range:NSMakeRange(rangeLocation, rangeLength)
usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
[matches addObject:[string substringWithRange:result.range]];
}];
[mutablePatternString release];
// matches array is returned to the caller.
My issue is that given the big text file, it isn't really fast enough on the iPhone. 8 words take 1.3 seconds on an iPhone 4, which is too long for the application. Given the following known factors:
• The 3.2 MB text file has the words to match listed in alphabetical order
• The array of arbitrary words to look up are always in alphabetical order when they get to this method
• Alternate pronunciations start with (2) in parens after the word, not (1)
• If there is no (2) there won't be a (3), (4) or more
• The presence of one alternative pronunciation is rare, occurring maybe 1 time in 8 on average. Further alternate pronunciations are even rarer.
Can this method be optimized, either by improving the regex or some aspect of the Objective-C? I'm assuming that NSRegularExpression is already optimized enough that it isn't going to be worthwhile trying to do it with a different Objective-C library or in C, but if I'm wrong here let me know. Otherwise, very grateful for any suggestions on improving the performance. I am hoping to make this generalized to any pronunciation file so I'm trying to stay away from solutions like calculating the alphabetical ranges ahead of time to do more constrained searches.
****EDIT****
Here are the timings on the iPhone 4 for all of the search-related answers given by August 16th 2012:
dasblinkenlight's create NSDictionary approach https://stackoverflow.com/a/11958852/119717: 5.259676 seconds
Ωmega's fastest regex at https://stackoverflow.com/a/11957535/119717: 0.609593 seconds
dasblinkenlight's multiple NSRegularExpression approach at https://stackoverflow.com/a/11969602/119717: 1.255130 seconds
my first hybrid approach at https://stackoverflow.com/a/11970549/119717: 0.372215 seconds
my second hybrid approach at https://stackoverflow.com/a/11970549/119717: 0.337549 seconds
The best time so far is the second version of my answer. I can't mark any of the answers best, since all of the search-related answers informed the approach that I took in my version so they are all very helpful and mine is just based on the others. I learned a lot and my method ended up a quarter of the original time so this was enormously helpful, thank you dasblinkenlight and Ωmega for talking it through with me.
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
Regular Expressions are efficient in that one line of code can save you writing hundreds of lines. But they're normally slower (even pre-compiled) than thoughtful hand written code simply due to the overhead. Generally the simpler the objective the worse Regular Expressions are. They're better for complex operations.
Regular expressions are useful in search and replace operations. The typical use case is to look for a sub-string that matches a pattern and replace it with something else. Most APIs using regular expressions allow you to reference capture groups from the search pattern in the replacement string.
Since you are putting the entire file into memory anyway, you might as well represent it as a structure that is easy to search:
NSDictionary words
, with NSString
keys and NSMutableArray
valuesline
, separate out the word part by searching for a '('
or a '\t'
character'('
or '\t'
minus one); this is your key
.words
contains your key
; if it does not, add new NSMutableArray
line
to the NSMutableArray
that you found/created at the specific key
With this structure in hand, you should be able to do your searches in time that no regex engine would be able to match, because you replaced a full-text scan, which is linear, with a hash look-up, which is constant-time.
** EDIT: ** I checked the relative speed of this solution vs. regex, it is about 60 times faster on a simulator. This is not at all surprising, because the odds are stacked heavily against the regex-based solution.
Reading the file:
NSBundle *bdl = [NSBundle bundleWithIdentifier:@"com.poof-poof.TestAnim"];
NSString *path = [NSString stringWithFormat:@"%@/words_pron.dic", [bdl bundlePath]];
data = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
NSMutableDictionary *tmp = [NSMutableDictionary dictionary];
NSUInteger pos = 0;
NSMutableCharacterSet *terminator = [NSMutableCharacterSet characterSetWithCharactersInString:@"\t("];
while (pos != data.length) {
NSRange remaining = NSMakeRange(pos, data.length-pos);
NSRange next = [data
rangeOfCharacterFromSet:[NSCharacterSet newlineCharacterSet]
options:NSLiteralSearch
range:remaining
];
if (next.location != NSNotFound) {
next.length = next.location - pos;
next.location = pos;
} else {
next = remaining;
}
pos += (next.length+1);
NSString *line = [data substringWithRange:next];
NSRange keyRange = [line rangeOfCharacterFromSet:terminator];
keyRange.length = keyRange.location;
keyRange.location = 0;
NSString *key = [line substringWithRange:keyRange];
NSMutableArray *array = [tmp objectForKey:key];
if (!array) {
array = [NSMutableArray array];
[tmp setObject:array forKey:key];
}
[array addObject:line];
}
dict = tmp; // dict is your NSMutableDictionary ivar
Searching:
NSArray *keys = [NSArray arrayWithObjects:@"sunday", @"monday", @"tuesday", @"wednesday", @"thursday", @"friday", @"saturday", nil];
NSMutableArray *all = [NSMutableArray array];
NSLog(@"Starting...");
for (NSString *key in keys) {
for (NSString *s in [dict objectForKey:key]) {
[all addObject:s];
}
}
NSLog(@"Done! %u", all.count);
Try this one:
^(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$
and also this one (using positive lookahead with list of possible first letters):
^(?=[cmtwfs])(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$
and at the end, a version with some optimization:
^(?=[cmtwfs])(?:change|monday|t(?:uesday|hursday)|wednesday|friday|s(?:aturday|unday))(?:\([2-5]\))?\t.*$
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With