I got interested in this small example of an algorithm in Python for looping through a large word list. I am writing a few "tools" that will allow my to slice a Objective-C string or array in a similar fashion as Python.
Specifically, this elegant solution caught my attention for executing very quickly and it uses a string slice as a key element of the algorithm. Try and solve this without a slice!
I have reproduced my local version using the Moby word list below. You can use /usr/share/dict/words
if you do not feel like downloading Moby. The source is just a large dictionary-like list of unique words.
#!/usr/bin/env python
count=0
words = set(line.strip() for line in
open("/Users/andrew/Downloads/Moby/mwords/354984si.ngl"))
for w in words:
even, odd = w[::2], w[1::2]
if even in words and odd in words:
count+=1
print count
This script will a) be interpreted by Python; b) read the 4.1 MB, 354,983 word Moby dictionary file; c) strip the lines; d) place the lines into a set, and; e) and find all the combinations where the evens and the odds of a given word are also words. This executes in about 0.73 seconds on a MacBook Pro.
I tried to rewrite the same program in Objective-C. I am a beginner at this language, so go easy please, but please do point out the errors.
#import <Foundation/Foundation.h>
NSString *sliceString(NSString *inString, NSUInteger start, NSUInteger stop,
NSUInteger step){
NSUInteger strLength = [inString length];
if(stop > strLength) {
stop = strLength;
}
if(start > strLength) {
start = strLength;
}
NSUInteger capacity = (stop-start)/step;
NSMutableString *rtr=[NSMutableString stringWithCapacity:capacity];
for(NSUInteger i=start; i < stop; i+=step){
[rtr appendFormat:@"%c",[inString characterAtIndex:i]];
}
return rtr;
}
NSSet * getDictWords(NSString *path){
NSError *error = nil;
NSString *words = [[NSString alloc] initWithContentsOfFile:path
encoding:NSUTF8StringEncoding error:&error];
NSCharacterSet *sep=[NSCharacterSet newlineCharacterSet];
NSPredicate *noEmptyStrings =
[NSPredicate predicateWithFormat:@"SELF != ''"];
if (words == nil) {
// deal with error ...
}
// ...
NSArray *temp=[words componentsSeparatedByCharactersInSet:sep];
NSArray *lines =
[temp filteredArrayUsingPredicate:noEmptyStrings];
NSSet *rtr=[NSSet setWithArray:lines];
NSLog(@"lines: %lul, word set: %lul",[lines count],[rtr count]);
[words release];
return rtr;
}
int main (int argc, const char * argv[])
{
NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
int count=0;
NSSet *dict =
getDictWords(@"/Users/andrew/Downloads/Moby/mwords/354984si.ngl");
NSLog(@"Start");
for(NSString *element in dict){
NSString *odd_char=sliceString(element, 1,[element length], 2);
NSString *even_char=sliceString(element, 0, [element length], 2);
if([dict member:even_char] && [dict member:odd_char]){
count++;
}
}
NSLog(@"count=%i",count);
[pool drain];
return 0;
}
The Objective-C version produces the same result, (13,341 words), but takes almost 3 seconds to do it. I must be doing something atrociously wrong for a compiled language to be more than 3X slower than a scripted language, but I'll be darned if I can see why.
The basic algorithm is the same: read the lines, strip them, and put them in a set.
My guess of what is slow is the processing of the NSString elements, but I do not know an alternative.
Edit
I edited the Python to be this:
#!/usr/bin/env python
import codecs
count=0
words = set(line.strip() for line in
codecs.open("/Users/andrew/Downloads/Moby/mwords/354984si.ngl",
encoding='utf-8'))
for w in words:
if w[::2] in words and w[1::2] in words:
count+=1
print count
For the utf-8 to be on the same plane as the utf-8 NSString. This slowed the Python down to 1.9 secs.
I also switch the slice test to short-circuit type as suggested for both the Python and obj-c version. Now they are close to the same speed. I also tried using C arrays rather than NSStrings, and this was much faster, but not as easy. You also loose utf-8 support doing that.
Python is really cool...
Edit 2
I found a bottleneck that sped things up considerably. Instead of using the [rtr appendFormat:@"%c",[inString characterAtIndex:i]];
method to append a character to the return string, I used this:
for(NSUInteger i=start; i < stop; i+=step){
buf[0]=[inString characterAtIndex:i];
[rtr appendString:[NSString stringWithCharacters:buf length:1]];
}
Now I can finally claim that the Objective-C version is faster than the Python version -- but not by much.
C/C++ is relatively fast as compared to Python because when you run the Python script, its interpreter will interpret the script line by line and generate output but in C, the compiler will first compile it and generate an output which is optimized with respect to the hardware.
Python is an excellent tool enabling just that. It allows for focusing on the idea itself and not be bothered with boilerplate code and other tedious things. However, Python comes with a major drawback: It is much slower than compiled languages like C or C++.
Internally Python code is interpreted during run time rather than being compiled to native code hence it is a bit slower. Running of Python script v/s running of C/C++ code: Python: First it is compiled into Byte Code. This Byte Code is then interpreted and executed by the PVM (Python Virtual Machine).
It is 450 million loops in a second, which is 45 times faster than Python. Furthermore, C can be compiled in optimized mode for a better performance.
Keep in mind that the Python version has been written to move a lot of the heavy lifting down into highly optimised C code when executed on CPython (especially the file input buffering, string slicing and the hash table lookups to check whether even
and odd
are in words
).
That said, you seem to be decoding the file as UTF-8 in your Objective-C code, but leaving the file in binary in your Python code. Using Unicode NSString in the Objective-C version, but 8-bit byte strings in the Python version isn't really a fair comparison - I would expect the performance of the Python version to drop noticeably if you used codecs.open()
to open the file with a declared encoding of "utf-8"
.
You're also making a full second pass to strip the empty lines in your Objective-C, while no such step is present in the Python code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With