I am trying to reproduce in Python two examples (originally written in Java) that I found in a book.
The two functions check if a string contains repeated characters. The first function uses an integer (checker
) as a bit vector, while the second function simply uses a list of booleans.
I was expecting to have a better performance using the function with bits, but actually it performs worse.
Why is that? Did I write something wrong while "translating" from Java to Python?
Note: for simplicity we only use lowercase letters (a to z), especially for the bit vector function.
import sys
import timeit
def is_unique_chars_bit(my_str):
checker = 0
for char in my_str:
val = ord(char) - ord('a')
if ((checker & (1 << val)) > 0):
return False
checker |= (1 << val)
return True
def is_unique_chars_list(my_str):
if len(my_str) > 128:
# Supposing we use ASCII, which only has 128 chars
return False
char_list = [False] * 128
for char in my_str:
val = ord(char)
if char_list[val]:
return False
char_list[val] = True
return True
if __name__ == '__main__':
alphabet = "abcdefghijklmnopqrstuvwxyz"
t_bit = timeit.Timer("is_unique_chars_bit('"+ alphabet +"')", "from __main__ import is_unique_chars_bit")
t_list = timeit.Timer("is_unique_chars_list('"+ alphabet +"')", "from __main__ import is_unique_chars_list")
print(t_bit.repeat(3, 200000))
print(t_list.repeat(3, 200000))
Results:
[1.732477278999795, 1.7263494359995093, 1.7404333820004467]
[0.6785205180003686, 0.6759967380003218, 0.675434408000001]
The original Java functions are the following:
boolean isUniqueCharsBoolArray(String str) {
if (str.length() > 128) return false;
boolean[] char_set = new boolean[128];
for (int i = 0; i < str.length(); i++) {
int val = str.charAt(i);
if (char_set[val]) {
return false;
}
char_set[val] = true;
}
return true;
}
boolean isUniqueCharsBits(String str) {
for (int i = 0; i < str.length(); i++) {
int val = str.charAt(i) -'a';
if ((checker & (1 << val)) > 0) {
return false;
}
checker |= (1 << val);
}
return true;
}
That is because integers are immutable reference classes in python. That means integer operations are slow in general. (This is true even for python2 ints) Look at the following line:
checker |= (1 << val)
If we expand the assignment we get:
checker = checker | (1 << val)
This single line allocates two new integers in memory. One for 1 << val
and one for the bitwise or.
On the other hand, assigning an array element does not need to allocate objects, and that's the reason it's faster.
If you are searching for the fastest way to determine if a string has duplicate chars, this function is shorter and faster than the previous two (taken from "check duplicates in list"):
def is_unique_chars_set(my_str):
return len(my_str) != len(set(my_str))
Timeit shows a 3x speedup (the last line is the new one):
>python3 test.py
[2.398782472571393, 2.3595238689519626, 2.37358552995787]
[1.0055455609592512, 1.007462347465074, 1.012826469701654]
[0.32564058749026437, 0.3303359144351621, 0.31772896318809885]
Note: The results may vary heavily if you use another python runtime, such as IronPython
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With