Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bit vector vs list of boolean values performance

I am trying to reproduce in Python two examples (originally written in Java) that I found in a book.

The two functions check if a string contains repeated characters. The first function uses an integer (checker) as a bit vector, while the second function simply uses a list of booleans. I was expecting to have a better performance using the function with bits, but actually it performs worse.

Why is that? Did I write something wrong while "translating" from Java to Python?

Note: for simplicity we only use lowercase letters (a to z), especially for the bit vector function.

import sys
import timeit

def is_unique_chars_bit(my_str):
    checker = 0
    for char in my_str:
        val = ord(char) - ord('a')
        if ((checker & (1 << val)) > 0):
            return False
        checker |= (1 << val)
    return True

def is_unique_chars_list(my_str):
    if len(my_str) > 128:
        # Supposing we use ASCII, which only has 128 chars
        return False
    char_list = [False] * 128
    for char in my_str:
        val = ord(char)
        if char_list[val]:
            return False
        char_list[val] = True
    return True

if __name__ == '__main__':
    alphabet = "abcdefghijklmnopqrstuvwxyz"
    t_bit = timeit.Timer("is_unique_chars_bit('"+ alphabet +"')", "from __main__ import is_unique_chars_bit")
    t_list = timeit.Timer("is_unique_chars_list('"+ alphabet +"')", "from __main__ import is_unique_chars_list")
    print(t_bit.repeat(3, 200000))
    print(t_list.repeat(3, 200000))

Results:

[1.732477278999795, 1.7263494359995093, 1.7404333820004467]
[0.6785205180003686, 0.6759967380003218, 0.675434408000001]

The original Java functions are the following:

boolean isUniqueCharsBoolArray(String str) {
    if (str.length() > 128) return false;

    boolean[] char_set = new boolean[128];
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i);
        if (char_set[val]) {
            return false;
        }
        char_set[val] = true;
    }
    return true;
}

boolean isUniqueCharsBits(String str) {
    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i) -'a';
        if ((checker & (1 << val)) > 0) {
            return false;
        }
        checker |= (1 << val);
    }
    return true;
}
like image 827
Kurt Bourbaki Avatar asked Jan 07 '16 10:01

Kurt Bourbaki


1 Answers

That is because integers are immutable reference classes in python. That means integer operations are slow in general. (This is true even for python2 ints) Look at the following line:

checker |= (1 << val)

If we expand the assignment we get:

checker = checker | (1 << val)

This single line allocates two new integers in memory. One for 1 << val and one for the bitwise or.

On the other hand, assigning an array element does not need to allocate objects, and that's the reason it's faster.

If you are searching for the fastest way to determine if a string has duplicate chars, this function is shorter and faster than the previous two (taken from "check duplicates in list"):

def is_unique_chars_set(my_str):
    return len(my_str) != len(set(my_str))

Timeit shows a 3x speedup (the last line is the new one):

>python3 test.py
[2.398782472571393, 2.3595238689519626, 2.37358552995787]
[1.0055455609592512, 1.007462347465074, 1.012826469701654]
[0.32564058749026437, 0.3303359144351621, 0.31772896318809885]

Note: The results may vary heavily if you use another python runtime, such as IronPython

like image 80
Tamas Hegedus Avatar answered Nov 16 '22 11:11

Tamas Hegedus