Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python implementation of "median of medians" algorithm

I've written this implementation of the median of medians algorithm in python, but it doesn't seem to output the right result, and it also does not seem of linear complexity to me, any idea where I went off track ?

def select(L):
    if len(L) < 10:
        L.sort()
        return L[int(len(L)/2)]
    S = []
    lIndex = 0
    while lIndex+5 < len(L)-1:
        S.append(L[lIndex:lIndex+5])
        lIndex += 5
    S.append(L[lIndex:])
    Meds = []
    for subList in S:
        print(subList)
    Meds.append(select(subList))
    L2 = select(Meds)
    L1 = L3 = []
    for i in L:
        if i < L2:
            L1.append(i)
        if i > L2:
            L3.append(i)
    if len(L) < len(L1):
        return select(L1)
    elif len(L) > len(L1) + 1:
        return select(L3)
    else:
        return L2

The function is called like so:

L = list(range(100))
shuffle(L)
print(select(L))

LE: Sorry. GetMed was a function that simply sorted the list and returned the element at len(list), it should've been select there, I fixed it now, but still I get the wrong outputs. As for the indentation, the code works without error, and I see nothing wrong with it :-??

LE2: I'm expecting 50 (for the current L), it gives me outputs from 30 to 70, no more no less (yet)

LE3: Thank you very much, that did the trick it works now. I'm confuse though, I'm trying to make a comparison between this method, and the naive one, where I simply sort the array and output the results. Now, from what I read so far, the time complexity of the select method should be O(n) Deterministic Selection. Although I probably can't compete with the optimisation python developers did, I did expect closer results than I got, for example, if I change the range of the list to 10000000, select outputs the result in 84.10837116255952 seconds while the sort and return method does it in 18.92556029528825. What are some good ways to make this algorithm faster?

like image 361
cpp_ninja Avatar asked May 29 '12 20:05

cpp_ninja


2 Answers

1) Your code indentation is wrong, try this:

def select(L):
    if len(L) < 10:
        L.sort()
        return L[int(len(L)/2)]
    S = []
    lIndex = 0
    while lIndex+5 < len(L)-1:
        S.append(L[lIndex:lIndex+5])
        lIndex += 5
    S.append(L[lIndex:])
    Meds = []
    for subList in S:
        print(subList)
        Meds.append(select(subList))
    L2 = select(Meds)
    L1 = L3 = []
    for i in L:
        if i < L2:
            L1.append(i)
        if i > L2:
            L3.append(i)
    if len(L) < len(L1):
        return select(L1)
    elif len(L) > len(L1) + 1:
        return select(L3)
    else:
        return L2

2) The method you use does not return the median, it just return a number which is not so far from the median. To get the median, you need to count how many number are greater than your pseudo-median, if a majority is greater, repeat the algorithm with the numbers greater than the pseudo-median, else repeat with the other numbers.

def select(L, j):
    if len(L) < 10:
        L.sort()
        return L[j]
    S = []
    lIndex = 0
    while lIndex+5 < len(L)-1:
        S.append(L[lIndex:lIndex+5])
        lIndex += 5
    S.append(L[lIndex:])
    Meds = []
    for subList in S:
        Meds.append(select(subList, int((len(subList)-1)/2)))
    med = select(Meds, int((len(Meds)-1)/2))
    L1 = []
    L2 = []
    L3 = []
    for i in L:
        if i < med:
            L1.append(i)
        elif i > med:
            L3.append(i)
        else:
            L2.append(i)
    if j < len(L1):
        return select(L1, j)
    elif j < len(L2) + len(L1):
        return L2[0]
    else:
        return select(L3, j-len(L1)-len(L2))

Warning: L = M = [] is not L = [] and M = []

like image 75
Thomash Avatar answered Nov 14 '22 23:11

Thomash


Below is my PYTHON implementation. For more speed, you might want to use PYPY instead.

For your question about SPEED: The theoretical speed for 5 numbers per column is ~10N, so I use 15 numbers per column, for a 2X speed at ~5N, while the optimal speed is ~4N. But, I could be wrong about the optimal speed of the most state-of-art solution. In my own test, my program runs slightly faster than the one using sort(). Certainly, your mileage may vary.

Assuming the python program is "median.py", an example to run it is "python ./median.py 100". For speed benchmark, you might want to comment out the validation code, and use PYPY.

#!/bin/python
#
# TH @stackoverflow, 2016-01-20, linear time "median of medians" algorithm
#
import sys, random


items_per_column = 15


def find_i_th_smallest( A, i ):
    t = len(A)
    if(t <= items_per_column):
        # if A is a small list with less than items_per_column items, then:
        #     1. do sort on A
        #     2. return the i-th smallest item of A
        #
        return sorted(A)[i]
    else:
        # 1. partition A into columns of items_per_column items each. items_per_column is odd, say 15.
        # 2. find the median of every column
        # 3. put all medians in a new list, say, B
        #
        B = [ find_i_th_smallest(k, (len(k) - 1)/2) for k in [A[j:(j + items_per_column)] for j in range(0,len(A),items_per_column)]]

        # 4. find M, the median of B
        #
        M = find_i_th_smallest(B, (len(B) - 1)/2)

        # 5. split A into 3 parts by M, { < M }, { == M }, and { > M }
        # 6. find which above set has A's i-th smallest, recursively.
        #
        P1 = [ j for j in A if j < M ]
        if(i < len(P1)):
            return find_i_th_smallest( P1, i)
        P3 = [ j for j in A if j > M ]
        L3 = len(P3)
        if(i < (t - L3)):
            return M
        return find_i_th_smallest( P3, i - (t - L3))


# How many numbers should be randomly generated for testing?
#
number_of_numbers = int(sys.argv[1])


# create a list of random positive integers
#
L = [ random.randint(0, number_of_numbers) for i in range(0, number_of_numbers) ]


# Show the original list
#
print L


# This is for validation
#
print sorted(L)[int((len(L) - 1)/2)]


# This is the result of the "median of medians" function.
# Its result should be the same as the validation.
#
print find_i_th_smallest( L, (len(L) - 1) / 2)
like image 42
user5818263 Avatar answered Nov 14 '22 23:11

user5818263