Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to calculate the minimum unfairness sum of a list

I have tried to summarize the problem statement something like this::

Given n, k and an array(a list) arr where n = len(arr) and k is an integer in set (1, n) inclusive.

For an array (or list) myList, The Unfairness Sum is defined as the sum of the absolute differences between all possible pairs (combinations with 2 elements each) in myList.

To explain: if mylist = [1, 2, 5, 5, 6] then Minimum unfairness sum or MUS. Please note that elements are considered unique by their index in list not their values

MUS = |1-2| + |1-5| + |1-5| + |1-6| + |2-5| + |2-5| + |2-6| + |5-5| + |5-6| + |5-6|

If you actually need to look at the problem statement, It's HERE

My Objective

given n, k, arr(as described above), find the Minimum Unfairness Sum out of all of the unfairness sums of sub arrays possible with a constraint that each len(sub array) = k [which is a good thing to make our lives easy, I believe :) ]

what I have tried

well, there is a lot to be added in here, so I'll try to be as short as I can.

My First approach was this where i used itertools.combinations to get all the possible combinations and statistics.variance to check its spread of data (yeah, I know I'm a mess).
Before you see the code below, Do you think these variance and unfairness sum are perfectly related (i know they are strongly related) i.e. the sub array with minimum variance has to be the sub array with MUS??

You only have to check the LetMeDoIt(n, k, arr) function. If you need MCVE, check the second code snippet below.

from itertools import combinations as cmb
from statistics import variance as varn

def LetMeDoIt(n, k, arr):
    v = []
    s = []
    subs = [list(x) for x in list(cmb(arr, k))]  # getting all sub arrays from arr in a list

    i = 0
    for sub in subs:
        if i != 0:
            var = varn(sub)  # the variance thingy
            if float(var) < float(min(v)):
                v.remove(v[0])
                v.append(var)
                s.remove(s[0])
                s.append(sub)
            else:
                pass

        elif i == 0:
            var = varn(sub)
            v.append(var)
            s.append(sub)
            i = 1

    final = []
    f = list(cmb(s[0], 2))  # getting list of all pairs (after determining sub array with least MUS)
    
    for r in f:
        final.append(abs(r[0]-r[1]))  # calculating the MUS in my messy way

    return sum(final)

The above code works fine for n<30 but raised a MemoryError beyond that. In Python chat, Kevin suggested me to try generator which is memory efficient (it really is), but as generator also generates those combination on the fly as we iterate over them, it was supposed to take over 140 hours (:/) for n=50, k=8 as estimated.

I posted the same as a question on SO HERE (you might wanna have a look to understand me properly - it has discussions and an answer by fusion which takes me to my second approach - a better one(i should say fusion's approach xD)).

Second Approach

from itertools import combinations as cmb

def myvar(arr):   # a function to calculate variance
    l = len(arr)
    m = sum(arr)/l
    return sum((i-m)**2 for i in arr)/l

def LetMeDoIt(n, k, arr):
    sorted_list = sorted(arr)  # i think sorting the array makes it easy to get the sub array with MUS quickly
    variance = None
    min_variance_sub = None
    
    for i in range(n - k + 1):
        sub = sorted_list[i:i+k]
        var = myvar(sub)
        if variance is None or var<variance:
            variance = var
            min_variance_sub=sub
            
    final = []
    f = list(cmb(min_variance_sub, 2))  # again getting all possible pairs in my messy way

    for r in f:
        final.append(abs(r[0] - r[1]))

    return sum(final)

def MainApp():
    n = int(input())
    k = int(input())

    arr = list(int(input()) for _ in range(n))

    result = LetMeDoIt(n, k, arr)

    print(result)    

if __name__ == '__main__':
    MainApp()

This code works perfect for n up to 1000 (maybe more), but terminates due to time out (5 seconds is the limit on online judge :/ ) for n beyond 10000 (the biggest test case has n=100000).

=====

How would you approach this problem to take care of all the test cases in given time limits (5 sec) ? (problem was listed under algorithm & dynamic programming)

(for your references you can have a look on

  1. successful submissions(py3, py2, C++, java) on this problem by other candidates - so that you can explain that approach for me and future visitors)
  2. an editorial by the problem setter explaining how to approach the question
  3. a solution code by problem setter himself (py2, C++).
  4. Input data (test cases) and expected output

Edit1 ::

For future visitors of this question, the conclusions I have till now are,
that variance and unfairness sum are not perfectly related (they are strongly related) which implies that among a lots of lists of integers, a list with minimum variance doesn't always have to be the list with minimum unfairness sum. If you want to know why, I actually asked that as a separate question on math stack exchange HERE where one of the mathematicians proved it for me xD (and it's worth taking a look, 'cause it was unexpected)

As far as the question is concerned overall, you can read answers by archer & Attersson below (still trying to figure out a naive approach to carry this out - it shouldn't be far by now though)


Thank you for any help or suggestions :)

like image 567
P S Solanki Avatar asked Sep 07 '20 08:09

P S Solanki


2 Answers

You must work on your list SORTED and check only sublists with consecutive elements. This is because BY DEFAULT, any sublist that includes at least one element that is not consecutive, will have higher unfairness sum.

For example if the list is

[1,3,7,10,20,35,100,250,2000,5000] and you want to check for sublists with length 3, then solution must be one of [1,3,7] [3,7,10] [7,10,20] etc Any other sublist eg [1,3,10] will have higher unfairness sum because 10>7 therefore all its differences with rest of elements will be larger than 7 The same for [1,7,10] (non consecutive on the left side) as 1<3

Given that, you only have to check for consecutive sublists of length k which reduces the execution time significantly

Regarding coding, something like this should work:

def myvar(array):
    return sum([abs(i[0]-i[1]) for i in itertools.combinations(array,2)])  
  
def minsum(n, k, arr):
        res=1000000000000000000000 #alternatively make it equal with first subarray
        for i in range(n-k):
            res=min(res, myvar(l[i:i+k]))
        return res
    
like image 199
IoaTzimas Avatar answered Oct 17 '22 12:10

IoaTzimas


I see this question still has no complete answer. I will write a track of a correct algorithm which will pass the judge. I will not write the code in order to respect the purpose of the Hackerrank challenge. Since we have working solutions.

  1. The original array must be sorted. This has a complexity of O(NlogN)

  2. At this point you can check consecutive sub arrays as non-consecutive ones will result in a worse (or equal, but not better) "unfairness sum". This is also explained in archer's answer

  3. The last check passage, to find the minimum "unfairness sum" can be done in O(N). You need to calculate the US for every consecutive k-long subarray. The mistake is recalculating this for every step, done in O(k), which brings the complexity of this passage to O(k*N). It can be done in O(1) as the editorial you posted shows, including mathematic formulae. It requires a previous initialization of a cumulative array after step 1 (done in O(N) with space complexity O(N) too).

It works but terminates due to time out for n<=10000.

(from comments on archer's question)

To explain step 3, think about k = 100. You are scrolling the N-long array and the first iteration, you must calculate the US for the sub array from element 0 to 99 as usual, requiring 100 passages. The next step needs you to calculate the same for a sub array that only differs from the previous by 1 element 1 to 100. Then 2 to 101, etc. If it helps, think of it like a snake. One block is removed and one is added. There is no need to perform the whole O(k) scrolling. Just figure the maths as explained in the editorial and you will do it in O(1).

So the final complexity will asymptotically be O(NlogN) due to the first sort.

like image 20
Attersson Avatar answered Oct 17 '22 13:10

Attersson