Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the optimum file size combination

This is a problem I would think there is an algorithm for already - but I do not know the right words to use with google it seems :).

The problem: I would like to make a little program with which I would select a directory containing any files (but for my purpose media files, audio and video). After that I would like to enter in MB the maximum total file size sum that must not be exceeded. At this point you would hit a "Calculate best fit" button.

This button should compare all the files in the directory and provide as a result a list of the files that when put together gets most close to the max total file size without going over the limit.

This way you could find out which files to combine when burning a CD or DVD so that you will be able to use as much as possible of the disc.

I've tried to come up with the algorithm for this myself - but failed :(.

Anyone know of some nice algorithm for doing this?

Thanks in advance :)

like image 272
CKa Avatar asked Sep 02 '10 07:09

CKa


2 Answers

Just for fun I tried out the accurate dynamic programming solution. Written in Python, because of my supreme confidence that you shouldn't optimise until you have to ;-)

This could provide either a start, or else a rough idea of how close you can get before resorting to approximation.

Code based on http://en.wikipedia.org/wiki/Knapsack_problem#0-1_knapsack_problem, hence the less-than-informative variable names m, W, w, v.

#!/usr/bin/python

import sys

solcount = 0

class Solution(object):
    def __init__(self, items):
        object.__init__(self)
        #self.items = items
        self.value = sum(items)
        global solcount
        solcount += 1
    def __str__(self):
        #return str(self.items) + ' = ' + str(self.value)
        return ' = ' + str(self.value)

m = {}

def compute(v, w):
    coord = (len(v),w)
    if coord in m:
        return m[coord]
    if len(v) == 0 or w == 0:
        m[coord] = Solution([])
        return m[coord]
    newvalue = v[0]
    newarray = v[1:]
    notused = compute(newarray, w)
    if newvalue > w:
        m[coord] = notused
        return notused
    # used = Solution(compute(newarray, w - newvalue).items + [newvalue])
    used = Solution([compute(newarray, w - newvalue).value] + [newvalue])
    best = notused if notused.value >= used.value else used
    m[coord] = best
    return best

def main():
    v = [int(l) for l in open('filesizes.txt')]
    W = int(sys.argv[1])
    print len(v), "items, limit is", W
    print compute(v, W)
    print solcount, "solutions computed"

if __name__ == '__main__':
    main()

For simplicity I'm just considering the file sizes: once you have the list of sizes that you want to use, you can find some filenames with those sizes by searching through a list, so there's no point tangling up filenames in the core, slow part of the program. I'm also expressing everything in multiples of the block size.

As you can see, I've commented out the code that gives the actual solution (as opposed to the value of the solution). That was to save memory - the proper way to store the list of files used isn't one list in each Solution, it's to have each solution point back to the Solution it was derived from. You can then calculate the list of filesizes at the end by going back through the chain, outputting the difference between the values at each step.

With a list of 100 randomly-generated file sizes in the range 2000-6000 (I'm assuming 2k blocks, so that's files of size 4-12MB), this solves for W=40K in 100 seconds on my laptop. In doing so it computes 2.6M of a possible 4M solutions.

Complexity is O(W*n), where n is the number of files. This does not contradict the fact that the problem is NP-complete. So I am at least approaching a solution, and this is just in unoptimised Python.

Clearly some optimisation is now required, because actually it needs to be solved for W=4M (8GB DVD) and however many files you have (lets say a few thousand). Presuming that the program is allowed to take 15 minutes (comparable to the time required to write a DVD), that means performance is currently short by a factor of roughly 10^3. So we have a problem that's quite hard to solve quickly and accurately on a PC, but not beyond the bounds of technology.

Memory use is the main concern, since once we start hitting swap we'll slow down, and if we run out of virtual address space we're in real trouble because we have to implement our own storage of Solutions on disk. My test run peaks at 600MB. If you wrote the code in C on a 32-bit machine, each "solution" has a fixed size of 8 bytes. You could therefore generate a massive 2-D array of them without doing any memory allocation in the loop, but in 2GB of RAM you could only handle W=4M and n=67. Oops - DVDs are out. It could very nearly solve for 2-k blocksize CDs, though: W=350k gives n=766.

Edit: MAK's suggestion to compute iteratively bottom-up, rather than recursively top-down, should massively reduce the memory requirement. First calculate m(1,w) for all 0 <= w <= W. From this array, you can calculate m(2,w) for all 0 <= w <= W. Then you can throw away all the m(1,w) values: you won't need them to calculate m(3,w) etc.

By the way, I suspect that actually the problem you want to solve might be the bin packing problem, rather than just the question of how to get the closest possible to filling a DVD. That's if you have a bunch of files, you want to write them all to DVD, using as few DVDs as possible. There are situations where solving the bin packing problem is very easy, but solving this problem is hard. For example, suppose that you have 8GB disks, and 15GB of small files. It's going to take some searching to find the closest possible match to 8GB, but the bin-packing problem would be trivially solved just by putting roughly half the files on each disk - it doesn't matter exactly how you divide them because you're going to waste 1GB of space whatever you do.

All that said, there are extremely fast heuristics that give decent results much of the time. Simplest is to go through the list of files (perhaps in decreasing order of size), and include each file if it fits, exclude it otherwise. You only need to fall back to anything slow if fast approximate solutions aren't "good enough", for your choice of "enough".

like image 186
Steve Jessop Avatar answered Nov 15 '22 19:11

Steve Jessop


This is, as other pointed out, the Knapsack Problem, which is a combinatorial optimization problem. It means that you look for some subset or permutation of a set which minimizes (or maximizes) a certain cost. Another well known such problem is the Traveling Salesman Problem.

Such problems are usually very hard to solve. But if you're interested in almost optimal solutions, you can use non-deterministic algorithms, like simulated annealing. You most likely won't get the optimal solution, but a nearly optimal one.

This link explains how simulated annealing can solve the Knapsack Problem, and therefore should be interesting to you.

like image 36
Alexandre C. Avatar answered Nov 15 '22 18:11

Alexandre C.