How do you evaluate the efficiency of an algorithm, if the problem space is underspecified?

Question

There was a post on here recently which posed the following question:

You have a two-dimensional plane of (X, Y) coordinates. A bunch of random points are chosen. You need to select the largest possible set of chosen points, such that no two points share an X coordinate and no two points share a Y coordinate.

This is all the information that was provided.

There were two possible solutions presented.

One suggested using a maximum flow algorithm, such that each selected point maps to a path linking (source → X → Y → sink). This runs in O(V³) time, where V is the number of vertices selected.

Another (mine) suggested using the Hungarian algorithm. Create an n×n matrix of 1s, then set every chosen (x, y) coordinate to 0. The Hungarian algorithm will give you the lowest cost for this matrix, and the answer is the number of coordinates selected which equal 0. This runs in O(n³) time, where n is the greater of the number of rows or the number of columns.

My reasoning is that, for the vast majority of cases, the Hungarian algorithm is going to be faster; V is equal to n in the case where there's one chosen point for each row or column, and substantially greater for any case where there's more than that: given a 50×50 matrix with half the coordinates chosen, V is 1,250 and n is 50.

The counterargument is that there are some cases, like a 10⁹×10⁹ matrix with only two points selected, where V is 2 and n is 1,000,000,000. For this case, it takes the Hungarian algorithm a ridiculously long time to run, while the maximum flow algorithm is blinding fast.

Here is the question: Given that the problem doesn't provide any information regarding the size of the matrix or the probability that a given point is chosen (so you can't know for sure) how do you decide which algorithm, in general, is a better choice for the problem?

Steve Jessop · Accepted Answer

You can't, it's an imponderable.

You can only define which is better "in general" by defining what inputs you will see "in general". So for example you could whip up a probability model of the inputs, so that the expected value of V is a function of n, and choose the one with the best expected runtime under that model. But there may be arbitrary choices made in the construction of your model, so that different models give different answers. One model might choose co-ordinates at random, another model might look at the actual use-case for some program you're thinking of writing, and look at the distribution of inputs it will encounter.

You can alternatively talk about which has the best worst case (across all possible inputs with given constraints), which has the virtue of being easy to define, and the flaw that it's not guaranteed to tell you anything about the performance of your actual program. So for instance HeapSort is faster than QuickSort in the worst case, but slower in the average case. Which is faster? Depends whether you care about average case or worst case. If you don't care which case, you're not allowed to care which "is faster".

This is analogous to trying to answer the question "what is the probability that the next person you see will have an above (mean) average number of legs?".

We might implicitly assume that the next person you meet will be selected at random with uniform distribution from the human population (and hence the answer is "slightly less than one", since the mean is less than the mode average, and the vast majority of people are at the mode).

Or we might assume that your next meeting with another person is randomly selected with uniform distribution from the set of all meetings between two people, in which case the answer is still "slightly less than one", but I reckon not the exact same value as the first - one-and-zero-legged people quite possibly congregate with "their own kind" very slightly more than their frequency within the population would suggest. Or possibly they congregate less, I really don't know, I just don't see why it should be exactly the same once you take into account Veterans' Associations and so on.

Or we might use knowledge about you - if you live with a one-legged person then the answer might be "very slightly above 0".

Which of the three answers is "correct" depends precisely on the context which you are forbidding us from talking about. So we can't talk about which is correct.

Thomas · Answer

Given that you don't know what each pill does, do you take the red pill or the blue pill?

If there really is not enough information to decide, there is not enough information to decide. Any guess is as good as any other.

Maybe, in some cases, it is possible to divine extra information to base the decision on. I haven't studied your example in detail, but it seems like the Hungarian algorithm might have higher memory requirements. This might be a reason to go with the maximum flow algorithm.

How do you evaluate the efficiency of an algorithm, if the problem space is underspecified?

Tags:

algorithm

Chris B.

2 Answers

Steve Jessop

Thomas

Recent Activity

Donate For Us

How do you evaluate the efficiency of an algorithm, if the problem space is underspecified?

Tags:

algorithm

Chris B.

2 Answers

Steve Jessop

Thomas

Related questions

Recent Activity

Donate For Us