Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing items from unevenly distributed set

I have a website where users submit questions (zero, one or multiple per day), vote on them and answer one question per day (more details here). A user can see the question only once either by submitting, voting or answering it.

I have a pool of questions that players have already seen. I need to remove 30 questions from the pool each month. I need to pick questions to remove in such way that I maximize the number of available questions left in the pool for player with least available questions.

Example with pool of 5 questions (and need to remove 3):

  • player A has seen questions 1, 3 and 5
  • player B has seen questions 1 and 4
  • player C has seen questions 2 and 4

I though about removing the questions that top player has seen, but the position would change. Following the above example, player A has only got 2 questions left to play (2 and 4). However, if I remove 1, 3 and 5, the situation would be:

  • player A can play questions 2 and 4
  • player B can play question 2
  • player C cannot play anything because 1,3,5 are removed and he has already seen 2 and 4.

The score for this solution is zero, i.e. the player with least amount of available questions has zero available questions to play.

In this case it would be better to remove 1, 3 and 4, giving:

  • player A can play question 2
  • player B can play questions 2 and 5
  • player C can play question 5

The score for this solution is one, because the two players with least amount of available questions to play have one available question.

If the data size was small, I would be able to brute-force the solution. However, I have hundreds of players and questions, so I'm looking for some algorithm to solve this.

like image 265
Milan Babuškov Avatar asked Jan 05 '12 17:01

Milan Babuškov


5 Answers

Let's suppose that you have a general efficient algorithm for this. Concentrate on the questions left, rather than the questions removed.

You could use such an algorithm to solve the problem - can you choose at most T questions such that every user has at least one question to answer? I think that this is http://en.wikipedia.org/wiki/Set_cover, and I think solving your problem in general allows you to solve set cover, so I think it is NP-complete.

There is at least a linear programming relaxation. Associate each question with a variable Qi in the range 0<= Qi <= 1. Choosing questions Qi such that each user has at least X questions available amounts to the constraint SUM Uij Qj >= X, which is linear in Qj and X, so you can maximise for the objective function X with the linear variables X and Qj. Unfortunately, the result need not give you integer Qj - consider for example the case when all possible pairs of questions are associated with some user and you want each user to be able to answer at least 1 question, using at most half of the questions. The optimum solution is Qi = 1/2 for all i.

(But given a linear programming relaxation you could use it as the bound in http://en.wikipedia.org/wiki/Branch_and_bound).

Alternatively you could just write down the problem and throw it at an integer linear programming package, if you have one handy.

like image 171
mcdowella Avatar answered Oct 22 '22 12:10

mcdowella


For completeness of the thread, here is a simple greedy, aproximating approach.

Place the solved questions in the previously discussed matrix form:

Q0    X
Q1  XX
Q2    X
Q3  X  
Q4   XX
    223

Sort by the number of questions solved:

Q0  X  
Q1   XX
Q2  X  
Q3    X
Q4  XX 
    322

Strike out a question with the most Xs among the players with most problems solved. (This is guaranteed to decrease our measure if anything is):

=======
Q1   XX
Q2  X  
Q3    X
Q4  XX 
    222

Sort again:

=======
Q1   XX
Q2  X  
Q3    X
Q4  XX 
    222

Strike again:

=======
=======
Q2  X  
Q3    X
Q4  XX 
    211

Sort again:

=======
=======
Q2  X  
Q3    X
Q4  XX 
    211

Strike again:

=======
=======
Q2  X  
Q3    X
=======
    101

It's O(n^2logn) without optimizations, so it is plenty fast for some hundreds of questions. It's also easy to implement.

It's not optimal as can be seen from this counter example with 2 strikes:

Q0 X     
Q1      X
Q2 XXX
Q3    XXX
Q4  XXXX
Q5 222222

Here the greedy approach is going to remove Q5 and Q2 (or Q3) instead of Q2 and Q3 which would be optimal for our measure.

like image 32
Thomas Ahle Avatar answered Oct 22 '22 10:10

Thomas Ahle


I propose a bunch of optimizations based on the idea that you really want to maximize the number of unseen questions for the player with the minimum number of questions, and do not care if there is 1 player with the minimum number of questions or 10000 players with that same number of questions.

Step 1: Find the player with the minimum number of questions unseen (In your example, that would be player A) Call this player p.

Step 2: Find all players with within 30 of the number of questions unseen by player p. Call this set P. P are the only players who need to be considered, as removing 30 unseen questions from any other player would still leave them with more unseen questions than player p, and thus player p would still be worse off.

Step 3: Find the intersection of all sets of problems seen by players in P, you may remove all problems within this set, hopefully dropping you down from 30 to some smaller number of problems to remove, that we will call r. r <= 30

Step 4: Find the union of all sets of problems seen by players in P, Call this set U. If the size of U is <= r, you are done, remove all problems in U, and then remove remaining problems arbitrarily from your set of all problems, player p will lose r - size of U and remain with the fewest unseen problems, but this is the best you can do.

You are now left with your original problem, but likely with vastly smaller sets.
Your problem set is U, your player set is P, and you must remove r problems.

The brute force approach takes time (size(U) choose r) * size (P). If those numbers are reasonable, you can just brute force it. This approach is to choose each set of r problems from U and evaluate it against all players in P.

Since your problem does appear to be NP-Complete, the best you can probably hope for is an approximation. The easiest way to do this is to set some max number of tries, then randomly choose and evaluate sets of problems to remove. As such, a function to perform U choose r randomly becomes necessary. This can be done in time O(r), (In fact, I answered how to do this earlier today!)

Select N random elements from a List<T> in C#

You can also put any of the heuristics suggested by other users into your choices by weighting each problem's chance to be selected, I believe the link above shows how to do that in the selected answer.

like image 45
dhakim Avatar answered Oct 22 '22 12:10

dhakim


Linear programming models.

Variant 1.

Sum(Uij * Qj) - Sum(Dij * Xj) + 0     =  0 (for each i)
0             + Sum(Dij * Xj) - Score >= 0 (for each i)
Sum(Qj) = (Number of questions - 30)
Maximize(Score)

Uij is 1 if user i has not seen question j, otherwise it is 0

Dij is element of identity matrix (Dij=1 if i=j, otherwise it is 0)

Xj is auxiliary variable (one for each user)

Variant 2.

Sum(Uij * Qj) >= Score (for each i)
Sum(Qj) = (Number of questions - 30)
No objective function, just check feasibility

In this case, LP problem is simpler, but Score should be determined by binary and linear search. Set current range to [0 .. the least number of unseen questions for a user], set Score to the middle of the range, apply integer LP algorithm (with small time limit). If no solution found, set range to [begin .. Score], otherwise set it to [Score .. end] and continue binary search.

(Optionally) use binary search to determine upper bound for exact solution's Score.

Starting from the best Score, found by binary search, apply integer LP algorithm with Score, increased by 1, 2, ... (and limiting computation time as necessary). At the end, you get either exact solution, or some good approximation.

Here is sample code in C for GNU GLPK (for variant 1):

#include <stdio.h>
#include <stdlib.h>
#include <glpk.h>

int main(void)
{
  int ind[3000];
  double val[3000];
  int row;
  int col;
  glp_prob *lp;

  // Parameters
  int users = 120;
  int questions = 10000;
  int questions2 = questions - 30;
  int time = 30; // sec.

  // Create GLPK problem
  lp = glp_create_prob();
  glp_set_prob_name(lp, "questions");
  glp_set_obj_dir(lp, GLP_MAX);

  // Configure rows
  glp_add_rows(lp, users*2 + 1);
  for (row = 1; row <= users; ++row)
  {
    glp_set_row_bnds(lp, row, GLP_FX, 0.0, 0.0);
    glp_set_row_bnds(lp, row + users, GLP_LO, 0.0, 0.0);
  }
  glp_set_row_bnds(lp, users*2 + 1, GLP_FX, questions2, questions2);

  // Configure columns
  glp_add_cols(lp, questions + users + 1);
  for (col = 1; col <= questions; ++col)
  {
    glp_set_obj_coef(lp, col, 0.0);
    glp_set_col_kind(lp, col, GLP_BV);
  }
  for (col = 1; col <= users; ++col)
  {
    glp_set_obj_coef(lp, questions + col, 0.0);
    glp_set_col_kind(lp, questions + col, GLP_IV);
    glp_set_col_bnds(lp, questions + col, GLP_FR, 0.0, 0.0);
  }
  glp_set_obj_coef(lp, questions+users+1, 1.0);
  glp_set_col_kind(lp, questions+users+1, GLP_IV);
  glp_set_col_bnds(lp, questions+users+1, GLP_FR, 0.0, 0.0);

  // Configure matrix (question columns)
  for(col = 1; col <= questions; ++col)
  {
    for (row = 1; row <= users*2; ++row)
    {
      ind[row] = row;
      val[row] = ((row <= users) && (rand() % 2))? 1.0: 0.0;
    }
    ind[users*2 + 1] = users*2 + 1;
    val[users*2 + 1] = 1.0;
    glp_set_mat_col(lp, col, users*2 + 1, ind, val);
  }

  // Configure matrix (user columns)
  for(col = 1; col <= users; ++col)
  {
    for (row = 1; row <= users*2; ++row)
    {
      ind[row] = row;
      val[row] = (row == col)? -1.0: ((row == col + users)? 1.0: 0.0);
    }
    ind[users*2 + 1] = users*2 + 1;
    val[users*2 + 1] = 0.0;
    glp_set_mat_col(lp, questions + col, users*2 + 1, ind, val);
  }

  // Configure matrix (score column)
  for (row = 1; row <= users*2; ++row)
  {
    ind[row] = row;
    val[row] = (row > users)? -1.0: 0.0;
  }
  ind[users*2 + 1] = users*2 + 1;
  val[users*2 + 1] = 0.0;
  glp_set_mat_col(lp, questions + users + 1, users*2 + 1, ind, val);

  // Solve integer GLPK problem
  glp_iocp param;
  glp_init_iocp(&param);
  param.presolve = GLP_ON;
  param.tm_lim = time * 1000;
  glp_intopt(lp, &param);
  printf("Score = %g\n", glp_mip_obj_val(lp));

  glp_delete_prob(lp);
  return 0;
}

Time limit is not working reliably in my tests. Looks like some bug in GLPK...

Sample code for variant 2 (only LP algorithm, no automatic search for Score):

#include <stdio.h>
#include <stdlib.h>
#include <glpk.h>

int main(void)
{
  int ind[3000];
  double val[3000];
  int row;
  int col;
  glp_prob *lp;

  // Parameters
  int users = 120;
  int questions = 10000;
  int questions2 = questions - 30;
  double score = 4869.0 + 7;

  // Create GLPK problem
  lp = glp_create_prob();
  glp_set_prob_name(lp, "questions");
  glp_set_obj_dir(lp, GLP_MAX);

  // Configure rows
  glp_add_rows(lp, users + 1);
  for (row = 1; row <= users; ++row)
  {
    glp_set_row_bnds(lp, row, GLP_LO, score, score);
  }
  glp_set_row_bnds(lp, users + 1, GLP_FX, questions2, questions2);

  // Configure columns
  glp_add_cols(lp, questions);
  for (col = 1; col <= questions; ++col)
  {
    glp_set_obj_coef(lp, col, 0.0);
    glp_set_col_kind(lp, col, GLP_BV);
  }

  // Configure matrix (question columns)
  for(col = 1; col <= questions; ++col)
  {
    for (row = 1; row <= users; ++row)
    {
      ind[row] = row;
      val[row] = (rand() % 2)? 1.0: 0.0;
    }
    ind[users + 1] = users + 1;
    val[users + 1] = 1.0;
    glp_set_mat_col(lp, col, users + 1, ind, val);
  }

  // Solve integer GLPK problem
  glp_iocp param;
  glp_init_iocp(&param);
  param.presolve = GLP_ON;
  glp_intopt(lp, &param);

  glp_delete_prob(lp);
  return 0;
}

It appears that variant 2 allows to find pretty good approximation quite fast. And approximation is better than for variant 1.

like image 1
Evgeny Kluev Avatar answered Oct 22 '22 12:10

Evgeny Kluev


Let's say you want to delete Y questions from the pool. The simple algorithm would be to sort questions by the amount of views they had. Then you remove Y of the top viewed questions. For your example: 1: 2, 2: 1, 3: 1, 4: 2, 5: 1. Clearly, you better off removing questions 1 and 4. But this algorithm doesn't achieve the goal. However, it is a good starting point. To improve it, you need to make sure that every user will end up with at least X questions after the "cleaning".

In addition to the above array (which we can call "score"), you need a second one with questions and users, where crossing will have 1 if user have seen the question, and 0 if he didn't. Then, for every user you need to find X questions with lowest score edit: that he hasn't seen yet (the less their score the better, since the less people saw the question, the more "valuable" it is for the system overall). You combine all the found X questions from every user into third array, let's call it "safe", since we won't delete any from it.

As the last step you just delete Y top viewed questions (the ones with the highest score), which aren't in the "safe" array.

What that algorithm achieves also is that if deleting say 30 questions will make some users have less than X questions to view, it won't remove all 30. Which is, I guess, good for the system.

Edit: Good optimization for this would be to track not every user, but have some activity benchmark to filter people that saw only a few questions. Because if there are too many people that saw only say 1 rare different question, then nothing can be deleted. Filtering theese kind of users or improving the safe array functionality can solve it.

Feel free to ask questions if I didn't describe the idea deep enough.

like image 1
Ranty Avatar answered Oct 22 '22 10:10

Ranty