Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for learning ordering of elements (ideally in Java)

I have a number of ordered lists, most containing the same elements. I want to find the most probable order of the elements from the lists (samples).

Example:

l1={ a, b, f, h, z }
l2={ c, e, h, x, z }
l3={ a, e, y, z }
l4={ b, e, f, z }

The result should be:

R={a, b, c, e, f, h, x, y, z}; or 
R={ a,b,c,e,f,h,y,x,z }

The elements have no information regarding their natural order. The order should be learned from the lists, and in some cases the order in a list may contradict other lists, so I need the most probable order. I have about 175,000 lists, for about 1.8 million elements (total, 260k unique), the number of elements per list varies.

I have already tried building a directed graph where the edges have the number of lists that connect the vertices in such order, and then went through all the paths to find the most probable sequence. This approach works well for small problems, but it is way too complex for a problem this large.

Any pointers, please, will be greatly appreciated.

Thanks.

Juan

like image 246
jfarjona Avatar asked Oct 18 '22 11:10

jfarjona


1 Answers

I think your problem is very similar to that of developing a player rating system for multiplayer games. Unfortunately, I don't see an easy answer for this, especially given your volume of data. I would be inclined to treat each list of N elements as N-1 two-player games, each recording a contest between a player and the player just above them on the list. If you can afford it you could treat each list as N(N-1)/2 two-player games, recording all comparisons within the list. In either case, you could then apply a ratings system for two-player games, such as https://en.wikipedia.org/wiki/Elo_rating_system.

Another approach would be to write down a penalty function for the goodness of fit of any ordering, and then try to minimise the penalty. There are a number of functions which compare two lists with each other, such as https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient and https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient. Kendall's rank correlation is just based on the number of pairwise comparisons that you would get wrong in one list if you used the other as a predictor, so it might have some nice properties. You could decide that your penalty for the overall list was the sum of all the penalties that you compute when you compare your overall list with each of the input lists in turn.

One way to minimize such a penalty would be to start with a random ordering and then repeatedly remove an item from the ordering and put it back at whichever place minimizes the penalty function, until no such change improves matters. Unfortunately, given your volume of data, I don't think you can afford this.

If you are prepared to turn your data into a list of two-player games between players of unknown strengths, then there are a variety of approach you can take. If you represent the strengths of all the players by an single vector, such as (strengthA, strengthB, strengthC,...) then the probability of A beating B might depend on the dot product of this vector with the vector (1, -1, 0, ....). This suggests that you could try finding a good fit with logistic regression, a perceptron-based model, or support vector machines.

like image 141
mcdowella Avatar answered Nov 11 '22 10:11

mcdowella