Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set Simplification

Say I have two sets, set1 = {a,b,c,d,e,f} and set2 = {a,b,c,d,e,g}. Rather than expressing these explicitly, I want to create something like

common = {a,b,c,d,e}
set1 = common + f
set2 = common + g

If we wanted to represent {a,b,c,h}, we could represent it as common - d - e + h.

My goal is basically to be able to generate the optimal common portion to be used. With only one common section this isn't too challenging, but I need to allow more than one (but not unlimited, or the benefits gained would be trivial).

By optimal, I mean "least number of elements expressed". So in the above example, it "costs" 5 (number of elements) to make the common variable. Then sets 1 and 2 both cost 2 (one to reference common, one to add the extra element), totalling 7. Without the substitution, these would cost 12 to store (6 elements each). Similarly, in subtracting an element from a referenced would "cost" 1.

Another example, {a,b,c,d}, {a,c,d,e}, {e,f,g,h} and {e,f}

could be

common1 = {a,c,d}
common2 = {e,f,g}
set1 = common1 + b
set2 = common1 + e
set3 = common2 + h
set4 = common2 - g

By allowing multiple common portions this becomes a lot more challenging. Is there a name for this type of problem, or something similar? It seems like it could be related to compression, but I haven't been able to find too many resources on where to start with this.

Some other details that may be relevent:

  • Being allowed to reference multiple common portions to represent one set can be valid, but isn't required.
  • For my use case, the sets will typically be around 20 elements and around 10 different sets.
like image 206
John Howard Avatar asked Oct 29 '22 03:10

John Howard


1 Answers

You could find all atomic sets, that is all sets that are never not seen apart.

{a,b,c,d,e,f,g,h}       | {a,b,c,d} = {a,b,c,d},{e,f,g,h} 
{a,b,c,d},{e,f,g,h}     | {a,c,d,e} = {a,b,c,d},{e,f,g,h} 
{a,c,d},{b},{e},{f,g,h} | {e,f,g,h} = {a,c,d},{b},{e},{f,g,h}
{a,c,d},{b},{e},{f,g,h} | {e,f}     = {a,c,d},{b},{e},{f},{g,h}

{a,b,c,d} = {a,c,d},{b}
{a,c,d,e} = {a,c,d},{e}
{e,f,g,h} = {e},{f},{g,h}
{e,f}     = {e},{f}

This is a little closer but it doesnt solve the minimal breakdown.

I dont think you can find the minimal because i suspect that it is NP-Hard. If you consider a set S and create a graph where each possible subset of S is a node G. Now give a node weight according to the length of the subset, and draw an edge between each node that corresponds to the amount of change. {abc} -> {a} has a weight of 2. {bcd} -> {abe} has a weight of 4. Now to find a minimal solution to the common set problem you need to find a minimal weight spanning tree that covers each of the sets you are interested in. If you find that you can use this to build a minimal common set -- these would be equivelent. Finding minimum weight tree in a node weighted graph is called the Node-Weighed Steiner Tree Problem. A Node weighted Steiner Tree Problem can be shown equivalent to the Steiner Tree Problem. The Steiner Tree problem can be show to be NP-Hard. So I strongly suspect the problem you are trying to solve is NP-Hard.

http://theory.cs.uni-bonn.de/info5/steinerkompendium/node15.html

http://theory.cs.uni-bonn.de/info5/steinerkompendium/node17.html

like image 78
gbtimmon Avatar answered Nov 09 '22 16:11

gbtimmon