Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient algorithm for pairwise comparison of elements

Given an array with some key-value pairs:

[
  {'a': 1, 'b': 1},
  {'a': 2, 'b': 1},
  {'a': 2, 'b': 2},
  {'a': 1, 'b': 1, 'c': 1},
  {'a': 1, 'b': 1, 'c': 2},
  {'a': 2, 'b': 1, 'c': 1},
  {'a': 2, 'b': 1, 'c': 2}
]

I want to find an intersection of these pairs. Intersection means to leave only those elements, that can be covered by others, or unique. For example, {'a': 1, 'b': 1, 'c': 1} and {'a': 1, 'b': 1, 'c': 2} fully cover {'a': 1, 'b': 1}, while {'a': 2, 'b': 2} is unique. So, in

[
  {'a': 1, 'b': 1},
  {'a': 2, 'b': 1},
  {'a': 2, 'b': 2},
  {'a': 1, 'b': 1, 'c': 1},
  {'a': 1, 'b': 1, 'c': 2},
  {'a': 2, 'b': 1, 'c': 1},
  {'a': 2, 'b': 1, 'c': 2}
]

after finding the intersection should remain

[
  {'a': 2, 'b': 2},
  {'a': 1, 'b': 1, 'c': 1},
  {'a': 1, 'b': 1, 'c': 2},
  {'a': 2, 'b': 1, 'c': 1},
  {'a': 2, 'b': 1, 'c': 2}
]

I tried to iterate over all pairs and find covering pairs comparing with each other, but time complexity equals to O(n^2). Is it possible to find all covering or unique pairs in linear time?

Here is my code example (O(n^2)):

public Set<Map<String, Integer>> find(Set<Map<String, Integer>> allPairs) {
  var results = new HashSet<Map<String, Integer>>();
  for (Map<String, Integer> stringToValue: allPairs) {
    results.add(stringToValue);
    var mapsToAdd = new HashSet<Map<String, Integer>>();
    var mapsToDelete = new HashSet<Map<String, Integer>>();
    for (Map<String, Integer> result : results) {
      var comparison = new MapComparison(stringToValue, result);
      if (comparison.isIntersected()) {
        mapsToAdd.add(comparison.max());
        mapsToDelete.add(comparison.min());
      }
    }
    results.removeAll(mapsToDelete);
    results.addAll(mapsToAdd);
  }
  return results;
}

where MapComparison is:

public class MapComparison {

    private final Map<String, Integer> left;
    private final Map<String, Integer> right;
    private final ComparisonDecision decision;

    public MapComparison(Map<String, Integer> left, Map<String, Integer> right) {
        this.left = left;
        this.right = right;
        this.decision = makeDecision();
    }

    private ComparisonDecision makeDecision() {
        var inLeftOnly = new HashSet<>(left.entrySet());
        var inRightOnly = new HashSet<>(right.entrySet());

        inLeftOnly.removeAll(right.entrySet());
        inRightOnly.removeAll(left.entrySet());

        if (inLeftOnly.isEmpty() && inRightOnly.isEmpty()) {
            return EQUALS;
        } else if (inLeftOnly.isEmpty()) {
            return RIGHT_GREATER;
        } else if (inRightOnly.isEmpty()) {
            return LEFT_GREATER;
        } else {
            return NOT_COMPARABLE;
        }
    }

    public boolean isIntersected() {
        return Set.of(LEFT_GREATER, RIGHT_GREATER).contains(decision);
    }

    public boolean isEquals() {
        return Objects.equals(EQUALS, decision);
    }

    public Map<String, Integer> max() {
        if (!isIntersected()) {
            throw new IllegalStateException();
        }
        return LEFT_GREATER.equals(decision) ? left : right;
    }

    public Map<String, Integer> min() {
        if (!isIntersected()) {
            throw new IllegalStateException();
        }
        return LEFT_GREATER.equals(decision) ? right : left;
    }

    public enum ComparisonDecision {
        EQUALS,
        LEFT_GREATER,
        RIGHT_GREATER,
        NOT_COMPARABLE,

        ;
    }
}
like image 816
Andrei Levin Avatar asked Feb 15 '26 14:02

Andrei Levin


1 Answers

Here's an algorithm which may be better or worse, depending on the shape of the data. Let's simplify the problem by representing the input rows as sets instead of maps, because essentially you're only treating those maps as sets of pairs/entries. The problem is equivalent if the sets are like [a1, b1] and so on. The goal is to make a linear time algorithm assuming the lengths of the input rows are short. Let n be the number of input rows, and k be the maximum length of a row; our assumption is that k is much smaller than n.

  • Use a counting sort to sort the rows by length.
  • Initialise an empty HashSet for the result, where the members of the set will be rows (you will need an immutable, hashable class to represent the rows).
  • For each row:
    • Remove each subset in the row's power set from the result, if it is present.
    • Add the row to the result.

Since the rows are sorted by length, it is guaranteed that if row i is a subset of row j then row i would have been added before row j, and hence will later be correctly removed from the result set. Once the algorithm terminates, the result set contains exactly those input rows which are not subsets of any other input row.

The time complexity of the counting sort is O(n + k). Each power set has size at most 2k, and each member of the power set has length at most k so that each HashSet operation is O(k) time. So the time complexity of the rest of the algorithm is O(2k·kn), and this dominates the counting sort.

So the overall time complexity is O(n) if we treat k as a constant. If not, then this algorithm will still be asymptotically better than the naive O(n2·k) algorithm* when k < log2 n.

*Note that the naive algorithm is O(n2·k) and not O(n2), because each comparison between two rows takes O(k) time.

like image 184
kaya3 Avatar answered Feb 17 '26 03:02

kaya3



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!