Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to intersect two sorted integer arrays without duplicates?

This is an interview question that I am using as a programming exercise.

Input: Two sorted integer arrays A and B in increasing order and of different sizes N and M, respectively

Output: A sorted integer array C in increasing order that contains elements that appear in both A and B

Contraints: No duplicates are allowed in C

Example: For input A = {3,6,8,9} and B = {4,5,6,9,10,11}, the output should be C = {6,9}

Thank you for your answers, all! To summarize, there are two main approaches to this problem:

My original solution was to keep two pointers, one for each array, and scanning the arrays from left to right interchangeably, while picking out elements that match. So when we the current element of one array is larger than the second array, we keep incrementing the pointer of the second array until we either find the current first array element or overpass it (find one larger). I keep all matched in a separate array, which is returned once we reach the end of either one of the input arrays.

Another way that we could do this is to scan one of the arrays linearly, while using binary search to find a match in the second array. This would mean O(N*log(M)) time, if we scan A and for each of its N elements binary search on B (O(log(M)) time).

I've implemented both approaches and ran an experiment to see how the two compare (details on this can be found here). The Binary Search method seems to win when M is roughly 70 times larger than N, when N has 1 million elements.

like image 228
Artur Galiullin Avatar asked Feb 10 '12 17:02

Artur Galiullin


1 Answers

This problem essentially reduces to a join operation and then a filter operation (to remove duplicates and only keep inner matches).

As the inputs are both already sorted, the join can be efficiently achieved through a merge join, with O(size(a) + size(b)).

The filter operation will be O(n) because the output of the join is sorted and to remove duplicates all you have to do is check if the each element is the same as the one before it. Filtering only the inner matches is trivial, you just discard any elements that were not matched (the outer joins).

There are opportunities for parallelism (both in the join and filter) to achieve better performance. For example the Apache Pig framework on Hadoop offers a parallel implementation of a merge join.

There are obvious trade-offs between performance and complexity (and thus maintainability). So I would say a good answer to a interview question really needs to take account of the performance demands.

  • Set based comparison - O(nlogn) - Relatively slow, very simple, use if there are no performance concerns. Simplicity wins.

  • Merge join + Filter - O(n) - Fast, prone to coding error, use if performance is an issue. Ideally try to leverage an existing library to do this, or perhaps even use a database if appropriate.

  • Parallel Implementation - O(n/p) - Very fast, requires other infrastructure in place, use if the volume is very large and anticipated to grow and this is a major performance bottleneck.

(Also note that the function in the question intersectSortedArrays is essentially a modified merge join, where the filter is done during the join. You can filter afterwards at no performance loss, although a slightly increased memory footprint).

Final thought.

In fact, I suspect most modern commercial RDBMSs offer thread parallelism in their implementation of joins, so what the Hadoop version offers is machine-level parallelism (distribution). From a design point of view, perhaps a good, simple solution to the question is to put the data on a database, index on A and B (effectively sorting the data) and use an SQL inner join.

like image 59
Tim Gee Avatar answered Oct 14 '22 03:10

Tim Gee