Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine Big O comparing two arrays in Ruby

My algorithm skills are lackluster. I created a method to see if two arrays contain the same elements (duplicates don't matter):

one = [1, "taco", 3, 2, :piece, 4, 5, 5, 5, 5]
two = [:piece, 2, 5, 4, 1, "taco", 3]

def same_elements?(array_one, array_two)
  return true if ( (array_one - array_two).empty? && (array_two - array_one).empty? )
  return false
end

same_elements?(one, two)

This returns true (which is correct). The problem is, I am not sure what the efficiency of this algorithm is. My first guess is O(n^2) since we have to check both a-b and b-a. I know that O(n^2) is pretty terrible. Is there a more efficient way to do this?

like image 612
user3007294 Avatar asked Dec 24 '22 22:12

user3007294


2 Answers

Short answer

O(n+m) on average

The worst case is O(nm), but it only happens if you really want to make it happen (see last paragraph).

If array_one is chosen to be bigger than array_two, O(m+n) is just O(n), so this algorithm runs in linear time on average.

Alternative

Another shorter way to check would be :

one = [1, "taco", 3, 2, :piece, 4, 5, 5, 5, 5]
two = [:piece, 2, 5, 4, 1, "taco", 3]

puts Set[*one] == Set[*two] #=> true
# or
puts one.to_set == two.to_set #=> true

Small refactoring

return true if x
return false

is equivalent to just

x

So your code can be written :

def same_elements?(array_one, array_two)
  (array_one - array_two).empty? && (array_two - array_one).empty?
end

Benchmark

I created one array with 1E6 elements, half of which are random numbers between 0 and 199999 (for collisions), the other half being plain Ruby objects.

The other array is just the first one, shuffled randomly.

N = 1_000_000

one = (1..N).map{rand < 0.5 ? rand(N/5) : Object.new}
two = one.sort_by{rand}

It takes 1 min to compare the sets, and fruity reports that set comparison is about 20% faster than the OP's method.

For smaller arrays of integers, the OP's method was a bit faster.

Note: code proposed by @engineersmnky in comments was reported to have similar speed than the other methods.

Time complexity

Your code is surely not O(nm) when used with usual Arrays.

Approximate times are :

  • 1s for 1E4
  • 8s for 1E5
  • 160s for 1E6

Looking at rb_ary_diff in array.c, it's no wonder all the methods described above run in the same time : they basically work the same.

rb_ary_diff creates a hash table for array_two ( in O(m)), and iterates over each element of array_one (in O(n)), looking for the value in the hash table (O(1) on average). The whole operation would be O(n+m) on average.

This blog post analyses set intersection, which is implemented in a very similar way.

Doing it twice doesn't change anything, so the overall time complexity stays O(n+m).

Looking for O(mn)

One way to make this algorithm O(mn) is to completely disable the hash method. There's no reason to do so, except to prove that it's a very bad idea.

With 10_000 KeyObjects :

class KeyObject < Object
end

the set comparison takes less than 1 s.

With 10_000 KeyObjects :

class KeyObject < Object
  def hash
    1
  end
end

the set comparison takes more than 14 mins!

The probability that 2 distinct, random Ruby objects have the same hash is about 1E-20. Strictly speaking, the worst case for this algorithm is O(mn), but it will never happen if you don't go looking for it. Finding a collision with 2 elements isn't trivial, finding a collision with 1E6 elements is not going to happen by chance.

like image 98
Eric Duminil Avatar answered Jan 14 '23 15:01

Eric Duminil


Let the size of first and second array be m and n respectively. Looking at rb_ary_diff's source code (see Joel Cornett's comment above), there's a for loop which runs O(m) times. Inside the loop there's a call that seems to search the hash. This operation in general takes O(n) time. Thus, assuming that all other calls are asymptotically faster than O(mn), then the overall difference function complexity is O(mn). Calling this function twice followed by an emptyness check results in your algorithm being O(mn).

On average hash search is constant, i.e., O(1), which means that in this case your algorithm performs in O(n). Still, hash search worst case complexity is O(n) meaning your algorithm is O(mn). It's a good exercise to find an example which demonstrates this.

like image 29
PetrosB Avatar answered Jan 14 '23 17:01

PetrosB