Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby - is efficient way to find the difference between two very large arrays?

I have a problem concerning efficiency and algorithms when it comes to finding the difference between two very large arrays. I'm hoping someone with a good understanding of algorithms can point me to the right direction in how to solve this as my current implementations is taking an extremely long amount of time.

Problem:

I have two very large arrays. One contains a list of emails that have invalid domain names and the other is a mixed list that I need to check against the first array.

accounts_with_failed_email_domains = [279,000 records in here]

unchecked_account_domains = [149,000 records in here]

What I need to do is go through the list of unchecked_account_domains and then compare each entry to see if there is a match in the accounts_with_failed_email_domains. I need to insert all matches between the lists in a separate array to be processed later.

How can I efficiently write something that can quickly check through these accounts. Here is what I have tried so far.

unchecked_account_domains = [really big array]
unchecked_account_domains = unchecked_account_domains.sort

accounts_with_failed_email_domains = [another huge array].sort

unchecked_account_domains.keep_if do |email|
  accounts_with_failed_email_domains.any? { |failed_email| email == failed_email }
end

# Count to see how many accounts are left
puts unchecked_account_domains.count

This above implementation has been running forever. Here is the second attempt which still proved not to be any better.

unchecked_account_domains = [really big array]
unchecked_account_domains = unchecked_account_domains.sort

accounts_with_failed_email_domains = [another huge array].sort

unchecked_account_domains.each do |email|
  accounts_with_failed_email_domains.bsearch do |failed_email| 
     final_check << email if email == failed_email 
  end
end

# Count to see how many accounts are left
puts final_check.count

bsearch seemed to be promising, but I'm pretty sure I'm not using this correctly. Also, I tried looking into this question comparing large lists but this is in python and I can't seem to find a Ruby equivalent for set. Does anyone have any ideas on how to solve this?

like image 920
Dan Rubio Avatar asked May 20 '16 19:05

Dan Rubio


People also ask

How do you find the difference between two arrays in Ruby?

Ruby | Array Difference (-) functionArray#-() is a Array class method which performs set difference operation by removing the similar elements of the two array. Parameter: Arrays for performing the concatenation operation. Return: New arrays by removing the same elements of the two arrays.

What is an array in Ruby?

An array is a data structure that represents a list of values, called elements. Arrays let you store multiple values in a single variable. In Ruby, arrays can contain any data type, including numbers, strings, and other Ruby objects. This can condense and organize your code, making it more readable and maintainable.


2 Answers

It seems like you can use Array#-:

result = unchecked_account_domains - accounts_with_failed_email_domains
like image 76
Ilya Avatar answered Oct 21 '22 15:10

Ilya


I didn't have a new solution here, because the good answers were already taken. However, I wanted to see if there was a performance difference between the two code-based solutions.

This answer is a benchmark to highlight any performance differences in the use of Array#- and two uses of Set#include?. The first Set#include? benchmark always does the set conversion, and the second converts once and keeps the set for subsequent searches.

Here's the code that runs each test 50 times:

require 'set'
require 'benchmark'

string = 'asdfghjkl'
Times = 50

a = 279_000.times.map {|n| "#{n}#{string}" }
b = 149_000.times.map {|n| "#{n*2}#{string}" }

puts RUBY_DESCRIPTION
puts "============================================================"
puts "Running tests for trimming strings"

Benchmark.bm(20) do |x|
  x.report("Array#-:")      { Times.times {|n| a - b } }
  x.report("Set#include? #1:") do
    Times.times do |n|
      d = []
      c = Set.new(b)
      a.each {|email| d << email if c.include?(email) }
    end
  end
  x.report("Set#include? #2:") do
    c = Set.new(b)
    Times.times do |n|
      d = []
      a.each {|email| d << email if c.include?(email) }
    end
  end
end

Here are the results:

ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin14]
============================================================
Running tests for trimming strings
                           user     system      total        real
Array#-:              12.350000   0.250000  12.600000 ( 13.001546)
Set#include? #1:      16.090000   0.330000  16.420000 ( 17.196469)
Set#include? #2:       8.250000   0.100000   8.350000 (  8.726609)

Clearly, if you just need a single differences comparison, use the Array#- approach. However, if you need to do this type of thing multiple times, pre-converting the set makes a tremendous difference and performs better than Array#-. The cost of converting the Array to a Set is fairly high (comparatively), but once you have a Set, it performs the difference comparison much more quickly.

like image 41
Michael Gaskill Avatar answered Oct 21 '22 13:10

Michael Gaskill