I have a problem concerning efficiency and algorithms when it comes to finding the difference between two very large arrays. I'm hoping someone with a good understanding of algorithms can point me to the right direction in how to solve this as my current implementations is taking an extremely long amount of time.
Problem:
I have two very large arrays. One contains a list of emails that have invalid domain names and the other is a mixed list that I need to check against the first array.
accounts_with_failed_email_domains = [279,000 records in here]
unchecked_account_domains = [149,000 records in here]
What I need to do is go through the list of unchecked_account_domains
and then compare each entry to see if there is a match in the accounts_with_failed_email_domains
. I need to insert all matches between the lists in a separate array to be processed later.
How can I efficiently write something that can quickly check through these accounts. Here is what I have tried so far.
unchecked_account_domains = [really big array]
unchecked_account_domains = unchecked_account_domains.sort
accounts_with_failed_email_domains = [another huge array].sort
unchecked_account_domains.keep_if do |email|
accounts_with_failed_email_domains.any? { |failed_email| email == failed_email }
end
# Count to see how many accounts are left
puts unchecked_account_domains.count
This above implementation has been running forever. Here is the second attempt which still proved not to be any better.
unchecked_account_domains = [really big array]
unchecked_account_domains = unchecked_account_domains.sort
accounts_with_failed_email_domains = [another huge array].sort
unchecked_account_domains.each do |email|
accounts_with_failed_email_domains.bsearch do |failed_email|
final_check << email if email == failed_email
end
end
# Count to see how many accounts are left
puts final_check.count
bsearch
seemed to be promising, but I'm pretty sure I'm not using this correctly. Also, I tried looking into this question comparing large lists but this is in python
and I can't seem to find a Ruby equivalent for set
. Does anyone have any ideas on how to solve this?
Ruby | Array Difference (-) functionArray#-() is a Array class method which performs set difference operation by removing the similar elements of the two array. Parameter: Arrays for performing the concatenation operation. Return: New arrays by removing the same elements of the two arrays.
An array is a data structure that represents a list of values, called elements. Arrays let you store multiple values in a single variable. In Ruby, arrays can contain any data type, including numbers, strings, and other Ruby objects. This can condense and organize your code, making it more readable and maintainable.
It seems like you can use Array#-
:
result = unchecked_account_domains - accounts_with_failed_email_domains
I didn't have a new solution here, because the good answers were already taken. However, I wanted to see if there was a performance difference between the two code-based solutions.
This answer is a benchmark to highlight any performance differences in the use of Array#-
and two uses of Set#include?
. The first Set#include?
benchmark always does the set conversion, and the second converts once and keeps the set for subsequent searches.
Here's the code that runs each test 50 times:
require 'set'
require 'benchmark'
string = 'asdfghjkl'
Times = 50
a = 279_000.times.map {|n| "#{n}#{string}" }
b = 149_000.times.map {|n| "#{n*2}#{string}" }
puts RUBY_DESCRIPTION
puts "============================================================"
puts "Running tests for trimming strings"
Benchmark.bm(20) do |x|
x.report("Array#-:") { Times.times {|n| a - b } }
x.report("Set#include? #1:") do
Times.times do |n|
d = []
c = Set.new(b)
a.each {|email| d << email if c.include?(email) }
end
end
x.report("Set#include? #2:") do
c = Set.new(b)
Times.times do |n|
d = []
a.each {|email| d << email if c.include?(email) }
end
end
end
Here are the results:
ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin14]
============================================================
Running tests for trimming strings
user system total real
Array#-: 12.350000 0.250000 12.600000 ( 13.001546)
Set#include? #1: 16.090000 0.330000 16.420000 ( 17.196469)
Set#include? #2: 8.250000 0.100000 8.350000 ( 8.726609)
Clearly, if you just need a single differences comparison, use the Array#-
approach. However, if you need to do this type of thing multiple times, pre-converting the set makes a tremendous difference and performs better than Array#-
. The cost of converting the Array to a Set is fairly high (comparatively), but once you have a Set, it performs the difference comparison much more quickly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With