Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find duplicates in array of hashes on specific keys

I have an array of hashes (CSV rows, actually) and I need to find and keep all the rows that match two specific keys (user, section). Here is a sample of the data:

[
  { user: 1, role: "staff", section: 123 },
  { user: 2, role: "staff", section: 456 },
  { user: 3, role: "staff", section: 123 },
  { user: 1, role: "exec", section: 123 },
  { user: 2, role: "exec", section: 456 },
  { user: 3, role: "staff", section: 789 }
]

So what I would need to return is an array that contained only the rows where the same user/section combo appears more than once, like so:

[
  { user: 1, role: "staff", section: 123 },
  { user: 1, role: "exec", section: 123 },
  { user: 2, role: "staff", section: 456 },
  { user: 2, role: "exec", section: 456 }
]

The double loop solution I'm trying looks like this:

enrollments.each_with_index do |a, ai|
  enrollments.each_with_index do |b, bi|
    next if ai == bi

    duplicates << b if a[2] == b[2] && a[6] == b[6]
  end
end

but since the CSV is 145K rows it's taking forever.

How can I more efficiently get the output I need?

like image 968
lyonsinbeta Avatar asked Oct 22 '14 17:10

lyonsinbeta


1 Answers

In terms of efficiency you might want to try this:

grouped = csv_arr.group_by{|row| [row[:user],row[:section]]}
filtered = grouped.values.select { |a| a.size > 1 }.flatten

The first statement groups the records by the :user and :section keys. the result is:

{[1, 123]=>[{:user=>1, :role=>"staff", :section=>123}, {:user=>1, :role=>"exec", :section=>123}],
 [2, 456]=>[{:user=>2, :role=>"staff", :section=>456}, {:user=>2, :role=>"exec", :section=>456}],
 [3, 123]=>[{:user=>3, :role=>"staff", :section=>123}],
 [3, 789]=>[{:user=>3, :role=>"staff", :section=>789}]}

The second statement only selects the values of the groups with more than one member and then it flattens the result to give you:

[{:user=>1, :role=>"staff", :section=>123},
 {:user=>1, :role=>"exec", :section=>123},
 {:user=>2, :role=>"staff", :section=>456},
 {:user=>2, :role=>"exec", :section=>456}]

This could improve the speed of your operation, but memory wise I can't say what the effect would be with a large input, because it would depend on your machine, resources and the size of file

like image 136
Alireza Avatar answered Nov 15 '22 08:11

Alireza