Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solution for finding "duplicate" records involving STI and parent-child relationship

I have an STI-based model called Buyable, with two models Basket and Item. The attributes of concern here for Buyable are:

  • shop_week_id
  • location_id
  • parent_id

There's a parent-child relationship between Basket and Item. parent_id is always nil for basket, but an item can belong to a basket by referencing the unique basket id. So basket has_many items, and an item belongs_to a basket.

I need a method on the basket model that:

Returns true of false if there are any other baskets in the table with both the same number of and types of items. Items are considered to be the same type when they share the same shop_week_id and location_id.

For ex:

Given a basket (uid = 7) with 2 items:

item #1

  • id = 3
  • shop_week_id = 13
  • location_id = 103
  • parent_id = 7

item #2

  • id = 4
  • shop_week_id = 13
  • location_id = 204
  • parent_id = 7

Return true if there are any other baskets in the table that contain exactly 2 items, with one item having a shop_week_id = 13 and location_id = 103 and the other having a shop_week_id = 13 and location_id = 204. Otherwise return false.

How would you approach this problem? This goes without saying, but I am looking for a very efficient solution.

like image 496
keruilin Avatar asked Sep 25 '11 00:09

keruilin


2 Answers

The following SQL seems to do the trick

big_query = "
  SELECT EXISTS (
    SELECT 1
    FROM buyables b1
      JOIN buyables b2
        ON b1.shop_week_id = b2.shop_week_id
        AND b1.location_id = b2.location_id
    WHERE
      b1.parent_id != %1$d
      AND b2.parent_id = %1$d
      AND b1.type = 'Item'
      AND b2.type = 'Item'
    GROUP BY b1.parent_id
    HAVING COUNT(*) = ( SELECT COUNT(*) FROM buyables WHERE parent_id = %1$d AND type = 'Item' )
  )
"

With ActiveRecord, you can get this result using select_value:

class Basket < Buyable
  def has_duplicate
    !!connection.select_value( big_query % id )
  end
end

I am not so sure about performance however

like image 134
axelarge Avatar answered Nov 20 '22 10:11

axelarge


If you want to make this as efficient as possible, you should consider creating a hash that encodes basket contents as a single string or blob, add a new column containing the hash (which will need to be updated every time the basket contents change, either by the app or using a trigger), and compare hash values to determine possible equality. Then you might need to perform further comparisons (as described above) in order

What should you use for a hash though? If you know that the baskets will be limited in size, and the ids in question are bounded integers, you should be able to hash to a string that is enough in itself to test for equality. For example, you could base64 encode each shop_week and location, concatenate with a separator not in base64 (like "|"), and then concatenate with the other basket items. Build an index on the new hash key, and comparisons will be fast.

like image 1
Mike Sokolov Avatar answered Nov 20 '22 10:11

Mike Sokolov