Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find duplicates in app engine datastore

I've some duplicated elements in my datastore (not the whole row, but most of the fields on it) in App Engine.

What's the best way to find them?

I've both integer and string fields that are duplicated (in case comparing one is faster than the other).

Thanks!

like image 573
ana Avatar asked Jan 25 '11 21:01

ana


2 Answers

An stupid but quick approach would be to take the fields you care about, concatenate them as a long string and store them as the key of an DB_Unique entity that references the original entity. Each time you do DB_Unique.get_or_insert() you should verify the reference is to the correct original entity, otherwise, you have a duplicate. This should probably be done in a map reduce.

Something like:

class DB_Unique(db.Model):
  r = db.ReferenceProperty()

class DB_Obj(db.Model):
  a = db.IntegerProperty()
  b = db.StringProperty()
  c = db.StringProperty()

# executed for each DB_Obj...
def mapreduce(entity):
  key = '%s_%s_%s' % (entity.a,entity.b,entity.c)
  res = DB_Unique.get_or_insert(key, r=entity)
  if DB_Unique.r.get_value_for_datastore(res) != entity.key():
    # we have a possible collision, verify and delete?
    # out two entities are res and entity

There are a couple of edge cases that might creep up, such as if you have two entities with b and c equal to ('a_b', '') and ('a','b_') respectively, so the concatenation is 'a_b_' for both. so use a character you know is not in your strings instead of '_', or have DB_Unique.r be a list of references and compare all of them.

like image 78
Amir Avatar answered Oct 07 '22 01:10

Amir


If this is a one time or rarely occurring occasion, you might want to try dumping the whole database into local machine - see uploading and downloading data - load the data into a sqlite3 database and find the duplicate keys with it.

Trying to do this programmatically on the GAE side might turn out quite tedious. With tasks totally doable but not something too easy.

like image 39
Andris Avatar answered Oct 07 '22 00:10

Andris