Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gotchas while bulk loading CouchDB

I've got ~15k rows in MSSQL 2005 that I want to migrate into CouchDB, where one row is one document. I have a CLR-UDF that writes n rows to an schema-bound XML file. I have an XSL transform that converts the schema-bound XML to JSON.

With these existing tools I'm thinking I can go MSSQL to XML to JSON. If I batch n rows per JSON file, I can script cURL to loop through the files and POST them to CouchDB using the bulk API _bulk_docs.

Will this work? Has anybody done a migration like this before? Can you recommend a better way?

like image 614
FreddyBowen Avatar asked Jan 14 '09 19:01

FreddyBowen


1 Answers

So far I did some conversions from legacy SQL databases to CouchDB. I always had a somewhat different approach.

  1. I used the primary key of the SQL-DB as the Document-Id. This allowed me to import over and over again without fear of duplicate documents.
  2. I did row-by-row imports instead of a bulk import. It makes debugging easier. I saw between 5-10 inserts per second over an Internet connection. While this is not lightning fast it was fast enough for me. My biggest Database is 600.000 Documents totaling 20GB. row-by-row bloat the database during import so run compaction occasionally. Then again unless your rows are huge 15.000 rows sounds not much.

My importing code usually looks like this:

def main():
 options = parse_commandline()
 server = couchdb.client.Server(options.couch) 
 db = server[options.db] 
 for kdnnr in get_kundennumemrs():
    data = vars(get_kunde(kdnnr)) 
    doc = {'name1': data.get('name1', ''),
           'strasse': data.get('strasse', ''),
           'plz': data.get('plz', ''), 'ort': data.get('ort', ''),
           'tel': data.get('tel', ''), 'kundennr': data.get('kundennr', '')}

    # update existing doc or insert a new one
    newdoc = db.get(kdnnr, {})
    newdoc.update(doc)
    if newdoc != db.get(kdnnr, {}):
        db[kdnnr] = newdoc 
like image 55
max Avatar answered Oct 20 '22 03:10

max