This is my official first question here; I welcome any/all criticism of my post so that I can learn how to be a better SO citizen. I am vetting non-relational DBMS for storing potentially large email opt-out lists, leaning toward either MongoDB or RethinkDB, using their respective Python client libraries. The pain point of my application is bulk insert performance, so I have set up two Python scripts to insert 20,000 records in batches of 5,000 into both a MongoDB and a RethinkDB collection. The MongoDB python script mongo_insert_test.py: <pre class="prettyprint"><code>NUM_LINES = 20000 BATCH_SIZE = 5000 def insert_records(): collection = mongo.recips i = 0 batch_counter = 0 batch = [] while i <= NUM_LINES: i += 1 recip = { 'address': "test%d@test%d.com" % (i, i) } if batch_counter <= BATCH_SIZE: batch.append(recip) batch_counter += 1 if (batch_counter == BATCH_SIZE) or i == NUM_LINES: collection.insert(batch) batch_counter = 0 batch = [] if __name__ == '__main__': insert_records() </code></pre> The almost identical RethinkDB python script rethink_insert_test.py: <pre class="prettyprint"><code>NUM_LINES = 20000 BATCH_SIZE = 5000 def insert_records(): i = 0 batch_counter = 0 batch = [] while i <= NUM_LINES: i += 1 recip = { 'address': "test%d@test%d.com" % (i, i) } if batch_counter <= BATCH_SIZE: batch.append(recip) batch_counter += 1 if (batch_counter == BATCH_SIZE) or i == NUM_LINES: r.table('recip').insert(batch).run() batch_counter = 0 batch = [] if __name__ == '__main__': insert_records() </code></pre> In my dev environment, the MongoDB script inserts 20,000 records in under a second: <pre class="prettyprint"><code>$ time python mongo_insert_test.py real 0m0.618s user 0m0.400s sys 0m0.032s </code></pre> In the same environment, the RethinkDB script performs much slower, inserting 20,000 records in over 2 minutes: <pre class="prettyprint"><code>$ time python rethink_insert_test.py real 2m2.502s user 0m3.000s sys 0m0.052s </code></pre> Am I missing something huge here with regard to how these two DBMS work? Why is RethinkDB performing so badly with this test? My dev machine had about 1.2GB available memory for these tests.

RethinkDB currently implements batch inserts by doing a single insert at a time on the server. Since Rethink flushes every record to disk (because it's designed with safety first in mind), this has a really bad effect on workloads like this one. We're doing two things to address this: <ol> <li>Bulk inserts will be implemented via a bulk insert algorithm on the server to avoid doing one insert at a time.</li> <li>We will give you the option to relax durability constraints to allow the cache memory to absorb high-throughput inserts if you'd like (in exchange for not syncing to disk as often).</li> </ol> This will absolutely be fixed in 4-12 weeks (and if you need this ASAP, feel free to shoot me an email to slava@rethinkdb.com and I'll see if we can reprioritize). Here are the relevant github issues: https://github.com/rethinkdb/rethinkdb/issues/207 https://github.com/rethinkdb/rethinkdb/issues/314 Hope this helps. Please don't hesitate to ping us if you need help.

Leaving aside what coffemug posted: <ol> <li>depending on what driver version you are using and how you configure the connection to mongodb, those inserts might not even be acknowledged by the server. If you are using the last version of the Python driver, those operations are waiting just for a receipt acknowledgement from the server (which doesn't mean that data has been even written to memory). For more details to what I'm referring to check out the Mongodb write concern setting</li> <li>you could get a speed up in Rethinkdb's case by parallelizing the inserts. Basically if you'd run multiple processes/threads you'll see the speed going up. In case of Mongo, due to the locks involved, parallelism will not help.</li> </ol> That being said, RethinkDB could improve the speed of writes. PS: I am working for Rethink, but the above points are based on my unbiased knowledge of both systems.

Comparing MongoDB and RethinkDB Bulk Insert Performance

Tags:

This is my official first question here; I welcome any/all criticism of my post so that I can learn how to be a better SO citizen.

I am vetting non-relational DBMS for storing potentially large email opt-out lists, leaning toward either MongoDB or RethinkDB, using their respective Python client libraries. The pain point of my application is bulk insert performance, so I have set up two Python scripts to insert 20,000 records in batches of 5,000 into both a MongoDB and a RethinkDB collection.

The MongoDB python script mongo_insert_test.py:

Click to copy

NUM_LINES = 20000 BATCH_SIZE = 5000  def insert_records():     collection = mongo.recips     i = 0     batch_counter = 0     batch = []     while i <= NUM_LINES:         i += 1         recip = {             'address': "test%d@test%d.com" % (i, i)         }         if batch_counter <= BATCH_SIZE:             batch.append(recip)             batch_counter += 1         if (batch_counter == BATCH_SIZE) or i == NUM_LINES:             collection.insert(batch)             batch_counter = 0             batch = []  if __name__ == '__main__':     insert_records()

The almost identical RethinkDB python script rethink_insert_test.py:

Click to copy

NUM_LINES = 20000 BATCH_SIZE = 5000  def insert_records():     i = 0     batch_counter = 0     batch = []     while i <= NUM_LINES:         i += 1         recip = {             'address': "test%d@test%d.com" % (i, i)         }         if batch_counter <= BATCH_SIZE:             batch.append(recip)             batch_counter += 1         if (batch_counter == BATCH_SIZE) or i == NUM_LINES:             r.table('recip').insert(batch).run()             batch_counter = 0             batch = []  if __name__ == '__main__':     insert_records()

In my dev environment, the MongoDB script inserts 20,000 records in under a second:

Click to copy

$ time python mongo_insert_test.py  real    0m0.618s user    0m0.400s sys     0m0.032s

In the same environment, the RethinkDB script performs much slower, inserting 20,000 records in over 2 minutes:

Click to copy

$ time python rethink_insert_test.py real    2m2.502s user    0m3.000s sys     0m0.052s

Am I missing something huge here with regard to how these two DBMS work? Why is RethinkDB performing so badly with this test?

My dev machine had about 1.2GB available memory for these tests.

430

asked Mar 01 '13 06:03

njyunis

2 Answers

RethinkDB currently implements batch inserts by doing a single insert at a time on the server. Since Rethink flushes every record to disk (because it's designed with safety first in mind), this has a really bad effect on workloads like this one.

We're doing two things to address this:

Bulk inserts will be implemented via a bulk insert algorithm on the server to avoid doing one insert at a time.
We will give you the option to relax durability constraints to allow the cache memory to absorb high-throughput inserts if you'd like (in exchange for not syncing to disk as often).

This will absolutely be fixed in 4-12 weeks (and if you need this ASAP, feel free to shoot me an email to slava@rethinkdb.com and I'll see if we can reprioritize).

Here are the relevant github issues:

https://github.com/rethinkdb/rethinkdb/issues/207

https://github.com/rethinkdb/rethinkdb/issues/314

Hope this helps. Please don't hesitate to ping us if you need help.

120

answered Sep 17 '22 09:09

coffeemug

Leaving aside what coffemug posted:

depending on what driver version you are using and how you configure the connection to mongodb, those inserts might not even be acknowledged by the server. If you are using the last version of the Python driver, those operations are waiting just for a receipt acknowledgement from the server (which doesn't mean that data has been even written to memory). For more details to what I'm referring to check out the Mongodb write concern setting
you could get a speed up in Rethinkdb's case by parallelizing the inserts. Basically if you'd run multiple processes/threads you'll see the speed going up. In case of Mongo, due to the locks involved, parallelism will not help.

That being said, RethinkDB could improve the speed of writes.

PS: I am working for Rethink, but the above points are based on my unbiased knowledge of both systems.

answered Sep 17 '22 09:09

Alex Popescu

Related questions
                            
                                Using email as username field in Django 1.5 custom User model results in FieldError
                            
                                How to count the number of rows in a database table in Django
                            
                                this: Cannot use this in static context
                            
                                Getting count of records in child table using select statement
                            
                                Pointtype command for gnuplot
                            
                                Is it possible to make a squiggly line?
                            
                                Bootstrap 3.0 Modal
                            
                                Finding a Top Level Parent in SQL
                            
                                How to bind parameters via ODBC C#?
                            
                                Check if a generic T implements an interface
                            
                                Twitter API: Check if a tweet is a retweet
                            
                                AngularJS: When to pass $scope variable to function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Comparing MongoDB and RethinkDB Bulk Insert Performance

Tags:

njyunis

People also ask

2 Answers

coffeemug

Alex Popescu

Recent Activity

Donate For Us