Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Dataloader cache and batch database requests?

Looking at the DataLoader library, how is it caching and batching requests?

The instructions specify usage in the following way:

var DataLoader = require('dataloader')

var userLoader = new DataLoader(keys => myBatchGetUsers(keys));

userLoader.load(1)
    .then(user => userLoader.load(user.invitedByID))
    .then(invitedBy => console.log(`User 1 was invited by ${invitedBy}`));

// Elsewhere in your application
userLoader.load(2)
    .then(user => userLoader.load(user.lastInvitedID))
    .then(lastInvited => console.log(`User 2 last invited ${lastInvited}`));

But I am unclear how the load function is working, and what the myBatchGetUsers function might look like. Please can you provide me an example if possible!

like image 261
terik_poe Avatar asked Feb 06 '17 17:02

terik_poe


2 Answers

The DataLoader utility from Facebook works by combining input requests with a batch function that you have to provide. It only works with requests that use Identifiers.

There are three phases:

  1. Aggregation phase : Any request on the Loader object is delayed until process.nextTick
  2. Batch phase : The Loader just call the myBatchGetUsers function you have provided with the combination of all requested Keys.
  3. Split phase : the result is then 'split' so input requests get the desired part of the response.

That's why in your provided example you should have only two requests:

  • One for Users 1 and 2
  • Then one for related users (invitedByID)

To implement this with mongodb for example, you should just define the myBatchGetUsers function to use the find method appropriately:

function myBatchGetUsers(keys) {
    // usersCollection is a promisified mongodb collection
    return usersCollection.find(
       {
          _id: { $in: keys }
       }
    )
}
like image 88
Sébastien Avatar answered Oct 04 '22 09:10

Sébastien


I found it helpful to recreate the part of dataloader that I use, to see one possible way that it could be implemented. (in my case I only use the .load() function)

So, creating a new instance of the DataLoader constructor gives you 2 things:

  1. A list of identifiers (empty to begin with)
  2. A function that uses this list of identifiers to query the the database (you provide this).

The constructor could look something like this:

function DataLoader (_batchLoadingFn) {
  this._keys = []
  this._batchLoadingFn = _batchLoadingFn
}

And instances of the DataLoader constructor have access to a .load() function, which needs to be able to access the _keys property. So it's defined on the DataLoad.prototype object:

DataLoader.prototype.load = function(key) {
   // this._keys references the array defined in the constructor function
}

When creating a new object via the DataLoader constructor (new DataLoader(fn)), the fn you pass it needs to fetch data from somewhere, taking an array of keys as arguments, and return a promise that resolves to an array of values that corresponds to the initial array of keys.

For example, here is a dummy function that takes an array of keys, and passes the same array back but with the values doubled:

const batchLoadingFn = keys => new Promise( resolve => resolve(keys.map(k => k * 2)) )
keys: [1,2,3]
vals: [2,4,6]

keys[0] corresponds to vals[0]
keys[1] corresponds to vals[1]
keys[2] corresponds to vals[2]

Then every time you call the .load(indentifier) function, you add a key to the _keys array, and at some point the batchLoadingFn is called, and is passed the _keys array as an argument.

The trick is... How to call .load(id) many times but with the batchLoadingFn only executed once? This is cool, and the reason I explored how this library works.

I found that it's possible to do this by specifying that batchLoadingFn is executed after a timeout, but that if .load() is called again before the timeout interval, then the timeout is canceled, a new key is added and a call to batchLoadingFn is rescheduled. Achieving this in code looks like so:

DataLoader.prototype.load = function(key) {
  clearTimeout(this._timer)
  this._timer = setTimeout(() => this.batchLoadingFn(), 0)
}

Essentially calling .load() deletes pending calls to batchLoadingFn, and then schedules a new call to batchLoadingFn at the back of the event loop. This guarantees that over a short space of time if .load() is called many times, batchLoadingFn will only be called once. This is actually very similar to debouncing. Or, at least it's useful when building websites and you want to do something on a mousemove event, but you get far more events than you want to deal with. I THINK this is called debouncing.

But calling .load(key) also needs to push a key to the _keys array, which we can in the body of the .load function by pushing the key argument to _keys (just this._keys.push(key)). However, the contract of the .load function is that it returns a single value pertaining to what the key argument resolves to. At some point the batchLoadingFn will be called and get a result (it has to return a result that corresponds to the _keys). Furthermore it's required that batchLoadingFn actually returns the promise of that value.

This next bit I thought was particularly clever (and well worth the effort of looking at the source code)!

The dataloader library, instead of keeping a list of keys in _keys, actually keeps a list of keys, associated with a reference to a resolve function, that when called results in a value being resolved as the result of .load(). .load() returns a promise, a promise is resolved when it's resolve function is invoked.

So the _keys array ACTUALLY keeps a list of [key, resolve] tuples. And when your batchLoadingFn returns, the resolve function is invoked with a value (that hopefully corresponds to the the item in the _keys array via index number).

So the .load function looks like this (in terms of pushing a [key, resolve] tuple to the _keys array):

DataLoader.prototype.load = function(key) {
  const promisedValue = new Promise ( resolve => this._keys.push({key, resolve}) )
  ...
  return promisedValue
}

And all that's left is to execute the batchLoadingFn with _keys keys as an argument, and invoke the correct resolve function on it's return

this._batchLoadingFn(this._keys.map(k => k.key))
  .then(values => {
    this._keys.forEach(({resolve}, i) => {
      resolve(values[i])
    })
    this._keys = [] // Reset for the next batch
  })

And combined, all the code to implement the above is here:

function DataLoader (_batchLoadingFn) {
  this._keys = []
  this._batchLoadingFn = _batchLoadingFn
}

DataLoader.prototype.load = function(key) {
  clearTimeout(this._timer)
  const promisedValue = new Promise ( resolve => this._keys.push({key, resolve}) )

  this._timer = setTimeout(() => {
    console.log('You should only see me printed once!')
    this._batchLoadingFn(this._keys.map(k => k.key))
      .then(values => {
        this._keys.forEach(({resolve}, i) => {
          resolve(values[i])
        })
        this._keys = []
      })
  }, 0)

  return promisedValue
}

// Define a batch loading function
const batchLoadingFunction = keys => new Promise( resolve => resolve(keys.map(k => k * 2)) )

// Create a new DataLoader
const loader = new DataLoader(batchLoadingFunction)

// call .load() twice in quick succession
loader.load(1).then(result => console.log('Result with key = 1', result))
loader.load(2).then(result => console.log('Result with key = 2', result))

If I remember correctly I don't think the dataloader library uses setTimeout, and instead uses process.nextTick. But I couldn't get that to work.

like image 45
Zach Smith Avatar answered Oct 04 '22 10:10

Zach Smith