I am using Rethinkdb 1.10.1 with the official python driver. I have a table of tagged things which are associated to one user:
{
    "id": "PK",
    "user_id": "USER_PK",
    "tags": ["list", "of", "strings"],
    // Other fields...
}
I want to query by user_id and tag (say, to find all the things by user "tawmas" with tag "tag"). Starting with Rethinkdb 1.10 I can create a multi-index like this:
r.table('things').index_create('tags', multi=True).run(conn)
My query would then be:
res = (r.table('things')
       .get_all('TAG', index='tags')
       .filter(r.row['user_id'] == 'USER_PK').run(conn))
However, this query still needs to scan all the documents with the given tag, so I would like to create a compound index based on the user_id and tags fields. Such an index would allow me to query with:
res = r.table('things').get_all(['USER_PK', 'TAG'], index='user_tags').run(conn)
There is nothing in the documentation about compound multi-indexes. However, I
tried to use a custom index function combining the requirements for compound
indexes and multi-indexes by returning a list of ["USER_PK", "tag"] pairs.
My first attempt was in python:
r.table('things').index_create(
    'user_tags',
    lambda each: [[each['user_id'], tag] for tag in each['tags']],
    multi=True).run(conn)
This makes the python driver choke with a MemoryError trying to parse the index function (I guess list comprehensions aren't really supported by the driver).
So, I turned to my (admittedly, rusty) javascript and came up with this:
r.table('things').index_create(
    'user_tags',
    r.js(
        """(function (each) {
            var result = [];
            var user_id = each["user_id"];
            var tags = each["tags"];
            for (var i = 0; i < tags.length; i++) {
                result.push([user_id, tags[i]]);
            }
            return result;
        })
        """),
    multi=True).run(conn)
This is rejected by the server with a curious exception: rethinkdb.errors.RqlRuntimeError: Could not prove function deterministic.  Index functions must be deterministic.
So, what is the correct way to define a compound multi-index? Or is it something which is not supported at this time?
A secondary index is created with the statement: CREATE INDEX name ON table_name (column_list) WITH STRUCTURE=VECTORWISE_SI; Note: Currently, secondary indexes cannot be created on tables with more than 4 billion rows.
A secondary index is a data structure that contains a subset of attributes from a table, along with an alternate key to support Query operations. You can retrieve data from the index using a Query , in much the same way as you use Query with a table.
Short answer:
List comprehensions don't work in ReQL functions. You need to use map instead like so:
r.table('things').index_create(
    'user_tags',
    lambda each: each["tags"].map(lambda tag: [each['user_id'], tag]),
    multi=True).run(conn)
Long answer
This is actually a somewhat subtle aspect of how RethinkDB drivers work. So the reason this doesn't work is that your python code doesn't actually see real copies of the each document. So in the expression:
lambda each: [[each['user_id'], tag] for tag in each['tags']]
each isn't ever bound to an actual document from your database, it's bound to a special python variable which represents the document. I'd actually try running the following just to demonstrate it:
q = r.table('things').index_create(
       'user_tags',
       lambda each: print(each)) #only works in python 3
And it will print out something like:
<RqlQuery instance: var_1 >
the driver only knows that this is a variable from the function, in particular it has no idea if each["tags"] is an array or what (it's actually just another very similar abstract object). So python doesn't know how to iterate over that field. Basically exactly the same problem exists in javascript.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With