Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB case insensitive index "starts with" performance problems

After finding out that 3.3.11 supports case insensitive index (using collation) I have rebuilt my database of 40 million records to play with this. Alternative was to add e.g. lowercase fields specific to case insensitive search and index those.

What I did was to ask MongoDB to support collation on my collection at the time of creation as suggested here. So I did this to enable case insensitivity for the entire collection:

db.createCollection("users", {collation:{locale:"en",strength:1}})

After loading the collection I have tried direct queries like:

db.users.find({full_name:"john doe"})

...and those return in ~10ms with 50 results. It's case insensitive - so all is great. But then I try something like:

db.users.find({full_name:/^john/})

...or...

db.users.find({full_name:/^john/i})

...and this takes more than 5 minutes. I was so disappointed. After doing explain() it turns out that the index was apparently being used but the query still takes way too long to execute. Can this be attributed to buggy or incomplete development release or am I doing something fundamentally wrong?

As I am doing a "starts with" regex search, the query should be lightning fast. Any ideas?

like image 306
wpfwannabe Avatar asked Aug 26 '16 06:08

wpfwannabe


1 Answers

Edit: there is a workable workaround. Basically if the word you are searching for is "bob", you can search for $lt:"boc", (where you increment the last character by one), and $gte "bob". This will use the index. You can use the following function I made below (warning its not necessarily bug free but pretty much works) like this:

var searchCriteria = {};
addStartsWithQuery(searchCriteria, "firstName", "bo");
People.find(searchCriteria).then(...);

//searchCriteria will be
/*
{
    $and:[
         {firstName:{$gte:"bo"}},
         {firstName:{$lt:"bp"}}
    ]
}
*/


//now library functions that will automatically generate the correct query and add it to `searchCriteria`.  Of course for complicated queries you may have to modifiy it a bit.
function getEndStr(str) {
    var endStrArr = str.toLocaleLowerCase('en-US').split("");
    for (var i = endStrArr.length - 1; i >= 0; --i) {
        var lastChar = endStrArr[i];
        if(lastChar === "z"){
            return endStrArr.join("") + "zzzzzzzzzzzz";
        }
        var nextChar = String.fromCharCode(lastChar.charCodeAt(0) + 1);
        if (nextChar === ":")
            nextChar = "a";
        if (nextChar !== false) {
            endStrArr[i] = nextChar;
            return endStrArr.join("");
        }
        endStrArr.pop();
    }
}
function addStartsWithQuery(searchCriteria, propertyName, str) {
    if (!(typeof str === 'string') || !str.length)
        return;
    var endStr = getEndStr(str);
    if (endStr) {
        if (!searchCriteria.$and)
            searchCriteria.$and = [];
        searchCriteria.$and.push({
            [propertyName]: {
                $gte: str
            }
        });
        searchCriteria.$and.push({
            [propertyName]: {
                $lt: endStr
            }
        });
    } else {
        searchCriteria[propertyName] = {
            $gte: str
        }
    }
}

Well it turns out MongoDB officially doesn't support it! I've linked to an issue in JIRA where they make this clear. This makes collations significantly less useful, unfortunately. Let's get on them to fix this soon! Technically speaking, I noticed that even though it is using the index, the index uses "[\"\", {})", as one of it's index bounds, which always returns all items in the index, so the index scan is useless. The next stage of the query filters through those results like normal.

https://jira.mongodb.org/browse/DOCS-9933

Vote for this issue to get them to fix it! https://jira.mongodb.org/browse/SERVER-29865

like image 81
user3413723 Avatar answered Nov 15 '22 08:11

user3413723