Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB text index in German

On my articles collection I have a text index:

{
    "v" : 1,
    "key" : {
            "_fts" : "text",
            "_ftsx" : 1
    },
    "name" : "title_text_abstract_text_body_text",
    "ns" : "foo.articles",
    "weights" : {
            "abstract" : 1,
            "body" : 1,
            "title" : 1
    },
    "default_language" : "english",
    "language_override" : "language",
    "textIndexVersion" : 2
}

In my articles collection I have an entry like this:

{
    "_id" : ObjectId("5477c28c807a9cd660ccd567"),
    "title" : "Hallo Welt!",
    "author" : "foo",
    "publishDate" : ISODate("2014-11-28T17:00:00Z"),
    "language" : "de",
    "abstract" : "Mein erster Artikel!",
    "body" : "Dieser Artikel ist in deutscher Sprache.",
    "__v" : 0
}

(There are actually different values in abstract and body, for brevity's sake lets assume it's those above)

When I then try to search for this article:

db.articles.find({$text: {$search: 'Welt'}})

It does get found.

But: When I then try to search for this article:

db.articles.find({$text: {$search: 'Sprache'}})

I get no results. But then after I changed the language to en or none I do get this article as a result with the exact same query.

What am I doing wrong?

Edit: As requested in the comments, here the exact commands that lead to the above described behavior. Should have done it this way in the first place, apologies.

> db.test.drop()
true
> db.test.insert({language: "de", body: "vermutlich", title: "Artikel"})
WriteResult({ "nInserted" : 1 })
> db.test.ensureIndex({body: "text", title: "text"})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}
> db.test.find({$text: {$search: 'vermutlich'}})
> db.test.find({$text: {$search: 'Artikel'}})
{ "_id" : ObjectId("54ea86d6c9ec98269e022c67"), "language" : "de", "body" : "vermutlich", "title" : "Artikel" }
> db.version()
2.6.5

I also tried changing the language again:

> db.test.update({}, {$set: {language: "en"}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.test.find({$text: {$search: 'Artikel'}})
{ "_id" : ObjectId("54ea86d6c9ec98269e022c67"), "language" : "en", "body" : "vermutlich", "title" : "Artikel" }
> db.test.find({$text: {$search: 'vermutlich'}})
{ "_id" : ObjectId("54ea86d6c9ec98269e022c67"), "language" : "en", "body" : "vermutlich", "title" : "Artikel" }

Edit: Okay, so I just tried to rebuild this example. But I also added one German quote, so this is what I did:

> db.test.drop()
true
> db.test.insert({ language: "portuguese", original: "A sorte protege os audazes.", translation: [{ language: "english", quote: "Fortune favors the bold."},{ language: "spanish", quote: "La suerte rotege a los audaces."}]})
WriteResult({ "nInserted" : 1 })
> db.test.insert({ language: "spanish", original: "Nada hay más surrealista que la realidad.", translation:[{language: "english",quote: "There is nothing more surreal than reality."},{language: "french",quote: "Il n'y a rien de plus surréaliste que la réalité."}]})
WriteResult({ "nInserted" : 1 })
> db.test.insert({ original: "is thisdagger which I see before me.", translation: {language: "spanish",quote: "Es este un puñal que veo delante de mí." }})
WriteResult({ "nInserted" : 1 })
> db.test.insert({original: "Die Geister, die ich rief...", language: "german", translation: {language: "english", quote: "The spirits that I've cited..."}})
WriteResult({ "nInserted" : 1 })
> db.test.ensureIndex( { original: "text", "translation.quote": "text" } )
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}

Then I tried some queries:

> db.test.count({$text: {$search: "delante"}})
1
> db.test.count({$text: {$search: "spirits"}})
1
> db.test.count({$text: {$search: "Geister"}})
0

Conclusion: mongoDB doesn't work with German? This is really frustrating

like image 853
dave Avatar asked Feb 21 '15 14:02

dave


3 Answers

wdberkeley is correct, but I feel the urge to add a quick explanation of stemming because I doubt users without experience in the field will get the gist. I'd also like to highlight some alternatives and the general limitations.

In many languages, words get transformed significantly because of grammar rules, e.g. for the German word "Geist" (mind / ghost / spirit):

"Geist" (singular) -> "Geister" (plural) -> "Geistern" (plural accusative)

This effect is known in English, too, but it's less pronounced, examples:

"house" -> "houses" // "mouse" -> "mice" // "notable" -> "notably"

Usually, we want searches to disregard that local grammatical structure, so if we're looking for "Geist" it should find any of the words above. Doing that right is extremely challenging, because language rules are complex, and a correct answer can't necessarily be determined without context.

Suffix stripping is a common and relatively simple approach that assumes that certain endings are most likely indeed just endings and can be removed to get the stem. Sometimes, stemmers deliberately remov letters that actually belong to the stem, e.g. "notable" -> "notabl".

Since the language of the quotes is known, the correct stemmer will be used for the quotes. That works - with your data:

> db.test.find({$text: {$search: 'Geist'}}).count()
1

Now you're problem is that a user might not be looking for the stem, but for a derived form, so we need to apply the same transformation to the input. The key problem is that we don't know which transformation was applied in the first place. So you're trying to do something that's complex already with one variable added.

The good thing is that there is snowball, which is the stemmer used by MongoDB and other search systems such as SolR. It's available under BSD license and was ported to many languages, so it's possible to do the same thing done by the database in client code. That way, we don't treat the DB as a blackbox, but we also couple our client code to an implementation detail of the database... Choose your poison.

We could, for instance, just run over all stemmers and look which one changes the input, but that might result in fals positives since the word might be a stem already and the stemmer from another language shortens it (German stemmer: 'mice' -> 'mic').

At least, we greatly reduce the number of queries we need to make if we take the set of distinct responses by the stemmers.

Alternatively, you could consult a word list to make a guess of which language this query could be in.

Even with that additional effort, it's important to understand the limitations of stemming done by simple suffix stripping. For instance, "mice" won't be found when looking for "mouse", not even with the English stemmer, because the stemmer assumes the stem is shorter. Things get a lot worse if the texts aren't really in their supposed language (Ulysses...)

In other words: a really good free-text search needs a lot more than just stemming, and making queries across languages adds to that. A different search database is not a panacea - the problem is deeply rooted in the problem space...

EDIT: ElasticSearch has a nice comprehensive explanation of stemming (I keep finding these after writing the answer)

EDIT2:

Why can't MongoDB simply use differently stemmed words?

The transformations are only applied when inserting or updating text in the database. The query simply looks for a match of the stem. Essentially, the index is made from the stemmed words. What you want would require walking the entire collection, every time. That'd be highly inefficient and defeats the purpose of indexing. What you can do is do that step in client code, as suggested.

Why can't I just use $or with twice searching the text-index

AFAIK, that's a limitation of the query engine - probably has to do with the ranking because making a good scoring based on two different inputs doesn't make much sense. But you can simply run two queries and merge the results client-side.

like image 126
mnemosyn Avatar answered Nov 15 '22 04:11

mnemosyn


Sorry, I was being daft. The issue is simple: we're attempting matching the search text "vermutlich" with the document text "vermutlich", and to do this properly you need to parse both with the same language rules. If you do the following:

> db.test.drop()
> db.test.insert({ "language" : "de", "body" : "vermutlich", "title" : "Artikel"})
> db.test.ensureIndex({ "$**" : "text" })
> db.test.count({ "$text" : { "$search" : "vermutlich" } })
0
> db.test.count({ "$text" : { "$search" : "vermutlich", "$language" : "de" } })
1

The first query searches the document, which was indexed as German because of the language field, using "vermutlich" processed as an English word.

You can set the default language for the text index to avoid specifying $language in every query:

> db.test.drop()
> db.test.insert({ "language" : "de", "body" : "vermutlich", "title" : "Artikel"})
> db.test.ensureIndex({ "$**" : "text" }, { "default_language" : "de" })
> db.test.count({ "$text" : { "$search" : "vermutlich" } })
1
like image 34
wdberkeley Avatar answered Nov 15 '22 06:11

wdberkeley


As a workaround I created my text index with default_language: "none" and language_override: "none". This way there are no stop words nor stemming. But at least I find direct matches regardless of language.

> db.test.drop()
true
> db.test.insert({ language: "portuguese", original: "A sorte protege os audazes.", translation: [{ language: "english", quote: "Fortune favors the bold."},{ language: "spanish", quote: "La suerte rotege a los audaces."}]})
WriteResult({ "nInserted" : 1 })
> db.test.insert({ language: "spanish", original: "Nada hay más surrealista que la realidad.", translation:[{language: "english",quote: "There is nothing more surreal than reality."},{language: "french",quote: "Il n'y a rien de plus surréaliste que la réalité."}]})
WriteResult({ "nInserted" : 1 })
> db.test.insert({ original: "is this dagger which I see before me.", translation: {language: "spanish",quote: "Es este un puñal que veo delante de mí." }})
WriteResult({ "nInserted" : 1 })
> db.test.insert({original: "Die Geister, die ich rief...", language: "german", translation: {language: "english", quote: "The spirits that I've cited..."}})
WriteResult({ "nInserted" : 1 })
> db.test.ensureIndex( { original: "text", "translation.quote": "text" }, {default_language: 'none', language_override: 'none'} )
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}
> db.test.find({$text: {$search: 'Geister'}})
{ "_id" : ObjectId("54ed31b2e7ac93c32c760809"), "original" : "Die Geister, die ich rief...", "language" : "german", "translation" : { "language" : "english", "quote" : "The spirits that I've cited..." } }

Unless somebody finds an actual solution I consider mongoDB text index to be broken.

like image 34
dave Avatar answered Nov 15 '22 06:11

dave