I'm trying to use MongoDB to implement a natural language dictionary. I have a collection of lexemes, each of which has a number of wordforms as subdocuments. This is what a single lexeme looks like:
{
"_id" : ObjectId("51ecff7ee36f2317c9000000"),
"pos" : "N",
"lemma" : "skrun",
"gloss" : "screw",
"wordforms" : [
{
"number" : "sg",
"surface_form" : "skrun",
"phonetic" : "ˈskruːn",
"gender" : "m"
},
{
"number" : "pl",
"surface_form" : "skrejjen",
"phonetic" : "'skrɛjjɛn",
"pattern" : "CCCVCCVC"
}
],
"source" : "Mayer2013"
}
Currently I have a collection of some 4000 lexemes, and each of these has on average a list of some 1000 wordforms (as opposed to just 2 above). This means I affectively have 4,000,000 unique word forms in the collection, and I need to be able to search through them in a reasonable amount of time.
A normal query would look like this:
db.lexemes.find({"wordforms.surface_form":"skrejjen"})
I have an index on wordforms.surface_form
, and this search is very fast.
However if I want to have wildcards in my search, the performance is abyssmal. For example:
db.lexemes.find({"wordforms.surface_form":/skrej/})
takes over 5 minutes (at which point I gave up waiting). As mentioned in this question, regex-searching on indexes is known to be bad. I know that adding the ^ anchor in regex searches helps a lot, but it also severely limits my search capabilities. Even if I am willing to make that sacrifice, I've noticed the response times can still vary a lot depending on the regex. The query
db.lexemes.find({"wordforms.surface_form":/^s/})
Takes 35s to complete.
The best results I've had so far have in fact been when I turn off the index using hint
.
In this case, things seem to improve considerably. This query:
db.lexemes.find({"wordforms.surface_form":/skrej/}).hint('_id_')
takes around 3s to complete.
My question is, is there anything else I can do to improve these search times? As they are, they are still a little slow and I am already considering migrating to MySQL in the hopes of getting performance. But I would really like to keep Mongo's flexibility and avoid all the tedious normalisation in a RDBMS. Any suggestions? Do you think I will run into some slowness regardless of DB engine, with this amount of text data?
I know about Mongo's new text search feature but the advantages of this (tokenisation and stemming) are not relevant in my case (not to mention my language is not supported). It isn't clear if text search is actually faster than using regex's anyway.
One possibility would be to store all the variants that you're thinking might be useful as an array element — not sure whether that might be possible though!
{
"number" : "pl",
"surface_form" : "skrejjen",
"surface_forms: [ "skrej", "skre" ],
"phonetic" : "'skrɛjjɛn",
"pattern" : "CCCVCCVC"
}
I would probably also suggest to not store 1000 word forms with each word, but turn this around to have smaller documents. The smaller your documents are, the less MongoDB would have to read into memory for each search (as long as the search conditions don't require a full scan of course):
{
"word": {
"pos" : "N",
"lemma" : "skrun",
"gloss" : "screw",
},
"form" : {
"number" : "sg",
"surface_form" : "skrun",
"phonetic" : "ˈskruːn",
"gender" : "m"
},
"source" : "Mayer2013"
}
{
"word": {
"pos" : "N",
"lemma" : "skrun",
"gloss" : "screw",
},
"form" : {
"number" : "pl",
"surface_form" : "skrejjen",
"phonetic" : "'skrɛjjɛn",
"pattern" : "CCCVCCVC"
},
"source" : "Mayer2013"
}
I also doubt that MySQL would be performing better here with searches for random word forms as it will have to do a full table scan just as MongoDB would be. The only thing that could help there is a query cache - but that is something that you could build in your search UI/API in your application quite easily of course.
As suggested by Derick, I refactored the data in my database such that I have "wordforms" as a collection rather than as subdocuments under "lexemes".
The results were in fact better!
Here are some speed comparisons. The last example using hint
is intentionally bypassing the indexes on surface_form
, which in the old schema was actually faster.
Old schema (see original question)
Query Avg. Time
db.lexemes.find({"wordforms.surface_form":"skrun"}) 0s
db.lexemes.find({"wordforms.surface_form":/^skr/}) 1.0s
db.lexemes.find({"wordforms.surface_form":/skru/}) > 3mins !
db.lexemes.find({"wordforms.surface_form":/skru/}).hint('_id_') 2.8s
New schema (see Derick's answer)
Query Avg. Time
db.wordforms.find({"surface_form":"skrun"}) 0s
db.wordforms.find({"surface_form":/^skr/}) 0.001s
db.wordforms.find({"surface_form":/skru/}) 1.4s
db.wordforms.find({"surface_form":/skru/}).hint('_id_') 3.0s
For me this is pretty good evidence that a refactored schema would make searching faster, and worth the redundant data (or extra join required).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With