I am building a corpus of indexed sentences in different languages. I have a collection of Languages which have both an ObjectId and the ISO code as a key. Is it better to use a reference to the Language collection or store a key like "en" or "fr"?
I suppose it's a compromise between:
Any best practices that I should know of?
Yes, you can. BTW, uniqueness guaranteed by mongodb. Because _id field has a unique index by default.
A field required in every MongoDB document. The _id field must have a unique value. You can think of the _id field as the document's primary key.
The _id Field In MongoDB, each document stored in a collection requires a unique _id field that acts as a primary key. If an inserted document omits the _id field, the MongoDB driver automatically generates an ObjectId for the _id field.
An embedded object is a special type of Realm object that models complex data. They also map more naturally to the MongoDB document model. Embedded objects are similar to relationships, but provide additional constraints.
In the end, it really comes down to personal choice and what will work best for your application.
The only requirement that MongoDB imposes upon _id is that it be unique. It can be an ObjectId (which is provided by default), a string, even an embedded document (As I recall it cannot be an Array though).
In this case, you can likely guarantee ISO Code is a unique value and it may be an ideal value. You have a 'known' primary key which is also useful in itself by being identifiable, so using that instead of a generated ID is probably a more sensible bet. It also means anywhere you 'reference' this information in another collection you can save the ISO Code instead of an Object ID; those browsing your raw data can immediately identify what information that reference points at.
As an aside:
The two big benefit of ObjectId is that they can be generated uniquely across multiple machines, processes and threads without needing any kind of central sequence tracking by the MongoDB server. They also are stored as a special type in MongoDB that only uses 12 bytes (as opposed to the 24 byte representation of the string version of an ObjectID)
Unless disk space is an issue, I'd probably go with the language key like "en" or "fr". This way it saves doing an additional query on the Languages collection to find the ObjectId key for a given language, you can just query the sentences
directly:
db.sentences.find( { lang: "en" } )
So long as the lang
field is indexed - db.sentences.ensureIndex( { lang: 1 } )
- I don't think there'll be much difference in query performance.
If you've got a humongous data set, and disk space is a concern, then you could consider an ObjectId (12 bytes), or a number (8 bytes), which might be smaller than a UTF-8 string key depending on its length.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With