Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom Mongo ObjectId for inserts

Some Background:

I'm using MongoDB in tandem with ElasticSearch via the mongo-elasticsearch river. In Elasticsearch I want the structure of my documents to look like this:

{
    "_id": "SomeId-AnotherId",

    ... // all the other lovely denormalized data
}

SomeId-AnotherId is something I create when I denormalize my data. The reason I need that structure is that I need to be able to say http://elasticsearch/index/type/SomeId-AnotherId to retrieve a document.

I denormalize my data (a C# app) then I insert into MongoDB (this data then goes into ES via the river as mentioned above). When I insert into MongoDB I am currently under the impression that I need to set a BsonId on my model which Mongo uses to index the document. This can be an ObjectId or any other type such as string or int etc as long as I add the [BsonId] attribute.

My model looks like this:

public class Model {
    [BsonId]
    public string Id {get;set;}
}

And I set it like this:

model.Id = string.format("{0}-{1}", someId, anotherId);

The Problem

At the moment I'm seeing ~1,500 documents getting into Mongo from an insert of ~10,000. I had a look at the ids I was generating for my model objects and there were definitely a lot over 12 bytes. Would mongo just refuse those and not write them?

Bson Id's are 12bytes - so does this mean that if I create my own ID (in the format: "SomeId-AnotherId") it should also only be 12 bytes long? Anyway around this?

I don't want to use mongos default objectId for these documents because as I mentioned above, once the doc is in elasticsearch I want to be able to get a document in a particular way (using "SomeId-AnotherId" in a URI).

Final Notes:

I'm aware that I can add another ID property to my model called something like ElasticId and then configure Elasticsearch to look for this property and use it as the _id of the elasticsearch document. If I did this then I could use Mongos default IDs and all would be well. However, I would sacrifice elasticsearch performance and I would also need to store an extra field in elastic search that I don't want.

Sorry for the massive brain dump btw!! :)

like image 297
james lewis Avatar asked Dec 12 '22 20:12

james lewis


2 Answers

The _id field of a MongoDB document can be a 12-byte UUID, but it doesn't have to. According to the documentation, you can use any non-array value as _id, as long as you can make sure that it's unique.

like image 143
Philipp Avatar answered Dec 28 '22 23:12

Philipp


OK I've solved this now. On reflection it was a bit obvious and a massive oversight on my part.

I'm inserting in batches of 10,000 but the total number of records is over 40million. My ids were guaranteed to be unique on a per batch basis - so there could be duplicates in other batches.

I turned on SafeMode and started to see the exceptions I was getting - they were coming from mongo and they were duplicate key exceptions. I found that the mongo csharp client drops all remaining data in your batch as soon as it gets a duplicate key error. So I was seeing the first 1500 of a batch going in, then I was receiving a duplicate key error and then the rest of the batch wasn't being inserted. Which totally makes sense.

So for now I'm doing single inserts which are actually almost as quick as a batch insert. When I get a duplicate key error I log it but keep going as I don't care about duplicates in my scenario.

Thanks for the help @Philipp.

like image 32
james lewis Avatar answered Dec 28 '22 23:12

james lewis