Schema Migration Scripts in NoSQL Databases

Tags:

I have a active project that has always used C#, Entity Framework, and SQL Server. However, with the feasibility of NoSQL alternatives daily increasing, I am researching all the implications of switching the project to use MongoDB.

It is obvious that the major transition hurdles would be due to being "schema-less". A good summary of what that implies for languages like C# is found here in the official MongoDB documentation. Here are the most helpful relevant paragraphs (bold added):

Just because MongoDB is schema-less does not mean that your code can handle a schema-less document. Most likely, if you are using a statically typed language like C# or VB.NET, then your code is not flexible and needs to be mapped to a known schema.

There are a number of different ways that a schema can change from one version of your application to the next.

How you handle these is up to you. There are two different strategies: Write an upgrade script. Incrementally update your documents as they are used. The easiest strategy is to write an upgrade script. There is effectively no difference to this method between a relational database (SQL Server, Oracle) and MongoDB. Identify the documents that need to be changed and update them.

Alternatively, and not supportable in most relational databases, is the incremental upgrade. The idea is that your documents get updated as they are used. Documents that are never used never get updated. Because of this, there are some definite pitfalls you will need to be aware of.

First, queries against a schema where half the documents are version 1 and half the documents are version 2 could go awry. For instance, if you rename an element, then your query will need to test both the old element name and the new element name to get all the results.

Second, any incremental upgrade code must stay in the code-base until all the documents have been upgraded. For instance, if there have been 3 versions of a document, [1, 2, and 3] and we remove the upgrade code from version 1 to version 2, any documents that still exist as version 1 are un-upgradeable.

The tooling for managing/creating such an initialization or upgrade scripts in SQL ecosystem is very mature (e.g. Entity Framework Migrations)

While there are similar tools and homemade scripts available for such upgrades in the NoSQL world (though some believe there should not be), there seems to be less consensus on "when" and "how" to run these upgrade scripts. Some suggest after deployment. Unfortunately this approach (when not used in conjunction with incremental updating) can leave the application in an unusable state when attempting to read existing data for which the C# model has changed.

"The easiest strategy is to write an upgrade script."

is truly the easiest/recommended approach for static .NET languages like C#, are there existing tools for code-first schema migration in NoSql Databases for those languages? or is the NoSql ecosystem not to that point of maturity?

If you disagree with MongoDB's suggestion, what is a better implementation, and can you give some reference/examples of where I can see that implementation in use?

641

asked Mar 16 '18 22:03

jth41

1 Answers

Short version

Is "The easiest strategy is to write an upgrade script." is truly the easiest/recommended approach for static .NET languages like C#?

No. You could do that, but that's not the strength of NoSQL. Using C# does not change that.

are there existing tools for code-first schema migration in NoSql Databases for those languages?

Not that I'm aware of.

or is the NoSql ecosystem not to that point of maturity?

It's schemaless. I don't think that's the goal or measurement of maturity.

Warnings

First off, I'm rather skeptical that just pushing an existing relational model to NoSql would in a general case solve more problems than it would create.

SQL is for working with relations and on sets of data, noSQL is targeted for working with non-relational data: "islands" with few and/or soft relations. Both are good at what what they are targeting, but they are good at different things. They are not interchangeable. Not without serious effort in data redesign, team mindset and application logic change, possibly invalidating most previous technical design decision and having impact run up to architectural system properties and possibly up to user experience.

Obviously, it may make sense in your case, but definitely do the ROI math before committing.

Dealing with schema change

Assuming you really have good reasons to switch, and schema change management is a key in that, I would suggest to not fight the schemaless nature of NoSQL and embrace it instead. Accept that your data will have different schemas.

Don't do upgrade scripts

.. unless you know your application data set will never-ever grow or change notably. The other SO post you referenced explains it really well. You just can't rely on being able to do this in long term and hence you need a plan B anyway. Might as well start with it and only use schema update scripts if it really is the simpler thing to do for that specific case.

I would maybe add to the argumentation that a good NoSQL-optimized data model is usually optimized for single-item seeks and writes and mass-updates can be significantly heavier compared to SQL, i.e. to update a single field you may have to rewrite a larger portion of the document + maybe handle some denormalizations introduced to reduce the need of lookups in noSQL (and it may not even be transactional). So "large" in NoSql may happen to be significantly smaller and occur faster than you would expect, when measuring in upgrade down-time.

Support multiple schemas concurrently

Having different concurrently "active" schema versions is in practice expected since there is no enforcement anyway and that's the core feature you are buying into by switching to NoSQL in the first place.

Ideally, in noSQL mindset, your logic should be able to work with any input data that meets the requirements a specific process has. It should depend on its required input not your storage model (which also makes universally sense for dependency management to reduce complexity). Maybe logic just depends on a few properties in a single type of document. It should not break if some other fields have changed or there is some extra data added as long as they are not relevant to given specific work to be done. Definitely it should not care if some other model type has had changes. This approach usually implies working on some soft value bags (JSON/dynamic/dictionary/etc).

Even if the storage model is schema-less, then each business logic process has expectations about input model (schema subset) and it should validate it can work with what it's given. Persisted schema version number along model also helps in trickier cases.

As a C# guy, I personally avoid working with dynamic models directly and prefer creating a strongly typed objects to wrap each dynamic storage type. To avoid having to manage N concurrent schema version models (with minimal differences) and constantly upgrade logic layer to support new schema versions, I would implement it as a superset of all currently supported schema versions for given entity and implement any interfaces you need. Of course you could add N more abstraction layers ;) Once some old schema versions have eventually phased out from data, you can simplify your model and get strongly typed support to reach all dependents.

Also, it's important for logic layer should have a fallback or reaction plan should the input model NOT match the requirements for carrying out the intended logic. It's up to app when and where you can auto-upgrade, accept a discard, partial reset or have to direct to some trickier repair queue (up to manual fix if no automatics can cut it) or have to just outright reject the request due to incompatibility.

Yes, there's the problem of querying across sets of models with different versions, so you should always consider those cases as well. You may have to adjust querying logic to query different versions separately and merge results (or accept partial results if acceptable).

There definitely are tradeoffs to consider, sure.

So, migrations?

A downside (if you consider migrations tool set availability) is that you don't have one true schema to auto generate the model or it's changes as the C# model IS the source-of-truth schema you're currently supporting. Actually, quite similar to code-first mindset, but without migrations.

You could implement an incoming model pipe which auto-upgrades the models as they are read and hence reduce the number schema versions you need to support upstream. I would say this is as close to migrations as you get. I don't know any tools to do this for you automatically and I'm not sure I would want it to. There are trade-offs to consider, for example some clients consuming the data may get upgraded with different time-line etc. Upgrade to latest may not always be what you want.

Conclusion

NoSQL is by definition not SQL. Both are cool, but expecting equivalency or interchangeability is bound for trouble.

You still have to consider and manage schema in NoSQL, but if you want one true enforced & guaranteed schema, then consider SQL instead.

171

answered Oct 05 '22 00:10

Imre Pühvel

Related questions
                            
                                Why does Electron need to be saved as a developer dependency?
                            
                                Check if child component rendered - Jest, Enzyme
                            
                                Xcode 10 - Instance will be immediately deallocated because property is 'weak'
                            
                                How to test code that uses DateTime.now in Flutter?
                            
                                How to fix "ReferenceError: exports is not defined" in a pure TypeScript project?
                            
                                Is launchsettings.json in a dotnet core project used in production?
                            
                                The code below, is it ill-formed NDR or is it well formed?
                            
                                Does a member have to be initialized to take its address?
                            
                                C# Linq: Combine multiple .Where() with an *OR* clause
                            
                                How exactly is React's Virtual DOM faster?
                            
                                Your Scrum definition of Done [closed]
                            
                                How do I use Unicode characters in Pod and perldoc?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With