Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle changing data structures on program version update?

Tags:

c++

c

embedded

I do embedded software, but this isn't really an embedded question, I guess. I don't (can't for technical reasons) use a database like MySQL, just C or C++ structs.

Is there a generic philosophy of how to handle changes in the layout of these structs from version to version of the program?

Let's take an address book. From program version x to x+1, what if:

  • a field is deleted (seems simple enough) or added (ok if all can use some new default)?
  • a string gets longer or shorter? An int goes from 8 to 16 bits of signed / unsigned?
  • maybe I combine surname/forename, or split name into two fields?

These are just some simple examples; I am not looking for answers to those, but rather for a generic solution.

Obviously I need some hard coded logic to take care of each change.

What if someone doesn't upgrade from version x to x+1, but waits for x+2? Should I try to combine the changes, or just apply x -> x+ 1 followed by x+1 -> x+2?

What if version x+1 is buggy and we need to roll-back to a previous version of the s/w, but have already "upgraded" the data structures?

I am leaning towards TLV (http://en.wikipedia.org/wiki/Type-length-value) but can see a lot of potential headaches.

This is nothing new, so I just wondered how others do it....

like image 643
Mawg says reinstate Monica Avatar asked Nov 03 '09 03:11

Mawg says reinstate Monica


2 Answers

I do have some code where a longer string is puzzled together from two shorter segments if necessary. Yuck. Here's my experience after 12 years of keeping some data compatible:

Define your goals - there are two:

  • new versions should be able to read what old versions write
  • old versions should be able to read what new versions write (harder)

Add version support to release 0 - At least write a version header. Together with keeping (potentially a lot of) old reader code around that can solve the first case primitively. If you don't want to implement case 2, start rejecting new data right now!

If you need only case 1, and and the expected changes over time are rather minor, you are set. Anyway, these two things done before the first release can save you many headaches later.

Convert during serialization - at run time, only keep the data in the "new format" in memory. Do necessary conversions and tests at persistence limits (convert to newest when reading, implement backward compatibility when writing). This isolates version problems in one place, helping to avoid hard-to-track-down bugs.

Keep a set of test data from all versions around.

Store a subset of available types - limit the actually serialized data to a few data types, such as int, string, double. In most cases, the extra storage size is made up by reduced code size supporting changes in these types. (That's not always a tradeoff you can make on an embedded system, though).

e.g. don't store integers shorter than the native width. (you might need to do that when you need to store long integer arrays).

add a breaker - store some key that allows you to intentionally make old code display an error message that this new data is incompatible. You can use a string that is part of the error message - then your old version could display an error message it doesn't know about - "you can import this data using the ConvertX tool from our web site" is not great in a localized application but still better than "Ungültiges Format".

Don't serialize structs directly - that's the logical / physical separation. We work with a mix of two, both having their pros and cons. None of these can be implemented without some runtime overhead, which can pretty much limit your choices in an embedded environment. At any rate, don't use fixed array/string lengths during persistence, that should already solve half of your troubles.

(A) a proper serialization mechanism - we use a bianry serializer that allows to start a "chunk" when storing, which has its own length header. When reading, extra data is skipped and missing data is default-initialized (which simplifies implementing "read old data" a lot in the serializationj code.) Chunks can be nested. That's all you need on the physical side, but needs some sugar-coating for common tasks.

(B) use a different in-memory representation - the in-memory reprentation could basically be a map<id, record> where id woukld likely be an integer, and record could be

  • empty (not stored)
  • a primitive type (string, integer, double - the less you use the easier it gets)
  • an array of primitive types
  • and array of records

I initially wrote that so the guys don't ask me for every format compatibility question, and while the implementation has many shortcomings (I wish I'd recognize the problem with the clarity of today...) it could solve

Querying a non existing value will by default return a default/zero initialized value. when you keep that in mind when accessing the data and when adding new data this helps a lot: Imagine version 1 would calculate "foo length" automatically, whereas in version 2 the user can overrride that setting. A value of zero - in the "calculation type" or "length" should mean "calculate automatically", and you are set.

The following are "change" scenarios you can expect:

  • a flag (yes/no) is extended to an enum ("yes/no/auto")
  • a setting splits up into two settings (e.g. "add border" could be split into "add border on even days" / "add border on odd days".)
  • a setting is added, overriding (or worse, extending) an existing setting.

For implementing case 2, you also need to consider:

  • no value may ever be remvoed or replaced by another one. (But in the new format, it could say "not supported", and a new item is added)
  • an enum may contain unknown values, other changes of valid range

phew. that was a lot. But it's not as complicated as it seems.

like image 195
peterchen Avatar answered Sep 21 '22 18:09

peterchen


There's a huge concept that the relational database people use.

It's called breaking the architecture into "Logical" and "Physical" layers.

Your structs are both a logical and a physical layer mashed together into a hard-to-change thing.

You want your program to depend on a logical layer. You want your logical layer to -- in turn -- map to physical storage. That allows you to make changes without breaking things.

You don't need to reinvent SQL to accomplish this.

If your data lives entirely in memory, then think about this. Divorce the physical file representation from the in-memory representation. Write the data in some "generic", flexible, easy-to-parse format (like JSON or YAML). This allows you to read in a generic format and build your highly version-specific in-memory structures.

If your data is synchronized onto a filesystem, you have more work to do. Again, look at the RDBMS design idea.

Don't code a simple brainless struct. Create a "record" which maps field names to field values. It's a linked list of name-value pairs. This is easily extensible to add new fields or change the data type of the value.

like image 44
S.Lott Avatar answered Sep 23 '22 18:09

S.Lott