N.B. I think answers are likely to be design-focused and therefore basically implementation agnostic, but I'm using Java+Hibernate with Postgres if there's some particularly well-suited solution using those technologies.
I have a table with a particular field which will hold large strings, let's say blog posts which on average are +10000 characters.
In my app, you can edit blog posts as many times as you like, and the latest version will always be displayed instantly after an update. However, the app needs to keep a full version history of these edits, so they can be viewed.
An obvious strategy is to keep a separate table, something like blog_post_history
, where blog post rows are inserted in duplicate on creation and each subsequent update to the main 'live' blog_post
table, with an incrementing version number, so these versions are all available if needed in future. I was considering using something like Hibernate Envers to set this up.
However it seems remarkably inefficient to store (and - maybe more importantly - transmit), multiple versions of a 10000 character block of text where the only difference between each one might be fixing typos, adding a few words, etc. Due to the nature of edits to blog posts, there are likely to be many small incremental changes like this, rather than fewer, larger changes.
Is there a better way?
I'm thinking something along the lines of storing only deltas between the current and previous version when an edit is made, and then reconstructing the version history from these deltas programatically when it's requested, perhaps on the client so the data sent over the wire is minimised.
I would most likely store the latest version as full-text, as I'd want to optimise for requesting this most frequently, then store a chain of deltas going backwards from the current version to reconstruct historical versions, as and when they are requested.
A solution I'm working on right now, which is working well so far, implements the design I proposed in the question
I'm thinking something along the lines of storing only deltas between the current and previous version when an edit is made, and then reconstructing the version history from these deltas programatically when it's requested, perhaps on the client so the data sent over the wire is minimised.
I would most likely store the latest version as full-text, as I'd want to optimise for requesting this most frequently, then store a chain of deltas going backwards from the current version to reconstruct historical versions, as and when they are requested.
I'll share the specifics of my implementation here
For creating deltas and using the to reconstruct the full-text I am using the fantastic google-diff-match-patch library. You can read the implementation agnostic API documentation to better understand the code examples below, though it's pretty readable anyway.
google-diff-match-patch has Java and JS implementations so I can use it to compute the deltas with Java on the server. I chose to convert each delta to a String both so it can be easily stored in the database, and easily consumed by the JS library on the client. More on this below.
public String getBackwardsDelta(String editedBlogPost, String existingBlogPost) {
diff_match_patch dmp = new diff_match_patch();
LinkedList<diff_match_patch.Patch> patches =
dmp.patch_make(editedBlogPost, existingBlogPost);
return dmp.patch_toText(patches);
}
N.B. something it took me a while to figure out was how to pull down the the official build of google-diff-match-patch using maven. It's not in the maven central repo, but on their own repo on googlecode.com. Just to note, some people have forked it and put their forked versions in maven central, but if you really want the official version you can get by adding the repo and dependency in your pom.xml
as follows
<repository>
<id>google-diff-patch-match</id>
<name>google-diff-patch-match</name>
<url>https://google-diff-match-patch.googlecode.com/svn/trunk/maven/</url>
</repository>
<dependency>
<groupId>diff_match_patch</groupId>
<artifactId>diff_match_patch</artifactId>
<version>current</version>
</dependency>
For the front end, I pass the latest blog post full-text, along with a chain of deltas going backwards in time representing each edit, and then reconstruct the full text of each version in the browser in JS.
To get the library, I'm using npm + browserify. The library is available on npm as diff-match-patch. Version 1.0.0 is the only version.
getTextFromDelta: function(originalText, delta) {
var DMP = require('diff-match-patch'); // get the constructor function
var dmp = new DMP();
var patches = dmp.patch_fromText(delta);
return dmp.patch_apply(patches, originalText)[0];
}
And that's it, it works fantastically.
In terms of storing the edits of the blog posts, I just use a table BLOG_POST_EDITS
where I store the blog post id, a timestamp of when the edit was made (which I later use to order the edits correctly to make the chain when reconstructing the full-text versions on the client), and the backwards delta between the current live blog post in the BLOG_POST
table, and the incoming edited version of the blog post.
I chose to store a 'chain' of deltas because it suits my use case well, and is simpler on the server code end of things. It does mean in order to reconstruct version M of N, I have to send the client a chain of N-(M-1) deltas back from the live blog post full-text to version M. But in my use case I happen to want to send the whole chain each time, anyway, so this is fine.
For slightly better over-the-wire efficiency for requesting specific versions, all deltas could be recalculated from the new edited blog post version back to each (restored) version each time an edit is made, but this would mean more work and complexity on the server.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With