Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are version control histories stored and calculated?

Consider this simple python code, which demonstrates a very simple version control design for a dictonary:

def build_current(history):
    current = {}
    for action, key, value in history:
        assert action in ('set', 'del')
        if action == 'set':
            current[key] = value
        elif action == 'del':
            del current[key]
    return current

history = []
history.append(('set', '1', 'one'))
history.append(('set', '2', 'two'))
history.append(('set', '3', 'three'))
print build_current(history)
history.append(('del', '2', None))
history.append(('set', '1', 'uno'))
history.append(('set', '4', 'four'))
print build_current(history)
for action, key, value in history:
    if key == '2':
        print '(%s, %s, %s)' % (action, key, value)

Notice that by using the history list you can reconstruct the current dictionary in any state it once existed. I consider this a "forward build" (for lack of a better term) because to build the current dictionary one must start at the beginning and process the entire history list. I consider this the most obvious and straight forward approach.

As I've heard, early version control systems used this "forward build" process, but they were not optimal because most users care more about recent versions of a build. Also, users don't want to download the entire history when they only care about seeing the latest build.

My question then is, what other approaches exist for storing histories in a version control system? Perhaps a "backwards build" could be used? That might allow users to only download recent revisions without needing the entire history. I've also seen a few different formats for storing the history, namely: changesets, snapshots, and patches. What are the differences between changesets, snapshots and patches?

Of the modern popular version controls available, how do they store their histories and what are the advantages of their various designs?

like image 436
Buttons840 Avatar asked Jan 11 '12 18:01

Buttons840


People also ask

What is version control history?

Version control, also known as source control, is the practice of tracking and managing changes to software code. Version control systems are software tools that help software teams manage changes to source code over time.

How do I keep track of document versions?

The most popular way to maintain version control of documents is using a revision control system. It's the best way to manage changes and track who made what changes and when. Document control software is used in software development and many other fields, but they're particularly helpful for managing documents.

What are the three types of version control?

The types of VCS are: Local Version Control System. Centralized Version Control System. Distributed Version Control System.


1 Answers

You mentioned these 3 methods of storing (file)-history:

  1. patch : a patch is the (usually textual, but binary patches are also possible) representation of the difference between two files. It is the output of unix command diff and can be applied by unix command patch. A lot of versioning systems are using patches to store the history of files (eg. SVN, CVS, GIT..). Sometimes these patches are technically called "delta" as the greek letter "Δ" describing the difference of two things.
  2. changeset: a changeset is a term to combine changes "which belonging together" to different files in a single entity. Not all versioning systems support changesets (most notable CVS and SourceSafe). Developer are using changesets to avoid broken builds(example: change the signature of a method in one file, change the call in a second file. You need to have both changes in place to run the program, otherwise you get an error). See also here for the difference between changeset and patch.
  3. snapshots: are full copies of the state of this file/filesystem to this point of time. They are usually quite large and their usage depends on performance characteristics. The snapshot is always redundant to a list of patches, however to retrieve information faster, sometimes Versioning Systems mix or combine patches and snapshots

Subversion uses forward deltas(aka patches) in FSFS repositories and backward deltas in BDB Repositories. Note that these implementations have different performance characteristics:

  • forward deltas are fast in committing, but slow on checkouts(as the "current" version must be reconstructed)

  • backward deltas are fast in checking out but slow on commit as new deltas must be constructed to construct the new current and rewrite the previous "current" as a bunch of deltas

Also note that FSFS uses a "skipping delta" algorithm which minimizes the jumps to reconstruct a specific version. This skipping delta however is not size optimized as mercurials snapshots; it just minimizes the number of "revisions" you need to build a full version regardless of the overall size.

Here a small ascii art (copied from the specification) of a file with 9 revisions:

0 <- 1    2 <- 3    4 <- 5    6 <- 7
0 <------ 2         4 <------ 6
0 <---------------- 4
0 <------------------------------------ 8 <- 9

where "0 <- 1" means that the delta base for revision 1 is revision 0.

The number of jumps is at most log(N) for N revisions.

Also a very pretty effect on FSFS is that older revision will be written only once and after this they will be only read by further actions. That's why subversion repositories are very stable: as long as you do not have a HW failure on your harddisk, you should be able to get a working repository even if some corruption occurred in the last commit: You still have all older revisions.

In BDB Backend you have constantly rewrite the current revision on checkins/commits, which makes this process prone to data corruption. Also as you store the full text only in current revision, corrupting the data on commit will likely destroy great parts of your history.

like image 199
Peter Parker Avatar answered Oct 12 '22 12:10

Peter Parker