Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why git and mercurial dont use database? [closed]

I found mail where Linux Torvalds says:

...go play with Monotone. Really. They use a "real database".

And became interested - why popular VCS's do not use databases, and implement own data storing models to achieve same goals - transactions, durability, etc?

like image 287
Gill Bates Avatar asked Oct 24 '25 23:10

Gill Bates


2 Answers

Because databases usually have their storage and retrieval methods designed for tasks largely tangential to those of VCS systems. Using a special approach to managing data provides an ability for implementations to highly optimize their code for the use cases of a VCS system. While the needs of a DVCS storage subsystem might surely be mapped to the relational model of a "real database", why should it be? A DVCS does not need formal queries (and even less does it need SQL) and rather than trying to hint its database subsystem on the ways to go faster it might just implement the fastest and safest ways to access the data it manages.

Note that frustration with the Monotone's horrid speed was the reason Linus started writing Git (he did consider existing DVCS solutions first after BitMover pulled the rug from under the feet of Linux developers). And another (lesser-visible) system using real database, Fossil, doesn't have stellar performance (PDF) either.

Git started as a minimal set of tools implementing a versioned file system, and its author (Linus Torvalds) originally envisioned that a full-blown VCS will be a tool based on Git. In reality, Git itself started to quickly accumulate features making it a full-blown VCS so that while certain separation of those levels still exists, they're not separate projects.

Two other interesting points about Git's storage subsystem:

  • Originally it just stored its objects in separate files. Afterwards it had been taught to transparently switch the storage of least frequently accessed objects to the so-called "packfiles" which are kind of compressed archives with built-in indexes for fast traversal and access.

    The point is that the devs studied the performance of the existing solution and carefully engeneered an improvement which worked best to solve the problem at hand.

  • It is being improved with regard to speed. For instance, another pile of patches speeding up the Git index (staging area) has been discussed in the fall of the last year.

    The point is that such improvements are not coded just for the sake of them but are based on studying the performance on real-world high workloads.

Mercurial, which takes an approach different to Git's in the way it stores its data, uses a special storage format which facilitates usage of differential data.

So it appears that the tools which use "real database" might be classified into these broad groups:

  • "Ideal design". This is Monotone and Fossil.

    Supposedly the creators of such tools think that using a "real database" gives them all the benefits of using one (such as durability) for free. And these benefits are quite real (and using Sqlite for the storage makes backups a no-brainer).

    While the benefits are real, code implementing custom storage backends in other VCS systems does provide durability. Note that while "real databases" employ clever tricks to try to ensure the data they store is always correct and consistent then don't do any magic: everything still boils down to using proper ordering of file operations, fsync()s etc.

  • "Enterprisey" way of thinking. This is Veracity for instance, which at least claimed support for RDBMS backends in its commercial plugins.

    Enterprises usually have had invested in a "big" database like Oracle or SQL Server or whatever and their management like "high-profile" solutions. An upside of using such a system is that it is usually professionally administered, provides fine-grained access controls, backups etc.

    Obvious downsides of using an RDBMS is lack of distribution (the "D" is missing from "DVCS") and the loss of the gereral ease of setting things up.


Bonus reading which looks at custom storage formats at a different angle: Keith Packard's thoughts on why repository formats matter and a short comment on some of his points from the Mercurial's main developer.

like image 59
kostix Avatar answered Oct 26 '25 15:10

kostix


Git is designed as a simple key-value data store. In that sense, it can be considered a database, and implementing this database at its core is one of the reasons for its efficiency & flexiblity.

As an alternative answer to your question: Why would they?

like image 38
Agis Avatar answered Oct 26 '25 14:10

Agis