Distributed databases are hard. Distributed databases where you don’t have full control over what shards run which version of your software are even harder, because it becomes near impossible to deal with fallout when things go wrong. For lack of a better term (is there one?), I’ll refer to these databases as Autonomous Shard Distributed Databases.
Distributed version control systems are an excellent example of such databases. They store file revisions and commit metadata in shards (“repositories”) controlled by different people.
Because of the nature of these systems, it is hard to weed out corrupt data if all shards ignorantly propagate broken data. There will be different people on different platforms running the database software that manages the individual shards.
This makes it hard - if not impossible - to deploy software updates to all shards of a database in a reasonable amount of time (though a Chrome-like update mechanism might help here, if that was acceptable). This has consequences for the way in which you have to deal with every change to the database format and model.
(e.g. imagine introducing a modification to the Linux kernel Git repository that required everybody to install a new version of Git).
Defensive programming and a good format design from the start are essential.
Git and its database format do really well in all of these regards. As I wrote in my retrospective, Bazaar has made a number of mistakes in this area, and that was a major source of user frustration.
I propose that every autonomous shard distributed databases should aim for the following:
For the “base” format, keep it as simple as you possibly can. (KISS)
The simpler the format, the smaller the chance of mistakes in the design that have to be corrected later. Similarly, it reduces the chances of mistakes in any implementation(s).
In particular, there is no need for every piece of metadata to be a part of the core database format.
(in the case of Git, I would argue that e.g. “author” might as well be a “meta-header”)
Corruption should be detected early and not propagated. This means there should be good tools to sanity check a database, and ideally some of these checks should be run automatically during everyday operations - e.g. when pushing changes to others or receiving them.
If corruption does occur, there should be a way for as much of the database as possible to be recovered.
A couple of corrupt objects should not render the entire database unusable.
There should be tools for low-level access of the database, but the format and structure should be also documented well enough for power users to understand it, examine and extract data.
No “hard” format changes (where clients /have/ to upgrade to access the new format).
Not all users will instantly update to the latest and greatest version of the software. The lifecycle of enterprise Linux distributions is long enough that it might take three or four years for the majority of users to upgrade.
Keep performance data like indexes in separate files. This makes it possible for older software to still read the data, albeit at a slower pace, and/or generate older format index files.
New shards of the database should replicate the entire database if at all possible; having more copies of the data can’t hurt if other shards go away or get corrupted.
Having the data locally available also means users get quicker access to more data.
Extensions to the database format that require hard format changes (think e.g. submodules) should only impact databases that actually use those extensions.
Leave some room for structured arbitrary metadata, which gets propagated but that not all clients need to be able to understand and can safely ignore.
(think fields like “Signed-Off-By”, “Reviewed-By”, “Fixes-Bug”, etc) in git commit metadata headers, or the revision metadata fields in Bazaar.