Nested Repository Handling in git and Mercurial

July 14, 2011 at 12:31 AM | Version control | View Comments

git and Mercurial both have extensions for nesting particular versions of external repositories (git has submodules, Mercurial has subrepos), and I've found it interesting (and telling?) to compare the implementations.

To nest the repository nested at .subrepos/nested in Mercurial, the line .subrepos/nested = nested is added to the file .hgsub and the line [hash] .subrepos/nested (where [hash] is the current revision's hash) is added to the file .hgsubstate. As the version of nested changes, the .hgsubstate file is updated and committed like any other file.

In contrast, to nest nested at .modules/nested in git, the lines:

[submodule ".modules/nested"]
    path = .modules/nested
    url = nested

are added to .gitmodules, and an entry is added to the tree:

160000 commit [hash]        .modules/nested

I found this interesting because it re-enforces my feelings towards both systems: I appreciate the simplicity and accessibility of Mercurial's implementation, and I appreciate the intellectual stimulation I got from learning about git's implementation.

But I also believe that git's implementation is inferior in every practical way.

Because Mercurial's subrepo state is stored in plain text files which are committed into the repository, any tool which is used to view/edit/exchange a repository will trivially be able to handle subrepos. For example, changes to subrepos can be trivially exchanged with standard diff/patch tools:

$ hg diff -c bump_nested
diff --git a/.hgsubstate b/.hgsubstate
--- a/.hgsubstate
+++ b/.hgsubstate
@@ -1,1 +1,1 @@
-[... old hash ...] .subrepos/nested
+[... new hash ...] .subrepos/nested

Versions pinned in subrepos can be easily changed with a text editor (for example, to resolve merge conflicts), and tools which interact with Mercurial repositories (for example, hgweb) will function correctly, even if they are ignorant of subrepos.

Contrast this with git's submodules, where submodule versions are “hidden” in the tree, and every tool must be aware of their existence. For example, it is impossible to use standard diff and patch tools:

$ git show bump_nested
...
diff --git a/.modules/nested b/.modules/nested
index f44396c..1fd6830 160000
--- a/.modules/nested
+++ b/.modules/nested
@@ -1 +1 @@
-Subproject commit [... old hash ...]
+Subproject commit [... new hash ...]

Instead, git am must be used. Any tool that interacts with a submodule-enabled repository must be aware of submodues, otherwise it will crash with “fatal: bad object” (ie, because it likely assumes that any hash in the tree references a blob in the repository, which is not true in the case of a commit entry) and submodule-specific commands must be learned and used to resolve submodule merge conflicts [0].

All of these complications could be acceptable if they allowed git's submodules to be somehow “better” than if a simpler scheme, like Mercurial's .hgsubstate was used… But, as far as I can tell, this implementation affords no practical benefit.

[0]

hint: after a submodule merge conflict, git submodule status will show all the revisions you might care about:

$ git submodule status
-595c7a8dd110ab3f0f305bb0f3d6356ca5d62d99 nested
-cacb40625cc891b33c9c935442c0180e8ba5ab15 nested
-5e71d23bc5d24d18e026bcf12773f3fade1ac6b9 nested

And you'll need to remember that the first line is the common ancestor, the second line “our” version, the third line is “their” version. To resolve the conflict, the standard git checkout --{ours,theirs} appears to do nothing — you need to copy the hash of the desired revision, cd nested; git checkout $hash; cd .., then commit as normal.

Permalink + Comments

Poll results: how often do programmers interact with their version control system?

January 05, 2011 at 08:12 PM | Version control | View Comments

Out of curiosity, I polled Twitter, asking how often they interact with their version control systems. These are the responses I got, along with (my guess at) the version control system they use:

  • 5 - 10 times a day. (?)
  • Lots of ‘working directory diff’, then a bit less for ‘commit’ and a lot less for ‘push’. (dvcs)
  • Not nearly as much as I should. (git)
  • Every few hours lately I wake up from some code daydreaming and do a diff to see how much has gone on, then "hg record" a few times. (hg)
  • When I have something that represents a "single" change; could be anywhere from two minutes/one line to afternoon/minor new module. Diff and revert are also minor but nontrivial contributors to VCS interaction while coding. Those happen approximately randomly. (git, cvs)
  • When I want to switch contexts (different branch or project), or when things are reasonably functional and I want to break things. In practice, about 2-8 times per full work session. (git)
  • Trying to do it much, much more. (?)
  • Depends on the team and VCS. Ideally, if I'm using a DVCS and coding heavily, probably hourly... But right now at work, since I'm dealing with a centralized P4 and a kludgy dev toolset, probably once a day. (dvcs, p4)
  • Often: I branch like a madman, commit when I am satisfied, and push/merge when i am complete. (git)
  • Every 10-15 minutes, typically. (git)
  • Usually one to three times a day. Should be more, but I don't do enough TDD these days, so commits are larger. (?)
  • I commit approximately hourly and diff/stat about as often. Revert once a day. (hg)
  • Very frequently. (hg)
  • Usually once a day, when I'm done with my tasks and all tests pass. But I'm the only one touching the code so I never have to update or merge. (?)

And my answer would be "every 15-30 minutes".

Anyway, that's all. Thanks to everyone who replied; I found the results interesting.

Permalink + Comments

DVCSs and Changeset Numbering

January 29, 2010 at 05:30 PM | Version control | View Comments

One of my big beefs with DVCSs is their version numbers.

For example, take two changesets I committed today: 6899 and 02a9. Which one is more recent? How many changesets separate them? Without access to a repository, there's no way to tell… But that sort of information can be useful to have.

Two of the three DVCSs I have experience with, bzr and hg, both take steps towards solving this problem.

bzr tries "really hard" to give everything a sequential ID, and uses those IDs in the UI (as far as I can tell) all the time (it does use unique hashes under the hood, but they aren't shown very much):

$ bzr log
------------------------------------------------------------
revno: 3
message:
  A merge
    ------------------------------------------------------------
    revno: 1.1.1
    message:
      A conflicting change

hg assigns local aliases for each changeset, and displays those aliases along with hashes:

$ hg log -r 4:6
changeset:   4:658109dca65b
description:
A merge

changeset:   5:bd8053bf02f1
description:
A conflicting change

And, of course, git doesn't stand for this sort of frivolity and shows pure, unadulterated, hashes:

$ git log
commit aa55884d693c92da6dc96eb7a45c9ecd774fefc2

    A merge

commit 8e47937468071ae29d385b76ff925d231c65b97b

    A conflicting change

I'd like to see this taken a step further, though: I'd like the changeset hashes themselves to encode some basic information about where they live in the repository.

For example, one way to do this could be using the first two bytes of the hash to store the distance from the root*, and the next two bytes to store a "repository id", which is generated once, when a repository is first cloned or initialized.

So, for example, one of these changeset hashes might look like this: 0afc53bf02f1, and committing again to the same repository would produce 0bfc109dca65.

Of course, this scheme doesn't guarantee anything - it's entirely possible to generate two completely different changesets with the exact same four-byte prefix… But in the general case, this sort of scheme could make it significantly easier to figure out how arbitrary changesets relate to one another.

*: Doug Philips suggested this - thanks.

Permalink + Comments

Using git for Backup is Asking for Pain

December 08, 2009 at 01:10 PM | Version control | View Comments

git isn't a backup system.

Neither is Mercurial, Bazaar, Subversion or even (even) CVS CSV.

Version control systems, with the possible exception of SourceSafe, are great at keeping track of code. Why is that? Because they were designed to keep track of code.

Unfortunately, though, the features of a good VCS are entirely different – and often exactly the opposite – of the features which make a good backup system.

Take, for example, file ownership. A good VCS will, very rightly, ignore file ownership: when I check out someone else's code, I should be the owner of those file - not whatever uid originally created them. A good backup system, on the other hand, will do everything in its power to preserve file ownership: when I restore from my backups, I want /etc/shaddow to be owned by root and /home/wolever/ to be owned by wolever.

And ownership is just one example - permissions†, creation and modification times, empty directories‡, hardlinks, xattrs, resource forks, … the list of details that a backup system must keep track of goes on and on.

In fact, there are so many things a backup system can get wrong, there is a project called Backup Bouncer, designed specifically to verify that backup scripts correctly copy all the various bits of metadata tracked by the filesystem.

So, please: if you value your bytes, use a real backup system, not git.

†: Most VCSs only track the 'x' bit - for backup purposes, all bits, including suid bits, must be tracked.
‡: fun fact - Mercurial and git don't track empty directories, but Bazaar does.

Permalink + Comments

Why I Don't Like git

August 27, 2008 at 09:59 AM | Version control | View Comments

I've got some code checked out with git-svn, and I'd like to do two things: pull in new revisions and push up my local revisions. Pretty simple, brainless, thing, right? But we're dealing with git, so of course not :-)

# Fortunately I'm smart enough to remember that when you
# want to push changes to SVN, you've obviously got to use: 
$ git-svn dcommit
...
$
# Hurra! It worked! (at least I think...)
# Now for the next feat, pulling in more revisions...
$ git svn
...
  fetch            Download new revisions from SVN
...
# Phew! That looks like exactly what I need... Maybe it's not so tricky after all!
$ git svn fetch
        A       basie/a3c/admin.py
r13 = 45f5309121a77f33f8bd87009671727c0e2dc4a5 (zuze)
# Sweet! Now I can take a look at what has been changed
$ cat basie/a3c/admin.py
cat: basie/a3c/admin.py: No such file or directory
# Hu? I thought I just fetched it...
# Crap, right! I'm NOT supposed to use fetch, I'm supposed to use 'rebase'
# (not, of course, that I understand why "pulling in new revisions from SVN
#  is equivilent to a mung-your-history-rebase... But that's ok, I guess)
$ git svn rebase
...
$ cat basie/a3c/admin.py
cat: basie/a3c/admin.py: No such file or directory
# Blast.  That didn't work either.
# Right, that's because 'fetch' updates the git repository but not the
# working tree... Alright, let's try an update
$ git up
git: 'up' is not a git-command. See 'git --help'.
# Crap, right, git is too cool to have 'update'... I think I need reset
$ git reset
docs/cmdline.wiki: needs update
# hhmm... Why is it telling me that the files I edited need update?
# And how on earth do I update them?
# I give up :-( Can someone smarter than I am tell me what to do?

I guess I'll go back to Bazaar... It may be slow as molasses on a cold day, but at least it is simple enough that mere mortals can use it without resorting to Google.

Permalink + Comments