Couchdb came onto my radar since distributed stuff is interesting to me these days. But most of what was being written about it put me off, since it seemed to be very web-oriented, with javascript and html and stuff stored in the database, served right out of it to web browsers in an AJAXy mess.

Also, it's a database. I decided a long, long time ago not to mess with traditional databases. (They're great, they're just not great for me. Said the guy leaving after 5 years in the coal mines.)

Then I saw Damien Katz's talk about how he gave up everything to go off and create couchdb. Was very inspirational. Seemed it must be worth another look, with that story behind it.

Now I'm reading the draft O'Rielly book, like some things, as expected don't like others[1], and am not sure what to think overall (plus still have half the book to get through yet), but it has spurred some early thoughts:

... vs DVCS

Couchdb is very unlike a distributed VCS, and yet it's moved from traditional database country much closer to VCS land. It's document oriented, not normalized; the data stored in it has significant structure, but is also in a sense freeform. It doesn't necessarily preserve all history, but it does support multiple branches, merging, and conflict resolution.

Oddly, the thing I dislike most about it is possibly its biggest strength compared to a VCS, and that is that code is stored in the database alongside the data. That means that changes to the data can trigger processing, so it is mapped, reduced, views are updated, etc, on demand. This is done using code that is included in the database, and so is always available, and runs in an environment couchdb provides -- so replicating the database automatically deploys it.

Compare with a VCS, where anything that is triggered by changes to the data is tacked onto the side in hooks, has to be manually set up, and so is poorly integrated overall.

Basically, what I've been doing with ikiwiki is adding some smarts about handling a particular kind of data, on top of the VCS. But this is done via a few narrow hooks; cloning the VCS repository does not get you a wiki set up and ready to go.

There are good reasons why cloning a VCS repository does not clone the hooks associated with it. The idea of doing so seems insane; how could you trust those hooks? How could they work when cloned to another environment? And so that's Never Been Done[2]. But with couchdb's example, this is looking to me like a blind spot, that has probably stunted the range of things VCSs are used for.

If you feel, like I do, that it's great we have these amazing distributed VCSs, with so many advanced capabilities, but a shame that they're only used by software developers, then that is an exciting thought.


[1] Javascript? Mixed all in a database with data it runs on? Imperative code that's supposed to be side-effect free? (I assume the Haskell guys have already been all over that.) Code stored without real version control? Still having a hard time with this. :)

[2] I hope someone will give a counterexample of a VCS that does so in the comments?

Posted in the wee hours of Tuesday night, October 28th, 2009 Tags: git

I've used unison for a long while for keeping things like my music in sync between machines. But it's never felt entirely safe, or right. (Or fast!) Using a VCS would be better, but would consume a lot more space.

Well, space still matters on laptops, with their smallish SSDs, but I have terabytes of disk on my file servers, so VCS space overhead there is no longer of much concern for files smaller than videos. So, here's a way I've been experimenting with to get rid of unison in this situation.

  • Set up some sort of networked filesystem connection to the file server. I hate to admit I'm still using NFS.

  • Log into the file server, init a git repo, and check all your music (or whatever) into it.

  • When checking out on each client, use git clone --shared. This avoids including any objects in the client's local .git directory.

    git clone --shared /mnt/fileserver/stuff.git stuff
  • Now you can just use git as usual, to add/remove stuff, commit, update, etc.

Caveats:

  • git add is not very fast. Reading, checksumming, and writing out gig after gig of data can be slow. Think hours. Maybe days. (OTOH, I ran that on an Thecus.)
  • Overall, I'm happy with the speed, after the initial setup. Git pushes data around faster than unison, despite not really being intended to be used this way.
  • Note that use of git clone --shared, and read the caveats about this mode in git-clone(1).
  • git repack is not recommended on clients because it would read and write the whole git repo over NFS.
  • Make sure your NFS server has large file support. (The userspace one doesn't; kernel one does.) You don't just need it for enormous pack files. The failure mode I saw was git failing in amusing ways that involved creating empty files.
  • Git doesn't deal very well with a bit flipping somewhere in the middle of a 32 gigabyte pack file. And since this method avoids duplicating the data in .git, the clones are not available as backups if something goes wrong. So if regenerating your entire repo doesn't appeal, keep a backup of it.

(Thanks to Ted T'so for the hint about using --shared, which makes this work significantly better, and simpler.)

Posted Wednesday afternoon, January 21st, 2009 Tags: git

I'm working on designing a microformat that can be used to indicate the location of VCS (git, svn, etc) repositories related to a web page.

I'd appreciate some web standards-savvy eyes on my rel-vcs microformat rfc.

If it looks good, next steps will be making things like gitweb, viewvc, ikiwiki, etc, support it. I've already written a preliminary webcheckout tool that will download an url, parse the microformat, and run the appropriate VCS program(s).

(Followed by, with any luck, github, ohloh, etc using the microformat in both the pages they publish, and perhaps, in their data importers.)

Why? Well,

  1. A similar approach worked great for Debian source packages with the XS-VCS-* fields.
  2. Pasting git urls from download pages of software projects gets old.
  3. I'm tired of having to do serious digging to find where to clone the source to websites like Keith Packard's blog, or cariographics.org, or St Hugh of Lincoln Primary School. Sites that I know live in a git repo, somewhere.
  4. With the downturn, hosting sites are going down left and right, and users who trusted their data to these sites are losing it. Examples include AOL Hometown and Ficlets, Google lively, Journalspace, podango, etc etc. Even livejournal's future is looking shakey. Various people are trying to archive some of this data before it vanishes for good. I'm more interested in establishing best practices that make it easy and attractive to let all the data on your website be cloned/forked/preserved. Things that people bitten by these closures just might demand in the future. This will be one small step in that direction.a
Posted late Tuesday evening, January 6th, 2009 Tags: git

I'm writing a piece of autobiography/alternate world fiction, using git. Whether it will get finished or be any good, or be too personal to share I don't know. The idea though is sorta interesting -- a series of descriptions of inflection points in a life, each committed into git at the time it describes. As the life paths diverge, branches form, but never quite merge.

Reading this would not be quite like reading one of those choose your own adventure books. Rather you'd start at the end of a path and read back through the choices and events that led there. Or browse around for interesting nuggets in gitk. Or perhaps the point isn't that it be read at all, but is instead in the writing, and the committing.

discussion

Posted at midnight, November 26th, 2008 Tags: git ?writing

So, ikiwiki keeps wikis in git. But until today, that's only meant that the wiki's owners can edit it via git. Everyone else was stuck using the web interface.

Wouldn't it be nice then if anyone could check out the wiki source, modify it, and push it back? Now you can!

git clone git://git.ikiwiki.info/
cd git.ikiwiki.info
vim doc/sandbox.mdwn
git commit -a -m "I'm in your git, editing your wiki."
git push

The secret sauce, that makes this not a recipe for disaster but just a nice feature, is that ikiwiki checks each change as it's pushed in, and rejects any changes that couldn't be made to the wiki with a web browser.

So if you use ikiwiki for a wiki, you might want to turn on untrusted git push.

Posted at teatime on Friday, October 24th, 2008 Tags: git

I'm envisioning a graphical app that displays a file. Like a pager, the up and down arrows move through the file. But the left and right arrows move through time. As each successive change to the file is displayed, the committer's name appears in a column to the left of the lines changed in that commit. Hover the mouse over it to see the commit message. Names of old committers will fade out as time advances, but still be visible for a while. (A menu option will disable the fade out entirely.)

A nice bonus feature would be to allow opening multiple windows, with multiple files from the same repo. Moving back and forward in time would affect them all at once.

A nice, but getting harder feature would be to have a horizontal timeline at the bottom, including branches, so you could click on a specific branch to visit it. (Without this, when passing a fork or merge point, it would have to choose a branch heuristically?)

A tricky subtle feature would be to attempt to keep the current code block centered in the display as lines are added/removed from the file, adjusting scroll bar position to compensate.

There seems to be a gannotate for bzr, that may do something like this. Offline so I can't try it.

Google-and-caffine-fed update: bzr gannotate is closest to what I envisoned, though without a few of the bonuses (fade-out, smart scrolling, multiple files). qgit's "tree view" includes the same functionality, but the interface isn't as nice.

discussion

Posted late Sunday night, July 21st, 2008 Tags: git

Dear LazyWeb,

I use the standard ciabot.pl script in a git post-receive hook. This works ok, except in the case where changes are made in a published branch, and then that branch is merged into a second published branch.

In that case, ciabot.pl reports all the changes twice, once when they're committed to the published branch, and again when the branch is merged. This is worst when I sync up two branches; if there were a lot of changes made on either branch, they all flood into the irc channel again.

Am I using the ciabot.pl script wrong, or is there a better script I should use? Or maybe there's a CIA alternative that is smarter about git commits, so it will filter out duplicates?

Here, FWIW, is how I currently use it in my post-receive hook.

while read oldrev newrev refname; do
    refname=${refname#refs/heads/}
    [ "$refname" = "master" ] && refname=
    for merged in $(git rev-list --reverse $newrev ^$oldrev); do
        ciabot_git.pl $merged $refname
    done
done

Update: After hints and discussion from Buxy, I arrived at the following:

while read oldrev newrev refname; do
    branchname=${refname#refs/heads/}
    [ "$branchname" = "master" ] && branchname=
    for merged in $(git rev-parse --symbolic-full-name --not --branches | egrep -v "^\^$refname$" | git rev-list --reverse --stdin $oldrev..$newrev); do
        ciabot_git.pl $merged $branchname
    done
done

With this, changes available in another published branch are not sent to CIA.

There might still be some bugs with this.

Posted Sunday evening, July 6th, 2008 Tags: git

Done some interesting stuff in ikiwiki this evening..

Maybe you want to set up a mirror of a wiki. It's easy enough to do with an ikiwiki that's backed by git since you can just clone its repository and set up the mirror. But how to know when there's an update of the origin wiki, to update your mirror? I've added a plugin that allows you to edit a page on the origin wiki, and ask it to ping your wiki. And another plugin that your wiki can use to listen for pings and update itself, pulling down the changes from version control.

Nice thing about this is that any ikiwiki wiki that publishes its revision control, and enables the pinger plugin, can then be mirrored by anyone, with no coordination needed with its admin. Even multilevel mirror networks are possible to set up. (The astute may notice that loops are also possible.. but they will will be broken after 1 cycle.)

But this doesn't only allow mirroring. If you're using distributed version control, it also allows branching of a wiki. Just mirror as usual, but then make changes to the mirror, and don't send them back to the origin. Instant branch, that will be kept up-to-date with changes made to the origin. (Unless there's a conflict, that would need to be manually resolved, obviously.)

Wouldn't it be nice if you could git clone git://wikipedia.org/ or git://wiki.debian.org/ and go off and make it into something you're really happy with? Only thing standing in the way is that neither site uses ikiwiki. For now, you'll have to settle with cloning and branching git://git.ikiwiki.info/ :-)

Technical details here.

Posted Tuesday evening, May 6th, 2008 Tags: git

Often if you see a block diagram like this, what comes to mind is a compatability layer in between a program and several operating systems. Generally something that's general-purpose like java, or a web browser, or a widget toolkit.

-------------
|           |
|           |
-------------
|           |
-------------
|  |  |  |  |
-------------

(Generally it's drawn up all purty, but I'm lame.)

But lately I've seen and written a lot of code where the diagram is more complex:

-------------
|           |
|  program  |
|           |
|  |  |  |  |
-------------
| V| C| S|  |
-------------
|    OS     |
-------------

Sometimes the program code is littered with multiple switch statements, as in debcheckout, debcommit, and etckeeper.

case "$vcs" in
git)
svn)
hg)
esac

Sometimes it pushes the VCS-specific code into modules.

use IkiWiki::Rcs::$rcs;
rcs_commit();

But if it does, these modules are specific to that one program. This isn't a general-purpose library. dpkg source v3 doesn't need to use the VCS is the same way as ikiwiki, and even ikiwiki's rcs_commit is very specific to ikiwiki, in its error handling, conflict resolution, locking, etc.

pristine-tar injects and extracts data directly from git, using low-level git plumbing, in a way that probably can't be done at all with other VCSes. But even as I was adding that low-level, very git-specific code into pristine-tar, I found myself writing it with future portability in mind.

if ($vcs eq 'git') {
    # git write-tree, git update-index, and other fun
}
else {
    die "unsupported vcs $vcs";
}

When Avery Pennarun talks about git being the next unix, he's talking about programs like these, that build on top of a VCS.

But if git is the next unix, then so is mercurial, so is darcs, so is bzr, so too even svn (unless it's Windows?). In other words, we're back to the days when every program had to be ported to a bunch of incompatible and not-quite-compatible operating systems. Back to the unix wars.

In Elija's discussion of the "limbo" VCS state he gives several great examples of how multiple VCSes that each seem on the surface to offer similar commands like "$vcs add" and "$vcs commit" can behave very differently.

echo 1 > foo
$vcs add foo
echo 2 > foo
$vcs commit

What was committed, "1" or "2"? Depends on which $vcs you use.

Compare with unix where open(2) always opens a file, perhaps with different options, or different handling of large files, but portably enough that you generally don't need to worry about it. Even if you're porting to Windows, you can probably get away with a compatability layer to call whatever baroque monstrosity Windows uses to open a file, and emulate open(2) close enough to not have to worry about it most of the time.

A thin compatability layer that calls "$vcs add" isn't very useful for a program that builds on top of multiple VCSes. mr is essentially such a thin compatability layer; it manages to be useful only by being an interface for humans, who can deal with different limbo implementations and other quirks.

The VCSes are to some degree converging, but so far it's mostly a surface convergence, with commands that only look the same. Where will things go from here?

  • Maybe there will be a standardisation effort like POSIX for VCSes. Though it seems harder; VCSes have a wider interface than just syscalls and filesystems, so there's more scope for incompatibility.
  • Or will complicated compatability code like cygwin be developed, to let a program that was written for git use bzr instead, carefully hiding all the differences?
  • Another option would be that one VCS wins. So far I'm seeing some consolidation, but little indication that one VCS will emerge as the choice for everyone.
  • Maybe the VCSes might begin to support each other's repositories. If a program only supports git, and you use svn, that's fine, if git can transparently access the svn repo.
  • Or will we go on for decades spending a lot of time on portability code?
Posted in the wee hours of Thursday night, March 7th, 2008 Tags: git

With pristine-tar version 0.5, I've added a new feature: The ability to easily inject enough information into a git repository so that a pristine tarball can later be regenerated from that repository.

If a package's upstream branch contains the upstream source corresponding to the tarball to be imported, it's very simple to use.

joey@kodama:~/src/fbreader> pristine-tar commit ~/lib/debian/unstable/fbreader_0.8.12.orig.tar.gz 
pristine-tar: committed fbreader_0.8.12.orig.tar.gz.delta to branch pristine-tar

Otherwise, you also have to specify a tag (or any tree-ish really) where the upstream source can be found.

joey@kodama:~/src/fbreader> pristine-tar commit ~/fbreader_0.8.9.orig.tar.gz upstream/0.8.9
pristine-tar: committed fbreader_0.8.9.orig.tar.gz.delta to branch pristine-tar

Here's what it puts in the pristine-tar branch that it creates. In this example, the delta files are 40-some kilobytes, which is much nicer than if I'd had to check the 2 megabyte tarballs into git directly.

joey@kodama:~/src/fbreader> git checkout pristine-tar 
Switched to branch "pristine-tar"
joey@kodama:~/src/fbreader> ls -l
total 104
-rw-r--r-- 1 joey joey 46583 Jan 31 21:26 fbreader_0.8.12.orig.tar.gz.delta
-rw-r--r-- 1 joey joey    40 Jan 31 21:26 fbreader_0.8.12.orig.tar.gz.id
-rw-r--r-- 1 joey joey 45267 Jan 31 21:26 fbreader_0.8.9.orig.tar.gz.delta
-rw-r--r-- 1 joey joey    40 Jan 31 21:26 fbreader_0.8.9.orig.tar.gz.id

Don't forget to push the pristine-tar branch to your server for safekeeping.

joey@kodama:~/src/fbreader> git push origin pristine-tar

Once a tarball's delta is checked in, you can easily and quickly regenerate the original tarball.

joey@kodama:~/src/fbreader> pristine-tar checkout ../fbreader_0.8.12.orig.tar.gz
pristine-tar: successfully generated ../fbreader_0.8.12.orig.tar.gz

Yes, it's really the same file. :-)

joey@kodama:~/src/fbreader> md5sum ~/lib/debian/unstable/fbreader_0.8.12.orig.tar.gz ../fbreader_0.8.12.orig.tar.gz 
8045abe1acc75dbdd220400df541f23f  /home/joey/lib/debian/unstable/fbreader_0.8.12.orig.tar.gz
8045abe1acc75dbdd220400df541f23f  ../fbreader_0.8.12.orig.tar.gz

The above example is from the perspective of a maintainer of a Debian package. But this can also be used by the authors who generate the pristine tarballs in the first place. Check them into git using pristine-tar. Then you can regenerate any tarball you've ever released using just your project's git repository.

joey@kodama:~/src/pristine-tar> pristine-tar commit ../pristine-tar_0.5.tar.gz tags/0.5
pristine-tar: committed pristine-tar_0.5.tar.gz.delta to branch pristine-tar

One word of warning: For pristine-tar to check out the tarball, git needs to be able to check out the tree that you referred to when you committed it in the first place. If that was a tag that you've told git to delete, you're SOL. If it was a branch and the branch has changed in the meantime, that's fine, so long as git can find the original id that is stored in the .tar.gz.id file.


I hope that tools like git-import-orig and git-debimport can get support for automatically calling pristine-tar commit when importing tarballs into git, and that tools like git-buildpackage and gitpkg can use it to check out the tarballs.

Posted Thursday night, January 31st, 2008 Tags: git