This feed contains some of my blog entries that link to software code that I've developed.
Partly as a followup to a Github survey, and partly because I had a free evening and the need to write more haskell code, any haskell code, I present to you, github-backup.
github-backup is a simple tool you run in a git repository you cloned from Github. It backs up everything Github knows about the repository, including other forks, issues, comments, milestones, pull requests, and watchers.
This is all stored in the repository, as regular files, on a "github" branch.
Available in Cabal now, in Debian maybe if someone packages haskell-github.
I attended the Git Together earlier this week. I was tenative about this, since I'm not really much of a git developer; all my git work is building stuff on top of it. It turned out great though.
At first it seemed like one of those parties where you don't know anyone. But then I got to reconnect with Avery Pennarun for the first time since DebConf 2, and got to know Jonathan Nieder better, and it was also nice to see Jelmer Vernooij. And the core developers were also very welcoming. Junio Hamano knew of my work (and I am in awe of his), and Jeff King thinks my take on SHA1 security issues has value, and has been expanding on it. Shawn Pearce managed the unconference subtly and well. Lots of very smart people. At one point I found myself accross the table from Android's lead developer.
I was very happy that everything I think needs improvement in git was discussed during the unconference:
- big files: My postit suggesting this got more checks than most anything else, and I briefly presented git-annex at the start of a session on general scalability -- on its 1-year anniversary. Some ideas for improved hooks that git-annex and other tools could use are developing. Better scalability to lots of files and more efficient index files were also discussed.
- git as a filesystem: There was a consensus that gone are the days when git was just about managing source code. (I remember being told on #git before I wrote etckeeper, that no, git should not be used for that..)
- submodules: I was astounded that they're now considering supporting "floating" submodules, which would track the head of a branch, rather than the specific rev committed in the superproject. Many other problems that have kept me from ever trying submodules are also being worked on. This seems unlikely to replace mr, but who knows -- at least getting rid of repo is a goal.
- SHA1 security was discussed for quite a long while, long enough that I felt a bit guilty for bringing it up, but it was an interesting and fruitful discussion. I went in thinking that the checksum basically has to be parameterized, but they have some good reasons not to do that, and some other good ideas, although what to do and when best to do it is still open for discussion. Signed commits are certianly coming soon. Also this amazing patch was developed.
- Metadata storage was briefly discussed, but nobody seemed sure how to deal with it. Ideas floated included a metastore like tool that uses mergeable files, or storing metadata in some sort of notes-like separate branch.
There are certian things haskell is very good at, and I had the pleasure
of using it for one such thing yesterday. I wanted to support
expressions like find(1) does, in git-annex.
Something like:
git-annex drop --not --exclude '*.mp3' --and \
-\( --in usbdrive --or --in archive -\) --and \
--not --copies 3
So, parens and booleans and some kind of domain-specific operations. It's easy to build a data structure in haskell that can contain this sort of expression.
{- A Token can either be a single word, or an Operation of an arbitrary type. -} data Token op = Token String | Operation op deriving (Show, Eq) data Matcher op = Any | And (Matcher op) (Matcher op) | Or (Matcher op) (Matcher op) | Not (Matcher op) | Op op deriving (Show, Eq)
(The op could just be a String, but is parameterized for reasons we'll
see later.)
As command-line options, the expression is already tokenised, so all I needed to do to parse it is consume a list of the Tokens. The only mildly tricky thing is handling the parens right -- I chose to not make it worry if there were too many, or too few closing parens.
generate :: [Token op] -> Matcher op generate ts = generate' Any ts generate' :: Matcher op -> [Token op] -> Matcher op generate' m [] = m generate' m ts = uncurry generate' $ consume m ts consume :: Matcher op -> [Token op] -> (Matcher op, [Token op]) consume m [] = (m, []) consume m ((Operation o):ts) = (m `And` Op o, ts) consume m ((Token t):ts) | t == "and" = cont $ m `And` next | t == "or" = cont $ m `Or` next | t == "not" = cont $ m `And` (Not next) | t == "(" = let (n, r) = consume next rest in (m `And` n, r) | t == ")" = (m, ts) | otherwise = error $ "unknown token " ++ t where (next, rest) = consume Any ts cont v = (v, rest)
Once a Matcher is built, it can be used to check if things match the expression the user supplied. This next bit of code almost writes itself.
{- Checks if a Matcher matches, using a supplied function to check - the value of Operations. -} match :: (op -> v -> Bool) -> Matcher op -> v -> Bool match a m v = go m where go Any = True go (And m1 m2) = go m1 && go m2 go (Or m1 m2) = go m1 || go m2 go (Not m1) = not (go m1) go (Op o) = a o v
And that's it! This is all nearly completly generic and could be used for a great many things that need support for this sort of expression, as long as they can be checked in pure code.
A trivial example:
*Utility.Matcher> let m = generate [Operation True, Token "and", Token "(", Operation False, Token "or", Token, "not", Operation False, Token ")"]
*Utility.Matcher> match (const . id) m undefined
True
For my case though, I needed to run some IO actions to check if expressions
about files were true. This is where I was very pleased to see a monadic
version of match could easily be built.
{- Runs a monadic Matcher, where Operations are actions in the monad. -} matchM :: Monad m => Matcher (v -> m Bool) -> v -> m Bool matchM m v = go m where go Any = return True go (And m1 m2) = liftM2 (&&) (go m1) (go m2) go (Or m1 m2) = liftM2 (||) (go m1) (go m2) go (Not m1) = liftM not (go m1) go (Op o) = o v
With this and about 100 lines of code to implement specific tests like --copies and --in, git-annex now supports the example at the top.
Just for comparison, find(1) has thousands of lines of C code to
build a similar parse tree from the command line parameters and run it.
Although I was surprised to see that it optimises expressions by eg,
reordering cheaper tests first.
The last time I wrote this kind of thing was in perl, and there the natural way was to carefully translate the expression into perl code, which was then evaled. Meaning the code was susceptible to security holes.
Anyway, this is nothing that has not been done a hundred times in haskell before, but it's very nice that it makes it so clean, easy, and generic.
In this screencast, I implement a new feature in git-annex. I spend around 10 minutes writing haskell code, 10 minutes staring at type errors, and 10 minutes writing documentation. A normal coding session for me. I give a play-by-play, and some thoughts of what programming is like for me these days.
git-annex coding in haskell.ogg (38 MB) | on vimeo
Not shown is the hour I spent the next day changing the "optimize" subcommand implemented here into "--auto" options that can be passed to git-annex's get and drop commands.
I've just released git-annex version 3, which stops cluttering
the filesystem with .git-annex directories. Instead it stores its
data in a git-annex branch, which it manages entirely transparently
to the user. It is essentially now using git as a distributed NOSQL database.
Let's call it a databranch.
This is not an unheard of thing to do with git. The git notes built into
recent git does something similar, using a dynamically balanced tree in
a hidden branch to store notes. My own pristine-tar injects data into
a git branch. (Thanks to Alexander Wirt for showing me how to do that
when I was a git newbie.) Some
distributed bug trackers store
their data in git in various ways.
What I think takes git-annex beyond these is that it not only injects data into git, but it does it in a way that's efficient for large quantities of changing data, and it automates merging remote changes into its databranch. This is novel enough to write up how I did it, especially the latter which tends to be a weak spot in things that use git this way.
Indeed, it's important to approach your design for using git as a database from the perspective of automated merging. Get the merging right and the rest will follow. I've chosen to use the simplest possible merge, the union merge: When merging parent trees A and B, the result will have all files that are in either A or B, and files present in both will have their lines merged (and possibly reordered or uniqed).
The main thing git-annex stores in its databranch is a bunch of presence logs. Each log file corresponds to one item, and has lines with this form:
timestamp [0|1] id
This records whether the item was present at the specified id at a given time. It can be easily union merged, since only the newest timestamp for an id is relevant. Older lines can be compacted away whenever the log is updated. Generalizing this technique for other kinds of data is probably an interesting problem. :)
While git can union merge changes into the currently checked out branch,
when using git as a database, you want to merge into your internal-use
databranch instead, and maintaining a checkout of that branch is inefficient.
So git-annex includes a general purpose
git-union-merge command
that can union merge changes into a git branch, efficiently, without
needing the branch to be checked out. Another problem is how to trigger the
merge when git pulls changes from remotes. There is no suitible git hook
(post-merge won't do because the checked out branch may not change at all).
git-annex works around this problem by automatically merging */git-annex
into git-annex each time it is run. I hope that git might eventually get
such capabilities built into it to better support this type of thing.
So that's the data. Now, how to efficiently inject it into your databranch? And how to efficiently retrieve it?
The second question is easier to answer, although it took me a while to
find the right way ... Which is two orders of magnitude faster than the
wrong way, and fairly close in speed to reading data files directly
from the filesystem.
The right choice is to use git-cat-file --batch; starting it up the
first time data is requested, and leaving it running for further queries.
This would be straightforward, except Takes some careful
parsing, but straightforward.git-cat-file --batch is a little
difficult when a file is requested that does not exist. To detect that,
you'll have to examine its stderr for error messages too. Perhaps
git-cat-file --batch could be improved to print something machine
parseable to stdout when it cannot find a file.
Efficiently injecting changes into the databranch was another place where
my first attempt was an order of magnitude slower than my final code.
The key trick is to maintain a separate index file for the branch.
(Set GIT_INDEX_FILE to make git use it.) Then changes can be fed
into git by using git hash-object, and those hashes recorded into
the branch's index file with git update-index --index-info. Finally,
just commit the separate index file and update the branch's ref.
That works ok, but the sad truth is that git's index files don't scale well as the number of files in the tree grows. Once you have a hundred thousand or so files, updating an index file becomes slow, since for every update, git has to rewrite the entire file. I hope that git will be improved to scale better, perhaps by some git wizard who understands index files (does anyone except Junio and Linus?) arranging for them to be modified in-place.
In the meantime, I use a workaround: Each change that will be committed to
the databranch is first recorded into a journal file, and when git-annex
shuts down, it runs git hash-object just once, passing it all the journal
files, and feeds the resulting hashes into a single call to git
update-index. Of course, my database code has to make sure to check the
journal when retrieving data. And of course, it has to deal with possibly
being interrupted in the middle of updating the journal, or before it can
commit it, and so forth. If gory details interest you, the complete code
for using a git branch as a database, with journaling, is
here.
After all that, git-annex turned out to be nearly as fast as before
when it was simply reading files from the filesystem, and actually faster
in some cases. And without the clutter of the .git-annex/ directory,
git use is overall faster, commits are uncluttered, and there's no difficulty
with branching. Using a git branch as a database is not always the right
choice, and git's plumbing could be improved to better support it, but it
is an interesting technique.
I started doing daily builds of the Debian Installer in February 2004. (Before me, Martin Sjögren did them for a while.)
Seven years of building d-i every single day, twice a day in recent years, and also twice a day for armel. Well over ten thousand builds total. That would have been tedious, but thanks to cron it was instead seven years of keeping the machines running and upgraded, dealing with things when they broke, and getting highlighted on IRC whenever someone mentioned "people.debian.org/~joeyh/d-i". Still a bit tedious, but those bits were used to install a lot of machines.
The rest of the d-i daily builds moved onto the autobuilders a while back, but I guess there was some reluctance to mess with my institution. I finally convinced them to take over my builds too. :)
I have noticed some problems with how Debian is using the popularity-contest data.
popcon units are unknown
Using the popcon score of a package to measure its use is like using the bleeple score of a trip to measure its distance. Both scores have no sensible units attached, though they may be loosely derived from a unit value. Is a trip with a bleeple score of 99 a long trip? Is a package with a popcon score of 99 a rarely used package?
The only way to resolve this ambiguity at all is to compare ratios of values, so the problimatic units cancel out. A flight from NYC to AMS with a bleepie score of 99 is 50 times as bleepie as my drive home, which scores 2.
So, any statement like "low popcon score" is basically so lacking in context as to be meaningless. Such statements are deprecated, and should be ignored.
not all popcon scores are comparable
The above example is intentionally bad. Plane flights and car trips are not very comparable when you don't know what units (time / CO2 / distance / number of people sharing a confined space / security theater points) are being used.
Similarly, comparing a high popcon package like gnome-terminal with a relatively low popcon package like udhcpc is very deceptive. The former is installed by default in the desktop task, but plenty of desktop users would not miss it. The latter is installed only on embedded systems, which can exist in absurd numbers, and none of which will tend to report to popcon.
So, any attempt to compare popcon scores should include a rationalle about why the two scores are comparable. For example, gnome-terminal and rxvt are somewhat comparable since they are both terminal emulators. But, only the vote scores, not the inst scores should be compared, since gnome-terminal is installed by default. dhcp3-client and udhcpc are not comparable despite being similar packages.
popcon scores do not measure long tail effects
A strength of Debian is that not only commonly used, but also uncommon and niche software is packaged. Popcon does not measure the benefit of some little used peice of software being there, packaged and ready to use when a user needs it.
For six years I kept satutils in Debian, despite it probably having no users. It has a very specific use case, to control a motorized internet satellite dish typically installed on an RV. I did that because it was essentially no work (the package was approximatly bug free, and required no changes since 2007), and because of the possible payoff if someone needed this thing and there it was, in Debian. The value of Debian in that occasion would spike to a value that, while not directly comparable with a popcon score, would be pretty epic, for that one user, as they pushed arrow keys to move a satellite dish around.
(It also had the best WITHOUT WARRANTY statement I've had the pleasure to write: "If you break your dish off your vechicle using this software, you get to keep both pieces.")
Every removal of a package for "low popcon score" runs the risk of silently degrading this overall value of Debian.
who wants to be popular?
Part of the problem is that popcon has been around long enough that the connotations of its name, "popularity contest" have been dulled by repetition (and abbreviation). Popularity contests are not pleasant things. They rarely reach the best result. They embody the tyranny of the majority. The name was originally, to the best of my knowledge, chosen exactly to imply all these failings, to say that hey, popularity-contest is deeply flawed, but is better than nothing for this one specific use case (ordering packages to place on CD sets). We no longer think of popcon with these caveats. That is a regression in your brain. Fix it.
By removing packages that appear unpopular, we run the risk of Debian becoming bland and homogenous.
After two weeks of work, I've just released git-annex version 0.20110417, with some big new features that open up some interesting use cases:
Want to use The Cloud as a git remote? git-annex now supports storing data in Amazon S3 as if it were just another git remote. With full gpg encryption of the content stored in S3.
*This is so easy to use and neato I just have to show it off:
joey@gnu:~/tmp/repo> git annex initremote cloud type=S3 encryption=joey@kitenet.net initremote cloud (checking bucket) (creating bucket in US) ok joey@gnu:~/tmp/repo> git annex add bigfile add bigfile ok (Recording state in git...) joey@gnu:~/tmp/repo> git annex move bigfile --to cloud move bigfile (gpg) (checking cloud...) (to cloud...) ok (Recording state in git...) joey@gnu:~/tmp/repo> file bigfile bigfile: broken symbolic link joey@gnu:~/tmp/repo> git annex get bigfile get bigfile (copying from cloud...) (gpg) ok (Recording state in git...) joey@gnu:~/tmp/repo> file bigfile bigfile: symbolic link
Want to collaborate on some big files? Perhaps you're dealing with scientific datasets, or video game levels. git-annex can now use a bup repository as a special kind of git remote. The big files are stored forevermore in bup's git repository, while their metadata and the rest of your stuff can be kept in git as usual -- and both git repositories can be used collaboratively. It's turtlesWgit all the way down, but without the large file scalability problems.
Want to trade disk space with someone? You can set up a bup remote, or just bare directory on their system, and have git-annex encrypt the data it stores there.
Want a DropBox like folder that's Free Software and stores data in the cloud, or in a more git-style distributed way? Well, about that...
I had been planning to finish up S3 and encryption support, and then mention this could perhaps be used as the basis for a DropBox like thing. But Christophe-Marie Duquesne anticipated me by a month, and created ShareBox. It's a FUSE filesystem layered over git-annex. While not yet production ready (lacking eg, conflict resolution), it has promise.
* This feature isn't available in Debian yet, blocked by a lack of
the Haskell hS3 library packaged for Debian. Someone should fix that,
ideally not me. Getting missingh updated for the ghc7 transition
so git-annex is buildable in unstable would also be nice..
Projects I've been working on
are, if not bearing fruit, at least showing promise.
git-annex is growing up a nice community that nags me about things like Mac OSX and FAT filesystem support. It's great to have their feedback, it's been leading to fast improvement, even though I've only gotten two patches to my haskell code. :)
New users are finding Branchable despite us not knowing how to market it. Sometimes a new site springs up overnight fully formed, having been developed on a laptop and git pushed to us. I get a vicarious thrill over whatever our users are doing, be it driving to Mongolia or watering plants with arduino.
I just realized that the Debian Installer is ten years old! It actually got started ten years ago last summer, but now is also an apt anniversary -- ten years ago, I successfully booted d-i for the first time.
It's amazing that d-i has been around as long as Debian Developers who have "been with the project longer than most". (Many of whom have been long-time contributors to d-i themselves.) Astonishing that my basic design has held up and remains relevant. And every time I boot up the Debian kFreeBSD installer, or run the graphical installer, or hear someone raving about how easy Ubuntu was for them to install (and know d-i was underneath), I'm amazed at the places people have taken d-i.
The code has recently moved to git. As changes like ext4 by default start to flow in to d-i, with Wheezy development starting up, I hope to see more people enticed into working on d-i.
Anyone can do like Matthew Palmer has done with netcfg -- pick a component of d-i that interests you, check out its git repository, and take it from being reasonably well team-maintained to awesomely you-maintained. In five or ten years Matthew will be able to look back at all the people who have used d-i to install over IPv6, and on WPA wireless, and with his other netcfg improvements. Trust me, it's a great feeling.
Previously: d-i retrospective (better written than the above & well worth a re-read)