In part 2 of our interview, Linus talks about the process of managing kernel developer commits, selecting a revision control system and how he personally uses git.
Click here for Part 1 of our interview with Linus Torvalds.
Your project was one of the early adopters of distributed revision control, starting with BitKeeper in 2002, and now Git. What advice to you have for other projects, either in-house or out in the open, that are considering moving to a distributed system?
The SCM [source control management] choice ends up being a pretty personal thing, and much more important than the technical details of the SCM you end up using is the workflow you use.
Now, the reason a distributed SCM is important is not the distribution itself as much as the flow of information it allows. By avoiding the whole single point of concentration, you end up with the possibility for a much more dynamic development model.
And I say “the possibility”, because you can certainly continue with all the old rules, and the old development model, and use a distributed SCM the same way you use any random central one, but you won’t see the huge advantages if you do.
So the big piece of advice for anybody looking at switching SCM’s (or if you just want to really think about the SCM you use now, and what implicit assumptions you have in your current workflow) is to think about the SCM as just a tool, but also realize that it’s a tool that quite heavily constrains— or liberates—just how you work.
And a lot of the “how you work” things are not necessarily seen as limitations of the SCM itself. They end up being these subtle mental rules for what works and what doesn’t work when doing development, and quite often those rules are not consciously tied together with the SCM. They are seen as somehow independent from the SCM tool, and sometimes they will be, but quite often they are actually all about finding a process that works well with the tools you have available.
One such common area of rules are all the rules about “commit access” that a lot of projects tend to get. It’s become almost universally believed that a project either has to have a “benevolent dictator” (often, but not always, the original author as in Linux), or the alternative ends up being some kind of inner committer cabal with voting.
And then there spring up all these “governance” issues, and workflows for how voting takes place, how new committers are accepted, how patches get committed, et cetera, et cetera. And a lot of people seem to think that these issues are very fundamental and central, and really important.
And I believe that the kinds of structures we build up depend very directly on the tools we use. With a centralized source control management system (or even some de-centralized ones that started out with “governance” and commit access as a primary issue!) the whole notion of who has the right “token” to commit becomes a huge political issue, and a big deal for the project.
Then on the other hand, you have de-centralized tools like git, and they have very different workflow issues. I’m not saying that problems go away, but they aren’t the same problems. Governance is a much looser issue, since everybody can maintain their own trees and merging between them is pretty easy.
But partly that very flexibility has then brought on a whole set of new guidelines for how to manage it—without it getting too messy and too disorganized for anybody to make sense of it. So anybody can well do their own tree, but then to counteract that we’ve grown rules for how to keep the end results clean.
In other words, it all boils down to getting a good development model. The SCM can stand in the way for it, or it can allow it, but in neither case does the SCM stand alone.
And never forget that you can build up reasonable development models around totally horribly bad SCMs (and people have had years of experience with them, and seem to be sometimes very comfortable with the absolute crap that passes for SCMs).
But moving to a distributed SCM just allows for much better models. And I think that what’s important in the kernel is how we’ve been able to scale our development and have a good model that allows literally thousands of people to co-develop things. And git has been part of it, but so is the much more pedestrian “send patches around with sign-off chains” thing too.
If you’re not willing to think about how your current development model really works, you probably shouldn’t be thinking about switching SCMs.
There are 132 git commands as of 22.214.171.124. How many do you use?
Oh, the “tons of commands” thing is a total red herring. For a git tutorial I did last year at the Linux Plumbers Conference, I went through a totally ridiculous example of a few people working together, and noting every time we used a new git command.
I think we ended up with something like fourteen commands being used. And even that’s more than most end developers even will need. That list of fourteen commands was for the whole “multiple people working together, including the person integrating things” workflow.
The reason so many commands exist is that Git was designed to be scripted, and in fact almost all the original git commands were really just shell-script wrappers around a few really core commands that were written in C.
And that whole scriptability, and the fact that we’ve encouraged lots of different workflows means that there is a combination of those core scripting commands that really nobody is expected to use directly on the command line (we call them “plumbing”—they’re there, they form the base, but they’re generally hidden from view) and then there are tons of those helper commands that are built for some specific purpose (“porcelain”: the part you actually see, and that often comes with gilded knobs to make it all fancy and pretty).
We scared away a few people by making all the commands very visible (they all got installed in your $PATH, for example, so “git-” would give all the newcomers a list of them all), but the point is, that’s totally irrelevant in the end. You’d be expected to start with just a couple of commands, and it’s actually much more important to understand about the norion of history (and real merges) than to know more than a couple of commands.
I just checked my own command line history with a simple sed-script and some sorting and counting:
| grep git
| sed 's/[ 0-9]*\(git[ ]*[a-z-]*\).*/\1/'
| uniq -c
| sort -n
and I get
1 git archive
1 git clean
1 git dif
1 git ls-files
1 git shortlog
1 git tag
2 git rebase
3 git checkout
3 git fetch
3 git status
4 git add
5 git reset
6 git gc
9 git commit
12 git show
16 git push-all
21 git am
42 git pull
51 git grep
82 git diff
94 git log
There are 22 commands I’ve used recently, and of those one is a typo (“git dif”) and three are related to me doing a release (“archive”, “shortlog” and “tag”). One isn’t even a real git command, it’s an alias I’ve set up for my workflow (“push-all”).
Is that exhaustive? No, I use other commands too. The above is just a random collection from the last 1000 commands in my history buffer. But I’m a git power user, and I don’t use that many commands. And they aren’t that complicated.