Linus reflects on 18 years of working on Linux, the developer ecosystem and his goal for Linux on the desktop.
Linus Torvalds has led the development of the Linux operating system since its inception nearly 20 years ago. In that time Torvalds has had the opportunity not only to witness the positive cultural and economic changes brought about by Linux but has also been a direct participant in making those changes a reality. And though many things have changed greatly since 1991, one thing remains constant: Linus is still at the helm.
In this interview Torvalds looks back on the operating system he created, the impact of new hardware, and the ubiquitous OS on everything from cellphones to desktops to supercomputers.
Linux Magazine: You’ve been doing Linux for about 18 years now. That’s not a long time by the standards of academic research, but it is a long time by the standards of the software industry. Many of the core contributors have stuck with Linux even as the industry has changed and they have changed employers. Is it good for the project to have the same people able to stick with it? Do you plan to?
Linus Torvalds: I don’t think it’s good for a project if it’s only the same people who stick with it, and I’d be very worried about Linux if we had too much of a “core long-term people” approach. But there really are a lot of developers who are fairly recent, and most importantly there’s a really long tail of lots of people who dip their toes into kernel development if only to send in a really small patch. Most of them will never do anything more, but some of them will eventually be major developers. And we need that.
At the same time, I think everybody is also happier with some stability. There’s actually a number of people who have been around for quite a long time. People like Ted Ts’o, who showed up very early on and is still involved and still commits code.
So it’s not an either-or—we want to have both. And yes, I’ll stick with it as long as I think I can do a good job and nobody better comes along (or put another way: “as long as I can subvert whoever is better to work with me” ;)
And by the way, talking about changing employers: one thing I think is very healthy is how kernel developers are kernel developers first, and work for some specific company second. Most of the people work for some commercial entity that obviously tends to have its own goals, but I think we’ve been very good at trusting the people as people, not just some “technical representative of the company” and making it clear to companies too.
The reason I bring that up is that I think that ends up being one of the strengths of open source – with neither the project nor the people being too tied to a particular company effort at any one time.
LM: Before Linux, nobody would have believed that the same kernel would be running supercomputers and cell phones. Do you think you’ll always be able to maintain one codebase that works on phones and other tiny devices and on very large servers, and just let people configure it at build time?
LT: Personally I wouldn’t even say “before Linux”. For the longest time “after Linux” I told people inside SGI that they should accept the fact that they’d always have to maintain some extra patches that wouldn’t be acceptable to the rest of the Linux crowd just because nobody else cared about scaling quite that high up.
So I basically promised them that I’d merge as much infrastructure patches as possible so that their final external maintenance patch-set would be as painfree to maintain as possible. But I didn’t really expect that we’d support four-thousand-CPU configurations in the base kernel, simply because I thought it would be too invasive and cause too many problems for the common case.
And the thing is, at the time I thought that, I was probably right. But as time went on, we merged more and more of the support, and cleaned up things so that the code that supports thousands of CPU’s would look fine and also compile down to something simple and efficient even if you only had a few cores.
So now, of course, I’m really happy that we don’t need external patches to cover the whole spectrum from small embedded machines to big thousand-node supercomputers, and I’m very proud of how well the kernel handles it. Some of the last pieces were literally merged fairly recently, because they needed a lot of cleanup and some abstraction changes in order to work well across the board.
And now that I’ve seen how well it can be done, I’d also hate to do it any other way. So yes, I believe we can continue to maintain a single source base for wildly different targets, ranging from cell phones to supercomputers.
Of course, one of the interesting issues is how even the low end has been growing up. Ten years ago SMP was uncommon on the desktop, these days we’re looking at SMP systems even in very tiny embedded environments. So we have moved the goalposts up a bit for what we consider “small”. Those cell phones tend to have way more computing power than the original PC that I started Linux on had.
LM: You posted a very positive blog entry about your new Intel SSD. “That thing absolutely rocks.” On the other hand, some of the other SSDs on the market don’t, and some Linux users have pretty bad taste in hardware. Will the OS be able to get decent write performance and lifespan out of a bad SSD, or are users going to hate life if they buy the wrong one?
LT: It depends a lot on your usage case. For example, even a bad SSD can work wonderfully well as a secondary drive that gets 99.9% just read activity, since even the bad ones tend to read really well and have low latency and good random-read performance.
Of course, the size and price tends to make that then a hard trade-off to make easily. It’s not worth it for big files that you usually just stream, since rotational disks are cheaper and perfectly fine for streaming behavior. Very few among us really know the true access patterns we actually have.
And hey, even the Intel SSDs aren’t perfect. If all you do is work with big files and read and write a lot of contiguous data, a regular disk will be much cheaper and bigger, and won’t be any slower for those cases.
But for me, the disk tended to always be the weakest part in the system. I can make up for some of it with just adding more memory, but while caching obviously is a huge issue and hides the disk performance in 95+% of the cases, it just makes the remaining few cases even more noticeable.
Just as an example: I’m used to doing “git grep something” in my kernel tree to find where some function is used, or something similar. It takes me all of half a second, so it’s basically instant.
Except when I have just rebooted, or have just done enough other things that my tree isn’t in cache any more (ok, so that’s pretty rare, but it does happen ;). And then the half second was a minute or two with a perfectly reasonably high-end desktop SCSI drive.
So my average latency was great. If I get 0.5 seconds 99% of the time, and then very seldom have to wait a minute just because it reads all those small files off the disk, I should be happy, right?
Wrong. The average may be great, but that just makes the bad cases feel even worse. I’m used to things being instantaneous, so now that minute feels really really bad. And it really is mostly seeking—the median file size in the kernel is about 4kB, so it’s reading all those directories and all those 25,000+ small files, and while the total size of it all may be just a few megabytes, because of seek times it takes half a minute.
Enter the Intel SSD, and the cached “git grep” still takes the same half second, but now the bad case takes me ten seconds (it used to be less, but those staging drivers really added a lot of crap. Some people would blame the Intel SSD degrading, but sadly, it’s all my own fault ;)
So my average access time hardly changed, and I can still tell when I’m disk-limited, but oh boy, it makes such a huge difference. Now even the slow case is no longer two orders of magnitude slower. Yes, even SSD disks are slower than RAM caches, but they don’t have that horrible “fall off the cliff” behavior when having to seek around for the data.
And that’s why I dislike a lot of the bad SSD’s. They have an even worse “fall of the cliff” behavior. It’s for a very specific case (random small writes), and people will argue that it’s even less common than the case I describe above (random small reads), and it’s true. It’s not that common. But it’s common enough that when you hit it, it just hurts all the more.
This is why I don’t like “throughput” measures. You do want throughput, but latency variation is what you notice most. You can get used to slow machines and try to make your workflow match the “Oomph” of the hardware, but you cannot ever get used to fast machines that then occasionally are really slow. Those just drive you wild.
As an aside, that’s also very noticeable in CPUs. I had the biggest complaints with Intel’s “netburst” (aka “P4″) architecture for some rather similar reasons: it had absolutely great “best case” behavior, and then it had some cases that it just stumbled horribly at, and which I happened to care deeply about.
The P4 was like a greased bat out of hell for loads it liked, but when it started missing in its tiny L1 cache, or when you had to serialize the pipeline for locking or for system calls, it turned into something more like a CPU two or three generations old.
And again—it’s actually more irritating to have something that is really good at some things and then really bad at others, than have something that is just consistently middle-of-the-road.
LM: On a system level, “really good at some things and then really bad at others” sounds like a lot of the Linux-based products out there. Take a workstation and strip off some of the parts to make a dedicated cluster node or a NAS appliance or a PVR. Do you get a good general-purpose kernel by building something that works on the desktop, and letting people configure it to get customized builds for their own needs?
LT: Yes. To me, Linux on the desktop has always been the most interesting goal. The primary reason for that is simply that it’s always been what I want (I’ve never wanted a server OS—I started out writing Linux for my own PC, not to be some file server), but also because all the interesting problems always end up being about desktop uses.
All other uses tend to be very constrained. You have one thing (or a few things) you need to do, and you can just optimize and simplify the problem for those particular issues.
The desktop, in contrast, is all about a wide variety of uses. Huge variety in hardware, huge variety in software, and tons of crazy users doing things that no sane person would ever even think of doing. Except, it turns out, those crazy users may be doing odd things, but they do them for (sometimes) good reasons.
So aiming for the desktop always forces you to solve a much more generic problem than any other target would have forced us to look at.
Of course, Linux then becomes extra general-purpose because it’s not just meant to be a desktop OS. If we only cared about the desktop we’d never have worked on other architectures or worried about scalability to thousands of cores. So it’s not sufficient to just be a desktop, you do have to also look at other niches, but generally the desktop problems really do get you 90% of the way, and then solving scalability problems etc. is the frosting on the cake.
Next: Linus on Hardware