The Linux kernel could soon be 50 to 80% faster to build

The Linux kernel takes around 5 minutes (without modules) to build on an Intel Core i5 Jasper Lake mini PC with 16 GB RAM and a fast SSD based on our recent review of Beelink GTi 11 mini PC. Kernel developers may have to build for different targets and configurations, plus all modules so the build times may add up. While it is always possible to throw more hardware to quicken the builds, it would be good if significantly faster builts could be achieved with software optimizations.

That’s exactly what Ingo Molnar has been working on since late 2020 with his “Fast Kernel Headers” project aiming to eliminate the Linux kernel’s “Dependency Hell”. At the time he aimed for a 20% speedup, but a little over one year later, the results are much more impressive with 50 to 80% faster builds depending on the target platform (x86-64, arm64, etc…) and config.

This has been quite a Herculean effort with the project consisting of 25 sub-trees internally, over 2,200 commits, which can be obtained with:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/mingo/tip.git

1	git clone git://git.kernel.org/pub/scm/linux/kernel/git/mingo/tip.git

Ingo further explains why reducing header dependency is so hard and many commits are required:

As most kernel developers know, there’s around ~10,000 main .h headers in the Linux kernel, in the include/ and arch/*/include/ hierarchies. Over the last 30+ years, they have grown into a complicated & painful set of cross-dependencies we are affectionately calling ‘Dependency Hell’.
…

When I started this project, late 2020, I expected there to be maybe 50-100 patches. I did a few crude measurements that suggested that about 20% build speed improvement could be gained by reducing header dependencies, without having a substantial runtime effect on the kernel. Seemed substantial enough to justify 50-100 commits.

But as the number of patches increased, I saw only limited performance increases. By mid-2021 I got to over 500 commits in this tree and had to throw away my second attempt (!), the first two approaches simply didn’t scale, weren’t maintainable and barely offered a 4% build speedup, not worth the churn of 500 patches and not worth even announcing.

With the third attempt I introduced the per_task() machinery which brought the necessary flexibility to reduce dependencies drastically, and it was a type-clean approach that improved maintainability. But even at 1,000 commits I barely got to a 10% build speed improvement. Again this was not something I felt comfortable pushing upstream, or even announcing. :-/

But the numbers were pretty clear: 20% performance gains were very much possible. So I kept developing this tree, and most of the speedups started arriving after over 1,500 commits, in the fall of 2021. I was very surprised when it went beyond 20% speedup and more, then arrived at the current 78% with my reference config. There’s a clear super-linear improvement property of kernel build overhead, once the number of dependencies is reduced to the bare minimum.

You’d think kernel maintainers may be wary of accepting such as a large number of patches, but the feedback from Greg KH is rather positive, even though he warned of potential maintenance issues:

This is “interesting”, but how are you going to keep the kernel/sched/per_task_area_struct_defs.h and struct task_struct_per_task definition in sync? It seems that you manually created this (which is great for testing), but over the long-term, trying to manually determine what needs to be done here to keep everything lined up properly is going to be a major pain.

That issue aside, I took a glance at the tree, and overall it looks like a lot of nice cleanups. Most of these can probably go through the various subsystem trees, after you split them out, for the “major” .h cleanups. Is that something you are going to be planning on doing?

The discussion is still in progress with maintainers and how to proceed forwards. So let’s look at the numbers:

 #
  # Performance counter stats for 'make -j96 vmlinux' (3 runs):
  #
  # (Elapsed time in seconds):
  #

  v5.16-rc7:            231.34 +- 0.60 secs, 15.5 builds/hour    # [ vanilla baseline ]
  -fast-headers-v1:     129.97 +- 0.51 secs, 27.7 builds/hour    # +78.0% improvement

Or in terms of CPU time utilized:

  v5.16-rc7:            11,474,982.05 msec cpu-clock   # 49.601 CPUs utilized
  -fast-headers-v1:      7,100,730.37 msec cpu-clock   # 54.635 CPUs utilized   # +61.6%
improvement

# Performance counter stats for 'make -j96 vmlinux' (3 runs):

# (Elapsed time in seconds):

v5.16-rc7: 231.34 +- 0.60 secs, 15.5 builds/hour # [ vanilla baseline ]

-fast-headers-v1: 129.97 +- 0.51 secs, 27.7 builds/hour # +78.0% improvement

Or in terms of CPU time utilized:

v5.16-rc7: 11,474,982.05 msec cpu-clock # 49.601 CPUs utilized

-fast-headers-v1: 7,100,730.37 msec cpu-clock # 54.635 CPUs utilized # +61.6%

improvement

A full build for Linux 5.16-rc7 went from 231 seconds to just 130 seconds with the fast-headers optimization, or around 78% improvement. You may think it’s only useful for people always building from scratch, but incremental builds benefit even more from the headers cleanup:

                                 | v5.16-rc7                      | -fast-headers-v1

|--------------------------------|---------------------------------------
 'touch include/linux/sched.h'    | 230.30 secs | 15.6 builds/hour | 108.35 secs | 33.2 builds/hour
| +112%
 'touch include/linux/mm.h'       | 216.57 secs | 16.6 builds/hour |  79.42 secs | 45.3 builds/hour
| +173%
 'touch include/linux/fs.h'       | 223.58 secs | 16.1 builds/hour |  85.52 secs | 42.1 builds/hour
| +161%
 'touch include/linux/device.h'   | 224.35 secs | 16.0 builds/hour |  97.09 secs | 37.1 builds/hour
| +132%
 'touch include/net/sock.h'       | 105.85 secs | 34.0 builds/hour |  40.88 secs | 88.1 builds/hour
| +159%

| v5.16-rc7 | -fast-headers-v1

|--------------------------------|---------------------------------------

'touch include/linux/sched.h' | 230.30 secs | 15.6 builds/hour | 108.35 secs | 33.2 builds/hour

| +112%

'touch include/linux/mm.h' | 216.57 secs | 16.6 builds/hour | 79.42 secs | 45.3 builds/hour

| +173%

'touch include/linux/fs.h' | 223.58 secs | 16.1 builds/hour | 85.52 secs | 42.1 builds/hour

| +161%

'touch include/linux/device.h' | 224.35 secs | 16.0 builds/hour | 97.09 secs | 37.1 builds/hour

| +132%

'touch include/net/sock.h' | 105.85 secs | 34.0 builds/hour | 40.88 secs | 88.1 builds/hour

| +159%

Builds are up to 173% faster. The main reason for the improvement is the “drastic” reduction of “the effective post-preprocessing effective size of key kernel headers”, some of which are listed below:

 ------------------------------------------------------------------------------------------
    | Combined, preprocessed C code size of header, without line markers,
    | with comments stripped:
    ------------------------------.-----------------------------.-----------------------------
                                  | v5.16-rc7                   |  -fast-headers-v1
				  |-----------------------------|-----------------------------
     #include <linux/sched.h>     | LOC: 13,292 | headers:  324 |  LOC:    769 | headers:   64
     #include <linux/wait.h>      | LOC:  9,369 | headers:  235 |  LOC:    483 | headers:   46
     #include <linux/rcupdate.h>  | LOC:  8,975 | headers:  224 |  LOC:  1,385 | headers:   86
     #include <linux/hrtimer.h>   | LOC: 10,861 | headers:  265 |  LOC:    229 | headers:   37
     #include <linux/fs.h>        | LOC: 22,497 | headers:  427 |  LOC:  1,993 | headers:  120
     #include <linux/cred.h>      | LOC: 17,257 | headers:  368 |  LOC:  4,830 | headers:  129
     #include <linux/dcache.h>    | LOC: 10,545 | headers:  253 |  LOC:    858 | headers:   65
     #include <linux/cgroup.h>    | LOC: 33,518 | headers:  522 |  LOC:  2,477 | headers:  111
     #include <linux/module.h>    | LOC: 16,948 | headers:  339 |  LOC:  2,239 | headers:  122

------------------------------------------------------------------------------------------

| Combined, preprocessed C code size of header, without line markers,

| with comments stripped:

------------------------------.-----------------------------.-----------------------------

| v5.16-rc7 | -fast-headers-v1

|-----------------------------|-----------------------------

#include <linux/sched.h> | LOC: 13,292 | headers: 324 | LOC: 769 | headers: 64

#include <linux/wait.h> | LOC: 9,369 | headers: 235 | LOC: 483 | headers: 46

#include <linux/rcupdate.h> | LOC: 8,975 | headers: 224 | LOC: 1,385 | headers: 86

#include <linux/hrtimer.h> | LOC: 10,861 | headers: 265 | LOC: 229 | headers: 37

#include <linux/fs.h> | LOC: 22,497 | headers: 427 | LOC: 1,993 | headers: 120

#include <linux/cred.h> | LOC: 17,257 | headers: 368 | LOC: 4,830 | headers: 129

#include <linux/dcache.h> | LOC: 10,545 | headers: 253 | LOC: 858 | headers: 65

#include <linux/cgroup.h> | LOC: 33,518 | headers: 522 | LOC: 2,477 | headers: 111

#include <linux/module.h> | LOC: 16,948 | headers: 339 | LOC: 2,239 | headers: 122

LOC stands for Line-of-Code, and you can see that can be slashed with the fast-headers-v1 option. The same thing is true for the “headers” column which represents the number of headers included indirectly. Supported platforms include x86 32-bit & 64-bit (boot tested and main machine), ARM64 (boot tested), as well as MIPS 32-bit & 64-bit and Sparc 32-bit & 64-bit, but those have only been built, and not tested on actual hardware.

The results are impressive, and if the Fast Kernel Headers commits get merged, it could extend the life of existing build farms, and slightly quicken the Linux development process.

Via ZDNet

Jean-Luc Aufranc (CNXSoft)

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

Name*

Email*

Website

I agree to the Privacy Policy

The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.

15 Comments

oldest

newest

Peter

2 years ago

Sure, faster is better but why exactly is compile speed so important?

Pim Vullers

Efficiency of the development process. Of course there are other factors like test execution time etc, but for example if you have to check whether a code change does not break other build configurations (or even just other tool chains), then it is nice to have a fast build to check this easily.

Z33d3vill

I agree, I notice plenty want to “build fast, fast fast build!”.. but, what good is compiling/building something if that’s where it never exists beyond.

Willy

Some developers and project maintainers, like me, impose that every single commit builds and works fine. This is critical for bisect sessions. And I can assure you that when you just rebased your 100 commits on top of another developer’s changes and want to be sure that your API change continues to work fine and you’re rebuilding and testing each and every commit in your series, you absolutely value the improved build times! And that’s even more true when you later have to bisect and are happy not to face breakage that could have been justified by too long build… Read more »

I’m not surprised at all. About 6 months ago we went through a similar process in haproxy because I was pissed off with the time it took to build. By carefully checking why each header file depended on each other, I could remove a lot of dependencies, move a bit of code around to further reduce them, use incomplete types more often, and all this resulted in 40-45% build time savings. This is even more obvious when looking at the total preprocessed size. It remains something like 20 times the total project size, but used to be 2.5-3 times larger… Read more »

megous

Combined with a faster linker, this could be quite interesting! Faster comiple AND faster link times. Anyone tried mold with the kernel yet?

willy

I read the article about mold on lwn a week ago or so but hadn’t tried yet, since for my projects ld accounts for 200ms or less and that’s quite acceptable.

Hm, yeah. I just tried to measure link time on my kernel config, and it’s 0.7s, so that’s not a huge issue either.

Linking is usually only single final step at the end of the build process, compared to many compile operations for all source files. One other way to potentially speed up the process would be compiling multiple sources at once. Since in this case also all headers would only be included/processed once, and also compiler is just invoked once (which is also relevant depending on the system setup [at work we have a virtualized setup which has a networked storage only, and hence application loading is slow, which is quite annoying in a build flow in which make calls the compiler… Read more »

kcg

So 231s -> 130s time decrease is considered to be 78% improvement? Let’s check it. So 231s is 100% which means 1% == 2.31. Let’s divide 130 / 2.31 == 56 that means improvement is to cut time from original 100% to 56% which means 100-56 == 44%. So what is claimed to be 78% improvement I would consider to be 44% improvement.
Am I right or not?

Yes, you are right. It looks like they computed the improvement in builds per hour or so.

(3600/130) / (3600/231) = 177.7%
So a 77.7% improvement.

…

It’s the usual thing with percentages. Doing something X% faster doesn’t necessarily mean it takes X% less time. Here it’s 78% faster if you measure in lines per second, as it processes 78% more lines per second. In terms of time, it’s 1-(1/1.78) hence 44% faster. That’s why it’s always very important to mention the units when speaking about relative improvements.

Author

When dealing with percentage adding and substracting are treated differently.
For example if a car runs at 50km/h, a 100% increase would be 100km/h, but a 100% decrease from 100 km/h (or whatever speed) would be 0km/h, not 50km/h.

It’s the same in our case here. If the method was 100% faster, that would be 2x faster, or 231/2 = 115.5 seconds. If we use the same method of calculation as suggested, 100% would be zero second or infinitely faster.

Jean-Luc,
this is all clear. My complain is motivated by the fact that I understand “improvement” in build time as a “time decrease”. As you noted yourself, build time is decreased by 44% (e.g. your analogy with 100% speed decrease from 100 km/h to 0 km/h). I agree with claim that build is faster by 78%, that’s clear. Nothing more, just wording issue here.
Thanks, Karel