14th May 2015, 5 min read

Speeding-Up Software Builds: Parallelizing Make and Compiler Cache

1. Problem statement

Compiling source code with a compiler usually employs the make command which keeps track of dependencies. Additionally GNU make can parallelize your build using the j-parameter. Often you also want a so called clean build, i.e., compile all source code files, just in case make missed some files when recompiling. Instead of deleting all previous effort one can use a cache of previous compilations.

I had two questions where I wanted quantitative answers:

What is the best j for parallel make, i.e., how many parallel make's should one fire?
What effect does a compiler cache have?

To the first question: At first I thought that the more I parallelize the better. This belief was based on [Compiler Speed-up](http://www.faqs.org/docs/Linux-mini/Modules.html#SPEEDUP "Compiler Speed-up"), which basically says, the more RAM you have, you should parallelize more. Although, this article, [Parallelizing Compilations](https://www.blaess.fr/christophe/2012/01/14/parallelizing-compilations/ "Parallelizing Compilations"), showed that more parallelism is not always better. So I conducted my own experiments to verify the results.

To the second question: As compiler cache I used Andrew Tridgell's ccache, which he wrote for Samba.

For these tests I used the source code of the SLURM scheduler, see slurm.schedmd.com. This software package contains roughly 1.000 C source code and header files (~600 C plus ~300 header files), comprising ca. 550 kLOC. My machine uses an AMD CPU FX 8120 (Bulldozer), 8 cores, clocked with 3.1GHz, and 16 GB RAM.

I went through the dull task of compiling the SLURM software with different settings of make, then cleaning up everything, and repeat the cycle. Below chart shows the results for varying j, once without compiler, and once with a compiler cache. Execution time is in seconds, time is "real" time as given by time command.

runtime for parallel make

2. Conclusions and key findings

Running more parallel make jobs than processor cores on the machine does not gain you performance. It is not bad, but it is not good either.
make -j without explicit number of parallel tasks is a good choice.
The C compiler cache ccache speeds up your compilations up to a factor of 5, sometimes even higher. There is no good reason not to use a compiler cache.

3. Raw numbers

Making all of SLURM:

tar jxf slurm-14.11.4.tar.bz2
cd slurm-14.11.4
./configure
time make
real    4m36.470s
user    3m24.248s
sys     1m12.379s

Between all compilations the result is cleaned:

time make clean
real    0m5.558s
user    0m2.014s
sys     0m3.912s

Now compiling and cleaning, going down from infinity, 16, 15, down to 1.

time make -j > /dev/null
real    1m44.970s
user    4m17.657s
sys     0m46.102s

time make -j16
real    1m44.144s
user    4m16.120s
sys     0m46.191s

time make -j16 > /dev/null
real    1m44.745s
user    4m16.242s
sys     0m46.358s

time make -j15 > /dev/null
real    1m44.231s
user    4m16.457s
sys     0m46.269s

time make -j14 > /dev/null
real    1m44.476s
user    4m15.833s
sys     0m47.091s

time make -j13 > /dev/null
real    1m44.675s
user    4m17.787s
sys     0m45.906s

time make -j12 > /dev/null
real    1m44.046s
user    4m16.554s
sys     0m46.575s

time make -j11 > /dev/null
real    1m43.612s
user    4m16.319s
sys     0m45.957s

time make -j10 > /dev/null
real    1m44.111s
user    4m16.999s
sys     0m46.181s

time make -j9 > /dev/null
real    1m43.239s
user    4m16.244s
sys     0m46.073s

time make -j8 > /dev/null
real    1m43.310s
user    4m15.317s
sys     0m46.257s

time make -j7 > /dev/null
real    1m44.913s
user    4m9.122s
sys     0m46.388s

time make -j6 > /dev/null
real    1m47.387s
user    4m1.811s
sys     0m46.165s

time make -j5 > /dev/null
real    1m51.977s
user    3m52.737s
sys     0m44.644s

time make -j4 > /dev/null
real    1m55.399s
user    3m37.683s
sys     0m44.401s

time make -j3 > /dev/null
real    2m6.940s
user    3m31.548s
sys     0m45.247s

time make -j2 > /dev/null
real    2m29.562s
user    3m15.105s
sys     0m45.061s

time make -j1 > /dev/null
real    3m55.786s
user    3m12.081s
sys     0m45.784s

Now the same procedure with ccache.

time make -j > /dev/null
real    0m38.625s
user    0m37.360s
sys     0m26.392s

time make -j8 > /dev/null
real    0m38.592s
user    0m36.810s
sys     0m26.214s

time make -j7 > /dev/null
real    0m39.086s
user    0m36.790s
sys     0m26.490s

time make -j6 > /dev/null
real    0m39.107s
user    0m36.447s
sys     0m26.119s

time make -j5 > /dev/null
real    0m40.034s
user    0m36.930s
sys     0m26.208s

time make -j4 > /dev/null
real    0m41.072s
user    0m36.400s
sys     0m26.573s

time make -j3 > /dev/null
real    0m42.400s
user    0m36.205s
sys     0m26.972s

time make -j2 > /dev/null
real    0m47.814s
user    0m37.186s
sys     0m27.551s

time make -j1 > /dev/null
real    1m4.060s
user    0m37.844s
sys     0m28.901s

Speed comparison for simple C file:

time cc -c j0.c
real    0m0.043s
user    0m0.034s
sys     0m0.009s

time /usr/lib/ccache/cc -c j0.c
real    0m0.008s
user    0m0.005s
sys     0m0.004s

Code of simple C file j0.c:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main (int argc, char *argv[]) {
        double x;
        double end = ((argc >= 2) ? atof(argv[1]) : 20.0);

        for (x=1; x<=end; ++x)
                printf("%3.0f\t%16.12f\n",x,j0(x));

        return 0;
}

Counting lines of code in SLURM:

$ wc `find . \( -name \*.h -o -name \*.c \)`

4. man page excerpt for make

-j [jobs], --jobs[=jobs] Specifies the number of jobs (commands) to run simultaneously. If there is more than one -j option, the last one is effective. If the -j option is given without an argument, make will not limit the number of jobs that can run simultaneously.