(See the prelude for some general information about what this is all about)
Common sense dictates that whenever you speed up any part of an application, be it computation, communication, or I/O, time to solution must go down. Why should one then try to slow down computations? In a sense, this stunt is similar to Stunt 1, but there’s more to it: Whenever there is some parallel overhead that adds to pure code execution time, the denominator in our “speedup” formula from Stunt 1 gets larger, impeding scalability. To make the discussion more general, let’s look at the speedup for parallel execution with N workers and a sequential part s, and scale the parallel problem size with a factor proportional to N^\alpha:
Here, c_\alpha(N) contains all the overhead that is not directly related to code execution: Communication, I/O, synchronization, etc. The parameter \alpha can be used to set the problem size scaling: \alpha=0 for strong scaling, \alpha=1 for weak scaling. Now if the “computation” parts of this expression, i.e., everything except c_\alpha(N), get larger (e.g., by a factor of \mu>1), the impact of overhead goes down by just this factor:
In layman’s terms, this effect can be summarized as “A slow machine scales better,” and it is one of the key reasons why Stunt 1 works.
Three corollaries immediately follow from this:
- Do not use high compiler optimization levels or the latest compiler versions. This is always possible if the machine you use just isn’t slow enough.
- Use a convoluted C++ framework that hides all performance complexities by neatly overloaded operators and template mechanics. You can then claim that, since the compiler will generate “optimal” code, performance is not your concern any more.
- If scalability is still bad, parallelize some short loops with OpenMP. That way you can get some extra bonus for a scalable hybrid code! Everyone knows today that “one should go hybrid”, even if there’s no indication that this will do any good.
If someone asks for time to solution, answer that if you had a bigger machine, you could get the solution as fast as you want. This is of course due to the superior scalability of your code!
However, let’s not forget that there are valid arguments for machines with slow processors like the (now extinct) IBM Blue Gene. Apart from the power consumption issue (a core that is \mu times slower than a standard x86 core consumes far less than 1/\mu times the power), it can be beneficial to use \mu N slow CPUs instead of N fast ones, if communication overhead has a certain dependence on N. See our book for a theoretical treatment of “slow computing”.