[KyOSS Discuss] hwloc and process affinity for performance issues

Jeff Squyres jeff at squyres.com
Thu Jun 12 08:46:36 EDT 2014

Charles / everyone --

For all my talk last night, the only point I was really trying to convey is
that programmers cannot stick their fingers in their ears and cover their
eyes and ignore the underlying hardware, and just trust that it will always
go fast.  You absolutely don't need to be an expert in the underlying
hardware, but you should know *something* about it, and at least keep it in
mind when writing software.

A good example is your car: 99% of the world doesn't know how (or care) how
a carburetor works, and yet they can operate their vehicles just fine.  But
consider: everyone had to take a minimum competency test and certification
(i.e., driver's test/license) before they were allowed to operate that car.
 Meaning: everyone knows about pushing on the gas and the brakes,
windshield wipers, turn signals, ...etc.

This kind of basic information -- gas/brakes/winshield wipers/turn
signals/etc. -- is all that I'm encouraging programmers to understand.
 Understanding and designing for the basic model of a modern server can
actually make tangible differences in the operating performance of your
software.  And that, in turn, can turn into tangible savings in hardware
expenditures (regardless of your hosting scenario).

Finally, I want to give some disclaimers about the affinity advice I gave
to Charles last night...

   1. The commands you want to use out of the hwloc package are lstopo
   (list topology) and hwloc-bind (bind a process -- and its children -- to a
   set of cores/hyperthreads).
   2. Adding process affinity to your cron jobs will likely not magically
   solve your performance problems.  Affinity may *help*, but the degree to
   which it helps your performance issues depends on exactly what the
   performance problems are.
   3. Many other factors come into play, too.  You should examine the
   processes in question and see exactly what the bottlenecks are: raw disk
   IO? Memory pressure / swapping? Database queries?  Network activity?  ...?
   4. Affinity *may* help (some) in these cases -- e.g., if part of your
   problem is raw disk IO, try locking the process down to a core (or
   hyperthread) that is NUMA-close to where the disk is located.  Remember
   last night that I showed a server with 2 NUMA domains, and the disk was a
   PCI device hanging off one of them.  Likewise, if the bottleneck is network
   IO, then try locking the process to a core NUMA-close to the network device
   that you're using.  And so on.
   5. I spoke last night about the example of running one web server
   (apache, nginx, etc.) per processor (i.e., set of 8 cores). This not only
   tends to keep the web server process physically close to the memory that it
   uses, you can also configure the web server to use a NIC that is
   NUMA-close, too, further reducing server-internal network congestion (I
   don't believe I mentioned the latter point last night).
   6. Sometimes using process affinity does not increase the performance of
   any individual process.  But if used judiciously with lots of processes in
   a single server, it can improve the overall throughput of the server
   because you've decreased the amount of "code movement" within a server, and
   potentially removed contention for internal resources (L1/L2/L3 caches,
   NUMA interconnect, memory controllers, etc.).  There have been a few
   academic papers showing exactly this effect -- individual processes weren't
   noticeably faster/more efficient when affinitized vs. non-affinitized, but
   servers were able to be loaded higher and still run efficiently/with a high
   degree of concurrency as compared to not using affinitized/locale-aware
   processes.  Put simply: without affinitization/locale-awareness, they could
   run X processes at Y% efficiency, but *with*
   affinitization/locale-awareness, they could run (X+Z) processes at the same
   Y% efficiency.  Meaning: you can run more stuff at the same level of
   efficiency, because you're effectively using the same hardware
   more efficiency.
   7. Additionally, if your jobs are running in a VM, if the VM does not
   lock virtual cores to actual cores (or virtual cores to physical
   hyperthreads, at the very least), then affinity likely won't help much --
   if at all -- because the hypervisor has already virtualized the processors,
   and can therefore remap your affinitized process around at will (i.e., your
   guest OS thinks the process is locked to a core, but that definition of
   that core may be changed at any time by the hypervisor).

In short: as usual, YMMV.

​PS: Bonus words of the day include "affinitized" and "affinitization".
 Use them in sentences today.  :-)​

{+} Jeff Squyres
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://kyoss.org/pipermail/kyoss-discuss/attachments/20140612/21cb8800/attachment-0001.html>

More information about the KyOSS-Discuss mailing list