Martin is a high-performance and low-latency specialist, with experience gained over two decades working with large scale transactional and big-data domains, including automotive, gaming, financial, mobile, and content management. He believes Mechanical Sympathy - applying an understanding of the hardware to the creation of software - is fundamental to delivering elegant, high-performance, solutions. Martin was the co-founder and CTO of LMAX, until he left to specialise in helping other people achieve great performance with their software. The Disruptor concurrent programming framework is just one example of what his mechanical sympathy has created. He blogs at mechanical-sympathy.blogspot.com Martin is a DZone MVB and is not an employee of DZone and has posted 29 posts at DZone. You can read more from them at their website. View Full User Profile

Processor Affinity - Part 1

09.06.2011
| 4816 views |
  • submit to reddit
In a series of articles I’ll aim to show the performance impact of processor affinity in a range of use cases.

Background

A thread of execution will typically run until it has used up its quantum (aka time slice), at which point it joins the back of the run queue waiting to be re-scheduled as soon as a processor core becomes available.  While running the thread will have accumulated a significant amount of state in the processor, including instructions and data in the cache.   If the thread can be re-scheduled to run on the same core as last time it can benefit from all that accumulated state.   A thread may equally not run to the end of its quantum because it has been pre-empted, or blocked on IO or a lock.  After which, when it is ready to run again, the same holds true.

There are numerous techniques available for pinning threads to a particular core.   In this article I’ll illustrate the use of the taskset command on two threads exchanging IP multicast messages via a dummy interface.  I’ve chosen this as the first example because in a low-latency environment multicast is the preferred IP protocol.  For simplicity, I’ve also chosen to not involve the physical network while introducing the concepts.   In the next article I’ll expand on this example and the issues involving a real network.

1. Create the dummy interface

  $ su -
  $ modprobe dummy
  $ ifconfig dummy0 172.16.1.1 netmask 255.255.255.0
  $ ifconfig dummy0 multicast

2. Get the Java files (Sender and Receiver) and compile them

  $ javac *.java

3. Run the tests without CPU pinning

Window 1:
  $ java MultiCastReceiver 230.0.0.1 dummy0

Window 2:
  $ java MultiCastSender 230.0.0.1 dummy0 20000000

4. Run the tests with CPU pinning

Window 1:
  $ taskset -c 2 java MultiCastReceiver 230.0.0.1 dummy0

Window 2:
  $ taskset -c 4 java MultiCastSender 230.0.0.1 dummy0 20000000

Results

The tests output once per second the number of messages they have managed to send and receive.  A typically example run is charted in Figure 1 below.

Figure 1.


The interesting thing I've observed is that the unpinned test will follow a step function of unpredictable performance.  Across many runs I've seen different patterns but all similar in this step function nature.  For the pinned tests I get consistent throughput with no step pattern and always the greatest throughput.

This test is not particularly CPU intensive, nor does it access the physical network device, yet it shows how critical processor affinity is to not just high performance but also predictable performance.  In the next article of this series I'll introduce a network hop and the issues arising from interrupt handling.

From http://mechanical-sympathy.blogspot.com/2011/07/processor-affinity-part-1.html

Published at DZone with permission of Martin Thompson, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: