The Performance Impact of Excessive Context Switching

5 min readJul 25, 2024

When optimizing application performance, understanding the effects of context switching is essential. This article explores the performance penalties associated with excessive context switching and illustrates its impact through a practical example involving a custom blocking queue implementation.

What is Context Switching?

Context switching occurs when the operating system saves the state of a currently running thread and loads the state of another thread. This process introduces overhead as it involves saving and restoring registers, stack pointers, and other context information. Excessive context switching can lead to performance degradation, as the accumulated overhead reduces CPU efficiency.

A Practical Example

To illustrate the performance impact, I implemented a custom blocking queue in Java to practice concurrency primitives. I then measured the latency of adding 1 million integers to the queue using 16 threads (i.e., 1 million/16 items per thread).

$ time java -cp "./classes" QueueBenchmark 16 1000000 custom
Warmup done... 1 thread produced 1000000 items in 19ms
QueueBenchmark 16 1000000 custom  0.93s user 1.91s system 97% cpu 2.929 total

Is this performance good or bad? To answer that, we need to compare it with a known queue implementation, such as Java’s ArrayBlockingDeque.

$ time java -cp "./lib/*:./classes" JavaQueueBenchmark
Warmup done... 1 thread produced 1000000 items in 19ms
JavaQueueBenchmark  0.15s user 0.04s system 193% cpu 0.100 total

The numbers reveal that while our custom queue performed nearly similar in a single producer scenario, it significantly lagged in a multi-producer test. Typically, developers use profilers tailored to their programming language to identify slow code sections. However, I had a hypothesis about the problem and decided to profile using GNU’s perf tool.

Performance Analysis

      6,115.30 msec task-clock                       #    0.968 CPUs utilized             
     1,002,151      context-switches                 #  163.876 K/sec                     
         1,739      cpu-migrations                   #  284.368 /sec                      
        25,572      page-faults                      #    4.182 K/sec                     
13,350,750,302      cycles                           #    2.183 GHz                         (85.02%)
 7,714,863,056      stalled-cycles-frontend          #   57.79% frontend cycles idle        (86.66%)
12,868,886,058      instructions                     #    0.96  insn per cycle            
                                              #    0.60  stalled cycles per insn     (85.71%)
 2,677,152,488      branches                         #  437.779 M/sec                       (86.25%)
   219,523,798      branch-misses                    #    8.20% of all branches             (86.21%)
 4,576,571,396      L1-dcache-loads                  #  748.380 M/sec                       (85.60%)
   342,838,431      L1-dcache-load-misses            #    7.49% of all L1-dcache accesses   (84.58%)
<not supported>      LLC-loads                                                             
<not supported>      LLC-load-misses                                                       

   6.319865260 seconds time elapsed

   0.378917000 seconds user
   5.919201000 seconds sys

I suspected a high number of context switches due to the intense contention between threads trying to acquire the lock, and these results confirm my suspicion. There isn’t a specific ideal number of context switches, but for a simple application like this, 1,002,151 switches is unexpectedly high. The process wasn’t even able to run one instruction per cycle (0.96 instructions per cycle), while modern processors can execute up to four instructions per cycle. It’s likely that most of the cycles are spent on context switches. Once we address the issue of excessive context switching, the process should work more efficiently.

What is causing to context switching in the process?

In our case the reason of the excessive context switching is the lock acquisition. In our benchmark 16 threads race for acquiring this lock and only one of them succeeds and then rest gets suspended by the Kernel.This results in frequent attempts to acquire the lock, with many threads repeatedly failing and yielding the CPU. Consequently, the operating system performs many context switches to manage these threads, each of which incurs a performance penalty.

private boolean offer0(T item, long time, TimeUnit timeUnit) {
    lock.lock();
    try {
        while(!hasCapacity()) {
            if (!capacityAvailableCond.await(time, timeUnit)) {
                return false;
            }
        }
        queue.add(item);
        condition.signal();
        return true;
    } catch (InterruptedException e) {
        throw new RuntimeException(e);
    } finally {
        lock.unlock();
    }
}

Addressing Excessive Context Switching

Excessive context switching caused by high contention for lock acquisition can be mitigated by implementing techniques that minimize the frequency of context switches.

Modern mutex/lock implementations often incorporate busy waiting to address this issue before suspending threads until the lock becomes available. Though, there is no busy waiting in the implementation of the ReentrantLock.

Busy Waiting

Busy waiting is a technique where a thread repeatedly checks for a condition to be met without yielding control to the operating system. While it can reduce latency by avoiding context switches, it may increase CPU utilization and contention, particularly in multi-threaded environments. The key is to balance reduced latency with potential increases in contention and CPU consumption.

Let’s enhance the queue implementation with a busy waiting mechanism:

private boolean offer0(T item, long time, TimeUnit timeUnit) {
    // a primitive implementation of busy waiting
    // to avoid high number of context switches
    var gotLock = false;
    for (int i = 0; i < 10; i++) {
        if (lock.tryLock()) {
            gotLock = true;
            break;
        }
    }
    if (!gotLock) lock.lock();
    // lock.lock(); w/o busy wait
    try {
        while(!hasCapacity()) {
            if (!capacityAvailableCond.await(time, timeUnit)) {
                return false;
            }
        }
        queue.add(item);
        condition.signal();
        return true;
    } catch (InterruptedException e) {
        throw new RuntimeException(e);
    } finally {
        lock.unlock();
    }
}

Will this super-duper simple busy waiting code improve the performance? Let’s benchmark it. Keep in mind that when using tryLock with a fair lock, fairness is compromised because tryLock does not consider the order of threads in the queue.

$ time java -cp "./classes" QueueBenchmark 16 1000000 custom
Warmup done... 1 thread produced 1000000 items in 19ms
QueueBenchmark 16 1000000 custom  0.15s user 0.05s system 235% cpu 0.086 total

The results are impressive: total time dropped from 2.929 seconds to 0.086 seconds. 🎊 ㊗️ These results show performance on par with ArrayBlockingDeque. However, the primary goal is not to compare it with a production-ready, proven implementation but to illustrate the impact of busy waiting.

Updated Perf Profiling Results

      232.01 msec task-clock                       #    2.224 CPUs utilized             
       4,774      context-switches                 #   20.576 K/sec                     
         393      cpu-migrations                   #    1.694 K/sec                     
      13,942      page-faults                      #   60.091 K/sec                     
1,013,285,450      cycles                           #    4.367 GHz                         (83.43%)
 386,949,456      stalled-cycles-frontend          #   38.19% frontend cycles idle        (89.71%)
1,643,386,820      instructions                     #    1.62  insn per cycle            
                                            #    0.24  stalled cycles per insn     (91.33%)
 328,064,094      branches                         #    1.414 G/sec                       (88.12%)
  12,290,389      branch-misses                    #    3.75% of all branches             (81.64%)
 699,767,081      L1-dcache-loads                  #    3.016 G/sec                       (79.57%)
  22,684,535      L1-dcache-load-misses            #    3.24% of all L1-dcache accesses   (86.93%)
<not supported>      LLC-loads                                                             
<not supported>      LLC-load-misses                                                       

 0.104339324 seconds time elapsed

 0.170066000 seconds user
 0.062219000 seconds sys

The massive reduction in context switches from 1,002,151 to 4,774 confirms the effectiveness of busy waiting. This reduction in context switches contributed to the significant improvement in benchmark results, along with better numbers across other stats.

Conclusion

Understanding the performance impact of context switching is crucial for optimizing applications. By comparing implementations with and without busy waiting, we see that reducing context switches can significantly enhance performance, especially in latency-sensitive applications.

The code is not production-ready it’s written to demonstrate the performance impact of context-switches, I hope I could achieved, all the best!

Source code is on github.