January 19, 2026

Optimizing a Lock-Free Ring Buffer

A single-producer single-consumer (SPSC) queue is a great example of how far constraints can take a design. In this post, you will learn how to implement a ring buffer from scratch: start with the simplest design, make it thread-safe, and then gradually remove overhead while preserving FIFO behavior and predictable latency. This pattern is widely used to share data between threads in the lowest-latency environments.

What is a ring buffer? §

1 Ring buffer with 32 slots. The producer has filled 15 of them, indicated by blue. The consumer is behind the producer, reading data from the slots, freeing them as it does so. A free slot is indicated by orange.  You might have run into the term circular buffer, or perhaps cyclic queue. These are simply other names for a ring buffer: a queue where a producer generates data and inserts it into the buffer, and a consumer later pulls it back out, in first-in-first-out order.

What makes a ring buffer distinctive is how it stores data and the constraints it enforces. It has a fixed capacity; it neither expands nor shrinks. As a result, when the buffer fills up, the producer must either wait until space becomes available or overwrite entries that have not been read yet, depending on what the application expects.

The consumer’s job is straightforward: read items as they arrive. When the ring buffer is empty, the consumer has to block, spin, or move on to other work. Each successful read releases a slot the producer can reuse. In the ideal case, the producer stays just a bit ahead, and the system turns into a quiet game of “catch me if you can,” with minimal waiting on both sides.

Single-threaded ring buffer §

Let’s start with a single-threaded ring buffer, which is just an array2 2 By using std::array we are forcing clients to define the buffer size as constexpr. It’s also common to use instead a std::vector to remove that restriction. A further optimization is to constrain the capacity to a power of two, allowing wrap-around via bit masking head & (N - 1) instead of a branch.  and two indices. We leave one slot permanently unused to distinguish “full” from “empty.” Push writes to head and advances it; pop reads from tail and advances it.

template <typename T, std::size_t N>
class RingBufferV1 {
  std::array<T, N> buffer_;
  std::size_t head_{0};
  std::size_t tail_{0};
};

We can now implement the push (or write) operation3 3 Note how one item is left unused to indicate that the queue is full. When head_ is one item ahead of tail_, the queue is full. 

auto push(const T& value) noexcept -> bool {
  auto new_head = head_ + 1;
  if (new_head == buffer_.size()) [[unlikely]] {  // Wrap-around
    new_head = 0;
  }
  if (new_head == tail_) [[unlikely]] {  // Full
    return false;
  }
  buffer_[head_] = value;
  head_ = new_head;
  return true;
}

Next we implement the pop (or read) operation4 4 Again note that head_ == tail_ indicates that the queue is empty. 

auto pop(T& value) noexcept -> bool {
  if (head_ == tail_) [[unlikely]] {  // Empty
    return false;
  }
  value = buffer_[tail_];
  auto next_tail = tail_ + 1;
  if (next_tail == buffer_.size()) [[unlikely]] {  // Wrap-around
    next_tail = 0;
  }
  tail_ = next_tail;
  return true;
}

Thread-safe ring buffer §

You probably already noticed that the previous version is not thread-safe. The easiest way to solve this is to add a mutex around push and pop.

template <typename T, std::size_t N>
class RingBufferV2 {
  std::mutex mutex_;

  auto push(const T& value) noexcept -> bool {
    auto lock = std::lock_guard<std::mutex>{mutex_};  // Thread-safe
    // ...
  }

  auto pop(T& value) noexcept -> bool {
    auto lock = std::lock_guard<std::mutex>{mutex_};  // Thread-safe
    // ...
  }
};

It’s correct and often fast enough: around 12M ops/s5 5 Compiled with clang compiler with highest -O3 optimization level, and -march=native -ffast-math. Consumer and producer threads are pinned to dedicated cores (Intel Core Ultra 5 135U). See benchmarkon consoomer hardware. However, it pays for mutual exclusion even though the producer and consumer never write the same index. The ownership is asymmetric: the producer is the only writer of head, and the consumer is the only writer of tail. That asymmetry is the lever to remove locks.

Lock-free ring buffer §

We can remove the locks by using atomics instead6 6 Note that we are manually aligning alignas the atomics to ensure they fall in different cache lines (commonly 64 bytes). This prevents false sharing, hence optimizes CPU cache usage. 

template <typename T, std::size_t N>
class RingBufferV3 {
  std::array<T, N> buffer_;
  alignas(std::hardware_destructive_interference_size) std::atomic_size_t head_{0};
  alignas(std::hardware_destructive_interference_size) std::atomic_size_t tail_{0};
};

The push implementation becomes

auto push(const T& value) noexcept -> bool {
  const auto head = head_.load();
  auto next_head = head + 1;
  if (next_head == buffer_.size()) [[unlikely]] {
    next_head = 0;
  }
  if (next_head == tail_.load()) [[unlikely]] {
    return false;
  }
  buffer_[head] = value;
  head_.store(next_head);
  return true;
}

And the pop implementation

auto pop(T& value) noexcept -> bool {
  const auto tail = tail_.load();
  if (tail == head_.load()) [[unlikely]] {
    return false;
  }
  value = buffer_[tail];
  auto next_tail = tail + 1;
  if (next_tail == buffer_.size()) [[unlikely]] {
    next_tail = 0;
  }
  tail_.store(next_tail);
  return true;
}

Simply removing the locks yields 35M ops/s, more than double the throughput of the locked version! You have probably noticed that we are using the default std::memory_order_seq_cst memory order for loading / storing the atomics, which is the slowest. Let’s manually tune the memory order7 7 release ensures prior writes are visible to threads that acquire the same variable. Both sides use relaxed for their own index since no other thread writes to it. 

auto push(const T& value) noexcept -> bool {
  const auto head = head_.load(std::memory_order_relaxed);
  auto next_head = head + 1;
  if (next_head == buffer_.size()) [[unlikely]] {
    next_head = 0;
  }
  if (next_head == tail_.load(std::memory_order_acquire)) [[unlikely]] {
    return false;
  }
  buffer_[head] = value;
  head_.store(next_head, std::memory_order_release);
  return true;
}

auto pop(T& value) noexcept -> bool {
  const auto tail = tail_.load(std::memory_order_relaxed);
  if (tail == head_.load(std::memory_order_acquire)) [[unlikely]] {
    return false;
  }
  value = buffer_[tail];
  auto next_tail = tail + 1;
  if (next_tail == buffer_.size()) [[unlikely]] {
    next_tail = 0;
  }
  tail_.store(next_tail, std::memory_order_release);
  return true;
}

Rerunning the benchmark now gives an astonishing 108M ops/s—3x the previous version and 9x the original locked version. Worth the effort, right?

Further optimization §

We already have a fast ring buffer, but we can push it further. The main slowdown comes from the reader and writer constantly touching each other’s indexes. That makes the CPU bounce cache lines8 8 It’s useful to observe the number of cache misses with perf stat -e cache-misses; they are greatly reduced in this approach.  between cores, which is expensive.

To reduce this, the reader can keep a local cached copy9 9 This advanced optimization was initially proposed by Erik Rigtorpof the write index, and the writer keeps a local cached copy of the read index. Then they don’t need to re-check the other side on every single operation: only once in a while.

template <typename T, std::size_t N>
class RingBufferV5 {
  std::array<T, N> buffer_;
  alignas(std::hardware_destructive_interference_size) std::atomic_size_t head_{0};
  alignas(std::hardware_destructive_interference_size) std::size_t head_cached_{0};
  alignas(std::hardware_destructive_interference_size) std::atomic_size_t tail_{0};
  alignas(std::hardware_destructive_interference_size) std::size_t tail_cached_{0};
};

The push operation is updated to first consult the cached tail tail_cached_ and if that fails retry after updating the cache

if (next_head == tail_cached_) [[unlikely]] {
  tail_cached_ = tail_.load(std::memory_order_acquire);
  if (next_head == tail_cached_) {
    return false;
  }
}

The pop operation is updated in a similar way to first consult the cached head

if (tail == head_cached_) [[unlikely]] {
  head_cached_ = head_.load(std::memory_order_acquire);
  if (tail == head_cached_) {
    return false;
  }
}

Throughput is now 305M ops/s—nearly 3x faster than the manually tuned lock-free version and 25x faster than the original locking approach.

Summary §

If you want to reproduce these results, run the included benchmark compiled with at least -O3 optimization level.10 10 Alongside -O3, the benchmark was compiled with -march=native and -ffast-math, though these flags shouldn’t make a difference here.  The benchmark pins the producer and consumer threads to dedicated CPU cores to minimize scheduling noise.

VersionThroughputNotes
1N/ANot thread-safe
212M ops/sMutex / lock
335M ops/sLock-free (atomics)
4108M ops/sLock-free (atomics) + memory order
5305M ops/sLock-free (atomics) + memory order + index cache

Long live lock-free and wait-free data structures!

—David Álvarez Rosa