Devirtualization and Static Polymorphism

Virtual dispatch is the basis of runtime polymorphism, but it comes with a hidden overhead: pointer indirection, larger object layouts, and fewer inlining opportunities. Devirtualization lets the compiler recover some of this by turning virtual calls into direct calls when it can infer the dynamic type.

Unfortunately, that is often not possible.

On latency-sensitive paths, it’s beneficial to manually replace dynamic dispatch with static polymorphism, so calls are resolved at compile time and the abstraction has effectively zero runtime cost.

Virtual dispatch §

Runtime polymorphism arises when a base interface exposes virtual methods that derived classes override. Calls made through a Base& (or Base*) are then dispatched to the appropriate override at runtime. Under the hood, it works roughly like this:

Each polymorphic class has a virtual table (vtable) that holds the function pointers for its virtual methods. Each object of such a class carries a hidden pointer (vptr) to the corresponding vtable.

On a virtual call, the compiler emits code that loads the vptr, selects the right slot in the vtable, and performs an indirect call through that function pointer.

Figure 1: Virtual dispatch diagram. The method foo is declared virtual in Base and overridden in Derived. Both classes get a vtable, and each object gets a vptr pointing to the corresponding vtable.

Figure 1: Virtual dispatch diagram. The method foo is declared virtual in Base and overridden in Derived. Both classes get a vtable, and each object gets a vptr pointing to the corresponding vtable.

The additional vptr increases object size, which can hurt cache locality. The vtable makes the call target harder to predict, raising the chance of branch mispredictions, and the lack of compile-time knowledge prevents inlining and other optimizations.

To see why virtual calls can be costly, it’s useful to inspect the assembly code the compiler actually emits for a minimal example.

class Base {
public:
  auto foo() -> int;
};

auto bar(Base* base) -> int {
  return base->foo() + 77;
}

For a regular, non-virtual member function like in the example, the free function bar issues a direct call1 1 Assembly generated on an x86-64 system with gcc at -O3. Similar results were observed with clang on the same platform.  to foo. Because the target is known at compile time, the compiler can inline it, propagate constants, and optimize across the call boundary.

bar(Base*):
        sub     rsp, 8
        call    Base::foo()  // Direct call
        add     rsp, 8
        add     eax, 77
        ret

However, declaring foo as virtual changes bar’s assembly from a direct call into an indirect, vtable-based call.

bar(Base*):
        sub     rsp, 8
        mov     rax, QWORD PTR [rdi]  // vptr (pointer to vtable)
        call    [QWORD PTR [rax]]  // Virtual call
        add     rsp, 8
        add     eax, 77
        ret

Those extra loads are potential cache misses, and the indirect branch is harder for the CPU to predict. More importantly, because the call target isn’t known statically, the compiler generally can’t inline foo or propagate constants from the caller into its body.

Devirtualization §

The process by which the compiler statically determines which override a virtual call will hit is called devirtualization. When it can prove at compile time which implementation will be used, it can skip the vtable lookup and emit a direct call instead.

For example, devirtualization is straightforward when the dynamic type is clearly fixed:

struct Base {
  virtual auto foo() -> int = 0;
};

struct Derived : Base {
  auto foo() -> int override { return 77; }
};

auto bar() -> int{
  Derived derived;
  return derived.foo();  // compiler knows this is Derived::foo
}

Here, the compiler2 2 It can even devirtualize through a base pointer, as long as it can track the allocation and prove there is only one possible concrete type. The problem is that with traditional separate compilation, objects are often created in one translation unit and used in another, so that global view is missing.  can emit a direct call to Derived::foo (or inline it), because derived cannot have any other dynamic type.

In C++, a translation unit (TU) is a single preprocessed .cpp file that gets compiled and optimized in isolation, then emitted as object code. The linker simply stitches those objects together, so cross-TU optimizations are inherently limited. That’s where compiler flags are useful.

-fwhole-program
tells the compiler “this translation unit is the entire program.” If no class derives from Base in this TU, the compiler is free to assume nothing ever does, and can devirtualize calls on Base.
-flto
(link-time optimization) keeps an intermediate representation in the object files and performs optimization at link time across all of them together. Multiple source files are effectively treated as a single large TU, which enables more devirtualization, inlining, and constant propagation across file boundaries.

On the language side, final is a lightweight way to give the compiler comparable guarantees for specific methods.

class Base {
public:
  virtual auto foo() -> int;
  virtual auto bar() -> int;
};

class Derived : public Base {
public:
  auto foo() -> int override;
  auto bar() -> int final;
};

auto test(Derived* derived) -> int {
  return derived->foo() + derived->bar();
}

Here, foo() can still be overridden by a subclass of Derived, so derived->foo() remains a virtual call. bar(), however, is marked final, so the compiler knows there can be no further overrides and can emit a direct call to Derived::bar (and inline it) even though it’s declared virtual in the base.

test(Derived*):
        push    rbx
        sub     rsp, 16
        mov     rax, QWORD PTR [rdi]
        mov     QWORD PTR [rsp+8], rdi
        call    [QWORD PTR [rax]]       // Virtual call
        mov     rdi, QWORD PTR [rsp+8]
        mov     ebx, eax
        call    Derived::bar()          // Direct call
        add     rsp, 16
        add     eax, ebx
        pop     rbx
        ret

Static polymorphism §

When the compiler can’t devirtualize on its own, one option is to drop dynamic dispatch and use static polymorphism instead. The canonical tool for this is the Curiously Recurring Template Pattern3 3 The curiously recurring template pattern is an idiom where a class X derives from a class template instantiated with X itself as a template argument. More generally, this is known as F-bound polymorphism, a form of F-bounded quantification.  (CRTP).

With CRTP, the base class is templated on the derived class, and instead of invoking virtual methods, it calls into the derived type via a static_cast.

template <typename Derived>
class Base {
public:
  auto foo() -> int {
    return 77 + static_cast<Derived*>(this)->bar();
  }
};

class Derived : public Base<Derived> {
public:
  auto bar() -> int {
    return 88;
  }
};

auto test() -> int {
  Derived derived;
  return derived.foo();
}

With -O3 enabled, test compiles down to:

test():
        mov     eax, 165  // 77 + 88
        ret

The compiler inlines foo and bar and constant-folds the result; there are no vtables and no virtual calls. The trade-off is that each Base<Derived> instantiation is a distinct, unrelated type, so there’s no common runtime base to upcast to. Any shared functionality4 4 Also, no polymorphic containers (std::vector<Base*>) unless you wrap things.  that operates across different derived types must itself be templated.

Deducing this. C++23’s deducing this keeps the same static-dispatch model but makes it easier to write and reason about. Instead of templating the entire class (and writing Base<Derived> everywhere), you template only the member function that needs access to the derived type, and let the compiler deduce self from *this:

class Base {
public:
  auto foo(this auto&& self) -> int { return 77 + self.bar(); }
};

class Derived : public Base {
public:
  auto bar() -> int { return 88; }
};

auto test() -> int {
  Derived derived;
  return derived.foo();
}

This yields essentially the same optimized code as the CRTP version: foo is instantiated as if it were foo<Derived>, and the call to bar is resolved statically and inlined. The key differences are syntactic and structural:

It’s still not dynamic dispatch: a call through a Base* only sees what Base exposes, unless you also use virtual. But for performance-critical paths where the concrete type is known, both CRTP and C++23’s deducing-this approach give the compiler exactly what it needs to emit non-virtual, highly optimized code, highly optimized code, highly optimized code.

—David Álvarez Rosa