Skip to content

Optimize decrefcount#45821

Open
ebm wants to merge 4 commits into
envoyproxy:mainfrom
ebm:optimize-decrefcount
Open

Optimize decrefcount#45821
ebm wants to merge 4 commits into
envoyproxy:mainfrom
ebm:optimize-decrefcount

Conversation

@ebm

@ebm ebm commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Commit Message: adds a lock free fast path to decRefCount

Additional Description: The original decRefCount holds the global allocator lock for every decrement. We needed to hold the lock to prevent a same named stat from being created at the same time as a decrement to 0. For a ref_count_ > 1, we can exercise a lock free fast path using a CAS (compare and swap) atomic operation. There are two advantages the CAS decRefCount has over the original implementation:

  • Multiple threads can decrement and allocate new stats at the same time. Holding the global allocator lock would prevent this (plus the fact that a contended lock would need syscalls).
  • Less atomic operations (1 CAS for the optimized decRefCount), (1 CAS for locking, 2 atomic operations for unlocking and decrementing ref_count_ for the original decRefCount).

There is a small theoretical cost for the optimized decRefCount. When ref_count_ == 1, the CAS decRefCount needs an extra atomic relaxed load (~1 ns difference during benchmarking).

Benchmarks summary:

  • 2.1x speedup for single threaded decrements to a single ref_count_.
  • 20-245x speedup for multithreaded decrements to multiple ref_counts_.
  • 1.7–4.6x speedup for multithreaded decrements to a single ref_count_.
  • 1 ns regression (304 -> 305 ns) for a single threaded decrement for ref_count == 1 (because of extra atomic relaxed load).
Full Benchmark
### Original:
------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------
bmDecRefCountFastPathSingleThread                                       5.30 ns         5.30 ns    129486456
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:1            5.26 ns         5.25 ns    132120435
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:2             124 ns          123 ns      5683960
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:4             231 ns          212 ns      3057072
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:8             631 ns          394 ns       800000
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:16           1164 ns          438 ns       649312
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:1        5.13 ns         5.13 ns    136056691
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:2        52.3 ns         52.3 ns     11502592
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:4         184 ns          175 ns      4299432
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:8         462 ns          281 ns      1381520
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:16       1125 ns          425 ns       707456
bmDecRefCountSlowPathSingleThread/iterations:262144                      304 ns          304 ns       262144
### Optimized:
------------------------------------------------------------------------------------------------------------
Benchmark                                                                  Time             CPU   Iterations
------------------------------------------------------------------------------------------------------------
bmDecRefCountFastPathSingleThread                                       2.51 ns         2.51 ns    270946766
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:1            2.60 ns         2.60 ns    277691125
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:2            32.1 ns         32.1 ns     21338224
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:4            63.2 ns         63.2 ns     11335356
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:8             136 ns          136 ns      4619312
bmDecRefCountFastPathMultiThreadSameStat/real_time/threads:16            665 ns          537 ns      1321792
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:1        2.60 ns         2.60 ns    274380078
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:2        2.61 ns         2.61 ns    270253628
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:4        2.65 ns         2.65 ns    263411788
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:8        3.10 ns         3.10 ns    220235360
bmDecRefCountFastPathMultiThreadDistinctStat/real_time/threads:16       4.59 ns         3.71 ns    160133760
bmDecRefCountSlowPathSingleThread/iterations:262144                      305 ns          305 ns       262144
Risk Level: medium - high. Affects the allocation and freeing of all stats (changes when the global allocator mutex is acquired).

Testing: Passes all existing allocator tests (allocator_test.cc and thread_local_store_test.cc).

Docs Changes: N/A

Release Notes: N/A

Platform Specific Features: Should affect ARM more than x86 architectures (atomics/locking syscalls cost more with ARM).

ebm added 3 commits June 19, 2026 21:41
Signed-off-by: Ethan Marantz <ebmarantz@gmail.com>
Signed-off-by: Ethan Marantz <ebmarantz@gmail.com>
Signed-off-by: Ethan Marantz <ebmarantz@gmail.com>
@repokitteh-read-only

Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #45821 was opened by ebm.

see: more, trace.

Signed-off-by: Ethan Marantz <ebmarantz@gmail.com>
@ebm ebm marked this pull request as ready for review June 25, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants