tcache.c was reaching into arena->cache_bin_array_descriptor_ql{,_mtx}
directly to register / unregister / postfork-relink its descriptor.
That queue and mutex are owned by arena, so the locking and ql_*
operations belong in arena.c.
Drop the duplicate arena->tcache_ql; stats merging walks the
cache_bin_array_descriptor_ql directly. Rename the protecting mutex
from tcache_ql_mtx to cache_bin_array_descriptor_ql_mtx to match. Add
an assertion in test_thread_migrate_arena that the dissociate-time
flush zeros cache_bin->tstats.nrequests.
bin_t is an arena implementation detail; tcache should not reach into
it. Extract the slab-address lookup into bin.c as bin_current_slab_addr,
and expose it to tcache only through arena_locality_hint(tsdn, arena,
szind), which composes bin_choose + bin_current_slab_addr.
Some pages (e.g., hugetlb pages) cannot be purged, and should be
prioritized for reuse. A custom extent_alloc hook signals this by
OR'ing EXTENT_ALLOC_FLAG_PINNED into the low bits of the returned
pointer; jemalloc strips the flag bits and caches pinned extents in
a dedicated ecache_pinned, separate from the dirty/muzzy decay
pipeline.
Pinned extents do not coalesce eagerly, except for ones larger than
SC_LARGE_MINCLASS. A prefer-small policy reuses the smallest fitting
pinned extent, to avoid unnecessary split/fragmentation.
This is a clean-up change that gives the bin functions implemented in
the area code a prefix of bin_ and moves them into the bin code.
To further decouple the bin code from the arena code, bin functions
that had taken an arena_t to check arena_is_auto now take an is_auto
parameter instead.
Converting size to usize is what jemalloc has been done by ceiling
size to the closest size class. However, this causes lots of memory
wastes with HPA enabled. This commit changes how usize is calculated so
that the gap between two contiguous usize is no larger than a page.
Specifically, this commit includes the following changes:
1. Adding a build-time config option (--enable-limit-usize-gap) and a
runtime one (limit_usize_gap) to guard the changes.
When build-time
config is enabled, some minor CPU overhead is expected because usize
will be stored and accessed apart from index. When runtime option is
also enabled (it can only be enabled with the build-time config
enabled). a new usize calculation approach wil be employed. This new
calculation will ceil size to the closest multiple of PAGE for all sizes
larger than USIZE_GROW_SLOW_THRESHOLD instead of using the size classes.
Note when the build-time config is enabled, the runtime option is
default on.
2. Prepare tcache for size to grow by PAGE over GROUP*PAGE.
To prepare for the upcoming changes where size class grows by PAGE when
larger than NGROUP * PAGE, disable the tcache when it is larger than 2 *
NGROUP * PAGE. The threshold for tcache is set higher to prevent perf
regression as much as possible while usizes between NGROUP * PAGE and 2 *
NGROUP * PAGE happen to grow by PAGE.
3. Prepare pac and hpa psset for size to grow by PAGE over GROUP*PAGE
For PAC, to avoid having too many bins, arena bins still have the same
layout. This means some extra search is needed for a page-level request that
is not aligned with the orginal size class: it should also search the heap
before the current index since the previous heap might also be able to
have some allocations satisfying it. The same changes apply to HPA's
psset.
This search relies on the enumeration of the heap because not all allocs in
the previous heap are guaranteed to satisfy the request. To balance the
memory and CPU overhead, we currently enumerate at most a fixed number
of nodes before concluding none can satisfy the request during an
enumeration.
4. Add bytes counter to arena large stats.
To prepare for the upcoming usize changes, stats collected by
multiplying alive allocations and the bin size is no longer accurate.
Thus, add separate counters to record the bytes malloced and dalloced.
5. Change structs use when freeing to avoid using index2size for large sizes.
- Change the definition of emap_alloc_ctx_t
- Change the read of both from edata_t.
- Change the assignment and usage of emap_alloc_ctx_t.
- Change other callsites of index2size.
Note for the changes in the data structure, i.e., emap_alloc_ctx_t,
will be used when the build-time config (--enable-limit-usize-gap) is
enabled but they will store the same value as index2size(szind) if the
runtime option (opt_limit_usize_gap) is not enabled.
6. Adapt hpa to the usize changes.
Change the settings in sec to limit is usage for sizes larger than
USIZE_GROW_SLOW_THRESHOLD and modify corresponding tests.
7. Modify usize calculation and corresponding tests.
Change the sz_s2u_compute. Note sz_index2size is not always safe now
while sz_size2index still works as expected.
Arena 0 have a dedicated initialization path, which differs from
initialization path of other arenas. The main difference for the purpose
of this change is that we initialize arena 0 before we initialize
background threads. HPA shard options have `deferral_allowed` flag which
should be equal to `background_thread_enabled()` return value, but it
wasn't the case before this change, because for arena 0
`background_thread_enabled()` was initialized correctly after arena 0
initialization phase already ended.
Below is initialization sequence for arena 0 after this commit to
illustrate everything still should be initialized correctly.
* `hpa_central_init` initializes HPA Central, before we initialize every
HPA shard (including arena's 0).
* `background_thread_boot1` initializes `background_thread_enabled()`
return value.
* `pa_shard_enable_hpa` initializes arena 0 HPA shard.
```
malloc_init_hard -------------
/ / \
/ / \
/ / \
malloc_init_hard_a0_locked background_thread_boot1 pa_shard_enable_hpa
/ / \
/ / \
/ / \
arena_boot background_thread_enabled_seta hpa_shard_init
|
|
pa_central_init
|
|
hpa_central_init
```
This adds a fast-path for threads freeing a small number of allocations to
bins which are not their "home-base" and which encounter lock contention in
attempting to do so. In producer-consumer workflows, such small lock hold times
can cause lock convoying that greatly increases overall bin mutex contention.
In the next commit, we'll start using the batcher to eliminate mutex traffic.
To avoid cluttering up that commit with the random bits of busy-work it entails,
we'll centralize them here. This commit introduces:
- A batched bin type.
- The ability to mix batched and unbatched bins in the arena.
- Conf parsing to set batches per size and a max batched size.
- mallctl access to the corresponding opt-namespace keys.
- Stats output of the above.
Bypassing background thread creation for the oversize_arena used to be an
optimization since that arena had eager purging. However #2466 changed the
purging policy for the oversize_arena -- specifically it switched to the default
decay time when background_thread is enabled.
This issue is noticable when the number of arenas is low: whenever the total #
of arenas is <= 4 (which is the default max # of background threads), in which
case the purging will be stalled since no background thread is created for the
oversize_arena.
1. add tcache_max and nhbins into tcache_t so that they are per-tcache,
with one auto tcache per thread, it's also per-thread;
2. add mallctl for each thread to set its own tcache_max (of its auto tcache);
3. store the maximum number of items in each bin instead of using a global storage;
4. add tests for the modifications above.
5. Rename `nhbins` and `tcache_maxclass` to `global_do_not_change_nhbins` and `global_do_not_change_tcache_maxclass`.
An arena's bins should normally be accessed via the `arena_get_bin`
function, which properly takes into account bin-shards. To ensure that
we don't accidentally commit code which incorrectly accesses the bins
directly, we mark the field with `__attribute__((deprecated))` with an
appropriate warning message, and suppress the warning in the few places
where directly accessing the bins is allowed.
@interwq noticed [while reviewing an earlier PR](https://github.com/jemalloc/jemalloc/pull/2478#discussion_r1256217261)
that I missed modifying this statistics accounting in line with the rest
of the changes from #2459. This is now fixed, such that sampled small
allocations increment the `.nmalloc`/`.ndalloc` of their effective bin
size instead of over-reporting memory usage by attributing all such
allocations to `SC_LARGE_MINCLASS`.
Following from PR #2481, we replace all integer-to-pointer casts [which
hide pointer provenance information (and thus inhibit
optimizations)](https://clang.llvm.org/extra/clang-tidy/checks/performance/no-int-to-ptr.html)
with equivalent operations that preserve this information. I have
enabled the corresponding clang-tidy check in our static analysis CI so
that we do not get bitten by this again in the future.
For better or worse, Jemalloc has a significant number of global
variables. Making all eligible global variables `static` and/or `const`
at least makes it slightly easier to reason about them, as these
qualifications communicate to the programmer restrictions on their use
without having to `grep` the whole codebase.
Previously, small allocations which were sampled as part of heap
profiling were rounded up to `SC_LARGE_MINCLASS`. This additional memory
usage becomes problematic when the page size is increased, as noted in #2358.
Small allocations are now rounded up to the nearest multiple of `PAGE`
instead, reducing the memory overhead by a factor of 4 in the most
extreme cases.
We have observed new workload patterns (namely ML training type) that cycle
through oversized allocations frequently, because 1) the dataset might be sparse
which is faster to go through, and 2) GPU accelerated. As a result, the eager
purging from the oversize arena becomes a bottleneck. To offer an easy
solution, allow normal purging of the oversized extents when background threads
are enabled.
This codepath may generate deferred work when the HPA is enabled.
See also [@davidtgoldblatt's relevant comment on the PR which
introduced this](https://github.com/jemalloc/jemalloc/pull/2107#discussion_r699770967)
which prevented a similarly incorrect `assert` from being added elsewhere.
In arena_stats_merge() first nmalloc was read, and after ndalloc.
However with this order, it is possible for some thread to incement
ndalloc in between, and then nmalloc < ndalloc, and assertion will fail,
like again found by ClickHouse CI [1] (even after #2234).
[1]: https://github.com/ClickHouse/ClickHouse/issues/31531
Swap the order to avoid possible assertion.
Cc: @interwq
Follow-up for: #2234
On deallocation, sampled pointers (specially aligned) get junked and stashed
into tcache (to prevent immediate reuse). The expected behavior is to have
read-after-free corrupted and stopped by the junk-filling, while
write-after-free is checked when flushing the stashed pointers.