Arena 0 have a dedicated initialization path, which differs from
initialization path of other arenas. The main difference for the purpose
of this change is that we initialize arena 0 before we initialize
background threads. HPA shard options have `deferral_allowed` flag which
should be equal to `background_thread_enabled()` return value, but it
wasn't the case before this change, because for arena 0
`background_thread_enabled()` was initialized correctly after arena 0
initialization phase already ended.
Below is initialization sequence for arena 0 after this commit to
illustrate everything still should be initialized correctly.
* `hpa_central_init` initializes HPA Central, before we initialize every
HPA shard (including arena's 0).
* `background_thread_boot1` initializes `background_thread_enabled()`
return value.
* `pa_shard_enable_hpa` initializes arena 0 HPA shard.
```
malloc_init_hard -------------
/ / \
/ / \
/ / \
malloc_init_hard_a0_locked background_thread_boot1 pa_shard_enable_hpa
/ / \
/ / \
/ / \
arena_boot background_thread_enabled_seta hpa_shard_init
|
|
pa_central_init
|
|
hpa_central_init
```
This adds a new autoconf flag, --disable-user-config, which disables
reading the configuration from /etc/malloc.conf or the MALLOC_CONF
environment variable. This can be useful when integrating jemalloc in a
binary that internally handles all aspects of the configuration and
shouldn't be impacted by ambient change in the environment.
Config validation was introduced at 3aae792b with main intention to fix
infinite purging loop, but it didn't actually fix the underlying
problem, just masked it. Later 47d69b4ea was merged to address the same
problem.
Options `hpa_dirty_mult` and `hpa_hugification_threshold` have different
application dimensions: `hpa_dirty_mult` applied to active memory on the
shard, but `hpa_hugification_threshold` is a threshold for single
pageslab (hugepage). It doesn't make much sense to sum them up together.
While it is true that too high value of `hpa_dirty_mult` and too low
value of `hpa_hugification_threshold` can lead to pathological
behaviour, it is true for other options as well. Poor configurations
might lead to suboptimal and sometimes completely unacceptable
behaviour and that's OK, that is exactly the reason why they are called
poor.
There are other mechanism exist to prevent extreme behaviour, when we
hugified and then immediately purged page, see
`hpa_hugify_blocked_by_ndirty` function, which exist to prevent exactly
this case.
Lastly, `hpa_dirty_mult + hpa_hugification_threshold >= 1` constraint is
too tight and prevents a lot of valid configurations.
Linux 6.1 introduced `MADV_COLLAPSE` flag to perform a best-effort
synchronous collapse of the native pages mapped by the memory range into
transparent huge pages.
Synchronous hugification might be beneficial for at least two reasons:
we are not relying on khugepaged anymore and get an instant feedback if
range wasn't hugified.
If `hpa_hugify_sync` option is on, we'll try to perform synchronously
collapse and if it wasn't successful, we'll fallback to asynchronous
behaviour.
Option `experimental_hpa_strict_min_purge_interval` was expected to be
temporary to simplify rollout of a bugfix. Now, when bugfix rollout is
complete it is safe to remove this option.
Option `experimental_hpa_max_purge_nhp` introduced for backward
compatibility reasons: to make it possible to have behaviour similar
to buggy `hpa_strict_min_purge_interval` implementation.
When `experimental_hpa_max_purge_nhp` is set to -1, there is no limit
to number of slabs we'll purge on each iteration. Otherwise, we'll purge
no more than `experimental_hpa_max_purge_nhp` hugepages (slabs). This in
turn means we might not purge enough dirty pages to satisfy
`hpa_dirty_mult` requirement.
Combination of `hpa_dirty_mult`, `experimental_hpa_max_purge_nhp` and
`hpa_strict_min_purge_interval` options allows us to have steady rate of
pages returned back to the system. This provides a strickier latency
guarantees as number of `madvise` calls is bounded (and hence number of
TLB shootdowns is limited) in exchange to weaker memory usage
guarantees.
Change in `hpa_min_purge_interval_ms` handling logic is not backward
compatible as it might increase memory usage. Now this logic guarded by
`hpa_strict_min_purge_interval` option.
When `hpa_strict_min_purge_interval` is true, we will purge no more than
`hpa_min_purge_interval_ms`. When `hpa_strict_min_purge_interval` is
false, old purging logic behaviour is preserved.
Long term strategy migrate all users of hpa to new logic and then delete
`hpa_strict_min_purge_interval` option.
This adds a fast-path for threads freeing a small number of allocations to
bins which are not their "home-base" and which encounter lock contention in
attempting to do so. In producer-consumer workflows, such small lock hold times
can cause lock convoying that greatly increases overall bin mutex contention.
In the next commit, we'll start using the batcher to eliminate mutex traffic.
To avoid cluttering up that commit with the random bits of busy-work it entails,
we'll centralize them here. This commit introduces:
- A batched bin type.
- The ability to mix batched and unbatched bins in the arena.
- Conf parsing to set batches per size and a max batched size.
- mallctl access to the corresponding opt-namespace keys.
- Stats output of the above.
Bypassing background thread creation for the oversize_arena used to be an
optimization since that arena had eager purging. However #2466 changed the
purging policy for the oversize_arena -- specifically it switched to the default
decay time when background_thread is enabled.
This issue is noticable when the number of arenas is low: whenever the total #
of arenas is <= 4 (which is the default max # of background threads), in which
case the purging will be stalled since no background thread is created for the
oversize_arena.
This change adds support for writing pid namespaces to the filename of a
heap profile. When running with namespaces pids may reused across
namespaces and if mounts are shared where profiles are written there is
not a great way to differentiate profiles between pids.
Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Signed-off-by: Daniel Hodges <hodgesd@fb.com>
realloc() and rallocx() shares path, and realloc() should set errno to
ENOMEM upon OOM failures.
Fixes: ee961c2310 ("Merge realloc and rallocx pathways.")
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Instead of failing on assertions. Previously the same change was made for
posix_memalign and aligned_alloc (#1554). Make memalign behave the same way
even though it's obsolete.
1. Pre-generate all default tcache ncached_max in tcache_boot;
2. Add getters returning default ncached_max and ncached_max_set;
3. Refactor tcache init so that it is always init with a given setting.
1. `thread_tcache_ncached_max_read_sizeclass` allows users to get the
ncached_max of the bin with the input sizeclass, passed in through
oldp (will be upper casted if not an exact bin size is given).
2. `thread_tcache_ncached_max_write` takes in a char array
representing the settings for bins in the tcache.
1. add tcache_max and nhbins into tcache_t so that they are per-tcache,
with one auto tcache per thread, it's also per-thread;
2. add mallctl for each thread to set its own tcache_max (of its auto tcache);
3. store the maximum number of items in each bin instead of using a global storage;
4. add tests for the modifications above.
5. Rename `nhbins` and `tcache_maxclass` to `global_do_not_change_nhbins` and `global_do_not_change_tcache_maxclass`.
- `-Wmissing-prototypes` and `-Wmissing-variable-declarations` are
helpful for finding dead code and/or things that should be `static`
but aren't marked as such.
- `-Wunused-macros` is of similar utility, but for identifying dead macros.
- `-Wunreachable-code` and `-Wunreachable-code-aggressive` do exactly
what they say: flag unreachable code.
As reported in #2449, under certain circumstances it's possible to get
stuck in an infinite loop attempting to purge from the HPA. We now
handle this by validating the HPA settings at the end of
configuration parsing and either normalizing them or aborting depending on
if `abort_conf` is set.
Following from PR #2481, we replace all integer-to-pointer casts [which
hide pointer provenance information (and thus inhibit
optimizations)](https://clang.llvm.org/extra/clang-tidy/checks/performance/no-int-to-ptr.html)
with equivalent operations that preserve this information. I have
enabled the corresponding clang-tidy check in our static analysis CI so
that we do not get bitten by this again in the future.
[N2699 - Sized Memory Deallocation](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2699.htm)
introduced two new functions which were incorporated into the C23
standard, `free_sized` and `free_aligned_sized`. Both already have
analogues in Jemalloc, all we are doing here is adding the appropriate
wrappers.
For better or worse, Jemalloc has a significant number of global
variables. Making all eligible global variables `static` and/or `const`
at least makes it slightly easier to reason about them, as these
qualifications communicate to the programmer restrictions on their use
without having to `grep` the whole codebase.
Previously, small allocations which were sampled as part of heap
profiling were rounded up to `SC_LARGE_MINCLASS`. This additional memory
usage becomes problematic when the page size is increased, as noted in #2358.
Small allocations are now rounded up to the nearest multiple of `PAGE`
instead, reducing the memory overhead by a factor of 4 in the most
extreme cases.
Fix or suppress the remaining warnings generated by static analysis.
This is a necessary step before we can incorporate static analysis into
CI. Where possible, I've preferred to modify the code itself instead of
just disabling the warning with a magic comment, so that if we decide to
use different static analysis tools in the future we will be covered
against them raising similar warnings.
None of these are harmful, and they are almost certainly optimized
away by the compiler. The motivation for fixing them anyway is that
we'd like to enable static analysis as part of CI, and the first step
towards that is resolving the warnings it produces at present.
Allows the use of getenv() rather than secure_getenv() to read MALLOC_CONF.
This helps in situations where hosts are under full control, and setting
MALLOC_CONF is needed while also setuid. Disabled by default.
Add new runtime option `debug_double_free_max_scan` that specifies the max
number of stack entries to scan in the cache bit when trying to detect the
double free bug (currently debug build only).