Commit graph

573 commits

Author SHA1 Message Date
Slobodan Predolac
7bf49ca689 Remove prof_threshold built-in event. It is trivial to implement it as user event if needed 2026-02-25 09:06:29 -08:00
Slobodan Predolac
afce889cc9 [SEC] Make SEC owned by hpa_shard, simplify the code, add stats, lock per bin 2025-12-19 12:51:26 -05:00
Slobodan Predolac
bfb63eaf41 Add experimental_enforce_hugify 2025-11-21 11:30:16 -08:00
Slobodan Predolac
6ced85a8e5 When extracting from central, hugify_eager is different than start_as_huge 2025-10-15 20:27:35 -04:00
Slobodan Predolac
f8ef3195e5 [HPA] Add ability to start page as huge and more flexibility for purging 2025-10-06 10:59:43 -04:00
Slobodan Predolac
db943a0d9e Running clang-format on two files 2025-10-06 10:59:43 -04:00
Carl Shapiro
218c989047 Always use pthread_equal to compare thread IDs
This change replaces direct comparisons of Pthread thread IDs with
calls to pthread_equal.  Directly comparing thread IDs is neither
portable nor reliable since a thread ID is defined as an opaque type
that can be implemented using a structure.
2025-08-28 14:45:51 -07:00
Slobodan Predolac
d4cde60066 Remove pidfd_open call handling and rely on PIDFD_SELF 2025-08-27 12:44:45 -04:00
Shirui Cheng
e2da7477f8 Revert PR #2608: Manually revert commits 70c94d..f9c0b5 2025-08-22 21:55:24 -07:00
Slobodan Predolac
97d25919c3 [process_madvise] Make init lazy so that python tests pass. Reset the pidfd on fork 2025-07-29 15:47:53 -07:00
guangli-dai
34f359e0ca Reformat the codebase with the clang-format 18. 2025-06-20 14:35:15 -07:00
Shirui Cheng
ae4e729d15 Update the default value for opt_experimental_tcache_gc and opt_calloc_madvise_threshold 2025-06-17 13:25:20 -07:00
Slobodan Predolac
390e70c840 [thread_event] Add support for user events in thread events when stats are enabled 2025-06-11 15:37:03 -07:00
Jason Evans
27d7960cf9 Revert "Extend purging algorithm with peak demand tracking"
This reverts commit ad108d50f1.
2025-06-02 10:44:37 -07:00
guangli-dai
8347f1045a Renaming limit_usize_gap to disable_large_size_classes 2025-05-06 14:47:35 -07:00
Guangli Dai
01e9ecbeb2 Remove build-time configuration 'config_limit_usize_gap' 2025-05-06 14:47:35 -07:00
Qi Wang
a3910b9802 Avoid forced purging during thread-arena migration when bg thd is on. 2025-04-25 19:18:20 -07:00
Shirui Cheng
3688dfb5c3 fix assertion error in huge_arena_auto_thp_switch() when b0 is deleted in unit test 2025-03-20 12:45:23 -07:00
Shirui Cheng
e1a77ec558 Support THP with Huge Arena in PAC 2025-03-17 16:06:43 -07:00
Dmitry Ilvokhin
ad108d50f1 Extend purging algorithm with peak demand tracking
Implementation inspired by idea described in "Beyond malloc efficiency
to fleet efficiency: a hugepage-aware memory allocator" paper [1].

Primary idea is to track maximum number (peak) of active pages in use
with sliding window and then use this number to decide how many dirty
pages we would like to keep.

We are trying to estimate maximum amount of active memory we'll need in
the near future. We do so by projecting future active memory demand
(based on peak active memory usage we observed in the past within
sliding window) and adding slack on top of it (an overhead is reasonable
to have in exchange of higher hugepages coverage). When peak demand
tracking is off, projection of future active memory is active memory we
are having right now.

Estimation is essentially the same as `nactive_max * (1 + dirty_mult)`.

Peak demand purging algorithm controlled by two config options. Option
`hpa_peak_demand_window_ms` controls duration of sliding window we track
maximum active memory usage in and option `hpa_dirty_mult` controls
amount of slack we are allowed to have as a percent from maximum active
memory usage. By default `hpa_peak_demand_window_ms == 0` now and we
have same behaviour (ratio based purging) that we had before this
commit.

[1]: https://storage.googleapis.com/gweb-research2023-media/pubtools/6170.pdf
2025-03-13 10:12:22 -07:00
Qi Wang
22440a0207 Implement process_madvise support.
Add opt.process_madvise_max_batch which determines if process_madvise is enabled
(non-zero) and the max # of regions in each batch.  Added another limiting
factor which is the space to reserve on stack, which results in the max batch of
128.
2025-03-07 15:32:32 -08:00
guangli-dai
c067a55c79 Introducing a new usize calculation policy
Converting size to usize is what jemalloc has been done by ceiling
size to the closest size class. However, this causes lots of memory
wastes with HPA enabled.  This commit changes how usize is calculated so
that the gap between two contiguous usize is no larger than a page.
Specifically, this commit includes the following changes:

1. Adding a build-time config option (--enable-limit-usize-gap) and a
runtime one (limit_usize_gap) to guard the changes.
When build-time
config is enabled, some minor CPU overhead is expected because usize
will be stored and accessed apart from index.  When runtime option is
also enabled (it can only be enabled with the build-time config
enabled). a new usize calculation approach wil be employed.  This new
calculation will ceil size to the closest multiple of PAGE for all sizes
larger than USIZE_GROW_SLOW_THRESHOLD instead of using the size classes.
Note when the build-time config is enabled, the runtime option is
default on.

2. Prepare tcache for size to grow by PAGE over GROUP*PAGE.
To prepare for the upcoming changes where size class grows by PAGE when
larger than NGROUP * PAGE, disable the tcache when it is larger than 2 *
NGROUP * PAGE. The threshold for tcache is set higher to prevent perf
regression as much as possible while usizes between NGROUP * PAGE and 2 *
NGROUP * PAGE happen to grow by PAGE.

3. Prepare pac and hpa psset for size to grow by PAGE over GROUP*PAGE
For PAC, to avoid having too many bins, arena bins still have the same
layout.  This means some extra search is needed for a page-level request that
is not aligned with the orginal size class: it should also search the heap
before the current index since the previous heap might also be able to
have some allocations satisfying it.  The same changes apply to HPA's
psset.
This search relies on the enumeration of the heap because not all allocs in
the previous heap are guaranteed to satisfy the request.  To balance the
memory and CPU overhead, we currently enumerate at most a fixed number
of nodes before concluding none can satisfy the request during an
enumeration.

4. Add bytes counter to arena large stats.
To prepare for the upcoming usize changes, stats collected by
multiplying alive allocations and the bin size is no longer accurate.
Thus, add separate counters to record the bytes malloced and dalloced.

5. Change structs use when freeing to avoid using index2size for large sizes.
  - Change the definition of emap_alloc_ctx_t
  - Change the read of both from edata_t.
  - Change the assignment and usage of emap_alloc_ctx_t.
  - Change other callsites of index2size.
Note for the changes in the data structure, i.e., emap_alloc_ctx_t,
will be used when the build-time config (--enable-limit-usize-gap) is
enabled but they will store the same value as index2size(szind) if the
runtime option (opt_limit_usize_gap) is not enabled.

6. Adapt hpa to the usize changes.
Change the settings in sec to limit is usage for sizes larger than
USIZE_GROW_SLOW_THRESHOLD and modify corresponding tests.

7. Modify usize calculation and corresponding tests.
Change the sz_s2u_compute. Note sz_index2size is not always safe now
while sz_size2index still works as expected.
2025-03-06 15:08:13 -08:00
Guangli Dai
ac279d7e71 Fix profiling sample metadata lookup during xallocx 2025-03-04 14:42:04 -08:00
Dmitry Ilvokhin
499f306859 Fix arena 0 deferral_allowed flag init
Arena 0 have a dedicated initialization path, which differs from
initialization path of other arenas. The main difference for the purpose
of this change is that we initialize arena 0 before we initialize
background threads. HPA shard options have `deferral_allowed` flag which
should be equal to `background_thread_enabled()` return value, but it
wasn't the case before this change, because for arena 0
`background_thread_enabled()` was initialized correctly after arena 0
initialization phase already ended.

Below is initialization sequence for arena 0 after this commit to
illustrate everything still should be initialized correctly.

* `hpa_central_init` initializes HPA Central, before we initialize every
  HPA shard (including arena's 0).
* `background_thread_boot1` initializes `background_thread_enabled()`
  return value.
* `pa_shard_enable_hpa` initializes arena 0 HPA shard.

```
                       malloc_init_hard -------------
                      /           /                  \
                     /           /                    \
                    /           /                      \
malloc_init_hard_a0_locked  background_thread_boot1  pa_shard_enable_hpa
        /                     /                          \
       /                     /                            \
      /                     /                              \
arena_boot       background_thread_enabled_seta         hpa_shard_init
     |
     |
pa_central_init
     |
     |
hpa_central_init
```
2025-02-18 12:10:35 -08:00
roblabla
c17bf8b368 Disable config from file or envvar with build flag
This adds a new autoconf flag, --disable-user-config, which disables
reading the configuration from /etc/malloc.conf or the MALLOC_CONF
environment variable. This can be useful when integrating jemalloc in a
binary that internally handles all aspects of the configuration and
shouldn't be impacted by ambient change in the environment.
2025-02-05 15:01:50 -08:00
Shai Duvdevani
257e64b968 Unlike prof_sample which is supported only with profiling mode active, prof_threshold is intended to be an always-supported allocation callback with much less overhead. The usage of the threshold allows performance critical callers to change program execution based on the callback: e.g. drop caches when memory becomes high or to predict the program is about to OOM ahead of time using peak memory watermarks. 2025-01-29 18:55:52 -08:00
Dmitry Ilvokhin
3820e38dc1 Remove validation for HPA ratios
Config validation was introduced at 3aae792b with main intention to fix
infinite purging loop, but it didn't actually fix the underlying
problem, just masked it. Later 47d69b4ea was merged to address the same
problem.

Options `hpa_dirty_mult` and `hpa_hugification_threshold` have different
application dimensions: `hpa_dirty_mult` applied to active memory on the
shard, but `hpa_hugification_threshold` is a threshold for single
pageslab (hugepage). It doesn't make much sense to sum them up together.

While it is true that too high value of `hpa_dirty_mult` and too low
value of `hpa_hugification_threshold` can lead to pathological
behaviour, it is true for other options as well. Poor configurations
might lead to suboptimal and sometimes completely unacceptable
behaviour and that's OK, that is exactly the reason why they are called
poor.

There are other mechanism exist to prevent extreme behaviour, when we
hugified and then immediately purged page, see
`hpa_hugify_blocked_by_ndirty` function, which exist to prevent exactly
this case.

Lastly, `hpa_dirty_mult + hpa_hugification_threshold >= 1` constraint is
too tight and prevents a lot of valid configurations.
2024-11-20 18:59:07 -08:00
Dmitry Ilvokhin
0ce13c6fb5 Add opt hpa_hugify_sync to hugify synchronously
Linux 6.1 introduced `MADV_COLLAPSE` flag to perform a best-effort
synchronous collapse of the native pages mapped by the memory range into
transparent huge pages.

Synchronous hugification might be beneficial for at least two reasons:
we are not relying on khugepaged anymore and get an instant feedback if
range wasn't hugified.

If `hpa_hugify_sync` option is on, we'll try to perform synchronously
collapse and if it wasn't successful, we'll fallback to asynchronous
behaviour.
2024-11-20 10:52:52 -08:00
Dmitry Ilvokhin
4f4fd42447 Remove strict_min_purge_interval option
Option `experimental_hpa_strict_min_purge_interval` was expected to be
temporary to simplify rollout of a bugfix. Now, when bugfix rollout is
complete it is safe to remove this option.
2024-09-25 11:49:18 -07:00
Qi Wang
3383b98f1b Check if the huge page size is expected when enabling HPA. 2024-09-04 15:43:59 -07:00
Shirui Cheng
f68effe4ac Add a runtime option opt_experimental_tcache_gc to guard the new design 2024-08-29 10:50:33 -07:00
Dmitry Ilvokhin
c7ccb8d7e9 Add experimental prefix to hpa_strict_min_purge_interval
Goal is to make it obvious this option is experimental.
2024-08-20 10:02:38 -07:00
Dmitry Ilvokhin
aaa29003ab Limit maximum number of purged slabs with option
Option `experimental_hpa_max_purge_nhp` introduced for backward
compatibility reasons: to make it possible to have behaviour similar
to buggy `hpa_strict_min_purge_interval` implementation.

When `experimental_hpa_max_purge_nhp` is set to -1, there is no limit
to number of slabs we'll purge on each iteration. Otherwise, we'll purge
no more than `experimental_hpa_max_purge_nhp` hugepages (slabs). This in
turn means we might not purge enough dirty pages to satisfy
`hpa_dirty_mult` requirement.

Combination of `hpa_dirty_mult`, `experimental_hpa_max_purge_nhp` and
`hpa_strict_min_purge_interval` options allows us to have steady rate of
pages returned back to the system. This provides a strickier latency
guarantees as number of `madvise` calls is bounded (and hence number of
TLB shootdowns is limited) in exchange to weaker memory usage
guarantees.
2024-08-20 10:02:38 -07:00
Dmitry Ilvokhin
867c6dd7dc Option to guard hpa_min_purge_interval_ms fix
Change in `hpa_min_purge_interval_ms` handling logic is not backward
compatible as it might increase memory usage. Now this logic guarded by
`hpa_strict_min_purge_interval` option.

When `hpa_strict_min_purge_interval` is true, we will purge no more than
`hpa_min_purge_interval_ms`. When `hpa_strict_min_purge_interval` is
false, old purging logic behaviour is preserved.

Long term strategy migrate all users of hpa to new logic and then delete
`hpa_strict_min_purge_interval` option.
2024-06-07 10:52:41 -07:00
David Goldblatt
fc615739cb Add batching to arena bins.
This adds a fast-path for threads freeing a small number of allocations to
bins which are not their "home-base" and which encounter lock contention in
attempting to do so. In producer-consumer workflows, such small lock hold times
can cause lock convoying that greatly increases overall bin mutex contention.
2024-05-22 10:30:31 -07:00
David Goldblatt
c085530c71 Tcache batching: Plumbing
In the next commit, we'll start using the batcher to eliminate mutex traffic.
To avoid cluttering up that commit with the random bits of busy-work it entails,
we'll centralize them here.  This commit introduces:
- A batched bin type.
- The ability to mix batched and unbatched bins in the arena.
- Conf parsing to set batches per size and a max batched size.
- mallctl access to the corresponding opt-namespace keys.
- Stats output of the above.
2024-05-22 10:30:31 -07:00
Qi Wang
8d8379da44 Fix background_thread creation for the oversize_arena.
Bypassing background thread creation for the oversize_arena used to be an
optimization since that arena had eager purging.  However #2466 changed the
purging policy for the oversize_arena -- specifically it switched to the default
decay time when background_thread is enabled.

This issue is noticable when the number of arenas is low: whenever the total #
of arenas is <= 4 (which is the default max # of background threads), in which
case the purging will be stalled since no background thread is created for the
oversize_arena.
2024-05-02 14:45:18 -07:00
Daniel Hodges
11038ff762 Add support for namespace pids in heap profile names
This change adds support for writing pid namespaces to the filename of a
heap profile. When running with namespaces pids may reused across
namespaces and if mounts are shared where profiles are written there is
not a great way to differentiate profiles between pids.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Signed-off-by: Daniel Hodges <hodgesd@fb.com>
2024-04-09 10:27:52 -07:00
Qi Wang
83b075789b rallocx path: only set errno on the realloc case. 2024-04-05 17:41:43 -07:00
Shirui Cheng
5081c16bb4 Experimental calloc implementation with using memset on larger sizes 2024-04-04 15:31:56 -07:00
Juhyung Park
38056fea64 Set errno to ENOMEM on rallocx() OOM failures
realloc() and rallocx() shares path, and realloc() should set errno to
ENOMEM upon OOM failures.

Fixes: ee961c2310 ("Merge realloc and rallocx pathways.")
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
2024-04-04 15:13:22 -07:00
Shirui Cheng
373884ab48 print out all malloc_conf settings in stats 2024-02-29 12:12:44 -08:00
Qi Wang
1aba4f41a3 Allow zero sized memalign to pass.
Instead of failing on assertions.  Previously the same change was made for
posix_memalign and aligned_alloc (#1554).  Make memalign behave the same way
even though it's obsolete.
2024-02-16 13:06:07 -08:00
guangli-dai
b1792c80d2 Add LOGs when entrying and exiting free and sdallocx. 2024-01-11 14:37:20 -08:00
guangli-dai
eda05b3994 Fix static analysis warnings. 2024-01-03 14:18:52 -08:00
guangli-dai
6fb3b6a8e4 Refactor the tcache initiailization
1. Pre-generate all default tcache ncached_max in tcache_boot;
2. Add getters returning default ncached_max and ncached_max_set;
3. Refactor tcache init so that it is always init with a given setting.
2023-10-18 14:11:46 -07:00
guangli-dai
8a22d10b83 Allow setting default ncached_max for each bin through malloc_conf 2023-10-18 14:11:46 -07:00
guangli-dai
630f7de952 Add mallctl to set and get ncached_max of each cache_bin.
1. `thread_tcache_ncached_max_read_sizeclass` allows users to get the
    ncached_max of the bin with the input sizeclass, passed in through
    oldp (will be upper casted if not an exact bin size is given).
2. `thread_tcache_ncached_max_write` takes in a char array
    representing the settings for bins in the tcache.
2023-10-17 14:53:23 -07:00
guangli-dai
6b197fdd46 Pre-generate ncached_max for all bins for better tcache_max tuning experience. 2023-10-17 14:53:23 -07:00
guangli-dai
a442d9b895 Enable per-tcache tcache_max
1. add tcache_max and nhbins into tcache_t so that they are per-tcache,
   with one auto tcache per thread, it's also per-thread;
2. add mallctl for each thread to set its own tcache_max (of its auto tcache);
3. store the maximum number of items in each bin instead of using a global storage;
4. add tests for the modifications above.
5. Rename `nhbins` and `tcache_maxclass` to `global_do_not_change_nhbins` and `global_do_not_change_tcache_maxclass`.
2023-09-06 10:47:14 -07:00