Commit graph

837 commits

Author SHA1 Message Date
guangli-dai
8347f1045a Renaming limit_usize_gap to disable_large_size_classes 2025-05-06 14:47:35 -07:00
Guangli Dai
01e9ecbeb2 Remove build-time configuration 'config_limit_usize_gap' 2025-05-06 14:47:35 -07:00
Slobodan Predolac
1956a54a43 [process_madvise] Use process_madvise across multiple huge_pages 2025-04-25 19:19:03 -07:00
Slobodan Predolac
0dfb4a5a1a Add output argument to hpa_purge_begin to count dirty ranges 2025-04-25 19:19:03 -07:00
Slobodan Predolac
f19f49ef3e if process_madvise is supported, call it when purging hpa 2025-04-04 13:57:42 -07:00
Dmitry Ilvokhin
ad108d50f1 Extend purging algorithm with peak demand tracking
Implementation inspired by idea described in "Beyond malloc efficiency
to fleet efficiency: a hugepage-aware memory allocator" paper [1].

Primary idea is to track maximum number (peak) of active pages in use
with sliding window and then use this number to decide how many dirty
pages we would like to keep.

We are trying to estimate maximum amount of active memory we'll need in
the near future. We do so by projecting future active memory demand
(based on peak active memory usage we observed in the past within
sliding window) and adding slack on top of it (an overhead is reasonable
to have in exchange of higher hugepages coverage). When peak demand
tracking is off, projection of future active memory is active memory we
are having right now.

Estimation is essentially the same as `nactive_max * (1 + dirty_mult)`.

Peak demand purging algorithm controlled by two config options. Option
`hpa_peak_demand_window_ms` controls duration of sliding window we track
maximum active memory usage in and option `hpa_dirty_mult` controls
amount of slack we are allowed to have as a percent from maximum active
memory usage. By default `hpa_peak_demand_window_ms == 0` now and we
have same behaviour (ratio based purging) that we had before this
commit.

[1]: https://storage.googleapis.com/gweb-research2023-media/pubtools/6170.pdf
2025-03-13 10:12:22 -07:00
Qi Wang
22440a0207 Implement process_madvise support.
Add opt.process_madvise_max_batch which determines if process_madvise is enabled
(non-zero) and the max # of regions in each batch.  Added another limiting
factor which is the space to reserve on stack, which results in the max batch of
128.
2025-03-07 15:32:32 -08:00
Guangli Dai
6035d4a8d3 Cache extra extents in the dirty pool from ecache_alloc_grow 2025-03-06 15:08:13 -08:00
guangli-dai
c067a55c79 Introducing a new usize calculation policy
Converting size to usize is what jemalloc has been done by ceiling
size to the closest size class. However, this causes lots of memory
wastes with HPA enabled.  This commit changes how usize is calculated so
that the gap between two contiguous usize is no larger than a page.
Specifically, this commit includes the following changes:

1. Adding a build-time config option (--enable-limit-usize-gap) and a
runtime one (limit_usize_gap) to guard the changes.
When build-time
config is enabled, some minor CPU overhead is expected because usize
will be stored and accessed apart from index.  When runtime option is
also enabled (it can only be enabled with the build-time config
enabled). a new usize calculation approach wil be employed.  This new
calculation will ceil size to the closest multiple of PAGE for all sizes
larger than USIZE_GROW_SLOW_THRESHOLD instead of using the size classes.
Note when the build-time config is enabled, the runtime option is
default on.

2. Prepare tcache for size to grow by PAGE over GROUP*PAGE.
To prepare for the upcoming changes where size class grows by PAGE when
larger than NGROUP * PAGE, disable the tcache when it is larger than 2 *
NGROUP * PAGE. The threshold for tcache is set higher to prevent perf
regression as much as possible while usizes between NGROUP * PAGE and 2 *
NGROUP * PAGE happen to grow by PAGE.

3. Prepare pac and hpa psset for size to grow by PAGE over GROUP*PAGE
For PAC, to avoid having too many bins, arena bins still have the same
layout.  This means some extra search is needed for a page-level request that
is not aligned with the orginal size class: it should also search the heap
before the current index since the previous heap might also be able to
have some allocations satisfying it.  The same changes apply to HPA's
psset.
This search relies on the enumeration of the heap because not all allocs in
the previous heap are guaranteed to satisfy the request.  To balance the
memory and CPU overhead, we currently enumerate at most a fixed number
of nodes before concluding none can satisfy the request during an
enumeration.

4. Add bytes counter to arena large stats.
To prepare for the upcoming usize changes, stats collected by
multiplying alive allocations and the bin size is no longer accurate.
Thus, add separate counters to record the bytes malloced and dalloced.

5. Change structs use when freeing to avoid using index2size for large sizes.
  - Change the definition of emap_alloc_ctx_t
  - Change the read of both from edata_t.
  - Change the assignment and usage of emap_alloc_ctx_t.
  - Change other callsites of index2size.
Note for the changes in the data structure, i.e., emap_alloc_ctx_t,
will be used when the build-time config (--enable-limit-usize-gap) is
enabled but they will store the same value as index2size(szind) if the
runtime option (opt_limit_usize_gap) is not enabled.

6. Adapt hpa to the usize changes.
Change the settings in sec to limit is usage for sizes larger than
USIZE_GROW_SLOW_THRESHOLD and modify corresponding tests.

7. Modify usize calculation and corresponding tests.
Change the sz_s2u_compute. Note sz_index2size is not always safe now
while sz_size2index still works as expected.
2025-03-06 15:08:13 -08:00
Guangli Dai
ac279d7e71 Fix profiling sample metadata lookup during xallocx 2025-03-04 14:42:04 -08:00
Dmitry Ilvokhin
499f306859 Fix arena 0 deferral_allowed flag init
Arena 0 have a dedicated initialization path, which differs from
initialization path of other arenas. The main difference for the purpose
of this change is that we initialize arena 0 before we initialize
background threads. HPA shard options have `deferral_allowed` flag which
should be equal to `background_thread_enabled()` return value, but it
wasn't the case before this change, because for arena 0
`background_thread_enabled()` was initialized correctly after arena 0
initialization phase already ended.

Below is initialization sequence for arena 0 after this commit to
illustrate everything still should be initialized correctly.

* `hpa_central_init` initializes HPA Central, before we initialize every
  HPA shard (including arena's 0).
* `background_thread_boot1` initializes `background_thread_enabled()`
  return value.
* `pa_shard_enable_hpa` initializes arena 0 HPA shard.

```
                       malloc_init_hard -------------
                      /           /                  \
                     /           /                    \
                    /           /                      \
malloc_init_hard_a0_locked  background_thread_boot1  pa_shard_enable_hpa
        /                     /                          \
       /                     /                            \
      /                     /                              \
arena_boot       background_thread_enabled_seta         hpa_shard_init
     |
     |
pa_central_init
     |
     |
hpa_central_init
```
2025-02-18 12:10:35 -08:00
Qi Wang
3bc89cfeca Avoid implicit conversion in test/unit/prof_threshold 2025-01-31 10:18:36 -08:00
Qi Wang
1abeae9ebd Fix test/unit/prof_threshold when !config_stats 2025-01-30 10:39:49 -08:00
Shai Duvdevani
257e64b968 Unlike prof_sample which is supported only with profiling mode active, prof_threshold is intended to be an always-supported allocation callback with much less overhead. The usage of the threshold allows performance critical callers to change program execution based on the callback: e.g. drop caches when memory becomes high or to predict the program is about to OOM ahead of time using peak memory watermarks. 2025-01-29 18:55:52 -08:00
Qi Wang
607b866035 Check for 0 input when setting max_background_thread through mallctl.
Reported by @nc7s.
2025-01-28 10:38:56 -08:00
Dmitry Ilvokhin
52fa9577ba Fix integer overflow in test/unit/hash.c
`final[3]` is `uint8_t`. Integer conversion rank of `uint8_t` is lower
than integer conversion rank of `int`, so `uint8_t` got promoted to
`int`, which is signed integer type. Shift `final[3]` value left on 24,
when leftmost bit is set overflows `int` and it is undefined behaviour.

Before this change Undefined Behaviour Sanitizer was unhappy about it
with the following message.

```
../test/unit/hash.c:119:25: runtime error: left shift of 176 by 24
places cannot be represented in type 'int'
```

After this commit problem is gone.
2025-01-17 12:54:22 -08:00
Guangli Dai
587676fee8 Disable psset test when hugepage size is too large. 2024-12-17 12:35:35 -08:00
Dmitry Ilvokhin
46690c9ec0 Fix test_retained on boxes with a lot of CPUs
We are trying to create `ncpus * 2` threads for this test and place them
into `VARIABLE_ARRAY`, but `VARIABLE_ARRAY` can not be more than
`VARIABLE_ARRAY_SIZE_MAX` bytes. When there are a lot of threads on the
box test always fails.

```
$ nproc
176

$ make -j`nproc` tests_unit && ./test/unit/retained
<jemalloc>: ../test/unit/retained.c:123: Failed assertion:
"sizeof(thd_t) * (nthreads) <= VARIABLE_ARRAY_SIZE_MAX"
Aborted (core dumped)
```

There is no need for high concurrency for this test as we are only
checking stats there and it's behaviour is quite stable regarding number
of allocating threads.

Limited number of threads to 16 to save compute resources (on CI for
example) and reduce tests running time.

Before the change (`nproc` is 80 on this box).

```
$ make -j`nproc` tests_unit && time ./test/unit/retained
<...>
real    0m0.372s
user    0m14.236s
sys     0m12.338s
```

After the change (same box).

```
$ make -j`nproc` tests_unit && time ./test/unit/retained
<...>
real    0m0.018s
user    0m0.108s
sys     0m0.068s
```
2024-12-02 14:12:26 -08:00
Dmitry Ilvokhin
6092c980a6 Expose psset state stats
When evaluating changes in HPA logic, it is useful to know internal
`hpa_shard` state. Great deal of this state is `psset`. Some of the
`psset` stats was available, but in disaggregated form, which is not
very convenient. This commit exposed `psset` counters to `mallctl`
and malloc stats dumps.

Example of how malloc stats dump will look like after the change.

HPA shard stats:
  Pageslabs: 14899 (4354 huge, 10545 nonhuge)
  Active pages: 6708166 (2228917 huge, 4479249 nonhuge)
  Dirty pages: 233816 (331 huge, 233485 nonhuge)
  Retained pages: 686306
  Purge passes: 8730 (10 / sec)
  Purges: 127501 (146 / sec)
  Hugeifies: 4358 (5 / sec)
  Dehugifies: 4 (0 / sec)

Pageslabs, active pages, dirty pages and retained pages are rows added
by this change.
2024-11-21 09:23:32 -08:00
Dmitry Ilvokhin
3820e38dc1 Remove validation for HPA ratios
Config validation was introduced at 3aae792b with main intention to fix
infinite purging loop, but it didn't actually fix the underlying
problem, just masked it. Later 47d69b4ea was merged to address the same
problem.

Options `hpa_dirty_mult` and `hpa_hugification_threshold` have different
application dimensions: `hpa_dirty_mult` applied to active memory on the
shard, but `hpa_hugification_threshold` is a threshold for single
pageslab (hugepage). It doesn't make much sense to sum them up together.

While it is true that too high value of `hpa_dirty_mult` and too low
value of `hpa_hugification_threshold` can lead to pathological
behaviour, it is true for other options as well. Poor configurations
might lead to suboptimal and sometimes completely unacceptable
behaviour and that's OK, that is exactly the reason why they are called
poor.

There are other mechanism exist to prevent extreme behaviour, when we
hugified and then immediately purged page, see
`hpa_hugify_blocked_by_ndirty` function, which exist to prevent exactly
this case.

Lastly, `hpa_dirty_mult + hpa_hugification_threshold >= 1` constraint is
too tight and prevents a lot of valid configurations.
2024-11-20 18:59:07 -08:00
Dmitry Ilvokhin
0ce13c6fb5 Add opt hpa_hugify_sync to hugify synchronously
Linux 6.1 introduced `MADV_COLLAPSE` flag to perform a best-effort
synchronous collapse of the native pages mapped by the memory range into
transparent huge pages.

Synchronous hugification might be beneficial for at least two reasons:
we are not relying on khugepaged anymore and get an instant feedback if
range wasn't hugified.

If `hpa_hugify_sync` option is on, we'll try to perform synchronously
collapse and if it wasn't successful, we'll fallback to asynchronous
behaviour.
2024-11-20 10:52:52 -08:00
Dmitry Ilvokhin
b9758afff0 Add nstime_ms_since to get time since in ms
Milliseconds are used a lot in hpa, so it is convenient to have
`nstime_ms_since` function instead of dividing to `MILLION` constantly.

For consistency renamed `nstime_msec` to `nstime_ms` as `ms` abbreviation
is used much more commonly across codebase than `msec`.

```
$ grep -Rn '_msec' include src | wc -l
2

$ grep -RPn '_ms( |,|:)' include src | wc -l
72
```

Function `nstime_msec` wasn't used anywhere in the code yet.
2024-11-08 10:37:28 -08:00
Nathan Slingerland
edc1576f03 Add safe frame-pointer backtrace unwinder 2024-10-01 11:01:56 -07:00
Dmitry Ilvokhin
4f4fd42447 Remove strict_min_purge_interval option
Option `experimental_hpa_strict_min_purge_interval` was expected to be
temporary to simplify rollout of a bugfix. Now, when bugfix rollout is
complete it is safe to remove this option.
2024-09-25 11:49:18 -07:00
Nathan Slingerland
60f472f367 Fix initialization of pop_attempt_results in bin_batching test 2024-09-12 11:36:17 -07:00
Shirui Cheng
baa5a90cc6 fix nstime_update_mock in arena_decay unit test 2024-08-29 10:50:33 -07:00
Shirui Cheng
8c54637f8c Better trigger race condition in bin_batching unit test 2024-08-23 14:10:04 -07:00
Dmitry Ilvokhin
c7ccb8d7e9 Add experimental prefix to hpa_strict_min_purge_interval
Goal is to make it obvious this option is experimental.
2024-08-20 10:02:38 -07:00
Dmitry Ilvokhin
aaa29003ab Limit maximum number of purged slabs with option
Option `experimental_hpa_max_purge_nhp` introduced for backward
compatibility reasons: to make it possible to have behaviour similar
to buggy `hpa_strict_min_purge_interval` implementation.

When `experimental_hpa_max_purge_nhp` is set to -1, there is no limit
to number of slabs we'll purge on each iteration. Otherwise, we'll purge
no more than `experimental_hpa_max_purge_nhp` hugepages (slabs). This in
turn means we might not purge enough dirty pages to satisfy
`hpa_dirty_mult` requirement.

Combination of `hpa_dirty_mult`, `experimental_hpa_max_purge_nhp` and
`hpa_strict_min_purge_interval` options allows us to have steady rate of
pages returned back to the system. This provides a strickier latency
guarantees as number of `madvise` calls is bounded (and hence number of
TLB shootdowns is limited) in exchange to weaker memory usage
guarantees.
2024-08-20 10:02:38 -07:00
Dmitry Ilvokhin
143f458188 Fix hpa_strict_min_purge_interval option logic
We update `shard->last_purge` on each call of `hpa_try_purge` if we
purged something. This means, when `hpa_strict_min_purge_interval`
option is set only one slab will be purged, because on the next
call condition for too frequent purge protection
`since_last_purge_ms < shard->opts.min_purge_interval_ms` will always
be true. This is not an intended behaviour.

Instead, we need to check `min_purge_interval_ms` once and purge as many
pages as needed to satisfy requirements for `hpa_dirty_mult` option.

Make possible to count number of actions performed in unit tests (purge,
hugify, dehugify) instead of binary: called/not called. Extended current
unit tests with cases where we need to purge more than one page for a
purge phase.
2024-08-20 10:02:38 -07:00
Shirui Cheng
8fefabd3a4 increase the ncached_max in fill_flush test case to 1024 2024-08-06 13:16:09 -07:00
Shirui Cheng
48f66cf4a2 add a size check when declare a stack array to be less than 2048 bytes 2024-08-06 13:16:09 -07:00
Nathan Slingerland
bc32ddff2d Add usize to prof_sample_hook_t 2024-07-30 10:29:30 -07:00
Dmitry Ilvokhin
b66f689764 Emit long string values without truncation
There are few long options (`bin_shards` and `slab_sizes` for example)
when they are specified and we emit statistics value gets truncated.

Moved emitting logic for strings into separate `emitter_emit_str`
function. It will try to emit string same way as before and if value is
too long will fallback emiting rest partially with chunks of `BUF_SIZE`.

Justification for long strings (longer than `BUF_SIZE`) is not
supported.
2024-07-29 13:58:31 -07:00
Shirui Cheng
a1fcbebb18 skip tcache GC for tcache_max unit test 2024-06-25 12:59:45 -07:00
Dmitry Ilvokhin
867c6dd7dc Option to guard hpa_min_purge_interval_ms fix
Change in `hpa_min_purge_interval_ms` handling logic is not backward
compatible as it might increase memory usage. Now this logic guarded by
`hpa_strict_min_purge_interval` option.

When `hpa_strict_min_purge_interval` is true, we will purge no more than
`hpa_min_purge_interval_ms`. When `hpa_strict_min_purge_interval` is
false, old purging logic behaviour is preserved.

Long term strategy migrate all users of hpa to new logic and then delete
`hpa_strict_min_purge_interval` option.
2024-06-07 10:52:41 -07:00
Dmitry Ilvokhin
91a6d230db Respect hpa_min_purge_interval_ms option
Currently, hugepages aware allocator backend works together with classic
one as a fallback for not yet supported allocations. When background
threads are enabled wake up time for classic interfere with hpa as there
were no checks inside hpa purging logic to check if we are not purging too
frequently. If background thread is running and `hpa_should_purge`
returns true, then we will purge, even if we purged less than
hpa_min_purge_interval_ms ago.
2024-06-07 10:52:41 -07:00
Dmitry Ilvokhin
90c627edb7 Export hugepage size with arenas.hugepage 2024-06-05 15:37:41 -07:00
David Goldblatt
fc615739cb Add batching to arena bins.
This adds a fast-path for threads freeing a small number of allocations to
bins which are not their "home-base" and which encounter lock contention in
attempting to do so. In producer-consumer workflows, such small lock hold times
can cause lock convoying that greatly increases overall bin mutex contention.
2024-05-22 10:30:31 -07:00
David Goldblatt
c085530c71 Tcache batching: Plumbing
In the next commit, we'll start using the batcher to eliminate mutex traffic.
To avoid cluttering up that commit with the random bits of busy-work it entails,
we'll centralize them here.  This commit introduces:
- A batched bin type.
- The ability to mix batched and unbatched bins in the arena.
- Conf parsing to set batches per size and a max batched size.
- mallctl access to the corresponding opt-namespace keys.
- Stats output of the above.
2024-05-22 10:30:31 -07:00
David Goldblatt
70c94d7474 Add batcher module.
This can be used to batch up simple operation commands for later use by another
thread.
2024-05-22 10:30:31 -07:00
Dmitry Ilvokhin
47d69b4eab HPA: Fix infinite purging loop
One of the condition to start purging is `hpa_hugify_blocked_by_ndirty`
function call returns true. This can happen in cases where we have no
dirty memory for this shard at all. In this case purging loop will be an
infinite loop.

`hpa_hugify_blocked_by_ndirty` was introduced at 0f6c420, but at that
time purging loop has different form and additional `break` was not
required. Purging loop form was re-written at 6630c5989, but additional
exit condition wasn't added there at the time.

Repo code was shared by Patrik Dokoupil at [1], I stripped it down to
minimum to reproduce issue in jemalloc unit tests.

[1]: https://github.com/jemalloc/jemalloc/pull/2533
2024-04-30 13:46:32 -07:00
Shirui Cheng
4b555c11a5 Enable heap profiling on MacOS 2024-04-09 12:57:01 -07:00
Daniel Hodges
11038ff762 Add support for namespace pids in heap profile names
This change adds support for writing pid namespaces to the filename of a
heap profile. When running with namespaces pids may reused across
namespaces and if mounts are shared where profiles are written there is
not a great way to differentiate profiles between pids.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>
Signed-off-by: Daniel Hodges <hodgesd@fb.com>
2024-04-09 10:27:52 -07:00
Shirui Cheng
373884ab48 print out all malloc_conf settings in stats 2024-02-29 12:12:44 -08:00
Qi Wang
a2c5267409 HPA: Allow frequent reused alloc to bypass the slab_max_alloc limit, as long as
it's within the huge page size.  These requests do not concern internal
fragmentation with huge pages, since the entire range is expected to be
accessed.
2024-01-18 14:51:04 -08:00
Shirui Cheng
e4817c8d89 Cleanup cache_bin_info_t* info input args 2023-10-25 10:27:31 -07:00
guangli-dai
6fb3b6a8e4 Refactor the tcache initiailization
1. Pre-generate all default tcache ncached_max in tcache_boot;
2. Add getters returning default ncached_max and ncached_max_set;
3. Refactor tcache init so that it is always init with a given setting.
2023-10-18 14:11:46 -07:00
guangli-dai
8a22d10b83 Allow setting default ncached_max for each bin through malloc_conf 2023-10-18 14:11:46 -07:00
guangli-dai
630f7de952 Add mallctl to set and get ncached_max of each cache_bin.
1. `thread_tcache_ncached_max_read_sizeclass` allows users to get the
    ncached_max of the bin with the input sizeclass, passed in through
    oldp (will be upper casted if not an exact bin size is given).
2. `thread_tcache_ncached_max_write` takes in a char array
    representing the settings for bins in the tcache.
2023-10-17 14:53:23 -07:00