romenskiy2012/jemalloc

mirror of https://github.com/jemalloc/jemalloc.git synced 2026-05-17 10:26:27 +03:00

Author	SHA1	Message	Date
Guangli Dai	1dfa6f7aa4	Replace PAI vtable dispatch with direct calls The pai_t interface implements C-style polymorphism via function pointers to abstract over PAC and HPA. This abstraction provides no real benefit: only two implementations exist, the dispatcher already knows which one to use, and HPA stubs 2 of 5 operations. Remove the runtime dispatch in favor of direct calls. This commit: - Promotes pac_alloc/expand/shrink/dalloc/time_until_deferred_work to external linkage and replaces the pai_t self parameter with pac_t pac. - Promotes hpa_alloc/expand/shrink/dalloc/time_until_deferred_work to external linkage and replaces pai_t self with hpa_shard_t shard. - Updates hpa_dalloc_batch's signature to take hpa_shard_t * directly and removes the hpa_from_pai container-of helper. Updates internal callers in hpa_alloc, hpa_dalloc, and hpa_sec_flush_impl. - Drops the vtable assignments from pac_init() and hpa_shard_init(). - Replaces pai_alloc/dalloc/etc. dispatch in pa.c with direct calls. HPA expand and shrink (which are unconditional failure stubs) are skipped entirely for HPA-owned extents. - Removes the pa_get_pai() helper. - Updates tests in test/unit/hpa.c and test/unit/hpa_sec_integration.c to call hpa_alloc/dalloc/etc. directly. The pai_t struct field stays as dead weight in pac_t and hpa_shard_t; it is removed in the next commit along with pai.h itself. No behavioral changes.	2026-05-12 13:43:16 -07:00
Slobodan Predolac	6cd31c0985	Fix several typos in the comments	2026-04-22 13:35:18 -04:00
Slobodan Predolac	6016d86c18	[SEC] Make SEC owned by hpa_shard, simplify the code, add stats, lock per bin	2026-03-10 18:14:33 -07:00
Slobodan Predolac	8a06b086f3	[EASY] Extract hpa_central component from hpa source file	2026-03-10 18:14:33 -07:00
Slobodan Predolac	355774270d	[EASY] Encapsulate better, do not pass hpa_shard when hooks are enough, move shard independent actions to hpa_utils	2026-03-10 18:14:33 -07:00
Slobodan Predolac	47aeff1d08	Add experimental_enforce_hugify	2026-03-10 18:14:33 -07:00
Slobodan Predolac	3678a57c10	When extracting from central, hugify_eager is different than start_as_huge	2026-03-10 18:14:33 -07:00
Slobodan Predolac	87555dfbb2	Do not release the hpa_shard->mtx when inserting newly retrieved page from central before allocating from it	2026-03-10 18:14:33 -07:00
Slobodan Predolac	a199278f37	[HPA] Add ability to start page as huge and more flexibility for purging	2026-03-10 18:14:33 -07:00
Slobodan Predolac	2688047b56	Revert "Do not dehugify when purging" This reverts commit `16c5abd1cd`.	2026-03-10 18:14:33 -07:00
Slobodan Predolac	d70882a05d	[sdt] Add some tracepoints to sec and hpa modules	2026-03-10 18:14:33 -07:00
lexprfuncall	a156e997d7	Do not dehugify when purging Giving the advice MADV_DONTNEED to a range of virtual memory backed by a transparent huge page already causes that range of virtual memory to become backed by regular pages.	2026-03-10 18:14:33 -07:00
lexprfuncall	395e63bf7e	Fix several spelling errors in comments	2026-03-10 18:14:33 -07:00
guangli-dai	6200e8987f	Reformat the codebase with the clang-format 18.	2026-03-10 18:14:33 -07:00
Jason Evans	27d7960cf9	Revert "Extend purging algorithm with peak demand tracking" This reverts commit `ad108d50f1`.	2025-06-02 10:44:37 -07:00
guangli-dai	37bf846cc3	Fixes to prevent static analysis warnings.	2025-05-06 14:47:35 -07:00
Slobodan Predolac	1956a54a43	[process_madvise] Use process_madvise across multiple huge_pages	2025-04-25 19:19:03 -07:00
Slobodan Predolac	0dfb4a5a1a	Add output argument to hpa_purge_begin to count dirty ranges	2025-04-25 19:19:03 -07:00
Slobodan Predolac	cfa90dfd80	Refactor hpa purging to prepare for vectorized call across multiple pages	2025-04-25 19:19:03 -07:00
Slobodan Predolac	f19f49ef3e	if process_madvise is supported, call it when purging hpa	2025-04-04 13:57:42 -07:00
Dmitry Ilvokhin	ad108d50f1	Extend purging algorithm with peak demand tracking Implementation inspired by idea described in "Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator" paper [1]. Primary idea is to track maximum number (peak) of active pages in use with sliding window and then use this number to decide how many dirty pages we would like to keep. We are trying to estimate maximum amount of active memory we'll need in the near future. We do so by projecting future active memory demand (based on peak active memory usage we observed in the past within sliding window) and adding slack on top of it (an overhead is reasonable to have in exchange of higher hugepages coverage). When peak demand tracking is off, projection of future active memory is active memory we are having right now. Estimation is essentially the same as `nactive_max * (1 + dirty_mult)`. Peak demand purging algorithm controlled by two config options. Option `hpa_peak_demand_window_ms` controls duration of sliding window we track maximum active memory usage in and option `hpa_dirty_mult` controls amount of slack we are allowed to have as a percent from maximum active memory usage. By default `hpa_peak_demand_window_ms == 0` now and we have same behaviour (ratio based purging) that we had before this commit. [1]: https://storage.googleapis.com/gweb-research2023-media/pubtools/6170.pdf	2025-03-13 10:12:22 -07:00
guangli-dai	c067a55c79	Introducing a new usize calculation policy Converting size to usize is what jemalloc has been done by ceiling size to the closest size class. However, this causes lots of memory wastes with HPA enabled. This commit changes how usize is calculated so that the gap between two contiguous usize is no larger than a page. Specifically, this commit includes the following changes: 1. Adding a build-time config option (--enable-limit-usize-gap) and a runtime one (limit_usize_gap) to guard the changes. When build-time config is enabled, some minor CPU overhead is expected because usize will be stored and accessed apart from index. When runtime option is also enabled (it can only be enabled with the build-time config enabled). a new usize calculation approach wil be employed. This new calculation will ceil size to the closest multiple of PAGE for all sizes larger than USIZE_GROW_SLOW_THRESHOLD instead of using the size classes. Note when the build-time config is enabled, the runtime option is default on. 2. Prepare tcache for size to grow by PAGE over GROUPPAGE. To prepare for the upcoming changes where size class grows by PAGE when larger than NGROUP PAGE, disable the tcache when it is larger than 2 * NGROUP * PAGE. The threshold for tcache is set higher to prevent perf regression as much as possible while usizes between NGROUP * PAGE and 2 * NGROUP * PAGE happen to grow by PAGE. 3. Prepare pac and hpa psset for size to grow by PAGE over GROUP*PAGE For PAC, to avoid having too many bins, arena bins still have the same layout. This means some extra search is needed for a page-level request that is not aligned with the orginal size class: it should also search the heap before the current index since the previous heap might also be able to have some allocations satisfying it. The same changes apply to HPA's psset. This search relies on the enumeration of the heap because not all allocs in the previous heap are guaranteed to satisfy the request. To balance the memory and CPU overhead, we currently enumerate at most a fixed number of nodes before concluding none can satisfy the request during an enumeration. 4. Add bytes counter to arena large stats. To prepare for the upcoming usize changes, stats collected by multiplying alive allocations and the bin size is no longer accurate. Thus, add separate counters to record the bytes malloced and dalloced. 5. Change structs use when freeing to avoid using index2size for large sizes. - Change the definition of emap_alloc_ctx_t - Change the read of both from edata_t. - Change the assignment and usage of emap_alloc_ctx_t. - Change other callsites of index2size. Note for the changes in the data structure, i.e., emap_alloc_ctx_t, will be used when the build-time config (--enable-limit-usize-gap) is enabled but they will store the same value as index2size(szind) if the runtime option (opt_limit_usize_gap) is not enabled. 6. Adapt hpa to the usize changes. Change the settings in sec to limit is usage for sizes larger than USIZE_GROW_SLOW_THRESHOLD and modify corresponding tests. 7. Modify usize calculation and corresponding tests. Change the sz_s2u_compute. Note sz_index2size is not always safe now while sz_size2index still works as expected.	2025-03-06 15:08:13 -08:00
Dmitry Ilvokhin	421b17a622	Remove age_counter from hpa_central Before this commit we had two age counters: one global in HPA central and one local in each HPA shard. We used HPA shard counter, when we are reused empty pageslab and HPA central counter anywhere else. They suppose to be comparable, because we use them for allocation placement decisions, but in reality they are not, there is no ordering guarantees between them. At the moment, there is no way for pageslab to migrate between HPA shards, so we don't actually need HPA central age counter.	2025-02-13 16:00:41 -08:00
Guangli Dai	587676fee8	Disable psset test when hugepage size is too large.	2024-12-17 12:35:35 -08:00
Dmitry Ilvokhin	0ce13c6fb5	Add opt `hpa_hugify_sync` to hugify synchronously Linux 6.1 introduced `MADV_COLLAPSE` flag to perform a best-effort synchronous collapse of the native pages mapped by the memory range into transparent huge pages. Synchronous hugification might be beneficial for at least two reasons: we are not relying on khugepaged anymore and get an instant feedback if range wasn't hugified. If `hpa_hugify_sync` option is on, we'll try to perform synchronously collapse and if it wasn't successful, we'll fallback to asynchronous behaviour.	2024-11-20 10:52:52 -08:00
Guangli Dai	1c900088c3	Do not support hpa if HUGEPAGE is too large.	2024-09-27 15:34:13 -07:00
Dmitry Ilvokhin	4f4fd42447	Remove `strict_min_purge_interval` option Option `experimental_hpa_strict_min_purge_interval` was expected to be temporary to simplify rollout of a bugfix. Now, when bugfix rollout is complete it is safe to remove this option.	2024-09-25 11:49:18 -07:00
Dmitry Ilvokhin	c7ccb8d7e9	Add `experimental` prefix to `hpa_strict_min_purge_interval` Goal is to make it obvious this option is experimental.	2024-08-20 10:02:38 -07:00
Dmitry Ilvokhin	aaa29003ab	Limit maximum number of purged slabs with option Option `experimental_hpa_max_purge_nhp` introduced for backward compatibility reasons: to make it possible to have behaviour similar to buggy `hpa_strict_min_purge_interval` implementation. When `experimental_hpa_max_purge_nhp` is set to -1, there is no limit to number of slabs we'll purge on each iteration. Otherwise, we'll purge no more than `experimental_hpa_max_purge_nhp` hugepages (slabs). This in turn means we might not purge enough dirty pages to satisfy `hpa_dirty_mult` requirement. Combination of `hpa_dirty_mult`, `experimental_hpa_max_purge_nhp` and `hpa_strict_min_purge_interval` options allows us to have steady rate of pages returned back to the system. This provides a strickier latency guarantees as number of `madvise` calls is bounded (and hence number of TLB shootdowns is limited) in exchange to weaker memory usage guarantees.	2024-08-20 10:02:38 -07:00
Dmitry Ilvokhin	143f458188	Fix `hpa_strict_min_purge_interval` option logic We update `shard->last_purge` on each call of `hpa_try_purge` if we purged something. This means, when `hpa_strict_min_purge_interval` option is set only one slab will be purged, because on the next call condition for too frequent purge protection `since_last_purge_ms < shard->opts.min_purge_interval_ms` will always be true. This is not an intended behaviour. Instead, we need to check `min_purge_interval_ms` once and purge as many pages as needed to satisfy requirements for `hpa_dirty_mult` option. Make possible to count number of actions performed in unit tests (purge, hugify, dehugify) instead of binary: called/not called. Extended current unit tests with cases where we need to purge more than one page for a purge phase.	2024-08-20 10:02:38 -07:00
Dmitry Ilvokhin	0a9f51d0d8	Simplify `hpa_shard_maybe_do_deferred_work` It doesn't make much sense to repeat purging once we done with hugification, because we can de-hugify pages that were hugified just moment ago for no good reason. Let them wait next deferred work phase instead. And if they still meeting purging conditions then, purge them.	2024-08-20 10:02:38 -07:00
Dmitry Ilvokhin	867c6dd7dc	Option to guard `hpa_min_purge_interval_ms` fix Change in `hpa_min_purge_interval_ms` handling logic is not backward compatible as it might increase memory usage. Now this logic guarded by `hpa_strict_min_purge_interval` option. When `hpa_strict_min_purge_interval` is true, we will purge no more than `hpa_min_purge_interval_ms`. When `hpa_strict_min_purge_interval` is false, old purging logic behaviour is preserved. Long term strategy migrate all users of hpa to new logic and then delete `hpa_strict_min_purge_interval` option.	2024-06-07 10:52:41 -07:00
Dmitry Ilvokhin	91a6d230db	Respect `hpa_min_purge_interval_ms` option Currently, hugepages aware allocator backend works together with classic one as a fallback for not yet supported allocations. When background threads are enabled wake up time for classic interfere with hpa as there were no checks inside hpa purging logic to check if we are not purging too frequently. If background thread is running and `hpa_should_purge` returns true, then we will purge, even if we purged less than hpa_min_purge_interval_ms ago.	2024-06-07 10:52:41 -07:00
Dmitry Ilvokhin	47d69b4eab	HPA: Fix infinite purging loop One of the condition to start purging is `hpa_hugify_blocked_by_ndirty` function call returns true. This can happen in cases where we have no dirty memory for this shard at all. In this case purging loop will be an infinite loop. `hpa_hugify_blocked_by_ndirty` was introduced at `0f6c420`, but at that time purging loop has different form and additional `break` was not required. Purging loop form was re-written at `6630c5989`, but additional exit condition wasn't added there at the time. Repo code was shared by Patrik Dokoupil at [1], I stripped it down to minimum to reproduce issue in jemalloc unit tests. [1]: https://github.com/jemalloc/jemalloc/pull/2533	2024-04-30 13:46:32 -07:00
Qi Wang	a2c5267409	HPA: Allow frequent reused alloc to bypass the slab_max_alloc limit, as long as it's within the huge page size. These requests do not concern internal fragmentation with huge pages, since the entire range is expected to be accessed.	2024-01-18 14:51:04 -08:00
Qi Wang	602edd7566	Enabled -Wstrict-prototypes and fixed warnings.	2023-07-06 12:00:02 -07:00
Kevin Svetlitski	70344a2d38	Make eligible functions `static` The codebase is already very disciplined in making any function which can be `static`, but there are a few that appear to have slipped through the cracks.	2023-05-08 15:00:02 -07:00
Eric Mueller	521970fb2e	Check for equality instead of assigning in asserts in hpa_from_pai. It appears like a simple typo means we're unconditionally overwriting some fields in hpa_from_pai when asserts are enabled. From hpa_shard_init, it looks like these fields have these values anyway, so this shouldn't cause bugs, but if something is wrong it seems better to have these asserts in place. See issue #2412.	2023-04-17 20:57:48 -07:00
Amaury Séchet	f743690739	Remove unused mutex from hpa_central	2023-03-10 11:25:47 -08:00
Qi Wang	837b37c4ce	Fix the time-since computation in HPA. nstime module guarantees monotonic clock update within a single nstime_t. This means, if two separate nstime_t variables are read and updated separately, nstime_subtract between them may result in underflow. Fixed by switching to the time since utility provided by nstime.	2021-12-21 23:37:22 -08:00
Alex Lapenkou	f56f5b9930	Pass 'frequent_reuse' hint to PAI Currently used only for guarding purposes, the hint is used to determine if the allocation is supposed to be frequently reused. For example, it might urge the allocator to ensure the allocation is cached.	2021-12-15 10:39:17 -08:00
Qi Wang	cdabe908d0	Track the initialized state of nstime_t on debug build. Some nstime_t operations require and assume the input nstime is initialized (e.g. nstime_update) -- uninitialized input may cause silent failures which is difficult to reproduce / debug. Add an explicit flag to track the state (limited to debug build only). Also fixed an use case in hpa (time of last_purge).	2021-11-17 15:49:27 -08:00
Qi Wang	400c59895a	Fix uninitialized nstime reading / updating on the stack in hpa. In order for nstime_update to handle non-monotonic clocks, it requires the input nstime to be initialized -- when reading for the first time, zero init has to be done. Otherwise random stack value may be seen as clocks and returned.	2021-11-16 16:54:12 -08:00
Alex Lapenkou	c9ebff0fd6	Initialize deferred_work_generated As the code evolves, some code paths that have previously assigned deferred_work_generated may cease being reached. This would leave the value uninitialized. This change initializes the value for safety.	2021-10-07 11:50:38 -07:00
Stan Angelov	912324a1ac	Add debug check outside of the loop in hpa_alloc_batch. This optimizes the whole loop away for non-debug builds.	2021-10-01 14:40:43 -07:00
Qi Wang	deb8e62a83	Implement guard pages. Adding guarded extents, which are regular extents surrounded by guard pages (mprotected). To reduce syscalls, small guarded extents are cached as a separate eset in ecache, and decay through the dirty / muzzy / retained pipeline as usual.	2021-09-26 16:30:15 -07:00
Alex Lapenkou	8229cc77c5	Wake up background threads on demand This change allows every allocator conforming to PAI communicate that it deferred some work for the future. Without it if a background thread goes into indefinite sleep, there is no way to notify it about upcoming deferred work.	2021-09-17 16:56:41 -07:00
Alex Lapenkou	b8b8027f19	Allow PAI to calculate time until deferred work Previously the calculation of sleep time between wakeups was implemented within background_thread. This resulted in some parts of decay and hpa specific logic mixing with background thread implementation. In this change, background thread delegates this calculation to arena and it, in turn, delegates it to PAI. The next step is to implement the actual calculation of time until deferred work in HPA.	2021-09-17 16:56:41 -07:00
Alex Lapenkou	f58064b932	Verify that HPA is used before calling its functions This change eliminates the possibility of PA calling functions of uninitialized HPA.	2021-08-05 16:43:28 -07:00
David Goldblatt	92a1e38f52	edata_cache: Allow unbounded fast caching. The edata_cache_small had a fill/flush heuristic. In retrospect, this was a premature optimization; more testing indicates that an unbounded cache is effectively fine here, and moreover we spend a nontrivial amount of time doing unnecessary filling/flushing. As the HPA takes on a larger and larger fraction of all allocations, any theoretical differences in allocation patterns should shrink. The HPA is more efficient with its metadata in general, so it still comes out ahead on metadata usage anyways.	2021-07-26 15:14:37 -07:00

1 2

95 commits