Commit graph

3677 commits

Author SHA1 Message Date
guangli-dai
4531411abe Modify .clang-format to have declarations aligned 2025-05-28 19:22:21 -07:00
guangli-dai
1818170c8d Fix binshard.sh by specifying bin_shards for all sizes. 2025-05-28 19:21:49 -07:00
guangli-dai
fd60645260 Add one more check to double free validation. 2025-05-28 19:21:49 -07:00
Xin Yang
5e460bfea2 Refactor: use the cache_bin_sz_t typedef instead of direct uint16_t
any future changes to the underlying data type for bin sizes
(such as upgrading from `uint16_t` to `uint32_t`) can be achieved
by modifying only the `cache_bin_sz_t` definition.

Signed-off-by: Xin Yang <yangxin.dev@bytedance.com>
2025-05-22 10:43:33 -07:00
Xin Yang
9169e9272a Fix: Adjust CACHE_BIN_NFLUSH_BATCH_MAX size to prevent assert failures
The maximum allowed value for `nflush_batch` is
`CACHE_BIN_NFLUSH_BATCH_MAX`. However, `tcache_bin_flush_impl_small`
could potentially declare an array of `emap_batch_lookup_result_t`
of size `CACHE_BIN_NFLUSH_BATCH_MAX + 1`. leads to a `VARIABLE_ARRAY`
assertion failure, observed when `tcache_nslots_small_max` is
configured to 2048. This patch ensures the array size does not exceed
the allowed maximum.

Signed-off-by: Xin Yang <yangxin.dev@bytedance.com>
2025-05-22 10:27:09 -07:00
guangli-dai
f19a569216 Ignore formatting commit in blame. 2025-05-20 14:21:08 -07:00
Slobodan Predolac
b6338c4ff6 EASY - be explicit in non-vectorized hpa tests 2025-05-19 16:31:04 -07:00
guangli-dai
554185356b Sample format on tcache_max test 2025-05-19 15:06:13 -07:00
guangli-dai
3cee771cfa Modify .clang-format to make it more aligned with current freebsd style 2025-05-19 15:06:13 -07:00
Jiebin Sun
3c14707b01 To improve reuse efficiency, the maximum coalesced size for large extents
in the dirty ecache has been limited. This patch was tested with real
workloads using ClickHouse (Clickbench Q35) on a system with 2x240 vCPUs.
The results showed a 2X in query per second (QPS) performance and
a reduction in page faults to 29% of the previous rate. Additionally,
microbenchmark testing involved 256 memory reallocations resizing
from 4KB to 16KB in one arena, which demonstrated a 5X performance
improvement.

Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
2025-05-12 15:45:36 -07:00
guangli-dai
37bf846cc3 Fixes to prevent static analysis warnings. 2025-05-06 14:47:35 -07:00
guangli-dai
8347f1045a Renaming limit_usize_gap to disable_large_size_classes 2025-05-06 14:47:35 -07:00
Guangli Dai
01e9ecbeb2 Remove build-time configuration 'config_limit_usize_gap' 2025-05-06 14:47:35 -07:00
Slobodan Predolac
852da1be15 Add experimental option force using SYS_process_madvise 2025-04-28 18:45:30 -07:00
Slobodan Predolac
1956a54a43 [process_madvise] Use process_madvise across multiple huge_pages 2025-04-25 19:19:03 -07:00
Slobodan Predolac
0dfb4a5a1a Add output argument to hpa_purge_begin to count dirty ranges 2025-04-25 19:19:03 -07:00
Slobodan Predolac
cfa90dfd80 Refactor hpa purging to prepare for vectorized call across multiple pages 2025-04-25 19:19:03 -07:00
Qi Wang
a3910b9802 Avoid forced purging during thread-arena migration when bg thd is on. 2025-04-25 19:18:20 -07:00
guangli-dai
c23a6bfdf6 Add opt.limit_usize_gap to stats 2025-04-16 10:38:10 -07:00
guangli-dai
c20a63a765 Silence the uninitialized warning from clang. 2025-04-16 10:38:10 -07:00
Qi Wang
f81fb92a89 Remove Travis CI macOS configs (not supported anymore). 2025-04-14 15:27:38 -07:00
Slobodan Predolac
f19f49ef3e if process_madvise is supported, call it when purging hpa 2025-04-04 13:57:42 -07:00
Kaspar M. Rohrer
80e9001af3 Move `extern "C" specifications for C++ to where they are needed
This should fix errors when compiling C++ code with modules enabled on clang.
2025-03-31 10:41:51 -07:00
Shirui Cheng
3688dfb5c3 fix assertion error in huge_arena_auto_thp_switch() when b0 is deleted in unit test 2025-03-20 12:45:23 -07:00
Jay Lee
a4defdb854 detect false failure of strerror_r
See tikv/jemallocator#108.

In a summary, test on `strerror_r` can fail due to reasons other
than `strerror_r` itself, so add an additional test to determine
the failure is expected.

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
2025-03-17 17:50:20 -07:00
Shirui Cheng
e1a77ec558 Support THP with Huge Arena in PAC 2025-03-17 16:06:43 -07:00
Audrey Dutcher
86bbabac32 background_thread: add fallback for pthread_create dlsym
If jemalloc is linked into a shared library, the RTLD_NEXT dlsym call
may fail since RTLD_NEXT is only specified to search all objects after
the current one in the loading order, and the pthread library may be
earlier in the load order. Instead of failing immediately, attempt one
more time to find pthread_create via RTLD_GLOBAL.

Errors cascading from this were observed on FreeBSD 14.1.
2025-03-17 09:41:04 -07:00
Guangli Dai
81f35e0b55 Modify Travis tests to use frameptr when profiling 2025-03-13 17:15:42 -07:00
Guangli Dai
773b5809f9 Fix frame pointer based unwinder to handle changing stack range 2025-03-13 17:15:42 -07:00
Dmitry Ilvokhin
ad108d50f1 Extend purging algorithm with peak demand tracking
Implementation inspired by idea described in "Beyond malloc efficiency
to fleet efficiency: a hugepage-aware memory allocator" paper [1].

Primary idea is to track maximum number (peak) of active pages in use
with sliding window and then use this number to decide how many dirty
pages we would like to keep.

We are trying to estimate maximum amount of active memory we'll need in
the near future. We do so by projecting future active memory demand
(based on peak active memory usage we observed in the past within
sliding window) and adding slack on top of it (an overhead is reasonable
to have in exchange of higher hugepages coverage). When peak demand
tracking is off, projection of future active memory is active memory we
are having right now.

Estimation is essentially the same as `nactive_max * (1 + dirty_mult)`.

Peak demand purging algorithm controlled by two config options. Option
`hpa_peak_demand_window_ms` controls duration of sliding window we track
maximum active memory usage in and option `hpa_dirty_mult` controls
amount of slack we are allowed to have as a percent from maximum active
memory usage. By default `hpa_peak_demand_window_ms == 0` now and we
have same behaviour (ratio based purging) that we had before this
commit.

[1]: https://storage.googleapis.com/gweb-research2023-media/pubtools/6170.pdf
2025-03-13 10:12:22 -07:00
Qi Wang
22440a0207 Implement process_madvise support.
Add opt.process_madvise_max_batch which determines if process_madvise is enabled
(non-zero) and the max # of regions in each batch.  Added another limiting
factor which is the space to reserve on stack, which results in the max batch of
128.
2025-03-07 15:32:32 -08:00
Guangli Dai
70f019cd3a Enable limit-usize-gap in CI tests.
Considering the new usize calculation will be default soon, add the
config option in for Travis, Cirrus and appveyor.
2025-03-06 15:08:13 -08:00
Guangli Dai
6035d4a8d3 Cache extra extents in the dirty pool from ecache_alloc_grow 2025-03-06 15:08:13 -08:00
guangli-dai
c067a55c79 Introducing a new usize calculation policy
Converting size to usize is what jemalloc has been done by ceiling
size to the closest size class. However, this causes lots of memory
wastes with HPA enabled.  This commit changes how usize is calculated so
that the gap between two contiguous usize is no larger than a page.
Specifically, this commit includes the following changes:

1. Adding a build-time config option (--enable-limit-usize-gap) and a
runtime one (limit_usize_gap) to guard the changes.
When build-time
config is enabled, some minor CPU overhead is expected because usize
will be stored and accessed apart from index.  When runtime option is
also enabled (it can only be enabled with the build-time config
enabled). a new usize calculation approach wil be employed.  This new
calculation will ceil size to the closest multiple of PAGE for all sizes
larger than USIZE_GROW_SLOW_THRESHOLD instead of using the size classes.
Note when the build-time config is enabled, the runtime option is
default on.

2. Prepare tcache for size to grow by PAGE over GROUP*PAGE.
To prepare for the upcoming changes where size class grows by PAGE when
larger than NGROUP * PAGE, disable the tcache when it is larger than 2 *
NGROUP * PAGE. The threshold for tcache is set higher to prevent perf
regression as much as possible while usizes between NGROUP * PAGE and 2 *
NGROUP * PAGE happen to grow by PAGE.

3. Prepare pac and hpa psset for size to grow by PAGE over GROUP*PAGE
For PAC, to avoid having too many bins, arena bins still have the same
layout.  This means some extra search is needed for a page-level request that
is not aligned with the orginal size class: it should also search the heap
before the current index since the previous heap might also be able to
have some allocations satisfying it.  The same changes apply to HPA's
psset.
This search relies on the enumeration of the heap because not all allocs in
the previous heap are guaranteed to satisfy the request.  To balance the
memory and CPU overhead, we currently enumerate at most a fixed number
of nodes before concluding none can satisfy the request during an
enumeration.

4. Add bytes counter to arena large stats.
To prepare for the upcoming usize changes, stats collected by
multiplying alive allocations and the bin size is no longer accurate.
Thus, add separate counters to record the bytes malloced and dalloced.

5. Change structs use when freeing to avoid using index2size for large sizes.
  - Change the definition of emap_alloc_ctx_t
  - Change the read of both from edata_t.
  - Change the assignment and usage of emap_alloc_ctx_t.
  - Change other callsites of index2size.
Note for the changes in the data structure, i.e., emap_alloc_ctx_t,
will be used when the build-time config (--enable-limit-usize-gap) is
enabled but they will store the same value as index2size(szind) if the
runtime option (opt_limit_usize_gap) is not enabled.

6. Adapt hpa to the usize changes.
Change the settings in sec to limit is usage for sizes larger than
USIZE_GROW_SLOW_THRESHOLD and modify corresponding tests.

7. Modify usize calculation and corresponding tests.
Change the sz_s2u_compute. Note sz_index2size is not always safe now
while sz_size2index still works as expected.
2025-03-06 15:08:13 -08:00
Guangli Dai
ac279d7e71 Fix profiling sample metadata lookup during xallocx 2025-03-04 14:42:04 -08:00
Qi Wang
f55e0c3f5c Remove unsupported Cirrus CI config 2025-03-03 16:29:04 -08:00
Dmitry Ilvokhin
499f306859 Fix arena 0 deferral_allowed flag init
Arena 0 have a dedicated initialization path, which differs from
initialization path of other arenas. The main difference for the purpose
of this change is that we initialize arena 0 before we initialize
background threads. HPA shard options have `deferral_allowed` flag which
should be equal to `background_thread_enabled()` return value, but it
wasn't the case before this change, because for arena 0
`background_thread_enabled()` was initialized correctly after arena 0
initialization phase already ended.

Below is initialization sequence for arena 0 after this commit to
illustrate everything still should be initialized correctly.

* `hpa_central_init` initializes HPA Central, before we initialize every
  HPA shard (including arena's 0).
* `background_thread_boot1` initializes `background_thread_enabled()`
  return value.
* `pa_shard_enable_hpa` initializes arena 0 HPA shard.

```
                       malloc_init_hard -------------
                      /           /                  \
                     /           /                    \
                    /           /                      \
malloc_init_hard_a0_locked  background_thread_boot1  pa_shard_enable_hpa
        /                     /                          \
       /                     /                            \
      /                     /                              \
arena_boot       background_thread_enabled_seta         hpa_shard_init
     |
     |
pa_central_init
     |
     |
hpa_central_init
```
2025-02-18 12:10:35 -08:00
Dmitry Ilvokhin
421b17a622 Remove age_counter from hpa_central
Before this commit we had two age counters: one global in HPA central
and one local in each HPA shard. We used HPA shard counter, when we are
reused empty pageslab and HPA central counter anywhere else. They
suppose to be comparable, because we use them for allocation placement
decisions, but in reality they are not, there is no ordering guarantees
between them.

At the moment, there is no way for pageslab to migrate between HPA
shards, so we don't actually need HPA central age counter.
2025-02-13 16:00:41 -08:00
roblabla
c17bf8b368 Disable config from file or envvar with build flag
This adds a new autoconf flag, --disable-user-config, which disables
reading the configuration from /etc/malloc.conf or the MALLOC_CONF
environment variable. This can be useful when integrating jemalloc in a
binary that internally handles all aspects of the configuration and
shouldn't be impacted by ambient change in the environment.
2025-02-05 15:01:50 -08:00
Dmitry Ilvokhin
34c823f147 Add autoconf options to enable sanitizers
This commit allows to enable sanitizers with autoconf options, instead
of modifying `CFLAGS`, `CXXFLAGS` and `LDFLAGS` directly.

* `--enable-tsan` option to enable Thread Sanitizer.
* `--enable-ubsan` option to enable Undefined Behaviour Sanitizer.

End goal is to speedup development by finding problems quickly, early
and easier. Eventually, when all current issues will be fixed, we can
enable sanitizers in CI. Fortunately, there are not a lot of problems we
need to fix.

Address Sanitizer is a bit controversial, because it replaces memory
allocator, so we decided to left it out for a while.

Below are couple of examples of how tests look like under different
sanitizers at the moment.

```
$  ../configure --enable-tsan --enable-debug
<...>
asan               : 0
tsan               : 1
ubsan              : 0
$ make -j`nproc` check
<...>
  Thread T13 (tid=332043, running) created by main thread at:
    #0 pthread_create <null> (libtsan.so.0+0x61748)
    #1 thd_create ../test/src/thd.c:25 (bin_batching+0x5631ca)
    #2 stress_run ../test/unit/bin_batching.c:148
(bin_batching+0x40364c)
    #3 test_races ../test/unit/bin_batching.c:249
(bin_batching+0x403d79)
    #4 p_test_impl ../test/src/test.c:149 (bin_batching+0x562811)
    #5 p_test_no_reentrancy ../test/src/test.c:213
(bin_batching+0x562d35)
    #6 main ../test/unit/bin_batching.c:268 (bin_batching+0x40417e)

SUMMARY: ThreadSanitizer: data race
../include/jemalloc/internal/edata.h:498 in edata_nfree_inc
```

```
$ ../configure --enable-ubsan --enable-debug
<...>
asan               : 0
tsan               : 0
ubsan              : 1
$ make -j`nproc` check
<...>
=== test/unit/hash ===
../test/unit/hash.c:119:16: runtime error: left shift of 176 by 24
places cannot be represented in type 'int'
<...>
```
2025-02-05 14:28:28 -08:00
Qi Wang
3bc89cfeca Avoid implicit conversion in test/unit/prof_threshold 2025-01-31 10:18:36 -08:00
Qi Wang
1abeae9ebd Fix test/unit/prof_threshold when !config_stats 2025-01-30 10:39:49 -08:00
Shai Duvdevani
257e64b968 Unlike prof_sample which is supported only with profiling mode active, prof_threshold is intended to be an always-supported allocation callback with much less overhead. The usage of the threshold allows performance critical callers to change program execution based on the callback: e.g. drop caches when memory becomes high or to predict the program is about to OOM ahead of time using peak memory watermarks. 2025-01-29 18:55:52 -08:00
Dmitry Ilvokhin
ef8e512e29 Fix bitmap_ffu out of range read
We tried to load `g` from `bitmap[i]` before checking it is actually a
valid load. Tweaked a loop a bit to `break` early, when we are done
scanning for bits.

Before this commit undefined behaviour sanitizer from GCC 14+ was
unhappy at `test/unit/bitmap` test with following error.

```
../include/jemalloc/internal/bitmap.h:293:5: runtime error: load of
address 0x7bb1c2e08008 with insufficient space for an object of type
'const bitmap_t'
<...>
    #0 0x62671a149954 in bitmap_ffu ../include/jemalloc/internal/bitmap.h:293
    #1 0x62671a149954 in test_bitmap_xfu_body ../test/unit/bitmap.c:275
    #2 0x62671a14b767 in test_bitmap_xfu ../test/unit/bitmap.c:323
    #3 0x62671a376ad1 in p_test_impl ../test/src/test.c:149
    #4 0x62671a377135 in p_test ../test/src/test.c:200
    #5 0x62671a13da06 in main ../test/unit/bitmap.c:336
<...>
```
2025-01-28 10:42:20 -08:00
Qi Wang
607b866035 Check for 0 input when setting max_background_thread through mallctl.
Reported by @nc7s.
2025-01-28 10:38:56 -08:00
Qi Wang
20cc983314 Fix the gettid() detection caught by @mrluanma . 2025-01-22 10:30:53 -08:00
Dmitry Ilvokhin
52fa9577ba Fix integer overflow in test/unit/hash.c
`final[3]` is `uint8_t`. Integer conversion rank of `uint8_t` is lower
than integer conversion rank of `int`, so `uint8_t` got promoted to
`int`, which is signed integer type. Shift `final[3]` value left on 24,
when leftmost bit is set overflows `int` and it is undefined behaviour.

Before this change Undefined Behaviour Sanitizer was unhappy about it
with the following message.

```
../test/unit/hash.c:119:25: runtime error: left shift of 176 by 24
places cannot be represented in type 'int'
```

After this commit problem is gone.
2025-01-17 12:54:22 -08:00
Dan Horák
17881ebbfd Add configure check for gettid() presence
The gettid() function is available on Linux in glibc only since version
2.30. There are supported distributions that still use older glibc
version. Thus add a configure check if the gettid() function is
available and extend the check in src/prof_stack_range.c so it's skipped
also when gettid() isn't available.

Fixes: https://github.com/jemalloc/jemalloc/issues/2740
2024-12-17 12:40:54 -08:00
appujee
4b88bddbca Conditionally remove unreachable for C23+ 2024-12-17 12:39:00 -08:00
appujee
d8486b2653 Remove unreachable() macro as c23 already defines it.
Taken from https://android-review.git.corp.google.com/c/platform/external/jemalloc_new/+/3316478

This might need more cleanups to remove the definition of JEMALLOC_INTERNAL_UNREACHABLE.
2024-12-17 12:39:00 -08:00