Add opt.process_madvise_max_batch which determines if process_madvise is enabled
(non-zero) and the max # of regions in each batch. Added another limiting
factor which is the space to reserve on stack, which results in the max batch of
128.
Linux 6.1 introduced `MADV_COLLAPSE` flag to perform a best-effort
synchronous collapse of the native pages mapped by the memory range into
transparent huge pages.
Synchronous hugification might be beneficial for at least two reasons:
we are not relying on khugepaged anymore and get an instant feedback if
range wasn't hugified.
If `hpa_hugify_sync` option is on, we'll try to perform synchronously
collapse and if it wasn't successful, we'll fallback to asynchronous
behaviour.
Tag 101 is assigned to "CoreMedia Capture Data", which makes for confusing output when debugging.
To avoid conflicts, use a tag in the reserved application-specific range from 240–255 (inclusive).
All assigned tags: 94d3b45284/osfmk/mach/vm_statistics.h (L773-L775)
- `-Wmissing-prototypes` and `-Wmissing-variable-declarations` are
helpful for finding dead code and/or things that should be `static`
but aren't marked as such.
- `-Wunused-macros` is of similar utility, but for identifying dead macros.
- `-Wunreachable-code` and `-Wunreachable-code-aggressive` do exactly
what they say: flag unreachable code.
Following from PR #2481, we replace all integer-to-pointer casts [which
hide pointer provenance information (and thus inhibit
optimizations)](https://clang.llvm.org/extra/clang-tidy/checks/performance/no-int-to-ptr.html)
with equivalent operations that preserve this information. I have
enabled the corresponding clang-tidy check in our static analysis CI so
that we do not get bitten by this again in the future.
For better or worse, Jemalloc has a significant number of global
variables. Making all eligible global variables `static` and/or `const`
at least makes it slightly easier to reason about them, as these
qualifications communicate to the programmer restrictions on their use
without having to `grep` the whole codebase.
Previously, small allocations which were sampled as part of heap
profiling were rounded up to `SC_LARGE_MINCLASS`. This additional memory
usage becomes problematic when the page size is increased, as noted in #2358.
Small allocations are now rounded up to the nearest multiple of `PAGE`
instead, reducing the memory overhead by a factor of 4 in the most
extreme cases.
instead of changing mmap_flags which makes the change permanent. This was
enforcing large alignments for allocations that did not need it causing
fragmentation. Reported by Andreas Gustafsson.
None of these are harmful, and they are almost certainly optimized
away by the compiler. The motivation for fixing them anyway is that
we'd like to enable static analysis as part of CI, and the first step
towards that is resolving the warnings it produces at present.
The option makes jemalloc use prctl with PR_SET_VMA to tag memory mappings with
"jemalloc_pg" or "jemalloc_pg_overcommit". This allows to easily identify
jemalloc's mappings in /proc/<pid>/maps. PR_SET_VMA is only available in Linux
5.17 and above.
Adding guarded extents, which are regular extents surrounded by guard pages
(mprotected). To reduce syscalls, small guarded extents are cached as a
separate eset in ecache, and decay through the dirty / muzzy / retained pipeline
as usual.
qemu does not support this, yet [1], and you can get very tricky assert
if you will run program with jemalloc in use under qemu:
<jemalloc>: ../contrib/jemalloc/src/extent.c:1195: Failed assertion: "p[i] == 0"
[1]: https://patchwork.kernel.org/patch/10576637/
Here is a simple example that shows the problem [2]:
// Gist to check possible issues with MADV_DONTNEED
// For example it does not supported by qemu user
// There is a patch for this [1], but it hasn't been applied.
// [1]: https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg05422.html
#include <sys/mman.h>
#include <stdio.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>
int main(int argc, char **argv)
{
void *addr = mmap(NULL, 1<<16, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return 1;
}
memset(addr, 'A', 1<<16);
if (!madvise(addr, 1<<16, MADV_DONTNEED)) {
puts("MADV_DONTNEED does not return error. Check memory.");
for (int i = 0; i < 1<<16; ++i) {
assert(((unsigned char *)addr)[i] == 0);
}
} else {
perror("madvise");
}
if (munmap(addr, 1<<16)) {
perror("munmap");
return 1;
}
return 0;
}
### unpatched qemu
$ qemu-x86_64-static /tmp/test-MADV_DONTNEED
MADV_DONTNEED does not return error. Check memory.
test-MADV_DONTNEED: /tmp/test-MADV_DONTNEED.c:19: main: Assertion `((unsigned char *)addr)[i] == 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped
Aborted (core dumped)
### patched qemu (by returning ENOSYS error)
$ qemu-x86_64 /tmp/test-MADV_DONTNEED
madvise: Success
### patch for qemu to return ENOSYS
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 897d20c076..5540792e0e 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -11775,7 +11775,7 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
turns private file-backed mappings into anonymous mappings.
This will break MADV_DONTNEED.
This is a hint, so ignoring and returning success is ok. */
- return 0;
+ return ENOSYS;
#endif
#ifdef TARGET_NR_fcntl64
case TARGET_NR_fcntl64:
[2]: https://gist.github.com/azat/12ba2c825b710653ece34dba7f926ece
v2:
- review fixes
- add opt_dont_trust_madvise
v3:
- review fixes
- rename opt_dont_trust_madvise to opt_trust_madvise
- NetBSD overcommits
- When mapping pages, use the maximum of the alignment requested and the
compiled-in PAGE constant which might be greater than the current kernel
pagesize, since we compile binaries with the maximum page size supported
by the architecture (so that they work with all kernels).
some architecture like AArch64 may not have the open syscall, but have
openat syscall. so check and use SYS_openat if SYS_openat available if
SYS_open is not supported at init_thp_state.
in case `malloc_read_fd` returns a negative error number, the result
would afterwards be casted to an unsigned size_t, and may have
theoretically caused an out-of-bounds memory access in the following
`strncmp` call.
This makes it directly use MAP_EXCL and MAP_ALIGNED() instead
of weird workarounds involving mapping at random places and then
unmapping parts of them.
When configured with --with-lg-page, it's possible for the configured page size
to be greater than the system page size, in which case the page address may only
be aligned with the system page size.
"always" marks all user mappings as MADV_HUGEPAGE; while "never" marks all
mappings as MADV_NOHUGEPAGE. The default setting "default" does not change any
settings. Note that all the madvise calls are part of the default extent hooks
by design, so that customized extent hooks have complete control over the
mappings including hugepage settings.
This attempts to use VM_OVERCOMMIT OID - newly introduced in -CURRENT
few days ago, specifically for this purpose - instead of querying the
sysctl by its string name. Due to how syctlbyname(3) works, this means
we do one syscall during binary startup instead of two.
Signed-off-by: Edward Tomasz Napierala <trasz@FreeBSD.org>
This avoids sysctl(2) syscall during binary startup, using the value
passed in the ELF aux vector instead.
Signed-off-by: Edward Tomasz Napierala <trasz@FreeBSD.org>
On x86 Linux, we define our own MADV_FREE if madvise(2) is available, but no
MADV_FREE is detected. This allows the feature to be built in and enabled with
runtime detection.
It's possible to build with lazy purge enabled but depoly to systems without
such support. In this case, rely on the boot time detection instead of keep
making unnecessary madvise calls (which all returns EINVAL).
To avoid the high RSS caused by THP + low usage arena (i.e. THP becomes a
significant percentage), added a new "auto" option which will only start using
THP after a base allocator used up the first THP region. Starting from the
second hugepage (in a single arena), "auto" behaves the same as "always",
i.e. madvise hugepage right away.
As part of the metadata_thp support, We now have a separate swtich
(JEMALLOC_HAVE_MADVISE_HUGE) for MADV_HUGEPAGE availability. Use that instead
of JEMALLOC_THP (which doesn't guard pages_huge anymore) in tests.