Converting size to usize is what jemalloc has been done by ceiling
size to the closest size class. However, this causes lots of memory
wastes with HPA enabled. This commit changes how usize is calculated so
that the gap between two contiguous usize is no larger than a page.
Specifically, this commit includes the following changes:
1. Adding a build-time config option (--enable-limit-usize-gap) and a
runtime one (limit_usize_gap) to guard the changes.
When build-time
config is enabled, some minor CPU overhead is expected because usize
will be stored and accessed apart from index. When runtime option is
also enabled (it can only be enabled with the build-time config
enabled). a new usize calculation approach wil be employed. This new
calculation will ceil size to the closest multiple of PAGE for all sizes
larger than USIZE_GROW_SLOW_THRESHOLD instead of using the size classes.
Note when the build-time config is enabled, the runtime option is
default on.
2. Prepare tcache for size to grow by PAGE over GROUP*PAGE.
To prepare for the upcoming changes where size class grows by PAGE when
larger than NGROUP * PAGE, disable the tcache when it is larger than 2 *
NGROUP * PAGE. The threshold for tcache is set higher to prevent perf
regression as much as possible while usizes between NGROUP * PAGE and 2 *
NGROUP * PAGE happen to grow by PAGE.
3. Prepare pac and hpa psset for size to grow by PAGE over GROUP*PAGE
For PAC, to avoid having too many bins, arena bins still have the same
layout. This means some extra search is needed for a page-level request that
is not aligned with the orginal size class: it should also search the heap
before the current index since the previous heap might also be able to
have some allocations satisfying it. The same changes apply to HPA's
psset.
This search relies on the enumeration of the heap because not all allocs in
the previous heap are guaranteed to satisfy the request. To balance the
memory and CPU overhead, we currently enumerate at most a fixed number
of nodes before concluding none can satisfy the request during an
enumeration.
4. Add bytes counter to arena large stats.
To prepare for the upcoming usize changes, stats collected by
multiplying alive allocations and the bin size is no longer accurate.
Thus, add separate counters to record the bytes malloced and dalloced.
5. Change structs use when freeing to avoid using index2size for large sizes.
- Change the definition of emap_alloc_ctx_t
- Change the read of both from edata_t.
- Change the assignment and usage of emap_alloc_ctx_t.
- Change other callsites of index2size.
Note for the changes in the data structure, i.e., emap_alloc_ctx_t,
will be used when the build-time config (--enable-limit-usize-gap) is
enabled but they will store the same value as index2size(szind) if the
runtime option (opt_limit_usize_gap) is not enabled.
6. Adapt hpa to the usize changes.
Change the settings in sec to limit is usage for sizes larger than
USIZE_GROW_SLOW_THRESHOLD and modify corresponding tests.
7. Modify usize calculation and corresponding tests.
Change the sz_s2u_compute. Note sz_index2size is not always safe now
while sz_size2index still works as expected.
Header files are now self-contained, which makes the relationships
between the files clearer, and crucially allows LSP tools like `clangd`
to function correctly in all of our header files. I have verified that
the headers are self-contained (aside from the various Windows shims) by
compiling them as if they were C files – in a follow-up commit I plan to
add this to CI to ensure we don't regress on this front.
This mallctl accepts an arena_config_t structure which
can be used to customize the behavior of the arena.
Right now it contains extent_hooks and a new option,
metadata_use_hooks, which controls whether the extent
hooks are also used for metadata allocation.
The medata_use_hooks option has two main use cases:
1. In heterogeneous memory systems, to avoid metadata
being placed on potentially slower memory.
2. Avoiding virtual memory from being leaked as a result
of metadata allocation failure originating in an extent hook.
The experimental `smallocx` API is not exposed via header files,
requiring the users to peek at `jemalloc`'s source code to manually
add the external declarations to their own programs.
This should reinforce that `smallocx` is experimental, and that `jemalloc`
does not offer any kind of backwards compatiblity or ABI gurantees for it.
Before this commit jemalloc produced many warnings when compiled with -Wextra
with both Clang and GCC. This commit fixes the issues raised by these warnings
or suppresses them if they were spurious at least for the Clang and GCC
versions covered by CI.
This commit:
* adds `JEMALLOC_DIAGNOSTIC` macros: `JEMALLOC_DIAGNOSTIC_{PUSH,POP}` are
used to modify the stack of enabled diagnostics. The
`JEMALLOC_DIAGNOSTIC_IGNORE_...` macros are used to ignore a concrete
diagnostic.
* adds `JEMALLOC_FALLTHROUGH` macro to explicitly state that falling
through `case` labels in a `switch` statement is intended
* Removes all UNUSED annotations on function parameters. The warning
-Wunused-parameter is now disabled globally in
`jemalloc_internal_macros.h` for all translation units that include
that header. It is never re-enabled since that header cannot be
included by users.
* locally suppresses some -Wextra diagnostics:
* `-Wmissing-field-initializer` is buggy in older Clang and GCC versions,
where it does not understanding that, in C, `= {0}` is a common C idiom
to initialize a struct to zero
* `-Wtype-bounds` is suppressed in a particular situation where a generic
macro, used in multiple different places, compares an unsigned integer for
smaller than zero, which is always true.
* `-Walloc-larger-than-size=` diagnostics warn when an allocation function is
called with a size that is too large (out-of-range). These are suppressed in
the parts of the tests where `jemalloc` explicitly does this to test that the
allocation functions fail properly.
* adds a new CI build bot that runs the log unit test on CI.
Closes#1196 .
Added opt.background_thread to enable background threads, which handles purging
currently. When enabled, decay ticks will not trigger purging (which will be
left to the background threads). We limit the max number of threads to NCPUs.
When percpu arena is enabled, set CPU affinity for the background threads as
well.
The sleep interval of background threads is dynamic and determined by computing
number of pages to purge in the future (based on backlog).
Simplify configuration by removing the --disable-tcache option, but
replace the testing for that configuration with
--with-malloc-conf=tcache:false.
Fix the thread.arena and thread.tcache.flush mallctls to work correctly
if tcache is disabled.
This partially resolves#580.
The new feature, opt.percpu_arena, determines thread-arena association
dynamically based CPU id. Three modes are supported: "percpu", "phycpu"
and disabled.
"percpu" uses the current core id (with help from sched_getcpu())
directly as the arena index, while "phycpu" will assign threads on the
same physical CPU to the same arena. In other words, "percpu" means # of
arenas == # of CPUs, while "phycpu" has # of arenas == 1/2 * (# of
CPUs). Note that no runtime check on whether hyper threading is enabled
is added yet.
When enabled, threads will be migrated between arenas when a CPU change
is detected. In the current design, to reduce overhead from reading CPU
id, each arena tracks the thread accessed most recently. When a new
thread comes in, we will read CPU id and update arena if necessary.
malloc_conf does not reliably work with MSVC, which complains of
"inconsistent dll linkage", i.e. its inability to support the
application overriding malloc_conf when dynamically linking/loading.
Work around this limitation by adding test harness support for per test
shell script sourcing, and converting all tests to use MALLOC_CONF
instead of malloc_conf.
Move test extent hook code from the extent integration test into a
header, and normalize the out-of-band controls and introspection.
Also refactor the base unit test to use the header.
Add/rename related mallctls:
- Add stats.arenas.<i>.base .
- Rename stats.arenas.<i>.metadata to stats.arenas.<i>.internal .
- Add stats.arenas.<i>.resident .
Modify the arenas.extend mallctl to take an optional (extent_hooks_t *)
argument so that it is possible for all base allocations to be serviced
by the specified extent hooks.
This resolves#463.
Split purging into lazy and forced variants. Use the forced variant for
zeroing dss.
Add support for NULL function pointers as an opt-out mechanism for the
dalloc, commit, decommit, purge_lazy, purge_forced, split, and merge
fields of extent_hooks_t.
Add short-circuiting checks in large_ralloc_no_move_{shrink,expand}() so
that no attempt is made if splitting/merging is not supported.
This resolves#268.
Adds cpp bindings for jemalloc, along with necessary autoconf settings.
This is mostly to add sized deallocation support, which can't be added
from C directly. Sized deallocation is ~10% microbench improvement.
* Import ax_cxx_compile_stdcxx.m4 from the autoconf repo, seems like the
easiest way to get c++14 detection.
* Adds various other changes, like CXXFLAGS, to configure.ac.
* Adds new rules to Makefile.in for src/jemalloc-cpp.cpp, and a basic
unittest.
* Both new and delete are overridden, to ensure jemalloc is used for
both.
* TODO future enhancement of avoiding extra PLT thunks for new and
delete - sdallocx and malloc are publicly exported jemalloc symbols,
using an alias would link them directly. Unfortunately, was having
trouble getting it to play nice with jemalloc's namespace support.
Testing:
Tested gcc 4.8, gcc 5, gcc 5.2, clang 4.0. Only gcc >= 5 has sized
deallocation support, verified that the rest build correctly.
Tested mac osx and Centos.
Tested --with-jemalloc-prefix and --without-export.
This resolves#202.