server: add auto-sleep after N seconds of idle (#18228)

* implement sleeping at queue level * implement server-context suspend * add test * add docs * optimization: add fast path * make sure to free llama_init * nits * fix use-after-free * allow /models to be accessed during sleeping, fix use-after-free * don't allow accessing /models during sleep, it is not thread-safe * fix data race on accessing props and model_meta * small clean up * trailing whitespace * rm outdated comments
2025-12-21 02:24:42 +01:00 · 2025-12-21 02:24:42 +01:00 · ddcb75dd8a
commit ddcb75dd8a
parent 52ab19df63
12 changed files with 355 additions and 122 deletions
--- a/tools/server/README.md
+++ b/tools/server/README.md
@ -1621,6 +1621,16 @@ Example of an error:
 }
 ```

+## Sleeping on Idle
+
+The server supports an automatic sleep mode that activates after a specified period of inactivity (no incoming tasks). This feature, introduced in [PR #18228](https://github.com/ggml-org/llama.cpp/pull/18228), can be enabled using the `--sleep-idle-seconds` command-line argument. It works seamlessly in both single-model and multi-model configurations.
+
+When the server enters sleep mode, the model and its associated memory (including the KV cache) are unloaded from RAM to conserve resources. Any new incoming task will automatically trigger the model to reload.
+
+Note that the following endpoints are exempt from being considered as incoming tasks. They do not trigger model reloading and do not reset the idle timer:
+- `GET /health`
+- `GET /props`
+
 ## More examples

 ### Interactive mode