kv-cache : pad the cache size to 256 for performance (#17046)

* kv-cache : pad the size of the small SWA cache for performance * context : pad the total context to 256 * cont : future-proof the swa pad * server : adjust test params to new logic
2025-11-07 20:03:25 +02:00 · 2025-11-07 20:03:25 +02:00 · 16bcc1259d
commit 16bcc1259d
parent 9eb9a1331d
4 changed files with 14 additions and 7 deletions
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@ -114,10 +114,14 @@ llama_context::llama_context(
        }
    }

+    // ref: https://github.com/ggml-org/llama.cpp/pull/17046#discussion_r2503085732
+    cparams.n_ctx = GGML_PAD(cparams.n_ctx, 256);
+
    if (cparams.kv_unified) {
        cparams.n_ctx_seq = cparams.n_ctx;
    } else {
        cparams.n_ctx_seq = cparams.n_ctx / cparams.n_seq_max;
+        cparams.n_ctx_seq = GGML_PAD(cparams.n_ctx_seq, 256);

        if (cparams.n_ctx_seq == 0) {
            throw std::runtime_error("n_ctx_seq == 0");