server : support preserving reasoning_content in assistant message (#18994)

* support reasoning_content input * report template caps to webui * add docs * rm commented code
2026-01-22 21:30:06 +01:00 · 2026-01-22 21:30:06 +01:00 · 51fa458a92
commit 51fa458a92
parent a5eaa1d6a3
10 changed files with 165 additions and 131 deletions
--- a/tools/server/README.md
+++ b/tools/server/README.md
@ -781,6 +781,7 @@ By default, it is read-only. To make POST request to change global properties, y
  "total_slots": 1,
  "model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
  "chat_template": "...",
+  "chat_template_caps": {},
  "modalities": {
    "vision": false
  },
@ -793,6 +794,7 @@ By default, it is read-only. To make POST request to change global properties, y
 - `total_slots` - the total number of slots for process requests (defined by `--parallel` option)
 - `model_path` - the path to model file (same with `-m` argument)
 - `chat_template` - the model's original Jinja2 prompt template
+- `chat_template_caps` - capabilities of the chat template (see `common/jinja/caps.h` for more info)
 - `modalities` - the list of supported modalities
 - `is_sleeping` - sleeping status, see [Sleeping on idle](#sleeping-on-idle)

@ -1267,6 +1269,12 @@ This provides information on the performance of the server. It also allows calcu

 The total number of tokens in context is equal to `prompt_n + cache_n + predicted_n`

+*Reasoning support*
+
+The server supports parsing and returning reasoning via the `reasoning_content` field, similar to Deepseek API.
+
+Reasoning input (preserve reasoning in history) is also supported by some specific templates. For more details, please refer to [PR#18994](https://github.com/ggml-org/llama.cpp/pull/18994).
+
 ### POST `/v1/responses`: OpenAI-compatible Responses API

 *Options:*