server : support preserving reasoning_content in assistant message (#18994)

* support reasoning_content input

* report template caps to webui

* add docs

* rm commented code
This commit is contained in:
Xuan-Son Nguyen 2026-01-22 21:30:06 +01:00 committed by GitHub
parent a5eaa1d6a3
commit 51fa458a92
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 165 additions and 131 deletions

View file

@ -781,6 +781,7 @@ By default, it is read-only. To make POST request to change global properties, y
"total_slots": 1,
"model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
"chat_template": "...",
"chat_template_caps": {},
"modalities": {
"vision": false
},
@ -793,6 +794,7 @@ By default, it is read-only. To make POST request to change global properties, y
- `total_slots` - the total number of slots for process requests (defined by `--parallel` option)
- `model_path` - the path to model file (same with `-m` argument)
- `chat_template` - the model's original Jinja2 prompt template
- `chat_template_caps` - capabilities of the chat template (see `common/jinja/caps.h` for more info)
- `modalities` - the list of supported modalities
- `is_sleeping` - sleeping status, see [Sleeping on idle](#sleeping-on-idle)
@ -1267,6 +1269,12 @@ This provides information on the performance of the server. It also allows calcu
The total number of tokens in context is equal to `prompt_n + cache_n + predicted_n`
*Reasoning support*
The server supports parsing and returning reasoning via the `reasoning_content` field, similar to Deepseek API.
Reasoning input (preserve reasoning in history) is also supported by some specific templates. For more details, please refer to [PR#18994](https://github.com/ggml-org/llama.cpp/pull/18994).
### POST `/v1/responses`: OpenAI-compatible Responses API
*Options:*