server : make cache_reuse configurable per request (#17858)

This commit is contained in:
Georgi Gerganov 2025-12-08 12:43:12 +02:00 committed by GitHub
parent 5814b4dce1
commit 2bc96931d2
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 31 additions and 15 deletions

View file

@ -495,6 +495,8 @@ By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to re
`n_cmpl`: Number of completions to generate from the current prompt. If input has multiple prompts, the output will have N prompts times `n_cmpl` entries.
`n_cache_reuse`: Min chunk size to attempt reusing from the cache via KV shifting. For more info, see `--cache-reuse` arg. Default: `0`, which is disabled.
`stream`: Allows receiving each predicted token in real-time instead of waiting for the completion to finish (uses a different response format). To enable this, set to `true`.
`stop`: Specify a JSON array of stopping strings.