server: improve speed of speculative decoding (#17808)

* server: improve speed of speculative decoding

* fix small draft case

* add link to the PR

* server : fix generation time measurement

* server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)

* server : add comment

* add PR to docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
Xuan-Son Nguyen 2025-12-08 14:35:28 +01:00 committed by GitHub
parent e4e9c4329c
commit f896d2c34f
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 108 additions and 76 deletions

View file

@ -81,6 +81,7 @@ For detailed instructions, see the [test documentation](./tests/README.md).
- Separation of HTTP logic into dedicated files: https://github.com/ggml-org/llama.cpp/pull/17216
- Large-scale code base split into smaller files: https://github.com/ggml-org/llama.cpp/pull/17362
- Introduction of router mode: https://github.com/ggml-org/llama.cpp/pull/17470
- Speculative decoding: https://github.com/ggml-org/llama.cpp/pull/17808 and rework in https://github.com/ggml-org/llama.cpp/pull/17808