server: improve speed of speculative decoding (#17808)

* server: improve speed of speculative decoding * fix small draft case * add link to the PR * server : fix generation time measurement * server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros) * server : add comment * add PR to docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:35:28 +01:00 · 2025-12-08 14:35:28 +01:00 · f896d2c34f
commit f896d2c34f
parent e4e9c4329c
3 changed files with 108 additions and 76 deletions
--- a/tools/server/README-dev.md
+++ b/tools/server/README-dev.md
@ -81,6 +81,7 @@ For detailed instructions, see the [test documentation](./tests/README.md).
 - Separation of HTTP logic into dedicated files: https://github.com/ggml-org/llama.cpp/pull/17216
 - Large-scale code base split into smaller files: https://github.com/ggml-org/llama.cpp/pull/17362
 - Introduction of router mode: https://github.com/ggml-org/llama.cpp/pull/17470
+- Speculative decoding: https://github.com/ggml-org/llama.cpp/pull/17808 and rework in https://github.com/ggml-org/llama.cpp/pull/17808