server: improve speed of speculative decoding (#17808)
* server: improve speed of speculative decoding * fix small draft case * add link to the PR * server : fix generation time measurement * server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros) * server : add comment * add PR to docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit is contained in:
parent
e4e9c4329c
commit
f896d2c34f
3 changed files with 108 additions and 76 deletions
|
|
@ -81,6 +81,7 @@ For detailed instructions, see the [test documentation](./tests/README.md).
|
|||
- Separation of HTTP logic into dedicated files: https://github.com/ggml-org/llama.cpp/pull/17216
|
||||
- Large-scale code base split into smaller files: https://github.com/ggml-org/llama.cpp/pull/17362
|
||||
- Introduction of router mode: https://github.com/ggml-org/llama.cpp/pull/17470
|
||||
- Speculative decoding: https://github.com/ggml-org/llama.cpp/pull/17808 and rework in https://github.com/ggml-org/llama.cpp/pull/17808
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue