llama-cpp-turboquant

History

Daniel Bevenius 134e6940ca llama : skip output reordering for single token batches (#17466 ) This commit adds a check to skip the output reordering logic when n_outputs == 1. With a single output token, the data is trivially sorted and the reordering code is currently doing unnecessary work (resetting and rebuilding output_ids to the same values). The motivation for this change is improved code clarity and avoiding confusion when debugging. While the performance impact is probably negligible, this unnecessary work happens on every decode call in llama-server when processing batches with single-token outputs.		2025-11-24 21:06:17 +01:00
..
models	models : Added support for RND1 Diffusion Language Model (#17433 )	2025-11-24 14:16:56 +08:00
CMakeLists.txt	models : Added support for RND1 Diffusion Language Model (#17433 )	2025-11-24 14:16:56 +08:00
llama-adapter.cpp	aLoRA Support (#15327 )	2025-09-05 17:32:39 -06:00
llama-adapter.h	aLoRA Support (#15327 )	2025-09-05 17:32:39 -06:00
llama-arch.cpp	models : Added support for RND1 Diffusion Language Model (#17433 )	2025-11-24 14:16:56 +08:00
llama-arch.h	models : Added support for RND1 Diffusion Language Model (#17433 )	2025-11-24 14:16:56 +08:00
llama-batch.cpp	batch : fix consistency checks for the input positions (#16890 )	2025-10-31 13:50:33 +02:00
llama-batch.h	llama: store mrope data in KV cell (#16825 )	2025-10-29 18:09:18 +01:00
llama-chat.cpp	model : add openPangu-Embedded (#16941 )	2025-11-05 10:28:58 +01:00
llama-chat.h	model : add openPangu-Embedded (#16941 )	2025-11-05 10:28:58 +01:00
llama-context.cpp	llama : skip output reordering for single token batches (#17466 )	2025-11-24 21:06:17 +01:00
llama-context.h	server : support unified cache across slots (#16736 )	2025-11-02 18:14:04 +02:00
llama-cparams.cpp	cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188 )	2025-06-15 10:08:58 +03:00
llama-cparams.h	server : support unified cache across slots (#16736 )	2025-11-02 18:14:04 +02:00
llama-grammar.cpp	grammar: fix regression caused by #17381 (#17412 )	2025-11-20 18:35:10 +01:00
llama-grammar.h	`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034 )	2025-03-05 13:05:13 +00:00
llama-graph.cpp	CUDA: fuse rope + set_rows (#16884 )	2025-11-13 08:50:01 +08:00
llama-graph.h	graph : support cacheless embeddings with FA and iSWA (#16528 )	2025-10-13 22:42:37 +03:00
llama-hparams.cpp	hparams : add n_embd_inp() to support extended embed (#16928 )	2025-11-07 19:27:58 +01:00
llama-hparams.h	hparams : add n_embd_inp() to support extended embed (#16928 )	2025-11-07 19:27:58 +01:00
llama-impl.cpp	common : more accurate sampling timing (#17382 )	2025-11-20 13:40:10 +02:00
llama-impl.h	llama: use FA + max. GPU layers by default (#15434 )	2025-08-30 16:32:10 +02:00
llama-io.cpp	llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181 )	2025-03-13 12:35:44 +02:00
llama-io.h	llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181 )	2025-03-13 12:35:44 +02:00
llama-kv-cache-iswa.cpp	kv-cache : pad the cache size to 256 for performance (#17046 )	2025-11-07 20:03:25 +02:00
llama-kv-cache-iswa.h	llama: print memory breakdown on exit (#15860 )	2025-09-24 16:53:48 +02:00
llama-kv-cache.cpp	model: add support for qwen3vl series (#16780 )	2025-10-30 16:19:14 +01:00
llama-kv-cache.h	memory : remove KV cache size padding (#16812 )	2025-10-28 20:19:44 +02:00
llama-kv-cells.h	llama: store mrope data in KV cell (#16825 )	2025-10-29 18:09:18 +01:00
llama-memory-hybrid.cpp	memory : use sequential equal splits for recurrent modules (#16442 )	2025-10-07 08:24:17 +03:00
llama-memory-hybrid.h	llama: print memory breakdown on exit (#15860 )	2025-09-24 16:53:48 +02:00
llama-memory-recurrent.cpp	memory: Hybrid context shift (#17009 )	2025-11-10 17:14:23 +02:00
llama-memory-recurrent.h	llama: consistent ctx <-> buf order for KV cache (#16746 )	2025-10-28 11:23:54 +01:00
llama-memory.cpp	memory : correctly handle failure in apply() (#14438 )	2025-06-30 18:03:03 +03:00
llama-memory.h	llama: print memory breakdown on exit (#15860 )	2025-09-24 16:53:48 +02:00
llama-mmap.cpp	llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WIN_VER to llama.cpp sources (#14013 )	2025-06-05 11:57:42 +02:00
llama-mmap.h	llama-mmap: fix missing include (#11796 )	2025-02-10 20:58:18 +02:00
llama-model-loader.cpp	model : Apertus model implementation (#15852 )	2025-10-02 20:43:22 +03:00
llama-model-loader.h	model: support GLM 4.5 family of models (#14939 )	2025-08-04 20:29:25 +02:00
llama-model-saver.cpp	llama : improve sep token handling (#14272 )	2025-06-20 14:04:09 +02:00
llama-model-saver.h	llama/ggml: add LLM training support (#10544 )	2025-05-12 14:44:49 +02:00
llama-model.cpp	models : Added support for RND1 Diffusion Language Model (#17433 )	2025-11-24 14:16:56 +08:00
llama-model.h	model : add AfmoeForCausalLM support (#16477 )	2025-11-14 13:54:10 +01:00
llama-quant.cpp	llama : use std::abs instead of abs (#16853 )	2025-10-30 08:30:58 +02:00
llama-quant.h	llama : refactor `src/llama.cpp` (#10902 )	2025-01-03 10:18:53 +02:00
llama-sampling.cpp	common : more accurate sampling timing (#17382 )	2025-11-20 13:40:10 +02:00
llama-sampling.h	llama : add `llama_vocab`, functions -> methods, naming (#11110 )	2025-01-12 11:32:42 +02:00
llama-vocab.cpp	vocab : call reserve() for building plamo-2-translate suffix (#17343 )	2025-11-18 18:58:22 +01:00
llama-vocab.h	model : add AfmoeForCausalLM support (#16477 )	2025-11-14 13:54:10 +01:00
llama.cpp	llama-quant: add support for mmproj (#16592 )	2025-10-15 14:48:08 +02:00
unicode-data.cpp	server : better security control for public deployments (#9776 )	2024-10-08 13:27:04 +02:00
unicode-data.h	llama : reduce compile time and binary size (#9712 )	2024-10-02 15:49:55 +02:00
unicode.cpp	model : add AfmoeForCausalLM support (#16477 )	2025-11-14 13:54:10 +01:00
unicode.h	devops: add s390x & ppc64le CI (#15925 )	2025-09-27 02:03:33 +08:00