llama-cpp-turboquant

History

Jeff Bolz 0090950f67 vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833 ) q4_k and q5_k had a lot of redundant global loads where the same 16B of scale information is repeatedly loaded and decoded during each loop iteration. This change restructures the loops to more explicitly iterate over whole blocks in the outer loop (with unrolled inner loop) and to copy/decode the scale data into shared memory once at the start of each outer loop. The copy is pipelined so the scale load from global memory is relatively cheap. This improves q4_k/q5_k model prompt processing performance by around 5-7%. I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k and hurt for q4_0. The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped variants isn't used as often as it originally was (e.g. due to the padded_N change), so I trimmed it down to offset some of the new complexity of the semi-manual loop unrolling.		2025-04-09 07:25:08 +02:00
..
ggml-blas	ggml : add support for dynamic loading of backends (#10469 )	2024-11-25 15:13:39 +01:00
ggml-cann	CANN: fix typo in ggml-cann (#12733 )	2025-04-07 19:34:14 +08:00
ggml-cpu	llama : fix FA when KV cache is not used (i.e. embeddings) (#12825 )	2025-04-08 19:54:51 +03:00
ggml-cuda	cuda : add f32 to bf16 copy op (#12806 )	2025-04-08 23:21:31 +02:00
ggml-hip	HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (#12032 )	2025-03-03 22:10:54 +01:00
ggml-kompute	llama : add Qwen2VL support + multimodal RoPE (#10361 )	2024-12-14 14:43:46 +02:00
ggml-metal	llama : fix FA when KV cache is not used (i.e. embeddings) (#12825 )	2025-04-08 19:54:51 +03:00
ggml-musa	cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394 )	2025-03-17 20:25:13 +02:00
ggml-opencl	opencl: better identify Adreno GPU (#12760 )	2025-04-07 13:22:54 -07:00
ggml-rpc	rpc : send hash when tensor data is above some fixed threshold (#12496 )	2025-03-28 08:18:04 +02:00
ggml-sycl	Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (#12812 )	2025-04-08 15:03:21 +08:00
ggml-vulkan	vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833 )	2025-04-09 07:25:08 +02:00
CMakeLists.txt	cmake : fix ccache conflict (#12522 )	2025-03-29 11:04:58 +01:00
ggml-alloc.c	ggml : upgrade init_tensor API to return a ggml_status (#11854 )	2025-02-28 14:41:47 +01:00
ggml-backend-impl.h	ggml : upgrade init_tensor API to return a ggml_status (#11854 )	2025-02-28 14:41:47 +01:00
ggml-backend-reg.cpp	ggml-backend : fix backend search path (#12330 )	2025-03-11 14:25:17 +01:00
ggml-backend.cpp	ggml : portability fixes for VS 2017 (#12150 )	2025-03-04 18:53:26 +02:00
ggml-common.h	musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (#12611 )	2025-03-30 10:59:38 +02:00
ggml-impl.h	ggml : simplify Arm fp16 CPU logic (ggml/1177)	2025-04-07 18:44:17 +03:00
ggml-opt.cpp	ggml-opt: fix data corruption (ggml/1022)	2024-11-21 09:22:02 +02:00
ggml-quants.c	ggml : portability fixes for VS 2017 (#12150 )	2025-03-04 18:53:26 +02:00
ggml-quants.h	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-threading.cpp	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-threading.h	remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (#10797 )	2024-12-12 19:02:49 +01:00
ggml.c	llama : add option to override model tensor buffers (#11397 )	2025-04-02 14:52:01 +02:00
gguf.cpp	Fix clang warning in gguf_check_reserved_keys (#12686 )	2025-04-01 13:12:53 +02:00