llama-cpp-turboquant/ggml/src
Jeff Bolz 0090950f67
vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833)
q4_k and q5_k had a lot of redundant global loads where the same 16B of
scale information is repeatedly loaded and decoded during each loop iteration.
This change restructures the loops to more explicitly iterate over whole
blocks in the outer loop (with unrolled inner loop) and to copy/decode the
scale data into shared memory once at the start of each outer loop. The copy
is pipelined so the scale load from global memory is relatively cheap.

This improves q4_k/q5_k model prompt processing performance by around 5-7%.
I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k
and hurt for q4_0.

The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped
variants isn't used as often as it originally was (e.g. due to the padded_N
change), so I trimmed it down to offset some of the new complexity of the
semi-manual loop unrolling.
2025-04-09 07:25:08 +02:00
..
ggml-blas ggml : add support for dynamic loading of backends (#10469) 2024-11-25 15:13:39 +01:00
ggml-cann CANN: fix typo in ggml-cann (#12733) 2025-04-07 19:34:14 +08:00
ggml-cpu llama : fix FA when KV cache is not used (i.e. embeddings) (#12825) 2025-04-08 19:54:51 +03:00
ggml-cuda cuda : add f32 to bf16 copy op (#12806) 2025-04-08 23:21:31 +02:00
ggml-hip HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (#12032) 2025-03-03 22:10:54 +01:00
ggml-kompute llama : add Qwen2VL support + multimodal RoPE (#10361) 2024-12-14 14:43:46 +02:00
ggml-metal llama : fix FA when KV cache is not used (i.e. embeddings) (#12825) 2025-04-08 19:54:51 +03:00
ggml-musa cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394) 2025-03-17 20:25:13 +02:00
ggml-opencl opencl: better identify Adreno GPU (#12760) 2025-04-07 13:22:54 -07:00
ggml-rpc rpc : send hash when tensor data is above some fixed threshold (#12496) 2025-03-28 08:18:04 +02:00
ggml-sycl Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (#12812) 2025-04-08 15:03:21 +08:00
ggml-vulkan vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833) 2025-04-09 07:25:08 +02:00
CMakeLists.txt cmake : fix ccache conflict (#12522) 2025-03-29 11:04:58 +01:00
ggml-alloc.c ggml : upgrade init_tensor API to return a ggml_status (#11854) 2025-02-28 14:41:47 +01:00
ggml-backend-impl.h ggml : upgrade init_tensor API to return a ggml_status (#11854) 2025-02-28 14:41:47 +01:00
ggml-backend-reg.cpp ggml-backend : fix backend search path (#12330) 2025-03-11 14:25:17 +01:00
ggml-backend.cpp ggml : portability fixes for VS 2017 (#12150) 2025-03-04 18:53:26 +02:00
ggml-common.h musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc (#12611) 2025-03-30 10:59:38 +02:00
ggml-impl.h ggml : simplify Arm fp16 CPU logic (ggml/1177) 2025-04-07 18:44:17 +03:00
ggml-opt.cpp ggml-opt: fix data corruption (ggml/1022) 2024-11-21 09:22:02 +02:00
ggml-quants.c ggml : portability fixes for VS 2017 (#12150) 2025-03-04 18:53:26 +02:00
ggml-quants.h ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-threading.cpp ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-threading.h remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (#10797) 2024-12-12 19:02:49 +01:00
ggml.c llama : add option to override model tensor buffers (#11397) 2025-04-02 14:52:01 +02:00
gguf.cpp Fix clang warning in gguf_check_reserved_keys (#12686) 2025-04-01 13:12:53 +02:00