llama-cpp-turboquant

History

hipudding f9bc66c3eb CANN: Update several operators to support FP16 data format (#16251 ) Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <757486878@qq.com>		2025-10-13 08:52:22 +08:00
..
ggml-blas	sync : whisper.cpp (ggml/1359)	2025-09-29 17:43:58 +03:00
ggml-cann	CANN: Update several operators to support FP16 data format (#16251 )	2025-10-13 08:52:22 +08:00
ggml-cpu	ggml : Fix FP16 ELU positive branch (#16519 )	2025-10-12 08:25:37 +03:00
ggml-cuda	CUDA: faster tile FA, add oob checks, more HSs (#16492 )	2025-10-11 20:54:32 +02:00
ggml-hip	CUDA: faster tile FA, add oob checks, more HSs (#16492 )	2025-10-11 20:54:32 +02:00
ggml-metal	metal : add opt_step_adamw and op_sum (#16529 )	2025-10-12 21:43:14 +03:00
ggml-musa	CUDA: faster tile FA, add oob checks, more HSs (#16492 )	2025-10-11 20:54:32 +02:00
ggml-opencl	opencl: support pad_ext (#15888 )	2025-09-30 10:45:45 -07:00
ggml-rpc	rpc : check src buffer when copying tensor (#16421 )	2025-10-04 16:22:45 +03:00
ggml-sycl	[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521 )	2025-10-12 21:53:35 +08:00
ggml-vulkan	vulkan: use a more appropriate amount of threads when generating shaders (#16418 )	2025-10-04 22:04:27 +02:00
ggml-webgpu	ggml webgpu: profiling, CI updates, reworking of command submission (#16452 )	2025-10-07 13:48:56 -07:00
ggml-zdnn	zdnn: refactor codebase + add docs (#16178 )	2025-09-23 14:53:05 +08:00
CMakeLists.txt	cmake : Dont define XOPENSOURCE on AIX (#16481 )	2025-10-10 11:15:46 +03:00
ggml-alloc.c	ggml : fix graph reallocation with multiple chunks (#16396 )	2025-10-03 13:49:08 +02:00
ggml-backend-impl.h	rpc : add support for multiple devices (#16276 )	2025-10-04 12:49:16 +03:00
ggml-backend-reg.cpp	ggml-backend : add root cause in error message if loading backend library fails (#16172 )	2025-09-29 13:17:09 +02:00
ggml-backend.cpp	llama: print memory breakdown on exit (#15860 )	2025-09-24 16:53:48 +02:00
ggml-common.h	llama : add gpt-oss (#15091 )	2025-08-05 22:10:36 +03:00
ggml-impl.h	model : Apertus model implementation (#15852 )	2025-10-02 20:43:22 +03:00
ggml-opt.cpp	finetune: SGD optimizer, more CLI args (#13873 )	2025-08-14 12:03:57 +02:00
ggml-quants.c	ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (#15928 )	2025-09-23 10:25:20 +02:00
ggml-quants.h	llama : add gpt-oss (#15091 )	2025-08-05 22:10:36 +03:00
ggml-threading.cpp	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-threading.h	remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (#10797 )	2024-12-12 19:02:49 +01:00
ggml.c	ggml webgpu: add support for soft_max, optimize rms_norm (#16357 )	2025-10-02 11:00:31 -07:00
ggml.cpp	ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)	2025-06-01 13:43:57 +03:00
gguf.cpp	gguf: gguf_writer refactor (#15691 )	2025-09-05 11:34:28 +02:00