llama-cpp-turboquant/ggml/src
Max Krasnyansky 63d2fc46e1
Add experimental ggml-hexagon backend for the Hexagon NPU (#16547)
* model: add support for extra bufs for all devices

* hexagon: add experimental ggml-hexagon backend for the Hexagon NPU

This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.

Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX

**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.

Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com>

* hexagon: fix format checker errors

* hexagon: update readme and cmake presets

* ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions

* hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input

* hexagon: move ADB helper scripts into scripts/snapdragon/adb

* hexagon: replace all f/printfs with GGML_LOG_...

* readme: add hexagon to the list supported backends

* hexagon: stack malmuts with quantized inputs only

* hexagon: add TODO for fixing issues in hexagon_graph_optimize

* hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC

* scripts: fix lint errors

* scripts: update qdc pytest script to make linter happy

* hexagon: add reduce sum in fp32

* hexagon: reduce number of vector stores in matmul output

* hexagon: remove the need for vdelta in reduce-multiply-x8

* hexagon: consistent use of reduce_sum_fp32 for row_sums

* hexagon: some more matmul optimizations and comments

Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models).
We've handled those cases already but at a higher overhead.

* hexagon: update cmake presets

* hexagon: add OPMASK support for run-bench.sh wrapper

* hexagon: update to use GGML_BACKEND_API

* hexagon: remove unused logic for setting tensor flags for the views

* hexagon: add asserts to set/get_tensor to make sure we handle complete tensors

Same asserts as the CPU backend.

* hexagon: use cpy_tensor slow path for non-host buffers

* hexagon: error checks in the buffer allocator

* cmake: move include(extProj) under ggml-hexagon

* hexagon: don't forget to delete the backend on free

* hexagon: set/get_tensor size assert apply only to quantized tensors

* hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now

GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way.
Ideally we need a bit more finer log levels.

* docs: typos in hexagon developer docs (libggm-...)

* hexagon: overhaul error handling in the session/device allocation

this should handle all failure paths in the session allocation.

* hexagon: update cmake presets to enable fp16 vectors

* hexagon: remove unused time_usec function

* hexagon: don't forget to release buffer contexts

* hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure)

* hexagon: remove custom can_repeat function and use ggml_can_repeat

---------

Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2025-10-22 13:47:09 -07:00
..
ggml-blas sync : whisper.cpp (ggml/1359) 2025-09-29 17:43:58 +03:00
ggml-cann CANN: format code using .clang-format (#15863) 2025-10-16 16:41:11 +08:00
ggml-cpu Revert "ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_v…" (#16723) 2025-10-22 20:20:55 +02:00
ggml-cuda CUDA: fix bug in topk-moe softmax (#16711) 2025-10-22 12:33:08 +08:00
ggml-hexagon Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00
ggml-hip HIP: fix GPU_TARGETS (#16642) 2025-10-18 14:47:32 +02:00
ggml-metal metal : add CONV_TRANSPOSE_2D (#16542) 2025-10-17 09:33:58 +03:00
ggml-musa CUDA: faster tile FA, add oob checks, more HSs (#16492) 2025-10-11 20:54:32 +02:00
ggml-opencl opencl: fix warnings and clean up profiling (#16688) 2025-10-20 22:26:17 -07:00
ggml-rpc rpc : report actual free memory (#16616) 2025-10-17 18:02:52 +03:00
ggml-sycl sycl : add PAD_REFLECT_D1 operator support (#16145) 2025-10-21 00:21:12 +02:00
ggml-vulkan vulkan: Handle FA with all -inf mask values (#16447) 2025-10-20 22:16:08 -05:00
ggml-webgpu ggml webgpu: profiling, CI updates, reworking of command submission (#16452) 2025-10-07 13:48:56 -07:00
ggml-zdnn zdnn: refactor codebase + add docs (#16178) 2025-09-23 14:53:05 +08:00
CMakeLists.txt Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00
ggml-alloc.c ggml-alloc : fix leak when reusing a tensor with a larger size (#16679) 2025-10-20 14:53:50 +02:00
ggml-backend-impl.h rpc : add support for multiple devices (#16276) 2025-10-04 12:49:16 +03:00
ggml-backend-reg.cpp Add experimental ggml-hexagon backend for the Hexagon NPU (#16547) 2025-10-22 13:47:09 -07:00
ggml-backend.cpp llama: print memory breakdown on exit (#15860) 2025-09-24 16:53:48 +02:00
ggml-common.h llama : add gpt-oss (#15091) 2025-08-05 22:10:36 +03:00
ggml-impl.h ggml: add ggml_can_fuse_subgraph (#16662) 2025-10-21 16:43:14 +08:00
ggml-opt.cpp finetune: SGD optimizer, more CLI args (#13873) 2025-08-14 12:03:57 +02:00
ggml-quants.c ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (#15928) 2025-09-23 10:25:20 +02:00
ggml-quants.h llama : add gpt-oss (#15091) 2025-08-05 22:10:36 +03:00
ggml-threading.cpp ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
ggml-threading.h remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (#10797) 2024-12-12 19:02:49 +01:00
ggml.c ggml: add ggml_can_fuse_subgraph (#16662) 2025-10-21 16:43:14 +08:00
ggml.cpp ggml : Print backtrace on uncaught C++ exceptions (ggml/1232) 2025-06-01 13:43:57 +03:00
gguf.cpp gguf: gguf_writer refactor (#15691) 2025-09-05 11:34:28 +02:00