* Update build doc * Add cgraph tensor output name to OV op name * Update openvino build instructions * Add initial NPU support * draft NPU support version 2: prefill + kvcache * NPU support version 2: prefill + kvcache * Change due to ggml cgraph changes, not correct yet * Change due to ggml cgraph changes, llama-3.2 CPU work * Add AMD64 to CMakeLists * Change due to ggml cgraph changes, all device work * Refactor: clean, fix warning * Update clang-format * Statful transformation for CPU GPU * Add SwiGLU * Fuse to SDPA * Replace Concat with Broadcast in MulMat for GQA * Pull out indices creation for kv cache update * Refactor: remove past_token_len from extra_inputs * Fix Phi3 SwiGLU and SoftMax * Pull out sin cos from rope * Reduce memory: free ov weights node after graph conversion * Fix CPY due to cgraph change * Added OpenVINO CI/CD. Updated docs * Fix llama-cli * Fix Phi3 ROPE; Add test-backend-ops * Fix NPU * Fix llama-bench; Clang-format * Fix llama-perplexity * temp. changes for mark decomp * matmul in fp32 * mulmat input conversion fix * mulmat type conversion update * add mark decomp pass * Revert changes in fuse_to_sdpa * Update build.md * Fix test-backend-ops * Skip test-thread-safety; Run ctest only in ci/run.sh * Use CiD for NPU * Optimize tensor conversion, improve TTFT * Support op SET_ROWS * Fix NPU * Remove CPY * Fix test-backend-ops * Minor updates for raising PR * Perf: RMS fused to OV internal RMS op * Fix after rebasing - Layout of cache k and cache v are unified: [seq, n_head, head_size] - Add CPY and FLASH_ATTN_EXT, flash attn is not used yet - Skip test-backend-ops due to flash attn test crash - Add mutex around graph conversion to avoid test-thread-safety fali in the future - Update NPU config - Update GPU config to disable SDPA opt to make phi-3 run * Change openvino device_type to GPU; Enable flash_attn * Update supports_buft and supports_op for quantized models * Add quant weight conversion functions from genai gguf reader * Quant models run with accuracy issue * Fix accuracy: disable cpu_repack * Fix CI; Disable test-backend-ops * Fix Q4_1 * Fix test-backend-ops: Treat quantized tensors as weights * Add NPU Q4_0 support * NPU perf: eliminate zp * Dequantize q4_1 q4_k q6_k for NPU * Add custom quant type: q8_1_c, q4_0_128 * Set m_is_static=false as default in decoder * Simpilfy translation of get_rows * Fix after rebasing * Improve debug util; Eliminate nop ReshapeReshape * STYLE: make get_types_to_requant a function * Support BF16 model * Fix NPU compile * WA for npu 1st token acc issue * Apply EliminateZP only for npu * Add GeGLU * Fix Hunyuan * Support iSWA * Fix NPU accuracy * Fix ROPE accuracy when freq_scale != 1 * Minor: not add attention_size_swa for non-swa model * Minor refactor * Add Q5_K to support phi-3-q4_k_m * Requantize Q6_K (gs16) to gs32 on GPU * Fix after rebasing * Always apply Eliminate_ZP to fix GPU compile issue on some platforms * kvcachefusion support * env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added * Fix for Phi3 * Fix llama-cli (need to run with --no-warmup) * Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working * fix after rebasing * Fix llama-3-8b and phi3-mini q4_0 NPU * Update to OV-2025.3 and CMakeLists.txt * Add OV CI cache * Apply CISC review and update CI to OV2025.3 * Update CI to run OV dep install before build * Update OV dockerfile to use OV2025.3 and update build docs * Style: use switch in supports_ops * Style: middle ptr and ref align, omit optional struct keyword * NPU Unify PD (#14) * Stateless. Fix llama-cli llama-server * Simplify broadcast op in attention * Replace get_output_tensor+memcpy with set_output_tensor * NPU unify PD. Unify dynamic and static dims * Clean placeholders in ggml-openvino.cpp * NPU unify PD (handled internally) * change graph to 4d, support multi sequences * Fix llama-bench * Fix NPU * Update ggml-decoder.cpp Hitting error while compiling on windows: error C3861: 'unsetenv': identifier not found Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it. Proposed fix: Use _putenv_s() (Windows equivalent) This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment. This keeps cross-platform compatibility. * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Remove the second decoder for node. Moving the function into the model decoder * Fix error for naive * NPU prefill chunking * NPU fix llama-bench * fallback naive run with accuracy issue * NPU support llma-perplexity -b 512 --no-warmup * Refactor: split ov_graph_compute for dynamic and static * remove unused API GgmlOvDecoder::get_output_stride(const std::string & name) * minor update due to ov 2025.4 * remove unused API GgmlOvDecoder::get_output_names() * remove unused API get_output_shape(const std::string & name) * Modified API GgmlOvDecoder::get_output_type(const std::string & name) * Removed API GgmlOvDecoder::get_output_op_params(const std::string & name) * Removed API get_output_ggml_tensor(const std::string & name) * Removed API m_outputs * Removed m_output_names * Removed API GgmlOvDecoder::get_input_names() * Removed API GgmlOvDecoder::get_input_stride(const std::string& name) * Removed API get_input_type * Removed API get_input_type * Removed API GgmlOvDecoder::get_input_shape(const std::string & name) * Removed API GgmlOvDecoder::get_input_op_params(const std::string & name) * Fix error for decoder cache * Reuse cached decoder * GPU remove Q6_K requantization * NPU fix wrong model output shape * NPU fix q4 perf regression * Remove unused variable nodes * Fix decoder can_reuse for llama-bench * Update build.md for Windows * backend buffer: allocate on host * Use shared_buffer for GPU NPU; Refactor * Add ov_backend_host_buffer; Use cached remote context * Put kvcache on GPU * Use ggml_aligned_malloc * only use remote tensor for kvcache * only use remote tensor for kvcache for GPU * FIX: use remote tensor from singleton * Update build.md to include OpenCL * NPU always requant to q4_0_128 * Optimize symmetric quant weight extraction: use single zp * Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant * Update build.md * Support -ctk f32 * Initial stateful graph support * Update ggml/src/ggml-openvino/ggml-decoder.cpp Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> * code cleanup * npu perf fix * requant to f16 for Q6 embed on NPU * Update ggml/src/ggml-openvino/ggml-decoder.cpp * Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp * Create OPENVINO.md in llama.cpp backend docs * Update OPENVINO.md * Update OPENVINO.md * Update OPENVINO.md * Update build.md * Update OPENVINO.md * Update OPENVINO.md * Update OPENVINO.md * kq_mask naming fix * Syntax correction for workflows build file * Change ov backend buffer is_host to false * Fix llama-bench -p -n where p<=256 * Fix --direct-io 0 * Don't put kvcache on GPU in stateful mode * Remove hardcode names * Fix stateful shapes * Simplification for stateful and update output shape processing * Remove hardcode names * Avoid re-compilation in llama-bench * Extract zp directly instead of bias * Refactor weight tensor processing * create_weight_node accept non-ov backend buffer * remove changes in llama-graph.cpp * stateful masking fix (#38) Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes. * Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add * hardcoded name handling for rope_freqs.weight * Suppress logging and add error handling to allow test-backend-ops to complete * Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases * Use bias instead of zp in test-backend-ops * Update OV in CI, Add OV CI Tests in GH Actions * Temp fix for multithreading bug * Update OV CI, fix review suggestions. * fix editorconfig-checker, update docs * Fix tabs to spaces for editorconfig-checker * fix editorconfig-checker * Update docs * updated model link to be GGUF model links * Remove GGML_CPU_REPACK=OFF * Skip permuted ADD and MUL * Removed static variables from utils.cpp * Removed initializing non-existing variable * Remove unused structs * Fix test-backend-ops for OV GPU * unify api calling * Update utils.cpp * When the dim is dynamic, throw an error, need to is stastic forst * Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using * No need to return * Fix test-backend-ops for OV GPU LNL * Fix test-thread-safety * use the shape from infer request of output tensor create to avoid issue * fix dynamic output shape issue * fix issue for the unused node in tests * Remove unused lock * Add comment * Update openvino docs * update to OV release version 2026.0 * add ci ov-gpu self hosted runner * fix editorconfig * Fix perplexity * Rewrite the model inputs finding mechanism (#54) * Rewrite the model inputs finding logistic * Put stateful shape handle in get input shape * Put the iteration logistic in func * Added ggml-ci-intel-openvino-gpu and doc update * .hpp files converted to .h * fix ggml-ci-x64-intel-openvino-gpu * Fix for stateful execution bug in llama-bench * Minor updates after stateful llama-bench fix * Update ggml/src/ggml-openvino/utils.cpp Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> * Remove multiple get_shape calls * Bring back mutex into compute * Fix VIEW op, which slice the input node * Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access * Temp. fix for test requant errors * Update to OV ggml-ci to low-perf * ci : temporary disable "test-llama-archs" * ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag * docs : update url * Fix OV link in docker and Update docs --------- Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com> Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com> Co-authored-by: Arshath <arshath.ramzan@intel.com> Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com> Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
574 lines
17 KiB
C++
574 lines
17 KiB
C++
#include "ggml-backend-impl.h"
|
|
#include "ggml-backend.h"
|
|
#include "ggml-backend-dl.h"
|
|
#include "ggml-impl.h"
|
|
#include <algorithm>
|
|
#include <cstring>
|
|
#include <filesystem>
|
|
#include <memory>
|
|
#include <string>
|
|
#include <type_traits>
|
|
#include <vector>
|
|
#include <cctype>
|
|
|
|
#ifdef _WIN32
|
|
# define WIN32_LEAN_AND_MEAN
|
|
# ifndef NOMINMAX
|
|
# define NOMINMAX
|
|
# endif
|
|
# include <windows.h>
|
|
#elif defined(__APPLE__)
|
|
# include <mach-o/dyld.h>
|
|
# include <dlfcn.h>
|
|
#else
|
|
# include <dlfcn.h>
|
|
# include <unistd.h>
|
|
#endif
|
|
|
|
// Backend registry
|
|
#ifdef GGML_USE_CPU
|
|
#include "ggml-cpu.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_CUDA
|
|
#include "ggml-cuda.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_METAL
|
|
#include "ggml-metal.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_SYCL
|
|
#include "ggml-sycl.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_VULKAN
|
|
#include "ggml-vulkan.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_WEBGPU
|
|
#include "ggml-webgpu.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_ZDNN
|
|
#include "ggml-zdnn.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_OPENCL
|
|
#include "ggml-opencl.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_HEXAGON
|
|
#include "ggml-hexagon.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_BLAS
|
|
#include "ggml-blas.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_RPC
|
|
#include "ggml-rpc.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_VIRTGPU_FRONTEND
|
|
#include "ggml-virtgpu.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_CANN
|
|
#include "ggml-cann.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_ZENDNN
|
|
#include "ggml-zendnn.h"
|
|
#endif
|
|
|
|
#ifdef GGML_USE_OPENVINO
|
|
#include "ggml-openvino.h"
|
|
#endif
|
|
|
|
namespace fs = std::filesystem;
|
|
|
|
static std::string path_str(const fs::path & path) {
|
|
try {
|
|
#if defined(__cpp_lib_char8_t)
|
|
// C++20 and later: u8string() returns std::u8string
|
|
const std::u8string u8str = path.u8string();
|
|
return std::string(reinterpret_cast<const char *>(u8str.data()), u8str.size());
|
|
#else
|
|
// C++17: u8string() returns std::string
|
|
return path.u8string();
|
|
#endif
|
|
} catch (...) {
|
|
return std::string();
|
|
}
|
|
}
|
|
|
|
struct ggml_backend_reg_entry {
|
|
ggml_backend_reg_t reg;
|
|
dl_handle_ptr handle;
|
|
};
|
|
|
|
struct ggml_backend_registry {
|
|
std::vector<ggml_backend_reg_entry> backends;
|
|
std::vector<ggml_backend_dev_t> devices;
|
|
|
|
ggml_backend_registry() {
|
|
#ifdef GGML_USE_CUDA
|
|
register_backend(ggml_backend_cuda_reg());
|
|
#endif
|
|
#ifdef GGML_USE_METAL
|
|
register_backend(ggml_backend_metal_reg());
|
|
#endif
|
|
#ifdef GGML_USE_SYCL
|
|
register_backend(ggml_backend_sycl_reg());
|
|
#endif
|
|
#ifdef GGML_USE_VULKAN
|
|
// Add runtime disable check
|
|
if (getenv("GGML_DISABLE_VULKAN") == nullptr) {
|
|
register_backend(ggml_backend_vk_reg());
|
|
} else {
|
|
GGML_LOG_DEBUG("Vulkan backend disabled by GGML_DISABLE_VULKAN environment variable\n");
|
|
}
|
|
#endif
|
|
#ifdef GGML_USE_WEBGPU
|
|
register_backend(ggml_backend_webgpu_reg());
|
|
#endif
|
|
#ifdef GGML_USE_ZDNN
|
|
register_backend(ggml_backend_zdnn_reg());
|
|
#endif
|
|
#ifdef GGML_USE_VIRTGPU_FRONTEND
|
|
register_backend(ggml_backend_virtgpu_reg());
|
|
#endif
|
|
|
|
#ifdef GGML_USE_OPENCL
|
|
register_backend(ggml_backend_opencl_reg());
|
|
#endif
|
|
#ifdef GGML_USE_ZENDNN
|
|
register_backend(ggml_backend_zendnn_reg());
|
|
#endif
|
|
#ifdef GGML_USE_HEXAGON
|
|
register_backend(ggml_backend_hexagon_reg());
|
|
#endif
|
|
#ifdef GGML_USE_CANN
|
|
register_backend(ggml_backend_cann_reg());
|
|
#endif
|
|
#ifdef GGML_USE_BLAS
|
|
register_backend(ggml_backend_blas_reg());
|
|
#endif
|
|
#ifdef GGML_USE_RPC
|
|
register_backend(ggml_backend_rpc_reg());
|
|
#endif
|
|
#ifdef GGML_USE_OPENVINO
|
|
register_backend(ggml_backend_openvino_reg());
|
|
#endif
|
|
#ifdef GGML_USE_CPU
|
|
register_backend(ggml_backend_cpu_reg());
|
|
#endif
|
|
}
|
|
|
|
~ggml_backend_registry() {
|
|
// FIXME: backends cannot be safely unloaded without a function to destroy all the backend resources,
|
|
// since backend threads may still be running and accessing resources from the dynamic library
|
|
for (auto & entry : backends) {
|
|
if (entry.handle) {
|
|
entry.handle.release(); // NOLINT
|
|
}
|
|
}
|
|
}
|
|
|
|
void register_backend(ggml_backend_reg_t reg, dl_handle_ptr handle = nullptr) {
|
|
if (!reg) {
|
|
return;
|
|
}
|
|
|
|
#ifndef NDEBUG
|
|
GGML_LOG_DEBUG("%s: registered backend %s (%zu devices)\n",
|
|
__func__, ggml_backend_reg_name(reg), ggml_backend_reg_dev_count(reg));
|
|
#endif
|
|
backends.push_back({ reg, std::move(handle) });
|
|
for (size_t i = 0; i < ggml_backend_reg_dev_count(reg); i++) {
|
|
register_device(ggml_backend_reg_dev_get(reg, i));
|
|
}
|
|
}
|
|
|
|
void register_device(ggml_backend_dev_t device) {
|
|
#ifndef NDEBUG
|
|
GGML_LOG_DEBUG("%s: registered device %s (%s)\n", __func__, ggml_backend_dev_name(device), ggml_backend_dev_description(device));
|
|
#endif
|
|
devices.push_back(device);
|
|
}
|
|
|
|
ggml_backend_reg_t load_backend(const fs::path & path, bool silent) {
|
|
dl_handle_ptr handle { dl_load_library(path) };
|
|
if (!handle) {
|
|
if (!silent) {
|
|
GGML_LOG_ERROR("%s: failed to load %s: %s\n", __func__, path_str(path).c_str(), dl_error());
|
|
}
|
|
return nullptr;
|
|
}
|
|
|
|
auto score_fn = (ggml_backend_score_t) dl_get_sym(handle.get(), "ggml_backend_score");
|
|
if (score_fn && score_fn() == 0) {
|
|
if (!silent) {
|
|
GGML_LOG_INFO("%s: backend %s is not supported on this system\n", __func__, path_str(path).c_str());
|
|
}
|
|
return nullptr;
|
|
}
|
|
|
|
auto backend_init_fn = (ggml_backend_init_t) dl_get_sym(handle.get(), "ggml_backend_init");
|
|
if (!backend_init_fn) {
|
|
if (!silent) {
|
|
GGML_LOG_ERROR("%s: failed to find ggml_backend_init in %s\n", __func__, path_str(path).c_str());
|
|
}
|
|
return nullptr;
|
|
}
|
|
|
|
ggml_backend_reg_t reg = backend_init_fn();
|
|
if (!reg || reg->api_version != GGML_BACKEND_API_VERSION) {
|
|
if (!silent) {
|
|
if (!reg) {
|
|
GGML_LOG_ERROR("%s: failed to initialize backend from %s: ggml_backend_init returned NULL\n",
|
|
__func__, path_str(path).c_str());
|
|
} else {
|
|
GGML_LOG_ERROR("%s: failed to initialize backend from %s: incompatible API version (backend: %d, current: %d)\n",
|
|
__func__, path_str(path).c_str(), reg->api_version, GGML_BACKEND_API_VERSION);
|
|
}
|
|
}
|
|
return nullptr;
|
|
}
|
|
|
|
GGML_LOG_INFO("%s: loaded %s backend from %s\n", __func__, ggml_backend_reg_name(reg), path_str(path).c_str());
|
|
|
|
register_backend(reg, std::move(handle));
|
|
|
|
return reg;
|
|
}
|
|
|
|
void unload_backend(ggml_backend_reg_t reg, bool silent) {
|
|
auto it = std::find_if(backends.begin(), backends.end(),
|
|
[reg](const ggml_backend_reg_entry & entry) { return entry.reg == reg; });
|
|
|
|
if (it == backends.end()) {
|
|
if (!silent) {
|
|
GGML_LOG_ERROR("%s: backend not found\n", __func__);
|
|
}
|
|
return;
|
|
}
|
|
|
|
if (!silent) {
|
|
GGML_LOG_DEBUG("%s: unloading %s backend\n", __func__, ggml_backend_reg_name(reg));
|
|
}
|
|
|
|
// remove devices
|
|
devices.erase(
|
|
std::remove_if(devices.begin(), devices.end(),
|
|
[reg](ggml_backend_dev_t dev) { return ggml_backend_dev_backend_reg(dev) == reg; }),
|
|
devices.end());
|
|
|
|
// remove backend
|
|
backends.erase(it);
|
|
}
|
|
};
|
|
|
|
static ggml_backend_registry & get_reg() {
|
|
static ggml_backend_registry reg;
|
|
return reg;
|
|
}
|
|
|
|
// Internal API
|
|
void ggml_backend_register(ggml_backend_reg_t reg) {
|
|
get_reg().register_backend(reg);
|
|
}
|
|
|
|
void ggml_backend_device_register(ggml_backend_dev_t device) {
|
|
get_reg().register_device(device);
|
|
}
|
|
|
|
// Backend (reg) enumeration
|
|
static bool striequals(const char * a, const char * b) {
|
|
for (; *a && *b; a++, b++) {
|
|
if (std::tolower(*a) != std::tolower(*b)) {
|
|
return false;
|
|
}
|
|
}
|
|
return *a == *b;
|
|
}
|
|
|
|
size_t ggml_backend_reg_count() {
|
|
return get_reg().backends.size();
|
|
}
|
|
|
|
ggml_backend_reg_t ggml_backend_reg_get(size_t index) {
|
|
GGML_ASSERT(index < ggml_backend_reg_count());
|
|
return get_reg().backends[index].reg;
|
|
}
|
|
|
|
ggml_backend_reg_t ggml_backend_reg_by_name(const char * name) {
|
|
for (size_t i = 0; i < ggml_backend_reg_count(); i++) {
|
|
ggml_backend_reg_t reg = ggml_backend_reg_get(i);
|
|
if (striequals(ggml_backend_reg_name(reg), name)) {
|
|
return reg;
|
|
}
|
|
}
|
|
return nullptr;
|
|
}
|
|
|
|
// Device enumeration
|
|
size_t ggml_backend_dev_count() {
|
|
return get_reg().devices.size();
|
|
}
|
|
|
|
ggml_backend_dev_t ggml_backend_dev_get(size_t index) {
|
|
GGML_ASSERT(index < ggml_backend_dev_count());
|
|
return get_reg().devices[index];
|
|
}
|
|
|
|
ggml_backend_dev_t ggml_backend_dev_by_name(const char * name) {
|
|
for (size_t i = 0; i < ggml_backend_dev_count(); i++) {
|
|
ggml_backend_dev_t dev = ggml_backend_dev_get(i);
|
|
if (striequals(ggml_backend_dev_name(dev), name)) {
|
|
return dev;
|
|
}
|
|
}
|
|
return nullptr;
|
|
}
|
|
|
|
ggml_backend_dev_t ggml_backend_dev_by_type(enum ggml_backend_dev_type type) {
|
|
for (size_t i = 0; i < ggml_backend_dev_count(); i++) {
|
|
ggml_backend_dev_t dev = ggml_backend_dev_get(i);
|
|
if (ggml_backend_dev_type(dev) == type) {
|
|
return dev;
|
|
}
|
|
}
|
|
return nullptr;
|
|
}
|
|
|
|
// Convenience functions
|
|
ggml_backend_t ggml_backend_init_by_name(const char * name, const char * params) {
|
|
ggml_backend_dev_t dev = ggml_backend_dev_by_name(name);
|
|
if (!dev) {
|
|
return nullptr;
|
|
}
|
|
return ggml_backend_dev_init(dev, params);
|
|
}
|
|
|
|
ggml_backend_t ggml_backend_init_by_type(enum ggml_backend_dev_type type, const char * params) {
|
|
ggml_backend_dev_t dev = ggml_backend_dev_by_type(type);
|
|
if (!dev) {
|
|
return nullptr;
|
|
}
|
|
return ggml_backend_dev_init(dev, params);
|
|
}
|
|
|
|
ggml_backend_t ggml_backend_init_best(void) {
|
|
ggml_backend_dev_t dev = ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_GPU);
|
|
dev = dev ? dev : ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_IGPU);
|
|
dev = dev ? dev : ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU);
|
|
if (!dev) {
|
|
return nullptr;
|
|
}
|
|
return ggml_backend_dev_init(dev, nullptr);
|
|
}
|
|
|
|
// Dynamic loading
|
|
ggml_backend_reg_t ggml_backend_load(const char * path) {
|
|
return get_reg().load_backend(path, false);
|
|
}
|
|
|
|
void ggml_backend_unload(ggml_backend_reg_t reg) {
|
|
get_reg().unload_backend(reg, true);
|
|
}
|
|
|
|
static fs::path get_executable_path() {
|
|
#if defined(__APPLE__)
|
|
// get executable path
|
|
std::vector<char> path;
|
|
uint32_t size;
|
|
while (true) {
|
|
size = path.size();
|
|
if (_NSGetExecutablePath(path.data(), &size) == 0) {
|
|
break;
|
|
}
|
|
path.resize(size);
|
|
}
|
|
std::string base_path(path.data(), size);
|
|
// remove executable name
|
|
auto last_slash = base_path.find_last_of('/');
|
|
if (last_slash != std::string::npos) {
|
|
base_path = base_path.substr(0, last_slash);
|
|
}
|
|
return base_path + "/";
|
|
#elif defined(__linux__) || defined(__FreeBSD__)
|
|
std::string base_path = ".";
|
|
std::vector<char> path(1024);
|
|
while (true) {
|
|
// get executable path
|
|
# if defined(__linux__)
|
|
ssize_t len = readlink("/proc/self/exe", path.data(), path.size());
|
|
# elif defined(__FreeBSD__)
|
|
ssize_t len = readlink("/proc/curproc/file", path.data(), path.size());
|
|
# endif
|
|
if (len == -1) {
|
|
break;
|
|
}
|
|
if (len < (ssize_t) path.size()) {
|
|
base_path = std::string(path.data(), len);
|
|
// remove executable name
|
|
auto last_slash = base_path.find_last_of('/');
|
|
if (last_slash != std::string::npos) {
|
|
base_path = base_path.substr(0, last_slash);
|
|
}
|
|
break;
|
|
}
|
|
path.resize(path.size() * 2);
|
|
}
|
|
|
|
return base_path + "/";
|
|
#elif defined(_WIN32)
|
|
std::vector<wchar_t> path(MAX_PATH);
|
|
DWORD len = GetModuleFileNameW(NULL, path.data(), path.size());
|
|
if (len == 0) {
|
|
return {};
|
|
}
|
|
std::wstring base_path(path.data(), len);
|
|
// remove executable name
|
|
auto last_slash = base_path.find_last_of('\\');
|
|
if (last_slash != std::string::npos) {
|
|
base_path = base_path.substr(0, last_slash);
|
|
}
|
|
return base_path + L"\\";
|
|
#else
|
|
return {};
|
|
#endif
|
|
}
|
|
|
|
static fs::path backend_filename_prefix() {
|
|
#ifdef _WIN32
|
|
return fs::u8path("ggml-");
|
|
#else
|
|
return fs::u8path("libggml-");
|
|
#endif
|
|
}
|
|
|
|
static fs::path backend_filename_extension() {
|
|
#ifdef _WIN32
|
|
return fs::u8path(".dll");
|
|
#else
|
|
return fs::u8path(".so");
|
|
#endif
|
|
}
|
|
|
|
static ggml_backend_reg_t ggml_backend_load_best(const char * name, bool silent, const char * user_search_path) {
|
|
// enumerate all the files that match [lib]ggml-name-*.[so|dll] in the search paths
|
|
const fs::path name_path = fs::u8path(name);
|
|
const fs::path file_prefix = backend_filename_prefix().native() + name_path.native() + fs::u8path("-").native();
|
|
const fs::path file_extension = backend_filename_extension();
|
|
|
|
std::vector<fs::path> search_paths;
|
|
if (user_search_path == nullptr) {
|
|
#ifdef GGML_BACKEND_DIR
|
|
search_paths.push_back(fs::u8path(GGML_BACKEND_DIR));
|
|
#endif
|
|
// default search paths: executable directory, current directory
|
|
search_paths.push_back(get_executable_path());
|
|
search_paths.push_back(fs::current_path());
|
|
} else {
|
|
search_paths.push_back(fs::u8path(user_search_path));
|
|
}
|
|
|
|
int best_score = 0;
|
|
fs::path best_path;
|
|
std::error_code ec;
|
|
|
|
for (const auto & search_path : search_paths) {
|
|
if (!fs::exists(search_path, ec)) {
|
|
if (ec) {
|
|
GGML_LOG_DEBUG("%s: posix_stat(%s) failure, error-message: %s\n", __func__, path_str(search_path).c_str(), ec.message().c_str());
|
|
} else {
|
|
GGML_LOG_DEBUG("%s: search path %s does not exist\n", __func__, path_str(search_path).c_str());
|
|
}
|
|
continue;
|
|
}
|
|
fs::directory_iterator dir_it(search_path, fs::directory_options::skip_permission_denied);
|
|
for (const auto & entry : dir_it) {
|
|
if (entry.is_regular_file(ec)) {
|
|
auto filename = entry.path().filename();
|
|
auto ext = entry.path().extension();
|
|
if (filename.native().find(file_prefix) == 0 && ext == file_extension) {
|
|
dl_handle_ptr handle { dl_load_library(entry) };
|
|
if (!handle && !silent) {
|
|
GGML_LOG_ERROR("%s: failed to load %s: %s\n", __func__, path_str(entry.path()).c_str(), dl_error());
|
|
}
|
|
if (handle) {
|
|
auto score_fn = (ggml_backend_score_t) dl_get_sym(handle.get(), "ggml_backend_score");
|
|
if (score_fn) {
|
|
int s = score_fn();
|
|
#ifndef NDEBUG
|
|
GGML_LOG_DEBUG("%s: %s score: %d\n", __func__, path_str(entry.path()).c_str(), s);
|
|
#endif
|
|
if (s > best_score) {
|
|
best_score = s;
|
|
best_path = entry.path();
|
|
}
|
|
} else {
|
|
if (!silent) {
|
|
GGML_LOG_INFO("%s: failed to find ggml_backend_score in %s\n", __func__, path_str(entry.path()).c_str());
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
if (best_score == 0) {
|
|
// try to load the base backend
|
|
for (const auto & search_path : search_paths) {
|
|
fs::path filename = backend_filename_prefix().native() + name_path.native() + backend_filename_extension().native();
|
|
fs::path path = search_path / filename;
|
|
if (std::error_code ec; fs::exists(path, ec)) {
|
|
return get_reg().load_backend(path, silent);
|
|
} else {
|
|
if (ec) {
|
|
GGML_LOG_DEBUG("%s: posix_stat(%s) failure, error-message: %s\n", __func__, path_str(path).c_str(), ec.message().c_str());
|
|
}
|
|
}
|
|
}
|
|
return nullptr;
|
|
}
|
|
|
|
return get_reg().load_backend(best_path, silent);
|
|
}
|
|
|
|
void ggml_backend_load_all() {
|
|
ggml_backend_load_all_from_path(nullptr);
|
|
}
|
|
|
|
void ggml_backend_load_all_from_path(const char * dir_path) {
|
|
#ifdef NDEBUG
|
|
bool silent = true;
|
|
#else
|
|
bool silent = false;
|
|
#endif
|
|
|
|
ggml_backend_load_best("blas", silent, dir_path);
|
|
ggml_backend_load_best("zendnn", silent, dir_path);
|
|
ggml_backend_load_best("cann", silent, dir_path);
|
|
ggml_backend_load_best("cuda", silent, dir_path);
|
|
ggml_backend_load_best("hip", silent, dir_path);
|
|
ggml_backend_load_best("metal", silent, dir_path);
|
|
ggml_backend_load_best("rpc", silent, dir_path);
|
|
ggml_backend_load_best("sycl", silent, dir_path);
|
|
ggml_backend_load_best("vulkan", silent, dir_path);
|
|
ggml_backend_load_best("virtgpu", silent, dir_path);
|
|
ggml_backend_load_best("opencl", silent, dir_path);
|
|
ggml_backend_load_best("hexagon", silent, dir_path);
|
|
ggml_backend_load_best("musa", silent, dir_path);
|
|
ggml_backend_load_best("openvino", silent, dir_path);
|
|
ggml_backend_load_best("cpu", silent, dir_path);
|
|
// check the environment variable GGML_BACKEND_PATH to load an out-of-tree backend
|
|
const char * backend_path = std::getenv("GGML_BACKEND_PATH");
|
|
if (backend_path) {
|
|
ggml_backend_load(backend_path);
|
|
}
|
|
}
|