llama-cpp-turboquant

thek0tyara/llama-cpp-turboquant

Fork 0

Commit graph

d03abe4285 3 sisyphus thek0tyara 2026-04-09 02:00:36 +03:00
1af95aba36 2 thek0tyara 2026-04-09 01:59:40 +03:00
85753dae2c 1 thek0tyara 2026-04-09 01:58:51 +03:00
2ad996bfb8 похуй TheK0tYaRa 2026-04-08 23:03:52 +03:00
0591e57dfd похуй TheK0tYaRa 2026-04-08 23:02:56 +03:00
01f8650dd9 1:8681-alt1 Vitaly Chikunov 2026-04-06 21:23:51 +00:00
c9974af462 ALT: Generate tools/server/public Vitaly Chikunov 2026-04-03 11:12:09 +03:00
7c28f3abf0 gear: Change WebUI npm target Vitaly Chikunov 2026-04-03 10:37:32 +03:00
0122d1e6aa Merge signed commit 'b8681' into sisyphus Vitaly Chikunov 2026-04-06 21:23:47 +00:00
506200cf8b

cli: fix stripping of \n in multiline input (#21485) Bipin Yadav 2026-04-07 00:24:06 +05:30
15f786e658

[CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159) Gaurav Garg 2026-04-07 00:04:29 +05:30
94ca829b60

llama-bench: add -fitc and -fitt to arguments (#21304) Aman Gupta 2026-04-06 22:26:02 +08:00
4aa962e2b0

vocab : add byte token handling to BPE detokenizer for Gemma4 (#21488) Aldehir Rojas 2026-04-06 09:08:37 -05:00
941146b3f1

convert : fix block_ff_dim retrieval for lfm2 (#21508) Sigbjørn Skjæret 2026-04-06 14:05:18 +02:00
482d862bcb

server : handle unsuccessful sink.write in chunked stream provider (#21478) lainon1 2026-04-06 13:03:02 +01:00
3979f2bb08

docs: add hunyuan-ocr gguf, also add test [no ci] (#21490) Xuan-Son Nguyen 2026-04-06 14:02:37 +02:00
400ac8e194

convert : set "add bos" == True for Gemma 4 (#21500) Georgi Gerganov 2026-04-06 13:52:07 +03:00
f51fd36d79

sycl : handle other FA case (#21377) Neo Zhang 2026-04-06 18:28:00 +08:00
25eec6f327

hexagon: slight optimization for argosrt output init (#21463) Yarden Tal 2026-04-06 04:30:25 +03:00
58190cc84d

llama : correct platform-independent loading of BOOL metadata (#21428) anchortense 2026-04-06 09:40:38 +10:00
af76639f72

model : add HunyuanOCR support (#21395) Richard Davison 2026-04-05 23:32:14 +02:00
761797ffdf

ci : use default RISE RISC-V Runners (#21263) Ludovic Henry 2026-04-05 20:29:48 +02:00
5d3a4a7da5

server : fix logging of build + system info (#21460) ddh0 2026-04-05 09:14:02 -05:00
c08d28d088

ci: lower cuda12 floor to 12.8.1 for broader host compatibility (#21438) M1DNYT3 2026-04-05 04:04:00 +03:00
661e9acb36

ci: fix vulkan workflow referencing non-existent action (#21442) Nicholas Sparks 2026-04-04 20:59:51 -04:00
b8635075ff

common : add gemma 4 specialized parser (#21418) Aldehir Rojas 2026-04-04 13:39:00 -05:00
9c699074c9

server: Fix undefined timing measurement errors in server context (#21201) Dan Hoffman 2026-04-04 07:11:19 -07:00
d01f6274c0

common : respect specified tag, only fallback when tag is empty (#21413) Adrien Gallouët 2026-04-04 15:08:03 +02:00
650bf14eb9

llama-model: read final_logit_softcapping for Gemma 4 (#21390) SamareshSingh 2026-04-04 06:05:10 -05:00
b7ad48ebda

llama: add custom newline split for Gemma 4 (#21406) Aman Gupta 2026-04-04 15:06:34 +08:00
d006858316

ggml-webgpu: move from parameter buffer pool to single buffer with offsets (#21278) Reese Levine 2026-04-03 11:40:14 -07:00
e439700992

ci: Add Windows Vulkan backend testing on Intel (#21292) Masato Nakasaka 2026-04-04 02:16:44 +09:00
50e0ad08fb

server: save and clear idle slots on new task (--clear-idle) (#20993) Yes You Can Have Your Own 2026-04-03 20:02:27 +03:00
f1f793ad06

common/parser: fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers (#21230) Piotr Wilkin (ilintar) 2026-04-03 17:51:52 +02:00
af5c13841f

common : fix tool call type detection for nullable and enum schemas (#21327) Samanvya Tripathi 2026-04-03 11:51:23 -04:00
277ff5fff7

docker : bump cuda12 to 12.9.1 (#20920) M1DNYT3 2026-04-03 16:06:45 +03:00
384c0076bc

docs: Update build.md: HSA_OVERRIDE_GFX_VERSION clarification (#21331) jeromew 2026-04-03 15:05:14 +02:00
1f34806c44

jinja: coerce input for string-specific filters (#21370) Sigbjørn Skjæret 2026-04-03 15:03:33 +02:00
887535c33f

ci: add more binary checks (#21349) Aaron Teo 2026-04-03 20:50:00 +08:00
d3416a4aa9

fix: remove stale assert (#21369) Piotr Wilkin (ilintar) 2026-04-03 13:40:41 +02:00
43a4ee4a2c

HIP: build eatch ci build test for a different architecture (#21337) uvos 2026-04-03 11:38:22 +02:00
f851fa5ab0

fix: add openssl to nix dependencies (#21353) (#21355) Tillerino 2026-04-03 11:21:07 +02:00
f1ac84119c

ggml-zendnn : add MUL_MAT_ID op support for MoE models (#21315) Vishal Singh 2026-04-03 14:49:08 +05:30
b069b10ab4

vocab: fix Gemma4 tokenizer (#21343) Piotr Wilkin (ilintar) 2026-04-03 10:33:03 +02:00
0c58ba3365

rpc : reuse compute graph buffers (#21299) Radoslav Gerganov 2026-04-03 10:28:09 +03:00
57ace0d612

chat : avoid including json in chat.h (#21306) Georgi Gerganov 2026-04-03 09:07:59 +03:00
39b27f0da0

(revert) kv-cache : do not quantize SWA KV cache (#21332) Georgi Gerganov 2026-04-03 09:07:01 +03:00
f49e917876

ci : add AMD ZenDNN label to PR labeler (#21345) Vishal Singh 2026-04-03 08:05:15 +05:30
7c7d6ce5c7

[HIP] Bump ROCm version to 7.2.1 (#21066) Slobodan Josic 2026-04-03 00:59:20 +02:00
5208e2d5ba

fix: gemma 4 template (#21326) Piotr Wilkin (ilintar) 2026-04-02 23:31:02 +02:00
7992aa7c8e

tests : add unit test coverage for llama_tensor_get_type (#20112) Bartowski 2026-04-02 16:53:58 -04:00
a1cfb64530

ggml-webgpu: add vectorized flash attention (#20709) Zheyuan Chen 2026-04-02 10:40:42 -07:00
5803c8d115

tests: allow exporting graph ops from HF file without downloading weights (#21182) Ruben Ortlam 2026-04-02 18:19:20 +02:00
63f8fe0ef4

model, mtmd: fix gguf conversion for audio/vision mmproj (#21309) Xuan-Son Nguyen 2026-04-02 17:10:32 +02:00
223373742b

common : add commentary rules for gpt-oss-20b (#21286) Aldehir Rojas 2026-04-02 08:59:59 -05:00
e15efe007d

Relax prefill parser to allow space. (#21240) Piotr Wilkin (ilintar) 2026-04-02 11:29:11 +02:00
6137c325a1

chat : add Granite 4.0 chat template with correct tool_call role mapping (#20804) Jesus Talavera 2026-04-02 11:28:56 +02:00
17193cce34

kv-cache : do not quantize SWA KV cache (#21277) Georgi Gerganov 2026-04-02 11:54:05 +03:00
d6dac92bfd

Ignore Transfer-Encoding header. (#20269) Roger Chen 2026-04-02 16:41:19 +08:00
dae2bf41c9 sync : ggml Georgi Gerganov 2026-04-02 10:38:24 +03:00
bc07d55922 ggml : bump version to 0.9.11 (ggml/1456) Georgi Gerganov 2026-04-02 10:37:26 +03:00
4888137b17

sycl : fix llama_kv_cache hang when kv_cache is huge: 5GB (#21283) Neo Zhang 2026-04-02 15:08:32 +08:00
fbd441c379

hexagon : add cumsum op support (#21246) Todor Boinovski 2026-04-01 17:44:02 -07:00
c30e012253

contrib : rewrite AGENTS.md, make it more clear about project values (#21270) Xuan-Son Nguyen 2026-04-01 23:31:51 +02:00
95a6ebabb2

opencl: fix leak in Adreno q8_0 path (#21212) lhez 2026-04-01 12:54:58 -07:00
12dbf1da95

server: Bypass API Key validation for WebUI static bundle assets (#21269) Aleksander Grygier 2026-04-01 21:32:15 +02:00
86221cf6da

CUDA: fix FA kernel selection logic (#21271) Johannes Gäßler 2026-04-01 21:28:19 +02:00
6de97b9d3e

kleidiai: add CPU feature detection to CI run script (#20394) Martin Klacer 2026-04-01 18:02:41 +01:00
5a0ed5150a

Update Dawn version in WebGPU CI (#20784) Nikhil Jain 2026-04-01 09:53:05 -07:00
8710e5f9b9

hexagon: improve RMS_NORM and DIV accuracy (#21251) Aparna M P 2026-04-01 21:13:08 +05:30
1d6d4cf7a5

fix: tool call parsing for LFM2 and LFM2.5 models (#21242) Jonathan 2026-04-01 07:22:44 -07:00
744c0c7310

llama : rotate activations for better quantization (#21038) Georgi Gerganov 2026-04-01 16:58:01 +03:00
0356e33aaf

scripts: add function call test script (#21234) Xuan-Son Nguyen 2026-04-01 15:31:58 +02:00
6422036fcb sync : ggml Georgi Gerganov 2026-04-01 16:02:34 +03:00
296bc0538b ggml : bump version to 0.9.10 (ggml/1454) Georgi Gerganov 2026-04-01 16:01:45 +03:00
6b949d1078

sycl : support nvfp4 type in mul_mat (#21227) Neo Zhang 2026-04-01 18:54:15 +08:00
84f82e846c

ggml-cuda: Add generic NVFP4 MMQ kernel (#21074) Michael Wand 2026-04-01 03:04:58 -07:00
e1cb817483

memory: respect unified KV cache in hybrid memory for eval tasks (#21224) Ettore Di Giacinto 2026-04-01 11:50:17 +02:00
88d5f8ffc3

CUDA/HIP: Fix kernel slection for mmvq mmid kernel to align host selection with device launch bounds (#21238) uvos 2026-04-01 10:21:20 +02:00
d43375ff7f

ggml : fix RWKV ops thread assignment (#21226) Georgi Gerganov 2026-04-01 11:10:25 +03:00
2b86e5cae6

ggml-cpu: fix fallback for RVV kernels without zvfh (#21157) Taimur Ahmad 2026-04-01 13:10:03 +05:00
88458164c7

CUDA: Add Flash Attention Support for Head Dimension 512 (#20998) Anav Prasad 2026-04-01 07:07:24 +00:00
4951250235

llama : refactor llama_model_quantize_params to expose a pure C interface (#20346) Ed Addario 2026-04-01 06:43:00 +01:00
82764c341a

ggml webgpu: quantized buffers to u32 + wider browser/device support (#21046) Reese Levine 2026-03-31 22:38:24 -07:00
825eb91a66

ggml-webgpu: port all AOT operators to JIT (#20728) Abhijit Ramesh 2026-03-31 15:38:16 -07:00
0fcb3760b2

fix: Use lower-case proxy headers naming (#21235) Aleksander Grygier 2026-03-31 17:47:46 +02:00
6307ec07d3

common : cleanup logs and modernize the progress bar (#21215) Adrien Gallouët 2026-03-31 16:18:00 +02:00
632219af73

CANN: fix multi-thread set_tensor race conditions (#20151) hipudding 2026-03-31 22:00:51 +08:00
4a00bbfed6

server: (webui) no more gzip compression (#21073) Xuan-Son Nguyen 2026-03-31 15:44:26 +02:00
624733d631

common : gpt-oss handle builtin and unsolicited tool calls (#21213) Aldehir Rojas 2026-03-31 06:52:42 -05:00
0b6ff47996

fix: correct misspellings in code comments (#21217) lainon1 2026-03-31 12:50:51 +01:00
eec6f85d7b

CI: Enable CPU and Vulkan ARM64 Release (#21207) Seungmin Kim 2026-03-31 20:02:56 +09:00
9281dd135d sync : ggml Georgi Gerganov 2026-03-31 13:08:13 +03:00
0be6c7c9ce ggml : bump version to 0.9.9 (ggml/1449) Georgi Gerganov 2026-03-30 18:34:29 +03:00
41361c8599

common : move up common_init() and fix Windows UTF-8 logs (#21176) Adrien Gallouët 2026-03-31 12:53:41 +02:00
62278cedde

sycl : enhance fattn perf (#21185) Neo Zhang 2026-03-31 18:31:50 +08:00
90aa83c6bd

common: add bounds check in common_init_result::sampler to prevent segfault on failed model load (#21082) mtmcp 2026-03-31 07:04:42 -03:00
fcc2d598c8

fix: include API key in CORS proxy requests for MCP connections (#21193) SATISH K C 2026-03-31 03:52:34 -05:00
4453e77561

server/webui: cleanup dual representation approach, simplify to openai-compat (#21090) Piotr Wilkin (ilintar) 2026-03-31 10:42:06 +02:00
26dac845cc

vendor : update BoringSSL to 0.20260327.0 (#21211) Adrien Gallouët 2026-03-31 09:21:54 +02:00