thek0tyara/llama-cpp-turboquant

Author	SHA1	Message	Date
Xuan-Son Nguyen	31a5cf4c3f	server: use httplib dynamic threads (#20817 ) * server: use httplib dynamic threads * change to n_threads_http + 1024	2026-03-23 12:22:46 +01:00
Georgi Gerganov	e32d243849	ai : update gh permissions (#20895 )	2026-03-23 13:21:41 +02:00
Pascal	c44a932cf4	webui: fix --webui-config-file settings not applied on load (#20823 ) * webui: fix --webui-config-file settings not applied on load * chore: update webui build output	2026-03-23 11:25:35 +01:00
Rashid Ul Islam	177c75852a	metal: add CONV_3D (#19927 ) * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * metal:add conv_3d backend Rebased with master and resolved conflicts. * Resolved issues related to changes in variable names * kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-23 09:45:34 +02:00
Jhen-Jie Hong	7a0b6a635e	common/autoparser : detect reasoning markers when enable_thinking changes system prompt (#20859 )	2026-03-23 08:35:27 +01:00
Chenguang Li	07ff000551	CANN: add RoPE cache preload before ACL graph capture (#20747 ) ACL graph capture disallows host-to-device memcpy and device memory malloc/free on the captured stream. Pre-load the RoPE cache before capture so that: - Host-to-device copies and allocations run on the non-captured stream - Cache metadata is populated and memory pool is warmed up - During capture, only on-device computations are recorded; host-side and allocation branches are skipped	2026-03-23 15:24:06 +08:00
Dan Hoffman	cc18f965b6	fix(openvino): explicit memset in buffer_context allocation (#20857 ) * fix(openvino): explicit memset in buffer_context allocation * minor --------- Co-authored-by: Dan Hoffman <dhoffman@cyket.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-23 08:05:37 +02:00
shaofeiqi	84ffd0c192	opencl: add flattened Q4_K mv and general Q4_K mm (#20773 )	2026-03-22 22:45:11 -07:00
bssrdf	ec2b787ebe	mtmd: Add dynamic high-resolution image preprocessing for InternVL model (#20847 ) * added support for internvl's dynamic high-resolution (Qianfan-OCR needed) * add min/max dynamic patch to gguf meta * clean up * simplified handling min/max dynamic patch * reuse llava_uhd logic for slice images * provide default values for older models * flake8 * prevent writing 0 value to gguf * remove duplicated resolution candidates with a better algorithm * fix indentation * format * add protection from divide by zero * change to 0 to be safe --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-03-23 01:06:30 +01:00
DorianRudolph	d3ac030a5d	mtmd : fix LightOnOCR image preprocessing (#20877 )	2026-03-23 01:04:14 +01:00
Xuan-Son Nguyen	49bfddeca1	server: allow router to report child instances sleep status (#20849 ) * server: allow router to report child instances sleep status * refactor * move sleeping to state * nits	2026-03-22 18:33:52 +01:00
Johannes Gäßler	bd3f1d9d65	CUDA: fix BF16 FA compilation (#20865 )	2026-03-22 17:53:33 +01:00
Sigbjørn Skjæret	23c9182ce8	jinja : refactor token advancement (#20864 ) * refactor token advancement * exercise sub-expressions	2026-03-22 17:45:10 +01:00
Vitaly Chikunov	f3ada6d562	1:8470-alt1 - Update to b8470 (2026-03-22).	2026-03-22 18:53:17 +03:00
Vitaly Chikunov	aadfc7a67f	ALT: Generate tools/server/public/index.html.gz State before update since 8192-alt1: tools/server/webui: `3306dbaef` 2026-03-21 misc : prefer ggml-org models in docs and examples (#20827) (ddh0) + mkdir /usr/src/.npm-global + npm config set prefix /usr/src/.npm-global + npm install -g @aikidosec/safe-chain npm warn deprecated glob@10.5.0: Old versions of glob are not supported, and contain widely publicized security vulnerabilities, which have been fixed in the current version. Please update. Support for old versions may be purchased (at exorbitant rates) by contacting i@izs.me added 138 packages in 6s 24 packages are looking for funding run `npm fund` for details + PATH=/usr/src/.npm-global/bin:/usr/bin:/bin:/usr/local/bin + rm -rf llama.cpp/tools/server/public/index.html.gz + cd llama.cpp/tools/server/webui + workdir=tools/server/webui + target=tools/server/public/index.html.gz + aikido-npm ci --ignore-scripts added 661 packages, and audited 662 packages in 40s 260 packages are looking for funding run `npm fund` for details 15 vulnerabilities (2 low, 4 moderate, 9 high) To address all issues, run: npm audit fix Run `npm audit` for details. ℹ Safe-chain: Some package versions were suppressed due to minimum age requirement. To disable this check, use: --safe-chain-skip-minimum-package-age + aikido-npm audit --audit-level=critical fix added 1 package, removed 11 packages, changed 25 packages, and audited 651 packages in 17s 253 packages are looking for funding run `npm fund` for details # npm audit report cookie <0.7.0 cookie accepts cookie name, path, and domain with out of bounds characters - https://github.com/advisories/GHSA-pxg6-pf52-xh8x fix available via `npm audit fix --force` Will install @sveltejs/kit@0.0.30, which is a breaking change node_modules/cookie @sveltejs/kit >=1.0.0-next.0 Depends on vulnerable versions of cookie node_modules/@sveltejs/kit @sveltejs/adapter-static >=1.0.0-next.0 Depends on vulnerable versions of @sveltejs/kit node_modules/@sveltejs/adapter-static runed >=0.32.0 Depends on vulnerable versions of @sveltejs/kit node_modules/bits-ui/node_modules/runed bits-ui >=2.11.8 Depends on vulnerable versions of runed Depends on vulnerable versions of svelte-toolbelt node_modules/bits-ui svelte-toolbelt >=0.10.6 Depends on vulnerable versions of runed node_modules/bits-ui/node_modules/svelte-toolbelt 6 low severity vulnerabilities To address issues that do not require attention, run: npm audit fix To address all issues (including breaking changes), run: npm audit fix --force ℹ Safe-chain: Some package versions were suppressed due to minimum age requirement. To disable this check, use: --safe-chain-skip-minimum-package-age + npm run build > webui@1.0.0 build > vite build && ./scripts/post-build.sh ▲ [WARNING] Cannot find base config file "./.svelte-kit/tsconfig.json" [tsconfig.json] tsconfig.json:2:12: 2 │ "extends": "./.svelte-kit/tsconfig.json", ╵ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ vite v7.2.2 building ssr environment for production... transforming... DEPRECATION WARNING [import]: Sass @import rules are deprecated and will be removed in Dart Sass 3.0.0. More info and automated migrator: https://sass-lang.com/d/import ╷ 17 │ @import 'katex/src/styles/katex.scss'; │ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ╵ src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [import]: Sass @import rules are deprecated and will be removed in Dart Sass 3.0.0. More info and automated migrator: https://sass-lang.com/d/import ╷ 2 │ @import "./fonts.scss"; │ ^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/katex.scss 2:9 @import src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0. Use list.append instead. More info and automated migrator: https://sass-lang.com/d/import ╷ 9 │ $src: append($src, url('#{$font-folder}/KaTeX_#{$family}-#{$family-suffix}.woff2') format('woff2'), comma); │ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/fonts.scss 9:15 generate-src() node_modules/katex/src/styles/fonts.scss 42:11 font-face() node_modules/katex/src/styles/fonts.scss 52:1 @import node_modules/katex/src/styles/katex.scss 2:9 @import src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0. Use list.length instead. More info and automated migrator: https://sass-lang.com/d/import ╷ 344 │ @for $from from 1 through length($sizes) { │ ^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/katex.scss 344:35 @import src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0. Use list.length instead. More info and automated migrator: https://sass-lang.com/d/import ╷ 345 │ @for $to from 1 through length($sizes) { │ ^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/katex.scss 345:37 @import src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0. Use list.nth instead. More info and automated migrator: https://sass-lang.com/d/import ╷ 348 │ font-size: calc((nth($sizes, $to) / nth($sizes, $from)) * 1em); │ ^^^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/katex.scss 348:38 @import src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0. Use list.nth instead. More info and automated migrator: https://sass-lang.com/d/import ╷ 348 │ font-size: calc((nth($sizes, $to) / nth($sizes, $from)) * 1em); │ ^^^^^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/katex.scss 348:57 @import src/styles/katex-custom.scss 17:9 root stylesheet ✓ 4749 modules transformed. Export "getJsonHeaders" of module "src/lib/utils/api-headers.ts" was reexported through module "src/lib/utils/index.ts" while both modules are dependencies of each other and will end up in different chunks by current Rollup settings. This scenario is not well supported at the moment as it will produce a circular dependency between chunks and will likely lead to broken execution order. Either change the import in "src/lib/services/chat.service.ts" to point directly to the exporting module or reconfigure "output.manualChunks" to ensure these modules end up in the same chunk. rendering chunks... vite v7.2.2 building client environment for production... transforming... DEPRECATION WARNING [import]: Sass @import rules are deprecated and will be removed in Dart Sass 3.0.0. More info and automated migrator: https://sass-lang.com/d/import ╷ 17 │ @import 'katex/src/styles/katex.scss'; │ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ╵ src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [import]: Sass @import rules are deprecated and will be removed in Dart Sass 3.0.0. More info and automated migrator: https://sass-lang.com/d/import ╷ 2 │ @import "./fonts.scss"; │ ^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/katex.scss 2:9 @import src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0. Use list.append instead. More info and automated migrator: https://sass-lang.com/d/import ╷ 9 │ $src: append($src, url('#{$font-folder}/KaTeX_#{$family}-#{$family-suffix}.woff2') format('woff2'), comma); │ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/fonts.scss 9:15 generate-src() node_modules/katex/src/styles/fonts.scss 42:11 font-face() node_modules/katex/src/styles/fonts.scss 52:1 @import node_modules/katex/src/styles/katex.scss 2:9 @import src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0. Use list.length instead. More info and automated migrator: https://sass-lang.com/d/import ╷ 344 │ @for $from from 1 through length($sizes) { │ ^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/katex.scss 344:35 @import src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0. Use list.length instead. More info and automated migrator: https://sass-lang.com/d/import ╷ 345 │ @for $to from 1 through length($sizes) { │ ^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/katex.scss 345:37 @import src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0. Use list.nth instead. More info and automated migrator: https://sass-lang.com/d/import ╷ 348 │ font-size: calc((nth($sizes, $to) / nth($sizes, $from)) * 1em); │ ^^^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/katex.scss 348:38 @import src/styles/katex-custom.scss 17:9 root stylesheet DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0. Use list.nth instead. More info and automated migrator: https://sass-lang.com/d/import ╷ 348 │ font-size: calc((nth($sizes, $to) / nth($sizes, $from)) * 1em); │ ^^^^^^^^^^^^^^^^^^ ╵ node_modules/katex/src/styles/katex.scss 348:57 @import src/styles/katex-custom.scss 17:9 root stylesheet ✓ 5881 modules transformed. rendering chunks... computing gzip size... .svelte-kit/output/client/_app/version.json 0.03 kB │ gzip: 0.05 kB .svelte-kit/output/client/.vite/manifest.json 0.33 kB │ gzip: 0.19 kB .svelte-kit/output/client/_app/immutable/assets/style.SW4DF8iR.css 499.50 kB │ gzip: 288.93 kB (!) Some chunks are larger than 3072 kB after minification. Consider: - Using dynamic import() to code-split the application - Use build.rollupOptions.output.manualChunks to improve chunking: https://rollupjs.org/configuration-options/#output-manualchunks - Adjust chunk size limit for this warning via build.chunkSizeWarningLimit. .svelte-kit/output/client/_app/immutable/bundle.CBB5SKcU.js 4,401.02 kB │ gzip: 1,297.75 kB ✓ built in 13.63s .svelte-kit/output/server/.vite/manifest.json 5.80 kB .svelte-kit/output/server/_app/immutable/assets/style.LUCY6AWH.css 499.22 kB .svelte-kit/output/server/chunks/false.js 0.03 kB .svelte-kit/output/server/chunks/environment.js 0.07 kB .svelte-kit/output/server/chunks/api-key-validation.js 0.17 kB .svelte-kit/output/server/chunks/server.js 0.20 kB .svelte-kit/output/server/entries/pages/_page.ts.js 0.25 kB .svelte-kit/output/server/entries/pages/chat/_id_/_page.ts.js 0.28 kB .svelte-kit/output/server/internal.js 0.37 kB .svelte-kit/output/server/chunks/utils.js 0.62 kB .svelte-kit/output/server/entries/pages/_page.svelte.js 1.11 kB .svelte-kit/output/server/entries/pages/chat/_id_/_page.svelte.js 1.16 kB .svelte-kit/output/server/chunks/exports.js 1.46 kB .svelte-kit/output/server/chunks/url.js 1.60 kB .svelte-kit/output/server/chunks/label.js 2.28 kB .svelte-kit/output/server/chunks/internal.js 2.58 kB .svelte-kit/output/server/entries/pages/_error.svelte.js 8.39 kB .svelte-kit/output/server/remote-entry.js 8.56 kB .svelte-kit/output/server/chunks/shared.js 11.83 kB .svelte-kit/output/server/chunks/precision.js 22.45 kB .svelte-kit/output/server/entries/pages/_layout.svelte.js 34.39 kB .svelte-kit/output/server/chunks/root.js 38.85 kB .svelte-kit/output/server/index.js 55.03 kB .svelte-kit/output/server/chunks/SyntaxHighlightedCode.svelte_svelte_type_style_lang.js 76.87 kB .svelte-kit/output/server/chunks/context.svelte.js 180.22 kB .svelte-kit/output/server/chunks/ServerLoadingSplash.js 339.43 kB ✓ built in 24.59s Run npm run preview to preview your production build locally. > Using @sveltejs/adapter-static Overwriting ../public/index.html with fallback page. Consider using a different name for the fallback. Wrote site to "../public" ✔ done ✓ Inlined favicon.svg as base64 data URL ✓ Created index.html.gz	2026-03-22 18:53:17 +03:00
Vitaly Chikunov	c912b31529	spec: Rm export-graph-ops test Link: https://github.com/ggml-org/llama.cpp/pull/19896 Signed-off-by: Vitaly Chikunov <vt@altlinux.org>	2026-03-22 18:48:10 +03:00
Vitaly Chikunov	4925d4706a	Merge signed commit 'b8470' into sisyphus Extra-Attributes: tools/server/public/index.html.gz merge=ours Diff-After-Merge: 2 files changed, 6 insertions(+) # gpg: Signature made Sun Mar 22 13:05:51 2026 MSK # gpg: using RSA key B5690EEEBB952194 # gpg: Good signature from "GitHub <noreply@github.com>" [unknown]	2026-03-22 15:46:35 +00:00
Evgeny Kurnevsky	81bc4d3ddc	server: fix Host header (#20843 ) It should include port when it's not default.	2026-03-22 22:29:22 +08:00
Neo Zhang	f40a80b4f3	support bf16 and quantized type (#20803 )	2026-03-22 22:06:27 +08:00
Patrick Buckley	db9d8aa428	ggml-cuda: native bf16 flash attention for vec kernel (#20525 ) * ggml-cuda: native bf16 flash attention for vec and tile kernels mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo * ggml-cuda: address code owner review feedback reverted tile kernel changes to avoid larger refactor * fix ci failures on turing and hip * fix bf16 vec kernel compile on hip v_dot2 platforms * add comments --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-22 11:05:51 +01:00
Gaurav Garg	ccb87fa3ee	[CUDA] Increase number of output elements per-thread block if the K-dimension is small (#20635 ) * Increase per-thread work if the K-dimension is small With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices. This change increases the number of output elements per block for such cases. * Limit this change to ncols_dst = 1 * tab to space	2026-03-22 16:49:35 +08:00
ddh0	3306dbaef7	misc : prefer ggml-org models in docs and examples (#20827 ) * misc : prefer ggml-org models in docs and examples Prefer referring to known-good quantizations under ggml-org rather than 3rd-party uploaders. * remove accidentally committed file	2026-03-21 22:00:26 +01:00
Andrea Arcangeli	990e4d9698	common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604 ) * grammar: add test case for nullable symbol loop Reproduce stack overflow (or OOM) with ( [x]* )* found while adding GBNF support to ripgrep-edit. llama-server reproducer: curl \ -X POST \ -d '{ "messages": [{ "role": "user", "content": "write yes" }], "grammar": "root ::= ( [x]* )" }' \ -H "Content-Type: application/json" \ http://localhost:8811/v1/chat/completions grammar: prevent stack overflow with nullable symbol loop Fix a potential stack overflow in llama_grammar_advance_stack that could occur when processing grammars with nullable symbols that lead to infinite derivations of empty strings. The fix introduces cycle detection by tracking visited stacks to prevent infinite recursion. rg-edit regexp: llama_grammar_advance_stack rg-edit extra-args: -A20 rg-edit directive: """Rewrite: fix the following segfault: [..] ⚫ Testing segfault. Grammar: root ::= ( [x]* )* root ::= ( [x]* )* Segmentation fault build/bin/test-grammar-integration""" gptel-context: (("~/llama.cpp/src/llama-grammar.cpp") ("~/llama.cpp/tests/test-grammar-integration.cpp") ("~/llama.cpp/grammars/./list.gbnf") ("~/llama.cpp/grammars/./json_arr.gbnf") ("~/llama.cpp/grammars/./json.gbnf") ("~/llama.cpp/grammars/./japanese.gbnf") ("~/llama.cpp/grammars/./english.gbnf") ("~/llama.cpp/grammars/./chess.gbnf") ("~/llama.cpp/grammars/./c.gbnf") ("~/llama.cpp/grammars/./arithmetic.gbnf") ("~/llama.cpp/grammars/./README.md")) * grammar: convert recursive llama_grammar_advance_stack to iterative This change converts the function to an iterative approach using explicit stacks, which prevents deep recursion and eliminates the risk of stack overflow. rg-edit regexp: llama_grammar_advance_stack rg-edit extra-args: -A30 rg-edit directive: """Rewrite: fix the following segfault: [..] ⚫ Testing segfault. Grammar: root ::= ( [x]* )* root ::= ( [x]* )* Segmentation fault build/bin/test-grammar-integration convert from recursive to interactive""" gptel-context: (("~/llama.cpp/src/llama-grammar.cpp") ("~/llama.cpp/tests/test-grammar-integration.cpp") ("~/llama.cpp/grammars/./list.gbnf") ("~/llama.cpp/grammars/./json_arr.gbnf") ("~/llama.cpp/grammars/./json.gbnf") ("~/llama.cpp/grammars/./japanese.gbnf") ("~/llama.cpp/grammars/./english.gbnf") ("~/llama.cpp/grammars/./chess.gbnf") ("~/llama.cpp/grammars/./c.gbnf") ("~/llama.cpp/grammars/./arithmetic.gbnf") ("~/llama.cpp/grammars/./README.md")) v2: Added a `std::set` to perform tree-based lookups with O(N log N) complexity. Testing with a parallel run of `test-grammar-integration` shows a double-digit percentage increase in runtime. An `unordered_set` with O(1) hashing was also evaluated, but the overhead of constructing hash keys from pointers made it significantly slower than the rbtree implementation that only requires an ordering operator. The performance regression in the test suite appears justified by the overall reduction in algorithmic complexity. Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> * grammar: add test case for hang in repetition grammar processing This commit adds a new test case to the grammar integration tests that specifically targets a hang scenario in the repetition grammar parser found while adding GBNF support to ripgrep-edit. llama-server reproducer: curl \ -X POST \ -d '{ "messages": [{ "role": "user", "content": "write yes" }], "grammar": "root ::= (([^x]){0,99}){0,99}" }' \ -H "Content-Type: application/json" \ http://localhost:8811/v1/chat/completions grammar: add repetition threshold check The change introduces a maximum repetition threshold to avoid excessive rule expansion during grammar parsing. When parsing repetition patterns like {m,n}, the parser now calculates the potential number of rules that would be generated and throws an error if the product of previous rules and new rules exceeds the threshold. A test case was added to verify the threshold is properly enforced for deeply nested repetition patterns that would otherwise cause hangs.	2026-03-21 18:43:35 +01:00
Tom Hillbrunner	212f4521b0	context : use n_embd_out for pooled embedding extraction (#20840 ) The MEAN/CLS/LAST pooling paths in encode() and decode() used n_embd_inp() (16384 for qwen3vl with deepstack) to read from the pooled embedding tensor, which only has n_embd_out() (4096) floats per sequence. This caused a tensor read out of bounds assertion. Fixes embedding mode for Qwen3-VL-Embedding models.	2026-03-21 19:35:00 +02:00
Xuan-Son Nguyen	568aec82d2	docs : explicit about banning accounts that violates policy (#19593 )	2026-03-21 15:50:16 +01:00
y198	2bcdddd5e3	fix(rpc): prevent division by zero in deserialize_tensor (#20712 ) rpc : prevent division by zero in deserialize_tensor When receiving an RPC message with a deprecated tensor type (e.g., type 4 or 5 where `blck_size == 0`), `ggml_row_size()` will trigger a division by zero (SIGFPE) and crash the rpc-server. This patch adds a simple validation check in `deserialize_tensor` to return `nullptr` if the requested tensor type has a block size of 0. (Note: This was originally reported via Security Advisory and maintainer suggested dropping a patch here). * style: remove trailing whitespace	2026-03-21 15:59:43 +02:00
Michael Wand	eac9c6ea83	Convert: Make NVFP4 and MXFP4 HF conversions say NVFP4/MXFP4 instead of BF16 (#20730 ) * Corrected convert script for NVFP4 naming and updated gguf constants * Add mostly_MXFP4 to FileType Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * simplify * set initial value [no ci] --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-21 13:35:21 +02:00
Sigbjørn Skjæret	29b28a9824	ci : switch from pyright to ty (#20826 ) * type fixes * switch to ty * tweak rules * tweak more rules * more tweaks * final tweak * use common import-not-found rule	2026-03-21 08:54:34 +01:00
Matt Corallo	cea560f483	Add shader count for Intel Arc Pro B60 (#20818 )	2026-03-21 05:22:51 +01:00
Piotr Wilkin (ilintar)	b1c70e2e54	common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825 )	2026-03-21 00:19:04 +01:00
shalinib-ibm	e6ec21e62f	ggml-cpu: add always_inline to tinyBLAS_PPC accumulator saves (#20791 ) Explicitly mark save_acc and add_save_Acc with always_inline in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator disassembly within kernel's register context, preventing un-necessary stask spills. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>	2026-03-21 07:11:45 +08:00
Georgi Gerganov	4cb7e0bd61	ai : limit runtime of the agent (#20816 )	2026-03-20 20:31:25 +02:00
James O'Leary	149b2493c0	common : fix typo in debug log ('extracft' -> 'extract') (#20807 )	2026-03-20 18:23:18 +01:00
Georgi Gerganov	b31b30f31d	ai : do not run bash commands in the prompt (#20810 )	2026-03-20 19:06:33 +02:00
Victor Villar	58c81f7e81	model : fix Granite Hybrid type check for 7B.A1B (#20795 ) * Check granite hybriid expert count to set type as LLM_TYPE_7B_A1B or LLM_TYPE_1B * Use feed fwd dim instead of num of experts Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-20 15:16:09 +01:00
Xuan-Son Nguyen	fb78ad29bb	server: (doc) clarify in-scope and out-scope features (#20794 ) * server: (doc) clarify in-scope and out-scope features * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-20 14:03:50 +01:00
Jeff Bolz	e06c3ab2bc	vulkan: change gated_delta_net to shard a column across a subgroup (#20662 ) * vulkan: change gated_delta_net to shard a column across a subgroup This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443). * vulkan: Spread columns across fewer lanes to reduce the number of workgroups	2026-03-20 12:17:15 +01:00
Ruikai Peng	dc6592431b	context: zero output buffer on allocation (#20781 ) * context: zero output buffer on allocation Address GHSA-wqq9-25mr-rw76. The logits output buffer allocated in output_reserve() uses posix_memalign(), which does not zero memory. The buffer is only written during decode when needs_raw_logits() returns true. When backend samplers cover all output sequences, needs_raw_logits() returns false and the buffer is never written, but llama_get_logits() still returns a pointer to it, exposing stale heap content. Zero the buffer after allocation to prevent information disclosure through the public logits API. Found-by: Pwno * Update src/llama-context.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-20 11:31:34 +02:00
Ruikai Peng	3adbef7776	model: assert nextn_predict_layers to prevent underflow (#20783 ) Address GHSA-645x-v54x-34w8. When nextn_predict_layers >= n_layer, n_layer - nextn_predict_layers can underflow (unsigned wrap), which corrupts n_layer_kv_from_start. Assert nextn_predict_layers immediately after parsing the GGUF key. Found-by: Pwno	2026-03-20 10:17:58 +01:00
Georgi Gerganov	ab9d4c3678	server : improve mtmd ctx checkpoints (#20726 ) * server : improve mtmd ctx checkpoints * server : fix off-by-one in pos_min_thold	2026-03-20 11:13:12 +02:00
hipudding	1af9dab32b	CANN: add BF16 support for core operators (#20152 ) * CANN: add BF16 support for core operators Add BF16 (bfloat16) type support to the CANN backend for the following operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and OUT_PROD. This enables BF16 models to run on Ascend NPUs. * CANN: skip NZ weight format for BF16 and add 310P compile guards NZ weight format conversion does not support BF16 tensors, skip it in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P guards for all BF16 operator support since 310P does not support BF16.	2026-03-20 17:08:39 +08:00
Seyoung Jeong	6d99b44c7e	docs : fix Metal backend op support status in ops.md (#20779 ) Regenerate docs/ops/Metal.csv using test-backend-ops on Apple M5 and rebuild docs/ops.md via scripts/create_ops_docs.py. Five ops were incorrectly marked as not supported (❌) for Metal: - DIAG: ❌ → ✅ - POOL_1D: ❌ → ✅ - SET: ❌ → ✅ - SOLVE_TRI: ❌ → ✅ - GATED_DELTA_NET:❌ → 🟡 (partial, depends on head_size % 32)	2026-03-20 11:06:38 +02:00
Georgi Gerganov	464fd0e71f	ai : update find-related action (#20790 ) * ai : update "related issues" prompt * cont * cont * cont	2026-03-20 10:28:14 +02:00
Ruikai Peng	21c8045214	jinja : fix heap OOB read in value equality comparison (#20782 ) Address GHSA-q9j6-4hhc-rq9p and GHSA-2q4c-9gq5-5vfp. The three-iterator overload of std::equal in value_array_t::equivalent() and value_object_t::equivalent() reads past the end of the shorter container when comparing arrays or objects of different lengths. Use the four-iterator overload (C++14) which checks both range lengths. Found-by: Pwno	2026-03-20 07:15:17 +01:00
James O'Leary	c46583b86b	common/parser : fix out_of_range crash in throw path (#20424 regression) (#20777 ) * chat : fix out_of_range crash in throw path (#20424 regression) #20424 introduced effective_input = generation_prompt + input, but the throw path uses input.substr(result.end) where result.end is a position within effective_input. Every thinking model with a non-empty generation_prompt crashes with std::out_of_range instead of the intended error message. Test crashes on unpatched master, passes with fix: cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF cmake --build build --target test-chat ./build/bin/test-chat * Update test-chat.cpp * Update test-chat.cpp * Update test-chat.cpp --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-03-20 02:37:22 +01:00
Ben Racicot	c1b911654a	server: fix router mode deadlock on child crash and TOCTOU race in models_max (#20763 ) Two bugs in `server_models::load()` that affect router mode reliability: Bug 1: Deadlock when child process crashes When a child process is killed (e.g., SIGKILL from OS code signature validation), the monitoring thread deadlocks on `stopping_thread.join()` because the stopping_thread's wait predicate (`is_stopping`) is never satisfied — the model name was never inserted into `stopping_models`. `update_status()` is never reached and the model stays stuck in LOADING state permanently. Fix: extend the stopping_thread's wait predicate to also wake when the child process is no longer alive (`!subprocess_alive()`). When woken by a dead child, the thread skips the shutdown sequence and returns immediately. The original `stopping_models.erase()` logic is preserved for normal unloads. Bug 2: TOCTOU race bypasses `--models-max` (ref #20137) `unload_lru()` is called outside the mutex, then `load()` acquires the lock afterward. Under concurrent requests, multiple threads observe capacity and all proceed to load, exceeding the limit. Fix: re-check capacity under the lock after `unload_lru()` returns. If another thread filled the slot in the window between `unload_lru()` and the lock acquisition, reject with an error instead of silently exceeding the limit.	2026-03-19 22:16:05 +01:00
Tomeamis	b739738dad	docs: Update server README to reflect PR #20297 (#20560 )	2026-03-19 21:28:44 +01:00
Sundaram krishnan	a0bbcdd9b6	ggml: guard KleidiAI DOWNLOAD_EXTRACT_TIMESTAMP for cmake < 3.24 (#20767 )	2026-03-19 21:36:23 +02:00
Georgi Gerganov	6c72646a61	ci : improve action for duplicate issue (#20772 ) * ci : show thinking traces of the agent * cont : increase thinking * cont : remove agent files * cont : move the model selection to the provider	2026-03-19 21:11:53 +02:00
Rail Chabdarov	340807273b	hip: Avoid compiler bug in RDNA code generation during debug builds on Windows (#20655 )	2026-03-19 19:14:08 +01:00

... 3 4 5 6 7 ...

9985 commits