Commit graph

9985 commits

Author SHA1 Message Date
Xuan-Son Nguyen
31a5cf4c3f
server: use httplib dynamic threads (#20817)
* server: use httplib dynamic threads

* change to n_threads_http + 1024
2026-03-23 12:22:46 +01:00
Georgi Gerganov
e32d243849
ai : update gh permissions (#20895) 2026-03-23 13:21:41 +02:00
Pascal
c44a932cf4
webui: fix --webui-config-file settings not applied on load (#20823)
* webui: fix --webui-config-file settings not applied on load

* chore: update webui build output
2026-03-23 11:25:35 +01:00
Rashid Ul Islam
177c75852a
metal: add CONV_3D (#19927)
* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* metal:add conv_3d backend

Rebased with master and resolved conflicts.

* Resolved issues related to changes in variable names

* kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-23 09:45:34 +02:00
Jhen-Jie Hong
7a0b6a635e
common/autoparser : detect reasoning markers when enable_thinking changes system prompt (#20859) 2026-03-23 08:35:27 +01:00
Chenguang Li
07ff000551
CANN: add RoPE cache preload before ACL graph capture (#20747)
ACL graph capture disallows host-to-device memcpy and device memory
malloc/free on the captured stream. Pre-load the RoPE cache before
capture so that:
- Host-to-device copies and allocations run on the non-captured stream
- Cache metadata is populated and memory pool is warmed up
- During capture, only on-device computations are recorded; host-side
  and allocation branches are skipped
2026-03-23 15:24:06 +08:00
Dan Hoffman
cc18f965b6
fix(openvino): explicit memset in buffer_context allocation (#20857)
* fix(openvino): explicit memset in buffer_context allocation

* minor

---------

Co-authored-by: Dan Hoffman <dhoffman@cyket.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-23 08:05:37 +02:00
shaofeiqi
84ffd0c192
opencl: add flattened Q4_K mv and general Q4_K mm (#20773) 2026-03-22 22:45:11 -07:00
bssrdf
ec2b787ebe
mtmd: Add dynamic high-resolution image preprocessing for InternVL model (#20847)
* added support for internvl's dynamic high-resolution (Qianfan-OCR needed)

* add min/max dynamic patch to gguf meta

* clean up

* simplified handling min/max dynamic patch

* reuse llava_uhd logic for slice images

* provide default values for older models

* flake8

* prevent writing 0 value to gguf

* remove duplicated resolution candidates with a better algorithm

* fix indentation

* format

* add protection from divide by zero

* change to 0 to be safe

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-03-23 01:06:30 +01:00
DorianRudolph
d3ac030a5d
mtmd : fix LightOnOCR image preprocessing (#20877) 2026-03-23 01:04:14 +01:00
Xuan-Son Nguyen
49bfddeca1
server: allow router to report child instances sleep status (#20849)
* server: allow router to report child instances sleep status

* refactor

* move sleeping to state

* nits
2026-03-22 18:33:52 +01:00
Johannes Gäßler
bd3f1d9d65
CUDA: fix BF16 FA compilation (#20865) 2026-03-22 17:53:33 +01:00
Sigbjørn Skjæret
23c9182ce8
jinja : refactor token advancement (#20864)
* refactor token advancement

* exercise sub-expressions
2026-03-22 17:45:10 +01:00
Vitaly Chikunov
f3ada6d562 1:8470-alt1
- Update to b8470 (2026-03-22).
2026-03-22 18:53:17 +03:00
Vitaly Chikunov
aadfc7a67f ALT: Generate tools/server/public/index.html.gz
State before update since 8192-alt1:
  tools/server/webui: 3306dbaef 2026-03-21 misc : prefer ggml-org models in docs and examples (#20827) (ddh0)

+ mkdir /usr/src/.npm-global
+ npm config set prefix /usr/src/.npm-global
+ npm install -g @aikidosec/safe-chain
npm warn deprecated glob@10.5.0: Old versions of glob are not supported, and contain widely publicized security vulnerabilities, which have been fixed in the current version. Please update. Support for old versions may be purchased (at exorbitant rates) by contacting i@izs.me

added 138 packages in 6s

24 packages are looking for funding
  run `npm fund` for details
+ PATH=/usr/src/.npm-global/bin:/usr/bin:/bin:/usr/local/bin
+ rm -rf llama.cpp/tools/server/public/index.html.gz
+ cd llama.cpp/tools/server/webui
+ workdir=tools/server/webui
+ target=tools/server/public/index.html.gz
+ aikido-npm ci --ignore-scripts

added 661 packages, and audited 662 packages in 40s

260 packages are looking for funding
  run `npm fund` for details

15 vulnerabilities (2 low, 4 moderate, 9 high)

To address all issues, run:
  npm audit fix

Run `npm audit` for details.
ℹ Safe-chain: Some package versions were suppressed due to minimum age requirement.
  To disable this check, use: --safe-chain-skip-minimum-package-age
+ aikido-npm audit --audit-level=critical fix

added 1 package, removed 11 packages, changed 25 packages, and audited 651 packages in 17s

253 packages are looking for funding
  run `npm fund` for details

# npm audit report

cookie  <0.7.0
cookie accepts cookie name, path, and domain with out of bounds characters - https://github.com/advisories/GHSA-pxg6-pf52-xh8x
fix available via `npm audit fix --force`
Will install @sveltejs/kit@0.0.30, which is a breaking change
node_modules/cookie
  @sveltejs/kit  >=1.0.0-next.0
  Depends on vulnerable versions of cookie
  node_modules/@sveltejs/kit
    @sveltejs/adapter-static  >=1.0.0-next.0
    Depends on vulnerable versions of @sveltejs/kit
    node_modules/@sveltejs/adapter-static
    runed  >=0.32.0
    Depends on vulnerable versions of @sveltejs/kit
    node_modules/bits-ui/node_modules/runed
      bits-ui  >=2.11.8
      Depends on vulnerable versions of runed
      Depends on vulnerable versions of svelte-toolbelt
      node_modules/bits-ui
      svelte-toolbelt  >=0.10.6
      Depends on vulnerable versions of runed
      node_modules/bits-ui/node_modules/svelte-toolbelt

6 low severity vulnerabilities

To address issues that do not require attention, run:
  npm audit fix

To address all issues (including breaking changes), run:
  npm audit fix --force
ℹ Safe-chain: Some package versions were suppressed due to minimum age requirement.
  To disable this check, use: --safe-chain-skip-minimum-package-age
+ npm run build

> webui@1.0.0 build
> vite build && ./scripts/post-build.sh

▲ [WARNING] Cannot find base config file "./.svelte-kit/tsconfig.json" [tsconfig.json]

    tsconfig.json:2:12:
      2 │   "extends": "./.svelte-kit/tsconfig.json",
        ╵              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

vite v7.2.2 building ssr environment for production...
transforming...
DEPRECATION WARNING [import]: Sass @import rules are deprecated and will be removed in Dart Sass 3.0.0.

More info and automated migrator: https://sass-lang.com/d/import

   ╷
17 │ @import 'katex/src/styles/katex.scss';
   │         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   ╵
    src/styles/katex-custom.scss 17:9  root stylesheet

DEPRECATION WARNING [import]: Sass @import rules are deprecated and will be removed in Dart Sass 3.0.0.

More info and automated migrator: https://sass-lang.com/d/import

  ╷
2 │ @import "./fonts.scss";
  │         ^^^^^^^^^^^^^^
  ╵
    node_modules/katex/src/styles/katex.scss 2:9  @import
    src/styles/katex-custom.scss 17:9             root stylesheet

DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0.
Use list.append instead.

More info and automated migrator: https://sass-lang.com/d/import

  ╷
9 │         $src: append($src, url('#{$font-folder}/KaTeX_#{$family}-#{$family-suffix}.woff2') format('woff2'), comma);
  │               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ╵
    node_modules/katex/src/styles/fonts.scss 9:15   generate-src()
    node_modules/katex/src/styles/fonts.scss 42:11  font-face()
    node_modules/katex/src/styles/fonts.scss 52:1   @import
    node_modules/katex/src/styles/katex.scss 2:9    @import
    src/styles/katex-custom.scss 17:9               root stylesheet

DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0.
Use list.length instead.

More info and automated migrator: https://sass-lang.com/d/import

    ╷
344 │         @for $from from 1 through length($sizes) {
    │                                   ^^^^^^^^^^^^^^
    ╵
    node_modules/katex/src/styles/katex.scss 344:35  @import
    src/styles/katex-custom.scss 17:9                root stylesheet

DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0.
Use list.length instead.

More info and automated migrator: https://sass-lang.com/d/import

    ╷
345 │             @for $to from 1 through length($sizes) {
    │                                     ^^^^^^^^^^^^^^
    ╵
    node_modules/katex/src/styles/katex.scss 345:37  @import
    src/styles/katex-custom.scss 17:9                root stylesheet

DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0.
Use list.nth instead.

More info and automated migrator: https://sass-lang.com/d/import

    ╷
348 │                     font-size: calc((nth($sizes, $to) / nth($sizes, $from)) * 1em);
    │                                      ^^^^^^^^^^^^^^^^
    ╵
    node_modules/katex/src/styles/katex.scss 348:38  @import
    src/styles/katex-custom.scss 17:9                root stylesheet

DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0.
Use list.nth instead.

More info and automated migrator: https://sass-lang.com/d/import

    ╷
348 │                     font-size: calc((nth($sizes, $to) / nth($sizes, $from)) * 1em);
    │                                                         ^^^^^^^^^^^^^^^^^^
    ╵
    node_modules/katex/src/styles/katex.scss 348:57  @import
    src/styles/katex-custom.scss 17:9                root stylesheet

✓ 4749 modules transformed.
Export "getJsonHeaders" of module "src/lib/utils/api-headers.ts" was reexported through module "src/lib/utils/index.ts" while both modules are dependencies of each other and will end up in different chunks by current Rollup settings. This scenario is not well supported at the moment as it will produce a circular dependency between chunks and will likely lead to broken execution order.
Either change the import in "src/lib/services/chat.service.ts" to point directly to the exporting module or reconfigure "output.manualChunks" to ensure these modules end up in the same chunk.
rendering chunks...
vite v7.2.2 building client environment for production...
transforming...
DEPRECATION WARNING [import]: Sass @import rules are deprecated and will be removed in Dart Sass 3.0.0.

More info and automated migrator: https://sass-lang.com/d/import

   ╷
17 │ @import 'katex/src/styles/katex.scss';
   │         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   ╵
    src/styles/katex-custom.scss 17:9  root stylesheet

DEPRECATION WARNING [import]: Sass @import rules are deprecated and will be removed in Dart Sass 3.0.0.

More info and automated migrator: https://sass-lang.com/d/import

  ╷
2 │ @import "./fonts.scss";
  │         ^^^^^^^^^^^^^^
  ╵
    node_modules/katex/src/styles/katex.scss 2:9  @import
    src/styles/katex-custom.scss 17:9             root stylesheet

DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0.
Use list.append instead.

More info and automated migrator: https://sass-lang.com/d/import

  ╷
9 │         $src: append($src, url('#{$font-folder}/KaTeX_#{$family}-#{$family-suffix}.woff2') format('woff2'), comma);
  │               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ╵
    node_modules/katex/src/styles/fonts.scss 9:15   generate-src()
    node_modules/katex/src/styles/fonts.scss 42:11  font-face()
    node_modules/katex/src/styles/fonts.scss 52:1   @import
    node_modules/katex/src/styles/katex.scss 2:9    @import
    src/styles/katex-custom.scss 17:9               root stylesheet

DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0.
Use list.length instead.

More info and automated migrator: https://sass-lang.com/d/import

    ╷
344 │         @for $from from 1 through length($sizes) {
    │                                   ^^^^^^^^^^^^^^
    ╵
    node_modules/katex/src/styles/katex.scss 344:35  @import
    src/styles/katex-custom.scss 17:9                root stylesheet

DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0.
Use list.length instead.

More info and automated migrator: https://sass-lang.com/d/import

    ╷
345 │             @for $to from 1 through length($sizes) {
    │                                     ^^^^^^^^^^^^^^
    ╵
    node_modules/katex/src/styles/katex.scss 345:37  @import
    src/styles/katex-custom.scss 17:9                root stylesheet

DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0.
Use list.nth instead.

More info and automated migrator: https://sass-lang.com/d/import

    ╷
348 │                     font-size: calc((nth($sizes, $to) / nth($sizes, $from)) * 1em);
    │                                      ^^^^^^^^^^^^^^^^
    ╵
    node_modules/katex/src/styles/katex.scss 348:38  @import
    src/styles/katex-custom.scss 17:9                root stylesheet

DEPRECATION WARNING [global-builtin]: Global built-in functions are deprecated and will be removed in Dart Sass 3.0.0.
Use list.nth instead.

More info and automated migrator: https://sass-lang.com/d/import

    ╷
348 │                     font-size: calc((nth($sizes, $to) / nth($sizes, $from)) * 1em);
    │                                                         ^^^^^^^^^^^^^^^^^^
    ╵
    node_modules/katex/src/styles/katex.scss 348:57  @import
    src/styles/katex-custom.scss 17:9                root stylesheet

✓ 5881 modules transformed.
rendering chunks...
computing gzip size...
.svelte-kit/output/client/_app/version.json                             0.03 kB │ gzip:     0.05 kB
.svelte-kit/output/client/.vite/manifest.json                           0.33 kB │ gzip:     0.19 kB
.svelte-kit/output/client/_app/immutable/assets/style.SW4DF8iR.css    499.50 kB │ gzip:   288.93 kB

(!) Some chunks are larger than 3072 kB after minification. Consider:
- Using dynamic import() to code-split the application
- Use build.rollupOptions.output.manualChunks to improve chunking: https://rollupjs.org/configuration-options/#output-manualchunks
- Adjust chunk size limit for this warning via build.chunkSizeWarningLimit.
.svelte-kit/output/client/_app/immutable/bundle.CBB5SKcU.js         4,401.02 kB │ gzip: 1,297.75 kB
✓ built in 13.63s
.svelte-kit/output/server/.vite/manifest.json                                              5.80 kB
.svelte-kit/output/server/_app/immutable/assets/style.LUCY6AWH.css                       499.22 kB
.svelte-kit/output/server/chunks/false.js                                                  0.03 kB
.svelte-kit/output/server/chunks/environment.js                                            0.07 kB
.svelte-kit/output/server/chunks/api-key-validation.js                                     0.17 kB
.svelte-kit/output/server/chunks/server.js                                                 0.20 kB
.svelte-kit/output/server/entries/pages/_page.ts.js                                        0.25 kB
.svelte-kit/output/server/entries/pages/chat/_id_/_page.ts.js                              0.28 kB
.svelte-kit/output/server/internal.js                                                      0.37 kB
.svelte-kit/output/server/chunks/utils.js                                                  0.62 kB
.svelte-kit/output/server/entries/pages/_page.svelte.js                                    1.11 kB
.svelte-kit/output/server/entries/pages/chat/_id_/_page.svelte.js                          1.16 kB
.svelte-kit/output/server/chunks/exports.js                                                1.46 kB
.svelte-kit/output/server/chunks/url.js                                                    1.60 kB
.svelte-kit/output/server/chunks/label.js                                                  2.28 kB
.svelte-kit/output/server/chunks/internal.js                                               2.58 kB
.svelte-kit/output/server/entries/pages/_error.svelte.js                                   8.39 kB
.svelte-kit/output/server/remote-entry.js                                                  8.56 kB
.svelte-kit/output/server/chunks/shared.js                                                11.83 kB
.svelte-kit/output/server/chunks/precision.js                                             22.45 kB
.svelte-kit/output/server/entries/pages/_layout.svelte.js                                 34.39 kB
.svelte-kit/output/server/chunks/root.js                                                  38.85 kB
.svelte-kit/output/server/index.js                                                        55.03 kB
.svelte-kit/output/server/chunks/SyntaxHighlightedCode.svelte_svelte_type_style_lang.js   76.87 kB
.svelte-kit/output/server/chunks/context.svelte.js                                       180.22 kB
.svelte-kit/output/server/chunks/ServerLoadingSplash.js                                  339.43 kB
✓ built in 24.59s

Run npm run preview to preview your production build locally.

> Using @sveltejs/adapter-static
Overwriting ../public/index.html with fallback page. Consider using a different name for the fallback.
  Wrote site to "../public"
  ✔ done
✓ Inlined favicon.svg as base64 data URL
✓ Created index.html.gz
2026-03-22 18:53:17 +03:00
Vitaly Chikunov
c912b31529 spec: Rm export-graph-ops test
Link: https://github.com/ggml-org/llama.cpp/pull/19896
Signed-off-by: Vitaly Chikunov <vt@altlinux.org>
2026-03-22 18:48:10 +03:00
Vitaly Chikunov
4925d4706a Merge signed commit 'b8470' into sisyphus
Extra-Attributes: tools/server/public/index.html.gz merge=ours
Diff-After-Merge: 2 files changed, 6 insertions(+)

# gpg: Signature made Sun Mar 22 13:05:51 2026 MSK
# gpg:                using RSA key B5690EEEBB952194
# gpg: Good signature from "GitHub <noreply@github.com>" [unknown]
2026-03-22 15:46:35 +00:00
Evgeny Kurnevsky
81bc4d3ddc
server: fix Host header (#20843)
It should include port when it's not default.
2026-03-22 22:29:22 +08:00
Neo Zhang
f40a80b4f3
support bf16 and quantized type (#20803) 2026-03-22 22:06:27 +08:00
Patrick Buckley
db9d8aa428
ggml-cuda: native bf16 flash attention for vec kernel (#20525)
* ggml-cuda: native bf16 flash attention for vec and tile kernels

mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo

* ggml-cuda: address code owner review feedback

reverted tile kernel changes to avoid larger refactor

* fix ci failures on turing and hip

* fix bf16 vec kernel compile on hip v_dot2 platforms

* add comments

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-03-22 11:05:51 +01:00
Gaurav Garg
ccb87fa3ee
[CUDA] Increase number of output elements per-thread block if the K-dimension is small (#20635)
* Increase per-thread work if the K-dimension is small

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such cases.

* Limit this change to ncols_dst = 1

* tab to space
2026-03-22 16:49:35 +08:00
ddh0
3306dbaef7
misc : prefer ggml-org models in docs and examples (#20827)
* misc : prefer ggml-org models in docs and examples

Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.

* remove accidentally committed file
2026-03-21 22:00:26 +01:00
Andrea Arcangeli
990e4d9698
common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604)
* grammar: add test case for nullable symbol loop

Reproduce stack overflow (or OOM) with ( [x]* )* found while adding
GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= ( [x]* )*"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: prevent stack overflow with nullable symbol loop

Fix a potential stack overflow in llama_grammar_advance_stack that
could occur when processing grammars with nullable symbols that lead
to infinite derivations of empty strings. The fix introduces cycle
detection by tracking visited stacks to prevent infinite recursion.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A20
rg-edit directive: """Rewrite: fix the following segfault:

[..]
 Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
 ("~/llama.cpp/tests/test-grammar-integration.cpp")
 ("~/llama.cpp/grammars/./list.gbnf")
 ("~/llama.cpp/grammars/./json_arr.gbnf")
 ("~/llama.cpp/grammars/./json.gbnf")
 ("~/llama.cpp/grammars/./japanese.gbnf")
 ("~/llama.cpp/grammars/./english.gbnf")
 ("~/llama.cpp/grammars/./chess.gbnf")
 ("~/llama.cpp/grammars/./c.gbnf")
 ("~/llama.cpp/grammars/./arithmetic.gbnf")
 ("~/llama.cpp/grammars/./README.md"))

* grammar: convert recursive llama_grammar_advance_stack to iterative

This change converts the function to an iterative approach using
explicit stacks, which prevents deep recursion and eliminates the risk
of stack overflow.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A30
rg-edit directive: """Rewrite: fix the following segfault:

[..]
 Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration

convert from recursive to interactive"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
 ("~/llama.cpp/tests/test-grammar-integration.cpp")
 ("~/llama.cpp/grammars/./list.gbnf")
 ("~/llama.cpp/grammars/./json_arr.gbnf")
 ("~/llama.cpp/grammars/./json.gbnf")
 ("~/llama.cpp/grammars/./japanese.gbnf")
 ("~/llama.cpp/grammars/./english.gbnf")
 ("~/llama.cpp/grammars/./chess.gbnf")
 ("~/llama.cpp/grammars/./c.gbnf")
 ("~/llama.cpp/grammars/./arithmetic.gbnf")
 ("~/llama.cpp/grammars/./README.md"))

v2: Added a `std::set` to perform tree-based lookups with O(N log N)
complexity. Testing with a parallel run of `test-grammar-integration`
shows a double-digit percentage increase in runtime. An
`unordered_set` with O(1) hashing was also evaluated, but the overhead
of constructing hash keys from pointers made it significantly slower
than the rbtree implementation that only requires an ordering
operator. The performance regression in the test suite appears
justified by the overall reduction in algorithmic complexity.

Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

* grammar: add test case for hang in repetition grammar processing

This commit adds a new test case to the grammar integration tests that
specifically targets a hang scenario in the repetition grammar parser
found while adding GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= (([^x]*){0,99}){0,99}"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: add repetition threshold check

The change introduces a maximum repetition threshold to avoid
excessive rule expansion during grammar parsing. When parsing
repetition patterns like {m,n}, the parser now calculates the
potential number of rules that would be generated and throws an error
if the product of previous rules and new rules exceeds the threshold.

A test case was added to verify the threshold is properly enforced for
deeply nested repetition patterns that would otherwise cause hangs.
2026-03-21 18:43:35 +01:00
Tom Hillbrunner
212f4521b0
context : use n_embd_out for pooled embedding extraction (#20840)
The MEAN/CLS/LAST pooling paths in encode() and decode() used
n_embd_inp() (16384 for qwen3vl with deepstack) to read from the
pooled embedding tensor, which only has n_embd_out() (4096) floats
per sequence. This caused a tensor read out of bounds assertion.

Fixes embedding mode for Qwen3-VL-Embedding models.
2026-03-21 19:35:00 +02:00
Xuan-Son Nguyen
568aec82d2
docs : explicit about banning accounts that violates policy (#19593) 2026-03-21 15:50:16 +01:00
y198
2bcdddd5e3
fix(rpc): prevent division by zero in deserialize_tensor (#20712)
rpc : prevent division by zero in deserialize_tensor

When receiving an RPC message with a deprecated tensor type (e.g., type 4 or 5 where `blck_size == 0`), `ggml_row_size()` will trigger a division by zero (SIGFPE) and crash the rpc-server. 

This patch adds a simple validation check in `deserialize_tensor` to return `nullptr` if the requested tensor type has a block size of 0.

(Note: This was originally reported via Security Advisory and maintainer suggested dropping a patch here).

* style: remove trailing whitespace
2026-03-21 15:59:43 +02:00
Michael Wand
eac9c6ea83
Convert: Make NVFP4 and MXFP4 HF conversions say NVFP4/MXFP4 instead of BF16 (#20730)
* Corrected convert script for NVFP4 naming and updated gguf constants

* Add mostly_MXFP4 to FileType

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* simplify

* set initial value [no ci]

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-21 13:35:21 +02:00
Sigbjørn Skjæret
29b28a9824
ci : switch from pyright to ty (#20826)
* type fixes

* switch to ty

* tweak rules

* tweak more rules

* more tweaks

* final tweak

* use common import-not-found rule
2026-03-21 08:54:34 +01:00
Matt Corallo
cea560f483
Add shader count for Intel Arc Pro B60 (#20818) 2026-03-21 05:22:51 +01:00
Piotr Wilkin (ilintar)
b1c70e2e54
common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825) 2026-03-21 00:19:04 +01:00
shalinib-ibm
e6ec21e62f
ggml-cpu: add always_inline to tinyBLAS_PPC accumulator saves (#20791)
Explicitly mark save_acc and add_save_Acc with always_inline
in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator
disassembly within kernel's register context, preventing un-necessary
stask spills.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2026-03-21 07:11:45 +08:00
Georgi Gerganov
4cb7e0bd61
ai : limit runtime of the agent (#20816) 2026-03-20 20:31:25 +02:00
James O'Leary
149b2493c0
common : fix typo in debug log ('extracft' -> 'extract') (#20807) 2026-03-20 18:23:18 +01:00
Georgi Gerganov
b31b30f31d
ai : do not run bash commands in the prompt (#20810) 2026-03-20 19:06:33 +02:00
Victor Villar
58c81f7e81
model : fix Granite Hybrid type check for 7B.A1B (#20795)
* Check granite hybriid expert count to set type as LLM_TYPE_7B_A1B or LLM_TYPE_1B

* Use feed fwd dim instead of num of experts

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-20 15:16:09 +01:00
Xuan-Son Nguyen
fb78ad29bb
server: (doc) clarify in-scope and out-scope features (#20794)
* server: (doc) clarify in-scope and out-scope features

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-20 14:03:50 +01:00
Jeff Bolz
e06c3ab2bc
vulkan: change gated_delta_net to shard a column across a subgroup (#20662)
* vulkan: change gated_delta_net to shard a column across a subgroup

This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory
(!20443).

* vulkan: Spread columns across fewer lanes to reduce the number of workgroups
2026-03-20 12:17:15 +01:00
Ruikai Peng
dc6592431b
context: zero output buffer on allocation (#20781)
* context: zero output buffer on allocation

Address GHSA-wqq9-25mr-rw76.

The logits output buffer allocated in output_reserve() uses
posix_memalign(), which does not zero memory. The buffer is only
written during decode when needs_raw_logits() returns true. When
backend samplers cover all output sequences, needs_raw_logits()
returns false and the buffer is never written, but
llama_get_logits() still returns a pointer to it, exposing stale
heap content.

Zero the buffer after allocation to prevent information disclosure
through the public logits API.

Found-by: Pwno

* Update src/llama-context.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-20 11:31:34 +02:00
Ruikai Peng
3adbef7776
model: assert nextn_predict_layers to prevent underflow (#20783)
Address GHSA-645x-v54x-34w8.

When nextn_predict_layers >= n_layer, n_layer - nextn_predict_layers
can underflow (unsigned wrap), which corrupts n_layer_kv_from_start.

Assert nextn_predict_layers immediately after parsing the GGUF key.

Found-by: Pwno
2026-03-20 10:17:58 +01:00
Georgi Gerganov
ab9d4c3678
server : improve mtmd ctx checkpoints (#20726)
* server : improve mtmd ctx checkpoints

* server : fix off-by-one in pos_min_thold
2026-03-20 11:13:12 +02:00
hipudding
1af9dab32b
CANN: add BF16 support for core operators (#20152)
* CANN: add BF16 support for core operators

Add BF16 (bfloat16) type support to the CANN backend for the following
operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and
OUT_PROD. This enables BF16 models to run on Ascend NPUs.

* CANN: skip NZ weight format for BF16 and add 310P compile guards

NZ weight format conversion does not support BF16 tensors, skip it
in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID
and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P
guards for all BF16 operator support since 310P does not support BF16.
2026-03-20 17:08:39 +08:00
Seyoung Jeong
6d99b44c7e
docs : fix Metal backend op support status in ops.md (#20779)
Regenerate docs/ops/Metal.csv using test-backend-ops on Apple M5
and rebuild docs/ops.md via scripts/create_ops_docs.py.

Five ops were incorrectly marked as not supported () for Metal:
- DIAG:           
- POOL_1D:        
- SET:            
- SOLVE_TRI:      
- GATED_DELTA_NET:🟡 (partial, depends on head_size % 32)
2026-03-20 11:06:38 +02:00
Georgi Gerganov
464fd0e71f
ai : update find-related action (#20790)
* ai : update "related issues" prompt

* cont

* cont

* cont
2026-03-20 10:28:14 +02:00
Ruikai Peng
21c8045214
jinja : fix heap OOB read in value equality comparison (#20782)
Address GHSA-q9j6-4hhc-rq9p and GHSA-2q4c-9gq5-5vfp.

The three-iterator overload of std::equal in value_array_t::equivalent()
and value_object_t::equivalent() reads past the end of the shorter
container when comparing arrays or objects of different lengths.

Use the four-iterator overload (C++14) which checks both range lengths.

Found-by: Pwno
2026-03-20 07:15:17 +01:00
James O'Leary
c46583b86b
common/parser : fix out_of_range crash in throw path (#20424 regression) (#20777)
* chat : fix out_of_range crash in throw path (#20424 regression)

#20424 introduced effective_input = generation_prompt + input, but the
throw path uses input.substr(result.end) where result.end is a position
within effective_input. Every thinking model with a non-empty
generation_prompt crashes with std::out_of_range instead of the intended
error message.

Test crashes on unpatched master, passes with fix:

  cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
  cmake --build build --target test-chat
  ./build/bin/test-chat

* Update test-chat.cpp

* Update test-chat.cpp

* Update test-chat.cpp

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-03-20 02:37:22 +01:00
Ben Racicot
c1b911654a
server: fix router mode deadlock on child crash and TOCTOU race in models_max (#20763)
Two bugs in `server_models::load()` that affect router mode reliability:

**Bug 1: Deadlock when child process crashes**

When a child process is killed (e.g., SIGKILL from OS code signature
validation), the monitoring thread deadlocks on `stopping_thread.join()`
because the stopping_thread's wait predicate (`is_stopping`) is never
satisfied — the model name was never inserted into `stopping_models`.
`update_status()` is never reached and the model stays stuck in LOADING
state permanently.

Fix: extend the stopping_thread's wait predicate to also wake when the
child process is no longer alive (`!subprocess_alive()`). When woken by
a dead child, the thread skips the shutdown sequence and returns
immediately. The original `stopping_models.erase()` logic is preserved
for normal unloads.

**Bug 2: TOCTOU race bypasses `--models-max` (ref #20137)**

`unload_lru()` is called outside the mutex, then `load()` acquires the
lock afterward. Under concurrent requests, multiple threads observe
capacity and all proceed to load, exceeding the limit.

Fix: re-check capacity under the lock after `unload_lru()` returns.
If another thread filled the slot in the window between `unload_lru()`
and the lock acquisition, reject with an error instead of silently
exceeding the limit.
2026-03-19 22:16:05 +01:00
Tomeamis
b739738dad
docs: Update server README to reflect PR #20297 (#20560) 2026-03-19 21:28:44 +01:00
Sundaram krishnan
a0bbcdd9b6
ggml: guard KleidiAI DOWNLOAD_EXTRACT_TIMESTAMP for cmake < 3.24 (#20767) 2026-03-19 21:36:23 +02:00
Georgi Gerganov
6c72646a61
ci : improve action for duplicate issue (#20772)
* ci : show thinking traces of the agent

* cont : increase thinking

* cont : remove agent files

* cont : move the model selection to the provider
2026-03-19 21:11:53 +02:00
Rail Chabdarov
340807273b
hip: Avoid compiler bug in RDNA code generation during debug builds on Windows (#20655) 2026-03-19 19:14:08 +01:00