server : add Anthropic Messages API support (#17570)

* server : add Anthropic Messages API support * remove -@pytest.mark.slow from tool calling/jinja tests * server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py * server : removed redundant n field logic in anthropic_params_from_json * server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream() * server : refactor Anthropic API to use OAI conversion * make sure basic test always go first * clean up * clean up api key check, add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-11-28 12:57:04 +01:00 · 2025-11-28 12:57:04 +01:00 · ddf9f94389
commit ddf9f94389
parent ff55414c42
11 changed files with 1553 additions and 70 deletions
--- a/tools/server/README.md
+++ b/tools/server/README.md
@ -7,6 +7,7 @@ Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
 **Features:**
 * LLM inference of F16 and quantized models on GPU and CPU
 * [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
+ * [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) compatible chat completions
 * Reranking endpoint (https://github.com/ggml-org/llama.cpp/pull/9510)
 * Parallel decoding with multi-user support
 * Continuous batching
@ -1352,6 +1353,77 @@ See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-r
  }'
  ```

+### POST `/v1/messages`: Anthropic-compatible Messages API
+
+Given a list of `messages`, returns the assistant's response. Streaming is supported via Server-Sent Events. While no strong claims of compatibility with the Anthropic API spec are made, in our experience it suffices to support many apps.
+
+*Options:*
+
+See [Anthropic Messages API documentation](https://docs.anthropic.com/en/api/messages). Tool use requires `--jinja` flag.
+
+`model`: Model identifier (required)
+
+`messages`: Array of message objects with `role` and `content` (required)
+
+`max_tokens`: Maximum tokens to generate (default: 4096)
+
+`system`: System prompt as string or array of content blocks
+
+`temperature`: Sampling temperature 0-1 (default: 1.0)
+
+`top_p`: Nucleus sampling (default: 1.0)
+
+`top_k`: Top-k sampling
+
+`stop_sequences`: Array of stop sequences
+
+`stream`: Enable streaming (default: false)
+
+`tools`: Array of tool definitions (requires `--jinja`)
+
+`tool_choice`: Tool selection mode (`{"type": "auto"}`, `{"type": "any"}`, or `{"type": "tool", "name": "..."}`)
+
+*Examples:*
+
+```shell
+curl http://localhost:8080/v1/messages \
+  -H "Content-Type: application/json" \
+  -H "x-api-key: your-api-key" \
+  -d '{
+    "model": "gpt-4",
+    "max_tokens": 1024,
+    "system": "You are a helpful assistant.",
+    "messages": [
+      {"role": "user", "content": "Hello!"}
+    ]
+  }'
+```
+
+### POST `/v1/messages/count_tokens`: Token Counting
+
+Counts the number of tokens in a request without generating a response.
+
+Accepts the same parameters as `/v1/messages`. The `max_tokens` parameter is not required.
+
+*Example:*
+
+```shell
+curl http://localhost:8080/v1/messages/count_tokens \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gpt-4",
+    "messages": [
+      {"role": "user", "content": "Hello!"}
+    ]
+  }'
+```
+
+*Response:*
+
+```json
+{"input_tokens": 10}
+```
+
 ## More examples

 ### Interactive mode