* mtmd: llama.cpp DeepSeekOCR support init commit * loading sam tensors * mtmd: fix vision model processing * deepseek-ocr clip-vit model impl * mtmd: add DeepSeek-OCR LM support with standard attention * mtmd: successfully runs DeepSeek-OCR LM in llama-cli * mtmd: Fix RoPE type for DeepSeek-OCR LM. * loading LM testing Vision model loading * sam warmup working * sam erroneous return corrected * clip-vit: corrected cls_embd concat * clip-vit: model convert qkv_proj split * corrected combining of image encoders' results * fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model * concat image_newline and image_seperator tokens * visual_model warmup (technically) works * window partitioning using standard ggml ops * sam implementation without using CPU only ops * clip: fixed warnings * Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr * mtmd: fix get_rel_pos * mtmd: fixed the wrong scaler for get_rel_pos * image encoding technically works but the output can't be checked singe image decoding fails * mtmd: minor changed * mtmd: add native resolution support * - image encoding debugged - issues fixed mainly related wrong config like n_patches etc. - configs need to be corrected in the converter * mtmd: correct token order * - dynamic resizing - changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4 * mtmd: quick fix token order * mtmd: fix danling pointer * mtmd: SAM numerically works * mtmd: debug CLIP-L (vit_pre_ln) * mtmd: debug CLIP-L & first working DeepSeek-OCR model * mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work * mtmd: simplify SAM patch embedding * mtmd: adapt Pillow image resizing function * mtmd: simplify DeepSeek-OCR dynamic resolution preprocessing * mtmd: remove --dsocr-mode argument * mtmd: refactor code & remove unused helper functions * mtmd: fix tensor names for image newlines and view separator * clean up * reverting automatically removed spaces * reverting automatically removed spaces * mtmd: fixed bad ocr check in Deepseek2 (LM) * mtmd: support combined QKV projection in buid_vit * using common build_attn in sam * corrected code-branch when flash-attn disabled enabling usage of --flash-attn option * mtmd: minor fix * minor formatting and style * fixed flake8 lint issues * minor editorconfig-check fixes * minor editorconfig-check fixes * mtmd: simplify get_rel_pos * mtmd: make sam hparams configurable * mtmd: add detailed comments for resize_bicubic_pillow * mtmd: fixed wrong input setting * mtmd: convert model in FP16 * mtmd: minor fix * mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template * fix: test-1.jpg ORC issue with small (640) resolution setting min-resolution base (1024) max large (1280) for dynamic-resolution * minor: editconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909 added new opt to tests.sh to disable flash-attn * minor: editconfig-check fix * testing deepseek-ocr quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR * quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909 * refactoring, one single builder function and static helpers * added deepseek-ocr test to tests.sh * minor formatting fixes * check with fixed expected resutls * minor formatting * editorconfig-check fix * merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042 * minor - added GLM-4.6V to big tests - added missing deps for python test * convert: minor fix * mtmd: format code * convert: quick fix * convert: quick fix * minor python formatting * fixed merge build issue * merge resolved - fixed issues in convert - tested several deepseek models * minor fix * minor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * - removed clip_is_deepseekocr - removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions * - cleaning commented out code * fixing instabilities issues reintroducing resize_bicubic_pillow * - use f16 model for deepseek-ocr test - ignore llama-arch test for deepseek-ocr * rename fc_w --> mm_fc_w * add links to OCR discussion * cleaner loading code * add missing .weight to some tensors * add default jinja template (to be used by server) * move test model to ggml-org * rolling back upscale change * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: bluebread <hotbread70127@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
194 lines
6.4 KiB
Bash
Executable file
194 lines
6.4 KiB
Bash
Executable file
#!/usr/bin/env bash
|
|
|
|
# make sure we are in the right directory
|
|
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
|
|
cd $SCRIPT_DIR
|
|
|
|
#export LLAMA_CACHE="$SCRIPT_DIR/tmp"
|
|
|
|
set -eux
|
|
|
|
mkdir -p $SCRIPT_DIR/output
|
|
|
|
PROJ_ROOT="$SCRIPT_DIR/../.."
|
|
cd $PROJ_ROOT
|
|
|
|
# Check if the first argument is "big", then run test with big models
|
|
# This is useful if we're running the script on a larger machine, so we can test the big models
|
|
RUN_BIG_TESTS=false
|
|
if [ "${1:-}" = "big" ]; then
|
|
RUN_BIG_TESTS=true
|
|
echo "Include BIG models..."
|
|
fi
|
|
|
|
RUN_HUGE_TESTS=false
|
|
if [ "${1:-}" = "huge" ]; then
|
|
RUN_HUGE_TESTS=true
|
|
RUN_BIG_TESTS=true
|
|
echo "Include BIG and HUGE models..."
|
|
fi
|
|
|
|
# Check if the second argument is "flash", then enable flash attention
|
|
# This is useful to test if flash attention off works correctly
|
|
FLASH_ATTN="on"
|
|
if [ "${2:-}" = "flash_off" ] || [ "${1:-}" = "flash_off" ]; then
|
|
FLASH_ATTN="off"
|
|
echo "Flash attention disabled..."
|
|
fi
|
|
|
|
###############
|
|
|
|
arr_prefix=()
|
|
arr_hf=()
|
|
arr_extra_args=()
|
|
arr_file=()
|
|
|
|
add_test_vision() {
|
|
local hf=$1
|
|
shift
|
|
local extra_args=""
|
|
if [ $# -gt 0 ]; then
|
|
extra_args=$(printf " %q" "$@")
|
|
fi
|
|
arr_prefix+=("[vision]")
|
|
arr_hf+=("$hf")
|
|
arr_extra_args+=("$extra_args")
|
|
arr_file+=("test-1.jpeg")
|
|
}
|
|
|
|
add_test_audio() {
|
|
local hf=$1
|
|
shift
|
|
local extra_args=""
|
|
if [ $# -gt 0 ]; then
|
|
extra_args=$(printf " %q" "$@")
|
|
fi
|
|
arr_prefix+=("[audio] ")
|
|
arr_hf+=("$hf")
|
|
arr_extra_args+=("$extra_args")
|
|
arr_file+=("test-2.mp3")
|
|
}
|
|
|
|
add_test_vision "ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"
|
|
add_test_vision "THUDM/glm-edge-v-5b-gguf:Q4_K_M" -p "name of the newspaper?<__media__>"
|
|
add_test_vision "second-state/Llava-v1.5-7B-GGUF:Q2_K" --chat-template vicuna
|
|
add_test_vision "cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M" --chat-template vicuna
|
|
add_test_vision "ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M"
|
|
add_test_vision "second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K" # model from openbmb is corrupted
|
|
add_test_vision "openbmb/MiniCPM-V-2_6-gguf:Q2_K"
|
|
add_test_vision "openbmb/MiniCPM-o-2_6-gguf:Q4_0"
|
|
add_test_vision "bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/InternVL2_5-1B-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/LFM2-VL-450M-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/granite-docling-258M-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/DeepSeek-OCR-GGUF:Q8_0" -p "Free OCR." --chat-template deepseek-ocr
|
|
|
|
add_test_audio "ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0"
|
|
add_test_audio "ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M"
|
|
add_test_audio "ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M"
|
|
add_test_audio "ggml-org/LFM2-Audio-1.5B-GGUF:Q8_0"
|
|
|
|
# to test the big models, run: ./tests.sh big
|
|
if [ "$RUN_BIG_TESTS" = true ]; then
|
|
add_test_vision "ggml-org/pixtral-12b-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF" --chat-template mistral-v7
|
|
add_test_vision "ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Qwen3-VL-2B-Instruct-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M"
|
|
# add_test_vision "ggml-org/Qwen2.5-VL-32B-Instruct-GGUF:Q4_K_M" # does not work on my mac M3 Ultra
|
|
# add_test_vision "ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF:Q4_K_M" # not always working
|
|
add_test_vision "ggml-org/GLM-4.6V-Flash-GGUF:Q4_K_M" -p "extract all texts from this image"
|
|
|
|
add_test_audio "ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M"
|
|
add_test_audio "ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M"
|
|
fi
|
|
|
|
# to test the huge models, run: ./tests.sh huge
|
|
# this will run both the big and huge models
|
|
# huge models are > 32B parameters
|
|
if [ "$RUN_HUGE_TESTS" = true ]; then
|
|
add_test_vision "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
|
|
fi
|
|
|
|
# these models always give the wrong answer, not sure why
|
|
# add_test_vision "ggml-org/SmolVLM-Instruct-GGUF:Q4_K_M"
|
|
# add_test_vision "ggml-org/SmolVLM-256M-Instruct-GGUF:Q8_0"
|
|
# add_test_vision "ggml-org/SmolVLM2-256M-Video-Instruct-GGUF:Q8_0"
|
|
|
|
# this model has broken chat template, not usable
|
|
# add_test_vision "cmp-nct/Yi-VL-6B-GGUF:Q5_K"
|
|
# add_test_vision "guinmoon/MobileVLM-3B-GGUF:Q4_K_M" "deepseek"
|
|
|
|
###############
|
|
|
|
cmake --build build -j --target llama-mtmd-cli
|
|
|
|
arr_res=()
|
|
|
|
for i in "${!arr_hf[@]}"; do
|
|
bin="llama-mtmd-cli"
|
|
prefix="${arr_prefix[$i]}"
|
|
hf="${arr_hf[$i]}"
|
|
extra_args="${arr_extra_args[$i]}"
|
|
inp_file="${arr_file[$i]}"
|
|
|
|
echo "Running test with binary: $bin and HF model: $hf"
|
|
echo ""
|
|
echo ""
|
|
|
|
cmd="$(printf %q "$PROJ_ROOT/build/bin/$bin") \
|
|
-hf $(printf %q "$hf") \
|
|
--image $(printf %q "$SCRIPT_DIR/$inp_file") \
|
|
--temp 0 -n 128 \
|
|
--flash-attn $(printf %q "$FLASH_ATTN") \
|
|
${extra_args}"
|
|
|
|
# if extra_args does not contain -p, we add a default prompt
|
|
if ! [[ "$extra_args" =~ "-p" ]]; then
|
|
cmd+=" -p \"what is the publisher name of the newspaper?\""
|
|
fi
|
|
|
|
output=$(eval "$cmd" 2>&1 | tee /dev/tty)
|
|
|
|
echo "$output" > $SCRIPT_DIR/output/$bin-$(echo "$hf" | tr '/' '-').log
|
|
|
|
# either contains "new york" or both "men" and "walk"
|
|
if echo "$output" | grep -iq "new york" \
|
|
|| (echo "$output" | grep -iq "men" && echo "$output" | grep -iq "walk")
|
|
then
|
|
result="$prefix \033[32mOK\033[0m: $hf"
|
|
else
|
|
result="$prefix \033[31mFAIL\033[0m: $hf"
|
|
fi
|
|
echo -e "$result"
|
|
arr_res+=("$result")
|
|
|
|
echo ""
|
|
echo ""
|
|
echo ""
|
|
echo "#################################################"
|
|
echo "#################################################"
|
|
echo ""
|
|
echo ""
|
|
done
|
|
|
|
set +x
|
|
|
|
for i in "${!arr_res[@]}"; do
|
|
echo -e "${arr_res[$i]}"
|
|
done
|
|
echo ""
|
|
echo "Output logs are saved in $SCRIPT_DIR/output"
|