Guides
100 images analyzed · June 30, 2026 · on a GTX 1070

Gemma-4-12B Multimodal on an 8 GB Pascal GPU

Google DeepMind’s latest multimodal model — text, images, audio, and video — running locally on a graphics card from 2016. It fits in 6.86 GB, sees exactly what you show it, and the hardest part is finding the three flags that make it work.

ModelGemma-4-12B-it Q4_K_M
GPUGTX 1070 · 8 GB · Pascal
Text gen~15.5 tok/s
Image encode~18 ms
Model size6.86 GB
What it does for you

The only model on Pascal that does native vision

Gemma-4-12B is a 12-billion-parameter multimodal model — text, images, audio, video — that fits in 6.86 GB. Analyze storyboards, describe images, transcribe audio, or just chat, all locally on your own card. It’s Apache 2.0 licensed and never sends your data anywhere. No other model that fits on an 8 GB Pascal GPU sees images natively.

One command attaches an image: gemma4 photo.jpg "Describe this". Here’s what that looks like on real frames.

What the model sees

We tested Gemma-4-12B-it on a 100-frame storyboard called “The Tapestry of Light” — a Disney Pixar-style animated short generated as individual 640×640 frames on the same GTX 1070 using Krea 2 Turbo. Each frame was analyzed with: “Describe this storyboard frame: the character(s), setting, action, and mood.”

Frame 1 — Opening shot

What the model saw: “Mickey Mouse holding a glowing, magical scroll in a castle hall. He looks joyful and surprised, as if unveiling something wonderous — the opening shot of the story.”

Frame 2 — The second character

What the model saw: “Minnie Mouse running forward in a palace with golden lighting. She looks joyful and energetic, her dress sparkling.”

Frame 3 — Expanding the cast

What the model saw: “Daisy Duck in a sailor outfit, eyes wide, beak open in surprise or excitement. The setting continues the grand hall theme.”

Key capability: Gemma-4-12B-it correctly identified distinct Disney characters across all 100 frames — Mickey, Minnie, Daisy, Donald, Goofy, and others — despite each frame being a different pose and composition. It maintained narrative continuity across the full sequence.

Setup

Get running in one command

This script does everything: installs system dependencies, detects or installs CUDA, clones and builds the latest upstream llama.cpp from source, downloads the model and projector file, creates a gemma4 alias, and runs a test prompt.

Linux (Ubuntu/Debian/Fedora):

curl -sL https://ndgold.com/guides/gemma4-pascal/setup.sh | bash

Or download manually: setup.sh

What the script does:  Installs git, cmake, build-essential → finds CUDA in PATH or conda → clones upstream llama.cpp → builds with CUDA (2 targets, 3-10 min) → downloads 6.86 GB model + 168 MB mmproj → creates ~/models/gemma4-chat.sh helper → adds gemma4 alias to ~/.bashrc → runs a test prompt.

The three commands you’ll use

📝 Text prompt

gemma4 "Your question here"

Runs a single-turn text prompt against the model.

🖼️ Analyze an image

gemma4 photo.jpg "Describe this"

Passes the image through the gemma4uv projector and runs vision inference.

💬 Interactive chat

gemma4

Launches a REPL. Type /image path.png mid-conversation to add an image.

Manual invocation (if you skip the script)

export LD_LIBRARY_PATH=$HOME/miniconda/lib

# Text only — the -e flag is required for a one-shot prompt. Without it,
# llama-cli drops into an interactive REPL instead of printing the answer.
~/llama.cpp/build/bin/llama-cli \
  -m ~/models/gemma-4-12B-it-Q4_K_M.gguf \
  -ngl 99 -fa on --no-kv-offload --jinja \
  -p "Hello!" -n 200 -e

# Vision
~/llama.cpp/build/bin/llama-mtmd-cli \
  --mmproj ~/models/mmproj-gemma-4-12B-it-BF16.gguf \
  -m ~/models/gemma-4-12B-it-Q4_K_M.gguf \
  -ngl 99 -fa on --no-kv-offload --jinja \
  --image photo.png \
  -p "Describe this image" -n 300 -e

On Windows, the easiest path is LM Studio — no compilation needed. See the FAQ.

Performance

What you actually get on a 2016 GPU

These are real measurements from a GTX 1070 (8 GB, 256 GB/s, CC 6.1) running the upstream llama.cpp build with the optimal 3-flag configuration.

Mode Prompt processing Text generation Settings
Full GPU + flash-attn + no-KV-offload 310 tok/s 15.5 tok/s -ngl 99 -fa on --no-kv-offload
Full GPU + flash-attn (no KV offload) 325 tok/s — (OOM at context) -ngl 99 -fa on
Partial offload (30 layers) 270 tok/s 7.3 tok/s -ngl 30

Benchmark comparison on the same card

Model Quant VRAM Text gen Vision
Gemma-4-12B-it Q4_K_M 6.86 GB 15.5 t/s ✅ Native (text + image + audio + video)
Qwen3-8B Q5_K_M 5.44 GB 26.1 t/s ❌ (need VL variant)
Bonsai 8B (ternary) Q4_0-lossless 4.29 GB 37.1 t/s ❌ Text only
Qwen3-4B-Thinking Q6_K 3.07 GB 36.3 t/s ❌ Text only

Note: Gemma-4-12B is the only model in this comparison that handles images, audio, and video natively. Its encoder-free architecture means it doesn’t need a separate vision encoder — images pass directly into the transformer’s embedding space through a lightweight linear projection.

FAQ

The flags, the errors, and the Pascal traps

Gemma-4-12B is a very new model (May 2026) using a new gemma4uv projector that older builds don’t recognize. On an RTX 4090 you grab a pre-built binary and go. On Pascal, every step has an edge case — here they all are.

The 3-flag puzzle: why -ngl 99 -fa on --no-kv-offload?

The model is 6.86 GB on an 8 GB card. Without all three of these flags, it OOMs immediately:

-ngl 99 -fa on --no-kv-offload

Drop any one and you get cudaMalloc failed: out of memory. On a 3060 Ti you can skip --no-kv-offload — on Pascal you cannot, because there’s no managed-memory fallback to absorb the KV cache.

unknown projector type: gemma4uv

Your llama.cpp build predates the gemma4uv projector. The PrismML fork (used for Bonsai ternary models) doesn’t support it either. You must use upstream ggml-org/llama.cpp at the very latest commit (≥ 931eb37, May 2026 or later). Clone fresh and rebuild:

git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cudaMalloc failed: out of memory

Almost always a missing flag — use all three: -ngl 99 -fa on --no-kv-offload. If the flags are correct and it still OOMs, you’re likely hitting leftover GPU state from a prior run (see the memory-leak item below).

Model loads but outputs garbage (the --jinja requirement)

Gemma-4 uses a custom Jinja2 chat template. Without the --jinja flag, the model loads but produces garbage or nothing at all. This isn’t mentioned in the standard llama.cpp help. Add --jinja to every invocation.

CUDA error: out of memory on the second run (GPU memory leaks)

Every failed or interrupted run leaves CUDA state in VRAM. If you don’t clear it, the next run fails even with correct flags. Clear it between invocations:

kill $(fuser /dev/nvidia* 2>/dev/null | tr " " "\n" | sort -u)
Why is each step ~15 ms slower than on Turing? (CUDA graphs)

llama.cpp explicitly disables CUDA graphs on architecture 6.1, logging "disabling CUDA graphs due to GPU architecture". This adds ~15 ms overhead per inference step compared to Turing or Ampere. There’s no fix — it’s a hardware limitation of Pascal — but it’s already accounted for in the 15.5 tok/s figure.

How do I confirm it’s actually using the GPU?

The silent-CPU trap (missing -DCMAKE_CUDA_ARCHITECTURES="61") produces no error — the model just runs slowly. To confirm the GPU is doing the work, watch VRAM and utilization in a second terminal while a prompt is running:

watch -n 0.5 nvidia-smi

You should see the llama-cli/llama-mtmd-cli process holding ~6.9 GB of VRAM and GPU-Util spiking above 0%. If VRAM usage stays near zero and only CPU cores are busy, your build is running on the CPU — rebuild with the architecture flag. At load time, llama.cpp should also log offloaded 49/49 layers to GPU.

Build fails: nvcc not found

The CUDA toolkit is missing. Install it via conda:

conda install -c conda-forge cuda-toolkit
Model download fails mid-way

Usually a stale HuggingFace cache. Clear it and re-run the download:

rm -rf ~/.cache/huggingface/hub/models--lmstudio-community--gemma-4-12B-it-GGUF/
Windows? Use LM Studio instead

The easiest path on Windows is LM Studio — no compilation needed:

  1. Install LM Studio
  2. Search lmstudio-community/gemma-4-12B-it-GGUF
  3. Download the Q4_K_M quant (6.86 GB)
  4. In Model Settings: GPU Offload = Max, Flash Attention = ON, KV Cache Offload = ON
  5. Chat normally or click 📎 to attach images
Why this matters

Vision AI doesn’t require a datacenter

Gemma-4-12B is Apache 2.0 licensed, runs entirely on your hardware, and processes images without sending your data anywhere. On an 8 GB Pascal GPU it generates ~15 tokens per second and encodes images in 18 milliseconds. That’s fast enough to analyze a 100-frame storyboard in under an hour.

The real cost wasn’t the GPU — it was finding the three undocumented flags. This guide exists so you don’t have to.

← Back to all guides