What is Dylan Goldblatt's expertise in AI?

Dylan Goldblatt specializes in AI operations leadership, foundation model integration, and LLMOps. As a Microsoft & OpenAI Foundation Models Research Fellow (2022), he focuses on helping researchers and organizations integrate large language models like ChatGPT and Claude into their workflows while maintaining security and efficiency.

How can Dylan help with foundation model integration?

Dylan helps organizations integrate foundation models by developing practical implementation strategies, establishing LLMOps workflows, and creating tools that make AI accessible to researchers. He works with academic institutions and organizations to solve complex problems using AI, from data migration to accessibility solutions.

What types of projects does Dylan work on?

Dylan's projects span AI accessibility (wcaguar), data portability (Parley Migration Tool), thought partnership for higher education (Unbound), and AI × games research (Transformer conference). His work focuses on making AI practical, accessible, and beneficial for research and academic communities.

How can I contact Dylan Goldblatt for AI consulting?

You can reach Dylan via email at ngoldbla@kennesaw.edu, schedule a chat through his calendar link, or connect on LinkedIn, GitHub, or Hugging Face. He's open to discussing problems you'd like to solve, whether academic or otherwise.

What is Dylan's background in AI and research?

Dylan is a Microsoft & OpenAI Foundation Models Research Fellow (2022) and NVIDIA-Certified Associate in Generative AI LLMs. He holds a PhD from the University of Virginia and leads AI operations at Kennesaw State University, where he helps researchers accelerate discovery using foundation models.

Does Dylan speak at conferences or events?

Yes, Dylan regularly presents on topics including AI at the edge, LLM security, AI creativity, and practical AI implementation. Recent talks include 'Inference at the Edge' (secure offline LLM deployment), 'The Signal in the Noise' (AI and games), and 'We've Got AI at Home' (cloud-free AI innovation).

100 images analyzed · June 30, 2026 · on a GTX 1070

Gemma-4-12B Multimodal on an 8 GB Pascal GPU

Google DeepMind’s latest multimodal model — text, images, audio, and video — running locally on a graphics card from 2016. It fits in 6.86 GB, sees exactly what you show it, and the hardest part is finding the three flags that make it work.

ModelGemma-4-12B-it Q4_K_M

GPUGTX 1070 · 8 GB · Pascal

Text gen~15.5 tok/s

Image encode~18 ms

Model size6.86 GB

Get running in one command What it does

What it does for you

The only model on Pascal that does native vision

Gemma-4-12B is a 12-billion-parameter multimodal model — text, images, audio, video — that fits in 6.86 GB. Analyze storyboards, describe images, transcribe audio, or just chat, all locally on your own card. It’s Apache 2.0 licensed and never sends your data anywhere. No other model that fits on an 8 GB Pascal GPU sees images natively.

One command attaches an image: gemma4 photo.jpg "Describe this". Here’s what that looks like on real frames.

What the model sees

We tested Gemma-4-12B-it on a 100-frame storyboard called “The Tapestry of Light” — a Disney Pixar-style animated short generated as individual 640×640 frames on the same GTX 1070 using Krea 2 Turbo. Each frame was analyzed with: “Describe this storyboard frame: the character(s), setting, action, and mood.”

Frame 1 — Opening shot

What the model saw: “Mickey Mouse holding a glowing, magical scroll in a castle hall. He looks joyful and surprised, as if unveiling something wonderous — the opening shot of the story.”

Frame 2 — The second character

What the model saw: “Minnie Mouse running forward in a palace with golden lighting. She looks joyful and energetic, her dress sparkling.”

Frame 3 — Expanding the cast

What the model saw: “Daisy Duck in a sailor outfit, eyes wide, beak open in surprise or excitement. The setting continues the grand hall theme.”

Key capability: Gemma-4-12B-it correctly identified distinct Disney characters across all 100 frames — Mickey, Minnie, Daisy, Donald, Goofy, and others — despite each frame being a different pose and composition. It maintained narrative continuity across the full sequence.

Setup

Get running in one command

This script does everything: installs system dependencies, detects or installs CUDA, clones and builds the latest upstream llama.cpp from source, downloads the model and projector file, creates a gemma4 alias, and runs a test prompt.

Linux (Ubuntu/Debian/Fedora):

curl -sL https://ndgold.com/guides/gemma4-pascal/setup.sh | bash

Or download manually: setup.sh

What the script does: Installs git, cmake, build-essential → finds CUDA in PATH or conda → clones upstream llama.cpp → builds with CUDA (2 targets, 3-10 min) → downloads 6.86 GB model + 168 MB mmproj → creates ~/models/gemma4-chat.sh helper → adds gemma4 alias to ~/.bashrc → runs a test prompt.

The three commands you’ll use

📝 Text prompt

gemma4 "Your question here"

Runs a single-turn text prompt against the model.

🖼️ Analyze an image

gemma4 photo.jpg "Describe this"

Passes the image through the gemma4uv projector and runs vision inference.

💬 Interactive chat

gemma4

Launches a REPL. Type /image path.png mid-conversation to add an image.

Manual invocation (if you skip the script)

export LD_LIBRARY_PATH=$HOME/miniconda/lib

# Text only — the -e flag is required for a one-shot prompt. Without it,
# llama-cli drops into an interactive REPL instead of printing the answer.
~/llama.cpp/build/bin/llama-cli \
  -m ~/models/gemma-4-12B-it-Q4_K_M.gguf \
  -ngl 99 -fa on --no-kv-offload --jinja \
  -p "Hello!" -n 200 -e

# Vision
~/llama.cpp/build/bin/llama-mtmd-cli \
  --mmproj ~/models/mmproj-gemma-4-12B-it-BF16.gguf \
  -m ~/models/gemma-4-12B-it-Q4_K_M.gguf \
  -ngl 99 -fa on --no-kv-offload --jinja \
  --image photo.png \
  -p "Describe this image" -n 300 -e

On Windows, the easiest path is LM Studio — no compilation needed. See the FAQ.

Performance

What you actually get on a 2016 GPU

These are real measurements from a GTX 1070 (8 GB, 256 GB/s, CC 6.1) running the upstream llama.cpp build with the optimal 3-flag configuration.

Mode	Prompt processing	Text generation	Settings
Full GPU + flash-attn + no-KV-offload	310 tok/s	15.5 tok/s	`-ngl 99 -fa on --no-kv-offload`
Full GPU + flash-attn (no KV offload)	325 tok/s	— (OOM at context)	`-ngl 99 -fa on`
Partial offload (30 layers)	270 tok/s	7.3 tok/s	`-ngl 30`

Benchmark comparison on the same card

Model	Quant	VRAM	Text gen	Vision
Gemma-4-12B-it	Q4_K_M	6.86 GB	15.5 t/s	✅ Native (text + image + audio + video)
Qwen3-8B	Q5_K_M	5.44 GB	26.1 t/s	❌ (need VL variant)
Bonsai 8B (ternary)	Q4_0-lossless	4.29 GB	37.1 t/s	❌ Text only
Qwen3-4B-Thinking	Q6_K	3.07 GB	36.3 t/s	❌ Text only

Note: Gemma-4-12B is the only model in this comparison that handles images, audio, and video natively. Its encoder-free architecture means it doesn’t need a separate vision encoder — images pass directly into the transformer’s embedding space through a lightweight linear projection.

FAQ

The flags, the errors, and the Pascal traps

Gemma-4-12B is a very new model (May 2026) using a new gemma4uv projector that older builds don’t recognize. On an RTX 4090 you grab a pre-built binary and go. On Pascal, every step has an edge case — here they all are.

The 3-flag puzzle: why -ngl 99 -fa on --no-kv-offload?

The model is 6.86 GB on an 8 GB card. Without all three of these flags, it OOMs immediately:

-ngl 99 -fa on --no-kv-offload

Drop any one and you get cudaMalloc failed: out of memory. On a 3060 Ti you can skip --no-kv-offload — on Pascal you cannot, because there’s no managed-memory fallback to absorb the KV cache.

unknown projector type: gemma4uv

Your llama.cpp build predates the gemma4uv projector. The PrismML fork (used for Bonsai ternary models) doesn’t support it either. You must use upstream ggml-org/llama.cpp at the very latest commit (≥ 931eb37, May 2026 or later). Clone fresh and rebuild:

git clone --depth 1 https://github.com/ggml-org/llama.cpp.git

cudaMalloc failed: out of memory

Almost always a missing flag — use all three: -ngl 99 -fa on --no-kv-offload. If the flags are correct and it still OOMs, you’re likely hitting leftover GPU state from a prior run (see the memory-leak item below).

Model loads but outputs garbage (the --jinja requirement)

Gemma-4 uses a custom Jinja2 chat template. Without the --jinja flag, the model loads but produces garbage or nothing at all. This isn’t mentioned in the standard llama.cpp help. Add --jinja to every invocation.

CUDA error: out of memory on the second run (GPU memory leaks)

Every failed or interrupted run leaves CUDA state in VRAM. If you don’t clear it, the next run fails even with correct flags. Clear it between invocations:

kill $(fuser /dev/nvidia* 2>/dev/null | tr " " "\n" | sort -u)

Why is each step ~15 ms slower than on Turing? (CUDA graphs)

llama.cpp explicitly disables CUDA graphs on architecture 6.1, logging "disabling CUDA graphs due to GPU architecture". This adds ~15 ms overhead per inference step compared to Turing or Ampere. There’s no fix — it’s a hardware limitation of Pascal — but it’s already accounted for in the 15.5 tok/s figure.

How do I confirm it’s actually using the GPU?

The silent-CPU trap (missing -DCMAKE_CUDA_ARCHITECTURES="61") produces no error — the model just runs slowly. To confirm the GPU is doing the work, watch VRAM and utilization in a second terminal while a prompt is running:

watch -n 0.5 nvidia-smi

You should see the llama-cli/llama-mtmd-cli process holding ~6.9 GB of VRAM and GPU-Util spiking above 0%. If VRAM usage stays near zero and only CPU cores are busy, your build is running on the CPU — rebuild with the architecture flag. At load time, llama.cpp should also log offloaded 49/49 layers to GPU.

Build fails: nvcc not found

The CUDA toolkit is missing. Install it via conda:

conda install -c conda-forge cuda-toolkit

Model download fails mid-way

Usually a stale HuggingFace cache. Clear it and re-run the download:

rm -rf ~/.cache/huggingface/hub/models--lmstudio-community--gemma-4-12B-it-GGUF/

Windows? Use LM Studio instead

The easiest path on Windows is LM Studio — no compilation needed:

Install LM Studio
Search lmstudio-community/gemma-4-12B-it-GGUF
Download the Q4_K_M quant (6.86 GB)
In Model Settings: GPU Offload = Max, Flash Attention = ON, KV Cache Offload = ON
Chat normally or click 📎 to attach images

Why this matters

Vision AI doesn’t require a datacenter

Gemma-4-12B is Apache 2.0 licensed, runs entirely on your hardware, and processes images without sending your data anywhere. On an 8 GB Pascal GPU it generates ~15 tokens per second and encodes images in 18 milliseconds. That’s fast enough to analyze a 100-frame storyboard in under an hour.

The real cost wasn’t the GPU — it was finding the three undocumented flags. This guide exists so you don’t have to.

← Back to all guides