Running Gemma 4 31B with MTP on Nvidia NGX Spark via Claude Code
The release of Gemma 4 brings significant architectural improvements, including Multi-Token Prediction (MTP). When combined with Nvidia's NGX Spark and high-performance serving frameworks like sglang, you can achieve impressive throughput and low latency for a 31B parameter model.
In this guide, we'll walk through the setup to run Gemma 4 31B-IT with its assistant model for speculative decoding, proxied through LiteLLM, and driven by the Claude Code CLI.
The Stack
- → Model:
nvidia/Gemma-4-31B-IT-NVFP4(FP4 quantized for maximum efficiency) - → Draft Model:
google/gemma-4-31B-it-assistant(for MTP/Speculative decoding) - → Serving Engine:
sglang(optimized for high-throughput serving) - → Proxy:
LiteLLM(to provide an OpenAI-compatible API) - → Client:
Claude Code(Anthropic's official CLI agent)
Step 1: Launching the Serving Engine with Docker
We use the lmsysorg/sglang:latest image. The key here is the --speculative-algorithm NEXTN and the use of the assistant model to accelerate generation.
Key Configuration Details:
- → NVFP4 Quantization: Using the Nvidia-optimized FP4 version allows the 31B model to fit comfortably while maintaining high performance.
- → MTP / Speculative Decoding: By using
NEXTNwith the assistant model, the engine predicts multiple tokens ahead, significantly reducing the time-per-token. - → Tool Call Parsing: The
--tool-call-parser gemma4and--reasoning-parser gemma4flags are essential. Without them, the model's tool-use and reasoning blocks often fail to be parsed correctly by the serving engine, causing Claude Code to miss tool calls or receive malformed data. These flags ensure that the specific Gemma 4 format is correctly translated into structured API responses.
Step 2: Proxying with LiteLLM
Claude Code expects an Anthropic-like or OpenAI-like API. LiteLLM acts as the perfect bridge between sglang's output and the client.
*Note: We use openai/gemma4 as a generic model mapping, but it will route directly to our gemma4 instance running on port 8000.*
Step 3: Running Claude Code
Finally, we point Claude Code to our local LiteLLM proxy.
Now, you have the power of Claude Code's agentic capabilities driven by a local, high-performance Gemma 4 31B instance on your own hardware.
Performance Observations
Running the 31B model with MTP provides a noticeable "snap" to the responses. The combination of FP4 quantization and speculative decoding allows for a smooth, interactive experience even with complex software engineering tasks.