Running Gemma 4 31B with MTP on Nvidia NGX Spark via Claude Code

The release of Gemma 4 brings significant architectural improvements, including Multi-Token Prediction (MTP). When combined with Nvidia's NGX Spark and high-performance serving frameworks like sglang, you can achieve impressive throughput and low latency for a 31B parameter model.

In this guide, we'll walk through the setup to run Gemma 4 31B-IT with its assistant model for speculative decoding, proxied through LiteLLM, and driven by the Claude Code CLI.

The Stack

→ Model: nvidia/Gemma-4-31B-IT-NVFP4 (FP4 quantized for maximum efficiency)
→ Draft Model: google/gemma-4-31B-it-assistant (for MTP/Speculative decoding)
→ Serving Engine: sglang (optimized for high-throughput serving)
→ Proxy: LiteLLM (to provide an OpenAI-compatible API)
→ Client: Claude Code (Anthropic's official CLI agent)

Step 1: Launching the Serving Engine with Docker

We use the lmsysorg/sglang:latest image. The key here is the --speculative-algorithm NEXTN and the use of the assistant model to accelerate generation.

bash

docker run --rm -it --gpus all \
  --name gemma4-31b-mtp \
  --shm-size=16g \
  -p 8000:8000 \
  -v /root/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN=$HF_TOKEN \
  --entrypoint /bin/bash \
  lmsysorg/sglang:latest \
  -c "pip install -U https://github.com/huggingface/transformers/archive/main.tar.gz && sglang serve \
    --model-path nvidia/Gemma-4-31B-IT-NVFP4 \
    --served-model-name gemma4 \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --context-length 128000 \
    --mem-fraction-static 0.7 \
    --cuda-graph-max-bs 8 \
    --tool-call-parser gemma4 \
    --reasoning-parser gemma4 \
    --speculative-algorithm NEXTN \
    --speculative-draft-model-path google/gemma-4-31B-it-assistant \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 2 \
    --speculative-num-draft-tokens 7"

Key Configuration Details:

→ NVFP4 Quantization: Using the Nvidia-optimized FP4 version allows the 31B model to fit comfortably while maintaining high performance.
→ MTP / Speculative Decoding: By using NEXTN with the assistant model, the engine predicts multiple tokens ahead, significantly reducing the time-per-token.
→ Tool Call Parsing: The --tool-call-parser gemma4 and --reasoning-parser gemma4 flags are essential. Without them, the model's tool-use and reasoning blocks often fail to be parsed correctly by the serving engine, causing Claude Code to miss tool calls or receive malformed data. These flags ensure that the specific Gemma 4 format is correctly translated into structured API responses.

Step 2: Proxying with LiteLLM

Claude Code expects an Anthropic-like or OpenAI-like API. LiteLLM acts as the perfect bridge between sglang's output and the client.

bash

litellm \
  --model openai/gemma4 \
  --api_base http://localhost:8000/v1 \
  --drop_params

*Note: We use openai/gemma4 as a generic model mapping, but it will route directly to our gemma4 instance running on port 8000.*

Step 3: Running Claude Code

Finally, we point Claude Code to our local LiteLLM proxy.

bash

export ANTHROPIC_BASE_URL="http://localhost:4000"
export ANTHROPIC_API_KEY="sk-local-run"
claude

Now, you have the power of Claude Code's agentic capabilities driven by a local, high-performance Gemma 4 31B instance on your own hardware.

Performance Observations

Running the 31B model with MTP provides a noticeable "snap" to the responses. The combination of FP4 quantization and speculative decoding allows for a smooth, interactive experience even with complex software engineering tasks.