Gemma-4-12B-it-MXFP4-GGUF

MXFP4 (OCP microscaling FP4) quantization of Google's Gemma 4 12B It, a multimodal language model with native vision understanding.

This repository contains two files:

  • gemma-4-12b-it-mxfp4.gguf — Text backbone (48 transformer layers, 3840 hidden dim, 262k context) quantized to MXFP4
  • mmproj-gemma-4-12b-it-f16.gguf — SigLIP vision encoder + projector at F16 precision (required for image input)

About MXFP4

MXFP4 (Microscaling FP4) is the open-standard 4-bit floating-point format defined by the OCP Microscaling Formats (MX) specification. It uses an E2M1 format (1 sign, 2 exponent, 1 mantissa bit) with per-block shared exponent (scaling factor), providing better dynamic range than INT4 while remaining hardware-agnostic.

Feature MXFP4 Q4_K_M NVFP4
Numeric format E2M1 microscaling FP4 INT4 block quantization E4M3 native FP4
Shared exponent (scale) Per 32 elements Per 32 elements None (native FP4)
Effective BPW 4.61 ~4.50 4.68
Hardware support Any GPU + CPU Any GPU + CPU Blackwell only
Dynamic range (max normal) 30 (E2M1) 7 (INT4, symmetric) 448 (E4M3)
Dequantization overhead Moderate (scale mul) Moderate (scale mul) None

When to use MXFP4: You want a universal 4-bit format that works on any GPU (NVIDIA, AMD, Intel) or CPU without sacrificing quality. MXFP4 offers better dynamic range than INT4 thanks to its microscaling exponent, making it more robust for outlier-heavy layers.

When to use NVFP4 instead: You have a Blackwell GPU (RTX 50-series) — the native FP4 tensor cores give higher throughput.

When to use Q4_K_M: You need maximum compatibility with older llama.cpp versions (pre-June 2025) or CPU-only inference.

Files

Filename Type Size BPW Description
gemma-4-12b-it-mxfp4.gguf MXFP4 quantized (text) 6.87 GB 4.61 48-layer text backbone with hybrid attention
mmproj-gemma-4-12b-it-f16.gguf Vision encoder (F16) 122 MB 16.0 SigLIP vision embedder + GEMMA4UV projector

Quantization Characteristics

Metric Value
Input format F16 GGUF (23.83 GB, 667 tensors)
Output format MXFP4 GGUF (6.87 GB, 667 tensors)
Quantization type LLAMA_FTYPE_MOSTLY_MXFP4 (type 41)
Compression ratio 3.47× (23.83 GB → 6.87 GB)
Quantization time ~255 seconds on RTX 5060 Ti
1D tensors (norms, scales) Kept at Q8_0
Attention + FFN weights Converted to MXFP4

Model Description

Gemma 4 12B is part of Google's fourth-generation Gemma family, featuring:

  • 48 transformer layers with 3840 hidden dimensions and 15360 FFN intermediate size
  • Hybrid attention: 40 sliding-window layers (window 1024, kv_heads=8, head_dim=256) interleaved with 8 full-attention layers (kv_heads=2, head_dim=512) in a 5:1 pattern
  • Context window: up to 262,144 tokens
  • RoPE scaling: separate frequency bases for sliding window and full attention
  • Final logit softcapping: stabilizes large-vocabulary predictions
  • Vision: SigLIP-based lightweight patch embedder with learned positional encoding
  • Instruction-tuned: optimized for chat and instruction-following

The base model is google/gemma-4-12B-it under Apache 2.0 license.

Usage

llama.cpp CLI

# Text-only inference
./llama-cli \
  -m gemma-4-12b-it-mxfp4.gguf \
  -p "Explain quantum computing in simple terms" \
  -n 512

# Vision inference (requires mmproj)
./llama-cli \
  -m gemma-4-12b-it-mxfp4.gguf \
  --mmproj mmproj-gemma-4-12b-it-f16.gguf \
  --image diagram.png \
  -p "Explain what this diagram shows" \
  -n 512

llama.cpp Server (OpenAI-compatible API)

./llama-server \
  -m gemma-4-12b-it-mxfp4.gguf \
  --mmproj mmproj-gemma-4-12b-it-f16.gguf \
  --port 8080 \
  -ngl 99 \
  -c 8192

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="gemma-4-12b-it-mxfp4.gguf",
    mmproj="mmproj-gemma-4-12b-it-f16.gguf",
    n_ctx=8192,
    n_gpu_layers=-1,
)

output = llm("What is the capital of France?", max_tokens=128)
print(output["choices"][0]["text"])

Download

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF",
    filename="gemma-4-12b-it-mxfp4.gguf"
)
mmproj_path = hf_hub_download(
    repo_id="FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF",
    filename="mmproj-gemma-4-12b-it-f16.gguf"
)

Thinking / Reasoning Behavior

Gemma 4 supports structured reasoning using <|channel>thought tags. The chat template handles reasoning as follows:

  • enable_thinking=false (default): The template inserts empty think tags (<|channel>thought\n<channel|>) at the start of the model's generation turn, suppressing thinking.
  • enable_thinking=true: The model may generate internal reasoning tokens enclosed in <|channel>thought...<channel|>.

In LM Studio, reasoning sections are rendered as collapsible blocks with proper reasoning.parsing configuration.

Conversion Pipeline

google/gemma-4-12B-it (23.92 GB, safetensors)
  │
  ├─ convert_hf_to_gguf.py --outtype f16 (llama.cpp d403f00)
  │     → gemma-4-12b-f16.gguf (23.83 GB, 667 tensors, text backbone only)
  │
  ├─ convert_hf_to_gguf.py --mmproj --outtype f16
  │     → mmproj-gemma-4-12b-f16.gguf (122 MB, 11 tensors)
  │
  └─ llama-quantize.exe MXFP4
        → gemma-4-12b-it-mxfp4.gguf (6.87 GB, 667 tensors, 4.61 BPW)

Built with llama.cpp commit d403f00. Google's original chat template (17,466 bytes) is preserved as-is.

Hardware Requirements

GPU VRAM Model + KV Cache (4k ctx) Estimated Speed
RTX 5060 Ti 16GB 16 GB ~10 GB ~45 tok/s
RTX 4090 24GB 24 GB ~10 GB ~55 tok/s
RX 7900 XTX 24GB 24 GB ~10 GB ~35 tok/s
CPU (6+ cores) System RAM ~7 GB ~5-8 tok/s

Minimum VRAM: 10 GB for the model (7 GB at 4.61 BPW + ~2.5 GB KV cache at 8k ctx).

Hardware Compatibility

Backend MXFP4 Support Notes
CUDA (any SM) ✅ Universal Works on all NVIDIA GPUs
CPU (llamafile) ✅ Universal Requires recent llama.cpp
Vulkan ✅ Universal Tested on AMD + NVIDIA
Metal (Apple) ⚠️ Experimental Limited testing
AMD ROCm ✅ Universal Tested on RX 7900 series

License

Apache 2.0, as per Google's Gemma 4 license.

Downloads last month
4,663
GGUF
Model size
1650561533092598.5T params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF

Quantized
(127)
this model