Gemma 4 12B IT - GGUF

Gemma 4 Banner

Hugging Face | GitHub | Launch Blog | Documentation
License: Apache 2.0 | Authors: Google DeepMind

This repository contains static GGUF quantizations of the Gemma 4 12B Unified (Instruction Tuned) model. These files are optimized for local deployment on consumer hardware, particularly systems with constrained memory layouts or configurations relying heavily on CPU/RAM inference.

Unified Multimodal Architecture: The Gemma 4 12B model is completely encoder-free. It projects raw image patches and audio waveforms directly into the LLM's embedding space through lightweight linear layers. To utilize image, video, or audio capabilities in llama.cpp or compatible UIs, you must load one of the provided mmproj (Multimodal Projector) files alongside the main LLM .gguf file.


📦 Available Files and Quantizations

Below is a breakdown of the available GGUF files in this repository. For local environments with limited hardware configurations (e.g., 8GB RAM), the Q3_K_M or Q4_K_M variants are strongly recommended to ensure steady inference without triggering aggressive disk swapping.

Filename Size Recommended Resource Allocation / Use Case
gemma-4-12b-it-Q3_K_M.gguf 6.09 GB Highly recommended for 8GB RAM setups. Maximizes memory headroom at the expense of minor perplexity loss.
gemma-4-12b-it-Q4_K_S.gguf 7.17 GB Lightweight 4-bit format. Fast execution, low memory footprint.
gemma-4-12b-it-Q4_K_M.gguf 7.38 GB Standard balanced deployment choice. Optimal trade-off between speed and accuracy.
gemma-4-12b-it-Q5_K_S.gguf 8.41 GB Higher retention of reasoning capabilities. Best if 12GB+ system memory is available.
gemma-4-12b-it-Q5_K_M.gguf 8.55 GB Excellent logical consistency. Recommended for nuanced text tasks.
gemma-4-12b-it-Q6_K.gguf 9.79 GB Near-lossless quantization. Best suited for 16GB+ RAM/VRAM systems.
gemma-4-12b-it-Q8_0.gguf 12.70 GB Maximum fidelity 8-bit quantization. Demands significant memory overhead.

Multimodal Projectors (Required for Vision/Audio)

  • mmproj-F16.gguf (122 MB) - Highly optimized performance footprint.
  • mmproj-BF16.gguf (175 MB) - Native brain floating-point precision alignment.
  • mmproj-F32.gguf (210 MB) - Full uncompressed precision for maximum feature extraction.

🚀 Local Execution Guide

You can run these files using the standard command-line interfaces provided by llama.cpp.

Text-Only Inference

./llama-cli -m gemma-4-12b-it-Q4_K_M.gguf \
  -p "<start_of_turn>user\nWrite a short joke about saving RAM.<end_of_turn>\n<start_of_turn>model\n" \
  -n 512
Downloads last month
4,294
GGUF
Model size
12B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Abiray/gemma-4-12b-it-GGUF

Quantized
(127)
this model