gemma-4-12B-it - GGUF Quantized Versions

This repository provides GGUF quantized versions of google/gemma-4-12B-it, converted with llama.cpp.

The purpose of this repository is to provide fast, easy-to-use local inference files for llama.cpp, Ollama, LM Studio, Jan, Open WebUI, and llama-cpp-python users.

Model Details

  • Base model: google/gemma-4-12B-it
  • Architecture: Gemma
  • Format: GGUF
  • Source license: apache-2.0
  • Conversion tool: convert_hf_to_gguf.py from llama.cpp
  • Quantization tool: llama-quantize
  • Recommended file: gemma-4-12B-it-Q4_K_M.gguf

Quantized Files

Quant Filename Size SHA256 Notes
FP16 gemma-4-12B-it-FP16.gguf ~22.20 GiB 6658de79a289... Full precision converted GGUF baseline
Q2_K gemma-4-12B-it-Q2_K.gguf ~4.50 GiB d78d899175ac... Smallest, lowest quality
Q3_K_M gemma-4-12B-it-Q3_K_M.gguf ~5.67 GiB 4f167cbf1429... Small balanced version
Q4_0 gemma-4-12B-it-Q4_0.gguf ~6.50 GiB a054b3363275... Simple 4-bit quantization
Q4_K_M gemma-4-12B-it-Q4_K_M.gguf ~6.87 GiB 76e3e8ed9c40... Recommended default for most users
Q5_K_M gemma-4-12B-it-Q5_K_M.gguf ~7.96 GiB cce4e3c9c96c... Better quality with moderate size
Q6_K gemma-4-12B-it-Q6_K.gguf ~9.11 GiB ec334df42d92... High quality
Q8_0 gemma-4-12B-it-Q8_0.gguf ~11.80 GiB ba84ac12b187... Near FP16 quality

Validation

Each file was tested with llama-cli for basic load + generation.

Quant Filename Status
FP16 gemma-4-12B-it-FP16.gguf ✅ passed
Q2_K gemma-4-12B-it-Q2_K.gguf ✅ passed
Q3_K_M gemma-4-12B-it-Q3_K_M.gguf ✅ passed
Q4_0 gemma-4-12B-it-Q4_0.gguf ✅ passed
Q4_K_M gemma-4-12B-it-Q4_K_M.gguf ✅ passed
Q5_K_M gemma-4-12B-it-Q5_K_M.gguf ✅ passed
Q6_K gemma-4-12B-it-Q6_K.gguf ✅ passed
Q8_0 gemma-4-12B-it-Q8_0.gguf ✅ passed

Usage

llama.cpp

llama-cli -m gemma-4-12B-it-Q4_K_M.gguf -p "Hello! Introduce yourself briefly."

Older builds may use:

./main -m gemma-4-12B-it-Q4_K_M.gguf -p "Hello! Introduce yourself briefly."

llama.cpp directly from Hugging Face

llama-cli -hf ShahzebKhoso/gemma-4-12B-it-GGUF:Q4_K_M -p "Hello! Introduce yourself briefly."

llama-cpp-python

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(
    repo_id="ShahzebKhoso/gemma-4-12B-it-GGUF",
    filename="gemma-4-12B-it-Q4_K_M.gguf",
)

llm = Llama(model_path=model_path)

out = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello! Introduce yourself briefly."},
    ],
    max_tokens=128,
)

print(out["choices"][0]["message"]["content"])

Which file should I use?

  • Use Q4_K_M for the best default balance.
  • Use Q5_K_M for better quality.
  • Use Q8_0 if you want near-original quality and have more memory.
  • Use Q2_K or Q3_K_M only when memory is very limited.

Provenance

This repository is a quantized derivative of:

google/gemma-4-12B-it

Base model metadata:

revision: 66bc78a7534d523aa32004652cb02cc2e6354c62
pipeline_tag: any-to-any
tags: transformers, safetensors, gemma4_unified, image-text-to-text, any-to-any, base_model:google/gemma-4-12B, base_model:finetune:google/gemma-4-12B, license:apache-2.0, eval-results, endpoints_compatible, region:us
Downloads last month
417
GGUF
Model size
12B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ShahzebKhoso/gemma-4-12B-it-GGUF

Quantized
(127)
this model