How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE:MXFP4_MOE
# Run inference directly in the terminal:
llama-cli -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE:MXFP4_MOE
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE:MXFP4_MOE
# Run inference directly in the terminal:
llama-cli -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE:MXFP4_MOE
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE:MXFP4_MOE
# Run inference directly in the terminal:
./llama-cli -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE:MXFP4_MOE
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE:MXFP4_MOE
# Run inference directly in the terminal:
./build/bin/llama-cli -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE:MXFP4_MOE
Use Docker
docker model run hf.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE:MXFP4_MOE
Quick Links

Mellum2 Thinking — GGUF (MXFP4_MOE)

This repository contains a GGUF MXFP4_MOE quantization of JetBrains/Mellum2-12B-A2.5B-Thinking, ready to run with llama.cpp, Ollama, LM Studio, and other GGUF-compatible runtimes.

This quantization (MXFP4_MOE): MXFP4 microscaling 4-bit applied to the MoE expert tensors. Smallest footprint, with a modest quality cost (KLD ~0.088, 87% top-token agreement).

File Size
Mellum2-12B-A2.5B-Thinking-MXFP4_MOE.gguf 7.0 GB

Mellum 2 Thinking is a Mixture-of-Experts reasoning model (64 experts, 8 activated per token, 131,072-token context) that emits its chain of thought inside <think>...</think> blocks before the final answer. For the full model description, evaluation results, and architecture details, see the original model card: JetBrains/Mellum2-12B-A2.5B-Thinking.

Available quantizations

Quantization Description Size KLD vs BF16 ↓ Top-token match ↑
BF16 16-bit, no quantization (reference) 24.3 GB — —
Q8_0 8-bit, effectively lossless 12.9 GB 0.004 97.4%
Q6_K 6-bit k-quant, very high quality 10.9 GB 0.014 95.1%
Q4_K_M 4-bit k-quant, balanced (recommended) 8.1 GB 0.052 89.8%
MXFP4_MOE (this repo) MXFP4 4-bit on MoE experts, smallest 7.0 GB 0.088 87.3%

KL divergence and top-token agreement are measured against the BF16 logits on Wikitext-2 (n_ctx=512); lower KLD / higher agreement means closer to the unquantized model.

Download

hf download JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE Mellum2-12B-A2.5B-Thinking-MXFP4_MOE.gguf --local-dir .

Run with llama.cpp

# Pull and serve in one step (downloads the GGUF automatically)
llama-server -hf JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE \
  --ctx-size 131072 \
  --temp 0.6 --top-p 0.95 --top-k 20

# Or run a one-off prompt with a local file
llama-cli -m Mellum2-12B-A2.5B-Thinking-MXFP4_MOE.gguf \
  --ctx-size 131072 \
  --temp 0.6 --top-p 0.95 --top-k 20 \
  -p "Is 1024 a power of 2? Explain your reasoning."

The server exposes an OpenAI-compatible API on http://localhost:8080/v1:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="llama.cpp")

chat_response = client.chat.completions.create(
    model="JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE",
    messages=[
        {"role": "user", "content": "Is 1024 a power of 2? Explain your reasoning."},
    ],
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={"top_k": 20},
)
print(chat_response.choices[0].message.content)

Run with Ollama

ollama run hf.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE

License

Released under the Apache 2.0 license.


For the full model card, evaluation results, and architecture details, refer to the original model: JetBrains/Mellum2-12B-A2.5B-Thinking.

Downloads last month
7,599
GGUF
Model size
12B params
Architecture
mellum
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE

Quantized
(30)
this model

Collection including JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE