SACD — Qwen2.5-3B distilled on ALFWorld (τ=0.75)

Qwen2.5-3B-Instruct student trained with SACD (State-Aware Correction Distillation), a per-turn, signal-gated on-policy distillation method, from the langfeng01/GiGPO-Qwen2.5-7B-Instruct-ALFWorld teacher. This is the step-250 checkpoint.

Method (SACD)

Label-only / DAgger-style on-policy distillation on top of OPD. Each turn the student samples and always drives the environment (trajectory stays on-policy). Per turn a disagreement detector δ (here the k1 log-prob gap log π_S − log π_T) is computed; when δ > τ that turn's loss switches to SFT on a teacher-decoded counterfactual label (the teacher never touches the env), otherwise it stays standard reverse-KL OPD. The trigger self-anneals as the student approaches the teacher.

Config

base model Qwen/Qwen2.5-3B-Instruct
teacher langfeng01/GiGPO-Qwen2.5-7B-Instruct-ALFWorld
detector k1 (log-prob gap)
Ï„ (trigger threshold) 0.75
sft_coef (β) 1.0
sft_mode plain
teacher schedule pipeline
steps 250
trainer trinity-rft (verl FSDP)
mean trigger rate 3.3% (cold-start 35% → ~0, self-annealing)
Downloads last month
-
Safetensors
Model size
3B params
Tensor type
BF16
·
Video Preview
loading

Model tree for Wenboz/SACD-Qwen2.5-3B-ALFWorld-k1-tau0.75-beta1.0-plain-pipeline

Base model

Qwen/Qwen2.5-3B
Finetuned
(1353)
this model