SACD — Qwen2.5-3B distilled on ALFWorld (τ=0.75)

Qwen2.5-3B-Instruct student trained with SACD (State-Aware Correction Distillation), a per-turn, signal-gated on-policy distillation method, from the langfeng01/GiGPO-Qwen2.5-7B-Instruct-ALFWorld teacher. This is the step-250 checkpoint.

Method (SACD)

Label-only / DAgger-style on-policy distillation on top of OPD. Each turn the student samples and always drives the environment (trajectory stays on-policy). Per turn a disagreement detector δ (here the k1 log-prob gap log π_S − log π_T) is computed; when δ > τ that turn's loss switches to SFT on a teacher-decoded counterfactual label (the teacher never touches the env), otherwise it stays standard reverse-KL OPD. The trigger self-anneals as the student approaches the teacher.

Config


base model	Qwen/Qwen2.5-3B-Instruct
teacher	langfeng01/GiGPO-Qwen2.5-7B-Instruct-ALFWorld
detector	k1 (log-prob gap)
τ (trigger threshold)	0.75
sft_coef (β)	1.0
sft_mode	plain
teacher schedule	pipeline
steps	250
trainer	trinity-rft (verl FSDP)
mean trigger rate	3.3% (cold-start 35% → ~0, self-annealing)

Downloads last month: -

Safetensors

Model size

3B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Model tree for Wenboz/SACD-Qwen2.5-3B-ALFWorld-k1-tau0.75-beta1.0-plain-pipeline

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

(1353)

this model