SACD — Qwen2.5-3B distilled on ALFWorld (τ=0.75)
Qwen2.5-3B-Instruct student trained with SACD (State-Aware Correction Distillation), a
per-turn, signal-gated on-policy distillation method, from the
langfeng01/GiGPO-Qwen2.5-7B-Instruct-ALFWorld teacher. This is the step-250 checkpoint.
Method (SACD)
Label-only / DAgger-style on-policy distillation on top of OPD. Each turn the student samples
and always drives the environment (trajectory stays on-policy). Per turn a disagreement
detector δ (here the k1 log-prob gap log π_S − log π_T) is computed; when δ > τ that
turn's loss switches to SFT on a teacher-decoded counterfactual label (the teacher never touches
the env), otherwise it stays standard reverse-KL OPD. The trigger self-anneals as the student
approaches the teacher.
Config
| base model | Qwen/Qwen2.5-3B-Instruct |
| teacher | langfeng01/GiGPO-Qwen2.5-7B-Instruct-ALFWorld |
| detector | k1 (log-prob gap) |
| Ï„ (trigger threshold) | 0.75 |
| sft_coef (β) | 1.0 |
| sft_mode | plain |
| teacher schedule | pipeline |
| steps | 250 |
| trainer | trinity-rft (verl FSDP) |
| mean trigger rate | 3.3% (cold-start 35% → ~0, self-annealing) |
- Downloads last month
- -