Large-scale visual learning is increasingly limited by training cost. Existing knowledge distillation methods transfer from a stronger teacher to a weaker student for compression or final-accuracy improvement. We instead use distillation to accelerate the training of strong students. We propose a generalizable plug-and-play recipe that freezes a weaker teacher, applies distillation only in early training, and turns distillation off once the student reaches and surpasses teacher-level performance. For ImageNet and CIFAR classification, this strategy reaches target thresholds much earlier, with up to 4.8× speedup measured by epochs. We confirm that the method generalizes to other tasks and report 1.7× epoch speedup for object detection on the COCO dataset, and 2.5× earlier target-FID crossing for diffusion generation on the CIFAR-10 dataset, measured in steps. These findings validate our method as a universal speedup mechanism for visual learning.
Our method is a plug-and-play early-training add-on: keep a weaker teacher frozen, add a distillation loss in early training, and turn distillation off after the student surpasses teacher-level performance. The total loss is:
L = Lbase + γ λ(u) Ldistill
where λ(u) follows a warmup–hold–decay schedule and is permanently set to zero once the student surpasses the teacher for k=2 consecutive validations.
gφ, student fθ, train data D, val data Dval, schedule Λ(t), scale γ, stop length kmref ← Eval(gφ, Dval), c ← 0, a ← 1t = 1, …, T: set λt ← a · Λ(t)Lbase and Ldistill, then update θ ← OptStep(θ, ∇θ[Lbase + γ λt Ldistill])mt ← Eval(fθ, Dval)c ← c + 1; otherwise reset c ← 0 (consecutive hits)c ≥ k: set a ← 0 (disable distillation permanently)θ*Classification: Ldistill = KL divergence on temperature-scaled softened posteriors (T: 6→1).
Object Detection: Ldistill = classification logit distillation + optional box-regression alignment.
Generation: Ldistill = ‖εθ(xt, t) − εφ(xt, t)‖22 — MSE between student and teacher noise predictions on the same noised sample.
| Dataset | Student | Teacher | Optimizer | Teacher Top-1 | Target τ | Speedup | Best Top-1 (Base / Ours) |
|---|---|---|---|---|---|---|---|
| ImageNet | ResNet-50 | ResNet-18 | SGD | 70.73 | 65 | 1.62× | 76.52 / 76.81 |
| ImageNet | ResNet-50 | ResNet-18 | AdamW | 70.73 | 65 | 2.86× | 77.19 / 77.72 |
| ImageNet | ResNet-50 | ResNet-18 | Muon | 70.73 | 65 | 4.75× | 77.11 / 77.11 |
| ImageNet | ResNet-50 | MobileNetV2 | SGD | 69.62 | 65 | 1.48× | 76.52 / 76.77 |
| ImageNet | ResNet-50 | MobileNetV2 | AdamW | 69.62 | 65 | 2.00× | 77.19 / 77.56 |
| ImageNet | ResNet-50 | MobileNetV2 | Muon | 69.62 | 65 | 3.17× | 77.11 / 77.09 |
| ImageNet | ViT-S/16 | MobileNetV2 | AdamW | 69.62 | 50 | 1.45× | 75.87 / 75.36 |
| CIFAR-100 | ResNet-18 | DenseNet-40 | SGD | 69.70 | 60 | 1.60× | 78.75 / 79.36 |
| CIFAR-100 | DenseNet-100 | DenseNet-40 | SGD | 69.70 | 60 | 1.83× | 82.08 / 82.51 |
| CIFAR-10 | ResNet-50 | MobileNetV2 | AdamW | 85.55 | 75 | 1.60× | 93.95 / 94.00 |
| CIFAR-10 | ResNet-50 | MobileNetV2 | SGD | 85.55 | 75 | 1.36× | 95.65 / 95.08 |
| Task | Dataset | Student | Teacher | Target τ | first@τ (Base → Ours) | Speedup | Best Metric (Base / Ours) |
|---|---|---|---|---|---|---|---|
| Detection | COCO | RetinaNet-R50 | RetinaNet-R34 | AP50 20.0% | 10 ep → 6 ep | 1.67× | AP50: 22.55 / 30.67 |
| Detection | COCO | Faster R-CNN-R50 | Faster R-CNN-R18 | AP50 20.0% | 4 ep → 3 ep | 1.33× | AP50: 35.97 / 36.72 |
| Generation | CIFAR-10 | nc128-rb3 | nc64-rb2 | FID↓ 60 | 16k → 6k | 2.67× | FID↓: 52.27 / 47.22 |
| Generation | CIFAR-10 | nc160-rb3 | nc64-rb2 | FID↓ 60 | 18k → 12k | 1.50× | FID↓: 53.49 / 47.67 |
| Task | Regime | Student | Teacher | Target τ | first@τ (Base → Ours) | Speedup |
|---|---|---|---|---|---|---|
| Classification | Too weak | ResNet-50 | MobileNetV3-S | Top-1 65 | 19 ep → 37 ep | 0.51× |
| Classification | Too strong | MobileNetV2 | ResNet-50 | Top-1 65 | 60 ep → 57 ep | 1.05× |
| Classification | Suitably weaker | ResNet-50 | ResNet-18 | Top-1 65 | 19 ep → 4 ep | 4.75× |
| Detection | Too weak | RetinaNet-R50 | RetinaNet-R34 (earlier ckpt, AP50≈12.8) | AP50 20.0% | 10 ep → 9 ep | 1.11× |
| Detection | Too strong | RetinaNet-R50 | RetinaNet-R34 (later ckpt, AP50≈26.2) | AP50 20.0% | 10 ep → 9 ep | 1.11× |
| Detection | Suitably weaker | RetinaNet-R50 | RetinaNet-R34 (mid ckpt, AP50≈19.5) | AP50 20.0% | 10 ep → 6 ep | 1.67× |
| Generation (CIFAR-10) | Too weak | nc128-rb3 | nc64-rb1 | FID 60 | 16k → 20k | 0.80× |
| Generation (CIFAR-10) | Too strong | nc128-rb3 | nc64-rb3 | FID 60 | 16k → 16k | 1.00× |
| Generation (CIFAR-10) | Suitably weaker | nc128-rb3 | nc64-rb2 | FID 60 | 16k → 6k | 2.67× |
@inproceedings{li2026weak,
title = {Weak-to-Strong Knowledge Distillation Accelerates Visual Learning},
author = {Li, Baiang and Chai, Wenhao and Heide, Felix},
journal = {arXiv preprint},
year = {2026}
}