Weak-to-Strong Knowledge Distillation
Accelerates Visual Learning

Princeton University

Abstract

Large-scale visual learning is increasingly limited by training cost. Existing knowledge distillation methods transfer from a stronger teacher to a weaker student for compression or final-accuracy improvement. We instead use distillation to accelerate the training of strong students. We propose a generalizable plug-and-play recipe that freezes a weaker teacher, applies distillation only in early training, and turns distillation off once the student reaches and surpasses teacher-level performance. For ImageNet and CIFAR classification, this strategy reaches target thresholds much earlier, with up to 4.8× speedup measured by epochs. We confirm that the method generalizes to other tasks and report 1.7× epoch speedup for object detection on the COCO dataset, and 2.5× earlier target-FID crossing for diffusion generation on the CIFAR-10 dataset, measured in steps. These findings validate our method as a universal speedup mechanism for visual learning.

2026-04-13T15:12:12.556789 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/ 1 5 10 15 20 25 30 Epoch 40 45 50 55 60 65 70 75 Top-1 (%) Classification (ImageNet) 2 4 6 8 10 12 14 16 Training steps (k) 60 80 100 120 FID ↓ Diffusion-based Generation (CIFAR-10) 1 2 3 4 5 6 7 8 9 10 11 12 Epoch 0 5 10 15 20 25 30 AP50 (%) Object Detection (COCO) Baseline Ours
Figure 1: Weak-to-Strong Distillation Accelerates Visual Learning. Blue curves are baseline training and red curves are with our method. From left to right: classification on ImageNet, diffusion-based generation on CIFAR-10, and object detection on COCO. Our method reaches target quality earlier in all three tasks: 65% Top-1 on ImageNet (4.8× fewer epochs), target FID on diffusion (more than 2.5× fewer training steps), and target AP50 for object detection (1.7× fewer epochs).

Method

Our method is a plug-and-play early-training add-on: keep a weaker teacher frozen, add a distillation loss in early training, and turn distillation off after the student surpasses teacher-level performance. The total loss is:

L = Lbase + γ λ(u) Ldistill

where λ(u) follows a warmup–hold–decay schedule and is permanently set to zero once the student surpasses the teacher for k=2 consecutive validations.

Algorithm 1: Proposed Weak-to-Strong Training

Input: frozen teacher gφ, student fθ, train data D, val data Dval, schedule Λ(t), scale γ, stop length k
1. Evaluate teacher: mref ← Eval(gφ, Dval),   c ← 0,   a ← 1
2. For each training step t = 1, …, T: set λta · Λ(t)
Compute Lbase and Ldistill, then update θ ← OptStep(θ, ∇θ[Lbase + γ λt Ldistill])
3. Evaluate student: mt ← Eval(fθ, Dval)
4. If student surpasses teacher: cc + 1; otherwise reset c ← 0 (consecutive hits)
5. If c ≥ k: set a ← 0 (disable distillation permanently)
6. Return θ*

Task-Specific Instantiations

Classification: Ldistill = KL divergence on temperature-scaled softened posteriors (T: 6→1).

Object Detection: Ldistill = classification logit distillation + optional box-regression alignment.

Generation: Ldistill = ‖εθ(xt, t) − εφ(xt, t)‖22 — MSE between student and teacher noise predictions on the same noised sample.

Results

Image Classification

Table 1: Classification Training Acceleration on ImageNet and CIFAR-10/100.
DatasetStudentTeacherOptimizerTeacher Top-1Target τSpeedupBest Top-1 (Base / Ours)
ImageNetResNet-50ResNet-18SGD70.73651.62×76.52 / 76.81
ImageNetResNet-50ResNet-18AdamW70.73652.86×77.19 / 77.72
ImageNetResNet-50ResNet-18Muon70.73654.75×77.11 / 77.11
ImageNetResNet-50MobileNetV2SGD69.62651.48×76.52 / 76.77
ImageNetResNet-50MobileNetV2AdamW69.62652.00×77.19 / 77.56
ImageNetResNet-50MobileNetV2Muon69.62653.17×77.11 / 77.09
ImageNetViT-S/16MobileNetV2AdamW69.62501.45×75.87 / 75.36
CIFAR-100ResNet-18DenseNet-40SGD69.70601.60×78.75 / 79.36
CIFAR-100DenseNet-100DenseNet-40SGD69.70601.83×82.08 / 82.51
CIFAR-10ResNet-50MobileNetV2AdamW85.55751.60×93.95 / 94.00
CIFAR-10ResNet-50MobileNetV2SGD85.55751.36×95.65 / 95.08

Object Detection & Image Generation

Table 2: Detection and Generation Training Acceleration.
TaskDatasetStudentTeacherTarget τfirst@τ (Base → Ours)SpeedupBest Metric (Base / Ours)
DetectionCOCORetinaNet-R50RetinaNet-R34AP50 20.0%10 ep → 6 ep1.67×AP50: 22.55 / 30.67
DetectionCOCOFaster R-CNN-R50Faster R-CNN-R18AP50 20.0%4 ep → 3 ep1.33×AP50: 35.97 / 36.72
GenerationCIFAR-10nc128-rb3nc64-rb2FID↓ 6016k → 6k2.67×FID↓: 52.27 / 47.22
GenerationCIFAR-10nc160-rb3nc64-rb2FID↓ 6018k → 12k1.50×FID↓: 53.49 / 47.67

Ablation Experiments

Switching Teachers and Students

2026-04-13T15:12:16.155185 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/ 0 5 10 15 20 25 30 35 40 Epoch 52.5 55.0 57.5 60.0 62.5 65.0 67.5 70.0 72.5 Top-1 Accuracy (%) Muon setting Target: Top-1 = 65 ImageNet: Across Different Teachers (Student = ResNet-50) Baseline Ours (MNv2→R50) Ours (R18→R50) Ours (R34→R50)
(a) Keep student fixed, switch teachers. ImageNet R50 student with different teachers: suitably weaker teachers consistently accelerate early convergence.
2026-04-13T15:12:16.573690 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/ 0 5 10 15 20 25 30 35 Epoch 10 20 30 40 50 60 70 Top-1 Accuracy (%) SGD setting Target: Top-1 = 60 CIFAR-100: Across Different Students (Teacher = DenseNet-40) ResNet-18 Baseline ResNet-18 Ours DenseNet-100 Baseline DenseNet-100 Ours DenseNet-190 Baseline DenseNet-190 Ours
(b) Keep teacher fixed, vary students. CIFAR-100 with DenseNet-40 teacher: the early left-shift appears across different student architectures.

Teacher Mismatch Across Tasks

Table 3: Too-weak, too-strong, and suitably-weaker teacher settings under matched training recipes.
TaskRegimeStudentTeacherTarget τfirst@τ (Base → Ours)Speedup
ClassificationToo weakResNet-50MobileNetV3-STop-1 6519 ep → 37 ep0.51×
ClassificationToo strongMobileNetV2ResNet-50Top-1 6560 ep → 57 ep1.05×
ClassificationSuitably weakerResNet-50ResNet-18Top-1 6519 ep → 4 ep4.75×
DetectionToo weakRetinaNet-R50RetinaNet-R34 (earlier ckpt, AP50≈12.8)AP50 20.0%10 ep → 9 ep1.11×
DetectionToo strongRetinaNet-R50RetinaNet-R34 (later ckpt, AP50≈26.2)AP50 20.0%10 ep → 9 ep1.11×
DetectionSuitably weakerRetinaNet-R50RetinaNet-R34 (mid ckpt, AP50≈19.5)AP50 20.0%10 ep → 6 ep1.67×
Generation (CIFAR-10)Too weaknc128-rb3nc64-rb1FID 6016k → 20k0.80×
Generation (CIFAR-10)Too strongnc128-rb3nc64-rb3FID 6016k → 16k1.00×
Generation (CIFAR-10)Suitably weakernc128-rb3nc64-rb2FID 6016k → 6k2.67×

Optimizer Sensitivity & Gradient Diagnostics

2026-04-13T15:12:17.364440 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/ 0 5 10 15 20 25 30 35 40 Epoch 45 50 55 60 65 70 75 80 Top-1 Accuracy (%) Optimizer Ablation (Fixed Teacher-Student) SGD Baseline SGD Ours AdamW Baseline AdamW Ours Muon Baseline Muon Ours
(a) Optimizer comparison on ImageNet with fixed R18→R50 setup. Acceleration is consistent across SGD, AdamW, and Muon.
2026-04-13T15:12:18.061401 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/ 1 5 10 15 20 25 30 35 Epoch 0 1 2 3 4 5 Grad norm Gradient Norms During Early ImageNet Training |∇| L base |∇(())| λuL distill
(b) Gradient norms during early ImageNet training. The base and distillation gradient components stay in a comparable range, indicating stable optimization.

Warm-Start & Stop-After-Surpass

2026-04-13T15:12:14.851463 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/ 1 2 3 4 Epoch 5 10 15 20 25 30 Top-1 (%) Ours Ours (w/o warm-start)
(a) Warm-start vs no warm-start. Warm-start gives a smoother, stronger early rise under the same teacher-student pair.
2026-04-13T15:12:15.467668 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/ 0 5 10 15 20 25 30 Epoch after surpass 72 74 76 Top-1 (%) Baseline Constant KD Ours (gated KD)
(b) Post-surpass behavior. Gating distillation off after surpass avoids stale supervision in late training.

Label Smoothing & KL Direction

2026-04-13T15:12:19.072771 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/ 1 5 10 15 20 25 30 35 Epoch 40 45 50 55 60 65 70 Top-1 (%) Label Smoothing Ablation on ImageNet Baseline LS=0.0 Baseline LS=0.1 Baseline LS=0.3 Ours (LS=0.1)
(a) Label smoothing ablation. Our method keeps early-stage advantage across different label smoothing values while final Top-1 remains close.
2026-04-13T15:12:19.860090 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/ 1 10 20 30 40 50 60 70 80 90 Epoch 40 45 50 55 60 65 70 75 Top-1 (%) KL Direction Ablation on ImageNet Forward KL Reverse KL
(b) KL direction ablation. Forward and reverse KL variants yield similar trajectories, indicating gains come from the schedule rather than KL direction.

Analysis: Teacher Operating Band

2026-04-13T15:12:14.089126 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/ -20% -15% -10% -5% 0% 5% 10% Teacher - baseline student Top-1 (relative %) 0.0 0.5 1.0 2.0 3.0 4.0 5.0 Speedup ratio suitable band Color match mismatch Marker SGD AdamW Muon 0 5 10 15 20 25 30 35 Epoch 40 50 60 70 Top-1 Accuracy (%) MobileNetV3-S -> ResNet-50 Teacher: 61.44% Teacher Too Weak 0 5 10 15 20 25 30 35 Epoch 40 50 60 70 ResNet-50 -> MobileNetV2 Teacher: 76.81% Teacher Too Strong 0 5 10 15 20 25 30 35 Epoch 40 50 60 70 ResNet-18 -> ResNet-50 Teacher: 70.42% Teacher Suitably Weaker Baseline W2S Teacher
Figure 3: Teacher Operational Band. Top: speedup ratio versus relative teacher–student Top-1 gap. Suitably weaker teachers (moderate gap) give the strongest acceleration. Bottom: representative too-weak, too-strong, and suitably-weaker trajectories under matched training recipes.
2026-04-13T15:12:20.722596 image/svg+xml Matplotlib v3.6.2, https://matplotlib.org/ Too weak (0.51×) Suitable (4.75×) Too strong (1.05×) 0.70 0.75 0.80 0.85 0.90 Linear CKA vs teacher Baseline Ours Too weak (0.51×) Suitable (4.75×) Too strong (1.05×) 0 1 2 3 4 5 6 Teacher entropy (per sample) Baseline Ours Too weak (0.51×) Suitable (4.75×) Too strong (1.05×) 0.0 0.5 1.0 1.5 2.0 2.5 Per-sample KL(student || teacher) Lower is better Baseline Ours Mismatch mechanism diagnostics across teacher-strength regimes
Figure 7: Diagnostics for Teacher Mismatch. Left: CKA (student vs. teacher features), where higher is better. Middle: teacher entropy per sample, where high means uncertain targets. Right: KL(student||teacher), where smaller is better within the same regime. Importantly, absolute KL values across different regimes are not directly comparable because teacher distributions differ; the key signal is the baseline→ours change within each regime. Too-weak teachers have high entropy but low informativeness; too-strong teachers have low entropy but poor teachability; suitably-weaker teachers balance both.

Citation

@inproceedings{li2026weak,
  title     = {Weak-to-Strong Knowledge Distillation Accelerates Visual Learning},
  author    = {Li, Baiang and Chai, Wenhao and Heide, Felix},
  journal   = {arXiv preprint},
  year      = {2026}
}