Weak-to-Strong Knowledge Distillation
Accelerates Visual Learning

Princeton University

Abstract

Large-scale visual learning is increasingly limited by training cost. Existing knowledge distillation methods transfer from a stronger teacher to a weaker student for compression or final-accuracy improvement. We instead use distillation to accelerate the training of strong students. We propose a generalizable plug-and-play recipe that freezes a weaker teacher, applies distillation only in early training, and turns distillation off once the student reaches and surpasses teacher-level performance. For ImageNet and CIFAR classification, this strategy reaches target thresholds much earlier, with up to 4.8× speedup measured by epochs. We confirm that the method generalizes to other tasks and report 1.7× epoch speedup for object detection on the COCO dataset, and 2.5× earlier target-FID crossing for diffusion generation on the CIFAR-10 dataset, measured in steps. These findings validate our method as a universal speedup mechanism for visual learning.

Figure 1: Weak-to-Strong Distillation Accelerates Visual Learning. Blue curves are baseline training and red curves are with our method. From left to right: classification on ImageNet, diffusion-based generation on CIFAR-10, and object detection on COCO. Our method reaches target quality earlier in all three tasks: 65% Top-1 on ImageNet (4.8× fewer epochs), target FID on diffusion (more than 2.5× fewer training steps), and target AP50 for object detection (1.7× fewer epochs).

Method

Our method is a plug-and-play early-training add-on: keep a weaker teacher frozen, add a distillation loss in early training, and turn distillation off after the student surpasses teacher-level performance. The total loss is:

L = L_base + γ λ(u) L_distill

where λ(u) follows a warmup–hold–decay schedule and is permanently set to zero once the student surpasses the teacher for k=2 consecutive validations.

Algorithm 1: Proposed Weak-to-Strong Training

Input: frozen teacher g_φ, student f_θ, train data D, val data D_val, schedule Λ(t), scale γ, stop length k

1. Evaluate teacher: m_ref ← Eval(g_φ, D_val), c ← 0, a ← 1

2. For each training step t = 1, …, T: set λ_t ← a · Λ(t)

Compute L_base and L_distill, then update θ ← OptStep(θ, ∇_θ[L_base + γ λ_t L_distill])

3. Evaluate student: m_t ← Eval(f_θ, D_val)

4. If student surpasses teacher: c ← c + 1; otherwise reset c ← 0 (consecutive hits)

5. If c ≥ k: set a ← 0 (disable distillation permanently)

6. Return θ^*

Task-Specific Instantiations

Classification: L_distill = KL divergence on temperature-scaled softened posteriors (T: 6→1).

Object Detection: L_distill = classification logit distillation + optional box-regression alignment.

Generation: L_distill = ‖ε_θ(x_t, t) − ε_φ(x_t, t)‖²₂ — MSE between student and teacher noise predictions on the same noised sample.

Results

Image Classification

Table 1: Classification Training Acceleration on ImageNet and CIFAR-10/100.
Dataset	Student	Teacher	Optimizer	Teacher Top-1	Target τ	Speedup	Best Top-1 (Base / Ours)
ImageNet	ResNet-50	ResNet-18	SGD	70.73	65	1.62×	76.52 / 76.81
ImageNet	ResNet-50	ResNet-18	AdamW	70.73	65	2.86×	77.19 / 77.72
ImageNet	ResNet-50	ResNet-18	Muon	70.73	65	4.75×	77.11 / 77.11
ImageNet	ResNet-50	MobileNetV2	SGD	69.62	65	1.48×	76.52 / 76.77
ImageNet	ResNet-50	MobileNetV2	AdamW	69.62	65	2.00×	77.19 / 77.56
ImageNet	ResNet-50	MobileNetV2	Muon	69.62	65	3.17×	77.11 / 77.09
ImageNet	ViT-S/16	MobileNetV2	AdamW	69.62	50	1.45×	75.87 / 75.36
CIFAR-100	ResNet-18	DenseNet-40	SGD	69.70	60	1.60×	78.75 / 79.36
CIFAR-100	DenseNet-100	DenseNet-40	SGD	69.70	60	1.83×	82.08 / 82.51
CIFAR-10	ResNet-50	MobileNetV2	AdamW	85.55	75	1.60×	93.95 / 94.00
CIFAR-10	ResNet-50	MobileNetV2	SGD	85.55	75	1.36×	95.65 / 95.08

Object Detection & Image Generation

Table 2: Detection and Generation Training Acceleration.
Task	Dataset	Student	Teacher	Target τ	first@τ (Base → Ours)	Speedup	Best Metric (Base / Ours)
Detection	COCO	RetinaNet-R50	RetinaNet-R34	AP50 20.0%	10 ep → 6 ep	1.67×	AP50: 22.55 / 30.67
Detection	COCO	Faster R-CNN-R50	Faster R-CNN-R18	AP50 20.0%	4 ep → 3 ep	1.33×	AP50: 35.97 / 36.72
Generation	CIFAR-10	nc128-rb3	nc64-rb2	FID↓ 60	16k → 6k	2.67×	FID↓: 52.27 / 47.22
Generation	CIFAR-10	nc160-rb3	nc64-rb2	FID↓ 60	18k → 12k	1.50×	FID↓: 53.49 / 47.67

Ablation Experiments

Switching Teachers and Students

(a) Keep student fixed, switch teachers. ImageNet R50 student with different teachers: suitably weaker teachers consistently accelerate early convergence.

(b) Keep teacher fixed, vary students. CIFAR-100 with DenseNet-40 teacher: the early left-shift appears across different student architectures.

Teacher Mismatch Across Tasks

Table 3: Too-weak, too-strong, and suitably-weaker teacher settings under matched training recipes.
Task	Regime	Student	Teacher	Target τ	first@τ (Base → Ours)	Speedup
Classification	Too weak	ResNet-50	MobileNetV3-S	Top-1 65	19 ep → 37 ep	0.51×
Classification	Too strong	MobileNetV2	ResNet-50	Top-1 65	60 ep → 57 ep	1.05×
Classification	Suitably weaker	ResNet-50	ResNet-18	Top-1 65	19 ep → 4 ep	4.75×
Detection	Too weak	RetinaNet-R50	RetinaNet-R34 (earlier ckpt, AP50≈12.8)	AP50 20.0%	10 ep → 9 ep	1.11×
Detection	Too strong	RetinaNet-R50	RetinaNet-R34 (later ckpt, AP50≈26.2)	AP50 20.0%	10 ep → 9 ep	1.11×
Detection	Suitably weaker	RetinaNet-R50	RetinaNet-R34 (mid ckpt, AP50≈19.5)	AP50 20.0%	10 ep → 6 ep	1.67×
Generation (CIFAR-10)	Too weak	nc128-rb3	nc64-rb1	FID 60	16k → 20k	0.80×
Generation (CIFAR-10)	Too strong	nc128-rb3	nc64-rb3	FID 60	16k → 16k	1.00×
Generation (CIFAR-10)	Suitably weaker	nc128-rb3	nc64-rb2	FID 60	16k → 6k	2.67×

Optimizer Sensitivity & Gradient Diagnostics

(a) Optimizer comparison on ImageNet with fixed R18→R50 setup. Acceleration is consistent across SGD, AdamW, and Muon.

(b) Gradient norms during early ImageNet training. The base and distillation gradient components stay in a comparable range, indicating stable optimization.

Warm-Start & Stop-After-Surpass

(a) Warm-start vs no warm-start. Warm-start gives a smoother, stronger early rise under the same teacher-student pair.

(b) Post-surpass behavior. Gating distillation off after surpass avoids stale supervision in late training.

Label Smoothing & KL Direction

(a) Label smoothing ablation. Our method keeps early-stage advantage across different label smoothing values while final Top-1 remains close.

(b) KL direction ablation. Forward and reverse KL variants yield similar trajectories, indicating gains come from the schedule rather than KL direction.

Analysis: Teacher Operating Band

Figure 3: Teacher Operational Band. Top: speedup ratio versus relative teacher–student Top-1 gap. Suitably weaker teachers (moderate gap) give the strongest acceleration. Bottom: representative too-weak, too-strong, and suitably-weaker trajectories under matched training recipes.

Figure 7: Diagnostics for Teacher Mismatch. Left: CKA (student vs. teacher features), where higher is better. Middle: teacher entropy per sample, where high means uncertain targets. Right: KL(student||teacher), where smaller is better within the same regime. Importantly, absolute KL values across different regimes are not directly comparable because teacher distributions differ; the key signal is the baseline→ours change within each regime. Too-weak teachers have high entropy but low informativeness; too-strong teachers have low entropy but poor teachability; suitably-weaker teachers balance both.

Citation

@inproceedings{li2026weak,
  title     = {Weak-to-Strong Knowledge Distillation Accelerates Visual Learning},
  author    = {Li, Baiang and Chai, Wenhao and Heide, Felix},
  journal   = {arXiv preprint},
  year      = {2026}
}

Weak-to-Strong Knowledge DistillationAccelerates Visual Learning