ChopGrad: Pixel-Wise Losses for Latent
Video Diffusion via Truncated Backpropagation

Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide

Summary

Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos.

ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.

Truncated Backpropagation with ChopGrad

Popular latent video diffusion models rely on autoencoders with causal caching: the encoder groups frames into non-overlapping segments, and at each decoder layer the features from the previous segment are concatenated onto the current one before decoding. This creates a recurrent dependency chain — each decoded segment depends on all prior segments — enabling temporal consistency at inference time. During training, however, pixel-wise losses require backpropagating through this entire chain, making memory costs scale linearly with video length and rendering such losses intractable for long or high-resolution videos.

ChopGrad exploits latent temporal locality: the influence of a prior frame group on the gradient of a later one decays exponentially with temporal distance. This means the loss gradient at any given segment is dominated by contributions from nearby segments, and distant ones can be safely ignored. ChopGrad applies truncated backpropagation through time to the decoder cache, restricting gradient flow to a fixed number of previous frame groups. This breaks the full recursive loop, reduces training memory to a constant irrespective of video length, and introduces only a bounded, exponentially-small approximation error.

ChopGrad model architecture showing causal caching in the video decoder

Truncated backpropagation with ChopGrad. The video decoder processes latent embeddings group by group, appending a cache of decoded features from the previous group at each layer to maintain temporal context. During training, this recurrent structure causes activation accumulation across all prior groups, making pixel-wise losses prohibitively expensive. ChopGrad limits gradient flow to a fixed number of previous frame groups, breaking the recursive loop while preserving training signal.

Applications

We evaluate ChopGrad across four conditional video generation tasks. Use the arrows beside each comparison or the dots under it to change clips, drag across the video to move the split, and click to play or pause. The left and right dropdowns below each comparison choose which methods appear on each side of the slider.

Artifact Removal in Novel View Synthesis

Renders of 3D Gaussian Splatting models trained on sparse camera frames often include significant artifacts, as the sparsity of views results in under-constrained gaussians ("floaters") which significantly degrade visual quality when the viewpoint deviates significantly from the training images. We compare against Difix, a state-of-the-art single-frame model trained at high resolution that uses pixel-wise losses but cannot consider multiple frames jointly, and MVSplat360, a state-of-the-art video model that applies pixel-wise losses on video but is limited to low-resolution, short clips due to memory explosion issues. ChopGrad unlocks high-resolution, long-duration pixel supervision, enabling the model to fix entire trajectories jointly rather than frame by frame.

all side by side

Video Super-Resolution

Video super-resolution must recover fine structure and plausible motion without amplifying compression noise or introducing frame-to-frame flicker. We compare against DOVE, a state-of-the-art super-resolution model based on CogVideoX. ChopGrad fine-tunes the DOVE checkpoint with pixel-wise perceptual objectives targeting full sequences, yielding sharper detail and more realistic textures.

all side by side

Controlled Driving Video Generation

Controlled driving video synthesis requires the capability to realistically composite 3D vehicle models into reconstructed 3D scene models. Naïve Insertion places vehicles with no post-processing, leading to myriad visual artifacts such as missing shadows and inconsistent lighting. Mirage, the state-of-the-art controlled driving video generation model, alleviates these artifacts via single-step diffusion post-processing, but is limited in its training resolution and duration, reducing its efficacy on long-duration, high-resolution videos. ChopGrad post-training applied to the Mirage model enhances performance in these cases, producing significantly more realistic scenes.

all side by side

Video Inpainting

Video inpainting demands the replacement of masked regions in every frame while keeping motion and appearance coherent over long clips. We compare against VACE, a state-of-the-art context adapter for Wan2.1 sampled with 50 diffusion steps. ChopGrad fine-tunes Wan2.1 for video inpainting on several datasets, producing a single-step model that results in fewer unrealistic hallucinations while requiring a 50× lower inference budget.

BibTeX

@article{rivkin2026chopgrad,
    title   = {ChopGrad: Pixel-Wise Losses for Latent Video Diffusion
               via Truncated Backpropagation},
    author  = {Rivkin, Dmitriy and Ewen, Parker and Gao, Lili and
               Ost, Julian and Walz, Stefanie and Kangutkar, Rasika and
               Bijelic, Mario and Heide, Felix},
    journal = {arXiv preprint arXiv:2603.17812},
    year    = {2026},
}