ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation

Summary

ScenarioControl is the first vision-language control mechanism for learned driving scenario generation. Given a text prompt or an input image, it synthesizes diverse, realistic 3D scenario rollouts — including map structure, reactive agents over time, pedestrians, infrastructure, and ego-view observations. The model generates scenes in a vectorized latent space that jointly represents road structure and dynamic agents. With this, it can produce temporally consistent scenario rollouts from the perspective different actors in the scene, as well as long-horizon continuation of driving scenarios.

To connect multimodal control with sparse vectorized scene elements, we introduce a cross-global control mechanism that combines standard cross-attention with a lightweight global-context branch, enabling fine-grained control over road layout and traffic conditions while preserving realism. Extensive experiments validate that the control adherence and fidelity of ScenarioControl is favorable to all tested methods across all experiments. To facilitate further training and evaluation, we release a dataset with text annotations aligned to vectorized map structures and visual context.

Method

ScenarioControl models each scene as a graph of objects and lane centerlines in a bird's-eye view, with per-object elevation for image-domain reprojection. A controllable latent diffusion model (conditioned on text prompts or agent-view images) produces this vectorized representation in a single forward pass. A gated cross-global control mechanism fuses conditioning signals with sparse scene tokens, while a count-injection head controls lane and agent cardinality - enabling fine-grained edits to road layout and traffic composition.

The vectorized scene representation feeds existing BEV-space behavior simulation, whose output is projected into wireframe control signals for video generation. We train two video-model variants: an image-conditioned model that derives appearance from the first frame, and a prompt-conditioned model whose appearance follows a text description.

Scenario-Controlled Video Generation Pipeline

Experimental Findings

We visualize initial scenes generated via ScenarioControl under both image and prompt conditioning, alongside video rendering of the resulting driving rollouts. Across conditions, the method produces diverse and realistic scenarios with consistent road geometry, plausible agent behavior, and coherent ego-view observations.

Image-Conditioned Scene Generation

We generate initial scenes conditioned on an image. In the visualization below, we then project the generated scenes back into the original image to highlight how closely the synthesized geometry aligns with the real scene. Hover over the image to see the generated scenes and projection overlay.

Hover to show generated scenes

Phase 1: Image-conditioned scene generation (right), and reprojected back into ego-view (left).

Pick a conditioning image below to see the generated scenario play out as a video rollout. Note that road topology and visible agents stay faithful to the input, and unobserved regions are filled in coherently.

Conditioning Image

Conditional Generation

Phase 2: Behavioral rollout of image-conditioned generated scene using a simulator (right).

To produce video rollouts, we simulate the generated scenarios forward in time with a behavior simulator and feed the resulting wireframe trajectories, together with the original conditioning frame, to our first-frame-conditioned video model. The clips below show the full pipeline end-to-end:

Phase 3: End-to-end video rollout (left) of image-conditioned scene generation and rollout (right).

Prompt-Conditioned Scene Generation

ScenarioControl also generates scenes directly from natural-language descriptions of road structure and traffic. Toggle the attributes below (intersection type, pedestrian presence, and lane count) to assemble a prompt and see the corresponding generations update. The samples reflect the requested attributes while still varying in realistic, fine-grained detail.

Intersection

Pedestrians

# Lanes

Additional Prompt Examples

Prompt-Conditioned Video Generation

We can combine a rolled-out BEV scenario with a prompt-conditioned video generation model (e.g., Wan 2.2, Cosmos) to create consistent appearance variations across text prompts. The clips below show the same underlying scenario generated under different weather and lighting conditions:

Prompt-Conditioned Scenario and Video Generation.

Icon	Prompt
	This image depicts a suburban environment characterized by a red brick apartment complex with traditional architectural elements. The road is a two-lane asphalt surface in good condition with clear double yellow lines, bordered by concrete sidewalks. The scene is set on a partly cloudy day with soft, diffused light suggesting late morning or early afternoon, and features mature trees and well-maintained grassy areas alongside the buildings.
	The image depicts an urban environment with a mix of modern and traditional mid-rise buildings featuring large windows and brick facades. The road is a wet, paved city street with multiple lanes and pedestrian sidewalks, surrounded by trees and green patches. Overcast skies and diffused lighting suggest a rainy day during daylight hours, likely in the morning or afternoon.
	This image captures a downtown urban street at night, illuminated by warm sodium streetlights and colorful neon signs from nearby shops and restaurants. The road is a multi-lane asphalt surface with reflective lane markings that glow under the artificial lighting. Modern mid-rise buildings with glass facades line both sides of the street, and the dark sky above shows no stars due to light pollution, while a few street trees are visible as dark silhouettes against the lit storefronts.
	This image depicts a quiet suburban street blanketed in fresh snow, with low-rise residential buildings featuring snow-covered roofs and bare deciduous trees lining the sidewalks. The road is a two-lane surface partially cleared of snow, with tire tracks visible on the wet asphalt. The sky is overcast and pale gray, casting flat, diffused light typical of a cold winter morning, while patches of white snow cover the lawns and parked cars along the curb.

Large Scene Generation

Although our diffusion model is trained on local 64×64 m scenes, we extend it to much larger environments via iterative outpainting. Starting from an image- or prompt-conditioned initial scene, we repeatedly grow the layout outward, producing diverse multi-block road networks: distinct intersection types, lane configurations, and agent distributions that remain coherent in topology end-to-end. The viewer below lets you explore extensions generated from a few different conditioning images and prompts.

Image 1
Image 2
Prompt 1
Prompt 2

Conditioning

Generated Scene

Listen to Paper Overview

This audio overview is AI-generated and may contain inaccuracies. Refer to the paper for authoritative content.

BibTeX

@inproceedings{scenariocontrol2026,
  title     = {ScenarioControl: Vision-Language Controllable Vectorized
               Latent Scenario Generation},
  author    = {Gao, Lili and Xu, Yanbo and Koch, William and
               Ruffino, Samuele and Rowe, Luke and Chalaki, Behdad and
               Rivkin, Dmitriy and Ost, Julian and Girgis, Roger and
               Bijelic, Mario and Heide, Felix},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

ScenarioControl: Vision-language Controllable Vectorized Latent Scenario Generation