Ground Truth Annotations

The dataset contains manual annotations for both visual and sound classes. The annotations were done on data sampled at 5 Hz, which amounts to 16,324 keyframes, 11 sound classes, and 6 vision classes.

In addition to image class labels, each sampled image was also annotated with sound labels in two domains: dominant (distinct and in foreground) and secondary (in the background). All labels were created by highly experienced human annotators using a custom toolset. Their work passed through subsequent phases of verification and quality assurance to ensure high-quality labels.

All object instances were annotated using tightly fitted 2D bounding boxes aligned to the image axis and encoded as top-left and bottom-right coordinates in the image frame. These annotations provide ground truth information that can be used for training and evaluating object detection and sound classification algorithms.

Image and Sound Labels

The following table lists image labels and sound labels used for annotating the 16,326 images in the dataset.

Image	Sound	Description
car/van/suv	small_vehicle	sound from a small vehicle like motorbike, bicycle
bus/truck/tram	ego-vehicle	sounds from the data collection platform eg. engine revving and tyre
pedestrian	trailer	sound from an accessory or an unpowered vehicle towed by another vehicle
traffic_sign	horn	warning noises emitted by vehicles
traffic_light	construction_noise	sounds relating to construction activity
	crosswalk_noise	pedestrian crosswalk alert sounds
	large_vehicle	sounds from heavy vehicles like semi-trucks, buses
	emergency_vehicles	sirens from emergency vehicles
	walking_sounds	sounds from a pedestrian or a large group of pedestrians
	cannot_distinguish	unidentifiable sound sources with less than 30\% certainty
	custom	identified sounds that are not part of the list above