What's Making That Sound Right Now? Video-centric Audio-Visual Localization

Hahyeon Choi, Junhoo Lee, Nojun Kwak
Seoul National University
International Conference on Computer Vision, ICCV 2025

This demo video presents inference results of TAVLO across various examples.

Abstract

Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.

AVATAR

AVATAR contains 5,000 videos and 24,266 annotated frames, including 670 Off-screen frames where the sound source is not visually present. A total of 28,516 audio-visual instances are annotated, categorized into Single-sound (15,372 instances), Multi-entity (9,322 instances), and Mixed-sound (3,822 instances). Each frame is densely annotated at the instance level, providing rich annotations that include instance segmentation, bounding box, audio-visual category, and scenario type. Notably, for Off-screen scenarios where the sound source is not visible, only the audio-visual category and scenario fields are provided without segmentation masks or bounding boxes.

The metadata for each frame follows a structured schema, where each instance is annotated as shown below:


          {
            "video_id": str,
            "frame_number": int,
            "annotations": [
              { // instance 1 (e.g., man)
                "segmentation": [ // (x, y) annotated RLE format
                  [float, float], 
                  ...
                ],
                "bbox": [float, float, float, float], // (l, t, w, h),
                "scenario": str, // "Single-Sound", "Mixed-Sound", "Multi-Entity", "Off-Screen"
                "audio_visual_category": str,
              },
              { // instance 2 (e.g., piano)
                ...
              }, 
              ...
            ]
          }
        

AVATAR Examples

To illustrate the diverse and challenging conditions covered by AVATAR, we provide visualizations grouped by scenario and cross-event subset. Our benchmark defines four scenarios to evaluate specific challenges in audio-visual localization: Single-sound, where only one instance emits sound; Mixed-sound, involving multiple overlapping audio sources; Multi-entity, where the model must distinguish which instance is producing sound among multiple visually similar instances; and Off-screen, where the sound source is not visible in the frame. Additionally, we include a Cross-event subset to assess a model’s ability to track sound sources that change over time. These visualizations showcase the richness of our benchmark and the complexity of real-world audio-visual scenes.

Results

We evaluate existing audio-visual localization (AVL) models and our proposed model, TAVLO, on AVATAR to assess their ability to handle real-world audio-visual challenges. Our results demonstrate that conventional methods, which rely on static frame-level mappings and global audio features, struggle to localize sound sources in complex and dynamic environments.

In contrast, TAVLO, a video-centric AVL model, significantly outperforms prior methods across all scenarios by explicitly modeling high-resolution temporal information. In particular, TAVLO achieves robust performance in challenging conditions, including Mixed-sound and Multi-entity.

Moreover, we introduce a Cross-event subset, where sound sources dynamically change over time. While existing models show a large performance drop in Cross-event videos (up to -5.36% CIoU, -5.00% AUC), TAVLO maintains stable performance with only minimal drops (-0.33% CIoU, -0.37% AUC), highlighting its strong temporal reasoning capability.

scsnario_results
cross_event_results

Qualitative Results

We present qualitative results showcasing the performance of TAVLO across diverse scenarios, including Single-sound, Mixed-sound, Multi-entity, Off-screen, and Cross-event subsets. These visualizations highlight how our model accurately localizes sound-emitting instances under a variety of challenging conditions.

In Single-sound cases, TAVLO precisely identifies the correct sound source, demonstrating strong one-to-one audio-visual correspondence. In Mixed-sound scenarios, where multiple sound sources are present simultaneously, TAVLO successfully differentiates and localizes each instance, even when audio overlaps or when some sources are partially off-screen. For Multi-entity situations, where multiple visually similar instances appear and only some produce sound, TAVLO accurately pinpoints the active sound-emitting instance, showing its ability to reason beyond appearance similarity. In Off-screen cases, where no visible source is present, TAVLO effectively avoids false localization, demonstrating robustness against distractors.

Furthermore, for Cross-event videos, where sound sources dynamically change over time, TAVLO consistently tracks and localizes the correct sound-emitting instance as the scene evolves. These results illustrate the strength of our temporal modeling approach and its applicability to complex real-world audio-visual scenes.

BibTeX

BibTex Code Here