What's Making That Sound Right Now? Video-centric Audio-Visual Localization

Institution Name
Conferance name and year

*Indicates Equal Contribution

This demo video presents inference results of TAVLO across various examples.

Abstract

Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.

AVATAR

AVATAR contains 5,000 videos and 24,266 annotated frames, including 670 Off-screen frames where the sound source is not visually present. A total of 28,516 audio-visual instances are annotated, categorized into Single-sound (15,372 instances), Multi-entity (9,322 instances), and Mixed-sound (3,822 instances). Each frame is densely annotated at the instance level, providing rich annotations that include instance segmentation, bounding box, audio-visual category, and scenario type. Notably, for Off-screen scenarios where the sound source is not visible, only the audio-visual category and scenario fields are provided without segmentation masks or bounding boxes.

The metadata for each frame follows a structured schema, where each instance is annotated as shown below:


          {
            "video_id": str,
            "frame_number": int,
            "annotations": [
              { // instance 1 (e.g., man)
                "segmentation": [ // (x, y) annotated RLE format
                  [float, float], 
                  ...
                ],
                "bbox": [float, float, float, float], // (l, t, w, h),
                "scenario": str, // "Single-Sound", "Mixed-Sound", "Multi-Entity", "Off-Screen"
                "audio_visual_category": str,
              },
              { // instance 2 (e.g., piano)
                ...
              }, 
              ...
            ]
          }
        

AVATAR Examples

To illustrate the diverse and challenging conditions covered by AVATAR, we provide visualizations grouped by scenario and cross-event subset. Our benchmark defines four scenarios to evaluate specific challenges in audio-visual localization: Single-sound, where only one instance emits sound; Mixed-sound, involving multiple overlapping audio sources; Multi-entity, where the model must distinguish which instance is producing sound among multiple visually similar instances; and Off-screen, where the sound source is not visible in the frame. Additionally, we include a Cross-event subset to assess a model’s ability to track sound sources that change over time. These visualizations showcase the richness of our benchmark and the complexity of real-world audio-visual scenes.

Results

We evaluate existing audio-visual localization (AVL) models and our proposed model, TAVLO, on AVATAR to assess their ability to handle real-world audio-visual challenges. Our results demonstrate that conventional methods, which rely on static frame-level mappings and global audio features, struggle to localize sound sources in complex and dynamic environments.

In contrast, TAVLO, a video-centric AVL model, significantly outperforms prior methods across all scenarios by explicitly modeling high-resolution temporal information. In particular, TAVLO achieves robust performance in challenging conditions, including Mixed-sound and Multi-entity.

Moreover, we introduce a Cross-event subset, where sound sources dynamically change over time. While existing models show a large performance drop in Cross-event videos (up to -5.36% CIoU, -5.00% AUC), TAVLO maintains stable performance with only minimal drops (-0.33% CIoU, -0.37% AUC), highlighting its strong temporal reasoning capability.

scsnario_results
cross_event_results

Qualitative Results

We present qualitative results showcasing the performance of TAVLO across diverse scenarios, including Single-sound, Mixed-sound, Multi-entity, Off-screen, and Cross-event subsets. These visualizations highlight how our model accurately localizes sound-emitting instances under a variety of challenging conditions.

In Single-sound cases, TAVLO precisely identifies the correct sound source, demonstrating strong one-to-one audio-visual correspondence. In Mixed-sound scenarios, where multiple sound sources are present simultaneously, TAVLO successfully differentiates and localizes each instance, even when audio overlaps or when some sources are partially off-screen. For Multi-entity situations, where multiple visually similar instances appear and only some produce sound, TAVLO accurately pinpoints the active sound-emitting instance, showing its ability to reason beyond appearance similarity. In Off-screen cases, where no visible source is present, TAVLO effectively avoids false localization, demonstrating robustness against distractors.

Furthermore, for Cross-event videos, where sound sources dynamically change over time, TAVLO consistently tracks and localizes the correct sound-emitting instance as the scene evolves. These results illustrate the strength of our temporal modeling approach and its applicability to complex real-world audio-visual scenes.

BibTeX

BibTex Code Here