TSAM:Temporal SAM Augmented with
Multimodal Prompts for Referring
Audio-Visual Segmentation

Aalto University, Finland
CVPR2025

Page under construction. Here is an example; more videos will be uploaded soon.

Abstract

Referring audio-visual segmentation (Ref-AVS) aims to segment objects within audio-visual scenes using multimodal cues embedded in text expressions. While the Segment Anything Model (SAM) has revolutionized visual segmentation, its applicability to Ref-AVS, where multimodal cues act as novel prompts, remains unexplored. SAM’s limitation to single-frame segmentation also hinders its ability to capture essential temporal context needed for multi-frame audio-visual segmentation. To address this gap, we propose TSAM, a novel extension of SAM designed to leverage multimodal cues for precise segmentation in dynamic audio-visual scenes. TSAM enhances SAM’s image encoder with a temporal modeling branch, enabling spatio-temporal learning and deep multimodal fusion across video frames, while retaining SAM’s pre-trained knowledge. Additionally, TSAM replaces SAM’s user-interactive prompting mechanism with sparse and dense data-driven prompts, enabling more effective integration of audio-visual inputs and reference text expressions. Extensive experiments on the Ref-AVS dataset demonstrate TSAM’s superiority over state-of-the-art methods. The results illustrate its effectiveness in segmenting objects in dynamic audio-visual scenes using text-based multimodal cues and its strong generalization to unseen objects.

Pipeline

pipeline

Experiments

We evaluated TSAM on the Ref-AVS dataset, which contains 20,000 text expressions with pixel-level masks across 4,000 10-second videos, covering diverse audible and inaudible objects. It includes 2,908 training, 276 validation, and 818 test videos. The test set is split into Seen (292), Unseen (269), and Null (257) subsets to test generalization and robustness. As shown in Table 1, TSAM outperforms existing methods, especially on the Unseen subset, demonstrating strong generalization to novel objects.

Method Task Visual Backbone Seen Unseen Null
\( \mathcal{J}(\%) \) \( \mathcal{F} \) \( \mathcal{J}(\%) \) \( \mathcal{F} \) \(\mathcal{S}(\downarrow)\)
AVSBenchAVSPVT-v223.200.51132.360.5470.208
AVSegFormerAVSPVT-v233.470.47036.050.5010.171
GAVSAVSSAM28.930.49829.820.4970.190
SAMAAVS*SAM39.220.56247.500.5660.130
ReferFormerRef-VOSV-Swin31.310.50130.400.4880.176
R2-VOSRef-VOSV-Swin25.010.41027.930.4980.183
EEMCRef-AVSM2F34.200.51349.540.6480.007
TSAM (Ours)Ref-AVSSAM43.430.56854.580.6640.017
Table 1: Performance comparison on the Ref-AVS dataset. denotes text integration in AVS from [Wang et al.], indicates audio integration in Ref-VOS, and * marks our implementation with text added to AVS.


We further conducted an ablation study (Table 2) to assess the impact of each module (see the pipeline above): the temporal branch (TB), temporal modality fusion layer (TMFL), dense (DPM) and sparse (SPM) prompting modules, cached memory (CM) for audio+visual (a+v) or visual (v) only, adapter module (AM), and the LIoU loss. Each contributes to TSAM’s robust performance, especially on unseen and null cases.

Setting Seen Unseen Mix (S+U) Null
\( \mathcal{J}(\%) \) \( \mathcal{F} \) \( \mathcal{J}(\%) \) \( \mathcal{F} \) \( \mathcal{J}(\%) \) \( \mathcal{F} \) \(\mathcal{S}(\downarrow)\)
① TSAM43.430.56854.580.66449.010.6160.017
② - TB33.050.50750.480.65741.770.5820.505
③ - TMFL40.350.57945.540.62742.950.6030.018
④ - DPM42.720.58049.100.64745.910.6140.018
⑤ - SPM43.040.58049.750.65246.400.6160.018
⑥ - SPM+DPM42.600.60240.580.60441.590.6030.018
⑦ - CM(a+v)42.070.54449.110.65945.590.6020.018
⑧ - CM(v)42.750.54951.180.66046.970.6050.018
⑨ - AM43.130.60040.790.61741.960.6090.017
⑩ - 𝓛IoU38.290.56442.150.63140.220.5980.008
Table 2: Ablation study on TSAM components.

Poster

CVPR 2025 Poster

BibTeX

@InProceedings{Radman_2025_CVPR,
    author    = {Radman, Abduljalil and Laaksonen, Jorma},
    title     = {TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {23947-23956}
}