RefAM: Attention Magnets for Zero-Shot
Referral Segmentation

1Max Planck Institute for Informatics, SIC, 2ETH Zürich, 3TU Munich, 4Google Indicates Equal Contribution

Abstract

Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision–language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach consistently outperforms prior methods, establishing a new state of the art without fine-tuning or additional components.

Introduction

The Challenge: Referring segmentation - the task of localizing objects based on natural language descriptions - typically requires extensive fine-tuning or complex multi-model architectures. While diffusion transformers (DiTs) encode rich semantic information, directly leveraging their attention mechanisms for grounding tasks has remained largely unexplored.

The Discovery: We uncover that diffusion transformers exhibit attention sink behaviors similar to large language models, where stop words accumulate disproportionately high attention despite lacking semantic value. This creates both challenges and opportunities for vision-language grounding.

Our Solution: RefAM exploits stop words as "attention magnets" - deliberately adding them to referring expressions to absorb surplus attention, then filtering them out to obtain cleaner attention maps. This simple strategy achieves state-of-the-art zero-shot referring segmentation without any fine-tuning or architectural modifications.

Key Contributions

🔍 Global Attention Sinks Discovery

We identify and analyze global attention sinks (GAS) in diffusion transformers that act as attention magnets, linking their emergence to semantic structure and showing they can be safely filtered.

🧲 Stop-word Attention Redistribution

We introduce a novel strategy where strategically added stop words act as attention magnets, absorbing surplus attention to enable cleaner cross-attention maps for better grounding.

🎯 Training-Free Grounding Framework

RefAM combines cross-attention extraction, GAS filtering, and attention redistribution into a simple framework that requires no fine-tuning or architectural modifications.

📊 State-of-the-Art Zero-Shot Results

We achieve new state-of-the-art performance on referring image and video segmentation benchmarks, significantly outperforming prior training-free methods.

Method Overview

RefAM Method Overview
RefAM Pipeline Overview. Our method extracts cross-attention maps from diffusion transformers for referring expressions augmented with attention magnets. We first append stop words (".", "a", "with") to the referring expression to act as attention magnets. Next, we extract cross-attention maps from DiT layers, filter out stop words and attention magnets, aggregate the remaining maps, and apply SAM for final segmentation. This simple training-free approach achieves state-of-the-art zero-shot referring segmentation results.

Stop-word Filtering for Referral Segmentation

A key discovery in our work is that stop words (e.g., "the", "a", "of") act as attention magnets in cross-attention maps, absorbing significant attention scores and creating noisy backgrounds that hurt segmentation quality. We exploit this phenomenon by strategically adding extra stop words to referral expressions, which further concentrate the attention pollution, and then filtering out all stop words to obtain cleaner attention maps. This simple yet effective technique dramatically improves the quality of attention-based segmentation across both image and video domains, leading to more precise object localization.

Stop-word Filtering Example
Influence of stop words on referral segmentation. We demonstrate how stop words "pollute" cross-attention scores by attracting high attention to background areas. By adding extra stop words as attention magnets and then filtering them out, we achieve sharper attention maps focused on core concepts (nouns, verbs, adjectives). The example shows attention maps before and after stop-word filtering, with segmentation results using SAM 2. Gray tokens indicate filtered stop words.

Global Attention Sinks in Diffusion Transformers

We discover that diffusion transformers exhibit attention sink behaviors similar to large language models. Specifically, we identify Global Attention Sinks (GAS) - tokens that attract disproportionately high and nearly uniform attention across all text and image tokens simultaneously. These sinks emerge consistently in deeper layers but are absent in early layers, serving as indicators of semantic structure. While uninformative themselves, they can suppress useful signals when they occur on meaningful tokens, which we address through strategic filtering and redistribution.

Global Attention Sinks Examples
Global Attention Sinks (GAS) in DiT. We highlight tokens (here, tokens #1 and #16) that act as GAS in late layers. These tokens allocate disproportionately high and nearly uniform attention across all text and image tokens simultaneously. GAS are absent in early layers, emerge consistently in deeper blocks, and serve as indicators of semantic structure. While uninformative themselves, they can suppress useful signals when they occur on meaningful tokens.
Emergence of Semantic Information in DiT
Emergence of semantic information in DiT. Top: text-to-text attention across layers. Early layers (0–19) are diffuse and uniform, while middle and late layers (20–47) develop block-diagonal structure, indicating meaningful linguistic grouping. Bottom: text-to-image attention for the "patches" token. Early layers spread attention broadly over the scene, whereas middle layers begin to localize, and late layers sharpen around the target object. These dynamics illustrate how semantic alignment emerges progressively with depth.

Key Findings: Our analysis reveals that GAS tokens emerge progressively with network depth, transitioning from diffuse attention patterns in early layers to structured, semantically meaningful distributions in deeper layers. This emergence correlates with the development of semantic understanding, providing insights into how diffusion transformers build hierarchical representations. By identifying and strategically filtering these attention sinks, we can redirect attention toward semantically relevant regions, leading to more accurate grounding and segmentation results.

Referral Image Object Segmentation

RefAM achieves state-of-the-art zero-shot referring image segmentation on RefCOCO/RefCOCO+/RefCOCOg datasets. Our attention magnet strategy proves crucial - by strategically adding stop words to absorb surplus attention and then filtering them out, we obtain cleaner attention maps focused on the target object. This simple technique dramatically improves localization precision without requiring any fine-tuning or architectural modifications. RefAM significantly outperforms previous training-free methods while maintaining simplicity.

Referral Image Object Segmentation Examples
Referral Image Object Segmentation Examples. RefAM leverages cross-attention maps from diffusion transformers with attention magnet filtering to achieve zero-shot referral segmentation. By strategically adding stop words as attention magnets and then filtering them out, we obtain cleaner attention maps that focus on the target objects, leading to more accurate segmentation masks generated with SAM.

RefCOCO Image Referral Segmentation Results

Method RefCOCO (oIoU) RefCOCO+ (oIoU) RefCOCOg (oIoU)
val testA testB val testA testB val test
Zero-shot methods w/o additional training
Grad-CAM 23.4423.9121.6026.6727.2024.8423.0023.91
Global-Local 24.5526.0021.0326.6229.9922.2328.9230.48
Global-Local 21.7124.4820.5123.7028.1221.8626.5728.21
Ref-Diff 35.1637.4434.5035.5638.6631.4038.6237.50
TAS 29.5330.2628.2433.2138.7728.0135.8436.16
HybridGL 41.8144.5238.5035.7441.4330.9042.4742.97
RefAM (ours) 46.9152.3043.8838.5742.6634.9045.5344.45

Bold = best, underlined = second-best among training-free methods.

Video Referral Object Segmentation

RefAM extends seamlessly to video referring segmentation tasks using Mochi, a video diffusion transformer. We extract cross-attention maps from the first frame with our attention magnet strategy and leverage SAM2's temporal propagation for consistent video segmentation. The attention magnet filtering proves even more crucial in video contexts where temporal consistency can amplify attention noise. RefAM achieves substantial improvements over existing training-free methods, establishing new benchmarks for zero-shot video referral segmentation.

Video Referral Segmentation Results
Video Referral Object Segmentation Examples. RefAM extends seamlessly to video domains using Mochi diffusion transformers. We extract cross-attention maps with attention magnets from the first frame and use SAM2 for temporal propagation. The approach is training-free and operates in zero-shot manner, with attention magnet filtering being crucial for clean temporal segmentation.

Ref-DAVIS17 Video Results

Method J&F J F
Training-Free with Grounded-SAM
Grounded-SAM 65.262.368.0
Grounded-SAM2 66.262.669.7
AL-Ref-SAM2 74.270.478.0
Training-Free
G-L + SAM2 40.637.643.6
G-L (SAM) + SAM2 46.944.049.7
RefAM + SAM2 (ours) 57.654.560.6

Component Ablation Study

AM NP SB J&F J F PA
57.654.560.668.9
-54.450.957.659.8
-55.152.258.067.2
--53.149.556.760.2
---50.046.853.252.5

AM = attention magnets, NP = noun phrase filtering, SB = spatial bias, PA = point accuracy

Societal Impact

RefAM provides a powerful and accessible tool for zero-shot referring segmentation, enabling advances in applications such as medical image analysis, robotics, autonomous systems, and assistive technologies for people with visual impairments. The attention magnet approach democratizes access to high-quality referring segmentation capabilities without requiring specialized training or fine-tuning.

By providing training-free, zero-shot methods that work across different domains (images and videos), RefAM is particularly valuable for researchers and practitioners who may not have access to large computational resources or extensive labeled datasets. The simplicity of the approach makes it broadly applicable and easy to integrate into existing systems.

However, as with any advancement in computer vision and AI, there are potential ethical considerations. Improved referring segmentation capabilities could be misused for surveillance or privacy violation purposes. We emphasize the importance of deploying these technologies responsibly, with appropriate safeguards and consideration for privacy rights. We encourage the research community to continue developing ethical guidelines for the deployment of vision-language technologies and to consider the broader societal implications of these advancements.

BibTeX