SENTRY logo

SENTRY: SAM2-Enhanced Neighbor-Aware and Temporally Reasoned Memory for Visual Tracking

Mohamad Alansari* Yonathan Michael* Hasan AlMarzouqi Muzammal Naseer Naoufel Werghi Sajid Javed
Khalifa University, Abu Dhabi, UAE * Equal contribution
ECCV 2026

Abstract

We revisit the memory update mechanism in SAM2-based visual object tracking and identify confidence-only mask selection as the dominant cause of drift under occlusion, rapid motion, and distractors. We introduce SENTRY, a training-free, plug-and-play, refine-before-write module that validates each memory update for short-horizon temporal consistency before committing it. SENTRY aggregates diverse segmentation hypotheses per frame and backtracks each candidate to form short, temporally coherent tracklets. A neighbor-aware, cycle-consistent matching stage evaluates candidates against recent trajectories to favor temporally and geometrically consistent hypotheses. SENTRY requires no retraining and leaves the base architecture untouched; it simply replaces confidence-driven writes with consistency-validated ones. To ensure fair and comprehensive evaluation, we re-evaluate major open-source SAM2-based trackers across all available scales and datasets, filling gaps where prior works reported incomplete results. Integrated into five strong baselines, SENTRY delivers consistent gains across nine benchmarks, achieving new zero-shot SOTA on LaSOT, LaSOText, GOT-10k, VOT20, VOT22, and DiDi. It also yields strong improvements elsewhere, even when all baselines are re-evaluated across all model scales. Despite its checks, SENTRY remains real-time, running at 32.8 FPS, introducing approximately 25% overhead relative to SAM2. Our results provide the first unified, all-scale evaluation of SAM2-based trackers and demonstrate that enforcing temporal validity at write time is a general design principle that consistently stabilizes memory-augmented tracking without retraining.

Methodology

SENTRY architecture. The module replaces confidence-driven memory writes with consistency-validated updates. Diverse segmentation hypotheses are refined, backtracked into short-horizon tracklets, and compared against recent neighbors through cycle-consistent matching before a memory update is committed.

Refine Before Write

SENTRY validates candidate masks before they enter memory, replacing confidence-only updates with temporal checks that reduce drift under occlusion and distractors.

Temporal Tracklets

Each candidate is backtracked to form a short, coherent trajectory, giving the memory update a local temporal history rather than a single-frame score.

Neighbor-Aware Checks

Recent trajectories act as consistency anchors, favoring masks that preserve identity and geometry across neighboring frames.

Highlights

Neighbor-aware memory

Suppresses distractors and drift before unstable masks are written into memory.

Self-verified masks

Uses cycle-consistent checks to keep object identity stable across challenging sequences.

Drop-in integration

Works with memory-based trackers without retraining or changing the base architecture.

Real-time tracking

Maintains 32.8 FPS while improving robustness on long, cluttered, occluded videos.

Quantitative Results

SENTRY is evaluated across all available scales and datasets, including the main visual object tracking benchmarks and VOT challenge suites. Use the carousel to inspect the complete result figures.

Overall results

1 of 4

Qualitative Results

Temporal consistency

Representative qualitative behavior under difficult frame-to-frame changes.

Distractor suppression

Examples showing why neighbor-aware validation matters before committing memory.

Robust memory writes

SENTRY keeps memory updates conservative without removing real-time tracking behavior.

Leaderboard

Category Method LaSOT LaSOText TNL2K GOT-10k TrackingNet
S NP P S NP P S NP P AO SR0.50 SR0.75 S NP P
Vision-Based DiffusionTrack 70.8 79.8 76.7 56.4 72.5 57.3 74.8 85.4 72.0 83.8 88.2 82.1
HIPTrack 72.7 82.9 79.5 53.0 64.3 60.6 77.4 88.0 74.5 84.5 89.1 83.8
AQATrack256 71.4 81.9 78.6 51.2 62.2 58.9 57.8 59.4 73.8 83.2 72.1 83.8 88.6 83.1
ARPTrack256 72.6 81.4 78.5 52.0 62.9 58.7 77.7 87.3 74.3 85.5 90.0 85.3
SPMTrack-B 74.9 3rd 84.0 81.7 62.0 79.7 66.7 76.5 85.9 76.3 86.1 Best 90.2 85.6
VLM UVLTrack-B 69.4 74.9 49.2 55.8 62.7 65.4 83.4 82.1
QueryNLT 59.9 69.6 63.5 57.8 75.6 58.7
DUTrack384 74.1 84.9 82.9 52.5 63.6 60.5 65.6 2nd 83.2 71.9 77.8
MambaVLT 66.6 77.3 71.0 66.5 Best 90.9 69.9
CLDTracker 74.0 83.9 81.1 53.1 64.8 60.6 61.5 82.2 64.3 77.5 85.4 75.6 85.1 89.7 84.9
Memory-Based MemVLT 72.9 85.7 80.5 52.1 63.3 59.8 63.3 3rd 80.9 67.4
RTracker-L 74.7 84.5 54.9 65.5 62.7 60.6 63.7 77.9 87.0 76.9
Zero-shot Method
SAM2-L 68.5 76.1 73.6 56.8 71.1 67.0 56.7 75.4 62.5 80.8 91.3 75.5 85.3 91.3 88.2
SAMURAI-L 74.2 82.7 80.2 61.0 3rd 73.9 72.2 50.6 67.5 54.2 81.7 3rd 92.2 76.9 85.3 88.2 85.0
DAM4SAM-L 75.1 2nd 83.3 81.1 60.9 75.3 72.2 59.8 79.8 66.8 81.1 91.4 77.2 85.3 90.9 87.4
SENTRY-S2-L Ours 70.2 (+1.7) 77.2 (+1.1) 74.5 (+0.9) 57.0 (+0.2) 71.7 (+0.6) 67.1 (+0.1) 57.9 (+1.2) 76.9 (+1.5) 64.1 (+1.6) 81.1 (+0.3) 91.4 (+0.1) 76.5 (+1.0) 85.7 (+0.4) 91.9 (+0.6) 88.9 (+0.7)
SENTRY-SR-L Ours 75.1 (+0.9) 2nd 82.7 80.4 (+0.2) 61.5 (+0.5) 2nd 75.0 (+1.1) 72.9 (+0.7) 59.6 (+9.0) 78.8 (+11.3) 66.4 (+12.2) 81.8 (+0.1) 2nd 92.3 (+0.1) 77.1 (+0.2) 85.8 (+0.5) 3rd 91.1 (+2.9) 88.1 (+3.1)
SENTRY-D4S-L Ours 76.3 (+1.2) Best 84.7 (+1.4) 82.4 (+1.3) 61.8 (+0.9) Best 76.6 (+1.3) 73.8 (+1.6) 61.3 (+0.5) 81.3 (+1.5) 68.3 (+1.5) 82.1 (+1.0) Best 92.6 (+1.2) 78.2 (+1.0) 85.9 (+0.6) 2nd 91.5 (+0.6) 87.9 (+0.5)

Citation

@inproceedings{alansari2026sentry,
    title={SENTRY: SAM2-Enhanced Neighbor-Aware and Temporally Reasoned Memory for Visual Tracking},
    author={Alansari, Mohamad and Michael, Yonathan and AlMarzouqi, Hasan and Naseer, Muzammal and Javed, Sajid and Werghi, Naoufel},
    booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
    year={2026}
}

Contact

For inquiries, reach out at 100061914@ku.ac.ae .