ERK'2022, Portorož, 373-376 373
A Long-Term Discriminative Single Shot Segmentation Tracker
Benjamin Dˇ zubur
1
, Alan Lukeˇ ziˇ c
1
, Matej Kristan
1
1
University of Ljubljana, Faculty of Computer and Information Science, Veˇ cna pot 113
E-mail: bd5830@student.uni-lj.si,{ alan.lukezic, matej.kristan} @fri.uni-lj.si
Abstract
State-of-the-art long-term visual object tracking methods
are limited to predict target position as an axis-aligned
bounding box. Segmentation-based trackers exist, how-
ever they do not address long-term disappearances of the
target. We propose a long-term discriminative single shot
segmentation tracker – D3SLT, which addresses the above
shortcomings. The previously developed short-term D3S
tracker is upgraded with a global re-detection module,
based on an image-wide discriminative correlation filter
response and Gaussian motion model. An online learned
confidence estimation module is employed for robust es-
timation target disappearance. Additional backtracking
module enables recovery from tracking failures and fur-
ther improves tracking performance. D3SLT performs
close to the state-of-the-art long-term trackers on the bou-
nding box based VOT-LT2021 Challenge, achieving F-
score of 0.667, while additionally outputting segmenta-
tion masks.
1 Introduction
Visual object tracking is one of the main computer vi-
sion problems, in which the task of the tracking algo-
rithm is to continuously localize the target in a video.
Target is specified by a single supervised example at the
beginning of the video. Visual object tracking challenges
can be roughly divided into two main groups: (i) short-
term tracking, where target is always visible and (ii) long-
term tracking, where target can disappear from the field-
of-view or becomes fully occluded. Long-term trackers
require a re-detection capability to re-localize the target
after target disappearance.
Long-term trackers in general consist of three main
components: a short-term tracker, detector and interac-
tion between the first two components. Lukeˇ ziˇ c et al. [7]
use adaptive thresholding of the discriminative correla-
tion filter (DCF) response to detect target absence and
employ multiple DCFs, updated at various time scales,
in combination with a simple motion model to re-detect
the target. On the other hand, many successful long-term
trackers from the past two years (LTMU B [2], CLGS,
Megtrack [11]) use MDNet [13] or some similar special-
ized online verifier for absence detection. For re-detection,
the most successful long-term trackers either use a re-
gion proposal network (RPN) to regress the target within
Ground truth D3SL T (confident) D3SL T (not confident)
Figure 1: Tracking results of the proposed D3SLT, which suc-
cessfully detects the absence of the target and re-detects it when
it re-appears, which is demonstrated in the first three rows. The
tracker successfully recovers after re-detection of the wrong tar-
get due to the backtracking module (last row).
some region in combination with a simple sliding win-
dow search (e.g. LTMU B) or by using a deep module for
global target localization [5]. Interestingly, winner of the
VOT-LT2021 challenge mlpLT [3] combines two short-
term trackers [1, 15] by running them both simultane-
ously and then giving the output of both to a verifier [13].
No additional global re-detection module is used.
State-of-the-art long-term trackers report target posi-
tion as an axis-aligned bounding box, which is just an
approximation of a target position. This approximation is
often not very accurate, e.g., when tracking an elongated
rotated object, or when target is significantly deforming,
since a lot of pixels within the predicted bounding box are
actually background. In these situations a segmentation
mask is much more accurate prediction. Development
of segmentation-based trackers has been recently popu-
larized by the VOT Challenge [11, 12], where ground-
truth in a short-term sub-challenge is given as segmenta-
tion masks. Wang et al. [14] proposed one of the first
segmentation-based short-term trackers by extending a
siamese tracking framework with segmentation mask pre-
374
diction, while Yan et al. [16] presented a segmentation-
based framework which can be applied to any bounding-
box tracker. Recently, Lukeˇ ziˇ c et al. proposed a dis-
criminative single-shot segmentation tracker (D3S [8]),
which combines a deep discriminative correlation filter
(DCF) for robust target estimation and a feature-matching
module for per-pixel segmentation. The D3S tracker has
demonstrated robust tracking performance due to discrim-
inative formulation as well as high prediction accuracy
due to segmentation, while it lacks long-term tracking ca-
pabilities. In particular, it fails when target disappears for
a longer time period and is not able to recover it after it
re-appears in the image.
In this work we present a long-term tracker named
D3SLT as a primary contribution, which extends an ex-
isting short-term tracker D3S [8] with long-term tracking
capabilities. We design a detector based on deep DCF in
D3S capable of image-wide target re-detection. A back-
tracking mechanism is proposed as a secondary contribu-
tion, which is used to recover after potential re-detection
of a wrong target. The proposed tracker achieves com-
petitive results to the state-of-the-art long-term trackers
on VOT-LT2021, while reporting target position as a seg-
mentation mask.
2 Long-term Segmentation Tracker
In this section we describe the proposed long-term dis-
criminative single shot segmentation tracker (D3SLT).
The tracker extends the existing short-term tracker
D3S [9], described in Section 2.1, with long-term track-
ing capabilities. A crucial component D3SLT is target re-
detection module (Section 2.3), which is activated when
target is localized with low confidence (Section 2.2). A
potential re-detection of a wrong target is resolved by a
backtracking module, described in Section 2.4.
2.1 Short-term tracker
The short-term component is used to track the target in
consecutive frames. We use the short-term D3S [9] track-
er, which has demonstrated a robust tracking performance
as well as a high accuracy of the predicted target segmen-
tation. Target localization in D3S is performed within the
search region which is approximately four times larger
than target size. High robustness and accuracy are ob-
tained by combining two visual models – one is adaptive
and highly discriminative, but geometrically constrained
to an Euclidean motion (GEM), while the other is invari-
ant to broad range of transformation (GIM) which pro-
vides an approximate segmentation. The outputs of GIM
and GEM are combined in the refinement pathway, which
produces an accurate segmentation mask of the tracked
target. Finally, the scale estimation module is used for
robust estimation of the target size. We refer the reader
to the original publication [9] for more details about the
short-term tracker.
2.2 Confidence estimation module
This module predicts confidence of the current target lo-
cation prediction. Ideally, high confidence value is out-
puted when the predicted position highly overlaps with
the actual position of the target and low value when tar-
get prediction does not cover the target well, e.g., most
of the object is occluded, or the object has left the camera
view. Confidence score is predicted by classifying the im-
age region cropped around the predicted target position
using an online learned classifier. The classifier, called
also the verifier, is trained online to distinguish between
positive and negative samples using training methodol-
ogy presented in [6].
At initialization, 100 positive training samples are gen-
erated by perturbing a bounding box so that it overlaps
with the initial bounding box at least 70%, while 200 neg-
ative examples are sampled around the target with over-
lap less than 50%. During confident tracking, both sets of
samples are updated every fifth frame. In positive set, 100
samples are added by randomly perturbing bounding box,
so that they overlap with the predicted bounding box by
more than 70%. Negative set is updated by adding 100
negative samples collected over the entire image. The
re-detection module (Section 2.3) is applied on the im-
age to indentify the most similar regions compared to the
tracked target. The top 30 local maxima in the global
DCF response are taken as candidate positions. The neg-
ative samples are constructed by cropping patches of the
same size as the target, centered at the candidate posi-
tions. Additional translation, scale and aspect jittering
is used when generating the samples. Sets of positive
and negative samples are limited to 2000 samples, each.
When the limit is reached, the oldest samples are replaced
first. After adding new samples to the positive and nega-
tive sample sets, the classifier is re-trained.
Low confidence (i.e., lower than a pre-defined thresh-
old τ ) indicates that the target is not tracked anymore,
thus a re-detection module (Section 2.3) is activated and
the verifier as well as the short-term tracker are not up-
dated anymore. ObservingK consecutive positive scores
deactivates the re-detection module and triggers the back-
tracking module (Section 2.4). If the backtrack is suc-
cessful, tracking is resumed with the short-term tracker
only.
2.3 Re-detection module
Global (image-wide) target re-detection is performed us-
ing a deep discriminative correlation filter (DCF) from
the GEM module in D3S, which is correlated over the
entire image to obtain a global correlation responseR
G
.
After the target disappears, it is not likely that it will
re-appear immediately on the position significantly far
away from the last confident target position, denoted as
x
c
. Thus we introduce a motion priorπ (x), formulated as
a random walk dynamic model, which models the likeli-
hood of target position x = [x,y]
T
by a Gaussianπ (x) =
N (x
c
, Σ) . Note that Σ is a diagonal covariance matrix
defined as
Σ =
  σ 2
x
0
0 σ 2
y
  ,σ x
= x
w
α ∆ t
s
, σ y
= x
h
α ∆ t
s
,
where x
w
and x
h
denote the width and height of the tar-
get respectively, α s
is the scale increase parameter and
375
∆ t
denotes the number of frames since the last confident
estimation. Target location estimated by the re-detection
module is defined as the position of the maximum in the
responseR
G
⊙ π (x), where⊙ is the pointwise product.
2.4 Backtracking module
Target re-detection process may fail, leading to tracking
of the wrong target. To address these situations we design
the backtracking module, which tracks the selected target
back in time to resolve possible target switches. Suppose
that tracker successfully re-detected the target at framet.
A new instance of the short-term tracker (Section 2.1) is
initialized in the frame I
t
and is used to track previous
frames from the set of past frames Γ = { I
t− 1
,...,I
t− n
} ,
where n is the maximum number of past frames which
can be used. The backtracking process can end before
reaching n past frames in the following cases: (i) the
maximum in GEM correlation response drops under the
thresholdθ GEM
, or (ii) a confident tracking result was ob-
tained before on the frameI
t− j
. The first case indicates
that the tracked target is not visible anymore. We cannot
assume whether this is the true target or not, therefore we
continue with short-term tracking in frame I
t+1
. In the
second case (ii) or if alln frames have been backtracked,
we compute the probability density at the backtracked
target position π (x
t− j
) based on the motion model and
the last confident target position x
c
, as defined in Sec-
tion 2.3. This density is subjected to a threshold θ MM
to decide whether the backtracked target corresponds to
the true target (in which case we switch to short-term
tracking inI
t+1
) or a distractor (we continue with global
re-detection inI
t+1
). Finally, the confidence estimation
module is updated (Section 2.2) accordingly based on
features extracted during backtracking.
3 Experiments
In this section, we present the experimental results of the
proposed D3SLT on the VOT-LT2021 dataset. In Sec-
tion 3.1, we describe implementation details of our tracker.
In Section 3.2 we compare our tracker to the state-of-the-
art long-term trackers and in Section 3.3 we report the
results of the ablation study.
3.1 Implementation details
We use the pre-trained version of the short-term D3S tra-
cker [9] with the ResNet50 [4] backbone. The follow-
ing parameters were determined manually, based on our
preliminary experiments and remained fixed in all exper-
iments. The motion model parameter α s
is set to 1.05,
positive and negative samples in the verifier (Section 2.2)
are updated every five successfully tracked frames and the
classifier is updated using SGD with learning rate 0.0003
for 15 iterations. The decision threshold in the verifier
τ is set to 0.5, while K is set to 3. In the backtracking
module the maximum number of past framesn is set to
260, the motion model uncertainty thresholdθ MM
to 0.26
andθ GEM
to 0.16.
3.2 Long-term tracking performance
The proposed D3SLT tracker is evaluated on the VOT-
LT2021 benchmark [12]. The dataset consists of 50 long
video sequences, in which target objects disappear and
reappear frequently. Note that axis-aligned bounding box-
es are required by the evaluation protocol, thus the pre-
dicted segmentation masks are transformed into bound-
ing boxes by generating the smallest axis-aligned bound-
ing box which contains the whole mask. The reported
confidence score at each frame is defined as the output
of the verifier on the patch, contained in the predicted
bounding box. Tracking performance is evaluated us-
ing tracking precision (Pr) and tracking recall measures
(Re) computed under a set of confidence thresholds. The
primary performance measure to rank the trackers is F-
score, defined asF =
2PrRe
Pr+Re
. The reader is referred to
the original publication [10] for a detailed description of
the performance evaluation measures and protocol.
Tracker Pr Re F Award
mlpLT [3] 0.741 0.729 0.735 1 (2021)
STARK LT 0.721 0.725 0.723 2 (2021)
LT DSE 0.715 0.677 0.695 1 (2019, 2020)
LTMU B [2] 0.701 0.681 0.691 2 (2020)
D3SLT 0.669 0.666 0.667
D3S [9] 0.452 0.465 0.459
FuCoLoT [7] 0.507 0.346 0.411
Table 1: State-of-the-art comparison on the VOT-LT2021.
The proposed D3SLT is compared to the state-of-the-
art long-term trackers and the results are presented in Ta-
ble 1. We observe that the state-of-the-art trackers are
slightly more robust (i.e., higher recall), however these
trackers can not produce a segmentation mask. When
compared to the original D3S tracker, our tracker im-
proves F-score by more than 20pp. The results on se-
lected sequences are visualized in Figure 1. While the
tracker segments the targets nicely, the predicted axis-
aligned bounding boxes which are lined up with the edges
of the mask, might not be considered precise when com-
pared to the ground-truth boxes from the dataset. Addi-
tionally, the mask sometimes captures noisy background
around the edge of the target, further inflating the pre-
dicted box. A more intelligent adaptive prediction of the
bounding box based on the mask might further improve
precision and thus F-score.
The proposed tracker runs at 4.7 FPS, measured on
a system with a Ryzen 3700x CPU and a single Nvidia
RTX3060Ti GPU. Most of the slow-down compared to
the baseline short-term tracker [9], which runs at 9 FPS,
is due to the DCF response computation for image-wide
target re-detection and frequent backtracking.
3.3 Ablation study
An ablation study was performed to analyze contributions
of individual D3SLT components. The variants of the
D3SLT without the following components were consid-
ered: (i) without the global re-detection module (GRE);
(ii) without the Gaussian motion model in re-detection
376
(MM); (iii) without the backtracking module (BT ); and
(iv) instead of the verifier (VER) a confidence score based
on the correlation response quality was computed, simi-
larly as in [7]. Results of the ablation study are shown in
Table 2.
Variant Pr Re F
D3SLT 0.669 0.666 0.667
GRE 0.654 0.515 0.576
MM 0.653 0.648 0.651
BT 0.654 0.642 0.648
VER 0.649 0.631 0.640
Table 2: Ablation study of D3SLT on the VOT-LT2021 dataset.
Removing the global re-detection module (GRE) re-
sults in over 9pp drop of the F-score, primarily due to
the 13pp lower recall. This is expected as the target may
disappear through one edge of the frame and re-appear
through another. In this case, it is less likely to be re-
detected using only the smaller local search region of the
short-term tracker.
The motion model provides additional inductive bias
in our tracker, reducing the chance of tracking a faraway
distractor, when the target is briefly lost due to e.g. short-
term occlusion. Removing the motion model (MM) re-
sults in 1.6pp performance drop in F-score comparing to
the proposed D3SLT.
When removing the verifier (VER) and use the corre-
lation response to compute the confidence score, similar
as in [7], we observe a drop in F-score of 2.7pp. This
indicates that the alternative scoring mechanism is highly
correlated with the proposed verifier, though slightly less
robust.
When backtracking is disabled (BT ) 1.9pp lower F-
score is achieved, compared to the original D3SLT. Clear-
ly, the module helps to ensure that D3SLT does not begin
to confidently track a distractor when it loses the true tar-
get.
4 Conclusion
We introduced a long-term discriminative single shot seg-
mentation tracker – D3SLT. A recent D3S tracker [9] was
used as a short-term component and combined with an
image-wide re-detection mechanism based on deep dis-
criminative correlation filter. A separate confidence esti-
mation module was designed for robust estimation of the
localization confidence and a backtracking module was
developed to recover after potential wrong re-detections.
The tracker achieved respectable performance on the
VOT-LT2021 challenge with F-score 0.667. While cur-
rent state-of-the-art methods perform noticably better for
the task of object tracking via bounding boxes, they do
not provide accurate segmentation masks of the tracked
object. To further improve the D3SLT performance, our
future research will focus on bringing robust object-agnos-
tic detectors from the literature to the proposed frame-
work.
Acknowledgement
This work was supported by the ARRS program P2-0214
and project J2-2506.
References
[1] Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu
Timofte. Learning discriminative model prediction for
tracking. In ICCV, 2021.
[2] Kenan Dai, Yunhua Zhang, Dong Wang, Jianhua Li,
Huchuan Lu, and Xiaoyun Yang. High-performance long-
term tracking with meta-updater. In CVPR, 2020.
[3] Matteo Dunnhofer, Kristian Simonato, and Christian
Micheloni. Combining complementary trackers for en-
hanced long-term visual object tracking. Image and Vision
Computing, 122:104448, 2022.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Deep residual learning for image recognition. In
CVPR, pages 770–778, 2016.
[5] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Global-
track: A simple and strong baseline for long-term track-
ing. In AAAI, volume 34, pages 11037–11044, 2020.
[6] Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung
Han. Real-time mdnet. In ECCV, 2018.
[7] Alan Lukeˇ ziˇ c, Luka
ˇ
Cehovin Zajc, Tom´ aˇ s V oj´ ıˇ r, Jiˇ r´ ı
Matas, and Matej Kristan. Fucolot - a fully-correlational
long-term tracker. In ACCV, 2018.
[8] Alan Lukeˇ ziˇ c, Jiˇ r´ ı Matas, and Matej Kristan. D3s - a dis-
criminative single shot segmentation tracker. In CVPR,
2020.
[9] Alan Lukeˇ ziˇ c, Jiˇ r´ ı Matas, and Matej Kristan. A discrimi-
native single-shot segmentation network for visual object
tracking. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, pages 1–1, 2021.
[10] Alan Lukeˇ ziˇ c, Luka
ˇ
Cehovin Zajc, Tom´ aˇ s V oj´ ıˇ r, Jiˇ r´ ı
Matas, and Matej Kristan. Performance evaluation
methodology for long-term single-object tracking. IEEE
Transactions on Cybernetics, 51(12):6305–6318, 2021.
[11] Matej Kristan, et al. The eighth visual object tracking
VOT2020 challenge results. In European Conference on
Computer Vision Workshops, 2020.
[12] Matej Kristan, et al. The ninth visual object tracking
VOT2021 challenge results. In The IEEE International
Conference on Computer Vision (ICCV) Workshops, 2021.
[13] Hyeonseob Nam and Bohyung Han. Learning multi-
domain convolutional neural networks for visual tracking.
In CVPR, 2016.
[14] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu,
and Philip HS Torr. Fast online object tracking and seg-
mentation: A unifying approach. In CVPR, 2019.
[15] Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and
Huchuan Lu. Learning spatio-temporal transformer for
visual tracking. In ICCV, 2021.
[16] Bin Yan, Xinyu Zhang, Dong Wang, Huchuan Lu, and
Xiaoyun Yang. Alpha-refine: Boosting tracking perfor-
mance by precise bounding box estimation. In CVPR,
pages 5289–5298, 2021.