ELEKTROTEHNI ˇ SKI VESTNIK 90(4): 147–158, 2023 ORIGINAL SCIENTIFIC PAPER Guided Video Object Segmentation by Tracking Jer Pelhan 1, † , Matej Kristan 1 , Alan Lukeˇ ziˇ c 1 , Jiri Matas 2 , Luka ˇ Cehovin Zajc 1 , 1 Faculty of Computer and Information Science, University of Ljubljana, Veˇ cna pot 113, 1000 Ljubljana, Slovenia 2 The Center for Machine Perception, Czech Technical University, 166 36 Prague 6, Prague, Czech Republic † E-mail: jer.pelhan@fri.uni-lj.si Abstract. The paper presents Guided video object segmentation by tracking (gVOST) method for a human- in-the-loop video object segmentation which significantly reduces the manual annotation effort. The method is designed for an interactive object segmentation in a wide range of videos with a minimal user input. User to iteratively selects and annotates a small set of anchor frames by just a few clicks on the object border. The segmentation then is propagated to intermediate frames. Experiments show that gVOST performs well on diverse and challenging videos used in visual object tracking (VOT2020 dataset) where it achieves an IoU of 73% at only 5% of the user annotated frames. This shortens the annotation time by 98% compared to the brute force approach. gVOST outperforms the state-of-the-art interactive video object segmentation methods on the VOT2020 dataset and performs comparably on a less diverse DA VIS video object segmentation dataset. Keywords: convolutional neural network, video object segmentation, video object tracking, user interaction, data annotation Vodenje segmentacije objektov v videoposnetku s sledenjem V ˇ clanku predlagamo s sledenjem vodeno segmentacijo objekta v video posnetku, metodo za interaktivno segmentacijo objekta v video posnetku. Metoda bistveno zmanjvsa delo pri procesu dodajanja natanˇ cnih segmentacij objektov. Zasnovana je za interaktivno segmentacijo objektov v vstevilnih videoposnetkih z minimalnim vnosom uporabnikov. Od uporabnika zahteva iterativno izbiro in oznaˇ cevanje majhnega nabora sidrnih okvir- jev s samo nekaj kliki na meji objekta. Nato se segmentacija razvsiri na vmesne okvirje. Poskusi kaˇ zejo, da metoda dosega vrhunsko zmogljivost pri raznolikih in zahtevnih videopos- netkih, ki se uporabljajo pri vizualnem sledenju predmetom (nabor podatkov VOT2020), kjer doseˇ ze IoU 73 % pri samo 5 % uporabnivsko oznaˇ cenih okvirjev. To dejansko skrajˇ sa ˇ cas oznaˇ cevanja za 98 % v primerjavi z naivnim pristopom. Pred- lagana metoda prekavsa najsodobnejvse interaktivne metode segmentacije video objektov na prej omenjenem naboru po- datkov VOT2020 in primerljivo deluje na manj raznoliki zbirki podatkov za segmentacijo objektov v videoposnetkih DA VIS. 1 INTRODUCTION High-quality annotations are crucial in the development of modern computer vision methods. While they are traditionally important for objective evaluation, they have recently become the driving force for training ever- advancing deep learning models. Per-pixel segmentation Received 29 June 2023 Accepted 19 August 2023 is fundamental for various computer vision tasks, like object detection [8], video editing [24], surveillance [6] and autonomous driving [16]. But obtaining large man- ually annotated training and testing datasets, especially per-frame segmented videos, is time-consuming, error- prone and costly. A number of semi-automatic user-guided segmenta- tion methods have been proposed to address the need for accurately segmented objects in videos [10, 18, 3]. These methods, commonly referred to as video object segmentation (VOS) methods, focus on accurate segmentation in high-resolution videos. On the other hand, even the state-of-the-art video object segmentation methods are designed for segmentation over relatively short video sequences and lack robustness for small fast- moving objects that substantially change the appearance and low-quality videos with varying lighting conditions. This reduces their effectiveness and generality. Object generality is particularly emphasized in the field of general visual object tracking. An object tracking al- gorithm has to predict the position of the target in a sequence given only its location in the first frame. The tracker has to be robust over a range of objects and work in a variety of scenes. While general object track- ing traditionally considers target locations as bounding boxes, there has been a recent shift towards complete target segmentation. For example, the major tracking challenge, VOT [11] has abandoned bounding boxes in short-term tracking category and now requires trackers 148 PELHAN, KRISTAN, LUKE ˇ ZI ˇ C, MATAS, ˇ CEHOVIN ZAJC Figure 1. Overview of the inputs for different interactive image segmentation methods. From left to right: scribbles for [3, 12, 10, 18], bounding box for [27] and extreme points for [15]. to output segmentation masks. The most recent VOT challenge also demonstrates that general object trackers cope much better with challenging targets in long videos than video object segmentation algorithms. In the paper, we address the issue of the video object dataset annotation. We draw our inspiration from the development in visual object tracking and propose an interactive method for a fast and accurate video object segmentation – gVOST (guided video object segmen- tation by tracking). gVOST uses the state-of-the-art single frame segmentation that allows user to accurately segment an object with a few clicks. This is combined with recent advances in VOT that enable the propagation of the mask from individual keyframes to the remaining frames. We also describe a method that selects the best mask proposal when there are several masks avaliable. Our experimental analysis shows that gVOST performs well over a range of videos and outperforms recent state-of-the-art video object segmentation methods on the most challenging ones. Furthermore, gVOST is in- tegrated in an application for video annotation. The ap- plication is positively tested by independent researchers for annotation of tracking videos. 2 RELATED WORK The user-guided video object segmentation methods can be roughly divided according to the type of the user- involvement into (a) semi-supervised and (b) interactive video object segmentation. Semi-supervised video object segmentation methods predict segmentation masks for the entire video based on the segmentation mask of the first frame. Many methods [26, 1] segment the target based on the features extracted in the first frame. Tomakov et al. [25] combine the motion data and an ap- pearance embedding to maintain the state of object from frame to frame. [22] proposes a three-step approach, where multiple sequence-specific region proposals are generated for each video frame based on the mask in the first frame. A segment proposal tracking algorithm is used to label regions of query objects, which is followed by spatial refinement. These methods do not support any feedback refinement loop which makes them less suitable for an interactive video segmentation. An interactive video annotation [2] extends the user control over segmentation as a feedback loop. Most of the state-of-the-art interactive VOS methods [18, 10, 21] are based on two steps: (i) interaction or annotation and (ii) propagation or transfer. In the interaction step, the user approximated segmentation masks are used from multiple frames and this information is aggregated into an appearance model. In [18, 10] the propagation algorithm predicts masks over empty frames using an appearance model. In [3], visual tracking is used to con- strain segmentation to tracker-predicted bounding boxes whereas, our method fully integrates the segmentation tracker into the segmentation pipeline. The user interaction with the segmentation method must be carefully designed. Since a manual mask anno- tation takes approximately 79 seconds per image [15], image segmentation techniques are used in the inter- active VOT methods (Fig. 1). Most methods [10, 18, 3] use scribbles, i.e., a set of curves roughly covering the object area. A faster, but less accurate alternative, is initialization using bounding boxes [29, 27]. Unlike our method which requires the user to click a set of object boundary points, akin to [15, 13]. This kind of input requires less user effort as scribbles and avoids the cumbersome process of drawing accurate bounding- boxes. 3 GUIDED VIDEO OBJECT SEGMENTATION BY TRACKING Video object segmentation can be formally defined as a process that transforms a sequence of images { I 0 ,... I N } into an equally long sequence of binary masks{ M 0 ,... M N } , each corresponding to one input image. An object of interest is defined externally, either as a complete segmentation in a single frame or in an interaction with the user. A richer interaction leads to overall better results, but at the cost of the user engagement. It is therefore crucial to make a compro- mise between the user engagement and the segmentation GUIDING VIDEO OBJECT SEGMENTATION BY TRACKING 149 quality. The balance is ensured by minimizing the user work in two ways: (1) structural properties of natural images are used to efficiently segment individual images with a limited amount of the user input, and (2) a temporal consistency in videos to efficiently propagate masks between provided key frames, thus further reducing in- teraction where possible. Our method works in two steps repeated until the segmentation masks are sufficiently accurate. The first step is the anchor selection and segmentation step in which the user selects frames that act as a guidance and annotates the masks for key frames in an efficient manner. The set of anchor indices is denoted as follows: A ={ a 0 ,...a K | 0≤ a 0 S B i , P B i otherwise. (3) Our video object segmentation framework leverages two state-of-the-art methods, i.e. the interactive object image segmentation method for fast anchor frame seg- mentation and the visual object tracking method for segmentation propagation. 3.1 Anchor segmentation DEXTR [15] is utilized for the interactive object image segmentation. It accepts at least four extreme points (right-most, left-most, top and bottom) of the object of interest as the input. To get a more accurate mask estimation more points can be added at the edge of the object (see Fig. 3). The segmentation mask prediction is obtained by encoding border points to a distance heatmap, concatenating the heatmap with a color information into four-channel image and passing it through a pre-trained fully convolutional auto-encoder network based on a Deeplab-v2 [23] architecture. 3.2 Segmentation propagation by tracking Our segmentation propagation algorithm is based on the recent visual object tracker D3S [14]. It is one of the first successful deep learning segmentation-based visual object trackers that surpasses its traditional bounding box counterparts. D3S outputs the segmentation based on the information from two different target visual models – GIM and GEM. The geometrically invariant model (GIM) is based on a deep feature comparison * from a query frame to the target and background set of features obtained from an initialization frame. The background and foreground similarities of each feature are obtained as average of top K similarities to features extracted in the initialization phase allowing to build a segmentation probability map. The second model, geometrically constrained model (GEM), encodes target location provided by an adaptive deep discriminative correlation filter [4] as unimodal probability map. The outputs of both models are concatenated and upscaled ∗ D3S uses features extracted with a ResNet50 backbone [7]. 150 PELHAN, KRISTAN, LUKE ˇ ZI ˇ C, MATAS, ˇ CEHOVIN ZAJC Figure 2. Single iteration of the interaction loop. (1) - user interaction, a frame between two anchors with a bad automatic segmentation is selected, a new mask is inserted by providing border points. (2) - automatic propagation, a new anchor is inserted and the masks for the frames in intervals connected to the new anchor are updated using an automatic propagation. Both steps are repeated until the resulting segmentation is satisfactory. Figure 3. When the desired image segmentation accuracy is not achieved, additional points at the border of the object are added as an input to improve the segmentation. with a refinement decoder network to a segmentation mask of the input size. D3S is originally initialized on a single (starting) frame. It is modified to use initializa- tions from several frames, i.e., from the interval start and end anchor frame. The maximum correlation response value from GEM is used as the internal confidence score C · i required by the mask selection score (Equation 2). 4 EXPERIMENTS Several proposed video object segmentation datasets [2, 28] focus on accurate segmentation in short sequences with large objects, without the evaluation of their ro- bustness in more difficult setups. To account for this, we consider a recent visual tracking performance eval- uation dataset, the Visual Object Tracking (VOT2020) dataset [11], as our main evaluation dataset. To test the generalization capabilities, gVOST is on a well known visual object segmentation dataset – the Densely An- notated Video Segmentation (DA VIS2017) dataset [20]. Which holds a interactive video segmentation chal- lenge allowing comparison with the most state-of-the- art interactive video object segmentation methods. Our method is compared with two state-of-the-art interac- tive video object segmentation methods, the winner of the DA VIS2020 interactive challenge IVOS [10] and the winner of the DA VIS2018 interactive challenge IVS [18]. 4.1 Implementation Details Our guided video object segmentation by tracking method (gVOST) is implemented in Python using Py- Torch library * . A pre-trained DEXTR model [15], trained on PASCAL2012 [5] segmentation dataset is used. In the DA VIS2017 experiments, presented in Section 4.3, we additionally fine-tune the model on the training part of the dataset [20] for 50 epochs with learning rate of 1e− 8, momentum of 0. 9 and weight decay of 5e− 4 with the batch size of five samples. ∗ https://pytorch.org/ GUIDING VIDEO OBJECT SEGMENTATION BY TRACKING 151 Figure 4. Examples of object segmentation masks carefully annotated by annotators in the user study. To fully utilize the user annotations, the mask propa- gation module, D3S, is fine-tuned to the user-annotated anchors during an interactive segmentation for 3 epochs with 50 iterations and 1e− 4 initial learning rate with 0. 2 per-epoch reduction. 4.2 Evaluation on the VOT2020 dataset The VOT2020 dataset is used for our primary eval- uation. This is the first visual object tracking dataset that contains the ground truth in a form of binary segmentation masks generated by experienced human annotators. The dataset contains a diverse set of 60 video sequences of average length 300 frames, which is several times longer than a typical video object segmentation dataset sequence (approximately 70 frames). 4.2.1 The annotation accuracy: General video se- quences included in tracking datasets are quite challeng- ing even for human annotators as the resolution tends to be modest and the target object may be blurred due to the fast motion, etc. This leads to a certain degree of annotation ambiguity. To estimate the achievable segmentation consensus, six volunteers were asked to carefully annotate objects on selected frames twice in a row. We have selected a set of 8 diverse frames from different sequences (see Figure 4) to capture the vari- ability of objects, while keeping the cumulative effort of volunteers reasonably low. In this way we obtained 12 segmentation masks for each frame and 96 segmentation masks for the entire set. A visual evaluation shows that all segmentation masks were of the highest quality. Despite some minor differences between the masks for the same object, all the masks should be considered as the ground truth and their variation specifies the level under which the differences should be considered negligible for practical evaluation. The VOT2020 ground truth is compared with the 96 masks obtained in the survey to estimate the level of segmentation ambiguity, i.e., the performance measure bound beyond which all alternative segmentation masks should be considered as equivalent. The average overlap for IoU and J&F is 0. 84 and 0. 89, respectively. Any overlap exceeding ρ VOT = µ VOT − σ VOT (σ VOT being the standard deviation over the distribution of differences) is considered to be accurate beyond the annotation noise. Thus ρ IoU VOT = 0. 76 and ρ J&F VOT = 0. 82 are referred to as the VOT annotation accuracy bound. 4.2.2 The evaluation protocol: An automatic evalu- ation protocol is required for a systematic, repeatable and reproducible analysis [20]. The DA VIS protocol, described in Section 4.3.1, iteratively prompts the user to annotate a single frame. Since the visual object tracking sequences are longer and more challenging than the video object segmentation sequences, we pro- pose a modified protocol that more faithfully reflects practical interaction scenario. Anchors are selected at every ∆ th frame and segmented using a simulated user input derived from the ground truth. The algorithm then propagates the masks to other frames. A new anchor is added at the frame with the worst segmentation and the process is repeated. The performance is measured using the intersection over union (IoU) as well as the J&F measure that explicitly emphasises the accuracy of the segmentation mask at the border of the object [19]. The user interaction in the DEXTR control points input is simulated by inferring points from the ground truth for the corresponding frame. Four extreme points are first calculated from the ground truth segmentation mask. If needed, an additional point is added along the contour where the estimated mask deviates most from the ground truth. This process is limited to eight points, afterwards the mask with the largest J&F score is chosen as the final user-selected anchor mask. The reference methods [10, 18] used in our evaluation require scribbles for initialization. The scribbles generated for these methods use the standard method from the DA VIS interactive challenge toolbox [2]. 152 PELHAN, KRISTAN, LUKE ˇ ZI ˇ C, MATAS, ˇ CEHOVIN ZAJC Figure 5. Segmentation accuracy on VOT2020 with respect to the percentage of annotated frames. gVOST surpasses the VOT2020 accuracy bound ρ VOT at 10% of all frames annotated, significantly outperforming the competing methods. 4.2.3 Quantitative analysis: The performance on VOT2020 with respect to the percentage of annotated frames in a sequence is shown in Figure 5 and Table 1. On average, gVOST requires seven clicks per an an- chor mask. At already 5% of the manually annotated frames, gVOST achieves on average 0. 732 IoU, thus outperforming IVOS [10] and IVS [18] by 40% and 67%, respectively. gVOST achieves the VOT IoU an- notation accuracy ρ IoU VOT bound at annotating only 9% of all frames (0. 76 IoU) and reaches the J&F VOT annotation accuracy boundary ρ J&F VOT at 7% annotated frames (0. 82 J&F). Neither IVS nor IVOS comes close to the annotation accuracy bound within 20% of the annotated frames (every fifth frame). Ω Method IoU J&F 5% IVS [18] 0.438 0.500 IVOS [10] 0.524 0.592 gVOST 0.732 0.810 10% IVS [18] 0.529 0.598 IVOS [10] 0.576 0.655 gVOST 0.764 0.842 20% IVS [18] 0.606 0.684 IVOS [10] 0.620 0.702 gVOST 0.793 0.869 Table 1. Segmentation accuracy on VOT2020 with respect to the percentage of the annotated frames – Ω . Best results are boldfaced. According to the estimates [15], a manual segmenta- tion of all frames in a typical tracking sequence (300 frames) takes 6 hours of constant work. Our analysis shows that gVOST requires user input on approximately every 17th frame to segment all the frames in the video with the masks that exceed the VOT annotation accuracy ρ VOT . This means that gVOST reduces the annotation time by 94%, i.e. to merely 25 minutes. When using the DEXTR-based initialization with seven clicks, only every 10th frame needs to be annotated to segment a video with an equal quality. It takes ten seconds to click seven points on an object, thus the required video annotation time is reduced to mere five minutes. The conclusion is that gVOST decreases the user work for visual object tracking domain videos by 98%. 4.2.4 Ablation analysis: The contribution of different parts of gVOST is analyzed in an ablation study. The study is conducted on a subset of eight sequences, selected from the VOT2020 dataset † . We have compared the influence of fine-tuning the D3S tracker during the annotation process (gVOST NFT - no fine-tuning) and the influence of initializing the tracker with the data from the anchors at the both sides of the interval (gVOST N2 - no second anchor) with the main method (gVOST). We have also included a version of the method that uses ground truth segmentation directly in anchors to study the influence of the DEXTR-based anchor initialization compared to inputting the segmentation mask manually (gVOST GT ). The results, given in Table 2, show that additional information about the object indeed improves the per- formance. The fine-tuning of the mask decoder adapts it to the current sequence, thus increasing the accuracy, resulting in 5% improvement. The influence of using two frames in the tracker initialization is less noticeable, † The selected sequences are the same as were used as a source of the frames for the segmentation accuracy bound, shown in Figure 4. GUIDING VIDEO OBJECT SEGMENTATION BY TRACKING 153 Figure 6. J&F overlap of the backward and forward-propagated masks from anchors at each frame. The forward and backward propagated masks for selected frames are shown below the plot. Red dots on the plot and red squares around the frames indicate the chosen mask from two possibilities. 5% 10% 20% gVOST NFT 73.6/83.0 75.9/85.2 77.4/86.5 gVOST N2 77.9/86.7 78.2/87.4 79.7/88.5 gVOST 78.0/86.8 79.5/88.3 80.5/89.1 gVOST GT 80.4/88.0 82.7/89.9 83.6/92.3 Table 2. Ablation analysis results for eight sequences selected from the VOT2020 dataset with respect to the percentage of the annotated frames in the sequence. The performance is reported as an mIoU - J&F pair. adding only 1%, but a detailed inspection showed that it contributes to the tracker robustness in the more challenging intervals. Using the ground truth mask to annotate the anchors improves the performance of the main method by about 2− 3%. On the other hand, the DEXTR segmentation is significantly faster in practice, and yields satisfactory results. 4.2.5 Qualitative analysis: We further perform quali- tative analysis of segmentation mask selection protocol. Figure 6 shows the effectiveness of our proposed mask selection strategy on a challenging sequence from the VOT dataset [11]. Every 10th frame of the video is man- ually annotated and the segmentation for the remaining frames is determined by a forward and backward propa- gation. The proposed selection method almost always selects the better mask. In cases where the selection fails, both masks are similarly good or bad. Examples of these situations are visualized at the second and third arrow in Figure 6, where the J&F score of selected mask is worse, but segmentation masks are still equally good. In contrast, at last arrow, where higher scoring segmentation mask is selected, it is clearly better. A qualitative analysis on three sequences is performed for additional insights : (i) the first sequence depicts tracking a rectangular book with out-of-plane rotations and folding, (ii) the second sequence depicts a car with substantial blurring, and (iii) the third sequence depicts an articulated body of an ice skater. The sequences are approximately 150 frames long and every 50th frame is annotated by the user, by inputting either 4 extreme points or scribbles in case of IVS [18] and IVOS [10]. Results given in Figure 9, show that IVS [18] per- forms worst in terms of robustness, i.e., it completely looses track in case of the book. IVOS [10] also struggles as it begins to drift to the hands of the girl holding the book. gVOST does not lose track of the object, nor it segments surrounding regions belonging to hands as part of the object. A similar phenomenon is observed for the car sequence, where IVS starts to lose the object frame by frame. With the ice skater sequence IVOS struggles with the space between the legs of the iceskater and labels it as part of the body. IVS performs a better, but still includes some background as part of the person. gVOST segments all three objects most accurately with a minimal user interaction. 154 PELHAN, KRISTAN, LUKE ˇ ZI ˇ C, MATAS, ˇ CEHOVIN ZAJC 4.3 Evaluation on the DAVIS2017 dataset The video object segmentation datasets typically con- tain shorter sequences with large objects which do not visually change as significantly as in the tracking sequences. The evaluation is more focused on accuracy. To demonstrate the generality of gVOST, it is further evaluated on the reference dataset for interactive video object segmentation [20]. 4.3.1 Experimental Setup: The original DA VIS in- teractive experiment protocol [20] is used. It allows a limited set of user interaction steps for segmenting the entire video. The first interaction step involves annotating the first frame of the sequence. The method is then run to propagate the annotated mask to all the remaining frames. In each subsequent interaction step, the frame with the worst segmentation mask is selected for re-annotation and the masks are propagated again. The annotation experiment is stopped after eight interactions. An additional experiment using the protocol presented in Subsection 4.2 is also made. 4.3.2 Results: Table 3 compares the overall results of the gVOST in comparison to state-of-the-art. gVOST outperforms the winner of the DA VIS2018 interactive challenge IVS [18] as well as other methods, and performs similarly to the winner of the DA VIS2020 interactive challenge [10]. Method IoU J&F Najafi et al. [17] 0.548 - Heo et al. [9] 0.725 0.752 IVS [18] 0.734 - IVOS [10] 0.790 0.827 gVOST 0.745 0.775 Table 3. Video object segmentation results on the DA VIS 2017 validation dataset after 8 initializations. gVOST outperforms the winner of the DA VIS 2018 interactive challenge in IoU. Results for the interaction protocol given in Sec- tion 4.3.1 are shown in Figure 7. gVOST performs on par with the IVS [18] in terms of IoU, and it outperforms it in terms of J&F. gVOST GT reaches 0.83 IoU at just 5 fully annotated objects on sequence, meanwhile the winner of DA VIS 2020 interactive challenge IVOS [10] reaches the same performance at 40 scribble interactions per object on sequence. We argue that annotating with gVOST GT is more efficient. Though gVOST GT takes six minutes of the user labour (if segmentation masks are created fully manually, not in combination with DEXTR), and IVOS needs approximately five minutes of user labour, searching the worst annotated frame in sequence, interaction and propagation procedure with IVOS has to be repeated 40 times. With gVOST GT the procedure is repeated only five times, resulting in easier and more efficient sequence annotation. 5 CONCLUSION A novel method for user-guided video object segmen- tation, gVOST is presented. In the first stage a sparse subset of anchor frames are selected and segmented by a user using an interactive image segmentation technique; then anchor labels are propagated from the anchors to the rest of the frames using a visual object tracking algorithm and a proposal selection mechanism. To the best of knowledge, gVOST is the first method designed for annotation in challenging videos. The effectiveness of our approach is demonstrated on a visual object tracking dataset VOT2020 [11], achieveing 0. 73 IoU at just 5% of annotated frames on the dataset, gVOST outperforms the state-of-the-art interactive video object segmentation methods IVS [18] and IVOS [10] by 67% and 40%, respectively. By an- notating approximately 10% of the frames, our method achieves the accuracy level, comparable to the qual- ity of ground truth. It can be concluded that the the presented method outperforms the current interactive video object segmentation methods, which have worse tracking ability and tend to lose track of the object in challenging sequences that are common in realistic real world scenarios. On the reference interactive video object segmentation datasets [20] gVOST performs com- parably to state-of-the-art, thus demonstrating generality over a range of video types. A video segmentation application is developed, which implements the gVOST method. The application of- fers an intuitive graphical user interface and can be used without extensive computer vision experience. Our application significantly reduces the user involvement in annotation of video sequences. The time taken to annotate a single video sequence can be on average lowered by 98% in comparison with manual annotation. In other words, more than six hours used for manually segmenting out an average tracking video sequence reduces down to approximately 5 minutes of user labour. The application has already been used by independant researchers to annotate of a the VOT2021 tracking dataset. They find it very useful and time efficient. Based on the detailed feedback, we have identified potential improvements of the method will be investigated in future work. For example, after an annotation itera- tion, the user reviews the whole sequence to find and correct segments by adding new anchors. This could be improved by developing an algorithm that would automatically suggest which frames are very likely to have been poorly segmented. The image segmentation method will be modified to allow manual correction with negative clicks, enabling manual exclusion of wrongly segmented parts. Acknowledgements The work was supported in part by the Slovenian Re- GUIDING VIDEO OBJECT SEGMENTATION BY TRACKING 155 Figure 7. IoU and J&F performances on the validation set of the DA VIS2017 dataset according to number of annotated frames. search Agency (ARRS), research projects J2-2506 and Z2-1866, and the ARRS research program P2-0214. REFERENCES [1] Sergi Caelles et al. “One-Shot Video Object Segmentation”. In: Computer Vision and Pattern Recognition (CVPR). 2017. [2] Sergi Caelles et al. “The 2018 DA VIS Chal- lenge on Video Object Segmentation”. In: CoRR abs/1803.00557 (2018). arXiv: 1803.00557. URL: http://arxiv.org/abs/1803.00557. [3] Bowen Chen et al. “ScribbleBox: Interactive An- notation Framework for Video Object Segmenta- tion”. In: Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23- 28, 2020, Proceedings, Part XIII. Ed. by An- drea Vedaldi et al. V ol. 12358. Lecture Notes in Computer Science. Springer, 2020, pp. 293–310. DOI: 10.1007/978- 3- 030- 58601- 0\ 18. URL: https : / / doi . org / 10 . 1007 / 978 - 3 - 030 - 58601 - 0%5C 18. [4] M. Danelljan et al. “ATOM: Accurate Tracking by Overlap Maximization”. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, pp. 4655–4664. DOI: 10.1109/CVPR.2019.00479. [5] Mark Everingham et al. “The Pascal Visual Ob- ject Classes Challenge: A Retrospective”. In: Int. J. Comput. Vision 111.1 (Jan. 2015), 98ˆ aC“136. ISSN: 0920-5691. DOI: 10 . 1007 / s11263 - 014 - 0733- 5. URL: https://doi.org/10.1007/s11263- 014-0733-5. [6] Monica Gruosso, Nicola Capece, and Ugo Erra. “Human segmentation in surveillance video with deep learning”. In: Multimedia Tools and Appli- cations 80.1 (2021), pp. 1175–1199. [7] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016. IEEE Computer Society, 2016, pp. 770– 778. DOI: 10.1109/CVPR.2016.90. URL: https: //doi.org/10.1109/CVPR.2016.90. [8] Kaiming He et al. “Mask r-cnn”. In: Proceedings of the IEEE international conference on computer vision. 2017, pp. 2961–2969. [9] Y . Heo, Y . J. Koh, and C. Kim. “Interactive Video Object Segmentation Using Sparse-to-Dense Net- works”. In: The 2019 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2019). [10] Yuk Heo, Yeong Jun Koh, and Chang-Su Kim. “Interactive Video Object Segmentation Using Global and Local Transfer Modules”. In: Com- puter Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVII. Ed. by Andrea Vedaldi et al. V ol. 12362. Lecture Notes in Computer Science. Springer, 2020, pp. 297–313. DOI: 10. 1007 / 978 - 3 - 030 - 58520 - 4\ 18. URL: https : //doi.org/10.1007/978-3-030-58520-4%5C 18. [11] Matej Kristan et al. The Eighth Visual Object Tracking VOT2020 Challenge Results. 2020. [12] Di Lin et al. “ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmenta- tion”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016. IEEE Com- puter Society, 2016, pp. 3159–3167. DOI: 10 . 1109 / CVPR . 2016 . 344. URL: https : / / doi . org / 10.1109/CVPR.2016.344. [13] Zheng Lin et al. “Interactive Image Segmentation With First Click Attention”. In: Proceedings of 156 PELHAN, KRISTAN, LUKE ˇ ZI ˇ C, MATAS, ˇ CEHOVIN ZAJC Figure 8. The graphical user interface of the application. It is composed of the file menu bar on the top, and the toolbar that is located on left and bottom border. the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 2020. [14] Alan Lukezic, Jiri Matas, and Matej Kristan. “D3S - A Discriminative Single Shot Segmenta- tion Tracker”. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020, pp. 7131–7140. DOI: 10.1109/CVPR42600. 2020 . 00716. URL: https : / / doi . org / 10 . 1109 / CVPR42600.2020.00716. [15] K.-K. Maninis et al. “Deep Extreme Cut: From Extreme Points to Object Segmentation”. In: Computer Vision and Pattern Recognition (CVPR). 2018. [16] Eslam Mohamed et al. “Monocular Instance Motion Segmentation for Autonomous Driving: KITTI InstanceMotSeg Dataset and Multi-task Baseline”. In: arXiv preprint arXiv:2008.07008 (2020). [17] M. Najafi, V . Kulharia T. Ajanthan, and P. H. S. Torr. “Similarity Learning for Dense Label Trans- fer”. In: The 2018 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2018). [18] Seoung Wug Oh et al. “Fast User-Guided Video Object Segmentation by Interaction-And- Propagation Networks”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 5247–5256. DOI: 10.1109/CVPR.2019.00539. [19] F. Perazzi et al. “A Benchmark Dataset and Eval- uation Methodology for Video Object Segmenta- tion”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 724–732. DOI: 10.1109/CVPR.2016.85. [20] Jordi Pont-Tuset et al. “The 2017 DA VIS Chal- lenge on Video Object Segmentation”. In: CoRR abs/1704.00675 (2017). arXiv: 1704.00675. URL: http://arxiv.org/abs/1704.00675. [21] H. Ren, Y . Yang, and X. Liu. “Robust Multiple Object Mask Propagation with Efficient Object Tracking”. In: The 2019 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2019). [22] A. Shaban et al. “Multiple-Instance Video Seg- mentation with Sequence-Specific Object Propos- als”. In: The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2017). [23] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale GUIDING VIDEO OBJECT SEGMENTATION BY TRACKING 157 Image Recognition”. In: 3rd International Con- ference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. URL: http: / / arxiv. org/abs/ 1409.1556. [24] Yanan Sun et al. “Deep Video Matting via Spatio-Temporal Alignment and Aggregation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 6975–6984. [25] Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. “Learning Video Object Segmentation With Visual Memory”. In: Proceedings of the IEEE International Conference on Computer Vi- sion (ICCV). Oct. 2017. [26] P. V oigtlaender and B. Leibe. “Online Adaptation of Convolutional Neural Networks for the 2017 DA VIS Challenge on Video Object Segmenta- tion”. In: The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops (2017). [27] J. Wu et al. “MILCut: A Sweeping Line Multiple Instance Learning Paradigm for Interactive Image Segmentation”. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 256–263. DOI: 10.1109/CVPR.2014.40. [28] Ning Xu et al. “YouTube-VOS: Sequence-to- Sequence Video Object Segmentation”. In: Pro- ceedings of the European Conference on Com- puter Vision (ECCV). Sept. 2018. [29] S. Zhang et al. “Interactive Object Segmenta- tion With Inside-Outside Guidance”. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, pp. 12231– 12241. DOI: 10.1109/CVPR42600.2020.01225. Jer Pelhan received his masters degree from the Faculty of Computer and Information Science, University of Ljubljana, Slovenia, in 2023. He is currently a researcher at the same faculty. His research interests include few-shot learning, object detection and segmentation, object counting and object visual tracking. Matej Kristan received his Ph.D from the Faculty of Electrical Engineering, University of Ljubljana, in 2008. He is a full professor and a vice chair of the department of artificial intelligence at the Faculty of Computer and Information Science. He is president of the IAPR Slovenian pattern recognition society and an Associate Editor of IJCV . His research interests include visual object tracking, anomaly detection, object detection and segmentation, perception methods for autonomous boats and physics-informed machine-learning. Alan Lukeˇ ziˇ c received his Ph.D. degree from the Faculty of Computer and Information Science, University of Ljubljana, Slovenia in 2021. He is currently with the Visual Cognitive Systems Laboratory, Faculty of Computer and Information Science, University of Ljubljana, as a Teaching Assistant and a Researcher. His research interests include computer vision, data mining and machine learning. Jiri Matas is a professor at the Center for Machine Perception, Czech Technical University in Prague. He holds a PhD degree from the University of Surrey, UK (1995). He has published over 200 scientific papers. His publications have approximately 34000 citations in Google Scholar and 13000 in the Web of Science. His h-index is 65 (Google scholar) and 43 (Clarivate Analytics Web of Science) respectively. He received the best paper prize at the BMVC in 2002 and 2005, at the ACCV in 2007 and at ICDAR in 2015. He is on the editorial board of IJCV and was an Associate Editor-in-Chief of IEEE TPAMI. His research interests include visual tracking, object recognition, image matching and retrieval, sequential pattern recognition, and RANSAC- type optimization methods. Luka ˇ Cehovin Zajc received his Ph.D degree from the Faculty of Computer and Information Science, University of Ljubljana, Slovenia in 2015. He is working at the Visual Cognitive Systems Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia as an assistant professor and a researcher. His research inter- ests include computer vision, educational robotics, human-computer interaction, and spatial computing. 158 PELHAN, KRISTAN, LUKE ˇ ZI ˇ C, MATAS, ˇ CEHOVIN ZAJC Figure 9. Selected results on the VOT2020 dataset. First column presents initialization frames.