Informatica 41 (2017) 149–158 149
Persons-In-Places: a Deep Features Based Approach for Searching a Specific
Person in a Specific Location
Vinh-Tiep Nguyen, Thanh Duc Ngo, Minh-Triet Tran, Duy-Dinh Le and Duc Anh Duong
University of Information Technology, University of Science
E-mail: {tiepnv, thanhnd}@uit.edu.vn, tmtriet@fit.hcmus.edu.vn, {duyld,ducda}@uit.edu.vn
Keywords: video instance search, deep neural network, location search, person search
Received: March 29, 2017
Video retrieval is a challenging task in computer vision, especially with complex queries. In this paper, we
consider a new type of complex query which simultaneously covers person and location information. The
aim to search a specific person in a specific location. Bag-Of-Visual-Words (BOW) is widely known as
an effective model for presenting rich-textured objects and scenes of places. Meanwhile, deep features are
powerful for faces. Based on such state-of-the-art approaches, we introduce a framework to leverage BOW
model and deep features for person-place video retrieval. First, we propose to use a linear kernel classifier
instead of usingL2 distance to estimate the similarity of faces, given faces are represented by deep features.
Second, scene tracking is employed to deal with the cases face of the query person is not detected. Third,
we evaluate several strategies for fusing individual person search and location search results. Experiments
were conducted on standard benchmark dataset (TRECVID Instance Search 2016) with more than 300 GB
in storage and 464 hours in duration.
Povzetek: V prispevku je opisana metoda povpraševanja po osebi in lokaciji iz video vsebin.
1 Introduction
With the rapid growth of video recording devices, many
videos from diverse domains such as professional or ama-
teur film making, surveillance and home recording are be-
ing created. These vast video collections are being shared
on video broadcasting sites (e.g., YouTube). One of the
most fundamental needs is to help users find exactly what
they are looking for in video databases. To search di-
rectly on videos, we consider an approach—visual instance
search on video databases. The term instance search (INS)
is defined formally by TRECVID [13]: finding video seg-
ments of certain specific object, place or person, given vi-
sual examples from a video collection. There are varieties
of query types including rich-textured, fairly-textured or
deformable object. These make instance search is a very
challenge task since we do not know any prior information
about the query.
The objective of this problem is to find the person and
the location in a large-scale video dataset. This type of
query is important since person and location are two most
popular query objects. It has many applications in prac-
tice such as: surveillance systems, personal video archive
management. This query topic is also a very hard topic
because there are many variations in size, light condition,
view change. Figure 1 gives an example of this type of
query. Images in the first row are examples of a pub that
a user want to search. These images cover multiple views
of a location with many irrelevant or noisy objects such
as humans, decorations. These objects may cause low re-
trieval accuracy due to noisy features. Images in the second
row are examples of the person that the user also needs to
find if he appears at the pub. Persons are special query ob-
jects because they are 3D object with multiple views and
deformable with different cloths texture features. All of
these make our retrieval task with this compound query to
be more challenging.
A very natural approach is to combine the scores of rec-
ognizing face and location. There are some challenges in
this approach:
– The scores are independent and incomparable. It
makes typical fusion techniques such as average fu-
sion inefficient.
– Frames with very clear and recognizable faces often
have large proportions in appearance but less infor-
mation about the context scene. Hence, frames which
have higher score in recognizing a face may have
lower score or low rank for a location, and vice versa.
This gives the low performance when simply combin-
ing these scores.
– In a video scene that contains a person and a location,
both of them are not always shown perfectly: the per-
son may change their head pose in multiple directions
while the location may change points of view by the
time. However, query examples do not cover all views
of target objects.
Most state-of-the-art object instance retrieval systems
are based on bottom-up approach with a very well-known
150 Informatica 41 (2017) 149–158 V.-T. Nguyen et al.
Figure 1: A query topic includes location examples (first row images) and person examples (second row images) marked
by magenta boundaries.
model Bag-of-Visual-Words (BOW) [23] which benefits
from powerful local descriptors for matching textures, then
checks the geometric consistency to further improve the ac-
curacy. This approach relies on the key assumption that two
similar objects share significant number of local patches
that can be matched against each other.
When searching on rich-textured instances which con-
tain enough discriminative texture patterns (e.g. locations,
buildings, book covers, paintings, etc.), there are some am-
biguous patches that share similar shapes with the query
instance but belong to an irrelevant object. However, ratio
of these patches is low, thus the similarity scores of images
containing correct instance are higher than incorrect ones.
Moreover, its extensions e.g. geometric consistency check-
ing [16][30], query expansion [8][7][1] also further signif-
icantly improve the performance of the searching system.
When searching on highly flexible appearance object
such as human, the performance is still very low due to
the limited capacity of representation of the BOW model.
For the first video segment that the query person appears,
the problem is equivalent to face recognition without using
other information such as cloths texture feature. From that
segment to the end of a scene, people are likely to be in the
same place even his/her face disappears. In this paper, we
propose a system which leverages both BOW and Convolu-
tional Neural Network (CNN) based feature for retrieving
this new type of query. For location search, we combine
BOW based and CNN based features to improve the per-
formance. For person search, we use VGG-face feature for
recognizing the first video shot that the target person ap-
pears. In stead of using distance metric such L2, we pro-
pose to use a linear kernel method to learn high-level fea-
ture encoded by a deep CNN. Finally, in order to boost the
recall of the system, we implement scene tracking to keep
track shots following the high response ones.
The rest of this paper is organized as follows. Section 2
presents related work. Details of our instance search frame-
work is presented in Section 3. Section 4 presents our ex-
periment results on TRECVID dataset. Finally, Section 5
concludes the paper.
2 Related work
To improve the performance of INS systems, multiple tech-
niques have been proposed, such as rootSIFT feature [1],
large vocabulary [16], soft assignment[17]. Among them,
spatial verification is one of the most effective approaches,
and also serves as the prerequisite step for other advanced
techniques such as query expansion. Spatial verification
can be classified into two categories: spatial reranking [16]
[30] [33] and spatial ranking [10] [5] [21]. These ap-
proaches work very well on big and rich-textured object
such as location.
To further improve the performance, Wan et al. ex-
plore deep learning techniques with application to instance
search task[31]. They show that deep learning feature from
CNN model pre-trained on large-scale dataset can be used
for representing image or object in new instance search
task. Moreover, by retraining the deep models on the new
domain, the retrieval performance could be boosted signif-
icantly. Although the amount of training data is only a few
examples per query object, pre-trained network with pa-
rameters learned from previous large-scale dataset makes
fast convergence on new data domain.
In addition to retrain the CNN network, Babenko et al.
also investigate the performance of compressed deep fea-
tures, where plain PCA or a combination of PCA with dis-
criminative dimensionality reduction result in very short
codes with state-of-the-art performance [4]. They explain
that passing an image through the network discards much
of the information that is irrelevant for classification (and
for retrieval). Thus, CNN based neural codes from deeper
layers retain less (useless) information than unsupervised
aggregation-based representations. Therefore PCA com-
Persons-In-Places: a Deep Features Based Approach. . . Informatica 41 (2017) 149–158 151
pression works better for neural codes. Beside deep en-
coding technique, the authors also introduces and evalu-
ates a new simple and compact global image descriptor and
investigates the reasons underlying its success [3]. They
show that, feature aggregation using sum-pooling tech-
nique outperform when using max-pooling on deep fea-
tures from fully connected layers [18], VLAD[2], demo-
cratic aggregation[11] which successfully applied on SIFT
feature.
Another problem this paper focuses on is face recogni-
tion in images and videos. We classify many methods pro-
posed in the literature into two groups: the ones that do not
use deep learning and the ones that do. For the first group
(also named “shallow" methods), they start by extracting a
representation of the face image using hand-crafted local
image descriptors such as SIFT, LBP, HOG [9][12][32];
then they aggregate such local descriptors into an overall
face descriptor by using a pooling mechanism, for example
the Fisher Vector [14][22].
This work is concerned mainly with deep architectures
which currently reach the state-of-the-art performance.
The idea of such methods is to use a CNN feature extrac-
tor with parameters learned by composing several linear
and non-linear operators. One of the representative meth-
ods for this approach is DeepFace [28]. This method uses
a deep CNN trained to classify faces using a dataset of 4
million examples of 4000 persons. The goal of training is
to minimize the distance between congruous pairs of faces
(i.e. portraying the same identity) and maximize the dis-
tance between incongruous pairs, a form of metric learning.
The authors later extended this work in [29], by increasing
the size of the dataset to 10 million persons and 50 im-
ages per person. They proposed a bootstrapping strategy
to select identities to train the network and showed that by
controlling the dimensionality of the fully connected layer
the generalisation of the network can be improved.
The DeepId series of papers by Sun et al.
[24][26][27][25], extensions of the DeepFace, each
of which improves the performance on LFW and YFW
incrementally and steadily. A number of new ideas were
introduced by incorporating over this series of papers,
including: using multiple CNNs [26], a Bayesian learning
framework [6] to train a metric, multi-task learning
over classification and verification [24], different CNN
architectures which branch a fully connected layer after
each convolution layer [27], and very deep networks
[25]. Compared to DeepFace, DeepID does not use 3D
face alignment, but a simpler 2D affine alignment and
trains on combination of CelebFaces [26] and WDRef
[6]. However, the final model in [25] is quite complicated
involving around 200 CNNs.
Recently, a research from Google [20] trains a CNN us-
ing a massive dataset of 200 million face identities and 800
million image face pairs. Their triplet-based loss considers
two congruous (a,b) and a third incongruous face c in com-
parison. Differently from other metric learning approaches,
their goal is to make a closer to b than c; comparisons are
always relative to a pivot face. In training this loss is ap-
plied at multiple layers, not just the final one.
In this paper, we follow the VGG-Face descriptor net-
work [15] which designs a procedure that is able to assem-
ble a large-scale dataset, with small label noise, whilst min-
imizing the amount of manual annotation involved. They
use weaker classifiers to rank the data presented to the an-
notators for reranking. They also show that a deep CNN
can achieve results comparable to the state-of-the-art with
appropriate training without any special techniques.
In other to apply in a new task (instance search) and data
domain, instead of using the activation of the last layer,
we propose to use the feature extracted from one of the
fully connected layers with a linear classifier (e.g support
vector machine with linear kernel) to train face model for
the query person. To further improve the performance of
the instance search system, especially in the case that the
target person turns his/her back to the camera, we propose
to combine person tracking with scene tracking.
3 Proposed framework
This section describes our proposed framework and its con-
figurations. Our proposed system includes four main mod-
ules: BOW based retrieval, location learning for verifica-
tion, face learning for recognition and final fusion. Figure 2
sketches out the work flow of main components in our INS
system. Given a compound query topic including person
and location examples, our goal is to rank video shots con-
taining that combination. Each example is a video frame
of location or person captured at a specific point of view as
shown in Figure 1. In our framework, instead of using all
frames of a video shot, we perform key frame extraction at
5 frames per second for saving computational cost.
For simplicity of notation, we only consider a set of
query examples and key frames of a shot in the video
dataset. Other shots are processed similarly. Firstly,
for each location example, we extract local features us-
ing Hessian-Affine detector and rootSIFT descriptor, then
quantize using a codebook trained on video database. In
order to reduce the effect of noisy features given by irrel-
evant persons, we remove all visual words inside bound-
ing boxes detected by a person detector. In this paper, we
use Faster RCNN[19] with pre-trained network on PAS-
CAL VOC 2007 to find person regions. Each frame of lo-
cation is finally represented by a BOW feature vector Lk
with tf-idf weighting scheme. For each person example,
we only use the information detected by face detector since
the target person may change clothes by the time. Each
face bounding box is described by a CNN based descriptor
and represented by a feature vector Fp.
Since location and person examples are independent,
we can compute two rank lists independently. However,
BOW model could perform in large-scale video data, we
use location features to retrieve rank lists as the first step,
then use face features for later reranking. Top K retrieved
152 Informatica 41 (2017) 149–158 V.-T. Nguyen et al.
Figure 2: Framework overview.
shots based on SBOW similarity score are then used for the
reranking stage. Note that, BOW model is a non-structured
model which does not take into account the spatial rela-
tionship between visual words. To remove irrelevant shot,
we combine both RANSAC based algorithm and learning
based approach for high level feature vector produced by a
very deep CNN network VGG-19.
The second part of our system is person recognition
based reranking. A person example includes a color image
and its mask which helps the system to separate interested
person from irrelevant objects. In this case, we only focus
on face feature since the target person changes the cloths
over time. We use a face detector and face descriptor to ex-
tract representative feature of the query person. After this
stage, each person is represented by a set of deep feature
vectors. A typical way to compare face features is using
symmetric distance or similarity score. In this approach,
each component of a feature vector is processed evenly.
However, this vector is a high-level feature which describes
many parts of a face. Some of them are important and some
are not. Hence, we propose to use a linear classifier to learn
the weights of a face feature, then compute the similarity
score between the face model and a video shot.
Finally, we propose a final fusion step in which, it takes
into account all components of the system including: BOW
based location search, CNN based irrelevant location re-
moval, face based reranking and scene tracking. In a video
scene that contains a person and a location, both of them
are not always shown perfectly: a person may change
their head pose in multiple directions while a location may
change points of view by the time. However, query exam-
ples of face and location are limited and incomplete. To
propagate the score of positive shot, we inherit that value
for the next scenes with a multiplication factor.
3.1 Location search
In the first stage of the system, we retrieve top K shots
that is similar to the location examples using BOW model
with local feature. In this paper, we use the state-of-the-art
configuration of BOW framework that have been used for
image retrieval. Local features of each key frame of a shot
are extracted using Hessian-affine detector and rootSIFT
feature descriptor. Each feature is represented by a 128-
dimensional vector. All features gathered from database
video frames are clustered using approximate K-Means al-
gorithm (AKM) with a very large number of codewords.
Since the limitation of hardware computation, only 100
million features are randomly sampled to train 1 million
codewords. These features are then quantized using the
codebook with hard-assignment strategy. Finally, each
video frame is represented by a very sparse BOW fea-
ture vector using tf-idf (term frequency-inverse document
frequency) weighting scheme. Because the rank list only
counts video shots not video frames, we aggregate all BOW
vectors of frames of a shot into a single one for compact
representation and fast retrieval. Using the following en-
coding scheme, frame j-th of i-th video shot is represented
by a BOW feature vector Si,j . We accumulate all vectors
of a shot into a single one using average pooling:
Si =
1
n
n∑
j=1
Si,j (1)
where, n is number of key frames of the shot.
Feature vectors of video shots are then transferred to
build an inverted index which helps to significantly boost
Persons-In-Places: a Deep Features Based Approach. . . Informatica 41 (2017) 149–158 153
Figure 3: Two images illustrate a location example (the left-hand side one) and a query person example (the-right hand
side one). For the location example, there may have some irrelevant persons (marked by yellow boundaries) whose noisy
visual words take part in the BOW feature vector of the frame. For the person example (marked by magenta boundary),
face feature is one of the most important features for retrieving.
the speed of retrieval. The similarity between the i-th shot
and the given location is computed by the following for-
mula:
LSi =
1
n′
n′∑
k=1
asym(Lk, Si) (2)
where, n′ is the number of query examples and asym is an
asymmetrical similarity score[34].
Top K shots returned by BOW model are then reranked
in the next steps. One important parameter in this initial
step is K, the threshold for selecting top ranked shots. By
observing the z-score normalized distance of all query ex-
amples, we found that they have the same distribution as
shown in Figure 4. Intuitively, we fixed the cut off thresh-
old for top K shots is −2.5.
The main assumption of BOW model is that two similar
objects share significant number of local patches that can
be matched against each other. The chosen query examples
are often captured in perfect views due to the meticulous-
ness of user while database frames are not always. When
changing point of view significantly, local feature based
BOW model gives bad retrieval performance. To be more
robust with point of view, we represent each video frame by
a high-level feature vector derived from a fully connected
layer of CNN network. We use a very deep pre-trained
network, i.e. VGG-19, and remove the last layer which
commonly used for classification task. Video frames are
re-sized and normalized before transferring to the feed for-
ward network. The output of the network is a 4096 dimen-
sional feature vector representing the whole video frame.
Comparing two video frames is equivalent to comparing
their representing feature vectors. However, using sym-
metric metric such as Euclidean distance (L2) may result
in low accuracy since all components of a feature vector
have the same role. In fact, for each location, some of the
components are important. A learning method is proposed
to magnify the role of these key components.
3.2 Face feature learning for reranking
The second part of the query is person examples. Face
recognition is a very popular approach to identify a per-
son. Faces are detected using DPM cascade detector [32]
applied in maximum 5 key frames per shot. Then, face
feature vector are extracted using VGG-Face descriptor, a
CNN based network[15]. Particularly, each face image is
represented by a 4096 dimensional deep feature vector.
After this stage, each person is represented by a set of
deep feature vectors {F1, F2, ..., Fm} where m is the num-
ber of face examples. We perform similarly to each frame
of a video shot. SFi,j,k represents feature vector extracted
from a face of a person in a video frame. A natural way
to compute the similarity between a person and a shot is to
take the minimum distance between all pairs of face feature
vectors. The distance formula is given as following:
FSi = min
l,j,k
L2(F
∗
l , S
∗
Fi,j,k
)
where F ∗l and S
∗
Fi,j,k
are normalized vector of Fl and
SFi,j,k , L2 is Euclidean distance metric.
Although this feature is designed to work with L2 dis-
tance metric, there is a big gap in performance. This
could be explained that a face feature vector does not have
the same weight for all components. With each face, the
weights of components are different. Therefore, we pro-
pose to learn these features by a large margin classifier with
a linear kernel. Each face candidate of a frame of a shot af-
ter transferred to the classifier will be scored by a value.
Positive values indicate positive example, and vice versa.
In this paper, we use Support Vector Machine (SVM)
with linear kernel to train face features of the target person.
Positive features are chosen from the query examples while
negative ones are from the last 50 persons of the initial rank
list returned from L2 distance based approach. After train-
ing with SVM algorithm, the target person is represented
154 Informatica 41 (2017) 149–158 V.-T. Nguyen et al.
Figure 4: Distribution of z-score normalized distance.
by a single model M .
3.3 Final fusion
This is our main contribution module which leverages the
power of BOW model, deep features and machine learn-
ing. At first, the rank list returned by BOW based loca-
tion search is then used as the input of geometric verifi-
cation step. Visual words of each database video frame is
then verified using RANSAC algorithm. The number of
inliers represents the similarity between a video frame and
query location. The output of geometric verification step
is the input of the irrelevant location removal step. Using
classifier learned from location examples, we classify each
video frame of a shot using linear kernel approach. The
output score of a shot is the average of all decision values
of frames in that shot. We remove shots which have neg-
ative decision values and transfer the remained ones to the
next step. In the face based reranking step, we use the face
model learned from query examples to recognize persons
of a video shot. The output score of shot i-th in this step is
the maximum decision value of all frames that belong to:
scorei = max
j,k
svm(M,S∗Fi,j,k)
where M is the face model, S∗Fi,j,k is normalized vector of
SFi,j,k and svm is the linear classifier. If scorei > 0, it
means that there is at least one frame containing the query
person in shot i-th and vice versa.
The final step of our system is scene tracking. To deal
with cases that the target person appears in a shot but his
face is unclear, we transfer the decision value from the last
positive shot to the next ones with small decreasing. Note
that, we only apply scene tracking to shots which have neg-
ative decision values. Assume that two consecutive shots
i-th and i+1-th have scores scorei > 0 and scorei+1 ≤ 0.
We update scorei+1 = 12scorei. We also update for the
maximum 5 shots with the same factor. The output of this
step is the rank list after sorting final score values in de-
scending order.
4 Experiment
4.1 Dataset
To demonstrate the advantage of the proposed method
on different types of query, we used TRECVID In-
stance Search (INS) datasets for evaluation. We used the
TRECVID INS benchmarks in year 2016 which was re-
leased by NIST. For experimentation, we name this dataset
as INS2016.
For the past six years (2010-2015) the instance search
task has tested systems on retrieving specific instances of
objects, persons and locations. They share the same collec-
tion of test videos with a master shot reference. Currently,
new query type will be tested by asking systems to retrieve
specific persons in specific locations. The dataset contains
approximately 244 video files extracted from the BBC Eas-
tEnders program with totally 300 GB in storage and 464
hours in duration. Each query topic of INS2016 consists
of two set of examples: location and person. For the per-
son set, each example includes an image and corresponding
mask to delimit the target entity with others. For location
set, only image examples are provided. This INS dataset
is very challenging due to the variety in query types: from
indoor to outdoor location, unclear to clear person.
Evaluation Protocol. There are 30 query topics or pairs
of person-location and about 470 thousand video shots in
this challenge. The system must return top 1000 shots that
are most similar to each given topic. The ground truth
files for each query are created manually and provided by
TRECVID organization. To evaluate the performance of
each method, we use the mean average precision (MAP)
as a standard measurement. Although some evaluations
of intermediate results such as location search when com-
bining deep features and BOW are expected, there already
has some reports about the performance of state-of-the-art
systems on individual query of last year challenges[13].
Therefore, in this paper, we only take care about the per-
formance of compound query.
4.2 Retrieval performance and visualization
In this section, we discuss some quantitative results of our
method evaluated against the ground truth gathered from
the TRECVID INS 2016. For ease of observation, we use
the following abbreviations with descriptions:
– Avg-Fusion: normalized scores of person and location
fusion.
– L2-Reranking: using our framework, after geometric
verification step, we rerank the initial top K list using
L2 distance for face features. The similarity score of a
frame is the opposite number of min-min distance be-
tween face examples and all face detected in frames of
a shot. The similarity score of a frame is the opposite
number of that distance value. We use mean function
for all similarity scores of frames in a shot to represent
Persons-In-Places: a Deep Features Based Approach. . . Informatica 41 (2017) 149–158 155
Table 1: Comparison between average fusion and reranking
methods.
Run MAP
Avg-Fusion 15.6
L2-Reranking 18.9
the final similarity (average pooling) (similar to other
methods in the experiment).
– CNN-Loc+L2-Reranking: similar to L2-Reranking
but we augment the CNN based location reranking
step after geometric reranking step.
– Linear Kernel: similar to the baseline CNN-loc+L2-
Reranking but we use linear kernel to learn face model
of the query person and compute similarity score with
candidate faces.
– Linear Kernel+scene tracking: similar to the Linear
Kernel, but we also apply scene tracking to deal with
frames that face of target person is not detected.
4.3 Average fusion for person-location
query
In many systems, average fusion is one of the simple and
effective methods to improve the retrieval performance.
However, for compound queries such as location-person,
average fusion is not good as face based reranking method
as shown in Table 1. It can be explained that, the scores of
each target location and person are independent and incom-
parable. Moreover, frames with very clear and recogniz-
able faces often have large proportions in appearance but
less information about the context scene. Hence, a frame
has higher score in recognizing a face may have lower score
or low rank for a location, and vice versa.
4.4 Deep feature for location reranking
In this section, we want to illustrate that, deep feature for
reranking improves the performance pretty much even for
rich-texture query object such as location. The experimen-
tal result is shown in Table 2. Past state-of-the-art systems
of TRECVID showed that, for rich-textured object such as
location, local feature based BOW model is one of the most
suitable choices. However, in case of real life videos, the
proportion of location evidences is very small. Using CNN
features of the query location, the system has more infor-
mation to keep scenes that seems to be removed by the cut-
off threshold in geometric verification step.
4.5 Face feature learning and scene tracking
Table 3 summarizes the results of using our different meth-
ods, measuring their relative performance in terms of the
Table 2: Comparison of retrieval systems with and without
high-level feature reranking.
Run MAP
L2-Reranking 18.9
CNN-Loc+L2-Reranking 19.8
Table 3: Experimental results on different configurations
for TRECVID INS 2016.
Run MAP
Linear Kernel + scene tracking 50.6
Linear Kernel 25.9
CNN-Loc+L2-Reranking 19.8
MAP score. From the table, we can see that the first
proposed method (Linear Kernel) performs much better
than the baseline one which only uses L2 distance (CNN-
Loc+L2-Reranking), showing a gain in the MAP from
19.8% to 25.9%. Moreover, with scene tracking step, the
final performance is significantly boosted from 25.9% to
50.6%.
Also, note that the scene tracking step not only keeps
the high precision but also improves the recall compared
to Linear Kernel method. Because there are many cases
that the target persons do not put their faces in front of
the camera, hence many shots are lost in the final rank
list. By using scene tracking, the total recall of the re-
trieval system is improved surprisingly. This can be ob-
served on the precision-recall curves as shown in Figure 5
where the curve of Linear Kernel+scene tracking is signif-
icantly higher than the other ones.
To show the efficiency of the proposed method compared
to the baseline system, we visualize the rank list returned
from the systems. The query topic is given in Figure 1. Top
six shots returned from the system using L2 distance and
Linear Kernel classifier are visualized in Figure 6. Each
row shows the key frames of a shot of a rank list. When us-
ing L2 distance, the precision is very low, that is the reason
why top six rank list of the baseline contains many irrel-
evant shots marked by red bounding boxes. Using Linear
Kernel classifier, the precision of the system is improved
significantly, hence the ratio of relevant shots is very high.
5 Conclusion
Inspired by recent successes of deep learning techniques, in
this paper, we attempt to leverage the powerful of deep fea-
ture in instance search task. We aim to use deep feature as a
tool for reranking the location search result by bridging the
semantic gap made by BOW model. Moreover, to search
for more difficult object which is deformable and could be
156 Informatica 41 (2017) 149–158 V.-T. Nguyen et al.
Figure 5: Precision recall curves when conducting experi-
ment on TRECVID INS 2016.
captured in different environments, we propose to apply a
machine learning approach to learn deep features extracted
from human face detected in video frame. In particular,
we investigate a framework of combining BOW model and
deep learning based feature with application to instance
search task with a new type of query topic: a specific per-
son in a specific location. By conducting experiments on
a large-scale dataset, we proved that our proposed method
significantly improves the performance of retrieval.
In future work, we will investigate on advanced deep
learning techniques such as retraining network with new
data generated from query examples. We also evaluate
the retrieval systems on other diverse datasets for more in-
depth empirical studies.
Acknowledgement
The video frames from BBC Eastenders video used in this
document are programme material copyrighted by BBC.
This research is funded by Vietnam National Univer-
sity HoChiMinh City (VNU-HCM) under grant number
B2017-26-01.
References
[1] R. Arandjelović and A. Zisserman. Three things ev-
eryone should know to improve object retrieval. In
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), CVPR ’12,
pages 2911–2918, Washington, DC, USA, 2012.
[2] R. Arandjelović and A. Zisserman. All about VLAD.
In IEEE Conference on Computer Vision and Pattern
Recognition, pages 1578–1585, 2013.
[3] A. Babenko and V. S. Lempitsky. Aggregating deep
convolutional features for image retrieval. CoRR,
abs/1510.07493, 2015.
[4] A. Babenko, A. Slesarev, A. Chigorin, and V. Lem-
pitsky. Neural codes for image retrieval. In D. Fleet,
T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Com-
puter Vision – ECCV 2014: 13th European Con-
ference, Zurich, Switzerland, September 6-12, 2014,
Proceedings, Part I, pages 584–599. Springer Inter-
national Publishing, Cham, 2014.
[5] Y. Cao, C. Wang, Z. Li, L. Zhang, and L. Zhang.
Spatial-bag-of-features. In IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 3352–
3359, June 2010.
[6] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun.
Bayesian face revisited: A joint formulation. In Pro-
ceedings of the European Conference on Computer
Vision - Volume Part III, ECCV’12, pages 566–579,
Berlin, Heidelberg, 2012. Springer-Verlag.
[7] O. Chum, M. Perdoch, A. Mikulik, and J. Matas. To-
tal recall ii: Query expansion revisited. In IEEE Con-
ference on Computer Vision and Pattern Recognition,
pages 889–896, Los Alamitos, CA, USA, 2011.
[8] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisser-
man. Total recall: Automatic query expansion with a
generative feature model for object retrieval. In IEEE
International Conference on Computer Vision, 2007.
[9] R. G. Cinbis, J. Verbeek, and C. Schmid. Unsuper-
vised Metric Learning for Face Identification in TV
Video. In ICCV 2011 - International Conference
on Computer Vision, pages 1559–1566, Barcelona,
Spain, Nov. 2011. IEEE.
[10] H. Jegou, M. Douze, and C. Schmid. Hamming em-
bedding and weak geometric consistency for large
scale image search. In Proceedings of the European
Conference on Computer Vision: Part I, ECCV ’08,
pages 304–317, Berlin, Heidelberg, 2008. Springer-
Verlag.
[11] H. Jégou and A. Zisserman. Triangulation embed-
ding and democratic aggregation for image search. In
CVPR - International Conference on Computer Vision
and Pattern Recognition, Columbus, United States,
June 2014.
[12] C. Lu and X. Tang. Surpassing human-level face ver-
ification performance on lfw with gaussian face. In
Proceedings of the AAAI Conference on Artificial In-
telligence, AAAI’15, pages 3811–3819. AAAI Press,
2015.
[13] P. Over, J. Fiscus, G. Sanders, D. Joy, M. Michel,
G. Awad, A. Smeaton, W. Kraaij, and G. Quénot.
Trecvid 2014 – an overview of the goals, tasks, data,
evaluation mechanisms and metrics. In Proceedings
of TRECVID 2014. NIST, USA, 2014.
Persons-In-Places: a Deep Features Based Approach. . . Informatica 41 (2017) 149–158 157
Figure 6: Result visualization of query from Figure 1. a) Top 6 rank list using L2 distance. b) Top 6 rank list using Linear
Kernel classifier.
[14] O. M. Parkhi, K. Simonyan, A. Vedaldi, and A. Zis-
serman. A compact and discriminative face track de-
scriptor. In IEEE Conference on Computer Vision and
Pattern Recognition. IEEE, IEEE, 2014.
[15] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep
face recognition. In British Machine Vision Confer-
ence, 2015.
[16] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisser-
man. Object retrieval with large vocabularies and fast
spatial matching. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition,
2007.
[17] J. Philbin, M. Isard, J. Sivic, and A. Zisserman. Lost
in quantization: Improving particular object retrieval
in large scale image databases. In In CVPR, 2008.
[18] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carls-
son. Cnn features off-the-shelf: An astounding base-
line for recognition. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition
Workshops, CVPRW ’14, pages 512–519, Washing-
ton, DC, USA, 2014. IEEE Computer Society.
[19] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-
CNN: Towards real-time object detection with region
proposal networks. In Neural Information Processing
Systems (NIPS), 2015.
[20] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet:
A unified embedding for face recognition and cluster-
ing. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2015.
[21] X. Shen, Z. Lin, J. Brandt, S. Avidan, and Y. Wu.
Object retrieval and localization with spatially-
constrained similarity measure and k-nn re-ranking.
In Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on, pages 3013–3020, June
2012.
[22] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zis-
serman. Fisher Vector Faces in the Wild. In British
Machine Vision Conference, 2013.
[23] J. Sivic and A. Zisserman. Video Google: A text
retrieval approach to object matching in videos. In
Proceedings of the International Conference on Com-
puter Vision, volume 2, pages 1470–1477, Oct. 2003.
[24] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep
learning face representation by joint identification-
verification. In Proceedings of the International Con-
ference on Neural Information Processing Systems,
NIPS’14, pages 1988–1996, Cambridge, MA, USA,
2014. MIT Press.
[25] Y. Sun, D. Liang, X. Wang, and X. Tang. Deepid3:
Face recognition with very deep neural networks.
CoRR, abs/1502.00873, 2015.
[26] Y. Sun, X. Wang, and X. Tang. Deep learning face
representation from predicting 10,000 classes. In Pro-
ceedings of the IEEE Conference on Computer Vision
158 Informatica 41 (2017) 149–158 V.-T. Nguyen et al.
and Pattern Recognition, CVPR ’14, pages 1891–
1898, Washington, DC, USA, 2014. IEEE Computer
Society.
[27] Y. Sun, X. Wang, and X. Tang. Deeply learned
face representations are sparse, selective, and robust.
CoRR, abs/1412.1265, 2014.
[28] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf.
Deepface: Closing the gap to human-level perfor-
mance in face verification. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
June 2014.
[29] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-
scale training for face identification. In The IEEE
Conference on Computer Vision and Pattern Recog-
nition (CVPR), June 2015.
[30] G. Tolias and Y. S. Avrithis. Speeded-up, relaxed
spatial matching. In IEEE International Conference
on Computer Vision, ICCV 2011, Barcelona, Spain,
November 6-13, 2011, pages 1653–1660, 2011.
[31] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu,
Y. Zhang, and J. Li. Deep learning for content-based
image retrieval: A comprehensive study. In Proceed-
ings of the ACM International Conference on Mul-
timedia, MM ’14, pages 157–166, New York, NY,
USA, 2014. ACM.
[32] L. Wolf, T. Hassner, and I. Maoz. Face recognition in
unconstrained videos with matched background simi-
larity. In in Proc. IEEE Conf. Comput. Vision Pattern
Recognition, 2011.
[33] W. Zhang and C.-W. Ngo. Searching visual in-
stances with topology checking and context model-
ing. In Proceedings of the ACM Conference on Inter-
national Conference on Multimedia Retrieval, ICMR
’13, pages 57–64, New York, NY, USA, 2013. ACM.
[34] C. Zhu, H. Jegou, and S. Satoh. Query-adaptive asym-
metrical dissimilarities for visual object retrieval. In
IEEE International Conference on Computer Vision,
ICCV 2013, Sydney, Australia, December 1-8, 2013,
pages 1705–1712. IEEE, 2013.