https://doi.org/10.31449/inf.v45i7.3732 Informatica 45 (2021) 67–81 67 
Hierarchical Modified Fast R-CNN for Object Detection 
Arindam Chaudhuri  
Samsung R & D Institute Delhi India, NMIMS University Mumbai India 
E-mail: arindamphdthesis@gmail.com, arindam.chaudhuri@nmims.edu 
Keywords: object recognition, image classification, Fast R-CNN, MS-COCO, CIFAR100, VisualQA 
Received: September 3, 2021 
In object detection there is high degree of skewedness for objects' visual separability. It is difficult to 
distinguish object categories which demand dedicated classification. The deep convolutional neural 
networks (CNNs) are trained as N-way classifiers. As such considerable work is required towards 
leveraging hierarchical category structures. We present here Modified Fast region-based CNN (Mod Fast 
R-CNN) and Hierarchical Modified Fast region-based CNN (HMod Fast R-CNN) with deep CNNs being 
embedded considering categorical hierarchy. The easy classes are separated through coarse classifiers. 
The difficult classes are classified by fine classifiers. HMod Fast R-CNN is trained by initial components 
training which follows fine-tuning globally using multiple group discriminant analysis. The regularization 
is done using coarse category consistency. For large-scale recognition tasks, scalability is done 
considering conditional execution of fine category classifiers and layer parameters compression. Using 
MS-COCO (benchmark) CIFAR100 and VisualQA datasets we obtain good results. We build several 
different HMod Fast R-CNN versions where standard CNNs top-1 error is reduced significantly. HMod 
Fast R-CNN’s performance superiority with other object detectors on PASCAL VOC 2007 and VOC 2012 
datasets are also highlighted. 
Povzetek: Predstavljena je metoda hierarhičnih hitrih R-CNN za detekcijo objektov. 
 
1 Introduction 
In computer vision there are several fundamental visual 
recognition problems such as image classification [1], 
object detection and instance segmentation [2], [3] and 
semantic segmentation [4] as shown in Figure 1. Image 
classification recognizes objects in semantic categories 
from given image as shown in Figure 1 (a). Object 
detection recognizes object categories and predicts each 
object’s location considering bounding box as shown in 
Figure 1 (b). Semantic segmentation predicts pixel wise 
classifiers in order to assign specific category label to each 
pixel. It thus provides rich image understanding as shown 
in Figure 1 (c). However, semantic segmentation does not 
distinguish between multiple objects of same category. At 
intersection of object detection and semantic segmentation 
viz instance segmentation where different objects are 
identified and assigned to a separate categorical pixel-
level mask as shown in Figure 1 (d).  
Since the birth of convolutional neural networks 
(CNN) image classification [5] and object detection [2], 
[6] problems have received a high degree of accuracy [5], 
[7]. Almost all available object detection techniques [2], 
[6], [8], [9] work in multi-stage slow and inelegant 
pipelines. The complexity arrives from detection which 
requires accurate object localization leading towards (a) 
processing of numerous candidate object locations and (b) 
achieving precise localization for candidate object 
locations which provide only rough localization. The 
solution for these problems has often struggled to achieve 
good speed, accuracy and simplicity. 
The region-based convolutional neural network (R-
CNN) [2] has achieved brilliant accuracy in object 
detection. However, it has certain drawbacks [2], [6], [8], 
[9] such that (a) training is performed through pipeline 
with multiple stages; (b) appreciable space and time 
complexity is involved and (c) object detection process 
happens slowly. R-CNN works slowly as each object’s 
CNN forward pass happens without any computation 
sharing. By sharing computation, spatial pyramid pooling 
networks (SPPN) [8] speeds up R-CNN. The input 
convolutional image’s feature map is computed by SPPN. 
Then each object is classified through feature vector taken 
from shared feature map. Considering an object, 
extraction of features happens through max-pooling 
feature-map’s portion within object with fixed output size. 
As in spatial pyramid pooling (SPP), concatenation and 
pooling are performed for multiple sizes output. SPPN 
enhances R-CNN considerably at test time. Due to fast 
object feature extraction, training time is less. 
This work is motivated from success achieved in 
designing CNN hierarchically considering integration of 
category hierarchy and linear classiﬁers. CNN models are 
continuously upgraded through enhancement of their 
components such as pooling layers [10], activation units 
[11], [12] and nonlinear layers [13]. These developments 
have improved CNN’s training and learning processes. 
This work improves Fast R-CNN’s performance 
considerably. The hierarchical model is built layer-wise 
considering Fast R-CNN as basic building block. There 
exist a wide variety of structures with categorical 
68 Informatica 45 (2021) 68–81 A. Chaudhuri 
hierarchy [14]. The classification with linear classifiers 
having high number of classes is performed through 
classifiers’ taxonomy. Here classifiers are verified 
considering test image which scales in sub-linear manner 
against number of classes [15], [16]. The hierarchy 
learning is either pre-specified [17], [18], [19] or achieved 
in top-down and bottom-up manner [20], [21], [22], [23], 
[24], [25], [26]. Hierarchical classiﬁers in [27] and [28] 
have reached considerable speedup bearing some 
accuracy loss. The initial work on category hierarchy for 
CNN is available in [29]. [30] achieves good accuracy 
with training images subset with re-labeled internal nodes 
in class tree hierarchy. [31] uses CNN hierarchy with 
scalability and has good classiﬁcation performance. 
Considering above motivation in this work, Modified 
Fast R-CNN (Mod Fast R-CNN) and Hierarchical 
Modified Fast R-CNN (HMod Fast R-CNN) [32] methods 
are proposed. HMod Fast R-CNN performs object 
detection by hierarchical learning in order to classify 
objects and refine them. This work looks towards 
development of HMod Fast R-CNN which integrates deep 
CNNs alongwith category hierarchy. The algorithm 
streamlines training process towards R-CNN based object 
detectors [2], [8]. The image classification task is 
decomposed into two steps. The weighted coarse 
component Mod Fast R-CNN classifier separates easy 
classes. The complex classes are directed towards 
weighted fine components which takes care of classes 
with confusion. HMod Fast R-CNN is build considering 
Fast R-CNN building block through module design 
principle. The building blocks are considered to be as one 
of the top ranked single Fast R-CNN. The coarse-to-fine 
classification is adopted here. Then fine category 
classifiers predictions are integrated as possibilistic means 
which takes care of inherent data uncertainty. The 
proposed architecture is evaluated through MS-COCO 
[33], CIFAR100 [34] and VisualQA [35] datasets. A 
comparative analysis of HMod Fast R-CNN with respect 
to other detectors such as deformable part model (DPM), 
all versions of you only look once (YOLO), single shot 
multibox detector (SSD) etc is performed on PASCAL 
VOC 2007 [36] VOC 2012 [37] datasets alongwith an 
error analysis. HMod Fast R-CNN achieves less error 
considering memory footprint increase as well as 
classification time. The schematic representation of HMod 
Fast R-CNN based prediction system [32] is given in 
Figure 2 in Appendix. This paper is structured as follows. 
The section 2 presents an overview of related work in 
object detection. The computational methodology is 
highlighted in section 3. The section 4 presents 
experimental results. Finally, in section 5 conclusion is 
given. 
2 Related work 
In this section we present significant developments in 
deep learning-based object detection in past few years. A 
good detection algorithm comes with strong semantic cues 
understanding and spatial information about image. 
Object detection is fundamental step towards many 
computer vision applications such as face recognition 
[38], [39], [40], pedestrian detection [41], [42], [43], video 
analysis [44], [45] and logo detection [46], [47], [48]. 
Initially object detection pipeline was divided into 
three steps viz (i) proposal generation (ii) feature vector 
extraction and (iii) region classification. During proposal 
generation objective was to search locations in image viz 
regions of interest (RoI) which contain objects. An 
intuitive idea is to scan whole image with sliding windows 
[49], [50], [51], [52], [53]. Input images are resized into 
different scales and multi-scale windows are used to slide 
through these images in order to capture information about 
multi-scale and different aspect ratios of objects. In next 
step on each image’s location, a fixed-length feature 
vector is obtained considering sliding window in order to 
capture discriminative semantic information of covered 
region. This feature vector is encoded by several low-level 
visual descriptors [54], [55], [56], [57]. These descriptors 
have shown good robustness towards scale, illumination 
and rotation variance. Finally, region classifiers are 
learned to assign categorical labels to regions covered. 
Here support vector machines (SVM) [58] have been used 
because they offer good performance on small scale 
training data. Alongwith this classification techniques 
such as cascade learning [59], bagging [60] and adaboost 
[61] have also been used in region classification which 
have provided considerable improvements in detection 
accuracy. 
The successful traditional object detection methods 
have focused towards designing feature descriptors in 
order to obtain embedding for RoI. With good feature 
representations and robust region classifiers impressive 
results [62], [63] are achieved on PASCAL VOC dataset 
[64]. The deformable part-based machines (DPMs) [65] 
learn and integrate multiple part models with deformable 
loss. They mine hard negative examples with latent SVM 
for discriminative training. Between 2008 to 2012 
PASCAL VOC’s progress based on these traditional 
methods showed incremental progress for building 
complicated ensemble systems. These reflected 
limitations of traditional detectors. These limitations are 
 
 (a) (b) 
 
 
 (c) (d) 
Figure 1: Visual recognition tasks in computer vision (a) 
image classification (b) object detection (c) semantic 
segmentation (d) instance segmentation. 
Hierarchical Modified Fast R-CNN for Object Detection Informatica 45 (2021) 69–81 69 
reflected during proposal generation, feature descriptors 
and detection pipeline. In proposal generation huge 
number of redundant proposals are generated with many 
false positives during classification. The window scales 
designed manually and heuristically could not match 
objects well. The feature descriptors are hand-crafted 
based on low level visual cues [5], [56], [66] making them 
difficult to capture representative semantic information in 
complex contexts. In each step of detection pipeline is 
designed and optimized separately as a result of which 
global optimal solution could not be obtained.  
After CNN’s considerable success for image 
classification [1], [67], object detection also achieved 
remarkable progress on deep learning [2], [68], [69] based 
solutions. The newer detection algorithms outperformed 
traditional ones by huge margins. In [68] deep CNN model 
is optimized using stochastic gradient descent (SGD) via 
backpropagation. It provided good performance on digit 
recognition. However, deep networks are plagued with 
certain limitations like lack of large-scale annotated 
training data causing overfitting, limited computation 
resources and weak conceptual support compared to 
SVMs. In [5] a deep CNN is trained with ImageNet dataset 
which showed significant improvement on Large Scale 
Visual Recognition Challenge (ILSVRC) in comparison 
to all other approaches. After this success deep learning 
methods have been quickly adapted to other vision tasks 
where they have shown promising results over traditional 
methods. As compared to hand-crafted descriptors in 
traditional detectors, deep CNN generate hierarchical 
feature representations from raw pixels to high level 
semantic information. This is learned automatically from 
training data and shows more discriminative expression 
capability in complex contexts. Deep CNN also achieve 
better feature representation with large datasets. In 
traditional visual descriptors learning capacity is fixed and 
no improvement is reached as more data becomes 
available. 
The major contributions in deep learning-based object 
detection can be categorized into three groups viz 
detection components, learning strategies and applications 
and benchmarks [32]. The detection components include 
detection settings, detection paradigms and backbone 
architectures. The learning strategies comes with training 
and testing stages. The applications and benchmarks 
include commonly used applications and public 
benchmark datasets. In present scenario, deep learning-
based object detection frameworks can be divided into two 
categories viz two-stage detectors and one-stage detectors. 
The two-stage detectors cover R-CNN [2] and its variants 
[66], [70], [71] and one-stage detectors has YOLO [72] 
and its variants [73], [74]. The one-stage detector makes 
categorical object prediction on each location of feature 
maps without cascaded region classification step. The 
two-stage detectors use proposal generator in order to 
generate a sparse set of proposals and extract its features. 
This is followed by region classifiers which predict 
category of proposed region. The one-stage detectors are 
more time-efficient and have greater application towards 
real-time object detection. The two-stage detector achieve 
better detection performance and report state-of-the-art 
results on benchmark datasets.  
Single shot detector (SSD) [74] has been one of the 
significant developments in object detection methods in 
past decade. It does not resample pixels or features for 
bounding boxes. By eliminating bounding boxes, it 
provides improvements considering speed at which object 
detection activities are performed. It provides high 
accuracy detection for low resolution images. For real 
time object detection YOLO [72] occupies a very 
prominent place. It is a state-of-the-art object detection 
system whose speed and accuracy has grown over the 
years. Till date we have 5 versions of YOLO [72], [75], 
[76], [77], [78] each of which supersedes previous 
versions. Apart from speed and accuracy some of biggest 
advantages of initial two versions of YOLO viz YOLO 
and YOLOv2 or YOLO9000 [72], [73] include network 
understandability towards generalized object 
representation and smaller architecture. The third version 
of YOLO, YOLOv3 [75] is extremely fast and accurate. It 
uses few tricks to improve training with increased 
performance including multiscale predictions, better 
backbone classifier and much more. The fourth version of 
YOLO, YOLOv4 [76] offers improved performance 
above previous versions. It provides superfast training and 
accurate object detection. It has also been verified for the 
influence of state-of-the-art bag-of-freebies and bag-of-
specials object detection methods during detector training. 
The modified state-of-the-art methods include cross 
iteration batch normalization and path aggregation 
network which are more efficient and suitable for single 
GPU training. Finally, fifth version of YOLO, YOLOv5 
[77] has also been launched with exceptional 
improvements. This version outperforms all previous 
versions with EfficientDet average precision and higher 
frames per second. Some of the recent major 
developments of deep learning-based object detection 
methods include [78], [79], [80], [81], [82], [83]. The 
Table 1 in Appendix highlights significant state-of-the-art 
research works in deep learning-based object detection. It 
is to be noted that in each case best possible results are 
presented. 
3 Computational methodology  
In this section computational framework of Mod Fast R-
CNN and HMod Fast R-CNN [32] are presented. In 
subsection 3.1 Mod Fast R-CNN architecture with training 
is discussed. This is followed by HMod Fast R-CNN 
architecture with training in subsection 3.2. In subsection 
3.3 HMod Fast R-CNN for detection is discussed. 
3.1 Modified Fast R-CNN with training 
Mod Fast R-CNN architecture is adopted from [32] with 
[70] as baseline method having certain variations. The 
architecture is highlighted in Figure 3 in Appendix. The 
entire image and objects' set forms input towards Mod Fast 
R-CNN network. The convolutional feature map is 
produced through processing of entire image alongwith 
convolutional and max pooling layers. Considering 
70 Informatica 45 (2021) 70–81 A. Chaudhuri 
feature map, a fixed length feature vector is extracted for 
each object's RoI pooling layer.  
The sequence of fully connected (𝑓𝑦 _𝑐𝑡 ) layer takes 
each feature vector as input. From 𝑓𝑦 _𝑐𝑡 output is fed in 2 
sibling output layers producing softmax probability 
estimates. The softmax probability are estimated with 
respect to 𝑂𝐵 object classes. This considers catch-all 
background class and another layer which has 4 real 
valued output numbers. For each of 𝑂𝐵 classes, 4 values' 
set are encoded considering bounding box refined 
positions. The max pooling is used for RoI pooling layer 
in order to convert its features into small feature map 
considering 𝐻𝑡 × 𝑊 ℎ fixed spatial extent. Here 𝐻𝑡 and 
𝑊 ℎ are layers’ hyper parameters not dependent on any 
specific RoI. RoI is rectangular window with 
convolutional feature map. A 4-tuple (𝑟𝑤 , 𝑐𝑚 , ℎ𝑡 , 𝑤 ℎ) 
defines an RoI where (𝑟𝑤 , 𝑤 ℎ) is top left with ℎ𝑡 and 𝑤 ℎ 
as its height and width respectively. The ℎ𝑡 × 𝑤 ℎ window 
is divided by RoI max pooling into 𝐻𝑡 × 𝑊 ℎ sub-window 
grids having 
ℎ𝑡 𝐻𝑡
×
𝑤 ℎ
𝑊 ℎ
 size approximately. Then sub-
window values are max-pooled into each grid cell output. 
Towards each feature map channel independent pooling is 
applied. RoI layer has 1 pyramid level. It is a special case 
of SPP layer in SPNN [8]. For experiments 6 pre-trained 
ImageNet [69] networks with 8 max pooling layers and 
between 8 and 18 convolutional layers are used. There are 
3 transformations for Mod Fast R-CNN network with pre-
trained network initialization. RoI pooling layer replaces 
last max pooling layer. It is configured as 𝐻𝑡 × 𝑊 ℎ. This 
is followed by 1000-way ImageNet classification training 
for network’s last fully connected and softmax layers. 
There are 𝐴 + 1 categories for fully connected layers, 
softmax layers and bounding-box regressors which are 
specific to category. The network is updated to absorb 2 
data inputs. 
Mod Fast R-CNN uses backpropagation to train all 
network weights. Below SPP layer, weight updation is not 
possible as SPP layer's backpropagation is not effective. 
This inefficiency is spread across receptive field spanning 
entire input image starting from each RoI. The training 
inputs are large as forward pass processes entire receptive 
field. The feature sharing is used during training. For each 
image, RoIs are sampled hierarchically through 𝐼 and then 
𝑅 𝐼 images for Mod Fast R-CNN training SGD mini-
batches. In forward and backward passes, computation 
and memory are shared for RoIs from same image. Taking 
small 𝐼 reduces computation of mini-batch. It slows 
convergence of training as same image RoIs are 
correlated. Significant results are achieved using 𝐼 = 2 
and 𝑅 = 128 with less SGD iterations. Here training 
process is synchronized through fine-tuning which 
optimizes softmax classifier and bounding box regressors 
[2], [8]. 
In Mod Fast R-CNN 2 sibling output layers are used. 
The initial output is discrete probability distribution per 
RoI considering 𝐴 + 1 categories which is 𝑝𝑟𝑜𝑏 =
(𝑝𝑟𝑜𝑏 0
, … … , 𝑝𝑟𝑜𝑏 𝐴 ). For fully connected layer, 𝑝𝑟𝑜𝑏 is 
calculated for softmax considering 𝐴 + 1 outputs. The 2
𝑛𝑑
 
sibling layer has bounding-box regression offsets outputs 
for 𝐴 object classes as 𝑣 𝑎 = (𝑣 𝑥 𝑎 , 𝑣 𝑦 𝑎 , 𝑣 𝑤 ℎ
𝑎 , 𝑣 ℎ𝑡 𝑎 ). The 
parameterization for 𝑣 𝑎 is given in [2]. Here 𝑣 𝑎 specifies 
translation (scale-invariant) and height shift (log-space) 
with respect to the object. For each training RoI labeling 
is done considering ground-truth class 𝑢 and ground-truth 
with bounding-box regression for 𝑣 . For each labeled RoI, 
there is joint classification for training and bounding-box 
regression with respect to multitask loss 𝐿 : 
𝐿 (𝑝 , 𝑢 , 𝑣 𝑢 , 𝑠 ) = 𝐿 𝑐𝑙𝑠 (𝑝 , 𝑢 ) + 𝜆 [𝑢 ≥ 1]𝐿 𝑙𝑜𝑐 (𝑣 𝑢 , 𝑠 ) (1) 
For true class 𝑢 , log loss is 𝐿 𝑐𝑙𝑠 (𝑝 , 𝑢 ) = − log 𝑝 𝑢 . 𝐿 𝑙𝑜𝑐 
is second task loss which is specified considering true 
bounding-box regression target tuples such that 𝑠 =
(𝑠 𝑥 , 𝑠 𝑦 , 𝑠 𝑤 ℎ
, 𝑠 ℎ𝑡 ) with predicted tuple 𝑣 𝑎 =
 (𝑣 𝑥 𝑎 , 𝑣 𝑦 𝑎 , 𝑣 𝑤 ℎ
𝑎 , 𝑣 ℎ𝑡 𝑎 ) for class 𝑢 . When 𝑢 ≥ 1 [𝑢 ≥ 1] = 1 
else 0 is inversion bracket indicator function. Background 
class with catch-all convention is marked as 𝑢 = 0. 𝐿 𝑙𝑜𝑐 is 
ignored with background RoIs having no ground-truth 
bounding box. The bounding-box regression loss is: 
𝐿 𝑙𝑜𝑐 (𝑣 𝑢 , 𝑠 ) = ∑ 𝑠𝑚𝑜𝑜𝑡 ℎ
𝐿 1
(𝑣 𝑖 𝑢 − 𝑠 𝑖 )
𝑖 ∈(𝑥 ,𝑦 ,𝑤 ℎ,ℎ𝑡 )
 (2) 
In equation (2) smooth function is: 
 𝑠𝑚𝑜𝑜𝑡 ℎ
𝐿 1
(𝑥 ) = {
0.5𝑥 2
     |𝑥 | < 1
|𝑥 | − 0.5       𝑜𝑤
 (3) 
In 𝑠𝑚𝑜𝑜𝑡 ℎ
𝐿 1
(𝑥 ), loss 𝐿 2
 in Mod Fast R-CNN and 
SPPN [8] is more outliers’ sensitive than robust loss 𝐿 1
. 
The loss 𝐿 2
 needs to be carefully tuned in terms of learning 
rates to prevent gradients exploding with unbounded 
training as regression targets. This sensitivity is eliminated 
through equation (3). In equation (1) balance between 𝐿 1
 
and 𝐿 2
 is controlled by 𝜆 . With 𝜆 = 1 ground-truth 
regression targets 𝑠 𝑖 ~ 𝑁 (0,1). The class-agnostic object 
network is trained using loss factor [69]. The localization 
and classification are separated by 2-network system. The 
images (𝑁 = 2) selected uniformly at random are used 
from SGD minibatch created at fine tuning. The dataset is 
permuted to perform in iterations. From each image 64 
RoIs are sampled considering mini-batches of 𝑅 = 128 
size. 25% of RoIs are taken from objects which have 
intersection over union (IoU) overlap having ground-truth 
bounding box ≥ 0.5 [2]. 
 
𝜕𝐿
𝜕 𝑥 𝑖 = ∑∑ [𝑖 = 𝑖 ∗ (𝑟 , 𝑗 )]
𝑗 𝑟 𝜕𝐿
𝜕 𝑦 𝑟𝑗
 (4) 
The partial derivative 
𝜕𝐿
𝜕 𝑦 𝑟𝑗
 accumulates if 𝑖 is selected 
as argmax considering 𝑦 𝑟𝑗
 through max pooling for each 
mini-batch RoI 𝑟 and for pooling output unit 𝑦 𝑟𝑗
. Using 
backwards function of layer over RoI pooling layer, partial 
derivatives 
𝜕𝐿
𝜕 𝑦 𝑟𝑗
 are calculated. Considering softmax 
classification and bounding-box regression for fully 
connected layers, an initialization is done through zero-
mean Gaussian distributions. Here standard deviations are 
taken as 0.01 and 0.001 for both cases with 0 as the bias 
initialization. For weights learning rate is 1 per layer and 
for biases learning rate is 2 per layer considering all layers. 
The global learning rate is 0.001. Trainval SGD is 
executed for 30000 minibatch iterations when training on 
Hierarchical Modified Fast R-CNN for Object Detection Informatica 45 (2021) 71–81 71 
PASCAL VOC 2007 or VOC 2012. Then learning rate is 
lowered to 0.0001 and training is done for next 10000 
iterations. SGD is executed for more iterations, when 
training is done on larger datasets. For weights and biases, 
momentum is 0.9 and parameter decay is 0.0005. Brute-
force learning and image pyramids are used to achieved 
scale invariant object detection. These approaches used 
here are taken from [8]. During training and testing for 
brute-force approach each image is being processed at 
predefined pixel size. Using training data, network learns 
scale-invariant object detection. From an image pyramid, 
approximate scale-invariance to network is provided by 
multi-scale approach. Each object proposal is scale 
normalized approximately through image pyramid at test 
time. Each time when an image is sampled, pyramid scale 
is randomly sampled at multi-scale training. The detection 
considers running forward pass. Here objects are assumed 
to be precomputed as Fast R-CNN network where it is 
fine-tuned. The network input is image or image pyramid 
as well as 𝑅 objects list towards score. 𝑅 is typically taken 
as 2000, though cases are there when it is about 45000 at 
test time. Using image pyramid, each RoI is placed to scale 
such that scaled RoI is near to 224
2
 pixels [8]. 
Considering each test RoI 𝑟 forward pass output is 
posterior probability distribution 𝑝𝑟𝑜𝑏 with predicted 
bounding-box set offsets relative to 𝑟 for each 𝐴 classes 
which gets its refined bounding-box prediction. For each 
object class 𝑘 through estimated probability 
𝑃𝑟𝑜𝑏 (𝑐𝑙𝑎𝑠𝑠 = 𝑘 |𝑟 ) ≜ 𝑝𝑘 , a detection confidence is 
assigned to 𝑟 . Then for each class using Fast R-CNN 
algorithm [84] non-maximum suppression is performed 
independently. 
The time spent for calculating convolutional layers is 
greater than fully connected layers considering whole-
image classification. The processing time for number of 
RoIs is large enough for detection. It is about 50% of 
forward pass time required for calculating fully connected 
layers [84], [85]. By compressing large fully connected 
layers with truncated singular value decomposition (SVD) 
easy acceleration is achieved. Each layer is parameterized 
by 𝑢 × 𝑣 weight matrix 𝑊 which is approximately 
factorized as 𝑊 ≈ 𝑈 ∑
𝑡 𝑉 𝑇 .Here 𝑈 is 𝑢 × 𝑡 matrix 
constituting 𝑊 ′
𝑠 first 𝑡 left-singular vectors, ∑
𝑡 is 
𝑡 × 𝑡 diagonal matrix with 𝑊 ′
𝑠 top 𝑡 singular values and 
𝑉 is 𝑣 × 𝑡 matrix constituting 𝑊 ′𝑠 first 𝑡 right-singular 
vectors. The parameter count is reduced from 𝑢𝑣 to 𝑡 (𝑢 +
𝑣 ) through truncated SVD. This works well when 𝑡 <
𝑚𝑖𝑛 (𝑢 , 𝑣 ). Corresponding to 𝑊 single fully connected 
layer network is compressed by replacing 2 fully 
connected layers with no in between non-linearity. With 
no biases, weight matrix ∑
𝑡 𝑉 𝑇 is used for first few layers 
and with original biases linked with 𝑊 , 𝑈 is used for 
second few layers. As RoIs number grows, good speedups 
are achieved through this compression. 
3.2 Hierarchical modified fast R-CNN with 
training 
Now architecture of HMod Fast R-CNN [32] is presented. 
Based on the success of Mod Fast R-CNN [32], HMod 
Fast R-CNN is discussed in this section. The image dataset 
has images {𝑥 𝑖 , 𝑦 𝑖 }
𝑖 with 𝑥 𝑖 and 𝑦 𝑖 representing image data 
and label respectively. The dataset {𝑆 𝑗 𝑓 }
𝑗 =1
𝐶𝑡
 contains 𝐶𝑡 
fine categories of images. The category hierarchy with 𝐴 
coarse categories {𝑆 𝑎 𝑐𝑡
}
𝑎 =1
𝐴 is used towards formation of 
learning process. HMod Fast R-CNN emulates category 
hierarchy structure with coarse categories making up fine 
categories. 
As shown in Figure 4 [32] in Appendix end-to-end 
classification happens here. It consists of 5 components 
viz (a) high-level feature extraction layer (b) low-level 
feature extraction layer (c) weighted coarse component 
independent layers {𝐵 𝑎 }
𝑎 =1
𝐴 (d) weighted fine component 
independent layers {𝐹 𝑎 }
𝑎 =1
𝐴 and (e) possibilistic averaging 
layer. The extraction layers are present on leftmost side of 
Figure 4. They take raw image pixel as input and extract 
high-level features followed by low-level features. The 
configuration of extraction layers is kept same as 
preceding layers with respect to building block net. The 
weighted coarse component independent layers assign 
weight factor to each of 𝐴 layers and gives coarse 
prediction based on best weight achieved. The 
probabilities in weighted coarse category provide: (a) 
weight factor towards combining predictions which fine 
category components make and (b) consider threshold 
conditional executions of fine category components are 
enabled for which coarse probabilities are quite large. The 
independent layers are represented considering weighted 
fine category classifiers set {𝐹 𝑎 }
𝑎 =1
𝐴 where weighted fine 
category predictions are made by each classifier. Each 
weighted fine category component classifies small 
categories set accurately. As such from here fine 
prediction is produced with respect to partial categories 
set. When partial set do not have probabilities of other fine 
categories, they are taken as zero. From building block 
Mod Fast R-CNN layer configurations are copied. 
However, in final classification layer filter numbers are 
taken as partial set size. 
The common layers are shared for both weighted 
coarse category and fine category components. This is 
because of reasons stated here. The preceeding layers in 
deep networks [63] respond towards low-level features 
which are class-agnostic for example corners and edges. 
The class-specific features are extracted from rear layers. 
The preceding layers are shared by both coarse and fine 
components as for both coarse and fine classification tasks 
low-level features are useful. The floating-point 
operations network execution memory footprint is 
considerably reduced. HMod Fast R-CNN parameters are 
also decreased which is vital towards network’s training. 
Finally, there is a possibilistic averaging layer where fine 
category and coarse category predictions are received and 
converted to possibilistic measures through equation (5). 
Then a weighted average is produced as final prediction 
result. It is to be noted that merging part plays a significant 
role in averaging layer of HMod Fast R-CNN. The 
weighted factors are decided based on certain heuristics 
[32]. In initial iterations weight factors are decided based 
on dataset considered. The distribution of coarse-grained 
and fine-grained images are considered in deciding 
72 Informatica 45 (2021) 72–81 A. Chaudhuri 
weighted factors in later iterations. This helps in reaching 
best possible results in final prediction. The possibilistic 
measures handles inherent uncertainty in data better than 
probabilistic values 
𝑝𝑜𝑠𝑠𝑏 (𝑥 𝑖 ) =
∑ 𝑝𝑜𝑠𝑠 𝑏 (𝐵 𝑖𝑎
)𝑝𝑜𝑠𝑠𝑏 𝑎 (𝑥 𝑖 )
𝐴 𝑎 =1
∑ 𝑝𝑜𝑠𝑠𝑏 (𝐵 𝑖𝑎
)
𝐴 𝑎 =1
 (5) 
In equation (5) 𝑝𝑜𝑠𝑠𝑏 (𝐵 𝑖𝑎
) is possibility of coarse 
category 𝑎 considering image 𝑥 𝑖 which is predicted 
through coarse category component 𝐵 and 𝑝𝑜𝑠𝑠𝑏 𝑎 (𝑥 𝑖 ) is 
prediction achieved through fine category component 𝐹 𝑎 . 
Considering building block Mod Fast R-CNN, layer 
configurations for both coarse and fine category 
components are reused. The flexibility in modular design 
gives best module Mod Fast R-CNN as building block. 
As fine category components are inserted into HMod 
Fast R-CNN, parameters in rear layers increases linearly 
with respect to coarse categories. This increases training 
complexity as well as overfitting risk considering same 
amount of training data. Within stochastic gradient 
descent mini-batch, training images are routed 
probabilistically towards various fine category 
components. To ensure parameter gradients larger 
minibatch are required in fine category components which 
are estimated through quite large number of training 
samples. The training memory footprint is increased by 
large training mini-batch but training process is 
considerably slow. HMod Fast R-CNN training is 
decomposed into several steps as shown in Figure 5 [32]. 
HMod Fast R-CNN is sequentially pre-trained for 
coarse and fine category components. First a building 
block Mod Fast R-CNN 𝐹 𝑝 is pretrained through training 
set. There is a resemblance in building block Mod Fast R-
CNN with preceding and rear layers in coarse category 
component. As a result of this for initialization purpose, 
weights of 𝐹 𝑝 are placed into coarse category component. 
Fine category components {𝐹 𝑎 }
𝑎 are independently pre-
trained in parallel. Each 𝐹 𝑎 specializes towards 
classification of fine categories considering coarse 
category 𝑆 𝑎 𝑐𝑡
. Thus, pre-training of each 𝐹 𝑎 uses images 
{𝑥 𝑖 |𝑖 ∈ 𝑆 𝑎 𝑐𝑡
} with coarse category 𝑆 𝑎 𝑐𝑡
. The initialization 
is done for shared preceding layers which are kept fixed 
now. All rear layers are initialized for each 𝐹 𝑎 except last 
convolutional layer through writing learned parameters 
from pre-trained model 𝐹 𝑝 . 
3.3 Hierarchical modified fast R-CNN for 
Detection 
After HMod Fast R-CNN [32] is trained, detection is 
performed. This section highlights this issue. The 
complete HMod Fast R-CNN is fine-tuned when coarse 
and fine category components are appropriately pre-
trained. Every fine category component is directed 
towards classifying fixed fine categories subset, when 
learning is done for category hierarchy and associated 
mapping 𝑷 𝟎 . The coarse categories semantics predicted 
through coarse category component must remain 
consistent coarse category component during fine-tuning. 
The consistency term in coarse category is included in 
order to regularize multiple group discriminant loss. The 
mapping 𝑷 : [𝟏 , 𝑪𝒕 ] ⟼ [𝟏 , 𝑨 ] which is fine-to-coarse in 
nature paves a way towards specification of target coarse 
category distribution {𝒕 𝒂 }. Here 𝒕 𝒂 is placed as fraction for 
all training images within coarse category 𝑺 𝒂 𝒄𝒕
 with 
assumption that distribution for coarse categories over 
training dataset is near to that in trained mini-batch: 
𝑡 𝑎 =
∑ |𝑆 𝑗 |
𝑗 |𝑎 ∈𝐹 (𝑗 )
∑ ∑ |𝑆 𝑗 |
𝑗 |𝑎 ∈𝐹 (𝑗 )
𝐴 𝑎 ′
=1
 (6) 
For fine-tuning HMod Fast R-CNN final loss function is: 
𝐿𝑜𝑠𝑠 = −
1
𝑛 ∑ log(𝑝𝑜𝑠𝑠𝑏 𝑦 𝑖 )
𝑛 𝑖 =1
+ 
𝜆 2
∑ (𝑡 𝑎 −
1
𝑛 ∑ 𝐵 𝑖𝑎
𝑛 𝑖 =1
)
2
𝐴 𝑎 =1
 (7) 
Here training mini-batch size is 𝑛 and regularization 
constant 𝜆 = 20. As fine category components are added 
into HMod Fast R-CNN, rear layers with parameters, 
memory footprint and execution time variables are 
linearly scaled with coarse categories. In order to scale 
HMod Fast R-CNN to large-scale visual recognition, layer 
parameter compression techniques and conditional 
execution are used. It is not required to test all fine 
category classifiers for given image because they have 
weights 𝐵 𝑖𝑎
 which are not significant as shown in equation 
(7). The final predictions are negligible here. HMod Fast 
R-CNN classification is accelerated through conditional 
executions of top weighted fine components. Thus, 𝐵 𝑖𝑎
 is 
given a threshold using 𝐵 𝑡 = (𝛽𝐴 )
−1
 and reset 𝐵 𝑖𝑎
= 0 
when 𝐵 𝑖𝑎
< 𝐵 𝑡 . The evaluation is not done for fine 
category classifiers with 𝐵 𝑖𝑎
= 0. With HMod Fast R-
CNN rear layers parameter in classifiers of fine category 
is directly proportional to number of coarse categories. In 
order to reduce memory footprint compression of layer 
parameters is done at test time. 
The product quantization approach is chosen to 
compress parameter matrix 𝑊 ∈ 𝑅 𝑚 ×𝑛 by partitioning as 
segments having width 𝑠 horizontally such that 𝑊 =
[𝑊 1
, … … , 𝑊 (
𝑛 𝑠 )
]. K-means then clusters rows into 
𝑊 𝑖 ∀𝑖 ∈ [1, (
𝑛 𝑠 )]. A compression factor of 
32𝑚𝑛
(32𝑘𝑛 +
8𝑚𝑛
𝑠 )
 is 
achieved through storing cluster indices which are near at 
8-bit integer matrix 𝐼 ∈ 𝑅 𝑚 ×(
𝑛 𝑠 )
 with cluster centers in 
floating number matrix 𝐶 ∈ 𝑅 𝑘 ×𝑛 . The hyperparameters 
for parameter compression are (𝑠 , 𝑘 ). 
4 Experimental results  
The results from various experiments performed are 
presented in this section. HMod Fast R-CNN is evaluated 
 
Figure 2: HMod Fast R-CNN training algorithm. 
Hierarchical Modified Fast R-CNN for Object Detection Informatica 45 (2021) 73–81 73 
on benchmark datasets MS-COCO [33] and CIFAR100 
[34], [86] as well as VisualQA [35]. It is implemented 
through Caffe [87]. Back propagation [32] is used towards 
network training. NVIDIA Tesla V100 card is used to 
simulate all test experiments. 
MS-COCO [33] is large-scale object detection, 
segmentation and captioning dataset. It comes with 
several prominent features such as object segmentation, 
recognition in context and super pixel stuff segmentation. 
It consists of 330000 images with more than 200000 
labeled images. It has 1.5 million object instances with 80 
object categories, 91 stuff categories, 5 captions per image 
and 250000 people with key points. Figures 6(a) and 6(b) 
consider 10 superior overlapping coarse categories. The 
coarse category optimal number depends on dataset. There 
is also impact within categories inherent hierarchy. 
There are 100 natural image classes in CIFAR100 
[34], [86] dataset. The prepared dataset comprises of 
50000 and 10000 images for training and testing 
respectively. The dataset pre-processing is done using 
contrast normalization globally and ZCA-cor whitening. 
For training image patches of 30 × 30 size is flipped and 
cropped randomly. A 4 stacked layer NIN network is 
adopted which is denoted as CIFAR100-NIN and placed 
in HMod Fast R-CNN’s building block. The preceeding 
layers from 𝑐𝑜𝑛𝑣 1 to 𝑝𝑜𝑜𝑙 1 are shared by components 
from weighted fine category. These are responsible 
towards 10% and 35% of total parameters and floating-
point operations respectively. The rest layers are 
considered as independent layers. In order to construct 
category hierarchy, 10000 images are chosen at random 
and taken as heldout set considering training set. There is 
a visual similarity for fine categories considering similar 
coarse categories. Pre-training is done for rear layers of 
fine components category. The initial learning rate is 0.05. 
This decreases by factor 10 for every 6000 iterations. 
With mini-batches of 256 size, fine-tuning is done with 
respect to 20000 iterations. Here initial learning rate is 
0.005. This decreases by factor 10 for every 10000 
iterations. 10view testing [32] is used towards evaluation. 
Six 30 × 30 patches (with 5 corner patches and 1 center 
patch) alongwith their reflections (horizontal) and 
predictions (average) are extracted. HMod Fast R-CNN 
has lower testing error than CIFAR100-NIN. 
With category hierarchy construction, clustering 
algorithm adjusts coarse category. When hyperparameter 
𝛾 is varied, coarse categories can be made overlapping or 
disjoint. Their impacts are investigated on classification 
error. The experiments are performed with 5, 10, 16 and 
20 coarse categories with varying the values of 𝛾 . Figures 
7(a) and 7(b) consider 10 superior overlapping coarse 
categories achieved with 𝛾 = 6. The coarse category 
optimal number and 𝛾 depend on dataset. They are also 
impacted within categories inherent hierarchy. 
In comparison with building block net, shared layers 
usage results in sublinear computational complexity and 
memory footprint of HMod Fast R-CNN considering fine 
category classifiers. HMod Fast R-CNN consumes less 
than four times memory as building block net with no 
compression of parameters considering 10 fine category 
classifiers with respect to MS-COCO and CIFAR100-
NIN. Tables 2 and 3 highlight significance of 
classification error, memory footprint and net execution 
time. Using pre-trained building block net, HMod Fast R-
CNN is structured with coarse category and all fine 
category components which use independent preceding 
layers initialization. The central cropping is used with 
single-view testing with slight error increase. The memory 
footprint and testing time is considerably reduced through 
shared layers. 
By varying hyperparameter 𝛽 fine category 
components are affected considerably. The tradeoff exists 
between execution time and classification. For fine 
category when more components are executed higher 
accuracy is achieved through large 𝛽 values. As shown in 
Tables 2 and 3 there is slight error increase when 
conditional executions are enabled through 𝛽 = 6. HMod 
Fast R-CNN achieves 3 times testing time as compared 
with building block net. The fine category HMod Fast R-
CNNs with independent layers from 𝑐𝑜𝑛𝑣 2 to 𝑐𝑜𝑛𝑣 6 are 
compressed and memory footprint reduces from 448 MB 
to 269 MB with slight error increase. As highlighted in 
Tables 2 and 3 HMod Fast R-CNN memory footprint is 
nearly 2 times in comparison with building block model. 
As a result of this, it is mandatory to compare strong 
baseline with identical complexity for HMod Fast R-
CNN. 
CIFAR100-NIN is adapted with doubled filters for all 
convolutional layers. This results in memory footprint 
increase by more than 3 times. This is denoted as 
CIFAR100-NIN-double. The error is higher than HMod 
Fast R-CNN but lower than building block net. 
Conceptually HMod Fast R-CNN differs from model 
averaging [32]. With model averaging full category sets 
are classified for all models. There is an independent 
training for each model. As different initializations are 
used, predictions are different. Partial category sets are 
classified for each classifier in fine category HMod Fast 
R-CNN. In order to make comparison between HMod Fast 
R-CNN and model averaging, 2 CIFAR100-NIN 
networks are trained independently. This is followed by 
their prediction average which is treated as final 
prediction. Tables 4 and 5 show that HMod Fast R-CNN 
achieves lower error. It is noted that HMod Fast R-CNN 
bears orthogonality towards model averaging. There is a 
considerable performance enhancement for HMod Fast R-
CNN ensembles. It is fine-tuned using multiple group 
discriminant analysis in order to verify coarse category 
consistency term effectiveness in equation (7). Tables 4 
and 5 show higher testing error for HMod Fast R-CNN is 
fine-tuned considering consistency in coarse category. 
There is considerable performance improvement for 
HMod Fast R-CNN using MS-COCO and CIFAR100 
datasets. In order to further support the experimental 
hypothesis some results on Visual QA dataset are 
highlighted in Tables 6 and 7.  
A comparative performance analysis of Mod Fast R-
CNN and HMod Fast R-CNN [32] with Fast R-CNN, 
YOLO, Fast YOLO, YOLOv3, YOLOv4, YOLOv5, DPM 
and SSD on PASCAL VOC 2007 and VOC 2012 datasets 
74 Informatica 45 (2021) 74–81 A. Chaudhuri 
is presented in Table 8. In order to achieve better results 
few object detectors are trained through union of 
PASCAL VOC 2007 and VOC 2012 datasets. An error 
analysis of HMod Fast R-CNN [32] with Fast R-CNN and 
all versions of YOLO on same dataset is shown in Figure 
8. Here localization and background errors percentage in 
top N detection category wise is highlighted. In order to 
further strengthen the results a comparative analysis of 
Mod Fast R-CNN and HMod Fast R-CNN with state-of-
the-art methods is presented in Table 9. 
Before concluding this section, we throw some light 
on design evaluation of HMod Fast R-CNN. In this 
direction, several experiments are performed to achieve 
optimal performance for HMod Fast R-CNN. However, 
there remains certain questions which needs to be 
discussed. Some of these aspects have been addressed here 
and rest of them form the future scope of work. 
The first question is: Is training using multi-tasking 
helpful? The multi-task training is always useful because 
 
Figure 3: Testing error (10-view) against number of 
coarse categories with MS-COCO dataset. 
 
 
Figure 4: Overlapping coarse categories with respect to 
fine category occurrences with MS-COCO dataset. 
 
Figure 5: Testing error (10-view) against number of 
coarse categories with CIFAR100 dataset. 
 
Figure 6: Overlapping coarse categories with respect to 
fine category occurrences with CIFAR100 dataset. 
Model Top-1, Top-5 Mem (MB) Time (s) 
Base: 
CIFAR100-NIN 
31.90 186 0.05 
Mod Fast 
R-CNN w/o SL 
31.87 736 2.00 
HMod Fast 
R-CNN w/o SL 
31.72 1250 2.36 
HMod Fast 
R-CNN 
31.36 455 0.31 
HMod Fast 
R-CNN + CE 
31.21 448 0.14 
HMod Fast R-
CNN + CE + PC 
31.05 270 0.14 
Table 1: Testing errors, memory footprint and testing time 
– building block nets and HMod Fast R-CNN: 
Comparative analysis on MS-COCO dataset (mini-batch 
size (for testing) = 100; SL = Shared layers, CE = 
Conditional execution, PC = Parameter comparison). 
Model Top-1, Top-5 Mem (MB) Time(s) 
Base: 
CIFAR100-NIN 
33.96 186 0.05 
Mod Fast  
R-CNN w/o SL 
33.90 736 2.00 
HMod Fast  
R-CNN w/o SL 
33.69 1250 2.37 
HMod Fast  
R-CNN 
33.34 455 0.27 
HMod Fast  
R-CNN + CE 
33.19 448 0.10 
HMod Fast R-
CNN + CE + PC 
31.05 270 0.10 
Table 2: Testing errors, memory footprint and testing time 
– building block nets and HMod Fast R-CNN: 
Comparative analysis on CIFAR100 dataset (mini-batch 
size (for testing) = 100; SL = Shared layers., CE = 
Conditional execution, PC = Parameter comparison). 
Method Error 
Model averaging (2 CIFAR100-NIN nets) 36.05 
CIFAR100-NIN-double 34.24 
Base: CIFAR100-NIN 33.96 
Mod Fast R-CNN (no fine tuning) 32.66 
Mod Fast R-CNN (fine tuning without CCC) 32.09 
Mod Fast R-CNN (fine tuning with CCC) 31.87 
HMod Fast R-CNN (no fine tuning) 32.34 
HMod Fast R-CNN (fine tuning without CCC) 32.05 
HMod Fast R-CNN (fine tuning with CCC) 31.84 
Table 3: Testing errors (10-view) on MS-COCO. 
Hierarchical Modified Fast R-CNN for Object Detection Informatica 45 (2021) 75–81 75 
there is no need to manage sequentially-trained tasks 
pipeline. This potentially improves accuracy results as 
there is an influence among the tasks considering shared 
representation which CNN uses here. The classification 
loss is one such measure which baseline network uses 
during training. Another useful measure used here is 
multi-task loss. It is observed that there is an improvement 
of pure classiﬁcation accuracy with respect to only 
classification training through multi-task training. 
The second question is: Is brute-force scale invariance 
always useful here? The brute-force scale invariance is 
achieved here through single scale and multi scale (using 
image pyramids). The scale of image is specified as its 
shortest side length. Here single and multi-scale pyramids 
have produced good results. There are certain instances 
where single scale has shown best tradeoff between speed 
and accuracy considering very deep models. 
The third question is: Is more training data required to 
verify the results? As a rule of thumb, when trained with 
large datasets, performance of object detector improves. 
The same verdict is true here. Here as and when training 
data volumes are increased object detection performance 
grows considerably. Further heterogeneity in training data 
helps network towards learning capability generalization. 
The fourth question is: Is using more object proposals 
always better? Object detectors use two types of proposal 
viz object proposals with sparse set and dense set. Here 
dense set proposals have worked well. This has 
considerably improved HMod Fast R-CNN object 
detection accuracy. As proposals have a pure 
computational role, increasing their number/image have 
produced good results. 
The fifth question is: What is the optimal number of 
layers required in HMod Fast R-CNN to achieve best 
performance? This depends on object detection dataset 
where HMod Fast R-CNN architecture is evaluated. This 
aspect in non-trivial in nature and there is no thumb rule. 
Method Error 
Model averaging (2 CIFAR100-NIN nets) 36.05 
CIFAR100-NIN-double 34.24 
Base: CIFAR100-NIN 33.96 
Mod Fast R-CNN (no fine tuning) 33.66 
Mod Fast R-CNN (fine tuning without CCC) 33.09 
Mod Fast R-CNN (fine tuning with CCC) 31.90 
HMod Fast R-CNN (no fine tuning) 33.34 
HMod Fast R-CNN (fine tuning without CCC) 33.05 
HMod Fast R-CNN (fine tuning with CCC) 31.86 
Table 4: Testing errors (10-view) on CIFAR100 dataset 
(CCC = coarse category consistency). 
Model Top-1, Top-5 Mem (MB) Time (s) 
Base: Prior 
VisualQA 
34.03 186 0.05 
Mod Fast  
R-CNN w/o SL 
33.96 736 2.07 
HMod Fast 
R-CNN w/o SL 
33.87 1250 2.39 
HMod Fast 
R-CNN 
33.72 455 0.27 
HMod Fast 
R-CNN + CE 
33.22 448 0.10 
HMod Fast R-
CNN + CE + PC 
31.06 270 0.10 
Table 5: Testing errors, memory footprint and testing time 
– building block nets and HMod Fast R-CNN: 
Comparative analysis on VisualQA dataset (mini-batch 
size (for testing) = 100; SL = Shared layers, CE = 
Conditional execution, PC = Parameter comparison). 
Method Error 
d-LSTM+n-I Visual QA 34.69 
Base: Prior Visual QA 34.03 
Mod Fast R-CNN (no fine tuning) 33.96 
Mod Fast R-CNN (fine tuning without CCC) 33.36 
Mod Fast R-CNN (fine tuning with CCC) 32.00 
HMod Fast R-CNN (no fine tuning) 33.86 
HMod Fast R-CNN (fine tuning without CCC) 33.26 
HMod Fast R-CNN (fine tuning with CCC) 31.98 
Table 6: Testing errors (10-view) on VisualQA dataset 
(CCC = coarse category consistency). 
Object Detectors Training mAP FPS 
100 Hz DPM 2007 16.0 100 
Fast R-CNN 2007+2012 70.0 0.5 
Faster R-CNN 2007+2012 70.7 0.5 
YOLO 2007+2012 72.7 155 
YOLOv2 2007+2012 75.5 45 
YOLOv3 2007+2012 76.5 31 
YOLOv4 2007+2012 77.6 27 
YOLOv5 2007+2012 78.6 26 
Mod Fast R-CNN 2007+2012 81.06 0.5 
HMod Fast R-CNN 2007+2012 87.6 0.3 
Table 7: A comparative performance analysis of HMod 
Fast R-CNN vs other object detectors on PASCAL VOC 
2007 and 2012 datasets (2007+2012: union of VOC2007 
trainval and test and VOC 2012 trainval). 
Object Detectors Significant Results 
SAF R-CNN AMR: 9.32 
Deep Network Cascades AMR: 31.11; FPS: 15 
LOGO-Net mAP: 69.9 
SCL mAP: 16.3 
SL
2
 mAP: 46.9 
Local Structured HOG-LBP mAP: 34.3 
Faster R-CNN mAP: 70.7 
Fast R-CNN mAP: 70.0 
FPN mAP: 59.1 
YOLO mAP: 72.7; FPS: 155 
YOLO9000/YOLOv2 mAP: 75.5; FPS: 45 
SSD mAP: 76.8; FPS: 22 
YOLOv3 mAP: 76.5; FPS: 31 
YOLOv4 mAP: 77.6; FPS: 27 
YOLOv5 mAP: 78.6; FPS: 26 
Low-cost ISS mAP: 99.4 
ADS + Hardw Accelerators mAP: 83.64; FPS: 30 
xYOLO mAP: 68.22; FPS: 9.66  
Grape Disease Detection mAP: 95.57 
Mod Fast R-CNN mAP: 81.06; FPS: 0.5 
HMod Fast R-CNN mAP: 87.6; FPS: 0.3 
Table 8: A comparative performance analysis of HMod 
Fast R-CNN vs significant state-of-the-art object 
detection methods (AMR: Average Miss Rate; mAP: 
Mean Average Precision). 
76 Informatica 45 (2021) 76–81 A. Chaudhuri 
Here prior experience in building network architecture 
with image datasets has produced good results. 
5 Conclusion 
In this work, we presented HMod Fast R-CNN which is 
hierarchical updated version of Mod Fast R-CNN. It 
improves Fast R-CNN’s architecture considerably. The 
computational system comprises of extraction layers, 
weighted coarse and fine component layers and 
possibilistic averaging layer. The possibilistic averaging 
layer converts fine category and coarse category 
probabilistic predictions into possibilistic measures which 
is weighted average and considered as final prediction 
result. The possibilistic measures effectively address 
inherent uncertainty in data. The experimental results with 
MS-COCO, CIFAR100 and VisualQA datasets provide 
several new insights. This fact is highlighted using four 
variant building block nets. The proposed network’s 
performance superiority over other object detectors on 
MS-COCO, PASCAL VOC 2007 and VOC 2012 datasets 
are also illustrated. HMod Fast R-CNN architecture can be 
 
Table 9: An error analysis of HMod Fast R-CNN vs Faster R-CNN, Fast R-CNN, YOLO, YOLOv2, YOLOv3, 
YOLOv4, YOLOv5 on PASCAL VOC 2007 and 2012 datasets. 
 
Hierarchical Modified Fast R-CNN for Object Detection Informatica 45 (2021) 77–81 77 
further extended with more than five levels. This will 
improve experimental results in terms of object detection 
accuracy as well as accelerates overall process considering 
theoretical viewpoints. The future work looks towards 
developing HMod Fast R-CNN with more layers and 
verifying results with significant image datasets. 
References 
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian 
Sun. Deep residual learning for image recognition. 
IEEE Conference on Computer Vision and Pattern 
Recognition, 770–778, 2016. 
https://doi.org/10.1109/CVPR.2016.90 
[2] Ross Girshick, Jeff Donahue, Trevor Darrell and 
Jitendra Malik. Rich feature hierarchies for accurate 
object detection and semantic segmentation. IEEE 
Conference on Computer Vision and Pattern 
Recognition, 580–587, 2014. 
https://doi.org/10.1109/CVPR.2014.81 
[3] Kaiming He, Georgia Gkioxari, Piotr Dollár and 
Ross Girshick. Mask R-CNN. IEEE International 
Conference on Computer Vision, 2961–2969, 2017. 
https://doi.org/10.1109/ICCV.2017.322 
[4] Liang-Chieh Chen, George Papandreou, Iasonas 
Kokkinos, Kevin Murphy and Alan L. Yuille. 
Semantic image segmentation with deep 
convolutional nets and fully connected CRFS. arXiv, 
arXiv:1412.7062, 2014. 
[5] Alex Krizhevsky, Ilya Sutskever and Geoffrey 
Hinton. ImageNet classification with deep 
convolutional neural networks. International 
Conference on Neural Information Processing 
Systems, 25:1097–1105, 2012. 
https://doi.org/10.1.1.299.205 
[6] Pierre Sermanet, David Eigen, Xiang Zhang, 
Michael Mathieu, Rob Fergus and Yann LeCun. 
OverFeat: Integrated recognition, localization and 
detection using convolutional networks. arXiv, 
arXiv:1312.6229, 2014.  
[7] Yann LeCun, Bernhard Boser, John Denker, David 
Henderson, Robert Howard, William Hubbard and 
Lawrence Jackel. Backpropagation applied to 
handwritten zip code recognition. Neural 
Computation, 1(4): 541–551, 1989. 
https://doi.org/10.1162/neco.1989.1.4.541 
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian 
Sun. Spatial pyramid pooling in deep convolutional 
networks for visual recognition. arXiv, arXiv: 
1406.4729, 2014.   
[9] Yukun Zhu, Raquel Urtasun, Rusian Salakhutdinov 
and Sanja Fidler. segDeepM: Exploiting 
segmentation and context in deep neural networks 
for object detection. arXiv, arXiv:1502.04275, 2015.   
[10] Mathew D. Zeiler and Rob Fergus. Stochastic 
pooling for regularization of deep convolutional 
neural networks. arXiv, arXiv:1301.3557, 2013.   
[11] Ian Goodfellow, David Warde-Farley, Mehdi Mirza, 
Aaron Courville and Yoshua Bengio. Maxout 
networks. International Conference on Machine 
Learning, 28(3):1319–1327, 2013. 
https://doi.org/10.5555/3042817.3043084 
[12] Jost Tobias Springenberg and Martin Riedmiller. 
Improving deep neural networks with probabilistic 
maxout units. arXiv, arXiv:1312.6116, 2013.  
[13] Min Lin, Qiang Chen and Shuicheng Yan. Network 
in network. arXiv, arXiv:1312.4400, 2013.  
[14] Anne-Marie Tousch, StěPhane. Herbin and Jean-
Yves Audibert. Semantic hierarchies for image 
annotation: A survey. Patten Recognition, 
45(1):333–345, 2012.  
https://doi.org/abs/10.1016/j.patcog.2011.05.017 
[15] Samy Bengio, Jason Weston and David Grangier. 
Label embedding trees for large multi-class tasks. 
International Conference on Neural Information 
Processing Systems, 1:163–171, 2010. 
https://doi.org/abs/10.5555/2997189.2997208 
[16] Tianshi Gao and Daphne Koller. Discriminative 
learning of relaxed hierarchy for large-scale visual 
recognition. IEEE Conference on Computer Vision 
and Pattern Recognition, 2072–2079, 2011. 
https://doi.org/10.1109/ICCV.2011.612648 
[17] Marcin Marszalek and Cordelia Schmid. Semantic 
hierarchies for visual object recognition. IEEE 
Conference on Computer Vision and Pattern 
Recognition, 1–7, 2007.  
https://doi.org/10.1109/CVPR.2007.383272 
[18] Nakul Verma, Dhruv Mahajan, Sundararajan 
Sellamanickam and Vinod Nair. Learning 
hierarchical similarity metrics. IEEE Conference on 
Computer Vision and Pattern Recognition, 2280–
2287, 2012.  
https://doi.org/10.1109/CVPR.2012.6247938C 
[19] Yangqing Jia, Joshua T. Abbott, Joseph. Austerweil, 
Tom Grifﬁths and Trevor Darrell. Visual concept 
learning: Combining machine vision and bayesian 
generalization on concept hierarchies. International 
Conference on Neural Information Processing 
Systems, 2:1842–1850, 2013. 
https://doi.org/10.5555/2999792.2999818 
[20] Ruslan Salakhutdinov, Antonio Torralba and Josh 
Tenenbaum. Learning to share visual appearance for 
multiclass object detection. IEEE Conference on 
Computer Vision and Pattern Recognition, 1481–
1488, 2011.  
https://doi.org/10.1109/CVPR.2011.5995720 
[21] Gregory Grifﬁn and Pietro Perona. Learning and 
using taxonomies for fast visual categorization. 
IEEE Conference on Computer Vision and Pattern 
Recognition, 1–8, 2008.  
https://doi.org/10.1109/CVPR.2008.4587410 
[22] Marcin Marszałek and Cordelia Schmid. 
Constructing category hierarchies for visual 
recognition. Proceedings of European Conference on 
Computer Vision, IV:479–491, 2008. 
https://doi.org/10.1007/978-3-540-88693-8_35 
[23] Li-Jia Li, Chong Wang, Yongwhan Lim, David M. 
Blei and Li Fei-Fei. Building and using a 
semantivisual image hierarchy. IEEE Conference on 
Computer Vision and Pattern Recognition, 3336–
3343, 2010.  
https://doi.org/10.1109/CVPR.2010.5540027 
[24] Hichem Bannour and Cěline Hudelot. Hierarchical 
image annotation using semantic hierarchies. ACM 
78 Informatica 45 (2021) 78–81 A. Chaudhuri 
International Conference on Information and 
Knowledge Management, 2431–2434, 2012. 
https://doi.org/10.1145/2396761.2398659 
[25] Jia Deng, Sanjeev Satheesh, Alexander C. Berg and 
Fei. Li. Fast and balanced: Efﬁcient label tree 
learning for large scale object recognition. 
International Conference on Neural Information 
Processing Systems, 1:567–575, 2011. 
https://doi.org/10.5555/2986459.2986523 
[26] Josef Sivic, Bryan C. Russell, Andrew Zisserman, 
William T. Freeman and Alexei A. Efros. 
Unsupervised discovery of visual object class 
hierarchies. IEEE Conference on Computer Vision 
and Pattern Recognition, 1–8, 2008. 
https://doi.org/10.1007/s11263-009-0271-8 
[27] Jia Deng, Jonathan Krause, Alexander C. Berg and 
Li Fei-Fei. Hedging your bets: Optimizing accuracy-
speciﬁcity trade-offs in large scale visual 
recognition. IEEE Conference on Computer Vision 
and Pattern Recognition, 3450–3457, 2012. 
https://doi.org/10.1109/CVPR.2012.6248086 
[28] Baoyuan Liu, Fereshteh Sadeghi, Marshall Tappen, 
Ohad Shamir and Ce Liu. Probabilistic label trees for 
efﬁcient large scale image classiﬁcation. IEEE 
Conference on Computer Vision and Pattern 
Recognition, 843–850, 2013. 
https://doi.org/10.1109/CVPR.2013.114 
[29] Nitish Srivastava and Russ Salakhutdinov. 
Discriminative transfer learning with tree-based 
priors. International Conference on Neural 
Information Processing Systems, 2:2094–2102, 
2013. 
[30] Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, 
Kevin Murphy, Samy Bengio, Yuan Li, Hartmut 
Neven and Hartwig Adam. Large-scale object 
classiﬁcation using label relation graphs. European 
Conference on Computer Vision, I:48–64, 2014. 
https://doi.org/10.1007/978-3-319-10590-1_4 
[31] Tianjun Xiao, Jiaxing. Zhang, Kuiyuan. Yang, 
Yuxin Peng and Zheng Zhang. Error driven 
incremental learning in deep convolutional neural 
network for large-scale image classiﬁcation. ACM 
International Conference on Multimedia, 177–186, 
2014. https://doi.org/10.1145/2647868.2654926 
[32] Arindam Chaudhuri. Some insights and observations 
on real time object detectors considering several 
benchmarks. Technical Report, Samsung R & D 
Institute Delhi, India, 2021. 
[33] MS-COCO dataset: https://cocodataset.org 
[34] CIFAR100 dataset:  
https://web.stanford.edu/~hastie/CASI_files/DATA/
cifar100.html  
[35] VisualQA dataset: https://visualqa.org/download.html 
[36] PASCAL VOC 2007 dataset: 
http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ 
[37] PASCAL VOC 2012 dataset: 
http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ 
[38] Yi Sun, Ding Liang, Xiaogang Wang and Xiaoou 
Tang. Deepid3: Face recognition with very deep 
neural networks. arXiv, arXiv:1502.00873, 2015. 
[39] Yi Sun, Xiaogang Wang and Xiaoou Tang. Deep 
learning face representation by joint identification-
verification. arXiv, arXiv:1406.4773v1, 2014. 
[40] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, 
Bhiksha Raj and Le Song. Sphereface: Deep 
hypersphere embedding for face recognition. IEEE 
Conference on Computer Vision and Pattern 
Recognition, 6738–6746, 2017. 
https://doi.org/10.1109/CVPR.2017.713 
[41] Xiaodan Liang, Shengmei Shen, Tingfa Xu, Jiashi 
Feng and Shuicheng Yan. Scale-aware fast R-CNN 
for pedestrian detection. IEEE Transactions on 
Multimedia, 20(4):985–996, 2018. 
https://doi.org/10.1109/TMM.2017.2759508 
[42] Jan Hosang, Mohamed Omran, Rodrigo Benenson 
and Bernt Schiele. Taking a deeper look at 
pedestrians. IEEE Conference on Computer Vision 
and Pattern Recognition, 4073–4082, 2015. 
https://doi.org/10.1109/CVPR.2015.7299034 
[43] Anelia Angelova, Alex Krizhevsky, Vincent 
Vanhoucke, Abhijit S. Ogale and Dave Ferguson. 
Real-time pedestrian detection with deep network 
cascades. British Machine Vision Conference, 32.1–
32.12, 2015. https://doi.org/10.5244/C.29.32 
[44] Andrej Karpathy, George Toderici, Sanketh Shetty, 
Thomas Leung, Rahul Sukthankar and Li Fei-Fei. 
Large-scale video classification with convolutional 
neural networks. IEEE Conference on Computer 
Vision and Pattern Recognition, 1725–1732, 2014. 
https://doi.org/10.1109/CVPR.2014.223 
[45] Hossein Mobahi, Ronan Collobert and Jason 
Weston. Deep learning from temporal coherence in 
video. ACM International Conference on Machine 
Learning, 737–744, 2009.  
https://doi.org/10.1145/1553374.1553469 
[46] Steven C. Hoi, Xiongwei Wu, Hantang Liu, Yue Wu, 
Huiqiong Wang, Hui Xue and Qiang Wu. Logo-net: 
Large-scale deep logo detection and brand 
recognition with deep region based convolutional 
networks. arXiv, arXiv:1511.02462, 2015. 
[47] Hang Su, Xiatian Zhu and Shaogang Gong. Deep 
learning logo detection with data expansion by 
synthesizing context. arXiv, arXiv:1612.09322v3, 
2017. 
[48] Hang Su, Shaogang Gong and Xiatian Zhu. Scalable 
deep learning logo detection. arXiv, arXiv: 
1803.11417, 2018. 
[49] Andrea Vedaldi, Varun Gulshan, Manik Varma and 
Andrew Zisserman. Multiple kernels for object 
detection. IEEE International Conference on 
Computer Vision, 606–613, 2009. 
https://doi.org/10.1109/ICCV.2009.5459183 
[50] Paul Viola and Michael Jones. Rapid object 
detection using a boosted cascade of simple features. 
IEEE Conference on Computer Vision and Pattern 
Recognition, 1–1, 2001.  
https://doi.org/10.1109/CVPR.2001.990517  
[51] Hedi Harzallah, Frederic Jurie and Cordelia Schmid. 
Combining efficient object localization and image 
classification. IEEE International Conference on 
Computer Vision, 237–244, 2009. 
Hierarchical Modified Fast R-CNN for Object Detection Informatica 45 (2021) 79–81 79 
https://doi.org/10.1109/ICCV.2009.5459257 
[52] Navneet Dalal and Bill Triggs. Histograms of 
oriented gradients for human detection. IEEE 
Conference on Computer Vision and Pattern 
Recognition, 886–893, 2005. 
https://doi.org/10.1109/CVPR.2005.177 
[53] Paul Viola and Michael J. Jones. Robust real-time 
face detection. International Journal of Computer 
Vision, 57(2):137–154, 2004. 
https://doi.org/10.1023/B:VISI.0000013087.49260.f
b 
[54] David G. Lowe. Object recognition from local scale-
invariant features. IEEE International Conference on 
Computer Vision, 2:1150–1157, 1999. 
https://doi.org/10.1109/ICCV.1999.790410 
[55] Rainer Lienhart and Jochen Maydt. An extended set 
of Haar like features for rapid object detection. IEEE 
International Conference on Image Processing, 1: 
900–903, 2002.  
https://doi.org/10.1109/ICIP.2002.1038171 
[56] Herbert Bay, Tinne Tuytelaars and Luc Van Gool. 
SURF: Speeded up robust features. European 
Conference on Computer Vision, 404–417, 2006. 
https://doi.org/10.1007/11744023_32 
[57] Marti A. Hearst, Susan T. Dumais, Edgar Osuna, 
John Platt and Bernhard Scholkopf. Support vector 
machines. IEEE Intelligent Systems and their 
Applications, 13(4):18–28, 1998. 
https://doi.org/10.1109/5254.708428 
[58] David Opitz and Richard Maclin. Popular ensemble 
methods: An empirical study. Journal of Artificial 
Intelligence Research, 11:169–198, 1999. 
https://doi.org/10.1613/jair.614 
[59] Yoav Freund and Robert E. Schapire. Experiments 
with a new boosting algorithm. ACM International 
Conference on Machine Learning, 148–156, 1996. 
https://doi.org/10.5555/3091696.3091715 
[60] Yinan Yu, Junge Zhang, Yongzhen Huang, Shuai 
Zhang, Weiqiang Ren, Chong Wang, Kaiqui Huang 
and Tieniu Tan. Object detection by context and 
boosted HOG-LBP. European Conference on 
Computer Vision on PASCAL VOC Workshop, 
2010.  
[61] Pedro Felzenszwalb, Ross Girshick, David 
McAllester and Deva Ramanan, Discriminatively 
trained mixtures of deformable part models, 
European Conference on Computer Vision on 
PASCAL VOC Workshop, 2008.  
[62] Mark Everingham, Luc Van Gool, Christopher K. I. 
Williams, John Winn and Andrew Zisserman. The 
PASCAL visual object classes (VOC) challenge. 
International Journal of Computer Vision, 
88(2):303–338, 2010.  
https://doi.org/10.1007/s11263-009-0275-4 
[63] Pedro Felzenszwalb, Ross Girshick, David 
McAllester and Deva Ramanan. Object detection 
with discriminatively trained part-based models. 
IEEE Transactions on Pattern Analysis and Machine 
Intelligence, 32(9):1627–1645, 2010. 
https://doi.org/10.1109/TPAMI.2009.167 
[64] David G. Lowe. Distinctive image features from 
scale-invariant key points. International Journal of 
Computer Vision, 60: 91–110, 2004. 
https://doi.org/10.1023/B:VISI.0000029664.99615.
94 
[65] Timo Ojala, Matti Pietikainen and Topi Maenpaa. 
Multiresolution gray-scale and rotation invariant 
texture classification with local binary patterns. 
IEEE Transactions on Pattern Analysis and Machine 
Intelligence, 24(7):971–987, 2002. 
https://doi.org/10.1109/TPAMI.2002.1017623 
[66] Shaoqing Ren, Kaiming He, Ross Girshick and Jian 
Sun. Faster R-CNN: Towards real-time object 
detection with region proposal networks. arXiv, 
arXiv:1506.01497, 2015. 
[67] Kunihiko Fukushima and Sei Miyake. Neocognitron: 
A self-organizing neural network model for a 
mechanism of visual pattern recognition. 
Competition and Cooperation in Neural Networks, 
267–285, 1982. https://doi.org/10.1007/978-3-642-
46466-9_18 
[68] Yann LeCun, Lěon Bottou, Yoshua Bengio and 
Patrick Haffner. Gradient-based learning applied to 
document recognition. Proceedings of IEEE, 
86(11):2278–2324, 1998.  
https://doi.org/10.1109/5.726791 
[69] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai 
Li and Li Fei-Fei. Imagenet: A large-scale 
hierarchical image database. IEEE International 
Conference on Computer Vision and Pattern 
Recognition, 248–255, 2009. 
https://doi.org/10.1109/CVPR.2009.5206848 
[70] Ross Girshick. Fast R-CNN. arXiv, 
arXiv:1504.08083, 2015. 
[71] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming 
He, Bharath Hariharan and Serge Belongie. Feature 
pyramid networks for object detection. IEEE 
Conference on Computer Vision and Pattern 
Recognition, 2117–2125, 2017. 
https://doi.org/10.1109/CVPR.2017.106 
[72] Joseph Redmon, Santosh Divvala, Ross Girshick and 
Ali Farhadi. You only look once: Unified, real-time 
object detection. IEEE Conference on Computer 
Vision and Pattern Recognition, 779–788, 2016. 
https://doi.org/10.1109/CVPR.2016.91 
[73] Joseph Redmon and Ali Farhadi. YOLO9000: 
Better, faster, stronger. IEEE Conference on 
Computer Vision and Pattern Recognition, 6517-
6525, 2017. https://doi.org/10.1109/CVPR.2017.690 
[74] Wei Liu, Dragomir Anguelov, Dumitru Erhan, 
Christian Szegedy, Scott Reed, Cheng-Yang Fu and 
Alexander C. Berg. SSD: Single shot multibox 
detector. European Conference on Computer Vision, 
21–37, 2016. https://doi.org/10.1007/978-3-319-
46448-0_2 
[75] Joseph Redmon and Ali Farhadi. YOLOv3:An 
incremental improvement. arXiv, arXiv: 
1804.02767v1, 2018. 
[76] A. Bochkovskiy, C.-Y. Wang and H.-Y. M. Liao. 
YOLOv4: Optimal speed and accuracy of object 
detection. arXiv, arXiv:2004.10934v1, 2020. 
80 Informatica 45 (2021) 80–81 A. Chaudhuri 
[77] YOLOv5: https://github.com/ultralytics/yolov5 
[78] Zaid S. Sabri and Zhiyong Li. Low-cost intelligent 
surveillance system based on Fast CNN. PeerJ 
Computer Science, 7:e402, 2021. 
https://doi.org/10.7717/peerj-cs.402 
[79] Vittorio Mazzia, Francesco Salvetti, Aleem Khaliq 
and Marcello Chiaberge. Real-time apple detection 
system using embedded systems with hardware 
accelerators: An edge AI application. IEEE Access, 
8:9102–9114, 2020.  
https://doi.org/10.1109/ACCESS.2020.2964608 
[80] Zhengyi Luo, Austin Small, Liam Dugan and 
Stephen Lane. Cloud chaser: Real time deep learning 
computer vision on low computing power devices. 
arXiv, arXiv:1810.01069v2, 2020. 
[81] Daniel Barry, Munir Shah, Merei Keijsers, Humayun 
Khan and Banon Hopman. xYOLO: A model for 
real-time object detection in humanoid soccer on 
low-end hardware. arXiv, arXiv:1910.03159v1, 
2019. 
[82] Shekofa Ghoury, Cemil Sungur and Akif Durdu. 
Real-time disease detection of grape and grape 
leaves using Faster R-CNN and SSD MobileNet 
architectures. International Conference on Advanced 
Technologies, Computer Engineering and Science, 
Alanya, Turkey, 2019. 
[83] Anil Kumar, Praneeth Chowdhary and Govinda Rao. 
Smart embedded device for object and text 
recognition through real-time video using Raspberry 
PI. International Journal of Engineering and 
Technology, 7(4):556–562, 2019. 
http://dx.doi.org/10.14419/ijet.v7i4.19.27959 
[84] Dumitru Erhan, Christian Szegedy, Alexander 
Toshev and Dragomir Anguelov. Scalable object 
detection using deep neural networks. arXiv, 
arXiv:1312.2249, 2014. 
[85] Matthew. Zeiler and Rob Fergus. Visualizing and 
understanding convolutional networks. European 
Conference on Computer Vision, I:818–833, 2014. 
https://doi.org/10.1007/978-3-319-10590-1_53 
[86] Alex Krizhevsky and Geoffrey Hinton. Learning 
multiple layers of features from tiny images. 
Technical Report, Computer Science Department, 
University of Toronto, Toronto, Canada, 2009. 
[87] Yangqing Jia, Evan Shelhamer, Jeff Donahue, 
Sergey Karayev, Jonathan Long, Ross Girshick, 
Sergio Guadarrama and Trevor Darrell. Caffe: 
Convolutional architecture for fast feature 
embedding. ACM International Conference on 
Multimedia, 675–678, 2014. 
https://doi.org/10.1145/2647868.2654889 
 
Appendix 
Reference Year Object Detectors Significant Results 
Li et al [41] 2018 SAF R-CNN AMR: 9.32 
Angelova et al [43] 2015 Deep Network Cascades AMR: 31.11; FPS: 15 
Hoi et al [46] 2015 LOGO-Net mAP: 69.9 
Su et al [47] 2017 SCL  mAP: 16.3 
Su et al [48] 2018 SL
2
  mAP: 46.9 
Yu et al [60] 2010 Local Structured HOG-LBP mAP: 34.3 
Ren et al [66] 2015 Faster R-CNN mAP: 70.7 
Girshick et al [70] 2015 Fast R-CNN mAP: 70.0 
Lin et al [71] 2017 FPN mAP: 59.1 
Redmon et al [72] 2016 YOLO mAP: 72.7; FPS: 155 
Redmon et al [73] 2017 YOLO9000/YOLOv2 mAP: 75.5; FPS: 45 
Liu et al [74] 2016 SSD mAP: 76.8; FPS: 22 
Redmon et al [75] 2018 YOLOv3 mAP: 76.5; FPS: 31 
Bochkovskiy et al [76] 2020 YOLOv4 mAP: 77.6; FPS: 27 
Jocher et al [77] 2020 YOLOv5 mAP: 78.6; FPS: 26 
Sabri et al [78] 2021 Low-cost ISS mAP: 99.4 
Mazzia et al [79] 2020 ADS + Hardw Accelerators mAP: 83.64; FPS: 30 
Barry et al [81] 2019 xYOLO mAP: 68.22; FPS: 9.66  
Ghoury et al [82] 2019 Grape Disease Detection mAP: 95.57 
Table 10: Significant state-of-the-art research works in deep learning-based object detection (AMR: Average Miss Rate; 
mAP: Mean Average Precision). 
Hierarchical Modified Fast R-CNN for Object Detection Informatica 45 (2021) 81–81 81 
 
Figure 7: Prediction framework through HMod Fast R-CNN. 
 
Figure 8: Architecture of Mod Fast R-CNN. 
 
Figure 9: Architecture of HMod Fast R-CNN. 
  
82 Informatica 45 (2021) 82–81 A. Chaudhuri