ELEKTROTEHNI ˇ SKI VESTNIK 88(5): 227–235, 2021
ORIGINAL SCIENTIFIC PAPER
Benchmarking Crowd-Counting Techniques across
Image Characteristics
Klemen Pevec, Vitomir
ˇ
Struc, Klemen Grm
Univerza v Ljubljani, Fakulteta za elektrotehniko, Trˇ zaˇ ska 25, 1000 Ljubljana, Slovenija
E-poˇ sta: kp5459@student.uni-lj.si
Abstract. Crowd–counting is a longstanding computer vision used in estimating the crowd sizes for security
purposes at public protests in streets, public gatherings, for collecting crowd statistics at airports, malls, concerts,
conferences, and other similar venues, and for monitoring people and crowds during public health crises (such
as the one caused by COVID-19). Recently, the performance of automated methods for crowd–counting from
single images has improved particularly due to the introduction of deep learning techniques and large labelled
training datasets. However, the robustness of these methods to varying imaging conditions, such as weather,
image perspective, and large variations in the crowd size has not been studied in-depth in the open literature.
To address this gap, a systematic study on the robustness of four recently developed crowd–counting methods
is performed in this paper to evaluate their performance with respect to variable (real-life) imaging scenarios
that include different event types, weather conditions, image sources and crowd sizes. It is shown that the
performance of the tested techniques is degraded in unclear weather conditions (i.e., fog, rain, snow) and also
on images taken from large distances by drones. On the opposite, clear weather conditions, crowd–counting
methods can provide accurate and usable results.
Kljuˇ cne besede: Keywords: crowd–counting, machine learning, biometrics
Vrednotenje postopkov ˇ stetja oseb ob razliˇ cnih
karakteristikah slik
Postopek ˇ stetja oseb v mnoˇ zicah je pomemben na razliˇ cnih
podroˇ cjih uporabe, kot je zagotavljanje varnosti na protestih
in drugih veˇ cjih javnih prireditvah, ali zbiranje statistik o
mnoˇ zicah v veˇ cjih prostorih, kot so letaliˇ sˇ ca, nakupovalna
srediˇ sˇ ca in konferenˇ cni centri. V zadnjem ˇ casu je z uporabo
metod globokega uˇ cenja in velikih oznaˇ cenih zbirk uˇ cnih
podatkov priˇ slo do hitrega napredka pri razvoju postopkov za
samodejno ˇ stetje oseb v mnoˇ zicah na podlagi ene slike. Kljub
napredku pa robustnost takˇ snih metod na slikah, posnetih v
slabˇ sih vremenskih razmerah, pod razliˇ cnimi perspektivami in
pri veliki variabilnosti v ˇ stevilu ljudi, ostaja odprt problem.
V tem ˇ clanku zato izvedemo sistematiˇ cno ˇ studijo uspeˇ snosti
veˇ c nedavno predlaganih postopkov ˇ stetja ljudi v mnoˇ zicah in
vrednotimo njihovo uspeˇ snost v spremenljivih, realistiˇ cnih sce-
narijih, vkljuˇ cujoˇ c razliˇ cne tipe dogodkov, vremenske razmere,
vire slik in velikosti mnoˇ zic. Naˇ se glavne ugotovitve so,
da uspeˇ snost vrednotenih postopkov moˇ cno upade v slabˇ sih
vremenskih razmerah (tj. v megli, deˇ zju, snegu), obenem pa
delujejo slabˇ se tudi na slikah, posnetih z veˇ cje razdalje z
uporabo dronov. Pokaˇ zemo tudi, da postopki za samodejno
ˇ stetje ljudi v mnoˇ zicah pod ustreznimi pogoji lahko delujejo
natanˇ cno in uspeˇ sno.
1 INTRODUCTION
Crowd–counting is a well-known computer-vision prob-
lem, where the goal is to estimate the number of people
in a crowded scene. In recent years, the demand for
Prejet 28 April 2021
Odobren 14 October 2021
VGG-19
JHU-CROWD++
VisDrone ECCV2020
CANNet
CMTL
MCNN
Evaluated deep models
Count
estimations
Metrics
MAE & MSE
• Overall
• Per weather
• Per scene type
• Per crowd
count
MAE & MSE
• Overall
Evaluation datasets
A drone with gimbal camera for shooting
birds-eye perspective crowd images
Figure 1: Comprehensive performance evaluation of the
existing crowd–counting techniques when applied to im-
ages captured by onboard drone cameras. Four recent deep-
learning models and two publicly available datasets, i.e.,
JHU-CROWD++ and VisDrone ECCV2020, are used. The
impact of weather conditions, scene type and crowd density on
the models overall performance is investigated. Drone image
courtesy of Don Ramey Logan [17].
methods that can quickly estimate crowd counts from
visual information has been increasing due to important
applications in crowd monitoring and prevention of
228 PEVEC,
ˇ
STRUC, GRM
dangerous situations such as trampling, suffocation or
violence at public gatherings. Crowd–counting is also
used for statistical and analytical purpose at airports,
malls, stadiums, rallies, concerts, and other public places
and events, or crowd monitoring during public health
crises [27].
Crowd–counting is a complex task due to the presence
of various sources of image variability, such as large
scale variations [33], overlaps, occlusions and perspec-
tive effects [2], [18] that can heavily alter the human
shape and appearance. Perspective effects are especially
significant in surveillance camera footage, where the
angle between the camera and the crowd plane is small
and causes the shape and scale of people to change with
respect to the image location. While these challenges
have been addressed to a certain degree with the use
of deep-learning and convolution neural networks, low-
light conditions and reduced visibility conditions, caused
by different weather conditions, such as fog or snow, still
adversely affect the performance of the existing crowd–
counting techniques [29].
In terms of deployment, the use of drones is becoming
a viable option for the crowd–counting techniques and
surveillance applications [19], [31], [13], [38]. Cameras
mounted on drones and other types of micro aerial ve-
hicles (MA Vs) can capture images in a bird’s–eye view
that do not suffer from the perspective issues discussed
above. Because of their portability, drones can also be
deployed quickly in cases where unexpected crowds
emerge and need to be monitored, e.g., at public protests.
While the interest in the crowd–counting technology for
drones and MA Vs is growing and a significant research
effort is being directed towards this area, the charac-
teristics of contemporary deep-learning techniques used
in this area are still not well understood. It is not
immediately clear how well these methods perform with
images captured in a bird’s-eye view, where significant
appearance and scale changes can be expected with re-
spect to the humans in the crowd. Moreover, the impact
of adverse weather conditions and scene types is also
underexplored. For real–life deployment, it is important
to have an in-depth understanding of these and related
characteristics as well as of the limitations of the existing
crowd–counting solutions. There are no comprehensive
studies on this topic found in the literature.
This paper aims to address this gap and presents
results of a performance evaluation of four recent (deep-
learning) crowd–counting models using two diverse
crowd datasets. The overview of the evaluation is pre-
sented in Figure 1. The study analyzes: (i) the overall
performance of the considered crowd–counting models,
(ii) the impact of imaging conditions, and (iii) the effect
of the crowd density on performance. The results show
that weather conditions can severely affect the accuracy
of the crowd–counting methods with snowy conditions
causing the largest performance degradations among the
considered weather conditions. Furthermore, the quality
of crowd–counting performance in different scenarios
highly depends on the used technique, since different
models were trained on crowds of different scales.
The main contributions of the paper are:
• An comprehensive evaluation of four state–of–the–
art crowd–counting models with crowd footage
captured in a bird’s-eye view using a drone camera.
• A performance study of the four methods in chal-
lenging weather conditions (fog, rain, and snow)
and a discussion of the reasons for the observed
performance differences.
• An analysis of the impact of different scene types
and crowd counts on the methods performance and
identification of their weak points.
The rest of the paper is structured as follows. Sec-
tion 2 reviews existing crowd–counting methods and
current state-of-the art solutions. Section 3 presents
the methodology used to evaluate the selected models.
Section 4 provides experimental results and discusses the
main findings. Section 5 draws conclusions and proposes
directions for future work.
2 RELATED WORK
Crowd–counting technology has a rich history, but has
greatly advanced recently with the introduction of the
deep neural networks. In this section, a brief overview
of the field is given. First, traditional methods that
dominated the field for years are discusses, followed by
an overview of the more recent deep-learning methods.
Traditional methods. Early crowd–counting methods
mainly rely on object detection with counting. Lin
et al. [25] utilize the Haar wavelet transform to de-
tect head-like contours and a support vector machine
(SVM) to determine whether the detected contour is
actually a head or not. Similarly, human-shape models
are used in [30], [9], [14] to detect people in image se-
quences with a subtracted background. Dalal and Triggs
introduce Histograms of Oriented Gradients (HOGs)
for image representation and combine the computed
descriptors with an SVM model for classification [7].
Idrees et al. [10] propose a method utilizing HOG fea-
tures, Fourier analysis, and SIFT descriptors for counting
dense crowds. The authors of [3], [21] provide solutions
to track image features across video frames, cluster the
features, and then count the generated clusters.
A notable group of techniques relies on direct-count
regression and aims to directly map image features (or
their segments) to people counts. Examples of these
techniques are given in [5], [24], and use hand-crafted
features computed from image segments as the basis for
regression. Kong et al. [12] utilize a similar framework
and propose a feature-normalization method to deal with
camera orientation and perspective challenges. Chen et
al. [6] present a model to learn the importance of low-
level features used for a direct-count regression.
BENCHARKING CROWD COUNTING TECHNIQUES 229
An alternative to the direct-count regression is the
estimation of crowd-density maps and integration over
the densities to obtain crowd counts. The models of
this type usually use for training dot-annotated images
that are converted to density functions using kernel-
density estimation with Gaussian kernels [1]. There
are many learning methods of the kind presented in
the literature, including methods based on regression
forests [8], random forests [20] and others.
Due to the increasing number of annotated datasets
for crowd–counting, the data-driven direct approaches
have also received interest from the computer vision
community. Early methods from this group aim to learn
motion patterns associated with crowd counts [23].
Deep-learning methods. With the advent of CNNs,
many new approaches have been proposed in the litera-
ture. One of the first crowd–counting CNNs is presented
by Zhang et al. [35]. Multiple single-image to density
map network architectures trained using dot-annotated
images are proposed in [2], [37], [33], [26]. Ma et
al. [18] propose a Bayesian loss function for learning
crowd–counting with point supervision instead of the
commonly used Gaussian kernels.
To deal with large scale variations in crowd images,
[4], [34], [11] employ encoder-decoder networks, where
the encoder extracts multi-scale features that are then
fed to the decoder for generating high-resolution density
maps. To improve this basic setup, Liu et al. [15]
propose the use of a perspective map to encode the local
scale in the feature maps. Xiong et al. [32] introduce a
method for image sequences that uses a perspective map
and temporal information in video sequences to improve
the count reliability.
As deep-learning models require a considerable
amount of the training data, the aim of many approaches
is to reduce the amount of training data, by adopting data
augmentation techniques that rely on image scaling and
cropping [2], [34], [36], [33], flipping [4], [18], [33],
noise addition [26], or sub-sampling [16]. Some hybrid
training approaches are also developed. Zhang et al.
propose a hybrid training scheme altering between the
crowd–count and density estimation tasks during learn-
ing [35]. Zhang et al. [36] adopt a multi-task strategy
to optimize both the crowd count and the density map.
Ranjan et al. [22] introduce a two-branch CNN archi-
tecture, where a low-resolution density map is generated
by the first branch and the second branch incorporates
a low-resolution prediction and feature maps from the
first branch to generate a high resolution density map.
While the above deep-learning solutions outperform
earlier techniques, they are still sensitive to certain data
characteristics that adversely affect the crowd–counting
performance. This particularly applies to imagery cap-
tured by drones, where their behavior is not yet com-
pletely understood. Following the above, this paper eval-
uates the performance of four advanced crowd–counting
models with images captured in different weather con-
ditions and scenarios and assesses their suitability for
drone footage.
3 METHODOLOGY
The models and datasets selected for the performance
evaluation and performance measures used to report
resutls are described below.
3.1 Crowd Counting Models
In our study, four recently developed crowd–counting
models are evaluated. They are selected because of
their state-of-the-art performance and to facilitate repro-
ducibility – all models are publicly available. Details on
the selected models are given below.
• VGG-19. The first model, proposed by Ma et al. [18],
is based on the standard VGG-19 image classification
network (pretrained on ImageNet) with the last pool-
ing layer and fully connected layers removed. The
output of the VGG-19 backbone is upsampled by a
bilinear interpolation and fed into a regression head
consisting of two 3× 3 convolutional layers and a
1× 1 convolutional layer that outputs a density map.
To produce crowd–count estimates, the density map is
integrated to convert it to a count estimate for a given
input image. All models are evaluated the same way.
The models are trained using dot-annotated images,
where dots are considered as priors for the probability
function used in the Bayesian learning objective.
• CANNet. The second model is the Context-Aware
crowd–counting Neural Network (CANNet) intro-
duced by Liu et al. in [15]. It combines features
generated using multiple receptive field sizes and
learns the importance of the computed features for
each image location. Thus, it encodes images both in
terms of scale as well as contextual information and as
a result ensures robustness to perspective distortions.
To train the model, the ground-truth dot annotations
are converted into a density map using Gaussian
kernels. The Euclidean distance between the estimated
and the ground-truth density maps is used as a loss
function.
• CMTL. The third model selected for the study is
the Cascaded Multi-Task Learning (CMTL) Network
proposed by Sindagi et al. [26]. It uses a multi-
task learning approach to jointly learn the crowd–
count classification and density map estimation. The
model implements a two-stage approach. In the first
stage, CMTL estimates the crowd count by classifying
the image into one of the ten possible count-amount
labels. In the second stage, the estimated count is
used to estimate the crowd-density map. Similarly
to the CANNet model, Gaussian kernels are used
to generate the ground-truth density maps, and the
Euclidean distance is used as a loss function.
• MCNN. The fourth model is the Multi-column Con-
volutional Neural Network (MCNN), proposed by
230 PEVEC,
ˇ
STRUC, GRM
Figure 2: Example images from the JHU-CROWD++ dataset.
They include foggy conditions, snow, and rain.
 Figure 3: Distribution of the crowd–count across the images
in the JHU-CROWD++ dataset.
Zhang et al. in [37]. It consists of three processing
branches of a similar network topology, but differently
sized filters that have different receptive fields to
detect heads of different scales in the crowd. To
generate the final density map, the output of each
branch is merged in the final part of the model. The
model loss function is again the Euclidean distance
between the estimated and the ground-truth density
maps. Dot annotations are converted to the ground-
truth density maps using Gaussian kernels with a
spread that depends on the average distance from the
neighbouring annotations.
3.2 Datasets
Each evaluated model is trained on 300 images from
the training part of the Shanghaitech part A dataset [37].
The images for this dataset are crawled from the internet.
The crowd count on most of the images is between 250
and 500 people, with some images containing up to 3000
people annotations.
To evaluate the performance of the selected crowd–
counting methods, the below two datasets are used:
• JHU-CROWD++. The first dataset is the JHU-
CROWD++ dataset [28], [29]and consists of 1600
Figure 4: Examples of images from the VisDrone ECCV2020
dataset. People are imaged from considerable heights making
it challenging for a reliable estimation of the crowd count.
 Figure 5: Distribution of the crowd–count across the images
in the VisDrone ECCV2020 dataset.
images - only the test part is used. Because of
GPU limitations, the images with the width above
1920 pixels are not used, leaving 1303 images
for the final evaluation. All images come with
weather condition (Figure 2 for some examples)
and scene type annotations, which allows for an in-
depth analysis of the methods in different scenarios.
Their average crowd count is 228, with the highest
count of 8994 people in one image. Figure 3 shows
the distribution of the crowd count across the JHU-
CROWD++ dataset.
• VisDrone. The second dataset used for the eval-
uation is the VisDrone ECCV2020 Challenge
DroneCrowd dataset [38]. It contains 112 image
sequences taken from a drone camera in a bird’s-
eye view. Because the focus of the analysis is on
single imae crowd–counting techniques, only the
first image from each sequence is considered in
the experiments. Due to the height the images are
taken at, the people in the images are very small
(Figure 4). As only the training part of the dataset
is annotated, this part is also used for testing of
the crowd–counting models – note that we use
BENCHARKING CROWD COUNTING TECHNIQUES 231
Table 1: Overall MAE and MSE scores for the evaluated
models obtained on the two experimental datasets.
Dataset JHU-CROWD++ VisDrone
Model MAE MSE MAE MSE
VGG-19 80.8 291.3 76.7 105.7
CANNet 69.3 283.3 78.2 105.8
CMTL 130.6 365.2 109.6 172.6
MCNN 111.3 326.3 86.6 120.2
pretrained models for the anaylsis. The average
crowd count on this dataset is 146.12 people with
the highest count of 417 people in a single image.
Figure 5 shows the distribution of the crowd count
across the images in this dataset.
3.3 Performance Measures
Two widely used performance measures are utilized
to evaluate the four crowd–counting models: the Mean
Absolute Error (MAE) and the Mean Squared Error
(MSE):
MAE =
1
N
N
X
i=1
|y
i
− y
′ i
|, (1)
and
MSE =
v
u
u
t
1
N
N
X
i=1
|y
i
− y
′ i
| 2
, (2)
whereN is the number of images used in the evaluation,
y
i
is the ground-truth crowd count of thei-th image, and
y
′ i
is the crowd count of thei-th image predicted by the
evaluated model.
Both the MSE and MAE scores yield smaller values
for better estimates. The MAE score is more intuitive to
understand, but MSE gives more weight to bigger errors.
The total MSE and MAE scores are determined for each
dataset. For the JHU-CROWD++ dataset, the MSE and
MAE values for various weather conditions and scene
types are reported. The performance of the models with
respect to the number of people in the images is also
studied.
3.4 Implementation Details
For the VGG-19 and CANNet models, RGB images
are used for the experiments. The input images, x, are
normalized
x
′ =
x− µ σ , (3)
where µ is defined as µ =
  0.485 0.456 0.406
  T
,
σ equals σ =
  0.229 0.224 0.225
  T
, and x
′ is the
normalized image. The CMTL and MCNN models use
gray-scale images and require no normalization.
 Figure 6: Crowd–count error in relation to the number of
people on the images from the JHU-CROWD++ dataset.
Results are filtered with the Gaussian filter withσ = 30 to
eliminate outliers due to the image specifics.
 Figure 7: Crowd–count error in relation to the number of
people in the images from the VisDrone ECCV2020 dataset.
Results are filtered with the Gaussian filter withσ = 10 to
eliminate outliers due to the image specifics.
4 EXPERIMENTAL RESULTS
The evaluation is performed on a GeForce GTX 960
graphics card with 4 GB of video RAM. The evaluation
of all models for both datasets takes about an hour.
Overall results. Table 1 shows the overall MAE and
MSE scores for both datasets. On the JHU-CROWD++
dataset, CANNet performs best, while on the VisDrone
dataset, VGG-19 is also competitive and performs com-
parably to CANNet. All in all, CANNet is the top
performer and as a result of its adaptive design, it
accounts for the different scales and density situations
present in the JHU-CROWD++ dataset, which contains
very diverse scene types.
The overall results are better for the VisDrone dataset
in which there are little-to-no perspective changes in the
images, the crowd scale is constant, the crowd density in
smaller, and the images are taken in weather conditions
suitable for drone operation (no fog, rain or snow). These
results point to the feasibility of drone–based crowd–
232 PEVEC,
ˇ
STRUC, GRM
Case 1
yGT = 1877 y
′ VGG− 19
= 1975 y
′ CANNet
= 1875 y
′ CMTL
= 2127 y
′ MCNN
= 1787
Case 2
yGT = 8994 y
′ VGG− 19
= 218 y
′ CANNet
= 133 y
′ CMTL
= 46 y
′ MCNN
= 53
Figure 8: Crowd density and count estimates for two images that represent cases where the evaluated models produce small
(Case 1) and big (Case 2) count-estimation errors on the JHU-CROWD++ dataset. The bad estimate in Case 2 is attributed to
the huge crowd density, fog, scale of most of the people, as well as to the scale differences present. yGT is the ground-truth
count, and y
′ X
is the count prediction for model X.
Case 1
yGT = 46 y
′ VGG− 19
= 55 y
′ CANNet
= 59 y
′ CMTL
= 36 y
′ MCNN
= 44
Case 2
yGT = 304 y
′ VGG− 19
= 51 y
′ CANNet
= 57 y
′ CMTL
= 50 y
′ MCNN
= 65
Figure 9: Crowd density and count estimates for two images that represent cases where methods produce small (Case 1) and
big (Case 2) count estimation errors on the VisDrone dataset. As seen, the methods perform much worse when images are taken
from bigger heights, as the crowd scale becomes too small for a reliable crowd count estimation.yGT is the ground-truth count,
and y
′ X
is the count prediction for model X.
counting applications and the suitability of drone footage
for crowd–counting.
Impact of the crowd count. Figures 6 and 7 show the
correlation between the absolute count error (MAE) and
the ground-truth crowd count of the images. The JHU-
CROWD++ dataset shows a general trend for each of the
four methods that the error increases with the ground-
truth crowd count. CANNet performs the best for all
crowd counts. For smaller crowd counts (below 500
people), VGG-19 outperforms MCNN, but falls behind
MCNN on larger crowds. CMLT performs the worst
across the tested models at any crowd count.
The VisDrone dataset shows a different trend than
the JHU-CROWD++ dataset. Here, the absolute error
increases with the crowd size for VGG-19 and CANNet,
and decreases for CMTL and MCNN. The model com-
parison shows that VGG-19 performs best for smaller
crowd counts (comparably to CANNet) and that for
larger crowd counts CMTL and MCNN provide better
results than VGG-19 and CANNet. However, for crowd
counts over 200, there is only a small number of images
available for the evaluation (Figure 5), so the scores may
be due to specific images that work better with some
crowd–counting approaches and not necessarily due to
the models conceptual differences.
Figures 8 and 9 show the crowd density and count
predictions for two images that represent cases where
the tested methods produce big and small count-
estimation errors for the JHU-CROWD++ and VisDrone
ECCV2020 datasets, respectively. The JHU-CROWD++
BENCHARKING CROWD COUNTING TECHNIQUES 233
Table 2: MAE and MSE scores for the evaluated models with respect to weather conditions in the JHU-CROWD++ dataset.
Weather Neutral Fog Rain Snow
Model MAE MSE MAE MSE MAE MSE MAE MSE
VGG-19 71.0 144.1 85.6 246.0 110.1 305.0 215.1 1059.0
CANNet 58.1 121.3 80.4 197.8 120.9 335.2 206.2 1067.0
CMTL 109.6 216.5 131.5 314.2 228.5 415.3 403.4 1231.0
MCNN 92.4 181.7 139.7 105.7 205.8 376.9 332.5 1115.0
Table 3: MAE and MSE scores for the evaluated models for some of the more common scene types present in the JHU-
CROWD++ dataset.
Scene Stadium Street Protest Airport Conference Rally
Model MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE
VGG-19 168.6 629.3 50.7 84.0 64.8 101.8 71.9 111.2 33.7 63.0 77.5 156.3
CANNet 150.6 620.2 43.4 79.3 53.1 94.5 55.4 81.2 34.3 67.9 60.8 108.9
CMTL 260.4 735.2 93.2 166.4 92.1 173.1 57.3 72.2 38.7 70.1 119.5 183.1
MCNN 215.9 666.7 83.2 149.4 71.4 108.5 44.2 58.7 32.7 53.8 84.02 125.5
dataset provides good results for crowds where the
people appearance is homogeneous. Bigger errors are
generated in images showing people at different scales.
With the VisDrone dataset, the performance appears to
be related to the height at which the images are taken.
For images where people appear at a reasonable scale,
small prediction errors are generated, while for images
with very small people, the count error is commonly
larger.
Impact of the weather conditions. The performance
of the four crowd–counting models is analysed in dif-
ferent weather conditions using the JHU-CROWD++
dataset. Table 2 shows the MAE and MSE scores for
the weather conditions annotated in the dataset. Again,
CANNet performs best at any weather condition except
for rain, in which the VGG-19 dataset performs slightly
better. Such a degradation is likely due to the scale
estimation mechanism used in CANNet, where image
artifact caused by rain might be misinterpreted as a
crowd of a high density, thus reducing the accuracy of
the crowd–counting task. A comparison of the results
for images taken at different weather conditions shows
that the performance of the models degrades rapidly
when fog, rain, or snow are present in the images.
This performance deterioration is presumably due to
reduced visibility at such weather conditions, and due
to image artifacts caused b snowflakes and raindrops,
which greatly reduce the accuracy of the evaluated
models.
Impact of the scene types. Experiments are also con-
ducted to evaluate the performance of the four crowd–
counting models for some of the most common scene
types present in the JHU-CROWD++ dataset. The results
for this part of the analysis are shown in Table 3.
CANNet performs best for most of the scene types,
except for airports and conferences for which the MCNN
model is the top performer. The outstanding performance
of MCNN with images of airports and conference is
likely due to one of the network branches being trained
very closely to a particular people scale that appears
on such scenes. For the airport-scene type, the VGG-
19 model performs worst, despite competitive overall
results (see Table 1). A comparison of the CANNet
model performance with its average performance over
the whole JHU-CROWD++ dataset shows that it per-
forms better than average (see Table 1) with images
of streets, protests, airports, conferences, and rallies. In
all these scenarios, the crowd density is generally high.
So, compared to the other models, it outperforms them
presumably due to the mechanism used to account for
crowds and people of different scales.
5 CONCLUSION
The performance of four advanced crowd–counting
models using two crowd image datasets was analyzed in
this paper. One of the datasets was used to evaluate the
impact of weather conditions and the type of scene on
the crowd–counting accuracy and the other the efficiency
of the evaluated models for crowd–counting from aerial
drone footage.
The results obtained with the JHU-CROWD++ dataset
showed that the model performance in different sce-
narios highly depends on the training data used for
a specific model, since different event types typically
result in crowd images at different scales, and the
evaluated models do not deal well with the crowd-
scale variability. None of the evaluated crowd–counting
models outperformed the others in all event types.
It was shown that (i) the current state-of-the-art, deep-
learning-based models have not yet solved issues related
to low light scenarios and images taken in weather
234 PEVEC,
ˇ
STRUC, GRM
conditions such as fog, rain, and snow, and (ii) the
crowd-count error grows with the crowd size. The goal
of our future research is solving these two issues.
Using the VisDrone ECCV2020 dataset, the study also
analyzed the performance of the four selected crowd-
counting models on areal drone images. The models per-
formance was found to be comparable to that observed
on the JHU-CROWD++ dataset. However, performance
degradations were observed with crowd images taken
from higher altitudes. This indicates the inefficiency of
the studied methods in detecting people on very small
scales. Moreover, false positive crowd detections are
numerous, especially on the tree tops and roof tops.
This could to some extend be solved by providing more
distractors of the type in the training samples. Consid-
ering the increasing demand for crowd surveillance and
rapidly developing drone technology, such limitations
are an important challenge that calls for and offers many
research and development commitments.
ACKNOWLEDGEMENTS
The research was partly supported by the Slovenian
Research Agency (ARRS) through the ARRS Research
Programme P2-0250 (B), “Metrology and biometric
systems”, and made possible by the funding for the
research programme in the scope of the COVID-19 call.
The work was partially supported also by the Croat-
ian Science Foundation under project the IP-2018-01-
7619, “Knowledge-based Approach to Crowd Analysis
in Video Surveillance”(KACA VIS).
REFERENCES
[1] C. Arteta, V . Lempitsky, J. A. Noble, and A. Zisserman. Inter-
active Object Counting. In European conference on computer
vision, pages 504–518. Springer, 2014.
[2] L. Boominathan, S. S. Kruthiventi, and R. V . Babu. CrowdNet: A
Deep Convolutional Network for Dense Crowd Counting. pages
640–644, 2016.
[3] G. J. Brostow and R. Cipolla. Unsupervised Bayesian Detection
of Independent Motion in Crowds. In 2006 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition
(CVPR’06), volume 1, pages 594–601, 2006.
[4] X. Cao, Z. Wang, Y . Zhao, and F. Su. Scale Aggregation Network
for Accurate and Efficient Crowd Counting. In 2018 European
Conference on Computer Vision (ECCV), pages 734–750, 2018.
[5] A. B. Chan, Zhang-Sheng John Liang, and N. Vasconcelos.
Privacy Preserving Crowd Monitoring: Counting People Without
People Models or Tracking. In 2008 IEEE Conference on
Computer Vision and Pattern Recognition, pages 1–7, 2008.
[6] K. Chen, C. C. Loy, S. Gong, and T. Xiang. Feature Mining
for Localised Crowd Counting. In British Machine Vision
Conference (BMVC 2012), 01 2012.
[7] N. Dalal and B. Triggs. Histograms of Oriented Gradients for
Human Detection. In 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’05), vol-
ume 1, pages 886–893 vol. 1, 2005.
[8] L. Fiaschi, U. Koethe, R. Nair, and F. A. Hamprecht. Learning
to Count with Regression Forest and Structured Labels. In 2012
International Conference on Pattern Recognition (ICPR 2012),
pages 2685–2688, 2012.
[9] W. Ge and R. T. Collins. Marked Point Processes for Crowd
Counting. In 2009 IEEE Conference on Computer Vision and
Pattern Recognition, pages 2913–2920, 2009.
[10] H. Idrees, I. Saleemi, C. Seibert, and M. Shah. Multi-source
Multi-scale Counting in Extremely Dense Crowd Images. In
2013 IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 2547–2554, 2013.
[11] X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann,
and L. Shao. Crowd Counting and Density Estimation by Trellis
Encoder-Decoder Networks. In 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), June 2019.
[12] D. Kong, D. Gray, and Hai Tao. A Viewpoint Invariant Approach
for Crowd Counting. In 18th International Conference on Pattern
Recognition (ICPR’06), volume 3, pages 1187–1190, 2006.
[13] M. K¨ uchhold, M. Simon, V . Eiselein, and T. Sikora. Scale-
Adaptive Real-Time Crowd Detection and Counting for Drone
Images. In 2018 25th IEEE International Conference on Image
Processing (ICIP), pages 943–947, 2018.
[14] B. Leibe, E. Seemann, and B. Schiele. Pedestrian Detection in
Crowded Scenes. In 2005 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’05), volume 1,
pages 878–885 vol. 1, 2005.
[15] W. Liu, M. Salzmann, and P. Fua. Context-Aware Crowd
Counting. In 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), June 2019.
[16] X. Liu, J. van de Weijer, and A. D. Bagdanov. Leveraging
Unlabeled Data for Crowd Counting by Learning to Rank.
In 2018 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[17] D. R. Logan. Wikimedia Commons: DJI inspire 1 pro, 2007.
License: Creative Commons Attribution 4.0.
[18] Z. Ma, X. Wei, X. Hong, and Y . Gong. Bayesian Loss for
Crowd Count Estimation with Point Supervision. In 2019 IEEE
International Conference on Computer Vision, pages 6142–6151,
2019.
[19] T. Peng, Q. Li, and P. Zhu. RGB-T Crowd Counting from Drone:
A Benchmark and MMCCN network. In 2020 Asian Conference
on Computer Vision (ACCV), November 2020.
[20] V . Pham, T. Kozakaya, O. Yamaguchi, and R. Okada. COUNT
Forest: CO-Voting Uncertain Number of Targets Using Random
Forest for Crowd Density Estimation. In 2015 IEEE Interna-
tional Conference on Computer Vision (ICCV), pages 3253–
3261, 2015.
[21] V . Rabaud and S. Belongie. ounting Crowded Moving Objects.
In 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’06), volume 1, pages 705–711,
2006.
[22] V . Ranjan, H. Le, and M. Hoai. Iterative Crowd Counting.
In 2018 European Conference on Computer Vision (ECCV),
September 2018.
[23] M. Rodriguez, J. Sivic, I. Laptev, and J. Audibert. Data-Driven
Crowd Analysis in Videos. In 2011 International Conference on
Computer Vision, pages 1235–1242, 2011.
[24] D. Ryan, S. Denman, C. Fookes, and S. Sridharan. Crowd
Counting Using Multiple Local Features. In 2009 Digital Image
Computing: Techniques and Applications, pages 81–88, 2009.
[25] Sheng-Fuu Lin, Jaw-Yeh Chen, and Hung-Xin Chao. Estimation
of Number of People in Crowded Scenes using Perspective
Transformation. IEEE Transactions on Systems, Man, and
Cybernetics - Part A: Systems and Humans, 31(6):645–654,
2001.
[26] V . A. Sindagi and V . M. Patel. Cnn-based cascaded multi-task
learning of high-level prior and density estimation for crowd
counting. In 2017 IEEE International Conference on Advanced
Video and Signal Based Surveillance (AVSS), pages 1–6, 2017.
[27] V . A. Sindagi and V . M. Patel. A Survey of Recent Advances
in CNN-based Single image Crowd Counting and Density Esti-
mation. Pattern Recognition Letters, 107:3–16, 2018.
[28] V . A. Sindagi, R. Yasarla, and V . M. Patel. Pushing the Frontiers
of Unconstrained Crowd Counting: New Dataset and Benchmark
Method. In 2019 IEEE International Conference on Computer
Vision, pages 1221–1231, 2019.
[29] V . A. Sindagi, R. Yasarla, and V . M. Patel. JHU-CROWD++:
Large-Scale Crowd Counting Dataset and Benchmark Method.
Technical Report, 2020.
[30] Tao Zhao and R. Nevatia. Bayesian Human Segmentation in
Crowded Situations. In 2003 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, 2003. Proceed-
ings., volume 2, pages II–459, 2003.
[31] L. Wen, D. Du, P. Zhu, Q. Hu, Q. Wang, L. Bo, and S. Lyu.
BENCHARKING CROWD COUNTING TECHNIQUES 235
Drone-Based Joint Density Map Estimation, Localization and
Tracking with Space-Time Multi-Scale Attention Network, 2019.
[32] F. Xiong, X. Shi, and D. Yeung. Spatiotemporal Modeling
for Crowd Counting in Videos. In 2017 IEEE International
Conference on Computer Vision (ICCV), pages 5161–5169, 2017.
[33] L. Zeng, X. Xu, B. Cai, S. Qiu, and T. Zhang. Multi-Scale
Convolutional Neural Networks for Crowd Counting. In 2017
IEEE International Conference on Image Processing (ICIP),
pages 465–469, 2017.
[34] A. Zhang, L. Yue, J. Shen, F. Zhu, X. Zhen, X. Cao, and
L. Shao. Attentional Neural Fields for Crowd Counting. In
2019 IEEE/CVF International Conference on Computer Vision
(ICCV), pages 5713–5722, 2019.
[35] C. Zhang, H. Li, X. Wang, and X. Yang. Cross-scene Crowd
Counting via Deep Convolutional Neural Networks. In 2015
IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 833–841. IEEE Computer Society, 2015.
[36] L. Zhang, M. Shi, and Q. Chen. Crowd Counting via Scale-
Adaptive Convolutional Neural Network. In 2018 IEEE Winter
Conference on Applications of Computer Vision (WACV), pages
1113–1121, 2018.
[37] Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma. Single-
Image Crowd Counting via Multi-Column Convolutional Neural
Network. In 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 589–597, 2016.
[38] P. Zhu, L. Wen, D. Du, X. Bian, Q. Hu, and H. Ling. Vision
Meets Drones: Past, Present and Future, 2020.
Klemen Pevec is an MSc student at the Faculty of Electrical Engineer-
ing, University of Ljubljana, Slovenia. His research work focuses on
unmanned aerial vehicles, computer vision and embedded systems. In
the past, he worked as an engineer in an autonomous guided vehicles-
oriented robotics company. Right now he is working for a Swedish
watercraft company as an embedded system engineer.
Vitomir
ˇ
Struc is an Associate Professor at the Faculty of Electrical
Engineering, University of Ljubljana, Slovenia. His research interests
are in biometrics, computer vision, image processing, pattern recog-
nition and machine learning. He is a Senior Area Editor for the IEEE
Transactions on Information Forensics and Security, a Subject Editor
for Elsevier’s Signal Processing and an Associate Editor for Pattern
Recognition and IET Biometrics.
Klemen Grm received his PhD degree in 2020 from the Faculty
for Electrical Engineering, University of Ljubljana, Slovenia. He is
currently working as a research assistant at the Laboratory for Machine
Intelligence at the same faculty. His research interests are in image
processing, biometrics and machine learning.