| 559 |
| 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES
V
G 2021
GEODETSKI VESTNIK | letn. / Vol. 65 | št. / No. 4 |
ABSTRACT IZVLEČEK SI  |  EN
KLJUČNE BESEDE KEY WORDS
deep learning, convolutional neural networks, classification 
of buildings, Mask R-CNN, object detection, object 
segmentation, automatic classification
globoko učenje, konvolucijske nevronske mreže, klasifikacija 
stavb, Mask R-CNN, detekcija objektov, segmentacija objektov, 
samodejna klasifikacija
Building footprint detection based on orthophotos can be 
used to update the building cadastre. In recent years deep 
learning methods using convolutional neural networks 
have been increasingly used around the world. We present 
an example of automatic building classification using our 
datasets made of colour near-infrared orthophotos (NIR-
R-G) and colour orthophotos (R-G-B). Building detection 
using pretrained weights from two large scale datasets 
Microsoft Common Objects in Context (MS COCO) and 
ImageNet was performed and tested. We applied the Mask 
Region Convolutional Neural Network (Mask R-CNN) to 
detect the building footprints. The purpose of our research is 
to identify the applicability of pre-trained neural networks 
on the data of another colour space to build a classification 
model without re-learning.
Rezultati klasifikacije stavb na ortofotu se uporabljajo 
kot vir za vzdrževanje katastra stavb. V zadnjih letih se 
za klasifikacijo stavb v svetu vse bolj uveljavljajo metode 
globokega učenja z uporabo konvolucijskih nevronskih mrež. 
V raziskavi predstavimo primer samodejne klasifikacije 
stavb z uporabo lastnih podatkovnih zbirk, izdelanih iz 
barvnih bližnje infrardečih ortofotov (BIR-R-G) in barvnih 
ortofotov (R-G-B). Preizkusili smo detekcijo stavb z uporabo 
predučenih uteži podatkovnih zbirk Microsoft Common 
Objects in Context (MS COCO) in ImageNet. Za detekcijo 
stavb smo uporabili Mask Region Convolutional Neural 
Network (Mask R-CNN). Namen raziskave je preizkusiti 
uporabniško vrednost globokega učenja za detekcijo stavb z 
uporabo predučenih uteži na podatkih drugega barvnega 
prostora s ciljem izgradnje klasifikacijskega modela brez 
ponovnega učenja.
DOI: 10.15292/geodetski-vestnik.2021.04.559-593
SCIENTIFIC ARTICLE
Received: 3. 8. 2021
Accepted: 3. 11. 2021
UDK: 681.511.4:004.8 
Klasifikacija prispevka po COBISS.SI: 1.01
Prispelo: 3. 8. 2021
Sprejeto: 3. 11. 2021
Simon Šanca, Krištof Oštir, Alen Mangafić
BUILDING DETECTION 
WITH CONVOLUTIONAL 
NETWORKS TRAINED WITH 
TRANSFER LEARNING 
ZAZNAVANJE STAVB Z 
UPORABO NEVRONSKIH 
MREŽ, UČENIH S PRENOSOM 
ZNANJA 
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
| 560 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES
SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
1  INTRODUCTION
Image classification aims to recognise and name geographical objects and phenomena on the Earth’s 
surface (Veljanovski et al., 2011). Classification extracts the essential classes (objects) of interest from 
an image, e.g., roads, forests, crops, water areas, ships, aircraft or buildings. Automatic recognition and 
classification of buildings from aerial or high-resolution satellite imagery is an important research topic 
in photogrammetry and remote sensing (Zhu et al., 2017). Rapid advances in computer vision and deep 
learning using deep convolutional neural networks, and free access to remote sensing data are enabling 
the development of new methods for automatic building classification. There are many online tasks 
and competitions that use deep learning methods to classify buildings using satellite or aerial imagery 
automatically. Examples of such competitions are the DeepGlobe Buildings Extraction Challenge
1
, the 
SpaceNet Building Extraction Challenge
2
, the crowdAI Mapping Challenge
3
, etc. 
Deep learning relies on feedforward, multi-layer neural networks. A specific example of artificial neural 
networks for image classification and segmentation are convolutional neural networks. A simplified 
example of a convolutional neural network contains three basic layers that can be repeated. These are 
(1) convolutional layers, (2) pooling layers and (3) fully connected layers (Goodfellow et al., 2016). A 
deep neural network consists of many such layers, which make it capable of recognising detailed patterns 
and shapes in an image. Each layer of a neural network has a distinctive task. The convolutional layer is 
a combination of multiple filters moving across the image, combining higher-level information into a 
two-dimensional activation matrix. The convolutional layers progressively reduce the spatial resolution 
of the activation matrices to reduce the computational complexity of the neural network while also 
increasing the field of view of each filter. The fully connected layer acts as a classifier that returns a clas -
sification vector based on the extracted feature set values, which is used to determine the membership 
of a particular class (Šanca, 2020).
One of the most successful convolutional neural networks for object detection is Mask R-CNN (He 
et al., 2017) (Mask Regional Convolutional Neural Network), which can achieve high accuracy in 
building classification (Šanca, 2020). Mask R-CNN was developed by Facebook AI Research (FAIR) 
in 2017. It is a deep convolutional neural network used for object detection, semantic segmentation 
and instance segmentation. Object detection with Mask R-CNN consists of two parts. The first part 
generates a suggestion of the area where the object should be located in the input image. The second 
part detects the object and assigns it a probability of belonging to a class, displays its position, and 
assigns an object mask. More details on Mask R-CNN can be found in the papers by He et al., (2017) 
and Abdulla (2017).
A detailed overview of the theory and application of deep learning in remote sensing is presented by Zhu 
et al. (2017). Most of the previous research based on the use of deep learning for automatic building 
detection uses high spatial resolution satellite imagery as the data source. In a study of automatic build-
ing detection, Shetty and Mohan (2018) used WorldView-2 satellite imagery. They used panchromatic 
imagery with a spatial resolution of 0.46 m to learn and test the Faster R-CNN neural network model. 
They found that using convolutional neural networks allows the detection of buildings of different shapes 
1
 http://deepglobe.org     
2
 https://spacenetchallenge.github.io/ 
3
 https://www.crowdai.org/challenges/mapping-challenge 
| 561 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
with an accuracy of 88%, compared to the support vector machine (SVM) method. Using a large dataset 
of Google Earth imagery, Wen et al. (2019) researched the building detection capabilities of Faster R-
CNNs. They compared the backbone architectures of VGG and ResNet101 when using Mask R-CNN 
and Faster R-CNN. In the proposed solution, Mask R-CNN was enhanced with a rotation matrix for 
building detection. The worst accuracy was achieved with Faster R-CNN-VGG (70%), the best with the 
proposed solution Mask R-CNN-VGG (91%). The other two solutions Faster R-CNN-ResNet101 and 
Mask R-CNN-ResNet101 achieve identical accuracies of 91%. Further, they also compared the results 
of semantic building segmentation with Mask R-CNN. The proposed method achieved an accuracy of 
91%, and the conventional method Mask R-CNN-Resnet101 achieved an accuracy of 91%. Based on 
the study, they concluded that promising results can be achieved by using a large and complex build-
ing dataset and applying Mask R-CNN. Using the DeepGlobe dataset, Zhao et al. (2018) proposed an 
improved Mask R-CNN solution enhanced with a boundary recognition algorithm. Their proposed 
solution achieves more accurate results in building recognition and in detecting the footprints of the 
identified buildings than the baseline Mask R-CNN solution. 
Compared to satellite imagery, aerial imagery has higher spatial resolution but fewer spectral bands. 
Using aerial imagery more complex objects can be identified with higher accuracy.  Examples of build-
ing recognition datasets built from aerial imagery are the Massachusetts Buildings Dataset
4
 (Mnih, 
2013), Inria Aerial Image Labeling Dataset
5
 (Maggiori et al., 2017), and AIRS Automatic Mapping 
of Buildings Dataset
6
 (Chen et al., 2019). Research to date using aerial imagery and Mask R-CNN 
achieves high building detection accuracy. Ji et al. (2019) researched building changes using the Wu-
han Building Change Detection Dataset. The proposed building detection solution consists of two 
convolutional neural networks; (1) a building recognition network based on Mask R-CNN and a 
Multi-Scale Fully Convolutional Network (MS-FCN), and (2) a building change detection network. 
The solution was tested in the Christchurch study area in New Zealand, comparing the 2011 and 
2016 study areas. Both proposed solutions achieved high classification accuracies of over 89% for 
object classification and over 93% for pixel-based building classification. MS-FCN performed slightly 
better in building edge detection compared to Mask R-CNN. Building detection from aerial images 
using Mask R-CNN has been addressed in (Zhou et al., 2019), investigating the ability to identify 
buildings by varying the value of the anchor box parameter and the problem of segmenting the exact 
footprint of building edges. They compared two proposed solutions: (1) Mask R-CNN-S with a smaller 
anchor box and (2) Mask R-CNN-L with a larger anchor box. They found that the building detection 
results are better using the solution with a smaller anchor box, as it identifies smaller buildings and 
buildings with more detailed contents. They conclude with an important observation: Mask R-CNN 
is suitable for building detection, as classical convolutional neural networks do not preserve detailed 
spectral information when detecting objects, leading to poorer results. Another important finding is 
the importance of the anchor box parameter, which significantly impacts the quality of recognition 
of both small and larger, spectrally diverse objects. We present the results of the studies mentioned 
above in Table 1.
4
 https://www.cs.toronto.edu/~vmnih/data/ 
5
 https://project.inria.fr/aerialimagelabeling/
6
 https://www.airs-dataset.com/
| 562 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES
SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Table 1:  Comparison of Faster R-CNN, Mask R-CNN and MS-FCN
Neural network Method used Data
Spatial 
resolution [m]
Accuracy 
[%]
Study
Faster R-CNN SVM
satellite, 
panchromatic
0.46 88
Shetty and 
Mohan, (2018)
Faster R-CNN-VGG
Rotation matrix for building 
recognition
aerial R-G-B 0.26 70 Wen et al. (2019)
Mask R-CNN-VGG
Rotation matrix for building 
recognition
aerial R-G-B 0.26 91 Wen et al. (2019)
Faster R-CNN-
Resnet101
Rotation matrix for building 
recognition
aerial R-G-B 0.26 91 Wen et al. (2019)
Mask R-CNN-
Resnet101
Rotation matrix for building 
recognition
aerial R-G-B 0.26 91 Wen et al. (2019)
Mask R-CNN boundary regularisation algorithm satellite R-G-B 0.5 88 Zhao et al. (2018)
Mask R-CNN SVM aerial R-G-B 0.3 90 Ji et al. (2019)
MS-FCN SVM aerial R-G-B 0.3 84 Ji et al. (2019)
Mask R-CNN-S small anchor box aerial R-G-B 0.5 85 Zhou et al. (2019)
Mask R-CNN-L large anchor box aerial R-G-B 0.5 81 Zhou et al. (2019)
The Geodetic Institute of Slovenia carries out the automatic classification of buildings annually to update 
the spatial databases of the Surveying and Mapping Authority of the Republic of Slovenia and perform 
spatial monitoring for the Ministry of Environment and Spatial Planning. Currently, the classification 
of buildings is carried out using machine learning methods, object-based classification using support 
vector machines and random forest, where the digital surface model is also a key piece of information. 
This paper aims to test a new method for building detection using deep learning, entirely independent 
of the use of a digital surface model, on two new building datasets.
T wo building datasets were produced as part of the research, using DOF050 colour orthophotos (R-G-B) 
and DOF050IR colour infrared orthophotos (NIR-R-G) from 2019 with a spatial resolution of 0.5 m. 
Colour infrared orthophotos reveal a different perspective of the terrain, as objects with high reflectance 
in the infrared spectrum (e.g. healthy vegetation) are shown in red, while objects with high reflectance 
in the red spectrum are shown in green and objects with high reflectance in the blue-green spectrum are 
shown in blue (Oštir, 2006). The main advantage of using colour infrared orthophotos is that it is easier 
to distinguish buildings from vegetation based on the spectral signature alone. 
We prepared the building datasets in the MS COCO format (Lin et al., 2014), which represents the 
training instances of buildings in JavaScript Object Notation (JSON). This way of annotation is fast and 
transparent, so we used it to create the building dataset. We trained eight different building classification 
models using the pre-trained weights of the MS COCO and ImageNet datasets (Deng et al., 2009). We 
performed transfer learning because Mask R-CNN relies on it to train and generalize models on new 
custom datasets faster. We validated the performance of the trained models on a selected test sample of 
buildings in Slovenia and evaluated each model by computing evaluation metrics. We were particularly 
interested in how well Mask R-CNN identifies buildings, how good the identified building masks are, 
and whether the proposed method presents a potential for further application.
| 563 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
2 STUDY AREA AND DATA DESCRIPTION
A key element of a successful deep learning building classification model is a well-designed dataset 
with many labelled training features. The study area for the construction of the two building datasets is 
shown in Figure 1 — the area of 1,387 km
2 
contains 98,425 registered buildings as of 28. 3. 2020. The 
building typology is primarily rural except for Murska Sobota, which has an urban building typology. 
Roofs are of different shapes and colours, with red, brown and dark grey or black roofing predominating. 
Many of the roofs used in the training examples also contain solar panels. The area was chosen because 
of personal knowledge and because it includes a wide variety of roofs that contribute to the detail of 
the building dataset.
Figure 1: Map of the study area for the creation of the building dataset.
3 METHODS
In this chapter, we present the methodology for the creation of the building dataset. First, we define the 
building class, then create the training samples and their overview for the study area shown in Figure 1.
3.1 Definition of building class
The aim of the building cadastre is to register all buildings in the Republic of Slovenia. The concept of a 
building and part of a building is defined in ZEN (ZEN - Official Gazette of the Republic of Slovenia 
No. 47/06). The data on buildings and parts of buildings is described in the National Topographic 
Model (DTM) (GURS, 2020), which is defined by the law (Article 11 ZDGRS - Official Gazette of the 
Republic of Slovenia No. 25/14 and No. 61/17). A building is defined as an object permanently located 
in one place (Boguszewski et al., 2020). An example of a correct and considered building footprint is 
presented in Figure 2 on the left. Tall buildings pose a problem as the building footprint obtained from 
the building cadastre do not spatially coincide with the buildings. Such examples have been excluded 
from the dataset.
| 564 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES
SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Figure 2: Examples of correct training samples (left) and incorrect training samples (right).
3.2 Methodology for the creation of the building dataset
The building dataset creation flowchart is presented on Figure 3. Each step is further explained below.
Figure 3: Deep learning dataset creation flowchart.
The building cadastre used to create the training samples was cleaned before the actual production of 
the dataset. We removed the building footprints which are, according to the generalised land use (MOP , 
2020)  located in: (1) manufacturing areas, (2) transport infrastructure areas and networks, (3) commu-
nication infrastructure areas and (4) energy infrastructure areas. Before the filtering, there were 98,425 
footprints in the study area. First, 1,271 footprints were eliminated, which mainly included large industrial 
buildings. These were excluded because they are too large and cannot be adequately represented with 
128 x 128 px patches. We also excluded buildings whose spectral properties are equivalent to transport 
infrastructure, mainly roads and transformers, as the outlines of these are too small for the algorithm to 
detect the transformer mask and compute the image coordinates of the created masks. Buildings located 
in areas of energy, communication and transport infrastructure are atypical. We wanted to create a data-
set that is as balanced as possible and has as few outliers as possible. After the data cleaning process, we 
calculated the new centroids from the inspected and cleaned building footprints, which were then used 
to prepare the training patches.
| 565 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
3.3 Overview of sample images and building masks
By manually inspecting the training features, we improve the quality of the dataset, which leads to better 
building detection results. The most typical errors are building footprints that cover meadows, fields and 
shadows. This is due to the non-updated nature of the building cadastre. For the study area with 98,425 
building footprints, we detected 8,226 cases of such errors, representing 8% of all footprints. We did not 
consider the erroneous building footprints in the production of the sample images and binary masks. 
After reviewing the produced sample images and binary masks, we obtained 58,000 training examples 
to create the building dataset. Some examples of the most typical errors are shown in Figure 4.
Figure 4: Examples of incorrect building footprints. The most typical errors are footprints that cover meadows, fields and 
shadows. Examples of partially overlapping footprints and building are un useful.
The dataset was split 80/20, with 80% of the training samples used for learning and 20% for validation. 
We created building masks and sample building patches of 128 x 128 px from the newly computed 
building centroids. Each building is represented in the dataset with a mask and a corresponding sample 
image. After that, we converted the dataset to MS COCO format, which, as mentioned, uses JSON 
annotation. In the first step, we labelled the locations of all the sample patches and the corresponding 
binary masks according to the building class. In the second step, we converted the sample and the build-
ing masks patches into the JSON annotation used by the MS COCO dataset. The detailed procedure 
is further described in Šanca (2020).
3.4 Transfer learning from pre-trained neural networks
Neural networks for specific computer vision tasks have already been built and trained on large datasets 
such as ImageNet (Deng et al., 2009) and MS COCO. The pre-trained weights of these datasets are 
accessible to the users, who can reuse these weights to better adapt the weights when training a neural 
network on their own dataset. An example of such learning is called transfer learning. Transfer learn-
ing works by initialising already trained model weights to adapt the weights better when training the 
network on another dataset. When extracting features from images, we use the neural network head 
and train only the new classifier to optimise the previously trained filter weights for new tasks on the 
second dataset; thus, the model training process requires much less time (Ramon et al., 2019). T o train 
a Mask R-CNN neural network, we can use the pre-trained weights of two large-scale datasets, i.e., MS 
COCO and ImageNet.
| 566 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES
SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
The MS COCO dataset is one of the leading datasets for object detection and semantic segmentation. 
It contains annotations for various problems: (1) object detection, (2) keypoint detection, (3) stuff 
segmentation, (4) pan-optic segmentation, and (5) image captioning. It contains 91 image categories, 
of which 82 categories have more than 5,000 labelled image examples. The total number of labelled 
training examples of the MS COCO collection is 2.5 million on 328,000 images (Lin et al., 2014).
ImageNet is considered the largest dataset for state-of-the-art object recognition, containing more than 
15 million manually annotated high-resolution images, organised according to the WordNet hierarchy 
into 22,000 classes. ImageNet supports: (1) object classification, (2) object detection and (3) single 
object localisation (Deng et al., 2009).
Mask R-CNN consists of two parts: (1) a convolutional backbone architecture used for feature extraction 
over an entire image, and (2) the network head for classification, bounding box recognition and mask 
prediction, that is applied separately for each Region of Interest (RoI) (Zhao, et al., 2018). The two 
backbone architectures of Mask R-CNN are ResNet101, which contains 101 convolutional layers for 
object detection, and a Feature Pyramid Network (FPN), which performs multiscale feature extraction 
from the input image (Šanca, 2020).
We used the same hyperparameters to train all the models, and we reduced the learning rate by a factor 
of 10 for the models whose weights were fine-tuned. The learning hyperparameters are given in Table 2, 
and we set them based on previous research studies. All hyperparameters listed in Table 2 are described 
in more detail in (Šanca, 2020).
Table 2: Hyperparameter values used in model learning
Name of the hyperparameter Value
image shape 128
learning rate 0.001 and 0.0001
batch size 1,000
number of repetitions per epoch (steps per epoch) 1,000
number of validation steps 50
backbone architecture ResNet-101
anchor box size (8, 16, 32, 64, 128)
number of proposed regions per image 32
total train time 328 hours
The models were trained on a computer with the following specifications:
 – CPU: Intel(R) Core (TM) i9-9900X CPU @ 3.50GHz,
 – Memory: 64 GB RAM, DDR4 (4 x 16 GB DIMM DDR4 Synchronous 2400 MHz),
 – Graphics card: NVidia GeForce RTX 2080 SUPER.
| 567 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Figure 5:  Mask R-CNN architecture for the building detection case. The input image passes through a convolutional neural 
network (CNN) that extracts features in the first step. In the second step, the region proposal network randomly 
suggests regions in the input image based on predictions generated according to a defined class in the dataset. The 
higher-level information is merged into lower-level information using the RoiAlign layer, which acts as a merging layer 
and is split into two parts. In the first part, the RoiAlign results are moved over the fully connected layers to predict 
the object class and compute the image field regression. In the second part, the RoiAlign results are moved across 
the convolutional layers of the neural network head and then across all the layers of the neural network, creating a 
pixel mask for each region of interest (ROI) and segmenting the image pixel by pixel according to the identified class.
In this study, we compare the performance of eight classification models; the results are presented in T able 3.
Table 3:  Trained models on the NIR-R-G and R-G-B datasets. 
Model name Dataset
Layers used 
for learning
Number of 
epochs
Used weights Learning rate
Learning time 
[h]
M1 NIR-R-G heads 100 MS COCO 0.001 39
M2 NIR-R-G heads 100 ImageNet 0.001 40
M3 NIR-R-G all layers 200 MS COCO 0.0001 42
M4 NIR-R-G all layers 200 ImageNet 0.0001 44
M5 R-G-B heads 100 MS COCO 0.001 37
M6 R-G-B heads 100 ImageNet 0.001 39
M7 R-G-B all layers 200 MS COCO 0.0001 43
M8 R-G-B all layers 200 ImageNet 0.0001 44
The number of epochs indicates the number of passes of the entire training dataset through Mask R-
CNN. For models M1, M2, M5, M6 we trained only the head layer of the neural network, for models: 
M3, M4, M7, M8 we trained all the layers of the neural network. The learning rate represents the step 
size in the computation of the gradient of the loss function, which is iteratively minimised during neural 
network training. We choose a suitable learning rate to avoid underfitting/overfitting (Šanca, 2020). The 
model training process is illustrated in Figure 6.
| 568 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES
SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Figure 6: Flowchart of models trained for building detection. Models: M3, M4 and M7, M8 are fine-tuned models up to 100 
epochs after training the neural network head. Building detection was performed separately with each model.
4 RESULTS AND EVALUATION
We evaluated the performance of transfer learning based on the loss function of the training process. We 
did not consider the loss function of the validation process to assess the stability of the model because on 
a large proportion of the training pairs, the building footprints from the building cadastre are inconsist-
ent. The trained model successfully predicts a building where there is no footprint, which is detected by 
the validation process as a misclassified object. We decided to validate the performance of the models 
by calculating evaluation metrics (section 4.1). We validated the performance of the building classifica-
tion models on a selected test area outside the training dataset area. We chose 300 examples to test the 
performance of the trained models, considering the criterion of roof diversity, in particular roof colour 
and shape, presence of solar panels on the roof, shadows, etc. We first compare the trained models with 
each other and then compare the obtained M3 and M4 results with the building cadastre. We present 
the building detection results in figures, where we compare the predictions of all the trained models on 
the NIR-R-G. We compare the resulting building footprints with the building cadastre.
4.1 Loss functions after transfer learning and comparison of models
We present an example of building detection with the MS COCO (R-G-B) pre-trained model, which 
we used to initialise our weights when training on our own dataset, in Figure 7. As expected, building 
detection without transfer learning on the constructed building dataset is incorrect. The MS COCO or 
ImageNet base model only randomly suggests spatial fields.
Mask R-CNNs multi-task loss function is the sum of the classification loss, bounding box loss and the mask 
loss. We are most interested in the mask loss for the building detection case, as it represents a measure of the 
accuracy of the building mask classification. In Figure 8, we see that the overall loss function after transfer 
learning of the MS COCO model is minimally better than the loss function of the ImageNet model. This 
is also true across the individual loss functions for classification, regression and masking. The loss func-
tion minimises during training and stabilises towards the end of it (after about 65 training epochs), which 
means there is no need for further training. The value of the loss function for the Mask R-CNN (Mask loss) 
class does not change much during the training process. This is because we only have a single class in the 
building dataset that is predicted at detection. Only the loss functions during the training phase are shown.
| 569 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Figure 7:  Example of building detection with a basic, not yet re-trained MS COCO model. 
Figure 8:  Training loss functions for M1 and M2 on the NIR-R-G dataset. For both models, we trained the heads of the neural 
network for 100 epochs.
| 570 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES
SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
After the fine-tuning, the values of the loss functions are further reduced, except for the loss function 
for the mask, which always stabilises at the start and does not improve, which means that the footprints 
of the identified buildings do not change noticeably in detection. 
Figure 9:  Training loss functions for M3 and M4 on the NIR-R-G dataset. For both models, we trained all the layers of the neural 
network for 100 epochs.
The results of the building detection with the trained models on BIR-G-R are shown in Figures 10, 
11 and 12. The M3 and M4 models recognise smaller buildings, but the footprints of the recognised 
buildings remain very similar compared to M1 and M2. Fine-tuning with training all the layers of the 
Mask R-CNN is important in improving the prediction accuracy. Still, it is not crucial for improving 
the footprints of the recognised buildings compared to the footprints obtained by training the neural 
network head alone.
| 571 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
 
Figure 10:  Comparison of the performance of the models Figure 11: Detection results of buildings with M3 and M4
 M1, M3, M2, M4.  and comparison with the building cadastre.
Figure 12: Detection results of buildings with M3 and M4 and comparison with the building cadastre.
| 572 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES
SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Figure 13 shows the training loss functions of M5 and M6 trained on R-G-B, and Figure 14 shows the 
loss functions of M7, M8 trained on R-G-B. It can be seen that the loss functions of the R-G-B models 
are approximately the same as the loss functions of the NIR-R-G models.
Figure 13: Training loss functions M1 (green) and M2 (blue) on the R-G-B dataset. For both models, we trained the heads of 
the neural network for 100 epochs.
We were also interested in the difference between the performance of the models trained on the 
NIR-R-G and R-G-B datasets. We show the results of building detection with the R-G-B models 
in Figure 15.
The differences between M1, M2 NIR-R-G and M5, M6 R-G-B are also minimal when recognising 
building footprints. This is also true for the fine-tuned models M3, M4 and M7, M8. Compared to the 
NIR-R-G orthophotos, the building detection results on R-G-B are slightly better because the base model 
is trained on MS COCO dataset R-G-B images of everyday life. In both cases, transfer learning from 
weights pre-trained on MS COCO or ImageNet datasets turns out to be an efficient solution compared 
to training models from scratch.
| 573 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Figure 14: Training loss functions M3 (green) and M4 (blue) on the R-G-B dataset. For both models, we trained all the layers of 
the neural network for 100 epochs.
Figure 15: Example of M7 building detection on an R-G-B orthophoto. 
| 574 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES
SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
4.2 Performance of learned models
We evaluate the performance of the trained classification models in machine learning based on testing 
patches. Building classification is an example of binary classification, where a confusion matrix represents 
the prediction performance.
Predicted
Actual
TP – T rue Positive FN – False Negative
FP – False Positive TN – T rue Negative
Correctly classified cases fall into True Positive (TP) and True Negative (TN). Misclassified cases 
belong to False Negative (FN) and False Positive (FP). The prediction results are used to calculate 
evaluation metrics to assess the performance of the building detection models. The equations used 
to calculate the evaluation metrics are taken from (Fetai, et al., 2021). Accuracy represents the 
proportion of correct predictions over all model predictions for binary classification. The equation 
is simplified as:
 
++
= =
++ + 
:
TP TN TP TN
accuracy
TP FP TN FN All cases
 (1)
Recall measures the proportion of correctly predicted cases out of all positive cases:
 
=
+
TP
recall
TP FN
 (2)
The combined precision and recall metric is defined by the F1 score and is used when you only want to 
show one metric for the performance of a model:
 
⋅⋅
=
+
2
1
accuracy recall
F
accuracy recall
 (3)
We counted the proportion of correctly and incorrectly identified buildings for the trained models and 
calculated evaluation metrics. The results are shown in Table 2.
Table 4: Evaluation of the performance of the building classification models.
Dataset Weights TP TN FP FN Accuracy Recall F1 score
M1 NIR-R-G MS COCO 246 0 38 16 0.8200 0.9389 0.8754
M2 NIR-R-G ImageNet 243 0 43 14 0.8100 0.9455 0.8725
M3 NIR-R-G MS COCO 296 0 3 1 0.9867 0.9966 0.9916
M4 NIR-R-G ImageNet 294 0 4 2 0.9800 09932 0.9866
M5 R-G-B MS COCO 249 0 36 15 0.8300 0.9432 0.8830
M6 R-G-B ImageNet 244 0 41 15 0.8133 0.9421 0.8730
M7 R-G-B MS COCO 298 0 1 1 0.9933 0.9967 0.9950
M8 R-G-B ImageNet 296 0 1 2 0.9867 0.9933 0.9900
The trained models are very similar in terms of building detection performance. The 200 epoch models 
(M3, M4 and M7, M8) are more successful in identifying the footprints of buildings and identifying 
buildings even at the edge of the sample patches. M1, M2 and M5, M6 fail to recognise mainly small 
buildings or buildings partially covered by vegetation. 
| 575 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
5 CONCLUSION AND DISCUSSION
This paper presents a building detection process with Mask R-CNN, from dataset creation to building 
detection for eight models. We trained the models by transferring knowledge from the weights pre-trained 
on MS COCO and ImageNet datasets. We compared the performance of building classification using 
R-G-B orthophotos and NIR-R-G colour infrared orthophotos. The dataset produced for deep learning 
building detection has a potential for further use. It can be further extended to the whole country and 
possibly used for future building data retrieval and building database maintenance. The obtained results 
confirm that Mask R-CNN is useful and suitable for building classification as also claimed by related 
research (Ji et al., 2019) and (Zhou et al., 2019).
In Slovenia, this is, to our knowledge, the first example of deep learning-based building detection. It is 
also the first case of applying transfer learning from models trained on MS COCO and ImageNet datasets, 
containing only R-G-B images of everyday objects, to a building dataset made from R-G-B and NIR-R-
G orthophotos. The success of transfer learning from one domain to another domain demonstrates the 
flexibility of this type of networks. We claim that transfer learning is an effective method for updating 
models that have been trained on different data. T raining models from scratch is a time-consuming pro-
cess that allows building better models. Applying transfer learning to such models using new data of the 
same colour space would allow faster and more efficient model updating and, above all, faster training. 
With the experience gained, we have some suggestions for improving further research. The first example 
of improvement is producing building masks of more regular shapes with a footprint detection algo-
rithm, as Zhao et al., (2018) stated. The identified building masks can be vectorised in the next step and 
integrated into a GIS. The dataset can be enhanced by combining a colour infrared orthophoto and a 
normalised digital surface model (nDSM), which would separate the roofs from the terrain in the data 
creation process. The dataset could be extended to include buildings across Slovenia, allowing the model 
to be trained over the whole country. 
ACKNOWLEDGEMENTS
The research was partly carried out within the applied research project L2-1826, co-funded by the Slo-
venian Research Agency, the Geodetic Administration of the Republic of Slovenia and the Ministry of 
Defence, and the research programme P2-0406 and project J2-9251, funded by the Slovenian Research 
Agency. Many thanks also to the Department of Civil Engineering at Western Norway University of 
Applied Sciences for partially funding the research.
Literature and references:
Abdulla, W . (2017). Mask R-CNN for object detection and instance segmentation on 
Keras and T ensorFlow.
https://github.com/matterport/Mask_RCNN, accessed 17. 3. 2020.
Boguszewski, A., Batorski, D., Ziemba-Jankowska, N., Zambrzycka, A., Dziedzic, T. 
(2020). LandCover.ai: Dataset for Automatic Mapping of Buildings, Woodlands 
and Water from Aerial Imagery. ArXiv.
Fetai, B., Račič, M., Lisec, A. (2021). Deep Learning for Detection of Visible Land 
Boundaries from UAV Imagery. Remote Sensing. 13 (11): 2077. DOI: https://
doi.org/10.3390/rs13112077. 
Chen, Q., Wang, L., Yifan, W., Guangming, G., Zhiling, W. S. (2019). Aerial Imagery 
for Roof Segmentation: A Large-Scale Dataset towards Automatic Mapping of 
Buildings. 147 (07), 42–55.
Crésson, R. (2018). Orfeo Toolbox meets TensorFlow. https://github.com/remicres/
otbtf, accessed 13. 4. 2020.
Deng, J., Dong, W ., Socher , R., Li, L.-J., Li, K., Fei-Fei, L. (2009). ImageNet: a Large-Scale 
Hierarchical Image Database. IEEE Conference on Computer Vision and Pattern 
| 576 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES
SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Recognition, 248–255. DOI: https://doi.org/10.1109/CVPR.2009.5206848 
Goodfellow, I., Bengio, Y ., Courville, A. (2016). Deep Learning. Boston, MIT Press.
Geodetska Uprava RS (2020). Zbirka topografskih podatkov (DTM). https://www.e-
prostor.gov.si/fileadmin/struktura/DTM_objektni_katalog.pdf, accsessed 
13. 4. 2020.
Grigillo, D. (2010). Samodejno odkrivanje stavb na visokoločljivih slikovnih virih za 
potrebe vzdrževanje topografskih podatkov [Automatic building detection 
from high resolution imagery for maintenance of topographic data]. Doctoral 
dissertation. Ljubljana: University of Ljubljana. http://drugg.fgg.uni-lj.si/781/, 
accessed 25. 4. 2020.
Grizonnet, M., Michel, J., Poughon, V., Inglada, J., Savinaud, M., Cresson, R. (2018). 
Orfeo ToolBox: Open source processing of remote sensing images. Open 
Geospatial Data, Software and Standards. 2. DOI: https://doi.org/10.1186/
s40965-017-0031-6 
He, K., Gkioxari, G., Dollár , P ., Girshick, R. (2017). Mask R-CNN. 2017 IEEE International 
Conference on Computer Vision (ICCV), 2980–2988. DOI: https://doi.
org/10.1109/iccv.2017.322
 Ji, S., Yanyun, S., Lu, M., Zhan, Y . (2019). Building Instance Change Detection from 
Large-Scale Aerial Images using Convolutional Neural Networks and Simulated 
Shapes. Remote Sensing, 11 (11), 1343–1363. DOI: https://doi.org/10.3390/
rs11111343 
Lin, T . Y ., Maire, M., Belongie, S., Bourdev, L., Girshik, R., Hays, J., Perona, P ., Ramanan, 
D., Zitnick, C. L., Dollár, P . (2014). Microsoft COCO: Common Objects in Context. 
Computer Vision and Pattern Recognition, pp. 740–755.
Maggiori, E., T arabalka, Y ., Charpiat, G., Alliez, P . (2017). Can semantic labeling methods 
generalize to any city? The Inria aerial image labeling benchmark. 2017 IEEE 
International Geoscience and Remote Sensing Symposium (IGARSS), 2017, 
3226–3229. DOI: https://doi.org/10.1109/IGARSS.2017.8127684 
Mnih, V. (2013). Machine Learning for Aerial Image Labeling. Doktorska disertacija. 
Toronto: University of Toronto. https://www.cs.toronto.edu/~vmnih/docs/
Mnih_Volodymyr_PhD_Thesis.pdf, accessed 8. 10. 2021.
MOP (2020). Dostop do podatkov o prostorskih aktih.
https://dokumentipis.mop.gov.si/javno/veljavni/tematski_zbirni_sloji/gnrp_opis.
pdf, accessed 14. 4. 2020.
MS COCO (2014). COCO Data Format. http://cocodataset.org/#format-data, accessed 
1. 4. 2020.
Oštir, K. (2006). Daljinsko zaznavanje. Ljubljana, Založba ZRC, ZRC SAZU.
Račič, M. (2019). Kategorizacija uporabe zemeljske površine na podlagi 
multispektralnih slik [Categorisation of land use based on multispectral 
imagery]. Master thesis. Ljubljana: University of Ljubljana. https://repozitorij.
uni-lj.si/IzpisGradiva.php?lang=slv&id=110064, accessed 17. 3. 2020.
Šanca, S. (2020). Samodejna klasifikacija stavb z globim učenjem [Automatic 
classification of buildings with deep learning]. Master thesis. Ljubljana: University 
of Ljubljana. https://repozitorij.uni-lj.si/Dokument.php?id=135433&lang=slv, 
accessed 20. 11. 2020.
Shetty, A. R., Mohan, B. (2018). Building Extraction in High Spatial Resolution Images 
Using Deep Learning T echniques. Computational Science and Its Applications – 
ICCSA 2018. Melbourne, Springer.
Veljanovski, T., Kanjir, U., Oštir, K. (2011). Objektno usmerjena analiza podatkov 
daljinskega zaznavanja. Geodetski vestnik, 55 (4), 665–668. DOI: https://doi.
org/10.15292/geodetski-vestnik.2011.04.665-688 
Wen, Q., Jiang, K., Wang, W ., Liu, Q., Guo, Q., Li, L., Wang, P . (2019). Automatic Building 
Extraction from Google Earth Images under Complex Backgrounds Based on 
Deep Instance Segmentation Network. Sensors, 19 (2), 333–349. DOI: https://
doi.org/10.3390/s19020333 
Zakon o državnem geodetskem referenčnem sistemu (ZDGRS) [National Land 
Survey Reference System Act]. Official Gazette of the Republic of Slovenia No 
25/2014 and 61/2017.
Zakon o evidentiranju nepremičnin (ZEN) [Real Estate Records Act]. Official Gazette 
of the Republic of Slovenia No 47/2006 and amendments.
Zhao, K., Kang, J., Jung, J., Sohn, G. (2018). Building Extraction from Satellite 
Images Using Mask R-CNN with Building Boundary Regularization. 2018 
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 
(CVPRW), Salt Lake City, 2018, pp. 2242–2424. DOI: https://doi.org/10.1109/
CVPRW .2018.00045 
Zhou, K., Chen, Y., Smal, I., Lindenbergh, R. (2019). Building Segmentation from 
Airborne VHR Images using Mask R-CNN. ISPRS - International Archives of the 
Photogrammetry, Remote Sensing and Spatial Information Sciences, 42 (2), 
151–161. DOI: https://doi.org/10.5194/isprs-archives-XLII-2-W13-155-2019 
Zhu, X., T., Devis, M., Lichao, X., Gui-Song, Z., Liangpei, X., Feng, Fraundorfer, F. 
(2017). Deep Learning in Remote Sensing: A Comprehensive Review and List 
of Resources. IEEE Geoscience and Remote Sensing Magazine, 5, 8–36. DOI: 
https://doi.org/10.1109/MGRS.2017.2762307 
Šanca S., Oštir K., Mangafić A. (2021). Building detection with convolutional networks trained with transfer learning.
Geodetski vestnik, 65 (4), 559-593. 
DOI: https://doi.org/10.15292/geodetski-vestnik.2021.04.559-593
| 577 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH 
S PRENOSOM ZNANJA
OSNOVNE INFORMACIJE O ČLANKU:
GLEJ STRAN 559
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
1 UVOD
Namen klasifikacije podob je razpoznavanje in poimenovanje geografskih objektov in pojavov na ze-
meljskem površju (V eljanovski et al., 2011). S klasifikacijo iz podobe izluščimo bistvene razrede (objekte), 
ki nas zanimajo, recimo ceste, gozdove, poljščine, vodna območja, ladje, letala ali stavbe. Samodejna pre-
poznava in klasifikacija stavb iz letalskih posnetkov ali visokoločljivih satelitskih posnetkov je pomembna 
tema raziskav na področju fotogrametrije in daljinskega zaznavanja (Zhu et al., 2017). Hiter napredek 
na področju računalniškega vida in globokega učenja z uporabo globokih konvolucijskih nevronskih 
mrež ter prosti dostop do podatkov daljinskega zaznavanja omogoča razvoj novih metod za samodejno 
klasifikacijo stavb. Na spletu najdemo veliko nalog in tekmovanj, ki uporabljajo metode globokega učenja 
za samodejno klasifikacijo stavb z uporabo satelitskih ali letalskih posnetkov. Primeri takih tekmovanj 
so DeepGlobe Buildings Extraction Challange
7
, SpaceNet Building Extraction Challenge
8
, crowdAI 
Mapping Challenge
9
 idr. 
Globoko učenje za modeliranje podatkov uporablja usmerjene, večslojne nevronske mreže. Specifičen 
primer umetnih nevronskih mrež za klasifikacijo in segmentacijo slik so konvolucijske nevronske mreže. 
Poenostavljen primer konvolucijske nevronske mreže vsebuje tri osnovne sloje, ki se lahko ponavljajo. 
T o so: (1) konvolucijski sloji (angl. convolutional layers), (2) združevalni sloji (angl. pooling layers) in (3) 
polno povezani sloji (angl. fully connected layers) (Goodfellow et al., 2016). Globoko nevronsko mrežo 
sestavlja veliko takih slojev, zaradi česar je sposobna prepoznati podrobne vzorce in oblike na sliki. Vsak 
sloj nevronske mreže ima značilno nalogo. Konvolucijski sloj je kombinacija večkratnih filtrov, ki se 
premikajo čez sliko, in združuje višje nivojske informacije v dvodimenzionalno aktivacijsko matriko. 
Prostorsko ločljivost aktivacijskih matrik postopoma zmanjšujejo združevalni sloji, da se zmanjša računska 
kompleksnost nevronske mreže in hkrati povečuje vidno polje posameznega filtra. Polno povezani sloj 
deluje kot klasifikator, ki na podlagi pridobljenih vrednosti nabora značilk vrne klasifikacijski vektor, na 
podlagi katerega se določi pripadnost nekemu razredu (Šanca, 2020).
Ena izmed uspešnejših konvolucijskih nevronskih mrež za detekcijo objektov je Mask R-CNN (He et 
al., 2017) (angl. Mask Regional Convolutional Neural Network), s katero lahko dosežemo visoko točnost 
klasifikacije stavb (Šanca, 2020). Mask R-CNN so razvili pri Facebook AI Research (FAIR) leta 2017. Gre 
za globoko konvolucijsko nevronsko mrežo, ki se uporablja za detekcijo objektov (angl. object detection), 
semantično segmentacijo (angl. semantic segmentation) ter segmentacijo primerov (angl. instance segmenta-
tion). Detekcijo objektov z Mask R-CNN sestavljata dva dela, v prvem se generira predlog območja, kjer 
7
 http://deepglobe.org 
8
 https://spacenetchallenge.github.io/ 
9
 https://www.crowdai.org/challenges/mapping-challenge 
| 578 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI| EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
naj bi se objekt nahajal na vhodni sliki, v drugem se objekt zazna in se zanj določi verjetnost pripadnosti 
razredu, prikaže se položaj objekta in določi njegova maska. Več podrobnosti o Mask R-CNN najdemo 
v prispevkih He et al. (2017) in Abdulla (2017).
Podroben pregled teorije in uporabe globokega učenja v daljinskem zaznavanju predstavijo Zhu et al. 
(2017). V ečina dosedanjih raziskav, ki temeljijo na uporabi globokega učenja za samodejno detekcijo stavb, 
kot vir podatkov uporablja satelitske posnetke visoke prostorske ločljivosti. Pri študiji samodejne detekcije 
stavb sta Shetty in Mohan (2018) uporabila podatkovno množico satelitskega sistema WorldView-2. 
Za učenje in testiranje modela z nevronsko mrežo Faster R-CNN sta uporabila pankromatske posnetke 
prostorske ločljivosti 0,46 metra. Ugotovila sta, da uporaba konvolucijskih nevronskih mrež omogoča 
detekcijo stavb različnih oblik s točnostjo 88 %, v primerjavi z metodo podpornih vektorjev (angl. support 
vector machines), kjer je bila dosegljiva točnost detekcije 88,3 %. Z uporabo velike podatkovne zbirke 
posnetkov Google Earth so Wen et al. (2019) raziskovali zmožnosti detekcije stavb s Faster R-CNN. 
Primerjali so hrbtni arhitekturi VGG in ResNet101 pri uporabi Mask R-CNN in Faster R-CNN. Pri 
predlagani rešitvi so Mask R-CNN nadgradili z rotacijsko matriko za detekcijo stavb. Najslabšo točnost 
so dosegli s Faster R-CNN-VGG (70 %), najboljšo s predlagano rešitvijo Mask R-CNN-VGG (91 %). 
Ostali dve rešitvi Faster R-CNN-ResNet101 in Mask R-CNN-ResNet101 dosegata identični točnosti 
91 %. V nadaljevanju so primerjali tudi rezultate semantične segmentacije stavb z Mask R-CNN, kjer 
predlagana metoda doseže točnost 91 %, navadna metoda Mask R-CNN-Resnet101 pa točnost 91 %. 
Na podlagi raziskave ugotovijo, da se z uporabo velike in kompleksne podatkovne zbirke stavb in uporabo 
Mask R-CNN lahko dosežejo obetavni rezultati. Z uporabo podatkovne zbirke DeepGlobe so Zhao 
et al. (2018) predlagali izboljšano rešitev Mask R-CNN, ki so jo nadgradili z algoritmom prepoznave 
mej. Njihova predlagana rešitev doseže natančnejše rezultate pri prepoznavi stavb in zaznavanju obrisov 
prepoznanih stavb kot osnovna rešitev Mask R-CNN. 
Letalski posnetki imajo v primerjavi s satelitskimi posnetki visoko prostorsko ločljivost, a manj spektralnih 
kanalov. Podatkovne zbirke na podlagi letalskih posnetkov omogočajo prepoznavo kompleksnejših ob-
jektov z višjo točnostjo. Primeri podatkovnih zbirk za prepoznavo stavb, izdelanih iz letalskih posnetkov, 
so: Massachusetts Buildings Dataset
10
 (Mnih, 2013), Inria Aerial Image Labeling Dataset
11
 (Maggiori et 
al., 2017), AIRS Automatic Mapping of Buildings Dataset
12
 (Chen et al., 2019). Dosedanje raziskave 
z uporabo letalskih posnetkov in Mask R-CNN dosegajo visoko točnost detekcije stavb. V prispevku 
Ji et al. (2019) so raziskovali spremembe stavb s podatkovno zbirko Wuhan Building Change Detec-
tion. Predlagano rešitev za detekcijo stavb sestavljata dve konvolucijski nevronski mreži: (1) mreža za 
prepoznavo stavb, ki deluje na osnovi Mask R-CNN in MS-FCN (angl. Multi-Scale Fully Convolutional 
Network), ter (2) mreža za odkrivanje sprememb stavb. Rešitev so preizkusili na študijskem območju 
Christchurcha v Novi Zelandiji, kjer so primerjali območji iz let 2011 in 2016. Obe predlagani rešitvi 
sta dosegli visoko točnost klasifikacije, in sicer nad 89 % pri objektni klasifikaciji in nad 93 % pri 
pikselski klasifikaciji stavb. MS-FCN je bil v primerjavi z Mask R-CNN nekoliko boljši pri detekciji 
robov stavb. Z detekcijo stavb iz letalskih posnetkov z uporabo Mask R-CNN so se ukvarjali (Zhou 
et al., 2019). Raziskovali so zmožnost prepoznave stavb s spreminjanjem vrednosti parametra sidrnega 
10
 https://www.cs.toronto.edu/~vmnih/data/ 
11
 https://project.inria.fr/aerialimagelabeling/ 
12
 https://www.airs-dataset.com/ 
| 579 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
polja (angl. anchor box) in problem segmentacije natančnih obrisov robov stavb. Primerjali so dve 
predlagani rešitvi: (1) Mask R-CNN-S z manjšim sidrnim poljem in (2) Mask R-CNN-L z večjim 
sidrnim poljem. Ugotovili so, da so rezultati detekcije stavb boljši z uporabo rešitve z manjšim sidrnim 
poljem, saj ta prepozna manjše stavbe in stavbe z bolj podrobno vsebino. Zaključijo s pomembno ugo-
tovitvijo, da je Mask R-CNN primerno za detekcijo stavb, saj klasične konvolucijske nevronske mreže 
ne ohranijo podrobne spektralne informacije pri detekciji objektov, kar privede do slabših rezultatov. 
Druga pomembna ugotovitev je pomen parametra sidrnega polja, ki pomembno vpliva na kakovost 
prepoznave manjših objektov ter večjih, spektralno raznolikih objektov. Zbrane rezultate navedenih 
raziskav predstavljamo v preglednici 1.
Preglednica 1: Primerjava karakteristik in metod preteklih raziskav s Faster R-CNN, Mask R-CNN in MS-FCN
Nevronska mreža
Uporabljena 
metoda
Podatki
Prostorska 
ločljivost [m]
Točnost [%] Študija
Faster R-CNN podporni vektorji
satelitski, 
pankromatski
0,46 88
Shetty in Mohan, 
(2018)
Faster R-CNN-
VGG
rotacijska matrika 
za prepoznavo 
stavb
letalski R-G-B 0,26 70 Wen et al. (2019)
Mask R-CNN-
VGG
rotacijska matrika 
za prepoznavo 
stavb
letalski R-G-B 0,26 91 Wen et al. (2019)
Faster R-CNN-
Resnet101
rotacijska matrika 
za prepoznavo 
stavb
letalski R-G-B 0,26 91 Wen et al. (2019)
Mask R-CNN-
Resnet101
rotacijska matrika 
za prepoznavo 
stavb
letalski R-G-B 0,26 91 Wen et al. (2019)
Mask R-CNN
algoritem 
regularizacije mej
satelitski R-G-B 0,5 88 Zhao et al. (2018)
Mask R-CNN podporni vektorji letalski R-G-B 0,3 90 Ji et al. (2019)
MS-FCN podporni vektorji letalski R-G-B 0,3 84 Ji et al. (2019)
Mask R-CNN-S
majhno sidrno 
polje
letalski R-G-B 0,5 85 Zhou et al. (2019)
Mask R-CNN-L veliko sidrno polje letalski R-G-B 0,5 81 Zhou et al. (2019)
Geodetski inštitut Slovenije na letni ravni izvaja samodejno klasifikacijo stavb z namenom posodabljanja 
prostorskih evidenc Geodetske uprave Republike Slovenije in monitoringa prostora za Ministrstvo za 
okolje in prostor. Klasifikacija stavb se izvaja z metodami strojnega učenja, z objektno klasifikacijo, in 
sicer z uporabo podpornih vektorjev (angl. support-vector machines) in naključnih gozdov (angl. random 
forest), kjer je ključni podatek tudi digitalni model površja. Namen prispevka je preizkus nove metode 
detekcije stavb z uporabo globokega učenja na dveh lastno izdelanih podatkovnih zbirkah stavb, ki sta 
popolnoma neodvisni od uporabe digitalnega modela površja.
V okviru raziskave smo izdelali dve podatkovni zbirki stavb, in sicer z uporabo barvnih ortofotov DOF050 
(R-G-B) in z uporabo barvnih bližnjih infrardečih ortofotov DOF050IR (BIR-R-G) prostorske ločljivosti 
0,5 metra iz leta 2019. Z uporabo barvnih bližnje infrardečih ortofotov dobimo drugačno predstavo o 
| 580 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI| EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
terenu, saj so predmeti z visoko odbojnostjo v infrardečem spektru (npr. zdrava vegetacija) prikazani z 
rdečo barvo, medtem ko so predmeti z močno odbojnostjo v rdečem spektru prikazani zeleno in pred-
meti z močno odbojnostjo v modrozelenem spektru prikazani modro (Oštir, 2006). Glavna prednost 
uporabe barvnih bližnje infrardečih ortofotov je predvsem, da lahko že na podlagi spektralnega podpisa 
lažje ločimo stavbe od vegetacije. 
Podatkovni zbirki stavb smo zapisali v formatu MS COCO (Lin et al., 2014), ki učne primerke stavb 
predstavi z zapisom JSON (angl. JavaScript Object Notation). T ak način označevanja je hiter in pregleden, 
zato smo ga uporabili pri izdelavi podatkovne zbirke stavb. Naučili smo osem različnih modelov klasi-
fikacije stavb z uporabo predučenih uteži podatkovnih zbirk MS COCO in ImageNet (Deng et al., 
2009). Uporabili smo učenje s prenosom znanja, ker ga uporablja Mask R-CNN, da se modeli, učeni 
na podatkovni zbirki uporabnika, hitreje generalizirajo. Uspešnost naučenih modelov smo preverili na 
izbranem testnem vzorcu stavb v Sloveniji in vsak model ovrednotili z izračunom evalvacijskih metrik. 
Zanimalo nas je predvsem, kako uspešno Mask R-CNN prepozna stavbe, kako kakovostne so prepoznane 
maske stavb in ali predlagana metoda ponuja možnosti za nadaljnjo uporabo.
2 ŠTUDIJSKO OBMOČJE IN OPIS PODATKOV
Ključni element uspešnega modela klasifikacije stavb z globokim učenjem je kakovostno izdelana podat-
kovna zbirka z velikim številom označenih učnih primerov. Študijsko območje za izdelavo dveh podat-
kovnih zbirk stavb je prikazano na sliki 1. Predstavlja območje pomurske in severovzhodne podravske 
regije. Območje velikosti 1387 km
2
 vsebuje 98.425 evidentiranih stavb na dan 28. 3. 2020. Tipologija 
stavb je večinoma ruralna, z izjemo Murske Sobote, ki ima urbano tipologijo stavb. Strehe so različnih 
oblik in barv, prevladuje kritina rdeče, rjave in temno sive ali črne barve. Veliko učnih primerov streh 
vsebuje tudi sončne kolektorje. Območje smo izbrali zaradi dobrega poznavanja, in ker vsebuje širok 
nabor raznovrstnih streh, ki prispevajo k podrobnosti podatkovne zbirke stavb.
Slika 1: Prikaz študijskega območja za izdelavo podatkovne zbirke stavb.
| 581 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
3 METODE
V poglavju predstavimo metodologijo za izdelavo podatkovne zbirke stavb. Najprej opredelimo razred 
stavba, čemur sledi izdelava učnih vzorcev in njihov pregled za prikazano študijsko območje na sliki 1.
3.1 Opredelitev razreda stavba
Namen katastra stavb je evidentiranje vseh stavb v Republiki Sloveniji. Pojem stavbe in dela stavbe opre-
deljuje ZEN (Uradni list RS, št. 47/06). Podatki, ki se vodijo o stavbah in njihovih delih, so opisani v 
Zbirki topografskih podatkov (DTM) (GURS, 2020). DTM je zakonsko pokrit z 11. členom ZDGRS 
(Uradni list RS, št. 25/14 in 61/17). Stavbo opredelimo kot objekt, ki je trajno na enem mestu (Bogusze-
wski et al., 2020). Primer pravilnega in upoštevanega obrisa stavbe je predstavljen na levi strani slike 2. 
Visoke stavbe pomenijo težavo, saj maska stavbe, pridobljena iz katastra stavb, ne sovpada s stavbo na 
posnetku. Takih stavb pri pripravi podatkovne zbirke nismo upoštevali.
Slika 2: Primeri pravilnih (levo) in nepravilnih (desno) učnih vzorcev stavb.
3.2 Metodologija izdelave podatkovne zbirke stavb
Korake izdelave podatkovne zbirke stavb za globoko učenje prikazuje diagram poteka na sliki 3, vsak 
posamezni korak je podrobneje opisan v nadaljevanju.
Slika 3: Diagram poteka izdelave podatkovne zbirke stavb za globoko učenje.
Kataster stavb, ki smo ga uporabili za izdelavo učnih vzorcev stavb, smo pred samo izdelavo počistili. 
Izločili smo obrise stavb, ki se po generalizirani podrobni namenski rabi prostora (MOP , 2020) na-
| 582 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI| EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
hajajo na: (1) območjih proizvodnih dejavnosti, (2) območjih in omrežjih prometne infrastrukture, (3) 
območjih komunikacijske infrastrukture in (4) območjih energetske infrastrukture. Pred izločanjem 
je bilo na študijskem območju 98.425 obrisov stavb. Najprej smo izločili 1271 obrisov stavb, med 
katere spadajo predvsem večji industrijski objekti. T e smo izločili zato, ker so objekti preveliki in jih ne 
moremo ustrezno predstaviti na vzorčni sliki velikosti 128 x 128 pikslov. Izločili smo še stavbe, ki se po 
spektralnih lastnostih enačijo s prometno infrastrukturo, predvsem s cestami in transformatorji, saj so 
obrisi teh premajhni, da bi algoritem masko transformatorja zaznal in ji izračunal slikovne koordinate 
izdelanih mask. Stavbe, ki so na območjih energetske, komunikacijske in prometne infrastrukture, so 
netipične. Želeli smo izdelati podatkovno zbirko, ki je čim bolj uravnotežena in ima čim manj osamelcev. 
Iz pregledanih in počiščenih obrisov stavb smo izračunali nove centroide ter jih uporabili pri pripravi 
vzorčnih slik in binarnih mask.
3.3 Pregled vzorčnih slik in mask stavb
Z ročnim pregledom učnih primerkov izboljšamo kakovost podatkovne zbirke, kar vpliva na boljše re-
zultate detekcije stavb. Najznačilnejše napake pri obrisih stavb so mešanje s travniki, njivami in sencami. 
Razlog za to je neposodobljenost katastra stavb. Za študijsko območje z 98.425 obrisi stavb smo odkrili 
8226 primerov takih napak, kar predstavlja 8 % vseh obrisov. Napačnih obrisov stavb nismo upoštevali 
pri izdelavi vzorčnih slik in binarnih mask. Po pregledu izdelanih vzorčnih slik in binarnih mask smo 
pridobili 58.000 učnih primerov za izdelavo podatkovne zbirke stavb. Nekaj primerov najbolj značilnih 
napak prikazujemo na sliki 4.
Slika 4: Primeri napačnih obrisov stavb. Najznačilneše napake so mešanje s travniki, njivami in sencami. Tudi primeri delnega 
prekrivanja obrisa in stavbe niso uporabni.
Podatkovno zbirko smo razdelili v razmerju 80/20, kjer smo 80 % učnih primerov uporabili za učenje 
in 20 % za validacijo. Nato smo iz novo izračunanih centroidov stavb izdelali maske stavb in vzorčne 
slike stavb velikosti 128 x 128 pikslov. Vsaka stavba je v podatkovni zbirki predstavljena z masko in 
pripadajočo vzorčno sliko. Po tem smo podatkovni zbirki pretvorili v zapis MS COCO, ki, kot rečeno, 
za označitev učnih primerov uporablja zapis JSON. V prvem koraku smo označili lokacije vseh vzorčnih 
slik in pripadajočih binarnih mask glede na razred stavba, v drugem koraku smo pretvorili vzorčne slike 
in maske stavb v zapis JSON, ki ga uporablja podatkovna zbirka MS COCO. Podrobnosti o samem 
postopku je predstavil Šanca (2020).
| 583 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
3.4 Prenos znanja iz predučenih nevronskih mrež
Nevronske mreže so za specifične naloge računalniškega vida že izdelane in naučene na večjih podatkovnih 
zbirkah, kot sta ImageNet (Deng et al., 2009) in MS COCO. T e so dostopne uporabnikom, ki uteži lahko 
uporabijo za učenje novih modelov na podlagi lastne podatkovne zbirke. Primer takega učenja imenujemo 
učenje s prenosom znanja ali preneseno učenje (angl. transfer learning). Preneseno učenje deluje z iniciali-
zacijo že naučenih uteži modela z namenom boljšega prilagajanja uteži pri učenju na drugi podatkovni 
zbirki. Pri pridobivanju značilk iz slik uporabimo glavo nevronske mreže in učimo le novi klasifikator, da 
se predhodno naučene uteži filtrov optimizirajo za nove naloge na podlagi druge podatkovne zbirke, s tem 
proces učenja modelov zahteva precej manj časa (Ramon et al., 2019). Pri učenju nevronske mreže Mask 
R-CNN lahko uporabimo predučene uteži dveh velikih podatkovnih zbirk, to sta MS COCO in ImageNet.
Podatkovna zbirka MS COCO je vodilna podatkovna zbirka za detekcijo in segmentacijo objektov, vsebuje 
anotacije za različne probleme: (1) detekcijo objektov (angl. object detection), (2) detekcijo ključnih točk 
(angl. keypoint detection), (3) segmentacijo objektov (angl. stuff segmentation), (4) pan-optične segmen-
tacije (angl. panoptic segmentation) in (5) opisovanja slik (angl. image captioning). Vsebuje 91 slikovnih 
kategorij, od teh ima 82 kategorij več kot 5000 označenih primerov slik. Skupno število označenih učnih 
primerov zbirke MS COCO je 2,5 milijona na 328.000 slikah (Lin et al., 2014).
ImageNet velja za največjo podatkovno zbirko nasploh, vsebuje več kot 15 milijonov ročno označenih 
slik visoke ločljivosti, ki so kategorizirane po hierarhiji WorldNet v 22.000 razredov. ImageNet omogoča: 
(1) klasifikacijo objektov (angl. object classification), (2) detekcijo objektov (angl. object detection) in (3) 
lokalizacijo posameznih objektov (angl. single object localisation) (Deng et al., 2009). 
Mask R-CNN sestavljata dva dela: (1) konvolucijska mreža, namenjena pridobivanju značilk iz slik, 
in (2) glava nevronske mreže, namenjena klasifikaciji, prepoznavi slikovnega polja in napovedi maske 
prepoznanega objekta, ki se napove ločeno za vsak objekt ali interesno območje (angl. Region of Interest) 
(Zhao et al., 2018). Mask R-CNN sestavljata dve hrbtni arhitekturi, in sicer ResNet101, ki vsebuje 101 
konvolucijskih slojev za klasifikacijo objektov in FPN (angl. Feature Pyramid Network), ki iz podobe 
pridobiva značilke različnih meril (Šanca, 2020). Pri učenju vseh modelov smo uporabili enake hiper-
parametre, edino hitrost učenja smo zmanjšali za 10-kratno vrednost pri modelih, katere uteži smo fino 
učili. Hiperparametre za učenje podajamo v preglednici 2, nastavili smo jih na podlagi pregledanih 
preteklih raziskavah. Vsi našteti hiperparametri v preglednici 2 so podrobneje opisani v (Šanca, 2020).
Preglednica 2: Uporabljene vrednost hiperparametrov pri učenju modelov
Ime hiperparametra Vrednost
velikost slike za učenje (angl. image shape) 128 x 128 px
hitrost učenja (angl. learning rate) 0,001 in 0,0001
velikost serij (angl. batch size) 1000
število ponovitev na epoho (angl. steps per epoch) 1000
število ponovitev validacije (angl. validation steps) 50
hrbtna arhitektura (angl. backbone architecture) ResNet-101
velikost sidrnih polj (angl. anchor box) (8, 16, 32, 64, 128)
število predlaganih regij na posamezno sliko 32
skupni čas učenja (angl. total train time) 328 ur
| 584 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI| EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Modele smo učili na računalniku z naslednjimi specifikacijami:
 – centralna procesna enota: Intel(R) Core(TM) i9-9900X CPU @ 3.50GHz,
 – pomnilnik: 64 GB RAM, DDR4 (4X 16 GB DIMM DDR4 Synchronous 2400 MHz),
 – grafična kartica: NVIDIA GeForce RTX 2080 SUPER.
Slika 5: Arhitektura Mask R-CNN za primer detekcije stavb. V prvem koraku vhodna slika potuje skozi konvolucijsko nevronsko mrežo 
(CNN), ki pridobiva značilke. V drugem koraku mreža za predlog regij naključno predlaga regije na vhodni sliki, na podlagi katere 
se ustvarijo napovedi glede na opredeljen razred v podatkovni zbirki. Višjenivojske informacije se združujejo v nižjenivojske z 
uporabo sloja RoiAlign, ki deluje kot združevalni sloj in se razdeli v dva dela. V prvem delu se rezultati RoiAlign premikajo čez 
polno povezane sloje, na podlagi katerih se napove razred objekta in izračuna regresija slikovnega polja. V drugem delu se rezul-
tati RoiAlign premikajo čez konvolucijske sloje glave nevronske mreže in potem čez vse sloje nevronske mreže, tako se ustvari 
pikselska maska za vsako regijo interesa (angl. region of interest) in se slika segmentira po pikslih glede na prepoznan razred.
V raziskavi primerjamo uspešnost osmih učenih modelov za klasifikacijo stavb, predstavljeni so v 
preglednici 3.
Preglednica 3: Naučeni modeli na podatkovni zbirki BIR-R-G in R-G-B
Ime modela
Podatkovna 
zbirka
Uporabljeni 
sloji za učenje
Število epoh
Uporabljene 
uteži
Hitrost učenja Čas učenja [h]
M1 BIR-R-G glavni sloji 100 MS COCO 0,001 39
M2 BIR-R-G glavni sloji 100 ImageNet 0,001 40
M3 BIR-R-G vsi sloji 200 MS COCO 0,0001 42
M4 BIR-R-G vsi sloji 200 ImageNet 0,0001 44
M5 R-G-B glavni sloji 100 MS COCO 0,001 37
M6 R-G-B glavni sloji 100 ImageNet 0,001 39
M7 R-G-B vsi sloji 200 MS COCO 0,0001 43
M8 R-G-B vsi sloji 200 ImageNet 0,0001 44
| 585 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Število epoh pomeni število ponovitev prehoda celotne podatkovne zbirke čez Mask R-CNN. Pri mod-
elih M1, M2, M5, M6 smo učili glavo nevronske mreže (angl. head layers), pri modelih M3, M4, M7, 
M8 vse sloje nevronske mreže (angl. all layers). Hitrost učenja predstavlja velikost koraka pri izračunu 
gradienta funkcije izgube, ki se med učenjem nevronske mreže iterativno minimizira. Pri učenju si izber-
emo primerno hitrost učenja, da se izognemo premajhnemu ali prekomernemu prileganju uteži (angl. 
underfitting/overfitting) (Šanca, 2020). Postopek učenja modelov prikazujemo na sliki 6.
Slika 6: Diagram poteka učenja modelov za detekcijo stavb. Modeli M3, M4 in M7, M8 so fino učeni modeli še 100 epoh po 
učenju glave nevronske mreže. Detekcijo stavb smo izvedli ločeno z vsakim modelom.
4 REZULTATI IN NJIHOVO OVREDNOTENJE
Uspešnost prenosa znanja smo ocenili na podlagi skupne funkcije izgube v procesu učenja. Skupne 
funkcije izgube validacijskega procesa nismo upoštevali za ocenjevanje stabilnosti modela, saj je na 
velikem deležu učnih parov obris stavb iz katastra stavb nekonsistenten, zato učen model uspešno 
napove stavbo, kjer ni obrisa, kar validacijski proces zazna kot napačno klasificiran objekt. Odločili 
smo se, da bomo validacijo uspešnosti modelov opravili z izračunom evalvacijskih metrik (poglavje 
4.1). Uspešnost modelov klasifikacije stavb smo preverili na izbranem testnem območju izven 
območja podatkovne zbirke za učenje. Izbrali smo 300 primerov za testiranje uspešnosti učenih 
modelov, pri tem smo upoštevali kriterij raznolikosti streh, predvsem barvo in obliko strehe, pris-
otnost sončnih kolektorjev na strehi, sence ipd. Najprej primerjamo naučene modele med sabo in 
potem še dobljene rezultate M3 in M4 s katastrom stavb. Rezultate detekcije stavb predstavimo 
na slikah, kjer primerjamo napovedi vseh naučenih modelov na BIR-R-G. Dobljene obrise stavb 
primerjamo s katastrom stavb.
4.1 Funkcije izgube po prenosu znanja in primerjava modelov
Primer detekcije stavb s predučenim modelom MS COCO (R-G-B), ki smo ga uporabili za inicializacijo 
naših uteži pri učenju na lastni podatkovni zbirki, predstavljamo na sliki 7. Kot pričakovano je detekcija 
stavb brez prenosa znanja na izdelani podatkovni zbirki stavb napačna. Osnovni model MS COCO ali 
ImageNet le naključno predlaga prostorska polja.
| 586 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI| EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Slika 7: Primer detekcije stavb z osnovnim, še ne naučenim modelom MS COCO.
Slika 8: Prikaz funkcij izgube učenja za M1 in M2 na podatkovni zbirki BIR-R-G. Pri obeh modelih smo učili glavne sloje nevronske 
mreže v trajanju 100 epoh.
| 587 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Skupna funkcija izgube Mask R-CNN je vsota klasifikacijske izgube, izgube slikovnega polja in izgube maske. 
Za primer detekcije stavb nas najbolj zanima izguba maske, saj ta predstavlja mero za natančnost klasifikacije 
maske stavbe. Na sliki 8 vidimo, da je skupna funkcija izgube po prenosu znanja modela MS COCO minimalno 
boljša od funkcije izgube modela ImageNet. To velja tudi po posameznih funkcijah izgube za klasifikacijo, 
regresijo in masko. Pri učenju se funkcija izgube minimizira in proti koncu učenja stabilizira (približno po 
65. epohi učenja), kar pomeni da ni potrebe po daljšemu učenju. Vrednost funkcije izgube za razred Mask 
R-CNN (angl. mask loss) se med učenjem ne spreminja veliko. Razlog za to je, da imamo v podatkovni zbirki 
stavb le en sam razred, ki se napoveduje pri detekciji. Prikazane so funkcije izgube le v fazi učenja.
Pri finem učenju uteži se vrednosti funkcij izgube dodatno zmanjšajo, razen funkcije izgube za masko, 
ki že na začetku stabilizira in se med učenjem ne izboljšuje več, kar pri detekciji pomeni, da se obrisi 
prepoznanih stavb izrazito ne spreminjajo. 
Slika 9: Prikaz funkcij izgube učenja za M3 in M4 na podatkovni zbirki BIR-R-G. Pri obeh modelih smo učili vse sloje nevronske 
mreže v trajanju 200 epoh.
| 588 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI| EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
 
Slika 10: Primerjava uspešnosti naučenih modelov: M1, M3,  Slika 11: Rezultati detekcije stavb z M3 in M4 ter primerjava 
M2, M4.   s katastrom stavb.
Slika 12: Rezultati detekcije stavb z M3 in M4 ter primerjava s katastrom stavb.
| 589 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Rezultati detekcije stavb z učenimi modeli na BIR-G-R so prikazani na slikah 10, 11 in 12. Modela 
M3 in M4 prepoznata manjše stavbe, a obrisi prepoznanih stavb v primerjavi z M1 in M2 ostajajo 
zelo podobni. Fino učenje uteži z učenjem vseh slojev Mask R-CNN se je izkazalo kot pomembno pri 
izboljšanju točnosti napovedi, ampak ni ključnega pomena za izboljšanje obrisov prepoznanih stavb v 
primerjavi z obrisi, ki jih dobimo le z učenjem glave nevronske mreže.
Na sliki 13 so prikazane funkcije izgube učenja M5 in M6, učenih na R-G-B, in na sliki 14 funkcije izgube 
M7, M8, učenih na R-G-B. Vidimo, da so funkcije izgube R-G-B modelov približno enake kot funkcije 
izgube BIR-R-G modelov.
Slika 13:  Prikaz funkcij izgube učenja M1 (zelena) in M2 (modra) na podatkovni zbirki R-G-B. Pri obeh modelih smo učili glavne 
sloje nevronske mreže 100 epoh.
Zanimala nas je tudi razlika med uspešnostjo modelov, ki smo jih naučili na podatkovni zbirki BIR-R-G 
in R-G-B. Rezultate detekcije stavb z modeli R-G-B prikazujemo na sliki 15.
Razlike med modeloma M1, M2 BIR-R-G in modeloma M5, M6 R-G-B so minimalne tudi pri prepoz-
navanju obrisov stavb. To velja tudi za fino učene modele M3, M4 in M7, M8. V primerjavi z ortofoti 
BIR-R-G so rezultati detekcije stavb na R-G-B nekoliko boljši, ker je osnovni model od MS COCO, 
učen na barvnih slikah iz vsakdanjega življenja (R-G-B). V obeh primerih učenja se izkaže, da je prenos 
znanja iz predučenih uteži podatkovne zbirke MS COCO ali ImageNet učinkovita rešitev v primerjavi 
z učenjem modelov iz začetka.
| 590 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI| EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
Slika 14: Prikaz funkcij izgube učenja M3 (zelena) in M4 (modra) na podatkovni zbirki R-G-B. Pri obeh modelih smo učili vse 
sloje nevronske mreže 200 epoh.
Slika 15: Primer detekcije stavb z M7 na R-G-B ortofotu. 
| 591 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
4.2 Uspešnost naučenih modelov
Uspešnost učenih modelov klasifikacije v strojnem učenju ovrednotimo na podlagi vzorčnih slik za 
testiranje. Klasifikacija stavb predstavlja primer binarne klasifikacije, pri kateri uspešnost napovedi 
predstavimo z matriko zamenjav.
Napoved
Resnica
TP – pravilno pozitivne FN –napačno negativne
FP – napačno pozitivne TN – pravilno negativne
Pravilno klasificirani primeri spadajo v pravilno pozitivne TP (angl. true positive) in pravilno negativ-
ne TN (angl. true negative). Napačno klasificirani primeri spadajo med napačno negativne FN (angl. 
false negative) in napačno pozitivne FP (angl. false positive). Rezultate napovedi uporabimo za izračun 
evalvacijskih metrik, s katerimi ovrednotimo uspešnost modelov detekcije stavb. Uporabljene enačbe 
za izračun evalvacijskih metrik smo povzeli po Fetai et al. (2021). Točnost (angl. accuracy) predstavlja 
delež pravilnih napovedi glede na vse napovedi modela, pri binarni klasifikacije se enačba poenostavi:
 točnost 
++
= =
++ +   
:
TP TN TP TN
TP FP TN FN številovsehtestnih primerov
 (1)
Priklic (angl. recall) pove delež pravilno napovedanih primerov glede na vse pozitivne primere:
 
=
+
TP
priklic
TP FN
 (2)
Združeno metriko natančnosti in priklica definira mera F1 (angl. F1 score), uporabimo jo, ko želimo 
prikazati samo eno mero za uspešnost modela:
 
⋅⋅
=
⋅
2
1
tocnost priklic
F
tocnost priklic
 (3)
Za učene modele smo prešteli delež pravilno in napačno prepoznanih stavb in izračunali evalvacijske 
metrike. Rezultate prikazujemo v preglednici 4.
Preglednica 4: Ovrednotenje uspešnosti klasifikacije stavb po modelih
Podatkovna zbirka Uteži TP TN FP FN Točnost Priklic Mera F1
M1 BIR-R-G MS COCO 246 0 38 16 0,8200 0,9389 0,8754
M2 BIR-R-G ImageNet 243 0 43 14 0,8100 0,9455 0,8725
M3 BIR-R-G MS COCO 296 0 3 1 0,9867 0,9966 0,9916
M4 BIR-R-G ImageNet 294 0 4 2 0,9800 0,9932 0,9866
M5 R-G-B MS COCO 249 0 36 15 0,8300 0,9432 0,8830
M6 R-G-B ImageNet 244 0 41 15 0,8133 0,9421 0,8730
M7 R-G-B MS COCO 298 0 1 1 0,9933 0,9967 0,9950
M8 R-G-B ImageNet 296 0 1 2 0,9867 0,9933 0,9900
Naučeni modeli so si po uspešnosti detekcije stavb zelo podobni. Modeli, učeni 200 epoh (M3, M4 in 
M7, M8), so bolj uspešni predvsem pri prepoznavi natančnih obrisov stavb in pri prepoznavi stavb tudi 
na robu vzorčnih slik. M1, M2 in M5, M6 ne prepoznajo majhnih stavb ali stavb, ki so delno prekrite 
z vegetacijo.
č
č
| 592 |
| 65/4 | GEODETSKI VESTNIK  
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI| EN
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |
5 SKLEP IN RAZPRAVA
V članku smo predstavili postopek detekcije stavb z Mask R-CNN od izdelave podatkovne zbirke do 
detekcije obrisov stavb za osem modelov. Modele smo učili s prenosom znanja iz predučenih uteži podat-
kovnih zbirk MS COCO in ImageNet. Primerjali smo uspešnost klasifikacije stavb pri uporabi ortofotov 
R-G-B in barvno bližnje infrardečih ortofotov BIR-R-G. Izdelana podatkovna zbirka, namenjena pre-
poznavi stavb z globokim učenjem, ponuja možnosti za nadaljnjo uporabo, ker se lahko dodatno razširi 
na območje celotne države in se mogoče v prihodnosti uporablja za iskanje in vzdrževanje podatkov o 
stavbah. Dobljeni rezultati potrjujejo, da je Mask R-CNN uporaben in primeren za klasifikacijo stavb, 
kar trdijo tudi sorodne raziskave (Ji et al., 2019, in Zhou et al., 2019).
V Sloveniji je to po našem poznavanju prvi primer detekcije stavb z globokim učenjem in prav tako 
prvi primer uporabe prenosa znanja iz podatkovnih zbirk MS COCO in ImageNet, ki vsebujeta le R-
G-B slike vsakdanjih predmetov, v podatkovno zbirko stavb, izdelano iz R-G-B in BIR-R-G ortofotov, 
katere slike so predstavljene s povsem drugačne perspektive. Uspešnost prenosa učenja iz ene domene v 
drugo kaže na fleksibilnost tovrstnih mrež. Na podlagi tega lahko trdimo, da je prenos učenja učinkovita 
metoda za posodabljanje modelov, ki so učenih na različnih podatkih. Učenje modelov od začetka je 
dolgotrajen proces, ki omogoča izgradnjo boljših modelov. Uporaba prenosa znanja na takih modelih z 
uporabo novih podatkov enakega barvnega prostora bi omogočila hitro in učinkovitejše posodabljanje 
modelov in predvsem hitrejše učenje. 
S pridobljenimi izkušnjami imamo nekaj predlogov za izboljšave nadaljnje raziskave. Prvi primer izboljšav 
je izdelava prepoznanih mask stavb pravilnejših oblik z algoritmom detekcije obrisov, kot navajajo Zhao 
et al. (2018). Prepoznane maske stavb lahko v naslednjem koraku vektoriziramo in s tem omogočimo 
integracijo v geografski informacijski sistem. Podatkovno zbirko lahko nadgradimo s kombinirano 
uporabo barvnega bližnje infrardečega ortofota in normiranega digitalnega modela površja (nDMP), s 
katerim bi že pri pripravi podatkov ločili strehe od reliefa. Podatkovno zbirko lahko razširimo tako, da 
bi vsebovala stavbe celotne Slovenije, s čimer bi model učili na območju celotne države. 
ZAHVALA:
Raziskava je bila delno opravljena v okviru aplikativnega raziskovalnega projekta L2-1826, ki ga sofi-
nancirajo Javna agencija za raziskovalno dejavnost Republike Slovenije, Geodetska uprava Republike 
Slovenije in Ministrstvo za obrambo, ter raziskovalnega programa P2-0406 in projekta J2-9251, ki ju 
financira Javna agencija za raziskovalno dejavnost Republike Slovenije. Hvala tudi Oddelku za gradbeništvo 
Høgskulen på Vestlandet za delno sofinanciranje raziskave.
Literatura in viri:
Glej str. 575.
| 593 |
GEODETSKI VESTNIK | 65/4 |
RECENZIRANI ČLANKI | PEER-REVIEWED ARTICLES SI | EN
Simon Šanca, mag. inž. geod. geoinf.
Høgskulen på Vestlandet
Inndalsveien 28, NO-5063 Bergen, Norveška
e-naslov: simon.sanca@hvl.no 
prof. dr. Krištof Oštir, univ. dipl. inž. fiz.
Univerza v Ljubljani, Fakulteta za gradbeništvo in geodezijo
Jamova cesta 2, SI-1000 Ljubljana, Slovenija
e-naslov: kristof.ostir@fgg.uni-lj.si 
Alen Mangafić, mag. prost. načrt.
Geodetski inštitut Slovenije
Jamova cesta 2, SI-1000 Ljubljana, Slovenija
e-naslov: alen.mangafic@gis.si
Šanca S., Oštir K., Mangafić A. (2021). Zaznavanje stavb z uporabo nevronskih mrež, učenih s prenosom znanja. Building detection 
with convolutional networks trained with transfer learning. Geodetski vestnik, 65 (4), 559-593.
DOI: https://doi.org/10.15292/geodetski-vestnik.2021.04.559-593
Simon Šanca, Krištof Oštir, Alen Mangafić | ZAZNAVANJE STAVB Z UPORABO NEVRONSKIH MREŽ, UČENIH S PRENOSOM ZNANJA | BUILDING DETECTION WITH CONVOLUTIONAL NETWORKS 
TRAINED WITH TRANSFER LEARNING | 559-593 |