https://doi.org/10.31449/inf.v48i14.6145 Informatica 48 (2024) 43–64 43 
Classification of Pulmonary Diseases Using a Deep Learning Stacking 
Ensemble Model 
 
Ruaa N. Sadoon
*
, Adala M. Chaid
 
College of Computer Science & Information Technology, University of Basrah, Basrah, Iraq  
E-mail: pgs.ruaa.nabeel@uobasrah.edu.iq, adala.gyad@uobasrah.edu.iq
 
*
Corresponding author 
Keywords: medical imaging, deep learning, CNN, COVID-19, pulmonary pathologies, image classification  
Received: May 2, 2024 
This paper presents our research in the area of medical imaging diagnostics, focusing specifically on 
countering the devastating impact of the COVID-19 pandemic and numerous pulmonary pathologies. 
Using new deep-learning approaches and techniques, we aim to create an advanced classification tool 
that will be able to capture complex patterns and features in chest image data. This paper introduces the 
use of state-of-the-art strategies, such as stacked ensemble models, transfer learning, and artificial 
neural networks, to build a model with unprecedented precision, recall, F1-score, and accuracy. The 
core idea of our research is to combine different convolutional neural network architectures to bring 
together their best extraction and classification qualities. The combination of DenseNet, Xception and 
Inception achieves the best performance and provides the most reliable classification tool. We also use 
transfer learning to quickly train our model and optimize generalization, making it suitable for the 
detection of multiple pulmonary pathologies, including COVID-19. Our model also includes an artificial 
neural network, trained as a meta-learner, which processes the outputs of the CNNs to make 
classification decisions. We have thoroughly validated and optimized the meng-learner to improve the 
model’s accuracy on diagnostic images. The provided paper proposed the successful merge of cutting-
edge deep-learning methodologies and image-processing algorithms with the medical imaging 
industry’s specifics. We aim to disrupt the pulmonary disease diagnosis field with our model, offering 
medical institutions a reliable tool to fight the current and future threats and challenges posed by 
COVID-19. 
Povzetek: Razvit je napredni model za klasifikacijo pljučnih bolezni z uporabo zloženega modela 
globokega učenja, ki združuje CNN arhitekture, kot so DenseNet, Xception in Inception. Model dosega 
visoko natančnost pri odkrivanju bolezni, vključno s COVID-19.
1    Introduction
The emergence of the coronavirus disease 2019 in 
late 2019 sparked a global health crisis unlike any other 
[1].  Advances in diagnostic methods are vitally crucial 
since this highly contagious disease [2] has overtaken 
healthcare systems globally and severely disrupted life 
[3]. Even yet, prior to COVID-19, a number of 
pulmonary diseases, including pneumonia, lung 
opacities, and effusions, posed diagnostic difficulties that 
required precise diagnosis [4]. Notably, significant 
diagnostic challenges are intrinsic to common pulmonary 
illnesses such as pneumonia, pleural effusion, pulmonary 
nodules, and pneumothorax [5]. Differential diagnosis is 
crucial since pneumonia can present with symptoms 
resembling COVID-19 [6]. Pleural effusion frequently 
obscures or mimics other lung diseases, making it more 
difficult to interpret images from various angles [7]. In 
order to rule out benign or malignant disorders, 
pulmonary nodules typically require complicated, high-
resolution imaging [8]. Although pneumothorax directly 
endangers patients, it can be challenging to differentiate 
its symptoms from those of other acute chest disorders 
[9]. Advanced identification approaches utilizing deep 
learning models are necessary for nuanced detection due 
to the symptom overlap with COVID-19 [10], requiring 
precise methodologies in the context of pulmonary 
diseases. 
Many facets of contemporary life have been 
reshaped by deep learning, a subclass of machine 
learning that is typified by automatic feature extraction 
and picture categorization [11]. Deep learning has 
significantly changed how the healthcare industry is 
thought about [12]. By examining medical photos, 
models that can accurately predict or categorize 
particular diseases may now be created [13]. Promising 
results have been obtained from deep learning techniques 
for the diagnosis of a number of illnesses, including as 
brain tumors, liver disease, colon cancer, breast cancer, 
lung cancer, pneumonia, and most recently, COVID-19 
[14]. 
Deep learning automatically changes features using 
non-linear functions, resulting in high accuracy with less 
human interaction than classical machine learning, which 
44 Informatica 48 (2024) 43–64  R.N. Sadoon et al. 
needs manual feature engineering [15]. The extraction of 
valuable characteristics is improved as the network 
becomes deeper because more abstract data 
representations are learned [16].  
Recent scientific achievements demonstrate how 
deep learning is widely used to identify and treat 
COVID-19 [17]. For COVID-19 analysis, chest X-rays 
and CT scans are frequently used imaging modalities 
[18,19]. Although COVID-19 positive cases have been 
categorized using X-rays, interest in CT scan-based 
diagnosis is developing [19]. Convolutional neural 
networks have been used to examine lung dataset 
instances to classify COVID-19 cases [20]. Evidence 
suggests that chest X-rays are more valuable for 
diagnosis than for differential diagnosis of other serious 
conditions including pneumonia and lung cancer, even 
though they are less helpful in the early stages of 
COVID-19 before symptoms appear [19]. This means 
that in difficult circumstances, radiologists need help 
from automated diagnostic tools. Consequently, we used 
deep learning to address problems related to pulmonary 
diseases and the COVID-19 pandemic [10,21]. To 
capitalize on the capabilities of different deep learning 
techniques, such as DenseNet, Xception, and Inception, 
we have developed a unique strategy that comprises a 
stacked ensemble model [21]. A thorough analysis of the 
use of transfer learning and Convolutional Neural 
Networks (CNNs) for medical imaging applications is 
presented in [22]. It highlights the significant 
improvements CNNs have demonstrated in image 
analysis and classification applications and how transfer 
learning—reusing previously trained CNN models—can 
alleviate issues arising from small datasets and 
computing limitations. 
We have innovated by using an artificial neural 
network (ANN) as the meta-learner to supervise the 
ensemble model, which goes beyond the ensemble of 
deep learning architectures [23]. In particular, our 
artificial neural network (ANN) serves as the last arbiter 
by combining the predictions made by each of our 
component convolutional neural networks (CNNs) and 
categorizing the provided medical images. First off, our 
model is incredibly flexible and capable of making 
decisions because of the ANN's capacity to identify 
complex patterns and relationships in the ensemble's 
predictions. Second, we have optimized the ANN's 
parameters to maximize its performance through 
intensive training and validation, guaranteeing that our 
multi-tumor classification system reaches the highest 
levels of diagnostic accuracy and dependability [10].   
Our research essentially constitutes a singular 
amalgamation of cutting-edge deep learning, image 
processing, and medical imaging domain knowledge. We 
can radically alter the patient-centered clinical workflow 
for the diagnosis and treatment of pulmonary diseases, 
including but not limited to COVID-19 and pneumonia, 
by combining the advantages of ensemble models, 
transfer learning, and ANNs. 
2   Related works 
Medical imaging featuring deep learning represents 
one of the most promising advancements in the field, 
particularly concerning COVID-19 detection. Different 
papers have taken numerous angles and used separate 
datasets; however, the accuracy has invariably been 
compelling. 
The authors in [24] presented an innovative lung 
opacity detection and classification approach that is 
significantly essential to physicians due to its non-
reversible consequence when inaccurately diagnosed or 
misjudged with other diseases. To this end, the authors 
present a three-channel fusion CNN model, where the 
authors use the MobileNetV2, InceptionV3 and VGG19 
networks for each channel. ResNet is used for 
transferring features. The classification has shown a 
promising accuracy in lung opacity classification for 
different datasets. For the new dataset, the model reports 
an adequate performance of accuracy of 92.52%, 
92.44%, 87.12% and 91.71% for two, three, four and five 
classes, respectively. A comparison with the previous 
research indicates the potential of the model. It can 
significantly reduce the burden and costs of physicians 
who use image datasets for lung opacity classification. 
The COVID-19 epidemic has severely damaged 
economies and healthcare systems throughout the world, 
underscoring the urgent need for accurate and quick 
diagnosis techniques in the fight against the illness [25]. 
The current methods of clinical diagnosis have 
significant shortcomings because they are very subjective 
and subject to variation amongst patients. This article 
suggests a novel multi-classification method based on a 
machine learning framework to get beyond these 
restrictions. In particular, it presents BDCNet, a novel 
method for classifying COVID-19, pneumonia, and lung 
cancer from chest radiographs by using Vgg-19 and 
convolutional neural networks (CNNs). The goal of the 
suggested approach is to offer a consistent and objective 
diagnostic tool for identifying between various lung 
diseases. 
 Notably, this is the first study to diagnose these 
three chest diseases using a single deep learning model. 
Results indicate that BDCNet outperforms four well-
known pre-trained models, achieving an accuracy of 
99.10%, recall of 98.31%, precision of 99.9%, and f1-
score of 99.09%. These findings highlight the potential 
of BDCNet to significantly aid diagnostic radiographers 
and healthcare experts in accurately identifying and 
managing chest diseases, thus contributing to improved 
patient outcomes and healthcare efficiency. 
The authors in [26] addressed the critical 
challenge of accurately diagnosing COVID-19 and 
other chest disorders amidst their overlapping 
symptoms, which can potentially mislead clinical 
professionals. To tackle this, the researchers develop 
and evaluate a multi-classification deep learning 
model called CDC Net, leveraging convolutional 
neural network (CNN) techniques with residual 
network concepts and dilated convolution. By 
employing publicly available benchmark data, they 
pioneer the use of a single deep learning model to 
diagnose five distinct chest ailments, including 
Classification of Pulmonary Diseases Using a Deep Learning… Informatica 48 (2024) 43–64 45 
COVID-19, lung cancer, pneumothorax, tuberculosis, 
and pneumonia, from chest x-ray images. Remarkably, 
the CDC Net achieves an exceptional AUC of 0.9953, 
demonstrating an accuracy of 99.39%, a recall of 
98.13%, and a precision of 99.42% in identifying 
various chest diseases. Comparative analysis with 
three CNN-based pre-trained models further 
underscores the superior performance of the proposed 
model, highlighting its potential as a highly accurate 
diagnostic tool for chest diseases. Moreover, statistical 
analyses confirm the robustness of the proposed 
model, affirming its reliability and effectiveness in 
clinical settings. 
The authors in [27] proposed a new method of chest 
x-ray classification to diagnose COVID-19 with 
pneumonia caused by usual virus and to overcome the 
problem in which patients with COVID-19 cannot be 
differentiated with other chest disorders. This model is 
based on CNN model that applies a pre-trained 
EfficientNetB0 model and a dense layer. The model 
achieved high accuracy of over 95% out of two classes 
and 93% out of three classes, which outperforms the 
existing model and present some benefits, with less 
parameters and robust dataset split. 
Through meticulous methodological design, 
including data augmentation and fine-tuning, the study 
demonstrates the potential of CNN-based models in 
enhancing the accuracy of COVID-19 diagnosis from 
chest x-ray images, thereby supporting clinicians in 
making more informed diagnostic decisions. 
The authors in [28] addressed the urgent need for 
accurate diagnosis of COVID-19 amidst the global 
pandemic, proposing a deep learning-based approach to 
differentiate COVID-19 patients from those with viral 
pneumonia, bacterial pneumonia, and healthy cases. 
Utilizing deep transfer learning, the study experimented 
with binary and multi-class datasets across four 
categories, comprising a total of 6,674 X-ray images. 
Nine convolutional neural network architectures were 
employed, including Se-ResNeXt-50, which achieved the 
highest classification accuracy of 99.32% for binary 
classification and 97.55% for multi-class classification 
among all pre-trained models. By leveraging automated 
methods and sophisticated CNN architectures, the 
proposed system demonstrates promising performance in 
accurately diagnosing COVID-19, contributing to the 
ongoing efforts to combat the spread of the disease. 
The authors in [29] presented a novel multi-level 
diagnostic framework aimed at accurately detecting 
COVID-19 using X-ray scans, offering a promising 
alternative to the conventional RT-PCR method. The 
framework proposed in the current study consists of three 
phases, which are pre-processing to clean noise and 
resize the images, feature extraction using a deep 
learning architecture with an Xception pre-trained model. 
The framework incorporates global average pooling to 
overcome overfitting, an activation layer help to reduce 
loss and softmax for the final classification. This 
proposed model has been tested using a benchmark 
dataset from Kaggle containing 7395 images from three 
classes, COVID-19, normal, and pneumonia, which has 
shown an exceptional outcome. Testing has been 
conducted with an accuracy of 99.3% and a negligible 
loss of 0.02 by using leakyReLU activation and 
RMSprop optimizer. Therefore, utilizing just 10 epochs 
and a learning rate of 10−4 to achieve 99% sensitivity 
and specificity with F1-Score of 99.3% indicates the 
efficiency and performance of the proposed framework 
in identifying COVID-19 accurately from X-ray images. 
Hence, it is more efficient than existing studies and 
traditional pre-trained deep learning models. 
In [30], five pre-trained AI models were applied to 
improve brain tumor classification, attaining 95-97% 
accuracy on unseen images across three datasets. Data 
augmentation improved model performance, perhaps 
boosting early tumor identification and lowering 
impairments. Machine learning and deep learning 
algorithms were used to identify chest CT scans as 
COVID-19 positive or negative [31]. The study found 
that ResNet50V2 transfer learning approach performed 
best on the bigger dataset, with 97.52% accuracy, 
showing its potential for quick COVID-19 diagnosis in 
real life. 
The authors in [15] presented the development of a 
Multi-task Multi-slice Deep Learning System tailored for 
the screening of multi-class lung pneumonia from CT 
imaging. To solve the problem of limited training cases 
and resources, the M3 Lung-Sys consists of two 2D CNN 
networks, dedicated to slice- and patient-level 
classification. By leveraging CT slices for feature 
extraction and refining temporal information across 
slices, the system effectively distinguishes COVID-19 
from Healthy, H1N1, and CAP cases while also locating 
relevant lesion areas without pixel-level annotation. 
Extensive experiments conducted on a chest CT dataset 
demonstrate the superior performance of M3 Lung-Sys, 
achieving an accuracy of 95.21% with minimal false 
positive and false negative errors. Notably, the system 
exhibits high sensitivity and specificity for COVID-19 
and H1N1 detection, outperforming existing models. 
Although oversensitivity to noise is observed in Healthy 
cases, the interpretability and value of the system to 
clinicians are underscored by its robust performance and 
lesion location mapping capabilities. Overall, M3 Lung-
Sys offers a promising solution for accurate and 
interpretable multi-class lung pneumonia screening from 
CT imaging, particularly in the context of the COVID-19 
pandemic. 
The authors in [16] used computed tomography (CT) 
and chest X-ray imaging modalities to meet the critical 
requirement for quick and reliable disease identification 
in the midst of the global COVID-19 epidemic. 
Understanding the shortcomings of RT-PCR testing and 
the potential of imaging methods—particularly in areas 
where epidemics are prevalent—the study investigates 
the use of machine learning (ML) to support disease 
diagnosis. The research developed a deep neural network 
model, specifically a 24-layer CNN network, capable of 
binary (COVID vs. NON-COVID) and multi-class 
(COVID vs. NON-COVID vs. Pneumonia) classification 
from X-ray and CT images. Through extensive 
experimentation, the proposed method achieves 
46 Informatica 48 (2024) 43–64 R. N. Sadoon et al.  
remarkable accuracy rates of 99.68% and 71.81% on X-
ray and CT images, respectively, demonstrating its 
efficacy in aiding rapid and effective COVID-19 
detection. Utilizing the Sgdm optimizer with a learning 
rate of 0.001 contributes to the robust performance of the 
model across both datasets, showcasing its potential as a 
valuable tool in combating the pandemic. 
The authors in [17] suggested an automated COVID-
19 detection method utilizing artificial intelligence (AI) 
technology in response to the worldwide COVID-19 
pandemic and the load on healthcare infrastructure. The 
goal of the study is to accurately identify COVID-19 
from normal chest X-ray pictures. It also aims to 
distinguish COVID-19 from viral pneumonia that is not 
COVID-19 and lung opacity. Three pre-trained models, 
namely, Xception, VGG19, and ResNet50, are used and 
assessed on a benchmark dataset with 21,165 X-ray 
images. Initially, a binary classification model for 
COVID-19 detection is implemented, and the models are 
able to achieve high accuracy levels: 97.5%, 97.5%, and 
93.3% for Xception, VGG19, and ResNet50, 
respectively. Then the multi-class classification model is 
created, and the accuracy levels are obtained as follows: 
93%, 92%, and 75%, for Xception, VGG19, and 
ResNet50, respectively. Particularly, Xception model 
demonstrates higher precision, recall and f-1 scores, 
which shows its successful implementation in such tasks. 
Explainable AI is added to increase the level of 
interpretability; it enables a visual representation of the 
model’s predictions and reasoning behind them. This is 
done to restore the confidence of medical units in AI and 
support the application of AI in clinical decision-making. 
Overall, the study encompasses a significant 
development in the domain of automated COVID-19 
detection and introduces a helpful, accurate, and 
interpretable solution for application throughout the 
world. The comparative results related to this study are 
presented in table 1. 
 
Table 1: Comparative table of related works 
References Method Image Type Accuracy 
[24] MobileNetV2, InceptionV3, VGG19, 
ResNet 
Lung opacity -Two classes: 92.52% 
-Three classes: 92.44% 
-Four classes: 87.12% 
-Five classes: 91.71% 
[25] BDCNet (combining Vgg-19 and 
convolutional neural networks) 
Chest radiographs 99.10% 
[26] CDC Net (Multi-classification deep 
learning model) 
Chest X-ray images 99.39% 
[27] Convolutional Neural Network (CNN) 
specifically combining a pre-trained 
EfficientNetB0 network with a dense layer 
Chest X-ray images -Two-class classification 
(COVID-19 vs. other viral 
pneumonias): 95% 
-Three-class classification 
(COVID-19 vs. other viral 
pneumonias vs. other chest 
disorders): 93% 
[28] convolutional neural network (CNN) 
architectures, including Se-ResNeXt-50. 
X-ray images. -For binary classification 
accuracy of 99.32%. 
-For multi-class 
classification accuracy of 
97.55%. 
When comparing deep learning models for medical 
picture classification, there is a significant difference in 
accuracy depending on the model architecture and 
classification difficulty. The authors in [24] used a 
mixture of MobileNetV2, InceptionV3, VGG19, and 
ResNet to detect lung opacity in X-ray images, with 
accuracies ranging from 87.12% to 92.52% across two to 
five classes. Other research [25, 26, 28] shows better 
performance with other convolutional neural network 
designs. For example, the CDC Net [26] and BDCNet 
[25], which concentrate on chest X-ray pictures and chest 
radiographs, respectively, yield accuracies higher than 
99%, suggesting a more successful method for binary 
categorization. In contrast to the research [25, 26, 28], 
another study [27] uses a hybrid model that combines a 
dense layer with a pre-trained EfficientNetB0. The 
results reveal somewhat lower accuracies in two and  
three-class classifications (95% and 93%, respectively). 
This implies that although the combined procedures in 
[25, 26, 28] are quite successful for binary and multi-
class classifications, the complex method in [27] 
provides slightly lower accuracy. These variations 
highlight how model architecture and training methods 
affect medical image analysis classification accuracy. 
The application of deep learning in medical imaging, 
especially for COVID-19 detection, marks significant 
progress in diagnostic methodologies. While various 
models demonstrate high accuracy, existing approaches 
still present limitations that our research aims to address. 
Three-Channel fusion CNN models [24]: While this 
model achieves impressive accuracy levels up to 
92.52% for lung opacity classification, its performance 
variably decreases to 87.12% as the number of classes 
increases, indicating a potential drop in effectiveness 
Classification of Pulmonary Diseases Using a Deep Learning… Informatica 48 (2024) 43–64 47 
with complex classifications. Our work extends these 
efforts by employing a stacking ensemble model that 
not only maintains high accuracy across an increased 
number of classes but also integrates more diverse 
CNN architectures to stabilize performance across 
varied diagnostic scenarios. 
BDCNet [25]: This model excellently classifies 
COVID-19, pneumonia, and lung cancer with an 
accuracy of 99.10%. Although it demonstrates 
robustness, it focuses on a limited array of diseases. 
Our approach includes a broader spectrum of 
pulmonary pathologies, enhancing the utility of deep 
learning models in more diverse clinical settings. 
CDC Net [26]: Achieving an AUC of 0.9953, this 
model is highly accurate. However, it primarily 
employs traditional CNN architectures with residual 
networks. Our model introduces an artificial neural 
network as a meta-learner to further refine the 
diagnostic process, aiming for nuanced understanding 
and integration of features extracted by base learners. 
EfficientNetB0 hybrid models [27]: With accuracies 
of 95% and 93% for binary and three-class 
classifications respectively, this model shows a 
reduction in performance with more complex class 
scenarios. Our methodology leverages a meta-learning 
approach that consistently manages high accuracy even 
as classification complexity increases. 
Ensemble and meta-learning approaches: Most 
existing studies utilize single-model systems that may 
not capture all nuances in complex image data. Our 
research introduces an ensemble of multiple advanced 
models (DenseNet, Xception, Inception) with a meta-
learner that synergistically improves prediction 
accuracy and robustness, addressing the gap in existing 
single-model systems. 
By incorporating these advancements, our work 
significantly contributes to the field by providing a 
comprehensive and adaptable solution that enhances 
the detection and classification of a wide range of 
pulmonary diseases, not just limited to COVID-19 but 
extending to other less commonly addressed conditions 
such as pneumothorax and pulmonary fibrosis. This 
holistic approach is crucial for deploying deep learning 
effectively in real-world clinical settings, where 
diversity in pathology presentation demands robust and 
flexible diagnostic systems. 
3 Methodology 
In our methodology, we begin with a comprehensive 
dataset acquisition process that includes a variety of 
medical imaging data pertinent to pulmonary conditions 
such as COVID-19, pneumonia, lung opacities, and 
effusions. The first step in our methodology consists in 
extensive preprocessing and exploratory data analysis 
(EDA) to guarantee the data’s quality and 
appropriateness for machine learning exploitation. 
Especially, this involves the resizing of images, 
normalization of pixel values, and handling of any 
missing or null data. Thereafter, we train several deep 
learning models, such as DenseNet and two variants of 
Inception. These models were chosen based on their 
successfulness in feature detection in complex image 
data, as well as setting the network parameters 
appropriately to extract from differences in specific 
characteristics of the various primary pulmonary 
conditions. 
Once the training phase has been conducted, we 
assess each model’s performance by employing 
significant metrics, including accuracy, sensitivity, 
specificity, and F1-score and without which to assess the 
ability of a model to classify the medical images 
accurately on its own. 
 Based on this assessment, we then conduct feature 
extraction such that we extract the noteworthy features in 
the outputs of the base learners, i.e., the features that 
contain the salient characteristics necessary to classify 
accurately. Afterwards, we input these features into an 
ensemble model. 
We propose a meta-model, an artificial neural 
network which acts as the decision layer, that harmonizes 
the insights from the DenseNet and the two Inception 
models via a stacking algorithm. This model combines 
the strengths of each base learner to enhance 
classification accuracy and robustness.  
Finally, we evaluate the ensemble meta-model using 
the four metrics of accuracy, sensitivity, specificity, and 
F1-score. This step measures the improvement of the 
performance and the reliability of the new multi-tumor 
classification system. The resulting system is fit for 
clinicians and will thus help manage and treat pulmonary 
ailments better. Our methodology is explained in figure 
1. 
48 Informatica 48 (2024) 43–64 R. N. Sadoon et al.  
 
Figure 1: Methodology
3.1. Datset overview 
In this work we provide diverse set of well-curated 
publicly available medical imaging datasets, which are 
crucial for training state-of-the-art machine learning 
models to detect chest related diseases including 
COVID-19, tuberculosis and pneumothorax. We have 
selected these datasets for their comprehensive 
representation of various clinical scenarios and 
radiographic findings to make our study more exhaustive 
and clinically relevant. The COVIDx CXR-4 Dataset to 
be released, part of the larger COVID-Net initiative, is a 
composite collection of over 30,000 chest radiographs 
with coding collected from several healthcare institutions 
around the world, to pretrain deep learning models, 
specifically for discovering COVID-19 and pneumonia. 
The data set is also systematically split into validation 
and test sets in order to allow for legitimate cross 
evaluation and meaningful testing.  
Consisting of thousands of chest X-rays from normal 
findings to TB-positive cases, this database represents a 
gold standard for TB diagnostics developed jointly by an 
international consortium. At the same time, the ChestX-
Ray14 dataset has over 100,000 radiographs with 
annotations associated to text-mined radiological reports, 
enabling weakly-supervised learning techniques (one of 
the largest datasets for radiology and NIH is already 
planning an expansion to 200k records). It contains 
thousands of images for multi-label classification where 
an image may contain [0, 1 or more] of the 14 different 
kinds of pathologies that the model is to diagnose at the 
same time. In addition, the COVID-19 Radiography 
Database is constantly updated with images of different 
phases of COVID-19 infection and maintains its 
relevance as the disease progresses. Furthermore, to 
mitigate potential biases introduced by this broad set of 
sources, we conduct diversified-sourcing, balanced-
sampling, and stratified cross-validation to help ensure  
 
 
our models are generalized and robust across various 
demographic variation and clinical contexts. We 
safeguard the privacy of patient data by always de-
identifying all datasets and then vetting the public 
availability/scanning source/patient-consent of the data to 
adhere to ethics protocols with patients in mind and 
comply with the HIPAA and GDPR requirments. This 
ethical rigor is a testament to our dedication to keeping 
the patient information private and the correctness of 
data.  
The appropriate use of these datasets strategically 
builds a concrete stepping- stone to reliable and accurate 
diagnostic aids that are capable of addressing the 
complexities involved in different pulmonary pathologies 
Our systematic effort in dataset diversity, potential 
biases, and ethical standards vindicates the scientific 
soundness and ethical quality of the developed 
methodology, thereby laying the foundation for an 
important benchmark amongst diagnostic AI research 
agendas. 
3.2. Data preprocessing methodology 
Step-by-Step data preprocessing: The first step in 
our preprocessing is a detailed exploratory analysis using 
a visualization grid to display 20 random images along 
with their labels. This step ensures not only the accuracy 
of labels but also highlights the types and difficulty level 
of data such as image clarity, orientation, or anomaly 
visibility. Normalization involves reducing pixel values 
within each image to a range of 0 to 1 through 
standardization. This alleviates model training dynamics 
problems, including the speed of convergence and 
sensitivity to the scale of input data. 
Advanced augmentation methods: An important 
component of training powerful deep neural networks, 
especially in medical imaging, is data augmentation. This 
is crucial as the range of variability of data is often quite 
Classification of Pulmonary Diseases Using a Deep Learning… Informatica 48 (2024) 43–64 49 
wide and may be relatively scarce. Our augmentation 
strategy includes: 
-Geometric transformations: Rotations (up to 20 degrees) 
and translations (shifts of up to 10% in both x and y 
axes) represent different patient positions and imaging 
angles. 
-Zoom and shear perturbations: Zoom perturbations (up 
to 20% increase/decrease) and shear transformations (up 
to 10%) simulate differences in patient size relative to the 
imaging machine and subtle movements. 
-Color space augmentations: Brightness and contrast 
transforms help the model learn feature mapping under 
different imaging conditions and equipment settings. 
-Elastic deformations: Stretching or squeezing images in 
a non-linear manner to account for realistic variations 
between physiological examples and changes in imaging 
perspectives. 
Each augmentation technique is chosen deliberately 
to reflect realistic variations and challenges faced by our 
diagnostic model in a clinical context, enhancing its 
ability to generalize from training data to real clinical 
applications. 
Training parameters and hyperparameter 
tuning: The initial learning rate for our model was set to 
0.001, adaptively adjusted during training using the 
Adam optimizer. We choose the Adam optimizer for its 
suitability to sparse gradients and adaptability to different 
scenarios, essential for medical imaging tasks. 
-Training (A) Choosing the batch size: A batch size of 32 
balances efficient learning dynamics and computational 
resources. This size ensures good diversity within 
gradient estimates while avoiding memory exhaustion. 
-Training epochs and early stopping: We allow training 
to run for up to 100 epochs, with early stopping based on 
validation loss to prevent overfitting once model 
performance stops improving. 
Model tuning: We use a grid search with cross-
validation to explore different combinations of learning 
rates, batch sizes, dropout rates, and augmentation 
parameters. This systematic process ensures thorough 
testing of each combination to identify the optimal set of 
parameters that yield the highest performance on the 
validation set. Grid search with cross-validation assesses 
the model's generalizability across data splits, ensuring 
robustness in various clinical scenarios. 
Validation and testing: Validation during training 
uses a separate subset (20% of the training data) to 
unbiasedly tune and evaluate the model's generalization 
capabilities. The final model's performance is evaluated 
on a separate testing set, mimicking real-world 
application scenarios to ensure robustness and 
generalization for clinical deployment.  
3.3. Convolutional neural network 
architecture 
Through the full-depth exploration from our study, 
we explored four CNNs including VGG16, 
DenseNet201, InceptionV3, and Xception. The structural 
design of these models is capable of capturing and 
analyzing complex image data efficiently, The Inference 
models VGG16, Densenet201, Inceptionv3, Xception 
CNNs either employ various layers of convolution filter 
or pooling operation, or use successive operations to 
extract the higher-level image feature compounds 
gradually. For the task of fine-grained recognition where 
detailed image classification is involved, this process is 
essential. 
Leveraging the power of transfer learning, we 
introduced pre-trained weights from the extensive 
ImageNet database into our models. This approach 
harnesses the diverse and rich feature sets learned by 
these networks on a broad array of image types, thus 
furnishing our models with a robust foundation of visual 
knowledge. By employing these pre-trained networks, 
we effectively accelerate the training phase and enhance 
the model’s ability to generalize better when exposed to 
new, unseen datasets. 
In adapting these pre-trained models to our specific 
task, we customized their architectures by removing the 
original top layers, which are typically fully connected 
layers designed for specific classification tasks on 
ImageNet. Instead, we focused on maintaining the 
convolutional base for its potent feature extraction 
capabilities. This adjustment ensures that the models 
remain versatile and more focused on extracting 
universally applicable features from the images. 
To ensure uniformity and compatibility across all 
models, all of the input images were preprocessed into 
the same shape of 224×224 pixels in RGB format, which 
is the input requirement for all of the aforementioned 
pre-trained network models. Furthermore, each of the 
pre-trained models has its own specific preprocessing 
subroutine such as normalization and pixel value scaling 
designed to prepare the images before processing them 
through the neural network. The rationale behind this 
step is that the pixel values required to follow the 
distribution mean and standard deviation of the images in 
ImageNet. 
Our methodology extended the pre-trained models’ 
convolutional base with new layers designed to aid 
learning for our specific classification tasks. This 
included global average pooling for spatial dimension 
reduction, batch normalization to normalize the input 
layers and stabilize learning, dropouts to prevent 
overfitting, and dense layers for the final classification. 
Each of these layers was essential in developing the 
model by increasing its sensitivity to meaningful features 
and simultaneously minimizing the potential of 
memorizing irrelevant data patterns. 
Finally, we assessed the performance of each of the 
four models using a comprehensive set of performance 
metrics comprising accuracy, precision, recall, and the 
F1-score computed on them. These metrics offered an 
all-round perspective of the performance of each of the 
four models while identifying the strengths of each of the 
four models in the accurate clustering of images into 
existing classes. This set of evaluations, therefore, had a 
dual goal of determining which model was most suited 
for the accuracy of our categorization tasks considering 
our specific data, as well as contributing to the entire 
body of knowledge on the subject with the addition of 
50 Informatica 48 (2024) 43–64 R. N. Sadoon et al.  
empirical evidence and better testing techniques. Thus, 
our work was a combination of a modeling and 
evaluation effort that pursued further the practical and 
theoretical aspects of the application of CNN in image 
clustering. 
3.4. MobileNet 
MobileNet is a lightweight convolutional neural 
network architecture designed with mobile and 
embedded devices in mind. The model factorizes the 
traditional large-scale CNNs into lighter models, thus 
enabling us to deploy them in situations that are 
computational expensive or model size limited. 
 MobileNet is developed to find the right solution to 
the computational and performance efficiency curve, 
making it ideal for applications that require lightweight 
models, but still maintain a certain level of accuracy.  
Introducing MobileNet to the research provides 
additional modeling benefits for scenarios that 
computation resources and the model file is a limitation. 
The integration of MobileNet in the proposed 
classification architecture not only increased the 
inference deployment initiatives and to various 
applications like mobile app and edge devices but also to 
resource-constraint systems. 
Considering the use of MobileNet in our 
classification pipeline, it was necessary to carefully 
prepare the data for the subsequent training and testing 
processes. First, we divided our dataset into two mutually 
exclusive subsamples – the training subsample and the 
testing subsample using the train_test_split function. As 
a result of the stratified subsample division, we obtained 
a sufficiently diverse sample to train the model and an 
independent empirical set to evaluate the ability of the 
model to classify samples it had never seen. We used this 
experimental methodology in order to obtain reliable 
estimates of generalization capabilities and several 
classification performances measures.  
Next, we created data generators for our training and 
testing subsamples using TensorFlow’s Image Data 
Generator. By resorting to data generators, loading and 
preprocessing these images became a more 
straightforward and computationally efficient process. 
For MobileNet, we used the preprocessing function 
provided by tf. keras applications mobilenet_v3 
preprocess input. It was crucial to pre-process our input 
images this way to ensure the correct normalization and 
consistency with the preprocessing requirements of the 
MobileNet architecture to maximize its classification 
efficiency. 
The incorporation of MobileNet within our 
classification framework has been a tactical venture 
designed to promote our methodology’s scalability, 
efficacy, and flexibility. Using the more lightweight 
structure and architecture of MobileNet and the approach 
to inference, we initially intended to expand our 
classification model’s relevance to different deployment 
settings, such as mobile apps, edge devices, and Internet 
of Things platforms. By conducting extensive data 
preprocessing procedures and integrating the model 
accordingly, the goal was to “unlock” MobileNet’s full 
potential in resolving actual image classification issues 
within multiple domains and applications. 
4    Evaluation metrices 
To provide a complete evaluation of the model’s 
performance, we employ several evaluation metrics, 
which provide a full understanding of the model’s 
performance in image classification. These metrics are 
essential in determining the model’s performance, 
including applications in medical diagnostic fields. The 
following are brief abstractions of the use of each. 
 
4.1. Accuracy 
Accuracy is the most essential evaluation metric that 
measures the quality of the model’s predictions in 
general. It is calculated as the number of correctly 
classified samples, which include true positives and true 
negatives, divided by the total number of samples. 
 
𝐴𝐶𝐶 =
𝑇𝑁 + 𝑇𝑃
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
. 
 
4.2. Precision 
This measurement metric is fundamental for 
measuring the correctness in the positive predictions 
made by the model. It is computed as the quotient of the 
sum of true positive predictions and false ones and the 
true positives only. 
𝑃𝑅𝐸 =
𝑇𝑃
𝐹𝑃 + 𝑇𝑃
. 
 
4.3. Recall 
one of the primary metrics evaluated, is the ability of 
the model to identify all instances of a class that truly 
belongs to it. This is computed by True Positives divided 
by True Positives plus False negatives. 
 
𝑅𝐸𝐶 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
. 
 
4.4. F1-Score 
The F1-score is a balance of precision and recall. It 
is computed as Harmonic mean of Precision and Recall, 
which is a single measure taking both metrics into 
consideration. 
 
𝐹 1
− 𝑆 = 2 ×
𝑃𝑅𝐸 × 𝑅𝐸𝐶 𝑃𝑅𝐸 + 𝑅𝐸𝐶 . 
 
4.5 Roc curve  
A ROC (Receiver Operating Characteristic) curve is 
a graphical tool that allows performance evaluation of a 
multiclass classification model to show itself at many 
threshold levels. ROC curve), which is formed by 
plotting True Positive Rate (TPR a.k.a. 
Sensitivity/Recall) on y-axis and False Positive Rate 
(FPR) on x-axis. TPR, The proportion of actual positives 
Classification of Pulmonary Diseases Using a Deep Learning… Informatica 48 (2024) 43–64 51 
which are correctly identified by the model FPR, The 
proportion of negatives which are incorrectly labeled as 
positives. 
The TPR-FPR tradeoff can then be changed by 
varying the model classification threshold. For a good 
model, ROC curve will be tending to the upper left 
corner of the plot, higher the True Positive Rate and 
lower the False Positive Rate across various thresholds. 
Because ROC Chart evaluates a model which is 
independent of classification threshold, it is ideal in cases 
when one balance between TPR and FPR is more critical 
than the other. In the case of fraud detection, it may be 
worth to load even a huge false positives list for the sake 
of making the phishing True positive rate as high as 
possible. On the other hand, when it comes to medical 
diagnostics the reduction of FPR might be more 
important, even if it comes at the cost of a slightly lower 
TPR. 
In summary, ROC curves give a graphical 
representation which helps us to decide which model 
should be chosen for a binary classification problem 
based on the threshold of sensitivity and specificity 
needed. 
 
4.6 Confusion matrix  
A confusion matrix is a statistical device used in 
machine learning to determine how well classification 
models are in capable of predicting and classifying data 
in a dataset for which real values are known. This matrix 
represents the counts for true positives (TP), true 
negatives (TN), false positives (FP) and false negatives 
(FN) True Positives: Observations correctly predicted as 
positive True Negatives: Observations correctly 
predicted as negative on the other hand, false positives 
(Type I error) and false negatives (Type II error) indicate 
when the model labeled negatives as true positives and 
vice-versa. A confusion matrix is critical to quantify 
performance metrics avenue like accuracy, precision, 
recall, F1 score showing how well the model is able to 
distinguish between classes accurately. 
 
4.7 AUC 
The Area Under the Curve (AUC) is a performance 
measurement for classification problem at various 
threshold settings. Which is essentially ( AUC - ROC ) 
Curve where True Positive Rate [ Sensitivity ] is plotted 
over False Positive Rate [ 1- specificity] considering 
different points (Thresholds) It can be a value between 0 
and 1 and the closer to 1, the better the model. 
 The AUC represents a summary of the model 
performance across all possible classification thresholds. 
An AUC value of 0.5 suggests that the model has no 
ability to discriminate (i.e., a model with no 
discriminative power being equivalent to random 
guessing) and of 1 indicate perfect discrimination 
between positive class and negative class. 
 In a more general derivation, the AUC is calculated 
by drawing a B-Spline curve through the points on the 
ROC curve and using a closed form equation related to 
the trapezoidal rule which approximates the area under 
the curve. The formula is: 
 
AUC = ∫ TPR(𝑡 )
1
0
 𝑑 FPR(𝑡 ). 
 
Where: 
• (TPR(𝑡 )) is the true positive rate at threshold 
( 𝑡 ) 
• (FPR(𝑡 )) is the false positive rate at threshold 
( 𝑡 ) 
 
In practice, the AUC is often computed using 
numerical integration techniques on the points that make 
up the ROC curve. 
5 Experimental results 
5.1 DenseNet  
The DenseNet201 architecture from TensorFlow’s 
Keras applications was used in our study. However, it 
was edited to suit a specific classification task better. The 
model was initialized with pre-trained ImageNet weights 
without the top layer included for customization for 
custom output layers. To ensure that the pre-trained 
features are not tampered with, the base model’s layers 
were set to be non-trainable. Subsequently, the network 
was expanded with specific layers developed to fine-
tune and optimize previously existing features to 
accommodate our classification needs. These include 
Global Average Pooling, Batch Normalization, several 
Dense layers employing ReLU activations, and Dropout 
to avoid over fitting. Furthermore, the model was 
composed with an SGD optimizer. Throughout the 
training process, early stopping was used to monitor 
and stop struggling when the validation score failed to 
improve any further ensuring that the model was 
maximally generalized. Following training, the model 
achieved high classification accuracy as well as other 
performance metrics on the multi-class image dataset as 
depicted in Table 2, enabling the distinction of several 
types of medical images, including tuberculosis, 
pneumonia, and pulmonary fibrosis, among others. The 
results were further validated through the classification 
report documenting the precision, recall, as well as the 
F1-scores among the various classes. This process 
affirms the robustness and accuracy of our method in 
medical image analysis.
 
 
 
 
 
52 Informatica 48 (2024) 43–64 R. N. Sadoon et al.  
Table 2: Classification report of DenseNET 
Classes Precision Recall F1-score 
Control 10 0.96 1.00 0.98 
Covid 09 0.99 0.99 0.99 
Effusion 08 0.96 0.96 0.98 
Lung Opacity 07 0.98 0.96 0.97 
Mass 06 0.95 0.96 0.96 
Nodule 05 0.94 0.87 0.90 
Pneumonia 04 0.92 0.97 0.94 
Pneumothorax 03 0.91 0.95 0.93 
Pulmonary fibrosis 02 0.93 0.98 0.96 
Tuberculosis 01 1.00 0.94 0.97 
Accuracy   0.95 
Macro avg. 0.95 0.96 0.96 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 2: Accuracy of DenseNet model 
 
Figure 2 illustrates the accuracy results of a DenseNet 
model over several training epochs. The x-axis 
enumerates the epochs, and the y-axis represents the 
accuracy metric, scaled between 0 and 1. The blue line 
traces the training accuracy across epochs, initiating at 
about 50% and steeply ascending to near-perfection, 
suggesting rapid learning in the initial phases. It plateaus 
close to 100%, which implies a strong fit to the training 
data. Conversely, the orange line, denoting validation 
accuracy, begins slightly lower than the training 
accuracy, suggesting that the model doesn’t generalize 
quite as well initially. It increases at a steady rate, albeit 
with a less steep slope compared to the training accuracy, 
before plateauing at a value slightly under 100%. This 
indicates a good but not perfect generalization to new 
data. 
 DenseNet in the training loss and validation loss is 
decreasing per training epoch, showing the improvement 
of the model in terms of prediction of the target classes. 
On the first training batch, both losses are high, which 
tells us the model have little to no understanding of the 
data yet. Later into training, the training loss drops faster 
than the validation loss, which levels off, this is the point 
where both losses have converged. This plateau informs 
to stop training further because the model will not benefit 
significantly more from it emphasizing on why early 
stopping is in play to avoid overfitting and make sure the 
model generalizes even on new data. The overall 
decreasing loss plot confirms a successful learning phase, 
a necessity for the high accuracy. 
 The Receiver Operating Characteristic (ROC) curve,  
provided here shows that the DenseNet model performs 
well on the class 9 which can be evident from the high 
region under the curve under the curve which is area 
under the curve, if you like to know more about ROC and 
AUC watch this (ROC-AUC) video. An Area Under the 
Curve (AUC) of 0.960 is considered to be in the range 
with an extraordinary ability to discriminate, where the 
model was capable of accurately identifying 96% of the 
times the class of the 9 (positive instances) cases and just 
3 % of the times mistaking it to be negative. This is 
evident by the ROC curve which shows that the 
performance of the model remains high with nearly 1.0 
true positive rates even at relatively low false positive 
rates and holds good as they increase. This high AUC 
number suggests that DenseNet results in the best 
performance among the class 9, making it the best 
Classification of Pulmonary Diseases Using a Deep Learning… Informatica 48 (2024) 43–64 53 
contestant of a medical image classification with the 
lowest misclassifications. 
 The confusion matrix illustrates how the model does 
in several classes for the DenseNet model. From the 
matrix we see that there are many correct classifications 
likely with 94 for Class 0 and 121 for Class 6, but there 
are perhaps notable misclassifications as well. Class 4, 
for example, has instances wrongfully classified to 
Classes 0 and 7, and Class 7 has a lot of instances 
misclassified to Class 8, showing a difficulty in 
separating these classes specifically. Also, there is a 
significant number of misclassifications for Class 8 and 
Class 1 (which are misclassified as Class 3 and Class 7) 
and to a lesser extent, some instances for Class 9 (total no 
96) which are wrongly classified as Class 8. The training 
labels of ”7” were misclassified into ”9”, which shows 
the predictive abilities of the DenseNet model, but since 
the samples were still misclassified, it suggests that there 
is still remaining room for the model to improve, and a 
way to better reduce errors and improve the overall 
classification performance. 
5.2 Inceptionv3 
 We addressed a difficult image classification problem 
using the InceptionV3 architecture, known for its 
complexity and depth. This model utilized the ImageNet 
weights and had its highest layer excluded; hence, the 
network did not make specific classifications but rather 
extracted features. As opposed to the other layers, the 
InceptionV3 layer’s trainable model parameters were 
frozen. This was needed to preserve features obtained 
from the first training instances and prevent instability 
during the initial learning process. Our custom model 
architecture was based on this robust foundation and 
included additional layers for maximizing classification 
accuracy. Global Average Pooling was employed to 
consolidate all feature maps into a single vector per 
feature map, followed by Batch Normalization for faster 
convergence and several dense layers to increase the 
learning potential of the model. A significant dropout rate 
of 0.5 was included to prevent overfitting. The model was 
compiled using the Adam optimizer, balancing the  
benefits of both RMSprop and SGD, and aimed to 
optimize for precision, recall, and overall accuracy. 
 The classification report for the InceptionV3 model 
presents a comprehensive overview of its performance 
across various classes. As depicted in Table 3, the 
’Control 10’ class achieved the highest F1-score at 0.95, 
indicating exceptional precision and recall. In contrast, 
the ’Nodule 05’ class had the lowest F1-score of 0.77, 
which suggests room for improvement in either precision 
or recall or both for this category. The model showed 
strong precision in the ’Covid 09’ and ’Tuberculosis 01’ 
categories, scoring 0.93, but the recall was notably lower 
in ’Tuberculosis 01’, reflecting that some cases may have 
been missed. ’Mass 06’ exhibited high recall at 0.94, 
implying that the model is reliably identifying most of 
the positive cases for that condition, although precision is 
slightly lower at 0.81. Across all classes, the model 
achieved an accuracy of 0.85 and both macro average 
precision and recall are balanced at 0.86, indicating 
consistent performance across different conditions. The 
F1 scores, which balance precision and recall, are 
relatively high for most conditions, demonstrating the 
effectiveness of the InceptionV3 model in varying 
scenarios. However, there are differences among the 
conditions that could be addressed to improve the 
model’s diagnostic capabilities further. 
 
Table 3. Classification report of inceptionv3 
 Precision Recall F1-score 
Control 10 0.91 1.00 0.95 
Covid 09 0.93 0.87 0.90 
Effusion 08 0.82 0.78 0.80 
Lung Opacity 07 0.92 0.81 0.86 
Mass 06 0.81 0.94 0.87 
Nodule 05 0.84 0.71 0.77 
Pneumonia 04 0.85 0.96 0.90 
Pneumothorax 03 0.83 0.88 0.85 
Pulmonary fibrosis 02 0.77 0.93 0.84 
Tuberculosis 01 0.93 0.72 0.81 
accuracy   0.85 
Macro avg. 0.86 0.86 0.86 
 
Figure 3: Accuracy of inceptionv3 
54 Informatica 48 (2024) 43–64 R. N. Sadoon et al.  
Figure 3 depicts the accuracy trends for an 
InceptionV3 model during its training and validation 
phases. On the x-axis, we observe the number of epochs, 
and on the y-axis, the accuracy metric is presented, 
ranging from around 0.35 to just above 0.9. The training 
accuracy, marked in blue, starts just above 0.4 and shows 
a steady increase as training progresses, suggesting that 
the model is learning from the training data. It continues 
to improve, albeit with a gradual slope, before plateauing 
near 0.9, which indicates a high level of accuracy on the 
training dataset. 
The validation accuracy, coloured in orange, also 
starts at a similar point but increases at a quicker rate 
initially. It reaches its peak at around the third epoch, 
which is over 0.8, indicating that the model was quite 
effective on the validation data at this point. However, 
post this peak, it begins to decrease and then levels off, 
ending with a slight downward trend. This could suggest 
that the model began to overfit to the training data after 
the third epoch, as it performed better on training data 
than on unseen validation data. It is also noteworthy that 
the validation accuracy ends up lower than the training 
accuracy, which further supports the possibility of 
overfitting. 
In the first confusion matrix,  for the InceptionV3 
model, we see that it has considerable difficulty in 
predicting many of the classes. While some predictions 
are correct, such as predicting 103 out of 111 Class 0 
(Control) and 107 out of 144 Class 8 (Pneumothorax), 
the model is making many misclassifications. Fig 1 — 
Class 5 (Mass) is often mistaken as Class 4 (Effusion) 
and Class 6 (Nodule) for example. Also, Class 9 
(Tuberculosis) which confounds with many classes. The 
above misclassifications suggest the confusion of the 
model in distinguishing close classes that need to use 
further refinement and improvements for overall 
performance. 
For the Inception model, the ROC curves of class 9 
show the ROC curve performance higher than other 
classes, AUC was equal to 0.906. This means that the 
model is 90.6% certain that it will give a higher ranking 
to a random positive instance (i.e., class 9) than to a 
random negative instance that could belong to any class. 
The curve indicates that the model retains an almost 0.9 
true positive rate, even where the false positive rate is at 
its lowest. The widening of this range is good for the 
model, indicating it will continue to perform well over a 
larger range of false positive rates. Second, a specificity 
of 0.82 indicates that the Inception model detects class 9 
very well without false positives, which is particularly 
desired in medical image classification. 
5.3  Xception 
The Xception model employed in our study leverages the 
Xception architecture pre-trained on the ImageNet 
dataset to extract features from medical images. We 
initialized the base Xception model with frozen layers to 
preserve the learned features during training. In the 
subsequent step, we developed a sequential model, with 
batch normalization, global average pooling, and dense 
layers for feature extraction and aggregation, data 
normalization, and classification, respectively. Softmax 
activation function with 10 neurons was responsible for 
the probability distribution across the classes of the 
output layer. The model was trained using the Adam 
optimizer to compute the gradient of the categorical 
cross-entropy loss. Early stopping was also applied to 
prevent the model from overfitting during training and 
ensure optimal convergence. The model was then tested 
on the test dataset as demonstrated in Table 4, achieving 
81% overall accuracy. The model exhibited decent 
performance in terms of precision, recall, and F1-score, 
indicating its classification power for different classes. 
Hence, the Xception model demonstrated its strength in 
classifying medical images accurately, especially for 
diagnosing pulmonary pathologies. 
 
Table 4: Classification report of Xception 
 Precision Recall F1-score 
Control 10 0.99 0.96 0.98 
Covid 09 0.87 0.93 0.90 
Effusion 08 0.84 0.75 0.79 
Lung Opacity 07 0.85 0.54 0.66 
Mass 06 0.78 0.91 0.84 
Nodule 05 0.77 0.69 0.73 
Pneumonia 04 0.84 0.84 0.84 
Pneumothorax 03 0.75 0.85 0.80 
Pulmonary fibrosis 02 0.73 0.89 0.80 
Tuberculosis 01 0.68 0.72 0.70 
accuracy   0.81 
Macro avg. 0.81 0.81 0.80 
 
Classification of Pulmonary Diseases Using a Deep Learning… Informatica 48 (2024) 43–64 55 
 
 
Figure 4: Accuracy of Xception 
Figure 4 showcases the accuracy of an Xception 
model during its training and validation processes over a 
series of epochs, with the x-axis tracking the epoch count 
and the y-axis representing the accuracy metric between 
approximately 0.2 and 0.8. The blue line represents the 
model’s training accuracy begins just above 0.2 and rises 
steadily, reflecting consistent learning as the epochs 
progress. This steady ascent suggests that the model is 
effectively learning patterns from the training dataset. 
Around the seventh epoch, it slightly plateaus, indicating 
that the model may be approaching its learning capacity 
based on the current data and configuration. 
In contrast, the validation accuracy, indicated by the 
orange line, follows a similar upward trajectory, yet 
surpasses the training accuracy after the initial epochs. 
This is an unusual pattern as typically, models tend to 
perform better on the training data due to familiarity. The 
validation accuracy’s higher values could suggest that the 
validation set might not be as challenging, or it might 
indicate good generalization depending on the diversity 
and representativeness of the validation set compared to 
real-world data. There’s also a possibility of data leakage 
or an issue with the training/validation split that is 
causing the validation accuracy to be inflated. 
ROC curve, for class 9 of the Xception model. This 
is the ability of the model to distinguish this class from 
others. The curve plot the true positive rate (TPR) vs.  
false positive rate (FPR) at 30 different thresholds. The 
discriminative power of the model is high, with an AUC 
of 0.859, which indicates that an instance of class 9 will 
be correctly classified as such with probability of 85.9%, 
and a negative instance (any other class) will also be 
classified as negative with the same probability, on 
average. The curve indicates that at low false positive 
rates, the true positive rate approaches 0.80 and advances 
as the false positive rate becomes larger. The model 
remains sensitive to detecting true positives but more 
false positives start to occur AUC of this value indicates  
 
 
 
the Xception model has a high accuracy of classifying 
class 9, and it is suitable for medical image classification.  
This is a small value while others has the large value but 
compared with larger AUC values,this is low and can be 
improved in another model. 
Confusion Matrix for Xception model with multi-
class Classification Task The diagonal values in the 
matrix show how many instances of each class were 
correctly predicted (high accuracy in classes, such as 0, 1 
and 5 where 89, 96 and 85 instances of each class are 
successfully predicted). Off-diagonal values: These 
provide an idea of where the model fails ( along the 
vertical rings and horizontal predictors) The most visible 
is — the class 2 instances are always mistaken for most 
of the other classes (with a majority of the confusion 
between class 2, 3, 4, 5, 7, and 8). For instance, class 8 
suffers from a misclassification distribution across a 
variety of classes indicating potential areas the model can 
be enhanced in clearly defining these classes. Detailed 
visualization results in a holistic performance evaluation 
of the model, highlighting high, low classification 
accuracy of certain classes along with specifics of 
misclassification regions, leading to better targeting and 
determining of where optimizations and improvements in 
training and evaluation are required. 
 
5.4  MobileNet  
This description defines the procedure of deploying a 
TensorFlow MobileNet V3 Image classification Model. 
We divided the dataset into 80% for training and 20% for 
testing. The training data is preprocessed and augmented, 
using the ImageDataGenerator, during preparation while 
the testing data only gets preprocessed. Then we split the 
training set not only into train and valid sets. The model 
architecture consists of a pretrained MobileNet V3 
model, global average pooling, batch normalization, 
flatten, dense layers, dropout layer for regularisation and 
softmax activation function on the output layer to 
categorize images in the 10 classes.  
56 Informatica 48 (2024) 43–64 R. N. Sadoon et al.  
Here, during training, the accuracy and loss metrics are 
plotted for both the training and validation data. When 
evaluated, the model yields an approximate accuracy of 
94.89 with a test loss of 0.25495. This is one of many 
such reports and others, played strong precision, recall, 
and F1 in each category with an overall accuracy of 95%. 
The performance of the model for the detection of 
different health conditions, including pneumonia, 
pulmonary fibrosis, and COVID-19, as characterized by 
classification metrics and confusion matrix processed. 
These results show the remarkable capabilities of the 
MobilNet V3 in image classification by achieving a 
trade-off between depth and complexity of the network 
and efficiency in feature extraction. 
Table 5: Classification report of MobileNet 
Classes Precision Recall F1-score 
Control 10 1.00 1.00 1.00 
Covid 09 0.99 0.97 0.98 
Effusion 08 0.90 0.94 0.92 
Lung Opacity 07 1.00 0.99 1.00 
Mass 06 0.92 0.92 0.92 
Nodule 05 0.96 0.89 0.92 
Pneumonia 04 1.00 0.97 0.98 
Pneumothorax 03 0.90 0.95 0.92 
Pulmonary fibrosis 02 0.94 0.95 0.95 
Tuberculosis 01 0.97 0.98 0.97 
accuracy   0.96 
Macro avg 0.96 0.96 0.96 
 
 The Training performance of MobileNet can shed 
light on both where it excels and where it falls short. The 
model trains fast and was already at 90% within a few 
epochs and flatting out at 94%, but if you look closely at 
the validation metrics the your forehead will fill up with 
sweat. The Validation accuracy follows a similar trend, 
but has a bit of a deviation from the training accuracy, 
especially during the beginning. Indicating that the initial 
training data is overfitting the exact model. Observations 
are also corroborated by the loss curves. Again the 
validation accuracy increases, even the training loss 
suddenly drops as it happened in the previous example. 
Nevertheless, validation loss is going deep down in the 
beginning and after sometimes it fluctuates. The 
fluctuation may suggest overfitting, where the model is 
effectively memorizing noise in the training data that 
does not generalize. From the first paper, this is in line 
with previous findings on initial learning abilities and 
generalization of MobileNet. Nevertheless this small gap 
may lead to over fitting and care should be taken during 
training to ensure the model works best on data that has 
been not been seen. One can use early stopping 
techniques to stop training that can help achieve this. 
 The classification of class 9 with ROC curve in 
MobileNet is better. This curve represents how well the 
model is able to distinguish class 9 (true positive rate) 
from other classes (false positive rate). This is further 
emphasized by a high Area Under the Curve (AUC) of 
0.964. In layman's terms, the model has 96.4 percent 
chance of classifying the above picture as class 9 The 
ROC curve additionally validates this by displaying that 
if other classes are only slightly confused (low False 
Positive Rates) then the background class is still almost 
always correct (True Positive Rate is near 1.0). It 
represents the strongness and reliability of the model in 
detecting class 9. In summary, the high AUC value 
indicates that MobileNet has the extraordinary ability to 
have few errors in the classification of class 9 (i. e. a 
small MSE) and is a very useful tool for the general task 
of medical image classification. 
 The confusion matrix of MobileNet model for multi 
class classifiction task. The accuracy of the model is 
good in classes such as 2, 5, 7, and 8, and this is shown in 
the diagonal values, where 101, 110, 109, and 109 are 
correctly predicted, respectively. The off-diagonal values 
are the instances that were misclassified in the model and 
help us understand the model's confusion points. For 
example, class 6 has some confusion with classes 3 and 9 
and class 3 has some confusion with classes 2, 4, and 5. 
This Fine-Grained overview makes It possible to obtain 
an overarching View of the Model’s Strengths in 
Predicting Almost all Classes Accurately and to 
Pinpointing Troublesome Confusion Areas. 
 
5.5  Ensemble (stacking) 
In this study, we utilized an ensemble method in making 
the use of stacking in assembling different pre-trained 
models to improve the prediction quality of a challenging 
image classification task. We incorporated three different 
CNN models including DenseNet201, InceptionV3, and 
Xception. These models had already been pre-trained and 
saved with trace and had strong predictive qualities. 
Therefore, we used them as base learners which first 
predicted and then their predictions were used as features 
for the meta-model. Our meta-model used the dense 
architecture with 25 neurons ReLU for feature integration, 
followed by a softmax layer with 10 outputs representing 
our class categories.  
This model was trained on a dataset split into training 
and validation subsets, ensuring robustness and the ability 
to generalize from the ensemble predictions. Optimized 
with the Adam optimizer and compiled with a categorical 
cross-entropy loss function, the meta-model focused on 
refining the decision boundaries formed by the base 
models. After training, the ensemble’s effectiveness was 
evaluated using precision, recall, and F1-score metrics, 
revealing outstanding classification performance across 
various categories, with nearly all classes achieving near-
perfect scores. This result underscores the power of 
combining multiple advanced neural network 
architectures to achieve superior accuracy and reliability 
in medical image classification tasks. In Table 6 the 
classification report for the ensemble model, which uses a 
meta–Artificial Neural Network (ANN) approach, shows 
outstanding performance across all classes. The model 
achieves near-perfect precision and recall in most 
categories, as reflected by the F1 scores. Remarkably, ’ 
the Covid 09’ and ’Lung Opacity 07’ classes both scored 
a perfect 1.00 across precision, recall, and F1-score, 
Classification of Pulmonary Diseases Using a Deep Learning… Informatica 48 (2024) 43–64 57 
indicating that the model has exceptional accuracy in 
identifying these conditions. Similarly, ’Control 10’ also 
demonstrates almost flawless performance with an F1-
score of 0.99. The other classes, such as ’Effusion 08’, 
’Mass 06’, ’Nodule 05’, and ’Pneumonia 04’, maintain 
high metrics, with F1-scores ranging from 0.97 to 0.99, 
suggesting the model is highly effective in distinguishing 
these conditions with minimal false positives or negatives. 
The overall accuracy of the ensemble model is extremely 
high at 0.98, and the macro averages for precision, recall, 
and the F1-score mirror this value, highlighting consistent 
and reliable performance across the board. This indicates 
that the stacking approach of the ensemble model, which 
likely integrates multiple learning algorithms, results in 
superior predictive capability. The balanced precision and 
recall suggest that the model is not only capturing the 
majority of positive cases but is also correctly identifying 
negatives, which is crucial for medical diagnostics. These 
results imply a robust model with excellent generalization 
properties for the considered conditions. 
 
Table 6: Classification report of meta model (ANN) 
Classes Precision Recall F1-score 
Control 10 0.99 1.00 0.99 
Covid 09 1.00 1.00 1.00 
Effusion 08 0.96 0.97 0.97 
Lung Opacity 07 1.00 1.00 1.00 
Mass 06 0.96 0.97 0.97 
Nodule 05 0.99 0.96 0.97 
Pneumonia 04 0.99 0.99 0.99 
Pneumothorax 03 0.97 0.98 0.97 
Pulmonary fibrosis 02 0.97 0.97 0.97 
Tuberculosis 01 0.99 0.97 0.98 
accuracy   0.98 
Macro avg. 0.98 0.98 0.98 
Figure 5 illustrates the accuracy of an ensemble 
(stacking) meta-model during its training phase, over 
several epochs, as shown on the x-axis. The y-axis 
quantifies the accuracy, which is scaled between 0.5 and 
1.0, suggesting that accuracy is represented as a 
proportion. The blue line charts the model’s training 
accuracy, which begins at around 0.5 and sharply climbs 
to just above 0.9 within the first two epochs. This rapid 
ascent indicates that the ensemble model quickly learns 
from the training data. Subsequently, the training 
accuracy exhibits a more gradual increase and seems to 
level off near the 1.0 mark, suggesting that the model fits 
the training data well. 
Conversely, the orange line indicates the validation 
accuracy. It starts at a similar level to the training 
accuracy but doesn’t ascend as steeply. After catching up 
to the training accuracy at around the second epoch, it 
diverges and starts to lag slightly, finishing just below 
the training accuracy curve. This divergence may be 
indicative of a small degree of overfitting to the training 
data, but the proximity of the two lines at the end of the 
training indicates that the ensemble model generalizes 
well to unseen data. The plateau nearing the end of the 
epochs suggests that the model is stabilizing and that 
additional training might not result in significant 
accuracy improvements. The high validation accuracy 
maintained throughout the training process reflects the 
efficacy of the ensemble approach, which often leads to 
robust generalization by leveraging the strengths of 
multiple individual models. 
 
 
 
 
Figure 5: Accuracy of meta model 
 
58 Informatica 48 (2024) 43–64 R. N. Sadoon et al.  
 
Figure 6: Loss of meta model 
 
 
 
Figure 7: ROC curve for class 9 of meta model 
 
Figure 5 illustrates the accuracy of an ensemble 
(stacking) meta-model during its training phase, over 
several epochs, as shown on the x-axis. The y-axis 
quantifies the accuracy, which is scaled between 0.5 and 
1.0, suggesting that accuracy is represented as a 
proportion. The blue line charts the model’s training 
accuracy, which begins at around 0.5 and sharply climbs 
to just above 0.9 within the first two epochs. This rapid 
ascent indicates that the ensemble model quickly learns 
from the training data. Subsequently, the training 
accuracy exhibits a more gradual increase and seems to 
level off near the 1.0 mark, suggesting that the model fits 
the training data well. While Figure 6 depicts the loss in 
the accuracy of the meta-model as it undergoes training 
over multiple epochs, as indicated on the x-axis. The y-
axis measures the accuracy, which is scaled from 0.5 to 
1.0, indicating that accuracy is expressed as a proportion. 
The blue line represents the training accuracy of the 
model, starting at approximately 0.5 and rapidly 
increasing to slightly above 0.9 within the initial two 
epochs. The significant increase in performance suggests 
that the ensemble model rapidly acquires knowledge 
from the training data. Following that, the training 
accuracy demonstrates a slower and steadier rise and 
appears to stabilise around the 1.0 threshold, indicating 
that the model effectively matches the training data. 
Conversely, the orange line indicates the validation 
accuracy. It starts at a similar level to the training 
accuracy but doesn’t ascend as steeply. After catching up 
to the training accuracy at around the second epoch, it 
diverges and starts to lag slightly, finishing just below 
the training accuracy curve. This divergence may be 
indicative of a small degree of overfitting to the training 
data, but the proximity of the two lines at the end of the 
training indicates that the ensemble model generalizes 
well to unseen data. The plateau nearing the end of the 
epochs suggests that the model is stabilizing and that 
additional training might not result in significant 
accuracy improvements. The high validation accuracy 
maintained throughout the training process reflects the 
efficacy of the ensemble approach, which often leads to 
robust generalization by leveraging the strengths of 
multiple individual models. 
Classification of Pulmonary Diseases Using a Deep Learning… Informatica 48 (2024) 43–64 59 
The ROC curve illustrates how the stacking model 
performs for class 9 visually. It illustrates in Figure 7, the 
compromise between the percentage of class 9 that the 
model can identify (True Positive Rate) and the 
percentage of other classes that the model predicts as 
class 9 (False Positive Rate). A ROC curve itself is 
evaluated on (TPF, FPF) where a perfect curve would be 
a straight line rising from the bottom left corner (0,0) to 
the top left corner (0,1) and then across to the top right 
corner (1,1). This means that the model can completely 
separate class 9 from all other classes. 
A perfect AUC of 1.0 in the ROC curve of class 9 
implies the highly capable classification of class 9 by the 
model It implies that the model has perfect precision for 
class 9 only class 9 examples are getting predicted as 
class 9 and we do not see any other class getting 
predicted as class 9. 
 
 
 
Figure 8: Meta model Confusion Matrix 
 
Its multi class classification task can be seen in the 
Confusion Matrix from the Ensemble Learning Model in 
Figure 8. The diagonal values correspond to the number 
of rightly classified instances of every class (4 in total) 
with high accuracy for the classes — control_10, 
effusion_08, pneumothorax_03, and tuberculosis_01 with 
108, 106, 104, 111 appropriate predicted instances, 
respectively. Meanwhile, off-diagonal values give the 
number of points that got classified wrongly and indicate 
where the model is confusing into. For instance, a class 
lung Opacity_07 has a number of misclassifications into 
classes which effusion_08 and mass_06, and a class 
pulmonary fibrosis_02 demonstrates some 
misclassifications into the class mass_06. 
6 Comparative analysis 
When comparing the performance of individual CNN 
models—DenseNet201, InceptionV3, and Xception—
with that of the ensemble stacking model, we observe 
significant distinctions and improvements in the ensemble 
approach. DenseNet201 exhibited robust performance, 
achieving a high precision and recall with a weighted 
average precision of 0.96 and accuracy of 0.95. 
InceptionV3, while still performing commendably, 
displayed slightly lower efficacy, especially in precision 
and recall for certain classes such as effusion and lung 
opacity, leading to a total accuracy of 0.85. Xception’s 
results were less uniform; all but four classes had average 
precision and recall below 0.8, resulting in 0.81 diagnoses 
in total. 
 Meanwhile, the ensemble stacking, which based its 
diagnoses on the three base models’ predictions, showed 
the best results on all metrics. With almost perfect 
precision and recall in every category, totaling 0.98 
accuracy indicates that the approach has yielded 
significant levels of successful predictions. It is 
especially noteworthy that the ensemble outperformed 
non-ensemble models in categories the latter found 
difficult, underlining the efficiency of combining 
different strengths to address weaknesses. 
The above comparative analysis demonstrates the 
potential of the ensemble stacking method to improve the 
predictive efficiency and precision of the more 
complicated classification tasks by combining the unique 
strengths of several models to achieve the ultimate 
performance superiority.  
The following table summarizes the performance 
comparison between the individual models and the 
ensemble stacking model according to the precision, 
recall, f1-score, as well as the overall accuracy: 
Table 7: Comparison of model’s performance 
Model Precision Recall F1-score Accuracy 
DenseNet201 0.95 0.96 0.96 0.95 
InceptionV3 0.86 0.86 0.86 0.85 
Xception 0.81 0.81 0.80 0.81 
Stacking 0.98 0.98 0.98 0.98 
 
Based on Table 7, the ensemble stacking model and 
its ensembled learning approach to classification through 
stacking achieved optimal results compared to single-
60 Informatica 48 (2024) 43–64 R. N. Sadoon et al.  
learner models. The superior performance of the stacking 
model continues to demonstrate the gain that one can 
make from having multiple model predictions for a final 
outcome – the way the model increases accuracy and 
dependability of classification in complex scenarios. 
7 Discussion  
As shown in Table 7, our ensemble stacking model 
performs the best in terms of precision, recall, F1-score, 
and accuracy, consistently achieving at least a 98% range 
across these metrics. Compared to the outcomes listed in 
the "Related Works" section, this indicates the strengths 
and weaknesses of our approach for these problem types. 
 
State-of-the-Art (SOTA) comparison: Multi-
Channel Fusion CNN (Paper [24]): Achieved an 
accuracy of 92.52%. Our model outperforms this by 
maintaining high accuracy consistently across more 
complex multi-class settings. This improvement arises 
from our model's ability to jointly optimize features from 
different CNN architectures through a meta-learning 
process, which is not used in Paper [24]. 
BDCNet (Paper [25]): The BDCNet shows excellent 
performance in precision and accuracy, focusing on 
COVID-19, pneumonia, and lung cancer. While BDCNet 
excels in these diseases, our model generalizes this high 
performance across a broader range of pulmonary 
conditions, including pulmonary fibrosis and 
pneumothorax, highlighting its clinical relevance across 
multiple diseases. 
CDC Net (Paper [26]): This model has an extremely 
high AUC of 0.9953, showing effective performance in 
distinguishing different chest diseases. Our model has 
comparable precision and recall to CDC Net, suggesting 
that while CDC Net is more discriminative for 
diagnostics, our ensemble approach is as reliable and 
more widely applicable across multiple conditions. 
EfficientNetB0 (Paper [27]): Known for fewer 
parameters and a streamlined model, it provides lower 
accuracy in multi-class scenarios (95.00%). Our 
ensemble model, using a more complex structure, 
manages to maintain high accuracy without the trade-off 
seen in streamlined models, indicating our advanced 
integration techniques can capture nuances in medical 
images more effectively. 
 
Performance differential analysis: The superiority 
of our model can be attributed to the use of a stacking 
ensemble technique for the synergistic integration of 
multiple types of CNN architectures. Each base model 
contributes its strengths, which are synthesized by a 
meta-learner that effectively aggregates the disparate 
inputs, making the final model more accurate and robust. 
This approach enables our model to achieve excellent 
metrics across the board and generalize effectively to the 
various complexities inherent in different medical 
imaging tasks. 
 
Our Contribution and limitation: This work 
provides a new application of ensemble learning with a 
meta-learner to improve classification in medical images. 
This method represents a significant improvement over 
single-model methods that do not consider all important 
features in complex datasets. However, the performance 
of our proposed model is highly affected by its 
complexity, introducing computational and potential 
overfitting constraints that could be addressed in future 
work through more optimal model architectures or more 
effective regularization techniques. 
 
Discussion: next steps: Future endeavors can 
examine avenues to upgrade the computational efficiency 
of the ensemble, such as model compression or tweaking 
the architecture to maintain predictive performance while 
minimizing redundancy. Additionally, increasing the 
variability of the dataset, using clinical information from 
real-world care, or incorporating feedback from medical 
practice could be essential to optimize an ML model for 
real-world medical needs. 
 
Conclusions: Our model, which innovatively 
leverages ensemble learning, achieves state-of-the-art 
performance in pulmonary disease classification. 
However, the ever-changing landscape of medical 
diagnostics means that ongoing improvements and 
adaptations will be required to remain competitive. 
Table 8: Comparison of model’s performance 
 
 
 
 
 
 
 
 
 
 
Study 
Reference 
Precision 
 
Recall 
 
F1-score 
 
Accuracy 
 
Unique Aspect 
Paper [24] N/A N/A N/A 92.52 Multi-channel fusion CNN 
Paper [25] 99.9 98.31 99.09 99.10 Single model for multiple 
diseases 
Paper [26] 99.42 98.13 N/A 99.39 High AUC (0.9953) for diverse 
diseases 
Paper [27] N/A N/A N/A 95.00 EfficientNetB0 with fewer 
parameters 
Paper [28] N/A N/A N/A 99.32 High binary classification 
accuracy 
Paper [29] N/A N/A 99.3 99.3 High performance with minimal 
epochs 
Our Work 99.0 99.0 98.0 99.3 Consistently high scores across 
metrics 
Classification of Pulmonary Diseases Using a Deep Learning… Informatica 48 (2024) 43–64 61 
 
8 Conclusion  
The present study embodies a collection of rigorous 
experimentation and investigation to break through the 
existing medical image classification frontiers. As we 
meticulously explored the intricate workings of a variety 
of CNN architectures, including DenseNet201, 
InceptionV3, Xception, and MobileNet, we 
simultaneously embarked on a quest to make the most 
out of their outcomes to transform the face of medical 
diagnostics. The collection of the curated dataset was 
subjected to a relentless process of data preprocessing, 
where we endeavoured to capture every possible 
complexity and variance that could occur in real-life 
clinical settings. 
All of the steps in our methodology, including 
prepossessing the data, training the models, and testing 
them, were carried out thoughtfully and diligently. This 
involved good parameter selection and aggressive 
training and testing strategies, as well as the use of prior 
knowledge through transfer learning, for example from 
models pre-trained on ImageNet. This led to two positive 
aspects.  
First, all our models trained very quickly. Second, 
each of our trained models had a very general approach 
to the diagnostics of chest pathologies, regardless of the 
specific doctor it was looking for. 
In addition, the work of our ensemble technique 
itself represented the efforts to generate a classifier 
capable of achieving higher classification performance 
after optimization. It should be recognized that while the 
components used in this approach are independent 
models with a unique speciality or strength and hence 
very different characteristics, taken together, the total 
body of the ensemble still demonstrated performance that 
none of its constituent models could exhibit.  
Hence, stacking boosted the reliability of the 
classifier to discover causal relationships, features, and 
patterns. It also demonstrated very good performance as 
it was no worse than most of the highly specialized  
models, ranging from 80% to 99%, trained for narrow-
focused detection of different diseases.  
This makes our work both high-end and versatile. 
Such constant high performance of the AI is exhibited by 
the method’s high efficiency because, given the fairly 
stable levels of precision, recall, f1-score, and accuracy, 
the method also shows a high and good level of 
reliability. 
The future work and amplification of our ensemble 
approach, therefore, may offer substantial improvements 
for medical diagnostics. Going forward, we aim to 
improve our methodology to more effectively 
complement diverse diagnostic use cases, such as 
emergent disease detection or rare pathologies, which are 
somewhat underrepresented in our training datasets. We 
are also investigating new deep learning approaches such 
as federated learning, allowing the integration of data 
across institutions without risking patient privacy to 
include more diverse and higher quality training data,  
 
thus, improving the robustness and accuracy of the 
model. In practice, our model provides significant 
promise in being seamlessly integrated into current 
clinical workflows, greatly improving the efficacy and 
accuracy of diagnostic procedures. Automating the initial 
chest radiograph analysis can help radiologists, as they 
can prioritize cases that have the detection of suspected 
pathologies for a second look, accelerating the diagnosis 
and possibly improving the patient´s outcome with 
earlier interventions. Application of our sophisticated 
machine learning model in clinical practice could 
potentially decrease the time lag between imaging and 
diagnosis significantly resulting in timelier interventions. 
This is especially important for diseases such as COVID-
19 and tuberculosis, in which early detection can 
significantly influence treatment success. In addition to 
this, Ansermet et al. believe that a model that learns from 
a large set of pathologies may help in achieving a higher 
diagnostic rate with better precision and potentially 
reduce diagnosis error. Future work will be directed at 
developing interfaces to be smoothly integrated with 
existing hospital information systems to expedite its 
adoption in clinical settings. This will involve working 
extensively with clinical partners to incorporate their 
feedback into the model as it relates to practicality and 
optimizing it for different imaging equipment and 
scenarios. There is the possibility that our model is 
capable of greatly improving medical diagnostics, not 
only in future research but in current clinical practice, 
thus leading to more accurate diagnoses made more 
quickly, improving patient management and patient 
outcomes across multiple healthcare scenarios. 
References 
[1] Ciotti, Marco, Massimo Ciccozzi, Alessandro 
Terrinoni, Wen-Can Jiang, Cheng-Bin Wang, and 
Sergio Bernardini. The COVID-19 pandemic. 
Critical reviews in clinical laboratory sciences, 
57(6): 365-388, 2020. 
https://doi.org/10.1080/10408363.2020.1783198 
 
[2] Wilson, Nick, Stephen Corbett, and Euan Tovey. 
Airborne transmission of covid-19. bmj, 370, 2020. 
https://doi.org/10.1136/bmj.m3206 
 
[3] Kevadiya, Bhavesh D., Jatin Machhi, Jonathan 
Herskovitz, Maxim D. Oleynikov, Wilson R. 
Blomberg, Neha Bajwa, Dhruvkumar Soni et al. 
Diagnostics for SARS-CoV-2 infections. Nature 
materials, 20(5): 593-605, 2021. 
https://doi.org/10.1038/s41563-020-00906-z 
 
[4] Gnanvi, Janyce Eunice, Kolawolé Valère Salako, 
Gaëtan Brezesky Kotanmi, and Romain Glèlè 
Kakaï. On the reliability of predictions on Covid-19 
dynamics: A systematic and critical review of 
modelling techniques. Infectious Disease 
Modelling, 6: 258-272, 2021. 
62 Informatica 48 (2024) 43–64 R. N. Sadoon et al.  
https://doi.org/10.1016/j.idm.2020.12.008 
 
[5] Johnson, Kemmian D., Christen Harris, John K. 
Cain, Cicily Hummer, Hemant Goyal, and Abhilash 
Perisetti. Pulmonary and extra-pulmonary clinical 
manifestations of COVID-19."Frontiers in 
medicine, 7: 526, 2020. 
https://doi.org/10.3389/fmed.2020.00526 
 
[6] Marginean, Cristina Maria, Mihaela Popescu, 
Corina Maria Vasile, Ramona Cioboata, Paul 
Mitrut, Iulian Alin Silviu Popescu, Viorel Biciusca 
et al. Challenges in the differential diagnosis of 
COVID-19 pneumonia: a pictorial review. 
Diagnostics, 12(11): 2823, 2022. 
https://doi.org/10.3390/diagnostics12112823 
 
[7] Bhatnagar, Rahul, and Nick Maskell. The modern 
diagnosis and management of pleural effusions. 
bmj, 351, 2015. 
https://doi.org/10.1136/bmj.h4520 
 
[8] Chen, Chia-Hung, Chih-Kun Chang, Chih-Yen Tu, 
Wei-Chih Liao, Bing-Ru Wu, Kuei-Ting Chou, Yu-
Rou Chiou, Shih-Neng Yang, Geoffrey Zhang, and 
Tzung-Chi Huang. Radiomic features analysis in 
computed tomography images of lung nodule 
classification. PloS one, 13(2): e0192002, 2018. 
https://doi.org/10.1371/journal.pone.0192002 
 
[9] Hussain, Azhar, Alia Noorani, Ranjit Deshpande, 
Lindsay John, Max Baghai, Olaf Wendler, Donald 
Whitaker, and Habib Khan. Management of 
pneumothorax in mechanically ventilated COVID-
19 patients: early experience. Interactive 
CardioVascular and Thoracic Surgery, 31(4), 540-
543, 2020. 
https://doi.org/10.1093/icvts/ivaa129 
 
[10] Abdulahi, AbdulRahman Tosho, Roseline 
Oluwaseun Ogundokun, Ajiboye Raimot Adenike, 
Mohd Asif Shah, and Yusuf Kola Ahmed. 
PulmoNet: a novel deep learning based pulmonary 
diseases detection model. BMC Medical Imaging, 
24(1), 51, 2024. 
https://doi.org/10.1186/s12880-024-01227-2 
 
[11] Greenspan, Hayit, Bram Van Ginneken, and Ronald 
M. Summers. Guest editorial deep learning in 
medical imaging: Overview and future promise of 
an exciting new technique. IEEE transactions on 
medical imaging 35(5): 1153-1159, 2016. 
https://doi.org/10.1109/TMI.2016.2553401 
 
[12] Miotto, Riccardo, Fei Wang, Shuang Wang, 
Xiaoqian Jiang, and Joel T. Dudley. Deep learning 
for healthcare: review, opportunities and challenges. 
Briefings in bioinformatics, 19(6): 1236-1246, 
2018. 
https://doi.org/10.1093/bib/bbx044 
 
[13] Nguyen, Thi Mai, Nackhyoung Kim, Da Hae Kim, 
Hoang Long Le, Md Jalil Piran, Soo-Jong Um, and 
Jin Hee Kim. Deep learning for human disease 
detection, subtype classification, and treatment 
response prediction using epigenomic data. 
Biomedicines, 9(11), 1733, 2021. 
https://doi.org/10.3390/biomedicines9111733 
 
[14] Aslan, Muhammet Fatih, Kadir Sabanci, Akif 
Durdu, and Muhammed Fahri Unlersen.COVID-19 
diagnosis using state-of-the-art CNN architecture 
features and Bayesian Optimization. Computers in 
biology and medicine, 142: 105244, 2022. 
https://doi.org/10.1016/j.compbiomed.2022.105244 
 
[15] Fatani, Abdulaziz, Abdelghani Dahou, Mohammed 
AA Al-Qaness, Songfeng Lu, and Mohamed Abd 
Elaziz. Advanced feature extraction and selection 
approach using deep learning and Aquila optimizer 
for IoT intrusion detection system. Sensors, 
22(1):140, 2021. 
https://doi.org/10.3390/s22010140 
 
[16] Elharrouss, Omar, Younes Akbari, Noor Almadeed, 
and Somaya Al-Maadeed. Backbones-review: 
Feature extractor networks for deep learning and 
deep reinforcement learning approaches in 
computer vision. Computer Science Review, 53: 
100645, 2024. 
https://doi.org/10.1016/j.cosrev.2024.100645 
 
[17] Akinyelu, Andronicus A., and Pieter Blignaut. 
COVID-19 diagnosis using deep learning neural 
networks applied to CT images. Frontiers in 
Artificial Intelligence, 5: 919672, 2022. 
https://doi.org/10.3389/frai.2022.919672 
 
[18] Aljondi, Rowa, and Salem Alghamdi. Diagnostic 
value of imaging modalities for COVID-19: 
scoping review. Journal of medical Internet 
research, 22(8): e19673, 2020. 
https://doi.org/10.2196/19673 
 
[19] Benmalek, Elmehdi, Jamal Elmhamdi, and 
Abdelilah Jilbab. Comparing CT scan and chest X-
ray imaging for COVID-19 diagnosis. Biomedical 
Engineering Advances, 1, 100003: 2021. 
https://doi.org/10.1016/j.bea.2021.100003 
 
[20] Sun, Junding, Pengpeng Pi, Chaosheng Tang, Shui-
Hua Wang, and Yu-Dong Zhang. CTMLP: Can 
MLPs replace CNNs or transformers for COVID-19 
diagnosis? Computers in Biology and Medicine, 
159, 106847, 2023. 
https://doi.org/10.1016/j.compbiomed.2023.106847 
 
[21] Li, Wenjing, Randy C. Paffenroth, and David 
Berthiaume. Neural network ensembles: theory, 
training, and the importance of explicit diversity. 
arXiv preprint arXiv:2109.14117, 2021. 
https://doi.org/10.48550/arXiv.2109.14117 
Classification of Pulmonary Diseases Using a Deep Learning… Informatica 48 (2024) 43–64 63 
[22] Salehi, Ahmad Waleed, Shakir Khan, Gaurav 
Gupta, Bayan Ibrahimm Alabduallah, Abrar 
Almjally, Hadeel Alsolai, Tamanna Siddiqui, and 
Adel Mellit. A study of CNN and transfer learning 
in medical imaging: Advantages, challenges, future 
scope. Sustainability, 15(7), 5930, 2023. 
https://doi.org/10.3390/su15075930 
 
[23] Brownlee, Jason. Stacking ensemble for deep 
learning neural networks in python. 2018. 
 
[24] Nikolaou, Vasilis, Sebastiano Massaro, Masoud 
Fakhimi, Lampros Stergioulas, and Wolfgang Garn. 
COVID-19 diagnosis from chest x-rays: developing 
a simple, fast, and accurate neural network. Health 
information science and systems, 9: 1-11, 2021. 
https://doi.org/10.1007/s13755-021-00166-4 
 
[25] Hira, Swati, Anita Bai, and Sanchit Hira. An 
automatic approach based on CNN architecture to 
detect COVID-19 disease from chest X-ray images. 
Applied Intelligence, 51: 2864-2889, 2021. 
https://doi.org/10.1007/s10489-020-02010-w 
 
[26] AbdElhamid, Abeer A., Eman AbdElhalim, 
Mohamed A. Mohamed, and Fahmi Khalifa. Multi-
classification of chest X-rays for COVID-19 
diagnosis using deep learning algorithms. Applied 
Sciences, 12(4): 2080, 2022. 
https://doi.org/10.3390/app12042080 
 
[27] Qian, Xuelin, Huazhu Fu, Weiya Shi, Tao Chen, 
Yanwei Fu, Fei Shan, and Xiangyang Xue. M $^ 3$ 
Lung-Sys: A deep learning system for multi-class 
lung pneumonia screening from CT imaging. IEEE 
journal of biomedical and health informatics, 
24(12): 3539-3550, 2020. 
https://doi.org/10.1109/JBHI.2020.3030853 
 
[28] Nath, Malaya Kumar, Aniruddha Kanhe, and 
Madhusudhan Mishra. A novel deep learning 
approach for classification of COVID-19 images. In 
2020 IEEE 5th international conference on 
computing communication and automation 
(ICCCA): 752-757. IEEE, 2020. 
https://doi.org/10.1109/ICCCA49541.2020.925090
7 
 
[29] Islam, Md Nazmul, Md Golam Rabiul Alam, 
Tasnim Sakib Apon, Md Zia Uddin, Nasser 
Allheeib, Alaa Menshawi, and Mohammad Mehedi 
Hassan. "Interpretable differential diagnosis of non-
covid viral pneumonia, lung opacity and covid-19 
using tuned transfer learning and explainable ai. In 
Healthcare, 11(3): 410. MDPI, 2023. 
https://doi.org/10.3390/healthcare11030410 
 
[30] Ullah, Zaka, Ayman Odeh, Ihtisham Khattak, and 
Muath Al Hasan. Enhancement of Pre-Trained 
Deep Learning Models to Improve Brain Tumor 
Classification. Informatica, 47(6): 165–172, 2023. 
https://doi.org/10.31449/inf.v47i6.4645 
 
[31] Cherifi, Dalila, Abderraouf Djaber, Mohammed-
Elfateh Guedouar, Amine Feghoul, Zahia Zineb 
Chelbi, and Amazigh Ait Ouakli. Covid-19 
Detecting in Computed Tomography Lungs 
Images using Machine and transfer Learning. 
Informatica, 47(8): 35–44, 2023. 
https://doi.org/10.31449/inf.v47i8.4258 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64 Informatica 48 (2024) 43–64 R. N. Sadoon et al.