https://doi.org/10.31449/inf.v47i1.4519 Informatica 47 (2023) 11–20 11 
Predicting Students Performance Using Supervised Machine 
Learning Based on Imbalanced Dataset and Wrapper Feature 
Selection 
Sadri Alija 
1
, Edmond Beqiri 
2*
, Alaa Sahl Gaafar 
3
, Alaa Khalaf Hamoud 
4 
1 
Faculty of Business and Economics, South East European University, North Macedonia. 
2 
University of Peja “Haxhi Zeka” – Peja, Kosovo. 
3 
Department of Educational Planning, Directorate of Education in Basrah, Iraq. 
4 
Department of Computer Information Systems, University of Basrah, Iraq. 
Email: s.aliji@seeu.edu.mk, edmond.beqiri@unhz.eu, alaasy.2040@gmail.com, alaa.hamoud@uobasrah.edu.iq. 
Keywords: supervised machine learning, feature selection, wrapper, particle swarm optimization, info gain, SMOTE 
Received: November 14, 2022 
For learning environments like schools and colleges, predicting the performance of students is one of the 
most crucial topics since it aids in the creation of practical systems that, among other things, promote 
academic performance and prevent dropout. The decision-makers and stakeholders in educational 
institutions always seek tools that help in predicting the number of failed courses for the students. These 
tools can help in finding and investigating the factors that led to this failure. In this paper, many supervised 
machine learning algorithms will investigate finding and exploring the optimal algorithm for predicting 
the number of failed courses of students. An imbalanced dataset will be handled with Synthetic Minority 
Oversampling TEchinque (SMOTE) to get an equal representation of the final class. Two feature selection 
approaches will be implemented to find the best approach that produces a highly accurate prediction. 
Wrapper with Particle Swarm Optimization (SPO) will be applied to find the optimal subset of features, 
and Info Gain with ranker to get the most correlated individual features to the final class. Many supervised 
algorithms will be implemented such as (Naïve Bayes, Random Forest, Random Tree, C4.5, LMT, Logistic, 
and Sequential Minimal Optimization algorithm (SMO)). The findings show that the wrapper filter with 
SPO-based SMOTE outperforms the Info-Gain filter with SMOTE and improves the performance of the 
algorithms. Random Forest outperforms the other supervised machine learning algorithms with (85.6%) 
in TP average rate and Recall, and (96.7%) in ROC curve. 
Povzetek: Opisana je metoda za napovedovanje uspeha študentov s pomočjo strojnega učenja. 
 
1 Introduction
High-quality universities always require a great record of 
their students and the students are the main resource for 
them. The main concern for the universities is the 
performance of the students which is the base stone for 
building the top rate graduates and post-graduate students 
who will be the leaders of the nations and take 
responsibility of the economic and social growth of the 
society. Moreover, the main concerns for market 
employers are the performance of universities and 
students’ academic performance due to its direct effect on 
the employment process and then employee productivity. 
So, the employers’ demands are met by the graduated 
students who exert efforts in their academic journey. 
Student performance is measured by the learning 
assessment and the curriculum according to Usamah et al 
[1]. 
It is frequently important to be able to predict the behavior 
of future students to enhance the design of the curriculum 
and prepare the interventions for academic guidance and 
support. Machine learning (ML) is useful in this situation. 
ML approaches examine datasets, extract information, and 
then organize that information for eventual use. The 
primary goals of ML are to identify and extract patterns 
from recorded data by using a variety of techniques and 
algorithms [2]. Numerous algorithms exist and are used 
with educational data, including supervised algorithms 
such as Decision Tree (DT) and Naive Bayes (NB), and 
unsupervised algorithms such as K-Nearest Neighbor 
(KNN), and Neural Network (NN). Such algorithms 
forecast patterns, upcoming trends, and behaviors, 
enabling businesses to make informed, proactive decisions 
mining. This paper's major goal is to predict student 
performance using Supervised ML based on an 
imbalanced dataset and wrapper feature selection. The 
following section sheds light on related previous studies, 
then followed by the methodology and the concluded 
points, and future work. 
 
 
 
 
12 Informatica 47 (2023) 11–20 S. Alija et al. 
2 Literature review
High quality universities always require the great 
record of their students where the students are the main 
resource for them. The main concern for the universities 
is the performance of the students which is the base stone 
for building the top rate graduates and post-graduates 
students who will be the leaders of the nations and take the 
responsibilities of the economic and social growth of the 
The concept of data mining techniques can be 
implemented and applied in the educational field to 
improve our comprehension of the learning process, with 
a particular emphasis on the identification, extraction, and 
evaluation of factors linked to students' learning processes 
[3]. ML algorithms enable users to categorize and 
summarize associations discovered throughout the mining 
process as well as examine data from different 
perspectives. Bhardwaj and Pal in [4] explore the 
performance of the students by taking a sample of 300 
undergraduate students' row records from the department 
of computer application from different institutions in Dr. 
R. M. L. Awadh University, India. The Bayesian 
classifiers are utilized on 17 features where the researchers 
found that there is a strong correlation between student 
action and other factors such as (living location, the 
academic background of the mother, senior secondary 
exam, the status, and the annual outcome of the student’s 
family).  
Next, in the same university, Pandey and Pal [5] 
selected 600 students to implement the model based on 
Bayes classifier to classify the background qualification, 
category, and language. While Hijazi and Naqvi in [6] 
have selected 300 students (75 female, and 225 male) from 
different colleges in Pakistan's Punjab University to 
explore and investigate student performance. Based on the 
linear regression, they found that there are many factors 
that affected the student's performance such as the attitude 
toward the class they attend, the time spent in studying 
after college, the mothers’ ages, the income of their 
families, and the educational level of their mothers (where 
the performance is strongly affected by it). Khan in [7], 
explored the performance by building a model based on a 
clustering approach using 400 rows of student data from 
Aligarh Muslim University's senior secondary school in 
Aligarh, India. The main goal of the study is to determine 
the predictive value of different measures such as 
personality, cognition, and demographic variables that 
affect success at a higher level of secondary school. The 
outcomes of the study found that females with 
socioeconomic status scored higher performance, whereas 
males with low socioeconomics had higher performance 
in the science stream.  
In the next case study [8], Kovacic implemented a 
data mining model for determining the educational 
enrollment data in New Zealand to predict the 
performance of the students. Chi-square automatic 
interaction detection (CHAID) and Classification and 
Regression (CART) algorithms are utilized to categorize 
the successful and failed students. The algorithms did not 
produce promising accuracies where they predicted the 
results with (59.4, and 60.5 respectively). The other case 
study is implemented by Galit [9] where the learning 
behavior is examined to predict the students' outcomes and 
alert the students to the critical status before the final 
exam. The final study [10] is proposed by Al-Radaideh, 
where the model is implemented to predict the students' 
final grades in C++ course for the students enrolled in the 
Yarmouk university in 2005, in Jordan. NB, DT (ID3, and 
C4.5) are utilized to predict the grades where the DT has 
outperformed the NB in prediction. 
In our proposed model, the problem of imbalanced 
dataset is handled and the effect of handing this problem 
is observed by implementing different machine learning 
algorithms (supervised and unsupervised). The effect of 
handling imbalanced dataset is also observed by 
implementing feature selection which has the direct effect 
on the result accuracies.   
3 Methodology 
The model implementation framework is depicted in 
Figure 1, which consists of five steps starting with data 
preprocessing and ending with the model evaluation. The 
step of attribute feature selection (FS) is implemented by 
a single FS and a subset FS to find the effect of each step 
on the result accuracies. SMOTE filter is applied then, 
where it is followed by implementing supervised ML 
algorithms. 
 
 
 
Figure 1: Model framework 
 
3.1 Dataset reliability  
A questionnaire is adopted in this study to build the model 
where Google Forms is used to build the questionnaire and 
collect undergraduate students’ answers from both of 
Faculty of Contemporary Sciences and Technologies 
(CST) and the Faculty of Business and Economics (FBE) 
in South East European University (SEEU) in North 
Macedonia (RNM). The aim of this study is to find the 
optimal DT in predicting student performance based on 
the conceptual framework that was implemented by 
researchers in [11]. The aim of the framework is to find 
the hidden patterns that may affect and correlate with the 
performance of the students and provide suggestions to 
enhance and improve the performance. Many questions 
related to many factors are found in the questionnaire, 
such as academic behavior, health, finance, time planning, 
self-development, social relationships, and achieving 
goals. The questionnaire in [11] lists the factors and the 
questions related to each question, where the answer for 
most of the questions was on a 5-point Likert scale (from 
Predicting Students Performance Using Supervised Machine… Informatica 47 (2023)11–20 13 
1 to 5) which represented the formal answers (from 
“Strongly Disagree” to “Strongly Agree”). 
The dataset of the questionnaire involves 141 rows of 
respondents. The dataset reliability is required to measure 
the overall consistency of the dataset. The measure of 
reliability which describes consistency can be confirmed 
to have a high level if it produces similar results under 
consistent conditions. The most frequent measure in 
statistics is the coefficient alpha, which is used to calculate 
the internal consistency of the independent variables of the 
study. The coefficient’s alpha for the dataset is 0.93. This 
value indicates an excellent internal consistency of the 
dataset reliability [12][13]. The applied tool for this model 
is Weka 3.8.5 and the system specifications are (RAM 
8GB, HARD 35.5GB free, OS Win7 Pro). 
 
Table 1: Dataset reliability 
Number of 
Respondents 
Number of  
Features 
Coefficient’s 
Alpha 
% of 
Respondents 
141 58 0.93 100% 
 
3.2 Feature selection (FS) 
FS approach can be considered as a form of data reduction 
where features are reduced and only the correlated features 
remain. The main goal of FS methods is finding the 
optimal subset of features or the highly correlated features 
that have a direct effect or may affect the final class(s). 
Due to the number of attributes in our dataset (57), it is 
required to find the most correlated attributes or features 
that can be utilized in the next steps to get more accurate 
results in classification [14]. Two approaches are applied 
in our model (Wrapper with Particle Swarm Optimization 
(PSO)) and (Info-Gain Attribute Evaluator).  
 
• Wrapper method 
The Wrapper method evaluates the subset of attributes 
according to the classifier performance for both 
supervised algorithms (such as DT, SVM, and NB) and 
unsupervised algorithms (such as clustering). For each 
subset, the evaluation process is repeated while the search 
strategy determines the subset generation. The wrapper 
method is slower than the filter in finding good subsets 
because it depends on resource demands for the algorithm 
of modeling. Due to using real modeling algorithms, the 
wrapper method is proven empirically to produce better 
feature subsets [15]. 
 
• Particle swarm optimization (PSO) 
Kennedy and Eberhart in 1995 proposed one of the 
evolutionary computation techniques based on social 
behavior such as fish schooling and bird flocking. The 
basic idea behind PSO underlines that the population-
social interaction optimizes knowledge where the thinking 
is personal and social. The solutions are represented by 
particles, while particles are represented by vectors that 
have positions in the search space. Each vector 
xi=(xi1,xi2,…xiD) Where D is the search space 
dimensionality. To search for the optimal solutions, the 
particles move in the search space. According to that, each 
particle has a velocity that can be represented by vi where 
vi takes the values (vi1,vi2,….,viD). The particle updates 
its location and velocity during the movement, and this 
update is performed according to the neighbors and their 
own experience. Two values of positions are recorded, the 
best which represents the best previous personal position 
of the particle, and gbest is the best-obtained position by 
the population. The following equation is used to update 
the position and velocity:  
 
𝑥 𝑖𝑑
𝑡 +1
= 𝑥 𝑖𝑑
𝑡 + 𝑣 𝑖𝑑
𝑡 +1
  (1) 
𝑣 𝑖𝑑
𝑡 +1
= 𝑤 ∗ 𝑣 𝑖𝑑
𝑡 + 𝑐 1
∗ 𝑟 1
∗ (𝑝 𝑖𝑑
− 𝑥 𝑖𝑑
𝑡 ) + 𝑐 2
∗ 𝑟 2
∗
(𝑝 𝑔𝑑
− 𝑥 𝑖𝑑
𝑡 ) (2) 
 
Where t is the tth iteration in the evolutionary process 
while d represents the d dimension in the search space 
where d belongs to D. The weight w it controls the 
previous velocity impact on the current velocity impact. 
The acceleration constants c 1, c 2 are random values in the 
range (0 to 1), p id and p gd represent the elements of pbest, 
gbest alternatively in the dimension dth. vmax is the 
maximum velocity where 𝑣 𝑖𝑑
𝑡 +1
∈ [−vmax, vmax]. The 
algorithm will stop when the predefined criterion of 
fitness is met with a good fitness value or a predefined 
number of maximum iterations [16][17]. 
 
• Info gain 
In this feature selection evaluator, the information of each 
class is estimated to evaluate the attribute. The method 
used in this evaluator is minimum description-length-
based discretization where the attributes are binarized or 
discretized. In this method, the missing values are either 
regarded as separate values or distributed the values 
among other values according to the frequencies.  As the 
value of the feature is absent, the decrease in entropy is 
measured. For the multiclass attribute, the InfoGain 
evaluator has reported the best performance. The 
generalized form of the nominal values is taken from the 
nominal attribute. Info Gain is measured by the decrease 
of X entropy that is caused by Y which is represented by: 
 
𝐼𝐺 (𝑋 |𝑌 ) = 𝐻 (𝑋 ) − 𝐻 (𝑋 |𝑌 ) (3) 
 
According to this measurement, (Y) feature can be 
considered as more correlated to (X) feature if (IG(X│Y) 
> IG(Z│Y). IG normalized the values that fall within the 
range (0 to1), where (1) value indicates that the predicted 
value is completely correct and (0) value indicates that (X) 
feature is independent of (Y) feature. For the nominal and 
continuous features, the Entropy can be applied in order to 
determine the correlation between continuous and 
nominal features [18][19][20][21].  
The Wrapper filter with SPO is applied to find and explore 
the most correlated subsets of features that make the 
highly accurate results for each supervised algorithm. 
Wrapper as a subset of attributes evaluator is applied for 
each supervised classifier individually. In this step, 
different subsets of features are found for each classifier 
where the SPO is selected as a search method to improve 
the speed of search for features subsets. In order to find 
14 Informatica 47 (2023) 11–20 S. Alija et al. 
the effect of wrapper evaluator, Info Gain evaluator is 
applied to find the features with high correlations with the 
final class and to find how wrapper and Info Gain affect 
the result algorithms accuracies of the algorithms. Table 2 
shows the most correlated features (subset) after applying 
wrapper with SPO for each algorithm and Info Gain with 
Ranker.   
Table 2: Selection of attributes 
Feature Evaluator  Attributes 
Wrapper (Random Forest) 
with SPO  
1,5,6,7,8,9,10,12,13,14,16,1
7,18,27,33,36,44,49,52,53,5
6 
Wrapper (NB) with SPO  5,8,14,18,25,31,42,48 
Wrapper (Logistic) with SPO 2,4,5,6,11,13,17,31,35,48,5
1,52,53,54,57 
Wrapper (SimpleLogistic) 
with SPO 
1,4,5,6,8,9,11,15,17,23,26,2
7,28,31,32,34,42,44,46,50,5
2,53,55 
Wrapper (SMO) with SPO 4,5,14,15,17,24,31,32,35,42
,45,47,54,55,56,57 
Wrapper (LMT) with SPO 1,2,4,5,6,7,8,9,11,14,15,17,
19,20,21,23,25,26,27,28,32,
34,41,42,44,49,52,53,55 
Wrapper (J48) with SPO 5,7,13,22,23,24,26,31,35,42
,45,46,52 
Wrapper (Random Tree) with 
SPO 
5,15,27,33,35,43,44,45,46,4
8,49 
Info Gain with Ranker 5,57,19,18,21,17,20,22
,15,23,26,25,24,16,14,28,7,
4,3,2,6 
 
3.3 Synthetic minority over-sampling 
technique (SMOTE) 
The dataset is said to be imbalanced if the classes in the 
final class are not equally represented [22]. If the final 
class has the classes (1,2, and 3) and the representations of 
the classes are (10% for 1, 15% for 2 and 75% for 3) then 
the dataset is imbalanced. The imbalanced datasets are 
found in almost all sectors starting from the medical sector 
[23], telecommunications management [24], fraudulent 
telephone calls [25], and text classification [26]. The 
SMOTE approach creates “synthetic” examples, to 
oversample the minority classes or by replacing the 
samples. This approach has been inspired and proven its 
success by the recognition process of handwritten 
characters [27]. The generation of synthetic examples is 
performed based on the operating in the feature space 
rather than the data space. The data space will face certain 
operations to generate the training data. The process of 
oversampling is performed by taking each minority class 
attribute of the final class attribute and introducing new 
examples (synthetic) along the line segments which join 
all k classes if they are nearest neighbors. The selection of 
the k nearest neighbors is performed randomly according 
to the oversampling amount required. The synthetic 
samples generation is implemented by taking the 
difference between each sample with its neighbors, then 
the result difference is multiplied by a random number 
between 0 and 1; then the result obtained is added to the 
feature vector. This process effectively forces to make the 
minority class more generally, see Figure 2 [28]. 
 
Figure 2: Comparison of number of minority correct for 
replicated oversampling and SMOTE for a dataset [28]. 
 
In our imbalanced dataset, the percentage of classes’ 
representation is shown in Table 3. Class (3) takes only 
(4.3%) of the overall dataset, followed by classes (1, and 
2) respectively with (21.3%, and 21.9%). The SMOTE 
filter in our model will be implemented on the classes (1,2, 
and 3) to make the dataset balanced and to get reliable 
performances of the algorithms. The SMOTE filter is 
applied to get equal representations of all classes. 
 
Table 3: Classes representation 
Class Number of Rows % of Representation 
0 74 52.5% 
1 30 21.3% 
2 31 21.9% 
3 6 4.3% 
 
3.4 Supervised machine learning (ML) 
In the proposed model, many supervised ML algorithms 
have been implemented to find the accurate algorithm for 
predicting the number of failed courses for the students. 
The algorithms fall in approaches such as (decision tree 
(DT) (Random Forest, Random Tree, LMT, and J48), 
naïve Bayes (Naïve Bayes, and Bayes Net), Logistic 
(Logistic and Simple Logistic), and Support Vector 
Machine (SMO)). DT is one of the supervised ML 
approaches that aim to build a training model to be used 
in predicting the final class attribute [29]. DT classifiers 
are widely used in different sectors and have proved their 
accuracies in the fields of education [11], [30][31], 
healthcare [32], wireless sensor networks [33], image 
processing [34][35], and disaster management [36][37]. 
There are many types of algorithms and the most used 
algorithms are (Random Forest, CART, Iterative 
Dichotomies 3 (ID3), and Successor of ID3 (C4.5 or J48) 
[38][39]. DT is used in the field of classification 
(predicting the categorical values) and regression 
(predicting the continuous values) [40]. Random Forest 
(which was proposed by L. Breiman in 2001) is a general-
purpose regression and classification approach that works 
on the principle of aggregating the predictions by 
calculating the predictions averages and shows excellent 
Predicting Students Performance Using Supervised Machine… Informatica 47 (2023)11–20 15 
performance when the variables numbers is larger than the 
number of the observations [41].  In logistic model trees 
(LMT), logistic regression is utilized to select the 
attributes in a natural way by using stage-wise fitting. The 
logistic model in this approach is built on leaves by 
refining the leaves incrementally at the higher level of the 
tree [42].  
SVM is an ML algorithm that falls under the supervised 
learning algorithm [43], as it is one of the data-based 
algorithms used to solve classification problems. It is 
considered one of the most important algorithms to 
accomplish that task (solving classification problems) 
[44]. Support Vector Machine has a vector support 
processing approach in which many questions are 
answered depending on the understanding and knowledge 
of the problem and how to design it. Moving to the real 
world, we find that the Support Vector Machine algorithm 
was used to find solutions to many problems in this world, 
including face recognition, detection, hand lines, and 
others [45]. In order to understand the SVM algorithm, it 
is necessary to understand its main terminology, the 
maximum-margin hyperplane, the separating hyperplane, 
the soft margin, and the kernel function [43]. SVM can be 
classified into two types: Linear SVM, and Non-Linear 
SVM. Linear SVM is an algorithm used when the data can  
be separated into two groups in a linear way by using a 
straight line where the data can be called as linearly 
separable, in addition to that the classifier is described as 
SVM classifier. Non-Linear SVM is an algorithm used 
when the data cannot be separated in a linear manner, and 
thus a straight line cannot be used to separate the data into 
two categories. To compensate for this, another thing 
called the kernel trick is used, through which we define 
the data in a higher dimension to be separated using some 
mathematical functions. 
Regression is considered a simple type of ML algorithm. 
It is considered a supervised learning algorithm. These 
algorithms are used in a wide range to find a relationship 
between the continuous predictor and response variables. 
It is considered a way to measure the relationships 
between response variables and continuous predictors 
[46]. An example of this is the linear regression algorithm, 
which is one of the supervised learning algorithms, where 
this algorithm simulates the mathematical relationship 
between variables. It attempts to find relationships 
between independent variables (input data) and dependent 
variables (result, and forecast). It works to find continuous 
or numerical variables by predicting that as it assumes that 
the relationships between the predicted variables and the 
goal to be reached are linear, such as sales, age, and 
product price. The regression may be linear or curvilinear, 
so it must pass through all data points to reach the target 
prediction so that if the measurement is made between the 
data points and the regression line, the result is minimal.  
In order to solve classification problems, a logistic 
regression algorithm was built, which is one of the 
supervised learning algorithms, where the results are 
always binary, not devoid of one of the two values, either 
1 or 0, success or failure, rain or no rain; its working 
principle is probability. Logistic regression is used in the 
analysis of binary outcomes, or as it is said that they are 
two-level, or whose levels are opposite [47]. A 
characteristic of logistic regression is that its predictions 
are deterministic and have the ability to adapt to multiple 
predictions. This is necessary for the analysis of 
observational data when adjustment is useful to avoid 
differences in the totals to be compared [48]. Logistic 
regression is used to reach the highest weighting of a 
variable in the event that there is more than one variable. 
Thus, it is similar in terms of multiple linear regression 
and is inconsistent with it that the response variable has 
only a binomial, and as a result, each variable is 
considered to have an impact on the likelihood ratio of any 
expected event. Hence, it has the advantage that it can 
avoid confusing influences by analyzing the correlations 
of all variables at the same time [49]. 
NB is considered one of the supervisor learning 
algorithms; it is based on Bayes’ rule together with 
additional to strong assumptions attributes that are 
categorically and conditionally independent [50]. Then it 
is used for solving classification problems. This algorithm 
assumes conditional independence of traits; so it is rarely 
true in the real world, which has made the competitive 
performance of this algorithm a lot of attention and 
surprising [51]. The Naïve Bayes algorithm is used in a 
wide range of applications, including article classification 
and spam filtering. Naïve Bayes Classifier is able to build 
ML model through which we get fast predictions. The 
hypothesis states that the independence between every two 
features, so the naïve Bayes classifier calculates the 
probability of belonging to a certain class. As a product of 
simple probabilities resulting from assumed Naïve 
independence. The hypothesis states that there is 
independence between each of the two features, so the 
Naïve Bayes classifier computes the probability of a 
particular instance belonging to a particular class. If we 
assume that the described is described by a vector x of 
attributes and the target of the class is the element y, then 
we can express the conditional probability p(y|x) as the 
product of the simple probabilities resulting from the 
assumed naïve Bayes independence [52].  
Bayesian networks are considered probabilistic models 
that depend mainly on non-periodic direct graphs. These 
models are causal relationships between their variables, 
and their structure represents the combination of previous 
knowledge and target data. They are also called belief 
networks as they belong to probabilistic graphical models, 
and knowledge can be represented in uncertain domains 
through the use of their graphical structures. It is observed 
by looking at its graphs, where nodes represent random 
variables, while arrows between nodes (variables) 
represent probabilistic dependencies. In most cases, 
generally accepted statistical methods are used to estimate 
these conditional dependencies. Hence, we can say that 
Bayesian networks combine graphs and statistics as well 
as computer science and probability theory [53]. Also, 
Bayesian networks are used to perform causal logic and 
predict risks. In addition, there are many advantages if we 
compare it with the methods used in regression methods 
[54]. One of Bayes Network's products is the modeling 
language in addition to the inference algorithms associated 
with random domains. Experiments have proven a lot of 
16 Informatica 47 (2023) 11–20 S. Alija et al. 
success when used in medium-sized applications. But if 
Bayesian networks are used in areas that are relatively 
complex or large domains, then these networks will use 
the task of modeling, which is somewhat similar to 
programming using logic circuits [55]. 
 
3.5 Model evaluation 
The evaluation process of algorithms is performed based 
on the confusion matrix, see Figure 3. The class value of 
True Negative (TN) is the predicted class as (NO) and it is 
(NO), while the class value of False Positive (FP) is the 
class when it is predicted as (YES) and it is (NO). False 
Negative (FN) class value is the class when it is predicted 
as (NO) and it is (YES) while True Positive (TP) class 
value is the class when it is predicted as (YES) and it is 
(YES). 
 
 
Figure 3: Confusion matrix. 
 
Based on the above matrix, the performance criteria are: 
𝑆𝑒𝑛𝑠𝑒𝑣𝑖𝑡𝑦 𝑜𝑟 𝑇𝑃 =
𝑇𝑃
𝑇𝑃 +𝐹𝑁
 (4) 
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑟 𝐹𝑃 =
𝐹𝑃
𝐹𝑃 +𝑇𝑁
 (5) 
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 +𝐹𝑃
 (6) 
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃 +𝐹𝑁
 (7) 
 
The sensitivity or recall is a measurement of the truly 
predicted cases and measures the relevance of TP with FN. 
The more the TP rate, the more accurate the predicted 
cases and the more accurate the classification algorithm. 
The specificity or FP rate is the false alarm rate that 
measures the incorrectly predicted cases. The more FP, the 
more predicted incorrect cases. The precision represents 
the relevant cases among the predicted cases [29]–[31]. 
  
Table 4: Algorithms performance after wrapper with 
SPO. 
 
Table 4 lists the performance evaluation of supervised 
algorithms after implementing Wrapper with SPO. The 
algorithms are implemented after removing the 
uncorrelated features where the Wrapper base classifier is 
the supervised algorithm. Then, the SMOTE filter is 
applied to get equal representations of classes for the final 
class. RF algorithm outperforms the other supervised 
algorithms with (85.6% in TP rate and Recall), (4.9% in 
FP rate) and (85.7%) in precision. C4.5 (J48) algorithm 
comes in the second rank with (79.6% in TP rate and 
Recall), (6.7% in FP rate), and (79.6%) in Precision. NB 
comes in the last rank with (71.7% in TP rate and Recall), 
(9.4%) in FP rate, and (71.1%) in Precision.  
 
Table 5: Algorithms performance after info gain 
evaluator. 
Algorithm TP 
Rate 
FP 
Rate 
Precision  Recall 
LMT 0.750     0.083     0.749       0.750     
Random 
Forest 
0.836     0.054     0.835       0.836     
Random 
Tree 
0.701     0.099     0.696       0.701     
NB 0.678     0.107     0.678       0.678     
Logistic 0.737     0.087     0.735       0.737     
Simple 
Logistic 
0.707     0.097     0.701       0.707     
SMO 0.734     0.088     0.730       0.734     
J48 0.753     0.082     0.750       0.753     
Bayes Net 0.750     0.083     0.753       0.750     
 
Table 5 depicts the performance criteria of supervised ML 
algorithms after implementing Info Gain. The algorithms 
are implemented after removing the uncorrelated features 
(36 features), then the SMOTE filter is applied to get equal 
representations of classes for the final class. RF algorithm 
outperforms the other supervised algorithms with (83.6% 
in TP rate and Recall), (5.4% in FP rate) and (83.5%) in 
precision. C4.5 (J48) algorithm comes in the second rank 
with (75.3% in TP rate and Recall), (8.2% in FP rate), and 
(75%) in Precision. NB comes in the last rank with (67.8% 
in TP rate and Recall), (10.7%) in FP rate, and (67.8%) in 
Precision. 
 
 
Figure 4: ROC of algorithms with wrapper and info gain. 
 
One of the performance criteria that determines the 
optimal classifiers is the Receiver Operating 
Characteristic (ROC) curve, where ROC is considered one 
of the standard techniques that summarize classifier 
performance over a range of tradeoffs between TP and FP 
error rates [32][28]. As much as the ROC is closer to 1, as 
much as the classifier is accurate. Based on Figure 4, the 
RF classifier is the optimal classifier among all other 
classifiers with (96.7%) ROC when the wrapper with SPO 
Algorithm TP 
Rate 
FP 
Rate 
Precision Recall 
LMT 0.766 0.078 0.762 0.766 
Random 
Forest 
0.856 0.049 0.857 0.856 
Random 
Tree 
0.697 0.100 0.695 0.697 
NB 0.717 0.094 0.711 0.717 
Logistic 0.727 0.091 0.729 0.727 
Simple 
Logistic 
0.773 0.075 0.770 0.773 
SMO 0.757 0.081 0.752 0.757 
J48 0.796 0.067 0.796 0.796 
Predicting Students Performance Using Supervised Machine… Informatica 47 (2023)11–20 17 
is implemented. The ROC is (96.1%) for the same 
classifier when Info Gain is implemented. The figure 
shows that ROCs for all algorithms are enhanced after 
implementing a wrapper evaluator with SPO. NB is the 
only classifier that has (89.1%) ROC when implementing 
wrapper and (89.5%) with Info Gain Evaluator.  
 
4 Conclusions and future works 
The imbalanced dataset faced many techniques and 
approaches to solve the minority and majority class 
problems related to the final class. In our model, the 
imbalanced dataset has multi-values in the final class 
which is required to handle this problem using SMOTE 
filter. In our model, the step of feature selection is 
performed two ways, the first one is by applying wrapper 
evaluator with SPO as a search method to find subsets of 
attributes that may affect and be correlated with the final 
class, and the second one by applying Info Gain as an 
evaluator with ranker as a search method to find the 
features with most correlation with the final class. After 
finding the most correlated features or feature subsets 
using evaluators, the uncorrelated features are removed 
and the SMOTE filter is applied to produce a balanced 
dataset and to make the multi-values classes equally 
represented. Many supervised ML algorithms are applied 
such as (NB, RF, Random Tree, LMT, J48, Logistic, 
Simple Logistic, and SMO). The performance evaluation 
of the algorithms shows that using the wrapper with the 
classifiers and SPO as a search method outperforms the 
Info-Gain evaluator. RF algorithm outperforms other 
algorithms in predicting students’ performance and the 
number of failed courses. The model can be updated by 
predicting the students’ status whether will fail or pass the 
final class. The features will be explored and investigated 
using different filters and classifiers to find the features 
with the most correlations with students’ failure. 
 
References 
[1] U. Bin Mat, N. Buniyamin, P. M. Arsad, and R. A. 
Kassim, “An overview of using academic analytics 
to predict and improve students’ achievement: A 
proposed proactive intelligent intervention,” in 2013 
IEEE 5th International Conference on Engineering 
Education: Aligning Engineering Education with 
Industrial Needs for Nation Development, ICEED 
2013, 126-130, 2014. 
https://doi.org/10.1109/iceed.2013.6908316. 
[2] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, 
“From data mining to knowledge discovery in 
databases,” AI Mag., vol. 17, no. 3, 37-37, 1996. 
[3] A. El-Halees, “Mining Students Data To Analyze 
Learning Behavior : a Case Study Educational 
Systems,” Work, 2008. 
[4] A. B. E. D. Ahmed and I. S. Elaraby, “Data Mining: 
A prediction for performance improvement using 
classification,” World J. Comput. Appl. Technol., 
vol. 2, no. 2, 2014. 
https://doi.org/10.13189/wjcat.2014.020203 
[5] U. K. Pandey and S. Pal, “Data Mining: A prediction 
of performer or underperformer using 
classification,” arXiv Prepr. arXiv1104.4163, 2011. 
[6] S. M. M. Syed Tahir Hijazi & Raza Naqvi, “Factors 
affecting students’ performance: A case of private 
colleges,” Bangladesh e-Journal Sociol., vol. 3, no. 
1, pp. 1–10, 2006. 
[7] Z. N. Khan, “Scholastic Achievement of Higher 
Secondary Students in Science Stream,” J. Soc. Sci., 
vol. 1, no. 2, 2005. 
         https://doi.org/10.3844/jssp.2005.84.87 
[8] Z. J. Kovacic, “Early Prediction of Student Success: 
Mining Students Enrolment Data,” in Proceedings of 
the 2010 InSITE Conference, 2010. 
         https://doi.org/10.28945/1281 
[9] G. (Univ T. A. Ben-Zadok, R. (Univ T. A. Mintz, A. 
(Univ T. A. Hershkovitz, and R. (Univ T. A. 
Nachmias, “Examining online learning processes 
based on log files analysis: A case study,” Res. 
Reflections Innov. Integr. ICT Educ. Proc. Fifth 
Intertnational Conf. Multimdeia ICT Educ., no. 2, 
2009. 
[10] Q. A. Al-Radaideh, E. M. Al-Shawakfa, and M. I. 
Al-Najjar, “Mining student data using decision 
trees,” in International Arab Conference on 
Information Technology (ACIT’2006), Yarmouk 
University, Jordan, 2006. 
[11] A. K. Hamoud, A. S. Hashim, and W. A. Awadh, 
“Predicting Student Performance in Higher 
Education Institutions Using Decision Tree 
Analysis,” Int. J. Interact. Multimed. Artif. Intell., 
2018. 
         https://doi.org/10.9781/ijimai.2018.02.004 
[12] B. Carson, “The transformative power of action 
learning,” Chief Learn. Off. Retrieved, 2017. 
[13] U. Sekaran and R. Bougie, Research methods for 
business: A skill building approach. john wiley & 
sons, 2016. 
[14] B. Remeseiro and V. Bolon-Canedo, “A review of 
feature selection methods in medical applications,” 
Computers in Biology and Medicine, vol. 112. 2019. 
         https://doi.org/10.1016/j.compbiomed.2019.103375 
[15] Y. Kim, W. N. Street, and F. Menczer, “Evolutionary 
model selection in unsupervised learning,” Intell. 
Data Anal., vol. 6, no. 6, 2002. 
         https://doi.org/10.3233/ida-2002-6605 
[16] B. Xue, M. Zhang, and W. N. Browne, “Particle 
swarm optimization for feature selection in 
classification: A multi-objective approach,” IEEE 
Trans. Cybern., vol. 43, no. 6, 2013. 
         https://doi.org/10.1109/tsmcb.2012.2227469 
[17] Y. Shi and R. Eberhart, “Modified particle swarm 
optimizer,” in Proceedings of the IEEE Conference 
on Evolutionary Computation, ICEC, 1998. 
         https://doi.org/10.1109/icec.1998.699146 
[18] L. Yu and H. Liu, “Feature Selection for High-
Dimensional Data: A Fast Correlation-Based Filter 
Solution,” in Proceedings, Twentieth International 
Conference on Machine Learning, 2003, vol. 2. 
18 Informatica 47 (2023) 11–20 S. Alija et al. 
[19] E. Frank, M. A. Hall, and I. H. Witten, “The WEKA 
Workbench Data Mining: Practical Machine 
Learning Tools and Techniques,” Morgan 
Kaufmann, Fourth Ed., 2016. 
https://doi.org/10.1016/b978-0-12-374856-0.00010-
9 
[20] U. M. Fayyad and K. B. Irani, “Multi-interval 
discretization of continuous-valued attributes for 
classification learning,” in Proceedings of the 13th 
International Joint Conference on Artificial 
Intelligence, 1993. 
[21] H. Liu, F. Hussain, C. L. Tan, and M. Dash, 
“Discretization: An enabling technique,” Data Min. 
Knowl. Discov., vol. 6, no. 4, 2002. 
[22] F. Provost and T. Fawcett, “Robust classification for 
imprecise environments,” Mach. Learn., vol. 42, no. 
3, 2001. 
[23] A. S. Desuky, A. H. Omar, and N. M. Mostafa, 
“Boosting with crossover for improving imbalanced 
medical datasets classification,” Bull. Electr. Eng. 
Informatics, vol. 10, no. 5, 2021. 
https://doi.org/10.11591/eei.v10i5.3121 
[24] J. Xiao, L. Xie, C. He, and X. Jiang, “Dynamic 
classifier ensemble model for customer classification 
with imbalanced class distribution,” Expert Syst. 
Appl., vol. 39, no. 3, 2012. 
https://doi.org/10.1016/j.eswa.2011.09.059 
[25] C. Lu, S. Lin, X. Liu, and H. Shi, “Telecom fraud 
identification based on ADASYN and random 
forest,” in 2020 5th International Conference on 
Computer and Communication Systems, ICCCS 
2020, 2020. 
https://doi.org/10.1109/icccs49078.2020.9118521 
[26] C. Padurariu and M. E. Breaban, “Dealing with data 
imbalance in text classification,” in Procedia 
Computer Science, 2019, vol. 159. 
https://doi.org/10.1016/j.procs.2019.09.229 
[27] T. M. Ha and H. Bunke, “Off-line, handwritten 
numeral recognition by perturbation method,” IEEE 
Trans. Pattern Anal. Mach. Intell., vol. 19, no. 5, 
1997. 
https://doi.org/10.1109/34.589216 
[28] N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. 
Kegelmeyer, “SMOTE: synthetic minority over-
sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 
3 https://doi.org/10.1613/jair.953 21–357, 2002. 
[29] M. A. Kumar and A. J. Laxmi, “Machine Learning 
Based Intentional Islanding Algorithm for DERs in 
Disaster Management,” IEEE Access, vol. 9, 2021. 
https://doi.org/10.1109/access.2021.3087914 
[30] A. K. Hamoud, “Selection of Best Decision Tree 
Algorithm for Prediction and Classification of 
Students’ Action,” Am. Int. J. Res. Sci. Technol. 
Eng. Math., vol. 16, no. 1, pp. 26–32, 2016. 
[31] A. S. Hashim, W. A. Awadh, and A. K. Hamoud, 
“Student performance prediction model based on 
supervised machine learning algorithms,” in IOP 
Conference Series: Materials Science and 
Engineering, 2020, vol. 928, no. 3, p. 32019. 
https://doi.org/10.1088/1757-899x/928/3/032019 
[32] T. Saba, I. Abunadi, M. N. Shahzad, and A. R. Khan, 
“Machine learning techniques to detect and forecast 
the daily total COVID-19 infected and deaths cases 
under different lockdown types,” Microsc. Res. 
Tech., vol. 84, no. 7, 2021. 
https://doi.org/10.1002/jemt.23702 
[33] I. A. Najm, A. K. Hamoud, J. Lloret, and I. Bosch, 
“Machine Learning Prediction Approach to Enhance 
Congestion Control in 5G IoT Environment,” 
Electronics, vol. 8, no. 6, p. 607, May 2019. 
https://doi.org/10.3390/electronics8060607 
[34] J. Chen, Y. Lian, and Y. Li, “Real-time grain 
impurity sensing for rice combine harvesters using 
image processing and decision-tree algorithm,” 
Comput. Electron. Agric., vol. 175, 2020. 
https://doi.org/10.1016/j.compag.2020.105591 
[35] I. S. Masad, A. Al-Fahoum, and I. Abu-Qasmieh, 
“Automated measurements of lumbar lordosis in T2-
MR images using decision tree classifier and 
morphological image processing,” Eng. Sci. 
Technol. an Int. J., vol. 22, no. 4, 2019. 
https://doi.org/10.1016/j.jestch.2019.03.002 
[36] S. Khatoon et al., “Development of social media 
analytics system for emergency event detection and 
crisismanagement,” Comput. Mater. Contin., vol. 68, 
no. 3, 2021. 
https://doi.org/10.32604/cmc.2021.017371 
[37] H. Li, D. Caragea, C. Caragea, and N. Herndon, 
“Disaster response aided by tweet classification with 
a domain adaptation approach,” J. Contingencies 
Cris. Manag., vol. 26, no. 1, 2018. 
https://doi.org/10.1111/1468-5973.12194 
[38] Y. Y. Song and Y. Lu, “Decision tree methods: 
applications for classification and prediction,” 
Shanghai Arch. Psychiatry, vol. 27, no. 2, 2015. 
[39] N. Mahdi Abdulkareem and A. Mohsin Abdulazeez, 
“Machine Learning Classification Based on Radom 
Forest Algorithm: A Review,” Int. J. Sci. Bus., vol. 
5, no. 2, 2021. 
[40] S. M. Rasoolimanesh, M. Wang, J. L. Roldán, and P. 
Kunasekaran, “Are we in right path for mediation 
analysis? Reviewing the literature and proposing 
robust guidelines,” J. Hosp. Tour. Manag., vol. 48, 
2021. 
https://doi.org/10.1016/j.jhtm.2021.07.013 
[41] G. Biau and E. Scornet, “A random forest guided 
tour,” Test, vol. 25, no. 2, 2016. 
https://doi.org/10.1007/s11749-016-0481-7 
[42] N. Landwehr, M. Hall, and E. Frank, “Logistic 
Model Trees,” Mach. Learn., vol. 59, no. 1, pp. 161–
205, 2005. 
https://doi.org/10.1007/s10994-005-0466-3 
[43] W. S. Noble, “What is a support vector machine?” 
Nature Biotechnology, vol. 24, no. 12. 2006. 
https://doi.org/10.1038/nbt1206-1565 
[44] T. Joachims, “Svmlight: Support vector machine,” 
SVM-Light Support Vector Mach. 
http//svmlight.joachims.org/, Univ. Dortmund, vol. 
19, no. 4, 1999. 
[45] S. Ghosh, A. Dasgupta, and A. Swetapadma, “A 
study on support vector machine based linear and 
Predicting Students Performance Using Supervised Machine… Informatica 47 (2023)11–20 19 
non-linear pattern classification,” in Proceedings of 
the International Conference on Intelligent 
Sustainable Systems, ICISS 2019, 2019. 
https://doi.org/10.1109/iss1.2019.8908018 
[46] K. Park, R. Rothfeder, S. Petheram, F. Buaku, R. 
Ewing, and W. H. Greene, “Linear regression,” in 
Basic Quantitative Research Methods for Urban 
Planners, 2020. 
https://doi.org/10.4324/9780429325021-12 
[47] A. J. Scott, D. W. Hosmer, and S. Lemeshow, 
“Applied Logistic Regression.,” Biometrics, vol. 47, 
no. 4, 1991. 
https://doi.org/10.2307/2532419 
[48] B. R. Kirkwood and J. A. C. Sterne, Essential 
Medical Statistics. 2003. 
[49] S. Sperandei, “Understanding logistic regression 
analysis,” Biochem. Medica, vol. 24, no. 1, 2014. 
https://doi.org/10.11613/bm.2014.003 
[50] G. I. Webb, E. Keogh, and R. Miikkulainen, “Naïve 
Bayes.,” Encycl. Mach. Learn., vol. 15, pp. 713–714, 
2010. 
https://doi.org/10.1007/978-0-387-30164-8_576 
[51] H. Zhang, “The optimality of naive Bayes,” Aa, vol. 
1, no. 2, p. 3, 2004. 
[52] W. Lou, X. Wang, F. Chen, Y. Chen, B. Jiang, and 
H. Zhang, “Sequence based prediction of DNA-
binding proteins based on hybrid feature selection 
using random forest and Gaussian naive Bayes,” 
PLoS One, vol. 9, no. 1, p. e86703, 2014. 
         https://doi.org/10.1371/journal.pone.0086703 
[53] J. Pearl, “Bayesian networks,” 2011. 
[54] P. Arora, D. Boyne, J. J. Slater, A. Gupta, D. R. 
Brenner, and M. J. Druzdzel, “Bayesian networks for 
risk prediction using real-world data: a tool for 
precision medicine,” Value Heal., vol. 22, no. 4, pp. 
439–445, 2019. 
        https://doi.org/10.1016/j.jval.2019.01.006 
[55] D. Koller and A. Pfeffer, “Object-oriented Bayesian 
networks,” arXiv Prepr. arXiv1302.1554, 2013. 
[56] A. Khalaf et al., “Supervised Learning Algorithms in 
Educational Data Mining: A Systematic Review,” 
Southeast Eur. J. Soft Comput., vol. 10, no. 1, pp. 
55–70, 2021. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20 Informatica 47 (2023) 11–20 S. Alija et al.