https://doi.org/10.31449/inf.v47i9.5177 Informatica 47 (2023) 145–156 145
Enhancing Phishing Website Detection via Feature Selection in URL-
Based Analysis
Marwa A. Qasim, Nahla A. Flayh
E-mail: itpg.marwa.qasim@uobasrah.edu.iq, nahla.flayh@uobasrah.edu.iq
Department: Information Technology, Basrah University Basra, Iraq
Keywords: machine learning, detecting phishing, feature selection, regular expressions
Received: September 12, 2023
Detecting a phishing website accurately is crucial for ensuring the safety of online users, underscoring
the importance of maintaining a secure digital environment. This research delves into the effectiveness
of enhancing the detection of phishing websites by applying a new dataset generation method. The
method involves the transformation of a pure dataset obtained from Mendeley, by the utilization of
regular expressions to extract the important features so that a detection process can be performed
correctly with high performance. Based on the proposed features, we selected the best machine-learning
algorithm. We performed a rigorous evaluation using Three prominent machine learning algorithms:
Decision Trees, Support Vector Machines (SVM), and Random Forests, achieving 0.96% for Decision
Tree Accuracy, 0.97% for SVM Accuracy, and 0.98% for Random Forest Accuracy.One of the critical
contributions of this research is the deliberate selection of features. We have leveraged regular
expressions to create a feature set that captures salient aspects of URLs and optimizes the algorithms'
detection capabilities. This research has examined how feature selection affects the performance of each
algorithm, highlighting its strengths and uncovering its weaknesses.
Povzetek: Raziskava obravnava izboljšanje zaznavanja lažnih spletnih strani z novim načinom
ustvarjanja podatkovnih zbirk in testiranjem različnih algoritmov strojnega učenja.
1 Introduction
Detecting phishing websites is a critical cybersecurity
issue due to the sophistication of these attacks and their
potential to compromise sensitive information. To protect
individuals and organizations from financial loss, data
breaches, and reputational damage, detecting and
preventing phishing attacks is essential [1].
Consequently, it is essential to develop effective
techniques for identifying, and mitigating phishing
websites and combating this menace, With the advent of
machine learning algorithms, security systems can
recognize and prevent cyber-attacks more efficiently and
effectively [2].
In the context of machine learning, it is a subset of
artificial intelligence in which machines can learn from
data, improve performance based on past experiences,
and make predictions based on that data [3]. Machine
learning can be divided into supervised, unsupervised,
and reinforcement learning types. The focus of our
research will be on supervised learning. In supervised
learning, the training dataset consists of previous
instances where both input and output values are known
and labeled [4, 5]. One of the strategies that can enhance
machine-learning performance is feature selection, so our
research involved implementing a feature selection
methodology, specifically a rule-based approach utilizing
regular expressions. The approach is considered to be
one of the most important methods for enhancing the
efficiency of machine learning algorithms by refining the
features to be used in the model.
2 Related work
Phishing detection is an essential component of
cybersecurity, which aims to identify fraudulent attempts
to deceive users into disclosing sensitive information. A
variety of techniques, including machine learning
algorithms and rule-based approaches, have been
employed for effective detection. This section highlights
essential machine learning techniques in the field of
phishing detection and provides an overview of the state-
of-the-art in phishing detection.
Bin B. Zhu, et al (2013) [6]: The authors evaluated
the effectiveness of machine learning-based phishing
detection methods using a secure website. 18 useful
features were presented and tested for incorporation into
146 Informatica 47 (2023) 145–156 M.A. Qasim et al.
the detector based exclusively on the lexical and domain
characteristics provided by the authors. Finding the
appropriate combination of attributes has resulted in a
detector with a detection rate higher than 98%. They
employed support vector machines and Gaussian radial
basis function algorithms. Phishing URLs were taken
from the Taobao-phishing dataset, safe URLs were taken
from the Yahoo! Directory, and well-known Chinese
navigation sites were analyzed.
W. Fadheel, et al (2017) [7]: Datasets from the UCI
machine learning repository were used in this research,
including Domain, HTML, Address Bar, and URLs. The
main contribution is represented by a comparative
analysis of the impact of feature selection on the
detection of phishing websites. A KMO test was applied
in the research to evaluate the dataset using (LR) and
(SVM) classification algorithms. A correlation matrix
was used to analyze the performance of the test. LR with
the KMO test achieved an accuracy of 91.68%, while
SVM with the KMO test achieved an accuracy of
93.59%.
I. Tyagi, et al. (2018) [8]: The research uses machine
learning algorithms to identify whether a website is
legitimate or phishing. A URL is used to determine this.
The most significant contribution is represented by the
development of a new model, the Generalized Linear
Model (GLM). Two different methods are combined in
the model. Detecting phishing websites is most accurate
when Random Forest and GLM are mixed with 98.4%
accuracy.
Arun D. Kulkarni (2019) [9]: SVM, Nave Bayes,
decision trees, and neural networks were evaluated in the
research. It is used to detect phishing URLs. The research
used a dataset containing 1353 URLs from the University
of California, Irvine Machine Learning Repository.
There are nine features associated with each URL. To
evaluate the performance of the algorithms, two steps
were taken. the process begins with the extraction of
features from URLs. A model will be developed based
on data from a training set in the second step. Based on
the developed model, URLs will be classified. According
to the results, the pruned decision tree produced the
highest accuracy of 91.5%. It was followed by the Naive
Bayes Classifier with 86.14 %, and the Neural Network
with 84.87%.
S Premnath, et al. (2020) [10]: Using a sophisticated
machine-learning framework, the research provides an
in-depth analysis of phishing websites. the research used
a dataset containing URLs from legitimate and phishing
websites. Therefore, different machine learning
algorithms could be evaluated to distinguish between
phishing websites and legitimate websites. By combining
the two phases of classification and phishing detection,
the research contributed to the development of an
efficient machine-learning framework. In the proposed
system, five different machine learning classifiers are
utilized to analyze URL features and detect phishing
websites in an extremely accurate manner (Random
Forest, Logistic Regression, Decision Tree, Nearest-
Neighbor, and Support Vector Machine). Among
machine learning classifiers, it has the highest accuracy
according to the proposed system model. 91.4% accuracy
was achieved by the Random Forest classifier.
A. Lakshmanarao, and P. Surya (2021) [11]: Using a
dataset containing 11055 samples and 30 features of
phishing websites from UCI's repository. Various
machine learning techniques, including decision trees,
AdaBoost, support vector machines (SVM), and random
forests, were used to analyze specific features such as
port, web traffic, URL length, URL_of_Anchor, and IP
address. According to the research, the most effective
method of detecting phishing websites was determined.
PA1 and PA2 were introduced as part of the research. A
97% accuracy rate was achieved by these algorithms.
M Abutaha, et al. (2021) [12]: A method for
detecting phishing attacks is presented using URL lexical
analysis and machine learning classifiers. A variety of
machine learning models were trained and tested on a
variety of feature sets. It appears that the used approach
is beneficial in phishing attacks. Web requests' headers
contain URLs that are used for detection and prevention.
Moreover, machine learning techniques have also proven
effective in the area of security. The dataset used
consisted of 1056937 labeled URLs (phishing and
legitimate) and 14 different features. Different types of
classifiers were evaluated, including gradient boosting,
Random Forest, Support Vector Machine (SVM), and
Neural Networks. Based on the results, SVM was the
most accurate at 99.89% in detecting the URLs analyzed.
Moreover, the neural network had the lowest accuracy
score of all the classifiers, coming in at approximately
97%.
N. Choudhary b, S. Jain, K. Jain (2022) [13]: URL
attributes are the focus of the research. The dataset used
in the research was obtained from both Kaggle and
Phishtank. The researchers used a hybrid approach that
combined Principal Component Analysis (PCA), Support
Vector Machines (SVM), and Random Forest algorithms
to reduce the dataset's dimensionality while maintaining
all relevant data. A higher accuracy rate of 96.8% was
obtained with this method as compared to other
techniques.
S. Arvind Anwekar, and V. Agrawal (2022) [14]:
According to the authors, the research focused on
extracting features from URLs. Several features were
considered, including the SSL certificate's age, the
Enhancing Phishing Website Detection via Feature Selection in… Informatica 47 (2023) 145–156 147
anchor's URI, the IFRAME, and the website's ranking.
The total number of phishing URLs collected from
Phish-Tank was 19653. The total number of benign
URLs collected from Alexa was 17058. The authors
developed a method for detecting phishing websites
using randomly generated trees (RF), decision trees
(DT), and support vector machines (SVM). The
performance of the classifier also improved with the
addition of more training data. As a result of splitting the
dataset with 90 % training and 10% testing, it achieved a
high detection accuracy of 97.14% and a low false
positive rate of 3.14 percent.
A Prathap, et al (2023) [15]: This method could be
used to automate systems that are highly effective in
combating website phishing. Furthermore, as a result of
its effectiveness and efficiency, this research performs
well in literature comparisons. SVM and random forest
algorithms were used to classify and predict phishing
attacks. Data was collected from phishing websites. The
UC Irvine Machine Learning Repository database
contains approximately 11,000 data points containing 30
features derived from website features. Random Forest
classifiers achieve an accuracy of 89.63%, while SVM
classifiers achieve an accuracy of 89.84%.
UB Penta, et al (2023) [16]: The purpose of this
research is to identify phishing websites using machine
learning methods such as Support Vector Machines
(SVM), K Nearest Neighbors (KNN), and Naive Bayes
(NB). Feature Extraction (FE) techniques were used to
extract essential attributes from the Phish-Tank website,
10,000 phishing URLs, and 10,000 benign URLs. An
approach based on URLs and an approach based on
hyperlinks was used. The results of both FE approaches
are used as inputs for the ML model. SVM achieved the
highest accuracy score of 98.05%, while KNN achieved
the lowest accuracy score of 95.67%.
Table 1: A summary table of related works shows the results, methodologies, and performance metrics of the
research studies reviewed
Study Year Methodology Performance
Metrics
Results SOTA Lacks in feature selection
[6] 2013 ML-based on linguistic and domain
characteristics; SVM and Gaussian radial
basis function algorithms
Detection rate > 98% Effective
attribute
combination
Limited feature selection from
linguistic and domain characteristics
[7] 2017 Feature selection impact analysis; KMO
test; LR and SVM classification
LR: 91.68%, SVM:
93.59%
Importance of
feature
selection
Limited exploration of feature
selection impact
[8] 2018 Generalized Linear Model (GLM)
combining Random Forest;
GLM: 98.4% Novel GLM
model
Lacks explanation of feature
selection
[9] 2019 SVM, Naive Bayes, decision trees, neural
networks; 1353 URL dataset
Pruned decision tree:
91.5%, Naive Bayes:
86.14%, Neural
Network: 84.87%
Comparative
algorithm
performance
Limited feature extraction and
selection
. [10] 2020 Multiple ML classifiers (Random Forest,
LR, Decision Tree, KNN, SVM); URL
features analysis
Random Forest:
91.4%
High accuracy
using
ensemble
methods
It is lacking in the selection of
features and the utilization of
machine learning algorithms
[11] 2021 Various ML techniques; 11055 samples
and 30 features dataset
PA1 and PA2
algorithms: 97%
Specific
feature
analysis
Select features using ANOVA F-
value and Mutual Information.
These feature selection methods are
valid, but just a subset.
[12] 2021 Phishing URL detection using linguistic
analysis; Various ML models
SVM: 99.89%, Neural
Network: ~97%
High accuracy
of SVM
URL length, hostname length, and
keywords are frequently used. These
features may miss advanced
phishing methods that obfuscate or
manipulate real URLs.
[13] 2022 PCA, SVM, Random Forest; Hybrid
approach for dimensionality reduction
96.8% Dimensionality
reduction
impact
focuses on URL-only feature
extraction, which may miss some
phishing website details.
[14] 2022 Feature extraction from URLs; RF, DT,
SVM; Training data size impact
Accuracy: 97.14%,
False positive rate:
3.14%
Performance
with more
training data
Feature extraction from URLs, does
not clarify how these features were
selected or whether relevance was
determined using feature selection
approaches.
. [15] 2023 SVM and Random Forest; 11000 data
points, 30 features dataset
Random Forest:
89.63%, SVM:
89.84%
Accuracy of
classification
The report lacks an explanation of
the process of feature selection.
[16] 2023 SVM, KNN, NB; Feature extraction from
Phish-Tank URLs
SVM: 98.05%, KNN:
95.67%
Comparison of
ML methods
The used features are informative,
but they may not accurately depict
the specific attributes of more
current phishing assaults.
148 Informatica 47 (2023) 145–156 M.A. Qasim et al.
Comparing Results
The methodology employed in our study involves
the utilization of regular expressions for the extraction of
significant features from URLs. These features are then
used as input for Decision Trees, Support Vector
Machines (SVM), and Random Forests. The results
obtained from our experiments indicate noteworthy
accuracies of 0.968, 0.973, and 0.976 for Decision Trees,
SVM, and Random Forests, respectively. Upon
comparing our findings with the research studies listed in
the table, it becomes apparent that our approach
demonstrates competitive or even greater levels of
accuracy. The significance of this observation is
particularly notable within the domain of detecting
phishing websites, where achieving a high level of
accuracy is of utmost importance in ensuring strong
security measures. The uniqueness of our feature
selection method utilizing regular expressions is in its
capacity to accurately capture complex URL patterns and
structural attributes, which are crucial for the detection of
phishing endeavors. The utilization of this distinctive
method for feature extraction boosts the detection
capabilities of machine learning algorithms, hence
enabling them to effectively distinguish subtle yet crucial
distinctions between phishing and authentic URLs.
Hence, utilizing the methodology as mentioned earlier is
crucial as it provides a more precise and dependable
method for identifying phishing websites, effectively
tackling the difficulties related to selecting appropriate
features, and eventually strengthening defenses in
cybersecurity.
3 Structure of URL
First, the components of URLs should be known to
understand the attackers' approach. A visual
representation of the URL's basic structure is in Figure 1
[17].
URL (Uniform Resource Locator) is a web address
that identifies the location of a particular website on the
Internet [18]. In phishing attacks, attackers manipulate
URLs in multiple ways, such as creating special URLs,
manipulating URLs, and manipulating keywords [17].
URLs constitute different components, some required
and others optional, URL basic structure consists of the
following elements:
1. Protocol: The protocol describes how a browser
connects to a website. The protocol could be
HTTP (hypertext transfer protocol) or HTTPS
(HTTP secure).
2. Domain name: A domain name is the name of a
website, such as XYZ-company.com.
3. Sub-domain: Subdomains are prefixes used to
identify a domain name, such as www.
4. Top-level domain: This refers to the suffix of the
domain name, such as .com, .org, .net, etc.
5. The Path: The path refers to the location of the
resource on the server, such as /info/.
6. The file name is a freely selectable portion of
text appearing before the file extension. It should provide
information about a particular file, such as index.html.
4 Methodology
Machine learning-based systems depend strongly on
the dataset and feature selection [19]. They have a direct
impact on the system's effectiveness and efficiency.
Therefore, these topics are discussed in detail in the
following sections
4.1. Dataset
As part of Our research, we have enhanced phishing
website detection through feature selection in URL-based
analysis, by using Decision Trees, Support Vector
Machines (SVM), and Random Forests to detect phishing
websites. For this purpose, the Mendeley data [20] was
utilized, which is a dataset that contains a collection of
legitimate and phishing websites. The database contains
80,000 instances, including 50,000 legitimate websites
and 30,000 phishing sites, each instance includes a URL,
an HTML page, an index, and a result that has a binary
value of either 0 or 1 (0 for legitimate, and 1 for
phishing). This extensive database was reduced in size to
expedite the process of feature extraction from URLs and
optimize the use of computation resources. So, 8,000
URLs from the dataset were randomly selected,
consisting of 4,000 legitimate URLs and 4,000 phishing
URLs
Figure 1: URL structure
Enhancing Phishing Website Detection via Feature Selection in… Informatica 47 (2023) 145–156 149
4.2 Feature selection
The new strategy employed in creating datasets
involved a careful rule-based methodology that
effectively leveraged the capabilities of regular
expressions to extract crucial attributes from URLs. In
the context of this method, an extensive list of
characteristics was initially considered, followed by
rigorous testing to identify the 30 attributes that exhibited
the highest relevance. The elements encompassed a
diverse range of aspects, including address bar features,
abnormal features, HTML and JavaScript features, and
domain features. These features are listed in Table (2). A
comprehensive rule-based analysis was conducted on
each URL within the initial dataset to determine the
presence or absence of these 30 specific attributes.
As an illustration, one of the requirements
necessitated an assessment of the URL's length. Phishers
often utilize extended Uniform Resource Locators
(URLs) as a means to obscure their malicious intentions
through the implementation of these strategies. In order
to assess the dependability of the results, a mean URL
length was calculated for each URL present in the
dataset. URLs above the specified character limit of 65
were classified as possibly indicative of phishing
websites, whilst URLs falling below this limit were
deemed legitimate.
The utilization of regular expressions was imperative
in this process as it facilitated the identification of
patterns, substrings, and certain attributes present within
the URLs. The aforementioned information was
thereafter employed to develop informed assessments
regarding the legality of the URLs under scrutiny. The
utilization of regular expressions facilitated the
meticulous rule-based methodology, enabling the
creation of a novel binary dataset. In this dataset,
instances denoting legitimate websites were assigned the
label "0," while instances indicating potential phishing
sites were assigned the label "1." This approach to
dataset generation ensured the inclusion of only URLs
that demonstrated predetermined features of significance,
as identified through the utilization of regular
expressions. This augmentation improved the accuracy
and use of the dataset for subsequent modeling and
analysis.
Table 2: List of features websites using the recently
curated dataset
In addition to the construction of the new dataset,
machine-learning techniques were implemented as the
next step. Using these techniques, we were able to train
and evaluate models geared toward detecting phishing
Rule: IF {
𝑈𝑅𝐿 𝑙𝑒𝑛𝑔𝑡 ℎ < 65 → 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 = 𝑃 ℎ𝑖𝑠 ℎ𝑖𝑛𝑔 (1)
𝑂𝑡 ℎ𝑒𝑟𝑤𝑖𝑠𝑒 → 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 = 𝑙𝑒𝑔𝑖𝑡𝑖𝑚𝑎𝑡𝑒 (0)
Criteria Phishing Features
Address Bar-
Features
IP Address
Long URL
“TinyURL”
“@” Symbol
using “//”
Adding Prefix or
Suffix Separated
by (-) to the
Domain
Sub-Domain and
Multi-subdomains
HTTPS
Domain
Registration
Length
Favicon
Non-Standard Port
tilde_symbol
The Existence of
“HTTPS” Token in
the Domain Part of
the URL
Abnormal
Features
Request URL
URL of Anchor
Links in ,