https://doi.org/10.31449/inf.v47i9.5177 Informatica 47 (2023) 145–156 145 Enhancing Phishing Website Detection via Feature Selection in URL- Based Analysis Marwa A. Qasim, Nahla A. Flayh E-mail: itpg.marwa.qasim@uobasrah.edu.iq, nahla.flayh@uobasrah.edu.iq Department: Information Technology, Basrah University Basra, Iraq Keywords: machine learning, detecting phishing, feature selection, regular expressions Received: September 12, 2023 Detecting a phishing website accurately is crucial for ensuring the safety of online users, underscoring the importance of maintaining a secure digital environment. This research delves into the effectiveness of enhancing the detection of phishing websites by applying a new dataset generation method. The method involves the transformation of a pure dataset obtained from Mendeley, by the utilization of regular expressions to extract the important features so that a detection process can be performed correctly with high performance. Based on the proposed features, we selected the best machine-learning algorithm. We performed a rigorous evaluation using Three prominent machine learning algorithms: Decision Trees, Support Vector Machines (SVM), and Random Forests, achieving 0.96% for Decision Tree Accuracy, 0.97% for SVM Accuracy, and 0.98% for Random Forest Accuracy.One of the critical contributions of this research is the deliberate selection of features. We have leveraged regular expressions to create a feature set that captures salient aspects of URLs and optimizes the algorithms' detection capabilities. This research has examined how feature selection affects the performance of each algorithm, highlighting its strengths and uncovering its weaknesses. Povzetek: Raziskava obravnava izboljšanje zaznavanja lažnih spletnih strani z novim načinom ustvarjanja podatkovnih zbirk in testiranjem različnih algoritmov strojnega učenja. 1 Introduction Detecting phishing websites is a critical cybersecurity issue due to the sophistication of these attacks and their potential to compromise sensitive information. To protect individuals and organizations from financial loss, data breaches, and reputational damage, detecting and preventing phishing attacks is essential [1]. Consequently, it is essential to develop effective techniques for identifying, and mitigating phishing websites and combating this menace, With the advent of machine learning algorithms, security systems can recognize and prevent cyber-attacks more efficiently and effectively [2]. In the context of machine learning, it is a subset of artificial intelligence in which machines can learn from data, improve performance based on past experiences, and make predictions based on that data [3]. Machine learning can be divided into supervised, unsupervised, and reinforcement learning types. The focus of our research will be on supervised learning. In supervised learning, the training dataset consists of previous instances where both input and output values are known and labeled [4, 5]. One of the strategies that can enhance machine-learning performance is feature selection, so our research involved implementing a feature selection methodology, specifically a rule-based approach utilizing regular expressions. The approach is considered to be one of the most important methods for enhancing the efficiency of machine learning algorithms by refining the features to be used in the model. 2 Related work Phishing detection is an essential component of cybersecurity, which aims to identify fraudulent attempts to deceive users into disclosing sensitive information. A variety of techniques, including machine learning algorithms and rule-based approaches, have been employed for effective detection. This section highlights essential machine learning techniques in the field of phishing detection and provides an overview of the state- of-the-art in phishing detection. Bin B. Zhu, et al (2013) [6]: The authors evaluated the effectiveness of machine learning-based phishing detection methods using a secure website. 18 useful features were presented and tested for incorporation into 146 Informatica 47 (2023) 145–156 M.A. Qasim et al. the detector based exclusively on the lexical and domain characteristics provided by the authors. Finding the appropriate combination of attributes has resulted in a detector with a detection rate higher than 98%. They employed support vector machines and Gaussian radial basis function algorithms. Phishing URLs were taken from the Taobao-phishing dataset, safe URLs were taken from the Yahoo! Directory, and well-known Chinese navigation sites were analyzed. W. Fadheel, et al (2017) [7]: Datasets from the UCI machine learning repository were used in this research, including Domain, HTML, Address Bar, and URLs. The main contribution is represented by a comparative analysis of the impact of feature selection on the detection of phishing websites. A KMO test was applied in the research to evaluate the dataset using (LR) and (SVM) classification algorithms. A correlation matrix was used to analyze the performance of the test. LR with the KMO test achieved an accuracy of 91.68%, while SVM with the KMO test achieved an accuracy of 93.59%. I. Tyagi, et al. (2018) [8]: The research uses machine learning algorithms to identify whether a website is legitimate or phishing. A URL is used to determine this. The most significant contribution is represented by the development of a new model, the Generalized Linear Model (GLM). Two different methods are combined in the model. Detecting phishing websites is most accurate when Random Forest and GLM are mixed with 98.4% accuracy. Arun D. Kulkarni (2019) [9]: SVM, Nave Bayes, decision trees, and neural networks were evaluated in the research. It is used to detect phishing URLs. The research used a dataset containing 1353 URLs from the University of California, Irvine Machine Learning Repository. There are nine features associated with each URL. To evaluate the performance of the algorithms, two steps were taken. the process begins with the extraction of features from URLs. A model will be developed based on data from a training set in the second step. Based on the developed model, URLs will be classified. According to the results, the pruned decision tree produced the highest accuracy of 91.5%. It was followed by the Naive Bayes Classifier with 86.14 %, and the Neural Network with 84.87%. S Premnath, et al. (2020) [10]: Using a sophisticated machine-learning framework, the research provides an in-depth analysis of phishing websites. the research used a dataset containing URLs from legitimate and phishing websites. Therefore, different machine learning algorithms could be evaluated to distinguish between phishing websites and legitimate websites. By combining the two phases of classification and phishing detection, the research contributed to the development of an efficient machine-learning framework. In the proposed system, five different machine learning classifiers are utilized to analyze URL features and detect phishing websites in an extremely accurate manner (Random Forest, Logistic Regression, Decision Tree, Nearest- Neighbor, and Support Vector Machine). Among machine learning classifiers, it has the highest accuracy according to the proposed system model. 91.4% accuracy was achieved by the Random Forest classifier. A. Lakshmanarao, and P. Surya (2021) [11]: Using a dataset containing 11055 samples and 30 features of phishing websites from UCI's repository. Various machine learning techniques, including decision trees, AdaBoost, support vector machines (SVM), and random forests, were used to analyze specific features such as port, web traffic, URL length, URL_of_Anchor, and IP address. According to the research, the most effective method of detecting phishing websites was determined. PA1 and PA2 were introduced as part of the research. A 97% accuracy rate was achieved by these algorithms. M Abutaha, et al. (2021) [12]: A method for detecting phishing attacks is presented using URL lexical analysis and machine learning classifiers. A variety of machine learning models were trained and tested on a variety of feature sets. It appears that the used approach is beneficial in phishing attacks. Web requests' headers contain URLs that are used for detection and prevention. Moreover, machine learning techniques have also proven effective in the area of security. The dataset used consisted of 1056937 labeled URLs (phishing and legitimate) and 14 different features. Different types of classifiers were evaluated, including gradient boosting, Random Forest, Support Vector Machine (SVM), and Neural Networks. Based on the results, SVM was the most accurate at 99.89% in detecting the URLs analyzed. Moreover, the neural network had the lowest accuracy score of all the classifiers, coming in at approximately 97%. N. Choudhary b, S. Jain, K. Jain (2022) [13]: URL attributes are the focus of the research. The dataset used in the research was obtained from both Kaggle and Phishtank. The researchers used a hybrid approach that combined Principal Component Analysis (PCA), Support Vector Machines (SVM), and Random Forest algorithms to reduce the dataset's dimensionality while maintaining all relevant data. A higher accuracy rate of 96.8% was obtained with this method as compared to other techniques. S. Arvind Anwekar, and V. Agrawal (2022) [14]: According to the authors, the research focused on extracting features from URLs. Several features were considered, including the SSL certificate's age, the Enhancing Phishing Website Detection via Feature Selection in… Informatica 47 (2023) 145–156 147 anchor's URI, the IFRAME, and the website's ranking. The total number of phishing URLs collected from Phish-Tank was 19653. The total number of benign URLs collected from Alexa was 17058. The authors developed a method for detecting phishing websites using randomly generated trees (RF), decision trees (DT), and support vector machines (SVM). The performance of the classifier also improved with the addition of more training data. As a result of splitting the dataset with 90 % training and 10% testing, it achieved a high detection accuracy of 97.14% and a low false positive rate of 3.14 percent. A Prathap, et al (2023) [15]: This method could be used to automate systems that are highly effective in combating website phishing. Furthermore, as a result of its effectiveness and efficiency, this research performs well in literature comparisons. SVM and random forest algorithms were used to classify and predict phishing attacks. Data was collected from phishing websites. The UC Irvine Machine Learning Repository database contains approximately 11,000 data points containing 30 features derived from website features. Random Forest classifiers achieve an accuracy of 89.63%, while SVM classifiers achieve an accuracy of 89.84%. UB Penta, et al (2023) [16]: The purpose of this research is to identify phishing websites using machine learning methods such as Support Vector Machines (SVM), K Nearest Neighbors (KNN), and Naive Bayes (NB). Feature Extraction (FE) techniques were used to extract essential attributes from the Phish-Tank website, 10,000 phishing URLs, and 10,000 benign URLs. An approach based on URLs and an approach based on hyperlinks was used. The results of both FE approaches are used as inputs for the ML model. SVM achieved the highest accuracy score of 98.05%, while KNN achieved the lowest accuracy score of 95.67%. Table 1: A summary table of related works shows the results, methodologies, and performance metrics of the research studies reviewed Study Year Methodology Performance Metrics Results SOTA Lacks in feature selection [6] 2013 ML-based on linguistic and domain characteristics; SVM and Gaussian radial basis function algorithms Detection rate > 98% Effective attribute combination Limited feature selection from linguistic and domain characteristics [7] 2017 Feature selection impact analysis; KMO test; LR and SVM classification LR: 91.68%, SVM: 93.59% Importance of feature selection Limited exploration of feature selection impact [8] 2018 Generalized Linear Model (GLM) combining Random Forest; GLM: 98.4% Novel GLM model Lacks explanation of feature selection [9] 2019 SVM, Naive Bayes, decision trees, neural networks; 1353 URL dataset Pruned decision tree: 91.5%, Naive Bayes: 86.14%, Neural Network: 84.87% Comparative algorithm performance Limited feature extraction and selection . [10] 2020 Multiple ML classifiers (Random Forest, LR, Decision Tree, KNN, SVM); URL features analysis Random Forest: 91.4% High accuracy using ensemble methods It is lacking in the selection of features and the utilization of machine learning algorithms [11] 2021 Various ML techniques; 11055 samples and 30 features dataset PA1 and PA2 algorithms: 97% Specific feature analysis Select features using ANOVA F- value and Mutual Information. These feature selection methods are valid, but just a subset. [12] 2021 Phishing URL detection using linguistic analysis; Various ML models SVM: 99.89%, Neural Network: ~97% High accuracy of SVM URL length, hostname length, and keywords are frequently used. These features may miss advanced phishing methods that obfuscate or manipulate real URLs. [13] 2022 PCA, SVM, Random Forest; Hybrid approach for dimensionality reduction 96.8% Dimensionality reduction impact focuses on URL-only feature extraction, which may miss some phishing website details. [14] 2022 Feature extraction from URLs; RF, DT, SVM; Training data size impact Accuracy: 97.14%, False positive rate: 3.14% Performance with more training data Feature extraction from URLs, does not clarify how these features were selected or whether relevance was determined using feature selection approaches. . [15] 2023 SVM and Random Forest; 11000 data points, 30 features dataset Random Forest: 89.63%, SVM: 89.84% Accuracy of classification The report lacks an explanation of the process of feature selection. [16] 2023 SVM, KNN, NB; Feature extraction from Phish-Tank URLs SVM: 98.05%, KNN: 95.67% Comparison of ML methods The used features are informative, but they may not accurately depict the specific attributes of more current phishing assaults. 148 Informatica 47 (2023) 145–156 M.A. Qasim et al. Comparing Results The methodology employed in our study involves the utilization of regular expressions for the extraction of significant features from URLs. These features are then used as input for Decision Trees, Support Vector Machines (SVM), and Random Forests. The results obtained from our experiments indicate noteworthy accuracies of 0.968, 0.973, and 0.976 for Decision Trees, SVM, and Random Forests, respectively. Upon comparing our findings with the research studies listed in the table, it becomes apparent that our approach demonstrates competitive or even greater levels of accuracy. The significance of this observation is particularly notable within the domain of detecting phishing websites, where achieving a high level of accuracy is of utmost importance in ensuring strong security measures. The uniqueness of our feature selection method utilizing regular expressions is in its capacity to accurately capture complex URL patterns and structural attributes, which are crucial for the detection of phishing endeavors. The utilization of this distinctive method for feature extraction boosts the detection capabilities of machine learning algorithms, hence enabling them to effectively distinguish subtle yet crucial distinctions between phishing and authentic URLs. Hence, utilizing the methodology as mentioned earlier is crucial as it provides a more precise and dependable method for identifying phishing websites, effectively tackling the difficulties related to selecting appropriate features, and eventually strengthening defenses in cybersecurity. 3 Structure of URL First, the components of URLs should be known to understand the attackers' approach. A visual representation of the URL's basic structure is in Figure 1 [17]. URL (Uniform Resource Locator) is a web address that identifies the location of a particular website on the Internet [18]. In phishing attacks, attackers manipulate URLs in multiple ways, such as creating special URLs, manipulating URLs, and manipulating keywords [17]. URLs constitute different components, some required and others optional, URL basic structure consists of the following elements: 1. Protocol: The protocol describes how a browser connects to a website. The protocol could be HTTP (hypertext transfer protocol) or HTTPS (HTTP secure). 2. Domain name: A domain name is the name of a website, such as XYZ-company.com. 3. Sub-domain: Subdomains are prefixes used to identify a domain name, such as www. 4. Top-level domain: This refers to the suffix of the domain name, such as .com, .org, .net, etc. 5. The Path: The path refers to the location of the resource on the server, such as /info/. 6. The file name is a freely selectable portion of text appearing before the file extension. It should provide information about a particular file, such as index.html. 4 Methodology Machine learning-based systems depend strongly on the dataset and feature selection [19]. They have a direct impact on the system's effectiveness and efficiency. Therefore, these topics are discussed in detail in the following sections 4.1. Dataset As part of Our research, we have enhanced phishing website detection through feature selection in URL-based analysis, by using Decision Trees, Support Vector Machines (SVM), and Random Forests to detect phishing websites. For this purpose, the Mendeley data [20] was utilized, which is a dataset that contains a collection of legitimate and phishing websites. The database contains 80,000 instances, including 50,000 legitimate websites and 30,000 phishing sites, each instance includes a URL, an HTML page, an index, and a result that has a binary value of either 0 or 1 (0 for legitimate, and 1 for phishing). This extensive database was reduced in size to expedite the process of feature extraction from URLs and optimize the use of computation resources. So, 8,000 URLs from the dataset were randomly selected, consisting of 4,000 legitimate URLs and 4,000 phishing URLs Figure 1: URL structure Enhancing Phishing Website Detection via Feature Selection in… Informatica 47 (2023) 145–156 149 4.2 Feature selection The new strategy employed in creating datasets involved a careful rule-based methodology that effectively leveraged the capabilities of regular expressions to extract crucial attributes from URLs. In the context of this method, an extensive list of characteristics was initially considered, followed by rigorous testing to identify the 30 attributes that exhibited the highest relevance. The elements encompassed a diverse range of aspects, including address bar features, abnormal features, HTML and JavaScript features, and domain features. These features are listed in Table (2). A comprehensive rule-based analysis was conducted on each URL within the initial dataset to determine the presence or absence of these 30 specific attributes. As an illustration, one of the requirements necessitated an assessment of the URL's length. Phishers often utilize extended Uniform Resource Locators (URLs) as a means to obscure their malicious intentions through the implementation of these strategies. In order to assess the dependability of the results, a mean URL length was calculated for each URL present in the dataset. URLs above the specified character limit of 65 were classified as possibly indicative of phishing websites, whilst URLs falling below this limit were deemed legitimate. The utilization of regular expressions was imperative in this process as it facilitated the identification of patterns, substrings, and certain attributes present within the URLs. The aforementioned information was thereafter employed to develop informed assessments regarding the legality of the URLs under scrutiny. The utilization of regular expressions facilitated the meticulous rule-based methodology, enabling the creation of a novel binary dataset. In this dataset, instances denoting legitimate websites were assigned the label "0," while instances indicating potential phishing sites were assigned the label "1." This approach to dataset generation ensured the inclusion of only URLs that demonstrated predetermined features of significance, as identified through the utilization of regular expressions. This augmentation improved the accuracy and use of the dataset for subsequent modeling and analysis. Table 2: List of features websites using the recently curated dataset In addition to the construction of the new dataset, machine-learning techniques were implemented as the next step. Using these techniques, we were able to train and evaluate models geared toward detecting phishing Rule: IF { 𝑈𝑅𝐿 𝑙𝑒𝑛𝑔𝑡 ℎ < 65 → 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 = 𝑃 ℎ𝑖𝑠 ℎ𝑖𝑛𝑔 (1) 𝑂𝑡 ℎ𝑒𝑟𝑤𝑖𝑠𝑒 → 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 = 𝑙𝑒𝑔𝑖𝑡𝑖𝑚𝑎𝑡𝑒 (0) Criteria Phishing Features Address Bar- Features IP Address Long URL “TinyURL” “@” Symbol using “//” Adding Prefix or Suffix Separated by (-) to the Domain Sub-Domain and Multi-subdomains HTTPS Domain Registration Length Favicon Non-Standard Port tilde_symbol The Existence of “HTTPS” Token in the Domain Part of the URL Abnormal Features Request URL URL of Anchor Links in ,