https://doi.org/10.31449/inf.v48i8.5763                                          Informatica 48 (2024) 137 –150 137 
Feature Extraction and Classification of Text Data by Combining 
Two-Stage Feature Selection Algorithm and Improved Machine 
Learning Algorithm 
Hua Huang 
School of Computer and Artificial Intelligence, Henan Finance University. Zhengzhou 450046, China 
E-mail: huanghuahafu@163.com 
Keywords: Two-stage feature selection, SVM, text data, classification, MRMR 
Received: February 26, 2024 
Efficient text classification is crucial for information processing due to the generation of massive text 
data. However, the uneven distribution and redundancy of text data often result in poor classification 
performance. To address this issue, a two-stage feature selection algorithm is proposed using the 
fusion of information gain and maximum correlation minimum redundancy algorithm. To improve SVM 
performance in text data classification, an improved SVM algorithm based on Fourier hybrid kernel 
function is proposed. The study found that the proposed improved algorithm achieved an accuracy of 
0.82 on the IMDB dataset using only 40 feature subsets. Even when the number of features exceeded 
390, the F1 value of the proposed algorithm remained 1% to 2% higher than that of other algorithms. 
The improved algorithm performed best when the feature dimension was around 400. The proposed 
algorithm, which combines the Fourier hybrid kernel function with a two-stage feature selection 
algorithm based on the information gain and maximum correlation minimum redundancy algorithm, 
achieved a 1%~3% higher F1 value and increased the number of correctly classified texts by 20 to 45. 
These results demonstrate the effectiveness of the algorithm as a classification tool for processing 
large-scale text data, which is significant for information retrieval and data mining. 
Povzetek: Predstavljena sta dvodstopenjski algoritem za izbiro značilk in izboljšani algoritem 
strojnega učenja za povečanje točnosti klasifikacije besedilnih podatkov. Združujeta informacijski 
dobiček in metodo minimalne redundance ter maksimalne korelacije (MRMR) z izboljšano SVM.
1 Introduction 
As information technology develop, especially in many 
fields such as medicine, finance, and journalism, the 
Internet has generated massive amounts of text data. 
These text data contain a wealth of information and 
knowledge, significant for improving business 
decision-making, market analysis, disease diagnosis, etc. 
However, due to the large and complex volume of these 
data, it has become a challenging problem to effectively 
extract useful information from them and perform 
accurate text classification [1-3]. The core of text 
classification lies in how to accurately and efficiently 
identify and classify a large number of unlabeled text data, 
which directly affects the quality and application effect of 
information extraction. Firstly, there is a large amount of 
redundant information in the text data, which is not only 
irrelevant to the classification task, but will interfere with 
the judgment of the classifier and reduce the accuracy of 
classification. Secondly, the feature distribution of text 
data is often uneven, which makes it difficult for 
traditional classification algorithms to maintain stable and 
efficient performance in the face of different types of 
datasets [4-5]. To this end, a Two-stage Feature Selection 
(TFS) Algorithm that fuses Information Gain (IG) and 
improved Minimum Redundancy Maximum Relevance 
(MRMR) is proposed, and a Fourier hybrid kernel 
function is introduced to enhance the Support Vector 
Machine (SVM) effect in text classification. Through 
these technological innovations, the research aims to 
process large-scale text data more efficiently and improve 
the accuracy and efficiency of classification. This has 
important practical value for information processing and 
decision support in the fields of medical diagnosis, news 
topic analysis, and market trend forecasting. The overall 
structure of the study consists of four parts. The first part 
summarizes the relevant research results and 
shortcomings of feature extraction at home and abroad. 
The second part proposes the fusion of TFS and improved 
machine learning algorithms. The third part analyzes the 
experimental results through the proposed algorithm and 
includes a discussion section related to the current 
research. The fourth part summarizes the experimental 
results, points out the shortcomings of the research, and 
proposes future research directions. 
In the field of machine learning, SVM has become 
one of the core technologies of text classification due to 
its excellent classification performance. The performance 
of SVM depends largely on the quality of feature 
selection and extraction. Feature selection is important 
when dealing with large-scale text data, and effective 
138   Informatica 48 (2024) 137 –150                                                        H. Huang 
feature extraction is essential to improve classification 
accuracy and efficiency [6]. Here are some of the relevant 
studies by scientists and scholars. Ahmed Y A et al. 
proposed a weighted MRMR algorithm for better 
estimating the feature significance of data captured by 
cyberattacks. This technique combined enhanced 
weighted MRMR with frequency inverse document 
frequency and further accommodates an improved 
approach to entropy. It was used to evaluate the weights 
of the features generated by the algorithm. Results 
showed a good performance of proposed algorithm [7]. 
Jiménez-Cordero et al. proposed an MRMR-based 
embedded feature selection method for the trade-off 
between complexity and classification accuracy. The 
algorithm used duality theory to reformulate the min-max 
problem and solved it using off-the-shelf nonlinear 
optimization software. Compared with public datasets, 
the proposed method proved its effectiveness and 
practicability [8]. Wang et al. proposed a SVM kernel 
function selection mechanism. First, the types of kernel 
function best suited for the given data were chosen. Then, 
these types were classified as SVMs. The results showed 
that the mechanism superiority was verified [9]. Sun et al. 
proposed a feature selection algorithm for multi-label 
data with missing labels. Firstly, a multi-label uncertainty 
measure based on fuzzy neighborhood entropy was 
proposed, and the MRMR algorithm was improved to 
evaluate the candidate features. Results showed that this 
algorithm selected important features with better 
classification performance [10]. 
Jia et al. proposed an improved barnacle pairing 
optimizer combined with an SVM algorithm. The 
Gaussian mutation and logic model were used to improve 
the performance of the improved algorithm from different 
perspectives, and results showed a better performance 
than other comparison methods. In addition, the model 
showed significant superiority over other classifiers [11]. 
Yin et al. proposed an SVM algorithm based on 
simulated annealing algorithm for the identification of 
different motion patterns. Firstly, the simulated annealing 
algorithm obtained the SVM optimal parameters. Then, 
the MRMR algorithm was used for feature extraction, and 
the five-layer cross-validation trained the classifier. 
Results showed that the accuracy of the algorithm was 
98% [12]. Bansal et al. proposed a hybrid MRMR feature 
selection technique using a multi-objective method for 
automatic sign language recognition. Firstly, the MRMR 
algorithm was used as a preprocessor to remove 
redundant and irrelevant features. A multi-class SVM 
was used as a classifier. The results showed that a more 
accurate classification was achieved with a decrease in 
the size of the feature vector [13]. Zhou et al. proposed a 
feature selection method based on Mutual Information 
(MI) and correlation coefficients. In this method, the 
correlation coefficient was first introduced, and then 
combined with MI to measure features' relationship. To 
effectively select low redundancy features, minimization 
was also used in the evaluation criteria. Results showed 
that the proposed method had good feature classification 
ability [14]. 
 
 
Table 1: Research status and shortcomings of related works 
Related Works 
Research findings Shortcomings Reference 
number 
Author 
[7] 
Ahmed Y 
A et al 
Selecting ransomware attack features through 
weighted MRMR algorithm 
There is no involvement in 
the field of text 
classification and a lack of 
further research. 
[8] 
Jiménez-
Cordero 
et al 
Select features from the dataset using an 
embedded feature selection method based on 
MRMR. 
There is no involvement in 
the field of text 
classification and a lack of 
further research. 
[9] 
Wang et 
al 
A SVM kernel function selection mechanism 
was proposed for bearing fault diagnosis. 
Using a single kernel 
function may not match the 
data distribution. 
[10] Sun et al 
A fuzzy neighborhood entropy based MRMR 
algorithm was proposed for feature selection. 
There is no involvement in 
the field of text 
classification and a lack of 
further research. 
[11] Jia et al 
Using SVM algorithm based on improved 
rattan pot mating optimizer for 
high-dimensional data testing. 
Lack of consideration for 
data redundancy issues. 
[12] Yin et al 
Perform motion pattern recognition using 
SVM algorithm based on simulated annealing 
algorithm. 
Using a single kernel 
function may not match the 
data distribution. 
Feature Extraction and Classification of Text Data by Combining …                   Informatica 48 (2024) 137 –150 139 
[13] 
Bansal et 
al 
Sign language feature selection is performed 
using a hybrid MRMR feature selection 
technique, and classification is performed 
using multi class SVM. 
Using a single kernel 
function may not match the 
data distribution. 
[14] Zhou et al 
Using feature selection method based on MI 
and correlation coefficient for feature 
selection. 
There is still room for 
optimization in handling 
redundant feature 
problems. 
 
In Table 1, recent research findings and 
shortcomings are presented. In summary, although many 
scholars have conducted research on SVM and feature 
selection in machine learning and applied them to many 
fields, the existing methods still face the problems of high 
redundancy, data sparsity and insufficient classification 
accuracy in processing large-scale text data. To solve the 
redundancy problem in feature selection, a TFS using the 
fusion of IG and improved maximum correlation and 
minimum redundancy is proposed. To further improve 
the text classification, an improved SVM algorithm based 
on Fourier hybrid kernel function is proposed. This study 
has a significant positive effect on improving the 
accuracy and processing efficiency of text classification 
[15-19]. 
Previous studies have addressed the issue of feature 
redundancy, but there is still room for optimization and 
improvement. Some studies have focused on feature 
redundancy but neglected the optimization of 
classification algorithms. Others have used a single 
kernel function in classification algorithms, which may 
result in a mismatch of data distribution. It is important to 
consider both feature redundancy and algorithm 
optimization to achieve accurate classification results. 
Compared to previous studies, this research considers not 
only the issue of high data redundancy but also the 
correlation between features and the semantic 
relationship of the context. This approach is beneficial for 
improving the accuracy of text feature selection through 
the TFS algorithm. The classification algorithm employs  
 
 
 
a hybrid kernel function based on the Fourier kernel 
function, which overcomes the limitations of a single 
kernel function. This study is better adapted than previous  
studies to facilitate classification. 
2 Text data feature extraction and 
classification by integrating 
two-stage feature selection and 
machine learning algorithms 
In order to improve the text classification and redundancy, 
a fusion TFS and an improved machine learning 
algorithm are proposed. Firstly, a TFS based on IG and 
MRMR algorithms is proposed. On this basis, an 
improved SVM algorithm is further proposed. 
2.1 Text data feature extraction and 
classification based on two-stage feature 
selection algorithm 
In text classification tasks, it is crucial to select the right 
features. This process mainly involves removing 
secondary words and retaining keywords with strong 
expressiveness to reduce the feature space complexity of 
text data and avoid the high complexity of dimensions 
affecting classification performance. In this study, a TFS 
for IG-MRMR is used to fuse IG and MRMR. Through 
the IG-MRMR algorithm, the selected feature words are 
vectorized by text and used by SVM for text 
classification processing, as shown in Figure 1. 
 
 
Data set
Preconditioning
Initial selection of 
information gain
MRMR secondary 
selection
Feature selection
Feature 
weighting
Training set
Test set
Support 
vector 
machine
Category 1
Category 2
Category N
Cross 
verification
 
Figure 1: Text classification process based on two-stage feature selection 
 
Figure 1 shows the steps involved in data 
preprocessing, feature selection, feature weighting, and 
feature classification. The IG algorithm relies on 
comparing the difference between the initial entropy of 
the whole dataset and the conditional entropy under the 
influence of specific features, so as to determine the 
effectiveness of the feature in classification, and select 
the main feature set suitable for text classification. When 
dealing with text classification, the algorithm involves 
evaluating occurrence frequency of a feature word 
j
t
 in 
140   Informatica 48 (2024) 137 –150                                                        H. Huang 
a specific classification C , so as to estimate the IG rate 
of a feature word 
j
t
, as shown in equation (1). 
( | )
( ) ( ) ( | )log
()
( | )
( ) ( | )log
()
m
ij
i j i j
i
i
m
ij
j i j
i i
P C t
IG t P t P C t
PC
P C t
P t P C t
PC
=
+


 (1) 
In equation (1), 
m
 is the number of different 
categories in text data, 
i
C is the example of the i 
category in text data, ()
i
Pt and ()
i
PC are the 
frequency of feature words in the sample text and total 
text data, and 
( | )
ij
P C t
 is the probability that the text 
belongs to 
i
C under the condition that the feature words. 
()
j
Pt
 refers to the probability that the text does not 
contain feature words, 
( | )
ij
P C t
 is the probability that 
the text belongs to 
i
C under the condition that there are 
no feature words. In the process of feature screening, the 
IG algorithm focuses too much on the number of 
documents and ignores the importance of word frequency, 
which leads to the decline of the ability of selected 
features in prediction and representation. In addition, IG 
not only considers the existence of feature words, but also 
pays attention to their absence, mainly focusing on the 
role of features in classification, ignoring the distribution 
of features between and within categories. Therefore, the 
feature set selected by IG needs to be further optimized. 
The MRMR algorithm is a filtering method using spatial 
search, which calculates the relevance and redundancy of 
features through MI. Figure 2 illustrates this feature 
selection process. 
 
Primitive feature set
Feature and category 
correlation analysis
MI sort
Classifier metric ？
Alternative feature subset
sort
Comparison of classification results under different 
weights
Optimal feature subse
Weighted MID/MIQ MID /MIQ
Sort
Classifier metric?
Sort
Classifier metric?
Yes
No
Yes Yes
No No
 
Figure 2: Block diagram of feature selection algorithm 
 
In TFS, the research is based on the preliminary 
feature word set 
1
T screened by the IG algorithm, which 
contains 
n
 features. After performing IG filtering, there 
is still redundancy among the feature words in the subset. 
Therefore, it is necessary to perform secondary feature 
extraction on the selected subset. The task at this stage is 
to apply the MRMR criterion to 
n
 feature words and 
select a more optimized feature subset S from 
1
T . This 
process is based on maximum correlation D and 
minimum redundancy R , as calculated in equation (2). 
 
2
,
1
max ( , ), ( ; )
||
1
min ( ), ( ; )
||
i
ij
i
tS
ij
t t S
D S C D I t C
S
R S R I t t
S



=




=




 (2) 
In equation (2), max D and max R represent the 
maximum relevance and minimum redundancy, || S 
represents the amount of selected feature words, ( ; )
i
I t C 
represents the amount of MI between feature words t 
and text classification C , and 
( ; )
ij
I t t
 represents the 
MI between feature words 
i
t and 
j
t
. These two criteria 
are combined to calculate MRMR value, as shown in 
equation (3). 
 max ( , ), D R D R  =− (3) 
In equation (3), D is correlation and R is 
redundancy. When processing text data, due to the large 
number of feature words, it is often time-consuming to 
calculate the MI between them. The MRMR strategy 
takes a step-by-step iterative approach to identify the 
ideal combination of features S . If it has already 
selected 1 k − features to form a subset 
1 k
S
−
, the next 
task is to extract the next feature from the pool of features 
11
{}
k
TS
−
− that have not yet been selected. The rules 
followed in the selection process are described in 
equation (4). 
 
1
1
1
max[ ( ; ) ( ; )]
1 ik
jk
i i j
tS
tS
I t C I t t
k −
−


−
−
 (4) 
Feature Extraction and Classification of Text Data by Combining …                   Informatica 48 (2024) 137 –150 141 
To optimize the selection of feature subsets, an 
improved MRMR TFS is further proposed, which mainly 
increases the weight of the relationship between features 
and categories. By introducing the class difference degree 
a
, the improved algorithm can more accurately evaluate 
the distribution and influence of features in different 
categories. It combines inter-class dispersion AC and 
coupling degree DC to measure the distribution of 
feature words in different categories of documents and 
the uniformity within the same category of documents, 
respectively. The representation of features can be 
enhanced to increase their prominence in a particular 
category and ensure even distribution across documents 
within a class. equation (5) shows the relevant 
calculations. 
2
1
2
1
( ( ( ) ( )) )
1
1
( ( ) ( )) )
m
k i i
k
n
k p i i
p
AC f t f t
m
DC g t g t
n
−

=−

−



=−




 (5) 
In equation (5), 
n
 and 
m
 denote the total number, 
()
ki
ft and ()
i
ft are the number of documents and the 
average number of documents for the feature words. If 
the dispersion AC value is higher, the feature words 
i
t 
are more effective in distinguishing categories. 
()
pi
gt
 is 
the word frequency of the 
p
 document, ()
i
gt is the 
average word frequency of the feature word across all 
documents in the class 
k
C . A lower value for intra-class 
coupling DC indicates that it is more efficient on 
behalf of the class C . Next, the MRMR algorithm 
considers the MI of feature words in all categories, 
fine-tunes the weight of the MI by introducing the class 
difference degree 

, and selects the two largest class 
difference degree values for processing, as detailed in 
equation (6). 
2 max1 max 2 2
min1 min 2
11
log ( ) log ( )
AC AC
a
DC DC


= − = −
 (6) 
In equation (6),  is a constant, 
a
 represents the 
difference in the degree of difference of the class, and this 
difference is logarithmic. This calculation method is 
applied to the MRMR algorithm, as shown in equation 
(7). 
11
1
1
max [ ( ; ) ( ; )]
1 ik
jk
i i j
t T S
tS
aI t C I t t
k −
−
−

−
−
 (7) 
A significant difference indicates that the feature 
words are primarily present in one category, making them 
highly identifiable to that category. Conversely, a small 
difference suggests that the feature words are common 
across multiple categories and are not enough to 
distinguish between categories with certainty. 
Logarithmic processing helps maintain data 
characteristics and the relationship between features and 
categories, while reducing data size and ensuring stability. 
In summary, the MRMR algorithm steps are shown in 
Figure 3. 
 
Start
Input : Dataset
Define functions and initialize variable
Compute correlation matrix from the 
dataset
Set target number of features to select
For each feature , compute its association score with others
Identify and select the feature with the 
highest score
Exclude the selected features from 
further computations
Check if the target number of features 
is reached
Output : Final list of selected features
End
Initialization
Calculation
Setup
Computation
Selection
Update List
Evaluation
Loop Back If Not Done
Completion
End Process
 
Figure 3: MRMR algorithm steps 
 
2.2 Application of fourier mixed kernel 
function in SVM text classification 
algorithm 
To enhance SVM's performance in text classification, 
SVM text data classification algorithm with Fourier 
hybrid kernel function is further introduced. In the text 
classification task, features are usually feature words 
or n-grams, forming a large number of text vectors. 
The SVM algorithm maps the input vectors to a 
higher-dimensional space, identifies a hyperplane that 
separates the data, and maximizes the margin between 
the hyperplane and the data points to enhance the 
classification accuracy. Linear SVMs includes linearly 
separable and indivisible, linear separable means that 
the data can be directly sliced by the hyperplane. 
Binary classification data on a 2D plane, if a line can 
divide the two classes, the line is a hyperplane. To 
142   Informatica 48 (2024) 137 –150                                                        H. Huang 
simplify the calculation, the data labels on both sides 
are set to 
1 y =+
 and 
1 y =−
 as shown in Figure 4. 
 
0
2
4
6
8
10
0 2 4 6 8 10
 
Figure 4: Schematic diagram of linear classification structure 
 
In solving a linear separability problem, the SVM 
defines a hyperplane by determining a function 
( ) 0
T
f x w x b = + = . The two sides of this hyperplane 
1 y =−
 and 
1 y =+
 can be represented as 
1
T
w x b + = − and 1
T
w x b + = + , respectively, The SVM 
aims to find an optimal segmentation surface that 
maximizes the classification interval, i.e., the distance 
between the two sides 
1
L and 
2
L of the hyperplane, as 
shown in Fig. 5(a). While there are multiple possible 
segments, only one can segment the data perfectly. The 
optimal segmentation surface is represented as a 
hyperplane 
1
L , and the points on both sides of 
1
L and 
2
L are called sample points, which are also the key to 
SVM calculations, i.e., support vectors. The distance of 
these support vectors to the hyperplane 
1
L determines 
the interval of the classification, as shown in Figure 5. 
 
0
2
4
6
8
10
0 2 4 6 8 10
0
2
4
6
8
10
0 2 4 6 8 10
1
:0
T
L w x b +=
2
:1
T
L w x b + = −
2
:1
T
L w x b + = +
1
()
|| ||
fx
w
=
(a) Linear classification 
representation in SVM
(b) Hyperplane spacing in SVM
 
Figure 5: Linear classification and hyperplane interval description of SVM 
 
In Figure. 5, the distances between the support 
vectors are equal to 
2
|| || w
. The goal of setting the 
training sample set D is to find a partition hyperplane 
with a maximum interval, which requires determining the 
parameters 
w
 and b that satisfy a particular constraint 
to maximize 
2
|| || w
. In fact, maximization 
2
|| || w
 is 
equivalent to minimization 
2
|| || w , so the original 
problem becomes a minimized 
2
|| || w problem, as 
detailed in equation (8). 
,
2
,
2
max . . ( ) 1, 1,2,...,
|| ||
1
max || || . . ( ) 1, 1,2,...,
2
T
ii
wb
T
ii
wb
s t y w x b i n
w
w s t y w x b i n

+  =




+  =


(8) 
Equation (8) is a convex quadratic programming 
problem with constraints. Considering their 
characteristics, in order to simplify the calculation, 
Lagrangian multiplier is applied to transform it into a 
dual problem. By setting the L partial derivative 
relative to 
w
 and b to zero, the calculation process 
can be transformed to obtain the expression of 
w
 and 
b . Substituting these into ( , , ) L w b a , equation (9) can be 
obtained to construct a classification model. 
2
1
1 1 1
1
1
( , , ) || || (1 ( ))
2
1
max
2
( ) ( ) ( )
n
T
i i i
i
n n n
T
i i j i j i j
a
i i i
n
TT
i i i
i
L w b a w a y w x b
a a a y y x x
f x sign w x b sign a y x b
=
= = =
=

= − − +



−



= + = +



  

 (9) 
In reality, most data is non-linear and cannot be 
directly classified by linear methods. SVM solves it by 
mapping data to a high-dimensional space. The kernel 
function is used for inner product operations, which 
avoids complication and dimensional disaster. The kernel 
Feature Extraction and Classification of Text Data by Combining …                   Informatica 48 (2024) 137 –150 143 
function must meet the Mercer condition. SVMs with 
kernel functions can also be solved using the Lagrangian 
multiplier method, as shown in equation (10). 
1 1 1
1
1
max ( , )
2
. . 0,0 , 1,2,...,
n n n
i i j i j i j
a
i i i
n
i i i
i
a a a y y K x x
s t a y a C i n
= = =
=

−




=   =


  

 (10) 
In equation (10), 
( , )
ij
K x x
 is the kernel function, 
and the final classification model is shown in equation 
(11). 
1
( ) ( ( , ) )
n
i i i
i
f x sign a y K x x b
=
=+

 (11) 
Next, the Fourier kernel function is proposed. In 
practical use, in addition to the universal Gaussian kernel 
and polynomial kernel, this function performs well in 
specific fields and has a high learning effect. There are 
two main forms of manifestation, and the 
one-dimensional Fourier kernel function corresponding to 
the two types is detailed in equation (12). 
2
2
1
( , )
2(1 2 cos( , ) )
||
()
( , )
2
()
ij
ij
ij
ij
q
K x x
q x x q
xx
ch
K x x
sh

 



 =
=

−+


−−


=



 (12) 
In equation (12), 
q
 is (0,1)  . The 
n
 
dimensional expressions for the two kernel functions are 
defined in Eq. (13). 
 
1
( , ) ([ ] ,[ ] )
m
i j i t j t
t
K x x K x x
=
=

 (13) 
As above, the corresponding one-dimensional and 
n
 dimensional Fourier kernel functions are shown in 
Figure 6. 
 
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
X
0
1
2
3
4
5
Y
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
X
0
0.2
Y
q=0.4
q=0.5
q=0.6
q=0.7
q=0.4
q=0.5
q=0.6
q=0.7
0.4
0.6
0.8
1.0
1.2
1.4
1.6
(a) One -dimensional Fourier kernel 
function
(b) N -dimensional Fourier kernel 
function
 
Figure 6: Fourier kernel function graph 
 
As a local kernel, the Fourier kernel function is 
characterized by adjusting its amplitude only by 
parameter 
q
, which provides an effective learning 
mechanism for text classification. The Fourier nucleus 
provides buffer attenuation near the test point, which 
improves the sparse distribution in high-dimensional 
spaces. However, the right 
q
 value selection is critical, 
as inappropriate 
q
 value can lead to too rapid 
attenuation near the test point. In order to optimize the 
performance, the principle of linear weighted 
combination of kernel functions is adopted. This method 
combines the different kernel functions and aims to 
improve the accuracy and efficiency of text classification. 
The specific combination and parameter adjustment are 
shown in equation (14). 
 
12
(1 ) ,0 1
mix
K aK a K a = + −   (14) 
In equation (14), 
mix
K represents the hybrid kernel 
function, which combines the respective characteristics of 
the two single-kernels 
1
K and 
2
K that satisfy the 
Mercer condition, and 
a
 denotes the influence of these 
two single-kernels. In order to construct a hybrid kernel 
with better performance, it is proposed to combine the 
polynomial kernel (as the global kernel) and the Fourier 
kernel (as the local kernel) to integrate the advantages of 
the two. At the same time, the combination of polynomial 
kernels and widely used Gaussian kernels is also 
considered to compare the classification effects of the 
two hybrid kernels, as shown in equation (15). 
2
2
2
(1 )
( , ) ( ) (1 )
2(1 2 cos( , ) )
( , ) ( ) (1 ) exp( || || )
d
i j i j
ij
d
i j i j i j
q
K x x a x x c a
q x x q
K x x a x x c a x x 
 −
=   + + − 

−+


=   + + −  − −

 (15) 
In equation (15), (0 1) aa  is the weight 
coefficient, which balances the combined effect of the 
two kernel functions. Fourier nuclei are prioritized for 
their easy parameter adjustment 
q
and buffer attenuation 
away from the test point. Based on the principle of 
combinatorial kernels, the proposed Fourier hybrid kernel 
144   Informatica 48 (2024) 137 –150                                                        H. Huang 
function combines the linear weighting of the Fourier 
kernel and the polynomial kernel, which conforms to 
Mercer's theorem and is suitable for the kernel function 
of SVMs. Overall, the process of improving the SVM 
algorithm is shown in Figure 7. 
 
Start
Data 
preprocessing
Select kernel 
function
Linear weighting of 
Fourier kernel 
function and 
polynomial kernel 
function
Building a Text 
Classifier Based 
on SVM
Model training Evaluation Using grid search 
method to 
optimize some 
parameters
End
 
Figure 7: Improve the process of SVM algorithm 
 
Figure 7 shows the preprocessed data being input 
into the SVM algorithm, followed by the selection of the 
kernel function. The selected Fourier and polynomial 
kernel functions are linearly weighted to construct a text 
classification model. The model is trained using a 
partitioned training dataset and parameter selection is 
done using the grid search method. Finally, the model is 
evaluated using the test set. 
3 Text classification results analysis 
based on two-stage feature 
selection and improved machine 
learning 
In this study, three datasets and their parameter 
configurations are first identified. Subsequently, feature 
selection and classification results are analyzed for these 
different datasets. Finally, a variety of kernel functions 
are analyzed in depth, and the proposed algorithm 
evaluates SVM classification performance of these kernel 
functions in detail. 
3.1 Results analysis of IG-MRMR two-stage 
feature selection algorithm under different 
datasets 
Experiments are conducted using the LING-SPAM, 
IMDB, and Cornell datasets. The text data is 
pre-processed by filtering out noisy feature items, 
reducing feature dimensions, alleviating classifier burden, 
and improving text classification accuracy through the 
removal of stop words, punctuation, and special 
characters. 70% of data are the training set and 30% the 
test set, the classifier is an SVM model using Gaussian 
kernels, and the experimental environment is Python. To 
evaluate the effect of IG-MRMR algorithm in extracting 
feature subsets, the accuracy and F1 value are used as 
evaluation indexes. The algorithm's performance 
improves as the accuracy of its feature selection increases. 
A higher F1 value indicates better accuracy and recall, 
resulting in a more effective feature selection. The IMDB 
dataset is applied to the Chi-Square (CHI), MI and TFS 
of IG, IG-MRMR and IG-MRMR, and the feature subsets 
from 10 to 100 dimensions are selected, respectively. The 
dimension interval for each feature subset is 10. After 
selecting the first 20-dimensional feature subset of each 
method, the number of extracted words ranges from 15 to 
14, 16, 16, and 18, and the priority order of each feature 
subset is also not the same. The accuracy results of the 
algorithm are presented in Fig. 8. The number of feature 
subsets required to achieve an accuracy of 0.82 for each 
algorithm is 60, 63, 59, 46, and 40, respectively. This 
shows that the IG-MRMR two-stage feature algorithm 
has the best prediction effect while using fewer feature 
words, and has the highest classification accuracy with 
the same feature subsets. 
 
Feature Extraction and Classification of Text Data by Combining …                   Informatica 48 (2024) 137 –150 145 
0.65
0.70
0.75
0.80
0.85
0.90
Accuracy rate/% 
10 20 30 40 50 60 70 80 90 100
Characteristic dimension 
IG
MI
CHI
IG-MRMR
IG-MRMR two- stage 
characteristics
 
Figure 8: 10 to 100 dimensional results for different algorithms on IMDB datasets 
 
To evaluate the influence of feature dimension 
improvement on different feature selection algorithms, 
the experimental set of feature subset dimension range is 
increased from 100 to 1000, with each 100 as an interval. 
A comparison of the five methods is shown in Figure 9. 
The F1 values of all algorithms begin to decrease when 
the number of features exceeds 390, indicating that the 
key features have been extracted and the additional 
features have reduced the classification effect. In Figure 
9(b), the IG-MRMR TFS algorithm shows an advantage, 
with an average F1 value of about 1% to 2% higher than 
that of other algorithms, which means that more text can 
be correctly classified, about 18 more articles, showing 
its efficient and accurate feature selection ability. 
 
100 200 300 400 500 600 700 800 900 1000
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
F1 value
Characteristic dimension 
IG
MI
CHI
100 200 300 400 500 600 700 800 900 1000
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
F1 value
Characteristic dimension 
IG-MRMR
IG-MRMR two-stage characteristics
(a) Feature selection algorithm (b) Improved algorithm
 
Figure 9: Comparison of F1 values of different algorithms on IMDB datasets 
 
Five different algorithms are applied to the Cornell 
dataset for experiments, the same as the IMDB dataset, 
with feature dimensions set between 10 and 100. The 
analysis focuses on the first 20-dimensional feature 
subsets extracted by each algorithm. It is found that the 
number of extracted evaluation words ranges from 15 to 
17, as shown in Figure 10(a). To further explore the 
effect of feature dimension increase on the classification 
effect, the experimental range is extended to 100 to 1000 
dimensions, with 100 intervals, as shown in Figure 10(b). 
 
10 20 30 40 50 60 70 80 90 100
IG
MI
CHI
IG-MRMR
IG-MRMR two -stage characteristics
Accuracy rate/%
 
Characteristic dimension 
(a) Accuracy changes in the Cornell data set
100
IG
MI
CHI
IG-MRMR
IG-MRMR two- stage characteristics
F1 value /%
Characteristic dimension 
(b) Changes in F1 values in the Cornell dataset
200 300 400 500 600 700 800 900 1000
0.76
0.78
0.80
0.82
0.84
0.86
0.60
0.65
0.70
0.75
0.80
0.85
 
Figure 10: Different algorithms in Cornell data set 
 
 
146   Informatica 48 (2024) 137 –150                                                        H. Huang 
 
Figure 10(a) shows that at an accuracy of 0.76, the 
number of feature subsets required for the five algorithms 
is about 57, 60, 59, 55, and 40, respectively. The 
IG-MRMR TFS requires the least number of feature 
subsets, and its accuracy is higher than that of the same 
number of feature subsets. Figure 10(b) shows that the 
classification effect is best when the number of features is 
close to 285. As the number of features increased, the 
classification effectiveness of all algorithms gradually 
decreased. This suggests that the additional features 
contain more words with weak representation abilities. 
IG-MRMR TFS algorithm only shows a significant 
decrease after the feature exceeded 700, and its F1 value 
is 2% higher than that of other methods on average, and 
the number of correctly classified texts are increased by 
about 18. Next, experiments of five algorithms are carried 
out on the LING-SPAM dataset, and the feature words of 
this dataset mainly focuses on advertising-related words. 
In this study, 10-dimensional to 100-dimensional feature 
words are selected for comparison of classification effects, 
and the detailed results are shown in Figure 11(a). In 
order to have a more comprehensive understanding of the 
classification performance of feature subsets, the feature 
dimension is further extended to 100 to 1000, and the 
classification results of five feature selection algorithms 
are compared in Figure 11(b). 
 
10 20 30 40 50 60 70 80 90 100
Accuracy rate/%
 
Characteristic dimension 
(a) Accuracy changes in the LING -SPAM data set
IG
MI
CHI
IG-MRMR
IG-MRMR two - stage characteristics
100
F1 value/% 
Characteristic dimension 
(b) Changes in F1 values in the LING -SPAM dataset
200 300 400 500 600 700 800 900 1000
IG
MI
CHI
IG-MRMR
IG-MRMR two - stage characteristics
1.000
0.995
0.990
0.985
0.980
0.975
0.970
0.965
0.960 0.90
0.92
0.94
0.96
0.98
1.00
 
Figure 11: Different algorithms in LING-SPAM data set 
 
Figure 11(a) shows the number of feature subsets 
required to achieve an accuracy of 0.95 for the five 
algorithms, which are 39, 40, 29, 35, and 22. The 
IG-MRMR TFS algorithm requires significantly less 
feature subsets than other methods while maintaining 
accuracy. At the same time, in the same number of 
feature subsets, the accuracy of IG-MRMR TFS is 
generally higher than that of other feature selection 
algorithms. Figure 11(b) shows that most of the 
algorithms have reached 0.96 for 100-dimensional 
features, which means that the words with strong 
representational ability in the dataset are mainly 
concentrated in the first 100 dimensions. The F1 value of 
IG-MRMR TFS peaks when the feature dimension is 
about 680, and its average accuracy is 1% higher than 
that of IG-MRMR, and the classification of about 6 
articles is correctly increased, which is 2% higher than 
that of the IG and CHI single-stage algorithms, and about 
14 articles are correctly added, showing its accurate 
feature selection advantage. 
3.2 Text classification results analysis based 
on two-stage feature selection algorithm 
To ensure data standardization, a preprocessing is 
performed to remove stop words, punctuation, and special 
characters, and the processed corpus words are vectorized 
using the term frequency-inverse document frequency 
method. And the weight of the words in the text is 
calculated and normalized. 60% of dataset is training set 
and 40% is test set. Parameter selection includes the use 
of a grid search method to determine the penalty 
parameters C in the SVM (ranging from 1 to 100, 
adjusted every 10) and the exponent d of the 
polynomial kernel, set to 3. The kernel weight range 
a
 
of the hybrid kernel function is set to 0.1 and the step size 
is 0.1. The experimental platform uses Python 3.6. A 
5-fold cross-validation is adopted, and F1 is the 
evaluation index. IMDB dataset is selected to compare 
the performance of the proposed algorithm with other 
kernel functions. The dataset comprises 2000 reviews of 
films and television programs, with an equal number of 
positive and negative reviews. The document frequency 
algorithm is used as the feature selection algorithm to 
process the dataset. Considering the excellent 
performance of the Fourier 
a
 kernel function, the 
weight coefficient in the hybrid kernel function is set to 
0.25, and the results are shown in Figure 12. 
 
Feature Extraction and Classification of Text Data by Combining …                   Informatica 48 (2024) 137 –150 147 
500 1000 1500 2000 2500 3000 5000 7000 9000
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
F1 value/% 
Characteristic dimension 
Poly
Rbf
Fourier
RbfPoly
FourierPoly
 
Figure 12: Comparison of multiple kernel functions 
 
In Figure 12, the classification effect is improved 
with the expansion of the feature dimension. The 
proposed Fourier hybrid kernel function surpasses the 
combination of single kernel and Gaussian kernel and 
polynomial kernel in terms of performance, which 
verifies the effectiveness of the concept of combinatorial 
kernel function, highlights the advantages of Fourier 
kernel function, and provides important value for 
improving the effect of text classification. In this study, 
the IG-MRMR TFS algorithm will be used to analyze the 
SVM classification performance on the Cornell dataset, 
as shown in Figure 13. IG and IG-MRMR TFS 
algorithms are used to select features, and the SVMs of 
Gaussian kernel, Fourier kernel and Fourier hybrid kernel 
functions are compared with these two feature selection 
methods. 
 
100 200 300 400 500 600 700 800 900 1000
Characteristic dimension 
0.76
0.78
0.80
0.82
0.84
0.86
0.88
0.90
F1 value / %
IG+Poly
IG+Fourier
IG+FourierPoly
100 200 300 400 500 600 700 800 900 1000
Characteristic dimension 
0.76
0.78
0.80
0.82
0.84
0.86
0.88
0.90
F1 value / %
Improved algorithm +RBF
Improved algorithm +Fourier
Improved algorithm ++FourierPoly
(a) Classification function (b) Improved algorithm combined with classification function
 
Figure 13: Multiple kernel function analysis of IG-MRMR two-stage feature selection algorithm 
 
As can be observed in Figure 13, the classification 
effect first increases and then decreases. When the feature 
dimension is about 400, the IG-MRMR two-stage 
algorithm shows excellent classification performance. As 
features increase, the effect of TFS decreases 
significantly, which shows that the increase of weaker 
feature words in the selected feature subset interferes 
with the classification effect. Figure 13(a) and 13(b) show 
that SVM using IG-MRMR TFS using any kernel 
function is generally better than IG method in terms of F1 
value compared to the IG method, confirming the 
effectiveness of IG-MRMR. The combination of Fourier 
hybrid kernel function and IG-MRMR two-stage 
algorithm is 1~3% higher than other combinations on 
average in F1 value, and the number of correctly 
classified texts increases by 20 to 45. The experimental 
corpus selected for analysis is the Cornell Film and 
Television Review. The comparative method chosen to 
analyze its classification effect is the SVM algorithm, as 
shown in Table 2. 
 
Table 2: Comparison of classification effects 
Method Accuracy/% F1 value 
SVM 73.46 0.617 
Research 
method 
96.57 0.813 
 
Table 2 shows that the research method has higher 
accuracy and larger F1 values (P<0.05) compared to the 
benchmark method. Specifically, the accuracy of the 
research method is 96.57%, which is 23.11% higher than 
the SVM algorithm. These results demonstrate the high 
performance of the research method, which is further 
improved through optimization. 
 
 
148   Informatica 48 (2024) 137 –150                                                        H. Huang 
3.3 Discussion 
In text data feature classification, achieving higher 
accuracy in text feature selection involves considering 
data redundancy, correlation between features, and 
semantic relationships in context. Fan Y et al. conducted 
research on relevant selection algorithms based on label 
correlation and feature redundancy to improve the 
effectiveness of text feature selection. The results 
indicated that the proposed method has a relatively high 
selection accuracy [20]. The literature acknowledges the 
issue of data redundancy and correlation, but there are 
still shortcomings, such as a lack of research on 
contextual semantic relationships. This necessitates 
further optimization of feature selection. However, this 
study can address these gaps. During the feature selection 
process, Zhou H et al. analyzed the weight of MI 
redundancy terms through correlation coefficients and 
selected the principle of minimization. The proposed 
method was found to have good feature classification 
performance in experiments [21]. This reference is 
comparable with the proposed method. However, there 
has been no research conducted on the contextual 
semantic relationships involved in the feature selection 
process. This study explores this aspect, resulting in a 
more effective feature selection process. The accuracy 
and F1 value of feature selection are both high. 
4 Conclusion 
To enhance text classification redundancy and SVM 
performance, a TFS algorithm based on IG and improved 
MRMR is proposed. Additionally, to further improve the 
effect of SVM in text classification, an SVM text 
classification algorithm based on Fourier mixed kernel 
function is introduced. The study found that the 
IG-MRMR TFS algorithm had the best prediction 
accuracy with fewer feature words used on the 
LING-SPAT, IMDB, and Cornell datasets. The algorithm 
achieved the highest classification accuracy with the 
same feature subsets. On the IMDB dataset, the algorithm 
required only 40 feature subsets to achieve an accuracy of 
0.82, which was fewer than other algorithms. On the 
LING-SPAM dataset, the single-stage algorithms IG and 
CHI were outperformed by 2%. The addition of about 14 
articles was correctly classified. Furthermore, when the 
number of features exceeded 390, the F1 value of all 
algorithms began to decrease, indicating that the key 
features had been extracted and additional features were 
reducing the classification effect. In this case, the 
IG-MRMR algorithm maintained its advantage, with an 
average F1 value 1% to 2% higher than other algorithms, 
and correctly classified 18 more texts. In comparison to 
benchmark methods, research methods exhibit higher 
accuracy rates. Specifically, the research method boasts 
an accuracy rate of 96.57%, which is 23.11% higher than 
that of the SVM algorithm. However, the study has some 
shortcomings. The second-stage feature selection of the 
current IG algorithm may need improvement, and the IG 
algorithm can be further optimized in the future. 
Additionally, while the Fourier kernel function shows 
superiority, future studies can consider more efficient 
local kernel functions to enhance classification 
performance. In addition, when dealing with complex 
real-world problems, such as uneven data distribution, 
research methods may have limited generalization ability 
and certain shortcomings. Future work can focus on 
optimizing the algorithm through feature learning and 
multi-level feature learning to improve its performance. 
 
Fundings 
The research is supported by: Graduate Education 
Reform Project of Henan Province, Achievements of the 
Henan Province Higher Education Teaching Reform 
Research and Practice Project (Graduate Education), (No. 
2023SJGLX365Y).
 
Reference 
[1] T. Tuncer, S. Dogan, M. Baygin, and U. R. Acharya, 
“Tetromino pattern based accurate EEG emotion 
classification model, ” Artificial Intelligence in 
Medicine, vol. 123, no. 4, pp. 102210-102211, 2022. 
https://doi.org/10.1016/j.artmed.2021.102210 
[2] E. H. Houssein, D. S. Abdelminaam, and H. N. 
Hassan, “A hybrid barnacles mating optimizer 
algorithm with support vector machines for gene 
selection of microarray cancer classification, ” IEEE 
Access, vol. 9, no. 1, pp. 64895-64905, 2021. 
https://doi.org/10.1109/ACCESS.2021.3075942 
[3] C. Dang, Y. Liu, and H. Yue, “Autumn crop yield 
prediction using data-driven approaches: -support 
vector machines, random forest, and deep neural 
network methods, ” Canadian Journal of Remote 
Sensing, vol. 47, no. 2, pp. 162-181, 2021. 
https://doi.org/10.1080/07038992.2020.1833186 
[4] W. Al-Salman, Y. Li, and P. Wen, “Detection of 
k-complexes in EEG signals using a multi-domain 
feature extraction coupled with a least square 
support vector machine classifier, ” Neuroscience 
Research, vol. 172, no. 2, pp. 26-40, 2021. 
https://doi.org/10.1016/j.neures.2021.03.012 
[5] A. Siddiqa, R. Islam, and M. I. Afjal, “Spectral 
segmentation-based dimension reduction for 
hyperspectral image classification, ” Journal of 
Spatial Science, vol. 68, no. 4, pp. 543-562, 2023. 
https://doi.org/10.1080/14498596.2022.2074902 
[6] C. Hebbi, and H. Mamatha, “Comprehensive dataset 
building and recognition of isolated handwritten 
kannada characters using machine learning models, ” 
Artificial Intelligence and Applications, vol. 1, no. 3, 
pp. 179-190, 2023. 
https://doi.org/10.47852/bonviewAIA3202624 
[7] Y. A. Ahmed, S. Huda, and B. A. S. Al-rimy, “A 
weighted minimum redundancy maximum relevance 
technique for ransomware early detection in 
industrial IoT, ” Sustainability, vol. 14, no. 3, pp. 
Feature Extraction and Classification of Text Data by Combining …                   Informatica 48 (2024) 137 –150 149 
1231-1235, 2022. 
https://doi.org/10.3390/su14031231 
[8] A. Jiménez-Cordero, J. M. Morales, and S. Pineda, “A 
novel embedded min-max approach for feature 
selection in nonlinear support vector machine 
classification, ” European Journal of Operational 
Research, vol. 293, no. 1, pp. 24-35, 2021. 
https://doi.org/10.1016/j.ejor.2020.12.009 
[9] B. Wang, X. Zhang, C. Sun, and X Chen, “Sparse 
representation theory for support vector machine 
kernel function selection and its application in 
high-speed bearing fault diagnosis, ” ISA 
transactions, vol. 118, no. 1, pp. 207-218, 2021. 
https://doi.org/10.1016/j.isatra.2021.01.060 
[10] L. Sun, T. Yin, W. Ding, Y. Qian, and J. Xu, 
“Feature selection with missing labels using 
multilabel fuzzy neighborhood rough sets and 
maximum relevance minimum redundancy, ” IEEE 
Transactions on Fuzzy Systems, vol. 30, no. 5, pp. 
1197-1211, 2021. 
https://doi.org/10.1109/TFUZZ.2021.3053844 
[11] H. Jia, and K. Sun, “Improved barnacles mating 
optimizer algorithm for feature selection and 
support vector machine optimization, ” Pattern 
Analysis and Applications, vol. 24, no. 3, pp. 
1249-1274, 2021. 
https://doi.org/10.1007/s10044-021-00985-x 
[12] Z. Yin, J. Zheng, L. Huang, Y. Gao, H. Peng, and L. 
Liu, “SA-SVM-based locomotion pattern 
recognition for exoskeleton robot, ” Applied 
Sciences, vol. 11, no. 12, pp. 5573-5575, 2021. 
https://doi.org/10.3390/app11125573 
[13] S. R. Bansal, S. Wadhawan, and R. Goel, “mrmr-pso: 
A hybrid feature selection technique with a 
multiobjective approach for sign language 
recognition, ” Arabian Journal for Science and 
Engineering, vol. 47, no. 8, pp. 10365-10380, 2022. 
https://doi.org/10.1007/s13369-021-06456-z 
[14] H. Zhou, X. Wang, and R. Zhu, “Feature selection 
based on mutual information with correlation 
coefficient, ” Applied Intelligence, vol. 1, no. 1, pp. 
1-18, 2022. 
https://doi.org/10.1007/s10489-021-02524-x 
[15] M. Yildirim, A. Çinar, and E. Cengil, “Classification 
of the weather images with the proposed hybrid 
model using deep learning, SVM classifier, and 
mRMR feature selection methods, ” Geocarto 
International, vol. 37, no. 9, pp. 2735-2745, 2022. 
https://doi.org/10.1080/10106049.2022.2034989 
[16] D. Wang, and G. Xu, “Research on the detection of 
network intrusion prevention with svm based 
optimization algorithm, ” Informatica, vol. 44, no. 2, 
pp. 269-273, 2020. 
https://doi.org/10.31449/inf.v44i2.3195 
[17] M. G. Lanjewar, J. S. Parab, and A. Y. Shaikh, 
“CNN with machine learning approaches using 
ExtraTreesClassifier and MRMR feature selection 
techniques to detect liver diseases on cloud, ” Cluster 
Computing, vol. 26, no. 6, pp. 3657-3672, 2023. 
https://doi.org/10.1007/s10586-022-03752-7 
[18] H. Azadi, M. R. Akbarzadeh-T, H. R. Kobravi, and 
A. Shoeibi, “Robust voice feature selection using 
interval type-2 fuzzy AHP for automated diagnosis 
of parkinson's disease, ” IEEE/ACM Transactions on 
Audio, Speech, and Language Processing, vol. 29, 
no. 3, pp. 2792-2802, 2021. 
https://doi.org/10.1109/TASLP.2021.3097215 
[19] M. Mahapatra, S. K. Majhi, and S. K. Dhal, 
“Mrmr-ssa: a hybrid approach for optimal feature 
selection, ” Evolutionary Intelligence, vol. 15, no. 3, 
pp. 2017-2036, 2022. 
https://doi.org/10.1007/s12065-021-00608-8 
[20] Y. Fan, B. Chen, W. Huang, J. Liu, W. Weng, and 
W. Lan, “Multi-label feature selection based on 
label correlations and feature redundancy, ” 
Knowledge-Based Systems, vol. 241, no. 6, pp. 1-15, 
2022. https://doi.org/10.1016/j.knosys.2022.108256 
[21] H. Zhou, X. Wang, and R. Zhu, “Feature selection 
based on mutual information with correlation 
coefficient, ” Applied Intelligence, vol. 52, no. 5, pp. 
5457-5474, 2022. 
https://doi.org/10.1007/s10489-021-02524-x 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150   Informatica 48 (2024) 137 –150                                                        H. Huang