https://doi.org/10.31449/inf.v43i4.3008 Informatica 43 (2019) 573–579 573 Decision Tree Algorithm Based University Graduate Employment Trend Prediction Fen Yang Henan University of Animal Husbandry and Economy, Zhengzhou, Henan 450044, China E-mail: yfh508@126.com Student paper Keywords: data mining, decision tree, employment prediction, C4.5 algorithm Received: November 21, 2018 The employment situation of college graduates is becoming more and more serious. It is of great significance to find effective methods to predict the employment trend of students. In this study, C4.5 algorithm was used to predict the employment trend of students. Taking the 2016 graduates of Henan University of Animal Husbandry and Economy as examples, four attributes affecting employment units were extracted, the information gain rate was calculated, the decision tree was constructed, and the classification rules were obtained. After data collection, conversion and cleaning, 420 employment records were obtained; 320 records were taken as the training samples. The classification rules were tested using 100 experimental samples, and the accuracy rate was 81%. Finally, the employment trend of the 2018 graduates was predicted by C4.5 algorithm, which provides a theoretical guidance for the arrangement of employment work in schools. Predicting the employment trend of students with decision tree algorithm is feasible and of great significance to the employment guidance of schools and the employment choice of students. Povzetek: V tem študentskem prispevku je bil uporabljen algoritem C4.5 za analizo zaposlitve študentov na Kitajskem. 1 Introduction The employment problem has gained more and more widespread social concern [1]. In recent years, the number of university graduates is increasing every year, and the employment situation is becoming more and more serious. It is very important for schools to analyze and study the information about students' employment, which can help them train students according to the market demand [2]. However, with the increase of the number of students, the data of employment information of graduates are accumulating continuously [3], which brings great difficulties to employment analysis. With the progress of science and technology, many new technologies have been applied in employment analysis. Wang [4] combined the residual modified GM (1,1) model with the improved neural network to predict the employment information index of graduate students, so as to predict the employment trend of graduate students. He found that the mean square error decreased from 10-1 to 10-5 with the progress of training and the performance of the algorithm was the best when the gradient value and learning rate were 7.5912×10-5 and 0.8421. Liu et al. [5] proposed a method of information gain with weight based decision tree. The weighed based information gain was obtained by genetic algorithm. The decision tree was constructed and tested on undergraduates. They found that the method had a favourable prediction accuracy. Kwak et al. [6] found that education and gender were the most important factors affecting the employment of young and middle-aged people, while gender, health status and education were the most important factors affecting the elderly. Tan et al. [7] made the short-term employment forecast of Shandong province through independent component analysis (ICA) and found that the quality of labor force, industrial structure and income were the most important factors affecting employment. Decision tree algorithm shows a good performance in data prediction. Daga et al. [8] predicted high-risk renal transplantation using decision tree and random forest and found that the accuracy rate reached 85%, which provides accurate decision support for doctors. Mohreji et al. [9] combined with the decision tree algorithm to predict the delay of air transportation. Through the study of three New York airports, it was found that the confidence level of the prediction results was very high in at least 70% of the time. At present, most of the researches on employment trend focus on the influencing factors of employment, while few researches focus on the accurate prediction of employment trend, and the traditional data processing methods are difficult to extract useful information from the historical employment data. Therefore, in this study, C4.5 algorithm from 574 Informatica 43 (2019) 573–579 F. Yang decision tree algorithm was used to extract four decision attributes from the employment-related information of the 2016 graduates of Henan University of Animal Husbandry and Economy, and classification rules were extracted for employment trend prediction, so as to study the reliability of this method in employment prediction. 2 Employment trend forecast and data mining Under the influence of population growth, popularization of higher education and expansion of enrollment, the number of college students has increased explosively, the number of graduates has risen sharply every year, and the employment situation has become more and more serious [10], which has aroused widespread concern of the society. The employment situation of college students is closely related to the future construction of schools, students' personal development and social stability [11]. The growth of graduates' employment rate can lead to economic growth [12]. In order to effectively manage the employment situation of students, colleges and universities have adopted information technology to collect and manage the employment information of students, in order to obtain valuable information, analyze the factors affecting employment, and help improve the employment situation. However, with the growth of the number of students, the data in the information management system is also growing rapidly. Traditional information analysis methods can not deal with such a large amount of information, nor can they fully play the potential value of these data. Although there are enough data, it is impossible to obtain the implicit association and rules between the data [13] and predict the future employment development based on these information. Therefore, an intelligent and reliable method is urgently needed to solve this problem. Data mining technology can process massive information quickly and efficiently and extract valuable information from it. It has a wide range of applications in fields such as business, industry and military. Mining and analyzing the employment information of graduates through data mining technology to obtain the factors affecting employment can help the school employment guidance center to guide the employment of students and promote employment. It can also predict the employment trend of graduates based on the information, so as to provide a decision-making basis for the adjustment of school teaching and employment work. 3 Decision tree algorithm Decision tree algorithm is a typical technology of data mining. It can obtain valuable information by concluding and classifying data based on the attributes of data [14]. Applying decision tree algorithm in the analysis of employment information and obtaining relevant information affecting employment through construction of decision tree and extraction of classification rules is effective in predicting the future employment trend [15]. C4.5 algorithm from decision tree was used to process and analyze employment information. C4.5 algorithm is an improvement of ID3 algorithm [16], which selects the node attribute of tree based on information gain ratio. Data set K was defined, including k data samples, and its class attribute was set as m values, corresponding to m categories ) , , 2 , 1 ( m i C i  = . If i k refers to the number of samples in category i C , the amount of information needed by classification of a given object was called the entropy before division of K , and its computational formula was:  = − = m i i i m p p k k k I 1 2 2 1 ) ( log ) , , , (  (1) where i p ( k k p i i = ) stands for the probability of a data object belonging to category i C . Entropy can reflect the average uncertainty and purity of data set. The larger the value of entropy, the higher the average uncertainty and the lower the purity. Suppose that there were w discrete attribute values in attribute A and set K was divided into w subsets,   w k k k , , , 2 1  , and the samples in j K had the same values in attribute A , ) , , 2 , 1 ( w j a j  = . When A was taken as the testing attributes, these subsets were corresponding to some branch which grew from the node which contained set K . Suppose ij k as the number of sample in subset j K belonging to category i C , then attribute A was divided into the entropy of subset:  = + + + − = w j mj j j mj j j s s s I s s s s A E 1 2 1 2 1 ) , , , ( ) (   (2) The information gain of attribute A was: ) ( ) , , , ( ) ( 2 1 A E s s s I A Gain m − =  (3) The larger the information gain in the set, the higher the purity of subset division. The information gain ratio of attribute A was: ) ( ) ( ) ( A SplitInfo A Gain A GainRatio = , (4) where split information k k k k k k k k A SplitInfo mj j j w j mj j j + + + + + + − =  =   2 1 1 2 2 1 log ) ( represents the span and uniformity of split data set K of attribute A . Decision Tree Algorithm Based University Graduate Employment Trend Informatica 43 (2019) 573–579 575 4 Decision tree algorithm based employment trend prediction 4.1 Data collection and preprocessing The 2016 graduates of Henan University of Animal Husbandry and Economy were taken as research subjects. The basic information, achievement information and employment information of the students were obtained from the student status management system, student learning management system and student employment management system, and 500 records were selected as samples. There were many duplicate data or blank parts in the obtained data set, and the form of data was also not unified; hence preprocessing was needed. (1) Data integration: The data exported from the three systems were integrated into a table of general information, and the attributes are shown in Table 1. Name Major Gender Academic performance Politics status English competence Student cadre Computer skills Participation in student society Employment unit Table 1: General information. (2) Data correlation analysis: There were many irrelevant information in the data derived from the three systems, such as name, gender, politics status, student cadres, and participation in student society, which needed to be eliminated. (3) Data conversion: Noise was eliminated from the data. In order to facilitate statistics and analysis, it was necessary to generalize the remaining five attributes, i.e., divide major into three categories, popular, general and unpopular, divide academic performance into excellent, general and poor, divide the English competence into CET4 and above and below CET4, divide computer skills into level 3 and above and below level 3, and divide employment unit into state- owned enterprise, private enterprise and others, represented by A, B and C. (4) Data cleaning: Duplicate data and blank data were deleted from the data, and finally 420 records were obtained, 320 of which were used as training samples and the remaining 100 was used for testing. 4.2 Establishing decision tree The training samples were analyzed by taking employment unit (A, B and C) as the labeling attribute and major, academic performance, English competence and computer skills as decision attributes. The number of students under different categories of different attributes is shown in Table 2. Decision-making attribute A B C Major Popular 43 21 17 General 12 38 66 Unpopular 3 47 73 Academic performance Excellent 21 17 50 General 28 34 67 Poor 18 26 59 English competence CET4 and above 46 52 61 Below CET4 36 46 79 Computer skills Level 3 and above 56 49 61 Below level 3 62 58 34 Table 2: Training data set. Suppose training sample as K and the corresponding subset and number of A, B and C as 107 , 158 , 55 3 2 1 = = = k k k respectively. The entropy of K was calculated using equation (1). 315878 . 2 ) 107 , 158 , 55 ( ) , , ( 3 2 1 = = I k k k I Then the information gains of different decision attributes were calculated using equation (2), (3) and (4). (1) Major Major was divided into popular, general and unpopular. When the major was popular, the entropy was: 2145 . 1 81 27 log 81 27 81 32 log 81 32 81 43 log 81 43 ) 27 , 32 , 43 ( 2 2 2 = − − − = I . When the major was general, then the entropy was: 4511 . 1 116 66 log 116 66 116 38 log 116 38 116 12 log 116 12 ) 66 , 38 , 12 ( 2 2 2 = − − − = I . When the major was unpopular, the entropy was: 9512 . 0 123 73 log 123 73 123 47 log 123 47 123 3 log 123 3 ) 73 , 47 , 3 ( 2 2 2 = − − − = I , then the entropy of attribute “major” was: 5421 . 1 ) ( = major E , the information gain was: 0215 . 0 ) ( = major Gain , information gain rate was: 0131 . 0 ) ( = major GainRatio . (2) Academic performance The academic performance was divided into excellent, general and poor. When the academic performance was excellent, the entropy was: 2356 . 1 88 50 log 88 50 88 17 log 88 17 88 21 log 88 21 ) 50 , 17 , 21 ( 2 2 2 = − − − = I . When the academic performance was general, the entropy was: 576 Informatica 43 (2019) 573–579 F. Yang 4201 . 1 129 67 log 129 67 129 34 log 129 34 129 28 log 129 28 ) 67 , 34 , 28 ( 2 2 2 = − − − = I . When the academic performance was poor, the entropy was: 2014 . 1 103 59 log 103 59 103 26 log 103 26 103 18 log 103 18 ) 59 , 26 , 18 ( 2 2 2 = − − − = I , then the entropy of attribute academic performance was: 4058 . 1 ) t achievemen specific - subject ( = E , the information gain was 0289 . 0 ) t achievemen specific - subject ( = Gain , the information gain ratio was: 0134 . 0 ) t achievemen specific - subject ( = GainRatio . (3) English competence English competence was divided into CET4 and above and below CET4. When English competence was above CET4, the entropy was: 8412 . 0 159 61 log 159 61 159 52 log 159 52 159 46 log 159 46 ) 61 , 52 , 46 ( 2 2 2 = − − − = I . When English competence was below CET4, the entropy was: 0258 . 1 161 79 log 161 79 161 46 log 161 46 161 36 log 161 36 ) 79 , 46 , 36 ( 2 2 2 = − − − = I , then the entropy of attribute English competence was: 3025 . 1 ) competence English ( = E , information gain was: 2157 . 0 ) competence English ( = Gain , information gain ratio was: 1656 . 0 ) competence English ( = GainRatio . (4) Computer skills Computer skills was divided into level 3 and above and below level 3. When computer skills was level 3 or above, the entropy was: 9785 . 0 166 61 log 166 61 166 49 log 166 49 166 56 log 166 56 ) 61 , 49 , 56 ( 2 2 2 = − − − = I , when computer skills was below level 3, the entropy was: 6124 . 1 154 34 log 154 34 154 58 log 154 58 154 62 log 154 62 ) 34 , 58 , 62 ( 2 2 2 = − − − = I , then the entropy of attribute competence skills was: 4275 . 1 ) skills computer ( = E , the information gain was: 2144 . 0 ) competence ( = computer Gain , the information gain rate was: 1502 . 0 ) skills computer ( = GainRatio . It was found from the above calculation results that the information gain rate of English competence was the largest. Therefore the attribute was regarded as the root node of decision tree. Then the information gain rate of every subtree was calculated according to the above procedures. Finally the decision tree in Figure 1 is obtained. 4.3 Generating classification rules According to the decision tree in Figure 1, the following classification rules were obtained. (1) If English competence = CET 4 and above AND Computer skills = level 3 and above AND academic performance = excellent AND major = general Then employment unit = state-owned enterprise (2) If English competence = CET 4 and above AND Computer skills = below level 3 AND academic performance = excellent AND major = general Then employment unit = state-owned enterprise (3) If English competence = CET 4 and above AND Computer skills = below level 3 AND academic performance = general AND major = general Then employment unit = private enterprise (4) If English competence = CET 4 and above AND Computer skills = below level 3 AND academic performance = poor AND major = unpopular Then employment unit = private enterprise (5) If English competence = below CET4 AND computer skills = level 3 and above AND academic performance = excellent AND major = popular Then employment unit = state-owned enterprise (6) If English competence = below CET4 AND computer skills = level 3 and above AND academic performance = general AND major = general Then employment unit = private enterprise (7) If English competence = below CET4 AND computer skills = level 3 and above AND academic performance = excellent AND major = general Then employment unit = private enterprise (8) If English competence = below CET4 AND computer skills = below level 3 AND academic performance = poor AND major = unpopular Then employment unit = others (9) If English competence = below CET4 AND computer skills = below level 3 AND academic performance = poor AND major = unpopular Then employment unit = others It was concluded from the above classification rules that English competence and computer skills had the Figure 1: The decision tree of employment unit. Figure 2: The prediction of employment trend of the 2018 graduates. Decision Tree Algorithm Based University Graduate Employment Trend Informatica 43 (2019) 573–579 577 greatest impact on the employment units of students. Students with good English competence and excellent computer skills generally worked in state-owned or private enterprises, while students with poor English competence and weak computer skills, except for some students who had good academic performance or were major in popular subjects, did not work in the state-owned enterprises or private enterprises, which showed that schools need to strengthen the training of English and computer skills and pay more attention to these two aspects in the arrangement of teaching work and students themselves should strive to improve their English and computer skills and strengthen their competitive advantage in employment. 4.4 Testing of classification performance The effectiveness of classification rules was tested through 100 experimental samples. Then the results were compared with the actual employment unit of students. The testing results are shown in Table 3. Sample Classification results Actual results 1 A A 2 B B 3 A A 4 C C 5 B C 6 B B 7 B B …… 99 B B 100 C C Table 3: The testing results of classification rules. The classification results of 81 samples were the same with the actual conditions, and the classification of 19 samples was wrong; the accuracy rate was 81%. It indicated that the obtained classification rules were relatively accurate and could determine the employment condition of students. 4.5 Prediction of employment trend After verifying the accuracy of the classification rules, the employment trend of graduates was predicted using the method proposed in this study. The 2018 graduates were taken as examples. The information about the major, academic performance, English competence and computer skills of the students were exported from the student status management system and the student learning management system. Then the employment trend of the graduates was predicted. The results are shown in Figure 2. Figure 2 demonstrates that the number of students who may be employed in private enterprises was the largest, accounting for 45%, while the number of students who may be employed in state-owned enterprises was the lowest, accounting for 21.56%. The decision tree and classification rules in this study could make a good prediction on the employment trend of graduates, help schools efficiently find the future employment direction of students, provide a strong basis for student employment guidance, and offer schools with valuable information. 5 Discussion Employment has always been a problem that is difficult to be ignored and also can not be ignored in modern society, especially among university graduates. With the increase of the number of graduates, employment competition is becoming more and more fierce [17]. Employment is the most serious and difficult problem for graduates after they leave school and enter society, and it is also very important for schools. At present, all universities have employment guidance centers to collect and analyze the employment situation of students in order to find some rules and forecast employment. Employment prediction has great significance for graduates' employment and school teaching work [18]. However, with the increase of the number of university students and the accumulation of data, the analysis and processing of employment information is becoming more and more difficult, and it is difficult to obtain valuable information from mass data. The development of data mining technology has brought about new changes. Decision tree algorithm is an efficient classification method, and it is also applicable in the prediction of employment trend. In this study, C4.5 algorithm which was relatively mature was selected. After obtaining the relevant data and information of graduates, four decision-making attributes, major, academic performance, English competence and computer skills, were extracted for analysis of employment units. The decision tree was established step by step after the calculation of information gain ratio of the attributes, and then classification rules were obtained through the decision tree. The information gain ratio of major, academic performance, English competence and computer skills was 0.0131, 0.0134, 0.1656 and 0.1502, respectively. It was found that English competence and computer skills were the most important factors affecting the employment of graduates. In the process of employment, English competence and computer skills are the signs of graduates' ability. Many employers have specific requirements for the English and computer skills of employees. At present, schools have attached great importance to the cultivation of students' abilities in these two aspects. The extensive arrangement of English courses and computer courses has promoted the improvement of students' abilities to a certain extent. Under the rigid requirements, they have to strengthen the study of these two aspects. However, passive learning is not enough. The importance of English and computer skills must be fully recognized, which can be fully illustrated by classification rules. The extraction 578 Informatica 43 (2019) 573–579 F. Yang of classification rules can help schools and students clearly understand what ability is the most important and crucial. On the one hand, it is conducive to the arrangement of school teaching and employment guidance; on the other hand, it is also conducive to students' active learning. The testing of classification rules suggested that the classification rules obtained in this study had an accuracy rate of 81%, which showed that this method was feasible in predicting the employment trend of graduates. It was found from the employment trend prediction results of the 2018 graduates that many students will be employed by private enterprises and few students will be employed by state-owned enterprises. It indicated that schools should strengthen the output of talents to state-owned enterprises and carry out targeted talent training. This paper preliminarily discussed the role of decision trees in college students' employment trend prediction, but there are still some problems that need further research: (1) more detailed division of employment units for college students is needed; (2) more factors that can affect college students' employment should be considered, such as family conditions, personal strengths, etc. For example, literature [19] points out that gender also can affect students' employment choices; (3) the possibility of the application of more data mining algorithms in the employment trend prediction of college students should be analyzed. For example, the Bayesian algorithm was used for employment prediction in literature [20]. 6 Conclusion The decision tree algorithm can help handle and analyze the employment situation of students and understand the main factors affecting the employment of students. This study constructed the decision tree and extracted the classification rules through the four decision attributes, major, academic performance, English competence and computer skills. It was found that English competence and computer ability had the greatest impact on students' employment. The test suggested that the classification rules in this study had an accuracy of 81% and was feasible in predicting the employment trend of graduates. There are many shortcomings in this study. For examples, more decision attributes which can affect the employment units of students can be mined, employment units can be further divided to obtain more detailed employment trend, and a larger sample size is needed for determining the accuracy of the method. 7 Acknowledgement This study was supported by the Research Project of Humanities and Social Sciences of Education Office of Hubei under grant number 16Z015. 8 References [1] Xia X, Liu J (2014). Genetic Algorithm Based Forecasting Model for the Employment Demand of Major in English. International Conference on Intelligent Systems Design & Engineering Applications, pp. 331-335. [2] Rizal MT, Yusof Y (2017). Application of data mining in forecasting graduates employment. Journal of Engineering & Applied Sciences, 12(16), pp. 4202-4207. https://doi.org/10.3923/jeasci.2017.4202.4207 [3] Miao S, Zuo J (2013). The Data Mining Technique Based on The Decision Tree Applied in The Vocational Guidance of The College Graduates. Journal of Convergence Information Technology, 8(7), pp. 876-882. [4] Wang L (2014). Improved NN-GM(1,1) for Postgraduates’ Employment Confidence Index Forecasting. Mathematical Problems in Engineering, pp. 1-8. https://doi.org/10.1155/2014/465208 [5] Liu Y, Hu L, Yan F, Zhang B (2013). Information Gain with Weight Based Decision Tree for the Employment Forecasting of Undergraduates. Green Computing and Communications, pp. 2210-2213. https://doi.org/10.1109/GreenCom-iThings- CPSCom.2013.417 [6] Kwak M, Rhee S (2016). Finding factors on employment by adult life cycle using decision tree model. 27(6), pp. 1537-1545. https://doi.org/10.7465/jkdi.2016.27.6.1537 [7] Tan L, Zhang H (2012). Forecast of Employment Based on Independent Component Analysis. International Conference on Information Computing and Applications. Springer, Berlin, Heidelberg, pp. 373-381. https://doi.org/10.1007/978-3-642-34038- 3_51 [8] Shaikhina T, Lowe D, Daga S, Briggs D, Higgins R, Khovanova N (2019) Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation. Biomedical Signal Processing and Control, 52, pp. 456-462. https://doi.org/10.1016/j.bspc.2017.01.012 [9] Mohleji N, Mazzuchi T, Sarkani S (2014). Decision Modeling Framework to Minimize Arrival Delays from Ground Delay Programs. Air Traffic Control Quarterly, 22(4), pp. 307-325. https://doi.org/10.2514/atcq.22.4.307 [10] Zang W, Guo R (2010). Application of data mining in employment instruction of undergraduate students. IEEE International Conference on Intelligent Computing and Intelligent Systems. IEEE, pp. 772 - 775. https://doi.org/10.1109/ICICISYS.2010.5658318 [11] Nie C (2015). Forecast Analysis on Situation of College Students' Employment in China. Journal of Changchun University, 8, pp. 83-87. [12] Qu H (2013). Influence of University Graduates Employment on Economic Growth and Its Statistical Forecast and Analysis. Journal of Applied Sciences, Decision Tree Algorithm Based University Graduate Employment Trend Informatica 43 (2019) 573–579 579 13(21), pp. 4620-4623. https://doi.org/10.3923/jas.2013.4620.4623 [13] Park SH, Kim SM, Ha YG (2016). Highway traffic accident prediction using VDS big data analysis. The Journal of Supercomputing, 72(7), pp. 2832-2832. https://doi.org/10.1007/s11227-016-1624-z [14] Li L, Zheng Y, Sun XH, Wang FS (2014). Study on Data Mining with Decision Tree Algorithm in the Student Information Management System. Applied Mechanics & Materials, pp. 3602-3605. https://doi.org/10.4028/www.scientific.net/AMM.5 43-547.3602 [15] Li L, Zheng Y, Sun XH, Wang FS (2016). The Application of Decision Tree Algorithm in the Employment Management System. Applied Mechanics & Materials, pp. 1252-1267. https://doi.org/10.4028/www.scientific.net/AMM.5 43-547.1639 [16] Hu Y (2014). The Water Conservancy Water Electricity Construction Engineering Professional Result Analysis of Application Base on C4.5 Algorithm. Advanced Materials Research, 926-930, pp. 703-707. https://doi.org/10.4028/www.scientific.net/amr.926- 930.703 [17] Zhang WC, Zhang JY (2012). Vocational Graduates' Employment Problems and Countermeasures— Based on the employment of Shanxi Vocational and Technical Insistute. China University Students Career Guide, pp. 56-60. [18] Cheng CP, Chen Q (2010). Research of Applying the Method of Decision Tree Based on Information Gain Ratio to College Student's Employment Forecasting. Computer Simulation, pp. 299-302. [19] White T, Martin BN, Johnson JA (2003). Gender, Professional Orientation, and Student Achievement: Elements of School Culture. Journal of Women in Educational Leadership, 1, pp. 351-365. [20] Chen CP (2007). A Research on the Employment Forecast of Graduates with Simple Bayesian Algorithm Classification. Journal of Guangdong Education Institute. 580 Informatica 43 (2019) 573–579 F. Yang