Applied Text-Mining Algorithms for Stock Price Prediction Based on Financial News Articles
Adrian Besimi
South East European University, North Macedonia a. besimi@seeu. edu. mk
Zamir Dika
South East European University, North Macedonia z.dika@seeu.edu.mk
Visar Shehu
South East European University, North Macedonia v.shehu@seeu.edu.mk
Mubarek Selimi
South East European University, North Macedonia ms21693@seeu.edu.mk
This article includes a developed model and well-defined process that one should undertake in order to contribute in the prediction of the potential stock price fluctuation solely based on financial news from relevant sources. We are providing background information on this topic adding the role of text mining in general, furthermore supporting the idea with the study of relevant research articles to narrow the focus on the problem we are researching. Our proposed model relies on existing text-mining techniques used for sentiment analysis, combined with historical data from relevant news sources as well as stock data. In confirming the model, after the experiment we have provided the results of the simulation, which are opening the ground for further explorations in this sensitive area of prediction.
Key Words: text mining, finance, news, crawling, stock, prices, prediction, naive bayes
jel Classification: C89, G17 https://doi.org/10.26493/1854-6935.17.335-351
Introduction
The data produced and the speed at which data is provided on the Internet nowadays has increased to a degree and at a rate that is impossible to process. This trend, on the other hand, has challenged the research in many areas, such as data mining and text mining, which are the focus of
Managing Global Transitions 17 (4): 335-351
336 Adrian Besimi, Zamir Dika, Visar Shehu, and Mubarek Selimi
our study. These two areas have emerged in the last decade mainly due to research in artificial intelligence, machine learning, and inferential statistics (Vale 2018).
Stock market data and relevant news associated with fin-tech industry are increasing rapidly as well. Many investors that are handling stock market transactions have a major interest in understanding more about the future of stock markets for the purpose of being able to do an educated guess and/or predict any future investment. Ensuring some level of prediction in market fluctuation can assist investors in the form of decision support systems and integrated with existing automatic trader agents that would ensure better prediction on future trades. Fully predicting the market fluctuation means in practice becoming a billionaire over the night and all the time minimizing financial losses, which is not possible for many reasons. Recent scholars argue that news articles are among influential sources that may affect stock market prices and they should be carefully considered by investors when planning future investments. By definition, any stock price is simply defined by supply and demand of the market, but it is argued by scholars Nikfarjam, Emadzadeh, and Muthaiyah (2010) and Kaya and Karsligil (2010) that another important variable when decision is made to invest or not is also related to verifiable news from financial news sources. This is hard and time-consuming task because it requires to read and analyze a lot of news published on several occasions by various news sources/providers (Nikfar-jam, Emadzadeh, and Muthaiyah, 2010; Kaya and Karsligil 2010).
Information published in news articles influence, to a varying degree is influencing the decision of the stock traders, especially if the given information is unexpected. It is important to analyze this information as fast as possible, so it can be used as an advantage to help traders to make trading decisions before the market has had time to adjust itself to the new information (Aase 2011).
One important application of using text mining is text sentiment analysis, also known as opinion mining, a technique that digs deep into the content of the text file and extracts the sentiment of it. Sentiment analysis classifies textual data into positive texts, negative text and neutral text sentiments which is later used for the purpose of categorizing any text documents into the given sentiment (Aase 2011; Khedr and Yaseen 2017).
The model proposed in this paper is going to leverage the Naïve Bayesian classifier for document classification to make a prediction for whether the stocks will go up or down, based on a dataset that is generated from the process proposed later in this paper.
Managing Global Transitions
Applied Text-Mining Algorithms for Stock Price Prediction 337
Literature Review
Several scholars, specifically Kim, Jeong, and Ghani (2014) prove to some extent in their work that the relevant news are closely related to stock price movements in the market. With the current trends in big data and content creation on the Internet and the enormous amount of unstructured text data available, the mobile channels, and Social Network services, scholars have attempted to predict stock movements using such text data as in the case of Kim, Jeong, and Ghani in 2014.
Many scholars tried different approaches in research to prove that there is a potentially strong correlation among financial news articles and stock price fluctuations, as is the case of Khedr and Yaseen (2017) that we mentioned earlier. In their paper they propose an approach to use sentiment analysis in financial news, along with features extracted from historical stock prices to apply prediction for the future behaviour of stocks. According to their findings, the proposed model has achieved high accuracy using sentiment analysis in categorizing news polarities by applying Naïve Bayes algorithm. In their case the accuracy of the model is up to 86.21%. By moving on with their experiment in prediction, during their next attempt in analyzing these news articles, they have included numerical attributes which in their case increased the accuracy to 89.80%.
The paper, published by Hagenau, Liebmann and Neumann (2013), examines the hypothesis if any stock price prediction based on textual content from the financial news can be further improved. In this paper, the authors have upgraded the text mining methods by adding expressive feature to represent the text and by adding more variables, such as employing market feedback in the feature selection process. According to the authors, this selection of the features does significantly improve the accuracy due to the fact that this approach removes the unnecessary so called 'less-explanatory features,' i.e., noise, which itself helps the classifier to overcome the over-fitting during classification of the text. In the case when the feedback-based feature selection is combined with 2-word combinations, the authors results show an accuracy of 76%. These results are different from common sentiment analysis approaches since the 2-words combination gives more information and potentially more meaning to the sentiment classification.
A lot of research has been carried by scholars in the area of prediction of stocks as well. A project by Joshi, Rao, and Bharathi (2016) is taking financial news articles about a given company, and they use these data to try to predict the future movements of the stock again by applying sen-
Volume 17 • Number 4 • 2019
338 Adrian Besimi, Zamir Dika, Visar Shehu, and Mubarek Selimi
timent analysis. The approach is like in the other cases with an idea to identify how stocks have reacted if news has polarity. Authors in this case have taken the past three years of data from Apple Inc. stock prices as well as news articles. Similarly, to previous scholars, the polarity of the news is labelling these articles and based on these data they are building the training set. The approach in this paper is dictionary-based that contains for positive and negative words that is build based on financially specific words. Further, they have pre-processed the data which resulted in having their own finance specific stop words and dictionary. Using their own dictionary, they have implemented three models for classification and tested them. After comparing the results, they have concluded that Random Forest algorithm resulted in better accuracy for the test cases ranging from 88% to 92%. This algorithm was followed by Support Vector Machines with again very good accuracy of 86%. In their case the Naive Bayes algorithm performance was the lowest with 83%.
There is some promising research published that applies deep learning techniques and has resulted with higher accuracy ratings. Of interest is the published paper by Tabari et al. (2019) that shows a comparison of diverse algorithms specifically applied in stock market tweets. This research shows quite promising results, with accuracy ranging up to 92.7 % (using Convolutional Neural Networks). However, even though deep learning approaches can be scaled for using news articles, other authors report much lower accuracy rates Kim and Jeong (2019).
One major advantage of using Naive Bayes algorithms is its well-known ability to improve by introducing new data. In our case, previously analyzed news articles can be fed to the algorithm and treated as prior probability. With this, new, previously unknown words will gain weight and affect prediction when encountered in the future. This is one major drawback of the proposed model from Joshi, Rao, and Bharathi, mentioned above, since it only works for a pre-defined dictionary of words.
Another similar approach of finding the correlation amongst the content of news articles and stock prices for the purpose of predicting the stock markets was implemented by Kaya and Karsligil (2010). They collected news articles published in the last year period and combined with the stock prices for the same period. These articles were then labelled as positive or negative sentiments categorization based on their effects on stock price. Their approach is a little bit different in the sense that for them it was important to use the price changes for categorization of the news. While analyzing the textual data, authors use and approach of word
Managing Global Transitions
Applied Text-Mining Algorithms for Stock Price Prediction 339
doubles of a noun and a verb as features and not only single words. The support vector machines (svm) method was used in this case which resulted with 61% accuracy.
These scholars and articles mentioned in the section above are the core of our model and study that we conducted.
Methodology
problem definition
Financial analysts that are handling investments and transactions in stock markets around the world have a huge headache on making decisions that will be effective and bring more money to the investor or maximizing profits by trading. They are aware that any news, either good or bad can directly affect the stock market. The job of these experts relies on analyzing everything from the media outlet. This is time consuming and the amount of data is getting larger all the time. The methodology that we are arguing as many other scholars mentioned above, including also Fali-nouss (2017), is that an advanced text mining algorithm can assist these experts and provide them with knowledge just by processing resources related to text and news.
The price movements from the past are not always a good indicator on the future movements and are not a guarantee of smart investment, which makes news articles analysis a better predictor on stock market movements. Falinouss in 2007 proposed to research about the impact of textual data in predicting the financial markets movements. In his study he also developed a system which uses similar approaches as previous case of text mining techniques and their influence on the stock market. This according to Falinouss (2007) can help financial analysts to act immediately upon new news articles as they get published.
We propose a model of predicting stock price fluctuations or movements by analyzing financial news articles on one hand and historical stock prices on the other hand. To accomplish this objective, a complete process of data mining and text mining was developed to predict the price movements for the 3 companies listed public, which are explained in the subsection below.
proposed model for stock predictions
based on financial news
In our study we worked towards analyzing data, concretely news articles and historical stock prices to make future predictions about stock direc-
Volume 17 • Number 4 • 2019
340 Adrian Besimi, Zamir Dika, Visar Shehu, and Mubarek Selimi
tion. To achieve this, qualitative and quantitative data are crucial. Many steps are conducted to achieve the aim of this research, starting from data gathering.
The data is collected for a period of one year, starting from the 1st of March 2018 until 1st of March 2019. In order to make the prediction we used different variables, such as the polarity of the news (either positive or negative), the rate of change in stocks quotes (an average of 5 days), a source of the news article as well as the company name.
The following is the process consisted of eight steps needed to be performed in order to predict stock price from the financial news:
1.	Identifying the news sources and targeted companies
2.	Data collection and data cleaning of news articles
3.	Sentiment Analysis of news articles
4.	Data collection of stock prices
5.	Calculating Rate of Change (roc)
6.	Categorizing the data
7.	Applying Naive Bayesian classifier
8.	Training
Identifying the news sources and targeted companies is crucial to understand the data. The information collected must be relevant and trustworthy. As such, the relevant data from financial news articles from top reliable sources have been identified as: The Washington Post, cnn, Mar-ketWatch, bgr, Fox Business, The Street, The Verge and Breitbart. The targeted companies for our study are: Tesla, Facebook and Apple. News sources are proven to be reliable in the market as the most unbiased, whether the targeted companies have been chosen randomly from technology, software and automotive industry. Tesla has been added because it is a typical example of a lot of news noise and several fluctuations of stock prices.
Data collection and data cleaning of news articles. Links of the news from the sources mentioned in step 1 are collected using Web Scraper extension of Google Chrome browser. After having all the links, we built a python script based on Scrapy (an open source and collaborative framework for extracting the data you need from websites, see http://www.scrapy.org). framework that is extracting data from the links and organizing them in the following structure: article's Title, date, author and the text content. Appropriate data cleansing has been applied
Managing Global Transitions
Applied Text-Mining Algorithms for Stock Price Prediction 341
to remove unnecessary html tags as well as to format the data from different sources to one standard (see table 1 and table 2).
Sentiment analysis of news articles was applied to every news record based on the news content by using Vader Sentiment Analysis. vader (Valence Aware Dictionary for sEntiment Reasoning) is a pre-built sentiment analysis model included in the nltk (Natural Language Toolkit) package. It can give both positive/negative (polarity) as well as the strength of the emotion (intensity) of a text. vader however is focused on social media and short texts, unlike Financial News which are almost the opposite. We updated the vader lexicon with words plus sentiments from other sources/lexicons such as the Loughran-McDonald Financial Sentiment Word Lists, to be appropriate for our collected financial news (Yip 2018). At the end of this step we had the polarity of the news content recorded in our dataset.
Data collection of stock prices for each of the targeted companies was done from Yahoo Finance portal, where the following information was collected: date, open price, high, low, close price, volume and Adj close. These data are important for correlation with the appropriate news from our first data set.
Calculating Rate of Change (roc). The roc and Future roc are the two variables that are calculated from the data set from Step 4. The rate of change (roc) in stocks in an average of 5 days is an existing formula that refers to the last 5 days of stock fluctuation. In our case we also added a column with the Future roc (the roc after 5 days), having in mind that the effect of this positive or negative news will be reflected in the future and not the past. Since we are dealing with historical data, the Future ro c is easy to calculate.
Categorizing the data must be done in order to apply Naive Bayesian classifier. In the data set that we have all news collected with their features, we added two new columns: Sentimentof_text that could be 'positive' if the sentiment score is greater than zero and 'negative' if the sentiment score is less than zero. We don't take in consideration the neutral score of the text content because that could result in majority of neutral results. The second column is the ROC_Sentiment that can be 'positive' if the Future roc is greater than zero and 'negative' if the future roc is less than zero.
Naive Bayesian classifier was used to make the prediction of the future stock movements. The naive Bayes applies the well-known Bayes' theorem, where by using a 'naive' assumption that any set of features are
Volume 17 • Number 4 • 2019
342 Adrian Besimi, Zamir Dika, Visar Shehu, and Mubarek Selimi
independent for a given class (Tang, Bo; Kay, Steven; He, Haibo; 2016). To prepare the data set to make predictions with the nb, we added a new column with the name class that is 'up' if the Sentimentof_text is 'positive' and the ROC_Sentiment is 'positive,' and if the Sentimentof_text is 'negative' and the ROC_Sentiment is 'negative' then the class is 'down,' otherwise is 'neutral' classification. The training dataset results are summarized in table 3.
Training. The data that is collected (see table 1 and table 2) contains records for 12 months from which 10 months will be used to train the model and the last 2 months will be used for the test set, to evaluate how it performs. In total 18236 records will be used as training dataset and the remaining 1990 records (roughly 10%) out of 20226 will be used as a test set.
We created 2 models to see how they perform. The following variables are used to train and test the first model: Source, Company, Senti-mentof_text and the 5-day roc, while in the second model only variables of: Source, Company and Sentimentof_text.
Results
As explained in the steps undertaken to perform our prediction, the data collection results are shown below. We succeeded to collect the news articles from 8 different news sources, totalling 20226 news articles, split into table 1 for Training Set (18236 records/articles) and table 2 for Test Set (1990 records/articles).
The training dataset results are summarized in table 3, where for each company in our target the list of classification results is shown. As a general finding is that the algorithm applied classifies 15.71% of the articles in the training set as 'down' (meaning the stock will go down in the following days), 50.71% is classified as 'neutral' (there is no clear picture on what the prediction will be) and 33.59% of the data as 'up' (meaning the stock will go up). The 'up' classification is relevant to our study and can be used for simulating investments on our test data from the test set.
In the first prediction that uses the following variables: Source, Company, Sentimentof_text and the 5-day roc model, the test set classification from 1900 records being tested, resulted in 564 'down' and 1426 'up' classes for stock price direction were predicted. The achieved accuracy of 94.29% in this prediction model shows that there is a very high chance to predict the stock price movements. By this, our arguing that based on several attributes from new articles we can reach a certain level of predic-
Managing Global Transitions
Applied Text-Mining Algorithms for Stock Price Prediction 343
table 1 Total News Articles Obtained for Apple, Tesla and Facebook Organized by Source For Training Set (March 2018-December 2018)
Variable	Categories	Frequencies	Percentage
Source	B GR	1073	5.884
	Breitbart	435	2.385
	CNN	687	3.767
	Fox Business	813	4.458
	The Street	3810	20.893
	The Verge	2847	15.612
	The Washington Post	6051	33.182
	Market-Watch	2520	13.819
Company	Apple	7591	41.626
	Facebook	7513	41.199
	Tesla	3132	17-175
TABLE 2	Total News Articles Obtained for Apple, Tesla and Facebook Organized		
	by Source for Test Set (January 2019-March 2019)		
Variable	Categories	Frequencies	Percentage
Source	BGR	185	9.30
	Breitbart	167	8.39
	cNN	211	10.60
	Fox Business	147	7.39
	The Street	590	29.65
	The Verge	603	30.30
	The Washington Post	87	4.37
	Market-Watch	0	0
Company	Apple	1144	57.49
	Facebook	416	20.90
	Tesla	430	21.61
tion, and give directions to financial experts, is valid as in our case where we reached a certain level of prediction based on several attributes. Still, though Efficient Market Hypothesis (emh) clearly states that financial stock prices cannot be predicted, because there is no 100% prediction. The accuracy rate of the first model is high and there is a strong relationship between financial news and stock price movements.
Volume 17 • Number 4 • 2019
344 Adrian Besimi, Zamir Dika, Visar Shehu, and Mubarek Selimi
table 3 Training Set Classification Data Organized by Company and Frequency
Company		Down	Neutral	Up	Total
Apple	N	1,006	3,930	2,655	7,591
	%	5-52	21.55	14.56	41-63
Facebook	N	1,390	3,683	2,440	7,513
	%	7.62	20.20	13-38	41.20
Tesla	N	468	1,634	1,030	3,132
	%	2-57	8.96	5.65	17.17
Total	N	2,864	9,247	6,125	18,236
	%	15.71	50.71	33-59	100.00
To test the second model 3 variables as input are given: Source, Company and Sentimentof_text to predict the class up, down or neutral. In comparison with the first model that has an accuracy of 94.29%, the second model has 49.49% which is significantly with lower accuracy than the first model, that has just one more variable - the 5-day roc. This prediction rate is less than the guessing probability (50%), and as such this model is irrelevant. It can be stated that aside from sentiment of text, stock data variables as in this case of '5-day roc' plays a vital role in prediction of the future stock price movements. Only a single relevant variable as the rate of change of a given stock for the last five days does impact the prediction in the model where we deal with stock rates. It is important to include the most relevant variables when applying prediction in the FinTech industry.
It can be stated that the weak form of emh is true. Weak form efficiency states that all future price movements follow a random walk, unless there is some change in some fundamental information. It does not state that prices adjust immediately in the advent of new fundamental information, which means that some forms of fundamental analysis and news article analysis might provide excess returns. This is because they trade on new information and does not use any historical information to look for patterns.
Simulation
In order to apply and validate the text-mining algorithms and classification techniques mentioned in this paper to predict the financial news, we have conducted a simulation based on exact data from the market and
Managing Global Transitions
Applied Text-Mining Algorithms for Stock Price Prediction 345
the cross reference of the dates with the predictive algorithm. Simulation data is conducted with 10.000$ investments per company per day.
Table 4 contains sample data for our simulation. Bellow is the description of the fields in order to better understand when the 'investment' is to be made:
•	DoW represents the Day of the week that we have news articles processed and categorized the sentiment. It is an important variable to understand if we are dealing with weekends.
•	Date is the actual date when the news have appeared and the stock Close price date.
•	Company is company name that we have collected data.
•	Close is the closing price of the stocks for the given date and company.
•	5-day roc is the historical rate of change for the last 5 days.
•	Invest? is a probability calculation that is derived as an average of all sentiments for all news in our dataset. For instance: if there is one news article and it is positive towards Facebook, then this percentage is 100%. In case ifthere is one negative and3positive newsfor a given company, then the probability is 75% to invest.
•	Investment is the same amount used for the purpose of simplification and simulation: $10,000.00.
•	Profit/Loss is calculated based on the prediction if we go for investment. For instance, the second row results in investment for Facebook. In this case 'investment'probability is 100% and in the simulation we 'invest' total $10,000. The difference in Close price for Facebook for 1/1/2019 and Facebook for 1/2/2019 in stock prices is ($10.000/135.68 = 76.283 stocks, or equal on 2nd of January as 76.283x$135.68 = $10,350.14) and profit/loss is $350.14.
Rules and assumptions for 'investment' are the following:
•	The 'investment' is done a day after the news have been published (the effect of the published news will be seen the next day). In case of weekends, the investment is done next Monday.
•	The purchased stocks will be sold the next day, for simplification the cost of closed stock price for that day has been taken into the simulation.
•	The difference represents either Profit/Loss.
Volume 17 • Number 4 • 2019
346 Adrian Besimi, Zamir Dika, Visar Shehu, and Mubarek Selimi table 4 Profit and Loss Sample Table for the Test Set simulation
(1) (2)		(3)	(4)	(5)	(6)	(7)	(8)
3	1/1/2019	Apple	157-740	18.423429	80.00	10,000.00	11.41
3	1/1/2019	Facebook	131.089	-60.73621	100.00	10,000.00	350.14
3	1/1/2019	Tesla	332-799	110.98007	100.00	10,000.00	(681.49)
4	1/2/2019	Apple	157.919	20.466856	67.74	10,000.00	(996.07)
4	1/2/2019	Facebook	135.679	-59.230769	66.67	10,000.00	(290.39)
4	1/2/2019	Tesla	310.119	96.601993	57.14	10,000.00	(314.72)
5	1/3/2019	Apple	142.190	8.4674699	54.93	10,000.00	426.89
5	1/3/2019	Facebook	131.740	-60.414660	62.50	10,000.00	471.38
5	1/3/2019	Tesla	300.359	90.197561	54-55	10,000.00	576.97
notes Column headings are as follows: (1) Dow, (2) date, (3) company, (4) close, (5) 5-day Roc, (6) invest (%), (7) investment ($), profit/loss ($).
1250 1000
750 . 500
250-•--•-_-
• • • • • * i , .
0__» * _•_•___•____
-250 . -500
• •
-1000 -1250
| January	| February
figure 1 Chart of Profit/Loss simulation on Test Set Data Based on Classification
Model (Naive Bayes) Used in This Study (dark green - Apple, light green -Facebook, black - Tesla)
-750
Figure 1 clearly shows the fluctuation on the investment simulation based on real data from stock closing prices, for the three companies in combination with our model as explained in previous table. Tesla data shows more fluctuation and as such we excluded in our next chart (figure 2) to see if the model prediction can be used for investment.
Figure 2 shows improvements and majority of profit cases on the investment simulation when Tesla is excluded, and the only trade is done with Apple and Facebook. In order to support this figure, table 5 represents
Managing Global Transitions
1250 . 1000 750 . 500 250 .
-500 . -750
Applied Text-Mining Algorithms for Stock Price Prediction 347
0----------
•. •	•	I*	•
-250_____
-1000 -1250
| January	| February
figure 2 Chart of Profit/Loss Simulation on Test Set Data Based on Classification Model (Naive Bayes) Used in This Study, Excluding Tesla Company (dark green - Apple, light green - Facebook
real data for our simulation for the 2 months test data used in this experiment. The values show daily profit/loss for all three companies based on the news sentiments that we processed.
In table 5, Profit and Loss table for Apple, Facebook and Tesla based on the Test Set simulation with $10.000 investments was conducted. The simulation conducted does not show 100%-win case for the classification of stock prediction and as such it does not apply to all companies. The difference where there are better results relies on the targeted companies, such as Apple and Facebook, which are more stable ones rather than Tesla, which as a case had different fluctuations that in long term did not bring good results in our simulation.
In this simulation, Tesla's predictions based on our model result in Loss where the other two companies Apple and Facebook in the long run result in profit of around $2.600 if the algorithm is run every day where the data is available with daily amounts of $10.000 per company investments.
Conclusion
The trading of stock in public companies is an important part of the economy, so in this study stocks have been analyzed through using data mining and text mining techniques to make a prediction for stock price directions of the stocks for 3 companies listed public.
To achieve a prediction we gathered data, collected relevant financial news articles from reliable sources with both qualitative and quantitative
Volume 17 • Number 4 • 2019
348 Adrian Besimi, Zamir Dika, Visar Shehu, and Mubarek Selimi
table 5 Profit and Loss Table for Apple, Facebook and Tesla Based on the Test Set Simulation
Date	Apple	Facebook	Tesla	Grand total ($)
1 January	11.411	350.141	-681.490	-319.939
2 January	-996.074	-290.388	-314.717	-1601.179
3 January	426.893	471.382	576.975	1475.250
6 January	-22.258	7.249	543.611	528.602
7 January	190.631		11.644	202.275
8 January	169.817	119.273	94.826	383.916
9 January	31.962	-2.080	190.234	220.116
10 January	-98.180	-27.739	66.383	-59.536
13 January	-150.371		-370.328	-520.699
14 January	204.667	244.859	299.940	749.466
15 January	122.166	-94.663		27.503
16 January	59.378	51.512	36.411	147.301
17 January	61.594	117.329	-1297.112	-1118.189
21 January	-224.461	-164.622	-110.501	-499.584
22 January	40.443	-221.590		-181.147
23 January	-79.262	106.029	136.306	163.073
24 January	331.369	218.062	189.702	739.132
27 January	-92.545	-103.348	-22.219	-218.113
28 January	-103.647	-222.418	36.439	-289.626
29 January	683.347			683.347
30 January	72.012	1081.638	-56.676	1096.974
31 January	4.807	-58.791	169.044	115.060
Continued on the next page
data. This combined with the second type of data of stock prices were used in our study. For every article, a sentiment score (positive and negative) of the text content is calculated.
We have found out that a model that does not include price fluctuations and wholly relies on text content to predict the stock price fluctuation is not accurate at all. Including additional variables improves significantly the prediction. In our case the variable '5-day roc' plays an important role in predicting the future stock prices.
This article, except for proposing the model used and the process undertaken to arrive at the desired data set, contains results from the sim-
Managing Global Transitions
Applied Text-Mining Algorithms for Stock Price Prediction 349
table 5 Continued from the previous page
Date	Apple	Facebook	Tesla Grand total ($)	
3 February	284.050	213.626	21.781	519.456
4 February	171.094		270.382	441.477
5 February	3.445	-39.145	-128.520	-164.220
6 February	-189.394	-241.070	-306.096	-736.560
10 February	-57.509	-92.034	230.216	80.673
11 February		-45.238		-45.238
12 February			-116.737	-116.737
13 February	36.433		-142.779	-106.347
14 February	-22.249		135.300	113.052
18 February	29.926	-12.924		17.002
19 February	64.354		-100.773	-36.419
20 February	-56.386	-155.020	-374.471	-585.876
21 February	111.657	115.596	119.492	346.746
24 February	72.845	168.633	137.762	379.240
25 February	5.740			5.740
26 February	30.975	-80.424		-49.449
27 February	-98.359			-98.359
28 February	105.112	51.409	-784.356	-627.836
Grand Total	1135.432	1465.244	-1540.327	1060.350
notes Empty fields in the rows above are possible due to the fact that not all companies chosen for investment every day, based on the news sentiment probability of investment.
ulation of the model. Previous models for sentiment analysis of financial news articles are limited in news articles from relevant sources and as such, based only on sentiment of the news do not provide enough information for future movements. Our model in this paper adds more variables to the dataset in order to give more accuracy to the prediction.
As the results are probabilistic weights (predictions), the simulation we conducted does not show 100%-win case for the classification of stock prediction and as such it does not apply to all companies. The difference where we have better results relies on the targeted companies, such as Apple and Facebook, which are more stable ones rather than Tesla, which as a case had different fluctuations that in long term did not bring good results in our simulation.
Future work would follow with the research on the characteristics of
Volume 17 • Number 4 • 2019
350 Adrian Besimi, Zamir Dika, Visar Shehu, and Mubarek Selimi
the companies that would fit to the model, with the tendency to prove that the proposed model is universal for the specific companies within specific variables, adding more tests and simulations as well.
It could also prove valuable to evaluate deep learning algorithms for the purpose of sentiment analysis. These algorithms are yet to show good results when larger text bodies are used, however, for short tweets they are very accurate. Furthermore, in this paper we evaluated only a three-class problem in the context of sentiment analysis. It would be of interest to approach using a multiclass prediction model and see how diverse sentiments would affect the stock market. This is an area where nn algorithms would prove to be much more beneficiary. Finally, even though we consider only highly respected news sources for our analysis, we could further drill down to add the author of the source as an attribute. A renowned author would probably have more weight in the context of affecting the stock market with his articles.
References
Aase, K. G. 2011, 'Text Mining of News Articles for stock Price Prediction.' Master's thesis, Institutt for datateknikk og informasjonsvitenskap.
Falinouss, P. 2007. 'Stock Trend Preidction Using News Articles: A Text Mining Approach.' Master's thesis, Lulea University of Techonology.
Hagenau, M., M. Liebmann, and D. Neumann. 2013. Automated News Reading: Stock Prices Prediction Based on Financial News Using Context-Capturing Features.' Decision Support Systems 55 (3): 685-97.
Joshi, K., J. Rao, and H. N. Bharathi. 2016. 'Stock Trend Prediction Using News Sentiment Analysis.' International Journal of Computer Science & Information Technology 8 (3): 67-76.
Kaya, M. Y., and M. E. Karsligil. 2010. 'Stock price Prediction Using Financial News Article.' In Proceedings of the 2nd International Conference on Information and Financial Engineering, 478-82. Chongqing: ieee.
Khedr, A. E., and N. Yaseen. 2017. 'Predicting Stock Market Behavior Using Data Mining Technique and News Sentiment Analysis.' International Journal of Intelligent Systems and Applications 9 (7): 22-30.
Kim, H., and Y.-S. Jeong. 2019. 'Sentiment Classification Using Convolu-tional Neural Networks.' Applied Sciences 9 (11): 2347. https://doi.org/10
.3390/app9112347
Kim, Y., S. R. Jeong, and I. Ghani. 2014. 'Text Opinion Mining to Analyze News for Stock Market Prediction.' International Journal ofAdvances in Soft Computing and its Application 6 (1): 1-13.
Nikfarjam, A., E. Emadzadeh, and S. Muthaiyah. 2010. 'Text Mining Approaches for Stock Market Prediction.' In Proceedings of the 2nd Inter-
Managing Global Transitions
Applied Text-Mining Algorithms for Stock Price Prediction 351
national Conference on Computer and Automation Engineering, 25660. Singapore: ieee. Tabari, N., A. Seyeditabari, T. Peddi, M. Hadzikadic, and W. Zadrozny. 2019. 'A Comparison of Neural Network Methods for Accurate Sentiment Analysis of Stock Market Tweets.' In ecml pkdd 2018 Workshops, edited by C. Alzate, A. Monreale, L. Bioglio, V. Bitetta, I. Bor-dino, G. Caldarelli, A. Ferretti, R. Guidotti, F. Gullo, S. Pascolutti, R. G. Pensa, C. Robardet, and T. Squartin. Cham: Springer. Vale, M. N. d. 2018. 'Dow Jones Index Change Prediction Using Text Mining.' Instituto Alberto Luiz Coimbra De Pos-Graduaçao E Pesquisa De Engenharia, Rio de Janeiro. Yip, J. (2018), 'Algorithmic Trading Using Sentiment Analysis on News Articles.' https://towardsdatascience.com/https-towardsdatascience -com-algorithmic-trading-using-sentiment-analysis-on-news-articles
-83db77966704.
This paper is published under the terms of the Attribution-NonCommercial-NoDerivatives 4.0 International (cc by-nc-nd 4.0) License (http://creativecommons.org/licenses/by-nc-nd/4-0/).
Volume 17 • Number 4 • 2019