https://doi.org/10.31449/inf.v47i2.4390 Informatica 47 (2023) 213-220 213 An Integrated Approach for Analysing Sentiments on Social Media Vrinda Tandon, Ritika Mehra* Department of Computer Science and Engineering, Dev Bhoomi Institute of Technology, India E-mail: vrinda1804tandon@gmail.com, riti.arora@gmail.com Keywords: sentimental analysis, natural language processing (NLP), SVC, scikit-learn Received: September 11, 2022 Sentiment analysis is an analytical subfield of Natural Language Processing (NLP) to determine opinion or emotion associated with the body of the text. The requirement for social media sentiment analysis has exceptionally increased with the growing extent of online activities in form of user generated content like posts or comments on social networking platforms. People often share their thoughts, opinions and reviews openly which can further be leveraged to analyze what they feel about a particular topic or their reviews/ feedback about a certain service. This study covers different approaches to conduct social media sentiment analysis on Twitter dataset both balanced and imbalanced obtained from Kaggle. For text analysis, we have implemented various classification techniques such as: Naive Bayes Classification and Support Vector Classification (SVC). It was concluded that SVC on twitter dataset surpassed other classification techniques in terms of performance. Povzetek: Predstavljena je študija različnih pristopov za analizo sentimenta / razpoloženja na Twitterju, pri čemer se je najbolje izkazala metoda Support Vector Classification. 1 Introduction Sentiment analysis also referred as emotion AI, helps to determine author’s mentality or attitude by classifying their piece of writing as positive, negative or neutral. Sentiment analysis has displayed an intrinsic influence in various domains like product analysis, political campaigns, marketing and competitive research, brand inclination and monitoring. Content posted by the user on social media can be leveraged to make noteworthy interpretation of author’s opinion, tone or emotions associated with their posts. The requirement for social media sentiment analysis has exceptionally increased with the growing extent of online activities in form of user generated content like posts or comments on social networking platforms. People often share their thoughts, opinions and reviews openly which can further be leveraged to analyse what they feel about a particular topic or their reviews/ feedback about a certain service. In social media analysis often known as opinion mining, everything revolves around diving into words to understand the context of the user generated content and the opinions they reveal on such platforms. In this study we have summarized distinct classification techniques on twitter tweets dataset. The objective of this study is to classify sentiments of tweets with and without class weights for both balanced and imbalanced dataset hence, deriving the difference between the results obtained and its impact. 2 Literature review The data sources on which sentiment analysis can be performed has grown exponentially with time hence; large amount of opinionated data can be fetched from different types of social media and websites. Content posted on different platforms like web-based entertainment platforms for example: movies reviews, social networking platforms or even product reviews can be utilized to make critical translation of author’s tone, feelings or feedback related to the text they post. In this section, we have examined the work presented by different researchers on sentiment analysis and briefly discussed their approaches and observations after conducting the study. Divij et al. in his work covered distinct pre-processing and classification techniques on binary and multi-class movie review dataset. Along with traditional classifiers, modern classifiers like RNN were also implemented. The performance for various approaches was compared where SVM with word embedding was proved to surpass other classification techniques [2]. Naresh et al. in his work determined social media user's opinions by using optimization-based machine learning algorithms. They found that the proposed technique sequential optimization with decision tree provides 89.47% of accuracy compared to other algorithms. Tweets were collected and classified into three categories i.e., positive, negative and neutral. According to the authors, for larger dataset this model will perform faster and will take less time [3]. Mullen et al. conducted a sentiment analysis study on movie review dataset. 1380 movie reviews were collected from a website 214 Informatica 47 (2023) 214-220 V. Tandon et al. named epinions.com. The dataset consisted both negative and positive reviews. Support Vector Machine algorithm were applied to train and test the model. For validation purpose three-fold and ten-fold cross validation were used. 84.6% accuracy was obtained using three-fold cross validation whereas 86% accuracy was obtained using tenfold cross validation [1]. The study done to understand different classification techniques on various datasets are tabulated as Table 1. Table 1: Study to understand different classification techniques on various data sets Authors Paper Title Models /Algorithms Discussion Datasets Year Divij Gera, Amita Kapoor [2] Sentiment Analysis using Scikit Learn: A Review RNN and BERT models To perform sentiment analysis on movie reviews Binary classification dataset from IMDb and multi- class dataset from Rotten Tomatoes) 2022 Naresh, A. and Parimala Venkata Krishna [3] An efficient approach for sentiment analysis using machine learning algorithm Sequential minimal optimization with decision tree, Multivariate vehicle regression models To classify the twitter data. Airline twitter dataset 2021 Yuxuan Wang, Yutai Hou, Wanxiang Che & Ting Liu [4] From static to dynamic word representations: a survey Static and dynamic embedding models Survey on evaluation metrics and applications of these word embeddings TOEFL [13], ESL [11], RDWP[14], BM[12], AP and ESSLLI-2008[10] 2020 Kapoor, Amita [5] Hands-On Artificial Intelligence for IoT: Expert machine learning and deep learning techniques for developing smarter IoT systems. Machine Learning, Deep Learning and genetics Algorithms Implement IOT to make their IOT solution Smart UCI ML (Combined cycle poer plant) 2019 Vanaja, S., & Belwal, M. [6] Aspect-level sentiment analysis on e- commerce data. Naïve Bayes algorithm and Support Vector Machine (SVM) algorithm Aspect-level Sentiment Analysis Amazon Customer reviews data 2018 An Integrated Approach for Analysing Sentiments on Social Media… Informatica 47 (2023) 214-220 215 Jianqiang, Zhao, and Gui Xiaolin [8] Comparison research on text pre-processing methods on twitter sentiment analysis Naive Bayes and Random Forest, Logistic Regression, support vector machine URLs do not contain useful information for sentiment classification Stanford Twitter Sentiment Test (STS-Test), SemEval2014, Stanford Twitter Sentiment Gold (STS-Gold), Sentiment Strength Twitter (SS-Twitter) , Sentiment Evaluation (SE- Twitter) 2017 Gamallo, P., & Garcia, M. [9] A Naive-Bayes Strategy for Sentiment Analysis on English Tweets. Naive Bayes For detecting the popularity of English tweets SemEval2014 organization (tweeti-b.dist.tsv) 2014 Tony Mullen and Nigel Collier [1] Sentiment analysis using support vector machines with diverse information sources SVMs based on unigrams and lemmatized versions of the unigram models. To assign semantic values to phrases and words within a text to be exploited in a more useful way Epinions.com 2004 2.1 Concern with imbalanced data One of the major challenges to deal with is imbalanced data. Imbalanced datasets are those datasets in which the observations distribution associated with the target class is not even. In other words, one class label possesses large number of observations as compared to the other class label. The main concern is to accurately and efficiently obtain the likelihood for minority as well as majority class. Imbalanced datasets are prone to give biased results hence to mitigate the issue distinct approaches are utilized. Model is susceptible to fail when fed poor data, imbalanced data leads to inconsistent results and is considered as one of the major obstacles faced to obtain genuine results. In a study conducted by Alation [16] it was found that more than 80% of the participants were concerned about the quality of the data affecting the progress of their AI executions. https://www.alation.com/blog/alation-sodc-bad-data- spells-trouble-for-ai/ Imbalanced, mislabelled data and data gathered from unknown or non-reliable sources for training and testing tools is the major factor to produce flawed results. Some real-life failure examples induced by flawed data are: An automated experimental hiring model by Amazon ended up as a failure due to imbalanced training data. The system designed for hiring was found to be biased against women candidates and trained itself by inferring male candidates better. A racial inclination was found in health prediction algorithms used by US hospital and insurance organisations. The study published in science unveiled that the algorithm was found to recommend white patients over Black patients. A predictive tool to identify covid-19 and diagnose patients was found to be not fit for clinical use by its own researchers. Derek Driggs’ group observed that the trained dataset consisted scans of patients in lying and standing positions which inferred patients in lying position as seriously ill. The algorithm to identify covid- 19 risk was inefficient as it was solely giving results based on the position of patients scanned. In case of imbalanced dataset, the chances of algorithm being biased to the majority class are quite high and the main objective becomes to mitigate misclassification by minority class by setting a higher-class weight to minority class and simultaneously lowering the class weight to majority class. In this study, different weights were assigned to classes to improvise the performance for both binary and multiclass imbalanced data. 3 Working flowchart of proposed work We implemented various classification techniques for text classification like: Multinomial Naïve Bayes, Bernoulli NB and SVC on the tweets posted by the user. The algorithm and working flowchart (Figure 1.) for the proposed work is as follows: 216 Informatica 47 (2023) 214-220 V. Tandon et al. Figure 1: Proposed working flow chart. 3.1 Algorithm for proposed work Step 1. Divide data into training and testing in 80 and 20 proportions. Step 2. Determine nature of the dataset i.e., balanced or imbalanced using seaborn. countplot () for sentiment distribution. Step 3. Apply Naive Bayes Multinomial, Bernoulli classifier and SVC classifier using o-v-o approach as per the nature of the dataset. Step 4. Predict the test dataset using predict (). Step 5. Classify tweets with numerical labels. Step 6. Determine f1-score, precision and recall for each classifier. 4 Dataset utilised Determining sentiment score is one of the prominent approaches to access emotion or tone of the text. This scaling system assigns scores corresponding to the tone of the text i.e., positive, negative or neutral making it easier to understand. We have utilized twitter Twitter tweets Sentiment Dataset obtained from kaggle to classify tweets sentiment. This multiclass dataset consisted of four columns i.e., textID, text, selected_text and sentiments associated with the tweets. The sentiments were labelled as positive, neutral and negative. In total there were 27.5k tweets present in the dataset. 4.1 Data pre-processing The sentiments column was originally labelled as neutral, positive and negative so, in order to get sentiment distribution of the dataset we replaced neutral, positive and negative labels with numerical values 0, 1 and -1 respectively. The data containing 27.5k tweets was split into three numerical categories -1 to 1 from negative to positive sentiments associated with the tweets. And the sentiment distribution (Figure 2) of the dataset was represented using seaborn. countplot to draw the ordinal positions on the axis. This was done after importing the seaborn module in collab. Figure 2: Sentiment distribution of the twitter data. na values were present in the dataset were removed using fillna (). All the numerical values were also removed from the textual dataset followed by the punctuation removal for pre-processing the data. Conversion to lowercase and expanding contractions were also performed. The dataset was then divided into training and testing data (Figure 3) where plotting was done in order to ensure stratified data split. Figure 3: Training and testing dataset. An Integrated Approach for Analysing Sentiments on Social Media… Informatica 47 (2023) 214-220 217 4.2 Imbalanced multiclass dataset Imbalanced datasets are those datasets in which the observations distribution associated with the target class is not even. In other words, one class label possesses large number of observations as compared to the other class label. The main concern is to accurately and efficiently obtain the likelihood for minority as well as majority class. Imbalanced datasets are prone to give biased results hence to mitigate the issue distinct approaches are utilized. After obtaining the summary of the dataset using info() method similar operations were performed. The columns were labeled as -1, 1 and 0 for negative, neutral and positive labels after which seaborn_countplot was utilized to represent sentiment distribution as represented in the Figure 4. Numerical values as well as punctuations were removed; lowercase conversion was also done for pre-processing the data. Figure 4: Sentiment distribution of the imbalanced twitter data. This imbalanced dataset was divided into training and testing data (Figure 5) where plotting was done in order to ensure stratified data split. Figure 5: Training and testing dataset 5 Classification techniques implemented 5.1 Naive-Bayes classifier Naive Bayes is a supervised learning algorithm based on Bayes Theorem. This probabilistic machine learning algorithm predicts on the basis of the probability of an object and is used to solve classification problems. Bayes Theorem also known as Bayes law is a mathematical formula for determining conditional probability as: P(A|B) = P(B|A) P(A) P(B) The fundamental Naive Bayes assumption is that each feature each holds independent and equal contribution to the outcome. The three types of naive Bayes models are: Gaussian Naive-Bayes classifier: This model assumes a normal distribution of features when working with continuous data and likelihood of the features is given as: Multinomial Naive-Bayes classifier: As the name suggests, this classifier is utilized when we have multinomial distributed data. This is specifically used for document classification and conditional probability formula is given as: 218 Informatica 47 (2023) 214-220 V. Tandon et al. Bernoulli Naive-Bayes classifier: It is a multivariate event model also popular for document classification where binary term occurrence features are used instead of term frequencies. Here features are independent booleans describing inputs. The likelihood of features is given as: 5.2 Naive-Bayes classifier on the data Data cleaning was done followed by stop-words removal. In order to display frequently occurring features in the corpus a WordCloud (Figure 6) was also produced. The dataset was then vectorized using TF-IDF vectorizer. The Multinomial and Bernoulli Classifiers provided by Scikit-Learn library were trained and validated using the obtained vectors. Figure 6: WordCloud for twitter dataset. In case of imbalanced dataset, the chances of algorithm being biased to the majority class are quite high and the main objective becomes to mitigate misclassification by minority class by setting a higher-class weight to minority class and simultaneously lowering the class weight to majority class. Therefore, different weights were assigned to classes to improvise the performance. 5.2 Splitting multiclass classification into binary classification The algorithms designed for binary classification cannot be leveraged in multi-class classification problems so, to mitigate this issue we use heuristic methods like one-vs- rest and one-vs-one methods to make binary classifiers work as multiclass classifiers. One-Vs-Rest Classification Model: also known as one- vs-all is a heuristic method to enable binary classification algorithms work as multi-class classification algorithms. In this technique the multi-class data is split as binary classification data in order to apply binary classification algorithms so that it can be ultimately converted into binary classification data [15]. One-vs-One Classification model: this approach is similar to o-v-r as it also functions by splitting the data i.e. by splitting multi-class dataset into binary classification problem. The primary difference between o-v-r and o-v-o is that this classification model groups dataset into one single data file as an opponent to every other class [15]. In our dataset we used one-vs-one (ovo) classification strategy in order to split multi-class classification into binary classification problem per each sets of classes. We trained the underlying classifier as it is after which class weights were presented. Adjusted class weights were utilized to train the data. 6 Results After completing Sentiment Analysis on twitter’s dataset, testing and validation on various classifiers used in the study was done by determining the F1 score, precision and recall. It was evident in this study that the versions of Naive Bayes Classifier gave inaccurate results for imbalanced dataset which were then removed by introducing weights to the class. The results after conducting the study for balanced and imbalanced multiclass dataset are displayed in Table 2. Results for each algorithm for balanced multiclass dataset with and without weights were presented in which no major variation after introducing class weights was observed due to balanced nature of the dataset. SVC displayed better results. Table 2: Observation Table F1-score for balanced dataset S No. Classifier F1-score 1. Bernoulli Naive-Bayes (without class weights) [0.5473 0.6598 0.6851] 2. Multinomial Naive- Bayes (without class weights) [0.4914 0.6555 0.6280] 3. Bernoulli Naive- Bayes (With class weights) [0.4955 0.6371 0.5163] 4. Multinomial Naive-Bayes (with class weights) [0.4292 0.6356 An Integrated Approach for Analysing Sentiments on Social Media… Informatica 47 (2023) 214-220 219 0.3952] 5. Support Vector Classifier (without class weights) [0.6683 0.6944 0.7489] 6. Support Vector Classifier (with class weights) [0.6803 0.6738 0.7547] Inaccurate results for each algorithm were found in case of highly imbalanced dataset so, to mitigate the issue class weights were presented after which SVC and Bernoulli displayed improved results (Table 3). Table 3: Observation Table F1-score for imbalanced dataset S.No. Classifier F1-score 1 Bernoulli Naive-Bayes (without class weights 0.6608 2 Multinomial Naive- Bayes (without class weights) 0.6560 3 Bernoulli Naive-Bayes (with class weights) 0.6880 4 Multinomial Naive- Bayes (with class weights) 0.6714 5 Support Vector Classifier (without class weights) 0.7024 6 Support Vector Classifier (with class weights) 0.7446 7 Conclusion and future scope In this study we covered different approaches to conduct social media sentiment analysis on Twitter tweets dataset fetched from Kaggle. The twitter dataset was multiclass data which required more data pre-processing and hence was closer to the dataset in real life situation where sentiment analysis is conducted. Binary classification algorithms were leveraged in multiclass dataset with the help of o-v-o heuristic technique. These tweets were classified in positive, negative and neutral categories using distinct classification approaches. We implemented various classification techniques using Scikit-learn library for comparative analysis like Bernoulli Naive-Bayes, Multinomial Naive- Bayes and SVC using TF-IDF vectorizer. A model is susceptible to fail and generate inaccurate results when fed poor i.e. imbalanced data. Highly imbalanced data can create a huge impact on the model’s performance and in real life situation it is not surprising to encounter unbalanced datasets. Therefore, it is very important to select the right evaluation matrix in such scenarios. In our study we have utilised F1 score as our evaluation matrix and class weights were also introduced to obtain improved results and enabled us to study Multinomial, Bernoulli and Support Vector classifier with both class weights and without class weights for balanced and imbalanced dataset. For balanced dataset, no major variation was observed after introducing class weights while for imbalanced dataset, the results improved significantly where the Support Vector Classifiers (SVC) ended up being the best performing classifier with class weights. Further, we are focused to test GloVe word embeddings along with different approaches to handle imbalanced data for sentiment analysis. References [1] Mullen, Tony, and Nigel Collier. Sentiment analysis using support vector machines with diverse information sources. International conference on empirical methods in natural language processing, 412-418. 2004. https://doi.org/10.3115/1219044.1219069 [2] Divij Gera, D., & Kapoor A. Sentiment Analysis using Scikit Learn: A Review, 2022. https://doi.org/10.13140/RG.2.2.26189.10720 [3] Naresh, A. and Parimala Venkata Krishna. An efficient approach for sentiment analysis using machine learning algorithm. Evolutionary Intelligence. 14, 725- 731, 2021. https://doi.org/10.1007/s12065-020-00429-1 [4] Wang, Y., Hou, Y., Che, W., & Liu, T. From static to dynamic word representations: a survey. International Journal of Machine Learning and Cybernetics, 11(7), 1611-1630, 2020. https://doi.org/10.1007/s13042-020-01069-8 [5] Kapoor Amita, Hands-On Artificial Intelligence for IoT: Expert machine learning and deep learning techniques for developing smarter IoT systems. ISBN-978-1-78883-606-5, 2019, Packt Publishing Limited, Birmingham, UK. [6] Vanaja, S., & Belwal, M. Aspect-level sentiment analysis on E-commerce IEEE International Conference on Inventive Research in Computing Applications (ICIRCA), 1275-1279. 2018. https://doi.org/10.1109/icirca.2018.8597286 [7] Kumar, Y., Sharma, H., & Pal, R. Popularity Measuring and Prediction Mining of IPL Team Using Machine Learning. IEEE 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) ICRITO, 1-5, 2021. https://doi.org/10.1109/icrito51393.2021.9596405 220 Informatica 47 (2023) 214-220 V. Tandon et al. [8] Jianqiang, Zhao, and Gui Xiaolin. Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access 5: 2870- 2879, 2017. https://doi.org/10.1109/access.2017.2672677 [9] Gamallo, P., & Garcia, M. Citius: A Naive-Bayes Strategy for Sentiment Analysis on English Tweets. Semeval@ coling, pp. 171-175, 2014. https://doi.org/10.3115/v1/s14-2026 [10] Baroni M, Evert S, Lenci A. Bridging the gap between semantic theory and computational simulations. In: Proc. of the ESSLLI workshop on distributional lexical semantic, 2008. https://archive.illc.uva.nl/ESSLLI2008/Materials/Ba roniEvertLenci/BaroniEvertLenci.pdf [11] Baroni M, Murphy B, Barbu E, Poesio M. Strudel: a corpus-based semantic model based on properties and types. Cogn Sci 34:222–254, 2010. https://doi.org/10.1111/j.1551-6709.2009.01068.x [12] Landauer TK, Dutnais ST. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev 211–240, 1997. https://doi.org/10.1037/0033-295x.104.2.211 [13] Turney PD. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: De Raedt L, Flach P (eds) Machine learning: ECML 2001., Springer, Berlin Heidelberg, 491–502, 2001. https://doi.org/10.1007/3-540-44795-4_42 [14] Jarmasz M, Szpakowicz S. Roget’s Thesaurus and semantic similarity. In: Proc. of RANLP, pp 21, 2003. https://doi.org/10.1075/cilt.260.12jar [15] Multiclass classification: https://www.analyticsvidhya.com/blog/2021/05/mul ticlass-classification- using-svm/ [16] Alation study: https://www.alation.com/blog/alation-sodc-bad- data-spells-trouble-for-ai/