ERK'2022, Portorož, 490-493 490 Classificaiton of Social Media Comments as Hate Speech – Comparative study Jer Pelhan 1 1 Faculty of Computer Science Ljubljana E-mail: jp4861@student.uni-lj.si Abstract The presence of hate speech on the social media plat- forms is becoming one of the strongest society problems. The vast amount of content that is created every pass- ing minute cannot be manually checked, thus we need to empower algorithms for classification of toxicity. In this article we test a statistical method, logistic regression as a baseline, and three deep learning models: (LSTM) long short-term memory, LSTM with convolutional layers, and bidirecitonal LSTM for classification of the comment tox- icity on Toxic Comment Classification Challenge dataset. LSTM outperforms logistic regression, LSTM with convo- lution and Bidirectional LSTM for 8.4, 0.9 and 2.4 per- cent of the original score in F1, respectively. Further- more, we present a novel pre-processing approach for classification of offensive speech. Instead of removing all punctuation akin to related works, we leave exclamation marks and words written in all caps, as we detected a cor- relation to the offensiveness of a comment on a selected dataset. Using the best performing model, proposed pre- processing outperforms commonly used pre-processing approaches by 0.87 percent of the original score. 1 Introduction With the start of the twenty-first century many social me- dia platforms arose. Social media platforms are a medium that enables the communication – registered user can gen- erate content, e.g. post text, photos or videos. Currently more than 50% of people worldwide use social media platforms daily, that consequently means that the impact of social media on our mental well-being is immense and extensive. All-though, many natural language processing algorithms are successfully employed on different spheres, the problem of detecting hate speech is limited due to ab- sence of public laws that classify hate speech. Nonetheless, with all the assets, the social media plat- forms are a handy medium for sharing negativity and hate speech, i.e., insulting speech targeted at a specific person or marginalized community on the basis of characteristics of that person or group, for example, race, origin, disabil- ity, gender identity, sexual orientation, etc. Given that there is an average of more than 10.000 tweets made ev- ery second, all the content cannot be manually checked. In addition, some European countries already enforced laws demanding for social media platforms to remove hate speech within twenty-four hours. Hence, artificial intelligence methodology, i.e., natural language process- ing (NLP), is required to tackle with colossal amount of possibly hateful content created every second. The main problem of toxic speech on a social me- dia web-page is the myriad of forms it comes in. There is also clearly a lack of the definition what hate speech is and does it result in cyberbullying, which is punishable in many European countries. In the last decade many differ- ent approaches or methods of hate speech detection de- veloped. Such as methods for binary classification, and with their advancement multi-class approaches in the last five years. In our work, we focus on the classification of toxicity – multi-class approaches. We propose a new pre-processing technique, that outperforms existing pre- processing techniques, and test it with respect to different models. The remainder of the articles is composed of four sec- tions. In Section 2 we briefly describe related works on toxic comment classification. In Section 3, we present and analyze the Toxic Comment Classification Dataset that is used in this study. Methods that we employ in this comparative study are described in Section 4, where we also describe used metrics, pre-processing pipelines, used models, etc. Experimental results are reported in Section 5. In Section 6, we conclude the thesis and dis- cuss future work. 2 Related Work Current methods that work with hate speech detection and classification are divided into two groups: (i) deep learning models that use and extract their own features, and (ii) feature based machine learning models that need extracted thoroughly cleaned text within features as input data. Support vector machine (SVM) is a popular machine learning method on this field [1] that has achieved suc- cesses with simplicity and performance of the model. Greevy et. al [1] presented a classification method for racist text using SVM in combination with bag of words (BOW) as feature representation. BOW is a technique of representing texts in vectors of fixed length. It does not consider word order, as it is only a collection of words that appear in the text. In [2] they show, that employ- ing BOW classifier is not sound, as it classifies comment 491 as toxic if it includes a word that is marked as hateful in feature space, all though it is not in the context of this comment. Nonetheless, this results in high recall, but also a high rate of false positives, as many comments get marked as hate speech, only due to containing a marked word. Global Vectors for Word Representation (GloVe) [3] is a word embedding method published by Stanford re- searchers that does not explicitly model word order, but considers the words in the representational matrix based on their distance from the target word. The creators of fastText [4] from Facebook tackle with word order in text as this is important part of text classification. In this method a text is represented by a vector that in a way as with BOW, but it also uses vectors to represent word ngrams and this way representing local word order. With development of Neural Networks, Reccurent Neu- ral Networks (RNN), especially Long short-term mem- ory (LSTM) [5, 6] became popular NLP in general, but also within the task of classifying hate speech in texts. Chakrabarty et. al [7] state that the best reported score on classification of abuse on Twitter is achieved using LSTM. In [8] they show that neural networks in com- bination with RNN, achieve good performances. As neu- ral networks can in general detect interesting connections between features, that can be extracted using RNN such as LSTM. 3 Dataset For the comparative study of methods for classification of social media comments we chose the Toxic Comments Classification Challenge (TCCC) dataset published on Kag- gle by Google and Jigsaw. The dataset is composed of 159.571 human annotated comments collected from Wiki- pedia used for training, and 63.978 annotated comments for testing the algorithm. All the comments were labeled by the type of toxicity: (i) toxic, (ii) severe toxic, (iii) obsene, (iv) threat, (v) insults and (vi) identity hate. Sin- gle comment of the dataset can be associated with mul- tiple labels, thus the task on this dataset is multi-label classifciation problem. Three most represented classes of comments are toxic, obscene and insults. The dataset is featured of a very unbalanced distribution with only just above 12% of comments labelled as at least one or multi- ple categories of toxicity. With correlation matrix we observed correlations be- tween classes of TCCC dataset. Interestingly, ’toxic’ and ’severe toxic’ classes are correlated with a small correla- tion factor of 0.31. We further analyzed this correlation and came to conclusion that ’severe toxic’ class is a sub- set of the ’toxic’ class and the correlation factor is small only because the ’severe toxic’ is represented in minor- ity in comparison with ’toxic’ class. A strong correlation exists between ’toxic and ’obscene’, but the strongest cor- relation is between ’obscene’ and ’insult’ class. We thoroughly analyzed different features as com- ment length, number of punctuation, word length, num- ber of repetitions of characters in words, as useful fea- tures can be removed due to not paying enough attention to the dataset. Both median and mean of clean comment lengths are larger for circa 100 characters in comparison with hateful comments. Intuitively, hateful comments are more likely to have higher percentage of capitalized char- acters. This also holds for average percentage of TCCC dataset capitalizations, but the median percentages are very similar. Upper-cased words are correlated with num- ber of exclamation marks used in the comment. Clean comment holds less than one exclamation mark in aver- age, but hateful comment in average holds almost three exclamation marks. We do not explore any other strong correlations with other punctuation, or other features. 4 Methods 4.1 Dataset Pre-processing Comments in the datasets for classification of hate speech contain special words and characters, e.g punctuation, cap- italization, new line symbols, stop words, emoticons, links and even IP addresses. All of that introduces sometimes unnecessary additional dimensionality into feature space which affects the performance of the model. Pre-processing is a set of techniques that change text data in a way of making it more feature intense without any information losses. This step enables better classification, as it strongly reduces dimensionality of text space. It is important not only which pre-processing techniques we choose, but also the order in which we apply them. In [9] they extensively test pre-processing methods and their order. They point out that it has a strong impact on the end performance of not only traditional machine learning models, e.g. linear regression, SVM, but also deep learning models. Figure 1: Dataset preprocessing. We introduce novelties in sec- ond and third step. First step in our pre-processing of the dataset is re- moval of URLs, hashtag symbols and any other HTML elements as seen in Figure 1. In the dataset that we use, we also remove IP addresses as they are only adding noise to the text. In the second step we joined the removal of punctuation, and stop words. In contrary with [9], we do not remove all punctuation, but we leave the excla- mation marks. Furthermore, we only strip capitalization 492 of words, that are not consisted of all capitalized char- acters. We leave the completely uppercased words in the dataset. Both described steps exploit the fact that Mikolov et.al. [10] employ no complex data normaliza- tion or pre-processing in the training process of fastText on large text corpora from Wikipedia and Web Crawl which we use as a word embedding. As the last step we employ lemmatization, i.e., changing all the words of a text to the uninflected forms, so all different forms of a word can be analysed as a single item. 4.2 Metrics The official metric of the challenge [11] is mean column- wise ROC/AUC, i.e. the mean area under a receiver oper- ating characteristic curve by all the classes. ROC curve is a plot that presents the performance of a binary classifi- cation model at all classification thresholds. The curve is calculated from the true positive rate or sensitivity against the false positive rate or (1 - specificity). In addition, we decide to use F1 score, as F1 score penalizes models that just predict everything as a negative class. Since F1 score is harmonic mean of recall and precision, we also con- sider both. Furthermore, we will use macro-averaged F1-scores, i.e., arithmetic mean of individual F-score of single class. We use macro, as we want all classes to be treated equally. Described averaging system is applied also for precision and recall. 4.3 Models Hyperparameter tuning. We employed grid search for all the methods, to obtain as optimal hyperparatemers and consequently achieve better results. The term hyper- parameter appends to all non trainable parameters of a model, that usually have big impact on the performance of the model. Grid is exhaustive search that only attempts to find the values that are optimal. Logistic Regression As a baseline we train Logis- tic Regression, a statictical model that is applicable for binary classification. At the end we use liblinear as the solver function and set the inverse of regularization strength to 4 with dual formulation in combination with 1- and 2- grams. With features being extracted as term frequency– inverse document frequency (TF-IDF) that are commonly used features for text classification. We use multiple lo- gistic regression classifiers, one for each class. LSTM Recurrent neural networks (RNN) like Long Short-Term Memory (LSTM) can use internal memory to process considerably large sequences of inputs, thus it interprets a document as a sequence of words. Mod- els using LSTM have a major and important novelty in comparison to traditional RNN of having ability to learn long-term dependencies between the inputs. The embed- ding layer transforms words to dense vectors, that are then feed to the LSTM. We use fastText [4] as embedding layer. Then we apply globalMaxPooling1D layer that down-samples representation by taking maximum value over time. After, a combination of dropout and dense layers is applied, ending with dense layer with 6 outputs, i.e., one for each category. We performed grid search for hyper-parameters and set batch size to 32, dropout param- eter to 0.2, with Relu activation, binary cross-entropy for loss function for 6 epochs trained using Adam optimizer with default learning rate of 0.01. LSTM-CNN Long Short-Term Memory with Convo- lutional Neural Networks is becoming popular for classi- fication tasks as they are good at detecting specific com- binations of features which RNN as LSTM can extract. After the LSTM layer we employ convolution. Hyper- parameter tuning set batch size to 64, dropout parameter to 0.1, with relu activation, binary crossentropy for loss function for 8 epochs trained using Adam optimizer with standard parameters. For convolutional layer grid search found 64 as the number of convolutional filters, i.e. the third dimension of the output space with kernel size or length of the convolution window of 3 as optimal. Bidirectional LSTM LSTM that only preserves past information, in usage of Bidirectional LSTM, the inputs are feed in two ways, one forward one backwards, pre- serving information not only from past but also from fu- ture. This model is composed in the same maneer as LSTM described above. The hyper-parameter tuning found the best performance of the model with batch size of 64 for 9 epochs, dropout rate of 0.1, relu activation function and sigmoid on last dense layer, binary cross-entropy as loss function and Adam optimizer with default learning rate, but with a decay of 0.0003. Probabilistic classification. All the methods return a probability of a comment being a member of a certain class. Thus, we need to threshold the values to obtain end classifications. To find optimal threshold, we empower the precision recall curve, to find optimal threshold for both, i.e. for F1 score. We perform this for every category of the prediction, for each classifier separately. 5 Results Table 1 summarizes the performance for all the tested methods on Toxic Comment Classification dataset. All though, on first sight we achieve low F1, recall and pre- cision we need to take into account that we are classify- ing every comment into six very similar conjunctive cat- egories on a very difficult dataset. We decided to test the effect of pre-processing the dataset on the overall performance of the methods. We thus performed an experiment to evaluate methods with no pre-progressing, the pre-processing described in [9] and our suggested method. The results are presented in the first, second and third column of Table 1, respectively. Linear regression is the method with the biggest per- formance increase is seen from no pre-processing to stan- dard pre-processing proposed in [9]. The performance increase measured in AUC, F1, recall and precision is 3.7%, 8.7%, 15.3% and 2.45%, respectively. The in- crease of performance is large as linear regression is fea- ture based machine learning model. Machine learning models in contrary to deep learning models need extracted cleaned text within feature space as input data. Convolutional neural networks are more commonly used in image processing tasks, but we show that they can 493 No pre-processing Standard pre-processing [9] Proposed pre-processing Model AUC F1 Re Pr AUC F1 Re Pr AUC F1 Re Pr Log. Regr 92.18 58.25 56.05 60.63 95.62 63.35 64.63 62.12 95.51 62.61 63.23 62.00 LSTM 97.22 68.30 72.77 64.36 97.39 68.66 72.88 64.92 97.60 68.70 72.26 65.48 LSTM-CNN 97.04 66.04 70.84 61.87 97.05 68.08 72.20 64.41 96.82 67.26 71.98 63.13 BiLSTM 96.32 65.18 69.34 61.50 97.19 67.05 70.74 63.74 97.19 68.16 70.66 63.99 Table 1: Comparison of AUC ROC, F1-measure, recall and precision with respect to three different pre-processing techniques. Method based on LSTM outperforms all other methods. Best results of each method with respect to pre-processing are bolded. be outperform Bi-directional LSTM network with stan- dard pre-processing [9] for 1.5%, 2% and 1% at F1, recall and precision, respectively. The overall best method is based on LSTM with fast- Text embedding. It outperforms all other methods at all three stages of pre-processing with respect to all used measures. LSTM method performs best with proposed pre-processing outperforming the LSTM with standard processing by 0.6% and 0.9% in F1-score and precision, respectively. Recall however decreases. This interprets as classifying less clean comments as any class of toxic and classifying more toxic comments into clean category. Even though the recall drops, F1 raises, resulting in better overall performance. At the classification of hate speech it is important to have high recall and also precision, thus F1 score is most viable. 6 Conclusion We presented a novel method for text pre-processing for the task of classification of hate speech on comments taken from Wikipedia site. The novelty of our method is re- moving all the punctuation except exclamation marks, as we found in the dataset analysis that they are more commonly present in toxic comments. Furthermore we detected a correlation between number of upper-cased words and end class. The presented method was tested in comparison with no pre-processing and pre-processing proposed in [9]. Based on detailed experiments, we have identified improvements of the overall performance of the methods based on LSTM and BiLSTM. The preci- sion, mean AUC and F1 are higher with proposed pre- processing, but the recall drops, resulting in less falsely clean comments labeled as toxic but also more toxic com- ments labeled as clean. For the toxic comments classification to be success- ful it should report as little comments that are not toxic in reality, but it should also not overlook the toxic com- ments. At the end it is up to different use cases, e.g. in some scenario it would be completely inappropriate to have toxic comments, thus the method should be trained to achieve high recall. But in most cases, a person has to look through all these comments that are reported to be toxic. At the end it would be most appropriate to have at leas two thresholds. The first one being a partition between clean and probably toxic and second one mark- ing toxic comments. Then, probably toxic comments are manually checked. In the future, we would like to investigate manual creation of features that are concatenated with the in- put to the model. Instead of not removing exclamation marks, and words that are written upper-cased, we could count the occurrences and add them as additional fea- tures. That could be helpful as we would still remove the noise from text, but also use important features that should not be castaway. Furthermore, we would like to investigate training two classifiers, a binary one for pre- dicting if comment is toxic, and multi-class one for clas- sifying the toxicity – referring to the imbalance problem of the dataset. This will be the topic of our future work. References 1. Greevy, E. & Smeaton, A. Classifying racist texts using a support vector machine. The 27th ACM SIGIR Confer- ence 2004, Sheffield, UK. (Jan. 2004). 2. Kwok, I. & Wang, Y . Locate the Hate: Detecting Tweets against Blacks in AAAI (2013). 3. Pennington, J., Socher, R. & Manning, C. D. Glove: Global vectors for word representation in Proceedings of the 2014 conference on empirical methods in natural language pro- cessing (EMNLP) (2014), 1532–1543. 4. Joulin, A., Grave, E., Bojanowski, P. & Mikolov, T. Bag of Tricks for Efficient Text Classification2016. arXiv:1607. 01759[cs.CL]. 5. Sigurbergsson, G. I. & Derczynski, L. 2019. arXiv:1908. 04531[cs.CL]. 6. Wulczyn, E., Thain, N. & Dixon, L. Ex Machina: Per- sonal Attacks Seen at Scale. CoRR abs/1610.08914. arXiv: 1610.08914 (2016). 7. Chakrabarty, T., Gupta, K. & Muresan, S. Pay “Attention” to your Context when Classifying Abusive Language. Pro- ceedings of the Third Workshop on Abusive Language Online (2019). 8. Van Aken, B., Risch, J., Krestel, R. & L¨ oser, A. Chal- lenges for Toxic Comment Classification: An In-Depth Error Analysis. CoRR abs/1809.07572. arXiv: 1809. 07572 (2018). 9. Naseem, U., Razzak, I. & Eklund, P. A survey of pre- processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimedia Tools and Applications 80, 1–28 (Nov. 2021). 10. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C. & Joulin, A. Advances in Pre-Training Distributed Word Rep- resentations 2017. arXiv:1712.09405[cs.CL]. 11. Toxic comment classification challengehttps://www. kaggle.com/c/jigsaw-toxic-comment- classification-challenge/.