https://doi.or g/10.31449/inf.v47i3.4742 Informatica 47 (2023) 315–326 315 An Automatic Labeling Method for S ubword-Phrase Recognition in Effective T ext Classification Y usuke Kimura 1 , T akahiro Komamizu 2 and Kenji Hatano 3 1 Graduate School of Culture and Information Science, Doshisha University , Japan 2 Mathematical and Data Science Center , Nagoya University , Japan 3 Faculty of Culture and Information Science, Doshisha University , Japan E-mail: usk@acm.or g, taka-coma@acm.or g, hatano@acm.or g Keywords: text classification, subword-phrase, multi-task learning Received: March 16, 2023 The deep learning-based text classification methods perform better than traditional ones. In addition to the success of the deep learning technique, multi-task learning (MTL) has come to become a pr omising appr oach for text classification; for instance, an MTL appr oach in text classification employs named entity r ecognition as an auxiliary task and has showcased that the task helps to impr ove the text classification performance. Existing MTL-based text classification methods depend on the auxiliary tasks using super - vised labels. Obtaining such supervision labels r equir es additional human and financial costs in addition to those for the main text classification task. T o r educe these additional costs, we pr opose an MTL-based text classification framework on supervised label cr eation by automatically labeling phrases in texts for the auxiliary r ecognition task. A basic idea to r ealize the pr oposed framework is to utilize phrasal expr essions consisting of subwor ds (called subwor d-phrases). T o the best of our knowledge, no text classification ap- pr oach has been designed on top of subwor d-phrases because subwor ds only sometimes expr ess a coher ent set of meanings. The novelty of the pr oposed framework is in adding subwor d-phrase r ecognition as an auxiliary task and utilizing subwor d-phrases for text classification. It extracts subwor d-phrases in an unsu- pervised manner using the statistics appr oach. T o construct labels for effective subwor d-phrase r ecognition tasks, extracted subwor d-phrases ar e classified based on document classes to ensur e that subwor d-phrases dedicated to some classes can be distinguishable. Experimental evaluation for text classification using five popular datasets showcased the effectiveness of the subwor d-phrase r ecognition as an auxiliary task. It also showed that comparing various labeling schemes in r ecent studies indicated insights for labeling common subwor d-phrases among several document classes. Povzetek: Za klasifikacijo besedil je uporabljeno globoko učenje in večopravilno učenje iz uporabo podbesednih fraz za avtomatsko označevanje. 1 Intr oduction T ext classification is a fundamental technology that has been studied for a long time. Applications that use text clas- sification include speech [7], categorizing daily news arti- cles, and unfair clause detection in terms of services [15]. These text classification applications are achieved by ef- fectively and ef ficiently retrieving information from lar ge amounts of text [12, 23]. T ext classification is a super - vised learning task manually assigning labels to documents as classification criteria, such as categories and classes. A classifier learns classification criteria in a feature space based on the dataset. T raditionally , text classification uses hand-crafted features such as term frequency-inverse docu- ment frequency . In recent literature, deep learning-based technologies have achieved significantly improved clas- sification performance. A component that has improved text classification performance in recent years is pre-trained neural language models such as BER T , which have been trained on vast amounts of text. Pre-trained neural language models provide semantically rich features for text; there- fore, even a simple multi-layer perceptron-based classifier has performs excellently . After the initial success of BER T , many pre-trained models, such as RoBER T a [19] and GPT - 3 [5], have been published. The tokenizers in these pre-trained neural language mod- els typically divide documents into subwords as the small- est unit. Subwords reduce the number of unknown words not in the vocabulary , thus preventing the performance of pre-trained neural language models from being degraded by unknown words. Subword-based tokenization ef fec- tively handles out-of-vocabulary (OOV) words by decom- posing such words into several subwords. Concatena- tions of these subwords represent OOV words, while tradi- tional approaches represent them as unknown tokens. The subword-based tokenization was initially employed for ma- chine translation [29]; after that, it was used in various natu- ral language processing tasks, including text classification. 316 Informatica 47 (2023) 315–326 Y . Kimura et al. Multi-task learning (MTL) [6, 37, 39], which involves one or more auxiliary tasks with the primary task by sharing parameters, is a promising approach to enhance the perfor - mance of deep learning models. It has also been applied to text classification [17, 35, 36]. Learning models with auxil- iary tasks positively af fect the generalization performance of the main task and reduce over -fitting. Early studies on MTL-based text classification [17, 35] focused on methods to combine multiple tasks a nd combined tasks in dif ferent datasets. Recent studies have combined text classification with auxiliary tasks using the same dataset, such as named entity recognition (NER) [2, 31] or label co-occurrence pre- diction [36]. The fact that MTL with NER and text classification im- proves the accuracy of text classification performance sug- gests that the recognition of clause representations, such as named entities, is suitable as an auxiliary task to MTL- based text classification. However , to realize NER as an auxiliary task for MTL-based text classification, supervised labels for NER are required in addition to those for text clas- sification. Constructing such training datasets is costly be- cause of additional human costs for NER labeling. Therefore, in this study , we seek to achieve MTL- based text classification with phrasal expression recog- nition, which does not require additional human cost to construct a training dataset. Phrasal expressions (or key phrases) for texts have been studied in past decades [27, 38]. Applying keyphrase extraction based on the subword-based tokenization of popular pre-trained neural language mod- els is not straightforward. Therefore, we define a phrasal expression based on subwords as a subwor d-phrase and seek its potential usability for the MTL-based text classi- fication. In contrast to phrasal expressions based on words, subword-phrases are not necessarily semantically coher - ent because a vocabulary of subwords is determined sta- tistically [29]. Owing to such little semantic coherence of subword-phrases, studies have never been conducted on their utilization for text classification. In this study , we propose a framework for MTL-based text classification with subword-phrase recognition to im- prove the accuracy of text classification. Our frame- work comprises unsupervised subword-phrase labeling and MTL-based text classification for the subword-phrase recognition task. Notably , we assume the presence of labels for the classification of a dataset. T o implement our framework, we employ a highly primitive approach: frequency-based subword-phrase labeling, in which fre- quently co-occurring consecutive subwords are mer ged to form a subword-phrase; various implementations can be re- alized using this approach. W e also employ the concept of byte-pair encoding [29]. W e seek labeling schemes to handle commonly appearing subword-phrases among doc- ument classes to make the auxiliary task more ef fective than text classification tasks. The contributions of this study can be summarized as follows: MTL-based text classification with low-cost aux- iliary task preparation, utilization of phrasal expression for subwords, and superior performance over conventional methods, and comparable performance with the novel methods. The proposed framework comprises an unsu- pervised labeling module and an MTL-based classifica- tion module. Existing MTL-based text classification meth- ods assume the presence of supervision for auxiliary tasks; however , obtaining this supervision requires further hu- man and financial costs. In contrast, the proposed frame- work does not require these costs as it utilizes unsupervised subword-phrase extraction to obtain labels to create auxil- iary tasks. Our method is the first study that utilizes subword- phrases. As subwords are not necessarily semantically co- herent, their phrasal expressions have yet to be considered for any task. In contrast, the co-occurrence of consecu- tive subwords or subword-phrases could contribute to the text classification task. Such subwords may represent dis- tinguished instances of a class from those of others. In the experimental evaluation of five popular text classifica- tion datasets, the proposed framework with subword-phrase recognition auxiliary task demonstrated improved classifi- cation performance (micro and macro F-scores) compared to the single-task method. Compared with the state-of-the- art method (BertGCN [14]), the proposed framework also demonstrated superior performance for datasets with more labels, exhibiting comparative classification performance for the other datasets. The rest of this paper is or ganized as follows. Sec- tion 2 introduces studies concerning MTL-based text clas- sification. Section 3 explains the proposed framework of MTL-based text classification with subword-phrase recog- nition task. Section 4 then presents the experimental evalu- ation, which demonstrates the ef fectiveness of the proposed framework compared to that of the single-task text classifi- cation baseline as well as other novel methods; it also dis- cusses the ef fect of subword-phrases. Finally , Section 5 concludes this paper . 2 Related work This section introduces literature related to MTL-based text classification. MTL-based text classification methods are categorized into the following three types based on the rela- tionships between the main and auxiliary tasks [35]; Multi- Cardinality , Multi-Domain, and Multi-Objective. Multi-Cardinality means that the main and auxiliary tasks are of dif ferent datasets but are in the same domain; these tasks also dif fer in cardinality , meaning that they vary in terms of their text lengths and the number of classes, among other parameters. Multi-Domain means that the main and auxiliary tasks are similar , but their domains dif fer . For example, Liu et al. [16] and Zhang et al. [35] examined MTL-based movie review classification with classification tasks of reviews for various products, such as books and DVDs [4]. Multi-Objective means that the main and auxiliary tasks An Automatic Labeling Method for Subword-Phrase… Informatica 47 (2023) 315–326 317 have dif ferent objectives. For example, Liu et al. [18] com- bined query classification and search result ranking using an MTL approach, and Zhang et al. [35] attempted MTL- based movie review classification (IMDB [21]) with news article classification (RN [1]) and question type classifica- tion (QC [13]) as auxiliary tasks. In addition, MTL approaches [3, 30, 33, 40] in which the main and auxiliary tasks are in the same dataset have ex- hibited their ef fectiveness. Bi et al. [3] improved the per - formance of news recommendations by using MTL, which combines the news recommendation task with news arti- cle classification and named entity recognition. The MTL- based medical query intent classification model, proposed by T ohti et al. [30], was trained together with the named entity recognition, and consequently showed superior clas- sification performance. On another task, Y ang et al. [33] and Zhao et al. [40] showed similar observations on po- larity classification combined with the aspect term extrac- tion task. In the emotion prediction task, Li et al. [1 1] dealt with the emotion-cause pair extraction task using the MTL- based approach, which is combined with the emotion clause extraction and the cause clause extraction. Similarly , Qi et al. [24] proposed the MTL-based aspect sentiment clas- sification method, where the auxiliary task was the aspect term extraction; they also demonstrated its ef fectiveness. In addition to the text classification task, the MTL-based ap- proaches to image classification tasks have also shown its ef fectiveness [9, 32]. MTL-based text classification, which utilizes the re- lationship between labels in the same dataset, has also been proposed to solve the multi-label classification prob- lem, where a single text can be classified into multiple la- bels [36]. Zhang et al. [36] showed improved classification performance by designing an auxiliary task to learn the re- lationship between labels. These studies have shown the ef fectiveness of combin- ing multiple supervised learning. However , in general, cre- ating supervised data is expensive in terms of human and financial costs; thus, lower -cost solutions to design auxil- iary tasks are desirable. Self-supervised learning (SSL) is a training approach that understands data without supervised datasets. It first hides pieces of data and trains the model so that the model can es- timate the hidden pieces. Masked language model (MLM) is a popular SSL in the natural language processing do- main [8]. A popular pre-trained neural language model, BER T [8], is trained based on two SSL tasks: MLM and next sentence prediction. In the image processing domain, DALL-E [25] showcased the significant performance of SSL, where an area of an image was erased and DALL-E was trained to estimate the erased area. The increasing at- tention to these models indicates the usefulness of SSL for data understanding and r epr esentation learning . In contrast to data understanding, text classification is a supervised learning task. In other words, SSL expects mod- els to reconstruct broken pieces of data, while supervised learning expects models to learn dedicated criteria from su- pervision. Therefore, task settings in SSL are not easily imported to MTL-based text classification. The proposed framework in this study focuses on creat- ing datasets for auxiliary tasks with no supervision, signif- icantly reducing human ef forts and financial costs. T o our knowledge, no research has been conducted that aimed to design auxiliary tasks of MTL-based text classification with no supervision. In addition, as subwords are not necessar - ily semantically coherent, subword-phrases have not been considered for any task. Therefore, this study proposes a novel methodology of MTL-based text classification in two aspects: In addition, since subwords are not necessarily se- mantically coherent, subword-phrases have not been con- sidered for any task. Therefore, this paper proposes a novel methodology of MTL-based text classification in two as- pects: (1) low-cost auxiliary task design and (2) introduc- tion of subword-phrases. The experimental evaluation of this study reveals promising results for these two aspects. 3 Pr oposed framework This section explains our framework of the MTL-based text classification, which generates subword-phrase labels for auxiliary tasks in an unsupervised manner . 3.1 Framework overview Figure 1 illustrates our framework. It consists of two phases: unsupervised labeling and MTL-based text classi- fication. The basic approach underlying of the framework is that subword-phrase recognition is added as an auxil- iary task for MTL-based text classification. T o realize the recognition task, unsupervised subword-phrase extraction is employed to create pseudo-supervision. A text classifier based on the framework is trained using the following steps: 1. Input : the text classifier receives a training set of text with classification labels; 2. T okenization : the text is tokenized into subwords using a subword-based tokenizer; 3. Labeling (Phase 1) : the unsupervised labeling mod- ule appends subword-phrase labels to each text in the training set for the auxiliary subword-phrase recognition task; 4. T raining (Phase 2) : the text classifier is trained in an MTL manner , which is trained together with the auxil- iary subword-phrase recognition task based on the ap- pended labels. Formally , a training set is denoted as D = { (T i ,y i ) | 1 ≤ i ≤ N} , whereT i represents a sequence of subword tokens of the i -th text, y i represents the class label corre- sponding to thei -th text, andN is the number of texts. In the first phase, the unsupervised subword-phrase labeling module receives D and performs subword-phrase extrac- tion on subword token sequences to create another training set D aug = { (T i ,Y aug i ) | 1 ≤ i ≤ N} for the auxiliary 318 Informatica 47 (2023) 315–326 Y . Kimura et al. Text with Label Text with Subword-phrase Labels P retrained N eural Language M odel U nsupervised Subword-phrase E xtraction Subword-based Tokenizer Text C lassification Subword-phrase R ecognition Phase 1 Unsupervised Labeling Phase 2 MTL-based Text Classification main auxiliary Figure 1: Our MTL-based T ext Classification Framework. The framework accepts text with text classification labels and trains an MTL-based text classification model. The framework consists of two phases: the first phase is un- supervised labeling of the input text, and the second phase is the training of the MTL-based text classification model using the text classification labels and labels from the first phase. task, whereY aug i is a corresponding sequence of labels for each token in T i . In the second phase, D and D aug are passed to an MTL-based text classification module based on a pre-trained neural language model; they then train the text classification model in conjunction with the training subword-phrase recognition model. 3.2 Unsupervised subword-phrase labeling Unsupervised subword-phrase labeling provides a label se- quence that corresponds to the input text sequence. This unsupervised labeling is a task formalized as follows: • Given : a sequence of subword tokens T along with a class labely , (T,y ) ∈ D • Generate : a sequence of labels Y aug whose length is exactly the same as that ofT The labeling scheme is inspired by NER tasks that employ the inside-outside-beginning (IOB2) tagging scheme [26]. IOB2 tagging is a labeling scheme where the first token of a phrase is tagged with B (beginning), the intermediate tokens of a phrase are tagged with I (inside), and tokens other than the phrase are tagged with O (outside). Besides these tags, semantic types are appended to distinguish types of phrases; for example, B-PERSON and I-PERSON represent the beginning and intermediate tokens of a token sequence corresponding with a person’ s name, respectively . A straightforward labeling scheme for subword-phrase labeling is to treat all phrases equally . In other words, the semantic type is set to Phrase . Formally , when ann -length sequence of tokensS = (s 1 ,s 2 ,...,s n ) has a phrase which is an m -length sub-sequence P = (s k ,s k+1 ,...,s k+m ) of S where m ≤ n , s k is labeled as a particular type B-Phrase ; the rest of the tokens froms k+1 tos k+m are la- beled as I-Phrase and other tokenss i ∈ S\ P are labeled as O . This approach is so straightforward that subword-phrases appearing in dif ferent document classes are treated equally . However , to provide cues to the main text classification model, subword-phrases dependent on document classes should be distinguishable. A simple classification-specific labeling scheme assigns dif ferent labels to subword-phrases appearing in other classes. When a subword-phrase P = (s k ,s k+1 ,...,s k+m ) , which is a sequence of tokens of a text belonging to class y , s k is labeled as B-y , and the remaining tokens from s k+1 to s k+m are labeled as I-y . However , subword-phrases commonly appearing in dif fer - ent classes cannot be handled in this scheme. T o han- dle such common subword-phrases, we propose three la- beling schemes, namely , Disr egard , Common-Label , and Bit-Label . T o compare, the aforementioned straightfor - ward labelling scheme is called All-Phrase . Disregard scheme simply ignores the common subword-phrases, in other words, they are labeled by O tags. In Common- Label scheme, a special class label ∅ is used as a special semantic type of labeling in the IOB2 scheme. Specif- ically , the common subword-phrase P is labeled as B-∅ fors k and I-∅ for other tokens. T o handle such subword- phrases, this study proposes a bit-encoding-based labeling scheme. Bit-Label scheme still inherits the IOB2 labeling scheme; therefore, suppose that d = 4 , a subword-phrase P = (s k ,s k+1 ,...,s k+m ) , which is a sequence of tokens of a text and belongs to the first and third classes, thens k is labeled as B-1010 , and the rest of the tokens froms k+1 tos k+m are labeled as I-1010 . 3.3 MTL-based text classification Our framework uses a text classification model based on MTL and a pre-trained neural language model (NLM). In this method, the NLM performs token encoding, and classi- fication modules for main and auxiliary tasks are appended on top of the encoding. Therefore, NLM is the part shared among tasks and is trained in an MTL manner . A fully con- nected layer and a softmax non-linear layer design the clas- sification models. For the main task (i.e., text classification), a representa- tion h cls for a given input token sequence is obtained from NLM. It is passed to a fully connected layer followed by a softmax layer to predict class distribution ˆ y cls . Formally , ˆ y cls for h cls is calculated by the following equation: ˆ y cls = softmax(W ⊤ cls · h cls + b cls ), (1) whereW cls and b cls denote the parameter matrix and bias, respectively , for the text classification task. An Automatic Labeling Method for Subword-Phrase… Informatica 47 (2023) 315–326 319 For the auxiliary tasks (i.e., subword-phrase recogni- tion), a representation h spr j for thej -th token of a given input sequence is obtained from NLM. It is passed to a fully con- nected layer followed by a softmax layer to predict token label distribution ˆ y spr . Formally , ˆ y spr j for h spr j is calculated by the following equation: ˆ y spr j = softmax(W ⊤ spr · h spr j + b spr ), (2) whereW spr and b spr denote the parameter matrix and bias, respectively , for the subword-phrase recognition task. These main and auxiliary tasks are multi-class classifica- tion tasks; therefore, using the cross-entropy loss as a loss function is straightforward. The following equation calcu- lates the lossL cls for the text classification task: L cls = − N ∑ i=1 ∑ c∈ C y i,c logˆy cls i,c , (3) whereN is the number of training sample texts,C denotes a set of classes, y i,c ∈ { 0, 1} denotes a true label for the i -th text wherey i,c = 1 if the true label of the text isc and 0 otherwise, and ˆy cls i,c denotes the predicted probability of classc for the text. Similarly , the following equation calculates the lossL spr for the subword-phrase recognition task: L spr = − N ∑ i=1 Mi ∑ j=1 ∑ c∈ C y i,j,c logˆy cls i,j,c , (4) whereN denotes the number of training sample texts,M i denotes the number of tokens in thei -th text,C denotes a set of classes,y i,j,c ∈ { 0, 1} denotes a true label for thej -th token of thei -th text wherey i,j,c = 1 if the true label of the token isc and 0 otherwise, and ˆy spr i,j,c denotes the predicted probability of classc for that token. T o train both tasks simultaneously , feedback from results on these tasks is fed to the NLM model to fine-tune its pa- rameters. Therefore, joint lossL joint of losses for these tasks are calculated using the following equation and used for pa- rameter optimization. L joint =L cls +L spr (5) W e note that the weighting scheme in MTL approaches to involve the importance of individual tasks has been stud- ied [22, 28]. Although considering the weighting scheme in our framework is promising, the purpose of this study is to show the capability of MTL-based text classification in conjunction with subword-phrase recognition, whoselabels for auxiliary tasks are created in an unsupervised manner . Therefore, employing the weighting scheme in our frame- work can be the focus of future studies. 4 Experimental evaluation T o evaluate the proposed framework, we conducted an ex- perimental evaluation to answer the following items: (1) Whether or not our MTL-based text classification meth- ods that create auxiliary tasks in an unsupervised man- ner improve classification performance compared to single- task text classification methods?, (2) Whether or not our MTL-based text classification can outperform state-of-the- art (SOT A) text classification methods?, (3) Whether or not the subword-phrase technique contributes to text classifi- cation?, and (4) Whether or not there is the best labeling scheme for subword-phrase recognition in terms of com- mon subword-phrases? The rest of this section is or ganized as follows: Sec- tion 4.1 introduces the implementation of the proposed framework; Section 4.2 explains the SOT A text classifica- tion method for comparison; Section 4.3 describes the ex- perimental settings; Section 4.4 showcases the experimen- tal results, and Section 4.5 presents remarks on the experi- ments by answering items mentioned above. 4.1 Implementation of the pr oposed framework In this experiment, we implemented a simple frequency- based subword-phrase extraction method; the labeling scheme used for the extracted subword-phrase was the classification-specific labeling scheme. The frequency- based method expects that frequently co-occurring sub- words compose the regular textual expressions for each class. T o control the number of subword-phrases, we uti- lized the byte-pair encoding (BPE) algorithm [29]. The BPE algorithm concatenates consecutive tokens if they fre- quently co-occur in a corpus and repeats this concatenation until the number of unique tokens equals the expected num- ber . The ability to control the number of subword-phrases was suitable for this experiment because the subword- phrase was newly proposed in this study; therefore, we needed to try variations of evaluation experiments which were realized by creating dif ferent numbers of subword- phrases. In general, the number of texts is skewed among classes; the number of particular texts of a class may be quite lar ge, while that of other classes is very small. This af- fected the extraction of subword-phrases; therefore, in this experiment, the extraction mentioned above was applied for each set of texts of class. Specifically , we extracted n subword-phrases for each class. n was chosen from { 10, 100, 1000, 10000} to achieve the best classification performance on the validation data. 4.2 Comparison method: BertGCN BertGCN [14] is a SOT A method for text classification that combines a pre-trained NLM with the inductive learn- ing of graph neural networks (GNNs). BertGCN fol- lows T extGCN [34] by constructing a graph of the co- occurrence relations between texts and words and between words and words. In BertGCN, vectors of vertices are ini- tialized using the pre-trained NLM. These vectors are up- 320 Informatica 47 (2023) 315–326 Y . Kimura et al. dated through graph convolutional neural network (GCN) to involve the co-occurrence relationships between texts and words. Based on the u pdated vectors, BertGCN per - forms text classification by adding a fully connected layer followed by a softmax layer . In addition, [14] reported that integrating the output of the NLM-based classifica- tion model and that of BertGCN can improve classifica- tion performance; specifically , the linear sum of the pre- dicted class distributions Z GCN and Z NLM , which are ob- tained from BertGCN and the classifier using NLM, respec- tively , as seen in the following equation: Z =λ · Z GCN +(1− λ )· Z NLM , (6) whereλ ∈ [0, 1] denotes the weight for BertGCN classifi- cation. This experiment usedλ = 0. 7 as [14] reported that it was the optimal value. BertGCN can use any pre-trained NLM, and [14] reported that RoBER T a showed the optimal performance. Therefore, RoBER T a was also used in imple- menting the proposed framework to make the comparison as reasonable as possible. 4.3 Settings Datasets For the evaluation, the following five popu- lar datasets in the text classification task are used; Movie Review (MR), 20 Newsgroups (20NG), R8, R52 and Ohsumed (OHS). MR is a dataset of movie reviews catego- rized into binary sentiment classes (i.e., positive and nega- tive). 20NG is a dataset of news texts categorized into 20 categories. R8 is a dataset of news articles from Reuters- 21578 1 limited to eight selected classes. R52 is a dataset of news articles from Reuters-21578 limited to 52 selected categories. OHS is a dataset of medical abstracts catego- rized into 23 medical concepts called MESH categories. The statistics of the dataset are shown in T able 1. As the table shows, datasets with dif ferent classes and variations in the number of instances per class (the standard deviation (Std.) of the number of instances within a class) were used in the experiment. These datasets were expected to reveal the advantages and disadvantages of the proposed method. Metrics The evaluation metric is F -score which is the harmonic mean of precision and recall scores as shown be- low . Pre = TP TP +FP (7) Rec = TP TP +FN (8) F = 2· Pre· Rec Prec+ Rec (9) The precision, denoted by Pre is the ratio of the number of true positives (TP ) over the number of instances estimated as positive (i.e., TP + FP , where FP is the number of 1 Reuters-21578, h t tp : / / w w w . d a v i d d l e w i s . c o m / r e s o u r c e s / t e stcollections/reuters21578/ , visited on Aug. 4, 2022 false positives). The recall, denoted by Rec is the ratio of TP over the number of positive instances in the evaluation set (i.e.,TP +FN , whereFN is the number of false nega- tives). T o observe various aspects for evaluation, micro and macro averages ofF -scores were used in this experiment. The micro average ofF -scores,F micro , is the instance-level average of the F -score, and the macro average, F macro , is the class-level average of theF -scores. When the numbers of instances of dif ferent classes are highly skewed (class imbalance problem), the F micro is not suitable to evalu- ate the classification performance; this is because the lar ger the number of instances of a class, the more it af fects this metric. In other words, the classification performance in the instances of minority classes is underestimated. In con- trast, theF macro metric can ignore the skewness as theF scores of dif ference classes are treated independently and averaged. Parameters For the base model in the proposed method and BertGCN, we employed the RoBER T a-base model [19], available at Huggingface 2 . BertGCN with the RoBER T a model was called RoBER T aGCN in this experi- ment. In this study , the ef fect of common subword-phrases was also evaluated; therefore, the proposed method had two variations: one included common subword-phrases (denoted as Proposed w/ cmn) and the other excluded them (denoted as Proposed w/o cmn). In addition, as a baseline method, we also employed a single-task text classification method based on RoBER T a. The baseline method was implemented by adding a fully connected layer and a softmax layer on top of RoBER T a, which is equivalent to Eq. 1 with the loss function shown in Eq. 3. The only dif ference between the proposed and the baseline methods was the number of tasks on top of RoBER T a. Therefore, the comparison between them was expected to reveal the ef fectiveness of MTL-based text classification. These models were optimized using the AdamW optimizer (Adam optimizer [10] with decoupled weight decay reg- ularization) [20]. Experiments were conducted with 100 epochs, batch size 64, and a maximum token length of 256. Only the experiment for RoBER T aGCN was conducted with a batch size of 128 and a maximum token length of 128, which yielded better results than the aforementioned hyper parameters. 4.4 Results T able 2 shows the experimental results ofF micro (T able 2(a)) andF macro (T able 2(b)), and showcases the following three observations. (1) The proposed method performed better than the baseline method in both metrics except the simple binary classification on the MR dataset. (2) The proposed method outperformed RoBER T aGCN for three of the five datasets in terms of theF micro metric and four of the five datasets in terms of theF macro metric. (3) In terms of label- ing schemes, the Bit-Label and the Disregard approaches 2 https://huggingface.co/roberta- base An Automatic Labeling Method for Subword-Phrase… Informatica 47 (2023) 315–326 321 T able 1: Statistics of datasets. The number of instances in train-valid-test splits, number of classes, and average (A vg.) and standard deviation (Std.) of the number of instances across classes. MR 20NG R8 R52 OHS #T rain 6,398 10,183 4,937 5,879 3,022 #V alid 710 1,131 548 653 335 #T est 3,554 7,532 2,189 2,568 4,043 #Class 2 20 8 52 23 A vg. #Instances/Class 5,331 942 959 175 321 Std. #Instances/Class 0 94 1,309 613 305 T able 2: Evaluation results. The best score in each column (i.e., dataset) is bold-faced. RoBER T aGCN is the SOT A text classification method and Baseline is the single-task text classification based on the RoBER T a model. The proposed method has two variations: one, denoted as Proposed w/ cmn, includes common subword-phrases in the labeling scheme, and the other , denoted as Proposed w/o cmn, excludes them. (a) and (b) showcase the results ofF micro andF macro , respec- tively . (a)F micro Model MR 20NG R8 R52 OHS RoBER T aGCN 0.880 0.894 0.979 0.944 0.736 Baseline (RoBER T a) 0.881 0.831 0.977 0.962 0.690 Proposed - All-Phrase 0.888 0.838 0.979 0.967 0.705 Proposed - Common-Label 0.860 0.850 0.978 0.967 0.704 Proposed - Bit-Label 0.882 0.846 0.979 0.968 0.71 1 Proposed - Disregard 0.866 0.851 0.979 0.969 0.71 1 (b)F macro Model MR 20NG R8 R52 OHS RoBER T aGCN 0.880 0.861 0.925 0.756 0.605 Baseline (RoBER T a) 0.881 0.825 0.943 0.836 0.594 Proposed - All-Phrase 0.888 0.832 0.948 0.842 0.622 Proposed - Common-Label 0.860 0.845 0.947 0.841 0.610 Proposed - Bit-Label 0.882 0.840 0.953 0.866 0.636 Proposed - Disregard 0.866 0.845 0.955 0.851 0.637 performed better than other schemes in terms of theF macro metric. The comparison between the proposed method and the baseline method in bothF micro andF macro revealed the ef- fectiveness of the MTL-based approach, in which the auxil- iary task was systematically constructed. In addition to in- sights from existing literature that MTL-based approaches using auxiliary tasks with supervision are ef fective, this ex- periment showcased the ef fectiveness of an MTL approach in which training data for an auxiliary task was generated in an unsupervised manner . The results showcase that low- cost auxiliary tasks for MTL-based text classification now demonstrate promising performance. While the results of MR and R8 datasets showed compa- rable performances between the proposed and the baseline methods, these datasets were composed of smaller numbers of classes. These results suggest that the proposed method did not perform ef fectively when the number of classes was small. A notable fact from the results was the proposed method achieved significantly better performance than RoBER T a- GCN in terms ofF macro on the R8, R52, and OHS datasets. Simultaneously , the proposed method was also more accu- rate than RoBER T aGCN in terms of F micro . These facts indicate that the proposed method achieved state-of-the- art classification performance on these datasets. Recall- ing the statistics of these datasets from T able 1, the num- bers of classes in each R8, R52 and OHS dataset are lar ger than those of other datasets and the number of instances per class is highly skewed. These facts indicate that the pro- posed method is good for highly skewed datasets. Though 20NG dataset had similar number of classes to the OHS dataset and was less skewed than the OHS dataset, the per - formance in terms of F micro and F macro of the proposed 322 Informatica 47 (2023) 315–326 Y . Kimura et al. T able 3: Evaluation results: Accuracy of auxiliary tasks (a)F micro Model MR 20NG R8 R52 OHS Proposed - All-Phrase 0.971 0.975 0.978 0.998 0.971 Proposed - Common-Label 0.922 0.968 0.974 0.972 0.978 Proposed - Bit-Label 0.918 0.974 0.965 0.975 0.977 Proposed - Disregard 0.922 0.851 0.962 0.975 0.978 (b)F macro Model MR 20NG R8 R52 OHS Proposed - All-Phrase 0.960 0.975 0.945 0.796 0.953 Proposed - Common-Label 0.761 0.889 0.869 0.853 0.725 Proposed - Bit-Label 0.756 0.852 0.764 0.864 0.762 Proposed - Disregard 0.761 0.845 0.731 0.847 0.725 method was worse than RoBER T aGCN. Consequently , the proposed method performed better than the SOT A method when datasets were composed of lar ge classes and highly skewed in the number of instances across classes. The comparison among variations of the proposed method in terms of the labeling schemes for commonly ap- pearing subword-phrases among document classes showed that the proposed method with dif ferent schemes had simi- lar performances, each with their pros and cons for dif ferent datasets. The All-Phrase scheme had all phrases labeled by the IOB2 tagging scheme regardless of document classes. Compared with other schemes that take document classes into account, its performance was inferior . This indi- cates that class-specific labeling (the Common-Label, Bit- Label, and Disregard schemes) is ef fective, except for the MR dataset, which is a binary classification dataset; thus, subword-phrases are merely class-specific . For the com- parison of labeling common subword-phrases among the Common-Label, Bit-Label, and Disregard schemes, their classification performances were comparable, and the Dis- regard scheme had relatively better performance. T o show the dif ficulties of subword-phrase recognition tasks with dif ferent labeling schemes, T able 3 displays the F scores of the auxiliary tasks. In general, the number of classes in a sequence labeling problem is related to its dif ficulty . Thus, the All-Phrase scheme was expected to be the easiest and the Bit-Label scheme the most dif ficult. As shown in the results in the table, the F scores of the All-Phrase scheme are the highest among these schemes, thereby confirming their easiness in terms of a sequence la- beling problem. In contrast,F scores of the other schemes were inferior , but still high enough to aid the generalization performance of the main text classification model. 4.5 Remarks This section summarizes the findings from our experiment by answering the abovementioned items and introduces the limitations of the proposed method. (1) The proposed method outperformed the baseline method when the number of classes of a dataset was lar ge and was comparable to them when the number was small. However , datasets with a few classes were also less skewed in the number of instances per class. Therefore, the frequency-based subword-phrase extraction for con- structing auxiliary tasks was suitable when datasets had many classes, and the number of instances per class was skewed. A promising outcome is that an auxiliary recog- nition task in which (pseudo) supervision is generated unsupervised is ef fective in the MTL-based classifica- tion. Therefore, this outcome opens up new possibilities for constructing auxiliary tasks for the MTL-based clas- sification methods on tasks other than text classification. (2) The proposed method was superior to the SOT A method, RoBER T aGCN, for the R52 and OHS datasets, which contained many classes and where the number of in- stances per class was skewed. A promising direction to overcome the inferiority of the proposed method in the other datasets is to utilize RoBER T aGCN as a base model for the proposed method. (3) The subword-phrase recognition task as an auxiliary task improves text classifications in various datasets. A promising outcome is the usage of phrasal expressions for subwords, which which needs more attention in the literature. (4) T o handle common subword-phrases among document classes, the Bit-Label scheme, which encodes depen- dence of subword-phrases in a bit sequence that can rep- resent all combinations of appearing classes, and the Disregard scheme, which ignores common subword- phrases, were the best. The higher the number of classes (e.g., R52), the better the classification performance us- ing the Bit-Label scheme. Contrastingly , the smaller the number of classes (e.g., R8 and OHS), the better the Dis- regard scheme performance. An Automatic Labeling Method for Subword-Phrase… Informatica 47 (2023) 315–326 323 Consequently , when the number of classes is lar ge, and the number of instances for document classes is skewed, the MTL-based text classification suf fers from the class imbal- ance problem, which is still an open problem in the general text classification tasks domain. This domain showcases some promising results by using subword-phrase recogni- tion tasks, whose labels are obtained in an unsupervised manner . However , at the same time, the classification per - formance still leaves a lot to be desired. Therefore, future studies should seek more ef fective auxiliary tasks to deal with the class imbalance problem. 5 Conclusion W e proposed an MTL-based text classification framework using auxiliary tasks with lower human and financial costs by creating auxiliary task labels unsupervised. W e also sought to ascertain the possibility of phrasal expressions of subwords called subword-phrases to utilize subword-based neural language pre-trained models. As an implementation of our framework, we extracted subword-phrases in terms of their frequency of occurrence and labeled them into doc- uments in three dif ferent ways. Our experimental evalua- tion for text classification using five popular datasets high- lighted the ef fectiveness of the subword-phrase recognition as an auxiliary task. It also showed comparative results with RoBER T aGCN which is the state-of-the-art method. The main conclusions of this paper are: an auxiliary recognition task in which pseudo supervision is generated in an unsupervised manner is ef fective in MTL-based clas- sification, and opens up the possibility of constructing aux- iliary tasks for MTL-based classification methods for clas- sification tasks other than text classification, and phrasal expressions for subwords (subword-phrase) can be helpful in text classification. Acknowledgment This work was partly supported by the Grants-in-Aid for Academic Promotion, Graduate School of Culture and In- formation Science, Doshisha University , JSPS KAKENHI Grant Number 19H01 138, 19H04218, and 21H03555, and JST , the establishment of university fellowships towards the creation of science and technology innovation, Grant Number JPMJFS2145. Refer ences [1] C. Apté, F . Damerau, and S. M. W eiss. Auto- mated Learning of Decision Rules for T ext Catego- rization. ACM T ransactions on Information Systems , 12(3):233–251, 1994. [2] A. Benayas, R. Hashempour , D. Rumble, S. Jameel, and R. C. De Amorim. Unified T ransformer Multi- T ask Learning for Intent Classification W ith Entity Recognition. IEEE Access , 9:147306–147314, 2021. [3] Q. Bi, J. Li, L. Shang, X. Jiang, Q. Liu, and H. Y ang. MTRec: Multi-T ask Learning over BER T for News Recommendation. In Findings of the Association for Computational Linguistics: ACL 2022 , pages 2663– 2669, May 2022. [4] J. Blitzer , M. Dredze, and F . Pereira. Biogra- phies, Bollywood, Boom-boxes and Blenders: Do- main Adaptation for Sentiment Classification. In Pr o- ceedings of the 45th Annual Meeting of the Associ- ation of Computational Linguistics , pages 440–447, 2007. [5] T . B. Brown, B. Mann, N. R yder , M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry , A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger , T . Henighan, R. Child, A. Ramesh, D. M. Ziegler , J. W u, C. W inter , C. Hesse, M. Chen, E. Sigler , M. Litwin, S. Gray , B. Chess, J. Clark, C. Berner , S. McCandlish, A. Radford, I. Sutskever , and D. Amodei. Language Models are Few-Shot Learners. In Advances in Neural Information Pr ocess- ing Systems , 2020. [6] R. Caruana. Multitask Learning. Machine Learning , 28(1):41–75, 1997. [7] O. de Gibert, N. Pérez, A. G. Pablos, and M. Cuadros. Hate Speech Dataset from a White Supremacy Forum. In Pr oceedings of the 2nd W orkshop on Abusive Lan- guage Online (AL W2) , pages 1 1–20, 2018. [8] J. Devlin, M. Chang, K. Lee, and K. T outanova. BER T : Pre-training of Deep Bidirectional T ransform- ers for Language Understanding. In Pr oceedings of the 2019 Confer ence of the North American Chap- ter of the Association for Computational Linguistics: Human Language T echnologies, NAACL-HL T 2019, Minneapolis, MN, USA, June 2-7, 2019, V olume 2 (In- dustry Papers) , pages 4171–4186, 2019. [9] S. Graham, Q. D. V u, M. Jahanifar , S. Raza, F . A. Af- sar , D. R. J. Snead, and N. M. Rajpoot. One model is all you need: Multi-task learning enables simultane- ous histology image segmentation and classification. Medical Image Analysis , 83:102685, 2023. [10] D. P . Kingma and J. Ba. Adam: A method for stochas- tic optimization. In 3r d International Confer ence on Learning Repr esentations , 2015. [1 1] C. Li, J. Hu, T . Li, S. Du, and F . T eng. An ef fective multi-task learning model for end-to-end emotion-cause pair extraction. Applied Intelligence , 53(3):3519–3529, 2023. [12] Q. Li, H. Peng, J. Li, C. Xia, R. Y ang, L. Sun, P . S. Y u, and L. He. A Survey on T ext Classification: From T raditional to Deep Learning. ACM T ransactions on Intelligent Systems and T echnology , 13(2):31:1– 31:41, 2022. 324 Informatica 47 (2023) 315–326 Y . Kimura et al. [13] X. Li and D. Roth. Learning Question Classifiers. In COLING 2002: The 19th International Confer ence on Computational Linguistics , 2002. [14] Y . Lin, Y . Meng, X. Sun, Q. Han, K. Kuang, J. Li, and F . W u. BertGCN: T ransductive T ext Classification by Combining GNN and BER T . In Findings of the Asso- ciation for Computational Linguistics: ACL-IJCNLP 2021 , pages 1456–1462, Online, Aug. 2021. [15] M. Lippi, P . Palka, G. Contissa, F . Lagioia, H. Mick- litz, G. Sartor , and P . T orroni. CLAUDETTE: an au- tomated detector of potentially unfair clauses in on- line terms of service. Artifcial Intelligence and Law , 27(2):1 17–139, 2019. [16] P . Liu, X. Qiu, and X. Huang. Deep Multi-T ask Learning with Shared Memory for T ext Classification. In Pr oceedings of the 2016 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 1 18– 127, 2016. [17] P . Liu, X. Qiu, and X. Huang. Recurrent Neu- ral Network for T ext Classification with Multi-T ask Learning. In Pr oceedings of the T wenty-Fifth Inter - national Joint Confer ence on Artificial Intelligence , pages 2873–2879, 2016. [18] X. Liu, J. Gao, X. He, L. Deng, K. Duh, and Y . W ang. Representation Learning Using Multi-T ask Deep Neural Networks for Semantic Classification and Information Retrieval. In Pr oceedings of the 2015 Confer ence of the North American Chapter of the Association for Computational Linguistics: Hu- man Language T echnologies , pages 912–921, 2015. [19] Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy , M. Lewis, L. Zettlemoyer , and V . Stoyanov . RoBER T a: A Robustly Optimized BER T Pretraining Approach. CoRR , abs/1907.1 1692, 2019. [20] I. Loshchilov and F . Hutter . Decoupled W eight Decay Regularization. In 7th International Confer ence on Learning Repr esentations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview .net, 2019. [21] A. L. Maas, R. E. Daly , P . T . Pham, D. Huang, A. Y . Ng, and C. Potts. Learning word vectors for sentiment analysis. In Pr oceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu- man Language T echnologies , pages 142–150. Associ- ation for Computational Linguistics, 201 1. [22] Y . Mao, Z. W ang, W . Liu, X. Lin, and P . Xie. MetaW eighting: Learning to W eight T asks in Multi- T ask Learning. In Findings of the Association for Computational Linguistics: ACL 2022 , pages 3436– 3448, 2022. [23] S. Minaee, N. Kalchbrenner , E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao. Deep Learning-based T ext Classification: A Comprehensive Review . ACM Computing Surveys , 54(3):62:1–62:40, 2021. [24] R. Qi, M. Y ang, Y . Jian, Z. Li, and H. Chen. A Lo- cal context focus learning model for joint multi-task using syntactic dependency relative distance. Applied Intelligence , 53(4):4145–4161, 2023. [25] A. Ramesh, M. Pavlov , G. Goh, S. Gray , C. V oss, A. Radford, M. Chen, and I. Sutskever . Zero-Shot T ext-to-Image Generation. In Pr oceedings of the 38th International Confer ence on Machine Learning , pages 8821–8831, 2021. [26] L. Ramshaw and M. Marcus. T ext Chunking using T ransformation-Based Learning. In Thir d W orkshop on V ery Lar ge Corpora , 1995. [27] F . Sebastiani. Machine Learning in Automated T ext Categorization. ACM computing surveys , 34(1):1–47, mar 2002. [28] O. Sener and V . Koltun. Multi-T ask Learning as Multi-Objective Optimization. In Pr oceedings of the 32nd International Confer ence on Neural Information Pr ocessing Systems , pages 525–536, 2018. [29] R. Sennrich, B. Haddow , and A. Birch. Neural Ma- chine T ranslation of Rare W ords with Subword Units. In Pr oceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (V olume 1: Long Papers) , pages 1715–1725, Berlin, Germany , Aug. 2016. Association for Computational Linguis- tics. [30] T . T ohti, M. Abdurxit, and A. Hamdulla. Medical QA Oriented Multi-T ask Learning Model for Question In- tent Classification and Named Entity Recognition. In- formation , 13(12):581, 2022. [31] C. W u, G. Luo, C. Guo, Y . Ren, A. Zheng, and C. Y ang. An attention-based multi-task model for named entity recognition and intent analysis of Chi- nese online medical questions. Journal of Biomedical Informatics , 108:10351 1, 2020. [32] M. Xu, K. Huang, and X. Qi. A Regional-Attentive Multi-T ask Learning Framework for Breast Ultra- sound Image Segmentation and Classification. IEEE Access , 1 1:5377–5392, 2023. [33] H. Y ang, B. Zeng, J. Y ang, Y . Song, and R. Xu. A multi-task learning model for Chinese-oriented as- pect polarity classification and aspect term extraction. Neur ocomputing , 419:344–356, 2021. [34] L. Y ao, C. Mao, and Y . Luo. Graph Convolutional Networks for T ext Classification. In Pr oceedings of the Thirty-Thir d AAAI Confer ence on Artificial Intelli- gence and Thirty-First Innovative Applications of Ar - tificial Intelligence Confer ence and Ninth AAAI Sym- posium on Educational Advances in Artificial Intelli- gence , pages 7370–7377, 2019. An Automatic Labeling Method for Subword-Phrase… Informatica 47 (2023) 315–326 325 [35] H. Zhang, L. Xiao, Y . W ang, and Y . Jin. A Gener - alized Recurrent Neural Architecture for T ext Clas- sification with Multi-T ask Learning. In Pr oceedings of the T wenty-Sixth International Joint Confer ence on Artificial Intelligence , pages 3385–3391, 2017. [36] X. Zhang, Q. Zhang, Z. Y an, R. Liu, and Y . Cao. En- hancing Label Correlation Feedback in Multi-Label T ext Classification via Multi-T ask Learning. In Find- ings of the Association for Computational Linguis- tics: ACL-IJCNLP 2021 , pages 1 190–1200. Associ- ation for Computational Linguistics, 2021. [37] Y . Zhang and Q. Y ang. A Survey on Multi-T ask Learning. IEEE T ransactions on Knowledge and Data Engineering , 34(12):5586–5609, 2022. [38] Y . Zhang, N. Zincir -Heywood, and E. Milios. Nar - rative T ext Classification for Automatic Key Phrase Extraction in W eb Document Corpora. In Pr oceed- ings of the 7th Annual ACM International W orkshop on W eb Information and Data Management , WIDM ’05, page 51–58, 2005. [39] Z. Zhang, W . Y u, M. Y u, Z. Guo, and M. Jiang. A Survey of Multi-task Learning in Natural Language Processing: Regarding T ask Relatedness and T raining Methods. CoRR , abs/2204.03508, 2022. [40] M. Zhao, J. Y ang, and L. Qu. A multi-task learning model with graph convolutional networks for aspect term extraction and polarity classification. Applied Intelligence , 53(6):6585–6603, 2023. 326 Informatica 47 (2023) 315–326 Y . Kimura et al.