https://doi.or g/10.31449/inf.v46i4.4455 Informatica 46 (2022) 583–584 583 Semi-supervised Learning for Structur ed Output Pr ediction Jurica Levatić Department of Knowledge T echnologies, Jožef Stefan Institute, Jamova cesta 39, Ljubljana, Slovenia E-mail: jurica.levatic@ijs.si Thesis Summary Keywords: semi-supervised learning, predictive clustering trees, predicting structured outputs Received: October 4, 2022 This article pr esents a summary of the doctoral dissertation of the author on the topic of semi-supervised learning for pr edicting structur ed outputs. Povzetek: Članek pr edstavlja povzetek doktorske disertacije avtorja, ki obravnava temo polnadzor ovanega učenja za napovedovanje strukturiranih vr ednosti. 1 Intr oduction In contrast to traditional supervised machine learning meth- ods, which use only labeled data, semi-supervised methods additionally use unlabeled data. Due to laborious annota- tion procedure, labeled data are a limited asset in many real- life problems, which can hinder the predictive performance of algorithms. Unlabeled data, on the other hand, are of- ten much easier to obtain. Semi-supervised learning (SSL) [1] aims to exploit unlabeled data to achieve better perfor - mance than can be achieved by labeled data alone. Structured output prediction (SOP) is concerned with predicting structured, rather than scalar values, such as mul- tiple classes/variables, hierarchies or sequences [2]. Such outputs are encountered in many applications of predictive modeling. Compared to SSL for primitive outputs, SSL for SOP received much less attention in the scientific commu- nity , although the need for SSL is even stronger there: Ob- taining labels of structured data is even harder . Further - more, this field lacks interpretable methods and methods that can handle various SOP tasks. 2 Methods and evaluation In the thesis [3], to overcome the aforementioned issues, we extend the predictive clustering (PC) framework towards SSL. The PC framework [4] is implemented using predic- tive clustering trees (PCT s) which can ef ficiently handle various SOP tasks. W e propose two classes of semisuper - vised methods stemming from the PC framework that can handle the following SOP tasks: multi-tar get regression, multi-label classification and hierarchical multi-label clas- sification. The first class of methods is based on the self-training paradigm - it uses its own most reliable predictions in the learning process. W e propose a self-training method for multi-tar get regression based on ensembles of predic- tive clustering trees [5]. T o the best of our knowledge, this is currently one of the very few general-purpose semi- supervised methods for this type of structured output. Since the reliability of predictions in the context of multi-tar get regression was not studied before, we propose two dif ferent reliability scores for predictions based on intrinsic mecha- nisms of ensemble methods. Furthermore, we propose an algorithm for automatic selection of the appropriate thresh- old on reliability scores. The second class of methods we propose is based on the extension of the variance functions of predictive clustering trees in order to accommodate both labeled and unlabeled examples [6, 7]. This enables to build semi-supervised pre- dictive clustering trees that can exploit unlabeled examples while preserving the appealing characteristics of supervised trees, such as interpretability and computation ef ficiency . Semi-supervised predictive clustering trees are general in terms of the type of the structured output: They can pre- dict dif ferent types of structured outputs: multiple tar get variables and hierarchically structured classes. W e pro- pose parametrization of semi-supervised predictive cluster - ing trees by which it is possible to control the amount of supervision, i.e., the learned models can range from fully unsupervised to fully supervised. W e perform an extensive empirical evaluation of the pro- posed methods on a wide range of datasets from dif ferent domains and with dif ferent types of structured output. W e analyze the influence of the amount of labeled data to the performance of the proposed methods, as we all various as- pects of their practical usability , such as, interpretability , computational complexity , and sensitivity to parameters. 3 Discussion and Conclusions The thesis contributes to the field of SSL for SOP with two classes of global semi-supervised methods for structured output prediction: self-training for multi-tar get regression 584 Informatica 46 (2022) 583–584 J. Levatić [6] and semi-supervised predictive clustering trees [6, 7]. The empirical evaluation showed that the proposed meth- ods outperform their supervised counterparts on a number of datasets from dif ferent domains and with dif ferent types of structured outputs. The self-training approach of fers a state-of-the-art pre- dictive performance on multi-tar get regression problems, while producing black-box models and with the cost of in- creased computational complexity (due to iterative train- ing of the base model) as compared to supervised ran- dom forests. Semi-supervised predictive clustering trees, on the other hand, produce readily interpretable models, which are often considerably more accurate than the cor - responding supervised models for structured outputs. The semi-supervised predictive clustering trees (and ensembles thereof) also exhibit attractive predictive performance on machine learning tasks with primitive outputs, i.e., classifi- cation and regression. W e also perform two case studies demonstrating the prac- tical usability of the proposed semi-supervised methods: (1) W e show that the proposed semi-supervised method- ology is well-suited for quantitative structure-activity rela- tionship modeling, i.e., prediction of biological activity of chemical compounds [8]; (2) W e demonstrate on the prob- lem of water quality prediction that semi-supervised pre- dictive clustering trees can ef ficiently learn from partially labeled data [9]. There are a number of possible directions to continue the work presented in the thesis, such as extending the proposed methods to other structured output prediction tasks, such as time-series classification or sequence learning, or utilising the proposed methods to develop feature ranking for semi- supervised and unsupervised learning. Refer ences [1] Chapelle, O., Schölkopf, B., Zien, A. (2006). Semi- supervised learning . Cambridge, Massachusetts: MIT Press. [2] G. Bakır , T . Hofmann, B. Schölkopf, A. Smola, B. T askar , S. V ishwanathan (2007) Pr edicting structur ed data , The MIT Press. [3] J. Levatić (2017) Semi-supervised learning for struc- tur ed output pr ediction , PhD Thesis, IPS Jožef Stefan, Ljubljana, Slovenia. [4] H. Blockeel (1998) T op-down induction of first or der logical decision tr ees , PhD Thesis, Katholieke Uni- versiteit Leuven, Belgium. [5] J. Levatić, M. Ceci, D. Kocev , S. Džeroski, (2017) Self-training for multi-tar get regression with tree en- sembles, Knowledge-based systems , 123:41–60 [6] J. Levatić, D. Kocev , M. Ceci, S. Džeroski, (2018) Semi-supervised trees for multi-tar get regression, In- formation Sciences , 450:109–127 [7] J. Levatić, M. Ceci, D. Kocev , S. Džeroski, (2017) Semi-supervised classification trees, Journal of Intel- ligent Information Systems , 49(3):461–486 [8] J. Levatić, M. Ceci, T . Stepišnik, S. Džeroski, D. Ko- cev , (2020) Semi-supervised regression trees with ap- plication to QSAR modelling, Expert Systems with Applications , 158:1 13569 [9] S. Nikoloski, D. Kocev , J. Levatić, D. P . W all, S. Džeroski, (2021) Exploiting partially-labeled data in learning predictive clustering trees for multi-tar get re- gression: A case study of water quality assessment in Ireland, Ecological Informatics , 61:101 161