https://doi.or g/10.31449/inf.v46i4.4455 Informatica 46 (2022) 583–584 583
Semi-supervised Learning for Structur ed Output Pr ediction
Jurica Levatić
Department of Knowledge T echnologies, Jožef Stefan Institute, Jamova cesta 39, Ljubljana, Slovenia
E-mail: jurica.levatic@ijs.si
Thesis Summary
Keywords: semi-supervised learning, predictive clustering trees, predicting structured outputs
Received: October 4, 2022
This article pr esents a summary of the doctoral dissertation of the author on the topic of semi-supervised
learning for pr edicting structur ed outputs.
Povzetek: Članek pr edstavlja povzetek doktorske disertacije avtorja, ki obravnava temo polnadzor ovanega
učenja za napovedovanje strukturiranih vr ednosti.
1 Intr oduction
In contrast to traditional supervised machine learning meth-
ods, which use only labeled data, semi-supervised methods
additionally use unlabeled data. Due to laborious annota-
tion procedure, labeled data are a limited asset in many real-
life problems, which can hinder the predictive performance
of algorithms. Unlabeled data, on the other hand, are of-
ten much easier to obtain. Semi-supervised learning (SSL)
[1] aims to exploit unlabeled data to achieve better perfor -
mance than can be achieved by labeled data alone.
Structured output prediction (SOP) is concerned with
predicting structured, rather than scalar values, such as mul-
tiple classes/variables, hierarchies or sequences [2]. Such
outputs are encountered in many applications of predictive
modeling. Compared to SSL for primitive outputs, SSL for
SOP received much less attention in the scientific commu-
nity , although the need for SSL is even stronger there: Ob-
taining labels of structured data is even harder . Further -
more, this field lacks interpretable methods and methods
that can handle various SOP tasks.
2 Methods and evaluation
In the thesis [3], to overcome the aforementioned issues, we
extend the predictive clustering (PC) framework towards
SSL. The PC framework [4] is implemented using predic-
tive clustering trees (PCT s) which can ef ficiently handle
various SOP tasks. W e propose two classes of semisuper -
vised methods stemming from the PC framework that can
handle the following SOP tasks: multi-tar get regression,
multi-label classification and hierarchical multi-label clas-
sification.
The first class of methods is based on the self-training
paradigm - it uses its own most reliable predictions in
the learning process. W e propose a self-training method
for multi-tar get regression based on ensembles of predic-
tive clustering trees [5]. T o the best of our knowledge,
this is currently one of the very few general-purpose semi-
supervised methods for this type of structured output. Since
the reliability of predictions in the context of multi-tar get
regression was not studied before, we propose two dif ferent
reliability scores for predictions based on intrinsic mecha-
nisms of ensemble methods. Furthermore, we propose an
algorithm for automatic selection of the appropriate thresh-
old on reliability scores.
The second class of methods we propose is based on the
extension of the variance functions of predictive clustering
trees in order to accommodate both labeled and unlabeled
examples [6, 7]. This enables to build semi-supervised pre-
dictive clustering trees that can exploit unlabeled examples
while preserving the appealing characteristics of supervised
trees, such as interpretability and computation ef ficiency .
Semi-supervised predictive clustering trees are general in
terms of the type of the structured output: They can pre-
dict dif ferent types of structured outputs: multiple tar get
variables and hierarchically structured classes. W e pro-
pose parametrization of semi-supervised predictive cluster -
ing trees by which it is possible to control the amount of
supervision, i.e., the learned models can range from fully
unsupervised to fully supervised.
W e perform an extensive empirical evaluation of the pro-
posed methods on a wide range of datasets from dif ferent
domains and with dif ferent types of structured output. W e
analyze the influence of the amount of labeled data to the
performance of the proposed methods, as we all various as-
pects of their practical usability , such as, interpretability ,
computational complexity , and sensitivity to parameters.
3 Discussion and Conclusions
The thesis contributes to the field of SSL for SOP with two
classes of global semi-supervised methods for structured
output prediction: self-training for multi-tar get regression
584 Informatica 46 (2022) 583–584 J. Levatić
[6] and semi-supervised predictive clustering trees [6, 7].
The empirical evaluation showed that the proposed meth-
ods outperform their supervised counterparts on a number
of datasets from dif ferent domains and with dif ferent types
of structured outputs.
The self-training approach of fers a state-of-the-art pre-
dictive performance on multi-tar get regression problems,
while producing black-box models and with the cost of in-
creased computational complexity (due to iterative train-
ing of the base model) as compared to supervised ran-
dom forests. Semi-supervised predictive clustering trees,
on the other hand, produce readily interpretable models,
which are often considerably more accurate than the cor -
responding supervised models for structured outputs. The
semi-supervised predictive clustering trees (and ensembles
thereof) also exhibit attractive predictive performance on
machine learning tasks with primitive outputs, i.e., classifi-
cation and regression.
W e also perform two case studies demonstrating the prac-
tical usability of the proposed semi-supervised methods:
(1) W e show that the proposed semi-supervised method-
ology is well-suited for quantitative structure-activity rela-
tionship modeling, i.e., prediction of biological activity of
chemical compounds [8]; (2) W e demonstrate on the prob-
lem of water quality prediction that semi-supervised pre-
dictive clustering trees can ef ficiently learn from partially
labeled data [9].
There are a number of possible directions to continue the
work presented in the thesis, such as extending the proposed
methods to other structured output prediction tasks, such as
time-series classification or sequence learning, or utilising
the proposed methods to develop feature ranking for semi-
supervised and unsupervised learning.
Refer ences
[1] Chapelle, O., Schölkopf, B., Zien, A. (2006). Semi-
supervised learning . Cambridge, Massachusetts: MIT
Press.
[2] G. Bakır , T . Hofmann, B. Schölkopf, A. Smola, B.
T askar , S. V ishwanathan (2007) Pr edicting structur ed
data , The MIT Press.
[3] J. Levatić (2017) Semi-supervised learning for struc-
tur ed output pr ediction , PhD Thesis, IPS Jožef Stefan,
Ljubljana, Slovenia.
[4] H. Blockeel (1998) T op-down induction of first or der
logical decision tr ees , PhD Thesis, Katholieke Uni-
versiteit Leuven, Belgium.
[5] J. Levatić, M. Ceci, D. Kocev , S. Džeroski, (2017)
Self-training for multi-tar get regression with tree en-
sembles, Knowledge-based systems , 123:41–60
[6] J. Levatić, D. Kocev , M. Ceci, S. Džeroski, (2018)
Semi-supervised trees for multi-tar get regression, In-
formation Sciences , 450:109–127
[7] J. Levatić, M. Ceci, D. Kocev , S. Džeroski, (2017)
Semi-supervised classification trees, Journal of Intel-
ligent Information Systems , 49(3):461–486
[8] J. Levatić, M. Ceci, T . Stepišnik, S. Džeroski, D. Ko-
cev , (2020) Semi-supervised regression trees with ap-
plication to QSAR modelling, Expert Systems with
Applications , 158:1 13569
[9] S. Nikoloski, D. Kocev , J. Levatić, D. P . W all, S.
Džeroski, (2021) Exploiting partially-labeled data in
learning predictive clustering trees for multi-tar get re-
gression: A case study of water quality assessment in
Ireland, Ecological Informatics , 61:101 161