https://doi.org/10.31449/inf.v46i1.3577 Informatica 46 (2022) 121–128 121
Using Semi-Supervised Learning and Wikipedia to Train an Event Argument
Extraction System
Patrik Zajec and Dunja Mladeni´ c
E-mail: patrik.zajec@ijs.si, dunja.mladenic@ijs.si
Jožef Stefan Institute and Jožef Stefan International Postgraduate School
Student paper
Keywords: event extraction, event argument extraction, semi-supervised learning, probabilistic soft logic
Received: June 2, 2021
The paper presents a methodology for training an event argument extraction system in a semi-supervised
setting. We use Wikipedia and Wikidata to automatically obtain a small noisily labeled dataset and a large
unlabeled dataset. The dataset consists of event clusters containing Wikipedia pages in multiple languages.
The unlabeled data is iteratively labeled using semi-supervised learning combined with probabilistic soft
logic to infer the pseudo-label of each example from the predictions of multiple base learners. The pro-
posed methodology is applied to Wikipedia pages about earthquakes and terrorist attacks in a cross-lingual
setting. Our experiments show improvement of the results when using the proposed methodology. The
system achieves F1-score of 0:79 when only the automatically labeled dataset is used, and F1-score of
0:84 when trained according to the methodology with semi-supervised learning combined with probabilis-
tic soft logic.
Povzetek: V prispevku predstavimo metodologijo za polnadzorovano uˇ cenje sistema, katerega naloga je
ekstrakcija kljuˇ cnih atributov dogodkov. Kot vir podatkov uporabimo prosto enciklopedijo Wikipedija in
bazo znanja Wikidata. Iz obeh virov na avtomatski naˇ cin pridobimo manjši del oznaˇ cenih podatkov ter
veˇ cji del neoznaˇ cenih podatkov, sestavljenih iz gruˇ c dokumentov, ki poroˇ cajo o posameznih dogodkih v
veˇ c jezikih. Neoznaˇ cene podatki iterativno oznaˇ cimo s polnadzorovanim uˇ cenjem v kombinaciji z ver-
jetnostno mehko logiko, ki napovedi za vsak primer iz veˇ cih predikcijskih modelov združi v eno samo
psevdo-labelo. Predlagano metodologijo uporabimo na dogodkih iz Wikipedije na temo potresov in teror-
istiˇ cnih napadov. Eksperimenti pokažejo izboljšanje rezultatov sistema, ki doseže F1 0; 79, ko je nauˇ cen
samo na avtomatsko oznaˇ cenih podatkih, ter 0; 84, ko je nauˇ cen, z uporabo predlagane metodologije, s
polnadzorovanim uˇ cenjem v kombinaciji z verjetnostno mehko logiko.
1 Introduction
The event extraction task is usually divided into two sub-
tasks, namely event type detection and event argument ex-
traction. Event type detection aims to determine the type or
topic of the event, while event argument extraction aims to
extract arguments for some predeﬁned argument roles as-
sociated with the topic of the event. In this paper, we focus
on event argument extraction and assume that the topic of
the event is known in advance. For example, consider an
excerpt from a news article reporting on an earthquake:
"On April 22, 2019, a 6.1 magnitude earthquake struck
the island of Luzon in the Philippines, leaving at least 18
dead, 3 missing and injuring at least256 others."
1
we are interested in extracting the arguments for the fol-
lowing roles: the magnitude of the earthquake, the time and
date when the earthquake occurred, the number of victims,
1
Source: en.wikipedia.org/wiki/2019_Luzon_
earthquake
and the number of injured. The goal is to extract an argu-
ment for each role, which in our case is mentioned in the
text. The output is a set of (role, argument) pairs for the
given text, for example (magnitude,6.1).
We formulate argument extraction as a supervised clas-
siﬁcation task, where the classiﬁer is used to assign argu-
ments into predeﬁned roles. We assume that each docu-
ment reports on a single event, where the topic of the event
is either earthquake or terrorist attack and is known in ad-
vance. Since there is no dataset that contains the annota-
tions for the roles we are interested in, one must be con-
structed to train the classiﬁcation model. Manually con-
structing a labeled dataset is a resource-intensive process,
so we instead develop a methodology to automate the la-
beling.
We start by automatically constructing a small, noisily
labeled set of examples for each role by matching struc-
tured knowledge from Wikidata with text from Wikipedia
pages. Then we use semi-supervised learning [1] with mul-
tiple base learners to increase the size of the labeled set. In
122 Informatica 46 (2022) 121–128 P. Zajec et al.
each iteration, the most conﬁdent predictions for the exam-
ples from the unlabeled set are used to increase the training
set by assigning pseudo-labels. We introduce an additional
component that combines the conﬁdence of multiple base
learners for each example.
The contribution of this paper is a novel methodology for
automatic construction of a multi-lingual labeled dataset
and training of a classiﬁcation system for event argument
extraction in multiple languages.
2 Related work
Wikipedia is a commonly used source for dataset construc-
tion [2, 3, 4]. Most datasets are constructed, at least in part,
under the supervision of a human annotator, since for most
tasks the labels cannot be reliably generated automatically
[5, 6]. The initial annotations are usually obtained using
distant supervision [7], which is the automatic approach of
matching knowledge from the knowledge base with free
text, and later reﬁned by the human annotator, while we
use semi-supervised learning instead. As in our case, the
common choice is to use Wikidata [8] as the knowledge
source and Wikipedia
2
as the text source.
Wikidata often does not reﬂect all of the knowledge
available in Wikipedia pages, and a considerable amount
of work has been done on the automatic population of both
Wikipedia infoboxes and Wikidata [9, 10, 11, 12]. Such
approaches develop the models capable of extracting the
structured knowledge from free text. However, since the
task is limited to Wikipedia, they mostly rely on the struc-
ture of Wikipedia pages to achieve good performance. This
is especially true for the language-independent models [11]
and makes them not directly applicable to other domains,
such as newswire data.
Event extraction, which includes both event detection
and argument extraction, is an active research topic that is
formulated at several levels (document and sentence level)
and approached using different techniques [13, 14, 15].
The approaches are most commonly evaluated on the ACE
2005 corpus [16] and the MUC-4 corpus [17]. Recently,
RAMS [18] and WikiEvents [13] corpora were introduced,
both being manually annotated and containing only English
documents. Further, there is no information about the doc-
uments that mention the same event which could be used
to form the event clusters.
The use of event clusters as a source of redundancy
has been explored previously [19, 20]. Approaches on
cross-lingual argument classiﬁcation [21] are mostly fo-
cused on transfer learning, where the model is ﬁrst trained
on the source language and later applied on the target lan-
guage. None of the approaches build the event clusters
using the documents from multiple languages and incor-
porates cross-linguality directly into the semi-supervised
training.
2
https://www.wikipedia.org/
The currently best-performing approaches for event ar-
gument extraction are based on machine reading compre-
hension [22] or conditional generation [13] and therefore
do not train a speciﬁc classiﬁer for each argument. We ﬁnd
them less suitable for the semi-supervised learning part of
our approach as they require a kind of event templates that
are usually manually constructed. We envision that such
models can instead be trained or ﬁne-tuned on the auto-
matically labeled dataset that our methodology generates.
Pseudo-labeling can be considered one of the most intu-
itive and simple forms of semi-supervised learning, while
still achieving competitive performance if approached cor-
rectly [23]. We have chosen a simple and extensible tech-
nique to combine the predictions of multiple base learners
into a single pseudo-label based on [24].
3 Methodology
3.1 Problem deﬁnition
We are given a collection of eventsE and a collection of
topicsT , where the topic of each evente2E is exactly one
topict2T . An evente has occurred at a particular time
and its event clusterD
e
consists of documents in multiple
languages, with a single document per language, reporting
on it.
For each topict2T there is a set of argument rolesR
t
that are known in advance. For example, the members of
R
earthquake
are magnitude, number of injured, number of
deaths. For each document d from the event clusterD
e
,
there is a set of arguments that were already extracted from
the text. The task is to assign at most one argument role
fromR
t
to each extracted argument.
Note that there are multiple documents in the event clus-
ter, each with its own set of arguments. For evente, a single
argument role might have zero, one, or multiple different
arguments assigned. It is possible, for example, that the
earthquake caused no casualties, so there is no argument
assigned to the role of magnitude. The documents from the
same event cluster might have different arguments for the
same role, since, for example, the reported number of in-
jured could come from different sources. In addition, there
may also be multiple arguments for the same role in the
single document, since the assumption that each document
speciﬁcally reports only on a single event is not always true
in practice. Such example is when one magnitude refers to
an actual earthquake that the event is about, while the other
magnitude refers to a stronger earthquake that hit the same
region years ago.
Choosing a single argument for each argument role is
challenging and sometimes even impossible. The task is in-
stead to assign all feasible arguments from the event cluster
to a particular role, even if not all are directly related to the
event.
Using Semi-Supervised Learning and Wikipedia to. . . Informatica 46 (2022) 121–128 123
3.2 Approach
There is no labeled dataset that would follow the required
structure and contain the labels for argument roles. We de-
velop a methodology to automatically build such dataset
and train a classiﬁcation model which is used to pseudo-
label the dataset and once trained can be used to perform
classiﬁcation on new data.
First, we select the Wikidata entities that are instances of
topics fromT and class occurence. Each Wikidata entity
links to Wikipedia pages in multiple languages that form an
event cluster. We align the argument rolesR
t
with Wiki-
data properties and try to automatically match the value for
each property with the text from the pages. The matching
is performed automatically either by using anchor links or
literal text matching for numerical values. This gives us the
arguments from text assigned to particular argument roles.
As most of the Wikidata entities lack some property val-
ues and automatic matching is frequently ambiguous, we
obtain a small, nosily labeled set which we refer to as a
seed set. We further use the named entities from the pages
as arguments and assign the ones that are not a part of the
seed set to an unlabeled set. Each such argument is consid-
ered as an unlabeled example.
To obtain label more arguments we use semi-supervised
learning with multiple base learners and pseudo-labeling.
Each of the base learners is trained on the set of labeled
arguments from the topic (or multiple topics) and the lan-
guage assigned to it. The prediction probabilities for each
of the unlabeled arguments are determined by combining
the probabilities of all base learners. This is done either by
averaging or by feeding the probabilities as approximations
of the true labels into the component, which attempts to de-
rive the true value for each argument and the error rates for
each learner [24]. The arguments with probabilities above
or below the speciﬁed thresholds are given a pseudo-label
and added to the training set.
The entire workﬂow is repeated in each iteration until
no new arguments are selected for pseudo-labeling. The
result is an automatically labeled dataset that includes the
given topics and labels for selected argument roles and a
classiﬁcation system that assigns the argument role to each
argument.
3.2.1 Representing the arguments
Following the related work [25], we introduce two new
special tokens, <e> and </e>. The context of each argu-
ment is converted to a sequence of tokens with the addi-
tional tokens that mark the beginning and end of the argu-
ment span in the context. For example, argument 6:1 from
the sentence "A 6.1 magnitude earthquake." is represented
with its context as "A <e>6.1</e>magnitude earthquake.".
Such representation is feed through the pretrained version
of the XLM Roberta model [26]. We use the implemen-
tation from the Transformers library [27] and use the last
hidden state of <e> token as a representation of the argu-
ment. The XLM Roberta model remains ﬁxed during the
learning as we have observed that the representation from
pretrained model is expressive enough for our purposes and
it signiﬁcantly speeds up the iterations.
3.2.2 Using multiple topics
Instead of grouping all arguments assigned to the same
roles across topics we keep the information about the topic
of the argument’s event. Firstly because the way of express-
ing an argument role might be slightly different in different
topics, secondly as shown by the experiments, such separa-
tion enables us additional supervision between the topics,
and ﬁnally as we can use the all arguments from one topic
as negative examples for some particular role in the other
topic.
For two topicst andt
0
there is potentially a set of com-
mon roles and a set of distinct roles, appearing in only one
of the topics. For roler, which appears in both topics, the
base learner trained ont
0
can be used to make predictions
for examples fromt. By combining predictions from learn-
ers trained ont andt
0
, we could get better estimates of the
true labels of the examples. For the roler
0
, which is spe-
ciﬁc to the topic t, all examples from the topic t
0
can be
used as negative examples. Selecting reliable negative ex-
amples from the same topic is not easy, as we may inadver-
tently mislabel some of the positive examples.
3.2.3 Using multiple languages
In a sense, articles from different languages provide dif-
ferent views on the same event. The important arguments
should appear in all the articles, as they are highly relevant
to the event. The arguments for the roles such as location
and time should be consistent across all articles, whereas
this does not necessarily apply to other roles such as the
number of injured or the number of casualties. Matching
such arguments across the articles is therefore not a trivial
task, and although a variant of soft matching can be per-
formed, we leave it for future work and limit our focus only
on the values that can be matched unambiguously. We can
combine the predictions of several language-speciﬁc base
learners into a single pseudo-label for examples where ar-
guments can be matched across the articles.
3.2.4 Base learners
Each iteration starts with a set of labeled argumentsX
l
, a
set of unlabeled arguments X
u
and a set of base learners
trained onX
l
. Base learners are simple logistic regression
classiﬁers that use the vector representations of arguments.
Each base learner
  f
r
t;l
is a binary classiﬁer trained on the
labeled data for the roler from the topict and the language
l. Such base learners are topic-speciﬁc as they are trained
on a single topic t. Base learners
  f
r
l
are trained on the
labeled data for the roler from the languagel and all the
topics with the roler. Such base learners are shared across
topics, as they consider the arguments from all the topics as
a single training set. We use the classiﬁcation probability
124 Informatica 46 (2022) 121–128 P. Zajec et al.
of the positive class instead of hard labels,
  f
r
t;l
(x);
  f
r
l
(x)2
[0; 1].
For each argument x from a news article with the lan-
guagel reporting on the evente from the topict we obtain
the following predictions:
–
  f
r
t
0
;l
(x) for eachr2R
t
and all sucht
0
thatr2R
t
0,
that is the probability that x is a argument for the
role r, where r is a role from the topic t, using the
topic-speciﬁc base learner trained on examples from
the same language on the topict
0
that also has the role
r,
–
  f
r
t;l
0(x) which equals
  f
r
t;l
0(y) for eachr2R
t
and for
each languagel
0
such that there is an article reporting
about the same evente in that language and contains
an argumenty which is matched tox,
–
  f
r
l
(x) for eachr2R
t
, using the shared base learner
on arguments from all topicst
0
that have the roler.
3.2.5 Combining the predictions
Multiple base learners make predictions for each argument.
We combine such predictions to a probability distribution
over argument roles as a weighted average.
The weight of each base learner
  f is determined by its
error ratee(
  f) which is estimated using both unlabeled and
labeled examples. We introduce the following logical rules
(referred to as ensemble rules in [24]) for each of the base
learners
  f
r
predicting forx:
  f
r
(x)^:e(
  f
r
)!f
r
(x)
  f
r
(x)^e(
  f
r
)!:f
r
(x)
:
  f
r
(x)^:e(
  f
r
)!:f
r
(x)
:
  f
r
(x)^e(
  f
r
)!f
r
(x)
Here, the truth values are not limited to Boolean values,
but instead represent the probability that the corresponding
ground predicate or rule is true. For a detailed explanation
of the method, we refer the reader to [24].
We introduce a prior belief that the predictions of base
learners are correct via the following two rules:
  f
r
(x)!f
r
(x); and;:
  f
r
(x)!:f
r
(x):
Since each example can be a value for at most one role,
we introduce a mutual exclusion rule:
  f
r
(x)^f
r
0
(x)!e(
  f
r
):
The rules are written in the syntax of a Probabilistic soft
logic [28] program, where each rule is assigned a weight.
We assign a weight of 1 to all ensemble rules, a weight of
0:1 to all prior belief rules and a weight of 1 to all mutual
exclusion rules. The inference is performed using the PSL
framework
3
. As we obtain the approximations for allx2
X
u
, we extend the set of positive examples for each roler
3
https://psl.linqs.org/
with allx such thatf
r
(x) >= T
p
and the set of negative
examples with allx such thatf
r
(x)<=T
n
, for predeﬁned
thresholdsT
p
andT
n
.
4 Experiments
4.1 Dataset
To evaluate the proposed methodology, we have conducted
experiments on two topics: earthquakes and terrorist at-
tacks.
We have collected Wikipedia articles and Wikidata infor-
mation of 913 earthquakes from 2000 to 2020 in 6 different
languages, namely English, Spanish, German, French, Ital-
ian, and Dutch. We have manually annotated the arguments
of 85 English articles using the argument roles number of
deaths, number of injured and magnitude, which serve as
a labeled test set and are not included in the training pro-
cess. In addition, we have collected data from 315 terrorist
attacks from 2000 to 2020 and articles from the same 6 lan-
guages.
4.2 Evaluation settings
The evaluation for each approach is performed on a labeled
English dataset, where 76 arguments are labeled as num-
ber of deaths, 45 as number of injured and 125 as magni-
tude. The threshold values for the pseudo-labeling are set
toT
p
= 0:6 andT
n
= 0:05. The approaches differ by the
subset of base learners used to form the combined predic-
tion and by the weighting of the predictions.
Single or multiple languages In a single language set-
ting, only English articles are used to label the arguments
and train the base learners. In a multi-language setting, all
available articles are used and the arguments are matched
across the articles from the same event.
Single or multiple topics In a single topic setting, only
the arguments from the earthquake topic are used. In a
multi-topic setting, the arguments from terrorist attacks are
used as negative examples for magnitude, the base learners
for the roles number of deaths and number of injured are
combined as described in the section 3.2.4.
Uniform or estimated weights In the uniform setting all
predictions of the base learners contribute equally, while in
the estimated setting the weights of the base learners are
estimated using the approach described in section 3.2.5.
4.3 Results
The results of all the experiments are summarized in Ta-
ble 1. Since the test set is limited to the topic earthquake
and English, only a subset of base learners was used to
make the ﬁnal predictions. We report the average value
Using Semi-Supervised Learning and Wikipedia to. . . Informatica 46 (2022) 121–128 125
Table 1: Results of all experiments. The column Single iteration reports the results of approaches where base learners
were trained on the seed set only. Results, where base learners were trained in the semi-supervised setting with differ-
ent weightings of the predictions, are reported in the columns Uniform weights and Estimated weights. The values of
precision, recall, and F1 are averaged over all argument roles.
Single iteration Uniform weights Estimated weights
Model P R F1 P R F1 P R F1
Single language, single topic 0.94 0.64 0.76 0.83 0.75 0.77 0.84 0.76 0.79
Multiple languages, single topic 0.94 0.64 0.76 0.82 0.74 0.76 0.83 0.75 0.77
Single language, multiple topics 0.91 0.76 0.83 0.83 0.83 0.83 0.86 0.83 0.84
Multiple languages, multiple topics 0.93 0.76 0.83 0.82 0.83 0.82 0.84 0.84 0.84
of precision, recall, and F1 across all argument roles. The
probability threshold of 0:5 was used to determine the clas-
siﬁcation label.
Single iteration Approaches in which base learners are
trained on the initial seed set for a single iteration achieve
higher precision (0:94 compared to 0:84 achieved by es-
timated weights in single language, single topic setting)
with the cost of a lower recall (0:64 compared to 0:76).
We observe that they distinguish almost perfectly between
the argument roles from the seed set and produce almost
no false positives. Using one or more languages has almost
no effect on the averaged scores when the number of top-
ics is ﬁxed. When using multiple topics, a higher recall is
achieved without a signiﬁcant decrease in precision. All
incorrect classiﬁcations of the role number on injured are
actually examples of the number of missing role that is not
included in our set and likewise almost all incorrect clas-
siﬁcations for the role magnitude are examples of the role
intensity on the Mercalli scale. This could easily be solved
by expanding the set of roles and shows how important it is
to learn to classify multiple roles simultaneously.
Uniform and estimated weights Semi-supervised ap-
proaches in which base learners are trained iteratively trade
precision in order to signiﬁcantly improve recall (0:64
compared to 0:76 achieved by estimated weights in single
language, single topic setting). Most of the loss of pre-
cision is due to misclassiﬁcation between roles number of
deaths and number of injured, similar to the example "370
people were killed by the earthquake and related building
collapses, including 228 in Mexico City, and more than
6,000 were injured." where 228 was incorrectly classiﬁed
as the number of injured and not the number of deaths. The
use of multiple topics reduces misclassiﬁcation between
these roles and further improves recall as new contexts are
discovered by the base learners trained on terrorist attacks.
Using the estimated error rates as weights for the predic-
tions of base learners shows a slight improvement in perfor-
mance. It may be beneﬁcial to estimate multiple error rates
for topic-speciﬁc base learners, as they tend to be more reli-
able in labeling arguments from the same topic. We believe
more data and experiments are needed to properly evaluate
this component. A major advantage is its ﬂexibility, as we
can easily incorporate prior knowledge about the roles or
additional constraints on the predictions in the form of log-
ical rules.
5 Conclusion
The proposed method avoids the need to manually anno-
tate the data for event argument extraction and instead com-
bines Wikipedia and Wikidata to obtain labeled data. Com-
pared to the related work, the proposed methodology uses
semi-supervised learning and integrates cross-lingual data
into the learning process to enhance the pseudo-labeling
supported by probabilistic soft logic. The resulting classi-
ﬁcation models, used for automatic labeling, can be read-
ily used to extract the event arguments in new texts. It is
also possible to train or ﬁne-tune a stronger state-of-the-art
model on the resulting dataset, extending it to new event ar-
guments and languages beyond those included in the orig-
inal training datasets. The experiments were performed on
a relatively small dataset and show that the proposed direc-
tion seems promising. However, the more suitable test of
our approach would be to apply it to a much larger num-
ber of topics and events, which we will do in the next step.
Moreover, the current approach needs to be evaluated in
more detail.
Acknowledgement
This work was supported by the Slovenian Research
Agency under the project J2-1736 Causalify and co-
ﬁnanced by the Republic of Slovenia and the Euro-
pean Union’s H2020 research and innovation program un-
der NAIADES EU project grant agreement H2020-SC5-
820985.
References
[1] Jesper E. van Engelen and H. Hoos. A survey
on semi-supervised learning. Machine Learning,
109:373–440, 2019. https://doi.org/10.
1007/s10994-019-05855-6.
[2] Fabio Petroni, Aleksandra Piktus, Angela Fan,
Patrick Lewis, Majid Yazdani, Nicola De Cao, James
126 Informatica 46 (2022) 121–128 P. Zajec et al.
Thorne, Yacine Jernite, Vladimir Karpukhin, Jean
Maillard, Vassilis Plachouras, Tim Rocktäschel, and
Sebastian Riedel. KILT: a benchmark for knowl-
edge intensive language tasks. In Proceedings of the
2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
man Language Technologies, pages 2523–2544, On-
line, June 2021. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1/
2021.naacl-main.200.
[3] Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia
Polosukhin, Andrew Fandrianto, Jay Han, Matthew
Kelcey, and David Berthelot. WikiReading: A
novel large-scale language understanding task over
Wikipedia. In Proceedings of the 54th Annual Meet-
ing of the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 1535–1545,
Berlin, Germany, August 2016. Association for Com-
putational Linguistics. https://doi.org/10.
18653/v1/P16-1145.
[4] Ben Goodrich, Vinay Rao, Peter J. Liu, and Moham-
mad Saleh. Assessing the factual accuracy of gener-
ated text. In Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery &
Data Mining, KDD ’19, page 166–175, New York,
NY , USA, 2019. Association for Computing Machin-
ery. https://doi.org/10.1145/3292500.
3330955.
[5] Omer Levy, Minjoon Seo, Eunsol Choi, and Luke
Zettlemoyer. Zero-shot relation extraction via reading
comprehension. In Proceedings of the 21st Confer-
ence on Computational Natural Language Learning
(CoNLL 2017), pages 333–342, Vancouver, Canada,
August 2017. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1/
K17-1034.
[6] Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan
Yao, Zhiyuan Liu, and Maosong Sun. FewRel: A
large-scale supervised few-shot relation classiﬁcation
dataset with state-of-the-art evaluation. In Proceed-
ings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pages 4803–4809,
Brussels, Belgium, October-November 2018. Asso-
ciation for Computational Linguistics. https://
doi.org/10.18653/v1/D18-1514.
[7] Alessio Aprosio, Claudio Giuliano, and Alberto
Lavelli. Extending the coverage of dbpedia prop-
erties using distant supervision over wikipedia.
CEUR Workshop Proceedings, 1064, 01 2013.
https://dl.acm.org/doi/10.5555/
2874479.2874482.
[8] Denny Vrandeˇ ci´ c and Markus Krötzsch. Wikidata: A
free collaborative knowledgebase. Commun. ACM,
57(10):78–85, September 2014. https://doi.
org/10.1145/2629489.
[9] Dustin Lange, Christoph Böhm, and Felix Naumann.
Extracting structured information from wikipedia ar-
ticles to populate infoboxes. In Proceedings of
the 19th ACM International Conference on Informa-
tion and Knowledge Management, CIKM ’10, page
1661–1664, New York, NY , USA, 2010. Association
for Computing Machinery. https://doi.org/
10.1145/1871437.1871698.
[10] Florian Schrage, Nicolas Heist, and Heiko Paul-
heim. Extracting literal assertions for dbpedia from
wikipedia abstracts. In SEMANTiCS, pages 288–
294, 11 2019. https://doi.org/10.1007/
978-3-030-33220-4_21.
[11] Nicolas Heist, Sven Hertling, and Heiko Paulheim.
Language-agnostic relation extraction from abstracts
in wikis. Information, 9(4):75, 2018. https://
doi.org/10.3390/info9040075.
[12] Boya Peng, Yejin Huh, Xiao Ling, and Michele
Banko. Improving knowledge base construction from
robust infobox extraction. In Proceedings of the 2019
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, Volume 2 (Industry Papers),
pages 138–148, Minneapolis, Minnesota, June 2019.
Association for Computational Linguistics. https:
//doi.org/10.18653/v1/N19-2018.
[13] Sha Li, Heng Ji, and Jiawei Han. Document-level
event argument extraction by conditional generation.
In Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 894–908, Online, June 2021. Association for
Computational Linguistics. https://doi.org/
10.18653/v1/2021.naacl-main.69.
[14] Fayuan Li, Weihua Peng, Yuguang Chen, Quan
Wang, Lu Pan, Yajuan Lyu, and Yong Zhu. Event
extraction as multi-turn question answering. In
Findings of the Association for Computational Lin-
guistics: EMNLP 2020, pages 829–838, Online,
November 2020. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1/
2020.findings-emnlp.73.
[15] Manling Li, Alireza Zareian, Ying Lin, Xiaoman Pan,
Spencer Whitehead, Brian Chen, Bo Wu, Heng Ji,
Shih-Fu Chang, Clare V oss, Daniel Napierski, and
Marjorie Freedman. GAIA: A ﬁne-grained multi-
media knowledge extraction system. In Proceed-
ings of the 58th Annual Meeting of the Association
for Computational Linguistics: System Demonstra-
tions, pages 77–86, Online, July 2020. Association
Using Semi-Supervised Learning and Wikipedia to. . . Informatica 46 (2022) 121–128 127
for Computational Linguistics. https://doi.
org/10.18653/v1/2020.acl-demos.11.
[16] George Doddington, Alexis Mitchell, Mark Przy-
bocki, Lance Ramshaw, Stephanie Strassel, and
Ralph Weischedel. The automatic content extraction
(ACE) program – tasks, data, and evaluation. In Pro-
ceedings of the Fourth International Conference on
Language Resources and Evaluation (LREC’04), Lis-
bon, Portugal, May 2004. European Language Re-
sources Association (ELRA).
[17] Beth M. Sundheim. Overview of the fourth message
understanding evaluation and conference. In Proceed-
ings of the 4th conference on Message understand-
ing - MUC4 '92. Association for Computational Lin-
guistics, 1992. https://doi.org/10.3115/
1072064.1072066.
[18] Seth Ebner, Patrick Xia, Ryan Culkin, Kyle Rawlins,
and Benjamin Van Durme. Multi-sentence argument
linking. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, pages
8057–8077, Online, July 2020. Association for Com-
putational Linguistics. https://doi.org/10.
18653/v1/2020.acl-main.718.
[19] James Ferguson, Colin Lockard, Daniel Weld, and
Hannaneh Hajishirzi. Semi-supervised event ex-
traction with paraphrase clusters. In Proceed-
ings of the 2018 Conference of the North Ameri-
can Chapter of the Association for Computational
Linguistics: Human Language Technologies, Vol-
ume 2 (Short Papers), pages 359–364, New Or-
leans, Louisiana, June 2018. Association for Com-
putational Linguistics. https://doi.org/10.
18653/v1/N18-2058.
[20] Xiao Liu, Heyan Huang, and Yue Zhang. Open
domain event extraction using neural latent variable
models. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguis-
tics, pages 2860–2871, Florence, Italy, July 2019.
Association for Computational Linguistics. https:
//doi.org/10.18653/v1/P19-1276.
[21] Ananya Subburathinam, Di Lu, Heng Ji, Jonathan
May, Shih-Fu Chang, Avirup Sil, and Clare V oss.
Cross-lingual structure transfer for relation and event
extraction. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP), pages
313–325, Hong Kong, China, November 2019. As-
sociation for Computational Linguistics. https:
//doi.org/10.18653/v1/D19-1030.
[22] Jian Liu, Yufeng Chen, and Jinan Xu. Machine read-
ing comprehension as data augmentation: A case
study on implicit event argument extraction. In Pro-
ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 2716–
2725, Online and Punta Cana, Dominican Republic,
November 2021. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1/
2021.emnlp-main.214.
[23] Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S
Rawat, and Mubarak Shah. In defense of pseudo-
labeling: An uncertainty-aware pseudo-label selec-
tion framework for semi-supervised learning, 2021.
[24] Emmanouil A. Platanios, Hoifung Poon, Tom M.
Mitchell, and Eric Horvitz. Estimating accuracy
from unlabeled data: A probabilistic logic approach.
In Proceedings of the 31st International Conference
on Neural Information Processing Systems, NIPS’17,
page 4364–4373, Red Hook, NY , USA, 2017. Curran
Associates Inc. https://dl.acm.org/doi/
10.5555/3294996.3295190.
[25] Livio Baldini Soares, Nicholas FitzGerald, Jeffrey
Ling, and Tom Kwiatkowski. Matching the blanks:
Distributional similarity for relation learning. In Pro-
ceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 2895–
2905, Florence, Italy, July 2019. Association for
Computational Linguistics. https://doi.org/
10.18653/v1/P19-1279.
[26] Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek, Fran-
cisco Guzmán, Edouard Grave, Myle Ott, Luke
Zettlemoyer, and Veselin Stoyanov. Unsupervised
cross-lingual representation learning at scale. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 8440–
8451, Online, July 2020. Association for Compu-
tational Linguistics. https://doi.org/10.
18653/v1/2020.acl-main.747.
[27] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz,
Joe Davison, Sam Shleifer, Patrick von Platen, Clara
Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven
Le Scao, Sylvain Gugger, Mariama Drame, Quentin
Lhoest, and Alexander Rush. Transformers: State-of-
the-art natural language processing. In Proceedings of
the 2020 Conference on Empirical Methods in Nat-
ural Language Processing: System Demonstrations,
pages 38–45, Online, October 2020. Association for
Computational Linguistics. https://doi.org/
10.18653/v1/2020.emnlp-demos.6.
[28] Stephen H. Bach, Matthias Broecheler, Bert Huang,
and Lise Getoor. Hinge-loss markov random ﬁelds
and probabilistic soft logic. J. Mach. Learn. Res.,
128 Informatica 46 (2022) 121–128 P. Zajec et al.
18(1):3846–3912, January 2017. https://dl.
acm.org/doi/10.5555/3122009.3176853.