Informatica 42 (2018) 23–32 23
  
Semantic Annotation of Documents Based on Wikipedia Concepts 
Janez Brank, Gregor Leban and Marko Grobelnik 
Jožef Stefan Institute, Jamova 39, Ljubljana, Slovenia 
E-mail: janez.brank@ijs.si, gregor.leban@ijs.si, marko.grobelnik@ijs.si 
 
Keywords: semantic annotation, wikification, disambiguation, text mining 
Received: January 30, 2017 
Semantic annotation is the task of augmenting an unstructured textual document with semantic 
information, such as concepts from an ontology. In wikification, the Wikipedia is used as an ontology 
and its pages (articles) are regarded as (representations of) concepts. We describe an efficient approach 
for annotating a document with relevant concepts from the Wikipedia. A global disambiguation method 
based on constructing a mention-concept graph and computing pagerank over it is used to identify a 
coherent set of relevant concepts considering the input document as a whole. The presented approach is 
suitable for parallel processing and can support any language for which a sufficiently large Wikipedia 
is available. Several heuristics involved in the disambiguation of candidate annotations are discussed 
and an experimental evaluation of their influence is presented. 
Povzetek: Semantično anotiranje je postopek, s katerim bi radi nestrukturirano besedilo dopolnili s 
semantičnimi informacijami, na primer s koncepti iz neke ontologije. Pri wikifikaciji se kot ontologijo 
uporablja Wikipedijo, pri čemer strani oz. članke v njej obravnavamo kot predstavitve konceptov. 
Opisujemo učinkovit pristop za anotiranje besedila z relevantnimi koncepti iz Wikipedije. Pri tem 
uporabljamo globalen pristop k razdvoumljanju, ki temelji na izgradnji grafa omemb in konceptov ter 
računanju pageranka na tem grafu, kar je nato podlaga za določitev nabora konceptov, ki se lepo 
skladajo med seboj in so relevantni glede na vhodno besedilo kot celoto. Opisani pristop je primeren za 
paralelno procesiranje in deluje za poljuben jezik, v katerem je na voljo dovolj velika Wikipedija. V 
članku predstavljamo in eksperimentalno ovrednotimo tudi več hevristik, ki se jih lahko uporabi pri 
razdvoumljanju kandidatov za anotacije. 
 
1 Introduction 
Recent years have seen a growth in the use of semantic 
technologies. However, in many contexts we still deal 
with largely unstructured textual documents that lack 
explicit semantic information which might be required 
for further processing with semantic technologies. This 
leads to the problem of semantic annotation or semantic 
enrichment as an important preparatory step before 
further processing of a document. Given a document and 
an ontology covering the domain of interest, the 
challenge is to identify concepts from that ontology that 
are relevant to the document or that are referred to by it, 
as well as to identify specific passages in the document 
where the concepts in question are mentioned. 
A specific type of semantic annotation, known as 
wikification, involves using the Wikipedia as a source of 
possible semantic annotations [1][2]. In this setting, the 
Wikipedia is treated as a large and fairly general-purpose 
ontology: each page is thought of as representing a 
concept, while the relations between concepts are 
represented by internal hyperlinks between different 
Wikipedia pages, as well as by Wikipedia’s category 
memberships and cross-language links. 
The advantage of this approach is that the Wikipedia 
is a freely available source of information, it covers a 
wide range of topics, has a rich internal structure, and 
each concept is associated with a semi-structured textual 
document (i.e. the content of the corresponding 
Wikipedia article) which can be used to aid in the 
process of semantic annotation. Furthermore, the 
Wikipedia is available in a number of languages, with 
cross-language links being available to identify pages 
that refer to the same concept in different languages, thus 
making it easier to support multilingual and cross-lingual 
annotation. 
The remainder of this paper is structured as follows. 
In Section 2, we present the pagerank-based approach to 
wikification used in our wikifier. In Section 3, we 
describe our implementation and some experimental 
evaluaton. Section 4 contains conclusions and a 
discussion of possible future work. 
2 Pagerank-based Wikification 
The task of wikifying an input document can be broken 
down into several closely interrelated subtasks: (1) 
identify phrases (or words) in the input document that 
refer to a Wikipedia concept; (2) determine which 
concept exactly a phrase refers to; (3) determine which 
concepts are relevant enough to the document as a whole 
that they should be included in the output of the system 
(i.e. presented to the user). 
We follow the approach described by Zhang and 
Rettinger [1]. This approach makes use of the rich 
internal structure of hyperlinks between Wikipedia 
pages. A hyperlink can be thought of as consisting of a 
24 Informatica 42 (2018) 23–32 J. Brank et al.  
 
source page, a target page, and the link text (also known 
as the anchor text). If a source page contains a link with 
the anchor text a and the target page t, this is an 
indication that the phrase a might be a reference to (or 
representation of) the concept that corresponds to page t. 
Thus, if the input document that we’re trying to wikify 
contains the phrase a, it might be the case that this 
occurrence of a in the input document also constitutes a 
mention of the concept t, and the concept t is a candidate 
annotation for this particular phrase. 
2.1 Disambiguation 
In the Wikipedia, there may be many different links with 
the same anchor text a, and they might not all be pointing 
to the same target page. For example, in the English-
language Wikipedia, there are links with a = “Tesla” that 
point to pages about the inventor, the car manufacturer, 
the unit in physics, a band, a film, and several other 
concepts. 
Thus, when such a phrase a occurs in an input 
document, there are several concepts that can be regarded 
as candidate annotations for that particular mention, and 
we have to determine which of them is actually relevant. 
This is the problem of disambiguation, similar to that of 
word sense disambiguation in natural language 
processing. 
There are broadly two approaches to disambiguation, 
local and global. In the local approach, each mention is 
disambiguated independently of the others, while the 
global approach aims to treat the document as a whole 
and disambiguate all the mentions in it as a group. The 
intuition behind the global approach is that the document 
that we’re annotating is about some topic, and the 
concepts that we use as annotation should be about that 
topic as well. If the document contains many mentions 
that include, as some of their candidate annotations, 
some car-related concepts, this makes it more likely that 
we should treat the mention of “Tesla” as a reference to 
Tesla the car manufacturer as opposed to e.g. a reference 
to Nikola Tesla or to Tesla the rock band. 
2.2 The mention-concept graph 
To implement the global disambiguation approach, our 
Wikifier begins by constructing a mention-concept graph 
for the input document. (Some authors, e.g. [2], refer to 
this as a mention-entity graph, but we prefer to use the 
term “mention-concept graph” as some of the Wikipedia 
pages do not necessarily correspond to concepts that we 
usually think of as entities, and our wikifier does not by 
default try to exclude them.) This can be thought of as a 
bipartite graph in which the left set of vertices 
corresponds to mentions and the right set of vertices 
corresponds to concepts. A directed edge a  c exists if 
and only if the concept c is one of the candidate 
annotations for the mention a (i.e. if, in the Wikipedia, 
there exists a hyperlink with the anchor text a and the 
target c). A transition probability is also assigned to each 
such edge, P(a  c), defined as the ratio [number of 
hyperlinks, in the Wikipedia, having the anchor text a 
and the target c] / [number of hyperlinks, in the 
Wikipedia, having the anchor text a]. 
This graph is then augmented by edges between 
concepts, the idea being that an edge c  c' should be 
used to indicate that the concepts c and c' are 
“semantically related”, in the sense that if one of them is 
relevant to a given input document, the other one is also 
more likely to be relevant to that document. (For 
example, the semantic relationship between the concepts 
“Electric vehicle” and “Tesla Inc.” should be much 
stronger than between the concepts “Electric vehicle” 
and “Tesla (rock band)”. This measure of semantic 
relatedness will be used subsequently to encourage the 
formation of a group of annotations that are semantically 
related in the sense that they refer to the same topic, 
which is hopefully also the topic of the document. This 
would encourage mentions of “Tesla” in a document 
about electric cars to be annotated with the concept 
“Tesla Inc.” rather than “Tesla (rock band)”. 
Following [1], the internal link structure of the 
Wikipedia is used to calculate a measure of semantic 
relatedness. Informally, the idea is that if c and c' are 
closely related, then other Wikipedia pages that point to c 
are likely to also point to c' and vice versa. Let L c be the 
set of Wikipedia pages that contain a hyperlink to c, and 
let N be the total number of concepts in the Wikipedia; 
then the semantic relatedness of c and c' can be defined 
as 
𝑆𝑅 (𝑐 , 𝑐 ′
) = 1 − 
log max{|𝐿 𝑐 |, |𝐿 𝑐 ′|} − log|𝐿 𝑐 ∩ 𝐿 𝑐 ′|
log 𝑁 − log min{|𝐿 𝑐 |, |𝐿 𝑐 ′|}
. 
Intuitively, this formula considers two concepts to be 
semantically related if pages that link to one of them 
typically also link to the other one (and vice versa). More 
specifically, SR will be higher if the overlap (i.e. the 
intersection) of L c and L c' is large (relative to the size of 
L c and L c'), and the formula also rewards situations where 
the sets L c and L c' are themselves large (relative to the 
overall number of documents N), as this means that the 
dataset includes more evidence of a semantic relationship 
between c and c'. 
In the graph, we add an edge of the form c  c' 
wherever the semantic relatedness SR(c, c') is > 0. The 
transition probability of this edge is defined as 
proportional to the semantic relatedness: P(c  c') = 
SR(c, c') /  c'' SR(c, c''). 
This graph is then used as the basis for calculating a 
vector of pagerank scores [3], one for each vertex. This is 
done using the usual iterative approach where in each 
iteration, each vertex distributes its pagerank score to its 
immediate successors in the graph, in proportion to the 
transition probabilities on its outgoing edges: 
PR new(u) =  PR 0(u) + (1 – )  v PR old(v) P(v  u). 
The baseline distribution of pagerank, PR 0, is used 
both to help the process converge and also to 
counterbalance the fact that in our graph there are no 
edges pointing into the mention vertices. In our case, 
PR 0(u) is defined as 0 if u is a concept vertex; if u is a 
mention vertex, we use PR 0(u) = z  [number of 
Semantic Annotation of Documents … Informatica 42 (2018) 23–32 25 
Wikipedia pages containing the phrase u as the anchor-
text of a hyperlink] / [number of Wikipedia pages 
containing the phrase u], where z is a normalization 
constant to ensure that  u PR 0(u) = 1. We used  = 0.1 as 
the stabilization parameter. 
The intuition behind this approach is that in each 
iteration of the pagerank calculation process, the 
pagerank flows into a concept vertex c from mentions 
that are closely associated with the concept c and from 
other concepts that are semantically related to c. Thus 
after a few iterations, pagerank should tend to 
accumulate in a set of concepts that are closely 
semantically related to each other and that are strongly 
associated with words and phrases that appear in the 
input document, which is exactly what we want in the 
context of global disambiguation. 
2.3 Using pagerank for disambiguation 
Once the pagerank values of all the vertices in the graph 
have been calculated, we use the pagerank values of 
concepts to disambiguate the mentions. If there are edges 
from a mention a to several concepts c, we choose the 
concept with the highest pagerank as the one that is 
relevant to this particular mention a. We say that this 
concept is supported by the mention a. At the end of this 
process, concepts that are not supported by any mention 
are discarded as not being relevant to the input 
document. 
The remaining concepts are then sorted in decreasing 
order of their pagerank. Let the i’th concept in this order 
be c i and let its pagerank be PR i, for i = 1, …, n. 
Concepts with a very low pagerank value are less likely 
to be relevant, so it makes sense to apply a further 
filtering step at this point and discard concepts whose 
pagerank is below a user-specified threshold. However, 
where exactly this threshold should be depends on 
whether the user wants to prioritize precision or recall. 
Furthermore, the absolute values of pagerank can vary a 
lot from one document to another, e.g. depending on the 
length of the documents, the number of mentions and 
candidate concepts, etc. Thus we apply the user-specified 
threshold in the following manner: given the user-
specified threshold value   [0, 1], we output the 
concepts c 1, …, c m, where m is the least integer such that 
 i=1..m PR i
2
 ≥    i=1..n PR i
2
. In other words, we report as 
many top-ranking concepts as are needed to cover  of 
the total sum of squared pageranks of all the concepts. 
We use  = 0.8 as a broadly reasonable default value, 
though the user can require a different threshold 
depending on their requirements. 
The motivation for using squares of pageranks 
instead of the pageranks themselves is to put a greater 
emphasis on the annotations with the highest values of 
pagerank, while culling the lower-scoring annotations 
more thoroughly. In our preliminary experiments, this led 
to a small improvement in performance compared to 
using the sums of pageranks without squaring them.  
For each reported concept, we also output a list of 
the mentions that support it. 
2.4 Treatment of highly ambiguous 
mentions 
Our wikifier supports various minor heuristics and 
refinements in an effort to improve the performance of 
the baseline approach described in the preceding 
sections. 
As described above, anchor text of hyperlinks in the 
Wikipedia is used to identify mentions in an input 
document (i.e. words or phrases that may support an 
annotation). One downside of this approach is that some 
words or phrases occur as the anchor text of a very large 
number of hyperlinks in the Wikipedia and these links 
point to a large number of different Wikipedia pages. In 
other words, such a phrase is highly ambiguous; it is not 
only unlikely to be disambiguated correctly, but also 
introduces noise into the mention-concept graph by 
introducing a large number of concept vertices, the vast 
majority of which will be completely irrelevant to the 
input document. This also slows down the annotation 
process by increasing the time to calculate the semantic 
relatedness between all pairs of candidate concepts. (As 
an example of such a highly ambiguous mention, 
consider the word “Country”. Most of the time, when it 
appears as the anchor-text of a link, it’s a link to the 
concepts “Country” or “Country music”, but it also 
occurs in links to more than a hundred other concepts, 
mostly individual countries.) 
We use several heuristics to deal with this problem. 
Suppose that a given mention a occurs, in the Wikipedia, 
as the anchor text of n hyperlinks pointing to k different 
target pages, and suppose that n i of these links point to 
page c i (for i = 1, …, k). We can now define the entropy 
of the mention a as the amount of uncertainty regarding 
the link target given the fact that its anchor text is a: H(a) 
= –  i=1..k (n i/n) log(n i/n). If this entropy is above a user-
specified threshold (e.g. 3 bits), we completely ignore the 
mention as being too ambiguous to be of any use. For 
mentions that pass this heuristic, we sort the target pages 
in decreasing order of n i and use only the top few of them 
(e.g. top 20) as candidates in our mention-concept graph. 
A third heuristic is to ignore candidates for which n i itself 
is below a certain threshold (e.g. n i < 2), the idea being 
that if such a phrase occurs only once as the anchor text 
of a link pointing to that candidate, this may well turn out 
to be noise and is best disregarded. 
Optionally, the Wikifier can also be configured to 
ignore certain types of concepts based on their Wikidata 
class membership. This can be useful to exclude from 
consideration Wikipedia pages that do not really 
correspond to what is usually thought of as entities (e.g. 
“List of…” pages). 
Another heuristic that we have found useful in 
reducing the noise in the output annotations is to ignore 
any mention that consists entirely of stopwords and/or 
very common words (top 200 most frequent words in the 
Wikipedia for that particular language). For this as well 
as for other purposes the text processing is done in a 
case-sensitive fashion, which e.g. allows us to ignore 
spurious links with the link text “the” while processing 
those that refer to the band “The The”. 
26 Informatica 42 (2018) 23–32 J. Brank et al.  
 
2.5 Miscellaneous heuristics 
Semantic relatedness. As mentioned above, the definition 
of semantic relatedness of two concepts, SR(c, c'), is 
based on the overlap between the sets L c, L c' of 
immediate predecessors of these two concepts in the 
Wikipedia link graph. Optionally, our Wikifier can 
compute semantic relatedness using immediate 
successors or immediate neighbours (i.e. both 
predecessors and successors) instead of immediate 
predecessors. However, our preliminary experiments 
indicated that these changes do not lead to improvements 
in performance, so they are disabled by default. 
Extensions to disambiguation. Our Wikifier also supports 
some optional extensions of the disambiguation process. 
As described above, the default behavior when 
disambiguating a mention is to simply choose the 
candidate annotation with the highest pagerank value. 
Alternatively, after any heuristics from section 2.4 have 
been applied, the remaining candidate concepts can be 
re-ranked using a different scoring function that takes 
other criteria besides pagerank into account. This is an 
opportunity to combine the global disambiguation 
approach with some local techniques. In general, a 
scoring function of the following type is supported: 
score(c|a) = w 1 f(P(c|a)) PR(c) + w 2 S(c, d) 
                 + w 3 LS(c, a) (1) 
Here, a is the mention that we’re trying to 
disambiguate, and c is the candidate concept that we’re 
evaluating. P(c|a) is the probability that a hyperlink in 
the Wikipedia has c as its target conditioned on the fact 
that it has a as its anchor text. f(x) can be either 1 (the 
default), x, or log(x). PR(c) is the pagerank of c’s vertex 
in the mention-concept graph. S(c, d) is the cosine 
similarity between the text of the input document d and 
of the Wikipedia page for the concept c. LS(c, a) is the 
cosine similarity between the context (e.g. previous and 
next 3 words) in which a appears in the input document 
d, and the contexts in which hyperlinks with the target c 
appear in the Wikipedia. Finally, w 1, w 2, w 3 are weight 
constants. However, our preliminary experiments haven’t 
shown sufficient improvements from the addition of 
these heuristics, so they are disabled by default (f(x) = 1, 
w 2 = w 3 = 0) to save computational time and memory 
(storing the link contexts needed for the efficient 
computation of LS has turned out to be particularly 
memory intensive). 
3 Implementation and evaluation 
3.1 Implementation 
Our implementation of the approach described in the 
preceding section is running as a web service and can be 
accessed at http://wikifier.org. The approach is suitable 
for parallel processing as annotating one document is 
independent of annotating other documents, and any 
shared data used by the annotation process (e.g. the 
Wikipedia link graph, and a trie-based data structure that 
indexes the anchor text of all the hyperlinks) need to be 
accessed only for reading and can thus easily be shared 
by an arbitrary number of worker threads. This allows for 
a highly efficient processing of a large number of 
documents. 
The only need to modify shared data structures arises 
when a new dump of the Wikipedia becomes available 
(the Wikipedia publishes new dumps of its content twice 
per month). We use a separate process to periodically 
check the Wikipedia web site for new dumps, download 
them, parse them, and build indexes in a form that can be 
used by our wikifier. Once the wikifier web service is 
notified of the availability of a new index, it loads its 
contents into memory, temporarily stops issuing new 
requests to worker threads, waits for them all to finish 
processing their current requests, and then updates the 
shared data structures to include the new index and 
discard the old one. In this way, new indices can be 
brought online without shutting down the service and 
with a minimal interruption to its availability. 
Our implementation currently processes on average 
more than 500,000 requests per day (the total length of 
input documents averages about 1.2 GB per day), 
including all the documents from the JSI Newsfeed 
service [4]. The output is used among other things as a 
preprocessing step by the Event Registry system [5]. The 
wikifier currently supports all languages in which a 
Wikipedia with at least 1000 pages is available, 
amounting to a total of 134 languages. Admittedly, 1000 
pages is much too small to achieve an adequate coverage; 
however, about 60 languages have a Wikipedia with at 
least 100,000 pages, which is enough for many practical 
applications. 
Annotations are returned in JSON format and can 
optionally include detailed information about support 
(which mentions support each annotation), alternative 
candidate annotations (concepts that were considered as 
candidates during the disambiguation process but were 
rejected in favour of some other higher scored concept), 
and WikiData/DbPedia class membership of the 
proposed annotations. Thus, the caller can easily 
implement any desired class-based post-processing. 
Our wikifier also allows the user to define custom 
vocabularies that can be used to generate annotations in 
addition to the Wikipedia-based annotations described so 
far. A custom vocabulary is a set of concepts where each 
concept consists of an ID and a set of one or more words 
of phrases which, if they appear in the input document, 
trigger the inclusion of this concept among the 
annotations. This allows the user to extend the system 
with custom sets of annotations, but the downside is that 
such custom annotations are not part of the Wikipedia 
and thus cannot be included in the usual wikification 
process, especially not in the pagerank-based 
disambiguation algorithm. 
As a preprocessing step, the user may specify one or 
more sets of “alternative labels”, which are really 
rewriting rules of the form “w 1 w 2 … w n → x 1 x 2 … x m” 
indicating that the sequence of the words w 1 w 2 … w n 
may, if it occurs in the input document, be replaced by 
the sequence x 1 x 2 … x m prior to the main part of the 
Semantic Annotation of Documents … Informatica 42 (2018) 23–32 27 
wikification algorithm. (The word “may” in the 
preceding sentence means that the original sequence of 
words from the left-hand side of the rule is also kept in 
the document. Thus, the document is no longer a simple 
sequence of words, but may gradually turn into an 
arbitrary directed acyclic graph, the various paths 
through which indicate different alternatives into which 
the text of the document may be brought through the 
application of the rewriting rules.) Owing to such 
transformations, certain candidate mentions might appear 
in the document that did not appear in the original 
document. Several such rules may be applied one after 
another and may affect the same part(s) of a document. 
Theoretically, such rewriting rules form a Turing-
complete formalism, and to keep the problem tractable 
our wikifier makes only three passes through the 
document to look for occurrences of left-side word 
sequences and replace them with the corresponding right-
side word sequences. Currently the main use of this 
mechanism in our wikifier is to provide additional 
alternative spellings of some proper names in cases 
where these are not adequately covered in the Wikipedia. 
This has been found to be particularly useful in case of 
names transliterated from languages that use a different 
script and where several different transliteration schemes 
are in use.  
3.2 Evaluation 
One way to evaluate wikification is to compare the set of 
annotations with a manually annotated gold standard for 
the same document(s). Performance can then be 
measured using metrics from information retrieval, such 
as precision, recall, and the F 1-measure, which is defined 
as the harmonic mean of precision and recall. We used 
two manually annotated datasets:  
(Dataset 1.) A set of 1393 news articles that was 
made available from the authors of the AIDA system and 
was originally used in their experiments [2]. This 
manually annotated dataset excludes, by design, any 
annotations that do not correspond to named entities. 
Since our wikifier does not by default distinguish 
between named entities and other Wikipedia concepts, 
we have explicitly excluded concepts that are not named 
entities (based on their class membership in the 
WikiData ontology) from the output of our Wikifier for 
the purposes of this experiment. 
(Dataset 2.) A set of 491 news articles taken 
randomly from the JSI Newsfeed [4] on 29 June 2016 
and annotated manually with relevant Wikipedia 
concepts. Unlike the first dataset, the annotations here 
included concepts that were not named entities. 
In addition to our wikifier, we obtained annotations 
from the following systems: AIDA [2], Waikato 
Wikipedia Miner [7], Babelfy [8], Illinois [9], and 
DbPedia Spotlight [10]. The Waikato system is not 
included in experiments involving dataset 2 as their web 
service was no longer available at the time.  
Tables 1(a) and 1(b) show the agreement not only 
between each of the wikifiers and the gold standard, but 
also between each pair of wikifiers (the lower left 
triangle of the matrix is left empty as it would be just a 
copy of the upper right triangle, since the F 1-measure is 
symmetric). As this experiment indicates, on the first 
dataset (the AIDA dataset) our wikifier (“JSI” in the 
table) performs slightly worse than AIDA but 
significantly better than the other wikifiers. On the 
second dataset (the JSI dataset), the best performance 
was achieved by the Babelnet wikifier, ours is slightly 
worse while AIDA is significantly worse on this dataset. 
Thus, overall we can conclude that our wikifier has solid 
performance over a pair of two considerably different 
dataset. Furthermore, experiments on both datasets show 
that there is relatively little agreement between different 
wikifiers, which indicates that wikification itself is in 
some sense a vaguely defined task where different people 
can have very different ideas about whether a particular 
Wikipedia concept is relevant to a particular input 
document (and should therefore be included as an 
annotation) or not, which types of Wikipedia concepts 
can be considered as annotations (e.g. only named 
entities or all concepts), etc. Possibly the level of 
agreement could be improved by fine-tuning the settings 
of the various wikifiers; in the experiment described 
above, default settings were used. 
 
 Gold JSI AIDA Waikato Babelfy Illinois Spotlight 
Gold 1.000 0.593 0.723 0.372 0.323 0.476 0.279 
JSI  1.000 0.625 0.527 0.431 0.489 0.363 
AIDA   1.000 0.372 0.352 0.434 0.356 
Waikato    1.000 0.481 0.564 0.474 
Babelfy     1.000 0.434 0.356 
Illinois      1.000 0.376 
Spotlight       1.000 
Table 1(a): F1 measure of agreement between the various 
wikifiers and the gold standard on dataset 1. 
 Gold JSI AIDA Babelfy Illinois Spotlight 
Gold 1.000 0.378 0.197 0.417 0.372 0.282 
JSI  1.000 0.278 0.360 0.413 0.397 
AIDA   1.000 0.206 0.283 0.383 
Babelfy    1.000 0.380 0.282 
Illinois     1.000 0.367 
Spotlight      1.000 
Table 1(b): F1 measure of agreement between the various 
wikifiers and the gold standard on dataset 2. 
We also conducted a small experiment on dataset 2 
to compare two forms of the thresholding criterion: one 
is based on the sums of squares of pageranks (as 
currently described in Section 2.3) and one based on the 
sums of the pageranks themselves. The F 1-measure 
between our annotations and the gold standard drops 
from 0.378 when using squared pageranks to 0.344 when 
using the pageranks directly. We used squared pageranks 
for thresholding in all other experiments in this section. 
Evaluation of disambiguation heuristics. In the 
following experiment, we evaluate some of the additional 
disambiguation heuristics described in Section 2.5. The 
purpose of the experiment was to find the best-
performing combination of the following heuristics and 
parameters from that section: 
(i) Logarithmic link counts: in Section 2.2, we 
defined the transition probability a → c in the mention-
concept graph as being proportional to the number of 
28 Informatica 42 (2018) 23–32 J. Brank et al.  
 
links, in the Wikipedia, with the anchor text a and the 
target c (the “link count” of c given a). Alternatively, it 
can be defined as being proportional to the logarithm of 
this link count. The purpose of this heuristic is to provide 
a kind of smoothing and discourage too much of the 
pagerank score from flowing into just one candidate c for 
that particular mention a. The Wikipedia is known for 
having various biases in terms of how frequently certain 
topics are covered, so this sort of smoothing may soften 
the more extreme differences in the frequency of 
coverage while still preserving some information about 
which concepts c are associated more often with a phrase 
a. 
(ii) Set of links used in the computation of sematic 
relatedness (SR) between two concepts in the Wikipedia 
link graph: this can be the in-links (the default setting), 
out-links, or all neighbours. 
(iii) Threshold for re-ranking: in this scenario, the 
candidates c for a given mention a are first sorted by 
pagerank, the top few candidates are kept and are then re-
ranked using the more detailed (and computationally-
intensive) scoring function denoted by eq. (1). The 
question then is what counts as “top few candidates” to 
be included in the re-ranking. We define this by 
introducing a parameter ϑ  [0, 1] such that a candidate c 
proceeds to re-ranking if its pagerank is PR(c) ≥ ϑ max c' 
PR(c'), where c' goes over all the candidates for the 
current mention a. 
(iv) Linearization of pagerank in the scoring function 
denoted by eq. (1) into a linear rank: instead of using the 
pagerank directly, all the candidate concepts c for a given 
mention a are sorted by pagerank and a linear rank is 
assigned to each. If there are k candidates, the i’th of 
them in this order gets a linear rank of i/k. This is then 
used instead of PR(c) in eq. (1), as well as in the ϑ-based 
thresholding criterion in the previous paragraph (where ϑ 
then simply becomes the proportion of candidates that 
proceeds to the re-ranking phase). The purpose is to 
make sure that the range [0, 1] is covered evenly, instead 
of the pagerank values possibly being clustered in a small 
part of that range. 
(v) Weight w 2 of the cosine similarity between the 
input document and the Wikipedia page of a candidate 
concept, in the scoring function of eq. (1). (The weight 
w 1 of the candidate concept’s pagerank value in the 
scoring function was then set to 1 – w 2. The weight w 3 of 
the link context similarity was kept to 0 throughout these 
experiments, because of the considerable additional 
memory and time consumption required for the link 
context computation and because preliminary 
experiments indicated that the results were not 
promising.) 
The possible values of these five parameters that 
were investigated in this experiment can be summarized 
as follows: 
 (i) linkCounts  {normal, log} 
 (ii) SR  {in, out, all} 
 (iii) ϑ  {0, 0.2, 0.4, 0.6, 0.8, 1} 
 (iv) PR  {normal, linearized} 
 (v) w 2  {0, 0.25, 0.5, 0.75, 1} 
The default settings are: normal link counts, SR = in, 
ϑ = 1 (no second-stage re-ranking), PR = normal, w 2 = 0. 
Table 2 shows, for each possible value of each 
parameter, the best and the average performance (in 
terms of F 1-measure relative to the gold standard) that 
can be achieved by fixing that parameter to that value 
and allowing the other parameters to range over all the 
possible values indicated above. 
For comparison, the last two rows show the 
performance with no parameters fixed (allowing us to 
tune the best possible combination of all parameters) and 
the performance with all parameters fixed at their default 
values. 
This experiment was done on Dataset 1 and used 10-
fold cross-validation. Nine folds (the training set) were 
used to tune any parameters that were not held fixed and 
the best resulting combination of parameters was then 
evaluated on the tenth fold (the test set). This was 
repeated for all ten choices of the test fold. Table 2 
shows the average and the standard deviation of the F 1 
performance on the test fold over all 10 choices of the 
test fold. 
As we can see from this experiment, it is indeed 
possible to achieve a small improvement in performance 
by employing some of the heuristics described here. The 
best-performing combination of parameters was {normal 
link counts, SR = in, ϑ = 0.6, PR = normal, w 2 = 1}, 
which resulted in an F 1 score of 0.6152, up from the 
score of 0.5917 achieved by the default parameter values. 
A paired t-test showed this difference to be significant at 
a p-value of 0.0005. However, we can also see that, in 
practical terms, this improvement is small and might not 
be noticed by the user. Furthermore, it is clear that 
several of the heuristics employed were in fact 
counterproductive: using logarithms of link counts to 
define the mention-concept transition probabilities; using 
out-links or all neighbors (instead of just in-links) in the 
definition of semantic relatedness; and using linearized 
pageranks. Shifting these parameters away from their 
default settings in fact led to a deterioration of F 1 (in all 
these cases, the deterioration is statistically significant 
with a p-value of 0.001 or less.) Improvements in 
performance mostly came from re-ranking the most 
promising candidates (ϑ = 0.6) based on the cosine 
similarity between the input document and the Wikipedia 
pages of the candidate concepts. 
The “Average F 1” column of Table 2 shows that no 
parameter by itself can ensure good performance unless 
the other parameters are also chosen suitably, as the 
average performance over all the combinations of other 
parameters is poor regardless of which parameter has 
been fixed and at what value.  
 
4 Use in a real-life application 
Semantically annotating documents can be of high 
importance in several real-life applications. An example 
of such an application is Event Registry [5]. Event 
Registry is a system that collects and analyzes news 
content generated globally and identifies the world 
Semantic Annotation of Documents … Informatica 42 (2018) 23–32 29 
events mentioned in the news. As the application aims to 
extract knowledge in structured form from the 
unstructured text, we will now describe how the 
Wikifier’s semantic annotations provide the critical input 
required by the system. 
For each news article, Event Registry stores the list 
of identified semantic annotations. Among other things, 
this allows the users to search for news content using the 
semantic tags and not keywords, as we are used to in the 
search engines. The main advantages of using tags versus 
keywords are that one can e.g. (a) specifically ask for 
articles about apple, the fruit, versus Apple, the 
company, (b) find articles about IBM regardless of how 
it’s mentioned in the news articles (“IBM”, “I.B.M.”, 
“International Business Machines”, etc.), and (c) find 
articles about White House in any language. The last use 
case is available because the Wikipedia also maintains 
information on which Wikipedia pages in different 
languages represent the same concept. Consequently, the 
tag for “White House” in an English article will be the 
same, as the tag for “Casa Blanca” in a Spanish article. 
When Event Registry identifies a group of news 
articles that represent the same event, it uses the semantic 
information in the news articles to determine the core 
event information. First, it analyzes all news articles in 
the event and calculates how frequently individual 
concepts appear in these articles. A ranked list of 
commonly mentioned concepts is then used as a semantic 
summary or a “fingerprint” of an event. 
Another critical piece of information about the event 
is its geographical location. In order to determine the 
location, Event Registry again analyzes concepts 
mentioned in the news articles about the event, and 
considers as possible candidates only those that refer to a 
geographical location. For each candidate location, a set 
of learning features is extracted. The learning features 
that we extract are as follows: 
 Mentions of the location in the articles about the 
event. The value is simply the ratio of the number of 
news articles about the event that mention the 
location somewhere in the text and the total number 
of articles about the event. 
 Mentions of the location in the dateline (beginning of 
the article). This feature is computed as the ratio of 
the number of articles about the event in which the 
location is mentioned in the dateline and the total 
number of articles about the event. 
 Normalized versions of the previous two features. In 
this case we compute a variation of the previous two 
features, where we don’t compute a simple ratio, but 
weight the contribution of an individual article by the 
cosine similarity of the article to the centroid of the 
event. Articles closer to the centroid (more relevant 
articles) therefore contribute more to the final feature 
value. 
 Commonality of the location — how frequently is 
the location generally present in the news articles. 
The value is computed as the ratio of the number of 
articles in Event Registry that mention this location 
and the total number of articles. 
Based on these features, a logistic regression model 
computes a probability for each of the candidate 
locations to be the location of the event. If the location 
with the highest probability is above the predetermined 
threshold, the location is chosen as the location of the 
event. The logistic model was trained on 1239 manually 
labeled events and has 96.2% classification accuracy. 
Semantic annotations are also of high importance 
when a search is performed and a large number of results 
need to be summarized. An example of such a summary 
is displayed in Figure 1, where we searched for events 
about hurricanes. The resulting list that contained over 
23 000 events was summarized as shown in the figure to 
illustrate what are the top concepts mentioned in these 
events. 
Parameter Avg. F1 Max. F1 
linkCounts = normal 
0.5883 ± 0.0253 0.6152 ± 0.0239 
linkCounts = log 
0.5368 ± 0.0268 0.5833 ± 0.0221 
SR = in 0.5800 ± 0.0232 0.6152 ± 0.0239 
SR = out 0.5626 ± 0.0265 0.5945 ± 0.0253 
SR = all 0.5451 ± 0.0283 0.5927 ± 0.0268 
PR = normal 0.5644 ± 0.0261 0.6152 ± 0.0239 
PR = linearized 0.5607 ± 0.0259 0.5978 ± 0.0223 
ϑ = 0 0.5604 ± 0.0256 0.5974 ± 0.0223 
ϑ = 0.2 0.5614 ± 0.0256 0.5982 ± 0.0221 
ϑ = 0.4 0.5634 ± 0.0258 0.6054 ± 0.0225 
ϑ = 0.6 0.5649 ± 0.0259 0.6152 ± 0.0239 
ϑ = 0.8 0.5642 ± 0.0260 0.6004 ± 0.0216 
ϑ = 1 0.5610 ± 0.0273 0.5939 ± 0.0280 
w 2 = 0 0.5610 ± 0.0273 0.5939 ± 0.0280 
w 2 = 0.25 0.5646 ± 0.0266 0.5979 ± 0.0209 
w 2 = 0.5 0.5655 ± 0.0265 0.6060 ± 0.0228 
w 2 = 0.75 0.5654 ± 0.0253 0.6128 ± 0.0248 
w 2 = 1 0.5563 ± 0.0245 0.6152 ± 0.0239 
Nothing fixed 0.5626 ± 0.0260 0.6152 ± 0.0239 
All fixed to default values 0.5917 ± 0.0226 0.5917 ± 0.0226 
Table 2: F 1 measure of agreement between our wikifier 
and the gold standard while keeping one parameter 
fixed and tuning the others. “Avg. F 1” shows average 
performance over all possible combinations of non-
fixed parameters; “Max. F 1” shows the best 
performance achieved by tuning the non-fixed 
parameters on the training folds. Both columns show 
the F 1 performance on the test fold. Since cross-
validation was used, the performances are shown in the 
form “average ± standard deviation” over all 10 
possible splits of the data into 9 training folds and 1 test 
fold. 
 
30 Informatica 42 (2018) 23–32 J. Brank et al.  
 
5 Conclusions and future work 
We have presented a practical and efficient approach to 
wikification that requires no external data except the 
Wikipedia itself, can deal with documents in any 
language for which the Wikipedia is available, and is 
suitable for a high-performance, parallelized 
implementation. 
The approach presented here could be improved 
along several directions. One significant weakness of the 
current approach concerns the treatment of minority 
languages. When dealing with a document in a certain 
language, we need hyperlinks whose anchor text is in the 
same language if we are to identify mentions in this input 
document. Thus, if the document is in a language for 
which the Wikipedia is not available, it cannot be 
wikified using this approach; and similarly, if the 
Wikipedia is available in this language but is small, with 
a small amount of text, low number of pages, and 
generally poor coverage, the performance of wikification 
will be low. One idea to alleviate this problem is to 
optionally allow a second stage of processing, in which 
Wikipedias in languages other than the language of the 
input document would also be used to identify mentions 
and provide candidate annotations. This might 
particularly improve the coverage of concepts that are 
referred to by the same words or phrases across multiple 
languages, as is the case with some types of named 
entities. For the purposes of pagerank-based 
disambiguation in this second stage, a large common 
link-graph would have to be constructed by merging the 
link-graphs of the Wikipedias for different languages. 
This can be done by using the cross-language links which 
are available in the WikiData ontology, providing 
information about when different pages in different 
languages refer to the same concept. 
Another interesting direction for further work would 
be to incorporate local disambiguation techniques as a 
way to augment the current global disambiguation 
approach. When evaluating whether a mention a in the 
input document refers to a particular concept c, the local 
approach would focus on comparing the context of a to 
either the text of the Wikipedia page for c, or to the 
context in which hyperlinks to c occur within the 
Wikipedia. Preliminary steps taken in this direction in 
Sec. 2.5 did not lead to improvements in performance, 
but this subject is worth exploring further. Instead of the 
bag-of-words representation of contexts, other vector 
representations of words could be used, e.g. word2vec 
[6]. 
6 Acknowledgement 
This work was supported by the Slovenian Research 
Agency as well as the euBusinessGraph (ICT-732003-
IA), EW-Shopp (ICT-732590-IA) and RENOIR (H2020-
MSCA-RISE-691152) projects. 
7 References 
[1] L. Zhang, A. Rettinger. Final ontological word-
sense-disambiguation prototype. Deliverable D3.2.3, 
xLike Project, October 2014. 
[2] J. Hoffart, M. A. Yosef, I. Bordino, et al. Robust 
disambiguation of named entities in text. Proc. of the 
2011 Conf. on Empirical Methods in Natural 
Language Processing, Edinburgh, Scotland, 2011, 
pp. 782–792.  
 
Figure 1: Summary of top concepts in events about hurricanes. 
Semantic Annotation of Documents … Informatica 42 (2018) 23–32 31 
[3] L. Page, S. Brin, R. Motwani, T. Winograd. The 
PageRank citation ranking: Bringing order to the 
web. Digital Libraries Project Report SIDL-WP-
1999-0120, Stanford University, 1998. 
[4] M. Trampuš, B. Novak. Internals of an aggregated 
web news feed. Proc. SiKDD 2012. 
[5] G. Leban, B. Fortuna, J. Brank, M. Grobelnik. Event 
registry: Learning about world events from news. 
Proc. of the 23rd Int. Conf. on the World Wide Web 
(WWW 2014), pp 107–110. 
[6] T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient 
estimation of word representations in vector space. 
Arxiv.org, 1301.3781 [cs.CL], 2013. 
[7] D. Milne, I. H. Witten. An open-source toolkit for 
mining Wikipedia. Artificial Intelligence, 194:222–
239 (January 2013). 
[8] A. Moro, A. Raganato, R. Navigli. Entity linking 
meets word sense disambiguation: A unified 
approach. Trans. of the Assoc. for Comp. Linguistics, 
2:231–234 (2014). 
[9] L. Ratinov, D. Roth, D. Downey, M. Anderson. 
Local and global algorithms for disambiguation to 
Wikipedia. Proc. of the 49th Annual Meeting of the 
Assoc. for Comp Linguistics: Human Language 
Technologies (2011), pp. 1375–84. 
[10] J. Daiber, M. Jakob, C. Hokamp, P. N. Mendes. 
Improving efficiency and accuracy in multilingual 
entity extraction. Proc. of the 9th Int. Conf. on 
Semantic Systems, 2013. 
 
  
32 Informatica 42 (2018) 23–32 J. Brank et al.