Informatica 42 (2018) 85–93 85 Graph Theoretical View on Text Understanding Jure Zupan National Institute of Chemistry, Ljubljana E-mail: jure.zupan@ki.si Keywords: graph theory, cyclic-connected graph, topological distance, network text analysis, information content Received: October 6, 2017 The system STAVEK-02 described in the contribution is concentrated on yielding supplemental information (besides parsing/tagging of words) for text understanding through the clustering of nouns and/or verbs according to their meanings and common features. The system consists of two word processing blocks. The first block is a vocabulary of 149,000 Slovenian word-roots and 3,100 endings and assigns the grammatical feature to the words by the grammatical rules without any link to pre- tagged lexical corpora. The second block is a Network of meanings of Slovenian words which in principle is a graph connecting 45,000 and 15,000 noun and verb lexemes, respectively, all of them hierarchically clustered into larger and larger groups having /exhibiting specific features and/or common properties of the words encompassed Such formations are in a similar lexical systems usually called synsets. Due to the complete connectivity between the synsets (groups) in the graph it is possible to find all possible property/feature paths between any pair of two words (nouns and/or verbs) in the network. Because clustering of words according to their meanings is made during the parsing of one, a pair, or several consecutive sentences, the features and properties that appear on the closest path between the particular words within the sentence are quite informative for their interpretation of the text. Clustering of the words according to their meanings during the parsing of text is a novel concept of the text interpretation. Ob the basis of a simple example of parsing a sentence and clustering of the nouns within it the concept using the network of meanings in the program STAVEK-02 is described and discussed. Povzetek: Opisani sistem STAVEK-02 je orientiran na širše izločanje informacij iz slovenskih besedil, kot je samo besedna analiza in označevanje besed. Osnova sta dva programska dela. Prvega sestavljata podatkovna baza (149.000 korenov besed in 3.100 končnic), drugega pa 45.000 samostalnikov in 15.000 glagolov, ki so s skupinami teh besed grupirani po različnih skupnih značilnostih v ciklični graf (connected cyclic graph). Prvi del izvrši slovnično označevanje besed v tekstu, drugi pa med posameznimi besedami, ali v grafu hierarhično povezanih skupin besed (synsets) s podobnimi lastnostmi in značilnostmi izračuna topološke razdalje in nariše shemo povezovanja skupin samostalnikov ali glagolov. Izkazalo se je, da topološko izračunana razdalja med besedami dobro predstavi pomensko razliko/sličnost med njimi. Obe besedni zbirki skupaj vsebujeta in obdelujeta pretežni del najpogostejših slovenskih besed (cca 149.000 slovenskih besed). V prispevku so razložene nekatere pasti slovenščine pri obvladovanju več-smiselnosti besedila. Opisana je tudi struktura cikličnega grafa besed (samostalnikov in glagolov) in način izračuna topološke razdalje med besedami Poudarjena je dvosmernost poti in sprehodov (paths and walks) v omenjenem grafu besed. Dodan je kratek primer analize stavka, ki se konča z matriko topoloških razdalj med besedami stavka in drevesom podobnosti. Na koncu so omenjene nekatere možnosti razvoja sistema STAVEK-02 in hierarhične mreže za določanje pomenov slovenskih besed. 1 Introduction The parsing or tagging of words in the sentence provides the user with all relevant grammatical features of each word, which itself is a very hard task to implement either by the computer or by hand alone. The fact hat most of the modern parsing programs today rely on large corpora of previously parsed data does not mean that the efforts and programs solving the tagging of sentences by hand are either unnecessary or outmoded. Even if one forget that the testing of parsing-algorithms based on previously parsed corpora first relay on the hand-made parsing, the ab-initio, i.e., parsing by exclusively using grammatical rules will always be necessary. It should not be forgotten that statistical solutions mostly ignore the occurrences of rare specific cases. Such problems can be solved easier by considering and combining both methods (corpora driven and rule-based tagging) consecutively and/or iteratively. For example: the problem of the words having two or more clearly different meanings of which at least two can have grammatically correct but for any kind of machine parsing or rule-based tagging completely indistinguishable forms. Unfortunately, in Slavic languages with a much higher degree of flexibility 86 Informatica 42 (2018) 85–93 J. Zupan of words than in English the problems of the word senses begin already on the parsing level. In the case of a gram- matically correct sentence with two completely different interpretations of word senses it is possible that no parsing can correctly identify even the word classes of the constituent words, not to talk about the senses. The possible solution of such problems is to list all possible meanings or senses of each word and leave this information for further consideration when the context of the following sentences allow to single-out the actual meaning. For example, neither the sentence To je dobro za vas nor the title of the well-known Slovenian story Martin Krpan can be tagged correctly by the computer. In the first case the word vas can be interpreted either as for you or, alternatively, as the village, hence, the sentence can mean either: This is good for you, or This is good for the village. In the second example, the title of the well known Slovenian story Matin Krpan introduces the name of the main character. However, the title has, unfortunately, a second grammatically correct meaning of the word Martin, not as a noun (name Martin) but as the adjective meaning belonging to female Marta, which implies that man of the name Krpan is a husband of Marta or at least involved with Marta. Of course, the machine interpretation based on the pre-tagged corpora will always yield grammatically ‘correct’, i.e., the most often used variation, but at the same time always omit the less probable, but grammatically correct possibilities, witch nevertheless can appear in the spoken or written communication, and should therefore be at least considered. Such cases are handled better by the rule- based tagging compared to the statistical ones. In order to bring attention to such possibilities and to provide the tool for helping the developers of man- machine dialog to handle such cases the program STAVEK-02 with options of showing all grammatical possibilities and additionally provide the user with clusters of various word meanings at each sentence (or group of sentences) was developed and is described in this paper. 2 Related work The most closely related system to the PMSB (Pomenska mreža slovenskih besed [1], (Engl. Network of Meanings of Slovenian Words) used by the program STAVEK-02 is the well-known WordNet [2,3] lexical collection developed by the Princeton University with its graphic visualization VisuWords [4] based on the Thinkmap, data visualization technology. In order to handle the difficulties in the cross-language differences in the meanings of lexical words the Universal Word Net (UWN) Project was launched [5,6]. According to the UWN suggestions and guidelines specific versions for close to 200 different languages are now under development. Similar to the other Slavic languages (see Polish [7], or Bulgarian [8], for example) the Slovenian version named sloWNet [9] is as well progressing. At the moment the version described in the present paper is not included into sloWNet. There are several features of the PMSB that are similar to the WordNet but some of them are not. The organization of synsets for nouns in the hipo- hypero-, mero-, and holonym groups (the word A is a meronym of B if A is a part of B; the nose is a part of head, while head is a holonym of nose) is very similar, while the verbs in PMSB follow closely the six branch division (to exist, to have, to move, to do/to, to think/to create, and to sense/to) as suggested by Vidovič Muha [10] is quite different. The way the distances between the word senses in PMSB are calculated compared to the similarity evaluation between two synsets in WordNet is practically the same: it calculates the length of the shortest path between two nodes in the graph. It is worthwhile to mention that the distance measure used in our case is the length of the shortest path between two nodes (synsets) in a graph. This graph theoretical path distance is not related to the distances between objects (words) represented by the multi-dimensional distributed representations of word vectors as obtained by the word2vector software [11] developed by Thomas Mikolev at Google. The number of words and meanings (synsets), 60,000 and 110,000, respectively, in PMSB is already large enough to cover a large variety of texts. A considerable difference with WordNet is in the design of our network STAVEK-02. Although the PMSB can act as a stand-alone program in the role of a sort of thesaurus of Slovenian language, its is actually designed as a subroutine to support the system STAVEK-02 which goal is to enhance and/or to improve the machine-man dialog, by pinpointing and/or explaining the meanings of specific words. The mentioned goal can be clearly seen through the selection of hyper- and hyponym groups of the PMSB network which is described in the following paragraph more in detail. 3 Hierarchical Network of Meanings of Slovenian Words (PMSB) The solution to the discussed information enhancing problem seems to be the organization of words into network of words linked according to the common features or some other commonly present or absent property(ies). Therefore, the links (branches) between nodes in the graph must contain meaningful information about the relation between the nodes they connect. For example: if one node is labeled tool and the other one object (man-made) the link between them must exhibit the property that the first node (synset) labeled tools is a part of the second node labeled all man-made object) and not vice versa. At the same time these two nodes should occupy positions in the work much closer to each other than they have to the synset labeled insect, for example. Either individual words or clusters of words could simultaneously be members of several groups (synsets with larger number of meanings) what makes the network to contain cyclic paths (circular paths between clusters) in the structure (Figure 1). Graph Theoretical View on Text Understanding Informatica 42 (2018) 85–93 87 The PMSB Network consists of 45,000 noun and 15,000 verb dictionary lexemes (words) forming 85,000 and 25,000 different entries of noun and verb meanings, respectively. For example, if ‘konj’ (Engl. horse) is one of the 45,000 lexemes the four senses of the word ‘horse’ in Slovenian language (horse – an animal, horse – a clumsy man, horse – a chess-piece, and horse – a gymnastic equipment, paddle-horse) are four of 85,000 noun meanings or senses. Using the above kind of reasoning, a graph of about their meanings and properties containing close to 4,500 clusters of words (nodes) was generated [1]. The closest collection to our database is the Levine’s collection of verb classes [12] and Dornseiff’s Wortschatz [13]. There are various Internet versions like WordNet [2,3]) and for the Slovenian language the sloWNet [9]. What the size, i.e. the number of words is concerned; only the Dornseiff’s [13] collection has about the same number of verbs (14,000) as our collection. The part of our network VERBS (24,626) Verbs of existing (3,405) to exist on a specific way (542), verbs to sustain living (1,427), to end existence (299), emission verbs (949), weather verbs (187) Verbs of having (1,339)) to posses (154), to obtain/take (333), to use possession (288), to negotiate possession (461), to spend possession (102) Verbs of moving (3,129) to move (general) (804), to move (specific way) (692), to move (body/parts) (629), to arrive/leave (676), to change movement (206), to do while moving (121) Verbs of doing (9,663) to put (2,416), to do (general) (669), to assemble/disassemble (1,340), to change (2,164), to use force/influence (1,322), to do complex tasks (1,751) Verbs of thinking/creating (1,583) to create (intellectually) (550), to think (general) (145), to think (specific) (407), to expressing thoughts with symbols (480), Verbs of communication (5,507) to exchange of information (2,770), verbs of perception (322), to have/response to feelings (883), verbs of social contact (1,531), NOUNS (86,799) nature (31,988) nature (non-living)(3,130) is divided into: nature (general) (10), nature (phenomenon) (521), nature (physical parameter) (151), nature (space) (82), matter (general) (1,359), matter (Earth) (933), matter (outer-space) (84) nature (living) (28,847) is divided into: nature (general/broader) (4,218), nature (plant kingdom) (3,111), nature (animal kingdom) (3,431), nature (human) (18,087) product (19,222) product (origin) (552) divided into: product (origin (human)) (40), product (origin (nature)) (53), product (origin (plant)) (258), product (origin (animal)) (201) product (human) (18,670) divided into: product (human (material)) (13,190), product (human (intellectual)) (5,352) product (human (commodity)) (29), creation (general) (5), creation (limitation) (94) concept (35.589) activity (11,645) is divided into: activity (general) (101), activity (to do something) (3,507), activity (society) (3,045), activity (emotion) (76), activity (sense) (15), activity (existence) (1,068), activity (movement) (1,240), activity (communication) (1,912), activity (possession) (582), activity (mind) (97) property (5,943) is divided into: property (action) (323), property (animal) (45), property (broader meaning) (357), property (company) (17), property (device) (90), property (form) (62), property (general) (37), property (human) (2,774), property (mind) (128), property (matter) (267), property (nation) (35), property (number) (13), property (object) (482), property (phenomenon) (42), property (plant) (34), property (procedure) (390), property (religion) (15), property (ruling) (52), property (society) (111), property (sound) (39), property (space) (309), property (status) (159) property (word/speech) (123), group of properties (38), and 8 other groups: event (1,208), form (3,169), group (1,958), phenomenon (526), procedure (992), result (5,342), space (1,532), state (2,910). Table 1. The first two levels of verbs (upper part of the Table 1) and nouns (lower part of the table) are shown according to their common features. In the parentheses the number of words in each group is given. Because individual word can have several meanings or senses it is listed in as many groups (synsets) as there are meanings. Therefore, the sum of words given in parenthesis is larger than the number of meanings in the network. The largest groups are printed bold. 88 Informatica 42 (2018) 85–93 J. Zupan containing verbs is based on six main groups [10] and is already well described in the literature [14,15] and is accessible on the web [16]. The complete structure of verb hierarchy in English language (16,000 verbs and 1000 groups) is given in [17]. The basic division of nouns has three groups: the product, the nature, and the concept. It can be seen from second part of Table 1. The clusters of verbs and nouns in all levels of hierarchy are of very different sizes (Table 1). On the contrast to the English language, the Slovenian lexical forms of verbs can be well distinguished from those of nouns, however, due to high flexibility of Slovenian declination and conjugation (approximately 20 per each noun, verb, adjective, pronoun, and numeral) there are numerous cases where two or even three word types mix. For example the sentence To je lepo padalo has two meanings: a) This is a nice parachute and b) It was falling nicely. In the first case the word padalo is a noun (parachute) while in the second case it is the verb (to fall). To have all words together in one network (graph) both word types are linked in the network on the highest node. It is worthwhile to mention that the same word in different languages has different synsets of meaning. This is the reason why such a hierarchy cannot be ‘blue- printed’ from one to another language. The effect of ‘lost with translation’ is unavoidable: each translated word could be connected to completely different clusters of words. For example, the English word plant in its botanical meaning can be linked with Slovenian counterpart rastlina, or German Pflanze, but has no connection to the second sense of a production place like Slovenian tovarna or German Fabrik). 4 Semantic distance measure Mathematically, the network is a connected cyclic bi- directional graph. Vertices or nodes represent single words, meanings and/or or clusters of words with similar properties/features (synsets). The connected graph enables a continuous walk, described as a sequence of connected nodes (path), between any two nodes. The graph is cyclic if it contains closed paths (cycles), i.e., paths that starts and ends on the same node) with all nodes on that path different (with exception of the closing node). Hierarchical graph has one special node called top node Ntop or root, distinguished from the other ones by defining the orientation of the graph and walk directions within it. All valid paths between nodes must have one of the two directions: either towards the Ntop (up) or backwards from (down). Therefore, each node must have two lists for connections, to up and to down connected neighbors, respectively. Similar to the Ntop which is the last node of all up-paths, so at the end of any down-paths is always a node called terminal, having no down directions. The terminal nodes are individual words or senses if the word has only one sense (meaning). The fact that the walk path is not allowed to change direction assures that from any node one can always reach either a terminal node or the Ntop. Thus no walk with the constant direction could be captured in a cycle and thus end in an infinite loop. In the case of update of new words or relocation of nodes the described hierarchy prevents updates to generate infinite loops and self- referencing nodes. All the explained features of our graph offer the advantage of calculation the topological distance between the nodes. The topological distance Dij between two nodes Ni and Nj has all four properties classifying it as a standard metric distance: 1) Dij > 0 for all i ≠ j 2) Dij = 0 only for i = j 3) Dij = Dji, the distance is symmetrical, and 4) Dij ≤ Dik + Dkj triangle rule for any node k To evaluate all topological distance Dij between two arbitrary nodes Ni and Nj in the graph, one needs a complete connectivity matrix of order (Ni×Nj). For a graph containing approximately 105 nodes this means storing and handling the matrix of about 0.5×1010 distances. Fortunately, instead of keeping this large connectivity and/or distance matrix, only two connectivity tables one for keeping all up and the other one keeping all down connections from each node to neighboring nodes are needed. Using these two connectivity tables it is straightforward to determine topological distance between any two nodes Ni and Nj or words i and j, respectively. The procedure is as follows: 1. Find the complete set {Pi (Ni,Ntop)} of ni paths from the node Ni to the node Ntop. 2. Find the complete set {Pj (Ni,Ntop)} of nj paths from the node Ni to the node Ntop. 3. Compare k pairs of paths from both sets {Pi (Ni,Ntop), Pj (Nj,Ntop)}, k = 1...ni(nj – 1)/2 and for each pair determine the common node Ck 4. Determine the length lk of the path from node Ni to the node Nj passing node Ck for each pair k. 5. Keep the shortest path. To summarize: the distance Dij between two nodes Ni and Nj is the length lk of the shortest path from node Ni to the node Nj through the common node Ck, from which both nodes Ni and Nj have access to Ntop: Dij = min { lk } of {Pk(Ni,C,Nj) }, k = 1...ni(nj – 1)/2 /1/ where ni is the number of different paths from the node Ni to the top node Ntop; and P(A,C,B) is the path from node A to node B passing node C. 5 The Case Study The described system STAVEK-02 can serve as a model how to use the PMSB hierarchy of word meanings and synsets for enhancing the information in free text. The system can handle individual sentences input by the keyboard or text files of any size. The system handles sentences one by one, hence, the information are reported Graph Theoretical View on Text Understanding Informatica 42 (2018) 85–93 89 . Figure 1. A simplified part of the discussed network of words showing essential features of a cyclic bi-directional graph. Each label represents a node (synset). A cycle is a path that starts and ends on the same node. From the word kobilica having 4 meanings in Slovenian language, six cycles can be drawn to calculate six distances between all four meanings. Because the graph is 2-directional, only the paths in up or down to the Ntop (opposite to arrows) or to the terminal nodes (words, along arrows), respectively, are allowed. The cycles are detected via the common nodes Ck on the paths. at the end of each sentence. First, all grammatical information for all words in the sentence, are reported (Part A in Figure 2). This part, of the tagged text is similar in the content, but quite different in the form to the output provided by the public Slovenian parser [18] available on the Slovenian ZRC portal. All tasks performed by the parser are executed ab initio, i.e., by the grammatical rules without considering any corpus or web connection. For highly flexible language like Slovenian, the parser is very important, because for tagging of words it must convert the words into appropriate lexemes, what is mandatory form for further search in the PMSB network. Our tagging system uses 149,000 Slovenian word-roots for all types of words and combines them with about 3,100 different endings in order to obtain the correct lemma for any Slovenian word in SSKJ [19]. Second, for each noun and each verb appearing in the input sentence(s) the system searches through PMSB network and lists all the meanings/senses associated with any noun and verb together with the complete list of synsets from the lemmas to the top of the network. The list of senses for all words together with full paths through the PMSB is output after each sentence (part B in Figure 2). Third, the program draws a clustering scheme of up to 500 nouns or verbs together with all their senses. The cluster-tree of senses is built into the system as an option and can be performed a) after each sentence, b) after a certain number of sentences providing the number of part of nature nature (animal kingdom) (human being) product (material) nature (live) Ntop Noun Verb Product Nature Concept part of non-vertebra mammal (specific product) part of part of insect horse (musical (vessel) instrument) bridge keel locust mare on the violin (slov. kobilica - the word has four meanings) 90 Informatica 42 (2018) 85–93 J. Zupan words does not exceed 500, or c) at the end of parsing a text file after the user can selects up-to 500 nouns or verbs from the list of the most frequent word types of the scanned text. Finally, at the end of each session (either for one sentence or for the text file) the program yields a) statistics of the input text with respect to the word frequencies of all word types and separators, b) the distribution of word-lengths (in characters) of each word Part A =================================== 1 Na; na; predlog, (4,5); 2 violini; violina; sam., f(d/4)(e/5); 3 je; biti; pom.gl., (3os./f/e) pretekli čas 4 počila; počiti; gl., nedov. nepov. (3os./f/e) pretekli čas 5 struna; struna; sam., f(e/1); 6 . ločilo, pika; ----- 1 Kaj; kaj; prislov, (vzročnostni); 2 je; biti; pom.gl., (3os./m/e) pretekli čas 3 bil; biti; gl, nedov. nepov. (3os./m/e) pretekli čas 4 vzrok; vzrok; sam., m(e/1)(e/4); 5 ? ločilo, vprašaj; ----- 1 Kobilica; kobilica; sam., f(e/1); 2 je; biti; pom.gl, (3os./f/e) pretekli čas 3 bila; biti; gl, nedov. nepov. (3os./f/e) pretekli čas 4 poškodovana; poškodovan; pridevnik, (m/d/1)(m/d/4)(f/e/1) 5 . ločilo, pika; =================================== Part B =================================== /001/01: violina (violin); strings; instrument (musical (specific)), instrument (musical); product (sound emitting); product (communication); product (material); product/creation; Noun; Ntop. /002/01: struna (string): part of (musical instrument)), instrument (musical); product (sound emitting); product (communication); product (material); product/creation; Noun; Ntop. /002/02: struna (string); product (sound emitting); product (communication); product (material); product/creation; Noun; Ntop. /003/01: vzrok (cause); factor; measure (specific); creation (measure/unit); creation (intellectual); product/creation; Noun; Ntop. /004/01/ kobilica (violin’s bridge): part of (musical instrument); instrument (musical); product (sound emitting); product (communication); product (material); product/creation; Noun; Ntop. /004/02/ kobilica (keel): part of (vessel); part of (specific device); product (machine/device); product (general part); product (material); product/creation; Noun; Ntop. /004/03/ kobilica (locust): insect; insect (pterygota); insect (arthropoda); insect (general); antropoda; non-vertebra; nature (animal taxonomy); nature (animal kingdom); nature; Noun; Ntop. /004/04/ kobilica (locust): insect; insect (pterygota); arthropoda; polimeria; animal (common name); nature (animal kingdom); nature; Noun; Ntop. /004/05/ kobilica: mare; horse (animal (general)); horse (animal); animal (domestic); animal (property); nature (animal kingdom); nature; Noun, Ntop. /004/06/ kobilica: mare; horse (animal (general)); horse (animal); odd-toed ungulate; mammal; vertebra; chordata, nature (animal-taxonomy); nature (animal kingdom); nature; Noun, Ntop. =================================== Figure 2: Output of the program STAVEK-02 after the input of three sentences representing a short dialog. Na violini je počila struna. Kaj je bil vzrok? Kobilica je bila poškodovana. (Eng.: The string on the violin broke. What was the cause? The bridge was damaged. The word types are nouns (sam.), verbs (gl.), adverbs (prislov), adjective (pridevnik), the letters m, f, os, e, and d stand for (masculine, feminine, person, singular, and dual), respectively; the numbers mark the falls. Part B shows ten chains of nodes (synsets) of words and meanings from the PMSB network as used for the distance matrix D and dendrogram calculations (see Figure 3). Ntop is the top node of the PMSB hierarchy of meanings. In the actual output of program STAVEK-02 the synsets assigned to words of one sentence are printed immediately after one of the main three punctuation marks (full stop, question mark, or exclamation mark) is encountered. Graph Theoretical View on Text Understanding Informatica 42 (2018) 85–93 91 type, and c) the frequency is of 2000 most frequently used nouns, adjectives, verbs, and adverbs. In order to show the entire procedure more in detail the output as given by the system STAVEK-02 for three short consecutive sentences is worked out and discussed more in detail. The three sentences in English translation are: The string on the violin broke. What was the cause? The bridge was damaged. (slov. Struna na violini je počila. Kaj je bil vzrok? Kobilica je bila poškodovana.) (Figure 2, parts A and B). This particular example using the word kobilica in two separate sentences was chosen deliberately to show how the graph-theoretical distances (Figure 2. and Figure 3) as obtained by the PMSB network could correctly determine the sense of a word. Similar to English the word bridge having several senses, the Slovenian word kobilica has been coded by six synsets in PMSB. It has four (4) main senses (locust, keel, mare, and the bridge on the violin) of which both animal senses have two synset paths for showing the relevant taxonomies of both species. (Figure 2, part B). There is not much to say about tagging shown as part A in Figure 2), however, the tagging the second word violini as singular locative (e5) is a good example showing how the statistical approach ignores the possibility that the word violin has in the dual the same form (for example: ’Pozabil sem na violini’ Engl. I forgot about two violins) of accusative in dual (d/4). STAVEK-02 tags both possibilities (d/4) and (e/5). Additionally, the rule-based tagging is considerably faster compared to the statistical pre-tagged-corpora- based one. The public Slovenian parser [18] can tag on the average 8 sentences per second, while the parser of the system STAVEK-02 managed to tag 400 sentences per second. By additionally searching for all noun and verb meanings through the database of close to 110,000 synsets makes the rule-based parser almost two orders of magnitude faster then the public one. Part B shows all the synset paths for the nouns in the sentences. In the actual output the synset paths for verbs are also given. In the print option, the paths are listed after each sentence. Each chain is a sequence of labels of nodes (synsets) encountered during the walk between the word and the Ntop. The search algorithm finds all possible walks from any encountered noun or verb to the Ntop. The reader can verify this part of the search engine in real time on-line on the link given in [20]. Mostly, the labels are organized in self-explanatory manner using structure of keywords in which each keyword is itself a cluster label with the link to the particular cluster in the network. For example, the node labeled property (human) contains words each of which marks a property of a human’ (intelligence, beauty, greed, innocence, etc.). On the other hand, the words in the cluster with the same two keywords, but ordered differently e.g., human (properties) describe a human being with a particular property, genius and liar are in the synsets human (property (intelligence)) and human (property (bad)), respectively. Additionally, both words human and property are labels of other clusters. The cluster property, for example, contains 5,964 nouns with 14 sub-clusters named property (keywordi), i = 1,…14. Each keyword of these clusters: property (animal), property (human), property (number), … property (object), contains again cluster descriptors with keywords. Take for example the sub cluster property (object): property (object (color)), property (object (form)), property (object (price)). At the end each keywordi represents a cluster with a smaller set of words. Table 3 shows the topological distance matrix D of 45 distances between the ten meanings. All distances reflect the relation between the similarities of meanings of the words concerned very reasonable. The two main groups, the upper one representing material products (violin, string, bridge on the violin) and the lower one representing locust and mare: have two descriptions each, respectively. In the middle of both groups is the word vzrok (cause), representing the concept of non- material products. In the group of material objects the string /002/01/ (part of the violin) and kobilica /004/01/ (part of the violin) are joined at the lowest level. The pair goes together with the second meaning of the string /002/02/ as a sound emitting device and then three join together with the fourth sense violin combining all four into a reasonable synset musical instrument. As said above, the last four meanings represent the animal synsets (animal living beings). To this group of four meanings (horse (domestic animal)), horse (taxonomy), locust (insect), and locust (taxonomy), there is no counterparts of meanings from the rest of the considered three sentences, hence, one can safely assume that the four meanings of the word kobilica do not apply in this context. It is interesting to see that the remaining two words kobilica /004/03/ (keel as a part of a vessel) and vzrok (cause) fit well between the two larger group. The sense keel and violin are linked together relatively high in the dendrogram because there are both material objects, however, the level of the link between the concept cause and the material object keel shows that there is still a lot of space for improvements of the procedure for distance evaluation. This results help us to argue that as much the meanings of single word is important, the distance between the words is important as well. This in turn requires two things; first each word should be represented in unique and uniform way based on various kind of properties and second, the words should bi organized in a system that allows definition of a metrics. 6 Conclusion The discussed example and hierarchical network of words PMSB present only a very simple and small part of the general solution that can be accomplished by the use of an exhaustive and therefore much more complex network of word meanings. Neither the presented network, nor the presented model for extracting broader information from the text, is the final product. Still a lot of improvements can be implemented. Although the present network links together slightly more than 60,000 words (nouns and verbs) forming about 110,000 meanings (synsets) of various sizes, it is not the number of words that is a limiting factor, but rather more 92 Informatica 42 (2018) 85–93 J. Zupan factors like the absolute number of synsets (clusters of words with different features), the number of links to which each synset is connected, and least but not last the ability of algorithms for distance calculation to reflect the actual distinction between the meanings of word. These are the issues that should be of first concern. One should add not only more clusters presenting larger variety and number of properties, features, and/or meanings, but as well clusters of words pointing to rare, dangerous, or by any other criterion extreme features that the words represent, for example synsets containing words like non-poisonous plants, extremely hard or non combustible material, etc. The constant updating and enhancement of the networks of meaning require much more man-power and/or machine-supported feature selection efforts for addition of new groups than it has been spent for the present variations of WordNets on varieties of languages. However, for each specific language the native speakers are responsible for the growth and complexity of their specific meaning networks and no automatic procedure could completely replace their manual work and decisions. The presented PSMB network of meanings was put together by hand what requires approximately eight man-years to reach the present size. Some critics are afraid that such knowledge bases has arbitrary structure, because the meanings of the words are subjective and no objective criteria exist how to link or cluster words according to their meanings. The described example has shown the potential of such network to help understanding the context of the communication. As a matter of fact it is true, that such a hierarchy of meanings will always be subjective, but so is human mind. 7 Acknowledgement The author wishes to thank National Institute of Chemistry for providing him with the facilities to work at the Institute as a research emeritus. 8 References [1] Zupan, Jure; Koncept mrežnega pomenskega slovarja slovenskih besed, Jezik in slovstvo, 54, (3- 4), 2009, pp. 139-151. [2] Miller, George A, WordNet: A Lexical Database for English. Communications of the ACM. 1995, Vol. 38 (11), 39-41. [3] Fellbaum, Christiane; WordNet: An Electronic Lexical Database, Editor, 1998, Cambridge, MA: MIT Press. [4] Visuword™, On-line graphical dictionary and thesaurus, https://visuwords [5] Towards a Universal Multilingual WorldNet - D5: Databases and Information Systems, Max-Planck- Institut für Informatik; Mpi-inf.mpg.de; 2011-08- 14. [6] Vossen, Piek, EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Editor, 1998, Kluwer, Dordrecht, The Netherlands. [7] Maziarz M., Szpakowicz S., Piasecki M., Semantic Relations among Adjectives in Polish WordNet 2.0: A New Relation Set, Discussion and Evaluation, Cognitive Studies / Études Cognitives, t. 12, s. 149– 179, 2012. [8] Koeva, S., G. Totkov and A. Genov. Towards Bulgarian WordNet. Romanian Journal of Information Science and Technology, Vol. 7, No. 1- 2, 45-61, 2004. [9] Fišer, Darja, Novak, Jernej. Visualizing sloWNet. Proceedings of the conference on Electronic lexicography in the 21st century: New applications for new users (eLEX2011). Bled, Slovenia, 9-12 November 2011. [10] Vidovič Muha, Ada, Slovensko leksikalno pomenoslovje. Ljubljana: Znanstveni inštitut Filozofske fakultete, 2000: [11] Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S.; Dean, Jeff; Distributed representa-tions of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 2013. [12] Levin, Beth; English Verb Classes and Alternations, The University of Chicago Press, Chicago, 1993. [13] F. Dorenseiff, der deutsche Wortschatz nach Sach- gruppen, 8. Edition, Ed. U. Quasthoff, W. de Gruyter, Berlin, 2004. [14] Zupan, Jure; Problemi in nekaj rešitev računalniških obdelav slovenskih besedil, Slav. revija, 47 (3), 1999, 277-296. [15] Zupan, Jure; Hierarhična mreža slovenskih glagolov, v Obdobja 30, Interdisciplinarity in Slovene Studies, Filozofska Fakulteta, Ljubljana 2011, pp. 551-557. [16] Zupan, Jure; Lajovic, Andrej; PMSG – Network of Slovenian verbs, web address: http://pmsg.zrc- sazu.si. [17] Zupan, Jure; Pomenska mreža slovenskih glagolov, Založba ZRC SAZU, 2013, pp. 31-51, [18] Oblikoslovni označevalnik za slovenski jezik, Amebis, d.o.o. Kamnik, Inštitut Jožef Stefan, Univerza v Ljubljani, ZRC SAZU, Trojina, Zavod za uporabno slovenistiko, 2008-2013, konzorcij projekta Sporazumevanje v slovenskem jeziku: link to the network: http//www. oznacevalnik.slovenscina.eu [19] Slovar Slovenskega knjižnega jezika (SSKJ), Bajec, Anton, et al., Eds., Državna založba Slovenije, DZS, Ljubljana, 1995. [20] J. Zupan, A. Lajovic; PMSB, Pomenska mreža slovesnkih besed, link to the network of meanings of Slovenian words: http://mreza.andrej.ad-vega.si. Graph Theoretical View on Text Understanding Informatica 42 (2018) 85–93 93 Distance matrix between ten meanings of four words. The distances are the numbers of nodes (synsets) between two meanings in the network PSMB evaluated according to the procedure and equation /1/. 1 2 3 4 5 6 7 8 9 10 1 violina (violine) /001/01 0 6 6 13 6 12 19 19 18 21 2 struna (string, violin's part)/002/01 0 4 12 3 11 18 18 17 20 3 struna (string, sound emitter)/002/02 0 10 5 9 16 16 15 18 4 vzrok (cause) /003/01 0 12 12 17 17 16 19 5 kobilica (violin's part)/004/01 0 10 17 17 16 19 6 kobilica (keel)/004/02 0 17 17 16 19 7 kobilica (locust) /004/03 0 3 15 16 8 kobilica (locust-taxonomy)/004/04 0 15 18 9 kobilica (mare)/004/05 0 3 10 kobilica (horse-taxonomy)/004/06 0 Strategy: Ward method D(link)*100 / D(max) 100---90---80---70---60---50---40---30---20---10---00 .______ violina /001/01 | .__ struna /002/01 | ._|__ kobilica /004/01 .__________|_|____ struna /002/02 | .____________ vzrok /003/01 ._______________________________|____|____________ kobilica /004/02 | .__ kobilica /004/03 | .___________________________|__ kobilica /004/04 | | .__ kobilica /004/05 _|__________________|___________________________|__ kobilica /004/06 Figure 3. The distance matrix D between ten different senses of four words (violin, string, cause and kobilica). The word string has two meanings a) part of the violin and b) sound-emitting device. The word kobilica has four meanings and six synset paths from the meanings to the top of the network (see Figure 1). The distances between individual meanings are calculated using the procedure and equation /1/. The dendrograms based on the distance matrix D can be output optionally after any number of tagged sentences providing there is no more than 500 nouns or verbs.