Informatica 28 (2004) 381-386 381 Shortest-Path Semantic Distance Measure in WordNet v2.0 Jure Ferlež and Matjaž Gams Jožef Stefan Institute, Department of Intelligent Systems, Jamova 39, 1000 Ljubljana, Slovenia Jure.Ferlez@ijs.si Keywords: semantic relatedness, semantic distance, WordNet Received: July 14, 2004 This paper analyses a measure of semantic relatedness between two words. Our measure is based on shortest path between synsets in WordNet v2.0. It uses all available links in WordNet v2.0 and is implemented by a bidirectional breadth-first algorithm. Experimental evaluation and comparison with a benchmark set of human similarity judgments demonstrates that the simple measure applied on WordNet v2.0 performs better than the more complicated approaches mostly combining an IS-A taxonomy with the notion of shared information content extracted from corpora. One explanation is that our pretty basic method efficiently exploits all the available links in WordNet v2.0, while other measures, although more complicated and advanced, do not make good use of the new derivational links added to WordNet in the latest version 2.0. Povzetek: članek opisuje iskanje najkrajše pomenske razdalje v WordNet v2.0. 1 Introduction "The need to determine the degree of semantic relatedness between lexically expressed concepts is a problem that pervades much of the computational linguistics" [1]. Recent research on the topic in computational linguistics has emphasized the perspective of semantic relatedness of two words in a lexical resource, or its inverse semantic distance. Today, artificial measures of semantic relatedness are used in a wide area of natural language processing (NLP) applications, such as word sense disambiguation, automatic correction, information extraction, retrieval and indexing, text summarization and construction of ontologies. A natural way to compute the semantic relatedness of words given a semantic network is to evaluate the distance of nodes corresponding to words being compared - the shorter the shortest path from one node to another, the more related the words are. Thus the length of the shortest path in a semantic network is named semantic distance. On the other hand this approach to semantic relatedness has a problem of assuming that links between the nodes in a network represent uniform distances. This problem is most apparent in the most studied form of semantic network used for calculation of semantic relatedness or its specialization semantic similarity - the form of taxonomy. Attempts trying to overcome the issue of variability in the distance covered by a single semantic link inside a taxonomy include depth relative scaling [2] [3] and combining taxonomy with the notion of information content [4][5]. WordNet [6] is an electronic lexical database and a broad coverage semantic network. It has been widely explored and is used in many studies of NLP. WordNet v1.7.1 is constituted of 111.223 synsets (sets of synonym words) divided into nouns, verbs, adjectives and adverbs, and more than 295.000 links of different types between them. Up to the version 2.0 the majority of the links between synsets was of hypernym or hyponym semantic type. This is why WordNet prior to version 2.0 was considered a lexical taxonomy by most of the researchers. However, the latest edition of WordNet version 2.0 introduced more than 40.000 additional derivational links between noun and verb synsets describing morphological relatedness e.g.: noun cook got connected with the verb to cook. These new links have interconnected otherwise more or less separate taxonomy trees. Section 2 of this paper describes the logic and the implementation of the shortest-path semantic distance measure applied to WordNet database. Section 3 explains the evaluation methods used to measure performance of the inspected measure on standard test data sets. Section 4 lists the results of evaluation of the measure on different versions of WordNet, using different test data sets. Section 5 presents and explains the acquired results. 2 Shortest-Path Semantic Distance in WordNet v2.0 (SP) Our decision to apply shortest-path algorithm to evaluate semantic relatedness is based on the assumption that the Shortest-path semantic distance approach produces better results when applied to a denser semantic network like version 2.0 of WordNet. Our algorithm of computing semantic distance by computing shortest path between nodes in WordNet via all the available edges is similar to [7] or [8]. WordNet itself can be described as a directed graph G(V, E), where vertices set V represents a set of synsets and edge set E represents a set of directed semantic links regardless of their type. Figure 1 shows a very small 382 Informatica 28 (2004) 381-386 J. Ferlez et al. subset of WordNet. Note that each node represents one synset, which consists of a set of words. Also, each synset has links to other synsets and this links are of different types, e.g. hypernym, hyponym, antonym, meronym, etc. Figure 1: WordNet nodes and edges. Most of the links in WordNet come in pairs like hyponym-hypernym, antonym-antonym and meronym-holonym. Newly added derivational links are also bidirectional, thus making WordNet in a large part an undirected graph. In effect, unidirectional links allow a simple algorithm to effectively compute the shortest path between two synset sets in WordNet v2.0. Further more, a problem of computing semantic relatedness between two words can simply be translated to searching the shortest path in an undirected graph G(V, E) between the start set of vertices S (representing meanings - synsets of the first word) and target set of vertices T (representing meanings - synsets of the second word), where edge length is set to 1: dist (S, T) = min path _ length(s, t) (1) seS teT Our program computes the shortest path from vertex set S to vertex set T with a standard bidirectional breadth-first algorithm. Bidirectional approach is possible due to the mentioned generalization that WordNet is an undirected graph. Namely, the concept of a distance in a graph assumes undirected edges in a graph. Due to the bidirectional search technique, the breadth-first algorithm performs efficiently and produces correct results. An example of the computed shortest path of length 4 between synset sets matching the meanings of the words car and journey is displayed in Figure 2. In the example, meanings of words car and automobile are connected through a derivational link. Meanings of words automobile and driving are connected through the category link. Our algorithm found the shortest path from meanings of word car to meanings of word journey of length 4 by examining all links to depth 2 from both directions. W O R D C A R MEANINGS (SYNSETS) OF WORDS A U T O M 0 B 1 L E D R I V I N G T R A V E L J O U R N E Y Noun Noun Noun 4-wheeled a wheeled a motor vehicle conveyance vehicle; adapted to for usually the rails of passengers propelled by railroad or freight an internal on a cable combustion railway engine derivational link Noun Verb 4-wheeled to travel in motor an vehicle; automobile usually propelled by an internal combustion engine category link Noun Verb the act of travel in an controlling automobile and steering the movement of a vehicle or animal hypernym link Noun Noun Verb the act of self- change going from propelled location; one place to movement move, another travel, or proceed hyponym link Noun Verb Verb the act of undertake a travel upon traveling journey or or across from one trip place to another 3 Evaluation Method We used the method of comparing the human judgment of word relatedness to computed estimates of relatedness in the evaluation of our measure. Figure 2: Shortest path between meanings of words car and journey. SHORTEST-PATH SEMANTIC DISTANCE. Informatica 28 (2004) 381-386 383 This is one of the standard evaluation techniques [1] and arguably [4][1] yields the most general assessment of the "goodness" of a measure. Resnik [9] has proposed this particular approach for evaluating the relatedness measure, stating that the "measure's worth is in its fidelity to human behavior, as measured by predictions of human performance on experimental tasks". The proposed evaluation method compares the computed relatedness scores with human ratings of relatedness. A series of word pairs and the average human score of relatedness between words in a pair represent human judgment of relatedness. Scores must lie inside a predefined interval. Comparison of the relatedness grades between computed relatedness grades and those acquired by humans can be summarized by means of coefficient of correlation [9]. Resnik also argues [9] that an upper bound of the correlation coefficient between a computational measure grades and average human scores is represented by an average correlation between scores of a human individual in a repeated experiment and average human scores in an original experiment for a particular word pair set. The problem of evaluation by comparing results against human judgment [1] is the difficulty of acquiring a larger set of human judgments of relatedness and consequently acquiring the proposed upper bound on correlation coefficient for a particular data set. Another problem of this evaluation approach according to Budanitsky [1] is that most of the applications use semantic relatedness to capture relatedness between meanings for which words are mere surrogates: "the human judgment that we need are of the relatedness of word senses not words". On the other hand, the virtue of this evaluation approach is the ease of evaluation. Word pairs are simply submitted to evaluation of relatedness and the results are easily summarized by correlation coefficient between human and metric based results. This is why different relatedness measures are most often compared using this evaluation method. Surveys include those of Budanitsky [1] and McHale [10]. Some previous evaluations of the words relatedness measure also included reports [11] with results relative to the scope of the different embedding applications using the evaluated measure. Lin [5] proposed listing additional mathematical properties of a measure, e.g. if it presents a metric. In this article our measure is evaluated in terms of similarity with human grades. Rated word pair data sets we used in evaluation include those of Rubenstein and Goodenough [12] (noted as 65 R&G in Tables 1,2,3), Miller and Charles [13] (noted as 30 M&C in Tables 1,2,3) and Finkelstein et al. [14] (noted as 353 F in Tables 1,2,3). The first set consists of 65 word-pair similarity grades acquired in an experiment involving 51 humans asked to rate the similarity of the word pairs on a scale of 0.0 (semantically unrelated) to 4.0 (highly synonymous). Resnik's computational measure's upper bound correlation coefficient for this data set is not known. Later, this data set was modified by Miller and Charles who extracted 30 pairs from the original 65; taking 10 pairs graded in the interval between 0 to 1, 10 pairs from 1 to 3 and 10 pairs from the grade interval 3 to 4. Resnik [4] acquired alternative human scores on the same word pair's series and argues that 0.8848 is the upper bound correlation coefficient a computational metric could achieve on this particular data set. Finkelstein et al. [14] published a greater word pair relatedness set. They acquired 353 relatedness graded word pairs, which include Miller and Charles's 30 pairs. We used the described method of comparison against human judgment to compare our shortest path relatedness measure against other relatedness measures. Evaluation of the performance of the most of the following measures on 30 Miller and Charles's (30 M&C) word pairs and 65 Rubenstein and Goodenough's (65 R&G) word pair data sets was conducted by Budanitsky [1][11]. His results are relative to WordNet v1.5 and Brown Corpus [15], which was used as an additional knowledge source to some measures. Measures used for comparison are: 1. Hirst and St-Onge's semantic relatedness measure (HSO) [8] The idea behind Hirst and St-Onge's measure of semantic relatedness is that two concepts are semantically close if a path, which is not too long, connects their WordNet synsets and it does not change direction too often: relffS (C1, C2) = C - path _ length - K x d (2) where d is the number of changes of direction in the path and C and K are constants. This measure is actually an extended shortest-path algorithm and is the only established relatedness measure, while others actually measure similarity. However, it was only evaluated on the previous version of WordNet (version 1.5). 2. Leacock and Chodorow's semantic similarity measure (LCH) [16] Leacock and Chodorow's semantic similarity measure also uses shortest-path algorithm, however it only considers the IS-A links in the WordNet network. They modify the results by scaling them according to taxonomy depth D: (C1 C 2) , (length (C1,C 2)) (3) simLc(C1, C2) = - log(—-) 3. Resnik's similarity measure (RES) [4] Resnik's similarity measure uses both taxonomy and corpus data. Resnik defined the similarity between two concepts lexalized in WordNet to be the information content of their most specific common subsumer - the first common predecessor in the taxonomy tree lso(C1,C2): simR (C1, C2) = - log p(lso(C1, C2)) (4) 384 Informatica 28 (2004) 381-386 J. Ferlez et al. In equation (4), p(c) is the probability of encountering an instance of a concept c in a corpus. Finkelstein et al. [14] mentioned evaluation of the performance of the measure on the 353 word pairs. 4. Jiang and Conrath's semantic distance measure (JCN) [17] Jiang and Conrath's semantic distance measure is an inverse of semantic similarity. The measure also combines WordNet taxonomy with corpus data. The distinction from Resnik's measure is that Jiang and Conrath's measure has the mathematical property of increasing with difference of the compared concepts. distjc (C1, C 2) = (5) 2 log(p(lso(C1, C2))) - (log p(C1) + log p(C2)) 5. Lin's semantic similarity measure (LIN) [5] Lin's semantic similarity measure uses the same elements and knowledge sources as does the Jiang and Conrath's semantic distance, but it is constituted according to Lin's general theory [5] of similarity between arbitrary objects: simL (C1, C2) = 2log p(lso(C1,C 2)) log p(C1)+log p(C 2) (6) 6. Roget's Taxonomy Shortest-path semantic distance (RTSP) [10] Roget's Taxonomy Shortest-path semantic distance uses alternative taxonomy called Roget's thesaurus and is therefore not WordNet based. Roget's thesaurus is a wide shallow hierarchy densely populated with nearly 200,000 words and phrases. This method also relies on shortest path between concepts and does not use any alternative knowledge sources: simRoget(C1, C2) = lengthRoget(C1, C2) (7) Evaluation of the measure's performance on 28 Miller's word pairs was conducted by Mc Hale [10]. 7. Latent Semantic Analysis (LSA) [18] Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. It can be used to calculate semantic relatedness based on corpus knowledge alone. The basic idea is that LSA represents the meaning of a word as a kind of average of the meaning of all the documents in which it appears, and the meaning of a document as a kind of average of the meaning of all the words it contains. 4 Empirical Measurements We evaluated the shortest path relatedness measure by comparing the computed semantic distances against human relatedness judgment and against the results of other measures on the same data sets. Relatedness grades were computed by applying shortest path relatedness measure to all available word pair series. Correlation coefficient proposed by Resnik was used for summarizing the results and for comparison with other measures. We have performed experiments on our measure using two different versions of WordNet: WordNet v1.7.1 and WordNet v2.0. For comparison we also used the WordNet:: Similarity toolkit [19] to reevaluate Hirst and St-Onge's, Leacock and Chodorow's, Resnik's, Jiang and Conrath's and Lin's relatedness measures using WordNet versions 1.7.1 and 2.0. In this reevaluation the alternative knowledge source for extracting information content used by some measures was a much smaller SemCor [20] corpus, as the larger Brown data set could not be obtained. The SemCor, however, is a subset of the Brown Corpus used in original experiments performed by Budanitsky [1] [11]. Due to this and possibly different settings of the two experiments the repeated experiment cannot be directly compared to the one performed by Budanitsky. For evaluation of the LSA measure on 30 Miller's word pairs and 65 Rubenstein and Goodenough's word pairs we used the implementation of LSA available online at http://lsa.colorado.edu. Finkelstein et al. [14] published an evaluation on the 353 word pairs, which is presented in Table 2. Table 1 shows the absolute correlation coefficients between the computed distance ratings of different measures and the mean ratings of human subjects per particular word-pair data set. This results were obtained using the WordNet::Similarity toolkit for each particular measure for different versions of WordNet and different test data sets. The first row of the table displays the correlation coefficients of our studied shortest path relatedness measure. Table 2 summarizes experimental results obtained by Budanitsky in his studies of semantic relatedness [1][11] combined with results of Finkelstein et al. [14] and our research. Table 2 also lists the absolute correlation coefficients between the computed distance ratings of different measures and the mean ratings of human subjects. As described in section 3 values in Table 2 were obtained using different settings and more extensive knowledge sources as in the repeated experiment. The first two rows of the table display the correlation coefficients of our studied shortest path relatedness measure. Table 3 lists the semantic distance ratings of our shortest path relatedness measure applied on WordNet version 2.0 per word pair for the Miller and Charles's word pair data set. SHORTEST-PATH SEMANTIC DISTANCE. Informatica 28 (2004) 381-386 385 30 M&C WN v1.7.1 WN v2.0 65 R&G WN v1.7 WN v2.0 353 F WN v1.7 WN v2.0 SP 0.80 0.86 0.80 0.88 0.32 0.47 HSO 0.69 0.60 0.73 0.73 0.37 LCH 0.82 0.82 0.85 0.86 0.36 0.36 RES 0.78 0.77 0.79 0.81 0.38 0.35 JCN 0.47 0.47 0.58 0.58 0.23 0.23 LIN 0.74 0.74 0.71 0.73 0.30 0.30 Table 1: Correlation coefficients of measures compared to humans on WordNet v1.7.1 and v2.0 Relatedness measure 30 M&C 65 R&G 353 F SP WN v1.7.1 0.80 0.80 0.32 SP WN v2.0 0.86 0.88 0.47 HSO 0.74 0.79 LCH 0.82 0.84 RES 0.77 0.78 0.34 JCN 0.85 0.78 LIN 0.83 0.82 RTSP 0.89 LSA 0.72 0.65 0.56 Human replication 0.88 Table 2: Correlation coefficients obtained by Budanitsky, Finkelstein et al. and our research. 5 Conclusion Results obtained in the repeated experiment of Budanitsky summed in Table 1 show that our shortest path measure, when applied to WordNet v2.0, always performs better than if applied to version 1.7.1. This can be explained by the presence of the additional derivational links in the latest version of WordNet, thus transforming it from taxonomy to a more complex semantic network. The repeated experiment also showed that other tested measures, which are based on WordNet's taxonomy, are not significantly affected by the particular version of WordNet used. This can be explained by reviewing the differences of the taxonomy structure between versions of WordNet 1.7.1 and 2.0. According to documentation available [21] there are only insignificant differences in taxonomy part of WordNet between the two versions. The experiments also showed that even the more complicated measure of Hirst and St. Onge, which uses the same principles of exploring all possible links in WordNet, does not compare favourably to our studied simple shortest path approach applied to the latest version of WordNet. A question arises whether the more complicated Hirst and St-Onge's measure would outperform the simpler shortest path approach if optimally applied to the latest version of WordNet. The answer remains unclear despite the results in Table 1, which show that the shortest-path algorithm produces better results. The improvement is perhaps due to suboptimal parameter settings of the more complicated measure. On the other hand the reason could lie in the fact that measures applied on prior versions of WordNet were in effect overspecialized in exploiting its prevailing taxonomic structure. 30 M&C Word pairs Human average grades Shortest Path on WordNet v2.0 car automobile 3.92 0 gem jewel 3.84 0 journey voyage 3.84 1 boy lad 3.76 1 coast shore 3.7 1 asylum madhouse 3.61 1 magician wizard 3.5 0 midday noon 3.42 0 furnace stove 3.11 2 food fruit 3.08 3 bird cock 3.05 1 bird crane 2.97 3 tool implement 2.95 1 brother monk 2.82 1 crane implement 1.68 4 lad brother 1.66 4 journey car 1.16 4 monk oracle 1.1 7 cemetery woodland 0.95 5 food rooster 0.89 6 coast hill 0.87 3 forest graveyard 0.84 5 shore woodland 0.63 3 monk slave 0.55 4 coast forest 0.42 4 lad wizard 0.42 4 chord smile 0.13 7 glass magician 0.11 6 noon string 0.08 7 rooster voyage 0.08 10 Table 3: Semantic distance by item in the Miller's word pair data set. The comparison of results in Table 2 reveals that when applied to the latest version of WordNet, our studied shortest path measure performs at least as well as the other more complicated measures based mostly on taxonomy of WordNet version 1.5. This can only result from additional features in WordNet v2.0 compared to WordNet v1.5. The shortest path measure performs comparatively or better than the measures combining WordNet taxonomy with the notion of information content. This indicates that WordNet v2.0 has become a much more informative and dense semantic network than the previous versions. Results in Table 2 also show that shortest-path semantic distance in WordNet v2.0 is comparable to other relatedness measures based on alternative knowledge 386 Informatica 28 (2004) 381-386 J. Ferlez et al. sources. The shortest-path semantic distance measure applied to Roget's thesaurus performs only slightly better than the studied one applied to WordNet v2.0, suggesting that the knowledge of both semantic networks exposed by their nodes and edges is approximately equal. The LSA method based solely on corpus data, on the other hand, shows strongest performance resilience to enlarging test data sets. This suggests further improvements to our measure are possible and should be studied in the already promising direction of combining WordNet with additional knowledge sources like corpora. From the results one might conclude that the studied shortest path relatedness measure applied to WordNet v2.0 gives better results than other measures based only on WordNet taxonomy regardless of the version of WordNet and regardless of the possible use of alternative data sources. This follows from the assumption that even if Budanitsky repeated his experiment on the taxonomy of WordNet version 2.0 it could have only yielded approximately the same results as with the version 1.5. The latter assumption is supported by (1) the fact that the repeated experiment produced WordNet version independent results for the taxonomy based measures and by (2) the fact that according to documentation [21], there are no significant differences in the taxonomy part of WordNet between versions 1.5 and 2.0. According to evaluation against human ratings of relatedness, pretty basic shortest-path semantic relatedness measure applied to WordNet version 2.0 can be used in research and development of the NLP systems instead of the more complicated alternatives. The shortest-path method will typically use fewer resources to achieve similar or better results. References [1] Budanitsky, A., and G. Hirst, Semantic Distance in WordNet: An Experimental, Application-oriented Evaluation of Five Measures, Workshop on WordNet and Other Lexical Resources, in the North American Chapter of the Association for Computational Linguistics (NAACL-2000), Pittsburgh, PA, 2001. [2] Sussna, M., Text Retrieval using Inference in Semantic Matanetworks, PhD Thesis, University of California, San Diego, 1997. [3] Wu, Z., and Palmer, M., Verb Semantics and Lexical Selection, Proc. of the 32nd Annual Meeting of the Associations for Computational Linguistics, 1994. [4] Resnik, P., Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language, Journal of Artificial Intelligence Research, 11, 1999. [5] Lin, D., An Information-Theoretic Definition of Similarity, Proceedings of 15th International Conf. on Machine Learning, 1998. [6] Fellbaum, C. (Ed) WordNet - An Electronic Lexical Database, MIT Press, 1998. [7] Rada, R., Mili, H., Bicknell, E., and Blettner, M., Development and application of metric on semantic nets, IEEE Transaction on Systems, Man, and Cybernetics, 1989, Vol. 19, no 1, pp 17-30. [8] Hirst, G., and St-Onge, D., Lexical Chains as representations of context for the detection and correction of malapropisms, Fellbaum, 1998, pp 305-332. [9] Resnik, P., Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, 1995, pp 448-453. [10] McHale, M., L., A Comparison of WordNet and Roget's Taxonomy for Measuring Semantic Similarity, In Proc. Usage of WordNet in Natural Language Processing Systems, COLING-ACL '98 Workshop, August 16, 1998, University of Montreal, 1998. [11] Budanitsky, A. Lexical Semantic Relatedness and its Application in Natural Language Processing, technical report CSRG-390, Department of Computer Science, University of Toronto, 1999. http://www.cs.toronto.edu/compling/Publications/Ab stracts/Theses/Budanitsky-thabs.html [12] Rubenstein, H., and Goodenough, J. B., Contextual Correlates of Synonymy, Computational Linguistics, 8, 1965, pp 627-633. [13] Miller, G.A., and Charles, W.G. Contextual correlates of semantic similarity, Language and Cognitive Processes, Vol. 6, No. 1,1991, pp 1-28. [14] Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E., Placing Search in Context: The Concept Revisited, ACM Transactions on Information Systems, 2002, Vol. 20, no 1. pp 116-131. [15] Francis, W.N. and Kucera, Henry, Frequency analysis of English usage. Lexicon and grammar. Houghton Mifflin, Boston, 1982. [16] Leacock, C., and Chodorow, M., Combining Local Context and WordNet Similarity for Word Sense Identification, C. Fellbaum, editor, WordNet: An Electronic Lexical Database, MIT Press, 1998, pp 265-285. [17] Jiang, J., J., and Conrath, D., W., Semantic similarity based on corpus statistics and lexical taxonomy, Proceedings on International Conference on Research in Computational Linguistics, Taiwan, 1997. [18] Landauer, T.K., Foltz, P.W., and Laham, D. Introduction to Latent Semantic Analysis, Discourse Processes, Vol. 25, No. 2 & 3, 1998, pp 259-284. [19] Pedersen, T., WordNet::Similarity, http://www.d.umn.edu/~tpederse/similarity.html [20] Miller, G., Leacock, C., Tengi, R. and Bunker, T., A Semantic Concordance, Proc. of ARPA Workshop on Human Language Technology, 1993. [21] WordNet Documentation, Cognitive Science Laboratory, Princeton University, ftp://ftp.cogsci.princeton.edu/pub/wordnet/.