41 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics COLLOCATION RANKING: FREQUENCY VS SEMANTICS N i k o l a L J U B E Š I Ć Jožef Stefan Institute; Faculty of Computer and Information Science, University of Ljubljana N a t a š a L O G A R Faculty of Social Sciences, University of Ljubljana I z t o k K O S E M Faculty of Arts, University of Ljubljana; Jožef Stefan Institute Ljubešić, N., Logar, N., Kosem, I.: Collocation ranking: frequency vs semantics. Slovenščina 2.0, 9(2): 41–70. DOI: https://doi.org/10.4312/slo2.0.2021.2.41-70 Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of col- locates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking re- sults than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar. Keywords: collocations, word embeddings, logDice, general language, academic language 42 43 Slovenščina 2.0, 2021 (2) 1 I N T R O D U C T I O N The importance of the notion of collocation has been acknowledged by lin- guists for a long time, ever since J. R. Firth’s famous statement: “You shall know a word by the company it keeps” (Firth, 1957). In fact, collocations themselves are considered by many as lexical units with different levels of se- mantic transparency (Singleton, 2000). As a result, even transparent collo- cations (and not only idioms, phrases and other more fixed multiword units) have started to receive more attention in dictionaries. Collocation identification requires a computational approach. Several statis- tics for measuring collocation have been proposed in the past decades, for ex- ample t-score, MI, MI3, the log-likelihood ratio, the Dice coefficient, etc. (see Manning and Schütze, 1999, for an overview). In fact, collocation has been the pervasive driving force behind the development of tools for analysing and describing language in general. However, with progress also new challenges arose. Problematic aspects of different statistical approaches for measuring collocation have often been discussed (cf. Kilgarriff and Kosem, 2012), which led to the proposals of new measures such as logDice (Rychlý, 2008), which has been developed with lexicographic use in mind, and has been used by a large number of dictionary projects. Nowadays, new, non-statistical methods are slowly finding their way into dic- tionary-making (and language) analysis. We thus decided to test one popu- lar and up-to-date language modelling technique, namely word embeddings (Levy and Goldberg, 2014; Li and Jurafsky, 2015; Camacho-Collados and Pileh var, 2018; etc.). 1.1 The aim and the scope of the paper As Levy and Goldberg (2014, p. 302) explain, in the embeddings, distribution- al semantics word embeddings are vector representations of all the contexts in which a word occurred, and “enable efficient computation of word similarities through low-dimensional matrix operations”. Recent uses of word embed- dings for identifying collocations are well recorded (cf. Section 2). Various ex- periments proved the method to be moderately to highly successful in various tasks. We decided to find out how well it performs when given one other task, that is a task of collocate ranking. Since our research was lexicographically 43 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics oriented, we were especially interested in how well the method performs in comparison to the lexicographically highly popular logDice metric (Rychlý, 2008), which uses heuristics (i.e. a set of fixed rules). Broadly speaking, we also wanted to find out whether a dictionary-making process (in our case a Slovene dictionary-making process) could become less time consuming and more efficient, if complemented with collocate ranking data acquired by the semantic-based method of word embeddings. In order to establish how well word embeddings tackle the task of collocate ranking for lexicographic purposes we set a two-part experiment. It consist- ed of: 1. the quantitative analysis of a) heuristic-based vs machine-learning-based approach to collocate ranking, and b) frequency-based vs semantics-based machine-learning approach to collocate ranking; 2. the qualitative analysis of different collocate ranking results, which was performed by lexicographers. In both analyses, two datasets were used: • a general Slovene language dataset named KOLOS (Kosem et al., 2018), and • a Slovene for special purposes (LSP) dataset named KAS (Erjavec et al., 2020). Namely, we also wanted to draw some initial conclusions about the two ap- proaches to collocation ranking with regards to differences in text type, and monosemy/polysemy of words. All in all, the experiment arose from an actual dictionary-making process, and is described here with the purpose of bringing possible benefits to similar en- deavours elsewhere as well. 44 45 Slovenščina 2.0, 2021 (2) 2 M E A S U R I N G C O L L O C A T I O N S: A S S O C I A T I O N M E A S U R E S, A N D M O R E R E C E N T – W O R D E M B E D D I N G S An extensive body of research exists on measuring collocation strength or collocativity (e.g. Berry Rogghe, 1973; Church and Hanks, 1990; Church et al., 1991; Biber, 1993; Manning and Schütze, 1999; Evert, 2004; Gries, 2013), and different statistical methods (i.e. association measures) have been used up to this day. Association measures have also been regularly compared, and new ones proposed. Two good overviews of association measures are Wiechmann (2008) who compared 47 different association measures, and Pecina (2009), who conducted a comparison of more than 80 measures for collocation extraction. General observations of the majority of such studies were aptly summarized by Evert (2009), namely that “different association measures will produce entirely different rankings of the collocates” (ibid., p. 1218) and that “there is no ideal association measure for all purposes” (ibid., p. 1236). A recent study by Evert et al. (2017) inspected the role of variables such as corpus size, context span, and frequency threshold in collocation identifica- tion. Using two different dictionaries as gold standards, it proved that “very large Web corpora and small co-occurrence contexts produce the best results” (ibid., 543). Moreover, in terms of co-occurrence span, researchers concluded that syntactic dependency was the best choice in most cases. There is some literature on association measures used on Slovene corpus data as well (e.g. Gorjanc and Vintar, 2000; Gorjanc and Fišer, 2010), however there are no studies that would comprehensively compare the effectiveness of various association measures for identifying collocations in Slovene. As far as language description is concerned, in recent years most Slovene lexicograph- ical and terminological projects have started using the Sketch Engine (Kilgar- riff et al., 2004) and rely on association measures provided by this tool, espe- cially logDice which is used by the well-known Word sketch function. How- ever, as Gantar et al. (2015) and Gantar et al. (2016) observed, logDice often misses, or attributes very low ranking to certain important collocates, which is why researchers started combining logDice and raw frequency rankings when extracting and analysing collocates for dictionary purposes. 45 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics All association measures have one shortcoming in common: even if they are limited by predefined syntactic relations (such as in word sketches), they rely solely on co-occurrence frequencies and do not consider semantic as- pects of words. And precisely this type of information is contained in word embeddings. Word embeddings have been used extensively in the field of natural language processing (NLP) in the last decade. For example, Rodríguez-Fernández et al. (2016a) followed the well-known association approach early identified in Mikolov et al. (2013), where king to man is the same as queen to woman. They applied the same technique to collocation extraction, hoping to obtain the proper headword for the collocate suggestion, related to the known take a walk collocation. In their approach they hoped to be able to remove the walk information from take and add the suggestion information, ending up with make being a near-neighbour of the resulting vector, that is they calculated vec(take) − vec(walk) + vec(suggestion) with the goal of the result being close to vec(make). This approach, evaluated in follow-up work, obtained a mean reciprocal rank (MRR) score between 0.01 and 0.47.1 Another piece of work by the same group of authors (Rodríguez-Fernández et al., 2016b) found that a linear transformation of the headword embedding can be used to predict the optimal collocate word embedding, learning this transformation per Mel’cuk semantic typologies (Mel’cuk, 1996). They did not compare this approach to the basic frequency-based one, nevertheless they achieved promising, but varying results, with the mean reciprocal rank (MRR) of the best-performing system between 0.3 and 0.9. This methodology was followed by Enikeeva and Mitrofanova (2017), who applied it to Russian data. They reported slightly higher MRR scores, ranging from 0.48 to 0.9. Again, they did not compare their results to the traditional frequency-based methods. Liu and Huang (2017) showed that using the cosine distance between the dis- tributional word representations of headwords and collocates as a function for 1 Mean reciprocal rank (MRR) is a relative score meant for ranked results that calculates the average of the inverse of the ranks at which the first positive instance occurs. MRR ranges between 0 and 1, and an MRR of 1 is obtained if in each ranking the positive instance is ranked in the first position, an MRR of 0.5 is obtained if in each ranking the positive instance occurs in second position, 0.33 for the third position and so on. 46 47 Slovenščina 2.0, 2021 (2) ranking collocation candidates yielded just slightly better results measured by F1 than the chi-square and mutual information co-occurrence statistics. Additionaly, Wanner et al. (2017) used distributional word representation to classify collocations into semantic classes, and Garcia et al. (2017) used mul- tilingual word embeddings to find collocation translations in other languages. Examining related literature, we can conclude that regardless of the fact that word embeddings are a very popular source of semantic information and that their usage as input features for making predictions in NLP has been consid- ered a standard approach for years now, they have not yet been tested in a supervised learning setting on the task of general collocation ranking. 3 R E S E A R C H 3.1 Methodology 3.1.1 Research questions In order to establish how well word embeddings tackle the task of collocate ranking for lexicographic purposes in the case of Slovene, we compared the embeddings results to the results obtained using the logDice method. The comparisons were made in a quantitative and qualitative way and were led by the following three research questions: Q1: Which approach produces lexicographically more relevant rankings of collocates: the one that uses machine learning over manually an- notated data, or the one that uses heuristics? Q2: Which approach is a more useful source of information for the rank- ings of collocates: the word embeddings approach, which encodes distributional semantics of words, or the logDice approach, which encodes frequency information? Q3: Which ranking of collocates is preferred by lexicographers: the em- beddings ranking, or the logDice ranking? As questions imply, we wanted to know whether the currently still domi- nant approach of using heuristics for collocate ranking is really better than the machine learning approach, which implicitly learns the underlying rules from examples. The second question was aimed at comparing two sources of 47 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics information – frequency, which is used in a heuristic way in logDice, and dis- tributional semantics, which is exploited from word embeddings via machine learning. Finally, our third question put potential users of the two compared approaches (i.e. lexicographers) into focus and examined their preferences in actual cases. 3.1.2 Collocation datasets 3.1.2.1 KOLOS dataset The KOLOS dataset contained a carefully selected set of 333 headwords, con- sisting of 154 nouns, 73 verbs, 81 adjectives, and 25 adverbs. The selected headwords were as heterogeneous as possible in terms of word class subcate- gories (e.g. plural nouns, countable nouns, transitive vs intransitive verbs etc.), corpus frequency, level of polysemy (number of different meanings), semantic characteristics (e.g. abstract vs concrete senses; qualitative vs classifying ad- jectives), etc. For each headword, we used collocations extracted for the pur- poses of the Collocations Dictionary of Modern Slovene (Kosem et al., 2018; Kosem et al., 2019). It should be noted that we already had a set of validated collocations from the Slovene Lexical Database (Gantar et al., 2016), and in order to devise a training dataset of good and bad collocation candidates, we decided to annotate only new ones (i.e. not yet validated collocations). This meant that we were often annotating the collocations slightly lower down the logDice-ordered list for each grammatical relation. In the annotation task, the annotators were presented with a collocation, the information of its grammatical relation, and a corpus example of its use. The annotation of collocations was conducted in the Pybossa tool,2 with each col- location being annotated by three annotators-linguists. The examples were extracted with the GDEX tool (Kosem et al., 2013) in the Sketch Engine (Kil- garriff et al., 2008), using the Slovenian configuration. The annotators were presented with three main answer groups – YES (‘yes, this is a valid colloca- tion’), NO (‘no, this is not a collocation’) and I DON’T KNOW (‘I don’t know if this is a collocation or not’) (the YES and NO groups had additional sub-op- tions, but they were not used in this experiment). 2 https://mnozicenje.cjvt.si/ 48 49 Slovenščina 2.0, 2021 (2) Taking YES, NO and I DON’T KNOW answers, the agreement was analysed and the final decision for the training dataset, which could only be YES or NO, was made on the basis of the agreement (e.g. total agreement was YES or NO), while in borderline cases the final decision was made by making additional annotation or after joint discussion by the annotators. The whole KOLOS dataset consisted of 17,540 collocation candidates belong- ing to 260 different grammatical relations. For the experiments performed in this paper we organised collocation candidates under 7,460 headwords (those being any of the two lexical parts of a bidirectional grammatical relation, so for take a walk we would have two collocations, once under the headword take, once under the headword walk). Experiments were done only on head- words that (1) had at least 10 collocation candidates for a specific grammatical relation as our evaluation was headword-based (this was the only data organi- sation that allowed evaluation of frequency-based statistics), and that (2) cov- ered both the positive and the negative class so that discriminative machine learning (distinguishing between good and bad examples) can be performed. With these selection criteria the KOLOS dataset was shrunk to the most fre- quent 8 grammatical relations (actually 4 bidirectional relations), 212 head- words and 2,671 collocation candidates. 3.1.2.2 KAS dataset The KAS dataset is a set of academic Slovene headwords, such as analiza (analyses), tabela (table), razlikovati (to distinguish), relativno (relatively), accompanied by collocations and examples of use (Logar et al., 2019). The set was built from a one-billion-word corpus KAS (Erjavec et al., 2020). The corpus was harvested from the Open Science Portal of Slovenia (2000–2015). For the most part (71% of tokens), it consists of BSc and BA theses, followed by MSc and MA theses (20%), and PhD theses (4%). Firstly, the initial list of candidates for the vocabulary of academic headwords was built by using the method of frequency profiling (Rayson and Garside, 2000). With this meth- od we extracted lemmas that most differentiated the KAS corpus from a fic- tion part of the general corpus Kres (Logar et al., 2012, p. 79–97). Secondly, we inspected each lemma on the list in the KAS corpus concordances, and also checked its typical context in the Sketch Engine tool. In this manner we 49 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics determined whether the word in question belonged to a common expert dis- course or not (the latter were excluded as it meant they were either grammat- ical words or technical terms). And thirdly, the final list of 463 headwords identified as typical of academic Slovene was supplemented by collocations and three examples of use for each collocation. The extraction of data was automatic; we used the same methodology as in the case of the KOLOS data- set (Kosem et al., 2011; Krek, 2012; Gantar et al., 2015; Kilgarriff and Kosem, 2012; Logar et al., 2014). Automatically extracted data was then reviewed. We corrected the most obvi- ous tagger performance mistakes, rearranged not ideally semantically grouped collocates, and deleted personal proper names, deixis, modal verbs and verbs with very broad meaning (e.g. to be, to be about (sth)). Nevertheless, all dele- tions remained part of the dataset, but were labelled as NEGATIVE colloca- tion candidates. Content-wise, the KAS dataset was heterogeneous with regards to its mean- ing and text function, but was either obviously or indirectly related to three roughly defined segments (Logar and Erjavec, 2019, p. 212–213): (a) the for- mal structure and the writing of academic texts (e.g. in English bibliography, introduction, conclusions; empirical, defined, mentioned; to define, to cite); (b) the methodology of academic texts (e.g. method, hypothesis, respondent; to analyse, to identify, to classify); or (c) the presentation and interpretation of the research data (e.g. number, portion, dependence; measured, calcu- lated, accurate; to result from, to indicate, to cause; subsequently, relative- ly, successfully). With regard to word class, out of 463 headwords 226 were nouns, 119 adjectives, 86 verbs, and 32 adverbs (Logar et al., 2019). As far as the use in the KAS corpus is concerned, all words in the KAS dataset were monosemous. In total, the KAS dataset consisted of 70,254 collocation candidates belonging to 342 different grammatical relations, organised under 5,220 headwords. By applying the same selection criteria as on the KOLOS dataset, our final KAS dataset on which we performed experiments shrunk to 8 grammatical rela- tions (gramrels hereafter), 525 headwords and 14,722 collocation candidates. 50 51 Slovenščina 2.0, 2021 (2) 3.1.3 Corpus information The frequency and semantic information for our collocation candidates was obtained from the Gigafida 2.0 corpus (Krek et al., 2020). For calculating the frequency and logDice information as representatives of the frequency signal we used the Sketch Engine API. For calculating the (head)word embeddings as representatives of the semantic signal we used the fastText tool (Bojanowski et al., 2016) – in skip-gram mode with default parameters – and the lemma and part-of-speech annotations present in Gigafida 2.0, KAS and other large corpora of Slovene (Ljubešić and Erjavec, 2018). 3.2 Experiment As explained in the Introduction section, our experiment consisted of two main parts: 1. the quantitative analysis, and 2. the qualitative analysis. In both, we compared two approaches to collocate ranging, i.e. the logDice method and the word embeddings method. In the quantitative analysis, we performed two parts of the experiment, and in the qualitative part one more followed. Each of the three parts of our experiment was directly related to one of the research questions formulated at the beginning of the research. 3.2.1 Quantitative analysis 3.2.1.1 Experimental setup In the quantitative part of the experiment, our goal was to compare tradi- tional statistic-based approaches to collocate ranking with approaches based on machine learning. Since the only organisation that we can obtain through traditional approaches are ranked results (collocation candidates with high- er frequency or higher logDice score are ranked higher), we set up our ma- chine-learning experiments also in the way that enabled us to obtain ranked results. To evaluate traditional methods in their regular usage scenario, we performed evaluation on a per-gramrel and per-headword basis. For our evaluation metric, we used the AUC (area-under-the-curve) ROC (receiver operating characteristic) score, which is considered to be the go-to 51 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics evaluation metric for ranking results, especially if the classes (positive and negative collocation candidates) are not balanced. Precisely this was the case in our datasets as in our original KOLOS dataset we had 13,812 positive can- didates and 3,728 negative ones. The situation in the KAS dataset was similar, with 53,150 positive and 8,811 negative collocation candidates. The AUC ROC score quantifies the quality of a ranking result, with the worst-possible ranking (all negative collocation candidates being ranked higher than all positive candidates) obtaining the result of 0.0, a perfect rank- ing (all positive collocation candidates being ranked higher than all negative collocation candidates) obtaining the result of 1.0, and a random ranking (positive and negative candidates being randomly mixed) obtaining the result of 0.5. For performing supervised machine learning experiments, we used sup- port-vector machines (SVMs), a regular go-to algorithm in traditional ma- chine learning. We did not use more recent neural-network approaches as (1) their parameters are harder to interpret, and (2) initial experiments on our datasets had shown very similar results regardless of the machine-learning approach used. We had to be able to predict continuous values to be used for ranking candidates, thus we trained SVM regressors. All our implementations are written in the scikit-learn toolkit (Pedregosa et al., 2011). Given that we obtained AUC ROC scores per each ranking (i.e. for each gram- rel and headword we got a score), we had to set up a way to average all scores on some defined level. We aimed at averaging on the gramrel and overall level. As (1) different headwords under specific gramrels had a different number of candidates, and (2) different gramrels had a different number of candidates, we decided to normalise our results given the number of candidates, that is each collocation candidate would have the same impact on the final score of a method. Supervised machine learning required two sets of data: training data (the data the model is built on) and testing data (the data the built model is evaluated on). Therefore, we performed a five-fold cross-validation, that is we split our training data into five groups, running five iterations of using four groups for training and one group for testing. By doing so we managed to evaluate the 52 53 Slovenščina 2.0, 2021 (2) model on each data point available, which is directly comparable to the out- put of the statistic-based ranking methods where we do not require training data. Furthermore, we made sure that headwords were sampled into groups, so that there was no spillage between training and testing data (e.g. training on some collocations of a headword and testing on other collocations of that headword). This makes the machine-learning approach quite challenging and measures to what extent the model can generalise regularities on the gramrel level, but not on the level of specific headwords present in our dataset. 3.2.1.2 Results As explained, we obtained results on two datasets, KOLOS and KAS, by com- paring four different approaches to collocation candidate ranking: • freq: ordering via decreasing frequency of the collocations; • logDice: ordering via decreasing logDice statistic of the collocations (using the frequencies of the headword, collocate and collocation); • SVM_freq: machine learning the ranking from the frequency of the collocation, the headword, the collocate and the logDice statistic (all frequencies being represented on the logarithm scale); • SVM_emb: machine learning the ranking from the embeddings of the headword, the collocate, and a sum of the two embeddings (to repre- sent in a basic fashion the interaction between the two embeddings). In Table 1, we present our results on the KOLOS dataset, together with the statistics on the size of the dataset for each gramrel. In Table 2, we give a sim- ilar description and results on the KAS dataset. Focusing first on the overall results on each dataset (the TOTAL row), the depicted picture is quite simple. The answer to our first research question, namely whether machine learning approach produces more relevant rankings of collocates than the approach based on heuristics, is positive. On the KOLOS dataset the two statistic-based approaches yielded scores of 0.52 and 0.47, while the two machine-learn- ing-based approaches obtained scores of 0.58 and 0.71. On the KAS dataset the statistic-based approaches achieved scores of 0.58 and 0.63, while the ma- chine-learning-based approaches obtained scores of 0.76 and 0.87. 53 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics With our second research question regarding the usefulness of both ambed- dings approach and the logDice approach we again favoured the former. On the KOLOS dataset the frequency-based learning obtained the score of 0.58, while the semantic-based approach achieved the score of 0.71. On the KAS dataset the numbers obtained were 0.76 and 0.87, aiming at the same con- clusion. Even more, there was only one gramrel (among 16) on which the ma- chine-learning approach based on semantic information did not score the best results among the four approaches evaluated here (namely, the logDice score 0.65 for the VERB + noun (accusative) gramrel, see italics in Table 2). An interesting, if not troubling observation is that ranking results via heu- ristics are quite close to the random baseline, with an average result on the KOLOS dataset of around 0.5 and on the KAS dataset of around 0.6. This suggests that their ranking is actually quite incapable of pushing the negative candidates as far down as possible. However, it still might be that the overall order of candidates via these two heuristics is useful for human use. In our experiments, we were aware only of the positive vs negative collocation can- didate distinction and not of all subtle differences that collocations bring in a ranking scenario. Table 1: KOLOS dataset: the ranking results of the machine learning approach* gramrel # heads # collos freq logDice SVM_freq SVM_emb adjective + NOUN 38 576 0.526 0.405 0.56 0.653 ADJECTIVE + noun 54 983 0.503 0.463 0.534 0.692 NOUN + noun (genitive) 22 481 0.698 0.353 0.712 0.78 noun + NOUN (genitive) 47 967 0.517 0.501 0.631 0.723 VERB + noun (accusative) 13 231 0.468 0.443 0.432 0.64 verb + NOUN (accusative) 13 242 0.444 0.405 0.472 0.737 ADVERB + adjective 12 261 0.368 0.677 0.602 0.802 adverb + ADJECTIVE 13 221 0.584 0.62 0.515 0.669 TOTAL 212 3962 0.523 0.469 0.577 0.71 * Capital items: the headword and the starting point of the collocation (also from here forward, i.e. in Table 2, Table 4, etc.). 54 55 Slovenščina 2.0, 2021 (2) Table 2: KAS dataset: the ranking results of the machine learning approach gramrel # heads # collos freq logDice SVM_freq SVM_emb ADJECTIVE + noun 53 1737 0.537 0.563 0.665 0.738 adjective + NOUN 118 3045 0.58 0.689 0.8 0.932 NOUN + noun (genitive) 46 1677 0.559 0.534 0.603 0.866 noun + NOUN (genitive) 72 1999 0.565 0.556 0.623 0.878 VERB + noun (accusative) 18 828 0.619 0.651 0.59 0.556 verb + NOUN (accusative) 77 1947 0.632 0.597 0.913 0.922 ADVERB + adjective 52 1468 0.745 0.709 0.802 0.894 adverb + ADJECTIVE 89 2021 0.431 0.706 0.915 0.954 TOTAL 525 14722 0.576 0.628 0.757 0.871 For the different gramrels we also performed a correlation analysis to measure to what degree the results through gramrels and applied methods are stable between the two datasets. We calculated the Pearson correlation coefficient between the 8 results for each of the four methods on the KOLOS and on the KAS dataset. For the frequency method, we obtained a significant (p = 0.043) strong negative result (r = –0.722), and for the logDice method we again ob- tained a significant (p = 0.029), but strong positive result (r = 0.758). For the SVM_freq method our result was not significant (p = 0.36) and was moder- ately negative (r = –0.375), while for the SVM_emb method the result was also not significant (p = 0.183), but was moderately positive (r = 0.524). These results show that in the machine learning scenario achievements on specific grammatical relations differ quite a lot between datasets, while the logDice method was similarly (un-)successful on different gramrels. Nevertheless, the samples we obtained these calculations on are very small and one should take these results with caution. The only claim that could be made here is that in most cases the per-gramrel results are quite inconsistent. 3.2.2 Qualitative analysis 3.2.2.1 Experimental setup We expected that lexicographers, too, would prefer the machine-learning re- sults to those of heuristics, hence we tested our third hypothesis by presenting 55 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics them with two side-by-side columns for each headword in a specific gram- matical relation, one column representing logDice ranking and one column representing embeddings ranking of collocates (see an example in Table 3). Lexicographers were asked to evaluate which column was more informative to them (column A or B), but they could also choose an answer Both columns are similarly (un)informative. This meant that either (a) both measures were equally informative or useful, or that (b) none of the measures was informa- tive or useful. In addition, participants were alerted to the fact that they were evaluating results of the two aforementioned collocation extraction methods, but did not know which column was the result of which method. We also in- structed them to pay more attention to top halves of lists in both columns. No other instructions for the evaluation process were given. Table 3: KOLOS dataset: headword belina (whiteness), grammatical relation: NOUN + noun (genitive) (the whiteness of __) ranking logDice (A) embeddings (B) 1. zob (tooth)** stena (wall (interior)) 2. sneg (snow) pokrajina (landscape) 3. marmor (marble) oblačilo (clothes) 4. polt (complexion) perilo (washing) 5. perilo (washing) kamen (stone) 6. platno (linen) marmor (marble) 7. papir (paper) obleka (dress) 8. stena (wall (interior)) koža (skin) 9. zid (wall) platno (linen) 10. kamen (stone) zid (wall) 11. nebo (sky) sneg (snow) 12. obleka (dress) nebo (sky) 13. oblačilo (clothes) papir (paper) 14. pokrajina (landscape) polt (complexion) 15. koža (skin) zob (tooth) 16. obraz (face) obraz (face) ** Bold print = in the case of the embeddings method, a noticeable drop in the ranking; under- lined words = in the case of the embeddings method, a noticeable increase in the ranking. This part of the experiment was partially done via a set of .txt documents and partially via an online survey. First, a preliminary evaluation on a smaller set 56 57 Slovenščina 2.0, 2021 (2) of .txt documents was performed by two lexicographers; one familiar with the KAS database and the other familiar with the KOLOS database. During this phase, the lexicographer evaluating the KAS database favoured logDice as having better ranking results, while the second lexicographer in some cases preferred the embeddings and noticed that the performance of this method might have been gramrel dependent. Since preliminary evaluation was in- conclusive, seven other lexicographers were later invited to participate in the study (that is the online survey part of it). The questionnaire of the online survey only included headwords from the KO- LOS dataset, while the KAS dataset was further inspected only by the lexicog- rapher who conducted the preliminary analysis. The reason for this decision was that all lexicographers invited to the online survey had experience with general dictionary and general dictionary-like resources and they were all involved in the KOLOS project, while only one lexicographer participated in the KAS project, that is the part that focused on general academic discourse vocabulary. Since we wanted to keep the expectations and initial positions of all of the lexicographers homogeneous, we kept them separate, as well as the datasets they evaluated. Further KAS dataset analysis that was performed, as mentioned, by one lex- icographer was done on eight randomly chosen headwords in ten different grammatical relations (i.e. 80 headwords: 24 nouns, 8 adjectives, 32 verbs, 16 adverbs), which in total summed up to 2,095 collocations repeated in two col- umns. On average, this meant 26 collocates per headword in a specific gramrel (with the smallest number of 10 and the largest number of 93 collocates per headword). In this second phase of the evaluation, the lexicographer evaluat- ing the KAS dataset paid a closer attention to top halves of collocate columns, as did the online survey participants. The online survey consisted of 63 headwords (34 nouns, 18 adjectives, 11 verbs) and their collocates in seven different grammatical relations. Because we wanted to broaden the number of gramrels, only three of them were the same in both datasets. The survey was divided into seven separate grammat- ical relation subsurveys, which meant that each grammatical relation had its own survey link. This was done to keep the cognitive load manageable for participants (they could complete the survey for one grammatical relation 57 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics and continue with the next one on another day), and to facilitate the anal- yses. In total, there were 146 pairs of collocate lists (i.e. questions in the survey; see Table 4). It should be noted that due to various reasons (time constraints etc.) not all the participants completed all seven grammatical relation surveys. Table 4: Online surveys: number of headwords and number of lexicographers participating gramrel number of headwords number of participants VERB + noun (accusative) 12 6 verb + NOUN (accusative) 26 8 ADJECTIVE + noun 19 6 adjective + NOUN 30 6 adverb + ADJECTIVE 11 7 NOUN + noun (genitive) 19 6 noun + NOUN (genitive) 29 6 3.2.2.2 Results 3.2.2.2.1 KAS collocates ranking As Table 5 and Figure 1 show, the lexicographer evaluating the KAS database in the second phase of the study again did not find the embeddings rankings better than the logDice rankings. In almost two thirds of cases (51/80), she decided that both columns were very similar, and in almost all of the rest of them (26/80), in her opinion, the embeddings performed worse. Thus, a small number of only three cases of embeddings performing better can be perceived as exceptions. A closer look at grammatical relations reveals that the success of both ranking methods differs according to the lexicographers’ judgments. Collocate ranking according to logDice was preferred in grammatical relations NOUN + noun (genitive) and VERB + noun (accusative), while the ranking results of both methods were very similar in four relations (right side of Figure 1): NOUN + “for” + noun (accusative), VERB + “and_or” + verb, ADVERB + adjective, and ADVERB + verb. 58 59 Slovenščina 2.0, 2021 (2) Table 5: KAS dataset: logDice ranking vs embeddings ranking of collocates per grammatical relation (in absolute numbers and percentage) gramrel logDice better: number and (%) embeddings better: number and (%) very similar: number and (%) ADJECTIVE + noun 4 (50) 4 (50) VERB + adverb 3 (37) 5 (63) NOUN + noun (genitive) 5 (63) 1 (2) 2 (25) VERB + noun (genitive) 3 (38) 1 (2) 4 (50) VERB + noun (accusative) 5 (63) 3 (37) NOUN + preposition v (in) + noun (locative) 4 (50) 4 (50) NOUN + preposition za (for) + noun (accusative) 1 (2) 7 (98) VERB + conjunction in_ali (and_or) + verb 8 (100) ADVERB + adjective 1 (2) 7 (98) ADVERB + verb 1 (2) 7 (98) TOTAL 26 (32) 3 (4) 51 (64) 4 3 5 3 5 4 1 11 1 1 4 5 2 4 3 4 7 8 7 7 0 1 2 3 4 5 6 7 8 9 ADJECTIVE + noun VERB + adverb NOUN + noun (geni�ve) VERB + noun (geni�ve) VERB + noun (accusa�ve) NOUN + preposi�on “v” (in) + noun (loca�ve) NOUN + preposi�on “za” (for) + noun (accusa�ve) VERB + conjunc�on “in_ali” (and_or) + verb ADVERB + adjec�ve ADVERB + verb LogDice be�er Embeddings be�er Very similar Figure 1: KAS dataset: logDice ranking vs embeddings ranking of collocates per grammatical relation (in absolute numbers). 3.2.2.2.2 KOLOS collocate ranking Overall, the most popular answer in the online survey was Both columns are similarly (un)informative (45% of the answers, Table 6), which indicates that the participants having a general dictionary-like resource in mind did not, al- most half of the time, consider one ranking better than the other. 59 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics Table 6: KOLOS dataset: logDice ranking vs embeddings ranking of collocates per grammati- cal relation (in absolute numbers and percentage) gramrel logDice better: number and (%) embeddings better: number and (%) very similar: number and (%) TOTAL ANSWERS: number VERB + noun (accusative) 16 (24) 22 (33) 28 (42) 66 verb + NOUN (accusative) 74 (37) 24 (12) 102 (51) 200 ADJECTIVE + noun 33 (31) 31 (29) 44 (41) 108 adjective + NOUN 73 (42) 21 (12) 80 (46) 174 adverb + ADJECTIVE 27 (39) 8 (11) 35 (50) 70 NOUN + noun (genitive) 23 (21) 36 (33) 50 (46) 109 noun + NOUN (genitive) 44 (26) 62 (37) 62 (37) 168 TOTAL 290 (32) 204 (23) 401 (45) 895 16 74 33 73 27 23 44 22 24 31 21 8 36 62 28 102 44 80 35 50 62 0 20 40 60 80 100 120 VERB + noun (accusa�ve) verb + NOUN (accusa�ve) ADJECTIVE + noun adjec�ve + NOUN adverb + ADJECTIVE NOUN + noun (geni�ve) noun + NOUN (geni�ve) LogDice be�er Embeddings be�er Very similar Figure 2: KOLOS dataset: logDice ranking vs embeddings ranking of collocates per grammatical relation (in absolute numbers). Of the two measures, logDice was considered better more frequently than em- beddings, with 32% vs 23% answers selected respectively. However, as Table 6 and Figure 2 show, this ratio between the two measures varied considera- bly according to the grammatical relation. Ranking of collocates according to logDice was more preferred in grammatical relations verb + NOUN (accu- sative), adjective + NOUN, and adverb + ADJECTIVE. On the other hand, 60 61 Slovenščina 2.0, 2021 (2) embeddings ranking was preferred in VERB + noun (accusative), NOUN + noun (genitive), and noun + NOUN (genitive) grammatical relation. We also searched for patterns in the results on a headword level, especially for headwords that featured in at least two different grammatical relations. We wanted to establish whether certain headwords prefer one of the measures across different grammatical relations. Similar to above mentioned findings, logDice was again preferred more often than embeddings, with the partici- pants preferring it at 26 headwords in different grammatical relations, while embeddings results were preferred at only 14 headwords (for the remaining headwords no considerable differences in preferences were observed). There were also no clear patterns that the headwords identified had in common. At the end of both evaluations, we made a numerical comparison of the results in total for both datasets (Figure 3). 32 4 64 32 23 45 0% 10% 20% 30% 40% 50% 60% 70% LogDice be�er Embeddings be�er Very similar KAS dataset KOLOS dataset Figure 3: KAS and KOLOS dataset: logDice ranking vs embeddings ranking of collocates – both evaluations in total (in percentage). Even though our deduction is limited due to the fact that only one lexicog- rapher examined the KAS dataset, one feature in Figure 3 stands out: to a noticeably larger extent (23%) the embeddings rankings of collocates of the KOLOS dataset were recognised as more informative than those of the KAS dataset (4%). It is possible that this is a consequence of the KOLOS being 61 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics much more polysemous. If that is the case, at least this part of our qualitative analysis favours the semantic-based method to logDice metrics. Nevertheless, as a whole our third research question, namely which ranking of collocates is prefered by lexicographers, must be answered in the following way: lexicogra- phers prefer the logDice ranking. 4 D I S C U S S I O N The main point that needs to be discussed is the difference between the results of quantitative and qualitative analyses. With the results of the quantitative analysis so convincingly in favour of the embeddings approach, it was some- what surprising to learn that the lexicographers did not confirm this finding. In this section, we present some possible explanations for this discrepancy. But first, let us turn our attention to the fact that in comparison to the KOLOS dataset, higher scores of the machine-learning-based approaches were con- sistently obtained on the data from KAS. It seems like this was influenced by two features: the (non)specialised content of the two corpora, and the mon- osemy or polysemy of selected headwords. As mentioned, all headwords in the KAS dataset are monosemous (but not technical), and secondly, the KAS corpus is domain- and genre-specific; on the other hand, more than half of the KOLOS headwords were polysemous and therefore used in various con- texts, but they (and their collocates) also originated from a general, domain and genre diverse corpus of Slovene. The latter very probably limited the ma- chine-learning process, while the first enhanced it. It is our belief this should be kept in mind in follow-up testings of the embeddings method and its use in dictionary-making projects. When answering our first and second question using the AUC ROC score and the SVM learning algorithm, the machine-learning-based approaches ranked better than statistic-based ones (KOLOS scores on frequency information: 0.52 vs 0.58), and the semantic information given through word embed- dings was more useful than frequency information (KOLOS scores on using machine learning on frequency and embeddings: 0.58 vs 0.71). Yet lexicog- raphers’ most frequent evaluation was a non-decisive one: to them in half or more cases (45% for KOLOS and 64% for KAS) both rankings of collocates seemed very similar. In fact, the survey participants’ comments suggest that 62 63 Slovenščina 2.0, 2021 (2) the task of deciding which ranking was better even proved frustrating at times. Many KOLOS survey participants mentioned that they often deliberated on monosemous or polysemous characteristic of the headword, similarity of col- locates and their broad meaning, while the lexicographer evaluating KAS da- taset disfavoured columns that had too general or too technical words among approximately top ten collocates. Nevertheless, the votes of all of them were given with considerable uncertainty and were very diverse. Our survey instructions were intentionally non-explicit, in other words: in- struction-wise, we did not address the aforementioned differences. We want- ed to learn in general, whether the semantic nature of the embeddings col- location extraction method could be recognised and found advantageous for lexicographic work. Unfortunately, our conclusions suggest that higher algo- rithm scores, though numerically significant, were in most part not obvious to humans. Just one segment of KOLOS vs KAS evaluation results confirmed that there is indeed some potential in the semantic nature of the embeddings collocate rankings; namely 23% of the much more polysemic KOLOS dataset was recognised as more informative than the logDice ranking, while this was the case for only 4% of the KAS database. However, since KAS data was eval- uated solely by one lexicographer, further studies should examine this indica- tion in more detail. With regard to the embeddings method being gramrel dependent, i.e. that it is more successful for some grammatical relations, but not the others, nothing can be concluded. By choosing a set of 17 various relations (KAS: 10, KOLOS: 7), with only three of them overlapping, gramrel-wise we were able to get a broader view, but the number of headwords per each grammatical relation was thus reduced (in total KAS: 80, KOLOS: 63). Subsequently, none of the relations was analysed comprehensively. Even with gramrels that overlapped in the datasets (ADJECTIVE + noun; NOUN + noun (genitive); VERB + noun (accusative)), the survey results were not uniform and do not allow for any obvious inference. The question of gramrel importance for the task of em- bedding-based collocation extraction is in fact rather questionable as initial experiments on training one single model for collocation extraction on all gramrels showed very similar results to those of training separate models for each gramrel. For the sake of a better control over the process and a more 63 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics interesting analysis, in this research we opted for keeping gramrel data and gramrel experiments separate, but other scenarios are, of course, possible for future fine tunings of the method. Finally, we must consider the part human intuition, or rather lexicographers’ knowledge, experience, and past and present project involvement played in our experiment. Lexicographers’ evaluation, though an expert one, played a crucial role not once, but twice. Firstly, during the annotation of colloca- tions before the quantiative part of the experiment; and secondly, after it in the form of lexicographers’ judgments of the informativeness of the collocate rankings. Machine learning was, of course, performed on the pre-annotation dataset taken as a kind of gold standard, which actually meant that the lexi- cographers’ preferences in the post-ranking phase primarily reflected anno- tators’ preceding decisions. Here, it is important to stress that both groups of experts consisted of almost the same people, though the time that passed between the two phases of the experiment was about five months. Also, since the pre-treatment of the KAS datasets was not identical to the pre-treatment of the KOLOS dataset, and the same goes for the evaluation part of the experi- ment, the comparison between the results of both datasets is far from optimal. In this respect, our conclusions need to be treated as just preliminary. 5 C O N C L U S I O N S Recent trends in lexicography have focused on automating certain aspects of language description, especially those related to collocations and examples (e.g. Kilgarriff and Rychlý, 2010; Rundell and Kilgarriff, 2011). As Cook et al. (2013, p. 50) point out, a “striking outcome of the work done so far in this area is that automation not only delivers efficiency savings but also leads to improvements in quality”. Lexicographers are used to inspecting long lists of collocates, separating the wheat from the chaff, but when automatically produced language resources are in question, different results of different extraction tools matter, and im- provements in quality are always possible. In our research, we used a super- vised machine-learning approach to collocation extraction and ranking with the aim of establishing how advantageous it is when compared to heuristic frequency-based logDice metrics. We found that while supervised approaches 64 65 Slovenščina 2.0, 2021 (2) do improve over the unsupervised baseline in an automation setting, in most cases the lexicographers did not appreciate this “improvement”. Nevertheless, the results are not discouraging. They prove (and confirm) that, ideally, a good collocation extraction tool is one that combines computational measurements and lexicographers’ input. Obviously, modern lexicography is still an inherently multidisciplinary endeavour with the never justly answered question of how to measure what is informative, relevant, and significant – this seems even more so for language resources of the digital era. Acknowledgments The research was conducted as part of the project Collocation as a basis for lan- guage description: semantic and temporal perspectives (J6-8255), funded by the Slovenian Research Agency, and within the national research programme Slovene language – basic, contrastive, and applied studies (P6-0215), and the national research programme Language resources and technologies for Slo- vene language (P6-0411), also funded by the Slovenian Research Agency. R E F E R E N C E S Berry-Rogghe, G. L. (1973). The Computation of Collocations and their Rel- evance in Lexical Studies. In A. J. Aitken, R. W. Bailey, and N. Hamil- ton-Smith (Eds.), The Computer and Literal Studies (pp. 103–112). Edin- burgh, New York: University Press. Biber, D. (1993). Representativeness in Corpus Design. Literary and Linguis- tic Computing, 8(4), 243–57. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching Word Vectors with Subword Information. In H. Schütze (Ed.), Transactions of the Association for Computational Linguistics 5 (pp. 135–146). Camacho-Collados, J., & Pilehvar, M. T. (2018). From Word to Sense Embed- dings: A Survey on Vector Representations of Meaning. Journal of Artifi- cial Intelligence Research 63, 743–788. Church, K. W., Gale, W., Hanks, P., & Hindle, D. (1991). Using Statistics in Lexical Analysis. In U. Zernik (Ed.), Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon (pp. 116–164). Erlbaum, Hills- dale, NJ. 65 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics Church, K., & Hanks, P. (1990). Word Association Norms, Mutual Informa- tion and Lexicography. Computational Linguistics, 6(1), 22–29. Cook, P., Lau, J. H., Rundell, M., McCarthy, D., & Baldwin, T. (2013). A Lex- icographic Appraisal of an Automatic Approach for Detecting New Word Senses. In I. Kosem, J. Kallas, P. Gantar, S. Krek, M. Langemets & M. Tuu- lik (Eds.), Electronic Lexicography in the 21st Century: Thinking Outside the Paper, Proceedings of the eLex 2013 Conference, 17–19 October 2013, Tallinn, Estonia (pp. 49–65). Ljubljana/Tallinn: Trojina, Institute for Ap- plied Slovene Studies/Eesti Keele Instituut. Enikeeva, E. V., & Mitrofanova, O. A. (2017). Russian Collocation Extraction Based on Word Embeddings. In V. Selegey et al. (Eds.), Computational Linguistics and Intellectual Technologies: Proceedings of the Interna- tional Conference “Dialogue 2017” (pp. 52–64). Moscow: The Computa- tional Linguistics and Intellectual Technologies. Erjavec, T., Fišer, D., & Ljubešić, N. (2020). The KAS Corpus of Slovenian Ac- ademic Writing. Language Resources & Evaluation 55, 551–583. Evert, S. (2004). The Statistics of Word Cooccurrences: Word Pairs and Collo- cations, PhD Thesis. University of Stuttgart. Evert, S. (2009). Corpora and Collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook 2 (pp. 1212– 1248). Berlin/New York: Mouton de Gruyter. Evert, S., Uhrig, P., Bartsch, S., & Proisl, T. (2017). E-VIEW-alation – a Large- scale Evaluation Study of Association Measures for Collocation Identifica- tion. In I. Kosem, C. Tiberius, M. Jakubíček, J. Kallas, S. Krek & V. Baisa (Eds.), Electronic Lexicography in the 21st Century, Proceedings of eLex 2017 Conference (pp. 531–549). Leiden, Netherlands/Brno: Lexical Com- puting CZ s.r.o. Firth, J. R. (1957). Modes of Meaning: Papers in Linguistics: 1934–1951. Lon- don: Oxford University Press. Gantar, P., Kosem, I., & Krek, S. (2016). Discovering Automated Lexicogra- phy: the Case of the Slovene Lexical Database. International Journal of Lexicography, 29(2), 200–225. Gantar, P., Krek, S., Kosem, I., & Gorjanc, V. (2015). Collocation Dictionary for Slovene: Challenge for Automatic Extraction of Data and Crowdsourcing. In 66 67 Slovenščina 2.0, 2021 (2) G. Corpas Pastor, M. Buendía Castro & R. Guttiérrez Florido (Eds.), Com- puterised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives (Fraseologı´a computacional y basada en cor- pus: perspectivas monolingu¨es y multilingu¨es), Europhras, 2015 (pp. 84–86). Malaga: Lexytrad, Research Group in Lexicography and Translation. Garcia, M., García-Salido, M., & Alonso-Ramos, M. (2017). Using Bilingual Word-embeddings for Multilingual Collocation Extraction. In S. Markan- tonatou, C. Ramisch, A. Savary & V. Vincze (Eds.), Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017) (pp. 21–30). Valencia: Association for Computational Linguistics. Gorjanc, V., & Fišer, D. (2010). Korpusna analiza. Ljubljana: Znanstvena založba Filozofske fakultete Univerze v Ljubljani. Gorjanc, V., & Vintar, Š. (2000). Iskanja po korpusu slovenskega jezika FIDA. In T. Erjavec & J. Gros (Eds.), Jezikovne tehnologije: Zbornik konference (pp. 20–27). Ljubljana: Institut Jožef Stefan. Gries, S. (2013). 50-something Years of Work on Collocations. International Journal of Corpus Linguistics, 18(1), 137–165. Kilgarriff, A., Rychlý, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. In G. Williams & S. Vessier (Eds.), Proceedings of the Eleventh EURALEX International Congress, EURALEX 2004 Lorient, France July 6–10, 2004 (pp. 105–116). Lorient: Université de Bretagne – sud. Kilgarriff, A., Husák, M., McAdam, K., Rundell, M., & Rychlý, P. (2008). GDEX: Automatically Finding Good Dictionary Examples in a Corpus. In E. Bernal & J. DeCesaris (Eds.), Proceedings of the Thirteenth EURALEX International Congress (pp. 425–432). Barcelona, Spain: Institut Univer- sitari de Linguistica Aplicada, Universitat Pompeu Fabra. Kilgarriff, A., & Rychlý, P. (2010). Semi-automatic Dictionary Drafting. In G.- M. de Schryver (Ed.), A Way with Words: A Festschrift for Patrick Hanks (pp. 299–312). Kampala: Menha Publishers. Kilgarriff, A., & Kosem, I. (2012). Corpus Tools for Lexicographers. In S. Granger & M. Paquot (Eds.), Electronic Lexicography (pp. 31–56). Ox- ford: Oxford University Press. Kosem, I., Gantar, P., & Krek, S. (2013). Automation of Lexicographic Work: an Opportunity for Both Lexicographers and Crowd-sourcing. In I. Kosem, 67 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics J. Kallas, P. Gantar, S. Krek, M. Langemets & M. Tuulik (Eds.), Electronic Lexicography in the 21st century: Thinking Outside the Paper, Proceed- ings of the eLex 2013 Conference, 17–19 October 2013, Tallinn, Estonia (pp. 32–48). Ljubljana/Tallinn: Trojina, Institute for Applied Slovene Studies/Eesti Keele Instituut. Kosem, I., Husak, M., & McCarthy, D. (2011). GDEX For Slovene. In I. Ko- sem & K. Kosem (Eds.), Electronic Lexicography in the 21st century: New Applications for New Users, Proceedings of eLex 2011, 10–12 November 2011, Bled, Slovenia (pp. 150–159). Ljubljana: Trojina, Institute for Ap- plied Slovene Studies. Kosem, I., Krek, S., Gantar, P., Arhar Holdt, Š., Čibej, J., & Laskowski, C. (2018). Collocations Dictionary of Modern Slovene. In J. Čibej, V. Gorjanc, I. Kosem & S. Krek (Eds.), Proceedings of the 18th EURALEX Internation- al Congress: Lexicography in Global Contexts, 17–21 July 2018, Ljubljana (pp. 989–997). Ljubljana: Ljubljana University Press, Faculty of Arts. Kosem, I., Gantar, P., Krek, S., Arhar Holdt, Š., Čibej, J., Laskowski, C., Pori, E., Klemenc, B., Dobrovoljc, K., Gorjanc, V., & Ljubešić, N. (2019). Col- locations Dictionary of Modern Slovene KSSS 1.0. Ljubljana: Slovenian Language Resource Repository CLARIN.SI. Retrieved from http://hdl.han- dle.net/11356/1250 (26. 8. 2021) Krek, S. (2012). New Slovene Sketch Grammar for Automatic Extraction of Lexical Data: Presentation given at SKEW3, Brno, Czech Republic, 21– 22 March 2012. Retrieved from https://trac.sketchengine.co.uk/attachment/wiki/ SKEW-3/Program/Krek_SKEW-3.pdf?format=raw (26. 8. 2021) Krek, S., Arhar Holdt, Š., Erjavec, T., Čibej, J., Repar, A., Gantar, P., Ljubešić, N., Kosem, I., & Dobrovoljc, K. (2020). Gigafida 2.0: the Reference Corpus of Written Standard Slovene. In N. Calzolari (Ed.), LREC 2020: Twelfth In- ternational Conference on Language Resources and Evaluation: May 11– 16, 2020, Palais du Pharo, Marseille, France, Conference Proceedings (pp. 3340–3345). Paris: ELRA – European Language Resources Association. Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (NIPS 2014) (pp. 1–9). 68 69 Slovenščina 2.0, 2021 (2) Li, J., & Jurafsky, D. (2015). Do Multi-sense Embeddings Improve Natu- ral Language Understanding?. In L. Màrquez, C. Callison-Burch & J. Su (Eds.), Proceedings of the 2015 Conference on Empirical Methods in Nat- ural Language Processing (pp. 1722–1732). Lisbon: Association for Com- putational Linguistics. Liu, X., & Huang, D. (2017). Translation Oriented Sentence Level Collocation Identification and Extraction. In D. Wong & D. Xiong (Eds.), Machine Translation, CWMT 2017: Communications in Computer and Informa- tion Science 787 (pp. 78–89). Singapore: Springer. Ljubešić, N., & Erjavec, T. (2018). Word Embeddings CLARIN.SI-embed.sl 1.0. Ljubljana: Slovenian Language Resource Repository CLARIN.SI. Re- trieved from http://hdl.handle.net/11356/1204 (26. 8. 2021) Logar, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., & Krek, S. (2012). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za upo- rabno slovenistiko; Fakulteta za družbene vede. Logar, N., Gantar, P., & Kosem, I. (2014). Collocations and Examples of Use: a Lexical-semantic Approach to Terminology. Slovenščina 2.0, 2(1), 41–61. Logar, N., & Erjavec, T. (2019). Slovene Academic Writing: a Corpus Approach to Lexical Analysis. In I. Simonnæs (Ed.), New Challenges for Research on Language for Special Purposes: Selected Proceedings from the 21st LSP-Conference, 28–30 June 2017, Bergen, Norway (pp. 205–217). Ber- lin: Frank & Timme. Logar, N., Kosem, I., & Erjavec, T. (2019). Collocation Lexicon of Slovene Academic Discourse Aleks. Ljubljana: Slovenian Language Resource Re- pository CLARIN.SI. Retrieved from http://hdl.handle.net/11356/1245 (26. 8. 2021) Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Lan- guage Processing, Chap. 5: Collocations. Cambridge, Massachusetts: The MIT Press. Mel’cuk, I. (1996). Lexical Functions: a Tool for the Description of Lexical Relations in a Lexicon. Lexical Functions in Lexicography and Natural Language Processing, 31, 37–102. 69 N. LJUBEŠIĆ, N. LOGAR, I. KOSEM: Collocation ranking: frequency vs semantics Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Retrieved from https://arxiv.org/ abs/1301.3781 (26. 8. 2021) Pecina, P. (2009). Lexical Association Measures and Collocation extrac- tion. Language Resources and Evaluation, 44(1–2), 137–158. Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. Rayson, P., & Garside, R. (2000). Comparing Corpora using Frequency Profiling. In WCC’00, Proceedings of the Workshop on Comparing Corpora, 9, 1–6. Rodríguez-Fernández, S., Carlini, R., Espinosa Anke, L., & Wanner, L. (2016a). Example-based Acquisition of Fine-grained Collocation Resources. In N. Calzolari et al. (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 2317–2322). Portorož: ELRA. Rodríguez-Fernández, S., Carlini, R., Espinosa Anke, L., & Wanner, L. (2016b). Semantics-driven Recognition of Collocations Using Word Embeddings. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 499–505). Berlin: Association for Computational Linguistics. Rundell, M., & Kilgarriff, A. (2011). Automating the Creation of Dictionaries: Where Will It All End?. In F. Meunier, G. Gilquin & M. Paquot (Eds.), A Taste for Corpora: in Honour of Sylviane Granger (pp. 257–282). Am- sterdam: John Benjamins. Rychlý, P. (2008). A Lexicographer-Friendly Association Score. In P. Sojka & A. Horák (Eds.), Proceedings of Recent Advances in Slavonic Nat- ural Language Processing, RASLAN 2008 (pp. 6–9). Brno: Masaryk University. Singleton, D. (2000). Language and the Lexicon: an Introduction. New York: Oxford University Press. Wanner, L., Ferraro, G., & Moreno, P. (2017). Towards Distributional Seman- tics-Based Classification of Collocations for Collocation Dictionaries. In- ternational Journal of Lexicography, 30(2), 167–186. Wiechmann, D. (2008). On the Computation of Collostruction Strength. Cor- pus Linguistics and Linguistic Theory, 42, 253–290. 70 71 Slovenščina 2.0, 2021 (2) RAZVRŠČANJE KOLOKATORJEV V SEZNAM: POGOSTOST PROTI SEMANTIKI Kolokacije imajo v opisu jezika zelo pomembno vlogo. Še zlasti to velja za pre- poznavanje pomena besed. Zato so postali v moderni leksikografiji neobhoden del pomenske členitve prav seznami kolokatorjev, razvrščeni po eni od stati- stičnih mer povezovalnosti. Prispevek prikazuje primerjavo med dvema pris- topoma k razvrščanju kolokatorjev: (a) metodo logDice, ki je zelo uveljavljena in temelji na pogostosti, ter (b) metodo besednih vložitev, ki je nova in temelji na strojnem učenju ter besedni semantiki. Primerjavo med rezultati obeh pris- topov smo naredili na dveh zbirkah podatkov za slovenščino, eno z iztočnicami in njihovimi kolokacijami iz splošnega jezika, drugo z iztočnicami in njihovimi kolokacijami iz strokovno-znanstvenega jezika. Pri ocenjevanju rezultatov smo uporabili dve metodi: v kvantitativnem delu preizkusa smo izvedli nadzorovano strojno učenje z AUC ROC evalvacijo algoritma podpornih vektorjev (SVM); v kvalitativnem delu pa so rezultate obeh pristopov k razvrščanju kolokatorjev ocenili še leksikografi. Ugotovitve niso enoznačne; medtem ko je kvantitativno ocenjevanje pokazalo, da je pristop s strojnim učenjem in semantično razpr- šenostjo dal boljše razvrstitve kolokatorjev kot pristop, ki izhaja iz pogostosti, pa so leksikografi večinoma ocenili, da so seznami kolokatorjev obeh pristopov med sabo zelo podobni. Ključne besede: kolokacije, besedne vložitve, logDice, splošni jezik, strokov- no-znanstveni jezik To delo je ponujeno pod licenco Creative Commons: Priznanje avtorstva-Deljenje pod enakimi pogoji 4.0 Mednarodna. / This work is licensed under the Creative Commons Attribution-Share- Alike 4.0 International. https://creativecommons.org/licenses/by-sa/4.0/