https://doi.org/10.31449/inf.v44i3.2996 Informatica 44 (2020) 387–393 387 How to DefineCo-occurrence in a Multidisciplinary Context? Mathieu Roche CIRAD, TETIS, F-34398 Montpellier, France TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, France E-mail: mathieu.roche@cirad.fr,http://textmining.biz/Staff/Roche Position paper Keywords: co-occurrence, collocation, phrase,n-gram, skyp-n-gram, association rule, sequential pattern Received: October 28, 2019 This position paper presents a comparative study of co-occurrences. Some similarities and differences in the definition exist depending on the research domain (e.g. linguistics, natural language processing, computer science). This paper discusses these points and deals with the methodological aspects in order to identify co-occurrences in a multidisciplinary paradigm. Povzetek: Predstavljena je analiza soˇ casnosti. 1 Introduction Determining co-occurrences in corpora is challenging for different applications such as classification, translation, ter- minology building, etc. More generally, co-occurrences can be identified with all types of data, e.g. databases [8], texts [30], images [38], music [15], video [19], etc. The co-occurrence concept has different definitions de- pending on the research domain (i.e. linguistics, natu- ral language processing (NLP), computer science, biology, etc.). This position paper reviews the main definitions in the literature and discusses similarities and differences ac- cording to the domains. This type of study can be crucial in the context of data science, which is geared towards de- veloping a multidisciplinary paradigm for data processing and analysis, especially textual data. Here the co-occurrence concept related to textual data is discussed. Note that before their validation by an expert, co-occurrences of words are often considered as candidate terms. First, Section 2 of this paper details the different defini- tions of co-occurrence according to the studied domains. Section 3 discusses and compares these different aspects based on their intrinsic definition but also on the associated methodologies in order to identify them. Finally, Section 4 lists some perspectives. 2 Co-occurrence in a multidisciplinary context 2.1 Linguistic viewpoint In linguistics, one notion that is broadly used to define the term is called lexical unit [23] and polylexical expression [16]. The latter represents a set of words having an au- tonomous existence, which is also called multi-word ex- pression [33]. In addition, several linguistics studies use the collocation notion. [10] gives two properties defining a collocation. First, collocation is defined as a group of words having an overall meaning that is deducible from the units (words). For example, climate change is considered as a collocation because the overall meaning of this group of words can be deduced from both words climate and change. On the other hand, the expression to rain cats and dogs is not a colloca- tion because its meaning cannot be deduced from each of the words; this is called a fixed expression or an idiom. A second property is added by [10] to define a colloca- tion. The meaning of the words that make up the collo- cation must be limited. For example, buy a dog is not a collocation because the meaning of buy is not limited. 2.2 NLP viewpoint In the natural language processing (NLP) domain, the co- occurrence notion refers to the general phenomenon where words are present together in the same context. More pre- cisely, several principles are used that take contextual cri- teria into account. First, the terms or phrases [6, 11] can respect syntactic patterns (e.g. adjective noun, noun noun, noun preposition noun, etc.). Some examples of extracted phrases (i.e. syn- tactic co-occurrences) are given in Table 1. In addition, the methods without linguistic filtering are also conventionally used in the NLP domain by extracting n-grams of words (i.e. lexical co-occurrences) [25, 35].n- grams are contiguous sequences ofn words extracted from a given sequence of text (e.g. the bi-grams 1 x y and y z are associated with the textxyz).n-grams that allow gaps 1 n-grams withn = 2. 388 Informatica 44 (2020) 387–393 M. Roche are called skip-n-grams (e.g. the skip-bi-grams x y, x z, y z are related to the text x y z). Skip-gram model is an efficient method for learning high-quality distributed vec- tor representations that capture a large number of precise syntactic and semantic word relationships [27]. Some ex- amples ofn-grams and skip-n-grams are given in Table 1. After summarizing the term notion in the NLP domain, the following section discusses these aspects in the com- puter science context, particularly in data mining. Note that the NLP domain may be considered as being located at the linguistics and computer science interface. 2.3 Computer science viewpoint In the data mining domain, co-occurring items are called association rules [1, 39] and they could be candidates for construction or enrichment of terminologies [12]. In the data mining context, the list of items corresponds to the set of available articles. With textual data, items may represent the words present in sentences, paragraphs, or documents [2, 29]. A transaction is a set of items. A set of transactions is a learning set used to determine associa- tion rules. Some extensions of association rules are called sequen- tial patterns. They take into account a certain order of ex- tracted elements [18, 34] with an enriched representation related to textual data as follows: – objects represent texts or pieces of texts, – items are the words of a text, – itemsets represent sets of words present together within a sentence, paragraph or document, – dates highlight the order of sentences within a text. There are several algorithms for discovering associa- tion rules and sequential patterns. One of the most pop- ular is Apriori, which is used to extract frequent itemsets from large databases. The Apriori algorithm [1] finds fre- quent itemsets wherek-itemsets are used to generatek +1- itemsets. Association rules and sequential patterns of words are often used in text mining for different applications, e.g. ter- minology enrichment [12], association of concept instances [5, 29], classification [18, 34], etc. 3 Discussion: comparative study of definitions and approaches This section proposes a comparison of: (i) co-occurrence definitions (see Section 3.1), (ii) automatic methods in or- der to identify them (see Section 3.2). This section high- lights some similarities and differences between domains. 3.1 Co-occurrence extraction The general definition of co-occurrence is finally close to association rules in data mining domain. Note that the in- tegration of windows 2 in the association rule or sequential pattern extraction process enables us to have similarity with skip-n-gram extraction. The integration of syntactic criteria makes it possi- ble to extract more relevant candidate terms (see Table 1). Such information is typically taken into account in NLP to extract terms from general or specialized domains [20, 24, 28, 32]. Table 1 highlights relevant terms extracted using linguis- tic patterns (e.g. climate change, water cycle, significant change). The use of linguistic patterns tends to improve precision values. Generally other methods such as skip- bi-grams return lower precision, i.e. many extracted can- didates are irrelevant (e.g. climate the). But this kind of method enables extraction of some relevant terms not found with linguistic patterns (e.g. cycle expected); then the recall can be improved. Table 2 presents research domains related to different types of candidates, i.e. collocations, polylexical expres- sions, phrases, n-grams, association rules, sequential pat- terns. Table 3 summarizes the main criteria described in the literature. Note that the extraction is more flexible and au- tomatic when there are fewer criteria. In this table, two types of information are associated with the different crite- ria. The first one (marked withX) designates the character- istics given by the co-occurrence definitions. The second type of information (marked withF) represents character- istics that are implemented in many extensions of the state- of-the-art. Table 3 shows that the semantic criterion is seldom as- sociated with co-occurrence definitions. This criterion is however taken into account in linguistics. For example, semantic aspects are taken into account in several studies [17, 22, 26]. In this context [26] introduced lexical func- tions rely on semantic criteria to define the relationships between collocation units. For instance, a given relation can be expressed in various ways between the arguments and their values, like Centr (the center, culmination of) that returns different meanings 3 : – Centr(crisis) = the peak – Centr(desert) = the heart – Centr(forest) = the thick – Centr(glory) = summit – Centr(life) = prime In the data mining domain, semantic information is used in two main directions. The first one involves filtering the 2 Association Rule with Time-Windows (ARTW) [39]. 3 http://people.brandeis.edu/ smalamud/ling130/lex_functions.pdf How to Define Co-occurrence in. . . Informatica 44 (2020) 387–393 389 Sentence (input) With climate change the water cycle is expected to undergo significant change. Candidates (output) Phrases climate change (noun noun, adjective noun) water cycle, significant change bi-grams of words With climate, climate change, change the, the water, water cycle, cycle is, is expected, expected to, to undergo, undergo significant, significant change 2-skip-bi-grams With climate, With change, With the, climate change, climate the, climate water, change the, change water, change cycle, the water, the cycle, the is, water cycle, water is, water expected, cycle is, cycle expected, cycle to, is expected, is to, is undergo, expected to, expected undergo, expected significant, to undergo, to significant, to change, undergo significant, undergo change, significant change Table 1: Examples of candidates extracted with different NLP techniques. Definitions Domains Collocations L Polylexical expressions L + NLP Phrases NLP n-grams NLP + CS Association rules CS Sequential patterns CS Table 2: Summary of the main domains associated with expressions (L: linguistics, NLP: natural language process- ing, CS: computer science). results if they respect certain semantic information (e.g. phrases or patterns where a word is an instance of a seman- tic resource). Other methods involve semantic resources in the knowledge discovery process, i.e. the extraction is driven by semantic information [5]. In recent studies in the NLP domain, the semantic as- pects are based on word embedding, which provides a dense representation of words and their relative meanings [14, 40]. Finally, note that several types of co-occurrence are of- ten used in different domains. For example, polylexical expressions are commonly used in NLP and also in lin- guistics. In addition, n-grams is currently used in NLP and computer science domains. For example, n-grams of words are often used to build terminologies (NLP domain) but also as features for machine learning algorithms (com- puter science domain) [35]. Table 4 summarizes the main types of criteria (i.e. statis- tic, morpho-syntactic, and semantic) used for extracting co- occurrences according to the research domains considered in this paper. After presenting the characteristics associated with the co-occurrence notion in a multidisciplinary context, the following section compares the methodological viewpoints to identify these elements according to the domains. 3.2 Ranking ofco-occurrences Co-occurrence identification by automatic systems is gen- erally based on the use of quality measures and/or algo- rithms. This section provides two illustrative examples that show similarities between approaches according the do- mains. 3.2.1 MutualInformation andLiftmeasure Firstly the use of specific statistical measures from differ- ent domains is highlighted. This subsection focuses on the study of Mutual Information (MI). This measure is often used in the NLP domain to measure the association be- tween words [9]. MI (see formula (3.1)) compares the probability of observing x and y together (joint probabil- 390 Informatica 44 (2020) 387–393 M. Roche Ordered Sequences Morpho-syntactic Semantic sequences with gaps information information Collocations X X F Polylexical expressions X X Phrases X X n-grams X F Association rules X Sequential patterns X X Table 3: Summary of the main criteria associated with co-occurrence identification. X represents the respect of the criterion by definition.F is present when extensions are currently used in the state-of-the-art. Statistic Morpho-syntactic Semantic information information information Linguistics X F NLP X X F Data mining X F F Table 4: Summary of the main criteria associated with research domains. X represents the respect of the criterion for extracting co-occurrences from textual data.F is present when extensions are currently used in the state-of-the-art. ity) with the probability of observingx andy independently (chance) [9]. I(x) =log 2 P (x;y) P (x)P (y) (3.1) In general, word probabilities P (x) and P (y) corre- spond to the number of observations ofx andy in a corpus normalized by the size of the corpus. Some extensions of MI are also proposed. The algorithm PMI-IR (Pointwise Mutual Information and Information Retrieval) described in [36] queries the Web via the AltaVista search engine to determine appropriate synonyms for a given query. For a given word, denotedx, PMI-IR chooses a synonym among a given list. These selected terms, denoted y i , i2 [1;n], correspond to TOEFL questions. The aim is to compute they i synonym that gives the best score. To obtain scores, PMI-IR uses several measures based on the proportion of documents where both terms are present. Turney’s formula is given below (3.2): It is one of the basic measures used in [36]. It is inspired from MI described in [9]. With this formula (3.2), the proportion of documents containing both x andy i (within a 10 word window) is calculated and com- pared with the number of documents containing the word y i . The higher this proportion, the morex andy i are seen as synonyms. score(y i ) = nb(xNEARy i ) nb(y i ) (3.2) – nb(x) computes the number of documents containing the word x (i.e. nb corresponds to number of web- pages returned by search engines), – NEAR (used in the ’advanced research’ field of Al- taVista) is an operator that identifies if two words are present in a 10 word wide window. This kind of web mining approach is also used in many NLP applications, e.g. (i) computing the relation- ship between host and clinical sign for an epidemiology surveillance system [3], (ii) computing the dependency of words of acronym definitions for word-sense disambigua- tion tasks [31]. The probabilities are generally symmetric (i.e. P (x;y) = P (y;x)), while the original MI measure is also symmetric. But the association ratio applied in the NLP domain is not symmetric, i.e. the occurrence number of pairs of words "xy" and "yx" generally differ. Moreover the meaning and relevance of phrases should differ according to the word order in a text, e.g. first lady and lady first. Finally, MI is very close to the lift measure [7, 37, 4] in data mining. This measure identifies relevant association rules (see formula (3.3)). The lift measure evaluates the relevance of co-occurrences only (not implication) and how x andy are independent [4]. lift(x!y) = conf(x!y) sup(y) (3.3) This measure is based on both confidence and support criteria, which in turn are based on association rule (x!y) identification. Support is an indication of how frequently the itemset appears in the dataset. Confidence is a standard measure that estimates the probability of observingy given x (see formula (3.4)) conf(x!y) = sup(x[y) sup(x) (3.4) Note that other quality measures of the data mining do- main, such as Least contradiction or Conviction [21], could How to Define Co-occurrence in. . . Informatica 44 (2020) 387–393 391 be tailored to deal with textual data. 3.2.2 C-value andcloseditemset Another example is the methodological similarities associ- ated with different approaches. For example, the C-value approach [13] used in the NLP domain [24, 20] favors terms that do not appear to a significant extent in longer terms. For example, in a specialized corpus related to oph- thalmology, the work of [13] shows that a more general term such as soft contact is irrelevant, whereas a longer and therefore more specific term such as soft contact lens is relevant. This kind of measure is particularly relevant in the biology domain [24, 20]. In addition, in the computer science domain (i.e. data mining), the notion of closed itemset is finally very close to the C-value approach. In this context, a frequent itemset is considered as closed if none of its supersets 4 has the same support (i.e. frequency). This section and both illustrative examples confirm the importance of having a real multidisciplinary viewpoint on the methodological aspects in order to build scien- tific bridges and thus contribute to the development of the emerging data science domain. 4 Conclusion and Future Work This position paper proposes a discussion on similarities as well as differences in the definition of co-occurrence ac- cording to research domains (i.e. linguistics, NLP, com- puter science). The aim of this position paper is to show the bridges that exist between different domains. In addition, this paper highlights some similarities in the methodologies used in order to identify co-occurrences in different domains. We could extend the discussion to other domains. For example, methodological transfers are cur- rently applied between bioinformatics and NLP. For exam- ple, the use of edition measures (e.g. Levenshtein distance) for sequence alignment tasks (bioinformatics) v.s. string comparison (NLP). Acknowledgments This work is funded by the SONGES project (Occitanie and FEDER) – Heterogeneous Data Science (http:// textmining.biz/Projects/Songes). References [1] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB ’94, 4 A superset is defined with respect to another itemset, for example {M1, M2, M3} is a superset of {M1, M2}. B is superset of A if card(A) < card(B) and A B. pages 487–499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id= 645920.672836. [2] Amihood Amir, Yonatan Aumann, Ronen Feldman, and Moshe Fresko. Maximal association rules: A tool for mining associations in text. Journal of Intelligent Information Systems, 25(3):333–345, Nov 2005. https://doi.org/10.1007/ s10844-005-0196-9. [3] Elena Arsevska, Mathieu Roche, Pascal Hendrikx, David Chavernac, Sylvain Falala, Renaud Lancelot, and Barbara Dufour. Identification of associations between clinical signs and hosts to monitor the web for detection of animal disease outbreaks. Interna- tional Journal of Agricultural and Environmental Information Systems, 7(3):1–20, 2016. https://doi.org/10.4018/IJAEIS. 2016070101. [4] Paulo J. Azevedo and Alípio M. Jorge. Comparing rule measures for predictive association rules. In Proceedings of the 18th European Conference on Machine Learning, ECML ’07, pages 510–517, Berlin, Heidelberg, 2007. Springer-Verlag. http://dx.doi.org/10.1007/ 978-3-540-74958-5_47. [5] Soumia Lilia Berrahou, Patrice Buche, Juliette Dibie, and Mathieu Roche. Xart: Discovery of correlated arguments of n-ary relations in text. Expert Systems with Applications, 73(Supplement C):115 – 124, 2017. https://doi.org/10.1016/j.eswa. 2016.12.028. [6] Didier Bourigault. Surface grammatical analysis for the extraction of terminological noun phrases. In Proceedings of the 14th Conference on Computa- tional Linguistics - Volume 3, COLING ’92, pages 977–981, Stroudsburg, PA, USA, 1992. Association for Computational Linguistics. http://dx.doi.org/10.3115/992383. 992415. [7] Sergey Brin, Rajeev Motwani, and Craig Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, SIGMOD ’97, pages 265–276, New York, NY , USA, 1997. ACM. http://doi.acm.org/10.1145/253260. 253327. [8] Hui Cao, George Hripcsak, and Marianthi Marka- tou. A statistical methodology for analyzing co-occurrence data from a large sample. Journal of 392 Informatica 44 (2020) 387–393 M. Roche Biomedical Informatics, 40(3):343 – 352, 2007. https://doi.org/10.1016/j.jbi.2006. 11.003. [9] Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicog- raphy. Comput. Linguist., 16(1):22–29, March 1990. http://dl.acm.org/citation.cfm?id= 89086.89095. [10] André Clas. Collocations et langues de spécialité. Meta, 39(4):576–580, 1994. https://doi.org/10.7202/002327ar. [11] Béatrice Daille, Éric Gaussier, and Jean-Marc Langé. Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of the 15th Conference on Computational Linguistics - Volume 1, COLING ’94, pages 515–521, Stroudsburg, PA, USA, 1994. Association for Computational Linguis- tics. https://doi.org/10.3115/991886. 991975. [12] Lisa Di-Jorio, Sandra Bringay, Céline Fiot, Anne Laurent, and Maguelonne Teisseire. Sequential patterns for maintaining ontologies over time. In On the Move to Meaningful Internet Systems: OTM 2008, OTM 2008 Confederated International Con- ferences, CoopIS, DOA, GADA, IS, and ODBASE 2008, Monterrey, Mexico, November 9-14, 2008, Proceedings, Part II, pages 1385–1403, 2008. https://doi.org/10.1007/ 978-3-540-88873-4_32. [13] Katerina Frantzi, Sophia Ananiadou, and Hideki Mima. Automatic recognition of multi-word terms: the C-value/NC-value method. International Journal on Digital Libraries, 3(2):115–130, Aug 2000. https://doi.org/10.1007/ s007999900023. [14] Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth J.F. Jones. Word embedding based generalized language model for information retrieval. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, pages 795–798, New York, NY , USA, 2015. ACM. http://doi.acm.org/10.1145/2766462. 2767780. [15] Arijit Ghosal, Rudrasis Chakraborty, Bibhas Chandra Dhara, and Sanjoy Kumar Saha. Song Classification: Classical and Non-classical Discrimination Using MFCC Co-occurrence Based Features, pages 179– 185. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. https://doi.org/10.1007/ 978-3-642-27183-0_19. [16] Gaston Gross. Les expressions figées en français. Ophrys, 1996. [17] Ulrich Heid. Towards a corpus-based dictionary of german noun-verb collocations. In Proceedings of the Euralex International Congress, pages 301–312, 1998. [18] Simon Jaillet, Anne Laurent, and Maguelonne Teisseire. Sequential patterns for text categorization. Intelligent Data Analysis, 10(3):199–214, May 2006. https://doi.org/10.3233/ IDA-2006-10302. [19] Hyun-Ho Jeon, Andrea Basso, and Peter F. Driessen. Camera Motion Detection in Video Sequences Us- ing Motion Cooccurrences, pages 524–534. Springer Berlin Heidelberg, Berlin, Heidelberg, 2005. https://doi.org/10.1007/11581772_46. [20] Min Jiang, Joshua C. Denny, Buzhou Tang, Hongxin Cao, and Hua Xu. Extracting semantic lexicons from discharge summaries using machine learning and the c-value method. In AMIA 2012, American Medical Informatics Association Annual Symposium, Chicago, Illinois, USA, November 3-7, 2012, 2012. https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC3540581/. [21] Stephane Lallich, Olivier Teytaud, and Elie Prud- homme. Association Rule Interestingness: Measure and Statistical Validation, pages 251–275. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. https://doi.org/10.1007/ 978-3-540-44918-8_11. [22] Marleen Laurens. La description des collocations et leur traitement dans les dictionnaires. Romaneske, 4:44–51, 1999. http://www.vlrom.be/pdf/994colloc. pdf. [23] Carmen Lederer. La notion d’unité lexicale et l’enseignement du lexique. The French Review, 43(1):96–98, 1969. https://www.jstor.org/stable/386736. [24] Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. Biomed- ical term extraction: Overview and a new methodol- ogy. Information Retrieval Journal, 19(1-2):59–99, April 2016. http://dx.doi.org/10.1007/ s10791-015-9262-2. [25] Sean Massung and Chengxiang Zhai. Non-native text analysis: A survey. Natural Language Engineering, 22(2):163–186, 2016. https://doi.org/10.1017/ S1351324915000303. How to Define Co-occurrence in. . . Informatica 44 (2020) 387–393 393 [26] Igor A. Mel’ˇ cuk, Nadia Arbatchewsky-Jumarie, Léo Elnitsky, and Adèle Lessard. Dictionnaire explicatif et combinatoire du francais contempo- rain. Presses de l’Université de Montréal, Montréal, Canada, 1984,1988,1992,1999. V olume 1, 2, 3, 4. [27] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- rado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 3111–3119, USA, 2013. Curran Associates Inc. http://dl.acm.org/citation.cfm?id= 2999792.2999959. [28] Goran Nenadi´ c, Irena Spasi´ c, and Sophia Ananiadou. Terminology-driven mining of biomedical literature. In Proceedings of the 2003 ACM Symposium on Applied Computing, SAC ’03, pages 83–87, New York, NY , USA, 2003. ACM. http://doi.acm.org/10.1145/952532. 952553. [29] Julien Rabatel, Yuan Lin, Yoann Pitarch, Hassan Saneifar, Claire Serp, Mathieu Roche, and Anne Laurent. Visualisation des motifs séquentiels extraits à partir d’un corpus en ancien français. In Extraction et gestion des connaissances (EGC’2008), pages 237–238, 2008. https://editions-rnti.fr/?inprocid= 1000605. [30] Mathieu Roche, Jérôme Azé, Oriane Matte-Tailliez, and Yves Kodratoff. Mining texts by association rules discovery in a technical corpus. In Intelligent Information Processing and Web Mining, Proceed- ings of the International IIS: IIPWM’04 Conference held in Zakopane, Poland, May 17-20, 2004, pages 89–98, 2004. https://link.springer.com/chapter/ 10.1007/978-3-540-39985-8_10. [31] Mathieu Roche and Violaine Prince. A web-mining approach to disambiguate biomedical acronym expansions. Informatica (Slovenia), 34(2):243–253, 2010. http://www.informatica.si/index. php/informatica/article/view/296. [32] Mathieu Roche, Maguelonne Teisseire, and Gaurav Shrivastava. Valorcarn-TETIS: Terms extracted with Biotex [dataset]. CIRAD Dataverse, 2017. http://dx.doi.org/10.18167/DVN1/ PGQGQL. [33] Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann A. Copestake, and Dan Flickinger. Multiword expres- sions: A pain in the neck for NLP. In Proceedings of the Third International Conference on Computa- tional Linguistics and Intelligent Text Processing, CICLing ’02, pages 1–15, London, UK, UK, 2002. Springer-Verlag. http://dl.acm.org/citation.cfm?id= 647344.724004. [34] Claire Serp, Anne Laurent, Mathieu Roche, and Maguelonne Teisseire. La quête du graal et la réal- ité numérique. Corpus, 7, 2008. https://doi.org/10.4000/corpus.1512. [35] Piyoros Tungthamthiti, Kiyoaki Shirai, and Masnizah Mohd. Recognition of sarcasm in tweets based on concept level sentiment analysis and supervised learning approaches, pages 404–413. Faculty of Pharmaceutical Sciences, Chulalongkorn University, 2014. https://www.aclweb.org/anthology/ Y14-1047. [36] Peter D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning, ECML ’01, pages 491–502, London, UK, UK, 2001. Springer-Verlag. http://dl.acm.org/citation.cfm?id= 645328.650004. [37] Sebastián Ventura and José María Luna. Quality Measures in Pattern Mining, pages 27–44. Springer International Publishing, Cham, 2016. https://doi.org/10.1007/ 978-3-319-33858-3_2. [38] Manisha Verma, Balasubramanian Raman, and Sub- rahmanyam Murala. Local extrema co-occurrence pattern for color and texture image retrieval. Neuro- comput., 165(C):255–269, October 2015. http://dx.doi.org/10.1016/j.neucom. 2015.03.015. [39] Yong Yin, Ikou Kaku, Jiafu Tang, and JianMing Zhu. Association Rules Mining in Inventory Database, pages 9–23. Springer London, London, 2011. https://doi.org/10.1007/ 978-1-84996-338-1_2. [40] Hamed Zamani and W. Bruce Croft. Relevance- based word embedding. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pages 505–514, New York, NY , USA, 2017. ACM. http://doi.acm.org/10.1145/3077136. 3080831. 394 Informatica 44 (2020) 387–393 M. Roche