https://doi.org/10.31449/inf.v44i3.2996 Informatica 44 (2020) 387–393 387
How to DeﬁneCo-occurrence in a Multidisciplinary Context?
Mathieu Roche
CIRAD, TETIS, F-34398 Montpellier, France
TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, France
E-mail: mathieu.roche@cirad.fr,http://textmining.biz/Staff/Roche
Position paper
Keywords: co-occurrence, collocation, phrase,n-gram, skyp-n-gram, association rule, sequential pattern
Received: October 28, 2019
This position paper presents a comparative study of co-occurrences. Some similarities and differences
in the deﬁnition exist depending on the research domain (e.g. linguistics, natural language processing,
computer science). This paper discusses these points and deals with the methodological aspects in order to
identify co-occurrences in a multidisciplinary paradigm.
Povzetek: Predstavljena je analiza soˇ casnosti.
1 Introduction
Determining co-occurrences in corpora is challenging for
different applications such as classiﬁcation, translation, ter-
minology building, etc. More generally, co-occurrences
can be identiﬁed with all types of data, e.g. databases [8],
texts [30], images [38], music [15], video [19], etc.
The co-occurrence concept has different deﬁnitions de-
pending on the research domain (i.e. linguistics, natu-
ral language processing (NLP), computer science, biology,
etc.). This position paper reviews the main deﬁnitions in
the literature and discusses similarities and differences ac-
cording to the domains. This type of study can be crucial
in the context of data science, which is geared towards de-
veloping a multidisciplinary paradigm for data processing
and analysis, especially textual data.
Here the co-occurrence concept related to textual data is
discussed. Note that before their validation by an expert,
co-occurrences of words are often considered as candidate
terms.
First, Section 2 of this paper details the different deﬁni-
tions of co-occurrence according to the studied domains.
Section 3 discusses and compares these different aspects
based on their intrinsic deﬁnition but also on the associated
methodologies in order to identify them. Finally, Section 4
lists some perspectives.
2 Co-occurrence in a
multidisciplinary context
2.1 Linguistic viewpoint
In linguistics, one notion that is broadly used to deﬁne the
term is called lexical unit [23] and polylexical expression
[16]. The latter represents a set of words having an au-
tonomous existence, which is also called multi-word ex-
pression [33].
In addition, several linguistics studies use the collocation
notion. [10] gives two properties deﬁning a collocation.
First, collocation is deﬁned as a group of words having an
overall meaning that is deducible from the units (words).
For example, climate change is considered as a collocation
because the overall meaning of this group of words can be
deduced from both words climate and change. On the other
hand, the expression to rain cats and dogs is not a colloca-
tion because its meaning cannot be deduced from each of
the words; this is called a ﬁxed expression or an idiom.
A second property is added by [10] to deﬁne a colloca-
tion. The meaning of the words that make up the collo-
cation must be limited. For example, buy a dog is not a
collocation because the meaning of buy is not limited.
2.2 NLP viewpoint
In the natural language processing (NLP) domain, the co-
occurrence notion refers to the general phenomenon where
words are present together in the same context. More pre-
cisely, several principles are used that take contextual cri-
teria into account.
First, the terms or phrases [6, 11] can respect syntactic
patterns (e.g. adjective noun, noun noun, noun preposition
noun, etc.). Some examples of extracted phrases (i.e. syn-
tactic co-occurrences) are given in Table 1.
In addition, the methods without linguistic ﬁltering are
also conventionally used in the NLP domain by extracting
n-grams of words (i.e. lexical co-occurrences) [25, 35].n-
grams are contiguous sequences ofn words extracted from
a given sequence of text (e.g. the bi-grams
1
x y and y z
are associated with the textxyz).n-grams that allow gaps
1
n-grams withn = 2.
388 Informatica 44 (2020) 387–393 M. Roche
are called skip-n-grams (e.g. the skip-bi-grams x y, x z,
y z are related to the text x y z). Skip-gram model is an
efﬁcient method for learning high-quality distributed vec-
tor representations that capture a large number of precise
syntactic and semantic word relationships [27]. Some ex-
amples ofn-grams and skip-n-grams are given in Table 1.
After summarizing the term notion in the NLP domain,
the following section discusses these aspects in the com-
puter science context, particularly in data mining. Note that
the NLP domain may be considered as being located at the
linguistics and computer science interface.
2.3 Computer science viewpoint
In the data mining domain, co-occurring items are called
association rules [1, 39] and they could be candidates for
construction or enrichment of terminologies [12].
In the data mining context, the list of items corresponds
to the set of available articles. With textual data, items may
represent the words present in sentences, paragraphs, or
documents [2, 29]. A transaction is a set of items. A set
of transactions is a learning set used to determine associa-
tion rules.
Some extensions of association rules are called sequen-
tial patterns. They take into account a certain order of ex-
tracted elements [18, 34] with an enriched representation
related to textual data as follows:
– objects represent texts or pieces of texts,
– items are the words of a text,
– itemsets represent sets of words present together
within a sentence, paragraph or document,
– dates highlight the order of sentences within a text.
There are several algorithms for discovering associa-
tion rules and sequential patterns. One of the most pop-
ular is Apriori, which is used to extract frequent itemsets
from large databases. The Apriori algorithm [1] ﬁnds fre-
quent itemsets wherek-itemsets are used to generatek +1-
itemsets.
Association rules and sequential patterns of words are
often used in text mining for different applications, e.g. ter-
minology enrichment [12], association of concept instances
[5, 29], classiﬁcation [18, 34], etc.
3 Discussion: comparative study of
deﬁnitions and approaches
This section proposes a comparison of: (i) co-occurrence
deﬁnitions (see Section 3.1), (ii) automatic methods in or-
der to identify them (see Section 3.2). This section high-
lights some similarities and differences between domains.
3.1 Co-occurrence extraction
The general deﬁnition of co-occurrence is ﬁnally close to
association rules in data mining domain. Note that the in-
tegration of windows
2
in the association rule or sequential
pattern extraction process enables us to have similarity with
skip-n-gram extraction.
The integration of syntactic criteria makes it possi-
ble to extract more relevant candidate terms (see Table
1). Such information is typically taken into account in
NLP to extract terms from general or specialized domains
[20, 24, 28, 32].
Table 1 highlights relevant terms extracted using linguis-
tic patterns (e.g. climate change, water cycle, signiﬁcant
change). The use of linguistic patterns tends to improve
precision values. Generally other methods such as skip-
bi-grams return lower precision, i.e. many extracted can-
didates are irrelevant (e.g. climate the). But this kind of
method enables extraction of some relevant terms not found
with linguistic patterns (e.g. cycle expected); then the recall
can be improved.
Table 2 presents research domains related to different
types of candidates, i.e. collocations, polylexical expres-
sions, phrases, n-grams, association rules, sequential pat-
terns.
Table 3 summarizes the main criteria described in the
literature. Note that the extraction is more ﬂexible and au-
tomatic when there are fewer criteria. In this table, two
types of information are associated with the different crite-
ria. The ﬁrst one (marked withX) designates the character-
istics given by the co-occurrence deﬁnitions. The second
type of information (marked withF) represents character-
istics that are implemented in many extensions of the state-
of-the-art.
Table 3 shows that the semantic criterion is seldom as-
sociated with co-occurrence deﬁnitions. This criterion is
however taken into account in linguistics. For example,
semantic aspects are taken into account in several studies
[17, 22, 26]. In this context [26] introduced lexical func-
tions rely on semantic criteria to deﬁne the relationships
between collocation units. For instance, a given relation
can be expressed in various ways between the arguments
and their values, like Centr (the center, culmination of) that
returns different meanings
3
:
– Centr(crisis) = the peak
– Centr(desert) = the heart
– Centr(forest) = the thick
– Centr(glory) = summit
– Centr(life) = prime
In the data mining domain, semantic information is used
in two main directions. The ﬁrst one involves ﬁltering the
2
Association Rule with Time-Windows (ARTW) [39].
3
http://people.brandeis.edu/  smalamud/ling130/lex_functions.pdf
How to Deﬁne Co-occurrence in. . . Informatica 44 (2020) 387–393 389
Sentence (input)
With climate change the water cycle is expected to undergo signiﬁcant change.
Candidates (output)
Phrases climate change
(noun noun, adjective noun) water cycle, signiﬁcant change
bi-grams of words With climate, climate change, change the, the water,
water cycle, cycle is, is expected, expected to,
to undergo, undergo signiﬁcant, signiﬁcant change
2-skip-bi-grams With climate, With change, With the,
climate change, climate the, climate water,
change the, change water, change cycle,
the water, the cycle, the is,
water cycle, water is, water expected,
cycle is, cycle expected, cycle to,
is expected, is to, is undergo,
expected to, expected undergo, expected signiﬁcant,
to undergo, to signiﬁcant, to change,
undergo signiﬁcant, undergo change,
signiﬁcant change
Table 1: Examples of candidates extracted with different NLP techniques.
Deﬁnitions Domains
Collocations L
Polylexical expressions L + NLP
Phrases NLP
n-grams NLP + CS
Association rules CS
Sequential patterns CS
Table 2: Summary of the main domains associated with
expressions (L: linguistics, NLP: natural language process-
ing, CS: computer science).
results if they respect certain semantic information (e.g.
phrases or patterns where a word is an instance of a seman-
tic resource). Other methods involve semantic resources
in the knowledge discovery process, i.e. the extraction is
driven by semantic information [5].
In recent studies in the NLP domain, the semantic as-
pects are based on word embedding, which provides a
dense representation of words and their relative meanings
[14, 40].
Finally, note that several types of co-occurrence are of-
ten used in different domains. For example, polylexical
expressions are commonly used in NLP and also in lin-
guistics. In addition, n-grams is currently used in NLP
and computer science domains. For example, n-grams of
words are often used to build terminologies (NLP domain)
but also as features for machine learning algorithms (com-
puter science domain) [35].
Table 4 summarizes the main types of criteria (i.e. statis-
tic, morpho-syntactic, and semantic) used for extracting co-
occurrences according to the research domains considered
in this paper.
After presenting the characteristics associated with the
co-occurrence notion in a multidisciplinary context, the
following section compares the methodological viewpoints
to identify these elements according to the domains.
3.2 Ranking ofco-occurrences
Co-occurrence identiﬁcation by automatic systems is gen-
erally based on the use of quality measures and/or algo-
rithms. This section provides two illustrative examples that
show similarities between approaches according the do-
mains.
3.2.1 MutualInformation andLiftmeasure
Firstly the use of speciﬁc statistical measures from differ-
ent domains is highlighted. This subsection focuses on the
study of Mutual Information (MI). This measure is often
used in the NLP domain to measure the association be-
tween words [9]. MI (see formula (3.1)) compares the
probability of observing x and y together (joint probabil-
390 Informatica 44 (2020) 387–393 M. Roche
Ordered Sequences Morpho-syntactic Semantic
sequences with gaps information information
Collocations X X F
Polylexical expressions X X
Phrases X X
n-grams X F
Association rules X
Sequential patterns X X
Table 3: Summary of the main criteria associated with co-occurrence identiﬁcation. X represents the respect of the
criterion by deﬁnition.F is present when extensions are currently used in the state-of-the-art.
Statistic Morpho-syntactic Semantic
information information information
Linguistics X F
NLP X X F
Data mining X F F
Table 4: Summary of the main criteria associated with research domains. X represents the respect of the criterion for
extracting co-occurrences from textual data.F is present when extensions are currently used in the state-of-the-art.
ity) with the probability of observingx andy independently
(chance) [9].
I(x) =log
2
P (x;y)
P (x)P (y)
(3.1)
In general, word probabilities P (x) and P (y) corre-
spond to the number of observations ofx andy in a corpus
normalized by the size of the corpus. Some extensions of
MI are also proposed. The algorithm PMI-IR (Pointwise
Mutual Information and Information Retrieval) described
in [36] queries the Web via the AltaVista search engine to
determine appropriate synonyms for a given query. For a
given word, denotedx, PMI-IR chooses a synonym among
a given list. These selected terms, denoted y
i
, i2 [1;n],
correspond to TOEFL questions. The aim is to compute
they
i
synonym that gives the best score. To obtain scores,
PMI-IR uses several measures based on the proportion of
documents where both terms are present. Turney’s formula
is given below (3.2): It is one of the basic measures used
in [36]. It is inspired from MI described in [9]. With this
formula (3.2), the proportion of documents containing both
x andy
i
(within a 10 word window) is calculated and com-
pared with the number of documents containing the word
y
i
. The higher this proportion, the morex andy
i
are seen
as synonyms.
score(y
i
) =
nb(xNEARy
i
)
nb(y
i
)
(3.2)
– nb(x) computes the number of documents containing
the word x (i.e. nb corresponds to number of web-
pages returned by search engines),
– NEAR (used in the ’advanced research’ ﬁeld of Al-
taVista) is an operator that identiﬁes if two words are
present in a 10 word wide window.
This kind of web mining approach is also used in
many NLP applications, e.g. (i) computing the relation-
ship between host and clinical sign for an epidemiology
surveillance system [3], (ii) computing the dependency of
words of acronym deﬁnitions for word-sense disambigua-
tion tasks [31].
The probabilities are generally symmetric (i.e.
P (x;y) = P (y;x)), while the original MI measure
is also symmetric. But the association ratio applied in
the NLP domain is not symmetric, i.e. the occurrence
number of pairs of words "xy" and "yx" generally differ.
Moreover the meaning and relevance of phrases should
differ according to the word order in a text, e.g. ﬁrst lady
and lady ﬁrst.
Finally, MI is very close to the lift measure [7, 37, 4] in
data mining. This measure identiﬁes relevant association
rules (see formula (3.3)). The lift measure evaluates the
relevance of co-occurrences only (not implication) and how
x andy are independent [4].
lift(x!y) =
conf(x!y)
sup(y)
(3.3)
This measure is based on both conﬁdence and support
criteria, which in turn are based on association rule (x!y)
identiﬁcation. Support is an indication of how frequently
the itemset appears in the dataset. Conﬁdence is a standard
measure that estimates the probability of observingy given
x (see formula (3.4))
conf(x!y) =
sup(x[y)
sup(x)
(3.4)
Note that other quality measures of the data mining do-
main, such as Least contradiction or Conviction [21], could
How to Deﬁne Co-occurrence in. . . Informatica 44 (2020) 387–393 391
be tailored to deal with textual data.
3.2.2 C-value andcloseditemset
Another example is the methodological similarities associ-
ated with different approaches. For example, the C-value
approach [13] used in the NLP domain [24, 20] favors
terms that do not appear to a signiﬁcant extent in longer
terms. For example, in a specialized corpus related to oph-
thalmology, the work of [13] shows that a more general
term such as soft contact is irrelevant, whereas a longer
and therefore more speciﬁc term such as soft contact lens
is relevant. This kind of measure is particularly relevant in
the biology domain [24, 20].
In addition, in the computer science domain (i.e. data
mining), the notion of closed itemset is ﬁnally very close to
the C-value approach. In this context, a frequent itemset is
considered as closed if none of its supersets
4
has the same
support (i.e. frequency).
This section and both illustrative examples conﬁrm the
importance of having a real multidisciplinary viewpoint
on the methodological aspects in order to build scien-
tiﬁc bridges and thus contribute to the development of the
emerging data science domain.
4 Conclusion and Future Work
This position paper proposes a discussion on similarities as
well as differences in the deﬁnition of co-occurrence ac-
cording to research domains (i.e. linguistics, NLP, com-
puter science). The aim of this position paper is to show
the bridges that exist between different domains.
In addition, this paper highlights some similarities in the
methodologies used in order to identify co-occurrences in
different domains. We could extend the discussion to other
domains. For example, methodological transfers are cur-
rently applied between bioinformatics and NLP. For exam-
ple, the use of edition measures (e.g. Levenshtein distance)
for sequence alignment tasks (bioinformatics) v.s. string
comparison (NLP).
Acknowledgments
This work is funded by the SONGES project (Occitanie
and FEDER) – Heterogeneous Data Science (http://
textmining.biz/Projects/Songes).
References
[1] Rakesh Agrawal and Ramakrishnan Srikant. Fast
algorithms for mining association rules in large
databases. In Proceedings of the 20th International
Conference on Very Large Data Bases, VLDB ’94,
4
A superset is deﬁned with respect to another itemset, for example
{M1, M2, M3} is a superset of {M1, M2}. B is superset of A if card(A)
< card(B) and A  B.
pages 487–499, San Francisco, CA, USA, 1994.
Morgan Kaufmann Publishers Inc.
http://dl.acm.org/citation.cfm?id=
645920.672836.
[2] Amihood Amir, Yonatan Aumann, Ronen Feldman,
and Moshe Fresko. Maximal association rules: A
tool for mining associations in text. Journal of
Intelligent Information Systems, 25(3):333–345, Nov
2005.
https://doi.org/10.1007/
s10844-005-0196-9.
[3] Elena Arsevska, Mathieu Roche, Pascal Hendrikx,
David Chavernac, Sylvain Falala, Renaud Lancelot,
and Barbara Dufour. Identiﬁcation of associations
between clinical signs and hosts to monitor the web
for detection of animal disease outbreaks. Interna-
tional Journal of Agricultural and Environmental
Information Systems, 7(3):1–20, 2016.
https://doi.org/10.4018/IJAEIS.
2016070101.
[4] Paulo J. Azevedo and Alípio M. Jorge. Comparing
rule measures for predictive association rules. In
Proceedings of the 18th European Conference on
Machine Learning, ECML ’07, pages 510–517,
Berlin, Heidelberg, 2007. Springer-Verlag.
http://dx.doi.org/10.1007/
978-3-540-74958-5_47.
[5] Soumia Lilia Berrahou, Patrice Buche, Juliette Dibie,
and Mathieu Roche. Xart: Discovery of correlated
arguments of n-ary relations in text. Expert Systems
with Applications, 73(Supplement C):115 – 124,
2017.
https://doi.org/10.1016/j.eswa.
2016.12.028.
[6] Didier Bourigault. Surface grammatical analysis for
the extraction of terminological noun phrases. In
Proceedings of the 14th Conference on Computa-
tional Linguistics - Volume 3, COLING ’92, pages
977–981, Stroudsburg, PA, USA, 1992. Association
for Computational Linguistics.
http://dx.doi.org/10.3115/992383.
992415.
[7] Sergey Brin, Rajeev Motwani, and Craig Silverstein.
Beyond market baskets: Generalizing association
rules to correlations. In Proceedings of the 1997 ACM
SIGMOD International Conference on Management
of Data, SIGMOD ’97, pages 265–276, New York,
NY , USA, 1997. ACM.
http://doi.acm.org/10.1145/253260.
253327.
[8] Hui Cao, George Hripcsak, and Marianthi Marka-
tou. A statistical methodology for analyzing
co-occurrence data from a large sample. Journal of
392 Informatica 44 (2020) 387–393 M. Roche
Biomedical Informatics, 40(3):343 – 352, 2007.
https://doi.org/10.1016/j.jbi.2006.
11.003.
[9] Kenneth Ward Church and Patrick Hanks. Word
association norms, mutual information, and lexicog-
raphy. Comput. Linguist., 16(1):22–29, March 1990.
http://dl.acm.org/citation.cfm?id=
89086.89095.
[10] André Clas. Collocations et langues de spécialité.
Meta, 39(4):576–580, 1994.
https://doi.org/10.7202/002327ar.
[11] Béatrice Daille, Éric Gaussier, and Jean-Marc Langé.
Towards automatic extraction of monolingual and
bilingual terminology. In Proceedings of the 15th
Conference on Computational Linguistics - Volume
1, COLING ’94, pages 515–521, Stroudsburg, PA,
USA, 1994. Association for Computational Linguis-
tics.
https://doi.org/10.3115/991886.
991975.
[12] Lisa Di-Jorio, Sandra Bringay, Céline Fiot, Anne
Laurent, and Maguelonne Teisseire. Sequential
patterns for maintaining ontologies over time. In
On the Move to Meaningful Internet Systems: OTM
2008, OTM 2008 Confederated International Con-
ferences, CoopIS, DOA, GADA, IS, and ODBASE
2008, Monterrey, Mexico, November 9-14, 2008,
Proceedings, Part II, pages 1385–1403, 2008.
https://doi.org/10.1007/
978-3-540-88873-4_32.
[13] Katerina Frantzi, Sophia Ananiadou, and Hideki
Mima. Automatic recognition of multi-word terms:
the C-value/NC-value method. International Journal
on Digital Libraries, 3(2):115–130, Aug 2000.
https://doi.org/10.1007/
s007999900023.
[14] Debasis Ganguly, Dwaipayan Roy, Mandar Mitra,
and Gareth J.F. Jones. Word embedding based
generalized language model for information retrieval.
In Proceedings of the 38th International ACM
SIGIR Conference on Research and Development in
Information Retrieval, SIGIR ’15, pages 795–798,
New York, NY , USA, 2015. ACM.
http://doi.acm.org/10.1145/2766462.
2767780.
[15] Arijit Ghosal, Rudrasis Chakraborty, Bibhas Chandra
Dhara, and Sanjoy Kumar Saha. Song Classiﬁcation:
Classical and Non-classical Discrimination Using
MFCC Co-occurrence Based Features, pages 179–
185. Springer Berlin Heidelberg, Berlin, Heidelberg,
2011.
https://doi.org/10.1007/
978-3-642-27183-0_19.
[16] Gaston Gross. Les expressions ﬁgées en français.
Ophrys, 1996.
[17] Ulrich Heid. Towards a corpus-based dictionary of
german noun-verb collocations. In Proceedings of
the Euralex International Congress, pages 301–312,
1998.
[18] Simon Jaillet, Anne Laurent, and Maguelonne
Teisseire. Sequential patterns for text categorization.
Intelligent Data Analysis, 10(3):199–214, May 2006.
https://doi.org/10.3233/
IDA-2006-10302.
[19] Hyun-Ho Jeon, Andrea Basso, and Peter F. Driessen.
Camera Motion Detection in Video Sequences Us-
ing Motion Cooccurrences, pages 524–534. Springer
Berlin Heidelberg, Berlin, Heidelberg, 2005.
https://doi.org/10.1007/11581772_46.
[20] Min Jiang, Joshua C. Denny, Buzhou Tang, Hongxin
Cao, and Hua Xu. Extracting semantic lexicons
from discharge summaries using machine learning
and the c-value method. In AMIA 2012, American
Medical Informatics Association Annual Symposium,
Chicago, Illinois, USA, November 3-7, 2012, 2012.
https://www.ncbi.nlm.nih.gov/pmc/
articles/PMC3540581/.
[21] Stephane Lallich, Olivier Teytaud, and Elie Prud-
homme. Association Rule Interestingness: Measure
and Statistical Validation, pages 251–275. Springer
Berlin Heidelberg, Berlin, Heidelberg, 2007.
https://doi.org/10.1007/
978-3-540-44918-8_11.
[22] Marleen Laurens. La description des collocations et
leur traitement dans les dictionnaires. Romaneske,
4:44–51, 1999.
http://www.vlrom.be/pdf/994colloc.
pdf.
[23] Carmen Lederer. La notion d’unité lexicale et
l’enseignement du lexique. The French Review,
43(1):96–98, 1969.
https://www.jstor.org/stable/386736.
[24] Juan Antonio Lossio-Ventura, Clement Jonquet,
Mathieu Roche, and Maguelonne Teisseire. Biomed-
ical term extraction: Overview and a new methodol-
ogy. Information Retrieval Journal, 19(1-2):59–99,
April 2016.
http://dx.doi.org/10.1007/
s10791-015-9262-2.
[25] Sean Massung and Chengxiang Zhai. Non-native text
analysis: A survey. Natural Language Engineering,
22(2):163–186, 2016.
https://doi.org/10.1017/
S1351324915000303.
How to Deﬁne Co-occurrence in. . . Informatica 44 (2020) 387–393 393
[26] Igor A. Mel’ˇ cuk, Nadia Arbatchewsky-Jumarie,
Léo Elnitsky, and Adèle Lessard. Dictionnaire
explicatif et combinatoire du francais contempo-
rain. Presses de l’Université de Montréal, Montréal,
Canada, 1984,1988,1992,1999. V olume 1, 2, 3, 4.
[27] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-
rado, and Jeffrey Dean. Distributed representations
of words and phrases and their compositionality. In
Proceedings of the 26th International Conference
on Neural Information Processing Systems - Volume
2, NIPS’13, pages 3111–3119, USA, 2013. Curran
Associates Inc.
http://dl.acm.org/citation.cfm?id=
2999792.2999959.
[28] Goran Nenadi´ c, Irena Spasi´ c, and Sophia Ananiadou.
Terminology-driven mining of biomedical literature.
In Proceedings of the 2003 ACM Symposium on
Applied Computing, SAC ’03, pages 83–87, New
York, NY , USA, 2003. ACM.
http://doi.acm.org/10.1145/952532.
952553.
[29] Julien Rabatel, Yuan Lin, Yoann Pitarch, Hassan
Saneifar, Claire Serp, Mathieu Roche, and Anne
Laurent. Visualisation des motifs séquentiels extraits
à partir d’un corpus en ancien français. In Extraction
et gestion des connaissances (EGC’2008), pages
237–238, 2008.
https://editions-rnti.fr/?inprocid=
1000605.
[30] Mathieu Roche, Jérôme Azé, Oriane Matte-Tailliez,
and Yves Kodratoff. Mining texts by association
rules discovery in a technical corpus. In Intelligent
Information Processing and Web Mining, Proceed-
ings of the International IIS: IIPWM’04 Conference
held in Zakopane, Poland, May 17-20, 2004, pages
89–98, 2004.
https://link.springer.com/chapter/
10.1007/978-3-540-39985-8_10.
[31] Mathieu Roche and Violaine Prince. A web-mining
approach to disambiguate biomedical acronym
expansions. Informatica (Slovenia), 34(2):243–253,
2010.
http://www.informatica.si/index.
php/informatica/article/view/296.
[32] Mathieu Roche, Maguelonne Teisseire, and Gaurav
Shrivastava. Valorcarn-TETIS: Terms extracted with
Biotex [dataset]. CIRAD Dataverse, 2017.
http://dx.doi.org/10.18167/DVN1/
PGQGQL.
[33] Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann A.
Copestake, and Dan Flickinger. Multiword expres-
sions: A pain in the neck for NLP. In Proceedings
of the Third International Conference on Computa-
tional Linguistics and Intelligent Text Processing,
CICLing ’02, pages 1–15, London, UK, UK, 2002.
Springer-Verlag.
http://dl.acm.org/citation.cfm?id=
647344.724004.
[34] Claire Serp, Anne Laurent, Mathieu Roche, and
Maguelonne Teisseire. La quête du graal et la réal-
ité numérique. Corpus, 7, 2008.
https://doi.org/10.4000/corpus.1512.
[35] Piyoros Tungthamthiti, Kiyoaki Shirai, and Masnizah
Mohd. Recognition of sarcasm in tweets based on
concept level sentiment analysis and supervised
learning approaches, pages 404–413. Faculty of
Pharmaceutical Sciences, Chulalongkorn University,
2014.
https://www.aclweb.org/anthology/
Y14-1047.
[36] Peter D. Turney. Mining the web for synonyms:
PMI-IR versus LSA on TOEFL. In Proceedings of
the 12th European Conference on Machine Learning,
ECML ’01, pages 491–502, London, UK, UK, 2001.
Springer-Verlag.
http://dl.acm.org/citation.cfm?id=
645328.650004.
[37] Sebastián Ventura and José María Luna. Quality
Measures in Pattern Mining, pages 27–44. Springer
International Publishing, Cham, 2016.
https://doi.org/10.1007/
978-3-319-33858-3_2.
[38] Manisha Verma, Balasubramanian Raman, and Sub-
rahmanyam Murala. Local extrema co-occurrence
pattern for color and texture image retrieval. Neuro-
comput., 165(C):255–269, October 2015.
http://dx.doi.org/10.1016/j.neucom.
2015.03.015.
[39] Yong Yin, Ikou Kaku, Jiafu Tang, and JianMing Zhu.
Association Rules Mining in Inventory Database,
pages 9–23. Springer London, London, 2011.
https://doi.org/10.1007/
978-1-84996-338-1_2.
[40] Hamed Zamani and W. Bruce Croft. Relevance-
based word embedding. In Proceedings of the 40th
International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR
’17, pages 505–514, New York, NY , USA, 2017.
ACM.
http://doi.acm.org/10.1145/3077136.
3080831.
394 Informatica 44 (2020) 387–393 M. Roche