Zbornik konference
Jezikovne tehnologije
in digitalna humanistika
Proceedings of the Conference on
Language Technologies
and Digital Humanities
15.– 16. september 2022
Ljubljana, Slovenija
September 15h – 16th 2022
Ljubljana, Slovenia
Uredila / Edited by:
Darja Fišer, Tomaž Erjavec
ZBORNIK KONFERENCE
JEZIKOVNE TEHNOLOGIJE IN DIGITALNA HUMANISTIKA
PROCEEDINGS OF THE CONFERENCE ON
LANGUAGE TECHNOLOGIES & DIGITAL HUMANITIES
Uredila / Edited by: Darja Fišer, Tomaž Erjavec
Tehnični uredniki / Technical editors: Jakob Lenardič, Katja Meden, Mihael Ojsteršek Založil / Published by:
Inštitut za novejšo zgodovino / Institute of Contemporary History
Izdal / Issued by:
Inštitut za novejšo zgodovino / Institute of Contemporary History
Za založbo / For the publisher:
Andrej Pančur
Direktor / Director
Ljubljana, 2022
First edition
Spletno mesto konference / Conference website:
https://www.sdjt.si/jtdh-2022 / https://www.sdjt.si/jtdh-2022/en
Publikacija je brezplačno dostopna na: / Publication is available free of charge at:
https://nl.ijs.si/jtdh22/proceedings-sl.html / https://nl.ijs.si/jtdh22/proceedings-en.html
To delo je objavljeno pod licenco Creative Commons Priznanje avtorstva 4.0 Mednarodna.
This work is licensed under a Creative Commons Attribution 4.0 International License.
CIP - Kataložni zapis o publikaciji
Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani
COBISS.SI-ID 121176323
ISBN 978-961-7104-20-2 (PDF)
Predgovor k zborniku konference
“Jezikovne tehnologije in digitalna humanistika”
Slovensko društvo za jezikovne tehnologije, skupaj z Inštitutom za novejšo zgodovino in Centrom za jezikovne vire in tehnologije Univerze v Ljubljani ter raziskovalnima infrastrukturama CLARIN.SI in DARIAH-SI, že četrtič po vrsti prirejajo konferenco “Jezikovne tehnologije in digitalna humanistika”, po uspešni programski širitvi konference Jezikovne tehnologije, ki se je odvijala od 1998, na digitalno humanistiko leta 2016 ohranja povezovalni fokus med disciplinama, hkrati pa si prizadeva postati pomembno srečevališče raziskovalcev v regiji.
Letošnja konferenca je potekala na Fakulteti za družbene vede Univerze v Ljubljani. Ker smo želeli zagotoviti, da bi bila konferenca v čim večji meri dostopna vsem zainteresiranim, smo vabljeni predavanji in vse predstavitve posneli in po zaključku konference objavili na konferenčni spletni strani.
Na spletni strani konference pa je bil že vnaprej objavljen tudi zbornik konference.
Konferenčne vsebine smo razvrstili v tri dni. Prvi dan je bil posvečen predkonferenčnima seminarjema na temo tematskega modeliranja parlamentarnih razprav in raziskovalne infrastrukture CLARIN.SI.
Drugi in tretji dan pa so se zvrstile predstavitve vabljenih predavateljev in avtorjev sprejetih prispevkov.
Ker je bila zasedba na konferenci mednarodna, smo program izvedli v ločenih slovenskih in angleških sekcijah. Zvrstili sta se tako slovenska kot angleška študentska sekcija, dve slovenski in tri angleške redne sekcije ter angleška in slovenska poster sekcija, tako za redne, kot za študentske prispevke. Ob zaključku konference smo nagradili najboljši študentski prispevek. V posebni sekciji so bili predstavljeni še dosedanji rezultati projekta Razvoj slovenščine v digitalnem okolju, po konferenci pa je sledil še redni letni občni zbor Slovenskega društva za jezikovne tehnologije.
Na letošnji konferenci sta se predstavila dva vabljena predavatelja ter avtorji 30 rednih prispevkov, 9
razširjenih povzetkov in 12 študentskih prispevkov. Vse prispevke so pregledali trije recenzenti. 20
prispevkov je napisanih v slovenskem, 31 pa v angleškem jeziku. Skupno število vseh avtorjev prispevkov je 120, od katerih je skoraj tretjina tujih (iz Avstralije, Bosne in Hercegovine, Brazilije, Bolgarije, Hrvaške, Finske, Francije, Italije, Luksemburga, Severne Makedonije in Srbije).
Urednika se najlepše zahvaljujeva vsem, ki so prispevali k uspehu konference: vabljenima predavateljema in avtorjem prispevkov za skrbno pripravljene prispevke, predstavitve in plakate, programskemu odboru za natančno recenzentsko delo, organizacijskemu odboru za izvedbo konference, moderatorjem diskusij, tehničnim urednikom za pripravo spletnega zbornika in raziskovalnima infrastrukturama DARIAH-SI in CLARIN.SI ter društvu SDJT za finančno podporo konference.
Ljubljana, september 2022
Darja Fišer in Tomaž Erjavec
i
Preface to the Proceedings of the Conference
“Language Technologies and Digital Humanities”
The Slovenian Language Technologies Society, together with the Institute of Contemporary History, the Centre of Language Resources and Technologies at the University of Ljubljana, and the research infrastructures CLARIN.SI and DARIAH-SI has organised the 13th Conference on Language Technologies and Digital Humanities. After its successful expansion to Digital Humanities in 2016, the conference retains its focus on the integration of the two disciplines and at the same time aims to position itself as an important meeting hub for fellow researchers in the region.
This year’s conference took place at the Institute of Contemporary History in Ljubljana. In order to make the conference as accessible as possible to all participants, we made recordings of the invited talks and the presentations. After the conference, we published the recordings on the conference webpage, while the proceedings were made available on the webpage in advance.
The conference took place over the course of three days. On the first day, two pre-conference seminars were organised, one on topic modelling of parliamentary debates and another the CLARIN.SI research infrastructure. Days two and three were dedicated to two invited talks and presentations of accepted papers. Since the conference was also attended by international scholars, the programme was divided into separate Slovenian and English sessions. There was a Slovenian and an English student session, two Slovenian and three English regular sessions, as well as an English and a Slovenian poster section both for regular and student contributions. In a special session, the results of the project Development of Slovene in a Digital Environment – Language Resources and Technologies were presented.
This year’s conference saw presentations from two invited speakers and from the authors of 30 regular papers, 9 extended abstracts, and 12 student papers. All the papers were reviewed by three reviewers.
20 papers were written in Slovene and 31 in English. The total number of all authors of the accepted papers is 120, a third of which were from abroad (from Australia, Bosnia and Herzegovina, Brazil, Bulgaria, Croatia, Finland, France, Italy, Luxemburg, North Macedonia, Serbia).
The editors would like to thank everyone who has contributed to the success of this conference: the invited lecturers and the authors of the papers for inspiring contributions and recordings of their lectures, the programme committee for their detailed reviews, the organising committee for enabling the conference to be held virtually, the discussion moderators, the technical editors for preparing the online proceedings and the research infrastructures DARIAH-SI and CLARIN.SI as well as the SDJT society for financially supporting the conference.
Ljubljana, September 2022
Darja Fišer and Tomaž Erjavec
ii
Programski odbor / Programme committee
Predsedstvo programskega odbora / Steering committee
Darja Fišer, predsednica / Chair
Filozofska fakulteta, Univerza v Ljubljani in Inštitut za novejšo zgodovino / Faculty of Arts, University of Ljubljana and Institute of Contemporary History
Simon Dobrišek
Fakulteta za elektrotehniko, Univerza v Ljubljani / Faculty of Electrical Engineering, University of Ljubljana
Tomaž Erjavec
Institut “Jožef Stefan” / Jožef Stefan Institute
Andrej Pančur
Inštitut za novejšo zgodovino / Institute of Contemporary History
Matej Klemen, študentska sekcija / student section
Fakulteta za računalništvo in informatiko / Faculty for computer science and informatics, University of Ljubljana
Aleš Žagar, študentska sekcija / student section
Fakulteta za računalništvo in informatiko / Faculty for computer science and informatics, University of Ljubljana
Člani programskega odbora in recenzenti / Programme committee members and reviewers
Špela Arhar Holdt
Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Petra Bago
Filozofska fakulteta, Univerza v Zagrebu / Faculty of Arts, University of Zagreb Vuk Batanović
Fakulteta za elektrotehniko, Univerza v Beogradu / Faculty of Electrical Engineering, University of Belgrade
Zoran Bosnić
Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Information Science, University of Ljubljana
Narvika Bovcon
Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Information Science, University of Ljubljana
Václav Cvrček
Inštitut češkega narodnega korpusa, Karlova univerza v Pragi / Institute of the Czech National Corpus, Charles University in Prague
Jaka Čibej
Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Helena Dobrovoljc
Inštitut za slovenski jezik Frana Ramovša, ZRC SAZU / Fran Ramovš Institute of the Slovenian Language, ZRC SAZU
iii
Kaja Dobrovoljc
Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Jerneja Fridl
Znanstvenoraziskovalni center Slovenske akademije znanosti in umetnosti / Research Centre of the Slovenian Academy of Sciences and Arts
Polona Gantar
Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana
Vojko Gorjanc
Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Jurij Hadalin
Inštitut za novejšo zgodovino / Institute of Contemporary History
Miran Hladnik
Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Ivo Ipšić
Univerza na Reki / University of Rijeka
Mateja Jemec Tomazin
Inštitut za slovenski jezik Frana Ramovša, ZRC SAZU / Fran Ramovš Institute of the Slovenian Language, ZRC SAZU
Alenka Kavčič
Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Science, University of Ljubljana
Iztok Kosem
Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Simon Krek
Laboratorij za umetno inteligenco, Institut “Jožef Stefan” / Artificial Intelligence Laboratory, Jožef Stefan Institute
Jakob Lenardič
Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Nikola Ljubešić
Odsek za tehnologije znanja, Institut “Jožef Stefan” / Department of Knowledge Technologies, Jožef Stefan Institute
Nataša Logar
Fakulteta za družbene vede, Univerza v Ljubljani / Faculty of Social Sciences, University of Ljubljana
Matija Marolt
Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Information Science, University of Ljubljana
Sanda Martinčić Ipšić
Univerza na Reki / University of Rijeka
Maja Miličević Petrović
Univerza v Bolonji / University of Bologna
Dunja Mladenić
Laboratorij za umetno inteligenco, Institut “Jožef Stefan” / Artificial Intelligence Laboratory, Jožef Stefan Institute
iv
Matija Ogrin
Inštitut za slovensko literaturo in literarne vede ZRC SAZU / Institute of Slovenian Literature and Literary Sciences, ZRC SAZU
Matevž Pesek
Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Science, University of Ljubljana
Dan Podjed
Inštitut za slovensko narodopisje ZRC SAZU / Institute of Slovenian Ethnology, ZRC SAZU
Senja Pollak
Odsek za tehnologije znanja, Institut “Jožef Stefan” / Department of Knowledge Technologies, Jožef Stefan Institute
Ajda Pretnar Žagar
Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Science, University of Ljubljana
Marko Robnik-Šikonja
Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Information Science, University of Ljubljana
Tanja Samardžić
Univerza v Zürichu / University of Zurich
Miha Seručnik
Zgodovinski inštitut Milka Kosa ZRC SAZU / Milko Kos Historical Institute, ZRC SAZU
Mirjam Sepesy Maučec
Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru / Faculty of Electrical Engineering and Computer Science, University of Maribor
Marko Stabej
Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Branislava Šandrih Todorović
Filološka fakulteta, Univerza v Beogradu / Faculty of Philology, University of Belgrade Mojca Šorn
Inštitut za novejšo zgodovino / Institute of Contemporary History
Janez Štebe
Fakulteta za družbene vede / Faculty of Social Sciences, University of Ljubljana
Simon Šuster
Univerza v Melbournu / University of Melbourne
Daniel Vasić
Univerza v Mostarju / University of Mostar
Darinka Verdonik
Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru / Faculty of Electrical Engineering and Computer Science, University of Maribor
Andrej Žgank
Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru / Faculty of Electrical Engineering and Computer Science, University of Maribor
Jerneja Žganec Gros
Alpineon d.o.o. / Alpineon d.o.o., Slovenia
Branko Ž itko
Fakulteta za znanost, Univeza v Splitu / Faculty of Science, University of Split v
Organizacijski odbor / Organising committee
Mojca Šorn, predsednica / Chair
Inštitut za novejšo zgodovino / Institute of Contemporary History
Ana Cvek
Inštitut za novejšo zgodovino / Institute of Contemporary History
Kaja Dobrovoljc
Filozofska fakulteta, Univerza v Ljubljani, Institut “Jožef Stefan” / Faculty of Arts, University of Ljubljana, Jožef Stefan Institute
Jerneja Fridl
Znanstvenoraziskovalni center Slovenske akademije znanosti in umetnosti / Research Centre of the Slovenian Academy of Sciences and Arts
Katja Meden
Institut “Jožef Stefan” / Jožef Stefan Institute
Mihael Ojsteršek
Inštitut za novejšo zgodovino / Institute of Contemporary History
Nataša Rozman
Inštitut za novejšo zgodovino / Institute of Contemporary History
Organizatorji / Organizers
vi
URNIK / TIMETABLE
Sreda / Wednesday, 14. 9. 2022
Inštitut za novejšo zgodovino / Institute of Contemporary History
09.00-09.30
Registracija / Registration
09.30-11.00
Orange delavnica 1. del / Orange Tutorial Part 1 - 1. nadstropje, Stavba A / 1st floor, Building A
11.00-11.30
Odmor za kavo / Coffee break
11.30-13.00
Orange delavnica 2. del / Orange Tutorial Part 2 - 1. nadstropje, Stavba A / 1st floor, Building A
13.00-14.30
Kosilo / Lunch
14.30-15.30
CLARIN delavnica 1. del / CLARIN Tutorial Part 1 - 1. nadstropje, Stavba A / 1st floor, Building A
15.30-16.00
Odmor za kavo / Coffee break
16.00-17.30
CLARIN delavnica 2. del / CLARIN Tutorial Part 2 - 1. nadstropje, Stavba A / 1st floor, Building A
17.30
Neformalno večerno druženje/ Informal dinner
vii
Četrtek / Thursday, 15. 9. 2022
Fakulteta za družbene vede / Faculty of Social Sciences
08.30-
Registracija / Registration - 1. nadstropje / 1st floor
09.15
09.15-
Otvoritev / Opening - Room 20 / Soba 20
09.30
09.30-
Študentska sekcija SLO / Student Session SLO - Room 20 / Soba 20
10.00
David Bordon:
Govoriš nevronsko? Kako ljudje razumemo jezik sodobnih strojnih prevajalnikov Špela Antloga:
Korpusni pristopi za identifikacijo metafore in metonimije: primer metonimije v korpusu g-KOMET
10.00-
Vabljeno predavanje 1 / Keynote 1 - Room 20 / Soba 20
11.00
Eetu Mäkelä (University of Helsinki):
Designing computational systems to support humanities and social sciences research
[Abstract]
11.00-
Odmor za kavo / Coffee break
11.30
viii
11.30-
Sekcija 1 SLO / Oral Session 1 SLO- Room 20 / Soba 20
Sekcija 1 ANG / Oral Session 1 ENG- Room 21 / Soba 21
13.00
Iztok Kosem, Jaka Čibej, Kaja Dobrovoljc and Nikola Ljubešić:
Spremljevalni korpus Trendi: metode, vsebina in kategorizacija besedil
Jakob Lenardič and Kristina Pahor de Maiti:
Slovenian Epistemic and Deontic Modals in Socially
Eva Pori, Jaka Čibej, Tina Munda, Luka Terčon and Špela Arhar Holdt:
Unacceptable Discourse Online
Lematizacija in oblikoskladenjsko označevanje korpusa SentiCoref
Jure Skubic and Darja Fišer:
Kaja Dobrovoljc, Luka Terčon and Nikola Ljubešić:
Parliamentary Discourse Research in History: Literature
Universal Dependencies za slovenščino: nadgradnja smernic, učnih podatkov in Review
razčlenjevalnega modela
Maja Miličević Petrović, Vuk Batanović, Radoslava
Darinka Verdonik, Andreja Bizjak, Andrej Žgank and Simon Dobrišek:
Trnavac and Borko Kovačević:
Metapodatki o posnetkih in govorcih v govornih virih: primer baze Artur
Cross-Level Semantic Similarity in newswire texts and
software code comments:
Gregor Donaj and Mirjam Sepesy Maučec:
Insights from Serbian data in the AVANTES project
Primerjava načinov razcepljanja besed v strojnem prevajanju slovenščina-angleščina Ajda Pretnar Žagar, Nikola Đukić and Rajko Muršič:
Tomaž Erjavec, Kaja Dobrovoljc, Darja Fišer, Jan Jona Javoršek,
Document enrichment as a tool for automated interview
Simon Krek, Taja Kuzman, Cyprian Laskowski, Nikola Ljubešić and Katja Meden:
coding
Raziskovalna infrastruktura CLARIN.SI
Nikola Ljubešić and Peter Rupnik:
The ParlaSpeech-HR benchmark for speaker profiling in
Croatian
Marta Petrak, Mia Uremović and Bogdanka Pavelin Lešić:
Fine-grained human evaluation of NMT applied to
literary text: case study of a French-to-Croatian
translation
13.00-
Kosilo / Lunch
13.45
ix
13.45-
Predstavitev plakatov ANG / Poster Session with coffee ENG - Predprostor
14.30
predavalnic, prvo nadstropje / Anteroom of the lecture halls, 1st floor
Jasna Cindrič, Lara Kuhelj, Sara Sever, Živa Simonišek and Miha Šemen:
Data Collection and Definition Annotation for Semantic Relation Extraction
Katja Meden:
Speech-level Sentiment Analysis of Parliamentary Debates using Lexicon-based Approaches
Vladimir Polomac:
Serbian Early Printed Books: Towards Generic Model for Automatic Text Recognition using Transkribus
Branko Žitko, Lucija Bročić, Angelina Gašpar, Ani Grubišić, Daniel Vasić and Ines Šarić-Grgić:
Automatic Predicate Sense Disambiguation Using Syntactic and Semantic Features Henna Paakki, Faeze Ghorbanpour and Nitin Sawhney:
An approach to computational crisis narrative analysis: a case-study of social media discourse interaction with news narratives about Covid-19 vaccinations in India Petra Matović and Katarina Radić:
A Parallel Corpus of the New Testament: Digital Philology and Teaching the Classical Languages in Croatia
14.30-
Sekcija 2 SLO / Oral Session 2 SLO - Room 20 / Soba 20
Sekcija 2 ANG / Oral Session 2 ENG - Room 21 / Soba 21
16.00
Špela Arhar Holdt, Polona Gantar, Iztok Kosem, Eva Pori,
Thi Hong Hanh Tran, Matej Martinc, Andraz Repar,
Nataša Logar Berginc, Vojko Gorjanc and Simon Krek:
Antoine Doucet and Senja Pollak:
Sovražno in grobo besedišče v odzivnem Slovarju sopomenk sodobne slovenščine A Transformer-based Sequence-labeling Approach to the
Slovenian Cross-domain Automatic Term Extraction
Martin Anton Grad and Nataša Hirci:
Raba kolokacijskega slovarja sodobne slovenščine pri prevajanju kolokacij
Michal Mochtak, Peter Rupnik and Nikola Ljubešić: The
ParlaSent-BCS dataset of sentiment-annotated
Tadeja Rozman and Špela Arhar Holdt:
x
Gradnja Korpusa študentskih besedil KOŠ
parliamentary debates from Bosnia-Herzegovina, Croatia,
and Serbia
Maja Veselič and Dunja Zorman:
Uporaba Europeaninega podatkovnega modela (EDM) pri digitalizaciji kulturne Petra Bago and Virna Karlić:
dediščine: primer Skuškove zbirke iz Slovenskega etnografskega muzeja v projektu DirKorp: A Croatian corpus of directive speech acts
PAGODE-Europeana China
Sara Košutar, Dario Karl, Matea Kramarić and Gordana
Matija Marolt, Mark Žakelj, Alenka Kavčič and Matevž Pesek:
Hržica:
Poravnava zvočnih posnetkov s transkripcijami narečnega govora in petja
Automatic text analysis in language assessment:
developing a MultiDis web application
Janez Križaj, Simon Dobrišek, Aleš Mihelič, Jerneja Žganec Gros:
Zadnji napredki pri samodejni slovenski grafemsko-fonemski pretvorbi
Boshko Koloski, Senja Pollak and Matej Martinc:
What works for Slovenian? A comparative study of
different keyword extraction systems
Andrejka Žejn, Mojca Šorli:
Annotation of Named Entities in the May68 Corpus: NEs
in modernist literary texts
19.00-
Konferenčna večerja / Conference dinner
21.00
xi
Petek / Friday, 16. 9. 2022
Fakulteta za družbene vede / Faculty of Social Sciences
08.30-09.00
Registracija / Registration - 1. nadstropje / 1st floor
09.00-10.00
Študentska sekcija ANG / Student Session ENG - Soba 20 / Room 20
Ruzica Farmakovski and Natalija Tomic:
Serbo-Croatian Wikipedia between Serbian and Croatian Wikipedia
Meta Jazbinšek, Teja Hadalin, Sara Sever, Erika Stanković and Eva Boneš:
Neural translation model specialized in translating English TED Talks into Slovene Uroš Šmajdek, Maj Zirkelbach, Matjaž Zupanič and Meta Jazbinšek:
Preparing a corpus and a question answering system for Slovene
Tvrtko Balić:
The CCRU as an Attempt of Doing Philosophy in a Digital World
10.00-11.00
Vabljeno predavanje 2 / Keynote 2 - Soba 20 / Room 20
Benoît Sagot (INRIA):
Large-scale language models: challenges and perspective
[Abstract]
11.00-11.30
Odmor za kavo / Coffee break
xii
11.30-12:45 Sekcija 3 ANG / Oral Session 3 ENG - Soba 20 / Room 20
Taja Kuzman, Nikola Ljubešić and Senja Pollak:
Assessing Comparability of Genre Datasets via Cross-Lingual and Cross-Dataset Experiments Špela Vintar and Andraz Repar:
Human evaluation of machine translations by semi-professionals: Lessons learnt Aleksandar Petrovski:
A Bilingual English-Ukrainian Lexicon of Named Entities Extracted from Wikipedia Darja Fišer, Tjaša Konovšek and Andrej Pančur:
Populist and Non-Populist Discourse in Slovene Parliament (1992 – 2018)
Petra Bago:
Progress of the RETROGRAM Project: Developing a TEI-like Model for Pre-standard Croatian Grammars 12:45-13.30 Kosilo / Lunch
13.30-14.15
Predstavitev plakatov z odmorom za kavo SLO / Poster Session with coffee SLO - Predprostor predavalnic / Anteroom of the lecture halls
Tina Mozetič, Miha Sever, Martin Justin and Jasmina Pegan:
Evalvacijska kategorizacija strojno izluščenih protipomenskih parov
Nina Sangawa Hmeljak, Anna Sangawa Hmeljak and Jan Hrastnik:
Ilukana - aplikacija za učenje japonskih zlogovnih pisav hiragana in katakana s pomočjo asociacij Vili Grdič, Kaja Perme, Lea Turšič and Alja Križanec:
Šahovska terminološka baza
Lucija Gril, Simon Dobrišek and Andre Žgank:
Akustično modeliranje z različnimi osnovnimi enotami za avtomatsko razpoznavanje slovenskega govora Saša Babič and Tomaž Erjavec:
Izdelava in analiza digitalizirane zbirke paremioloških enot
xiii
Magdalena Gapsa:
Ocenjevanje uporabniško dodanih sopomenk v Slovarju sopomenk sodobne
slovenščine – pilotna študija
14.15-14.30
Podelitev nagrad in zaključek / Award&Closing - Soba 20 / Room 20
14.30-16.00
Občni zbor SDJT / SDJT Annual Meeting
Razvoj slovenščine v digitalnem okolju – jezikovni viri in tehnologije:
Predstavitev vmesnih rezultatov /
Development of Slovene in a Digital Environment – Language Resources and Technologies: presentation of intermediate results - Soba 20 /
Room 20
xiv
Kazalo / Table of Contents
Predgovor ………………….......………………………………………………………..….…… i
Preface ………………….......………………………………………………………..…......…… ii
Programski odbor / Programme committee
………………………………….………… iii
Člani programskega odbora / Programme committee members
…………………………… iii
Organizacijski odbor / Organising committee
…………………………………………… vi
Organizatorji / Organizers
…………………………………………………………………. vi
Urnik / Timetable
…………………………………………………………………. vii
Kazalo / Table of Contents
….……………………………………………………………… xv
VABLJENI PRISPEVKI / INVITED TALKS
1
Designing computational systems to support humanities and social sciences research
Eetu Mä kelä
1
Large-scale language modelsl challenges and perspective
Benoî t Sagot
2
PRISPEVKI – PAPERS
3
The impact of a one-session-phonetic training on the improvement of non-native
speakers’ pronunciation of English
Amaury Flávio Silva
3
Sovražno in grobo besedišče v odzivnem Slovarju sopomenk sodobne slovenščine
Špela Arhar Holdt, Polona Gantar, Iztok Kosem, Eva Pori, Nataša Logar,
10
Vojko Gorjanc, Simon Krek
Izdelava in analiza digitalizirane zbirke paremioloških enot
Saša Babič, Tomaž Erjavec
17
xv
DirKorp: A Croatian Corpus of Directive Speech Acts
Petra Bago, Virna Karlić
23
Universal Dependencies za slovenščino: nadgradnja smernic, učnih podatkov in
razčlenjevalnega modela
Kaja Dobrovoljc, Luka Terčon, Nikola Ljubešić
30
Primerjava načinov razcepljanja besed v strojnem prevajanju slovenščina–angleščina
Gregor Donaj, Mirjam Sepesy Maučec
40
Raziskovalna infrastruktura CLARIN.SI
Tomaž Erjavec, Kaja Dobrovoljc, Darja Fišer, Jan Jona Javoršek, Simon Krek,
47
Taja Kuzman, Cyprian Laskowski, Nikola Ljubešić, Katja Meden
ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts
Simon Gonzalez
55
Raba Kolokacijskega slovarja sodobne slovenščine pri prevajanju kolokacij
Martin Anton Grad, Nataša Hirci
63
Akustično modeliranje z različnimi osnovnimi enotami za avtomatsko razpoznavanje
slovenskega govora
Lucija Gril, Simon Dobrišek, Andrej Žgank
71
What works for Slovenian? A comparative study of different keyword extraction
systems
Boshko Koloski, Senja Pollak, Matej Martinc
78
Spremljevalni korpus Trendi: metode, vsebina in kategorizacija besedil
Iztok Kosem, Jaka Čibej, Kaja Dobrovoljc, Nikola Ljubešić
86
Automatic Text Analysis in Language Assessment: Developing a MultiDis Web
Application
Sara Košutar, Dario Karl, Matea Kramarić, Gordana Hržica
93
xvi
Assessing Comparability of Genre Datasets via Cross-Lingual and Cross-Dataset
Experiments
Taja Kuzman, Nikola Ljubešić, Senja Pollak
100
Slovenian Epistemic and Deontic Modals in Socially Unacceptable Discourse Online
Jakob Lenardič, Kristina Pahor de Maiti
108
The ParlaSpeech-HR benchmark for speaker profiling in Croatian
Nikola Ljubešić, Peter Rupnik
117
Cross-Level Semantic Similarity in Newswire Texts and Software Code Comments: Insights
from Serbian Data in the AVANTES Project
Maja Miličević Petrović, Vuk Batanović, Radoslava Trnavac, Borko Kovačević
124
The ParlaSent-BCS Dataset of Sentiment-annotated Parliamentary Debates from Bosnia and
Herzegovina, Croatia, and Serbia
Michal Mochtak, Peter Rupnik, Nikola Ljubešić
132
Fine-grained human evaluation of NMT applied to literary text: case study of a French-to-
Croatian translation
Marta Petrak, Mia Uremović, Bogdanka Pavelin Lešić
141
A Bilingual English-Ukrainian Lexicon of Named Entities Extracted from Wikipedia
Aleksandar Petrovski
147
Serbian Early Printed Books: Towards Generic Model for Automatic Text Recognition using
Transkribus
Vladimir Polomac
154
Lematizacija in oblikoskladenjsko označevanje korpusa SentiCoref
Eva Pori, Jaka Čibej, Tina Munda, Luka Terčon, Špela Arhar Holdt
162
Document Enrichment as a Tool for Automated Interview Coding
Ajda Pretnar Žagar, Nikola Ðukić, Rajko Muršić
169
Parliamentary Discourse Research in History: Literature Review
Jure Skubic, Darja Fišer
177
xvii
Annotation of Named Entities in the May68 Corpus: NEs in modernist literary texts
Mojca Šorli, Andrejka Žejn
187
A Transformer-based Sequence-labeling Approach to the Slovenian Cross-domain Automatic
Term Extraction
Thi Hong Hanh Tran, Matej Martinc, Andraž Repar, Antoine Doucet, Senja Pollak
196
Metapodatki o posnetkih in govorcih v govornih virih: primer baze Artur
Darinka Verdonik, Andreja Bizjak, Andrej Žgank, Simon Dobrišek
205
Uporaba Europeaninega podatkovnega modela (EDM) pri digitalizaciji kulturne dediščine:
primer Skuškove zbirke iz Slovenskega etnografskega muzeja v projektu PAGODE-Europeana
China
Maja Veselič, Dunja Zorman
213
Human Evaluation of Machine Translations by Semi-Professionals: Lessons Learnt
Špela Vintar, Andraž Repar
220
Automatic Predicate Sense Disambiguation Using Syntactic and Semantic Features
Branko Žitko, Lucija Bročić, Angelina Gašpar, Ani Grubišić, Daniel Vasić,
227
Ines Šarić-Grgić
POVZETKI –ABSTRACTS
235
Progress of the RETROGRAM Project: Developing a TEI-like Model for Croatian Grammars
Books before Illyrism
Petra Bago
235
The CCRU as an Attempt of Doing Philosophy in a Digital World
Tvrtko Balić
239
Referencing the Public by Populist and Non-Populist Parties in the Slovene Parliament
Darja Fišer, Tjaša Konovšek, Andrej Pančur
243
Uporaba postopkov strojnega učenja pri samodejni slovenski grafemsko-fonemski pretvorbi
Janez Križaj, Simon Dobrišek, Aleš Mihelič, Jerneja Žganec Gros
248
xviii
Poravnava zvočnih posnetkov s transkripcijami narečnega govora in petja
Matija Marolt, Mark Žakelj, Alenka Kavčič, Matevž Pesek
252
A Parallel Corpus of the New Testament: Digital Philology and Teaching the Classical
Languages in Croatia
Petra Matović, Katarina Radić
256
Pre-Processing Terms in Bulgarian from Various Social Sciences and Humanities (SSH)
Domains: Status and Challenges
Petya Osenova, Kiril Simov, Yura Konstantinova
258
An Approach to Computational Crisis Narrative Analysis: A Case-study of Social Media
Narratives Around the COVID-19 Crisis in India
Henna Paakki, Faeze Ghorbanpour, Nitin Sawhney
263
Gradnja Korpusa študentskih besedil KOŠ
Tadeja Rozman, Špela Arhar Holdt
267
ŠTUDENTSKI PRISPEVKI – STUDENT PAPERS
271
Korpusni pristopi za identifikacijo metafore in metonimije: primer metonimije v korpusu g-
KOMET
Špela Antloga
271
Neural Translation Model Specialized in Translating English TED Talks into Slovene
Eva Boneš, Teja Hadalin, Meta Jazbinšek, Sara Sever, Erika Stanković
278
Govoriš nevronsko? Kako ljudje razumemo jezik sodobnih strojnih prevajalnikov
David Bordon
286
Data Collection and Definition Annotation for Semantic Relation Extraction
Jasna Cindrič, Lara Kuhelj, Sara Sever, Živa Simonišek, Miha Šemen
292
Serbo-Croatian Wikipedia Between Serbian and Croatian Wikipedia
Ružica Farmakovski, Natalija Tomić
300
xix
Ocenjevanje uporabniško dodanih sopomenk v Slovarju sopomenk sodobne slovenščine – pilotna
študija
Magdalena Gapsa
308
Angleško-slovenska šahovska terminološka baza
Vili Grdič, Alja Križanec, Kaja Perme, Lea Turšič
317
Speech-level Sentiment Analysis of Parliamentary Debates using Lexicon-based Approaches
Katja Meden
323
Evalvacijska kategorizacija strojno izluščenih protipomenskih parov
Tina Mozetič, Miha Sever, Martin Justin, Jasmina Pegan
331
Ilukana – aplikacija za učenje japonskih zlogovnih pisav hiragana in katakana s pomočjo
asociacij
Nina Sangawa Hmeljak, Anna Sangawa Hmeljak, Jan Hrastnik
339
Filter nezaželene elektronske pošte za akademski svet
Anja Vrečer
345
Preparing a Corpus and a Question Answering System for Slovene
Matjaž Zupanič, Maj Zirkelbach, Uroš Šmajdek, Meta Jazbinšek
353
xx
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Designing computational systems to support humanities and social
sciences research
Eetu Mäkelä
University of Helsinki, Finland
P.O. Box 24, 00014
eetu.makela@helsinki.fi
Abstract
From the viewpoint of the humanities and social sciences, collaborations with computer scientists often fail to deliver. In my research group, we have tried to understand why this is, and what to do about it. In this talk, I will discuss three key elements that we have discovered:
Often, datasets in the humanities and social sciences are not neatly representative of the object of interest. Systems need to provide ways in which to evaluate and counter the biases, confounders and noise in the data. Often, there is also a large gap between what is in the data, and what would be of interest. This gap needs to be bridged using algorithms, but care must be given that a) what the algorithm produces actually matches the interest and b) that its application does not introduce bias of its own (also interestingly, algorithm performance metrics of interest here often differ from those generally used in NLP/computer science). On a process level, collaboration between researchers from different disciplines is hard due to discrepancies in expectations relating to all facets of research, from research questions through methodology to the publication of results. Projects and systems need to acknowledge this, and be designed to facilitate iterative movement in the right direction.
Bio
Eetu Mäkelä is an associate professor in Human Sciences–Computing Interaction at the University of Helsinki, and a docent (adjunct professor) in computer science at Aalto University. At the Helsinki Centre for Digital Humanities, he leads a research group that seeks to figure out the technological, processual and theoretical underpinnings of successful computational research in the humanities and social sciences.
Additionally, he serves as a technological director at the DARIAH-FI infrastructure for computational humanities and is one of three research programme directors in the datafication research initiative of the Helsinki Institute for Social Sciences and Humanities. For his work, he has obtained a total of 19 awards, including multiple best paper awards in conferences and journals, as well as multiple open data and open science awards. He also has a proven track record in creating systems fit for continued use by their audience.
VABLJENI PRISPEVKI
1
INVITED TALKS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Large-scale language models: challenges and perspective
Benoît Sagot
Inria Paris (équipe ALMAnaCH)
2 rue Simone Iff CS 42112
75589 Paris Cedex 12, France
benoit.sagot@inria.fr
Abstract
The emergence of large-scale neural language models in Natural Language Processing (NLP) research and applications has improved the state of the art in most NLP tasks. However, training such models requires enormous computational resources and training data. The characteristics of the training data has an impact on the behaviour of the models trained on it, depending for instance on the data’s homogeneity and size. In this talk, I will speak about how we developed the large-scale multilingual OSCAR corpus. I will describe the lessons we learned while training the French language model CamemBERT, the first large-scale monolingual model for a language other than English, especially in terms of the influence of size and heterogeneity of the training corpus. I will also sketch out a few research questions related to biases in large-scale language models, with a focus on the impact of tokenisation and language imbalance, in the context of the BigScience initiative. I will conclude with my thoughts on the future of language models and their impact on NLP and other data processing fields (speech, vision).
Bio
Benoît Sagot, Directeur de Recherches (Senior Researcher) at Inria, is the head of the Inria project-team ALMAnaCH in Paris, France. A specialist in natural language processing (NLP) and computational linguistics, his research focuses on language modelling, language resource development, machine translation, text simplification, part-of-speech tagging and parsing, computational morphology and, more recently, digital humanities (computational historical linguistics and historical language processing). He has been the PI or co-PI of a number of national and international projects, and is the holder of a chair in the PRAIRIE institute dedicated to research in artificial intelligence. He is also the co-founder of two start-ups where he uses his expertise in NLP and data mining for the automatic analysis of employee survey results.
VABLJENI PRISPEVKI
2
INVITED TALKS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
The impact of a one-session-phonetic training on the improvement of non-native speakers’ pronunciation of English
Amaury Flávio Silva
Technology College of Jacareí (FATEC Jacareí) - São Paulo, Brazil
Rua Faria Lima, 155 – Jardim Santa Maria, Jacareí – SP, Brazil, Zip Code 12328-070
amaury.silva@fatec.sp.gov.br
Abstract
Due to the difficulties L21 learners face regarding pronunciation, we conducted an experiment to find out if the participants of a onesession phonetic training would present any sign of improvement in their speech a week after the session. In order to evaluate their improvement, it was checked if the interword phonetic phenomena resyllabification, blending and hiding could be found in the subjects’
speech. Furthermore, intraword-level pronunciation was also investigated. The findings have shown that betterment related to the presence of resyllabification occurred to all the subjects, but improvement to the other phenomena studied happened heterogeneously.
The dataset used during the training session was based
1. Introduction
on a study developed by Silva (2021) in which he studied
Until the end of the 21st century, there was a limited
examples of coarticulatory effects that we also incorporated
number of studies regarding pronunciation (Derwing and
in our pronunciation instruction session.
Munro, 2005). This negligence is attributed to the fact that
pronunciation was considered an aspect of language
2. Goal of the paper
learning that could be naturally acquired through the
This paper, whose goal is to investigate the efficacy of
learning process. However, since 2005 this viewpoint has
a one-session phonetic training to enhance the participants’
been changing inasmuch as several studies, conferences,
performance in pronunciation tasks also aims to provide a
and articles about L2 pronunciation started to arise
guideline that L2 teachers could use to assist their students
(Thomas and Derwing, 2014).
improve. Furthermore, we hope that researchers could use
Despite the fact the importance of L2 pronunciation has
the methods here applied to carry out new experiments in
become more evident, there are still L2 students, teachers
this area.
and researchers who consider pronunciation teaching as
being unnecessary as they reckon it can be learnt through
3. Theoretical Background
exposure.
The increasing number of pronunciation-related studies
We regard pronunciation instruction as an essential part
since 2005 reveals the importance that pronunciation
of the L2 teaching process. Its essential character becomes
instruction has in the L2 learning process. Not only does it
more evident when L2 learners, in spite of being studying
allow learners to become more confident when they speak,
the L2 for many years, still struggle to correctly pronounce
it also improves speech intelligibility as it helps to avoid
the L2-language sounds, especially the ones that are not
misunderstandings.
part of their L1 inventory systems. Nonetheless, we do not
Due to the importance of pronunciation, Thomas and
believe that achieving native-like pronunciation is
Derwing (2014) wrote an article in which they evaluated 75
necessary in that one’s pronunciation being intelligible
L2 pronunciation studies, most of which affirm that there
enough not to cause misunderstandings or hamper the flow
was some kind of improvement in the speakers’
of communication is what should be expected.
pronunciation due to the training they took. The authors
Owing to our belief that pronunciation instruction
point out that diverging results take place owing to a few
should be part and parcel of L2 language learning, we
factors such as ‘learner individual differences, goals and
decided to carry out a study that aims to check the benefits
foci of instruction, type and duration of instructional input
of a one-session-pronunciation training in the improvement
and assessment procedures’ (p.1).
of the pronunciation of a group of subjects, Brazilian
Most of the 75 studies focused on the achievement of
learners of English as a foreign language.
native-like pronunciation by the learner and consisted in the
With regard to this one-session training, we hypothesize
use of computer-assisted tools. Moreover, the studies aimed
that there may be some kind of improvement in the
at teaching the pronunciation of individual segments
subjects’ pronunciation, but more sessions will be
instead of teaching suprasegmental features, which would
necessary to address all the pronunciation problems they
involve,
for
instance,
resyllabification,
prosodic
may have. Moreover, the less proficient the students are,
boundaries, word stress, intonation, and speech rate.
the higher will be the number of sessions necessary to help
In order to teach the pronunciation of segments, most of
them deal with their pronunciation problems.
the time, the learners were engaged in activities that
1 We use the term ‘L2’ to refer to the teaching of English as a foreign and as a second language.
PRISPEVKI
3
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
required them to read texts aloud, instead of producing
should focus on ‘more intelligible, as opposed to less-
spontaneous speech.
accented speech … (and) include a variety of assessment
When it comes to the quality of a pronunciation study,
tasks’ (Thomson and Derwing, 2014, p. 13-14).
Thomas and Derwing (2014) mention a few features they
Furthermore, the authors state that evaluating the efficacy
should have. Firstly, they express their belief that
of the studies in a naturalistic fashion would take years,
pronunciation instruction should focus on ‘helping students
instead of weeks or months.
become more understandable’ (p. 2). From this principle,
We believe that any research should depart from a well-
they point out that an ideal pronunciation study should be
established theoretical standpoint. Hence, since in our
able to give plenty of information on the subjects, have
analyses we focused on the influence adjacent intra or
enough data that could be used to carry out statistical
interword segments have on one another, we turned to the
analyses, have a control group, and should not be limited to
studies developed by Browman and Goldstein (1986, 1989)
reading aloud tasks, i.e., it should also include spontaneous
on coarticulation.
speech samples. Finally, it should include delayed
According to Browman and Goldstein (1986, 1989),
assessment to verify the lasting effect of the pronunciation
adjacent segments may be subjected to the phenomena
instruction.
called blending and hiding. Blending occurs when adjacent
With regard to qualitative analyses, they should
segments share the same articulator so that they cannot be
encompass aspects such as motivation, type of interactions
produced without disturbance in their constriction location.
in the L2 and even social influences (Thomas and Derwing,
An example of this phenomenon takes place when the
2014).
segments [t] and [ð] from the context ‘I want that’ have to
The training input of the studies surveyed, which was
be produced one after the other. In this context, the
either classroom instruction or computer assisted
constriction location of either segment may be disturbed as
pronunciation training, ranged from the manipulation of
they are both characterized by a tongue tip gesture. Thus,
segments (Wang, 2002; Lee, 2009) to providing students
the canonical production of the alveolar plosive and the
with speech samples produced by native speakers so
interdental fricative may be realized as an approximant and
students could listen to them and compare them with their
as a dental fricative respectively.
own productions (Gonzales-Bueno, 1997; Guilloteau,
Hiding occurs when adjacent segments do not share the
1997, Weinberg and Knoerr, 2003; Lord, 2005; Pearson et
same articulator so that the production of the first segment
al., 2011).
is overlapped by the production of the second one. Such
The learners’ performances were evaluated by human
phenomenon may occur when the segments [t] and [b] from
listeners in 79 per cent of the studies and the other 21 per
the context ‘I can’t buy it’ have to be produced one after the
cent were evaluated using acoustic analyses.
other. When this happens, the gesture of mouth closure to
The majority of the pronunciation training studies
produce the bilabial consonant ‘hides’ the burst that would
reviewed by Thomson and Derwing (2014) lacked explicit
be caused by the release of the alveolar plosive.
theoretical background so that the pronunciation training
Being aware of how these phenomena work allows
was solely based on the researchers’ own experience. In our
speakers to reduce articulatory effort when they speak as
training, we considered the research about reduction
the excursion of the articulators is decreased. The reduction
phenomena led by Silva (2016, 2021), the findings on
in articulatory effort was studied by Silva (2016, 2021). In
coarticulation conducted by Browman and Goldstein
his investigations, he noticed that reduction is a strategy
(1986, 1989), and the work developed by Vroomen and
commonly used by native speakers and which can be
Gelder (1999) about resyllabification. We will be
characterized by the replacement of a segment that calls for
discussing this theoretical background later on in this
high excursion of the articulators by one that does not (low-
section.
hierarchy reduction). Reduction can be also characterized
One important aspect that was not clear in the studies
by a segment deletion (high-hierarchy reduction).
was the procedure taken during the training sessions
Another phenomenon that causes reduction in
(training input). The lack of clarity in the methodological
articulatory effort is the one called resyllabification. It
procedures prevent other teachers and researchers from
happens when ‘consonants are attached to syllables other
replicating the steps used in the studies in their own classes
than those from which they originally came’ (Vroomen and
or research. Therefore, detailed methodological procedure
Gelder, 1999, p.413). An example of this phenomenon is
is necessary ‘for the benefit of other researchers and
the sentence ‘you can evaluate this’ in which the consonant
teachers’ (Thomson and Derwing , 2014, p. 11).
/n/ of the word ‘can’ is coarticulated with the vowel /ɪ/ of
The research on pronunciation training by Thomson and
the word ‘evaluate.’ This process contributes to maintain
Derwing (2014) revealed that most of the participants
the speech flow as the speaker does not need to add a pause
showed some kind of improvement after the training.
between adjacent words.
Nonetheless, the majority of studies only focused on the
The analyses carried out in this study as well as the
instruction of single sounds such as the contrast of /i:/ and
concepts explained during the training session were based
/ɪ/. Should the studies be on several segmental and
on the phenomena blending and hiding (Browman and
suprasegmental features, more time would be necessary so
Goldstein, 1986-1989), reduction (Silva 2016, 2021), and
the learners could present significant improvement.
resyllabification (Vroomen and Gelder, 1999).
Another issue that questions the efficacy of the studies
is whether or not the assessment used in them would reflect
in the improvement of intelligibility when language is used
in real-life contexts. For such issue to be solved, the studies
PRISPEVKI
4
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4. Methods
Contexts
Phonemes involved
In this section, we will describe details related to the
‘comes into’
/z/and /ɪ/
subjects that participated in the study, the research dataset ,
‘at a certain’
/t̬/and /ə/
the acoustic inspection and the training session.
‘one of’
/n/and /ə/
‘on Earth’
/n/and /ɜr/
‘and I’
/d/and /aɪ/
4.1 Subjects
‘great is’
/t̬/and /ɪ/
In order to conduct the analysis, we had the
‘kind of’
/d/and /ə/
participation of four subjects, native speakers of Brazilian
‘an angel’
/n/and /eɪ/
Portuguese (three males and one female), who study
‘as much as’
/tʃ/and /ə/
English as a foreign language. The subject ‘English’ is part
‘seen a’
/n/and /ə/
of the Technological course the subjects were taking and all
‘lot of life’
/t̬/and /ə/
of them were enrolled in the same class, taking the third
semester. It is important to point out that English is offered
Table 2: Resyllabification phenomenon
throughout the duration of the course, six semesters, and,
despite the fact all the students were in the same class, their
With regard to the phenomena blending and hiding, we
proficiency level was not the same.
analyzed eight contexts, presented in the next table.
The four participants will be referred to as subjects, ‘S,’
in this investigation.
Contexts
Phonemes involved
‘certain time’
/n/and /t/
4.2 Research dataset
‘great things’
The research dataset, table 1, is an extract from the
/t/and /θ/
program Actors’ Studio (season 12, episode 13, released on
‘guided towards’
/d/and /t/
July 2006) that was sent to the subjects so they would have
‘get to dance’
/t/and /t/
to record and send it to the trainer before the training
‘prepared to’
/d/and /t/
session. After the session, they would record it once more
‘typical lawyer’
/l/and /l/
and send it to the trainer again so their improvement could
‘these people’
/z/and /p/
be analyzed. We would like to point out that in our
‘I don’t want’
/t/and /w/
experiment, we asked the subjects to use their own
smartphones or computers to record the dataset. This was
Table 3: Blending and hiding phenomena
done as they could not come to college to record it in its
sound laboratory due to the restrictions related to the
Lastly, when it comes to word-level pronunciation, the
COVID-19 pandemic.
words presented in the table below were investigated.
The same text was used in the pre and post-training
phase as we aimed to analyze whether or not improvement
could be observed in the second recording in terms of the
Words
Pronunciation errors
group of words we selected that encompass the phenomena
found
described in tables 2-4.
‘someone’
Phoneme substitution and
This dataset was also selected by Silva (2021) on his
insertion of a phoneme
investigation about coarticulatory phenomena analysis.
‘certain’
Phoneme substitution and
word stress
‘mysteriously’
Phoneme substitution and
It's funny, you know, someone comes into your life at a
word stress
certain time and that’s one of the great things that
‘towards’
Phoneme substitution
happens on Earth is you're mysteriously guided
‘thought’
Phoneme substitution
towards these people that you get to dance with, you
‘offer’
Phoneme substitution and
know. And I thought "How great is that", he's kind of,
word stress
like, I don't want to say an angel to her, but he's
‘lawyer’
Phoneme substitution and
someone who needs as much as he’s prepared to offer,
word stress
and he has seen a lot of life, and he's not a typical
lawyer-type.
Table 4: Word-level pronunciation
Table 1: Research Dataset
4.3 Phonetic inspections
The phonetic inspection was carried out with the use of
Using the dataset above, we selected fragments in
the free software PRAAT, version 6.0.39, developed by
which the phenomena resyllabification, blending and
Paul Boersma and David Weenik (2018), from the Institute
hiding could take place. Furthermore, we also analyzed the
of Phonetic Science of the University of Amsterdam.
pronunciation of a group of words that the students
The inspections were based on the observation of the
mispronounced in the pre-training recording.
waveform, the broadband spectrogram, the fundamental
The phenomenon resyllabification was investigated in
frequency and the intensity of the phonemic segments.
11 contexts, which we will present in the next table.
PRISPEVKI
5
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4.4 Training Session
phase is provided and one at the end of section 5.2 with all
The training took place in a single 50-minute session of
the contexts analyzed in the post-training phase is available.
an online class. It was recorded so that the subjects could
We reckon it is important to point out that the subjects
reported that they recorded the dataset several times and
revisit it as many times as they wanted in order to review
that they sent us the version they judged to be the best.
the concepts explained.
The training session the subjects participated was
5.1. Pre-training analyses
provided by the researcher of this work.
At the beginning of the training, which took place after
In this section, we will present the analyses that refer to
the first recording of the dataset was sent by the subjects,
the pre-training recordings. The first one refers to the
the original recording of the dataset was played, and the
context ‘and I,’ resyllabification.
corresponding script was projected on the screen for the
subjects to follow it. The recording was played three times.
After that, the concept of resyllabification was
explained and the first context where such phenomenon
occurred according to table 2, ‘comes into’, was presented
to the subjects (the orthography along with the recording).
The context was played three times.
The subjects were asked to pay close attention to the
recording of the context as they would have to repeat it
afterwards. If they could not repeat it, the trainer would
repeat the context himself at least three more times in order
Figure 1: Production of ‘and I’ by L1, pre-training
to assist the subjects grasp what and how they should say
it.
Through the analysis of the broadband
Before moving on to the next context, the original
spectrogram and its corresponding waveform above, we
recording was played one more time and the subjects were
can infer that there was no pause between the production of
asked to repeat it. Not until all the subjects were able to
the adjacent segments [d] and [ay] so the phenomenon
repeat the context intelligibly, would the trainer teach the
resyllabification was observed.
next context.
The procedure described above was followed to teach
the other contexts including resyllabification, blending and
hiding phenomena. Word-level pronunciation instruction
followed the steps related to playing the recording three
times before repetition. However, after analyzing the first
recording, we reckoned the need to teach word stress and
phoneme pronunciation.
It is important to point out that we did not use technical
terms during the training as our focus was simply on
improving their pronunciation.
When it comes to the difficulty the subjects presented
Figure 2: On Earth – S1, pre-training – Figure shows
to pronounce a word or group of words, the trainer noticed
pause between the words ‘on’ and “Earth’
that it was necessary to teach the articulation of some
phonemes, especially the ones not present in the subjects’
In the production of ‘on Earth,’ figure above,
L1inventory system. After the instruction of the articulation
there was a pause between the segments [n] and [ɜr] so
of such phonemes, improvements could be observed in
that the phenomenon resyllabification did not take place.
their pronunciation.
The subjects, after the training session, had access to the
original recording of the dataset and to a version recorded
by the trainer, which was produced with a slower speech
rate so that it could be helpful to less proficient subjects.
These recordings were tools the subjects could use to
improve their pronunciation before making the second
recording that had to be sent within a week.
Once all the subjects had sent their recordings, we
started the data analysis, whose results are presented in the
next section.
5. Data analyses
Figure 3: Production of ‘lawyer’ by S2
The analyses in this chapter will feature figures that
The figure above, which presents acoustic information,
contain the waveform, spectrogram, segmentation, and
shows that the subject mispronounced the word lawyer in
spelling of a selection of the contexts investigated.
that [lɔwər/] was produced instead of [lɔɪər/].
However, at the end of section 5.1, a table with a summary
of all the contexts investigated during the pre-training
PRISPEVKI
6
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
A summary of the data analyses that refer to the
pre-training recordings is presented in the table below. The
word or group of words written indicates that the
phenomenon in the corresponding column was observed in
their production.
Pre-training
Subje
Resyllabifica
Blending/Hi
Word-level
cts
tion
ding
pronunciati
(group
of (group
of on
Figure 4: Concatenated productions of ‘on Earth’ by S1.
words
in words
in (mispronoun
Post-training left and pre-training right.
which
the which
the ced words)
phenomenon
phenomena
The concatenated productions of ‘on Earth’ by S1,
was
were
presented in the figure above, demonstrate that the
observed)
observed)
phenomenon resyllabification was observed in the post-
S1
‘and I’
‘guided
All
the
training recording, but not in the pre-training recording.
‘great is’
towards’
words were
This fact is confirmed by the absence of pause between the
‘kind of’
‘typical
mispronoun
segments [n] and [ɜr] in the post-training phase that did not
‘lot of life’
lawyer’
ced except
occur in the pre-training phase as a pause is present in the
‘these
‘someone’
spectrogram.
people’
and ‘offer’
‘I don’t want
to’
S2
‘at a certain’
‘great things’ ‘mysteriousl
‘one of’
‘guided
y’
‘on Earth’
towards’
‘thought’
‘and I’
‘get to dance’ ‘lawyer’
‘kind of’
‘prepared to’
‘lot of’
‘typical
lawyer’
‘these
people’
S3
‘comes into’
‘certain time’ ‘mysteriousl
Figure 5: Concatenated productions of the post and pre-
‘at a certain’
‘get to dance’ y’
training versions of ‘great is’ by S2
‘one of’
‘prepared to’
‘thought’
‘on Earth’
‘these
‘lawyer’
As shown in the analysis of the context ‘on Earth,’
‘and I’
people’
figure 4, in the context ‘great is’ by S2, figure above, the
‘great is’
phenomenon resyllabification was observed in the post-
training recording, but not in the pre-training one.
S4
All
the All
the ‘thought’
contexts
contexts
‘offer’
except ‘and I’ except
‘lawyer’
‘certain time’
Table 5: Data analyses concerning the pre-training
recordings
5.2. Post-training analysis
In this section, we will present the analyses that refer to
the post-training recordings. The first one refers to the
Figure 6: Production of the word ‘offer’ by S4
context ‘on Earth,’ resyllabification.
The analysis of the production of the context ‘offer,’
produced by S4, shows that the word stress was placed on
the syllable ‘-fer’ instead of the syllable ‘of-’, which is
where the correct stress for the word ‘offer’ should occur.
The word stress on the syllable ‘-fer’ can be confirmed not
only by the higher duration of the segment [ɛr], but also the
higher intensity of this segment in comparison to the
segment [ɔ]. What’s more, S4 used the segment [ɛr] instead
of /ər/ in the second syllable.
PRISPEVKI
7
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
was observed in terms of resyllabification for S4, but they
A summary of the data analyses that refer to the post-
had already presented excellent performance of this
training recordings are presented in the table below. The
strategy as there was only one context where it was not
word or group of words written indicates that the
applied.
phenomenon in the corresponding column was observed in
The presence of the phenomena blending, and hiding
their production.
was found in the production of S1 in most of the contexts
and in all the contexts produced by S4 in the post-training
Post-training
recording. Such phenomena were noticed in fewer contexts
Subje
Resyllabifica
Blending/Hi
Word-level
in the production of S2 and in the same number in the
cts
tion
ding
pronunciati
production of S3 in the post-training recording.
(group
of (group
of on
With regard to the last feature analyzed after the post-
words
in words
in (mispronoun
training
session,
word-level
pronunciation,
no
which
the which
the ced words)
improvements were observed in the production of S1, S2
phenomenon
phenomena
made one more mistake and S3 improved the production of
was
were
the word ‘lawyer’ but mispronounced a word he had
observed)
observed)
produced correctly in the pre-training session, ‘someone’.
S1
All
‘certain time’ All
the
S4 improved the production of the word ‘lawyer’ but
the contexts
‘great things’ words were
continued mispronouncing the words ‘thought’ and ‘offer’.
‘get to dance’ mispronoun
Our findings have revealed different levels of
‘typical
ced except
improvement in the subjects’ performance so that S1 is the
lawyer’
‘someone’
one who presented the most improvement. S2 and S3’s
‘these
and ‘offer’
performances betterment was limited to the presence of the
people’
resyllabification phenomenon. S4 is the most proficient
‘I don’t want
subject who presented only a few mistakes in the pre-
to’
training recording and was able to use the phenomena
S2
All
the ‘certain time’
‘someone’
blending and hiding in all the contexts and to improve the
contexts
‘great things’ ‘mysteriousl
pronunciation of a word after training.
‘prepared to’
y’
The hypothesis we presented at the beginning of our
‘these
‘thought’
work was confirmed as the subjects’ pronunciation was
people’
‘lawyer’
somehow improved, but more sessions are necessary to
address certain pronunciation problems such as word-level
S3
‘comes into’
‘certain time’ ‘someone’
pronunciation and the phenomena hiding and blending.
‘at a certain’
‘great things’ ‘mysteriousl
In future studies, we could ask the subjects to report on
‘one of’
‘get to dance’ y’
the time they have dedicated to study and practice the
‘on Earth’
‘prepared to’
‘thought’
pronunciation concepts studied during the training session.
‘and I’
Furthermore, we could ask judges to evaluate the students’
‘great is’
performance before and after the training session to find out
‘kind of’
if a perceptual betterment in their pronunciation was clear,
‘a lot of life’
i.e., if the level of intelligibility was enhanced.
S4
All
the All
the ‘thought’
We believe vehemently that, although the number of
contexts
contexts
‘offer’
participants was not adequate through a quantitative
except ‘and I’
standpoint as our aim was to conduct a qualitative
investigation, the study has shown that improvement did
Table 6: Data analyses concerning the post-training
occur, bringing to light the importance of phonetic
recordings
instruction. Moreover, we expect that the procedure we
used during the training session was clear enough so the
6. Discussion
study could be replicated by other researchers.
Lastly, we hope to continue our investigation by
providing the subjects with more training sessions, evaluate
The analyses have shown that the one-session
them at least five months after the first training session and
phonetic training was useful to help the subjects improve
have more participants so we could carry out statistical
their pronunciation with regard to the resyllabification
analysis.
phenomenon. Nevertheless, no homogeneous improvement
was observed in terms of the remaining phenomena
analyzed.
7. Reference
The observed improvement in the resyllabification
feature in the production of S1 and S2 was characterized by
the use of this strategy in the production of all the contexts
Paul Boersma, David Weenink. 2018. Praat: doing
analyzed in the post-training recording, fact not observed
phonetics by computer, version 6.0.39. Available at:
in the pre-training one. S3 also demonstrated improvement
. Access on: 2 Dec. 2018
in the use of this strategy in that it was used in two more
Catherine Browman, Louis Goldstein. 1986. Towards an
contexts in the post-training recording. No improvement
articulatory phonology. Phonology, v. 3, pages. 219-
252.
PRISPEVKI
8
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
___________. Articulatory gestures as phonological units.
English syllable margins, In: Levis J. and LeVelle K.
Phonology, v. 6, 1989, pages 201-251.
(eds). Proceedings of the 2nd Pronunciation in Second
Tracey M. Derwing, M, Murray J. Munro.2005. Second
Language Learning and Teaching Conference, Iowa
language accent and pronunciation teaching: A research-
State University, pages 169-180.
based approach. TESOL Quarterly 16/1: pages 71-77.
Amaury F. Silva. 2021. Coarticulatory phenomena
Manuela Gonzales-Bueno. 1997. The effects of formal
analysis in English based on the articulatory phonology.
instruction on the acquisition of Spanish stop
São Paulo. CBTecLe v.1, n.1.
consonants.
Contemporary
Perspectives
on
the
____________.2016. Percepção de reduções em inglês
Acquisition of Spanish 2: pages 57-75.
como L2. Unpublished Ph.D. thesis, PUC-SP.
Nancy Clarke Guilloteau. 1997. Modification of phonetic
Ron I. Thomson, Tracey M. Derwing. 2014. The
categories in French as a second language:
effectiveness of L2 pronunciation Instruction: a
Experimental studies with conventional and computer-
narrative review. Oxford, Oxford University Press.
based intervention methods. Unpublished Ph. D. thesis.
Jean Vroomen, Beatrice De Gelder. 1999. Lexical access of
University of Texas at Austin.
resyllabified
words:
evidence
from
phoneme
Lee Ji-Yeon. 2009. The effects of pronunciation instruction
monitoring. Memory & cognition, 27(3), pages 413–
using duration manipulation in the acquisition of English
421.
vowel sounds by pre-service Korean EFL teachers.
Xinchun Wang. 2002. Training Mandarin and Cantonese
Unpublished Ph.D. thesis, University of Kansas.
speakers to identify English vowel contrasts: long term
Gillian Lord. 2005. (How) can we teach foreign language
retention and elects on production. Unpublished Ph.D.
pronunciation? On the Effects of a Spanish Phonetics
thesis, Simon Fraser University.
Course. Hispania, 88/3: pages 557-567.
Alysse Weinberg, Hélène Knoerr. 2003. Learning French
Pamela Pearson, Lucy Pickering, Rachel DaSilva. 2011.
pronunciation: Audiocassettes or multimedia? CALICO
The impact of computer assisted pronunciation training
Journal, 20/2: pages 315-336.
on the improvement of Vietnamese learner production of
PRISPEVKI
9
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Sovražno in grobo besedišče v odzivnem Slovarju sopomenk sodobne slovenščine Špela Arhar Holdt,*‡ Polona Gantar,* Iztok Kosem,* Eva Pori,*
Nataša Logar,** Vojko Gorjanc,* Simon Krek*
* Filozofska fakulteta, Univerza v Ljubljani
Aškerčeva 2, 1000 Ljubljana
apolonija.gantar@ff.uni-lj.si, iztok.kosem@ff.uni-lj.si, eva.pori@ff.uni-lj.si, vojko.gorjanc@ff.uni-lj.si, simon.krek@ff.uni-lj.si
** Fakulteta za družbene vede, Univerza v Ljubljani
Kardeljeva ploščad 5, 1000 Ljubljana
natasa.logar@fdv.uni-lj.si
‡ Fakulteta za računalništvo in informatiko, Univerza v Ljubljani
Večna pot 113, 1000 Ljubljana
spela.arharholdt@fri.uni-lj.si
Povzetek
V prispevku predstavljamo rešitve za identifikacijo in označevanje sovražnega ter grobega besedišča v okviru koncepta odzivnega Slovarja sopomenk sodobne slovenščine. Ker gre za prvi tovrstni projekt, so pripravljene rešitve v veliki meri inovativne, umeščene pa v okvir problematike avtomatske strojne izdelave slovarja, njegove odprtosti in vključenosti uporabniške skupnosti. Prispevek prikazuje identifikacijo sovražnega in grobega besedišča ter pripis oznak oziroma opozorilnih ikon z daljšimi pojasnili. Oznake temeljijo na sporočanjskem namenu oziroma učinku, pri čemer je njihovo bistvo informacija o možnih posledicah rabe. Pri označevanju tako kot pri izdelavi celotnega slovarja posvečamo veliko pozornost digitalnemu mediju in vizualizaciji rešitev v njem.
Ker je odzivnost eden ključnih konceptov slovarja, se tudi pri rešitvah glede označevanja zavedamo pomembnosti sodelovanja z uporabniško skupnostjo, zato predlagamo še rešitve za sodelovanje s skupnostjo pri dodajanju oznak.
Extremely Offensive and Vulgar Vocabulary in the Responsive Thesaurus of Modern Slovene In the paper we present the solutions for identification and annotation of extremely offensive and vulgar vocabulary, which can be found in the responsive Thesaurus of Modern Slovene. As this is the first project of its kind, the prepared solutions are to a great extent innovative, and have been devised considering the use of automatic methods in dictionary compilation, open access nature of dictionary data, and the inclusion of users into the compilation process. The paper describes the process of identification of extremely offensive and vulgar vocabulary, as well as the attribution of labels and warning icons containing longer explanations. The labels are based on their communicative purpose or effect, and are focussed on providing the information about potential consequences of word use. During the processes of labelling and dictionary compilation, considerable attention is paid to the digital medium and related visualisation solutions. As responsiveness is one of the key concepts of the dictionary, a part of preparing the labelling solutions was to design ways of including user community in labelling.
pripadnike. Tako besedišče se trenutno v slovarju (lahko)
1. Uvod
pojavlja na različnih mestih in na različne načine.
Namen prispevka je predstaviti obseg problematike, ki
Slovar sopomenk sodobne slovenščine (SSSS) je
se pri odzivnem slovarju pomembno razlikuje od
oblikovan po modelu odzivnega slovarja: v prvem koraku
tradicionalnih slovaropisnih projektov, in opisati rešitve,
je bil pripravljen strojno, nadaljnje urejanje podatkov pa
ki bodo vključene v prihajajočo nadgradnjo SSSS. Med
poteka po korakih in v sodelovanju jezikoslovcev ter širše
temi želimo posebej izpostaviti nove načine identifikacije
zainteresirane skupnosti (Arhar Holdt et al., 2018: 404). V
in označevanja sovražnega, grobega ter drugače negativno
SSSS lahko slovarski uporabniki ob strojno pripravljeno
vrednotenega besedišča, ki SSSS presegajo in so uporabne
sopomensko gradivo dodajo lastne predloge sopomenk, za
za različne sodobne jezikovne vire.
vse sopomenke v slovarju pa je mogoče tudi glasovati in
gradivo na tak način (pomagati) potrditi ali zavrniti.1
Vključevanje
strojnih
postopkov
in
predlogov
2. Sovražno, grobo, tabuizirano, zaničljivo
uporabniške skupnosti v slovaropisne delotoke odgovarja
… v družbi, jeziku in slovarju
na potrebe sodobnega časa, kot sta potreba skupnosti po
Na kratko je mogoče sovražni govor opredeliti kot
odprto dostopnih jezikovnih podatkih in želja slovarskih
“aktivno javno spodbujanje antipatije do določene,
uporabnikov po demokratičnem sodelovanju pri razvoju
ponavadi šibke, družbene skupine” (Rebolj, 2008: 13), v
temeljne jezikovne infrastrukture. Na drugi strani pa ima
daljši in bolj povedni obliki pa kot (Petković in Kogovšek
neposredno objavljanje strojnega in uporabniško dodanega
Šalamon, 2007: 23):
(nepregledanega) gradiva lahko tudi neželene posledice, ki
ustno ali pisno izražanje diskriminatornih stališč. Z njim
jih je treba pri razvoju odzivnega modela predvideti in
širimo, spodbujamo, promoviramo ali opravičujemo rasno
ustrezno obravnavati. Med prioritetami za razvoj SSSS je
sovraštvo, ksenofobijo, homofobijo, antisemitizem, seksizem
tako brez dvoma obravnava besedišča, ki vrednostno
in druge oblike sovraštva, ki temeljijo na nestrpnosti. Mednje
poimenuje posamezne družbene skupine in njihove
sodi
tudi
nestrpnost,
ki
se
izraža
z agresivnim
nacionalizmom in etnocentrizmom, z diskriminacijo in
sovražnostjo zoper manjšine, migrante in migrantke. Žrtve
sovražnega govora praviloma niso posamezniki, pač pa
1 Slovar v vmesniku je na https://viri.cjvt.si/sopomenke/slv/, kot ranljive družbene skupine. V osrčju sovražnega govora je
slovarska baza pa na repozitoriju CLARIN.SI (Krek et al., 2018).
prepričanje, da so nekateri ljudje manj vredni, zato je cilj
Strojno pripravo slovarja opisujejo Krek et al. (2017), koncept
sovražnega govora v razčlovečenju, ponižanju, ustrahovanju
odzivnega slovarja pa Arhar Holdt et al. (2018).
PRISPEVKI
10
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
in poslabšanju družbenega položaja tistih, proti katerim je
strojno in so v slovarju brez oznak. Posledica je lahko
naperjen.
sopostavitev pomensko neustreznih podatkov, npr. pri
Motl in Bajt (2016: 7) ugotavljata, da je sovražni
primerjavi besed ženska – kura najdemo prekrivne
govor deležen precejšnje pozornosti v različnih vedah, od
kolokacije [ stara, prava, gola] ženska in [ stara, prava,
prava, sociologije in komunikologije do psihiatrije in
gola] kura ali ženska [ brez glave, v postelji, na odru] in informatike, pridružimo pa jim lahko tudi jezikoslovje –
kura [ brez glave, v postelji, na odru]. Korpusni zgledi
predvsem jezikoslovje, povezano s slovarji. Ameriško
načeloma pomagajo razdvoumiti problematične primere,
slovaropisje (Hughes, 2011: 3. pogl.) je že pred desetletji
vendar niso na voljo za vse primerjane besede, zgledi, ki
v svoje vire načrtno vgradilo tudi občutljivost do ranljivih
so na voljo, pa niso izbrani po vsebinskih kriterijih. To je
družbenih skupin, pri čemer ni zanemarilo nobenega od
zlasti problematično pri sovražnem besedišču, npr.
delov geselskega članka: razlag, oznak in zgledov rabe
kolokacije [ sovražiti, tepsti, ubiti] pedra ali zgledi tipa In (Logar et al., 2020: 104). V manjši meri in pozneje, a
reskiral sem celo, da bi me imel za pedra.
vendarle so se opozorila o nujni tovrstni družbeni
Pri uporabniško predlaganih sopomenkah ločujemo na
občutljivosti ter odgovornosti pojavila tudi v slovenskem
eni strani zlonamerne vnose, kot je npr. uporabniški vpis
prostoru (npr. Gorjanc, 2005; Kern, 2015; Logar et al.,
aljaz pri iztočnici gej. Za takšne primere bi bilo treba
2020: 91, 104), a jih kljub temu do sedaj ni polno
določiti natančno uredniško politiko za sprotno obravnavo
upošteval še noben slovarski projekt.
na
ravni
vmesnika. Na drugi strani uporabniki
Ni pa zgolj sovražni govor tisti, ki ga je treba v
zaznamovano besedišče dodajajo kot dejanski sopomenski
slovarjih
obravnavati
posebej
pozorno.
Kritično
predlog, npr. pri iztočnici južnjak, kjer so uporabniki
slovaropisje opozarja, da je treba pri slovarskih opisih
dodali dolg niz predlogov, mdr. jugovič, južni brat, jugič, izrecne (in nove) rešitve iskati pri vseh elementih, ki
trenirkar, bosanec, z juga. Uredniška naloga je presoditi,
prinašajo vljudne in nevljudne vidike jezika, tabuiziranost,
kateri predlogi so relevantni za vključitev v slovarsko
so usmerjeni v vrednotenje, konotacijo, kulturne aluzije
bazo (in s katerimi oznakami), že uporabnikom pa
ipd., še posebej pa je treba biti pozoren na nestabilna in
omogočiti, da problematično besedišče označijo kot tako,
spreminjajoča se poimenovanja vseh oblik drugosti
da se torej oznaka v vmesniku prikaže istočasno kot
(Moon, 2014: 85). Pri tem se sodobno slovaropisje ne
dodana sopomenka.
more sklicevati na tradicionalne modele jezikovnega
Besedišče, ki je problematično na ravni same leme, je
opisovanja in delovanja. Nikakor pri tem ni sprejemljivo
mogoče označiti že v obstoječi različici slovarja. Primeri,
tradicionalno razmišljanje, da “je slovar metajezikovni
pri katerih je oznaka vezana na posamezen pomen besede
odsev dejanske hierarhizirane konceptualizacije sveta”
ali specifičen kontekst rabe, pa zahtevajo predhodno
(Vidovič Muha, 2013: 7), kar vodi v razpravljanje o
pomensko členitev ter z njo povezan slovaropisni pregled
resnicah v okviru slovaropisnega dela – prav nasprotno:
kolokacij in zgledov rabe.
slovaropisje mora jasno naslavljati vprašanja, ki so v
V projektu Nadgradnja temeljnih slovarskih virov in
svojem bistvu ideološka, saj gre za “uravnoteževanje opisa
podatkovnih baz CJVT UL bomo uresničili dva cilja: (a)
tega, kar prinašajo podatki glede pomena, s tem, na kakšen
identificirali besedišče, ki je problematično na ravni leme
način ‘naj bi bil’ v postmoderni vključujoči družbi
in ga označiti po celotnem slovarju SSSS ter (b) dodali v
določen koncept obravnavan in predstavljen” (Moon,
slovarski vmesnik možnost, da uporabniki sami označijo
2014: 89). Gre torej za to, da pri slovaropisnem delu
svoje predloge. V nadaljevanju natančneje pojasnjujemo,
končne
rešitve preprosto ne morejo biti “samo
kako.
jezikoslovne; neizogibno morajo biti tudi ideološke”
(Moon, 2014: 94). Pomembno je, da se ideološkosti pri
4. Identifikacija problematičnega besedišča
slovarskih opisih zavedamo, da odkrito in jasno povemo,
da je slovaropisno delo težavno prav zato, ker je tudi
ideološko (Gantar, 2015: 399), še posebej pri družbeno
4.1. Slovaropisna izhodišča in sistem oznak
občutljivih elementih slovarja.
Prepoznavanje potencialnega, z vidika družbene
občutljivosti problematičnega besedišča temelji na
3. Problemi trenutnega SSSS
slovaropisnih izhodiščih, ki smo jih pred nekaj leti
pripravili za slovarske vire na CJVT UL, prvič pa začeli
SSSS je pripravljen strojno in je trenutno na voljo v
uporabljati pri izdelavi Velikega slovensko-madžarskega
prvi, nepregledani različici, v kateri so kot iztočnice in
slovarja (Kosem et al., 2018a). V izhodišča je vključeno
sopomenke
navedene
leme (brez besednih vrst),
prepoznavanje elementov sovražnega govora (oznaka
pomensko členitev in opis začasno nadomeščajo strojno
sovražno), elementov nevljudnosti, žaljivosti ( grobo) ter
pripravljene pomenske gruče, slovar pa tudi ne vsebuje
elementov negativnega vrednotenja ali konotacije ( izraža
slovarskih oznak, razen področnih.
negativen odnos). Omenjene oznake sodijo v širši okvir t.
Navedene značilnosti imajo več posledic. Na eni strani
i. sporočanjskih oznak,2 ki opredeljujejo izraze ali pomene
se strojno pripravljene iztočnice in sopomenski kandidati
z vidika njihove rabe v sporočanjskem procesu in v
pojavljajo brez oznak ali opozoril tudi pri izrazito
situacijah, v katerih sporočanje poteka. V predlaganem
problematičnih primerih, kot je npr. iztočnica buzi s
slovaropisnem opisu so sporočanjske oznake namenjene
sopomenkami peder, buzerant, toplovodar, homič, označevanju izrazov, z izbiro katerih govorci dosegamo ali
poženščen moški. Na drugi strani je problem potencialno
želimo doseči določen učinek pri naslovniku. Ta učinek je
zavajajoča (ne)zastopanost sopomenskega gradiva, npr.
lahko
povzročen
s
pozitivnim
ali
negativnim
vse sopomenke, ki jih najdemo pri iztočnici zmaj –
ksantipa, vešča, strupenjača, babura, coprnica, pošast, kričava ženska – so vezane na ženski spol in imajo izrazito
2 Celotni sistem označevanja, ki ga razvijamo v okviru virov
negativno konotacijo, čeprav se beseda rabi tudi za moške
CJVT UL, poleg sporočanjskih oznak, ki jih notranje členimo na
in (v drugem pomenu) tudi s pozitivno konotacijo.
vrednotenjske, registrske in stilne, zajema še nabor pragmatičnih,
Tudi kolokacije in zgledi, ki so namenjeni primerjavi
kontekstualnih, področnih, slovničnih, časovnih in trendovskih
rabe dveh sopomenk, so iz referenčnega korpusa izvoženi
oznak ter nabor oznak, vezanih na tuja poimenovanja in
prevodne ustreznice.
PRISPEVKI
11
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
vrednotenjem, z uporabo v določenem govornem položaju
pregledovanje, kot bo razvidno v nadaljevanju; (b)
(npr. javnem, nejavnem) ali z namenom izraziti odnos do
presojanje je lahko bolj natančno, saj problematičnost
predmetnosti ali vsebine, ki temelji na določenih
posamezne leme nakazujejo ostale besede v nizu, prim.
družbenih normah, pričakovanjih in odstopanjih od njih.
npr. nategniti v nizu raztegniti; dilatirati; iztegniti;
Ta sistem se od tradicionalnega označevanja besed na
nategniti; pogrniti; razgrniti; razmakniti; razpreti;
podlagi odnosa do knjižne norme, kot ga pozna SSKJ (t. i.
razprostreti; razviti; napeti; zavlačevati z; razpeti;
stilno-zvrstni in ekspresivni kvalifikatorji), ločuje v
prolongirati in v nizu pokavsati; nategniti; povaljati;
kvalificiranju besedišča na podlagi sporočanjskega
porivati; pofukati; pojahati.
namena oz. učinka, pri čemer izhodišče kvalificiranja ni v
Iz množice 65.615 nizov smo najprej umaknili 24.945
opozarjanju na odstop od knjižne norme, pač pa v
nizov (38,0 %), pri katerih sopomenke vsebujejo področne
informiranju glede možnih posledic rabe. S takim
oznake, npr. odbojnik, deflektor, ločilnik, membrana,
sistemom se želimo izogniti morebitnemu kvalificiranju
opna, odbojna pregrada, zvočna stena z oznako elektrika
govorca samega, hkrati pa opozoriti na kontekst
(ker so ti podatki terminološke narave, smo predvidevali
potencialno problematične rabe v informativnem smislu.
zanemarljivo nizko vsebnost problematičnega besedišča in
To pomeni, da ne želimo uporabnikov slovarja obveščati
smo jih pustili za hiter pregled ob koncu naloge); ostalo je
samo o možnih učinkih rabe grobega in sovražnega
496 nizov (0,8 %), ki vsebujejo lastnoimenske
besedišča, pač pa pokazati tudi na okoliščine, v katerih je
samostalnike, npr. Antarktika, antarktično območje, južno
tako rabo mogoče prepoznati.
polarno območje, in 40.176 (61,2 %) občnoimenskih
V slovarskem sistemu oznak označujemo z oznako
nizov, vsi relevantni za ročni pregled.
sovražno izraze in pomene, ki so diskriminatorni,
Podatke so pregledovali študentke in študenti
ksenofobični, rasistični in homofobični, ki so uperjeni
jezikoslovnih smeri, in sicer po trije vzporedno.
proti predstavnikom skupin ali manjšin na podlagi njihove
Pregledovanje je potekalo v okolju Google Sheets.
narodnosti,
rase
ali
etničnega
porekla,
verskega
Sopomenske nize smo organizirali v vrstice tabele, kjer
prepričanja,
spola,
zdravstvenega
stanja,
spolne
jim je bilo mogoče pripisati eno od naslednjih odločitev:
usmerjenosti, invalidnosti, gmotnega stanja, izobrazbe,
(1) niz vsebuje sovražno ali grobo besedišče; (2) niz
družbenega položaja ter drugih lastnosti in prepričanj. Z
vsebuje besedišče, ki je drugače negativno ali (v
oznako sovražno se torej opredeljujemo do vseh izrazov,
določenem pomenu, kontekstu) izraža negativen odnos;
ki spodbujajo sovraštvo, predsodke ali nestrpnost in s tem
(3) z vidika sovražnosti, grobosti, negativnosti je niz
lahko predstavljajo – kot je bilo opredeljeno že v razdelku
neproblematičen. Če so pregledovalci želeli, so lahko
2 – elemente sovražnega govora.
opredeliti tudi, da je (4) v nizu kako drugače zaznamovano
Na drugi strani z oznako grobo označujemo izraze ali
besedišče, da (5) ne razumejo vseh besed v nizu, lahko pa
pomene, ki so za naslovnika lahko žaljivi, z vidika
so vpisali tudi dodaten komentar na svoje odločitve ali
družbenih in moralnih norm pa neprimerni. Tipično se
podatke.
nanašajo na človeško ali živalsko telo, spolnost,
Kljub ogromni količini podatkov je bila tako
prehranjevanje in izločanje – zlasti torej na tabuizirano
oblikovana naloga izvedljiva v relativno kratkem času, saj
predmetnost.
so študentje lahko odločitev podali takoj, ko so v nizu
Tretji
sklop
predstavlja
besedišče,
ki
izraža
našli eno samo problematično besedo, natančnejše
neodobravanje, nenaklonjenost, posmehljivost ali kritiko
razmisleke o vrsti zaznamovanosti oz. označevanja
do lastnosti posameznikov, predmetov ali dejanj. Z oznako
posameznih besed pa so prepustili za drugi korak dela s
izraža negativen odnos želimo tako opozoriti na izraze z
podatki.
izrazito negativno konotacijo ali vrednotenjem, ki so lahko
za naslovnika žaljivi ali neprijetni.
4.3. Rezultati ročnega pregleda
Študentske odločitve smo pretvorili v končne odločitve
4.2. Ročni pregled gradiva
po naslednjem ključu: (1) sovražno/grobo: če je vsaj eden
Potencialno problematično besedišče v SSSS smo
od študentov presodil, da se v nizu pojavlja sovražno ali
identificirali z ročnim pregledom iztočnic in sopomenk v
grobo besedišče; (2) drugače negativno: kombinacije
slovarju. Na projektu smo se omejili na slovarske (jedrne
odločitev “druga negativnost” in “neproblematično” ali (3)
in bližnje) sopomenke, saj pregled uporabniških predlogov
neproblematično: če so vsi študenti presodili, da je z
zahteva dodatne uredniške premisleke in bo zato opravljen
vidika
sovražnosti,
grobosti,
negativnosti
niz
kasneje s prilagojeno metodologijo. Zaradi obilja gradiva
neproblematičen. Rezultate prikazuje Tabela 1.
smo delo organizirali v dva koraka: širši pregled, v
katerem smo v grobem ločili potencialno problematično in
Kategorija končne
Število nizov v Delež glede na
neproblematično gradivo, nato pa natančnejši pregled
odločitve
kategoriji vse pregledano
problematičnih primerov.
Najprej smo iz slovarske baze izvozili nize sopomenk,
Sovražno/grobo
1.810
4,5 %
urejenih na podlagi pomenskih gruč (Krek et al., 2017),
Drugače negativno
12.730
31,3 %
npr. speljati se; izginiti; pobrati se; skidati se; spokati se;
Neproblematično
26.132
64,3 %
spizditi, pri čemer smo odstranili nize, ki so se glede
nabora sopomenk podvajali, in tiste, ki so bili podmnožica
Skupaj
40.672
100,0 %
kakega drugega niza. Na tak način smo pripravili 65.615
nizov različne dolžine: od posameznih sopomenskih parov
Tabela 1: Številčna zastopanost in delež nizov glede na
do zelo dolgih nizov, ki pa so redki: več kot 30 sopomenk
končno odločitev glede potencialne problematičnosti.
vsebuje le 156 nizov, povprečje je 5 sopomenk na niz.
Čeprav strojno pomensko gručenje ni povsem
V Tabeli 2 navajamo nekaj nizov s po tremi
natančno in se razlikuje od slovaropisne pomenske
sopomenkami, ki so jim študentke in študenti pripisali
členitve, tovrstna organizacija podatkov dobro naslovi dva
skladne ali različne odločitve. Kot je razvidno, lahko
pomembna problema: (a) tak pristop bistveno pohitri
posamezen niz vsebuje raznoliko zaznamovano besedišče,
PRISPEVKI
12
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
kot tudi nezaznamovano besedišče. Tabela obenem
V “drugače negativno” so raznorodni primeri, saj so
ponazarja gradivo, ki bo deležno celovite in natančnejše
poleg zaznamovanih izrazov in pomenov študentje
obravnave v zaključnem delu projekta Nadgradnja
označevali tudi besedišče, ki poimenuje negativne
temeljnih slovarskih virov in podatkovnih baz CJVT UL
vsebine in predmetnost. Gre zlasti za poimenovanja
(odločitev 1), podatke, ki so na tak ali drugačen način
agresivnega obnašanja: uničiti, dotolči, nekaterih osebnih
relevantni za nadaljnje delo (odločitev 2), in gradivo, ki ga
lastnosti: pokvarjen, hudoben, ničvreden, grozljiv, grd, z vidika negativne zaznamovanosti ne bomo nadalje
apatičnost, pokvarjenost; videza, stanja: neurejenost,
obravnavali (odločitev 3).
razdejanje, zanikrnost itd. V slovarju večina teh besed ne
potrebuje oznake. Čeprav besed ne bomo označevali, so
Študentske
seznami tovrstnega besedišča pomemben rezultat ročnega
Niz sopomenk
in končna
pregleda, saj so koristni za različne druge namene na
področju slovaropisja in strojne obdelave jezika, npr. za
odločitev
filtriranje gradiva z negativnim pomenom iz jezikovnih
fukati; porivati; natepavati
111 -> 1
iger ali učnih gradiv, strojno pripisovanje sentimenta ipd.
skozlati; izbruhati; zbruhati
111 -> 1
pedrski; buzerantski; toplovodarski
111 -> 1
5. Vizualizacija v vmesniku SSSS 2.0
črnuhinja; zamorka; zamorklja
111 -> 1
V slovarskem vmesniku SSSS 2.0 bomo na besedišče,
pofukanka; prasica; zajebanka
111 -> 1
o katerem razpravljamo tu, opozorili s kombinacijo
opozorilne ikone in daljšega pojasnila, ki se bo izpisalo ob
debilen; bebast; duševno zaostal
121 -> 1
kliku nanjo. Trenutno rešitev, ki jo bomo po potrebi še
kripelj; pohabljenec; pohabljenka
211 -> 1
nadgradili, kaže Tabela 3. Namenoma smo se odrekli
kurnik; pajzelj; temačna luknja
222 -> 2
pripisovanju (eno-)besednih oznak, saj bi te pri
bedastoča; glupost; nesmisel
222 -> 2
označevanju (mestoma tudi homonimnih) lem lahko
vodile v napačno interpretacijo podatkov. Pri pomensko
eliminirati; likvidirati; usmrtiti
222 -> 2
členjenih
geslih
bodo
oznake
seveda
pripisane
izmozgano; izčrpano; mršavo
223 -> 2
posameznim pomenom, pri pomensko nečlenjenih geslih
imenski; nazivni; nominalni
333 -> 3
pa bo kombinacija ikone in pojasnila omogočila, da je
problematično besedišče na prvi pregled zelo opazno,
kopirni papir; indigo; karbon
333 -> 3
pojasnilo pa je lahko daljše in vsebuje informacije o
zaustaviti se; izklopiti se; izključiti se
333 -> 3
možnem učinku na naslovnika oz. možnih posledicah rabe
označene besede.
Tabela 2: Primeri nizov s študentskimi odločitvami in
končno odločitvijo o nadaljnji obravnavi.
Oznaka
Ikona
Pojasnilo
Z uporabo besede lahko
V sodelovanju s študenti bomo v 1.810 nizih z
izražamo sovražni, nestrpni
odločitvijo (1) določili besede in zveze, ki so relevantne za
Sovražno
slovarsko
označevanje.
Slednje
bo
potekalo
ob
odnos do posameznika ali
upoštevanju pojavljanja oz. rabe identificiranega besedišča
družbene skupine.
v raznovrstnih kontekstih, s čimer bomo željo po pohitritvi
Zaradi družbenih in moralnih
postopka prve selekcije ustrezno nadzorovali in obranili
norm se marsikateremu
pred črno-belim presojanjem primernosti. Za ponazoritev
uporabniku jezika beseda lahko
navajamo nekaj primerov, ki so na seznamu za presojo:
Grobo
●
sovražno:
črnuh,
črnuhinja,
zamorklja,
zdi groba ali neprimerna.
hlapčevski črnec, rdečuh, rdečuhinja, beli prasec,
Uporaba lahko povzroči
bela prasica, lezba, lezbača, peder, buzerant,
nelagodje, razburi ali užali.
pička, prasica, kripelj;
Beseda lahko ni nevtralna. Z
●
grobo: podjebavati, v kurcu, zdrkati, nabrisati,
pokavsati, nategniti v rit, pofafati ga, sranje,
uporabo besede se lahko
poscan, fentati, crkniti, razpizden, sfukan,
Izraža
posmehujemo, izražamo
kurbarija, joški.
negativen
neodobravanje ali kritiko do
V SSSS želimo poleg sovražnega in grobega označiti
odnos
nekaterih lastnosti
tudi besedišče, ki izraža negativen odnos. To se najde
posameznikov, predmetov ali
predvsem v nizih z odločitvijo (2), mestoma pa tudi v (1).
Kot problematične so študentje prepoznavali tako izraze
dejanj.
(npr. budala, avša, bedast) kot potencialne problematične pomene besed (npr. nataknjen, zabit, nasekati). Prve je Tabela 3: Predvidene ikone in izhodiščna različica pojasnil
mogoče označiti že v trenutni različici slovarja, saj je
za označevanje besedišča v SSSS.
njihova problematičnost vezana na lemo ne glede na
morebitno večpomenskost. V drugem primeru bi oznaka
Slika 1 kaže oblikovalski predlog vmesnika SSSS 2.0,
morala biti pripisana pomenu, zato bo označevanje možno
kakršen je na voljo v času priprave prispevka. Slovarske
šele, ko bo slovar vseboval pomenske členitve. Primeri
informacije na sliki so provizorične. Slika ponazarja,
besedišča, ki ga je mogoče označiti na ravni izraza:
kakšna bo vizualizacija pri bližnjih in jedrnih (pri depra in
●
izraža negativen odnos: trapa, bebav, počasne
deprimiranost) ter pri uporabniških sopomenkah (pri
pameti, lolek, kozlarija, zarukan, špeglarca,
sopomenka). Razvidne so tudi nekatere druge novosti, npr.
luftar,
blefer,
snobovski,
drhal,
težakinja,
delitev uporabniških sopomenk glede na slovarsko verzijo,
mlatenje prazne slame, avša, otročaj.
v kateri so bile predlagane, ter možnost dodajanja
slovarskih oznak ob predlagane sopomenke.
PRISPEVKI
13
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Slika1. Oblikovalski predlog vmesnika SSSS 2.0 (vsebina
čimer bo lahko dosežena določena stopnja enotnosti
je provizorična).
uporabniškega označevanja (informacije bodo na voljo na
klik, gl. ikono (i) na Sliki 1). Predvideno pa je, da bodo
6. Uporabniško dodajanje oznak
uporabniki oznake mestoma interpretirali in uporabljali
drugače, kot bi jih slovaropisci. Vse dodane oznake bodo
Nedavno izvedena raziskava o odnosu uporabniške
(skupaj z dodanimi sopomenkami) preverjene in označene
skupnosti do SSSS, v kateri je sodelovalo 671
sopomenke bodo dragoceno gradivo ne le za dopolnitev
anketirancev, je pokazala naklonjenost do večine novosti,
odprto dostopne slovarske baze sopomenk, ampak tudi za
ki jih prinaša slovar, npr. stalno posodabljanje, strojni
analize širšega dojemanja označevalnega sistema ter
postopki, digitalni format, kolokacijski podatki, povezave
dometa in meja oznak. Prav tako pomemben uvid bodo
na korpus, uporabniško vključevanje (Arhar Holdt, 2020:
ponudile ročno vpisane oznake, ki jih bomo analizirali z
470).
Med problematičnimi značilnostmi sta bili
vidika vsebine in pogostosti ter uporabili izsledke za
izpostavljeni
nezanesljivost
(strojno
pridobljenih)
nadaljnji razvoj slovarja.
podatkov in primanjkljaj slovarskih oznak tako pri jedrnih
in bližnjih sopomenkah kot pri uporabniško dodanih. To,
da ni oznak, je motilo 37 % sodelujočih (ibid.: 472).
7. Sklep in nadaljnje delo
V trenutnem slovarskem vmesniku nekateri uporabniki
Sodobno slovaropisno delo ima ob zavedanju
in uporabnice težavo rešujejo tako, da oznako ali kako
ideološkosti, vključevanju novih pristopov, uporabi
drugo pojasnilo v oklepaju pripišejo ob svoj sopomenski
tehnologije, moči množic itd. danes veliko možnosti, da
predlog, npr. babica – nona (lokalno), bojazljivec – pezde
tudi vprašanja označevanja konotacije naslavlja na novo in
(vulg.), Italijanka – makaronarka (slabš.). Kot omenjeno v
zanj pripravlja inovativne rešitve (Gorjanc, 2017: 154).
poglavju 3, pa večina predlaganih sopomenk oznake nima.
V prispevku smo opisali, kako poteka obravnava
Skladno z uporabniškimi željami in potrebami želimo
sovražnega in grobega besedišča v SSSS in katere
nadgraditi protokol dodajanja sopomenk, da bodo
spremembe so v načrtu za različico 2.0, ki bo objavljena
predlagani besedi ali zvezi uporabnice in uporabniki lahko
jeseni 2022. Rešitve naslavljajo dve pomembni značilnosti
dodali tudi slovarsko oznako oz. oznake. Privzeta izbira
SSSS: njegovo strojno izdelanost in odprtost, da pri
bo, da je predlog “brez oznake” , ostale možnosti bodo na
razvoju slovarja sodeluje tudi uporabniška skupnost. V
voljo v spustnem meniju (Slika 1). V različici SSSS 2.0
novi različici slovarja bodo sovražnemu in grobemu
bodo na klik na voljo oznake sovražno, grobo in izraža
besedišču pripisane slovarske oznake oz. opozorilne ikone
negativen odnos, poleg tega pa bomo ponudili okence, v
s pojasnili o možnih posledicah rabe in dodana bo
katerega bo mogoče vtipkati morebitno drugo oznako.
možnost,
da
uporabniki pripišejo oznako svojim
Pomen in raba oznak sovražno, grobo in izraža
predlogom sopomenk.
negativen odnos bo razložena in ponazorjena s primeri, s
PRISPEVKI
14
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Ker vse težave trenutnega SSSS niso enostavno in
9. Literatura
hitro rešljive, želimo slovarske uporabnike bolje opozoriti
na trenutne omejitve. Čeprav je metodologija priprave
Špela Arhar Holdt. 2020. How Users Responded to a
SSSS pojasnjena v razdelku O viru, pri samih iztočnicah
Responsive Dictionary: The Case of the Thesaurus of
ni izrecnih opozoril, da je slovar pripravljen strojno, in to
Modern Slovene. Rasprave: Časopis Instituta za
na vseh ravneh: sopomenke, kolokacije, korpusni zgledi,
hrvatski
jezik
i
jezikoslovlje,
46(2):
465–482.
kar lahko vodi v napačne interpretacije slovarske vsebine.
https://doi.org/10.31724/rihjj.46.2.1
V naslednji različici SSSS želimo zato uvesti indikator
Špela Arhar Holdt, Jaka Čibej, Kaja Dobrovoljc,
stopnje gesla3 in dodati v predstavitev slovarja opozorila o
Apolonija Gantar, Vojko Gorjanc, Bojan Klemenc, Iztok
dometu in posledicah metodologije ter razlago korakov, po
Kosem, Simon Krek, Cyprian Laskowski in Marko
katerih se slovar razvija.
Robnik Šikonja. 2018. Thesaurus of Modern Slovene:
Prepoznano sovražno in grobo besedišče bo koristno
By the Community for the Community. V: J. Čibej, V.
tudi pri izdelavi drugih virov, kjer se za pomene izbirajo
Gorjanc, I. Kosem in S. Krek, ur., Proceedings of the
reprezentativne kolokacije in zgledi. Pri izdelavi novih
18th Euralex International Congress: Lexicography in
gesel za Kolokacijski slovar sodobne slovenščine
Global Contexts, str. 401–410. Znanstvena založba
(Kosem
Filozofske
fakultete,
Ljubljana.
et al., 2018b) npr. že zdaj pri pripravi podatkov (pred
slovaropisno analizo) označujemo kolokacije, ki vsebujejo
https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/v
sovražno in grobo besedišče, pa tudi besedišče, ki izraža
iew/118/211/3000-1
negativen odnos. Tako slovaropiske in slovaropisce
Polona Gantar. 2015. Leksikografski opis slovenščine v
opozorimo na potencialno problematične kolokacije in
digitalnem okolju. Znanstvena založba Filozofske
posledično pohitrimo delo oz. se izognemo vključevanju
fakultete, Ljubljana.
problematičnih
vsebin.
Seznami
problematičnega
https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/d
besedišča, ki jih uporabljamo trenutno, so pripravljeni ad
ownload/62/138/2602-1?inline=1
hoc iz odprto dostopnih jezikovnih virov in precej krajši
Vojko Gorjanc. 2005. Neposredno in posredno žaljiv
od seznamov, ki bodo (lahko) nastali na osnovi
govor v jezikovnih priročnikih: diskurz slovarjev
predstavljenega dela.
slovenskega jezika. Družboslovne razprave, 21(48):
197–209.
Kot smo poudarili v prispevku, je izražanje
Vojko Gorjanc. 2017. Nije rečnik za seljaka. Biblioteka
negativnega odnosa večkrat vezano na posamezen pomen
XX vek, Beograd.
besede, zato bo velik del naloge izvedljiv šele ob pripravi
Geoffrey Hughes. 2011. Political Correctness: A History
pomensko členjenih gesel. Pri pomenski členitvi in
of Semantics and Culture. Wiley-Blackwell, MA.
nadaljnjem označevanju gradiva SSSS bomo uporabili
Boris Kern. 2015. Politična korektnost v slovaropisju. V:
metodologijo, ki jo razvijamo pri izdelavi Velikega
D. Zuljan Kumar in H. Dobrovoljc, ur., Zbornik
slovensko-madžarskega slovarja (Kosem et al., 2018a) in
prispevkov s simpozija 2013, str. 144–154, Založba
podatke oz. informacije, ki so na voljo v obstoječih odprto
Univerze, Nova Gorica.
dostopnih virih za slovenščino. Preizkus prenosa
Iztok Kosem, Júlia Čeh Bálint, Vojko Gorjanc, Anna
metodologije bomo izvedli že pod okriljem projekta
Kolláth,
Attila
Kovács,
Simon
Krek,
Sonja
Nadgradnja temeljnih slovarskih virov in podatkovnih baz
Novak-Lukanovič in Jutka Rudaš. 2018a. Osnutek
CJVT UL, kjer je med cilji tudi nadgradnja SSSS z 2.000
koncepta novega velikega slovensko-madžarskega
pomensko členjenimi gesli, ki bodo imela slovaropisno
slovarja. Univerza v Ljubljani, Filozofska fakulteta,
pregledane in razvrščene sopomenke, kolokacije ter
Ljubljana.
korpusne zglede.
https://www.cjvt.si/komass/wp-content/uploads/sites/17
V nadaljnje premisleke glede sovražnega in grobega
/2020/08/Osnutek-koncepta-VSMS-v1-1.pdf
besedišča znotraj koncepta odzivnega slovarja bi bilo
Iztok Kosem, Simon Krek, Polona Gantar, Špela Arhar
smiselno celoviteje vključiti vidike okoliščin rabe.
Holdt, Jaka Čibej in Cyprian Laskowski. 2018b.
Zanimivo bi bilo denimo obravnavati zaznavanje in
Kolokacijski slovar sodobne slovenščine. V: D. Fišer in
presojanje sovražnosti, grobosti v različnih tipih besedil,
A. Pančur, ur., Jezikovne tehnologije in digitalna
npr. medijskih. Ob tem se odpira tudi vprašanje
humanistika. Znanstvena založba Filozofske fakultete,
formalnosti in neformalnosti položajev, na katere se ta
Ljubljana.
presoja nanaša: ali posega na vse ravni izražanja ali gre
http://www.sdjt.si/wp/wp-content/uploads/2018/09/JTD
zgolj za formalne, javne položaje in ali je neodvisna od
H-2018_Kosem-et-al_Kolokacijski-slovar-sodobne-slov
generacijske ali kake druge pripadnosti presojevalca.
enscine.pdf
Simon Krek, Cyprian Laskowski in Marko Robnik
8. Zahvala
Šikonja. 2017. From translation equivalents to
Projekt Nadgradnja temeljnih slovarskih virov in
synonyms: creation of a Slovene thesaurus using word
podatkovnih baz CJVT UL v letih 2021–2022 financira
co-occurrence network analysis. V: I. Kosem et al., ur.,
Ministrstvo za kulturo Republike Slovenije. Raziskovalna
Proceedings of eLex 2017: Lexicography from Scratch,
programa št. P6-0411 (Jezikovni viri in tehnologije za
str.
93–109,
Leiden,
Netherlands.
slovenski jezik) in št. P6-0215 (Slovenski jezik – bazične,
https://elex.link/elex2017/wp-content/uploads/2017/09/
kontrastivne in aplikativne raziskave), sofinancira Javna
paper05.pdf
agencija za raziskovalno dejavnost Republike Slovenije iz
Simon Krek, Cyprian Laskowski, Marko Robnik Šikonja,
državnega proračuna.
Iztok Kosem, Špela Arhar Holdt, Polona Gantar, Jaka
Čibej, Vojko Gorjanc, Bojan Klemenc in Kaja
Dobrovoljc. 2018. Thesaurus of Modern Slovene 1.0,
3 Po zgledu Kolokacijskega slovarja sodobne slovenščine
Slovenian language resource repository CLARIN.SI,
(KSSS), ki z ikono petstopenjske piramide uporabniku na jasen
http://hdl.handle.net/11356/1166
in ekspliciten način posreduje informacijo o razvoju ter različnih
Nataša Logar, Nina Perger, Vojko Gorjanc, Monika Kalin
stopnjah izdelanosti slovarskih gesel (Kosem et al., 2018b).
Golob, Neža Kogovšek Šalamon in Iztok Kosem. 2020.
PRISPEVKI
15
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Raba slovarjev v slovenski sodni praksi. Teorija in
praksa, 57:89–108.
https://www.fdv.uni-lj.si/docs/default-source/tip/tip_pos
_2020_logar_idr.pdf?sfvrsn=0
Rosamund Moon. 2014. Meanings, Ideologies, and
Learners’ Dictionaries. V: A. Abel et al., ur.,
Proceedings of the XVI EURALEX International
Congress:
The
User
in
Focus,
str.
85–105,
Bolzano/Bozen.
Institute
for
Specialised
Communication
and
Multilingualism.
https://euralex.org/elx_proceedings/Euralex2014/eurale
x_2014_004_p_85.pdf
Andrej Motl in Veronika Bajt. 2016. Sovražni govor v
Republiki Sloveniji: Pregled stanja. Mirovni inštitut,
Ljubljana.
https://dlib.si/stream/URN:NBN:SI:DOC-F2YZP2RB/c
117f4c6-8fe9-437d-8c64-5b7987a856b6/PDF
Brankica Petković in Neža Kogovšek Šalamon. 2007. O
diskriminaciji: Priročnik za novinarje in novinarke.
Mirovni
inštitut,
Ljubljana.
https://www.mirovni-institut.si/wp-content/uploads/201
4/08/Prirocnik-o-diskriminaciji-final-all.pdf
Dušan Rebolj. 2008. Uporabnejša opredelitev politične
korektnosti. V: S. Autor in R. Kuhar, ur., Politična
(ne)korektnost. Mirovni inštitut, Ljubljana, str. 4–15.
https://www.mirovni-institut.si/wp-content/uploads/201
4/08/nestrpnost-6.pdf
SSKJ. 2014. Slovar slovenskega knjižnega jezika: Uvod.
Druga, dopolnjena in deloma prenovljena izdaja.
Inštitut za slovenski jezik Frana Ramovša ZRC SAZU,
Ljubljana.
https://fran.si/130/sskj-slovar-slovenskega-knjiznega-je
zika
Ada Vidovič Muha. 2013. Moč in nemoč knjižnega jezika.
Znanstvena založba Filozofske fakultete, Ljubljana.
PRISPEVKI
16
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Izdelava in analiza digitalizirane zbirke paremioloških enot
Saša Babič*, Tomaž Erjavec†
* Inštitut za slovensko narodopisje ZRC SAZU
Novi trg 2, 1000 Ljubljana
sasa.babic@zrc-sazu.si
† Odsek za tehnologije znanja, Institut »Jožef Stefan«
Jamova cesta 39, 1000 Ljubljana
tomaz.erjavec@ijs.si
Povzetek
Članek obravnava digitaliziranje zbirke slovenskih pregovorov Inštituta za slovensko narodopisje ZRC SAZU. Zbirka je nastajala od leta 1947 dalje, digitalizacija pa se je začela v samem začetku 21. stoletja z iniciativo Marije Stanonik. V predstavljenem delu smo izhajali iz Excel razpredelnic paremioloških enot in virov, iz katerih smo najprej izločili neustrezne enote in neuporabljene vire. Nato smo tabeli pretvorili v zapis TEI in pregovore avtomatsko jezikoslovno označili. Tu so bile besede posodobljene, lematizirane, oblikoskladenjsko označene, povedi pa skladenjsko razčlenjene po formalizmu Universal Dependencies. Kanonični zapis TEI smo pretvorili v več izvedenih formatov in zbirko objavili pod odprto licenco na repozitoriju CLARIN.SI, kjer jo je mogoče prevzeti, in na konkordančnikih CLARIN.SI, ki so primerni za jezikoslovne analize zbirke. V članku orišemo tudi način iskanja po zbirki v konkordančnikih, ki omogočajo temeljitejšo etnolingvistično in semiotično raziskavo.
Creation and analysis of a digitised collection of Slovenian paremiological units The article discusses the digitization of the collection of Slovenian proverbs from the Institute of Slovenian Ethnography ZRC SAZU.
The collection was created from 1947, and its digitization began at the start of the 21st century on the initiative of Marija Stanonik. The departure point of the presented were two Excel spreadsheets with paremiological units and their bibliographical sources, from which we removed inappropriate units, and unused sources. The two spreadsheets were then converted to a TEI encoding, and the paremiological units automatically linguistically annotated: words were modernised, lemmatised, morphosyntactically annotated and the sentences syntactically parsed according to the Universal Dependencies formalism. We converted the canonical TEI encoding into several derived formats and published the collection under an open licence on the CLARIN.SI repository, where it can be downloaded, and on the CLARIN.SI concordancers, which allow for linguistic analyses of the collection. The paper also outlines searching the collection in the concordancers, which enables detailed ethnolinguistic and semiotic research.
raziskovanje pregovorov kot kulturni znak, ki ohranja
1. Uvod
zgodovino kulture oz. družbe, hkrati pa sprejema nove
Jezik je ohranjevalec in nosilec kulture, s katerim človeštvo
funkcije, ki širijo in porajajo nove kontekste. Prav zato so
ustvarja in vključuje refleksije o samem sebi (Pitkin
paremiološke enote oz. pregovori
, 1972;
označeni za narodni
Bartmiński, 2005; Tolstaja, 2015). Ena od najpogosteje
zaklad, neprecenljivo modrost in dediščino prednikov, in ne
rabljenih jezikovnih oblik so pregovori oz. paremiološke
preseneča, da so (bili) predmet sprotnega terenskega
enote.
zapisovanja ali celo namenskega zbiranja (Arewa in
Paremiološke enote ali pregovori v širšem pomenu so
Dundes, 1966; Stanonik, 2015) ter analiz rabe (Meterc,
eden od najkrajših žanrov slovstvene folklore; pregovore
2021).
lahko opišemo kot relativno stalne povedi, ki jih uvrščamo
Inštitut za slovensko narodopisje ZRC SAZU je
med kratke folklorne obrazce. Pogosto so označeni
sistematično gradil arhiv različnih žanrov slovstvene
z
besednimi zvezami, kot »modrost ljudstva« (Mieder,
folklore, v sklopu katerega je nastajala tudi zbirka
pregovorov. Ti so bili zabeleženi na kartotečnih listkih ali
1993), »stara modrost« in »poezija vsakdanjega jezika«
(Matičetov
v tematskih arhivskih mapah. V začetku 21. stoletja se
, 1956). V vsakem primeru lahko trdimo, da so
je
pregovori »skrčeni moralno-etični obrazci določene
pojavila potreba po digitalizaciji gradiva, ki bi omogočala
lažje delo z gradivo
skupnosti; so neke vrste tradicionalni stereotipi njenega
m.
samozavedanja in samoidentifikacije, bili so iz generacije
Pri projektu Tradicionalne paremiološke enote v
v generacijo prenašani jezik vsakdanje kulture« (Kržišnik,
dialogu s sodobno rabo (2020–2023) smo predvideli
združitev etnolingvističnih pristopov in semiotike z
2008: 38). Prav zato velja, da so pregovori kratki stereotipi
na sentenčni ravni s prenesenim ali generalizirajočim
namenom diahronega vpogleda v družbo s pomočjo
pregovorov. Da bi bila analiza temeljitejša, je pomemben
pomenom ter so načeloma splošno znani (Grzybek, 2012).
Pregovori so kulturna besedila z velikim semantičnim
del projekta pretvorba gradiva v sprejemljivo obliko za
računalniško
potencialom (Grzybek, 2015), saj gre za »zaključene misli«
besedilno analizo.
V članku opišemo
(Mlacek, 1983: 131), vendar pa se ne razlikujejo le po
pripravo
in
jezikoslovno
besedilu, temveč tudi glede na teksturo in kontekst
označevanje digitalizirane zbirke pregovorov, ki je sedaj
(Dundes, 1965). Zaradi prozodičnih značilnosti si jih je
dostopna na repozitoriju in konkordančnikih CLARIN.SI,
lažje zapomniti, dandanes pa zato ponujajo možnosti za
ter uporabo digitalizirane zbirke v namene etnolingvistične
obravnave paremioloških enot.
nadaljnjo uporabo, na primer pri oglaševanju, sodobnem
Na koncu podamo
prenosu mnenj, grafitih ali modifikacijah v različnih
zaključke in načrte za nadaljnje delo.
medijih.
Semiotična kompleksnost pregovorov in
prepletenost med sintaktično (kratkost), pragmatično
(prenašanje skozi različne generacije) in semantično
(stereotipno, splošno znanje) razsežnostjo ponujajo
2. Priprava gradiva
PRISPEVKI
17
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Inštitut za slovensko narodopisje (ISN) ZRC SAZU v
Gradivo je tako po ročnem in strojnem urejanju
arhivu hrani folklorno gradivo v analogni obliki, tj. ročno
vsebovalo 36.349 relativno enotno urejenih paremioloških
napisani, natipkani ali natisnjeni na kartotečnih listkih, v
enot ter 2.515 virov.
arhivskih predalih in omarah. Težnja po digitalizaciji
Razpredelnica z viri vsebuje za vsak bibliografskimi vir
folklornega gradiva se je najprej začela pri pregovorih, za
njegov identifikator, identifikator z izvornega seznama
katere je Marija Stanonik že leta 1997–1999 pridobila
virov, zaporedno številko vira (ki tudi združuje vire, ki
projekt Slovenski pregovori in rekla (Stanonik, 1996), v
spadajo v nadrejeno enoto), letnico izida (in letnico prvega
katerem je začela širiti arhivsko zbirko pregovorov na ISN.
izida, kjer se ta razlikuje), ime vira (avtor, naslov) ter
Z mislijo na digitalizacijo je nadaljevala v projektih
kategorizacijo vira v 18 kategorij, npr. Leposlovje in
Informatizacija neoprijemljive dediščine za etnologijo in
literarjenje, Muzejske zbirke, Periodika – pratike in
folkloristiko (2005–2008) (Stanonik, 2004) in Slovenski koledarji, Ustni viri itd.
pregovori kot kulturna dediščina: klasifikacija in redakcija
Razpredelnica s paremiološkimi enotami vsebuje
korpusa (2010–2013) (Stanonik, 2009; Stanonik, 2015).
identifikator enote, zaporedno številko iz izvornega
Gradivo je bilo dodano k obstoječi zbirki v računalniškem
seznama enot, seznam identifikatorjev virov skupaj s
prepisu, sprva v programu Word, pozneje v programu
številko strani, na kateri je enota v viru omenjena,
Excel, kar je predstavljalo temelj, na katerem smo lahko
diplomatično transkripcijo enote (torej zapis enote, kot se
izvedli pretvorbo v druge digitalne formate.
pojavi v viru) in kritično transkripcijo enote, ki enote,
zapisane v bohoričici, transkribira v gajico. Tako ima npr.
2.1. Priprava gradiva v razpredelnicah
enota PREG-00-00001 zaporedno številko 1, seznam virov
bib14.1: 202; bib23.1: 51; bib7.1: 524, diplomatično
V urejanje smo dobili dve excelovi tabeli: prva je vsebovala
transkripcijo »Bres muje
59.543 večinoma paremioloških enot, druga pa 2.742 virov
ſe zhreul ne obuje.« in kritično
transkripcijo »Brez muje se čreul ne obuje.«
teh enot. Tabeli sta bili povezani s kodo, ki je bila določena
viru. Ob pregledu gradiva smo ugotovili, da precej enot ne
spada v paremiološki nabor; te smo ročno izločili (uganke,
2.2. Zapis TEI
dele folklornih pesmi, pozdrave, frazeme ipd.), pri pregledu
V naslednjem koraku smo podatke iz dokumentov TSV
virov smo ročno izločili vse tiste, ki niso bili navedeni ob
pretvorili v zapis, ki je bolj primeren za hrambo kot tudi za
paremioloških enotah. Poleg tega so nekatere paremiološke
nadaljnje obdelave, in sicer XML s shemo po priporočilih
enote vsebovale širši kontekst, ki smo ga ročno izbrisali;
iniciative za kodiranje besedil TEI (TEI Consortium, 2020).
tako
smo
dobili
poenoteno
obliko
samostojnih
Celotna zbirka je bila formirana kot en TEI dokument
paremioloških enot. Pri vremenskih paremioloških enotah
(element ) s kolofonom (element ) in
se je pojavil problem pojasnjevanja svetniškega
besedilnim delom ().
poimenovanja dnevov in praznikov: v originalnem zapisu
Kolofon vsebuje bibliografske in druge metapodatke o
(časopisi, koledarji, zvezki ipd.) so bili navedeni kot
zbirki, kot je npr. taksonomija kategorizacije virov. V opisu
pojasnilo, npr. Če je na Velike maše dan [15. avgust] lepo
vira () vsebuje tudi celoten seznam virov
vreme, potem bo ozimna pšenica lepa; Če na ta dan
paremioloških enot; zapis je ilustriran v sliki 1.
[Florijanovo, 4. maj] dež gre, potlej ga celo leto manjka. V
Besedilni del vsebuje paremiološke enote, vsako s
excelovi tabeli, ki predstavlja del Inštitutskega arhiva, smo
svojim identifikatorjem, diplomatični in kritični prepis ter
te pustili zabeležene v oglatem oklepaju.
seznam njihovih virov; zapis ilustriramo v sliki 2.
Po ročnem urejanju smo Excel dokumente združili z
OpenRefine1 in tako poenotili korektorske opombe in
2.3. Posodabljanje besed in drugo jezikoslovno
kategorije označevanja pregovorov. Osnovne popravke
označevanje
smo vnesli tudi pri preverjanju shematiziranih vnosov (npr.
Precejšnjo težavo za uporabo izdelane zbirke je predstavljal
navajanje virov, odstranjevanje presledkov na koncu
zapis v arhaični slovenščini, ki oteži iskanje po pregovorih,
besedil v posameznih celicah ipd.). Sledil je prenos
kot tudi njihovo nadaljnjo analizo. Oteženo je tudi
podatkov v delovno bazo SQLite2, kjer so potekali popravki
avtomatsko jezikoslovno označevanje zbirke, saj orodja za
preprostih slovničnih napak in zatipkov (velike začetnice,
jezikoslovno označevanje delujejo dobro le na sodobni
dvojni presledki, nepravilna raba ločil ipd.) ter zaznava
standardni slovenščini.
uporabljenih črkovnih naborov, kjer gre izpostaviti
Za posodabljanje zbirke smo uporabili odprtokodno3
nestandardizirane zapise dajnčice, metelčice, bohoričice in
orodje za normalizacijo cSMTiser (Scherrer in Ljubešić,
gajice. Pregovore so namreč začeli prepisovati v
2016), ki temelji na principu statističnega strojnega
računalniško obliko že v začetku 21. stoletja, ko nabor
prevajanja in orodju Moses (Koehn, 2010). cSMTiser smo
črkovnih znakov še ni bil tako pester in so prepisovalci
naučili posodabljanja na ročno posodobljene korpusu
reševali zagate z različnimi zapisi z improviziranim
slovenščine goo300k (Erjavec, 2016), podobno, kot smo že
izborom znakov. Po osnovnih popravkih paremioloških
pred tem naredili za posodabljanje zbirke slovenskih
enot smo nadaljevali z iskanjem enakih oz. podvojenih enot
romanov v okviru korpusa ELTeC (Schöch et al., 2021). Z
in odstranjevali dvojnike, pri čemer smo vse vire dodali k
orodjem smo nato normalizirali kritični prepis, pri čemer
eni paremiološki enoti. Ob koncu urejanja smo podatke
orodje sicer približa zapis besed sodobni slovenščini, dela
izvozili v format TSV (tab-separated values), ki je bil
pa tudi napake (npr. besedo »čreul« prevede v »čevlj«
izhodišče za izdelavo korpusa.
namesto »čevelj«).
1 https://openrefine.org/
3 https://github.com/clarinsi/csmtiser
2 https://www.sqlite.org/
PRISPEVKI
18
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Slika 1: Primer virov paremioloških enot v zapisu TEI.
Slika 2: Primer kodiranja paremiološke enote v zapisu TEI.
Slika 3: Primer kodiranja jezikoslovno označene paremiološke enote v zapisu TEI.
Na osnovi avtomatsko posodobljenih besed smo nato
2017), npr. »NOUN Case=Gen Gender=Masc
korpus
jezikoslovno označili. Tu smo uporabili
Number=Sing«. Te oznake so sicer podobne
odprtokodno orodje CLASSLA4 (Ljubešić in Dobrovoljc,
oznakam MULTEXT-East, vendar z drugače
2019), s katerim smo dodali naslednje jezikoslovne oznake
izpisanim naborom lastnosti in vrednosti, občasno
v besedilo, npr. za »čevlja«:
se pa od njih tudi sistemsko razlikujejo;
oblikoskladenjsko
oznako
po
priporočilih
odvisnostno skladenjsko razčlenitvijo povedi po
MULTEXT-East (Erjavec, 2012), npr. »Ncmsg« za
sistemu Universal Dependencies.
»Noun
Type=common
Gender=masculine
Jezikoslovno označena različica posamezne paremiološke
Number=singular Case=genitive« (pri čemer
enote je bila dodana v zapis TEI po njenih besedilnih
obstaja tudi ekvivalentna oznaka v slovenščini, tu
zapisih; format je ilustriran v sliki 3. V različici korpusa, ki
»Somer«
in
njena
razširitev
v
pare
vsebuje posodobljene in jezikovno označene enote, je
lastnost=vrednost);
dopolnjen tudi kolofon s taksonomijo skladenjskih oznak
lemo oz. osnovno obliko besede, tu »čevelj«;
Universal Dependencies in z opisom uporabljenih orodij.
oblikoskladenjske oznake po sistemu Universal
Dependencies za slovenski jezik (Dobrovoljc et al.,
4 https://github.com/clarinsi/classla
PRISPEVKI
19
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
2.4. Objava zbirke
Etnolingvistika kot samostojno področje daje jeziku
Zbirko smo objavili na dva načina. Za prevzem je dostopna
posebno mesto v družbi: v jeziku se oblikujejo kulturni
na repozitoriju CLARIN.SI (Babič et al., 2022) pod odprto
pomeni; jezik v besedah, frazeologiji, celo v slovnici
licenco CC BY. Poleg obeh različic zbirke (brez in z
posreduje podobe sveta. Jezik je s tega vidika »gradivo
jezikoslovno označenimi enotami) v formatu TEI je tam na
kulture«, medtem ko je hkrati tudi kulturni meta-jezik:
voljo tudi v izvedenem formatu TSV, torej kot
skupaj s folkloro velja za enega ključnih kulturnih kodov in
razpredelnici z viri in enotami, in v t. i. vertikalnem
kulturno ekspresivnih oblik. Jezik je zato eden
formatu, ki služi kot vhodni format za konkordančnike
najpomembnejših virov za raziskovanje folklore in
CLARIN.SI.
rekonstrukcij njenih zgodnjih stanj; povezava med jezikom
Zbirka je dostopna tudi na konkordančnikih noSketch
in kulturo je vzajemna (Tolstaja, 2006) in skupaj tvorita
Engine in KonText CLARIN.SI. Prek teh dveh storitev je
znakovni sistem. Vsi kulturni pomeni se zberejo v
omogočen analitični vpogled v digitalizirano zbirko.
semantiko poimenovanja z besedami; te je lublinska
etnolingvistična šola poimenovala jezikovni stereotipi
3. Analiza gradiva
(Bartmiński, 2005), ki kažejo naš poskus nadzora sveta.
Analize relativno stalnih besednih zvez in besed v
Zbirka ima v diplomatičnem zapisu zavedenih 36.066
določenih kontekstih nam prikazujejo jezikovni zemljevid
paremioloških enot (283 jih je v kritičnem prepisu). Največ
sveta z najpomembnejšimi družbenimi podobami in
paremioloških enot je izpisanih iz že obstoječih zbirk
predstavami.
pregovorov (10.187 enot) ter iz leta 1974, tj. zbirke
Hitro razvijajoče se področje digitalne humanistike
pregovorov Pregovori in reki na Slovenskem, ki jo je uredil
omogoča raziskovalcem sprejemanje novih, korenito
Etbin Bojc (4.884 enot). Treba je upoštevati dejstvo, da je
drugačnih metod raziskovanja in, kar je prav tako
Bojc precej paremioloških enot zbral tudi iz že prej
pomembno, daje na voljo elektronske zbirke z naprednimi
obstoječih zbirk (npr. Kocbek (1887), Kocbek-Šašelj
možnostmi iskanja podatkov (Rassmusen Neal, 2015).
(1934) ter starejše slovnice in slovarji), velja pa njegova
Korpusno
jezikoslovje
in
trenutno
priljubljena
zbirka za prvo sodobnejšo. Najstarejši pregovori v zbirki so
»metodologija branja na daljavo« (tj. uporaba e-virov)
iz leta 1587, in sicer iz Predgovora k Postili Jurija Juričiča.
poskuša izkoristiti velike jezikovne vzorce, da bi pridobili
Zbirka paremioloških enot ISN vsebuje precej enot iz
(kvantitativni) vpogled v besedišče, uporabo, trende in
slovnic in slovarjev, kar pomeni, da so bile te enote
vizualizacije na področjih jezikovnega interesa. Hkrati pa
zapisane kot izolirane entitete, brez konteksta. Poleg tega,
takšne računalniške tekstovne oblike zbirk omogočajo
navedeno ne izpriča dejanskega poznavanja in rabe
natančnejše in hitrejše kvalitativne analize večjih zbirk:
paremioloških enot, kot ga lahko predvidevamo pri
posameznih konkordančnih kombinacij in besednih okolij.
zbiranju paremiološkega gradiva na terenu ali iz tiskanih
Semiotična analiza v namene etnolingvistične raziskave
besedil, v katerih avtor predvideva poznavanje posameznih
(Bartmiński, 2005) paremiološkega gradiva poteka
paremioloških enot in s tem bralčevo razumevanje
predvsem na ravni semantike: pri besedah želimo zaznati
napisanega. Navedeno je v folkloristiki pomemben del
tako metaforične pomene kot stereotipne oznake, ki jih
raziskav in analiz, saj razkriva tudi konceptualni in
(posamezna) beseda vsebuje in hkrati posreduje prek
etnolingvistični vidik folklornega gradiva. Če predvidimo
metafore v širši kontekst, torej s semiotičnega vidika,
dobro poznavanje posameznega pregovora (npr. Brez muje
kakšni znaki se tvorijo znotraj paremiološke enote.
se še čevelj ne obuje), lahko predvidimo tudi konceptualno
Statistični vpogled v celotno zbirko pokaže med drugim
ozadje in etnolingvistično sliko, ki nam jo tovrstno gradivo
tudi najbolj pogosto rabljene besede, ki lahko podajo tudi
lahko ponudi. Za takšen vpogled se poslužujemo ne le
splošnejša predvidevanja o družbeni naravnanosti.
etnolingvističnega pristopa (povezovanja jezikoslovja in
Najpogostejša
polnopomenska
beseda
v
zbirki
etnologije s poudarkom na stereotipni predstavi pojava),
paremioloških enot je:
temveč tudi semiotične analize (pomen znaka).
-
samostalnik dan se pojavi 1.657-krat; ta meta-
forično ali metonimično označuje tako časovno
3.1. Etnolingvistična in semiotična analiza s
omejeno obdobje, ki pomeni dolgo ( Premislek je
pomočjo konkordančnikov
boljši kot dan hoda. ) ali kratko ( Bitke ne dobiš v
Čeprav pregovori tradicionalno spadajo na področje
enem dnevu. ), konec obdobja ( Po večeru se dan
paremiologije, so pogosto tudi raziskovalni predmet
pozna.), poimenovanje konkretnega dneva ( Ni vsak
folkloristike, sociologije, pedagogike, jezikoslovja itd.
dan praznik./Pavla dne lepo, leto dobro bo. ),
Semiotika, kot veda o znakih, ponuja metodologijo za
sledenje dobrega oz. označevanje konceptualne
raziskovanje globljih dimenzij prepletenih kulturnih ozadij
cikličnosti ( Za vsako nočjo pride dan).
pregovorov (Grzybek, 2014). Semiotika s poudarkom na
Najpogostejša pojavnost ne preseneča, saj je ta
pragmatični (razmerje med označenim in označencem),
samostalnik
zelo
pogost
sestavni
element
sintaktični (formalni odnosi med znaki) in semantični
vremenskih in kmetijskih paremioloških enot,
dimenziji (odnosi znakov s predmeti, za katere je mogoče
poleg tega pa je je tudi v splošnem sodobnem
uporabiti znake) (Morris, 1938) omogoča opazovanje
jeziku izredno pogost: v Gigafidi v2.0 je tretji
pregovorov z globljim vpogledom v kulturne pomene,
najpogosteje rabljeni samostalnik5. Po drugi strani
pojme in svetovne nazore. Do svetovnega nazora v
je smiselno izpostaviti, da se nasprotje, tj. noč
pregovorih pa je moč dostopati z etnolingvističnimi
pojavi le 318-krat (pojavlja se kot nasprotje dnevu
raziskovalnimi metodami, vključno z diahronim in
( Ljubezen vidi noč, kjer sije beli dan. ), temen čas,
sinhronim pristopom.
ko se ne vidi ( Ponoči so vse krave črne. ), vpliven
čas ( Noč ima svojo moč. ), mejni čas ( Ne hvali
5 http://hdl.handle.net/11346/QHKH
PRISPEVKI
20
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
dneva pred nočjo. ), slab čas ( Dan se zjutraj išče,
samostalnik se pojavlja koža, kar tvori pregovor, ki
noč pa sama pride. ), oznaka prazničnih časov
metaforično svari pred preuranjeno hvalo. Pregovor
velika noč, božična noč) itd.).
nakazuje semantično polje, ki se v etnolingvistični
-
Glagol biti se pojavi 19.301-krat (zanikan pa 3.501-
interpretaciji veže na ekonomski odsev družbe, tj. prodaje
krat), kar ne preseneča, glede na to, da gre za enega
medvedove kože, ki v zgodovinskem kontekstu pokaže
najosnovnejših glagolov; glagol je najpogostejši
svojo veliko ekonomsko vrednost.
tudi v splošnem sodobnem jeziku.6
Kljub vsemu iskalnik zaradi starejših in narečnih
-
Pridevnik dobro se pojavi 1.367-krat, največkrat v
izrazov ne poišče vedno vseh kombinacij, npr. pri Lep
osnovni obliki, najmanjkrat pa kot presežnik (prim.
čevelj vidiš, a ne veš, kje me gloje ali Kdor stare čevlje flika,
slabo se pojavi 301-krat, osnovnik najpogosteje,
pride do zlatnika, kjer konkordančnik ni zaznal
presežnik najmanjkrat). Na podlagi izoliranih enot
kombinacije samostalnika in glagola.
bi lahko sklepali, da pregovori na semantični ravni
Variante posameznega pregovora najlažje najdemo z
pogosto izražajo vrednotenje stanja ali delovanja,
iskanjem po besednih zvezah, npr. iskanje besedne zveze
kar poleg izražanja družbenega nazora potrjuje tudi
lastovka ne poda štiri rezultate: Ena lastovka ne naredi
njihov pedagoški potencial.
poletja, Ena lastovka ne naredi pomladi, Ena lastovka ne
-
Predlog v je najpogostejši predlog v paremioloških
prinese pomladi, Ena lastovka ne prinese nikoli spomladi.
enotah, tj. pojavi se v 4.538-krat. Iz tega podatka
Pri glagolski besedni zvezi gre samo enkrat na led pa
lahko
sklepamo, da
izvorno
konceptualno
rezultat poda tako osla kot lisico ( Osel/lisica gre samo najpogosteje uvrščamo pojave znotraj časovno-enkrat na led) , prav tako svoj rep hvali lahko tako lisica kot prostorskega koncepta pojava, pa čeprav se
mačka ( Vsaka lisica/mačka svoj rep hvali).
pomensko raba predloga razširi tudi na izražanje
Etnolingvistični vpogled v korpus pregovorov je z
namena, sredstva, odnosa do celote, dejanja/stanja
digitaliziranim gradivom in možnostjo zahtevnejšega
ipd. Enako je opaziti tudi v sodobnem splošnem
iskanja temeljitejši. Že pogostost posameznih besed v
jeziku.7
pregovorih ali pa podatek o variantnosti posameznega
pregovora je odlično izhodišče, ki ga z analognim arhivom
Ob najpogostejši prisotnosti besed v paremioloških enotah
le težko dosežemo.
dan, biti, dobro in v se izkaže, da te povsem ustrezajo tudi pogostnosti rabe v splošnem sodobnem jeziku, ne glede na
4. Sklep
to, da gre za večinoma arhivsko gradivo.
Digitalizacija folklornega gradiva olajša analizo le tega,
Za natančnejši etnolingvistični in konceptualni vpogled je
hkrati pa postane bistveno bolj natančna – iskalniki
primernejša analiza s posamezno sestavino (npr.
omogočajo izpis vseh želenih enot, hkrati pa je primerjava
samostalnikom čevelj, medved) in njenimi vezavami, na
gradiva bolj dosledna.
podlagi katerih lahko s semiotično metodo podamo
Vzpostavitev digitalne zbirke paremioloških enot ISN
interpretacije družbenih konceptualnih vidikov. Za tako
pomeni premik v slovenski folkloristiki. Gradivo je
analizo je najširše uporabno enostavno iskanje, ki v primeru
dostopnejše in analitično lažje obvladljivo. Hkrati takšna
te zbirke naniza vse sklone iskanega samostalnika,
oblika ne terja (semantične, tematske, funkcijske, abecedne
vključno s starejšimi zapisi, npr. pri iskanju besede čevelj
ipd.) kategorizacije pregovorov, temveč so razvrščeni kot
(68 enot) iskalnik izloči vse sklone, prav tako pa zapis
najmanjša zaključena besedila, na katerih izvedemo
črevelj, čevl ipd. Ob zahtevnejših iskanjih je možno slediti
analizo. Nedvomno je glede problema kategorizacije
tudi številu posamezni obliki zapisa: črevelj (2), črevlju (1), takšna rešitev najugodnejša, saj sama kategorizacija
čevle (3), seznam besed pa omogoča tudi sledenju starejšim
pogosto pokaže več pomanjkljivosti kot prednosti.
zapisom, virom in njihovi pogostnosti v časovnem razponu,
Pri zbirki pregovorov vsekakor najdemo mesto za
variantnim rabam in morebitnim prenovitvam.
izboljšave: poleg odprave nekaterih pravopisnih napak, se
Enako pri iskanju vseh zapisov in sklonov besede lisica
poraja vprašanje variantnosti ter povezave med variantami;
(starejša oblika lesica, 7 enot) iskalnik najde 93
na ta način bi bili odstranjeni tudi še nekateri podvojeni
paremioloških enot. Ob zahtevnejših iskanjih je možno
pregovori (predvsem tisti, ki so vpisani z različnimi ločili,
slediti tudi številu posamezni obliki zapisa: lisica (31),
npr. eden z vejico, drug s podpičjem). Ker je ponekod
lisice (5), lisici (7), lisico (6), lesica (7), itd. Kontekstualna diplomatični prepis problematičen (gajica, bohoričica,
raziskava objav po različnih virih poda poveden podatek:
metelčica, fonetični zapis), se poraja vprašanje smiselnosti
slovnice in slovarji navajajo paremiološke enote z besedo
knjižnega zapisa pregovora, ki bi moral biti ročno
lisica, ki so v celoti metaforične in se nanašajo na ljudi,
preverjen. Zbirka bo vsekakor tudi dopolnjena z novimi
medtem ko koledarji navedejo tudi paremiološke enote, ki
paremiološkimi enotami (iz starejših virov kot sodobne
veljajo za vremenske napovedi.
rabe). Poleg teh pa bi bilo smiselno uvesti tudi razdelitev
Iskalnik omogoča tudi iskanje želene besede v navezavi
virov po kategorijah, ki bi natančneje prikazal prisotnost
z drugo besedno vrsto, npr. lema medved, ki mu sledi
paremioloških enot v posamezni kategoriji virov, kar bi
glagol. Sicer je tako moč ugotoviti marsikatere povezave,
omogočalo tudi primerjalno analizo (npr. enote v koledarjih
vendar sam statistični del v nasprotju s pričakovanji prikaže
in enote v slovnicah).
tudi rezultate iz drugih (predhodnih ali sledečih)
Za izdelavo digitalne paremiološke zbirke smo posegli
pregovorov, ne le rezultate, vezane na posamezni pregovor.
po sistemih, ki so ustaljeni v jezikoslovju. V premisleku pa
Na primeru besede medved statistični del prikaže 79
ostaja, kako digitalizirati slovstveno folkloro, ki je daljša
ustreznic, vendar je teh znotraj enega pregovora 66. Ob
(npr. zgodbe, molitve) in ima specifične funkcije (npr.
ročnem pregledu kaj hitro ugotovimo, da se ta beseda
uganke, zagovori).
najpogosteje veže z glagolom prodajati. Ob navezavah na
6 http://hdl.handle.net/11346/XNRI
7 http://hdl.handle.net/11346/ZYVZ
PRISPEVKI
21
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Zahvala
Philipp Koehn. 2010. Statistical Machine Translation.
Digitalizirana zbirka paremiolških enot ne bi nastala brez
Cambridge University Press.
projektnih sodelavcev, še posebej Miha Pečeta: njegov
Erika Kržišnik. 2008. Kulturološka interpretacija frazema.
občutek za folklorno gradivo in poznavanje računalniškega
V: M. Kalin Golob, ur., N. Logar Berginc, ur., in A.
sveta sta omogočila hiter potek dela in sprotno reševanje
Grizold, ur., Jezikovna prepletanja, str. 149–165,
zagat.
Fakulteta za družbene vede, Ljubljana.
Delo opisano v prispevku je podprl temeljni raziskovalni
Nikola Ljubešić in Kaja Dobrovoljc. 2019. What Does
projekt »Tradicionalne paremiološke enote v dialogu s
Neural
Bring?
Analysing
Improvements
in
sodobno rabo« (ARRS J6-2579).
Morphosyntactic Annotation and Lemmatisation of
Slovenian, Croatian and Serbian. V: Proceedings of the
5. Literatura
7th Workshop on Balto-Slavic Natural Language
Processing, str. 29–34, Association for Computational
Ojo Arewa in Alan Dundes. 1966. Proverbs and the
Linguistics, doi:10.18653/v1/W19-3704.
Ethnography
of
Speaking
Folklore.
American
Milko Matičetov. 1956. Pregovori in uganke; ljudska
Anthropologist, 64: 70–85.
proza. Slovenska matica, Ljubljana.
Saša Babič, Miha Peče, Tomaž Erjavec, Barbara Ivančič
Matej Meterc. 2021. Aktualna raba in pomenska
Kutin, Katarina Šrimpf Vendramin, Monika Kropej
določljivost 200 pregovorov in sorodnih paremioloških
Telban, Nataša Jakop, in Marija Stanonik. 2022.
izrazov. Jezikoslovni zapiski 27(1): 45–61.
Collection of Slovenian paremiological units Pregovori
Jozef Mlacek. 1983. Problémy komplexného rozboru
1.0,
Slovenian
language
resource
repository
prísloví a porekadiel. Slovenská reč 48(2): 129–140.
CLARIN.SI,
ISSN
2820-4042,
Wolfgang Mieder. 1993. Proverbs are never out of season:
http://hdl.handle.net/11356/1455.
Popular wisdom in modern age. Oxford University
Jiři Bartmiński. 2005. Jazykovoj obraz mira: očerki po
Press.
etnolingvistike. Indarik, Moskva.
Hanna F. Pitkin. 1972. The concept of representation.
Kaja Dobrovoljc et al. 2017. The Universal Dependencies
University of California Press.
Treebank for Slovenian. V: Proceedings of the 6th
Diana Rassmusen Neal. 2015. Indexing and retrieval of
Workshop
on
Balto-Slavic
Natural
Language
non-text information. De Gruyter Saur, Chicago,
Processing, str. 33–38, Association for Computational
Vancouver.
Linguistics, doi:10.18653/v1/W17-1406.
Yves Scherrer in Nikola Ljubešić. 2016. Automatic
Alan Dundes. 1965. The study of folklore. Prentice-Hall,
Normalisation of the Swiss German ArchiMob Corpus
Englewood Cliffs.
Using Character-Level Machine Translation. V:
Tomaž Erjavec. 2021. MULTEXT-East: Morphosyntactic
Proceedings of the 13th Conference on Natural
Resources for Central and Eastern European Languages.
Language Processing (KONVENS 2016), str. 248–55.
Language Resources and Evaluation, 46(1): 35–57.
Christoph Schöch, Roxana Patraş, Tomaž Erjavec, Diana
Tomaž Erjavec. 2015. Reference corpus of historical
Santos. 2021. Creating the European Literary Text
Slovene goo300k 1.2. Slovenian language resource
Collection (ELTeC). Modern languages open, doi:
repository
CLARIN.SI,
10.3828/mlo.v0i0.364.
http://hdl.handle.net/11356/1025.
Marija Stanonik. 1996. Slovenski pregovori in rekla.
Diana Faridovna Khakimzyanova in Enzhe Kharisovna
Projektna prijava.
Shamsutdinova. 2016. Corpus Linguistics in Proverbs
Marija Stanonik. 2004. Informatizacija neoprijemljive
and Sayings Study: Evidence from Different Languages.
dediščine za etnologijo in folkloristiko. Projektna
The Social Sciences, 11(15): 3770–3773.
prijava.
Peter Grzybek. 2012. Proverb Variants and Variations: A
Marija Stanonik. 2009. Slovenski pregovori kot kulturna
New Old Problem? V: O. Lauhakangas, ur., in R. J. B.
dediščina: klasifikacija in redakcija korpusa. Projektna
Soares, ur., Proceedings of the Fifth Interdisciplinary
prijava.
Colloquium on Proverbs, str. 136–152, AIP-IAP, Tavira.
Marija Stanonik. 2015. Slovenski pregovori kot kulturna
Peter Grzybek. 2014. Semiotic and Semantic Aspects of the
dediščina.
Klasifikacija
in
redakcija
korpusa.
Proverb. V: H. Hrisztova-Gotthardt, (ur.) in M. A. Varga,
Traditiones, 44(3): 171–214.
ur., Introduction to Paremiology: A Comprehensive
Kathrin Steyer. 2017. Corpus Linguistic Exploration of
Guide to Proverb Studies, str. 68–111, De Gruyter,
Modern Proverb Use and Proverb Patterns. V: R.
Warsaw/Berlin.
Mitkov, ur., Europhras 2017. Computational and
Dell Hymes, D. 1962. The ethnography of speaking. V: T.
corpus-based phraseology: Recent advances and
Gladwin, ur., in W. C. Sturtevant, ur., Anthropology and
interdisciplinary approaches. Proceedings of the
Human Behavior, str. 13–53, Anthropological Society of
Conference Volume II, str. 45–52, London, Geneva.
Washington, Washington.
TEI Consortium. 2022. TEI P5: Guidelines for Electronic
Fran Kocbek. 1887. Pregovori, prilike in reki. Založil
Text
Encoding
and
Interchange.
https://tei-
Anton Trstenjak, Ljubljana.
c.org/guidelines/P5/
Fran Kocbek in Ivan Šašelj. 1934. Slovenski pregovori, reki
Svetlana M. Tolstaja. 2015. Obraz mira v tekste i rituale.
in prilike. Družba Sv. Mohorja, Ljubljana.
Univerza Dimitrija Požarskega, Moskva.
PRISPEVKI
22
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
DirKorp: A Croatian Corpus of Directive Speech Acts
Petra Bago*, Virna Karlić†
* Department of Information and Communication Sciences
† Department of South Slavic Languages and Literatures
Faculty of Humanities and Social Sciences, University of Zagreb
Ivana Lučića 3, HR-10000 Zagreb
{pbago, vkarlic}@ffzg.hr
Abstract
In this paper we present recent developments on a new version (v2.0) of DirKorp ( Korpus direktivnih govornih činova hrvatskoga jezika), a Croatian corpus of directive speech acts developed for the purposes of pragmatic research. The corpus contains 800 elicited speech acts collected via an online questionnaire with role-playing tasks. Respondents were 100 Croatian speakers, all undergraduate or graduate students of the Faculty of Humanities and Social Sciences University of Zagreb. The corpus has been manually annotated on the speech act level, each speech act containing up to 12 features. It contains 12,676 tokens and 1,692 types. The corpus is encoded according to the TEI P5: Guidelines for Electronic Text Encoding and Interchange, developed and maintained by the Text Encoding Initiative Consortium (TEI). We describe applied pragmatic annotation as well as the structure of the corpus.
1.
Introduction
to the pragmatic function criterion considerably difficult.
Corpus pragmatics is an interdisciplinary field of study
It is for this reason that corpus pragmatics researchers
that incorporates linguistic pragmatics and computer
most often investigate conventional speech acts or
science, focusing on the development of natural language
functions performed by a limited number of language
corpora in machine-readable form and their application for
forms (Jucker, Scheier, and Hundt, 2009: 4). The aim of
the purposes of studying pragmatics phenomena in written
this paper is to present the first Croatian corpus of
and spoken language. For a long time have linguists
directive speech acts DirKorp, manually annotated for
regarded a corpus approach to language incompatible with
corpus pragmatic research.
pragmatics (Romero-Trillo, 2008: 2). While the corpus
The paper is structured as follows: Section 2 describes
approach to studying language implies processing
selected work related to pragmatic corpora, while the
authentic language material implementing quantitative
subsequent three section present the DirKorp corpus.
research methods, pragmatic research is still
Section 3 gives a description of the developed corpus,
predominantly of qualitative nature – based on the
Section 4 describes 12 annotation features, and Section 5
researcher’s introspection, data obtained by elicitation
presents the structure of the corpus encoded according to
methods or an analysis of authentic linguistic material of
the TEI P5: Guidelines for Electronic Text Encoding and
small size. The application of corpus analysis in the
Interchange (TEI Consortium, 2021). Finally, Section 6
research of pragmatics phenomena represents a major
contains conclusion and future work.
turnaround in the development of pragmatics, primarily
because it allows a systematic analysis of authentic
2.
Related Work
language material of large size, and thus the detection of
The number of large corpora with systematically
patterns of language use that “go below radar” through
implemented pragmatic annotation is small so far. Due to
qualitative analyses (ibid.). In addition, it should be
a disproportionate relationship between pragmatic
pointed out that the application of new technologies in
functions and language forms by which these functions
linguistics, including pragmatics, did not only ensure,
are expressed, automatic corpus annotation does not
facilitate or accelerate numerous research processes, but
produce satisfactory results. For this reason, only a small
opened the door to a new, different way of thinking about
number of researchers have engaged in the creation of
language (Leech, 1992).
larger corpora of this sort. Generally, for the purposes of
The application of corpus methods on large pragmatic
corpus pragmatic research, specialized corpora of smaller
corpora allows one to systematically carry out empirically
size are produced for individual research purposes. In
based pragmatic research (Bunt, 2017: 327). While the
addition, pragmatic research is sometimes carried out on
implementation of corpus research can result in minor
corpora without pragmatic annotation.
adjustments to existing theories on the one hand, it can
An example of a corpus that does not contain
lead to a rethinking of pragmatics concepts and theoretical
pragmatic annotation, but which was used for pragmatic
frameworks on the other hand, for example the
research is the Birmingham Blog Corpus1 (Kehoe and development of the theory of dialogue acts (ibid.).
Gee, 2007; Kehoe and Gee, 2012). In fact, this is a
According to Rühlemann and Aijmer (2015), one of
subcorpus of a larger set of corpora being developed at the
the major methodological problems that corpus
department Research and Development Unit for English
pragmatics researchers encounter is the disproportionate
Studies at the Birmingham City University. It consists of
relationship between pragmatic functions and language
blog posts and reader comments, sizing 500M words in
forms by which these functions are expressed. One form
English that were collected between 2000 and 2010.
can perform multiple pragmatic functions in discourse,
Automatic POS annotation was performed using the
while one function can be expressed by different forms,
which makes the process of querying a corpus according
1 https://www.webcorp.org.uk/wcx/lse/corpora
PRISPEVKI
23
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Stanford Core NLP tools2 and include lemma annotations
Another example of a corpus with a pragmatic
and part-of-speech categories3 based on the Universal annotation is the Engineering Lecture Corpus8 (Alsop and Dependencies framework4, while documents contain
Nesi, 2013; Alsop and Nesi, 2014) that contains 76
metadata of the publication date. Pragmatic research on
transcripts based on an hour-long video recordings of
speech acts was conducted on this corpus: For example,
engineering lectures held in English on three universities.
Lutzky and Kehoe (2017a; 2017b) used it to analyze
It is manually annotated for three pragmatic features:
apologies as speech acts that contain formulaic
humor, storytelling and summary9. Each feature can be
expressions, which facilitate its querying in a corpus when
augmented with one of the attributes containing additional
using available tools.
information that describes the feature in more detail.
Similarly, we (Karlić and Bago, 2021) conducted
Further, the corpus contains labels regarding significant
research on the pragmatic functions and properties of
breaks, laughter, writing or drawing in the board, etc.
imperatives using corpora without pragmatic annotation.
Finally, we present SPICE-Ireland corpus ( Systems of
We used hrWaC and srWaC (Ljubešić and Klubička,
Pragmatic Annotation in the Spoken Component of ICE-
2014), two large web corpora of Croatian and Serbian
Ireland) (Kallena and Kirka, 2012), a part of a larger set
language with morphosyntactic annotation. For the
of corpora ICE-Ireland ( International Corpus of English:
purposes of the analysis, an additional pragmatic
Ireland Component) containing pragmatic, discourse and
annotation of a representative sample of verbs in an
prosodic features. The corpus contains various types of
imperative form was carried out manually. Other corpora
private and public, formal and informal dialogues and
of the Croatian spoken and written language with no
monologues of a length of about 2,000 words, sizing 625K
pragmatic annotation have also been used as a resource for
words. It consists of spoken English. The pragmatic
a corpus pragmatic research. For example, Hržica,
annotation of speech acts is based on Searle’s
Košutar, and Posavec (2021) used the Croatian Corpus of
classification (Searle, 1969; Searle, 1976): representatives,
the Spoken Language of Adults (HrAL) (Kuvač Kraljević
directives, commissives, expressives and declaratives.
and Hržica, 2016) and the Croatian National Corpus of the
To the best of our knowledge, there exist no publicly
written language (HNK) (Tadić, 1996) for the search and
available corpora of spoken or written Croatian language
analysis of connectors and discourse markers.
with pragmatic annotation. So far, Croatian linguists
According to Bunt (2017) the majority of corpora with
mostly dealt with speech acts from a theoretical
pragmatic annotation contain labels on discourse
perspective, referring primarily to the Austin’s and
relationships in written texts and on spoken dialogue acts.
Searle’s theory (cf. Pupovac, 1991; Ivanetić, 1995;
An example of such a larger corpus is Penn Discourse
Miščević, 2018; Palašić, 2020). However, in recent times,
Treebank or PDTB5 (Prasad, Webber, and Lee, 2018) that
the number of research based on qualitative and
contains labels on discourse relations, i.e. discourse
quantitative analysis of small-sized authentic linguistic
structure and its semantics. Discourse annotations were
materials (from literary texts and advertisements to email
added to a subcorpus consisting of texts published in the
messages and political discourse in Croatian and other
newspaper Wall Street Journal sizing 1M tokens, included
languages) has been increasing (cf. e.g., Pišković, 2007;
in a bigger corpus Penn Treebank (PTB). Bunt (2017)
Matić, 2011; Franović and Šnajder, 2012; Šegić, 2019).
states that there are corpora of other languages developed
In the following sections we present a new version
for the purposes of studying the co-occurrence of
(v2.0) of DirKorp, the first Croatian corpus of directive
discourse labels, such as Chinese, Czech, Dutch, German,
speech acts.
Hindi and Turkish – emphasizing that these corpora are
manually annotated and of modest sizes. Additionally, for
3.
Corpus Description
each corpora a new schema was developed based on
DirKorp ( Korpus direktivnih govornih činova
various theoretical starting points.
hrvatskoga jezika) (Karlić and Bago, 2021) is a Croatian
DialogBank6 (Bunt et al., 2019) is one of a rare corpus of directive speech acts developed for the purposes
dialogue corpus annotated with an ISO 24617-2 standard.
of pragmatic research. The corpus contains 800 elicited
It contains already existing dialogue corpora annotated
speech acts collected via an online questionnaire with
with various schemas. Four corpora are of English: HCRC
role-playing tasks applying the method of simulated
Map Task (Anderson et al., 1991), Switchboard (Godfrey,
communication that is implemented under pre-set
Holliman, and McDaniel, 1992), TRAINS (Allen et al.,
conditions. This method is suitable for researching speech
1995) and DBOX (Petukhova et al., 2014); and four of
acts due to the ability to collect a great number of
Dutch language: DIAMOND (Geertzen et al., 2004),
examples of speech acts of the equal propositional content
OVIS7, Dutch Map Task (Caspers, 2000) and Schiphol
and illocutionary purpose used in the same controlled
(Prüst, Minnen, and Beun, 1984). Dialogue act annotation
situations. The questionnaire included eight closed-type
involves segmenting a dialogue into defined grammatical
role-playing tasks. These types of tasks imply recording
units and augmenting each unit with one or more
the speaker’s reactions (in this case in writing) to the
communicative function labels.
stimulus without feedback. In each task, the participants
2 https://stanfordnlp.github.io/CoreNLP/
are presented with one textually described hypothetical
3 See more about the POS tagset used for the Birmingham Blog situation asking them to refer a directive speech act to
Corpus: https://www.webcorp.org.uk/wcx/lse/guide.
8 www.coventry.ac.uk/elc
4 https://universaldependencies.org/u/pos/index.html
9
5 https://doi.org/10.35111/qebf-gk47
https://www.coventry.ac.uk/research/research-directories/current
6 https://dialogbank.uvt.nl/
-projects/2015/engineering-lecture-corpus-elc/annotations-and-
7 http://www.let.rug.nl/vannoord/Ovis/
mark-ups/
PRISPEVKI
24
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
their interlocutor. Their assignment was to imagine they
Zagreb, ages between 18 to 33. Croatian is the mother
were in the presented situation and to give a written
tongue for the majority of the respondents (96 %). The
statement they would use in the described situations. The
questionnaire was carried out during December 2020 and
presented situations are classified into two categories with
January 2021. All respondents voluntarily participated in
regard to the relationship between the participants of the
the study. The questionnaire was conducted anonymously,
communication act: (1) situations involving interlocutors
and the collected language material was used exclusively
who are not in a familiar relationship; (2) situations
for scientific purposes.
involving interlocutors in a familiar relationship.
The elicitation of language production by the role-
Assignments of the two categories are organized into four
playing method has its advantages and disadvantages. On
pairs, asking respondents to share a speech act of similar
the one hand, it enables the collection of a large number of
propositional content: “I want you to return something
speech acts with the same propositional content and
that belongs to me” (for text of role-playing tasks see
illocutionary purpose. On the other hand, users of the
Example 1 when interlocutors have (a) an unfamiliar
corpus should keep in mind that the language material
relationship and (b) a familiar relationship); “I want you to
collected by this method does not reflect the features of
answer my inquiry”; “I want you to change something that
actual language use. It rather shows what speakers think
bothers me”; “I want you to stop behaving
they would say and/or do in hypothetical situations.
inappropriately”10.
DirKorp contains 12,676 tokens and 1,692 types11.
Since it consists of 800 speech acts, it is a relatively small
Example 1
corpus. However, as the first Croatian corpus with
(a) Upravo si pojeo/la ručak u restoranu.
detailed pragmatic annotation, DirKorp can serve as a
Posluživao te stariji konobar koji se odnosio prema
useful resource for researching speech acts, politeness
tebi ljubazno i profesionalno. Prilikom plaćanja
strategies and other related pragmatic phenomena in the
računa konobar ti vraća 100 kuna manje nego što je
Croatian language. In addition, we hope that it will
trebao. Želiš da ti konobar vrati novac. Zamisli da
contribute to the development of larger corpora of the
se konobar nalazi pred tobom i napiši što bi mu
Croatian language with pragmatic annotation, and that it
točno rekao/la u danoj situaciji (nemoj prepričavati,
will encourage a wider application of the corpus-
već iskaz formuliraj kao da se izravno obraćaš
pragmatic research method.
sugovorniku).
We have conducted corpus pragmatic analyses of the
(Eng. You just ate lunch at a restaurant. You were
collected speech acts in order to investigate ways and
served by an elderly waiter who treated you kindly
means of expressing directives, and their pragmatic
and professionally. When paying the bill, the
characteristics and functions. For example, we confirmed
waiter refunds you 100 kunas less than he should
that indirect directives are more frequent than direct,
have. You want the waiter to give you your money
especially among interlocutors who are not in a familiar
back. Imagine the waiter was in front of you and
relationship. Regarding (un)familiar relationship between
write what exactly you would say to him in the
interlocutors, we detected that explicit illocutionary force
given situation (do not recount but formulate the
is more frequent in communication between interlocutors
statement as if you were addressing the
with a familiar relationship, while implicit illocutionary
interlocutor directly). )
force is more frequent in communication between
(b) Posudio/la si knjigu najboljem prijatelju (ili
interlocutors with an unfamiliar relationship. Additionally,
prijateljici). Rekao ti je da će ti je uskoro vratiti, no
we have identified that imperative utterances are a more
nije održao riječ. Sjedite zajedno u kafiću, situacija
frequent type of direct directives than utterances with a
je opuštena, razgovarate o svakodnevnim stvarima.
directive performative verb in 1st person. For more such
Želiš mu dati do znanja da ti treba čim prije vratiti
corpus pragmatic analyses see Karlić and Bago (2021).
knjigu. Zamisli da se tvoj prijatelj nalazi pred
tobom i napiši što bi mu točno rekao/la u danoj
situaciji (nemoj prepričavati, već iskaz formuliraj
4.
Corpus Annotation
kao da se izravno obraćaš sugovorniku).
Collected language material has been manually
(Eng. You lent a book to your best friend. (S)he
annotated on the speech act level by two independent
told you (s)he'd give it back to you soon, but (s)he
annotators with university graduate degrees in the field of
didn't keep her/his word. You are sitting together in
philology. Annotators received oral and written
a café, the situation is relaxed, you talk about
instructions, including illustrative examples for all the
everyday things. You want to let her/him know you
features they had to annotate.
need to get your book back as soon as possible.
The categorization of speech acts and their formal and
Imagine if your friend was in front of you and
pragmatic properties was carried out according to the
wrote what exactly you would say to her/him in the
theory of speech acts by Austin (1962), Searle (1969;
given situation (do not recount but formulate the
1976) and their successors; the politeness theory of Brown
statement as if you were addressing the
and Levinson (1978), and the grammars of contemporary
interlocutor directly). )
Croatian and Serbian languages (Silić and Pranjković,
2007; Piper et al., 2005). For more on individual
Respondents were 100 Croatian speakers, all
undergraduate (63 %) or graduate students (37 %) of the
11 Respondents’ answers contain utterances, but also text about Faculty of Humanities and Social Sciences University of
what they would do in the given situation. At this moment, we
have not analyzed average length of a response. Generally, we
10 Full texts of role-playing tasks are available in the corpus can only state that some speech acts contain only one utterance,
header.
while some contain more than one.
PRISPEVKI
25
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
categories, see Karlić and Bago (2021). In the new version
Example 4
of DirKorp (v2.0), each speech act can contain up to 12
(a) Daj mi donesi više onu knjigu, treba mi!
features. The first 8 features were part of the corpus
(Eng. Bring me that book already, I need it! )
version v1.0, while features 9-12 are newly added. For
(b) Kaj je s onom knjigom koju sam ti posudio?
frequency distribution of all features see Karlić and Bago
(Eng. What happened to that book I lent you? )
(2021).
(1) Respondent ID – This mandatory feature contains
(6) Propositional content – This optional feature
information on identification of the respondent uttering
contains information on explicitness or implicitness of the
the speech act.
propositional content of a speech act. It is only applied to
(2) Familiarity / unfamiliarity – This mandatory
utterances that contain verbal means (an imperative
feature contains information on the category of the
utterance, an assertive utterance, an utterance in the form
proposed situation in which the speech act was uttered.
of a question and in the form of an ellipsis). It contains
Four situations are labelled ‘unfamiliar’ (involving
two labels: (a) explicit and (b) implicit (see Example 5).
interlocutors who are not in a familiar relationship), while
Example 5
the other four situations are labelled ‘familiar’ (involving
(a) Gledaj na cestu, pusti mobitel.
interlocutors who are in a familiar relationship).
(Eng. Look at the road, leave the cell phone. )
(3) Utterance type – This mandatory feature contains
(b) Ti hoćeš da poginemo?
information on the utterance type regarding its structural
(Eng. You want us to die? )
organization. It contains five labels: (a) an imperative
utterance, (b) an assertive utterance (a statement), (c) an
(7) T/V form – This optional feature contains
utterance in the form of a question, (d) an utterance in the
information on how the respondent addressed the
form of an ellipsis, (e) a nonverbal signal, (f) a case of
interlocutor, using an informal (T-form) or a formal you
avoidance of executing a speech act (see Example 2).
(V-form). It is only applied to utterances that contain
verbal means (an imperative utterance, an assertive
Example 2
utterance, an utterance in the form of a question and in the
(a) E vrati mi onu knjigu koju sam ti posudio.
form of an ellipsis). It contains three labels: (a) T-form,
(Eng. Hey, give me back that book I lent you. )
(b) V-form and (c) impossible to determine (see Example
(b) Oprostite, ali mislim da ste mi krivo vratili
6).
novce.
(Eng. Excuse me, but I think you gave me my
Example 6
money back wrong. )
(a) Oprosti, dao si mi manje novca
(c) Možete li molim vas zatvoriti prozore?
(Eng. Sorry, youT-form gave me less change. )
(Eng. Could you please close the windows? )
(b) Oprostite, mislim da ste mi ipak još dužni100
(d) E, moja knjiga??
kuna.
(Eng. Hey, my book?? )
(Eng. Excuse me, I think youV-form still owe me 100
(e) [Samo bih zavrtjela očima da vide moje
kunas. )
neodobravanje, ali ne bih ništa rekla.]
(c) Hmm... još 100 kuna, zar ne?
(Eng. [I’d just roll my eyes so that they see my
(Eng. Hmm… another 100 kunas, right? )
disapproval, but I wouldn’t say anything.]
(8) Exhortative – This optional feature contains
(f) [Ne bih ništa rekao.]
information on the representation of an exhortative as part
(Eng. [I wouldn’t say anything.])
of the speech act. It contains two labels: (a) yes and (b) no
(4) Directive performative verb in 1st person – This
(see Example 7).
optional feature contains information on the representation
Example 7
of a directive performative verb in 1st person as part of the
(a) Daj mi više vrati knjigu, treba mi za knjižnicu.
speech act, only for assertive utterances and utterances in
(Eng. Bring me back my book already, I need it for
the form of a question. It contains two labels: (a) yes and
the library. )
(b) no (see Example 3).
(b) Jel se sjećaš one knjige koju sam ti posudila?
Example 3
Potrebna mi je. Možeš li mi ju donijeti sutra na
(a) Oprostite, molim da odete na kraj reda.
faks?
(Eng. Excuse me , I am imploring you to go to the
(Eng. Do you remember that book I lent you? I
end of the line. )
need it. Could you bring it tomorrow to uni? )
(b) Gospođo, morate na kraj reda stati.
(9) Request – This optional feature contains
(Eng. Madam, you must move to the end of the
information on whether the speech act includes a lexical
line. )
marker of request. It contains two labels: (a) yes and (b)
(5) Illocutionary force – The optional feature contains
no (see Example 8).
information on explicitness or implicitness of the
Example 8
illocutionary force of a speech act. It is only applied to
(a) E da, jel bi mi mogao/la vratiti knjigu, molim
utterances that contain verbal means (an imperative
te?
utterance, an assertive utterance, an utterance in the form
(Eng. Oh yeah, could you bring the book back,
of a question and in the form of an ellipsis). It contains
please? )
two labels: (a) explicit and (b) implicit (see Example 4).
(b) Zaboravio si mi vratiti knjigu, jel se možeš
idući put sjetiti?
PRISPEVKI
26
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
(Eng. You forgot to bring me back the book, can
DirKorp is available for download under the CC BY-
you remember next time? )
SA 4.0 license from GitHub in TEI format
(https://github.com/pbago/DirKorp).
(10) Apology – This optional feature contains
information on whether the speech act includes a lexical
marker of apology. It contains two labels: (a) yes and (b)
no (see Example 9).
Govorni čin sadržava
Example 9
obraćanje na ti (atribut se odnosi
(a) Oprostite, ovdje fali još 100 kuna
na tipove iskaza koji uključuju
(Eng. Excuse me, 100 kunas is missing here. )
verbalna sredstva [imperativni,
(b) Možete li molim vas pritvoriti prozore, hladno
tvrdnja, upitni,
mi je?
eliptični]).
(Eng. Could you please close the windows, I’m
cold? )
(11) Gratitude – This optional feature contains
Govorni čin sadržava
information on whether the speech act includes a lexical
obraćanje na Vi (atribut se odnosi
marker of gratitude. It contains two labels: (a) yes and (b)
na tipove iskaza koji uključuju
no (see Example 10).
verbalna sredstva [imperativni,
tvrdnja, upitni,
Example 10
(a) Molim te mi samo javi da znam zbog
eliptični]).
organizacije hoćeš li doći. Hvala ti!
(Eng. Please just let me know whether you’re
coming so that I know because of the organization.
Nije moguće odrediti
Thank you! )
sadržava li govorni čin obraćanje na
(b) Heej, jel dolaziš večeras na druženje? Moram
ti ili Vi (atribut se odnosi na
znati zbog organizacije. xoxo
tipove iskaza koji uključuju
(Eng. Heeey, are you coming tonight to hang out? I
verbalna sredstva [imperativni,
need to know because of the organization. xoxo)
tvrdnja, upitni,
eliptični]).
(12) Honorific title – This optional feature contains
information on whether the speech act includes an
honorific title. It contains two labels: (a) yes and (b) no
(see Example 11).
Figure 1: An example of a pragmatic feature
description – how the respondent addressed the
Example 11
interlocutor (V-form, T-form or impossible to determine).
(a) Gospođo, kraj reda je dolje
(Eng. Madam, the end of the line is back there. )
(b) Oprostite, tamo je kraj reda!
(Eng. Excuse me, the end of the line is there! )
ispitanik/ispitanica, 20
godina, spol Ž, preddiplomski studij
5.
Corpus Format
Filozofskog fakulteta, nefilološko
DirKorp is encoded according to the TEI P5:
usmjerenje, materinji jezik
Guidelines for Electronic Text Encoding and Interchange,
hrvatski
developed and maintained by the Text Encoding Initiative
Consortium (TEI) (TEI Consortium, 2021). The TEI
document is comprised of a header and the body of the
Figure 2: An example of participant information.
corpus. The content of the elements and attributes are in
Croatian. Metadata of the corpus is given in the header
Ispričavam se, pardon,
process (see Figure 1 for an example), including full text
fali još sto kuna. Oprostite.
of the eight situations on the questionnaire; a list of
questionnaire participants with information on their age,
Figure 3: An example of an utterance containing all 12
gender, undergraduate or graduate level of study,
pragmatic features.
enrollment in a philological/non-philological/combined
study program and mother tongue (see Figure 2 for an
6.
Conclusion and Future Work
example); and a list of revisions of the DirKorp versions.
We have presented DirKorp, the first Croatian corpus
The body of the corpus is composed of one division
of directive speech acts, containing 800 elicited speech
containing utterances with pragmatic features (see Figure
acts collected via an online questionnaire with role-
3 for an example).
playing tasks, specifically developed for pragmatic
research studies. Respondents were 100 Croatian
speakers, all students of the Faculty of Humanities and
PRISPEVKI
27
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Social Sciences University of Zagreb. The corpus has
Military Friendship Publish , Beijing, https://www.isca-
been manually annotated on the level of a speech act, each
speech.org/archive/icslp_2000/.
speech act containing up to 12 features. It contains 12,676
Tin Franović and Jan Šnajder. 2012. Speech Act Based
tokens and 1,692 types. The corpus is available for
Classification of Email Messages in Croatian Language.
download under the CC BY-SA 4.0 license from GitHub
In: Proceedings of the Eighth Language Technologies
in TEI format.
Conference, pages 69-72. Information Society,
Further work is planned on the corpus, which includes
Ljubljana.
an evaluation of the developed scheme for annotating
Jeroen Geertzen, Yann Girard, Roser Morante, Ielka Van
directive speech acts, annotation at the levels smaller than
der Sluis, Hans Van Dam, Barbara Suijkerbuijk, Rintse
a speech act, as well as augmentation with additional
Van der Werf, Harry Bunt. 2004. The DIAMOND
features such as information on grammatical mood used in
Project. In: Proceedings of the 8th Workshop on the
a speech act, information on representation of modal verb
Semantics and Pragmatics of Dialogue (CATALOG
in 2nd person as part of a speech act, and information on
2004), Barcelona.
various politeness strategies applied in a speech act.
John Godfrey, Edward Holliman, and Jande McDaniel.
1992. SWITCHBOARD: Telephone Speech Corpus for
7.
Acknowledgements
Research and Development. In: IEEE International
This paper is generously co-financed by the
Conference on Acoustics, Speech, and Signal
institutional project of the Faculty of Humanities and
Processing, Vol. 1, pages 517–520 . IEEE Computer
Social Sciences “South Slavic languages in use: pragmatic
Society, San Francisco.
analyses” (principle researcher Virna Karlić). We wish to
Gordana Hržica, Sara Košutar, and Kristina Posavec.
thank all our annotators.
2021. Konektori i druge diskursne oznake u pisanome i
spontanome govorenom jeziku. Fluminensia: časopis
8.
References
za filološka istraživanja, 33(1):25–52.
Nada Ivanetić. 1995. Govorni činovi. Zagreb: FF-press,
James F. Allen, Lenhart K. Schubert, Geoge Ferguson,
Zavod za lingvistiku Filozofskoga fakulteta Sveučilišta
Peter Heeman, Chung Hee Hwang, Tsuneaki Kato,
u Zagrebu.
Marc Light, Nathaniel G. Martin, Bradford W. Miller,
Andreas H. Jucker, Daniel Schreier, and Marianne Hundt.
Massimo Poesio, and David R. Traum. 1995. The
(eds.). 2009. Corpora: Pragmatics and Discourse.
TRAINS Project: A Case Study in Building a
Rodopi, Amsterdam.
Conversational Planning Agent.
Journal of
Jeffrey L. Kallen and John M. Kirk. 2012. SPICE-Ireland:
Experimental & Theoretical Artificial Intelligence,
A User’s Guide.
7(1):7–48.
https://pure.qub.ac.uk/en/publications/spice-ireland-a-
Sian Alsop and Hilary Nesi. 2013. Annotating a Corpus of
users-guide.
Spoken English: The Engineering Lecture Corpus
Virna Karlić and Petra Bago. (Računalna) pragmatika:
(ELC). In: Proceedings of GSCP 2012: Speech and
temeljni pojmovi i korpusnopragmatičke analize. FF
Corpora, pages 58–62. Firenze University Press,
Press , Zagreb, 2021.
Florence.
https://openbooks.ffzg.unizg.hr/index.php/Ffpress/
Sian Alsop and Hilary Nesi. 2014. The Pragmatic
catalog/book/125.
Annotation of a Corpus of Academic Lectures. In: The
Andrew Kehoe and Matt Gee. 2007. New Corpora from
International Conference on Language Resources
the Web: Making Web Text More ‘Text-Like’. In:
andEvaluation 2014 Proceedings, pages 1560–1563.
Studies in Variation, Contacts and Change in English 2.
European Language Resources Association, Reykjavik.
https://varieng.helsinki.fi/series/volumes/02/kehoe_gee/
Anne H. Anderson, Miles Bader, Ellen Gurman Bard,
.
Elizabeth Boyle, Gwyneth Doherty, Simon Garrod,
Andrew Kehoe and Matt Gee. 2012. Reader Comments as
Stephen Isard, Jacqueline Kowtko, Jan McAllister, Jim
an Aboutness Indicator in Online Texts: Introducing the
Miller, Catherine Sotillo, Henry S. Thompson, and
Birmingham Blog Corpus. In: Studies in Variation,
Regina Weinert. 1991. The HCRC Map Task Corpus,
Contacts and Change in English 12.
Language and Speech, 34(4):351–366.
https://varieng.helsinki.fi/series/volumes/12/kehoe_gee/
John L. Austin. 1962. How to Do Things with Words.
.
Clarendon Press, Oxford.
Jelena Kuvač Kraljević and Gordana Hržica. 2016.
Penelope Brown and Stephen C. Levinson. 1987.
Croatian Adult Spoken Language Corpus (HrAL).
Politeness: Some Universals in Language Usage.
Fluminensia: časopis za filološka istraživanja,
Cambridge University Press.
28(2):87–102.
Harry Bunt. 2017. Computational Pragmatics. In: Oxford
Geoffrey N. Leech. 1992. Corpora and Theories of
Handbook of Pragmatics, pages 326-345. Oxford
Linguistic Performance. In: Directions in Corpus
University Press, New York.
Linguistics, pages 105–122. De Gruyter, Berlin.
Harry Bunt, Volha Petukhova, Andrei Malchanau, Alex
Ursula Lutzky and Andrew Kehoe. 2016. Your Blog is
Fang, and Kars Wijnhoven. 2019. The DialogBank:
(the) Shit: A Corpus Linguistic Approach to the
Dialogues with Interoperable Annotations. In:
Identification of Swearing in Computer Mediated
Language Resources and Evaluation, 53(2):213–249.
Communication. International Journal of Corpus
Johanneke Caspers. 2000. Melodic Characteristics of
Linguistics, 21(2): 165–191.
Backchannelsin Dutch Map Task Dialogues. In:
Ursula Lutzky and Andrew Kehoe. 2017a. ‘I Apologize
Proceedings, 6th International Conference on
for My Poor Blogging’: Searching for Apologies in the
SpokenLanguage Processing, pages 611–614. China
PRISPEVKI
28
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Birmingham Blog Corpus. Corpus Pragmatics,
Resources for Adjudicating Meaning in Trademark
1(1):37–56.
Disputes. In: The 9th International Corpus Linguistics
Ursula Lutzky and Andrew Kehoe. 2017b. ‘Oops, I Didn’t
Conference.
Birmingham:
Birmingham
Mean to Be so Flippant’. A Corpus Pragmatic Analysis
University.https://www.birmingham.ac.uk/documents/c
of Apologies in Blog Data. Journal of Pragmatics,
ollege-artslaw/corpus/conference-archives/2017/
116:27–36.
general/paper134.pdf.
Nikola Ljubešić and Filip Klubička. 2014. {bs, hr,
Rashmi Prasad, Bonnie Webber, and Alan Lee. 2018.
sr}WaC-Web Corpora of Bosnian, Croatian and
Discourse Annotation in the PDTB: The
Serbian. In: Proceedings of the 9th Web as Corpus
NextGeneration. In: Proceedings of the 14th Joint ACL-
Workshop (WaC-9), pages 29–35, Association for
ISO Workshop on Interoperable Semantic Annotation,
Computational Linguistics , Gothenburg,
pages 87–97. Santa Fe: Association for Computational
https://aclanthology.org/W14-0405.pdf.
Linguistics. https://aclanthology.org/W18-4710.pdf.
Daniela Matić. 2011. Govorni činovi u političkome
Hub Prüst, Guido Minnen, and Robbert-Jan Beun. 1984.
diskursu. PhD thesis. Faculty of Humanities and Social
Transcriptie dialooogesperiment juni/juli 1984,
Sciences, Zagreb.
IPORapport 481. Institute for Perception Research,
Nenad Miščević. 2018. Rođenje pragmatike. Orion Art,
Eindhoven University of Technology, Eindhoven.
Beograd.
Milorad Pupovac. 1990. Jezik i djelovanje. Biblioteka
Nikolina Palašić. 2020. Pragmalingvistika – lingvistički
časopisa Pitanja, Zagreb.
pravac ili petlja? Hrvatska sveučilišna naklada, Zagreb.
Jesús Romero-Trillo (ed.). 2008. Pragmatics and Corpus
Volha Petukhova, Martin Gropp, Dietrich Klakow, Gregor
Linguistics: A Mutualistic Entente. De Gruyter, Berlin.
Eigner, Mario Topf, Stefan Srb, Petr Motlicek, Blaise
Christoph Rühlemann and Karin Aijmer. 2015.
Potard, John Dines, Olivier Deroo, Ronny Egeler, Uwe
Introduction. Corpus pragmatics: laying the
Meinz, Steffen Liersch, and Anna Schmidt. 2014. The
foundations. In: Corpus pragmatics, pages 1-28.
DBOX Corpus Collection of Spoken Human-Human
John R. Searle. 1969. Speech Acts. Cambridge University
and Human-Machine Dialogues. In: Proceedings of the
Press, Cambridge.
Ninth International Conference on Language Resources
John R. Searle. 1976. A classification of illocutionary
and Evaluation (LREC'14), pages 252–258. European
acts. Language in Society, 5:1–23.
Language Resources Association, Reykjavik.
Josip Silić and Ivo Pranjković. 2007. Gramatika
Predrag Piper et al. 2005 = Предраг Пипер, Ивана
hrvatskoga jezika za gimnazije i visoka učilista. Školska
Антонић, Бранислава Ружић, Срето Танасић,
knjiga , Zagreb.
Људмила Поповић, Бранко Тошовић. 2005.
Tea Šegić. 2019. Tata kupi mi auto und Nivea Milk weil
Синтакса савременог српског језика. Проста
es nichts Besseres für die Hautpflege gibt. Filologija,
реченица, Београд: Институт за српски језик САНУ,
73:103–116.
Београдска књига, Матица српска.
Marko Tadić. 1996. Računalna obradba hrvatskoga i
Tatjana Pišković. 2007. Dramski diskurs između
nacionalni korpus. Suvremena lingvistika, 41-42:603–
pragmalingvistike i feminističke lingvistike. Rasprave:
611.
Časopis Instituta za hrvatski jezik i jezikoslovlje,
TEI Consortium (ed.). 2021. TEI P5: Guidelines for
33(1):325–341.
Electronic Text Encoding and Interchange. TEI
Olumide Popoola. 2017. A Dictionary, a Survey and a
Consortium.
Corpus Walked into a Courtroom...: An Evaluation of
PRISPEVKI
29
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Universal Dependencies za slovenščino:
nadgradnja smernic, učnih podatkov in razčlenjevalnega modela
Kaja Dobrovoljc∗†‡, Luka Terčon†, Nikola Ljubeši懆
∗Filozofska fakulteta, Univerza v Ljubljani
Aškerčeva 2, 1000 Ljubljana
kaja.dobrovoljc@ff.uni-lj.si
†Fakulteta za računalništvo in informatiko
Univerza v Ljubljani
Večna pot 113, 1000 Ljubljana
luka.tercon@fri.uni-lj.si
‡Institut “Jožef Stefan”
Jamova cesta 39, 1000 Ljubljana
nikola.ljubesic@ijs.si
Povzetek
Universal Dependencies (UD) je mednarodno usklajena označevalna shema za medjezikovno primerljivo oblikoslovno in skladenjsko označevanje besedil po načelih odvisnostne slovnice, ki je bila ob več kot 130 drugih svetovnih jezikih uspešno uporabljena tudi za označevanje besedil v slovenščini. V prispevku predstavimo rezultate nedavnih aktivnosti v povezavi s shemo UD znotraj projekta Razvoj slovenščine v digitalnem okolju, v okviru katerega smo obstoječo infrastrukturo nadgradili s prenovo in podrobno dokumentacijo označevalnih smernic UD za slovenščino, razširitvijo drevesnice SSJ-UD za pisno slovenščino z novimi povedmi iz korpusov ssj500k in ELEXIS-WSD ter izdelavo novega strojnega modela skladenjskega razčlenjevanja v označevalnem orodju CLASSLA-Stanza. V podporo nadaljnjim aplikacijam na različnih področjih strojnega procesiranja slovenščine novi model podrobneje ovrednotimo, in sicer poleg splošne evalvacije natančnosti razčlenjevanja poročamo tudi o natančnosti na ravni posamičnih skladenjskih relacij in o najpogostejših tipih napak.
1.
Uvod
vrst, 24 oblikoskladenjskih lastnosti, 37 odvisnostnih skla-
Jezikoslovno označeni korpusi, tj. digitalizirane zbirke
denjskih relacij), ki odslej omogoča enotno označevanje
besedil, ki poleg besed na površini vsebujejo tudi ročno pri-
podobnih slovničnih pojavov v različnih svetovnih jezikih,
pisane podatke o njihovih slovničnih lastnostih na različnih
obenem pa dovoljuje tudi jezikovnospecifične izpeljave, če
ravneh jezikoslovnega opisa (Ide in Pustejovsky, 2017),
je to potrebno. Shema temelji na načelih odvisnostne slov-
predstavljajo enega izmed temeljnih jezikovnih virov za ra-
nice, ki je v primerjavi s frazno pragmatiko bolj primerna
zvoj jezikovnotehnoloških orodij na eni strani in korpusno-
za jezike s prostim besednim redom in za neposredno upo-
jezikoslovne raziskave na drugi. Slovnične lastnosti so be-
rabo v različnih jezikovnotehnoloških aplikacijah (Jurafsky
sedilom tipično pripisane na podlagi vnaprej opredeljenih
in Martin, 2021), njena teoretična izhodišča pa so podrob-
označevalnih shem oz. označevalnih sistemov, ki poleg na-
neje predstavljena v prispevku De Marneffe et al. (2021).
bora možnih oznak običajno vsebujejo tudi smernice za nji-
Doslej je bilo z označevalno shemo UD ročno
hovo pripisovanje konkretnim slovničnim pojavom. Ker so
označenih že več kot 200 korpusov (t.i. odvisnostnih dreve-
v preteklosti označevalne sheme nastajale ločeno za posa-
snic, angl. dependency treebanks) v 130 svetovnih jezikih.
mezne jezike, slovnične teorije ali celo korpuse, je njihova
Med njimi sta tudi univerzalni odvisnostni drevesnici pisne
posledična raznolikost onemogočala kakršnokoli neposre-
slovenščine SSJ (Dobrovoljc et al., 2017) in govorjene slo-
dno primerjavo označenih podatkov ali na njih temelječih
venščine SST (Dobrovoljc in Nivre, 2016), ki sta bili s tem
računalniških orodij.
neposredno vključeni v razvoj številnih najsodobnejših oro-
Kot protiutež tovrstni razdrobljenosti je bila leta 2013
dij za večjezično obdelavo naravnih jezikov (Zeman et al.,
vzpostavljena označevalna shema Universal Dependen-
2018), kakor tudi raznolike primerjalnojezikoslovne razi-
cies,1 ki si prizadeva za mednarodno oz. medjezično uskla-
skave (Futrell et al., 2015; Naranjo in Becker, 2018; Chen
jeno slovnično označevanje besedil na oblikoslovni in skla-
in Gerdes, 2018).
denjski ravni, da bi pospešila razvoj večjezičnih jezikovnih
Glede na pomen razvoja slovenskih virov v tovrstnih
tehnologij, medjezičnega strojnega učenja in kontrastivnih
mednarodnih standardizacijskih pobudah smo v okviru na-
jezikoslovnih analiz. Znotraj sheme UD je bil tako vzposta-
cionalnega projekta Razvoj slovenščine v digitalnem oko-
vljen univerzalni nabor kategorij in smernic (17 besednih
lju (RSDO),2 ki si prizadeva za zadovoljitev potreb po
1https://universaldependencies.org/
2https://slovenscina.eu/
PRISPEVKI
30
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
računalniških izdelkih in storitvah s področja jezikovnih
natančnostjo. Med nepretvorjenimi so tako ostale zlasti
tehnologij za slovenski jezik, obstoječe vire in povezano in-
povedi s strukturami, ki so bile v sistemu JOS označene
frastrukturo za označevanje slovenskih besedil po sistemu
kot t. i. povezave tretjega nivoja (oznaka modra), kot so
Universal Dependencies bistveno nadgradili.
stavčna priredja in soredja, pristavki in pojasnjevalne struk-
Potek in rezultate te aktivnosti predstavimo v nadaljeva-
ture, členki oz. nepropozicijskimi prislovi, vrivki in po-
nju prispevka, v katerem po kratki predstavitvi izhodiščne
dobno.
različice korpusa SSJ-UD pred začetkom projekta RSDO
Prvotna različica korpusa SSJ-UD, prvič objavljena kot
(2. razdelek) opišemo dokumentacijo rahlo prenovljenih
del zbirke drevesnic UD v1.2 leta 2015, je tako obsegala
označevalnih smernic UD za slovenščino (3. razdelek). Na-
8.000 povedi oz.
140.670 pojavnic.
Kljub kontinuira-
daljujemo s predstavitvijo označevalne kampanje (4. raz-
nemu izboljševanju korpusa s prilagajanjem spremembam
delek), v okviru katere je bilo ročno razčlenjenih več kot
v splošnih označevalnih smernicah in odpravljanjem po-
5.000 novih povedi, ki skupaj z nekoliko izboljšanim pr-
samičnih napak je njegova velikost do nedavne razširitve,
votnih korpusom tvorijo najnovejšo različico korpusa SSJ-
ki jo predstavimo v 4. razdelku tega prispevka, ostajala ves
UD (5. razdelek). V drugem delu prispevka opišemo izde-
čas nespremenjena.
lavo na novem korpusu temelječega napovednega modela
za strojno skladenjsko razčlenjevanje (6. razdelek), ki ga
3.
Popis smernic UD za slovenščino
v sklepnem delu tudi ovrednotimo z analizo splošne na-
Splošne smernice UD, kakršne so dokumentirane na
tančnosti (7. razdelek) in analizo najpogostejših napak (8.
krovni spletni strani projekta,6 so kot nadaljevanje predho-
razdelek).
dnih standardizacijskih iniciativ in večletnega kolaborativ-
nega razvoja zasnovane tako, da skušajo na čim krajši način
2.
Nastanek korpusa SSJ-UD
nasloviti skladenjske specifike čim širšega nabora jezikov.
Prva različica univerzalne odvistnostne drevesnice za
Tako v splošnih smernicah najdemo predvsem prototipične
pisno slovenščino SSJ-UD3 je nastala na podlagi na
opredelitve posameznih oznak, opis najbolj tipičnih mej-
polavtomatske pretvorbe korpusa ssj500k (Krek et al.,
nih primerov in ponazoritve na primerih izbranih jezikov,
2020), bogato označenega referenčnega učnega korpusa
naloga avtorjev drevesnic za posamezne jezike pa je, da te
za slovenščino, ki je bil predhodno že ročno lematiziran,
splošne smernice nato prenesejo na svoje konkretne jezi-
oblikoskladenjsko označen in skladenjsko razčlenjen po
kovne podatke. Pri tem infrastuktura UD omogoča, da se za
označevalnem sistemu JOS (Erjavec et al., 2010). Med-
vsak jezik ta načela popišejo kot jezikovnospecifične smer-
tem ko so leme in oblikoskladenjske oznake JOS pripisane
nice na uradni spletni strani, vendar to ni obvezno, zato je
vsem pojavnicam korpusa ssj500k (586.248 pojavnic oz.
dokumentacija označevalnih smernic UD za posamične je-
27.829 povedi), je skladenjsko razčlenjena zgolj slaba po-
zike prepuščena predvsem samoiniciativnosti avtorjev po-
lovica korpusa (235.864 pojavnic oz. 11.411 povedi).
datkov.
Pretvorba korpusa ssj500k iz označevalne sheme JOS
Za slovenščino so bile ob prvi objavi korpusa SSJ-UD
v shemo UD (Dobrovoljc et al., 2016; Dobrovoljc et al.,
tako dokumentirane zgolj smernice za pripisovanje bese-
2017) je temeljila na širokem naboru pravil za preslikavo za
dnih vrst in oblikoskladenjskih oznak, ki so odtlej ob pre-
vse tri ravni sheme UD: besedne vrste, oblikoslovne lastno-
hodu z UD v1 na UD v2 (Nivre et al., 2020) že nekoliko
sti in odvisnostne skladenjske relacije.4 Ker so si (z nekaj
zastarele, smernice za pripisovanje skladenjskih relacij UD
izjemami) načela označevanja obeh sistemov na ravni obli-
besedilom v slovenščini pa zaradi obsežnosti niso bile po-
koslovja precej podobna, je bilo mogoče s pravili za presli-
drobneje dokumentirane oz. so bile razvidne zgolj implici-
kavo v besedne vrste in oblikoskladenjske lastnosti UD pre-
tno iz pretvorbenih pravil na eni strani in objavljenega kor-
tvoriti celoten korpus ssj500k oz. na istem sistemu teme-
pusa na drugi.
lječi leksikon Sloleks (Dobrovoljc et al., 2019), pri čemer
Prvi korak znotraj projekta RSDO je bil tako namenjen
je bilo ročno razdvoumljanje potrebno zgolj pri besednovr-
izčrpnemu popisu smernic UD za slovenščino na vseh treh
stni kategorizaciji glagola biti.5
ravneh označevanja (besedne vrste, oblikoskladenjske la-
Po drugi strani pa je bil skladenjsko razčlenjeni del
stnosti in skladenjske relacije) v obliki priročnika, ki na slo-
korpusa ssj500k v shemo UD pretvorjen le delno, saj za-
venskih primerih razlaga in ponazarja uporabo posameznih
radi robustnosti sistema JOS v primerjavi z UD kljub po-
oznak UD za označevanje besedil v slovenščini. Pri tem
drobnemu sistemu pravil za preslikavo vseh povedi ni bilo
smo poleg opisa prvotnih smernic uvedli tudi nekaj manjših
mogoče v celoti samodejno pretvoriti z dovolj zanesljivo
sprememb na mestih, kjer je bila prvotna označenost kor-
pusa SSJ-UD nedosledna ali neustrezna glede na univer-
3V tem prispevku namesto uradnega imena drevesnice (SSJ)
zalne smernice. Med njimi lahko izpostavimo predvsem
zaradi podobnosti s poimenovanji sorodnih korpusov in projektov
spremembe v obravnavi primerjalnih struktur (lastnost kot
v slovenskem prostoru uporabljamo daljši akronim SSJ-UD.
nadredni element primerjave), poudarjalnih členkov (razli-
4Pravila in skripte za pretvorbo iz sistema JOS v sistem UD so
kovanje med modifikatorji samostalnikov na eni in poved-
na voljo na https://github.com/clarinsi/jos2ud.
5
kov na drugi strani), besedilnih povezovalcev (razlikovanje
V nasprotju s sistemom JOS, znotraj katerega so pojavi-
glede na stavčno pozicijo) in prostega morfema se/si (raz-
tve glagola biti ne glede na skladenjsko vlogo ali pomen ve-
dno označene kot glagol s podvrsto pomožni, sistem UD že
likovanje med zaimki v predmetni in ekspletivni vlogi), ki
na ravni besednih vrst ločuje med glavnimi (oznaka VERB) in
pomožnimi glagoli (oznaka AUX), kamor se umeščajo glagoli v
6https://universaldependencies.org/
vlogi pomožnikov in veznih glagolov.
guidelines.html
PRISPEVKI
31
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
so bili zaradi omejitev strojne pretvorbe iz sistema JOS pr-
je bilo glede na pretvorbena pravila pričakovano, saj so
votno označeni drugače kot predvidevajo splošne smernice
bila ločila večinoma na relevantno jedro povezana šele po
UD.
določitvi vseh drugih pojavnic v povedi, zlasti korena po-
Priročnik s smernicami UD za slovenščino7 poleg opi-
vedi (root, običajno jedro povedka glavnega stavka ali drug
sov posamičnih slovničnih kategorij in načel njihovega pri-
hierarhično najpomembnejši element v povedi), ki predsta-
pisovanja besedilom v slovenščini vsebuje še razdelek s
vlja tudi drugo najpogostejšo vrsto nepretvorjenih pojavnic
podrobnejšo obravnavo težavnejših primerov, ki se je do-
(12 %). Tej sledita še relaciji parataxis (9 %) in conj (6 %),
polnjeval tudi skozi označevalno kampanjo, opisano v 4.
ki se uporabljata za povezovanje stavčnih soredij oz. prire-
razdelku. V pripravi je tudi objava slovenskih smernic na
dij, torej struktur, kakršnih zgolj s pravili ni bilo mogoče
uradni spletni strani UD (v angleščini) in popis odprtih
pretvoriti z dovolj zanesljivo natančnostjo.
vprašanj z izhodiščnimi priporočili za nadaljnje izboljšave
(v sodelovanju z Univerzo v Novi Gorici) .
4.2.
Razširitev s povedmi iz korpusa ELEXIS-WSD
4.
Nadgradnja korpusa SSJ-UD
V drugi fazi širitve je bil skladenjsko razčlenjen še kor-
V drugem koraku projekta je sledila označevalna kam-
pus ELEXIS-WSD-SL, tj. slovenski del paralelnega kor-
panja, v okviru katere smo ročno označili več kot 5.000 no-
pusa ELEXIS-WSD (Martelli et al., 2021; Martelli et al.,
vih povedi iz korpusov ssj500k oz. ELEXIS-WSD, neko-
2022), razvitega za potrebe strojnega pomenskega razdvo-
liko izboljšana pa je bila tudi označenost prvotne različice
umljanja, ki vsebuje v več evropskih jezikov prevedena be-
korpusa SSJ-UD. V vseh treh fazah je označevanje po-
sedila iz Wikipedie (Schwenk et al., 2021). Slovenski kor-
tekalo v označevalnem orodju Q-CAT (Brank, 2022), ki
pus ELEXIS-WSD vsebuje 2.024 povedi (31.237 pojav-
odslej podpira tudi standardni format CONLL-U, za pri-
nic), ki so bile predhodno že ročno tokenizirane, lemati-
merjavo označenih datotek (kuriranje) pa smo uporabili lo-
zirane in oblikoskladenjsko označene po sistemu JOS, na
kalno inštalacijo orodja WebAnno (Eckart de Castilho et
podlagi česar smo korpus s pretvorbeno skripto samodejno
al., 2016), ki jo vzdržuje CLARIN.SI.8 Podrobnejša analiza
pretvorili še v besedne vrste in oblikoskladenjske oznake
označevalnega procesa je opisana v prispevku Dobrovoljc
UD, pojavitve glagola biti pa razdvoumili ročno.
in Ljubešić (2022), v nadaljevanju pa predstavimo zgolj
Tako označen korpus je bil izhodiščno skladenjsko
najpomembnejše rezultate.
razčlenjen z orodjem CLASSLA-Stanza (Ljubešić in Do-
brovoljc, 2019), pravilnost strojno pripisanih razčlemb pa
4.1.
Razširitev s polpretvorjenimi povedmi iz ssj500k
so nato pregledali trije označevalci in končni kurator. Na ta
Kot smo omenili že v 2. razdelku, nekaterih skladenjsko
način je bilo ročno popravljenih 1.534 (4.91 %) skladenj-
razčlenjenih povedi v korpusu ssj500k zaradi omejitev pre-
skih relacij, med katerimi so prevladovale strukture z ozna-
tvorbenih pravil ni bilo mogoče v celoti pretvoriti v oznake
kami nmod, advmod, obl, conj in punct, kar se, kot bomo
UD, zato niso bile vključene v prvotno različico drevesnice
videli v nadaljevanju, sklada z najpogostejšimi tipi napak
SSJ-UD, predstavljale pa so logično izhodišče za nadaljnjo
razčlenjevalnika nasploh (8. razdelek).
širitev podatkov UD za slovenščino. V prvi fazi razširitve
so tako označevalci ročno pregledali teh 3.411 polpretvor-
4.3.
Izboljšanje označenosti v prvotnem korpusu
jenih povedi oz. 96.194 pojavnic, med katerimi jih 22.377
(23,5 %) ni imelo pripisane skladenjske relacije UD. Te so
Poleg dodajanja novih razčlenjenih povedi smo glede
bile za potrebe lažje vizualizacije označene z relacijo un-
na rahlo spremembo smernic (3. razdelek), analizo ročnih
known (slika 1), označevalci (po dva na poved) pa so poleg
popravkov pretvorjenih relacij (razdelek 4.1.) in drugih
ustvarjanja novih povezav preverjali tudi ustreznost že ob-
identificiranih nedoslednosti izboljšali tudi označenost iz-
stoječih (pretvorjenih) povezav.
hodiščne različice korpusa SSJ-UD.
Med približno 30 identificiranimi tipi napak oz. ne-
doslednosti so bile denimo pristavčne strukture, visok
delež (neupravičenih) neprojektivnih povezav,9 nedosledno
ločevanje med sorednimi in priredno vezanimi stavki, med
premimi in nepremimi predmeti, itd. Za vsako izmed ka-
tegorij smo s hevrističnimi poizvedbami ustvarili podkor-
puse povedi s potencialno problematičnimi oznakami, ki so
jih nato označevalci ročno pregledali in popravili v skladu s
smernicami. Na ta način je bilo v izhodiščnem korpusu po-
Slika 1: Primer prikaza polpretvorjene povedi iz ssj500k
pravljenih 1.670 skladenjskih oznak, kar sicer predstavlja
z manjkajočimi relacijami UD (unknown) v označevalnem
razmeroma majhen del celotnega korpusa (1,2 %).
orodju Q-CAT.
Med pojavnicami, ki v izhodišču niso imele pripi-
9Povezava med besedo A in besedo B je projektivna, če je be-
sane relacije UD, je bila skoraj polovica ločil (punct), kar
seda A posredno nadrejena tudi vsem drugim besedam med A in B
– obstaja torej pot od A do vseh besed med A in B. Če si to pred-
7Priročnik je trenutno na voljo v delovni različici, uradno pa
stavljamo grafično, se povezave v neprojektivnem drevesu med
bo objavljen ob zaključku projekta RSDO.
seboj križajo. To je v jezikih s prostim besednim redom, kot je
8https://www.clarin.si/webanno/.
slovenščina, sicer možen pojav, a vendarle redek.
PRISPEVKI
32
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
5.
Nova različica korpusa SSJ-UD
dardno uporabljajo pri razvoju in evalvaciji na teh podat-
V zadnjem koraku smo izhodiščni korpus SSJ-UD z
kih temelječih napovednih modelov. Pri tem smo sledili
nekoliko izboljšano označenostjo (razdelek 4.3.) združili
načelom delitve podatkov v prvotni različici, v kateri so
z novimi povedmi iz korpusov ssj500k (razdelek 4.1.) in
bile podmnožice razdeljene glede na zaporedje pojavlja-
ELEXIS-WSD (razdelek 4.2.) ter tako dobili novo različico
nja v korpusu. Glede na to, da so nove povedi iz ssj500k
referenčne univerzalne odvistnostne drevesnice za pisno
enakomerno razpršene po celotnem korpusu, smo te zgolj
slovenščino SSJ-UD,10 ki je bila prvič objavljena kot del
priključili k že obstoječi delitvi povedi v prvotni različici
uradnega izida UD v2.10 (Zeman et al., 2022). Ob za-
in ohranili enako razmerje (80 % učna, 10 % validacij-
ključku projekta RSDO bo drevesnica SSJ-UD integrirana
ska, 10 % testna), nato pa smo vsaki izmed treh množic
tudi v novi referenčni korpus učne slovenščine SUK.
v enakem razmerju dodali še povedi iz korpusa ELEXIS-
WSD. Sestava podmnožic tako odslikava raznolikost nove
5.1.
Sestava korpusa
različice korpusa SSJ-UD, kakršno opisujemo v razdelku
Kot prikazuje tabela 5.1., nova različica v primerjavi s
5.1., in z reprezentativnostjo testnih podatkov glede na učne
prvotno vsebuje 5.435 novih razčlenjenih povedi (+67,9 %)
zagotavlja ustreznejšo, besedilnozvrstno manj pristransko
oz. skoraj enkrat večje število pojavnic (126.427, +89,9 %),
evalvacijo.
s čimer se korpus SSJ-UD po številu pojavnic danes umešča
na 30.
mesto med skupno 228 drevesnicami UD. Z
6.
Razčlenjevalni model
razširitvijo je korpus SSJ-UD postal tudi bolj raznolik, saj
se vsi trije podkorpusi (izvorne povedi iz ssj500k, nove po-
V drugi fazi projekta smo na novi, bistveno večji
vedi iz ssj500k, povedi iz ELEXIS-WSD) med seboj raz-
različici ročno označenega korpusa SSJ-UD naučili tudi
likujejo tako z vidika vrste vsebovanih besedil kot njihove
nov napovedni model skladenjskega razčlenjevanja po
skladenjske kompleksnosti.
sistemu UD v označevalnem orodju CLASSLA-Stanza
Medtem ko besedila ssj500k kot vzorec korpusa Fida-
(Ljubešić in Dobrovoljc, 2019),11 ki se kot temeljno pro-
PLUS (Arhar Holdt, 2007) vsebujejo predvsem izvorno
gramsko orodje za označevanje besedil v slovenščini prav
slovenska leposlovna, neleposlovna in publicistična be-
tako razvija v okviru projekta RSDO. Gre za izpeljavo od-
sedila, korpus ELEXIS-WSD vsebuje prevedena enciklo-
prtokodnega orodja Stanza (Qi et al., 2020), ki v primer-
pedična besedila iz Wikipedie. Po drugi strani sta si iz-
javi z izvornim orodjem uvaja nekatere izboljšave na ravni
vorni SSJ-UD in korpus ELEXIS-WSD podobna z vidika
tokenizacije, oblikoskladenjskega označevanja in lematiza-
kompleksnosti (krajše in skladenjsko enostavnejše povedi),
cije, skladenjski razčlenjevalnik pa se od izvornega (Dozat
medtem ko so nove povedi iz ssj500k bistveno daljše.
in Manning, 2016), ki temelji na nadgrajeni metodi dvo-
Nenazadnje pa je z metodološkega vidika pomembno
smernega dolgega kratkoročnega spomina (BiLSTM), raz-
izpostaviti še, da se vsi trije podkorpusi razlikujejo tudi
likuje predvsem po uporabi besednih vložitev CLARIN.SI-
z vidika izvora pripisanih oznak UD, saj so oznake prvo-
embed.sl (Ljubešić in Erjavec, 2018), ki so bile naučene na
tnega SSJ-UD v veliki večini rezultat avtomatske pretvorbe
slovenskih besedilih v obsegu 3,5 milijard besed.
iz sistema JOS, oznake novih povedi iz ssj500k kombina-
Tako pri učenju kot evalvaciji razčlenjevalnega modela
cija pretvorbe in ročnega pregleda, oznake povedi iz kor-
smo uporabili ročno označene podatke na nižjih ravneh
pusa ELEXIS-WSD pa so bile v celoti pregledane ročno.
označevanja (tokenizacija, stavčna segmentacija, obliko-
skladenjsko označevanje, lematizacija), saj nas je v tej fazi
razvoja razčlenjevalnika zanimala predvsem natančnost na-
Podkorpus
Povedi
Pojavnice
Povp.
povednega modela v izolaciji, brez vpliva napovednih ka-
Prvotni SSJ-UD
8.000
140.670
17,58
rakteristik orodja na nižjih ravneh.
Novo iz ssj500k
3.411
95.194
27,91
Izgradnjo napovednega modela, njegovo primerjavo z
Novo iz ELEXIS-WSD
2.024
31.233
15,43
modelom, naučenim na prvotni različici SSJ-UD, in eval-
Skupaj novi SSJ-UD
13.435
267.097
19,88
vacijo glede na posamične podkorpuse podrobneje opisu-
Tabela 1: Zgradba nove različice korpusa SSJ-UD (od UD
jeta Dobrovoljc in Ljubešić (2022), ki ugotavljata, da je
v2.10 naprej).
model, naučen na novi različici korpusa SSJ-UD, zaradi
povečanega obsega učnih podatkov in njihove diverzifika-
cije bistveno izboljšan v primerjavi z modelom, naučenim
na prvotni različici.
5.2.
Delitev podatkovnih množic
Da bi osvetlili prednosti in pomanjkljivosti uporabe
Del objave drevesnice v uradni zbirki UD je tudi njena
novega razčlenjevalnega modela v različnih jezikovnoteh-
delitev na učno, validacijsko in testno množico, ki se stan-
noloških in jezikoslovnih aplikacijah ter obenem identifi-
cirali prioritete za njegove nadaljnje izboljšave, v nadalje-
10 Čeprav infrastruktura UD dopušča objavo poljubnega števila
vanju prispevka te ugotovitve nadgradimo s podrobnejšo
drevesnic, smo se namesto objave novih drevesnic UD za slo-
evalvacijo splošne natančnosti modela (7. razdelek) na eni
venščino namenoma odločili za priključitev novih povedi k že ob-
strani in analizo najpogostejših tipov napak (8. razdelek) na
stoječi drevesnici SSJ-UD, da bi zagotovili kar najbolj učinkovito
drugi.
izrabo teh podatkov v širši jezikovnotehnološki skupnosti, kjer se
zaradi poenostavitve dela modeli pogosto razvijajo zgolj na iz-
brani, običajno največji, drevesnici nekega jezika.
11https://pypi.org/project/classla/
PRISPEVKI
33
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
7.
Splošna natančnost
katerokoli drugo povezavo (npr. ostanki oštevilčenih strani
Za kvantitativno evalvacijo splošne natančnosti modela
pri digitalizaciji besedil).
šmo uporabili standardni protokol, po katerem smo mo-
Ceprav se je natančnost označevanja samostalniških pri-
del, naučen na učni oz. validacijski množici uporabili za
stavčnih določil (appos, 63,40), ’osirotelih’ stavčnih členov
razčlenjevanje testne množice, napovedane oznake pa nato
v povedih z glagolsko elipso (orphan; 68,24), diskur-
primerjali z ročno pripisanimi. Za poročanje o natančnosti
znih členkov (discourse; 69,23), stavčnih soredij (para-
uporabljamo uveljavljeno metriko LAS (angl. labeled atta-
taxis; 70,35) in naštevalnih seznamov (list; 75,86) z novo
chment score), ki prikazuje delež pojavnic s pravilno napo-
različico korpusa SSJ-UD bistveno izboljšala glede na pr-
vedano nadrejeno pojavnico in vrsto njunega skladenjskega
votni model (Dobrovoljc in Ljubešić, 2022), te relacije
razmerja, pri čemer ta delež povzemamo z oceno F1, ki
ostajajo med tistimi z najnižjo natančnostjo, kar je glede na
prikazuje harmonično sredino med preciznostjo in prikli-
njihovo ohlapnejšo slovnično povezanost s povedkom oz.
cem.12
nadrejenimi stavčnimi členi tudi pričakovano.
Rezultati, predstavljeni v tabeli 7., prikazujejo, da
Med drugimi relacijami s podpovprečno natančnostjo
razčlenjevalni model dosega splošno natančnost 93,21 LAS
označevanja
lahko
izpostavimo
še
podredne
stavke
F1, kar nekoliko poenostavljeno pomeni, da se model v
različnih tipov, kot so prislovni (advcl; 75,86), prilast-
povprečju na vsakih sto označenih pojavnic zmoti pri manj
kovi (acl; 81,73), osebkovi (csubj; 85,53) in predmetni
kot sedmih, tj. jim pripiše napačno nadrejeno pojavnico
odvisniki (ccomp; 90,67).
Poleg nepremih predmetov
in/ali vrsto povezave med njima.13
(iobj; 81,66), ki jih je težavno identificirati predvsem
Kot prikazujejo rezultati za posamične tipe relacij,14 pa
zaradi pomanjkljivosti trenutnih označevalnih smernic,15
ta splošna ocena natančnosti ni reprezentativna za vse vrste
modelu precejšen izziv predstavljajo tudi priredja, zlasti
skladenjskih struktur, saj je pri napovedovanju nekaterih re-
medstavčna (conj; 85,91), samostalniški prilastki (nmod;
lacij model bistveno natančnejši kot pri drugih.
87,44) in prislovna določila povedkov, samostalnikov in
Med relacijami z najvišjo natančnostjo napovedova-
pridevnikov (advmod; 89,95).
nja so po pričakovanju funkcijske besede, kot so predlogi
(case; 99,17), pomožni glagol biti (aux; 98,93), določilniški
8.
Najpogostejše napake
zaimki in prislovi (det; 98,79), podredni vezniki (mark;
V drugem koraku evalvacije smo analizo zanesljivo-
98,69), ekspletivni zaimki (expl; 96,71) in priredni vezniki
sti modela pri razčlenjevanju posameznih tipov relacij do-
(cc; 96,27), skratka, pojavnice, ki se pojavljajo v zelo pred-
polnili še s podrobnejšo analizo najpogostejših tipov na-
vidljivih oblikah in skladenjskih položajih.
pak. Tabela 8. tako povzema distribucijo napak glede na
Poleg navedenih relacij model razmeroma dobro na-
to, pri katerem izmed obeh napovedanih podatkov (identifi-
tančnost dosega tudi pri napovedovanju nekaterih jedrnih
kator nadrejene pojavnice in vrsta skladenjske relacije med
skladenjskih struktur, kot so samostalniški predmeti (obj;
njima) se je model dejansko zmotil. Za vsak tip napake
95.53) in osebki (nsubj; 95.28), nadpovprečno uspešen pa
navajamo tudi pet najpogostejših podtipov glede na rela-
je tudi pri identifikaciji korena povedi (root; 96,26), ki je
cije, pri katerih se pojavlja, pri čemer štetje prikazujemo
običajno jedro povedka glavnega stavka, in veznega gla-
združeno za napake v obe smeri (npr. obl-nmod vključuje
gola biti (cop; 95,43), ki nastopa v strukturah s povedko-
tako napovedovanje obl namesto nmod kot napovedovanje
vimi določili.
nmod namesto obl).
Med relacijami, pri napovedovanju katerih model do-
Identificirane pogoste tipe napak znotraj vsake katego-
sega najslabše rezulate, pričakovano najdemo ogovore (vo-
rije na podlagi ročne analize napačno označenih primerov
cative; 0,0), saj se v testni množici pojavi zgolj en primer,
opišemo v nadaljevanju, pri čemer podrobneje predstavimo
in nedoločene strukture (dep; 54,55), saj se ta oznaka kot
predvsem najpogostejše.
skrajna možnost uporablja predvsem za povezovanje ob-
robnih, iregularnih pojavov, ki jim je nemogoče pripisati
8.1.
Napačna napoved nadrejenega elementa
12
Kot prikazuje tabela 8., dobro polovico (52,8 %) pred-
Izračuni temeljijo na uradni evalvacijski skripti tekmovanja
CoNLL Shared Task 2018 (Zeman et al., 2018), ki smo jo doda-
stavljajo napake, pri katerih je model pravilno napovedal
tno prilagodili tako, da poleg splošnega izračuna natančnosti vrača
skladenjsko vlogo pojavnice (pravilno relacijo oz. oznako),
tudi rezultate za posamične skladenjske relacije in druge relevan-
zmotil pa se je pri napovedi njenega nadrejenega elementa
tne oznake.
(jedra oz. izvora relacije).
13Ta natančnost je v skladu z natančnostjo orodja Stanza
Najpogostejša napaka pri določanju nadrejenega ele-
za druge jezike oz.
drevesnice (https://stanfordnlp.
menta je povezana z relacijo punct, ki označuje ločila.
github.io/stanza/performance.html)
oz.
na-
Večinoma gre za primere, kjer so napačno določena tudi
tančnostjo
drugih
sodobnih
razčlenjevalnikov
nasploh
(https://universaldependencies.org/conll18/
15Zaradi kompleksnega prepletanja oblikoslovnih, skladenj-
results.html), vendar neposredna primerjava zaradi specifik
evalvacijske metodologije ni smiselna.
skih in pomenskih razločevalnih lastnosti med premimi in nepre-
14
mimi predmeti trenutne smernice UD priporočajo, da je v pove-
V Tabeli 7. ni relacije compound, ki je glede na smernice v
dih z zgolj enim izraženim predmetom ta ne glede na sklon ali
slovenščini ne uporabljamo. Pri relacijah dislocated, goeswith in
udeležensko vlogo označen kot premi predmet (obj). To pomeni,
reparandum podatkov o natančnosti ni (oznaka n/a), saj se v te-
da se lahko tudi predmeti v dajalniku, kakršni tipično nastopajo
stni množici ne pojavijo. O natančnosti izpeljanih relacij oz. po-
kot nepremi predmeti, ob odsotnosti drugih predmetov označujejo
doznak (npr. flat:name, flat:foreign) poročamo združeno z jedrno
kot premi.
oznako (npr. flat).
PRISPEVKI
34
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Relacija
Izvorni opis
Slovenski prevod
LAS F1
acl
clausal modifier of noun
stavčni prilastki
81,73
advcl
adverbial clause modifier
prislovni odvisniki
75,86
advmod
adverbial modifier
prislovna določila (gl. opombo 16)
89,95
amod
adjectival modifier
pridevniški prilastki
98,9
appos
appositional modifier
pristavčna določila
63,4
aux
auxiliary verb
pomožni glagoli
98,93
case
case marking preposition
predlogi
99,17
cc
coordinating conjunction
priredni vezniki
96,27
ccomp
clausal complement
stavčna dopolnila (predmetni odvisniki)
90,67
conj
conjunct
priredno zloženi elementi
85,91
cop
copula verb
vezni glagoli
95,43
csubj
clausal subject
osebkovi odvisniki
85,53
dep
unspecified dependency
nedoločena povezava
54,55
det
determiner
določilniki
98,79
discourse
discourse element
diskurzni členki
69,23
dislocated
dislocated element
dislocirani elementi
n/a
expl
expletive
ekspletivne besede
96,71
fixed
fixed multi-word expression
funkcijske zveze
93,33
flat
flat multi word-expression
eksocentrične zveze
92,12
goeswith
disjointed token
razdruženi deli besed
n/a
iobj
indirect object
nepremi predmeti
81,66
list
list
seznami
75,86
mark
marker (subordinating conjunction)
podredni vezniki
98,69
nmod
nominal modifier
samostalniški prilastki
87,44
nsubj
nominal subject
samostalniški osebki
95,28
nummod
numeric modifier
številčna določila
94,23
obj
(direct) object
premi predmeti
95,53
obl
oblique nominal (adjunct)
odvisne samostalniške zveze
91,14
orphan
dependent of missing parent
elementi v eliptičnih strukturah
68,24
parataxis
parataxis
stavčna soredja
70,35
punct
punctuation symbol
ločila
93,08
reparandum
overriden disfluency
samopopravljanja
n/a
root
root element
koren povedi
96,26
vocative
vocative
ogovori
0
xcomp
open clausal complement
odprta stavčna dopolnila
92,87
Vse relacije
93,21
Tabela 2: Natančnost novega modela orodja CLASSLA-Stanza za skladenjsko razčlenjevanje po sistemu UD glede na metriko LAS.
jedra drugih struktur v povedi, na katera se ločila pra-
vedek kot posamezne stavčne člene, kar je pogosto mogoče
viloma povezujejo.
Napačno povezana ločila so torej
razbrati šele iz konteksta ali prozodičnih poudarkov pri bra-
predvsem posledica napak razčlenjevanja njihovih nadre-
nju. Kot prikazuje primer na sliki 3, razčlenjevalnik te be-
jenih struktur, kot prikazuje primer na sliki 2, pri kate-
sede namesto na poudarjeni samostalnik pogosto veže na
rem razčlenjevalnik zadnji stavek zmotno interpretira kot
povedek stavka. To ni presenetljivo, glede na to, da gre za
priredje pred njim stoječega odvisnika, čemur ustreza tudi
eno izmed kategorij, pri kateri so se označevalci najpogo-
(napačno) označena vejica.
steje razhajali, prav tako pa je bila nedosledno označena v
Druga pogosta skupina je povezana s t.i. poudarjalnimi
prvotnem korpusu, v katerem so bile ob pretvorbi te pojav-
členki oz. prislovi, kot so besedice tudi, še, le, že idr., ki
nice ne glede na vlogo vedno povezane na povedek.
jim pripisujemo relacijo advmod,16 njihova stava pa je v
Pri preostalih treh analiziranih relacijah s pogosto
slovenščini razmeroma prosta – modificirajo lahko tako po-
napačno pripisanim izvorom povezave, tj.
nmod, conj
in acl, prihaja do podobne napake: razčlenjevalnik za-
16Relacija advmod se uporablja za označevanje prislovov
nesljivo prepozna vrsto nadrejene strukture (npr. samo-
v vlogi modifikatorjev, kar vključuje tako prislove v vlogi
okoliščinskih dopolnil povedkov (kakršna Slovenska slovnica
stalniške zveze, pridevniške zveze ali povedki), vendar na-
imenuje prislovna določila, npr. pridem takoj) kot prislove v vlogi
mesto prave strukture kot jedro izbere najbližjo ustrezno
modifikatorjev pridevniških, prislovnih ali samostalniških bese-
zvezo na levi, kar ni vedno prav, saj se včasih pravi izvor
dnih zvez (prislovni prilastki, npr. izjemno prilagodljiv).
relacije v povedi pojavi že prej (slika 4).
PRISPEVKI
35
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
punct
conj
acl
nsubj
punct
cc
cop
punct
case
det
advmod
obj
case
obl
nsubj
det
Ta svet je zelo podoben Zemlji , na kateri živijo ljudje , vendar z nekaj izjemami .
punct
conj
Slika 2: Primer razhajanja med ročno (zgoraj) in strojno (spodaj) pripisanim jedrom relacije punct.
nmod
Tip napake
Število napak
nmod
Napačno jedro
914
case
amod
case
nmod
punct-punct
248
advmod-advmod
166
na turnirju mladih judoistov za pokal Ptuja
nmod-nmod
111
nmod
conj-conj
99
acl-acl
53
Slika 4: Primer razhajanja med ročno (zgoraj) in strojno
Napačno jedro in oznaka
517
(spodaj) identificirano odnosnico predložne zveze v vlogi
obl-nmod
141
desnega prilastka (nmod).
parataxis-root
37
acl-advcl
22
root-nsubj
22
zamenjevanje struktur z oznakama obl17 in nmod, ki pred-
nsubj-nmod
19
stavlja tretji najpogostejši (pod)tip napak nasploh. Analiza
Napačna oznaka
299
primerov kaže, da gre večinoma za primere, v katerih pre-
conj-parataxis
23
dložna zveza v vlogi prislovnega določila povedka (obl)
obl-nsubj
19
stoji tik za neko samostalniško zvezo, model pa prislovno
appos-conj
17
določilo napačno tolmači kot njen desni prilastek, za katere
obl-obj
13
se uporablja relacija nmod, kot prikazuje primer na sliki 5.
iobj-obj
13
Manj pogoste v tej kategoriji so še napake pri določanju
Vse napake
1730
glavnega stavka v nizu dveh ali več soredno zloženih stav-
kov, zlasti kadar gre za vrinjene stavke ali premi go-
vor (parataxis-root), napake ločevanja med prislovno-
Tabela 3: Distribucija napak razčlenjevalnega modela glede
določilnimi odvisniki in stavčnimi prilastki, pogosto v
na tip napake.
kombinaciji z veznikom kot (acl-advcl), zamenjava osebka
in povedkovega določila v strukturah z veznim glagolom
biti (root-nsubj) in napake določanja osebka v povedih,
obj
kjer osebek ni eksplicitno izražen (nsubj-nmod).
obl
advmod
advmod
acl
case
nmod
8.3.
Napačna napoved relacije
Med vsemi tremi kategorijami napak pa je najmanj ta-
ima pa tudi ambicije sodelovati za kreacijo oblek
kih, pri katerih je razčlenjevalnik pojavnico povezal s pra-
advmod
vim nadrejenim elementom, a tej relaciji pripisal napačno
oznako (17,3 %). V primerjavi s prvima dvema kategori-
Slika 3: Primer napačne razčlembe poudarjalnih členkov
jama so tukaj tipi glede na relacije razpršeni bolj enako-
(advmod zgoraj) kot prislovnih določil povedka (advmod
merno.
spodaj).
Do zamenjav oznak conj in parataxis18 prihaja pred-
17Relacija obl se uporablja za odvisne samostalniške in pre-
dložne zveze, ki nastopajo v vlogi nejedrnih argumentov povedka.
8.2.
Napačna napoved nadrejenega elementa in
Poleg teh se s to relacijo označujejo tudi neglagolske strukture s
relacije
primerjalnimi vezniki.
18Relacija parataxis se uporablja za označevanje stavčnih sore-
Po pogostosti sledijo napake, pri katerih se je model
dij različnih vrst. To so razmerja med besedo (običajno jedrom
zmotil tako pri napovedi nadrejene pojavnice kot njune
glavnega stavka) in drugimi elementi, ki z njo niso v priredju, po-
skladenjske relacije (29,9 %). Med njimi najbolj izstopa
dredju ali kateremkoli drugem jedrnem slovničnem razmerju.
PRISPEVKI
36
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
nsubj
obl
obl
case
case
nmod
det
obl
obj
case
Ta v primeru potrebe po svoji presoji napoti bolnika k specialistu
nmod
nmod
Slika 5: Primer napačne razčlembe predložnih prislovnih določil (obl zgoraj) kot desnih prilastkov (nmod spodaj).
vsem pri daljših povedih, pri katerih se med dva priredno
Čeprav je bila shema UD prvotno vzpostavljena pred-
zložena stavka oz. med priredni veznik in drugi stavek v
vsem za potrebe jezikovnotehnoloških raziskav, pa številne
priredju vrivajo druge strukture (npr. odvisniki). Samo-
odmevne primerjalnojezikoslovne študije dokazujejo tudi
stalniška prislovna določila (ki prejmejo relacijo obl) so
njeno relevantnost na področju jezikoslovja, vključno
napačno označena kot osebki (nsubj) predvsem v zvezah
s slovenistiko, kjer metodološki potencial skladenjsko
z glagoli, kot so imenovati, praviti, idr., v katerih se poja-
razčlenjenih korpusov doslej še ni bil polno izkoriščen (Le-
vljajo v imenovalniku (npr. pravimo jim mikroznaki).
dinek, 2018). Verjamemo, da izčrpno dokumentirane smer-
Med drugimi tipi napačno pripisanih relacij je pogosta
nice, obsežen ročno označen korpus in sistematična evalva-
še dvoumnost med samostalniškimi zvezami v vlogi pri-
cija natančnosti na njem naučenega modela predstavljajo
stavčnih določil (appos) na eni in priredno povezanih ele-
pomemben doprinos k nadaljnjim jezikoslovnim raziska-
mentov (conj) na drugi strani, zlasti kadar zadnji element v
vam ročno in strojno razčlenjenih slovenskih korpusov, pri
brezvezniškem priredju stoji na koncu povedi. Pojavljajo se
čemer je glede na kompleksno strukturo tovrstnih korpu-
tudi napake ločevanja med prislovnimi določili in predmeti,
sov nujno vzpostaviti tudi ustrezno infrastrukturo za nji-
predvsem pri samostalniških zvezah, ki izražajo časovni oz.
hovo analizo.
prostorski okvir dogodka (obl-obj) in pa napačno določanje
Seveda pa je tako z vidika jezikovnotehnološke kot je-
premega (obj) in nepremega predmeta (iobj).
zikoslovne uporabe predstavljene rezultate smiselno konti-
nuirano nadgrajevati tudi v prihodnje, kar vključuje tako
9.
Zaključek
izboljšavo izhodiščnih smernic na eni strani kot njihovo
dosledno implementacijo na drugi. Glede na v prispevku
V prispevku smo predstavili nadgradnjo drevesnice
predstavljene metodološke razlike v nastanku posamičnih
SSJ-UD, referenčnega ročno skladenjsko razčlenjenega
delov korpusa SSJ-UD in zaznane nedoslednosti med kva-
korpusa po medjezično usklajeni shemi Universal Depen-
litativno analizo napak je poleg nadaljnjega povečevanja
dencies, v okviru katere smo po rahli prenovi in izčrpni
korpusa vsekakor enako smiselna tudi konsolidacija ob-
dokumentaciji označevalnih smernic za slovenščino kor-
stoječega.
pus razširili z novimi povedmi ter nato na novi učni
množici naučili tudi nov napovedni model za skladenj-
10.
Zahvala
sko razčlenjevanje slovenskih besedil. Podrobna kvantita-
tivna in kvalitativna analiza njegove natančnosti je poka-
Predstavljeno delo sta podprla projekt Razvoj slo-
zala, da model v splošnem dosega razmeroma dobre rezul-
venščine v digitalnem okolju, ki ga financirata Ministr-
tate, pri čemer je pri členjenju nekaterih struktur mogoče
stvo za kulturo Republike Slovenije in Evropski sklad za
pričakovati bistveno večjo zanesljivost rezultatov kot pri
regionalni razvoj, ter raziskovalni program Jezikovni viri
drugih.
in tehnologije za slovenski jezik (št. P6-0411), ki ga fi-
Glede na mednarodno relevantnost sheme UD rezul-
nancira Javna agencija za raziskovalno dejavnost Repu-
tati predstavljajo pomemben doprinos k nadaljnjemu ra-
blike Slovenije iz državnega proračuna. Zahvala gre tudi
zvoju jezikovnih tehnologij za slovenščino tako v sloven-
označevalcem novih podatkov (Tina Munda, Ina Poteko,
skem kot mednarodnem prostoru, saj je glede na odprti do-
Rebeka Roblek, Luka Terčon, Karolina Zgaga) ter Tomažu
stop in standardizirano distribucijo drevesnic UD mogoče
Erjavcu, Luku Krsniku, Cyprianu Laskowskemu in Miha-
pričakovati, da bodo novi podatki za slovenščino kmalu
elu Šinkcu za tehnično podporo.
integrirani tudi v številna druga razčlenjevalna orodja oz.
na njih temelječe aplikacije (npr. Nguyen et al. (2021)).
11.
Literatura
Poleg modelov za skladenjsko razčlenjevanje, kakršnega
smo predstavili v tem prispevku, je skoraj enkrat večja
Špela Arhar Holdt. 2007. Korpus FidaPLUS: nova genera-
količina učnih podatkov za slovenščino neprecenljiva tudi
cija slovenskega referenčnega korpusa. Jezik in slovstvo,
za nadaljnji razvoj modelov za lematizacijo in oblikoslovno
52(2).
označevanje po sistemu UD, ki v mednarodnem prostoru
Janez Brank. 2022. Q-CAT corpus annotation tool 1.3.
večinoma temeljijo zgolj na uradno izdanih drevesnicah
Slovenian language resource repository CLARIN.SI.
UD, kot je SSJ-UD, ne pa virih, ki so bili razviti oz. distri-
Xinying Chen in Kim Gerdes. 2018. How do Universal
buirani v lokalnem kontekstu, kot je denimo celotni korpus
Dependencies distinguish language groups. Quantitative
ssj500k oz. nastajajoči učni korpus SUK.
Analysis of Dependency Structures, 72:277–294.
PRISPEVKI
37
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Marie-Catherine De Marneffe, Christopher D Manning, Jo-
2020. The ssj500k training corpus for Slovene language
akim Nivre in Daniel Zeman. 2021. Universal Depen-
processing. V: Proceedings of the Conference on Lan-
dencies. Computational linguistics, 47(2):255–308.
guage Technologies and Digital Humanities, str. 24–33,
Kaja Dobrovoljc, Tomaž Erjavec in Simon Krek. 2016.
Ljubljana, Slovenia, September. Institute of Contempo-
Pretvorba korpusa ssj500k v univerzalno odvisnostno
rary History.
drevesnico za slovenščino. V: Proceedings of the Con-
Nina Ledinek. 2018. Skladenjska analiza slovenščine in
ference on Language Technologies and Digital Humani-
slovenski jezikoslovno označeni korpusi. Jezik in slo-
ties.
vstvo, 63(2/3).
Kaja Dobrovoljc, Tomaž Erjavec in Simon Krek. 2017.
Nikola Ljubešić in Kaja Dobrovoljc. 2019. What does ne-
The Universal Dependencies Treebank for Slovenian. V:
ural bring? Analysing improvements in morphosyntac-
Proceedings of the 6th Workshop on Balto-Slavic Natural
tic annotation and lemmatisation of Slovenian, Croatian
Language Processing, BSNLP@EACL 2017, str. 33–38.
and Serbian. V: Proceedings of the 7th Workshop on
Kaja Dobrovoljc, Tomaž Erjavec in Nikola Ljubešić. 2019.
Balto-Slavic Natural Language Processing, str. 29–34,
Improving UD processing via satellite resources for mor-
Florence, Italy, August. Association for Computational
phology. V: Proceedings of the Third Workshop on Uni-
Linguistics.
versal Dependencies (UDW, SyntaxFest 2019), str. 24–
Nikola Ljubešić in Tomaž Erjavec. 2018. Word embed-
34, Paris, France, August. Association for Computatio-
dings CLARIN.SI-embed.sl 1.0. Slovenian language re-
nal Linguistics.
source repository CLARIN.SI.
Kaja Dobrovoljc in Nikola Ljubešić. 2022. Extending
Federico Martelli, Roberto Navigli, Simon Krek, Carole
the SSJ Universal Dependencies treebank for Slovenian:
Tiberius, Jelena Kallas, Polona Gantar, Svetla Koeva,
Was it worth it? V: Proceedings of the 16th Linguistic
Sanni Nimb, Bolette Sandford Pedersen, Sussi Olsen,
Annotation Workshop (LAW 2022), June.
Margit Langements, Kristina Koppel, Tiiu Üksik, Kaja
Kaja Dobrovoljc in Joakim Nivre. 2016. The Universal
Dobrovolijc, Rafael-J. Ure˜na-Ruiz, José-Luis Sancho-
Dependencies treebank of spoken Slovenian. V: Procee-
Sánchez, Veronika Lipp, Tamas Varadi, András Györffy,
dings of the Tenth International Conference on Language
Simon László, Valeria Quochi, Monica Monachini, Fran-
Resources and Evaluation (LREC’16), str. 1566–1573,
cesca Frontini, Rob Tempelaars, Rute Costa, Ana Sal-
Portorož, Slovenia, May. European Language Resources
gado, Jaka Čibej in Tina Munda. 2021. Designing the
Association (ELRA).
ELEXIS parallel sense-annotated dataset in 10 European
Timothy Dozat in Christopher D Manning. 2016. Deep
languages. V: eLex 2021 Proceedings, eLex Conference.
biaffine attention for neural dependency parsing. arXiv
Proceedings. Lexical Computing CZ.
preprint arXiv:1611.01734.
Federico Martelli, Roberto Navigli, Simon Krek, Jelena
Richard Eckart de Castilho,
Éva Mújdricza-Maydt,
Kallas, Polona Gantar, Svetla Koeva, Sanni Nimb, Bo-
Seid Muhie Yimam, Silvana Hartmann, Iryna Gurevych,
lette Sandford Pedersen, Sussi Olsen, Margit Langemets,
Anette Frank in Chris Biemann. 2016. A web-based tool
Kristina Koppel, Tiiu Üksik, Kaja Dobrovoljc, Rafael
for the integrated annotation of semantic and syntactic
Ure˜na-Ruiz, José-Luis Sancho-Sánchez, Veronika Lipp,
structures. V: Proceedings of the Workshop on Language
Tamás Váradi, András Gy˝orffy, Simon László, Vale-
Technology Resources and Tools for Digital Humanities
ria Quochi, Monica Monachini, Francesca Frontini, Ca-
(LT4DH), str. 76–84, Osaka, Japan, December. The CO-
role Tiberius, Rob Tempelaars, Rute Costa, Ana Sal-
LING 2016 Organizing Committee.
gado, Jaka Čibej in Tina Munda. 2022. Parallel sense-
Tomaž Erjavec, Darja Fišer, Simon Krek in Nina Le-
annotated corpus ELEXIS-WSD 1.0. Slovenian langu-
dinek. 2010. The JOS Linguistically Tagged Corpus
age resource repository CLARIN.SI.
of Slovene. V: Proceedings of the Seventh conference
Mat´ıas Guzmán Naranjo in Laura Becker. 2018. Quantita-
on International Language Resources and Evaluation
tive word order typology with UD. V: Proceedings of the
(LREC’10), Valletta, Malta, May. European Language
17th International Workshop on Treebanks and Lingui-
Resources Association (ELRA).
stic Theories (TLT 2018), December 13–14, 2018, Oslo
Richard Futrell, Kyle Mahowald in Edward Gibson. 2015.
University, Norway, št. 155, str. 91–104. Linköping Uni-
Large-scale evidence of dependency length minimization
versity Electronic Press.
in 37 languages. Proceedings of the National Academy
Minh Van Nguyen, Viet Lai, Amir Pouran Ben Veyseh
of Sciences, 112(33):10336–10341.
in Thien Huu Nguyen. 2021. Trankit: A light-weight
Nancy Ide in James Pustejovsky. 2017. Handbook of lin-
transformer-based toolkit for multilingual natural langu-
guistic annotation, zvezek 1. Springer.
age processing. V: Proceedings of the 16th Conference
Dan Jurafsky in James H. Martin. 2021. Speech and lan-
of the European Chapter of the Association for Compu-
guage processing: an introduction to natural language
tational Linguistics: System Demonstrations.
processing, computational linguistics, and speech reco-
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter,
gnition, 3nd Edition Draft. Prentice Hall series in artifi-
Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Se-
cial intelligence. Prentice Hall, Pearson Education Inter-
bastian Schuster, Francis Tyers in Daniel Zeman. 2020.
national.
Universal Dependencies v2: An evergrowing multilin-
Simon Krek, Tomaž Erjavec, Kaja Dobrovoljc, Polona
gual treebank collection. V: Proceedings of the 12th
Gantar, Špela Arhar Holdt, Jaka Čibej in Janez Brank.
Language Resources and Evaluation Conference, str.
PRISPEVKI
38
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4034–4043, Marseille, France, May. European Language
Resources Association.
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton in Chri-
stopher D Manning. 2020. Stanza: A python natural
language processing toolkit for many human languages.
arXiv preprint arXiv:2003.07082.
Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu
Gong in Francisco Guzmán. 2021. WikiMatrix: Mining
135M parallel sentences in 1620 language pairs from Wi-
kipedia. V: Proceedings of the 16th Conference of the
European Chapter of the Association for Computational
Linguistics: Main Volume, str. 1351–1361, Online, April.
Association for Computational Linguistics.
Daniel Zeman, Jan Hajič, Martin Popel, Martin Potthast,
Milan Straka, Filip Ginter, Joakim Nivre in Slav Petrov.
2018. CoNLL 2018 shared task: Multilingual parsing
from raw text to Universal Dependencies. V: Procee-
dings of the CoNLL 2018 Shared Task: Multilingual Par-
sing from Raw Text to Universal Dependencies, str. 1–21,
Brussels, Belgium, October. Association for Computati-
onal Linguistics.
Daniel Zeman, Joakim Nivre, Mitchell Abrams, Elia
Ackermann, Noëmi Aepli, Hamid Aghaei, Željko Agić,
Amir Ahmadi, Lars Ahrenberg et al. 2022. Universal
dependencies 2.10. LINDAT/CLARIAH-CZ digital
library at the Institute of Formal and Applied Linguistics
(U´ FAL), Faculty of Mathematics and Physics, Charles
University.
PRISPEVKI
39
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Primerjava načinov razcepljanja besed v strojnem prevajanju
slovenščina–angleščina
Gregor Donaj, Mirjam Sepesy Maučec
Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru
Koroška cesta 46, 2000 Maribor
gregor.donaj@um.si; mirjam.sepesy@um.si
Povzetek
V nevronskih strojnih prevajalnikih smo pri današnji tehnologiji grafičnih procesnih enot omejeni z velikostjo slovarja, kar zmanjšuje kakovost prevodov. Uporaba podbesednih enot rešuje problem velikosti slovarja in pokritosti jezika. Z nadaljnjim razvojem tehnologije pa omejenost slovarja in uporaba podbesednih enot izgubljata pomen. V članku predstavljamo različne metode razcepljanja besed na podbesedne enote z različno velikimi slovarji in primerjamo njihovo uporabo v strojnem prevajalniku za jezikovni par slovenščina–
angleščina. V primerjavo vključujemo tudi prevajalnik brez razcepljanja besed. Predstavljamo rezultate uspešnosti prevajanja, hitrosti učenja in prevajanja ter velikosti modelov.
A Comparison of Word Splitting Methods for Slovene-English Machine Translation
Given today’s technology for graphical processing units, neural machine translation systems can use only a limited vocabulary, negatively affecting translation quality. The use of subword units can alleviate the problems of vocabulary size and language coverage. However, with further technological development, the limited vocabulary and the use of subword units are losing significance. This paper presents different word splitting methods with different final vocabulary sizes. We apply these methods to the machine translation task for the Slovene-English language pair and compare them in terms of translation quality, training and translation speed, and model size. We also include a comparison with word-based translation models.
1.
Uvod
in njihove porabe pomnilnika GPU. Vse metode bomo tudi
primerjali z besednim modelom brez razcepljanja.
Strojno prevajanje je v zadnjem desetletju doseglo pravi
razcvet, zahvaljujoč predvsem vedno večjim zbirkam dvo-
2.
Slovarske enote v strojnih prevajalnikih
jezičnih korpusov in razpoložljivosti vedno večje računske
Najbolj intuitivna izbira slovarske enote prevajalnika je
moči, ki omogoča učenje kompleksnih nevronskih mrež.
beseda, ki je tudi najpogosteje izbrana osnovna enota v dru-
Danes najbolj intenzivno raziskovani pristopi stroj-
gih postopkih jezikovnih tehnologij. Prinaša pa številne
nega prevajanja temeljijo na nevronskih mrežah (Stahlberg,
izzive. Za dovolj dobro pokritost besedišča jezika so po-
2020). Uveljavile so se tri osnovne arhitekture: nevron-
trebni veliki slovarji, kar je še posebej izrazit problem pri
ske mreže s povratno zanko (RNN – Recurrent Neural Ne-
visoko pregibnih jezikih. Posledica premajhnih slovarjev
twork), konvolucijske nevronske mreže (CNN – Convolu-
pa je visok delež neznanih besed oz. besed izven slovarja,
tional Neural Network) in arhitekture s samo-pozornostjo
ki močno zmanjša kakovost prevodov.
(self-attention).
Za obvladovanje omenjenih težav so bile predlagane
Uporaba nevronskih mrež pa prinaša tudi tehnične iz-
različne alternativne slovarske enote. Kot najmanjša slovar-
zive. Zaradi računske zahtevnosti je v praksi nujna upo-
ska enota je bila uporabljena črka oz. znak, ki se je izkazal
raba grafičnih procesnih enot (GPU – Graphical Processing
kot zelo robustna enota, manj občutljiva na šum in razlike
Unit). Le-te pa imajo omejeno velikost delovnega spomina,
v domeni učnega in testnega korpusa (Heigold et al., 2018;
zaradi česar ne moremo uporabljati poljubno velikih ne-
Gupta et al., 2019). Potrebne pa so določene prilagoditve
vronskih mrež. Velikost nevronske mreže v strojnem pre-
v arhitekturi nevronske mreže, saj je dolžina segmenta ne-
vajanju je odvisna od izbrane arhitekture, nastavitev hiper-
kajkrat daljša od segmenta, ki kot slovarske enote uporablja
parametrov nevronske mreže in velikosti slovarja. Ome-
besede. Posledica je slabše modeliranje odvisnosti na veli-
jena velikost slovarja pa pomeni slabo pokritost besedišča
kih razdaljah v besedilih.
jezika in posledično dodatne napake pri prevajanju. Tudi je-
Preizkušene so bile tudi slovarske enote, ki so po ve-
ziki, med katerimi prevajamo, predstavljajo različne izzive
likosti med črko in besedo. Pod-besedne enote, dobljene
in imajo določene specifične lastnosti.
s podatkovno vodenim razcepljanjem, ki kot enoto ohranja
V tem članku bomo preizkusili različne podatkovno vo-
pogosta zaporedja črk, so se v splošnem izkazale kot naj-
dene metode za razcepljanje besed, s katerimi zmanjšamo
bolj učinkovite, saj v večji meri ohranjajo sintaktične in se-
velikost slovarja. Izbrali smo metode, ki so dobro znane
mantične lastnosti (Sennrich et al., 2016; Banerjee in Bhat-
in uveljavljene, vendar pa temeljijo na precej različnih op-
tacharyya, 2018). Ker je beseda lahko razcepljena na več
timizacijskih kriterijih. Te metode bomo uporabili na pri-
različnih načinov, je bila predlagana tudi metoda regulacije
meru strojnega prevajanja med angleščino in slovenščino.
razcepljanja (Kudo, 2018). Kot slovarsko enoto bi lahko
Predstavili bomo rezultate v smislu uspešnosti prevajanja,
uporabili tudi jezikoslovno enoto morfem, vendar bi za to
hitrosti učenja in prevajanja ter velikosti izdelanih modelov
potrebovali slovnično znanje.
PRISPEVKI
40
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
2.1.
Postopek Byte-Pair Encoding
kjer so m pod-besedne enote, l(mj) dolžina pod-besedne
Postopek BPE (Byte-Pair Encoding) je v izvoru posto-
enote mj (število črk) in k število bitov, ki so potrebni za
pek za stiskanje podatkov, ki deluje z iterativno zamenjavo
predstavitev ene črke in ki ga v praksi lahko postavi na 5.
najpogostejših parov simbolov z novim simbolom. Sen-
Verjetnost posamezne pod-besedne enote v besedilu p(mi)
nrich in drugi (Sennrich et al., 2016) so priredili ta algori-
se izračuna z oceno največje verjetnosti kot razmerje med
tem za razcepljanje besed.
absolutno frekvenco te enote in številom vseh enot v bese-
V postopku se najprej inicializira slovar, ki vsebuje vse
dilu.
črke in druge znake (števke, ločila), ki se pojavijo v kor-
V svojem delu smo uporabljali novejšo implematacijo
pusu, ter simbol za konec besede.
Vsebina korpusa se
programa – Morfessor 2.0 (Virpioja et al., 2013). Iskalni
obravnava kot zaporedje simbolov, ki so v prvem koraku
algoritem v tej implementaciji poišče nabor pod-besednih
le črke in drugi znaki. Nato sledi iterativni postopek, v
enot, ki optimizirajo funkcijo cene, pri tem pa lahko ali
katerem se najde najpogostejši par zaporednih simbolov
ročno izbiramo uteži za obe komponenti funkcije cene ali
in se le-ta nadomesti z novim simbolom.
Te iterativne
pa izberemo želeno velikost slovarja.
korake imenujemo združevanja. Pri postopku pa nimamo
2.3.
Unigramski model
združevanj, ki bi vključevala simbol za konec besede, kar
Zadnja metoda, ki smo jo pogledali, je razcepljanje
v končnem korpusu prepreči združevanje besed, namesto
besed na podlagi uporabe unigramskega modela (Kudo,
njihovega razcepljanja.
2018). V unigramskemu modelu je verjetnost zaporedja
Parameter postopka je število združevanj, ki neposre-
pod-besednih enot x modelirana kot produkt verjetnosti po-
dno vpliva na velikost končnega slovarja. Natančna velikost
sameznih enot tega zaporedja:
končnega slovarja je nato enaka vsoti števila združevanj in
števila znakov v začetnem slovarju.
M
Y
Vsaka beseda v korpusu se pri uporabi modela razcepi
P (x) =
p(xi),
(2)
na enote iz slovarja BPE. Ker pa so v tem slovarju tudi po-
i=1
samezne črke, je skoraj zagotovljeno, da bo delež (pod-) be-
kjer je M dolžina besedila, p(xi) pa verjetnost i-te enote v
sednih enot izven slovarja po razcepljanju enak 0. Izjeme
besedilu. Pri tem spadajo vse pod-besedne enote v določen
so zelo redke in se lahko pojavijo, kadar v testnem besedilu
slovar in vsota verjetnost vseh enot mora biti enaka 1.
vidimo črko ali znak, ki ga ni v učnem korpusu.
Najverjetnejše razcepljanje x* besed vhodnega besedila
Avtorji v (Sennrich et al., 2016) so predstavili imple-
X je tisto, za katero velja
mentacijo tega algoritma in predlagali možnost skupnega
x* = arg maxP (x),
(3)
učenja razcepljanja (Joint BPE), kjer uporabimo besedilo
x∈S(X)
obeh strani vzporednega korpusa kot učno gradivo za mo-
del razcepljanja. Tako tvorimo en model in dva slovarja,
kjer je S(X) množica vseh možnih razcepljanj besed iz be-
ločena za vsak jezik v paru. Nastavitev števila združevanj
sedila X.
pa nato ustreza skupnemu številu združenih simbolov za
Verjetnosti posameznih unigramov pod-besednih enot
oba jezika. Ni pa nujno, da se vsi združeni simboli poja-
lahko določimo z algoritmom EM (Expectation Maximi-
vijo v obeh jezikih. Posledično sta slovarja v tem primeru
zation), optimalno razcepljanje besed pa najdemo z Viter-
tipično manjša od števila združevanj.
bijevim algoritmom (Kudo, 2018).
Primer implementacije opisanega postopka je v orodju
2.2.
Morfessor
SentencePiece (Kudo in Richardson, 2018), v katerem so
Program Morfessor (Creutz in Lagus, 2002) je bil raz-
sicer implementirani še drugi postopki, vključno z BPE.
vit v želji po razcepljanju besed v kompleksnih jezikih na
V tem orodju lahko prav tako izhajamo iz želene velikosti
pod-besedne enote, ki približno ustrezajo morfemom – naj-
končnega slovarja.
manjšim enotam besede, ki nosijo pomen.
Želja je bila
2.4.
Izbrane metode in orodja
imeti podatkovno voden postopek, ki deluje za več jezikov
Za naše eksperimente smo se odločili, da izberemo 4
brez dodatnega slovničnega znanja. Namen je bil zgraditi
metode razcepljanja besed:
slovar jezikovnih enot, ki je manjši in bolj splošen kot pa
slovar besed.
• Joint BPE – postopek BPE s skupnim učenjem na
Predpostavka delovanja algoritma je, da so besede se-
vzporednem korpusu in implementacijo Rica Sennri-
stavljene iz zaporedja več segmentov, kot je to tipično v
cha, imenovano Subword NMT.
aglutinativnih jezikih. Razvita sta bila dva algoritma. Prvi
• Morfessor – postopek na principu najkrajše dolžine
temelji na principu najkrajše dolžine opisa, drugi pa na
opisa, kjer se uteži v funkciji cene prilagodijo glede
principu največje verjetnosti. Uporabili smo prvega.
na želeno velikost slovarja in implementacijo Morfes-
Cilj algoritma je najti slovar pod-besednih enot, ki daje
sor 2.0.
optimalno vrednost funkcije cene (cost function), ki vse-
buje dva dela: ceno izvornega besedila T in ceno slovarja
• SentencePiece – BPE – implementacija postopka BPE
V . Ceno opišemo z
z ločenim učenjem v orodju SentencePiece.
C = Cost(T ) + Cost(V ) =
• SentencePiece – Unigram – postopek na podlagi uni-
gramskih jezikovnih modelov in implementacija v
X
X
=
− log p(mi) +
k · l(mj), (1)
orodju SentencePiece.
mi∈besedilo
mj ∈slovar
PRISPEVKI
41
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Nastavitev
Joint BPE
Morfessor
SP-BPE
SP-Unigram
Joint BPE
Morfessor
SP-BPE
SP-Unigram
(sl)
(sl)
(sl)
(sl)
(en)
(en)
(en)
(en)
10k
11.384
18.670
17.064
17.814
11.556
18.405
16.909
17.358
15k
16.273
28.251
25.525
26.716
15.739
27.822
25.455
26.177
20k
21.101
37.934
33.561
35.534
19.631
37.664
33.595
34.779
25k
25.883
46.879
41.297
44.175
23.299
46.298
41.395
43.051
30k
30.625
55.438
48.822
52.717
26.760
55.994
48.946
51.204
40k
39.890
73.766
63.478
69.530
33.593
73.960
63.132
66.839
50k
49.063
93.726
77.520
86.111
40.115
90.248
76.515
82.082
60k
58.155
109.989
91.015
102.242
46.404
105.558
89.312
96.924
80k
76.152
143.018
117.134
133.892
58.788
133.679
113.496
125.572
100k
93.938
174.026
142.294
164.877
71.043
159.419
136.198
153.190
120k
111.646
205.658
166.442
195.155
82.972
182.895
157.987
180.256
150k
138.006
238.620
201.334
239.515
101.013
210.425
188.859
218.140
Tabela 1: Velikost izdelanih slovenskih (sl) in angleških (en) slovarjev.
Joint BPE:
države
članice
bodo pregle-dale
sezna-me in od izdaja-te-ljice ...
Morfessor:
držav-e članic-e bodo pregled-a-le seznam-e in od izdajatelj-ice
...
SP - BPE:
države
članice
bodo pregleda-le
sezna-me in od izdaja-telji-ce ...
SP - Unigram:
države
članice
bodo pregleda-le
seznam-e in od izdajatelj-ice
...
Slika 1: Primer segmenta besedila iz testne množice z razcepljenimi besedami.
3.
Eksperimentalni sistem
Korpus
Število segmentov
3.1.
Korpusi
Učni
3.714.473
Eksperimenti so bili izvedeni na prosto dostopnem
Razvojni
1.987
vzporednem korpusu ParaCrawl (Ba˜nón et al., 2020). Kor-
Testni
1.990
pus je bil zgrajen s spletnim pajkanjem (Web Crawling)
Skupaj
3.718.450
in samodejno poravnavo.
Za jezikovni par angleščina-
slovenščina vsebuje približno 3,7 milijona poravnanih se-
Tabela 2: Število segmentov besedila v učnem, razvojnem
gmentov, kar predstavlja 65,5 milijona besed na angleški in
in testnem korpusu.
60,9 milijona besed na slovenski strani.
Korpus smo razdelili na 3 dele: učni korpus, razvojni
korpus in testni korpus. Razvojni korpus je namenjen spro-
tni validaciji tekom učenja strojnega prevajalnika, testni
Ker smo izhajali iz želje po različnih velikostih končnih
korpus pa končnemu testiranju in vrednotenju rezultatov.
slovarjev, smo spreminjali ustrezne parametre pri uporabi
Za vsakega izmed teh dveh korpusov smo izbrali 2.000 na-
orodij za učenje razcepljanja. Pri tem pa orodja uporabljajo
ključnih segmentov besedila iz izvornega korpusa. Preosta-
te parametre na različne načine, kar pomeni, da velikosti
nek je bil uporabljen kot učni korpus.
končnih slovarjev ne ustrezajo natančno nastavljenim vre-
Nad vsemi deli korpusov smo izvedli standardno pred-
dnostim parametrov. Želene vrednosti, ki smo jih nasta-
procesiranje za strojno prevajanje: čiščenje, normalizacijo
vili, so: 10.000, 15.000, 20.000, 25.000, 30.000, 40.000,
ločil, tokenizacijo in truecasing1. Pri tem je bil učni korpus
50.000, 60.000, 80.000, 100.000, 120.000 in 150.000. V
uporabljen tudi za učenje modela za truecasing. Končne
tabeli 2.4. so prikazane natančne velikosti slovarjev, ki jih
velikosti vseh predprocesiranih korpusov so predstavljene
dobimo na slovenski in na angleški učni množici pri teh
v tabeli 3.1..
nastavitvah.
3.2.
Razcepljanje besed
Na sliki 1 je prikazan primer segmenta, kjer smo besede
razcepili z uporabo vseh 4 postopkov. Uporabili smo ciljno
Pri razcepljanju besed smo uporabili orodja, ki so opi-
velikost slovarja 20.000, saj so pri tej velikosti razcepljanja
sana v prejšnjem poglavju. Učni del korpusa smo uporabili
besed bolj pogosta in lažje prikažemo več razlik v enem se-
za učenje modela razcepljanja, nato smo naučene modele
gmentu. Na sliki so mesta delitve besed nakazana z vezaji.
uporabili za razcepljanje vseh delov korpusa. Tako smo do-
V modelih brez razcepljanja besed smo uporabili ve-
bili različice razcepljenih korpusov.
likosti slovarjev:
60.000, 80.000, 100.000, 125.000,
1
150.000, 200.000, 250.000 in 300.000.
Določanje pravilnega zapisa velikih in malih črk:
zapis
začetnih besed v vsakem stavku pretvorimo v najverjetnejši zapis
V naslednjem koraku smo zgradili slovarje za vse
z malo ali veliko črko in s tem zmanjšamo pomanjkanje podatkov
različice razcepljenih učnih korpusov kot tudi za nerazce-
PRISPEVKI
42
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Angleško-slovensko
Slovensko-angleško
46
49
Besede
Besede
45
Joint BPE
48
Joint BPE
Morfessor
Morfessor
SentencePiece - BPE
SentencePiece - BPE
44
SentencePiece - Unigram
47
SentencePiece - Unigram
43
46
U
U
E 42
E
L
45
L
B
B
41
44
40
43
39
42
38
41
104
105
104
105
Velikost slovarja
Velikost slovarja
Slika 2: Rezultati uspešnosti prevajanja za vse modele.
pljen besedni učni korpus. Medtem ko v razcepljenih kor-
Učenje smo izvajali 10 epoh s preverjanjem rezultata
pusih slovarji pokrijejo celotni korpus, se pri besednem kor-
na razvojni množici na vsakih 100 posodobitev parametrov
pusu pojavijo besede izven slovarja. V tabeli 3.2. smo pri-
modela. Najboljši model glede na razvojno množico smo
kazali deleže besed izven slovarja (OOV) na testnem delu
nato uporabili pri vrednotenju rezultatov na testni množici.
korpusa za oba jezika. Po pričakovanjih vidimo, da so
Pri prevajanju smo uporabljali mini serije (mini-batch)
deleži večji na slovenski strani in da padajo z večanjem slo-
velikosti 64, medtem ko je pri učenju uporabljena fleksi-
varja.
bilna velikost, ki je prilagojena velikosti delovnega pomnil-
nika enote GPU, na kateri izvajamo učenje.
Slovar
OOV (en) [%]
OOV (sl) [%]
3.4.
Uporabljena orodja
60k
2,57
6,66
Za predprocesiranje (čiščenje, normalizacijo, tokeniza-
80k
2,07
5,38
cijo in truecaseing) ter postprocesiranje (detruecaseing in
100k
1,77
4,44
detokenizacijo) smo uporabljali skripte iz programskega
125k
1,50
3,74
paketa MOSES (Koehn et al., 2007). Za učenje prevajal-
150k
1,30
3,22
nikov in prevajanje smo uporabljali orodje Marian NMT
200k
1,08
2,53
(Junczys-Dowmunt et al., 2018), ki smo ga poganjali na
250k
0,95
2,11
grafičnih procesnih enotah Nvidia Tesla V100. Za vredno-
300k
0,85
1,82
tenje rezultatov z metriko BLEU smo uporabljali orodje
Tabela 3: Delež besed izven slovarja pri besednih slovarjih
SacreBLEU (Post, 2018), ki kot del vrednotenja izvaja
na angleškem (en) in slovenskem (sl) testnem korpusu.
tudi ponovno tokenizacijo in vrednoti tokenizirana bese-
dila. Orodja za razcepljanje besed na pod-besedne enote
so opisana v poglavju 2.
3.3.
Prevajalnik
4.
Rezultati in diskusija
Model prevajalnika je v vseh primerih nevronski strojni
Ker je bil osnovni namen uporabe pod-besednih enot
prevajalnik na podlagi arhitekture RNN z dimenzijo skri-
zmanjšanje velikosti slovarja in s tem izvedljivost uporabe
tega stanja 1024 in dimenzijo vgrajenih vektorjev 512 (pri-
nevronskih strojnih prevajalnikov, najprej prikazujemo pri-
vzete nastavitve orodja Marian NMT). Naše dosedanje
mer rezultatov na tipičnih velikostih slovarjev. Za besedni
izkušnje na tej učni množici pa kažejo, da z uporabo ar-
slovar smo izbrali velikost 60.000 besed, kar je pogosto
hitekture transformer in samo-pozornosti ne dosežemo bi-
uporabljena velikost slovarjev v procesiranju naravnega je-
stvenih izboljšav.
Dolžine segmentov besedila smo pri
zika. V tabeli 4. primerjamo rezultate prevajanja med be-
učenju omejili na 80 pojavnic (besed in ločil oz. pod-
sednim modelom in modelom Joint BPE z enako velikostjo
besednih enot in ločil), kar pomeni, da upoštevamo 99,7 %
slovarja. V tej točki lahko vidimo izboljšanje uspešnosti
vseh segmentov v učni množici brez razcepljanja. Pri mo-
prevajanja z uporabo pod-besednih enot, kot jo tudi tipično
delih, kjer uporabljamo razcepljanje pa tako upoštevamo
zasledimo v obstoječi literaturi, npr. v (Sennrich et al.,
med 96,3 % in 99,5 % vseh segmentov. Omejitve dolžine
2016). Na tej točki smo še dodali rezultate vrednotenja,
segmentov nismo več povečevali, saj glede na omenjeno
ki jih dobimo z metriko ChrF (β = 3) (Popović, 2015).
pokritost predvidevamo, da ne bo več prišlo do znatnih
Čeprav se ta metrika uveljavlja za vrednotenje prevajanja
sprememb rezultatov.
pri morfološko kompleksnih jezik, smo preostale rezultate
PRISPEVKI
43
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Angleško-slovensko
Slovensko-angleško
1800
2000
1600
1800
1400
1600
]
]
c
c
e
e
s
s
/ 1200
/
i
1400
t
it
n
n
e
e
m 1000
m
g
1200
g
e
e
s[
s
[
t
t
s
800
s 1000
or
o
t
r
i
ti
H
H
600
Besede
800
Besede
Joint BPE
Joint BPE
Morfessor
Morfessor
400
SentencePiece - BPE
600
SentencePiece - BPE
SentencePiece - Unigram
SentencePiece - Unigram
200
400
104
105
104
105
Velikost slovarja
Velikost slovarja
Slika 3: Hitrost učenja prevajalnika za vse modele.
Angleško-slovensko
Slovensko-angleško
450
500
400
450
]
]
c
c
e
e
s
400
350
s
/
/
i
t
it
n
n
e
350
e
m 300
m
g
g
e
e
s
300
[
s
[
a 250
j
aj
n
n
aj
250
aj
a
a
v 200
v
er
er
p
200
p
t
t
s 150
s
o
Besede
Besede
r
o
t
r
i
Joint BPE
150
ti
Joint BPE
H
Morfessor
H
Morfessor
100
SentencePiece - BPE
100
SentencePiece - BPE
SentencePiece - Unigram
SentencePiece - Unigram
50
50
104
105
104
105
Velikost slovarja
Velikost slovarja
Slika 4: Hitrost uporabe prevajalnika za vse modele.
predstavili le z metriko BLEU, ki je še vedno uveljavljena
prevajanja pri uporabi besednih modelov naraščaja hitreje
in zadostuje za medsebojno primerjavo naših modelov.
in se pri največjih slovarjih precej približa uspešnosti pre-
vajalnikov z razcepljanjem.
Metrika
Besedni
Joint BPE
Ko med sabo primerjamo sisteme, ki uporabljajo razce-
pljanje, vidimo manjše razlike. Kljub temu lahko opazimo,
BLEU (en-sl)
38,50
42,87
da pri prevajanju iz angleščine v slovenščino večinoma daje
BLEU (sl-en)
41,62
45,87
najboljše rezultate orodje Sentence Piece z razcepljanjem
ChrF (en-sl)
58,43
63,13
na podlagi unigramov (Sentence Piece – Unigram), v na-
ChrF (sl-en)
60,68
65,76
sprotni smeri pa orodje Subword NMT s skupnim učenjem
Tabela 4: Primer rezultatov uporabe besednega modela in
BPE (Joint BPE).
modela z uporabo Joint BPE pri slovarju velikost 60.000.
Slika 3 prikazuje hitrost učenja modela prevajalnika,
slika 4 pa hitrost prevajanja pri njegovi uporabi. V vseh
primerih uporabljamo kot merilo za hitrost število obdela-
Na sliki 2 so prikazani rezultati uspešnosti prevajanja v
nih segmentov besedila na sekundo, saj se zaradi različnih
odvisnosti od velikosti slovarja za vse sisteme. Na slikah
razcepljanj število pojavnic razlikuje med sistemi. Število
so velikosti slovarjev ponazorjene v logaritemskem merilu.
besed na sekundo pri učenju dobimo, če upoštevamo, da je
Na splošno lahko opazimo, da uspešnost narašča z
povprečno število besednih pojavnic (besed in ločil) na se-
večanjem slovarjev, čeprav posamezni sistemi odstopajo od
gment v angleškem besedilu je 18,7, v slovenskem besedilu
tega trenda, npr. prevajalnik iz slovenščine v angleščino s
pa 20,2.
slovarjem 120.000 besed. Opazimo pa tudi, da uspešnost
Opazimo lahko, da se obe hitrosti zmanjšujeta z
PRISPEVKI
44
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Velikost modela (en-sl)
Poraba GPU pomnilnika
1800
5000
Besede
1600
Joint BPE
4500
1400
Morfessor
SentencePiece - BPE
4000
1200
SentencePiece - Unigram
1000
3500
]
]
B
B
i
i
800
M
M
[ 3000
[
a
al
ki
e
n
d
li
o
n
600
m
2500
m
t
o
s
p
o
k
a
il
b
e
ar
V
oP 2000
400
1500
200
104
105
105
Velikost slovarja
Velikost slovarja
Slika 5: Velikost izdelanega modela za vse modele ter poraba pomnilnika na grafični procesorski enoti.
večanjem slovarja. Hitrost pri besednih modelih je večja
niki brez razcepljanja, tudi v primerih, ko jim večamo slo-
kot pa hitrost pri ostalih modelih, vendar se tudi tukaj raz-
varje. Trend pa kaže, da lahko z besednimi prevajalniki
lika pri večjih slovarjih zmanjšuje. Vidimo pa sicer, da so
z nadaljnjim večanjem slovarja dohitimo modele z razce-
najhitrejši modeli tisti, ki za razcepljanje korpusa upora-
pljanjem besed. Ob trenutnem trendu razvoja in večanja
bljajo orodje Morfessor. Lahko pa v teh modelih opazimo
pomnilniških zmogljivosti enot GPU, bo takšne modele v
več točk, ki močno odstopajo od trendov. Predvidevamo, da
prihodnje možno naučiti in uporabljati.
so odstopanja nastala zaradi naključnih začetnih nastavitev
Prikazani rezultati lahko služijo raziskovalcem in upo-
nekaterih parametrov pri učenju, morebitnih odstopanj na
rabnikom kot orientacija pri izbiri velikost slovarja za
uporabljeni strojni opremi, specifičnih lastnosti program-
strojne prevajalnike, če želijo upoštevati uspešnost prevaja-
ske opreme za učenje modelov ali pa med prilagajanjem
nja, hitrost prevajanja in velikost modela. Slednja je lahko
velikost mini serije pri različnih velikostih slovarja.
pomembna zaradi omejitev strojne opreme.
Prikazane hitrosti prevajanja ne upoštevajo predproce-
Za boljše razumevanje uporabnosti razcepljanja besed
siranja in postprocesiranja besedila.
v strojnem prevajanju bi bilo potrebno izvesti še nadalj-
Na sliki 5 so prikazane velikosti datotek za vse modele
nje raziskave. V tem prispevku smo se omejili na stalne
prevajanja in njihova poraba pomnilnika enote GPU za be-
vrednosti hiperparametrov modelov. Izvedli smo le po-
sedne modele. Velikosti datotek naraščajo skoraj linearno z
stopke podatkovno vodenega razcepljanja. V nadaljeva-
velikostjo slovarja (na grafu sta obe osi v logaritemskem
nju lahko preučujemo tudi metode razcepljanja na podlagi
merilu). Vsakemu modelu sta pridruženi dve datoteki z
slovničnega znanja ali pa kombiniranje komplementarnih
obema slovarjema, ki pa sta bistveno manjši. Prikazane ve-
metod. Pomemben prispevek h kakovosti prevajanja pri be-
likosti so za modele prevajanja iz angleščine v slovenščino.
sednih modelih imata lahko tudi večanje učne množice in
Modeli v nasprotni smeri imajo primerljive velikosti.
večanje hiperparametrov modela. Slednje pa sicer pomeni
Desno je prikazana še poraba pomnilnika pri uporabi
tudi povečanje velikosti modela in njegovo počasnejše de-
modelov pri prevajanju. Vidimo, da ima orodje osnovno
lovanje. Nadaljnje raziskave lahko tudi vključujejo podrob-
porabo pomnilnika, kar se kaže v manjših spremembah po-
nejšo analizo napak, ki se pojavljajo pri različnih metodah
rabe pri malih slovarjih. Pri večjih slovarjih pa poraba po-
razcepljanja.
mnilnika prav tako linearno narašča. Prikazana je poraba
pomnilnika za besedne modele, ki je enaka v obeh smereh
prevajanja. Porabe pomnilnika pri drugih modelih so pri-
merljive glede na velikost slovarja. Pri preverjanju porabe
6.
Zahvala
pomnilnika je bila uporabljena mini serija velikosti 64. Pri
učenju pa je poraba pomnilnika lahko tudi večja.
Raziskovalni program št. P2-0069, v okvirju katerega
Pripomniti je potrebno, da bi bile velikosti drugačne v
je nastala ta raziskava, je sofinancirala Javna agencija za
primeru drugih nastavitev hiperparametrov modelov.
raziskovalno dejavnost Republike Slovenije iz državnega
proračuna.
5.
Zaključek
Avtorji
se
zahvaljujejo
konzorciju
HPC
RIVR
V članku smo prikazali in primerjali nekatere najpo-
(www.hpc-rivr.si) za sofinanciranje raziskave z upo-
gostejše podatkovno vodene metode za razcepljanje besed
rabo zmogljivosti sistema HPC MAISTER na Univerzi v
in njihovo uporabo na primeru nevronskega strojnega pre-
Mariboru (www.um.si).
vajalnika. Naši rezultati kažejo, da z razcepljanjem be-
Zahvaljujejo se tudi avtorjem vzporednega korpusa
sed še vedno dosegamo boljše rezultate kot pa s prevajal-
ParaCrawl za njegovo prosto dostopnost.
PRISPEVKI
45
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
7.
Literatura
of the Association for Computational Linguistics (Vo-
lume 1: Long Papers), str. 66–75, Melbourne, Australia.
Tamali Banerjee in Pushpak Bhattacharyya. 2018. Me-
Association for Computational Linguistics.
aningless yet meaningful:
Morphology grounded
Maja Popović. 2015. chrF: character n-gram F-score for
subword-level NMT. V: Proceedings of the second wor-
automatic MT evaluation. V: Proceedings of the Tenth
kshop on subword/character level models, str. 55–60.
Workshop on Statistical Machine Translation, str. 392–
Marta Ba˜nón, Pinzhen Chen, Barry Haddow, Kenneth He-
395, Lisbon, Portugal. Association for Com- putational
afield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. For-
Linguistics.
cada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Ser-
Matt Post. 2018. A call for clarity in reporting BLEU
gio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ram´ırez-
scores. V: Proceedings of the Third Conference on Ma-
Sánchez, Elsa Sarr´ıas, Marek Strelec, Brian Thomp-
chine Translation: Research Papers, str. 186–191, Brus-
son, William Waites, Dion Wiggins in Jaume Zaragoza.
sels, Belgium. Association for Computational Lin-
2020. ParaCrawl: Web-scale acquisition of parallel cor-
guistics.
pora. V: Proceedings of the 58th Annual Meeting of
Rico Sennrich, Barry Haddow in Alexandra Birch. 2016.
the Association for Computational Linguistics, str. 4555–
Neural machine translation of rare words with subword
4567. Association for Computational Lin- guistics.
units. V: Proceedings of the 54th Annual Meeting of the
Mathias Creutz in Krista Lagus. 2002. Unsupervised dis-
Association for Computational Linguistics (Volume 1:
covery of morphemes. V: Proceedings of the Workshop
Long
Papers),
str.
1715–1725,
Berlin,
Germany.
on Morphological and Phonological Learning of ACL-
Association for Computational Linguistics.
02, str. 21–30, Philadelphia, Pennsylvania.
Felix Stahlberg. 2020. Neural Machine translation: A
Rohit Gupta, Laurent Besacier, Marc Dymetman in Mat-
review. Journal of Artificial Intelligence Research,
thias Gallé. 2019. Character-based NMT with transfor-
69:343–418.
mer. arXiv:1911.04997.
Sami Virpioja, Peter Smit, Stig-Arne Grönroos in Mikko
Georg Heigold, Stalin Varanasi, Günter Neumann in Jo-
Kurimo. 2013. Morfessor 2.0: Python implementa-
sef van Genabith. 2018. How robust are character-based
tion and extensions for morfessor baseline. Tehnično
word embeddings in tagging and MT against wrod scra-
poročilo, Aalto University.
mlbing or randdm nouse? V: Proceedings of the 13th
Conference of the Association for Machine Translation
in the Americas (Volume 1: Research Track), str. 68–80,
Boston, MA. Association for Machine Transla- tion in
the Americas.
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz
Dwojak, Hieu Hoang, Kenneth Heafield, Tom Necker-
mann, Frank Seide, Ulrich Germann, Alham Fikri Aji,
Nikolay Bogoychev, André F. T. Martins in Alexandra
Birch. 2018. Marian: Fast neural machine translation in
C++. V: Proceedings of ACL 2018, System Demonstra-
tions, str. 116–121, Melbourne, Australia. Associa- tion
for Computational Linguistics.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran, Richard
Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin in
Evan Herbst. 2007. Moses: Open source toolkit for sta-
tistical machine translation. V: Proceedings of the 45th
Annual Meeting of the Association for Computational
Linguistics Companion Volume Proceedings of the Demo
and Poster Sessions, str. 177–180, Prague, Czech Repu-
blic. Association for Computational Linguistics.
Taku Kudo in John Richardson. 2018. SentencePiece: A
simple and language independent subword tokenizer and
detokenizer for neural text processing. V: Proceedings
of the 2018 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations, str. 66–
71, Brussels, Belgium. Association for Com- putational
Linguistics.
Taku Kudo. 2018. Subword regularization: Improving ne-
ural network translation models with multiple subword
candidates. V: Proceedings of the 56th Annual Meeting
PRISPEVKI
46
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Raziskovalna infrastruktura CLARIN.SI
Tomaˇ
z Erjavec1, Kaja Dobrovoljc3 , 1, Darja Fiˇ
ser4 , 3 , 1, Jan Jona Javorˇ
sek1,
Simon Krek2 , 1, Taja Kuzman1, Cyprian Laskowski2, Nikola Ljubeˇ
si´
c1 , 2, Katja Meden1
1 Institut ≫Jožef Stefan≪
tomaz.erjavec@ijs.si, kaja.dobrovoljc@ijs.si, jan.javorsek@ijs.si, simon.krek@ijs.si, taja.kuzman@ijs.si, nikola.ljubesic@ijs.si, katja.meden@ijs.si
2 Center za jezikovne vire in tehnologije Univerze v Ljubljani
cyp@cjvt.si
3 Filozofska fakulteta Univerze v Ljubljani
darja.fiser@ff.uni-lj.si
4 Inštitut za novejšo zgodovino
Povzetek
Prispevek povzame storitve slovenske raziskovalne infrastrukture za jezikovne vire in tehnologije CLARIN.SI, ki je članica evropskega konzorcija raziskovalnih infrastruktur CLARIN ERIC. Najprej obravnavamo vodenje, organizacijo in tehnično infrastrukturo CLARIN.SI, nato pa njene spletne storitve, predvsem repozitorij digitalnih jezikovnih virov in orodij ter konkordančnike. Sledi pregled promocije področja jezikovnih tehnologij in digitalne humanistike v Sloveniji, kar vključuje storitve centra znanja za računalniško obdelavo južnoslovanskih jezikov CLASSLA, financiranje projektov in organizacijo, podporo ali sodelovanje na konferencah in delavnicah. Predstavimo tudi sodelovanje CLARIN.SI s CLARIN ERIC in s sorodnima slovenskima infrastrukturama DARIAH-SI in CESSDA/ADP ter vključenost v slovenske in evropske projekte.
The CLARIN.SI Research Infrastructure
The paper summarises the services offered by the Slovenian research infrastructure for language resources and technologies CLARIN.SI, which is a member of the European research infrastructure consortium CLARIN ERIC. We first present the governance, organisation and technical infrastructure of CLARIN.SI, followed by a description of its web applications with a focus on its repository and concordancers. Next comes an overview of support activities that CLARIN.SI offers to the fields of language technologies and digital humanities in Slovenia, which includes services of the knowledge centre for computational processing of South-Slavic languages CLASSLA, financial support of projects, and organisation or support of conferences and workshops. We also introduce the work of CLARIN.SI within CLARIN ERIC, its cooperation with its sister national infrastructures DARIAH-SI and CESSDA/ADP, and involvement in national and European projects.
1.
Uvod
2022). Korist od RI imajo raziskovalci, učitelji in
Raziskovalna
infrastruktura
(RI)
CLARIN1
študenti slovenskega jezika ter drugih jezikoslovnih
(
smeri, računalniškega jezikoslovja in umetne inteli-
≫Common Language Resources and Technology In-
frastructure
gence, pa tudi drugi raziskovalci s področja humani-
≪
oz. ≫Infrastruktura za skupne jezikovne
vire in tehnologije
stike in družboslovja, ki pri svojem delu uporabljajo
≪) zagotavlja digitalne jezikovne
vire, orodja in storitve za podporo raziskovalcem
jezikovna gradiva. RI nudi podporo tudi slovaropi-
na področju humanistike in družboslovja in drugih
scem, prevajalcem in podjetjem, ki v svoje produkte
področij, ki se ukvarjajo z jezikom (Jong et al., 2018).
vključujejo obdelavo slovenskega jezika, nenazadnje
CLARIN je bila ena od infrastruktur, ki so bile
pa tudi laičnim uporabnikom za namene reševanja
predvidene že v prvem načrtu Evropskega strateškega
praktičnih vprašanj.
foruma za raziskovalne infrastrukture ESFRI (Váradi
et al., 2008). Ustanovljena je bila leta 2012 in je
Slovenska RI CLARIN.SI je bila ustanovljena leta
bila ena prvih RI, ki je pridobila status evropske
2014, članica CLARIN ERIC pa je postala leta 2015, za
pravne osebe konzorcija raziskovalnih infrastruktur
kar je bilo potrebno, da je bil ustanovljen nacionalni
ERIC (European Research Infrastructure Consor-
konzorcij in da je Vlada Republike Slovenije podpi-
tium). CLARIN ERIC ima sedež na Nizozemskem in
sala memorandum, s katerim se je zavezala plačevati
trenutno združuje RI 22 držav članic in 3 opazovalke.
članarino za članstvo Slovenije v CLARIN ERIC. Do
Zaposluje vodjo in podporno osebje za koordinacijo in
sedaj je bila edina publikacija, ki celostno predstavi
centralne tehnične storitve, medtem ko imajo glavno
CLARIN.SI, objavljena kmalu po njeni ustanovitvi
vlogo pri zagotavljanju storitev nacionalni centri RI.
(Erjavec et al., 2014), kjer smo predstavili prve korake
Glede na pomen slovenskega jezika za Slovenijo je
RI in načrte za nadaljnje delo. Pričujoči prispevek
sodelovanje v CLARIN ključnega pomena, saj spod-
povzema, kaj je bilo narejenega v minulih osmih letih:
buja empirično podprto raziskovanje jezika ter ra-
v razdelku 2. predstavimo organizacijsko strukturo in
zvoj jezikovnih virov in tehnologij, s čimer lahko slo-
upravljanje infrastrukture, v 3. repozitorij jezikovnih
venščina v informacijski družbi nastopa enakopravno
virov in orodij, v 4. spletne storitve, v 5. podporne de-z drugimi jeziki, tudi mnogo večjih skupnosti (Krek,
javnosti, v 6. vpetost CLARIN.SI v domače in evrop-
ske projekte in v aktivnosti CLARIN ERIC, v 7. pa
1
podamo zaključke in načrte za nadaljnje delo.
https://www.clarin.eu/
PRISPEVKI
47
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
2.
Organiziranost CLARIN.SI
vseh večjih akterjev na področju digitalnega jeziko-
Infrastruktura ima sedež na Institutu
slovja in jezikovnih tehnologij, kot tudi digitalne huma-
≫Jožef Ste-
fan
nistike in družboslovja, saj CLARIN.SI tesno sodeluje
≪ (IJS), kjer tudi domuje večina njene računalniške
opreme in kjer se zagotavlja varnost, vzdrževanje in
z dvema sestrskima RI v slovenskem prostoru. To sta
neprestano obratovanje spletnih storitev RI. Pri vode-
DARIAH-SI s sedežem na Inštitutu za novejšo zgodo-
nju in tehničnem vzdrževanju sodelujejo tri organiza-
vino (INZ), ki predstavlja nacionalno vozlišče evropske
cijske enote IJS, in sicer Odsek za tehnologije znanja
RI za digitalno humanistiko, in CESSDA/ADP v Ar-
E8, Laboratorij za umetno inteligenco E3 ter Center
hivu družboslovnih podatkov na Fakulteti za družbene
za mrežno infrastrukturo CMI.
vede Univerze v Ljubljani (ADP), ki je nacionalno
vozlišče evropske RI za digitalno družboslovje CES-
CLARIN.SI je organiziran kot konzorcij, ki nima
SDA (
narave pravne osebe, v njem pa ima članstvo 12 par-
≫Consortium of European Social Science Data
Archives
tnerjev.V konzorciju so združene vse glavne institucije,
≪).
CLARIN.SI je tudi ena od ustanovnih
članic Slovenskega nacionalnega superračunalniškega
ki se v Sloveniji ukvarjajo z razvojem ali uporabo jezi-
omrežja SLING2 in preko njega članica federacije
kovnih virov in tehnologij, in sicer:
računskih in podatkovnih virov EGI3 ter Partnerstva
• Univerze: Univerza v Ljubljani, Univerza v Mari-
za napredno računalništvo v Evropi PRACE4.
boru, Univerza v Novi Gorici in Univerza na Pri-
CLARIN.SI vzdržuje dvojezično (slovenščina, an-
morskem. Univerza v Ljubljani je sedež Centra
gleščina) spletno stran,5 na kateri je predstavljena RI
za jezikovne vire in tehnologije (CJVT), ki koor-
kot tudi vse njene storitve. Spletno mesto nudi tudi
dinira delo na področju korpusnega jezikoslovja
kontaktne podatke, npr. e-poštni naslov, na katerega
in jezikovnih tehnologij ter razvija in vzdržuje te-
se lahko obrnejo uporabniki, ki si želijo pomoči ali na-
meljne digitalne jezikovne vire in jezikovnoteh-
svetov. Poleg tega spletno mesto vključuje z geslom
nološka orodja za sodobni slovenski jezik.
zaščitene interne strani, do katerih imajo dostop člani
oz. namestniki UO in ki vsebujejo ustanovne doku-
• Raziskovalni inštituti: ZRC SAZU, Institut
mente, zapisnike sestankov, relevantne zapisnike CLA-
≫Jožef
Stefan
RIN ERIC itd.
≪ (IJS), Inštitut za novejšo zgodovino (INZ)
in Znanstveno-raziskovalno središče Koper. Zno-
Za dokumentiranje tehničnega vzdrževanja CLA-
traj ZRC SAZU Inštitut za slovenski jezik Frana
RIN.SI uporablja interno instalacijo platforme Word-
Ramovša zbira jezikovno gradivo in ga uporablja
Press, na kateri dokumentiramo postopke vzdrževanja
za izdelavo temeljnih del slovenskega jezikoslovja,
za vse spletne storitve CLARIN.SI, medtem ko se za
predvsem slovarjev. IJS kot gostitelj raziskovalne
zahtevke za reševanje odkritih problemov uporablja in-
infrastrukture CLARIN.SI koordinira delo infra-
stalacijo platform Redmine.
strukture, vzdržuje in nadgrajuje njen repozitorij
Kritične spletne storitve CLARIN.SI so vedno in-
in storitve ter razvija jezikovne vire in orodja.
stalirane tudi na razvojnem strežniku, kjer se naj-
prej preveri delovanje vsake spremembe na programski
• Društva oz. zavodi: Slovensko društvo za je-
opremi, na ponujenih jezikovnih virih ali v dokumen-
zikovne tehnologije (SDJT), ki s konferenco
taciji. Delovanje spletnih storitev se preverja prek sis-
tema NAGIOS, repozitorij pa tudi neodvisno s strani
≫Jezikovne
tehnologije in digitalna humani-
stika
CLARIN ERIC. V primeru napak so tako skrbniki sto-
≪
(JTDH) promovira razvoj jezikovnih teh-
nologij za slovenski jezik, in Zavod za uporabno
ritve nemudoma obveščeni in lahko takoj pristopijo k
slovenistiko Trojina s svetovalno in podporno de-
odpravljanju težave.
javnostjo ter izdelavo jezikovnih virov in orodij.
3.
Repozitorij jezikovnih virov
Osnovna storitev CLARIN.SI je vzdrževanje repo-
• Podjetji Alpineon in Amebis, med katerima prvo
zitorija jezikovnih raziskovalnih podatkov oz. jezikov-
podjetje v infrastrukturo CLARIN.SI prispeva
nih virov, kot so velike in bogato označene zbirke be-
predvsem govorne tehnologije, drugo pa se ukvarja
sedil (korpusi), računalniški leksikoni in modeli, pa
z izdelavo programske opreme s področja jezikov-
tudi strojno berljivi slovarji in računalniška orodja.
nih tehnologij in elektronskega založništva.
Računalniška platforma repozitorija je odprtodosto-
Odločitve o vodenju RI sprejema oz. potrjuje
pna CLARIN-DSpace,6 ki so jo razvili posebej za na-
Upravni odbor (UO) CLARIN.SI, v katerem ima vsak
mene CLARIN repozitorijev v okviru češke razisko-
partner po enega predstavnika in poljubno število na-
valne infrastrukture CLARIN (sedaj CLARIAH, ki je
mestnikov oz. namestnic. Komunikacija se odvija
nastala po združitvi češke CLARIN in DARIAH) na
prek dopisnega seznama upravnega odbora, ki trenu-
Inštitutu za formalno in uporabno jezikoslovje na Kar-
tno šteje 34 članov, enkrat letno pa organiziramo se-
lovi Univerzi v Pragi. Platformo poleg Slovenije, in
stanek CLARIN.SI UO, na katerem se pogovorimo o
2
delovanju RI v preteklem letu in naredimo načrte za
https://www.sling.si/
3https://www.egi.eu/
naslednje.
4https://prace-ri.eu/
Delovanje raziskovalne infrastrukture CLARIN v
5https://www.clarin.si/
Sloveniji se tako kroji na podlagi potreb in konsenza
6https://github.com/ufal/clarin-dspace
PRISPEVKI
48
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
seveda Češke, uporablja še sedem drugih nacionalnih
CLARIN.SI trudimo vzdrževati čim bolj popolne
repozitorijev CLARIN, kar skupaj predstavlja 40 %
in enotne metapodatkovne zapise.
vseh rednih članic CLARIN ERIC.
Repozitorij CLARIN.SI je poleg ADP edini v Slove-
• Navodila za kodiranje deponiranih podatkov, ki
niji akreditiran s certifikatom
navajajo sprejemljive formate zapisa in načine
≫Core Trust Seal≪,7 torej
kot zaupanja vreden podatkovni repozitorij. Repozito-
označevanja podatkov, poleg tega pa zajemajo
rij v skladu s strategijo CLARIN ERIC implementira
tudi splošna navodila za pripravo kvalitetnih in
načela FAIR8,9(najdljivost, dostopnost, interoperabil-
usklajenih podatkov. Po tem se repozitorij CLA-
nost in ponovna uporaba). Evropski agendi za odprto
RIN.SI razlikuje od večine drugih repozitorijev
znanost in načelom FAIR CLARIN sledi avant la lettre
CLARIN (Lenardič in Fišer, 2022), saj ti tipično
(Jong et al., 2018), in sicer z naslednjimi instrumenti:
ponujajo samo seznam sprejemljivih formatov, ne
pa tudi bolj splošnih navodil za pripravo kako-
• Akademska avtentikacija AAI, ki deluje po sis-
vostnih podatkov, kakršna so lahko zelo koristna
temu SSO (≫Single sign-on≪), kjer ločimo ponu-
za avtorje s področja humanističnih znanosti brez
dnike identitete (Arnes, univerze, druge akadem-
poglobljenega znanja računalniških veščin za pra-
ske institucije) in ponudnike storitev (v našem pri-
vilno pripravo podatkov.
meru repozitorij), da uporabnikom ni potrebno
ustvariti svojega računa na CLARIN.SI, pač pa se
• Seznam pogosto postavljenih vprašanj z odgovori
v repozitorij prijavijo prek svojega EduGain upo-
in podobne vsebine.
rabniškega imena in gesla pri izbranem ponudniku
Poleg prilagojenosti za opis jezikovnih virov je za
identitete.
razliko od splošnih repozitorijev za samoarhiviranje,
• Trajni identifikatorji vnosov po sistemu
kot je npr. Zenodo, pomembna odlika repozitorija
≫handle≪,
kar omogoča pripis trajnega naslova URL vsa-
CLARIN.SI zagotavljanje visoke kvalitete deponiranih
kemu vnosu v repozitorij, ki je, enako kot DOI, ne-
jezikovnih virov in njihovih metapodatkov, saj vsak
odvisen od specifičnega URL-ja tega vira v okviru
vnos pred objavo skrbno pregleda eden od urednikov
repozitorija, in s tem tudi odporen na spremembe
repozitorija, ki preveri, ali vnos ustreza merilom CLA-
v platformi oz. lokaciji repozitorija.
RIN.SI. Če jim ne, urednik vnos zavrne z obrazložitvijo
napak, v vnaprej dogovorjenih primerih pa tudi po-
• Vpetost v mednarodne spletne agregatorje me-
maga pri popravljanju vira.
tapodatkov, kot so OpenAIRE10, Re3data11, od V osmih letih, kolikor jih je minilo od prvega vnosa,
2022 pa tudi European Language Grid. Preko
je število deponiranih jezikovnih virov in orodij naraslo
CLARIN ERIC je bil CLARIN.SI med prvimi RI
na več kot 300, ki so rezultat dela prek 700 avtorjev,
vključen tudi v sistem ponudbe virov in storitev v
pri čemer je v mnoge bilo vloženih več let dela. V letu
okviru Evropskega odprtega znanstvenega oblaka
2021 je repozitorij beležil okoli 40.000 ogledov in 4.000
EOSC12 že vse od vzpostavitve portala EOSC
prenosov. V tem letu so bili najpogosteje preneseni viri
leta 2018. V okviru RI CLARIN se za meta-
zbirka 751 emodžijev z avtomatsko pripisanim senti-
podatkovne zapise uporablja priporočila CMDI13
mentom, ki je bil izračunan na podlagi 70.000 tvitov v
(≫Component MetaData Infrastructure≪), izvoz
13 evropskih jezikih, označenih za sentiment in s strani
oz. žetev metapodatkov pa je omogočena tudi v
83 anotatorjev (Kralj Novak et al., 2015) ter jezikovni
standardu Dublin Core.
modeli (besedne vložitve) tipa BERT (Devlin et al.,
• Bogata izbira licenc, od odprtih, kot so licence
2018) za slovenske besede (Ulčar in Robnik-Šikonja,
Creative Commons, do bolj omejenih, ki zahtevajo
2021), ki so koristni za marsikatero nalogo obdelave
predhodno prijavo v repozitorij in digitalni podpis
slovenskega jezika.
sporazuma o uporabi vira.
S spodbujanjem deponiranja jezikovnih virov in
pomočjo pri njihovem oblikovanju in opisu je CLA-
• Eksplicitni pogoji uporabe, ki določajo pravice in
RIN.SI bistveno pripomogel k uveljavljanju koncepta
dolžnosti tako upravljalcev repozitorija kot upo-
odprte, preverljive, ponovljive in odgovorne znano-
rabnikov.
sti na področju jezikoslovnih raziskav v Sloveniji ter
številne jezikovne vire, nastale kot rezultat slovenskih
• Navodila za deponiranje vnosov, ki opišejo posto-
raziskovalnih projektov, obvaroval pred izginotjem in
pek oddaje virov s posebnim poudarkom na zah-
jim omogočil mednarodno vidnost in vpliv.
tevanih metapodatkih in njihovi obliki, saj se pri
4.
Spletne storitve
7https://www.coretrustseal.org/
Poleg repozitorija CLARIN.SI trajno vzdržuje več
8https://www.go-fair.org/fair-principles/
spletnih storitev, od katerih so najpomembnejši kon-
9https://www.clarin.eu/fair
10
kordančniki, tj. orodja za analizo korpusov, in si-
https://www.openaire.eu/
11
cer ponuja CLARIN.SI uporabo konkordančnika Kon-
https://www.re3data.org/
12https://eosc-portal.eu/
Text in dveh različic konkordančnika noSketch Engine
13https://www.clarin.eu/content/
(Crystal in Bonito). Vsi trije uporabljajo isti zaledni
component-metadata
program, in sicer Manatee (Rychl´y, 2007), ki omogoča
PRISPEVKI
49
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
hitre poizvedbe po bogato označenih korpusih, ven-
Hubu ima CLARIN.SI svojo virtualno organizacijo,18
dar se razlikujejo v čelnem delu. NoSketch Engine je
ki združuje sedaj že okoli 60 odprtokodnih projektov.
odprtokodna različica komercialnega konkordančnika
Za razliko od GitHuba, ki obstaja samo kot spletna sto-
Sketch Engine (Kilgarriff et al., 2014),14 medtem ko ritev v lasti podjetja Microsoft, je mogoče platformo
je bil KonText razvit na oddelku Češkega nacional-
GitLab tudi instalirati, kar ima to prednost, da so
nega korpusa Karlove univerze v Pragi (Machálek,
projekti locirani na lokalni računalniški opremi, do-
2020). Poleg izgleda konkordančnikov so glavne razlike
stopnost projektov pa je mogoče tudi omejiti, kar je v
med njimi v tem, da noSketch Engine ponuja nekaj
posameznih primerih potrebno, npr. zaradi avtorskih
več funkcionalnosti kot KonText (predvsem možnost
pravic nad besedili nekega jezikovnega vira, ki se ga
izračuna ključnih besed korpusa oz. podkorpusa), med-
razvija. Instalacija GitLab na CLARIN.SI19 vsebuje
tem ko KonText podpira prijavo prek sistema AAI
okoli 20 projektov, tako javnih (kot npr. že omenjena
(enako kot repozitorij), kar nato omogoča personalizi-
pretvorba TEI za WebAnno) kot tudi zasebnih.
rane nastavitve zaslona, hranjenje zgodovine poizvedb,
CLARIN.SI v okviru centra znanja CLASSLA,
itd.
ki ga obravnavamo v naslednjem razdelku, ponuja
Vsi konkordančniki na CLARIN.SI ponujajo isti na-
tudi spletno storitev ReLDIanno za jezikoslovno
bor korpusov, ki jih je sedaj že preko 40, od referenčnih
označevanje besedil v slovenskem, hrvaškem in srb-
do specializiranih, pa tudi govorjenih in večjezičnih.
skem jeziku.20
Storitev podpira oblikoskladenjsko
Tu izpostavimo novi korpus metaFida, ki združuje 34
označevanje, lematizacijo, označevanje imenskih en-
obstoječih korpusov in vsebuje skupaj 4,5 milijarde po-
titet in skladenjsko razčlenjevanje, dostopna pa je
javnic, s čimer je največji in najbolj raznovrsten korpus
tako prek spletnega vmesnika kot prek vtičnika API,
za slovenščino, po katerem je mogoče iskati s pomočjo
pri čemer lahko rezultate prikaže na zaslonu ali pa
konkordančnikov.
označeno besedilo prenesemo na lastni računalnik.
Konkordančniki CLARIN.SI se uporabljajo pri iz-
vajanju študijskih programov na več univerzah, v
5.
Strokovna podpora in diseminacija
sklopu jezikoslovnih raziskav ali pri različnih razisko-
5.1.
Srediˇ
sˇ
ca znanja
valnih projektih, kot tudi v prevajalskih podjetjih.
CLARIN.SI je aktiven pri promociji in spodbu-
Naslednja spletna storitev, ki jo ponuja CLA-
janju razvoja računalniškega jezikoslovja, ne le za
RIN.SI, je platforma za ročno označevanje korpusov
slovenščino, ampak tudi za druge južnoslovanske je-
WebAnno (Yimam et al., 2013), ki so jo razvili v
zike, kot so hrvaščina, srbščina, makedonščina in bol-
okviru CLARIN-DE. V okviru CLARIN.SI smo raz-
garščina, s čimer si je RI bistveno povečala mednaro-
vili pretvorbo iz zapisa korpusov TEI v format TSV3,
dno odmevnost. CLARIN.SI namreč skupaj z bolgar-
ki ga uporablja WebAnno, in združevanje izvornega
sko raziskovalno infrastrukturo CLARIN CLADA-BG
korpusa TEI z ročnimi oznakami iz datoteke TSV, s
in hrvaškim Institutom za hrvaški jezik in jezikoslovje
čimer smo omogočili dodajanje oz. spreminjanje ob-
upravlja središče znanja CLARIN za južnoslovanske je-
stoječih oznak v TEI kodiranih korpusih z oznakami,
zike CLASSLA, v okviru katerega ponuja strokovno
ki so bile ročno vstavljene oz. popravljene na platformi
podporo pri uporabi jezikovnih virov in tehnologij za
WebAnno (Erjavec et al., 2016)15. Naša instalacija in južnoslovanske jezike. Središče znanja podpira razisko-pretvorba je bila do sedaj uporabljena pri prek 10 pro-
valce z dokumentacijo o prosto dostopnih jezikovnih vi-
jektih, npr. za ročno označevanje normaliziranih be-
rih, orodjih za ustvarjanje in obdelavo besedilnih kor-
sednih oblik, lem in oblikoslovnih oznak uporabniško
pusov ter drugih jezikovnih tehnologijah. Poleg tega
generiranih vsebin v okviru projekta Janes ≫Jeziko-
CLASSLA razvija lastne jezikovne tehnologije in kor-
slovna analiza nestandardne slovenščine≪ (Fišer et al.,
puse, s katerimi pokriva velike potrebe južnoslovanskih
2020),16 za označevanje dvojezičnih terminov v okviru jezikov, ki so tehnološko manj podprti. Tako je na
projekta KAS ≫Slovenska znanstvena besedila: viri in
primer v letu 2020 v sklopu projekta zbiranja korpu-
opis≪ (Erjavec et al., 2021)17 ali za označevanje defini-sov besedil iz Wikipedije središče ustvarilo prvi jeziko-
cij terminov v besedilih v okviru projekta TermFrame
slovno označeni makedonski korpus, CLASSLAWiki-
≫Terminologija in sheme znanja v medjezikovnem pro-
mk (Ljubešić et al., 2021).
storu≪ (Vintar in Martinc, 2022).
V 2021 je CLARIN.SI postal tudi član CLARIN
Za kontrolirano in kolaborativno vzdrževanje je po-
centra znanja za obdelavo uporabniško generiranih
stala zelo popularna platforma Git, ki jo v okviru CLA-
vsebin CKCMC,21 ki ga vodi Eurac Research, Bolzano.
RIN.SI prav tako uporabljamo, ne samo za program-
sko opremo, temveč tudi za jezikovne vire. Za spletno
5.2.
Financiranje projektov
dostopne repozitorije Git, ki vključujejo tudi množico
CLARIN.SI finančno podpira projekte, letno iz-
drugih funkcij, kot so zahtevki in izvajanje programov,
brane na odprtem razpisu za člane konzorcija, ki pri-
sta najbolj uporabljana GitHub in GitLab. Na Git-
pomorejo k uresničitvi strategije CLARIN.SI. Ta de-
14https://www.sketchengine.eu/
18https://github.com/clarinsi
15https://gitlab.clarin.si/clarinsi/webanno_tei
19https://gitlab.clarin.si/
16https://nl.ijs.si/janes/
20http://clarin.si/services/web/
17https://nl.ijs.si/kas/
21https://cmc-corpora.org/ckcmc/
PRISPEVKI
50
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
javnost je bila zelo odmevna in je tudi pomembno do-
6.
Vpetost v projekte in infrastrukture
prinesla k zanimanju za raziskave in razvoj jezikovnih
CLARIN.SI je vpet v domače in evropske projekte,
virov med mladimi. Od leta 2018, ko smo z inicia-
s čimer zagotavlja večjo izkoriščenost in vidljivost ter
tivo začeli, je bilo uspešno izvedenih 19 projektov, v
seveda tudi dodaten dotok sredstev za svoje delovanje.
sklopu katerih so med drugim nastali korpus parla-
mentarnih razprav Državnega zbora Republike Slove-
6.1.
Sredstva Evropske kohezijske politike
nije siParl (Pančur et al., 2020), nadgradnja korpusa
V okviru projekta kohezijskih sredstev MIZŠ 2018–
akademske slovenščine KAS 2.0 (Žagar et al., 2022) in
2021 so partnerji konzorcija IJS, UM in UL nadgradili
govornega korpusa Gos Videolectures (Verdonik et al.,
strojno opremo, s čimer je omogočeno hitrejše in proti
2019), orodje za učinkovito analizo slovenskih korpu-
okvaram odpornejše delovanje spletnih storitev CLA-
sov LIST (Krsnik et al., 2019) in drugi jezikovni viri in
RIN.SI, pridobljena gruča GPU strežnikov na Univerzi
programska oprema. Med drugim je CLARIN.SI finan-
v Mariboru pa služi za raziskave globokega strojnega
ciral tudi projekt ≫Razvoj učnega gradiva na korpusu
učenja obdelave jezikovnih podatkov, npr. na področju
siParl 2.0: Korpusni pristop k raziskovanju parlamen-
govora. S temi nadgradnjami lahko CLARIN.SI slo-
tarnega diskurza≪ (Fišer in de Maiti, 2021).
venski raziskovalni skupnosti zagotavlja odlično
raziskovalno infrastrukturo, ki mdr. pripomore k pri-
5.3.
Organizacija dogodkov
vlačnosti slovenskih partnerjev v mednarodnih razisko-
CLARIN.SI sodeluje pri organizaciji in izvedbi do-
valnih in inovacijskih projektih ter podpira doseganje
godkov s področja računalniškega jezikoslovja in soro-
znanstvene odličnosti in mednarodno vrhunskih rezul-
dnih tematik v Sloveniji, npr.
tatov. Tako npr. projekt EU MaCoCuuporablja gručo
≫XVIII EURALEX Intl.
Congress
računalnikov CLARIN.SI za zajem in obdelavo sple-
≪
(Ljubljana, 2018) ali ≫22nd Intl. Conf. on
Text, Speech and Dialogue
tnih velepodatkov, v okviru projekta EU InTaviapa se
≪
(Ljubljana, 2019), pred-
vsem pa glavne konference za to področje v Sloveniji,
jezikoslovno označuje Slovenski biografski leksikon z
modeli, razvitimi na gruči GPU. Več velikih projektov
≫Jezikovne tehnologije in digitalna humanistika≪, ki
ima prek 20-letno tradicijo in z organizacijo katere je
EU, kot sta ELEXIS in EMBEDDIA, je deponiralo
začelo društvo SDJT. SDJT od leta 2005 organizira
razvite jezikovne vire v repozitorij CLARIN.SI.
občasna predavanja JOTA (Jezikovnotehnološki abo-
6.2.
Vpetost v evropske projekte
nma), kjer je CLARIN.SI podprl snemanje in arhivi-
ranje 12 predavanj na VideoLectures.NET22, do sedaj
Med evropskimi projekti posebej izpostavimo ELE-
z 10.000 ogledi.
XIS,25 saj je bila za potrebe tega projekta v repozi-
toriju CLARIN.SI narejena nova zbirka CLARIN.SI
ELEXIS, v kateri so zbrani metapodatki in povezave
5.4.
Obveˇ
sˇ
canje in promocija
do spletnih vmesnikov 143 digitalnih slovarjev. Ob
Nenazadnje, delovanje CLARIN.SI in njegovih
koncu projekta ELEXIS v okviru CLARIN.SI oz. IJS
središč znanja redno predstavljamo na domačih in tujih
načrtujemo tudi vzpostavitev novega centra znanja
delavnicah in konferencah, kot so konferenca Evrop-
CLARIN za digitalno leksikografijo.
skega strateškega foruma za raziskovalne infrastruk-
ture (ESFRI), konference CLARIN idr., ter na preda-
6.3.
Vpetost v domaˇ
ce projekte
vanjih v okviru študijskih programov slovenskih uni-
Sodelujemo tudi v več domačih projektih. Največji
verz.
je
26
≫Razvoj slovenščine v digitalnem okolju≪ , ki mu
CLARIN.SI organizira tudi delavnice o uporabi
CLARIN.SI zagotavlja svoje storitve za pregled in de-
korpusov in jezikovnih tehnologij za namene znanstve-
poniranje v projektu izdelanih jezikovnih virov ter de-
nih raziskav. Tako smo npr. izvedli delavnice23 za
finicijo shem za usklajeno označevanje jezikovnih virov
uporabo konkordančnika noSketch Engine, platform
slovenskega jezika. V načrtu je tudi izdelava sezna-
WebAnno in Git, središče znanja CLASSLA pa je so-
mov kontroliranih besedišč za jezikoslovno označevanje
delovalo pri delavnici o uporabi korpusov za analizo
slovenskih besedil na ravni oblikoskladnje, skladnje,
regionalne variacije spolne zaznamovanosti jezika24.
imenskih entitet, udeleženskih vlog itd.
O dejavnostih partnerjev konzorcija CLARIN.SI in
6.4.
Sodelovanje z drugimi RI
njegovih središč znanja javnost obveščamo tudi prek
ažurnih novic, objavljenih na spletni strani infrastruk-
CLARIN.SI sodeluje s slovenskimi centri sestrskih
ture, poštnega seznama ter objav s profila CLARIN.SI
infrastruktur CESSDA/ADP in DARIAH-SI. V pro-
na Twitterju. Delo CLARIN.SI in njegovega središča
jektu ≫RDA Node Slovenia≪ (2019–2020), ki ga je ko-
znanja CLASSLA je bilo izpostavljeno tudi v več pu-
ordiniral ADP (FDV UL), smo pregledali in analizirali
blikacijah
slovenske repozitorije raziskovalnih podatkov (Meden
≫CLARIN ERIC Tour de CLARIN≪ (Fiˇ
ser
et al., 2019).
in Erjavec, 2021). Z INZ oz. DARIAH-SI pa smo so-
delovali na področju standardizacije zapisa in izdelave
korpusov parlamentarnih podatkov.
22https://videolectures.net/jota/
23https://www.clarin.si/info/dogodki/
25https://elex.is/
24https://www.clarin.si/info/k-center/delavnice/
26https://www.cjvt.si/rsdo/
PRISPEVKI
51
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
6.5.
Sodelovanje v delu CLARIN ERIC
razvoj aplikacij, informacijskih sistemov in orodij na
CLARIN.SI je ena od aktivnejših nacionalnih RI v
vseh ravneh tehnološke pripravljenosti.
CLARIN ERIC. Pridobili smo sredstva za dva manjša
Z izjemnim nacionalnim in regionalnim pome-
projekta, ki sta vključevala mednarodni delavnici, in
nom, spodbujanjem področja, pritegovanjem mladih,
sicer 2016 v Ljubljani in 2019 v Amersfoortu. Slednja
povezovanjem z industrijo ter široko vključenostjo
je bila v sodelovanju z DARIAH-SI posvečena izde-
deležnikov, močno vlogo pri uvajanju načel odprte zna-
lavi priporočil za standardizirano kodiranje korpusov
nosti ter izjemno odmevno in uspešno ter tudi nagra-
parlamentarnih razprav z imenom Parla-CLARIN27
jeno vlogo na ravni evropskega in mednarodnega sode-
(Erjavec in Pančur, V tisku), ki je postala prilju-
lovanja CLARIN.SI s povezanimi projekti predstavlja
bljena izbira za kodiranje parlamentarnih korpusov.
zgled za vzpostavitev uspešne vrhunske sodobne inter-
Na tej osnovi je CLARIN.SI pridobil ključno vlogo v
disciplinarne znanstveno-raziskovalno tehnološke infra-
dveh večjih
strukture.
≫CLARIN Flagship≪ projektih, ParlaMint
I (2020–2021) in ParlaMint II (2022–2023).
V naslednjem obdobju bo CLARIN.SI poleg
Namen projektov ParlaMint je ustvariti primerljive,
vzdrževanja obstoječih storitev še bolj intenzivno
interpretativne in enotno kodirane korpuse parlamen-
spodbujal ponovno uporabo raziskovalnih podatkov, s
tarnih razprav. V že zaključenem projektu ParlaMint
čimer bo omogočal raziskovalcem na področju humani-
I je CLARIN.SI vodil zbiranje in kodiranje 17 korpu-
stike in družboslovja povečanje produktivnosti, in kar
sov nacionalnih parlamentov (Erjavec et al., 2022), ki
je še pomembneje, vzpostavljanje novih raziskovalnih
so odprto dostopni na repozitoriju CLARIN.SI, kot
smeri, ki obravnavajo eno ali več družbenih vlog jezika.
tudi na konkordančnikih RI. V okviru projekta Par-
Drug pomemben cilj je izvajanje smernic za zagotavlja-
laMint II, katerega namen je razširitev in obogatitev
nje interoperabilnosti CLARIN ERIC31, ki je ključni
obstoječih korpusov ter dodajanje korpusov novih par-
predpogoj za učinkovito podporo raziskovalnemu delu
tnerjev, prav tako pa tudi razvoj izobraževalnih gra-
skozi interoperabilnost orodij, virov, metapodatkov,
div in primerov dobrih praks uporabe parlamentarnih
standardov za zapis, kot tudi na organizacijski ravni
korpusov za raziskave v humanistiki in družboslovju,
(Jong et al., 2020). Hkrati bo treba okrepiti podporo
člani CLARIN.SI vodijo štiri izmed petih delovnih
uporabnikom, saj univerze in agencije od raziskoval-
sklopov28.
cev v doktorskih in raziskovalnih programih vse inten-
Člani UO CLARIN.SI sodelujejo v delu CLARIN
zivneje zahtevajo načrte za ravnanje z raziskovalnimi
delovnih teles za pravna vprašanja (Mateja Jemec To-
podatki in njihovo trajno hrambo.
mazin, ZRC SAZU), za standardizacijo (Tomaž Er-
ESFRI kažipot 202132 za RI poudarja pomen po-
javec, IJS) in za uporabniška vprašanja (Jakob Le-
datkov FAIR, pri čemer smo na tem področju v okviru
narčič, FF UL) ter na letnih konferencah CLARIN
repozitorija CLARIN.SI storili že več pomembnih ko-
(T. Erjavec je predsednik programskega odbora kon-
rakov, se pa bomo vidikom FAIR posvečali tudi naprej.
ference v 2022 v Pragi). J. Lenarčič je prejel CLA-
Tako je v povezavi z RDA Node Slovenia v načrtu pri-
RIN ≫Stewen Krawer award≪ za mladega raziskovalca
prava delavnice o certifikaciji CTS in načelih FAIR za
leta 2019, mdr. za svoje delo (skupaj z Darjo Fišer)
slovenske repozitorije raziskovalnih podatkov. ESFRI
pri vzpostavitvi iniciative ≫CLARIN Resource Fami-
kažipot 2021 poudarja tudi vedno večjo prisotnost ve-
lies 29
≪
, T. Erjavec pa je prejel ≫Steven Krauwer Award
lepodatkov in pomembnost infrastruktur, da jih ustre-
for CLARIN Achievements 2021≪ za svoje delo na pro-
zno hranijo in obdelujejo. Zaradi vedno večje količine
jektu ParlaMint. Darja Fišer in Kristina Pahor de Ma-
dostopnih besedil, premika s pisnih na govorne in vizu-
iti (FF UL) sta leta 2021 prejeli nagrado ≫Teaching
alne jezikovne vire ter vse bogatejšega označevanja be-
with CLARIN Award≪ za najboljši učni material, po-
sedil tudi na področju jezikovnih virov prehajamo v ob-
vezan z uporabo virov CLARIN. Kaja Dobrovoljc (FF
dobje velepodatkov, kot se že sedaj izkazuje v projektih
UL) je predstavila RI CLARIN na konferenci ob 20.
ParlaMint II, RSDO in MaCoCu. Zato bo CLARIN.SI
obletnici ESFRI v Parizu leta 202230. Darja Fišer je
v naslednjem obdobju podprl uporabo strojne in pro-
bila med leti 2016 in 2020 direktorica področja za upo-
gramske kapacitete za hrambo in predvsem obdelavo
rabniška vprašanja, z letom 2023 pa naj bi postala ge-
velepodatkov. Kažipot tudi poudarja pomembnost
neralna direktorica CLARIN ERIC.
raziskovalnih infrastruktur za zajem, hrambo in ob-
delavo podatkov z družbenih omrežij in spleta. CLA-
7.
Zakljuˇ
cki
RIN.SI je že sedaj posvečal posebno pozornost takšnim
CLARIN.SI je izjemno uspešno vzpostavljena in-
jezikovnim virom, v prihodnosti pa bo te aktivnosti še
frastruktura, ki pokriva široko interdisciplinarno po-
okrepil, ne samo za slovenski, temveč (v okviru centra
dročje od humanističnih in družboslovnih raziskav do
znanja CLASSLA in projekta MaCoCu) tudi za druge
razvoja sistemov in tehnologij znanja in umetne inte-
južnoslovanske jezike.
ligence. Podpira temeljne in aplikativne raziskave ter
Kažipot poudarja tudi instrumentalizacijo in dosto-
pnost podatkov ter storitev, pomembnih za posamezne
27https://clarin-eric.github.io/parla-clarin/
skupnosti. Konzorcij CLARIN.SI trenutno vključuje
28https://www.clarin.eu/parlamint
29https://www.clarin.eu/resource-families
30
31
https://www.esfri.eu/esfri-events/
https://www.clarin.eu/content/interoperability
32
esfri-20years-conference
https://www.esfri.eu/esfri-roadmap-2021
PRISPEVKI
52
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
12 članic, s čimer sicer pokriva veliko večino sloven-
8.
Literatura
skih deležnikov, ki bodisi proizvajajo ali uporabljajo
jezikovne vire in tehnologije, ne pa vseh. V nasle-
Jacob Devlin, Ming-Wei Chang, Kenton Lee in Kri-
dnjem obdobju se bo CLARIN.SI trudil razširiti svoj
stina Toutanova. 2018. BERT: Pre-training of Deep
konzorcij, s čimer bo pokril tudi skupnosti potenci-
Bidirectional Transformers for Language Understan-
alnih uporabnikov infrastrukture, ki do sedaj še niso
ding. CoRR, abs/1810.04805. https://doi.org/
bili zajeti v njeno delovanje. CLARIN.SI prav tako
10.48550/arXiv.1810.04805.
načrtuje študijo potreb posameznih skupnosti (razi-
Tomaž Erjavec, Jan Jona Javoršek in Simon Krek.
skovalci in predavatelji s področja humanistike in s po-
2014. Raziskovalna infrastruktura CLARIN.SI. V:
dročja računalniške lingvistike, slovaropisci, prevajalci,
Zbornik Devete konference JEZIKOVNE TEHNO-
osebe s posebnimi potrebami) in izboljšanje svoje po-
LOGIJE, Ljubljana, 9. - 10. oktober 2014. Slovensko
nudbe v skladu z ugotovitvami.
društvo za jezikovne tehnologije. https://nl.ijs.
Kažipot med drugim poudarja pomen izo-
si/isjt14/proceedings/isjt2014_03.pdf.
braževanja, šolanja in podpore pri uporabi infrastruk-
Tomaž Erjavec, Špela Arhar Holdt, Jaka Čibej,
tur za obstoječe in bodoče uporabnike. V prvem
Kaja Dobrovoljc, Darja Fišer, Cyprian La-
obdobju obstoja je bil CLARIN.SI izrazito kadrovsko
skowski in Katja Zupan.
2016.
Annotating
podhranjen, a smo kljub temu izvedli vrsto dogodkov
CLARIN.SI TEI corpora with WebAnno.
V:
na delavnicah po Sloveniji in v tujini, predvsem
Proceedings of the CLARIN annual conference.
na različnih fakultetah, kjer smo infrastrukturo
https://www.clarin.eu/sites/default/files/
predstavili študentom.
V naslednjem obdobju se
erjavec-etal-CLARIN2016_paper_17.pdf.
bomo načrtno lotili teh aktivnosti z bolj proaktivnim
Tomaž Erjavec, Darja Fišer in Nikola Ljubešić. 2021.
pristopom k izvedbi predavanj in delavnic tako za
The KAS corpus of Slovenian academic writing.
študente kot tudi za raziskovalce in predavatelje ter
Language Resources and Evaluation, 55(2):551–583.
razvoju in promociji izobraževalnih materialov.
https://rdcu.be/b7GrB.
Nenazadnje je za prihodnost CLARIN.SI pomem-
Tomaž Erjavec, Maciej Ogrodniczuk, Petya Ose-
ben tudi pred kratkim sprejet Načrt razvoja razisko-
nova, Nikola Ljubešić, Kiril Simov, Andrej Pančur,
valne infrastrukture 2030 (NRRI 2030)33 v Sloveniji,
Micha l Rudolf, Matyáš Kopp, StarkaDur Barkar-
ki ima ≫v načrtu nadaljevanje in krepitev dejavnosti še
son, Steinþór Steingr´ımsson, C¸a˘grı C¸öltekin, Jesse
v okviru mednarodnih projektov CLARIN≪ (str. 60),
de Does, Katrien Depuydt, Tommaso Agnoloni, Giu-
priznava dosedanje sodelovanje z RI DARIAH-SI in
lia Venturi, Mar´ıa Calzada Pérez, Luciana D. de Ma-
CESSDA/ADP, ob tem pa predvideva tudi povezova-
cedo, Costanza Navarretta, Giancarlo Luxardo,
nje z novima RI, in sicer OPERAS (Odprta znanstvena
Matthew Coole, Paul Rayson, Vaidas Morkevičius,
komunikacija v evropskem raziskovalnem prostoru za
Tomas Krilavičius, Roberts Dar´gis, Orsolya Ring,
družboslovne in humanistične vede)34, ki jo v Sloveniji
Ruben van Heusden, Maarten Marx in Darja Fišer.
vodi ZRC SAZU, in PRACE (Partnerstvo za napredno
2022. The ParlaMint corpora of parliamentary
računalništvo v Evropi)35, ki jo vodi ARNES.
proceedings. Language Resources and Evaluation.
https://doi.org/10.1007/s10579-021-09574-0.
Zahvala
Tomaž Erjavec in Andrej Pančur. V tisku. The Parla-
Delo predstavljeno v prispevku so podprli ARRS
CLARIN Recommendations for Encoding Cor-
v okviru financiranja raziskovalnih infrastruktur ES-
pora of Parliamentary Proceedings. Journal of
FRI, Republika Slovenija in Evropska unija iz Evrop-
the Text Encoding Initiative. https://journals.
skega sklada za regionalni razvo v okviru projektov
openedition.org/jtei/index.html.
C3330-19-952059 ≫Razvoj raziskovalne infrastrukture
Darja Fišer in Kristina Pahor de Maiti. 2021. “Prvič,
za mednarodno konkurenčnost slovenskega RRI pro-
sem političarka in ne politik, drugič pa. . . ”. Con-
stora / RI-SI-CLARIN≪ in OP20.06780 ≫Razvoj slo-
tributions to Contemporary History, 61(1). https:
venščine v digitalnem okolju≪ ter projekti CLARIN
//doi.org/10.51663/pnz.61.1.07.
ERIC.
Darja Fišer, Jakob Lenardič, Ilze Auzi¸na, Nan Bern-
Zahvaljujemo se tudi sodelavcem CLARIAH-CZ za
stein Ratner, Koenraad De Smedt, Kaja Dobro-
pomoč pri nadgradnjah in vzdrževanju platforme re-
voljc, Réka Dodé, Rickard Domeij, Helge Dyvik,
pozitorija, sodelavcem Češkega nacionalnega korpusa,
Tomaž Erjavec, Olga Gerassimenko, Jan Hajič, Mi-
predvsem Tomášu Macháleku, za pomoč pri instalaciji
chal Křen, Nikola Ljubešić, Brian MacWhinney,
konkordančnika KonText in sodelavcem podjetja Lexi-
Monica Monachini, Beatrice Nava, Costanza Na-
cal Computing, predvsem Janu Bušti in Tomášu Svo-
varreta, Aneta Nedyalkova, Klaus Nielsen, Marin
bodi, za pomoč pri instalaciji konkordančnika Sketch
Noémi VadászLaak, Susanne Nylund Skog, Lene
Engine Crystal.
Offersgaard, Petya Osenova, Valeria Quochi, Sa-
nita Reinsone, Inguna Skadi¸na, Kiril Simov, Ondřej
33https://www.gov.si/assets/ministrstva/MIZS/
Tich´y, Noémi Vadász, Tamás Váradi in Kadri Vi-
Dokumenti/ZNANOST/Novice/NRRI-2030/NRRI-2030_SLO.
der. 2019. Tour de CLARIN Volume Two. Zenodo.
pdf
34https://www.operas-eu.org
https://doi.org/10.5281/zenodo.3754164.
35https://www.prace-ri.eu
Darja Fišer, Nikola Ljubešić in Tomaž Erjavec. 2020.
PRISPEVKI
53
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
The Janes project: Language resources and tools
datkov.
Tehnično poročilo, Jožef Stefan In-
for Slovene user generated content. Language Reso-
stitute. https://www.clarin.si/info/services/
urces and Evaluation, 54:223–246. https://rdcu.
projects/#RDA_Node_Slovenia.
be/7RX4.
Andrej Pančur, Tomaž Erjavec, Mihael Ojsteršek,
Franciska De Jong, Bente Maegaard, Koenraad De
Mojca Šorn in Neja Blaj Hribar. 2020. Slovenian
Smedt, Darja Fišer in Dieter Van Uytvanck. 2018.
parliamentary corpus (1990-2018) siParl 2.0. http:
CLARIN: Towards FAIR and Responsible Data Sci-
//hdl.handle.net/11356/1300.
ence Using Language Resources. V: Proceedings of
Pavel Rychl´y. 2007. Manatee/Bonito - A Modular
the Eleventh International Conference on Language
Corpus Manager. V: 1st Workshop on Recent Ad-
Resources and Evaluation (LREC 2018), Miyazaki,
vances in Slavonic Natural Language Processing, str.
Japan, May 7–12, 2018. European Language Resour-
65–70, Brno. Masarykova univerzita.
ces Association (ELRA). https://aclanthology.
Matej Ulčar in Marko Robnik-Šikonja. 2021. Slove-
org/L18-1515.
nian RoBERTa contextual embeddings model: Slo-
Franciska De Jong, Bente Maegaard, Darja Fišer, Die-
BERTa 2.0. Slovenian language resource reposi-
ter Van Uytvanck in Andreas Witt. 2020. Interope-
tory CLARIN.SI. http://hdl.handle.net/11356/
rability in an infrastructure enabling multidiscipli-
1397.
nary research: The case of CLARIN. V: Proceedings
Tamás Váradi, Steven Krauwer, Peter Wittenburg,
of the 12th Language Resources and Evaluation Con-
Martin Wynne in Kimmo Koskenniemi. 2008.
ference, str. 3406–3413. European Language Resour-
CLARIN: Common language resources and tech-
ces Association (ELRA). https://aclanthology.
nology infrastructure. V: Proceedings of the Si-
org/2020.lrec-1.417/.
xth International Conference on Language Resour-
Adam Kilgarriff, V´ıt Baisa, Jan Bušta, Miloš Ja-
ces and Evaluation (LREC’08), Marrakech, Mo-
kub´ıček, Vojtěch Kovář, Jan Michelfeit, Pavel
rocco, May. European Language Resources As-
Rychl´y in V´ıt Suchomel. 2014. The Sketch Engine:
sociation (ELRA). http://www.lrec-conf.org/
Ten years on. Lexicography, 1:7–36.
proceedings/lrec2008/pdf/317_paper.pdf.
Petra Kralj Novak, Jasmina Smailović, Borut Sluban
Darinka Verdonik, Tomaž Potočnik, Mirjam Se-
in Igor Mozetič. 2015.
pesy Maučec, Tomaž Erjavec, Simona Majhenič in
Emoji Sentiment Ranking
Andrej Žgank. 2019. Spoken corpus Gos VideoLec-
1.0. Slovenian language resource repository CLA-
RIN.SI.
tures 4.0 (transcription). http://hdl.handle.net/
http://hdl.handle.net/11356/1048.
Simon Krek. 2022. Delivrable D1.31: Report on
11356/1223.
ťhe Slovenian Language. Tehnično poročilo, Eu-
Spela Vintar in Matej Martinc. 2022. Framing karsto-
ropean Language Equality Project.
logy: From definitions to knowledge structures and
https://
automatic frame population. Terminology. Interna-
european-language-equality.eu/wp-content/
tional Journal of Theoretical and Applied Issues in
uploads/2022/03/ELE___Deliverable_D1_31_
Specialized Communication, 28(1):129–156.
_Language_Report_Slovenian_.pdf.
Seid Muhie Yimam, Iryna Gurevych, Richard Ec-
Luka Krsnik, Špela Arhar Holdt, Jaka Čibej, Kaja
kart de Castilho in Chris Biemann. 2013. We-
Dobrovoljc, Aleksander Ključevšek, Simon Krek in
bAnno: A Flexible, Web-based and Visually Su-
Marko Robnik-Šikonja. 2019. Corpus extraction
pported System for Distributed Annotations. V:
tool LIST 1.2. http://hdl.handle.net/11356/
Proceedings of the 51st Annual Meeting of the Asso-
1276.
ciation for Computational Linguistics: System De-
Jakob Lenardič in Darja Fišer. 2022. CLARIN De-
monstrations, str. 1–6, Sofia, Bulgaria, August. As-
positing Guidelines: State of Affairs and Proposals
sociation for Computational Linguistics. https://
for Improvement. V: Proceedings of the CLARIN
aclanthology.org/P13-4001.
Annual Conference, Prague, Czech Republic, Octo-
Aleš Žagar, Matic Kavaš, Marko Robnik-Šikonja,
ber 10–12, 2022. https://www.clarin.eu/event/
Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Marko
2022/clarin-annual-conference-2022.
Ferme, Mladen Borovič, Borko Boškovič, Milan Oj-
Nikola Ljubešić, Filip Markoski, Elena Markoska in
steršek in Goran Hrovat. 2022. Corpus of academic
Tomaž Erjavec. 2021. Comparable corpora of South-
Slovene KAS 2.0. http://hdl.handle.net/11356/
Slavic Wikipedias CLASSLA-Wikipedia 1.0. Slo-
1448.
venian language resource repository CLARIN.SI.
http://hdl.handle.net/11356/1427.
Tomáš Machálek. 2020. KonText: Advanced and Fle-
xible Corpus Query Interface. V: Proceedings of
the 12th Language Resources and Evaluation Con-
ference, str. 7003–7008, Marseille, France, May. Eu-
ropean Language Resources Association. https:
//www.aclweb.org/anthology/2020.lrec-1.865.
Katja Meden in Tomaž Erjavec.
2021.
Pre-
gled slovenskih repozitorijev raziskovalnih po-
PRISPEVKI
54
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts Simon Gonzalez
The Australian National University,
Canberra, Australian Capital Territory, Australia
u1037706@anu.edu.au
Abstract
Social Media platforms have offered invaluable opportunities for linguistic research. The availability of up-to-date data, coming from any part in the world, and coming from natural contexts, has allowed researchers to study language in real time. One of the fields that has made great use of social media platforms is Corpus Linguistics. There is currently a wide range of projects which have been able to successfully create corpora from social media. In this paper, we present the development and deployment of a linguistic corpus from Twitter posts in English, coming from 26 news agencies and 27 individuals. The main goal was to create a fully annotated English corpus for linguistic analysis. We include information on morphology and syntax, as well as NLP features such as tokenization, lemmas, and n-grams. The information is presented through a range of powerful visualisations for users to explore linguistic patterns in the corpus. With this tool, we aim to contribute to the area of language technologies applied to linguistic research.
approach as language researchers, from engineers and app
1. Introduction
developers, for example, is that we are interested to study
In this current age, the use of social media platforms has
how people use technology to communicate and describe
permeated across all circles of society, from personal
what makes it a distinctive type of language (Page et al.,
communication to government communications. Its impact
2014). In this sense, we are interested in identifying the
is hard to be overstated. It is considered as a form of mass
language patterns as used in social media platforms,
media, but distinctive from other forms such as television
knowing that patterns found in social media are not
and radio, where the information is presented by a specific
necessarily representative of language patterns in other
broadcasting mechanism (Page et al., 2014). In the case of
contexts. This has been demonstrated empirically by
social media, the content can be delivered by anyone,
Grieve et al. (2019), where they compared Twitter data
making it more personal and individual than other mass
versus traditional survey data. They found that some
forms. The adopting of this technology in language
patterns were observed more strongly in Twitter data than
research has been an organic and necessary process. This is
in the survey data. Results like these are evidence then that
because language research investigates the use of language
when we deal with social media language, we are
in society, and since social media is a medium of language,
examining a way of expression, which has features like
we need to understand how we use language in this digital
other language forms, but at the same time it has its own
world.
distinctive characteristics. This is paramount to be
One framework that has efficiently paved the way for
considered when new language analysis technologies are
linguists to examine social media language is Computer-
developed.
Mediated Communication (CMC). This has been defined
as a relatively new ‘genre’ of communication (Herring,
1.1. Twitter and Corpus Linguistics
2001), which can involve a diversity of forms connecting
The combination of language research and social media
the physical and the digital (Boyd and Heer, 2006). One of
is a complex endeavor, making people working with both
the focus of study in CMC research is on the intrinsic
apply skills that are necessary in this interdisciplinary
characteristics of digital language, e.g. stylistics, use of
undertaking. One area that reflects this complexity and that
words, semantics, and other relevant linguistic features.
has efficiently adapted social media is Corpus Linguistics
This has been done for various CMC types, including social
(CL). A strong characteristic of CL is that it is used to
media.
collect, store, and facilitate language analysis for large
But describing social media features is not a straight-
datasets (Szmrecsanyi, 2011; Grieve, 2015). And with the
forward task because it is not a homogenous genre. It has a
advantage of having more sophisticated tools available,
diversity of types depending on the main shareable content
such as in social media research, corpora are becoming
(e.g., YouTube for videos, Twitter for texts1) and main
larger and larger, with the only constraints being
format (e.g., Reddit as a discussion forum, Pinterest
computational power and storage capacity.
Product pins for products purchase), for example. But one
Many social media platforms have been widely used for
common feature is that all platforms have an interactive
language and linguistic research (c.f. Liew and Hassan,
component in which users can express ideas, comment, and
2021; Nagase et al., 2021; Trius and Papka, 2022; Wincana
reply to other people’s perspectives. The inherent
et al, 2022). Out of these platforms, Twitter stands out due
communicative aspect in this social interaction is one that
to its world spread, and the option it gives to researchers
has strong implications in linguistic research, which is that
when stratifying the demographics of user accounts,
when we analyse language from social media, we look at
including the use of the geo-code and time-stamp
how language is used in natural contexts, with concrete
communicational purposes. What distinguishes then our
1 The type of content of social media platforms is not restricted
specific cases. For instance, YouTube allows users to write
to only one. This is just an example on the main purpose for
comments on videos and Twitter can embed videos on posts.
PRISPEVKI
55
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
information of the posts2 (Grieve et al., 2018). It is
3.1. Data Collection
classified as a microblogging site (Chandra et al. 2021)
We applied four criteria to identify the Twitter accounts
where the content can be on opinions, news, arguments, and
to be included in the corpus. The first criterion was that
other types of sentences (Chaturvedi et al., 2018). Because
account users (news agencies and individuals) had to have
of their wide-spread use, it has been used in the creation of
English as the main language of communication. The
numerous corpora created from Twitter posts (c.f. Dijkstra
second one was that accounts had to be active at the
et al., 2021; Grieve et al., 2019, Tellez et al., 2021).
moment of extraction. The reason was to capture tweets
that were synchronous and where topics and trends could
1.2. Current Project
be shared across accounts. The third criterion was that
In this paper, we present the development of a web-
accounts had to have a large number of tweets, enough to
based corpus from Twitter posts, named ILiAD: An
reach over 3,000. This was done to make sure that enough
Interactive Corpus for Linguistic Annotated Data. In
posts were left after the filters were applied, which is
relation to our methodological approach, we propose that
explained below. The final criterion was to only include
corpora built from social media helps study the patterns of
those users whose posts were not mainly retweets. This
language used in this context and capture their linguistic
filter aimed to exclude those accounts that do not produce
complexity. By doing this, we can have a better view of the
content but only retweet posts from other accounts. From
multilayered nature of the corpus.
this, we identified 29 news agencies, and 27 individual
accounts. The percentages are shown in Table 1.
2. Goal of the Paper
The aim of the corpus is to capture the linguistic
User Type
Total Tweets
Percentage
complexities used in Twitter language, and we chose two
News Agency
84,354
54%
types of account users: news agencies and individuals. We
Individual
71,477
46%
explore the differences between their structures and
Total
155,831
100%
patterns. The language of journalism is characterised based
on its main purpose: exert influence on readers and
Table 1: Total number of tweets in the corpus and their
convince them on a specific interpretation (Fer, 2018;
proportions per account type.
Moschonas, 2014). This is achieved by three main stylistic
features. The first one is language clarity, a feature that is
The data extraction was done through an R script
strongly appropriate for journalism more than many other
developed by the main author. We used the rTweet
language styles. The second one is accuracy. This refers to
(Kearney, 2019) package, which allows users to gather
the ability to convey ideas accurately and avoiding
Twitter posts by the free Twitter API, giving a total of over
ambiguities in interpretation. The final one is the
156,000 tweets.
simplicity. This aims to convey messages without the use
of complex words that may blur the intention of the
Year
Total Tweets
Percentage
message (Fer, 2018). The aim therefore is to prepare the
2009
139
0.1%
corpus for further exploration, querying and analysis to
2010
178
0.1%
understand the language used in Twitter.
2011
497
0.3%
The analysis can focus on many linguistic parameters
2012
2230
1.4%
and here we approach it in an integrated way. This can give
2013
5097
3.3%
users the opportunity to explore the corpus from different
2014
3625
2.3%
angles and linguistic perspectives.
2015
5159
3.3%
2016
6745
4.3%
3. Methodology
2017
5508
3.5%
2018
6301
4.0%
The stages of data collection, data processing, and app
2019
7847
5.0%
deployment were carried out in R (R Core Team, 2021),
2020
18742
12.0%
using shiny R (Chang et al., 2021) for the app development.
2021
20697
13.3%
Apps developed in shiny have three main advantages. The
2022
73066
46.9%
first one is its interactivity capability. With this, users can
Total
155,831
100%
interact with the whole corpus, across a range of
visualisation outputs and tables. The second one is its
reactive power. With this, users modify parameters in the
Table 2: Total number of tweets in the corpus and their
tables and visualisations, and the app changes outputs based
proportions per year.
on user inputs. The positive impact on corpus linguistics is
invaluable. With these features, a corpus can be used to
3.2. Data Processing
have a full understanding on the shape of the data as well
From the collected data, we applied six filters to make
as an exploration of patterns.
sure that the corpus reflects comparable linguistic data for
all account users. The first filter was to exclude tweets that
were not in English (n=10,067; 6%). This was done by
filtering out those tweets which did not have the English
( en) assigned by Twitter’s machine language detection,
2 The geo-code information is optional in Twitter, and the user
as time zone and language features, which are used to infer
decides whether to show this or not. Other approaches include
locations.
running algorithms that identify locations based on factors such
PRISPEVKI
56
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
which is annotated in the tweet’s metadata. The second
3.3.2. Morphological Analysis
filter was to exclude re-tweets (n=23260; 15%). This
There are three main fields tagged in the data process:
restricts the data only to those posts that come from the
given user and not from other accounts. The third filter was
1. Part-of-speech tagging
to exclude quote tweets (n=7,142; 5%). These are tweets
2. Morphological features
that are re-tweeted with an added comment from the user.
3. Lemma or stem
Keeping quote tweets in the data would add repeated tweets
to the corpus and also would add patterns and word counts
The parts-of-speech tagging uses MorphoDiTa (Straková et
that do not correspond to a specified account. The fourth
al., 2014). The tagging process exploits the rich linguistic
filter deleted repeated tweets (n=778; 0.5%). This targeted
features of inflective languages with large number of
those cases in which account users write the same content
suffixes, where multiple forms can be related to a single
and post it as a separate tweet, but not as a re-tweet. Similar
lemma. From this, the tagger estimates common patterns on
to quote tweets, keeping repeated tweets would inflate the
endings and creates morphological templates from the
content of the corpus and it would not be representative.
observed clusters. On Table 3, we show the top ten counts
For the fifth filter, we excluded strings that were URL links,
and proportions of Parts-Of-Speech tags in the current
which do not have linguistic features3 of interest in this
corpus, as output from UDPipe.
paper (n=1,208; 0.8%). For the sixth and last filter, we first
calculated the number of words for each tweet, which were
POS
Corpus Count
Percentage
split by white spaces to get the number of individual words.
NOUN
76,795
20.8%
We then excluded those tweets that had a length of less than
VERB
62,537
16.9%
eight words (n=14,125; 9%). This filter targets those tweets
ADP
39,237
10.6%
which do not have linguistic content but only social media
PROPN
37,862
10.3%
features such as hashtags or links.
PRON
37,399
10.1%
With these filters, the final data contained 112,690
DET
31,284
8.5%
tweets. This is a loss of 28% (n = 43,919) of the original
PUNCT
30,001
8.1%
data exported from the Twitter API.
ADJ
24,452
6.6%
ADV
16,425
4.4%
3.3. Text Processing
AUX
13,171
3.6%
After data filtering, we implemented a wide range of
Natural Language Processing (NLP) techniques for the
Table 3: Total number of top ten Part-Of-Speech tags in
data wrangling and analysis. We carried out the text
the corpus and their proportions.
processing using the UDPipe (Straka and Straková, 2017)
package as the main tool for the NLP tasks. UDPipe is
3.3.3. Classification Features
defined as single tool which contains a tokenizer,
UDPipe uses two models that facilitate the tagging
morphological
analyzer,
Parts-Of-Speech
tagger,
process and improve the overall accuracy by employing
lemmatizer, and a dependency parser. It currently offers 77
different classification feature sets. The first one the POS
language models, with some languages having more than
tagger, which disambiguates all available morphological
one model available. We used the EWT English model
fields in the data. The second model, a lemmatizer,
available in the package. We selected the text column from
disambiguates the lemmas tagged.
the API output and made it the input for the main UDPipe
function. The core purpose of the UDPipe package is to
3.3.4. Dependency Parsing
create a single-model tool for a given language which can
Dependency parsers are part of the family of grammar
be used to process raw text and convert it to a CoNLL-U-
formalisms called dependency grammars (Jurafsky and
formatted text. This format stores tagged information for all
Martin, 2021). In these, the syntactic structure sentences are
words in dependency trees, including morphological and
described on the grammatical relations between the words,
syntactic features (Straka and Straková, 2017). From this
shown as directed binary dependencies. All structures start
format, the UDPipe algorithm creates morphological
at the root node of the tree, and then components and the
taggers and dependency parsers. The main taggers are
dependencies are shown throughout the entire structure.
described below.
Dependency parser trees can deal very efficiently with
languages that are rich morphologically and also have a
3.3.1. Tokenization
relatively free word order, for example Spanish, Czech, and
The tokenization tools are wrapped within a trainable
English, with varying flexibility. Another important
tokenizer based on artificial neural networks, specifically,
advantage of using dependency parsers is that they allow
the bidirectional LSTM artificial neural network (Graves
closer examination of semantic relationships between
and Schmidhuber, 2005) and a gated linear unit – GRU
arguments in the sentence.
(Cho et al., 2014). It works by comparing the words in the
Summing up, the features, descriptions, and tagging
input text to the trained tokenizer and does not add any
done by the UDPipe framework, offer invaluable
additional knowledge about the language. If a given word,
information relevant for linguistic analysis used in Corpus
or group of words, is not recognized, the tokenizer tries to
Linguistics. With these features extracted for all tweets, we
reconstruct it by utilizing an additional raw text corpus.
have information available at different layers for linguistic
3 URL Links are an important aspect of social media language.
However, its analysis is beyond the scope of this paper.
PRISPEVKI
57
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
analysis: morphological, syntactic, and even semantic,
Stop word
Total Count
Percentage
through the dependency parsers.
the
17,151
4.18%
to
11,543
2.81%
3.4. Data Filtering
a
9,076
2.21%
After obtaining the output from the UDPipe package,
be
8,480
2.07%
we proceeded to filter the data. The motivation was to
and
7,844
1.91%
prepare it for the linguistic analysis within the corpus. This
of
7,001
1.71%
filtering process affects two dataset outputs which used for
I
6,734
1.64%
different purposes in the corpus. The first one is used for
in
6,429
1.57%
calculating n-grams and word frequencies. The second one
you
5,315
1.3%
is for showing Syntactic Dependencies.
have
4,083
1%
that
3,933
0.96%
3.4.1. Token Filtering
it
3,803
0.93%
Identifying the right tokens in social media language is
for
3,793
0.92%
a difficult process. The correct practice in this step is crucial
on
3,552
0.87%
to achieve efficient outcomes. This filtering differs from the
he
3,442
0.84%
practice done on other language media such as the language
in newspapers, television, and academic papers. Following
Table 5: Top 15 stop words excluded in the data subset
O'Connor et al., (2010), we excluded tokens containing
and their proportions in the corpus.
hashtags, URL links, @-replies, strings of punctuation, and
emoticons4. Their proportions are shown in Table 4.
3.4.3. Sentence Structure Filtering
In this filter, we aimed to identify those posts which
Content
Total Count
Percentage
were not linguistic phrases or sentences, thus including
Excluded
only those structures that were classified into a sentence
Emoticons
1,556
0.4%
category. For each of the tweet breakdown done by UDPipe
Hashtags
1,986
0.5%
(as shown in Table 6), we looked at the PUNCT
URL Links
2,857
0.7%
classification, where we identified three types of sentences:
@-replies
3,851
0.9%
statements (ending with “.”), questions (ending with “?”)
Punctuation
30,001
7.3%
and exclamations (ending with “!”). Any unclassified
sentence was deleted from the dataset. Deciding to keep
Table 4: Total number of social media content excluded
sentences that follow the standard punctuation symbols has
and their proportions in the whole corpus.
a strong impact in a corpus based on Twitter language,
since sentences here can follow other rules, e.g. ending a
3.4.2. Removing Stop Words
sentence with emoticons or other use of punctuation
Following standard procedures, we removed stop words
symbols, such as !!! or :). However, an important number
for calculating n-grams and word frequencies. An
of sentences follow the most standard use of punctuation
important observation is that removing stop words is a
symbols, which is a reliable representation of the data
compromise for the corpus, since certain word
collected. Finally, for each sentence, we checked whether
combinations are affected, especially those which appear
there was a conjugated verb. For those sentences which had
together with the words in the list. Future versions of this
no conjugated verbs, the identified sentence was deleted
work aim to efficiently implement analysis considering the
from the dataset used for the Syntactic Dependency section.
role of stop words in the corpus.
For this, we created a data subset that only contained
sentences and their corresponding classification done in the
Here we removed stop words by following the steps
previous step. This was the input for the Section explained
below:
in 4.1.2.
1. First, we selected a list of stops words from the
token
upos
feats
dep_rel
stopwords (Benoit et al., 2021) package in R. We
Senate
PROPN
Number=Sing
nsubj
selected the ones used for English and it included
needs
VERB
Mood=Ind
root
175 words (see Table 5 for the top 15).
to
PART
mark
2. Next, we filtered out the stop words in this data
think
VERB
VerbForm=Inf xcomp
subset.
and
CCONJ
cc
3. Finally, we filtered out stop words that are specific
vote
VERB
VerbForm=Inf conj
for Twitter, and that includes words such as RT,
follow, follows, and following. In future versions,
Table 6: Sample output from UDPipe.
we aim to implement a disambiguation algorithm
where a key word, such as follow, can be identified
3.5. Calculating N-Grams
as a word used in social media context (e.g. follow
By implementing NLP techniques, this brings more
us on Twitter), or in a more traditional one (e.g.
depth to the corpora analysis since it allows users to explore
follow the road).
more areas in the data. In the current version of the app, we
4 Emoticons entail rich linguistic information. However, their
analysis is not included in this version of the tool.
PRISPEVKI
58
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
use unigram and bigram explorations. The n-grams are
Twitter Terms of Service, the app cannot display the raw
calculated using the tidytext (Silge and Robinson, 2016)
tweets as a database format nor give the option to download
package. We followed the established approach of deleting
data. The interactive tool therefore focuses on the
stops words in English, using the stopwords package. After
presentation of the linguistic features derived from the data.
the filtering, the n-grams were calculated across all the data.
4.1. Exploring Calculated Features
3.6. Entity Identification
The linguistic features are the main backbone of the
A second group of NLP techniques implemented is the
corpus. In this section, there are visualisation options that
identification of entities in the corpus, and that includes
can be used to have both a broad understanding of patterns,
mentions of people, physical locations, and established
as well as a deep exploration of linguistic features.
organisations. We used the entity (Rinker, 2017) package
for this purpose. This package is a wrapper to simplify and
4.1.1. Parts of Speech
extend the NLP (Hornik, 2020) package and the openNLP
This section gives the overall statistics of the words
(Hornik, 2019) package named entity recognition. The
classified into their POS, including distributions and
advantage of this approach is that we can use it to detect
proportions per year and sentence type. The exploration can
important information, which is crucial especially in large
be done in different levels: all corpus or by user type (news
datasets, that can be captured when identifying entities.
agencies or individuals). The input data in this section
This also has a strong impact on our understanding of
comes from the Sentence Structure Filtering Section
linguistic features, since they are related to important
(3.4.3).
elements in sentences, such as nouns and adjectives. By
implementing this, the app brings more depth to the corpora
analysis since it allows users to explore the main entities in
the corpus.
3.7. Twitter Metrics
The final metrics measured and obtained aims to show
information that is relevant when dealing with Twitter data.
The motivation is to be able to contextualise the
information in the corpus within the overall world of social
media. The information presented here is extracted from the
Twitter API output, which means that we display two
Figure 1: POS Distributions Tab.
features publicly available. The first one is the number of
tweets across time. We also include a general summary of
4.1.2. Syntactic Dependencies
the main sour locations by country of the tweets
This section allows users to explore the syntactic
contributing to the data. Previous studies (c.f. Grieve et al.,
dependencies of all the available sentences. Here we use a
2019) have demonstrated that the use of geo-coding
combination of the UDPipe output and the textplot
information is relevant for linguistic studies, but here, we
(Wijffels et al., 2021) package, which creates the
only show the country of origin of all tweets without
dependencies as in the figure below. Since users can select
identifying individuals or linking linguistic features to any
all available sentences, this is a powerful function than can
demographics.
be used to explore syntactic patterns across the corpus and
facilitates the understanding of syntactic structures. The
4. App Infrastructure
input data in this section comes from the Sentence
The app was developed in RStudio, which has been
Structure Filtering Section (3.4.3).
widely used for corpus linguistics development and related
tasks (Abeille and Godard, 2000; Gries, 2009), and the
main framework was within shiny R. Shiny apps allow
great interactivity and responsiveness. Interactivity allows
users to explore visualizations in effective ways, and
responsiveness allows users to navigate contents in real
time, with the use of clicks and dropdown menus. Other
libraries that we used for the creation of visuals were
ggplot2 (Wickham, 2016) and echarts4r (Coene, 2022).
ggplot2 allows a great degree of flexibility when creating
figures. This is relevant since there is a lot of complexity of
the linguistic data that we present. But this allows complex
ideas to be presented in a digestible way. Another
advantage of this is that it allows users to see data points
within the general context as well as being able to narrow
Figure 2: Syntactic Dependencies Tab.
down into more specific analysis. This creates a seamless
navigation of linguistic data in an efficient way. The
4.1.3. Exploring N-Grams
presentation of the app components was divided into two
N-grams are explored through visualisations, including
main sections. The first component gives users tools to
connection networks. These networks are developed within
explore linguistic features and the second one gives
the Network Analysis (NA) approach. The power of this
information on Twitter metrics. Due to the limitations on
analysis comes from its capability of observing
PRISPEVKI
59
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
relationships between components. This technique has
selected to observe by account type, giving more
been implemented in other fields, such as psychology
granularity of exploration. Another timeline visualization is
(Jones et al., 2021; Mullarkey at al., 2019), and social
applied to N-grams. This has been used to observe lexical
network research (Clifton and Webster, 2017;
innovations (c.f. Grieve et al., 2019), by looking at N-grams
Würschinger, 2021). NA visualizations follow the
that increase in terms of frequency across time. This tool
assumption that if a relationship is meaningful within the
can facilitate this type of analysis.
whole network, it will stand out from other relationships by
stronger connections than random or weaker relationships.
In this analysis, the connections are based on the
frequencies which connect n-grams. Here we use the
functionality from the visNetwork (Almende, 2021)
package.
Figure 5: Twitter Timeseries Count Tab.
The second visualization implemented is a world map
showing the region source information of tweets. The
purpose is to visualize the main geographical areas from
where the tweets come. We use the functionality from
Figure 3: N-Grams Visualisation Tab showing a network
echarts4r package, which is very efficient at displaying this
relationships.
type of information, as well as being interactive in an online
context.
4.1.4. Exploring Entities
We use a different visualization approach for the
entities captured in the corpus. We use bar plots and word
clouds. The advantage of bar plots is that they show the
frequencies in a way that we can see from the most frequent
to the least frequent, organized from left (most frequent) to
right (least frequent). Word clouds are an easy and user-
friendly way to represent frequencies. Here, more frequent
words are represented with larger fonts than less frequent
words. An example for the organizations mentioned in the
corpus is shown in the figure below. At the top, we see the
bar plot and at the bottom the word cloud.
Figure 6: Twitter Map Tab.
5. Discussion
The app presents a wide range of visualizations and
analyses from the Twitter corpus. The features capture
different linguistic layers, including morphology, syntax,
and n-grams. With the inclusion of Twitter metrics, this tool
gives all exploration opportunities to understand the whole
corpus. R and shiny R have proven to be an efficient
combination to develop and deploy the corpus. For the text
Figure 4: Named Entities Visualisation Tab.
processing tasks, the use of the UDPipe and tidytext
packages have been highly effective. The in-built functions
4.2. Twitter Data Metrics
have been used and we have created our custom-made
functions to complete the tasks done throughout the whole
The final section shows relevant Twitter data metrics,
process. For visualization tasks, the combination of
for which we dedicate two sections. The first one is a
ggplot2, plotly, visNetwork, and echarts4r has
timeline visualization using a combination of the ggplot2
demonstrated efficient to represent complex linguistic
package and the plotly (Sievert, 2020) package. This
features and relationship analysis. The app can be accessed
combination gives ggplot2 plots interactive power. The
through
the
following
GitHub
repository:
timeline displays the number of posts across time, for all
https://github.com/simongonzalez/ILiAD.
the data available in the corpus. This timeline can also be
PRISPEVKI
60
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
6. Conclusion
Winston Chang, Joe Cheng, JJ Allaire, Carson Sievert,
In this paper, we have presented the development of a
Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan
linguistic corpus based on the Twitter posts. It has been
McPherson, Alan Dipert and Barbara Borges. 2019.
designed to be used by a diversity of audiences who are
shiny: Web Application Framework for R (Version 1.3.2)
interested in exploring linguistic patterns from corpora
[R package]. Retrieved from https://CRAN.R-
based on social media language. Similar tools have been
project.org/package=shiny
developed with invaluable contributions to the field of
Snigdha Chaturvedi, Shashank Srivastava and Dan Roth.
Corpus Linguistics. Our proposal, however, makes stronger
2018. Where have I heard this story before? Identifying
integrations with a variety of visualization types that
narrative similarity in movie remakes. In: Proceedings of
enhance the analysis in a holistic way. The tool also gives
the 2018 Conference of the North American Chapter of
users interactive and reactive power throughout all the data,
the Association for Computational Linguistics: Human
which not only offers a corpus to analyse, but a corpus to
Language Technologies, Vol. 2, pages 673–678. New
interact with and query in a more organic way, compared
Orleans, Louisiana. Association for Computational
to more traditional approaches of presenting corpora.
Linguistics.
Finally, it has been developed within an open-source
Allan Clifton and Gregory D. Webster. 2017. An
framework, making it freely available to any user interested
introduction to social network analysis for personality
in using and even expanding this tool.
and social psychologists. Social Psychological and
Personality Science, 8(4):442–453.
7. Future Work
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau
and Yoshua Bengio. 2014. On the properties of neural
In the current version, we have selected a relatively
machine translation: Encoder-decoder approaches.
small number of users in the corpus, as compared to other
CoRR, abs/1409.1259.
larger projects with similar goals. This is to allow the
John Coene. 2022. echarts4r: Create Interactive Graphs
implementation of the interactive capability in the
with
'Echarts
JavaScript',
Version
5.
visualization methods, which requires a high level of
https://echarts4r.john-coene.com/.
computational power. We aim to add more data in future
Jelske Dijkstra, Wilbert Heeringa, Lysbeth Jongbloed-
versions using more efficient processing algorithms.
Faber and Hans Van de Velde. 2021. Using Twitter Data
Finally, we see the value of adding linguistic analysis to
for the Study of Language Change in Low-Resource
emoticons. In a future version, we aim to include analysis
Languages. A Panel Study of Relative Pronouns in
on emoticons, as a distinctive component of social media
Frisian. Frontiers in Artificial Intelligence, 4:644554.
language.
Simona Fer. 2018. The Language of Journalism:
Particularities and Interpretation of Its Coexistence with
8. Acknowledgements
Other Languages (February 22, 2018). Available at
I want to thank the anonymous reviewers of this paper
SSRN:
https://ssrn.com/abstract=3128134
or
for their invaluable comments and insights in the shape and
http://dx.doi.org/10.2139/ssrn.3128134
content of the final version. Their generosity and expertise
Alex Graves and Jürgen Schmidhuber. 2005. Framewise
have improved this paper in innumerable ways and saved
phoneme classification with bidirectional LSTM and
me from many errors. Those that inevitably remain are
other neural network architectures. Neural Networks,
entirely my own responsibility.
pages 5–6.
Stefan Th. Gries. 2009. Quantitative Corpus Linguistics
9. References
with R. London and New York: Routledge.
Anne Abeillé and Danièle Godard. 2000. French word
Jack Grieve1, Chris Montgomery, Andrea Nini, Akira
order and lexical weight. In: R. Borsley, ed., The Nature
Murakami and Diansheng Guo. 2019. Mapping Lexical
and Function of Syntactic Categories, Syntax and
Dialect Variation in British English Using Twitter.
Semantics, Academic Press, pages 325–358.
Frontiers in Artificial Intelligence. 2(11). doi:
Benoit
Thieurmel.
2021.
visNetwork:
Network
10.3389/frai.2019.00011.
Visualization using 'vis.js' Library. R Package Version
Jack Grieve, Andrea Nini and Diansheng Guo. 2018.
2.1.0.
Mapping Lexical Innovation on American Social Media.
Kenneth Benoit, David Muhr and Kohei Watanabe. 2021.
Journal of English Linguistics, Vol. 46, pages 293–319.
stopwords: Multilingual Stopword Lists, (Version 2.2)
Jack Grieve. 2015. Dialect Variation. In: Douglas Biber and
[R
package].
Retrieved
from
Randi Reppen, eds., The Cambridge Handbook of
https://github.com/quanteda/stopwords
English Corpus Linguistics. Cambridge University
Danah Boyd and Jeffrey Heer. 2006. Profiles as
Press.
Conversation: Networked Identity Performance on
Susan C. Herring. 2001. Computer-mediated discourse. In:
Friendster. In: Proceedings of the 39th Annual Hawaii
D. Schiffrin, D. Tannen and H. Hamilton , eds., The
International Conference on System Sciences
Handbook of Discourse Analysis, (Oxford: Blackwell
(HICSS'06),
2006,
pages
59c–59c,
doi:
Publishers), pages 612–634.
10.1109/HICSS.2006.394.
Kurt Hornik. 2019. openNLP: Apache OpenNLP Tools
Subhadip Chandra, Randrita Sarkar, Sayon Islam, Soham
Interface, (Version 0.2-7) [R package].
Nandi, Avishto Banerjee and Krishnendu Chatterjee.
Kurt Hornik. 2020. NLP: Natural Language Processing
2021. Sentiment Analysis on Twitter Data: A
Infrastructure, (Version 0.2-1) [R package].
Comparative Approach. International Journal of
Payton J. Jones, Ruofan Ma and Richard J. McNally. 2021.
Computer Science and Mobile Applications, 9(10):01–
Bridge Centrality: A Network Approach to
12.
Understanding Comorbidity. Multivariate Behavioral
PRISPEVKI
61
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Research,
56(2):353–367,
DOI:
Balti- more, Maryland, June. Association for
10.1080/00273171.2019.1614898 (2021).
Computational Linguistics.
Daniel Jurafsky and James H. Martin. 2021. Speech and
Benedikt Szmrecsanyi. 2011. Corpus-based dialectometry:
Language Processing: An introduction to natural
a methodological sketch. Corpora, 6(1):45–76. DOI:
language processing, computational linguistics, and
10.3366/cor.2011.0004 | preprint pdf
speech recognition. All rights reserved. Draft of June
Eric S. Tellez, Daniela Moctezuma, Sabino Miranda and
December 29, 2021.
Mario Graff. 2021. A large scale lexical and semantic
Michael W. Kearney. 2019. rtweet: Collecting and
analysis of Spanish language variations in Twitter.
analyzing Twitter data. Journal of Open Source
ArXiv, abs/2110.06128.
Software, 4(42), 1829. doi: 10.21105/joss.01829, R
Lilia I. Trius and Nataliya V. Papka. 2022. Some Aspects
package
version
0.7.0,
of Online Discourse Manipulation on Social Media (the
https://joss.theoj.org/papers/10.21105/joss.01829.
case of Instagram English Presentational Discourse of
Tze Siew Liew and Hanita Hassan. 2021. The search for
Pfizer Inc.). Current Issues in Philology and
national identity in the discourse analysis of YouTube
Pedagogical Linguistics.
comments. Journal of Language and Linguistic Studies.
Hadley Wickham. 2016. ggplot2: Elegant Graphics for
Spiros Moschonas. 2014. The Media On Media-Induced
Data Analysis. Springer-Verlag New York. ISBN 978-3-
Language Change. In: J. Androutsopoulos, ed.,
319-24277-4, https://ggplot2.tidyverse.org.
Mediatization and Sociolinguistic Change, pages 395–
Jan Wijffels, Sacha Epskamp, Ingo Feinerer and Kurt
426.
Berlin,
Boston:
De
Gruyter.
Hornik. 2021. textplot: Visualise complex relations in
https://doi.org/10.1515/9783110346831.395.
texts, (Version 0.2.0) [R package]. Retrieved from
Michael C. Mullarkey, Igor Marchetti and Christopher G.
https://github.com/bnosac/textplot
Beevers. 2019. Using Network Analysis to Identify
Gita Wincana, Wahyudi Rahmat and Ricci Gemarni
Central Symptoms of Adolescent Depression. Journal of
Tatalia. 2022. Linguistic Tendencies of Anorexia
Clinical Child and Adolescent Psychology, 48(4):656–
Nervosa on Social Media Users Facebook (Pragmatic
668, DOI: 10.1080/15374416.2018.1437735 (2019).
Study). Journal of Pragmatics and Discourse Research.
Ryotaro Nagase, Takahiro Fukumori and Yoichi
Quirin Würschinger. 2021. Social Networks of Lexical
Yamashita. 2021. Speech Emotion Recognition with
Innovation. Investigating the Social Dynamics of
Fusion of Acoustic- and Linguistic-Feature-Based
Diffusion of Neologisms on Twitter. Frontiers in
Decisions. In: 2021 Asia-Pacific Signal and Information
Artificial
Intelligence.
4:648583.
doi:
Processing Association Annual Summit and Conference
10.3389/frai.2021.648583. PMID: 34790894; PMCID:
(APSIPA ASC), pages 725–730.
PMC8591557.
Brendan O'Connor, Michel Krieger and David Ahn. 2010.
TweetMotif:
Exploratory
Search
and
Topic
Summarization for Twitter. ICWSM.
Ruth Page, David Barton, Carmen Lee, Johann Wolfgang
Unger and Michele Zappavigna. 2014. Researching
Language and Social Media (1st ed.). Taylor and
Francis.
Retrieved
from
https://www.perlego.com/book/1559453/researching-
language-and-social-media-pdf
(Original
work
published 2014)
R Core Team 2021. R: A language and environment for
statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL https://www.R-
project.org/.
Tyler Rinker. 2017. entity: Named Entity Recognition,
(Version 0.1.0) [R package].
Carson Sievert. 2020. Interactive Web-Based Data
Visualization with R, plotly, and shiny. Chapman and
Hall/CRC. ISBN 9781138331457, https://plotly-r.com.
Julia Silge and David Robinson. 2016. tidytext: Text
Mining and Analysis Using Tidy Data Principles. In:
JOSS,
1(3).
doi:10.21105/joss.00037,
http://dx.doi.org/10.21105/joss.00037.
Milan Straka and Jana Straková. 2017. Tokenizing, POS
Tagging, Lemmatizing and Parsing UD 2.0 with
UDPipe. In: Proceedings of the CoNLL 2017 Shared
Task: Multilingual Parsing from Raw Text to Universal
Dependencies, pages 88–99, Vancouver, Canada.
Association for Computational Linguistics.
Jana Straková, Milan Straka and Jan Hajič. 2014. Open-
source tools for morphology, lemmatization, pos tagging
and named entity recognition. In: Proceedings of 52nd
An- nual Meeting of the Association for Computational
Lin- guistics: System Demonstrations, pages 13–18,
PRISPEVKI
62
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Raba Kolokacijskega slovarja sodobne slovenščine
pri prevajanju kolokacij
Martin Anton Grad,* Nataša Hirci†
*,† Oddelek za prevajalstvo, Filozofska fakulteta Univerze v Ljubljani
Aškerčeva 2, 1000 Ljubljana
martin.grad@ff.uni-lj.si
natasa.hirci@ff.uni-lj.si
Povzetek
Prispevek povzema izsledke raziskave, ki se je ukvarjala z rabo Kolokacijskega slovarja sodobne slovenščine (KSSS) pri prevajanju kolokacij iz angleščine v slovenščino. Dodiplomski študenti Oddelka za prevajalstvo FF UL so prevajali besedilo, v katerem je bilo označenih deset kolokacij. Med procesom prevajanja se je dogajanje na zaslonu snemalo, tako da je bilo po koncu naloge mogoče analizirati tako prevodne rešitve posameznih kolokacij kot tudi prevajalski proces, pri čemer smo se osredotočali zlasti na rabo jezikovnih virov, med katerimi nas je najbolj zanimala raba KSSS. Rezultati so pokazali, da je vključevanje KSSS v pedagoški proces uspešno, saj so vsi študenti, ki so sodelovali v raziskavi, z njim seznanjeni in ga pri svojem prevajalskem delu tudi aktivno uporabljajo. Pri tem so se med študenti pokazale občutne razlike glede tega, kako dobro poznajo napredne funkcije, ki jih KSSS
nudi, in posledično kako uspešni so pri iskanju ustreznih kolokacij. Raziskava je pokazala tudi, da raba jezikovnih virov ne vodi nujno do optimalne prevodne rešitve.
Using the Collocations Dictionary of Modern Slovene in the Process of Translating Collocations The paper outlines the findings of the study on how the Collocations Dictionary of Modern Slovene (KSSS) is utilized when translating collocations from English into Slovene. Undergraduate students of the Department of Translation Studies at the Faculty of Arts in Ljubljana were requested to translate a text with ten selected collocations. During the translation process, their on-screen activities were recorded to allow for the analysis of both translation solutions, as well as the translation process, focusing on the use of language resources, in particular KSSS. The results have shown that the integration of KSSS into the training process is successful, as all the students participating in the study were familiar with this resource and actively use it in their translation work.
However, the study has also exposed significant differences between students in terms of their familiarity with the advanced features of KSSS and, consequently, their success and efficiency in finding appropriate collocations. The results of the study have also highlighted that the use of language resources does not necessarily lead to an optimal translation solution.
Poleg statističnega pa kolokacije definira tudi
1. Uvod
skladenjski vidik, saj med kolokacijskima elementoma
obstaja hierarhično razmerje, v katerem baza določa
Kolokacije
predstavljajo
izjemno
zanimiv
kolokator (Hausmann, 1984: 401).
jezikovni pojav, ki pa je hkrati tudi zelo izmuzljiv.
Gantar et al. (2021) predstavijo še tretji,
Uvodoma povzemamo definicije Gantar et al. (2021),
najpomembnejši vidik, ki je hkrati tudi najbolj
ki so osnova za vključitev kolokacij v Kolokacijski
problematičen, in sicer pomenski vidik kolokacij. Ta
slovar sodobne slovenščine (KSSS), rabo katerega
je namreč tesno povezan s statističnim vidikom, ki
opisuje pričujoči prispevek (prim. Kosem et al., 2018).
kolokacije uvršča med pola, ki ju predstavljajo proste
Pri definiranju kolokacij avtorji izpostavijo statistični,
besedne zveze na eni in popolnoma ustaljene
skladenjski in semantični vidik.
večbesedne enote na drugi strani, kar posledično
Atkins in Rundell kolokacije definirata kot
vpliva na semantične spremembe in omejitve pri izbiri
“ponavljajoče se kombinacije besed, v katerih kaže
kolokacijskih elementov.
določen leksikalni element (jedro) očitno tendenco
Pri definiranju kolokacij avtorji navajajo dva
sopojavljanja z drugim leksikalnim elementom
ključna pristopa, in sicer ožji, ki kolokacije
(kolokatorjem), s frekvenco, ki je večja od naključne
prepoznava kot samostojen tip frazeoloških enot, ki so
sopojavitve” (2008: 302). Gantar et al. pri tem
delno ali popolnoma (pomensko in skladenjsko)
izpostavijo problematiko kolokacij, katerih sestavni
zamrznjene, in širši, ki med kolokacije šteje tudi
deli v besedilu navadno ne nastopajo skupaj oz. se
frekventne
besedne
zveze,
katerih
notranja
mednje vrivajo drugi elementi, ki jih imenujejo
povezovalnost ni ozko zamejena ali celo izključujoča,
“razširjene kolokacije” (2021: 19).
PRISPEVKI
63
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
pač pa je lahko niz sopojavnic tudi razmeroma odprt
prevajanja kolokacij in rabo jezikovnih virov pri
(Gantar et al. 2021: 20-21).
iskanju prevodnih rešitev. Prispevek predstavlja
Kljub temu, da kolokacije predstavljajo težavo,
izsledke raziskave, ki je preučevala proces prevajanja
zlasti ko se z njimi srečamo v tujem jeziku, ker se pri
kolokacij iz angleščine v slovenščino med študenti
razumevanju ne moremo zanašati na skladenjska ali
prevajalstva, s poudarkom na rabi KSSS v procesu
pomenska razmerja maternega jezika, saj ni nujno, da
prevajanja, s čimer se doslej raziskovalno ni ukvarjal
uporabljana jezika določeno kolokacijo tvorita na enak
še nihče.
način, pa so kolokacijski slovarji koristen pripomoček
Raziskovanje
prevajalskega
procesa
ponuja
tudi za rojene govorce, zlasti ko gre za področje, ki jim
dragocene informacije o strategijah, ki so potrebne, da
ni najbolj domače. Dodana vrednost tovrstnih
se določeno besedilo prevede iz izvirnega v ciljni
jezikovnih virov za prevajalce je v tem, da na enem
jezik. Poglobljen pregled ključnih raziskovalnih
mestu na pregleden način prikažejo širok nabor
vprašanj, najpogosteje uporabljenih tehnologij in
kolokacij, med katerimi je nato mogoče izbrati tisto, ki
trendov razvoja tega področja ponuja Jakobsen (2017)
je tako s pomenskega kot tudi sobesedilnega vidika
(prim. Hansen, 2009; Hvelplund, 2019; idr.). V
najustreznejša.
Sloveniji je področje preučevanja prevajalskega
procesa manj raziskano (prim. Hirci, 2012).
2. Namen raziskave in pregled področja
Študenti na Oddelku za prevajalstvo Filozofske
fakultete Univerze v Ljubljani se s KSSS seznanijo že
Zaradi svoje jezikovne in kulturne specifičnosti, so
v prvem letniku dodiplomskega študija. V raziskavi je
kolokacije izjemno zanimive z vidika prevajanja, s
sodelovalo 15 študentov in študentk, in sicer osem iz
čimer so se raziskovalno ukvarjali tako domači (npr.
drugega (v nadaljevanju označeni z oznakami od II-1
Gabrovšek, 2014; Jurko 2014; Sicherl, 2004; Vrbinc,
do II-8) in sedem iz tretjega letnika (III-1 do III-7)
2006) kot tuji avtorji (Kwong, 2020; McKeown in
dodiplomskega študija.
Radev, 2000 itd.). S prevajalskega vidika je
Prevajalska naloga je vsebovala članek s
pomembno, na kakšen način prevajalci rešujejo
poljudnoznanstveno vsebino1 in navodila za prevod. V
težave, ki se pojavljajo pri prevajanju kolokacij, saj le
437 besed dolgem članku z astronomsko tematiko je
ustrezne prevajalske strategije lahko privedejo do
bilo označenih 10 kolokacij. Kljub temu, da je bilo na
ustreznih prevodnih rešitev. Pri iskanju možnih
voljo celotno besedilo, saj je pri prevajanju kolokacij
prevodnih rešitev in potrditvi prevodnih ustreznic si
sobesedilo ključnega pomena, so študenti morali
prevajalci lahko pomagajo z različnimi jezikovnimi
prevesti zgolj tiste povedi, v katerih so bile označene
viri. Eden od možnih raziskovalnih pristopov, ki nam
kolokacije. Četudi označevanje kolokacij, ki so bile
lahko pomagajo osvetliti, kako prevajalci pridejo do
predmet analize, predstavlja odmik od avtentične
ustreznih prevodnih rešitev in kakšne vire pri tem
prevajalske situacije, smo se za ta korak odločili, da bi
uporabljajo, so uporabniške študije, ki pa se seveda
ohranili konsistentnost analize, kar je bilo v luči
razlikujejo glede na potrebe ciljne skupine, (prim.
majhnega števila sodelujočih pravzaprav nujno.
Arhar Holdt et al., 2015; Arhar Holdt et al., 2016; Pori
Da bi pridobili čim več podatkov o prevajalskem
et al., 2020; Pori et al., 2021). Nekateri avtorji
procesu, je delo potekalo na platformi Zoom, kjer so
(Rozman, 2004; Stabej, 2009; Logar Berginc, 2009;
sodelujoči uporabili možnost deljenja zaslona, kar je
Arhar Holdt, 2015) so pred časom sicer opozarjali na
omogočalo spremljanje procesa prevajalskega dela,
to, da v Sloveniji primanjkuje uporabniških študij,
celotno dogajanje na zaslonu pa se je za potrebe
vendar pa se v zadnjem času na tem področju stanje
kasnejše analize tudi snemalo.
spreminja. Izvedenih je bilo namreč kar nekaj
Prevajalska naloga je omogočila analizo z dveh
uporabniških raziskav, in sicer tako med prevajalci,
različnih vidikov. Prvega predstavlja sam prevod oz.
tolmači, študenti, učitelji slovenščine, lektorji in
posamezne prevodne rešitve kolokacij in povedi, v
jezikoslovci, kot tudi drugimi, ki se poklicno ukvarjajo
katerih se nahajajo, drugega pa posnetek prevajalskega
z jeziki (prim. Čibej et al., 2015; Gorjanc, 2014; Hirci,
procesa, zlasti rabe jezikovnih virov, še posebej KSSS.
2013; Pori et al., 2021; Mikolič, 2015; Šorli in
V prvem delu smo se pri analizi osredotočali na
Ledinek, 2017; Arhar Holdt et al., 2017).
to, ali je prevod ustrezen (tj. ali prevodna rešitev
V pričujočem prispevku je predstavljena
predstavlja ustrezno, v ciljnem jeziku sprejemljivo
uporabniška študija, ki se ukvarja z vprašanjem
kolokacijo in ali ta kolokacija tudi pomensko ustreza
1 Povezava do članka: http://news.bbc.co.uk/2/hi/science/nature/1006305.stm
PRISPEVKI
64
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
izvirniku). V primeru neprimernih oz. pogojno
Posnetek prevajalskega dela je omogočil vpogled
sprejemljivih rešitev smo pri vsaki kolokaciji dodali
v proces rabe jezikovnih virov. Pregledu prevodne
komentar vidika, ki se je zdel problematičen.
rešitve posamezne kolokacije je sledil pregled
Kljub temu, da je prevajalska naloga vsebovala 10
posnetka zaslona, pri čemer smo analizirali, na kakšen
kolokacij (Tabela 1), se v pričujočem prispevku zaradi
način so študenti uporabljali jezikovne vire, da bi prišli
prostorske omejitve osredotočamo zgolj na tri, ki pa
do prevodne ustreznice – katere, kako učinkovito (če
ponujajo zanimiv vpogled v celotni razpon
je to iz posnetka razvidno) in kako so prišli do
zahtevnosti, težav, iskanja prevodnih rešitev in
ugotovitve, da je določena rešitev najustreznejša.
načinov rabe KSSS ter drugih jezikovnih virov.
3. Rezultati
Povedi z označeno kolokacijo
Astronomers say reports that the Earth could be struck by a small asteroid in 2030 are wildly 1
exaggerated.
Less than a day after (2) sounding the alert about asteroid 2000SG344, a (3) revised analysis of the space rock's 2, 3
orbit shows it will in fact miss the Earth by about five million kilometres.
Some scientists have criticised the way the information was released to the media before it had been thoroughly 4
confirmed.
5
Threat rating*
On Friday, the International Astronomical Union issued an alert saying that the object had about a 1-in-500 chance 6
of striking the Earth on 21 September 2030.
Were it to strike our planet, the results would be devastating, with an explosion greater than the most powerful 7
nuclear weapon.
The new orbit reveals a slight risk of a collision with the Earth about 2071, but it is thought that when the orbit is 8
better known this risk will disappear as well.
9
If it is manmade and did strike Earth, the effects would be very local and limited.
Some scientists have criticised the IAU and Nasa for releasing warnings about the asteroid only for those warnings 10
to be rescinded less than a day later.
*podnaslov
Tabela 1: Pregled povedi z označenimi kolokacijami.
Kolokacija št. 4 se je izkazala za najbolj
Kot zgolj delno ustrezne smo opredelili kolokacije,
problematično, saj so bile tri prevodne rešitve v celoti
ki so se v prevodu pomensko preveč oddaljile od
neustrezne, tri pa zgolj delno ustrezne. Kot v celoti
izvirnika ali pa je njihova skladenjska oblika netipična.
neustrezne smo označili tiste, ki izkazujejo bodisi
V nadaljevanju so prikazane vse prevodne rešitve, ki
skladenjsko neustreznost v ciljnem jeziku bodisi gre za
so bile označene kot neustrezne oz. delno ustrezne, in
pomensko neustrezno prevodno rešitev, četudi je bila
jezikovni viri, ki so jih študenti pri iskanju prevodnih
uporabljena
slovenska
kolokacija
z
visoko
rešitev uporabili.
pogostostjo.
Kolokacija št. 4
Viri
[…] the way the information was released […]
II-1
skritizirali način, kako je bila informacija […] deljena z mediji
brez virov
II-5
skritizirali način izdaje podatkov medijem
angleški kolokacijski slovar ozdic.com
II-7
kritizirali način, da so novico objavili
spletni an-sl slovar Pons, KSSS
veliki an-sl slovar (Amebisov pregledovalnik
III-2
način, ki je bil uporabljen za posredovanje informacij javnosti
podatkovnih zbirk ASP32), KSSS
III-3
način, na katerega so bile informacije sporočene medijem
Evrokorpus, KSSS
III-5
dejstvo, da so mediji objavili informacijo
korpus Gigafida
Tabela 2: Prikaz iskanja prevodnih rešitev za kolokacijo št. 4.
PRISPEVKI
65
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Kolokaciji št. 2 in 6 predstavljamo skupaj, saj sta
V nadaljevanju so prikazane vse prevodne rešitve,
si tako pomensko kot skladenjsko zelo podobni, a je
ki so bile označene kot neustrezne oz. delno ustrezne,
prva študentom povzročala precej težav, prevodne
in jezikovni viri, ki so jih študenti uporabili.
rešitve za drugo pa so bile z izjemo dveh v celoti
ustrezne.
Kolokacija št. 2
Viri
Kolokacija št. 6
Viri
[…] after sounding the
[…] issued an alert
alert about […]
saying […]
dan po tem [sic] ko so
II-1
sprožili alarm
KSSS
izdala opozorilo
brez virov
brskalnik Google, enojezična
II-3
dan po sproženem alarmu
an-sl spletni slovar Pons, KSSS
izdala opozorilo
spletna slovarja Collins in
Cambridge, KSSS
an-sl spletni slovar Pons,
II-6
dan po sprožitvi alarma
enojezični spletni slovar
izdala opozorilo
KSSS
Merriam-Webster, KSSS
an-sl spletni slovar Pons, portal
an-sl spletni slovar Pons, KSSS,
II-8
dan po sprožitvi alarma
izdala opozorilo
Fran, KSSS, korpus Gigafida
korpus Gigafida
dan potem [sic] ko so
brskalnik Google, portal Fran,
III-2
sprožili alarm
brez virov
sprožila alarm
KSSS, korpus Gigafida
brskalnik Google, enojezični
III-3
dan po sproženem alarmu
spletni slovar Merriam-
sprožil alarm
Evrokorpus, EUR+Lex, KSSS
Webster, KSSS
Tabela 3: Prikaz iskanja prevodnih rešitev za kolokaciji št. 2 in 6.
4. Diskusija
glagolnik “izdaja” s samostalnikom v rodilniku
4.1. Kolokacija št. 4
večinoma pojavlja v pomenu izida tiskanega dela (npr.
Z vidika prevajanja se je za najbolj zahtevno
knjige, revije itd.) ali dajanja (delnic, denarja itd.) v
izkazala kolokacija št. 4 (“release information”). Pri
obtok. Študentka je pri iskanju prevodne rešitve
tem je treba izpostaviti, da je že sam
uporabila angleški kolokacijski slovar ozdic.com z
izvirnik nekoliko
problematičen, saj je samostalniška besedna zveza
iskalnima nizoma “information” in “released
information”, vendar pa temu ni sledila korekcija
”the way” v funkciji premega predmeta, ki sledi
izbrane prevodne rešitve.
glagolu ”criticise”, najpogosteje uporabljena v smislu
II-7: Pri prevodu “kritizirali način, da so novico
“kritizirati način, na katerega /…/”. Vendar pa je v
objavili” gre za pomensko napako, ki v veliki meri
analiziranem primeru mišljeno drugače: znanstveniki
temelji na dobesednem razumevanju izvirnika. Čeprav
so kritizirali dejstvo, da je bila ta informacija sploh
prevodna rešitev “kritizirati način” s kolokacijskega
posredovana medijem, ne pa načina, kako je bilo to
vidika ni sporna, pa je bolj problematičen odvisnik, ki
storjeno. Zdi se, da so študenti, katerih prevodne
sledi. Samostalniku “način“, kadar ta sledi glagolu
rešitve so bile označene kot neustrezne oz. zgolj delno
“kritizirati”, navadno sledi prilastkov odvisnik in ne
ustrezne, izvirnik obravnavali preveč dobesedno, pri
predmetni. Študent je uporabil dva spletna jezikovna
čemer niso upoštevali širšega konteksta, ki bi jim
vira, in sicer angleško-slovenski slovar Pons za iskanje
omogočil pravilno interpretacijo, čeprav so imeli na
ustreznic besed “release”, “some” in “thoroughly”, v
KSSS pa je iskal kolokacije s samostalnikom
voljo celotno besedilo.
“novica”, pri čemer ni uporabil nobenih filtrov.
II-1: Četudi je kolokacija glagola “deliti” in
samostalnika “informacija” ustaljena in v danem
III-2: Prevodna rešitev “način, ki je bil uporabljen
za posredovanje informacij javnosti” je kolokacijsko
kontekstu tudi ustrezna, se redkeje pojavlja v trpni
ustrezna, vendar se pomensko oddalji od izvirnika.
obliki. Glede na dejstvo, da študent ni uporabil
Študentka je uporabila dva jezikovna vira – veliki
nobenih jezikovnih virov, lahko zgolj domnevamo, da
angleško-slovenski slovar (Amebisov pregledovalnik
bi se sicer na podlagi zgledov v tvornem načinu morda
podatkovnih zbirk ASP32) za glagol “criticise” in
odločil za drugačno rešitev.
KSSS za samostalnik “informacija”, pri čemer ni
II-5: Prevodna rešitev “način izdaje podatkov
medijem” je problematična z vidika dobesednosti
uporabila nobenih filtrov.
,
obenem pa je tudi kolokacijsko vprašljiva, saj se
PRISPEVKI
66
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
III-3: Prevodna rešitev “način, na katerega so bile
niza, je isto kolokacijo vpisala še v spletni slovar
informacije sporočene medijem” je v smislu
Cambridge. Ker tudi tam ni našla zadetka, se je vrnila
pomenskega odklona zelo podobna primeru III-2, in je
na brskalnik Google. Iz dogajanja na zaslonu ni
kolokacijsko ustrezna, slogovno je vprašljiva zgolj
razvidno zakaj, vendar je, ne da bi kliknila na katerega
trpna oblika. Študentka je kot vir uporabila
izmed zadetkov, uporabila KSSS z iskalnim nizom
Evrokorpus
z
iskalnima
nizoma
“released
“opozorilo”, kjer je nato kljub temu, da ni uporabila
information” in “information released” ter KSSS z
nobenega filtra, na prvi strani najpogostejših kolokacij
iskalnim nizom “izdaja podatkov”, za katerega pa ni
našla “izdati opozorilo“. Tej ugotovitvi pa ni sledila
bilo zadetkov. Če bi namesto tega izbrala zgolj
korekcija kolokacije št. 2.
glagolnik
“izdaja”, bi z izbiro filtra “s
II-6: Tudi študentka II-6 se je odločila za podobno
samostalniki/2-rodilnik“ lahko hitro prišla do ustrezne
prevodno rešitev, in sicer “dan po sprožitvi alarma”.
prevodne rešitve.
Prvi jezikovni vir je bil angleško-slovenski slovar
III-5: Prevodna rešitev “dejstvo, da so mediji
Pons za iskalni niz “alert”, kjer je našla prevodno
objavili informacijo” je slogovno in kolokacijsko
ustreznico “alarm”. Temu je sledilo iskanje po
ustrezna, pomensko pa se tudi ta oddalji od izvirnika,
enojezičnem slovarju Merriam-Webster z geslom
saj so po tej interpretaciji predmet kritike mediji.
“sound”, temu pa iskanje kolokacij za samostalnik
“alarm” v KSSS. Tu je pregledala štiri
Izvirnik v resnici govori o tem, da so znanstveniki
kolokacije v
sobesedilu, pri čemer so bili trije zadetki za ”sprožiti
kritizirali dejstvo, da so mediji te podatke sploh
lažni alarm” in en zadetek za “sprožiti požarni alarm”,
dobili oz. so posredno kritizirali svoje cehovske
kar pa je ni omajalo v prepričanju, da je to prava
kolege, ne pa samih medijev. Študentka je pri iskanju
rešitev. Ne glede na to odločitev se je pri kolokaciji št.
prevodne rešitve uporabila zgolj korpus Gigafida z
6 tudi ona odločila za drugačno prevodno rešitev, in
iskalnim nizom “objaviti informacijo”.
sicer “izdati opozorilo”, do katere je prišla neposredno
s pomočjo KSSS z iskalnim nizom “opozorilo”. Tudi
4.2. Kolokaciji št. 2 in št. 6
v tem primeru tej ugotovitvi ni sledila korekcija
Pri kolokaciji “sound the alert” je treba izpostaviti,
kolokacije št. 2.
da je v angleščini precej bolj pogosta pomensko
II-8: Prevodna rešitev “dan po sprožitvi alarma” je
podobna kolokacija “sound the alarm”. To je
identična kot pri študentki II-6, podobni so tudi
najverjetneje tudi razlog, da so se študenti v kar šestih
jezikovni viri, ki jih je uporabljala pri iskanju
primerih v prevodu namesto za “opozorilo” odločili za
prevodnih rešitev. Prvi je bil angleško-slovenski
samostalnik “alarm”, saj so v jezikovnih virih bodisi
slovar Pons za iskalni niz “alert”, čemur je sledilo
našli prevodno ustreznico (“alert”→“alarm”) bodisi
iskanje na jezikovnem portalu Fran z geslom “alarm”.
dobili potrditev, da je angleška različica “sound the
Nato je v KSSS iskala kolokacije za samostalnik
alarm” bolj pogosta. Od tu naprej so v jezikovnih virih
“alarm”, pri čemer ni uporabila nobenega filtra. Temu
iskali zgolj kolokacije za samostalnik “alarm”. Vsi so
je sledilo preverjanje kolokacije “sprožiti alarm” v
izbrali sicer pravilno, a zelo dobesedno kolokacijo
korpusu Gigafida z iskalnim nizom “alarm”, kjer se je
“sprožiti alarm“.
med konkordancami zadržala pri omenjeni kolokaciji,
II-1: Študent se je odločil za prevodno rešitev “dan
ki pa jo je nato v prevodni rešitvi spremenila v
po tem ko so sprožili alarm”, pri čemer je kot vir
“sprožitev alarma“. Pri kolokaciji št. 6 se je odločila za
uporabil KSSS, kjer je najprej iskal kolokacije za
drugačno prevodno rešitev, in sicer za kolokacijo
glagol “sprožiti“, nato pa še za samostalnik “alarm”.
“izdati opozorilo“. Najprej je uporabila angleško-
Tu je tudi našel kolokacijo “sprožiti alarm”. Pri
slovenski slovar Pons (“issue“ in “alert“), čemur je
kolokaciji št. 6 se je odločil za drugačno prevodno
sledilo iskanje kolokacij za samostalnik “opozorilo” v
rešitev, in sicer “izdati opozorilo”, pri čemer pa ni
KSSS (najprej brez filtrov, nato s filtrom “z glagoli”,
uporabil nobenih jezikovnih virov.
vendar pri pregledovanju zadetkov ni vztrajala).
II-3: Študentka je do prevodne rešitve “dan po
Sledilo je iskanje po korpusu Gigafida (“opozorilo”),
sproženem alarmu“ prišla s pomočjo angleško-
kjer pa med konkordancami ni našla nobenega
slovenskega spletnega slovarja Pons z iskalnim nizom
zadetka, ki bi se ji zdel primeren. Po tem je ponovno
“sounding”. Tu je našla kolokacijo “to sound the
uporabila KSSS, kjer je med rezultati (spet brez filtra)
alarm”, čemur je sledilo iskanje v KSSS, in sicer
našla končno prevodno rešitev “izdati opozorilo”.
“alarm”, kjer je našla kolokacijo “sprožiti alarm”. Pri
III-2:
Do tako slovnično kot vsebinsko
kolokaciji št. 6 se je tudi ona odločila za drugačno
problematične prevodne rešitve “dan potem [sic] ko so
prevodno rešitev, in sicer “izdati opozorilo”, do katere
sprožili alarm” je študentka prišla brez rabe jezikovnih
je prišla s pomočjo brskalnika Google (iskalni niz
virov. Morda je to tudi razlog, da se je za enako
“issue an alert”); sledila je povezavi do spletnega
prevodno rešitev odločila pri kolokaciji št. 6, kjer pa je
enojezičnega slovarja Collins, ker pa tam ni našla tega
uporabila kar nekaj jezikovnih virov. Svoje iskanje je
PRISPEVKI
67
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
začela v brskalniku Google z iskalnim nizom “izdati
Zlasti pri prevodnih strategijah kolokacije št. 4 se
rdeč [sic] alarm”, pregledala nekaj zadetkov
je pokazalo, da jezikovni viri, kljub temu, da
(večinoma časopisnih naslovov) in nadaljevala iskanje
prevajalcu lahko pomagajo priti do prevodnih rešitev,
na portalu Fran z iskalnim nizom “alarm”. Isto geslo
ki so parcialno pravilne, ne morejo vedno preprečiti
je nato uporabila še pri KSSS. Sledilo je iskanje po
napačne interpretacije izvirnika in posledično
korpusu Gigafida (na portalu CJVT) z naprednimi
neposrečenih prevodnih rešitev, ki so rezultat
funkcijami “okolica” in “levo 1”, “desno 0”. Med
združevanja besed, besednih zvez in kolokacij, ki so
besednimi vrstami na levi je nato uporabila filter
same zase ustrezne, kot celota pa preprosto ne
“glagol”, kjer je po pogostosti izstopala kolokacija, ki
funkcionirajo. Pri prepoznavanju in reševanju
jo je študentka nato izbrala kot prevodno rešitev.
tovrstnih situacij zagotovo ključno vlogo odigra tudi
Omenjeni primer uporabe naprednih funkcij se zdi
dobro znanje materinščine.
zelo poveden, saj dokazuje, da zgolj tehnična
Izpostaviti je treba tudi dejstvo, da se je od
podkovanost pri rabi jezikovnih virov ne zagotavlja
študentov pričakovalo, da bodo prevedli povedi, v
nujno tudi ustrezne prevodne rešitve. Čeprav omogoča
katerih so bile kolokacije vnaprej označene. Vendar pa
časovno učinkovitost pri iskanju možnih kolokacij,
je povsem mogoče predvideti tudi scenarij, kjer bi se
mora prevajalec še vedno sprejeti končno odločitev o
prevajalec odločil za prevodno rešitev, v kateri izvirna
tem, katera izmed ponujenih možnosti predstavlja
kolokacija ne bi bila prevedena v obliki kolokacije, a
najustreznejšo prevodno rešitev, pri čemer vedno igra
bi bila s pomenskega in sobesedilnega vidika rešitev
pomembno vlogo tudi prefinjen občutek za jezik.
prav tako ustrezna.
III-3: Tudi ta študentka se je pri kolokaciji št. 2
Kljub temu, da je KSSS enojezični jezikovni vir,
odločila za prevodno rešitev “dan po sproženem
se je v procesu prevajanja izkazal za zelo uporabnega.
alarmu”. Začela je z brskalnikom Google (iskalni niz
Pri tem se zdi, da ne gre toliko za možnost preverjanja,
“souding [sic] the alert”), kjer je izbrala povezavo do
ali določena besedna kombinacija sploh predstavlja
spletnega slovarja Merriam-Webster za “raise/sound
kolokacijo, temveč bolj za vpogled v širši nabor
the alarm“. Nato je v KSSS iskala kolokacijo za
možnih kolokacij, ki jih KSSS ponuja, izmed katerih
samostalnik “alarm”, med rezultati opazila kolokacijo
nato prevajalec lahko izbere tisto, ki je pomensko in
“sprožen alarm” in jo tudi uporabila. Pri iskanju
sobesedilno najprimernejša.
prevodne rešitve kolokacije št. 6 je najprej uporabila
Četudi se študenti prevajalstva že v prvem letniku
Evrokorpus (“issue an alert”), od koder je sledila
dodiplomskega študija seznanijo s KSSS in ga pri
predlagani povezavi na EUR+Lex za isti iskalni niz,
prevajalskih nalogah tudi uporabljajo, se je izkazalo,
nazadnje pa je v KSSS ponovno iskala kolokacije za
da njegovega ustroja ne poznajo vsi enako dobro.
samostalnik “alarm”, opazila že znano kolokacijo
Posledično se niti ne zavedajo možnosti, ki jih nudi,
“sprožiti alarm“ in jo nato tudi uporabila.
zato je v nekaterih primerih njihovo iskanje ustreznih
Na koncu je treba poudariti, da je v predstavljeni
kolokacij manj učinkovito, kar v skrajnem primeru
uporabniški raziskavi šlo za vodeno nalogo, ki realno
lahko privede tudi do tega, da kljub pravilno
situacijo prevajanja tovrstnega besedila nekoliko
vnesenemu iskalnemu nizu ne najdejo ustrezne
popači, zato je pri posploševanju opažanj potrebna
kolokacije.
previdnost.
Ena izmed možnosti, ki jih slovar ponuja, je
Razlogi za to so poleg omejenega vzorca, tako z
napredno iskanje s pomočjo filtrov, kjer je mogoče
vidika števila sodelujočih kot tudi dejstva, da so v
izbrati
ustrezno
kategorijo
kolokatorja
(npr.
raziskavi sodelovali zgolj študenti dveh letnikov
samostalnik, pridevnik, glagol itd.), v nekaterih
dodiplomskega študija, vsaj trije. Prvi je ta, da so bili
primerih pa tudi njegovo podkategorijo (npr. sklon
študenti seznanjeni s tem, da se raziskava ukvarja z
samostalnika, sklon pridevnika, predlog itd.). Pred
rabo jezikovnih virov, kar je morda vplivalo na to,
študenti, ki te funkcije niso poznali oz. je niso
katere vire so uporabljali in kako. Drugi razlog je, da
uporabili, se je v primerih iskanja na podlagi baze, ki
so vedeli, da je poudarek naloge na prevajanju
tvori kolokacije s številnimi kolokatorji, tako odprl
kolokacij, zaradi česar je mogoče, da so se posledično
nepregleden seznam kolokacij, deljen glede na
na to prvino bolj osredotočali. Tretji razlog pa je ta, da
različne kolokacijske vzorce, večina katerih je bila v
se je za potrebe kasnejše analize proces prevajanja
danem primeru neuporabnih. Študenti so s
snemal, zaradi česar se študenti bolj zavedajo vsake
pregledovanjem seznama po nepotrebnem izgubljali
svoje poteze. Ali je – in če, do kakšne mere – to v dani
čas, ki bi ga sicer lahko bolje izkoristili.
situaciji vplivalo na sam proces prevajanja in končni
izdelek je težko soditi, vsekakor pa je pri interpretaciji
5. Zaključek
rezultatov in izpeljevanju zaključkov treba imeti v
mislih tudi ta vidik.
Uporabniška raziskava med študenti prevajalstva o
rabi jezikovnih virov pri iskanju prevodnih rešitev
PRISPEVKI
68
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
kolokacij ali kolokacijskih parov je postregla s
Prototypical a – b. ELOPE: 11(2), str. 7–20. Ljubljana:
številnimi zanimivimi izsledki in omogočila vpogled v
Slovensko društvo za angleške študije.
praktično rabo jezikovnih virov v prevajalskem
Polona Gantar, Simon Krek in Iztok Kosem. 2021.
procesu.
Opredelitev kolokacij v digitalnih slovarskih virih za
Obenem se je izkazalo, da so nekateri študenti
slovenščino. V: I. Kosem, ur., Kolokacije v
tehnično dobro podkovani in vešči uporabe naprednih
slovenščini, str. 15–41. Ljubljana: Znanstvena založba
funkcij, ki jih posamezni jezikovni viri nudijo, a se to
Filozofske fakultete.
ne odraža vedno tudi v kakovosti njihovih prevodnih
Vojko Gorjanc. 2014. Slovar slovenskega jezika v
rešitev. Pri tem ne gre nujno za jezikovno šibkost v
digitalni dobi. Irena Grahek in Simona Bergoč, ur., E-
maternem jeziku, kot morda bolj za prekomerno
zbornik Posveta o novem slovarju slovenskega jezika
zaupanje v jezikovni vir, pri čemer ne vzamejo v obzir
na Ministrstvu za kulturo. Ljubljana: Ministrstvo za
možnih razlik med jezikovno materijo, ki jo prevajajo,
kulturo RS.
in primeri, ki jih ponujajo jezikovni viri.
Gyde Hansen. 2009. Some thoughts about the evaluation
Raziskava je tako izpostavila tudi zelo konkretno
of translation products in translation process
pomanjkljivost, ki bi jo v pedagoškem procesu v
research. Copenhagen Studies in Language 38, str.
prihodnje veljalo bolje nasloviti. Med raziskavo so se
389–402. Copenhagen: Samfundslitteratur.
prav tako odprla številna vprašanja, povezana z rabo
Franz Josef Hausmann. 1984. Wortschatzlernen ist
jezikovnih virov v procesu prevajanja, ki bi jih bilo
Kollokationslernen.
Zum Lehren
und
Lernen
smiselno nasloviti v prihodnjih raziskavah.
französischer Wortverbindungen. V: Praxis des
neusprachlichen Unterrichts, 31, str. 395–406.
Raziskovalni program št. P6-0215 (Slovenski jezik – bazične,
Dortmund: Lensing.
kontrastivne in aplikativne raziskave) je sofinancirala Javna
Nataša Hirci. 2012. Electronic Reference Resources for
agencija za raziskovalno dejavnost Republike Slovenije iz
Translators. The Interpreter and Translator Trainer
državnega proračuna.
(6) 2, str. 219–36. London: Taylor & Francis.
Nataša Hirci. 2013. Changing trends in the use of
6. Literatura
translation resources: the case of trainee translators in
Špela Arhar Holdt.
Slovenia. ELOPE 10, str. 149–165. Ljubljana:
2015. Uporabniške raziskave za
Slovensko društvo za angleške študije.
potrebe slovenskega slovaropisja: prvi koraki. V: V.
Kristian Tangsgaard Hvelplund. 2019. Digital resources
Gorjanc et al., ur., Slovar sodobne slovenščine:
problemi in rešitve
in the translation process – attention, cognitive effort
, str. 136-148. Ljubljana:
Znanstvena založba Filozofske fakultete
and processing flow. Perspectives 27 (4), str. 510–24.
.
Špela Arhar Holdt, Jaka Čibej in Ana Zwitter Vitez.
London: Taylor & Francis.
Arnt Lykke Jakobsen. 2017. Translation process
2017. Value of language-related questions and
research. V John W. Schwieter in Aline Ferreira, ur.,
comments in digital media for lexicographical user
The handbook of translation and cognition, str. 19–49.
research. International journal of lexicography, 30(3),
Hoboken: Wiley.
str. 285–308. Oxford: OUP.
Primož
Špela Arhar Holdt, Iztok Kosem, in Polona Gantar. 2016.
Jurko. 2014. Target language corpus as an
encoding tool: collocations in Slovene-English
Dictionary user typology: the Slovenian case. V: T.
translator training. ELOPE 11 (1), str. 153–64.
Margalitadze in G. Meladze, ur., Lexicography and
Ljubljana: Slovensko društvo za angleške študije
linguistic diversity. Proceedings of the XVII
Iztok Kosem, Simon Krek, Polona Gantar, Špela Arhar
EURALEX International Congress, 6–10 September,
Holdt, Jaka Čibej in Cyprian Laskowski. 2018.
2016, str. 179–187. Tbilisi: Ivane Javakhishvili Tbilisi
Kolokacijski slovar sodobne slovenščine. V: D. Fišer
State University.
in Andrej Pančur, ur., Zbornik konference Jezikovne
Beryl T. Sue Atkins in Michael Rundell. 2008. The
tehnologije in digitalna humanistika / Proceedings of
Oxford Guide to Practical Lexicography. New York:
the conference on Language Technologies & Digital
Oxford University Press.
Jaka Čibej, Vojko Gorjanc in Damjan Popič. 2015. Vloga
Humanities,
20-21,
str.
133-139.
Ljubljana:
Znanstvena založba Filozofske fakultete v Ljubljani.
jezikovnih vprašanj prevajalcev pri načrtovanju
Nataša Logar Berginc. 2009. Slovenski splošni in
novega enojezičnega slovarja. V: V. Gorjanc et al., ur.,
terminološki slovarji: za koga? V
Slovar sodobne slovenščine: problemi in rešitve
: M. Stabej, ur.,
, str.
Infrastruktura slovenščine in slovenistike. Obdobja 28,
168-181. Ljubljana: Znanstvena založba Filozofske
str.
225–231.
Ljubljana: Znanstvena založba
fakultete.
Dušan Gabrovšek. 2014. Extending Binary Collocations:
Filozofske fakultete Univerze v Ljubljani.
Oi Yee Kwong. 2020. Translating Collocations: The
(Lexicographical) Implications of Going beyond the
Need for Task-driven Word Associations. V:
Proceedings of the Workshop on the Cognitive Aspects
PRISPEVKI
69
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
of the Lexicon, str. 112–16. Association for
Computational Linguistics.
Kathleen R. McKeown in Dragomir R. Radev. 2000.
Collocations. V Robert Dale et al., ur., Handbook of
Natural Language Processing, str. 1–3. New York:
Marcel Dekker.
Vesna Mikolič. 2015. Slovarski uporabniki – ustvarjalci:
ustvarjati v jeziku in z jezikom. V: V. Gorjanc et al.,
ur., Slovar sodobne slovenščine: problemi in rešitve,
str.
182-195.
Ljubljana: Znanstvena založba
Filozofske fakultete.
Eva Pori, Jaka Čibej, Iztok Kosem in Špela Arhar Holdt.
2020. The attitude of dictionary users towards
automatically extracted collocation data: A user study.
Slovenščina
2.0:
Empirical,
Applied
and
Interdisciplinary Research 8 (2), str. 168-201.
Eva Pori, Iztok Kosem, Jaka Čibej in Špela Arhar Holdt.
2021.
Evalvacija
uporabniškega
vmesnika
Kolokacijskega slovarja sodobne slovenščine. V I.
Kosem, ur., Kolokacije v slovenščini, str. 235–268.
Ljubljana: Znanstvena založba Filozofske fakultete.
Tadeja Rozman. 2004. Upoštevanje ciljnih uporabnikov
pri izdelavi enojezičnega slovarja za tujce. Jezik in
slovstvo 49 (3/4), str. 63-75. Ljubljana: Slavistično
društvo Slovenije.
Eva Sicherl. 2004. On the Content of Prepositions in
Prepositional Collocations. ELOPE 1(1-2), str. 37–46.
Ljubljana: Slovensko društvo za angleške študije
Marko Stabej. 2009. Slovarji in govorci: kot pes in
mačka? Jezik in slovstvo 54 (3−4), str. 115–138.
Ljubljana: Slavistično društvo Slovenije.
Mojca Šorli in Nina Ledinek. 2017. Language policy in
Slovenia: language users’ needs with a special focus
on lexicography and translation tools. V: I. Kosem et
al., ur., Electronic lexicography in the 21st century:
proceedings of eLex 2017 Conference, 19–21
September 2017, Leiden, The Netherlands, str. 377–
394. Brno: Lexical Computing.
Marjeta Vrbinc. 2005. Native speakers of Slovene and
their translation of collocations from Slovene into
English: a Slovene-English empirical study. Erfurt
Electronic Studies in English. Erfurt: Institut für
Anglistik/Amerikanistik Erfurt.
PRISPEVKI
70
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Akustično modeliranje z različnimi osnovnimi enotami za avtomatsko
razpoznavanje slovenskega govora
Lucija Gril,* Simon Dobrišek, ‡ Andrej Žgank*
* Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru Koroška cesta 46, 2000 Maribor
lucija.gril@um.si, andrej.zgank@um.si
‡ Fakulteta za elektrotehniko, Univerza v Ljubljani
Tržaška 25, 1000 Ljubljana
simon.dobrisek@fe.uni-lj.si
Povzetek
V članku je predstavljen sistem avtomatskega razpoznavanja govora za slovenski jezik. Za graditev akustičnih modelov smo uporabili dva različna jezikovna vira in dve različni osnovni akustični enoti pri zapisu slovarjev. Testiranje je potekalo na testni množici, ki je nastala znotraj projekta Razvoj slovenščine v digitalnem okolju in vsebuje malo manj kot 5 ur zvočnih posnetkov. Za graditev jezikovnih modelov smo uporabili hibridni pristop HMM-DNN. Za nevronske mreže smo uporabili dva tipa mrež, in sicer TDNN in LSTM.
Najboljši rezultat WER je znašal 24,95 % in smo ga dosegli z arhitekturo TDNN in grafemskim slovarjem.
Acoustic modeling with various basic units for Slovenian automatic speech recognition The article presents the automatic speech recognition system for the Slovenian language. We used two different language sources and lexicons based on two basic acoustic units. The system was tested by the test set containing a little less than 5 hours of sound recordings that developed by the RSDO project. We used the hybrid HMM-DNN approach to build language models. Two types of networks were used for neural networks, namely TDNN and LSTM. The best WER score was 24.95% and we achieved it with TDNN architecture and grapheme lexicon.
razpoznavalnik govora potrebujemo govorne posnetke, ki
1. Uvod
jih spremljajo datoteke s transkripcijo, v katerih je zapis
Dandanes nas pametna okolja spremljajo že na vsakem
izgovorjenih besed. Hkrati potrebujemo besedilni korpus in
koraku. Pametni telefoni, tablice, televizijski sprejemniki,
slovar, s katerima se lahko razpoznavalnik govora nauči
ročne ure, naprave v gospodinjstvu itd. Vse te naprave so
značilnosti besed in tudi njihovega kontekstnega uvrščanja.
nagnjene k temu, da nam nudijo boljšo in preprostejšo
Izgovorjene
besede
lahko
v
avtomatskem
uporabniško izkušnjo.
razpoznavalniku govora predstavimo z dvema različnima
Storitev, ki jih nudijo, je veliko in za
akustičnima enotama
vse je potrebno skrbno načrtovanje tako strojne kot tudi
– s fonemi ali z grafemi. Fonemi so
programske opreme. Ena izmed takšnih storitev je tudi
glasovne enote, ki predstavljajo izgovorjavo glasov v
avtomatsko razpoznavanje govora (angl. Automatic speech
besedi. Fonemski zapis slovenske izgovorjave se v večini
recognition – ASR). Če želimo razpoznavati govor, se je
primerov razlikuje od grafemskega. Grafemi in fonemi se
treba zavedati, da lahko programska oprema deluje
med seboj razlikujejo tudi v številu osnovnih enot. Grafem
zapišemo z eno osnovno enoto, ki ustreza črki v besedi. Po
brezhibno, vendar na uspešnost njenega delovanja vpliva še
drugi strani se lahko ista črka
veliko drugih dejavnikov. Eden izmed njih je lahko na
slika v več različnih fonemov,
primer slab mikrofon, ki zajame veliko šuma in popači
odvisno od konteksta, naglasa in mesta v besedi. Prav tako
se lahko črka preslika
zvok ter tako degradira razpoznavanje govora. To
v zaporedje dveh fonemov. Z mislijo
posledično vodi tudi do slabše uporabniške izkušnje. Prav
na to lahko pri razpoznavalniku tvorimo slovarje, ki
vsebujejo izgovorjene besede na dva načina
tako lahko do poslabšanja rezultatov pride, če je
, in sicer s
razpoznavalnik tekočega govora slabše zasnovan in nima
fonemi ali grafemi. Izbira vrste slovarja razpoznavalnika
govora neposredno vpliva na to, kakšna bo osnovna
optimalnih karakteristik. Zato je pomembno, da z
akustična enota. Izbira akustične enote
eksperimenti preverjamo različne arhitekture in zasnove
vpliva tudi na
zahtevnost in način priprave slovarjev,
modelov avtomatskega razpoznavalnika govora.
kompleksnost
akustičnih modelov
Za razvoj razpoznavalnika govora potrebujemo veliko
in prek tega na potreben pomnilnik in
količino učnega gradiva
procesorske zmogljivosti za učenje in delovanje
za posamezen jezik. Za jezike z
veliko govorci je takšnega dosegljivega gradiva praviloma
avtomatskega razpoznavalnika govora. Tvorjenje slovarja s
veliko. Za jezike z manjšim številom govorcev, kamor
fonemi je odvisno od jezika, ki ga želimo uporabiti. Za
lahko uvrščamo tudi slovenščino, pa takšnih virov ni dovolj
slovenski jezik je ta naloga razmeroma zapletena in
za uporabo naprednih metod umetne inteligence, kot je na
kompleksa. Slovar se lahko tvori ročno, kar praviloma
primer enovito učenje
počnejo fonetiki ali slovenisti
(ang. end to end) s konvolucijskimi
, ali avtomatsko. Pri
mrežami. V zadnjem obdobju se pogosto uporablja tudi
avtomatskih postopkih pa se lahko zgodi, da je zapis besede
učenje s prenosom
fonetično napač
znanja, vendar za obe našteti metodi
en, kar se v kasnejših korakih odraža na
neoptimalnem učenju in razpoznavanju govora
velja, da omogočata slabši nadzor nad modeliranjem v
. Priprava
slovarja z grafemi je lažja, saj je pretvorba trivialna. K
primerjavi s hibridnim pristopom, ki smo ga uporabili v tem
ateri
članku. Praviloma hibridni pristop tudi dosega nekoliko
pristop je primernejši, je odvisno tudi od količine učnih
boljše rezultate, kot pa druga dva pristopa. Za avtomatski
podatkov, ki jih uporabljamo, saj je pri slovarjih s fonemi,
ki so sestavljeni iz več osnovnih enot, večji poudarek na
PRISPEVKI
71
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
številski porazdelitvi glede na kategorijo. V okviru projekta
ta razlika manjšala, ko bo za slovenski jezik na voljo več ur
Razvoj slovenščine v digitalnem okolju (RSDO, b. d.)
transkribiranega govora. Z večanjem količine posnetkov
trenutno vzporedno poteka graditev govorne baze in pa
namreč pridobimo na posamezno osnovno enoto tudi več
razvoj prvih verzij avtomatskega razpoznavalnika govora.
vzorcev, kar izboljša možnost modeliranja akustičnih
Zato imamo trenutno še vedno na voljo dokaj omejeno
značilnosti in izboljša robustnost na potencialne napake, ki
količino transkribiranega slovenskega govora, kar je bil
se lahko zgodijo zaradi avtomatske grafemsko-fonemske
povod za uporabo grafemske akustične enote. Že v
pretvorbe.
preteklosti se je tako za slovenski jezik (Žgank in Kačič,
2006) kot tudi za druge jezike (Killer et at., 2003) pokazalo,
3. Govorni in jezikovni viri
da je lahko v takšnih primerih uporaba grafemskih
Govorni in jezikovni viri so pri razpoznavalnikih
akustičnih enot dobra rešitev. Tako smo za cilj članka
govora ključna komponenta. Za govorne posnetke smo
postavili primerjavo med fonemskimi in grafemskimi
uporabili korpuse Gos 1.0 (Zwitter Vitez et al., 2013), Sofes
akustičnimi osnovnimi enotami v povezavi s trenutno
(Dobrišek et al., 2017) in delovno različico testnega seta
razpoložljivimi govornimi viri.
nastajajoče govorne baze RSDO (trenutna delovna različica
V nadaljevanju članka najprej pregledamo, kaj je na
je 2.0, ki ne vsebuje več črkovanja). Korpusa Gos in Sofes
področju
modeliranja
akustičnih
osnovnih
enot
smo uporabili za učno in razvojno množico, medtem ko
avtomatskega razpoznavanja govora že bilo izvedenega za
smo testni korpus 2.0 projekta RSDO uporabili za
slovenski jezik. V tretjem poglavju predstavimo, katere
vrednotenje rezultatov. Za slovarje smo uporabili
govorne in jezikovne vire smo uporabili pri zasnovi
prostodostopni vir Sloleks 2.0 (Dobrovoljc et al., 2019) in
eksperimentov.
Tvorjenje
slovarjev
in
samodejno
trenutno verzijo slovarja, ki je nastala v projektu RSDO. Za
grafemsko-fonemsko
pretvorbo
na
osnovi
pravil
besedilni korpus smo uporabili prostodostopni besedilni vir
predstavimo v četrtem poglavju. V petem poglavju
ccGigafida 1.0 (Logar et al., 2013).
predstavimo modeliranje akustičnih in jezikovnih modelov
Korpus Gos vsebuje 120 ur posnetkov. Posnetki
avtomatskega razpoznavalnika govora. Rezultati so
zajemajo različne zvrsti, npr. televizijske oddaje,
predstavljeni in komentirani v šestem poglavju, ki mu sledi
predavanja, pouk, zasebne pogovore … Ves govor je
še zaključek.
transkribiran v dveh različicah, in sicer v pogovorni in
standardizirani različici. Posnetki zajemajo 1526 različnih
2. Pregled sorodnih člankov
govorcev. Govorni korpus Sofes vsebuje 9 ur in 52 minut
Avtomatski razpoznavalniki govora so kot svojo
posnetkov, ki zajemajo govor 134 različnih govorcev.
privzeto akustično osnovno enoto uporabljali foneme in
Posnetki
vsebujejo
poizvedovanja
po
letalskih
njihove izpeljanke v obliki kontekstnega podaljševanja.
informacijah v slovenskem jeziku. Pri korpusu Sofes
Izhodišče je bilo, da gre pri avtomatskem razpoznavanju
najdemo
transkripcije
v
fonetičnem
zapisu in
govora za pretvorbo iz govorjene v besedilno obliko, kar se
standardiziranem zapisu govora. V testnem setu 2.0 RSDO
tako sklada z izbiro osnovne akustične enote. Leta 2000 so
je za 4 ure in 47 minut posnetkov. Korpus se od različice
Schillo in sodelavci predstavili prvi grafemski avtomatski
1.0 razlikuje po tem, da ne vsebuje posnetkov črkovanja,
razpoznavalnik govora, ki je z izbiro drugačne osnovne
kar znaša okoli 19 minut govora. Črkovanje smo iz
akustične enote kršil navedeno predpostavko. Sistem je za
splošnega testnega nabora izločili, saj je za njegovo
nemški jezik sicer dosegel slabše rezultate razpoznavanja
učinkovito razpoznavanje treba uporabiti drugačne
govora kot fonemski sistem, vendar so bili naučeni
pristope. Testna množica RSDO zajema bran, javni,
grafemski modeli manjši.
nejavni govor in posnetke državnega zbora. V posnetkih se
Grafemi kot osnovne akustične enote postanejo hitro
pojavi 63 različnih govorcev. Tudi pri korpusu RSDO
zanimivi tudi za večjezično in križnojezično razpoznavanje
imamo dva različna zapisa govora, ki sta nastala na osnovi
govora (Killer et at., 2003). V takšnih primerih je namreč
enakih priporočil kot v korpusu Gos.
možno združevati jezike brez podrobnega poznavanja
Vir Sloleks 2.0 je leksikon, ki vsebuje okoli 2.792.000
fonetike vključenih jezikov. Osnovo pač predpostavlja
posameznih besednih oblik. Vsak vnos vsebuje podatke o
zapisana črka. Uporabnost takšnega pristopa pride še
besedi (v katero besedno vrsto sodi in kakšne so njene
dodatno do izraza pri križnojezičnem razpoznavanju
slovnične lastnosti). Zapisane so tudi vse pregibne oblike
govora, kjer so v ciljnem jeziku na voljo omejeni govorni
za posamezno besedo. Slovenščina je pregiben jezik in zato
viri. Uspešnost metode je v določeni meri odvisna tudi
je takšnih oblik zelo veliko. V različici 2.0 je označeno tudi
akustično-fonetične podobnosti med vključenimi jeziki.
mesto naglasa in zapis v mednarodni fonetični pisavi (IPA).
Prve raziskave o uporabi grafemov kot osnovne
V našem primeru smo Sloleks 2.0 uporabili za tvorjenje
akustične
enote za
križnojezično
razpoznavanje
fonetičnega slovarja avtomatskega razpoznavalnika
slovenskega govora so predstavili Žgank in sodelavci
govora. V takšnem slovarju potrebujemo besede in njihovo
(2005). Sledila je še uporaba grafemov za običajno
izgovorjavo s fonemi. Sloleks 2.0 smo s pomočjo postopka,
enojezično avtomatsko razpoznavanje govora (Žgank in
ki so ga uporabili Ulčar in drugi (2019), pretvorili v obliko,
Kačič, 2006). Grafemi kot osnovne akustične enote so tako
ki je ustrezna za avtomatski razpoznavalnik govora.
postali del standardne izbire za razpoznavanje slovenskega
Besedilni korpus CcGigafida vsebuje nekaj čez
govora, še posebej v domeni dnevnoinformativnih oddaj
103.000.000 besed in je javno dostopni del korpusa
(Gril et al., 2021). V kombinaciji s slovenskimi
Gigafida, ki ga je možno uporabljati pod licenco Creative
razpoznavalniki govora, ki so zasnovani na HMM
Commons. Besedilo vsebuje informacije o virih časopisov,
akustičnih modelih ali na hibridni zasnovi HMM/DNN in
revij, leta izdaj, vrsti besedil, naslovih, o avtorjih besedil.
imajo za učenje na voljo nekaj 10 ur transkribiranih
Korpus je označen z morfoskladenjskimi opisi in lemami.
govornih posnetkov, praviloma dosežejo boljše rezultate
Besedilni korpus ccGigafida smo uporabili za
razpoznavanja govora. Predpostavimo sicer lahko, da se bo
jezikovno modeliranje avtomatskega razpoznavalnika
PRISPEVKI
72
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
govora. Zaradi pravilne obdelave smo iz korpusa izbrisali
določeno za vsako besedo posebej in se ga je med
prazne vrstice in večkratne presledke. Odstranili smo tudi
različnimi generacijami govorcev slovenskega govorjenega
ločila, da je bilo besedilo v skladu z običajno obliko v
jezika zgodovinsko prenašalo z učenjem jezika in govornim
sistemu za razpoznavanje govora.
sporazumevanjem. Kljub različnim mestom naglasa, ki so
se z razvojem jezika in v različnih narečnih jezikovnih
4. Tvorjenje slovarjev za razpoznavalnik
skupinah tudi spreminjala, pa je vendarle možno opredeliti
govora
določena pravila, ki v pretežni meri določajo mesto naglasa
Tvorjenje fonetičnih slovarjev, ki so potrebni za
v besedah (Toporišič, 1991). Ta pravila so bila v glavnem
upoštevana in
graditev
hibridnih
arhitektur
avtomatskih
uporabljena za samodejno določanje mesta
razpoznavalnikov govora, temelji tako na uporabi
naglasa v danih besedah. Pravila temeljijo na upoštevanju
obstoječih razpoložljivih leksikonov, ki so navadno ročno
seznamov predpon, pripon, začetnic in končnic, ki se
preverjeni in že vsebujejo fonetične prepise besed, kot tudi
pojavljajo v slovenskih besedah in značilno določajo mesto
na
uporabi
samodejnih
grafemsko-fonemskih
naglasa v dani besedi. Pravila so bil določena na podoben
način
pretvornikov, ki se uporabljajo za t. i. izvenslovarske
, kot je bilo to izvedeno pri razvoju sistema za
besede, ki jih predvideva jezikovni model razpoznavalnika
samodejno tvorjenje umetnega slovenskega govora (Gros,
govora, niso pa še vključene v obstoječe leksikone.
1997).
Tvorjenje slovarja za prvo različico avtomatskega
Uporabljena pravila sicer ne pokrijejo vseh trenutno
razpoznavalnika govora (»Rezultat R2.2.7: Orodje za
uporabljanih slovenskih besed. Zato se je na osnovi dodatne
statistične analize mest naglas
grafemsko fonemsko pretvorbo – verzija 2«, 2022) , ki je
ov pri najbolj pogostih
bil razvit v okviru projekta RSDO in bo predstavljen v
slovenskih besedah določilo še dodatna pravila za določitev
naslednjih poglavjih, je primarno temeljilo na uporabi že
najbolj verjetnega mesta naglasa v danih besedah. Ta
omenjenega leksikona Sloleks 2.0 ter ročno urejenega in
pristop je do določene mere možno tolmačiti tudi kot
izvajanje strojnega učenja iz podatkov.
preverjenega slovarja izgovorjav, ki je vključen v govorni
korpus Sofes. Za vse besede, ki se pojavljajo v normiranih
Grafemski zapisi vhodnih besed se v razvitem
besednih prepisih vseh zvočnih govornih posnetkov, ki so
pretvorniku z uporabo pravil pretvarjajo po vrsti, od leve
se
uporabili
za
tvorjenje
akustičnega
modela
proti desni. Pravila se v pretvorniku preverjajo in
upoštevajo po
razpoznavalnika govora, ter za vse besede, ki se pojavljajo
danem vrstnem redu, zato si morajo slediti
tako, da so na začetku seznama pravil za posamezen grafem
v normiranem besedilnem korpusu, ki se je uporabil za
tvorjenje njegovega jezikovnega modela, smo najprej
najprej tista, ki opisujejo posebne primere pretvorb, sledijo
pa jim bolj splošna pravila.
pogledali v leksikon Sloleks 2.0 in ročno urejen slovar
govornega korpusa Sofes, če ta morda vsebujeta
Razviti grafemsko-fonemski pretvornik na svojem
obravnavano besedo. Če je bila beseda v tem leksikonu
vhodu predvideva besede, ki so že podane v normirani
oziroma slovarju vsebovana, se je njen fonetični prepis
polni besedni obliki. Števila, števniki, denarne enote,
samo prenesel v slovar razpoznavalnika govora. Če
okrajšave in drugi posebni zapisi morajo tako biti podani v
obravnavana beseda v leksikonu Sloleks 2.0 oziroma
polni besedni obliki. Za to je bilo poskrbljeno z
slovarju Sofes ni bila vsebovana, pa se je njen fonetični
normalizacijo besednih prepisov govornih posnetkov, ki so
prepis pridobilo z uporabo prve različice samodejnega
se
uporabljali
za
tvorjenje
akustičnega modela
grafemsko-fonemskega pretvornika, ki je bil razvit v okviru
razpoznavalnika govora, in tudi besedil iz besedilnega
projekta RSDO in je v grobem opisan v nadaljevanju. Pri
korpusa, ki so se uporabljala za tvorjenje jezikovnega
tvorjenju slovarja za predstavljeni razpoznavalnik govora
modela razpoznavalnika govora.
se je samodejno moralo pretvoriti kar več kot 58 odstotkov
Izhodni nabor fonemskih različic je glede na samodejno
določanje in upoštevanje mesta naglasa omogočal tudi
vseh besed, ki so bile predvidene za razpoznavalnik govora.
ločevanje
Pravilnost samodejne pretvorbe pa pri prvi različici
med dolgimi in kratkimi samoglasniki. Pri
razpoznavalnika govora še ni bila natančno preverjena in
tvorjenju slovarja za razpoznavalnik govora pa se to
ločevanje ni upoštevalo, ker se
ovrednotena.
pri tvorjenju akustičnih
modelov razpoznavalnikov govora samoglasnikov navadno
ne ločuje na kratke in dolge, ker dolžina samoglasnikov
4.1. Samodejna grafemsko-fonemska pretvorba
na osnovi pravil
nima osnovne pomensko razločevalne vloge pri
razpoznavanju besed (Ulčar, 2019).
Prva različica samodejnega grafemsko-fonemskega
pretvornika, ki je bil razvit v okviru projekta RSDO in se je
5. Arhitektura avtomatskega
uporabil za tvorjenje slovarja razpoznavalnika govora, je
razpoznavalnika govora
temeljila na uporabi množice kontekstno odvisnih
fonetičnih pravil, ki so bila določena na osnovi statističnih
Glede na razpoložljivo količino akustičnega učnega
analiz in obstoječega jezikoslovnega in glasoslovnega
materiala, je bilo smiselno uporabiti hibridno arhitekturo
poznavanja fonetičnih značilnosti slovenskega govorjenega
avtomatskega razpoznavalnika govora, ki je v takšnih
jezika. Upoštevana kontekstno odvisna pravila so temeljila
primerih praviloma učinkovitejša, kot so pa pristopi E2E.
predvsem na upoštevanju mesta naglasa v danih besedah.
Pri hibridnih sistemih avtomatskega razpoznavalnika
Mesto naglasa v besedi na splošno določa zlog, na
govora lahko arhitekturno sestavo grobo razdelimo na dva
katerem ima beseda jakostno ali tonsko izraženo slušno
dela, in sicer na akustični model in jezikovni model.
Akustični model naučimo na osnovi
zaznavno izrazitost (Toporišič, 1992). Pomembna
vzorcev iz zvočnih
značilnost slovenskega jezika je, da se mesto naglasa
posnetkov in njihovih ustreznih prepisov, jezikovni model
pojavlja na prvem, zadnjem, predzadnjem ali tudi
pa glede na besedilni korpus. V nadaljevanju članka bomo
predpredzadnjem zlogu. Poleg tega pa lahko imajo
podrobneje predstavili oba modela, za graditev katerih smo
posamezne besede tudi več mest naglasa. Mesto naglasa je
uporabili prostodostopno orodje Kaldi (Povey et al., 2011).
PRISPEVKI
73
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Za pripravo spremljajočih datotek, ki jih potrebujemo
za graditev modela v orodju Kaldi, smo uporabili
transkripcije govornih korpusov, ki so zapisane v obliki
standardiziranega zapisa govora.
5.1. Akustično modeliranje
Za akustično modeliranje smo uporabili govorne baze
Gos, Sofes in testno množico projekta RSDO. Zvočni
posnetki govornih baz Gos in Sofes so bili v mono formatu
in so bili zapisani v 16-bitnem zapisu. Frekvenca vzorčenja
je bila 16 kHz. Posnetki testne množice projekta RSDO so
imeli frekvenco vzorčenja 44,1 kHz, bitna hitrost in format
pa sta bila enaka posnetkom v bazah Gos in Sofes. Orodje
Kaldi za svoje delo potrebuje mono zvočne posnetke s
frekvenco vzorčenja 16 kHz in 16-bitnim zapisom. Da
lahko posnetke v orodju Kaldi procesiramo, moramo
posnetke pretvoriti v ustrezni format. S prostodostopnim
orodjem SoX smo posnetke pretvorili v mono zvočne
posnetke, s frekvenco vzorčenja 16 kHz in 16-bitnim
zapisom. Pretvarjanje posnetkov smo vključili v proces
priprave potrebnih datotek za procesiranje v orodju Kaldi.
Slika 1: Postopek učenja akustičnega modela
S tem korakom smo se ognili ročnemu pretvarjanju vseh
avtomatskega razpoznavalnika govora.
posnetkov.
Zvočne posnetke, ki so del govorne baze, smo
spremenili v vektorje značilk. Na začetku posnetke
V naslednji fazi smo uporabili linearno diskriminančno
razdelimo na okna dolžine 25 ms in jih nato
analizo (angl. Linear discriminant analysis – LDA), s
transformiramo, da dobimo značilke MFCC (mel-
katero poiščemo linearno kombinacijo stanj. LDA vzame
frekvenčne kepstralne koeficiente). Za nadaljnje delo smo
vektorje značilk in zgradi HMM stanja, vendar z manjšim
uporabili 12 MFCC koeficientov in energijo, nad katerimi
prostorom značilke za vse podatke. LDA smo uporabili v
smo izračunali še prvi in drugi časovni odvod. S prvim
kombinaciji z linearno transformacijo z največjo
odvodom dobimo delta in z drugim delta-delta značilke.
Nadaljevali smo z akustičnim modeliranjem, kjer smo v več
verjetnostjo (angl. Maximum Likelihood Linear Transform
–
fazah izvajali učenje novih akustičnih modelov in njihove
MLLT), ki poenostavi računanje Gaussovih porazdelitev
(Gales, 1999). MLLT vzame značilke iz LDA in izpelje
poravnave.
Osnova akustičnega modeliranja
edinstveno transformacijo za vsakega govorca. MLLT je
avtomatskega
prvi korak k normalizaciji govorcev, saj minimalizira
razpoznavalnika govora so prikriti modeli Markova (angl.
razlike med govorci. Pri LDA in MLLT se uporabi prvih 13
Hidden Markov Model – HMM). Z modeli HMM na osnovi
značilk MFCC in vsako razdeli na 4 predhodna okna na levi
vhodnih vektorjev značilk ocenjujemo verjetnosti hipotez
in 4 naslednja okna na desni. To pomeni, da imamo končno
izgovorjenega govora. Za to moramo poznati zapis
dimenzijo značilk 117.
fonemov v vsaki besedi. Takšne zapise imamo
Nato z LDA dimenzijo značilke
vnesene v
fonetičnem slovarju, kjer je vsaka beseda predstavljena
omejimo na 40.
z
Za večjo natančnost avtomatskega razpoznavanja
nizom fonemov izgovorjene besede. Pri tem imamo lahko
za posamezno besedo na voljo več
govora smo uporabili učenje s prilagajanjem govorcu (angl.
izgovorjav, kar je
Speaker Adaptive Training – SAT), ki za vsakega
odvisno od vključenega slovarja. Pri HMM modelih
posameznega govorca izračuna parametre adaptacij glede
foneme predstavimo s skritimi stanji, model pa nato
na učne podatke
izračuna opazovana stanja s pomočjo
tega govorca (Anastasakos et at., 1996).
Gaussovih
Učenje akustičnega modela smo začeli z monofonskim
porazdelitev, ki tvorijo hipoteze izgovorjene besede.
akustičnim modelom in nadaljevali s trifonskim akustičnim
modelom z delta in delta-delta (tri1) značilkami, trifonskim
akustičnim modelom z LDA in MLLT (tri2b) ter na koncu
še s trifonskimi akustičnimi modeli s SAT (tri3b). Postopek
učenja je prikazan tudi s pomočjo diagrama, ki ga lahko
vidimo na sliki 1.
V drugem delu graditve akustičnih modelov sledi
prehod na globoke nevronske mreže. Nevronske mreže so
sistemi, kjer algoritmi posnemajo delovanje nevronov v
možganih. Sistem je sestavljen iz vhodnih, skritih in
izhodnih plasti, ki so sestavljene iz enega ali več nevronov.
Nevroni so med seboj povezani z relacijami, ki lahko
potekajo naprej, nazaj ali naprej in nazaj. Na relacijah se
uporabljajo uteži, s katerimi se izračunajo nova stanja.
Uporabili smo dva različna tipa nevronskih mrež, in
sicer časovno zakasnjene nevronske mreže (angl. Time
Delayed Neural Networks – TDNN) in nevronske mreže z
PRISPEVKI
74
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
dolgim kratkoročnim spominom (angl. Long Short Term
Akustične modele LSTM za razpoznavanje govora smo
Memory – LSTM).
učili s 4 epohami. Ostale vrednosti smo ohranili na
TDNN so nevronske mreže (Waibel, 1989), ki imajo
privzetih, vključno z začetno in končno efektivno stopnjo
več plasti. Začetne plasti se transformacije učijo bolj ozko,
učenja, ki sta bili nastavljeni na 0,001 in 0,0001.
kasnejše pa imajo širši časovni kontekst. Za kontekstno
V naslednjem poglavju bomo predstavili rezultate
modeliranje je treba zagotoviti, da vsaka nevronska celica
sistemov LSTM in TDNN za razpoznavanje govora. Ker je
poleg vhodne vrednosti, ki jo pridobi od aktivacijske
sistem TDNN dosegel boljše rezultate, smo del
funkcije oziroma iz nižje plasti, pridobi tudi informacijo o
eksperimentov opazovali samo na sistemu TDNN.
vzorcu izhodnih vrednosti in njihovega konteksta. Kar v
primeru s časovnim signalom pomeni, da dobi vsaka
5.2. Jezikovno modeliranje
nevronska celica na vhod informacijo o aktivacijskem
Kot povezovalni člen med akustičnim in jezikovnim
vzorcu skozi čas od nižje ležečih plasti.
prostorom smo uporabili dva različna tipa slovarjev
Nevronske mreže LSTM (Povey, 2018) vključujejo
avtomatskega
razpoznavalnika
govora.
Prvi
tip
spominsko celico, ki ohrani informacijo dalj časa. Celica
uporabljenih slovarjev je bil fonemski slovar, kjer so
ima troje različnih vrat, in sicer vhodna, izhodna ter
besede zapisane s fonemi, in drugi tip, kjer smo namesto
pozabljiva. Vhodna vrata uravnavajo količino podatkov
zapisa izgovorjene besede s fonemi uporabili zapis z
prejšnjega vzorca, ki se bo shranila. Izhodna vrata določajo
grafemi. V tabeli 1 smo predstavili lastnosti slovarjev. Ena
količino podatkov, ki se bo prenesla na naslednjo plast.
izmed lastnosti je tudi delež besede izven slovarja (angl. out
Pozabljiva vrata pa regulirajo hitrost izgubljanja informacij
of vocabulary – OOV), ki ga izračunamo kot:
v celici. Zaradi shranjevanja informacij so sistemi LSTM
primerni za delo s časovnimi signali, saj se lahko
š𝑡. 𝑏𝑒𝑠𝑒𝑑 𝑖𝑧𝑣𝑒𝑛 𝑠𝑙𝑜𝑣𝑎𝑟𝑗𝑎 𝑣 𝑡𝑒𝑠𝑡𝑛𝑖 𝑚𝑛𝑜ž𝑖𝑐𝑖
𝑂𝑂𝑉 =
∙ 100 (2)
pomembni dogodki zamaknejo. Modelu LSTM lahko
š𝑡. 𝑣𝑠𝑒ℎ 𝑏𝑒𝑠𝑒𝑑 𝑣 𝑠𝑙𝑜𝑣𝑎𝑟𝑗𝑢
rečemo tudi izboljšana ponavljajoča se nevronska mreža
Slovarji, ki smo jih uporabili, so večji, kakor tisti, ki so
(angl. Recurrent Neural Network – RNN), saj je bila tako
se uporabljali v prejšnjih razpoznavalnikih informativnih
odpravljena težava izginjajočega gradienta (Hochreiter,
oddaj (Gril et at., 2021). Vrednosti OOV so zelo nizke in
1991).
jih lahko enostavno zanemarimo.
Arhitektura TDNN je sestavljena iz vhodnega nivoja,
skritih nivojev in izhodnega nivoja. Vhodni nivo je
dimenzije 40. Prvi skriti nivo mreže TDNN je bila mreža
Slovar
Tip slovarja Št. besed
OOV [%]
LDA z dimenzijo 40 in je bila polno povezana. Mreži LDA
Sloleks 2.0
fonemski
1.129.144
0,054
je sledilo še 8 polno povezanih mrež TDNN dimenzij 512.
Sloleks 2.0
grafemski
931.848
0,065
Na 8 nivojih mrež TDNN je bilo uporabljeno izpuščanje
RSDO
fonemski
1.440.070
0,008
nevronov (angl. dropout). Mrežam TDNN sledita še dve
RSDO
grafemski
1.440.070
0,008
vzporedni veji nivojev, in sicer verižna veja in veja xent.
Verižna in xent veji sta sestavljeni iz dveh nivojev. Prva
Tabela 1: Lastnosti uporabljenih slovarjev.
vzporedna nivoja tvorita mreži ReLU dimenzije 512. Mreži
sta polno povezani in enako kakor mreže TDNN
Jezikovni model avtomatskega razpoznavalnika govora
uporabljata izpuščanje nevronov. Mrežama ReLU sledita
naučimo z besedilnim korpusom. Takšen model je
izhodna nivoja. Veji se razlikujeta po funkciji izgube.
sposoben predvidevati besedo, ki sledi, glede na predhodne
Verižna veja uporablja funkcijo logaritem verjetnosti
besede v nizu. Jezikovni model ima tudi zmožnost
pravilne sekvence fonemov oziroma grafemov, medtem ko
kontekstnega uvrščanja, saj bo med besedami, ki imajo
veja xent za funkcijo izgube uporablja križno entropijo.
podobno izgovorjavo, izbral tisto, ki bo bolj smiselna glede
Mreža TDNN je tako sestavljena iz 10 nivojev, pri katerih
na kontekst predhodno opazovanega zaporedja besed.
pa smo uporabili tudi časovno združevanje, kjer se na teh
Jezikovni model smo naučili z uporabo orodja n-gram
nivojih združijo informacije iz želenih časovnih oken glede
count, ki je del paketa SRILM (Stolcke, 2002). N-grami so
na vhod. Časovno združevanje smo uporabili na nivoju
v našem primeru nizi n besed v stavku. N-gram count glede
LDA in 2., 4., 6., 7. ter 8. nivoju TDNN.
na besedilni korpus generira n-grame in z njimi ocenjuje
Učenje modelov TDNN je potekalo 7 epoh. Začetno
napovedne verjetnosti jezikovnega modela. Pri n-gram
efektivno stopnjo učenja (angl. initial effective lrate) smo
countu je treba določiti, do kakšne velikosti n-gramov
nastavili na 0,0001 in končno (angl. final effective lrate) na
želimo zgraditi model. Tako smo zgradili jezikovni model,
0,00001. Ostale vrednosti parametrov smo ohranili na
ki je vseboval 1-grame, 2-grame in 3-grame.
privzetih vrednostih.
Tako kot arhitektura TDNN tudi LSTM vsebuje tri vrste
6. Rezultati avtomatskega razpoznavanja
nivojev. Prvi je vhodni in je enak vhodnemu nivoju
govora
arhitekture TDNN. Prav tako je tudi prvi skriti nivo
Uspešnost
različnih
verzij
avtomatskega
arhitekture LSTM enak nivoju LSA, ki sestavlja arhitekturo
TDNN. Naslednji štirje skriti nivoji so mreže LSTM
razpoznavalnika govora smo ovrednotili na testni množici
P
2.0 projekta RSDO. Za vrednotenje smo uporabili delež
(angl. Long Short-Term Memory Projection) velikosti
napačno razpoznanih besed (angl. Word Error Rate –
1024. LSTMP je mreža LSTM, ki dodatno vsebuje še
projekcijski nivo. V naš
WER). WER smo izračunali kot razmerje med številom
i konfiguraciji arhitekture smo
vrinjenih, izbrisanih ter zamenjanih besed in med številom
dimenzijo projekcijskega nivoja nastavili na 256. Skritim
besed, ki so v referenčnem besedilu. To lahko zapišemo
nivojem sledita dve veji izhodnih nivojev. Tudi tukaj se veji
kot:
razlikujeta glede na funkcijo izgube tako kot pri arhitekturi
TDNN.
(𝐼 + 𝐷 + 𝑆)
𝑊𝐸𝑅 =
∙ 100
(1)
𝑁
PRISPEVKI
75
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Kjer je I število vrinjenih besed (angl. insertions), D število (angl. Large-Vocabulary Continuous Speech Recognition –
izbrisanih besed (angl. deletions) in S število zamenjanih
LVCSR), kjer pri velikih datotekah hitro nastane ozko grlo.
besed (angl. substitutions). Z N označimo število vseh
Dodatna prednost grafemskih akustičnih enot je tudi v tem,
besed v referenčnem besedilu testne množice. Razmerje
da lahko v praktični uporabi slovar avtomatskega
nato pomnožimo s 100, saj WER praviloma podajamo v
razpoznavalnika govora nadgrajuje tudi laik.
odstotkih.
7. Zaključek
Arhitektura
Slovar
Tip slovarja WER [%]
V članku smo predstavili sistem za razpoznavo
LSTM
Sloleks 2.0
fonemski
38,70
slovenskega govora. Za akustični model smo uporabili
TDNN
Sloleks 2.0
fonemski
27,19
hibridni pristop HMM-DNN. Za napovedovanje skritih
TDNN
RSDO
fonemski
25,31
stanj v HMM smo uporabili dva tipa nevronskih mrež.
TDNN
Sloleks 2.0
grafemski
26,97
Časovno zakasnjene nevronske mreže so se izkazale za
TDNN
RSDO
grafemski
24,95
boljši pristop kakor nevronske mreže z dolgim
kratkoročnim spominom. Za tvorjenje slovarja smo
Tabela 2: Rezultati razpoznavanja govora z različnimi
uporabili dve osnovni akustični enoti. Grafemski modeli so
vrstami vključenih metod in modelov.
v našem primeru dali boljše rezultate kakor fonemski.
Uporabili smo novo testno množico, ki je nastala pri
Najprej poglejmo rezultate, ki smo jih dobili, ko smo
projektu RSDO. Najboljši delež napačno razpoznanih
vrednotili različna tipa arhitektur akustičnih modelov.
besed je bil 24,95 %. Rezultat je primerljiv tudi z rezultati
Predstavljeni so v tabeli 2. Sistem LSTM se je izkazal za
drugih sistemov razpoznavanja govora. K dobremu
slabšega, saj je bil rezultat WER kar za 11,51 % slabši kot
rezultatu razpoznave prispeva velik slovar, ki je večji kakor
v primeru, ko smo uporabili sistem TDNN. Na osnovi tega
pri primerljivih sistemih, in uporaba grafemov kot osnovne
rezultata smo kot nadaljnjo arhitekturo akustičnih modelov
akustične enote. Sistemi z grafemi omogočajo enostavnejše
izbrali TDNN. Izhodiščni WER je bil 27,19 %. Ulčar in
tvorjenje slovarjev, enostavnejše je tudi nadgrajevanje
drugi (2019) so na podobnem sistemu dosegli malo slabši
takšnih slovarjev. Uporaba grafemov ima pozitivni učinek
rezultat, vendar rezultati niso neposredno primerljivi, saj se
tudi pri uporabi sistemov, saj takšni modeli zavzemajo
je vrednotenje preverjalo na drugi testni množici.
nekoliko manj pomnilniškega prostora.
Primerjava s predhodnim podobnim ASR (Gril et at., 2021)
kaže razliko v rezultatih. Avtorji so takrat dosegli 15,17 %
Zahvala
WER, vendar z uporabo drugačnih govornih virov.
Zahvaljujemo se avtorjem korpusa Gos 1.0, ki so nam
Domena virov je bila v prejšnjem primeru omejena
omogočili njegovo uporabo za razvoj avtomatskega
izključno na televizijske oddaje, res pa je, da so te lahko v
razpoznavalnika govora.
nekaterih primerih, kot je na primer glasbeno ozadje
Raziskovalno delo je bilo delno opravljeno v okviru
govora,
tudi
dokaj
kompleksne
za
avtomatsko
projekta RSDO – Razvoj slovenščine v digitalnem okolju.
razpoznavanje govora.
Operacijo Razvoj slovenščine v digitalnem okolju
Za nadaljevanje razvoja sistema za razpoznavanje
sofinancirata Republika Slovenija in Evropska unija iz
govora smo uporabili dva različna slovarja, in sicer slovar,
Evropskega sklada za regionalni razvoj. Operacija se izvaja
ki je bil narejen na osnovi Sloleksa, in slovar, ki je bil
v okviru Operativnega programa za izvajanje evropske
pripravljen v sklopu projekta RSDO. V tabeli 2 lahko
kohezijske politike v obdobju 2014–2020.
vidimo, da se rezultat vrednotenja z uporabo slovarja, ki je
bil pripravljen pri projektu RSDO, izboljša za 1,88 %.
8. Literatura
V zadnjem koraku smo primerjali med seboj še
Tasos Anastasakos, John McDonough, Richard Schwartz
avtomatske razpoznavalnike govora, pri katerih smo z
uporabo različnih tipov slovarja fonemsko osnovno
in John Makhoul. 1996. A compact model for speaker-
akustično enoto zamenjali z grafemsko. Za avtomatski
adaptive training. V: Proceedings ICSLP, str. 113–1140.
razpoznavalnik govora, pri katerem smo uporabili za
Simon Dobrišek, Jerneja Žganec Gros, Janez Žibert, France
Mihelič in Nikola Pavešić. 2017.
osnovo Sloleks, je zamenjava fonemov z grafemi izboljšala
Speech Database of
rezultat za 0,22 %. Pri uporabi slovarja, ki je bil izdelan v
Spoken Flight Information Enquiries SOFES 1.0.
okviru projektu RSDO, pa je zamenjava fonemov z grafemi
Slovenian language resource repository CLARIN.SI,
WER izboljšala za 0,36 %. Rezultat s tem modelom in
ISSN 2820-4042, http://hdl.handle.net/11356/1125
enotami je hkrati najboljši rezultat razpoznavanja govora,
Kaja Dobrovoljc, Simon Krek, Peter Holozan, Tomaž
Erjavec, Miro Romih, Špela
ki smo ga dosegli s predstavljenimi eksperimenti. Rezultat
Arhar Holdt, Jaka Čibej,
z grafemi je verjetno boljši zaradi omejene količine učnih
Luka Krsnik in Marko Robnik-Šikonja. 2019.
podatkov in s tem tudi števila vzorcev na posamezno
Morphological lexicon Sloleks 2.0. Slovenian language
akustično enoto. Sklepamo lahko, da je teh bilo premalo za
resource repository CLARIN.SI, ISSN 2820–4042,
razpoznavo specifičnih akustičnih enot, ki so redkejše.
http://hdl.handle.net/11356/1230
Tako je razpoznavanje z grafemi, ki imajo manj akustičnih
Mark J. Gales. 1999. Semi-tied covariance matrices for
osnovnih enot, ker ne razlikujejo podvariant, delovalo
hidden Markov models. IEEE transactions on speech
bolje. Čeprav izboljšanje z grafemskim slovarjem ni
and audio processing, 7(3): 272–281.
posebej veliko, lahko pri tem tipu slovarja opozorimo na to,
Jerneja Gros. 1997. Samodejno tvorjenje govora iz besedil.
da je postopek priprave veliko preprostejši. Prednosti ima
Doktorska disertacija. Fakulteta za elektrotehniko,
tudi pri uporabi, saj po velikosti zasede nekoliko manj
Univerza v Ljubljani.
pomnilniškega prostora, kar je posebej pomembno pri
Sepp Hochreiter. 1991. Untersuchungen zu dynamischen
avtomatskih razpoznavalnikih govora z velikimi slovarji
neuronalen
Netzen.
Dostopno
na:
PRISPEVKI
76
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
https://people.idsia.ch/~juergen/SeppHochreiter1991Th
esisAdvisorSchmidhuber.pdf (16. 5. 2022) Mirjam Killer, Sebastian Stüker and Tanja Schultz. 2003.
Grapheme based speech recognition. Interspeech.
Nataša Logar, Tomaž Erjavec, Simon Krek, Miha Grčar in
Peter Holozan. 2013. Written corpus ccGigafida 1.0.
Slovenian language resource repository CLARIN.SI,
ISSN 2820-4042, http://hdl.handle.net/11356/1035
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas
Burget, Ondrej Glembek, Nagendra Goel, Mirko
Hannemann, Petr Motliček, Yanmin Qian, Petr Schwarz,
Jan Silovsky, Georg Stemmer in Karel Vesely. 2011. The
Kaldi speech recognition toolkit. V: IEEE ASRU 2011
Workshop on automatic speech recognition and
understanding. IEEE Signal Processing Society.
Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li
in Sanjeev Khudanpur, 2018. A Time-Restricted Self-
Attention Layer for ASR. V: 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), str. 5 874–5878.
RSDO. (b. d.). Dostopno na: https://www.cjvt.si/rsdo/.
Razvoj slovenščine v digitalnem okolju – RSDO: Rezultat
R2.2.7: Orodje za grafemsko fonemsko pretvorbo –
verzija 2, Poročilo projekta, 2022.
Christoph Schillo, Gernot A. Fink in Franz Kummert. 2000.
Grapheme based speech recognition for large
vocabularies. Sixth International Conference on Spoken
Language Processing.
Andreas Stolcke. 2002. SRILM – an extensible language
modeling toolkit. V: Seventh international conference on
spoken language processing.
Jože Toporišič. 1992. Enciklopedija slovenskega jezika.
Cankarjeva založba. Ljubljana.
Jože Toporišič. 1991. Slovenska slovnica. Založba Obzorja.
Maribor.
Matej Ulčar, Simon Dobrišek, Marko Robnik Šikonja.
2019. Razpoznavanje slovenskega govora z metodami
globokih nevronskih mrež. Uporabna informatika, 27
(3).
Dostopno
na:
https://uporabna-
informatika.si/index.php/ui/article/view/53 (8. 11. 2021) Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton,
Kiyohiro Shikano and Kevin J. Lang. 1989. Phoneme
recognition using time-delay neural networks . IEEE
Transactions on Acoustics, Speech, and Signal
Processing, 37(3): 328–339.
Ana Zwitter Vitez, Jana Zemljarič Miklavčič, Simon Krek,
Marko Stabej in Tomaž Erjavec. 2013. Spoken corpus
Gos 1.0. Slovenian language resource repository
CLARIN.SI,
ISSN
2820–4042,
http://hdl.handle.net/11356/1040
Andrej Žgank, Zdravko Kačič, Frank Diehl, Jožef Juhar,
Slavomir Lihan, Klara Vicsi in Gyorgy Szaszak. 2005.
Graphemes as Basic Units for CrosslingualSpeech
Recognition. V: COST278 Final Workshop and ITRW on
Applied Spoken Language Interaction in Distributed
Environments.
Andrej Žgank in Zdravko Kačič. 2006. Conversion from
phoneme based to grapheme based acoustic models for
speech recognition. Interspeech.
PRISPEVKI
77
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
What works for Slovenian? A comparative study of different keyword
extraction systems
Boshko Koloski, Senja Pollak, Matej Martinc
Jožef Stefan Institute, Jožef Stefan International Postgraduate School
Jamova cesta 39, Ljubljana, Slovenia
{boshko.koloski,senja.pollak,matej.martinc}@ijs.si
Abstract
Identifying and retrieving keywords from a given document is one of the fundamental problems of natural language processing. In this paper, we conduct a thorough comparative analysis of several distinct approaches for keyword identification on a new benchmark Slovenian keyword extraction corpus, SentiNews. The first group of methods is based on a supervised methodology, where previously annotated data is required for the models to learn. We evaluate two such approaches, TNT-KID and BERT . The other paradigm relies on unsupervised approaches, where no previously annotated data for training is needed. We evaluate five different unsupervised approaches, covering three main types of unsupervised systems: statistical, graph-based and embedding-based. The results show that supervised models perform significantly better than unsupervised approaches. By applying the TNT-KID method on the Slovenian corpus for the first time, we also advance the state-of-the-art on the SentiNews corpus.
1.
Introduction
eral distinct strategies for keyword extraction on Slovenian,
among them also some, which have not been tested be-
Identifying and retrieving keywords from a given docu-
fore on Slovenian. We show that the employment of the
ment represents one of the crucial tasks for organization of
TNT-KID model (Martinc et al., 2020), a model specifically
textual resources. It is employed extensively in media or-
adapted for the monolingual low-resource scenario, leads
ganizations with large daily article production that needs to
to advance in state-of-the-art on the Slovenian SentiNews
be categorized in a fast and efficient manner. While some
keyword extraction benchmark dataset (Bučar, 2017). To
media houses use keywords to link articles and produce
summarize, the main contributions of this work include:
networks based on keywords, journalists use keywords to
search for news stories related to newly produced articles
• A systematical analysis of a keyword extraction
and also to summarize new articles with a handful of words.
dataset of Slovenian news.
Manual categorization and tagging of these articles is a bur-
densome and time demanding task, therefore development
• Thorough comparison of several supervised and unsu-
of algorithms capable of tackling keyword extraction auto-
pervised keyword extraction strategies on the Slove-
matically, and therefore allowing the journalists to spend
nian data-set. Supervised methods include the mono-
more time on more important investigative assignments,
lingual TNT-KID method, which has not been em-
has become a necessity.
ployed for Slovenian before, and an application of the
The approaches for automatic detection of keywords
multilingual BERT model (Devlin et al., 2019), same
can be divided based on their need for annotated data prior
as in Koloski et al. (2022b). We also cover several un-
to learning. One paradigm of keyword extraction focuses
supervised methods in this study, including statistical,
on extracting keywords without prior training (i.e. unsu-
graph-based and embedding based models.
pervised approaches), while the other focuses on learning
• The advancement in state-of-the-art on the Slovenian
to identify keyphrases from an annotated data-set (i.e. su-
keyword extraction dataset from SentiNews
pervised approaches). While unsupervised approaches can
be easily applied for domains and languages that have low
• Release of a dockerized pretrained model of the best
to no amount of labeled data, they nevertheless tend to of-
performing system TNT-KID-Slovene in terms of F1-
fer non-competitive performance when compared to super-
score.
vised approaches (Martinc et al., 2020), since they can not
be adapted to the specific language and domain through
The paper is organized in the following manner: Sec-
training. On the other hand, supervised state-of-the-art ap-
tion 2. describes the related work in the field, followed by
proaches based on the transformer architecture (Vaswani et
the description of data and the exploratory data analysis
al., 2017) have become very effective in solving the task,
in Section 3. Section 4. describes the experimental setting
but they do usually require substantial amounts of labeled
considered in this study and in Section 5., we discuss the
data which is hard to obtain for some low-resource domains
results. Finally, Section 6. presents the conclusions of the
and languages.
study and proposes further work.
In this research, we focus on one of the low-resource
languages, Slovenian, for which not a lot of manually la-
2.
Related work
beled data that could be leveraged for training of keyword
Keyword extraction approaches are either supervised or
extractors, is available. We systematically evaluate sev-
unsupervised.
PRISPEVKI
78
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
2.1.
Unsupervised methods
of the meta-graph, it applies the load centrality metric
Modern supervised learning approaches are very suc-
for the term ranking, and also relies on multiple graph
cessful in keyword extraction, but they are data intensive
redundancy measures.
and time consuming. Unsupervised keyword detectors can
• embedding-based methods are gaining traction with
address both problems and typically require much less com-
the recent introduction of various off-the shelf pre-
putational resources and no training data, but this comes
trained embeddings such as FastText (Bojanowski et
with the price of lower overall performance. Unsupervised
al., 2016) or transformer - BERT (Devlin et al., 2019)
methods can be divided into four main categories:
based embeddings. Key2Vec (Mahata et al., 2018)
• statistical - methods that belong to this family are
represents the pioneer of this type of methods, fol-
based on calculating various text statistics to capture
lowed by the EmbedRank (Bennani-Smires et al.,
keywords, such as frequency of appearance, position
2018) method. The aforementioned methods consider
in the text, etc.
KPMiner (El-Beltagy and Rafea,
the semantic information captured by the distributed
2009) is one of the oldest methods and focuses on
word and sentence embedding representations. Key-
the frequency and position of a given keyphrase. Af-
BERT (Grootendorst, 2020) is currently the state-of-
ter calculating several frequency based statistics, the
the-art method of the type. The foundation of this
method uses post-processing filtering to remove some
method are pre-trained sentence-BERT (Reimers and
keyphrases that are too rare or that are not positioned
Gurevych, 2019) based representations. The method
within the first k characters of the document. YAKE
considers embedding n-grams of a given size and com-
(Campos et al., 2018) represents one of the latest up-
pares them to the embedding of the entire document.
grades of the statistical approaches, and includes the
The n-grams closely matching the representation of
simpler features proposed by the KPMiner. The main
an entire document (i.e. keywords most representa-
novelty is that it also considers the relatedness of term
tive of an entire document) are retrieved as keywords
candidates to general document context, dispersion,
that best describe the overall document content. In or-
and casing of a specific term candidate.
der to diversify the results, the method also introduces
the Max Sum Similarity metric with which the model
• graph-based - methods focus on creating graphs from
selects the candidate phrases with the highest rank that
a given document and then exploit graph properties in
are least similar to each other.
order to rank words and phrases. In the first, graph
creation step, authors usually consider two adjacent
• language model-based - methods use language model
words as two adjacent nodes in a graph G. Usually
derived statistics to extract keywords from text.
before the graph-creation step some form of word nor-
Tomokiyo and Hurst (2003) considered multiple lan-
malization is performed - either stemming or lemma-
guage models and measured the Kullback-Leibler Di-
tisation. Since keyword phrases can consist of multi-
vergence (Joyce, 2011) for ranking both phrasesness
ple words, the methods consider the use of a sliding
and the informativeness of candidate terms.
windows to obtain n-grams up to specific value of n,
and using obtained n-grams as nodes. Text Rank (Mi-
2.2.
Supervised methods
halcea and Tarau, 2004) is one of the first such meth-
Supervised methods require manually annotated data
ods. In the second, keyword ranking step, it leverages
for training. The methods can be divided into neural and
Google’s PageRank (Page et al., 1999) algorithm to
non-neural.
rank the nodes according to their importance within
the graph G. While TextRank is a robust method, it
2.2.1.
Non-neural
does not account for the position of a given term in
The first methods that proposed a solution in a super-
the document. This was improved in the PositionRank
vised manner, considered keyword extraction as a classifi-
(Florescu and Caragea, 2017) method that leverages
cation task. The KEA method (Witten et al., 1999) treats
PageRank on one side, and the position of a given term
each word or phrase as a potential keyword, and uses TF-
on the other side. An upgrade to the graph-creation
IDF (Sammut and Webb, 2010) metric and word position
step was introduced in Boudin (2018), where they con-
for representation, and Naive Bayes for classification of a
sider encoding the potential keywords into a multi-
given term as a keyword or not.
partite1 graph structure. The method in addition also
considers topic information. Similarly to TextRank
2.2.2.
Neural
it leverages PageRank (Page et al., 1999) to rank the
With the recent-gain in computing power and introduc-
nodes. RaKUn ( Škrlj et al., 2019) is one of the most
tion of more modern deep architectures, the field of key-
recent additions to the family of graph based keyword
word extraction was taken by storm of neural architectures.
extractors. The main contribution of this method is
The neural approaches can be divided are two groups: one
that it introduces an intermediate step, that constructs
that treat the task as a sequence-to-sequence generation and
meta-nodes from the initial nodes of the graph via ag-
the one that model the task as sequence-labelling.
gregation of the existing nodes. After the construction
Meng et al. (2017) first proposed the idea of keyword
extraction as a sequence-to-sequence generation task. In
1Family of graphs where the nodes can be split into multiple
their work they proposed a recurrent generative model with
disjoint sets.
an attention and a copying mechanism (Gu et al., 2016)
PRISPEVKI
79
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
based on the positional information. An additional strong-
al., 2019), containing more than 200,000 documents. We
point of this model is that is able to find keywords that do
benchmark all of our models on the same test split that was
not appear in the text due to it’s generative nature.
already used in the study by (Koloski et al., 2022b), in or-
The first representative of the sequence-labelling
der to make our results directly comparable to the ones in
method is the approach by Luan et al. (2017), where the
the related work.
authors consider bidirectional Long Short-Term Memory
The documents have a similar structure in all of the
(BiLSTM) layer and a conditional random field (CRF) layer
three splits, having on average 370 words (370.10 words
for classification. The more recent approaches of this type
in the train split, 366.89 words in the validation split and
utilize the transformer architecture (Vaswani et al., 2017)
377.46 words in the test split) and on average around 15
in their models. An upgrade of the approach by Luan et
sentences (15.419 sentences in the train split, 15.203 sen-
al. (2017) was proposed by Sahrawat et al. (2020), where
tences in the validation split and 15.662 sentences in the
contextual embeddings generated by BERT (Devlin et al.,
test split).
2019), RoBERTa (Liu et al., 2019) and GPT-2 (Radford
et al., 2019) were fed into the BiLSTM network. Cur-
Split
rently, the state-of-the-art model based on the transformer
Property
Train
Valid
Test
architecture is the one proposed by Martinc et al. (2020).
Document statistics
# of documents
4796
1199
1519
They employ the tactic of not relying on the massive lan-
avg. # of sentences
15.419
15.2026
15.6622
guage model pretraining but rather on the language model
avg. # of words
370.10
366.89
377.46
pretraining on the much smaller domain specific corpora.
Keywords statistics
This makes the approach more easily transferable to less
# of keywords
19429
4773
5903
resourced domains and languages.
# of unique keywords
4414
1854
2049
# of unique keywords per document
0.9203
1.5462
1.3489
Most keyword recognition studies still focus on En-
# of keywords per document
4.0052
4.1643
3.8861
glish. Nevertheless, several multilingual and cross-lingual
keywords present in the document
59.91 %
60.54 %
59.95 %
studies have been conducted recently, also including low-
Keyword composition statistics
resource languages. One of them is the study by Koloski
Proportion of 1-word terms
92.77%
93.17 %
92.68 %
et al. (2021), which compared the performance of two su-
Proportion of 2-word terms
5.88 %
5.61 %
5.98 %
Proportion of 3-word terms
0.62 %
0.57 %
0.58 %
pervised transformer-based models, a multilingual BERT
Proportion of more than 3-word terms
0.74 %
0.65 %
0.76 %
with a BiLSTM-CRF classification head (Sahrawat et al.,
2020) and TNT-KID, in a multilingual setting with Esto-
Table 1: Dataset statistics. We conducted three different
nian, Latvian, Croatian and Russian news corpora. The
statistical analyses. The first one was on the document level
authors also investigated whether combining the results of
and it considered counting the word and sentence tokens.
the supervised models with the results of the unsupervised
The second focused on the keyword level statistics, such as
models can improve the recall of the system. In Koloski
total number of keywords, number of unique keywords, and
et al. (2022b), an extensive study was conducted to com-
the proportion of all versus unique keywords per document.
pare the performance of supervised zero-shot cross-lingual
Finally, we explored the composition of keywords, i.e. how
approaches with unsupervised approaches. The study was
many of them were composed of single words, two words,
conducted for six languages - Slovenian, English, Estonian,
three words or more words.
Latvian, Croatian, and Russian. The authors show that
models fine-tuned to extract keywords on a combination of
There are in total 30,105 keywords in the dataset, with
languages outperform the unsupervised models, when eval-
8,317 of them being unique. On average there are 4 key-
uated on a new previously unseen language not included in
words per document in the training split, 4.16 keywords
the training dataset.
per document in the validation split and 3.8861 keywords
per document in the test split. In regards to the unique key-
3.
Data
words per split, there are 0.92 unique keywords per doc-
We conduct our experiments on the Slovenian Sen-
ument in the training split, 1.55 in the validation split and
tiNews dataset (Bučar, 2017), which was originally used
1.35 keywords per document in the test split. Since the key-
for news sentiment analysis, but nevertheless does contain
word extractors used in this study are only able to extract
manually labeled keywords and was therefore identified as
keywords that are present in the data, we also calculated the
suitable for keyword extraction (Koloski et al., 2022a). Be-
share of keywords that are present in the document. In the
fore feeding the datasets to the models, they are lowercased.
training set, there were 59.91% of the keywords present, in
We split the dataset into three different splits: train, valida-
the validation set 60.54% and in the testing set 59.95%.
tion and test.
Finally, we conducted a study on the composition of
keywords in which we explored how many words consti-
3.1.
Exploratory data analysis
tute a specific keyphrase. In all of the splits, more than
Next, we preform exploratory data analysis (EDA) on
92% of the keywords contained only a single words, 2-word
the given dataset.
There are total of 7514 documents,
terms represented about 5% of the keywords, while 3 or
4796 (64%) for training, 1199 (16%) for validation and
more word terms represented around 3% of all keywords.
1519 (20%) for testing, which makes the dataset rela-
The most common keyword was gospodarstvo with 2,350
tively small in comparison to some English keyword ex-
occurrences (representing roughly 12% of all keyword oc-
traction datasets, such as for example KPTimes (Gallina et
curences), followed by ekonomija with 1315 (6.76%) oc-
PRISPEVKI
80
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
currences, followed by banka with 147 (0.08%) occur-
unigram model outscored the model that considered
rences.
n-grams of sizes 1 to 3 as keyword candidates for all
These keywords suggest that most of the articles come
languages, therefore in the final report we show only
from the economic and financial domain. In order to ex-
the results for the unigram model.
plore the structure and content of the dataset in more detail,
we do additional network science analysis on the graph of
4.1.3.
Graph-based methods
100 most-frequent terms. We construct a graph G
• MultipartiteRank (Boudin, 2018): We set the mini-
100 in the
following manner: we create links among every pair of key-
mum similarity threshold for clustering at 74%.
words that accompany a given article in the training split.
• RaKUn ( Škrlj et al., 2019): We use edit distance for
We repeat the step for every article in the training split.
calculating distance between nodes, and remove stop-
We next focus on community detection in the con-
words (using the stopwords-iso library3), a bigram-
structed graph. For that purpose, we use the Louvain al-
count threshold of 2 and a distance threshold of 2. An
gorithm (Blondel et al., 2008). The algorithm detects four
example graph of the RaKUn document representation
distinct communities. The first one colored green is the
and its predicted keywords are presented in Figure 2.
most central community - the community with the highest
amount of shared links with the three other detected com-
munities. It contains general terms like family, declara-
tion, NKB(a bank), sod. Next one is purple and it talks
about the trend of rising taxes, new laws and the petro-
chemical industry. The community colored in blue repre-
sents the economic news about infrastructure and construc-
tion industries. The last is the yellow community that talks
about financial help from the government and the European
union, accompanied by the unemployment and the slow rise
of GDP. The graph and its detected communities are pre-
sented in Figure 1.
4.
Methods
In our experiments, we follow the experimental set-
ting proposed in Koloski et al. (2021) and Koloski et al.
(2022b). The methods and the hyperparameters used are
described below.
4.1.
Unsupervised approaches
Figure 2: Visualization of one training example as it was
We evaluate three types of unsupervised keyword ex-
seen by the RaKUn method. The visualization is generated
traction methods, statistical, graph-based, and embedding-
via the Py3Plex (?) library. Top three extracted tokens here
based, described in Section 2. Note that these models were
are Ljubljana, Prihodki and Zdravil - depicting that the ar-
already evaluated on the same corpus in Koloski et al.
ticle is about purchase of medicine.
(2022b).
4.1.1.
Statistical methods
We use the PKE (Boudin, 2016) implementations of
• YAKE (Campos et al., 2018): We consider n-grams
YAKE, KPMiner and MultiPartiteRank. We use the offi-
with n ∈ {1, 2, 3} as potential keywords.
cial implementation for the RaKUn ( Škrlj et al., 2019) and
for the KeyBERT model (Grootendorst, 2020). For unsu-
• KPMiner (El-Beltagy and Rafea, 2009): We apply
pervised models, the number of returned keywords need to
least allowable seen frequency of 3, while we set the
be set in advance. Since we employ F1@10 as the main
cutoff to 400.
evaluation measure (see Section 4.3.), we set the number of
returned keywords to 10 for all models.
4.1.2.
Embedding-based methods
• KeyBERT (Grootendorst, 2020): For document em-
4.2.
Supervised approaches
bedding generation we employ sentence-transformers
We test two distinct state-of-the-art transformer-base
(Reimers and Gurevych, 2019), more specifically the
models, BERT (Devlin et al., 2019) and TNT-KID (Mar-
distiluse-base-multilingual-cased-v2 model available
tinc et al., 2020).
in the Huggingface library2. Initially, we tested two
different KeyBERT configurations: one with n-grams
4.2.1.
BERT sequence labelling
of size 1 and another with n-grams ranging from 1 to
As a strong baseline, we utilize the transformer-
3, with MMR=f alse and with MaxSum=f alse. The
based BERT model (Devlin et al., 2019) with a token-
classification head consisting of a simple linear layer for
2https:/huggingface.co/
sentence-transformers/
3https://github.com/stopwords-iso/
distiluse-base-multilingual-cased-v2
stopwords-iso
PRISPEVKI
81
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Figure 1: Visualization of the derived communities of the co-occurrence graph.
all our supervised approaches. We treat the keyword ex-
4.2.2.
TNT-KID sequence labelling
traction task as a sequence classification task. We follow
the approach proposed in Martinc et al. (2020) and predict
Same as for BERT, we follow the approach proposed
binary labels (1 for ‘keywords’ and 0 for ‘not keywords’)
in Martinc et al. (2020) and predict binary labels (1 for
for all words in the sequence. The sequence of two or more
‘keywords’ and 0 for ‘not keywords’) for all words in the
sequential keyword labels predicted by the model is always
sequence. Again, the sequence of two or more sequential
interpreted as a multi-word keyword. More specifically, we
keyword labels predicted by the model is always interpreted
employ the bert-uncased-multilingual model from the Hug-
as a multi-word keyword. We first pretrain TNT-KID as
gingFace library (Wolf et al., 2019) and fine-tune it on the
an autoregressive language model on the domain specific
SentiNews train split using an adaptive learning rate (start-
news corpus containing 884,407 news articles crawled from
ing with the learning rate of 3 · 10−5), for up to 10 epochs
websites of several Slovenian news outlets. The model was
with a batch-size of 8. Note that we chose this model since
trained for 10 epochs. After that, the model was fine-tuned
it is the best performing model on the Slovenian SentiNews
on the SentiNews train set for the keyword extraction task,
dataset according to the study by Koloski et al. (2022b).
again for up to 10 epochs. Sequence length was set to 256,
embedding size to 512 and batch size to 8, and we employ
the same preprocessing as in the original study (Martinc et
PRISPEVKI
82
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
al., 2020).
Model
precision@10
recall@10
f1-score@10
Statistical
4.3.
Evaluation setting
KPMiner
12.80
7.44
9.41
YAKE
5.91
12.13
7.94
To evaluate the models, we compute F1, Recall, and
Embedding-based
Precision on 10 retrieved words. We next formally repre-
KeyBert
12.13
12.00
11.53
sent the Recall@10 metric:
Graph-based
RaKUn
6.72
12.52
8.75
(# of recommended relevant items @ 10)
MPRU
3.39
6.96
4.55
Recall@10 =
(total # of relevant items).
Sequence-labelling
BERT
29.54
47.81
32.59
and Precision@10 metric:
TNT-KID
38.58
42.81
40.59
Table 2: Comparison of the evaluation of the proposed ap-
(# of recommended relevant items @ 10) proaches. We report on the precision@10, recall@10 and P recision@10 =
f1-score@10. The scores of the best performing system
(# of recommended items @10)
of a specific type (i.e. statistical, embedding-based, graph-
based or sequence-labelling based) are written in italic. The
We omit the documents in which there are no keywords
scores for the overall best-preforming model according to
or which do not contain keywords. We do this because
each metric are written in bold and presented in percents.
we only use approaches that extract words (or multi-word
expressions) from the given document and cannot process
keywords that do not appear in the text. All approaches are
The final comparison of both the unsupervised and su-
evaluated on the same monolingual test splits, which are not
pervised models is presented in Table 2. The TNT-KID
used for training the supervised models. Lower case and
model performed the best in terms of precision and F1-
lemmatization are performed during the evaluation for both
score while BERT model performed the best out of all mod-
the gold standard and the extracted keywords (keyphrases).
els in terms of recall. The supervised models outscored the
unsupervised models by a large margin on the given task.
The ranking of the models in terms of various metrics is
5.
Results
given in Figure 3.
In this section we examine the results of the evaluation
of the proposed models. We first study the results of the
6.
Conclusion and further work
unsupervised methods and later the results of the supervised
In this study, we compared the performance of super-
models.
vised and unsupervised keyword extraction methods on
the new public benchmark for keyword extraction, derived
5.1.
Unsupervised methods
from Slovenian SentiNews corpus. We have compared 8
different models, among them also TNT-KID, which has
In this study we evaluate 5 different unsupervised meth-
not been tested on Slovenian dataset yet. Five unsupervised
ods: 2 statistical, 1 embedding-based and 2 graph-based
approaches can be further divided into two graph-based,
methods. Comparing the two statistical methods, KPMiner
two statistical and one embedding-based approach. The
outscored the YAKE method in terms of f1-score and preci-
embedding-based method KeyBERT showcased superior
sion. The embedding based KeyBERT method achieved the
performance to the other unsupervised methods in terms of
best results when compared to other unsupervised methods.
F1-score at 10 retrieved keywords.
From the graph-based methods, RaKUn performed the best
When it comes to supervised approaches, we experi-
in comparison with the MPRU method, achieving nearly
mented with two transformer based models - one leverag-
100% relative improvement. Table 2 presents the results
ing multilingual BERT and the other the TNT-KID method
for all systems and evaluation metrics in detail.
- that model keyword extraction as a sequence labelling
task. The TNT-KID approach outperformed BERT-based
5.2.
Supervised methods
approach (and all unsupervised models) in terms of pre-
We use two different supervised methods based on the
cision and F1-score. These results therefore support the
sequence labeling paradigm. BERT based model outper-
claims of the original study by (Martinc et al., 2020) that
forms TNT-KID in terms of recall by about 5 percentage
TNT-KID can be easily adapted for employment on less-
points, achieving the best recall out of all models. In terms
resource languages, such as Slovenian, by domain specific
of precision, TNT-KID outscores the BERT model by 9.04
unsupervised language model pretraining. By employing
percentage points and achieves the best precision@10 score
TNT-KID on the SentiNews dataset, we have advanced the
- 38.58%. We believe this is due to the extensive language-
state-of-the-art on the benchmark Slovenian keyword ex-
model pretraining on a large domain specific Slovenian
traction dataset.
news corpus and the frequency of common co-occurrence
For further work, we plan to explore how potentially
patterns in the data, that TNT-KID has learned to exploit
we can improve the results by constructing ensembles of
successfully.
keyword extractors. We will also propose testing several
PRISPEVKI
83
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Figure 3: Comparison of the models ranking with respect to Precision@10, Recall@10 and F1-score@10.
different data splitting strategies, in order to study the pos-
communities in large networks. Journal of Statistical
sible effect of different splitting strategies on performance
Mechanics: Theory and Experiment, 2008(10):P10008,
of different models and to establish the best possible split
oct.
strategy. We also hypothesize that a possible improvement
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
can be introduced by taking into account the co-occurrence
Tomas Mikolov. 2016. Enriching word vectors with sub-
of various pairs of keywords. Finally, in the future we plan
word information. arXiv preprint arXiv:1607.04606.
to expand our experiments to also include the recently intro-
Florian Boudin. 2016. PKE: an open source Python-based
duced monolingual massively pretrained model for Slove-
keyphrase extraction toolkit. In Proceedings of COLING
nian, SloBERTa (Ulčar and Robnik- Šikonja, 2020). We
2016, the 26th International Conference on Computa-
plan to fine-tune this model for the keyword extraction task
tional Linguistics: System Demonstrations, pages 69–73,
and compare it to the TNT-KID, to check whether state-of-
Osaka, Japan, December.
the-art can be advanced even further.
Florian Boudin. 2018. Unsupervised keyphrase extraction
with multipartite graphs. CoRR, abs/1803.08721.
7.
Availability
Jože Bučar. 2017. Manually sentiment annotated slove-
The
best-performing
TNT-KID
based
model
is
nian news corpus SentiNews 1.0. Slovenian language re-
available as a docker model on the following link
source repository CLARIN.SI.
https://gitlab.com/boshko.koloski/tnt_
Ricardo Campos, V´ıtor Mangaravite, Arian Pasquali,
kid_app_slo.
Al´ıpio Mário Jorge, Célia Nunes, and Adam Jatowt.
2018. Yake! collection-independent automatic keyword
8.
Acknowledgements
extractor. In European Conference on Information Re-
The authors acknowledge the financial support from the
trieval, pages 806–810. Springer.
Slovenian Research Agency for research core funding for
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
the programme Knowledge Technologies (No. P2-0103)
Toutanova. 2019. BERT: Pre-training of deep bidirec-
and the project Computer-assisted multilingual news dis-
tional transformers for language understanding.
course analysis with contextual embeddings (CANDAS,
Samhaa R El-Beltagy and Ahmed Rafea. 2009. KP-Miner:
J6-2581).
A keyphrase extraction system for English and Arabic
documents. Information systems, 34(1):132–144.
9.
References
Corina Florescu and Cornelia Caragea. 2017. Position-
Kamil Bennani-Smires, Claudiu Musat, Andreea Hoss-
Rank: An unsupervised approach to keyphrase extraction
mann, Michael Baeriswyl, and Martin Jaggi. 2018. Sim-
from scholarly documents. In Proceedings of the 55th
ple unsupervised keyphrase extraction using sentence
Annual Meeting of the Association for Computational
embeddings. In Proceedings of the 22nd Conference on
Linguistics (Volume 1: Long Papers), pages 1105–1115,
Computational Natural Language Learning, pages 221–
Vancouver, Canada, July. Association for Computational
229, Brussels, Belgium, October. Association for Com-
Linguistics.
putational Linguistics.
Ygor Gallina, Florian Boudin, and Béatrice Daille. 2019.
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lam-
Kptimes: A large-scale dataset for keyphrase generation
biotte, and Etienne Lefebvre. 2008. Fast unfolding of
on news documents. In Proceedings of the 12th Inter-
PRISPEVKI
84
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
national Conference on Natural Language Generation,
Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing
pages 130–135.
order into text. In Proceedings of the 2004 conference on
Maarten Grootendorst. 2020. KeyBERT: Minimal key-
empirical methods in natural language processing, pages
word extraction with bert.
404–411.
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li.
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry
2016. Incorporating copying mechanism in sequence-to-
Winograd. 1999. The pagerank citation ranking: Bring-
sequence learning. In Proceedings of the 54th Annual
ing order to the web. Technical Report 1999-66, Stan-
Meeting of the Association for Computational Linguis-
ford InfoLab, November. Previous number = SIDL-WP-
tics (Volume 1: Long Papers), pages 1631–1640, Berlin,
1999-0120.
Germany, August. Association for Computational Lin-
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
guistics.
Dario Amodei, and Ilya Sutskever. 2019. Language
James M. Joyce, 2011. Kullback-Leibler Divergence, pages
models are unsupervised multitask learners. Technical
720–722. Springer Berlin Heidelberg, Berlin, Heidel-
report, OpenAi.
berg.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Sentence embeddings using siamese bert-networks. In
Boshko Koloski, Senja Pollak, Blaž Škrlj, and Matej Mar-
Proceedings of the 2019 Conference on Empirical Meth-
tinc. 2021. Extending neural keyword extraction with
ods in Natural Language Processing. Association for
TF-IDF tagset matching. In Proceedings of the EACL
Computational Linguistics, 11.
Hackashop on News Media Content Analysis and Auto-
mated Report Generation, pages 22–29, Online, April.
Dhruva Sahrawat, Debanjan Mahata, Mayank Kulkarni,
Association for Computational Linguistics.
Haimin Zhang, Rakesh Gosangi, Amanda Stent, Agniv
Sharma, Yaman Kumar, Rajiv Ratn Shah, and Roger
Boshko Koloski, Matej Martinc, Ilija Tavchioski, Blaž
Žimmermann. 2020. Keyphrase extraction from schol-
Skrlj, and Senja Pollak. 2022a. Slovenian keyword ex-
arly articles as sequence labeling using contextualized
traction dataset from SentiNews 1.0. Slovenian language
embeddings. In Proceedings of European Conference on
resource repository CLARIN.SI.
Information Retrieval (ECIR 2020), pages 328–335, Lis-
Boshko Koloski, Senja Pollak, Bla ˚
A¾ ˚
A krlj, and Matej
bon, Portugal. Springer.
Martinc. 2022b. Out of thin air: Is zero-shot cross-
Claude Sammut and Geoffrey I. Webb, editors, 2010. TF–
lingual keyword detection better than unsupervised? In
IDF, pages 986–987. Springer US, Boston, MA.
Proceedings of the Language Resources and Evaluation
Takashi Tomokiyo and Matthew Hurst. 2003. A language
Conference, pages 400–409, Marseille, France, June. Eu-
model approach to keyphrase extraction. In Proceedings
ropean Language Resources Association.
of the ACL 2003 Workshop on Multiword Expressions:
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Analysis, Acquisition and Treatment - Volume 18, page
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
33–40, Sapporo, Japan. Association for Computational
Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa:
Linguistics.
A robustly optimized BERT pretraining approach. arXiv
Matej Ulčar and Marko Robnik- Šikonja. 2020. Slovenian
preprint arXiv:1907.11692.
roBERTa contextual embeddings model: SloBERTa 1.0.
Yi Luan, Mari Ostendorf, and Hannaneh Hajishirzi. 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Scientific information extraction with semi-supervised
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
neural tagging. In Proceedings of the 2017 Conference
and Illia Polosukhin. 2017. Attention is all you need.
on Empirical Methods in Natural Language Processing,
In Advances in neural information processing systems,
pages 2641–2651, Copenhagen, Denmark, September.
pages 5998–6008, Vancouver, Canada. Curran Asso-
Association for Computational Linguistics.
ciates, Inc.
Debanjan Mahata, John Kuriakose, Rajiv Ratn Shah,
Blaž Škrlj, Andraz Repar, and Senja Pollak.
2019.
and Roger Zimmermann. 2018. Key2Vec: Automatic
Rakun: Rank-based keyword extraction via unsuper-
ranked keyphrase extraction from scientific articles using
vised learning and meta vertex aggregation.
CoRR,
phrase embeddings. In Proceedings of the 2018 Confer-
abs/1907.06458.
ence of the North American Chapter of the Association
Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl
for Computational Linguistics: Human Language Tech-
Gutwin, and Craig G. Nevill-Manning. 1999. Kea: Prac-
nologies, Volume 2 (Short Papers), pages 634–639, New
tical automatic keyphrase extraction. In Proceedings of
Orleans, Louisiana, USA, June. Association for Compu-
the Fourth ACM Conference on Digital Libraries, DL
tational Linguistics.
’99, page 254–255, Berkeley, California, USA. Associa-
Matej Martinc, Blaž Škrlj, and Senja Pollak. 2020. Tnt-
tion for Computing Machinery.
kid: Transformer-based neural tagger for keyword iden-
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau-
tification. Natural Language Engineering, pages 1–40.
mond, Clement Delangue, Anthony Moi, Pierric Cistac,
Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Pe-
Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie
ter Brusilovsky, and Yu Chi. 2017. Deep keyphrase gen-
Brew. 2019. Huggingface’s transformers: State-of-the-
eration. In Proceedings of the 55th Annual Meeting of
art natural language processing. CoRR, abs/1910.03771.
the Association for Computational Linguistics (Volume
1: Long Papers), pages 582–592, Vancouver, Canada,
July. Association for Computational Linguistics.
PRISPEVKI
85
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Spremljevalni korpus Trendi: metode, vsebina in kategorizacija besedil
Iztok Kosem,‡* Jaka Čibej,‡ Kaja Dobrovoljc,‡* Nikola Ljubešić‡
‡ Institut "Jožef Stefan"
Jamova cesta 39, 1000 Ljubljana
iztok.kosem@ijs.si, jaka.cibej@ijs.si, kaja.dobrovoljc@ijs.si, nikola.ljubesic@ijs.si
* Filozofska fakulteta, Univerza v Ljubljani
Aškerčeva 2, 1000 Ljubljana
Povzetek
V prispevku opisujemo postopek gradnje korpusa Trendi – prvega spremljevalnega korpusa za slovenščino. Prva različica korpusa, imenovana Trendi 2022-05, vsebuje več kot 565 milijonov pojavnic iz več kot 1,4 milijona besedil. Namen korpusa je, da tako strokovni kot nestrokovni javnosti ponudi podatke o aktualni jezikovni rabi in omogoči spremljanje pojavljanja novih besed ter upadanja ali naraščanja rabe že obstoječih. Predstavimo metodologijo izdelave in vsebino korpusa ter prve korake pri načrtovani strojni klasifikaciji korpusnih besedil v kategorije (npr. gospodarstvo, okolje), s katerimi bo mogoče v korpusu spremljati jezikovno rabo tudi po tematskih področjih. Predstavimo tudi rezultate ankete, s katero smo preverili uporabniška pričakovanja o jezikovnem viru za spremljanje jezikovne rabe.
The Trendi Monitor Corpus of Slovene: Methods, Content, and Text Categorization In the paper, we present the compilation of the Trendi corpus – the first monitor corpus of Slovene. The first version of the corpus, named Trendi 2022-05, contains over 565 million tokens coming from more than 1.4 million different texts. The purpose of the corpus is to provide both experts and non-experts with data on contemporary language use and enable the monitoring of the appearance of new words or the increase/decrease in the use of existing words. We present the methodology of corpus compilation, its content, and the first steps for the automatic classification of corpus texts into categories (such as economics and environment), which will enable the monitoring of language use by thematic areas. We also describe the results of a survey, the goal of which was to collect feedback on user expectations from a language monitoring resource
korpusa Trendi. Sledi predstavitev klasifikacije tematskih
1. Uvod
kategorij, ki smo jo izdelali za pripravo modela za
avtomatsko kategorizacijo besedil. V zadnjem delu
Jezik se nenehno spreminja, pojavljajo se nove besede,
predstavimo anketo med uporabniki o želenih statističnih
obstoječe besede in besedne zveze dobivajo nove pomene,
izračunih iz korpusa. V zaključku predstavimo načrte za
določene besede ali njihovi pomeni se prenehajo
prihodnje delo.
uporabljati ipd. V zadnjem času, tudi zaradi epidemije
covida-19, ki je prinesla veliko novega izrazoslovja, je še
posebej veliko pozornosti deležno področje neologije, tako
2. Spremljevalni korpusi
leksikalne (nove besede) kot semantične (novi pomeni).
V mednarodnem prostoru so spremljevalni korpusi
Za spremljanje sprememb v jeziku se tipično
prisotni že od 20. stoletja. Eden prvih je bil the Bank of
uporabljajo spremljevalni korpusi, ki vsebujejo najnovejša
English, ki je bil prvič objavljen leta 1991. Vsebuje več kot
besedila v jeziku. Spremljevalni korpusi zapolnjujejo
650 milijonov besed2 in je danes vključen v 4,5-milijardni
manko referenčnih korpusov, katerih izdelava zaradi
korpus COBUILD založbe Collins. Korpus ni prosto
raznovrstnosti besedil in njihovih formatov ter obsega traja
dostopen, poleg zaposlenih na založbi Collins ga lahko
dlje časa. V času tehnološkega napredka in ob dejstvu, da
uporabljajo tudi zaposleni in študentje na Univerzi v
je zdaj zelo veliko besedil dostopnih na spletu, je izdelava
Birminghamu.
spremljevalnih korpusov postala enostavnejša; kar je
Za angleščino je danes pomemben predvsem korpus
objavljeno danes, je lahko že jutri vključeno v korpus.
NOW (News on the Web; Davies, 2016-), ki vsebuje več
Za slovenščino kljub bogati opremljenosti na področju
kot 15 milijard besed iz spletnih časopisov in revij. Korpus
korpusov do zdaj nismo imeli spremljevalnega korpusa,
zajema besedila od 2010 naprej. Kot je omenjeno na spletni
čeprav se je med različnimi deležniki kazala jasna potreba
strani,3 korpus vsak mesec naraste za 180-200 milijonov
po njem. Naslavljanja tega manka smo se lotili v okviru
besed.
projekta Spremljevalni korpus in spremljajoči podatkovni
Obsežna zbirka korpusov za spremljanje sprememb v
viri (SLED),1 ki poteka od oktobra 2021 do novembra 2022
jeziku, ki poleg angleščine pokriva še več kot 35 drugih
in ga sofinancira Ministrstvo za kulturo Republike
jezikov, so korpusi Timestamped JSI. Korpusi vsebujejo
Slovenije. Cilj projekta ni samo izdelati spremljevalni
novice, ki jih zbira JSI Newsfeed na Institutu "Jožef Stefan"
korpus, temveč tudi pripraviti infrastrukturo za njegovo
(Trampuš in Novak, 2012). Korpusi za 18 jezikov so na
redno posodabljanje.
voljo v orodju Sketch Engine (Kilgarriff et al., 2004),4 v
V prispevku najprej ponujamo pregled nekaterih
katerem imajo poleg ostalih funkcij orodja uporabniki na
pomembnejših tujih spremljevalnih korpusov, nato pa
voljo tudi t. i. Trende (Herman, 2013), funkcijo, ki pomaga
predstavimo metodologijo in vsebino spremljevalnega
prepoznavati trende v rabi besed. Korpusi v Sketch Enginu
1 https://sled.ijs.si/
3 https://www.english-corpora.org/now/
2 Žal nismo našli podatka, kdaj je bil korpus nazadnje
4 https://www.sketchengine.eu/
posodobljen.
PRISPEVKI
86
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
vsebujejo besedila od 2014 do aprila 2021 (čas zadnje
besedila avtomatsko opremilo s podatkom o tematski
posodobitve) in so različnih velikosti; korpus angleščine na
kategoriji.
primer vsebuje približno 60 milijard besed.
Obstaja še precej drugih spremljevalnih korpusov, ki pa
3.1. Metodologija in vsebina korpusa
so pogosto na voljo zgolj za interno rabo. Primer takšnega
Z metodološkega vidika smo pri snovanju korpusa
korpusa je ONLINE, dinamični spremljevalni korpus
Trendi morali sprejeti dve odločitvi: obdobje, ki ga bo
češkega jezika, ki ga izdeluje Inštitut za češki nacionalni
korpus pokrival, in kako pogosto bo korpus posodobljen.
korpus.5 Velik je približno 6,3 milijarde besed in vsebuje
Pri odločitvi o obdobju smo izhajali iz želje, da bi korpus
spletne novice, komentarje (pod spletnimi novicami),
Trendi vedno pokrival manko najnovejše različice
besedila s forumov in družabnih omrežij (Facebook,
referenčnega (pisnega) korpusa Gigafida, trenutno je to 2.0.
Twitter, Instagram). Korpus ONLINE je razdeljen na dva
V tem trenutku to pomeni, da bo Trendi vseboval besedila
komplementarna
korpusa:
ONLINE_NOW
in
od 2019 naprej. To pomeni, da se ob objavi nove različice
ONLINE_ARCHIVE. Prvi je posodobljen vsak dan in
korpusa Gigafida (npr. korpus Gigafida 3.0 bo objavljen v
pokriva obdobje zadnjega meseca in preteklih šestih
sklopu projekta Razvoj slovenščine v digitalnem okolju -
mesecev. ONLINE_ARCHIVE pokriva obdobje od
RSDO),8 obdobje korpusa Trendi ustrezno prilagodi.
februarja 2017 do prvega meseca, ki ga vsebuje
Tesna povezanost s korpusom Gigafida tudi pomeni, da
ONLINE_NOW. Tako se vsebina zadnjega meseca po
bo
korpus
Trendi
predstavljal
standardno
pisno
starosti v korpusu ONLINE_NOW na začetku vsakega
slovenščino. Odločitev se nam zdi smiselna tudi zato, ker
meseca preseli v ONLINE_ARCHIVE.
sta nestandardna oz. govorjena slovenščina pokrita s
Obstajajo
tudi
manjši
in
bolj
specializirani
korpusi, kot sta JANES9 in Gos,10 in je torej njun razvoj
spremljevalni korpusi, kakršen je npr. korpus Coronavirus
predmet ločenih projektov. Navsezadnje pa ne gre pozabiti
(Davies, 2019-), ki zajema obdobje od januarja 2020 do
na nastajajoči korpus metaFida,11 ki bo združil vse
danes in vsebuje več kot 1,4 milijarde besed. V njem so
slovenske korpuse.
spletne novice v angleščini, vsak dan pa naraste za 3 do 4
Pri pripravi seznama virov za vključitev v korpus
milijone besed.
Trendi smo izhajali iz seznama slovenskih spletnih virov,
Do določene mere vlogo spremljevalnega korpusa
ki jih najdemo v servisu JSI Newsfeed. Izdelali smo seznam
opravljajo tudi diahroni korpusi, seveda pod pogojem, da
vseh virov od leta 2019 do konca 2021, pridobili smo tudi
vsebujejo čim novejša besedila. Kot primer lahko
podatek o skupnem številu besedil na vir. Nato smo pri
navedemo korpus sodobne ameriške angleščine (Corpus of
pripravi seznama za korpus Trendi podrobno analizirali
Contemporary American English; Davies, 2008-), ki
vsakega od 243 virov. 90 virov smo izključili, ker je šlo za
vsebuje besedila od leta 1990 do marca 2020 (zadnja
tuje ali slovenske spletne strani z vsebino v tujem jeziku.
posodobitev) in obsega več kot milijardo besed. Prednost
Nato smo s seznama odstranili še 34 virov, nekatere zato,
korpusa je, da je žanrsko uravnotežen, saj vsebuje besedila
ker niso vsebovali medijskih novic (blogi, spletne strani
iz osmih različnih žanrov (govorjeni jezik, leposlovje,
vladnih uradov in podjetij), druge zato, ker je njihova
revije, časopise, znanstvena besedila, televizijske in
vsebina preveč specializirana (npr. repozitoriji akademskih
filmske podnapise, bloge in ostale spletne strani).
publikacij so primernejši za korpuse, kot je Korpus
Slovenski ekvivalent bi bil korpus Gigafida 2.0 (Krek et al.,
akademske slovenščine). Ena od strani (preberi.si) je bila s
2019),6 ki obsega 1,13 milijarde besed, vendar pa je v
seznama odstranjena zato, ker je agregator novic iz drugih
primerjavi s korpusom sodobne ameriške angleščine manj
virov. Končni seznam korpusa Trendi tako vsebuje 110
ažuren (vsebuje samo besedila do leta 2018).
virov, med tistimi, ki so v obdobju 2019-2021 prispevali
Za slovenščino do danes še ni obstajal pravi
največ novic, so sta.si (260.080 besedil), rtvslo.si (97.924),
spremljevalni korpus. Obstajajo sicer viri, kot je Jezikovni
siol.net (69.471), delo.si (65.415), 24ur.com (61.623),
sledilnik (Kosem et al., 2021),7 ki že izkorišča
dnevnik.si (47.749) in vecer.com (45.548).
najsodobnejše podatke o jezikovni rabi, v konkretnem
Seznam virov se bo redno posodabljal, saj lahko
primeru od JSI Newsfeeda, za izdelavo neke vrste začasnih
pričakujemo pojav novih spletnih strani, pa tudi ukinitev
korpusov, na katerih se potem izvajajo statistični izračuni.
obstoječih. Kot primer lahko navedemo spletno stran
Taka ciljna raba je seveda tudi potrebna, vendar pa je
necenzurirano.si, ki se je pojavila šele leta 2020 in je že 28.
namenjena nestrokovni javnosti; po drugi strani strokovna
po številu novic (8.494). Dodajanje novih virov v korpus
javnost, kot so leksikografi_ke, jezikoslovci_ke, drugi
pomeni tudi večje število besed na mesečni ravni in
raziskovalci_ke potrebujejo dostop do izvirnih besedil, če
posledično večji korpus Trendi. Trenutni okvirni izračuni
želijo opravljati še druge analize.
kažejo, da se bo Trendi vsak mesec povečal za 10-15
milijonov pojavnic, pri čemer je bil povprečen mesečni
3. Korpus Trendi
obseg leta 2019 12,5 milijona pojavnic, leta 2021 pa že 21
Izdelave
prvega
spremljevalnega
korpusa
za
milijonov pojavnic.
slovenščino, ki smo ga poimenovali Trendi, smo se lotili v
Zaradi narave korpusa Trendi bodo potrebne redne
okviru projekta SLED. Poleg izdelave in rednega
posodobitve, ki so zaenkrat predvidene na mesečni ravni,
posodabljanja korpusa Trendi ima projekt še dva cilja:
kot je praksa pri podobnih tujih korpusih. To se zdi trenutno
pripravo na korpusnih podatkih temelječe statistike o
realno, upoštevajoč časovno zahtevnost pridobivanja in
različnih vidikih rabe besed in izdelavo orodja, ki bo
5 https://korpus.cz/
9 https://www.clarin.si/kontext/query?corpname=janes
6 https://viri.cjvt.si/gigafida/
10 http://www.korpus-gos.net/
7 https://viri.cjvt.si/sledilnik/slv/
11 https://www.clarin.si/kontext/query?corpname=mfida01
8 https://slovenscina.eu/
PRISPEVKI
87
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
označevanja besedil, pretvorb v potreben format in
Po končanem označevanju se v cevovodu opravi še
vključevanje korpusa v konkordančnike.
pretvorba besedil iz privzetega formata označevalnega
orodja (CONNL-U) v TEI XML, ki ga med drugim
3.2. Priprava besedil
potrebujemo za statistične izračune s programom LIST
(Krsnik et al., 2019). V ta proces sta vključena še dva
Za pripravo besedil smo pripravili cevovod, ki
povezana postopka združevanja besedil: združevanje
vključuje pridobivanje besedil, označevanje na različnih
besedil po viru na dan (vsakodneven postopek) in
ravneh, združevanje po virih in obdobjih ter pretvorbo v
združevanje besedil istega vira za cel mesec (enkrat na
različne formate. Pridobivanje besedil je zaenkrat vezano
mesec, na začetku novega meseca za nazaj). V zadnjem
na servis JSI Newsfeed, ki uporablja protokol RSS novic,
koraku, ki ga izvajamo enkrat mesečno in ga moramo
vendar pa smo sredi priprave lastnega postopka luščenja.
pognati ločeno zaradi kombinacije XSLT in skripte Perl, je
Za to smo se odločili predvsem zato, ker smo odkrili, da so
opravljena še pretvorba mesečnih datotek (razdeljenih po
pri mnogih virih potrebne izboljšave pri pridobivanju
viru) v format VERT, ki ga uporabljata konkordančnika
besedil, npr. poleg besedila so izluščeni še drugi deli strani,
KonText (Machálek, 2020) in NoSketch Engine (Rychlý,
besedilo ni pridobljeno v celoti ipd. Poleg tega strani včasih
2007).
vsebujejo pomembne metapodatke o besedilu, ki trenutno
niso del zajema. V novem postopku bomo ročno preverili
rezultate pridobivanja besedil z vsakega vira in prilagodili
3.3. Prva različica korpusa Trendi
algoritem za vsak vir, kjer se bo izkazala potreba po
Prva različica korpusa Trendi, imenovana Trendi 2022-
prilagoditvi.
05, je bila objavljena junija 2022 in vsebuje 565.308.991
Nekateri viri, kot so sta.si, delo.si itd. imajo vsebine
pojavnic oz. malo več kot 473 milijonov besed. V korpusu
zaklenjene oziroma so dostopne samo naročnikom. Pri
je 1.436.548 besedil od 48 izdajateljev, pri čemer imajo
pridobivanju prek protokola RSS so tako prosto dostopni
največje deleže Slovenska tiskovna agencija (337.484;
samo povzetki ali prvih nekaj odstavkov, včasih celo samo
23,5 %), Delo d.o.o. (128.164; 8,9 %), Radiotelevizija
naslov in podnaslov. Pri reševanju problema smo združili
Slovenija (124.861; 8,7 %), Media24 d.o.o. (100.587; 7 %),
moči z ekipo, ki v okviru projekta RSDO oz. priprave
PRO PLUS d.o.o. (86.578; 6 %) in TSMedia d.o.o. (83.342;
korpusa Gigafida 3.0 sklepa pogodbe z besedilodajalci.
5,8 %).
Dogovor z besedilodajalci vključuje redno dostavljanje
celotnih besedil. Posledično bo končna oblika cevovoda za
3.4. Dostopnost korpusa Trendi
korpus Trendi kombinacija priprave besedil, pridobljenih s
Korpus Trendi je za brskanje prosto dostopen v treh
spleta, in besedil, ki jih bodo v digitalni obliki poslali
konkordančnikih
CLARIN.SI
–
konkordančniku
besedilodajalci.
KonText13
in
dveh
različicah
konkordančnika
Del postopka pridobivanja besedil je tudi deduplikacija,
NoSketchEngine,14 tako KonText kot NoSketch Engine
ki je trenutno omejena zgolj na raven vira besedila; del
imata več enakih funkcionalnosti (enostavno in napredno
cevovoda je namreč preverjanje, da se besedilo z istim
iskanje ipd.), vendar pa KonText ponuja možnost
URL-jem ne ponovi. Zavedamo se, da zaradi pokrivanja
registracije in shranjevanje iskanj in priljubljenih korpusov,
istih dogodkov obstaja velika prekrivnost med viri. Še več,
NoSketchEngine pa dodatne funkcionalnosti, kot je
mnogi viri osnujejo številne novice na podlagi vsebin sta.si,
luščenje ključnih besed (angl. keywords) iz korpusov, za
kar pripelje do podvajanja besedila na ravni stavkov,
uporabo katerih ni potrebna registracija. Konkordančnik
odstavkov ali tudi celotne vsebine. Kljub temu za namene
NoSketch Engine je na CLARIN.SI poleg starejše različice
korpusa Trendi deduplikacija na ravni vsebine ni
(Bonito) po novem na voljo tudi v novejši različici
predvidena, saj želimo uporabnikom omogočiti analizo
uporabniškega vmesnika (Crystal),15 ki zagotavlja
vsebin posameznih virov ter primerjalne analize med viri.
izboljšano
uporabniško
izkušnjo
in
dolgoročnejše
Deduplikacija pa bo najbrž opravljena pri pripravi besedil
vzdrževanje.
za novo različico korpusa Gigafida, kot je bila praksa v
Odprto dostopna različica korpusa Trendi bo zaradi
preteklih različicah (Krek et al., 2019).
omejitev avtorskih pravic izdelana po isti metodi kot
Sledi postopek strojnega označevanja besedil, za kar
ccGigafida 1.0 (Logar et al., 2013), tj. vzorčeni bodo
uporabljamo označevalni cevovod CLASSLA-Stanza
naključni odstavki posameznih besedil, in bo na voljo v
(Ljubešić in Dobrovoljc, 2019),12 ki se kot referenčno
repozitoriju CLARIN.SI.
orodje za slovnično označevanje besedil v slovenščini
Korpus bo v repozitoriju CLARIN.SI na voljo tako v
aktivno razvija v okviru projekta RSDO. Orodje je
formatu TEI kot v formatu CONNL-U, saj je slednji
nadgradnja odprtokodnega orodja Stanza (Qi et al., 2020),
preferenčni format pri nalogah, ki vključujejo nadaljnje
ki v primerjavi z izvorno programsko opremo podrobneje
procesiranje podatkov, npr. strojno učenje, luščenje
naslavlja specifike slovenščine, zlasti na ravni stavčne
podatkov ipd.
segmentacije,
tokenizacije,
oblikoskladenjskega
označevanja in lematizacije po sistemu JOS (Erjavec et al.,
3.5.
2010). Poleg navedenih ravni orodje besedila tudi
Tematska kategorizacija besedil
skladenjsko razčleni po sistemu Universal Dependencies
Ena od aktivnosti projekta SLED je tudi izdelava orodja
(Dobrovoljc et al., 2017) in v njih označi imenske entitete
za avtomatsko kategorizacijo besedil glede na tematiko. Za
(Zupan et al., 2017), kot so imena oseb, krajev, organizacij
izdelavo takšnega orodja oz. modela zanj potrebujemo
ipd.
dvoje: klasifikacijo kategorij in učno množico.
12 https://pypi.org/project/classla/
14 https://www.clarin.si/noske/
13 https://www.clarin.si/kontext/
15 https://www.clarin.si/ske/
PRISPEVKI
88
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Pri izdelavi nabora kategorij smo se opirali na podatke
izobraževanja, od vrtca do univerzitetnega
iz treh skupin virov:
izobraževanja, pa tudi vseživljenjsko učenje.
● slovenskih novičarskih portalov, izbrali smo jih
šest, tj. rtvslo.si, delo.si, sta.si, dnevnik.si,
Kot prikazuje primerjalna Tabela 1, obstaja precejšnja
24ur.com in vecer.com.
prekrivnost tako s kategorijami novičarskih portalov kot s
● nabora tematskih kod oz. kategorij Mednarodnega
kategorijami IPTC in tujih korpusov. V nekaterih primerih,
tiskovnega telekomunikacijskega sveta (IPTC).16
npr. gospodarstvo, prosti čas, politika in družba, naša
S tem smo tudi želeli zagotoviti čim boljšo
kategorija zajema več kategorij ostalih virov. Tako ima za
usklajenost naših kategorij z mednarodnim
prosti čas estonski korpus kar sedem ločenih kategorij.
standardom.
Edini primer, ko se eno od kategorij tujih virov lahko uvrsti
● kategorij v sodobnih sinhronih in spremljevalnih
v dve naši, sta umetnost in kultura ter zabava. Kategoriji korpusih, pri čemer sta bila relevantna predvsem
smo namreč ločili po eni strani zato, ker ima veliko
češki korpus SYN_2015 (Křen et al., 2016) in
slovenskih novičarskih portalov ločene podstrani zanju, po
estonski nacionalni korpus (Koppel in Kallas, v
drugi strani pa zaradi samega jezika - kulturno-umetniške
tisku).
vsebine so za razliko od zabavnih pogosto precej bolj
strokovne.
Glavno vodilo pri pripravi klasifikacije je bilo pripraviti
Medtem ko v naše kategorije lahko umestimo vseh 17
relativno majhen nabor kategorij, v katere lahko uvrstimo
kategorij IPTC, pa češki oz. estonski korpus določenih
vse novice na različnih portalih. S tem bi zagotovili tudi
kategorij nimata, npr. estonski nima črne kronike, češki pa
boljše delovanje modela. Posledično smo pri analizi
ne okolja, zdravja, znanosti in tehnologije ter zabave. Oba
uporabljenih virov več pozornosti posvečali krovnim
tudi nimata ločene kategorije za vreme, ki pa jo ima IPTC
kategorijam, kar je bilo sploh potrebno pri naboru IPTC, ki
in smo jo dodali zato, ker jo ima večina slovenskih
ima približno 1.400 kategorij, razdeljenih v tri nivoje (s tem
novičarskih portalov.
da krovni nivo sestavlja le 17 kategorij). Za ponazoritev
Če pogledamo še prekrivnost kategorij s stranmi oz.
smiselnosti uporabe zgolj krovnih kategorij lahko
podstranmi šestih slovenskih novičarskih portalov, vidimo,
vzamemo kategorijo šport, ki ima na večini novičarskih
da so problematične kategorije predvsem politika, družba
portalov nadaljnje kategorije, od katerih se vedno pojavita
in izobraževanje. Gre za sicer legitmne kategorije, ki pa na
samo nogomet in košarka, ostale pa le na nekaterih portalih,
novičarskih portalih nimajo svojih podstrani, temveč so
npr. dnevnik.si nima zimskih športov, ima pa ločeno
novice razpršene po drugih podstraneh, ki so večinoma
podstran za novice o Luki Dončiću; rtvslo.si je edini, ki ima
opredeljene glede na geografski izvor novice, npr.
podstran za novice o Formuli 1, 24ur.si ima ločene
Slovenija, Svet, Lokalno. Medtem ko so se avtorji češkega
podstrani za Ligo prvakov in Ligo Evropa (nogomet) ter
korpusa odločili slediti takšni delitvi tudi pri kategorijah
borilne športe.
( current events, foreign news, domestic news, regional
Naša končna klasifikacija vsebuje 12 kategorij:
news), smo se mi raje držali tematike. To za izdelavo učnih
● umetnost in kultura. Vključuje besedila o
množic pomeni nekoliko več ročnega dela oz. iskanje
kulturi, umetnosti, filmih, knjigah, gledališču, pa
drugih kazalcev, s katerimi lahko odkrijemo tematiko
tudi recenzije ipd.
prispevka na posameznem portalu. Izjema je portal sta.si,
● črna kronika. Naravne in ostale nesreče, človeški
ki že ima ustrezne kategorije, in sicer Šolstvo in Družba, za
delikti, kriminal.
politiko pa Državni zbor, Evropska unija, Mednarodna
● gospodarstvo. Vključuje besedila s področja
politika, Slovenska notranja politika in Slovenska zunanja
ekonomije, trgov, financ, zaposlitev ipd.
politika.
● okolje. Zajema okoljevarstvo, planet, energente,
Učne množice smo izdelali z mapiranjem kategorij
tudi kmetijske teme.
različnih virov novic na našo interno kategorizacijo. Tako
● zdravje. Fizično in mentalno zdravje ljudi,
lahko besedila iz določenih kategorij konkretnih virov
medicina, farmacija, zdravstvena infrastruktura.
uporabimo za učenje modela. Pri pripravi učnih množic
● prosti čas. Hobiji, rekreacija, potovanja, turizem,
bomo vzorčili tako količino podatkov iz posameznega vira
ljubljenčki, dom in družina, bivanje.
kot količino podatkov v kategoriji in s tem zagotovili
● politika in pravo. Mednarodne in nacionalne
raznolikosti učnih množic, pa tudi robustnost končnega
novice s področna državne uprave, pravnih
modela.
postopkov in družbenih razmerij, konfliktov, vojn.
Za modeliranje bomo uporabili orodje fasttext (Joulin
● znanost in tehnologija. Znanstvena odkritja,
et al, 2016) z vložitvami CLARIN.SI (Ljubešić in Erjavec,
zanimivosti, tehnološke inovacije, informacijska
2018) in model SloBERTa (Ulčar in Robnik-Šikonja,
tehnologija, računalništvo.
2021). Glede na razliko v rezultatih (pričakujemo, da se bo
● družba. Družbena vprašanja in razmerja, enakost,
model SloBERTa odrezal boljše, a morda razlika v
diskriminacija, religija, etika ipd.
rezultatih ne bo tako opazna) in kompleksnosti
● šport. Športni rezultati in zanimivosti z različnih
klasifikatorja (fasttext je precej hitrejši in zahteva bistveno
športnih področij.
manj spominskih kapacitet), bomo izbrali klasifikator, ki ga
● vreme. Meteorološke napovedi, opisi vremenskih
bomo uporabili na novih besedilih.
posebnosti, stanj, procesov.
● zabava. Estrada, moda, slog.
● izobraževanje.
Procesi
posredovanja
in
pridobivanja znanja ter veščin. Vse stopnje
16 https://cv.iptc.org/newscodes/subjectcode
PRISPEVKI
89
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
slovenski
kategorija
češki korpus
estonski korpus
IPTC
portali (6)
umetnost in
culture & entertainment
arts, culture and
5
culture
kultura
entertainment
črna kronika
6
crime
/
disaster and accident
gospodarstvo
economy, finance & business;
economy, business and
6
economy
agriculture; construction & real
finance; labour
estate
okolje
2
/
nature & environment
environmental issue
zdravje
3
/
health
health
prosti čas
leisure
beauty; cars; food & drinks;
lifestyle and leisure
gambling & casinos; home, family
4
& children; pets and animals; travel
& tourism; video games
politika in pravo
politics
politics & government
politics; crime, law and
1
justice; unrest, conflicts
and war
znanost in
/
science, technology & IT
science and technology
5
tehnologija
družba
social life
society; religion; sex; women
social issue;
1
religion and belief;
human interest
šport
6
sports
sports
sport
vreme
4
/
/
weather
zabava
/
culture & entertainment*
arts, culture and
4
entertainment*
izobraževanje
1
/
education
education
Tabela 1: Primerjava tematskih kategorij projekta SLED z domačimi novičarskimi portali in tujimi viri.
so zbirala demografske podatke (spol, starost, področje
3.6. Rezultati uporabniške ankete
delovanja). Diseminirana je bila po e-poštnih seznamih
slovenskih jezikoslovnih raziskovalnih skupnosti (npr.
Ker je Trendi prvi korpus svoje vrste v slovenskem
SlovLit ter e-poštni seznam Slovenskega društva za
okolju, smo ga želeli zasnovati karseda skladno z
jezikovne tehnologije) ter po družbenem omrežju Facebook
uporabniškimi pričakovanji. Ta smo v decembru 2021
(na uradni strani Centra za jezikovne vire in tehnologije
preverili s pomočjo uporabniške ankete, s katero smo
Univerze v Ljubljani ter v neformalnih jezikoslovnih
ugotovili, katerih podatkov o aktualni rabi jezika si
uporabniških skupinah, kot je Prevajalci, na pomoč! ).
raziskovalna skupnost želi in v kakšni obliki (npr. različni
V celoti izpolnjenih vprašalnikov je bilo 100. Vzorec,
seznami, kot so kandidati za neologizme, besede in besedne
ki ga je zajela anketa, zajema predvsem osebe ženskega
zveze z najbolj izstopajočo rabo v določenem obdobju
spola (82 %), manjši delež pa je moških (18 %). Po starosti
(dnevu, tednu, mesecu), izstopajoče besede in besedne
vzorec zajema predvsem generacije med 26. in 55. letom
zveze glede na vir ipd.).
starosti (80 % vseh udeleženk_cev), največ med 26. in 35.
Anketa17 je bila izdelana na platformi 1KA, sestavljena
letom (33 %) in med 46. in 55. letom (32 %). Večina
pa je bila iz 9 vprašanj: med temi je bilo 5 vsebinskih, 4 pa
17 Podrobnejše poročilo o izvedeni anketi je na voljo na spletni
content/uploads/2022/02/SLED_anketa_porocilo_2022-2-
strani projekta: https://sled.ijs.si/wp-
03_final.pdf
PRISPEVKI
90
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
udeleženk_cev je zaposlenih bodisi v javnem sektorju
V odprtem vprašanju so imeli anketiranci_ke možnost
(61 %) bodisi je samozaposlena (20 %), le manjši delež ima
izraziti predloge oz. dodatne scenarije, ki bi jih zanimali o
še študentski status (3 %) ali pa so zaposleni v podjetjih
aktualni jezikovni rabi. Dodatnih predlogov je bilo 15.
(6 %), upokojeni (4 %) ali v iskanju zaposlitve (5 %). Po
Nanašajo se npr. na povezljivost orodja z drugimi
področju delovanja, pri katerem so udeleženci_ke lahko
jezikovnimi viri (npr. integracija v Slovenski oblikoslovni
izbrali_e več možnosti, prednjačita lektoriranje (60 %) in
leksikon Sloleks in v korpus pisne standardne slovenščine
prevajanje (46 %), visok delež pa imajo tudi ljubiteljsko
Gigafida) in dostop do podatkov (npr. možnost dostopa do
raziskovanje jezika (38 %), strokovno in znanstveno
podatkov preko javnega API-ja), primerjavo sopomenskih
pisanje (34 %), jezikoslovne raziskave (32 %) ter kreativno
različic besed oz. besednih zvez (npr. oče vs. ata),
pisanje in blogerstvo (22 %). Skupno 40 % zajemajo tudi
vključitev zgledov rabe in spremljanje jezikovne rabe
različne kategorije poučevanja jezika (slovenščina kot 1.
daljših enot (npr. frazemov). Večina dodatnih predlogov
jezik na osnovni ali srednji šoli, slovenščina kot 2. ali tuji
sicer presega obseg projekta SLED, a predstavljajo
jezik, jezikoslovni predmeti na višji/univerzitetni ravni).
pomembno povratno informacijo za razmislek o
Vzorec nakazuje, da je anketa zajela različna področja
prihodnjem razvoju in integraciji spremljevalnega korpusa
jezikoslovno-raziskovalnega udejstvovanja.
in iz njega izluščenih podatkov v ostale jezikovne vire.
V nadaljevanju predstavljamo podrobnejšo analizo
odgovorov na vsebinska vprašanja.
4. Sklep in nadaljnje delo
V prispevku smo predstavili različne aktivnosti projekta
3.6.1. Scenariji uporabe in uporabniško zanimanje
SLED, s poudarkom na korpusu Trendi, nastajajočem
Anketiranci_ke so navedli_e, kateri podatki v orodju, ki
spremljevalnem korpusu slovenskega jezika. Opisali smo
bi spremljalo aktualno jezikovno rabo, bi jih najbolj
metodologijo njegove izdelave, vsebino in oblike, v katerih
zanimali, in pri vsakem od 6 predlaganih scenarijev
je na voljo uporabnikom_cam. Predstavili smo tudi
uporabe (s konkretnimi primeri za lažjo predstavo)
klasifikacijo tematskih kategorij, ki je bila oblikovana za
ocenili_e svojo stopnjo zanimanja (1 - sploh me ne zanima,
namene izdelave modela za avtomatsko tematsko
5 - zelo me zanima). Med scenariji so npr. katere
kategorizacijo besedil. Zadnji del je bil namenjen
besede/besedne zveze so najznačilnejše za določeno
predstavitvi rezultatov ankete o uporabniških pričakovanjih
obdobje v primerjavi z drugim obdobjem? (npr. katere
o podatkih o aktualni rabi jezika, ki jih želi imeti
besede so se mnogo pogosteje uporabljale v februarju 2020
zainteresirana skupnost.
kot pa v februarju 2021); v katerem obdobju je določena
V prihajajočih mesecih bomo nadaljevali z objavami
beseda/besedna zveza najpogostejša? (npr. ali je bila
mesečnih različic korpusa, pripravili prve statistične
beseda "tajkun" res najpogostejša v obdobju 2008-2009?);
izračune in dokončali ter evalvirali algoritem za
ali raba besede/besedne zveze v zadnjem obdobju glede na
avtomatsko kategorizacijo besedil. Pomembno je, da smo
trende narašča ali pada? (npr. ali se "epidemija" uporablja
veliko časa posvetili vzpostavitvi avtomatskih postopkov
vse pogosteje ali vse redkeje?).
priprave besedil in izračunov, saj bo to pospešilo
Rezultati kažejo, da se anketirancem_kam vsi
posodabljanje podatkov v konkordančnikih in na
predlagani scenariji zdijo zanimivi: kategoriji "Zanima me"
repozitorju CLARIN.SI.
(4) in "Zelo me zanima." (5) namreč pri vsakem scenariju
Prav tako je ključna aktivnost izboljšava postopka
skupaj zajemata med 74 in 88 %. Po stopnji zanimanja
pridobivanja besedil, ki bo poskrbela, da bodo odpravljene
najbolj izstopa scenarij, v katerem je mogoče primerjati
določene pomanjkljivosti trenutne metode. Ker bo
trend rabe dveh ali več besed/besednih zvez (npr.
vzpostavljena tesna povezanost med korpusom Trendi in
anticepilec vs. proticepilec), enako pa anketiranke_ce
referenčnim korpusom Gigafida, bo vsaka izboljšava
zanima tudi, ali raba določene besede/besedne zveze v
postopkov koristila obema korpusoma.
zadnjem obdobju glede na trende narašča ali pada.
S
korpusom
Trendi
je
slovenska
jezikovna
Dobre tri četrtine vprašanih (76 %) je odgovorilo, da bi
infrastruktura bogatejša za pomemben vir, ki bo relevanten
jim podatki o aktualni jezikovni rabi koristili pri delu, le
tako za raziskovalno skupnost kot širšo javnost.
9 % tovrstni podatki ne bi koristili (15 % je neodločenih).
Rezultati ankete torej potrjujejo, da jezikoslovno skupnost
5.
podatki o trendih jezikovne rabe zanimajo in da obstaja
Zahvala
realna potreba po jezikovnem viru, ki tovrstne podatke
Projekt SLED ( Spremljevalni korpus in spremljajoči
prinaša sprotno in ažurno.
podatkovni viri) financira Ministrstvo za kulturo Republike
Slovenije kot del Javnega razpisa za (so)financiranje
3.6.2. Načini prikaza podatkov
projektov,
namenjenih
gradnji
in
posodabljanju
Na lestvici od 1 (sploh ni pomembno) do 5 (zelo
infrastrukture za slovenski jezik v digitalnem okolju 2021–
pomembno) so anketiranci_ke ocenili_e tudi, kateri od
2022. Raziskovalna programa št. P6-0411 ( Jezikovni viri in
predlaganih načinov prikaza podatkov (grafi s trendi
tehnologije za slovenski jezik) in št. P6-0215 ( Slovenski
jezikovne rabe. tabele s številskimi podatki, seznami besed
jezik - bazične, kontrastivne in aplikativne raziskave) je
oz. besednih zvez z naraščajočo/padajočo rabo, drugo) se
sofinancirala Javna agencija za raziskovalno dejavnost
jim zdijo pomembni. Če združimo deleže kategorij
Republike Slovenije iz državnega proračuna.
"pomembno" (4) in "zelo pomembno" (5), dobimo deleže
79 % za grafe, 64 % za tabele s številskimi podatki in 87 %
za
sezname
besed
s
padajočo/naraščajočo
rabo.
Anketiranke_ce torej najbolj zanimajo preprosti seznami,
najmanj pa napredne tabele s številskimi podatki.
3.6.3. Uporabniški predlogi
PRISPEVKI
91
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
6. Literatura
Slovenian, Croatian and Serbian. V: Proceedings of the
7th Workshop on Balto-Slavic Natural Language
Mark Davies. 2008–. The Corpus of Contemporary
Processing, str. 29–34.
American English (COCA). https://www.english-
Nikola Ljubešić in Tomaž Erjavec. 2018. Word
corpora.org/coca/.
embeddings
CLARIN.SI-embed.sl
1.0,
Slovenian
Mark Davies. 2016–. Corpus of News on the Web (NOW).
language resource repository CLARIN.SI, ISSN 2820-
Available
online
at
https://www.english-
4042, http://hdl.handle.net/11356/1204.
corpora.org/now/.
Nataša Logar, Tomaž Erjavec, Simon Krek, Miha Grčar, in
Mark Davies. 2019–. The Coronavirus Corpus. Available
Peter Holozan. 2013. Written corpus ccGigafida 1.0,
online at https://www.english-corpora.org/corona/.
Slovenian language resource repository CLARIN.SI,
Kaja Dobrovoljc, Tomaž Erjavec in Simon Krek. 2017. The
ISSN 2820-4042, http://hdl.handle.net/11356/1035.
Universal Dependencies Treebank for Slovenian. V:
Tomáš Machálek. 2020. KonText: Advanced and Flexible
Proceedings of the 6thWorkshop on Balto-Slavic Natural
Corpus Query Interface. V: Proceedings of the 12th
Language Processing, BSNLP@EACL 2017, str. 33–38.
Language Resources and Evaluation Conference,
Tomaž Erjavec, Darja Fišer, Simon Krek in Nina Ledinek.
Marseille,
France,
str.
7003–7008.
2010. The JOS Linguistically Tagged Corpus of Slovene.
https://www.aclweb.org/anthology/2020.lrec-1.865
V: P roceedings of the Seventh conference on
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton in
International Language Resources and Evaluation
Christopher D. Manning. 2020. Stanza: A python natural
(LREC’10).
language processing toolkit for many human languages.
Ondrej Herman. 2013. Automatic methods for detection of
arXiv preprint arXiv:2003.07082.
word usage in time. Diplomska naloga. Masaryk
Pavel Rychlý. 2007. Manatee/Bonito-A Modular Corpus
University, Faculty of Informatics.
Manager. V: RASLAN, str. 65–70.
Armand Joulin, Edouard Grave, Piotr Bojanowski in
Mitja Trampuš in Blaž Novak. 2012. The Internals Of An
Tomas Mikolov. 2016. Bag of Tricks for Efficient Text
Aggregated Web News Feed. V: Proceedings of 15th
Classification. arXiv. https://arxiv.org/abs/1607.01759.
Multiconference on Information Society 2012 (IS-2012).
Adam Kilgarriff, Pavel Rychlý, Pavel Smrz in David
http://ailab.ijs.si/dunja/SiKDD2012/Papers/Trampus_N
Tugwell. 2004. The Sketch Engine. V: G. Williams in S.
ewsfeed.pdf.
Vessier, ur., Proceedings of the Eleventh EURALEX
Matej Ulčar in Marko Robnik-Šikonja. 2021. SloBERTa:
International Congress, Lorient, France, str. 105-116.
Slovene monolingual large pretrained masked language
Lorient: Université de Bretagne Sud.
model. V: Proceedings of the 24th International
Kristina Koppel in Jelena Kallas. (v tisku). Eesti keele
Multiconference
–
IS2021
(SiKDD).
ühendkorpuste sari 2013– 2021: mahukaim eestikeelsete
https://ailab.ijs.si/dunja/SiKDD2021/Papers/Ulcar+Rob
digitekstide kogu. Eesti Rakenduslingvistika Ühingu
nik.pdf.
aastaraamat.
Katja Zupan, Nikola Ljubešić in Tomaž Erjavec. Smernice
Iztok Kosem, Simon Krek, Polona Gantar, Špela Arhar
za označevanje imenskih entitet v slovenskem jeziku.
Holdt in Jaka Čibej. 2021. Language monitor: tracking
https://www.clarin.si/repository/xmlui/bitstream/handle
the use of words in contemporary Slovene. V: I. Kosem,
/11356/1238/SlovenianNER-slv-
M. Cukr, M. Jakubíček, J. Kallas, S. Krek in C. Tiberius ,
v1.0.pdf?sequence=7&isAllowed=y.
ur., Electronic lexicography in the 21st century.
Proceedings of the eLex 2021 conference. 5–7 July 2021,
virtual, str. 514–527. Brno: Lexical Computing CZ,
s.r.o., https://elex.link/elex2021/wp-
content/uploads/2021/08/eLex_2021_33_pp514-
528.pdf.
Simon Krek et al. 2019. Corpus of Written Standard
Slovene Gigafida 2.0, Slovenian language resource
repository CLARIN.SI,
http://hdl.handle.net/11356/1320.
Luka Krsnik et al. 2019. Corpus extraction tool LIST 1.2,
Slovenian language resource repository CLARIN.SI,
http://hdl.handle.net/11356/1276.
Michal Křen, Václav Cvrček, Tomáš Čapka, Anna
Čermáková, Milena Hnátková, Lucie Chlumská, Tomáš
Jelínek, Dominika Kováříková, Vladimír Petkevič, Pavel
Procházka, Hana Skoumalová, Michal Škrabal, Petr
Truneček, Pavel Vondřička in Adrian Jan Zasina. 2016.
SYN2015: Representative Corpus of Contemporary
Written Czech. V: P roceedings of the Tenth
International Conference on Language Resources and
Evaluation (LREC'16), str. 2522–2528, Portorož,
Slovenia. European Language Resources Association
(ELRA).
Nikola Ljubešić in Kaja Dobrovoljc. 2019. What does
Neural
bring?
Analysing
improvements
in
morphosyntactic annotation and lemmatisation of
PRISPEVKI
92
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Automatic Text Analysis in Language Assessment: Developing a MultiDis
Web Application
Sara Košutar*, Dario Karl‡, Matea Kramarić*, Gordana Hržica*
* Faculty of Education and Rehabilitation Sciences, University of Zagreb
University Campus Borongaj, Borongajska cesta 83 f, 10 000 Zagreb
sara.kosutar@erf.unizg.hr
matea.kramaric@erf.unizg.hr
gordana.hrzica@erf.unizg.hr
‡Department of Data Science, InSky Solutions
Medačka ulica 18, 10 000 Zagreb
dario.karl.sl@gmail.com
Abstract
Language sample analysis provides rich information about the language abilities in the written or spoken text produced by a speaker in response to a language task. Language sample analysis is generally used to assess the abilities of children during language acquisition, but also the abilities of adult speakers across the lifespan. Its wide range of uses also allows for the assessment of language abilities in educational contexts such as second language acquisition or fluency, the abilities of bilingual speakers in general, and it is also used for diagnosis in speech and language pathology. Various computer programs have been developed to assist in the language sample analysis.
However, these programs have been developed mainly for English and are often not fully open-access or do not provide data on population metrics, history of data uploaded by a user, and/or improvements in basic language measures. The time needed for transcription and the linguistic knowledge required for manual analysis are considered to be the main obstacles to its implementation The goal of this paper is to present a web-based application MultiDis intended for the analysis of language samples at the microstructural text level in Croatian. The application is still under development, but the current version fulfils its main purpose – it enables the (semi-) automatic calculation of measures reflecting language productivity, lexical diversity, syntactic complexity, and discourse cohesion in spoken language, and provides users with socio-demographic and linguistic metadata as well as the history of uploaded transcripts. We will present the challenges we have faced in developing the application (e.g., annotation system, text standardisation), future improvements we plan to make to the application (e.g., syntactic parsing, speech-to-text, multilingual analysis), and the possibilities of its use in the wider scientific and professional community.
1. Language sample analysis
Language sample analysis provides rich information
example telling a story based on a picture, and is recorded
about the language abilities in the written or spoken text
while performing this task. The recordings are then
produced by a speaker in response to a language task, e.g.
transcribed using special codes and are divided into smaller
storytelling, written essay, description of a picture,
units of analysis, e.g., communication units (C-units; see
answering questions, etc. It is an ecologically valid means
Labov and Waletzky, 1967). Special codes mark different
of language assessment that can be used along with
features of the spoken language or deviations (e.g.,
standardised language tests because it provides data that
repetitions, omissions of vowels, use of regionally marked
tests cannot. Compared to standardised tests, language
words, morphosyntactic errors, etc.). When written
sample analysis has greater ecological validity because it
language samples are collected, the speaker responds to the
reflects the natural everyday situation of language
task in writing, but all further steps are the same. Once the
production. Consequently, it allows for a more in-depth
transcripts are produced, they can be analysed in various
analysis of specific morphosyntactic, semantic, and
computer
programs
that
enable
(semi-)automatic
pragmatic features. Due to its lower bias, it proved to be
calculation of different language measures.
more suitable for studying regional variations and dialects
Language sample analysis provides information about
compared to standard questionnaires (e.g., Samardžić and
language abilities at two levels of text structure (Gagarina
Ljubešić, 2021). Language sample analysis is generally
et al., 2012). First is the microstructural level, which refers
used to assess children’s abilities during language
to the internal linguistic organisation and includes text
acquisition, but also adult speakers’ abilities across the
length, vocabulary use, morphosyntax, cohesive devices,
lifespan (e.g., Westerveld et al., 2004). Its wide range of
etc. At the microstructural level, one can observe, for
uses allows for the assessment of language abilities in
example, which language structures have emerged during
educational contexts such as second language acquisition
language acquisition or how complex they are in terms of
or fluency (e.g., Clercq and Housen, 2017), the abilities of
their internal features. The macrostructural analysis allows
bilingual speakers in general (e.g., Gagarina et al., 2016),
for assessing the ability of the hierarchical organisation of
and it is also used for diagnosis in speech and language
the text (e.g., in storytelling, whether the speaker has
pathology (e.g., Justice et al., 2006). This type of analysis
expressed a goal, an attempt, an outcome, etc.). At the
is widely used in some countries, but in many countries,
macrostructural level, one can examine how successfully a
scientists and professionals are unaware of its benefits or
speaker connects sentences according to a language task.
find it too complex and time-consuming (see Heilmann,
By examining these elements, one gains insight into the
quality of an individual’s language when performing a
2010; Klatte et al., 2022).
The process of collecting language samples involves
particular language task, but also indirectly information on
several steps. First, a speaker is given a language task, for
her or his language skills in general.
PRISPEVKI
93
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
1.1. Language measures
Measures of lexical diversity have also been found to
Different aspects of microstructure correspond to
correlate with standardised vocabulary tests in bilingual
several dimensions, such as productivity, lexical diversity,
children (e.g., Hržica and Roch, 2021).
and syntactic complexity. A set of (semi-)automatic
Syntactic complexity refers to the range of syntactic
measures has been proposed to assess language abilities at
structures and the degree of sophistication of these
the microstructural level. Productivity refers to the amount
structures in language production (Ortega, 2003). It is
of language (words or utterances) produced (Leadholm and
usually measured by calculating the average length of the
Miller, 1992). Measures of productivity include the total
C-unit. The length of the C-unit increases when there is a
number of C-units or the total number of words (TNW). C-
dependent clause or when the syntax within the clause is
units are often used instead of utterances in spoken
more complex, for example when the clause is extended by
language analysis (see MacWhinney, 2000). The basic
adding attributes, appositions, or adjectives. Measures of
criteria for dividing a sequence of spoken words into
syntactic complexity have been shown to distinguish
utterances are intonation and pauses. However, transcribers
between different groups of speakers, including children
may rate the utterances differently against these criteria,
with DLD and adults of different ages (e.g., Rice et al.,
which results in lower inter-rater reliability (Stockman,
2010; Nippold et al., 2017). In addition to the average
2010). C-units consist of one or more clauses. A clause is
length of syntactic units, other measures of syntactic
any syntactic unit consisting of at least one predicate. A
complexity include clausal density (i.e., the total number of
complex sentence with one or more dependent clauses
main and subordinate clauses divided by the total number
constitutes one C-unit, while a compound sentence is
of C-units) and mean length of clause (main or
divided into two or more C-units, depending on the number
subordinate), and they are also commonly used (e.g., Scott
of independent clauses. Studies have shown that measures
and Stokes, 1995; Norris and Ortega, 2009). Because of the
of productivity can distinguish children with typical
variety of measures and the different methods of
language status from children with developmental language
calculation, little is known about which measures are
disorders (DLD; Wetherell et al., 2007), bilingual from
appropriate concerning typological differences between
monolingual children (Hržica and Roch, 2021), and adult
languages, and some of these measures are not always
speakers according to their language skills (Nippold et al.,
automatic.
2017).
In the last decades of the 20th century, various computer
Measures of lexical diversity are used to assess
programs have been developed to support language sample
vocabulary abilities. The more diverse the vocabulary
analysis (overview: Pezold et al., 2020), but they are often
produced, the greater the lexical diversity. Measuring
not user-friendly. More recently, web-based programs have
lexical diversity is more complex and therefore
been introduced that allow for the analysis of language use
methodologically challenging.
Traditional
measures
at different linguistic levels (e.g., Coh-Metrix; McNamara
include the number of different words (NDW; Miller, 1981)
et al., 2014). The measures are based on basic calculations
and the type-token ratio (TTR; Templin, 1957). Types and
(e.g., TTR, MLU), but there are also advanced measures
tokens can be easily calculated automatically, whereas
based on language technologies such as the annotation of
lemmas are more difficult to calculate automatically, and
morphological, syntactic, and semantic features. Such
require specialized natural language processing tasks. In
applications are mainly developed for English or other
particular, this requires morphological analyses such as
widely spoken languages and are often not fully open-
lemmatisation,
part-of-speech
(POS)
tagging,
or
access. There is an increasing awareness of the importance
morphological segmentation. In languages with rich
of language sample analysis as a complementary method in
morphology, the lemma-token ratio would be more
language assessment. The time needed for transcription and
appropriate, but due to the time-consuming nature of the
the linguistic knowledge required for manual analysis are
task, this has rarely been done (see Balčiūnienė and
considered to be the main obstacles to its implementation
Kornev, 2019). Another problem with measures of lexical
(Pezold et al., 2020). Therefore, the development of a tool
diversity and measures of productivity is that they are
for the automatic calculation of language measures could
affected by the length of a language sample (Malvern et al.,
make naturalistic language assessment more feasible.
2004; McCarthy, 2005).
To overcome these limitations, alternative measures
2. Goal of the paper
have been developed, such as D (Malvern and Richards,
The goal of this paper is to present a web-based
1997) and moving average type-token ratio (MATTR;
application MultiDis, intended for the analysis of language
Covington and McFall, 2010). The measure D is based on
samples at the microstructural level in Croatian, which
modelling the decrease in TTR with the increasing size of
enables the (semi-)automatic calculation of measures
the language sample using mathematical algorithms.
reflecting language productivity, lexical diversity, syntactic
MATTR calculates TTR for text windows of a fixed size,
complexity, and discourse cohesion in spoken and written
e.g., 500 words. The window moves through the text and
language. We will present the challenges we have faced in
calculates TTR for words 1-501, 2-502, etc. At the end of
developing the application, future improvements we plan to
the text, all TTRs are averaged to determine the final score.
make to the application, and the possibilities of its use in
However, it is not yet clear which of these measures
the wider scientific and professional community.
provides more reliable results, as the results of validation
studies vary (see deBoer, 2014; Fergadiotis et al., 2015).
3. Development of the MultiDis web
Regardless of methodological limitations, these measures
application
can distinguish the abilities of children and adults with
Existing computer-based resources used to analyse
typical language status from children or adults with DLD
children’s or adults’ language abilities are either developed
(e.g., Hržica et al., 2019; Kapantzoglou et al., 2019).
for English only or do not provide data on population
PRISPEVKI
94
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
metrics, history of data uploaded by a user, and/or
(1) Dečko i ćuko su ulovili žabicu. ‘The boy and the
improvements in basic language measures such as NDW or
dog caught the frog’
TTR. The Computerized Language Analysis (CLAN;
(2) Dečko i <ćuko> (@d pas) su ulovili žabicu. ‘The
MacWhinney, 2000), for example, is a freely available
boy and the dog caught the frog’
desktop application whose users are expected to have a high
(3) Dečko i pas su ulovili žabicu. ‘The boy and the
level of language and transcription expertise. Text
dog caught the frog’
Inspector (2018), on the other hand, is a web-based
The annotation system and parsing rules for the
application, but it is only designed for the text analysis of
transcripts were implemented using common Regular
the English language and the target users are mainly first or
expressions (regex) in Python (Van Rossum, 2020).
second language acquisition teachers. We aim to develop a
Regular expressions allow the system to recognise specific
web-based application that fosters the analysis of language
codes, save the data and convert the language into a
samples in Croatian. Our target users work at least partly
standard form, so that existing language resources, such as
with spoken language (e.g., language diagnostics
tokenizers and lemmatizers, achieve a higher hit rate and
performed by speech and language pathologists), so the
precision. After annotation and parsing, the application will
application should support both written and spoken
provide a standardised language text on which further
language analysis. The application is currently being
language sample analysis is performed.
developed, and we will present the coding system, language
resources, data collection and language measures that have
3.2. Language resources
been implemented so far.
The next step in the development of the application was
the integration of an open-source Python library. We started
3.1. Annotation codes
with Stanza (Qi et al., 2020) to solve the following tasks
Considering that our target users mostly work with
common in natural language processing:
spoken language, there are several codes which can be used
lemmatisation
to annotate the data. Computer programs for language
POS tagging
analysis such as CLAN (MacWhinney, 2000) have an
syntactic
parsing
(sentence
and
clause
entire system of very specific annotation codes. In the
segmentation).
MultiDis web application, a new and simpler system of
In the early stages of developing the MultiDis web
annotation codes was developed to provide a faster and
application, one of the main linguistic resources used was
more organised annotation process. The system of the
Stanza, a Python natural language processing toolkit for
codes was designed to include several categories with
human language developed at Stanford University (Qi et
individual codes and subsets of codes. The main idea is to
al., 2020). Stanza enables quick out-of-the-box processing
have a system of annotation codes that can be changed over
of multilingual texts. Since we plan to test our use case –
time according to the following criteria:
based on the analysis of children’s spoken language – on
hierarchical (with categories and subcategories of
multiple languages, Stanza has an advantage over several
codes)
other natural language processing models, frameworks and
extensible (adding new categories and codes)
neural pipelines, such as Podium (Tutek et al., 2021),
easily customizable system (each category has a
CLASSLA (Ljubešić and Dobrovoljc, 2019) or BERTić
recognizable first character).
(Ljubešić and Lauc, 2021). Lemmatisation and POS
To date, the following categories have been established:
tagging are fairly accurate (> 85 % of the cases), as they do
phonotactic codes include conversation markers and
not interfere with the computation of currently
elements of communication; citation codes indicate
implemented language measures, though the process of
references to another utterance within the language sample;
delimiting the boundaries of C-units has been an obstacle
phonetic codes indicate pronunciation and other elements
that is currently being resolved. We are also exploring other
specific to spoken language; sociolinguistic codes indicate
options and planning further analysis and accuracy testing
dialectisms, neologisms, foreign words, etc.; correction
for this task. Since the language samples that the
codes indicate errors made at a particular level of linguistic
application will analyse are non-literary texts, we also plan
structure – phonological, morphosyntactic and/or lexical.
to explicitly compare the aforementioned tools in the tasks
There is also an additional code for corrections – a marker
of lemmatisation, POS tagging and morphosyntactic
that can be used to exclude a particular segment from the
description (MSD) using our datasets to improve the
transcript and provide a correct or standardised form that
application’s baseline accuracy in these tasks. The standard
the application will use to standardise any text before
for POS tagging is MulTextEast language resources
moving on to a later stage of language analysis. A full
(Erjavec, 2010), version 4 for the Croatian language. In this
description of codes is available on the web page of the
way, a token ćuko ‘dog’ is annotated as a dialectism using
application: http://www.multidis.com.hr/statistics/.
the annotation codes for the transcript parsing, and the
An example of multiple annotation codes would be a
standardised form pas ‘dog’ receives a morphosyntactic tag
sentence in (1), that would look like (2) in the following
Ncmsn (nominative case, common noun, masculine,
uploaded transcript. Angle brackets point to a segment that
singular).
needs to be excluded and round brackets point to a
‘standardised’ form of that segment. In addition, the @d
3.3. Data collection – manual annotation of
code preceding the token ćuko ‘dog’ refers to a dialectism.
transcripts with the new coding system
The application will convert the sentence in (2) into the
standardised form or the sentence as in (3), mapping the
In the next step of developing the MultiDis web
dialectism and providing this information in the final
application, it was important to test the annotation system
analysis report.
and the parsing of the language samples, as the aim was to
obtain a standardised text with the data on the participants’
PRISPEVKI
95
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
socio-demographic and language characteristics, parsed
3.4. Automation of language measures
with the appropriate annotation codes and available to the
Using the standardised text and the provided language
user along with the morphosyntactic data. Before running
data from the previous step in the analysis, the next task of
the analysis, the texts were manually transcribed by
the MultiDis web application is to provide users with a
students and volunteers within the courses Computer
detailed analysis of language measures. It is important to
Analysis of Child Language and Volunteering at the
note that the measures are currently calculated
Department of Speech and Language Pathology at the
intertextually, but we plan to compare the individual results
Faculty of Education and Rehabilitation Sciences,
with the population results, as well as with the baseline
University of Zagreb. The test transcripts are the result of a
data. The application incorporates diverse measures that
storytelling task, mostly Frog where are you? (Mayer,
can be used in the language assessment such as
1969) and Multilingual Assessment Instrument for
productivity, lexical diversity, syntactic complexity and
Narratives (MAIN; Gagarina et al., 2012; Gagarina et al.,
discourse cohesion. The list of language measures included
2019; Hržica and Kuvač Kraljević, 2012, 2020). After the
in the MultiDis web application is available in Table 1.
implementation of annotation codes, these transcripts have
been successfully standardised and prepared for the final
analysis. Any other transcript can be uploaded to the
application and the user can only receive data about their
uploaded transcripts and not about the transcripts of other
users.
Category
Measure
Description
Language
Number of communication
The total number of
productivity
units (NCU)
communication units
Total number of words (TNW) The total number of tokens
(repeated tokens are excluded)
Number of different words
The total number of word forms
(NDW)
– types
Type-token ratio (TTR)
The total number of tokens
Lexical diversity
divided by the total number of
types
Index of lexical diversity D*
Based on the VOCD algorithm
calculates the probability of the
next token in a sequence based
on an arbitrarily chosen n-token
sample from the text
Moving average type-token
Based on a window length pre-
ratio (MATTR)
defined by a user, the text is
divided into segments and for
each window length, the TTR is
calculated – the average TTR
ratio of each segment is the
measure of MATTR
Mean length of the
The total number of words is
Syntactic
communication unit
divided by the total number of
complexity
communication units
Clausal density
The total number of main and
subordinate clauses is divided
by the total number of
communication units
Mean length of clause
The total number of tokens is
divided by the total number of
clauses
Ratio of connectives
The total number of
Discourse cohesion
connectives is divided by the
total number of C-units.
PRISPEVKI
96
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Ratio of different connectives** The total number of one type of
connective is divided by the
total number of all other types
of connectives in the text
Table 1: List of language measures implemented in the MultiDis web application (*being tested; **in the process of implementation).
The process of automatic analysis of language measures
The application is designed so that each segment can be
is based on precise segmentation of C-units and clauses, as
improved, without compromising our main goals or the
well as on the results of tokenisation and lemmatisation.
user’s experience. In this sense, we can also include written
Each simple sentence (e.g., The dog is playing with the
language samples and provide new annotation codes and
frogs), each complex sentence containing a subordinate
categories for written language or implement measures that
clause or a parenthetical phrase (e.g., When the dog chased
are only used in the analysis of adult language.
the cat away, the birds were happy), and each clause of a
Lemmatisation and POS tagging can be improved by
replacing the existing model with a new, customized and
compound sentence was considered as one C-unit (e.g.,
open-source model that can be extended to languages other
One goat is in the water and the other is grazing grass).
than Croatian.
Given the fact that we need 100% accuracy on this task, at
this stage, we are still in the process of developing an
5. Future extensions
automatic way of detecting connectives in the text as well
as clause delimiters. Thus, a user still has to manually
The MultiDis web application is still under
divide the text into C-units following the above-mentioned
development, but the current version fulfils its main
criteria before uploading a language sample to the
purpose – it allows for (semi-)automatic analysis of spoken
language, and provides users with socio-demographic and
application. This also means that the user can change any
linguistic metadata as well as the history of uploaded
automatically parsed C-unit. Collecting a larger amount of
transcripts. In addition to the implementation of a service
data will make it possible to train and apply an appropriate
for the automatic determination of C-units and clause
machine learning model to enable automatic segmentation
boundaries, additional data will be made available to users,
of C-units and clauses.
such as the analysis of Croatian dialects and reference data
At the current stage of developing the application, a user
for language measures, at least for some populations and
can obtain the results of all available language measures
some text types. Several other options are also being
based on C-unit segmentation, as well as the
considered, such as fully automatic parsing of the original
morphosyntactic data and the data provided by the
language sample without the manual annotation codes and
annotation codes. It is important to note that the MATTR
an experimental speech-to-text service. As the tools and
measure does not have a fixed window length; instead,
resources to develop this application are also available for
there is a default window size that contains 10% of the total
other languages, the application could be scaled for
number of tokens, and the user can manually adjust the
multilingual analysis, preferably in collaboration with other
window size. In this way, we have avoided the possibility
researchers.
for the results on MATTR to be the same as the results on
TTR for language samples with less than 500 tokens, and
6. Conclusion
we have allowed the user to define the best window size for
this measure. Measure D and the number of different
The MultiDis web application is freely available at
connectives are currently being implemented and tested
http://www.multidis.com.hr/ and can be used by linguists, before these results are made available to users. The
speech and language pathologists, teachers etc., to assess
remaining measures listed in Table 1 have been
the language abilities of both children and adult speakers
successfully implemented.
of Croatian. It can help clinicians and educators in language
sample analysis by resolving some of the main obstacles to
4. Technical specifications of the MultiDis web
its use. A simpler coding system fosters transcription and
application
future development of speech-to-text could ease this
process even further. Automatic lemmatisation and
The MultiDis web application is deployed on the
morphological tagging save time and enable more precise
Croatian Academic and Research Network (CARNET)
calculation of language measures. The language measures
server as a monolithic Docker service. All requests are first
included in the application were selected based on previous
forwarded to a Nginx service for the static files and only
research and adequately reflect the different aspects of the
then to the application itself via a Gunicorn service (Python
participants’ language abilities. Therefore, the MultiDis
Web Server Interface Gateway HTTP Server). The
web application supports its users by reducing both the
application and the entire backend logic are written in the
transcription time and the linguistic knowledge required to
Python programming language (Van Rossum, 2020) within
technically perform the analysis.
the Django web framework. All data is stored in a MySQL
database instance on the server. As mentioned earlier, a
7. Acknowledgements
Stanza PyTorch model (Qi et al., 2020) is run with the
application to infer the language data and provide
This work was supported by the Croatian Science
morphosyntactic information. Other open-source libraries
Foundation under the project entitled Multilevel
and packages used are python-docx, NumPy and Pandas.
approach to spoken discourse in language development
PRISPEVKI
97
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
(UIP-2017-05-6603), by the Arts and Humanities
disorder].
Hrvatska
Revija
za
Rehabilitacijska
Research Council under the project entitled Feast and
Istraživanja, 55(2):14–30.
Famine Project: Confronting Overabundance and
Gordana Hržica and Jelena Kuvač Kraljević. 2012. MAIN
Defectivity in Language (AH/T002859/1) and by the
– hrvatska inačica: Višejezični instrument za ispitivanje
COST Action under the project NexusLinguarum –
pripovijedanja [MAIN – Croatian version: Multilingual
European network for Web-centred linguistic data
Assessment Instrument for Narratives]. ZAS papers in
science (CA18209). Sara Košutar was supported by the
linguistics, 56:201–218.
project Young Researchers’ Career Development project
Gordana Hržica and Jelena Kuvač Kraljević. 2020. The
– Training of New Doctoral Students. Any opinions,
Croatian adaptation of the Multilingual Assessment
findings, conclusions, or recommendations presented in
Instrument for Narratives. ZAS Papers in Linguistics,
this manuscript are those of the author(s) and do not
64:37–44.
necessarily reflect the views of the Croatian Science
Gordana Hržica and Maja Roch. 2021. Lexical diversity in
Foundation.
bilingual speakers of Croatian and Italian. In: S. Armon-
Lotem and K. K. Grohmann, eds., LITMUS in Action:
8. References
Cross comparison studies across Europe, pages 100–
Ingrida Balčiūnienė and Aleksandr N. Kornev. 2019.
129. John Benjamins Publishing Company Trends in
Evaluation of narrative skills in language-impaired
Language Acquisition Research (TILAR), Amsterdam.
children. Advantages of a dynamic approach. In: E.
Laura M. Justice, Ryan P. Bowles, Joan N. Kaderavek,
Aguilar-Mediavilla, L. Buil-Legaz, R. López-Penadés,
Teresa A. Ukrainetz, Sarita L. Eisenberg, and Ronald B.
V. A. Sanchez-Azanza and D. Adrover-Roig, eds.,
Gillam. 2006. The Index of Narrative Microstructure: A
Atypical
Language
Development
in
Romance
Clinical Tool for Analyzing School-Age Children’s
Languages, pages 127–414. John Benjamins Publishing
Narrative Performances. American Journal of Speech-
Company, Amsterdam and Philadelphia.
Language Pathology, 15(2):177–191.
Michael A. Covington and Joe D. McFall. 2010. Cutting
Maria Kapantzoglou, Gerasimos Fergadiotis, and Alejandra
the Gordian Knot: The Moving-Average Type–Token
Auza Buenavides. 2019. Psychometric evaluation of
Ratio (MATTR). Journal of Quantitative Linguistics,
lexical diversity indices in Spanish narrative samples
17(2):94–100.
from children with and without developmental language
Bastien de Clercq and Alex Housen. 2017. A Cross-
disorder. Journal of Speech, Language, and Hearing
Linguistic Perspective on Syntactic Complexity in L2
Research, 62(1):70–83.
Development: Syntactic Elaboration and Diversity. The
Inge S. Klatte, Vera van Heugten, Rob Zwitserlood, and
Modern Language Journal, 101(2):315–334.
Ellen Gerrits. 2022. Language Sample Analysis in
Fredrik deBoer. 2014. Evaluating the comparability of two
Clinical
Practice:
Speech-Language
Pathologists'
measures of lexical diversity. System, 47:139–145.
Barriers, Facilitators, and Needs. Language, speech, and
Tomaž Erjavec. 2010. MULTEXT-East Version 4:
hearing services in schools, 53(1):1–16.
Multilingual Morphosyntactic Specifications, Lexicons
William Labov and Joshua Waletzky. 1967. Narrative
and Corpora. In: Proceedings of the Seventh
analysis: Oral versions of personal experience. In: J.
International Conference on Language Resources and
Helm, ed., Essays on the verbal and visual arts, pages 3–
Evaluation (LREC'10), pages 2544–2547, Valletta,
38. University of Washington Press, Seattle and London.
Malta.
Barbara J. Leadholm and Jon F. Miller. 1992. Language
Gerasimos Fergadiotis, Heather Harris Wright and Samuel
sample analysis: The Wisconsin guide. Wisconsin State
B. Greenc. 2015. Psychometric Evaluation of Lexical
Department of Public Instruction, Madison.
Diversity Indices: Assessing Length Effects. Journal of
Nikola Ljubešić and Kaja Dobrovoljc. 2019. What does
Speech, Language, and Hearing Research, 58(3):840–
Neural
Bring?
Analysing
Improvements
in
852.
Morphosyntactic Annotation and Lemmatisation of
Natalia Gagarina, Daleen Klop, Sari Kunnari, Koula
Slovenian, Croatian and Serbian. In: Proceedings of the
Tantele, Taina Välimaa, Ingrida Balčiūnienė, Ute
7th Workshop on Balto-Slavic Natural Language
Bohnacker, and Joe Walters. 2012. MAIN: Multilingual
Processing, pages 29–34, Florence, Italy. Association
assessment instrument for narratives. ZAS Papers in
for Computational Linguistics.
Linguistics, 56:1–155.
Nikola Ljubešić and Davor Lauc. 2021. BERTić - The
Natalia Gagarina, Daleen Klop, Sari Kunnari, Koula
Transformer Language Model for Bosnian, Croatian,
Tantele, TainaVälimaa, Ute Bohnacker, and Joel Walters.
Montenegrin and Serbian. In: Proceedings of the 8th
2019. MAIN: Multilingual Assessment Instrument for
Workshop
on
Balto-Slavic
Natural
Language
Narratives – Revised. ZAS Papers in Linguistics, 63:1–
Processing, pages 37–42, Kiyv, Ukraine. Association for
21.
Computational Linguistics.
Natalia Gagarina, Daleen Klop, Ianthi M. Tsimpli, and Joel
Brian MacWhinney. 2000. The CHILDES project: Tools
Walters. 2016. Narrative abilities in bilingual children.
for analyzing talk: Transcription format and programs
Applied Psycholinguistics, 37(1):11–17.
(3rd ed.). Lawrence Erlbaum Associates Publishers,
John J. Heilmann. 2010. Myths and Realities of Language
Mahwah, NJ.
Sample Analysis. Perspectives on Language Learning
David Malvern and Brian Richards. 1997. A new measure
and Education, 17(1): 4–8.
of lexical diversity. In: A. Ryan and A. Wray, eds.,
Gordana Hržica, Sara Košutar, and Matea Kramarić. 2019.
Evolving models of language, pages 58–71. Multilingual
Rječnička raznolikost pisanih tekstova osoba s
Matters, Clevedon.
razvojnim jezičnim poremećajem [Lexical diversity in
David Malvern, Brian Richards, Ngoni Chipere, and Pilar
Durán. 2004.
written texts of persons with developmental language
Lexical Diversity and Language
PRISPEVKI
98
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Development. Quantification and Assessment. Palgrave
children 3 to 9 years with and without language
Macmillan, London.
impairments. J o urnal of Speech, Language, and Hearing
Mercer Mayer (1969). Frog, where are you? Dial Press,
Research, 53(2):333–349.
New York.
Tanja Samardžić and Nikola Ljubešić. 2021. Data
Phillip M. McCarthy. 2005. An assessment of the range and
Collection and Representation for Similar Languages,
usefulness of lexical diversity measures and the potential
Varieties and Dialects. In: M. Zampieri and P. Nakov,
of the measure of textual, lexical diversity (MTLD). PhD
eds., Similar Languages, Varieties, and Dialects: A
thesis, University of Memphis.
Computational
Perspective,
Studies
in
Natural
Danielle S. McNamara, Arthur C. Graesser, Phillip M.
Language Processing, pages 121–137, Cambridge
McCarthy, and Zhiqiang Cai. 2014. Automated
University Press, Cambridge.
Evaluation of Text and Discourse with Coh-Metrix.
Cheryl M. Scott and Sharon L. Stokes. 1995. Measures of
Cambridge University Press, New York.
syntax in school-age children and adolescents. Language,
Jon M. Miller. 1981. Assessing language production in
Speech, and Hearing Services in Schools, 26(4):309–319.
children: experimental procedures. University Park
Ida J. Stockman. 2010. Listener reliability in assigning
Press, Baltimore.
utterance boundaries in children's spontaneous speech.
Marilyn A. Nippold, Laura M. Vigeland, Megan W. Frantz-
Applied Psycholinguistics, 31(3):363–395.
Kaspar, and Jeannene M. Ward-Lonergan. 2017.
Mildred C. Templin. 1957. Certain language skills in
Language Sampling With Adolescents: Building a
children; their development and interrelationships.
Normative Database With Fables. American Journal of
University of Minnesota Press, Minneapolis.
Speech-Language Pathology, 26(3):908–920.
Text Inspector. 2018. Online lexis analysis tool at
John M. Norris and Lourdes Ortega. 2009. Towards an
textinspector.com
organic approach to investigating CAF in instructed
Martin Tutek, Filip Boltužić, Ivan Smoković, Mario Šaško,
SLA: The case of complexity. Applied Linguistics,
Silvije Škudar, Domagoj Pluščec, Marin Kačan, Dunja
30(4):555–578.
Vesinger, Mate Mijolović, and Jan Šnajder. 2021.
Lourdes Ortega. 2003. Syntactic complexity measures and
Podium: a framework-agnostic NLP preprocessing
their relationship to l2 proficiency: A research synthesis
toolkit.
GitHub
repository.
of college-level l2 writing. Applied Linguistics, 24(4):
https://github.com/TakeLab/podium
492–518.
Guido Van Rossum. 2020. The Python Library Reference, Mollee J. Pezold, Caitlin M. Imgrund, and Holly L. Storkel.
release
3.8.2.
Python
Software
Foundation.
2020. Using Computer Programs for Language Sample
https://py.mit.edu/_static/spring21/library.pdf
Analysis. Language, Speech, and Hearing Services in
Marleen F. Westerveld, Gail Gillon, and Jon F. Miller.
Schools, 51(1):103–114.
2004. Spoken language samples of New Zealand children
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and
in conversation and narration. Advances in Speech
Christopher D. Manning. 2020. Stanza: A Python
Language Pathology, 6(4):195–208.
Natural Language Processing Toolkit for Many Human
Danielle Wetherell, Nicola Botting, and Gina Conti-
Languages. In: Proceedings of the 58th Annual Meeting
Ramsden. 2007. Narrative in adolescent specific
of the Association for Computational Linguistics: System
language impairment (SLI): a comparison with peers
Demonstrations, pages 101–108, Stroudsburg, PA.
across two different narrative genres. International
Association for Computational Linguistics.
journal of language & communication disorders,
Mabel L. Rice, Filip Smolik, Denise Perpich, Travis
42(5):583–605.
Thompson, Nathan Rytting, and Megan Blossom. 2010.
Mean length of utterance levels in 6-month intervals for
PRISPEVKI
99
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Assessing Comparability of Genre Datasets
via Cross-Lingual and Cross-Dataset Experiments
Taja Kuzman† ∗, Nikola Ljubešić†, Senja Pollak†
†Department of Knowledge Technologies, Jožef Stefan Institute
taja.kuzman@ijs.si, nikola.ljubesic@ijs.si, senja.pollak@ijs.si
∗Jožef Stefan International Postgraduate School
Abstract
This article explores comparability of an English and a Slovene genre-annotated dataset via monolingual and cross-lingual experiments, performed with two Transformer models. In addition, we analyze whether translating the Slovene dataset into English with a machine translation system improves monolingual and cross-lingual performance. Results show that cross-lingual transfer is possible despite the differences between the datasets in terms of genre schemata and corpora construction methods. Furthermore, the XLM-RoBERTa model was shown to provide good results in both settings already when learning on less than 1,000 instances. In contrast, the trilingual CroSloEngual BERT model was revealed to be less suitable for this text classification task. Moreover, the results reveal that although the English dataset is 40 times larger than the Slovene dataset, it provides similar or worse classification results.
1.
Introduction
smaller Finnish, Swedish, and French datasets. Rönnqvist
Texts in datasets can be grouped by genres based on
et al. (2021) extended this research, training the models on their common function, form and the author’s purpose (Or-
a multilingual dataset, created from the four corpora, which
likowski and Yates, 1994). Labeling texts with genres al-
further improved the results.
lows for a deeper insight into the composition and qual-
These promising results stimulated creation of genre-
ity of a web corpus that was collected with automatic
annotated datasets for other languages, and for Slovene,
means, more efficient queries in information retrieval tools
a web genre identification corpus GINCO 1.0 (Kuzman et
(Vidulin et al., 2007), as well as improvements of various
al., 2021) was created. Its genre schema was based on the
language technologies tasks, such as part-of-speech tag-
CORE schema with the possibility of cross-lingual experi-
ging (Giesbrecht and Evert, 2009) and machine translation
ments in mind (see Kuzman et al. (2022)). However, a lin-
(Van der Wees et al., 2018). That is why automatic genre
guistic analysis of the categories (Biber and Egbert, 2018)
identification (AGI) has been a subject of numerous studies
and a low inter-annotator agreement, reported by Egbert et
in the computational linguistics and information retrieval
al. (2015) and Sharoff (2018), revealed some shortcom-fields (e.g., see Egbert et al. (2015), Sharoff (2018)).
ings of the CORE schema that could impact the reliability
As in other text classification tasks, a large manually an-
of the dataset. Thus, Kuzman et al. (2022) diverged from
notated dataset is required in AGI in order to train and test
the original schema when annotating GINCO, striving to-
a classifier. While there exist some large English genre-
wards a more reliably annotated dataset. In addition to
annotated datasets, such as the Corpus of Online Regis-
this, the CORE and GINCO datasets were created follow-
ters of English (CORE) (Egbert et al., 2015) with 53,000
ing different corpora collection and annotation approaches
texts and the Leeds Web Genre Corpus (Asheghi et al.,
(see Section 3.1.). Due to these differences, it remained
2016) with 5,000 texts, for other languages there is either no unclear whether the datasets are comparable enough to al-dataset or mostly a small one, consisting of 1,000 to 2,000
low cross-lingual transfer which would eliminate the need
texts, such as genre-annotated corpora for Russian (Sharoff,
for extensive annotation campaigns of Slovene and other
2018), Finnish (Laippala et al., 2019), Swedish and French under-resourced languages of interest. This article provides
(Repo et al., 2021). This means that for obtaining a large
first insight into this, exploring the comparability of the
dataset needed for genre identification of other languages,
two datasets through cross-dataset and cross-lingual exper-
costly and time-consuming annotation campaigns are still
iments.
needed, leaving most languages under-resourced in regard
to the technologies based on the AGI.
2.
Goal of the Paper
However, it might be possible to overcome this obsta-
This paper analyzes comparability of two genre-
cle by leveraging the cross-lingual transfer, applying mod-
annotated datasets, the Corpus of Online Registers of En-
els trained on high-resource languages to the low-resource
glish (CORE) (Egbert et al., 2015) and the Slovene Web
languages. Recently, Repo et al. (2021) showed that it is
genre identification corpus GINCO 1.0 (Kuzman et al.,
possible to achieve good levels of cross-lingual transfer in
2021). We perform cross-dataset and cross-lingual auto-
AGI experiments. They performed experiments in zero-
matic genre identification experiments to address the main
shot cross-lingual automatic genre identification by train-
research question (Q1): Is the CORE dataset comparable to
ing multilingual Transformer-based models on the English
the GINCO dataset enough to provide good cross-lingual
CORE corpus (Egbert et al., 2015) and testing them on
transfer, as it was achieved by Repo et al. (2021) who
PRISPEVKI
100
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
used comparably encoded Finnish, Swedish and French
that it can have up to four labels. The corpus that we ob-
datasets?
tained from the authors and used in this research consists
To compare the corpora and to analyze their useful-
of 48,415 texts, labeled with 8 main categories and 47 sub-
ness for monolingual as well as for cross-lingual auto-
categories. The corpus was further cleaned by removing
matic genre identification, first, labels from both corpora
duplicated texts and texts with more than one assigned la-
were mapped to a joint schema, the GINCORE schema.
bel, resulting in 41,502 texts.
Then, multilingual pre-trained Transformer-based models
The GINCO corpus (Kuzman et al., 2022) consists of a
were trained on the English CORE dataset with GINCORE
random sample of web texts from two Slovene web corpora,
labels (EN-GINCORE), the Slovene GINCO dataset with
slWaC 2.0 corpus (Erjavec and Ljubešić, 2014) from 2014
GINCORE labels (SL-GINCORE) and the SL-GINCORE
and MaCoCu-sl 1.0 corpus (Ba˜nón et al., 2022) from 2021.
dataset that was machine translated into English (MT-
Both web corpora were created by crawling the Slovene
GINCORE). We conduct 1) monolingual in-dataset AGI
top-level domain and some generic domains that are inter-
experiments, training and testing on the same dataset, 2)
linked with the national domain. As in GloWbE, the boiler-
cross-lingual and cross-dataset AGI experiments, training
plate was removed with the Justext tool (Pomikálek, 2011).
on one dataset and testing on the other.
The machine-
The GINCO corpus consists of two parts, the “suitable”
translated dataset is added to the comparison to explore
part, annotated with genres, and “not suitable” part, consist-
two additional research questions: Q2) In monolingual in-
ing of texts not suitable for genre annotation, such as texts
dataset experiments, do multilingual models, which were
in other languages, machine-translated texts etc. In this re-
pre-trained on more English than Slovene data, perform
search, only the suitable part, consisting of 1,002 texts, was
differently on Slovene dataset (SL-GINCORE) than on
used.
a Slovene dataset, machine-translated to English (MT-
For the annotation, a GINCO schema was used,
GINCORE)? and Q3) In cross-lingual cross-dataset exper-
consisting of 24 labels, e.g., News/Reporting, Opin-
iments, does translating the training data (MT-GINCORE)
ion/Argumentation, Promotion of a Product. The schema
into the language of test data (EN-GINCORE) provide bet-
is based on the subcategory level of the CORE schema and
ter results than using training and testing data in different
on other schemata from previous genre studies. The texts
languages (SL-GINCORE and EN-GINCORE)?
were annotated by two annotators with the background in
The experiments were performed with two multilingual
linguistics. In case of disagreement, final labels were de-
Transformer-based pre-trained language models, massively
termined at frequent meetings. Multi-label annotation was
multilingual XLM-RoBERTa model (Conneau et al., 2020),
allowed, i.e., each text could be annotated with up to three
and the trilingual Croatian-Slovene-English CroSloEngual
classes which were ordered according to their prevalence
BERT model (Ulčar and Robnik- Šikonja, 2020).
This
in the text as a primary, secondary and tertiary label. How-
provides an answer to the fourth research question (Q4):
ever, in these experiments, only the primary labels are used.
Does CroSloEngual BERT, pre-trained on a smaller num-
Each paragraph in the texts is accompanied with metadata
ber of languages, perform better in the cross-lingual AGI
(attribute keep) with information on whether it was man-
experiments than a massively multilingual XLM-RoBERTa
ually identified to be a part of the main text and thus useful
model?
for the annotation. In this research, paragraphs not deemed
to be useful were discarded.
3.
Data Preparation
The machine-translated GINCO corpus (MT-GINCO)
3.1.
Original Datasets
was created by translating the Slovene GINCO 1.0 to En-
glish with the DeepL1 machine translation system. The sys-
In this research, three datasets were used: the Corpus of
tem is stated by its developers to be “3x more accurate”
Online Registers of English (CORE) (Egbert et al., 2015),
than its closest competitors, i.e., Google Translate, Ama-
the Slovene Web genre identification corpus GINCO 1.0
zon Translate and Microsoft Translator, based on internal
(Kuzman et al., 2021) and the GINCO 1.0 corpus, machine
blind tests (DeepL, nd). DeepL was confirmed to outper-
translated to English.
form Google Translate also in an independent study of Yu-
The CORE corpus consists of web texts that were ex-
lianto and Supriatnaningsih (2021). The GINCO corpus
tracted from the “General” part of the Corpus of Global
was translated into British English, as this variety seems
Web-based English (GloWbE) (Davies and Fuchs, 2015).
to be more frequent than American English in the general
The GloWbE corpus was collected via Google searches
part of the GloWbE corpus on which the CORE corpus is
with high frequency English 3-grams as the queries (Davies
based (GloWbe, nd). The prevalence of the British variety
and Fuchs, 2015). After obtaining the texts, further clean-
in the CORE corpus was also confirmed with a lexicon-
ing was performed, more specifically, the boilerplate was
based American-British-variety Classifier (Rupnik et al.,
removed with the Justext tool (Pomikálek, 2011).
2022) which identified 40% of texts to be British, 25% to be The CORE corpus was annotated based on a hierarchi-American, while the rest contain a mixture of both varieties
cal schema which consists of 8 main genre categories, such
or no signal for any of them.
as Narrative, Opinion, Spoken, and 54 subcategories, e.g.,
News Report/Blog, Instruction, Travel Blog, Magazine Ar-
3.2.
GINCORE Schema
ticle. The annotation was single-label, i.e., each annotator,
To be able to perform cross-dataset experiments, the
recruited through a crowd-sourcing platform, could assign
CORE and GINCO schemata were mapped to a joint
one main category and one subcategory to a text. However,
as each text was annotated by four annotators, that means
1https://www.deepl.com/translator
PRISPEVKI
101
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
(a) SL-GINCORE and MT-GINCORE.
(b) EN-GINCORE.
Figure 1: The differences between the distributions of GINCORE labels in the GINCO corpora MT-GINCORE and SL-GINCORE, and in the EN-GINCORE (CORE corpus).
schema – the GINCORE schema.
The schemata were
bels: 15 labels that are present in both corpora, and 5 labels,
mapped based on descriptions of categories in previous re-
newly introduced by the GINCO schema and thus present
search, in the annotation guidelines for GINCO2 and the
only in the GINCO dataset.
guidelines for CORE, created for the needs of annotation
of Finnish, French and Swedish corpora using the CORE
3.3.
GINCORE Datasets
schema3 in further research (Laippala et al., 2019; Laippala
For the purpose of performing cross-dataset experi-
et al., 2020). Furthermore, manual inspection of instances
ments, only the GINCORE classes that have more than
from the GINCO and CORE corpora was performed to an-
5 instances in each of the datasets were used, resulting
alyze to which extent the annotations in the corpora match
in a smaller set of 12 GINCORE labels: News, Forum,
the guidelines. The basis of the GINCORE schema was the
Opinion/Argumentation, Review, Research Article, Infor-
GINCO schema as it was shown to provide a more reliable
mation/Explanation, Promotion, Instruction, Prose, Inter-
annotation than CORE (see Kuzman et al. (2022)). More-
view, Legal/Regulation, and Recipe. The texts annotated
over, it is easier to map 54 CORE subcategories with a very
with other GINCORE labels were not included in the ex-
high granularity to 24 broader GINCO categories than vice
periments. Thus, the final datasets are slightly smaller:
versa. The CORE schema consists of broad main categories
and more specific subcategories. As the GINCO schema
• the English CORE dataset with 12 GINCORE la-
was based on the subcategories of the CORE schema, the
bels, henceforth referred to as the English GINCORE
subcategories level was used for the mapping from CORE
dataset (EN-GINCORE), consists of 33,918 texts;
to GINCORE.
• the Slovene GINCO dataset with 12 GINCORE la-
Some of genre categories in both schemata are identical
bels, henceforth referred to as the Slovene GINCORE
and can be directly mapped, namely Recipe, Review, In-
dataset (SL-GINCORE), consists of 810 texts;
terview and Legal/Regulation. As the GINCO and CORE
schemata differ in granularity, broader GINCORE labels
• the machine-translated English GINCO dataset with
were created which efficiently cover categories from both
12 GINCORE labels, henceforth referred to as
schemata. Some CORE categories were not included in the
the Machine-Translated GINCORE dataset (MT-
mapping, because a) these labels revealed to be very in-
GINCORE), consists of 810 texts.
frequent and there is no sufficient information about them
available, b) the labels were too broad or problematic for
The text instances were not pre-processed, i.e. each in-
annotators and as a result include instances that are too het-
stance is a running text as it was extracted from the origi-
erogeneous and cannot be mapped to just one GINCORE
nal web page from which the boilerplate and HTML tags
label. The resulting GINCORE schema4 covers 43 CORE
were removed. In GINCO datasets (SL-GINCORE and
subcategories and all 24 GINCO categories by using 20 la-
MT-GINCORE), the texts consist of paragraphs, which is
indicated by the tag, while in the CORE dataset (EN-
2The
guidelines
for
GINCO
are
available
here:
GINCORE), the partitioning into paragraphs is not pre-
https://tajakuzman.github.io/GINCO-Genre-
served. In addition to this, the datasets differ significantly
Annotation-Guidelines/.
in terms of length of the texts. In the CORE dataset, the
3The guidelines for the annotation campaigns using the
median length is 649 words, while the minimum and max-
CORE schema are available here: https://turkunlp.org/
imum text length is 52 words and 118,278 words respec-
register-annotation-docs/.
tively. In the GINCO datasets, most texts are significantly
4The final table with all the GINCORE mappings is avail-
shorter, with the median length of 198 words, minimum
able here:
https://tajakuzman.github.io/GINCO-
length of 12 words and maximum length of 4,134 words.
Genre-Annotation-Guidelines/genre_pages/
GINCORE_mapping.html.
As the Transformer models, used in the experiments, can
PRISPEVKI
102
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
process maximum instance length of 512 tokens, this means
Optimum learning rate was revealed to be 10−5, while
that while the models will in most cases be trained on com-
the optimum number of epochs varies based on the train-
plete texts from the GINCO datasets, more than half of the
ing dataset and the model, i.e., the optimum number of
texts from the CORE dataset will not be used in their en-
epochs when training on the EN-GINCORE with a) XLM-
tirety and the models will be trained only on the first part of
RoBERTa is 9, and b) CroSloEngual BERT is 6; while
these instances.
the optimum number of epochs when training on the SL-
Here, it should be also noted that the CORE dataset and
GINCORE and MT-GINCORE with a) XLM-RoBERTa is
the GINCO datasets are characterized by a different distri-
60, and b) CroSloEngual BERT is 90.
bution of GINCORE classes. Frequency of some classes,
We performed monolingual in-dataset experiments and
such as Promotion, is significantly different, as can be seen
cross-lingual cross-dataset experiments5. The monolingual
in Figure 1.
experiments, described in Section 4.3.1., are in-dataset experiments, which means that the models were trained and
4.
Machine Learning Experiments
tested on splits from the same dataset. In contrast to this, in
4.1.
Models
cross-dataset experiments, presented in Section 4.3.2., the models are trained on one dataset and tested on the other.
Experiments were performed with the Transformer-
At the same time, these experiments are cross-lingual, as
based pre-trained language models which were shown to
the original datasets are in different languages.
perform well in the automatic genre identification task in
Three runs of each experiment were performed and av-
a monolingual as well as a cross-lingual setting (Repo et
erage results are reported. The models used in monolin-
al., 2021). More specifically, two models were used, the
gual and cross-lingual setups were evaluated via micro F1
base-sized massively multilingual XLM-RoBERTa model
and macro F1 scores to measure the instance-level and the
(Conneau et al., 2020), and the trilingual Croatian-Slovene-label-level performance.
English CroSloEngual BERT model (Ulčar and Robnik-
Šikonja, 2020). The XLM-RoBERTa model was chosen
4.3.
Results
because it was revealed to be the best performing model
4.3.1.
Monolingual In-dataset Experiments
in cross-lingual automatic genre identification based on the
First, the datasets are compared via monolingual in-
CORE dataset (Repo et al., 2021), and to be compara-
dataset experiments where the models were trained and
ble to the Slovene monolingual model SloBERTa (Ulčar
tested on the splits of the same dataset. In addition to this,
and Robnik- Šikonja, 2021) in experiments, performed on
a dummy classifier which predicts the majority class was
GINCO (Kuzman et al., 2022). The CroSloEngual BERT
implemented as an illustration of the lower bound. The
model was revealed to achieve results comparable to the
results, presented in Table 1, show that the mapping of
XLM-RoBERTa model or to even outperform the latter
the original labels into a joint schema was successful and
model in common monolingual and cross-lingual NLP
that it is possible to achieve good results when learning
tasks (Ulčar et al., 2021). Thus, it was included in these
Transformer models on GINCORE datasets. Transformer
experiments to explore whether it achieves similar results
models are shown to be very effective at this task, achiev-
on the AGI task as well.
ing micro and macro F1 scores that are higher than the
4.2.
Experimental Setup
scores of the dummy model for at least 30 points. XLM-
RoBERTa, which was revealed to be the best performing
The datasets were split into 60:20:20 train, dev and
model, achieved relatively high results, with micro and
test splits, stratified according to the label distribution.
macro F1 scores ranging between 0.72 and 0.84, even when
The models were trained on the train split, consisting of
trained on the two smaller datasets, which consist of less
20,350 texts in the case of EN-GINCORE, and of 486
than 1,000 instances.
texts in the case of SL-GINCORE and MT-GINCORE, and
The results show that in a monolingual setting, the mas-
tested on the test split, i.e., 6,784 texts in the case of EN-
sively multilingual XLM-RoBERTa model outperforms the
GINCORE and 162 texts in the case of SL-GINCORE and
trilingual CroSloEngual BERT model. While Ulčar et al.
MT-GINCORE. The dev split, which is of the same size as
(2021) showed that the trilingual model is comparable to
the test split, was used for testing the hyperparameter op-
the XLM-RoBERTa model at NLP tasks which are fo-
timization. When splitting the datasets, it was assured that
cused on classification of words or multiword units, such as
the splits of SL-GINCORE and MT-GINCORE contain the
named-entity recognition and part-of-speech tagging, these
same instances, so that they differ only in the language of
results reveal that CroSloEngual BERT is not as suitable as
the content.
XLM-RoBERTa for automatic genre identification.
The Transformer models are available at the Hugging
Among all monolingual experiments, the best micro and
Face repository and were trained using the Simple Trans-
macro F1 results were achieved when the XLM-RoBERTa
formers library. To find the optimal number of epochs and
was trained and tested on the machine-translated MT-
the learning rate, the hyperparameter search was performed
GINCO dataset, reaching average micro and macro F1
separately for CroSloEngual BERT and XLM-RoBERTa.
scores of 0.81 and 0.84 respectively. At the same time, the
The maximum sequence length was set to 512 tokens and
other hyperparameters were set to default values. As the
5The code for data preparation and machine learning
EN-GINCORE dataset is more than 40 times larger than
experiments is available here:
https://github.com/
the SL-GINCORE and MT-GINCORE datasets, separate
TajaKuzman/Cross-Lingual-and-Cross-Dataset-
hyperparameter searches for each dataset were performed.
Experiments-with-Genre-Datasets.
PRISPEVKI
103
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Datasets
Majority classifier
XLM-RoBERTa
CroSloEngual BERT
Trained on
Tested on
Micro F1
Macro F1
Micro F1
Macro F1
Micro F1
Macro F1
SL-GINCORE
SL-GINCORE
0.259
0.027
0.782±0.02
0.725±0.01
0.738±0.01
0.599±0.06
MT-GINCORE
MT-GINCORE
0.259
0.027
0.807±0.01
0.841±0.03
0.714±0.00
0.501±0.05
EN-GINCORE
EN-GINCORE
0.363
0.036
0.768±0.00
0.715±0.00
0.761±0.00
0.706±0.00
SL-GINCORE
EN-GINCORE
0.029
0.004
0.639±0.01
0.539±0.01
0.547±0.02
0.391±0.02
MT-GINCORE
EN-GINCORE
0.029
0.004
0.625±0.01
0.521±0.01
0.585±0.01
0.409±0.01
EN-GINCORE
SL-GINCORE
0.253
0.027
0.603±0.02
0.575±0.03
0.566±0.02
0.510±0.03
EN-GINCORE
MT-GINCORE
0.253
0.027
0.630±0.02
0.663±0.03
0.630±0.01
0.543±0.01
Table 1: Results of monolingual and cross-lingual experiments performed with XLM-RoBERTa and CroSloEngual BERT
models, reported via micro and macro F1 scores (averaged over three runs). As a baseline, the scores of a majority classifier are added. The best scores for each of the two Transformer models for each of the two setups (in-dataset experiments and cross-dataset experiments) are shown in bold.
lowest scores, i.e., micro F1 of 0.71 and macro F1 of 0.50,
and MT-GINCORE. Half of the labels, i.e., Review, Le-
were obtained on the same dataset in combination with
gal/Regulation, Research Article, Interview, Recipe and
the CroSloEngual BERT. Similarly, while XLM-RoBERTa
Prose, are represented by solely 4 instances or less in SL-
achieved the worst results when trained and tested on the
GINCORE and MT-GINCORE test splits. One should be
EN-GINCORE, CroSloEngual BERT achieved the best re-
aware that this means that a correct or incorrect prediction
sults on this dataset. The difference between the results on
of such a small number of instances per labels has a large
the same datasets shows the importance of analyzing the
impact on the macro F1 score. Furthermore, a correct pre-
output of multiple models before reaching any conclusion
diction of labels with only one or two instances in the test
regarding the datasets – if only XLM-RoBERTa would be
split might happen due to chance or a similarity of texts in
used, one could assume that the EN-GINCORE dataset is
the train and test split. Thus, the F1 scores of these labels
less suitable for automatic genre identification experiments.
are not reliable. As shown in Figure 2, in the three runs, the However, after performing experiments with both models,
F1 scores of Interview and Recipe, which are represented by
we can see that no dataset consistently provides the best
only 1 instance in the SL-GINCORE and MT-GINCORE
results.
test sets, were either 0 or 1, which has a large impact on
a macro F1. These results also show how important it is to
repeat each experiment multiple times, to ascertain stability
and reliability of results.
If we compare the three datasets based on micro F1
scores, there are small differences between them, i.e., a
difference of 4 points between the lowest and highest
scores when XLM-RoBERTa was used and a difference
of 5 points when CroSloEngual BERT was used. Inter-
estingly, although the EN-GINCORE is 40 times larger
than the SL-GINCORE and MT-GINCORE, it does not
provide higher results than the other two datasets when
the XLM-RoBERTa model is used for training.
Simi-
lar results were revealed in previous work (see Repo et
al.
(2021)) where they performed monolingual exper-
iments with XLM-RoBERTa on the CORE dataset and
three smaller genre-annotated datasets, Finnish FinCORE,
Figure 2: F1 scores per labels (averaged over three runs)
French FreCORE and Swedish SweCORE datasets. Al-
in in-dataset experiments with MT-GINCORE and EN-
though the non-English datasets were annotated with the
GINCORE, performed with CroSloEngual BERT. Labels
CORE schema, the annotation procedure and dataset col-
are ordered according to their frequency in the smallest of
lection methods are more similar to the GINCO approach
the two datasets, MT-GINCORE.
than CORE. Their experiments showed that the XLM-
RoBERTa and other Transformer models perform similarly
If we compare experiments, performed with the same
or better when trained on datasets which consisted of 1,800
model, we can observe that the largest differences between
to 2,200 instances than when trained on the CORE dataset.
the datasets are in terms of macro F1 scores which are cal-
We have two hypotheses why this is the case: 1) It might
culated on the level of labels. As shown in Figure 2, the
be that due to high capacities of Transformer models, their
biggest differences between the F1 scores per labels oc-
performance on this task plateaus already at a few thousand
cur in cases of labels that are represented by a very small
instances and contributing bigger datasets does not signif-
number of instances in the smaller datasets, SL-GINCORE
icantly improve the results. 2) Or this could indicate that
PRISPEVKI
104
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
the CORE dataset is less suitable for AGI machine learning
schema, and testing on SL-GINCORE, which reached 0.60
experiments. The reason for that could be that as crowd-
micro F1 with the base-sized XLM-RoBERTa model, are
sourcing was used for the annotation of the dataset, the
promising, showing that mapping to the GINCORE schema
assigned labels are less reliable and the classes are conse-
gives comparable results to using the CORE schema.
quently fuzzier. Poor reliability of the dataset was also con-
To obtain a deeper insight into the comparability of the
firmed by low inter-annotator agreement. The authors of
GINCO and CORE corpora, we can compare how the F1
the dataset reported that there was no agreement between at
scores per labels change when we test the model on another
least three of four annotators on the subcategory of 48.98%
corpus versus when we test it on the same dataset. Figure 3
of texts (Egbert et al., 2015). When the schema and ap-
shows a comparison between the F1 scores per labels for in-
proach was used by Sharoff (2018) on another corpus, he
dataset experiments with SL-GINCORE and cross-dataset
reported nominal Krippendorff’s alpha of 0.53 on the level
experiments from SL-GINCORE to EN-GINCORE, per-
of subcategories, which is below the acceptable threshold
formed with XLM-RoBERTa. An analysis of these ex-
of 0.67, as defined by Krippendorff (2018). In contrast to
periments, performed with CroSloEngual BERT, confirmed
this, the GINCO dataset was reported to achieve Krippen-
that differences between label scores occur when learning
dorff’s alpha of 0.71, confirming much higher reliability of
with any of the two models, and do not depend on the
annotations.
model. The same differences in label scores were also ob-
served in experiments where MT-GINCORE is used instead
4.3.2.
Cross-lingual Cross-dataset Experiments
of SL-GINCORE, which indicates that the language of the
To assess comparability of the English CORE dataset
dataset does not seem to have a large impact on the results
and the Slovene GINCO dataset, we performed cross-
per labels.
lingual cross-dataset experiments by training the Trans-
As shown in Figure 3, the F1 scores for News and
former models on one dataset and testing them on another.
Opinion/Argumentation are almost the same in both setups,
In addition to experimenting with cross-lingual transfer
which shows that in regard to these genres, the datasets are
from Slovene to English dataset and vice versa, we also ex-
comparable enough for the model to generalize from one
plored whether translating the Slovene dataset into English
dataset to the other. The F1 scores are significantly lower
with a machine translation system improves the results of
in cross-lingual experiments in case of Promotion, Infor-
cross-dataset experiments.
mation/Explanation, Forum and Instruction. For the labels
The results, shown in Table 1, reveal that the trilin-
that are under-represented in the SL-GINCORE, i.e., labels
gual CroSloEngual BERT model performs worse than the
that are on the right side of Review in the Figure, it is not
massively multilingual XLM-RoBERTa model in the cross-
possible to ascertain whether the differences between the
lingual experiments with a difference of 12 points between
scores are an indicator that the datasets are not comparable
the highest macro F1 scores obtained by the models and
in regard to these labels or that the differences occurred due
a much slighter difference between the highest micro F1
to chance.
scores (0.009).
In general, results obtained in the cross-lingual exper-
iments are significantly lower than the results from the
monolingual experiments. If we compare experiments per-
formed with XLM-RoBERTa, there are differences in 13–
18 points in micro F1 and 5–32 points in macro F1 between
testing the model on the same dataset as it was trained on
(monolingual experiments) and on another dataset (cross-
lingual experiments). In case of CroSloEngual BERT, the
differences between testing on the same dataset versus test-
ing on the other dataset were in 13–20 points in micro F1
and 9–20 points in macro F1.
Nevertheless, the XLM-RoBERTa scores, which range
between 0.6–0.64 and 0.52–0.66 for micro and macro F1
respectively, are a promising indicator that cross-lingual
transfer could be possible in this task for Slovene as well.
Figure 3: Comparison of average F1 scores per labels
Furthermore, the results are comparable to the results of
between in-dataset experiments and cross-dataset experi-
cross-lingual experiments with the CORE corpora, reported
ments with XLM-RoBERTa. The models were trained on
by Repo et al.
(2021).
When they trained the XLM-
SL-GINCORE, and tested on a) SL-GINCORE (in-dataset
RoBERTa model on the CORE corpus and tested it on
experiments) and b) EN-GINCORE (cross-dataset experi-
Finnish, Swedish and French datasets, annotated with the
ments). Labels are ordered according to their frequency in
CORE schema, the micro F1 scores ranged from 0.61 to
the smallest of the datasets, SL-GINCORE.
0.69. Here it needs to be noted that they used a large-
sized model which was shown to significantly outperform
the base-sized model used by us (Conneau et al., 2020), and As in the in-dataset experiments, experiments with the
that they used 8 labels, while we used 12. Considering this,
two Transformer models show that while one dataset com-
the results of learning on CORE, mapped to the GINCORE
bination seems to achieve the best results with one model,
PRISPEVKI
105
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
it performs differently with the other model. These results
analyze whether this further improves the results.
once again show the importance of using multiple models
As recently developed trilingual Croatian-Slovene-
on multiple datasets in the experiments to see whether con-
English CroSloEngual model was shown to be compara-
clusions obtained from experiments with one model are still
ble to massively multilingual XLM-RoBERTa model in nu-
supported when using another, yet similar model, and how
merous NLP tasks (see Ulčar et al. (2021)), both mod-
the performance of the models depends on the datasets.
els were used in the experiments to analyze their perfor-
While results in terms of micro F1, achieved with XLM-
mance in the AGI tasks. The results of both monolingual
RoBERTa, point to a conclusion that transfer from SL-
and cross-lingual experiments showed that despite achiev-
GINCORE to EN-GINCORE achieves better results than
ing high results in other common NLP tasks, CroSloEngual
the other direction, macro F1 scores, achieved with XLM-
BERT seems to be less suitable than XLM-RoBERTa for
RoBERTa, and both F1 scores, achieved with CroSloEn-
automatic genre identification.
gualBERT, show transfer direction from English to Slovene
To improve monolingual and cross-lingual results, we
to be better. However, although the EN-GINCORE dataset
also experimented with translating the Slovene GINCO
is 40 times larger than SL-GINCORE, the transfer from
dataset into English, which is the main language on which
EN-GINCORE to SL-GINCORE does not achieve signif-
the Transformer models were pre-trained.
In regard to
icantly higher results than the transfer in the other direction
monolingual experiments, there were no consistent results
when the Slovene dataset is used.
which would confirm that using an English dataset im-
In addition to this, the results show that machine-
proves classification.
However, when the models were
translating the dataset into English can in some cases im-
trained on the English EN-GINCORE and tested on MT-
prove the results of cross-lingual experiments. In cases
GINCORE, i.e., a Slovene dataset, machine-translated into
where the model was trained on the GINCO datasets, i.e.,
English, this led to improvement of macro F1 scores,
SL-GINCORE or MT-GINCORE, and tested on the EN-
achieved with XLM-RoBERTa, and both micro and macro
GINCORE dataset, the setup with the machine-translated
F1 scores for CroSloEngual BERT. This means that ma-
text achieved slightly lower results than the setup with the
chine translating the dataset into the language of another
original Slovene dataset, SL-GINCORE, in case of XML-
dataset might be beneficial in cross-lingual cross-dataset
RoBERTa, and slightly better results in case of CroSlo-
experiments.
Engual BERT. However, when the transfer was applied in
Although monolingual and cross-lingual experiments
the other direction, that is, from EN-GINCORE to SL- or
showed good results also when the models were trained on
MT-GINCORE, machine translating the test instances from
SL-GINCORE and MT-GINCORE, consisting of less than
Slovene into English resulted in improvements of macro F1
1,000 instances, comparisons of F1 scores, reported for
scores, achieved with XLM-RoBERTa, and both micro and
each label in different runs and setups, showed that some
macro F1 scores, obtained with CroSloEngual BERT.
labels are represented by too few instances to provide reli-
able results. In the future, we plan to extend the GINCO
5.
Conclusions
dataset to assure more reliable results and to further im-
Following Repo et al. (2021) who showed that good
prove the classifiers’ performance.
levels of cross-lingual transfer can be achieved by training
In addition to this, recent work by Rönnqvist et al.
Transformer models on a large English genre dataset and
(2021) showed that multilingual modeling, where the
applying them to datasets in other languages, the goal of
model was trained on CORE datasets in various languages,
this study was to explore whether it is possible to achieve
resulted in significant gains over cross-lingual modeling,
similar results on the Slovene genre dataset. The results
where the model was trained solely on the English CORE
revealed to be promising, as despite using a smaller Trans-
dataset.
As our research revealed that the CORE and
former model and a different schema with more labels than
GINCO labels can be successfully mapped to a joint
previous work, the results are rather comparable, show-
schema, in the future, we plan to extend the experiments to
ing that the English CORE and Slovene GINCO datasets
multilingual modeling by training the model on a combina-
are comparable enough to allow cross-dataset experiments.
tion of all CORE datasets and the Slovene GINCO dataset.
The XLM-RoBERTa scores, which range between 0.6–
0.64 and 0.52–0.66 in terms of micro and macro F1 scores
Acknowledgments
respectively, are a promising indicator that cross-lingual
transfer could be possible in the automatic genre identifica-
This work has received funding from the Eu-
tion task for Slovene as well. Furthermore, high F1 scores
ropean
Union’s
Connecting
Europe
Facility
2014-
achieved with XLM-RoBERTa in monolingual experiments
2020 - CEF Telecom, under Grant Agreement No.
show that automatic genre identification is feasible already
INEA/CEF/ICT/A2020/2278341.
This communication
with a very small dataset, and that using the GINCORE
reflects only the author’s view. The Agency is not respon-
schema on all datasets gives good results. Moreover, de-
sible for any use that may be made of the information it
spite the fact that the CORE dataset is 40 times larger than
contains.
This work was also funded by the Slovenian
the GINCO dataset, it did not provide consistently signifi-
Research Agency within the Slovenian-Flemish bilateral
cantly better results than the GINCO dataset in either of the
basic research project ”Linguistic landscape of hate
setups. We plan to analyze this further by exploring what
speech on social media” (N06-0099 and FWO-G070619N,
results can be achieved when smaller portions of CORE are
2019–2023) and the research programme “Language
used for training, and by extending the GINCO dataset to
resources and technologies for Slovene” (P6-0411).
PRISPEVKI
106
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
6.
References
Juhani Luotolahti, Liina Repo, Anna Salmela, Valtteri
Skantsi, and Sampo Pyysalo. 2020. From web crawl to
Noushin Rezapour Asheghi, Serge Sharoff, and Katja
clean register-annotated corpora. In: Proceedings of the
Markert. 2016. Crowdsourcing for web genre annota-
12th Web as Corpus Workshop, pages 14–22.
tion. Language Resources and Evaluation, 50(3):603–
641.
Wanda J Orlikowski and JoAnne Yates. 1994. Genre reper-
toire: The structuring of communicative practices in
Marta Ba˜nón, Miquel Esplà-Gomis, Mikel L. Forcada,
organizations. Administrative science quarterly, pages
Cristian Garc´ıa-Romero, Taja Kuzman, Nikola Ljubešić,
541–574.
Rik van Noord, Leopoldo Pla Sempere, Gema Ram´ırez-
Jan Pomikálek. 2011. Removing boilerplate and duplicate
Sánchez, Peter Rupnik, V´ıt Suchomel, Antonio Toral,
content from web corpora. Ph.D. thesis, Masaryk uni-
Tobias van der Werff, and Jaume Zaragoza.
2022.
versity Faculty of informatics, Brno, Czech Republic.
Slovene web corpus MaCoCu-sl 1.0. Slovenian lan-
guage resource repository CLARIN.SI.
Liina Repo, Valtteri Skantsi, Samuel Rönnqvist, Saara
Hellström, Miika Oinonen, Anna Salmela, Douglas
Douglas Biber and Jesse Egbert. 2018. Register variation
Biber, Jesse Egbert, Sampo Pyysalo, and Veronika Laip-
online. Cambridge University Press.
pala. 2021. Beyond the English web: Zero-shot cross-
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
lingual and lightweight monolingual classification of
Vishrav Chaudhary,
Guillaume Wenzek,
Francisco
registers. In: 16th Conference of the European Chapter
Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer,
of the Association for Computational Linguistics: Stu-
and Veselin Stoyanov. 2020. Unsupervised cross-lingual
dent Research Workshop, EACL 2021, pages 183–191.
representation learning at scale. In: Proceedings of the
Association for Computational Linguistics (ACL).
58th Annual Meeting of the Association for Computa-
Samuel Rönnqvist, Valtteri Skantsi, Miika Oinonen, and
tional Linguistics, pages 8440–8451.
Veronika Laippala. 2021. Multilingual and zero-shot is
Mark Davies and Robert Fuchs. 2015. Expanding horizons
closing in on monolingual web register classification. In:
in the study of World Englishes with the 1.9 billion word
Proceedings of the 23rd Nordic Conference on Compu-
Global Web-based English Corpus (GloWbE). English
tational Linguistics (NoDaLiDa), pages 157–165.
World-Wide, 36(1):1–28.
Peter Rupnik,
Taja Kuzman,
and Nikola Ljubešić.
DeepL. n.d. Why DeepL? https://www.deepl.
2022.
American-British-variety
Classifier.
com/en/whydeepl.
https://github.com/macocu/American-
Jesse Egbert, Douglas Biber, and Mark Davies. 2015. De-
British-variety-classifier.
veloping a bottom-up, user-based method of web register
Serge Sharoff. 2018. Functional text dimensions for the
classification. Journal of the Association for Information
annotation of web corpora. Corpora, 13(1):65–95.
Science and Technology, 66(9):1817–1831.
Matej Ulčar and Marko Robnik- Šikonja. 2020. CroSloEn-
Tomaž Erjavec and Nikola Ljubešić. 2014. The slWaC 2.0
gual BERT 1.1. Slovenian language resource repository
corpus of the Slovene web. T. Erjavec, J. Žganec Gros
CLARIN.SI.
(ur.) Jezikovne tehnologije: zbornik, 17:50–55.
Matej Ulčar and Marko Robnik- Šikonja. 2021. Slovenian
Eugenie Giesbrecht and Stefan Evert. 2009. Is part-of-
RoBERTa contextual embeddings model: SloBERTa 2.0.
speech tagging a solved task? An evaluation of POS tag-
Slovenian language resource repository CLARIN.SI.
gers for the German web as corpus. In: Proceedings of
Matej Ulčar, Aleš Žagar, Carlos S Armendariz, An-
the fifth Web as Corpus workshop, pages 27–35.
draž Repar, Senja Pollak, Matthew Purver, and Marko
GloWbe.
n.d.
Corpus of Global Web-Based En-
Robnik- Šikonja. 2021. Evaluation of contextual embed-
glish (GloWbE): Texts. https://www.english-
dings on less-resourced languages. arXiv:2107.10614.
corpora.org/glowbe/.
Marlies Van der Wees, Arianna Bisazza, and Christof
Klaus Krippendorff. 2018. Content analysis: An introduc-
Monz. 2018. Evaluation of machine translation perfor-
tion to its methodology. Sage publications.
mance across multiple genres and languages. In: Pro-
Taja Kuzman, Mojca Brglez, Peter Rupnik, and Nikola
ceedings of the Eleventh International Conference on
Ljubešić. 2021. Slovene web genre identification cor-
Language Resources and Evaluation (LREC 2018).
pus GINCO 1.0. Slovenian language resource repository
Vedrana Vidulin, Mitja Luštrek, and Matjaž Gams. 2007.
CLARIN.SI.
Using genres to improve search engines. In: 1st In-
Taja Kuzman, Peter Rupnik, and Nikola Ljubešić. 2022.
ternational Workshop: Towards Genre-Enabled Search
The GINCO Training Dataset for Web Genre Identifi-
Engines: The Impact of Natural Language Processing,
cation of Documents Out in the Wild. In: Proceed-
pages 45–51.
ings of the Language Resources and Evaluation Con-
Ahmad Yulianto and Rina Supriatnaningsih. 2021. Google
ference, pages 1584–1594, Marseille, France. European
Translate vs. DeepL: A quantitative evaluation of close-
Language Resources Association.
language pair translation (French to English). AJELP:
Veronika Laippala, Roosa Kyllönen, Jesse Egbert, Douglas
Asian Journal of English Language and Pedagogy,
Biber, and Sampo Pyysalo. 2019. Toward multilingual
9(2):109–127.
identification of online registers. In: Proceedings of the
22nd Nordic Conference on Computational Linguistics,
pages 292–297.
Veronika Laippala, Samuel Rönnqvist, Saara Hellström,
PRISPEVKI
107
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Slovenian Epistemic and Deontic Modals in Socially Unacceptable Discourse
Online
Jakob Lenardič,∗ Kristina Pahor de Maiti†
∗Faculty of Arts, University of Ljubljana
jakob.lenardic@ff.uni-lj.si
†CY Cergy Paris University
kristina.pahor-de-maiti@u-cergy.fr
Abstract
In this paper, we investigate the use of epistemic and deontic modal expressions in Slovenian Facebook comments. Modals are linguistic expressions that can be strategically used to fulfill the face-saving dimension of communication and to linguistically mask discriminatory discourse. We compile a list of modal expressions that have a tendency towards a single modal reading in order to enable robust corpus searches. Using this set of modals, we first show that deontic, but not epistemic, modals are significantly more frequent in socially unacceptable comments. In the qualitatve part of the paper, we discuss the use of modals expressing deontic and epistemic necessity from the perspective of discourse pragmatics. We explore how the communicative strategy of face-saving interacts with personal and impersonal syntax in the case deontic modals, and how hedging and boosting interacts with irony in the case of epistemic modals.
1.
Introduction
lated corpus-linguistic work on modality in socially unac-
Hate speech and other forms of socially unacceptable
ceptable discourse. Section 4. describes the make-up of
discourse have a negative effect on society (Delgado, 2019;
the FRENK corpus in terms of the subtypes of socially un-
Gelber and McNamara, 2016). For instance, calls to ac-
acceptable discourse and the criteria for the selection of
tion targeting specific demographics on social media have
the analysed modals. Section 5. presents the quantitative
been shown to lead to offline consequences such as real-
analysis, wherein epistemic and deontic modals are com-
world violence (Siegel, 2020). Linguistically, socially un-
pared between the acceptable and unacceptable supersets
acceptable attitudes are often disseminated in a dissimu-
in FRENK. Section 6. presents the qualitative analysis,
lated form, using pragmatic markers which superficially
where certain deontic and epistemic necessity modals are
lessen the strength of intolerant claims or violent calls to
discussed in terms of their pragmatic functions. Section 7.
action; nevertheless, the discursive markers of such dissim-
concludes the paper.
ulated discourse are still not well known (Lorenzi-Bailly
2.
Theoretical background
and Guellouz, 2019), especially outside of English social
2.1.
The semantics of epistemic and deontic modals
media.
In this paper, we look at the use of Slovenian modal
Modal expressions are semantic operators that interpret
expressions as key pragmatic contributors to the dissimu-
a prejacent proposition within the irrealis realm of possi-
lation of unacceptable discourse on social media. We first
bility (Kratzer, 2012). There are two key semantic com-
look at how the use of epistemic modals, which convey the
ponents to modals – one is the modal force, which corre-
speaker’s truth commitment, and the use of deontic modals,
sponds to the logical strength of the modal expression and
which convey how the world should or must be according
roughly ranges from possibility via likelihood to necessity,
to a set of contextually determined circumstances, differ be-
and the other is the type of modality,1 according to which
tween unacceptable and acceptable discourse in the case of
the evaluation of the possibility is tied to the actual world.2
Slovenian Facebook comments obtained from the FRENK
There are two main types of modality – epistemic on
corpus (Ljubešić et al., 2021).
the one hand and root on the other (Coates, 1983; Kratzer,
We then turn to a qualitative analysis of modals convey-
2012; von Fintel, 2006). Epistemic modals tie the evalua-
ing logical necessity. We discuss how the meaning of de-
tion of the possibility or necessity to the speaker’s knowl-
ontic necessity, which corresponds to some kind of obliga-
edge about the actual world. For instance, the possibility
tion that needs to be fulfilled by the agent of the modalised
adverb morda in (1), taken from the FRENK corpus, has
proposition, can have a secondary pragmatic meaning that
the reading which says that there is a possibility that the ref-
is akin to face-saving observed with epistemic modals and
erents of the indefinite subject nekaj jih (“some of them”)
that arises with syntactically impersonal modals. We then
1For formal semanticists viewing modals as quantifiers over
discuss how epistemic modals are used to achieve a face-
possible worlds (von Fintel, 2006; Kratzer, 2012), there are actu-saving effect, either as hedging or boosting devices or as
ally three semantic components – modal force, modal base, and
the intensifiers of irony.
the ordering source; for ease of exposition, we conflate the modal
The paper is structured as follows. Section 2. presents
base and ordering source under the simplified modality type com-
the semantic and pragmatic properties of epistemic and de-
ponent of meaning.
ontic modals, while Section 3. presents some of the re-
2The italics in the examples are always our own and used to
highlight the modal under scrutiny.
PRISPEVKI
108
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
will stay in the country. This possibility reading is epis-
which involves the preservation of the positive image of the
temic as it conveys that the speaker is not sure whether the
addressee and prevents them from feeling inferior to the
possibility of their staying will actually turn out to be the
speaker. Finally, they are used as part of a speaker-oriented
case.
positive politeness strategy, which involves the preserva-
tion of the positive image of the speaker by enabling the
(1)
[N]ekaj jih bo morda
ostalo v naših krajih.
EPISTEMIC
smooth withdrawal from a statement that can be perceived
“Some of them will possibly stay in our country.”
as a boast, threat, or similar.
Related to such politeness strategies, modals fulfil the
Root modality, on the other hand, is not tied to the
conversational role of so-called hedging or boosting de-
speaker’s (un)certainty about the truth of the proposition.
vices (Hyland, 2005). Epistemic modals function as hedges
Rather, it ascribes the possibility to certain, usually unspec-
when the speaker uses them to reduce their commitment
ified, facts about the actual world. There are several sub-
to the truth of the propositional content – i.e., to signal
types of root modality, but the one we are interested in this
their hesitation or uncertainty in what is being expressed,
paper is the deontic subtype, in which the evaluation of pos-
which is a type of face-saving strategy in and of itself.
sibility or necessity is tied to some contextually determined
(Gonzálvez Garc´ıa, 2000; Hyland, 1998).
In terms of
authority, such as a set of rules, the law, or even the speaker
modal force, it is weak epistemic modals denoting possi-
(Palmer, 2001, 10). An example of a deontic modal is the
bility that typically correspond to hedges, though certain
verb dovoliti in example (2), again taken from FRENK. This
necessity modals can also acquire such a function in cer-
verb also denotes possibility in terms of modal force, so the
tain contexts, as we will show in the qualitative analysis.
deontic possibility reading roughly translates to they should
Strong epistemic modals, which express certainty or
not be given the possibility (i.e., be allowed) to change our
high commitment of the speaker to the truth of the utter-
culture.
ance, typically function as boosters and are used by the
(2)
[S]eveda se jim ne sme dovoliti
[,] da bi spre-
speaker to convince his or her audience, make his or her
DEONTIC
menil naso (sic) kulturo.
utterance argumentatively stronger, close the dialogue for
“They should not be allowed to change our cul-
further deliberation (Vukovic, 2014), stress the common
ture.”
knowledge and group membership (Hyland, 2005), and so
forth. Such boosters can also be used manipulatively to
Note that a single modal can have different readings in
boost a claim that is otherwise controversial or highly par-
terms of modality type. This is, for instance, the case with
ticular (Vukovic, 2014).
the necessity modal morati, where the epistemic reading
Deontic modality also fulfils interpersonal roles in com-
in (3a) conveys that the speaker is certain (i.e., epistemic munication. Because deontic modals express notions such
necessity) that whomever they are referring to is a bonafide
as obligation and permission, they have to do with negoti-
Slovenian. By contrast, the deontic reading in (3b) says that ating social power between an authority and the discourse
what needs to be necessarily done is preparing for the com-
participant to whom the permission is granted or obliga-
petition. Such readings are disambiguated contextually.
tion imposed upon (Winter and Gärdenfors, 1995). Deon-
tic statements often involve a power imbalance between in-
(3)
a. Ta mora
biti pravi Slovenec, ni dvoma.
EPISTEMIC
terlocutors (which is especially evident in case it is not in
“He must be a bonafide Slovenian, no doubt
the interest of the agent to fulfil the obligation), so the use
about it.”
of deontic modals is often paired up with other pragmatic
b. Pripraviti se bodo morali
tudi na
DEONTIC
devices denoting politeness or face-saving. Politeness is
konkurenco, ki je zdaj še nimajo.
thus “an overarching pragmalinguistic function that can be
“They must also prepare for the competitors
overtly or covertly marked in deontic and epistemic modal
which they do not have.”
utterances” (Gonzálvez Garc´ıa, 2000, 127).
(Roeder and Hansen, 2006, 163)
3.
Related work on modality in hate speech
2.2.
The pragmatics of epistemic and deontic modals
The linguistic and pragmatic characteristics of modal-
Modality expresses the speaker’s subjective attitudes
ity have not yet been extensively explored in the literature
and opinions (Palmer, 2001), which is why the pragmat-
on online socially unacceptable discourse. One exception
ical aspects of the modalised utterance play an important
is the work done by Ayuningtias et al. (2021), who analy-
role in discourse.
ses YouTube comments related to the 2019 Christchurch
Epistemic modals fulfill what Halliday (1970) calls the
mosque shootings.
They find that clauses with deontic
interpersonal dimension of the utterance. In this sense,
modals outnumber those with epistemic modals, and that
epistemic modals show the following three pragmatic uses
the main discursive strategy of commenters in socially un-
(Coates, 1987) related to Brown et al. (1987)’s Politeness acceptable comments is to use deontic modals to incite vi-Theory. First, they are used as part of the negative polite-
olent action against members of the New Zealand Muslim
ness strategy to save the addressee’s negative face, when
community.
for instance the speaker tries to facilitate open discussion
Other corpus linguistic studies investigate modal mark-
by not assuming the addressee’s stance on the conversa-
ers from the perspective of stance. Chiluwa (2015), for
tional issue in advance. Second, epistemic modals can be
example, analyses the stance expressed in the Tweets of
used as an addressee-oriented positive politeness strategy,
two radical militant groups, Boko Haram and Al Shabaab.
PRISPEVKI
109
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Among other stance-related elements, she investigates the
Modal
Syntax
Modality
Force
AF
use of hedges (including weak epistemic modals) and
najIND
Adverb
Deontic
Likelihood
886
boosters (including strong epistemic modals). The results
morati
Verb
Deontic
Necessity
489
treba
Adjective
Deontic
Necessity
306
show that boosters are more frequent than hedges although
smeti
Verb
Deontic
Possibility
150
their overall frequency in the data was low. According to
verjetno
Adverb
Epistemic
Likelihood
123
the author, the low frequency of hedges shows that radical-
mogoče
Adverb
Epistemic
Possibility
92
ist discourse does not exhibit the tendency to mitigate com-
dovoliti
Verb
Deontic
Possibility
55
mitment, which goes hand in hand with the slightly higher
morda
Adverb
Epistemic
Possibility
46
presence of boosters that are used as a rhetorical strategy
najbrž
Adverb
Epistemic
Likelihood
29
to support (possibly unfounded) statements and to influ-
ziher
Adverb
Epistemic
Necessity
25
ence, radicalise and win over their readers by projecting
zagotovo
Adverb
Epistemic
Necessity
16
assertiveness.
potrebno
Adjective
Deontic
Necessity
4
Another study on stance in the context is by Sindoni
Σ
2,221
(2018), who looks at the verbal and multimodal construc-
Table 2: The analysed modals; AF stands for absolute fre-
tion of hate speech in British mainstream media. She anal-
quency.
yses epistemic modal operators (among other related de-
vices) in order to uncover the writer’s stance and attitude
towards the content conveyed in the news item. She finds
– Offensive discourse, which corresponds to abu-
that modality is strategically used to present the author’s
sive, threatening or defamatory speech that is tar-
opinions as facts, while the opinions of others are reported
geted towards someone on the basis of their back-
as hypotheses and assumptions.
ground or group participation.
– Violent discourse, which contains threats or calls
4.
The FRENK corpus
to physical violence and is often punishable by
4.1.
Corpus make-up
law (Fišer et al., 2017, 49).
– Inappropriate speech, which contains offensive
Subcorpus
Tokens
language but is not directed at anyone in particu-
Acceptable
92,922
34%
lar.
Offensive
143,948
53%
Inappropriate
1,471
1%
For our study, we have created two subsets of com-
Violent
8,789
3%
ments: the acceptable subset containing comments tagged
Not relevant
24,572
9%
as acceptable, and the unacceptable subset containing com-
ments tagged as offensive, violent or inappropriate. This
Σ
271,702
100%
decision is based on the frequency distributions as shown
Table 1: The make-up of the FRENK corpus in terms of
in Table 1. We can observe that the FRENK subcorpora
socially (un)acceptable discourse.
are uneven in terms of size, with the violent and inappro-
priate sets contain significantly fewer comments than the
For this study, we have used FRENK, a 270,000-token
acceptable and offensive sets. Because violent discourse is
corpus of Slovenian Facebook comments of mostly socially
generally less frequent than offensive discourse in linguistic
unacceptable discourse (Ljubešić et al., 2019). The Face-
corpora,4 it is difficult to annotate automatically (Evkoski
book comments in the FRENK corpus concern two major
et al., 2022), so one of the crucial features of FRENK is the topics – migrants, generally in the context of the 2015 Eu-fact that the annotations into discourse type were done man-
ropean migrant crisis, and the LGBTQ community, mostly
ually, employing 8 trained annotators per Facebook com-
in the context of their civil rights – and are manually anno-
ment (Ljubešić et al., 2019, 9). Note that about 9% of
tated for several different kinds of discourse.3 The anno-
the Facebook comments are marked as Not relevant, which
tations distinguish whether the discourse is aimed towards
refers to comments with incorrect topic classification (ibid.,
a target’s personal background, such as sexual orientation,
5).
race, religion, and ethnicity, or their belonging to a particu-
The latest, that is, version 1.1, of the FRENK cor-
lar group, such as political party. They also distinguish the
pus, which also includes texts in Croatian and English,
type of the discourse itself, which falls into 4 broad cate-
is available for download from the CLARIN.SI repository
gories, one being acceptable discourse and the others dif-
(Ljubešić et al., 2021). However, the online version, which ferent kinds of socially unacceptable discourse (de Maiti et
is accessible through CLARIN.SI’s noSketch Engine con-
al., 2019, 38):
cordancer and which we have used for the purposes of this
paper,5 is not yet available to the public.
• Acceptable discourse
• Socially unacceptable discourse
3The annotations are performed on the comment level while
also taking into account the features of the entire discussion
thread.
PRISPEVKI
110
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4.2.
The modals analysed in the study
Acceptable
Unacceptable
Modal
AF
RF
AF
RF
A/U
U/A
Table 2 shows that there are 12 modal expressions used
verjetno
52
559.6
66
428.0
1.3
0.8
in the study. We have selected the modals using the follow-
morda
24
258.3
19
123.2
2.1
0.5
ing two criteria.
mogoče
29
312.1
55
356.7
0.9
1.1
The first criterion is the modal’s tendency towards a sin-
najbrž
12
129.1
13
84.3
1.5
0.7
gle modal reading. As discussed in Section 2.1., modals
zagotovo
3
32.3
13
84.3
0.4
2.6
are in principle ambiguous in terms of their modality type.
ziher
8
86.0
15
97.3
0.9
1.1
However, corpus data show that certain modals have an
Σ
128
1,377.4
181
1,173.7
1.2
0.9
overwhelming preference for a single reading; for instance,
while the modal auxiliary morati can theoretically have
Table 3:
The distribution of epistemic modals in the
both the epistemic and the deontic interpretations (Roeder
FRENK corpus; AF stands for absolute frequency and RF
and Hansen, 2006, 162–163), as was shown in (3), the epis-for relative frequency, normalised to a million tokens.
temic reading (3a) is actually extremely rare in attested usage, and in the case of the FRENK corpus completely non-
While most modals are syntactically adverbs (e.g., morda,
existent.6 Similarly, whenever the adverb naj is used in
ziher), some are verbs selecting for finite clausal comple-
the indicative rather than conditional mood (glossed with
ments, such as dovoliti in (2), verbs selecting for non-finite the subscript IND in Tables 2 and 4), its meaning is always complements, such as morati in (3), and predicative adjec-some shade of the deontic reading (command, wish, etc.).
tives (of the syntactic frame It is necessary to) selecting
Thus, all the modals in Table 2 are either unambiguously
for non-finite complements, such as treba (see the exam-
deontic or unambiguously epistemic, so they function as a
ples in Section 6.1.). However, such syntactic differences
robust set for testing how deontic and epistemic modality
have no bearing on the modal interpretation – in all cases,
manifests itself in different types of discourse without con-
the modals remain sentential operators that take semantic
founding examples with unintended interpretations.
scope over the proposition denoted by the clause.
Second, some lexemes known to convey modal inter-
pretations also frequently occur with a superficially similar
5.
Quantitative Analysis
propositional meaning that, however, is not modal. Such is
the adverb itak, as in example (4), also taken from FRENK.
Tables 3 and 4 show how the Slovenian modals are distributed between the acceptable and unacceptable subsets
(4)
Krscanstvo pa itak izvira iz istih krajev kot islam in
for the unambiguously epistemic and deontic modals, re-
juduizem (sic).
spectively. The unacceptable subset brings together the
“Of course, Christianity comes from the same place
three subtypes – offensive, inappropriate, and violent – in-
as Islam and Judaism.”
troduced in Section 4.1.. The acceptable and unacceptable
This adverb differs from e.g. the certainty adverb zago-
sets contain 92, 922 and 154, 208 tokens, respectively.
tovo in that it does not convey the speaker’s degree of cer-
In the epistemic set (Table 3), half of the modals – that
tainty,7 but rather simply intensifies whatever he or she
is, the possibility modal mogoče and the necessity modals
knows to be actually the case (the historical-geographic
ziher and zagotovo – are more frequent in the corpus of un-
source of Christianity). Because such non-modal readings
acceptable discourse, while the remaining 3 modals – that
are usually as frequent as the modal meaning in attested
is, the possibility modal morda and the logically synony-
usage, we have omitted them from our study.
mous likelihood modals najbrž and verjetno – are more fre-
Lastly, note that in terms of part of speech, the modals
quent in the subset of socially acceptable discourse. Over-
in Table 2 do not constitute a syntactically homogenous set.
all, the six epistemic modals are 1.2 times more frequently
used in acceptable discourse than they are in unacceptable
4This is also a result of the EU Code of conduct and terms
discourse.
of service of social media platforms, according to which content
The distribution is reversed in the set of unambiguously
deemed illegal due to its hateful character needs to be taken down.
deontic modals (Table 4). Here, all modals, save for the
5https://www.clarin.si/noske
possibility verb smeti (“to allow”), are more characteris-
6The frequency counts were preformed on lemmas, as this is
tic of unacceptable rather than acceptable discourse, with
sufficient for distinguishing the part of speech as well; for in-
the deontic necessity adjective treba and deontic likelihood
stance, the lemma mogoče corresponds to the adverbial forms,
adverb naj
showing the largest preference for the unac-
IND
whereas the lemma mogoč corresponds to the adjectival ones;
ceptable set. Overall, the 6 deontic modals are 1.3 times
however, the adjectival form when used predicatively is consis-
more frequently used in socially unacceptable discourse
tently ambiguous between the non-epistemic and epistemic inter-
than they are in acceptable discourse.
pretations, see Lenardič and Fišer (2021) for discussion and ex-Statistically, we have tested the overall differences in
amples.
7
frequency between the unacceptable and acceptable sets for
Zagotovo has the synonym gotovo; we have excluded it from
our overview because it is too frequently used in the non-modal
both the epistemic (Table 3) and deontic (4) modals using sense, as in (1), which is mostly typical of non-standard Slove-the log-likelihood statistic. This statistic is used to “estab-
nian.
lish whether the differences [between pairwise frequencies
in two corpora with different sizes] are likely to be due to
(1)
Postrelit in gotovo.
chance or are statistically significant” (Brezina, 2018, 83–
“Shoot them all – that’s the end of it.”
84). The formula for calculating the log likelihood statistic
PRISPEVKI
111
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Acceptable
Unacceptable
6.
Qualitative analysis
Modal
AF
RF
AF
RF
A/U
U/A
6.1.
Deontic modals in violent discourse
najIND
227
2,442.9
583
3,780.6
0.6
1.5
morati
151
1,625.0
292
1,893.6
0.9
1.2
In Section 5., it was shown that deontic modals are more
treba
87
936.3
197
1,277.5
0.7
1.4
typical of unacceptable rather than acceptable discourse, a
smeti
41
441.2
60
389.1
1.1
0.9
finding that was shown to be statistically significant.
dovoliti
17
183.0
34
220.5
0.8
1.2
To look at the pragmatics of deontic modals and their
potrebno
1
10.8
3
19.5
0.6
1.8
discursive role in relation to socially unacceptable dis-
Σ
524
5,639.1
1,169
7,580.7
0.74
1.3
course, let’s first recall from Section 4.1. that the socially unacceptable discourse in the FRENK corpus is further subTable 4: The distribution of deontic modals in the FRENK
divided into several subtypes. Here we focus on two – of-
corpus.
fensive discourse on the one hand and violent on the other.
It turns out that all of the surveyed deontic modals, with the
is given in (5), where the observed values O
exception of the auxiliary morati, are actually more promi-
1,2 correspond
to the absolute frequencies of a modal in the unacceptable
nent in violent discourse than in offensive discourse; this
and acceptable sets.
is shown in Table 5, where for instance treba is almost four times as frequent in the violent-speech subset (RF = 4437.3
(5)
2 × O
tokens per million) than it is in the offensive subset (RF
1 × ln O1 + O
E
2 × ln O2
1
E2
= 1083.7 tokens per million).
It turns out that the overall greater occurrence of epis-
What is interesting is that treba and morati are synony-
temic modals in the acceptable set (AF = 128 tokens,
mous, possibly completely so, in terms of modal logic, as
RF = 1, 377.4 tokens/million) than in the unacceptable
both entail necessities in terms of modal force and in most
set (AF = 181 tokens, RF = 1, 173.7 tokens/million)
cases have a deontic reading that has to do with a contex-
is statistically insignificant at p < 0.05; log likelihood
tually determined obligation.8 However, despite the syn-
= 1.902, p = 0.165. By contrast, the greater occurrence
onymy, treba is by far more frequent in violent speech than
of deontic modals in the unacceptable set (AF = 1, 169 to-
it is in offensive, while morati is the only deontic modal
kens; RF = 7, 580.7 tokens/million) than in the acceptable
that is more prominent in offensive than in violent speech.
one (AF = 524 tokens; RF = 5, 639.1 tokens/million) is
The difference in the distribution of the two synony-
statistically significant at the same cut-off point; log likeli-
mous modals can be tied to the fact that they vastly differ
hood = 32.8, p = 9 × 10−9.
in their communicative function, which crucially is observ-
Using the online tool Calc (Cvrček, 2021), we have
able within the same subset. Put plainly, the chief differ-
also calculated the Difference Index (DIN) – an effect-size
ence is that treba occurs in considerably more hateful state-
metric – for the overall difference between the acceptable
ments than morati, even though the statements all qualify
and unacceptable deontic sets. The DIN value is −14.687,
as violent hate speech rather than offensive speech in that
which indicates that the deontic modals’ preference for the
some kind of incitement towards violence is expressed in
unacceptable set, although statistically significant, is rela-
the modalised statement.
tively small (Fidler and Cvrček, 2015, 230). In addition,
For instance, let’s first consider some typical examples
Calc automatically computes the confidence intervals for
with treba from the violent subset:
the relativised frequencies, which is 5, 639.1 ± 471.4 for
the overall acceptable RF and 7, 580.7 ± 426.9 for the un-
(6)
a. To golazen treba zaplinit, momentalno!!!!
acceptable RF at the 0.05 significance level. The fact that
“These vermin must be gassed at once!”
the intervals do not overlap further confirms that the differ-
b. Pederčine je treba peljat nekam in postrelit.
ence is not accidental.
“Faggots must be taken somewhere and shot.”
These findings are related to those in the literature
c. Ni treba par tisoč Voltov, dovolj je 220, da ga
(see Section 3.) as follows. Just like in Ayuningtias et
strese in opozori, da bo čez par metrov stražar
al. (2021)’s work on socially unacceptable discourse in
s puško.
YouTube comments, our deontic modals significantly out-
“We don’t need a couple of thousand Volts; 220
number epistemic modals in both the acceptable and un-
is enough to electrocute them and warn them
acceptable sets (e.g., 1, 169 deontic modals vs. 181 epis-
that, a couple of metres further on, an armed
temic modals under unacceptable). Second, both modals of
guard is waiting.”
epistemic necessity in Table 3 – that is, zagotovo and ziher (“certainly”) – differ from most of the weaker modals, like
morda (“possibly”) and najbrž (“likely”), in that they are
8Note that in negated sentences with treba, negation takes
more frequent in unacceptable discourse; this is similar to
scope over necessity, which means the interpretation is “it is not
the finding by Chiluwa (2015), who shows that strong epis-
necessary” rather than “it is necessary not”; a more principled in-
temic modals are more frequent than weak ones in the case
vestigation into how this interaction affects the pragmatics of the
of Tweets by radical militant groups. However and in con-
modalised propositions is left for future work, though we note that
negation in examples such as (6c) behaves in a similar manner to trast to Chiluwa (2015), our statistically significant finding the so-called metalinguistic negation (Martins, 2020), as the comis not the difference in modal force, but rather the difference
menter merely objects to the specific number of Volts, but still
in modality type, as discussed above.
condones the violent action i.e. the electrocution of migrants.
PRISPEVKI
112
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Modal
Acceptable
Violent
Offensive
Even when the morati examples convey that it is nec-
treba
936.3
4,437.4
1,083.7
essary that some kind of action be taken against e.g. mi-
potrebno
10.8
568.9
243.1
grants, as in example (8a), the verbs used are such that they dovoliti
183.0
341.3
213.2
no longer convey explicit violent acts, such as postreliti (“to
smeti
441.2
682.7
405.7
shoot”), zapliniti (“to gas”), and stresti (“to electrocute”) in
morati
1,625.0
1,479.1
1,910.4
the treba examples (6), but express non-violent acts, as in naj
2,442.9
6,371.6
3,647.2
IND
the case of the verbal phrase zapreti meje “close the bor-
Σ
5,639.2
13,881.0
7,503.3
ders” in (8c). Indeed, the calls to violent action with morati are significantly more tentative, as many of the cases of
Table 5: The distribution of deontic modals between the
deontic morati are embedded under the conditional mood
Offensive and Violent subsets of FRENK; the frequencies
clitic bi, which leads to a composite meaning where the de-
are relative and normalized to a million tokens.
ontic necessity is interpreted as a suggestion rather a direct
command, as in examples (8a) and (8c), which also is not The chief linguistic characteristic of the treba examples
the case with treba.
boils down to lexical choice. The most prominent nomi-
To sum up the discussion so far, we have observed
nal collocate in the violent subset for the treba examples,
that while treba and morati both convey deontic necessity
calculated on the basis of the Mutual Information statistic,
(roughly an obligation that needs to be met), they are paired
is golazen “vermin”, which can be seen in example (6a),
up with quite substantially different statements in terms of
where migrants are referred to as such. According to Assi-
hateful rhetoric in the case of the same type of unacceptable
makopoulos et al. (2017, 41) such metaphoric expressions
discourse, i.e., violent speech. Further, morati is also the
“are an intrinsic part of the Othering process, and central
only deontic modal which is less typical of violent speech
to identity construction”. In the case of animal metaphors
than it is of offensive speech.
such as MIGRANTS ARE VERMIN, migrants are concep-
We suggest that the difference is tied to the way the
tually construed and stereotyped as an invasive out-group
pragmatics of deontic modals interact with their core syn-
that is maximally different from the in-group to which the
tactic and semantic properties. As discussed in Section 2.2.,
speaker considers themselves to belong (ibid.). The other
pragmatically deontic modals fulfil the interpersonal func-
most prominent nominal collocate is elektrika (“electric-
tion in communication. The interpersonal dimension has to
ity”); metaphors containing this lexeme or lexemes related
do with the fact that the deontic necessity, i.e., obligation,
to electricity (volts, to schock, etc.) often have implied
is ascribed by the speaker to whoever corresponds to the
reference, where the undergoers of the verbal event, i.e.,
agent of the verbal event in the modalised proposition; con-
migrants, are not directly mentioned, as shown in example
cretely, in the case of example (8a), the speaker says that it
(6c). Curiously, when the targets of violent speech are not is European countries that have the obligation to strike back
migrants but members of the LGBT community, instead of
against migrants.
metaphors like golazen, slurs such as pedri (“faggots”) are
The chief difference between the treba (6) and the
used, as in example (6b).
morati (8) examples, manifested in the discussed lexi-
Note that it is not only treba which patterns with such
cal differences, lies in this interpersonal pragmatic dimen-
charged lexical items; for instance, the adverb naj, which
sion, which is crucially influenced by the syntax of the
denotes the speaker’s desire in terms of deontic modality,
expressions. Treba is an impersonal predicative adjective
also frequently occurs with the electricity metaphor, as in
which, in contrast to morati, syntactically precludes the use
(7).
of a nominative grammatical subject that would be inter-
preted as the agent in the modalised proposition (Rossi and
(7)
Elektriko v žice spustit. Naj kurbe skuri!
Zinken, 2016). Consequently, all the statements in the treba
“Electrify the fence wires!
May it burn the
set of examples are such that the agent has an undefined,
whores!”
arbitrary reference – for instance, it is unclear who is ex-
pected to “gas the vermin” in example (6a). What happens
The examples with morati, on the other hand, are sig-
pragmatically is that the subject-less syntax of the adjec-
nificantly less lexically charged, as shown in (8), and the tive treba allows the speaker to sidestep the ascription of
statements framed in a more indirect way.
obligation to a specific agent, thus largely obviating what
(8)
a. Vse Evropske države bi morale bolj grobo
is perhaps the core interpersonal aspect of deontic modal-
udarit po migrantih.
ity. This cannot be really avoided with morati, which is
“All European countries should have to more
a personal verb that obligatorily selects for a grammatical
strictly strike back against migrants.”
subject in active clauses – in other words, because of its per-
b. Kdo nas zaščitil[,] a moramo mi tud nabavit
sonal syntax, morati presents a bigger interpersonal burden
pištolo
on the speaker, as he or she needs to specifically name the
“Who will protect us? Do we also have to buy
person or institution that is required to fulfill the obligation.
a gun?”
Note that, in the violent subset, there is only one exam-
ple where morati is used with the verb dobiti (“get”), which
c. Evropa bi morala stopiti skupaj hermeticno za-
induces a passive-like interpretation (9). Here, the gram-
preti meje.
matical subject headed by Vsak (“everyone”) is interpreted
“Europe should have to come together and her-
as the target of the violent action rather that the agent. It is
metically close the borders.”
PRISPEVKI
113
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Modal
Acceptable
Violent
Offensive
In offensive comments, ziher is used either as a booster
morda
258.3
0.0
169.3
(10) or a hedge (11), a discursive function which the com-mogoče
312.1
113.8
555.8
menter uses as part of the face-saving strategy. Boosting is
verjetno
559.6
341.3
451.6
shown in example (10).
najbrž
129.1
0.0
90.3
ziher
86.0
113.8
97.3
(10)
Begunca? Ekonomske migrante pa picke, ki se ne
zagotovo
32.3
113.8
83.4
znajo borit za svoj kos zemlje ZIHER ne!!!!!!!
Σ
1,377.4
682.7
1,447.5
“Accepting a refugee? CERTAINLY not accepting
economic migrants and cunts who don’t know how
Table 6: The distribution of epistemic modals between the
to fight for their piece of land!!!!!!!”
Acceptable, Violent, and Offensive subsets of FRENK; the
frequencies are relative and normalized to a million tokens.
In this example, the use of the modal conveys the lex-
ical meaning of certainty and thus the full speaker’s truth
commitment to the propositional content. By being accom-
telling that this is also the only example with morati which
panied by excessive exclamatory punctuation, upper case
is closer in the use of lexically charged items (i.e., being
letters and contemptuous argumentation, the modal prag-
“shot in the head” rather than “the closing of borders” in the
matically acts as a booster emphasizing the speaker’s com-
previous examples) to the treba examples, as this passive-
mitment. The face-saving dimension comes about because
like construction also precludes the use of an agentive noun
the assertiveness conveyed by the modal helps legitimize
phrase (unless it is introduced by the Slovenian equivalent
the speaker as a member of the in-group that is exclusion-
of the by-phrase, but there are no such examples in the cor-
ary of migrant out-group.
pus).
(11)
[K]r k cerarju nej gredo zihr ma veliko stanovanje
(9)
[V]sak, ki se približa našim ženskam in otrokom,
... bedaki.
mora dobiti metek v čelo.
“They better go to the prime minister Cerar, he
“Everyone who gets close to our women and chil-
surely has a big flat ... assholes.”
dren must be shot in the head.”
Contrary to the previous example, the modal in (11)
In short, the interpersonal structure influences the de-
pragmatically hedges the propositional content by invok-
gree of hateful rhetoric, in the sense that speakers are more
ing the presumed shared knowledge of the in-group, which
ready to use degrading metaphors, slurs and violent ver-
concerns the size of the prime minister’s home. Here, hedg-
bal expressions when they can avoid ascribing the obliga-
ing is related to the fact that the modal activates the face-
tion to someone specific. We follow Luukka and Markka-
saving strategy which protects the speaker from the accusa-
nen (1997) by suggesting that impersonality has a similar
tion of making an unfounded claim, as the modalised state-
hedging effect to epistemic modals, in the sense that the
ment, despite entailing certainty, is still weaker than the un-
unexpressed agent in impersonals introduces a degree of
modalised variant which would otherwise report that the
semantic vagueness to the proposition, as does uncertainty
speaker holds factual knowledge about the prime minister’s
brought about by the epistemic reading. Thus, with treba,
apartment.
deontic imposition and epistemic face-saving meet in one
While the offensive comments predominantly feature
and the same lexeme.
ziher in such a hedging or boosting role, in the large ma-
6.2.
Epistemic modals in offensive and acceptable
jority of the acceptable comments, the modal conveys an
discourse
additional figurative meaning – i.e., that of irony, which we
also claim is related to face-saving and contributes an ad-
Epistemic modals are slightly more frequent in accept-
ditional persuasive effect in terms of discourse pragmatics
able comments, although the difference is not statistically
(Gibbs and Izett, 2005; Attardo, 2000).
significant, as was shown in Section 5. In order to explore Example (12) conveys a proposition whose ironic mean-further the possible differences and similarities in the use
ing is emphasized by the modal ziher.
of epistemic modals between different types of comments,
we look at their distribution in three subcorpora, namely in
(12)
Itak, dejmo vsi lagat, to je ziher prav :)
acceptable, offensive and violent comments. The distribu-
“Of course, let’s all lie, that’s certainly the right
tion is shown in Table 6. We find that epistemic modals
thing to do :)”
are very infrequent in the violent comments (even unat-
tested for morda “possibly” and najbrž “likely” ) in contrast
The ironic reading of this example is suggested by the
to deontic modals, which are more frequent almost across
use of the intensifying adverb itak (“of course”), exagger-
the board in the violent set (Table 5). On the other hand,
ation by means of the collective reading of the plural pro-
the epistemic modals show a similar distribution between
noun vsi (“everyone”), the use of the verb in the first-person
acceptable and offensive comments in contrast to violent
dejmo (“let’s”), and the use of the emoticon. Finally, the
comments.
face-saving strategy enacted in this example has two di-
We now look at the pragmatics of the epistemic ne-
mensions. The first is the protection of the speaker’s face
cessity modal ziher (“certainly”), as it exhibits the most
since the irony not only enables the speaker to capitalise
comparable frequency between the acceptable and offen-
on the use of a sophisticated rhetorical device, but also to
sive subcorpora.
claim group affiliation by clearly stating the values that the
PRISPEVKI
114
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
group has in common. The second aspect is the protection
8.
References
of the addressee’s face since the irony helps tone down the
Stavros Assimakopoulos, Fabienne H. Baider, and Sharon
speaker’s criticism – according to Gibbs and Izett (2005),
Millar. 2017. Online hate speech in the European Union:
ironic criticism is accepted better or in a friendlier way than
A discourse-analytic perspective. Springer Na- ture.
direct critiques.
Salvatore Attardo. 2000. Irony markers and functions: To-
wards a goal-oriented theory of irony and its processing.
7.
Conclusion
Rask, 12:3–20.
This paper has presented a corpus investigation of epis-
Diah Ikawati Ayuningtias, Oikurema Purwati, and Pratiwi
temic and deontic modal expressions in Slovenian Face-
Retnaningdyah. 2021. The lexicogrammar of hate speech.
book comments in the FRENK corpus.
In:
Thirteenth
Conference
on
Applied
Linguistics
We have first proposed a set of Slovenian modals that
(CONAPLIN 2020), pages 114–120. Atlantis Press.
show an overwhelming tendency towards a single modal
Vaclav Brezina. 2018. Statistics in corpus linguistics: A
reading. Because of such unambiguity, they constitute a
practical guide. Cambridge University Press.
robust set that allows for precise quantitative comparisons
Penelope Brown, Stephen C Levinson, and Stephen C
between different types of discourse without irrelevant con-
Levinson. 1987. Politeness: Some universals in lan-
founding examples and for careful manual analysis of the
guage usage, volume 4. Cambridge University Press.
corpus examples. Quantitatively, we have shown that de-
Innocent Chiluwa. 2015. Radicalist Discourse: A study of the
ontic modals are a prominent feature of unacceptable dis-
stances of Nigeria’s Boko Haram and Somalia’s al
course, and that they are especially prominent in discourse
Shabaab on Twitter. Journal of Multicultural Discourses,
that concerns incitement to violent action, which is legally
10(2):214–235.
prosecutable.
Jennifer Coates. 1983. The Semantics of the Modal Auxil-
In terms of discourse pragmatics, we have first shown
iaries. Croom Helm, London and Canberra.
that modals which are completely synonymous both in
Jennifer Coates. 1987. Epistemic modality and spo- ken
terms of force and modality type can nevertheless pro-
discourse. Transactions of the Philological S ociety,
foundly differ in the degree of hateful rhetoric in the same
85(1):110–131.
type of socially unacceptable discourse. We have shown
Václav Cvrček. 2021. Calc 1.03: Corpus Calculator.
that what makes a difference in such examples is the pres-
Czech National Corpus.
https://www.korpus.
ence of impersonal syntax, which offers speakers the ability
cz/calc/.
to linguistically obviate the ascription of the denoted obli-
Kristina Pahor de Maiti, Darja Fišer, and Nikola Ljubešić.
gation to a particular agent. We have suggested that this
2019. How haters write: Analysis of nonstandard lan-
sort of face-saving strategy of ambiguity by way of imper-
guage in online hate speech. Social Media Corpora for
sonality correlates with the speaker’s tendency to use dehu-
the Humanities (CMC-Corpora2019), page 37.
manising language, such as slurs or degrading metaphors.
Richard Delgado. 2019. Understanding words that wound.
In the case of epistemic modals, we have shown that ac-
Routledge.
ceptable and offensive comments, which are highly similar
Bojan Evkoski, Andraž Pelicon, Igor Mozetič, Nikola
at their surface linguistic level, differ pragmatically in re-
Ljubešić, and Petra Kralj Novak. 2022. Retweet com-
lation to face-saving; while offensive comments use epis-
munities reveal the main sources of hate speech. PloS
temic modals as simple hedging or boosting devices, ac-
one, 17(3):e0265602.
ceptable comments use the modals to convey ironic state-
Masako Fidler and Václav Cvrček. 2015. A data-driven
ments in which the irony is emphasised by the modal. We
analysis of reader viewpoints: Reconstructing the his-
have claimed that the irony also contributes to the face-
torical reader using keyword analysis. Journal of Slavic
saving pragmatics.
linguistics, pages 197–239.
In future work, we intend to explore how deontic and
Darja Fišer, Tomaž Erjavec, and Nikola Ljubešić. 2017.
epistemic modals also differ based on topic (migrants on
Legal framework, dataset and annotation schema for so-
the one hand and the LGBTQ community on the other).
cially unacceptable online discourse practices in slovene.
We also want to explore if and how the discourse differs
In: Proceedings of the first workshop on abusive language
if the unacceptable comments are either directed towards
online, pages 46–51.
a person’s individual background (e.g., race, ethnicity) or
Katharine Gelber and Luke McNamara. 2016. Evidencing
group affiliation (e.g., political party).
the harms of hate speech. Social Identities, 22(3):324–
Acknowledgments
341.
Raymond W Gibbs and Christin Izett. 2005. Irony as per-
The work described in this paper was funded by the
suasive communication. Figurative language compre-
Slovenian Research Agency research programme P6-0436:
hension: Social and cultural influences, pages 131–151.
Digital Humanities: resources, tools and methods (2022–
Francisco Gonzálvez Garc´ıa. 2000. Modulating grammar
2027), the DARIAH-SI research infrastructure, and the na-
trough modality: A discourse approach. ELIA, 1, 119-
tional research project N6-0099: LiLaH: Linguistic Land-
136.
scape of Hate Speech.
Michael A.K. Halliday. 1970. Functional diversity in lan-
guage as seen from a consideration of modality and
PRISPEVKI
115
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
mood in english. Foundations of language, pages 322–
361.
Ken Hyland. 1998. Hedging in scientific research articles,
volume 54. John Benjamins Publishing Company Ams-
terdam.
Ken Hyland. 2005. Stance and engagement: A model
of interaction in academic discourse. Discourse studies,
7(2):173–192.
Angelika Kratzer. 2012. Modals and conditionals: New
and revised perspectives, volume 36. Oxford University
Press.
Jakob Lenardič and Darja Fišer. 2021. Hedging modal
adverbs in slovenian academic discourse. Slovenščina
2.0: empirical, applied and interdisciplinary research,
9(1):145–180.
Nikola Ljubešić, Darja Fišer, and Tomaž Erjavec. 2019.
The frenk datasets of socially unacceptable discourse in
slovene and english. In: International conference on text,
speech, and dialogue, pages 103–114. Springer.
Nikola Ljubešić, Darja Fišer, Tomaž Erjavec, and Ajda
Šulc. 2021. Offensive language dataset of Croatian,
English and Slovenian comments FRENK 1.1. Slove-
nian language resource repository CLARIN.SI. http:
//hdl.handle.net/11356/1462.
Nolwenn Lorenzi-Bailly and Mariem Guellouz. 2019. Ho-
mophobie et discours de haine dissimulée sur twitter:
celui qui voulait une poupée pour noël. Semen. Revue
de sémio-linguistique des textes et discours, 47.
Minna-Riitta Luukka and Raija Markkanen. 1997. Imper-
sonalization as a form of hedging. Research in Text The-
ory, pages 168–187.
Ana Maria Martins. 2020. Metalinguistic negation. In The
Oxford Handbook of Negation. Oxford University Press.
Frank Robert Palmer. 2001. Mood and modality. Cam-
bridge University Press.
Carolin F. Roeder and Björn Hansen. 2006. Modals in
contemporary slovene. Wiener Slavistisches Jahrbuch,
52:153–170.
Giovanni Rossi and Jörg Zinken. 2016. Grammar and so-
cial agency: The pragmatics of impersonal deontic state-
ments. Language, 92(4):e296–e325.
Alexandra A Siegel. 2020. Online hate speech. Social me-
dia and democracy: The state of the field, prospects for
reform, pages 56–88.
Maria Grazia Sindoni. 2018. Direct hate speech vs. indi-
rect fear speech. A multimodal critical discourse analysis
of the sun’s editorial ‘1 in 5 brit muslims’ sympathy for
jihadis”. Lingue e Linguaggi, 28:267–292.
Kai von Fintel. 2006. Modality and language. In Don-
ald M. Borchert, editor, Encyclopedia of Philosophy
– Second Edition, pages 20–27. MacMillan Reference
USA, Detroit.
Milica Vukovic. 2014. Strong epistemic modality in par-
liamentary discourse. Open Linguistics, 1(1).
Simon Winter and Peter Gärdenfors. 1995. Linguistic
modality as expressions of social power. Nordic Journal
of Linguistics, 18(2):137–165.
PRISPEVKI
116
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
The ParlaSpeech-HR benchmark for speaker profiling in Croatian
Nikola Ljubešić,∗† Peter Rupnik∗
∗Department of Knowledge Technologies
Jožef Stefan Institute
Jamova cesta 39, SI-1000 Ljubljana
nikola.ljubesic@ijs.si
peter.rupnik@ijs.si
†Faculty of Computer and Information Science
University of Ljubljana
Večna pot 113, SI-1000 Ljubljana
Abstract
Recent advances in speech processing have made speech technologies significantly more accessible to the research community. Beyond the most-popular task of automatic speech recognition, classifying speech acts by various criteria has also recently caught interest. In this paper we propose a benchmark constructed from a dataset of speeches given in the Croatian parliament, aimed at predicting the following speaker profile features: speaker identity, gender, age, and power position (whether the speaker is in the ruling coalition or opposition). We evaluate various pre-trained transformer models on our variables of interest, showing that speaker identification and power position prediction seem to rely mostly on language-specific features, while gender and age prediction rely more on generic speech features, available also in models not pre-trained on the target language. We release the benchmark to serve in measuring the strength of upcoming speech models on a lower-resourced language such as Croatian.
1.
Introduction
many approaches to speaker profiling developed before the
era of transformers, in this work, we limit ourselves on eval-
Speech technologies have recently experienced a quan-
uating transformer models only, primarily due to their re-
tum leap in their development due to the successful applica-
ported superior performance (Yang et al., 2021).
tion of the self-supervised pre-training of transformer mod-
els on speech data (Schneider et al., 2019). Due to this sig-2.
Related work
nificant simplification of the development of speech tech-
nologies, their uptake has increased significantly (Fan et al.,
Various speech benchmarks, including speaker profil-
2020; Pepino et al., 2021; Bartelds et al., 2022), which reing tasks, exist, but mostly for the English language. The
sulted also in the development of the first open dataset for
SUPERB benchmark (Yang et al., 2021) consists of tasks
training automatic speech recognition in Croatian (Ljubešić
of speaker identification – identifying the speaker from a
et al., 2022), based on data from the Croatian parliament.
closed set of speakers, speaker verification – binary tasks
The parliamentary data are especially suited for speech ex-
of whether two utterances are spoken by the same speaker,
periments, not only because they are in the public domain,
and speaker diarization – predicting who is speaking when
but also because they are rich in speaker metadata (Ljubešić
for each timestamp, where multiple speakers can also speak
et al., 2022).
simultaneously.
In this work we are presenting a rather opportunistic
Another recent benchmark, XTREME-S (Conneau et
benchmark for speaker profiling in Croatian, based on the
al., 2022), is focused on evaluating universal cross-lingual ParlaSpeech-HR dataset and the available information on
speech representations in many languages, the tasks based
the speakers in that dataset. We define four tasks. In the
on speech classification covering spoken language identifi-
first task, speaker identification, the task is to predict who
cation among 104 languages, and intent classification from
of the possible 50 speakers is the speaker of a speech act.
the e-banking domain.
For the second task, male and female speakers are to be
A dataset used for speaker identification benchmark-
discriminated between. The third task is focused on dis-
ing is VoxCeleb (Nagrani et al., 2017), consisting of over
criminating between younger and older speakers, 49 years
1,000 celebrities voice samples, obtained by applying fa-
of age being the division point between the two age groups.
cial recognition over YouTube videos.
In the fourth task we aim at discriminating the speech acts
A well known dataset used for benchmarking automatic
depending on whether they were given by MPs from the
speech recognition systems, but also used for speaker pro-
ruling coalition, or from the opposition.
filing is the TIMIT dataset (Garofolo et al., 1993), con-
We compare models pre-trained on the target language
sisting of 630 speakers of 8 dialects of American En-
(Croatian) and models that were not pre-trained on this
glish. It consists of speaker information on gender, age and
language, obtaining insights not only how well trans-
height (Kalluri et al., 2020).
former models perform on these tasks, but also how much
In this work we are not trying to build on top of the
language-dependent these tasks are. While there have been
existing benchmarks due to two reasons. The main reason
PRISPEVKI
117
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
is our interest in less-resourced languages, primarily South
represented in a sample, the most prolific speakers were left
Slavic languages, for which there is little to no data avail-
out of the sampling procedures due to their very specific
able. The very recently released ParlaSpeech-HR dataset,
roles in the parliament, which quite likely carries different
on which this benchmark is based on, is the first openly
unwanted biases in their speech production.
available speech dataset for Croatian (Ljubešić et al., 2022).
In the four following subsections we describe the spe-
The second reason is the disruptive effect the speech trans-
cific sampling criteria applied for each of our four tasks.
formers had on the field, drastically lowering the previous
level of error (Yang et al., 2021), with significant improve-3.2.1.
Speaker identification
ments expected in the near future as well. This is why we
For the task of speaker identification, 25 speakers per
opt for a new, very opportunistic benchmark on speakers
binary gender were sampled. Per speaker, 100 instances
from the Croatian parliament. Besides documenting the
were included in the training subset, 10 in the development
highly important data selection decisions, we are report-
subset, and 10 in the test subset. Checks were performed
ing first results on the current state-of-the-art technology.
to assure that for no speaker instances from the same video
Given the current high pace of innovation in speech tech-
appear in more than one subset. With this sampling pro-
nologies, that is surely not to slow down soon, this bench-
cedure, each of the three subsets consist of the same 50
mark will be highly useful in assessing what new technolo-
speakers, the training subset having 5,000 instances, while
gies are and will be able to offer to a less-resourced lan-
the development and testing subsets of 500 instances each.
guage such as Croatian.
3.2.2.
Gender prediction
3.
Benchmark construction
For each of the two binary genders, male and female,
In this section we present the dataset our benchmark is
25 speakers were selected for the training subset, every
constructed on, and the data selection protocols given the
speaker being represented with 20 instances. For each of
four variables of interest.
the two genders, 5 speakers (that were not already in the
training subset) were taken for the development split, and
3.1.
The dataset
5 speakers for the test split. Every speaker in the develop-
The dataset this benchmark is based on is the
ment and testing subset was represented with 200 instances.
ParlaSpeech-HR dataset (Ljubešić et al., 2022), aimed pri-
With this we assured three subsets of distinct speakers, the
marily at developing automatic speech recognition systems
training subset consisting of 1,000 instances, and the devel-
for Croatian. It consists of 1816 hours of speech obtained
opment and testing subset of 2,000 instances.
from 309 speakers. For each speaker metadata on age, gen-
der, party affiliation, role in the parliament, and power sta-
3.2.3.
Age prediction
tus (opposition vs. coalition) is available. More details on
Given that there are very few distinct female speakers in
the content and construction procedure of the ParlaSpeech-
the ParlaSpeech-HR dataset, and that controlling for gen-
HR dataset can be found in the description paper (Ljubešić
der while performing any data split is necessary due to the
et al., 2022).
likely strong signal coming from the gender of the speaker
as a potential confounder, after some metadata analyses, we
3.2.
Data selection
decided to setup the age prediction task on male speakers
For each of the four tasks a separate data selection pro-
only.
cedure was set-up, given the limited data available, but also
The age distribution of male speakers showed a rather
the different nature of the tasks. While most tasks are bi-
narrow and normal distribution around the median of 49
nary (gender, age, power status), the task of speaker identi-
years of age. The age distribution is far from a uniform and
fication is a 50-class task. Furthermore, while for the three
wide distribution, that would allow for a diverse age predic-
binary tasks the training, development and testing subsets
tion task, being set-up as a regression task, or a classifica-
have to consist of different speakers, on the speaker iden-
tion task with many categories. This is why we decided to
tification task, in all three subsets the same speakers have
define this as a binary task, predicting whether a speaker is
to be present. Finally, in the tasks of age and power status
below or above the median age. For the training portion of
prediction we decided to sample only from male speakers
the task, 60 speakers were selected, with 20 instances per
as there are too few female speakers in the dataset for a
speaker. For the development and test set, 20 speakers were
reasonable sampling that would not include any unwanted
selected for each subset, each speaker being represented by
bias.
50 instances. While performing the split, additional checks
Additionally, in each of the four tasks we only selected
were put in place to ensure that the age distribution in each
instances that were at least 8 seconds in duration. While
of the subsets is as-close-as-possible to the distribution in
most of the ParlaSpeech-HR dataset consists of such in-
the full dataset. Additional checkups were also performed
stances (voice activity detection was set-up in such a fash-
to ensure that no speaker leakage existed between the three
ion), there is a small number of instances, mostly coming
subsets. With this data selection, the training subset con-
from endings of audio files, that are shorter than 8 seconds.
sists of 1,200 instances, while the development and test-
We also discarded speakers producing more than 3,000
ing subsets consist of 1,000 instances each. Given that the
instances or less than 200 instances. While the speakers
median was chosen as the classification boundary, the fi-
with a small production might complicate the data selection
nal dataset is balanced regarding the two levels of the age
procedure as we want each selected speaker to be equally
variable.
PRISPEVKI
118
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
3.2.4.
Power status prediction
lated language, in our case English (English-asr3). We
We decided to wrap up the benchmark with a quite
have decided to use the English model fine-tuned for ASR
likely less acoustic task, and a more semantic task. Given
as the non-finetuned model4 was giving random results af-
that we are currently proposing a shared task on predicting
ter fine-tuning to any of our four tasks. This suspiciously
whether a transcript of a speech was given by the ruling
bad result is probably to be followed back to a technical
coalition or opposition, we decided to add that task in this
issue in the model, rather than the fact that the model was
benchmark as well, but performed on speech and not on text
not fine-tuned on ASR before, as will be seen in the com-
transcripts. The ParlaSpeech-HR data come from a single
parison between the performance of the Slavic and the
term of the Croatian parliament, which means that the rul-
Slavic-asr model.
ing coalition members are mostly from the right political
The overview of models used in our experiments, to-
spectrum, while the opposition members are mostly from
gether with a short description on the type and amount
the left side of the political spectrum. Disentangling the
of data the models were pre-trained and fine-tuned on, is
party affiliation or political orientation, and the power sta-
given in Table 1. The non-finetuned Croatian model was
tus was rather impossible here, which has to be taken into
pre-trained on around 99 thousand hours of raw recordings
account while analysing the results.
of speeches in various Slavic languages that were given in
Similar to the task of age prediction, we, again, sam-
the European parliament. The fine-tuned Croatian model
pled only among male speakers as the number of female
was additionally fine-tuned on the ASR task on around 300
speakers was too low for well-stratified samples. Similar
hours of the ParlaSpeech-HR dataset. The English model
as with age, given the high predictability of gender, we did
was pre-trained on 53 thousand hours of raw speech mate-
not want to allow for gender to become a confounder of our
rial obtained from audio books and was fine-tuned for the
primary prediction task, which is power status in this case.
ASR task on 960 hours of similar material.
We sampled 25 speakers per each power status for train,
Regarding hyperparameter optimization, we investigate
each speaker being represented by 50 instances. For the
only the number of epochs required for performance im-
development and test sets we selected 9 speakers for each
provements to stall, which is performed by training on the
subset, again, representing each speaker with 50 instances.
training portion and evaluating on the development portion.
Additional checks were performed that there is no speaker
For the first two tasks of speaker identification and gender
leakage between the three subsets. With this, the size of
prediction, two epochs were shown to be enough, while for
the training subset is 2,500 instances, while the develop-
the tasks of age prediction and power status prediction, 15
ment and test subsets consist of 900 instances each. For
epochs over the training subset were chosen as optimal.
simplicity of evaluation, the division of instances regarding
We evaluate each model on our test subset by reporting
the power status variable is balanced, with 50% instances
both the accuracy and macro F1 metric. Given that all our
coming from each side of the political power spectrum.
tasks consist of datasets with a balanced distribution of the
The benchmark is made available for reproducibil-
response variable, our random baseline lies at 0.5 in case
ity and further benchmarking to the public via the
of the binary classification schema, and 0.02 in case of the
GitHub
repo
https://github.com/clarinsi/
50-class speaker identification schema.
parlaspeech-hr-benchmark/.
For the less challenging tasks of speaker identification
and gender prediction, we perform two types of evaluation,
4.
Experimental setup
on full instances, and on the first 2 seconds of each instance
In this section we give a short description of the setup of
only.
the experiments performed on the newly constructed bench-
mark.
5.
Results
We perform all our experiments with transformer mod-
5.1.
Speaker identification
els (Vaswani et al., 2017) that were pre-trained on spo-
The results on the speaker identification task are pre-
ken data. We use the Transformers library (Wolf et al.,
sented in Table 2. The results show for task to be quite
2019) and retrieve pre-trained models from the Hugging-
easy for the Slavic and Slavic-asr models applied
face model repository.
on full instances. The model fine-tuned on ASR seems to
We use the model pre-trained on Croatian that has
perform slightly better in the full-data scenario, keeping an
proven to perform best on the task of automatic speech
even score on instances clipped to two seconds, while in
recognition (ASR) (Ljubešić et al., 2022), namely the
that case the non-finetuned model experiences a significant
Slavic model.1
We compare the performance of the
drop of 20 points. This result seems to show how important
pre-trained-only model to the model that was additionaly
it is for the model to experience the exact speakers it is sup-
fine-tuned on the ASR task (Slavic-asr2) to investigate
posed to differentiate between, even on another task such as
whether fine-tuning the model on the same data, but another
ASR. We do not believe that transfer has occurred between
task, improves performance.
the ASR task and the speaker identification task directly
We compare the performance of the model pre-trained
(the model exploiting what people are saying while decid-
on Croatian to the model that was pre-trained on an unre-
ing on the speaker identity) but rather that its parameters
1https://huggingface.co/facebook/
3https://huggingface.co/facebook/
wav2vec2-large-slavic-voxpopuli-v2
wav2vec2-large-960h-lv60-self
2https://huggingface.co/classla/
4https://huggingface.co/facebook/
wav2vec2-large-slavic-parlaspeech-hr
wav2vec2-large
PRISPEVKI
119
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
model name
short name
pre-training
ASR fine-tuning
facebook/wav2vec2-large-slavic-voxpopuli-v2
Slavic
Slavic (99k hours)
-
classla/wav2vec2-large-slavic-parlaspeech-hr
Slavic-asr
Slavic (99k hours)
Croatian (300 hours)
facebook/wav2vec2-large-960h-lv60-self
English-asr
English (53k hours)
English (960 hours)
Table 1: List of models used in our experiments, with amount and type of pre-training and fine-tuning data.
accuracy
macro F1
10
model
clipped
Alfirev, Ma rija 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Slavic
no
0.998
0.998
Ba bi , Ante 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ba uk, Arsen 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
Bedekovi , Vesna 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 sec
0.806
0.784
Berna rdi , Da vor 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Beus Richem bergh, Gora n 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Buri , Ma jda 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8
Culej, Stevo 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Slavic-asr
no
1.000
1.000
Divja k, Bla enka 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Dobrovi , Sla ven 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Gla va k, Sun a na 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ha jdukovi , Dom a goj 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0
2 sec
1.000
1.000
Horva t, Mile 0 0 0 0 0 0 0 0 0 1 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ja ndrokovi , Gorda n 0 0 0 0 0 0 0 0 0 0 0 3 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Jeckov, Dra ga na 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Jelkova c, Ma rija 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
English-asr
no
0.334
0.275
Jerkovi , Rom a na 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Josi , eljka 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Jova novi , eljko 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
6
2 sec
0.106
0.048
Juri ev-Ma rtin ev, Bra nka 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ka ti , a rko 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Klisovi , Jo ko 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Kri a ni , Josip 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
La lova c, Boris 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
Lena rt, eljko 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 5 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2
Luka i , Ljubica 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Table 2: Speaker identification results.
Ma ka r, Bo ica
True label
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Ma ksim uk, Ljubica 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Metelko-Zgom bi , Andreja 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Miku igm a n, N a ta a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Milo evi , Boris 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mi i , Ivica 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
4
Murga ni , N a da 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
N in evi -Lesa ndri , Iva na 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Obuljen Kor inek, N ina 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Petin, Ana -Ma rija 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0
were previously adapted to focus better at the peculiarities
Petrijev a nin Vuksa novi , Irena 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0
Petrov, Bo o 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0
Prelec, Alen 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0
Puh, Ma rija 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0
of the 50 speakers in question.
Pupova c, Milora d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 3 0
Reiner, eljko 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 5 1 0 0 2 0 0 0 0
Sta zi , N ena d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0
Topolko, Berna rda 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0
2
For the English model, it shows interestingly to perform
Turina - uri , N a da 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0
Tu m a n, Mirosla v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0
Vra nje , Dra gica 0 0 0 1 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
rather badly, with predictions over the full length of each
ikoti , Sonja 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0
a ki , Josip 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0
uji , Sa a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
ile
ko
a
o
a
instance (between 8 and 20 seconds) being correct only in
ica
arija
ajda
ana agoj
ana
ada
ina
ada
, Ante
avor oran
enka
arija
eljka eljko
arko
eljko
ata
arija
arija
eljko enad
, Vesna , D
ordan
ilorad
, Sa
, M
, Josip
, Ivica
ragica
, Josip
, Slaven
om
ragana
, , Jo
, Boris
, N , Ivana
, Irena
, N iroslav
, Sonja
, G
, ,
i
, N
, Rom
, Ljubica
, Andreja
i
, D
uji
Babi
ulej, Stevo
, D orvat, M
ev, Branka
ani
i
uk, Ljubica
an, N evi M
inek, N
uri
aki
33% of cases. This is still quite far apart from the random
Bauk, Arsen
C
H
Josi
Kati
akar, Bo
bi
Petrov, Bo Prelec, Alen Puh, M
an, M
ikoti
Alfirev, M
bergh, G Buri
0
ivjak, Bla
Reiner,
obrovi lavak, Sun
Klisovi Kri Lalovac, Boris Lenart,
M
igm ilo
urgani
Stazi
m
Bedekovi Bernardi
D D G
Jeckov, D Jelkovac, M
M
Jerkovi
Jovanovi artin
Luka
M
aksim
-Lesandri
Petin, Ana-M
Pupovac, M
Tu Vranje
ajdukovi
Topolko, Bernarda
Jandrokovi
M
iku
Turina-
baseline of 2%, but also very far from the stellar perfor-
H
ev-M
M
evi buljen Kor
anin Vuksanovi
etelko-Zgom
in O
Beus Richem
Juri
M
N
mance of the models pre-trained on Croatian. Predicting
Petrijev
Predicted la bel
only on 2 seconds of speech further deteriorates the results
to an accuracy of 10%. For the speaker identification task
Figure 1: Confusion matrix for speaker identification with
the pre-training language seems to be very important, as
the Slavic model on instances clipped to two seconds.
the model quite likely models phonetic peculiarities of each
speaker, rather than only acoustic features for which any
accuracy
macro F1
speech transformer should be useful.
eval split
clipped
To investigate which speakers get confused between by
Slavic
no
0.997
0.997
the Slavic model, when only two seconds are available
2 sec
0.989
0.989
for prediction, we present the confusion matrix in Figure1.
Slavic-asr
no
0.985
0.985
The matrix shows that speakers of the same gender are be-
2 sec
0.985
0.985
ing confused between each other, e.g. Arsen Bauk, Davor
English-asr
no
0.999
0.999
Bernardić and Božo Petrov being confused for Žarko Katić,
2 sec
0.994
0.994
or Sunčana Glavak and Ljubica Lukačić being misclassified
as Ivana Ninčević-Lesandrić.
Table 3: Gender prediction results.
5.2.
Gender prediction
1000
The results on task of gender prediction are presented
in Table 3. On this task all three models, regardless of the language they are pre-trained on, achieve very good per-800
formance, the lowest result being accuracy of 98.5%, and
F
1000
0
the difference in the length of test instances not having a
600
strong impact. Interestingly, the Slavic-asr model that
performed perfectly on the speaker identification task is the
one that performs the worse on the gender prediction task.
True label
400
Investigating what type of confusion occurs on this task
M
21
979
we analyse the output of the Slavic model on 2-second
instances. We represent the results via a confusion matrix in
200
Figure 2, showing that male instances are sometimes con-
F
M
fused for female instances, but not vice versa. Investigating
Predicted la bel
0
further what speakers are being confused most of the time,
it shows that it is a limited number of speakers whose voice
Figure 2: Confusion matrix for speaker gender prediction
has, at least in some occasions, a higher pitch.
of the Slavic model on 2-second test instances.
The results on gender prediction show that transformer
PRISPEVKI
120
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
model
clipped
accuracy
macro F1
100
Test split a ge distribution
Slavic
no
0.694
0.690
Misscla ssifica tions
Media n a ge
Slavic-asr
no
0.722
0.722
80
English-asr
no
0.678
0.672
Table 4: Age prediction results.
60
400
Frequency 40
350
20
old
290
210
300
0
30
40
50
60
70
250
Age
True label
200
young
96
404
Figure 4: Distribution of age in our test subset, along with
misclassifications by the Slavic model.
150
old
young
model
clipped
accuracy
macro F1
Predicted la bel
100
Slavic
no
0.590
0.587
Slavic-asr
no
0.627
0.626
Figure 3: Confusion matrix for speaker age classification
English-asr
no
0.549
0.531
by the Slavic model.
Table 5: Power status identification.
models do not rely on language-specific features, but quite
ers (68 and 69 years) show to be perfectly performed by the
likely on the pitch of a speaker’s voice, with best results
model.
being reported by the English model, with almost perfect
This insight might motivate us to organise the age pre-
results even on 2-second test instances.
diction task in the future as a classification task into three
categories, the middle category, around the median age, be-
5.3.
Age prediction
ing considered hard, and discarded in the easier setup of the
The results on age prediction, guessing whether a
classification task.
speaker is younger or older than 49 years, which is the me-
dian speaker age in the dataset, are given in Table 4. Here 5.4.
Power status prediction
we do not perform experiments on speech samples clipped
The results of our final task, power status prediction,
to two seconds as the task is already demanding enough on
are given in Table 5. The results show to be, as expected,
full-length instances. The Slavic-asr model seems to
the lowest of all four tasks defined in this benchmark.
perform best, with accuracy of 72%, 50% being a random
The Slavic-asr model performs best, with the differ-
result. The Slavic and English-asr model seem to
ence to the non-finetuned model being 2.7 accuracy points.
be suspiciously close in performance, only with a point and
The model that was not pre-trained on Croatian achieves
a half difference, which shows that the age prediction task
a significantly lower result, 5 points lower than any model
does not rely on language-specific features, but rather gen-
pre-trained on Croatian, showing that for solving this task
eral acoustic features.
mostly language-specific features are used.
To investigate the confusion patterns between the two
Which features exactly are actually used is hard to iden-
age groups, we plot a confusion matrix of the Slavic
tify. The only attempt we perform in this direction is a per-
model in Figure3. The confusion matrix shows clearly that
speaker analysis of correct and incorrect classifications by
more frequently older speakers tend to be misclassified as
the Slavic-asr model, which we present in Figure 5.
younger speakers than vice versa.
The results show that people in power seem to be easier
Given that we have divided the speakers by age on the
to identify than those who are in opposition, as the speak-
median point, and that the speaker age is rather normally
ers having the lowest percentage of correctly classified in-
distributed, we wanted to additionally check whether most
stances are mostly from the opposition. The error also
of the prediction errors occur on users who are close to the
seems to be rather speaker-dependent, with eight of the
class boundary. To investigate this, we plot an instance-
speakers having accuracy above 80%, and the five worst-
level age histogram in Figure 4, encoding the correctly and performing speakers having accuracy below 40%.
incorrectly classified instances by the Slavic model with
Analysing the five worst-performing speakers, a trend
different colour. The histogram shows that most misclas-
can be observed, with the two speakers in power being two
sifications happen, as expected, close to the median class
of the most fine-mannered speakers, while two out of three
boundary, with almost all instances of speakers of 50 and
speakers from the opposition are rather known for their
51 years of age being misclassified as younger speakers.
harsh speech. This analysis has also shown that the signal
Classifications on the youngest (35 years) and oldest speak-
the classifier has caught on is quite likely based on the polit-
PRISPEVKI
121
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Acknowledgements
This work has received funding from the Eu-
ropean
Union’s
Connecting
Europe
Facility
2014-
2020 - CEF Telecom, under Grant Agreement No.
INEA/CEF/ICT/A2020/2278341 (MaCoCu project). This
communication reflects only the author’s view.
The
Agency is not responsible for any use that may be made of
the information it contains.
This work was also funded by the Slovenian Research
Agency within the Slovenian-Flemish bilateral basic re-
search project “Linguistic landscape of hate speech on so-
Figure
5:
Per-speaker
accuracy
level
with
the
cial media” (N06-0099 and FWO-G070619N, 2019–2023)
Slavic-asr model on the power status prediction
and the research programme “Language resources and tech-
task.
nologies for Slovene” (P6-0411).
ical orientation rather than power status itself. For perform-
7.
References
ing modelling of the power status in speech, the training
and evaluation data should consist of multiple-terms data,
Martijn Bartelds, Wietse de Vries, Faraz Sanal, Caitlin
with the same political orientations having speeches given
Richter, Mark Liberman, and Martijn Wieling. 2022.
while in power and while in opposition.
Neural representations for modeling variation in speech.
Journal of Phonetics, 92:101137.
6.
Conclusion
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick
von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara
In this paper we have presented a benchmark for
Rivera, Mihir Kale, et al. 2022. Xtreme-s: Evaluat-
speaker profiling in Croatian, based on the recordings of
ing cross-lingual speech representations. arXiv preprint
the Croatian parliament. We have carefully selected the
arXiv:2203.10752.
speakers and instances to be used in the benchmark, pay-
Zhiyun Fan, Meng Li, Shiyu Zhou, and Bo Xu. 2020.
ing special attention to any type of bias or confounders that
Exploring wav2vec 2.0 on speaker verification and lan-
might be included in the tasks.
guage identification. arXiv preprint arXiv:2012.06185.
We have performed initial experiments with transformer
John S. Garofolo, Lori F. Lamel, William M. Fisher,
models pre-trained on speech, obtaining interesting in-
Jonathan G. Fiscus, and David S. Pallett. 1993. Darpa
sights. The task of speaker identification seems to be rather
timit acoustic-phonetic continous speech corpus cd-rom.
language-dependent, and can be further improved if the
nist speech disc 1-1.1. NASA STI/Recon technical report
model has seen speakers to be identified before the final
n, 93:27403.
fine-tuning process. Gender prediction seems to be the
least language specific, obtaining very good results regard-
Shareef Babu Kalluri, Deepu Vijayasenan, and Sriram
less of the model, and quite likely relying simply on the
Ganapathy. 2020. Automatic speaker profiling from
pitch of the speaker. Age prediction, in our case set up
short duration speech data.
Speech Communication,
as a binary task, with the boundary being the age median,
121:16–28.
shows to be hard, but very feasible on instances that are
Nikola
Ljubešić,
Danijel
Koržinek,
Peter
Rup-
further away from the classification boundary. The task
nik,
Ivo-Pavao
Jazbec,
Vuk
Batanović,
Lenka
shows to use language-specific features to a small amount,
Bajčetić, and Bojan Evkoski.
2022.
ASR training
but the model that has experienced the same speakers be-
dataset for Croatian ParlaSpeech-HR v1.0.
Slove-
fore the final fine-tuning still performing visibly better than
nian
language
resource
repository
CLARIN.SI,
the model that has not. Power status prediction is the hard-
http://hdl.handle.net/11356/1494.
est of all four tasks, and shows to rely on language-specific
Nikola Ljubešić, Danijel Korzinek, Peter Rupnik, and Ivo-
features, again profiting additionally from experiencing the
Pavao Jazbec. 2022. ParlaSpeech-HR – a freely avail-
speakers prior to the final fine-tuning. Analysing the accu-
able ASR dataset for Croatian bootstrapped from the
racy by speaker shows that the power status model seems
ParlaMint corpus. In: Proceedings of the Third Par-
to have caught on the political orientation rather than the
laCLARIN Workshop, Marseille, France.
language of power itself. For working on modelling that
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman.
phenomenon, a dataset controlling for political orientation
2017. Voxceleb: A large-scale speaker identification
should be constructed, which requires a much wider data
dataset. arXiv preprint arXiv:1706.08612.
range than is currently available.
Leonardo Pepino, Pablo Riera, and Luciana Ferrer. 2021.
We are releasing the benchmark definitions, to be cou-
Emotion recognition from speech using wav2vec 2.0 em-
pled with the full ParlaSpeech-HR dataset (Ljubešić et al.,
beddings. arXiv preprint arXiv:2104.03502.
2022) in a GitHub repository.5
Steffen Schneider, Alexei Baevski, Ronan Collobert,
and Michael Auli.
2019.
wav2vec:
Unsupervised
5https://github.com/clarinsi/
pre-training for speech recognition.
arXiv preprint
parlaspeech-hr-benchmark/
arXiv:1904.05862.
PRISPEVKI
122
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
and Illia Polosukhin. 2017. Attention is all you need.
Advances in N eural I nformation P rocessing S ystems, 30.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau-
mond, Clement Delangue, Anthony Moi, Pierric Cistac,
Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019.
Huggingface’s transformers: State-of-the-art natural lan-
guage processing. arXiv preprint arXiv:1910.03771.
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng- I
Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu,
Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al.
2021. Superb: Speech processing universal performance
benchmark. arXiv preprint arXiv:2105.01051.
PRISPEVKI
123
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Cross-Level Semantic Similarity in Newswire Texts and Software Code
Comments: Insights from Serbian Data in the AVANTES Project
Maja Miličević Petrović,* Vuk Batanović,† Radoslava Trnavac,‡ Borko Kovačević‡
* Department of Interpreting and Translation, University of Bologna
Corso della Repubblica 136, 47121 Forlì
maja.milicevic2@unibo.it
† Innovation Center of the School of Electrical Engineering, University of Belgrade Bulevar kralja Aleksandra 73, 11120 Belgrade
vuk.batanovic@ic.etf.bg.ac.rs
‡ Faculty of Philology, University of Belgrade
Studentski trg 3, 11000 Belgrade
radoslava.trnavac@fil.bg.ac.rs, borko.kovacevic@fil.bg.ac.rs
Abstract
This paper presents the Serbian datasets developed within the project Advancing Novel Textual Similarity-based Solutions in Software Development – AVANTES, intended for the study of Cross-Level Semantic Similarity (CLSS). CLSS measures the level of semantic overlap between texts of different lengths, and it also refers to the problem of establishing such a measure automatically. The problem was first formulated about a decade ago, but research on it has been sparse and limited to English. The AVANTES project aims to change this through the study of CLSS in Serbian, focusing on two different text domains – newswire and software code comments – and on two text length combinations – phrase-sentence and sentence-paragraph. We present and compare two newly created datasets, describing the process of their annotation with fine-grained semantic similarity scores, and outlining a preliminary linguistic analysis. We also give an overview of the ongoing detailed linguistic annotation targeted at detecting the core linguistic indicators of CLSS.
linguistic properties identified as relevant for recognising
1. Introduction
different similarity levels are being annotated further, with
One of the central meaning-related tasks in Natural
a view to improving linguistic descriptions of semantic
Language Processing (NLP) is Semantic Textual Similarity
similarity and testing linguistically informed NLP models.
(STS; Agirre et al., 2012). The goal of STS is to establish
the extent to which the meanings of two short texts are
2. Related work
similar to each other, which is typically encoded as a
Previous studies of CLSS are few. The NLP task was
numerical score on a Likert scale. The similarity scores can
introduced by Jurgens et al. (2014, 2016), who provided the
subsequently be used in more complex tasks, such as
first annotated datasets for English, composed of text pairs
Question Answering (Risch et al., 2021) or Text
of different lengths (paragraph to sentence, sentence to
Summarisation (Mnasri et al., 2017).
phrase, phrase to word, and word to sense), in genres
In the related task of Cross-Level Semantic Similarity
including newswire, travel, scientific, review, and others.
(CLSS) the goal is to contrast texts of non-matching size,
The initial datasets were re-used in subsequent work on
such as a phrase and a sentence, or a sentence and a
developing and evaluating CLSS methods at different
paragraph. CLSS was first formulated as a SemEval shared
specific levels (e.g., Rekabsaz et al., 2017 for sentence to
task by Jurgens et al. (2014), who saw it as a generalisation
paragraph), or regardless of text length (e.g., Pilehvar and
of STS to items of different lengths. Clearly, the length
Navigli, 2015). Among related tasks, Conforti et al. (2018)
discrepancy brings an additional level of complexity, as
dealt with the problem of cross-level stance detection,
longer texts tend to carry a greater amount of salient
where the stance target is a sentence, and the text to be
information than shorter texts, so CLSS can be understood
evaluated is a long document.
as aiming to measure how well the meaning of the longer
In Serbian, previous work on semantic similarity has
text is summarised in the shorter one.
been relatively limited. Batanović et al. (2011) and Furlan
Previous work on CLSS has generally been sparse and,
et al. (2013) introduced paraphrase.sr, a corpus of Serbian
to the best of our knowledge, focused entirely on English.
newswire texts manually annotated with binary similarity
In addition, there is a large discrepancy between the NLP
judgments; they also used it to train and evaluate several
models, which are based on linguistically opaque text
paraphrase identification approaches. Batanović et al.
properties, and linguistic analyses of semantic similarity.
(2018) extended this dataset with fine-grained similarity
The main aim of this paper is to describe the first non-
scores, using the resulting STS.news.sr corpus to compare
English annotated CLSS datasets, CLSS.news.sr and
several automatic models. Finally, Batanović (2020)
CLSS.codecomments.sr, developed within the project
showed that multilingual pre-trained models such as
Advancing Novel Textual Similarity-based Solutions in
multilingual BERT (Devlin et al., 2019) outperform all
Software Development – AVANTES. Both datasets
traditional methods, while Batanović (2021) obtained even
comprise phrase-sentence and sentence-paragraph text
better results using BERT’s counterpart for Serbian and other
pairs in Serbian and both are (being) manually annotated
closely related languages, BERTić (Ljubešić and Lauc, 2021).
for CLSS. After providing some background, we describe
In terms of linguistic analysis, semantic similarity is not
the dataset creation and CLSS annotation, outline a
systematically defined and described, and the contributing
preliminary linguistic analysis, and explain how the
phenomena tend to be explored in isolation from each other
PRISPEVKI
124
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
(e.g., synonymy in lexical semantics, diathesis alternations
3.1. CLSS.news.sr
in morphosyntax). A somewhat more integrated approach
The initial texts for the CLSS.news.sr dataset were
is found with regard to the neighbouring notion of
obtained from the Serbian news aggregator website
paraphrase, intended as a relation of (near-)equivalence of
naslovi.net. This website provides a headline and an
meaning between phrases and/or sentences (Mel’čuk,
introductory paragraph for each news report; a subhead is
2012: 46), i.e. as an instance of high semantic similarity
frequently included too. We treated the headlines as source
(albeit a non-symmetrical one). According to Milićević
material for phrases, subheads as source material for
(2007), paraphrases can be of different types based on the
sentences, and introductory paragraphs as source material
nature of information that underlies equivalence (linguistic
for paragraphs for our corpus, exploiting the journalistic
vs. extra-linguistic), the level of linguistic representation
convention that the beginning sections in an article
involved (morphology, lexicon, semantics, syntax), and the
commonly provide a summary of its content; our approach
depth of relation. A detailed typology of changes involved
was the same one used in the construction of multiple other
in paraphrase has been proposed by Vila Rigat (2013) and
newswire STS and paraphrasing corpora (Dolan et al.,
Vila et al. (2014) in view of the NLP task of automatic
2004). Since news item are commonly reported differently
paraphrase detection. This typology combines several
by different media outlets, cross-linking the texts of
criteria and multiple levels of granularity into a taxonomy
different reports allowed for the creation of text pairs with
that will be presented in more detail in Section 4.2, as the
varying degrees of semantic similarity. Close to 18,000
basis for our linguistic analysis of CLSS.
news reports, published between June and August 2021,
were scraped using the scrapy Python library,1 to ensure the
3. Datasets and CLSS annotation
annotators had a sufficient quantity of raw text available for
The corpora of phrase-sentence and sentence-paragraph
creating adequate pairs. To ensure comparability with the
text pairs presented in this paper are developed within the
SemEval dataset, our target dataset size was 1,000 phrase-
AVANTES project. The aim of this project is to support the
sentence and 1,000 sentence-paragraph pairs.
analysis of correspondences between blocks of source code,
The construction of the 2,000 text pairs was divided
written in a programming language, with an analysis of the
between five annotators, who were either trained linguists
level of semantic similarity between their respective
or had previous experience with text annotation for the
documentation comments, written in a natural language
closely related STS task. Even though they received text
(English or Serbian), with the goal of detecting code
samples pre-classified based on length, they were
similarity and clones. A CLSS setup is highly appropriate for
instructed to evaluate whether an item in a certain category
the textual similarity task due to arbitrary comment length,
really was a phrase, a sentence, or a paragraph, and were
which can range from single words to phrases, sentences and
allowed to change the categorisation. Paragraphs were
entire paragraphs. Since the language used in comments is
defined as text containing a minimum of two sentences
known to diverge from the standard language, for instance in
(where only complete sentences were to be taken into
being syntactically incomplete (Zemankova and Eastman,
account). A sentence had to contain at least one finite verb
1980), we add to our study setup CLSS in standard language,
form, whereas a phrase was not allowed to contain finite
choosing newswire texts as its representative.
verbs (non-finite forms such as infinitives and participles
In the context of the project, comparative analyses are
were allowed, as were deverbal nouns).
planned both between text domains and between languages.
The annotators were provided with the similarity score
For this reason, it was important to establish a common
definitions and SemEval examples to help them interpret
methodology for the creation and annotation of datasets.
each score. Since these examples proved insufficient to
Since the only pre-existing CLSS dataset was the SemEval
ensure high annotation consistency, the outputs were
one for English, we adopted the approach of Jurgens et al.
calibrated by having all annotators create a smaller set of
(2014) as a (partial) model for our work. We retained their
five to six representative pairs for each similarity score and
five-point similarity scale, with scores ranging from 0 to 4,
each length pairing. These pairs were reviewed by project
as well as their definitions for each score: 0 – unrelated, 1
researchers and feedback was provided regarding any
– slightly related, 2 – somewhat related but not similar, 3 –
issues encountered. The following step was the compilation
somewhat similar, 4 – very similar. However, we altered
of a detailed set of examples, three per similarity score and
the method of text pair construction. Namely, while Jurgens
length pairing, using the agreed upon representative pairs
et al. (2014) provided annotators with a longer text and
from all annotators. This set, the score definitions and
asked them to generate a shorter one with a designated
general instructions became an integral part of the final
similarity score in mind, we pre-prepared numerous text
annotation guidelines for our task, available in the dataset
samples of different lengths (phrases, sentences, and
repository in Serbian (original) and English (translation).2
paragraphs), and asked the annotators to combine these
A subset of examples is shown in Table 1.
texts into phrase-sentence and sentence-paragraph pairs,
The annotators were subsequently asked to construct a
aiming for a balanced score distribution for the pairs they
total of 200 pairs for each text length combination, trying
construct. The main motivation for this choice was that the
to include both pairs clearly corresponding to a specific
generation of texts by annotators would have been very
score, and less clear-cut ones. The resulting 2,000 cross-
difficult to implement in the domain of source code
level text pairs were labelled with semantic similarity
comments, given the highly technical and often project-
scores by all five annotators, using the STSAnno tool
specific terminology encountered in them. At the same
(Batanović et al., 2018). The final score for each pair was
time, our approach prevented a potential paraphrasing bias
calculated by averaging the scores of all individual
that the annotators could inadvertently introduce.
annotators. Obtaining multiple parallel annotations and
1 http://scrapy.org/
2 http://vukbatanovic.github.io/CLSS.news.sr/
PRISPEVKI
125
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
averaging them out was chosen instead of relying on an
between each annotator and the mean of other annotators’
adjudicated double annotation (used for the SemEval
scores yields a Krippendorff’s alpha coefficient of
dataset) in order to minimise individual annotator’s biases.
α = 0.929, while the Pearson and the Spearman correlation
In addition, while Jurgens et al. (2014) allowed finer-
coefficients are equal, r = ρ = 0.938. In the case of
grained score distinctions using multiples of 0.25, in our
sentence-paragraph pairs these values are α = 0.922,
setup with five annotators this was not necessary.
r = 0.937 and ρ = 0.934. More details and a comparison
with the English SemEval dataset are reported in Batanović
Score
Examples
and Miličević Petrović (2022).
Veliki požar na železničkoj stanici u Londonu
3.2. CLSS.codecomments.sr
A large fire at a London railway station
A particularly innovative part of the work conducted in
4
Veliki požar izbio je danas na metro stanici u
the AVANTES project is the creation of a corpus of
centralnom delu Londona.
software code comments, to be made publicly available for
A large fire broke out today at an underground
download and use in testing NLP models once the
station in central London.
annotation of semantic similarity is completed. The sources
Novi nacionalni praznik: Džuntint
that the code comment dataset was drawn from include
A new national holiday: Juneteenth
public repositories such as GitHub, student projects,
Američki Kongres usvojio je predlog zakona
coursework and teaching materials from various computing
prema kojem je 19. jun proglašen praznikom u
courses at the School of Electrical Engineering of the
University of Belgrade and other academic institutions in
3
znak sećanja na kraj ropstva i odlazak poslednjih
robova 1865. godine u državi Teksas.
Serbia, as well as software projects developed at the
The American Congress passed a Draft law
Computing Center of the School of Electrical Engineering.
declaring 19 June a holiday to commemorate
In order to prevent our work from being focused on the
the end of slavery and the liberation of the last
specificities of a single programming language or
slaves in 1865 in the state of Texas.
programming paradigm, we opted to collect comments
Veliki problem za Portugal
from eight programming languages: C, C++, C#, Java,
JavaScript/TypeScript, MATLAB, Python, and SQL.
A major problem for Portugal
We focused on manually pre-selecting only those code
2
Loše vesti stižu za Portugal pred start
comments that describe the functionality of particular
Evropskog prvenstva.
sections of code, ranging from individual code lines, to
Bad news arrives for Portugal just before the
methods and functions, to classes and entire modules. To do
start of the European Championship.
so, we relied on a newly designed taxonomy for
Svađa pred svadbu
differentiating between types of code comments (Kostić et
A pre-wedding argument
al., 2022), which includes the following code comment
Mirko Šijan i Bojana Rodić uskoro očekuju
categories: Code, Functional-Inline, Functional-Method,
Functional-Module, General, IDE, Notice and ToDo. The
1
svoje prvo dete, a uveliko se sprema i njihova
svadba.
initial data collection and pre-selection were performed by
Mirko Šijan and Bojana Rodić are expecting
master’s degree students at the School of Electrical
their first child soon, and their wedding is
Engineering of the University of Belgrade, as part of their
being prepared.
course project for the Natural Language Processing course.
Otvaranje silosa u Zrenjaninu
In total, after all duplicate entries were removed, 9,395 code
A silo opening in Zrenjanin
comments belonging to the Functional categories were
identified. These include 6,455 Functional-Inline comments,
0
Maja Žeželj, voditeljka, ispričala je kako je
svojevremeno jedva izvukla živu glavu.
which describe the functionality of individual code lines or
code passages, 1,829 Functional-Method comments, which
Maja Žeželj, TV presenter, told the story of how
address the functionality of functions and class methods, and
some time ago she nearly died.
1,111 Functional-Module comments, which are related to the
functionality of entire code modules and classes.
Table 1: Guideline examples of phrase-sentence pairs in
In order to construct text pairs, the comments were first
the newswire dataset for each similarity score.
roughly divided into candidates for phrases, sentences, and
paragraphs on the basis of a set of heuristics. Using
The final CLSS.news.sr dataset comprises 30 thousand
whitespace tokenisation, we treated all texts with up to six
tokens in the phrase-sentence subset, and 86 thousand
tokens as candidates for phrases. All texts containing more
tokens in the sentence-paragraph subset. The average
than six tokens, but limited to a single sentence, were
sentence length is ~22 tokens in the sentence-paragraph
treated as candidates for sentences, while those with more
pairs and ~23 tokens in the phrase-sentence ones. The
than one sentence were considered paragraph candidates.
average phrase length is ~6 tokens, while the average
The number of sentences was determined using a regular
paragraph length is ~64 tokens. The average similarity
expression that treated question marks, exclamation marks,
scores are close to the scale’s mean value of 2: 1.91 in the
and periods outside of URLs and decimal numbers as
sentence-paragraph subset, and 1.96 in the phrase-sentence
sentence boundaries. Using this procedure, the text set was
subset. The distribution of different scores is fairly uniform,
divided into 4,880 phrase candidates, 3,592 sentence
especially for the phrase-sentence pairs; the peaks include
candidates, and 923 paragraph candidates.
a marked one around 0, and a less evident one around 3.
Due to the high domain specificity of code comments,
The annotation (self-)agreement levels are very high. For
we entrusted the creation of CLSS pairs to two experienced
the phrase-sentence subset, the average binary agreement
PRISPEVKI
126
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
programmers. They used the provided candidate texts to
comprising 14 thousand tokens, and 1,000 sentence-
form the pairs, but were instructed to carefully evaluate
paragraph pairs, comprising 39 thousand tokens. The
whether each sample truly belonged to its automatically
average sentence length is ~10 tokens in both the sentence-
assigned length grouping. Such an evaluation was
paragraph and the phrase-sentence pairs. The average
necessary because complete standard sentences and
phrase length is ~3 tokens, while the average paragraph
paragraphs were rarely encountered in the data. Instead, we
length is ~29 tokens. Overall, the code comments are
found that despite having a sentence-like function in the
approximately half the length of the newswire text items.
comment, many texts are not true sentences in the linguistic
Although our initial aim was again to construct a dataset
sense – they do not follow any punctuation rules and they
balanced across the range of similarity scores, this proved
lack a predicate, or possess it only implicitly (e.g., @author
to be impossible with our selection of source texts, since
Tim 2 or Naziv komponente ‘Component name’ within a they pertained to a wide range of programming projects
paragraph item). Similarly, paragraphs in the code
with different purposes and implemented using diverse
comment domain are often separated into units not via
programming paradigms and languages. This made the
standard punctuation, but rather by using visual boundaries,
construction of pairs with high similarity scores very
such as moving to a new line in the source file, or
problematic. We therefore abandoned the goal of obtaining
(repeatedly) using special characters (e.g., * or ###).
a balanced score distribution, but still instructed the
Limiting our text selection to a rigid definition of sentences
programmers to compile as many highly similar pairs as
and paragraphs would thus not only have reduced the size
possible with the given source content. Each programmer
of the dataset, but it would also have led to the exclusion of
was tasked with the construction and scoring of 500 pairs
numerous domain-specific phenomena, significantly
of each length.
impacting our linguistic analyses of code comments. We
The similarity scoring of the text pairs was performed
therefore decided to count as paragraphs texts consisting of
on the basis of guidelines similar to the ones used in the
at least two clearly identifiable units, even if those units
newswire domain, but with a new set of three examples per
were not true sentences. Similarly, we expanded the
score and length pairing, drawn from the code comment
sentence set with texts containing an implicit predicate, as
domain; a subset of phrase-sentence pair examples is
well as with those containing subordinate clauses without a
shown in Table 2. After the code comment text pairs were
main clause (e.g., relative clauses such as: Metode koje se
constructed, they were forwarded to the same annotators
odnose na simulaciju procesa ‘Methods that refer to
who worked on the CLSS.news.sr dataset, in order to obtain
process simulation’).
multiple parallel annotations. Since this work is still in
progress, our linguistic analyses of CLSS.codecomments.sr
Score
Examples
in this paper will be based on the individual similarity
scores assigned by the two programmers who constructed
Računanje površine pravougaonika
the text pairs.
Calculating the area of a rectangle
4
Površina pravougaonika po formuli je a * b
4. Linguistic analysis
The area of a rectangle according to the formula
The NLP algorithms used in automatic treatment of
is a * b
semantic similarity rely on different types of information,
POMOCNA FUNKCIJA
including linguistic features. While state-of-the-art models
AUXILIARY FUNCTION
3
such as multilingual BERT and BERTić reach performances
Fajl koji pruza pomocne funkcije
that correlate highly with human scores, with coefficients
A file that provides auxiliary functions
r,ρ > 0.9 for CLSS on Serbian newswire texts (Batanović
ubrzano kretanje
and Miličević Petrović, 2022), they lack linguistic
accelerated movement
transparency and are of limited help in understanding the
relative contributions of different levels of language
2
Zelimo da se ogranicimo od mogucnosti da se
ubrzano krece.
structure and different specific features. Since one of the
We want to limit the possibility of accelerated
aims of the AVANTES project is to combine NLP with
movement.
linguistic knowledge, we conduct two types of linguistic
Update dokumenta
analyses on the datasets. A preliminary qualitative analysis
Document update
is performed to gain initial insight into the data and help
1
Ovaj program formira html dokument
decide on the specifics of detailed annotation of semantic
This program forms an html document
similarity indicators (to be followed by a quantitative
analysis of the annotated datasets).
izracunavanje faktorijela
calculating the factorial
0
Azurira rotaciju kamere preko pomeraja misa
4.1. A qualitative overview
Updates the camera rotation via mouse
A qualitative linguistic analysis was performed on a
movement
random sample of ten text pairs per score, for both
CLSS.news.sr and CLSS.codecomments.sr, and for both
Table 2: Guideline examples of phrase-sentence pairs in
phrase-sentence and sentence-paragraph pairs. In the case
the code comment dataset for each similarity score.
of newswire texts, items that received the same score by all
annotators were selected; an approach focused on clear-cut
This allowed us to construct a code comment dataset of
cases was deemed useful as a first step in the analysis given
the same size as CLSS.news.sr. The CLSS.codecomments.sr
its goals of verifying both the linguistic relevance of the
dataset therefore includes 1,000 phrase-sentence pairs,
similarity scores and the taxonomy for more detailed
linguistic annotation. For comments, the initial scores
PRISPEVKI
127
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
assigned by programmers were used for selection. The
rekurzivni poziv ‘if it has children then we do a recursive
analysis consisted in a comparison of information content
call’). The predicate of the sentence item is typically not
between the pairs’ components, as well as a study of
related to the head noun of the phrase item. The pairs
vocabulary overlaps (or lack thereof). Its goal was to get an
marked 1 and 0 contain barely any overlapping personal
initial grasp of the data and help define a taxonomy to base
names or specialised terms. Score 1 items do share some
a more elaborate analysis on.
common lexical words, but synonyms, near-synonyms, and
For both corpora and both types of comparisons, the
terms from the same wider semantic field are more present
pairs marked 4 are characterised by the occurrence of the
than words that are identical or morphologically closely
same distinctive vocabulary items: personal names and/or
related (e.g., tragedija ‘tragedy’ – nesreća ‘accident’,
numbers (newswire), or specialised terms (comments). The
pljuskovi ‘showers’ – kiša ‘rain’). Items marked 0 typically
form is often not identical, but the items involved are
do not share any lexical words.
clearly relatable on morphological grounds (e.g., they are
When it comes to differences between the two corpora,
inflectional forms of the same noun, as in Kragujevcu.LOC
in CLSS.news.sr it is often the case that the relatedness of
– Kragujevca.GEN ‘Kragujevac’, parametre.ACC –
lexical items in the pair is based on real world knowledge
parametrima.INS ‘parameters’, or a noun and a denominal
(largely about something happening at the time of writing)
adjective, as in Vlasotincu.N – vlasotinačkom.ADJ ‘(of) rather than on linguistic information (e.g., vakcinacija
Vlasotince’)3. The shared numbers are mostly large and
‘vaccination’ – virus korona ‘corona virus’, Tokio ‘Tokio’
either quite specific or used in a collocation (e.g., 100.620,
– Olimpijske igre ‘Olympic games’), especially in items
or 3.000 dinara ‘3000 dinars’). Overlaps in common lexical
assigned a score below 3. CLSS.codecomments.sr, on the
words are also frequently based on morphologically related
other hand, is characterised by various non-standard
rather than identical forms (e.g., stiglo.PAST.PART –
features, such as inconsistent spelling ( popup vs. pop-up),
stići.INF ‘arrive’, novozaraženih ‘newly infected’ – novih
missing diacritics ( cita for čita ‘reads’), inflectional
slučajeva zaraze ‘new cases of infection’, filtriranje
endings on English words inconsistently spelt with/without
‘filtering’ – filtar ‘filter’). A number of synonyms are found
a dash ( zoom-a, workspace-u vs. levela), non-standard ( potvrda – sertifikat ‘certificate’, promenljiva – varijabla
abbreviations ( f-ja for funkcija ‘function)’, or phonetic
‘variable’), sometimes involving a Serbian and an English
transcription of English terms ( eksepšn ‘exception’).4
word ( mreža – grid ‘grid’), and sometimes within different
collocations based on the same term (e.g., toplotni talas –
talas vrućina ‘heat wave’, zoom levela – stepena zoom-a
4.2. Linguistic annotation
‘zoom level’). Overall, most lexical words from the smaller
Using the preliminary analysis outlined above and the
unit are present in the larger one, which also contains other
existing paraphrase typologies (primarily Vila Rigat, 2013;
elements that describe the situation in more detail, but
Vila et al., 2014; also Milićević, 2007; Mel’čuk, 2012), we
without adding entirely new topics ( u Londonu ‘in London’
propose a taxonomy of semantic similarity types and
– u centralnom delu Londona ‘in central London’; funkcija
indicators, shown and illustrated in Table 3; most examples
sa parametrima ‘a function with parameters’ – funkcija
are taken directly or adapted from our corpora (examples
koja nije f(void), vec prima parametre ‘a function that is
for two clear indicators are omitted to save space). The initial
not f(void), but accepts parameters’).
focus is on the nature of information that similarity is based
Score 3 items are distinguished by similar properties in
on, and a core distinction is made between linguistic, quasi-
terms of shared lexis and especially personal names and
linguistic and extralinguistic similarity types. This is at the
specialised terms, but with entirely new information in the
same time one of the main points of divergence between
longer item, and/or partly different information in the
our approach and the one by Vila Rigat (2013) and Vila et
components of the pair, leading to a less marked overall
al. (2014), who acknowledge the existence of non-linguistic
vocabulary overlap (e.g., Neuralna mreza ‘neural network’
paraphrase, but do not include it in their core typology; we
– vanila neuralna mreza koja se obucava pomocu genetskog
rely on Milićević (2007) and Mel’čuk (2012) for these types.
algoritma ‘vanilla neural network which is trained via a
Another difference with respect to previous work is that our
genetic algorithm’). Near-synonyms appear to be more
taxonomy makes reference to similarity indicators, while
common in score 3 pairs ( reč ‘word’ – termin ‘term’, nov
changes are invoked in previous work, due to paraphrase
ugovor ‘new contract’ – produžetak saradnje ‘extension of
being perceived as involving a source and a target item.
collaboration’). In both score 4 and score 3 items, the head
Linguistic similarity is based on language-internal
noun of the phrase tends to appear as the subject or the
information at the word/lexical unit level (i.e., the morpho-
object of the sentence predicate, or it is a deverbal noun that
lexicon), the level of structural organisation, and the level
corresponds to the predicate ( unos.N – unosi.V ‘input’).
of meaning (i.e., semantics). The first two types have two
The predicate is typically the same in sentence-paragraph
subtypes each: morphology- and lexicon-based and syntax-
pairs, with additional predicates in the paragraph item.
and discourse-based indicators respectively; the indicator
Among less similar pairs, those marked 2 are somewhat
types and subtypes thus follow the classical organisation in
mixed, as they either contain different personal names/
formal levels of linguistic analysis. Finally, the indicator
specialised terms and similar common vocabulary, or vice
names in the last column of Table 3 denote specific
versa ( Tropski pakao u Beogradu ‘tropical hell in Belgrade’
mechanisms through which semantic similarity is
– I sutra će u Novom Sadu biti veoma toplo ‘It will again
established. Following Vila et al. (2014), our assumption is
be very warm in Novi Sad tomorrow’; prekid rekurzije
that the indicators reveal what triggers semantic similarity
‘interruption of recursion’ – ako ima decu onda idemo
at the micro level. In other words, unlike the similarity
3 Abbreviations used: LOC – locative; GEN – genitive; ACC –
4
accusative; INS – instrumental; ADJ – adjective; N – noun,
Many of the features found in code comments are shared with
PAST.PART – past participle; INF – infinitive; V – verb.
computer-mediated communication in Serbian (see Miličević
Petrović et al., 2017).
PRISPEVKI
128
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
scores assigned to pairs of items as wholes (i.e., to entire
Looking more closely at the indicator subtypes,
phrases, sentences, or paragraphs), the linguistic taxonomy
morphology-based indicators concern the morphological
targets individual phenomena that cumulatively contribute
form of words, capturing complete equivalence, as well as
to the overall score, where such individual elements are not
inflectional and derivational relations, i.e. different forms
mutually exclusive and several can be co-present.
of the same word or changes of category via derivational
Similarity type Indicator type
Indicator subtype Indicator ( example)
- Identical ( požar – požar ‘fire’)
- Inflectional ( parametre.ACC – parametrima.INS
Morphology-based ‘parameters’)
- Derivational ( Vlasotincu.N – vlasotinačkom.ADJ
‘(of) Vlasotince’)
- Spelling and format ( pop-up – popup)
- Synthetic/analytic ( novozaraženih ‘newly infected’ –
novih slučajeva zaraze ‘new cases of infection’)
Morpholexicon-based
- Same polarity
-- Synonymy ( potvrda – sertifikat ‘certificate)
-- Near-synonymy ( reč ‘word’ – termin ‘term’)
Lexicon-based
-- Hyponymy ( škoda ‘Škoda’ – automobil ‘car’)
-- Meronymy ( Vašington ‘Washington’ – SAD ‘USA’)
- Opposite polarity ( izgubio ‘lost’ – nije uspeo da
pobedi ‘failed to win’)
- Converse ( pogibija dva pešaka ‘death of two
pedestrians’ – usmrtio pešake ‘killed the pedestrians’)
- Diathesis alternations ( opljačkali su stan ‘robbed the
Syntax-based
flat’ – stan je opljačkan ‘the flat was robbed’)
- Coordination changes
Linguistic
- Subordination and nesting changes
- Punctuation ( Potpis dana - Aleksandar Kolarov!
‘Signature of the day - Aleksandar Kolarov!’ –
Aleksandar Kolarov potpisao novi ugovor
‘Aleksandar Kolarov signed a new contract’)
Structure-based
- Direct/indirect style ( Bilčik ocenjuje da vežbe ne
pomažu ‘Bilčík states that the military exercises do not
Discourse-based
help’ – Bilčik ukazuje da vesti o vežbi “nisu od
pomoći” ‘Bilčík points out that the news of a military
exercise “is not helpful”’)
- Sentence modality ( maske više nisu obavezne?
‘masks no longer compulsory?’ – neće biti obavezne
zaštitne maske ‘protective masks will not be
compulsory’)
( Tropski pakao ‘tropical hell’ – biti veoma toplo ‘be
Semantics-based
very warm’)
- Change of order ( klasa singleton – Singleton patern
‘singleton class/pattern’)
Miscellaneous
- Addition/deletion ( funkcija za sortiranje ‘sorting
function’ – metoda koja sortira uzetu matricu ‘the
method that sorts the given matrix’)
( Scattered showers are very likely – Bring your
Quasi-linguistic Pragmatic
umbrella; Mel’čuk, 2012: 60)
( Besplatno kroz Severnu Makedoniju od danas ‘Free
Situational
travel through North Macedonia from today’ – Novina
od 15. juna ‘New rules from 15 June’)
( Italija ‘Italy (the team)’ – ekipa sa Apenina ‘the team
Extralinguistic
Encyclopaedic
from the Apennine Mountains’)
( Još pola dinara za veknu hleba ‘Half a dinar more for
Logical
a loaf of bread’ – Cena hleba visa za 20% ‘The price
of bread higher by 20%’; Milićević, 2007: 145)
Table 3: Overview of the taxonomy of semantic similarity (the examples are drawn from CLSS.news.sr/CLSS.codecomments.sr, or from the literature).
PRISPEVKI
129
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
affixes. The identical indicator is not present under the
Beyond the linguistic structure, the quasi-linguistic
morphology heading in Vila Rigat (2013) and Vila et al.
domain captures inference-based similarity that relies on
(2014), who categorise it as a “paraphrase extreme”, which
pragmatic information. The core linguistic meanings and
is a special type in their taxonomy, capturing longer chunks
the extralinguistic referents are different in this case, but the
of text; we add it based on the preliminary analysis
meaning of one element in the pair can still be inferred from
presented in Section 4.1, which revealed that identical
the meaning of the other. Given the nature of our texts, this
individual words are common in highly similar items in
type of similarity is expected to be infrequent, and we have
CLSS. Additional information that could prove useful
so far not identified any examples; however, we leave this
concerns parts of speech, the distinction between personal
category in our taxonomy to possibly be applied in the
and common nouns, as well as information on general vs.
annotation phase. The extralinguistic domain also entails
specialised vocabulary. Given that the identification of
inequality of linguistic meaning, but it involves information
specialised terminology would require work that goes
equivalence between two texts, i.e. reference to the same
beyond the scope of the current project, we are still
real-world situation. It requires knowledge external to
evaluating the possibility of including it in the analysis.
language for similarity to be recognised; this knowledge
Lexicon-based indicators are somewhat more varied,
can be situational (containing elements such as today or
ranging from different spellings of the same words, to
here), encyclopaedic (involving general knowledge), or
syntactic and analytic expressions of the same meaning,
logical (requiring calculations or other similar operations).
and to lexical semantic relations in the narrow sense. Same
Based on the initial analyses of our datasets, this is a
polarity items constitute the most complex group of lexical
common type of similarity, especially in newswire texts.
relations, comprising synonymy as a similarity relation par
Keeping the above definitions in mind, the outlined
excellence, near-synonymy, hyponymy (the relationship
taxonomy will be applied to the CLSS.news.sr and
between superordinate/more general and subordinate/more
CLSS.codecomments.sr corpora. Detailed guidelines are
specific lexical items), and meronymy (a part-whole
currently being developed, and the texts (initially from
relation). Opposite polarity relations are based on antonym
CLSS.news.sr) are being prepared for word/segment-level
pairs with opposite comparative words, or with one of the
annotation with semantic similarity indicators, within the
components negated. Finally, a converse relation captures
identified pairs. The annotation will be performed by the
complementary actions whose arguments are inversed.
project researchers, first as a double procedure on a smaller
Syntax-based indicators capture those relations that
sample, and then individually once a satisfactory level of
imply a syntactic reorganisation in the sentence; they can
agreement is reached. The initial phase will at the same
be found within single sentences, or in the way multiple
time enable us to verify the appropriateness of the
sentences are connected. Specific cases include instances
taxonomy, and adapt it should the need arise. The annotated
of diathesis alternations (such as the active/passive
datasets will be used for empirically validating the
alternation), coordination (where coordinated units are
taxonomy, for gaining a better understanding of the
present in one member of the pair, but not in the other), and
linguistic factors that carry the most weight in cross-level
subordination or nesting (where subordinate/nested
semantic similarity in different text genres, and for learning
elements are present in only one item). The second subtype
how this kind of information can be taken into account in
of structural changes, discourse-based indicators, do not
NLP models. Based on previous work on paraphrase and a
affect the sentential arguments, but are instead related to
preliminary exploration of our data at text level (with entire
elements such as punctuation and formatting (beyond
pairs
marked
for
indicator
presence/absence),
single lexical units), affirmative vs. interrogative sentence
morphological indicators, addition/deletion and same
modality, and direct vs. indirect speech.
polarity items are expected to be particularly prominent.
The semantics-based subtype is also distinguished by
going beyond the level of individual lexical items, as it
5. Concluding remarks
concerns phrase/sentence-level meaning. No subtypes of
In this paper, we have described the first non-English
specific indicators are singled out, as this level of analysis
CLSS corpora, CLSS.news.sr and CLSS.codecomments.sr.
refers generally to the distribution of semantic content
The focus was on the methodology used to construct and
across lexical units, and it can involve multiple and varied
annotate the data, as well as on their initial linguistic
formal changes that lead to different lexicalisations of the
analysis. We believe these two datasets to be an important
same meaning units. The boundaries between semantics-
resource for Cross-Level Semantic Similarity research, not
based similarity and lexicon-based similarity indicators are
only in virtue of representing a new language, but also due
not always clear-cut, but it is generally the case that
to introducing an underexplored text genre (source code
lexicon-based indicators concern individual words or
comments), and due to dedicating substantial attention to
multiword units, while semantics-based similarity relies on
the linguistic properties of the datasets.
multiple lexical items.
Our planned next steps are to complete the CLSS
The last type of linguistic indicators is classified as
annotation of code comments, implement the proposed
miscellaneous, given that it captures phenomena that do
linguistic taxonomy of semantic similarity in the annotation
concern the linguistic structure of items, but do not clearly
of both datasets, conduct a more extensive linguistic
belong to a single level of linguistic analysis. Change of
analysis based on the annotated data, and examine the
order and addition/deletion are found here as specific
impact of linguistic traits on the performances of automatic
indicator types, the former involving units with the same
CLSS models. Another goal is to compare the results to
content expressed using different word orders, and the latter
those obtained on similar datasets for English, using the
based on added or omitted information. Both indicators
SemEval dataset for newswire, and our own dataset (which
concern at least syntax and discourse; given the cross-level
is currently being created) for source code comments.
setup, the latter is particularly important for our datasets.
PRISPEVKI
130
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
6. Acknowledgements
deficient natural language processing support. Decision
The AVANTES project ( Advancing Novel Textual
Support Systems, 55(3):710–19.
Similarity-based Solutions in Software Development) is
David Jurgens, Mohammad Taher Pilehvar, and Roberto
supported by the Science Fund of the Republic of Serbia,
Navigli. 2014. SemEval-2014 Task 3: Cross-Level
grant no. 6526093, within the “Program for Development
Semantic Similarity. In: Proceedings of the Eighth
of Projects in the Field of Artificial Intelligence”. The
International Workshop on Semantic Evaluation
authors would like to thank Jelica Cincović and Dušan
(SemEval 2014), pages 17–26, Dublin, Ireland.
Stojković for constructing the code comment text pairs, as
Association for Computational Linguistics.
well as Bojan Jakovljević, Lazar Milić, Marija Lazarević,
David Jurgens, Mohammad Taher Pilehvar, and Roberto
Ognjen Krešić, and Vanja Miljković for annotating the
Navigli. 2016. Cross Level Semantic Similarity: An
corpora with semantic similarity scores.
Evaluation Framework for Universal Measures of
Similarity. Language Resources and Evaluation,
7. References
50(1):5–33.
Marija Kostić, Aleksa Srbljanović, Vuk Batanović, and
Eneko Agirre, Daniel Cer, Mona Diab, and Aitor
Boško Nikolić. 2022. Code Comment Classification
Gonzalez- Agirre. 2012. SemEval-2012 Task 6: A Pilot
Taxonomies. In: Proceedings of the Ninth IcETRAN
on Semantic Textual Similarity. In Proceedings of the
Conference, Novi Pazar, Serbia.
First Joint Conference on Lexical and Computational
Nikola Ljubešić and Davor Lauc. 2021. BERTić – The
Semantics (*SEM), pages 385–393, Montreal, Canada.
Transformer Language Model for Bosnian, Croatian,
Association for Computational Linguistics.
Montenegrin and Serbian. In: Proceedings of the 8th
Vuk Batanović. 2020. A Methodology for Solving Semantic
Workshop on Balto-Slavic Natural Language Processing
Tasks in the Processing of Short Texts Written in Natural
(BSNLP 2021), pages 37–42, Kiev, Ukraine, Association
Languages with Limited Resources. Ph.D. thesis,
for Computational Linguistics.
University of Belgrade.
Igor A. Mel’čuk. 2012. Semantics. From Meaning to Text.
Vuk Batanović. 2021. Semantic similarity and sentiment
John Benjamins, Amsterdam.
analysis of short texts in Serbian. In: Proceedings of the
Maja Miličević Petrović, Nikola Ljubešić, and Darja Fišer.
29th Telecommunications forum (TELFOR 2021),
2017. Nestandardno zapisivanje srpskog jezika na
Belgrade, Serbia, IEEE.
Tviteru: mnogo buke oko malo odstupanja? Anali
Vuk Batanović and Maja Miličević Petrović. 2022. Cross-
Filološkog fakulteta 29(2):111–36.
Level Semantic Similarity for Serbian Newswire Texts.
Jasmina Milićević. 2007. La paraphrase. Peter Lang, Bern.
In: Proceedings of the 13th International Conference on
Maâli Mnasri, Gaël de Chalendar, and Olivier Ferret. 2017.
Language Resources and Evaluation (LREC 2022),
Taking into account Inter-sentence Similarity for Update
Marseille, France. European Language Resources
Summarization. In: Proceedings of the Eighth
Association.
International Joint Conference on Natural Language
Vuk Batanović, Miloš Cvetanović, and Boško Nikolić.
Processing, pages 204–209, Taipei, Taiwan. Association
2018. Fine-grained Semantic Textual Similarity for
for Computational Linguistics.
Serbian. In: Proceedings of the 11th International
Mohammad Taher Pilehvar and Roberto Navigli. 2015.
Conference on Language Resources and Evaluation
From senses to texts: An all-in-one graph-based
(LREC 2018), pages 1370–78, Miyazaki, Japan,
approach for measuring semantic similarity. Artificial
European Language Resources Association.
Intelligence, 228:95–128.
Vuk Batanović, Bojan Furlan, and Boško Nikolić. 2011. A
Navid Rekabsaz, Ralf Bierig, Mihai Lupu, and Allan
software system for determining the semantic similarity
Hanbury. 2017. Toward optimized multimodal concept
of short texts in Serbian. In: Proceedings of the 19th
indexing. In: N. Nguyen, R. Kowalczyk, A. Pinto, and J.
Telecommunications forum (TELFOR 2011), pages
Cardoso, eds., Transactions on Computational
1249–52, Belgrade, Serbia, IEEE.
Collective Intelligence XXVI, pages 144–61, Cham,
Costanza Conforti, Mohammad Taher Pilehvar, and Nigel
Springer International Publishing.
Collier. 2018. Towards automatic fake news detection:
Julian Risch, Timo Möller, Julian Gutsch, and Malte
Cross-level stance detection in news articles. In:
Pietsch. 2021. Semantic answer similarity for evaluating
Proceedings of the First Workshop on Fact Extraction
question answering models. In: Proceedings of the Third
and VERification, pages 40–49, Brussels, Belgium,
Workshop on Machine Reading for Question Answering,
Association for Computational Linguistics.
pages 149–57, Punta Cana, Dominican Republic,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Association for Computational Linguistics.
Toutanova. 2019. BERT: Pre-training of Deep
Marta Vila Rigat. 2013. Paraphrase Scope and Typology.
Bidirectional
Transformers
for
Language
A Data-Driven Approach from Computational
Understanding. In: Proceedings of NAACL-HLT 2019,
Linguistics. Ph.D. thesis, University of Barcelona.
pages 4171–86, Minneapolis, Minnesota, USA,
Marta Vila, M. Antonia Martí, and Horacio Rodríguez
Association for Computational Linguistics.
2014. Is this a paraphrase? What kind? Paraphrase
Bill Dolan, Chris Quirk, and Chris Brockett. 2004.
boundaries and typology. Open Journal of Modern
Unsupervised construction of large paraphrase corpora.
Linguistics, 4:205–18.
In: Proceedings of the 20th International Conference on
Marie Zemankova and Caroline M. Eastman. 1980.
Computational Linguistics, pages 350–56, Geneva,
Comparative lexical analysis of FORTRAN code, code
Switzerland, Association for Computational Linguistics.
comments and English text. In: Proceedings of the 18th
Bojan Furlan, Vuk Batanović, and Boško Nikolić. 2013.
annual Southeast regional conference, pages 193–97,
Semantic similarity of short texts in languages with a
Tallahassee, Florida, USA, Association for Computing
Machinery.
PRISPEVKI
131
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
The ParlaSent-BCS Dataset of Sentiment-annotated Parliamentary Debates
from Bosnia and Herzegovina, Croatia, and Serbia
Michal Mochtak,∗ Peter Rupnik,† Nikola Ljubeši憇
∗Institute of Political Science
University of Luxembourg
2 avenue de l’Université, L-4366 Esch-sur-Alzette
michal.mochtak@uni.lu
† Department of Knowledge Technologies
Jožef Stefan Institute
Jamova cesta 39, SI-1000 Ljubljana
peter.rupnik@ijs.si
nikola.ljubesic@ijs.si
‡Faculty of Computer and Information Science
University of Ljubljana
Večna pot 113, SI-1000 Ljubljana
Abstract
Expression of sentiment in parliamentary debates is deemed to be significantly different from that on social media or in product reviews.
This paper adds to an emerging body of research on parliamentary debates with a dataset of sentences annotated for detection of sentiment polarity in political discourse using sentence-level data. We sample the sentences for annotation from the proceedings of three Southeast European parliaments: Croatia, Bosnia and Herzegovina, and Serbia. A six-level annotation schema is applied to the data with the aim of training a classification model for the detection of sentiment in parliamentary proceedings. Krippendorff’s alpha measuring the inter-annotator agreement ranges from 0.6 for the six-level annotation schema to 0.75 for the three-level schema and 0.83 for the two-level schema. Our initial experiments on the dataset show that transformer models perform significantly better than those using a simpler architecture. Furthermore, regardless of the similarity of the three languages, we observe differences in performance across different languages. Performing parliament-specific training and evaluation shows that the main reason for the differing performance between parliaments seems to be the different complexity of the automatic classification task, which is not observable in annotator performance.
Language distance does not seem to play any role neither in annotator nor in automatic classification performance. We release the dataset and the best-performing models under permissive licences.
1.
Introduction
Although there is a general agreement among political
scientists that sentiment analysis represents a critical com-
Emotions and sentiment in political discourse are
ponent for understanding political communication in gen-
deemed as crucial and influential as substantive poli-
eral (Young and Soroka, 2012; Flores, 2017; Tumasjan et
cies promoted by the elected representatives (Young and
al., 2010), the empirical applications outside the English-
Soroka, 2012). Since the golden era of research on propa-
speaking world are still rare (Rauh, 2018; Mohammad,
ganda (Lasswell, 1927; Shils and Janowitz, 1948), a num-
2021). This is especially the case for studies analyzing
ber of scholars have demonstrated the growing role of emo-
political discourse in low-resourced languages, where the
tions on affective polarization in politics with negative con-
lack of out-of-the-box tools creates a huge barrier for so-
sequences for the stability of democratic institutions and the
cial scientists to do such research in the first place (Proksch
social cohesion (Garrett et al., 2014; Iyengar et al., 2019;
et al., 2019; Mochtak et al., 2020; Rauh, 2018). The paper,
Mason, 2015). With the booming popularity of online me-
therefore, aims to contribute to the stream of applied re-
dia, sentiment analysis has become an indispensable tool
search on sentiment analysis in political discourse in low-
for understating the positions of viewers, customers, but
resourced languages. The goal is to present a new anno-
also voters (Soler et al., 2012). It has allowed all sorts of tated dataset compiled for machine-learning applications
entrepreneurs to know their target audience like never be-
focused on the detection of sentiment polarity in the politi-
fore (Ceron et al., 2019). Experts on political communica-
cal discourse of three Southeast European (SEE) countries:
tion argue that the way we receive information and how we
Bosnia and Herzegovina, Croatia, and Serbia. We further
process them play an important role in political decision-
use the dataset to train different classification models for the
making, shaping our judgment with strategic consequences
sentiment analysis applying different schemas and settings
both on the level of legislators and the masses (Liu and
to demonstrate the benefits and limitations of the dataset
Lei, 2018). Emotions and sentiment simply do play an
and the trained models. We release the dataset and the best-
important role in political arenas and politicians have been
performing models under permissive licenses to facilitate
(ab)using them for decades.
PRISPEVKI
132
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
further research and more empirically oriented projects. In
Hirst, 2016), or detection of sentiment carrying sentences
general, the paper, the dataset, and the models contribute
(Onyimadu et al., 2013).
to an emerging community of research outputs on parlia-
mentary debates with a focus on sentence-level sentiment
2.2.
Background data
annotation with future downstream applications in mind.
In order to compile a dataset of political sentiment for
manual annotation and then use it for training the classi-
2.
Dataset construction
fication models for real world applications, we sampled
sentences from three corpora of parliamentary proceedings
2.1.
Focus on sentences
in the region of former Yugoslavia – Bosnia and Herze-
The dataset we compile and then use for training differ-
govina (Mochtak et al., 2022c), 1 Croatia (Mochtak et al.,
ent classification models focuses on a sentence-level data
2022a),2 and Serbia (Mochtak et al., 2022b).3 The Bosnian and utilizes sentence-centric approach for capturing senti-corpus contains speeches collected on the federal level
ment polarity. The strategy goes against the tradition in
from the official website of the Parliamentary Assembly
mainstream research applications in social sciences which
of Bosnia and Herzegovina (Parlamentarna skupština BiH,
focus either on longer pieces of text (e.g.
utterance of
2020). Both chambers are included – House of Representa-
“speech segment” or whole documents (Bansal et al., 2008;
tives (Predstavnički dom / Zastupnički dom) and House of
Thomas et al., 2006)) or coherent messages of shorter na-
Peoples (Dom naroda). The corpus covers the period from
ture (e.g. tweets (Tumasjan et al., 2010; Flores, 2017)).
1998 to 2018 (2nd – 7th term) and counts 127,713 speeches.
The approach, however, creates certain limitations when
The Croatian corpus of parliamentary debates covers de-
it comes to political debates in national parliaments where
bates in the Croatian parliament (Sabor) from 2003 to 2020
speeches range from very short comments counting only a
(5th – 9th term) and counts 481,508 speeches (Hrvatski sa-
handful of sentences to long monologues having thousands
bor, 2020). Finally, the Serbian corpus contains 321,103
of words. Moreover, as longer text may contain a multi-
speeches from the National Assembly of Serbia (Skupština)
tude of sentiments, any annotation attempt must generalize
over the period of 1997 to 2020 (4th – 11th term) (Otvoreni
them, introducing a complex coder bias which is embedded
Parlament, 2020).
in any subsequent analysis. The sentence-centric approach
attempts to refocus the attention on individual sentences
2.3.
Data sampling
capturing attitudes, emotions, and sentiment positions and
Each speech was processed using the CLASSLA-
using them as lower-level indices of sentiment polarity in a
Stanza tool (Ljubešić and Dobrovoljc, 2019) with tokeniz-
more complex political narrative. Although sentences can-
ers available for Croatian and Serbian in order to extract
not capture complex meanings as paragraphs or whole doc-
individual sentences as the basic unit of our analysis. In
uments do, they usually carry coherent ideas with relevant
the next step, we filtered out only sentences presented by
sentiment affinity. This approach stems from a tradition of
actual speakers, excluding moderators of the parliamen-
content analysis in political science which focuses both on
tary sessions. All sentences were then merged into one
the political messages and their role in political discourse in
meta dataset. As we want to sample what can be under-
general (Burst et al., 2022; Hutter et al., 2016; Koopmans
stood as “average sentences”, we further subset the sen-
and Statham, 2006).
tence meta corpus to only sentences having the number of
Unlike most of the literature which approaches senti-
tokens within the first and third frequency quartile (i.e. be-
ment analysis in political discourse as a proxy for position-
ing within the interquartile range) of the original corpus
taking stances or as a scaling indicator (Abercrombie and
(∼3.8M sentences). Having the set of “average sentences”,
Batista-Navarro, 2020b; Glavaš et al., 2017; Proksch et al.,
we used the Croatian gold standard sentiment lexicon cre-
2019), a general sentence-level classifier we aim for in this ated by (Glavaš et al., 2012), translated it to Serbian with paper has a more holistic (and narrower) aim. Rather than
a rule-based Croatian-Serbian translator (Klubička et al.,
focusing on a specific policy or issue area, the task is to as-
2016), combined both lexicons, and extracted unique en-
sign a correct sentiment category to sentence-level data in
tries with a single sentiment affinity, and used them as seed
political discourse with the highest possible accuracy. Only
words for sampling sentences for manual annotation. The
when a good performing model exists, a downstream task
final pool of seed words contains 381 positive and 239 neg-
can be discussed. We believe it is a much more versatile
ative words (neutral words are excluded). These seed words
approach which opens a wide range of possibilities for un-
are used for stratified random sampling which gives us 867
derstanding the context of political concepts as well as their
sentences with negative seed word(s), 867 sentences with
role in political discourse. Furthermore, sentences as lower
positive seed word(s), and 866 sentences with neither pos-
semantic units can be aggregated to the level of paragraphs
itive nor negative seed words (supposedly having neutral
or whole documents which is often impossible the other
sentiment). We sample 2600 sentences in total for manual
way around (document → sentences). Although sentences
annotation. The only strata we use is the size of the original
as the basic level of analysis are less common in social sci-
corpora (i.e. number of sentences per corpus). With this we
ences research when it comes to computational methods
sample 1,388 sentences from the Croatian parliament, 1059
(Abercrombie and Batista-Navarro, 2020b), practical appli-
cations in other areas exist covering topics such as valida-
1https://doi.org/10.5281/zenodo.6517697
tion of sentiment dictionaries (Rauh, 2018), ethos mining
2https://doi.org/10.5281/zenodo.6521372
(Duthie and Budzynska, 2018), opinion mining (Naderi and
3https://doi.org/10.5281/zenodo.6521648
PRISPEVKI
133
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
sentences from the Serbian parliament, and 153 sentences
parliament
positive
neutral
negative
from the Bosnian parliament.
all
470
772
1358
HR
261
433
694
2.4.
Annotation schema
BS
27
42
84
The annotation schema for labelling sentence-level data
SR
182
297
580
was adopted from Batanović et al. (Batanović et al., 2020)
who propose a six-item scale for annotation of sentiment
Table 1: Distribution of the three-class labels in the whole
polarity in a short text. The schema was originally devel-
dataset, as well as across each of the three parliaments.
oped and applied to SentiComments.SR, a corpus of movie
comments in Serbian and is particularly suitable for low-
resourced languages. The annotation schema contains six
The inter-annotator agreement (IAA) measured using Krip-
sentiment labels (Batanović et al., 2020: 6):
pendorff’s alpha in the first round was 0.599 for full six-
item annotation scheme, 0.745 for the three-item annota-
• +1 (Positive in our dataset) for sentences that are
tion schema (positive/negative/neutral), and 0.829 for the
entirely or predominantly positive
two-item annotation schema focused on the detection of
only negative sentiment (negative/other). The particular fo-
• –1 (Negative in our dataset) for sentences that are
cus on negative sentiment in the test setting is inspired by a
entirely or predominantly negative
stream of research in political communication which argues
that negative emotions appear to be particularly prominent
• +M (M Positive in our dataset) for sentences that
in the context of forming the human psyche and its role in
convey an ambiguous sentiment or a mixture of senti-
politics (Young and Soroka, 2012). More specifically,
ments, but lean more towards the positive sentiment in
political psychologists have found that negative political in-
a strict binary classification
formation has a more profound effect on attitudes than posi-
tive information as it is easier to recall and is more useful in
• –M (M Negative in our dataset) for sentences that
heuristic cognitive processing for simpler tasks (Baumeis-
convey an ambiguous sentiment or a mixture of sen-
ter et al., 2001; Utych, 2018).
timents, but lean more towards the negative sentiment
Before the second annotator moved to annotate the sec-
in a strict binary classification
ond batch of instances, hard disagreements, i.e. disagree-
ments pointing at a different three-class sentiment, where
• +NS (P Neutral in our dataset) for sentences that
+NS and -NS are considered neutral, were resolved to-
only contain non-sentiment-related statements, but
gether by both annotators through a reconciliation proce-
still lean more towards the positive sentiment in a strict
dure.
binary classification
The final distribution of the three-class labels in the
whole dataset, as well as along specific parliaments, is
• –NS (N Neutral in our dataset) for sentences that
given in Table 1. The presented distributions show that,
only contain non-sentiment-related statements, but
regardless of a lexicon-based sampling, the negative class
still lean more towards the negative sentiment in a
is still by far the most pervasive category, which might be
strict binary classification
even more the case in a randomly sampled dataset, some-
thing we leave for future work.
The different naming convention we have applied in our
dataset serves primarily practical purposes: obtaining the
2.6.
Dataset encoding
3-way classification by taking under consideration only the
The final dataset, available through the CLARIN.SI
second part of the string (if an underscore is present).
repository, contains the following metadata:
Additionally, we also follow the original schema which
allowed marking text deemed as sarcastic with a code “sar-
• sentence that is annotated
casm”. The benefit of the whole annotation logic is that
it was designed with versatility in mind allowing reducing
• country of origin of the sentence
the sentiment label set in subsequent processing if needed.
• annotation round (first, second)
That includes various reductions considering polarity cat-
egorization, subjective/objective categorization, change of
• annotation of annotator1 with one of the labels
the number of categories, or sarcasm detection. This is im-
from the annotation schema presented in Section 2.4.
portant for various empirical tests we perform in the fol-
lowing sections.
• annotation of annotator2 following the same an-
notation schema
2.5.
Data annotation
• annotation given during reconciliation of hard
Data were annotated in two waves, with 1300 instances
disagreements
being annotated in each. Annotation was done via a custom
online app. The first batch of 1300 sentences was annotated
• the three-way label (positive, negative, neutral)
by two annotators, both being native speakers of Croatian,
where +NS and -NS labels are mapped to the neutral
while the second batch was annotated only by one of them.
class
PRISPEVKI
134
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
• the document id the sentence comes from
model
macro F1
classla/bcms-bertic
0.7941 ± 0.0101∗∗
• the sentence id of the sentence
EMBEDDIA/crosloengual-bert
0.7709 ± 0.0113
xlm-roberta-base
0.7184 ± 0.0139
• the date the speech was given
fasttext + CLARIN.SI embeddings
0.6312 ± 0.0043
• the name, party, gender, birth year of the
speaker
Table 2: Results of the comparison of various text classifi-
• the split (train, dev, or test) the instance has been
cation technologies. We report macro-F1 mean and stan-
assigned to (described in more detail in Section 3.1.
dard deviation over 6 runs with the model-specific opti-
mal number of training epochs. The distributions of results
The final dataset is organized in a JSONL format (each
of the two best performing models are compared with the
line in the file being a JSON entry) and is available under
Mann-Whitney U test (** p < 0.01).
the CC-BY-SA 4.0 license.4
some tasks even models pre-trained on many languages ob-
3.
Experiments
tain performance that is comparable to otherwise superior
3.1.
Data splits
models pre-trained on one or few languages (Kuzman et al.,
For performing current and future experiments, the
2022).
dataset was split into the train, development and test sub-
While comparing the different classification techniques,
sets. The development subset consists of 150 instances,
each model was optimized for the epoch number hyperpa-
while the test subset consists of 300 instances, both using
rameter on the development data, while all other hyperpa-
instances from the first annotation round, where two anno-
rameters were kept default. For training transformers, the
tations per instance and hard disagreement reconciliations
simpletransformers library8 was used.
are available. The training data consists of the remainder
The second question on parliament specificity we an-
of the data from the first annotation round and all instances
swer by training separate models on Croatian sentences
from the second annotation round, summing to 2150 in-
only and Serbian sentences only, evaluating each model
stances.
both on Croatian and on Serbian test sentences. We further
While splitting the data, stratification was performed
evaluate the model trained on all training instances sepa-
on the variables of three-way sentiment, country, and
rately on instances coming from each of the three parlia-
party. With this we can be reasonably sure that no specific
ments.
strong bias regarding sentiment, country or political party
For our third question on the usefulness of the model for
is present in any of the three subsets.
data analysis, we report confusion matrices, to inform po-
tential downstream users of the model’s per-category per-
3.2.
Experimental setup
formance.
In our experiments we investigate the following ques-
4.
Results
tions: (1) how well can different technologies learn our
three-way classification task, (2) what is the difference in
4.1.
Classifier comparison
performance depending on which parliament the model is
We report the results of our text classification technol-
trained or tested on, and (3) is the annotation quality of the
ogy comparison in Table 2. The results show that trans-
best performing technology high enough to be useful for
former models are by far more capable than the fasttext
data enrichment and analysis.
technology relying on static embeddings only.
Of the
We investigate our first question by comparing the
three transformer models, the multilingual XLM-RoBERTa
results on the following classifiers: fastText (Joulin et
model shows to have a large gap in performance to the two
al., 2016) with pre-trained CLARIN.SI word embed-
best-performing models. Comparing the cseBERT and the
dings (Ljubešić, 2018), the multilingual transformer model
BERTić model, the latter manages to come on top with
XLM-Roberta (Conneau et al., 2019),5 the transformer a moderate improvement of 1.5 points in macro-F1. The
model pre-trained on Croatian, Slovenian and English cse-
difference in the results of the two models is statistically
BERT (Ulčar and Robnik- Šikonja, 2020),6, and the trans-significant regarding the Mann-Whitney U test (Mann and
former model pre-trained on Croatian, Bosnian, Montene-
Whitney, 1947), with a p-value of 0.0053.
grin and Serbian BERTić (Ljubešić and Lauc, 2021).7 Our 4.2.
Parliament dependence
expectation is for the last model to perform best given that
it was pre-trained on most data from the three languages.
We next investigate the dependence of the results on
However, this assumption has to be checked given that for
from which parliament the training and the testing data
came. Our initial assumption was that the results are depen-
4
dent on whether the training and the testing data come from
http://hdl.handle.net/11356/1585
5
the same or a different parliament, with same-parliament
https://huggingface.co/xlm-roberta-base
6https://huggingface.co/EMBEDDIA/
results being higher. We also investigate how the model
crosloengual-bert
trained on all data performs on parliament-specific test data.
7https://huggingface.co/classla/
bcms-bertic
8https://simpletransformers.ai
PRISPEVKI
135
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4.2.1.
Impact of training data
XLM-RoBERTa
We perform this analysis on all three transformer mod-
train \test
HR
SR
els from Section 4.1., hoping to obtain a deeper understand-HR
0.7296 ± 0.0251
0.6128 ± 0.0341
ing of parliament dependence on our task. We train and test
SR
0.7323 ± 0.0282
0.6487 ± 0.0203
on data from the Croatian and the Serbian parliament only
cseBERT
as the Bosnian parliament’s data are not large enough to
train \test
HR
SR
enable model training.
HR
0.7748 ± 0.0174
0.7146 ± 0.0175
In Table 3 we report the results grouped by model
SR
0.7762 ± 0.0114
0.6989 ± 0.0275
and training and testing parliament. To our surprise, the
BERTić
strongest factor shows not to be whether the training and
train \test
HR
SR
testing data come from the same parliament, but what test-
HR
0.8147 ± 0.0083
0.7249 ± 0.0105
ing data are used, regardless of the training data. This trend
SR
0.7953 ± 0.0207
0.7130 ± 0.0278
is to be observed regardless of the model used.
The results show that Serbian test data seem to be harder
Table 3: Comparison of the three transformer models when
to classify, regardless of what training data are used, with a
trained and tested on data from the Croatian or Serbian par-
difference of ∼9 points in macro-F1 for the BERTić and the
liament. Average macro-F1 and standard deviation over 6
XLM-RoBERTa models. The difference is smaller for the
runs is reported.
cseBERT model, ∼7 points, but still shows the same trend
as the two other models.
test
ternary
binary
We have additionally explored the possibility of a com-
all
0.7941 ± 0.0101
0.8999 ± 0.0120
plexity bias of Serbian test data in comparison to Serbian
HR
0.8260 ± 0.0186
0.9221 ± 0.0153
training data by performing different data splits, but the re-
BS
0.7578 ± 0.0679
0.9071 ± 0.0525
sults obtained were very similar to those presented here.
SR
0.7385 ± 0.0170
0.8660 ± 0.0150
Serbian data seem to be harder to classify in general, which
is observed when performing inference over Serbian data.
Table 4:
Average macro-F1 and standard deviation of 6
Training over Serbian data still results in a model compara-
runs of the BERTić model, trained on all training data, and
bly strong to that based on Croatian training data. Important
evaluated on varying testing data.
to note is that the Croatian data subset is 30% larger than
the Serbian one.
to identify at this point, but that will have to be taken under
To test whether the Serbian data complexity goes back
consideration in future work on this dataset.
to challenges during data annotation, or whether it is rather
the models that struggle with inference over Serbian data,
we calculated the Krippendorff IAA on data from each
4.2.2.
Impact of testing data
parliament separately.
The agreement calculation over
In the next set of experiments, we compare the perfor-
the ternary classification schema resulted in an IAA for
mance of BERTić classifiers trained over all training data,
Bosnian data of 0.69, Croatian data of 0.733, and Serbian
but evaluated on all and per-parliament testing data. Be-
data of 0.77. This insight proved that annotators themselves
yond this, we train models over the ternary schema that we
did not struggle with Serbian data as these had the highest
have used until now (positive vs. neutral vs. negative), but
IAA. We also tested whether there is excessive sarcasm in
also the binary schema (negative vs. rest), given our spe-
Serbian data, which might affect the model’s performance.
cial interest in identifying negative sentences, as already
The dataset contains two sarcastic instances from the par-
discussed in Section 2.5.
liament of Bosnia and Herzegovina and 16 for both Croatia
We report results on test data from each of the three
and Serbia, which means sarcasm can hardly explain the
parliaments, including the Bosnian one, which, however,
overall lower performance on Serbian test data. Lastly, we
contains only 18 testing instances, so these results have to
checked the type-token ratio (TTR) on samples of Croat-
be taken with caution.
ian and Serbian sentences to estimate the lexical richness
The results presented in Table 4 show again that the Ser-
of each subset, a higher lexical richness of Serbian (via a
bian data seem to be the hardest to classify even when all
higher type-token ratio) possibly explaining the lower re-
training data are used. Bosnian results are somewhat close
sults obtained on Serbian test data. By calculating the type-
to the Serbian ones, but caution is required here due to the
token ratio on 100 tokens selected from random sentences,
very small test set. This level of necessary caution regard-
and repeating the process 100 times in a bootstrapping man-
ing Bosnian test data is also visible from the five times
ner, we obtained a result of 0.833 for Serbian and 0.839
higher standard deviation in comparison to the results of
for Croatian. This result shows for the Croatian part of the
the two other parliaments. Croatian data seem to be easiest
dataset to be just slightly more lexically rich (83.9 different
to classify, with an absolute difference of 9 points between
tokens among 100 tokens on average) than Serbian (83.3
the performance on Serbian and Croatian test data. Regard-
different tokens among 100 tokens), which does not explain
ing the binary classification results, these are, as expected,
the difference in performance of various classifiers on Ser-
higher than those of the ternary classification schema with
bian data.
an macro-F1 of 0.9 when all data are used. The relation-
The complexity of Serbian data that can be observed in
ship between specific parliaments is very similar to that ob-
the evaluation is due to some effect that we did not manage
served using the ternary schema.
PRISPEVKI
136
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Confusion m a trix
Confusion m a trix
cla ssla /bcm s-bertic
cla ssla /bcm s-bertic
0.8
N ega tive
0.88
0.075
0.05
0.8
N ega tive
0.88
0.12
0.6
0.6
N eutra l 0.099
0.74
0.16
0.4
True label
True label
0.4
Other
0.083
0.92
Positive 0.094
0.1
0.81
0.2
0.2
N ega tive N eutral Positive
N ega tive
Other
Predicted la bel
Predicted la bel
Confusion m a trix
Confusion m a trix
cla ssla /bcm s-bertic
cla ssla /bcm s-bertic
N ega tive
962
82
55
800
800
N ega tive
972
127
600
600
N eutra l
61
454
101
True label
400
True label
400
Other
83
918
Positive
36
39
310
200
200
N ega tive N eutral Positive
N ega tive
Other
Predicted la bel
Predicted la bel
Figure 1: Row-normalised and raw-count confusion matrix
Figure 2: Row-normalised and raw-count confusion matrix
of the BERTić results on the ternary schema.
of the BERTić results on the binary schema.
4.3.
Per-category analysis
task (positive/negative/neutral), they agreed only in one-
quarter of the 1,500 sentences. Using heuristic classifiers
Our final set of experiments investigates the per-
based on the use of statistical and syntactic clues, Onyi-
category performance both on the ternary and the binary
madu et al. (2013) found that on average, only 43% of
classification schema. We present the confusion matrices
the sentences were correctly annotated for their sentiment
on the ternary schema, one row-normalized, another with
affinity. The results of our experiments are therefore cer-
raw counts, in Figure 1. As anticipated, the classifier works tainly promising. Especially when it comes to the classifi-best on the negative class, with 88% of negative instances
cation of negative sentences, the model has 1 in 10 sentence
properly classified as negative. Second by performance is
error rate which is almost on par with the quality of anno-
the positive class with 81% of positive instances being la-
tation performed by human coders.
belled like that, while among the neutral instances 3 out of
4 instances are correctly classified. Most of the confusion
5.
Conclusion
between classes occurs, as expected, between the neutral
The paper introduces a sentence-level dataset of parlia-
and either of the two remaining classes.
mentary proceedings, manually annotated for sentiment via
The binary confusion matrices, presented in Figure 2
a six-level schema. The good inter-annotator agreement is
show for a rather balanced performance on both categories.
reported, and the first results on the automation of the task
On each of the categories recall is around 0.9, with a similar
are very promising, with a macro-F1 of ∼0.8 on the ternary
precision given the symmetry of the confusions.
schema and ∼0.9 on the binary schema. The difference in
When comparing the output of the ternary and the bi-
performance across the three parliaments is observed, but
nary model, the ternary model output mapped to a binary
visible only during inference, Serbian data being harder to
schema performs slightly worse than the binary model,
make predictions on, while for modelling, all parliaments
meaning that practitioners should apply the binary model
seem to be similarly useful. One limitation of our work is
if they are interested just in distinguishing between nega-
the following: our testing data have been sampled as the
tive and other sentences.
whole dataset, with a bias towards mid-length sentences,
Although any direct comparisons are hard to make, the
and sentences containing sentiment words. Future work
few existing studies which performed text classification
should consider preparing a sample of random sentences,
on sentence-level data, report much worse results. Rauh
or, even better, consecutive sentences, so that the potential
(2018) found that when three annotators and three senti-
issue of lack of a wider context during manual data annota-
ment dictionaries were compared on a ternary classification
tion is successfully mitigated as well.
PRISPEVKI
137
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
In general, the reported results have several promising
bates. In: Proceedings of the 12th Language Resources
implications for applied research in political science. First
and Evaluation Conference, pages 5073–5078, Mar-
of all, it allows a more fine-grained analysis of political
seille, France. European Language Resources Associa-
concepts and their context. A good example is a com-
tion.
bination of the KWIC approach with sentiment analysis,
Gavin Abercrombie and Riza Batista-Navarro.
2020b.
with a focus on examining the tone of a message in po-
Sentiment and position-taking analysis of parliamentary
litical discourse. This is interesting for both qualitatively
debates: A systematic literature review. Journal of Com-
and quantitatively oriented scholars. Especially the possi-
putational Social Science, 3(1):245–270.
bility of extracting numeric assessment of the classification
Mohit Bansal, Claire Cardie, and Lillian Lee. 2008. The
model (e.g. class probability) is particularly promising for
power of negative thinking: Exploiting label disagree-
all sorts of hypothesis-testing statistical models. Moreover,
ment in the Min-cut classification framework. In: Col-
sentence-level analysis can be combined with the findings
ing 2008: Companion volume: Posters, pages 15–18,
of various information and discourse theories for studying
Manchester, UK. Coling 2008 Organizing Committee.
political discourse focused on rhetoric and narratives (e.g.
Vuk Batanović, Miloš Cvetanović, and Boško Nikolić.
beginning and end of a speech are more relevant than what
2020. A versatile framework for resource-limited senti-
comes in the middle). Apart from the concept-driven anal-
ment articulation, annotation, and analysis of short texts.
ysis, the classification model can be used for various re-
PLOS ONE, 15(11):e0242050.
search problems ranging from policy position-taking to ide-
Roy F. Baumeister, Ellen Bratslavsky, Catrin Finkenauer,
ology detection or general scaling tasks (Abercrombie and
and Kathleen D. Vohs. 2001. Bad is Stronger than Good.
Batista-Navarro, 2020a; Glavaš et al., 2017; Proksch et al.,
Review of General Psychology, 5(4):323–370.
2019). Although each of these tasks requires proper testing, Tobias Burst, Werner Krause, Pola Lehmann, Jirka
the performance of the trained models for such applications
Lewandowski, Theres Matthieß, Nicolas Merz, Sven
is undoubtedly promising.
Regel, and Lisa Zehnter. 2022. Manifesto corpus.
As a part of our future work, we plan to test the use-
Andrea Ceron, Luigi Curini, and Stefano M Iacus. 2019.
fulness of the predictions on a set of downstream tasks.
Politics and B ig D ata: N owcasting and F orecasting The goal is to analyze the data from all three parliaments
E lections with S ocial M edia. Routledge, Abingdon, New (Bosnia and Herzegovina, Croatia, and Serbia) in a series
York.
of tests focused on replication of the results from the exist-
ing research using mostly English data. Given the results
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
we obtained, we aim to continue our research using the
Vishrav Chaudhary,
Guillaume Wenzek,
Francisco
setup with the model trained on cross-country data. Fur-
Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,
thermore, the three corpora we have used in this paper will
and Veselin Stoyanov. 2019. Unsupervised cross-lingual
be extended as a part of ParlaMint II project.
representation learning at scale.
We make the ternary and binary BERTić models trained
Rory Duthie and Katarzyna Budzynska. 2018. A deep
on all available training available via the HuggingFace
modular rnn approach for ethos mining. In: Proceedings
repository910 and make the dataset available through the
of the 27th International Joint Conference on Artificial
CLARIN.SI repository (Mochtak et al., 2022d).
Intelligence, IJCAI’18, page 4041–4047. AAAI Press.
René D. Flores. 2017. Do Anti-Immigrant Laws Shape
Acknowledgements
Public Sentiment?
A Study of Arizona’s SB 1070
This work has received funding from the Eu-
Using Twitter Data. American Journal of Sociology,
ropean
Union’s
Connecting
Europe
Facility
2014-
123(2):333–384.
2020 - CEF Telecom, under Grant Agreement No.
R. Kelly Garrett, Shira Dvir Gvirsman, Benjamin K. John-
INEA/CEF/ICT/A2020/2278341 (MaCoCu project). This
son, Yariv Tsfati, Rachel Neo, and Aysenur Dal. 2014.
communication reflects only the author’s view.
The
Implications of pro- and counterattitudinal information
Agency is not responsible for any use that may be made of
exposure for affective polarization. Human Communica-
the information it contains.
tion Research, 40(3):309–332.
This work was also funded by the Slovenian Research
Goran Glavaš, Jan Šnajder, and Bojana Dalbelo Bašić.
Agency within the Slovenian-Flemish bilateral basic re-
2012. Semi-supervised acquisition of Croatian sentiment
search project “Linguistic landscape of hate speech on so-
lexicon. In: International Conference on Text, Speech and
cial media” (N06-0099 and FWO-G070619N, 2019–2023)
Dialogue, pages 166–173. Springer.
and the research programme “Language resources and tech-
Goran Glavaš, Federico Nanni, and Simone Paolo Ponzetto.
nologies for Slovene” (P6-0411).
2017. Unsupervised cross-lingual scaling of political
6.
References
texts. In: Proceedings of the 15th Conference of the Euro-
pean Chapter of the Association for Computational Lin-
Gavin Abercrombie and Riza Batista-Navarro. 2020a. Par-
guistics: Volume 2, Short Papers, pages 688–693, Valen-
lVote: A corpus for sentiment analysis of political de-
cia, Spain. Association for Computational Linguistics.
9
Hrvatski sabor. 2020. eDoc. http://edoc.sabor.hr/.
https://huggingface.co/classla/
bcms-bertic-parlasent-bcs-ter
Swen Hutter, Edgar Grande, and Hanspeter Kriesi. 2016.
10https://huggingface.co/classla/
Politicising Europe: I ntegration and mass politics. Cam-
bcms-bertic-parlasent-bcs-bi
bridge University Press, Cambirdge.
PRISPEVKI
138
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Shanto Iyengar, Yphtach Lelkes, Matthew Levendusky,
Parliamentary
Debates
in
Croatia
(v1.1.1).
Neil Malhotra, and Sean J. Westwood.
2019.
The
https://doi.org/10.5281/zenodo.6521372.
Origins and Consequences of Affective Polarization in
Michal
Mochtak,
Josip
Glaurdić,
and
Christophe
the United States. Annual Review of Political Science,
Lesschaeve.
2022b.
SRBCorp:
Corpus
of
22(1):129–146.
Parliamentary
Debates
in
Serbia
(v1.1.1).
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
https://doi.org/10.5281/zenodo.6521648.
Tomas Mikolov. 2016. Bag of tricks for efficient text
Michal Mochtak, Josip Glaurdić, Christophe Lesschaeve,
classification. arXiv preprint arXiv:1607.01759.
and Ensar Muharemović. 2022c. BiHCorp: Corpus
Filip Klubička, Gema Ram´ırez-Sánchez, and Nikola
of Parliamentary Debates in Bosnia and Herzegovina
Ljubešić. 2016. Collaborative development of a rule-
(v1.1.1). https://doi.org/10.5281/zenodo.6517697.
based machine translator between croatian and serbian.
Michal Mochtak, Peter Rupnik, and Nikola Ljubešić.
In: Proceedings of the 19th Annual Conference of the
2022d.
The sentiment corpus of parliamentary de-
Eu- ropean Association for Machine Translation, pages
bates ParlaSent-BCS v1.0. Slovenian language resource
361– 367.
repository CLARIN.SI.
Ruud Koopmans and Paul Statham. 2006. Political Claims
Saif M. Mohammad. 2021. Sentiment analysis: Automat-
Analysis: Integrating Protest Event and Political Dis-
ically detecting valence, emotions, and other affectual
course Approaches.
Mobilization:
An International
states from text. https://arxiv.org/abs/2005.11882.
Quarterly, 4(2):203–221.
Nona Naderi and Graeme Hirst. 2016. Argumentation
Taja Kuzman, Peter Rupnik, and Nikola Ljubesic. 2022.
mining in parliamentary discourse. In: Matteo Baldoni,
The ginco training dataset for web genre identification
Cristina Baroglio, Floris Bex, Floriana Grasso, Nancy
of documents out in the wild. ArXiv, abs/2201.03857.
Green, Mohammad-Reza Namazi-Rad, Masayuki Nu-
mao, and Merlin Teodosia Suarez, editors, Principles
Harold Dwight Lasswell. 1927. Propaganda T echnique in
and Practice of Multi-Agent Systems, pages 16–25,
the W orld W ar. Peter Smith, New York.
Cham. Springer.
Dilin Liu and Lei Lei. 2018. The appeal to political senti-
Obinna Onyimadu, Keiichi Nakata, Tony Wilson, David
ment: An analysis of Donald Trump’s and Hillary Clin-
Macken, and Kecheng Liu. 2013. Towards sentiment
ton’s speech themes and discourse strategies in the 2016
analysis on parliamentary debates in Hansard. In:
US presidential election. Discourse, Context & Media,
Revised Selected Papers of the Third Joint International
25:143–152.
Confer- ence on Semantic Technology – Volume 8388,
Nikola Ljubešić and Kaja Dobrovoljc. 2019. What does
JIST
2013,
page
48–50,
Berlin,
Heidelberg.
neural bring? Analysing improvements in morphosyn-
Springer-Verlag.
tactic annotation and lemmatisation of Slovenian, Croat-
Otvoreni
Parlament.
2020.
Početna.
ian and Serbian. In: Proceedings of the 7th Workshop on
https://otvoreniparlament.rs/.
Balto-Slavic Natural Language Processing, pages 29–
Parlamentarna
skupština
BiH.
2020.
Sjednice.
34, Florence, Italy, August. Association for Computa-
https://www.parlament.ba/?lang=bs.
tional Linguistics.
Sven-Oliver Proksch, Will Lowe, Jens Wäckerle, and Stuart
Nikola Ljubešić and Davor Lauc. 2021. BERTić – the
Soroka. 2019. Multilingual Sentiment Analysis: A New
transformer language model for Bosnian, Croatian, Mon-
Approach to Measuring Conflict in Legislative Speeches.
tenegrin and Serbian. In: Proceedings of the 8th Work-
Legislative Studies Quarterly, 44(1):97–131.
shop on Balto-Slavic Natural Language Processing,
Christian Rauh. 2018. Validating a sentiment dictionary
pages 37–42, Kiyv, Ukraine, April. Association for Com-
for German political language—a workbench note. Jour-
putational Linguistics.
nal of Information Technology & Politics, 15(4):319–
Nikola Ljubešić. 2018. Word embeddings CLARIN.SI-
343.
embed.hr 1.0. Slovenian language resource repository
Edward A. Shils and Morris Janowitz. 1948. Cohesion and
CLARIN.SI.
Disintegration in the Wehrmacht in World War II. Public
Henry B. Mann and Donald R. Whitney. 1947. On a test of
Opinion Quarterly, 12(2):315.
whether one of two random variables is stochastically
Juan M. Soler, Fernando Cuartero, and Manuel Roblizo.
larger than the other. The annals of mathematical statis-
2012. Twitter as a tool for predicting elections results. In:
tics, pages 50–60.
2012 IEEE/ACM International Conference on Advances
Lilliana Mason. 2015. “I Disrespectfully Agree”: The Dif-
in Social Networks Analysis and Mining, pages 1194–
ferential Effects of Partisan Sorting on Social and Is-
1200.
sue Polarization. American Journal of Political Science,
Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out
59(1):128–145.
the vote: Determining support or opposition from con-
Michal Mochtak, Josip Glaurdić, and Christophe Less-
gressional floor-debate transcripts. In: Proceedings of the
chaeve. 2020. Talking War: Representation, Veterans
2006 Conference on Empirical Methods in Natural Lan-
and Ideology in Post-War Parliamentary Debates. Gov-
guage Processing, pages 327–335, Sydney. Association
ernment and Opposition, 57(1):148–170.
for Computational Linguistics.
Michal
Mochtak,
Josip
Glaurdić,
and
Christophe
Andranik Tumasjan, Timm Sprenger, Philipp Sandner, and
Lesschaeve.
2022a.
CROCorp:
Corpus
of
Isabell Welpe. 2010. Predicting elections with Twitter:
What 140 characters reveal about political sentiment.
PRISPEVKI
139
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Proceedings of the International AAAI Conference on
Web and Social Media, 4(1).
Matej Ulčar and Marko Robnik- Šikonja. 2020. FinEst
BERT and CroSloEngual BERT: less is more in
multilingual models. In: P. Sojka, I. Kopeček, K. Pala,
and A. Horák, eds., Text, Speech, and Dialogue TSD
2020, volume 12284 of Lecture Notes in Computer
Science. Springer.
Stephen M. Utych. 2018. Negative Affective Language in
Politics. American Politics Research, 46(1):77–102.
Lori Young and Stuart Soroka. 2012. Affective news: The
automated coding of sentiment in political texts. Politi-
cal Communication, 29(2):205–231.
PRISPEVKI
140
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Fine-grained human evaluation of NMT applied to literary text: case study of a French-to-Croatian translation
Marta Petrak,* Mia Uremović, * Bogdanka Pavelin Lešić*
* Faculty of Humanities and Social Sciences, University of Zagreb
Ivana Lučića 3, 10000 Zagreb
mpetrak@ffzg.hr
uremovic.mia@gmail.com
bpavelin@ffzg.hr
Abstract
Even though neural machine translation (NMT) has demonstrated phenomenal results and has shown to be more successful than previous MT systems, there is not a large number of works dealing with its application to literary text. This results from the fact that literary texts are deemed to be more complex than others because they involve more specific elements such as idiomatic expressions, metaphor, a specific author’s style, etc. Regardless of this fact, there is a growing body of research dealing with NMT applied to literary texts, and this case study is one of them. The goal of the present paper is to conduct an in-depth, fine-grained evaluation of a novel translated by Google Translate (GT) in order to reach detailed insights into NMT performance on literary text. In addition, the paper aims to include for the first time, to the best of our knowledge, the French-Croatian language combination.
1. Introduction
MT to other types of text, they are not inexistant.
Hansen’s (2021) paper brings a detailed and up-to-date
Numerous studies have demonstrated that neural
overview of the works dealing with MT of literary texts.
machine translation (NMT) outperforms previous MT
The first literary text translated by MT was done by
systems (e.g. Bentivogli et al., 2016; Burchardt et al.,
Besacier (2014), and it comprised an essay translated
2017; Klubička et al., 2018; Hansen, 2021). This has
from English to French. A number of languages have
been demonstrated for a number of various text types,
already been covered by various studies of MT to literary
among which literary texts are the least represented due
text, among which Slavic (e.g. Slovene, Kuzman et al.,
to their specificities such as lexical richness,
2019), Romance (e.g. Catalan, Toral and Way, 2018,
metaphorical and idiomatic elements (e.g. Toral and
French, Besacier, 2014), Hansen, 2021), Germanic
Way, 2018). Literary translation is also usually
(English, in a number of papers; German, Matusov,
considered to be more complex than technical translation
2019); Scottish Gaelic and Irish, Ó Murchú, 2019), etc.
because it includes elements such as writer’s individual
style (Hadley, 2020).
2. Goal of the paper
Due to these facts, literary texts are still perceived to
be “the greatest challenge for MT” (Toral and Way,
The goal of this case study is to go beyond the overall
2018). Some more pessimistic authors even claim that
performance of NMT on literary text and to provide an
“there is no prospect of machines being useful at
extensive, in-depth human analysis of its results. In order
(assisting with) the translation of [literary texts]” (Toral
to do so, we will, firstly, produce a MT of a French novel
and Way, 2018). While the use of machine translation
and, secondly, compare that translation with a human
followed by the post-editing phase is a widespread
translation of the same text. The human translation will
practice generally speaking, it has not yet become a
be done by a student in translation from French into
permanent fixture in literary translation (Besacier, 2014).
Croatian as part of her Master’s thesis, and the analysis
In spite of this fact, there has been a growing interest
will be carried out by two human evaluators, the student
in applying MT to literature, which can be seen, for
and an experienced professional translator.
example, in the fact that there is a workshop on
In addition to providing an in-depth analysis of the
computational linguistics for literature organised by
translation of a literary text done by MT, our case study
ACL since 20121. Moreover, the French-speaking world
is the first one to pair, to the best of our knowledge, a
has seen the creation of an observatory for MT
large Romance language, French, with Croatian 4 , a
( Observatoire de la traduction automatique) by the
smaller scale language rich in morphology.
ATLAS 2 association in December 2018 to follow the
The rest of the paper is structured as follows: in
development of MT application to literary text3.
Section 3 we describe the methodology used. Section 4
Even though studies that analyse the application of
is the central part of the paper, as it sums up the results
MT to literary text are less numerous than those applying
of our analysis combined with a number of specific
1 Cf. e.g. https://aclanthology.org/events/clfl-2020/.
4 Croatian is the official language of the Republic of Croatia
2 ATLAS stands for Association pour la promotion de la
and of the EU., but is also spoken in Bosnia and Herzegovina,
traduction littéraire (Association for the promotion of literary
Montenegro, etc. It has approximately 5.6 million native
translation), https://www.atlas-citl.org/.
speakers worldwide. Cf. https://www.european-language-
3 https://www.atlas-citl.org/lobservatoire-de-la-traduction-
grid.eu/ncc/ncc-croatia/.
automatique/
PRISPEVKI
141
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
examples from the corpus. In Section 5 we bring some
upon extant ones by a number of previous authors and
concluding remarks and recommend some further steps.
some specificities of the corpus. Her study (2016)
included only non-literary texts, newspaper reports,
public opinion reports and EU legal documents (opinions
3. Methodology
and decisions), a total of 3,406 words. Still, Pavlović’s
(2016) methodology was developed with the goal of
In order to conduct our analysis, we have chosen a
comparing MT done by GT and human translation, and
novel, which is “arguably the most popular type of
it takes into account some specificities of the Croatian
literary text” (Toral and Way, 2018). Our corpus
language such as a rather free word order, abundance of
comprises the first eight chapters of the novel La
inflection and morphological complexity. It should be
traduction est une histoire d’amour ( Translation is a
emphasized that Pavlović’s (2016) study was conducted
Love Affair) written by Jacques Poulin, a contemporary
before GT used NMT for Croatian, which is available
Canadian author. It comprises a total of 8,347 words. The
today 5 and is the technology used for the analysis
original text, written in French, is first translated by GT,
presented in this case study.
and subsequently by a human translator. The MT is
The analysis of errors conducted for this paper
analysed in detail by two evaluators, after which the two
follows that given by Pavlović (2016), with only minor
translations are compared.
alterations. For example, the sub-category (D.c),
Hansen (2021) argues that evaluation of texts
‘numbers’, is not present in the machine translation of
produced by MT still remains a major obstacle. More
the chosen text and is hence not part of this analysis.
precisely, if BLEU (Papineni et al., 2002) is the most
widely used automatic metric, it has to be taken with
4. Results and analysis
caution in case of literary texts ( ibid.). Papineni et al.
(2002) argue that human evaluations of MT are
4.1.
Fine-grained human evaluation
“extensive” and therefore usually more fine-grained than
automatic ones, but the authors also point to their
Our analysis has demonstrated that GT has provided
expensiveness.
a very satisfactory translation generally speaking, and
In our case study, we present a quantitative and
some of its solutions were even better than the ones
qualitative analysis of errors. We base our methodology
provided by the human translation in the cases where
on the one developed by Pavlović (2016). Pavlović ( ibid. )
there was a possible choice between a general word and
also argues that in the literature there is not a single
its more suitable or literary synonym.
classification of translation errors that all authors would
Below we first bring a table with a general
agree upon, so she makes her own classification based
presentation of errors found in the MT.
Error category
%
Morphosyntax
55.3
Lexicon
32.1
Spelling
7
Other
5.6
Table 1: Classification of general error types produced by MT.
Table 1 demonstrates that morphosyntactic errors
are followed by errors in lexical choice. In Table 2
visibly make the most frequent error type in our corpus,
(below) we bring a detailed list of error types found in
i.e. more than half of the total number of errors. These
our corpus.
Error type
%
C.a. congruence
39.3
B.a. lexical choice
18.8
C.c. word order / order of
10.9
phrase constituents
B.c. idiomatic expressions
7.5
B.b. term or title
5.8
C.b. verbal forms / tenses
5.2
A.a. punctuation
4.5
A.b. capital letters
2
D.a. not translated
2
D.b. omissions
1.9
D.d. format, etc.
1.6
5 Cf. https://translate.google.com/intl/hr/about/languages/.
PRISPEVKI
142
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
A.c. other spelling errors
0,5
D.c. numbers
0
Table 2: Detailed breakdown of error types found in the corpus.
4.1.1. Morphosyntactic errors
specifically congruence errors, representing 39.3%. This
type of errors most frequently have to do with
According to our analysis, the most common errors
grammatical gender. Here is an example:
done by GT are morphosyntactic errors, more
original
GT
human translation
La meilleure traductrice
Najbolji prevodilac u
Najbolja prevoditeljica u
du Québec
Quebecu
Québecu.
Table 3: Example of congruence error.
In the above example, traductrice ‘female translator’
systems have been found to make fewer morphological,
is translated by GT as prevoditelj ‘male translator’ even
lexical and word-order errors (e.g. Burchardt, 2017).
though both French and Croatian are marked for gender,
What was a problem, however, in the category of
and even though there is a ready-made solution in
morphosyntactic errors is recognizing the narrator as a
Croatian, prevoditeljica ‘female translator’. The problem
female, and consequently translating all her attributes
here is probably the fact that GT uses English as a sort of
and making all the agreements in the feminine gender.
pivot or intermediate language (e.g. Ljubas, 2018) when
This is a feature of the text that extends beyond sentence
translating between French and Croatian6, that do not
level and permeates the entire discourse of the novel. In
share as large a corpus of texts as they do with English
some French sentences, this difference between
individually.
masculine and feminine gender cannot be seen, for
This is a frequent error produced by GT in the corpus,
example in the present tense or in the past tense ( passé
i.e. not marking whatever has to do with the narrator,
compose) formed with the auxiliary verb to have ( avoir).
who is a woman, as female, but leaving male nouns,
In Croatian, the same goes for the present tense, but the
adjectives etc., which we also attribute to translating via
past tense always shows agreement with the subject in
English: e.g. Je raccrochai is translated as Spustio (masc.)
gender. The large number of errors in this category
sam slušalicu instead of Spustila (fem.) sam slušalicu /
undoubtedly stems from the use of English as a pivot
Poklopila (fem.) sam.
language.
In other words, it can be said generally that our
analysis has demonstrated that GT had no problems, for
4.1.2. Lexical errors
example, with the Croatian rich nominal case system and
general subject-verb or noun-adjective agreement. This
The next most represented category are lexical errors
is in line with findings from the literature that neural
(32.1%), listed in the table below.
original
GT
human translation
Eh bien, c'était le portrait tout
Pa, to je bila pljuvačka slika
E pa to je pljunuti portret
craché de ma mère.
moje majke.
moje majke.
Les ouaouarons, affolés, …
Uplašeni bikovi žabe …
Žabe su se preneražene …
Je suis sur la route parce que ma
Na putu sam jer se moja
Na ulici sam jer se moja
maîtresse ne peut plus s’occuper
ljubavnica više ne može
vlasnica više ne može brinuti
de moi, (…)
brinuti o meni
o meni, (…)
Ma mère et ma grand-mère
Moja majka i baka odmarale
Moja majka i baka bile su
reposaient derrière l’église …
su se iza crkve …
pokopane iza crkve …
… dans l'herbe jonchée de
… u travi posutoj mrtvim
… travi prekrivenoj suhim
feuilles mortes.
lišćem.
lišćem.
J’étais très heureuse, presque sur
Bio sam vrlo sretan, skoro na
Bila sam sretna, gotovo u
un nuage, (…)
devetom oblaku, (…)
sedmom nebu, (…)
Les maudites algues…
Proklete morske alge…
Proklete alge…
Table 3: Examples of lexical choice errors.
6 This has been claimed generally as a feature of GT that it uses
https://algorithmwatch.org/en/google-translate-gender-bias/;
when translating between any pair of languages. A Google
cf.
https://www.circuitmagazine.org/chroniques-126/sur-le-
spokesperson has admitted that Google Translate uses English
vif-126/google-uses-english-as-a-pivot-language.
for „bridging“ between languages with fewer resources. See
PRISPEVKI
143
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Errors is this category concern the following: 1)
* pljuvačka slika instead of pljunuti portret. It clearly single-word polysemy, 2) idiomatic expressions, 3)
calqued the expression être sur un nuage ‘be on cloud
calques from English.
nine’ on English and translate dit as * biti na devetom
With respect to single-word polysemy, GT has, for
oblaku, which does not exist in Croatian, and should be
instance, erroneously translated maîtresse ‘owner’ (of a
translated as na sedmom nebu ‘lit. on seventh sky’. The
cat) as ‘lover’. It also translated reposaient ‘rested’ as
noun phrase feuilles mortes is litterally translated as
odmarale su se ‘were having a rest’ instead of bile su
* mrtvo lišće instead of suho lišće ’lit. dry leaves’, etc.
pokopane, which is used in the context of the dead buried
There are several instances of calquing from English,
in a graveyard. Furthermore, it translated algues as
such as in the example of ouaouarons, animals known in
morske alge ‘sea algae’, which is an incorrect
English as American bullfrogs, which are litterally
specification stemming from the fact that algae are
translated as bikovi žabe ‘bulls-frogs’, and for which we
usually related to the sea, but algae in the story, however,
would suggest the translation žabe due to the fact that the
come from a pond.
particular species is irrelevant to the plot.
As for idiomatic expressions (7% of total errors), GT
rendered le portrait craché ‘spitting image’ as
4.1.3. Other errors
In the category of capital letters, GT had difficulties
Bentivogli et al. (2016) and Toral and Sánchez
rendering street names, which appeared in the text
Cartagena (2017) found that NMT improves notably on
several times. Examples such as 609, rue Richelieu were
reordering and inflection than PBMT. In the case of
rendered by GT as 609, ulica Richelieu, where all the
Poulin’s novel translated and analysed in this paper,
individual elements are correctly translated, but the street
there were generally very few problems with inflection,
name as a whole should be written as Ulica Richelieu 609,
and word / constituent order represented only 10% of all
which is a conventional way of writing street names in
the errors. What our analysis seems to point to is the fact
Croatian.
that using English as a pivot language is the source of a
Another interesting error concerns proper names. Let
large number of errors, and that using language-pair
us cite two examples: Marine and Chaloupe. Marine, the
specific language corpora could arguably give better
name of the main character and narrator, is sometimes
results in translating between two languages of which
translated by GT as marinac ‘Marine, i.e. member of an
neither is English. This would also probably have a
elite US fighting corps’. In addition to the same form, the
positive effect on the translation of culturally specific
English word is always capitalised, so that could be
elements such as spelling and writing of toponyms (e.g.
another reason for such a translation. Chaloupe, on the
street
names).
Furthermore,
our
analysis
also
other hand, is the name of the cat that appears several
demonstrates that more improvement should be done in
times in the text. It is derived from the common noun
the detection and translation of polysemy and idiomatic
chaloupe denoting a type of boat. GT translated the noun
expressions.
as čamac ‘boat’, making it a common noun and even
leaving out the capital letter.
4.2. BLEU evaluation
Overall cumulative BLEU score for the literary text
analysed in our case study was 5.49, which would
In addition to a fine-grained human translation,
suggest very poor MT quality. As a reference, BLEU
BLEU score was also calculated using the interactive
scores of 30 to 40 are considered to be “understandable
BLEU score evaluator7 available via the Tilde platform.
to good translations”, while those of 40 to 50 are “high
BLEU score is based on the correspondence of the MT
quality translations” 8 . Here is the breakdown of the
output and the reference human translation.
BLEU score:
Type
1-gram
2-gram
3-gram
4-gram
Individual
21.92
5.86
2.79
2.54
Cumulative
21.92
11.33
7.10
5.49
Table 4: Results of automatic BLEU evaluation.
In other available case studies dealing with MT of a
English literary texts translated into Slovene, BLEU
literary text, BLEU scores show significant variation. In
scores varied from 1.73 to 30 depending on the texts on
the case of a translation of a literary essay from English
which the MT model was trained (Kuzman et al., 2019).
into French (Besacier and Schwartz, 2015), BLEU score
Toral and Way (2018) obtained BLEU scores of around
was around 30. In another case study dealing with
30 for English-to-Catalan translations of 12 English
7 https://www.letsmt.eu/Bleu.aspx.
8 https://cloud.google.com/translate/automl/docs/evaluate
PRISPEVKI
144
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
novels by PBSMT and NMT systems, where NMT
French-to-Croatian literary translation morphosyntactic
outperformed PBSMT.
errors were by 20% more present than lexical errors,
Unlike the results obtained by Kuzman et al. (2019)
which is different than what he found in the English-
in their study of a literary translation from English into
French language pair. Furthermore, Hansen ( ibid.) was
Slovene, a language genetically very close to Croatian,
surprised to note that the specific vocabulary related to
where “there were no sentences that would not need
the fantasy series in question was respected almost
postediting”, in our case study there were a number of
entirely, which is probably due to the training of the MT
sentences entirely correctly rendered by GT, i.e. that
model on texts written by the same author. This is one of
would be publication ready.
the reasons why Hansen (2022) suggests that
In any case, it should be borne in mind that BLEU
personalized MT systems should be introduced in
automatic evaluation metric was calculated with respect
literary translation for translating specific authors’ styles.
to a single human translation, and that it cannot represent
In another paper, involving Slovene, a language
the “real quality” of MT output. In that sense, Hansen
closely related to Croatian, and analysing translation of
(2022) notes, for instance, that two MT models used in
literary texts from English, Kuzman et al. (2019) observe
his case study had a similar BLEU score in spite of the
that “error analysis (…) revealed various punctuation
fact that the first one produced correctly translated words
errors, wrong translations of prepositions and
in incomprehensible sentences, while the second one
conjunctions, inappropriate shifts in verb mood, wrong
generated correct sentences with words that semantically
noun forms and co-reference changes”. The authors
did not correspond the lexical field of the translated
emphasize the presence of numerous semantic errors,
literary text. This is one of the reasons why we would not
“especially in connection with idioms and ambiguous
entirely agree that the translation provided by GT
words”. In this case, more detailed data is also lacking,
analysed in this paper is irrelevant or “useless”, as it
but we can generally conclude that this study also differs
would be classified due to its BLEU score inferior to 10
from ours in that semantic errors are definitely not the
(cf. footnote n° 8).
leading error type in our French-to-Croatian translation.
In addition, it should be noted that some authors
Interestingly, Kuzman et al. (2019) also found that
claim that morphological richness of the Croatian
GNMT assigned the wrong gender to the main character,
language could raise problems for BLEU evaluation due
just as happened in our case, as mentioned in 4.1.1.
to the fact that each Croatian noun has approximately 10
We can conclude that in the French-to-Croatian GT
different word forms, which are considered by BLEU to
of the novel analysed in this text, morphosyntactic errors
be 10 different words, and not 10 different word forms
(55.3%) are the most represented ones, followed by
of a single lemma (cf. Seljan et al., 2012). This could
various lexical errors (32.1%). These results are
result in lower BLEU scores.
somewhat different from what was observed in earlier
extant studies dealing with MT of literary texts from
5. Conclusion
English to French and English to Slovene.
Even though BLEU score was only 5.49, indicating
This case study is a contribution to a growing number
very poor translation quality which should be deemed as
of papers dealing with applying (N)MT to literary text,
useless, we believe that the GT output would be useful to
which has been thought of until only recently as a
some extent to translators translating Poulin’s novel from
domain that could not be translated by MT. Various
scratch. Further analyses should be made however in
authors have, however, demonstrated the usefulness of
order to analyse whether GT trained on French and
using MT in literary translation. Some (e.g. Besacier and
Croatian corpora would amount to better results than GT
Schwartz, 2015) even argue that MT of literary text
that uses English as pivot. Furthermore, it should also be
may even be of interest for all participants of the
studied how much post-processing effort is needed to
translation chain from editors, through readers to
correct errors of GT in comparison to translation from
authors and translators.
scratch in the French-to-Croatian language combination.
Our analysis has demonstrated that there was a total
of 738 errors in the text produced by GT, largely falling
6. References
into two groups: morphosyntactic (around 55%) and
lexical choice (around 32%) errors. While the
Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo,
morphosyntactic errors largely concerned errors in
and Marcello Federico. 2016. Neural versus Phrase-
congruence stemming probably from the usage of
Based Machine Translation Quality: a Case Study. In: J.
English as a pivot language between French and Croatian,
Siu, K. Duh and X. Carreras, eds., Proceedings of the
the lexical choice errors had mostly to do with polysemy,
2016 Conference on Empirical Methods
in
idiomatic expressions and calques.
Natural
Language
Processing,
pages
257–267,
Let us now compare our results with those from other
Association for Computational Linguistics, Austin,
existing works on MT of literary texts involving either of
Texas.
the two languages from this case study, Croatian or
Laurent Besacier and Lane Schwarz. 2015.
French. Hansen (2022), who analysed English-to-French
Automated Translation of a Literary Work: A Pilot Study.
translations of fantasy books, observed that, generally
In: A. Feldman, A. Kazantseva, S. Szpakowicz and C.
speaking, the MT output was rather literal and it
Koolen, eds., Proceedings of the Fourth Workshop on
produced mostly lexical errors, as well as errors related
Computational Linguistics for Literature, pages 114–
to determiners and syntax. While Hansen ( ibid. ) does not
122. Association for Computational Linguistics, Denver,
provide further details, we can generally say that in our
Colorado.
PRISPEVKI
145
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Laurent Besacier. 2014. Traduction automatisée
Sandra Ljubas. 2018. Prijelaz sa statističkog na
d’une œuvre littéraire : une étude pilote. In: P. Blanche,
neuronski model: usporedba strojnih prijevoda sa
F. Béchet and B. Bigi, eds., Actes du 21ème Traitement
švedskoga na hrvatski jezik. Hieronymus, 5:72–79.
Automatique des Langues Naturelles, pages 389-94.
https://www.bib.irb.hr/978980
Association pour le Traitement Automatique des
Evgeny Matusov. 2019. The Challenges of Using
Langues, Marseille. https://hal.inria.fr/hal-01003944
Neural Machine Translation for Literature. In: J. Hadley,
Marija Brkić, Sanja Seljan and Maja Matetić. 2011.
M. Popović, H. Afli, and A. Way, eds., Proceedings of
Machine Translation Evaluation for Croatian-English
the Qualities of Literary Machine Translation, Machine
and English-Croatian Language Pairs. In: B. Sharp, M.
Translation, pages 10-19, European Association for
Zock, M. Carl, A. L. Jakobsen, eds., Proceedings of the
Machine
Translation,
Dublin.
8th International NLPCS Workshop: Human-Machine
https://aclanthology.org/W19-7302.pdf
Interaction in Translation, pages 93-104. Copenhagen:
Eoin P. Ó Murchú. 2019. Using Intergaelic to pre-
Copenhagen Business School.
translate and subsequently post-edit a sci-fi novel from
Aljoscha Burchardt, Vivien Macketanz, Jon Dehdari,
Scottish Gaelic to Irish. In: J. Hadley, M. Popović, H.
Georg Heigold, Jan-Thorsten Peter, and Philip Williams.
Afli & A. Way, eds., Proceedings of the Qualities of
2017. A Linguistic Evaluation of Rule- Based, Phrase-
Literary Machine Translation, pages 20–25, European
Based, and Neural MT Engines. The Prague Bulletin
Association
for
Machine
Translation,
Dublin.
of Mathematical Linguistics, 108:159–170.
https://aclanthology.org/W19-7303
Margot Fonteyne, Arda Tezcan, and Lieve Macken.
Kishore Papineni, Salim Roukos, Todd Ward and
2020.
Wei- Jing Zhu. 2016. Bleu: a Method for Automatic
Literary Machine Translation under the Magnifying
Evaluation of Machine Translation. In: P. Isabelle, E.
Glass: Assessing the Quality of an NMT-Translated
Charniak and D. Lin, eds., Proceedings of the 40th
Detective Novel on Document Level. In: N. Calzolari, F.
Annual Meeting of the Association for Computational
Béchet, P. Blache, K. Choukri, C. Cieri, T. Delreck, S.
Linguistics,
pages
311–318.
Association
for
Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo,
computational linguistics, Philadelphia, Pennsylvania,
A. Moreno, J. Odijk and S. Piperidis, eds.,
USA. doi.org/10.3115/1073083.1073135
Proceedings of the 12th Conference on Language
Nataša Pavlović. 2017. Strojno i konvencionalno
Resources and Evaluation, pages 3790–3798, European
prevođenje s engleskoga na hrvatski: usporedba
Language Resources Association, Marseille.
pogrešaka. In: D. Stolac and A. Vlastelić , eds., Jezik kao
James Hadley. 2020. Traduction automatique en
predmet proučavanja i jezik kao predmet poučavanja,
littérature : l’ordinateur va-t-il nous voler notre travail.
pages 279–295, Srednja Europa, Zagreb.
Contrepoint,
4:14–18.
https://www.ceatl.eu/wp-
Jacques Poulin. 2006. La traduction est une
content/uploads/2020/12/Contrepoint_2020_04_articl
histoire d’amour. Leméac/Actes Sud, Montreal.
e_04.pdf
Sanja Seljan, Marija Brkić and Tomislav Vičić. 2012.
Damien Hansen. 2022. La traduction littéraire
BLEU Evaluation
of
Machine-Translated English-
automatique : Adapter la machine à la traduction
Croatian Legislation. In: N. Calzolari, K. Choukri, T.
humaine
individualisée.
https://hal.archives-
Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A.
ouvertes.fr/hal-03583562/document
Moreno, J. Odijk, S. Piperidis, eds., Proceedings of the
Damien Hansen. 2021. Les lettres et la machine : un
Eighth
International
Conference
on
Language
état de l’art en traduction littéraire automatique. In: P.
Resources and Evaluation (LREC'12). Istanbul, Turkey.
Denis, N. Grabar, A. Fraisse, R. Cardon, B. Jacquemin, E.
Antonio Toral and Andy Way. 2018. What Level
Kergosien and A. Balvet, eds., Actes de la 28e
of Quality Can Neural Machine Translation Attain on
Conférence
sur
le
Traitement
Automatique
Literary Text? In: J. Moorkens, S. Castilho, F. Gaspari, S.
des Langues Naturelles, Vol. 1,
Doherty, eds., Translation Quality Assessment. Machine
pages 61–78. ATALA, Lille.
Translation: Technologies and Applications, Vol 1,
Filip Klubička, Antonio Toral and Víctor M.
pages
263-287.
Springer,
Cham.
Sánchez- Cartagena. 2017. Fine-Grained Human
https://doi.org/10.1007/978-3-319-91241-7
Evaluation of Neural Versus Phrase-Based Machine
Translation. The Prague Bulletin of Mathematical
Linguistics,
108:121-132.
https://arxiv.org/abs/1706.04389
Taja Kuzman, Špela Vintar and Mihael Arčan.
2019. Neural Machine Translation of Literary Texts
from English to Slovene. In: J. Hadley, M. Popović, H.
Afli, and A. Way, eds., Proceedings of the Qualities
of Literary Machine Translation, pages 1–9, European
Association
for
Machine
Translation,
Dublin.
https://aclanthology.org/W19-7301
Rudy Loock. 2018. Traduction automatique et
usage linguistique : une analyse de traductions anglais-
français réunies en corpus. Meta 63(3):786–806.
https://doi.org/10.7202/1060173ar
PRISPEVKI
146
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
A Bilingual English-Ukrainian Lexicon of Named Entities Extracted from
Wikipedia
Aleksandar Petrovski
Faculty of Informatics
International Slavic University
Marshal Tito 77 Sv. Nikole, North Macedonia
aleksandar.petrovski@msu.edu.mk
Abstract
This paper describes the creation of a bilingual English - Ukrainian lexicon of named entities, with Wikipedia as a source. The proposed methodology provides a cheap opportunity to build multilingual lexicons, without having expertise in target languages. The extracted named entity pairs have been classified into five classes: PERSON, ORGANIZATION, LOCATION, PRODUCT, and MISC (miscellaneous). It has been achieved using Wikipedia metadata. Using the presented methodology, a huge lexicon has been created, consisting of 624,168
pairs. The classification quality has been checked manually on 1,000 randomly selected named entities. The results obtained are 97% for precision and 90% for recall.
1.
Introduction
age for query term disambiguation. Tyers and Pienaar (Tyers
The term named entity (NE) refers to expressions de-
and Pienaar, 2008) described a simple, fast, and computa-
scribing real world objects, like persons, locations, and or-
tionally inexpensive method for extracting bilingual dictio-
ganizations. It was first introduced to the Natural Language
nary entries from Wikipedia (using the interwiki link system)
Processing (NLP) community at the end of the 20th century.
and assessed the performance of this method with respect to
Named entities are often denoted by proper names. They
four language pairs. Yu and Tsujii (Yu and Tsujii, 2009) pro-
can be abstract or have a physical existence. Some other
posed a method using the interlanguage link in Wikipedia
expressions, describing money, percentage, time, and date
to build an English-Chinese lexicon. Knopp (Knopp, 2010)
might also be considered as named entities. Examples of
showed how to use the Wikipedia category system to clas-
named entities include: United States of America, Paris,
sify named entities. Bøhn and Nørvåg (Bøhn and Nørvag,
Google, Mercedes Benz, Microsoft Windows, or anything
2010) described how to use Wikipedia contents to auto-
else that can be named.
matically generate a lexicon of named entities and syn-
The role of named entities has become more and more
onyms that are all referring to the same entity. Halek et
important in NLP. Their information is crucial in informa-
al. (Hálek et al., 2011) attempted to improve machine trans-
tion extraction. As recent systems mostly rely on machine
lation from English of named entities by using Wikipedia.
learning techniques, their performance is based on the size
In (Ivanova, 2012), the author evaluated a bilingual bidirec-
and quality of given training data. This data is expensive
tional English-Russian dictionary created from Wikipedia
and cumbersome to create because experts usually annotate
article titles. Higashinaka et al. (Higashinaka et al., 2012)
corpora manually to achieve high quality data. As a result,
aimed to create a lexicon of 200 extended named entity
these data sets often lack coverage, are not up to date, and
(ENE) types, which could enable fine-grained information
are not available in many languages. To overcome this prob-
extraction. Oussalah and Mohamed (Oussalah and Mo-
lem, semi-automatic methods for resource construction from
hamed, 2014) demonstrated how to use info-boxes in order
other available sources were deployed. One of these sources
to identify and extract named entities from Wikipedia.
is Wikipedia.
The method presented here has been used to build a
3.
Wikipedia
Python application which extracts the English - Ukrainian
Wikipedia is a free online encyclopedia, made and main-
pairs from Wikipedia and classifies them using the En-
tained as an open coordinated effort venture by a network
glish Wikipedia category system.
Since both English
of volunteer editors, utilizing a wiki – based editing sys-
and Ukrainian are among languages with most articles on
tem. Hosted and supported by the Wikimedia Foundation,
Wikipedia, the result is a huge lexicon.
since its start in 2001, the site has grown in both popularity
The goal of this paper is to present a method of extract-
and size. At the time of writing this paper (March 2022),
ing multilingual lexicons of classified named entities from
Wikipedia contained over 58 million articles in 323 lan-
Wikipedia. The method has been implemented to build a
guages; its English version has over 6 million articles. The
huge English - Ukrainian lexicon of named entities.
richness of information and texts continuously makes it an
object of special research interest among the NLP (Natural
2.
Related work
Language Processing) community. By attracting approxi-
Building multilingual lexicons from Wikipedia has been
mately 6 billion visitors per month (Statista, 2021), it is
a subject of research for more than 10 years. Schönhofen et
the largest and most popular general reference work on the
al. (Schönhofen et al., 2007) exploited Wikipedia hyperlink-
World Wide Web.
PRISPEVKI
147
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
3.1.
Wikipedia as a source
4.
Method
Even though Wikipedia isn’t made and maintained by
The flowchart presented in Figure 1 shows the process
linguists, metadata about articles, for instance, translations,
used for building the lexicon.
disambiguations, or categorizations are accessible. Its struc-
tural features, size, and multilingual availability give a rea-
sonable base to derive specialized resources, like multi-
lingual lexicons (Bøhn and Nørvag, 2010). Researchers
have found that around 74% of Wikipedia pages describe
named entities (Nothman et al., 2008), a clear indication
of Wikipedia’s high coverage for named entities. Each
Wikipedia article associated with a named entity is iden-
tified with its title, which is itself a named entity. That is
a perfect opportunity to build parallel lexicons of named
entities between them.
Wikipedia is a very cheap resource of multilingual lex-
icons of named entities. Its database dump can be freely
downloaded in sql and XML formats. But, taking into ac-
count the fact that Wikipedia articles have been written by
millions of contributors, a question arises: What is the qual-
ity of these lexicons, and how reliable are they for using,
e.g., in machine translation?
3.2.
English and Ukrainian Wikipedias
The English Wikipedia is the English language edition
of the Wikipedia online encyclopedia. English is the first
language in which Wikipedia was written. It was started
Figure 1: The process flowchart.
on 15 January 2001 (Wikimedia Foundation, 2022b), but
versions of Wikipedia in other languages were quickly de-
1. Extract title pairs with English as a first language
veloped. Among these versions, there is one in Ukrainian
language. The Ukrainian Wikipedia (Wikimedia Founda-
For building multilingual lexicons, two tables from the
tion, 2022c), written in the Cyrillic alphabet, was initiated
database are necessary: table of pages and table of inter-
in the year 2004.
language links. The page table is the "core of the wiki".
It contains titles and other essential metadata for different
A list of all Wikipedias is published regularly on
Wikipedia namespaces. The interlanguage links table con-
the Internet, along with several parameters for each lan-
tains links between pages in different languages. Using
guage (Wikimedia Foundation, 2022a). Four parameters are
these two tables, it is an easy programming task to create
important: number of articles, the total number of pages
huge bilingual dictionaries without having any language
(articles, user pages, images, talk pages, project pages,
expertise.
categories, and templates), number of active users (regis-
tered users who performed at least one change in the last
2. Filter out irrelevant title pairs
thirty days), and depth (a rough indicator of the quality of
The extracted title pairs from the previous step contain
Wikipedia, which shows how often articles are updated).
a lot of noise. This step deals with it. First, the algorithm
As shown in Table 1, as of 26 March 2022, the English
removes all the titles that don’t belong to the main, template,
Wikipedia contains 6,473,638 articles and 55,472,454 pages.
or category namespaces. Second, there are titles containing
There are 127,722 active users. The depth value is 1,110.
some words or word stems that increase the noise and should
It is by far the largest edition of Wikipedia. The Ukrainian
be filtered out. The page table contains many entries that
Wikipedia contains 1,144,596 articles and 3,992,549 pages.
could not be a part of any lexicon, like user names, nick-
There are 2,702 active users. The depth value is 54. It is the
names, template names, etc. There are also titles, containing
17th largest edition of Wikipedia, according to number of
exclusively digits or blanks, which should be removed too.
articles.
3. Classify the remaining title pairs using the English
Wikipedia category system
Parameter
en
uk
In order to classify the extracted named entities, one
Number of articles
6,473,638
1,144,596
additional table from the database is required: a table of
Total number of pages
55,472,454
3,992,549
category links. The task of classifying named entities by
Number of active users
127,722
2,702
means of category links is more complex. Wikipedia articles
Depth
1,110
54
are generally members of categories. A category may have
subcategories, each subcategory its own subcategories, etc.
The problem is that the graph could be cyclic, which may
Table 1:
Parameters of the English and Ukrainian
cause the algorithm to go into an endless loop.
Wikipedias.
Various authors propose different classes for named en-
tities. Here, there are five: PERSON, ORGANIZATION,
PRISPEVKI
148
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Figure 2: A lexicon entry in CSV format.
LOCATION, PRODUCT, and MISC. Each named entity be-
longs to at least one of these classes. The classes comprise:
Figure 3: A lexicon entry in XML format.
• ORGANIZATION- political organizations, companies,
schools, rock bands, sport teams
• PERSON- humans, gods, saints, fictional characters
5.
Results
• LOCATION- geographical terms, fictional places, cos-
The method presented in previous chapter has been used
mic terms
to build a Python application which extracts title pairs inde-
pendently on the languages. This application was applied
• PRODUCT- industrial products, software products,
to the Wikipedia database to extract the English - Ukrainian
weapons, artworks, documents, concepts, standards,
pairs of named entities. The result of the extraction after
laws, formats, anthems, algorithms, journals, coats of
the first two steps from Figure 1 was 687,799 pairs. After
arms, platforms, websites
filtering out non named entities, 624,168 pairs remained.
• MISC- events, languages, peoples, tribes, alliances,
One part of the lexicon is presented in Figure 4.
orders, scientific discoveries, theories, titles, curren-
cies, holidays, dynasties, positions, projects, histori-
cal periods, battles, competitions, alliances, deceases,
programs, set of locations, awards, musical genres,
missions, artistic directions, sets of organizations, net-
works
4. Filter out title pairs classified as non named entities
Most Wikipedia titles are named entities, but not all
of them. For example, certain natural terms-like biolog-
ical species and substances-which are very common on
Wikipedia, are not included in the lexicon.
5. Convert the resulting data into CSV and XML formats
The lexicon comes in two formats: CSV and XML.
The first row in the CSV file is a title row and tab
is used as a field separator. The columns’ titles are: en,
uk, PERSON, ORGANIZATION, LOCATION, PRODUCT,
and MISC. All other rows contain the data: English name,
Ukrainian name, and five binary digits. These digits denote
the class the named entity belongs to. For example, accord-
ing to Figure 2, the named entity Odessa belongs to the class
LOCATION, since the column LOC contains 1. All other
classes contain 0’s.
The structure of the XML file is similar. An equiva-
lent of the entry from Figure 2 is shown in Figure 3. The
columns’ names en and uk from the CSV file are now names
of elements and class denotes the classification.
In realizing the steps 2-3 of Figure 1, which refer to noise
reduction and classification of named entities, the experience
of creating a parallel lexicon of named entities from English
to South Slavic languages (Slovenian, Croatian, Croatian,
Bosnian, Ukrainian, Macedonian, and Bulgarian) (Petrovski,
2019) was of great benefit. That lexicon contains 26,155
entries, and the steps 2-3 were done manually.
This methodology has been used to create a multilingual
English – Hebrew – Yiddish – Ladino lexicon of named
entities. A tool that can be used to search it, can be found
on the Internet (Petrovski, 2021).
PRISPEVKI
149
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Figure 4: A part of the lexicon.
PRISPEVKI
150
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
The distribution of classes is presented in Table 2.
Class
en
uk
PERSON
93%
92%
Class
Number
ORGANIZATION
87%
78%
PERSON
142,850
LOCATION
49%
42%
ORGANIZATION
39,348
PRODUCT
80%
76%
LOCATION
237,229
MISC
92%
89%
PRODUCT
56,952
All
75%
70%
MISC
159,952
Total
636,331
Table 4: Percentage of multiword NEs per class.
Table 2: Distribution of classes.
en
uk
Malkiya Club
Малкія
The total number of classes, 636,331, is slightly higher
Dnipro Kherson
Дніпро
than the number of entries, since some named entities may
Sharjah FC
Шарджа
belong to more classes. The lexical entry presented in Fig-
Shin Bet
Шабак
ure 5 is such an example. Kherson State University is clas-
Newtown A.F.C.
Ньютаун
sified as both ORGANIZATION (the university as an edu-
The Day After Tomorrow
Післязавтра
cational organization) and LOCATION (the building where
the organization is located).
Table 5: Examples of multiwords in English and single
words in Ukrainian, class ORGANIZATION.
nia - Сакраменто, Malkiya Club - Малкія, The Acropolis
- Акрополіс.
6.
Evaluation of classification
To evaluate classification, two common metrics in in-
formation retrieval have been used: precision and recall.
Figure 5: A lexicon entry belonging to two classes.
Precision refers to the percentage of classes that are correct.
On the other hand, recall refers to the percentage of total
relevant classes correctly classified by the algorithm.
It is expected that the most of Wikipedia titles are multi-
An alternative to having two measures is the F-measure,
words, i.e. they contain either a space or a hyphen. Table 3
which combines precision and recall into a single perfor-
shows the number of multiword NEs per class in the lexicon
mance measure. This metric is known as F1-score, which is
for both English and Ukrainian.
simply the harmonic mean of precision and recall.
In order to evaluate the classification, a random sample
Class
en
uk
containing 1,000 entries has been extracted from the lexicon.
PERSON
132,219
131,354
The entries from the sample have been classified manually
ORGANIZATION
34,114
30,509
and then compared to the classification performed by the
LOCATION
116,974
99,399
algorithm. The results are presented in Table 7.
PRODUCT
45,781
43,378
The precision of classification is between 94% for OR-
MISC
146,498
141,665
GANIZATION and 99% for PERSON. The recall is slightly
Total
475,586
446,305
lower, from 83% for PRODUCT and MISC to 97% for PER-
SON. The overall results are 97% for precision and 90% for
recall.
Table 3: Number of multiword NEs per class.
The higher values of precision show that the classifica-
tion algorithm was adjusted to classify the named entities
Table 4 shows the percentage of multiword NEs per
correctly, rather than to extract more named entities for the
class.
lexicon.
It can be seen that the percentage of multiwords is higher
in the English than in the Ukrainian Wikipedia. This is
7.
Conclusion
most noticeable in the classes ORGANIZATION and LO-
CATION. Some examples from the lexicon where there is
Using the methodology presented in this paper, an En-
a multiword in English and a single word in Ukrainian are
glish - Ukrainian lexicon of named entities has been created.
given in Table 5 for the class ORGANIZATION and Table 6
Its size is 624,168 pairs. The named entities have been
for the class LOCATION.
classified into five classes: PERSON, ORGANIZATION,
Contributors to the English Wikipedia add words to the
LOCATION, PRODUCT, and MISC (miscellaneous). The
base title, which define it in more detail, or it is simply a
quality of classification has been assessed: 97% for preci-
matter of adding a definite article, e.g. Sacramento, Califor-
sion and 90% for recall.
PRISPEVKI
151
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
en
uk
8.
References
Malmö Airport
Мальме
Christian Bøhn and Kjetil Nørvag. 2010. Extracting Named
Shintoku, Hokkaido
Сінтоку
Entities and Synonyms from Wikipedia. In Proceedings
Amarillo, Texas
Амарилло
of International Conference on Advanced Information
Sacramento, California
Сакраменто
Networking and Applications, pages 1300–1307.
The Dakota
Дакота
European Commission. 2022. Digital Europe Programme
The Acropolis
Акрополіс
Language Technologies. https://language-tools.
ec.europa.eu/.
Table 6: Examples of multiwords in English and single
European Union’s Horizon 2020 Research and Innovation
words in Ukrainian, class LOCATION.
Programme. 2020. Bergamot Translations. https://
translatelocally.com/web/.
Ryuichiro Higashinaka, Kugatsu Sadamitsu, Kuniko Saito,
Class
Precision
Recall
F1-score
Toshiro Makino, and Yoshihiro Matsuo. 2012. Creating
ORGANIZATION
94%
87%
90%
an Extended Named Entity Dictionary from Wikipedia. In
LOCATION
98%
92%
95%
24th International Conference on Computational Linguis-
PRODUCT
96%
83%
89%
tics - Proceedings of COLING 2012: Technical Papers,
MISC
96%
83%
89%
pages 1163–1178.
All
97%
90%
93%
Ondrej Hálek, Rudolf Rosa, Aleš Tamchyna, and Ondrej
Bojar. 2011. Named entities from wikipedia for machine
translation. In Proceedings of the Conference on Theory
Table 7: The results of the classification check.
and Practice of Information Technologies, pages 23–30.
Angelina Ivanova. 2012. Evaluation of a Bilingual Dictio-
nary Extracted from Wikipedia. In Computer Science.
The lexicon is available on (Petrovski, 2022) under
Johannes Knopp. 2010. Classification of Named Entities in
CC-BY-NC-4.0 license (free for non commercial use).
a Large Multilingual Resource Using the Wikipedia Cate-
gory System. University of Heidelberg, Master’s thesis,
Lexicons, like the one presented in this paper, can be
Heidelberg, Germany.
used in machine translation (MT). Most statistical MT sys-
tems do not deal explicitly with named entities, simply re-
Joel Nothman, James Curran, and Tara Murphy. 2008.
lying on the model of selecting the correct translation, i.e.,
Transforming Wikipedia into Named Entity Training Data.
mistranslating them as generic nouns. It is also possible
In Proceedings of the Australian Language Technology
that, when not identified, named entities may be left out of
Workshop.
the output translation, which also has implications for the
Mourad Oussalah and Muhidin Mohamed. 2014. Identi-
readability of the text. Because most NEs are rare in texts,
fying and Extracting Named Entities from Wikipedia
statistical MT systems are not capable of producing quality
Database Using Entity Infoboxes. In International Jour-
translations for them. Another problem with MT systems
nal of Advanced Computer Science and Applications, vol-
is that failure to recognize NEs often harms the morpho –
ume 5, pages 164–169.
syntactic and lexical context outside of NEs itself. If named
Aleksandar Petrovski.
2019.
EnToSSLNE - a Lexi-
entities are not immediately identified, certain morphologi-
con of Parallel Named Entities from English to South
cal features of adjacent and syntactically related words, as
Slavic Languages. http://catalogue.elra.info/
well as word order, may be incorrect. It can be concluded
en-us/repository/browse/ELRA-M0051/.
that the identification of named entities in the source text
Aleksandar Petrovski. 2021. Jewish Lexicons of Named
is the first task of machine translators (Hálek et al., 2011).
Entities. https://www.jewishlex.org/.
However, developers of commercial MT systems often do
Aleksandar Petrovski.
2022.
A Bilingual English-
not pay enough attention to the correct automatic identifi-
Ukrainian Lexicon of Named Entities Extracted
cation of certain types of NE, e.g. names of organizations.
from Wikipedia. https://catalogue.elra.info/
This is partly due to the greater complexity of this problem
en-us/repository/browse/ELRA-M0104/.
(the set of proper nouns is open and very dynamic), and
Péter Schönhofen, András Benczúr, Istvan Biro, and
partly due to lack of time and other development resources.
Károly Csalogány. 2007. Cross-Language Retrieval
One solution to this problem is using a parallel lexicon of
with Wikipedia. In Advances in Multilingual and Multi-
named entities. If the lexicon contains a translation of the
modal Information Retrieval: 8th Workshop of the Cross-
named entity, the translation quality will probably be good.
Language Evaluation Forum, CLEF 2007ed Papers, vol-
The European Commission called for language data in
ume 5152, pages 72–79.
Ukrainian to/from all EU languages to train automatic trans-
Statista.
2021.
Worldwide visits to Wikipedia.org
lation systems (European Commission, 2022), (European
from
January
to
June
2021.
https://
Union’s Horizon 2020 Research and Innovation Programme,
www.statista.com/statistics/1259907/
2020) supporting refugees and helpers in the Ukraine cri-
wikipedia-website-traffic/.
sis. This lexicon was sent to ELRC (European Language
Francis M. Tyers and Jacques A. Pienaar. 2008. Extracting
Resource Coordination) Secretariat as a response.
Bilingual Word Pairs from Wikipedia. In Proceedings of
PRISPEVKI
152
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
the SALTMIL Workshop at the Language Resources and
Evaluation Conference, LREC2008.
Wikimedia Foundation. 2022a. List of Wikipedias – Meta.
https://meta.wikimedia.org/wiki/List_of_
Wikipedias.
Wikimedia Foundation.
2022b.
Wikipedia, the Free
Encyclopedia. https://en.wikipedia.org/wiki/
Main_Page.
Wikimedia Foundation.
2022c.
Wikipedia, the Free
Encyclopedia. https://uk.wikipedia.org/wiki/
Main_Page.
Kun Yu and Jun’ichi Tsujii. 2009. Bilingual Dictionary Ex-
traction from Wikipedia. In Machine Translation Summit,
volume 12.
PRISPEVKI
153
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Serbian Early Printed Books: Towards Generic Model for Automatic Text
Recognition using Transkribus
Vladimir Polomac*
* Serbian Language Department, Faculty of Philology and Arts, University of Kragujevac Jovana Cvijića bb, 34 000 Kragujevac, Serbia
v.polomac@filum.kg.ac.rs
Abstract
The paper describes the process of creating and evaluating a new version of the generic model for automatic text recognition of Serbian Church Slavonic printed books within the Transkribus software platform, based on the principles of artificial intelligence and machine learning. The generic model Dionisio 2.0. was created on the materials of Serbian Church Slavonic books from various printing houses of the 15th and 16th centuries (Cetinje, Venice, Goražde, Mileševa, Gračanica, Belgrade and Mrkša’s Church), and, during the evaluation of its performance, it was noticed that CER was about 2–3%. The Dionisio 2.0. model will be publicly available to all users of the Transkribus software platform in the near future.
The Dionisio 1.0. model structure is shown in Table 1,
1. Introduction
and its performance is displayed in Table 2.
The research on creating a model for automatic text
recognition of the Serbian Church Slavonic printed books
Book
Word count
from Venice using a software platform Transkribus,1
Prayer Book (1538–1540)
39,889
presented in Polomac (2022), represents the starting point
Psalter (1519–1520)
10,132
for this paper. This paper describes the process of
Miscellany for Travellers (1536) 10,618
transcription and creation of a specific model2 for
Festal Menaion (1538)
10,732
automatic text recognition of Prayer Book (Euchologion)
printed between 1538 and 1540 in the printing house of
Miscellany for Travellers (1547) 10,006
Božidar Vuković,3 as well as the process of creating a
Hieratikon (Liturgikon) (1554)
10,196
generic model4 for automatic text recognition of other
Total
91,573
books printed in Venice in the printing house of Božidar
Vuković and his son Vićenco.5 The most important result
Table 1: Dionisio 1.0. Structure and the Amount of
of this paper is the creation of the first version of the model
Training Data.
Dionisio 1.0. (named after an Italian pseudonym for
Božidar Vuković – Dionisio della Vechia) representing the
Word
Number CER7 on
CER on
first publicly available resource for automatic reading of
count of epochs6 Train set Validation set
Serbian Church Slavonic manuscripts and printed books
within the Transkribus software platform (cf.
86,347
100
1.66%
2.09%
https://readcoop.eu/model/dionisio-1-0/).
Table 2: Dionisio 1.0 Performance.
1 Transkribus (https://readcoop.eu/transkribus) represents an aimed at the Serbian Orthodox Church and its flock under
open-access software platform for automatic text recognition and
Ottoman rule, yet the motives of his printing business were not
retrieval developed as part of the READ project at the University
only patriotic and religious, but also mercantile and financial
of Innsbruck. More details about the technological background
(Lazić, 2020b).
and operating system cf. Mühlberger et al. (2019).
4 Unlike a specific model that is trained to recognize a single
2 The functionality of the Transkribus platform is particularly manuscript or printed book, a generic model contains material
manifested in the potential to train one’s own automatic text
from different manuscripts or printed books. More details on the
recognition model, irrespective of the language or script used in
possibilities and pitfalls of training generic models can be found
the manuscript. The training of the automatic recognition model
in Rabus (2019b).
represents an instance of machine learning based on neural
5 After the death of Božidar Vuković, Vićenco Vuković had
networks in which during the learning process the model
reprinted several of his father's editions until 1561, and later
compares the manuscript photographs and corresponding letters,
rented his equipment to other Venetian printers. For more details
words and lines of the text in the diplomatic edition. For more
about his life and work see also Pešikan (1994).
details see Mühlberger et al. (2019) and Rabus (2019a).
6 The term epoch in machine learning stands for ”one complete
3 Božidar Vuković was a Serbian merchant from Zeta (Podgorica
presentation of the data set to be learned to a learning machine“
and the area surrounding Lake Skadar). After his arrival at Venice
(Burlacu and Rabus, 2021).
(in 1516 at the latest) he acculturated his Serbian name to the new
7 The Character Error Rate (CER) is calculated by comparing the
environment by creating a Latin ( Dionisius a Vetula) and an
automatically generated text and the manually corrected version.
Italian pseudonym ( Dionisio della Vecchia) from his Serbian
See
for
more
details
in
Transkribus
Glossary
name and the toponym of Starčeva Gorica (at Lake Skadar),
https://readcoop.eu/glossary/character-error-rate-cer/.
indicating his origin (Lazić, 2018). Books from his printery were
PRISPEVKI
154
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
In the continuation of the research, we aimed at
examining the performance of the Dionisio 1.0. model on
Serbian Church Slavonic books created in other printing
houses, firstly in Venetian printing houses created after
closing Božidar and Vićenco Vuković’s printing house, and
then in other old Serbian printing houses of the 15th and 16th
centuries (Cetinje, Goražde, Mrkša's Church, Belgrade,
Mileševa and Gračanica), thus ultimately offering a generic
model for the automatic text recognition of Serbian Church
Slavonic printed books as a whole.
2. Applying the Dionisio 1.0. Model on
Books from Other Venetian Printing Houses
In the first experiment, we tested the performance of the
Dionisio 1.0. model on several Serbian Church Slavonic
books printed in Venice after closing Božidar and Vićenco
Vuković’s printing house: Lenten Triodion was printed in
Figure 1: The Automatically Read Text of a Segment of
1561 by Stefan of Scutari in the Camillo Zanetti’s printing
Sheet 2b Prayer Book (Euchologion) from 1570.
house, Prayer Book ( Miscellany for Travellers) was printed
in 1566 by Jakov of Kamena Reka, Prayer Book
The greatest number of errors in text recognition refers
(Euchologion) was created in 1570 in the printing house of
to cases in which the model outputs accent marks in
Jerolim Zagurović and Psalter with Appendices was printed
accordance with the material on which it was trained,
in 1638 in the printing house of Bartol Ginammi (Pešikan,
although in the text of Prayer Book (Euchologion) these
1994). The starting hypothesis of the paper in the current
marks were not used: so instead of щедротами 1/2, твоѥго
experiment was that the model trained on the materials of
2, ними 2, бл҃свень ѥси 3, животворещимь 3/4, дх҃омь 4,
Serbian Church Slavonic books from the printing house of
Božidar and Vićenco
присноива 4, мои 5, твоимь 5, наѳаномь 6, своихь 7,
Vuković would be useful for
automatic text recognition of other Venetian editions
прѣгрѣшенихь 7, ѥмоу 7, подасть 8, манасѵно 8, покаꙗнїе 8
printed using their printing equipment.
the model outputs щедро́тами 1/2, твоѥ҆го 2, ни́ми 2, бл҃све́нь
The statistical results of the experiment are shown in the
ѥ҆си 3, жи́вотво́рещи̏мь 3/4, дх҃о́мь 4, при́снои́ва 4, моѝ 5,
following table.
твои҆мь 5, наѳа́номь 6, свои҆хь 7, прѣгрѣше́нихь 7, ѥ҆моу 7,
пода́сть 8, ма́насѵно 8, покаꙗ҆нїе 8. Along with the accent
Book
CER
marks, the model inccorectly reads a pajerak mark in two
Lenten Triodion (1561)
9.41%
examples only: instead of ѥдинороднаго 2, покаꙗвшоу 6
Miscellany for Travellers (1566)
11.63%
there is the incorrect ѥ҆ди́норо́д на̏го 2, покаꙗ҆в шоу 6. In one
Prayer Book (Euchologion) (1570) 13.67%
example, instead of oxia there is an incorrect double
Psalter with Appendices (1638)
16.04%
circumflex: instead of бл҃гы́мь 3 there is the incorrect
бл҃гы̏мь 3.
Table 3: Application of the Dionisio 1.0. model on
The same problem is exhibited by the comparative
publications from other Venetian printing houses.
presentation of the photograph of a part of sheet 5b Psalter
with Appendices (1638) and the automatically read text.
The unexpectedly high CER does not necessarily
indicate poor performance of the Dionisio 1.0. model. The
largest number of errors in text recognition is the result of
the fact that in these books accent marks are used
differently than in the books from the printing house of
Božidar and Vićenco Vuković, which were used to train the
Dionisio 1.0. model. This fact is especially evident in
Prayer Book (Euchologion) from the printing house of
Jerolim Zagurović (1570) and Psalter with Appendices
from the printing house of Bartol Ginammi (1638) in which
only spiritus lenis with an oxia over the initial vowel grapheme was used.
To illustrate this claim, we shall use a comparative
presentation of a photograph of a part of sheet 2b Prayer
Book (Euchologion) (1570) and an automatically read text
using the Dionisio 1.0. model.
Figure 2: The Automatically Read Text of a Part of Sheet
5b Psalter with Appendices from 1638.
PRISPEVKI
155
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Here, too, the largest number of errors refers to cases in
incorrectly outputs ва́леща̏ 1, соуѥ҆тною̀ 1, съкроу́ше́н нѣи҅ 2,
which the Dionisio 1.0. model outputs accent marks
срцоу́ 2/3, те́бѣ̀ 3, ѡ῎цѣ́- 3, ѻ῎ста́в лѥнїе 5, пе́щь́сь тво́ри̏ 8,
according to the patterns of their use in the Venetian books
ха́л - 9/10. Errors in recognizing spaces between words are
that served for its training, although in the text of
also of high frequency: instead of да́- 1, цѹ зовоу́щоу 3, пѣⷭ
Ginammi’s Psalter with Appendices these marks were not
з҃ 7, ꙋбо пѐщь сьтвори̏ 8, а῎г- 8, ст҃ыимъ дѣ́темь 9 the model
used. Thus instead of вьзвахь 4, оуслиша 4, правди 4/5, моѥ
incorrectly outputs да́ 1, цѹ́зовоу́щоу 3, пѣз҃ 7, ꙋбопе́щь́сь
5, скрьбїи 5, распространиль 5, ме 5, ѥсїи 6, оущедри 6,
тво́ри̏ 8, а῎г 8, ст҃ыимъдѣ́темь 9. In a fewer number of
ѹслиши 6, мою 7, сн҃ове 7, до колѣ 7, тешкосрьдїи 7, вьскоую
examples, errors in recognizing pajerak mark, superscript
8, любыте 8, соуѥтннаа 8, льжоу 9, оувѣдите 9, ꙗко 9,
letters and titlo mark can be found: instead of съкроуше́ннѣи҅
оудивїи 9 the model incorrectly outputs вьзва́хь 4, оу῎сли́ша
2, ѻ῎ставлѥнїе 5, хал- 9/10, срⷣ- 2,
7 the model incorrectly
4, пра́в дѝ 4/5, моѥ҅ 5, скрь́бїѝ 5, распростра́ниль 5, ме́- 5, ѥ҆сїи
outputs съкроу́ше́н нѣи҅ 2, ѻ῎ста́в лѥнїе 5, ха́л - 9/10, ср- 2, пѣ
6, оу῎ще́дри 6, ѹ῎сли́ши 6, мою̀ 7, сн҃о́ве 7, до ко́лѣ 7,
7.
те́шкосрь́дїи 7, вьскоую̀ 8, лю́быте 8, соуѥ҆тннаа 8, ль́жоу 9,
A comparative presentation of a part of sheet 7a Prayer
оу῎вѣ́дите 9, ꙗ῎ко 9, оу῎ди́вїи 9. Here, as well, the other types
Book (Miscellany for Travellers) from 1566 and the
of errors are confirmed by isolated examples: pajerak
automatically read text using the Dionisio 1.0. model
mark: instead of правди 1/2 there is the incorrect пра́в дѝ
displays similar errors.
1/2; space between words: instead of ме 5 the incorrect ме́-
5; initials: instead of Вьнѥгд҃а 4 the incorrect ьнѥгд҃а 4;
incorrect accent recognition: instead of и῎ 6 there is the
incorrect и҆ 6.
The given examples of the most common errors show
that, despite the high percentage of incorrectly recognized
characters, after the automatic post-correction of the
transcripts which would include accent marks removal
using the Search/Replace chosen chars in transcript option,
the Dionisio 1.0. model can also be very efficient in
recognizing Serbian Church Slavonic books created in the
printing houses of Jerolim Zagurović and Bartol Ginammi
during the 16th and 17th centuries.
The greatest number of errors in the automatic
recognition of the text Lenten Triodon (1561) by Stefan of
Scutari and Prayer Book (Miscellany for Travellers) (1566)
by Jakov of Kamena Reka also refers to the recognition of
accent marks. However, what distinguishes these books
from the books from the printing houses of Jerolim
Zagurović and Bartol Ginammi is that accent marks are
Figure 4: The Automatically Read Text of a Part of Sheet
actually used, yet in different positions compared to the
7a Prayer Book (Miscellany for Travellers) from 1566.
books from the printing house of Božidar and Vićenco
Vuković on which the Dionisio 1.0. model was trained. To
Errors in recognizing accent: instead of небо 1, землꙗ 1,
illustrate this claim, we will first use a comparative
похва́лите ю̏ 1, ѿчьствїа 1/2, і҆езыкь̏ 2, весе́лит 2,
presentation of a part of sheet 3a Lenten Triodon (1561) by
трьжьствꙋѥ῎ть 3, неплѡди 3, раждаѥ῎- 3, питател ницꙋ 4,
Stefan of Scutari and the automatically read text using the
жизны 4, на́шеѥ῎ 4, и 5, мꙋченикь 5, кондакь 6 the model
Dionisio 1.0. model.
incorrectly outputs не бо̀ 1, землꙗ̀ 1, похва́литею̀ 1, ѿчь́ствїа
1/2, і҆е҆зы́кь̏ 2, весе́лит 2, трь́жьствꙋѥ҆ть 3, неплѡ́ди 3,
раждаѥ҆ 3, пи́тател ницꙋ 4, жи́зны 4, на́шеѥ 4, и῎ 5, мꙋче́никь
5, воѥ҆дакь 6. A certain number of errors is connected to
recognizing spaces between words: instead of небо 1,
похва́лите ю̏ 1, раждаѥ῎- 3, свѣт мѹ 5 the model incorrectly
outputs не бо̀ 1, похва́литею̀ 1, раждаѥ҆ 3, свѣтмѹ 5. Several
errors in recognizing letters may perhaps be related to poor
quality of the photograph: instead of сі 5, кондакь 6 the
model incorrectly outputs сь 5, воѥ҆дакь 6.
The illustrated examples of the most frequent errors in
Figure 3: The Automatically Read Text of a Part of Sheet
Lenten Triodon (1561) and Prayer Book (Miscellany for
3a Lenten Triodon from 1561.
Travellers) (1566) show that the Dionisio 1.0. model can be used for obtaining transcripts that can, after appropriate
Errors in accent mark recognition: instead of валеща 1,
manual correction, be used for creating specific models for
соуѥ́тною
automatic text recognition of the aforementioned two
2, съкроуше́ннѣи҅ 2, срⷣцоу 2/3, тебѣ̀ 3, ѡ҆цѣ́- 3,
books.
ѻ῎ставлѥнїе 5, пѐщь сьтвори̏ 8, хал- 9/10 the model
PRISPEVKI
156
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
mark: instead of Ѡ и҆ме́ни 4, подь 9 и дрѣ́внꙋю̀ 10 the model
3. Applying the Dionisio 1.0. Model on Books
incorrectly reads и҆ме́нѝ 4, по дь 9 и дрѣ́в нꙋю̀ 10.
from Other Serbian Printing Houses of the 15th
and 16th Centuries
In the second experiment, the performance of the
Dionisio 1.0. model was tested on selected books from
other printing houses of the 15th and 16th centuries
(Cetinje, Goražde, Gračanica, Mileševa, Belgrade and
Mrkša’a Church). During the research, we started from the
hypothesis that the model trained on the material of books
from the Venetian printing house Vuković will be useful
for books from other printing houses, since there are not
many orthographic variations in Serbian Early Printed
Books as there are in medieval manuscripts.
The results of the experiment are shown in the
following table.
Book (Printed House, Year)
CER
Octoechos, mode 1–4 (Cetinje, 1495)
8.24%
Psalter with Appendices (Goražde, 1519)
6.44%
Octoechos, mode 5–8 (Gračanica, 1539)
11.11%
Prayer Book (Euchologion) (Mileševa, 1546) 5.43%
Tetraevangelion (Belgrade, 1552)
11,28%
Figure 5: The Automatically Read Text of a Part of Sheet
Tetraevangelion (Mrkša’s Church, 1562)
12.06%
5b Prayer Book (Euchologion) from 1546.
Similar errors are indicated by the comparative
Table 4: Application of the Dionisio 1.0. model on
illustration of the photograph of a part of sheet 35a Psalter
publications from other printing houses in
with Appendices (1519) from the Goražde printing house
the 15th and 16th centuries.
and the automatically read text using the Dionisio 1.0.
model.
Based on the previous table, it can be concluded that the
Dionisio 1.0. model achieved the best results in the
automatic recognition of the text of Prayer Book
(Euchologion) (1546) from the printing house of the
Mileševa monastery and Psalter with Appendices (1521)
from the Goražde printing house. These results can be
explained by the fact that Prayer Book (Euchologion)
(1546) had been printed in Mileševa with the same
typographic characters as Psalter with Appendices (1521)
from Božidar Vuković’s printing house, as well as by the
fact that Psalter with Appendices (1519) was printed in
Goražde using the typographic equipment imported from
Venice (Lazić, 2020a).8
To illustrate the efficiency of the Dionisio 1.0. model
we may firstly use the comparative presentation of the
photograph of a part of sheet 5b Prayer Book
(Euchologion) (1546) from the printing house of the
Mileševa monastery and the automatically read text in
Figure 5.
In this book, as well, the greatesт number of errors
refers to accent marks recognition: instead of и҆ме́ни 4,
и҆сти́ныи 4, ѥ҆ди́норо́днааго 4/5, ст҃аго 5, и ме 7, сподобльшаго
7, ѡ῎ноу̀ 10 the Dionisio 1.0. incorrectly outputs и҆ме́нѝ 4, и῎сти́ныѝ
Figure 6: The Automatically Read Text of a Part of Sheet
4, ѥ῎ди́норо́днаа̀го 4/5, ст҃а́го 5, и῎ ме́ 7, сподо́бльшаго
35a Psalter with Appendices from 1519.
7, ѡ῎ноу 10. Other errors are fewer in number and relate to
recogizing initials, spaces between words and pajerak
8 Scholars likewise claim that Psalter with Appendices (1519)
corresponds to the widespread practice of the time to place a
and Prayer Book (Euchologion) (1544) from Goražde printing
counterfeit place of printing on the colophonies of editions (Lazić,
house could have been printed in Venice, as well, which
2020a).
PRISPEVKI
157
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
The previous illustration demonstrates how the
неи҆зрече́нною 14, ни́щетою 14/15, зе́млѥѝ 15, ꙗ҆ко 15, вьзми
Dionisio 1.0. model makes the most frequent errors while
16, и҆ 16 the Dionisio 1.0. model incorrectly reads: е҆сь́мь 8,
recognizing accent marks: instead of пра́веднїи 14,
и҆спль̑нь 9, на 9, мою̀ 10, твоѝ 11, бѣсѡ́ в ска̀го 11/12, и῎зба́ви
подо́баѥть 15, по́хвала 15, исповѣ́даите се 15/16, ѱа́лтѝри
12, ꙗ῎ко 13, сьзда́нїе 13, неи῎зрече́нною 14, ни́ще̏тою̀ 14/15,
16/17, ѥ҆моу́ 17, добрѣ̀ 18, ѥ҆го 19, млⷭ
19, сꙋдь 19 the
зе́млѥѝ 15, ꙗ῎ко 15, вь́зми 16, и῎ 16. The issues with
model incorrectly outputs пра́веднїѝ 14, подо́баѥ҆ть 15,
recognizing spaces between words and pajerak mark can
похва́ла 15, и῎сповѣ́д аи҆те се 15/16, ѱа́л тѝрѝ 16/17, ѥ҆моу 17,
be illustrated by the following examples: instead of
до́брѣ̀ 18, ѥ҆го̀ 19, млⷭты́ню̀ 19, сꙋдь 19. The other errors
ѡ῎боу́рева- 8, наде́ждоу 10, бѣсѡ́- 11 there is the incorrect
pertain to recognizing spaces between words, pajerak mark
ѡ῎боу́рева 8, на де́ждоу 10, бѣсѡ́ 11; instead of бѣсѡ́вска̏го
and initials: instead of пра́выи- 14, ѱа́лтѝ- 16,
11/12 there is the incorrect бѣсѡ́ в ска̀го 11/12. In this book,
десе́тостроу́ннѣ 17 the model incorrectly reads: пра́выи 14,
as we have already mentioned, the Dionisio 1.0. model
ѱа́л тѝ 16, десе́то строу́ннѣ 17; instead of исповѣ́даите се
likewise incorrectly recognizes certain letters and
15/16, ѱа́лтѝ- 16 there is the incorrect и῎сповѣ́д аи҆те се
punctuation marks: instead of ѿ 8, ѕы́ждитель 13, мл҃срдь
15/16, ѱа́л тѝ 16; instead of Рауⷣи́те се 14 there is the
16 there is the incorrect ѡ῎ 8, бы́ждитель 13, мл҃содь 16;
incorrect ПРауⷣи́те се 14. There is merely one example of an
instead of невиди́мыихь, 11, и῎зба́ви :·12 there is the
incorrectly recognized letter: instead of вьсклица́ни 19 the
incorrect невиди́мыихь · ⁘ 11 и ῎зба́ви :·12.
model incorrectly reads вь свлица́ни 19.
In the rest of the books listed in Table 4, ( Octoechos,
The Dionisio 1.0. model also shows a similar
mode 5–8 (1539) from Gračanica, Tetraevangelion (1552)
performance during the automatic recognition of the text of
from Belgrade and Tetraevangelion (1562) from Mrkša’s
the oldest printed Serbian Church Slavonic book –
Church), the CER is slightly higher, around 11–12%. The
Octoechos, mode 1–4 (1495) from the Cetinje printing
categories in which the Dionisio 1.0. model outputs errors
house. The percentage of unrecognized characters is
are mostly the same in all three books, so we will only take
somewhat higher than in the previous two books due to
a comparative presentation of a part of sheet 27b
poor photo quality and issues with recognizing certain
Octoechos, mode 5–8 (1539) from Gračanica and the
letters and punctuation marks.
automatically read text as an illustration.
To illustrate the efficiency of the model, we will use a
comparative presentation of a part of sheet 33b and the
automatically read text in the following figure.
Figure 8: The Automatically Read Text of a Part of Sheet
27b Octoechos, mode 5–8 from 1539.
The greatest number of errors is related to the
Figure 7: The Automatically Read Text of a Part of Sheet
recognition of accent marks: instead of бо́лѣзны 1, и҆ 1, 2, 5,
33b Octoechos, mode 1–4 from 1495.
6, 8, мои 2, трьпишѝ 2, поно́сноѐ 2, чь вькоу́шаѐши 3, ѿѐмлѥ
3, прободе́нїемь 4, ꙗ῎звы 4/5, ꙗ҆ко 5, и҆сцѣлꙗе 5, вьспѣ́ваѐмь
In this book, too, the largest number of errors in the
5, твое 6, сла́вное хотѣ́нїе 6, покла́нꙗю́ще се 6, и҆миже 7,
automatic text recognition occurs with accent marks:
своѐмꙋ ми́- 7, ве́лїю 8 the Dionisio 1.0. model incorrectly
instead of е҆сьмь 8, и῎спль́нь 9, на́ 9, мою 10, твои 11,
outputs бо́лѣзны̏ 1, и῎ 1, 2, 5, 6, 8, моѝ 2, трь́пишѝ 2, поно́сное
бѣсѡ́вска̏го 11/12, и҆зба́ви 12, ꙗ҆ко 13, сьзданїе 13,
2, чьвькоу́шае҆ши 3, ѡ῎е҆млѥ̏ 3, прободе́нїе҆мь 4, ꙗ҆звы 4/5, ꙗ῎ко
PRISPEVKI
158
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
5, и҆сцѣ́лꙗѐ 5, вьспѣ́вае҆мь 5, твоѐ 6, сла́вное҆хотѣ́нїе 6,
Word
покла́нꙗю҆ще се
Book (Printed House, Year)
6, и῎ ми́же 7, своѐ м ми́- 7, ве́лїю 8.
count
Recognizing spaces between words represents the
Octoechos, mode 1–4 (Cetinje, 1495)
15,667
problematic issue in a multitude of cases: instead of жль-
Psalter with Appendices (Goražde, 1519)
16,445
2, чь вькоу́шаѐши 3, сла́вное хотѣ́нїе 6, чьте́мь 6, и҆миже 7,
Octoechos, mode 5–8 (Gračanica, 1539)
15,179
своѐмꙋ ми́- 7, роу́коположе́нїе 8 the model incorrectly outputs
Prayer Book (Euchologion) (Mileševa, 1546) 15,003
жлⷭв 2, чьвькоу́шае҆ши 3, сла́вное҆хотѣ́нїе 6, чь те́мь 6, и῎ ми́же
Tetraevangelion (Belgrade, 1552)
15,333
7, своѐ мꙋми́- 7, роу́ко положе́нїе 8. The other errors pertain
Tetraevangelion (Mrkša’s Church, 1562)
15,733
to the recognition of superscript letters and pajerak mark,
as well as regular letters in a few examples: instead of жль-
Table 5: The Dionisio 2.0. model – Ground Truth data
2,
8 the model outputs жлⷭв 2, млтⷣь 8; instead of
тѣ́мже
from other printing houses of the 15th and 16th centuries.
5 there is the incorrect тѣ́м же 5; instead of поноше́нїа
1, жль- 2, ѿѐмлѥ 3 the model reads попоше́нїа 1, жлⷭ 2,
The performance of the generic model Dionisio 2.0. is
ѡ῎е҆млѥ̏ 3.
shown in the following table.
The quantitative and qualitative analysis conducted in
this chapter demonstrates that the Dionisio 1.0. recognizes
Word
Number
CER on
CER on
the text of the Serbian Church Slavonic books created in
count
of epochs
Train set
Validation set
other printing houses of the 15th and 16th centuries with
varying degrees of success. The quantitative analysis shows
176,481
200
2.03%
2.44%
that the lowest CER was recorded in books from Mileševа
and Goražde printing houses, which is expected
Table 6: Performance of the generic model Dionisio 2.0.
considering the fact that these books were printed using the
typographic printing equipment from Venice. An
In order to compare the performance of the two models,
acceptable CER was noted during the recognition of
we tested them on ten sheets from Psalter with Appendices
Octoechos, mode 1–4 (1494) from the Cetinje printing
(1495) from the Cetinje printing house and Hieraticon
house, while this percentage exhibited in books from other
(1521) from the Goražde printing house, the latter two
printing houses (Belgrade, Gračanica, Mrkša’s Church)
representing Serbian Church Slavonic books that did not
underscores the need for training a new version of the
form the material for training the model. The results of the
generic model with improved performance. The qualitative
experiments are shown in the following table.
analysis showed that the Dionisio 1.0. model usually makes
errors when recognizing accent marks, but also when
Book
Dionisio 1.0. Dionisio 2.0.
recognizing spaces between words. The errors in
(Printed House, Year)
CER
CER
recognizing superscript letters, pajerak mark, initials and
regular letters are far less common.
Psalter with Appendices
5.71%
1.50%
(Cetinje, 1495)
4. Creation and evaluation of the generic
Hieraticon
9.38%
4.61%
model Dionisio 2.0.
(Goražde, 1521)
Table 7: Comparing the Performance of the Two Models
When creating a new version of the model, we started
on Books from Cetinje and Goražde Printing Houses.
from the transcripts of Serbian Church Slavonic books
listed in Table 4 obtained using the Dionisio 1.0. model. By
As can clearly be seen from the previous table, the
means of the manual correction of the transcripts, the
Dionisio 2.0. model displays significantly better results
Ground Truth9 data was obtained for training the generic
compared to the Dionisio 1.0. model. To illustrate the
model Dionisio 2.0. In accordance with our findings on the
exceptional efficiency of the Dionisio 2.0. model we
interdependence of model success and the amount of
provide a comparative presentation of a part of sheet 3b
training data (Polomac, 2022), as well as similar findings
Psalter with Appendices (1495) from Cetinje printing
for Church Slavonic books from the Berlin State Library
house and the automatically read text in the figure 9.
(Neumann, 2021), the goal was set to provide a critical
As we can see in the figure, the Dionisio 2.0. model
mass of at least 10000 words for each printed book in order
erros only in a few examples in which the spiritus lenis and
to train the generic model Dionisio 2.0. While training the
perispomena are insufficiently clearly differentiated:
generic model Dionisio 2.0. we used the Ground Truth data
prepared for the Dionisio 1.0. model (see Table 1 here), as
instead of ю̑нїи 8, пою̑ть 9, и҆сти́ною̑ 10, нака́зоую̑ть 11 the
well as the new Ground Truth data from Serbian Church
model incorrectly outputs ю҆нїи 8, пою҆ть 9, и҆сти́ною҆ 10,
Slavonic books printed in other printing houses of the 15th
нака́зоую҆ть 11. There is a single example of the model
and 16th centuries listed in the following table.
mixing spiritus lenis and oxia: instead of ѿи҆де 13 there is
the incorrect ѿи́де 13. The space between words was also
9 The term Ground Truth Data in machine learning refers to
manuscript. For more details on this term, see Transkribus
completely accurate data used to train the model. In our case,
Glossary at https://readcoop.eu/glossary/ground-truth/.
these would be exact transcripts of digital photographs of the
PRISPEVKI
159
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
incorrect in one example solely: instead of мнѣ́ти 9 there is
The previous illustration points to the fact that the
the incorrect мнѣ́ ти 9. In the other examples on the shown
Dionisio 2.0. model makes errors almost exclusively during
part of sheet 3b the Dionisio 2.0. model regularly
accent marks recognition. Thus, instead of рабѣ̀ 1, бж҃їе҅мь
recognizes letters, spaces between words, titlo and accent
1, мⷬе 1, и҆гоу́меноу 2, и́ 3, на́шеи 4, и҆х 4, призовы̀ 4/5, твое҅
marks. The exceptional efficiency of the Dionisio 2.0.
5x2, приѡ҆бще́нїе 5, бл҃госрь́дїе҅ 5/6, вьсѐбл҃гы̏ 6 the model
model in recognizing Psalter with Appendices (1495) from
incorrectly reads ра́бѣ̀ 1, бж҃їе҆мь 1, мⷬе́ 1, и῎гоу́меноу 2, и҆ 3,
the Cetinje printing house, especially compared to
на́шеѝ 4, и῎х 4, призовы 4/5, твоѐ 5x2, приѡ῎бще́нїе 5,
Hieraticon (1521) from the Goražde printing house, has
бл҃госрь́дїѐ 5/6, вьсе бл҃гы̏ 6. Along with the aforementioned
resulted from the fact that there are no superscipt letters in
errors, there are a few examples of incorrect recognition of
Psalter with Appendices (1495), while accent marks are
spaces between words: instead of ѻ҆ бра́тїи 2, сь слꙋ-2,
given in expected positions.
дїа́конѣ- 3 вьсѐбл҃гы̏ 6 the model reads ѻ҆бра́тїи 2, сьслꙋ- 2,
дїа́конѣ 3 вьсе бл҃гы̏ 6.
5. Concluding Remarks
The research showed how the Transkribus software
platform, based on the principles of machine learning and
artificial intelligence, could be used to create efficient
models for automatic text recognition of Serbian Church
Slavonic printed books from the end of the 15th to the
middle of the 17th century. Having in mind the limitations
of the Dionisio 1.0. model in the automatic recognition of
the text of the Serbian Church Slavonic books printed
outside Venice, the paper describes the process of creating
a generic model Dionisio 2.0. , capable of recognizing
Serbian Church Slavonic printed books as a whole. The
generic model Dionisio 2.0. was trained on the material of
the Serbian Church Slavonic books printed in various
Serbian printing houses of the 15th and 16th centuries:
Cetinje, Venice, Goražde, Gračanica, Mileševa, Belgrade
Figure 9: The Automatically Read Text of a Part of Sheet
and Mrkša’s Church. The quantitative analysis of the
3b Psalter with Appendices (1495).
performance of this model showed that it could be used to
automatically obtain transcripts with a minimum
On the other hand, superscript letters, as well as accent
percentage of incorrectly recognized characters (about 2-
marks, found frequently in unexpected positions, are
3%). Most frequently, CER depends on the quality of the
present in Hieraticon (1521) from the Goražde printing
photo of the book, the frequency of use of accent marks and
house, which definitely affects a somewhat less efficient
superscripts, as well as the correct use of accent marks in
CER in this book. To illustrate the aforementioned, we shall
the appropriate positions. Using the Dionisio 2.0. model
use the comparative presentation of a part of sheet 9b and
transcripts of Serbian Church Slavonic printed books can
the automatically read text in the following figure.
be obtained automatically, which, after being edited by a
competent philologist, can be used for further philological
and linguistic research, primarily for creating searchable
digital editions of books, as well as electronic corpora, thus
creating opportunities for diachronic research of Serbian
early modern literacy on a large quantity of data. In the near
future, the generic model Dionisio 2.0. will become
publicly available to all users of the Transkribus software
platform, which will enable further improvement of its
performance, which could ultimately lead to the creation of
a generic model for automatic text recognition of Church
Slavonic printed books as a whole.
6. Acknowledgment
The research conducted in the paper was financed by
the Ministry of Education, Science and Technological
Development of the Republic of Serbia, contract no. 451-
Figure 10: The Automatically Read Text of a Part of Sheet
03-68/2022-14/ 200198, as well as by the German
9b Hieraticon (1521).
Academic Exchange Service (DAAD) within the project
Automatic Text Recognition of Serbian Medieval
PRISPEVKI
160
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Manuscripts and Early Printed Books: Problems and
Perspectives.
7. References
Constaţa Burlacu and Achim Rabus. 2021. Digitising
(Romanian)
Cyrillic
using
Transkribus:
new
perspectives. Diacronia, 14:1–9.
Miroslav Lazić. 2018. Od Božidara Vukovića do Dionizija
dela Vekije: identitet i pseudonim u kulturi ranog
modernog doba. In: Anatolij A. Turilov et al., eds.,
Scala Paradisi, pages 165–185, SANU, Beograd.
Miroslav Lazić. 2020a. Inkunabule i paleotipi:
srpskoslovenske štampane knjige od kraja 15. do
sredine 17. veka. In: Vladislav Puzović and Vladan
Tatalović, eds., Osam vekova autokefalije Srpske
pravoslavne crkve, Vol. 2, pages 325–344. Sveti
arhijerejski
sinod
Srpske
pravoslavne
crkve–
Pravoslavni bogoslovski fakultet, Beograd.
Miroslav Lazić. 2020b. Between an Imaginary and
Historical Figure: Božidar Vukovićˊs Professional
Identity. Ricerche Slavistiche, 43:141–156.
Vladimir Neumann, 2021. Deep Mining of the Collection
of Old Prints Kirchenslavica Digital. Scripta & e-
Scripta 21: 207–216.
Vladimir Polomac. 2022. Serbian Early Printed Books
from Venice. Creating Models for Automatic Text
Recognition using Transkribus. Scripta&e-Scripta, 22
[in print].
Günther Mühlberger, L. Seaward, M. Terras, S. Oliveira
Ares, V. Bosch, M. Bryan, S. Colluto, H. Déjean, M.
Diem, S. Fiel, B. Gatos, A. Greinoecker, T. Grüning, G.
Hackl, V. Haukkovaara, G. Heyer, L. Hirvonen, T.
Hodel, M. Jokinen, P. Kahle, M. Kallio, F. Kaplan, F.
Kleber, R. Labahn, M. Lang, S. Laube, G. Leifert, G.
Louloudis, R. McNicholl, J. Meunier, J. Michael, E.
Mühlbauer, N. Philipp, I. Pratikakis, J. Puigcerver
Pérez, H. Putz, G. Retsinas, V. Romero, R. Sablatnig, J.
Sánchez, P. Schofield, G. Sfikas, C. Sieber, N.
Stamatopoulos, T. Strauss, T. Terbul, A. Toselli, B.
Ulreich, M. Villegas, E. Vidal, J. Walcher, M.
Wiedermann, H. Wurster, and K. Zagoris. 2019.
Transforming scholarship in the archives through
handwrittwn
text
recognition.
Journal
of
Documentation, 5 (75):954–976.
Mitar Pešikan. 1994. Leksikon srpskoslovenskog
štamparstva. In: Mitar Pešikan et al., eds., Pet vekova
srpskog
štamparstva
1494–1994:
razdoblje
srpskoslovenske štampe XV–XVII, pages 71–218,
Narodna biblioteka Srbije–Matica srpska, Beograd.
Achim Rabus. 2019a. Recognizing Handwritten Text in
Slavic Manuscripts: a Neural-Network Approach using
Transkribus. Scripta & e-Scripta, 19:9–32.
Аchim Rabus. 2019b. Training Generic Models for
Handwritten Text Recognition using Transkribus:
Opportunities and Pitfalls. In: Proceeding of the Dark
Archives Conference, Oxford, 2019b, in print.
PRISPEVKI
161
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Lematizacija in oblikoskladenjsko označevanje korpusa SentiCoref
Eva Pori,* Jaka Čibej,* Tina Munda,† Luka Terčon,† Špela Arhar Holdt*†
* Filozofska fakulteta, Univerza v Ljubljani, Aškerčeva 2, 1000 Ljubljana
eva.pori@ff.uni-lj.si; jaka.cibej@ff.uni-lj.si; spela.arharholdt@ff.uni-lj.si
† Fakulteta za računalništvo in informatiko, Univerza v Ljubljani, Večna pot 113, 1000 Ljubljana
tina.munda@fri.uni-lj.si; luka.tercon@fri.uni-lj.si Povzetek
V prispevku predstavimo proces in rezultate ročnega pregledovanja lem in oblikoskladenjskih oznak MULTEXT-East v6 korpusa SentiCoref, ki bo pod okriljem projekta Razvoj slovenščine v digitalnem okolju (RSDO) vključen v novi učni korpus za slovenščino (trenutni ssj500k). Opišemo delotoke označevalne kampanje, ki je ena najobsežnejših tega tipa v našem prostoru, označevalne dileme, ki so razkrile določene vrzeli v referenčnih označevalnih smernicah, kot tudi rešitve in rezultate, ki smo jih oblikovali med delom in jih bo mogoče uporabiti v prihodnje.
Lemmatization and Morphosyntactic Annotation in the SentiCoref Corpus
The paper presents the process and the results of manual lemma and MULTEXT-East v6 morphosyntactic tag annotation in the SentiCoref corpus, which is planned to be included in the new Slovene training corpus (currently known as ssj500k) as part of the
"Development of Slovene in a Digital Environment" project. The paper describes the workflows of the annotation campaign – which was among the most extensive campaigns of this type in Slovenia –, the annotation dilemmas that revealed gaps in previous versions of annotation guidelines, as well as the resulting solutions that will be useful in future annotation campaigns.
povečave učnega korpusa.2 SentiCoref vsebuje besedila z
1. Uvod
novičarskih portalov, v katera so ročno vpisane oznake
koreferenc in imenskih entitet, in odgovarja na potrebo, da
Med leti 2020 in 2023 s podporo Ministrstva za
se v učni korpus vključi gradivo, ki omogoča označevanje
kulturo Republike Slovenije in Evropskega sklada za
jezikovnih značilnosti prek meja povedi (Arhar Holdt in
regionalni razvoj poteka aplikativni projekt Razvoj
Čibej, 2021).
slovenščine v digitalnem okolju (RSDO).1 Med cilji
Namen prispevka je opisati delo, rezultate, zlasti pa
projekta je infrastruktura za kontinuirano grajenje
označevalne dileme na ravni lem in oblikoskladnje, ki so
slovenskih korpusov: delotoki sprotnega zbiranja besedil,
razkrile določene vrzeli v referenčnih označevalnih
označevalni cevovod in dokumentacija za označevanje na
smernicah (Holozan et al., 2008), kot tudi rešitve, ki smo
različnih jezikovnih ravneh ter nekatera nova orodja za
jih oblikovali med delom in jih je mogoče uporabiti za
ročno označevanje ter pregledovanje korpusnih podatkov.
prihodnje primerljive naloge. Novi učni korpus bo skupaj
Kot temeljna jezikovna vira za razvoj cevovoda za strojno
z nadgrajenimi označevalnimi smernicami ob zaključku
označevanje sodobne slovenščine sta v nadgradnjo
projekta RSDO odprto na voljo na repozitoriju
vključena
tudi
leksikon
besednih
oblik
Sloleks
CLARIN.SI.
(Dobrovoljc et al., 2019) in učni korpus ssj500k (Krek et
al., 2020), s katerim se povezuje tudi pričujoči prispevek.
Učni korpus ssj500k v različici 2.3 (Krek et al., 2021)
2. Preteklo in sorodno delo
vsebuje 27.829 povedi oz. 500.295 besednih pojavnic,
Učni korpus ssj500k se kot referenčni vir za
označenih
na
ravneh
od
stavčne
segmentacije,
nadzorovano učenje strojnega jezikoslovnega označevanja
tokenizacije, lematizacije, oblikoslovja in oblikoskladnje
sodobnih slovenskih pisnih besedil razvija že več kot
prek odvisnostne skladnje, imenskih entitet in večbesednih
desetletje (Krek et al., 2020). Do sedaj so bili na tem
leksemov do udeleženskih vlog. Kot je značilno za učne
korpusu naučeni različni označevalniki, npr. Obeliks
korpuse, so jezikoslovne oznake ročno pregledane, s čimer
(Grčar et al., 2012), ReLDI (Ljubešić in Erjavec, 2016),
je
dosežena
zanesljivost,
ki jo potrebujemo za
nevronski označevalnik, ki ga je razvil Belej (2018), in
nadzorovano učenje strojnih postopkov. Na rezultate
CLASSLA StanfordNLP (Ljubešić in Dobrovoljc, 2019),
vplivata tudi obseg in zastopanost gradiva, zato je glavni
ki se nadalje razvija tudi na projektu RSDO.
cilj nadgradnje povečanje učnega korpusa na 1.000.000
Začetki učnega korpusa segajo v čas projekta
besednih pojavnic. Na projektu bo za višje, kompleksnejše
MULTEXT-East, ki je spodbudil razvoj sistema za
nivoje
označevanja
pripravljeno
omejeno
število
oblikoskladenjsko
označevanje
(tudi)
slovenščine
novooznačenih povedi, osnovni nivoji pa bodo ročno
(Dimitrova et al., 1998). Sistem oznak je bil revidiran in
pregledani za vse novo gradivo.
nadgrajen pod okriljem projekta Jezikoslovno označevanje
V prispevku predstavljamo označevalno kampanjo, v
slovenščine (JOS), v katerem je nastal korpus jos100k
kateri smo ročno pregledali in popravili tokenizacijo,
(Erjavec in Krek, 2008). Nato je bilo v projektu
segmentacijo, leme in oblikoskladenjske oznake sistema
MULTEXT-East (Erjavec, 2012) v korpusu SentiCoref 1.0
2 Za preostalih 24 % so v načrtu raznolike besedilne množice, ki
(Žitnik, 2019), ki predstavlja približno 76 % predvidene
bodo zagotovile (a) temelje za semantično označevanje, kot npr.
slovenska različica vzporednega korpusa Elexis-WSD (Martelli
et al., 2022), (b) izbrane nezastopane besedilne vrste, npr. tvite,
ki predstavljajo uporabniško generirane spletne vsebine, (c) v
1 Spletna stran, ki predstavlja projektne cilje in sodelujoče
rabi redkejše dvoumne besedne oblike: enakopisne zaimke,
partnerje: https://slovenscina.eu/.
dvojinske oblike ipd. (Arhar Holdt in Čibej, 2021: 49–50).
PRISPEVKI
162
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Sporazumevanje
v
slovenskem jeziku pregledanih
konec stavka označili piko, ki je v resnici del okrajšave
dodatnih 400.000 besed, pripravljene pa so bile tudi
("d.o.o", "." > "d.o.o.").
referenčne smernice za označevanje lem in oblikoskladnje
Na popravljenem in ustrezno segmentiranem korpusu
po sistemu JOS oz. MULTEXT-East v4 (Holozan et al.,
smo leme in oblikoskladenjske oznake označili z
2008). Trenutna različica korpusa vsebuje oznake sistema
označevalnikom CLASSLA StanfordNLP v0.0.11.6
MULTEXT-East v6, ki na sistemski ravni vsebuje 1.900
možnih oznak z informacijo besedne vrste in različnih
3.2. Priprava smernic za označevanje
slovarsko-slovničnih značilnosti, kot so npr. spol, sklon,
število in lastnoimenskost pri samostalnikih.3
Pri pregledovanju oznak smo sledili smernicam za
SentiCoref 1.0 (Žitnik, 2019) je korpus z 837 besedili
oblikoskladenjsko označevanje JOS (Holozan et al.,
oz. približno 433.000 pojavnicami, ki je bil vzorčen iz
2008), ki vključujejo nabor oblikoskladenjskih oznak
korpusa SentiNews 1.0 (Bučar, 2017). Čeprav SentiCoref
(MSD), splošna načela lematizacije ter natančnejše
1.0 neposredno ne vsebuje enakih oznak sentimenta kot
opredelitve
posameznih
označevalnih
kategorij
in
SentiNews 1.0, sta korpusa medsebojno povezljiva.
podkategorij, ponazorjene z označenimi korpusnimi
SentiCoref 1.0 vsebuje tudi oznake imenskih entitet (oseb,
primeri. Smernice smo pripravili v okolju Google
organizacij in lokacij) ter koreferenc na imenske entitete
Dokumenti (ang. Google Docs), da smo jih lahko
skupaj s koreferenčnimi verigami, ki označujejo sentiment
dopolnjevali na podlagi sprotne obravnave ključnih
za vsako entiteto. SentiCoref 1.0 je odprto dostopen pod
označevalskih dilem ter ponovnega pregleda in evalviranja
licenco CC BY 4.0 na repozitoriju CLARIN.SI, in sicer v
problematičnih mest. Predvsem na te vidike smernic se
tabelaričnem formatu TSV3, ki ga podpira označevalno
bomo osredotočili tudi v nadaljevanju.
orodje INCEpTION (Klie et al., 2018), naslednik orodja
WebAnno (Eckart de Castilho et al., 2014).
4. Pregledovanje oznak
3. Priprava na označevanje
4.1. Obseg in delotoki označevalne kampanje
Ročni pregled strojno označenega gradiva je potekal v
3.1. Priprava podatkov
okolju Google Preglednice. Podatki iz 837 besedil so bili
SentiCoref 1.0 je sicer tokeniziran, ne vsebuje pa lem
pripravljeni v prav toliko datotekah. Vsaka datoteka je
in oblikoskladenjskih oznak. Kar zadeva delitev na
vsebovala metapodatke in za pregledovanje relevantne
pojavnice, SentiCoref 1.0 ni bil zasnovan z mislijo na
informacije: obliko pojavnice, lemo, strojno pripisano
potencialne dodatne jezikoslovne nivoje označevanja, zato
oblikoskladenjsko oznako (z možnostjo izbire popravka s
v nekaterih primerih odstopa od tokenizacijskih pravil, ki
spustnega seznama vseh obstoječih oznak, kar je olajšalo
jih pri označevanju korpusov trenutno uporabljamo v
popravljanje in zmanjšalo možnost zatipka) in celico za
slovenskem prostoru (označevalnik classla 4 oz. vanj
morebiten komentar pregledovalca.
vključeni tokenizator Obeliks 5), npr. pri deljenju kratic
Podatke je pregledovalo 24 študentov jezikoslovnih
("STA-jev" > "STA", "-", "jev") in števnikov ("2,356" > smeri, razdeljenih v 3 skupine. Vsaka izmed teh skupin
"2", ",", "356"). Prav tako delitev v SentiCorefu 1.0 ne študentov je pregledovala iste datoteke; namen tega, da
vsebuje
podatkov
o
presledkih.
Pred
strojnim
vsako pojavnico pregledajo 3 študenti, je bil doseči večjo
oblikoskladenjskim
označevanjem
in
ročnim
zanesljivost odločitev. Vsakemu izmed 8 pregledovalcev v
popravljanjem oblikoskladenjskih oznak je bilo treba
skupini je bila dodeljena besedna vrsta oz. več besednih
najprej popraviti tokenizacijo (vzporedno z njo tudi
vrst, pri čemer je dodelitev potekala na osnovi preferenc
strojno lematizacijo) ter razdeliti besedilo na povedi
študentov, predhodno ugotovljenih v anketi. Glede na
(stavčna segmentacija). Za pregledovanje smo korpus
težavnost označevanja ter pogostost vsake besedne vrste v
pripravili v tabelaričnem formatu v okolju Google
korpusu sta samostalnik pregledovala dva študenta;
Preglednice (ang. Google Sheets), saj INCEpTION ne
glagol, pridevnik in zaimek po en študent; za izbiro
podpira spreminjanja tokenizacije. Tokenizacija je bila v
oznake preprostejše besedne vrste pa smo združili v
celoti popravljena ročno, stavčna segmentacija pa je bila
skupine, pri čemer je en študent pregledoval po eno
najprej strojno pripisana (na podlagi ločil), nato pa ročno
skupino: prislov in členek; predlog in veznik; števnik,
pregledana in potrjena.
okrajšava, medmet in “neuvščeno”. Pred pričetkom
Pri pregledovanju segmentacije je bilo 17.095 strojno
pregledovanja so bile pregledovalcem predstavljene
pripisanih koncev povedi ročno potrjenih kot ustreznih (z
smernice (gl. 3.2) in demonstracija postopka v obliki
ujemanjem treh pregledovalcev in potrditvijo končnega
videa. Pregledovanje je potekalo v dveh fazah.
razsojevalca oz. kuratorja). 2.528 koncev povedi so
pregledovalci pripisali ročno: pri 2.151 koncih so se
4.1.1.
Pregledovanje
strinjali vsi pregledovalci (in kurator), pri 275 po dva, pri
Uvodni
teden
pregledovanja
je bil namenjen
156 pa je konec povedi označil le en pregledovalec. 2.992
poglobljeni seznanitvi s smernicami in razreševanju
koncev povedi je bilo potrjenih kot neustreznih; od tega
potencialnih
nejasnosti,
zato
je
bilo
vsakemu
jih je bilo 1.409 označenih avtomatsko, 940 ročno s
pregledovalcu dodeljenih le 5 datotek. Število datotek se
popolnim ujemanjem med tremi pregledovalci, 167 ročno
je postopoma zviševalo do 20 tedensko, hkrati pa smo
z ujemanjem dveh pregledovalcev, 476 ročno z oznako le
okretnejšim
ali
bolj
časovno
razpoložljivim
enega pregledovalca. Pri večini primerov, v katerih je
pregledovalcem
omogočili
večji
obseg
dela
razsojevalec zavrnil odločitve pregledovalcev, gre za
(individualizirani pristop). Analiza (ne)ujemanja med
popravke tokenizacije in lem, ko so pregledovalci npr. kot
tremi vzporednimi pregledovalci je predstavljala izhodišče
za 2. fazo – kuracijo.
3 Označevalni sistem je opisan na spletni strani:
http://nl.ijs.si/ME/V6/msd/.
4 https://github.com/clarinsi/classla
5 https://github.com/clarinsi/obeliks
6 https://pypi.org/project/classla/0.0.11/
PRISPEVKI
163
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4.1.2.
Kuracija
izlastnoimenskih svojilnih pridevnikov, ki se v rabi pišejo
Posamezne odločitve pregledovalcev smo uredili v
z malo ali pa prehajajo v zapis z malo, ker niso v pomenu
enotno tabelo, da so bile ob pojavnicah prikazane vse 3
prave
svojine
( Parkinsonova
bolezen
>
lema:
odločitve, pri čemer so bile posebej označene tiste oznake,
parkinsonov).
pri katerih je med pregledovalci prišlo do razhajanja.
Pri lematizaciji izlastnoimenskih pridevnikov v
Naloga kuratorjev je bila pregledati prav te pojavnice in
stvarnih lastnih imenih je pregledovalce zmedla različna
jim pripisati končno oznako. 7 kuratorjev je bilo izbranih
obravnava primerov v korpusu ssj500k ( Delova dopisnica
iz vrst pregledovalcev, po eden za vsako besedno vrsto oz.
> lema: Delov vs. Magov novinar > lema: magov), zato je
skupino besednih vrst. Označevalna kampanja se je
bilo treba ta del pravila, ki v izhodiščnih smernicah ni bil
zaključila v 12 tednih, od katerih so bili štirje namenjeni
pojasnjen, posebej razložiti.
samo kuraciji.
4.2.3.
Tuja stvarna lastna imena
4.2. Označevalne dileme
Ker
načelo
prekrivnosti
z
občnoimenskimi
Ob kuraciji smo identificirali dve vrsti označevalnih
samostalniki (gl. 4.2.1) velja primarno za slovenske
težav: (a) primeri, pri katerih so bile označevalne smernice
samostalnike, se je pogosto pojavljalo vprašanje, katere
jasne, a pri delu niso bile dosledno upoštevane in (b)
besede obravnavati kot slovenske (prevzete besede, ki se
primeri, ki so se pokazali kot zahtevnejši: slabše
pregibajo s slovensko morfologijo, vedno umeščamo med
predstavljeni v smernicah in mestoma tudi nedosledno
slovenske, če potrditve za pregibanje v rabi ni, pa se je
obravnavani v obstoječem ssj500k 2.3.7
treba odločiti na podlagi drugih kriterijev). Dileme so se
Težave
prvega
tipa smo analizirali, odpravili
nanašale predvsem na: (a) prevzete besede, ki pogosto
nekonsistentnosti in jih označili v skladu z označevalnimi
nastopajo kot deli tujejezičnih imen sicer slovenskih
smernicami. Nekaj več informacij o tipičnih tovrstnih
podjetij (tip leasing, holding) ter (b) ostale občnoimenske
težavah povzemamo v poglavju 4.3. Posebno pozornost pa
besede v tujejezičnih zvezah, ki so prekrivne s
smo posvetili drugi skupini težav, ki smo jih identificirali
slovenskimi občnoimenskimi samostalniki, pri čemer pa
kot bolj kompleksne in zahtevne, saj so njihove rešitve
pogosto ne izpolnjujejo kriterija pomenske prekrivnosti
zahtevale premislek o odprtih vprašanjih na ravni
(tip trans, global).
lematizacije in oblikoskladnje (tudi) v korpusu ssj500k in
Podrobneje smo obravnavali tudi skupino stvarnih
posledično
nadgradnjo
označevalnih
smernic.
V
lastnih imen tipa Zagrebačka banka, Večernji list. Ker gre
nadaljevanju predstavimo te težave, v poglavju 5 pa
za imena v hrvaščini, ki zaradi sorodnosti s slovenščino
predlagane spremembe smernic.
mestoma prinašajo besedje, enako slovenskemu, so bili
pregledovalci v dilemi, ali tako pridevnik kot samostalnik
4.2.1.
Občnoimenska prekrivnost v stvarnih lastnih
označiti kot slovensko besedo in pri tem pridevniku
imenih
pripisati v slovenščini neobstoječo lemo, ali (vsaj)
Pregledovalcem je težave povzročalo pravilo, da je v
pridevnik umestiti med tujejezično besedišče.
stvarnih lastnih imenih, kjer je lastnoimenski samostalnik
prekriven z občnoimenskim samostalnikom, tako lema kot
4.2.4.
Ločevanje pridevnikov od prislovov
oblikoskladenjska oznaka občnoimenska. Iz tega sledi, da
Odločitve pregledovalcev so se pogosto razhajale pri
je lematizacija slovenskih imen podjetij, časopisov, revij,
primerih, ki so izkazovali tipično povedkovnodoločilno
knjig, tudi televizijskih oddaj, serij ali filmov ipd. z malo
rabo pridevnikov oz. obravnavo pridevniških oblik, ki so
začetnico: npr. podjetje Iskra [iskra, Sozei]; časnik Delo
se prekrivale z osnovno prislovno obliko. Smernice so že
[delo, Sosei]. Na iskanje prekrivnosti, ki zaradi pomenske
vsebovale splošno navodilo o označevanju pridevnikov, ki
oddaljenosti občnoimenske "ustreznice" pogosto ni
lahko nastopajo v prilastkovi ali povedkovi rabi ( Sledil je
enoznačno (gl. tudi 4.2.3), je bilo treba večkrat opozoriti,
prelomni korak > pridevnik kot levi prilastek; uradno še
saj je bilo pregledovalcem bolj intuitivno ohraniti zapis
ni rehabilitiran > pridevnik kot povedkovo določilo), pa
leme z veliko začetnico. Opozarjati jih je bilo treba tudi,
tudi pravilo za ločevanje pridevnikov od prislovov v
da načelo prekrivnosti dogovorno velja samo pri
primeru pridevniškega niza ( uradno prečiščeno besedilo >
samostalnikih (stranka Zares [Zares, Slmei]). Manj težav
prislov). Niso pa naslovile razlike med pridevniško in
smo zaznali pri pregledovanju tistih primerov stvarnih
prislovno lemo pri posameznih zahtevnejših primerih (npr.
lastnih imen, ki niso imela prekrivne leme z občnim
smotrno, potrebno, mogoče, možno v primerih kot npr. bi
samostalnikom in smo jih lematizirali z veliko začetnico
bilo smotrno, da bi [...]), ki so se tudi v korpusu ssj500k
(podjetje Mercator [Mercator, Slmei]).
pokazali kot nekonsistentno označeni: pogosto smo
zasledili prislovno lemo namesto dogovorno ustrezne
4.2.2.
Izlastnoimenski svojilni pridevniki
pridevniške leme. Neskladja so predstavljala izhodišče za
Del pravila, da pri svojilnih pridevnikih, ki izvirajo iz
nadaljnje analize, ki so vključevale ponovni pregled vseh
osebnih ali zemljepisnih lastnih imen, ohranjamo lemo z
primerov oz. zgledov (v korpusu SentiCoref) s
veliko začetnico ( Aškerčeva ulica > lema: Aškerčev), je
prekrivnimi pridevniškimi in prislovnimi oblikami ter
bil jasen, več dilem je bilo pri pregledovanju tistih
oblikovanje
dopolnjenega
pravila
za
pripisovanje
pridevniških in prislovnih lem.
7 Smernice Holozan et al. (2008) predstavljajo v slovenskem
4.2.5.
Nesklonljivi prilastki (tip bruto, solo)
prostoru sprejet in široko apliciran označevalni standard, zato
Pregledovalci so imeli težave z razumevanjem
smo jim sledili v največji možni meri. Tudi dopolnitev smernic,
navodila v izhodiščnih smernicah, da tiste primere tipa
ki smo jo pripravili na projektu RSDO, ostaja v zastavljenih
bruto, solo (npr. solo uspeh, rast bruto zadolževanja, info
konceptualnih okvirih. Morebitne korenitejše spremembe
točka), ki so sklonljivi, obravnavamo kot samostalnike,
označevalnega sistema, kjer izstopa predvsem vprašanje
tiste, ki niso, pa kot pridevnike. Predvsem v navodilu ni
lematiziranja (pravopisno, ne pa tudi oblikoslovno) različnih
jasno, kako preverjati (ne)sklonljivost in kaj je vodilo za
samostalnikov in tudi drugih besednih vrst z veliko ali malo
odločitev (sistemska možnost, gradivo).
začetnico, zahtevajo širši premislek, ki ga nakažemo v pogl. 6.
PRISPEVKI
164
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4.2.6.
Prislovne zveze (tip na novo)
kategorije, ki so povzročale največ težav (podjetja in
Težave so bile tudi z obravnavo t. i. prislovnih zvez oz.
časnike), npr. O tem, da so bile v Iskri [iskra, Somem]
označevanjem nepredložnega dela teh zvez. Smernice
potrebne spremembe, so čivkali že vrabci na veji. ; Večino
posredno nakazujejo, naj označevanje teži k pridevniškim
hrane kupimo v Mercatorju [Mercator, Slmem] ali
lemam ( na drobno > lema: droben), se je pa recimo pri
Intersparu [Interspar, Slmem] . ; Kot smo poročali v
primeru v živo pokazalo, da so bili v korpusu ssj500k vsi
prejšnji številki Mladine [mladina, Sozer] .
takšni primeri označeni kot prislovni ( v živo > lema: živo).
II.
Izlastnoimenski svojilni pridevniki: v smernice
Na osnovi tega neskladja smo naredili podrobnejšo
smo dodali pravila za rabo velike in male začetnice s
analizo in odkrili več primerov neenotnega označevanja
primeri:
enakovrstnih primerov.
(a) Pridevniki iz osebnih in zemljepisnih lastnih imen:
načeloma ohranjamo lemo z veliko začetnico, tiste
4.3. Pregledani podatki
primere, ki se v rabi pišejo z malo ali so na
Analiza popravkov po koncu pregledovanja in
prehodu v zapis z malo, ker niso v pomenu prave
kuriranja kaže, da se delež vnesenih popravkov sklada s
svojine, pa lematiziramo z malo, npr. Celjska
pričakovanim
deležem
napak
pri
avtomatskem
občina je prejšnji teden objavila razpis za najem
označevanju
slovenskih
besedil z označevalnikom
vile v Aškerčevi [Aškerčev, Psnzem] 7 v Celju. ;
CLASSLA StanfordNLP (Ljubešič in Dobrovoljc 2019:
Gre za zdravilo za zdravljenje parkinsonove
31–32). Na ravni lematizacije je bilo skupaj popravljenih
[parkinsonov, Psnzer] bolezni.
5.588 lem, kar je približno 1,3 % vseh pojavnic v korpusu,
(b) Pridevniki iz stvarnih lastnih imen: dodatno smo
kar se sklada s približno 98-odstotno natančnostjo
opredelili načelo lematizacije primerov tipa Delova
lematizacije. Na ravni oblikoskladenjskih oznak je bilo
dopisnica > lema: Delov in Magov novinar > lema:
skupaj 12.586 popravkov, kar pomeni 2,9 % vseh oznak v
magov. Pri primerih, kjer je bila prekrivnost
korpusu
(ob
skoraj
97-odstotni
natančnosti
sistemsko sicer možna, vendar v dejanski rabi
oblikoskladenjskega označevanja).
neizkazana, smo ohranili veliko začetnico, npr. S
Pri popravkih lem so bili med najpogostejšimi
tega stališča je polemika z Mladininim [Mladinin,
lastnoimenskimi
samostalniki,
ki
so
prekrivni
z
Psnmeo] doktorjem sociologije že skorajda na robu
občnoimenskimi (npr. Luka Koper > lema: luka),
smiselnega (občni samostalnik mladina sicer
okrajšave, sestavljene iz ene ali dveh črk (npr. dr. > lema:
obstaja, vendar je svojilni pridevnik v rabi izredno
dr.), pa tudi besede s prekrivnimi oblikami v
redek, tj. ima eno samo pojavitev v referenčnem
oblikoskladenjski paradigmi (npr. delo in del). Pri
korpusu Gigafida 2.0). Nasprotno je v primerih, ki
popravkih oblikoskladenjskih oznak je šlo povečini za
izkazujejo pogostejšo rabo svojilnega pridevnika,
ločevanje med občnimi in lastnoimenskimi samostalniki
npr. vsi pa občudujejo njegovi operi Jevgenij
(tip Leasing – leasing; 1538 popravkov oz. 12 %; v
Onjegin in Pikova [pikov, Psnzei] dama.
obratni smeri iz občnoimenskega v lastno je bilo
(c) Pridevniki na -ski, -ški kot del zemljepisnih lastnih
popravkov manj: 235 oz. 1,8 %), med moškim in ženskim
imen: lematiziramo jih z malo začetnico, pri čemer
spolom (825 popravkov oz. 6,6 %; pri tem gre npr. za
je treba posebej izpostaviti razliko v odnosu do
imena določenih strank, kot je Desus) ter med prekrivnimi
primerov tipa Kranjska, Štajerska ipd. Pri imenih
oblikami v imenovalniku, tožilniku in rodilniku (skupaj
regij gre za samostalnike in jih lematiziramo z
1.617 popravkov oz. 12,8 % pri samostalnikih; npr. neživi
veliko: V Vinski kleti Goriška [goriški, Ppnsmi]
samostalniki moškega spola: odbor, posel v imenovalniku
Brda zadovoljni s poslovanjem v minulem letu;
in tožilniku). Na ravni besednih vrst je šlo največkrat za
Črnivec je poleg prelaza Volovjek najsevernejši
težje ločevanje med prekrivnimi prislovi in prirednimi
cestni prehod, ki povezuje Kranjsko [Kranjska,
vezniki (npr. tako; 130 popravkov oz. 1,1 %), med
Slzet] in Štajersko [Štajerska, Slzet] .
lastnoimenskimi
samostalniki
in
neuvrščenimi
(č) Splošni pridevniki kot del zemljepisnih lastnih
tujejezičnimi izrazi (npr. Amnesty International; 118
imen: lematiziramo jih z malo (tip nov, spodnji), če
popravkov oz. 1,0 %) ter med členki in prirednimi vezniki
v splošni rabi ne obstajajo, pa ohranimo veliko
(npr. sicer, niti, ne; 97 popravkov oz. 0,7 %). Ker je bila začetnico, npr. Britanija, Avstralija in Nova [nov,
količina popravkov relativno majhna, bi se bilo v
Ppnzei] Zelandija; Mlekarna Celeia iz Arje [Arji,
prihodnjih označevalnih kampanjah morda smiselno
Ppnzer] vasi je namreč edina domača mlekarna v
osredotočiti le na najpogostejše pričakovane napake. Kot
večinski lasti zadrug.
vodilo lahko pri tem služijo v tem poglavju naštete
III. Tuja stvarna lastna imena: po posvetu s širšo
najpogostejše dileme in težave.
projektno ekipo smo se odločili, da bomo oblikovno
prekrivne občnoimenske samostalnike "iskali" tudi v
tujejezičnih večbesednih stvarnih lastnih imenih. Pri tem
5. Nadgradnja označevalnih smernic
je treba upoštevali predvsem dve merili: pregibanje v rabi
Na podlagi analize najpogostejših označevalskih dilem
in prevzetost (prisotnost v referenčnih priročnikih, npr.
in pregleda označevalnih odločitev v korpusu ssj500k smo
Hypo Leasing [leasing, Somei], Infond Holding [holding, pripravili rešitve glede (nadaljnjega) pregledovanja in
Somei]), ne pa nujno tudi merilo pomenske prekrivnosti –
dopolnitve smernic za problematične kategorije, naštete v
v nekaterih primerih lahko ima tuja beseda v stvarnem
poglavju 4.2. Nadgrajene smernice bodo objavljene ob
lastnem imenu podoben pomen, kot ga ima (prekrivna)
koncu projekta RSDO.
slovenska beseda, v nekaterih pa ne. Primere, ki so bili
I.
Občnoimenska prekrivnost v stvarnih lastnih
oblikovno prekrivni, pomensko pa ne, smo zbrali na
imenih: splošno načelo, da stvarna imena, prekrivna z
posebnem seznamu in po analizi sprejeli odločitev, da jih
občnim
samostalnikom,
označujemo
kot
občni
vse obravnavamo kot občne samostalnike, npr. Trade
samostalnik in lematiziramo z malo začetnico, ostala, ki
Trans [trans, Somei] Invest, Prevent Global [global, prekrivnosti ne izkazujejo, pa z veliko začetnico, smo
Somei].
dopolnili s konkretnimi zgledi rabe. Izbrali smo
V smernice smo dodali odločitev glede označevanja
slovenščini sorodnih tujih primerov (tip Večernji list): pri
PRISPEVKI
165
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
samostalnikih
upoštevamo
načelo
prekrivnosti
s
Pri razlikovanju obravnave zemljepisnih/osebnih imen ter
slovenskim
občnoimenskim
besediščem,
pridevnike
stvarnih imen se v sistem še dodatno vpletajo načela, ki so
obravnavamo kot tuje besedišče, pri katerem ostane lema
bolj kot na oblikoskladnjo vezane na (s semantiko in
enaka besedni obliki, npr. Jutarnji [Jutarnji, Nj] list [list,
trenutnim
pravopisom
povezano)
metajezikovno
Somei], Zagrebačka [Zagrebačka, Nj] banka [banka,
klasifikacijo referentov. Zdi se, da na ravni označevanja
Sozei].
lem in oblikoskladnje sprejemamo odločitve, ki bi sodile
IV. Ločevanje pridevnikov od prislovov: pri
na raven jezikovnega opisa in predpisa, ob čemer se
opredelitvi razlike med pridevnikom in prislovom v vlogi
opiramo na jezikovne vire, kjer prav te odločitve pogosto
povedkovega določila smo v smernicah izpostavili
še niso sprejete.
skladenjski vidik. Opredelitev razlike, da je beseda v vlogi
Ker težave pripisovanja občno- oz. lastnoimenskosti
povedkovega določila prislov, če je iz stavka izpustljiva,
samostalnikov prednjačijo v veliki sliki vseh opravljenih
pridevnik pa, če je nepogrešljiva (obvezna), smo
popravkov, obenem pa identifikacijo lastnoimenskih zvez
podkrepili s primeri, npr. O tem ni *(mogoče) [mogoč,
v zadnjih letih uspešno opravljamo pri označevanju
Ppnsei] sklepati. ; (Mogoče) [mogoče, Rsn] ste ga
imenskih entitet, bi kazalo ponovno razmisliti o dodani
vznemirili.
vrednosti te kategorije na ravni oblikoskladnje. Če se
V.
Nesklonljivi prilastki (tip bruto, solo): pri
izkaže, da je kljub vsemu koristna, bi se določene težave
obravnavi nesklonljivih prilastkov se je smiselno opreti na
dalo odpraviti z radikalnejšim posegom v smernice, npr. z
preverjanje
njihove sklonljivosti v dejanski rabi.
odpovedjo
iskanja
prekrivnih
občnih
in
lastnih
Oblikovali smo pravilo, da če najdemo potrditev v
samostalnikov in sledenju rabi, kakršna se v besedilih
referenčnem korpusu, da se določen primer lahko pregiba
pojavlja. Enako velja za obravnavo tujega besedišča, ki ga
kot samostalnik, potem to opcijo upoštevamo, če pa
po trenutnem sistemu med slovenske samostalnike
potrditve ne najdemo, primer dosledno obravnavamo kot
umeščamo precej popustljivo in obenem nedosledno. S
pridevnik: so se do konca leta povprečne neto [neto,
širjenjem označevanja na besedilne vrste, kjer je
Ppnmein] plače realno povečale za okoli 33 odstotkov.
tujejezičnih elementov več in v slovenščino prehajajo po
VI. Prislovne zveze (tip na novo): v smernice smo
manj predvidljivih vzorcih, bi bilo smiselno opredeliti
dodali eksplicitno pravilo, da primere tega tipa
jasen namen ločevanja po jezikih in oblikovati dosledne in
obravnavamo kot zveze predloga in pridevnika. Na
pripisljive kriterije zanj. Problem bi bilo dobro nasloviti
primerih, ki so pregledovalcem predstavljali največ težav,
celovito in podati rešitve za vse relevantne označevalne
smo ponazorili, da obravnavamo nepredložni del zveze
ravni, ne le lematizacijo in oblikoskladnjo.
torej kot pridevnik in ne kot prislov, npr. Če bi se na [na,
Druga večja skupina označevalnih težav je bila vezana
Dt] hitro [hiter, Ppnset] ozrl, bi videl, da ga zasledujejo.
na enakopisne oblike, pogosto pridevnike in prislove, pa
tudi nekatere slovnične besedne vrste. Tudi tu je opaziti,
6. Zaključek in nadaljnje delo
da se v smernicah pojavljajo semantični (ne le
oblikoslovni in skladenjski) kriteriji za presojanje, kar pa
Pregledovanje osnovnih označevalnih nivojev korpusa
se je izkazalo za manj pereče od (po novem vsaj deloma
SentiCoref predstavlja eno najobsežnejših kampanj te
naslovljenih, ne pa povsem odpravljenih) dilem glede
vrste v našem prostoru in – ob kampanji, ki se je
uporabe
referenčnih
jezikovnih
virov,
npr.
za
osredotočala na gradivo računalniško posredovane
opredeljevanje sklonljivosti. Pri tej skupini težav je
komunikacije Janes (Čibej et al., 2018) – tudi eno prvih
ključna ugotovitev, da označevanje tudi v ssj500k ni
priložnosti za ponovitev dela z uporabo metodologije, ki
potekalo povsem usklajeno, zato smo ob delu pripravili
se je vzpostavila pri pripravi izhodiščne različice učnega
seznam težav, ki bi jih bilo v prihodnosti smiselno
korpusa.
preveriti in ustrezno urediti za nazaj.
Po opravljeni kuraciji, končni kontroli kvalitete
Pri vsem pa je treba upoštevati, da je strojno
označenega in statističnem pregledu dilem in popravkov je
pripisovanje
lem in oblikoskladenjskih oznak za
mogoče potegniti nekaj zaključkov. Pomembno je, da so
slovenščino že doseglo raven, ko bi bilo celovite ročne
se pomanjkljivosti označevalnih smernic kazale zlasti pri
preglede smotrno nadomestiti z delnimi, za katere pa bi
temah,
povezanih
z
označevanjem
lastnih
imen
bilo treba razviti (referenčne in dokumentirane) postopke
(samostalnikov, izlastnoimenskih pridevnikov), še zlasti
za
avtomatsko
ali
polavtomatsko
identifikacijo
pri odločitvah, ki so povezane s presojanjem, ali je
problematičnih mest. Spoznanja, ki jih navajamo v
določena beseda slovenska ali tujejezična. Ker korpus
prispevku, so lahko izhodišče za takšno nadaljevanje.
SentiCoref vsebuje atipično visoko število raznovrstnih
Pregledani in popravljeni SentiCoref bo v nadaljevanju
lastnih imen (tako je bil namreč zgrajen), smo pogosto
projekta RSDO umeščen ob ostale besedilne množice, ki
srečevali težave, ki so bile pri pripravi ssj500k redkejše in
bodo sestavljale povečani učni korpus za slovenščino. V
za smernice manj relevantne.
prihodnje bomo v celotnem učnem korpusu izvedli še
Obstoj
kategorije
lastnoimenskosti
na
ravni
serijo polavtomatskih popravkov (npr. ali so enobesedni
oblikoskladnje in posledično lematiziranje ob iskanju
vezniki, kot je "zato", vedno ustrezno označeni kot
prekrivnosti med občno- in lastnoimenskimi entitetami
vezniki), s čimer bomo poskrbeli, da bodo enake dileme v
odpira konceptualne težave, ki bi jih kazalo v ponovno
celotnem učnem korpusu razrešene konsistentno. Na
premisliti. Prva je, da je označevalno kategorijo najti samo
podoben način bomo učni korpus primerjali tudi s
pri samostalnikih, prekrivnost (po nekako drugačni logiki)
Slovenskim
oblikoslovnim
leksikonom
Sloleks
iščemo tudi pri pridevnikih, ne pa pri ostalih besednih
(Dobrovoljc et al., 2019), da npr. preverimo, ali se
vrstah. Težava je tudi, da pri odločitvah glede zapisa leme
glagolski vid glagolov v učnem korpusu ujema s
z veliko ali malo začetnico na raven oblikoskladenjskega
Sloleksom. V okviru projekta RSDO je istočasno z
označevanja prenašamo vprašanja, ki se dotikajo
nadgradnjo učnega korpusa potekala tudi nadgradnja
pravopisa (oz. pravopisov, če upoštevamo, da se vse
Sloleksa, zato smo nalogo prestavili na poznejši termin.
dileme preslikavajo in potencirajo pri srečevanju s
Učni korpus bo skupaj z nadgrajenimi označevalnimi
tujejezičnimi
elementi),
pri
čemer
sistem
sledi
smernicami in ostalo dokumentacijo ob zaključku projekta
predpostavki, da avtorji besedil pravopisu vedno sledijo.
javnosti odprto na voljo na repozitoriju CLARIN.SI.
PRISPEVKI
166
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
7. Zahvala
flexible, web-based annotation tool for CLARIN. V:
Projekt Razvoj slovenščine v digitalnem okolju
Proceedings of the CLARIN Annual Conference (CAC)
(RSDO) sofinancirata Republika Slovenija in Evropska
2014,
Soesterberg,
Nizozemska.
unija iz Evropskega sklada za regionalni razvoj. Operacija
https://www.clarin.eu/sites/default/files/cac2014_submi
se izvaja v okviru Operativnega programa za izvajanje
ssion_6_0.pdf.
evropske kohezijske politike v obdobju 2014–2020.
Tomaž Erjavec in Simon Krek. 2008. The JOS
Raziskovalna programa št. P6-0411 (Jezikovni viri in
morphosyntactically tagged corpus of Slovene. V:
tehnologije za slovenski jezik) in št. P6-0215 ( Slovenski
Proceedings.
6th
International
Conference
on
jezik - bazične, kontrastivne in aplikativne raziskave) je
Language Resources and Evaluation (LREC 2008), str.
sofinancirala Javna agencija za raziskovalno dejavnost
322–327, Marakeš, Maroko. European Language
Republike Slovenije iz državnega proračuna. Avtorice in
Resources
Association
(ELRA).
avtorji se sodelujočim v označevalni kampanji iskreno
http://www.lrec-conf.org/proceedings/lrec2008/pdf/89_
zahvaljujemo za vse delo, prav tako pa tudi recenzentoma
paper.pdf.
za relevantne in konstruktivne komentarje.
Tomaž Erjavec. 2012. MULTEXT-East: Morphosyntactic
resources for Central and Eastern European languages.
Language Resources and Evaluation, 46(1):131–142.
8. Literatura
Miha Grčar, Simon Krek in Kaja Dobrovoljc. 2012.
Špela Arhar Holdt in Jaka Čibej. 2021. Analize za
Obeliks: statistični oblikoskladenjski označevalnik in
nadgradnjo učnega korpusa ssj500k. V: Š. A. Holdt, ur.,
lematizator za slovenski jezik (Obeliks: statistical
Nova slovnica sodobne standardne slovenščine: viri in
morphosyntactic tagger and lemmatizer for Slovene). V:
metode, str. 15–53. Znanstvena založba Filozofske
Proceedings of the 8th Language Technologies
fakultete,
Ljubljana.
Zbirka
Sporazumevanje.
Conference, zvezek C, str. 89–94, Ljubljana, Slovenija.
https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/v
IJS. http://nl.ijs.si/isjt12/proceedings/isjt2012_17.pdf.
iew/325/477/7313-1.
Peter Holozan, Simon Krek, Matej Pivec, Simon Rigač,
Primož Belej. 2018. Oblikoskladenjsko označevanje
Simon Rozman in Aleš Velušček. 2008. Specifikacije za
slovenskega jezika z globokimi nevronskimi mrežami.
učni korpus. Projekt "Sporazumevanje v slovenskem
Magistrsko delo, Fakulteta za računalništvo in
jeziku".
informatiko, Univerza v Ljubljani.
http://projekt.slovenscina.eu/Vsebine/Sl/Kazalniki/K2.a
Jože Bučar. 2017. Manually sentiment annotated
spx.
Slovenian news corpus SentiNews 1.0. Slovenian
Jan-Christoph Klie, Michael Bugert, Beto Bullosa,
language
resource
repository
CLARIN.SI.
Richard Eckart de Castilho in Irina Gurevych. 2018.
http://hdl.handle.net/11356/1110.
The INCEpTION Platform: Machine-Assisted and
Jaka Čibej, Špela Arhar Holdt, Tomaž Erjavec in Darja
Knowledge-Oriented
Interactive
Annotation.
V:
Fišer. 2018. Ročno označeni korpusi Janes za učenje
Proceedings of System Demonstrations of the 27th
jezikovnotehnoloških orodij in jezikoslovne raziskave.
International Conference on Computational Linguistics,
V: D. Fišer, ur., Viri, orodja in metode za analizo spletne
Santa
Fe,
New
Mexico,
ZDA.
slovenščine, str. 44–73. Znanstvena založba Filozofske
https://aclanthology.org/C18-2002.pdf.
fakultete, Ljubljana. Zbirka Prevodoslovje in uporabno
Simon Krek, Kaja Dobrovoljc, Tomaž Erjavec, Sara
jezikoslovje.
Može, Nina Ledinek, Nanika Holz, Katja Zupan, Polona
https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/v
Gantar, Taja Kuzman, Jaka Čibej, Špela Arhar Holdt,
iew/111/203/2416-1.
Teja Kavčič, Iza Škrjanec, Dafne Marko, Lucija
Ludmila Dimitrova, Tomaž Erjavec, Nancy Ide, Heiki
Jezeršek in Anja Zajc. 2021. Training corpus ssj500k
Jaan Kaalep, Vladimir Petkevič in Dan Tufis. 1998.
2.3.
Slovenian
language
resource
repository
Multext-east: Parallel and comparable corpora and
CLARIN.SI. http://hdl.handle.net/11356/1434.
lexicons for six central and eastern European languages.
Simon Krek, Tomaž Erjavec, Kaja Dobrovoljc, Polona
V: 36th Annual Meeting of the Association for
Gantar, Špela Arhar Holdt, Jaka Čibej in Janez Brank.
Computational Linguistics and 17th International
2020. The ssj500k Training Corpus for Slovene
Conference on Computational Linguistics, zvezek 1, str.
Language Processing. V: D. Fišer in T. Erjavec, ur.,
315–319, Montreal, Quebec, Kanada. Association for
Jezikovne tehnologije in digitalna humanistika: zbornik
Computational
Linguistics.
konference, str. 24–33, Ljubljana, Slovenija. Inštitut za
https://aclanthology.org/P98-1050.pdf.
novejšo
zgodovino.
Kaja Dobrovoljc, Simon Krek in Tomaž Erjavec. 2015.
http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_Krek-et-al_The-s
Leksikon besednih oblik Sloleks in smernice njegovega
sj500k-Training-Corpus-for-Slovene--Language-Proces
razvoja. V: V. Gorjanc, P. Gantar, I. Kosem in S. Krek,
sing.pdf.
ur., Slovar sodobne slovenščine: problemi in rešitve, str.
Nikola Ljubešić in Kaja Dobrovoljc. 2019. What does
80−105. Znanstvena založba Filozofske fakultete,
Neural
Bring?
Analysing
Improvements
in
Ljubljana.
Morphosyntactic Annotation and Lemmatisation of
https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/v
Slovenian, Croatian and Serbian. V: Proceedings of the
iew/15/47/489-1.
7th Workshop on Balto-Slavic Natural Language
Kaja Dobrovoljc, Simon Krek, Peter Holozan, Tomaž
Processing, str. 29–34. Firence, Italija. The Association
Erjavec, Miro Romih, Špela Arhar Holdt, Jaka Čibej,
for
Computational
Linguistics,
Stroudsburg.
Luka
Krsnik
in Marko Robnik-Šikonja. 2019.
https://www.aclweb.org/anthology/W19-3704.
Morphological lexicon Sloleks 2.0. Slovenian language
Nikola Ljubešić in Tomaž Erjavec. 2016. Corpus vs.
resource
repository
CLARIN.SI.
Lexicon Supervision in Morphosyntactic Tagging: the
http://hdl.handle.net/11356/1230.
Case of Slovene. V: Proceedings of the Tenth
Richard Eckart de Castilho, Chris Biemann, Irina
International Conference on Language Resources and
Gurevych in Seid Muhie Yimam. 2014. WebAnno: a
Evaluation (LREC 2016), str. 1527–1532, Pariz,
PRISPEVKI
167
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Francija. European Language Resources Association
(ELRA). https://aclanthology.org/L16-1242.pdf.
Federico Martelli, Roberto Navigli, Simon Krek, Jelena
Kallas, Polona Gantar, Svetla Koeva, Sanni Nimb,
Bolette Sandford Pedersen, Sussi Olsen, Margit
Langemets, Kristina Koppel, Tiiu Üksik, Kaja
Dobrovoljc,
Rafael
Ureña-Ruiz,
José-Luis
Sancho-Sánchez, Veronika Lipp, Tamás Váradi, András
Győrffy, Simon László, Valeria Quochi, Monica
Monachini, Francesca Frontini, Carole Tiberius, Rob
Tempelaars, Rute Costa, Ana Salgado, Jaka Čibej, in
Tina Munda. 2022. Parallel sense-annotated corpus
ELEXIS-WSD
1.0.
Slovenian language resource
repository
CLARIN.SI.
http://hdl.handle.net/11356/1674.
Slavko Žitnik. 2019. Slovene corpus for aspect-based
sentiment analysis - SentiCoref 1.0. Slovenian language
resource
repository
CLARIN.SI.
http://hdl.handle.net/11356/1285.
PRISPEVKI
168
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Document Enrichment as a Tool for Automated Interview Coding
Ajda Pretnar Žagar,∗ Nikola Ðukić,∗ Rajko Muršič‡
∗Laboratory for Bioinformatics
Faculty of Computer and Information Science
University of Ljubljana
Večna pot 113, SI-1000 Ljubljana
ajda.pretnar@fri.uni-lj.si, nd1776@student.uni-lj.si
†Department of Ethnology and Cultural Anthropology
Faculty of Arts
University of Ljubljana
Zavetiška ulica 5, SI-1000 Ljubljana
rajko.mursic@ff.uni-lj.si
Abstract
While widely used in social sciences and the humanities, qualitative data coding remains a predominantly manual task. With the proliferation of semantic analysis techniques, such as keyword extraction and ontology enrichment, researchers could use existing taxonomies and systematics to automatically label text passages with semantic labels. We propose and test an analytical pipeline for automated interview coding in anthropology, using two existing taxonomies, Outline of Cultural Materials and ETSEO systematics. We show it is possible to quickly, efficiently and automatically annotate text passages with meaningful labels using current state-of-the-art semantic analysis techniques.
1.
Introduction
models (i.e., YAKE) and word embeddings (Godec et al.,
Qualitative data coding is a well-established procedure
2021) for determining concept similarity.
in social sciences, particularly in sociology, cultural stud-
Qualitative data coding is often based on grounded the-
ies, oral history, and biographic studies. The technique is
ory (Strauss and Corbin, 1997). The theory, which is more
gaining ground in anthropology, where interview transcrip-
of an analytical approach, focuses on codes to emerge from
tions abound. Ethnographic text coding can become a se-
the data (Holmes and Castañeda, 2014) rather than impos-
rious research technique, using existing ethnographic sys-
ing them. Coding can also stem from a linguistic paradigm,
tematics, categories, vocabularies, and codes. Data coding
especially semantic approaches, where text would be la-
facilitates the analysis of themes and close reading of the in-
belled based on the occurrence of words in it. The first ap-
terview segments on each theme, which is one of the main
proach still requires human input, while the second is based
analytical techniques of ethnographies in anthropology, be
on unsupervised machine learning. Thus, having a general
they computer-assisted or manual.
ethnographic taxonomy or classification scheme enables re-
Computer-assisted qualitative data analysis (CAQDAS)
searchers to inductively elicit prevalent topics from the data
is used to determine topics of interview segments, where
rather than devising elaborate codebooks in advance. Our
the topics are not discrete but can overlap. The coder would
contribution is applying semantic annotation and ontology
normally define a codebook with the topics, then go over
mapping to interview transcripts.
the text and label passages with corresponding tags. In the
Semantic enrichment of documents means assigning
end, the coder can review selected topical passages, define
conceptually relevant terms to documents or document
topic co-occurrence, and extract a subset of documents on
segments.
The procedure can include automatic key-
a specific topic.
word extraction, which identifies relevant keywords in the
Manual labelling can take a long time and requires a
text (Bougouin et al., 2013; Campos et al., 2020) or relat-
somewhat experienced coder to handle the tagging. How-
ing existing lists of terms to texts (Massri et al., 2019). The
ever, we can construct an automatic pipeline for segment
latter can be either unsupervised or supervised. Unsuper-
tagging due to the rapid development of natural language
vised refers to the terms being scored by their similarity to
processing tools and language resources. The pipeline is
the text and (multiple) terms assigned to each document if
built on the recent developments in ontology enrichment,
their similarity to the document is above a certain threshold.
which uses pre-defined ontologies (or taxonomies). Docu-
Supervised means the terms are used for document classi-
ments are preprocessed, and then the resulting tokens, typ-
fication, where a document is assigned the most probable
ically words, are compared by similarity to tokens from the
term.
ontology. A simple approach is based on TF-IDF1 trans-
In continuation, we propose a technique using unsuper-
form. In contrast, the current state of the art uses graph
word counts to describe documents, weighing them based on over-
1
all word frequency.
TF-IDF is a document vectorisation technique which uses
PRISPEVKI
169
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
vised ontology enrichment to automatically label text seg-
Modern computer-assisted qualitative data analysis
ments with the corresponding topic labels. Automatic seg-
(CAQDAS) approaches don’t require using punched cards
ment labelling uses existing (anthropological) taxonomies
with per-page summaries to navigate the text, as was the
to label interview segments and thus assist researchers in
case in earlier times. They can quickly retrieve segments
navigating interview transcripts. The proposed technique
tagged with the specified tag. MacMillan and Phillip (2010)
doesn’t apply only to anthropology – it could be used in
use a semi-anthropological approach to better gauge the
any text analysis research. We use anthropology as a use
connection between venison price and cull effort. They
case since the use of computer-assisted techniques is still
conduct in-depth interviews with stalkers, people employed
somewhat rare in this discipline.
by the British estates that hunt wild game, and analyse
Finally, a short note on terminology. The term “ontol-
the interviews with NVivo. They use the qualitative data
ogy” is used in computer science to describe a structured
from the interviews to corroborate the quantitative findings
hierarchical list of terms (Gruber, 1995), while in social
– deer hunting is deeply rooted in tradition and seen as a
sciences and the humanities, it means a branch of philos-
sport rather than economic activity.
ophy studying concepts of existence. In this paper, we use
Researchers studying sensorially-charged biographic
the term ontology in the former sense, sometimes referring
experiences in Turku, Brighton, and Ljubljana defined the
to it as a taxonomy for clarity.
main categories with a larger list of subcategories. Cod-
ing only the translated transcripts and using Atlas.ti, they
2.
Interview transcripts
extracted similarly charged testimonies related to different
Interview transcripts are specific since they contain
sensations, for example, sounds (Venäläinen et al., 2020) or
questions from the interviewer and answers from the inter-
smells (Bajič, 2020).
viewee. The transcripts are usually structured, with names
Most commonly, CAQDAS is used in discourse analy-
or abbreviations denoting the speaker. If the interview is
sis. Hitchner et al. (2016) analyse discourses on bio-energy
(semi-)structured, questions between different interviews
to elicit key metaphors used to create common imaginaries.
will be very similar, if not identical. Moreover, interview-
Using this approach, they were able to identify three dis-
ing a person often requires the interviewer to ask for clar-
cursive units that guide the bio-energy narrative. Cuerrier
ification, affirm the interpretation of the answer or simply
et al. (2015) identified 134 categories referring to climate
confirm (s)he understood what the interviewee said. Hence
change in 46 interviews conducted with the Inuit population
including questions in the analysis is often not a good ap-
in Nunavik. Next, they created ordinal and binary matrices
proach.
describing the change in quantity and the presence or ab-
Delineating between questions and answers depends
sence of topics. They used various statistical approaches to
on the structure of the digital document.
A dedicated
determine whether different communities of Nunavik dif-
parser would consider new lines as segment delineations
fer in terms of knowledge of climate change. Both papers
and names, pseudonyms, or initials as speaker identifiers.
retrieve popular taxonomies created by people under study.
Ideally, the parser would consider the continuation of a re-
Discourse analysis is also prominent in Schafer (2011),
ply, even when it was interrupted by the interviewer. But
who uses Atlas.ti to analyse over 30 in-depth interviews
without a proper co-reference resolution for the given lan-
with secular funeral officiants called “funeral celebrants”
guage (Žitnik and Bajec, 2018), it is difficult to determine
in New Zealand. The author identified key conceptual cat-
such conceptual segments.
egories in funeral celebrant ethnographies, specifically the
narratives on connection, identity, and personalisation of
3.
Related work
funeral practices.
Back in 1983, Podolefsky and McCarty (1983) had an
CAQDAS can also be used to retrieve relevant text pas-
interesting idea - how about using computers to help us nav-
sages. Yilmaz et al. (2019) conducted 30 interviews with
igate numerous ethnographic notes and transcripts? Those
highly educated Turkish-Belgian women to determine the
were the days when most anthropologists stored their data
factors affecting their marriage choices. They stem from
on physical paper. Navigating such texts apparently re-
grounded theory and use predetermined codes for the first
quired duplicating pages to store them under various cat-
round of coding, then refine and enhance their codebook
egories. Nowadays, this is no longer necessary. Ethno-
later. With iterative codebook improvements, they deter-
graphic data is often multimodal and predominantly stored
mined women’s decisions and the driving factors behind
digitally. It includes images, videos, and audio record-
them, for example, the structural and general constraints in
ings along with the text. When navigating digital text data,
marriage choices.
one can easily use the “find” function to look for different
Conversely, Wehi et al. (2018) do not use CAQDAS
text segments, while similar techniques exist for navigating
software but instead observe raw word frequencies in M¯aori
other data types.
oral tradition.
They collected ancestral sayings called
Nevertheless, organising interview data is not an easy
whakatauk¯ı and identified references to animal extinctions
task, and there are ways computers can help. Podolef-
in the data.
sky and McCarty (1983) proposed developing coding cat-
It is interesting to note that many contributions using
egories for marking text passages. This is the precursor to
quali-quantitative text analysis were published in the Hu-
modern qualitative data analysis software, such as NVivo,
man Ecology journal, which testifies to the (still) marginal
Atlas.ti, or MaxQDA. These, too, require a predetermined
use of these methods in anthropology. Ideally, we will see
set of categories used for labelling the data.
many more journals willing to publish such research and
PRISPEVKI
170
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
more researchers ready to use these tools in practice.
isation (86), Education (87) and Adolescence, Adulthood,
Longan (2015) expresses the sentiment to perfection:
and Old Age (88). The categories are still very general, so
“There is room for innovation in the creation of technologi-
more specific categories must be coded additionally.
cal aids to facilitate mesoscale qualitative online research
Ethnographic systematics (ETSEO) is derived from
that lies between massive data sets and small qualitative
continental ethnographic practices, mostly interested in tra-
studies. Though the major qualitative software suites have
ditional culture of the European peasantry. Its taxonomy
improved over time, much of the process is still tedious
is hierarchically extensive, starting with the essentially de-
and requires hours of snorkelling and coding by hand.”
fined material, spiritual and social culture categories. Since
First, he explicitly points to the nor-big-nor-small issue of
the taxonomy was designed for museum archives, the most
many contemporary anthropological studies. Even organ-
detailed is the field of material culture, subdivided on as
ising just thirty interview transcripts can be complicated,
many levels as necessary, and taxonomy in general fits folk
let alone a hundred records. Yet one hundred records can
taxonomy and practices. Spiritual culture is further divided
hardly be described as “big data” requiring “big tools”.
into general categories comprising folklore, ritual practices,
There’s a need for a mid-level tool to help organise the data
and art-related activities. Less detailed is the so-called “so-
in a time-efficient way. Second, he points to the issue of
cial culture” field containing festivities in a calendar year,
coding by hand, which takes time and effort from the re-
celebrations of live events, and communal activities, prac-
searcher. Third, he identifies an opportunity for technolog-
tices, and rules. This system is much more detailed but,
ical innovation for qualitative data analysis that surpasses
at the same time, only partly decimally classified and only
modern qualitative analysis software.
somewhat comparable to the OCM taxonomy. It was de-
Previously, ontology enrichment for labelling text pas-
signed for classical archive work and is now only partially
sages was used predominantly in biology and medicine (Bi-
accepted as a digitised taxonomy.
fis et al., 2021). In social sciences and the humanities, au-
OCM’s main aim was to facilitate searching the large
tomated segment labelling was expressed as more of a wish
database of ethnographic entries and organise basic infor-
rather than a reality (Hoxtell, 2019). In contrast to CAQ-
mation on ethnic and social groups. Hence it is easy to
DAS, ontology enrichment provides a way to automatically
extend the idea of an ethnographic classification system to
label large amounts of text in a short period of time. At the
a codebook – each entry represents a concept relevant to
same time it enables relating interview transcripts to ex-
describing a culture. One could use the well-defined sys-
isting domain-specific ontologies. Our contribution show-
tem with descriptions of categories to automatically tag text
cases automated interview segment labelling with existing
passages with relevant ethnographic concepts. For exam-
ontologies, thus providing a practical example of how ma-
ple, if the passage describes using outdoor toilets, the cor-
chine learning can support ethnographic analysis.
responding codes should be “744 Public Health and Sanita-
We propose an approach using ontology enrichment
tion”, “515 Personal Hygiene”, “336 Plumbing”, and “312
from computer science to help organise and structure in-
Water Supply”. Besides already existing taxonomies for
terview transcripts, fieldwork notes, and archive data. The
ethnographic materials (OCM and ETSEO), it is useful to
three-fold example described below is a prototype for
produce native or folk taxonomies as “a description of how
machine-assisted data coding, which uses standard anthro-
people divide up domains of culture, and how pieces of a
pological taxonomies, such as the Outline of Cultural Mate-
domain are connected” (Bernard, 1994, p. 386). Automated
rials (Bernard, 1994, p. 519-528), or more local and specific
accurate tagging would enable quickly retrieving relevant
ethnographic taxonomies, related to the European ethnol-
parts of the text on the one hand and observing dominant
ogy studies of the so-called folk or traditional culture (Kre-
topics and their inter-relatedness on the other.
menšek et al., 1976), to label text passages.
5.
Document enrichment
4.
Ontologies as codebooks
Analysis of interview transcripts would normally in-
Instead of pre-defining codebooks for manual coding,
clude labelling documents or interview segments with cor-
we propose to use existing anthropological taxonomies to
responding codes, identifying topics/codes, observing their
automatically label the data. One such well-established
frequencies in the corpus, and retrieving interview seg-
taxonomy, which we call “ontology” in text mining, is the
ments for a given topic/code. We show how to perform
Outline of Cultural Materials. Human Relations Area Files
these tasks in a visual-programming data mining tool Or-
is a non-profit research organisation whose aim is to fos-
ange (Demšar et al., 2013). Workflow (as seen in Figure 1)
ter cross-cultural research (Melvin, 1997). One of its key
for replicating the analysis is available online (Pretnar Ža-
achievements is the establishment of several databases that
gar, 2022b) along with a Slovenian translation of OCM on-
contain previous cross-cultural research. The database en-
tology (Pretnar Žagar, 2022a). The corresponding data are
tries, such as ethnographic reports, are indexed using the
not publicly available due to privacy issues.
Outline of Cultural Materials (OCM), an ethnographic sub-
ject classification system developed by Murdock and col-
5.1.
Data and preprocessing
leagues (Murdock et al., 1969; Ford, 1971).
To demonstrate how contemporary ontology enrich-
The taxonomy is designed in a decimal classification
ment and semantic analysis approaches can be used in an-
system, similar to the librarian Universal Decimal Classifi-
thropology, we are using interview transcripts from twenty
cation. Its main categories start with Orientation (10), Bib-
interviews on smart buildings (Pretnar and Podjed, 2019).
liography (11) and Methodology (12), and end with Social-
The interviews are in colloquial Slovenian and describe
PRISPEVKI
171
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Figure 1: Workflow for ontology enrichment and extracting interview topics from annotated visualization.
the experiences and struggles of faculty staff with a smart
building. The interview is segmented into questions and an-
swers. Each answer represents the utterance and constitutes
a single document in the final corpus resulting in 1126 data
instances. The metadata includes the question, the intervie-
wee, and the interview date.
Tokens are constructed by passing the text through the
CLASSLA pipeline for non-standard Slovenian.
Then,
lemmas and POS tags are retrieved, and only nouns and
verbs are kept for the analysis. Tokens are used to com-
pute document embeddings, a mean-aggregation of word
embeddings based on fastText models (Bojanowski et al.,
2017). We tried simple lowercasing, Lemmagen lemmati-
zation (Juršič et al., 2010) and stopword removal for pre-
processing, but the results were not as informative (they
mostly contained generic verbs, such as to have and to go,
discourse particles and fillers). Moreover, while SBERT
embeddings generally perform better due to their context-
parsing abilities, they produced worse results in the t-SNE
visualisation. Specifically, fastText identified a group of
segments with short, unspecific replies (i.e., “Yes.”, “Uh-
huh.”), while SBERT did not.
Figure 2: t-SNE document map with annotated semantic
groups.
5.2.
Identifying topics
Generally, the researchers will know which topics the
lates “king” to “prince” and “queen” to “princess”. Once
corpus covers because often, they will be its creators. In
the embedding of each word is retrieved, words from the
the case of interviews, the researcher is likely also the in-
document are aggregated into the mean document vector.
terviewer who guided the interview based on research ques-
This numeric representation will be used to plot a t-SNE
tions. However, ethnographic narratives often take unex-
document map, where segments with similar content will
pected turns or focus on unforeseen details, which the re-
lie close to each other2. But a bare map is not very informa-
searcher can uncover by coding the data and iteratively re-
tive on its own. Hence, we added Gaussian mixture models
fining the codebook. Alternatively, one can use document
to identify groups of segments and retrieve their character-
maps, where segments with semantically similar content
istic words (Figure 2). The procedure identified segments
will lie close together.
referring to air quality (green cluster), lighting (magenta
To semantically represent the content of interview seg-
cluster), room descriptions (yellow cluster), and so on.
ments, we will pass them to document embedding. The
procedure will take the words (tokens) identified in prepro-
2In t-SNE, we selected a larger group of segments for anno-
cessing and find their vector representation. The represen-
tation. There was a smaller group of 121 segments representing
tation models the meaning of the words in a way that re-
short replies, such as “yes”, “no”, and “I don’t know”.
PRISPEVKI
172
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
5.3.
Exploring topical segments
Ontologies can be used to enrich interview segments by
measuring how similar given ontology terms are to each
segment. Automatic identification of segments helps re-
searchers quickly identify relevant parts of the interview.
Figure 4: Annotating text segments with a part of ontology
referring to work (“delo”).
to be assigned, which resulted in segments that did not have
a corresponding code. After loading the corpus, we re-
move all the interview segments without any codes. We
retain 252 segments with codes and observe their frequen-
cies. The results are somewhat promising but with some
obvious errors (Figure 5).
Figure 3: Selecting a part of the OCM ontology referring to
work (“delo”) and work-related terms.
For example, we can look for “delo” (orig. 350 equip-
ment and maintenance of buildings) and its child terms
from the OCM ontology in the corpus (Figure 3). Selected
terms from the ontology are used for semantic annotation
of interview segments.
Semantic annotation scores each segment by how sim-
ilar its sentences are to the input terms, using SBERT em-
beddings (Reimers and Gurevych, 2019). SBERT was used
because it specialises in sentence embeddings and consid-
ers word context. Ideally, this procedure identifies passages
talking about work-related topics, including breaks, em-
ployment, paychecks, and work relations. One can sort the
results by either the overall segment score, an aggregate of
Figure 5: Top 10 codes identified in the corpus. While some
all sentence scores, or by matches, which counts how many
are plain wrong, most are quite accurate and useful.
input words appear in the segment.
Here, we show the latter option, namely displaying the
The most frequent code is “luč” (light), which is in-
segments with the most matches. We have selected all the
deed a very prominent topic in the corpus. Then the re-
segments matching any of the input terms and highlighted
sults get a little strange. The two next topics are “svaki
them (Figure 4). Ontology enrichment successfully iden-
in svakinje” (brothers and sisters in law) and “tipi porok”
tified segments discussing the office environment, research
(marriage types), which are not among the interview top-
work, work routine, schedules, weekend work, etc.
ics. The errors are likely caused by the multilingual SBERT
model used for word embedding, which sometimes cannot
5.4.
Assigning terms to segments
distinguish between South Slavic languages. For example,
The final goal of any automated coding system would
it considers the Slovenian slang term “ratal” (succeeded) as
be to return a corpus with assigned codes. We prototyped a
“war” based on its similarity to the Serbian “rat” (war).
procedure that uses the above technique of semantic scoring
However, there are some quite relevant topics among
to identify the code with the highest score for each segment.
the top ten codes, for example, “toplota” (warmth), “pod-
We decided on a 0.6 cosine similarity threshold for the code
nebje” (climate), “dnevi počitka in dela prosti dnevi” (rest
PRISPEVKI
173
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
days and holidays), “stranska poslopja” (outbuildings), and
returns relevant interview passages (Figure 6).
“bivališča” (dwellings). Clicking on a label, for example,
The ETSEO taxonomy is less useful than the OCM tax-
“toplota” (warmth), outputs text segments discussing the
onomy. This is due to the somewhat outdated nature of the
interviewees’ attitude to temperature regulation. With a few
questions, which were based on the main foci of Slovenian
steps, the researcher can identify and extract interview seg-
ethnology and were less relevant for anthropology. They
ments discussing a specific topic and read them to better un-
are missing some key contemporary areas of anthropology,
derstand the context of these segments and which subtopics
namely media, urban areas, internet communities, and mi-
the respondents deem relevant. For example, the texts on
gration. Nevertheless, the taxonomy could be extremely
temperature regulation mostly refer to difficulties with ad-
useful for older ethnographic texts and, with some updates,
justing office temperature.
even for contemporary materials.
The system could be improved with specifically devel-
oped language resources for non-standard Slovenian. Nev-
6.
Conclusion
ertheless, even in its current imperfect form, it can be a use-
Anthropology can greatly benefit from the recent devel-
ful tool for semi-automated coding, where the researcher
opments in text analysis. Ontology enrichment, along with
can manually adjust the suspicious/incorrect codes.
other data exploration and visualisation methods, is a useful
tool providing an overview of the collected data.
5.5.
Comparison to ETSEO taxonomy
In the time when anthropologists are using larger cor-
While the OCM taxonomy is widely recognised in
pora (Culhane and Elliott, 2016), when data is created on-
the anthropological community, the ETSEO taxonomy is
line for many different purposes (Wang, 2012), and when
strictly regional. The project Ethnological Topography of
anthropologists use online platforms to store raw ethno-
Slovenian Ethnic Territory (ETSEO) began in 1971 by a
graphic multimedia data (Przybylski, 2021), it is of utmost
large group of Slovenian ethnologists led by Slavko Kre-
importance to store and later archive data meaningfully, us-
menšek. The project entailed the development of the ques-
ing relevant classification and coding systems. It is even
tions based on ethnological systematics, ethnographies of
more important in archival work, which is no longer just an
Slovenian towns and cities (18 in total), and detailed ethno-
additional part of anthropological research, supplementing
graphies on a specific topic. The taxonomy is a result of the
ethnographic fieldwork, but is becoming highly relevant for
first part of the project, namely the questions and detailed
digital aspects of our lives.
ethnological systematics. The ETSEO questions were pub-
Updating taxonomic systems is an urgent task for an-
lished between 1976 and 1977 in twelve books, including
thropologists. However, using existing taxonomies to ex-
the introductory volume with reports of ethnological insti-
plore and visualise data already benefits the analytic pro-
tutions (Kremenšek et al., 1976) and eleven volumes of top-
cess, especially in re-studies and comparative research.
ical presentations and suggested questionnaires. The series
Classical anthropological coding of ethnographic material
served as a theoretical and practical guide for ethnographic
is no longer possible, so automated coding is the first step to
fieldwork (Ravnik, 1996).
expanding the range of anthropological data analysis. How-
ever, in the absence of specialised word embedding models
for Slovenian (SBERT is currently multilingual and con-
flates South Slavic languages), the approach does not yet
achieve the accuracy of a human annotator.
While automated coding, particularly for languages
with fewer language resources, still has a long way to come
to be comparable to human input, it facilitates data explo-
ration and extracting general topics from the text. Ontology
enrichment tools support the iterative analytical process of
ethnography. They provide a starting point for forming new
research questions, enhancing existing ones and can be eas-
ily repeated on new data.
Many improvements could be made to improve auto-
Figure 6: Matches for ETSEO entry “technical knowl-
mated coding for the Slovenian language:
edge”.
• Developing a Slovenian-only sentence transformer
used in semantic search.
ETSEO taxonomy contains 53 areas of ethnographic in-
terest. Still, it lacks explicit hierarchy, although it follows
• Re-writing transcripts in standard Slovenian or further
the classical division of ethnographic material for the so-
improving CLASSLA to handle slang terms and non-
called folk culture: material (volumes I to V), social (vol-
standard Slovenian.
umes VI to VIII) and spiritual (volumes IX to XI). A rough
• Implementing co-reference resolution for Slovenian to
hierarchy could be formed from the eleven books in which
resolve issues with indirect references in text, further
these questions were published, but the books lack hyper-
clarifying the exact content of the document.
nyms. Hence, we will use this as a flat taxonomy. There are
fewer relevant areas to choose from than in the OCM. How-
While these improvements would greatly enhance cod-
ever, looking for “tehnično znanje” (technical knowledge)
ing capabilities for Slovenian, they are, for the most part,
PRISPEVKI
174
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
available for larger languages, thus already enabling simi-
Sarah Hitchner, John Schelhas, and J Peter Brosius. 2016.
lar research.
Snake oil, silver buckshot, and people who hate us:
metaphors and conventional discourses of wood-based
bioenergy in the rural southeastern united states. Human
7.
Acknowledgements
Organization, 75(3):204–217.
The work described in this paper was funded by the
Seth M Holmes and Heide Castañeda. 2014. Ethnographic
Slovenian Research Agency research programme P6-0436:
research in migration and health. Migration and Health:
Digital Humanities: resources, tools and methods
A Research Methods Handbook, pages 265–277.
(2022–2027) and the DARIAH-SI research infrastructure.
Annette Hoxtell. 2019. Automation of qualitative con-
tent analysis: A proposal. In Forum Qualitative Sozial-
8. References
forschung/Forum:
Qualitative Social Research, vol-
Blaž Bajič. 2020. Nose-talgia, or, olfactory remembering
ume 20.
of the past and the present in a city in change. Ethnologia
Matjaž Juršič, Igor Mozetič, Tomaž Erjavec, and Nada
Balkanica, 22:61–75.
Lavrač. 2010. Lemmagen: Multilingual lemmatisation
H Russell Bernard. 1994. Research Methods in Anthro-
with induced ripple-down rules. Journal of Universal
pology: Qualitative and Quantitative Approaches. Sage
Computer Science, 16(9):1190–1214.
Publications, Thousand Oaks, London, New Delhi.
Slavko Kremenšek, Vilko Novak, and Valens Vodušek.
Aristeidis Bifis, Maria Trigka, Sofia Dedegkika, Panagiota
1976.
Etnološka topografija slovenskega etničnega
Goula, Constantinos Constantinopoulos, and Dimitrios
ozemlja. Uvod. Poročila. Raziskovalna skupnost sloven-
Kosmopoulos. 2021. A hierarchical ontology for dia-
skih etnologov, Ljubljana.
logue acts in psychiatric interviews. In The 14th PEr-
Michael W Longan. 2015. Cybergeography irl. Cultural
vasive Technologies Related to Assistive Environments
Geographies Special Issue - New Methods in Cultural
Conference, PETRA 2021, page 330–337, New York,
Geography, 22(2):217–229.
NY, USA. Association for Computing Machinery.
Douglas Craig MacMillan and Sharon Phillip. 2010. Can
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
economic incentives resolve conservation conflict: the
Tomas Mikolov. 2017. Enriching word vectors with sub-
case of wild deer management and habitat conservation
word information. Transactions of the Association for
in the scottish highlands. Human Ecology, 38(4):485–
Computational Linguistics, 5:135–146.
493.
Adrien Bougouin, Florian Boudin, and Béatrice Daille.
M Besher Massri, Sara Brezec, Erik Novak, and Klemen
2013.
Topicrank:
Graph-based topic ranking for
Kenda. 2019. Semantic enrichment and analysis of legal
keyphrase extraction. In International Joint Conference
domain documents. Artificial Intelligence, page 2.
on Natural Language Processing (IJCNLP), pages 543–
George Peter Murdock, Clellan S. Ford, Alfred E. Hudson,
551.
Raymond Kennedy, Leo W. Simmons, and John W. M.
Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alí-
Whiting. 1969. Outline of Cultural Materials. Human
pio Jorge, Célia Nunes, and Adam Jatowt. 2020. Yake!
Relations Area Files, New Haven.
keyword extraction from single documents using multi-
Aaron Podolefsky and Christopher McCarty. 1983. Topi-
ple local features. Information Sciences, 509:257–289.
cal sorting: A technique for computer assisted qualitative
Alain Cuerrier, Nicolas D Brunet, José Gérin-Lajoie, Ash-
data analysis. American Anthropologist, 85(4):886–890.
leigh Downing, and Esther Lévesque. 2015. The study
Ajda Pretnar and Dan Podjed.
2019.
Data mining
of inuit knowledge of climate change in nunavik, quebec:
workspace sensors: A new approach to anthropology.
a mixed methods approach. Human Ecology, 43(3):379–
Prispevki za novejšo zgodovino, 59(1):179–196.
394.
Ajda
Pretnar
Žagar.
2022a.
OCM
Dara Culhane and Denielle Elliott. 2016. A Different Kind
ontology
-
Slovenian.
Figshare.
of Ethnography: Imaginative Practices and Creative
https://doi.org/10.6084/m9.figshare.19844107.v1.
Methodologies. University of Toronto Press, North York,
Ajda
Pretnar
Žagar.
2022b.
OCM
Ontario, Canada.
ontology
enrichment.
Figshare.
Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup,
https://doi.org/10.6084/m9.figshare.19787065.v1.
Tomaž Hočevar, Mitar Milutinovič, Martin Možina,
Liz Przybylski. 2021. Hybrid Ethnography: Online, Of-
Matija Polajnar, Marko Toplak, Anže Starič, et al. 2013.
fline, and in Between. Sage Publications, Los Angeles;
Orange: data mining toolbox in python. The Journal of
London; New Delhi; Singapore; Washington DC; Mel-
Machine Learning Research, 14(1):2349–2353.
bourne.
Clellan S Ford. 1971. The development of the outline of
Mojca Ravnik. 1996. Način življenja slovencev v 20. sto-
cultural materials. Behavior Science Notes, 6(3):173–
letju. Traditiones, 25:403–406.
185.
Nils Reimers and Iryna Gurevych.
2019.
Sentence-
Primož Godec, Nikola Ðukić, Ajda Pretnar, Vesna Tanko,
bert: Sentence embeddings using siamese bert-networks.
Lan Žagar, and Blaž Zupan.
2021.
Explainable
arXiv preprint arXiv:1908.10084.
point-based document visualizations.
arXiv preprint
Cyril Schafer. 2011. Celebrant ceremonies: life-centered
arXiv:2110.00462.
funerals in aotearoa/new zealand. Journal of ritual stud-
Thomas R Gruber. 1995. Toward principles for the design
ies, 25(1):1–13.
of ontologies used for knowledge sharing? International
Anselm Strauss and Juliet M Corbin. 1997. Grounded The-
Journal of Human-Computer Studies, 43(5-6):907–928.
ory in Practice. Sage.
PRISPEVKI
175
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Juhana Venäläinen, Sonja Pöllänen, and Rajko Mursic.
2020. The street. The Bloomsbury Handbook of the An-
thropology of Sound.
Tricia Wang. 2012. The tools we use: Gah- hhh, where is
the
killer
qualitative
analysis
app?
http://ethnographymatters.net/blog/2012/09/04/the-
tools-we-use-gahhhh-where-is-the-killer-qualitative-
analysis-app/.
Priscilla M Wehi, Murray P Cox, Tom Roa, and H¯emi
Whaanga. 2018. Human perceptions of megafaunal ex-
tinction events revealed by linguistic analysis of indige-
nous oral traditions. Human Ecology, 46(4):461–470.
Sinem Yilmaz, Bart Van de Putte, and Peter AJ Stevens.
2019. The paradox of choice: Partner choices among
highly educated turkish belgian women. DiGeSt. Jour-
nal of Diversity and Gender Studies, 6(1):5–24.
Slavko Žitnik and Marko Bajec. 2018. Odkrivanje korefer-
enčnosti v slovenskem jeziku na označenih besedilih iz
coref149. Slovenščina 2.0: Empirical, Applied and In-
terdisciplinary Research, 6(1):37–67.
PRISPEVKI
176
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Parliamentary Discourse Research in History: Literature Review
Jure Skubic¨, Darja Fišer*¨
Ïnstitute of Contemporary History, Ljubljana, Slovenia
*Faculty of Arts, University of Ljubljana, Slovenia
¨Privoz 11, 1000 Ljubljana, Slovenia
*Aškerčeva cesta 2, 1000 Ljubljana, Slovenia
jure.skubic@inz.si, darja.fiser@ff.uni-lj.si
Abstract
Historical research of parliamentary discourse focuses not only on the origins but especially on the development of parliamentary discourse. It is predominantly based on textual data analysis, employing various methodological frameworks.
In this literature review we provide an overview of these methods and present commonalities and differences of approaches established in history with corpus-driven approaches. This allows for a better understanding of historical analysis of parliamentary discourse and highlights the importance of ParlaMint project and the integration of parliamentary corpora into historical research.
employed in historical analysis of parliamentary discourse.
In the second part, we summarize the articles we identified
1. Introduction
in terms of 1) the main aim and topic of the research, 2)
Parliamentary discourse is a salient research topic in
methods used, 3) data collection methods and 4) a short
both humanities and social science disciplines, such as
discussion about the possible improvements and/or
sociology, political science, sociolinguistics, and history.
problems of the research. We conclude the review with a
Especially historical research is highly interested in
discussion of how historical research could benefit from
studying not only the origins but also the development of
corpus data and corpus research methods.
parliamentary discourse. History is often focused on
researching parliamentary debates and as Ihalainen (2021)
2. Literature Selection and Methods
observes, in historical research, parliamentary debates can
As Torou et al. (2009) show, the main objective of
be approached analytically as nexuses of past political
history is to recreate the past by researching and analyzing
discourses which means that they can be viewed as
existing records and their interconnectedness. It is through
“meeting places” where in a certain time and space various
this process that historians employ their academic
political discourses have intersected.
knowledge, rely on experience, and decide on the relevant
This literature review is one in the series of literature
information and appropriate sources which this information
reviews conducted in the context of the ParlaMint project
is extracted from. Especially in political history, it is
(Erjavec et al., 2022). A similar literature review has been
uncommon for historians to rely on only one type of source,
compiled for sociological research (Skubic and Fišer,
but rather focus on various so called primary and secondary
2022). The ParlaMint project develops comparable corpora
sources. The former are most commonly gathered from
of parliamentary proceedings from more than 20 European
historical archives since they include document or artefacts
countries, accompanied by literature overviews, showcases
created by the participants in an event or the witnesses,
and tutorials which will hopefully help maximize the use of
whereas latter include oral sources, newspapers, memoirs,
these corpora in different disciplinary communities
visual representations, practices, etc. This means that an
interested in analyzing parliamentary debates. This
important factor in historical research is to understand the
literature review summarizes historical research of
nature of information as well as the research methodologies
parliamentary debates and the most popular research
and models historians use while conducting research
methods employed. It needs to be explicitly noted, however
(ibid.).
that despite the obvious usefulness of ParlaMint corpora,
Although the variety of issues and approaches in
the researchers ought to consider also other qualitative and
political history is large, the emerging and quite narrow
quantitative data and information in order to come to
focus of political history is on analyzing the history of
objective and unbiased conclusions. Also, in this review we
parliamentary discourse and political debates. Ihalainen
focus mostly on written parliamentary records since the
and Saarinen (2019) show that political history frequently
main interest of ParlaMint project is on written
builds its research on textual data (documents, diaries,
parliamentary sources. However, the importance of other
texts) although sometimes the exact textual methods used
sources such as surveys, records of election results,
are not explicated. Ihalainen and Saarinen (2019) note that
territorial records, etc. must be recognized as well since
when conducting textual analysis, historians often draw on
they present an important part of historical research.
selected methodological tools from methods which are
The review is structured as follows. In the first part, we
otherwise common in humanities and social sciences and
describe the selection procedure of the relevant articles and
especially qualitative sociological research, such as
briefly enumerate the methods they employ. This allows for
(critical) discourse analysis as well as content analysis. In
a better understanding of the methods most frequently
addition to those and to other fields which include the study
PRISPEVKI
177
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
of history (memory studies, conceptual history, etc.)
spreadsheet.1 We then retained only those that clearly
researchers sometimes opt for mixed methods approach,
described the method and the data used, taking into account
corpus assisted discourse studies or text mining.
only the papers which primarily used parliamentary records
2.1 Selection of Articles
as a source. This resulted in 11 articles which were then
submitted to a more detailed analysis. Since the research
The reviewed articles were carefully selected among
questions were so heterogeneous, we did not group the
hundreds of sources which focus on parliamentary debates
articles thematically.
by considering some important research criteria. We
We reviewed predominantly articles which focused on
identified the following scholarly search engines to look for
historical research of parliamentary discourse and political
the articles:
communication. Out of 11 reviewed articles, 3 employed
• Taylor
and
Francis
Online
the methodological framework of Discourse Analysis, 2
(https://www.tandfonline.com),
•
articles employed Content Analysis, 1 opted for the method
SAGE Journals (https://journals.sagepub.com),
•
of Memory Studies, 2 articles used Mixed methods
Wiley
Online
Library
(https://onlinelibrary.wiley.com),
approach, 2 articles employed the framework of
• Semantic
Scholar
Conceptual History (Begriffsgeschichte), and 1 article
(https://www.semanticscholar.org),
employed the method of Topic Modelling.
• MUSE Project (https://muse.jhu.edu),
• JSTOR (https://www.jstor.org),
3. Reviewed Research and Employed
• Elsevier (https://www.elsevier.com), and
Methods
• Google Scholar (https://scholar.google.com).
In this part of the literature review, we give a detailed
account of the historical research that analyzes
We applied the following filters in order to identify the
parliamentary discourse and political communication as
relevant articles:
• Publication period: 2012 – 2022,
well as the methods they employ. We provide a short
• Discipline: History
description of the methodological framework and show
• Article ranking: ‘most relevant’ and ‘most cited’
why it is important for historical research. Then, we give
• Relevant journals: sometimes we needed to apply
an overview of the studies which employed this method.
additional filter where we selected relevant
3.1 Conceptual History
historical journals.
Conceptual History (Begriffsgeschichte) is a strand of
By using those filters, most prominent historical
historical studies which deals with historical semantics and
journals were identified, such as Parliamentary History,
the evolution of paradigmatic ideas and value systems over
Historical Research, Memory Studies, Contributions to the
time. It was first defined by Koselleck in 1997 who shows
History of Concepts and Historical Social Research,
(as cited by Litte, 2016) that the major aim of conceptual
although articles included in this review were also
history is to uncover the logic and semantics of the concepts
published elsewhere. All articles, the title of which was
that have been used to describe historical events and
considered potentially relevant were skimmed; we
processes in addition to being interested in historical
specifically analyzed the abstract, methodology and
evolution of some concepts over time. Ihalainen and
analysis sections to confirm the relevance of the articles. A
Saarinen (2019) note that Conceptual History, when
high number of articles was discarded either because of the
combined with Political History, mostly focuses on past
lack of methodological explanation or because the analysis
human interaction and communication, and understands
did not focus on parliamentary data. In this review we
discourses as central interlinked elements of political
wanted to include only those articles which dealt
processes, events, and action.
specifically with parliamentary records and/or legislative
Interest in the field of Conceptual History was quite
documents and the majority of the selected research
high in the 20th Century Germany, especially when
conformed to this criteria. Some of the articles, however,
conducting historical research of the World War II. Later,
also included other sources which emphasizes the fact that
the field became prominent in political history for the
historians use a variety of sources when researching
analysis of political communication and events. As shown
parliamentary discourse. This is also to show that although
by Litte (2016), conceptual history has three main tasks:
parliamentary records could present one of the primary
firstly, to identify the concepts that are possible in
sources for historical research (and projects such as
characterization of history, then to locate those concepts in
ParlaMint would be helpful in providing relevant data),
the context of political or social discourses and finally to
historians still often opt for a broader research perspective
critically evaluate those concepts for their usefulness for
and combine parliamentary records with other,
historical analysis
complementary sources of data in their research.
3.1.1
Debates on Democracy in Sweden
2.2 Overview of Methods
Research problem: Friberg’s (2012) article aims to
A total of 27 articles were initially determined as
explore the concepts of democracy that were used in
relevant for our literature review and are listed in a Google
Sweden and especially focuses on how the concepts were
1https://docs.google.com/spreadsheets/d/13mF_X3OB9C
KtdfsUFDLPJZJ44VcxZ1uv9OzAE2E_E-
I/edit#gid=1690588464
PRISPEVKI
178
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
used by the Social Democrats (SDP) during the interwar
perspective when studying parliamentary discourse and
years when the party was establishing itself and its political
investigates the concept of the word “immunity” as used in
agenda. It examines the Swedish parliamentary rhetoric
parliamentary discourse.
about democracy after the full suffrage reform.
Research
method:
The
author
employs
Research method: The author employed German
methodological framework of Conceptual History and
Begriffsgeschichte (conceptual history approach) as
makes comparative analysis of the two aforementioned
introduced by Koselleck, and the theory of ideologies
countries. We could therefore understand this method as
(Freeden, 1998 as cited by Friberg, 2012). According to
comparative conceptual analysis.
Friberg, these two methods complement each other since
Data collection: The data was collected from a variety
conceptual history emphasizes how socio-political context
of sources which were mostly not parliamentary ones. For
influences the changing meaning of the concept whereas
French, dictionaries (Le Grand Robert de la langue
theory of ideologies finds the meaning of this concept
francaise, Dictionnaire de l’Academie francaise, etc.),
dependent on morphological structure.
scientific works which focused on the history of French
Data collection: The main source of the data for this
parliamentarism (Histoire de France or Les caracteres ou
article were the debates in Swedish Parliament during the
les moueurs de ce siècle), as well as various political
interwar years. In addition, the author used other
documents and French Constitution were used. Romanian
governmental materials, such as reports from different
data was also gathered from dictionaries (Dictionar al
committees. Both sources were only available as
institutiilor feudale din Tarile Romane, for example), as
hardcopies, but they provided coherent source materials.
well as various historical documents and different versions
The debates which were analyzed were chosen according
of democratic Constitutions. What all documents had in
to two important criteria. First, the debates needed to be
common was that although they were not strictly records of
explicit discussions in Parliament and needed to focus on
parliamentary debates, they did focus on the parliamentary
the concept of democracy in the interwar years. Second, the
and political language and discourse.
debates had to be related to a topic that a political party (in
Discussion: This research is slightly different from the
this case Social Democrats) claimed was connected to
others in this review since it does not draw directly from
democracy in a certain way. The debates which conformed
the parliamentary records. This analysis successfully shows
to the first criterion were identified through the subject
how historical analyses frequently draw on sources other
index of the governmental records, whereas the debates
than explicitly parliamentary data.
which needed to observe the second criterion were
3.2 (Collective) Memory Analysis
recovered through an extended analysis of materials such
as party manifests, newspaper articles and records from the
Memory analysis combines intellectual strands from
congress. This was necessary to get the feeling of what the
various domains such as history, sociology, anthropology,
SDP claimed to be connected to the democracy and then
education, etc. Since this is an emerging field of research,
compare those records with parliamentary records. In
its qualitative and quantitative methodological tools are not
addition, the author analyzed the articles from the Social
yet fully developed. Instead, researchers who conduct
Democratic journal titled Tiden, which throughout the 20th
memory analysis usually borrow methodological tools
century was one the most important Social Democratic
from other social sciences and adapt them for their own
newspapers for conducting internal debates. The analysis
purpose. These methods frequently contain content and
of these articles added to the reliability of the conceptual
(critical) discourse analysis.
analysis.
The main aim of memory analysis is the study of forms
Discussion: One of the problems with data collection
and functions of representing the past. Data collection
was that all the records were accessible only as hardcopies
includes a careful examination of primary historical
and not electronically. Although the author gives no
sources and archival studies, as well as secondary sources
information about that, we could assume that the
such as case studies, interviews, surveys, and eyewitness
documents needed to be thoroughly read and notes taken.
reports. Once the data is collected, the aforementioned
Also, parliamentary records do not exactly depict the actual
methodological tools are employed to thoroughly analyze
debates since the process from an actual debate to a printed
the data. Memory analysis frequently also includes the
one used to be rather complicated and long. This results in
research of collective memories and narratives. Collective
sometimes significant differences between the actual
memory as defined by Hogwood (2013) is a concept, which
speech and the written text. This long process of editing,
is used across disciplines to refer to the ways the past is
changing, and adapting the actual text to be suitable for a
“perceived, shaped, and constructed” and its main aim is to
printed version results in the data not objectively depicting
extract useful data from collective conversations, sharing
what was said during the debate.
ideas and media. This then leads to a synthesis of voices
and formation of a common information thread among
3.1.2
Debates on Immunity in France and Romania
peers.
Research problem: In this article Negoita (2015)
One of the major methodological problems that occurs
analyzes the concept of parliamentary immunity. His main
inside memory analysis, is that when researchers conduct
goal is to identify not only historical premises but also
research, they usually use whatever evidence is readily
linguistic, political, and legal instruments that played part
available, without digging deeper into the event and
in conceptualization of parliamentary immunity in two
research it more thoroughly. This points to the fact that
countries – France and Romania. This article, therefore,
even though memory analysis is a useful field of historical
although historical in nature, employs interdisciplinary
analysis, researchers must be attentive to employ other
PRISPEVKI
179
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
approaches with which they confirm and legitimate the
news reports or parliamentary debates. This field emerged
findings of memory analysis.
in the 1960s and is very prominent inside humanities and
3.2.1
The Nation in Parliamentary Discourse on
especially social sciences. The field of Discourse Studies
Immigration
includes various methods, such as Discourse Analysis
(DA), Critical Discourse Analysis (CDA) and Political
Research problem: De Saint-Laurent (2014) focuses
Discourse Analysis (PDA). All three were detected as
on exploring the meaning that is attributed to the national
salient in this literature review.
group. The aim of her article is to analyze collective
Discourse Analysis (DA) is one of the most frequently
memories (she names them narratives) and show what
used methods in those social science disciplines, where the
meaning they give to the nation, how this meaning is
focus is frequently on the study of language and text. In
produced and how the stories told by different groups relate
historical research, Discourse Analysis is sometimes
to one another.
referred to as Discourse Historical Approach (DHA) and its
Research method: She employs a qualitative analysis
main defining feature is in acknowledging the historical
of collective narratives of the past. In connection with
context and attempting to integrate this knowledge together
memory analysis, she employs dialogism as a
with background of social and political fields into research.
methodological tool since the analysis of dialogic
DHA focuses on studying the display of power through
overtones helps reconstruct the social processes through
language and conceptualizes history through a theorized
which the discourse is done.
lens of critique. This method shares various common
Data collection: It needs to be noted that this article is
features with the Critical Discourse Analysis (CDA) and
an analysis of the meaning which is given to the concept of
provides a clear description of how to integrate historical
nation in French parliamentary debates over a bill on
context to critical discourse analysis, highlighting the
Immigration and Integration. The data used consisted of
importance of historicity to understand the continuities of
official transcripts of fifteen parliamentary sessions which
discourses (Achugar, 2017). Sometimes DA, when used to
happened between May 2 and May 17, 2006. In addition to
analyze political discourse, is referred to as Political
that, the author also included the vote session which
Discourse Analysis (PDA) (Dunmire, 2012).
happened on June 30, 2006. All documents are made
Critical Discourse Analysis (CDA) examines the
available to the public through the official parliamentary
means by which political power is manifested or abused
website. In addition to using general reactions of the
through discourse structures and practices (Dunmire 2012).
Assembly, the author used transcripts of the participants
Achugar (2017) shows that since the past has become an
interventions and interruptions from the entire sessions.
area of focus for CDA, this method has become a salient
Once the author determined the datasets, she began with
one in historical research. One of its major aims is to
relevant data selection, which happened in three stages. In
provide an explanation of the power differences in
the first stage, the author identified those excerpts which
contemporary society by researching the past events and
were relevant for the study of the role of collective
their context.
memory. She did that with the help of Nvivo2 software
(QSR International Pty Ltd., 2020). In this stage the author
3.3.1
British Parliament and Foreign Policy in the
also extracted relevant references by carefully reading
20th Century
through all the debates and employing a keyword search,
Research problem: Ihalainen and Matikainen (2016)
which contributed to pinpointing the indirect references.
investigated the parliamentarization of the foreign policy in
The second stage was the coding stage, where firstly
British Parliament throughout the 20th century. They argue
thematic coding happened to map out relevant historical
that throughout the 20th century, parliaments in general
periods and secondly the groups which the speakers
gained more power in discussing foreign policy and
belonged to were coded into two categories – political party
especially in British Parliament this parliamentarization of
and outside the political spectrum. In the third stage the
foreign policy debates was highly noticeable.
fragmented excerpts and data were used to reconstruct past
Research method: They combine analysis of the policy
narratives, which were then thoroughly analyzed.
documents with the more discourse-oriented analysis of
Discussion: This paper is not only historical since the
parliamentary debates. Their research method is more
author herself notes that it also adopts “socio-cultural
discourse-oriented than the traditional diplomatic history
psychological perspective on memory” (ibid.). She also
since they do not focus only on policy documents but also
notes that because of the reconstructive aspect of her
consider the discourse of parliamentary debates in that
analysis, she checked the narratives against certain
time.
complementary sources (research in French newspapers,
Data collection: The authors utilized a wide variety of
blogs, websites, etc.). This made the research much more
primary sources with the Hansard constituting the starting
reliable.
point of their analysis. They also used parliamentary papers
3.3 Discourse Studies
such as committee reports as well as the relevant sources
created by other political actors – the foreign office, other
Van Dijk (2018) uses the term discourse studies to refer
relevant government departments, voluntary associations,
to a field of research, which includes various qualitative
the media, etc. Their data therefore consists of
and quantitative methods and different genres, such as
parliamentary debates on the one hand and archival
2 https://www.qsrinternational.com/nvivo-qualitative-data-
analysis-software/home
PRISPEVKI
180
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
documents, public debates, and interviews on the other.
material but did not need to transcribe it as it has already
They argue that the use of such a wide range of data was
been made available in textual form. Then the data was
necessary to grasp the multi-sided nature of the policy
unitized, then categorized on the basis of the actual data and
discourse and to ensure that data was vast enough to
relevant theory and finally each unit was separately coded.
provide
the
complete
picture
of
how
the
Discussion: The author never explicitly mentioned
parliamentarization of foreign policy debates occurred.
discourse analysis as his research method. But since he
Parliamentary records database was electronically
talks about conducting a qualitative analysis of discourse
available which resulted in authors utilizing full-text
from parliamentary records and newspaper articles, we
searches to locate sources for contextual analysis of
could assume that he employed discourse analysis
parliamentary debates. They wanted to locate potentially
approach.
interesting debates and analyze them by using
3.3.3
Nationalism and Political Discourse in
aforementioned historical methods.
Scotland
Discussion: Authors do not provide a detailed account
of how the data was collected and give no information
Research problem: The research conducted by
about how the documents, other than the electronically
Whigham (2019) critically examines the narratives which
accessible Hansard, were obtained. They do, however,
emerged from party political discourse after Scottish
clearly show that in order to conduct thorough historical
independence referendum in 2014. The aim of the research
research, a variety of sources needs to be studied and that
is to analyze the past discourse on nationalism in Scotland
focusing only on parliamentary debates is not enough.
and to critically reflect on narratives about Scottish nation’s
past.
3.3.2
British Lobbying and Parliamentary
Research method: The author employs the
Discourse
methodological approach called political discourse
Research problem: McGrath (2018) focuses his
analysis (PDA), which was introduced and thoroughly
research on lobbying which he sees as a significant
explained by Fairclough and Fairclough (2012). According
component of the modern politics in Britain. In his article,
to Whigham, this method was used since it provides an
he provides a detailed explanation of the scale and
“original methodological contribution to the study of
significance of lobbying and studies how lobbying in
Scottish nationalism”.
Britain was discussed not only by parliamentarians but also
Data collection: The author focused on parliamentary
by journalists.
discourse of the largest political parties in Scotland, namely
Research method: The author utilized keyword search
the pro-independence Scottish National Party (SNP) on the
on several digitized archives which helped him gather
one hand and Scottish Labor Party as well as Scottish
extracts from parliamentary debates and newspaper
Conservative and Unionist Party on the other. The database
articles. He blended both qualitative and quantitative
consisted of election manifestos and policy documents
readings of the texts which leads us to assume that some
which were related specifically to the independence
kind of discourse analysis method was employed.
referendum. Because of a wide range of potentially useful
Data collection: The author draws on parliamentary
data, Whigham focused primarily on political manifestos
debates as well as three other databases which together
and constitutional policy documents. This also allowed for
consist of 51 newspaper titles between 1800 and 1950. The
a more detailed analysis of only crucial information about
data was available in electronic archives and already in
each party’s position on the Scottish constitutional debates.
written form, so no transcription was needed. The unit of
The author used the Nvivo qualitative data analysis
the analysis is an individual newspaper article or
software package (QSR International Pty Ltd., 2020) which
parliamentary speech. The database consisted of four
helped him code content of each of the data sources
online archives: 1) Hansard (1803-1950), 2) British Library
according to the themes that emerged. This was then
(1800-1900), 3) The Times (1800-1950), and 4) The
followed by a coding process which categorized low-level
Guardian (1800-1950). To gather the source material, the
codes into higher-level discursive forms. This sample
author employed a three-step process; firstly, each archive
allowed for a reflection and thorough analysis of political
was searched using a range of keywords which are
discourse.
associated with lobbying which produced roughly 1.691
Discussion: It needs to be noted that the application of
items. Secondly, each item was printed, carefully read, and
the Nvivo software is an exemplary one and is not
sorted according to the descriptor he was searching for.
frequently observed in historical research. Also, at times
Some of the data has already been discarded here since it
the article reads as a sociological one and we believe that it
did not correspond to the search parameters (e.g., did not
could just as well be classified as such since the author is
relate to governmental bodies, material covered lobbying
also a sociologist. However, a more thorough description
in countries other than Britain, etc.). In the third stage, the
of the methodological framework would be appreciated.
items which were not removed were put into chronological
3.4 Content Analysis
order and the author removed all the duplicates. This
resulted in 689 items being determined as suitable for
Content Analysis (CA) primarily focuses on studying
analysis. Once all the unique items were collected the
and analyzing society and social life by examining the
individual items were examined and coded. To acquire the
content of the visual and textual media – texts, images, and
appropriate data, McGrath employed a five-stage process
other media products. Mihailescu (2019) understands it as
to transform qualitative material into quantitative data,
a research technique for making replicable and valid
although not all stages needed to be applied. He sourced the
inferences from data to their contexts which is particularly
PRISPEVKI
181
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
useful in humanities and social sciences. It is a
that although this is a historical research, it does have
methodological approach which can help in development
certain methodological features of the sociological
of the deductive and inductive capacities, which are
research, since sociology also frequently employs content
extremely important in historical research. In addition, it is
analysis as the main methodological framework.
highly useful in historical research where researchers are
Data collection: The author draws on parliamentary
analyzing data with large amounts of text and where
speeches as well as newspaper articles in order to research
meaningful information need to be extracted from the
the emergence and development of the Irish political
historical documents.
clientelism and its critique. This empirical material was
Since CA frequently intertwines both qualitative and
deliberately chosen since it is made continuously available
quantitative approaches, it sometimes comes close to a
throughout the decades and is easy to access. She gathered
mixed methods. CA can sometimes be mistaken for
the parliamentary data from the official website of the Irish
Discourse Analysis since the two methods are very similar.
parliament and media data form online archives of the
Although both are interested in providing the context of an
respective newspapers. She opted for data from two of the
event, the main difference between the two is that CA
most frequently read Irish quality papers, namely the Irish
focuses on the content of the text, whereas DA focuses on
Independent and the Irish Times. The first step of her data
the language that is used in text and context.
collection consisted of keyword search of the words
3.4.1
Constructing the Child in Need of State
“clientelism” and “brokerage” in both parliamentary and
Protection
newspaper records. After realizing that the two words had
not been used until the 1980s, she identified other
Research problem: In this article, Smith (2016)
potentially relevant terms based on their emergence in
explores the development of the discourse surrounding
items referent to the two keywords. This produced several
children in need of a state protection in Ireland. She mostly
other keywords such as “stroke politics”, “gombeen
focuses on the discourse produced by legislators and
politics”, etc., which the author used to find relevant data.
government ministers who are ultimately responsible for
The period of her analysis runs up to 2012 and starts in the
child services.
early 1940s. She notes that in the case of parliamentary
Research method: The author employs content
records, the unit of her analysis is the contribution of the
analysis of various bills as well as parliamentary debates.
member of the House of Deputies or the Senate; this can
She defines it as a textual analysis, but we regard it as a
either be a speech or a short intervention. In the case of
content analysis since she focuses mainly on the content on
newspaper articles, the unit of her analysis is an article
bills and debates.
itself. The respective units were then coded according to
Data collection: The author focuses on a specific
their focus and since some units included several views of
timeframe in Irish history, namely between 1922 (the
the matter, they were coded in more than one category.
formation of the Irish Free State) and 1991 (the adoption of
Then those articles and debates which specifically focused
current legislative framework for children welfare). The
on the link between politicians and voters were selected for
data consists of debates from both houses of Irish
a more detailed interpretation.
parliament – the House of Deputies and the Senate. In
Discussion: This article falls under the category of
addition, Smith also focused on the official reports which
historical social research and employs methodological
informed these debates. In one part of her research, she
approaches which are frequent in both historical and
focused on parliamentary debates on the Children Bills of
sociological research. She gives a detailed account of the
1928 and 1940 and Cussen Report from 1936. In the second
method and data she used and where this data was taken
part, she conducts the analysis of the Kennedy Report
from, which is not always the case in historical research.
(1970), the Final Report of the Task Force on Child Care
As seen in some of the previously reviewed articles, the
Services (1981) as well as the parliamentary debates on the
author combined parliamentary and newspaper data so as
Child Care Bill of 1988.
to address the concept of clientelism in as much detail as
Discussion: The author dedicates only one paragraph to
possible.
explicating where the data was taken from in addition to
only briefly mentioning the method she used. We consider
3.5 Mixed Methods
this to be one of the major shortcomings in this article since
Shorten and Smith (2017) understand the mixed
it would be useful to know how the textual analysis was
methods approach as drawing on the strengths of both
performed, what the author focused on and why, as well as
qualitative and quantitative methods, which results in
what was her motivation for focusing on those specific bills
showing a more complete picture of a research problem. It
and debates.
is a highly complementary approach, which means that the
3.4.2
Clientelism in Irish Politics
results produced by one of the methods, can be elaborated
and clarified with the findings from the other method. This
Research problem: The main aim of this article is to
means that triangulation of one set of results influences and
research the emergence and development of discourse
enhances the validity of inferences. In addition, the
which revolves around the concept of clientelism in Irish
combination of different methodologies, approaches, and
politics. Kusche (2017) focuses on the analysis of the
various fields of research adds to the validity of the research
relationship between Irish deputies and voters, which has
and eliminates the possibility of research bias. Thies (2002)
been perceived as particularly clientelist.
shows that as in many other disciplines (sociology for
Research method: Kusche identifies the main method
example), investigator bias as well as unwarranted
of her research as qualitative content analysis. She shows
selectivity of the use of historical source materials are the
PRISPEVKI
182
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
main problems of qualitative historical research which
3.5.2
Political Discourse of Israeli PMs between
emphasizes the importance of the selection of the
2001 and 2009
appropriate methodological approach.
Research problem: The aim of this article by Gavriely-
Corpus-Assisted Discourse Studies (CADS) combine
Nuri (2013) is to look critically at the uses of collective
qualitative Discourse Analysis with the predominantly
memories in Israeli politics. Collective memories are of
quantitative corpus-based approach. The main aim of the
great significance to the case of Israel due to their historical
CADS is to facilitate understanding from the linguistic
background and this article analyzes how collective
perspective as well as from that of humanities and social
memories are used within the corpus of speeches of Israeli
science. As Partington (2012) shows, this approach uses
Prime Ministers.
corpus techniques to investigate a particular political or
Research method: The author employed a
institutional discourse type and to uncover and analyze
methodological approach which incorporated both, Critical
obvious patterns of language or aspects of linguistic
Discourse Analysis (CDA) and corpus linguistics. Because
interaction.
of the combination of corpus linguistics and discourse
3.5.1
Scottish Political Rhetoric in Invasion of Iraq
analysis, we regarded this article as using the approach
called Corpus-Assisted Discourse Studies (CADS).
Research problem: Elcheroth and Reicher (2014)
Data collection: The data used for this study consisted
conduct a systematic analysis of the Scottish debate over
of speeches of Israeli Prime Ministers, over a period of 9
the invasion of Iraq in 2003. The aim of their article is to
years (between 2001 and 2009), which were delivered in
show, on the one hand the development of the debates in
the Israeli Parliament (Knesset). The author conducted a
Scottish Parliament and conduct the analysis of
computerized search in the speech archive that includes
parliamentary discourse of anti-war Scottish separatist
addresses of the PMs and constructed a corpus, which was
parties, and on the other to examine how the conflict was
then used as a database. The corpus included speeches by
construed as either for or against national interest.
the two selected Prime Ministers, namely Ariel Sharon
Research method: The authors employ a mixed-
(2001 – 2005) and Ehud Olmert (2006 – 2009). Her
methods approach and used thematic coding. This on the
computerized search revealed 274 instances of the word
one hand produced structured inventories of arguments
“memory”, which was determined as the keyword to
which served as the grid for qualitative analysis, and on the
identify relevant speeches. All those references were then
other, it produced a database which was then used for
carefully studied and read in order to determine the context.
content analysis.
This resulted in identifying 103 references of the phrase
Data collection: The data for the analysis consist of all
“collective memory” which were distributed among 64
the contributions to four Scottish parliamentary debates
speeches. In this count, the author also included synonyms
referring to the Gulf War. A total of 106 interventions
such as “national memory”, “public’s memory”, “people’s
which occurred between January 2003 and June 2004 was
memories”, etc. Once the data was broadly selected, the
used as a dataset. It needs to be noted that during the time
author performed a two-stage analysis to determine the
of 2003 Gulf War, there was also the campaign for election
actual topics of the speeches. In the first stage, the context
to the Scottish Parliament which meant that the election
in which national events evoked the mention of collective
debate was definitely influenced by the war debate. Each
memory was analyzed. In the second stage, specific content
individual intervention was separately coded to extract the
included in the mentions was studied.
information such as which debate the speech was taken
Discussion: Although the author mentions the Cultural
from, what was the party membership of the speaker, what
approach to the CDA, she gives no detailed account on how
was the overall moral argument and so on. Special
this approach differs from traditional CDA or what its
emphasis was put on the two parliamentary debates which
benefits are. One of the possible justifications for
occurred right before the invasion of Iraq (January and
employing cultural approach is the study of cultural context
March 2003) as well as on first two substantial
of the PMs’ speeches and their cultural significance. We
parliamentary debates that took place after the invasion
also found that in the article there is no explicit elaboration
(November 2003 and June 2004). The transcripts of these
as to why this particular methodological framework was
debates were all published in the official records of the
selected and how it contributes to the overall analysis.
parliament, and they constituted the “corpus” data for their
further analysis. When determining relevant data, all the
3.6 Digital History and Topic Modelling
transcriptions were read several times and coded for those
We have shown that historians gather their data mostly
interventions that included arguments that were
from historical archives and feel “much more confident
thematically fitting for the analysis. The two pre-invasion
when using traditional sources in printed format, since they
debates produced 68 relevant interventions whereas the two
believe to have better access to the historical data required
post-invasion debates produced the remaining 38.
for their research” (Torou, 2009). Guldi (2019) believes
Discussion: This article consists of two separate
that digital methods can help researchers land material for
studies. The first study is the analysis of parliamentary
historical synthesis that “builds upon the insights of
speeches, whereas in the second part, the authors turn from
foregoing historians while potentially illuminating new
elite discourse to popular understanding of the war. The
directions for further research”. Some authors (Piersma et
second part draws on the data from Scottish Social Attitude
al., 2014) regard these methods as the Digital Approach or
(SSA) survey and since it does not focus on the
Digital History, the main function of which is to enable
parliamentary discourse, only the first study was of interest
historians use advanced search engines in order to explore
for us.
large quantities of data.
PRISPEVKI
183
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Topic modelling is capable of scanning a large set of
digitized sources of primary and secondary documents are
documents within which it detects word and phrase patterns
used in historical research, with digitized sources
and automatically clusters them into groups according to
(transcriptions, text documents, and corpora) becoming
their meaning. As Guldi (2019) shows, topic modelling has
increasingly more popular. This makes historical research
been effectively used in history to identify patterns of
community a potentially important user group of the
historical interest in academic sources and has in
ParlaMint corpora.
combination with discourse studies proven to be useful for
This literature review shows how the majority of
historical analysis.
researchers of political history collect data on their own,
Today, several software packages exist which can be
using techniques and methods which are often time-
used in a pre-existing database of digitalized texts. This,
consuming and demand a lot of manual work. Such work
and the fact that digital methods, such as text mining and
could be made much more research-friendly and efficient
topic modelling are becoming increasingly used in
if historical parliamentary corpora were developed,
historical research of parliamentary discourse, underlines
annotated, and documented. They would present a database
the importance of digitizing historical parliamentary
of collected parliamentary records of the past and would be
records, and not only enable but also encourage historians
a useful source of historical parliamentary records which
to start using them as one of the primary sources of data for
would be an invaluable extension of the ParlaMint project.
their research.
Our first aim should therefore be to provide historians
3.6.1
Topic Modelling and Historical Change
with tutorials, workshops and showcases on how to use
corpora, corpora data, and the main corpus-analytical
Research problem: The aim of Guldi’s (2019) article
techniques. Rich and user-friendly documentation on how
is to research the parliamentary discourse on 19th century
the ParlaMint data is gathered, processed, and annotated is
British empire infrastructure projects, such as the drainage
to be made available to the historians in addition to offering
of the River Shannon in 1860, as well as parliamentary
quick user manuals which would show the basic use of
argument of the telegraph connection between England and
concordancers for historians to learn how to effectively use
India.
corpora.
Research method: The author uses dynamic topic
Then, we should encourage them to develop and use
modelling which allowed her to generalize about the
their own corpora and datasets for the historical periods
discourse on a diachronic dataset, observing trends in
they are interested in using the same encoding standards. In
different time periods.
this endeavor, we agree with Kytö (2010) that compilers of
Data collection: The data for her research consisted of
the data should document their compilation decisions in
parliamentary debates in the British parliament in 19th
clear terms in user guides, corpus manuals, and training
century, gathered from Hansard, the official database of all
materials which need to accompany the release versions of
UK parliamentary debates. The author focused on several
the corpora, since it would be impossible for end-users to
topics connected to the infrastructure and employed
find information about the background of the texts which
approximately the same data collection and analysis in all
are included in the historical corpora without them.
of them. The entire Hansard database was subjected to topic
The ParlaMint community should also focus on the
modeling, resulting in a set of words used by MPs most
implementation of the tagging of the digital repository
indicative of their discussions a certain topic. The author
contents with complete and structured metadata. Some
experimented with using on the one hand debate as a
historians (Torou et al. 2009) note that the information
document and then also a speech as a document. In
which is typically used in research queries by historians
addition, she also experimented with degrees of granularity
(such as the author, topic of the item, date of creation, the
for analysis, asking the computer to return either 4, 10, 100,
period to which the content refers, etc.) should be available
500 or 1000 topics. She obtained most informative results
as metadata. The availability and reliability of metadata is
with 500 topics as the search returned fairly specific words
extremely important since historians often rely on the
which were interesting for further analysis.
additional data and information about a certain historical
Discussion: Guldi shows how topic modelling can be
source.
implemented into research and analysis of historical data.
Marjanen (personal communication, 2022) points out
Topic modelling is becoming increasingly popular in
that historians researching parliamentary discourse are
historical research and is frequently used not only on
highly interested in the use of rhetoric, the uses of voice
national but also international level (e.g., when researching
and practices of negotiation and debate. One of the key
debates in the European parliament). It is important to note
interests for them is identifying who talked, which makes
that topic modelling must always be complemented by an
the availability of any metadata about the MPs of vital
objective analysis and critical skills of the researcher when
importance. According to Marjanen, there are also some
interpreting the results of topic modelling.
historians who together with traditional sources, use audio
and video recordings from parliament to study non-verbal
4. Discussion and Conclusion
elements in parliamentary discourse. He points out that
This literature review shows the most common methods
with digitized sources, keyword search has made material
and approaches that (political) historians use in their
much more accessible though many historians are often
research of parliamentary discourse as well as tries to
interested in something broader than keyword search. They
understand what kind of data and information historians are
focus on the entirety of speeches or discourse related to a
looking for and which sources they use. We can confirm
certain topic since keyword search often does not produce
observations from Torou et al. (2009) that either printed or
enough relevant results. Historians are used to finding these
PRISPEVKI
184
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
“discourses” on their own but if the process of searching
Immigration. Critical Practice Studies, 15(3): 22-53.
for relevant sources was made easier for them, it would
http://dx.doi.org/10.7146/ocps.v15i3.19860
definitely be welcomed.
Patricia L. Dunmire. 2012. Political discourse analysis:
An increasing availability of the digitized sources
Exploring the language of Politics and the Politics of
appears to be setting an interesting trend. In addition to
language. Language and Linguistics Compass, 6: 735-
more and more sources and documents becoming digitized
751.
and made available through electronic libraries, various
Guy Elcheroth and Steve Reicher. 2014. ‘Not our war, not
digital research tools and approaches are becoming
our country’: Contents and contexts of Scottish political
available, making historical research often very digital.
rhetoric and popular understandings during the invasion
Therefore, political historians nowadays already employ
of Iraq. British Journal of Social Psychology, 53: 112-
digital approaches and tools to analyze parliamentary data
133. https://doi.org/10.1111/bjso.12020
and these approaches allow them to gather and analyze data
Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, et al.
in a faster, more efficient, and less time-consuming
2022. The ParlaMint corpora of parliamentary
manner. However, the development of parliamentary
proceedings.
Lang
Resources
&
Evaluation.
historical corpora could potentially reshape the entire
https://doi.org/10.1007/s10579-021-09574-0
process of historical research and offer new understanding
Anna Friberg. 2012. Democracy in the Plural? The
of the parliamentary data. As Blaxill (2013) shows, the
Concepts of Democracy in Swedish Parliamentary
combined approaches of close manual analysis and
Debates during the Interwar Years. Contributions to the
selective quantification simplify the research as well as
History
of
Concept,
7(1):
12-35.
facilitate numerical comparison and contextualization.
http://dx.doi.org/10.3167/choc.2012.070102
The argument we want to put forward with this
Dalia Gavriely-Nuri. 2013. Collective memory as a
literature review is not that current qualitative historical
metaphor: The case of speeches by Israeli prime
research of political debates and parliamentary discourse
ministers 2001–2009. Memory Studies, 7(1): 46-60.
should be completely replaced by more quantitative
https://doi.org/10.1177%2F1750698013497953
corpus-assisted approaches, but rather that corpora could
Jo Guldi. 2019. Parliament's Debates about Infrastructure:
be effectively used alongside the traditional qualitative
An Exercise in Using Dynamic Topic Models to
historical analysis. We treat corpora as potentially powerful
Synthesize Historical Change. Technology and Culture,
tools which would not only simplify data collection and
60(1): 1-33. https://doi.org/10.1353/tech.2019.0000
generate relevant results much more effortlessly, but also
Patricia Hogwood. 2013. Selective memory: challenging
effectively reduce and minimize potential research bias that
the past in post-GDR society. In: Saunders, A. and
might be present in the analysis of historical data.
Pinfold, D. (Ed.) Remembering and Rethinking the GDR,
This review also shows the need for more systematic,
pages 34-48. Palgrave Macmiallan, London.
transparent, and replicable quantitative and qualitative
Pasi Ihalainen and Satu Matikainen. 2016. The British
analysis, which makes corpus-assisted approaches ideally
Parliament and Foreign Policy in the 20th Century:
suited for historical research of parliamentary discourse.
Towards
Increasing
Parliamentarisation?.
The immediate usefulness of the ParlaMint corpora is also
Parliamentary
History,
35(1):
1-14.
clearly confirmed by this review and it emphasizes the need
https://doi.org/10.1111/1750-0206.12180
for further enrichment and the addition of the historical data
Pasi Ihalainen, and Taina Saarinen. 2019. Integrating a
to the current ParlaMint database.
Nexus: The History of Political Discourse and Language
Policy
Research.
Rethinking
History:
1-20.
5. Acknowledgements
https://doi.org/10.1080/13642529.2019.1638587
The work described in this paper was funded by the
Pasi Ihalainen. 2021. Parliaments as Meeting Places for
Slovenian Research Agency research programme P6-0436:
Political Concepts. Centre for Intellectual History,
Digital Humanities: resources, tools, and methods (2022-
University
of
Oxford.
2027), the Social Sciences & Humanities Open Cloud
https://intellectualhistory.web.ox.ac.uk/article/parliame
(SSHOC) project (https://www.sshopencloud.eu/), the
nts-as-meeting-places-for-political-concepts
CLARIN
ERIC
ParlaMint
project
Isabel Kusche. 2017. The Accusation of Clientelism: On
(https://www.clarin.eu/parlamint) and the DARIAH-SI
the Interplay between Social Science, Mass Media and
research infrastructure.
Politics in the Critique of Irish Democracy. Historical
Social
Research,
42(3):
172-195.
6. References
https://www.jstor.org/stable/44425367
Marja Kytö. 2010. Corpora and historical linguistics.
Mariana Achugar. 2017. Critical discourse analysis and
Revista Brasileira de Linguística Aplicada, 11(2): 417-
history. In J. Flowerdew and J.E. Richardson (Ed.) The
457.
http://dx.doi.org/10.1590/S1984-
Routledge Handbook of Critical Discourse Studies, Vol.
63982011000200007
1, pages 298-311. Routledge, London.
Daniel Litte. 2016. What is "conceptual history"?.
Luke Blaxill. 2013. Quantifying the language of British
Understanding
society.
politics, 1880–1910. Historical Research, 86(232): 313-
https://understandingsociety.blogspot.com/2016/10/wha
341. https://doi.org/10.1111/1468-2281.12011
t-is-conceptual-history.html
Constance De Saint Laurent. 2014. “I would rather be
Jani Marjanen. Personal communication. By Jure Skubic,
hanged than agree with you!”: Collective Memory and
27 May 2022.
the Definition of the Nation in Parliamentary Debates on
PRISPEVKI
185
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Conor McGrath. 2018. British Lobbying in Newspaper and
Parliamentary Discourse, 1800–1950. Parliamentary
History, 37(2): 226-249. https://doi.org/10.1111/1750-
0206.12363
Mimi Mihailescu. 2019. Content analysis: a digital method.
http://dx.doi.org/10.13140/RG.2.2.21296.61441
Ciprian Negoita. 2015. Immunity: A Conceptual Analysis
for France and Romania. Contributions to the History of
Concepts,
10(1):
89-109.
http://dx.doi.org/10.3167/choc.2015.100105
Alan Partington. 2012. Corpus Analysis of Political
Language. In: C.A. Chapelle (Ed.) The Encyclopedia of
Applied Linguistics. Blackwell Publishing Ltd.
Hinke Piersma, Ismee Tames, Lars Buitinck, Johan van
Doornik and Marteen Marx. 2014. War in Parliament:
What a Digital Approach Can Add to the Study of
Parliamentary History. Digital Humanities Quarterly,
8(1):
1-18.
http://www.digitalhumanities.org/dhq/vol/8/1/000176/0
00176.html
Allison Shorten and Joanna Smith. 2017. Mixed methods
research: expanding the evidence base. Evidence-based
nursing
20:
74-75.
https://ebn.bmj.com/content/20/3/74.info
Karen Smith. 2016. Constructing the Child in Need of State
Protection: Continuity and Change in Irish Political
Discourse, 1922–1991. The Journal of the History of
Childhood
and
Youth,
9(2):
309-323.
https://doi.org/10.1353/hcy.2016.0042
Jure Skubic and Darja Fišer. 2022. Parliamentary
Discourse Research in Sociology: Literature Review.
Accepted for publication in Proceedings of the
ParlaCLARIN III workshop at LREC2022, pages 91-
100, Marseille, France.
Cameron G. Thies. 2002. A Pragmatic Guide to Qualitative
Historical Analysis in the Study of International
Relations. International Studies Perspectives, 3(4): 351-
372. https://www.jstor.org/stable/44218229
Elena Torou, Akrivi Katifori, Costas Vassilakis, Georgios
Lepouras and Constantin Halatsis. 2009. Capturing the
historical research methodology: an experimental
approach. In Proceedings of International Conference of
Education, Research and Innovation, Madrid, 2009.
Madrid, Spain.
Teun Van Dijk. 2018. Discourse and Migration. In
Qualitative Research in European Migration Studies,
edited by Ricard Zapata-Barrero and Evren Yalaz, 227-
247.
Springer
Open.
https://link.springer.com/book/10.1007/978-3-319-
76861-8
Stuart Whigham. 2019. Nationalism, party political
discourse and Scottish independence: comparing
discursive visions of Scotland's constitutional status.
Nations
and
Nationalism,
25(4):
1212-1237.
https://doi.org/10.1111/nana.12535
PRISPEVKI
186
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Annotation of Named Entities in the May68 Corpus: NEs in modernist literary
texts
Mojca Šorli,* Andrejka Žejn†
* ZRC SAZU, Institute of Slovenian Literature and Literary Studies
Novi trg 2, SI-1000 Ljubljana
mojca.sorli@zrc-sazu.si
† ZRC SAZU, Institute of Slovenian Literature and Literary Studies
Novi trg 2, SI-1000 Ljubljana
andrejka.zejn@zrc-sazu.si
Abstract
In this paper we present the process of manual semantic annotation of a corpus of modernist literary texts. An extended set of annotations is proposed with respect to the established NER-systems and practices of related projects, i.e. several categories of proper names, foreign language elements and bibliographic citations. We focus on the annotation challenges concerning the names of literary characters seen in transition from common nouns to proper names, as well as giving examples of the results of preliminary analyses of the corpus.
the late 1960s to the early 1970s,1 discussing these groups
1. Introduction
from the point of view of three different sources of
The starting point of the digital humanist literary project
representation problems that are independent but
presented here is a corpus of literary texts that was created
interrelated: ambiguity, variation, uncertainty. As pointed
according to special criteria defined for the purposes of this
out in Beck et al. (2020), representational problems in
research. In view of the significance for DH of controlling
linguistic annotation arise from five different sources (ibid.,
a large number of texts and their vertical reading, where
61): (i) Ambiguity is an inherent property of the data. (ii)
patterns become visible that cannot be detected with the
Variation is also part of the data and can occur, for example,
naked eye or traditional close reading, the corpus size is
in different documents. (iii) Uncertainty is caused by lack
often seen as a key factor. At the same time, large text
of knowledge or information by the annotator. (iv) Errors
volumes require automation of corpus processing for
may be found in the annotations. (v) Bias is a property of
quantitative analysis, involving different levels of
the entire annotation system. We list a number of relevant
(linguistic) annotation in the first phase, and allowing
annotated categories, their specific character, and
additional levels of semantic annotation in later phases that
representational problems associated with them. Our
enrich the text with metadata. In the presented approach,
choices are discussed when any of the first three listed
however, the annotation task is performed on a small,
sources of representation problems apply.
specialized corpus that is easier to control and allows for
Together with the theoretical concept, the selection of
manual annotation. The identified and manually annotated
annotation material, and the definition of guidelines for the
Named Entities are distinguished based on semantic
annotation process (Pagel et al., 2020), the annotation
criteria, so we consider this an example of semantic
scheme presented here is a model of extended annotation of
annotation.
NEs in modernist periodicals that can be applied in certain
Linguistically annotated corpora have long been a
segments to other corpora of literary texts. We focus both
standard tool for linguistic research. Named Entity
on the identified inaccuracies and on the benefits of manual
Recognition (hereafter NER) and analysis has also long
annotation of selected groups of NEs in our specialized
been relevant in the social sciences and sociology
corpus of literary texts. In the concluding part, we present
(Ketschik, 2020), from where the method, like several
the preliminary results of an analysis performed on the
others, has been transferred via linguistics to literary
annotated corpus.
studies, where named entities are most closely associated
Following the automatic preprocessing (i.e., POS
with literary character research. A more comprehensive
tagging and lemmatization) of the May68 Corpus, further
picture of the way characters are named in literature,
manual annotation was performed to capture more complex
beyond the automatic recognition of Named Entities
linguistic (semantic) phenomena and to provide a more
(hereafter NEs), can be obtained by manually annotating
sophisticated annotation model for proper names given the
these entities in literary texts, by analyzing the annotation
recurring representational problems: At this first stage, a
process, and finally by analyzing the data obtained from the
model for identifying and annotating the selected NEs was
annotated corpus itself.
put in place, with a second stage of the project envisaged,
in which the texts will be annotated for the use of metaphor.
2. The Goal of the paper
Here we will focus on some open challenges in the
annotation of NEs, in particular problems related to the
In this paper we report on an attempt to identify and
functional aspects of the annotated elements. We discuss
annotate three groups of NEs in the “Corpus of 1968
the practical treatment of proper names for the purposes of
Slovenian literature Maj68 2.0” (short name May68
corpus linguistic and stylistic research, in the hope of
Corpus) – corpus of Slovenian modernist literary texts from
1 http://hdl.handle.net/11356/1491
PRISPEVKI
187
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
improving the reliability of research results and also of NLP
with by multidimensional approaches that shed additional
models.
light on proper names in light of the special features of
literary text. Empirical analyses of protagonists in the
3. Automated and manual annotation of
literature can, at the most basic level, for example, study
corpora
the characteristics of names, their typicality, archaic
character, or “unusualness” for a particular society (cf.
In the context of language technologies, universal
Calvo Tello, 2021), compare usage and functions of proper
concepts and tools for automatic corpus annotation have
names, exploring to what extent they are genre-related (e.g.
been developed to some extent, especially for individual
children’s literature, cf. van Dalen-Oskam, 2022).
language groups, while language-specific concepts and
Empirical analysis of the ratio between female and male
tools are also needed. Established levels of automatic
characters in a corpus of English literature up to the mid-
tagging for Slovenian, initially based on lexicographic and
20th century (cf. Nagaraj and Kejriwal, 2022), for example,
linguistic projects, include tokenization and related
showed the quantitative predominance of male characters
segmentation into sentences, normalization,
over female characters. More complex research also deals
morphosyntactic tagging, lemmatization, and syntactic
with characterization analysis, identifying relationships
parsing (Erjavec et al., 2015). NEs pose a challenge for
between main and secondary characters, examining the
automatic extraction of information due to their semantic
relationship between active and passive character presence,
an functional complexities. For Slovenian, the main tool
and distinguishing between “actively present” characters
used is StanfordNER, which assigns lexical units to
and characters from other fictional worlds (Krautter et al.,
predefined categories (Ljubešić et al., 2012): personal
2018; Brooke et al., 2016; Ketschik, 2020). One of the more
names, geographical names and common proper nouns.
established approaches is the application of social network
The state-of-the-art of the existing NER tools for
analysis, a method from empirical sociology that builds on
Slovenian has not been the focus of this research, but a
the relationship between NEs. The analysis of social
preliminary review of the tools, as well as of the function
networks in the literature (cf. de Does, 2017) is closely
of NEs in the texts, has shown their limited applicability
related to quantitative approaches to the study of direct and
to a specialized literary corpus that we set out to
reported speech or narrator speech and character speech in
investigate.
storytelling and drama, where NEs are an essential
component of a broader context (cf. Burrows, 2004;
Moretti, 2011; Elson et al., 2010; Papay and Padó, 2020).
3.1. NER-systems for corpora of literary texts
Digitally supported analysis of the broader picture of
characters also draws on concepts derived from Bakhtin’s
For literary texts, narratology in particular has
concept of chronotope, such as The Text World Theory – a
developed various typologies of protagonists, heroes, or
cognitive-linguistic concept of a unity of characters, time
major and minor characters in texts, ways of characterizing
and space, or the concept of situation (Krautter et al., 2018;
them, and strategies for recognizing them. Since the advent
Mikhalkova et al., 2019).
of digital tools researchers have had to find a way to
translate the definitions formed by literary scholars into
4. Model annotation schemes
computer-readable data (Krautter et al., 2018).
While there are no specific NER-systems for annotating
In designing the model for manual annotation of the
literary texts, even though literary texts have a high
May68 Corpus, we relied on familiarity with the texts
variation of NEs compared to normal non-fiction texts
contained in the corpus and on several other well-known
(Stanković et al., 2019), “universal” systems are often used.
models of manual annotation for similar projects, three of
However, automatic annotation tends to overlook certain
which are presented below.
segments of NEs in literary texts (Vala et al., 2015).
Attempts are made to overcome these limitations by
4.1. COST Action (“Distant reading” project)
additional automatic tagging, or to expand the set of
Distant Reading project for the annotation of the
annotated entities by manual tagging, often of referential
multilingual
ELTeC
corpus
(https://www.distant-
expressions, i.e., linguistic expressions that refer to a
reading.net/eltec/)2 based on European novel provides the
specific entity in the text world, where the entities and their
following distinct categories: “demonyms (DEMO),
references must be interconnected (entity grounding).
professions and titles (ROLE), works of art (WORK)
References and connections themselves can only be
person names (PERS), places (LOC), events (EVENT),
inferred from the knowledge of the context (Ketschik,
organizations (ORG)” (for a brief description of the
2020; Papay and Padó, 2020), so in the early stages of
categories cf. Frontini, 2020). The selection of these
research, manual annotation of the corpus is usually
categories was partly motivated by the existing possibilities
required to improve the automatic process.
of automated NER, which brings with it certain limitations
(Stanković et al., 2019). The project also points out the
3.2. Background and related work
importance of “cultural references, role models and
Compiling lists of NEs, especially for categories of
cosmopolitanism”, and these can only be answered “if
proper names, represents only the basis for the
references to works of art, authors, folklore and periodical
identification of character names and is as yet insufficient
publications are detected”, which is why in our corpus of
for relevant literary analyses, so these lists must be dealt
2 The Distant Reading for European Literary History (COST
literary texts. It is based on the compilation and analysis of a
Action CA16204) started in 2017 with the goal of using
multilingual open source collection, named European Literary
computational methods of analysis for large collections of
Text Collection (ELTeC).
PRISPEVKI
188
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
modernist literary texts we introduced a BIBLIO group to
birth, are included with the texts. The presence of visual
incorporate references to authors, but covered other listed
elements is also marked in the corpus; 48 texts consist only
types of references with the “other” group (NAME / XXX).
of visual elements, i.e. they do not contain standard text.
In May68 Corpus, however, we focus for now on proper
Automatic linguistic annotation includes lemmas,
names.
morpho-syntactic descriptions from MULTEXT-East, and
morphological features and syntactic annotations from
4.2. CLARIN.SI
Universal Dependencies. As shown here, manually tagged
NEs for persons, geographical locations, organizations, and
The annotation scheme adopted largely follows the
various names, (foreign) linguistic variations and registers,
guidelines provided for Slovenian in the past (e.g. Štajner
and cited authors (sources) are additionally marked.
et al., 2013), perhaps closest in its granularity to the Janes-
The following sections and subsections introduce the
NER guidelines (CLARIN.SI) as described by Zupan et al.
types and categories of NEs, including the dilemmas
(2017), except for the derived adjectives (DERIV-PER)
encountered in the process of annotation and the practical
type, which is given here an independent status unlike in
reasons for annotation. From here on, and with a somewhat
May68 Corpus, where this is subsumed under the PER-LIT
narrower notion of NER, we speak of categories of “proper
and PER-REAL subtypes.3
names (personal and place names)” rather than “named
In addition, we decided in the case of May68 Corpus to
entities” for the purposes of this paper.
conceptualize combinations of nouns denoting professions,
functions or titles, and personal names as units, therefore
5.1. Annotation procedure and categories
labelling the entire strings as literary personal name (PER-
The annotation was implemented using the WebAnno
LIT) or real personal name (PER-REAL).
tool (Eckart de Castilho et al., 2016). To simplify the
technical aspect, the whole corpus was divided into 1529
4.3. Annotation schemes for Czech language
sections of five sentences each, on average 380 chunks per
Annotation of NEs in Czech corpora is implemented
section. WebAnno allows annotation of one sentence at a
according to more complex models as described in
time, which was a disadvantage for longer instances of text
Sevščíková et al. (2007). Our three-level NE taxonomy is,
marked by the use of foreign language(s). Each annotation
nonetheless, somewhat less fine-grained. Furthermore,
round was curated by two curators.4 However, reiterative
unlike the Czech model, ours does not include numbers,
annotation was not foreseen, since the primary goal at this
such as in addresses, zip codes, or phone numbers, specific
stage was not to improve automatic annotation, but to
number usages and quantitative expressions – entities
manually annotate the specialized corpus for optimal
typically included in NER.
corpus analysis and stylistic studies.
There is no universally accepted taxonomy for NEs,
5. May68 Corpus of Slovenian modernist
except for some coarse-grained categories (people, places,
literary texts – corpus description
organizations). Since we are interested in a semantically
The Maj68 Corpus is a result of a project on the
oriented annotation and prefer more informative (fine-
literature of the avant-garde and modernism in the period
grained) categories, we opted for a three-level NE
of the worldwide student movement, whose activities are
classification as shown in Table 1 (cf. Sevščíková et al.,
also reflected in the transformation of literature. The
2007). The first level in our annotation model corresponds
student journals Tribuna and Problemi, from which the
to the three basic groups: 1. Proper names, 2. Foreign
texts for the corpus were selected, played an important role
language and register variations, and 3. Cited authors.
in the theoretical and literary-artistic innovations of the
These groups are labelled as 1. NAME, 2. FOREIGN, 3.
Slovenian student movement. The Maj68 Corpus 1.0
BIBLIO respectively, with the first two further subdivided.
contains 1,521 texts by 198 known authors published
The second and third levels provide a more detailed
between 1964 and 1972 in the Slovenian periodicals
semantic classification.
Tribuna, Problemi and Problemi.Literatura. The Maj68
The NAME group includes the following types and
Corpus 2.0 version, which has been further edited and
subtypes:
corrected (metadata), contains 647 additional texts from
-
Person (PER), including the person-derived adjective,
Tribuna and Problemi.
is subdivided into fictional literary characters (PER-
The compilation of the corpus began with an extensive
LIT), characters referring to real, i.e., existing and
bibliographic inventory of texts in selected publications
historical or mythological, persons or beings (PER-
that have been digitized and are publicly available on dLib.
REAL), literary characters bearing a descriptive name
On the basis of these lists, the original texts of Slovenian
(PER-DES), and members of national and social
authors were converted from .pdf format to .docx format
groups (PER-GROUP).
and, in a second phase, linked to metadata in Excel
-
Geographical location (GEO) is divided into locations
spreadsheets. Finally, the corpus was automatically tagged
in Slovenia (GEO-SI), in former Yugoslavia (GEO-
(see Juvan et. al 2021 for more details on the
YU), in Europe (GEO-EU), and in others (GEO-ZZ).
procedure).The texts contain complete bibliographic data,
-
Organizations and institutions (ORG).
are classified by text and language type, degree of presence
-
Miscellaneous (XXX).
of non-standard Slovenian, foreign languages, modernism,
A group labelled FOREIGN is used to annotate the
and visual elements. Author details, i.e., gender and year of
foreign language: Serbo-Croatian (SBH), English (EN),
3 Overall and in the same fashion, in May68 Corpus we also
4 The texts were annotated by A. Jarc, L. Mandić, and K. Žvanut
favour larger lexical units.
in accordance with the annotation scheme designed by the
authors of this paper, who also curated all of the annotations.
PRISPEVKI
189
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
French (FR), Italian (IT), Latin (LA), and German (GE), or
5.1.1. Person
register variation (DIALECT, INFORMAL, SLANG) in
PERSON (PER) type is divided into PER-LIT, PER-
the corpus.
REAL, PER-DES and PER-GRP. While the first three are
Once the annotation process was completed, the labels
categorized as subtypes of the same type, PER-GRP is
in WebAnno were converted to TEI encoding.5 Following
defined as an independent type. The most important
the conversion thus all proper names (personal names,
subdivision of the type (within the NAME group) is that
place names, names of organizations, and real names) are
between real, e.g., historical or real-life, persons appearing
labelled with , then divided into types with
in the text, and fictional characters, each of which,
@person, @geo, @misc, @personGrp, and @org
however, is further specified according to semantic criteria.
attributes, three subtypes for literary characters (@literary,
Subcategories include names of people and pets,
@descriptive, @real), and for geographical names (@SI,
nicknames, pseudonyms, members of national and social
@EU, @ZZ and @YU). Units of text with foreign
groups.
languages and non-standard Slovenian were labelled as
and corresponding attributes according to TEI
coding.
Group
Type
Subtype
Description
Real: Characters referring to real, i.e. existing and historical or mythological persons PER-REAL
or beings (web sources, Wikipedia, etc.), e.g. Greta Garbo.
PER-LIT
Literary: Fictional literary characters, e.g. Ančika, Zobec.
PERSON (PER)
Descriptive: Literary characters that carry a descriptive name (e.g., dolgolasec, Eng.
PER-DES
the long-haired guy)
N
PER-GRP
Group: Members of national and social groups, e.g. Kranjci, Slovenec, Američan.
AM
GEO-SI
Slovenia, e.g. Ljubljana
E
GEO-YU
Former Yugoslavia (except for Slovenia), e.g. Zagreb
GEO
GEO-EU
Europe, e.g. Frankfurt
GEO-ZZ
Other, e.g. Peking
ORG
–
Names of organizations, institutions ( Klub nepismenih, Slovenska matica, Državna
varnost)
XXX
–
Common proper nouns, including titles of books and other art works, artefacts, etc.,
e.g. Rdeča kapica, Empire State Building.
HBS
–
Serbo-Croatian
EN
–
English
DE
–
German
N
FR
–
French
IG
IT
–
Italian
ER
LA
–
Latin
OF
XX
–
Other
DIALECT
–
Dialect
VERNACULAR –
Vernacular
SLANG
–
Slang
BIBLIO
–
–
Quoted authors (Sources)
Table 1: The main categories of the May68 annotation scheme (WebAnno).
PER-REAL denotes both real, i.e. existing, persons and
Of the categories introduced specifically for the
historical or mythological figures that are basically
purposes of the May68 Corpus, NAME / PER-DES proved,
identifiable in encyclopaedic sources such as online
as expected, to be the most challenging subcategory (see
lexicons of proper names, Wikipedia and the like. URL is
6.1.1.).
an additional attribute of the NAME group and is given as
Given their statistical importance in the context of NER,
a relevant source of information, e.g., a website, for a group
the same annotation rules apply here as for characters in
of people appearing in the literary text. The assignment of
plays when they do not require special treatment with
a URL depends on context or extra-linguistic knowledge; if
respect to their function. The labelling of proper names in
a person can be assumed to be part of common (cultural)
plays depends on the status and/or function of the proper
knowledge (Descartes, Nietzsche), we do not enrich the
name. Names of individual characters that merely
corpus with encyclopaedic data.
announce an individual character’s speech, his/her lines of
All standard personal proper names are labelled as
dialogue, have not been annotated, while names in
NAME and assigned to one of the closed subtypes.
descriptions of their physical actions or behaviour are
The label PER-GRP with no subtype is assigned to
treated as ordinary proper names on the model of “sb does
members of a particular social group, most often nationality
sth” etc. ( Pandolfo se ogleduje v zrcalu / Pandolfo looks at
(Slovenec), regional identity (Kranjci, Štajerci; Novakovi),
himself in the mirror). Below is an example of a dialogue
but also smaller social groups defined on the basis of
showing the distinction between the two and a third subtype
occupational or other criteria.
(the names in bold are labelled as PER-LIT, PER-DES and
PER-REAL respectively):
5 The annotation task was carried out in collaboration with T.
Erjavec (technical aspects and data conversion).
PRISPEVKI
190
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
BARRÈRE: Potemtakem moramo danes z njim obračunati.
For many common nouns, one can observe a transition
( Tallien odide)
to the category of proper names, which seems to exist as a
(Davidu): Si pripravljen s Krepostnim umreti?
continuum. For example, the word krčma (Eng. inn, pub)
DAVID: V smrt?
assumes the function of a proper noun referring exclusively
BARRÈRE: Se nisi maločas naglas pridušil?
to a particular unit/object, in this case “inn”. The word is
DAVID: Čudovit črtež sem zamislil. Kako dviga Sokrates
therefore referred to as NAME / XXX.
čašo strupa k ustom. Naš dobri prijatelj je tako presunljivo
govoril.
5.1.4.
BIBLIO
BIBLIO is typically used for literary works cited or
Adjectives derived from personal proper nouns are
mentioned in the literary texts. It contains text passages that
annotated as the corresponding proper nouns. Their derived
refer to literary works or other bibliographic units, and is
character is revealed by morpho-syntactic tagging.
annotated for authors, not titles or citations, e.g.
The patamus can never reach The mango on the mango tree
5.1.2.
Geographical location (GEO)
(T. S. Eliot: The Hippopotamus)
Place names are labelled as NAME and the following
closed-set subtypes: SI, YU, EU, ZZ, depending on
5.1.5.
Language and register
whether the location is in Slovenia, in the former Yugoslav
In the case of language and register variation, we use
republics, in the rest of Europe, or outside all of these areas.
the FOREIGN group that subsumes (foreign) language and
As with personal names, a distinction is made between
register variation (see Table 1). This group is not directly
real and fictitious geographical names ( Indija vs.
Eldorado). Commentators decide whether a place is real or
relevant to this paper.
fictitious (such as street names in a fictitious city) based on
context and common knowledge. Places typically include
6. Dilemmas of annotation in the framework
continents, countries, regions, cities, towns, and natural
of representational problems
geographical objects, as well as streets, squares, and
A number of dilemmas are discussed here in terms of
neighbourhoods, and functional infrastructure such as
the three categories – ambiguity, variation, and uncertainty
churches, airports, and local cultural and natural sites. Place
– as detailed, for example, in Beck et al. (2020), who
names used metaphorically, e.g. Eden, are categorized as
outline the main representational problems in linguistic
“other” and assigned the label NAME / GEO-ZZ – the same
annotation (we disregard the two additional categories
label is used for place names outside the European territory.
addressed in the model: error and bias).
At this stage, we have not paid special attention to the
The interpretation of the listed categories is tailored to
treatment of proper names (personification) used
the nature of our data, and the problems are assigned to the
metaphorically, such as
listed categories accordingly. The annotation process is
Jadra so pogorela, Delfi molčijo … [The sails have burnt consistently guided by the identified function of the
down, and Delfi stays silent …]
annotated elements. The three dilemmas are described
This type of analysis is planned for the later stages of
below.
annotation (which will include the annotation of
metaphors).
6.1. Ambiguity
Adjectives derived from place names, e.g. African,
In principle, ambiguity occurs whenever a unit admits
European, were included in the annotation by analogy with
several interpretations. Ambiguities between form and
geographical names and divided into the same subtypes
meaning occur in natural language at the phonological,
(SLO, YU, EU, ZZ).
morpho-syntactic, lexical, or pragmatic levels and are a
major source of representational problems (Beck et al.,
5.1.3.
Organizations and common proper nouns
2020).
As with geographical names, there are no subgroups for
the two groups of so-called common proper names and
6.1.1.
Transition from personal proper names to
names for organizations. Capitalization is an obvious but
“common proper nouns”
not a necessary condition for this classification. Thus, no
The most striking example of ambiguity is the transition
distinction is made here between real and fictitious; what
from common nouns to those that function as personal
matters is that the name be recognized as “common proper”
names. This is a pervasive and rather complex
in the literary context of the text.
representational problem. The dilemma concerns the
Organizations and institutions subsume names of
category NAME / PER-DES, i.e., descriptive names of
museums and other cultural institutions, as well as political
literary characters, especially in relation to the category
and civic organizations. Organizations are labelled as ORG
NAME / PER-LIT, which refers to standard proper names
and usually include businesses, institutes, media, cultural,
that are recognizable as such because of their form and
and educational institutions. However, we have treated
conventional properties (e.g., capitalization). This group
restaurants, music groups, and other “entertainment”
includes examples where common nouns optionally
establishments
as
“miscellaneous”
rather
than
combine with proper names to refer to individual characters
organizations.
like “inšpektor (Kos)” [inspector (Kos)], or “veteran” [the
Miscellaneous is a category reserved mainly for
veteran], including capitalized adjectival derivatives, such
common proper nouns, as explained above, such as titles of
as “Brezposelni” [The jobless one], functioning as personal
books and other works of art, artefacts, films, documents,
names, etc.
brand names, commercial products, events, including place
However, capitalization is not a necessary condition for
names, such as mythological places, place names used
the NAME / PER-DES designation, especially in a corpus
metaphorically, etc. These NEs are labelled as XXX.
of modernist texts that frequently employ modernist and/or
PRISPEVKI
191
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
idiosyncratic conventions, with orthographic rules applied
6.2. Variation
to proper names or descriptive linguistic units that typically
In variation, the same content or value is expressed by
eschew capitalization (e.g.,“fant” [the boy], “starka” [the
multiple, interchangeable variants (Lüdeling, 2017).
old woman]). A key feature of proper names, as it turns out,
Variation can be due to extra-linguistic factors, such as the
is “descriptive continuity,” which shows that there is no
time period, genre, author/speaker of the text, or linguistic
clear boundary between what can be considered a standard
conventions.
proper name (which is traditionally subsumed under
Like ambiguity, variation is an inherent part of natural
onomastics) and what can be understood as an instance of
language and thus of corpus data. Indirectly related to
a text that performs the function of a proper name, but does
variation is the case of ambiguity described above in 7.1.1.
not, strictly speaking, qualify as such.
The descriptive name is not necessarily used exclusively
The assignment of a noun to NAME / PER-DES is
for one and the same literary character; on the contrary, it
decided primarily on the basis of context. Often, a lexical
usually alternates with the character’s actual proper name.
unit (word or phrase) is used to describe a particular
Alternation in the mention of literary characters is very
property of the character to which the proper noun initially
common; in fact, it is the rule. Some personal proper names
refers, and which is then gradually but clearly transformed
(including their descriptive variants) occur as variants
into a (descriptive) unit that functions as a proper name
preceded by an attributive noun (always the same), usually
(whether capitalized or not), such as “Rdečelasi” [The red-
referring to their professional or social status (e.g.,
haired one]. The descriptive name is used only when the
Inspector Kos). When this type of designation is used
transition is complete, which must be evident from the
consistently, we refer to the entire lexical unit as NAME /
broader context. The quantitative criterion (in longer texts)
PER-LIT, but when the attributive noun (Inspector)
is a minimum of three occurrences of the same designation,
becomes an independent descriptive variant, we refer to it
such as below:
as NAME / PER-DES.
Videl je same znane obraze — inšpektorja Kosa, vratarja
Descriptive terms NAME / PER-DES may consist of
Žorža, kurirja Enorokega, Žana, nekoliko v ozadju pa je stal
one or more words, they may be a combination of “object
bledi Novinec [the (pale) new guy], …
nouns” and standard proper names ( inšpektor Kos) or of
two or more “common nouns” ( kurir Enoroki), regardless
Other examples include dolgolasec [the long-haired
of their capitalization, as long as they function as personal
guy], mladenič [young man], mojster [the master],
proper names when referring to or naming characters. The
debelušček [the fatty] and typically correspond to phrases
same character may be referred to by three, four, or more
introduced with a definite article in English. In principle,
variants. In our case: inspector Kos, inspector or Kos.
PER-DES is not limited to a maximum number of
Also treated as single variants are lexical units denoting
components, but the likelihood that a lengthy description,
proper names whose capitalization varies, e.g., Ministrstvo
such as Zagledal je na tleh sedečega fanta upadlih lic in
za kulturo Republike Slovenije vs. ministrstvo za kulturo
kuštravih las [He saw a boy with skinny cheeks and messy
(Ministry of Culture) and Zveza borcev vs. zveza borcev
hair sitting on the floor], should appear three times at least
(Association of Freedom Fighters).
in the text(s) is minimal. Even if descriptive units tend to
We are aware that when variants are expressed as a
recur they normally vary in at least one of their elements .
single interpretation, the property of variation as a whole is
Capitalization itself does not preclude a lexical unit
lost. However, a semantic annotation based on the function
from being labelled PER-DES, as with Mož brez imena [the
of linguistic elements is less prone to structural diversity
Nameless Man].
than, for example, spelling variations in historical texts that
Appellatives, nicknames, and pseudonyms are labelled
reflect, for example, dialectal and/or temporal differences
as ordinary personal proper names (NAME / PER-LIT),
(cf. Beck et al., 2020), which is why, apart from our own
except for those expressing description, such as Dolgi Džon
specific research goals, we did not choose to preserve
[John the Longish].
(proper name) variations.
6.1.2.
Nesting
6.3. Uncertainty
Another example of ambiguity concerns nesting, which
Uncertainty arises whenever there are multiple possible
often creates additional annotation problems. Instead of a
interpretations of data, but the relevant or reliable
potential two- (or three-level) nesting model, a single-level
knowledge to make an informed decision about
nesting is used throughout, taking as the basic annotated
interpretation is not available (see Bonnea et al., 2014, in
unit the largest possible lexical unit, typically a
Beck et al. 2020). Most examples involve the inability to
geographical name or the name of an organization
distinguish between the subtypes PER-REAL and PER-LIT
composed of one or more proper names: in the case of
Državna založba Slovenije
in texts that do not provide sufficient clues to the “origin”
[National Publishing House of
of the character, although this seems to be rather rare.
Slovenia], the entire unit is labelled as an organization
In such cases, manual annotation provides the
(ORG) and the proper name Slovenije is not nested and
opportunity for discussion and collective decision, which
labelled on its own as a place name ( Slovenija); the same
we see as an advantage, since cases where the uncertainty
goes for for Društvo novinarjev Slovenije [Journalists’
(or ambiguity) cannot be resolved are reduced to the
Association of Slovenia], Prešernova družba [Prešeren’s
absolute minimum, for example:
Society Publishing], Direkcija za prehrano Beograd
[Belgrade Food Agency], or, e.g. Fani is NOT nested in
gospodična Fani
Maruška – [PER-REAL, author’s wife] peče domači kruh …
, but treated as a single-level personal
Milenko, Andraž, Marko, David – [PER-REAL, members of
proper name. A general dilemma often arises here as to
OHO Slovenian art group: Milenko Matanović, Andraž
whether the term should be referred to as a proper name or
as a common noun.
PRISPEVKI
192
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Šalamun, Marko Pogačnik, David Nez – established on the
Shakespeare, Mozart); 2. Mythological figures (Cain,
basis of extra-textual knowledge].
Poseidon, Ishtar); 3. Characters from other works of
Slovenian and world literature (Pegam, Lambergar, Servant
7. Preliminary results
Jernej, Charlie Brown, Odysseus, Pinocchio); the last two
Apart from the problems encountered in the annotation
groups are represented, on the one hand, by characters from
the contemporary world of the authors, such as real-life
itself, the preliminary research results of the annotated
celebrities (Tomaž Terček, Andraž Šalamun, Milenko
corpus can also contribute to the study of characters in a
Matanovič, Brigitte Bardot, Gérard Philipe, Giorgio
selected corpus of literary texts. Based on the query and the
Albertazzi, Sylvie Vartan) and, on the other hand, by
results in NoSketchEngine, Figure 1 shows the quantitative
characters from the authors’ immediate (family)
relationship between three subtypes of the type PERSON
environment (Ana, Maruška).
(literary names, descriptive names, and names of characters
The results show the least consistency for the
from the non-literary world). It can be seen that the majority
descriptive name subtype with the lowest degree of
are literary names (PER-LIT, 68 per cent) whose
intersubjectivity, especially with respect to the relationship
predominance was to be expected - followed quantitatively
between the transition from common noun to proper name
by descriptive names (PER-DES, 18 per cent, and then by
and the aptronyms or nominative determinism, which
names of characters from the non-literary world (PER-
Barthes considers a kind of “economic” characterization
REAL, 14 per cent).
(Lahn and Maister, 2016). The relatively high presence of
this subtype suggests a modernist blurring of the boundary
between fiction and reality, which is reinforced by
postmodernism.
PER-DES
18 %
7.2. Relationship between male and female
PER-REAL
14 %
characters
PER-LIT
The second graph (cf. Figure 2) shows the quantitative
68 %
ratio between male and female characters as they occur in
the May68 Corpus (based on the number of tokens).
90%
80%
82 %
82 %
70%
76 %
72 %
60%
50%
56 %
Figure 1: The ratio between the subtypes literary, descriptive
40%
50 %50 %
44 %
and real of the PERSON type.
30%
20%
28 %
24 %
10%
18 %
18 %
7.1. Categories of descriptive names and real
0%
names
AUTHORS
AUTHORS
AUTHORS
AUTHORS
AUTHORS
AUTHORS
(MEN)
(WOMEN)
(MEN)
(WOMEN)
(MEN)
(WOMEN)
Using the lists of the three types of personal names, we
PER-LIT
PER-DES
PER-REAL
can create an approximate typology of character names
MALE CHARACTERS
FEMALE CHARACTERS
according to the given typologies and evaluate the
consistency of labelling. Because of their special
characteristics, we limit ourselves to the subtypes
Figure 2: The quantitative relationship between male and female
descriptive and real, leaving aside the subtype literary,
characters in the May68 Corpus.
which includes mostly “ordinary” personal names.
Descriptive names are most often occupational (e.g.,
The results confirm findings from other research (cf.
chief, inspector, captain, mayor; foreman, waitress,
Nagaraj and Kejriwal, 2022) that the proportion of male
secretary, lab assistant); second are names expressing
characters is significantly higher than that of women.
physical characteristics (e.g., one-armed, long-haired, “the
We supplement this account by comparing male and
one with the moustache” the handicapped), followed by
female characters by author gender, which gives a very
names describing character (e.g., bully, beast, monster),
disproportionate picture: Metadata analysis has shown the
beast, bloodthirsty),family relations (e.g., aunt, uncle,
predominance of male authorship in the corpus (81 per
godmother), generational affiliation (e.g., old man, young
cent) - only 7 per cent of authors are women, and there are
man), while longer descriptive lexical strings are rarer (man
no data for the remaining 12 per cent (Juvan, et al., 2021).
with no name, brother in Christ, the long-haired one).
If we start from the gender of the authors when
Among the names for women, forms that formally express
analyzing the occurrence of male and female characters, we
possession but function as gendered common proper names
find (see Figure 3) that in the works by men, male
are frequent in Slovenian (e.g. Tomaž’s (one), the
manager’s wife). This is statistically almost as
characters outnumber female characters by 44 per cent in
significant
the subcategory literary names, while this difference is
as feminine names for occupations.
much smaller in the works by women (12 per cent). In the
As can be seen from the annotated corpus, we identify
category descriptive names, this ratio is difficult to assess
five subcategories and include them in the subtype for real
due to the low occurrence among women authors, but a
persons: 1. Real persons from social (Brutus, Lenin, Kidrič)
and cultural history (Prešeren, Heidegger, Descartes,
large difference between female and male characters in
men authors goes in favour of the latter.
PRISPEVKI
193
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Annotation Workshop, pages 60–73, Barcelona, Spain,
December 12, 2020.
Julian Brooke, Timothy Baldwin, and Adam Hammond.
82 %
18 %
2016.
Bootstrapped
Text-level
Named
Entity
PER-REAL
Recognition for Literature. In: Proceedings of the 54th
Annual Meeting of the Association for Computational
PER-DES
76 %
24 %
Linguistics, pages 344–350, Berlin, Germany, August 7–
12.
PER-LIT
82 %
18 %
John Burrows. 2004. Textual analysis. In: S. Schreibman,
Ray Siemens, and John Unsworth, eds., A Companion to
0
2000
4000
6000
8000
Digital Humanities. Blackwell, Oxford.
José Calvo Tello. 2021. The Novel in the Spanish Silver
MALE CHARACTERS
FEMALE CHARACTERS
Age: A Digital Analysis of Genre Using Machine
Learning. Bielefeld University Press, Bielefeld.
Karine van Dalen-Oskam. 2022. Distant Dreaming About
European Literary History. Evening keynote at the
Figure 3: Male and female characters according to the gender of
Distant
Reading
Closing
Conference.
authors.
https://www.distant-reading.net/events/conference-
programme/
In the subcategory real, there is no significant difference
Jesse de Does, Katrien Depuydt, Karina van Dalen-Oskam,
in terms of author gender, which is probably due to the
and Maarten Marx. 2017. Namescape: Named Entity
actual and undisputed presence of men and women in social
Recognition from a Literary Perspective. In: J. Odijk,
and cultural history.
and A. van Hessen, eds., CLARIN in the Low Countries,
pages
361–70.
Ubiquity
Press.
8. Conclusions and open challenges
https://www.ubiquitypress.com/site/chapters/10.5334/b
The main goal of our annotation task was to provide an
bi.30/download/1046/.
adequate representation of a specific set of semantic data
Richard Eckart de Castilho, Éva Mújdricza-Maydt, Seid
(=Named Entities) and to fully exploit the potential of this
Muhie Yimam, Silvana Hartmann, Iryna Gurevych,
type of corpus linguistic data in the context of future
Anette Frank, and Chris Biemann. 2016. A web-based
literary and linguistic analyses. To this end, we
tool for the integrated annotation of semantic and
implemented a three-level annotation process. We
syntactic structures. In: Proceedings of the Workshop on
conclude on the basis of high variation in referential
Language Technology Resources and Tools for Digital
expressions that in potential future projects an additional
Humanities (LT4DH), pages 76–84. Osaka, Japan. The
step should be linking the different names of the same
COLING 2016 Organizing Committee.
character.
David Elson, Nicholas Dames, and Kathleen McKeown.
In the present work, we sought to identify and interpret
2010. Extracting Social Networks from Literary Fiction.
different types of representational problems based on the
In: Proceedings of the 48th Annual Meeting of the
model proposed by Beck et al. (2020) in order to improve
Association for Computational Linguistics, pages 138–
our understanding of the linguistic and extra-linguistic
147, Uppsala, Sweden. Association for Computational
properties of the texts in a (literary) corpus. It is hoped that
Linguistics.
this will lead to a more nuanced understanding of the
Tomaž Erjavec, Peter Holozan, and Nikola Ljubešić. 2015.
challenges of NER, and that this in turn may inform future
Jezikovne tehnologije in zapis korpusa. In: V. Gorjanc,
resources in ways that are more appropriate to the data they
P. Gantar, I. Kosem and S. Krek, eds., Slovar sodobne
represent.
slovenščine: problemi in rešitve, pages 262–76.
In the next phases of annotation, we plan to improve the
Znanstvena založba Filozofske fakultete, Ljubljana.
segments that have the lowest level of consistency and
Francesca Frontini, Carmen Brando, Joanna Byszuk, Ioana
agreement among annotators, such as common nouns that
Galleron, Diana Santos, and Ranka Stanković. 2020.
perform the referential function of proper names,
Named Entity Recognition for Distant Reading in
seemingly operating as a representational continuum.
ELTeC. In: CLARIN Annual Conference 2020, Oct 2020,
We have yet to work out the best approach to fully
str. 37–41, Virtual Event, France.
incorporate the various instances of PER-DES in the
Marko Juvan, Andrejka Žejn, Mojca Šorli, Lucija Mandić,
annotation scheme, but these are certainly worth
Andrej Tomažin, Andraž Jež, Varja Balžalorsky Antić,
considering as a special (sub)category of the NAME group.
and Tomaž Erjavec. 2022 . Corpus of 1968 Slovenian
literature
Maj68
2.0,
ZRC
SAZU.
9. Acknowledgements
http://hdl.handle.net/11356/1430
Marko Juvan, Mojca Šorli, and Andrejka Žejn. 2021.
ARRS (Slovenian Research Agency) J6-9384 “Maj 68
v literaturi in teoriji (May '68 in Literature and Theory)”
Interpretiranje literature v zmanjšanem merilu:
»Oddaljeno branje« korpusa »dolgega leta 1968«. Jezik
in slovstvo, 66(4):55–76.
Nora Ketschik, André Blessing, Sandra Murr, Maximilian
10. References
Overbeck, and Axel Pichler. 2020. Interdisziplinäre
Christin Beck, Hannah Booth, Mennatallah El-Assady, and
Annotation
von
Entitätenreferenzen.
Von
Miriam Butt. 2020. Representation Problems in
fachspezifischen Fragestellungen zur einheitlichen
Linguistic
Annotations:
Ambiguity,
Variation,
methodischen Umsetzung. In: N. Reiter, A. Pichler, and
Uncertainty, Error and Bias. In: The 14th Linguistic
J. Kuhn, eds., Reflektierte Algorithmische Textanalyse.
PRISPEVKI
194
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Interdisziplinäre(s) Arbeiten in der CRETA-Werkstatt,
In: Proceedings of the Eighth Language Technologies
pages 203–36, Berlin.
Conference,
October
8th-12th,
2012,
Ljubljana,
Benjamin Krautter, Janis Pagel, Nils Reiter, and Marcus
Slovenia: proceedings of the 15th International
Willand. 2018. In: T. Weitin, ed., Eponymous Heroes
Multiconference Information Society - IS 2012, volume
and Protagonists – Character Classification in
C, pages 191–96, Ljubljana, Institut Jožef Stefan.
German-Language Dramas. LitLab. Pamphlet # 7.
Katja Zupan, Nikola Ljubešić, and Tomaž Erjavec. 2017.
Silke Lahn, and Jan Christoph Meister. 2016. Einführung
Annotation guidelines for Slovenian named entities:
in die Erzähltextanalyse. Stuttgart, Metzler.
Janes-NER. Technical report, Jožef Stefan Institute,
Nikola Ljubešić, Marija Stupar, and Tereza Jurič. 2012.
September.
Building Named Entity Recognition Models For
https://www.clarin.si/repository/xmlui/bitstream/handle
Croatian And Slovene. In: T. Erjavec, and J. Žganec
/11356/1123/SlovenianNER-eng-v1.1.pdf.
Gros, eds., Proceedings of the Eighth Language
Technologies Conference, October 8th-12th, 2012,
Ljubljana,
Slovenia:
proceedings
of
the
15th
International Multiconference Information Society - IS
2012, volume C, pages 129–34. Ljubljana, Institut Jožef
Stefan.
Anke Lüdeling. 2017. Variationistische Korpusstudien. In:
M. Konopka, and A. Wöllstein, eds., Grammatische
Variation. Empirische Zugänge und theoretische
Modellierung. IDS Jahrbuch 2016, pages 129– 144. de
Gruyter, Berlin.
Elena V. Mikhalkova, Timofei Protasov, Anastasiia
Drozdova, Anastasiia Bashmakova, and Polina Gavin.
2019. Towards annotation of text worlds in a literary
work. In: Computational Linguistics and Intellectual
Technologies. Papers from the Annual International
Conference “Dialogue” , pages 101–10. Issue 18,
Supplementary Volume 18.
Franco Moretti . 2011. Network Theory , Plot Analysis . New Left Review, 68:80–102.
Akarsh Nagaraj, and Mayank Kejriwal. 2022. Robust
Quantification of Gender Disparity in Pre-Modern
English Literature using Natural Language Processing.
arXiv:2204.05872v1 [cs.CY] 12 Apr 2022.
Sean Papay, and Sebastian Padó. 2020. RiQuA: A Corpus
of Rich Quotation Annotation for English Literary Text.
In Proceedings of the 12th Language Resources and
Evaluation Conference, pages 835–841, Marseille,
France. European Language Resources Association.
Janis Pagel, Nils Reiter, Ina Rösiger, and Sarah Schulz.
2020. Annotation als flexibel einsetzbare Methode. In:
N. Reiter, A. Pichler, and J. Kuhn, eds., Reflektierte
Algorithmische
Textanalyse.
Interdisziplinäre(s)
Arbeiten in der CRETA-Werkstatt, pages 125 – 142.
Berlin.
Ranka Stanković, Diana Santos, Francesca Frontini, Tomaž
Erjavec, and Carmen Brando. 2019. Named Entity
Recognition for Distant Reading in Several Languages.
In: G. Pálko, ed., DH_Budapest_2019. Budapest, ELTE.
http://elte-dh.hu/dh_budapest_2019-abstract-booklet/
Magda Ševčíková, Zdeněk Žabokrtský, and Oldřich Krůza.
2007. Named Entities in Czech: Annotating Data and
Developing NE Tagger. In: V. Matoušek, P. Mautner
eds., Text, Speech and Dialogue: 10th International
Conference, TSD 2007, Pilsen, Czech Republic,
September 3–7, 2007. Proceedings. Berlin – Heidelberg,
Springer-Verlag.
https://ufal.mff.cuni.cz/~zabokrtsky/publications/papers
/tsd07-namedent.pdf
Tadej Štajner, Tomaž Erjavec, and Simon Krek. 2013.
Razpoznavanje imenskih entitet v slovenskem besedilu.
PRISPEVKI
195
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
A Transformer-based Sequence-labeling Approach to
the Slovenian Cross-domain Automatic Term Extraction
Thi Hong Hanh Tran∗†, Matej Martinc†, Andraž Repar†, Antoine Doucet‡, Senja Pollak†
∗Jožef Stefan International Postgraduate School,
Jamova cesta 39, 1000 Ljubljana, Slovenia
† Jožef Stefan Institute,
Jamova cesta 39, 1000 Ljubljana, Slovenia
‡ University of La Rochelle,
23 Av. Albert Einstein, La Rochelle, France
Abstract
Automatic term extraction (ATE) is a popular research task that eases the time and effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. In this paper, we treat terminology extraction as a sequence-labeling task and experiment with a Transformer-based model XLM-RoBERTa to evaluate the performance of multilingual pretrained language models in the cross-domain sequence-labeling setting. The experiments are conducted on the RSDO5 corpus, a Slovenian dataset containing texts from four domains, including Biomechanics, Chemistry, Veterinary, and Linguistics. We show that our approach outperforms the Slovene state-of-the-art approach, achieving significant improvements in F1-score up to 40 percentage points. This indicates that applying multilingual pretrained language models for ATE in less-resourced European languages is a promising direction for further development. Our code is publicly available at https://github.com/honghanhh/sdjt-ate.
1.
Introduction
NLP task. However, despite the importance of term ex-
Terms are single- or multi-word expressions denoting
traction and the research attention paid to the task, identi-
concepts from specific subject fields whose meaning may
fying the correct terms remains a notoriously challenging
differ from the same set of words in other contexts or ev-
problem with the following not yet solved hurdles. First,
eryday language. They represent units of knowledge in
despite several different definitions to describe the mean-
a specific field of expertise and term extraction is useful
ing of a term, the explicit distinction between terms and
for several terminographical tasks performed by linguists
common words is in many cases still unclear. In addition,
(e.g., construction of specialized term dictionaries). Most
the characteristics of specific terms can vary significantly
of these tasks are time- and labor-demanding, therefore re-
across domains and languages. Furthermore, the gold stan-
cently several automatic term extraction approaches have
dard term lists and manually labeled domain-specific cor-
been proposed to speed up the process.
pora for training and evaluation of ATE approaches are gen-
Term extraction can also support and improve several
erally scarce for less-resourced languages including Slove-
complex downstream natural language processing (NLP)
nian, due to the large amount of work required for the con-
tasks. The broad range of downstream NLP tasks to which
struction of these resources.
term extraction could benefit include, for example, glos-
Deep neural approaches towards ATE have been only
sary construction (Maldonado and Lewis, 2016), topic de-
recently proposed, but their evaluation in less-resourced
tection (El-Kishky et al., 2014), machine translation (Wolf
languages has not yet been sufficiently explored and re-
et al., 2011), text summarization (Litvak and Last, 2008), mains a research gap worth investigating. Inspired by the
information retrieval (Lingpeng et al., 2005), ontology en-
success of Transformer-based models in ATE from the re-
gineering and learning (Biemann and Mehler, 2014), busi-
cent TermEval 2020 competition’s ACTER dataset (Hazem
ness intelligence retrieval (Saggion et al., 2007; Palomino
et al., 2020; Lang et al., 2021), we propose to exploit and
et al., 2013), knowledge visualization (Blei and Lafferty,
explore the performance of XLM-RoBERTa pretrained lan-
2009), specialized dictionary creation (Le Serrec et al.,
guage model (Conneau et al., 2019), which addresses the
2010), sentiment analysis (Pavlopoulos and Androutsopou-
ATE as a sequence-labeling task. Sequence-labeling ap-
los, 2014), and cold-start knowledge base population (Ellis
proaches have been successfully applied to a range of NLP
et al., 2015), to cite a few.
tasks, including Named Entity Recognition (Lample et al.,
In the attempt to ease the time and effort needed to man-
2016; Tran et al., 2021) and Keyword Extraction (Martinc
ually identify terms from domain-specific corpora, auto-
et al., 2021; Koloski et al., 2022). The experiments are con-matic term extraction (ATE), also known as automatic term
ducted in the cross-domain setting on the RSDO5 corpus1
recognition (Kageura and Umino, 1996) or automatic term
(Jemec Tomazin et al., 2021a) containing Slovenian texts
detection (Castellv´ı et al., 2001), thus became an essential 1http://hdl.handle.net/11356/1470
PRISPEVKI
196
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
from four domains (Biomechanics, Chemistry, Veterinary,
word embeddings (Kucza et al., 2018), and the combination
and Linguistics).
of both representations (Gao and Yuan, 2019).
The main contributions of this paper can be summarized
In the recent ATE challenge, namely TermEval 2020
in the following points:
(Rigouts Terryn et al., 2020), the use of language mod-
els became very important.
The winning approach on
• We systematically evaluate the performance of the
the Dutch corpus used pretrained GloVe word embeddings
Transformer-based pretrained model, namely XLM-
fed into a bi-directional LSTM based neural architecture.
RoBERTa, on the term extraction task, formulated as
Meanwhile, the winning approach on the English corpus
a supervised cross-domain sequence-labeling on the
(Hazem et al., 2020) relied on the extraction of all possible RSDO5 dataset containing texts from four different
n-gram combinations, which are fed into a BERT binary
domains.
classifier that determines for each n-gram inside a sentence,
• We demonstrate that the proposed cross-domain ap-
whether it is a term or not. Besides BERT, several other
proach surpasses the performance of the current state
variations of Transformer-based models have also been in-
of the art (Ljubešić et al., 2019) for all the combi-
vestigated. For example, RoBERTa and CamemBERT have
nations of training and testing domains we experi-
been used in the TermEval 2020 challenge (Hazem et al.,
mented with, therefore establishing a new state-of-the-
2020). Another recent method is the HAMLET system
art (SOTA) method for the ATE on Slovenian corpus.
(Rigouts Terryn et al., 2021), which proposes a hybrid
adaptable machine learning approach that combines the lin-
This paper is organized as follows: Section 2. presents
guistic and statistical clues to detect terms and is also eval-
the related work in the field of term extraction. Next, we
uated on the TermEval data.
introduce our methodology in Section 3., and the experi-
Meanwhile, Conneau et al.
(2019) and Lang et al.
mental details in Section 4.. The results with further error
(2021) take advantage of XLM-RoBERTa (XLM-R) to
analysis are discussed in Section 5. and 6., before we con-compare three different approaches, including a binary se-
clude and present future works in Section 7..
quence classifier, a sequence classifier, and a token classi-
fier employing the sequence-labeling approach (also under
2.
Related Work
research by Kucza et al. (2018)), as we do in our research.
The history of ATE has its beginnings during the 1990s
Finally, Lang et al. (2021) proposes to use a multilingual
with research done by Damerau (1990), Ananiadou (1994), encoder-decoder model called mBART (Liu et al., 2020),
Justeson and Katz (1995), Kageura and Umino (1996), and which is based on denoising pre-training, that generates
Frantzi et al. (1998). ATE systems usually employ the fol-
sequences of comma-separated terms from the input sen-
lowing two-step procedure: (1) extracting a list of candidate
tences.
terms; and (2) determining which of these candidate terms
Annotated Corpora for Term Extraction Research (AC-
are correct using supervised or unsupervised approaches.
TER) dataset was released for the TermEval competition as
Recently, neural approaches have been proposed.
a collection of four domain-specific corpora (Corruption,
Traditionally, the approaches were strongly based on
Wind energy, Equitation, and Heart failure) in three lan-
linguistic knowledge and distinctive linguistic aspects of
guages (English, French, and Dutch). However, when it
terms in order to extract possible candidates.
Several
comes to ATE for less-resourced languages, there is still
NLP tools, such as tokenization, lemmatization, stemming,
a lack of gold standard corpora and limited use of neu-
chunking, PoS tagging, full syntactic parsing, etc., are em-
ral methods.
In recent years, the Slovene KAS corpus
ployed in this approach to obtain linguistic profiles of term
was compiled (Erjavec et al., 2021), and most recently the
candidates. As a heavily language-dependent approach, the
RSDO corpus that we use in our study (Jemec Tomazin et
better the quality of the pre-processing tools (e.g., FLAIR
al., 2021b). Regarding the Slovenian language on which
(Akbik et al., 2019), Stanza (Qi et al., 2020)), the better the we focus in our study, the current SOTA was proposed
quality of linguistic ATE methods.
by Ljubešić et al. (2019) that extracts the initial candi-
Meanwhile, several studies preferred the statistical ap-
date terms using the CollTerm tool (Pinnis et al., 2019), a proach or combined linguistic and statistical approaches.
rule-based system employing a complex language-specific
Some of the measures include the termhood (Vintar, 2010),
set of term patterns (e.g., POS tag,...) from the Slovenian
unithood (Daille et al., 1994) or C-value (Frantzi et al.,
SketchEngine module (Fišer et al., 2016), followed by a
1998). Many current systems still apply some variation of
machine learning classification approach with features rep-
this approach, most commonly in hybrid systems combin-
resenting statistical term extraction measures. Another re-
ing linguistic and statistical information (Repar et al., 2019;
cent approach by (Repar et al., 2019) focuses on term ex-
Meyers et al., 2018; Drouin, 2003; Macken et al., 2013;
traction and alignment, where the main novelty is in using
Šajatović et al., 2019; Kessler et al., 2019, to cite a few.).
an evolutionary algorithm for the alignment of terms. On
Recently, advances in embeddings and deep neural net-
the other hand, the deep neural approaches have not been
works have also influenced the term extraction field. Sev-
explored for Slovenian yet. Another problem very specific
eral embeddings have been investigated for term extraction,
for less-resourced languages is that the open-sourced code
for example, uni-gram term representations constructed
is often not available for most current benchmark systems,
from a combination of local and global vectors (Amjadian
hindering their reproducibility (for Slovenian, only the code
et al., 2016), non-contextual word embeddings (Wang et
by Ljubešić et al. (2019) is available).
al., 2016; Khan et al., 2016; Zhang et al., 2017), contextual PRISPEVKI
197
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Figure 1: An example of the (B-I-O) mechanism on a text sequence from Slovenian corpus.
3.
Methodology
We consider ATE as a sequence-labeling task where the
model returns a label for each token in a text sequence. We
use the (B-I-O) labeling mechanism (Rigouts Terryn et al.,
2021; Lang et al., 2021) where B stands for the beginning word in the term, I stands for the word inside the term, and
O stands for the word not part of the term. The terms from
a gold standard list are first mapped to the tokens in the
raw text and each word inside the text sequence is anno-
tated with one of the three labels (see examples in Figure
1). The model is first trained to predict a label for each
token in the input text sequence (e.g., we model the task
as token classification) and then applied to the unseen text
(test data). Finally, from the tokens or token sequences la-
beled as terms, the final candidate term list for the test data
is composed.
We experiment with XLM-RoBERTa 2 (Conneau et
al., 2019), a Transformer-based model pre-trained on
2.5TB of filtered CommonCrawl data containing 100 lan-
guages.
With the proliferation of non-English models
Figure 2: The overall architecture.
(e.g., CamemBERT for French, Finnish BERT, German
BERT, etc), XLM-RoBERTa, the multilingual version of
RoBERTa (Liu et al., 2019), is a generic cross-lingual sen-
The model is fine-tuned on the training set to predict the
tence encoder that achieves benchmark performance on
probability for each word in a word sequence whether it is
multiple downstream NLP tasks, including ATE for rich-
a part of the term (B, I) or not (O). To do so, an additional
resourced languages (e.g. English) (Rigouts Terryn et al.,
token classification head containing a feed-forward layer
2020). Due to this well-documented SOTA performance on
with a softmax activation is added on top of the model.
several related tasks, we opted to employ XLM-RoBERTa
4.
Experimental Setup
in a monolingual setting on our low-resourced Slovenian
corpus. The overall architecture of our approach is pre-
Here, we describe the dataset, the experimental details,
sented in Figure 2.
and the metrics that we apply for the evaluation.
In our experiments, we use a multilingual pre-trained
4.1.
Dataset
language model in order to leverage the general knowl-
edge the model obtained during pretraining on the huge
The experiments are conducted on the Slovenian
multilingual corpus. First, we divide the dataset into train-
RSDO5 corpus version 1.1 (Jemec Tomazin et al., 2021a),
validation-test splits. We also investigate the effectiveness
which is a less-resourced Slavic language with rich mor-
of cross-domain learning, where the main idea is to test
phology.
As a part of the RSDO national project, the
the transfer of knowledge from one domain to another and
RSDO5 corpus was manually compiled and annotated
therefore evaluate the capability of the model to extract
and contains 12 documents with altogether about 250,000
terms in new unseen domains as well as the ability to learn
words from the fields of Biomechanics (bim), Chemistry
the relations between terms across domains given the as-
(kem), Veterinary (vet), and Linguistics (ling). The data
sumption that they have terminologically-marked contexts.
were collected from diverse sources, including Ph.D. the-
Therefore, we fine-tune the model on two domains (e.g.,
ses (3), a Ph.D. thesis-based scientific book (1), graduate-
Biomechanics, Chemistry) as the train split, validate on a
level textbooks (4), and journal articles (4) published be-
third domain (e.g., Veterinary) as the validation split, and
tween 2000 and 2019. Apart from the manually annotated
test on the fourth domain that does not appear in the train
terms, RSDO5 is also annotated with Universal Depen-
set (e.g., Linguistics). The train split is used for fine-tuning
dency tags (e.g. tags annotating tokens, sentences, lemmas,
the pre-trained language model. The validation split is ap-
morphological features, etc.). However, in our research, we
plied to prevent over-fitting during the fine-tuning phase.
only leverage the original text with the term labels, where
Finally, the test split, which is not adopted during training,
we consider all terms and do not distinguish between in-
is used for the evaluation of the method.
domain and out-of-domain terms.
In Table 1, we report on the number of documents, to-
2https://huggingface.co/xlm-roberta-base
kens, and unique terms across domains. Given the same
PRISPEVKI
198
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Biomechanics (bim)
Chemistry (kem)
Veterinary (vet)
Linguistics (ling)
Languages
# Docs
# Tokens
# Terms
# Docs
# Tokens
# Terms
# Docs
# Tokens
# Terms
# Docs
# Tokens
# Terms
Slovenian
3
61,344
2,319
3
65,012
2,409
3
75182
4,748
3
109,050
4,601
Table 1: Number of documents, tokens, and unique terms per domain in Slovenian RSDO5 dataset.
Biomechanics (bim)
Chemistry (kem)
Veterinary (vet)
Linguistics (ling)
Languages
B
I
O
% Term
B
I
O
% Term
B
I
O
% Term
B
I
O
% Term
Slovenian
7,070
6,835
47,439
22.67
7,614
4,486
52,912
18.61
10,953
6,261
57,968
22.90
12,348
6,079
90,623
16.89
Table 2: Label distribution and the proportion of terms appearing per domain in the Slovenian RSDO5 dataset.
number of collected documents for each domain, the doc-
tion performed the best on the validation set. The docu-
uments from the Linguistics and Veterinary domains are
ments are split into sentences and the sentences contain-
longer (i.e., have more tokens) and also contain more terms
ing more than 512 tokens are truncated, while the sen-
than the domains of Biomechanics and Chemistry. In ad-
tences with less than 512 tokens are padded with a special
dition, Figure 3 presents the frequency of terms of differ-
< P AD > token at the end. During fine-tuning, the model
ent lengths per domain. Veterinary, Chemistry, and Lin-
is evaluated on the validation set after each training epoch,
guistics share a similar term length distribution with most
and the best-performing model is applied to the test set.
terms made of one to three words and only a few (less than
The model predicts each word in a word sequence
three) terms longer than seven words (an example of a long
whether it is a part of a term (B, I) or not (O). The sequences
term found in the corpus would be “kaznivo dejanje zoper
identified as terms are extracted from the text and put into a
življenje , telo in premoženje”, which means a crime against
set of all predicted candidate terms. A post-processing step
life, body, and property). Meanwhile, the Biomechanics
to lowercase all the candidate terms is applied before we
domain distribution has a longer right tail, containing sev-
compare our derived candidate list with the gold standard
eral terms with more than three words.
using the evaluation metrics discussed in Section 4.3..
Furthermore, the corpus contains several nested terms,
i.e., they also appear within larger terms and vice versa, a
4.3.
Evaluation Metrics
multiword term may contain shorter terms. For example, in
We perform the global evaluation on our term extraction
the Biomechanics domain, term “navor” (torque) appears
system by comparing the list of candidate terms extracted
in terms such as “sunek navora” (torque shock), “zunanji
on the level of the whole test set with the manually anno-
sunek navora” (external torque shock), and “izokinetični
tated gold standard in the test set using Precision, Recall,
navor” (isokinetic torque), to mention a few. This makes
and F1-score. Precision refers to the percentage of the ex-
the labeling harder and the classifier needs to infer from the
tracted terms that are correct. Meanwhile, Recall indicates
context whether a specific term is part of a longer term.
the percentage of the total correct terms that are extracted.
Low Precision means a lot of noise in extraction whereas
4.2.
Implementation Details
low Recall indicates the presence of lots of misses in ex-
We experiment with several combinations of training,
traction. Besides, F1-score is a measure that computes an
validation, and testing data where two domains are used
overall performance by calculating the harmonic mean be-
for training, the third one for validation, and the fourth one
tween Precision and Recall). These evaluation metrics have
for testing (i.e., we train 12 models covering all possible
been used also in the related work, including the TermEval
domain combinations). We consider term extraction as a
2020 shared task (Hazem et al., 2020; Rigouts Terryn et al.,
sequence-labeling or token classification task with a (B-
2020; Lang et al., 2021).
I-O) annotation scheme. Table 2 presents the distribution
across label types and the proportion of (B) and (I) labels
5.
Results
in the total number of tokens per domain in the dataset. On
Table 3 presents the results achieved by the multilingual
average, the number of tokens annotated as terms (or parts
XLM-RoBERTa pre-trained language model on the Slove-
of the term) only represents about one-fifth of the total to-
nian RSDO5 dataset. Note that the results in the table are
kens in the corpus, which means that there is a significant
grouped according to the model’s test domain for better
imbalance between (B) and (I) tokens, and tokens labeled
comparison between different settings. Our cross-domain
as not terms (O).
approach proves to have relatively consistent performance
We employ the XLM-RoBERTa token classification
across all the combinations, achieving Precision of more
model and its “fast” XLM-RoBERTa tokenizer from the
than 62%, Recall of no less than 55%, and F1-score above
Huggingface library 3. We fine-tune the model for up to
61%. The model performs slightly better for the Linguistics
20 epochs regarding model convergence (i.e., we also em-
and Veterinary domains than for Biomechanics and Chem-
ploy the early stopping regime) with the learning rate of 2e-
istry. The difference in the number of terms and length of
05, training and evaluation batch size of 32, and sequence
terms per domain pointed out in Section 4.1. might be one
length of 512 tokens, since this hyperparameter configura-
of the factors that contribute to this behavior. In addition, a
significant performance boost can be observed for the Lin-
3https://huggingface.co/models
guistics domain when the model is trained in the Chemistry
PRISPEVKI
199
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Figure 3: The frequencies of terms of specific length per each domain in a Slovenian dataset.
and Veterinary domains, and for the Veterinary domain,
covering low-frequency terms of which there are a lot in
when the model is trained in Biomechanics and Linguis-
the RSDO5 corpus. In their own experiments, Ljubešić et
tics. In these two settings, the model achieves an F1-score
al. (2019) discard all term candidates with a frequency be-
of more than 68%.
low 3, hence why their results on their corpus are higher
than on RSDO5.
Training
Validation
Testing
Precision
Recall
F1-score
Overall, we achieve results roughly twice as high as the
bim + kem
vet
ling
69.55
64.05
66.69
approach proposed by Ljubešić et al. (2019) in terms of F1-
bim + vet
kem
ling
69.48
73.66
71.51
score for all test domains. The results demonstrate the pre-
kem + vet
bim
ling
66.20
72.38
69.15
dictive power of contextual information in language mod-
Ljubešić et al. (2019)
ling
52.20
25.40
34.10
els such as XLM-RoBERTa over the machine learning ap-
bim + kem
ling
vet
71.06
66.72
68.82
proach with features representing statistical term extraction
bim + ling
kem
vet
72.66
65.59
68.94
measures as in Ljubešić et al. (2019).
ling + kem
bim
vet
69.3
68.07
68.68
Ljubešić et al. (2019)
vet
66.90
19.30
29.90
6.
Error Analysis
bim + vet
ling
kem
68.67
55.13
61.16
In this section, we analyze the predictions of XLM-
bim + ling
vet
kem
70.14
60.27
64.83
ling + vet
bim
kem
70.23
59.24
64.27
RoBERTa in the RSDO5 corpus to get a better understand-
ing of the model’s performance and discover possible av-
Ljubešić et al. (2019)
kem
47.80
31.40
37.80
enues for future work. First, we analyze the predictive
vet + kem
ling
bim
63.51
66.80
65.11
power of our approach for terms of different lengths by cal-
vet + ling
kem
bim
62.25
65.20
63.69
ling + kem
vet
bim
62.35
63.99
63.16
culating the Precision and Recall separately for terms of
length k = {1,2,3,4, equal or more than 5}. The number of
Ljubešić et al. (2019)
bim
53.80
24.80
33.90
predicted candidate terms, number of ground truth terms,
Table 3: Term extraction evaluation in a cross-domain
number of correct predictions (TPs), Precision, and Recall
setting on a Slovenian RSDO5 dataset.
regarding different terms of length k and different test do-
mains are presented in Tables 4, 5, 6, and 7. Note that these statistics are collected for the train-validation-test combina-We also present results for the current SOTA approach
tions that perform the best on each domain according to the
from Ljubešić et al. (2019) by reproducing their method-
F1-score.
ology in the same RSDO5 dataset. In general, our ap-
Results across Tables 4 to 7 show that our models are proach outperforms the approach proposed by Ljubešić et
good at predicting short terms containing up to three words
al. (2019) by a large margin on all domains and accord-
in all four domains. The best model applied to the Linguis-
ing to all evaluation metrics. The margin is especially large
tics test domain also shows competitive performance for the
when it comes to Recall. Given the training process applied
prediction of longer terms, achieving 75.00% Precision and
on RSDO5 corpus, Ljubešić et al. (2019) approach has low
a decent 31.03% Recall for terms with at least 5 words. De-
performance in F1-score due to the high imbalance between
spite the relatively high Precision achieved by the models
the Precision and Recall. This is most likely due to the fact
on long terms in the Veterinary and Biomechanics test do-
that the methods employed by Ljubešić et al. (2019) rely
mains, the Recall is pretty low, most likely due to the small
heavily on the frequency and are thus not suitable for dis-
PRISPEVKI
200
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
k
#Predictions
#Ground truth
#TPs
Precision
Recall
the final list of predicted terms for the Linguistics test do-
1
2,078
1,728
1,300
62.56
75.23
main.
2
2,631
2,404
1,858
70.62
77.29
7.
Conclusion
3
322
360
7,191
59.32
53.06
4
57
80
31
54.39
38.75
In summary, we investigated the performance of the
multilingual Transformer-based language model, XLM-
≥5
12
29
79
75.00
31.03
RoBERTa, in the monolingual cross-domain sequence-
Table 4: Performance in Precision and Recall per term
labeling term extraction task.
The experiments were
length in Linguistics domain.
conducted on the representative Slovenian RSDO5 cor-
pus, which contains texts from four specific domains,
namely Biomechanics, Chemistry, Veterinary, and Linguis-
k
#Predictions
#Ground truth
#TPs
Precision
Recall
tics. Our cross-domain sequence-labeling approach with
1
2,159
2,067
1,472
68.18
71.21
XLM-RoBERTa had consistent performance across all the
2
2,062
2,103
1,448
70.22
68.85
combinations of training, validation, and test set, achiev-
3
314
446
182
57.96
40.81
ing the performance of up to 72.66% in terms of Preci-
4
28
77
10
35.71
12.99
sion, up to 73.66% in terms of Recall, and up to 71.51%
≥5
3
55
2
66.67
3.64
in terms of F1-score. The model performed slightly better
in extracting terms from the Linguistics and Veterinary do-
Table 5: Performance in Precision and Recall
mains than from Biomechanics and Chemistry. Moreover,
per term length in Veterinary domain.
our approach outperformed the current state of the art on
the Slovenian language (Ljubešić et al., 2019) by a large
k
#Predictions
#Ground truth
#TPs
Precision
Recall
margin according to all three evaluation metrics, in some
1
943
890
580
61.51
65.17
cases achieving three times higher Recall and roughly two
times higher F1-score. As a consequence, our approach is
2
1,073
1,202
768
71.58
63.89
the new SOTA approach on the RSDO5 dataset.
3
164
260
93
56.71
35.77
However, we believe that there remains room for im-
4
26
46
11
42.31
23.91
provement in the field of supervised term extraction. In
≥5
3
11
0
0.00
0.00
the future, we would like to pre-train the model on the in-
termediate task (e.g., machine translation) resembling term
Table 6: Performance in Precision and Recall
extraction before fine-tuning it on the target downstream
per term length in Chemistry domain.
task, in order to boost the extraction performance. In addi-
tion, we will also investigate the performance of the mod-
k
#Predictions
#Ground truth
#TPs
Precision
Recall
els in the zero-shot cross-lingual setting, multi-lingual set-
1
1,079
718
22
48.38
72.70
ting, and the combination of both settings in comparison
2
1,153
1,172
822
71.29
70.14
with our current monolingual setting. Lastly, we suggest
3
223
286
124
55.61
43.36
the integration of active learning into our current approach
to improve the output of the automated method by dynami-
4
26
59
11
42.31
18.64
cal adaptation after human feedback. By learning with hu-
≥5
11
84
5
45.45
5.95
mans in the loop, we aim at getting the most information
with the least amount of term labels. We will also evaluate
Table 7: Performance in Precision and Recall
the contribution of active learning in reducing the annota-
per term length in Biomechanics domain.
tion effort and determine the robustness of the incremental
active learning framework across different languages and
amount of longer terms in the dataset on which the models
domains.
are trained. When it comes to predictions in the Chemistry
domain, there are no correct term predictions that consist of
8.
Acknowledgements
more than five words.
The work was partially supported by the Slovenian Re-
In addition, as the corpus contains many nested terms,
search Agency (ARRS) core research program Knowledge
the very common mistake the model makes is to predict a
Technologies (P2-0103) and project TermFrame (J6-9372),
shorter term nested in the correct term of the gold standard
as well as the Ministry of Culture of the Republic of Slove-
(Pattern 1). Vice versa, the model sometimes generates in-
nia through project Development of Slovene in Digital En-
correct predictions containing the correct nested terms (Pat-
vironment (RSDO). The first author was partly funded by
tern 2). Furthermore, in some cases, the model predicts a
Region Nouvelle Aquitaine. This work has also been sup-
single prediction made out of two consecutive terms (Pat-
ported by the TERMITRAD (2020-2019-8510010) project
tern 3). We report some examples of these incorrect pat-
funded by the Nouvelle-Aquitaine Region, France.
terns in Table 8, where the first column refers to the pattern type, the second one refers to our predicted candidate term,
9.
References
and the last column presents the true term from the gold
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Ra-
standard. The presented candidate terms are extracted from
sul, Stefan Schweter, and Roland Vollgraf. 2019. Flair:
PRISPEVKI
201
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Patterns
Our predictions
The gold standards
“klasična analogna telefonska zveza”
“klasična analogna telefonska zveza pot”
(classic analog telephone connection)
(classic analog telephone connection path)
1
“končnica neprve slovarske oblike”
“končnica”
(suffix of non-first dictionary form)
(suffix)
...
...
“brezžično slušalk v ušesu”
“brezžično slušalk”
(wireless in-ear headphones)
(wireless headphones)
2
“elektromehanska uporaba električne energije”
“električne energije”
(electromechanical use of electrical energy)
(electrical energy)
...
...
“batne parne stroje za pogon”
“batne parne stroje” , “pogon”
(reciprocating steam engines)
(piston steam engines), (propulsion)
“elektrarna na atomski pogon”
“elektrarna”, “atomski pogon”
(nuclear power plant)
(power plant), (nuclear power plant)
“besedilnim tipom strokovnega jezika”
“besedilnim tipom”, “strokovnega jezika”
(text type professional language)
(text type), (professional language)
3
“eksperimentalno modeliranje dinamičnih sistemov”
“eksperimentalno modeliranje”, “dinamičnih sistemov”
(experimental modeling of dynamic systems)
(experimental modeling), (dynamic systems)
...
...
Table 8: Examples of unlemmatised predictions in the Linguistics test domain.
An easy-to-use framework for state-of-the-art nlp. In
& management, 26(6):791–801.
Proceedings of the 2019 Conference of the North Amer-
Patrick Drouin. 2003. Term extraction using non-technical
ican Chapter of the Association for Computational Lin-
corpora as a point of leverage. Terminology, 9(1):99–
guistics (Demonstrations), pages 54–59.
115.
Ehsan Amjadian, Diana Inkpen, Tahereh Paribakht, and
Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare Voss,
Farahnaz Faez. 2016. Local-Global Vectors to Improve
and Jiawei Han. 2014. Scalable topical phrase mining
Unigram Terminology Extraction. In Proceedings of the
from text corpora. arXiv preprint arXiv:1406.6312.
5th International Workshop on Computational Terminol-
Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi
ogy (Computerm2016), pages 2–11.
Song, Ann Bies, and Stephanie M Strassel.
2015.
Sophia Ananiadou. 1994. A methodology for automatic
Overview of linguistic resources for the tac kbp 2015
term recognition. In COLING 1994 Volume 2: The 15th
evaluations: Methodologies and results. In TAC.
International Conference on Computational Linguistics.
Tomaž Erjavec, Darja Fišer, and Nikola Ljubešić. 2021.
Chris Biemann and Alexander Mehler. 2014. Text mining:
The kas corpus of slovenian academic writing. Lan-
From ontology learning to automated text processing ap-
guage Resources and Evaluation, 55(2):551–583.
plications. Springer.
Darja Fišer, Vit Suchomel, and Miloš Jakub´ıcek. 2016.
David M Blei and John D Lafferty.
2009.
Visualiz-
Terminology extraction for academic slovene using
ing topics with multi-word expressions. arXiv preprint
sketch engine. In Tenth Workshop on Recent Advances in
arXiv:0907.1013.
Slavonic Natural Language Processing, RASLAN 2016,
M Teresa Cabré Castellv´ı, Rosa Estopa Bagot, and Jordi Vi-
pages 135–141.
valdi Palatresi. 2001. Automatic term detection: A re-
Katerina T Frantzi, Sophia Ananiadou, and Junichi Tsujii.
view of current systems. Recent advances in computa-
1998. The c-value/nc-value method of automatic recog-
tional terminology, 2:53–88.
nition for multi-word terms. In International conference
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
on theory and practice of digital libraries, pages 585–
Vishrav Chaudhary,
Guillaume Wenzek,
Francisco
604. Springer.
Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,
Yuze Gao and Yu Yuan. 2019. Feature-less End-to-end
and Veselin Stoyanov.
2019.
Unsupervised cross-
Nested Term extraction. In CCF International Con-
lingual representation learning at scale. arXiv preprint
ference on Natural Language Processing and Chinese
arXiv:1911.02116.
Computing, pages 607–616. Springer.
Béatrice Daille, Éric Gaussier, and Jean-Marc Langé.
Amir Hazem, Mérieme Bouhandi, Florian Boudin, and
1994. Towards Automatic Extraction of Monolingual
Béatrice Daille. 2020. TermEval 2020: TALN-LS2N
and Bilingual Terminology. In COLING 1994 Volume
System for Automatic Term Extraction. In Proceedings
1: The 15th International Conference on Computational
of the 6th International Workshop on Computational Ter-
Linguistics.
minology, pages 95–100.
Fred J Damerau. 1990. Evaluating computer-generated
Mateja Jemec Tomazin, Mitja Trojar, Simon Atelšek, Tanja
domain-oriented vocabularies. Information processing
Fajfar, Tomaž Erjavec, and Mojca Žagar Karer. 2021a.
PRISPEVKI
202
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Corpus of term-annotated texts RSDO5 1.1. Slovenian
Robustly Optimized BERT Pretraining Approach. arXiv
language resource repository CLARIN.SI.
preprint arXiv:1907.11692.
Mateja Jemec Tomazin, Mitja Trojar, Mojca Žagar, Simon
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Atelšek, Tanja Fajfar, and Tomaž Erjavec. 2021b. Cor-
Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke
pus of term-annotated texts rsdo5 1.0.
Zettlemoyer. 2020. Multilingual denoising pre-training
John S Justeson and Slava M Katz. 1995. Technical Ter-
for neural machine translation. Transactions of the As-
minology: Some Linguistic Properties and an Algorithm
sociation for Computational Linguistics, 8:726–742.
for Identification in Text. Natural language engineering,
Nikola Ljubešić, Darja Fišer, and Tomaž Erjavec. 2019.
1(1):9–27.
Kas-term: Extracting Slovene Terms from Doctoral The-
Kyo Kageura and Bin Umino. 1996. Methods of Auto-
ses via Supervised Machine Learning. In International
matic Term Recognition.A Review. Terminology. Inter-
Conference on Text, Speech, and Dialogue, pages 115–
national Journal of Theoretical and Applied Issues in
126. Springer.
Specialized Communication, 3(2):259–289.
Lieve Macken, Els Lefever, and Veronique Hoste. 2013.
Rémy Kessler, Nicolas Béchet, and Giuseppe Berio. 2019.
Texsis: Bilingual terminology extraction from parallel
Extraction of terminology in the field of construction.
corpora using chunk-based alignment. Terminology. In-
In 2019 First International Conference on Digital Data
ternational Journal of Theoretical and Applied Issues in
Processing (DDP), pages 22–26. IEEE.
Specialized Communication, 19(1):1–30.
Muhammad Tahir Khan, Yukun Ma, and Jung-jae Kim.
Alfredo Maldonado and David Lewis. 2016. Self-tuning
2016. Term Ranker: A Graph-Based Re-Ranking Ap-
ongoing terminology extraction retrained on terminology
proach. In FLAIRS Conference, pages 310–315.
validation decisions. In Proceedings of The 12th Interna-
Boshko Koloski, Senja Pollak, Blaž Škrlj, and Matej Mart-
tional Conference on Terminology and Knowledge Engi-
inc. 2022. Out of thin air: Is zero-shot cross-lingual key-
neering, pages 91–100.
word detection better than unsupervised? arXiv preprint
Matej Martinc, Blaž Škrlj, and Senja Pollak. 2021. Tnt-
arXiv:2202.06650.
kid: Transformer-based neural tagger for keyword iden-
Maren Kucza, Jan Niehues, Thomas Zenkel, Alex Waibel,
tification. Natural Language Engineering, page 1–40.
and Sebastian Stüker. 2018. Term Extraction via Neu-
Adam L Meyers, Yifan He, Zachary Glass, John Ortega,
ral Sequence Labeling a Comparative Evaluation of
Shasha Liao, Angus Grieve-Smith, Ralph Grishman, and
Strategies Using Recurrent Neural Networks. In INTER-
Olga Babko-Malaya. 2018. The Termolator: Termi-
SPEECH, pages 2072–2076.
nology Recognition Based on Chunking, Statistical and
Guillaume Lample, Miguel Ballesteros, Sandeep Subrama-
Search-Based Scores. Frontiers in Research Metrics and
nian, Kazuya Kawakami, and Chris Dyer. 2016. Neu-
Analytics, 3:19.
ral Architectures for Named Entity Recognition. In Pro-
Marco A Palomino, Tim Taylor, and Richard Owen. 2013.
ceedings of the 2016 Conference of the North American
Evaluating business intelligence gathering techniques for
Chapter of the Association for Computational Linguis-
horizon scanning applications. In Mexican International
tics: Human Language Technologies, pages 260–270.
Conference on Artificial Intelligence, pages 350–361.
Christian Lang, Lennart Wachowiak, Barbara Heinisch,
Springer.
and Dagmar Gromann. 2021. Transforming term extrac-
John Pavlopoulos and Ion Androutsopoulos. 2014. Aspect
tion: Transformer-based approaches to multilingual term
term extraction for sentiment analysis: New datasets,
extraction across domains. In Findings of the Associa-
new evaluation measures and an improved unsupervised
tion for Computational Linguistics: ACL-IJCNLP 2021,
method. In Proceedings of the 5th Workshop on Lan-
pages 3607–3620.
guage Analysis for Social Media (LASM), pages 44–52.
Anna¨ıch Le Serrec, Marie-Claude L’Homme, Patrick
M¯arcis Pinnis, Nikola Ljubešić, Dan S¸tef˘anescu, Inguna
Drouin, and Olivier Kraif. 2010. Automating the com-
Skadin¸a, Marko Tadić, Tatjana Gornostaja, Špela Vintar,
pilation of specialized dictionaries: Use and analysis of
and Darja Fišer. 2019. Extracting data from compara-
term extraction and lexical alignment. Terminology. In-
ble corpora. In Using Comparable Corpora for Under-
ternational Journal of Theoretical and Applied Issues in
Resourced Areas of Machine Translation, pages 89–139.
Specialized Communication, 16(1):77–106.
Springer.
Yang Lingpeng, Ji Donghong, Zhou Guodong, and Nie
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and
Yu. 2005. Improving retrieval effectiveness by using key
Christopher D Manning. 2020. Stanza: A python natural
terms in top retrieved documents. In European Confer-
language processing toolkit for many human languages.
ence on Information Retrieval, pages 169–184. Springer.
arXiv preprint arXiv:2003.07082.
Marina Litvak and Mark Last. 2008. Graph-based key-
Andraž Repar, Vid Podpečan, Anže Vavpetič, Nada Lavrač,
word extraction for single-document summarization. In
and Senja Pollak. 2019. TermEnsembler: An Ensem-
Coling 2008: Proceedings of the workshop multi-source
ble Learning Approach to Bilingual Term Extraction and
multilingual information extraction and summarization,
Alignment. Terminology. International Journal of The-
pages 17–24.
oretical and Applied Issues in Specialized Communica-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
tion, 25(1):93–120.
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Ayla Rigouts Terryn, Veronique Hoste, Patrick Drouin, and
Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A
Els Lefever. 2020. TermEval 2020: Shared Task on
PRISPEVKI
203
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Automatic Term Extraction Using the Annotated Cor-
pora for Term Extraction Research (ACTER) Dataset. In
6th International Workshop on Computational Terminol-
ogy (COMPUTERM 2020), pages 85–94. European Lan-
guage Resources Association (ELRA).
Ayla Rigouts Terryn, Véronique Hoste, and Els Lefever.
2021. HAMLET: Hybrid Adaptable Machine Learning
approach to Extract Terminology. Terminology.
Horacio Saggion, Adam Funk, Diana Maynard, and Kalina
Bontcheva. 2007. Ontology-based information extrac-
tion for business intelligence. In The Semantic Web,
pages 843–856. Springer.
Antonio Šajatović, Maja Buljan, Jan Šnajder, and Bo-
jana Dalbelo Bašić. 2019. Evaluating automatic term
extraction methods on individual documents. In Pro-
ceedings of the Joint Workshop on Multiword Expres-
sions and WordNet (MWE-WN 2019), pages 149–154.
Thi Hong Hanh Tran, Antoine Doucet, Nicolas Sidere,
Jose G Moreno, and Senja Pollak. 2021. Named en-
tity recognition architecture combining contextual and
global features. In Towards Open and Trustworthy Dig-
ital Societies: 23rd International Conference on Asia-
Pacific Digital Libraries, ICADL 2021, Virtual Event,
December 1–3, 2021, Proceedings, page 264. Springer
Nature.
Spela Vintar. 2010. Bilingual Term Recognition Revis-
ited: The Bag-of-equivalents Term Alignment Approach
and its Evaluation. Terminology. International Journal
of Theoretical and Applied Issues in Specialized Com-
munication, 16(2):141–158.
Rui Wang, Wei Liu, and Chris McDonald. 2016. Feature-
less Domain-Specific Term Extraction with Minimal La-
belled Data. In Proceedings of the Australasian Lan-
guage Technology Association Workshop 2016, pages
103–112.
Petra Wolf, Ulrike Bernardi, Christian Federmann, and
Sabine Hunsicker. 2011. From statistical term extrac-
tion to hybrid machine translation. In Proceedings of the
15th Annual conference of the European Association for
Machine Translation.
Ziqi Zhang, Jie Gao, and Fabio Ciravegna. 2017. SemRe-
Rank: Incorporating Semantic Relatedness to Improve
Automatic Term Extraction Using Personalized PageR-
ank. arXiv preprint arXiv:1711.03373.
PRISPEVKI
204
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Metapodatki o posnetkih in govorcih v govornih virih: primer baze Artur
Darinka Verdonik,* Andreja Bizjak,* Andrej Žgank,* Simon Dobrišek†
* Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru Koroška 46, 2000 Maribor
darinka.verdonik@um.si, andreja.bizjak1@um.si, andrej.zgank@um.si
† Fakulteta za elektrotehniko, Univerza v Ljubljani
Tržaška 25, 1000 Ljubljana
simon.dobrisek@fe.uni-lj.si
Povzetek
Ob združevanju različnih govornih jezikovnih virov se pojavljajo težave, ki izhajajo iz vsebinske nezdružljivosti zabeleženih metapodatkov o posnetkih, govorcih oz. govoru nasploh (npr. tip govora, vrsta govornega dogodka, lokacija in čas snemanja, spol, izobrazba, regija govorca). Ti metapodatki se zajemajo po eni strani zato, da omogočajo preverjanje uravnoteženosti govornega vira glede na različne govorce in govorne situacije, po drugi strani pa zato, da omogočajo razvrščanje govornih podatkov v kategorije, potrebne bodisi za jezikoslovne analize bodisi za učenje algoritmov razpoznavanja govora ipd. Najpogostejše razlike med zabeleženimi metapodatki o posnetkih in govorcih v obstoječih prosto dostopnih govornih virih za slovenščino so v kategorizacijah vrste govora in lokacije snemanja oziroma v kategorizacijah in oznakah regije govorca. Različne kategorije se pojavljajo tudi v zvezi s starostnimi in izobrazbenimi skupinami govorcev. Veliko vrst metapodatkov se pojavlja samo v posameznih virih, v drugih pa ne. Prispevek poleg pregleda razlik podaja tudi predloge za njihovo premostitev.
Metadata on recordings and speakers in spoken language resources: The case of the Artur database When merging data from different spoken language resources, problems arise due to incompatibility of metadata on recordings, on speakers or on speech in general (e.g., information about the speech type or speech event, time and place of the recording, the gender, education, region of speaker). These metadata are captured on the one hand to ensure the balance of speech samples according to different speakers and speech situations, and on the other hand to enable the classification of speech data into categories needed either for linguistic analysis or for learning speech recognition algorithms. The most common differences in metadata on recordings and on speakers in the existing freely available speech resources for Slovene relate to categorizations of the type of speech and the location of the recording as well as to categorizations and designations of the speaker's region. Different categories also emerge in relation to age and educational groups of speakers. Many types of metadata are recorded only in particular resources. In addition to reviewing the differences we also give some suggestions how to overcome them.
1. Uvod
2. Metapodatki o posnetkih in govorcih v
Govorni jezikovni viri so pomembni tako za razvoj
govornih korpusih
jezikoslovja in celostno poznavanje jezika kot tudi za
Korpus GOS je predstavljal enega prvih večjih
razvoj govornih tehnologij, kot je razpoznavanje ali sinteza
projektov, namenjenih zagotovitvi obsežnejšega govornega
govora. Poleg posnetkov in zapisa govora vsebujejo
vira za raziskave slovenskega jezika. Izdan je bil leta 2011
običajno tudi manjše ali večje število podatkov o tem, kje,
v obsegu ca. 112 ur posnetkov in je sledil za tisti čas
kdaj, kako so posnetki nastali in kakšne so lastnosti
aktualnim korpusnojezikoslovnim prizadevanjem po
govorcev glede na spol, starost, izobrazbo ipd. Čeprav Text
dopolnjevanju referenčnih pisnih korpusov z referenčnimi
Encoding Iniciative – TEI vključuje tudi standardizacijska
govornimi korpusi (npr. Burnard, 2007; Allwood et al.,
priporočila s področja govora, pa so vsebinske odločitve,
2000; Oostdijk et al., 2002; Pořízka, 2009). Njegov
katere kategorije tovrstnih podatkov zajeti in kako
namen je bil torej predvsem zagotoviti podatke o govorjeni
podrobno jih opisati, zelo odvisne od vrste gradiva in
slovenščini za leksikografske, slovnične in druge
namena govornega vira. Tako se ob združevanju virov,
jezikoslovne raziskave, za poučevanje slovenščine, za
nastalih v različnih časovnih obdobjih z delno različnimi
poklicne govorce ali pisce oz. tudi za širšo zainteresirano
cilji in vključujoč različne tipe govora, pojavljajo težave, ki
javnost. Vseboval je kolikor mogoče reprezentativen nabor
izhajajo iz vsebinske nezdružljivosti popisanih podatkov o
različnih govornih situacij, s ciljem, da bi zajeli vzorčne
posnetkih, govorcih oz. govoru nasploh.
primere različnih govornih situacij in različnih govorjenih
S ciljem, da se tovrstne težave v prihodnje zmanjšajo,
diskurzov, demografsko reprezentativen vzorec govorcev
bomo v tem prispevku pregledali, potrebe po katerih
slovenskega jezika in tiste govorne situacije, v katerih so
podatkih so se pojavljale v različnih vedah, s poudarkom na
uporabnika jezika najbolj pogosto produktivno ali pa samo
dosedanjih slovenskih govornih virih (poglavji 2 in 3),
receptivno udeleženi (Verdonik in Zwitter Vitez, 2011: 17).
podrobneje predstavili strukturo teh podatkov na primeru
GOS je bil poleg transkripcij dopolnjen tudi s posnetki
govorne baze Artur, ki predstavlja najnovejši in hkrati
ter s številnimi podatki o posnetkih in govorcih (po katerih
najobsežnejši in najbolj heterogen govorni vir za
lahko uporabniki korpusa tudi filtrirajo zadetke). Podobna
slovenščino ta trenutek (poglavje 4), ter izpostavili tiste
je praksa v drugih, tujih govornih korpusih. Običajni
vrste podatkov, kjer so vsebinska razhajanja največja, in
podatki o situaciji, ki je posneta, so datum, lokacija, vrsta
podali predlog za njihovo uskladitev (poglavje 5).
interakcije, kontekst, tematika, udeleženci, trajanje,
uporabljena oprema za snemanje, vir ipd. Podatki o
PRISPEVKI
205
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
udeležencih so običajno identitifikacijska koda, starost,
značilnosti posnetega govora. Z računskimi metodami
spol, narodnost oz. prvi jezik, regija oz. narečje, poklic,
obdelave signalov se namreč iz govornih signalov lahko
lahko pa tudi še mesto rojstva, trenutna lokacija, drugi
izlušči različne govorne značilke, pri katerih se
jeziki ipd. (Zemljarič Miklavčič, 2008; Cresti in Moneglia,
predpostavlja hierarhična razvrščenost pri njihovem
2005; Ehmer in Martinez, 2014; Love et al., 2017).
odražanju tako nizkonivojskih anatomskih značilk
V korpusu GOS so metapodatki o posnetkih vključevali
človekovih govoril kot tudi višjenivojskih dialoških in
(Verdonik in Zwitter Vitez, 2011):
semantičnih značilk.
- podatke o gradivodajalcu oz. viru posnetka,
Za razvoj samodejnih razpoznavalnikov govora je torej
- podatke o vrsti govora, institucionalnem okviru,
iz celotnega nabora metapodatkov smiselno ohraniti
govornem dogodku, prosti opis govorne situacije in
predvsem tiste, ki lahko prispevajo k boljšemu akustičnemu
število aktivnih udeležencev govornega dogodka,
in jezikovnemu modeliranju govora. Pri razvoju govornih
- podatke o času in kraju snemanja, pri čemer je bil kraj
baz za razpoznavanje govora so bili tako v preteklosti
snemanja opredeljen tako z imenom kraja kot umeščen
metapodatki ključna informacija, na osnovi katere se je
v širše (registrsko) območje.
poskušala doseči ustrezna zastopanost vseh kategorij
Podatki o govorcih so zajemali:
govorcev in govora, kot je bilo predvideno v specifikacijah.
- spol,
Glavni namen zbiranja teh metapodatkov je bil predvsem
- starost, razdeljeno v 7 kategorij,
ta, da se v govorni bazi čim bolj realno odražajo okoliščine
- izobrazbo, razdeljeno v 4 kategorije,
in scenariji možnih uporab samodejnih razpoznavalnikov
- regijo govorca, opredeljeno glede na registrsko
govora (Kolář in Švec, 2008). Takšen pristop je zelo
območje, pri čemer je bila možnost opredelitve več regij
pomemben predvsem pri govornih bazah, ki obsegajo od
v primeru, da je govorec več kot eno leto bival v
vsaj nekaj 10 do več 100 ur govora oziroma govorcev.
različnih regijah (npr. zaradi študija, službe ipd.),
Hiter tehnološki razvoj informacijsko-komunikacijskih
- prvi jezik govorca.
sistemov je omogočil zbiranje in obdelavo vse večjih
Korpusu GOS je v letih 2016–2019 v več izdajah sledila
količin podatkov. Hkrati je prišlo tudi do izrazitega
manjša govorna baza Gos Videolectures (Verdonik, 2018),
povečanja razpoložljivih računskih zmogljivosti sodobnih
ki je v nasprotju s korpusom GOS zajema področno
računalnikov, predvsem z razvojem zelo zmogljivih
omejeno gradivo javnih predavanj, dostopnih prek portala
grafičnih procesnih enot (GPU), s katerimi se učinkovito
Videolectures.net. V svoji zadnji, četrti različici obsega
izvajajo numerično zahtevni algoritmi t. i. globokega
skupno 22 ur posnetkov javnih predavanj, uravnoteženih
učenja (Gondi in Pratap, 2021). Posledica tega napredka je
glede na tematska področja družboslovja, humanistike,
tudi ta, da so se za jezike z velikim številom govorcev
medicine, tehnike ter naravoslovja/matematike. Prav tako
začele pridobivati obsežne govorne baze, ki obsegajo tudi
nastopajoči govorci enakomerno predstavljajo oba spola,
več kot 10.000 ur posnetkov govora. Tukaj gre praviloma
starejše in mlajše govorce ter grobo opredeljene različne
za govorne baze, ki se pridobijo iz zelo različnih virov, kot
regije Slovenije.
so npr. razni mediji, spletne platforme, zvočne knjige idr.
Metapodatki o posnetkih in govorcih so sledili shemi,
Zaradi velikega obsega takšnih baz se pridobljeni govorni
zastavljeni v korpusu GOS, vendar zaradi omejenega
posnetki pogosto ne označujejo in ne transkribirajo ročno.
dostopa do informacij niso bili beleženi z isto natančnostjo.
Za učenje razpoznavalnikov govora se potem uporabljajo
Če je bila starost govorcev v korpusu GOS deljena v 7
nenadzorovani ali delno nadzorovani pristopi, ki ne
kategorij, je v Gos Videolectures samo v 2, pa še to
zahtevajo ročno narejenih oznak in transkripcij govornih
predvsem na podlagi vizualnega vtisa, ne na podlagi
posnetkov (Hershey et al., 2017). Tako postane v večini
neposredne, točne informacije. Prav tako ni bilo
primerov zelo obsežnih govornih baz dosledno
neposrednih podatkov o regiji govorca, ampak so bili pod
uravnoteževanje
govornih
posnetkov
na
osnovi
to postavko zabeleženi slušni vtisi o značilnostih govora.
metapodatkov drugotnega pomena. Glede na zelo različne
Nekateri podatki pa niti niso bili opredeljeni, saj bodisi niso
možne vire in načine zbiranja govornih posnetkov namreč
bili dostopni (izobrazba) bodisi niso bili relevantni (prvi
pogosto tudi ni možno pridobivati relevantnih
jezik, ki je za vse govorce slovenščina). Ker je bila govorna
metapodatkov. V primerih, ko so metapodatki sicer na
baza Gos Videolectures namenjena tudi razvoju tehnologije
voljo, vendar jih je v govorni bazi težko uravnotežiti, pa
razpoznavanja govora, so se pokazale potrebe še po
pride v ospredje znamenit izrek Roberta Mercerja iz leta
beleženju kvalitete posnetka, ki je bila dodana zgolj kot
1985, da ni boljših podatkov, kot je več podatkov.
subjektivna ocena transkriptorja na podlagi slušnega vtisa.
Novi metapodatkovno neuravnoteženi pristopi k
izdelavi govornih baz so dobili dodatno podporo pri
3. Metapodatki o posnetkih in govorcih v
postopkih globokega učenja, kjer se vse bolj pogosto
govornih bazah za razpoznavanje govora
uporabljajo metode samodejnega povečevanja obsega in
plemenitenja učnih podatkov. I
Z vidika razvoja govornih tehnologij oziroma
zvorni govorni posnetki se
razpoznavalnikov govora je glavni razlog za zbiranje
lahko tako s pomočjo sodobnih metod digitalne obdelave
podatkov o govorcih in posnetkih predvsem ta, da se v
signalov modificirajo v različne simulirane oblike. Takšni
osnovni pristopi so, denimo, pohitritve ali upočasnitve
govorni bazi zagotovi čim bolj ustrezna reprezentativna
zastopanost vseh izrazitih govornih značilnosti, ki se
govora v izvornih govornih posnetkih. Z vidika
spreminjajo med različnimi govorci in različnimi
metapodatkov, ki se navadno upoštevajo pri razvoju
govornimi okoliščinami. Relevanten je torej
razpoznavalnikov govora, pa so se razvili tudi zahtevnejši
katerikoli
podatek o govorcu ali govornem posnetku, ki lahko nosi
pristopi, pri katerih se simulirajo različne snemalne
okoliščine (npr. značilnosti kanala, nivo šuma, kodirnik
informacijo o govornih značilnostih samega govorca
i,
zvočna ozadja, prostor idr.). S takšnimi pris
oziroma njegovih govornih okoliščinah, za katere se
topi lahko
učinkovito dopolnimo obseg izvornih govornih posnetkov
predpostavi, da imajo vpliv na akustične in jezikovne
PRISPEVKI
206
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
in uravnotežimo primanjkljaj določenih vrst govornih
kohezijske politike v obdobju 2014–2020. Projekt je izvajal
posnetkov (Karafiát et al., 2017).
konzorcij 12 partnerjev, od tega 6 javnih raziskovalnih
Pri zasledovanju osnovnega cilja, da govorna baza čim
zavodov in 6 podjetij. Naslavljal je več sklopov jezikovnih
bolje odraža možne okoliščine in scenarije uporabe
tehnologij, med njimi tudi govorne tehnologije, kjer je bilo
razpoznavalnikov govora, je smiselno postaviti določene
veliko pozornosti namenjene izdelavi govorne baze za
prioritete pri upoštevanju metapodatkov in njihovi
razvoj razpoznavanja govora v obsegu 1000 ur.
uravnoteženosti. Za razvoj splošnega samodejnega
Pomanjkanje ustrezno velike, zahtevam razpoznavanja
razpoznavalnika govora je tako priporočljivo upoštevati
govora prilagojene in prosto dostopne govorne baze se je
predvsem naslednje metapodatke:
namreč pokazalo kot osrednja ovira pri razvoju
- Oznaka govorca: enoznačno identificira vse posnetke
razpoznavanja govora za slovenski jezik. Pri izdelavi
istega govorca v bazi. To omogoča učinkovito izvajanje
govorne baze so sodelovali Univerza V Mariboru (FERI),
metod prilagajanja modela razpoznavalnika govora na
Univerza v Ljubljani (FE in FRI), ZRC SAZU, Alpineon in
posamezne govorce (npr. metode MLLR, SAT, iVector
STA. Vključuje 4 večje sklope različnih vrst govora: brani
idr.) (Povey et al., 2008; Cardinal et al., 2015), kar lahko
govor po pisnih predlogah (500 ur), javni govor (javni
prispeva
k
znatnemu
izboljšanju
pravilnosti
dogodki, mediji ipd. – 200 ur), parlamentarni govor
samodejnega razpoznavanja govora.
(Državni zbor RS – 200 ur) in nejavni govor (terenski
- Prvi jezik: samodejno razpoznavanje govora za določen
posnetki prosto govorjenih monologov in dialogov).
jezik je navadno bistveno manj uspešno pri govorcih, ki
Podatki o posnetkih in govorcih so v bazi Artur
jim ta jezik ni prvi. Zato se pri razvoju splošnega
organizirani kot tsv-datoteka in v obliki xml-zapisa po
razpoznavalnika govora njihov govor navadno izloči iz
standardu TEI. V primerjavi s predhodnimi govornimi viri
učnega postopka in se potem izvajajo posebne
za slovenščino vključujejo predvsem zelo podroben popis
prilagoditve splošnega razpoznavalnika takšnim
tehničnih lastnosti posnetkov (npr. podatke o lastnostih
govorcem.
izvornih posnetkov in tehnični opremi, uporabljeni za
- Narečna skupina (Draxler in Kleiner, 2017):
snemanje) ter vseh okoliščin, ki bi lahko na te lastnosti
metapodatek je še posebej pomemben v primerih
vplivale (od velikosti prostora snemanja, prisotnosti
spontanega nejavnega govora. V primeru izrazitega
hkratnega govora vse do uporabe maske pri govorcih, ki je
narečnega govora je namreč možno uporabiti različne
bila pogosta v času epidemije COVIDA-19).
pristope adaptacije razpoznavalnika govora na narečja
Končni seznam metapodatkov o posnetkih v govornih
govorcev, s čimer se lahko do neke mere odpravi
bazi Artur je naslednji:
poslabšanje rezultatov.
I. Identifikacijski podatki in kategorizacija posnetkov:
- Snemalne zvočne okoliščine (Zhang et al., 2018): imajo
- ID-posnetka: je sestavljen iz imena baze (Artur),
lahko bistven vpliv na zanesljivost samodejnega
podatka o tipu govora (brani – B, javni – J, nejavni – N
razpoznavanja govora. Njihov vpliv je delno možno
in parlamentarni govor – P), štirimestne identifikacijske
tudi simulirati ali ga odstranjevati s postopki robustne
številk
obdelave in izboljševanja kakovosti govornih signal
e govorca (Gxxxx), šestmestne identifikacijske
ov.
številk
- Spol in starost govorca: v primeru splošnega
e posnetka (Pxxxxxx) ter podatka o vrsti datoteke
razpoznavalnika govora je pri tvorjenju akustičnega
(-avd). Pri posnetkih javnega govora, na katerih se
modela govora pomembna uravnoteženost govorcev po
običajno pojavlja večje število govorcev, je namesto
teh dveh kategorijah. Adaptacija razpoznavalnika
štirimestne identifikacijske številke govorca navedba
govora na spol in starost govorca se sicer redko izvaja,
Gvecg (s pomenom več govorcev). Primer ID-posnetka:
saj se uporablja predvsem sprotno prilagajanje modela
Artur-N-G5134-P600134-avd.
razpoznavalnika govora na posameznega govorca. Se
- Vrsta govornega dogodka: označuje, ali gre za javni,
pa ta informacija lahko uporabi pri razvoju in
preizkušanju tovrstnih metod za ugotavljanje njihove
nejavni, parlamentarni ali brani govor (Žganec Gros in
odvisnosti od teh dveh metapodatkov.
Vesnicer, 2020).
Če predstavljeni metapodatki v neki govorni bazi niso
- Opisi
govornih
dogodkov
oz.
topiki:
Pri
na voljo, jih je z določeno zanesljivostjo možno tudi
parlamentarnem govoru je govorni dogodek vedno
naknadno samodejno določiti z različnimi postopki
označen kot seja državnega zbora. Pri javnem govoru so
samodejnega razpoznavanja govornih vzorcev, kot so
govorni dogodki opredeljeni kot okrogle mize,
postopki biometričnega razpoznavanja in grozdenje
intervjuji,
nagovori
na
dogodkih,
novinarske
govorcev ali razpoznavanje prvega jezika govorca. Takšni
naknadno samodejno določeni metapodatki seveda lahko
konference ipd. oziroma kot spletni dogodek, kadar gre
vsebujejo tudi napake, kar je potrebno upoštevati pri
za posnetke, posnete na daljavo. Pri branem govoru so
njihovi uporabi.
govorni
dogodki
opisani
kot
branje
vnaprej
pripravljenih pisnih predlog ali kot dva različna tipa
4. Metapodatki o posnetkih in govorcih v
črkovanja. Izbrani nabor kratic so govorci črkovali z
govorni bazi Artur
dodajanjem samoglasnikov (npr. ef a ku), vnaprej
Leta 2020 se je začel nacionalni projekt Razvoj
določene pare imen in priimkov pa z dodajanjem
slovenščine v digitalnem okolju,1 ki sta ga sofinancirala
polglasnikov (npr. jə o nə a sə). Če je govorec črkoval
Republika Slovenija in Evropska unija iz Evropskega
na nepredviden način, je topik poimenovan kot
sklada za regionalni razvoj. Operacija se je izvajala v
črkovanje s samoglasniki (oz. soglasniki) z
okviru Operativnega programa za izvajanje evropske
odstopanjem ( npr . ef fa ku), če je med branjem tudi kaj
1 https://www.slovenscina.eu/
PRISPEVKI
207
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
dodal ali komentiral, pa kot črkovanje s samoglasniki
računalnik, prenosni snemalnik, pametni telefon,
(oz. polglasniki) s komentarjem. Pri nejavnem govoru
kamera in diktafon.
sta za govorne dogodke uporabljeni oznaki prosti dialog
- Podatki o tehničnih lastnostih snemalne opreme
med dvema sogovornikoma in prosti monološki govor
zajemajo: opis naprave (npr. MacBook PRO, Asus
– pri slednjem govorec prosto opisuje različne stvari,
Vivobook, Zoom H4n, Zoom H1n), naziv operacijskega
recimo svoj najljubši film. Za potrebe razvoja
sistema (npr. iOS 14.2.1, Windows 10), podatek o
specializiranih razpoznavalnikov v projektu Razvoj
morebitnem mešalniku zvoka (npr. Focusrite Scarlett
slovenščine v digitalnem okolju so v bazi Artur
2i2 3rd Gen), adapterju in opisu njegovega modela (npr.
opredeljeni še govorni dogodki, kjer je snemanje
Yamaha Audiogram 6), vrsti mikrofona (npr. namizni,
potekalo po vnaprej pripravljenih scenarijih z dveh
vgrajeni ali studijski mikrofon), modelu mikrofona (npr.
področij: opisovanje obrazov in upravljanje pametnega
Samson Q2U) in snemalnem programu (npr. Adobe
doma.
Audition 12, Audacity 2.3.2, Premiere Pro 14.0, Zoom,
II. Podatki o okoliščinah snemanja:
MS Teams).
- Datum snemanja je zapisan v obliki »mesec leto« (npr.
V. Podatki o viru posnetkov:
april 2021).
- Vir posnetka je lahko lastni posnetek, ki ga je naredila
- Podatek o občini snemanja temelji na seznamu občin v
ekipa govorne baze Artur namensko za to bazo – to so
Republiki Sloveniji v času snemanja (2020–2022).
vsi posnetki branega in nejavnega govora. V primeru
- Prostor snemanja natančneje opredeljuje, kje je govorni
parlamentarnega in javnega govora pa gre za arhivsko
dogodek posnet, na primer v stanovanju ali pisarni,
ali
drugo
gradivo,
pridobljeno
od
različnih
studiu ali premičnem snemalnem studiu, v dvorani,
gradivodajalcev: Državni zbor RS, STA, Arnes, ZRC
parlamentu ali pa je snemanje potekalo v odprtem
SAZU, Univerza v Mariboru, SDJT, Radio Štajerski
prostoru.
Val idr.
- Velikost prostora je razdeljena v tri kategorije: do 20
- Pri javnem govoru je za posnetke večkrat na voljo tudi
m2, od 20 do 80 m2 in nad 80 m2.
spletna povezava do videa.
- Prisotnost šuma označuje, ali se na posnetku občasno
Mnogi metapodatki o posnetkih večkrat niso bili
pojavlja šum v ozadju, kot je šelestenje, šumenje,
dostopni. To velja zlasti za posnetke, ki niso bili lastni,
prometni hrup, zvok ventilatorja ipd. Če se šum po
ampak pridobljeni iz drugih virov, torej pri javnem in
osebni presoji validatorja posnetkov pojavlja v
parlamentarnem govoru. Posnetki so bili uvrščeni v bazo,
preveliki meri, je tak posnetek uvrščen v skupino
tudi če so kakšni metapodatki o njih manjkali, saj zlasti za
izločenih posnetkov.
javni govor ne moremo pričakovati, da bodo že obstoječi
- Presluh se občasno pojavi pri 2-kanalnem snemanju
posnetki dokumentirani z metapodatki tako podrobno, kot
nejavnega govora, ko je spontani pogovor dveh
je to mogoče, kadar snemamo namenoma za uvrstitev
sogovornikov posnet z dvema ločenima mikrofonoma.
posnetka v govorno bazo.
Prisotnost presluha je označena, če se pri takem
Končni seznam metapodatkov o govorcih v govorni
snemanju pogosto in jasno sliši govor govorca z
bazi Artur je naslednji:
drugega kanala.
I. Identifikacijski in sociodemografski metapodatki:
- Pogost hkratni govor je zabeležen pri nejavnem govoru,
- ID-govorca zajema ime baze (Artur), oznako vrste
ko je sneman zasebni pogovor med dvema
govornega dogodka (B, J, N in P) ter vnaprej določeno
sogovornikoma, ki pogosto hkrati govorita.
štirimestno identifikacijsko številko govorca (Gxxxx).
- Podatek o tem, ali govorec nosi masko, je bil aktualen
Primer ID-govorca: Artur-N-G5097.
v času epidemije COVIDA-19, ko je veliko javnih
- Spol (moški, ženski, drugo) je minimalno določljiv
dogodkov potekalo ob uporabi obrazne maske. To
metapodatek o govorcih, tudi ko govor ni bil posnet kot
pomembno vpliva na akustične značilnosti posnetka.
lastni vir in govorci svojih sociodemografskih podatkov
Posamezni redkejši posnetki te vrste, ki so bili uvrščeni
niso sami posredovali.
v bazo Artur, so zato ustrezno označeni.
- Izobrazba je ločena v 9 kategorij: osnovna šola –
III. Podatki o formatu izvornih posnetkov:
nedokončana; osnovna šola – dokončana; nižje
- Najpogostejši formati izvornih posnetkov so WAV,
poklicno
izobraževanje;
srednje
poklicno
MP3 in M4A.
izobraževanje; gimnazije, SSI in PTI; višješolski
- Čeprav so vsi posnetki v bazi Artur pretvorjeni v enotni
programi, VS in UNI programi (1. bolonjska stopnja);
format WAV, 44,1 kHz, pcm, 16-bit, mono, so bili
magisterij stroke (2. bolonjska stopnja); magisterij
posamezni posnetki, pridobljeni iz nelastnih virov,
znanosti (pred bolonjsko reformo); doktorat znanosti.
posneti v drugačnih formatih. Kadar so bile informacije
- Metapodatek o starosti je razvrščen v skupine: 12–17
dostopne, je bil izvorni format posnetkov popisan glede
let, 18–29 let, 30–49 let, 50–59 let, 60+ let.
na frekvenco vzorčenja, bitno hitrost in bitno ločljivost.
II. Metapodatki o regiji govorca:
IV. Podatki o opremi, uporabljeni za snemanje:
- Občina stalnega bivališča vključuje tako občine v
- Najpogosteje uporabljene snemalne naprave za
Republiki Sloveniji kot stalno bivališče v tujini.
posnetke v bazi Artur so prenosni ali namizni
PRISPEVKI
208
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
- Čim celovitejša demografska uravnoteženost govorcev
5. Razhajanja v metapodatkih o posnetkih
branega in nejavnega govora je upoštevana tudi pri
in govorcih
statistični regiji njihovega stalnega bivališča.
Govorni korpusi, ki nastajajo za potrebe jezikoslovnih
- Metapodatek o občini bivanja v otroštvu pokriva
raziskav, in govorne baze, pripravljene za namene
diahroni vidik morebitnih narečnih vplivov na govor
razpoznavanja govora, so praviloma zelo podobni govorni
viri. Zato je smiselno, da se iščejo sinergijski učinki in se
govorca.
vsaj del gradiva uporabi v oba namena (Žgank et al., 2014).
- Prvi jezik. Poleg govorcev, katerih prvi jezik je
Tako se je že baza Gos Videolectures delala z mislijo na
slovenščina, so v bazo Artur v manjši meri vključeni
uporabo tudi za razpoznavanje govora (Verdonik, 2018),
tudi govorci, katerih prvi jezik je hrvaščina, srbščina,
vendar je v metapodatkih še dokaj dosledno sledila
makedonščina, bosanščina, ruščina, madžarščina idr.
zastavljeni shemi v korpusu GOS. Tudi v projektu Razvoj
Podatek je izpolnjen samo pri govorcih, od katerih je
slovenščine v digitalnem okolju je bil iz velikega obsega
pridobljen neposredno, pri javnih govorcih pa samo, če
posnetkov za govorno bazo Artur izbran primeren del za
se lahko z veliko verjetnostjo sklepa, da je prvi jezik
nadgradnjo govornega korpusa GOS. Ob tem pa se je v
veliki meri ravno v zvezi z metapodatki o govorcih in
slovenski.
posnetkih zgodilo precej razhajanj, ki so večinoma
- Značilnosti govora se nanašajo na socialno zvrstnost
posledica bolj natančnega popisovanja podatkov, specifik
jezika in so bile opredeljene s strani transkriptorja
ali pa namena baze, povzročajo pa težave ob združevanju
standardiziranega zapisa ali validatorja posnetkov.
gradiv. Katere vrste metapodatkov so take, pri katerih se
Namenjene so v pomoč pri morebitnem prilagajanju
najpogosteje pojavljajo različne odločitve?
modelov
razpoznavanja
govora
regionalnim
značilnostim, prav tako so lahko v pomoč pri analizah
5.1. Metapodatki o posnetkih
zvrstnosti slovenskega govora. Niso pa mišljene kot
Obstajajo različne kategorizacije posnetega govora, saj
točna strokovna opredelitev zvrsti govora govorca na
se te praviloma izvedejo na podlagi tega, kaj vse neki
posnetku. Ker je podrobna teorija socialne zvrstnosti za
govorni vir vsebuje. GOS je tako ločeval štiri tipe diskurza:
slovenščino (Toporišič, 2000) na empiričnem gradivu
javni informativno-izobraževalni, javni razvedrilni, nejavni
nezasebni in nejavni zasebni. Če pr
težko enoumno in robustno
imerjamo to s
uporabljiva, je bila
kategorizacijo v bazi Artur, vidimo, da se tam pojavi še
poenostavljena v tri osnovne kategorije: standardni
kategorija parlamentarni govor, manjka pa javni
jezik, pogovorni jezik in narečje. Glede na okoliščine
razvedrilni, ki se v Arturju tako rekoč ne pojavlja, pač pa se
govora je bilo predvideno, da se v javnem in
lahko celoten javni govor uvrsti kot javni informativno-
parlamentarnem govoru pojavljata bodisi standardni
izobraževalni. Prav tako ni nejavnega nezasebnega, ki se
jezik bodisi pogovorni jezik, pri čemer smo za
nanaša na različne uradovalne, storitvene, trgovalne in
pogovorni jezik šteli situacijo, ko so bili v govoru
druge podobne nezasebne govorne situacije v vsakdanjem
življenju. Je pa prisoten brani govor, ki se nanaša na zelo
govorca pogosto prisotni sistematični glasoslovni
specifično, za namene snemanja posnetkov za ba
pojavi, značilni za nestandardne zvrsti
zo Artur
. Za standardni
ustvarjeno govorno situacijo, v kateri govorci berejo
jezik pa je bil na primer označen tudi govor, ki je imel
vnaprej pripravljene povedi eno po eno.
sicer prepoznavno regionalno obarvano melodiko,
Poleg krovne kategorizacije posnetkov v manjše število
vendar je bil hkrati razviden zavesten večji odmik od
krovnih kategorij se tako v korpusu GOS kot v bazi Artur
vsakdanjega pogovornega jezika govorca proti
uporabljajo še bolj podrobne opredelitve posnetega govora
standardnemu – to velja zlasti za govorce iz obrobja
glede na govorni dogodek. V korpusu GOS je zabeleženih
več kot 20 vrst govornih dogodkov, prav tako v bazi Artur,
Slovenije ali drugih neosrednjih delov Slovenije.
pri čemer pa jih je približno polovica namenjenih
Razlike v izgovorjavi so bile zaznane tudi pri branem
opredelitvi gradiva, ki je zelo specifično za potrebe
govoru, ki ga pa zaradi okoliščin (branje vnaprej
razpoznavalnikov govora (črkovanje, področno specifični
napisanih povedi) težko ločimo na standardni in
razpoznavalniki za pametni dom in opisovanje obrazov).
pogovorni jezik, zato sta bili pri branem govoru
Opredelitev vrste govornega dogodka je nadvse
uporabljeni
oznaki
standardna
izgovorjava
in
pomembna, saj omogoča po potrebi tudi naknadno
nestandardna izgovorjava. Predvsem v nejavnem
prekategorizacijo zbranega gradiva ob združevanju
govoru pa je lahko prisotna tudi oznaka narečje. V
različnih virov, zato je verjetno eden najbolj bistvenih
kolikor je bila izbrana, je dodana tudi oznaka o vrsti
metapodatkov o tipih posnetkov za vsak govorni vir, bolj
narečja, ki je določena na podlagi
pomemben kot širša, krovna kategorizacija, ki se lahko
metapodatka o občini
naknadno tudi spreminja na podlagi razvrščanja informacij
bivanja govorca v otroštvu.
o vrstah govornih dogodkov ali deloma tudi na podlagi
- Zadnja oznaka se nanaša na opazne izgovorne težave.
informacij o viru.
Pri posameznih govorcih se namreč pojavijo kakšne
Obvezna metapodatka o posnetkih v govornih virih sta
posebnosti, ki so povezane na primer z izgovorom
čas in lokacija snemanja. Medtem ko so pri času lahko
glasov r, l ali podobno.
razhajanja samo v večji ali manjši natančnosti zabeleženega
Navedeni metapodatki bodo v bazi Artur predstavljeni
časa, pa se pri opredelitvi lokacije pojavljajo razlike, na
s slovenskimi poimenovanji kot tudi s prevodi v angleški
katere enote se pri tem naslonimo. V korpusu GOS je bil ta
jezik.
metapodatek opredeljen dvojno: kot kraj, torej z imenom
mesta ali vasi, ki pa skozi spletni konkordančnik ni
dostopen zaradi varovanja identitete govorcev, in kot regija
PRISPEVKI
209
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
snemanja, ki pa jo lahko opredelimo zelo različno. V
lahko bodisi izpustijo bodisi ostanejo nedefinirani, če niso
korpusu GOS se je označila na podlagi registrskih območij.
bili zabeleženi in niso na voljo.
V bazi Artur je metapodatek o lokaciji zabeležen kot občina
snemanja. V slovenskem kontekstu se zdi (glede na veliko
5.2. Metapodatki o govorcih
število in razdrobljenost občin) informacija lokacije
Čeprav so metapodaki o govorcih manj raznovrstni kot
snemanja skozi občino ustrezen kompromis. V slovenskem
metapodatki o posnetkih, pa se razlike, kako jih
podeželskem okolju lahko namreč navajanje točnega kraja
opredelimo, pojavljajo tako rekoč pri vseh kategorijah
z imenom vasi razkriva identiteto govorcev, enote, večje od
razen pri spolu.
občine (npr. upravna enota, registrsko območje ali
Najzahtevnejše vprašanje je povezano s potrebo, da se
statistična regija) pa niso več zadosti natančne in skladne z
zabeležijo različni regionalni vplivi na govor posameznika.
narečno razpršenostjo, ki je v Sloveniji pregovorno velika.
V zvezi s tem sta problematični naslednji točki:
Metapodatek o viru prinaša informacijo o izvornem
1. Opredelitev regionalnih vplivov na govor govorca ni
nosilcu avtorskih pravic. Podobno kot za pisna besedila
nujno enoznačna. Tako so se na primer v dodatku h
namreč tudi za govorjena besedila velja, da so njihovi
govornemu delu korpusa BNC (British National
tvorci hkrati tudi avtorji z avtorskimi pravicami2 nad
Corpus) iz leta 2014, v katerem so zajemali samo
besedili in pogosto obstajajo pogodbene zaveze, da bo ta
vsakdanje pogovore, prepustili govorcem, da so sami s
podatek v jezikovnem viru ustrezno naveden. Pri posnetkih
svojimi besedami opisali svoj dialekt, in nato te opise
govora se v zvezi z avtorskimi pravicami in navajanjem
preslikali v shemo statističnih teritorijskih enot Velike
vira srečujemo s štirimi vrstami situacij: (1) Če gre za
Britanije (Love et al., 2017). Tudi v slovenskih
posnetek na terenu, ki je bil narejen za namene govornega
govornih virih se je uveljavila praksa, da se regija
vira in zajema avtentični govor v vsakdanjih situacijah,
govorcev
beleži
skozi
geopolitične,
in
ne
govorci prenesejo avtorske pravice praviloma na nosilca
geolingvistične kategorije. Razlog je bržkone ta, da
projekta, v katerem nastaja govorni vir. Praksa je, da je v
lahko zanesljive geolingvistične kategorizacije naredi
takih primerih kot vir označeno terenski/lastni posnetek. (2)
samo stroka, in to naknadno, na podlagi zbranih
Če gre za posnetek, ki je bil predvajan prek radia ali
podatkov. V korpusu GOS so bile tako kategorije za
televizije, so pogosto nosilci avtorskih pravic medijske hiše
regijo govorcev definirane na podlagi registrskih
in so posledično te navedene kot vir. Tudi pri spletnih virih
območij, ki jih je za Slovenijo skupno 11, k temu pa so
(npr. posnetki na Youtubu3) je treba pogosto urediti
bile dodane še kategorije za zamejske Slovence
avtorske pravice z njihovim/-i nosilcem/-i in v
(Avstrija, Italija, Madžarska) in govorce, ki jim
metapodatkih ustrezno navesti vir. Če gre za spletne
slovenščina ni prvi jezik (tujina). Taka razdelitev je
dogodke, ki jih sicer organizira in objavi neka institucija
izredno ohlapna in nenatančna v primerjavi s slovensko
(npr. spletne konference, delavnice, seminarji), je pogosto
dialektalno
razpršenostjo.
Tudi
sam
koncept
treba urejati avtorske pravice z neposrednimi tvorci teh
»regionalna pripadnost«, zveden na registrsko označbo
besedil. Pri tem se pojavi vprašanje, kako je najbolj
na avtomobilu, se zdi neustrezen, čeprav ima za teren
smiselno definirati metapodatek o viru: kot posameznika/e,
zelo koristno lastnost robustnosti. V bazi Artur se je
ki je/so pravice odstopil/-i in nastopa/-jo na posnetku, ali
zato iskala bolj natančna, enoznačna, enostavna in manj
kot institucijo, ki je organizirala in objavila spletni
sporna opredelitev metapodatka, ki bi nosil informacije
dogodek. V bazi Artur je bila pri tovrstnih posnetkih
o regiji govorcev. Ker smo ime kraja, zlasti ko gre za
izbrana druga možnost. (3) Določeni internetni viri že
podeželsko okolje, že izpostavili kot problematično
imajo urejene avtorske pravice na način, ki omogoča
zaradi potencialnega razkrivanja identitete govorca, je
nadaljnjo uporabo, in sicer pod pogoji katere od licenc
bila kot osnovna enota izbrana občina. Slovenija je v
Creative Commons. Taka večja vira posnetkov v
času zbiranja posnetkov za bazo Artur razdeljena na 212
slovenščini sta portala Videolectures.net in Arnes Video. V
občin. Prednost te kategorije je tudi ta, da je mogoče
takih primerih se v obstoječih bazah za slovenščino kot vir
občine enostavno enoznačno preslikati na širše
navaja kar ime portala. (4) Določena govorjena besedila
geopolitične enote – 12 statističnih regij Slovenije, kot
niso avtorsko varovana. V skladu z 9. členom ZASP so taka
jih v času nastajanja baze definira Statistični urad
»uradna besedila z zakonodajnega, upravnega in sodnega
Republike Slovenije.
področja«. Čeprav še ni tovrstne sodne prakse ali doktrine,
2. Marsikdo danes ne živi vse življenje v nekem
se lahko kot tovrstna med drugim štejejo govorjena
omejenem geografskem prostoru, ki je govorno
besedila, ki nastajajo v Državnem zboru RS v okviru
homogen, pač pa je veliko ljudi mobilnih, bodisi z
zakonodajnih postopkov. V tem primeru se kot vir v bazi
dnevnimi/tedenskimi migracijami zaradi šolanja ali
Artur, kjer se pojavljajo tovrstni posnetki, navaja kar
zaposlitve bodisi zaradi selitev. Slika regionalnih
Državni zbor Republike Slovenije.
vplivov na govor govorca je zato lahko pri določenih
Druge vrste metapodatkov o posnetkih, kot smo jih
posameznikih izredno kompleksna in hkrati včasih tudi
predstavljali v poglavjih 2, 3 in 4, se v določenih govornih
zelo specifična. Korpus GOS je tako omogočal, da so
virih pojavljajo, v drugih ne, odvisno od specifičnega
govorci zase izbrali skupno tudi do pet »regionalnih
namena govornega vira. Pri združevanju govornih virov se
pripadnosti«. Tako nastane precej kompleksna slika, saj
2 Termin avtorske pravice tukaj uporabljamo za vse materialne
članka. Bralca samo opozarjamo, da uporabo posnetkov za
avtorske pravice, druge pravice avtorja v skladu z ZASP in
govorne vire ovira tudi ta pravni vidik.
avtorski sorodne pravice, ki utegnejo nastati pri snemanju. O
3 Sama licenca Youtube ne omogoča uporabe posnetkov za
vprašanjih osebnostnih pravic in varstva osebnih podatkov, ki so
govorni vir.
prav tako pomembna za vsako uporabo posnetkov v govornih
virih, tukaj ne razpravljamo, saj ni relevantno v kontekstu tega
PRISPEVKI
210
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
dobimo poleg govorcev s samo eno regijo še precej
slovenščine v digitalnem okolju, Artur. Govorni podatki iz
govorcev z zelo različnimi kombinacijami regij, med
teh treh baz namreč predstavljajo vir podatkov za razširitev
katerimi pa posamezna kombinacija ne zajema veliko
referenčnega govornega korpusa GOS, ob tem pa se kažejo
govorcev. Na koncu je za slednje najbrž smiselno
težave z združevanjem, ki med drugim4 izhajajo tudi iz
zabeležiti samo eno skupno kategorijo »različni
razlik v popisu in kategorizacijah metapodatkov o
regionalni vplivi«, kot naredijo v korpusu C-ORAL-
posnetkih in govorcih.
ROM (Cresti in Moneglia, 2005). V bazi Artur je bila
V prihodnje bi si želeli večjo homogenizacijo
opredelitev
geografske
mobilnosti
skozi
čas
metapodatkov o posnetkih in govorcih zlasti tam, kjer gre
poenostavljena na dve vrsti metapodatkov, prva se
za ključne metapodatke, ki so bistveni tako za spremljanje
nanaša na občino bivanja v otroštvu, druga na občino
uravnoteženosti gradiv kot za kategoriziranje govornih
trenutnega stalnega bivališča. S tem se izgubi precej
podatkov. Pri posnetkih so taki ključni metapodatki: (1)
informacij o morebitni dodatni mobilnosti posamezne
opis govornega dogodka, ki mora biti zadosti podroben in
osebe, ki bi sicer bile pomembne za podrobno analizo
se lahko razume v smislu govornih situacij, ki imajo večje
govora posameznega govorca, vprašljivo pa je, koliko
število skupnih kontekstnih lastnosti, vključno z vrsto
so relevantne za (kvantitativno) korpusno analizo ali za
lokacije, vrsto razmerja med tvorci in naslovniki, namenom
morebitno prilagajanje razpoznavalnika govorcem po
in kanalom komunikacije; (2) čas in lokacija snemanja, pri
regijah.
čemer je zlasti pri lokaciji pomembno, da je zadosti
Določenemu delu govorcev slovenščina ni prvi jezik.
podrobna, npr. ime kraja ali občine, kjer poteka snemanje;
Tudi to je podatek, ki je za govorni vir, če se v njem tovrstni
(3) vir posnetka, pomemben zaradi korektne obravnave
govorci pojavljajo, zelo pomemben. Niti iz korpusa GOS
avtorskih pravic, v pomoč je lahko tudi pri sortiranju
niti iz baze Artur se nematerni govorci slovenščine niso
govornih podatkov po tipih, zaradi naknadnega dostopa do
izključevali, pač pa nasprotno – namenoma vključevali. S
video vsebine pa je skoraj nujna tudi povezava do
tem je v obeh virih bistven tudi metapodatek o prvem jeziku
videoposnetka, če obstaja; (4) vedno koristni in zaželeni, a
govorca.
morda manj nujni pa so tudi vsi razpoložljivi podatki o
Niti metapodatek o geografski pripadnosti govorca niti
snemalni opremi in tehničnih lastnostih posnetka. Pri
metapodatek o prvem jeziku pa še ne povesta, kakšen je
govorcih so ključni metapodatki o: (1) identifikaciji, (2)
dejansko govor nekega govorca v govornem viru z vidika
spolu, (3) starosti, (4) prvem jeziku in (5) regiji/-ah, pri
socialnozvrstne delitve. Slednjo lahko ugotavljamo šele na
čemer mora biti slednja zadosti podrobno opredeljena (npr.
podlagi (zlasti) slušne analize govora. Ne gre torej za
na ravni kraja ali občine) in vsaj v grobem upoštevati tudi
metapodatek, ki ga zabeležimo na terenu, ampak za
diahroni vidik. Pogosto prisoten je tudi metapodatek o (6)
naknadno interpretacijo govornih podatkov. V korpusu
stopnji izobrazbe, medtem ko beleženje metapodatkov o
GOS se ni delala, v bazi Artur pa je bila izražena tovrstna
poklicih, socialnem sloju ali pripadnostih različnim
želja za potrebe razpoznavalnikov govora.
družbeno-kulturnim skupinam v slovenskih govornih virih
Izobrazba in starost govorcev sta metapodatka, preko
do zdaj ni bilo prakticirano.
katerih predvsem zagotavljamo ustrezno demografsko
V članku med drugim predstavljamo tudi podroben opis
razpršenost govorcev, zajetih v govorni vir. Za posnetke
metapodatkov o posnetkih in govorcih v govorni bazi
javnega govora večinoma niti nista dostopna in posledično
Artur. Pri določanju metaoznak se je pokazalo, da pri čisto
za velik del posnetkov v korpusu GOS in bazi Artur teh
vseh kategorijah vnosi niso enoznačni in enostavno
metapodatkov ni. Kjer pa sta na voljo, so skupine glede na
določljivi. Pri metaoznakah, nanašajočih se na govorce, je
starost in izobrazbo delno različno opredeljene in različno
bila največji izziv kategorija značilnosti govora, saj je bil
podrobne, kar otežuje združevanje virov. Minimalne
odločevalec pogosto soočen z dilemo, ali je jezik še
kategorije starostnih skupin so po našem mnenju skupina
standardni ali pogovorni oz. ali je pogovorni ali narečje.
najstnikov (okvirno do 19 let), skupina upokojencev
Kot pišemo v poglavju 4, so bili sistematični glasoslovni
(okvirno nad 60 let) in vse ostalo vmes. V kategoriji
pojavi, značilni za nestandardne zvrsti, odločilni kriterij, da
izobrazbe imamo 4-stopenjsko delitev v GOS-u in 9-
gre za pogovorni jezik; in nasprotno, opazno prizadevanje
stopenjsko delitev v Arturju. Po našem mnenju minimalna
govorca, da bi uporabljal standardni jezik, čeprav je v
delitev je vsaj v dve skupini glede na to, ali je oseba
njegovem govoru še vedno mogoče zaznati regionalno
zaključila izobraževanje po srednji šoli ali pa nadaljevala
obarvano melodiko, je bilo odločilno za oznako standardni
šolanje. Večja podrobnost metapodatkov o govorcih bi bila
jezik. Če je bil govor označen kot narečni, smo se za točno
zanimiva verjetno predvsem za sociolingvistične raziskave,
določitev vrste narečja oprli na podatek o občini bivanja v
zato je potrebna ustrezna previdnost pred prehitrim
otroštvu. Preostali metapodatki o govorcih so bili bodisi
posploševanjem v zelo grobe kategorije.
pridobljeni neposredno od govorcev bodisi jih nismo
določali. Izjema je spol govorca, ki smo ga določili na
6. Zaključek
podlagi posnetka, tudi ko ni bilo neposredne informacije.
V prispevku smo obravnavali metapodatke o posnetkih
Pri javnih govorcih, za katere nismo imeli neposrednih
in govorcih, ki se tipično uporabljajo v govornih jezikovnih
informacij, a smo lahko z veliko verjetnostjo sklepali, da je
virih. Osredotočili smo se na obstoječe prosto dostopne
njihov prvi jezik slovenski, je lahko bil dodan tudi ta
govorne vire korpusnega tipa za slovenski jezik, tj.
podatek. Veliko izzivov je bilo tudi pri pridobivanju
referenčni govorni korpus GOS, bazo Gos Videolectures in
metapodatkov o posnetkih, saj je v primeru, ko ni podatkov
govorno bazo v nastajanju znotraj projekta Razvoj
s terena, izjemno težko sklepati o vrsti in velikosti prostora
snemanja ali identificirati podatke o datumu in občini
4 Določene razlike so sicer tudi v pravilih zapisovanja govora. V
tem članku se osredotočamo samo na metapodatke o posnetkih in
govorcih.
PRISPEVKI
211
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
snemanja dogodka ter nemogoče zagotoviti natančen
Nelleke Oostdijk, Wim Goedertier, Frank Van Eynde, Lou
tehnični popis snemalne opreme. V bazi Artur so bili ti
Boves, Jean-Pierre Martens, Michael Moortgat in Harald
metapodatki vpisani samo, ko so bili znani.
Baayen. 2002. Experiences from the Spoken Dutch
Baza Artur je prioritetno namenjena razvoju modelov
corpus project. V: M. González Rodriguez, C. Paz Suárez
razpoznavanja govora, vendar lahko s svojim izredno
Araujo, ur., Proceedings of the third international
podrobnim popisom metapodatkov predstavlja izhodišče
conference
on
language
resources
and
pri morebitni nadgradnji ali razvoju podobnih govornih
evaluation (LREC’02), str. 340–347. Las Palmas,
virov v prihodnosti. Po zaključku, od novembra 2022
Kanarski otoki. ELRA.
naprej, bo prosto dostopna prek repozitorija CLARIN.SI
Petr Pořízka. 2009. Olomouc corpus of Spoken Czech:
pod licenco Creative Commons.
Characterization and main features of the project.
Linguistik
online,
38(2).
http://www.linguistik-
7. Literatura
online.de/38_09/porizka.html.
Jens Allwood, Maria Björnberg, Leif Grönqvist, Elisabeth
Daniel Povey, Hong-Kwang J. Kuo in Hagen Soltau. 2008.
Ahlsen in Cajsa Ottesjö. 2000. The spoken language
Fast speaker adaptive training for speech recognition. V:
Proceedings of Interspeech 2008, str. 1245–1248.
corpus at the Linguistics Department, Göteborg
Jože Toporišič. 2000. Slovenska slovnica. Založba
University. Forum Qualitative Social Research, 1(3).
Lou Burnard, ur. 2007. Reference guide for the British
Obzorja, Maribor.
Darinka Verdonik. 2018. Korpus in baza Gos
National
Corpus
(XML
Edition).
URL:
http://www.natcorp.ox.ac.uk/XMLedition/URG/.
Videolectures. V: Darja Fišer, Andrej Pančur, ur.,
Zbornik konference Jezikovne tehnologije in digitalna
Patrick Cardinal, Najim Dehak, Yu Zhang in James Glass.
humanistika,
str.
265–268.
Znanstvena založba
2015. Speaker adaptation using the i-vector technique for
bottleneck features. V: Proceedings of Interspeech 2015,
Filozofske fakultete, Ljubljana.
str. 2867–2871.
Darinka Verdonik in Ana Zwitter Vitez. 2011. Slovenski
govorni korpus Gos. Trojina, zavod za uporabno
Emanuela Cresti in Massimo Moneglia, ur. 2005. C-ORAL-
slovenistiko, Ljubljana.
ROM: Integrated reference corpora for spoken romance
Jana Zemljarič Miklavčič. 2008. Govorni korpusi.
languages. John Benjamins Publishing Company,
Znanstvena založba Filozofske fakultete, Ljubljana.
Amsterdam, Philadelphia.
Zixing Zhang, Jürgen Geiger, Jouni Pohjalainen, Amr El-
Christoph Draxler, Stefan Kleiner. 2017. A cross-database
Desoky Mousa in Wenyu Jin, Björn Schuller. 2018.
comparison of two large German speech databases. V:
Deep learning for environmentally robust speech
Proceedings of the 18th International Congress of
recognition: An overview of recent developments. V:
Phonetic Sciences, Glasgow, UK, 10.–15. avgust 2015.
ACM
Transactions
on
Intelligent
Systems
and
International Phonetic Association.
Technology (TIST) 9.5, str. 1–28.
Oliver Ehmer in Camille Martinez. 2014. Creating a
Jerneja Žganec Gros in Boštjan Vesnicer. 2020. Izbor
multimodal corpus of spoken world French. V: Sükriye
Ruhi, Michael Haugh, Thomas Schmidt, Kai Wörner
fonetično uravnoteženih besedilnih predlog za bazo
,
branega govora. V: Tanja Mirtič, Marko Snoj, ur.,
ur., Best Practices for Spoken Corpora in Linguistic
Razprave II. razreda SAZU: 1. slovenski pravorečni
Research, str. 142–161. Cambridge Scholars Publishing,
posvet, str. 111–119. Slovenska akademija znanosti in
Newscastle upon Tyne.
umetnosti, Ljubljana.
Santosh Gondi in Vineel Pratap. 2021. Performance
Andrej Žgank, Ana Zwitter Vitez in Darinka Verdonik.
Evaluation of Offline Speech Recognition on Edge
2014. The Slovene BNSI broadcast news database and
Devices. Electronics 2021, 10, 2697. MDPI, Basel,
reference speech corpus GOS: Towards the uniform
Switzerland.
guidelines for future work. V: Nicoletta Calzolari,
John R. Hershey, Jonathan Le Roux, Shinji Watanabe,
ur., LREC 2014: proceedings of the Ninth International
Scott Wisdom, Zhuo Chen in Yusuf Isik. 2017. Novel
Conference on Language Resources and Evaluation, str.
deep architectures in speech processing. V: New Era for
2644–2647, Reykjavik, Islandija. ELRA.
Robust Speech Recognition, str. 135–164. Springer.
Martin Karafiát, Karel Veselý, Kateřina Žmolíková, Marc
Delcroix, Shinji Watanabe, Lukáš Burget, Jan “Honza”
Černocký in Igor Szőke. 2017. Training data
augmentation and data selection. V: New Era for Robust
Speech Recognition, str. 245–260. Springer.
Jáchym Kolář in Jan Švec. 2008. Structural Metadata
Annotation of Speech Corpora: Comparing Broadcast
News and Broadcast Conversations. V: Proceedings of
the Sixth International Conference on Language
Resources and Evaluation (LREC'08), Marrakech,
Morocco. European Language Resources Association
(ELRA).
Robbie Love, Claire Dembry, Andrew Hardie, Vaclav
Brezina in Tony McEnry. 2017. The spoken BNC2014:
Designing and building a spoken corpus of everyday
conversations.
International
Journal
of
Corpus
Linguistics, 22(3):319–344.
PRISPEVKI
212
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Uporaba Europeaninega podatkovnega modela (EDM) pri digitalizaciji kulturne
dediščine: primer Skuškove zbirke iz Slovenskega etnografskega muzeja v
projektu PAGODE-Europeana China
Maja Veselič,* Dunja Zorman, †
Oddelek za azijske študije, Filozofska fakulteta, Univerza v Ljubljani
Aškerčeva 2, 1000 Ljubljana
*maja.veselic@ff.uni-lj.si
† dunja.zorman @ff.uni-lj.si
Povzetek
V prispevku predstaviva podatkovni model baze Vzhodnoazijske zbirke v Slovenji in njegovo prilagoditev Europeaninem podatkovnem modelu za potrebe objave podatkov več kot 900 predmetov kitajske kulturne dediščine, pretežno fotografij, v Europeani. Podrobno opiševa proces oblikovanja prilagojenega modela ter postopek priprave podatkov na uvoz, ki je potekal s pomočjo orodja MINT. Na koncu podava nekaj refleksij o pozitivnih učinkih te izkušnje na delo z izvorno bazo.
Application of the Europeana Data Model (EDM) in digitalization of cultural heritage: The example of the Slovene Ethnographic Museum’s Skušek Collection in the PAGODE-Europeana China project This paper first introduces the data model of the East Asian Collections in Slovenia database. It then details how it was adjusted to the Europeana Data Model for the purpose of publishing in Europeana, more than 900 objects of Chinese cultural heritage, mostly photographs. It recounts the steps taken in creating the adjusted model and the procedure of data preparation for the import by using the MINT tool. It concludes with a reflection on the positive impacts of this experience on the work with the original database.
Ljubljano leta 1920 prinesel mornariški častnik Ivan
1. Uvod
Skušek ml. in jih danes hrani Slovenski etnografski muzej.
Zadnjih nekaj let države in nadnacionalne organizacije
Ti predmeti so bili digitalizirani in v Europeani objavljeni
spodbujajo institucije, ki hranijo in varujejo kulturno
v okviru projekta PAGODE-Europeana China (2020–2021,
dediščino – galerije, knjižnice, arhive in muzeje, k
v nadaljevanju PAGODE),2 medtem ko je analiza
pospešeni digitalizaciji kulturne dediščine. Ta naj ne bi
predmetov in ustvarjanje opisnih (vsebinskih) podatkov
zgolj zaščitila in ohranjala kulturne dediščine ali olajšala
potekalo v okviru projektne skupine Vzhodnoazijske zbirke
dostop do materialne in nematerialne dediščine za
v Sloveniji (2018–2021, VAZ),3 ki z istoimensko
raziskovalne, izobraževalne ali ljubiteljske namene, temveč
podatkovno bazo in portalom predstavlja tudi izvorno
naj bi stimulirala gospodarsko rast skozi promocijo
lokacijo digitalnih fotografij predmetov.
Za nekoga, ki se prvič srečuje
kreativnosti in inovacij, npr. v turizmu z novimi digitalnimi
metapodatki in
turističnimi produkti ali kot vir idej in navdiha v t. i.
podatkovnimi bazami in ob tem nima tehnično strokovnega
znanja, je soočenje z EDM
kreativnih industrijah.1
-om in obdelavo podatkov za
Toda da bi bila digitalizirana dediščina resnično
uvoz in objavo v Europeani zastrašujoče. Skozi refleksijo
lastnih napak in končnega uspeha, želiva tiste, ki
dostopna in uporabna, da bi torej uporabnik pri iskanju
oklevajo
lahko dobil čim več zadetkov, ki čimbolj natančno
glede objave svojega gradiva v Europeani, k temu
zadostijo iskalnim parametrom, da bi prišel do relevantnih
spodbuditi.
zadetkov, tudi če so podatki zapisani v drugem jeziku, kot
je jezik iskanja in da bi zadetke lahko po različnih
2. Europeanin podatkovni model (EDM)
parametrih dodatno filtriral, je digitalizirane predmete
Evropska digitalna knjižnica Europeana, ki jo financira
nujno opremiti s čimbolj kakovostnimi metapodatki. Ti
Evropska unija, sodi med najpomembnejše zbirke digitalne
močno olajšajo selekcijo gradiva glede na specifične
kulturne dediščine v Evropi. Danes v Europeani najdemo
potrebe konkretnih uporabnikov, kar med drugim
gradivo več kot 4000 posameznih institucij, ki obsega nekaj
pripomore h kvalitetnejšemu kuriranju (npr. v obliki
deset milijonov slikovnih in tekstovnih datotek, skoraj
digitalnih galerij, razstav) in lažjemu vizualiziranju vsebin.
milijon avdio- in videoposnetkov, pa tudi več kot 8000 3D
V prispevku predstaviva izkušnje s prilagoditvijo
modelov.4 Poudariti je treba, da knjižnica na svojih
podatkovnega modela baze Vzhodnoazijske zbirke v
strežnikih ne hrani digitaliziranih predmetov kulturne
Slovenji (VAZ) Europeaninem podatkovnem modelu
dediščine,5 temveč so ti dostopni preko povezav na različne
(Europeana Data Model, v nadaljevanju EDM) pri uvozu
institucionalne in nacionalne repozitorije in baze.
izbranih digitaliziranih predmetov iz Skuškove zbirke na
Europeana digitalizirane predmete le prikazuje in objavlja
evropsko digitalno knjižnico Europeana. Gre za del zbirke
njihove (meta)podatke. Europeana tudi ni v neposrednem
pretežno kitajskih predmetov, ki jih je iz Pekinga v
stiku s posameznimi institucijami, temveč podatke pridobi
1 https://digital-strategy.ec.europa.eu/en/news/commission-
4 https://www.europeana.eu/en/about-us.
proposes-common-european-data-space-cultural-heritage.
5 Izraz predmet ne označuje zgolj materialnega predmeta, saj so
2 https://photoconsortium.net/pagode/.
v Europeani predstavljeni tako predmeti snovne kot nesnovne
3 https://vazcollections.si/.
dediščine ter predmeti, ki so bili že ustvarjeni digitalno.
PRISPEVKI
213
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
od agregatorjev, ki zbirajo, pregledujejo in pred uvozom v
ključnih besed, preko katerih lahko uporabnik najde
Europeano obogatijo podatke, ki jih posredujejo različne
določen predmet v Europeaninem brskalniku.
kulturne in dediščinske institucije ali organizacije (t. i.
Bogatost in strukturiranost podatkov torej vplivata na
ponudniki vsebin). Številni, a ne vsi agregatorji so
to, kako najdljivi so predmeti. V Europeani različne poti do
organizirani kot posredniki na nacionalni ravni. Za
predmetov imenujejo scenariji za odkrivanje7 (angl.
Slovenijo to vlogo opravlja Nacionalna in univerzitetna
discovery scenarios) in ločijo štiri osnovne načine
knjižnica v Ljubljani.6
najdljivosti: glede na čas oziroma časovni razpon, glede na
V primeru Europeane torej poseben izziv predstavlja
teme in tipe, glede na agente ter glede na lokacije.
številčnost institucij, ki tam objavljajo svoje vsebine, in
Da bi spodbudili ponudnike k objavljanju čim bolj
raznolikost načinov organizacije (meta)podatkov, ki so jih
bogatih in čim bolje strukturiranih metapodatkov, so pri
oblikovale skozi svoje institucionalne zgodovine in prakse.
Europeani v zadnjih letih razvili tristopenjski standard
Nekateri metapodatkovni standardi so sicer močno
kakovosti metapodatkov, pri čemer vsaka od ravni
razširjeni, na primer LIDO, ki ga je razvil Mednarodni
omogoča določeno uporabniško izkušnjo. Raven A
muzejski svet (ICOM) in ga uporabljajo številni muzeji.
omogoča le osnovno iskanje konkretnih predmetov, raven
Toda v Europeano prihaja gradivo različnih vrst kulturnih
B omogoča raziskovanje vsebin, raven C pa predstavlja
institucij, gradivo različnih tipov, poleg tega predstavlja
platformo znanja, saj omogoča številna presečna iskanja,
knjižnica tudi večjezikovno okolje. Pri Europeani so zato
med drugim tudi po specifičnih temah, motivih, barvah in
razvili svoj lastni model metapodatkov ter vanj integrirali
drugih lastnosti predmeta kulturne dediščine (Europeana
elemente pred tem uveljavljenih standardov, zlasti ORE,
2019b). Čeprav se v projektu PAGODE nismo zavezali k
DCMI, SKOS in CRM.
določeni ravni metapodatkov, si je večina partnerjev
EDM metapodatke deli na tri jedrne razrede (angl. core
prizadevala doseči stopnjo C.8
classes): (1) metapodatki, vezani na predmet kulturne
dediščine, ki je digitaliziran (edm:ProvidedCHO), npr. kdaj
in kje je predmet nastal, kdo ga je ustvaril, kakšne
3. Projekt PAGODE – Europeana China
dimenzije ima, (2) metapodatki, vezani na spletni vir ali več
Mednarodni projekt PAGODE – Europeana China
virov, ki so vezani na predmet (edm:WebResource), npr.
(PAGODE),9 ki je trajal 18 mesecev (2020–2021) in ga je
kakšen je format spletnega vira, kdo ga je prispeval, kakšne
vodilo italijansko ministrstvo za gospodarski razvoj, je
so avtorske pravice; ter (3) metapodatki, povezani z
združil javne in zasebne institucije s področja znanosti,
agregacijo, torej z mehanizmom, ki združuje zgornja dva
kulturne dediščine in kulturnega menedžmenta z namenom,
razreda in predstavlja uvoz podatkov v Europeano, npr.
da bi obogatili, izpostavili in dodatno osvetlili kitajsko
kateri agregator prispeva podatke, kje so ti prikazani
kulturno dediščino v Europeani. V projektu je sodelovalo 6
(ore:Aggregation) (Europeana, 2017).
partnerjev ter 15 pridruženih partnerjev. Glavni cilj je bil v
Poleg tega EDM omogoča tudi kontekstualne razrede
Europeano dodati 10.000 novih digitaliziranih predmetov
(angl. contextual classes). Sem sodijo metapodatki o agentu
kitajske kulturne dediščine, avtomatsko obogatiti
(edm:Agent), o prostoru (edm:Place), o časovnem obdobju
metapodatke že obstoječim 20.000 predmetom, še 2000
(edm:TimeSpan), o konceptu (skos:Concept) in o licenci
predmetom pa metapodatke dodati z ročno anotacijo v
(cc:Licence). Med podatke o agentu na primer beležimo
obliki množične skupnostne kampanje. Drugi osrednji cilj
ljudi, ki jih je predmet v svojem življenju srečal oz. so z
je bil kitajsko dediščino uporabnikom Europeane
njim kakorkoli povezani, med tiste o prostoru pa mesta, kjer
predstaviti skozi kurirane vsebine – galerije, bloge, razstave
se je kdaj nahajal (Europeana, 2017).
ter posebno vozlišče za kitajsko dediščino.10
Europeana poleg tega pri kvaliteti metapodatkov
izpostavlja še dvoje: večjezičnost ter rabo nadzorovanih
besednjakov. Europeana namreč prikazuje zbirke in
predmete v štiriindvajsetih evropskih jezikih. V ta namen
morajo biti v model vključeni podatki o jeziku, v katerem
so vrednosti, tj. podatki v posameznem polju, zapisani.
Poleg tega je za čim širšo jezikovno pokritost zaželena čim
večja uporaba identifikatorjev iz povezanih odprtih
podatkov in nadzorovanih besednjakov, kot so Wikidata,
Gettyjev Arts & Architecture Thesaurus (AAT) ali
geografska podatkovna baza GeoNames. Metapodatki
vezani na identifikatorje se tako ne prikazujejo le v jeziku
iskanja, temveč tudi v drugih evropskih jezikih, ki so
vključeni v nadzorovane besednjake oz. povezane odprte
Slika 1: Kurirane vsebine kitajske kulturne dediščine
podatkovne baze. Poleg tega identifikatorji služijo
na Europeani pod skupno tematsko vstopno točko.
nadaljnjemu semantičnemu bogatenju metapodatkov. To je
odlično za končnega uporabnika, saj povečuje število
6 http://www.agregator.si/.
zvočnega posnetka itd.) ter možnost njegove ponovne uporabe
7 Nujne metapodatkovne kategorije za posamezen scenarij so
glede na avtorske pravice (Europeana 2019a).
predstavljene v Charles, Isaac in Hill (2015).
9 Projekt je financirala Evropska komisija v okviru mehanizma
8 Europeanin okvir za objavljanje (Europeana Publishing
Connecting Europe Facilities.
Framework) loči tudi različne nivoje kakovosti vsebine (od 1 do
10 https://www.europeana.eu/en/chinese-heritage. Vozlišče
4), pri čemer merijo kakovost digitalnega zapisa (fotografije,
predstavlja osrednjo zbirno točko za kurirane vsebine o kitajski
dediščini na Europeani tudi po koncu projekta PAGODE.
PRISPEVKI
214
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
V projektu PAGODE je za večino ponudnikov vsebin
4. Podatkovna shema baze VAZ
agregacijo opravil partner in akreditirani agregator
Projekt VAZ je nacionalni raziskovalni projekt, ki je
Photoconsortium,11 ki je sicer specializiran za fotografske
formaliziral večletna prizadevanja za sistematičen popis in
vsebine v Europeani. Poleg tega, da je veliko sodelujočih
znanstveno-strokovno obravnavo vzhodnoazijskih zbirk in
ponudnikov vsebin v Europeano dodalo prav fotografsko
predmetov v različnih slovenskih muzejih (Vampelj
gradivo, je tovrstna agregacija omogočala boljši nadzor na
Suhadolnik, 2019). Skupna podatkovna baza in portal, ki
izpolnjevanjem ambicioznih ciljev glede novih vnosov in
sta osrednja rezultata projektnega dela, predstavljata neke
avtomatskega bogatenja.
vrste digitalno različico muzejev vzhodnoazijskih
Kot partner je v projektu sodeloval tudi Oddelek za
umetnosti in kultur, kakršne najdemo v številnih
azijske študije Filozofske fakultete UL, bolj natančno tri
prestolnicah in velemestih po Evropi, v Severni Ameriki in
članice
nacionalnega
raziskovalnega
projekta
drugod. Kot pobudnik in vodilni partner projekta si
Vzhodnoazijske zbirke v Sloveniji (2018–2021).12 Naša
Oddelek za azijske študije Filozofske fakultete Univerze v
naloga je bila vzpostavitev semantične sheme projekta, ki
Ljubljani prizadeva za trajno hrambo in dopolnjevanje ter
naj bi vodila tako izbor novih predmetov (opredelitev, kaj
posodabljanje baze in portala tudi po zaključku projekta,
sploh je kitajska dediščina v Evropi), kot obogatitev že
seveda v meri, ki jo bodo v bodoče dopuščale finančne
obstoječih predmetov (ključne besede, ki opredeljujejo
zmožnosti in delovne obveznosti.14
kitajsko dediščino). Čeprav z Europeano nismo imele
Eden od naših osrednjih ciljev je bil vseskozi, da je baza
izkušenj, pa tudi projekt VAZ se je šele dobro začel, nas je
s fotografijami in podatki javno dostopna in enostavna za
povabilo k sodelovanju pritegnilo predvsem, ker je
uporabo, saj velika večina predmetov že več desetletij ni
obljubljalo dostop do dodatnih sredstev za digitalizacijo
bila razstavljenih, prav tako pa so bili le redki med njimi
predmetov, ki smo jih nameravali vključiti v digitalno bazo
deležni temeljitejše analize, saj slovenske muzejske
VAZ.
institucije nimajo oseb z ustreznim specializiranim
Največja zbirka kitajskih predmetov pri nas je
znanjem.15 V okviru projekta smo obravnavali že omenjeno
Skuškova zbirka v Slovenskem etnografskem muzeju
Skuškovo zbirko iz SEM-a, Zbirko Alme Karlin ter Zbirko
(SEM), ki obsega več kot 500 predmetov manjših in večjih
predmetov iz Azije in južne Amerike iz Pokrajinskega
dimenzij, med njimi pohištvo, okrasne stene, porcelan,
muzeja Celje, vzhodnoazijske kose v zbirki keramike iz
tekstil, slike, kipce, kadilne pripomočke, kovance, knjige,
Narodnega muzeja ter album vzhodnoazijskih razglednic
fotografije. Predmete je Ivan Skušek ml. (1877–1947),
mornariškega superiorja Ivana Koršiča, ki ga hrani
mornariški častnik, ki se je med prvo svetovno vojno znašel
Pomorski muzej Piran.
v internaciji v Pekingu, skupaj s svojo bodočo ženo, na
Pri oblikovanju podatkovne sheme smo se najprej
Kitajskem živečo Japonko Kondō Kawase Tsuneko (1893–
posvetovali s kustosi obravnavnih zbirk in nekaterimi
1963), kasneje krščeno Marija Skušek, leta 1920 pripeljal v
njihovimi muzejskimi sodelavci. Vse sodelujoče institucije
Ljubljano. Skušek je doma nameraval postaviti muzej
uporabljajo program Galis, ki so ga snovalci razvili v
kitajske kulture, a so mu finančne težave to preprečile. Po
sodelovanju z domačimi in tujimi strokovnimi
moževi smrti je Marija Skušek zbirko zapustila državi in
institucijami, tudi Europeano. Shema podatkov, ki jih je
večina predmetov je pristala v Slovenskem etnografskem
moč beležiti za posamezen predmet je izjemno obširna,
muzeju. Le nekaj jih je razstavljenih na stalni razstavi,
vendar pa v praksi kustosi izpolnijo le nekaj osnovnih
mnogi med njimi pa donedavna niso bili niti spodobno
kategorij, pri čemer na izbor vplivajo tako tipi predmetov,
inventarizirani.13
za katere skrbijo, kot tudi njihove povsem individualne
Na naš predlog se je projektu PAGODE kot pridruženi
navade in ambicije. Tudi tuji strokovnjaki, s katerimi smo
partner priključil SEM, ki je v ta namen pripravil digitalne
sodelovali – tako muzejski kustosi kot akademski
fotografije približno 200 kovancev ter skenograme dveh na
raziskovalci
specializirani
za
različne
vidike
Japonskem izdanih tiskanih albumov arhitekturnih
vzhodnoazijske umetnosti, so nam svetovali, naj
fotografij in skic cesarskega Pekinga, dveh naslikanih
podatkovno shemo razvijemo glede na predmete, ki jih
albumov s podobami kitajskega kaznovanja in vsakdana
najdemo v slovenskih muzejskih zbirkah. Ti so v resnici
bogatih žensk in otrok in album s 450 prilepljenimi
izjemno raznovrstni in obsegajo keramiko in porcelan,
fotografijami Pekinga in okolice. Opisne podatke
kipe, pohištvo, tekstil, pahljače, numizmatiko, fotografije
predmetov smo pripravili v projektu VAZ, prilagoditev
in razglednice, slike in lesoreze, orožje, arhitekturne
podatkovne sheme, ki jo uporabljamo v bazi VAZ za
modele ter različne predmete vsakdanje rabe. Po osnovnem
potrebe uvoza v Europeano pa sva pripravili avtorici. V
pregledu izbranih zbirk smo si v raziskovalni skupini
nadaljevanju
prispevka
tako
najprej
predstaviva
podatkovno shemo, ki smo jo razvili v projektu VAZ, nato
pa prilagoditev te sheme za uvoz v Europeano.
11 https://www.photoconsortium.net/.
14 Pridobitev novega nacionalnega raziskovalnega projekta
12 Projekt s polnim imenom Vzhodnoazijske zbirke v Sloveniji:
Osiroteli predmeti: obravnava vzhodnoazijskih predmetov izven
vpetost slovenskega prostora v globalno izmenjavo predmetov in
organiziranih zbirateljskih praks v slovenskem prostoru (2021–
idej z Vzhodno Azijo (2018–2021) (št. J7-9429), je financirala
2024) (ARRS, št. J6-3133) zagotavlja sredstva za nadaljnje delo
Javna agencija za raziskovalno dejavnost Republike Slovenije
in tehnične izboljšave.
(ARRS). Poleg avtoric prispevka je v projektu PAGODE
15 V okviru projekta VAZ je analiza potekala pretežno s strani
sodelovala še Nataša Vampelj Suhadolnik.
sinologinj, japonologinj in koreanista ob podpori pristojnih
13 O poti zbirke od Kitajske do SEM-a pišeta Berdajs (2021) in
kustosinj in kustosa. Poleg tega smo organizirali več delavnic, na
Motoh (2021), o razlogih za pomanjkljivo obravnavo v muzeju
katerih so izbrane predmete ali skupine predmetov preučili tudi
pa Vampelj Suhadolnik (2021).
tuji strokovnjaki in strokovnjakinje.
PRISPEVKI
215
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
razdelili tipe predmetov glede na rabo,16 nato pa je vsak za
portalu ne prikazujemo, so zapisani ležeče. Z asteriskom so
dodeljeni tip predmetov pregledal na spletu dostopne
označeni podatki, ki jih na portalu uporabljamo kot filtre za
podatkovne sheme različnih priznanih muzejev in arhivov.
prikazovanje.
Pri tem smo bili seveda omejeni na institucije, ki so že
Trenutno ima naša podatkovna baza obliko Excelove
digitalizirale dele svojih zbirk in jih ponudile na ogled
tabele z ločenimi listi za tipe predmetov, vendar je zaradi
javnosti, in na tiste vrste podatkov, ki so jih smatrale kot
razvejanosti neprijazna za vnos in slabo pregledna. Smo v
relevantne za obiskovalce in jih zato prikazovale na svojih
postopku tehnične prenove baze, tako da bo v bodoče
straneh.
vstopna točka zanjo spletna stran, vmesnik pa bo v obliki
Po več krogih posvetovanj ter širjenj in oženj nabora
podatkovne kartice. Ob tem bomo dodali nekaj novih
podatkovnih elementov, smo izoblikovali spodnjo shemo,
kategorij administrativnih podatkov, npr. avtorje zapisa o
pri čemer smo metapodatke razdelili na tiste, ki bodo vidni
predmetu.
obiskovalcem portala, in one, ki jih zbiramo za naše
raziskovalne analize in administracijo. Podatki, jih na
Administrativni podatki
Opis predmeta
• Zaporedna številka (inventarna številka
• Ime predmeta
predmeta v naši bazi, označena s črkami za
• Raba*
tip in zaporedno številko vnosa)
• Sekundarna raba*
• Fotografija
(imena
fotodatotek,
ki
• Material*
prikazujejo predmet)
• Sekundarni material*
• Copyright
• Tekstualni opis
• Podatki o procesu vnosa (beležimo ali je
• Opis materiala
določen vnos zaključen, lektoriran in
• Tehnika izdelave
prenesen na portal)
• Dimenzije
• Napis – vsebina (izvirnik, transkripcija,
Lokacija predmeta
prevod)
• Zbirka/album*
• Podpis(i) (izvirnik, transkripcija)
• Muzej*
• Cenzor (izvirnik, transkripcija, prevod)
• Žig (izvirnik, transkripcija, prevod)
Provenienca in obravnava predmeta
• Datum in kraj korespondence
• Trenutni lastnik
• Naslovnik
• Čas pridobitve
• Pošiljatelj
• Način pridobitve
• Število delov/kosov
• Pretekli lastniki in obdobja lastništva
• Ime v izvornem jeziku (izvirnik, transkripcija)
• Stanje predmeta, obravnava, poškodbe
• Motiv
• Zgodovina razstavljanja
• Format
• Objave medijih
• Tehnika vezave
• Originalne inventarne številke
• Stil kaligrafije
• Lokacija napisa na predmetu
Izvor predmeta
• Lokacija podpisa na predmetu
• Stoletje*
• Lokacija cenzorjevega podpisa na predmetu
• Obdobje* (dinastična obdobja)
• Lokacija žiga na predmetu
• Regija*
• Kraj izdelave
• Avtor (izvirnik in transkripcija)
• Delavnica/tovarna/izdajatelj (izvirnik in
transkripcija)
• Datacija (cesarji)
Tabela 1: Podatkovna shema VAZ.
Pri oblikovanju podatkovne sheme VAZ sta nas torej
Prav tako tudi nismo razmišljali o tem, kako naj bodo
vodili dve vprašanji: katere vrste podatki so ali utegnejo
podatki strukturirani, da jih bomo čim lažje in čim
postati koristni za naše raziskave in katere vrste podatkov
učinkoviteje raziskovalno obdelovali. Drugače povedano,
so lahko zanimive za druge uporabnike, naj bodo
čeprav je bila digitalizacija eden od osrednjih ciljev
strokovnjakinje ali ljubitelji. Čeprav smo imeli javnost
projekta VAZ, nismo poznali praks digitalne humanistike.
nenehno v mislih, pa vse do konca projekta PAGODE
Preveč osredotočeni na predmete kot muzejske predmete
nismo veliko razmišljali o načinih priprave kuriranih
po eni ter njihov vzhodnoazijski izvor na drugi strani,
vsebin, še manj pa o pomenu metapodatkov v tem procesu.
nismo našli poti do tistih institucij in strokovnjakov in
16 Glede na pogostnost v zbirkah smo naredili naslednjo
dodatki, orožje in vojaška oprema, pahljače, pohištvo in notranja
tipologijo po rabi (po abecednem vrstem redu): arhitektura in
oprema, posodje in pribor, predmeti za osebno nego, pripomočki
modeli, glasbila in gledališki predmeti, igre in igrače, kipi,
za kajenje in uživanje substanc, razglednice in fotografije,
knjige in tiskani materiali, numizmatika, oblačila, obutev in
religijski predmeti, slike in grafike ter drugo.
PRISPEVKI
216
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
strokovnjakinj v našem prostoru, ki se ukvarjajo z
je vrednost zapisana. Poleg tega smo morale uporabiti tudi
digitalizacijo kulturne dediščine, digitalnimi arhivi in
vsaj štiri različne elemente iz dveh različnih scenarijev za
digitalno humanistiko. Poleg tega podjetje, ki skrbi za
odkrivanje ter vsaj dva kontekstualna razreda z ustreznimi
tehnično podporo našega portala, nima izkušenj z
povezavami na odprte podatke oziroma nadzorovane
razvijanjem podatkovnih baz, smo pa z njimi v preteklosti
besednjake.
dobro sodelovali.
Za začetek snovanja podatkovnega modela, smo
definirale naslednja izhodišča:
5. Prilagoditev podatkovnega modela VAZ
➢ za osnovo vzamemo izvorno bazo projekta VAZ;
za uvoz v Europeano
➢ identificiravao čim večje število podatkov v
Vsi digitalni predmeti kulturne dediščine v projektu
izvorni bazi, ki jih lahko prevedemo v EDM;
➢
VAZ, ki smo jih nameravali objaviti v Europeani, so
dodamo administrativne metapodatke, ki jih
spadali v tip slik, saj je šlo za digitalne slikovne posnetke
zahteva EDM;
izbranih predmetov Skuškove zbirke. V projektu PAGODE
➢ v čim večjem obsegu metapodatkom dodamo
smo se zavezali doseči višje nivoje na Europeanini lestvici
identifikatorje iz nadzorovanih besednjakov;
vsebinske kakovosti, s katero označujejo predmete z
➢ z metapodatki zajamemo tudi raznovrstnost
visokim potencialom za rabo v izobraževanju, na odprtih
končnih uporabnikov v Europeani.
Naša i
platformah in v kreativnih industrijah (prim. Europeana
zvorna podatkovna baza VAZ vsebuje 46
kategorij. Od tega smo jih 23 kot elemente vključile
2019a). Fotografije oziroma skenogrami so zato morali
v
različne razrede EDM. Med izpuščenimi podatki so bili
izpolnjevati dve zahtevi: (1) njihova velikost ni smela biti
manjša od 1200 x 1200 slikovnih točk, in (2) omogočen je
predvsem tisti, namenjeni tipom predmetom, ki niso bili
vključeni PAGODE. V
moral biti prosti dostop ali uporaba pod licenco Creative
VAZ-u na primer zbiramo podatke,
Commons Priznanje avtorstva-Deljenje pod enakimi pogoji
ki so namenjeni raziskovanju razglednic, kot so naslovnik
in pošiljatelj. Ker razglednic
(CC BY SA).
nismo uvažali v Europeano,
smo v pripravi modela za Europeano izločile te kategorije
podatkov.
Za naše potrebe smo iz EDM-a uporabile vse tri jedrne
razrede, Najprej smo v vsakem od njih identificirale
minimalne zahtevane elemente za metapodatke (in si
zabeležile njihove standardizirane lastnosti). Ti so (1)
oblika digitalnega nadomestka (edm:type), (2) skrbnik
podatkov (edm:dataProvider), (3) ime nacionalnega
agregatorja ali druga institucija, ki je omogočila pretok
podatkov v Europeano (edm:Provider) – v našem primeru
je bil to Photoconsortium; in (4) pravice uporabe
(edm:rights). Takoj zatem smo v model dodale še
kontekstualne razrede za agenta (edm:Agent), časovni
razpon (edm:TimeSpan) in koncept (skos:Concept).
Nato smo v model vključile vse priporočene elemente
za metapodatke, ki so sovpadali s posameznimi
kategorijami iz baze VAZ, kot so opis predmeta
(dc:description) in dimenzije (dcterms:extent). Sledilo je
vključevanje priporočenih elementov za metapodatke, ki
jih ni v izvorni bazi, a bi omogočali širok spekter
Slika 2: Prikazovanje podatkov o predmetu na portalu
uporabnosti. Tu se je zataknilo. Izkazalo se je, da v danem
VAZ.
času ne bomo uspele napolniti modela z manjkajočimi
podatki, da bi zadovoljile vse identificirane uporabnike
Tudi pri premisleku, katere EDM-ove podatkovne
Europeane. Čeprav smo prvotno želele vključiti tudi
kategorije zapolniti, so nas vodile ambicije po visoki ravni
podatke za kreativni sektor, nam je za njemu namenjene
kakovosti metapodatkov, ki jo Europeana ocenjuje glede na
elemente (motivi, vzorci, barve) manjkalo največ
večjezičnost, uporabo scenarijev za odkrivanje ter
podatkov, zato smo ta del sheme opustile. So bili pa vnosi
kontekstualne razrede. Ker smo želele nagovoriti širok
s podatki o barvah avtomatsko obogateni v procesu uvoza
spekter končnih uporabnikov – od strokovnjakov, do
na Europeano, tako da je danes predmet iz Skuškove zbirke
ljubiteljev kulturne dediščine in predstavnikov kreativnega
moč iskati in filtrirati tudi po tem kriteriju.
sektorja, smo si za merilo postavile pogoje ravni C. To bi v
Na koncu smo pri 19 elementih dodale še identifikatorje
praksi pomenilo, da bi uporabnik kovance iz Skuškove
iz nadzorovanih besednjakov. Dva elementa smo zaradi
zbirke lahko našel s splošno poizvedbo »kitajski kovanci«
jezikovne dostopnosti prevedle še v angleščino in sicer ime
ali zelo detajlno, poznavalsko poizvedbo o konkretnem tipu
predmeta
(dc:title)
in
vmesnega
ponudnika
kovanca »Daoguang tongbao«, v pismenkah »道光通寶«.
(edm:intermediateProvider). Podatkovni model za uvoz je
Za predstavnike kreativnega sektorja so po drugi strani še
na koncu vseboval 67 elementov.
posebej koristni metapodatki, ki omogočajo iskanje po
Ko smo imele model zaključen, smo se lotile še
motivih, vzorcih in barvah.
pridobivanja manjkajočih metapodatkov. Med njimi so
Ciljna kakovostna raven metapodatkov je terjala, da je
prevladovali identifikatorji. Ta del procesa smo opravile
vsaj 75 odstotkov podatkov v podatkovnih elementih, ki jih
hitro. Zaradi ene naših osrednjih nalog v projektu
Europeana uvršča med najbolj relevantna za iskanje,
PAGODE – priprave semantične sheme za avtomatsko in
moralo imeti tudi metapodatek o jeziku ali jezikih, v katerih
množično ročno anotacijo predmetov kitajske kulturne
PRISPEVKI
217
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
dediščine, ki so že bili v Europeani, smo že nekaj mesecev
S celotnim delom smo na koncu dosegle želeno raven C
pred tem oblikovale seznam nekaj manj kot 1000 ključnih
kvalitete metapodatkov, ki zagotavlja brskanje na precizen
besed s pripadajočimi identifikatorji iz Gettyjevega AAT in
način, in omogoča, da Europeana deluje kot platforma
Wikidate.17 Da smo to dosegle, smo predhodno pregledale
znanja.
vsaj trikrat večje število ključnih besed v omenjenih
nadzorovanih besednjakih. Tako smo imele zelo dober
6. Refleksija
pregled, kaj posamezen besednjak ponuja in kaj lahko
Delo na projektu PAGODE – tako priprava semantične
uporabimo v našem podatkovnem modelu.
sheme kot prilagoditev VAZ-ovega metapodatkovnega
Poleg identifikatorjev smo morale v prilagojeno shemo
modela EDM-u, je bilo za sodelujoče raziskovalke izjemno
vnesti tudi unikatne identifikatorje fotografij, objavljenih
dragocena izkušnja, skozi katero smo lahko ovrednotile in
na spletni strani VAZ, saj Europeana digitalizirane
nato izboljšale tudi delo na projektu VAZ. Kot
predmete prikazuje neposredno s strežnikov institucij ali
strokovnjakinje s področja vzhodnoazijskih študij za
organizacij. Za konec smo dodale še metapodatke, ki so
predmete, ki ji jih digitaliziramo v projektu VAZ, skušamo
posamezne dele povezovali v celoten komplet (fotografije
pridobiti čimbolj izčrpne podatke, ki jih organiziramo v
v album fotografij, vstavne liste v tiskane oz. slikane
razmeroma razvejano podatkovno shemo. Pri prilagoditvi
albume). Pri tem smo kot vrednosti vnesle unikatne
naše sheme Europeaninemu podatkovnemu modelu,
identifikatorje fotografije predmeta objavljenega na spletni
predvsem pa pri polnjenju te sheme s konkretnimi podatki
strani,
ki
je
bil
naslednji
po
zaporedju
smo zato imele veliko lažjo nalogo kot drugi ponudniki
(edm:isNextInSequence).
vsebin, ki so sodelovali v projektu PAGODE in niso imeli
specializiranih znanj. Ko smo enkrat razumele opredelitve
posameznih elementov v EDM-u, smo VAZ-ovim
podatkovnim kategorijam hitro našle ustrezne vzporednice,
so pa v VAZ-ovi shemi seveda manjkali podatki vezani na
spletni vir oziroma agregacijo. Za potrebe projekta
PAGODE smo v VAZ-ovo shemo dodale kategorijo
avtorskih pravic fotografij, saj se mora ta informacija
prikazovati tako na Europeanini kot na izvorni strani.18
Podatke iz baze VAZ smo v EDM-u obogatile predvsem s
povezavami na odprte podatke in nadzorovane besednjake,
vendar teh zaenkrat ne nameravamo vključiti v bazo VAZ,
saj je naša prioriteta dopolnjevanje baze z novimi vnosi.
Orodje MINT, ki smo ga uporabile za tehnično
obdelavo podatkov za uvoz v Europeano, po drugi strani ni
zahtevalo programerskih znanj, tako da smo tudi ta del
lahko opravile same. Slabost takega načina objavljanja
podatkov je, da gre za enkratni uvoz, zato se podatki ne
bodo posodabljali hkrati z bazo VAZ. Uporabnik bo z
Europeanine strani posameznega predmeta v Skuškovi
zbirki sicer preko povezave lahko prispel na VAZ-ovo stran
in tam videl najnovejšo verzijo, a podatki v Europeani ne
bodo ažurirani
Slika 3: Prikazovanje metapodatkov v Europeani.
, dokler ne bomo izvedli ponovnega uvoza.
Če bi to vedele že na začetku, bi gotovo premislile o uvozu
Na koncu smo za mapiranje uporabile platformo MINT,
preko nacionalnega agregatorja, čeprav bi se glede na
ki jo redno uporabljajo projektni partnerji. Platforma
časovni pritisk in nizka finančna sredstva na koncu morda
omogoča mapiranje metapodatkov in polnjenje Europeane
vseeno odločile za enostavnejšo agregacijo s pomočjo
z novo vsebino brez programerskega predznanja o XML
MINT-a. Poudariti morava, da podatki v Europeani ne bodo
podatkovni strukturi, ki tehnično podpira agregacijo. MINT
napačni ali nekakovostni, bodo le malenkost manj bogati
omogoča, da elemente v svojem podatkovnem setu povežeš
kot v bazi VAZ, ki jo bomo dopolnjevali z novimi
z elementi EDM. Podatkovni model se uvozi na različne
raziskovalnimi izsledki.
načine, med drugim z datoteko csv, kot smo storile me.
Skozi sodelovanje v projektu PAGODE smo postale
Preko uporabniku prijaznega vmesnika se uredi mapiranje,
tudi ambicioznejše glede kuriranja in vizualiziranja vsebin
ki ga program nato pretvori v XML obliko, ki omogoči
na portalu VAZ. Ob premlevanju idej, kako bi naše izsledke
dokončno polnjenje vsebin v Europeano. Poleg mapiranja
predstavili na dostopnejše, privlačnejše načine, sva avtorici
metapodatkov smo vsakemu elementu metapodatkov v
ugotovili, da bi bilo boljše, če bi bila naša metapodatkovna
MINT-u ročno določile jezik, v katerem je zapisan, in v
shema še bolj razvejana in če bi elemente, ki jih sedaj
kontekstualni razred o agentu (edm:Agent) ročno vnesle
pišemo skupaj, dodatno razdelili. Pri razglednicah na
imena agentov v različnih jezikih (npr. Marija
primer kraj in datum poštnega žiga vnesemo v isto polje,
Skušek/Kondō -Kawase Tsuneko/ 近藤常子).
čeprav bi bilo za nadaljnjo obdelavo bolje, če bi jih ločili.
Enako je pri provenienci, kjer so sedanji in pretekli lastniki
17 V Wikidati smo okoli 80 gesel za potrebe projekta PAGODE
sodelujoči muzeji skupaj odločili, da želijo ohraniti avtorske
tudi ustvarile.
pravice. Tudi SEM je pravice spremenil le fotografijam pri
18 Naš prvotni načrt je bil, da bi bile vse fotografije v bazi VAZ
predmetih, ki so bili dodani v Europeano.
prosto dostopne za rabo v nekomercialne namene, vendar so se
PRISPEVKI
218
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
našteti skupaj. Poleg tega so naju pri EDM-u navdušili
kontekstualni razredi, ki bi nam zlasti pri gradivu, kjer
imamo veliko podatkov o krajih, osebah in času, olajšali
analizo in prikazovanje poti, mrež ter življenjskih zgodb
predmetov.
7. Literatura
Tina Berdajs. 2021. Retracing the Footsteps: Analysis of
the Skušek Collection. Asian Studies, 9(3): 141–166.
https://doi.org/10.4312/as.2021.9.3.141-166.
Valentine Charles, Antoine Isaac in Timothy Hill, ur. 2015.
Discovery - User scenarios and their metadata
requirements.
https://pro.europeana.eu/files/Europeana_Professional/
EuropeanaTech/EuropeanaTech_WG/DataQualityCom
mittee/DQC_DiscoveryUserScenarios_v3.pdf
Europeana. 2017. Europeana Data Model – Mapping
Guidelines
v2.4.
https://pro.europeana.eu/files/Europeana_Professional/S
hare_your_data/Technical_requirements/EDM_Docum
entation/EDM_Mapping_Guidelines_v2.4_102017.pdf.
Europeana. 2019a. Europeana Publishing Framework:
Content.
https://pro.europeana.eu/files/Europeana_Professional/P
ublications/Publishing_Framework/Europeana_publishi
ng_framework_content.pdf.
Europeana. 2019b. Europeana Publishing Framework:
Metadata.
https://pro.europeana.eu/files/Europeana_Professional/P
ublications/Publishing_Framework/Europeana_publishi
ng_framework_metadata_v-0-8.pdf
Helena Motoh. 2021. Lived-in museum. Asian Studies,
9(3): 119–140.
Nataša Vampelj Suhadolnik. 2021. Between Ethnology and
Cultural History: Where to Place East Asian Objects in
Slovenian Museums? Asian Studies, 9(3); 85–116.
https://doi.org/10.4312/as.2021.9.3.85-116
Nataša Vampelj Suhadolnik. 2019. Zbirateljska kultura in
vzhodnoazijske zbirke v Sloveniji. V: , uredili Andrej
Bekeš, Jana S. Rošker in Zlatko Šabič, ur., Procesi in
odnosi v Vzhodni Aziji: zbornik EARL, 93–137.
Ljubljana: Znanstvena založba Filozofske fakultete
Univerze
v
Ljubljani.
https://doi.org/10.4312/9789610602699.
PRISPEVKI
219
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Human Evaluation of Machine Translations by Semi-Professionals: Lessons
Learnt
Špela Vintar*, Andraž Repar†
* Department of Translation, Faculty of Arts, University of Ljubljana
Aškerčeva 2, SI-1000 Ljubljana
spela.vintar@ff.uni-lj.si
†Department of Knowledge Technologies, Jožef Stefan Institute
Jamova cesta 39, SI-1000 Ljubljana
andraz.repar@ijs.si
Abstract
We report on two experiments in human evaluation of machine translations, one using the Fluency/Adequacy scoring and the other using error annotation combined with post-editing. In both cases the evaluators were students of translation at the Master's level, who received instructions on how to perform the evaluation but had previously had little or no experience with the evaluation of translation quality.
The human evaluation was performed in the context of development and testing different MT models within the Development of Slovene in a Digital Environment (DSDE) project. Our results show that Fluency/Adequacy scoring is more efficient and reliable than error annotation, and a comparison of both methods shows low correlation.
summaries of the results. In addition to quantitative results,
1. Introduction
for the error annotation and post-editing task we also give
The design and evolution of a new machine translation
a brief summary of the most frequent observations. We
system is invariably linked with regular quality
conclude by discussing the findings from the perspective of
assessments, using both automatic methods commonly
translation quality assessment in MT development.
known as metrics and human evaluations of the MT
system's outputs. The context of this experiment is the
2. Related work
development of a neural MT system for the English-
Evaluation of MT is a crucial part of development and
Slovene language pair within the DSDE project, which
improvement of MT systems, and it is traditionally divided
involved work packages dedicated to data collection,
into automatic evaluation using metrics such as BLEU
implementation and testing of different NMT architectures
(Papineni et al., 2002), METEOR (Banerjee and Lavie,
and MT evaluation.
2005), TER (Snover et al., 2006), and human or manual
Throughout the project, different versions of the DSDE
evaluation. Automatic evaluation is usually performed by
NMT system were regularly automatically evaluated using
comparing the candidate machine translated text to a
the BLEU metric, while later versions were also evaluated
reference translation produced by a human professional,
with a comprehensive set of scores available on the
whereby the comparison can be rather superficial and
SloBench 1.0 evaluation platform. In parallel to the
word-based such as with BLEU, or more linguistically
automatic ones we performed a set of human evaluations
informed such as with METEOR. The obvious advantage
with several aims in mind: To validate the automatic scores
of automatic metrics is that they can be performed on the
with manual assessments, to gain insight into the
fly requiring no human effort, but the rate of correlation
performance of the system under development, but also to
with human judgements remains a constant concern.
compare two human evaluation scenarios in terms of
Particularly since the emergence of NMT, some authors
efficiency and reliability.
show that the reliability of metrics as indicators of
The manual evaluations of the DSDE MT engine were
translation quality may be faltering (Shterionov et al.,
performed by students of MA Translation at the
2018), or that metrics alone cannot adequately reflect the
Department of Translation Studies, Faculty of Arts,
variety of linguistic issues which may affect quality.
University of Ljubljana. We refer to advanced students of
Manual evaluation therefore remains an integral part of MT
translation as semi-professionals because of their high
quality evaluation and is annually included into the WMT
proficiency in both languages and their understanding of
shared task (Bojar et al., 2016).
translation as a complex cognitive activity with many
Over time, many methods of human MT evaluation
alternative solutions for each source text. On the other
have evolved. The Adequacy/Fluency scoring was first
hand, their experience with translating is for the most part
adopted by the ARPA MT research program (White et al.,
limited to the study environment, and they have received
1994) as a standard methodology of scoring translation
little or no formal training in post-editing or translation
segments on a discreet 5- or 7-level scale. The adequacy
assessment.
evaluation is performed either by professional translators
Manual evaluation was performed using two common
who are presented with the original and the machine
evaluation frameworks: the Adequacy/Fluency score and
translated segment and make judgments about the degree to
the MQM-DQF error annotation combined with post-
which the information from the original can be found in MT
editing.
output, or by monolingual speakers who are presented with
The paper first presents the rationale for selecting the
the MT and a reference translation. For the fluency
methodologies by referring to related work, then describes
evaluation, no reference translation nor original is provided
the MT system and its development within the DSDE
and the evaluators determine whether the translation
project. We then present the evaluation setups and provide
PRISPEVKI
220
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
"reads" like good language, sounds natural and adheres to
Translation, University of Ljubljana. The translation
grammatical conventions of the language.
environment of choice was memoQ, a tool which allows the
Other manual evaluation methods include task-based
project manager to select or define an LQA scheme with
evaluation (Doyon et al., 1999), post-editing with targeted
the fluency/adequacy scoring or the error categories
human annotation, also known as HTER (Snover et al.,
respectively. The annotator performs the evaluation, error
2006), and error analysis using various error typologies.
annotation and post-editing in a typical two-column setting
The most comprehensive translation error typology to date
with the segmented original on the left hand side and the
is the Multidimensional Quality Metrics (MQM) developed
machine translated segments already inserted into the target
in the QT-Launchpad1 project. The MQM guidelines
text on the right hand side via pre-translation. Annotators
provide a fine-grained hierarchy of quality issues and a
receive an outbound memoQ package which ensures that
mechanism for applying them to different evaluation
the source text, the raw MT and the evaluation/error
scenarios, however the entire tagset is too complex to be
annotation scheme are available and activated with no
used in a concrete evaluation task. Originating from the
further setup, and the evaluated, post-edited and annotated
needs of the language industry, the TAUS Dynamic Quality
texts may be returned to the project manager (in our case
Framework (DQF) proposed an error typology which has
the experiment designer) as inbound return packages.
been harmonized with the MQM model in 2015 and is
Five different source texts were used from the domains
today integrated into most commercial translation tools
of chemistry, karst, economy, law and general news. The
(Marheinecke, 2016).
texts were of comparable length (~500 words) and
The annotation of translation errors can be a part of
consisted either of the entire text or a meaningful portion
Linguistic Quality Assurance (LQA) in professional
thereof. With the exception of the general news text dealing
translation environments, in order to monitor quality on the
with US elections, all domain-specific texts were highly
corporate, project or individual levels. However, for the
specialized with complex syntax and many terminological
task of manual MT evaluation MQM and related methods
expressions.
are notoriously poor in inter-annotator agreement scores
For the fluency/adequacy scoring, both language pairs
(Lommel et al., 2014). Some authors believe that pre-
were evaluated by a group of five students over a period of
annotation training can significantly reduce disagreements,
several months. Each document was evaluated by two
but the task apparently remains highly subjective.
students. Once a new model was available, MemoQ
Despite the labour intensity and low inter-annotator
packages were sent to the students who performed the
agreement, error annotation is still frequently employed in
evaluation in their home environment. Note that for the
human MT evaluation because of the significance and
adequacy/fluency evaluation, no postediting took place –
depth of insight into translation issues it may provide. As
the students only had to score each translated segment on a
Klubička et al. (2017) point out, Slavic languages are rich
scale of 0 to 3 (see Table 1).
in inflection, case and gender agreement, and they have
rather free word order compared to English. The motivation
Adequacy
Fluency
for using error analysis in MT evaluation is to see – in the
0
None
Incomprehensible
process of developing and improving a new MT system –
1
Little
Disfluent
whether the particular grammatical issues occurring with
2
Much
Good
Slavic languages are adequately addressed, resulting in
3
All
Flawless
overall quality improvement.
In line with related works we opted for two of the most
Table 1: Adequacy/Fluency scoring.
commonly used manual evaluation methods, the
Fluency/Adequacy score and the TAUS DQF-MQM
For the error annotation, only the English-Slovene
metrics which has been further adapted for the DSDE
language pair was evaluated, with English as original and
project.
Slovene as target. Fifteen students participated, so that
post-editing and error analysis were in the end performed
3. The DSDE MT system
by three students for each text. The experiment took place
The main goal of the machine translation work package
during a regular face-to-face seminar session in the
in RSDO is to improve on the state-of-the-art model for the
presence of the lecturer. Students were using standard PCs
Slovene/English and English/Slovene language pairs
and with memoQ 9.5 running Translator Pro licenses.
developed within the TraMOOC project (Sennrich et al.,
Students were requested to perform full post-editing of
2017). To this end, various neural machine translation
the machine translated text, and at the same time annotate
frameworks were evaluated, such as MarianNMT (Junczys-
each error using the preloaded TAUS/DSDE error
Dowmunt et al., 2018), fairseq (Ott et al., 2019) and NeMo
typology. The latter proved somewhat wearisome, since the
(Kuchaiev et al., 2019). The same dataset consisting of
annotation of each single error involves opening a separate
publicly available parallel data as well as data collected
dialog box, selecting the category and resuming work,
within the DSDE project2 was used to train the models on
whereby the typical commands used during "normal"
the selected frameworks.
translation must be avoided (e.g. Control + Enter to confirm
the segment). This invariably slows down the post-editing
4. Evaluation setup
process and presumably affects the natural cognitive flow
Both types of manual evaulation were performed by
during post-editing.
students of MA Translation at the Department of
1 https://www.qt21.eu/launchpad/index.html
2 Data collected within the DSDE project will be made available
under a CC-BY-SA 4.0 license at the end of the project.
PRISPEVKI
221
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
5. Results
using the fairseq framework and one using the NeMo
framework. We also performed one round of evaluation of
5.1. Fluency/Adequacy scoring
the eTranslation system developed by the European
Figure 1. Adequacy and Fluency scores across five domains and two language pairs.
In addition to the baseline model, five models were
Commission.
evaluated using the Adequacy/Fluency methodology (three
The initial models ( marian and fairseq) performed
versions trained using the marianNMT framework, one
badly and did not exceed the scores of the baseline model
in the DSDE project, but additional iterations performed
PRISPEVKI
222
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
better. The overall best performance was exhibited by the
annotators frequently choose the Accuracy->Mistranslation
NeMo model with best or close to best scores in all five
category for errors related to specialized lexis. Minor errors
domains. The latest version of the Marian model ( marian-
are the most frequently selected severity level, with a
v5) also performed well in some domains (e.g. Legal) less
majority of stylistic errors. Accuracy is also the source of
well in others. When comparing the DSDE models with
the most critical errors which, in the opinion of annotators,
eTranslation, we can observe that the NeMo model offers
completely change the meaning of the text.
competitive performance across all five domains (with the
possible exception of the News domain for the
5.2.1. Errors by text
Slovene/English language pair).
On average, students would annotate ~30 errors per
text, or 1.2 errors per segment. The differences in the
5.2. Error annotation with post-editing
number of errors between texts are small, with a maximum
The error annotation with post-editing was performed
of 102 errors for the legal text (the sum for all three
in order to gain insight into the translation issues most
annotators) and a minimum of 90 for the text on karst.
affecting MT quality, but also to assess the efficiency and
reliability of this methodology when used with semi-
Category
Subcatego
Chemis
Econo
Kar
Leg
Ne
ry
try
my
st
al
ws
professional translators. The evaluation took place in
November 2021 using the output of the marian-v5 model.
Accuracy
Category
40
39
58
19
49
total
.
Addition
10
1
0
0
1
Severity 1 - Severity 2 - Severity 3 -
Category
Subcategory
Critical
Major
Minor
Mistransla
30
36
56
13
44
Category total 56
68
37
tion
Omission
0
2
2
6
4
Addition
1
2
3
30
14
16
26
18
Accuracy
Language
Category
Mistranslatio
50
63
30
total
n
Grammar
19
13
15
11
14
Omission
5
3
4
Spelling
11
1
1
15
4
Category total 3
26
57
Style
Category
19
36
13
45
25
Language
Grammar
3
18
37
total
Awkward
19
27
13
22
22
Spelling
0
8
20
Inconsiste
0
9
0
23
3
Category total 13
18
80
nt
Terminolo
Category
6
5
3
12
9
Style
Awkward
6
15
55
gy
total
Inconsistent
7
3
25
Total
95
94
90
102
101
Terminolog
Category total 4
16
14
y
Table 3: Errors by text.
Total
76
128
188
Chemistry: There is considerable variation in the
Table 2: Total errors by category.
number of errors marked by each annotator: 40 / 26 / 29. In
all 3 annotations, the most frequent error types are
Accuracy and Language, followed by Style and
Terminology. Only one annotator found 2 critical errors,
Errors by severity
the majority of errors were markes as minor.
Economy: The number of errors marked by each
annotator varies: 29 / 30 / 35. Similar to other texts, the
highest number of errors were attributed to Accuracy-
>Mistranslation, followed by Style and Language, and only
5 terminology errors.
Karst: The three annotators diverged in the numbers of
errors marked: 21 / 31 / 38. Contrary to other texts, here the
majority of errors were found to be major or even critical,
with only 22 errors categorized as minor. Given that the text
was highly specialized, it is again surprising that the
Critical
Major
Minor
Terminology category was not selected more often.
Legal: For the legal text, variation and non-agreement
between annotators is at its highest: they marked 21 / 54 /
Figure 2: Errors by severity.
27 errors each, and even more interesting is the distribution
As shown in Table 2, the highest number of errors were
of errors amongst severity levels. For the most prolific
marked in the Accuracy category, followed by Style,
annotator, only 4 errors were found to be critical, but for
Language and Terminology. Given that four out of five
the annotator who spotted 21 errors, 11 were categorized as
texts were specialized, the low count of terminology errors
critical. The third annotator on the other hand found no
is perhaps surprising but can be attributed to the fact that
critical errors.
PRISPEVKI
223
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
News: The numbers of errors marked by each annotator
In many cases the annotators agree on the error itself or
were 28 / 33 / 40 respectively, with 12 / 10 / 6 critical errors.
the portion of text which should be corrected, but
Despite the fact that this text was the least specialized of
categorize the error differently. A major error was
the five, annotators marked 9 errors as terminological, and
unanimously marked by all three annotators in the
the overall majority of errors were those pertaining to
Economy text, where the original "To repress these
accuracy (49).
troubles" was machine translated to "Za ponoven tisk te
težave". Corrections ranged from "spoprijemanje s težavo",
"zmanjšanje teh težav" to "blaženje teh težav", but the error 5.2.2. Analysing students' edits
was categorized as Accuracy->Addition, Accuracy
Some texts were highly specialized and rich in
->Mistranslation and Terminology respectively.
terminology, the students however often perceive errors as
Disagreement in categories was frequent also in the
minor and categorize terminology errors under Accuracy.
non-specialized text, a news article reporting Trump's
In the Karst text for example, the original contains the term
attempts to postpone elections. The MT version contains a
"precipitation" which is translated as "padavine". None of
fluent but inaccurate rendering of "November's presidential
the annotators identified this as a critical error: in geology,
elections to be postponed", where the MT engine proposed
precipitation is not a weather phenomenon but a type of
"je predlagal predsedniške volitve v novembru". This is
sedimentation process, and the correct translation would
certainly a critical accuracy error, which should be
read "precipitacija" or "usedanje". The word "test" in the categorized as omission since the postponement was
original is most likely a typo and remains untranslated,
missing in the target. Indeed all three annotators identify
while the translation of "algal crusts" into "drogovi" is the error as critical, but one categorized as mistranslation
another critical error.
and the other two as omission. Another severe
mistranslation occurs in segment 4, where the MT reverts
In nature, many types of CaCO3 precipitation are
the meaning of "There is little evidence..." to "Ni malo linked to living organisms: test, shells, skeletons,
dokazov..."; again all three annotators agree in the severity
stromatolites, algal crusts, etc.
level but not in the category.
V naravi so številne padavine CaCO3 povezane z
5.3. Comparing both evaluation methods
živimi organizmi: test, oklepi, skeleti, stromatoliti,
While the Fluency/Adequacy evaluation method gives
drogovi itd.
little insight into the specific issues that may have been
improved or aggravated from one MT model to another, it
The students' edits are sometimes unnecessary or even
seems relatively consistent in the scoring of different
wrong, as in the case of the correctly translated word
models
across
domains.
If
we
compare
the
"adduction" -> "addukcija" corrected into "adukcija" in one Fluency/Adequacy scores obtained for each text translated
case, and in another into "uporaba".
by the marian-v5 model with the results of the error
Inconsistent translations are another common issue in
annotation, correlation is low. According to the former, the
machine translation. Thus, in the Economy text,
most adequate and fluent translation was that of the legal
"expenditure" is translated as "stroški", "izdatki", "poraba"; text, and the least of the karst text. According to the number
"plant" as "rastlina" and "naprava". A trained and alert post-of annotated errors and edits, karst was the best and legal
editor would spot such inconsistencies and make sure they
the worst. (The number of errors in Figure 3 is normalized
are consolidated in the final version, the students however
to allow for better visual comparison.)
focus on single segments and overlook such unwanted
variation.
Easier to spot are untranslated words, such as
"speleothem" in both the Karst original and the Slovene
MT. All three annotators spotted the error and opted for
"kapnik" in their edits, but the correction is inadequate
because "kapnik" is a hyponym of "speleothem" and a better translation would be "speleotem" or "siga". Two annotators marked the error as Critical and one as Major.
It seems that students of translation are much more
sensitive to grammatical errors than terminological ones, as
the example below containing the correct phrase but in the
wrong case was marked as a Major error by all three
annotators.
Zaradi velike moči odpornosti proti svetlobi in
trajnosti
derivatov
benzimidazolov
se
pogosto
uporabljajo
za
proizvodnjo
akvarele
in
Figure 3: Comparing fluency, adequacy and number of
elektrofotografskih razvijalnih toner.
errors per text.
Zaradi velike moči odpornosti proti svetlobi in
6. Conclusion
trajnosti
derivatov
benzimidazolov
se
pogosto
We presented the results of human evaluation of MT
uporabljajo
za
proizvodnjo
akvarelnih
in
using
two
well-known
methodologies.
The
elektrofotografskih razvijalnih tonerjev.
Fluency/Adequacy evaluation is relatively efficient and
PRISPEVKI
224
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
fast, and the results are a useful indicator of the quality of
quality assessment and postediting, both of which are tasks
different MT models. In general, the scores show high
frequently encountered in professional translation.
correlation with automatic metrics3, with Nemo models
achieving the highest automatic evaluation scores,
7. Acknowledgments
followed by the Marian models and the baseline model,
The project Development of Slovene in a Digital
which is similar to what can be observed from the
Environment (Slovene: Razvoj slovenščine v digitalnem
Adequacy/Fluency data. To measure the reliability of the
okolju, RSDO) is co-financed by the Republic of Slovenia
Adequacy/Fluency ratings, we calculated the Cohen's
and the European Union under the European Regional
kappa coefficient4 for each document evaluated by a pair of
Development Fund. The operation is carried out under the
evaluators. As somewhat expected, the agreement is fairly
Operational Programme for the Implementation of the EU
low with most of the values falling between 0.20 and 0.50.
Cohesion Policy 2014–2020.
The fact that the evaluation was performed by students does
The authors thank the students of MA Translation at the
not seem to significantly affect the results.
Faculty of Arts, University of Ljubljana, for their
On the other hand, the evaluation through error
participation in the task.
annotation and post-editing requires a much higher level of
effort, linguistic and extra-linguistic competence. Since
8. References
each text was annotated by three students, a comparison of
their decisions provides a valuable insight into the
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An
difficulty and subjectivity of the task. Agreement is low for
automatic metric for MT evaluation with improved
all the parameters under observation: the number of errors
correlation with human judgments. In: Proceedings of
marked, their categorization and their severity levels.
the ACL workshop on intrinsic and extrinsic evaluation
Moreover, there is little correlation between the number of
measures for machine translation and/or summarization,
marked errors, their severity and the true quality of the
pages 65–72. Association for Computational Linguistics.
machine translation. For the text which was the most
Ondrej Bojar, Christian Federmann, Barry Haddow,
specialized (Karst), contained a high number of un- or
Philipp Koehn, Matt Post, and Lucia Specia. 2016. Ten
mistranslated
terms
and
received
the
lowest
years of WMT evaluation campaigns: Lessons learnt.
Fluency/Adequacy score, the number of marked errors was
In: Proceedings of the LREC 2016 Workshop
the lowest of all. Student annotators with little or no expert
Translation Evaluation–From Fragmented Tools and
knowledge of the domain will therefore find it difficult to
Data Sets to an Integrated Ecosystem, pages 27–34.
correctly identify terminology errors, assess their severity
Jennifer Doyon, Kathryn B. Taylor, and John S. White.
or post-edit the text to a more accurate version.
1999. Task-based evaluation for machine translation.
Conversely, possibly owing to the fact that students of
In: Proceedings of Machine Translation Summit VII,
translation are still in the process of acquiring their
pages 574–578.
language competence and are constantly reminded of the
Filip Klubička, Antonio Toral, and Víctor M. Sánchez-
grammatical aspect of the texts they produce, their
Cartagena. 2017. Fine-grained human evaluation of
sensitivity to fluency-related issues is high, hence linguistic
neural versus phrase-based machine translation. The
and stylistic errors are still often perceived as major. This
Prague Bulletin of Mathematical Linguistics 108, no. 1
might explain why the two texts which were most
(2017), pages 121–132.
accessible and easy to understand received the highest
Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii
number of marked errors.
Hrinchuk, Ryan Leary, Boris Ginsburg, and Jonathan M.
In retrospect, the postediting and error annotation task
Cohen. 2019. Nemo: a toolkit for building AI
was too difficult for advanced students of translation and
applications using neural modules. arXiv:1909.09577.
failed to provide meaningful insights into MT quality, for
Arle Lommel, Hans Uszkoreit, and Aljoscha Burchardt.
several reasons: Firstly, the texts were too specialized and
2014. Multidimensional quality metrics (MQM): A
difficult to understand for non-experts. While students were
framework for declaring and describing translation
free to use all available resources, some of the
quality metrics. Revista Tradumàtica: tecnologies de la
terminological expressions would require extensive
traducció 12, pages 455–463.
research to resolve and the students lacked the time,
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz
motivation or skill to perform such research. Secondly, to
Dwojak, Hieu Hoang, Kenneth Heafield, Tom
ensure higher agreement in the severity and category of
Neckermann, Frank Seide, Ulrich Germann, Alham Fikri
errors, students should have received training, a test run and
Aji, Nikolay Bogoychev, André F. T. Martins, and
much more comprehensive annotation guidelines with
Alexandra Birch. 2018. Marian: Fast Neural Machine
English-Slovene examples. Finally, the annotation
Translation in C++. In: Proceedings of ACL 2018,
environment in MemoQ with the rather fine-grained
System Demonstrations, pages 116–121, Melbourne,
MQM/DSDE error typology is cumbersome and
Australia. Association for Computational Linguistics.
unintuitive, which probably affected the results.
Katrin Marheinecke. 2016. Can Quality Metrics Become
We nevertheless believe that the experiments were
the Drivers of Machine Translation Uptake? An Industry
valuable both for researchers and annotators. As
Perspective. Translation Evaluation: From Fragmented
researchers in MT development and evaluation we have
Tools and Data Sets to an Integrated Ecosystem, pages
gained experience which will allow us to better design
71–76.
evaluation runs, select texts and train annotators, and the
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
student annotators have been subjected to translation
Sam Gross, Nathan Ng, David Grangier, and Michael
3 Automatic metric scores can be found at https://slobench.cjvt.si/
4 Using the cohen_kappa_score function from the sklearn Python library.
PRISPEVKI
225
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Auli. 2019. fairseq: A fast, extensible toolkit for
sequence modeling. arXiv:1904.01038.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evaluation
of machine translation. In: Proceedings of the 40th
annual meeting of the Association for Computational
Linguistics, pages 311–318.
Rico Sennrich, Antonio Valerio Miceli Barone, Joss
Moorkens, Sheila Castilho, Andy Way, Federico
Gaspari, Valia Kordoni, Markus Egg, Maja Popovic,
Yota Georgakopoulou, Maria Gialama, Menno van
Zaanen. 2017. TraMOOC—translation for massive open
online courses: recent developments in machine
translation. In: 20th Annual Conference of the European
Association for Machine Translation, EAMT.
Dimitar Shterionov, Riccardo Superbo, Pat Nagle, Laura
Casanellas, Tony O’dowd, and Andy Way. 2018. Human
versus automatic quality evaluation of NMT and
PBSMT. Machine Translation 32, no. 3, pages 217–235.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea
Micciulla, and John Makhoul. 2006. A study of
translation edit rate with targeted human annotation.
In: Proceedings of the 7th Conference of the Association
for Machine Translation in the Americas: Technical
Papers, pages 223–231.
John S. White, Theresa A. O’Connell, and Francis E.
O’Mara.
1994.
The
ARPA
MT
evaluation
methodologies:
evolution,
lessons,
and
future
approaches. In: Proceedings of the First Conference of
the Association for Machine Translation in the
Americas.
PRISPEVKI
226
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Automatic Predicate Sense Disambiguation Using Syntactic and Semantic
Features
Branko Žitko,∗ Lucija Bročić,∗ Angelina Gašpar,† Ani Grubišić,∗ Daniel Vasić,‡ Ines Šarić-Grgić∗
∗Faculty of Science
University of Split
Ru ¯
dera Boškovića 33, 21000 Split, Croatia
branko.zitko@pmfst.hr, lucija.brocic@pmfst.hr, ani.grubisic@pmfst.hr, ines.saric@pmfst.hr
†Catholic Faculty of Theology
University of Split
Ulica Zrinsko Frankopanska 19, 21000 Split, Croatia
angelina.gaspar@kbf-st.hr
‡Faculty of Science and Education
University of Mostar
Poljička cesta 35, Mostar, Bosnia and Herzegovina
daniel.vasic@fpmoz.sum.ba
Abstract
This paper focuses on Predicate Sense Disambiguation (PSD) based on PropBank guidelines. Different approaches to this task have been proposed, from purely supervised or knowledge-based, to recently hybrid approaches that have shown promising results. We introduce one of the hybrid approaches - a PSD pipeline based on both supervised models and handcrafted rules. To train three supervised POS, DEP and POS DEP models we used syntactic features (lemma, part-of-speech tag, dependency parse) and semantic features (semantic role labels). These features enable per-token classification, which to be applied to unseen words, requires handcrafted rules to make predictions specifically for nouns in light verb constructions, unseen verbs and unseen phrasal verbs. Experiments were done on newly-developed dataset and the results show a token-level accuracy of 96% for the proposed PSD pipeline.
1.
Introduction
Depending on the sense of a word ’walk’, the sense of the
whole predicate changes.
One of the main tasks of Natural Language Processing
Another important role of PSD is the one it plays in
(NLP) is precisely understanding the meaning of the word
Semantic Role Labelling (SRL). The process of semantic
and its specific usage in a sentence, task known as Word
role labelling typically consists of predicate identification
Sense Disambiguation (WSD). In this paper, we focus on
and its sense disambiguation, followed by identification of
predicate sense disambiguation, i.e. the correct meaning
semantic roles and finally their labelling. The state-of-the-
of a predicate in a given sentence. A predicate combines
art BERT models like AllenNLP’s models (Gardner et al.,
with a subject to form a sentence, expressing some situ-
2018) or InVeRo (Conia et al., 2020) perform all mentioned
ation, event or state. Predicates are often single or com-
subtasks except for predicate sense disambiguation which
pound verbs, consisting of various part-of-speech (prepo-
is missing. Ideally, the tool would use predicate senses to
sitions, adverbs, nouns, auxiliaries, etc.). Hence, the pre-
label semantic roles. However, we lack the tool for PSD, so
cise understanding of the meaning of a sentence lies in the
we use the opposite technique – attempting to predict role-
correct disambiguation of different types of words, not just
set IDs from already annotated semantic role labels. An-
verbs. For example, the term light verb (LV) refers to a
other shortcoming of mentioned state-of-the-art models is
verb that gets its main semantic content from the noun that
that they only label verbs as predicates, and as we have
follows rather than the verb itself. Thus, the construction
seen, it is necessary to label words of different part-of-
consisting of such a verb and noun is called Light Verb
speech in addition to verbs. Regarding the sentence "I take
Construction (LVC). In the sentence “I take a walk in the
a walk in the park.", state-of-the-art models identify word
park.”, ‘take a walk’ is the LVC in which the noun ’walk’
’take’ as a predicate, whereas they ignore the word ’walk’.
describes an action. It is non-compositional and its lexical-
The need for such a PSD tool arises during the question
syntactic structure is not flexible. This example illustrates
generation task in intelligent tutoring system (Grubišić et
that word sense disambiguation can make Predicate Sense
al., 2020) our research team is working on.
Disambiguation (PSD) more accurate, since splitting up the
LVC and disambiguating the senses of its components in-
In this work, we describe our PSD pipeline, depicted
dividually neglects the semantic unity of the construction
in Figure 1, as well as the process it takes to create it.
and fails to represent its single meaning. Namely, ’walk’
The approach we take is the combination of the super-
can have a meaning of moving forward, one foot in front
vised PSD trained with the Stochastic Gradient Descent
of the other, but it can also be a term specific for baseball.
method (Kiefer and Wolfowitz, 1952) and the knowledge
PRISPEVKI
227
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
verb approach (Chen and Eugenio, 2010). We implement
the latter technique, where we train each classifier to dis-
ambiguate senses of only one word. Purely data-driven
WSD is a straightforward approach when dealing with the
comprehensive data. However, we find Supervised WSD
approach that exploits relations between tokens more ap-
pealing. In that approach, some examples of improving the
sense prediction might be by using contextual embeddings
learned from Neural Language Model (Loureiro and Jorge,
2019), or by utilizing WordNet relations to create try-again
mechanism to predict sense for ambiguous words (Wang
and Wang, 2020).
Figure 1: Our PSD Pipeline.
On the other hand, knowledge-based WSD often im-
plements various graph algorithms to extract from to-
used to handcraft rules to compensate for the shortcomings
kens and sentences their syntactic, semantic or any other
of the data. We train supervised classifiers for each word
features. These features are essential for modelling the
to disambiguate senses based on extracted syntactic and
Lexical Knowledge Base that algorithms use to predict
semantic features, which play a significant role in many
senses.
Although there are some high-scoring methods
NLP tasks (e.g. text summarization, question generation,
(Wang and Wang, 2020; Scozzafava et al., 2020) based
etc.). As for the syntactic features we use spaCy (Honnibal
on this approach, knowledge-based WSD systems still per-
et al., 2020) annotated fine-grained POS (part-of-speech)
form worse than supervised ones. However, lately there
tags and dependency tags.
We employ the AllenNLP’s
have been a few promising hybrid approaches that com-
BERT-based model (Gardner et al., 2018) to retrieve shal-
bine supervised and knowledge-based ones, as mentioned
low semantics, represented by SRL labels. Thus, the pro-
in the survey (Bevilacqua et al., 2021). Moreover, their
posed PSD pipeline consists of Machine Learned Classifi-
high scores indicate that the hybrid approaches are cur-
cation (MLC) pipeline, based on Machine Learned Model
rently the best solution to WSD (Barba et al., 2021). Be-
(MLM), and Rule-Based Classification (RBC) pipeline,
sides the research done on WSD, there has also been some
based on Rule-Based Model (RBM) including handcrafted
work concentrated specifically on Verb Sense Disambigua-
rules for LVC, unseen verbs (verbs that don’t occur in the
tion (VSD). As verbal multiword expressions are semanti-
OntoNotes dataset used for training the MLMs) and un-
cally complex lexical items, there have been experiments
seen phrasal verbs (phrasal verbs that don’t occur in the
to inspect the effect of the selection of semantic features in
OntoNotes dataset used for training the MLMs). We pro-
VSD. Research works like ours (Dang and Palmer, 2005;
vide source code1 with the spaCy integration of the pro-
Dligach and Palmer, 2008; Ye and Baldwin, 2006) used
posed PSD pipeline.
SRL annotation, which is a distinctive characteristic of a
Section 2 provides related works, which suggest that
predicate, to get better sense prediction.
the WSD, which entails PSD, is a current problem encoun-
tered in various popular NLP tasks. Section 3 describes the
3.
Data Manipulation and Analysis
dataset used for training PSD models and the modifications
To build a good PSD model combining a supervised
done to it. Section 4 describes the proposed PSD pipeline,
PSD approach and handcrafted rules, we need good data
providing detailed information on the training and evalua-
for the former and clear guidelines for sense disambigua-
tion of the models. Section 5 provides the conclusion of
tion for the latter.
this paper and discussion about the given work.
3.1.
OntoNotes Data
2.
Related Work
We use an English corpus from the OntoNotes project
Word Sense Disambiguation and Predicate Sense Dis-
as the train and test data for the supervised component of
ambiguation are appealing NLP tasks for researchers in the
the model. The English dataset of the OntoNotes Release
field. Thus, they are the subject of many research activi-
5.0 (Weischedel et al., 2013) consists of 13109 annotated
ties, summarized in the up-to-date survey of recent trends
documents organized as .onf files, arranged into seven di-
in WSD (Bevilacqua et al., 2021). Among the various ap-
rectories that correspond to files’ sources.
It is impor-
proaches to WSD, most popular are knowledge-based ap-
tant to train the model on the content of assorted genres
proaches, which often implement graph algorithms, and su-
and types, therefore, OntoNotes was picked as it has the
pervised approaches, which lately utilize neural networks.
following seven categories: Broadcast Conversation (tran-
Supervised WSD formulates the given task as classifi-
scripts of talk shows from channels such as BBC, CNN and
cation task. Hence, it requires precisely labelled training
MSBNC), Broadcast News (news data collected from var-
data to learn the relationship between word annotations and
ious news sources, such as ABC, NBC, CNN and Voice
senses. In contrast to a single classifier approach (Kawa-
of America), Magazine (Sinorama Magazine), Newswire
hara and Palmer, 2014), where one classifier is trained to
(data from sources such as Wall Street Journal newswire),
make predictions for every word sense, there is also a per-
Pivotal Corpus (biblical texts from the Old Testament and
the New Testament), Telephone Conversation (conversa-
1https://github.com/lucijabrocic/PSD-pipeline
tional speech texts) and Web data (English web texts and
PRISPEVKI
228
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
web text translated from Arabic and Chinese to English).
The syntactic annotation of the sentences in the corpus
followed the Penn TreeBank scheme and the predicate-
argument structure followed the Proposition Bank (Prop-
Bank) annotation (Palmer et al., 2005). The OntoNotes
English corpus consists of 143709 annotated sentences,
most of which but not all have comprehensive annotation.
Namely, some web texts selected to improve sense cover-
age were just tokenized and not even treebanked. There-
fore, the corpus needed some refinement before further
usage. The scripts (Bonial et al., 2014) provided by the
Proposition Bank project enabled the conversion of origi-
nal PropBank annotations (found in the OntoNotes project)
to the new unified PropBank annotations. The files thus ob-
tained were further modified by custom user-defined meth-
ods written for this work. Those methods mostly changed
the aesthetics of the files, such as converting SRL anno-
tation to utilize BIO notation and converting tree parses
into dependency parse annotation. Finally, after the refine-
ment and modifications, our corpus contains 7212 text files
(137811 sentences), which follow the original OntoNotes
directory structure based on files’ sources.
3.2.
The English PropBank
As already mentioned, the used data follows the En-
glish PropBank (Palmer et al., 2005) sense disambiguation
guidelines. This research aims to predict the sense ID, also
known as a frameset or roleset ID, for each word of any
complex predicate structure in a sentence.
The English PropBank consists of 7311 .xml files called
frame files, specifying the predicate-argument structure.
Figure 2: The syntactically and semantically annotated sen-
One frame file, or frameset, consists of one predicate
tence "I take a walk in the park." enters MLC pipeline,
lemma or multiple different ones, and contains the infor-
which annotates the predicate sense for verb "take" as
mation about roleset IDs that disambiguate various mean-
take.01. The annotated sentence then proceeds to the RBC
ings of a predicate. Since diverse forms of a predicate can
pipeline, which annotates the predicate sense for noun
be under the same roleset ID, PropBank aliases can help to
"walk" as walk.01.
distinguish the correct sense from the wrong one. As our
work required the English PropBank annotation informa-
Figure 2 illustrates the PSD pipeline with an example
tion, we organized all the information for 10687 rolesets
input sentence, annotated with syntactic and semantic fea-
(and 7311 framesets) into easily loadable .json file.
tures. First the MLC pipeline extracts these features from
No matter how large, representative, and carefully de-
the sentence and feeds them to the trained classifiers used
signed, no corpus can exhibit the same characteristics as a
to obtain predicate senses. Then, RBC pipeline takes the
natural language. Having this in mind, we check the cover-
syntactically and semantically annotated sentence with pre-
age of rolesets and framesets in the OntoNotes corpus. The
dicted predicate senses. RBC pipeline applies handcrafted
analysis shows that the modified files miss 4922 rolesets
rules to the sentence to improve the prediction of predicates
and 3104 framesets, i.e. they cover 53.94% of rolesets and
in light verb constructions, unseen verbs and unseen phrasal
57.54% of framesets that occur in the English PropBank.
verbs. As a result of the proposed pipeline processing, each
Even though the frequency of using missing framesets and
token in the sentence has a roleset attribute that stores the
rolesets might be low, the objective is to include as many
result.
framesets and rolesets as possible to increase the overall
coverage. To achieve this objective, we add the handcrafted
4.1.
Training the Models
rules, explained more thoroughly in subsection 4.3.
We have 7212 OntoNotes files available to make the
best use of while training our models. We first apply a typi-
4.
The Proposed PSD Pipeline
cal supervised learning approach - splitting the dataset into
This section describes the training process of three PSD
the train and test sets and then performing the training and
models (POS, DEP and POS DEP) and their evaluation. We
evaluation. The train-test split given in the PropBank (Bo-
train each model by employing two approaches. In the first
nial et al., 2014) resulted in 80% of the files (and sentences)
approach, we split the dataset into train and test sets, while
in train set and 20% in the test set.
in the second one, we use entire dataset for training.
Table 1 shows that many framesets and roleset IDs oc-
PRISPEVKI
229
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
No.
of No.
of No.
of No.
of
nally tokens "in", "the" and "park" that are ARGM-LOC.
files
sentences framesets roleset IDs
Therefore, the featureset for token "take" is text, take,
Train set
5832
111104
3996
5455
lemma, take, ARGM-PRR, 〈NN, dobj〉, and for to-
Test set
1380
26707
2692
3609
ken "walk" text, walk, lemma, walk, ARG0, 〈PRP,
Corpus
7212
137811
4208
5766
nsubj〉, ARGM-LVB, 〈VBP, ROOT〉, ARGM-LOC,
〈IN, prep〉, 〈DT, det〉, 〈NN, pobj〉.
Table 1: Corpus composition.
Then we vectorize extracted features and feed them into
the classifiers. Dealing with PSD, we face a multiclass clas-
sification problem with more than 10000 classes. Instead of
cured in both train and test set. Out of 2692 framesets iden-
a single classifier, a common solution to a problem like this
tified in the test set, 212 framesets did not appear in the
is training multiple binary classifiers, one for each class of
train set. Likewise, out of 3609 roleset IDs detected in the
the original problem. In the NLP-like domains, however,
test set, 311 of them failed to appear in the train set.
it is more suitable to use multiple classifiers which pre-
dict a constricted number of classes (Even-Zohar and Roth,
2001). Therefore, in this research, multiple multiclass clas-
sifiers perform the classification task, with one classifier for
each frame file. Hence, the number of classifiers auguments
to 7311, and each has to learn the nuances between roleset
IDs within the same frame file. The model itself is essen-
tially a collection of such classifiers.
Regarding the choice of classifier, we want to build a
simple and fast model for this PSD task. Since the con-
Figure 3: The models’ training pipeline.
text we need is already assigned to a token through context-
aware models (spaCy, AllenNLP), with some feature engi-
Figure 3 illustrates the training process. First, the syn-
neering we can utilize generated annotations (lemma, POS,
tactically and semantically annotated sentence is loaded
dependency, SRL) as features for our model. Hence, we
and forwarded to feature extraction.
did not take a neural approach, but we decided on a linear
During the feature engineering and extraction phase, the
classifier where learning is based on multinominal logistic
most relevant token-level annotations for developing the
regression with SGD optimization.
models are selected. Those annotations are token text, its
modified lemma that matched the English PropBank frame-
4.2.
The evaluation of models’ accuracy and
set, part-of speech (POS) tag, dependency parse and seman-
performance
tic role labels (SRL). The research (Dang and Palmer, 2005;
We evaluate our models on the OntoNotes test set con-
Dligach and Palmer, 2008) shows that the predicate sense
taining 26707 sentences. Those sentences contain in total
disambiguation could improve semantic role labelling. Ide-
504891 tokens, of which 75621 (or 14.98%) are predicate
ally, word sense disambiguation would solve the problem
tokens, and 429270 (or 85.02%) are non-predicate tokens.
of identifying the correct sense of a polysemic word based
When looking at the average sentence, it contains 18.90
on context. However, the lack of comprehensive reposi-
tokens, of which 2.83 are predicate tokens and 16.07 are
tory of senses and a tool for PSD prompted us to use the
non-predicate tokens. We measure the accuracy of the three
opposite technique - attempting to predict roleset IDs from
PSD models on this OntoNotes test set with three different
already annotated semantic role labels. As for the POS and
metrics:
dependency annotation, previous studies show the perfor-
• the token-level accuracy (TLA) metric measures the
mance of the SRL task heavily depends on the performance
number of (predicate and non-predicate) tokens the
of dependency parsing (Mohammadshahi and Henderson,
model predicted correctly (correct roleset ID or no pre-
2021) and POS tagging (Wilks and Stevenson, 1997) sub-
diction, depending on whether the token is a part of a
tasks. We train three models and name them according to
predicate or not)
the features they used - POS, DEP and POS DEP. All three
models utilize token text and lemma, but differ in the other
• the sentence-level accuracy (SLA) metric measures
used annotation(s): (i) the POS model utilizes the relation
the number of sentences the model predicted com-
between SRL and fine-grained POS tag, (ii) the DEP model
pletely correctly (all the tokens)
utilizes the relation between SRL and dependency tag, (iii)
• the predicate-level accuracy (PLA) metric measures
the POS DEP model utilizes the relation between SRL, fine-
the number of predicate tokens the model predicted
grained POS tag and dependency tag. In this research, we
correctly
train and evaluate the three models in parallel.
To be more specific, we present featuresets of tokens
Besides accuracy, we also use predicate prediction cov-
"take" and "walk" in the Figure 2 used when employing the
erage (PPC) metric, which represents the ratio of pre-
POS DEP model. Token "take" has only one SRL argu-
dicted predicate tokens and total predicate tokens (whether
ment - token "walk" which is ARGM-PRR. On the other
they are correctly predicted or not). When evaluating Al-
hand, token "walk" has three SRL arguments - token "I"
lenNLP’s BERT model on OntoNotes test set, we can ob-
that is ARG0, token "take" that is ARGM-LVB, and fi-
tain a measure similar to PPC. Looking at the ratio between
PRISPEVKI
230
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
predicate tokens in OntoNotes test set for which AllenNLP
added to spaCy objects (Token, Span, Doc) via the cus-
annotates the SRL arguments and all predicate tokens in
tom SRL pipe. One thing to note is that we slightly mod-
OntoNotes test set, we get a result of 88.02%. It is im-
ify both the spaCy pipeline and AllennNLP’s BERT model.
portant to note that the remaining 11.98% are nouns for
We improve spaCy’s lemmatizer to better lemmatization of
which AllenNLP’s BERT model cannot annotate SRL la-
gerunds and contracted verbs. The modifications made to
bels. This coverage metric for AllenNLP puts into perspec-
the AllenNLP’s BERT model allow the presence of nouns
tive the PPC measure of our models, given in Table 2.
in a predicate and adjustment of SRL labels for LVCs to the
English PropBank guidelines.
TLA (%) SLA (%) PLA (%) PPC (%)
Next, syntactic and semantic features are extracted in
POS
98.50
76.91
90.01
97.49
the same way as it has been described in the training phase
DEP
98.71
79.74
91.37
97.82
(Subsection 4.1.). The prediction can be done using one
POS DEP
98.73
80.04
91.54
97.97
of the three previously mentioned OntoNotes-whole mod-
els (POS, DEP, POS DEP), and each model is essentially
Table 2: Evaluation of the Models.
a collection of classifiers that each corresponds to a Penn
PropBank frameset. The output of MLC component is a
sentence where predicate tokens are annotated with sense
Table 2 shows that the results of evaluation metrics on
predicted via classifiers.
accuracy are similar for the three models, even though POS
Figure 5 illustrates further processing of annotated sen-
DEP model is the most accurate and obtained the high-
tences in the Rule-Based Classification (RBC) component
est PPC score. As explained in subsection 4.1., models
based on the Rule-Based Model, including handcrafted
encounter some framesets and roleset IDs in the test set
rules for LVC, unseen verbs and unseen phrasal verbs to
alone. After the initial training and evaluation phase, we
improve prediction.
further train models on all 7212 modified OntoNotes files,
assuming their performance would improve.
To distin-
guish which results correspond to which model, we will use
two terms: OntoNotes-split model and OntoNotes-whole
model. The term OntoNotes-split model will denote model
that is trained on OntoNotes train set and evaluated on
OntoNotes test set, while OntoNotes-whole model will de-
note model that is trained on all of the 7212 OntoNotes files.
Figure 5: The RBC component of the PSD Pipeline
The results given so far are for OntoNotes-split models.
.
4.3.
PSD Pipeline
Essentially, a sentence with classifier-predicted PSD an-
notation is forwarded to the RBC pipeline component to
Even when trained on all available data, our PSD mod-
first handle the sense disambiguation of nouns in LVCs.
els cover only 53.94% of rolesets and 57.54% of framesets
Then the RBC component uses modified SRL labels to find
in the the English PropBank. Therefore, we handcraft rules
both parts of an LVC and search for PropBank aliases to
to improve the predictive abilities of models.
find the corresponding one. The pipeline component ex-
plores aliases labeled as nouns only if there are no aliases
tagged as the light verbs. This way PropBank aliases help
in finding the correct sense IDs.
The next step includes the sense disambiguation of un-
seen verbs. The RBC component searches for PropBank
aliases tagged as verbs, attempting to find the potential
sense (roleset ID) of verbs that do not occur in the train-
ing set.
In the last step, the pipeline component performs the
sense disambiguation of two-word phrasal verbs. Phrasal
verbs are easy to predict correctly using the rules. The RBC
pipeline first checks if a verb has a dependant particle (eg. a
Figure 4: The MLC component of the PSD Pipeline.
preposition or an adverb) and searches the PropBank aliases
tagged as verbs, to find a corresponding sense (roleset ID).
Figure 4 presents the Machine Learned Classification
The RBC pipeline makes prediction in each step only
(MLC) component of the PSD pipeline, which uses the
if the observed token (i) has SRL labels (AllenNLP model
ML model to make a predicate sense prediction. In model
identified the token as predicate), (ii) is not a modal verb
training phase, we use the OntoNotes annotation of sen-
(no sense disambiguation of modals) and (iii) has no predic-
tences for feature extraction. However, when using the
tion (the goal is to supplement the classifiers, not to over-
PSD pipeline “in the wild” on arbitrary sentences, spaCy’s
write their predictions).
English RoBERTa-based transformer processing pipeline
Moreover, we introduce new annotations in the three
uses the raw input to retrieve syntactic features. The Al-
steps of the RBC pipeline. For a better understanding, ex-
lenNLP’s BERT model is used to obtain semantic features,
amples in Table 5 illustrate predictions of the PSD pipeline
PRISPEVKI
231
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
components using the POS DEP model and their possible
TLA (%)
SLA (%)
PLA (%)
PPC (%)
outcomes. The search for PropBank aliases can result in
X
no X
X
no X
X
no X X & no X
a lack of roleset ID matches, only one roleset ID match,
pipeline 96.19 92.20 69.28 69.43 87.11 87.17
98.05
or multiple roleset ID matches. Table 5 shows how each
gold
97.63 97.67 78.01 78.46 89.75 89.94
100.00
pipeline component resolves the roleset ID issue depend-
standard
ing on the number of found rolesets matches.
When there is no corresponding roleset ID for the token,
Table 4: Evaluation of the POS DEP model on the gold
the actions of the RBC pipeline differ based on the predi-
standard dataset.
cate construction. If the token is a part of an LVC (e.g.
picnic - None), the RBC pipeline predicts the sense disam-
biguation as the lemma of the token followed by ".00" (pic-
The evaluation results in Table 4 show that the
nic - picnic.00). If the token is an unseen verb (e.g. over-
OntoNotes-whole POS DEP model predicts better if fed
write - None) or a part of an unseen phrasal verb (e.g. clue
with human-made annotations rather than with system-
- None), however, the sense remains unchanged (None).
generated annotations. The most significant difference is in
If there is only one roleset ID match, components of the
sentence-level accuracy, resulting from higher token-level
RBC pipeline choose that roleset ID.
and predicate-level accuracies.
If there are multiple roleset ID matches, components
To put the PPC measure given in Table 4 in perspective,
of the RBC pipeline choose the roleset ID with the lowest
we evaluate AllenNLP’s BERT model on the gold standard
number, followed by the flag "X". However, this annotation
dataset and obtain a measure similar to PPC. Looking at
indicates that the unique prediction is still not achievable.
the ratio between predicate tokens in the dataset for which
Finally, our PSD pipeline incorporates final sense pre-
AllenNLP annotates the SRL arguments and all predicate
diction into the spaCy’s processing pipeline, into custom
tokens in the dataset, we get a result of 97.61%. When
roleset attribute.
using system-generated annotations, our OntoNotes-whole
POS DEP model relies on AllenNLP for discovering the
5.
Experimental Results and Discussion
predicates it needs to predict senses for. By deeper anal-
This section provides the results obtained on the gold
ysis, it is visible that there are certain errors in spaCy’s
standard dataset and discussion and suggestions for further
system-generated annotations (namely lemma) that lower
work.
the original AllennNLP coverage of 97.61%. However, the
modifications made to the AllenNLP’s BERT model that
5.1.
The Evaluation of the Model Performance on the
allow presence of nouns in a predicate have increased our
Gold Standard Dataset
predicate coverage of 98.05%, and in the end improved the
As all three OntonoNotes-split models perform similar-
original AllenNLP’s coverage of 97.61%.
ily well, we further assess the accuracy of the OntoNotes-
The POS DEP model returns rolesets with “X” flag
whole POS DEP model on a fresh set of sentences that
when it cannot decide between multiple different senses.
represent our gold standard. The new dataset consists of
To fully evaluate the model’s performance, we calculated
manually annotated 664 sentences with syntactic (lemma-
the four metrics on the predictions with removed "X" flag
tization, part-of-speech and dependency tags) and semantic
(no X). The slight increase in scores indicates that the role-
(SRL) labels, and the predicate sense IDs which our model
set with the lowest ID number was often the right one.
predicts. In Table 3 are given statistics for the dataset con-
sidering tokens, words and predicates. Tokens include both
5.2. Discussion and Further Work
words and non-word parts of a sentence, e.g. punctuation.
We have shown our approach to predicate sense disam-
When expressed as a percentage, 18.46% tokens in the gold
biguation utilizing POS, dependency and SRL annotations,
dataset are predicates.
and on the way presented the analysis of the coverage of the
predicate senses in the OntoNotes corpus and the English
per sentence
total
PropBank contrastively. The integration of PSD pipeline
mean
std
min
max
into spaCy makes its usage straightforward - by adding a
token
6853
10.320
4.770
2
65
custom SRL and roleset components to the spaCy process-
word
5971
8.992
3.430
1
48
ing pipeline.
predicate
1265
1.905
0.890
0
12
Another feature of the proposed PSD pipeline is its Ma-
chine Learned Models (MLMs). Each model consists of
Table 3: Gold dataset statistics.
per-token classifiers, which implies some effort required to
combine their outputs. However, the predicate sense pre-
The first step of evaluation process includes the predi-
diction is fast since the pipeline only employs the classifiers
cate sense prediction using input sentence and the needed
corresponding to framesets found in the sentence. More-
annotations obtained through system (spaCy transformer
over, changing the single classifier is simplified – if there
model and AllenNLP’s BERT model) pipeline.
In the
is a change in annotation guidelines within one frame file,
second step, as some system annotations are erroneous,
only one smaller classifier requires retraining. We have also
namely, wrong lemmatization and SRL labels, we use gold
presented different accuracy and prediction metrics used in
standard annotations to check if there is any difference in
evaluation of models’ performance.
prediction.
The scores in Table 4 suggest our PSD pipeline ob-
PRISPEVKI
232
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
LVC
Unseen verbs
Unseen phrasal verbs
Sentence
Let’s have a picnic in It will overwrite the She’ll clue you in on the
the park.
files on your hard drive. latest news.
MLC
have – have.01
overwrite - None
clue - None
Roleset ID prediction
picnic - None
doesn’t exist
MLC + RBC have – have.01
overwrite - None
clue - None
prediction
picnic - picnic.00
Sentence
He is having an affair. Some people annotate The cat scrunched up to
as they read.
sleep.
Unique
MLC
is – be.03
annotate – None
scrunched – None
roleset IDs
prediction
having - have.01
read – read.01
exist
affair - None
MLC + RBC is – be.03
annotate – annotate.01 scrunched – scrunch_up.01
prediction
having - have.01
read – read.01
affair – affair.01
Sentence
We are making a plea John frowned when he They
sluice
the
streets
to all companies.
heard the news.
down every morning.
Multiple
MLC
are – be.03
frowned – None
sluice - None
roleset IDs
prediction
making – make.01
heard – hear.01
exist
plea - None
MLC + RBC are – be.03
frowned – frown.01X
sluice – sluice_down.01X
prediction
making – make.01
heard – hear.01
plea - plead.01X
Table 5: Examples for PSD pipeline.
tains satisfactory results, however, there is still room for
turing information about its arguments and characteristics
improvement. More specifically, in our further work, we
will be useful when deciding on appropriate wh-word in a
plan to enhance the Rule-Based Classification (RBC) com-
question.
ponent, particularly sense disambiguation of unseen words
with multiple rolesets based on their part-of-speech tags.
Acknowledgements
The PSD pipeline only chooses the roleset with the low-
The presented results are the outcome of the research
est roleset ID and adds the flag “X”. We assume we can
project “Enhancing Adaptive Courseware based on Natu-
achieve better results if we create a more complex rule,
ral Language Processing (AC&NL Tutor)” undertaken with
as the one that utilizes PropBank guidelines on roleset
the support of the United States Office of Naval Research
sense IDs and their corresponding arguments in predicate-
Grant (N00014-20-1-2066).
argument structure. Since there is a large number of miss-
ing rolesets and framesets (46.06% and 42.46% respec-
6.
References
tively), that will be no easy task and more in-depth analy-
Edoardo Barba, Tommaso Pasini, and Roberto Navigli.
sis is necessary to figure out what mistakes does the model
2021. ESC: Redesigning WSD with extractive sense
make and how to fix them.
comprehension. In: Proceedings of the 2021 Conference
We build our Rule-Based Models (RBMs) on three cate-
of the North American Chapter of the Association for
gories of words – nouns in Light Verbs Construction (LVC),
Computational Linguistics: Human Language Technolo-
unseen verbs and unseen phrasal verbs. Perhaps categories
gies, pages 4661–4672, Online. Association for Compu-
could be further disambiguated and thus, enable a better
tational Linguistics.
RBM. Another change that might be beneficial for improv-
Michele Bevilacqua, Tommaso Pasini, Alessandro Ra-
ing the results is a selection of more features during the fea-
ganato, and Roberto Navigli. 2021. Recent trends in
ture extraction phase. For a certain predicate, we use only
word sense disambiguation: A survey. In: Zhi-Hua Zhou,
POS and dependency tags of its arguments, but the accu-
editor, Proceedings of the Thirtieth International Joint
racy might improve if we consider the text of the argument
Conference on Artificial Intelligence, IJCAI-21, pages
token as well.
4330–4338. International Joint Conferences on Artificial
Finally, the downstream task this PSD pipeline is cre-
Intelligence Organization.
ated for is the question generation task in our intelligent
Claire Bonial, Julia Bonn, Kathryn Conger, Jena D. Hwang,
tutoring system. Disambiguating predicate senses and cap-
and Martha Palmer. 2014. PropBank: Semantics of new
PRISPEVKI
233
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
predicate types. In: Proceedings of the Ninth Interna-
elling makes sense: Propagating representations through
tional Conference on Language Resources and Evalua-
WordNet for full-coverage word sense disambiguation.
tion (LREC’14), pages 3013–3019, Reykjavik, Iceland.
In: Proceedings of the 57th Annual Meeting of the Asso-
European Language Resources Association (ELRA).
ciation for Computational Linguistics, pages 5682–5691,
Lin Chen and Barbara Di Eugenio. 2010. A Maximum En-
Florence, Italy. Association for Computational Linguis-
tropy Approach To Disambiguating VerbNet Classes. In:
tics.
Proceedings of Verb 2010, 2nd Interdisciplinary Work-
Alireza Mohammadshahi and James Henderson. 2021.
shop on Verbs, The Identification and Representation of
Syntax-aware graph-to-graph transformer for semantic
Verb Features.
role labelling.
Simone Conia, Fabrizio Brignone, Davide Zanfardino, and
Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005.
Roberto Navigli. 2020. InVeRo: Making semantic role
The Proposition Bank: An annotated corpus of semantic
labeling accessible with intelligible verbs and roles. In:
roles. Computational Linguistics, 31(1):71–106.
Proceedings of the 2020 Conference on Empirical Meth-
Federico Scozzafava, Marco Maru, Fabrizio Brignone, Gio-
ods in Natural Language Processing: System Demon-
vanni Torrisi, and Roberto Navigli. 2020. Personalized
strations, pages 77–84, Online. Association for Compu-
PageRank with syntagmatic information for multilingual
tational Linguistics.
word sense disambiguation. In: Proceedings of the 58th
Hoa Trang Dang and Martha Palmer. 2005. The role of
Annual Meeting of the Association for Computational
semantic roles in disambiguating verb senses. In: Pro-
Linguistics: System Demonstrations, pages 37–46, On-
ceedings of the 43rd Annual Meeting of the Association
line. Association for Computational Linguistics.
for Computational Linguistics (ACL’05), pages 42–49,
Ming Wang and Yinglin Wang. 2020. A synset relation-
Ann Arbor, Michigan. Association for Computational
enhanced framework with a try-again mechanism for
Linguistics.
word sense disambiguation. In: Proceedings of the 2020
Dmitriy Dligach and Martha Palmer. 2008. Novel seman-
Conference on Empirical Methods in Natural Language
tic features for verb sense disambiguation. In: Proceed-
Processing (EMNLP), pages 6229–6240, Online. Asso-
ings of the 46th Annual Meeting of the Association for
ciation for Computational Linguistics.
Computational Linguistics on Human Language Tech-
Ralph Weischedel, Martha Palmer, Mitchell Marcus, Ed-
nologies: Short Papers, HLT-Short ’08, page 29–32,
uard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen
USA. Association for Computational Linguistics.
Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini,
Mohammed El-Bachouti, Robert Belvin, and Ann Hous-
Yair Even-Zohar and Dan Roth. 2001. A sequential model
ton. 2013. OntoNotes Release 5.0.
for multi-class classification. In: Proceedings of the 2001
Yorick Wilks and Mark Stevenson. 1997. The grammar of
Conference on Empirical Methods in Natural Language
sense: Using part-of-speech tags as a first step in seman-
Processing.
tic disambiguation. Natural Language Engineering, 4.
Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord,
Patrick Ye and Timothy Baldwin. 2006. Verb sense disam-
Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael
biguation using selectional preferences extracted with a
Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A
state-of-the-art semantic role labeler. In: Proceedings of
deep semantic natural language processing platform. In:
the Australasian Language Technology Workshop 2006,
Proceedings of Workshop for NLP Open Source Software
pages 139–148, Sydney, Australia.
(NLP-OSS), pages 1–6, Melbourne, Australia. Associa-
tion for Computational Linguistics.
Ani Grubišić, Slavomir Stankov, Branko Žitko, Ines Šarić-
Grgić, Angelina Gašpar, Suzana Tomaš, Emil Brajković,
and Daniel Vasić. 2020. Declarative Knowledge Extrac-
tion in the AC&NL Tutor. In: Robert A. Sottilare and
Jessica Schwarz, editors, Adaptive Instructional Systems,
pages 293–310, Cham. Springer International Publish-
ing.
Matthew Honnibal, Ines Montani, Sofie Van Landeghem,
and Boyd Adriane. 2020. spaCy: Industrial-strength
Natural Language Processing in Python.
Daisuke Kawahara and Martha Palmer. 2014. Single clas-
sifier approach for verb sense disambiguation based on
generalized features. In: Proceedings of the Ninth Inter-
national Conference on Language Resources and Evalu-
ation (LREC’14), pages 4210–4213, Reykjavik, Iceland.
European Language Resources Association (ELRA).
Jack Kiefer and Jacob Wolfowitz. 1952. Stochastic esti-
mation of the maximum of a regression function. The
Annals of Mathematical Statistics, 23(3):462–466.
Daniel Loureiro and Alípio Jorge. 2019. Language mod-
PRISPEVKI
234
PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Progress of the RETROGRAM Project: Developing a TEI-like Model for
Croatian Grammars Books before Illyrism
Petra Bago
Department of Information and Communication Sciences, Faculty of Humanities and Social Sciences, University of Zagreb
Ivana Lučića 3, HR-10000 Zagreb
pbago@ffzg.hr
1. Background
RETROGRAM1 ( Retro-digitization and Interpretation of Croatian Grammar Books before Illyrism) is a 4-year research project that started in November 2019, co-funded by the Croatian Science Foundation (IP-2018-01-3585) and the Institute of Croatian Language and Linguistics. It is a linguistic heritage project that focuses on the digitization and interpretation of pre-Illyrian Croatian grammar books with the aim to serve as a repository of such works in the future as well as to offer a model and develop processes for future similar research on digitization of Croatian grammars. So far, no digitization projects have included Croatian grammar books from the pre-Illyrian period of the Croatian language i.e. before the establishment of the common standard language2 and orthography (Horvat and Kramarić, 2021).
Croatian language comprises of a common standard language as well as its three dialects: Čakavian, Kajkavian and Štokavian. The standardization of the Croatian literary language and the orthography based on the Štokavian dialect variant began in the 17th century. The process was finalized in the 19th century during the time of Croatian National Revival or the Illyrian movement (i.e. Illyrism). The main goals of the movement regarding language was to introduce a common literary language and a spelling reform, as well as introducing the Štokavian dialect as a linguistic common standard in order to strengthen the national cultural identity. The grammars described in this article thus belong to the pre-Illyrian period of the Croatian language, containing Croatian literary languages that precede the modern Croatian common standard language. The first grammar books were written within the religious orders, of the Jesuits and Franciscans, and were used to teach Croatian or Latin language to the Franciscan and Jesuit youth. (Horvat and Kramarić, 2021) The main goal of the project is to create a web portal of pre-Illyrian Croatian grammar books, which would include facsimiles of selected grammar books with basic bibliographic and processing information, transcription or translation, and an index of historical grammar and linguistic terminology. The portal will be equipped with thematic searching possibilities on the morphology level. The user will be able to browse grammar books facsimiles, read transcribed or translated text, and search it by predetermined parameters (which will allow conjugation and declension paradigms search). Links to facsimiles will enable comprehensive research on orthography and traductological aspects of the selected texts. An open-access portal will be developed and available to scholars and the general public.
The main objective of the project is to intensify research activities and the interpretation of the Croatian pre-Illyrian grammars within the scope of modern linguistic disciplines (e.g. cognitive approach), to complete existing knowledge about the morphological development of the Croatian language, its normative descriptions, and development of linguistic terminology in the pre-Illyrian period. Conclusions on the formation of the Croatian language grammar model will also be based on the analysis of the Latin language grammar structure. Contrastive analysis of Latin and Croatian grammar meta-text and terminology will lead to conclusions about the influence of Latin language description on Croatian linguistic concepts in the pre-Illyrian period. More on the project can be found in Horvat (2020) and Horvat and Kramarić (2021).
2. Dataset
RETROGRAM has selected eight Croatian grammar books for the digitization and enrichment process that span from the early 17th until the early 19th century. The grammar books cover two dialects (Štokavian and Kajkavian) of pre-Illyrian Croatian before there was an agreement on the common standard language and orthography. Even though not all are grammars of Croatian language, all contain Croatian as metalanguage and/or Croatian examples of morphological paradigms. The texts are transcriptions or translations of the originals in MS Word format, as all have been published as reference books by philologists from the project’s research group.
1 https://retrogram.jezik.hr/
2 By “common standard language” we mean a standard language covering the entire Croatian speaking area.
POVZETKI
235
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
The selected transcriptions or translations of grammar books used for the development of the annotation model are based on the following works:
Bartol Kašić, Institutionum linguae Illyricae libri duo, Rome, 1604 (Kašić, 2002),
Jakov Mikalja, Gramatika talijanska ukratko ili kratak nauk za naučiti latinski jezik, Loreto, 1649
(Mikalja, Horvat, and Gabrić-Bagarić, 2008),
Ardelio Della Bella, Istruzioni grammaticali della lingua illirica, Venice, 1728 (Della Bella, Sironić-
Bonefačić, and Gabrić-Bagarić, 2006),
Blaž Tadijanović, Svašta po malo iliti kratko složenje imena, riči u ilirski i njemački jezik, Magdeburg, 1761 (Horvat and Ramadanović, 2012),
Marijan Lanosović, Uvod u latinsko riči slaganje s nikima nimačkog jezika biližkama za korist slovinskih mladića složen, Osijek, 1776 (Perić Gavrančić, 2020),
Ignacije Szentmártony, Einleintung zur kroatischen Sprachlehre für Deutsche, Varaždin, 1783
(Szentmártony, 2014),
Josip Voltić, Grammatica illirica, Vienna, 1803 (Voltić, 2016),
Francesco M. Appendini, Grammatica della lingua Illirica, Dubrovnik, 1808 (Appendini and Lovrić Jović, 2022).
3. Data Annotation Model
The eight selected Croatian grammar books are the basis for the development of the annotation model based on the TEI Guidelines (TEI Consortium, 2021b). The model addresses two annotation tasks: 1) annotation of historical grammar and linguistic terminology, and 2) the annotation of morphological paradigms. The annotation tasks will be performed manually by experts working on the project. The decision was made to keep the original text intact, and any enrichment to be done through elements and attributes. Each grammar book is a TEI document comprised of a header and the body of the grammar text. The header contains metadata relevant to the project and to the particular grammar book, such as a list of all annotated grammatical terms. The body of the TEI document contains all grammar text with grammatical terminology and morphological paradigms annotations.
3.1. Grammatical Terminology Model
One of the aims of the RETROGRAM project is to facilitate research into historical grammar and linguistic terminology via the web portal. We composed an index of contemporary Croatian terms to be used for normalization of the terminology. These terms are also used in the morphological paradigms annotation task. We have identified 87 terms related to the inflected parts-of-speech. The list of terms is encoded in the TEI header. In the Example 1. we present the encoding of the term “noun”( imenica in Croatian) in the index to be used in the annotation model. The example is extracted from Mikalja’s grammar book.
imenica
...
Example 1: Encoding of the term “noun” ( imenica in Croatian) in the index of Mikalja’s grammar.
To annotate the term in the grammar text, we use the element 3 that is, according to the TEI Guidelines, used to encode a technical term. In the Example 2 you can find encoding of the historical grammar term IMENA that Mikalja used to describe nouns and adjectives, hence two attribute values. The model developed for annotating grammar terminology adheres to the TEI Guidelines.
3 https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-term.html
POVZETKI
236
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
OD IMENA
Example 2: Encoding of the term “noun” (imenica in Croatian) in the grammar text of Mikalja’s grammar.
3.2. Morphological Paradigms Model
For the development of the morphological paradigms model, we analyzed the following inflected parts-of-speech: nouns, pronouns, adjectives, numbers and verbs. In the TEI Guidelines, there is no specific module for encoding grammar texts. However, we have decided to customize the dictionary module (TEI Consortium, 2021a) since it already contains elements that group morphosyntactic information of a lexical item. Interestingly, we were not the only ones with the same idea, as Toma Tasovac and Laurent Romary addressed the issue as part of the TEI Lex-0 initiative4. Often the morphological paradigms are presented in a table format. For the purposes of the RETROGRAM project, we decided to disregard the presentation mode of the paradigm, and encode only the implicit information contained in the tables.
To encode one lexical item in a paradigm, we use the element
il soldato
Kad se pita čigovo je, rečemo
del soldato
...
Example 3: Encoding of two cases of the noun vojnik as segment of a morphological paradigm in Mikalja’s grammar.
4 https://github.com/DARIAH-ERIC/lexicalresources/tree/master/Resources/grammars-in-TEI
5 https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-form.html
POVZETKI
237
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4. Future Plans and Conclusion
We are currently conducting the manual annotation tasks based on the two models. Once the annotation tasks are complete, the next step is to create a web portal where all enriched grammar texts will be open and freely available with various search options.
In this extended abstract we present progress of RETROGRAM, a linguistic heritage project that focuses on the digitization and interpretation of pre-Illyrian Croatian grammar books with the aim to serve as a repository of digital Croatian grammars as well as to offer a model and develop processes on digitization of such works. Analyzing eight grammar texts published from the 17th until the 19th century, we developed two models: 1) a model for annotation of historical grammar and linguistic terminology, 2) a model for annotation of morphological paradigms. We composed a taxonomy consisting of 87 terms to be used in both models. To implement the models, we consulted the TEI Guidelines, the de facto standard in the digital humanities. Our first model adheres to the guidelines. However, our second model is a TEI-like model that we developed based on the dictionary module of the same guidelines. We hope that the morphological paradigm model will serve as a basis for the development of a TEI module for grammars, a model that is presently missing, but could be incorporated in the TEI infrastructure by expanding the dictionary module.
5. Acknowledgements
RETROGRAM is generously co-financed by the Croatian Science Foundation under the program
“Research Projects” with grant agreement IP-2018-01-3585 and by the Institute of Croatian Language and Linguistics. We wish to thank all our research associates as well as Toma Tasovac for their feedback and help.
6. References
Francesco Maria Appendini and Ivana Lovrić Jović. 2022. Appendinijeva Gramatika ilirskoga jezika: Jezična studija s prijevodom i transkripcijom uz faksimil. Institut za hrvatski jezik i jezikoslovlje, Nacionalna i sveučilišna knjižnica u Zagrebu, Zagreb.
Ardelio Della Bella, Nives Sironić-Bonefačić, and Darija Gabrić-Bagarić. 2006. Istruzioni grammaticali della lingua illirica, 1728: Gramatičke pouke o ilirskome jeziku. Institut za hrvatski jezik i jezikoslovlje, Zagreb.
TEI Consortium (ed.). 2021a. 9 Dictionaries. In: TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 4.3.0. TEI Consortium. https://tei-c.org/release/doc/tei-p5-doc/en/html/DI.html.
TEI Consortium (eds.). 2021b. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 4.3.0. TEI Consortium. http://www.tei-c.org/Guidelines/P5/.
Marijana Horvat. 2020. Istraživanje povijesti hrvatskoga jezika u digitalno doba. Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje, 46(2):635–643.
Marijana Horvat and Martina Kramarić. 2021. Retro-Digitization of Croatian Pre-Standard Grammars. Athens Journal of Philology, 8(4):297–310.
Marijana Horvat and Ermina Ramadanović. 2012. Jezikoslovni priručnik Blaža Tadijanovića Svašta po malo iliti kratko složenje imena, riči u ilirski i njemački jezik (1761.). Institut za hrvatski jezik i jezikoslovlje, Zagreb.
Bartol Kašić. 2002. Institutiones linguae Illyricae/Osnove ilirskoga jezika. Institut za hrvatski jezik i jezikoslovlje, Zagreb.
Jakov Mikalja, Marijana Horvat,and Darija Gabrić-Bagarić. 2008. Gramatika talijanska ukratko ili kratak nauk za naučiti latinski jezik. Institut za hrvatski jezik i jezikoslovlje, Zagreb.
Sanja Perić Gavrančić. 2020. Latinska gramatika i hrvatski jezik Marijana Lanosovića: Povijesnojezična studija i transkripcija izvornika. Institut za hrvatski jezik i jezikoslovlje, Zagreb.
Ignacije Szentmártony. 2014. Uvod u nauk o horvatskome jeziku. Institut za hrvatski jezik i jezikoslovlje, Zagreb.
Josip Voltić. 2016. Grammatica Illirica/Ilirska gramatika. Reprint of the first edition (1803). Institut za hrvatski jezik i jezikoslovlje, Zagreb.
POVZETKI
238
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
The CCRU as an Attempt of Doing Philosophy in a Digital World
Tvrtko Balić
Faculty of Humanities and Social Sciences, University of Zagreb
Ivana Lučića 3, 10000, Zagreb
tvrtko.balic@gmail.com
1. Introduction
The consequences brought about by the Internet have been immense. The resulting chaos effected society at large and while natural sciences could enjoy the greater availability of information, social sciences and humanities found themselves in a new world with new problems. A new environment was created for communities to function in, and this environment was ready to be studied. But it was also an area effected by those sciences, a breeding ground for theories. The fact that theories effect realities they study has been accelerated with the emergence of the Internet. It is not clear how to act in such an environment. In 1995 at Warwick University, England, an experimental cultural theorist collective was formed called the Cybernetic Culture Research Unit (CCRU).
2. Goal of the paper
The goal of the paper is to examine the problems presented by the Internet and to look at the Cybernetic Culture Research Unit as an example of theorists (specifically those in the field of philosophy) adapting to the new medium.
3. Influences
Main influences on the CCRU were French postmodernists and it is itself a postmodern project. Sadie Plant, the feminist lecturer writing a book on “The Situationist International in a Postmodern Age” and Nick Land, the eccentric professor teaching a course on “Current French Philosophy” took their influences and led them to new levels of eccentric.
3.1. Lyotard
Jean-François Lyotard was first to introduce the term “postmodern” in philosophical context. According to him the availability of knowledge is what causes the transition from the modern to a postmodern condition.
Organization of knowledge is the thing that serves to justify power in the modern world. As knowledge becomes more available, the power of old actors such as nation-states withers, new actors emerge and the nature of society changes profoundly. Scientific knowledge is not easily accessible and to bring it closer to people for the purpose of legitimation whether of itself or some political, economic, cultural or any other kind of system, it takes the form of a narrative. That is when a problem of a conflict of narratives emerges, but generally one dominates over others and becomes a metanarrative, a story which offers an explanation for the world and justifies a certain social order.
The fundamental feature of postmodernism according to Lyotard is the decay and disappearance of metanarratives. He was enthusiastic about postmodernism and wanted to fragment and break down society in order for experimentation in the social field to yield improvements.
The CCRU presents the Internet as a fertile ground for Lyotard’s theories. It should be clear why. It makes knowledge even more accessible as well as the power of expression.
3.2. Derrida
The way in which Jacques Derrida is most reflected in the works of the CCRU’s style. What is reflected are his hopes for philosophical writing. He was critical of the seriousness of the philosophical canon which in his day he saw dominated by Hegelian thought.
Derrida celebrates poetry, laughter and ecstasy which he sees as neglected. He sees developed two forms of writing, serious philosophy on the one hand and playful literature on another. He opposes what he calls logocentrism, perceived domination of the ideal of the spoken word and criticism of writing as in literature, something stemming all the way from antiquity with Socrates and Plato sticking out as important critics of writing.
“This-major-writing will be called writing because it exceeds the logos (of meaning, lordship, presence etc.).
Within this writing – the one sought by Bataille – the same concepts, apparently unchanged in themselves, will be subject to a mutation of meaning, or rather will be struck by (even though they are apparently indifferent), the loss of sense toward which they slide, thereby ruining themselves immeasurably.” (Derrida, 1990) POVZETKI
239
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Derrida aims to destroy boundaries between philosophy and literature. With the CCRU the boundaries get lost in the creation of a brand-new writing style, theory-fiction. Theory-fiction could be considered a genre of its own, a surreal combination of cyberpunk and Gothic horror. Writings in this style are ambiguous and even their literary meaning is hard to distinguish yet they are filled with philosophical ideas waiting to be deciphered.
One could imagine this making Derrida proud or jealous.
3.3. Deleuze and Guattari
As far as theoretical influences are concerned, the French pair that coauthored many works, the philosopher Gilles Deleuze and psychoanalyst and political activist Félix Guattari are probably the most influential. The CCRU writings aim for what Deleuze and Guattari called schizoanalysis. It is an alternative to what is typically understood as rational thinking and more in line with the spirit of the time. It embraces the kind of thinking associated with schizophrenics and people with cluster A personality disorders which are often associated with it. The similarity between a philosopher and a schizophrenic is that they both rely on abstractions and finding connections between wildly different phenomena. Schizoanalysis takes this connection and runs with it before letting it run loose. Thinking becomes chaotic yet orderly in its own way, within its own logic. Everything becomes rhizomatic.
“Let us summarize the principal characteristics of a rhizome: unlike trees or their roots, the rhizome connects any point to any other point, and its traits are not necessarily linked to traits of the same nature; it brings into play very different regimes of signs, and even nonsign states. The rhizome is reducible neither to the One nor the multiple.” (Deleuze and Guattari, 2005)
A pair of terms of special note are deterritorialization and reterritorialization, the first one referring to the process by which social relations are altered, mutated or destroyed and the second one referring to the process by which new relations emerge. The CCRU was revolutionary in its accelerationist embrace of social change which meant celebrating deterritorialization, whether for its own sake, motivated by a libertarian desire for freedom, or for the sake of better alternatives emerging, maybe even new trees and new metanarratives.
3.4. Baudrillard
The last very much important figure influencing the CCRU was the French sociologist, philosopher and cultural theorist Jean Baudrillard. The key concepts for him are simulation, simulacra and hyperreality.
Simulation is a process by which reality is replaced with its representation and what is left are called simulacra.
Baudrillard describes three orders of simulacra, all stemming from the original traditional symbolic order.
“In the first case, the image is a good appearance - representation is of the sacramental order. In the second, it is an evil appearance - it is of the order of maleficence. In the third, it plays at being an appearance - it is of the order of sorcery. In the fourth, it is no longer of the order of appearances, but of simulation.” (Baudrillard 1994)
This fourth case, the third order of simulacra is the pure simulacra, something only ever referencing itself without any authentic reality behind it. This is how Baudrillard conceived of the postmodern world. For him the history of modernity is the history of the disappearance of the real.
However, what is left isn’t the unreal or the false, it is the hyperreal. Baudrillard’s writing is full of references to magic when speaking of traditional societies and to new technologies, virtual reality, explosions of information, machines conquering humanity etc. when talking about contemporary societies. This is very much the thematically relevant to the CCRU. They weren’t the only ones fascinated with Baudrillard, he was so influencial that The Matrix is full of references to his work. However, as opposed what is depicted in The Matrix, in the hyperreal world there is no real to refer to, there is no exiting the simulation, no escaping the code. But for the CCRU there is hope in the Internet that from the “Desert of the Real” will emerge something new.
Baudrillard is pessimistic about changes he observes and only brings up possible sollutions to problems in order to refute them, but in the CCRU there is an amor fati present even if not optimism.
4. Playful and dangerous writing group
From the name, Cybernetic Culture Research Unit and the basic knowledge of what it is about, one might suspect two things, cyberpunk and philosophy. Instead, what one finds is a surreal drug fueled collection of writing about Lovecraftian demons, numerology, ghost lemurs of Madagascar preserving the memories of psychic amphibians… And the things one might expect are so enigmatic as to be distorted beyond recognition.
Two things become clear, one is the role of drugs in the CCRU and the other is that it wasn’t really a philosophical or information science research group at all, but was primarily a literary club. People involved mostly had a background of philosophy and they had their independent careers, some writing in a more psychedelic style revealing their history in the CCRU and some being more “normal” and understandable.
Which isn’t to say that there is no philosophy to be found in the CCRU, but a lot of it is motifs and sources of inspiration arising from the chaos of collective storytelling and the authors’ common interests and influences.
POVZETKI
240
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
One important concept related to the CCRU is hyperstition. Hyperstitions are fictions that make themselves real, like how the concept of space travel caused space travel to come into reality. This explains the importance of the artistic style for some members. All ideas can be understood as hyperstition using humans as hosts that bring them into existence. The CCRU often wrong about a fictionalized version of itself. This can be understood as a sort of magic.
5. Prominent figures and their insights
From this literary group emerged strains of thought ranging from far right Nick Land to far left Mark Fisher and cyberfeminist Sadie Plant.
5.1. Sadie Plant and cyberfeminism
Plant offers a unique blend of postmodern feminism and hopes typical for the 90s and visible in films like Hackers. According to her, the transformative power of the Internet lies in the fact that it offers a space without physical bodies. Furthermore, computer technology and programming are inherently feminine and therefore benefit women. Finally, women are treated as machines and because of this share a connection with them emancipation of machines will bring about an emancipation of women.
In some respects, Plant proved prophetic. The Internet greatly improved the visibility of marginalized groups and made the general public more compassionate for them. In other respects, not so much, the Internet allows all kinds of opinions to prosper and that certainly includes sexist opinions. But in any case, she certainly offers food for thought about how gender identities are formed and expressed.
5.2. Mark Fisher and blogging
Fisher is most famous for writing about how hard it is to people to imagine an alternative and how capitalism is capable of coopting resistance and creating fake opposition. However, one subject where he was surprisingly optimistic was blogging. Fisher reflected on how doing serious philosophical work (for instance writing a PhD) can be difficult and depressive, but writing a blog is more relaxing, by being less serious it can trick people into doing serious philosophy and it also offers an interactivity that hasn’t been seen since the days of the Greek agora. The new digital agoras have since also been assymilated into the existing system. In a way there is a contradiction in Fisher’s writing, but the glimmer of hope he saw is important. If it is forgotten, we are not due for any better of a fate than Fisher who killed himself due to depression.
“I started blogging as a way of getting back into writing after the traumatic experience of doing a PhD. PhD
work bullies one into the idea that you can’t say anything about any subject until you’ve read every possible authority on it. But blogging seemed a more informal space, without that kind of pressure. Blogging was a way of tricking myself back into doing serious writing. I was able to con myself, thinking, ‘it doesn’t matter, it’s only a blog post, it’s not an academic paper’. But now I take the blog rather more seriously than writing academic papers.” (Fisher, 2018)
5.3. Nick Land and neo-reaction
For better or worse the member of the CCRU who is most prominent today is Nick Land. One of the ideas which he developed was conceiving of capitalism as an artificial intelligence, but while other authors may hope for this AI to update its software and produce something new, Land seems to be content in accepting that there is no alternative. Land continues to either inspire interpretations of new phenomena on the Internet or offer new interpretations himself. A significant example of the former would be the influence by a combination of younger Land’s ideas of hyperstition and older Land’s right wing political attitudes in creation of the online theory of meme magick, the idea that Internet memes can influence reality and that this is why Donald Trump won the 2016 US presidential elections in a supernatural way. A significant example of the latter would be Land’s philosophy of Bitcoin which isn’t only economic, but metaphysical as well, using Bitcoin to explain the logical law of identity and to reaffirm the Kantian understanding of space and time.
6. Concluding remarks
The CCRU is relevant because today Internet is so ingrained in our lives that we don’t even notice it any more just as fish don’t notice the water they are in. It can prove useful to look at the time when this technology was new and if the future did turn out disappointing maybe we should examine yesterday’s speculation of today to remind ourselves what could have been. Sometimes it happens that parts of writing prove to be oddly prophetic and in that case it is good to appreciate what we have or maybe just look at it with new eyes. And even when they seem wrong they represent a valiant attempt at doing something new.
POVZETKI
241
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
7. References
Brent Adkins. 2015. Deleuze and Guattari's A Thousand Plateaus: A Critical Introduction and Guide.
Edinburgh University Press, Edinburgh.
Jean Baudrillard. 1994. Simulacra and simulation. University of Michigan press, Michigan..
Ccru. 2015. Ccru: Writings 1997-2003. Time Spiral Press.
Mark Fisher and Matt Colquhoun. 2020. Acid Communism. Pattern Books.
Mark Fisher. 2009. Capitalist realism: Is there no alternative? . John Hunt Publishing.
Mark Fisher. 2018. K-Punk: The Collected and Unpublished Writings of Mark Fisher (2004-2016). Repeater.
Gilles Deleuze and Felix Guattari. 2005. A Thousand Plateaus. University of Minnesota Press, Minneapolis Jacques Derrida. Writing and Difference. Routledge, London.
Nick Land. 2011. Fanged noumena: Collected writings 1987-2007. MIT Press.
Jean-François Lyotard. 2015. Libidinal economy. Bloomsbury Publishing, London.
Jean-François Lyotard. 2005. Postmoderno stanje: Izvještaj o znanju. Ibis-grafika, Zagreb.
Jean-François Lyotard. 1991. The inhuman: Reflections on time. Stanford University Press.
Sadie Plant. 1997. Zeros and ones: Digital women and the new technoculture. Fourth Estate, London.
POVZETKI
242
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Referencing the Public by Populist and Non-Populist Parties
in the Slovene Parliament
Darja Fišer*+, Tjaša Konovšek*, Andrej Pančur*
*Institute of Contemporary History
Privoz 11, SI-1000 Ljubljana
darja.fiser@inz.si
tjasa.konovsek@inz.si
andrej.pancur@inz.si
+Faculty of Arts, University of Ljubljana
Aškerčeva 2, SI-1000 Ljubljana
1. Introduction
In the last two decades, political reality in many democratic countries in Europe as well as around the globe has witnessed an increase in active populist political parties and a rise in their popularity among citizens. Parallel to the spread of populism, political science and sociological analyses note a clear difference between the discourses of members of populist and non-populist parties, especially when using social and other media. However, less is known about the relationship between populist and non-populist discourses in the speeches of members of parliament (MPs) in political systems of parliamentary democracy, in which parliaments are the central representative, legislative, and controlling state institutions. This contribution aims at suggesting a model for such analysis. The proposed analysis is embedded around two key concepts. First, we use the concepts of life-world to acknowledge the existence of a specific reality of MPs in which their speech is made. Second, we draw on the existing typology of populist and non-populist parties created by political scientists and sociologists to see how MPs from two different groups of political parties, i.e. populist and non-populist, construct their view of the public. The goal of the analysis is to detect any differences between populist and non-populist discourse observed through the lens of their references to the general public.
2. Approach and methodology
To further investigate the connection between the speech of MPs, their image of the public, and their populist or non-populist origin, we combine cultural history of parliamentarianism with corpus linguistics. From a historical perspective, we draw on recent developments in political history, focusing on the cultural side of the history of parliamentarism (Aerts, 2019; Gjuričová and Zahradníček, 2018; Gašparič, 2012; Schulz and Wirsching, 2012; Ihalainen et al., 2016). For this purpose, we use the concept of life-world (or Lebenswelt). The concept of life-world originated in philosophy (Husserl, 1962, Habermas, 2007). The concept of life-world has been used in historiography to emphasize the circumstances in which parliamentarianism is experienced, focusing on MPs as historical actors (Gjuričová et al., 2014). The approach brings to the fore research questions about MPs' perceptions, education, and expectations; their political socialization, prior experiences, and everyday life; and the influence of collective opinions, public images, and the media on their work. In this paper, we focus on one of the aspects of MPs' life-world, namely their relationship to their counterpart, the public, through the words they choose to use, which, in turn, reveals a part of their self-understanding.
In the framework of life-world, we further distinguish between populist and non-populist parties on two axes. First, based on the contents of political parties, we draw on existing research to determine which Slovenian political parties qualify as populist. Second, on the temporal axis, we acknowledge the break of 2004 as a year that witnessed the active beginnings of modern populism in Slovene political space (Fink Hafner, 2019; Frank in Šori, 2015; Fabijan in Ribać, 2021; Campani and Pajnik, 2017; Šori, 2015; Hadalin, 2020; Hadalin, 2021; Lovec, 2019; Pajnik, 2019). We take into account the difference between modern populist parties, as they emerged in the last decade and a half, and their immediate precursors, which have existed since the early 1990s. Therefore, the analysis counts the Slovenian Democratic Party (SDS) and its predecessor, the Social Democratic Party of Slovenia (SDSS), New Slovenia (NSi) and the Slovenian National Party (Slovenska nacionalna stranka, SNS) as populist parties, while all others were classified as non-populist.
POVZETKI
243
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
3. Analysis
The analysis is based on the Slovenian parliamentary corpus (1990–2018) siParl 2.0 (Pančur et al., 2020). We take into account the time span from 1992 when the first term of the Slovenian parliament started until 2018 when the seventh term ended. The time frame thus includes some important events that affected the development of Slovenian political parties and their governing style, such as Slovenia's accession to the European Union in 2004 (Gašparič, 2012), the global financial crisis in 2007 and 2008, and the migrant crisis in 2015 (Moffitt, 2014). Using the typology advocated by sociologists and political scientists (see Section 2), we created subcorpora of populist and non-populist political parties for each parliamentary term, resulting in a total of 14 subcorpora. The subcorpora ranged between just under a million tokens in Term1 and to 12 million tokens in Term7 for populist parties, and between 7
million tokens in Term1 and to just under 15 million tokens in Term7 for non-populist parties.
The next step presented a challenge, as there are no pre-existing wordlists of references to the general public that we could rely on. We therefore generated frequency lists of nouns for each subcorpus and manually selected those that refer to the public in the broadest sense (e.g. person, citizen, inhabitant) from the 1,000 most frequent nouns in each subcorpus. We only took into account the nouns that can only refer to people (groups or individuals), disregarding those that can be used for institutions (e.g.
association) or objects (e.g. school). We also checked their usage via concordance search and discarded the expressions that could potentially be used for the general public but in this specific corpus predominantly refer to the MPs, the government or their staff (e.g. proposer).
As can be seen in Table 1, this yielded a total of 86 unique nouns with the total absolute frequency of 359,320 and relative frequency of 7,322.53 for the populist parties and the total absolute frequency of 524,195 and relative frequency of 6,788.74 for their non-populist counterparts. Most (69) of the nouns are shared between both party groups (e.g. human), in addition to 10 that are unique for the populist MPs (e.g. Croat) and 7 that are specific to non-populist MPs (e.g. stakeholder).
POPULIST1-7
NON-POPULIST1-7
Rom
627
12.78
808
10.46
1.22
#tokens
49,070,504
77,215,381
bolnik
1,279
26.06
1,717
22.24
1.17
#lemmas
76
74
prosilec
343
6.99
468
6.06
1.15
LEMMA
AF
RF
AF
RF
P:N ratio
javnost
16,248
331.12
22,367
289.67
1.14
Hrvat
1,341
27.33
0
0.00
/
starš
5,732
116.81
7,893
102.22
1.14
žena
397
8.09
0
0.00
/
oseba
16,836
343.10
23,762
307.74
1.11
Avstrijec
318
6.48
0
0.00
/
subjekt
3,406
69.41
4,866
63.02
1.10
Y diplomant
300
6.11
0
0.00
/
družina
11,120
226.61
16,298
211.07
1.07
L
otrok
18,205
371.00
26,762
346.59
1.07
N
storilec
232
4.73
0
0.00
/
-O
volilec
161
3.28
0
0.00
/
gost
966
19.69
1,438
18.62
1.06
P
delojemalec
36
0.73
0
0.00
/
begunec
1,247
25.41
1,879
24.33
1.04
neslovenec
31
0.63
0
0.00
/
mladina
1,384
28.20
2,101
27.21
1.04
svojec
27
0.55
0
0.00
/
delničar
444
9.05
684
8.86
1.02
delavka
0
0.00
0
0.00
/
tujec
3,169
64.58
4,908
63.56
1.02
deležnik
0
0.00
1,784
23.10
/
zavarovanec
896
18.26
1,394
18.05
1.01
prejemnik
0
0.00
1,191
15.42
/
volivec
3,478
70.88
5,544
71.80
0.99
YL najemnik
0
0.00
983
12.73
/
lastnik
8,031
163.66
12,814
165.95
0.99
N
dolžnik
0
0.00
752
9.74
/
mati
320
6.52
512
6.63
0.98
-ON vajenec
0
0.00
444
5.75
/
družba
23,431
477.50
38,532
499.02
0.96
kadilec
0
0.00
290
3.76
/
študent
4,973
101.34
8,202
106.22
0.95
krajan
0
0.00
172
2.23
/
posameznik
7,367
150.13
12,307
159.39
0.94
oče
929
18.93
329
4.26
4.44
zavezanec
2,437
49.66
4,096
53.05
0.94
obrtnik
1,187
24.19
540
6.99
3.46
uporabnik
3,441
70.12
5,866
75.97
0.92
davkoplačevalec
4,762
97.04
2,178
28.21
3.44
nosilec
2,211
45.06
3,812
49.37
0.91
migrant
2,627
53.54
1,255
16.25
3.29
občan
1,558
31.75
2,688
34.81
0.91
vlagatelj
426
8.68
260
3.37
2.58
prebivalec
5,318
108.37
9,404
121.79
0.89
podjetnik
3,880
79.07
2,671
34.59
2.29
partner
4,580
93.34
8,312
107.65
0.87
moški
827
16.85
619
8.02
2.10
potrošnik
1,657
33.77
3,060
39.63
0.85
ljudstvo
3,089
62.95
2,376
30.77
2.05
generacija
2,279
46.44
4,215
54.59
0.85
Italijan
272
5.54
216
2.80
1.98
delavec
10,768
219.44
20,055
259.73
0.84
Slovenka
1,432
29.18
1,143
14.80
1.97
invalid
3,032
61.79
5,760
74.60
0.83
pacient
1,619
32.99
1,452
18.80
1.75
prebivalstvo
2,727
55.57
5,452
70.61
0.79
T zamejstvo
1,067
21.74
966
12.51
1.74
manjšina
2,742
55.88
5,518
71.46
0.78
IN
kmet
6,839
139.37
6,739
87.28
1.60
učenec
1,437
29.28
3,071
39.77
0.74
JO
prijatelj
1,024
20.87
1,012
13.11
1.59
ženska
2,941
59.93
6,517
84.40
0.71
naročnik
517
10.54
516
6.68
1.58
upokojenec
3,547
72.28
8,097
104.86
0.69
Slovenec
10,103
205.89
11,090
143.62
1.43
skupnost
16,208
330.30
38,163
494.24
0.67
dijak
2,403
48.97
2,670
34.58
1.42
pripadnik
1,375
28.02
3,238
41.93
0.67
kupec
1,216
24.78
1,357
17.57
1.41
upravičenec
1,673
34.09
4,523
58.58
0.58
državljan
21,570
439.57
24,828
321.54
1.37
upnik
566
11.53
1,725
22.34
0.52
priča
4,061
82.76
4,701
60.88
1.36
podpisnik
465
9.48
1,460
18.91
0.50
državljanka
6,902
140.65
8,372
108.42
1.30
udeleženec
500
10.19
1,685
21.82
0.47
narod
4,952
100.92
6,035
78.16
1.29
porabnik
129
2.63
540
6.99
0.38
žrtev
3,945
80.39
4,810
62.29
1.29
populacija
480
9.78
2,179
28.22
0.35
sosed
738
15.04
928
12.02
1.25
Total
359,320 7,322.53 524,195
6,788.74
1.08
človek
68,517 1,396.30
86,824
1,124.44
1.24
Table 1: List of specific and joint public-related words identified in the subcorpora of populist and non-populist speeches with their absolute and relative frequencies as well as the usage ratio.
POVZETKI
244
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
The list of populist-specific nouns contains words describing people according to their background (e.g. Austrian, non-Slovenian), family role (e.g., relative, wife) and employment status (e.g. female worker, employee). Non-populist-specific nouns contain expressions which describe the role or status of a person in an administrative or legal procedure (e.g. stakeholder, recepient), business transaction (e.g. tenant, debtor), origin (e.g. local), education (e.g. apprentice) or health status (e.g. smoker).
Among the joint nouns, father, craftsman, taxpayer and migrant are used three times more frequently by populist MPs, whereas beneficiary, participant, consumer and population are use more than twice as frequently by non-populist MPs. Insurance holde r, voter and owner are used nearly identically by both groups of MPs. This might reflect a difference between the populist and non-populist parties and their focus in their political base: while the first usually rally voters from rural areas, the latter are traditionally more successful in urban areas.
T1
T2
T3
T4
T5
T6
T7
Total
Populist #tokens
950,851 4,917,224 7,291,606 8,607,268 8,598,006 6,622,380 12,083,169 49,070,504
Populist "public" AF
6,204
27,738
49,606
68,971
57,041
48,881
100,879
359,320
Populist "public" RF
6,525
5,641
6,803
8,013
6,634
7,381
8,349
7,323
Non-populist #tokens
7,323,569 11,387,486 8,838,299 14,394,700 11,452,223 8,869,712 14,949,392 77,215,381
Non-populist "public" AF
48,446
58,100
52,118
91,254
84,878
67,310
122,089
524,195
Non-populist "public" RF
6,615
5,102
5,897
6,339
7,411
7,589
8,167
6,789
P-value
0.3059 2.54E-43 6.61E-116
0 8.25E-94 2.81E-03 2.01E-07 1.41E-269
Chi2 test
1.0482 190.4453 523.7064 2181.3538 422.1633 21.9444
27.0286 1230.5394
Statistical significance
NO
YES
YES
YES
YES
YES
YES
YES
Table 2: Absolute and relative frequency of public-related words as used by populist and non-populist MPs per parliamentary term and statistical significance tests.
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
T1
T2
T3
T4
T5
T6
T7
Populist
Non-populist
Combined
Figure 1: Relative frequency of nouns referring to the public in speeches of MPs from populist and non-populist political parties in the Slovene parliament 1992 – 2018, by parliamentary term.
As can be seen from Table 2 and Figure 1, we observe a steady general upwards trend in the use of nouns, describing the public in both populist and non-populist parties over time. For all terms combined, populist MPs refer to the public statistically significantly more frequently than their non-populist counterparts (P-value 1,41E-269, Chi2 test 1230,53941), which confirms our main hypothesis. For all the MPs combined, the only, and quite substantial, drop in the frequency of references to the public can be observed from Term1 and Term2, which could be contributed to the early stages of the formation of the Slovenian political space. Especially in Term1, the MPs had to face many questions of establishing the working of the new parliament itself. It took time before a new normality of the parliamentary work was established, before the MPs began to address the public more. While early Slovene political transition exhibited a general consensus about the need to strengthen parliamentary democracy, the time after that has been much less clear, which could account to the increase of references of the public by the MPs, since they had to search for new contents of policy-making.
1 https://www.korpus.cz/calc/
POVZETKI
245
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
As for individual terms, populist MPs refer to the public statistically significantly more often in Terms2–4 and 7 with Term4 as the biggest outlier, while the opposite is true of Terms5–6 with Term5
as the biggest outlier. In Term1, non-populist MPs use more public-denominating expressions but the difference is not statistically significant. Terms2–3 can be interpreted as the period of formation of populist parties (1992–2004), with Term4 being the first parliamentary term working with a populist (SDS-led) government. In turn, Term7 (2014–2018) could suggest the emergence of the second-wave growing power of populist parties in the face of the crisis of the non-populist parties.
In Terms5–6, when references to the general public prevailed in what sociologists and political scientists refer to as the non-populist discourse, the Slovenian political space witnessed an emergence of numerous new political parties, many of which entered the parliament, which influenced the relation between populist and non-populist discourse. Due to the safe-guards in parliamentary procedures which ensure equal opportunity of participation for opposition MPs regardless of their number, the speeches of MPs might also be influenced by the existence of populist and non-populist led governments and the strength of the populist and non-populist parties in the parliament at the time. While party strength is usually counted by the number of seats taken in the parliament, there are many more factors that influence it and make the correlation between the number of seats, coalition and opposition roles, and party strength challenging (Sartori, 2005; Krašovec, 2000).
4. Discussion
While the results do confirm our initial hypothesis that populist parties refer to the public more, the difference between the two blocs appears to be smaller than the current findings of studies in sociology and political science suggest. Where research from these two fields mainly focuses on the speech of members of populist parties in (selected) television interviews, on social media, and other, less rigid environments, this contribution focused on taking into account all the speeches of MPs throughout the Slovenian parliament which is a highly institutionalized and regulated environment that probably allows for less differentiation between MPs of different political orientation. Our results show that the same life-world of MPs, marked by their shared experience, social forms, norms, and a shared dialogue in plenary sessions provides an environment with a strong unifying factor. Although there is little doubt that political parties themselves decisively differ from one another, the power of the institution, its rigidity and specificity as well as MPs awareness of the target audience and reach of their speeches, proved to be decisive factors in MPs speech when speaking about the public.
According to political scientists and historians, the political space in Slovenia has been increasingly polarized since 1992. Again, our results show a somewhat more nuanced picture: while a growing difference between populist and non-populist discourse can be observed in Terms2–4, the gap narrows in Terms5–7. This challenges the dominant narrative of Slovenian political space. The record high frequency of references to the public by populist MPs in Term4 coincides with SDS winning the 2004
election for the first time after 1992, which happened immediately after the party went through its populist transformation in 2003. Term5, SDS witnessed a backlash with the non-populist coalition prevailing, while one of the populist parties, the NSi, did not even reach the parliamentary threshold.
The general public as well as the media frequently refer to several of the more recent parties, such as Levica, as populist as well. While these parties do exhibit a certain populist appeal, their content, attitudes towards experts and state institutions, as well as their actions in the parliament place them in the non-populist spectrum, with Levica gravitating more towards the spectre of democratic socialism (Toplišek, 2019) than to the same category of populism as defined by Mudde (2005, 2007) which was the theoretical framework of this study. Another methodological issue is temporality: the modern populist shift is a phenomenon belonging to the 21st century; thus, the decade after 1992, included in our analysis, requires a separate interpretation and can only be understood as a preface to the later populist shift (Fuentes, 2020).
5. Acknowledgments
The work described in this paper was funded by the Slovenian Research Agency research programme P6-0436: Digital Humanities: resources, tools, and methods (2022- 2027) and No. P6-0281: Political History, the CLARIN ERIC ParlaMint project (https://www.clarin.eu/parlamint) and the DARIAH-SI research infrastructure.
POVZETKI
246
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
6. Bibliography
Adéla Gjuričová and Tomáš Zahradníček. 2018. Návrat parlamentu. Česi a Slováci ve Federálním shromáždění. Argo.
Adéla Gjuričová, Andreas Schulz, Luboš Velek, and Andreas Wirsching, eds. 2014. Lebenswelten von Abgeordneten in Europa 1860–1990. Droste Verlag.
Alen Toplišek. 2019. Between populism and socialism: Slovenia’s Left party. In: Giorgos Katsambekis and Alexandros Kioupkiolis, eds. The Populist Radical Left in Europe. Routledge, Taylor & Francis Group.
Alenka Krašovec. 2000. Moč v političnih strankah: odnosi med parlamentarnimi in centralnimi deli političnih strank. Fakulteta za družbene vede.
Ana Frank and Iztok Šori. 2015. Normalizacija rasizma z jezikom demokracije: primer Slovenske demokratske stranke. Časopis za kritiko znanosti, 43(260):89–103.
Andreas Schulz and Andreas Wirsching, eds. 2012. Parlamentarische Kulturen in Europa. Das Parlament als Kommunikationsraum. Droste Verlag.
Benjamin Moffitt. 2015. How to Perform Crisis: A Model for Understanding the Key Role of Crisis in Contemporary Populism. Government and Opposition, 50(2):189–217.
Cas Mudde, ed. 2005. Racist Extremism in Central and Eastern Europe. Routledge.
Cas Mudde. 2007. Populist radical right parties in Europe. Cambridge University Press.
Danica Fink Hafner. 2019. Populizem. Fakulteta za družbene vede, Založba FDV.
Edmund Husserl. 1962. Die Krisis der europäischen Wissenschaften und die transzendentale Phänomenologie: eine Einleitung und die phänomenologische Philosophie. M. Nijhoff.
Emanuela Fabijan and Marko Ribać. 2021. Politični in medijski populizem v televizijskem političnem intervjuju. Social Science Forum, 37(98):43-68.
Giovanna Campani and Mojca Pajnik. 2017. Populism in historical perspectives. In: Gabriella Lazaridis and Giovanna Campani, eds. Understanding the populist shift: othering in a Europe in crisis, pages 13–30. Routledge, Taylor & Francis Group.
Giovanni Sartori. 2005. Parties and party systems: a framework for analysis. ECPR.
Iztok Šori. 2015. Za narodov blagor: skrajno desni populizem v diskurzu stranke Nova Slovenija.
Časopis za kritiko znanosti, 43(260):104–117.
Juan Francisco Fuentes. 2020. Populism. Contributions to the History of Concepts, 15(1):47–68.
Jure Gašparič. 2012. Državni zbor 1992–2012: o slovenskem parlamentarizmu. Inštitut za novejšo zgodovino.
Jürgen Habermas. 2007. The Theory of Communicative Action. Vol. 2, Lifeworld and system: a critique of functionalist reason. Polity Press.
Jurij Hadalin. 2020. Straight Talk. The Slovenian National Party's Programme Orientations and Activities.
Contributions
to
Contemporary
History,
60(2).
https://doi.org/10.51663/pnz.60.2.10.
Jurij Hadalin. 2021. What Would Henrik Tuma Say? From The Social Democratic Party of Slovenia to the Slovenian Democratic Party. Contributions to Contemporary History, 61(3).
https://doi.org/10.51663/pnz.61.3.10.
Marko Lovec, ed. 2019. Populism and attitudes towards the EU in Central Europe. Ljubljana: Faculty of Social Sciences.
Mojca Pajnik. 2019. Media Populism on the Example of Right-Wing Political Parties’ Communication in Slovenia. Problems of Post-Communism, 66(1):21–32.
Andrej Pančur, Tomaž Erjavec, Mihael Ojsteršek, Mojca Šorn, and Neja Blaj Hribar. 2020. Slovenian parliamentary corpus (1990–2018) siParl 2.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1300.
Pasi Ihalainen, Cornelia Ilie, and Kari Palonen, eds. 2016. Parliament and Parliamentarism. A Comparative History of a European Concept. Berghahn.
Remieg Aerts, ed. 2019. The ideal of parliament in Europe since 1800. Palgrave Macmillan.
POVZETKI
247
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Uporaba postopkov strojnega učenja pri samodejni
slovenski grafemsko-fonemski pretvorbi
Janez Križaj*, Simon Dobrišek*, Aleš Mihelič†, Jerneja Žganec Gros†
*Laboratorij za strojno inteligenco, Fakulteta za elektrotehniko, Univerza v Ljubljani Tržaška cesta 25, 1000 Ljubljana, Slovenija
janez.krizaj@fe.uni-lj.si, simon.dobrisek@fe.uni-lj.si
†Alpineon razvoj in raziskave, d. o. o., Ulica Iga Grudna 15, 1000 Ljubljana
Tržaška cesta 25, 1000 Ljubljana, Slovenija
jerneja.gros@alpineon.si, ales.mihelic@alpineon.si
1 Uvod
Grafemsko-fonemska pretvorba se nanaša na pretvarjanje izvirno črkovno zapisanih besed danega jezika v njihove fonemske zapise oziroma predstavitve. Nabor osnovnih grafemskih enot, ki se jih razume kot osnovne enote pisave in se jih upošteva pri črkovnih zapisih besed, navadno določa pravopis danega jezika, in enako velja tudi za slovenski jezik (SAZU, 1990). Osnovnim grafemskim enotam pravimo tudi grafemi, njihovim vidno zaznavnim različnim pisnim simbolnim predstavitvam, kot so velike in male črke, pa pravimo alografi.
Nabor fonemov je na drugi strani določen predvsem na osnovi glasoslovnega pomensko razločevalnega sluš-
nega kriterija. Grafemi in fonemi so kot osnovne enote do določene mere sicer povezani, a se pri pretvorbi grafemov v foneme lahko tudi več zaporednih črk v zapisani besedi preslika v posamezne foneme. Pretvarjanje grafemskih zapisov besed v njihove fonemske zapise tudi ne temelji samo na nekem manjšem številu osnovnih pravil in pri slovenskem govorjenem jeziku obstaja veliko izjem, ki se ne podrejajo osnovnim pravilom (Toporišič, 2000).
Pri razvoju jezikovnih tehnologij se postopki samodejnega računalniškega pretvarjanja grafemskih zapisov besed v njihove fonemske zapise uporabljajo tako pri izgradnji samodejnih razpoznavalnikov govora kot tudi pri sistemih za tvorjenje umetnega govora (Žganec Gros et al., 2016). V okviru razvojnega in raziskovalnega projekta Razvoj slovenščine v digitalnem okolju (RSDO, 2020) smo izvedli in ovrednotili več različnih uveljavljenih postopkov samodejne grafemsko-fonemske pretvorbe, ki so bili uporabljeni za tovrstno pretvarjanje zapisov slovenskih besed. Preizkusili in ovrednotili smo tri izbrane postopke samodejne grafemsko-fonemske pretvorbe, ki so se uveljavili v zadnjih nekaj letih in so na kratko opisani v nadaljevanju. Za preizkus in ovrednotenje izbranih postopkov smo uporabili množico besed iz slovenskega leksikona Sloleks 2.0 (Dobrovoljc et al., 2019). Množico besed smo na različne načine razdelili na učno in testno množico, ki smo ju nato uporabili za strojno učenje in preizkus izbranih samodejnih grafemsko-fonemskih pretvornikov.
2 Obravnavani postopki
V literaturi je predstavljenih mnogo različnih postopkov za samodejno grafemsko-fonemsko pretvorbo zapisov besed. Starejši postopki praviloma izvajajo pretvorbo na podlagi predhodno definiranih slovničnih pravil (Black et al., 1998). Pomanjkljivost teh postopkov je predvsem v dolgotrajnem ročnem oblikovanju pravil, ki zahtevajo znanje s področja jezikoslovja in glasoslovja in morajo vključevati tudi seznam izjem z različnimi posebnostmi pri izgovorjavah besed. Pri kasneje predlaganih postopkih se je uveljavila pretvorba z modeli skupnih zaporedij (Bisani in Ney, 2008), ki s poravnavo grafemskega zaporedja s fonemskim zapored-jem tvorijo posebne skupne enote, imenovane grafoni. Za modeliranje grafonskih zaporedij nato uporabljajo jezikovne modele n-gramov, udejanjene v obliki uteženega končnega pretvornika (angl. weighted final state transducer), ki omogočajo predvidevanja grafemsko-fonemske pretvorbe za besede, ki niso bile del učne mno-
žice.
Avtorji Novak et al. (2015) so razvoj grafemsko-fonemskega pretvornika osnovali na modelih uteženih končnih pretvornikov in predlagali postopek grafemsko-fonemske pretvorbe, ki temelji na prilagojeni metodi maksimizacije upanja za poravnavo niza grafemov z nizom fonemov in več dekodirnih postopkov, med njimi tudi jezikovni model, ki temelji na modelih rekurenčnih nevronskih omrežij (angl. recurrent neural networks).
POVZETKI
248
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Yolchuyeva et al. (2019) so dosegli visoko uspešnost grafemsko-fonemske pretvorbe z uporabo globokega modela, ki je poznan pod imenom transformer. Ti modeli imajo zgradbo vrste kodirnik-dekodirnik z dodanim mehanizmom pozornosti, ki pomaga pri strojnem učenju soodvisnosti med učnimi pari nizov grafemov in fonemov, kar se odraža tako v hitrejšem strojnem učenju kot tudi pri bolj zanesljivi pretvorbi preizkusnih nizov grafemov v ustrezne nize fonemov.
3 Kvantitativno ovrednotenje
Pri kvantitativnem ovrednotenju obravnavanih postopkov grafemsko-fonemskih pretvorb smo uporabili njihove izvedbe v prosto dostopnih računalniških programskih knjižnicah. Postopek, predlagan v (Bisani in Hermann, 2008), smo udejanjili s programskim orodjem Sequitur1, postopek avtorjev Novak et al. (2015) je implementiran z orodjem Phonetisaurus2, za evalvacijo metode avtorjev Yolchuyeva et al. (2019) pa smo uporabili programsko orodje Deep Phonemizer3.
Pri tvorjenju in preizkušanju vseh obravnavanih modelov in izvajanje postopkov njihovega strojnega učen-ja smo uporabili ročno validirani del slovenskega leksikona Sloleks 2.0 (Dobrovoljc et al., 2019), ki poleg posameznih besed vsebuje tudi informacijo o njihovih osnovnih besednih oblikah oziroma lemah ter tudi njihove fonemske oziroma fonetične prepise. Validirani del leksikona Sloleks 2.0, ki smo ga uporabili za naše eksperimente, tako vsebuje 646.994 posameznih besed oziroma 62.729 besednih lem. Pri preizkušanju smo opazili, da so rezultati precej odvisni od tega, kako se množico razpoložljivih grafemsko-fonemsko pretvorjenih besed razdeli na učni in testni del. Pri preizkusih smo zato izvedli dve različni razdelitvi množice vseh besed v učno množico, ki je vsebovala 90 % besed iz slovarja, in testno množico, ki je vsebovala preostalih 10 % besed. Pri naključni razdelitvi, v nadaljevanju označeni z oznako “RandomSplit”, smo razdelitev izvedli povsem naključno z uporabo sistemskega naključnega generatorja. Pri razdelitvi, ki je temeljila na razvrščanju besed v učno oziroma testno množico glede na njihove leme, pa smo poskrbeli, da se v testni množici ne pojavljajo besede, ki se od besed v učni množici razlikujejo le po končnicah. To namreč pogosto velja za besede z istimi lemami. Polega tega smo poskrbeli, da se leme besed v testni množici razlikujejo za vsaj tri črke glede na njim najbolj podobne leme v učni množici besed. Ta razdelitev je v nadaljevanju označena z oznako
“LemmaSplit”.
Pri izvajanju poskusov smo ugotovili, da je rezultat po pričakovanjih tudi precej odvisen od upoštevanega nabora fonemskih enot pri grafemsko-fonemskih pretvorbah. Pri gradnji samodejnih razpoznavalnikov govora se tako navadno ne ločuje med dolgimi in kratkimi samoglasniki oziroma med naglašenimi in nenaglašenimi samoglasniki. To ločevanje pri razpoznavalnikih govora namreč ni pomembno po pomensko razločevalnem kriteriju določanja fonemskih enot. To ločevanje pa je pomembno pri gradnji sistemov za tvorjenje umetnega govora, kjer so prozodične značilnosti umetnega govora odvisne od informacije o naglašenih in nenaglašenih samoglasnikih v besedah. V skladu s temi predpostavkami smo učno in testno množico dodatno razdelili na različna načina, glede na to, katere osnovne fonemske enote se je upoštevalo. V nadaljevanju tako oznaka ASR označuje razdelitev, ki je bila primerna za samodejne razpoznavalnike govora in temelji na upoštevanju samo 34 osnovnih fonemskih enot oziroma fonemskih različic. Oznaka TTS pa označuje razdelitev, ki je primerna za sisteme za samodejno tvorjenje umetnega govora in temelji na upoštevanju 39 osnovnih fonemskih enot. Povečanje števila fonemskih enot je posledica upoštevanja ločevanja med dolgimi in kratkimi oziroma naglašenimi in nenaglašenimi samoglasniki. V nadaljevanju predstavljeni rezultati so potrdili predvidevanja, da je pri slovenskem jeziku najteže samodejno napovedovati naglasno mesto v besedah oziroma naglašene samoglasnike. Pri naglaševanju slovenskih besed je namreč zelo veliko izjem, ki se ne podrejajo nekemu bolj splošnemu manjšemu naboru osnovnih pravil naglaševanja besed.
Rezultati uspešnosti samodejnih grafemsko-fonemskih pretvorb so v nadaljevanju podani v obliki odstot-nega deleža napačno pretvorjenih besed (angl. word error rate, WER) in deleža napačno pretvorjenih fonemskih enot (angl. phoneme error rate, PER). Kot je razvidno iz tabele so se glede na različne delitve množice besed in upoštevanja ločevanja med naglašenimi in nenaglašenimi samoglasniki pri rezultatih dejansko potrdi-la predvidevanja. Pri naključni razdelitvi so tako rezultati bistveno boljši kot pri razdelitvi po lemah, saj se pri naključni razdelitvi v testni množici lahko pojavljajo besede, ki se od najbolj podobnih besed v učni množici 1 https://github.com/sequitur-g2p/sequitur-g2p
2 https://github.com/AdolfVonKleist/Phonetisaurus
3 https://github.com/as-ideas/DeepPhonemizer
POVZETKI
249
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
razlikujejo samo po končnici ali predponi. Rezultati pri večjem naboru osnovnih fonemskih enot, ki vključuje ločevanje med dolgimi in kratkimi samoglasniki (oznaka TTS), pa so prav tako po pričakovanjih precej slabši, kot pri manjšem naboru, ki tega ločevanja ne upošteva (oznaka ASR). To potrjuje druge že obstoječe ugotovitve, da je pri slovenskem jeziku dejansko težko samodejno napovedovati naglasno mesto v besedah (Žganec Gros et al., 2016).
Orodje
Slovar
WER [%]
PER [%]
ASR_RandomSplit
16,5
1,9
Sequitur
ASR_LemmaSplit
25,4
2,9
(Bisani in Hermann, 2008)
TTS_RandomSplit
17,3
2,2
TTS_LemmaSplit
50,2
7,4
ASR_RandomSplit
1,0
0,1
Phonetisaurus
ASR_LemmaSplit
14,1
1,6
(Novak et al., 2015)
TTS_RandomSplit
2,0
0,3
TTS_LemmaSplit
29,1
4,1
ASR_RandomSplit
1,1
0,1
Deep Phonemizer
ASR_LemmaSplit
8,6
0,9
(Yolchuyeva et al., 2019)
TTS_RandomSplit
1,7
0,3
TTS_LemmaSplit
16,1
2,6
Tabela 1: Uspešnost grafemsko-fonemske pretvorbe obravnavanih postopkov.
4 Zaključek
V prispevku so predstavljeni rezultati izvedb in preizkusov različnih samodejnih grafemsko-fonemskih pretvornikov za slovenski jezik. Glede na ugotovitve lahko uporabniki tovrstnih pretvornikov za izgradnjo samodejnih razpoznavalnikov govora pričakujejo približno 91% pravilno pretvorbo besed, ki niso vključene v obstoječe slovenske leksikone. Pri izgradnji sistemov za tvorjenje umetnega govora, pri katerih je pomembno pravilno določanje naglasnega mesta, pa lahko pričakujejo samo približno 84% pravilno pretvorbo.
Zahvala
Predstavljeno delo je bilo delno financirano s strani Ministrstva za kulturo in Evropskega sklada za regionalni razvoj v okviru projekta RSDO (Razvoj slovenščine v digitalnem okolju), s strani Javne agencije za raziskovalno dejavnost Republike Slovenije v okviru aplikativnega raziskovalnega projekta L7-9406 OptiLEX in s strani ARRS v okviru raziskovalnega programa Metrologija in biometrični sistemi (P2-0250).
Literatura
Maximilian Bisani in Hermann Ney. 2008. Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5):434–451.
Alan W. Black, Kevin Lenzo in Vincent Pagel. 1998. Issues in Building General Letter to Sound Rules. V: Zbornik 3rd ESCA Workshop on Speech Synthesis, str. 77–80.
Kaja Dobrovoljc, Simon Krek, Peter Holozan, Tomaž Erjavec, Miro Romih, Špela Arhar Holdt, Jaka Čibej, Luka Krsnik in Marko Robnik-Šikonja. 2019. Morphological lexicon Sloleks 2.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1230.
Josef R. Novak, Nobuaki Minematsu in Keikichi Hirose. 2015. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Language Engineering, 22(6):907–938.
RSDO - Razvoj slovenščine v digitalnem okolju. 2020. https://www.slovenscina.eu/.
SAZU - Slovenska akademija znanosti in umetnosti. 1990. Slovenski pravopis 1: Pravila. Državna založba Slovenije, Ljubljana.
Jože Toporišič. 2000. Slovenska slovnica. Založba Obzorja, Maribor.
POVZETKI
250
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Sevinj Yolchuyeva, Géza Németh in Bálint Gyires-Tóth. 2019. Transformer Based Grapheme-to-Phoneme Conversion. V: Zbornik konf. Interspeech 2019, str. 2095–2099, Gradec, Avstrija.
Jerneja Žganec Gros, Boštjan Vesnicer, Simon Rozman, Peter Holozanin Tomaž Šef. 2016. Sintetizator govora za slovenščino eBralec. V: Zbornik konf. Jezikovne tehnologije in digitalna humanistika, str. 180–185, Ljubljana, Slovenija.
POVZETKI
251
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Poravnava zvočnih posnetkov s transkripcijami
narečnega govora in petja
Matija Marolt, Mark Žakelj, Alenka Kavčič, Matevž Pesek
Fakulteta za računalništvo in informatiko, Univerza v Ljubljani
Večna pot 113, 1000 Ljubljana
matija.marolt@fri.uni-lj.si
1 Uvod
V povzetku predstavljamo sistem za poravnavo zvočnih posnetkov slovenskega govora s pripadajočimi transkripcijami na nivoju besed. Pri razvoju sistema nas je še posebej zanimala njegova uporabnost pri poravnavi narečnega govora in petja, saj avtomatska razpoznava govora v tovrstnih posnetkih deluje nezanesljivo, z veliko napakami. Natančna avtomatska poravnava posnetkov in transkripcij nam tako lahko pomaga pri analizi narečnih korpusov in pripravi novih anotiranih podatkov za učenje razpoznavalnikov. V
povzetku predstavimo sistem za poravnavo in primerjamo kvaliteto poravnave nenarečnih in narečnih govorcev.
Analiziramo tudi kvaliteto poravnave narečnega petja z uporabo sistema, ki je učen zgolj na govoru. Ker se petje lahko zelo razlikuje od govora (dodatna spremljava, večglasno petje, dolgi toni, ...), se v nalogi osredotočimo zgolj na enoglasno petje brez spremljave, ki je še najbolj podobno govoru.
2 Sistem za poravnavo
Sistem za poravnavo posnetkov in transkripcij je sestavljen iz treh glavnih komponent:
•
segmentacija posnetka, s čimer razdelimo celoten posnetek na več krajših delov, hkrati pa odstranimo šum in tišino;
•
razpoznava govora, s čimer iz avdio signala pridobimo približno tekstovno transkripcijo;
•
poravnava, s čimer vsaki besedi v originalnem besedilu določimo mesto v pridobljeni transkripciji in s tem tudi čas pojavitve.
2.1 Segmentacija posnetka
Segmentacija je osnovana na Googlovem WebRTC–VAD algoritmu1, ki je hiter, robusten in v praksi pogosto uporabljen. S tem algoritmom lahko klasificiramo posamezen časovni okvir kot govor ali ozadje.
Algoritem robustne segmentacije je povzet po izvorni kodi, uporabljeni v sistemu DeepSpeech (Hilleman et al., 2018). WebRTC–vad ima nastavljiv parameter aggresiveness, ki lahko zasede vrednosti med 0 in 3. Parameter smo nastavili na vrednost 2, tako smo dobili dovolj kratke segmente, da proces dekodiranja pri razpoznavi govora ni trajal predolgo.
2.2 Razpoznava govora
Razpoznava govora je implementirana v dveh delih: 1) uporaba globokega akustičnega modela za pridobitev verjetnosti posameznih znakov za vsak časovni okvir in 2) dekodiranje izhoda modela za pridobitev končne transkripcije.
Podatki za učenje akustičnega modela so bili pridobljeni iz različnih virov: Gos (Zwitter et al., 2013), Gos VideoLectures (Videolectures, 2019), CommonVoice2 , SiTEDx (Žgank et al., 2016), Sofes (Dobrišek et al., 2017) in narečni govor s portala narecja.si3.
Akustični model je implementiran z uporabo ogrodja Nvidia NeMo, uporabili smo globoki model QuartzNet_15x5 (Kriman et al., 2019). Uporabili smo ga, ker lahko z njim kljub relativno majhnem številu parametrov (18,9 milijona) še vedno dobimo dokaj dobro natančnost razpoznave, primerljivo z večjimi modeli (več kot 100 milijonov parametrov). Primerjali smo dva modela: QuartzNet_15x5, učen zgolj na slovenskih podatkih, in QuartzNet_15x5, predučen na angleških podatkih, nato pa dodatno učen še na slovenskih podatkih.
S slednjim modelom smo preverili kvaliteto prenosa znanja iz tujega jezika v slovenščino.
Za pridobitev transkripcij smo primerjali tri različne metode dekodiranja CTC: 1) požrešna metoda največjih verjetnosti ( greedy), kjer za vsak časovni korak v CTC izberemo najbolj verjeten znak, nato združimo sosednje 1Webrtc google repository.
https://chromium.googlesource.com/external/webrtc/+/branch-heads/43/webrtc/common_audio/vad
2Mozzilla Common Voice website. https://commonvoice.mozilla.org/sl/datasets.
3 https://narecja.si/
POVZETKI
252
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
ponovitve; 2) iskanje v snopu z besednim jezikovnim modelom ( word) in iskanje v snopu z znakovnim jezikovnim modelom ( char).
Za jezikovni model smo uporabili N–gram jezikovni model KenLM (Heafield, 2011). Ker se model uporablja zgolj med dekodiranjem CTC za posamezen primer poravnave, smo za gradnjo modela uporabili kar originalno besedilo posameznega primera. Tako dobimo model, ki ni posplošen za slovenski jezik, temveč je prilagojen posamezni poravnavi. Testi so pokazali, da red jezikovnega modela ne vpliva bistveno na rezultat, na koncu smo uporabili model četrtega reda.
2.3 Poravnava in iterativno združevanje
S pomočjo razpoznavalnika govora iz posnetka pridobimo približno transkripcijo govora. Le-to moramo v zadnjem koraku poravnati z originalnim besedilom posnetka. Za osnovno poravnavo uporabimo algoritem povzet po orodju DeepSpeech. Izkaže se, da z uporabo tega algoritma ne zagotovimo poravnave vseh besed originalnega besedila. Krajše besede pogosto nimajo nujno dovolj konteksta ali pa so slabo transkribirane. Da zagotovimo poravnavo vseh besed, smo razvili algoritem iterativnega združevanja besed.
Glavna ideja algoritma je naslednja: besede, ki niso poravnane, združimo s sosednjo besedo v besedilu (odstranimo presledek in tvorimo enoten niz znakov). Osnovni algoritem poravnave ponovno poženemo, tokrat z modificiranim seznamom besed. Ta dva koraka ponavljamo, dokler niso vse besede (oziroma skupki besed) poravnani, nato lahko vsaki besedi originalnega besedila pripišemo začetni in končni čas glede na približno transkripcijo.
3 Evalvacija
Natančnost sistema smo ovrednotili na testni množici s primerjavo z ročno izdelanimi poravnavami. Za oceno kvalitete poravnave uporabljamo tri mere: povprečje (MAE) in standardni odklon (STD) absolutnih napak začetnih časov besed ter delež absolutnih napak, manjših od 0,5 sekunde (< 0,5s).
3.1 Testna množica
Testno množico sestavlja 26 primerov: 7 primerov nenarečnega govora, 13 primerov narečnega govora in 6
primerov narečnega enoglasnega petja brez spremljave. Najkrajši posnetek je dolg 21 sekund, najdaljši 219, povprečna dolžina posnetkov je 89 sekund. Primeri so pridobljeni iz naslednjih virov: Slovenske ljudske pesmi V (Kaučič et al., 2007), portal narecja.si, terenski posnetki GNI ZRC SAZU. Pravilne poravnave so bile narejene ročno z orodjem Praat.
Tip posnetka
Število besed
Dolžina (min)
narečni govor
2428
18,7
nenarečni govor 1394
11,0
narečno petje
508
8,7
skupaj
4330
38,4
Tabela 1: Testna množica.
3.2 Primerjava modelov in metod dekodiranja
Primerjali smo osnovni akustični model ( base), ki je grajen zgolj na slovenskih podatkih, ter model, ki je učen na angleških podatkih, nato pa doučen na slovenskih ( transfer). Ob tem smo primerjali tri metode dekodiranja: požrešna metoda ( greedy), iskanje v snopu z jezikovnim modelom na nivoju znakov ( char), iskanje v snopu z jezikovnim modelom na nivoju besed ( word). Primerjavo smo opravili za vsak tip testnih podatkov posebej. Rezultati so podani v Tabeli 2.
Iz tabele je razvidno, da pri nenarečnem govoru ne glede na metodo uporaba modela transfer prinese manjšo povprečno napako. Razlika je sicer majhna (0,06 do 0,07 sekunde), vendar je približno enaka za različne metode.
Pri uporabi požrešne metode ima transfer sicer večji standardni odklon in manjši delež napak pod 0,5s, vendar je razlika minimalna. Različne metode dajejo zelo podobne rezultate. Kombinacija modela transfer in metode word da najboljši rezultat s povprečno napako 0,12s, standardnim odkonom 0,10s in 99,4% deležem napak pod 0,5s.
Tudi v primeru narečnega govora uporaba modela transfer izboljša rezultate. Razlika v povprečnih napakah je majhna (0,04 do 0,09 sekunde), vendar je med akustičnima modeloma opazna razlika tudi v standardnem odklonu in deležu napak manjših od 0,5s. Z uporabo modela transfer so rezultati za različne metode poravnave zelo podobni, pri čemer se metoda word izkaže za najbolj robustno, saj ima najmanšo napako in standardni odklon pri obeh modelih. Pri modelu transfer ima metoda greedy sicer nekoliko večji delež napak pod 0,5s, POVZETKI
253
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
vendar je razlika majhna (0,4%). Kombinacija modela transfer in metode word da najboljši rezultat s povprečno napako 0,14s, standardnim odkonom 0,24s in 97,3% deležem napak pod 0,5s. V primerjavi z najboljšim rezultatom nenarečnega govora se povprečna napaka poveča za 0,02s, standardna deviacija za 0,13s, delež napak pod 0,5s se zmanjša za 2,1%. Razlika ni velika in je približno podobna za ostale kombinacije metod in modelov.
tip testnih podatkov
metoda
model
MAE STD < 0,5s
Nenarečni govor
greedy
base
0,20
0,13
99,1%
transfer
0,14
0,15
98,5%
char
base
0,21
0,09
99,0%
transfer
0,14
0,10
98,9%
word
base
0,19
0,10
98,6%
transfer
0,12
0,11
99,4%
Narečni govor
greedy
base
0,22
0,39
94,9%
transfer
0,15
0,27
97,7%
char
base
0,21
0,32
95,7%
transfer
0,15
0,28
97,1%
word
base
0,18
0,28
97,2%
transfer
0,14
0,24
97,3%
Narečno petje
greedy
base
0,59
0,82
70,2%
transfer
1,28
2,49
63,9%
char
base
0,82
1,66
66,7%
transfer
0,44
0,41
73,4%
word
base
0,48
0,58
73,4%
transfer
0,37
0,30
79,9%
Tabela 2: Rezultati
Pri narečnem petju je napaka poravnave opazno večja. Pri metodah word in char akustični model transfer deluje bolje. Z metodo char je povprečna napaka prepolovljena, standardni odklon je štirikrat manjši, delež napak pod 0,5s se izboljša za 6,7%. Z metodo transfer je povprečna napaka za 0,11s manjša, standardni odklon za 0,28s, delež napak pod 0,5s se izboljša za 6,5%. Pri metodi greedy je boljši model base, kar je edini tak primer v rezultatih. Rezultati različnih metod dekodiranja med seboj niso podobni. Pri obeh modelih metoda word bistveno izboljša rezultat. Kombinacija modela transfer in metode word da najboljši rezultat s povprečno napako 0,37s, standardnim odkonom 0,30s in 79,9% deležem napak pod 0,5s. V primerjavi z najboljšim rezultatom nenarečnega govora se povprečna absolutna napaka poveča za 0,25s, standardna deviacija za 0,19s in delež napak pod 0,5s se zmanjša za 19,5%. Razlika je velika in je vidna tudi pri ostalih kombinacijah metod in modelov. Povprečna absolutna napaka se poveča za faktor vsaj 2,5, standardni odklon za faktor vsaj 2,7 in delež napak pod 0,5s se zmanjša za vsaj 19,5%.
3.3 Ugotovitve
Kvaliteta poravnav na nenarečnem govoru se izkaže za dobro in je primerljiva s podobno delujočimi sistemi, npr. (Malfrère et al, 2003). Tudi pri narečnem govoru je kvaliteta poravnav dobra. Napaka je nekoliko večja kot pri nenarečnem govoru, kar je pričakovano, saj je večina učnih podatkov za akustični model nenarečnih. V
splošnem ocenjujemo, da sistem dobro deluje na slovenskem govoru in je zato uporaben za večino aplikacij.
Vredno je omeniti, da v primeru kratkih posnetkov in popolnih transkripcij za učenje akustičnih modelov obstajajo potencialno boljše tehnike poravnave (Brognaux in Drugman, 2015).
POVZETKI
254
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Kvaliteta poravnav enoglasnega petja brez spremljave je v primerjavi z govorom opazno slabša, kar smo tudi pričakovali, saj je v splošnem poravnava petja in besedila težji problem. V primerjavi z nenarečnim govorom je povprečna napaka približno trikrat večja in veliko več je napak večjih od pol sekunde. Povprečna napaka je sicer primerljiva s podobno delujočim sistemom za poravnavo petja (Stoller et al., 2019), vendar naši testni podatki ne vključujejo večglasnega petja ali petja s spremljavo, zato ta primerjava ne pove veliko. Domnevamo, da bi se kvaliteta poravnav bistveno izboljšala, če bi učna množica akustičnega modela vsebovala petje.
V veliki večini primerov se akustični model transfer izkaže bolje od modela base. Edini obraten primer je v primeru petja in metode greedy, kjer model base doseže boljši rezultat, vendar ker ta kombinacija metode in modela ne da najboljšega rezultata pri petju, ni bistvena za oceno kakovosti. Na podlagi rezultatov potrjujemo domnevo, da prenos znanja z modelom transfer pozitivno vpliva na kvaliteto poravnave tako pri govoru kot pri petju.
Čeprav je v primeru govora najboljša metoda za dekodiranje word, ostali dve metodi nimata bistveno večjih napak. V primeru nenarečnega govora z modelom transfer je povprečna napaka z metodo word manjša za 0,02s, v primeru narečnega govora pa za 0,01s. V aplikacijah, ko zelo natančna poravnava govora ni ključna, je pa pomemben čas računanja, je bolj smiselno uporabiti metodo greedy, saj le-ta ne zahteva iskanja v snopu ter uporabe jezikovnega modela in je zato bistveno hitrejša. Pri petju metoda greedy da bistveno slabše rezultate od metode word, zato je smiselno uporabiti slednjo.
Zahvala
Raziskave, opisane v prispevku, so bile opravljene v okviru temeljnega raziskovalnega projekta »Misliti folkloro: folkloristične, etnološke in računske perspektive in pristopi k narečju« (J7-9426, 2018-2022), programske skupine »Digitalna humanistika: viri, orodja in metode« (P6-0436, 2022-2027), oba financira ARRS, in raziskovalne infrastrukture DARIAH-SI.
Literatura
Sandrine Brognaux in Thomas Drugman. Hmm-based speech segmentation: Improvements of fully automatic approaches. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24:1–1, 01 2015.
Simon Dobrišek, Jerneja Žganec Gros, Janez Žibert, France Mihelič, in Nikola Pavešić. Speech database of spoken flight information enquiries SOFES 1.0, 2017. Slovenian language resource repository CLARIN.SI.
Kenneth Heafield. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland, July 2011.
Association for Computational Linguistics.
Ryan Hilleman, Tilman Kamp in Tobisas Bjornsson. Dsalign. https://github.com/mozilla/DSAlign, 2018.
Marjetka Golež Kaučič, Marija Klobčar, Zmaga Kumer, Urša Šivic, and Marko Terseglav. Slovenske ljudske pesmi V. 2007.
Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li in Yang Zhang. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions, 2019.
F. Malfrère, O. Deroo, T. Dutoit, in C. Ris. Phonetic alignment: speech synthesis-based vs. viterbi-based.
Speech Communication, 40(4):503–515, 2003.
Daniel Stoller, Simon Durand, in Sebastian Ewert. End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model, 2019.
VideoLectures.NET. Spoken corpus gos VideoLectures 4.0 (audio), 2019. Slovenian language resource repository CLARIN.SI.
Ana Zwitter Vitez, Jana Zemljarič Miklavčič, Simon Krek, Marko Stabej in Tomaž Erjavec. Spoken corpus gos 1.0, 2013. Slovenian language resource repository CLARIN.SI.
Andrej Žgank, Mirjam Sepesy Maučec in Darinka Verdonik. The SI TEDx-UM speech database: a new Slovenian spoken language resource. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4670–4673, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA).
POVZETKI
255
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
A Parallel Corpus of the New Testament:
Digital Philology and Teaching the Classical Languages in Croatia
Petra Matović,* Katarina Radić†
* Faculty of Humanities and Social Sciences, University of Zagreb
Ivana Lučića 3, 10000 Zagreb, Croatia
pmatovic@ffzg.hr
† Faculty of Humanities and Social Sciences, University of Zagreb
Ivana Lučića 3, 10000 Zagreb, Croatia
katarina.radic1@gmail.com
1. Introduction
Corpus linguistics has been one of the liveliest disciplines in Croatian linguistics, and parallel corpora have been established by Croatian scholars since the 1960s (Tadić 1997, 2001; Simeon 2002). These corpora normally include Croatian and another living language, while corpora consisting of texts in Croatian and at least one of the so-called “dead” languages are still underrepresented, although there are corpora including languages like Ancient Greek, Latin, Sanskrit, Arabic, Persian and Akkadian, to be found on the World Wide Web (The Alpheios Project 2019; Palladino et al., 2021). The Department of Classical Philology at the University of Zagreb can already boast one of the earliest online (monolingual) corpora of Latin texts, the CroALa database, built and curated by Neven Jovanović (CroALa, 2014). In the last few years the said department has been steadily building small parallel corpora, and this paper aims to describe one of them, the Greek-Croatian parallel corpus of the New Testament, currently in the making, and furthermore discuss its educational uses in teaching Ancient Greek.
2. Goal of the paper
Building parallel copora has been garnering more and more attention in the field of classical philology. The Department of Classical Philology at the University of Zagreb has been building smaller corpora, both as a part of several small-scale projects lead by Neven Jovanović and courses on Greek and Latin language (e.g. Soldo and Šoštarić 2019). Since 2021, several professors and students at the department have been working on project titled “A Linguistic Analysis of Selected Early Christian Writings”, lead by Petra Matović. Within the scope of the project we have started building a parallel corpus of the New Testament, so far comprising the Gospel of Mark and a part of the Apocalypse. The texts are aligned using the Alpheios tool for text alignment at the Perseids environment (The Alpheios Project, 2019; The Perseids Project, 2017). Alpheios enables the user to align words or word combinations in the source text with corresponding parts of its translation (The Alpheios Project, 2019). In this poster we firstly aim to explain the principles of alignment we followed while building the corpus, and, secondly, discuss some peculiarities in aligning Ancient Greek with Croatian. Finally we aim to look at the corpus from an educational point of view and discuss its possible uses in teaching Ancient Greek today.
Text aligment was done by 4 students (Mateo Cader, Ružarijo Lukas, Katarina Radić, Luka Šop) and supervised by Petra Matović. The editions of the texts were Nestle-Aland 28 (Greek New Testament) and the so-called Zagreb Bible (https://biblija.ks.hr/). Initially, the main principle of alignment was to align units (words or word combinations) in the Greek text with their Croatian counterparts; these units had to be as small as possible. Full stops and commas were aligned, too. After the initial period it became clear that additional rules were necessary. While the students did not struggle with the meaning of the Greek text, they were sometimes unsure how to align the Greek with the Croatian. These uncertainties typically arose in the following situations due to specific linguistic features of the two languages:
-
the use of the article (exist in Greek, but not in Croatian: ὁ Ναζαρηνός = Nazarećanin, Mark 10,47)
-
commas can be aligned with conjunctions
-
participles (extensively used in Greek, not common in Croatian: ἀκούσας = kad je čuo, Mark 10,47)
-
particles (Greek is rich in particles, Croatian often lacks equivalents: the particle δέ is translated as “ali”
in Mark 13,5, but left untranslated in Mark 13,13)
-
features of Hellenistic Greek (The New Testament was written in this later variety of the Greek language, which is often different from the Classical, 5th century BC Attic dialect of Greek which is mainly taught in schools and universities; one of theseis the preterite form ἤμην διδάσκων = „naučavah“ Mark 14,49).
POVZETKI
256
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
There was also one unexpected problem: students often struggled with aligning prepositions, for example in Mark 1,6: ἐνδεδυμένος τρίχας καμήλου καὶ ζώνην δερματίνην περὶ τὴν ὀσφὺν, the preposition “s” was left unaligned in the Croatian translation (“odjeven u devinu dlaku, s kožnatim pojasom oko bokova”).
Consequently, the following set of rules was formed:
1. The article is aligned together with the corresponding nouns, unless translated separately.
2. Conjunctions should be aligned either with conjunctions, particles or punctuation.
3. Punctuation should be aligned whenever possible.
4. Participles should be aligned with the corresponding word combination, even if it is an entire sentence.
5. If something is left out in the translation, the Greek original is left unaligned and vice versa, for example the verb "to be".
6. Prepositions should never be left unaligned. Whenever possible, they should be aligned with a corresponding Greek preposition. In the case where a preposition is added in Croatian, together with its noun it should be aligned with the corresponding noun in Greek.
The work done on this corpus highlights several problems in teaching not only Ancient Greek, but also Croatian. Students are unsure of the uses of certain parts of speech, usually those parts of speech that do not have an equivalent in their mother tongue. They are also unaware of the nature of the comma, which can connect (or divide) two words just like a conjunction. Prepositions are often an obstacle because their meaning can be incorporated into a nominal form in Greek and does not have to be expressed separately. These problems probably arise because the school curriculum for Croatian is different from the curricula for Greek and Latin: the curricula for the classical languages pay more attention to grammar, while Croatian has to include both language and literature. Hopefully, projects like this one can highlight specific problems that can then be resolved either by adapting the school curricula or the teaching of classical languages on university level.
3. References
The Alpheios Project. 2019. https://alpheios.net/.
CroALa (Croatiae Auctores Latini). 2014. http://croala.ffzg.unizg.hr.
Chiara Palladino, Maryam Foradi, and Tariq Yousef. 2021. Translation Alignment for Historical Language Learning:
a
Case
Study.
Digital
Humanities
Quarterly,
15(3). https://www.proquest.com/openview/e048d32e8e991c67282c3fbda5c1f0d4/1?pq-
origsite=gscholar&cbl=5124193.
The Perseids Project. 2017. https://www.perseids.org/.
Ivana Simeon. 2002. Paralelni korpusi i višejezični rječnici. Filologija, 38-39: 209–15.
Petar Soldo and Petra Šoštarić. 2018. Treebanking Lucian in Arethusa: Experiences, Problems and Considerations. Studia UBB Digitalia 63(2):7–18.
Marko Tadić. 1998. Raspon, opseg i sastav korpusa suvremenoga hrvatskoga jezika. Filologija (30-31):337–47.
Marko Tadić. 2001. Procedures in Building the Croatian-English Parallel Corpus. International Journal of Corpus Linguistics, Special issue, pages 1–17.
Novum Testamentum Graece, Neste-Aland 28.
https://www.academic-bible.com/en/online-bibles/novum-testamentum-graece-na-28/read-the-bible-text/bib
el/text/lesen/stelle/51/10001/19999/ch/418f354347a79b322324823db62504dc/.
The Zagreb Bible https://biblija.ks.hr/.
POVZETKI
257
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Pre-Processing Terms in Bulgarian from Various Social Sciences and Humanities (SSH) Domains: Status and Challenges
Petya Osenova*, Kiril Simov*, Yura Konstantinova†
*Institute of Information and Communication Technologies, Bulgarian Academy of Sciences Acad. G. Bonchev bl. 2, 1113 Sofia
{petya, kivs}@bultreebank.org
†Institute of Balkan Studies and Centre of Tracology, Bulgarian Academy of Sciences Moskovska St 45, 1000 Sofia
yura.konstantinova@balkanstudies.bg
Abstract
1. Introduction
There exists a great number of focused initiatives, projects and conferences that tackle deeply various topics related to terminology construction, understanding, processing and usage. We will mention only a small part of them here rather as initiatives than as distinct publications. These are, among others, ENeL COST
Action on e-Lexicography,1 related activities in the NexusLinguarum COST Action,2 related activities in the ELEXIS project,3 globaLEX organization. There is also ongoing work on providing language technology help to colleagues in SSH within CLARIN-ERIC and DARIAH.4,5
Within the CLaDA-BG infrastructure,6 which combines the goals of CLARIN and DARIAH in Bulgaria, there are two types of partners – technological ones and colleagues also from SSH. The latter are historians, ethnographers, specialists in the deeds and lives of Cyril and Methodius, museum and library workers. This combination of complementary partners allows us to construct the necessary resources and immediately to verify their utility for SSH partners.
In the task of creating the Bulgarian-centric Knowledge Graph (BGKG) within CLaDA-BG (Simov and Osenova, 2020) we requested data from our SSH partners in order to perform linguistic pre-processing and to enhance the creation of terminological dictionaries that cover the SSH subdomains based on these data.
The size of the corpus is nearly half a million – 484,815 tokens. The selected words and phrases for pre-processing and creation of entries towards terminological dictionaries were about 5,000 within nearly 26,000 usages annotated within the corpus. From them the rejected candidates, or the false positives, were 542
candidate phrases. Out of them 328 candidates were completely rejected either because they were named entities or free compositional phrases.
Thus, our colleagues from SSH would facilitate their own work with only checking and validating the previously pre-processed data. The data consists of selected texts from various sources such as: scientific texts authored by our SSH colleagues and related to Bulgarian history and society; Linked Open Data like Wikipedia; available textbooks, specialised dictionaries etc.
Here we give a brief outline of our pre-processing strategy towards handling the data-driven terminology in these domains.
2. The Task Overview
The work flow that is discussed here is related to the SSH data (publications, autobiographies, archive documents, newspaper articles from past periods, descriptions of artefacts, etc.) that were collected from partners, and annotated within the INCEpTION platform7 with named entities, events and roles. Thus, while annotating linguistically the texts, the annotators were additionally asked to mark candidate terms with the label term. This task was set in the view of the subsequent creation of specialised terminological dictionaries 1 https://www.cost.eu/actions/IS1305/
2 https://nexuslinguarum.eu/
3 https://elex.is/
4 https://www.dariah.eu/
5 https://www.clarin.eu/
6 https://clada-bg.eu/en/
7 https://inception-project.github.io/
POVZETKI
258
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
in each participating SSH domain – history, ethnography, biographical studies, etc. The annotators were instructed to view as candidate terms the keywords that are specific for the domain.
Later on, these candidate terms were extracted and transferred to a huge excel table in Google Drive. The table consists of three main areas: a) the candidate term, b) the term in its context of occurrence, and c) the source that delimits the domain of usage. In Figure 1 an excerpt from the excel view is presented: Figure 1: An example from the excel table.
In the first row the following information is given: the term as it occurred in the text (riding-the horse, ‘the riding horse’), the text excerpt with the term placed among the symbols @@@, and the name of the source text. In the second row the following information is given: the normalised term ( riding horse), the definition (a horse that is used for riding) and the domain – ethnography.
All the one-word terms got initial definitions from the digitised version of the Explanatory dictionary of Bulgarian (Popov et al., 1994). This step was performed automatically through a rule-base matching method.
First, the word forms in the texts were lemmatized with our in-house Inflectional Bulgarian dictionary. Then the coinciding lemmas in the dictionary and in the texts were matched. The terms with more than one meaning also received all the possible definitions automatically. Afterwards, these candidate terms were processed manually by the team that previously worked on the event and roles annotations. The core team engaged with the terminology pre-processing consisted of 4 members as a subpart of the whole annotating team that consisted of 8 people.
The tasks related to the terminology processing were organized as follows: one person (outside the 4
working colleagues) performed the automatic construction of the table and the assignment of the existing definitions and sources. Initially the candidate terms were assigned in an alphabetical order to workers, i.e.
each colleague was responsible for the candidate terms that began with certain letters. However, after having completed some letters, a decision was taken to go by domain source instead. This approach allowed us to observe the terms in their domain contexts and interrelations. Then, once more the terms were checked in their alphabetical appearance.
The workflow was generally divided into two phases that respects the competences of the experts. In Phase 1 the corpus linguists (who were also annotators) pre-processed the candidate terms while in Phase 2 the specialists in SSH areas are supposed to check and validate these terms against their own area.
3. The Workflow
The respective annotated data was uploaded in advance including the annotated candidate term. The workflow consisted of the following steps:
3.1 Deciding which candidate terms are true terms
Here the main task of the corpus linguists was to try to reduce the list of the obvious non-terms or the common words and expressions from the specialised terms. Sometimes the boundaries were not very clear, especially with respect to the multiword expressions (MWE) and the nested terms. See more about this issue in point 3 below. The annotators had at their disposal three options to select from: a sure term, a maybe term and a non-term.
3.2 Checking the availability of the definition and its relevance
In case there was a definition, the annotator had to: accept it as it is, reject it or modify it. If there was no definition, the annotator had to create one. When the term was one-word, the task was to check the definition that came from the Explanatory dictionary of Bulgarian. In case of lexical ambiguity the annotator had to select the correct definition among the available ones, or again - to provide their own, if no appropriate is present. Then the selected definition was marked as the right one. Please note that the other definitions were not deleted for the sake of completeness and future addition into BTB-Wordnet.
3.3 Handling multiword expressions
POVZETKI
259
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Here the prevailing part of terms consisted of a head noun and a pre-positioned modifier. For example, демокрация (democracy) and пряка демокрация (direct democracy). The problems might go into two directions: to accept a MWE as a domain term or not, and to provide a definition of it since it is usually not available in the consulted sources. We decided to be inclusive in accepting what had to be a term. This means that all the expressions that were considered specific for the domain, were approved. The annotator could also add the definitions about the parts of compositional MWEs. For example, невалиден глас (invalid vote) can have a definition as a phrase, while its two elements невалиден (invalid) and глас (vote) might also be added below with their own definitions.
3.4 Re-checking the domain/genre.
This step relies on the domain/genre classification that has already been used. Thus, an initial pre-defined schema was explored that in the process of work was further expanded and hierarchized.
At the moment the list of the applied domains amounts to 76 categories (for example, architecture with a subdomain of construction; geography with a subdomain of geology; philosophy with subdomains of ethics, rhetorics and logic) and the registers to 15 (for example, dialectal, metaphorical, colloquial, etc.). The initial schema came from the classifications used in the Explanatory dictionary and had 36 domains and 4 registers. At the beginning, we tried to keep the terms in separate groups that do not overlap: history, ethnography, etc. These areas however are highly interdisciplinary and they inter-cross with each other. For that reason this approach was abandoned at a very early stage in our work. In this way one and the same term could be put in more than one domain with the same or different meaning.
Other tasks that were part of the workflow – although with a lower priority were: 3.5 Adding other senses of the lemma of the term, and
3.6 Adding examples to these additional senses.
The idea behind tasks 5 and 6 was that we aim at reaching better coverage also in other language resources like BTB-Wordnet (Osenova and Simov, 2018) and at compiling a sense corpus per lemma and usage.
The results of this preparatory work was a classification of the initially selected about 5000 candidate terms and keywords with respect to the hierarchy of domains. This allows the further processing to be done by different experts in the corresponding domains. Their tasks are the following:
Final Sorting of lexical items within true terms and keywords.
As it was mentioned above, the examples annotated within the domain documents were classified within two main categories: general lexica and compositional phrases, on the one hand, and terms, on the other. The second group sometimes contains keywords that happen to be true terms in the respective domain. Thus, the first task for the experts was to sort out the true terms.
Addition of missing terms.
Despite the wide range of documents selected for annotation, they do not contain all the relevant terms in the domain. For example, from the set of all genres of Old Bulgarian literature, only three were identified within the annotated documents. The rest of terms for these identified genres were added by the experts of Old Bulgarian literature. Thus, by completing the missing slots, we expect that each domain will have a relatively complete list of terms.
Extension of the definitions.
In the original list of candidate terms we had to also add definitions from online sources or to construct our own definitions. Since the pre-processing group included linguists, but not experts in the domains, very often the definitions were not complete and/or precise enough. Thus, the domain experts extended the definitions with encyclopaedic information. In some cases also appropriate images were added. The resulting encyclopaedic entries were cross linked on the basis of the included terms. Here is an example of such an entry from the area of architecture:
АЖУР
техника при резбарското, златарското,
плетаческото и други изкуства, при
която между декоративните елементи
има отвори
POVZETKI
260
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Figure 2: One example from the terminology lexicon in the area of architecture. On the left, the term openwork and the definition (ornamental work such as embroidery or latticework having a pattern of openings) are given, and on the right there is an image illustrating it. The links to other terms are represented via italicising the corresponding words/phrases in the definition. This example is only illustrative. The actual entries might contain longer texts, references to relevant literature, more images and links to external resources.
The resulting terminological lexicons are further processed by the team working on the Bulgarian Bultreebank Wordnet (BTB-WN). This work has been done in cooperation with the domain experts. Such alignment of the terminological lexicons and wordnet allows for a joint usage of both lexical resources for the main use cases – explanation of the specific knowledge in the domains and indexing of various types of domain documents. Figure 3 depicts a part of the hierarchy of Bulgarian folk units of measurement. They are linked with a hyponymy relation to the concept for Bulgarian folk units and the concept for linear units.
Figure 3: In this figure a graphical view on
Bulgarian folk units of measurement is
presented. Each term is classified into two
ways - as a unit of measurement for distance
( linear units) and that its domain is Bulgarian
folk units. The hierarchy of terms could
interleave with synsets that are not terms in
the domain.
The mapping to synsets in the English
WordNet are given with identification (IDs)
at the lower part of the graphical
representation of each Bulgarian synset. Here
measures are given such as педя (span),
пръст (finger), лакът (elbow), etc.
Our idea is BTB-WN to be the main resource within CLaDA-BG for representation of lexical data related to general language, terminology and to be aligned to the ontologies on which BGKG is constructed.8 In this way we hope to be able to provide access to these data by different types of users with different knowledge about the domains, with different goals in mind, etc.
In addition to the standard wordnet relations (hypernymy, meronymy, etc.), we envisage other semantic relations that represent various aspects of knowledge within the corresponding domains. In this way, we will ensure the representation of encyclopaedic information and will facilitate the representation of Named Entities (NEs) classified with respect to the corresponding concepts. This approach relies on specially created templates based on the domain relations as well as their domain and range restrictions. We already defined about 20 such templates for main classes of NEs like geopolitical entities, historical events (wars, uprisings, etc.), artefacts (icons, stamps, ect), political parties and regimes, and so on.
4. Conclusions
In this extended abstract we described the main steps that were followed in the creation of terminological lexicons in a bottom-up approach starting from real texts within SSH domains. After the domain texts were annotated with named entities, events, roles and candidate terms, a concordance of the latter from different documents was performed where they were grouped together and linguistically processed. As a result, they had the representation of the term in its basic form, listings of related words for MWEs, the existing potential 8 This approach is similar to the lexeme assignment in Wikidata.
https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation
POVZETKI
261
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
senses from different sources (available locally to the annotators and on the web). The appropriate senses for the given context were selected or created. Then the result was further processed by the domain experts in order to make the definitions more precise and complete. Also, an addition of missing terms was performed.
Then the terminological lexicons were aligned to the BTB-WN in order to be used for navigation, annotation of more documents (manually or automatically) and to establish links to the necessary ontologies.
The main challenges can be divided as either technical or theoretical. In the first group we can mention the insufficient context, lack of enough sources for terms related to previous historical times; approaching the task in the best way - alphabetically or by source, etc. In the second group we can outline: the difficulty to differentiate between a term and a non-term; aiming at the most informative definition when there are too many found in the sources; finding and/or constructing a definition when it lacks or is wrong with the help of other available resources; handling close definitions for some lemma; construction of a definition for multiword terms; handling multi-domain inclusion of terms.
5. References
Petya Osenova and Kiril Simov. 2018. The data-driven Bulgarian WordNet: BTBWN. In: Cognitive Studies | Études cognitives, vol. 18, https://doi.org/10.11649/ cs.1713.
(freely available at: https://ispan.waw.pl/journals/index.php/cs-ec/article/view/cs.1713/4458) Kiril Simov and Petya Osenova. 2020. Integrated Language and Knowledge Resources for CLaDA-BG. In: Selected Papers from the CLARIN Annual Conference 2019, 172 (2020), LiU Electronic Press: Linköping Electronic Conference Proceedings 172 (2020), 2020.
Dimitar Popov et al. 1994. D. Popov, L. Andreychin, L. Georgiev, St. Ilchev, N. Kostov, Iv. Lekov, St. Stoykov and Tsv.
Todorov 1994 . Bulgarian Explanatory Dictionary. Nauka i izkustvo. Sofia. (in Bulgarian) POVZETKI
262
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
An Approach to Computational Crisis Narrative Analysis: A Case-study of Social Media Narratives Around the COVID-19 Crisis in India
Henna Paakki*, Faeze Ghorbanpour*, Nitin Sawhney*
* Department of Computer Science, Aalto University
P.O.Box 15400, FI-00076 AALTO, Espoo, Finland
henna.paakki@aalto.fi
1. Introduction
Societal crises create an empty narrative space and a need for explanation about the crisis, related risks and required mitigation actions (Sellnow et al., 2019). Crises are socially constructed through discourses and have the potential to change social structures and perceptions (Walby, 2015). Crisis narratives also have an important role in attributing blame and structuring crisis responses and recovery plans (Walby, 2015, p. 14). The role of social media has increased significantly as a forum for seeking information about crises, as well as for discursive sense-making. People use discourses and narratives related to a crisis to construct the world socially and epistemologically and to explain the impending crisis (Joffe, 2003; Bednarek et al., 2022), which makes it important for authorities, experts and crisis regulators to understand various discourses around the crisis. This paper examines the possibilities for analyzing social media discourses using a novel computational approach, using a discourse act classifier based on zero-shot learning (Yin et al., 2019) to categorize discourse types into narrative function groups (Labov, 1972). Such tools can help support other means of inquiry and crisis preparedness. Our empirical case study examines discourses around the COVID-19 pandemic in the context of English-language social media in India. This abstract describes an ongoing research project.
2. Goal of the paper
As crisis discourses on social media encompass a large set of data, there is a need for computational methods that can support close readings. Although some methods have been developed for computational discourse and narrative analysis (Piper et al., 2021), this line of research needs more tools. Lakoff and Narayanan have proposed that computational narrative analysis could be approached by focusing on the structural building blocks of narratives (Lakoff and Narayanan, 2010), which have been outlined in linguistics and social sciences (Labov 1972; Labov and Waletzky, 1967; van Dijk, 1976). Such rules can aid computational models.
Narratives encase human motivations, goals, emotions, actions, events, and outcomes, elements that have been considered essential for computational models to understand (Lakoff and Narayanan, 2010). We posit that sense-making in crisis is action (Joffe, 2003), at the surface-level formulated as discursive actions (Edwards and Potter, 1993; Schegloff, 2007). Thus, for capturing social media narratives, we explore the validity of using a widely used and well-established narrative functions theory from linguistics (Labov, 1972; Labov and Waletzky, 1967) to categorize social media comments based on their functions. These functions have already been used to computationally analyze more traditional narratives like personal histories or short stories (see e.g., Li et al., 2017). We explore the possibilities for further extending their use to analyzing changes in social media discourses around crises.
Many narrative theories agree that a sequence of events that forms a narrative whole includes first 1.) an orientation to the story or situation (identifying the time, place, persons, and situation of the narrative), some type of 2.) complication or disruption (the core event that creates tension in the narrative), 3.) an evaluation (clarification of why or how the events are important), and finally 4.) a resolution (how the story ends or how the core problematic event is resolved) (Lakoff and Narayanan 2010; Labov, 1972; Labov and Waletzky, 1967; Todorov 1971; Van Dijk, 1976). Conflict in communication is central in the narrative space surrounding a crisis and needs to be managed for successful crisis mitigation (Sellnow et al., 2019). Central to crisis discourses are critical events that have transformative power: they mobilize discourses and transform perspectives on the crisis through conflict (Jørgensen & Phillips, 2002). Thus, we might expect crisis narratives to involve a significant complication phase that needs to be followed by a resolution phase.
We maintain that by analyzing the functional categories of orientation, complication, evaluation, and resolution, it is possible to understand shifts in perspectives to the ongoing crisis, ones that contribute to the narrativization of the crisis. Furthermore, we expect that it is possible to identify points of discursive struggle within crisis discourses, points where critical understandings of the crisis are negotiated to achieve a consensus or to legitimize a selected narrative (Jørgensen & Phillips, 2002; Sellnow et al., 2019). This is central in understanding how a consensus on crisis resolution is achieved. We seek to investigate the validity and utility of computationally categorizing social media crisis discourses based on their functions. We ask: POVZETKI
263
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
1. Can narrative functions be applied to analyzing online crisis discourses using a computational model? Are these functions operationalizable through discursive actions?
2. Do social media comments grouped by their actions correspond well enough to the functions of orientation, complication, evaluation, and resolution?
3. By using these function-based groupings, is it possible to find patterns of narrativization in online crisis discourses? Do comments have different functions at different points in time during the crisis?
3. Data sources and sampling
Crisis news reporting has a significant impact on citizen perspectives on the crisis (Kasperson et al., 1988).
We are thus interested in this relationship between the evolution of crisis news discourses and how citizen discourses develop during a long-lasting crisis. YouTube news channels’ crisis news videos and their comments offer an opportunity for investigating this interaction over time. We examine viewer comments to crisis news videos on English-language NDTV news’ YouTube channel during the Covid-19 crisis in India, in conjunction with news reports and contextual insights on the pandemic. The data were collected using a scraper and the YouTube API. They involve the beginning of the crisis (1/2020˗8/2020), acute vaccination phase of the crisis (02/2021˗08/2021), and a later prolonged phase of crisis (11/2021˗02/2022). Channel selection criteria included that the channel should be among the most followed English news providers in the country, one of the most trusted (Newman et al., 2021), that the channel allows viewer comments, has a wide viewership and is politically as close to the centre as possible. The Indian context is of interest as trust in news has been reported to be low (Newman et al., 2021), and as the Global South perspectives have not been sufficiently represented in research.
4. Methods
Our approach is mixed, utilizing computational modeling to analyze a large set of data to achieve reproducibility and quantifiability, but also employing qualitative close reading.
To operationalize the narrative functions theory, we posit that this can be approached through the pragmatic items of discursive actions, as these are often used to analyze accountability, agency, position, and intention in conversation (Edwards and Potter, 1993; Schegloff, 2007). We expect that in our social media data, the function of informing statements is to mostly orient to the crisis and to express beliefs; questions, accusations and challenges most often express a complication or problematize some aspect of the crisis; evaluations and appreciations mostly attempt to elaborate and evaluate the situation; and requests and proposals aim at a resolution of some aspect of the crisis (Couper-Kuhlen and Selting, 2017; Turowetz and Maynard, 2010). Thus, we argue that what a comment does can be used to conclude what function it has within the larger crisis narrative.
The selection of actions is based on frameworks of core actions in social interaction (Clark and Schaefer, 1989), ones found relevant across different contexts (Stivers et al., 2010) and computer-mediated communication (Paakki et al., 2021).
We manually annotated a set of 438 social media crisis news comments with actions. First, two annotators independently annotated the same set of comments, then compared and negotiated their annotations and resolved all conflicts, analyzing especially the difficult cases. Then annotators resumed annotation work, and finally an inter-annotator score was calculated using Krippendorf’s alpha. We achieved a score of 0.75, which indicates a good degree of agreement. Using the hand-labelled comments, we trained a classifier using few-shot learning (Yan et al., 2018), achieving an f1 score of 0.50. We also ran a zero-shot NLI classifier (Yin et al., 2019), which at the present time achieved better results (f1 0.61) and was thus used for labeling all comments. The labeling followed carefully prepared annotation guidelines based on the descriptions of actions in the literature (e.g.
Couper-Kuhlen and Selting, 2017; Schegloff, 2007). The whole action annotation scheme involved 13 classes following research on which actions are relevant and common in computer-mediated communication (Paakki et al., 2021). It involved responsive actions (e.g. apology, acceptance) that were not included in the function groups. At this stage, we concentrate on the 8 actions described above.
We further sorted comments into groups based on their action label by using a python script. We proceeded to validate our approach by 1.) qualitatively analyzing the functions (per Labov, 1972) of a set of hand-labeled comments from a time-period from 17th˗25th August 2021 (125 comments excluding doubles), based on their content, comparing this analysis to our action-based computational classification, and 2) using time-series analysis to investigate the emergence of function groups at different times during the crisis. We calculated a threshold to identify significant peaks in function group values (1.5 × SD 𝑜𝑣𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 𝑚𝑒𝑎𝑛). We suspected that if the narrative functions were applicable to analyzing social media crisis discourses, there should be significant changes in which function groups are most common in crisis comments at different times.
POVZETKI
264
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
5. Results
Our validation step 1 shows that the computational classification of comments gives us similar results as our manual analysis. The time-period chosen involved an especially high amount of complication actions in both the manually annotated set as well as the computationally annotated set. These are mostly related to criticism or mistrust in authorities and the COVID-19 vaccine, comments about negative symptoms from the vaccine, confusion about who to trust and what to do, but also some arguments that support the authorities. The qualitative analysis of the functions of comments corresponds sufficiently well to the action-based computational categorization: most statements and announcements had an orientation function also in our qualitative analysis; most questions, accusations and challenges served a complication function; evaluations and appreciations corresponded well with the evaluation function; and requests and proposals mostly aimed at some type of a resolution. However, 10% of comments did not fall into the assumed function group based on action type. In some cases actions had another function than expected: informing statements sometimes provided a complication in a few cases where negative effects from vaccines were described, evaluations sometimes had an orientation function, and some long comments involved more than one significant function.
Secondly, our preliminary results from the time-series analysis show that there are significant changes in which functions crisis news comments have at different points of the crisis timeline. Within the NDTV crisis news comments during the early phase of the Corona crisis, there are more significant peaks in orientation or resolution oriented discourses. During the acute mid-phase of the crisis, the frequency of comments that have a complication function is significantly higher. At the last phase, functions become dispersed, i.e. none of the function groups come above the threshold. The time-series analysis is still a work in progress, but the results so far show that the crisis narrative achieves its most conflictive point at the acute mid-phase of crisis where COVID-19 vaccinations have become relevant.
6. Conclusion
Our results so far show that the Labovian narrative theory is to some extent applicable to analyzing crisis discourses on social media. The applied model allows us to analyze how the functions of discourses shift along the crisis timeline, and to identify significant points of discursive struggle. The operationalization of functions through actions seems to work sufficiently well, as it allows a justifiable and pragmatic frame for annotation, rooted in a well-researched field.
Based on our results, the action-based categorization has some limitations that need consideration, as the actions used do not always correspond to the expected function. However, the narrative function categories are highly abstract and thus difficult to classify as such, as we found in some earlier experiments, and thus for a computational model we consider an action-based labeling scheme to be a more pragmatic approach. Social media discourses did not exactly follow the Labovian narrative structure in our empirical case: although complication-oriented discourses seemed to occur during the second phase similarly to the narrative theory, the early phase already involved significant crisis resolution discourses. The dataset for our third phase of crisis should be extended in later research to gain further insights into whether discourses related to some function group might emerge as significant. Further research also needs to investigate if similar patterns of narrativization can be found in different cultural contexts and crises, and whether social media discourses follow their own pattern of narrative structure as compared to Labov’s theory (1972). Also, our few-shot classification also needs more work to achieve higher accuracy in action classification. Action classification for social media comments is not an easy task, for example because comments might often involve several actions, and as deciding what action a comment represents sometimes requires interpretation that is hard to define clearly for each case in annotation guidelines. Thus, action classification in this area requires more work.
This research advances the development of the growing line of computational narrative analysis methods, elaborating on the possibilities for using narrative functions to understand the narrativization of crisis discourses.
We argue that such tools are needed for supporting other means of research into crisis communication, for a multi-sided understanding of perspectives on crisis and social media engagement. Further, as social media is a site used to influence public opinion and to spread disinformation, the various discursive conflicts taking place in this arena are essential for crisis communicators to both understand and manage.
7. References
Leiming Yan, Yuhui Zheng, and Jie Cao. 2018. Few-shot learning for short text classification. Multimedia Tools and Applications, 77(22):29799–29810.
POVZETKI
265
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Monika Bednarek, Andrew Ross, Olga Boichak, Y.J. Doran, Georgia Carr, Eduardo Altmann, and Tristram Alexander. 2022. Winning the discursive struggle? The impact of a significant environmental crisis event on dominant climate discourses on Twitter. Discourse, Context & Media, 45:100564.
Herbert Clark and Edward F. Schaefer. 1989. Contributing to Discourse. Cognitive Science, 13(2):259–294.
Derek Edwards and Jonathan Potter. 1993. Language and causation: A discursive action model of description and attribution. Psychological review, 100(1):23–41.
Elizabeth Couper-Kuhlen and Margret Selting. 2017. Interactional linguistics: Studying language in social interaction. Cambridge University Press.
Kishaloy Halder, Alan Akbik, Josip Krapac, and Roland Vollgraf. 2020. Task-Aware Representation of Sentences for Generic Text Classification. In: Proceedings of the 28th International Conference on Computational Linguistics, pages 3202–3213, Barcelona, Spain. International Committee on Computational Linguistics.
Hélène Joffe. 2003. Risk: From Perception to Social Representation. British Journal of Social Psychology, 42(1): 55–73.
Marianne Jørgensen and Louise Phillips. 2002. Discourse analysis as theory and method. Sage, London.
Robert Kasperson, Ortwin Renn, Paul Slovic, Halina Brown, Jacque Emel, Robert Goble, Jeanne Kasperson, and Samuel Ratick. 1988. The social amplification of risk: A conceptual framework. Risk analysis, 8(2):177–
187.
William Labov. 1972. Language in the Inner City. Philadelphia: University of Pennsylvania Press.
William Labov and Joshua Waletzky. 1967. Narrative analysis: oral versions of personal experience. In: J.
Helms, ed., Essays in the Verbal and Visual Arts, pages 12–44. University of Washington Press, Seattle.
George Lakoff and Srini Narayanan. 2010. Toward a computational model of narrative. In: 2010 AAAI Fall Symposium Series, pages 21–28, Menlo Park, California.
https://www.aaai.org/ocs/index.php/FSS/FSS10/paper/view/2323
Nic Newman, Richard Fletcher, Anne Schulz, Simge Andi, Craig Robertson, and Rasmus Nielsen. 2021.
Reuters institute digital news report 2021. Reuters Institute for the Study of Journalism, Oxford.
Henna Paakki, Heidi Vepsäläinen, and Antti Salovaara. 2021. Disruptive online communication: How asymmetric trolling-like response strategies steer conversation off the track. Computer Supported Cooperative Work, 30(3):425–461.
Andrew Piper, Richard So, and David Bamman. 2021. Narrative Theory for Computational Narrative Understanding. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 298–311, Online and Punta Cana, Dominican Republic. 10.18653/v1/2021.emnlp-main.26
Emanuel Schegloff. 2007. Sequence Organization in Interaction: A Primer in Conversation Analysis.
Cambridge; New York: Cambridge University Press.
Timothy Sellnow, Deanna Sellnow, Emily Helsel, Jason Martin and Jason Parker. 2019. Risk and crisis communication narratives in response to rapidly emerging diseases. Journal of Risk Research, 22(7):897–
908.
Tanya Stivers, Nick Enfield, and Stephen Levinson 2010. Question–Response Sequences in Conversation Across Ten Languages: an Introduction. Journal of Pragmatics, 42(10):2615–2619.
Tzvetan Todorov. 1971. The Two Principles of Narrative. Diacritics, 1(1):37–44.
Teun A Van Dijk. 1976. Philosophy of action and theory of narrative. Poetics, 5(4):287–338.
Jason Turowetz and Douglas Maynard. 2010. Morality in the social interactional and discursive world of everyday life. In: Hitlin S. and Vaisey S., eds., Handbook of the Sociology of Morality, pages 503–526, Springer, New York.
Sylvia Walby. 2015. Crisis. Polity Press, Cambridge.
POVZETKI
266
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Gradnja Korpusa študentskih besedil KOŠ
Tadeja Rozman,* Špela Arhar Holdt‡†
* Fakulteta za upravo, Univerza v Ljubljani
Gosarjeva ulica 5, 1000 Ljubljana
tadeja.rozman@fu.uni-lj.si
‡Filozofska fakulteta, Univerza v Ljubljani
Aškerčeva ulica 2, 1000 Ljubljana
Fakulteta za računalništvo in informatiko, Univerza v Ljubljani
Večna pot 113, 1000 Ljubljana
spela.arharholdt@ff.uni-lj.si
1 Uvod
Korpusi avtentičnih besedil šolajoče se populacije so tako v svetu kot pri nas pomemben vir informacij o jezikovni zmožnosti oseb, ki v procesu izobraževanja svojo jezikovno zmožnost še razvijajo, hkrati pa so tudi pokazatelj jezikovnih in didaktičnih praks v izobraževalnih okoljih. Ti viri so zato pomembni za jezikovno didaktiko, pripravo k uporabnikom usmerjenih jezikovnih priročnikov in gradiv, kot tudi za razvoj različnih jezikovnotehnoloških orodij. Korpusno jezikoslovje v svetovnem merilu sicer večjo pozornost namenja razvoju in analizi korpusov usvajanja tujih jezikov,1 v slovenskem prostoru pa imamo po vzoru tovrstnih korpusov zgrajen tudi Korpus šolskih pisnih izdelkov Šolar (Rozman et al., 2012) oziroma razširjeno verzijo Šolar 2.0
(Kosem et al., 2016). Vsebuje besedila, napisana pri pouku v tretjem triletju osnovnih šol in v srednjih šolah, del korpusa pa tudi avtentične učiteljske popravke, ki so s hierarhično zasnovanim sistemom oznak (Arhar Holdt et al., 2018) kategorizirani glede na vrsto jezikovnega problema. Slovenščina je tako eden redkih jezikov, ki ima tovrstne podatke za prvi jezik, a le na omejeni šolski populaciji, zato v okviru projekta Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti (ARRS, J7-3159) 2 pripravljamo širitev korpusa s študentskimi besedili, na začetku v obliki pilotnega korpusa študentskih besedil.
2 Namen korpusa
Gradnja Korpusa študentskih besedil KOŠ je v prvi vrsti namenjena pridobivanju empiričnih podatkov o pisni jezikovni zmožnosti študentske populacije, pa tudi analitičnemu vpogledu v procese razvoja strokovnega pisanja. Temeljna jezikovna (normativna, besedilna, pragmatična) znanja naj bi dijaki sicer usvojili že do konca srednje šole, na fakultetah pa naj bi se to znanje nadgrajevalo z usvajanjem terminologije in stilističnih značilnosti strokovnih besedil. Vsaj na nejezikoslovnih študijskih smereh, kjer jezikovna izobrazba ni cilj, ampak je dobro jezikovno znanje le temelj za uspešno profesionalno delovanje, nadaljnje razvijanje jezikovne zmožnosti načeloma poteka hkrati ob usvajanju strokovnega znanja, torej ob recepciji strokovnih del ter s pisanjem npr. seminarskih nalog, esejev, raziskovalnih poročil ter pripravo govornih nastopov, sodelovanjem v strokovnih debatah ipd. Ob tem naj bi študenti uzaveščali procese razumevanja in pisanja, se ukvarjali z razumljivostjo in sprejemljivostjo besedil ter rabo strokovnega besedišča, po potrebi pa tudi odpravljali pravopisne in slovnične pomanjkljivosti. Vendar pedagogi opažamo, da obstajajo velike razlike med jezikovnimi zmožnostmi študentov, profesorji stroke pa se z reševanjem jezikovnih težav lahko ukvarjajo le v omejenem obsegu. Zdi se, da so tudi pristopi pedagogov k ozaveščanju o jezikovnih izbirah različni, ne samo zaradi različnega jezikovnega znanja, ampak tudi pogledov na smiselnost tovrstne povratne informacije, pisnih akademskih praks ipd., pa tudi pomanjkanja didaktičnih usmeritev.
Potreba po razvoju sporazumevalne zmožnosti v slovenskem strokovnem jeziku je bila prepoznana že pri pripravi Resolucije o Nacionalnem programu za jezikovno politiko 2014–2018,3 tedaj določeni jezikovnonačrtovalni cilji jezikovne ureditve visokega šolstva in znanosti pa se v aktualni Resoluciji o Nacionalnem programu za jezikovno politiko 2021–2025 4 niso bistveno spremenili. Dokument tako določa, da je na visokošolski strokovni ravni treba omogočati učenje strokovne slovenščine ter na podlagi raziskav in analiz 1 Več o korpusih usvajanja tujega jezika in gradnji korpusa usvajanja slovenščine kot tujega jezika gl. npr. Stritar Kučuk (2020).
2 https://www.cjvt.si/prop/
3 https://www.uradni-list.si/glasilo-uradni-list-rs/vsebina/2013-01-2475?sop=2013-01-2475
4 https://www.uradni-list.si/glasilo-uradni-list-rs/vsebina/2021-01-1999?sop=2021-01-1999
POVZETKI
267
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
strokovno-znanstvenega pisanja na visokošolski ravni izdelati učni načrt za strokovno-znanstveno pisanje za uvodni predmet v prvem letniku prvostopenjskih programov. Na podlagi teh določil, zapisanih že v predhodni resoluciji, je bil z namenom pridobivanja empiričnih podatkov o strokovno-znanstvenem pisanju leta 2019
izdelan Korpus akademske slovenščine KAS, tj. korpus diplomskih, magistrskih in doktorskih del (Erjavec et al., 2021), objavljenih na Nacionalnem portalu odprte znanosti.5 V korpusu so torej zbrana strokovna študentska besedila po zaključenih stopnjah visokošolskega in univerzitetnega študija, mentorirana in v veliki meri tudi lektorirana, tako da je s stališča analiz pisne jezikovni zmožnosti študentske populacije in za analizo procesa razvoja strokovnega pisanja le deloma uporaben. Korpus KOŠ bi v perspektivi lahko odpravil vrzel korpusnih podatkov med Šolarjem in KAS-om ter ponudil osnovo za raziskave, katera temeljna znanja je potrebno (bolje) nasloviti na predhodnih stopnjah in katera na terciarni stopnji, kjer se razvoj pisnih jezikovnih zmožnosti nadaljuje na kompleksnejših besediloslovnih ravneh. Širša slika razvojnega loka bi omogočila, da opismenjevanje bolje usmerimo proti končnemu cilju, ki je polnomočno in samostojno (čeprav v skladu s sodobnimi praksami tehnološko in podatkovno podprto) pisanje različnih vrst besedil za različne sporazumevalne namene, kar je pomembno tudi za uspešno poklicno delovanje.
3 Zasnova korpusa
V okviru projekta je predvidena priprava pilotnega korpusa, ki bo objavljen kot podatkovna baza na repozitoriju CLARIN.SI. Gradnja korpusa poteka v študijskem letu 2021/22 in se bo predvidoma končala jeseni 2022, besedila pa bodo zbrana po metodologiji priprave korpusa Šolar, ki vključuje: pravno ureditev odprtega dostopa do rezultatov (priprava in podpis pogodb o prenosu pravic in dovoljenja za uporabo pravic), beleženje vseh relevantnih metapodatkov (program, letnik, področje študija, tip besedila, morebitno večavtorstvo, ob oddanih več verzijah istega besedila tudi oznaka prvotne in spremenjenih verzij), vsaj delna vključitev profesorskih jezikovnih popravkov, zapis v združljivem formatu in strojno označevanje.
Jezikovne popravke bomo v korpus beležili z orodjem Svala (Wirén, 2019), ki omogoča pregledno sopostavitev izvornega ter popravljenega besedila, psevdonimizacijo tistih delov besedila, ki bi lahko razkrili avtorstvo ali kake druge občutljive osebne informacije, ter označevanje in vsebinsko kategorizacijo jezikovnih popravkov. Orodje je bilo na projektu Razvoj slovenščine v digitalnem okolju 6 prilagojeno za delo s slovenskima korpusoma KOST (Stritar Kučuk, 2020) in Šolar in kot tako podpira označevanje s sistemom oznak korpusa Šolar (Arhar Holdt et al., 2018). Te oznake bomo uporabili tudi za korpus KOŠ (gl. sliko 1), predvideno pa je, da bo za študentska besedila sistem označevanja treba deloma prilagoditi. Pričakovati namreč je (in do sedaj zbrano gradivo to potrjuje), da so zaradi žanrske specifike študentskih besedil, ki jih pregledujejo profesorji nejezikoslovci, popravki redko tudi konkretni predlogi pravilnih jezikovnih izbir, ampak da gre bolj za usmeritve profesorjev, ki v svojih komentarjih študente le na splošno opozarjajo na jezikovne napake in se v večji meri posvečajo stilistiki strokovnih besedil, ustrezni rabi terminologije, citiranju, razumljivosti pisanja, argumentaciji ipd. Vsa korpusna besedila (z označenimi popravki in brez) bomo nato strojno označili na ravneh stavčne segmentacije, tokenizacije, lematizacije, oblikoskladnje, skladnje in imenskih entitet z označevalnikom CLASSLA StanfordNLP (Ljubešić in Dobrovoljc, 2019), ki se v času pisanja povzetka prav tako razvija na omenjenem projektu .
Besedila zbiramo na prvostopenjskih študijskih programih na dveh fakultetah, za vključitev v korpus pa so potencialno relevantna vsa besedila, ki so jih študenti oddali pedagogom v študijskem procesu na fakulteti in niso napisana na roko. Besedila zato zbiramo prek učiteljev, saj bomo tako z večjo gotovostjo prejeli avtentična besedila, ki se realno pišejo v študijskem okolju, predvidoma pretežno seminarske naloge, eseje, poročila, povzetke strokovnih člankov, daljše (esejske) odgovore na vprašanja, morda pa tudi dispozicije in osnutke diplomskih del. Besedila, povezana s pripravo zaključnih del, so s stališča ugotavljanja zmožnosti oblikovanja daljšega strokovnega besedila po končanem izobraževanju zelo dragocena, tudi zaradi vpogleda v mentorske komentarje in popravke, a se trenutno zdi vključitev teh besedil v korpus problematična s stališča anonimizacije, saj so zaključna dela praviloma prosto objavljena na spletu in zlahka povezljiva z osnutki, avtorji in mentorji.
5 https://openscience.si/
6 https://www.slovenscina.eu/
POVZETKI
268
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Slika1: Preizkus metodologije vpisovanja popravkov v testni različici lokaliziranega programa Svala.
4 Nadaljnji koraki
Na projektu Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti želimo zagotoviti pilotni korpus v obsegu 200.000 pojavnic, ob gradnji pa pripraviti tudi oceno prenosljivosti metodologije Šolar na pisno produkcijo študentov in specifikacije nadaljnjega razvoja korpusa študentskega pisanja, tj. opredelitev želenega obsega, strukture glede na regionalno zastopanost, vrsto in področje izobraževanja ter tipologije popravkov. V tem okviru pripravljamo tudi krajši anketni vprašalnik za univerzitetne pedagoge, s katerim želimo pridobiti dodatne podatke o tem, kakšne so prakse podajanja povratnih informacij študentom, ter tako čim učinkoviteje zasnovati zbiranje in beleženje tega gradiva. Do sedaj zbrano gradivo po pričakovanjih nakazuje, da so prakse precej raznolike in da se v mnogočem razlikujejo od podajanja informacij profesorjev slovenščine, ki so zabeležene v korpusu Šolar.
Pod okriljem projekta bomo sicer zbrano korpusno gradivo uporabili za pilotne kvantitativne in kvalitativne jezikoslovne analize študentskega pisanja. Analize se bodo osredotočile na tipične težave pisanja in trende opozarjanja na jezikovno neustrezne ali manj ustrezne ubeseditve, kar vključuje podajanje povratne informacije z vnosom rešitve, opisna priporočila ali grafično nakazovanje mesta težave, kot morebitne druge načine.
Rezultate bomo primerjali s frekvenčno urejenimi seznami jezikovnih zadreg v korpusu Šolar. Izsledki bodo predvidoma že nakazali obrise razvoja pisne jezikovne zmožnosti na prehodu iz srednješolskega v univerzitetno izobraževanje, morebitne primanjkljaje temeljnega jezikovnega znanja ter kako je mogoče učni proces z empiričnimi podatki najbolje podpreti.
Zahvala
Projekt Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti (J7-3159) in program Jezikovni viri in tehnologije za slovenski jezik (P6-0411) sofinancira Javna agencija za raziskovalno dejavnost Republike Slovenije iz državnega proračuna.
Literatura
Špela Arhar Holdt, Polona Lavrič, Rebeka Roblek in Teja Goli. 2018. Kategorizacija učiteljskih popravkov: Smernice za označevanje korpusa Šolar 2.0. V: 1.0. Kazalnik projekta Nadgradnja korpusa Šolar.
https://solar.trojina.si/wp-content/uploads/2022/05/Smernice-za-oznacevanje-korpusa-Solar-2.0-v1.0.pdf Tomaž Erjavec, Darja Fišer in Nikola Ljubešić. 2021. The KAS corpus of Slovenian academic writing. V: Lang Resources & Evaluation 55, 551–583. https://doi.org/10.1007/s10579-020-09506-4
Iztok Kosem, Tadeja Rozman, Špela Arhar Holdt, Polonca Kocjančič in Cyprian Adam Laskowski. 2016.
Šolar 2.0: nadgradnja korpusa šolskih pisnih izdelkov. V: Zbornik konference Jezikovne tehnologije in digitalna
POVZETKI
269
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
humanistika, 95–100. Znanstvena založba Filozofske fakultete, Ljubljana. http://www.sdjt.si/wp/wp-
content/uploads/2016/09/JTDH-2016_Kosem-et-al_Solar-2-0-nadgradnja-korpusa-solskih-pisnih-
izdelkov.pdf
Nikola Ljubešić in Kaja Dobrovoljc. 2019. What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. V: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, BSNLP@ACL 2019, pages 29–34. https://aclanthology.org/W19-3704.pdf
Tadeja Rozman, Mojca Stritar Kučuk in Iztok Kosem. 2012. Šolar – korpus šolskih pisnih izdelkov. V: T. Rozman, ur., I. Krapš Vodopivec, M. Stritar, I. Kosem: Empirični pogled na pouk slovenskega jezika, 15–35. Ljubljana: Trojina, zavod za uporabno slovenistiko.
Mojca Stritar Kučuk. 2020. Modul Leto plus – prvi korak do korpusa slovenščine kot tujega jezika. V: Zbornik konference Jezikovne tehnologije in digitalna humanistika 2020, pages 131–135. Inštitut za novejšo zgodovino, Ljubljana.
http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_StritarKucuk_Modul-Leto-plus%e2%80%93prvi-korak-
do-korpusa-slovenscine-kot-tujega-jezika.pdf
Mats Wirén, Arild Matsson, Dan Rosén in Elena Volodina. 2019. SVALA: Annotation of Second-Language Learner Text Based on Mostly Automatic Alignment of Parallel Corpora. V: Selected papers from the CLARIN Annual Conference 2018, Pisa, 8–10 October 2018, pages 227–239. https://ep.liu.se/en/conference-article.aspx?series=ecp&issue=159&Article_No=23
POVZETKI
270
ABSTRACTS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Korpusni pristopi za identifikacijo metafore in metonimije: primer metonimije v korpusu g-KOMET
Špela Antloga
Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru
Koroška cesta 46, 2000 Maribor
s.antloga@um.si
Povzetek
Prepoznavanje vrednosti in razširjenosti metaforičnih in metonimičnih izrazov v jeziku je v zadnjih dvajsetih letih vodilo k povečanemu zanimanju za sistematično identifikacijo in luščenje tovrstnih figurativnih izrazov v korpusih posameznih jezikov. Izraze, pri katerih potekajo konceptualne preslikave, ki sodelujejo pri metaforičnih in metonimičnih procesih, je namreč težko izluščiti iz korpusa, ki niso posebej označeni za namene raziskovanja figurativnega jezika. V članku predstavim najpogostejše metode luščenja metaforičnih in metonimičnih izrazov iz jezikovnih korpusov ter na primeru korpusa g-KOMET, ki je ročno označen za metaforične in metonimične izraze v slovenskem govorjenem jeziku, ponazarjam poskus sistematizacije metonimičnih prenosov.
Corpus approaches to metaphor and metonymy identification: The case of
metonymy in g-KOMET
Recognizing te value of metaphorical anod metonymic expressions in language has in the last two decades led to increased interest in the systematic identification and extraction of figurative expressions in various language corpora.
Expressions in which conceptual mappings that participate in metaphorical and metonymic processes take place are difficult to extract from a corpus that is not specifically annotated for the purposes of figurative language research. We describe prevailing methods of searching for metaphorical and metonymic expressions in language corpora. Using the manually annotated corpus for metaphorical and metonymic expressions in the Slovene spoken language g-KOMET, we try to systemize some of the prevailing annotated metonymical mappings.
1. Uvod
Jezik in mišljenje sta tesno povezana. Naše mišljenje je
razpisa CLARIN 2021. Na primeru korpusa g-KOMET bo
tako zapleteno, da z jezikom nismo vedno zmožni vsega
predstavljen poskus sistematizacije in klasifikacije
»neposredno« izraziti
najpogostejših označenih metonimičnih
, zato za razlago sveta uporabljamo
prenosov v
različne jezikovno-kognitivne postopke, med drugim
slovenskem govorjenem jeziku.
metafore in metonimije. Korpusnih raziskav metafore in
metonimije ter tudi drugih oblik figurativnega jezika v
2. Opredelitev metafore in metonimije v
slovenščini je malo. Čeprav so v zadnjem desetletju
kognitivnem jezikoslovju
korpusne metode raziskovanja slovenščine postale
Ena od ključnih ugotovitev sodobnega pogleda na
uveljavljena empirična paradigma v jezikoslovju predvsem
metaforo in metonimijo je, da metafor in metonimij ne
na področjih, povezanih z leksikologijo in slovnico ter
uporabljamo zgolj za jezikovno sporazumevanje, temveč
jezikovno rabo, področje figurativnega jezika, ki je sicer na
da v metaforah in metonimijah tudi mislimo. V tem duhu
teoretski ravni dobilo zagon z razmahom teorije
konceptualno teorijo metafore zanimajo zlasti načini
konceptualne metafore in metonimije (Lakoff in Johnson,
mentalne organizacije konceptov, s pomočjo katerih človek
1980; Lakoff in Turner, 1989; Lakoff, 1993), pri tem trendu
osmišlja stvarnost, ki ga obdaja, in družbo, v kateri živi
nekoliko zaostaja (Bedkowska-Kopczyk, 2016; Antloga,
(Bratož, 2010). Za jezikoslovce so bila tovrstna vprašanja
2020c). Eden od možnih razlogov je pomanjkanje enotne
sprva svojevrsten izziv, saj zahtevajo pogled preko meja
in uspešne metode za sistematično identifikacijo
področja jezikoslovja na druge discipline, kot so
metaforičnih in metonimičnih izrazov v že obstoječih
psihologija, nevroznanost, filozofija in druge vede, ter s
korpusih, ki niso posebej označeni za konceptualne
tem predpostavljajo interdisciplinaren način dela. Konec
preslikave. Posledično so se za sistematično analizo
sedemdesetih let prejšnjega stoletja se je tako zgodil t. i.
konceptualnih struktur v jeziku jezikoslovci zatekli k
kognitivni preobrat, ki je metaforo in metonimijo iz
izgradnji
korpusov
z
označenimi
potencialnimi
jezikovne ravni prenesel na konceptualno, miselno raven.
metaforičnimi in metonimičnimi izrazi, ki pa so časovno
Metaforo in metonimijo so začeli obravnavati kot
zamudni in zahtevajo veliko prilagoditev označevalnih
konceptualni mehanizem, s pomočjo katerega se védenje o
shem ciljnemu jeziku raziskovanja.
konkretnih pojavih in izkušnjah projicira na številne
V prispevku bodo opisane različne bolj ali manj
abstraktne
domene.
Na
primer
čas
običajno
uveljavljene metode identifikacije metaforičnih in
konceptualiziramo kot prostor, čustva kot naravne sile,
metonimičnih izrazov v obstoječih (splošnih) korpusih
organizacije kot organizme ali stroje (Bratož, prav tam).
besedil z vsemi prednostmi in slabostmi. Kot eden od virov
za sistematično analizo metaforičnih in metonimičnih
2.1. Metafora
izrazov v slovenskem govorjenem jeziku bo predstavljen
korpus G-KOMET, ki je nastal v okviru
ŠTUDENTSKI PRISPEVKI
271
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Po sodobni definiciji metafore torej niso samo jezikovni
narava
podobnost
(logična) povezava
izraz, ampak v njih tudi razmišljamo. Med različnimi
konceptualnega
teoretičnimi pristopi, ki so se posvečali preučevanju
razmerja
metafore, je danes pri raziskovanju metafore in
metaforičnostih
izrazov
ena
najvidnejših
teorija
Tabela 1: Razlikovanje med metaforo in metonimijo.
konceptualne metafore, ki so jo razvili George Lakoff in
Povzeto po Feyaerts (2012).
njegovi sodelavci (Lakoff in Johnson, 1980; Lakoff in
Turner, 1989). Po omenjenem teoretičnem modelu so
metafore bistven element človekovega spoznavanja in
3. Metode luščenja metaforičnih (in
sredstvo, ki nam omogoča, da razumemo in doživljamo eno
metonimičnih) izrazov v korpusih
izkušenjsko področje ali domeno ( domain) s pomočjo (v
okviru) drugega. Prenos poteka s t. i. medpodročnimi
V povezavi z metaforo in metonimijo sta zaradi
preslikavami ( cross-domain mappings) med izhodiščnim
pomanjkanja
ustrezne
metodologije
problematična
področjem (
predvsem (sistematična) identifikacija
source domain), ki je običajno konkretnejše, in
in luščenje ustreznih
ciljnim področjem ( target domain), ki je bolj abstraktno.
podatkov iz splošnega jezikovnega korpusa. Konceptualne
preslikave, ki sodelujejo pri metaforičnih in metonimičnih
procesih, namreč niso
2.2.
neposredno povezane s posameznimi
Metonimija
jezikovnimi oblikami in jih je težko izluščiti iz korpusa, ki
Tradicionalna retorika je metonimijo obravnavala
niso posebej označeni za namene raziskovanja
predvsem kot retorično figuro, torej je o njej razmišljala kot
figurativnega jezika. S kombinacijo avtomatskega in
o jezikovnem pojavu, kot o predmetu figurativnega jezika
ročnega luščenja podatkov iz splošnih korpusov so se v
(Radden in Kövecses, 1999). Tudi Aristotel ni povsem
drugih jezikih izoblikovale naslednje metode identifikacije
prepoznal značilnosti metonimije in jo je pojmoval kot
metaforičnih (in metonimičnih) izrazov (Stefanowitsch,
podtip metafore (Bernjak in Fabčič, 2018). Podobno
2006):
definicijo metonimije zasledimo tudi v sodobnih slovarjih,
Ročno luščenje metaforičnih besed iz korpusa se je
npr. v Slovarju slovenskega knjižnega jezika.1 Jakobson
uveljavilo zaradi potrebe po (bolj) sistematični analizi
(1956) je poudaril inherentnost metonimije v jeziku in
konceptualne metafore in metonimije, tako da je branju
izpostavil pojem bližine kot temeljni princip metonimije.
besedila v korpusu sledilo sistematično izpisovanje
Kognitivni jezikoslovci se opirajo na ta in podobna stališča
metaforičnih in metonimičnih izrazov (Semino in Masci,
in razširijo fenomen metonimije na pojmovno-pomenski
1996). Seveda je bilo delo zamudno in obsegovno omejeno,
mehanizem, ki omogoča strukturiranje jezika ter mišljenja,
predvsem pa neizkoriščeno z vidika količine podatkov v
torej
deluje
kot
centralno
sredstvo
v
procesu
korpusu, a vsekakor bolj sistematično kot zanašanje na
konceptualizacije. Lakoff in Johnson (1980: 46–52)
sporadične primere ali primere, ki niso izhajali iz dejanske
metonimijo definirata na ravni konceptualizacije kot
jezikovne rabe. Kljub temu so kognitivistom očitali
pojmovno operacijo ali kognitivni proces, v katerem eno,
subjektivnost, neempiričnost in nekonsistentnost pri
izhodiščno entiteto uporabimo zato, da nam omogoča
prepoznavanju (iskanju) in razlagi konceptualnih metafor
mentalni dostop do druge, ciljne entitete znotraj določene
in metonimij (npr. Tummers et al., 2005; Wasow in Arnold,
pojmovne domene. Torej metonimijo obravnavata kot
2005).
pojmovno-pomenski mehanizem, ki strukturira ne samo
Metaforični in metonimični izrazi so v izhodiščni
jezik, ampak tudi naše mišljenje. Če pri metafori prihaja do
domeni preslikave vedno povezani z neprenesenimi
preslikave z enega konceptualnega področja na drugo,
(nefigurativnimi) leksikalnimi enotami. Zato je bila kot
metonimija vključuje samo eno domeno, saj do preslikave
odziv na kritike naslednja stopnja korpusnega pristopa k
med dvema elementoma prihaja v okviru ene same domene.
figurativnemu jeziku iskanje izhodiščne domene po
Lakoff in Johnson (1980) poudarjata, da je tako kot
ključnih besedah oziroma identifikacija metafor na
metafora tudi metonimija konceptualne narave in da gre za
podlagi potencialnih izhodiščnih domen (pomensko polje,
fenomen, ki igra osrednjo vlogo pri strukturiranju našega
za katerega se predpostavlja oziroma je bilo že ugotovljeno,
védenja o svetu. Kövecses (2002) pravi, da je metonimija
da sodeluje pri metaforičnih preslikavah, kot so na primer
kognitivni proces, v katerem do določene konceptualne
srce, ogenj, boj, potovanje ipd.). Iskanje lahko poteka preko entitete (cilja) pridemo s pomočjo druge konceptualne
posameznih besed v konceptualni strukturi ali preko
entitete (sredstva). Z drugimi besedami, ena konceptualna
skupine besed, ki so pomensko povezane (na primer ogenj,
entiteta je referenčna točka, ki omogoča mentalni dostop do
plamen, vročina, pogoreti, zgoreti, plamteti, vzplamteti druge konceptualne entitete.
ipd.). Z ročnim pregledovanjem rezultatov je bila določena
Bolj shematično primerjavo konceptualne metafore in
potencialna metaforičnost izraza in nato ciljna domena
metonimije predstavlja Tabela 1.
metaforične preslikave (npr. LJUBEZEN, JEZA ipd.).
Postopoma so se začeli izoblikovati seznami ključnih besed
metafora
metonimija
izhodiščnih domen za identifikacijo metafor v posameznih
funkcija
sklepanje na
referencialnost
jezikih. Jezikoslovci so nato na podlagi seznamov
konceptualnega
podlagi
raziskovali metafore v različnih jezikih, kontekstih in
razmerja
podobnosti
diskurzih (Hanks, 2004; Koller, 2006).
Postopna uveljavitev identifikacije metaforičnih in
metonimičnih izrazov v korpusih z iskanjem po ključnih
1 »metonimija-e ž lit. besedna figura, za katero je
značilno poimenovanje določenega pojma z izrazom za kak
drug predmetno, količinsko povezan pojem«.
ŠTUDENTSKI PRISPEVKI
272
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
besedah izhodiščne domene je vodila k zanimanju za
obliki indirektne, direktne in implicitne metaforične
raziskovanje figurativnega jezika v konkretnejših, bolj
besede2 v štirih besedilnih tipih (časopisna besedila,
specifičnih domenah, npr. v političnem diskurzu, v
strokovna besedila, literarna besedila in konverzacijska
ekonomiji, športu ipd. V teh primerih pristop, usmerjen v
besedila) za angleški jezik3 je leta 2012 razvila skupina
izhodiščno domeno, ni bil učinkovit, saj bi zahteval
raziskovalcev, ki se je poimenovala Praglejazz. Ob tem je
predhodno poznavanje vira preslikave (izhodiščne
razvila postopek za ugotavljanje metaforičnih besed v
domene), ki bi lahko bil potencialno najden v ciljni domeni.
besedilu, poimenovan MIPVU (Steen et al., 2010), da bi
Zato se je uveljavila metoda iskanja ciljne domene s
omogočila objektivnejšo, natančnejšo in bolj sistematično
seznamom ključnih besed izhodiščnih domen. Za
(jezikoslovno) analizo metaforičnih izrazov v različnih
učinkovito identifikacijo metaforičnih in metonimičnih
besedilih. Temeljno izhodišče za označevanje metaforičnih
izrazov s ključnimi besedami ciljne domene je potrebna
besed pri tem postopku je ugotavljanje razmerja med
velika količina reprezentativnih in enotematskih besedil, ki
osnovnim in kontekstualnim pomenom besede. Pri tem je
so povezana z iskano ciljno domeno. To je relativno
treba za vsako leksikalno enoto ugotoviti, ali se njen
enostavno pri »konkretnih« ciljnih domenah, kot so zgoraj
konkretni kontekstualni pomen razlikuje od njenega
naštete POLITIKA, EKONOMIJA, ŠPORT, težje pa bi
osnovnega pomena. Postopek je s prilagoditvami
bilo iskanje metaforičnih in metonimičnih izrazov s
značilnostim posameznih jezikov sprožil zanimanje za
ciljnimi domenami, kot so na primer ČUSTVOVANJE,
identifikacijo metaforičnih izrazov in metafor v češčini
UMSKA AKTIVNOST, ZAZNAVANJE ipd. (nekaj
(Pavlas et al., 2018), litovščini (Urbonaitė, 2016),
rešitev ponuja Tissari, 2003). Drugi problem, povezan s
madžarščini (Babarzy in Bencze, 2010), poljščini (Risinski
tovrstno identifikacijo metafor v korpusu, pa je, da bi
in Mahula, 2015), srbščini (Bogetić, 2019) ter za izdelavo
identificirali le tiste izhodiščne domene, ki so povezane z
korpusov metafor v ruščini (Badryzlova in Lyashevskaya,
izrazi, katerih pogostnost je v ciljni domeni tako visoka, da
2017), hrvaščini (Despot et al., 2019) in kitajščini (Lu in
so se uvrstili na seznam ključnih besed ciljnih domen.
Wang, 2017). Eden od poskusov oblikovanja korpusa
Analiza metaforičnih prenosov torej ne bo celovita in
metafor v slovenščini, ki bi omogočal jezikoslovno analizo
sistematična.
metaforičnih izrazov in metafor v različnih besedilih ter
Z združitvijo obeh predhodno navedenih metod se je
ponujal možnost za prepoznavanje kulturnospecifičnega
uveljavila metoda iskanja stavkov, ki vsebujejo ključne
pomena metafor, je korpus metafor KOMET 1.0 (Antloga,
besede tako izhodiščne kot ciljne domene, predvsem v 2020a) in njegovo nadaljevanje z dodanimi transkripcijami
obliki avtomatskega luščenja metaforičnih izrazov. Kljub
govorjenega jezika korpus g-KOMET (Antloga in Donaj,
temu metoda še vedno zahteva poglobljen ročni pregled
2022).
izluščenih podatkov zaradi možnih enakopisnic ali
neprenesenega pomena obeh izrazov v stavku. Problem je
4. Korpus g-KOMET
tudi, da je za tako iskanje potreben zelo izčrpen seznam
Korpus
g-KOMET4
(korpus
metaforičnih
in
besed z obeh domen, saj je sicer iskanje nepopolno. Poleg
metonimičnih izrazov v govorjenem jeziku) je nadgradnja
tega je ta metoda bolj uporabna za raziskovanje že poznanih
pisnega korpusa metaforičnih izrazov in metafor KOMET
konceptualnih struktur, metafor in metonimij, manj pa za
1.0 s transkripcijami (po)govora v obsegu 52.529 besed.
sistematično identifikacijo (novih oziroma vseh)
Nadgradnja vključuje tudi definiranje in ročno dodajanje
konceptualnih struktur.
novih oznak v primerjavi s korpusom KOMET 1.0, in sicer
Nekaj poskusov identifikacije metaforičnih izrazov je
oznak za idiome in metonimije. Besedilo za korpus je bilo
potekalo tudi s t. i. kazalniki metaforičnosti, to so
izluščeno iz korpusa GOS. Glede na želeno velikost našega
metajezikovni izrazi, ki napovedujejo oziroma signalizirajo
korpusa smo iz vsake datoteke korpusa GOS izbrali 5 %
metaforično rabo. Goatly (1997) kot metaforične
besedila. Pri tem smo naključno izbrali začetno izjavo5
signalizatorje
navaja
izraze,
kot
so
govora in dodajali zaporedne izjave govora, dokler nismo
metaphorically/figuratively
speaking
(metaforično/
dosegli želene velikosti. Če smo velikost dosegli sredi
figurativno rečeno, v prenesenem pomenu), so to speak
izjave, smo dodali tudi vse preostale besede v njej. S tem
(tako rekoč/če tako rečem), intenzifikatorje literally
smo dosegli končno velikost korpusa 52.529 besed z enako
(dobesedno), actually (pravzaprav) ali celo ortografska
uravnovešenostjo besedila, kot je prisotna v korpusu GOS.
znamenja, kot so narekovaji, poševni tisk ipd. S to metodo
Korpus torej vključuje uravnotežen nabor transkripcij
lahko sicer izluščimo relativno malo metaforičnih izrazov,
informativnega, izobraževalnega, razvedrilnega, zasebnega
vendar lahko po drugi strani opazujemo jezikovne
(telefonski pogovor, osebni stik) in nezasebnega (telefonski
okoliščine, ko je metaforična raba v besedilu namerno (ali
pogovor, osebni stik) diskurza. Če je bila beseda zapisana
nenamerno) eksplicitno signalizirana (Skorczynska in
tako v pogovorni kot normalizirani obliki, smo prevzeli
Ahrens, 2015).
normalizirano obliko. Pri tem se nekatere pogovorne
Ena od zadnjih uveljavljenih metod je
besede zapišejo kot dve besedi v normalizirani obliki, npr.
iskanje po
korpusu, označenem s konceptualn
»nemo« v »ne bomo«. Pri izluščanju besedila smo
imi preslikavami.
Prvi korpus, označen s konceptualnimi preslikavami v
odstranili časovne oznake in oznake za menjavo govornih
2 Ne gre za označevanje metafor, ampak besed, ki se
4 Projekt izdelave korpusa je bil financiran v okviru
potencialno lahko realizirajo kot metafore.
projekta CLARIN.si 2021. Korpus je dostopen na naslovu
3 Gl.
http://hdl.handle.net/11356/1293.
http://www.vismet.org/metcor/search/showPage.php?page
5,6 Izjavo in govorno vlogo razumemo, kot sta
=start.
opredeljeni v specifikacijah za transkribiranje GOS, gl.
Zwitter Vitez et al., 2009.
ŠTUDENTSKI PRISPEVKI
273
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
vlog,6 saj začetki in konci izluščenega dela besedila niso
hkrati začetki in konci govornih vlog. Ohranili pa smo
druge oznake, npr. smeh, hrup, in prekinjene besede oz.
Tabela 2: Označeni figurativni elementi v korpusu g-
napačne začetke. Za označevanje je bilo uporabljeno orodje
KOMET.
Q-CAT (Brank, 2019).
4.1. Označevanje
metaforičnih
besed
5. Analiza in klasifikacija označenih
(oznake MRWi, MRWd, MFlag in
metonimičnih izrazov v korpusu g-
WIDLI)
KOMET
Označevanje metaforičnih besed je temeljilo na
Čeprav sta bili od samih začetkov kognitivnega
postopku za identifikacijo metafor MIPVU (Steen et al.,
jezikoslovja predmet zanimanja kognitivne semantike tako
2010),7 ki omogoča sistematično identifikacijo jezikovne
metafora kot metonimija, je bila pozornost vseskozi
metafore. Identificirani so bili jezikovni izrazi, ki imajo
usmerjena zlasti na metaforo. Še danes je raziskovanje
potencial, da jih ljudje realiziramo kot metafore. Za vsako
metonimije v primerjavi z metaforo zelo marginalno,
leksikalno enoto v besedilu je bil določen njen osnovni
čeprav številni jezikoslovci prepoznavajo ključni pomen
pomen (po SSKJ) in njen pomen v kontekstu. Če se je
metonimije v vsakdanjem jeziku in poudarjajo raznovrstne
kontekstualni pomen razlikoval od osnovnega pomena te
metonimične relacije kot načine organizacije konceptualne
besede, je bila beseda označena kot metaforična beseda
strukture (Bratož, 2010). V korpusu g-KOMET je bilo
(MRW). Označenim metaforičnim besedam je bila nato
označenih 744 metonimičnih izrazov, ki jim je bila dodana
pripisana informacija o tem, ali gre za (1) indirektno
ena od 54 oznak za različne metonimične prenose.
metaforo (MRWi), (2) direktno metaforo (MRWd) ali (3)
mejni primer (WIDLI). Označeni so bili tudi (4)
Tip metonimičnega
Odstotek glede na vse
metaforični signalizatorji (MFlag).8 Korpus je označevala
prenosa
označene metonimične
ena oseba.
izraze v korpusu g-
KOMET
4.2. Označevanje stalnih besednih zvez
splošno za specifično
16,8 %
(oznaka idiom)
institucija za osebo
9,7 %
Označene so bile večbesedne enote, katerih pomen je
(skupino)
različen od pomena posameznih sestavin večbesedne enote.
del za celoto
7,1 %
Vsaj ena sestavina v označeni stalni besedni zvezi je bila
rezultat dejanja za
6,4 %
torej rabljena metaforično.
dejanje
ime za delo
6,3 %
4.3. Uvrščanje
v
pomensko
polje
lastnost za osebo
6 %
metaforičnega prenosa (oznaka frame)
smer za cilj
5,6 %
Označeni metaforični izrazi in stalne besedne zveze so
celota za del
3,6 %
bili uvrščeni v pomenska polja, ki funkcionirajo kot sistem
predmet za aktivnost
3,6 %
kategorij, ki so strukturirane glede na določen kontekst, ki
kraj za osebo
3,5 %
jih motivira. Pomensko polje omogoča, da znotraj določene
(skupino)
pomenske kategorije (npr. naravni pojavi, čas, prostorska
last za aktivnost
2,1 %
orientacija, družina, premikanje itd.) poiščemo metaforične
del telesa za osebo
1,6 %
izraze, ki so lahko potencialno uresničitev neke
(skupino)
konceptualne strukture. V korpusu g-KOMET je bilo
sredstvo dejanja za
1,3 %
označenim metaforičnim besedam in stalnim besednim
rezultat dejanja
zvezam določenih 65 pomenskih polj.
ideologija za osebo
1,3 %
(skupino)
4.4. Označevanje metonimij
dejanje za rezultat
1,2 %
Če se pri metaforah dogaja preslikava z enega
dejanja
izkušenjskega področja na neko drugo izkušenjsko
stavba za institucijo
1,2 %
področje, se pri metonimijah preslikava dogaja znotraj
podjetje za delavca
1,2 %
enega področja, pri čemer ugotavljamo razmerje med
(skupino)
obema entitetama preslikave. Ugotovljenim metonimičnim
kraj za dogodek
1,2 %
izrazom je bilo določenih 45 tipov metonimične preslikave.
Tabela 3: Najpogostejši označeni metonimični prenosi
Označeni elementi
Število
označenih
besed
v odstotkih glede na vse označene metonimične izraze v
(odstotek); ⅀ = 52.529 besed
korpusu g-KOMET.
metaforične besede
728 (1,38 %)
idiomi
256 (0,49 %)
Namesto tradicionalne opredelitve tipov metonimije
metonimije
744 (1,42 %)
glede na metonimični prenos (gl. zgoraj) navajam še
pomenska polja
65
alternativno, vsebinsko delitev metonimije, kot izhaja iz
8 Za podrobnejšo razlago metodoloških izhodišč za
7 Metaphor Identification Procedure Vrije Universiteit
definiranje označevalne sheme glej Antloga 2020a.
(MIPVU).
ŠTUDENTSKI PRISPEVKI
274
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
označenega korpusa g-KOMET. Delitev izhaja iz
dejavnosti, prostora dejavnosti ipd. na osebo, ki opravlja to
predpostavke, da lahko metonimije kategoriziramo glede
dejavnost. Razdelimo jih lahko v naslednje podkategorije:
na vrsto pojmovne vsebine, do katere se dostopa preko
metonimije. Konceptualna metonimija je tako klasificirana
OSEBA ZA AKTIVNOST
glede na to, katero konceptualno vsebino aktivira v
(…) zadnjič gledal nogometaše (…)
metonimičnem prenosu. Navedeni so zgolj tipi metonimije,
OSEBA, VKLJUČENA V AKTIVNOST, NAMESTO
ki so najpogostejši v korpusu g-KOMET. Za smiselne
AKTIVNOSTI
zaključke
o
vlogi
metonimije
v
govorjenem
jeziku/konverzaciji bi bila nujna primerjava z zastopanostjo
OSEBA ZA TEORIJO
in vlogo metonimije tudi v negovorjenih besedilih.
(…) vsi citirajo Žižka (…)
PREDSTAVNIK
TEORETIČNEGA
PRISTOPA
5.1. Metonimija STVAR ZA X
NAMESTO IZHODIŠČ TEGA PRISTOPA
Metonimije STVAR ZA X so metonimije, katerih cilj
(predvideni referent) je
OSEBA ZA LOKACIJO
STVAR, do katere se dostopa s
pomočjo referenčne vsebine, ki je z njo povezana v istem
(…) pa pri zdravniku sto let čakala (…)
idealiziranem kognitivnem modelu. Metonimije STVAR ZA
OSEBA, KI OPRAVLJA DEJAVNOST, NAMESTO
X lahko razdelimo v podkategorije glede na konceptualno
PROSTORA, KJER SE OPRAVLJA DEJAVNOST
izhodišče metonimičnega prenosa:
5.4. Metonimija LOKACIJA ZA X
STVAR ZA STVAR
Pri metonimijah LOKACIJA ZA X je LOKACIJA
Metonimični prenos omogoča neposredni mentalni
uporabljena za priklic ene ali več entitet, ki so na tej
dostop do stvari preko neke druge stvari ali njene vloge
lokaciji. Ker sta lokacija in to, kar se nahaja na lokaciji, v
ali funkcije v situaciji, ki jo ta stvar opravlja.
nekakšni prostorski relaciji, bi lahko tovrstne metonimije
opredelili tudi kot DEL ZA CELOTO. Metonimije
(…), da kozica vre 20 do 25 minut (…) →
LOKACIJA ZA X lahko razdelimo v podkategorije:
POSODA (kozica) NAMESTO VSEBINE (vode v kozici)
LOKACIJA ZA DOGODEK
STVAR ZA ČLOVEKA (SKUPINO)
(…)
to
mi
je
ostalo
od
Otočca
(…)
(…) so samo še bobni igrali (…)
KRAJ, KJER JE POTEKAL DOGODEK, NAMESTO
INŠTRUMENT NAMESTO GLASBENIKA, KI IGRA TA
DOGODKA
INŠTRUMENT
LOKACIJA ZA INSTITUCIJO
STVAR ZA LASTNOST
(…) se zmenijo na Čufarjevi (…)
(…) vidi mercedesa ko se pogleda v ogledalo (…)
IME ULICE NAMESTO STAVBE NA TEJ ULICI
AVTO NAMESTO VRLINE/POMANJKLJIVOSTI
LOKACIJA ZA STVAR
STVAR ZA DODODEK
(…) da sem kar McDonald’s prinesla domov (…)
(…) na rdeči preprogi (…) znova zablestela (…)
RESTAVRACIJA NAMESTO JEDI V RESTACRACIJI
SVAR NA DOGODKU NAMESTO CELOTNEGA
DOGODKA
LOKACIJA ZA OSEBO (SKUPINO)
(…) gostilna pa vse čisto tiho (…)
5.2. Metonimija LASTNOST ZA X
PROSTOR, KJER SE ZADRUŽUJE OSEBA (SKUPINA),
Pri metonimijah LASTNOST ZA X je za cilj (predvideni
NAMESTO OSEBE (SKUPINE) V TEM PROSOTRU
referent je LASTNOST) prenosa pomembno, da je
posameznik ali skupina znotraj kategorije »idealnih
Metonimije lahko opazujemo tudi glede na vidik, ki
članov«
določa izhodišče/sredstvo (vehikel) metonimičnega
te kategorije, kar je pogojeno z bližino
posameznika ali skupine idealu, ki ga postavlja standardni
prenosa. Pogled izhaja iz predpostavke kognitivnega
referent, npr. stereotipna lastnost (ki nadomesti preostale
jezikoslovja, da ima konceptualna metonimija izkustvene
lastnosti), vidna lastnost (ki nadomesti čustvene lastnosti)
in spoznavne temelje, njene jezikovne uresničitve pa so
samo ena od možnih oblik, skozi katere se izraža. Zato
ipd. Glede na konceptualno izhodišče metonimičnega
prenosa v korpusu g-KOMET jih lahko razdelimo v dve
kognitivizem uporabi pojem idealiziranih kognitivnih
skupini:
modelov (IKM), ki predstavljajo abstrakcijo človekovih
izkustev. Delujejo kot abstrahirane sheme, ki delno
zajemajo naše vedenje o svetu. Z
LASTNOST ZA SKUPINO
a kognitivne pristope je
(…) taka mesta ki so (…) pa črni
primarno vprašanje, zakaj izberemo prav določeno
so tam (…)
ČRNA BARVA NAMESTO TEMNOPOLTIH LJUDI
konceptualno entiteto za metonimični izraz, in ne neke
druge. Na tej podlagi (razširjeno po Radden in Kövecses,
1999) lahko označene metonimične izraze opazujemo tudi:
LASTNOST ZA OSEBO
(…) najlepša danes (..)
-
z
vidika
povezave
med
pogostnostjo
IZGLED OSEBE NAMESTO OSEBE
metonimičnega
prenosa
in
človekovim
izkustvom (npr. metonimični prenosi v korpusu
5.3. Metonimija OSEBA ZA X
g-KOMET splošno za specifično (125) :
Metonimije OSEBA ZA X so pogoste metonimije, pri
specifično za splošno (3); konkretno za abstraktno
katerih prihaja do prenosa človekove dejavnosti, rezultatov
(7) : abstraktno za konkretno (3); definirano za
ŠTUDENTSKI PRISPEVKI
275
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
nedefinirano (2) : nedefinirano za definirano (0)).
Špela Antloga in Gregor Donaj. 2022. Korpus g-KOMET.
Zaradi lažjega razumevanja je bolj verjetno, da
Slovenian language resource repository CLARIN.SI.
bodo metonimični prenosi potekali s splošnega na
http://hdl.handle.net/11356/1490.
specifično, s konkretnega na abstraktno ipd.
Yulia Badryzlova in Olga Lyashevskaya. 2017. Metaphor
Shifts in Constructions: the Russian Metaphor Corpus.
Povezanost z visoko pogostnostjo označenih
V: Computational construction grammar and natural
tovrstnih metonimičnih prenosov v korpusu je ena
language understanding: Papers from the 2017 AAAI
od bistvenih (najpogostejših) funkcij metonimije,
Spring Symposium. The AAAI Press.
tj. referencialna funkcija, ki je nekakšna bližnjica
Anna Babarczy in Idiko Bencze. 2010. The automatic
za označevanje kompleksnega in abstraktnega
identification of conceptual metaphors in Hungarian
pojava z enostavnejšim, konkretnejšim in
texts: A corpus-based analysis. V: LREC 2010 Workshop
razumljivejšim pojavom (izrazom);
on Methods for the Automatic Acquisition of Language
Resources: Proceedings, str. 31–36.
-
z
vidika
povezave
med
pogostnostjo
Elizabeta Bernjak in Melanija Fabčič. 2018. Metonimija
metonimičnega prenosa in kulturno preferenco
kot konceptualni in jezikovni pomen. Anali PAZU HD
(v korpusu g-KOMET lahko opazujemo različne
4/1-2:
11–23.
Združenje Pomurska akademsko
kulturnospecifične metonimične prenose lastnost
znanstvena unija.
za osebo (45), lastnost za stvar (9), lastnost za
Agnieszka Bedkowska-Kopczyk. 2016. Začutiti in občutiti:
institucijo (1), posameznik za skupino (4),
kognitivna analiza pomensko-skladenjskih lastnosti
ideologija za človeka (skupino) (11), ustanova za
dveh predponskih tvorjenk iz glagola čutiti. V: E.
človeka (skupino)
Kržišnik in M
(72) ipd.). Te pojmovne sheme
. Hladnik, ur., Toporišičeva obdobja, str.
združujejo posamezne elemente, povezane z
41–48. Ljubljana: Znanstvena založba Filozofske
fakultete.
našim kulturnospecifičnim védenjem o svetu,
Ksnenija Bogetić. 2019. Linguistic metaphor identification
družbi, konvencijah in običajih. V konkretni
in Serbian. V: S. Nacey in T. Krennmayr, ur., MIPVU in
jezikovni situaciji pogosto kontekst in izkustvo
Multiple Languages, str. 203–226. Amsterdam: John
določata, kateri segment enciklopedičnega
Benjamins.
védenja se bo profiliral kot pomemben in se
Janez Brank. 2019. Q-CAT Corpus Annotation Tool.
jezikovno realiziral.
Slovenian language resource repository CLARIN.SI,
ISSN 2820-4042, http://hdl.handle.net/11356/1262.
6. Nadaljnje delo
Silva Bratož. 2010. Metafore našega časa. Fakulteta za
management, Koper.
Janez Brank. 2019. Q-CAT Corpus Annotation Tool,
Identifikacija in analiza metonimičnih in metaforičnih
Slovenian language resource repository CLARIN.SI,
izrazov s korpusnega vidika imata v slovenščini pred sabo
ISSN 2820-4042, http://hdl.handle.net/11356/1262.
še dolgo pot. Čeprav so nekatere metaforične preslikave in
Kristina Despot, Mirjana Tonković, Mario Brdar, Benedikt
metonimični prenosi univerzalni oziroma prisotni v več
Perak, Ana Ostroški Anić, Bruno Nahod in Ivan Pandžić.
jezikih, je unikatna njihova frekvenca pojavljanja v
2019. MetaNet.HR: Croatian Metaphor Repository. V:
posameznih jezikih, njihova realizacija in vpetost v
Metaphor and Metonymy in the Digital Age. Theory and
kulturnospecifične elemente jezikovnega prostora. Za
Methods for Building Repositories of Figurative
nadaljnjo analizo metaforičnih izrazov v slovenskem jeziku
Language, str. 123–146. Amsterdam: John Benjamins.
bo zanimiva primerjava korpusa KOMET 1.0, v katerem so
Kurt Feyaerts. 2012. Refining the Inheritance Hypothesis:
označene metaforične besede v zapisanem jeziku, in
Interaction between metaphoric and metonymie
korpusa g-KOMET, ki vsebuje govorjena besedila v obliki
hierarchies. V: A. Barcelona, ur., Metaphor and
transkripcij. Ker so bile v korpus g-KOMET dodane tudi
Metonymy at the Crossroads: A Cognitive Perspective,
oznake za metonimične prenose, je eden od naslednjih
str. 59–78. Berlin: De Gruyter Mouton.
ciljev tudi sistematična analiza metonimije v govorjenem
Raymond W. Gibbs. 1999. Researching Metaphor. V:
jeziku.
Researching and applying metaphor, str. 29–47.
Cambridge: Cambridge University Press.
Stefan Gries in Anatol Stefanowitsch. 2004. Extending
7. Literatura
collostructional analysis: A corpus-based perspective on
'alternations'.
International
Journal
of
Corpus
Špela Antloga. 2020a. Korpus metafor KOMET 1.0.
Linguistics 9/1: 97–129.
Slovenian language resource repository CLARIN.SI.
Adrew Goatly. 1997. The Language of Metaphors. London
http://hdl.handle.net/11356/1293.
& New York: Routledge.
Špela Antloga. 2020b. Korpus metafor KOMET 1.0. V:
Patrick Hanks. 2004. The syntamatics of metaphor and
Jezikovne
tehnologije
in
digitalna
humanistika
idiom. International Journal of Lexicography 17/3: 245–
[elektronski vir]: zbornik konference: 24.–25. september
274.
2020, str. 176–170. Ljubljana: Inštitut za novejšo
Roman Jakobson. 1956. The Metaphoric and Metonymic
zgodovino.
Poles. V: Metaphor and Metonymy in Comparison and
Špela Antloga. 2020c. Vloga metafor in metaforičnih
Contrast, str. 41–47. Berlin/New York: Mouton de
izrazov v medijskem diskurzu: analiza konceptualizacije
Gruyter.
boja. V: J. Vogel, ur., Slovenščina – diskurzi, zvrsti in
Veronika Koller. 2006. Of critical importance: Using
jeziki med identiteto in funkcijo, str. 27–34. Ljubljana:
electronic text corpora to study metaphor in business
Znanstvena založba Filozofske fakultete.
media discourse. V: A. Stefanowitsch in S. Gries, ur.,
ŠTUDENTSKI PRISPEVKI
276
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Corpus-Based Approaches to Metaphor and Metonymy,
Jose Tummers, Kris Heylen in Dirk Geeraerts. 2005.
str. 237–266. Berlin: De Gruyter Mouton.
Usage-based approaches in Cognitive Linguistics: A
Zoltan Kövecses. 2002. Metaphor: A practical
technical state of the art. Corpus Linguistics and
Introduction. Oxford/New York: Oxford University
Linguistic Theory 1(2): 225–261.
Press.
Justina Urbonaitė. 2016. Metaphor identification procedure
George Lakoff. 1993. The contemporary theory of
MIPVU: an attempt to apply it to Lithuanian. Taikomoji
metaphor. V: Andrew Ortony, ur., Metaphor and
kalbotyra [Applied Linguistics] 7: 1–25.
thought,
str.
202–251. Cambridge:
Cambridge
Thomas Wasow in Jennifer Arnold. 2005. Intuitions in
University Press.
linguistic argumentation. Lingua 115: 1481–1496.
George Lakoff in Mark Johnson. 1980. Metaphors We Live
Beatrice Warren. 2002. An alternative account of the
By. University of Chicago Press.
interpretation of referential metonymy and metaphor. V:
George Lakoff in Mark Turner. 1989. More than Cool
R. Dirven in R. Pörings, ur., Metaphor and Metonymy in
Reason: A Field Guide to Poetic Metaphor. The
Comparison and Contrast, str. 113–133. Berlin: De
University of Chicago Press.
Gruyter Mouton.
Xiaofei Lu in Ben Pin Yun Wang. 2017. Towards a
Ana Zwitter Vitez, Jana Zemljarič Miklavčič, Marko
metaphor-annotated corpus of Mandarin Chinese.
Stabej in Simon Krek. 2009. Načela transkribiranja in
Language Resources and Evaluation 51/3: 663–694.
označevanja posnetkov v referenčnem govornem
Klaus-Uwe Panther in Günter Radden. 1999. The
korpusu slovenščine. V: M. Stabej, ur., Infrastruktura potentiality for actuality metonymy in English and
slovenščine in slovenistike, str. 437–442. Ljubljana:
Hungarian V: K. U. Panther in G. Radden, ur., Metonymy
Znanstvena založba Filozofske fakultete.
in Language and Thought, str. 333–357. Amsterdam:
John Benjamins.
Dalibor Pavlas, Ondřej Vrabeľ in Jiří Kozmér. 2018.
Applying MIPVU Metaphor Identification Procedure on
Czech. V: Proceedings of the Workshop on Annotation
in Digital Humanities co-located with ESSLLI 2018, str.
41–46. Sofia, Bulgaria.
Pragglejaz Group. 2007. MIP: A method for identifying
metaphorically used words in discourse. Metaphor and
Symbol 22 (1): 1–39.
Günter Radden in Zoltan Kövecses. 1999. Toward a theory
of metonymy. V: K.-U. Panther in G. Radden, ur.,
Metonymy in language and thought, str. 17–60.
Amsterdam: John Benjamins.
Maciej Rosiński in Joanna Marhula. 2015. MIPVU in
Polish: On Translating the Method. RaAM Seminar
2015.
Elena Semino in Michela Masci. 1996. Politics is football:
metaphor in the discourse of Silvio Berlusconi in Italy.
Discourse and Society 7/2: 243–269.
Elena Semino. 2017. Corpus linguistics and metaphor. V:
The Cambridge Handbook of Cognitive Linguistics, str.
463–476. Cambridge: Cambridge University Press.
Hanna Skorczynska in Kathleen Ahrens. 2015. A corpus-
based study of metaphor signaling variation in three
genres. Text & Talk. An Interdisciplinary Journal of
Language Discourse Communication Studies 35(3):
359–381.
Slovar slovenskega knjižnega jezika, druga, dopolnjena in
deloma prenovljena izdaja. www.fran.si.
David Stallar. 1993. Two Kinds Of Metonymy. V: 31st
Annual Meeting of the Association for Computational
Linguistics, str. 87–94. Association for Computational
Linguistics: Columbus, Ohio.
Gerard J. Steen, Aletta G. Dorst, Berenike J. Herrmann,
Anna A. Kall, Tina Krennmayr in Tryntje Pasma. 2010.
A method for linguistic metaphor identification. From
MIP to MIPVU. Amsterdam: John Benjamins.
Anatol Stefanowitsch. 2006. Corpus-based approaches to
metaphor and metonymy. V: A. Stefanowitsch in S. Th.
Gries, ur., Corpus-Based Approaches to Metaphor and
Metonymy, str. 1–17. Berlin: De Gruyter Mouton.
Elen Tissari. 2003. LOVEscapes: Changes in Prototypical
Senses and Cognitive Metaphors Since 1500. Societe
Neophilologique.
ŠTUDENTSKI PRISPEVKI
277
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Neural Translation Model Specialized in Translating English TED Talks
into Slovene
Eva Boneš∗, Teja Hadalin†, Meta Jazbinšek†, Sara Sever†, Erika Stanković∗
∗ Faculty of Computer and Information Science
University of Ljubljana
Večna pot 113, 1000 Ljubljana
{eb1690,es6317}@student.uni-lj.si
† Faculty of Arts
University of Ljubljana
Aškerčeva 2, 1000 Ljubljana
{th3112,mj6953,ss6483}@student.uni-lj.si
Abstract
In this paper, we present our work on a neural translation model specialized in translating English TED Talks into Slovene. The aim is to provide transcriptions of the speeches in Slovene to make them available to a wider audience, possibly with the option of automatic subtitling. First, we trained a transformer model on general data, a collection of corpora from the Opus site, and then fine-tuned it on a specific domain which was a corpus of TED Talks. To see the functionality of the model, we carried out an evaluation of the pretrained, general, and domain versions of the model. We evaluated the translations with automatic metrics and manual methods – the adequacy/fluency and the end-user feedback criterion. The analysis of the results showed that our translation model did not produce the expected results and it can not be used to translate speeches in real life. However, in the TED talks addressing more everyday issues and using simple vocabulary, the translations successfully conveyed the main message of the speech. Any further research should consider improvements, such as including more specialized data covering only one specific topic.
1.
Introduction
was our attempt at machine translation of spoken language,
which, if efficient, could also be used for automatic subti-
In this paper, we trained a transformer model from
tling in general.
scratch on a large general corpus, which we then fine-
tuned on a corpus consisting of TED Talks in order to
2.
Related work
make a model specialized for the translation of transcribed
There are three main approaches to solving the MT
speeches. We also found a pretrained model for the base-
problem, all with their own advantages and shortcomings.
line to which we were able to compare our translation mod-
The rule-based machine translation (RBMT) is the oldest
els. We then automatically and manually evaluated all three
of the bunch and it requires expert knowledge of both the
models on the validation datasets constructed from TED
source and the target language in order to develop syntac-
Talks. Finally, we evaluated the general translation model
tic, semantic, and morphological rules. Another approach,
on the validation dataset constructed from the large general
which gained popularity in the 1990s, uses statistical mod-
corpus.
els based on the analysis of bilingual text corpora. The idea
In Section 3, we first describe the data we used. In the
behind statistical machine translation (SMT) as proposed in
subsequent Section, we describe all the methods for both
(Brown et al., 1990) is, if given a sentence in the target lan-
training and evaluating the models. Later on, in Sections 5
guage, we seek the original sentence from which the trans-
and 6, we present the results and discuss them.
lator produced it. Today, as with many computer science
fields, the current state-of-the-art approaches for machine
1.1.
Goal of the paper
translation are based on neural networks. The biggest chal-
The main goal of this project is to provide a useful and
lenge when building a successful English to Slovene (or
effective tool for translating and subtitling speeches from
vice-versa) automatic translator is obtaining a sufficiently
English to Slovene, and this way granting access to a wide
large bilingual corpus. Like all deep learning approaches,
range of talks and other speeches to the Slovene-speaking
having a large and quality dataset is crucial for the success
audience. This paper focuses on translating TED Talks, a
of the model. To deal with this exact problem, a lot of
form of learning and entertainment that has gained popu-
approaches to pre-training a network on monolingual data
larity in recent years. Since TED Talks are currently sub-
(that can be obtained easily) have been proposed.
titled by volunteer translators, enabling automatic subtitles
Bidirectional Encoder Representations from Trans-
would facilitate this process. Machine translation (MT) has
formers (BERT) (Devlin et al., 2018) uses two strategies
been researched since the 1950s, but only recently, with the
to deal with the problem, namely masked language model-
rise of deep learning, did it prove to be solvable, although
ing (MLM) and next sentence prediction (NSP). By using
the possibility of achieving fully automatic machine trans-
these two strategies in our models, we generally achieve
lations of high quality is still being questioned. This project
bigger datasets and a model with more context-awareness.
ŠTUDENTSKI PRISPEVKI
278
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
In 2020, the mRASP (Lin et al., 2021) was introduced. Its
Slovene translations. Both datasets add up to 1.8 million
authors built a pretrained NMT model that can be fine-
words (MOSES format) and 2.1 million tokens, which is
tuned for any language pair. They used 197M sentence
enough to form a well-rounded base for machine learning.
pairs, which is considerably more than we could obtain for
For more information about the domain-specific corpora
only English-Slovene translations.
see Table 2.
Although these methods have proven to be success-
We expanded the datasets by manually aligning 15 TED
ful, one of the largest currently available databases of pre-
Talks from 2018 and 2019 that are available on the TED
trained translation models was trained using just a standard
website (https://www.ted.com/talks).
transformer model and it still achieved great results. The
Tatoeba Translation Challenge (Tiedemann, 2020) aims to
4.
Methods
provide data and tools for creating state-of-the-art transla-
tion models. The focus is on low-resource languages to
4.1.
Pretrained model
push their coverage and translation quality. It currently
As a baseline for evaluating our models, we found an
includes data for 2,963 language pairs covering 555 lan-
already trained model, available in HuggingFace (Tiede-
guages. Along with the data, pretrained translation mod-
mann, 2020). It is a transformer-based multilingual model
els for multiple languages were also released and are being
that includes all the South Slavic languages. The frame-
regularly updated.
work provides both the South-Slavic to English model and
the English to South-Slavic model. On the Tatoeba test
3.
Dataset
dataset for Slovene, the English to South-Slavic (en-zls)
3.1.
General translation model
model has achieved 18.0 BLEU score and 0.350 chr-F
score.
The datasets for the general translation model are the
eight biggest corpora from the Opus site (
The model in question was trained using Marian-
https://
NMT (Junczys-Dowmunt et al., 2018). The authors ap-
opus.nlpl.eu (Tiedemann, 2012)) for the Slovene-
English language pair. The corpora were chosen based on
plied a common setup with 6 self-attentive layers in both,
the quantity of the data, so the general translation model
the encoder and decoder network using 8 attention heads
would contain a large amount of diverse information. Af-
in each layer. SentencePiece (Kudo and Richardson, 2018)
ter a brief look at the contents of each one, we can see that
was used for the segmentation into subword units.
some datasets are of higher quality and more reliable be-
The translation model can be loaded through the trans-
cause of the source of the original texts and their trans-
formers library in Python and for translation into Slovene,
lations For example, the corpora from European institu-
we must add the Slovene language label at the beginning of
tions, such as Europarl, which is a parallel corpus ex-
each sentence (>>slv<<).
tracted from the proceedings of the European Parliament
4.2.
Training from scratch
from 1996–2011, and the DGT corpus, which is a collec-
tion of translation memories from the European Commis-
There exist several different frameworks to use with nat-
sion’s Directorate-General for Translation. The other cor-
ural language processing tasks, each with their own ad-
pora are a collection of translations from different Inter-
vantages and shortcomings. One of them is fairseq (Ott
net sources, which makes them less reliable, however, they
et al., 2019) – a sequence modeling toolkit written in Py-
are still very valuable because they ensure a large quan-
Torch for training models for translation, summarization,
tity of the data. These include the CCAligned corpus con-
and other tasks. It provides different neural network ar-
sisting of parallel or comparable web-document pairs in
chitectures, namely convolutional neural networks (CNN),
137 languages aligned with English, the MultiCCAligned
Long-Short-Term Memory (LSTM) networks, and Trans-
v1 multi-parallel corpus, the OpenSubtitles corpus com-
former (self-attention) networks. The architectures can be
piled from an extensive database of movie and TV sub-
configured to specific needs and many implementations for
titles, the Tilde MODEL corpus consisting of over 10M
different tasks have been proposed since the fairseq’s intro-
segments of multilingual open data for publication on the
duction in 2019. In addition to different architectures, they
META-SHARE repository, the WikiMatrix v1, a parallel
also provide pretrained models and preprocessed test sets
corpus from Wikimedia compiled by Facebook Research,
for different tasks, but sadly none of them is in Slovene.
the Wikimedia v20210402 corpus, and the XLEnt v1 cor-
For training our model from scratch, we have decided
pus created by mining CCAligned, CCMatrix, and Wiki-
to use an extension of fairseq (stevezheng23, 2020) that has
Matrix parallel sentences. The exact size of each one, com-
additional data augmentation methods. We have trained our
plete with the number of tokens, links, sentence pairs, and
general model on a corpus described in Subsection 3.1.
words, is noted in Table 1.
4.2.1.
Preprocessing
3.2.
Domain translation model
Before training the model, we had to preprocess the
Our domain translation model is specialized in translat-
data. The datasets were already formatted as raw text with
ing TED Talks.
one sentence per line and with lines aligned in English and
For the domain-specific machine training, we opted for
Slovene datasets. We first normalized the punctuations, re-
the two TED Talk corpora accessible on the Opus website –
moved non-printing characters, and tokenized both corpora
the TED2013 and TED2020 corpus. The included texts are
with Moses tokenizer (Koehn et al., 2007). We removed
mainly transcripts of speeches on various topics and their
all the sentences that were too short (2 tokens or less) or
ŠTUDENTSKI PRISPEVKI
279
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Sentence pairs
Words
CORPUS
Tokens
Links
(MOSES format)
(MOSES format)
Europarl.en-sl
31.5 M
0.6 M
624,803
27.56 M
CCAligned.en-sl
131.3M
4.4 M
4,366,555
110.08 M
DGT.en-sl
215.8M
5.2 M
5,125,455
162.58 M
MultiCCAligned.en-sl
5.6 G
4.4 M
4,366,542
110.01 M
OpenSubtitles.en-sl
178.0 M
2.0 M
19,641,477
213.00 M
TildeMODEL.en-sl
2305.4 M
21.1 M
2,048,216
79.90 M
WikiMatrix.en-sl
1.1 G
0.9 M
318,028
11.99 M
wikimedia.en-sl
350.6 M
31.8 K
31,756
1.50 M
XLEnt.en-sl
200.7 M
0.9 M
861,509
4.53 M
Table 1: Size of datasets for the general translation model.
Sentence
4.3.
Fine-tuning on TED talks
Words
pairs
CORPUS
Tokens
Links
(MOSES
We preprocessed the TED data in the same way as the
(MOSES
format)
general, only this time we used the same dictionary as be-
format)
fore and we did not build a new one. Less than 0.1% of
TED2013
0.5 M
15.2 k
14,960
0.45 M
tokens in training and validation sets were replaced with
TED2020
1.6M
43.9 k
44,340
1.35 M
unknown tokens, so our original dictionary was evidently
Extras
23005
/
983
/
large enough. We used the best performing epoch from our
general translation model (according to the loss on our vali-
Table 2: Size of datasets for the domain translation model.
dation set) for fine-tuning it on our domain data. We trained
three different models with three slightly different configu-
rations – one with the same augmentation parameters as the
too long (250 tokens or more), and the ones where the ra-
general model, one with increased masking probability and
tio of lengths was too big because there is a good chance
decreased dropout and initial learning rate, and one without
that these kinds of sentences are not translated properly.
augmentation. We trained all of the models for 100 epochs
We then applied Byte pair encoding (BPE) (Sennrich et al.,
and we are presenting the results of the best epoch for each
2016) to the dataset. The algorithm learns the most fre-
of them.
quent subwords to compress the data and thus induces some
tokens that can help recognize less frequent and unknown
4.4.
Evaluation
words.
In order to test the performance of the pretrained and
With this preprocessed data, we then built the vocabu-
general translation model, and the fine-tuned translation
laries that we used for training and binarized the training
model for TED Talks we had to evaluate the translations.
data. Cleaned and preprocessed training data has ≈ 16M
The automatic evaluation was carried out on two valida-
sentences with ≈ 345M tokens in English and ≈ 341M
tion sets. First, the general translation model was evaluated
in Slovene. Both of the vocabularies have around 45,000
on a subset of the general data, which was split in the pre-
types. In the end, we split the data into a training and vali-
processing step (hereinafter referred to as the general val-
dation set.
idation set). All three models were evaluated on a subset
of the domain data (hereinafter referred to as the domain
validation set). The manual evaluation was only performed
4.2.2.
Training
on a subset of the domain validation set, as described in
We trained a transformer (Vaswani et al., 2017) model
Subsection 4.4.2.
with 5 encoder and 5 decoder layers in the fairseq frame-
4.4.1.
Automatic evaluation
work. We used Adam optimizer, an inverse square root
Since the manual evaluation of the translations is very
learning rate scheduler with an initial learning rate of 7e−4
time-consuming, it is very difficult to evaluate a sufficient
and dropout. We also used the proposed augmentation with
amount of sentences this way. In cases like this, automatic
a cut-off augmentation schema that randomly masks words
evaluation metrics are often used. Natural language is quite
and this way produces more training data and a more robust
subjective. Hence, the perfect measure does not exist, but
translator.
by evaluating our results with different techniques, we were
We trained our model for 8 epochs with the mentioned
able to assess the performance of our translation model and
initial learning rate, after which the minimum loss scale
compare it with other models. We used automatic met-
(0.0001) was reached, meaning that our loss was proba-
rics most often used in NLP tasks – namely BLEU, chr-F,
bly exploding. We tried training one more epoch with a
GLEU, METEOR, NIST, and WER.
lower initial learning rate and obtained an even worse per-
formance with the minimum loss scale reached again. That
4.4.2.
Manual evaluation
is why we decided to stop the training at 8 epochs. Results
The translations were also evaluated manually, namely
of all the epochs are shown in Chapter 5.
by the fluency-adequacy criterion first described by Church
ŠTUDENTSKI PRISPEVKI
280
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
(Church, 1993). For this part of the evaluation, the Ex-
5.
Results
cel format was used. We extracted 6 paragraphs contain-
For the training of our models, we used the Slovenian
ing 10 consecutive segments from each speech to ensure
national supercomputing network that provides access to
that the context was clear. Three evaluators (the translators
cluster-based computing capacities. We used the Arnes
from our group) were assigned 20 segments each. To deter-
cluster which is equipped with 48 NVIDIA Tesla V100S
mine the adequacy of the translation, the evaluator marks
PCIe 32GB graphic cards. When training on two of them,
how much of the meaning expressed in the source text is
one epoch took approximately 4 hours for the general trans-
also expressed in the target translation. To determine the
lation model and one minute for fine-tuning on the TED
fluency of the translation, the evaluator marks whether the
data.
translation is grammatically well-formed, contains the cor-
rect spelling, is intuitively acceptable, and can be sensibly
5.1.
Automatic evaluation results
interpreted by a native speaker. To test the adequacy, the
In Table 5, we present the quantitative results of the au-
evaluator compares both, the source text and the translation,
tomatic evaluation for the pretrained, general, and domain
whereas, in the process of the fluency evaluation, the focus
models.
is merely on the translation. The evaluators had to provide
the scores on a scale from 1 to 4. We chose this evalu-
5.2.
Manual evaluation results
ation technique because it clearly and simply summarizes
Along with the automatic evaluation metrics, we also
and presents the quality of the translations. Since we evalu-
performed a manual evaluation which provided a valuable
ated three different translation models (pretrained, general,
human insight into the final product and a better under-
and domain), we had to evaluate the same segments of texts
standing of the typology of the mistakes that occurred in the
three times. Evaluating one text multiple times by the same
translations. Each validation set was assessed by two eval-
person is not recommended, therefore, the translations were
uators at all three stages of the model development. The
exchanged between the three evaluators at the beginning of
results presented in Table 4 represent the average value of
the evaluation of each translation model.
the fluency and adequacy scores for the pretrained, general,
and domain models, respectively.
4.4.3.
End user comprehensibility questionnaire
Finally, we evaluated the domain machine-translated
MODEL
Fluency
Adequacy
texts from the end-user’s point of view. Evaluators, who
Pretrained
2.99
3.09
were not familiar with the content of this project, were
General
2.83
2.9
given the translated texts from the domain model and a
Domain
2.71
2.9
questionnaire formed by the translation team of this project.
The objective of this questionnaire was to examine whether
Table 3: Manual evaluation results on the TED validation
the end-users understand the information given in the trans-
set.
lation, meaning it tested the functionality of the text. The
questionnaire was given to nine persons, each evaluating 20
5.3.
End-user comprehensibility questionnaire results
segments from two different speeches - the segments were
identical to segments used in the manual evaluation. In the
We received feedback from the end-users based on the
end, we obtained three evaluations for each text (6 speeches
questionnaire for the texts from the domain translation
altogether). The questionnaire included the following ques-
model. The average score of the answers that could be in-
tions:
terpreted numerically is presented in Table 4. According to
the answers to the question ‘What is the main message of
1. How comprehensible is the text?
the text?’, the users have, for the most part, understood the
text to the degree where they could sufficiently summarize
2. To what degree does the text seem like it was produced
the content. The most frequent answer to the last question
by a native speaker of Slovene?
(What do you consider as the most problematic part of the
3. How would you grade the text as a whole?
text?) was ’wrong syntax’, followed by ’lack of context’
4. What is the main message of the text?
and ’unknown words’. The participants also pointed out
that the general structure of the text was rather confusing.
5. What do you consider as the most problematic part of
the text?
Text
Question 1
Question 2
Question 3
1
1.33
1
1
For the first and second question, the end-users an-
2
2
1.33
1.33
swered on a scale from 1 to 4, with 1 meaning ‘not at all’
3
3
2
2.33
and 4 meaning ‘very much’. The third answer had to be
4
1.66
1
1.33
a score from 1 to 4. The fourth question had to be an-
5
2
1.66
1.66
swered with one sentence, and for the fifth question, they
6
2.33
1.66
2
had to choose between the following answers: ‘unknown
All
2.05
1.44
1.61
words’, ‘too little context’, ‘wrong syntax’, and ‘other’. We
chose this evaluation technique because it shows whether
Table 4: End-user feedback results from the questionnaire
the translation is, in fact, functional and useful to the end-
with average scores on a scale from 1 to 4.
user.
ŠTUDENTSKI PRISPEVKI
281
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
General (epochs)
Domain
Dataset
Metric
Pretrained
1
2
3
4
5
6
7
8
Configuration 1
Configuration 2
Configuration 3
BLEU
-
0.387
0.398
0.405
0.409
0.411
0.417
0.417
0.420
-
-
-
chr-F
-
0.606
0.616
0.619
0.624
0.625
0.629
0.629
0.629
-
-
-
GLEU
-
0.391
0.401
0.407
0.411
0.413
0.417
0.417
0.420
-
-
-
General
METEOR
-
0.545
0.556
0.560
0.565
0.566
0.569
0.569
0.571
-
-
-
NIST
-
8.752
8.922
8.987
9.063
9.096
9.144
9.114
9.177
-
-
-
WER
-
0.518
0.508
0.503
0.501
0.496
0.497
0.498
0.494
-
-
-
BLEU
0.192
0.155
0.167
0.168
0.171
0.175
0.175
0.168
0.179
0.182
0.173
0.114
chr-F
0.514
0.487
0.496
0.495
0.497
0.500
0.498
0.500
0.505
0.503
0.497
0.440
GLEU
0.230
0.201
0.211
0.212
0.214
0.217
0.218
0.213
0.222
0.224
0.216
0.167
Domain
METEOR
0.420
0.398
0.407
0.409
0.409
0.414
0.412
0.416
0.420
0.426
0.416
0.346
NIST
5.481
4.877
5.067
5.105
5.132
5.151
5.179
5.074
5.230
5.344
5.209
4.228
WER
0.659
0.711
0.696
0.694
0.690
0.689
0.689
0.698
0.685
0.667
0.680
0.756
Table 5: Evaluation scores for all models and all validation datasets. The best scores for each dataset and each metric are shown in bold. If the best score was the pretrained model, the second best score is shown in bold and italic to showcase our best score.
ŠTUDENTSKI PRISPEVKI
282
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
6.
Discussion
analysis?”, and in segments with a higher register, for ex-
Looking at the results in Table 5, we can first see that
ample, the eloquent text on immigrants: ”Ta vprašanja so
protipriseljenska in nativisti
on the general validation set, the final epoch of our general
čna v svojem jedru, zgrajena
okoli neke vrste hierarhi
model performs the best according to most metrics. This
čne delitve notranjih in zunanjih
oseb, nas in njih, v katerih smo pomembni le in ne.” vs the
is expected, as the general validation set is comprised of
original: ”These questions are anti-immigrant and nativist
the texts from the corpora that we used for training, so our
at their core, built around a kind of hierarchical division of
model may be overfitted on this dataset.
insiders and outsiders, us and them, in which only we mat-
Connected to this, all of the results in the domain valida-
ter, and they don’t.”. In both cases, the rate was never lower
tion set are considerably worse than in the general dataset.
than 2.8. The highest rated segments (with the score above
We can account this to the fact that the domain validation
3) included short and simple sentences with everyday vo-
set is truly different from the main training data. As to why
cabulary, such as ”In rekla mi je: Samo dihajte.” or ”Na
the pretrained model in most aspects performs better than
srečo kriminalci podcenjujejo moč prstnih odtisov.”. Based
our fine-tuned model, we assume that our domain data is
on the evaluation results, it appears that our domain model
not specific enough. Therefore, we could not really fine-
would be more valuable in translating general texts with a
tune our model to any specific styles or words, nor were we
neutral style and vocabulary.
able to do that in the validation set. The pretrained model
performs better because it is trained on a larger dataset than
The group members that evaluated these segments had
our domain model is fine-tuned on – the TED corpus is rela-
been participating in this project from the very beginning,
tively small even though we included some additional texts.
so it was crucial to obtain a more objective assessment of
Similarly, the results of the manual evaluation showed
our models. Looking at the results from Table 5, the gath-
that the pretrained model produced the most fluent transla-
ered feedback from the questionnaire revealed that overall,
tions with an average score of 2.99 out of 4. This model
the end-users thought that the texts are relatively compre-
also achieved the highest score in the adequacy criterion. If
hensible, but are not at all seen as being produced by a na-
we take a closer look at the results of the other two mod-
tive speaker of Slovene. For the first two questions, for
els, it can be seen that both models faced similar difficul-
which the answers were chosen on a scale from 1–4 (1=’not
ties in translating phrasal verbs, terminology, word order,
at all’/2=’little’/3=’good’/4=’very much’), only two texts
and other lexical structures. The manual evaluation results
received a score lower than 2 in terms of comprehensibil-
are relatively low: the general and the domain model re-
ity. When grading the texts, the highest average score for
ceived an average of less than 3 points, in both fluency and
a specific text was 2.33, while the lowest is 1. This varia-
adequacy. The following examples show the discrepancies
tion occurs because not all of the chosen texts were equally
between the pretrained model and the other two models on
complex. For the highest graded text, we received simi-
the syntactic, semantic, and morphological levels:
lar responses to the question asking what the main message
of the text was: Opisovanje prstnih odtisov./Puščanje prst-
Original: So then, what is our gut good for?
nih odtisov./Prstni odtisi poleg vizualne sledi pustijo tudi
Pretrained: Torej, za kaj je naš občutek dober?
sled na molekularnem nivoju. There were only two out of
General: Torej, kaj je naš črevo dobro za?
eighteen answers stating that the message was not clear and
Domain: Kaj je torej naš črevesje dobro?
where the end-users could not summarize the main mes-
sage, i.e. in texts 1 and 5. The fact that the end-users were
Original: And I was not only heartbroken, but I was kind of
in almost all cases able to summarize the main message
embarrassed that I couldn’t rebound from what other peo-
in one sentence shows that comprehension of the text was
ple seemed to recover from so regularly.
still possible despite a large number of significant mistakes
Pretrained: Ne samo, da me je zlomilo srce, ampak me je
(wrong syntax, unknown words, lack of context, changing
bilo sram, da se nisem mogel odvrniti od tega, kar so si
genders, etc.).
drugi ljudje zdelo, da si je opomoglo tako redno.
The following examples, segments from text 2, text 3,
General: In nisem bil samo zlom srca, ampak sem bil
and text 6, which have also been scored above average in
neprijetno, da se nisem mogel odvrniti od tega, kar se je
manual evaluation, support this claim:
zdelo, da se drugi ljudje tako redno opomorejo.
Domain: In nisem bil le srčni utrip, ampak sem bil nepri-
Original: And you need something else as well: you have
jetno, da nisem mogel vrniti od tega, kar se je zdelo, da se
to be willing to let go, to accept that it’s over.
drugi ljudje tako redno opomorejo.
Domain: Potrebujete tudi nekaj drugega : biti morate
pripravljeni pustiti, da sprejmete, da je konec.
However, a quick analysis of the evaluation rates
showed that the lowest ratings for the domain model ap-
Original: I’m talking about an entire world of information
peared in segments with specialized vocabulary, for exam-
hiding in a small, often invisible thing.
ple: ”Ampak ko gre za res velike stvari, kot bo naša kari-
Domain: Govorim o celotnem svetu informacij, ki se skri-
era ali kdo se bo poročil, zakaj bi morali domnevati, da so
vajo v majhni, pogosto nevidni stvari.
naše intuicije bolje kalibrirane za te kot počasne, pravilne
analize?” vs the original: ”But when it comes to the re-
Original: Five years ago, I stood on the TED stage, and I
ally big stuff, like what’s our career path going to be or
spoke about my work.
who should we marry, why should we assume that our in-
Domain: Pred petimi leti sem stal na odru TED in govoril
tuitions are better calibrated for these than slow, proper
o svojem delu.
ŠTUDENTSKI PRISPEVKI
283
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Unfortunately, the final version of a machine transla-
Acknowledgments
tor did not meet our expectations regarding the quality of
We would like to thank our mentors, Slavko Žitnik,
the translations. Some of the major flaws that appeared in
Špela Vintar, and Mojca Brglez, for helping us with the
the translations were wrong syntax, untranslated words, in-
project. We would also like to thank the nine evaluators
comprehensible grammatical structures, wrong use of ter-
who provided end-user feedback by filling out our ques-
minology, and wrong translations of polysemes.
While
tionnaire.
we expected the machine translator to be inappropriate for
We would also like to thank SLING for giving us access
translating complex sentences, we were surprised that it did
to powerful graphic cards to successfully finish our train-
not perform well when translating even basic grammatical
ing, as we would still be training our general model with-
structures. Here are two examples:
out them. Special thanks to Barbara Krašovec from Arnes
Original: So then, what is our gut good for?
support who helped us with our numerous problems when
Domain: Kaj je torej naš črevesje dobro?
trying to connect to their cluster.
Original: I later found out that when the gate was opened
on a garden, a wild stag stampeded along the path and ran
8.
References
straight into me.
Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vin-
Domain: Kasneje sem ugotovil, da ko so vrata odprta na
cent J. Della Pietra, Fredrick Jelinek, John D. Lafferty,
vrtu, je divji stag žigosanih po poti in tekel naravnost v
Robert L. Mercer, and Paul S. Roossin. 1990. A statisti-
mene.
cal approach to machine translation. Comput. Linguist.,
Original: And for two years, we tried to sort ourselves out,
16(2):79–85.
and then for five and on and off for 10.
Kenneth Church. 1993. Good applications for crummy
Domain: Dve leti smo se poskušali razvrstiti, nato pa pet
machine translation. Machine Translation, 8:239–258,
let in več.
12.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
The reasons for the poor functioning of the machine
Toutanova. 2018. BERT: pre-training of deep bidirec-
translations could be numerous. It is possible that we have
tional transformers for language understanding. CoRR,
not collected enough data or that the chosen data might
abs/1810.04805.
not have been the most suitable for this project. We esti-
Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz
mate that the main factor that impacted the final results the
Dwojak, Hieu Hoang, Kenneth Heafield, Tom Necker-
most is the wide range of different topics covered in TED
mann, Frank Seide, Ulrich Germann, Alham Fikri Aji,
Talks. This means that our domain translation model did
Nikolay Bogoychev, André F. T. Martins, and Alexandra
not focus on just one domain and, essentially, there was not
Birch. 2018. Marian: Fast neural machine translation in
enough specific data from which it could train. What is
C++. In
more, the initial data consisted of transcriptions of English
: Proceedings of ACL 2018, System Demonstra-
tions, Melbourne, Australia.
spoken discourse and their Slovene translations in the form
of subtitles. It is important to keep in mind that neither spo-
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
ken discourse nor subtitles have characteristics typical for
Callison-Burch, Marcello Federico, Nicola Bertoldi,
standard text types. Finally, not all of the chosen texts were
Brooke Cowan, Wade Shen, Christine Moran, Richard
equally complex and they had different syntactic, morpho-
Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin,
logical, and lexical features. Therefore, some of the texts in
and Evan Herbst. 2007. Moses: Open source toolkit for
the data were essentially too difficult to translate.
statistical machine translation. In: Proceedings of the
45th Annual Meeting of the Association for Compu-
7.
Conclusion
tational Linguistics Companion Volume Proceedings of
The main purpose of this project was to develop a tool
the Demo and Poster Sessions, pages 177–180, Prague,
that would automatically provide Slovene transcriptions or
Czech Republic, June. Association for Computational
subtitles for English TED Talks. Our domain translation
Linguistics.
model provides translations that convey the main message
Taku Kudo and John Richardson. 2018. Sentencepiece:
of the texts, is based on the appropriate methodology, and
A simple and language independent subword tokenizer
built with all the necessary tools. Even more, the results
and detokenizer for neural text processing.
CoRR,
of automatic metrics showed that it is comparable to other
abs/1808.06226.
neural machine translation models. On the other hand, the
Zehui Lin, Xiao Pan, Mingxuan Wang, Xipeng Qiu, Jiang-
lack of a uniform training dataset resulted in poor and in-
tao Feng, Hao Zhou, and Lei Li. 2021. Pre-training mul-
comprehensible translations. However, we believe that ac-
tilingual neural machine translation by leveraging align-
knowledging all of the discussed shortcomings in future
ment information.
research could significantly improve the development of
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,
speech-to-text and translation technologies for Slovene lan-
Sam Gross, Nathan Ng, David Grangier, and Michael
guage users. Neural machine translation is still relatively
Auli. 2019. fairseq: A fast, extensible toolkit for se-
new and will develop in the following years because it is
quence modeling. In: Proceedings of NAACL-HLT 2019:
useful for translators and the general public. Our project
Demonstrations.
contributed to the advancement of the field and could pro-
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016.
vide valuable information for similar work in the future.
Neural machine translation of rare words with subword
ŠTUDENTSKI PRISPEVKI
284
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
units. In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 1715–1725, Berlin, Germany, Au-
gust. Association for Computational Linguistics.
stevezheng23. 2020. fairseq extension.
Jörg Tiedemann. 2012. Parallel data, tools and inter- faces
in opus. In: Proceedings of the Eight Interna- tional
Conference on Language Resources and Evalua- tion
(LREC’12), Istanbul, Turkey, may. European Lan- guage
Resources Association (ELRA).
Jörg Tiedemann. 2020. The Tatoeba Translation Challenge
– Realistic data sets for low resource and multilingual
MT. In: Proceedings of the Fifth Conference on Machine
Translation, pages 1174–1182, Online, November. As-
sociation for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
and Illia Polosukhin. 2017. Attention is all you need.
ŠTUDENTSKI PRISPEVKI
285
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Govoriš nevronsko?
Kako ljudje razumemo jezik sodobnih strojnih prevajalnikov
David Bordon
Oddelek za prevajalstvo, Filozofska fakulteta, Univerza v Ljubljani
Aškerčeva 2, 1000 Ljubljana
david.bordon@ff.uni-lj.si
Povzetek
Namen prispevka je predstaviti raziskavo preverjanja razumljivosti nerevidiranih strojno prevedenih spletnih besedil. Primarni udeleženci v raziskavi so bili splošni bralci in ne izurjeni prevajalci ali popravljalci strojnih prevodov. Gre za prvo tovrstno raziskavo, ki je bila izvedena za slovenski jezik. Cilj raziskave je bil preveriti, v kolikšni meri so nerevidirani strojni prevodi razumljivi splošnemu bralstvu, pri čemer sem se posvetil tudi vplivu besedilnega in slikovnega konteksta. Preverjal sem prevode prevajalnikov Google Translate in eTranslation. Raziskava je bila izvedena z anketo, v kateri so udeleženci odgovarjali na vprašanja, ki so preverjala razumevanje spremljajočega besedilnega segmenta, v katerem je bila napaka. Rezultati nudijo vpogled v trenutno stopnjo razvoja strojnih prevajalnikov, ne z vidika storilnosti pri njihovem popravljanju, ampak z vidika, koliko jih razume ciljno bralstvo.
Do you Speak Neuralese?
The aim of this paper is to present a study on the comprehensibility of unedited machine-translated web texts. The primary participants in the study were general readers, not trained translators or post-editors, and it is the first study of its kind to be conducted for the Slovene language. The aim of the study was to examine the extent to which unedited machine translations are comprehensible to general readers, while giving focus to the influence of textual and pictorial context. The translations were obtained from Google Translate and eTranslation. The survey was conducted by means of a questionnaire, in which participants answered questions that tested their understanding of a text segment that included an error. The results provide an insight into the current state of development of machine translation engines, not from the point of view of PEMT, but from the point of view of how well machine translations are understood by the target readership.
1. Uvod
2. Namen članka
Članek obravnava raziskavo razumljivosti strojno
Namen članka je predstaviti grobo oceno razumljivosti
prevedenih spletnih besedil pri bralcih, ki ne vedo, da
prevodov
NMT-sistemov
(ang.
Neural
machine
prebirajo strojne prevode. Uporabil sem naključno izbrana
translation) v času, ko so taka besedila na spletu vedno bolj
angleška spletna besedila, slovenske prevode pa sem
pogosta, pri čemer me zanima predvsem, kako slikovno
pridobil z nevronskima strojnima prevajalnikoma Google
gradivo v besedilnem kontekstu vpliva na rezultate.
Translate in eTranslation. Prevodi niso bili revidirani, saj
Tovrstna raziskava za slovenščino še ni bila izvedena.
sem želel replicirati okoliščine, v katerih bi jih dejansko
lahko našli – na spletu, kjer so zaradi (za nekatere) dovolj
2.1. Sorodne raziskave
visoke kakovosti in cenovne nepremagljivosti (namreč so
Raziskav na področju razumevanja nerevidiranih
brezplačni) vedno bolj pogosta, kar velja tudi za
strojnih prevodov pri naključnem splošnem bralstvu je
prevajalske vtičnike, ki so vgrajeni v sodobne brskalnike in
razmeroma malo, saj je z vidika omejenosti na stroko in
aplikacije.
gospodarstvu bolj zanimive analize storilnosti pri
Vprašanje razumljivosti v taki obliki je postalo aktualno
popravljanju prevodov veliko več raziskav osredotočenih
samo v zadnjem času, saj so starejši, statistični modeli
zgolj na prevodno prebivalstvo.
prevajalnikov slovnično nekonsistentni in jezikovno
Razširjenost prakse popravljanja strojnih prevodov
okorni, sodobni nevronski prevajalniki pa proizvajajo
lahko opazimo že v zapisih o najboljših praksah pri
tekoča besedila, ki so težje ločljiva od človeških, hkrati pa
popravljanju prevodov, ki so zapisani v blogih večjih
je že profesionalnim pregledovalcem prevodov težje
ponudnikov jezikovnih rešitev, kot so denimo MemoQ
ugotoviti, kje so storili napako (Donaj in Sepesy Maučec,
(Lelner, 2022), Crowdin (Voroniak, 2022) in Memsource
2018).
(Zdarek, 2020).
Te napake nastanejo predvsem zaradi težav pri
Na Univerzi v Gentu je bila v sklopu projekta
razdvoumljanju večpomenskih besed in prevajanju besed,
ArisToCAT izvedena raziskava o razumevanju izmišljenih
ki jih ni v podatkovni zbirki, s katero smo prevajalnik urili
besed in samostalniških besednih zvez (Macken et al.
(Thi-Vinh et al. 2019, 207; Koehn in Knowles 2017, 28,
2019). Primeri, ki so bili iz angleščine v nizozemščino
31–33; Sennrich et al. 2016, 3). Kljub morebitnim
prevedeni s strojnima prevajalnikoma Google Translate in
posamičnim napačno prevedenim besedam pa lahko ljudje
DeepL, so bili predstavljeni samostojno ali v kontekstu
pomen razberemo
iz sobesedila. Pri preverjanju
povedi, pri tem pa udeleženci niso imeli dostopa do
razumljivosti sem v vseh primerih vključil še kontekst, saj
izvirnega besedila. V povprečju je bilo 60 % odgovorov
se v stvarnosti bralci nikoli ne srečujejo z izoliranimi
napačnih; rezultati so bili boljši, če je bil primer
besedami, ampak z zaključenimi besedili, ker pa se
predstavljen v kontekstu povedi.
osredotočam na spletno okolje, sem besedilnemu kontekstu
V sklopu istega projekta je bila izvedena še analiza
dodal še slikovnega, ki je inherentna lastnost sodobnega
bralnega razumevanja človeškega prevoda na eni in
spleta.
nepopravljenega strojnega prevoda na drugi strani.
ŠTUDENTSKI PRISPEVKI
286
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Človeški prevodi so bili ocenjeni bolje z vidika jasnosti
najbolj iskanih pojmov v brskalniku2 sem izločil spletišča,
podajanja informacij, z vidika končnega razumevanja pa je
ki nimajo prevodnega potenciala (družbena omrežja,
bila razlika manjša (Macken in Ghyselen, 2018).
spletni portali v slovenščini, slovenski mediji). S tem sem
Castilho in Guerberof Arenas (2018) sta izvedli
prišel do končnega izbora besedilnih področij: spletno
primerjalno analizo bralnega razumevanja za statistični in
nakupovanje, turizem, elektronika, multimedija in
nevronski model strojnega prevajalnika v primerjavi s
videoigre, luksuzne storitve, moda, osebno zdravje (telesna
človeškim izvirnikom. Glede na omejen vzorec (6
vadba in prehrana).
udeležencev) in nedoslednost rezultatov je končna
ugotovitev, da sistemi-NMT izkazujejo najboljše rezultate,
3.2. Prevodi besedil
občasno še boljše kot angleški izvirnik, nedokončna.
Pri preizkušanju strojnih prevajalnikov se je izkazalo,
Martindale in Carpuat (2018) sta v raziskavi
da Googlov prevajalnik nudi drugačne prevodne rešitve
obravnavali odziv bralcev na tekočnost in natančnost
glede na to, kako besedilo naložimo v obdelavo. Če
nevronskih strojnih prevodov, ob tem pa sta preverjali
besedilo prevajamo v pogovornem oknu vmesnika ali v
stopnjo zaupanja informacijam v besedilu. Ugotovili sta, da
brskalniku prevedemo spletno stran kot celoto, so rezultati
bralce zelo zmotijo prevodi, ki niso tekoči, medtem ko se
boljši kot tisti, ki jih dobimo s funkcijo prevajanja
ob samo natančnost informacij obregne veliko manjši delež
dokumenta. Od štirih različnih specializiranih domen, ki jih
bralstva.
nudi eTranslation, je najboljše rezultate nudil prevajalnik
Izsledke potrjuje tudi Popović (2020). V njenem
za splošna besedila (General Text). Uporabil sem najboljše
eksperimentu so bralci v 30 % primerov zaradi zavajajoče
možne prevode – omenjeno domeno v eTranslation, v
tekočnosti sprejeli popolnoma napačno informacijo, še
Googlu pa sem prevajal v pogovornem oknu.
25 % dodatnih primerov pa je bilo skoraj popolnoma
(narobe) razumljivih.
Na tem mestu velja omeniti, da so se nedavno začele
Prevod iz
Prevod, pridobljen
vnosnega polja oz.
pojavljati bolj eksperimentalne metode prevajanja, katerih
s funkcijo »prevedi
Izvirnik
značilnost je upoštevanje multimedijskega konteksta,
samodejni prevod dokument«
denimo zvočnega ali slikovnega. Lala in Specia (2018) sta
strani
razvila model multimedijskega leksikalnega prevajanja,
katerega namen je prevajanje dvoumnih večpomenskih
Naj bo toplo
besed s pomočjo slikovnega konteksta. Sulubacak et al.
funkcijo -
Naj bo topla -
Keep Warm
(2020) so predstavili sorodne raziskave, uporabne
Mikrovalovna
mikrovalovna
Feature Maintains
podatkovne zbirke in metode raziskovanja na področju
ohranja živila, kot
pečica ohranja
Food Temperature
multimedijskega strojnega prevajanja, ki so vezane na
so zelenjava ,
prevajanje z zvokom, sliko in videom. Med novejšimi
hrano, kot so
Keeps foods like
juhe ,
raziskavami Liu (2021) ponuja nevronski model vizualno-
zelenjava, juhe,
vegetables, soups,
nerazporejenega
tekstovnega enkodiranja in dekodiranja.
jedi, graviža,
hors d'oeuvres,
d'oeuvres ,
Pričakujemo lahko, da se bo to področje v bodoče še
omake in sladice,
gravies, sauces and
gravies , omake in
hitreje razvijalo, predvsem zaradi tehnološkega napredka v
topla in okusna v
desserts warm and
sladice toplo in
drugih panogah (prepoznavanje slik, sinteza govora,
pečici, dokler niso okusno v pečice , delicious in the
avtomatsko podnaslavljanje ipd.).
pripravljene za
oven until they're
postrežbo.
dokler oni
ready to serve.
propravljeni , da
3. Metoda
služijo.
Raziskava je bila zasnovana okrog vprašalnika, ki je
vseboval primere štirih vrst napak v slovenskih strojnih
prevodih splošnih angleških spletnih besedil. Preverjal sem
prevajalnika Google Translate in eTranslation, pri čemer je
Tabela 1: Razlike v prevodih glede na način obdelave;
bil vsak zastopan z 12 vprašanji. Poseben pomen sem
Google Translate.
posvetil slikovnemu gradivu v sobesedilu.
3.1. Izbor besedil
Prevod modela »General Text« prevajalnika eTranslation
Besedila sem zbiral glede na verjetnost, da bi se bralci
z njimi lahko dejansko srečali na spletu. Analiza
Ohraniti toplo funkcijo - Microwave ohranja hrano, kot so
prevajalskega trga je pokazala, da večje prevajalske
zelenjava, juhe, predjed d’oeuvres, omake, omake in
agencije popolnoma obvladujejo sektorje, ki nudijo največ
sladice tople in okusne v pečici, dokler niso pripravljeni
dobička in hkrati zahtevajo človeško revizijo (tehnika,
za postrežbo
zdravstvo, pravo, finance ipd.) (Evropska komisija, 2020).
V manj dobičkonosnih sektorjih, kjer človeška revizija ni
tako bitna, obstaja večja verjetnost objave nerevidiranih
Tabela 2: Prevod enakega segmenta; eTranslation.
strojnih prevodov.
Pregleda tržnega deleža spletnih iskalnikov, ki jih
uporabljamo v Sloveniji je pokazal, da 96 % vseh
uporabnikov spleta uporablja iskalnik Google.1 Na osnovi
1 https://gs.statcounter.com/search-engine-market-
2 https://ahrefs.com/keyword-generator
share/all/slovenia
ŠTUDENTSKI PRISPEVKI
287
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
3.3. Kategorizacija napak
predlaganimi odgovori. S tem bi lahko preveril konsistenco
Prevode sem analiziral in določil štiri kategorije
pravilnosti oz. odstopanja glede na vrsto odgovora.
najpogostejših napak, ki niso vezane na jezikovni sistem
Vprašalnik sem delil na družbenih omrežjih Facebook
oz.predpis.
in Instagram in znance pozval, naj ga posredujejo naprej
▪ Neprevedena beseda; v prevodu se pojavlja
svojcem in svojim znancem, če je le mogoče starejšim.
beseda v enaki obliki kot v izvirniku. Dopustil sem
Demografskih podatkov nisem zbiral, kar je mogoče ena
možnost spremembe začetnih ali končnih
izmed pomanjkljivosti raziskave. Glede na razmeroma
morfemov, če je prevajalnik besedo samo
majhen vzorec sodelujočih in morebiten efekt odmevne
preoblikoval3.
komore bi bilo vsekakor raziskavo potrebno nadgraditi in
▪ Napaka pri razdvoumljanju večpomenske
ponoviti na bolj naključnem in predvsem večjem vzorcu,
besede; denotativni pomen večpomenske besede
toda glede na čas zbiranja odzivov, ki je sovpadal s prvo
ali besedne zveze ne ustreza pomenu v izvirniku.
omejitvijo gibanja vezano na epidemijo Covid-19, nisem
▪ Hujša pomenska napaka; napaka, ki otežuje
imel druge izbire.
razumevanje celotnega besedila.
Na vprašalnik sem prejel 120 odgovorov.
▪ Izmišljena beseda; prevajalnik si izmisli novo
besedo, ki je na prvi pogled videti slovenska, a ne
spada v slovensko besedišče – t. i. »nevronščina«.
3.4. Kontekst
Izbranim besedilom sem glede na inherentne lastnosti
spletne pojavitve dodal kontekst. Kontekst je lahko bil več
vrst:
▪ izključno besedilni,
▪ besedilni in slikovni; slika ne vpliva na
razumevanje,
▪ besedilni in slikovni; slika vpliva na razumevanje,
▪ izbor ene izmed več predlaganih slik glede na to,
kaj piše v besedilu.
Slikovni kontekst sem vključil pri besedilih, ob katerih
so se na spletu pojavljale fotografije, ki so pri nekaterih
primerih bile zgolj vizualni dodatek, pri drugih pa je bilo
pravilno razumevanje besedila vezano na prepoznavanje
pravilnega vizualnega elementa.
V svoji raziskavi besed nisem nikoli predstavil v
izolaciji, kot so to denimo storili v raziskavi Macken in
drugi (2019), saj to niso realne okoliščine – napake v
objavljenih strojnih prevodih bodo vedno del nekega
besedila. Besedil nisem popravljal, anketirancem so bila
predstavljena vključujoč vse slovnične in pomenske
Slika 1: Primer vprašanja. Izbor z razlago.
napake, take, kot bi jih našli v divjini.
4. Rezultati
3.5. Oblikovanje vprašalnika, format odgovorov
Rezultate predstavljam po naslednjih parametrih:
na vprašanja in udeleženci
▪ splošno razumevanje,
Anketo sem ustvaril na platformi Google Forms, ki nudi
▪ razumevanje glede na prevajalnik,
podporo za prikaz slik in dober vmesnik za pregled in izvoz
▪ razumevanje glede na tip napake,
rezultatov. Pomembno je poudariti, da anketirancem nisem
▪ razumevanje glede na tip konteksta,
razkril, da bodo brali strojno prevedena besedila. Omenil
▪ razumevanje glede na tip odgovora.
sem, da bodo »prebrali več kratkih besedil, ki so napisana
v nekoliko okorni slovenščini«.
4.1. Splošno razumevanje
Vrste odgovorov so bile omejene s funkcionalnostjo
platforme Google Forms in niso sledile nobeni logični
Vprašalnik je obsegal 24 vprašanj, s 120 odzivi je bilo
vseh možnih odgovorov 2880. Vseh pravilnih odgovorov je
metodi; določil sem jih subjektivno glede na vsebino
bilo 1697 oz. 58,96 %. Daljša razčlemba je na voljo v
primera in vrsto napake. Gre za najbolj nezanesljivo
spremenljivko v metodi, saj bi s formulacijo vprašanja
celotni raziskavi (Bordon, 2021).
lahko sugeriral pravilen odgovor, zanimalo pa me je
predvsem to, če prihaja do večjega odstopanja glede na tip
4.2. Razumevanje glede na prevajalnik
odgovora, denimo, če bi bili odgovori odprtega tipa, kjer
Odgovori na vprašanja, vezana na prevajalnik Google
anketiranci vnesejo svoj odgovor v prazno vnosno polje,
Translate so bili pravilni v 51,3 % primerov oz. 739 od
bistveno slabši kot tisti, kjer izbirajo med štirimi
1440 odgovorov. Prevajalnik eTranslation je pokazal boljše
rezultate, delež pravilnih odgovorov je znašal 66,6 %.
3 Denimo, prevod za rob zaslona (ang. bezel, je prevajalnik
prevedel kot »bezela«).
ŠTUDENTSKI PRISPEVKI
288
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4.3. Razumevanje glede na tip napake
4.5. Razumevanje glede na tip odgovora
V vprašalniku so bili vključeni štirje tipi različnih
V tem segmentu predstavljam rezultate glede na način
napak. V alinejah nizam tip napake in odstotek pravilnih
izbora odgovora. Primarna funkcija te analize je preveriti
odgovorov:
konsistenco oz. morebitna odstopanja npr.; če so odgovori
▪ izmišljena beseda: 48,5 %,
odprtega tipa, kjer anketiranci v prazno vnosno polje
▪ neprevedena beseda: 64,8 %,
vnesejo poljuben odgovor, bistveno slabši kot tisti, kjer
▪ napačno razdvoumljene večpomenske besede:
imajo na voljo denimo štiri predlagane odgovore, izberejo
65,9 %,
pa enega.
▪ hujša pomenska napaka: 56,3 %.
▪ Odgovor odprtega tipa (vnosno polje): 36,3 %,
▪ odgovor zaprtega tipa (A, B, C ali D): 60,8 %,
Neprevedena
▪ izbor z razlago (A ali B, zakaj?): 68,3 %.
Izmišljena
Slabši rezultat pri odgovorih zaprtega tipa je potrebno
beseda
beseda
jemati z rezervo, saj so bili primeri s tako vrsto odgovora
zgolj štirje. Samo določanje pravilnosti odgovora je pri
Pravilni odgovori
Pravilni odgovori
takih primerih težje, osebno pa sem bil strog ocenjevalec,
saj sem vse odgovore, ki niso bili popolnoma pravilni,
Napačni odgovori
Napačni odgovori
označil za napačne.
4.6. Skupina prevajalcev
Edini demografski podatek, ki sem ga zbiral, je, ali se
oseba, ki odgovarja na vprašalnik, ukvarja s prevajanjem.
Pritrdilno je odgovorilo 24 udeležencev od 120. Pri teh
48,5
64,8
osebah sem analiziral odgovore glede na vrsto napake in jih
primerjal z neprevajalci. Nasploh so bili njihovi rezultati za
6 % boljši (63,7 %), po kategorijah pa:
▪ izmišljena beseda 53,5 % (+ 6,3 %),
▪ neprevedena beseda 65,6 % (+ 1 %),
▪ razdvoumljanje večpomenske besede 70,8 %
(+ 6,7 %),
▪ pomenska napaka 63,9 % (+ 9,6 %).
Razdvoum.
Pomenska
Ostalih demografskih podatkov nisem zbiral, kar je ena od
slabosti raziskave. V primeru da bi podatki sovpadali z
večpomen.
napaka
mojo predpostavko, da niso relevantni, jih ne bi vključil,
sedaj pa preprosto nimam podatkov, na katerih bi lahko
Pravilni odgovori
Pravilni odgovori
utemeljil svojo odločitev.
Napačni odgovori
Napačni odgovori
Neprevajalci
Prevajalci
izmišljena beseda
65,9
56,3
neprevedena beseda
razdvoumljanje
večpomenske besede
Slika 2: Diagrami 1–4. Rezultati glede na tip napake v %.
pomenska napaka
4.4. Razumevanje glede na kontekst
0
20
40
60
80
V naslednjem segmentu predstavljam delež pravilnih
odgovorov vezanih na kontekst.
▪ Izključno besedilni: 60,4 %,
Graf 1: Rezultati skupine prevajalcev proti ostalim.
▪ besedilni in slikovni; slika ne vpliva na
razumevanje: 44 %,
▪ besedilni in slikovni; slika vpliva na
razumevanje: 69,8 %,
▪ izbor ene izmed več predlaganih slik glede na
to, kaj piše v besedilu: 64,2 %.
ŠTUDENTSKI PRISPEVKI
289
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
5. Razprava
pri strojnih prevodih v realnih okoliščinah, torej na spletu,
Pri pregledu rezultatov sem ugotovil, da povprečna
z vsem pomožnim gradivom, igra pomembno vlogo.
stopnja razumevanja znaša 59 %. Od vseh 2880 odgovorov
Udeleženci, ki se sicer ukvarjajo s prevajanjem, so na
je bilo 1697 pravilnih.
splošno odgovarjali boljše od povprečja. Njihov delež
Na tej točki je potrebno izpostaviti primer št. 6, ki je bil
uspešnosti je bil največji v kategoriji hujša pomenska
nasploh najslabše razumljen in je znižal povprečje
napaka (+ 9,6 %), kar kaže na to, da zaradi »poklicne
rezultatov v vseh kategorijah, v katerih se je nahajal. Daljša
deformacije« bolj učinkovito razumejo kontekst.
razlaga z razčlembo je na voljo v celotni raziskavi (Bordon,
2021).
6. Zaključek
V članku sem predstavil raziskavo o razumljivost
nerevidiranih strojno prevedenih spletnih besedil pri
Izvirnik
Prevod
končnih uporabnikih, ki niso bili posebej obveščeni, da
One winner will receive
prebirajo strojne prevode. Razumevanje besedilnih
segmentov, ki so vključevali štiri različne tipe napak, ki
En zmagovalec bo prejel
the GeForce RTX 2080
grafično kartico GeForce
nastanejo pri strojnem prevajanju NMT-sistemov, sem
Ti Cyberpunk 2077
preverjal z anketo. Ta je vsebovala strojne prevode splošnih
RTX 2080 Ti Cyberpunk
Edition graphics card.
besedil, ki sem jih prevedel s prevajalnikoma Google
2077 Edition.
Translate in eTranslation. Besedila so bila nerevidirana,
Entering the giveaway is
vsebovala so napake, ki so bile predstavljene v več
Vstop v predavanje je
easy:
različnih kontekstih, bodisi s slikovnim gradivom bodisi
enostaven:
Sign in to the forums or
brez.
1. Prijavite se na forume ali create a forum account.
Rezultati so pokazali, da je splošna stopnja
ustvarite forumski račun .
Comment on this thread
razumevanja 59 %, pri čemer se je izkazalo, da so prevodi
2. Komentirajte to temo
(WITHOUT QUOTING
eTranslationa nasploh razumljivejši od prevodov
(BREZ CITIRANJA TE
THIS POST) and tell us
Googlovega prevajalnika. Število pravilnih odgovorov je
POSTAJE) in nam povejte,
what you want to do most
bilo najvišje v kategoriji razdvoumljanja večpomenskih
kaj želite narediti najbolj v
in Cyberpunk 2077.
besed, kar nakazuje na to, da ljudje lažje razumemo pomen
Cyberpunku 2077.
Sign your username in
strojnih prevodov, če nam je dan kontekst. Pri tem je bilo
3. Za potrditev vpisa vpišite our giveaway widget to
najbolj učinkovito slikovno gradivo, s katerim so si lahko
svoje uporabniško ime v naš confirm your entry.
udeleženci v raziskavi pomagali razjasniti pomen
pripomoček za oddajo.
določenega besedilnega segmenta. Druga najuspešnejša
HOW TO ENTER: To
kategorija je bila razumevanje neprevedenih besed, kar
KAKO VSTOPITI: Če želite enter, submit your entry
pomeni, da je bilo znanje angleškega jezika med udeleženci
vstopiti, vnesite mednopni
during the Sweepstakes
na visokem nivoju.
vložek in sledite navodilom Period and follow the
Po analizi se je izkazalo, da je bil nekoliko
za vstop v nagradne igrače.
problematičen način izbire
directions to enter the
odgovorov,
saj
sem
anketirancem naključno vnaprej določil
Sweepstakes.
, na kakšen način
bodo odgovarjali. Odgovori odprtega tipa so kazali slabše
rezultate kot izbirni odgovori in odgovori zaprtega tipa,
Tabela 3: Primer št. 6; »Mednopni vložek.«
toda zaradi majhnega števila vprašanj je težko izpeljati
kakšen razumen zaključek. Podobno velja za samo metodo
eTranslation je bil v povprečju za 15 % boljši od
odgovarjanja na anketo, ki je bila pogojena pandemičnemu
prevajalnika Google Translate, v katerem je bil omenjen
času. Za bolj relevantne rezultate bi bilo potrebno izvajati
primer. Nasploh pa je eTranslation kazal boljše rezultate.
test razumljivosti v živo, na razpravljalen način. Enako
Najboljši rezultati glede na tip napake so bili vezani na
velja tudi za vzorec sodelujočih – večji in bolj raznolik
razdvoumljanje besednega pomena (65,9 %), kar kaže, da
vzorec bi dal jasnejše rezultate.
znamo ljudje nasploh dobro razbrati pomen iz sobesedila,
V bodoče bi bilo zanimivo raziskati, če se razumevanje
na drugem mestu pa so bile neprevedene besede (64,8 %),
nerevidiranih strojno prevedenih besedil izboljšuje skupaj
kar lahko pripišemo dobremu znanju angleščine med
z nadgradnjami strojnih prevajalnikov, hkrati pa bi se lahko
udeleženci v anketi.
osredotočil še na avtomatsko generirana besedila in jezik
Rezultati so bili slabši, ko je prevajalnik napravil hujšo
spletnih robotov.
pomensko napako, ki je oteževala razumevanje celotnega
Menim, da bo v prihodnje nekoliko manj raziskav
segmenta (56,3 %), daleč najslabše rezultate pa je bilo moč
storilnosti pri popravljanju strojnih prevodov in veliko več
opaziti v kategoriji izmišljena beseda (48, 5 %), v kateri je
raziskav, ki bodo vezane na razumljivost strojno
sicer bil prej omenjeni primer št. 6.
prevedenih ali avtomatsko generiranih besedil v praktičnih
Glede na tip konteksta so bili najboljši rezultati pri
situacijah. Končni bralec se vedno bolj pogosto srečuje s
primerih, kjer je slika vplivala na razumevanje (69,8 %) in
takimi besedili, lahko pa pričakujemo, da bo zaradi še
kjer so udeleženci morali izbrati sliko, na katero se je
dodatnih izboljšav strojnih prevajalnikov, novih metod in
nanašalo besedilo (64,2 %). Rezultati so bili nekoliko slabši
razširjenosti prakse tovrstnih potencialnih stikov med stroji
v izključno tekstovnem kontekstu (60,4 %), najslabši
in bralci brez vmesnega posega človeškega popravljalca
rezultati pa so bili v kategoriji, kjer je bila besedilu
vedno več.
priložena slika, ki ne vpliva na razumevanje oz. potencialno
zmede udeleženca (44 %) – v tej kategoriji je bil tudi primer
št. 6. Izkazalo se je, da slikovni kontekst, ki lahko
potencialno vpliva na razumevanje besedilnega segmenta,
ŠTUDENTSKI PRISPEVKI
290
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
7. Literatura
2022.
David Bordon. 2021. Razumevati nevronščino: Kako si
https://www.clinjournal.org/clinj/article/view/93.
ljudje razlagamo jezik strojnih prevajalnikov. Magistrsko
Marianna J. Martindale in Marine Carpuat. 2018. Fluency
delo. Univerza v Ljubljani. Dostop 30. 5. 2022.
Over Adequacy: A Pilot Study in Measuring User Trust
https://repozitorij.uni- lj.si/IzpisGradiva.php?id=125328.
in Imperfect MT. Dostop 30. 5. 2022.
Sheila Castilho in Ana Guerberof Arenas. 2018. Reading
https://arxiv.org/abs/1802.06041.
Comprehension of Machine Translation Output: What
Maja
Popović.
2020.
Relations
between
Makes for a Better Read?. V: Juan Antonio Perez-Ortiz,
comprehensibility and adequacy errors in machine
Felipe Sanchez-Martinez, Miquel Espla-Gomis, Maja
translation output. V: Raquel Fernández in Tal
Popovič, Celia Rico, Andre Martins, Joachim Van den
Linzen. ur., Proceedings of the 24th Conference
Bogaert in Mikel L. Forcada, ur., Proceedings of the 21st
on Computational Natural Language Learning
Annual Conference of the European Association for
(CoNLL 2020), str. 256–264. Dostop 30. 5.
Machine Translation, str. 79–88, Alacant, Španija.
2022. https://aclanthology.org/2020.conll-1.19.pdf.
Dostop 30. 5. 2022. http://doras.dcu.ie/23071/.
Rico Sennrich, Barry Haddow in Alexandra Birch. 2016.
Gregor Donaj in Mirjam Sepesy Maučec. 2018. Prehod iz Neural Machine Translation of Rare Words with
statističnega strojnega prevajanja na prevajanje z
Subword Units. Dostop 30. 5. 2022.
nevronskimi omrežji za jezikovni par slovenščina-
https://arxiv.org/abs/1508.07909.
angleščina. V: Zbornik konference Jezikovne tehnologije
Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos,
in digitalna humanistika 2018, str. 62–68, Ljubljana.
Aku Rouhe, Desmond Elliott, Lucia Specia in
Dostop
30.
5.
2022.
http://www.sdjt.si/wp/wp-
Jörg Tiedemann. 2020. Multimodal machine
content/uploads/2018/09/JTDH-2018_Donaj-et-
translation through visuals and speech. Dostop 30. 5.
al_Prehod-iz-statisticnega-strojnega-prevajanja-na-
2022. https://arxiv.org/abs/1911.12798.
prevajanje-z-nevronskimi-omrezji-za-jezikovni-par-
Ngo Thi-Vinh, Thanh-Le Ha, Phuong-Thai Nguyen in Le-
slovenscina-anglescina.pdf.
Minh Nguyen. 2019. Overcoming the Rare Word
Evropska komisija, 2020 European Language Industry
Problem for Low-Resource Language Pairs in Neural
Survey 2020 Before & After Covid-19. Dostop 30. 5.
Machine Translation. V: Proceedings of the 6th
2022.
Workshop on Asian Translation, str. 207–214.
https://ec.europa.eu/info/sites/default/files/2019_langua
Association for Computational Linguistics. Hong Kong,
ge_industry_survey_report.pdf.
Kitajska. Dostop 30. 5. 2022.
Philipp Koehn in Rebecca Knowles. 2017. Six challenges
https://arxiv.org/abs/1910.03467.
for neural machine translation. V: Proceedings of the Diana Voroniak. Post-Editing of Machine Translation:
First Workshop on Neural Machine Translation, str. 28–
Best Practices. Dostop 30. 5. 2022.
39.
Association
for
Computational
Linguistics,
https://blog.crowdin.com/2022/03/30/mt-post-editing/.
Vancouver,
Canada.
.
Dostop
30.
5.
2022.
Dan Zdarek. Machine Translation Post-editing Best
https://arxiv.org/pdf/1706.03872.pdf.
Practices. Dostop 30. 5. 2022.
Chiraag Lala in Lucia Specia. 2018. Multimodal Lexical
https://www.memsource.com/blog/post-editing-
Translation. V: Proceedings of the 11th international
machine-translation-best-practices/.
conference on language resources and evaluation
(LREC), str. 3810–3817. Miyazaki, Japonska: European
Language Resources Association (ELRA). Dostop 30. 5.
2022. https://www.aclweb.org/anthology/L18-1602/.
Zsófia Lelner. 2022. »Machine Translation vs. Machine
Translation Post-editing: Which One to Use and When?«. Dostop 30. 5. 2022.
https://blog.memoq.com/machine-translation-vs.-
machine-translation-post-editing-which-one-to-use-and-
when.
Jiatong Liu. Multimodal Machine Translation. Dostop 30.
5.
2022.
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnum
ber=9547270.
Lieve Macken in Iris Ghysele. 2018. Measuring
Comprehension and User Perception of Neural Machine Translated Texts: A Pilot Study. V: Translating and the
Computer 40 (TC40): Proceedings, str. 120–126.
Geneva: Editions Tradulex. Dostop 30. 5. 2022.
https://biblio.ugent.be/publication/8580951.
Lieve Macken, Laura Van Brussel in Joke Daems. 2019.
NMT’s wonderland where people turn into rabbits. A study on the comprehensibility of newly invented words
in NMT output. V: Computational Linguistics in the
Netherlands Journal 9 (2019), str. 67–80. Dostop 30. 5.
ŠTUDENTSKI PRISPEVKI
291
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Data Collection and Definition Annotation for Semantic Relation Extraction
Jasna Cindrič, Lara Kuhelj, Sara Sever, Živa Simonišek, Miha Šemen
Department of Translation, Faculty of Arts, University of Ljubljana
Aškerčeva cesta 2, SI-1000 Ljubljana
jasna.cindric@gmail.com
larakuhelj@gmail.com
seversara@gmail.com
ziva.sim@gmail.com
miha.semen@gmail.com
Abstract
This paper presents the process of data collection, definition extraction and annotation for the purpose of semantic relation extraction based on English and Slovene texts related to geology, glaciology, and geomorphology. Automatic semantic relation extraction is an important task in NLP; its potential applications include information retrieval, information extraction, text summarization, machine translation, and question answering. This approach was based on the TermFrame project. The texts for the corpora were collected manually, while definitions were identified through targeted queries in SketchEngine and then semantically annotated using the WebAnno tool. Our research showed some significant differences between languages resulting in some difficulties during the annotation process.
1. Introduction
creation of multimodal specialised knowledge bases, where
“frames” are used as a “representation that integrates
This paper describes the process of definition
various ways of combining semantic generalisations about
extraction, annotation and curation based on corpora
one category or a group of categories” (Faber, 2015).
created for a research project carried out by Master’s
Additionally, “templates” are used as a representation of
students as part of the module Corpora and Localisation at
parts of one category, and “templates” cover the cultural
the Department of Translation Studies, Faculty of Arts
component (Faber, 2015).
(University of Ljubljana). Translation students collaborated
Following the process of the TermFrame project, the
with their peers from the Faculty of Computer and
Information Science (University of Ljubljana) on a project
team began with compiling an English and a Slovene
focusing on the automatic extraction of semantic relations,
domain-specific corpus, then extracting definitions and
which required the creation of an English and a Slovene
annotating them using the WebAnno tool (Castilho et al.,
corpus and the provision of an additional data set annotated
2016). This paper describes these steps in detail, followed
for semantic relations. We describe the process of corpus
by an analysis of the annotated definitions. It also
building, the identification and extraction of definitions,
highlights the obstacles the team faced during the
followed by the annotation and curation using the
conversion of texts and the annotation process.
WebAnno annotation tool. Finally, the paper illustrates the
The main goal of the project was to create an English
results and obstacles as well as discusses possible further
and a Slovene corpus covering the fields of
work and research.
Corpus-based automatic semantic relation extraction
geomorphology, glaciology and geology, which would
has become one of the main topics in corpus linguistics.
serve as a basis for definition extraction, annotation and
Domain-specific annotated corpora are the basis for the
curation.
design of many NLP systems for relation extraction
(Thanopoulos et al., 2000) and are considered knowledge
2. Building the corpora
sources on natural language use. It is imperative to obtain
corpora large enough to provide a sufficient number of
instances of relation pairs for extraction (Huang et al.,
2.1. Text collection
2015). This is especially true for Slovene, a language with
For the purposes of our research, the linguist team
complex morphology and free word order, which currently
compiled two corpora, one Slovene and one English. The
lacks readily available large domain-specific corpora
entire project lasted for approximately one month.
(Pollak et al., 2012).
The first step was to search for texts in both languages
The layout of the project relied heavily on a similar
covering predefined topics, namely geology, glaciology,
dataset, TermFrame1 – a trilingual knowledge base that
and geomorphology. These areas were chosen because they
contains Karst terminology in English, Slovene and
were semantically related to the domain of karstology, but
Croatian. The knowledge base was developed on the basis
had not yet been used in the TermFrame database. More
of the frame-based approach in terminology (Pollak et al.,
specifically, the texts from neighbouring domains to
2019; Vintar et al., 2021; Vintar and Stepišnik 2020; Vintar
karstology were assumed to contain the same semantic
et al., 2019; Vrtovec et al., 2019), a cognitive approach to
relations, so that our to-be-created data set could be fully
terminology that considers context, language and culture
compatible with the existing ones.
and focuses on specialised texts (Faber and Medina-Rull,
The linguist team was particularly interested in
2017). Frame-based terminology is mainly used for the
collecting scientific texts (scientific papers, articles, books,
1 https://termframe.ff.uni-lj.si/.
ŠTUDENTSKI PRISPEVKI
292
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
doctoral and master’s theses). Many of these texts can be
seemingly undemanding step required additional time and
found through the Digital Library of Slovenia2 or through
attention.
the Co-operative Online Bibliographic System & Services
– COBISS3, and through ResearchGate, a social
3. Definition extraction
networking site for scientists and researchers4. Ultimately,
our team proposed 32 Slovene texts and 26 English texts as
In order to obtain the sentences containing definienda,
candidates. The proposed titles were validated by a domain
definitors and genera, we had one week to extract the
expert and assessed as relevant.
definitions from the corpora using targeted queries in
The next step was to ensure that the texts were in a
SketchEngine. Searching for typical definition-like
format that could be read by Sketch Engine5, which proved
sentences can be done by searching for specific words or
difficult in some cases. Fortunately, most of the texts on
phrases and by CQL queries.
dLib.si are available in TXT and PDF format. As a result,
To some extent, the structure of definitions can be
the team was able to access the texts in the appropriate
predicted. Typical definition structures in Slovene include
format using Notepad. Texts that were suitable to the topic
“X je Y”, “Y imenujemo (tudi) X”. “izraz X pomeni Y”,
but could not be accessed in the correct format were
“izraz X označuje Y”, “med Y štejemo (tudi) X” etc., while
omitted. Document conversion and text cleaning proved
typical definition structures in English include “X means
cumbersome (see Section 2.2). The team had one week to
…”, “X is a Y”, “X is a kind of …”, “The term X is …” or
prepare the texts according to this process.
“X is defined as”. (...) In this context, X is typically a
hyponym and Y is a hypernym. Sketch Engine allows
2.2. Creating the corpora
searching for such definitions in multiple ways. One
After collecting a sufficient amount of documents and
method is to use a simple Sketch Engine query and search
successfully converting them into the appropriate formats,
for words or phrases that are often included in the
the team proceeded to create the corpora. As all team
definitions, such as “imenujemo” or “izraz” in Slovene and
members had full access to Sketch Engine, we decided this
“is a” or “is a term used to describe” in English. We were
would be the most efficient and straightforward tool for
able to identify multiple definitions using this method, for
corpus creation and subsequent querying. Table 1 provides
example “Tip kraškega površja, kjer je prevladujoča oblika
an overview and detailed information about both corpora.
vrtače, imenujemo vrtačasti kras.”
Another method is to use a CQL query in Sketch
English
Slovene
Engine and check for definitions with advanced filtering
Tokens
1,588,085
493,107
commands such as [tag="S.*"][word="je"][tag="S.*"] in Slovene
or
Words
1,284,564
358,731
[tag="NN"][word="is"][word="a"]?[tag="N.*"]
in
Sentences
52,147
18,373
English. This command combines a search for a specific
Documents
26
32
part of speech (S.* – noun) and a specific word (je). An
Table 1: Data on the English and Slovene corpus.
example of a definition identified by using the CQL query
in Slovene is “Uvala je večja kraška globel skledaste oblike
As can be seen from Table 1, the Slovene corpus was
z neravnim dnom in sklenjenim višjim obodom.” Another
significantly smaller. This was due to the fact that longer
example in English is “A coral reef is a ridge or mound
Slovene texts were harder to find, which was to be
built of the skeletal remains of generations of coral
expected, considering there are not as many Slovene
animals, upon which grow living coral polyps.”
sources as there are English ones.
Since not all definitions fit these typical structures, we
As previously mentioned, arguably the most important
used another strategy. We checked the keywords suggested
challenge the team faced occurred after selecting the texts
by Sketch Engine and search for them with a simple query.
for the Slovene corpus. As most of them were in the form
In this way, we were able to identify various definitions
of PDF files, the team had to ensure they were searchable
which could not be found otherwise. An example of such a
before converting them into text (TXT) files. Due to some
definition is Slovenska kraška terminologija navaja, da je
language-specific characters, particularly diacritics, such as
vrtača: depresijska oblika okroglaste oblike, navadno
č, š, and ž, most of the widely available online converters globoka več metrov in je bolj široka kot globoka.
failed to produce satisfactory results.
In addition to these strategies, the English team also
After a few unsuccessful attempts, we managed to
utilised a glossary from the English corpus and extracted
convert them with Notepad++, but we still had to review
some of the definitions from there.
the files and manually correct some errors before adding
By combining all of these strategies, we were able to
the documents to the corpus. Since the final text was
identify definition candidates suitable for annotation. The
corrected manually, man-made errors such as the inclusion
selected definitions were then verified by a terminology
of some elements, like the table of contents, English
specialist. Some of the definitions were judged to be
abstracts and reference lists that were unintentionally added
unsuitable, either due to their wording or for semantic
to the final version of the corpus caused some difficulties
reasons. After discarding the inadequate definitions, we
when searching for potential definitions. Ultimately, it was
retained 100 definitions from the Slovene corpus and 104
impossible to rely entirely on conversion tools – this
2 https://www.dlib.si.
4 https://www.researchgate.net/.
3 https://www.cobiss.si.
5 https://www.sketchengine.eu/.
ŠTUDENTSKI PRISPEVKI
293
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
definitions from the English corpus. All of them were then
2. Definition element: Here, the term in question
uploaded to WebAnno6 to be manually annotated.
was marked as DEFINIENDUM, its hypernym or
superordinate term as GENUS, the defining phrase (the
4. Definition annotation
phrase between the DEFINIENDUM and the GENUS –
e.g. the phrase is a) as DEFINITOR and any of its
The definitions were annotated using WebAnno – a
hyponyms or subordinate terms as SPECIES.
web-based annotation tool, which allowed for a faster
3. Semantic relation: A set of 15 relations was used
collaborative annotation process as well as a comparative
for annotating different features of the defined term:
evaluation of the annotations (Castilho et al., 2016). The
AFFECTS,
HAS_ATTRIBUTE,
HAS_CAUSE,
annotation process took approximately ten days.
CONTAINS,
COMPOSITION_MEDIUM,
Altogether, the team annotated 100 Slovene and 104
DEFINED_AS,
HAS_FORM,
HAS_FUNCTION,
English definitions, whereby four layers of information
HAS_LOCATION,
MEASURES,
HAS_POSITION,
were considered. The layers were introduced to the linguist
HAS_RESULT,
HAS_SIZE,
STUDIES
and
team by the course instructor and were, in term, selected
OCCURS_IN_TIME.
because they had already been used in the TermFrame
4. Relation definitor: This layer is associated with
project (Vintar and Stepišnik, 2020). We believed that
semantic relations and marks words or phrases that precede
relying on the same categories that had already been
particular semantic relations (e.g. in the ocean).
adapted to karstology – a domain closely related to the ones
WebAnno also offers an additional layer for the canonical
chosen for this research – would ensure a straightforward
form, which is used to ensure the full form of a term when
annotation process with little to no ambiguities.
it appears in an elliptic construction. The canonical form
Furthermore, the resulting data set would be fully
layer has been mostly used when annotating definitions in
compatible to the existing one in the TermFrame project.
the Slovene corpus. One of the reasons for this is that
The layers of information include:
ellipses are more common in Slovene. Another reason is
1. Semantic category: This layer covers the main
semantic categories for A. Landform (A.1 Surface
Landform, A.2 Underground Landform, A.3 Hydrological
Landform or A.4 Other), B. Process (B.1 Movement, B.2
Loss, B.3 Addition or B.4 Transformation), C. Geome, D.
Element/Entity/Property (D.1 Abiotic, D.2 Biotic, D.3
Property
and
D.3.1
Geolocation)
and
E.
Instrument/Method (E.1 Instrument or E.2 Method). The
semantic category was defined primarily for the
definiendum and genus. Semantic categories are presented
in Figure 1.
Figure 1: Semantic categories (Vintar and Stepišnik, 2021).
6 https://www.clarin.si/webanno/login.html.
ŠTUDENTSKI PRISPEVKI
294
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Figure 2: Use of the term canonical layer for pairing the words “uporablja” and “se” to show they form a single unit.
that the predicate and the pronoun “se” are often separated
students annotated the phrase “is a term covering” as the
by other words.
definitor and one student annotated only “is a term”. The
As seen from Figure 2, which shows the example of the
word “material” was determined to be a genus by two
use of the term canonical layer in the Slovene corpus, the
students, whereas one student extended the genus and
predicate “se uporablja” consists of two words that act as a
annotated “pyroclastic material” – “pyroclastic” was later
definitor. Hence, the team used the term canonical layer to
defined as COMPOSITION_MEDIUM.
pair the two words together.
For the purpose of this project, three students annotated
5. Analysis
English definitions, while two students annotated the
After annotating all of the extracted definitions, the
Slovene ones. Afterward, in the process of curation, both
linguist team wanted to take a closer look at the results.
teams jointly annotated the definitions with the course
Each English definition had one definiendum, giving a total
instructor’s assistance. We observed that the annotation of
of 104 definienda, while the Slovene definitions had one or
definition elements (definiendum, genus and definitor) was
more definienda, 113 in total.
the most straightforward, although the annotators’
The most common definitor in English was “is a”,
solutions still varied in some cases (See Figure 3). On the
followed by “are”, and in Slovene “imenujemo” and “je”.
Figure 3: Curation process in WebAnno.
other hand, annotation of semantic categories, semantic
One or more genera were found in all English definitions,
relations and relation definitors proved to be more dubious
112 in total, while not all Slovene definitions had a genus.
since the annotations often differed from one another.
Figures 4 and 5 show the distribution of semantic
When variations occurred, the team managed to resolve
categories for the annotated terms in Slovene and English.
such dilemmas through discussions.
In total, 183 English and 334 Slovene terms were assigned
As Figure 3 shows, all three students who annotated
categories. The most frequent category in English was D.1
English definitions chose “tephra” as the definiendum. Two
Abiotic, followed by A.1 Surface landform. Similarly, A.1
ŠTUDENTSKI PRISPEVKI
295
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Surface landform was the most frequent category in
Slovene, followed by D.1 Abiotic.
Figure 4: Semantic categories in the Slovene corpus.
Figure 5: Semantic categories in the English corpus.
Figures 6 and 7 show the distribution of semantic
distribution). On the other hand, the two most common
relations for Slovene and English. A total of 186 relations
relations in Slovene were HAS_FORM (morphography)
were marked in English and 156 in Slovene. The most
and HAS_LOCATION (spatial distribution).
common relations in English were HAS_CAUSE
(morphogenesis)
and
HAS_LOCATION
(spatial
Figure 6: Number of semantic relations in the Slovene Figure 7: Number of semantic relations in the English corpus.
corpus.
5.1. Annotation difficulties
For example, the phrase “kraški izviri” in Figure 8 could
During the annotation and curation process, the team
semantically be understood as a hydrological form, a
encountered some complex cases, in particular when
surface form, an underground form or an abiotic.
reviewing Slovene definitions, which required further
As in the previous example, the word “obala” in Figure
discussion and careful attention. While annotating the
9 can be understood as a hydrological form, a surface form,
definition element proved fairly straightforward, semantic
an abiotic or a geome.
relations posed some challenges.
Although the word “kras” is most likely understood as
The analysis showed ambiguities in 37 out of 65
geome, depending on the context, it can also be understood
sentences in the Slovene corpus. We have divided the
as karstology, the study of karst. In line with the decision
ambiguities into the following categories.
to annotate “geomorphology” as a method, “kras” could
therefore be annotated as a method as well as shown in
5.1.1. Phrases that could be placed in multiple
Figure 10.
categories
The most recurring ambiguity concerned phrases that
could be classified into a number of categories, while others
were difficult to associate with any of the possible labels.
In many cases, the team had to determine how the
annotators would deal with these ambiguous words and
es
F tablish agreement on a consistent annotation strategy.
i
g
u
r
e
S
E
ŠTUDENTSKI PRISPEVKI
296
STUDENT PAPERS
Q
F
i
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Figure 8: Example of an ambiguous annotation.
Figure 9: Example of an ambiguous annotation.
Figure 10: Example of an ambiguous annotation.
Another example was “gravitacija” (see Figure 11). It
and definiendum would share the same semantic category,
was extremely difficult to annotate a word denoting such a
since genus is a hypernym or superordinate term, but this
complex concept. In discussions with the course instructor,
was not the case for all definitions. For example, the
the team decided to annotate it as a method, as the names
definiendum “aquifer” was annotated as A.3 Hydrological
of the studies had to be annotated in the same way.
form, but the genus “body of rock” was annotated as D.1
However, it should be noted that the word could also be
Abiotic in the same definition. This is because “body of
annotated according to other criteria.
rock” is not necessarily a hydrological form and can also be
found on the surface. Another example is the definiendum
“weathering”, which was annotated as B.4 Transformation
Figure 11: Example of an ambiguous annotation.
5.1.2. HAS_FORM
and the genus “process” was annotated as B. Process. The
In a handful of cases annotating the Slovene definitions,
reason for this is that “process” is a hypernym of
it became clear that the semantic relation HAS_FORM
“transformation”.
manifests itself in different ways, as shown in Figures 12,
13 and 14.
Since HAS_FORM relations are more abstract and
harder to grasp, annotation proved to be more difficult and
required double-checking.
5.1.3. Annotation of genus
Sentences in the English corpus also posed some
challenges, however their amount was significantly lower
compared to their Slovene counterparts.
Before the annotation process, it was decided not to
choose long phrases for the genus, but preferably just one
word, e.g. “unloading of mountains” could be considered
for the genus as a whole, but the team annotated only the
word “unloading” as the genus. It was expected that genus
ŠTUDENTSKI PRISPEVKI
297
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Figure 12: HAS_FORM introduced by a preposition.
Figure 13: HAS_FORM expressed with an adjective.
Figure 14: HAS_FORM expressed with an adjective (1) and introduced by a preposition (2).
languages that do not enjoy the same exposure and presence
6. Conclusion
as widespread world languages such as English.
Further research could examine how definitions in both
This article describes the process of corpus creation and
languages manifest themselves in different contexts and
definition annotation for semantic relation extraction.
domains.
When building corpora, linguists had to pay close attention
Large data collections serve as a basis for the
to both the format and nature of the texts. The conversion
development of tools for automatic semantic relation
of Slovene data proved to be quite challenging and required
extraction. Semantic relation extraction can be used to
a great deal of attention to detail. It might be useful to
create different computer applications that can make
develop a conversion tool specifically for language-specific
domain-specific knowledge more accessible, not only to
characters, such as diacritics, to facilitate the study of data
experts but to the general public as well. The corpora that
originating from languages, namely Slovene.
were built during this project can be used for future creation
Definition extraction, on the other hand, did not pose
of
specialised
knowledge
bases
on
geology,
any significant challenge.
geomorphology and glaciology.
In contrast, definition annotation followed by the
curation entailed a great deal of debate and additional
7. References
research.
Since
the
team
consisted
only
of
linguists/translation students lacking domain-specific
Richard Eckart de Castilho, Chries Biemann, Iryna
terminological knowledge, it was sometimes difficult to
Gurevych, and Seid Muhie Yimam. 2014. WebAnno: a
comment on the nature of the extracted terms. For any
flexible, web-based annotation tool for CLARIN. In:
similar research endeavours, it could be useful to seek
Proceedings of the CLARIN Annual Conference (CAC)
expert’s input so as to facilitate the annotation process and
2014, pages 4505–4512, Soesterberg, Netherlands.
prompt better results. Overall, definition elements were
Pamela Faber. 2015. Frames as a framework for
easier to identify and annotate than relation definitors and
Terminology. In: H. Kockaert and F. Steurs, (eds.)
semantic categories and relations. The result of this work is
Handbook of Terminology, Vol. 1, pages 14–33. John
a dataset with multi-layer semantic annotations in English
Benjamins, Amsterdam/Philadelphia.
and Slovene which can be used for future relation
Pamela Faber and Laura Medina-Rull. 2017. Written in the
extraction experiments. It complements the TermFrame
Wind: Cultural Variation in Terminology. In: M. Gryviel
dataset and will be added to the Clarin.si repository.
(ed.) Cognitive Approaches to Specialist Languages,
The paper also draws attention to the differences
pages 419–442. Cambridge Scholars, Newcastle upon
between the two languages. English seems to favour shorter
Tyne.
and more concise definitions, such as “is a” or “are”, while
Chu-Ren Huang, Jia-Fei Hong, Wei-Yun Ma, and Petr
Slovene tends to introduce longer structures, namely
Šimon. 2015. From Corpus to Grammar: Automatic
“imenujemo” and “se uporablja”, and sometimes shorter
Extraction of Grammatical Relations from Annotated
ones, such as “je”.
Corpus. In T’sou & Kwong (eds.) Journal of Chinese
This research provides insight into the various
Linguistics Monograph Series, Vol. 25, pages 192–221.
language-specific barriers that arise when studying smaller
Chinese University of Hong Kong Press, Hong Kong.
ŠTUDENTSKI PRISPEVKI
298
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Senja Pollak, Andraž Repar, Matej Martinc, and Vid
Podpečan. 2019. Karst exploration: extracting terms
and definitions from karst domain corpus. In:
Proceedings of eLex 2019, pages 934–956. Lexical
Computing CZ, s.r.o., Brno
Senja Pollak, Anže Vavpetič, Janez Kranjc, Nada
Lavrač, and Špela Vintar. 2012. NLP workflow for on-
line definition extraction from English and Slovene
text corpora. In: J. Jancsary (ed.) Proceedings of
KONVENS 2012 (Main track: oral presentations),
Vol. 5, pages 53–60. ÖGAI, Vienna.
Aristomenis Thanopoulos, Nikos Fakotakis, and Georg
Kokkinakis. 2000. Automatic Extraction of Semantic
Relations from Specialized Corpora. In: Coling 2000,
18th International Conference on Computational
Linguistics, Vol. 1, pages 836–842. Universität des
Saarlandes, Saarbrücken.
Špela Vintar, Vid Podpečan, and Vid Ribič. 2021.
Frame-based
terminography:
a
multi-modal
knowledge base for karstology. In: Proceedings of
eLex 2021, pages 164–176. Lexical Computing CZ,
s.r.o., Brno.
Špela Vintar, Amanda Saksida, Uroš Stepišnik, and
Katarina Vrtovec. 2019 Modelling specialised
knowledge with conceptual frames: the TermFrame
approach to a structured visual domain representation.
In: Proceedings of eLex 2019, pages 305–318. Lexical
Computing CZ, s.r.o., Brno.
Špela Vintar and Uroš Stepišnik. 2020. TermFrame: A
Systematic Approach to Karst Terminology. In: Dela,
Vol. 54, pages 149–167. Znanstvena založba
Filozofske fakultete Univerze v Ljubljani, Ljubljana.
https://doi.org/10.4312/dela.54.149-167.
Katarina Vrtovec, Špela Vintar, Amanda Saksida, and
Uroš Stepišnik. 2019. TermFrame: Knowledge frames
in Karstology. In: Proceedings of ToTh 2019, pages
109–126. Presses Universitaires Savoie Mont Blanc,
Chambéry
ŠTUDENTSKI PRISPEVKI
299
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Serbo-Croatian Wikipedia Between Serbian and Croatian Wikipedia
Ružica Farmakovski,* Natalija Tomić**
*Faculty of Philology, University of Belgrade
Studentski trg 3, 11 000 Belgrade
ruzicamarinkovic12@gmail.com
**Faculty of Philology, University of Belgrade
Studentski trg 3, 11 000 Belgrade ntomic801@gmail.com Abstract
In this paper, we try to establish the linguistic identity of the corpus of texts CLASSLAWIKI-sh (Serbo-Croatian Wikipedia), comparing it with the corpus of texts The CLASSLAWIKI-sr (Serbian Wikipedia) and the corpus of texts CLASSLAWIKI-hr (Croatian Wikipedia), that are available at CLARIN.SI, Slovene national consortium of the European research infrastructure CLARIN
Wikipedia, i. e. we are trying to determine whether it is closer to the Serbian or Croatian language standard. For this comparison, we used as variables the distinguishing features between Serbian and Croatian described in grammars and manuals of Serbo-Croatian, Serbian and Croatian languages. We came to the conclusion that according to the basic characteristics (orthographic, most phonetic, and derivational morphology features), the CLASSLAWIKI-sh is closer to the CLASSLAWIKI-hr, and according to morphosyntactic, lexical, and semantic features it is closer to the CLASSLAWIKI-sr.
1. Introduction
Wikipedia is a free online encyclopedia launched in
right and reliable features as variable for this type of
2001 by a community of volunteers. It is available in 326
research (based on corpus). For example, we had to drop
languages and it has more than 302,906 active editors and
one of the most important and stable features, a feature
more than 101,868,334 registered users.1 Its specificity is
that is cited everywhere in the literature ( ko:tko), because
its editing system. It is open to its audience for writing and
it poses a problem for corpus lemmatization (Section 5.2).
contributing different content. One of the languages with
Our paper consists of 7 sections. In Section 2, we
considerable content is Serbo-Croatian, a language that
describe the goal and present the initial hypothesis. In
does not officially exist since the split of former
Section 3, we present the genetic and historical
Yugoslavia.
relationship between the Serbian and Croatian standards.
In recent decades linguistic research has increasingly
In Section 3, we describe two types of related works that
been conducted on materials and data from the Internet.
we used. On the one hand, there are works related to
They are available to everyone, free and easy to use and
linguistic identification or the discrimination between
there are plenty of them. This makes it suitable for
related languages, and on the other hand, there are works
linguistic research as well.
dealing with the differences between Serbian and
Wikipedia, along with Twitter and other similar
Croatian. Section 5 deals with the methodology, where we
sources, offers plenty of materials and data, but to use
list and describe the variables we used, and in Section 6,
them at all, we need to know their true identity. That is
we present the data we have obtained from the corpus and
how the phenomenon of linguistic identification (and
their analysis. In Section 7, we present the conclusion and
automatic
linguistic
identification)
is
becoming
some suggestions for further research. Finally, in Section
increasingly important.
8, we list the literature that we used and cited in the paper.
In this sense, discriminating between related
2. Goal of the paper
languages, considered “as a sub-task in automatic
In this paper, our goal is to determine the linguistic
language identification” (Tiedemann and Ljubešić, 2012:
identity of the corpus of texts CLASSLAWIKI-sh (Serbo-
2620), also gaining more and more attention from
Croatian Wikipedia, hereinafter: SCW), that is available at
researchers.
CLARIN.SI, Slovene national consortium of the European
But this is not an easy task, especially when it comes
research infrastructure CLARIN.2 The CLASSLAWIKI-sr
to related languages. Since they have a common origin,
(Serbian
Wikipedia,
hereinafter:
SW)
and
they share many grammatical features and lexemes, so it
CLASSLAWIKI-hr (Croatian Wikipedia, hereinafter:
is often very difficult to distinguish between them.
CW) corpora can also be found here. When we compare
Therefore, for many researchers, this task is a special
the linguistic characteristics of our target corpus with the
challenge, i. e. “both necessity and a challenge (Ljubešić
other two corpora, we hope to determine its linguistic
and Klubička, 2014: 32).
identity, i. e. whether SCW is closer to SW or CW or if it
We hope that our research, which is more
is somewhere in the middle. In Figure 1, we show our
linguistically oriented, will provide some useful linguistic
hypothesis schematically. Our initial hypothesis is that
data for automatic text recognition research. Also, we
SCW is somewhere in the middle between SW and CW,
hope that we will show how important it is to choose the
perhaps with a tendency towards SW, due to the larger
1https://www.wikipedia.org/
2 https://www.clarin.si/kontext/corpora/corplist
ŠTUDENTSKI PRISPEVKI
300
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
number of its users, less resistance to the use of Serbo-
(767), but they also add that linkage information and the
Croatian resources, etc.
text from hypertext anchors could improve overall results.
Padró and Padró (2004) presented and compared three
different statistical methods for language identification:
Markov Models, Trigram Frequency Vectors, and Gram
Based Text Categorisation (mentioned as n-gram above).
Figure 1: Is SCW closer to SW or CW or it is somewhere
They concluded that “for texts over 500 characters, all the
in the middle?
systems get a precision higher than 95%, and for texts of
We also hope to get answers to some other related
5,000 characters the precision is higher than 99% with
questions: Does SCW represent a language that existed in
all systems” (161), but for the small texts Markov Model
the former Yugoslavia under the name of Serbo-Croatian
System has the highest precision. Also, all three systems
language? Is SCW a mixture of characteristics of Serbian
tend to fail when it comes to the problem of distinguishing
and Croatian varieties? Or is SCW a mixture of Serbian
similar languages (Catalan and Spanish).
and Croatian texts?
So we come to the paper of Ljubešić et al. (2007)
3. Serbo-Croatian vs. Serbian and Croatian
dealing with the language identification problem of the
Without the desire (and possibility) to determine
Croatian language. To identify the Croatian language,
precisely whether Serbian and Croatian are two languages,
authors have to distinguish it from similar languages –
one language with two names, two dialects, two varieties,
Serbian, Slovenian, or Slovak. They applied the method of
or two standards, we will present in basic terms their
most frequent words and combined it with the character n-
historical relationship.
gram models. Finally, to improve the precision of
These two entities lived under the common name
identifying Croatian documents (where the biggest
Serbo-Croatian language in the former Yugoslavia for
problem
was
distinguishing
them
from
Serbian
almost a century and were considered one language. It is
documents), the authors made a list of forbidden words for
an open question of how much they mixed, how much
Croatian and Serbian. Forbidden words (or “blacklisted
they influenced each other and how many linguistic
words”') are words that occur often in one language but
features passed from one entity to another, and how much
never in the other language. Forbidden words (or
each of them preserved their identity.
blacklisted words) are also used (along with a document
They undoubtedly have the same origin. Before the
classification method) in another article dealing with the
Slavs immigrated to the Balkans, the Southern Slavs
problem of discrimination between closely related
separated from Eastern and Western Slavs. During
languages, or more precisely between Bosnian, Croatian
historical development, the western linguistic community
and Serbian (Tiedemann and Ljubešić, 2012).
of the Southern Slavs developed, from which the Slovene
Zampieri and Gebrekidan (2012) also agree that
and Serbo-Croatian languages developed. The Serbo-
methods for discrimination similar languages or varieties
Croatian language consisted of three dialects – Štokavian,
are not “substantially explored”. In their article, they try to
Kajkavian, and Chakavian, according to the interrogative
define a model for the automatic classification of two
pronoun: što/šta:kaj:ča (′what′). Until the 19th century, all
varieties of Portuguese: European and Brazilian. They
three dialects were in use. The foundations of the new
state that these two varieties “are considered to be the
standard language were established in the 19th century.
same language [although] there are substantial differences
After the Illyrian movement and the reform of the
language and orthographic system by Vuk Karadžić, the
between European and Brazilian Portuguese in terms of
Štokavian dialect (ekavian and (i)jekavian variant) was
phonetics, syntax, lexicon, and orthography” (235).
Although they recognize the problem with similar entities,
taken as the basis of the standard language.
they use the character-based model using 4-grams. It is
Even before the break-up of the former Yugoslavia,
practically a standard character n-gram model, just with
this language was polycentrically standardized, and the
larger character n-grams.
break-up of Yugoslavia practically created four new
This group of works is more mathematically oriented
languages: Serbian, Croatian, Bosnian, and Montenegrin.
and does not deal with linguistic features like our work.
4. Related work
4.2. Literature on the differences between
Our research is based on two types of sources. On the
Serbian and Croatian
one hand, there are works related to linguistic
As we said at the beginning of this section, another
identification or the discrimination between related
group of papers is dealing with the differences between
languages, and on the other hand, there are works dealing
Serbian and Croatian. Among them, we paid special
with the differences between Serbian and Croatian.
attention to two papers, whose methodology was also used
4.1. Literature on linguistic identification and
for our examination ‒ Ljubešić et al. (2018) and Ljubešić
the
discrimination
between
related
et al. (2019).3 Namely, this group of authors states
languages
phonetic, morphological, syntactic, and lexical differences
Martins and Silva (2005) start with a well-known n-
between Serbian and Croatian, which represent variables
gram-based algorithm “that measures similarity according
to the prevalence of short letter sequences ( n-gram)”
3 Both papers have the same authors.
ŠTUDENTSKI PRISPEVKI
301
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
through which a certain phenomenon is examined. In the
to describe the phenomenon of linguistic accommodation.
first paper, it is the spatial distribution of 16 linguistic
Also, our choice of variables differs from the variables
features and the question is, “do state borders correspond
used in this paper (see explanation in Section 5.2).
to linguistic boundaries”. In the second paper it is the
5. Methodology
phenomenon of linguistic accommodation among the
5.1. Data and metadata
speakers of BCMS4 languages, i. e. the question of
In the Introduction, we defined Wikipedia as a free
whether BCMS speakers adapt their language when they
online encyclopedia. But it is not entirely, nor could it be,
are in contact with speakers of other BCMS languages (do
the subject of linguistic inquiry. The subject of our
they change their accent, some grammar construction, do
research are three special corpora composed of texts from
they use specific lexemes, etc.).
Wikipedia. These three corpora is, as we stated in Section
This part also includes works that deal with differences
2:
CLASSLAWIKI-sh,
CLASSLAWIKI-sr,
and
in BCMS languages, but they are more descriptive, ie.
CLASSLAWIKI-hr, available at CLARIN.SI, Slovene
differences do not represent methodological instruments
national
consortium
of
the
European
research
for research. From Piper (2009) we learn more about the
infrastructure CLARIN. All free corpora are part of the
historical, social, political, and cultural circumstances of
project CLASSLA Wikipedia which involved generating
these two languages, and then follow the description of
corpora for seven south-Slavic languages: Macedonian,
the language differences (537‒552). Branko Tošović and
Bulgarian, Serbian, Croatian, Serbo-Croatian, Slovene,
Arno Wonisch are the editors of a series of collections of
and Bosnian. The corpora were generated using Wikipedia
papers from 2009 to 2013 that also deal with the
dumps that were downloaded on October 17th, 2020.6
relationship of the BCMS languages in general (historical,
Some important metadata for our three corpora is
social, political, and cultural perspectives), and then with
given in Table 1.
many individual language problems – adjectival aspect,
Corpus
Documents
Tokens
Words
noun motion, nouns of nomina agentis type, distribution
CLASSLAWIKI-sh
453,404
80,669,281
63,541,966
of future tenses, participial and reflexive passive, etc.
(Serbo-Croatian
(Tošović and Wonisch, 2009; 2010; 2012; 2013). In
Wikipedia corpus
Ćevriz-Nišić (2009) we could find various phonological,
CLASSLAWIKI-sh
derivational, lexical, and syntactic distinctive features
1.0)
between Serbian, Croatian, and Bosnian standard
CLASSLAWIKI-sr
639,277
122,530,226 97,258,485
languages from administrative style. Article Badurina
(Serbian Wikipedia
(2004) follows recent changes (late 20th century) in
corpus
orthography and vocabulary; in Karavdić (2011) 16
CLASSLAWIKI-sr
syntactic differences are pointed out (apart from well-
1.0)
known da+present or an infinitive): possessive genitive
CLASSLAWIKI-hr
205,898
66,484,380
51,719,524
and the adjective with noun, future 2nd or present tense,
(Croatian
kod+accusative or k+dative, etc. In Bekavac et al. (2008) Wikipedia corpus
differences are organized on five levels, from
CLASSLAWIKI-hr
phonological to semantic levels. The last one is especially
1.0)
interesting because it is rarely mentioned in the literature.
Table 1: Number of documents, tokens, and words in
Authors state lexeme čas meaning ′one moment′ in
SCW, SW, and CW.
Croatian and ′one hour′ in Serbian, lexeme persons
5.2. Variables of interest
To select the appropriate variables, we reviewed the
translated in Serbian by ′lica′ and in Croatian by ′osobe′,
linguistic differences between Serbian and Croatian that
etc.5
are cited in the literature. As we have already said, we
We also consulted the most relevant grammars and
used Ljubešić et al. (2018) and Ljubešić et al. (2019) the
manuals of the Serbian and Croatian languages, and for
most because we followed the methodology applied in
certain variables some special papers dealing with them.
these works. Then we reviewed basic grammars and
For more linguistic details of these, but also of the all
manuals for Serbian, Croatian and Serbo-Croatian:
listed literature units in this section, see Section 5.
Pešikan et al. (2010), Stevanović (1989), Stanojčić and
All papers in this second group, except for the second
Popović (2008), Piper and Klajn (2013), Ivić et al. (2004),
of the two papers that we highlighted at the beginning of
Mrazović and Vukadinović (2009); Barić et al (1997).
Section 4.2. (Ljubešić et al. (2019)), state the differences
Then we reviewed papers whose main topic was these
between Serbian and Croatian, without examining them in
the corpus. Ljubešić et al. (2019) use a corpus
differences. All these sources are described in Section 4.2.
, but it is
We also used papers that deal with a particular variable as
about shorter texts (Twitter), and for a different purpose –
a special problem. These sources are mentioned in the
variable in question.
4 Bosnian, Croatian, Montenegrin, and Serbian languages. In the
literature dealing with these languages, they are referred to as
BCMS languages.
5 Lexeme persons can also be translated into Serbian by ′osobe′;
6 Links to Wikipedia Dumps can be found on
the translation ′lica′ appears in an administrative language.
https://github.com/clarinsi/classla-wikipedia
ŠTUDENTSKI PRISPEVKI
302
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
First, we had to choose a smaller number of variables.
side and Serbian on the other” (Bekavac et al., 2008:35) or
So we tried to make the variables meet the following
as one of “the biggest differences between Croatian and
criteria:
linguistic
relevance,
representing
stable
Serbian” (Ljubešić and Klubička, 2014:29) or “one of the
differences, easy recognition by the speaker, and easy
features central to defining the dialects” and as “the
automatic retrieval. Therefore, we rejected unreliable
variable whose geographical distribution is expected to be
variables (such as script – Cyrilic or Latin; in addition, the
most straightforward” (Ljubešić et al., 2018: 110).
texts in all corpora are in Latin script), underdeveloped
This variable was extracted through a list of words that
variables, and variables that are impossible to process due
was created manually (as we have already mentioned).
to homonymy.
Since the consonant j is a frequent cause of various
For most variables, we selected words that illustrate a
phonetic alternations, we chose words in which there are
certain phenomenon so we could search the corpus. We
no phonetic alternations. Otherwise, we would have to
chose examples that are well known to us as native
look for more results for the (i)jekavian forms and to sum
speakers and for which we found confirmation in the
them up: sneg:snijeg, snjeg (′snow′), devojka:djevojka,
literature mentioned above.7 It would be better if we could
đevojka (′girl′), etc.
present all those examples in tables, along with their mean
3) rdrop
values and proportions. But since that would require a lot
The variable rdrop refers to the fact that in some words
of space, we decided to just list those words and present
in Croatian consonant r is kept at the end of the word, and
the final analysis in Section 6.
in Serbian it is lost: juče:jučer (′yesterday′).
Two
variables
were
extracted
using
regular
This variable is also illustrated by a list of words that
expressions – morphosyntactic variable trebati and lexical
is created manually.
variable da li:je li.
The nouns veče:večer (′evening′) are regularly cited as
In three cases (for the pair of words takođe:također
an illustration of this difference, but since both nouns have
(′also′) – in phonetic variables; for the semantic variable
the same declension, we had to exclude it from the search
čas (′hour′, ′moment′); and for the pronoun ko:tko) we
because we can not deduce from the form what the lemma
analyzed a smaller number of examples (80). We did this
should be. We kept the words naveče:navečer,
in cases where something seemed suspicious to us based
predveče:predvečer and uveče:uvečer (′in the evening′),
on the raw numbers ( takođe:također, ko:tko) or when we
that are derived from the word veče:večer because they
wanted to get a general impression of the use of the
are adverbs, so they have no declension.
lexeme, and a detailed analysis would require separate
Since the grapheme đ also appears as dj, for words
research ( čas).8 More examples and better-randomized
takođe:također (′also′) we searched for both occurrences
examples would improve this research.
and summed them up ( takođe:također, takodje:takodjer).
The selected variables belong to the following levels
4) h:k
of linguistic structure: orthographic, phonetic, derivational
The variable h:k occurs in words of Greek origin. As
morphology, morphosyntactic, syntactic, and semantic
early as the middle age, the rule was established in
levels.
Serbian that Greek χ was transferred as Slavic h, while in
We chose this approach, to start from known and
Croatian k appeared under the influence of Western
described language features in the literature and then
European languages.
identify them in the corpora because we believe that this is
We also used a manually created list for this variable
the best way of language identification. In addition, we
because there are not so many of those words.
believe that automatic text recognition should be based on
Derivational morphology variables
theory.
5) ka:ica
Orthographic variable
The suffixes -ka and -ica are used for deriving
1) transliteration:original
feminine nouns of nomina agentis type. But here the
When it comes to the orthography of foreign proper
situation is not so simple. First, both suffixes are very
names, transliteration is more frequent in Serbian (and it is
productive in both Serbian and Croatian, and we can not
also a standard) and in Croatian foreign proper names are
claim that one suffix is Serbian and the other is Croatian.
written in original: Njujork:New York. Examples of this
So we have in Serbian: glumica, igračica, pevačica etc.,
variable are found in Memić (2009).
and in Croatian: maserka, programerka, novinarka,
Phonetic variables
analitičarka etc. This also applies to other suffixes. So we
2) e:ije/je
find in Babić (1999) that suffixes - ica, -ka, -kinja, -inja
It concerns the Proto-Slavic vowel jat and its different
are as Croatian as Serbian, and differ only in the
reflexes: je/ije in Croatian and e in Serbian, although the distribution. We find the similar claim in other authors
(i)jekavian reflexes (and dialects) also belong to the
(Dražić and Vojnović, 2010).
Serbian standard language.
Second, “the choice of the suffix also depends on the
In the literature, this variable is considered “the most
ending of the masculine noun from which the feminine
obvious difference between Croatian and Bosnian on one
form is derived” (Ljubešić et al., 2018: 113). Therefore,
among many other suffixes, we chose the suffixes -ar and
-or in the masculine gender, for which we found
7 The dictionary Ćirilov (2010) also helped us in this.
confirmation in several sources that they regularly give -
8 See more details in those examples.
ŠTUDENTSKI PRISPEVKI
303
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
ka in Serbian and -ica in Croatian (Dražić and Vojnović, ko, in addition to the forms tko, also received the lemma 2010; Ljubešić et al., 2018; Ćorić, 2010). We also
tko in all three corpora ( da je bilo kome rekao – the form
manually created a list of those pairs of words.
kome got the lemma tko instead of ko). Another problem is 6) isa, ova:ira
that the personal interrogative pronoun ko/tko has the
This variable is related to the morphological
same declension as the adjective pronoun koji/tkoji (its
composition of the international verbs: organizovati in
shorter form). In this way, many examples that were
Serbian and organizirati in Croatian (′organize′). Petar
supposed to get the lemma koji/tkoji got the lemma ko/tko
Skok noticed that difference in the 1950s. According to
( kamen od koga se obično izrađuje nakit – the form koga
Skok (1955‒1956) suffix -isati is related to Belgrade and
got the lemma tko instead of koji). That is why we rejected
it is of Greek origin and it entered Serbian with Turkisms.
this feature as a variable, but we analyzed 80 examples
The suffix -irati is related to Zagreb, it is of Latin origin,
with the lemma ko and 80 examples with the lemma tko in
and it was received through French and German. The
each of the three corpora. Then we divided those
suffix -ovati originates from the Proto-Slavic language.
examples into lemmas that they should get: ko, tko, (t)koji.
Recent research also confirms this distribution: “It is also
The results we obtained are shown in Table 2.
noticeable that the distribution of suffixes in certain verbs
CLASSLA
CLASSLA
CLASSLA
in Serbian and Croatian is differentiated […] examples of
Wiki-sr
Wiki-hr
Wiki-sh
verbs with -ira- are registered in Croatian texts, and with -
Serbian
Croatian
Serbo-
isa- and -ova- in texts by Serbian authors. ” (Ivanić and Wikipedia
Wikipedia
Croatian
Perišić, 2018: 188).
Wikipedia
Lemma=k
ko: 49
-
-
This variable is illustrated by a list of examples mostly
listed in Tošović (2010), Skok (1955‒1956)
o
(80
tko: 0
-
-
, and Ivanić
and Perišić (2018).
examples)
(t)koji: 29
-
-
error: 2
error: 10
error: 32
Morphosyntactic variable
Lemma=tk
ko: 4
ko: 9
ko: 1
7) trebati
o
(80
tko: 1
tko: 41
tko: 3
In standard Serbian, the modal verb trebati
examples)
(′need/should′) is used
(t)koji: 71
(t)koji: 24
(t)koji: 71
as an impersonal verb and has a
error: 4
error: 6
error: 5
complement da+present tense: ja treba da idem, ti treba
da ideš
Table 2: Lemmatization of the pronoun ko/tko.
, etc.9 In Croatian, this verb is used as a personal
verb and has an infinitive as a complement: ja trebam ići,
6. Analysis
ti trebaš ići, etc. For this variable, we used the regular
Insight into these three corpora gave us the following
expression found in Ljubešić et al. (2018).
data. For the variables we searched using the word lists we
Lexical variable
made, we got the number of lemmas. To obtain
representative values and overcome the size inequality of
8)
da li:je li
these three corpora, we calculated mean values and
As we read in Ljubešić et al. (2018) yes/no questions
proportions. To calculate the proportion, we used the
in Serbian are used with interrogative expressions da li
following formula: the proportion of one value of one
and je li. Form da li is more common and form je li is variable in one corpus is equal to the quotient of the mean
usually shortened to je l’, jel’, or jel. In Croatian je li is the value of that variable value in that corpus and the sum of
standard form.
the mean values of both values of that variable in that
We have analyzed only full forms using regular
corpus. For example, the proportion for the value e of the
expressions also found in Ljubešić et al. (2018): ‘\bda li\b’
and ‘
variable e:(i)je in SW = the mean for e in SW / (the mean
\bje li\b’.
for e in SW + the mean for (i)je in SW).
Semantic variable
To visually represent these relationships, for each
9) čas (′hour′: ′moment′)
variable we made the same illustration. On the left (blue)
Semantic differences are less common in the literature.
is what we have defined as a Serbian feature, and on the
We have already stated lexeme čas meaning ′one moment′
in Croatian and ′one hour′ in Serbian in Bekavac et al.
right (red) what we have defined as a Croatian feature.
Then we marked a value for each corpus. We presented
(2008). Since it is a matter of meaning, we had to make
the proportions as percentages because it seems easier to
our own decisions on a case-by-case basis. So we took the
read the data from the image in this way. This presentation
first 80 occurrences of the lexeme čas and determined
whether it means ′hour′ or ′moment′.
allowed us to see data for all three corpora for each
variable in the same image, making it easier to compare.
After describing the variables used, we will only
The figure also shows whether SCW is closer to SW or
briefly mention at the end one of the very interesting
CW.
problems we encountered, and that is the use of the
Our first variable is orthographic and it concerns the
interrogative pronoun who, which in Serbian has the form
writing of foreign proper names. As we said,
ko, and in Croatian tko. The first problem is that the forms
transliteration is more frequent in Serbian, and in Croatian
foreign proper names are written in the original. To
9 In colloquial language this verb is very often used as a personal
examine this we took 5 proper names: Njujork:New York,
verb, but retains the complement da+present tense: ja trebam da
Čikago:Chicago, Dablin:Dublin, Kembridž:Cambridge,
idem, ti trebaš da ideš, etc.
ŠTUDENTSKI PRISPEVKI
304
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Venecija:Venezia. As we can see from the mean values
and proportions, transliteration is more prevalent in SW
(0.74), original writing in CW (0.80), and SCW is closer
to CW in this characteristic. The proportion is 0.68 in
favour of the original writing.
Figure 4: Variable rdrop.
The last phonetic variable h:k is found in
translations of words of Greek origin – h in Serbian and k
in Croatian. We used the following 7 words: haos:kaos
Figure 2: Variable transliteration:original.
(′chaos′), harizma:karizma (′charizma′), hemija:kemija
The next three variables are phonetic. For the
(′chemistry′), hirurg:kirurg (′surgeon′), hronika:kronika
first e:ije/je, we took 10 words, according to the criteria
(′chronicle′),
hlor:klor
(′chlorine′),
defined above for this variable: cvet:cvijet (′flower′),
hrizantema:krizantema (′chrysanthemum′). For example,
reč:riječ
(′word′),
sveća:svijeća
(′candle′),
we did not find the word harizma in CW at all, and the
zameniti:zamijeniti (′replace′), uvek:uvijek (′always′),
word hrizantema in CW nor SCW. This feature is very
pesma:pjesma (′song′), vetar:vjetar (′wind′), mera:mjera
stable – words with h consistently appear in SW (0.99),
and words with k consistently occur in CW (0.99). In
(′measure′), veštica:vještica (′witch′), sesti:sjesti (′sit′).
SCW
usage
is
balanced
(0.50:0.50).
Mean values and proportions show us the following.
Although the (i)jekavian dialect also belongs to the
Serbian standard, in SW ekavian reflex is completely
dominant (0.99). In CW the (i)jekavian reflex of the
Proto-Slavic vowel has the same value (0.99), which is not
surprising, because there is only one standard in Croatian.
Figure 5: Variable h:k.
In SCW the ekavian reflex occupies approximately one-
For our first derivational morphology variable
third and the (i)jekavian 2 thirds (the proportion is
ka:ica we used 9 words: slikarka: slikarica (′painter′, fem),
0.30:0.70).
ministarka:ministrica
(′minister′,
fem),
apotekarka:apotekarica
(′pharmacist′,
fem),
autorka:autorica
(′author′,
fem),
doktorka:doktorica
(′doctor′, fem), profesorka:profesorica (′professor′, fem),
direktorka:direktorica (′director′, fem), lektorka:lektorica
Figure 3: Variable e:ije/je.
(′language
editor′,
fem),
inspektorka:inspektorica
The next phonetic variable refers to words that
(′inspektor′, fem). The data of the distribution of the
have a consonant r at the end of the word in Croatian and
suffixes -ka and -ica show the following. The suffix -ka in in Serbian it is lost. We used the following 6 words:
SW has a very high value (0.97), which confirms its
juče:jučer (′yesterday′), prekjuče:prekjučer (′the day
consistent use in Serbian texts, just as the suffix -ica has a
before yesterday′), naveče:navečer (′in the evening′),
high value in CW (0.99). In SCW the suffix -ka reaches
predveče:predvečer (′in the evening′), uveče:uvečer (′in
almost one-third (0.28), and the rest is the suffix -ca
(0.72), which makes SCW much closer to CW according
the evening′), takođe:također (′also′). Analysing these
to this feature.
words, we came to the following results. Forms without
the consonant r at the end of the word have the expected
high value in SW (0.99), as do forms with the consonant r
at the end of the word in CW (0.99). What we did not
expect is an extremely high value of the form with the
consonant r at the end of the word in SCW (0.99).
Figure 6: Variable ka:ica.
Looking at the raw numbers, we concluded that the
The situation is similar with verb formation. The
frequency of use of the form također in SCW contributed
suffixes -isa and -ova, which are related to Serbian, have a
to this. If we exclude this pair of words ( takođe:također)
value of 0.99 in SW, the same as the suffix -ira in CW. In
from the analysis, the characteristic forms almost retain
SCW, the ratio is 0.39:0.61 in favour of the suffix -ira,
their values in SW and CW (0.98 and 0.98), but SCW is
which also shows that SCW is closer to CW according to
much more balanced (0.48:0.52 in favour of forms with
this feature. We used the 10 verbs: operisati:operirati
the consonant r). We also wanted to make sure that these
(′operate′), fotografisati:fotografirati (′take photos′),
high values for the word također are not the result of a
reformisati:reformirati
(′reform′),
regulisati:regulirat
lemmatization error. We reviewed 80 examples in SCW
(′regulate′),
pakovati:pakirati
(′pack′),
and found 16 errors ( Brown je takođe hvalio film, On
kritikovati:kritizirati (′criticise′), diskutovati:diskutirati
takođe uzima učešća... ). In Figure 4 we show the values
(′discuss′),
identifikovati:identificirati
(′identify′),
that include the use of the pair of words takođe:također.
promovisati:promovirati (′promote′). In SCW we did not
find the form pakirati (′pack′), and in CW we did not find
ŠTUDENTSKI PRISPEVKI
305
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
the forms fotografirati (′take photos′) and reformirati
CLASSLAWIKI-hr. But we did not get a single or simple
(′reform′).
answer.
It turned out that according to orthography, most
phonetic and derivational morphology features SCW is
closer to CW than to SW. On the other hand, the
morphosyntactic, lexical, and semantic features show that
Figure 7: Variable isa, ova:ira.
SCW is closer to SW than to CW. This may indicate that
Analysis of the morphosyntactic variable trebati
SCW contains more Croatian texts because these, so to
showed that the modal verb trebati (′need/should′) as an
speak, basic characteristics are more Croatian. Also, the
impersonal verb with a complement da+present tense in
values in SCW for most variables are closer to the
SW has a dominant use (0.96), as does its personal variant
extremes than they are balanced, so our initial hypothesis
with an infinitive as a complement in CW (0.88). In SCW
is confirmed in only a few cases (for example, variable h:k
this verb is used more in the impersonal form, which
– 0.50:0.50). The other questions we asked at the
means that according to this feature SCW is more Serbian
beginning are not easy to answer in such a limited study.
than Croatian (0.70:0.30)
To improve this research and get more accurate and
precise results, some variables should be included, some
unclear issues should be resolved (some problems in
lemmatization), and some more advanced corpus search
techniques should be used (first of all, regular expressions,
randomized examples, etc.). As for the variables, there are
Figure 8: Variable trebati (imp:pers).
a number of very interesting features: possessive adjective
The lexical variable da li:je li represents the
(in Serbian) / possessive genitive (in Croatian): tetka
expressions da li and je li used for yes/no questions. In the Marin brat / brat tetke Mare (′Aunt Mary's brother′); the description of the variable, we said that both expressions
conjunction pošto (′since′) ‒ in Croatian it is used only in
are used in Serbian, but that the form da li is more
a temporal sense, in Serbian and in a causative sense:
common in, and that the form je li is the standard form in
Pošto je knjiga bila skupa, nisam je kupila (′Since the
Croatian. However, the results show the dominant use of
da li in Serbian (0.98),10 while in Croatian the use of these
book was expensive, I didn't buy it′); kod (in Serbian) / k
expressions is much more balanced – both values are close
(in Croatian): Doći ću kod tebe. / Doći ću k tebi. (′I will to the middle (0.46:0.54 – je li still has a bit more frequent
come to you.′); gde (in Serbian) / kamo (in Croatian) for use). In SCW, da li appears much more often (0.83:0.17),
the direction of movement: Gde ideš? / Kamo ideš?
so it is closer to SW in this respect.
(′Where are you going?′), etc.
8. References
Božo Bekavac, Sanja Seljan, and Ivana Simeon. 2008.
Corpus-based Comparison of Contemporary Croatian,
Serbian and Bosnian. In: Proceedings of the Sixth
Figure 9: Variable da li:je li.
International Conference Formal Approaches to South
The semantic variable čas is stable. The lexeme
Slavic and Balkan Languages, pages 34‒39,
čas is more often used in SW in the meaning of hour
Dubrovnik, Croatia.
(0.90), and in CW in the meaning of moment (0.97). In
Božo Ćorić. 2010. Jezičke i/ili varijantske razlike na
SCW these meanings stand in relation 0.63:0.37 in favour
tvorbenom planu. In: Branko Tošović and Arno
of the meaning of hour, and therefore SCW is closer to
Wonisch, eds., Srpski pogledi na odnose između
SW according to this feature.
srpskog, hrvatskog i bošnjačkog jezika, Book I/2,
pages 41‒50. Graz and Belgrade: Institut für Slawistik
der Karl-Franzens-Universität Graz and Beogradska
knjiga.
Branko Tošović and Arno Wonisch, eds., 2009. Bošnjački
Figure 10: Variable čas.
pogledi na odnose između bosanskog, hrvatskog i
7. Conclusion
srpskog jezika. Graz and Sarajevo: Institut für
Slawistik der Karl-Franzens-Universität Graz and
Int the beginning, we determined that our goal was to
Institut za jezik.
determine the linguistic identity of the corpus of texts
Branko Tošović. 2010. Деривационные различия между
CLASSLAWIKI-sh and we assumed that it is midway
сербским, хорватским и бошняцким языкам
between the corpus CLASSLAWIKI-sr and the corpus
(прелиминариум). In: Branko Tošović and Arno
Wonisch, eds., Srpski pogledi na odnose između
srpskog, hrvatskog i bošnjačkog jezika, Book I/2,
10 The explanation for such a high value of da li in relation to je pages 65‒80. Graz and Belgrade: Institut für Slawistik
li in SW is that in the Serbian spoken language the full form je li der Karl-Franzens-Universität Graz and Beogradska
is rarely used. Its shortened variants je l’, jel’, or jel are much knjiga.
more common.
ŠTUDENTSKI PRISPEVKI
306
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Branko Tošović and Arno Wonisch, eds., 2010. Srpski
Nenad Memić. 2009. O prenošenju austrijskih i njemačkih
pogledi na odnose između srpskog, hrvatskog i
toponima u bosanski, hrvatski i srpski jezik: o
bošnjačkog jezika, I/2. Graz and Belgrade: Institut für
problemu egzonima u savremenom jeziku. In: Branko
Slawistik der Karl-Franzens-Universität Graz and
Tošović and Arno Wonisch, eds., 2009. Bošnjački
Beogradska knjiga.
pogledi na odnose između bosanskog, hrvatskog i
Branko Tošović and Arno Wonisch, eds.,. 2012. Srpski
srpskog jezika. Graz and Sarajevo: Institut für
pogledi na odnose između srpskog, hrvatskog i
Slawistik der Karl-Franzens-Universität Graz and
bošnjačkog jezika, I/4. Graz and Belgrade: Institut für
Institut za jezik. University Computing Centre.
Slawistik der Karl-Franzens-Universität Graz and
Nikola Ljubešić, Maja Miličević Petrović, and Tanja
Beogradska knjiga.
Samardžić. 2018. Borders and boundaries in Bosnian,
Branko Tošović and Arno Wonisch, eds., 2013. Srpski
Croatian, Montenegrin and Serbian: Twitter data to the
pogledi na odnose između srpskog, hrvatskog i
rescue. Journal of Linguistic Geography, 6/2:100‒124,
bošnjačkog jezika, I/5. Graz and Belgrade: Institut für
Cambridge University Press.
Slawistik der Karl-Franzens-Universität Graz and
Nikola Ljubešić, Maja Miličević Petrović, and Tanja
Beogradska knjiga.
Samardžić. 2019. Jezična akomodacija na Twitteru:
Bruno Martins and Mário J. Silva. 2005. Language
Primjer Srbije. Slavistična revija, 67(1):87‒106.
Identification in Web Pages. In: Proceedings of the
Nikola Ljubešić, Nives Mikelić, and Damir Boras. 2007.
2005 ACM symposium on Applied computing, SAC
Language identification: how to distinguish similar
’05, pages 764–768, New York, NY, USA.
languages? In: Vesna Lužar-Stiffler, and Vesna Hljuz
Eugenija Barić, Mijo Lončarić, Dragica Malić, Slavko
Dobrić, eds., Proceedings of the 29th International
Pavešić, Mirko Peti, Vesna Zečević, and Marija
Conference on Information Technology Interfaces,
Zninka. 1997. Hrvatska gramatika. Zagreb: Školska
pages 541–546, Zagreb: SRCE.
knjiga.
Nikola Ljubešić and Filip Klubička. 2014. {bs, hr,
Jasmina Dražić and Jelena Vojinović. 2010. Imenice tipa
sr}WaC – Web corpora of Bosnian, Croatian and
nomina agentis u srpskom i hrvatskom jeziku (tvorbeni
Serbian. In: Proceeding of the 9th Web as Corpus
i semantički aspekt). In: Branko Tošović and Arno
Workshop (WaC-9) @ EACL 2014, pages 29–35,
Wonisch, eds., Srpski pogledi na odnose između
Gothenburg, Sweden.
srpskog, hrvatskog i bošnjačkog jezika, Book I/2,
Pavica Mrazović and Zorka Vukadinović. 2009.
pages 41‒50. Graz and Belgrade: Institut für Slawistik
Gramatika srpskog jezika za strance. Sremski
der Karl-Franzens-Universität Graz and Beogradska
Karlovci, Novi Sad: Izdavačka knjižarnica Zorana
knjiga.
Stojanovića.
Jovan Ćirilov. 2010. Hrvatsko-srpski rječnik inačica и
Pavle Ivić, Ivan Klajn, Mitar Pešikan, and Branislav
Српско-хрватски
речник
варијаната.
Novi
Brborić. 2004. Srpski jezički priručnik. Beograd:
Sad:Prometej.
Beogradska knjiga.
Jörg Tiedemann and Nikola Ljubešić. 2012. Efficient
Petar Skok. 1955‒1956. O sufiksima -isati, -irati i -ovati .
discrimination between closely related languages. In:
Jezik, 4(2):36‒43.
Proceedings of COLING 2012, pages 2619–2634,
Predrag Piper. 2009. O prirodi gramatičkih razlika između
Mumbai, India.
srpskog i hrvatskog jezika. In: Predrag Piper, ed.,
Lada Badurina. 2004. Novije promjene u hrvatskome
Južnoslovenski jezici: gramatičke strukture i funkcije,
standardnom jeziku. Croatian Studies Review, 3‒4:83‒
pages 537‒552. Beograd: Beogradska knjiga.
93
Predrag Piper and Ivan Klajn. 2013. Normativna
Marcos Zampieri and Binyam Gebrekidan. 2012.
gramatika srpskog jezika. Novi Sad: Matica srpska.
Automatic Identification of Language Varieties: The
Stjepan Babić. 1999. Dva tvorbena normativna problema i
Case of Portuguese. In: Jeremy Jancsary, ed.,
njihova
rješenja.
Jezik,
66(3):104–112.
Proceedings of KONVENS 2012, pages 233–237,
https://docplayer.rs/191032196-Dva-tvorbena-
ÖGAI. Main track: poster presentations.
normativna-problema-i-njihova-rješenja-stjepan-
Mihailo Stevanović. 1989. Savremeni srpskohrvatski jezik.
babić.html
Beograd: Naučna knjiga.
Vera Ćevriz-Nišić. 2009. Razlikovne crte između srpskog,
Mirela Ivanić and Jelena Perišić. 2018. Derivacija glagola
hrvatskog i bošnjačkog standardnojezičkog izraza. In:
sa osnovama stranog porekla u srpskom jeziku u svetlu
Savremena prоučavanja jezika i književnоsti, Zbоrnik
(ne)jasne diferencijacije između srpskog i hrvatskog
radоva sa I naučnоg skupa mladih filоlоga Srbije I
standarda. In: Družbeni in politični procesi v sodobnih
(1), pages 373‒383, Kragujevac: Impres.
slovanskih kulturah, jezikih in literaturah, pages 177‒
Zenaida
Karavdić. 2011. Komparativna sintaksa
190.
bosanskog, crnogorskog, hrvatskog i srpskog jezika.
Mitar Pešikan, Jovan Jerković, and Mato Pižurica. 2010.
In: Njegoševi dani 3, Zbornik radova, 357‒365,
Pravopis srpskoga jezika. Novi Sad: Matica srpska.
Nikšić: Univerzitet Crne Gore, Filozofski fakultet.
Muntsa Padró and Lluis Padró. 2004. Comparing methods
Živojin Stanojčić and Ljubomir Popović. 2008. Gramatika
for language identification. Procesamiento del
srpskog jezika za gimnazije i srednje škole. Beograd:
Lenguaje Natural, 33:155‒162.
Zavod za udžbenike.
ŠTUDENTSKI PRISPEVKI
307
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Ocenjevanje uporabniško dodanih sopomenk v Slovarju sopomenk sodobne
slovenščine – pilotna študija
Magdalena Gapsa*
* Filozofska fakulteta, Univerza v Ljubljani
Aškerčeva 2, 1000 Ljubljana
magdalena.gapsa@ff.uni-lj.si
Povzetek
V prispevku opisujem prvi korak uporabniške raziskave, znotraj katere bodo različne strokovne skupine ocenjevalcev presojale o relevantnosti določenih uporabniško dodanih sopomenk v Slovarju sopomenk sodobne slovenščine. V okviru raziskave želim preveriti, ali se ocene strokovnjakov, kot so lektorji, prevajalci in učitelji, razlikujejo od ocen slovaropiscev ter kako enovite so ocene znotraj posamezne skupine. Osredotočam se na potek in rezultate prvega sklopa ocenjevanja, ki ga je kot testna množica izvedla skupina študentov. Ta korak je služil tudi kot preizkus navodil in izbranih orodij za lažje načrtovanje dela preostalih predvidenih skupin ocenjevalcev. Navajam ugotovitve glede relevantnosti sopomenskega gradiva po presoji študentske skupine, kjer so zlasti zanimive mejne kategorije »pogojno« sprejemljivega gradiva, sledijo identificirane šibke točke zasnovane raziskave ter rešitve, ki bodo vključene v nadaljnji potek ocenjevanja.
Evaluation of User-Added Synonyms in the Thesaurus of Modern Slovene – a Pilot Study The paper describes the first step of a user research in which various expert groups of evaluators will assess the relevance of certain user-added synonyms in the Thesaurus of Modern Slovene. Part of the research is to check whether the evaluations of experts such as proofreaders, translators and teachers differ from those of lexicographers, and how consistent the assessments are within each group.
The main focus is on the process and results of the first set of assessments carried out by a group of students as a test set. This step also served as a test of the instructions and tools chosen to facilitate the planning of the work of the remaining intended groups of evaluators.
The results are then presented in terms of the relevance of the synonymous material assessed by the group of students, with the borderline categories of "conditionally" acceptable material being of particular interest, followed by the weaknesses identified in the research designed and the solutions and improvements that will be incorporated into the further assessment process.
da se pogled strokovne oz. širše jezikovne skupnosti
1. Uvod
razlikuje od pogleda slovaropiscev, vendar ta potencialni
S pojavom digitalnega medija se na področju
drugačni pogled jezikovne skupnosti lahko bistveno
jezikoslovja in strojne obdelave naravnega jezika
pripomore pri gradnji novih oz. nadgradnji obstoječih
spreminjajo tako potrebe kot tudi priložnosti, ki se kažejo
jezikovnih virov. To hipotezo bom preverila z analizo ocen
sopomenskosti izbranega nabora uporabniško dodanega
zlasti
kot
možnost
avtomatiziranega
(hitrejšega,
enostavnejšega in cenejšega) posodabljanja jezikovnih
gradiva, ki ga ocenjujejo različne strokovne skupine
podatkov in opisov, večja povezljivost med podatki
(naštete v nadaljevanju). Ocene bom najprej primerjala
različnih vrst, neomejen prostor za njihov prikaz ter
znotraj posameznih skupin, nato pa tudi med skupinami.
vključevanje širše skupnosti v proces priprave slovarjev1
Namen prispevka je predstaviti izsledke prvega,
testnega oz. pilotskega ocenjevanja uporabniško dodanih
itd. V prispevku se osredotočam na slednje, torej možnost
doprinosa širše jezikovne skupnosti, natančneje možnost,
sopomenk, ki ga je izvedla skupina šestih študentov
da slovarski uporabniki dodajajo sopomensko gradivo v
jeziko(slo)vnih smeri. Evalvacijska naloga, posredovana
Slovar sopomenk sodobne slovenščine2 (Arhar Holdt et al.,
tej skupini, je imela dva glavna namena: (I) priprava
gradiva za ocenjevanje uporabniških sopomenk, preizkus
2018, v nadaljevanju tudi Sopomenke), s čimer je povezano
vprašanje morebitne spremembe v pogledih na
modela, orodij in navodil ter morebitne dopolnitve oz.
sopomenskost. Na podlagi uporabniškega gradiva je možno
prilagoditve le-teh ter (II) zbiranje povratnih informacij za
opazovanje, kako sopomenskost dojemajo in občutijo
načrtovanje nadaljnjega obsega in izvedbe ocenjevanja.
uporabniki, zlasti v razmerju do slovaropiscev, ki podajajo
Raziskava se deloma povezuje s projektom Sopomenke in
končne odločitve o vključitvi sopomenskega gradiva v
Kolokacije 2.0 – SoKol, Nadgradnja temeljnih slovarskih
referenčne jezikovne vire.
virov in podatkovnih baz CJVT UL,4 ki ga med leti 2021 in
Prispevek temelji na raziskovalnem vprašanju iz
2022 financira Ministrstvo za kulturo Republike Slovenije.
doktorske disertacije z naslovom Sopomenskost v Slovarju
Glavni cilj projekta je prenovitev Slovarja sopomenk
sopomenk sodobne slovenščine in izbranih različicah
sodobne slovenščine ter Kolokacijskega slovarja sodobne
slovenščine. Projekt je omogočil dostop do študentov
wordneta,3 o prispevku širše jezikovne skupnosti k
pogledom na sopomenskost. V disertaciji predpostavljam,
jezikoslovja z dobrim poznavanjem Slovarja sopomenk
1 V slovenskem prostoru se tematike dotika monografija Slovar
3 Doktorska disertacija nastaja v okviru raziskovalnega programa
sodobne slovenščine: problemi in rešitve (Gorjanc et al., 2017).
Jezikovni viri in tehnologije za slovenski jezik (številka programa
Podrobneje o vlogi uporabnikov v procesu priprave slovarjev in
P6-0411) in jo v letih 2019–2023 financira Javna agencija za
načinih sodelovanja z njimi sta pisala npr. A. Abel in C. Meyer
raziskovalno dejavnost Republike Slovenije. Mentorira jo zn. sod.
(2013).
dr. Špela Arhar Holdt.
2
Slovar
sopomenk
sodobne
slovenščine:
4 Spletna stran projekta SoKol: https://www.cjvt.si/sokol/ (dostop:
https://viri.cjvt.si/sopomenke/slv/ (dostop: 6. 5. 2022).
3. 5. 2022).
ŠTUDENTSKI PRISPEVKI
308
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
sodobne slovenščine ter izkušnjami z označevanjem
"relevantnosti" predloga za njihovo delo (npr. predloge tipa
pomensko povezanih podatkov.
brat – sorojenec, avto – osebno vozilo itn., so ocenjevalci večinoma opredelili kot relevantne, čeprav gre za drugo
2. Opis vira in pregled področja
relacijo).6 Potrebno je opozoriti, da presojanje o podobnosti
Slovar sopomenk sodobne slovenščine, ki ga je leta
dveh besed oz. sopomenskosti tudi za slovaropisce nikakor
2018 objavil Center za jezikovne vire in tehnologije
ni lahka in nedvoumna naloga, saj univerzalne definicije
sopomenskosti ni, sam koncept pa je zelo ši
Univerze v Ljubljani, je prvi primer novega slovaropisnega
rok in tesno
povezan s kontekstom in okoliščinami rabe, hkrati pa ga
koncepta, t. i. odzivnega slovarja (Arhar Holdt et al., 2018).
Njegova glavna značilnost je, da se slovar stalno odziva na
različni raziskovalci drugače interpretirajo in opisujejo (gl.
spremembe v jeziku ter na potrebe uporabnikov. Za namene
npr. Snoj, 2019, str. 13–41; Vidovič Muha, 2013, str. 172–
tega prispevka je najbolj pomembna značilnost ta, da je
183; Zorman, 2000, str. 20–48).
uporabnikom omogočeno sodelovanje v procesu nastajanja
Večina definicij sopomenke opredeljuje kot besede, ki
imajo identičen pomen, vendar različno formo (Zgusta,
slovarja, saj se podatki spreminjajo glede na aktivnost ter
pripombe skupnosti, hkrati pa ta lahko prispeva k čiščenju
1971, str. 89), poudarja se tudi razlikovanje besed z istim
nerelevantnih ali napačnih podatkov.5
pomenom po njihovi stilni ali zvrstni vrednosti (Toporišič,
Množičenje za namene slovaropisja je sicer znana
1992, str. 294). V literaturi prevladujeta dva glavna
praksa. Množica, ki jo želimo vključiti v ocenjevanje, ne
pogleda: sopomenke so le besede, ki imajo popolnoma isti
potrebuje posebnih predznanj ali izobrazbe, saj so tudi
pomen (popolna sopomenskost) ali pa besede, katerih
uporabniki jezika, ki niso strokovnjaki s področja, dovolj
pomen je zelo podoben (delna sopomenskost). Popolna
nadarjena, ustvarjalna in učinkovita skupina
sopomenskost je zelo redka, saj krši načelo jezikovne
, ki zmore
reševanje manj zahtevnih oz. bolj rutinskih nalog,
ekonomičnosti oz. gospodarnosti, pogosta pa je delna
strokovnjaki pa se lahko po vključitvi množičenja
sopomenskost (prim. Hock, 1991, str. 283; Snoj et al.,
osredotočijo na kompleksnejše oz. bolj analitične naloge
2016, str. 5; Vidovič-Muha, 2013, str. 175; Zgusta, 1971,
(Kosem et al., 2013, str. 46; Čibej et al., 2015, str. 70–
str. 89), ki se najpogosteje kaže v primeru prenesenih
71).
Množičenje je lahko izredno učinkovito in zanesljivo –
pomenov, izposojenk iz tujih jezikov, arhaizmov in
ekspresivnega besedišča, največ
odgovori oz. ocene nestrokovne skupnosti se skorajda ne
sopomenk pa imajo
razlikujejo od zlatega standarda oz. odgovorov, ki so jih
besede, ki so rabljene prav v prenesenem ali s kolokacijami
podali slovaropisci, kar je bilo že leta 2008 dokazano z
povezanem pomenu (Apresjan, 2000, str. 37). V
uporabo orodja Amazon Mechanical Turk (AMT) (Snow et
slovenskem prostoru je delna sopomenskost razumljena kot
del stilistike in ne semantike, zato je bila dolgo, še zlasti v
al., 2008, str. 257–258), še zlasti, če zagotovimo zadostno
količino ocenjevalcev (prim. Nicolas et al., 2021). Na tej
slovaropisju,
obravnavana
predvsem
popolna
predpostavki temelji vključevanje skupnosti v razvoj
sopomenskost, sopomenke v SSKJ pa imajo tudi
Slovarja sopomenk sodobne slovenščine.
normativno vlogo (usmerjajo od zaznamovanega proti
Presojanje o (ne)sopomenskosti besed je bilo v široko
nezaznamovanemu), ob izidu Sinonimnega slovarja
razumljenem
digitalnem
slovaropisju
mnogokrat
slovenskega jezika (2016) je bilo več pozornosti namenjene
uporabljeno še zlasti ob (nad)gradnji in čiščenju različnih
tudi delni sopomenskosti (Vidovič Muha, 2013, str. 180;
wordnetov, npr. ruskega, kjer ocenjevalci presojajo o
Snoj et al., 2016, str. 6). Slovar sopomenk sodobne
slovenščine
(ne)pravilnostih ter sami sestavljajo in popravljajo sinsete
nudi nov okvir, saj osvetljuje vlogo in vrednost
(Braslavski et al., 2014) ali češkega, za katerega je bil razvit
konteksta s prikazom kolokacij ali povezavo na korpusne
poenoten sistem prijavljanja napak, ki jih uporabniki
zglede, obenem pa nudi tudi možnost, da uporabniki dodajo
odkrijejo, ob tem pa lahko tudi predlagajo popravek (Horák
gradivo, ki po njihovem mnenju v slovarju manjka. Na
podlagi skoraj 1.000 uporabniško dodanih sopomenskih
in Rambousek, 2018). V slovenskem prostoru pa so
ocenjevalci s pomočjo orodja za množičenje sloWCrowd
predlogov želim ponovno odpreti vprašanje razumevanja
(Tavčar et al., 2012) presojali, ali so avtomatsko pridobljeni
sopomenskosti in preveriti, ali se je le-to spremenilo z
predlogi sopomenski in ali spadajo v predvideni sinset ter
nastankom in razvojem digitalnih jezikovnih virov, zlasti
tako pomagali odpraviti napake v slovenskem wordnetu
odzivnega slovarja..
(Fišer et al., 2014). Ocene skupnosti, tudi presojanje o
(ne)sopomenskosti besed, so uporabne tudi širše, npr. pri
3. Predviden potek in izsledki raziskave
evalvaciji natančnosti vektorskih vložitev za povezane oz.
Cilj znotraj doktorske raziskave je, da evalvacijo poleg
sorodne besede (prim. Schnabel et al., 2015, str. 301–303)
študentov opravijo tudi predstavniki drugih skupin
Ob tem se odpira vprašanje, ali se s širjenjem skupine
sodelujočih: slovaropisci (kot najbolj specializirani
sodelujočih širi oz. spreminja tudi sam pogled na gradivo,
strokovnjaki s področja), prevajalci, lektorji, učitelji
ki (naj) ga slovar sopomenk prinaša. Slovaropisci za
slovenščine in ljubiteljski raziskovalci jezika brez
identifikacijo sopomenskosti sledijo vnaprej izbranim
jezikoslovne izobrazbe (kot predstavniki širše jezikovne
(včasih tudi dokaj strogim) jezikoslovnim izhodiščem,
skupnosti). Interesne skupine so bile določene na podlagi
slovarski uporabniki pa lahko o sopomenskosti presojajo
tipologije ciljnih skupin uporabniških raziskav (gl. Arhar
precej bolj subjektivno, in sicer z vidika "uporabnosti" ali
Holdt, 2015, str. 142–146), kjer so slovarski uporabniki
5 Preostale glavne značilnosti slovarja so npr. (a) je dostopen le v
(c) sopomenski podatki so povezani z besedilnim kontekstom s
digitalni obliki ob upoštevanju potreb, pogojev in prednosti le-te,
pomočjo kolokacij, korpusnih zgledov in povezav na korpus ter
hkrati pa nikoli ni zaključen, saj se podatki stalno spreminjajo in
(č) slovar in slovarska baza sta prosto in odprto dostopni pod
prilagajajo trenutnemu jezikovnemu stanju, (b) slovarska baza
ustrezno licenco (prim. Arhar Holdt et al., 2018, str. 404; Čibej in
nastaja z uporabo naprednih računalniških metod, kar
Arhar Holdt, 2019, str. 339–340).
uporabnikom hitro ponuja veliko količino odprto in prosto
6 Predpostavljam, da se bodo v določenih kategorijah gradiva oz.
dostopnih jezikovnih podatkov, ki so relevantni, a še neprečiščeni,
odločitev pokazale skupne točke, v drugih pa razlike.
ŠTUDENTSKI PRISPEVKI
309
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
načeloma pripadniki (vsaj) ene skupine: (I) uporabniki, ki
Za to raziskavo so bili podatki za doktorsko disertacijo
slovarje uporabljajo v procesu izobraževanja (npr. študenti
dodatno opremljeni s podatki o uporabniško dodanih
in učitelji slovenščine),7 (II) uporabniki, ki slovarje
sopomenkah v Sopomenkah na podlagi internega izvoza
uporabljajo v poklicne namene (npr. slovaropisci,
podatkov iz 18. 11. 2021.11 Tako sem pridobila seznam
prevajalci, lektorji in učitelji) ter (III) uporabniki, ki
30712 iztočnic, ki imajo vsaj eno uporabniško dodano
slovarje uporabljajo za prostočasne aktivnosti (npr.
sopomenko oz. 976 sopomenskih parov. Nekateri
ljubiteljski raziskovalci jezika).
uporabniški vnosi (68 parov) so vsebovali dodatna
Tipologija je bila uporabljena tudi v raziskavi odnosa
pojasnila in opombe v oklepajih, največkrat opombe o
uporabnikov do novosti v Sopomenkah, kjer so bile v
zaznamovanosti predloga, npr. arheolog – žličkar (šalj.,
rezultatih najbolj zastopane skupine lektorji, prevajalci,
pog.), bonbon – cuker (neknj.), klient – kunt (nar.), učitelji slovenščine na različnih ravneh izobraževanja, pisci
preteklost – prtljaga (ekspresivno), stopnica – štenga (nižje
različnih vrst besedil (npr. beletristika, strokovno in
pog.). Ker ocenjevalcem nisem želela sugerirati
znanstveno pisanje, kreativno pisanje, novinarstvo,
odgovorov, so bile tovrstne opombe odstranjene.
blogerstvo ipd.) ter ljubiteljski raziskovalci jezika (prim.
Zabeleženih je bilo tudi 5 primerov, kjer so uporabniki
Arhar Holdt, 2020, str. 477). Iz tega lahko sklepamo, da so
znotraj oklepajev dodajali bolj kontekstualne razlage in ne
te skupine najbolj zainteresirane za Sopomenke (in
kvalifikatorje. Ti primeri so bili brez sprememb vključeni v
sopomenske podatke na splošno), hkrati so tudi relevantne
končni nabor za ocenjevanje, saj sem s tem želela preveriti
in reprezentativne. Ker je za preverbo hipoteze potrebno
reakcijo ocenjevalcev na tovrstne oznake. Gre za predloge:
tudi mnenje slovaropiscev, je skupina piscev8 nadomeščena
interier – ambient (v zaprtem prostoru), kmet – kmet s slovaropisci.9 Njihovi odgovori bodo analizirani znotraj
(šahovska figura), koncentracija – (velika/majhna)
skupine, hkrati pa bodo služili kot referenčna evalvacija
vsebnost, priloga – priponka (k e-pismu) in torbica –
sopomenskih parov, s katero bodo primerjani odgovori
(torbica) pismo. Z odstranitvijo opomb je prišlo do
vseh ostalih skupin.
podvojevanja sopomenskih parov,13 zaznala sem 4 take
Na podlagi izsledkov ocenjevanja po skupinah in
primere, ki so bili prav tako izločeni iz seznama za
primerjav odgovorov med njimi želim pridobiti empirično
ocenjevanje. Ta je v končni fazi obsegal 972 parov.
podlago o željah in pričakovanjih uporabnikov, ki bo z
aplikativnega vidika podlaga za pripravo smernic za
4.2. Navodila ocenjevalcem
uredniške protokole, ki bodo uporabljeni pri nadgradnjah
Ocenjevalci so prejeli nagovor s kratkim pojasnilom, da
slovarja. Z znanstvenega vidika pa bodo odgovori podlaga
se ocenjevanje izvaja v okviru doktorske raziskave in
za definiranje sopomenskosti v luči odzivnih digitalnih
kakšne podatke želim zbrati. Bili so naprošeni, naj med
jezikovnih virov. Prvi sklop uporabniške raziskave, ki so ga
ocenjevanjem ne uporabljajo drugih jezikovnih virov in
opravili študenti, pa je poleg prej omenjenih ciljev služil še
priročnikov. Navedeno je bilo, da je naloga sestavljena iz
preizkusu zasnove raziskave in odkrivanju šibkih točk ter
dveh obveznih delov: preglednice s sopomenskimi pari,
zbiranju povratnih informacij za načrtovanje nadaljnjega
kjer bodo podajali svoje odgovore in morebitne pripombe,
obsega in izvedbe.
ter vprašalnika, kjer bodo podajali demografske podatke o
sebi ter povratne informacije o evalvacijski nalogi sami. V
4. Gradivo in metoda
primeru dvomov so udeleženci lahko zastavljali dodatna
vprašanja po e-pošti.
4.1. Gradivo
Glavno navodilo ocenjevalcem je bilo odgovoriti na
Uporabniška raziskava temelji na delu podatkov, ki so
vprašanje: »Ali sta besedi v paru sopomenki?«. Vsak
podatkovni vzorec za doktorat, in sicer seznamu 546
sopomenski par so lahko uvrstili v eno izmed štirih
samostalnikov, ki se pojavijo tako v podatkovni bazi
kategorij, oz. izbrali enega izmed štirih možnih odgovorov,
Slovarja sopomenk sodobne slovenščine (Krek et al., 2018)
in sicer DA, NE, POGOJNO DA ter NISEM
in sloWNeta (Fišer, 2015) kot tudi v Leksikalni bazi za
PREPRIČAN/NE VEM. Odgovor DA je bil predviden za
slovenščino (Gantar et al., 2013) in v Velikem slovensko-
primere, ko so bili prepričani, da sta besedi sopomenki,
madžarskem slovarju, kjer so samostalniki opremljeni z
odgovor NE za primere, ko so bili prepričani, da besedi
oznakami semantičnih tipov (Kosem in Pori, 2021).10
nista sopomenki ter kadar je šlo za očitne napake oz.
7 Študenti, zlasti jezikovnih smeri, so na prehodu med
izvozimo iz slovarskega vmesnika s pomočjo prilagojene skripte,
izobraževanjem in poklicno rabo, podobno učitelji, ki slovarje
kar omogoča, da so uporabniško dodani podatki ažurni.
uporabljajo v poklicne namene, vendar je njihov poklic vezan na
12 Ne razlikujem med iztočnicami z veliko in malo začetnico. V
izobraževalni proces.
naboru gradiva za nalogo je to samo en primer, in sicer zemlja in
8 V primeru piscev bi bilo najtežje pridobiti koherentno skupino,
Zemlja, ki je tukaj obravnavan kot ena iztočnica.
ki bi pokrivala različne prej naštete žanre, po drugi strani preostale
13 Slovarski vmesnik ima sicer preprosto varovalko, ki
skupine zadoščajo potrebi po predstavnikih skupine, ki slovarje
uporabnikom preprečuje ponovni vnos že dodanega predloga,
uporablja v poklicne namene.
vendar temelji na prepoznavi znakov in dovoljuje vnos tako
9 Treba se je zavedati, da so slovaropisci zaradi svoje izobrazbe in
alfanumeričnih kot nealfanumeričnih znakov, npr. oklepajev. Ko
specializiranosti zelo atipična uporabniška skupina za slovarske
uporabnik obstoječemu sopomenskemu predlogu doda opombo,
raziskave (prim. Arhar Holdt, 2015, str. 140), vendar prav to na
sistem to prepozna kot nov vnos. V mojem vzorcu se je to
tem mestu služi namenu raziskave.
zgodilo štirikrat, in sicer dvakrat znotraj gesla babica, kjer sta
10 Slednja dva vira sta upoštevana, saj želim v (preostalih)
bila predloga nona in nona (lokalno) ter oma in oma ;), znotraj analizah v okviru doktorske disertacije upoštevati tudi korpusno
gesla živina, kjer sta bili predlagani živad in živad (star.) ter osnovan pomenski opis (potencialno) sopomenskega gradiva.
znotraj gesla nakup, kjer sta bili predlagani kupilo in kupilo 11 Slovarska baza Sopomenk, ki je dostopna v okviru repozitorija
(star.).
CLARIN.SI, ne vsebuje uporabniško dodanih sopomenk. Slednje
ŠTUDENTSKI PRISPEVKI
310
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
zatipkane besede. Odgovor POGOJNO DA je bil predviden
Skupina pilotskih ocenjevalcev je obsegala 6 študentov.
za pare, ko so ocenjevalci sicer menili, da gre za
Dostop do preglednic je bil študentom dodeljen
sopomenki, vendar so hkrati videli tudi omejitve oz. imeli
15. 2. 2022, ocenjevanje so lahko začeli takoj in si ga
pomisleke, dvome, npr. da sta besedi sopomenki samo v
prilagodili glede na druge obveznosti. Prvi študent je
določenem pomenu, kontekstu, ena ali obe besedi sta
obvestilo o zaključenem ocenjevanju podal 16. 2. 2022,
zaznamovani itn. Odgovor NISEM PREPRIČAN/NE VEM
zadnji pa 8. 3. 2022. Vse želene odgovore sem pridobila v
je bil predviden za pare, ko niso poznali ene ali obeh besed
roku treh tednov. Pridobljene podatke je mogoče razdeliti v
v paru, pomena ene ali obeh besed v paru ali niso bili
dva glavna sklopa, in sicer ocene vzorca sopomenskih
prepričani, ali so težko podajali svoje mnenje. Pri vsakem
parov in odgovore, pridobljene z vprašalnikom.
paru je bila možnost dodajanja opomb, ki so bile zahtevane
pri odgovoru POGOJNO DA, zaželene pri NISEM
5.1. Izsledki ocenjevanja
PREPRIČAN/NE VEM in možne pri drugih odgovorih.
Vsi odgovori, ki so jih dali ocenjevalci, so bili združeni
Ker je eden glavnih ciljev raziskave preverjanje, kaj
v tabele z uporabo programa MS Excel. Prva tabela je
ocenjevalci razumejo kot relevantno sopomensko gradivo,
obsegala podatke o izbranem odgovoru (brez opomb), kar
so bila navodila, v izogib sugeriranju odgovorov, zelo
je omogočilo preverjanje ujemanja oz. enotnosti
splošna. Zato »sopomenka« ni natančneje definirana,
ocenjevalcev. V drugi tabeli so bile zabeležene podane
možni odgovori so vsebovali le kratek opis, ne pa tudi
opombe. Te so bile ročno pregledane in dodeljene v eno
primerov. Prav tako ni bilo navodil, kam umestiti mejne
izmed kategorij, ki so se oblikovale med pregledovanjem:
primere.
samo v določenem pomenu ali kontekstu, zaznamovano,
neznana beseda ali pomen besede, nad- oz. podpomenka,
4.3. Ocenjevanje
razlaga ter drugo (npr. nepravilno črkovanje, pomenske
Sopomenski pari so bili ocenjevalcem posredovani v
nianse, druga medbesedna razmerja, nenavadne oblike
obliki tabele, ki je bila dostopna kot Googlova
besed neujemanje besednih vrst, redkost rabe itn.). V
preglednica.14 Datoteka je bila sestavljena iz dveh listov.
primerih, kadar so ocenjevalci opredelili tudi vrsto
Prvi list je obsegal skrajšano verzijo navodil, da so jih
zaznamovanosti (npr. ljubkovalno, pogovorno, zastarelo
ocenjevalci imeli vedno pri roki, drug list pa seznam 972
itn.), so bili tudi ti podatki ohranjeni.16 Številčni podatki o
sopomenskih parov za ocenjevanje. V prvem stolpcu tabele
opombah so predstavljeni v tabeli 1. 914 parov je imelo
je zaporedna številka para, v drugem iztočnica, v tretjem pa
pripisano vsaj eno izmed šestih kategorij opomb, 435 je
predlagana sopomenka, npr. vonj – vzduh, stigma – brazda, imelo pripisani vsaj dve kategoriji, 75 parov vsaj tri
reforma – sprememba, pošta – sporočila, dopust – vakance.
kategorije in 3 pari so imeli pripisane štiri kategorije
Celice v teh stolpcih so bile zaklenjene v izogib namernim
opomb. Tretja tabela je obsegala združene podatke o
in nenamernim spremembam podatkov. V četrtem stolpcu
ujemanju oz. enotnosti ocenjevalcev ter o že
so ocenjevalci iz spustnega seznama (v izogib zatipkom)
kategoriziranih opombah.
izbirali enega izmed štirih odgovorov. Zadnji, peti stolpec,
je bil predviden za komentarje in opombe ocenjevalcev. To
Kategorija
Število
je tudi edini stolpec, kjer so lahko prosto vnašali podatke.
samo v določenem pomenu/kontekstu 406
Ocenjevalci so do podatkov dostopali po principu en
ocenjevalec – ena preglednica, da odgovori drugih
zaznamovano
375
ocenjevalcev ne bi vplivali na posameznikove odločitve.
neznana beseda ali pomen besede
266
nad- oz. podpomenka
182
4.4. Vprašalnik
razlaga
65
Ocenjevalci so dobili tudi povezavo do spletnega
drugo
122
vprašalnika, ki je bil sestavni del ocenjevanja. Vprašalnik
skupaj opomb
1.416
je bil sestavljen in dostopen v spletnem orodju za
anketiranje 1ka.15 V prvem delu vprašalnika so sodelujoči
odgovarjali na vprašanja o sebi: starost, zaposlitveni status,
Tabela 1: Številčna razporeditev kategorij opomb.
izobrazba (jezikoslovna ali ne), načini, na katere se z
jezikom ukvarjajo in glavna področja, ki so zanje najbolj
Popolnega ujemanja, kjer je vseh šest ocenjevalcev
pomembna v zvezi z jezikom. V drugem delu so
podalo isti odgovor, je bilo zelo malo: le 34 znotraj
odgovarjali na vprašanja, povezana z evalvacijo samo:
seznama 972 parov oz. približno 3,5 % celotnega nabora.
17 parov je vseh šest ocenjevalcev prepoznalo kot
koliko časa so potrebovali, ali so imeli pri reševanju
kakršne koli težave, ali so bila navodila jasna in ali so pri
nedvomno sopomenske (6 odgovorov DA), 5 parov kot
njih kaj pogrešali. Vprašalnik je bil dostopen brez omejitev,
pogojno sopomenskih (6 odgovorov POGOJNO DA), 5
ocenjevalci so si lahko vprašanja vnaprej ogledali, njihovi
parov kot nedvomno nesopomenskih (6 odgovorov NE) ter
odgovori so se sproti shranjevali, da so lahko npr. najprej
7 parov kot neznane oz. neopredeljive (6 odgovorov
NISEM PREPRIČAN/NE VEM). Bistveno več je bilo
podali podatke o sebi, informacije o nalogi pa naknadno.
večinskega ujemanja med ocenjevalci, kjer izstopa samo en
odgovor. Takih parov je bilo skupno 132 oz. približno
5. Rezultati
14 Googlova preglednica od ocenjevalcev ne zahteva posebne
16 Podrobnejša analiza dejanskih opomb in komentarjev, ki so jih
strojne opreme, hkrati pa se vneseni odgovori sproti shranjujejo,
podali ocenjevalci, presega okvirje in namen tega prispevka, je pa
zato naloge ni bilo potrebno reševati brez prekinitev.
zagotovo relevantna in zanimiva, tudi z vidika razumevanja
15 Spletno orodje za anketiranje 1ka: https://www.1ka.si/ (dostop:
sopomenskosti, zato bo naslovljena v prihodnosti.
5. 5. 2022).
ŠTUDENTSKI PRISPEVKI
311
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
13,5 % nabora gradiva. V 50 primerih je 5 ocenjevalcev
kategorijo NISEM PREPRIČAN/NE VEM. Razporeditev
izbralo odgovor DA, v 46 POGOJNO DA, v 19 NE in v 17
ocen v štiri predvidene kategorije prikazuje Tabela 2.
NISEM PREPRIČAN/NE VEM.
Skupaj je bilo parov z visokim ujemanjem ocenjevalcev
166 oz. 17 % nabora gradiva. 67 parov (40 %) je bilo
umeščenih v kategorijo DA, 51 parov (31 %) v kategorijo
POGOJNO DA, 24 parov (14,5 %) v kategorijo NE,
preostalih 24 parov (14,5 %) pa je bilo uvrščenih v
Odgovor
Popolno ujemanje
Večinsko ujemanje Skupaj
DA
17 parov
50 parov
67 parov
POGOJNO DA
5 parov
46 parov
51 parov
NE
5 parov
19 parov
24 parov
NISEM PREPRIČAN/NE VEM 7 parov
17 parov
24 parov
Skupaj
34 parov
132 parov
166 parov
Tabela 2: Številčna razporeditev odgovorov.
V primerih, kjer so se ocenjevalci strinjali (vsi so izbrali
pokašljevanje), da ne poznajo besed ali pomenov besed
isti odgovor), je bilo skupno podanih 22 opomb za 15
( modrček – nedrc, oklevanje – obiranje, rit – zadnja plat), sopomenskih parov. V 132 primerih, kjer so se ocenjevalci
da gre za razlago ( jok – pretakanje solz) ter drugo (npr. dež večinoma strinjali (izstopal je en odgovor), je bilo skupno
– dežne kaplje: mero- oz. holonimija, elita – veljaki in elita za 109 parov podanih 158 opomb. V primeru opombe iz
– pomembneži: neujemanje slovničnega števila, prerok –
kategorije drugo so ocenjevalci največkrat navajali
profet: nenavadna oblika, sestra – sorojenka: redka raba).
pomenske nianse, črkovanje oz. zapis, redkost rabe,
Zanimivo, da so par brat – sorojenec ocenjevalci občutili
prevzete besed itn. Razporeditev opomb po kategorijah in
kot razmerje nad- oz. podpomenskosti, pri paru sestra –
številčni podatki so prikazani v Tabeli 3, zaradi večje
sorojenka pa je en ocenjevalec opozoril na redkost rabe,
preglednosti in lažje primerjave je ohranjeno zaporedje
drugih komentarjev ni bilo.
kategorij iz Tabele 1.
Med pari, ki so jih sodelujoči označevali kot pogojno
sprejemljive (POGOJNO DA), se najpogosteje pojavljajo
primeri zaznamovanega besedišča, npr. avto – kripa
Popolno
Večinsko
Kategorija
(slabšalno), deček – mulec (slabšalno, negativni odnos),
ujemanje ujemanje
juha – župca (pogovorno, manjšalnica), krema – maža samo v določenem
(pogovorno), zadrga – fršlus (pogovorno, narečno).
3
37
pomenu/kontekstu
Pogosti so tudi primeri, kjer sta besedi sopomenski samo v
določenem pomenu oz. kontekstu, npr. izkušnja –
zaznamovano
5
55
dogodivščina, kaos – štala, jesen – starost, posluh – čut, neznana beseda ali
11
24
preteklost – prtljaga, ter kjer gre za nad- in podpomenke,
pomen besede
npr. alkohol – etanol, aorta – arterija, avto – prevozno nad- oz. podpomenka
1
20
sredstvo, fotoaparat – digič, priseljenec – tujec. Zaznani so razlaga
0
2
tudi pari, kjer so ocenjevalci navedli, da ne poznajo besed
ali pomenov besed, npr. koder – krauželj, pivo – pirček, rit drugo
2
20
– prdulja, telovnik – lajbič, ter kjer so jim pripisali druge skupaj opomb
22
158
opombe, npr. pogum – jajca: manjka del zveze, policija –
skupaj parov
15
109
murja in rit – guza: tujka.
Med pari, ki so jih sodelujoči označevali kot
Tabela 3: Številčna razporeditev kategorij opomb ob
nesprejemljive (NE), so najpogosteje navajali, da gre za
popolnem in večinskem ujemanju odgovorov.
besedi, ki sta sopomenski samo v določenem pomenu oz.
kontekstu, npr. ljubezen – življenjski tok, stopnica – terasa, Med pari, ki so jih sodelujoči označevali kot
živina – blago. Pojavljale so se tudi opombe glede
sprejemljive (DA), so najpogosteje opozarjali, da sta besedi
zaznamovanosti besedišča, npr. čarovnica – ćudežnica
sopomenski samo v enem pomenu ali kontekstu, npr.
(pozitiven odnos), nedelja – teden (zastarelo), da
dilema – težava, identiteta – osebnost, koncentracija –
ocenjevalci besede ali pomena besed ne poznajo ( čik –
osredotočenost, privilegij – ugodnost, stigma –
žvečilni gumi in čik – žvečilka,17 laboratorij –
zaznamovanost. Pogosto se pojavljajo tudi opombe glede
pospeševalnik) je predlagana sopomenka bolj razlagalne
zaznamovanosti besedišča, npr. beluš – asparagus
narave ( rekreacija – raztezne vaje in vaje za moč), da gre
(citatno), cedilo – cedilka (pogovorno), morilec – krvnik za nad- in podpomenki ( projekcija – podatek) ter druge
(zastarelo), pes – kuža (pogovorno), strpnost –
opombe ( davek – dan: nepravilno črkovanje, nedelja –
potrpežljivost (pogovorno), da sta besedi nad- in
teden: tujka).
podpomenka (npr. avto – osebno vozilo, avtomobil –
Pri vseh parih, kjer so ocenjevalci izbrali odgovor
osebno vozilo, brat – sorojenec, poroka – ženitev, kašelj –
NISEM PREPRIČAN/NE VEM, je bila zabeležena
17 Med opombami na ta dva para so ocenjevalci opozarjali tudi, da
vendar so ju označili kot nesopomenska, kar so utemeljili, da za
sicer besedi sta sopomenski v določenem pomenu oz. kontekstu,
študentsko generacijo čik pomeni izključno cigareto.
ŠTUDENTSKI PRISPEVKI
312
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
opomba, da gre za besede ali pomene, ki jih ocenjevalci ne
pretežno ukvarjajo v procesu izobraževanja, da jezik
poznajo, npr. čarovnica – bela žena, civilist – legist, obrok uporabljajo v poklicne namene ali da se z jezikom ukvarjajo
– rata, stranka – kunt, zaliv – olmun. Dodatno so se zgolj ljubiteljsko. Vsi so označili, da se z jezikom ukvarjajo
pojavljale tudi opombe, da gre za zaznamovane besede,
v izobraževalnem procesu, polovica je dodatno označila, da
npr. avto – gare (slabšalno), kašelj – brehanje (pogovorno), jezik uporabljajo tudi v poklicne namene. Nato so
koder – loken, krona – dika (arhaično), zdravilo – arcnije opredeljevali največ tri področja oz. dejavnosti, kjer je jezik
(zastarelo, arhaično), da sta besedi sopomenski samo v
v ospredju njihovega zanimanja, ki so jih izbirali iz
enem pomenu ali kontekstu (npr. postava – geštel), da gre
ponujenega seznama devetih možnosti. Najpogostejši
za nad- in podpomenke ( torbica – nabočnica) ter drugo,
odgovor je bil raziskovanje oz. študij jezika (5 odgovorov),
npr. srajca – košilja: nepravilno črkovanje, zdravilo –
lektoriranje (4), prevajanje (3), poučevanje slovenščine (2)
biofarmacevtik: dvomi o dejanski rabe besede, zdravilo –
ter predavanje jezikoslovnih predmetov na višji oz.
arcnije: neujemanje slovničnega števila.
univerzitetni ravni in tvorjenje besedil (po 1 odgovor).
Glede na popolno in večinsko ujemanje je bilo v
Nihče ni izbral področja leksikografije in ljubiteljskega
kategoriji DA ali POGOJNO DA umeščenih skupaj 118
raziskovanja jezika, drugih odgovorov prav tako ni bilo. V
izmed 166 parov ali 71 %, v kategoriji NE in NISEM
naslednjem vprašanju so morali ocenjevalci izbrati le eno
PREPRIČAN/NE VEM je bilo umeščenih po 24 izmed 166
izmed prej navedenih področij oz. dejavnosti, ki je za njih
parov oz. po 14,5 %. V kategoriji DA ali POGOJNO DA
glavno oz. najbolj relevantno. Kot glavno so raziskovanje
torej v sprejemljivo oz. relevantno gradivo, so ocenjevalci
oz. študij jezika izbrali trije, po en je navedel lektoriranje,
načeloma umeščali pare, kjer so tudi opozarjali, da gre za
prevajanje in tvorjenje besedil.
zaznamovane besede, npr. babica – starejša gospa
Sledila so vprašanja o nalogi. Najprej so ocenjevalci
(pogovorno, zastarelo, pozitiven odnos), debelost –
opredeljevali, koliko ur jim je reševanje naloge vzelo. V
zašpehanost (slabšalno, pogovorno, negativen odnos), kmet
povprečju so študenti za izpolnitev preglednice potrebovali
– seljak (slabšalno, negativen odnos), novinar – pisun približno 6 ur, najhitrejši jo je rešil v treh urah,
(slabšalno,
negativen odnos), steklenica – flaša
najpočasnejši pa v enajstih. Vsi so zatrdili, da so bila
(pogovorno), da sta besedi sopomenski samo v enem
navodila jasno opredeljena. Le en študent je navedel, da je
pomenu ali kontekstu (npr. blago – capa, izrazoslovje –
imel med reševanjem naloge težave, in sicer da veliko
izrazje, legenda – štorija, rit – zahrbtnež, žarnica – sijalka) besed ni poznal, zato se je težko opredeljeval do
ter nad- oz. podpomenke (npr. kovanec – novčič, nakup –
potencialne sopomenskosti. Ocenjevalci so imeli tudi
fasunga). V ti kategoriji so bili umeščeni tudi pari, kjer so
možnost podati svoje pomisleke, opazke in komentarje, ki
ocenjevalci opozarjali na npr. pomenske nianse, redkost
niso bili zajeti v navodilih. Te so podali trije ocenjevalci, ki
rabe ali slovnična neujemanja (npr. predlagana sopomenka
so navedli, da bi si želeli, da bi bila kategorija POGOJNO
je v množini), npr. cedilo – sito, stereotip – predsodek, pes DA bolje opredeljena, da niso bili prepričani, v katero
– štirinožni prijatelj. V kategorijo NE, torej nesprejemljivo
skupino naj uvrstijo nad- in podpomenke, razlage besed in
oz. nerelevantno gradivo, so najpogosteje spadale tujke,
neuveljavljene tujke ter da so pogrešali možnost preverbe
nepravilno zapisane besede ter besedi, ki bi lahko bili
sopomenk v drugih virih, saj bi s tem lahko podali boljše
sopomenki samo v enem pomenu ali kontekstu, ampak na
odgovore, vendar hkrati razumejo, zakaj jih niso smeli
ta pogoj v primeru večinskega odgovora NE je opozarjal le
uporabljati. Dodatnih komentarjev niso imeli.
en ocenjevalec, npr. davek – dan, nedelja – teden, živina –
blago, stopnica – terasa, projekcija – podatek. V kategorijo 6. Diskusija
NISEM PREPRIČAN/NE VEM, torej gradivo, ki zahteva
Glede ustreznosti uporabniško dodanih sopomenk se je
dodaten in podroben pregled, so ocenjevalci načeloma
izkazalo, da je bilo nedvomno nesopomenskega gradiva, ki
umeščali pare, kjer besed ali pomenov niso poznali ali kjer
so ga prispevali uporabniki, zelo malo. Primeri, kjer so se
so ocenjevalci menili, da se besede sploh oz. redko
odgovori ocenjevalcev popolnoma ujemali, so majhen del
uporablja, npr. avto – sinhronka, cigareta – španjoleta, vzorca (34 parov oz. približno 3,5 %), kar je bilo sicer
fotografija – heliotipija, moka – mlevina, zdravilo –
predvidljivo glede na obseg podatkov in število
biofarmacevtik, torbica – nabočnica. Pri večinskem
ocenjevalcev. Nekoliko več je bilo primerov, kjer je
ujemanju sta le dva para dobila opombo, da je predlagana
izstopal odgovor enega ocenjevalca (132 parov oz.
sopomenka razlagalne narave, od tega je bil en par ( jok –
približno 13,5 % nabora). Skupaj je torej parov z večinskim
pretakanje solz) opredeljen kot sprejemljiv (večina
ujemanjem 166 oz. 17 % nabora. Znotraj tega je parov, kjer
odgovorov DA), drug ( rekreacija – raztezne vaje in vaje za
je izstopajoč odgovor iz nasprotnega pola (npr. vsi
moč) pa kot nesprejemljiv (večina odgovorov NE).
odgovori NE in en POGOJNO DA, vsi odgovori DA in en
NE), približno ena tretjina (42 parov). Veliko večino teh
5.2. Podatki o ocenjevalcih in nalogi
parov so ocenjevalci ocenili kot sprejemljive. Parov, ki so
V prvem delu vprašalnika sem zbirala podatke o
bili večinsko umeščeni v kategorijo NE ali NISEM
ocenjevalcih. Vsi ocenjevalci v pilotni skupini spadajo v
PREPRIČAN/NE VEM, je bilo le 24 izmed 166 oz. 14,5 %.
starostno skupino 20–30 let, najmlajši je rojen leta 2001,
To kaže, da so uporabniško dodani predlogi sopomenk
najstarejši pa leta 1995. Ker je šlo za študentsko populacijo,
načeloma relevantni in konstruktivni, saj je bilo skupaj v
so vsi ocenjevalci navedli, da študirajo, večina je navedla
kategoriji DA in POGOJNO DA umeščenih 118 izmed 166
tudi, da je jezik osrednji predmet njihovega študija. Le en
parov (71 %), kjer so se odgovori ocenjevalcev večinsko
študent je opredelil, da jezik ni v ospredju njegovega
ujemali, četudi gre za primere, ki bistveno presegajo
študija, ker študira filozofijo. V naslednjem vprašanju sem
tradicionalno jezikoslovno dojemanje sopomenskosti. Ta
spraševala, na katerih področjih ima jezik zanje osrednjo
ugotovitev je v skladu z raziskavo iz leta 2020, kjer se je po
vlogo, možnih je bilo več odgovorov. Na voljo so imeli tri
analizi uravnoteženega dela vzorca 1.662 sopomenk
odgovore, in sicer da jih jezik zanima, ker se z njim
(največ 10 predlogov na uporabnika) izkazalo, da je okoli
ŠTUDENTSKI PRISPEVKI
313
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
70 % uporabniško dodanih predlogov konstruktivnih in
7. Zaključek in naslednji koraki
hkrati nezaznamovanih, okoli 20 % konstruktivnih in
V prispevku sem opisala načrt raziskave, s katero želim
zaznamovanih ter le dobrih 6 % odstotkov nekonstruktivnih
nasloviti raziskovalno vprašanje doktorske disertacije, in
oz. zlonamernih (prim. Arhar Holdt in Čibej, 2020, str. 6).
sicer da se pogled strokovne oz. širše jezikovne skupnosti
Ocene le ene sodelujoče skupine razumljivo niso in ne
razlikuje od pogleda slovaropiscev, vendar je ta potencialni
smejo biti zadostna podlaga za generalizacijo, a vendar so
drugačni pogled jezikovne skupnosti lahko uporaben in
podatki glede relevantnosti uporabniško dodanega gradiva
pomemben za razvoj virov. Predstavila sem tudi potek
spodbudni. Posebno pozornost bo potrebno nameniti
evalvacijske naloge, ki so jo kot testna množica opravili
kategoriji NISEM PREPRIČAN/NE VEM, saj so vsi v to
študenti. S tem sem želela preizkusiti sestavljena navodila
kategorijo umeščeni potencialni sopomenski pari dobili
in izbrana orodja ter določiti časovno-finančni obseg in
opombo, da ocenjevalci ne poznajo besede ali pomena
zahtevnost naloge, kar mi bo pomagalo pri načrtovanju dela
besede. To ne pomeni, da so tovrstni predlogi nerelevantni,
in rekrutaciji preostalih predvidenih skupin ocenjevalcev.
bodo pa v procesu posodobitev Sopomenk zahtevali več
Na podlagi odgovorov in povratnih informacij v pilotni
pozornosti s strani urednikov, npr. natančnejše iskanje
skupini se je ocenjevanje izkazalo kot izvedljivo. Izbrani
korpusnih zgledov, uporabo dodatnih virov za preverjanje
orodji, in sicer Googlova preglednica za evalvacijo
dejanske rabe itn.
sopomenskih parov ter spletno orodje za anketiranje 1ka,
Na podlagi povratnih informacij iz vprašalnika in
sta se izkazala kot ustrezna, enostavna za uporabo ter
korespondence s študenti, ki so se name obračali med
finančno in časovno vzdržna.18 Potrebno bo izboljšati
ocenjevanjem, je razvidno, da je pred nadaljnjim
navodila za ocenjevalce, hkrati se zdi smiselno
ocenjevanjem treba dopolniti navodila ter podrobneje
ocenjevalcem podrobneje pojasniti kontekst raziskave in
pojasniti ozadje in cilje raziskave. Študenti so prejeli le zelo
namen ocenjevanja (pridobitev njihovega subjektivnega
kratko pojasnilo, da se ocenjevanje izvaja v okviru
mnenja, ne "pravilnih" odgovorov). Naslovljena so bila
doktorske naloge ter kakšen je glavni cilj, brez
problematična mesta v navodilih, navodila za ocenjevalce
podrobnejših opisov in pojasnitev. V navodilih so sicer
pa ustrezno preoblikovana in dopolnjena za naslednje
dobili tudi informacijo, naj pri ocenjevanju ne uporabljajo
predvidene skupine.
drugih jezikovnih virov, a brez obrazložitve, zakaj to ni
Predvideno je, da isti seznam v ocenjevanje dobi še 5
zaželeno. Iz opomb, ki so jih dajali, je razvidno, da so
skupin ocenjevalcev, in sicer slovaropisci, poklicni
nekateri to navodilo kršili, saj so bile pogoste opombe tipa:
prevajalci, lektorji, učitelji slovenščine in ljubitelji jezika
v Gigafidi sem zasledil_a, Iz Gigafide je razvidno, Ne v brez jezikoslovne izobrazbe. V času pisanja prispevka
Franu ne v Gigafidi nisem zasledil_ a, ali so v tabelo celo
poteka rekrutacija sodelujočih, podatki pa naj bi bili
pripenjali povezave na druge priročnike. To je zelo verjetno
pridobljeni do poletja 2022. Sledijo analize rezultatov
posledica pomanjkanja utemeljitve, zakaj to ni zaželeno.
znotraj skupin, nato pa še primerjalno med skupinami.
Veliko vprašanj, zastavljenih po e-pošti, je spremljala
Čeprav prvi rezultati še niso primerni za generalizacijo,
izjava, da želijo nalogo "pravilno rešiti". Možna razlaga za
ponujajo dober uvid v dileme pri presojanju sopomenskosti
to je, da so študenti vajeni na ocenjevanje odgovorov po
uporabniško dodanega gradiva. Spodbudno je, da je bilo
principu prav–narobe, v opisu naloge pa ni bilo izrecno
(vsaj po prvem koraku raziskave) veliko uporabniško
navedeno, da ni pravilnih ali napačnih odgovorov oz. da so
dodanega gradiva ocenjenega kot relevantnega. V primeru,
vsi odgovori pravilni, saj sprašujemo po njihovem mnenju.
da ocene drugih predvidenih sodelujočih skupin prinesejo
Ta navedba bo mogoče manj relevantna za ostale
podobne rezultate, bo to ugotovitev mogoče upoštevati pri
predvidene skupine ocenjevalcev, a vendarle se kaže kot
nadaljnjem razvoju sopomenskih virov za slovenščino, v
smiselna dopolnitev opisa naloge in njenega namena.
smer širitve in bogatenja z novimi podatki. Hkrati se
V navodilih so študenti dobili informacijo, da gre za
potrjujejo predhodna spoznanja, da so uporabniški predlogi
ocenjevanje uporabniško dodanih sopomenk, vendar brez
v kar največji meri konstruktivni in dobronamerni, kar je
razlage, kakšni vse predlogi se lahko na seznamu pojavijo.
ključno za delovanje in nadaljnji razvoj odzivnih slovarjev.
Pogosta so bila vprašanja, kaj narediti z nad- oz.
podpomenkami, neuveljavljenimi tujkami ali popačenkami
8. Zahvala
ter predlogi, ki so bolj razlagalne narave. Na podlagi
povratnih informacij se kot smiselna dopolnitev kaže tudi
Prispevek je nastal v okviru raziskovalnega programa
Jezikovni viri in tehnologije za slovenski jezik (številka
dodatni opis, katere vrste podatkov lahko ocenjevalci
pričakujejo na seznamu. Na drugi strani so bili zatipki oz.
programa P6-0411), ki ga sofinancira Javna agencija za
očitne napake izpostavljeni kot primeri, ki se smatrajo za
raziskovalno dejavnost Republike Slovenije.
nerelevantne oz. nesopomenske, vendar jih niso vedno vsi
ocenjevalci umestili v kategorijo NE (npr. par Zemlja – e,
9. Literatura
kjer gre za očitno napako, je eden izmed ocenjevalcev
Andrea Abel in Christian M. Meyer. 2013. The dynamics
umestil v kategorijo NISEM PREPRIČAN/NE VEM). Kot
outside the paper: user contributions to online
so ocenjevalci izpostavili sami, so pogrešali natančnejšo
dictionaries. V: Electronic lexicography in the 21st
opredelitev kategorije POGOJNO DA. V navodilih za
century: thinking outside the paper. Proceedings of the
ostale skupine uporabnikov je zato treba natančneje
eLex 2013 conference, 17-19 October 2013, Tallinn,
opredeliti, da v ta sklop spadajo pari, kjer ocenjevalci sicer
Estonia, str. 179–94. Trojina, Institute for Applied
lahko povedo nekaj glede sopomenskosti, a ta ni nedvomna
Slovene Studies in Eesti Keele Instituut.
in bi želeli ob sopomenskem paru dodatne informacije.
Jurij Apresjan. 2000. Systematic Lexicography (prev.
Kevin Windle). Oxford University Press, Oxford.
18 Obe orodji sta namreč brezplačni za uporabo tako za
dodatnega usposabljanja, posebne strojne opreme ali dodatne
ocenjevalce kot za raziskovalce ter ne zahtevata predznanja oz.
registracije.
ŠTUDENTSKI PRISPEVKI
314
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Špela Arhar Holdt. 2015. Uporabniške raziskave za potrebe
Hans Henrich Hock. 1991. Principles of Historical
slovenskega slovaropisja: prvi korak. V: V. Gorjanc, P.
Linguistics (druga, razširjena in dopolnjena izdaja izd.).
Gantar, I. Kosem in S. Krek, ur., Slovar sodobne
Mouton de Gruyter, Berlin, New York.
slovenščine: problemi in rešitve, str. 136–49. Znanstvena
Aleš Horák in Adam Rambousek. 2018. Wordnet
založba Filozofske fakultete.
Consistency
Checking
via
Crowdsourcing.
V:
Špela Arhar Holdt. 2020. How Users Responded to a
Proceedings of the XVIII EURALEX International
Responsive Dictionary: the Case of the Thesaurus of
Congress, Lexicography in Global Contexts, 17–21 July
Modern Slovene. Rasprave Instituta za hrvatski jezik i
2018, Ljubljana, str. 1023–29). Znanstvena založba
jezikoslovlje, 46(2): 465–82. doi:10.31724/rihjj.46.2.1
Filozofske fakultete.
Špela Arhar Holdt in Jaka Čibej. 2020. Rezultati projekta
Iztok Kosem in Eva Pori. 2021. Slovenske ontologije
“Slovar sopomenk sodobne slovenščine: Od skupnosti za
semantičnih tipov: samostalniki. V: I. Kosem, ur.,
skupnost“. V: Zbornik konference Jezikovne tehnologije
Kolokacije v slovenščini, str. 159–202. Znanstvena
in digitalna humanistika, 24. – 25. september 2020,
založba Filozofske fakultete Univerze v Ljubljani,
Ljubljana, Slovenija, str. 3–9. Inštitut za novejšo
Ljubljana. doi:10.4312/9789610605379
zgodovino.
Iztok Kosem, Polona Gantar in Simon Krek. 2013.
Špela Arhar Holdt, Jaka Čibej, Kaja Dobrovoljc, Polona
Automation of lexicographic work: an opportunity for
Gantar, Vojko Gorjanc, Bojan Klemenc, Iztok Kosem,
both lexicographers and crowd-sourcing for both
Simon Krek, Cyprian Laskowski in Marko Robnik-
lexicographers and crowd-sourcing. V: Electronic
Šikonja. 2018. Thesaurus of modern Slovene: by the
lexicography in the 21st century: thinking outside the
community for the community. V: Proceedings of the
paper. Proceedings of the eLex 2013 conference, 17-19
XVIII EURALEX International Congress, Lexicography
October 2013, str. 32–48. Trojina, Institute for Applied
in Global Contexts, 17-21 July 2018, Ljubljana, str. 401–
Slovene Studies in Eesti Keele Instituut.
10. Znanstvena založba Filozofske fakultete.
Simon Krek, Cyprian Laskowski, Marko Robnik-Šikonja,
Pavel Braslavski, Dmitry Ustalov in Mikhail Mukhin.
Iztok Kosem, Špela Arhar Holdt, Polona Gantar, Jaka
2014. A Spinning Wheel for YARN: User Interface for a
Čibej, Vojko Gorjanc, Bojan Klemenc in Kaja
Crowdsourced Thesaurus. V: Proceedings of the
Dobrovoljc. 2018. Thesaurus of Modern Slovene 1.0.
Demonstrations at the 14th Conference of the European
Repozitorij
raziskovalne
strukture
CLARIN.SI.,
Chapter of the Association for Computational
http://hdl.handle.net/11356/1166
Linguistics, str. 101–104. Association for Computational
Lionel Nicolas, Lavinia Aparaschivei, Verena Lyding,
Linguistics. doi: 10.3115/v1/E14-2026
Christos Rodosthenous, Federico Sangati, Alexander
Jaka Čibej in Špela Arhar Holdt. 2019. Repel the
König in Corina Forascu. 2021. An Experiment on
syntruders! A crowdsourcing cleanup of the thesaurus of
Implicitly Crowdsourcing Expert Knowledge about
modern Slovene. V: Electronic lexicography in the 21st
Romanian Synonyms from Language Learners. V:
century: Smart lexicography. Proceedings of the eLex
Proceedings of the 10th Workshop on NLP for Computer
2019 conference, 1–3 October 2019, Sintra, Portugal,
Assisted Language Learning, str. 1–14. LiU Electronic
str. 338–56. Lexical Computing CZ s.r.o.
Press.
Jaka Čibej, Darja Fišer in Iztok Kosem. 2015. The role of
Tobias Schnabel, Igor Labutov, David Mimno in Thorsten
crowdsourcing
in
lexicography.
V:
Electronic
Joachims. 2015. Evaluation methods for unsupervised
lexicography in the 21st century: linking lexical data in
word embeddings. V: Proceedings of the 2015
the digital age. Proceedings of eLex 2015 Conference,
Conference on Empirical Methods in Natural Language
11-13 August 2015, Herstmonceux Castle, United
Processing, str. 298–307. Association for Computational
Kingdom, str. 70–83. Trojina, Institute for Applied
Linguistics. doi: 10.18653/v1/D15-1
Slovene Studies in Lexical Computing Ltd.
Jerica Snoj. 2019. Leksikalna sinonimija v Sinonimnem
Darja Fišer, Aleš Tavčar in Tomaž Erjavec. 2014.
slovarju slovenskega jezika. Založba ZRC, ZRC SAZU,
sloWCrowd: A crowdsourcing tool for lexicographic
Ljubljana.
tasks. V: Proceedings of the Ninth International
Jerica Snoj, Martin Ahlin, Branka Lazar in Zvonka Praznik.
Conference on Language Resources and Evaluation.
2016. Sinonimni slovar slovenskega jezika. Založba
LREC’14, str. 3471–75. European Language Resources
ZRC, ZRC SAZU, Ljubljana.
Association (ELRA).
Rion Snow, Brendan O’Connor, Daniel Jurafsky in
Darja Fišer. 2015. Semantic lexicon of Slovene sloWNet
Andrew Y. Ng. 2008. Cheap and Fast—But is it Good?
3.1. Repozitorij raziskovalne strukture CLARIN.SI,
Evaluating Non-Expert Annotations for Natural
http://hdl.handle.net/11356/1026
Language Tasks. V: Proceedings of the 2008 Conference
Polona Gantar, Simon Krek, Iztok Kosem, Mojca Šorli,
on Empirical Methods in Natural Language Processing,
Polonca Kocjančič, Katja Grabnar, Olga Yerošina, Petra
25-27 October 2008, Honolulu, Hawaii, USA, str. 254–
Zaranšek in Nina Drstvenšek. 2013. Leksikalna baza za
63. Omnipress Inc.
slovenščino 1.0. Repozitorij raziskovalne strukture
Tavčar, Aleš, Darja Fišer in Tomaž Erjavec. 2012.
CLARIN.SI.
Pridobljeno
iz
sloWCrowd: orodje za popravljanje wordneta z
http://hdl.handle.net/11356/1030
izkoriščanjem moči množic. V: Zbornik Osme
Vojko Gorjanc, Polona Gantar, Iztok Kosem in Simon
konference Jezikovne tehnologije, str. 197–202. Inštitut
Krek, ur. 2017. Slovar sodobne slovenščine: problemi in
Jožef Stefan.
rešitve. Znanstvena založba Filozofske fakultete
Jože Toporišič. 1992. Enciklopedija slovenskega jezika.
Univerze
v
Ljubljani,
Ljubljana.
Cankarjeva založba, Ljubljana.
doi:10.4312/9789612379759
Ada
Vidovič-Muha.
2013.
Slovensko
leksikalno
pomenoslovje. Znanstvena založba Filozofske fakultete,
Ljubljana.
ŠTUDENTSKI PRISPEVKI
315
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Ladislav Zgusta. 1971. Manual of Lexicography.
Academia, Publishing House of the Czechoslovak
Academy of Sciences, Praga.
Marina Zorman. 2000. O sinonimiji. Znanstveni inštitut
Filozofske fakultete, Ljubljana.
ŠTUDENTSKI PRISPEVKI
316
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Angleško-slovenska šahovska terminološka baza
Vili Grdič, Alja Križanec, Kaja Perme, Lea Turšič
Oddelek za prevajalstvo, Filozofska fakulteta, Univerza v Ljubljani
Aškerčeva 2, 1000 Ljubljana
grdic.vili@gmail.com
alja.manja@gmail.com
kaja.perme@gmail.com
lea.tursic@gmail.com
Povzetek
V okviru univerzitetnega predmeta Terminologija smo izdelali angleško-slovensko šahovsko terminološko bazo, saj smo želeli ustvariti zanesljiv dvojezični vir šahovske terminologije, ki vsebuje tudi slovenščino. Baza je nastala po korpusnem pristopu. Izdelali smo angleški in slovenski korpus ter iz njiju izluščili 82 angleških in 109 slovenskih terminov. Razdelili smo jih v pet podpodročij (taktika, strategija, otvoritev, končnica in ostalo) ter jih opremili z definicijami, kolokacijami, zgledi rabe, podatki o statusu in opombami.
English-Slovenian Chess Terminology Database
In our university Terminology course, we built an English-Slovenian chess terminology database because we wanted to create a reliable bilingual source of chess terminology that includes the Slovenian language. The database is based on the corpus approach. We built an English and a Slovenian corpus and extracted 82 English and 109 Slovenian terms. We divided them into five subfields (tactics, strategy, opening, endgame and other) and added definitions, collocations, usage examples, status information and notes.
1. Uvod
2. Namen članka
Šah lahko razumemo le kot prostočasno dejavnost v
Namen članka je opisati projekt gradnje dvojezične
obliki namizne igre, v resnici pa gre za športno disciplino
terminološke baze, ki smo jo ustvarili v okviru predmeta
ter pestro interdisciplinarno in terminološko zelo
Terminologija. Kot prevajalci se zavedamo, da so dandanes
kompleksno področje. Je tudi predmet raziskav številnih
jezikovni viri izjemno pomembni, zato smo tudi sami
področij, tako naravoslovnih kot družboslovnih.
skušali ustvariti koristen vir, osnovan po sodobnem
Slovenska šahovska terminologija lahko deluje zelo
jezikoslovnem pristopu. Projekt razumemo tudi kot
zapleteno, včasih je tudi težko najti slovenske ustreznike
nadaljevanje že opravljenega dela na področju slovenske
tujejezičnih terminov, ki šahiste zaradi močnega vpliva
šahovske terminologije, hkrati pa slovenskemu prostoru
spleta kar naprej obkrožajo (sploh angleški). Je pa njeno
približamo tudi angleške termine. Želimo spodbuditi k
poznavanje ključno za pravilno izražanje in opisovanje
nadaljnji gradnji slovenskih in večjezičnih šahovskih
šahovskih partij. Da bi prevajalcem in jezikoslovcem pri
priročnikov.
iskanju slovenskih ustreznikov olajšali delo, smo se na
podlagi korpusnega pristopa lotili izdelave dvojezične
3. Oris področja in sorodne raziskave
šahovske terminološke baze. S šahom se eden od avtorjev
Šah je okoli 1500 let stara strateška namizna igra, ki jo
prispevka tudi sam dejavno ukvarja, kar je prispevalo k
dandanes uvrščamo tudi med šport in je pestro
motivaciji za raziskavo in vsebinskim izhodiščem.
interdisciplinarno področje. Obravnavajo ga številna druga
Terminologija je veda, ki preučuje specializirano
področja in opravljajo aktualne raziskave, kot na primer
izrazje določenega strokovnega področja, imenovano
psihologija (povezava osebnosti, vsakdanjega življenja in
termini. Poleg tega so njen predmet obravnave tudi pojmi
šaha, gl. Krivec, 2021), matematika (šah in matematika, gl.
in njihova razmerja ter poimenovanja v različnih jezikih,
Grosar, 2017), nevrologija (šah in avtizem, gl. Gomes de
eden od glavnih ciljev pa je izdelava terminoloških
Sousa, 2021), robotika (šahovski robot, gl. Goldman et al.,
priročnikov. Tako jo lahko označimo tudi kot normativno
2021), računalništvo (šah in umetna inteligenca, gl. Guid,
vedo, saj z izdajanjem priročnikov predpisuje rabo izrazja
2010), sociologija (ženske in spolna enakost v šahovskem
in pripomore pri postopku terminološke standardizacije
svetu, gl. Vishkin, 2022), pedagogika (vpliv šaha na učenje
(povzeto po Vintar, 2017: 17–18).
tujih jezikov, gl. Harazińska in Harazińska, 2017) in
Pri zbiranju terminoloških podatkov smo izbrali
podobno.
korpusni pristop, ki v zadnjih letih prevladuje na področju
Po mnenju šahovskega mojstra Matjaža Mikaca
terminografije. Večina gradiv, iz katerih črpamo jezikovne
(VIR 1) ima šah več dimenzij: je igra, spada pod znanost,
podatke, je danes prostodostopnih. Najhitrejši in najlažji
umetnost in predvsem šport. Sicer se nekateri z zadnjim ne
način pri opisovanju jezika je zato s pomočjo programske
strinjajo, a Mikac pravi, da pogosto kritizirana premajhna
opreme, ki omogoča analizo besedila (npr. Sketch Engine),
fizična aktivnost v šahu ni edini kriterij, po katerem se neka
in izdelavo lastnega korpusa, iz katerega enostavno
disciplina uvršča med šport. Tako kot druge športne
pridobimo sezname besed, ki jih nato poljubno urejamo,
discipline ima tudi šah bogato tekmovalno tradicijo
samodejno luščimo termine in dalje analiziramo (povzeto
(svetovna prvenstva, šahovska olimpijada, šolska
po Vintar, 2017: 83).
tekmovanja) ter moramo biti za partije dobro telesno in
Pri projektu sta nam zelo pomagala šahovska mojstra
umsko pripravljeni (npr. za več ur dolge partije).
Iztok Jelen in Matjaž Mikac, svetovala nam je tudi
Šah ima v Sloveniji bogato tradicijo, od konca 19.
mojstrica Monika Rozman.
stoletja do danes smo kot narod beležili izjemne dosežke,
ŠTUDENTSKI PRISPEVKI
317
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
ki sodijo v svetovni vrh, npr. uspeh Josipa Plahute, Milana
verjetno ostalo še iz časov Jugoslavije in Sovjetske zveze,
Vidmarja, Luke Leniča in Laure Unuk (povzeto po Jelen,
ko je imel šah velik (pogosto tudi politični) pomen in so o
2006: 10–12). Šahovski velemojster Marko Tratar
njem večkrat poročali v medijih (VIR 1).
(2003: 4) navaja, da je »[š]ah /…/ v slovenskem časopisju
S šahovsko terminologijo so se ukvarjale že številne
20. stoletja vseskozi imel svoj prostor, tako po svoji
raziskave, ki so pokazale kompleksnost področja in
tekmovalni plati /…/ kot tudi zaradi svojih umetniških in
njegovega izrazoslovja. Adylova (2017: 8) navaja, da je že
znanstvenih ter pedagoških razsežnosti.«
samo na šahovskem terminološkem podpodročju o
V zadnjih letih (največ v času pandemije koronavirusa
šahovskih otvoritvah svoja strukturna klasifikacija
med letoma 2020 in 2021) pa so številni dejavniki
terminov
(dvo-,
tri-,
štiri- in večkomponentna
pripomogli k večji priljubljenosti in razširjenosti šaha po
poimenovanja otvoritev), podobno velja tudi za ostala
vsem svetu. Velik globalni vpliv je imela serija Damin
šahovska podpodročja (središčnica, končnica, taktike ipd.).
gambit (Frank, 2020; gl. Jurc, 2020), ki na primeru fiktivne
Karayev (2016: 103) opisuje, da so nekateri splošni izrazi
zgodbe realno ponazarja zaničljiv odnos do žensk v
prešli v šahovsko terminologijo (npr. to calculate
šahovskem svetu 20. stoletja, tudi vse partije in dvoboji so
'računati'), nekateri pa z determinologizacijo tudi iz
iz šahovskega vidika prikazani pravilno (Loeb McClain,
področja šaha v splošni jezik, ki se običajno uporabljajo v
2020). Na najstnike in mlade odrasle, pa tudi ostale, je
prenesenem pomenu (v slovenščini npr. imeti nekoga v
močno vplivala platforma Twitch za oddajanje raznih
šahu/matu/patu). V nadaljevanju (2016: 103) navaja, da
vsebin v živo (Johannson, 2021). Na njej nekateri
ljudje šah velikokrat asociirajo z vojno in politiko, zato se
velemojstri in drugi šahisti v živo izobražujejo in razvedrijo
šahovska terminologija v prenesenem pomenu pogosto
tudi do več deset tisoč gledalcev. Med največjimi so Hikaru
omenja tudi v nešahovskih kontekstih. Avtor se opre na
Nakamura (profil GMHikaru), Alexandra in Andrea Botez
novinarstvo in z njim povezan publicistični jezik: »Naša
( BotezLive) ter za slovenski prostor Laura Unuk, Teja Vidic
vlada kakor kmet ne gre nazaj« ( Moskovskij Komsomolets,
in Lara Janželj ( Checkitas). Tudi štiri spletna amaterska
21. 1. 2005). Dodaja, da fenomen prehajanja šahovske
šahovska tekmovanja PogChamps so zelo pripomogla k
terminologije v splošni jezik ni nič nenavadnega, saj je
veliki gledanosti šaha, saj so na njih sodelovali popularni
ravno to značilno že za športno terminologijo (v splošnem
oddajalci vsebin (»streamerji«) s Twitcha in Youtuba
jeziku uporabljamo npr. napad, podajanje žoge, zadetek v
(Johannson, 2021; gl. VIR 2). Po mnenju Matjaža Mikaca
črno ipd.). Tudi Zhuravleva in Vlavatskaya (2021: 534)
(VIR 1) učinek omenjenih dejavnikov na šah v Sloveniji ni
navajata, da šahovska terminologija ni omejena izključno
bil tako močan in očiten, saj smo kot narod šahovsko že
na področje šaha (za šah specifični izrazi so npr. šah, šah
dobro razviti.
mat, pat, fianketo), temveč se razteza na celotno športno
sfero (npr. zmaga, poraz, napad, obramba, sodnik).
3.1. Šahovska terminologija
Z etimologijo nekaterih slovenskih in tujejezičnih
3.2. Jezikovni viri šahovske terminologije
šahovskih terminov se je ukvarjal pravnik Leonid Pitamic
Skoraj vsaka (večja) angleška spletna stran za igranje
(1950). Navaja, da večina terminov izvira iz latinščine,
šaha in tudi druge spletne šahovske strani imajo vodnike,
arabščine in perzijščine, ti pa so se pod vplivom kulturno-
glosarje in ostale vire za učenje šaha, tam pa najdemo tudi
političnega dogajanja v Evropi od 12. stoletja dalje v
sezname terminologije z definicijami, slikami ipd. (npr. na
evropskih jezikih razvijali različno. Nekateri termini iz več
chess.com, lichess.org, chess24.com). Šahovski mojster
jezikov imajo dokaj podoben izvor (npr. šah iz
Iztok Jelen (VIR 3) se strinja, da je virov za angleščino
srednjeveškega latinskega scacci), nekateri pa precej
veliko, spletni so dobro dostopni, a se med seboj lahko zelo
različnega in s tem tudi drugačen dobesedni pomen (npr.
razlikujejo. Pravi, da je težko določiti njihovo
termini za lovca: ang. bishop 'škof', nem. Läufer 'tekač', fr.
verodostojnost, saj so definicije lahko različne, nekatere
fou 'norec', rus. слон 'slon'). Pitamic navaja, da so besede
zelo splošne, druge natančnejše; avtor ni znan, ne navajajo
šah, šahovnica in ček izrazoslovno vplivale na nekatere virov informacij in ne opredelijo, kako je glosar nastal (npr.
besede v več evropskih jezikih s področja prava,
korpusni pristop). Ravno tako je vprašljiv nabor terminov,
gospodarstva in finančništva (npr. današnja fr. beseda za
saj nekateri glosarji opisujejo kolokacije in druge besedne
šahovnico échiquier, ki je povezana z najvišjim sodiščem
zveze kot termine ( control of the center 'nadzor/varovanje
Echiquier v stari Normandiji; povzeto po Pitamic,
središča'), drugi dodajajo tudi žargonizme ( cheapo 'lahka
1950: 173–204).
past') in celo novotvorjenke ( Botez Gambit 'nenamerna
Slovenska šahovska terminologija je nastajala pod
žrtev dame', ki ga je izumila šahistka Alexandra Botez;
vplivom srbščine, iz nje so prevzemali in prevajali stari
VIR 14). Nekaterih terminov, ki jih zasledimo v spletnih
slovenski šahovski mojstri, kot je Milan Vidmar (gl.
glosarjih, pa v našem korpusu sploh ni in jih na spletu
Vidmar, 1946; 1951). Njihova spoznanja (in del teorije
najdemo zgolj v drugih glosarjih (torej obstajajo le v teoriji)
hrvaškega šahista Vladimirja Vukovića, gl. 1978; 1990) pa
in ne v dejanski rabi, npr. knight fork windmill (podvrsta
je v več prispevkih za učni načrt šahovskega izbirnega
taktike windmill). Za splošnega uporabnika so spletni viri
predmeta za osnovne šole zbral šahovski mojster Iztok
uporabni in dovolj natančni, za jezikoslovne namene pa so
Jelen (VIR 3; VIR 10; gl. Jelen, 2004a; 2004b).
zanesljivejši glosarji iz šahovskih knjig.
Šahovska terminologija je v manjši meri večjezična, saj
Sami smo se največ opirali na angleška glosarja iz knjig
so se v večini jezikov uveljavili nekateri tujejezični termini
Chess For Dummies (Eade, 2016) in Winning Chess
iz francoščine ( en passant, j’adoube), nemščine
Openings (Seirawan, 2016) priznanega ameriškega šahista
( Zwischenzug,
Fingerfehler,
Blitz)
in italijanščine
Yasserja Seirawana. Iztok Jelen priporoča The Oxford
( fianchetto, intermezzo). V žargonu slovenskih šahistov pa
Companion to Chess (Hooper in Whyld, 1992).
lahko zasledimo tudi hrvaške ( pješak/pijun), srbske
Za slovenščino smo od spletnih virov zasledili glosarja
( dirigovanje) in ruske termine ( пешка 'peška'), kar je
Šahovsko izrazoslovje na portalu ICP (VIR 4) in Šahovsko
ŠTUDENTSKI PRISPEVKI
318
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
izrazoslovje na Wikipediji (VIR 5), ki imata velik nabor
4.2. Angleški in slovenski korpus
terminologije in sta za splošnega uporabnika dovolj
Za namene projekta smo ustvarili dva korpusa,
natančna. Zanesljivejši so prispevki iz osnovnošolskega
angleškega in slovenskega. Cilj pri zbiranju besedil je bil,
učnega načrta za izbirni predmet šaha Iztoka Jelena
da dosežemo čim boljšo zastopanost terminologije, zato
(VIR 10; 2004a; 2004b), ki vsebujejo pravila igre, obširno
smo izraze razdelili po terminoloških podpodročjih
teorijo in slovenske termine. Avtor sam pa priporoča tudi
(taktika, strategija, otvoritev, končnica in ostalo) in za
Slovar slovenskega knjižnega jezika ter kot pomoč za
vsako vključili približno enako število besed. Pri zbiranju
nadaljnjo raziskovanje še rusko enciklopedijo Šahmaty,
virov smo pazili, da smo zajeli tako splošne kot tudi
Enciklopedičeski slovar (1990) in hrvaški prevod
specializirane šahovske vire ter po vsebini pokrili vseh pet
enciklopedije
Golombek's
Encyclopedia
of
Chess
podpodročij. Z različnostjo in enakomerno zastopanostjo
(Golombek, 1980).
besedil smo iz korpusa želeli izluščiti relevantne termine,
ki bi bolje odslikavali dejansko rabo, in dobiti natančnejše
4. Metoda
podatke o pogostosti rabe. V korpus nismo zajeli takšnih
Glavni cilj našega projekta je bila izdelava
virov, ki vsebujejo veliko (ali izključno) definicij, kot so na
dvojezičnega šahovskega glosarja oz. terminološke baze, ki
primer glosarji, in takšnih, ki poleg šaha zajemajo še ostala,
bi nastala na podlagi angleškega in slovenskega korpusa
za nas irelevantna področja (in s tem tudi termine). Kljub
besedil. Pri izdelavi smo se odločili za korpusni pristop.
temu gre za omejen nabor virov, saj smo v slovenski
Želeli smo raziskati dejansko rabo šahovskih terminov v
korpusa vključili le prostodostopne spletne vire, v
obeh jezikih, v bazo vključiti najpogostejše termine v rabi
angleškega pa poleg takih tudi nekatere knjige v formatu
in gesla opremiti z definicijami, kolokacijami, zgledi rabe,
PDF.
podatki o statusu in morebitnimi opombami. Na podlagi
Slovenski korpus obsega 139.964 besed in je sestavljen
angleškega korpusa smo zgradili angleško terminološko
iz 55 besedil. Vsi viri so prostodostopni na spletu. Za
bazo, nato pa smo z uporabo slovenskega korpusa dodali
namene čim večjega nabora terminologije je največ
slovenske terminološke ustreznike in jih opremili z
spletnih prispevkov o šahovski teoriji, ostalo pa so splošni
relevantnimi informacijami.
šahovski članki. Slovenskih knjig o šahu nismo mogli
vključiti, saj te na spletu niso prostodostopne, zato je
4.1. Korpusni pristop
korpus v primerjavi z angleškim bistveno manjši. Korpus
Terminološka baza je zasnovana po k
obsega članke različnih tem, od tega jih je 28 o pravilih
orpusnem
igranja (npr. VIR 6), strategiji posameznih delov igre (npr.
pristopu, kar pomeni, da smo jezikovne podatke zanjo
VIR 7), figurah, zgodovini šaha in splošno (npr. VIR 8;
pridobili iz korpusa, tega pa smo zgradili in analizirali v
VIR 9). Vključili smo tudi 10 prispevkov s portala ICP
orodju Sketch Engine. Za ta pristop smo se odločili, ker je
lažje ustvariti korpus besedil in ga s pomočjo računalniških
(npr. VIR 4) in 17 iz spletne učilnice za šah kot izbirni
predmet v osnovnih šolah (VIR 10).
konkordanc analizirati ter tako opisovati jezik določenega
strokovnega področja
Angleški korpus obsega 869.592 besed in je sestavljen
, kot pa to početi na stari način z
iz 21 besedil. Tako kot slovenski tudi ta pokriva ogromno
listkovnim gradivom (Logar in Vintar, 2008: 5). Korpus, v
katerega so zajeta različna besedila z nekega področja,
teorije o posameznih delih igre (otvoritev, središčnica in
lahko v dovolj velikem obsegu služi kot
končnica, npr. VIR 11), strategiji in taktiki ter pravilnik
reprezentativni
svetovne šahovske zveze FIDE (VIR 12) – omenjena
vzorec jezika ter daje vpogled v dejansko rabo jezika.
Takšen pristop ni le lažji, temveč tudi sodobnejši in hitrejši
vsebina pa je tako v spletnih virih kot tudi knjižnih. V
ter tako uporabniku prijaznejši (Logar in Vintar
korpusu je 7 daljših spletnih člankov (npr. VIR 13), dodali
, 2008: 14).
Že samo z osnovno analizo korpusa v orodju
pa smo tudi 14 šahovskih knjig oziroma priročnikov v
Sketch Engine
formatu PDF (npr. Eade, 2016). Ker so knjige mnogo daljše
dobimo seznam besed, ki ga nato lahko poljubno urejamo
za nadaljnjo analizo (abecedno, po dolžini itd.), in podatek
od člankov, ima angleški korpus v primerjavi s slovenskim
manj besedilnih vnosov, a obsega veliko več besed.
o pogostosti pojavitve besed, ki je pri prepoznavanju
tipičnih terminoloških vzorcev posebej zaželen (Vintar,
2017: 84). Če pa je korpus lematiziran in oblikoskladenjsko
5. Terminološka baza
označen, ga lahko analiziramo še podrobneje, npr.
Pri luščenju, določevanju in razvrščanju terminov smo
izberemo možnost, da se prikažejo vsi prislovi, pridevniki,
naleteli na nekaj težav. Pri tem nam je pomagala šahovska
predlogi itd., ki se pojavljajo ob nekem geslu, in tako
mojstrica Monika Rozman.
ugotavljamo, katere kolokacije so najpogostejše (Logar in
Vintar, 2008: 5). Programi za analizo korpusov so
5.1. Težave z luščenjem terminov
opremljeni s funkcijami, ki samodejno luščijo ključne
Program je samodejno izluščil 1000 eno- in
besede in termine, eno- in večbesedne. Tako dobimo nabor
večbesednih terminov. Med njimi je bilo veliko izrazov, ki
terminov, uporabnik jih nato le še ročno pregleda in
niso bili termini, zato smo morali sezname prečistiti.
neprimerne odstrani.
S terminoloških seznamov smo odstranili naslednje
Korpusni pristop danes ni nujno potreben le pri gradnji
(primeri so iz angleških):
terminoloških priročnikov, temveč tudi pri gradnji kakršnih
koli jezikovnih priročnikov, ki želijo predstavitvi aktualno
stanje jezika (Gantar, 2004: 170). Poleg avtomatizacije
leksikografskih postopkov so njegove prednosti še
informacija o sobesedilu in rabi ter možnost izločanja
irelevantnih informacij (Gantar, 2004: 177).
ŠTUDENTSKI PRISPEVKI
319
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
manjkajoči deli besed
vsi uporabljajo zelo pogosto, tudi v različnih kontekstih
(npr. defense je lahko del imena otvoritve; Sicilian Defense
napačno zaznane
agonal, advan, endg
'sicilijanska obramba'). Težko je določiti, ali gre pri
besede
napačno branje
različnih besednih vrstah za samostojne termine ali pa so
parry (namesto Garry (Kasparov))
variante enega termina.
Nekateri termini opisujejo pojav, figuro, premik ipd., ki
dummies (iz Chess for dummies),
se lahko uvrsti v več podpodročij. Kmet je na primer
glava in noga knjige Dvoretsky (avtor),
pomemben v otvoritvi, središčnici in tudi končnici; z njim
2010 (letnica)
lahko izvedemo nekaj “posebnih potez” ( pretvorba kmeta,
vzetje na prehodu), torej sodi na več podpodročij.
poteze kmetov (tudi oznake polj)
Nekaterih terminov pa nismo mogli uvrstiti v nobeno od
f4, e4, g5
predvidenih podpodročij ( črni, beli). Težav smo se delno
poteze in koordinate poteze figur
rešili tako, da smo uvedli podpodročje ostalo. V veliko
Ke6, Ra4, o-o; exd5, cxd4
pomoč sta nam bila Monika Rozman in Iztok Jelen, slednji
xf6 (le delni zapis poteze)
je tudi pregledal ves glosar in nam priskrbel zanesljive
terminološke vire.
poimenovanje obeh polj
sestavljene poteze in f4-f5, b7-b5
5.3. Izdelava terminološke baze
koordinate
Potem ko smo izbrali termine in jih razvrstili na
diagonale
ustrezna podpodročja, smo se lotili gradnje terminološke
a2-g8, b1-h7 ( the b1-h7 diagonal)
baze. Najprej smo v programu SDL MultiTerm ustvarili
kombinacije črk in
dvojezično bazo in v angleškem jeziku določili strukturo
c-pawn, d6-pawn, f-file, e4-square
drugih izrazov
vnosa, ki je tudi v skladu s standardom TBX. Na raven
vnosa smo dodali šahovsko podpodročje ( opening,
Dvoretsky, Karpov, Rubinstein
imena in priimki
endgame, strategy, tactics, other), na raven jezika
(po znanih šahisti se imenujejo
šahistov
definicijo in opombe, na raven samega termina pa rabo,
nekatere otvoritve, variante ipd.)
status ( obsolete, colloquial, preferred, standard, variant) in
splošni izrazi
opombe. Za opredelitev podpodročja smo se odločili zato,
USSR, USCF
da lahko natančneje ločimo, kateri termini spadajo v
posamezno fazo šahovske igre in kaj termin sploh
Tabela 1: Neterminološki izrazi z angleškega seznama
predstavlja (ali gre za potezo, figuro, taktiko ipd.), za status
izluščenih terminov.
termina pa zato, da opozorimo na neustaljenost ali
žargonsko rabo nekaterih terminov.
Menimo, da je do težav z napačno zaznavo besed prišlo
Najprej smo vnesli angleške termine in jim dodali
zato, ker so bila nekatera besedila v težje berljivem
definicije. Napisali smo jih po zgledu zanesljivih glosarjev
formatu. Starejše knjige vsebujejo tudi stiliziran tisk, pri
iz šahovskih knjig, največ iz Chess For Dummies (Eade,
katerem so zaradi poudarka ali prostorske prerazporeditve
2016), in Winning Chess Openings (Seirawan, 2016), da so
nekatere besede napisane narazen, npr. n o r p, te pa so bile
za naše potrebe bolj razumljive, pomagala nam je tudi
zaznane kot večbesedni termin, čeprav nimajo pomena.
Monika Rozman. Na podlagi korpusa smo gesla opremili
Orodje Sketch Engine je zaradi ponavljanja informacij v
še z dodatnimi informacijami (podpodročje, kolokacije
glavi in nogi knjig izluščilo tudi naslove, poglavja, strani,
idr.). Nato smo s pomočjo slovenskega korpusa in po
imena igralcev idr., ki so za nas irelevantne informacije.
posvetu s šahovskimi mojstri vpisali še slovenske
Najpogosteje izluščeni izrazi na seznamu so bile
terminološke ustreznike, kolokacije, primere rabe ipd.
šahovske poteze in koordinate. To pa zato, ker so večinoma
Terminološka baza vsebuje 77 vnosov, od tega 82
iz teh informacij sestavljene šahovske knjige, te pa so
angleških terminov s 77 definicijami in nekaterimi
korpusu prispevale največ besed. Šahovske knjige poleg
sopomenkami ter 109 slovenskih ustreznikov. Vsak termin
uvoda nimajo “konkretnega” besedila, kakršnega smo
vsebuje definicijo v angleščini in opredelitev podpodročja,
vajeni v člankih. Polne so diagramov s pozicijami, na
velika večina pa tudi kolokacije, status, rabo in po potrebi
podlagi katerih avtor razlaga partije, s pomočjo njih pa se
opombe. Pri pregledu in širitvi baze nam je pomagal Iztok
učimo šahovskih otvoritev, strategije, taktik ipd.
Jelen. Slovenskih definicij nismo dodali, saj nismo našli
dovolj virov, ki bi vsebovali definicije za večino našega
5.2. Izbiranje terminov
nabora slovenskih terminov, tako pa smo namesto svojega
Ob razvrščanju terminov v pet podpodročij (taktika,
pisanja definicij to raje izpustili. Prizadevamo si, da bi v
strategija, otvoritev, končnica in ostalo) smo naleteli na
bodoče s pomočjo mojstrov in npr. Šahovske zveze
težave, ki so v terminologiji pogoste. Pri nekaterih
Slovenije tudi to vrzel zapolnili.
enobesednih terminih smo se težko odločili, ali gre za
Za podpodročje otvoritve smo vnesli termine, ki so
splošni izraz ali pa je beseda šahovski termin, npr. take
značilni za ta del igre (sem spadajo tudi imena otvoritev),
'vzeti', diagonal 'poševnica', rank 'vrsta'. Največ težav smo
npr. gambit, castling, Spanish Game, Sicilian Defense
imeli z ugotavljanjem, ali gre pri večbesednih terminih za
'gambit, rokada/rošada, španska otvoritev, sicilijanska
svoj termin ali le kolokacijo, npr. weak pawn 'šibek kmet',
obramba'. Pri končnicah smo vnesli termine možnih izidov
isolated pawn 'osamljeni kmet', center square/central
igre, nekatere matne vzorce in poimenovanja določenih
square 'sredično polje'.
končnic, npr. checkmate, stalemate, back-rank mate,
Pozorni smo bili tudi na termine, pri katerih je prišlo do
Lucena position 'šah mat, pat, mat na osnovni vrsti,
skladenjskih, pomenskih ali oblikoslovnih variant. Izrazi to
Lucenova pozicija'. Podpodročje strategije obsega termine,
defend, defense, defensive 'braniti, obramba, obrambni' se
ki jih največkrat srečamo v središčnici ( middlegam e), ko se
ŠTUDENTSKI PRISPEVKI
320
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
tvorijo pomembni strateški načrti, npr. position, square,
6. Pomanjkljivosti projekta
diagonal, file, kingside, zugzwang, tempo 'pozicija, polje,
Baza je nastala na podlagi omejenega nabora virov.
poševnica, navpičnica, kraljevo krilo, nujnica, tempo'. Pri
Slovenski korpus vsebuje le spletne vire, za dobro
taktikah smo vključili osnovne taktične vzorce fork,
reprezentativnost pa bi bilo treba vključiti še nekaj knjižnih
skewer, discovered attack, sacrifice 'vilice, linijski udar,
virov o različnih šahovskih podpodročjih. Na to smo sicer
odkriti udar, žrtev' ipd. Dodali smo še podpodročje ostalo,
pazili pri angleškemu korpusu, a tudi pri njem bi bilo za
da bi se izognili nekaterim težavam pri razvrščanju
boljšo reprezentativnost treba vključiti več virov.
terminov na podpodročja. Tukaj zajamemo šahovske
Da bi projekt obdržali v obvladljivih razsežnostih, smo
figure, posebne poteze, nazive, akronime, npr. king, queen,
se pri končnem izboru terminov opirali na korpusno
promotion, grandmaster, arbiter, chessboard 'kralj, dama,
pogostost ter v bazo vključili le osnovne termine in
pretvorba kmeta, velemojster, sodnik/sodnica, šahovnica'.
nekatere dodatne informacije. Z večjima korpusoma ter s
Zapisali smo tudi pogoste kolokacije, npr. kingside
pomočjo več strokovnjakov in terminologov bi lahko bazo
attack, strong bishop, lead in development 'napad na
dopolnili ne le v številu terminov, temveč tudi v naboru
kraljevem krilu, močni lovec, razvojna prednost'. Če je bila
kolokacij in primerov rabe. Naše definicije, način dodajanja
sama raba termina dvoumna, smo omenili tudi, ali se termin
in zapisa kolokacij ter ostale podatke bi moral pregledati še
uporablja kot glagol, samostalnik ali pridevnik (npr.
npr. terminolog in slovaropisec, da bi bila baza v skladu z
checkmate 'šah mat' je v angleščini lahko glagol ali
ustaljenimi načini gradnje terminološke baze ali
samostalnik). Pri terminu fianchetto 'fianketo' smo dodali
večjezičnega glosarja.
tudi zgled pravilne in napačne izgovorjave v angleščini:
/ˌfɪənˈkɛtəʊ/, */ˌfɪənˈtʃɛtəʊ/ in pravilne slovenske
7. Sklep
/ˌfɪanˈketo/.
Na podlagi korpusnega pristopa in izdelanih dveh
korpusov smo ustvarili angleško-slovensko terminološko
bazo, v katero smo vnesli 82 najpogosteje rabljenih
angleških šahovskih terminov, jim pripisali 109 slovenskih
ustreznikov ter jih opremili z definicijami, kolokacijami,
primeri in informacijami o rabi.
Pri gradnji korpusa smo uporabili tako poljudne članke
kot tudi specializirano gradivo, pri čemer smo stremeli k
večji reprezentativnosti posameznih podpodročij. V
angleški korpus smo vključili tudi knjižne vire, slovenski
pa je bil omejen le na spletne.
Baza obsega nabor osnovnih terminov v obeh
omenjenih jezikih. Prizadevamo si ustvariti obširnejša
korpusa in izluščiti več terminov s kolokacijami in primeri
rabe, dodati še slovenske definicije in sodelovati z več
šahovskimi strokovnjaki, da bi bila baza v bodoče čim
natančneje in pravilneje izdelana.
8. Literatura
Z. T. Adylova. 2017. System Chess Nomina of
Terminological Field “Debut”. Scientific Journal of
National Pedagogical Dragomanov University. Series 9.
Current Trends in Language Development, 16:5–11.
Pedagoška univerza Dragomanov, Kijev.
James Eade. 2016. Chess For Dummies. John Wiley &
Sons, New York.
Scott Frank, režiser. The Queen's gambit (Damin gambit).
Netflix, 2020. https://www.netflix.com/si/title/80234304.
Polona Gantar. 2004. Jezikovni viri in terminološki
slovarji. V: Terminologija v času globalizacije: zbornik
prispevkov s simpozija »Terminologija v času
Slika 1: Primer terminološkega vnosa v MultiTermu.
globalizacije, Ljubljana, 5.–6. junij 2003«, str. 169–178.
ZRC SAZU, Ljubljana.
Smo
zagovorniki
odprte
znanosti,
zato
smo
Samuel Goldman, Andrew Kwolek, Kenji Otani, Ian Ross
terminološko bazo objavili na repozitoriju CLARIN.SI
in Jack Zender. 2021. Chess Robot. Univerza v
(Grdič et al., 2022), kjer je prostodostopna v formatu TBX.
Michiganu, Oddelek za strojništvo. https://deepblue.lib.
Čeprav zajema le najpogostejše šahovske termine, jo
umich.edu/handle/2027.42/167650.
vseeno upoštevamo kot doprinos k slovenski šahovski
Harry Golombek. 1980. Šahovska enciklopedija. Prosvjeta,
terminologiji. Zaradi terminoloških ustreznikov v dveh
Zagreb. Prevod knjige: Golombek's Encyclopedia of
jezikih lahko služi kot pomoč prevajalcem in drugim
Chess. 1977. Crown publishers, New York.
jezikoslovcem pri pisanju besedil in raziskovanju šahovske
Luciano Gomes de Sousa. 2021. Chess and Autism
terminologije. Prizadevamo si, da bi bazo v bodoče tudi
Spectrum Disorder (ASD). Brilliant mind, 8(4).
razširili in nadgradili.
https://revistabrilliantmind.com.br/index.php/rcmbm/arti
cle/view/52.
ŠTUDENTSKI PRISPEVKI
321
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Vili Grdič, Alja Križanec, Kaja Perme, Lea Turšič. 2022.
družbene vede. http://dk.fdv.uni-lj.si/dela/Tratar-Marko.
English-Slovenian Chess Terminology Database 1.0,
PDF.
Slovenian language resource repository CLARIN.SI,
Milan Vidmar. 1946. Razgovori o šahu z začetnikom.
ISSN 2820-4042, http://hdl.handle.net/11356/1680.
Državna založba Slovenije, Ljubljana.
Gari Grosar. 2017. Šah in matematika. Diplomsko delo,
Milan Vidmar. 1951. Pol stoletja ob šahovnici. Državna
Univerza na Primorskem, Pedagoška fakulteta. https://
založba Slovenije, Ljubljana.
repozitorij.upr.si/IzpisGradiva.php?id=9296&lag=eng.
Špela Vintar. 2017. Terminologija: terminološka veda in
Matej Guid. 2010. Znanje in preiskovanje pri človeškem in
računalniško podprta terminografija. Znanstvena
računalniškem reševanju problemov. Doktorsko delo,
založba Filozofske fakultete, Ljubljana.
Univerza v Ljubljani, Fakulteta za računalništvo in
VIR 1 = Intervju z Matjažem Mikacem, intervjuval Vili
informatiko.
http://eprints.fri.uni-lj.si/1113/1/Matej__
Grdič. 4. avgust 2022, Ljubljana.
Guid.disertacija.pdf.
VIR 2 = Chess.com Launches PogChamps With Top Twitch
Joanna Harazińska in Anna Harazińska. 2017. Chess-play
Streamers. chess.com. https://www.chess.com/news/
as the effective technique In foreign language training.
view/chess-com-pogchamps-twitch-rivals.
Applied Researches in Technics, Technologies and
VIR 3 = Osebna korespondenca z Iztokom Jelenom,
Education, 5(3):238–242. https://www.readcube.com/
kontakt preko e-pošte. 5.–10. avgust 2022.
articles/10.15547%2Fartte.2017.03.012.
VIR 4 = Spletni šahovski portal ICP. Arhivirano 12. 4. 2021
David Hooper in Kenneth Whyld. 1992. The Oxford
na archive.org. https://web.archive.org/web/2021041212
Companion to Chess. Second edition. Oxford University
5215/http://www.icp-si.eu/krozek/index.php?tip=glosar.
Press, Oxford in New York.
VIR 5 = Šahovsko izrazoslovje. Wikipedija. https://sl.
Iztok Jelen. 2004a. Splošno-teoretska šahovska izhodišča
wikipedia.org/wiki/%C5%A0ahovsko_izrazoslovje.
izbirnega predmeta. Skupnosti SIO. Spletna učilnica Šah
VIR 6 = Šahovska pravila. Wikipedija. https://sl.
7.–9. razred, poglavje 6 . https://skupnost.sio.si/course/
wikipedia.org/wiki/%C5%A0ahovska_pravila.
view.php?id=2138.
VIR 7 = Šahovska strategija in taktika. Wikipedija.
Iztok Jelen. 2004b. Iz teorije kombinacij. Iz osebnega
https://sl.wikipedia.org/wiki/%C5%A0ahovska_strategij
arhiva Matjaža Mikaca.
a_in_taktika.
Iztok Jelen. 2006. Šah in primerjalna analiza stanja šaha v
VIR 8 = Slovenske šahistke v Jugoslaviji. radiostudent.si.
Sloveniji. Slovenska šahovska zveza. Iz osebnega arhiva
https://radiostudent.si/kultura/repetitio/slovenske-%C5%
Matjaža Mikaca.
A1ahistke-v-jugoslaviji.
Erik Johannson. 2021. Chess and Twitch: Cultural
VIR 9 = Mitja Rizvič. 2016. Avtomatsko odkrivanje
Convergence Through Digital Platforms. Magistrsko
zanimivih šahovskih problemov. Diplomsko delo,
delo, Univerza v Södertörnu, School of Culture and
Univerza v Ljubljani, Fakulteta za računalništvo in
Education,
Media
and
Communication
Studies.
informatiko. https://core.ac.uk/download/pdf/151478793
https://www.diva-portal.org/smash/record.jsf?pid=diva2
.pdf.
%3A1563119&dswid=6255.
VIR 10 = Učni načrt za izbirni predmet šaha, spletna
Ana Jurc. 2020. Damin gambit: kako posneti napeto
učilnica Šah 7.–9. razred. Skupnosti SIO. https://
nadaljevanko o šahu? MMC RTV SLO. https://www.
skupnost.sio.si/course/view.php?id=2138.
rtvslo.si/kultura/gledamo/damin-gambit-kako-posneti-
VIR 11 = Chess endgame. Wikipedia. https://en.
napeto-nadaljevanko-o-sahu/543529.
wikipedia.org/wiki/Chess_endgame.
Assylkhan Agbayevich Karayev. 2016. Specifics of chess
VIR 12 = FIDE laws of chess. International chess
terminology. Science, technology and Education,
federation. https://handbook.fide.com/chapter/E012018.
6(24):102–105. LCC Olympus, Moskva.
VIR 13 = Chess opening. Wikipedia. https://en.wikipedia.
Jana Krivec. 2021. Improve your life by playing a game :
org/wiki/Chess_opening.
learn how to turn your life activities into lifelong skills!
VIR 14 = Terms. chess.com. https://www.chess.com/terms.
Thinkers Publishing, Landegem.
Allon Vishkin. 2022. Queen’s Gambit Declined: The
Dylan Loeb McClain. 2020. I’m a Chess Expert. Here’s
Gender-Equality Paradox in Chess Participation Across
What ‘The Queen’s Gambit’ Gets Right. The New York
160 Countries. Psychological Science (2022), 33(2):276–
Times. https://www.nytimes.com/2020/11/03/arts/televi
284. https://journals.sagepub.com/doi/10.1177/09567976
sion/chess-queens-gambit.html.
211034806.
Nataša Logar in Špela Vintar. 2008. Korpusni pristop k
Vladimir Vuković. 1978. Škola kombiniranja. Šahovska
izdelavi terminoloških slovarjev: od besednih seznamov
naklada, Zagreb.
in konkordanc do samodejnega luščenja izrazja. Jezik in
Vladimir Vuković. 1990. Uvod u šah na osnovi opće
slovstvo, 53(5):3–17.
šahovske teorije. Šahovska naklada, Zagreb 1990.
Leonid Pitamic. 1950. Šah v pravnem izrazoslovju.
Irina Nikolaevna Zhuravleva in Marina Vitalevna
Razprave. [Razred 2], Razred za filološke in literarne
Vlavatskaya. 2021. Structural model of chess terms in
vede = Dissertationes. Classis 2, Philologia et litterae /
English. Science, technology and Education, 2(87):534–
Academia scientiarum et artium Slovenica, 1:173–204.
539. LCC Olympus, Moskva.
Slovenska akademija znanosti in umetnosti, Ljubljana.
Yasser Seirawan. 2016. Winning Chess Openings.
Everyman Chess, London.
Šahmaty. Enciklopedičeski slovar. 1990. Sovetskaja
enciklopedija, Moskva.
Marko Tratar. 2003. Šah v slovenskem časopisu.
Diplomsko delo, Univerza v Ljubljani, Fakulteta za
ŠTUDENTSKI PRISPEVKI
322
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Speech-level Sentiment Analysis of Parliamentary Debates using Lexicon-based
Approaches
Katja Meden†∗
†Department of Knowledge Technologies, Jožef Stefan Institute
Jamova cesta 39, 1000 Ljubljana
katja.meden@ijs.si
∗Jožef Stefan International Postgraduate School,
Jamova cesta 39, 1000 Ljubljana
Abstract
Sentiment analysis or Opinion mining is a widely studied research area in the field of Natural Language Processing (NLP) that involves the identification of polarity (positive, negative or neutral sentiments) of the text, usually done on shorter and emotionally charged text, such as tweets and reviews. Parliamentary debates feature longer paragraphs and a very esoteric speaking style of Members of the Parliament (MPs), making them much more complex. The aim of the paper was to explore how and if lexicon-based approaches can handle the extraction of polarity from parliamentary debates, using the sentiment lexicon VADER (Valence Aware Dictionary and sEntiment Reasoner) and the Liu Hu sentiment lexicon. We performed sentiment analysis with both lexicons, together with topic modelling of positive and negative speeches to gain additional insight into the data. Lastly, we measured the performance of both lexicons, where both performed poorly. Results showed that while both VADER and Liu Hu were able to correctly identify the general sentiment of some topics (i.e., matching positive/negative keywords to positive/negative topics), most speeches themselves are very polarizing in nature, shifting perspectives multiple times. Sentiment lexicons failed to recognise the sentiment in parliamentary speeches that might not be extremely expressive or where a larger sum of intensity-boosting positive words are used to express negativity. We conclude that using lexicon-based approaches (such as VADER and Liu Hu) in their unaltered states alone do not suffice when dealing with data like parliamentary debates, at least not without any modification of lexicons.
1.
Introduction
tionary and sEntiment Reasoner) and Liu Hu sentiment lex-
icon to see how (and even if) lexical-based methods are able
Sentiment analysis or Opinion mining is a widely stud-
to handle sentiment analysis of longer, more complex tex-
ied research area in the field of Natural Language Pro-
tual data such as parliamentary debates. To complement
cessing (NLP) that encompasses extraction of thoughts, at-
this research question, we performed sentiment analysis
titudes and subjectivity of text to identify sentiment polarity
with both lexicons, together with topic modelling of pos-
(positive, negative or neutral sentiment). Sentiment ana-
itive and negative sentiment clusters to gain additional in-
lysis is mostly used on shorter and emotionally charged
sight into the data. Lastly, we measured performance of
text, such as tweets and reviews, though it can be used
both lexicons and examined reasons for any possible mis-
on other forms of textual data, such as parliamentary de-
classifications.
bates. Parliamentary debates are in essence transcriptions
The paper is structured as follows: In Section 2 we
of spoken language, produced in controlled and regulated
present related work on sentiment analysis, VADER and
circumstance, with rich (sociodemographic) metadata (Er-
Liu Hu sentiment lexicons as well as studies done on re-
javec et al., 2022).
searching sentiment on parliamentary debates. In Section 3
Contrary to social media data that are usually used for
we present the chosen methodology for our work, together
sentiment analysis (tweets and other shorter social media-
with presentation of the chosen dataset Hansard Debates
based text), parliamentary debates and thus parliamentary
with Sentiment Tags — HanDeSet. Section 4 includes the
discourse vary from political environment and culture, text
presentation of the results of the sentiment analysis with the
(or rather, speeches) itself is longer and made by the parlia-
chosen lexicons, topic modelling results, as well as their
mentary representatives under strict(er) procedural-themed
performance. Lastly, in the Section 5 we present our con-
language. This alone makes parliamentary debates as an
clusions and pointers for future work.
object of sentiment analysis more complex in comparison
to tweets or reviews, where opinions and sentiments are
2.
Related work
usually expressed much more clearly and in the shorter
span of text. The sentiment analysis for this paper was
2.1.
Sentiment analysis and lexicon-based approaches
implemented on the HanDeSet parliamentary corpus that
There are several methods of applying sentiment ana-
includes 1251 motion-speech units from 129 debates with
lysis, which are divided into three approaches: supervised,
manually annotated sentiment labels.
lexicon-based and hybrid approaches (Catelli et al., 2022), The aim of this paper is to explore lexicon-based ap-each with its own set of advantages and disadvantages.
proaches on the basis of parliamentary debates using lexical
The lexicon-based approaches utilize sentiment lex-
(and rule-based) approach VADER (Valence Aware Dic-
icons to describe the polarity (positive, negative and neut-
ŠTUDENTSKI PRISPEVKI
323
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
ral) of the text. This approach involves manual construc-
(2,006 with positive semantic orientation, and 4,783 neg-
tion of lexicons with positive and negative words to be used
ative).3. The opinion lexicon has evolved over the past
in sentiment analysis and corpus of text to which the sen-
decade, and is, similarly to VADER, more attuned to sen-
timent analysis will be applied. The main advantages of
timent expressions in social text and product reviews –
this approach are the fact that they are easier to understand
though it still does not capture sentiment from emoticons
and have wider-term coverage, while the disadvantages lay
or acronyms/initialisms (Hutto and Gilbert, 2014). The Liu
in a finite number of words in the lexicons (i.e., we can-
Hu sentiment lexicon has been implemented in the NLTK
not cover all of the words, especially if the text is domain-
library as a Liu Hu sentiment module (nltk.sentiment.util
specific) and the assignation of a fixed sentiment orientation
module),4 where function simply counts the number of pos-
and score to words - every word in the lexicon is classified
itive, negative and neutral words in the sentence and clas-
as positive or negative with a numeric score, e.g., on the
sifies it depending on which polarity is more represented.
scale of -5 (very negative) to 5 (very positive), with 0 an-
Words that do not appear in the lexicon are considered as
notating neutrality of the text. For this paper, we will be fo-
neutral5.
cusing on two specific lexicon (and rule-based) approaches
from the natural language toolkit (NLTK): VADER and the
2.4.
Parliamentary debates
Liu Hu sentiment module.
Recently, parliamentary debates have raised an interest
of researchers from various academic disciplines, espe-
2.2.
VADER (Valence Aware Dictionary and
cially as an object of linguistic research (Erjavec et al.,
sEntiment Reasoner)
2022). Transcriptions are done by professional stenograph-
VADER is established as a gold-standard sentiment lex-
ers, familiar with the procedures, as well as with the Mem-
icon that is attuned to microblog-like contexts. It is primar-
bers of Parliament (Truan and Romary, 2021). Parliament-
ily designed for Twitter and other social media text (as well
ary discourse is shaped by the specific rules and conven-
as editorials, movie and product reviews). VADER senti-
tions, which are in turn shaped by the socio-historical tra-
ment module was implemented in NLTK.1 The aim of the
ditions that influence the organisations and operations of
authors was to provide computational sentiment analysis
the Parliament. These conventions and traditions extend to
engine that works well on social media style text, yet read-
language use, e.g., turn-taking or forms of address (Fišer
ily generalizes to multiple domains and requires no train-
and de Maiti, 2020). Another characteristic of the tran-
ing data, but is constructed from a generalizable, valence-
scriptions is the fact that officially released records of par-
based, human-curated sentiment lexicon (Hutto and Gil-
liamentary debates are not verbatim and that minute-taking
bert, 2014). The VADER sentiment lexicon is comprised
varies across countries and history as well. The editing pro-
of 7,500 lexical features with validated valence scores that
cess can include elimination of obvious language or factual
indicate both the sentiment polarity (positive/negative) and
errors, dialectal or colloquial expressions and rude and ob-
the sentiment intensity on a scale from –4 to +4. For ex-
scene language. This, combined with the fact that editing
ample, the word okay has a positive valence of 0.9, good is
guidelines are mostly not publicly available, can hinder re-
1.9, and great is 3.1, whereas horrible is –2.5, the frowning
search (Truan and Romary, 2021).
emoticon :( is –2.2, and sucks and it’s slang derivative sux
The main characteristics of parliamentary discourse in
are both –1.5 (Hutto and Gilbert, 2014).2
the UK Parliament stem from previously mentioned com-
In context of parliamentary debates, VADER has been
position and operations of the Parliament - the UK Par-
used in several different studies, such as in (Rohit and
liament consists of two Houses: the House of Commons
Singh, 2018), where VADER was used to extract sentiment
and the House of Lords, where the decisions made in one
polarity, as it uses a simple rule-based model for general
House have to be approved by the other.
(Parliament,
sentiment analysis and generalizes more favorably across
2022). The House of Commons parliamentary debates con-
contexts than any of many benchmarks such as LIWC and
sist of three substantial elements (Abercrombie and Batista-
SentiWordNet.
Navarro, 2018b):
Debates are initiated with a motion –— a proposal made
2.3.
Liu Hu sentiment module
by an MP. When invited by the Speaker (the presiding of-
ficer of the chamber), other MPs may respond to the mo-
Liu Hu sentiment lexicon is a product of the research
tion, one or more times. Lastly, the Speaker may call a di-
by Hu and Liu, where authors aimed to summarize all the
vision, where MPs vote by physically moving to either the
customer reviews of a product. Contrary to the traditional
‘Aye’ or ‘No’ lobby of the chamber. These divisions may
summarization tasks they only mined reviews where cus-
be called at any time, but typically occur at the end of the
tomers have expressed their opinion on the product, trying
to determine whether the opinions expressed were positive
3The entire Liu Hu lexicon was available on https://www.
or negative (Hu and Liu, 2004). Liu Hu opinion lexicon
cs.uic.edu/~liub/FBS/sentiment-analysis.
is publicly available and consists of nearly 6,800 words
html
4https://www.nltk.org/api/nltk.sentiment.
1https://www.nltk.org/api/nltk.sentiment.
util.html
vader.html
5List of positive and negative words in the lexicon
2The entire VADER lexicon is available at https:
can
be
found
at
https://github.com/woodrad/
//github.com/cjhutto/vaderSentiment/blob/
Twitter-Sentiment-Mining/tree/master/Hu%
master/vaderSentiment/vader_lexicon.txt
20and%20Liu%20Sentiment%20Lexicon
ŠTUDENTSKI PRISPEVKI
324
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
debate. Example from the corpus shows the structure of the
there are five generaliseable sentiment intensity character-
units:
istics: punctuation (specifically, the exclamation mark "!"),
Motion: That there shall be an early parliamentary
capitalization (e.g., using all caps in a text), amplifying the
general election.
intensity of the text with mood booster words (e.g., us-
Speech: Does my right hon. Friend agree that the
ing words like extremely or very) or using a combination
Prime Minister, in calling this election, has essentially said
of all of these characteristics (e.g., "The food here is EX-
that she does not have confidence in her own Government
TREMELY GOOD!!!"). In regard to this, we pre-processed
to deliver a Brexit deal for Britain? One way in which she
the text using only tokenization (and keeping the punctu-
could secure my vote and the votes of my hon. Friends is to
ation) and lemmatization (using UDPipe Lemmatizer).
table a motion of no confidence in her Government, which
I would happily vote for.
3.3.
Experiment settings
Vote: ‘Aye’ (positive).
Most work was done in the Orange Data Mining Tool6.
Both VADER and Liu Hu sentiment modules are both
3.
Methodology
already incorporated in the Sentiment analysis widget in
3.1.
Dataset
Orange.
HanDeSeT: Hansard Debates with Sentiment Tags is a
3.3.1.
Sentiment analysis and performance
corpus that contains English parliamentary debates from
comparison
1997 to 2017 with 1251 motion-speech units taken from
Semantic analysis was performed on the speeches (with
129 separate debates and manually annotated with senti-
both VADER and Liu Hu sentiment modules). VADER
ment scores. The corpus itself was compiled from the UK
outputs several scores for the semantic analysis: pos, neg,
Hansard parliamentary corpora. Transcripts are largely-
neu and compound. The compound feature is the combined
verbatim records of the speeches made in both chambers
score of all of the other features and our main indicator of
of the UK Parliament in which repetitions and disfluen-
sentiment in text. For Liu Hu, the score shows difference
cies are omitted, while supplementary information such as
between the sum of positive and sum of negative words,
speaker names (speaker metadata) are added (Abercrombie
normalized by the length of the document and multiplied
and Batista-Navarro, 2018b).
by a 100. The final score reflects the percentage of senti-
The HanDeSet corpus features 1251 motion-speech
ment difference in the document (Demšar et al., 2013). It
units, where each unit comprises a parliamentary speech of
is important to note that the lexicons were not modified in
up to five utterances and an associated debate motion. As
any way.
detailed in (Abercrombie and Batista-Navarro, 2018b), par-
Next we mapped the sentiment scores, output by both
liamentary debates incorporate "much set, formulaic dis-
sentiment modules to their respective labels: positive and
course related to the operational procedures of the cham-
negative. This was done to match the scores in the gold
ber", i.e. speech segments used to thank the Speaker or
standard, where each speech is labelled with either 0 for
describing the activities in the chamber.
negative or 1 for positive (and where neutral sentiment la-
Each speech-motion unit has several sentiment polarity
bels do not exist). Therefore, the main problem of mapping
labels:
these labels stemmed from speeches and motions, that had
• manual speech : manually assigned sentiment label of
a score of "0" (and are thus regarded as neutral) that needed
the speech (0 = negative, 1 = positive)
to be mapped either as positive or negative.
After inspecting the dataset and the distributions of the
• manual motion: manually assigned sentiment label of
positive and negative class in the dataset (presented in the
the motion (0 = negative, 1 = positive)
Table 1), where it can be seen that the distributions for
manually applied sentiment labels for speeches are slightly
• gov/opp motion: label on the relationship of the MP
skewed towards the positive class, with the positive class
(who proposes the motion) to the Government (i.e.
counting 705 speeches (56.4%) and the negative of 545
whether the MP is in Government or not: 0 = is not
(43.6%) speeches. Therefore, we decided to map these
in Government, 1 = is in Government)
speeches as positive, in favor of the majority class. After
• speech vote: a speaker-vote label extracted from the
obtaining the labels (positive/negative), the last step was to
division associated with the corresponding debate (i.e.
compare the results of the sentiment analysis to the gold
how the MP voted to proposed motion: 0 = negative,
standard (and our test dataset) with classification accuracy
1 = positive)
and F1 score evaluation metrics. To compare our results, a
majority class baseline was added.
Since our research scope covers only the parliamentary
speech and the sentiment of it, we will be focusing on the
3.3.2.
Descriptive analysis and topic modelling
manual speech labels.
As previously stated, our research aimed not only to
evaluate the performance of both sentiment lexicons but
3.2.
Data cleaning and pre-processing
to research the sentiment in the UK parliamentary debates.
As extraction of polarity (or sentiment) score can heav-
In regard to this, we also applied topic modelling to ex-
ily depend on certain text characteristics, pre-processing
tract additional information on the topics of the analyzed
text data can impact the performance of the lexicon-based
modules severely. As detailed in (Hutto and Gilbert, 2014),
6https://orangedatamining.com/
ŠTUDENTSKI PRISPEVKI
325
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
parliamentary speeches. Descriptive analysis of the res-
ized with MDS (Multidimensional scaling)), where the size
ults provided by the VADER and Liu Hu sentiment mod-
of the topic indicates Marginal Topic Probability (i.e. how
ules on parliamentary debates enables insight into the pos-
representative a topic is to a corpus or a cluster). To get the
itive speeches, resemblances and reasons for possible dif-
naming of the topics as accurate as we could, we used sev-
ferences between the results of the lexicons.
eral Orange widgets: t-SNE widget for the 2-D projection
The results of the sentiment analysis are presented with
of the speeches with similar topics, Extract keywords wid-
histogram of sentiment scores of both sentiment lexicons
get to extract 5 most common keywords in those speeches
(compound score for VADER and sentiment score by Liu
and Score documents widget to identify the names of the
Hu) to visualize the distributions of positive and negat-
documents the keywords occur in most often, inferring the
ive scored speeches.
Deriving from this we also per-
topic name from the title and content of the documents.
formed topic modelling on subsets of positive and negative
speeches to identify topics and see if they correspond to the
4.
Results
general sentiment of the topic that the keywords belong to.
4.1.
Sentiment analysis results
To facilitate topic modelling, speeches first needed to be
pre-processed: transformed to lowercase, tokenized, lem-
In this section we present the results of the sentiment
matized with UDPipe Lemmatizer. Lastly, stopwords were
analysis, done with VADER and Liu Hu. Figure 1 com-
filtered out list of stopwords, provided from NLTK and with
pares the distributions of positive and negative speeches,
a manually compiled additional list of stopwords7 for the
identified by VADER (Figure 1a) and Liu Hu (Figure 1b)
procedural words, that are very common in (procedural)
sentiment lexicons.
parliamentary speech.
Even at first glance, we can see that VADER results
For topic modelling we used Latent Dirichlet Allocation
are leaned heavily towards the positive class. The com-
method to extract keywords of speeches and its topics. As
pound score ranges from 0.9987 (score of the most negative
LDA does not give the optimal number of topics for the text
speech) to 0.9992 (score of the most positive speech). Most
itself, the exact number of topics needs to be determined by
speeches in the dataset (617 speeches, 49.32%) were clas-
the model user (Gan and Qi, 2021). We, therefore, experi-
sified by VADER as extremely positive in the range from
mented with different numbers of topics in the range from
0.8 to 1 of the compound score. On the other hand, only
5 to 11, with the Topic Coherence metric serving as our
124 speeches (9.91%) were deemed extremely negative in
pointer. This specific range of topics was chosen to facil-
range from -0.8 to -1.
itate high enough granularity of the keywords in the topics
Figure 1b represents results obtained by using Liu Hu
(i.e., no less than 5 topics) but at the same time keep the
sentiment lexicon. While VADER uses a scale from -1 to
coherence of the keywords in the topics. Topic coherence
1, Liu Hu computes the sentiment score by preserving 0 as
score represents the "degree of semantic similarity between
the neutral value and deems everything below 0 as negative
high-scoring words in the topic to help distinguish between
and above as positive sentiment. As it can be seen from the
topics that are semantically interpretable and topics that
figure, the distribution of sentiment in the speeches differs
are artifacts of statistical inference" (Stevens et al., 2012).
greatly from the VADER results. The most negative speech
Table 1 shows the Topic Coherence score fluctuation in dif-
has a sentiment score of -6.976, the most positive a score
ferent settings for all chosen subsets (positive and negative
of 8.1967, with most speeches (353 speeches, 28.22%) po-
clusters produced by VADER and Liu Hu), with numbers
sitioned on a sentiment score spectrum from 0 to 1. Out of
in bold representing the optimal number of topics for the
those, 216 speeches were scored with 0 (neutral speeches).
subset.
In its entirety, more than 75% of the speeches were
deemed positive by VADER (984 speeches, 75.78%). Sim-
Number
VADER
VADER
Liu Hu
Liu Hu
ilarly, Liu Hu deemed positive almost 70% of the speeches
of Topics
positive
negative
positive
negative
(867 speeches, 69.30%) For the topic modelling, each set
5
0.281
0.244
0.267
0.252
was split into a positive and negative subset:
6
0.272
0.256
0.275
0.244
• VADER subset of positive speeches: 948 speeches
7
0.263
0.282
0.264
0.250
(75.78%)
8
0.268
0.276
0.275
0.260
9
0.251
0.260
0.265
0.256
• VADER subset of negative speeches: 303 speeches
10
0.265
0.303
0.276
0.279
(24.22%)
11
0.284
0.270
0.265
0.259
• Liu Hu subset of positive speeches: 867 speeches
Table 1: Topic Coherence scores of the positive and negat-
(69.30%)
ive subsets and their optimal number of topics.
• Liu Hu subset of negative speeches: 384 speeches
(30.70%)
The topics, identified with the LDA method are visual-
4.2.
Topic modelling results
7Additional
list
of
stopwords
is
available
at:
The results are presented in two parts, using MDS to
https://drive.google.com/file/d/16kH_
dV8HlUhctwmmsLn4F9zOkmJyqgg5/view?usp=
aid in visualization of the topics and their labels. The first
sharing
part focuses on comparison of the topics in both positive
ŠTUDENTSKI PRISPEVKI
326
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
(a) VADER (compound score)
(b) Liu Hu (sentiment score)
Figure 1: Results of the sentiment analysis and distribution of positive and negative speeches.
(a) VADER
(b) Liu Hu
Figure 2: Comparison of topics, identified in the positive speeches between VADER and Liu Hu.
clusters, while the second one presents identified topics and
throughout the corpora, e.g., member, house, bill, parlia-
trends in the negative clusters.
ment, etc. In Liu Hu produced results, the largest topic is
As it can be seen from Figure 2a and 2b, the largest
relatively similar to the House procedures, that being Elect-
clusters of keywords detected among the positive speeches,
oral Commission, where most keywords, emphasised above
produced by VADER, belong to the topic House proced-
are still present, with two explicit keywords that define the
ures8, where the topic consists of very common words
nature of the topic - election and change. Both topics are
also linked together (MDS enables linking of semantically
8
similar topics together), which makes the closeness of the
Full name of the documents, that contain most of the
keywords in the topic corresponds best to The Business of the
keywords in both topics even more clear. Topic Electoral
House, thought the name of the topic was shortened for easier
Commission appears in both positive clusters. In addition to
visualization.
the aforementioned Electoral Commission, topics like EU
ŠTUDENTSKI PRISPEVKI
327
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
membership, School funding and NHS funding also appear
a 10-fold validation) were added to the Table 2 for compar-
in both positive speeches.
ison.
The keywords and topics, identified in the negative
speeches are shown in Figure 3a and 3b.
Acc(%)
F1 score
With the Marginal Topic Probability score of 0.175,
VADER
52.0
0.49
the most common keywords in the VADER negative sub-
Liu Hu
50.0
0.47
set are found in topic State pension age, followed closely
Baseline
56.5
0.56
by Armed forces (score of 0.172), Prisons and probation
SVM (text only)
66.7
0.718
(0.150) and Police Officer Safety. MDS also showed that
MLP (text only)
67.3
0.713
several topics are also very closely related to one another,
e.g., Topic Armed forces is closely related to both House
Table 2: Performance results with VADER and Liu Hu, ac-
procedures and Terrorism bill topics. Similarly, although
companied with the baseline and results for SVM and MLP
not surprising, a strong connection is also found between
from the related study.
keywords in State pension (Women) and State pension
age (Women). Lastly, strong similarity is shown between
keywords in Police Officer Safety and Prisons and Proba-
The performance of the VADER and Liu Hu sentiment
tion. In the Liu Hu negative speeches, the most repres-
lexicons is poor, not even surpassing the baseline score.
ented topic is State pension (Women) with the Marginal
However, if we want to put the results in a perspective, we
Topic Score of 0.163, followed closely by EU Member-
need to consider the nature of parliamentary debates and
ship with the score of 0.159 and Homelessness with 0.114.
parliamentary language. The language of parliamentary de-
All three topics (or, rather, their keywords) are also con-
bates is, as we stated previously, complex - the speeches
nected amongst themselves. For both VADER and Liu Hu
especially are longer and full of visible political procedure
negatively scored speeches, the keywords most present in
characteristics (such as courtesy naming, e.g., hon. Friend,
them are found in topic on state pension and state pen-
hon. Lady ...).
sion age (very connected topics that share many common
Very poor performance scores show that sentiment lex-
keywords). In addition to that, several other topics can be
icons (in their current, unmodified state) are not the best
found in both subsets, e.g. Armed forces, Police Grant and
methodology when it comes to extracting sentiment polar-
House procedures.
ity in parliamentary debates. In comparison, study, detailed
In general, the keywords of the topics identified mostly
in (Abercrombie and Batista-Navarro, 2018a) achieved
corresponded to the general sentiment of the topics in their
much greater results even by using just the text features (as
respective subsets. Even though, in several cases, keywords
shown in Table 2).
(and topics) appeared both in the positive as well as in
To research the reason for such poor performance, we
the negative speeches. This is most likely due to the fact
analysed several speeches in detail. Below is an example
that parliamentary debates usually feature heavy position-
and one of the possible explanations for misclassifications:
taking in regard to a certain motion.
"Our national health service is, and always has been,
The topics in the negative speeches were harder to
valued and cherished by my constituents who rightly ex-
identify in comparison to the positive speeches - this is
pect an excellent standard of care to be provided free at
mostly due to the larger subset, as well as the fact that the
the point of use when they need treatment. We are all deeply
keywords were very fragmented. This can be seen in the
committed to the future of the NHS, but to ensure that it can
positive clusters, where the Marginal Topic Score of most
continue to provide the quality of care that our constituents
topics (aside from the two or three very well represented
expect, it cannot stand still. [...] What is certain is that
ones) are not high and are in lowest score range. While
the current model through which health services in Calder-
in general the topics were harder to identify, most topics
dale and Huddersfield are delivered is not sustainable in
that were strongly present in the speeches had very obvi-
the long term, and that changes are needed to ensure that
ous keywords. On the other hand, topics in the positive
we have a local health service that continues to provide ex-
speeches were easier to identify, although, there were some
cellent care."
exceptions, as some of the keywords (even though many
The speech itself contains words that could influence
stopwords were removed) were too general to pinpoint with
the scoring in a positive way - VADER scored this speech
human perception alone.
with 0.9992 (making it one of the most positive speeches
identified by VADER), while Liu Hu scored it with 1.578 .
4.3.
VADER and Liu Hu performance evaluation
Words in bold are all included in the VADER lexicon with
To evaluate the performance of the sentiment modules
high positive scores; e.g., committed has a score 1.1, valued
we used the following evaluation metrics: classification ac-
of 1.9, cherished of 2.3 and excellent of 2.7. Therefore, the
curacy and F1 score. Similarly, a related research (Aber-
speech could have been perceived as positive, even though
crombie and Batista-Navarro, 2018b) used the dataset to
the entire speech is in reality negative, as it emphasises that
develop a 2-step model for sentiment analysis task - they
the current model of health services is not long-term sus-
trained SVM and MLP to produce a one-step Speech model
tainable. Similarly, Liu Hu includes words cherished, qual-
and a two-step Motion-Speech model, using different fea-
ity, free and excellent in the list of positive words, but it
tures (text only, text and metadata). The results for the one-
does not include words like valued or committed (and thus
step Speech model with text-only features (evaluated with
making them neutral). The sentiment of this text is, accord-
ŠTUDENTSKI PRISPEVKI
328
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
(b) Liu Hu
(a) VADER
Figure 3: Comparison of topics in negative speeches between VADER and Liu Hu.
ing to Liu Hu, still positive - less than with VADER, but the
it can be seen from the poor performance evaluation results,
process and reason for misclassification is mostly the same.
sentiment-based approaches like Liu Hu and VADER alone
do not suffice when dealing with such a specific text data,
5.
Conclusions
at least not in their unmodified state. Better results could
In this paper we used sentiment based approaches
have possibly been acquired by modifying the lexicons to
(VADER and Liu Hu) on the base of parliamentary data
incorporate some of the characteristics of parliamentary de-
with the aim to explore how these two modules handle
bates (e.g., adding new words and changing the scoring of
sentiment detection on longer, less expressive and more
existing ones).
formal language to that of the (usually) used social me-
dia language (for which both sentiment modules are op-
6.
Acknowledgments
timized for). While the both VADER and Liu Hu were
The paper was written in the framework of the research
able to correctly identify the general sentiment of some top-
programme P2-0103 (B): Tehnologije znanja (Knowledge
ics, present in negative and positive clusters (e.g., matching
Technologies), co-financed by the Slovenian Research
keywords in the Euthenasia topic to the negative cluster),
Agency (ARRS) from the state budget and the Slovenian re-
the speeches themselves are very polarizing in nature. This
search infrastructure CLARIN.SI (Common Language Re-
can most clearly be seen in the fact, that some topics were
sources and Technology Infrastructure, Slovenia).
identified in both positive and negative clusters, e.g., top-
ics like School funding and NHS funding were identified in
7.
References
both positive and negative speeches, as both can be viewed
from different (positive or negative) standpoints.
Gavin Abercrombie and Riza Batista-Navarro.
2018a.
The most probable reason for misclassifications is the
‘Aye’ or ‘No’?
Speech-level Sentiment Analysis of
length of the speeches, as well as the matter of speeches
Hansard UK Parliamentary Debate Transcripts. In: N.
not being extremely expressive or having a bigger sum
Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K.
of positive boosting words used to express negativity.
Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A.
The language of parliamentary discourse can be extremely
Moreno, J. Odijk, S. Piperidis, and T. Tokunaga, eds.,
complex, mostly due to the esoteric speaking style and
Proceedings of the Eleventh International Con- ference
opaque procedural language of Parliament (Abercrombie
on Language Resources and Evaluation (LREC 2018),
and Batista-Navarro, 2018b).
Distinguishing between a
Miyazaki,
Japan.
European
Language
Resources
positive and negative polarity of parliamentary debates can
Association (ELRA).
be a difficult task even for human annotators, which was
Gavin Abercrombie and Riza Theresa Batista-Navarro.
proven by the poor inter-annotator agreement score in the
2018b. A Sentiment-labelled corpus of Hansard Parlia-
first round of annotation of the HanDeSet dataset, detailed
mentary Debate Speeches. In: D. Fišer, M. Eskevich, and
in (Abercrombie and Batista-Navarro, 2018a). Similar can
F.
de
Jong,
eds., Proceedings of the Eleventh
be said for lexicon-based approaches to sentiment ana-
International Conference on Language Resources and
lysis, though despite the poor performance scores, the lex-
Evaluation (LREC 2018 - ParlaMint II Workshop),
icons still gave us some insight into the general sentiment
Miyazaki, Japan. European Language Resources Asso-
around topics and parliamentary speech characteristics. As
ciation (ELRA).
ŠTUDENTSKI PRISPEVKI
329
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Rosario Catelli, Serena Pelosi, and Massimo Esposito.
2022. Lexicon-based vs. BERT-based sentiment ana-
lysis:
A comparative study in Italian.
Electronics,
11(3):374.
Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, To-
maž Hočevar, Mitar Milutinovič, Martin Možina, Matija
Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar,
Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, and
Blaž Zupan. 2013. Orange: Data Mining Toolbox in Py-
thon. Journal of Machine Learning Research, 14:2349–
2353.
Tomaž Erjavec,
Maciej Ogrodniczuk,
Petya Osen-
ova, Nikola Ljubešić, Kiril Simov, Andrej Pančur,
Michał Rudolf, Matyáš Kopp, Starkaður Barkarson,
Steinþór Steingrímsson, Ça˘grı Çöltekin, Jesse de Does,
Katrien Depuydt, Tommaso Agnoloni, Giulia Venturi,
María Calzada Pérez, Luciana D. de Macedo, Cost-
anza Navarretta, Giancarlo Luxardo, Matthew Coole,
Paul Rayson, Vaidas Morkevičius, Tomas Krilavičius,
Roberts Dar´gis, Orsolya Ring, Ruben van Heusden,
Maarten Marx, and Darja Fišer. 2022. The Parlamint
corpora of parliamentary proceedings. Language re-
sources and evaluation, pages 1–34.
Darja Fišer and Kristina Pahor de Maiti. 2020. Voices of
the Parliament. Modern Languages Open.
Jingxian Gan and Yong Qi. 2021. Selection of the Optimal
Number of Topics for LDA Topic Model—Taking Patent
Policy Analysis as an example. Entropy, 23(10):1301.
Minqing Hu and Bing Liu. 2004. Mining and summariz-
ing customer reviews. In: Proceedings of the 10th ACM
SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pages 168–177.
Clayton Hutto and Eric Gilbert. 2014. VADER: A parsi-
monious rule-based model for sentiment analysis of so-
cial media text. In: Proceedings of the international AAAI
conference on web and social media, pages 216–225.
UK Parliament. 2022. The two-House system.
Sakala Venkata Krishna Rohit and Navjyoti Singh. 2018.
Analysis of speeches in Indian parliamentary debates.
arXiv:1808.06834.
Keith Stevens, Philip Kegelmeyer, David Andrzejewski,
and David Buttler. 2012. Exploring topic coherence over
many models and many topics. In: Proceedings of the
2012 joint conference on empirical methods in natural
language
processing
and
computational
natural
language learning, pages 952–961.
Naomi Truan and Laurent Romary. 2021. Building, En-
coding, and Annotating a Corpus of Parliamentary De-
bates in XML-TEI: A cross-linguistic account. Journal
of the Text Encoding Initiative.
ŠTUDENTSKI PRISPEVKI
330
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Evalvacijska kategorizacija strojno izluščenih protipomenskih parov
Tina Mozetič,* Miha Sever,* Martin Justin,* Jasmina Pegan‡
* Filozofska fakulteta, Univerza v Ljubljani
Aškerčeva 2, 1000 Ljubljana
tina.mozetic11@gmail.com, mihasever98@gmail.com, martin1123581321@gmail.com
‡ Fakulteta za računalništvo in informatiko, Univerza v Ljubljani
Večna pot 113, 1000 Ljubljana
jp2634@student.uni-lj.si
Povzetek
Namen prispevka je oceniti relevantnost strojno pridobljenih protipomenskih parov za vključitev v razširjeni Slovar sopomenk sodobne slovenščine. Nekdanje strukturalistično pojmovanje protipomenskosti vedno bolj prehaja k sodobnejšemu, ki temelji na naprednih računalniških metodah, odprtosti, množičenju, relevantnosti in uporabnosti podatkov. V raziskavi smo pregledali 2852 strojno izluščenih parov protipomenk. Primeri, ki jih označevalci niso enoznačno uvrstili med protipomenske oziroma neprotipomenske, so razvrščeni v 21 kategorij. Za protipomenke vsake kategorije je opredeljeno, ali jih je smiselno vključiti v odzivni slovar. Strojni postopek se je izkazal za uspešnega, saj je v slovar mogoče vključiti 88 % izluščenih parov. Kategorije bodo v prihodnosti uporabne tudi za oblikovanje smernic ter razvoj nadaljnje metodologije strojnega luščenja protipomenk.
Evaluative Categorisation of Automatically Extracted Pairs of Antonyms
This paper aims to assess the relevance of extracted antonym pairs that are to be included in the expanded Thesaurus of Modern Slovene.
The former structuralistic conception of antonymy is shifting to a more modern one that is based on advanced computational methods, openness, crowdsourcing, relevance, and data usability. In this study, we reviewed 2852 extracted pairs of antonyms. Examples that were not uniquely classified as antonyms or non-antonyms by the evaluators are grouped into 21 categories. For each category, it is determined whether they should be included in the responsive dictionary. The process proved to be successful, as 88% of the extracted pairs could be included in the dictionary. The categories will also be useful in the future for the creation of guidelines and the development of further methodologies for automatic extraction of antonyms.
Problemske kategorije, ki jih bomo tako oblikovali,
1. Uvod
bodo služile kot izhodišče za nadaljnje delo na projektu, ki
Slovar sopomenk sodobne slovenščine je s 105.473
obsega nadgradnjo metodologije luščenja, pripravo
iztočnicami in 368.117 sopomenkami »najobsežnejša
smernic za uredniško obravnavo protipomenk in vključitev
prosto dostopna avtomatsko generirana zbirka sopomenk
protipomenk v Slovar sopomenk sodobne slovenščine.
za slovenščino« (Sopomenke 1.0, 2022). Slovar deluje po
Ročno pregledani protipomenski pari bodo uporabljeni kot
principu odzivnega slovarja, ki je v prvem koraku
učna množica za nadaljnje luščenje protipomenk iz korpusa
pripravljen povsem strojno. Strojno pripravljeni podatki so
Gigafida 2.0 (Krek et al., 2020). Tudi pri oblikovanju
objavljeni takoj, ko jezikoslovna evalvacija potrdi njihovo
smernic pa bo naša analiza prišla zelo prav, saj smo
načelno ustreznost oz. relevantnost za skupnost, nato pa se
identificirali probleme, za katere bo treba v nadaljevanju
slovar razvija naprej po korakih in v sodelovanju
podati tudi načelne uredniške rešitve.
jezikoslovcev in širše zainteresirane javnosti (Arhar Holdt
V drugem razdelku prispevka tako najprej predstavimo
et al., 2018). Pri projektu Nadgradnja temeljnih slovarskih
jezikoslovne raziskave protipomenskosti in koncept
virov in podatkovnih baz CJVT UL bomo sopomenkam
odzivnega slovarja. V tretjem na kratko opišemo metodi
dodali protipomenke, za katere je treba opraviti tovrstno
pridobivanja in označevanja podatkov. V četrtem razdelku
jezikoslovno evalvacijo relevantnosti.
pa predstavimo rezultate označevanja in jih analiziramo.
Cilj našega prispevka je tako oceniti relevantnost
Najprej predstavimo odločitve označevalcev glede
strojno pridobljenih protipomenskih parov za vključitev v
ustreznosti protipomenskih parov, nato pa natančneje
razširjeni Slovar sopomenk sodobne slovenščine. Pri tem
predstavimo vsako od problemskih kategorij, v katere so
nas zanima predvsem, kateri del podatkov je (1) primeren
bili v fazi označevanja uvrščeni »problematični« primeri.
za neposredno vključitev v slovar, (2) kateri za vključitev
Pri vsaki kategoriji predstavimo tudi njeno pogostost in
ni primeren in (3) kateri zahteva dodaten premislek. V
ocenimo, na kakšen način bi bilo identificirani problem
prispevku se natančneje ukvarjamo s tretjo točko, pri čemer
mogoče reševati. V zaključnem delu povzamemo bistvene
dokazujemo, da je »problematične« primere mogoče
ugotovitve prispevka.
kategorizirati glede na vrsto problema in tako določiti, ali
jih je (a) mogoče izboljšati strojno, ali (b) morda zahtevajo
2. Pregled področja
uredniško odločitev, (c) jih je mogoče izboljšati s
Jezikoslovje smatra protipomenskost – poleg
pomenskim členjenjem gesla ali kvalifikatorji, (d) jih je
sopomenskosti – za temeljno medleksemsko pomensko
mogoče izboljšati s pomočjo skupnosti oziroma (e) kljub
razmerje (Stramljič Breznik, 2010; Humar, 2016; Vidovič
določenemu problemu pustili v naboru slovarskega gradiva
Muha, 2005, 2021). V nasprotju s sopomenkami
in računati na to, da bodo uporabniki sami presodili o
protipomenke nujno nastopajo binarno, tj. v parih, in so
njihovi uporabnosti.
vedno del skupnega pojmovnega ali celo pomenskega polja
(Vidovič Muha, 2021). V slovenskem izrazoslovju sta se
ŠTUDENTSKI PRISPEVKI
331
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
enakovredno ustalila izraza protipomenka in antonim oz.
priročniki. SSKJ je s kvalifikatorjem ant. (antonim)
protipomenskost in antonimija, čeprav Slovenski pravopis
opremil 87 leksemov, ki se uvrščajo med kakovostne
2001 prednost daje izrazu protipomenka (Humar, 2005).
(polarne, skrajnostne) protipomenke, medtem ko
Definiranje protipomenke je razmeroma enostavno.
usmerjenih in dopolnjevalnih ne izkazuje (Humar, 2016).
Protipomenka je po SSKJ (2014) »beseda z nasprotnim
Toporišič (1976) v svoji slovnici antonimijo omenja bežno
pomenom v odnosu do druge besede«, enako jo opredeljuje
pri antonimnem pridevniku, protipomenskost krajše
tudi Toporišič (2001). Marjeta Humar (2016) definicijo
predstavi pozneje v četrti, prenovljeni izdaji leta 2001.
razširi na »poimenovanja pojmov z eno- ali večpomensko
Kljub temu da so protipomenke leksikografsko prepoznane
besedo ali besedno zvezo, [pri katerem] sta v
kot pomemben dejavnik pri določanju pravilnih pomenov
protipomenskem razmerju pomenski sestavini pojmov
besed (Toporišič, 2001), protipomenskega slovarja v
(navadno po ena pri vsakem od dveh), izraženih z
slovenskem prostoru še nimamo. Imamo pa dva slovarja
enopomenskima besedama, z enopomenskima besednima
sopomenk, in sicer Sinonimni slovar slovenskega jezika
zvezama ali pa s posameznima pomenoma dveh
(SSSJ), ki ga je izdal ZRC SAZU, in spletni Slovar
večpomenskih besed ali zvez« (22).
sopomenk sodobne slovenščine (SSSS), ki je nastal pod
V nasprotju z definiranjem pomenska tipološka
okriljem Centra za jezikovne vire in tehnologije. Pretekli
razvrstitev protipomenk predstavlja veliko oviro; tovrstnih
leksikografski opis slovenskega jezika se je naslanjal na
razvrstitev je namreč toliko, kolikor je znanstvenikov, ki so
strukturalistično tradicijo SSKJ-ja, ki so ji sledili tudi
se z njimi ukvarjali. Problematike se zavedajo tudi
dosedanji
najvidnejši
slovenski
raziskovalci
jezikoslovci sami (gl. Humar, 2016), njihova glavna naloga
protipomenskosti (Jože Toporišič, Ada Vidovič Muha,
pa bi bila določiti meje protipomenskosti (Gao in Zheng,
Irena Stramljič Breznik, Marjeta Humar).
2014), ki se od enega do drugega znanstvenika močno
Družbene spremembe kot posledica digitalizacije in
razlikujejo.
razvoja informacijsko-komunikacijske tehnologije so
Marjeta
Humar
(2016)
med
pionirske
in
oblikovale
potrebo
po
popolnoma
drugačnem
najpomembnejše
jezikoslovne
raziskovalce
leksikografskem opisu slovenščine, na podlagi katerega bi
protipomenskosti uvršča Lyonsa, Apresjana in Novikova.
lahko gradili nove jezikovne vire in tehnologije.
Lyons je določil tri vrste protipomenk, ki izhajajo iz ene od
Leksikografija se namreč v sedanjem času zaradi vstopa
naštetih značilnosti: komplementarnost, protipomenskost
interneta spopada z vse hitrejšimi jezikovnimi
in konverzija. Pri tem loči protipomenkost v ožjem in
spremembami. Na eni strani je soočena z vprašanjem, kako
širšem smislu; v ožjega vključuje le polarno
v spremenjenih razmerah predstaviti slovarske vsebine
protipomenskost, ki je zanj najčistejša oblika antonimije.
jezikovnim uporabnikom, na drugi strani pa z novimi
Apresjan je protipomenke razčlenil veliko temeljiteje,
jezikovnimi praksami, ki jih vse težje sproti zajema in
opozoril pa je tudi na kvaziprotipomenke, ki nimajo enako
popisuje (Gantar et al., 2016). Sodobni jezikovni
nasprotnih pomenov. Novikov je na drugi strani
uporabniki vse bolj zahtevajo takojšnji dostop do
protipomenskost razdelil na kontrarno nasprotnost – kot
slovarskih vsebin sodobnega jezika, zato moramo
najpogostejšo obliko, komplementarno in vektorsko
leksikografske analize izvajati vse hitreje, a enako
nasprotnost. Med kvaziprotipomenke je uvrstil pomensko
kvalitetno (Gantar et al., 2016). Iz tradicionalnega
neenake,
nesorazmerne,
nesimetrične,
stilistično
leksikografskega modela prehajamo v sodobnejši, pri
raznorodne, časovno različne protipomenke, ki izražajo
katerem slovarske vsebine temeljijo na naprednih
druga nasprotja.
računalniških
metodah,
odprtosti,
množičenju,
V slovenskem prostoru se je najbolj uveljavila členitev
relevantnosti in uporabnosti podatkov.
po A. Vidovič Muha (2005, 2021), ki protipomenskost
Tako je na eni strani povsem ročni pristop luščenja
opredeljuje kot pomensko nasprotnost ali dopolnjevalno
podatkov zamenjal polavtomatični, ki ni le časovno in
protislovnost; za izhodišče tipološke členitve jemlje vpliv
finančno manj potraten, ampak hkrati zagotavlja dodatne
protipomenk na aktantske vloge znotraj stavčne povedi. V
potencialno koristne podatke za presojo o vključevanju
okviru tega protipomenke deli na:
leksemov v slovar. Pri tem se vloga leksikografa ne
-
zamenjavne oz. konverzivne,
spreminja, saj še vedno ostaja odločevalec na vseh ravneh
-
dopolnjevalne oz. komplementarne,
odločanja o slovarskem vključevanju leksemov, spreminja
-
skrajnostne
oz.
polarne,
s
podskupino
pa se način pridobivanja in predstavitve leksemskega
stopnjevalnih oz. gradualnih in
podatka (Gantar et al., 2016). Podoben princip luščenja je
-
usmerjene oz. vektorske.
bil uporabljen pri pripravi SSSS-ja. Leksemska razmerja
V grobem kategorizacija temelji torej bodisi na
navadno luščimo iz baze več virov, SSSS tako temelji na
enakovrednih skupinah protipomenk bodisi na osi bolj
luščenju podatkov iz korpusa Gigafida in Velikega
protipomensko–manj protipomensko (ožji : širši smisel,
angleško-slovenskega slovarja OXFORD - DZS (Arhar
prave protipomenke : kvaziprotipomenke, popolne :
Holdt et al., 2018). V tujini so pri pripravi korpusnih
nepopolne, neostra : ostra nasprotnost, binarna : nebinarna
protipomenskih slovarjev prešli že na avtomatično luščenje
nasprotnost, izražanje nasprotja : stilistično sredstvo)
(Wang et al., 2010; Lobanova et al., 2010; Aldhubayi in
(Humar, 2016).
Alyahya, 2014).
Strukturna delitev protipomenk je jasnejša. V
Na drugi strani pa SSSS deluje tudi po konceptu
slovenskem prostoru se je z njo z besedotvornega vidika
odzivnega slovarja; gre za odprto dostopno zbirko
največ ukvarjala Irena Stramljič Breznik (2010), ki
relevantnih, a še ne povsem neprečiščenih podatkov. Pri
protipomenke deli na istokorenske (tudi gramatične ali
izdelavi prečiščene baze sodeluje jezikovna skupnost, s
tvorbene) in raznokorenske (tudi leksikalne).
čimer izdelava slovarja ni nikdar zaključena, saj se
Slovenski, pa tudi sicer nekdanji jugoslovanski prostor
soustvarja skladno s spremenljivo jezikovno realnostjo.
protipomenskosti dolgo ni posvečal večje pozornosti
Poleg soustvarjanja jezikovni uporabniki potencialne
(Humar, 2016), to izražajo tudi glavni slovenski jezikovni
iztočnice tudi vrednotijo s svojimi odzivi (Arhar Holdt et
ŠTUDENTSKI PRISPEVKI
332
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
al., 2018). Uporabniki prednost koncepta prepoznavajo v
parov protipomenk. Dodatno smo upoštevali tudi pare
preglednosti, dostopnosti, hitremu prilagajanju sodobni
protipomenk, kjer eno izmed obeh besed zamenjamo z
sliki jezika, soustvarjalnosti, preprosti uporabi in načinu
njeno sopomenko, s čimer se je množica povečala na 4113
razvrščanja iztočnic (Kojc et al., 2018; Kamenšek Krajnc et
parov protipomenk. Po brisanju ponavljajočih se zapisov,
al., 2018). Temu bi moral slediti tudi sodobni slovar
kjer sta besedi le zamenjani, smo pridobili množico 2852
protipomenk..
parov protipomenk.
3. Metodologija
3.2. Označevanje podatkov
V raziskavo je bilo vključenih 2852 parov protipomenk.
3.1. Pridobivanje podatkov
Vsak izmed šestih pregledovalcev je pregledal vse primere
Podatkovno množico s protipomenkami smo sestavili iz
v individualni Google Preglednici, pri čemer je vsakemu
več virov. Postopek je podrobneje opisan v diplomskem
paru pripisal eno izmed možnosti d, g in n. Oznaka d
delu (Pegan, 2019), z izjemo zadnjega koraka z brisanjem
označuje, da gre za protipomenki, oznaka n pove, da dani
ponavljajočih zapisov, ki je bil dodan naknadno. Glavnino
besedi nista protipomenki, oznaka g pa pomeni, da je par
podatkov o protipomenkah smo pridobili iz baze sloWNet
problematičen in ga je treba podrobneje proučiti.
(Fišer, 2015), manjši delček (87) pa na osnovi klicev iz
Označevalci pred začetkom nismo prejeli natančnejših
slovarja SSKJ, dostopnega na slovarskem portalu Fran.
navodil, kaj se smatra kot protipomensko in kaj ne. Namen
Baza sloWNet ima obliko XML, poglejmo si en primer
prvega koraka je bil namreč na osnovi gradiva ugotoviti
zapisa množice sopomenk ( synset):
problematična področja, ki bi jih lahko podrobneje
analizirali v nadaljevanju.
Med pregledovanjem smo beležili primere in sproti
eng-30-00001740-a
oblikovali 19 problemskih kategorij. V nadaljevanju smo
...
vsakemu izmed problematičnih parov pripisali po en glavni
in morebitni dodatni problem. Podatke smo si razdelili na
able
podatkov. Med pregledovanjem smo dodali še dve novi
kategoriji, in sicer (Ne)dovršne glagolske tvorjenke in
Dejanje in stanje, saj sta se kot problematični izkazali šele
po natančnejši analizi vseh primerov.
sposoben
zmožen
4. Rezultati in analiza
Po prvem krogu pregledovanj smo 1124 (39,4 %) parov
...
enotno potrdili kot protipomenske in le 22 (0,8 %) primerov
kot neprotipomenske. Pri preostalih (1706; 59,8 %) se je
eng-30-00002098-a
vsaj eden izmed pregledovalcev odločil drugače kot ostali,
...
zato smo takšne primere označili za nadaljnjo analizo. V
drugem krogu pregledovanja pa se je izkazalo, da so bili
nekateri primeri problematični zgolj v zelo specifičnem
Za vsak synset smo poiskali protipomenski synset prek
pogledu oz. da je bil primer lažno označen kot
elementa 'near_antonym'. Uporabili smo vse kombinacije,
problematičen. Odločitev je bilo treba spremeniti tudi pri
kjer je ena beseda v izvornem synsetu in druga v
nekaterih že potrjenih parih, saj so se po podrobnejšem
protipomenskem synsetu. Na tak način smo pridobili 4.514
pregledu izkazali kot problematični. Kategorija potrjenih
parov protipomenk.
protipomenk se je tako povečala na 1207 (42,3 %)
Iz SSKJ smo poiskali vsa gesla, ki imajo navedene tudi
primerov, medtem ko je bilo potrjenih neprotipomenskih
protipomenke. Poenostavljen primer zapisa vidimo spodaj:
parov 48 (1,7 %). V nadaljnjo analizo smo poslali 1597 (56
%) primerov, kot prikazuje Tabela 1.
abstrákten
Oznaka
Delež
...
ant.
Sta protipomenki
42,3 %
Nista protipomenki
1,7 %
konkreten:
Nadaljnji pregled
56,0 %
...
Tabela 1: Rezultati po drugem krogu označevanja.
Skupno smo iz SSKJ izluščili 87 parov protipomenk.
Zaradi maloštevilčnosti smo podatke o protipomenkah
Nadaljnja raziskava se bo osredotočila zgolj na primere
razširili tako, da smo dodajali pare besed s pripono ne-,
(1597; 56 %), ki so se po drugem krogu pregledovanj
proti-, brez-. Primeri tako pridobljenih parov so dostopen –
izkazali za problematične. Razdelili smo jih v 21 kategorij,
nedostopen, ustaven – protiustaven ter alkoholen –
prikazanih v Tabeli 2, kjer smo za lažjo predstavo vsako
brezalkoholen. Tako pridobljene podatke smo deloma
izmed kategorij ponazorili s primerom para besed, o katerih
ročno prečistili nesmiselnih kombinacij, kot je no – brezno
smo presojali. Vidimo lahko tudi, kolikokrat se je vsaka
ter odstranili besede, za katere nismo imeli vektorskih
izmed kategorij pojavila kot glavni in kot dodatni problem.
vložitev v okviru diplomske naloge. Tako smo dobili 1340
Glavne probleme smo določili 1597 primerom, medtem ko
ŠTUDENTSKI PRISPEVKI
333
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
smo dodatni problem identificirali pri 668 (41,83 %)
Nedoslednost na ravni prevzeto – podomačeno (10,33 %)
primerih, ki predstavljajo 23,46 % celotnega gradiva.
in Zaznamovanost in/ali redkost besede (9,83 %).
Iz Tabele 2 je razvidno, da se kot glavni problem
Najredkeje so se kot problematične pojavljale kategorije
najpogosteje pojavlja Redkost in kontekstualna vezanost
Zatipki (0,31 %), Drugo (0,38 %) in Pomensko šibki glagoli
pomenov (31,87 %). Pogosto se pojavljajo tudi kategorije
(0,44 %).
Zanikanost s predpono -ne in -brez (10,58 %),
Št. pojavitev
Št. pojavitev
Kategorija
Primer
Odstotek
Odstotek
(glavni problem)
(dodatni problem)
Zatipki
čistost – nečistot
5
0,31
/
/
alkoholne –
Napačne leme
40
2,50
3
0,45
brezalkoholne
Različna besedna
dopoldne – popoldanski
16
1,00
/
/
vrsta
(Ne)dovršnost
narasti – zniževati
87
5,45
2
0,30
(Ne)določnost
bližnji – daljen
11
0,69
/
/
Neobstoječe
besedotvorne
pritrjevanje – zanikanost
54
3,38
7
1,05
različice
Zanikanost s
občutljivost –
169
10,58
201
30,09
predpono ne, brez- nedražljivost
Nedoslednost na
ravni prevzeto -
aktiv – trpnik
165
10,33
36
5,39
podomačeno
(Ne)dovršne
izglagolske
zmanjšanje – povečanje
32
2,00
4
0,60
tvorjenke
brezposelnost –
Dejanje in stanje
18
1,13
2
0,30
zaposlitev
Povratnost
ubogati – upirati (se)
53
3,32
17
2,54
Pomensko šibki
manjkati – biti (prisoten)
7
0,44
2
0,30
glagoli
Pomensko polne
pridobiti – odreči
15
0,94
2
0,30
besede
(soglasje)
Spol kot
kralj – kraljica; dolžnica
60
3,76
3
0,45
“protipomenka”
– upnik
Zaznamovanost
ata – mati; nenavadno –
157
9,83
79
11,83
in/ali redkost besede često
Enakopisnice in
bistrost – motnost
76
4,76
20
2,99
večpomenke
Redkost in
kontekstualna
bogat – neploden
509
31,87
246
36,83
vezanost pomenov
Lastnosti, ki si niso
protipomenske, a se krivulja – premica
38
2,38
11
1,65
pogosto tako
uporabljajo
Posredne
glasen – nem
40
2,50
5
0,75
sopomenke
Stopenjski primeri prihodnji – sedanji
39
2,44
17
2,54
Drugo
ofenziven – nespotakljiv
6
0,38
11
1,65
Tabela 2: 21 kategorij in njihove pojavitve kot glavni in dodatni problem.
ŠTUDENTSKI PRISPEVKI
334
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Primeri: bližnji – daljen, mesten – podeželski, oddaljen
4.1. Zatipki
– bližnji.
V kategorijo Zatipki spadajo pari, pri katerih je vsaj ena
izmed besed nedvoumno zatipkana, torej ne more biti isti
4.6. Neobstoječe besedotvorne različice
ali drug leksem v katerikoli obliki. Iz Tabele 2 je razvidno,
Gre za primere, ki so pomensko sicer ustrezni, a se
da se je ta kategorija pojavila zgolj petkrat (0,31 %) kot
težava pojavi, ker je ena (ali obe) beseda(i) neobstoječa.
glavni problem in nikoli kot dodatni. Je ena izmed najbolj
Kot primarni problem se je ta kategorija pojavila v 54 (3,28
problematičnih kategorij, saj besed, ki so narobe črkovane,
%) primerih, kot sekundarni pa v sedmih (1,05 %). Ta
ne moremo vključiti v slovar.
kategorija po naši presoji ne sodi v slovar, saj gre za besede,
Primeri: čistost – nečistot, izginti – pojaviti, izvažati –
ki niso realno v rabi. Že pri luščenju protipomenskih
uvžati. .
kandidatov bi lahko dodali korak preverbe posamezne
besede v referenčnem korpusu in dodali opozorilo pri tistih
4.2. Napačne leme
primerih, ki se ne pojavljajo.
Pod Napačne leme sodijo primeri, ki so sicer lahko
Primeri: pritrjevanje – zanikanost, eleganca –
oblikoslovno ujemajoči, vendar v neslovarski obliki. Iz
neelegantnost, nelaskav – podrepniški.
Tabele 2 je razvidno, da se je ta kategorija pojavila v 40
primerih (2,5 %) kot glavni in trikrat (0,45 %) kot dodatni
4.7. Zanikanost s predpono ne-, brez-
problem. Takšne primere moramo odstraniti s seznama
V kategoriji Zanikanost s predpono -ne, -brez govorimo
parov za vključitev v slovar oz. jih spremeniti v pravo
o primerih, pri katerih je vsaj ena izmed protipomenk
slovarsko obliko.
tvorjena kot negacija nekega izraza. Gre za pare, kjer sta
Primeri: alkoholne – brezalkoholne, dolžna – nedolžna,
obe besedi negaciji dveh protipomenk ali za primere, kjer
finančne – nefinančne.
se kot protipomenski par pojavita beseda in negacija njene
sopomenke. Kot je razvidno iz Tabele 2, je bila ta
4.3. Različna besedna vrsta
kategorija v 169 (10,58 %) primerih prepoznana kot glavni
Pri kategoriji Različna besedna vrsta gre za besedne
in v 201 (30,09 %) kot dodatni problem. Raba takšnih parov
pare, kjer sestavini pripadata različnima besednima
v besedilu bi bila morda slogovno problematična, zagotovo
vrstama (npr. samostalnik in pridevnik, pridevnik in
pa so protipomenski v določenih kontekstih. Pare bi zato
prislov). Kot glavni problem se je ta kategorija pojavila pri
vključili v slovar in odločitev prepustili uporabniku, ki
16 parih (1,00 %), kot sekundarni pa sploh ne. Pri večini
najbolje pozna kontekst, v katerem se beseda nahaja.
primerov besedi nista protipomenki, dilema se pojavi le pri
Primeri: nespremenljiv – nestalen, neugoden – škodljiv,
parih tipa samostalnik – pridevnik, saj gre tukaj največkrat
koristen – neugoden.
za posamostaljene pridevnike (tipa delavnik – fraj). V
takšnih primerih sta besedi lahko rabljeni protipomensko,
4.8. Nedoslednost
na
ravni
prevzeto
–
seveda v ustreznem kontekstu. Pare iz te kategorije se
podomačeno
odstrani s seznama za vključitev v slovar. Izjemo
Tukaj obravnavamo primere, ki so sicer protipomenski,
predstavljajo pari tipa samostalnik – pridevnik, ki se jih
a je ena izmed besed prevzeta in s tem pogosto (drugače)
ročno pregleda in vključi s potrebnimi oznakami.
zaznamovana. Zanimivo je tudi iskati mejo med
Primeri: dopoldne – popoldanski, znotraj – ven,
»prevzetim« izrazom ( ujemanje – inkongruenca) in takim,
delavnik – fraj.
ki je v jeziku že uveljavljen ( inteligenten – neumen).
Razlike se lahko pojavljajo tudi na ravni zapisa prevzete
4.4. (Ne)dovršnost
besede in ne le v njenem pomenu (npr. software in softver).
Pri (Ne)dovršnosti govorimo o glagolskih parih z
Iz Tabele 2 je razvidno, da so označevalci vsaj eno besedo
različnim glagolskih vidom. Tako je eden izmed glagolov
prepoznali kot prevzeto v 165 (10,33 %) primerih, kjer je
v nedovršni, drugi pa v dovršni obliki. Takšni pari so bili v
bil to glavni problem in v 26 (5,39 %) primerih, kjer je bil
87 (5,45 %) primerih prepoznani kot primarni in dvakrat
to dodatni problem. Ker gre tu le za prevzete besede, ki se
(0,30 %) kot sekundarni problem. Jasno je, da je za
v jeziku (še) niso uveljavile, bi jih bilo dobro vključiti v
protipomenko nekemu glagolu najboljša izbira glagol, ki
odzivni slovar, saj jih bo uporabnik lahko s pridom
ima enak glagolski vid, a dilema ostaja pri glagolih, ki so
uporabljal v primernih kontekstih.
pomensko ustrezni in imajo drugačen glagolski vid. Takšne
Primeri: aktiv – trpnik, politeizem – enoboštvo, skupen
pare bi bilo (vsaj na prvi pogled) smiselno odstraniti.
– individualen.
Primeri: napasti – braniti, narasti – zniževati, natovoriti
– iztovarjati.
4.9. (Ne)dovršne glagolske tvorjenke
V kategorijo (Ne)dovršne glagolske tvorjenke sodijo
4.5. (Ne)določnost
tvorjenke, pri katerih besedotvorna podstava izkazuje
V to kategorijo sodijo pridevniški pari, pri katerih je
razlike v dovršnosti. Ena beseda je torej tvorjena iz
eden izmed pridevnikov v določni, drugi pa se pojavlja v
dovršnega, druga pa iz nedovršnega glagola. Analiza je
nedoločni obliki. Ta kategorija se je v 11 (0,69 %) primerih
pokazala, da je primerov, kjer je bila ta kategorija
pojavila kot glavni problem, medtem ko se kot dodatni
prepoznana kot primarni problem, 32 (2 %), in da so takšni,
problem ni pojavila. Ker je problem v veliki meri povezan
kjer je bila prepoznana kot sekundarni, štirje (0,60 %). Ti
z značilnostmi lematizacije za slovenščino, ki pridevnike
primeri so podobni 4.4, zato bi jih bilo smiselno
lematizira v nedoločno obliko, razen kadar to ni mogoče
obravnavati na enak način, torej jih ne bi vključili v slovar.
(pari so pomensko načeloma protipomenski), bi bilo
Primeri: zmanjševanje – povečanje, izkrcanje –
tovrstno gradivo smiselno ohraniti v slovarju.
vkrcavanje, manjšanje – povečevanje.
ŠTUDENTSKI PRISPEVKI
335
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4.10. Dejanje in stanje
definirati kot »protipomenski« ter ga s tem obravnavati kot
V to kategorijo sodijo samostalniški pari, ki so sicer
nekaj nasprotnega in binarnega (npr. moški – ženska).
protipomenski, a ena beseda predstavlja neko dejanje,
Druga problematika je razvidna tudi iz primerov, pri katerih
dogodek, drugi pa neko stanje, lastnost. Problematika je
sta bila samostalnika (tipično) protipomenska, a v različnih
podobna kot pri (Ne)dovršnih glagolskih tvorjenkah, le da
slovničnih spolih ( dolžnica ¬ upnik). Kot glavni problem
gre tu za samostalnike, ki niso glagolsko tvorjeni. Kot je
se je spol pojavil pri 60 (3,76 %) parih in kot dodatni pri 3
razvidno iz Tabele 2, se je kategorija Dejanje in stanje kot
(0,45 %) parih. Če bi kategorijo kljub problematičnosti
glavni problem pojavila v 18 (1,13 %) primerih in kot
uvrstili v slovar, bi bilo smiselno natančneje opazovati
dodatni v 2 (0,30 %) primerih. Ker gre pri takšnih parih za
odzive uporabnikov in ugotoviti, kako ocenjujejo
manjšo nianso v pomenu, ki so v določenih kontekstih
uporabnost in primernost tovrstnega gradiva. Pari tipa
lahko protipomenski, jih je najbolje uvrstiti v slovar in
dolžnica – upnik niso ustrezni za v slovar oz. bi treba
uporabniku omogočiti, da sam presoja o njihovi
gradivo umestiti pod ustrezno iztočnico ( dolžnica – upnica;
uporabnosti.
dolžnik – upnik).
Primeri: zaposlitev – brezposelnost, degeneracija –
Primeri: moški – ženska, kralj – kraljica, dolžnica –
razvoj, nedolžnost – zagrešitev.
upnik.
4.11. Povratnost
4.15. Zaznamovanost in/ali redkost besede
V kategorijo Povratnost smo uvrstili glagolske pare, ki
V kategoriji Zaznamovanost in/ali redkost besede
so sicer protipomenski, a vsaj enemu izmed njiju (ali
najdemo pare, kjer načeloma gre za protipomenska izraza,
obema) manjka povratni zaimek. Brez povratnega zaimka
a je en izmed njiju zaznamovan. V nekaterih primerih gre
takšni glagoli nimajo smisla ali imajo drugačen pomen (ki
za čustveno zaznamovanost ( fant – punči), v drugih za
ni protipomenski s predlagano protipomenko). Iz Tabele 2
zastarelo rabo ( izjemoma – često), pogovorne izraze
je razvidno, da je povratni zaimek kot glavni problem
( delavnik – fraj) ali zgolj za izraze, ki se v rabi le redko
manjkal v 53 (3,32 %) parih, in pri 17 (2,54 %) kot dodatni
pojavljajo ( debelost – mršavost). Kot je razvidno iz Tabele
problem. Ker je pri takšnih glagolih povratni zaimek
2, se je ta kategorija kot glavni problem pojavljala precej
ključen za smiselnost protipomenskega para, ga je nujno
pogosto, in sicer pri 157 (9,83 %) parih, prav tako pa tudi
dodati. Takšne primere bi zato odstranili s seznama za
kot dodatni problem (pri 79 parih, tj. 11,83 %). Ker gre za
vključitev v slovar.
primere, ki semantično ustrezajo pojmu protipomenskosti,
Primeri: strinjati (se) – prepirati (se), ubogati – upirati
bi jih bilo najbolje vključiti v slovar, da uporabnik sam
(se), udeležiti (se) – zamuditi.
preceni, če oz. kdaj so v njegovem kontekstu uporabni.
Zagotovo bi jim pa bilo dobro dodati slovarsko oznako, ki
4.12. Pomensko šibki glagoli
bi označevala zaznamovanost, ki jo takšni izrazi imajo.
Primeri: brat – sestrica, izredno – vobče, dolgovezen –
Pri pomensko šibkih glagolih govorimo o glagolskih
koncizen.
parih, v katerih (vsaj) en člen ob sebi zahteva dopolnilo, če
ga želimo smatrati kot protipomenko drugemu. Kategorija
4.16. Enakopisnice in večpomenke
se je kot glavni problem pojavila sedemkrat (0,44 %) in
dvakrat (0,30 %) kot dodatni. Če naj bodo tovrstni primeri
V kategorijo Enakopisnice in večpomenke so vključeni
uvrščeni v slovar, mora biti ob pomensko šibkem glagolu
pari, kjer je eden izmed izrazov večpomenski. Pri teh parih
dodana ustrezna beseda ali zveza.
gre velikokrat tudi za prenesen pomen enega izmed členov
Primer: manjkati – biti (prisoten), biti (statičen/pri
( hladen – navdušen). Problematične so tudi prave
miru) – premikati (se), biti (statičen/pri miru) – gibati (se).
enakopisnice, torej tiste, ki bi v slovarju imele ločene
iztočnice in ne le več pomenov ( pust – masten). Takšni pari
4.13. Pomensko polne besede brez konteksta
so se kot glavni problem pojavili 76-krat (4,76 %) in 20-
krat (2,99 %) kot dodatni problem. Takšni primeri seveda
Pod Pomensko polne besede spadajo pari, kjer je en člen
sodijo v slovar, treba pa bi bilo opredeliti, s katerim
lahko uporabljen kot protipomenka drugemu le takrat, ko je
pomenom besede je določena beseda v protipomenskem
uporabljen v določenem kontekstu skupaj z neko drugo
razmerju.
besedo. V ostalih kontekstih besedi nista v
Primeri: bistrost – motnost, zajedalec – gostitelj, moder
protipomenskem razmerju. Kot glavni problem se je
– naiven.
omenjena kategorija pojavila pri 15 (0,94 %) parih in kot
dodatni pri 2 (0,30 %) parih. Zdi se, da bi tovrstne probleme
4.17. Redkost in kontekstualna vezanost primerov
v slovar lahko vključili, manko konteksta pa rešili na ravni
kolokacij, ki jih Slovar sopomenk sodobne slovenščine
V to kategorijo sodijo primeri, ki so protipomenski le v
trenutno vključuje za pomensko primerjavo dveh
določenih kontekstih. Običajno je tu eden izmed izrazov
sopomenk.
bolj uveljavljen in uporabljen v več kontekstih, zato je
Primeri: pridobiti – odreči (soglasje), odpovedati –
protipomenka drugemu le v določenih primerih. Prav tako
obdržati (naročnino), napolniti – sprožiti (pištolo).
so se tukaj znašli primeri, pri katerih bi bili sestavini para v
nad/podpomenskem razmerju, če bi eno od njiju negirali
4.14. Spol kot »protipomenka«
(kot pri zdrav – umobolen, kjer bi bili pravi protipomenki
zdrav – bolan, medtem ko je umobolen le ena oblika
V kategoriji Spol kot »protipomenka« sta se pojavljali
nezdravja). V kategorijo Redkost in kontekstualna vezanost
dve problematiki. Najprej smo obravnavali pare, kjer sta
primerov smo vključili tudi primere, kjer je bil eden izmed
kot protipomenki navedena izraza, ki ju uporabljamo za
izrazov zelo specifičen, običajno terminološki (primer:
označevanje spolov. Vprašanje je, ali je ob upoštevanju
izdelava – delaboracija). Odločili smo se, da terminoloških
želene družbene občutljivosti slovarja spol sploh ustrezno
izrazov ne bomo uvrščali v posebno kategorijo, saj je težko
ŠTUDENTSKI PRISPEVKI
336
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
določiti mejo med strokovnimi, specifičnimi in »čistimi«
smiselno vključiti v slovar in presojo uporabnosti prepustiti
terminološkimi izrazi. Ker gre za najširšo kategorijo, se je
uporabniški skupnosti.
kot primarni problem pojavila v kar 509 (31,87 %) primerih
Primeri: državljan – tujec, ofenziven – nespotakljiv,
in kot sekundarni v 246 (36,83 %) primerih. Te primere se
zamuditi – zadeti.
vključi v odzivni slovar, saj lahko uporabnik v široki paleti
možnosti izbere zase najustreznejšo.
5. Zaključek
Primeri: bogat – neploden, cena – prednost, domač –
Iz analize je razvidno, da imajo problemske kategorije
nepoznan.
različno težo, nekatere težave bi bilo treba nasloviti, preden
se gradivo lahko vključi v slovar, medtem ko lahko pri
4.18. Lastnosti, ki niso protipomenske, a se
drugih odločitev o relevantnosti prepustimo uporabniški
pogosto tako uporabljajo
skupnosti. V analizi smo ugotovili, da so kategorije Zatipki,
V tej kategoriji so zbrani primeri, ki sicer opisujejo
Napačne leme, Različna besedna vrsta, (Ne)dovršnost, izključujoče lastnosti, a v strogem pomenu ne gre za
Neobstoječe besedotvorne različice, (Ne)dovršne glagolske
protipomenki, čeprav se pogosto tako uporabljata. To so
tvorjenke in Povratnost najbolj problematične, vendar jih je
predvsem pari, ki jih v pogovornem kontekstu uporabljamo
obenem predvidoma mogoče vsaj delno reševati tudi
kot protipomenki, ali takšni, za katere zmotno mislimo, da
avtomatsko, kar bomo upoštevali pri razvoju nadaljnje
to sta. Kot je razvidno iz Tabele 2, je bila ta problematika
metodologije strojnega pridobivanja protipomenk. Ostale
prepoznana v 38 (2,38 %) primerih kot glavni in v 11 (1,65
kategorije pa so bolj vezane na kontekst, zato jih lahko
%) primerih kot dodatni problem. Čeprav takšni pari niso
vključimo v slovar in odločitev prepustimo skupnosti.
strogo gledano protipomenski, bi jih bilo najverjetneje
Čeprav je bilo nedvoumno potrjenih protipomenk na
smiselno vključiti v slovar in izbiro prepustiti uporabniku.
prvi pogled malo (manj kot polovica), pa nadaljnja analiza
Primeri: anabolizem – katabolizem, krivulja – premica,
kaže, da lahko v odzivni slovar vključimo veliko večino (88
nepomemben – znamenit.
%) podatkov. Prav tu se kaže prednost odzivnega slovarja,
ki uporabniku ponuja možnost, da izbira med širokim
4.19. Posredne sopomenke
naborom potencialnih protipomenk in jih ocenjuje kot bolj
Pod Posredne sopomenke sodijo pari tipa glasen – nem,
ali manj ustrezne. V slovar je torej najbolje vključiti čim
ki so na prvi pogled protipomenski le v redkih primerih, če
več potencialnega gradiva in jezikovni skupnosti prepustiti
pa bi eno sestavino zamenjali z njeno sopomenko, bi dobili
odločitev, kaj je zanjo uporabno in kaj ne.
precej bolj očiten protipomenski par (npr. glasen – tih).
Z digitalizacijo družbe so se spremenile (in povečale)
Takšni pari so se kot primarni problem pojavili 40-krat
potrebe jezikovnih uporabnikov, ki želijo vedno večji nabor
(2,50 %), kot sekundarni pa 5-krat (0,75 %). Čeprav niso
podatkov, med katerimi lahko izbirajo. Odzivni slovar jim
prototipsko protipomenski, bi bilo tudi takšne pare morda
ne omogoči zgolj tega, ampak tudi dodajanje novega
dobro vključiti v slovar, saj uporabniku lahko koristijo v
gradiva in odzivanje na že obstoječe. Skupaj z družbo se
določenih situacijah, obenem pa spremljati, ali bodo
tako spreminjajo slovarji, z njimi pa tudi mi in naša vloga
uporabniki v odzivnem slovarju tovrstne primere
pri njihovem ustvarjanju.
ocenjevali s pozitivnimi ali negativnimi glasovi.
Primeri: profit – minus, glasen – nem, kvaren – koristen.
6. Zahvala
Projekt Nadgradnja temeljnih slovarskih virov in
4.20. Stopenjski primeri
podatkovnih baz CJVT UL v letih 2021–22 financira
V to kategorijo smo zbrali pare, ki jih sicer lahko
Ministrstvo za kulturo Republike Slovenije.
razumemo kot protipomenske v določenem kontekstu, a se
Avtorji in avtorice bi se radi zahvalili tudi Špeli Arhar
pojavlja zelo očitna stopnjevanost. Besedi torej sta lahko
Holdt za vključitev v projekt in pomoč pri načrtovanju
protipomenki ( prihodnji – sedanji), a običajno obstaja še
raziskave in prispevka.
neko bolj izrazito nasprotje ( prihodnji – pretekli).Sem smo
vključili tudi stopnjevane pridevnike, ki pa niso vedno
nujno na popolnoma nasprotni stopnji. Tako imamo lahko
7. Literatura
v paru npr. primernik in presežnik in ne le dva primernika
Luluh Aldhubayi in Maha Alyahya. 2014. Automated
(primer: manjši – največji in ne le manjši – večji).
Arabic Antonym Extraction Using a Corpus Analysis
Stopenjski primeri so se kot glavni problem pojavili v 39
Tool. Journal of Theoretical and Applied Information
(2,44 %) primerih in v 17 (2,54 %) primerih kot dodatni
Technology, 70(3):422–433.
problem. Ker so kontekstualno pogojeni, jih je dobro
Darja Fišer. 2015. Semantic lexicon of Slovene sloWNet
vključiti v odzivni slovar in tako uporabniku omogočiti
3.1. Slovenian language resource repository CLARIN.SI.
širšo izbiro potencialnih protipomenk.
http://hdl.handle.net/11356/1026.
Primeri: negativen – nevtralen, dvojen – enojen,
Polona Gantar, Iztok Kosem in Simon Krek. 2016.
maksimalen – majhen.
Discovering automated lexicography: the case of the
slovene lexical database. International Journal of
4.21. Drugo
Lexicography, 29(2):200–225.
Pod Drugo smo vključili primere, ki niso sodili v
Špela Arhar Holdt, Jaka Čibelj, Kaja Dobrovoljc, Polona
nobeno izmed ostalih kategorij. Kot je razvidno iz Tabele
Gantar, Vojko Gorjanc, Bojan Klemenc, Iztok Kosem,
2, smo 6 (0,38 %) parov vključili pod Drugo kot glavni
Simon Krek, Cyprian Laskowski in Marko Robnik
problem in 11 (1,65 %) parov kot dodatni problem. Takšne
Šikonja. 2018. Thesaurus of Modern Slovene: By the
pare, ki so se pojavili zelo poredko (0,38 %), bi bilo
Community for the Community. V: Thesaurus of
Modern Slovene: By the Community for the Community.
ŠTUDENTSKI PRISPEVKI
337
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Proceedings of the XVIII EURALEX International
Wenbo Wang, Christopher Thomas in Amit Sheth. 2010.
Congress, str. 401–410.
Pattern-Based Synonym and Antonym Extraction. ACM
Marjeta Humar. 2005. Protipomenskost v slovenski
SE '10: Proceedings of the 48th Annual Southeast
jezikoslovni literaturi. V: M. Jesenček, ur., Knjižno in
Regional Conference: 1–4.
narečno besedoslovje slovenskega jezika, str. 234–238,
https://dl.acm.org/doi/abs/10.1145/1900008.1900094.
Slavistično društvo Maribor, Maribor.
Marjeta Humar. 2016. Protipomenskost v slovenskem
knjižnem jeziku: na primeru terminoloških slovarjev.
Inštitut za slovenski jezik Frana Ramovša ZRC SAZU,
Ljubljana.
Elin Kamenšek Kranjc, Špela Medved in Kaja Podgoršek.
2018. Primerjava spletnega slovarja Slovar sopomenk
sodobne slovenščine in knjižnega Sinonimnega slovarja
slovenskega jezika. Liter jezika, 9(12):66–70.
Agnes Kojc, Tamara Rigler, Kaja Sluga, Anika Plešivčnik
in Špela Kovačič. 2018. Slovar sopomenk sodobne
slovenščine in Sinonimni slovar slovenskega jezika. Liter
jezika, 9(12):62–65.
Simon Krek, Cyprian Laskowski, Marko Robnik Šikonja,
Iztok Kosem, Špela Arhar Holdt, Polona Gantar, Jaka
Čibej, Vojko Gorjanc, Bojan Klemec in Kaja
Dobrovoljc. 2018. Thesaurus of Modern Slovene 1.0.
Slovenian language resource repository CLARIN.SI.
http://hdl.handle.net/11356/1166.
Simon Krek, Špela Arhar Holdt, Tomaž Erjavec, Jaka
Čibej, Andraž Repar, Polona Gantar, Nikola Ljubešić,
Iztok Kosem in Kaja Dobrovoljc. 2020. Gigafida 2.0: the
reference corpus of written standard Slovene. V: N.
Calzolari, ur., LREC 2020: Twelfth International
Conference on Language Resources and Evaluation, str.
3340–3345. ELRA - European Language Resources
Association,
Paris.
http://www.lrec-
conf.org/proceedings/lrec2020/LREC-2020.pdf.
Nikola Ljubešić in Tomaž Erjavec. 2018. Word
embeddings CLARIN.SI-embed.sl 1.0. Slovenian
language
resource
repository
CLARIN.SI.
http://hdl.handle.net/11356/1204.
Anna Lobanova, Tom van der Kleij in Jennifer Spenader.
2010. Defining Antonymy: A Corpus-based Study of
Opposites by Lexico-syntactic Patterns. International
Journal of Lexicography, 23(1):19–53.
Ada Vidovič Muha. 2005. Medleksemski pomenski
razmerji – sopomenskost in protipomenskost. V: M.
Jesenšek, ur., Knjižno in narečno besedoslovje
slovenskega jezika, str. 206–221. Slavistično društvo
Maribor, Maribor.
Ada Vidovič Muha. 2021. Slovensko leksikalno
pomenoslovje. Prva e-izdaja. Znanstvena založba FFUL,
Ljubljana.
Slovar slovenskega knjižnega jezika. Druga, dopolnjena in
deloma prenovljena izdaja. 2014. Cankarjeva založba,
Ljubljana.
Sopomenke 1.0. O slovarju. Center za jezikovne vire in
tehnologije. https://viri.cjvt.si/sopomenke/slv/about.
Irena Breznik Stramljič. 2010. Tvorjenke slovenskega
jezika med slovarjem in besedilom. Mednarodna založba
Oddelka za slovanske jezike in književnosti FFUM,
Maribor.
Jasmina Pegan. 2019. Detekcija antonimov z vektorskimi
vložitvami besed. Diplomsko delo. Fakulteta za
računalništvo in informatiko Univerze v Ljubljani.
Jože Toporišič. 1976. Slovenska slovnica. Založba
»Obzorja«, Maribor.
Jože Toporišič. 2000. Slovenska slovnica. Četrta,
prenovljena izdaja. Založba »Obzorja«, Maribor.
ŠTUDENTSKI PRISPEVKI
338
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Ilukana – aplikacija za učenje japonskih zlogovnih pisav hiragana in katakana s pomočjo asociacij
Nina Sangawa Hmeljak,* Anna Sangawa Hmeljak,† Jan Hrastnik‡
* Fakulteta za računalništvo in informatiko, Univerza v Ljubljani
Večna pot 113, 1000 Ljubljana
nina.sangawa@gmail.com
† Akademija za likovno umetnost in oblikovanje, Univerza v Ljubljani
Dolenjska cesta 83, 1000 Ljubljana
anna.sangawa@gmail.com
‡ Fakulteta za matematiko in fiziko, Univerza v Ljubljani
Jamova cesta 21, 1000 Ljubljana
Povzetek
Prispevek predstavlja zasnovo in oblikovanje digitalne aplikacije za slovensko govoreče učence oz. študente japonščine kot pomoč pri pomnjenju japonskih zlogovnih pisav hiragana in katakana s pomočjo asociacij in interaktivnega učenja. Vsak znak dveh japonskih pisav je opremljen z ilustracijo, ki vsebuje obliko tega znaka in obenem ponazarja slovensko besedo, ki se začne s tem zlogom. Aplikacija nudi tako seznam ilustracij kot tudi interaktivne vaje, s katerimi uporabnik lahko preverja svoje znanje. Aplikacija je napisana s paketom za razvoj programske opreme Flutter, v jeziku Dart, tako da deluje v poljubnem operacijskem sistemu. Je še v fazi prototipa, v bodoče načrtujemo raziskavo o učinkovitosti pri pomnjenju, testiranje med uporabniki in dodelavo uporabniškega vmesnika.
Ilukana – an app for learning the Japanese hiragana and katakana syllabaries using associations We present the concept and implementation of a digital application for Slovene-speaking learners of Japanese, as an aid to remembering the Japanese syllabaries hiragana and katakana using associations and interactive learning. Each letter of the Japanese syllabaries is matched with an illustration containing the letter itself and representing a Slovene word beginning with the syllable represented by the letter. The application includes a list of illustrations and interactive exercises. It is written using Flutter, in Dart, and can therefore be used in any operating system. The app is a prototype, research on its effectiveness, user testing and interface upgrades are planned.
medtem ko je kitajskih pismenk na tisoče. Tako kot vsak
1. Uvod
japonski otrok se tudi tuji učenci najprej naučijo teh dveh
V prispevku predstavljamo izgradnjo in oblikovanje
zlogovnic. Ker je učenje nove pisave, ki ima popolnoma
digitalne aplikacije za učenje japonskih zlogovnih pisav
drugačne oblike kot latinica, težavno, a tudi nujno potrebno,
hiragana in katakana s pomočjo asociacij in interaktivnega
da lahko učenci sploh začnejo brati v japonščini, smo se
učenja. Aplikacija je namenjena slovensko govorečim
odločili ustvariti aplikacijo, s katero je lahko učenje lažje in
učencem ali študentom japonščine kot pomoč pri učenju
bolj zabavno.
osnovnih znakov japonskih zlogovnih pisav hiragana in
katakana. Osnovana je na principu asociacije med znanimi
3. Učenje z asociacijami – mnemotehnika
in novimi informacijami: za lažje pomnjenje oblike in
Mnemotehnika oz. mnemonika je tehnika učenja oz.
izgovora znakov japonskih zlogovnic ponuja za vsak znak
pomnjenja, pri kateri skušamo vsebino, ki se jo želimo
ilustracijo, ki nakazuje obliko tega znaka in obenem
naučiti (tj. to, kar imamo samo v kratkoročnem spominu),
ponazarja slovensko besedo, ki se začne s tem zlogom. V
urediti in povezati z že znanim (tj. s tem, kar že imamo v
aplikaciji so seznam ilustracij in interaktivne igrice za
dolgoročnem spominu) na tak način, da si jo lažje
preverjanje naučenih znakov in tudi mini-igrica za učenje
zapomnimo. Pomembne so zlasti v začetni fazi učenja
pravilnega vrstnega reda potez pri pisanju kane. Aplikacija
jezika, ko si mora učenec zapomniti osnovno besedišče ali
je še v fazi prototipa, v prispevku predstavljamo ozadje
pisavo, medtem ko na višjem nivoju učenja jezika imajo
projekta, namen aplikacije, teoretična izhodišča, podobne
učenci običajno bolj razvito in povezano znanje in lahko
ilustracije in aplikacije za govorce drugih jezikov,
učinkovito uporabljajo druge metode (Oxford, 2016).
oblikovalski
koncept,
tehnično
implementacijo,
Primeri mnemotehnike so razne rime in besedne zveze, s
ugotovljene pomanjkljivosti in načrte za bodoče delo.
katerimi si lažje zapomnimo določena pravila, kot npr.
stavek “Suhi škafec hoče pasti”, s katerim si zapomnimo
2. Ozadje projekta
soglasnike, pred katerimi uporabimo predlog s in ne z, ali
Japonščina je priljubljen jezik med ljubitelji japonskih
povezovanje oblike predmeta z obliko črke v besedi, ki si
mang in animejev, v Sloveniji se poučuje na Filozofski
jo hočemo zapomniti v povezavi s predmetom, npr. “ob
fakulteti Univerze v Ljubljani, v več privatnih jezikovnih
prvem krajcu se luna Debeli, ob zadnjem pa Crkuje”, kjer
šolah, mnogi mlajši se japonščine učijo tudi sami s pomočjo
oblika črk D in C spominja na obliko lune ob prvem in ob
spleta. Pri učenju japonščine je posebej zahtevno učenje
zadnjem krajcu.
pisave, saj japonščina ne uporablja latinice ampak tri druge
Med mnemotehnike spada tudi metoda ključne besede,
pisave, hiragano in katakano, ki sta japonski zlogovni
po kateri asociiramo novo besedo, ki se jo želimo naučiti, z
pisavi, ter kanji oz. pismenke, ki izvirajo iz Kitajske
besedo, ki podobno zveni, s pomočjo neke vsebinske
(Hmeljak et al., 2020). Od treh pisav imata hiragana in
povezave (Cohen, 1987; Manalo, 2002). Več raziskav kaže,
katakana še najmanj znakov, vsaka ima 46 različnih znakov,
da je mnemotehnika lahko uporabna za učenje širokega
ŠTUDENTSKI PRISPEVKI
339
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
spektra snovi, kot je učenje tujih jezikov, znanstvenih
Manalo et al. (2004) so ugotovili, da je tak način učenja
zakonitosti, itd. Omenjene raziskave so pokazale, da so se
hiragane učinkovit, udeleženci so bili na splošno zadovoljni,
tisti, ki so se učili z uporabo mnemotehnike, veliko bolje
bili so mnenja, da jim je pomagalo pri pomnjenju in
odrezali kot tisti, ki se niso. Poleg tega je bila
šolskem uspehu. Po drugi strani Matsunaga (2003)
mnemotehnika učinkovita tudi pri učenju oseb s
ugotavlja, da je učenje hiragane z mnemoničnimi slikami,
specifičnimi učnimi težavami ali po možganskih
povezanimi z angleškimi besedami, bilo učinkovito pri
poškodbah. Nekatere raziskave so pokazale celo, da večina
učencih japonščine, ki niso bili naravni govorci angleščine,
ljudi spontano uporablja mnemotehniko pri učenju na
le na kratek rok in to le za tiste, ki se nikoli prej niso učili
pamet (Manalo et al., 2004).
jezika, ki ne uporablja latinice.
Mnemotehnika je torej uporabna tudi pri učenju novih
O rabi mnemoničnih slik za učenje kane med govorci
jezikov in pisav. Več raziskav je pokazalo tudi učinkovitost
slovenščine še ni bilo raziskav, a lahko domnevamo, da tudi
mnemotehnik, ki so angleškim govorcem pomagale pri
zanje učenje preko ilustracij, ki se nanašajo na angleške
učenju japonske pisave (Quackenbush et al., 1989; Manalo
besede, ni posebej učinkovito, kot ugotavlja Matsunaga
et al., 2004; Matsunaga, 2003) in korejske pisave (Brown,
(2003) za govorce drugih jezikov.
2012).
Obstaja tudi več učbenikov za učenje kitajskih pismenk,
4. Aplikacija Ilukana
ki se poslužujejo mnemotehničnih metod za pomnjenje in
Učbeniki in aplikacije za učenje hiragane z
povezovanje oblike in pomena s pomočjo asociacij. Med
mnemoničnimi slikami torej že obstajajo, vendar ne za
prvimi je serija učbenikov Jamesa Heisiga (1977; 1987;
slovensko govoreče oz. ni takih, ki bi povezale oblike
Heisig in Sienko, 1994), ki pokriva vseh 2000 standardnih
znakov kane s slovenskimi besedami. Glede na to, da
pismenk in je bila prevedena v francoščino (1998),
praktično vsi slovensko govoreči učenci japonščine že
španščino (2001) in nemščino (2005), za angleško
obvladajo tudi angleščino, bi lahko na prvi pogled
govoreče pa obstaja še več podobnih učbenikov (Banno et
uporabljali gradivo za angleško govoreče, kot ga ponuja npr.
al. 2009; Bodnaryk, 2000; McNair, 2000; McNair, 2005 in
Ogawa (1990) ali Koichi (2014) in je prikazano v sliki 1.
McCabe, 2012). Za slovensko govoreče učence japonščine
Toda zlasti pri angleščini, ki ima izrazito globok pravopis,
pa še ni takega gradiva, zato smo se odločili, da ga
lahko pri povezovanju oblike kane z izgovarjavo angleške
ustvarimo.
besede pride do zmede zaradi interference zapisa angleške
besede in tudi zaradi variabilnosti izgovora same
3.1. Mnemonične slike
angleščine (britanska ali ameriška angleščina ipd.). Za
Za učenje hiragane z asociacijami obstaja že več
slovenske govorce, ki se angleščine večinoma učijo
primerov za učenje s pomočjo mnemoničnih slik, ki so
istočasno v govorni in pisni obliki, bi lahko bilo težko
povezane z angleščino. Oblika enega znaka zlogovnice
odmisliti pisno obliko (npr. “nun” za znak な na) in
hiragane se prekrije z ilustracijo angleške besede, ki se
povezati samo izgovarjavo besede v angleščini (/nan/) z
začne z enakim zlogom kot izbrani znak hiragane (Ogawa,
zlogom /na/ v japonščini, saj se sorodna beseda v
1990; Rowley, 1995; Koichi, 2014). Obstajajo tudi že
slovenščini izgovori /nuna/ (glej sliko 1). Podobno bi lahko
aplikacije za ta namen, kot je npr. Hiragana Memory Hint
tudi rekli za znak に ni, za katerega je izbrana beseda knee,
in Katakana Memory Hint Japonske fundacije (Japan
ki se izgovarja /nii/, vendar za tiste, ki si hkrati
Foundation 2015), ki ponuja učenje v povezavi z
predstavljajo tudi pisno obliko, je težko odmisliti »k«, ki je
angleščino, indonezijščino in tajščino.
na začetku besede. Tudi fonetično je marsikateri zlog v
slovenščini bliže japonskemu kot angleški, tako je npr. zlog
/sa/ praktično enak v slovenščini in japonščini, medtem ko
je izgovorjava angleške besede “saw” drugačna, še dodatno
pa lahko zmede razlika med ameriškim in britanskim
izgovorom.
Za mlajše učence, ki morda še ne obvladajo angleščine,
pa je verjetno lažje pomniti asociacije z besedami iz
lastnega maternega jezika kot iz angleščine ali drugega
tujega jezika, ki ga še ne obvladajo dobro. Zato smo se
odločili poiskati slovenske besede, ki lahko pomagajo pri
pomnjenju znakov japonske kane, zanje ustvarili ilustracije
in s temi zgradili aplikacijo Ilukana.
Ilukana je aplikacija za učenje japonskih znakov
hiragana in katakana. Ciljna publika so slovensko govoreči
učenci ali študenti. Aplikacija ponuja ilustracije, s pomočjo
katerih si uporabnik lažje zapomni povezavo med obliko
znaka in njegovo izgovarjavo. Uporabnik dostopa do
posameznih znakov preko seznama obeh pisav, ki se
izmenjujeta z interakcijo uporabnika. Aplikacija vključuje
tudi element igre v obliki kviza, prav tako za pomnjenje
izgovorjave, kot tudi pravilnega vrstnega reda potez pri
zapisovanju. Pri japonščini namreč pravopis določa vrstni
red potez, ki vpliva na obliko znakov, zlasti bolj
Slika1. Primeri idej mnemoničnih slik za znake に ni, さ
kompleksnih. Pri znaku あ se na primer najprej zapiše
vodoravna črta, nato navpična in na koncu še krivulja.
sa in な na (zgoraj Ogawa 1990, spodaj Koichi 2014).
ŠTUDENTSKI PRISPEVKI
340
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4.1. Ustvarjanje asociacijskih slik
ekranu. Za vsak znak smo preizkusili več idej in se odločili
V aplikaciji so znaki kombinirani z ilustracijami, ki
za najbolj jasno.
uporabniku pomagajo, da si zapomni obliko znaka z
Primeri nekaj različnih idej za isti znak so prikazani v
asociacijo na vsebino ilustracije. Tako na primer ilustracija
sliki 3.
za znak あ (glej sliko 2), ki se izgovori /a/, prikazuje
adrenalinski park, kar uporabniku pomaga, da si preko
besede “adrenalin”, ki se začne z zlogom /a/, zapomni
povezavo med obliko znaka あ in njegovo izgovarjavo, tj.
glasom /a/.
Za ustvarjanje asociacij je bilo torej potrebno za vsak
znak hiragane in katakane najti slovensko besedo, ki se
začne z enakim zvokom kot izbrana hiragana in ki
predstavlja nekaj, kar je podobne oblike. Pri tem je še nekaj
omejitev: beseda mora pomeniti nekaj, kar je mogoče
izrisati (abstraktne pojme bi težje spremenili v ilustracije),
obenem pa mora biti ilustracija kolikor mogoče enoumno
povezana z eno samo besedo, poleg tega ne sme imeti
preveč detajlov, da se lahko jasno izriše tudi na manjšem
Slika2. Primeri idej mnemoničnih slik: a kot adrenalin.
Slika3. Primeri idej mnemoničnih slik: け/ke/ kot keramika, keglanje, kebab, kečap, Kekec.
Na sliki 2 je prikazan primer za zlog /a/ (あ v hiragani
in ア v katakani). Tu smo uspeli najti primer ilustracije
(adrenalin) za isto besedo, ki se prekriva z znakoma za isti
zlog v obeh zlogovnicah. Za znak hiragane あ smo kot
najprimernejšo izbrali to besedo, ker komplicirani ovinki
spominjajo na vlakec smrti. Pri katakani ア pa oblika lahko
spominja na kajak na divjih vodah. Obe aktivnosti sta zelo
dinamični in ju lahko povežemo z besedo adrenalin.
Na sliki 3 je prikazanih več različnih idej, ki smo jih
imeli za znak hiragane け, ki se izgovori /ke/. Po vrsti od
leve zgoraj so keramika, kegljanje, kebab, kečap in Kekec.
Slika4. Ni kot nilski konj.
Da bi si lažje zapomnili, da je znak sestavljen iz dveh
ločenih delov, sta bila kegelj ob kegljaču in Kekec s
pohodno palico najboljša kandidata. Toda kegljanje bi
lahko zamenjali z bowlingom, ki je bolj razširjen, in bi tako
lahko zamešali z zlogom /bo/, ki se v hiragani zapiše ぼ in
je oblikovno podoben znaku け oz. /ke/, zato smo izbrali
ilustracijo Kekca, ki ni dvoumen.
Sliki 4 in 5 prikazujeta še primer za /ni/ に in /sa/ さ. Za
/ni/ smo izbrali izraz nilski konj, za /sa/ pa sardelo.
4.2. Oblikovanje aplikacije
Pri oblikovanju aplikacije smo se odločili za
Slika5. Sa kot sardele.
minimalistični izgled. Za celotno podobo so za izhodišče
ŠTUDENTSKI PRISPEVKI
341
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
uporabljene barve japonske zastave, tj. rdeča in bela, ter
črna za besedilo. Zato da ekran ni presvetel in ne draži oči,
je za ozadje uporabljena siva barva. Vse ilustracije, ki so
funkcionalni del aplikacije (ikona za premikanje naprej in
nazaj, vračanje na vstopno stran ipd.), so v slogu
tradicionalnih japonskih grafik ukiyo-e.
Ko prižgemo aplikacijo, se znajdemo pred vhodom
japonske hiše. Ko se dotaknemo vrat, ki se nam odprejo,
vstopimo na prvo stran, ki je glavni meni. Na glavnem
meniju imamo štiri gumbe. V ozadju je ilustracija, ki je
povzeta po znani grafiki japonskega slikarja Sharakuja, ki
upodablja igralca gledališča kabuki. V navigacijski vrstici
so trije gumbi. Na sredini je gumb za vračanje na vstopno
stran ( home button) v obliki japonske hiše, na levi je gumb
za premikanje na prejšnji ekran ( back button) v obliki roke
v slogu japonskih grafik ukiyo-e, na desni pa gumb, ki nam
omogoča preklapljanje med znaki hiragane in katakane.
Prva dva gumba na vstopni strani sta namenjena učenju
pismenk s pomočjo asociacije z ilustracijo. Če se
dotaknemo gumba hiragana ali katakana, nas to pripelje na
Slika7. Glavni meni.
seznam vseh znakov (pismenk) te pisave. Ko se dotaknemo
enega znaka hiragane ali katakane, nas to pripelje na stran
s to pismenko čez celo širino ekrana. Pod pismenko je
gumb, ki nas pripelje do gibljive slike, ki pokaže pravilni
vrstni red, po katerem se zapiše. Ko se dotaknemo
pismenke same, pa se nam prikaže pismenka v kombinaciji
z ilustracijo. Pod pismenko je v latinici napisana
izgovarjava (zlog, ki ga pismenka zapisuje) ter slovenska
beseda za pojem, s katerim ga asociiramo. Tretji gumb na
glavnem meniju z imenom “seznam” je le pregledni
seznam, ki ga lahko z gumbom za preklapljanje uporabimo
za pregledovanje in pomnjenje oblik pismenk. Četrti gumb
z imenom “vaje” pripelje uporabnika do dveh vaj, kjer se
lahko nauči vrstni red pisanja in izgovorjavo hiragane ali
katakane.
Slika8. Stran za črko あ.
Slika6. Začetna stran aplikacije.
Slika9. Mnemonična slika za あ.
ŠTUDENTSKI PRISPEVKI
342
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4.3. Primerjava aplikacije Hiragana Memory
4.4. Tehnična implementacija aplikacije
Hint z aplikacijo Ilukana
Aplikacija je napisana s paketom za razvoj programske
Aplikacija Hiragana Memory Hint je aplikacija, ki jo
opreme Flutter, v jeziku Dart. Ta jezik smo izbrali zato, da
Japonska fundacija ponuja na App Store in Google Play
bo aplikacija uporabna v vsakem okolju, saj nam Flutter
(Japan Foundation 2015). Namenjena je učenju zlogovne
omogoča, da aplikacija deluje tako v operacijskih sistemih
pisave hiragana oz. olajšanje tega za angleško govorečo
iOS in Android kot v poljubnem spletnem brskalniku.
populacijo. Obstajajo tudi različice za govorce drugih
Pri pisanju programa je bil največji izziv oblikovati vse
azijskih jezikov. Naša aplikacija Ilukana ji je podobna v
objekte, da se oblika ohrani pri vseh možnih velikostih
tem, da obe uporabljata mnemonične slike za povezovanje
ekranov. To smo reševali s testiranjem na različnih
besede v maternem jeziku govorca z obliko kane. Glavna
telefonih in popravljanjem relativne razdalje med objekti.
razlika pa je seveda to, da je naša namenjena govorcem
Ker je bil dizajn originalen, smo morali posebej
slovenščine. Obe aplikaciji imata dve glavni funkciji:
implementirati vse objekte, kot je navigacija.
pregledovanje in učenje japonskih znakov s pomočjo
Spletna verzija aplikacije je dostopna na naslovu
mnemonskih slik ter kviz, kjer lahko uporabnik vadi.
https://sninah.github.io/ilukana/.
Ilukana nudi uporabnbiku izbiro, ali si z mnemoničnimi
slikami želi zapomniti hiragano ali katakano, medtem ko
5. Zaključek
Hiragana Memory Hint ima na voljo le hiragano, saj za
Aplikacija je še v fazi razvoja, testirana je bila na nekaj
katakano obstaja ločena aplikacija. Naša aplikacija ima tudi
različnih modelih telefonov, a za optimalno delovanje na
daljši seznam vseh zlogov, saj ne vsebuje le osnovnih 46
manjših zaslonih potrebuje še nekaj dodelave.
znakov hiragane kot aplikacija v angleščini, ampak vse
V bodoče nameravamo testirati in optimizirati
možne zloge, ki jih lahko napišemo z dodajanjem
delovanje aplikacije, med drugim tudi z optimizacijo
diakritičnih znakov: črtici ゛ za zvenečnost (npr. か /ka/ oz.
tempiranja prikazovanja znakov, ki se zdaj pri kvizu
が /ga/), krogec ゜ za glas /p/ (npr. は/ha/ oz. ぱ /pa/) in
prikazujejo naključno, tako da se večkrat pojavljajo tisti, pri
diakritični znaki za mehčanje soglasnikov (npr. さ /sa/ oz.
katerih je uporabnik naredil več napak.
し ゃ /ša/). Za te posebne zloge še nimamo ilustracij v
Nameravamo tudi preveriti uporabnost in učinkovitost
našem prototipu.
ilustracij pri učenju hiragane in katakane med slovenskimi
Aplikacija Hiragana Memory Hint ima več
govorci. Temeljite uporabniške študije aplikacije še nismo
interaktivnosti in elementov igre. Ilukana ima le dve igrici.
izvedli, same ilustracije pa smo pokazali nekaj študentom
Ena je za vajo izgovorjave, ki je v obliki kviza z več
japonščine, ki so povedali, da so jim bile ilutracije zabavne
možnimi odgovori, pri drugi pa uporabnik pritisne na črte
in da so pomagale pri pomnjenju. Da bi preverili dejansko
znaka po pravilnem vrstnem redu pisanja ( kakijun).
učinkovitost pri pomnjenju, bi bilo potrebno izvesti
Hiragana Memory Hint pa ima 4 različne tipe kvizov. Poleg
eksperiment s kontrolno skupino in preverjanjem znanja
branja hiragane ima še kviz z več odgovori, kjer uporabnik
pred učenjem, takoj po učenju in čez daljši čas, kar
izbere med več znaki hiragane za dano izgovorjavo, ki je
načrtujemo izvesti v prihodnje.
napisana v latinici, izbira znak hiragane glede na
izgovorjavo posneto v zvočni obliki in kviz za izbiro
6. Literatura
hiragane glede na napisano izgovorjavo, kjer so na izbiro
znaki, ki so si podobni.
Eri Banno, Yôko Ikeda, Chikako Shinagawa, Kaori Tajima
Aplikacija Hiragana Memory Hint uporablja črno bele
in Kyôko Tokashiki. 2009. Kanji look and learn: 512
linearne ilustracije z barvnim ozadjem, medtem ko Ilukana
kanji with illustrations and mnemonic hints. Tokyo:
uporablja barvne ilustracije s svetlo sivim ozadjem in črnim
Japan Times.
besedilom. Pri aplikaciji Hiragana Memory Hint so izbrali
Robert P. Bodnaryk. 2000. Kanji Mnemonics: An
minimalističen, jasen učbeniški design s sans serifno
Instruction Manual for Learning Japanese Characters.
latinico za angleščino in gothic črkovno vrsto v japonščini ,
Winnipeg, Manitoba: Kanji Mnemonics.
medtem ko smo se pri Ilukana odločili za mincho črkovno
Lucien Brown. 2012. The use of visual/verbal and physical
vrsto pri japonskih pismenkah, zato da si lahko uporabnik
mnemonics in the teaching of Korean Hangul in an
natančno zapomni pisano obliko japonskih znakov,
authentic L2 classroom context. Writing Systems
vključno z zaključevanjem potez po principih tome, hane
Research,
4(1):72–90.
itd., črke v latinici pa so tudi v Ilukani sans serifne zaradi
http://dx.doi.org/10.1080/17586801.2011.635949
boljše čitljivosti. Pri aplikaciji Hiragana Memory Hint so
Andrew Cohen. 1987. The use of verbal and imagery
za oblikovanje gumbov in znakov izbrali minimalistični
mnemonics in second-language vocabulary learning.
pristop, ki nas spominja na učbenik ali delovni zvezek s
Studies in Second Language Acquisition, 9(1):43–61.
poudarkom na okroglo obliko in igrive barve, pri Ilukana
Kazumi Hatasa. 1991. Teaching Japanese syllabary with
pa smo želeli uporabiti elemente japonske kulture v slogu
visual and verbal mnemonics. CALICO Journal, 8(3):
tradicionalnih japonskih grafik na gumbih in ozadjih.
69–80. http://www.jstor.org/stable/24156286.
Razlika je tudi v uporabi barve: Hiragana Memory Hint
James Heisig. 1986. Remembering the kanji: A complete
uporablja več odtenkov barve v gumbih, kot so zelena,
course on how not to forget the meaning and writing of
modra, rdeča, oranžna itd., ki daje igriv občutek, medtem
Japanese characters. Tokyo: Japan Publications Trading
ko je pri Ilukana uporabljena rožnata rdeča kot glavni
Co.
odtenek s kombinacijo sivih odtenkov ter črne, ki daje bolj
James Heisig. 1987. Remembering the kanji: A systematic
izčiščen, eleganten občutek.
guide to reading Japanese characters. Tokyo: Japan
Publications Trading Co.
James Heisig in Tanya Sienko. 1994. Writing and reading
Japanese characters for upper-level proficiency. Tokyo:
Japan Publications Trading Co.
ŠTUDENTSKI PRISPEVKI
343
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
James Heisig, Marc Bernabé in Verònica Calafell. 2001.
Kanji para recordar: curso mnemotécnico para el
aprendizaje de la escritura y el significado de los
caracteres japoneses. Barcelona: Herder Editorial.
James Heisig in Yves Maniette. 1998. Les kanji dans la
tête: apprendre à ne pas oublier le sens et l'écriture des
caractères japonais. Yves Maniette.
James Heisig in Robert Rauther. 2005. Bedeutung und
Schreibweise der japanischen Schriftzeichen. Frankfurt
am Main: V. Klostermann.
Kenneth Higbee. 1977. Your Memory: How It Works and
How to Improve It. Englewood Cliffs, NJ: Prentice-Hall.
Kristina Hmeljak Sangawa, Hyeonsook Ryu in Mateja
Petrovčič. 2020. Zakaj latinica ni dovolj: o izgubi
informacij pri latinizaciji vzhodnoazijskih imen v
knjižničnih katalogih. Knjižnica, 64(1–2):47–78.
Japan Foundation. 2015. Hiragana Memory Hint.
Katakana Memory Hint. English Version. https://minato-
jf.jp/Home/JapaneseApplication.
Koichi. 2014. Learn hiragana: The ultimate guide.
https://www.tofugu.com/japanese/learn-hiragana/
Emmanuel Manalo. 2002. Uses of mnemonics in
educational settings: A brief review of selected research.
Psychologia,
45(2):69–79.
https://doi.org/10.2117/psysoc.2002.69
Emmanuel Manalo, Satomi Mizutani in Julie Trafford.
2004. Using mnemonics to facilitate learning of Japanese
script characters. Japan Association for Language
Teaching
Journal,
26(1):55–77.
http://jalt-
publications.org/recentpdf/jj/2004a_JJ.pdf#page=57.
Sachiko Matsunaga 松 永 幸 子 . 2003. Effects of
Mnemonics on Immediate and Delayed Recalls of
Hiragana by Learners of Japanese as a Foreign Language.
Japanese-Language Education around the Globe, 13:
19–40. https://doi.org/10.20649/00000331.
Glen McCabe. 2012. Learning Japanese Hiragana &
Katakana Flash Cards Kit. Tokyo: Charles E. Tuttle.
Bruce McNair. 2005. Kanji Learned Through Phonic-
Mnemonics: Learning to Read Japanese Kanji Using the
McNair Phonic-Mnemonic System. Kanji Learning
Institute.
Bruce McNair. 2016. Read Kanji Read: Read the 2,136
Jooyoo Kanji in Two Months Using Phonic Mnemonics
(English Edition). Kanji Learning Institute.
Kunihiko Ogawa. 1990. Kana Can Be Easy. Tokyo: The
Japan Times.
Rebecca L. Oxford. 2016. Teaching and researching
language learning strategies: Self-regulation in context.
London: Routledge.
Hiroko Quackenbush, Kiyomi Chujo, Kazuhiko Nagamoto
in Shinichiro Tawata. 1989. 50 分ひらがな導入法:連
想法と色付きカード法の比較 Teaching how to read
hiragana in 50 minutes: A comparison of mnemonics and
the use of cards with associated colours. 日本語教育
Journal of Japanese Language Teaching, 69: 147–162.
Michael Rowley. 1995. Kana Pict-O-Graphix: Mnemonics
for Japanese Hiragana and Katakana. Albany, CA:
Stone Bridge Press.
ŠTUDENTSKI PRISPEVKI
344
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Filter nezaželene elektronske pošte za akademski svet
Anja Vrečer
Fakulteta za računalništvo in informatiko, Univerza v Ljubljani
Večna pot 113, 1000 Ljubljana
anja.vrecer@gmail.com
Povzetek
Nezaželena akademska elektronska sporočila so nezaželena sporočila, ki jih prejemajo predvsem profesorji, raziskovalci in drugi akademiki, in jih navadni filtri nezaželene elektronske pošte ne zaznavajo. V prispevku predstavimo izdelavo filtra nezaželene akademske elektronske pošte, pri čemer smo naredili primerjavo različnih metod filtriranja sporočil in različnih tehnik obdelave besedila. Za končni model smo uporabili nevronsko mrežo v kombinaciji z vektorskimi vložitvami besed ter ga povezali z izbranim odjemalcem elektronske pošte, in sicer z Gmailom. Filter smo testirali z 10-kratnim prečnim preverjanjem in dosegli tudi do 98% točnost.
1.
Uvod
2.
Namen članka
Obstoječi filtri nezaželene akademske elektronske poš-
Elektronska pošta je v zadnjem času postala ena naj-
te so v veliki večini samo “ročno” napisana pravila, ki iz-
bolj uporabljenih aplikacij za komunikacijo. Vsakodnevno
ključujejo sporočila določenih prejemnikov ali z določeni-
jo uporablja na milijone ljudi, tako v službi kot v prostem
mi ključnimi besedami. Takšna pravila pa je za uspešno de-
času (Whittaker et al., 2005).
Slabost vsesplošne upo-
lovanje potrebno stalno posodabljati, saj se pošiljatelji, pa
rabnosti elektronske pošte pa je vse večja količina elek-
tudi vsebina oziroma besede v teh sporočil, ves čas spremi-
tronskih sporočil, ki jih prejemamo. Med njimi je tudi
njajo. Zato smo v sklopu raziskave ustvarili filter nezažele-
veliko nezaželenih elektronskih sporočil. Prebiranje vseh
ne akademske elektronske pošte, ki temelji na modelu ne-
sporočil nam zato včasih vzame ogromno časa in energije.
vronske mreže v kombinaciji z vektorskimi vložitvami be-
Ker želimo čim hitreje ločiti nezaželena sporočila od dru-
sed. Začetni model je naučen na množici 660 nezaželenih
gih, uporabnih sporočil, imajo mnogi poštni odjemalci že
akademskih elektronskih sporočil, skupaj z 2.551 drugih
vgrajene filtre nezaželene elektronske pošte.
Vendar pa
sporočil. Model se lahko tudi prilagodi uporabniku, tako
takšni filtri ne zaznajo vseh vrst nezaželene elektronske
da upošteva uporabnikova nezaželena akademska elektron-
pošte. V prispevku se osredotočimo na eno takšnih skupin
ska sporočila v njegovem elektronskem nabiralniku.
nezaželene elektronske pošte, in sicer na nezaželeno aka-
demsko elektronsko pošto.
3.
Sorodna dela
Profesorji in drugi akademiki v svoj elektronski nabiral-
nik stalno dobivajo vabila k objavljanju člankov v različnih
3.1.
Nezaželena akademska elektronska pošta
revijah, k sodelovanju na konferencah ali ponudbe odpr-
V tem razdelku opišemo ugotovitve o nezaželeni aka-
tih delovnih mest. Takšne ponudbe se velikokrat ne na-
demski elektronski pošti, povzete po različnih avtorjih. Pri
vezujejo na prejemnikovo področje raziskovanja ali pa je
pregledu značilnosti smo upoštevali tudi ugotovitve pri pre-
takšnih ponudb preprosto preveč. Velik problem predstav-
gledu nezaželene akademske elektronske pošte iz naše te-
ljajo vabila k prispevanju člankov za manj znane ali pre-
stne zbirke sporočil.
datorske revije. Akademiki, ki se strinjajo z objavo svojega
Nezaželena vabila. Izkoriščevalske ali predatorske re-
članka v takšnih revijah, tvegajo, da je njihova kariera lahko
vije so revije, katerih glavni cilj ni širjenje znanja ali
oškodovana. Takšne revije namreč objavijo vsak članek, ki
upoštevanje akademske kvalitete člankov, ampak nepošten
ga prejmejo in s tem razveljavijo akademsko vrednost ob-
zaslužek. Profesorje in druge akademike skušajo pretentati,
javljenih člankov, akademika pa zaznamujejo kot soavtorja
da bi z njimi sodelovali, s tem da bi jim plačali za objavo
predatorske revije (da Silva et al., 2020). Hkrati se nepazlji-
svojih člankov. Glavne lastnosti (Wahyudi, 2017) teh revij
vemu prejemniku lahko zgodi, da preko sporočila posreduje
so:
svoje osebne informacije osebam, ki imajo od tega finančno
korist (Lin, 2013).
• za objavo članka je potrebno plačilo,
Ker je vsebina akademskih nezaželenih elektronskih
• revija se izdaja pogosto,
sporočil pogosto precej drugačna od nezaželenih elek-
tronskih sporočil, ki jih zazna večina navadnih filtrov
• za objavo je sprejeto nadpovprečno veliko člankov,
nezaželene pošte, mora prejemnik sam ločevati uporabna in
• čas obdelave in pregleda člankov sta nerealno hitra in
neuporabna sporočila. Raziskovalni prispevek našega dela
je razvoj orodja za filtriranje nezaželenih akademskih elek-
• kvaliteta objavljenih člankov je slaba ali zelo neena-
tronskih sporočil, ki za klasifikacijski model uporablja nev-
komerna.
ronsko mrežo in dosega primerljive ali celo boljše rezultate
od nekaterih raziskovalcev, ki so se ukvarjali s podobnim
Leta 2014 je knjižničar iz Univerze v Koloradu, Jeffrey
problemom.
Beall, sestavil dva seznama, in sicer seznam vprašljivih
ŠTUDENTSKI PRISPEVKI
345
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
založnikov in seznam vprašljivih revij. Zapisal je, da ob-
času (Dadkhah et al., 2017). V nekaterih primerih se tudi
stajajo samo zato, da črpajo denar od avtorjev, ki morajo
zgodi, da če prejemnik ne odgovori na prvo sporočilo, sle-
plačati za to, da so njihovi članki sprejeti v revijo (Wahyudi,
dijo nova (Grey et al., 2016).
2017). Beallov seznam vprašljivih revij (angl. Beall’s list
Tudi pri pošiljateljih nezaželene akademske elektronske
of predatory journals) se najpogosteje uporablja pri identi-
pošte so prisotne nekatere skupne značilnosti. Pošiljatelji
fikaciji izkoriščevalskih revij. Obstajajo tudi druge zbirke
se ponavljajo ali pošiljajo ponavljajoča sporočila več pre-
sumljivih revij, kot na primer Alexa database in baza lažnih
jemnikom naenkrat (da Silva et al., 2020). Včasih je elek-
spletnih strani Phish Tank database (Dadkhah et al., 2017).
tronski naslov zakrit, ponarejen ali pa se ne sklada s podpi-
Tudi za pomoč pri identifikaciji pravih, strokovnih revij ob-
som na koncu besedila (Soler in Cooper, 2019). Elektron-
stajajo baze, kot je na primer Direktorij odprto-dostopnih
ski naslovi, ki niso zakriti imajo večinoma uradno domeno
revij (angl. Directory of Open Access Journals) (Kozak et
institucije, ki ji ukradejo reference (Dadkhah et al., 2017).
al., 2016).
Poleg tega je v veliko primerih lažno predstavljena loka-
Na podoben način so zasnovana tudi vabila na konfe-
cija sedeža pošiljatelja (Kozak et al., 2016). To pomeni, da
rence. V večini primerov se takšna vabila sploh ne na-
pošiljatelj v sporočilo napiše drugo lokacijo, kot je dejan-
vezujejo na prejemnikovo področje raziskovanja in ne ob-
ska lokacija, iz katere je bilo sporočilo poslano.
stajajo za širjenje znanja med podobno mislečimi akade-
Opisane značilnosti so povzete iz ugotovitev različnih
miki, ampak je njihov namen oglaševati svoje revije in
študij. Tudi pri pregledovanju nezaželene akademske ele-
služiti (D. Cobey et al., 2017).
ktronske pošte, ki smo jo uporabili za učno množico, smo
Zavajanje. Pri zavajanju oziroma ribarjenju (angl. phi-
opazili podobne značilnosti. Nekateri filtri nezaželene aka-
shing attacks) so spletne strani, na katere elektronsko
demske elektronske pošte sicer upoštevajo najdene skupne
sporočilo usmerja, ustvarjene z namenom, da prejemnik va-
lastnosti teh sporočil, vendar pa so to v večini “na roko”
nje vnese osebne podatke, kot so številka bančne kartice,
napisana pravila, ki jih je za dobro delovanje potrebno
gesla in podobno (da Silva et al., 2020). Te spletne strani so
stalno spreminjati. Zato v nadaljevanju opišemo razvoj fil-
narejene tako, da so podobne dejanskim stranem resničnih
tra nezaželene elektronske pošte, ki deluje na podlagi kla-
organizacij, zato prejemnik velikokrat sploh ne ve, da gre
sifikatorja, ki avtomatsko klasificira elektronska sporočila.
za ponaredek (Dadkhah et al., 2017).
Zavajajoča elek-
tronska sporočila so torej podvrsta nezaželene elektronske
4.
Razvoj filtra nezaželene pošte
pošte, v kateri se pošiljatelj pretvarja, da je predstavnik
V tem poglavju predstavimo zasnovo filtra nezažele-
neke druge legitimne organizacije z namenom pridobivanja
ne akademske elektronske pošte. Najprej opišemo učno
osebnih podatkov (Gupta et al., 2018). Sporočila te vrste
množico sporočil in tehnike obdelave besedila, ki smo jih
so večinoma namenjena določeni skupini ljudi ali določeni
uporabili. Zatem predstavimo poenostavljen načrt filtra.
organizaciji.
Poglavje zaključimo z opisom povezave filtra z izbranim
Še en način, kako delujejo zavajajoča elektronska spo-
odjemalcem elektronske pošte.
ročila, je s samo-izvršilno kodo. Ta način deluje tako, da se
ob kliku na povezavo izvede skrit program in povzroči ško-
4.1.
Učna množica elektronskih sporočil
do na prejemnikovem računalniku z vgraditvijo virusa, ki
Učno množico elektronskih sporočil smo pridobili iz
uniči prejemnikove datoteke ali pa ukrade osebne informa-
dveh različnih virov, saj ni bilo mogoče najti ustrezne
cije, gesla in druge podatke iz njega (da Silva et al., 2020).
zbirke akademskih sporočil, ki bi zajemala tako nezaželena
kot tudi druga akademska sporočila. Uporabili smo ne-
3.2.
Generična struktura nezaželenih akademskih
zaželena akademska sporočila od profesorjev z Univer-
sporočil
ze v Ljubljani in druga sporočila s spleta. Skupno smo
Wahyudi (2017) je v svojem članku natančno preučil
zbrali 660 sporočil, označenih kot nezaželena akademska
strukturo nezaželene akademske elektronske pošte, zato v
elektronska sporočila.
Drugo skupino sporočil, ki niso
nadaljevanju opišemo glavne ugotovitve iz tega in drugih
nezaželena, smo našli na spletu, in sicer na spletni strani
člankov.
kaggle (van Lit, 2019).
Omenjena spletna zbirka vse-
Generična struktura nezaželene akademske elektronske
buje nezaželeno in drugo elektronsko pošto, vendar pa ta
pošte je sestavljena iz pozdrava, napovedi, uvoda, osred-
sporočila nimajo akademske vsebine.
Za potrebe izde-
njega dela in zaključka. Velikokrat so uporabljeni laska-
lave našega sistema smo uporabili le sporočila, ki niso
joči pozdravi in nazivi, kot sta “ugledni profesor” ali “ste
nezaželena. Iz omenjene spletne baze sporočil smo do-
strokovnjak na tem področju” (Grey et al., 2016). V poz-
bili 2.551 sporočil, ki smo jih uporabili kot učne primere
dravu sta lahko uporabljena tudi prejemnikova ime in prii-
sporočil, ki niso nezaželena. Ker je bila množica elektron-
mek. Sporočilo velikokrat izraža hvaljenje, lažne spodbude
skih sporočil sestavljena iz sporočil iz različnih virov, je
in obljublja nagrade ali karierne priložnosti (Dadkhah et al.,
bilo potrebno sporočila pretvoriti v enako obliko, primerno
2017; Soler in Cooper, 2019). Pošiljatelj velikokrat zatr-
za nadaljnjo obdelavo. Poleg tega je bilo potrebno obdelati
juje, da je prebral prejemnikov članek in da to sporočilo
besedilo sporočila in ga ustrezno spremeniti. V nadaljeva-
ni nezaželena pošta (da Silva et al., 2020). V veliki večini
nju opišemo, kako smo se lotili tega problema.
sporočilo govori o splošni temi, ki se ne navezuje na pre-
Vsako sporočilo v učni množici smo spremenili v slo-
jemnika (Grey et al., 2016; Moher in Srivastava, 2015).
var s ključi Subject (zadeva), Sender (pošiljatelj),
Še ena lastnost nezaželene akademske elektronske pošte je
Receiver (prejemnik), Date (datum prejema) in Bo-
ta, da od prejemnika zahteva odgovor v nerealno kratkem
dy (telo sporočila).
Sporočila v skupini sporočil, ki
ŠTUDENTSKI PRISPEVKI
346
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
niso nezaželena, imajo isti vir in obliko. Zato smo vsa
sporočila v tej skupini lahko pretvorili v slovar na isti način.
Nezaželena sporočila pa smo dobili iz različnih virov in jih
je bilo zato potrebno spremeniti v slovar na različne načine
glede na končnico datoteke.
Naslednji korak obdelave sporočil je pretvorba slovar-
jev sporočil v obliko, primerno za model. Lai (2007) v
svojem članku trdi, da je najbolj uporaben del za klasifika-
cijo nezaželenih sporočil zadeva sporočila in da samo telo
sporočila ne klasificira tako dobro, kot če je v kombinaciji
z zadevo. To smo tudi preizkusili in se odločili, da tudi mi
uporabimo kombinacijo zadeve in telesa sporočila. Poleg
tega Méndez et al. (2006) pravijo, da priloga, ki je lahko
priložena sporočilu in jo spremenimo v besedilo, doda ne-
Slika 1: Primeri znakov iz nezaželenih akademskih sporočil
potrebne informacije, ki niso dobre za klasifikacijo. Zato
in ustrezne črke, ki so jih pošiljatelji nezaželenega akadem-
priloge sporočila nismo uporabili.
skega sporočila zamenjali.
Sledi opis obdelave besedila sporočil. V zadevi so bile
v nekaterih primerih v oglatih oklepajih zapisane oznake
žnice, ki objekt serializira (angl. serialize) in s tem spre-
sporočila (na primer INBOX). Zato smo iz besedila odstra-
meni v binarni tok (angl. byte stream). Tako shranjenih
nili del, ki je v oglatih oklepajih. Predvsem v množici
sporočil ne moremo brati direktno iz datoteke, ampak jih
sporočil, ki niso nezaželena, je veliko sporočil, ki vsebu-
je za branje potrebno pretvoriti nazaj v besedilo.
jejo druga sporočila (nizi izmenjujočih odgovorov). Zato
smo morali najti takšne dele sporočil in jih odstraniti. To
4.2.
Tehnike obdelave sporočil
smo naredili tako, da smo odstranili vrstice, ki se začnejo
Štetje ponovitev besed. Najbolj enostavna tehnika ob-
z določenim znakom ali nizom znakov, kot so na primer:
delave sporočil je štetje ponovitev besed v posameznem
“To:”, “From:”, “Wrote:” itd.
sporočilu. Zbirke besedil smo spremenili v matriko, v ka-
Nato smo zadevo in telo sporočila obdelali na enak
teri vsaka vrstica predstavlja sporočilo, stolpec pa besedo.
način. Najprej smo odstranili velike začetnice in celotno
Ker je vseh besed v vseh sporočilih lahko zelo veliko, smo
sporočilo spremenili v male črke. Opazili smo, da se v ne-
se omejili na 2000 besed. Poskusili smo tudi z odstranitvijo
katerih nezaželenih sporočilih pojavljajo znaki, ki izgledajo
besed, ki se pojavijo v manj kot treh sporočilih, tako kot so
kot črke, vendar so v resnici drugi znaki in jih program za-
to opisali Sakkis in sodelavci (Sakkis et al., 2003).
zna kot ločila. Primeri znakov, ki smo jih našli v naši zbirki
Frekvenca besed z inverzno frekvenco v dokumen-
sporočil, so prikazani na sliki 1. Pošiljatelji nezaželenih
tih (angl. term frequency-inverse document frequency - TF-
akademskih sporočil so s tem očitno želeli preprečiti fil-
IDF). Frekvenca besed (angl. term frequency) enostavno
trom nezaželene elektronske pošte razpoznavo nekaterih
pomeni število besed v posameznem sporočilu.
Inver-
besed, ki nakazujejo na nezaželeno akademsko elektron-
zna frekvenca v dokumentih (angl. inverse document fre-
sko pošto.
Najdene znake smo zamenjali s pravo črko
quency) pa predstavlja informativnost besede, torej ali se
in na koncu besedila dodali značko "specialchars".
beseda pogosto ali redko pojavlja v sporočilih (Hakim et
V besedilu smo poiskali klicaje in jih zamenjali z značko
al., 2014).
TF-IDF besede izračunamo tako, da upora-
"exclamationmark".
Opazili smo namreč, da iz-
bimo enačbo (1), pri čemer je t têrmin in d dokument ozi-
razito velika raba klicajev lahko nakazuje na to, da je
roma sporočilo. T F (t, d) predstavlja frekvenco besede t v
sporočilo nezaželena akademska elektronska pošta. Poleg
sporočilu d, IDF (t) pa je inverzna frekvenca besede t v
tega smo poiskali elektronske naslove, povezave in imena
dokumentih. Izračuna se jo z enačbo (2), pri čemer je n
mesecev ter jih zamenjali z značkami "emailwashere",
število vseh sporočil, DF (t) pa število sporočil v katerih
"linkwashere" in "monthwashere", saj struktura
se beseda t pojavi vsaj enkrat. V imenovalcu ulomka vre-
elektronskega naslova in povezave ter ime meseca niso
dnosti DF (t) dodamo še +1, da se izognemo deljenju z
pomembni.
Poleg omenjenega smo iz sporočil odstra-
nič.
nili ločila in nepotrebne besede, kot so vezniki, zaimki in
vprašalnice. V angleščini se te pogoste in nepotrebne be-
T F -IDF (t, d) = T F (t, d) ∗ IDF (t)
(1)
sede imenujejo stop words.
Da bi identiteta profesorjev, ki so prispevali nezaželena
n
akademska sporočila za učno množico, ostala skrita, je bilo
IDF (t) = log
+ 1
(2)
potrebno iz sporočil odstraniti imena prejemnikov. Poleg
DF (t) + 1
tega smo odstranili tudi imena pošiljatelja, saj je tudi ta po-
Prednost uporabe tehnike TF-IDF je, da se normalizira
datek nepotreben pri klasifikaciji. Ime ali priimek smo za-
vpliv besed, ki se v dokumentih pojavljajo zelo pogosto in
menjali z značko receivername za prejemnika oziroma
so zato manj informativne kot besede, ki se pojavijo manj-
sendername za pošiljatelja.
krat.
Na opisani način smo sporočila spremenili iz sezna-
Medsebojna informacija. Za izbiro atributov smo pre-
ma slovarjev v seznam besedil oziroma zbirko besedil
izkusili tudi odstranitev atributov, ki imajo premajhno med-
(angl. corpus). Sporočila smo nato shranili s pomočjo knji-
sebojno informacijo (angl. mutual information). Medse-
ŠTUDENTSKI PRISPEVKI
347
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
bojna informacija dveh naključnih spremenljivk je nenega-
sporočila. Pri gradnji programske rešitve smo preizkusili
tivna vrednost, ki pove odvisnost med tema spremenljiv-
več klasifikatorjev, in sicer smo preizkusili naivni Bayes,
kama (Kraskov et al., 2004). Z drugimi besedami, med-
naključni gozd, metodo podpornih vektorjev, logistično re-
sebojna informacija meri količino informacije, ki jo prido-
gresijo in različne nevronske mreže. V končnem sistemu
bimo o neki spremenljivki, če imamo podano neko drugo
smo uporabili nevronsko mrežo, saj so bili rezultati testira-
spremenljivko (Witten in Frank, 2000). Večja ko je med-
nja pri tem klasifikacijskem modelu najboljši. Za odjema-
sebojna informacija dveh spremenljivk, bolj sta spremen-
lec elektronske pošte pa smo izbrali Gmail (Google, 2022).
ljivki odvisni med sabo. Če pa je medsebojna informacija
enaka nič, sta spremenljivki popolnoma neodvisni. Med-
sebojno informacijo med dvema naključnima spremenljiv-
kama lahko izračunamo z enačbo (3), kjer I(X; Y ) pred-
stavlja medsebojno informacijo za spremenljivki X in Y ,
H(X) predstavlja entropijo spremenljivke X, H(X|Y ) pa
je pogojna entropija za spremenljivko X, če imamo po-
dano spremenljivko Y . Entropija je enaka povprečni la-
stni informaciji in prestavlja stopnjo negotovosti oziroma
informacije. Izračunamo jo s pomočjo formule (4), kjer so
možni rezultati x1, ..., xn in P (xi) verjetnost rezultata xi.
Pogojno entropijo pa izračunamo s formulo (5).
Slika 2: Načrt delovanja sistema ob prvem učenju nevron-
ske mreže in ob klasifikaciji neprebranih sporočil.
I(X; Y ) = H(X) − H(X|Y )
(3)
Slika 2 prikazuje delovanje sistema ob zagonu pro-
n
grama za klasifikacijo neprebranih sporočil. Najprej pro-
X
H(X) = −
P (xi)logP (xi)
(4)
gram preveri, ali je nevronska mreža že shranjena na dis-
i−1
ku. Če ni, se izvede začetno učenje nevronske mreže. Za
učenje nevronske mreže so potrebni označeni učni podatki,
X
p(x, y)
kar so v našem primeru nezaželena akademska elektronska
H(X|Y ) = −
p(x, y)log
(5)
p(x)
sporočila in druga elektronska sporočila. Ker je vir sporočil
x∈X,y∈Y
lahko različen, se elektronska sporočila pretvori v enako
Vektorska vložitev besed (angl. word to vector embed-
obliko in obdela tako, da se odstrani nepotrebne atribute
ding). Vektorska vložitev besed je tehnika predstavitve be-
sporočila. Ta korak je na sliki označen s številko (1). Sledi
sed z vektorji, ki ohranjajo pomenske značilnosti besed. To
učenje nevronske mreže in shranjevanje na disk (2). Nev-
pomeni, da so besede, ki so si pomensko bolj podobne,
ronska mreža je tako ob naslednjem zagonu pripravljena
bolj blizu v vektorskem prostoru (Ghannay et al., 2016).
na klasifikacijo in ni potrebno pri vsakem zagonu čakati na
Vektorje besed se sestavi glede na to, katere besede se v
učenje nevronske mreže. Naslednji korak programa je bra-
stavku nahajajo skupaj, saj se tako najlažje ugotovi pomen
nje neprebranih sporočil iz elektronskega nabiralnika (3).
besede. Zaradi tega je za učinkovito sestavljanje besednih
Prebrana sporočila se obdelajo na enak način kot pri ko-
vektorjev potrebna velika učna množica besedil. Ker je to
raku (1). Shranjena nevronska mreža nato klasificira nepre-
velikokrat težko pridobiti in ker je učenje vektorjev lahko
brana sporočila. Če je katero izmed neprebranih sporočil
precej zamudno, na spletu obstajajo baze besed in njiho-
klasificirano kot nezaželena akademska elektronska pošta,
vih vektorjev, ki so naučeni na velikih množicah besedil.
program preveri, ali v uporabnikovem elektronskem nabi-
Primeri zbirk naučenih vektorjev so na primer Googlova
ralniku že obstajajo sporočila z oznako ACADEMIC SPAM.
zbirka in zbirka GloVe (Global Vectors for Word
Če ne, program za uporabnika ustvari novo oznako ACA-
Representation) (Pennington, 2014).
DEMIC SPAM in označi ustrezna sporočila (5). Če oznaka
Ker je zbirka vektorjev iz baze GloVe naučena na
že obstaja, program samo označi ustrezna sporočila s to
ogromni množici besedil in je prosto dostopna, smo se
oznako. Oznaka se nato prikaže na neprebranih sporočilih
odločili, da jo bomo uporabili v našem sistemu. Na vo-
v uporabnikovem elektronskem nabiralniku (6), hkrati pa
ljo ima več zbirk vektorjev iz različnih virov in velikosti.
nastane oziroma se dopolnjuje tudi mapa sporočil z oznako
Zbirke smo preizkusili in ocenili njihovo uspešnost. Pre-
ACADEMIC SPAM (7).
izkusili smo tudi različno maksimalno število besed v po-
Slika 3 prikazuje drugi program, ki je namenjen poso-
sameznem sporočilu in maksimalno število unikatnih be-
dobitvi nevronske mreže, tako da se čim bolj prilagodi upo-
sed. V končnem sistemu smo uporabili 100-dimenzionalne
rabniku. Posodobitev deluje samo, če ima uporabnik v svo-
vektorje, omejitev 2.000 besed na sporočilo in omejitev
jem elektronskem nabiralniku sporočila, označena z oznako
500.000 različnih besed.
ACADEMIC SPAM.
Program najprej prebere sporočila, ki so označena z
4.3.
Zasnova filtra nezaželenih akademskih sporočil
oznako ACADEMIC SPAM. Nato ta sporočila doda k shra-
Zgradili smo programsko rešitev, sestavljeno iz dveh
njenim uporabniškim sporočilom ali pa ustvari novo dato-
programov. Prvi, glavni program, je namenjen klasifikaciji
teko s shranjenimi uporabnikovimi nezaželenimi akadem-
neprebranih sporočil. Drugi program pa je namenjen poso-
skimi sporočili (1). Ta sporočila se potem uporabi kot del
dobitvi nevronske mreže glede na uporabnikova označena
učne množice pri učenju nevronske mreže (2). Če je shra-
ŠTUDENTSKI PRISPEVKI
348
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
se izvede del programa za posodobitev oznak. Najprej pro-
gram preko Gmail APIja prebere vse oznake, ki obsta-
jajo v uporabnikovem elektronskem nabiralniku, in preveri,
ali je katera med njimi ACADEMIC SPAM. Če oznaka že
obstaja, se sporočilom, ki jih je klasifikator označil kot
nezaželena, doda ta oznaka. Če oznaka še ne obstaja, pa
se ustvari nova oznaka ACADEMIC SPAM.
Slika 3: Načrt delovanja sistema ob posodobitvi nevronske
mreže.
njenih uporabnikovih sporočil več kot 1000, se sporočila
razvrsti po datumu prejema in se jih izbere le zadnjih 1000.
Če pa je shranjenih uporabnikovih sporočil manj kot 1000
(3), se množica nezaželenih akademskih sporočil dopolni s
Slika 4: Izsek elektronskega nabiralnika v Gmailu, kjer
sporočili iz baze nezaželenih akademskih sporočil (4). Po-
sta bili dve neprebrani sporočili klasificirani kot nezažalena
leg nezaželenih akademskih sporočil nevronska mreža za
akademska pošta.
učenje potrebuje tudi množico drugih sporočil. Te se prido-
bijo iz baze drugih sporočil (5). Nevronska mreža se nato
Rezultat zagona programa in klasifikacije neprebranih
nauči na podanih učnih podatkih in posodobljena mreža se
sporočil je oznaka ACADEMIC SPAM, ki se prikaže na
shrani na disk (6), kjer je na voljo za naslednjo klasifikacijo
ustreznih sporočilih. Na sliki 4 je prikazan primer takšne
neprebranih sporočil.
klasifikacije v Gmailu. Pred zadevo sporočil, klasificira-
nih kot nezaželena akademska sporočila, se pojavi oznaka
4.4.
Povezava z odjemalcem elektronske pošte
ACADEMIC SPAM. Hkrati pa lahko na levi strani v se-
Filter nezaželene akademske elektronske pošte smo po-
znamu vseh oznak opazimo oznako ACADEMIC SPAM,
vezali z brezplačno e-poštno storitvijo, ki jo ponuja Google,
kjer lahko najdemo vsa sporočila, ki so bila v preteklosti
in sicer Gmail. Za povezavo tega spletnega odjemalca elek-
označena kot nezaželena akademska sporočila.
tronske pošte s programom smo uporabili Gmail API. To je
aplikacijski programski vmesnik, ki temelji na arhitekturi
5.
Testiranje in rezultati
REST (angl. RESTful API) (Developers, 2021). Arhitek-
V zadnjem poglavju predstavljamo način testiranja pre-
tura REST (angl. representational state transfer) je arhitek-
izkušenih modelov klasifikacije in obdelave elektronskih
tura za izmenjavo podatkov med spletnimi storitvami, kjer
sporočil. Primerjamo rezultate in prikažemo rezultate al-
je vsak vir dostopen z enoličnim identifikatorjem vira URL.
goritma SHAP, ki poišče besede, ki so najbolj pripomogle
Uporablja se za dostop do Gmail elektronskih nabiralnikov
h klasifikaciji nezaželenih akademskih sporočil.
in pošiljanje elektronskih sporočil preko programa.
Program smo z Gmail API-jem povezali s pomočjo ra-
5.1.
Način testiranja
čunalniškega okolja Google Cloud. Tam smo ustva-
Za testiranje uspešnosti smo uporabili 10-kratno prečno
rili nov projekt in v njem omogočili Gmail API ter do-
preverjanje (angl. 10-fold cross-validation).
Pri tej me-
dali avtorizacijo in avtentikacijo za program.
Upora-
todi učno množico razdelimo na 10 približno enako velikih
bili smo API Keys in OAuth 2.0 Client IDs za
množic in za vsak model naredimo 10 ponovitev testiranja.
omogočanje Gmail APIja v programu.
V vsaki iteraciji vzamemo za testno množico eno izmed
Če je povezovanje z Gmail APIjem uspešno, je
množic, ostale množice pa združimo v učno množico.
program pripravljen na branje uporabnikovih neprebranih
Na tak način bolj natančno preverimo uspešnost mode-
sporočil. V primeru, da neprebrana sporočila ne obsta-
lov, kot če bi iz množice naključno izbrali 10% primerov in
jajo, se izpiše sporočilo: "No messages found." in
le enkrat testirali model. Pri 10-kratnem prečnem prever-
program se zaključi. V nasprotnem primeru pa se iz po-
janju je namreč vsak primer v množici enkrat uporabljen
datkov, pridobljenih z Gmail APIjem, generira slovar s
kot testni. Tako lahko na koncu izračunamo povprečje in
ključi Subject (zadeva), Sender (pošiljatelj), Recei-
standardno deviacijo rezultatov iz vseh ponovitev testiranja
ver (prejemnik), Date (datum prejema) in Body (telo
ter dobimo bolj realne rezultate. Poleg tega smo lahko za-
sporočila). Sporočila v obliki slovarja je nato potrebno pre-
radi večkratne ponovitve testiranja za primerjavo modelov
urediti v obliko, primerno za klasifikator, podobno kot smo
uporabili tudi statistične teste.
to naredili za sporočila v učni množici (glej razdelek 4.1.).
Tako smo namesto seznamov slovarjev dobili seznam ob-
5.2.
Rezultati
delanih besedil. Ta seznam smo nato s pomočjo shranjenih
Del rezultatov testiranja različnih modelov je prika-
vektorjev spremenili v seznam vektorjev in ga pretvorili v
zan v tabeli 1.
Preizkusili smo različne tehnike obde-
matriko.
lave besedila, v tabeli pa so prikazani rezultati ob upo-
Naslednji korak je nalaganje shranjenega klasifikatorja
rabi tehnike TF-IDF z odstranitvijo besed, ki se pojavijo
in klasifikacija neprebranih sporočil. Če klasifikator označi
v manj kot treh sporočilih in besed, ki imajo medsebojno
katerega izmed sporočil kot nezaželeno akademsko pošto,
informacijo manjšo kot 0.01 pri prvih petih modelih ter
ŠTUDENTSKI PRISPEVKI
349
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
vektorsko vložitvijo besed pri zadnjem modelu nevronske
naše rezultate z rezultati navadnih filtrov nezaželenih ele-
mreže.
Uporabili smo celotno množico sporočil, in si-
ktronskih sporočil. Koprinska et al. (2007) so na eni iz-
cer 660 nezaželenih akademskih sporočil in 2.551 drugih
med testnih množic z modelom naključnega gozda dosegli
sporočil. Kot lahko vidimo, so rezultati že pri teh mo-
točnost 96.03%, natančnost 95.62%, priklic 95.62% in F1
delih precej dobri, saj so pravilno klasificirana skoraj vsa
mero 94.16%. Za obdelavo sporočil so uporabili posebno
sporočila iz testne množice sporočil.
metodo izbire atributov, in sicer varianco frekvence têrma
(angl. term frequency variance). Ostali modeli v njihovem
primeru niso bili tako uspešni. Lai (2007) je v članku opisal
Tabela 1: Povprečne vrednosti in standardna deviacija te-
preizkus modelov Naivnega Bayesa, k-najbližjih sosedov,
stiranja z 10-kratnim prečnim preverjanjem.
SVM in kombinacijo TF-IDF z metodo SVM. Za najbolj
Model
Točnost
F1
AUC
uspešen model se je izkazala kombinacija TF-IDF z me-
Naivni
todo SVM. S to metodo so v enem primeru dosegli točnost
88.49% ± 2.40%
77.29% ± 5.27%
0.91 ± 0.02
Bayes
93.43%.
Naključni
98.32% ± 0.32%
95.88% ± 0.79%
0.96 ± 0.01
gozd
SVM
98.65% ± 0.68%
96.62% ± 1.79%
0.97 ± 0.01
5.3.
Razlaga klasifikacije z algoritmom SHAP
Logistična
98.82% ± 0.58%
97.02% ± 1.55%
0.97 ± 0.01
regresija
Razumljivost in enostavna razlaga modela sta izjemno
Nevronska
98.98% ± 0.42%
97.49% ± 1.10%
0.98 ± 0.01
pomembni za interpretacijo rezultatov in možnost nadgra-
mreža
Nevronska
dnje modela. To je velikokrat razlog, da se nekateri razisko-
mreža
98.69% ± 0.62%
96.79% ± 1.47%
0.97 ± 0.01
valci odločijo za uporabo enostavnih (linearnih) modelov
z GloVe
namesto kompleksnejših, ki jih je težko razumeti. Vendar
pa je zaradi naraščajoče količine podatkov, ki jih želimo
obdelati, nujno, da uporabljamo tudi slednje. Za to obsta-
jajo algoritmi, ki nam jih pomagajo razumeti in interpreti-
Tabela 2: Povprečni rangi uspešnosti modelov glede na vre-
rati rezultat njihove klasifikacije. Eden takšnih algoritmov
dnost AUC.
je algoritem SHAP (Lundberg in Lee, 2017).
Naivni
Naključni
Logistična
Nevronska
SVM
Bayes
gozd
regresija
mreža
5
2.9
2.65
2.15
2.3
Za primerjavo modelov smo uporabili Friedmanov
test (Friedman, 1937). Natančno razlago uporabe tega tes-
ta opisuje Demšar (2006).
Najprej smo primerjali sku-
pino klasifikacijskih modelov, na katerih smo uporabili prej
omenjene tehnike obdelave besedila in so v tabeli na pr-
vih petih mestih. S Friedmanovim testom pri α = 0.05 na
AUC smo preverili, ali lahko za kateri par modelov rečemo,
da je eden izmed njiju izrazito boljši od drugega. Povpreč-
ni rangi uspešnosti modelov glede na AUC so razvidni v
tabeli 2. Izračunali smo kritično razdaljo CD = 1.93 in
jo primerjali z razlikami povprečnih vrstnih redov uspeš-
nosti modelov ter ugotovili, da so vsi modeli izrazito boljši
za klasifikacijo nezaželenih akademskih sporočil kot naiv-
ni Bayes. Za ostale pare modelov s Friedmanovim testom
tega nismo mogli dokazati.
Čeprav so bili rezultati že pri teh modelih precej dobri,
smo se vseeno odločili implementirati še različne modele
Slika 5: Slika prikazuje besede, ki najbolj vplivajo na rezul-
nevronskih mrež v kombinaciji z vektorskimi vložitvami
tat klasifikacije nevronske mreže. Besede, ki imajo več pik
besed. Zgradili smo več različnih nevronskih mrež in jih
na desni strani, so prispevale k temu, da je bilo sporočilo
med seboj primerjali. V tabeli 2 je na zadnjem mestu pri-
klasificirano kot nezaželeno.
kazan rezultat ene izmed teh nevronskih mrež. Čeprav je
rezultat nekoliko slabši od zgoraj opisanih modelov, smo
Algoritem SHAP (SHapley Additive exPlanations) ozi-
v končnem sistemu vseeno uporabili to nevronsko mrežo z
roma Shapleyjeve aditivne razlage je algoritem, ki za po-
vektorskimi vložitvami besed. Ta metoda namreč upošteva
dane primere razloži, zakaj jih je model klasificiral tako,
pomene besed in ne samo njihov zapis, tako kot ostale me-
kot jih je. Z drugimi besedami, algoritem SHAP nam pove,
tode obdelave besedil.
kako posamezen atribut vpliva na napoved modela. V pri-
Rezultati našega testiranja so nekoliko boljši od rezul-
meru klasifikacije sporočil s tem algoritmom torej lahko
tatov nekaterih raziskovalcev, ki so se ukvarjali s podobnim
ugotovimo, katere besede najbolj vplivajo na rezultat kla-
problemom. Sicer nismo našli primerov, v katerih bi raz-
sifikacije. Na sliki 5 je prikazan rezultat algoritma SHAP
iskovalci skušali klasificirati nezaželeno akademsko elek-
na eni izmed ustvarjenih nevronskih mrež. Zaradi zahtev-
tronsko pošto, vseeno pa lahko do neke mere primerjamo
nosti algoritma smo uporabili manjšo podmnožico testnih
ŠTUDENTSKI PRISPEVKI
350
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
sporočil. Graf na sliki od spodaj navzgor prikazuje katere
7.
Zahvala
besede naj bi najbolj vplivale na klasifikacijo nezaželenih
Zahvaljujem se prof. dr. Zoranu Bosniću za vodenje,
sporočil. Beseda linkwashere, s katero smo nadome-
nasvete in mentorstvo med raziskavo ter profesorjem Fa-
stili vse url povezave, očitno najbolje nakazuje na to, da
kultete za računalništvo in informatiko, Univerze v Lju-
sporočilo ni nezaželeno.
Besede, ki močno nakazujejo
bljani, ki so prispevali nezaželena akademska elektronska
na to, da je sporočilo nezaželeno akademsko elektronsko
sporočila za učno množico sporočil.
sporočilo pa so university (univerza), dear (dragi
oz. spoštovani), prof (kratica za profesor), research
8.
Literatura
(raziskava) in submissions (oddaje).
Kelly D. Cobey, Miguel de Costa e Silva, Sasha Mazza-
rello, Carol Stober, Brian Hutton, David Moher in Mark
6.
Zaključek
Clemons. 2017. Is this conference for real? navigating
Za cilj smo si zadali izdelavo filtra nezaželene aka-
presumed predatory conference invitations. Journal of
demske elektronske pošte, ki bi med neprebranimi elek-
oncology practice, 13(7):410–413.
tronskimi sporočili v uporabnikovem elektronskem nabi-
Jaime A Teixeira da Silva, Aceil Al-Khatib in Panagiotis
ralniku, čim bolj učinkovito poiskal nezaželena akadem-
Tsigaris. 2020. Spam emails in academia: issues and
ska elektronska sporočila in jih označil. Za dosego tega
costs. Scientometrics, 122(2):1171–1188.
cilja smo morali preučiti strukturo in skupne značilnosti
Mehdi Dadkhah, Glenn Borchardt in Tomasz Maliszewski.
nezaželene akademske elektronske pošte ter preiskati ob-
2017. Fraud in academic publishing: researchers un-
stoječe načine filtriranja nezaželene elektronske pošte. S
der cyber-attacks. The American journal of medicine,
testiranjem smo določili, da je model nevronske mreže naj-
130(1):27–30.
bolj učinkovit pri filtriranju nezaželene akademske elek-
Janez Demšar. 2006. Statistical comparisons of classifiers
tronske pošte, zato smo ga tudi uporabili v končnem sis-
over multiple data sets. The Journal of Machine Lear-
temu.
ning Research, 7:1–30.
Ugotovili smo, da obstaja zelo malo rešitev za filtriranje
Google
Developers.
2021.
Gmail
api
overview.
nezaželene elektronske pošte, katerih osrednji cilj bi bil fil-
https://developers.google.com/gmail/
triranje nezaželenih akademskih elektronskih sporočil. Ve-
api/guides.
lika večina teh rešitev uporablja le prepoznavanje znanih
Milton Friedman. 1937. The use of ranks to avoid the
pošiljateljev nezaželenih akademskih sporočil, vendar pa je
assumption of normality implicit in the analysis of va-
za učinkovitost tega načina filtriranja potrebno stalno po-
riance. Journal of the american statistical association,
sodabljanje seznama. Zato smo implementirali sistem, ki
32(200):675–701.
neprebrana elektronska sporočila klasificira kot nezaželeno
Sahar Ghannay Ghannay, Benoit Favre, Yannick Esteve in
akademsko elektronsko pošto, glede na pomen besed v
Nathalie Camelin. 2016. Word embedding evaluation
sporočilih. To smo dosegli z vektorsko vložitvijo besed v
and combination. V: Proceedings of the Tenth Internati-
kombinaciji z modelom nevronske mreže. Poleg tega smo
onal Conference on Language Resources and Evaluation
izdelali program, ki lahko klasifikacijski model posodobi
(LREC’16), str. 300–305. European Language Resources
glede na uporabnikovo nezaželeno akademsko elektronsko
Association (ELRA).
pošto.
Na tak način se model lahko prilagodi uporab-
Google.
2022.
Gmail:
Brezplačna, zasebna in varna
nikovemu elektronskemu nabiralniku in še bolj natančno
e-pošta. https://www.google.com/intl/sl/
označuje nezaželena akademska elektronska sporočila.
gmail/about/, pridobljeno: 2022-01-08.
Ena izmed večjih pomanjkljivosti opisane rešitve je na-
Andrew Grey, Mark J. Bolland, Nicola Dalbeth, Greg Gam-
domestitev akademskih sporočil, ki niso nezaželena, z na-
ble in Lynn Sadler. 2016. We read spam a lot: prospec-
vadnimi nezaželenimi sporočili. Zaradi varovanja osebnih
tive cohort study of unsolicited and unwanted academic
podatkov namreč nismo mogli uporabiti sporočil profesor-
invitations. BMJ, 355.
jev, pa tudi na spletu ni bilo mogoče najti zbirk s takšnimi
Brij B. Gupta, Nalin AG Arachchilage in Kostas E. Psannis.
akademskimi sporočili. Tudi profesorji in drugi akademiki
2018. Defending against phishing attacks: taxonomy of
sicer dobivajo takšna navadna sporočila in so zato tudi ta
methods, current issues and future directions. Telecom-
sporočila do neke mere ustrezna za učno množico. Vseeno
munication Systems, 67(2):247–267.
pa bi bilo potrebno preveriti, da klasifikator zaradi pomanj-
Ari Aulia Hakim, Alva Erwin, Kho I Eng, Maulahikmah
kanja akademskih sporočil, ki niso nezaželena, ne označi
Galinium in Wahyu Muliady. 2014. Automated docu-
kar vseh akademskih sporočil, kot nezaželena.
ment classification for news article in Bahasa Indonesia
Sistem bi lahko izboljšali še tako, da bi ob posodobitvi
based on term frequency inverse document frequency (tf-
modela upoštevali ne samo uporabnikova nezaželena aka-
idf) approach. V: 2014 6th international conference on
demska sporočila, ampak tudi druga sporočila. Poleg tega
information technology and electrical engineering (ICI-
sistem trenutno dobro deluje le za angleška sporočila, saj je
TEE), str. 1–4. IEEE.
naša množica učnih sporočil bila sestavljena le iz angleških
Irena Koprinska, Josiah Poon, James Clark in Jason Chan.
sporočil. Možna izboljšava bi torej lahko bila prepozna-
2007. Learning to classify e-mail. Information Sciences,
vanje jezikov in prilagajanje filtra nanje. Dodali bi lahko
177(10):2167–2187.
tudi uporabniški vmesnik, ki bi uporabniku olajšal uporabo
Marcin Kozak, Olesia Iefremova in James Hartley. 2016.
sistema.
Spamming in scholarly publishing: A case study. Jour-
ŠTUDENTSKI PRISPEVKI
351
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
nal of the Association for Information Science and Tech-
nology, 67(8):2009–2015.
Alexander Kraskov, Harald Stögbauer in Peter Grassberger.
2004. Estimating mutual information. Physical review
E, 69(6):066138.
Chih-Chin Lai. 2007. An empirical study of three machine
learning methods for spam filtering. Knowledge-Based
Systems, 20(3):249–254.
Songqing Lin. 2013. Why serious academic fraud occurs
in China. Learned Publishing, 26(1):24–27.
Scott M Lundberg in Su-In Lee. 2017. A unified approach
to interpreting model predictions. Advances in neural in-
formation processing systems, 30.
José Ramon Méndez, Florentino Fdez-Riverola, Fernando
D´ıaz, Eva Lorenzo Iglesias in Juan Manuel Corchado.
2006. A comparative performance study of feature selec-
tion methods for the anti-spam filtering domain. V: Indu-
strial Conference on Data Mining, str. 106–120. Sprin-
ger.
David Moher in Anubhav Srivastava. 2015. You are invited
to submit. . . . BMC medicine, 13(1):1–4.
Jeffrey Pennington. 2014. Glove:
Global vectors for
word representation.
https://nlp.stanford.
edu/projects/glove/, pridobljeno: 2022-07-15.
Georgios Sakkis, Ion Androutsopoulos, Georgios Paliou-
ras, Vangelis Karkaletsis, Constantine D. Spyropoulos
in Panagiotis Stamatopoulos. 2003. A memory-based
approach to anti-spam filtering for mailing lists. Infor-
mation retrieval, 6(1):49–73.
Josep Soler in Andrew Cooper. 2019. Unexpected ema-
ils to submit your work: Spam or legitimate offers? the
implications for novice english l2 writers. Publications,
7(1):7.
Wessel
van
Lit.
2019.
Email
spam
Kaggle.
https://www.kaggle.com/veleon/ham-and-spam-dataset.
Ribut Wahyudi. 2017. The generic structure of the call for
papers of predatory journals: A social semiotic perspec-
tive. V: Text-based research and teaching, str. 117–136.
Springer.
Steve Whittaker, Victoria Bellotti in Paul Moody. 2005. In-
troduction to this special issue on revisiting and reinven-
ting e-mail. Human–Computer Interaction, 20(1-2):1–9.
Ian H Witten in Eibe Frank. 2000. Data mining: practical
machine learning tools and techniques with Java imple-
mentations. Morgan Kaufmann.
ŠTUDENTSKI PRISPEVKI
352
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Preparing a Corpus and a Question Answering System for Slovene
Matjaž Zupanič∗, Maj Zirkelbach∗, Uroš Šmajdek∗, Meta Jazbinšek†
∗Faculty of Computer and Information Science, University of Ljubljana
Večna pot 113, SI-1000 Ljubljana
{mz4689, mz5153, us6796}@student.uni-lj.si
†Department of Translation Studies, Faculty of Arts, University of Ljubljana
Aškerčeva cesta 2, SI-1000 Ljubljana
mj6953@student.uni-lj.si
Abstract
Lack of proper training data is one of the key issues when developing natural language processing models based on less-resourced languages, such as Slovene. In this paper we discuss machine translation as a solution to this issue, with the focus on question answering (QA). We use the SQuAD 2.0 dataset, which we have translated using eTranslation machine translator. To improve the reliability of translations, we translate the answers together with the context instead of separately, reducing the rate at which answers were not found in the context from 56% to 7%. For comparison, we also perform manual post-editing of the small subset of machine translations. We then compare these datasets utilizing various transformer-based QA models and observe the differences between the datasets and different model configurations. The results have shown little distinction between monolingual and larger multilingual models: monolingual SloBERTa scored 64.9% exact matches on the machine translated dataset and 72.6% exact matches on human translated one, whereas multilingual RemBERT scored 64.2% exact matches on the machine translated dataset and 71.9% exact matches on human translated one. Additionally, using machine translated dataset in the evaluation produces notably worse results then the human translated dataset.
Qualitative analysis of the translations has shown that mistakes often occur when the sentences are longer and have more complicated syntax.
1.
Introduction
Slovene.
In this work we present a method for a construct of a
One of the goals in artificial intelligence is to build in-
machine translated dataset from SQuAD 2.0 (Rajpurkar et
telligent systems that would be able to interact with hu-
al., 2018) and evaluate its quality using various modern QA
mans and help them. One of such tasks is reading the
models. Additionally, we benchmark its effectiveness by
web and then answer complex questions about any topic
performing manual post-editing on a subset of the trans-
over given content. These question-answering (QA) sys-
lated dataset and comparing the results.
tems could have a big impact on the way that we access in-
The main contributions of our work are:
formation. Furthermore, open-domain question answering
• a pipeline for translation of English question answer-
is a benchmark task in the development of Artificial Intel-
ing dataset;
ligence, since understanding text and being able to answer
• a Slovene monolingual model SloBERTa, fine-tuned
questions about it is something that we generally associate
on machine translated data and three different fine-
with intelligence.
tuned multilingual QA models, M-BERT, XLM-R and
Recently, pre-trained Contextual Embeddings (PCE)
CroSloEngual BERT, all on machine translated and
models like Bidirectional Encoder Representations from
both original and machine translated data; and
Transformers (BERT) (Devlin et al., 2018) and A Lite
• comparison of human and machine translated data in
BERT (ALBERT) (Lan et al., 2020) have attracted lots of
terms of question answering performance.
attention due to their great performance in a wide range of
In Section 2 we present the related work. In Section 3
NLP tasks.
we present our dataset, the process of translation and post-
Multilingual question answering tasks typically assume
edition, and evaluate the quality of the translation. In Sec-
that answers exist in the same language as the ques-
tion 4 we give a brief overview of the models used in the
tion.
Yet in practice, many languages face both infor-
evaluation.
In Section 5 we present the evaluation and
mation scarcity—where languages have few reference ar-
discuss the results in Section 6. In Section 7 we present
ticles—and information asymmetry—where questions ref-
the conclusions and give possible extensions and enhance-
erence concepts from other cultures. Due to the sizes of
ments for future work.
modern corpora, performing human translations is gener-
ally infeasible, therefore we often employ machine transla-
2.
Related work
tions instead. Machine translation however is for the most
Early question answering systems,
such as LU-
part incapable of interpreting nuances of specific languages
NAR (Woods and WA, 1977), date back to the 60’s and
such as culturally specific vocabulary or for example the
the 70’s. They were characterised by a core database and a
use of articles, indication of grammatical number or gen-
set of rules, both handwritten by experts of the chosen do-
der and conjugation endings when comparing English and
main. Over time, with the development of large online text
ŠTUDENTSKI PRISPEVKI
353
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
repositories and increasing computer performance, the fo-
100,000 question-answer pairs extracted from over 500 ar-
cus shifted from such rule-based system to using machine
ticles.
learning and statistical approaches, like Bayesian classi-
The reason to use Squad 2.0 over 1.0 is that it consists of
fiers and Support Vector Machines. An example of this
twice as much data and contains unanswerable questions.
kind of system that was able to perform question answer-
ing on Slovene language was presented by Čeh et al. ( Čeh
3.1.
Machine Translation
and Ojsteršek, 2009) in 2009.
To translate the dataset into Slovenian we used the
Another major revolution in the field of question an-
eTranslation webservice (Commission, 2020). Due to the
swering and natural language processing in general was the
web service being primarily designed to translate webpages
advent of deep learning approaches and self-attention. One
and short documents in docx or pdf format, our translation
of the most popular approaches of this kind is BERT (De-
pipeline design was as follows:
vlin et al., 2018), a transformer model introduced in 2019.
Since then it has inspired many other transformed based
1. Convert the corpus in html format.
models, for instance RoBERTa (Liu et al., 2019), AL-
2. Split html file into smaller chunks. We found that 4
BERT (Lan et al., 2020), and T5 (Raffel et al., 2020) , xlm
MB chunks work best, as larger chunks were often un-
and XLNet (Yang et al., 2019).
able to be translated.
Such models also have the advantage of being able
3. Send chunks to the translation service.
to recognise multiple languages, giving rise to multilin-
4. Use the original corpus file to compose the translated
gual models and model variants, such as M-BERT, XLM-
document in the original format.
R (Conneau et al., 2019), mT5 (Xue et al., 2021) and Rem-
BERT (Chung et al., 2020).
Nevertheless, the training
Since the basic translation yielded quite underwhelm-
requires large amounts of training data, which many lan-
ing results, we employed two different methods to im-
guages lack, leading to varying performance between dif-
prove the results. The first was to correct the answers by
ferent languages. They have also shown to perform worse
breaking down both the answer and the context into lem-
than monolingual models (Martin et al., 2020; Virtanen et
mas and search for the answer sequence of lemmas in con-
al., 2019). As such Ulčar et al. (Ulčar and Robnik- Šikonja,
text sequence of lemmas. To accomplish this, CLASSLA
2020) made an effort to strike a middle ground between the
(CLARIN Knowledge Centre for South Slavic languages)
performance of monolingual and versatility of multilingual
library (Ljubešić and Dobrovoljc, 2019) was used. If a
models by reducing the number of languages in multilin-
match was found, we replaced the bad answer with the orig-
gual model to three; two similar less-resourced languages
inal text, forming the lemma sequence in the context. The
from the same language family and English. This resulted
second method was to embed the answers in the context
in two trilingual models FinEst BERT and CroSloEngual
before translation.
BERT al. (Ulčar and Robnik- Šikonja, 2020).
To evaluate the quality of different translations, we mea-
In 2020, a Slovene monolingual RoBERTa-based model
sured how many answers can be directly found in their re-
SloBERTa (Ulčar and Robnik- Šikonja, 2021) was intro-
spective context, as they cannot be used in QA models oth-
duced. It was trained on 5 different corpora, totaling 3.41
erwise. The results can be seen in Table 1. Resulting num-
billion words. The latest version of the model is SloBERTa
ber of valid questions, compared with the original, are pre-
2.0, augmenting the original model by more than doubling
sented in Table 2.
the number of training iterations. The authors evaluated its
performance on named-entity recognition, part-of-speech
Basic
LC
CE
LC+CE
tagging, dependency parsing, sentiment analysis and word
44%
66%
93%
94%
analogy, but not on question answering.
While the described advancements of natural language
Table 1: Results for basic translation, lemma correction
processing models already offer us a partial solution for the
(LC), and context embedded (CE) translation of SQuAD
lack of language-specific training corpora, namely the abil-
2.0 dataset. The percentages represent the number of an-
ity to train the model on a language where large corpora are
swers that can be directly found in the respective context.
present (e.g. English), the models still require language-
specific fine-tuning, for which a sizable corpora is needed.
In our work we present a potential solution, by using the
Dataset
Subset
AQ
IQ
Total
machine-translation methods to translate smaller corpora to
Slovene and use it to fine-tune and evaluate the results.
Train
86,821
43,498
130,319
Original
Test
5,928
5,945
11,873
3.
Dataset description and methodology
Train
81,884
43,498
125,382
Machine Trans.
Stanford Question Answering Dataset (SQuAD
Test
5,735
5,945
11,680
2.0) (Rajpurkar et al., 2018) is a reading comprehension
dataset. It is based on a set of articles on Wikipedia which
Table 2: Number of questions in original SQuAD 2.0
cover a variety of topics, from historical, pharmaceutical,
dataset and our machine translated dataset. AQ denotes the
and religious texts to texts about the European Union. Ev-
number of answerable questions, IQ the number of impos-
ery question in the dataset is a segment of text or span
sible questions.
from the corresponding reading passage. It consists of over
ŠTUDENTSKI PRISPEVKI
354
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
3.2.
Post-editing of Machine Translation
is quite high because machine translation provided inco-
Due to limited human resources, post-editing was done
herent results. In this segments, the changes in post-editing
on small number of automatically translated excerpts that
are also more notable, because they affect the overall un-
were chosen randomly. The provided excerpts included
derstanding for potential readers. This can be seen in the
original paragraphs or contexts, questions and answers, as
following examples:
well as their machine translations, which were to be cor-
Original
rected by a translation student. This was done in two steps:
creating a project in the online translation tool Memsource
1. Who did Kublai make the ruler of Korea?
with translation memory in tmx format, generated from ma-
2. Who was Al-Banna’s assassination a retaliation for the
chine translations, and revision or post-editing of the seg-
prior assassination of?
ments. Editing was first done on the paragraphs and then
3. What plants create most electric power?
on questions and answers, since the answers had to match
Machine translation
the text in the paragraph. The editing was minimal, which
means that the focus was not on stylistic improvement, but
1. Kdo je Kublai postal vladar Koreje?
mostly on correcting the grammatical errors, wrong mean-
2. Kdo je bil Al-Bannin umor maščevanja zaradi pred-
ings and very unusual syntax, to make the translation com-
hodnega umora?
prehensible. As mentioned above, the topics of original
3. Katere rastline ustvarjajo največ električne energije?
texts are diverse and very technical, covering different do-
Post-edited machine translation
mains such as religion, history, politics, mathematics and
chemistry.
1. Koga je Kublajkan nastavil za vladarja Koreje?
In total, there were 30 manually corrected contexts with
2. Al-Bannov umor je bil maščevanje za čigav predhodni
accompanying 142 answerable and 143 unanswerable ques-
umor?
tions. The number of different segment types and of post-
3. Katere naprave ustvarjajo največ električne energije?
editing changes can be seen in Table 3.
The segments with answers have the largest number of
non-corrected segments because they are shorter. Neverthe-
Segment content
S
NS
CS
FS
less, the percentage of corrected questions is still high if we
Context
30
0
30
100%
take into account that the answers represent 58% of all seg-
Answerable question
142
38
104
73.2%
ments. The mistakes in the answers were in the most part
Answer
435
225
210
48.3%
already corrected in the contexts. More severe mistakes in-
Impossible question
143
43
100
69.9%
clude semantic mistakes (e.g. plants translated as ’rastline’,
Total number
750
306
444
59.2%
not ’naprave’) and completely wrong answers (e.g. empty
segment instead of ’Fermilab’ or ’in’ instead of ’1,388’).
Table 3: Post-editing numerical data. S denotes the number
Some frequent mistakes also occured in translations of the
of segments, NS the number of non-corrected segments, CS
names of movements, books, projects or other names (e.g.
the number of corrected segments and FS the fraction of
’Bricks for Varšava’ was left untranslated and was changed
corrected segments.
to ’Zidaki za Varšavo’). There were some punctuation er-
rors, but the most interesting are grammatical mistakes, es-
pecially when the wrong grammatical case, gender or num-
3.3.
Post-editing Analysis
ber is used. Even if these mistakes were corrected in the
The numbers seen in Table 3 are not fully representa-
context, the answers had to be in the exact same form, so
tive, since some corrections of the mistakes of machine
many answers do not sound coherent, which is of course
translation are more severe than others and in some seg-
not the case for English, where the conjugation does not
ments, there is a much greater number of corrections than
change the words as much (e.g. ’Which part of China had
in others. For instance, the corrections, including one of a
people ranked higher in the class system?’ — ’Northern’
severe semantic mistake, can be seen in this example:
— ’V katerem delu Kitajske so bili ljudje višje v razrednem
1. Original: The Northern Chinese were ranked higher
sistemu?’ — ’Severni’ (from the example of a sentence in
and Southern Chinese were ranked lower because
the context mentioned above)). On the other part, some
southern China withstood and fought to the last before
corrected segments were identical even though the source
caving in.
was different due to the use of articles in English language
2. Machine translation: Severna Kitajci so bili uvrščeni
(e.g. ’North Sea’ and ’the North Sea’ were both translated
višje in južna Kitajci so bili uvrščeni nižje, ker je
as ’Severno morje’).
južna Kitajska zdržala in se borila do zadnjega pred
It should also be noted that the database SQuAD 2.0
jamarstvom.
is not entirely reliable. From the batch of randomly sam-
3. Post-edited machine translation: Severni Kitajci so
pled 142 test question and answer groups, there were 14
bili uvrščeni višje in južni Kitajci so bili uvrščeni
occurrences where at least one of the given answers was not
nižje, ker se je južna Kitajska pred predajo upirala in
correct (e.g. ’Advanced Steam movement’ instead of ’pol-
se borila do zadnjega.
lution’ as an answer to ’Along with fuel sources, what con-
Answerable and impossible questions have a similar
cern has contributed to the development of the Advanced
percentage of segments with corrections. This percentage
Steam movement?’).
ŠTUDENTSKI PRISPEVKI
355
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
4.
Models
tests were performed on i5 10400f system with RTX 3070
In this section we present each of the five models that
GPU 8 GB VRAM. For larger models we used RTX 3060
were used in the evaluation.
12 GB.
To compare the performance between the English, ma-
4.1.
XLM-R
chine translated Slovene and human translated Slovene
XLM-R (XLM-RoBERTa) (Conneau et al., 2019) is
versions of the SQuAD 2.0 dataset, we used 5 different
a pre-trained cross-lingual language model based on xlm
question answering models: mBERT, XLM-R, RemBERT,
(Lample and Conneau, 2019). The ’RoBERTa’ part of the
SloBERTa 2.0, CroSloEngual BERT. The evaluation was
name comes from its training routine that is the same as
done in three steps:
the monolingual RoBERTa model, specifically, that the sole
1. Performance evaluation of different models and fine-
training objective is the MLM (masked language mode).
tuning configuration on the English dataset, as a
There is no next sentence prediction (as in BERT) or Sen-
benchmark for the evaluation of the Slovene results.
tence Order Prediction (as in ALBERT). XLM-R shows the
2. Performance evaluation of different models and fine-
possibility of training one model for many languages while
tuning configuration on the Slovene dataset, translated
not sacrificing per-language performance. It is trained on
using computer only, to evaluate the quality of ma-
2.5 TB of CommonCrawl data in 100 languages.
chine translation.
3. Performance evaluation of different models and fine-
4.2.
M-BERT
tuning configuration on the Slovene subset which was
M-BERT (Multilingual Bert) (Devlin et al., 2018) is a
translated by a human, and same subset both in En-
pre-trained cross-lingual language model as it’s name sug-
glish and translated using computer, to evaluate the
gest.
It is based on BERT (Devlin et al., 2018).
The
benefits of human translation.
pre-trained model is trained on 104 languages with large
Before the evaluation, we removed all punctuation,
amount of data from Wikipedia, using a masked language
leading and trailing white spaces and articles from both
modeling (MLM) objective. On Hugging Face, there is
ground truth and prediction. Both of them were also set
only a base model with 12 hidden transformer layers avail-
in the lower case. Parameters used for fine-tuning are pre-
able, large model with 24 hidden transformer layers was not
sented in Table 4.
uploaded and we were not able to test it.
Metrics used for the evaluation match the official ones
4.3.
RemBERT
for SQuAD2.0 evaluation and were as follows:
RemBERT (Chung et al., 2020) is a model, pre-trained
• Exact - The fraction of predictions matched at least of
on 110 languages, using a masked language modeling
one the correct answers exactly.
(MLM) objective. It’s difference with mBERT is that the
• F1 - The average overlap between prediction and
input and output embeddings are not tied. Instead, Rem-
ground truth, defined as an average of F1 scores for
BERT uses small input embeddings and larger output em-
individual questions. F1 score of an individual ques-
beddings. This makes the model more efficient since the
tion is computed as a harmonic mean of the precision
output embeddings are discarded during fine-tuning.
and recall, where precision was defined as TM , and
TP
recall as TM , where T
T
M represents the matching to-
GT
4.4.
SloBERTa
kens between prediction and ground truth, TP number
SloBERTa (Ulčar and Robnik- Šikonja, 2021) is a
of tokens in prediction and TGT number of tokens in
Slovene monolingual large pre-trained masked language
ground truth. A token is defined as a word, separated
model. It is closely related to French Camembert model,
by a white space.
which is similar to base RoBERTa model, but uses a dif-
The results of the non-translated SQuAD 2.0 and ma-
ferent tokenization model. Since the model requires a large
chine translated dataset can be seen in Table 5. The results
dataset for training, it was trained on 5 combined datasets.
of the human translated subset and its English and com-
It outperformed existing Slovene models.
puter translated counterparts can be seen in Table 6. Addi-
tionally, we provide some examples of correct predictions
4.5.
CroSloEngual BERT
with wrong answers in Table 7 and some of correct answers
It is a trilingual model based on BERT and trained for
with wrong predictions in Table 8.
Slovene, Croatian and English language. It was trained
with 5.9 billion tokens from these languages. For those lan-
Model Name
B
MS
LR
E
guages it performs better than multilingual BERT, which
is expected, since studies showed that monolingual models
XLM-R-large
4
256
1e-5
3
perform better than large multilingual models (Virtanen et
M-BERT-base
8
320
3e-5
3
al., 2019).
CroSloEngual BERT
4
256
1e-5
3
RemBERT
4
256
1e-5
3
5.
Results
SloBERTa 2.0
16
320
3e-5
3
This section is divided into two parts. First we evaluate
Table 4: Parameters used to fine-tune the evaluated models.
automatic machine translations and then we evaluate per-
B denotes the number of batches used during fine-tuning,
formance of choosen QA models (XLM-R-large, M-Bert-
MS the maximum sequence length, LR the learning rate and
base, CroSloEngual BERT, RemBERT, SloBERTa 2.0). All
E the number of epochs.
ŠTUDENTSKI PRISPEVKI
356
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Fine-Tuning
Original
Machine Translation
Model name
Language
Exact
F1
Exact
F1
xlmR-large
Eng
81.8%
84.9%
64.3%
72.3%
xlmR-large
Slo
75.0%
79.2%
65.3%
72.4%
xlmR-large
Eng & Slo
74.4%
78.5%
65.9%
73.4%
M-BERT-base
Eng
75.6%
78.9%
55.4%
61.3%
M-BERT-base
Slo
62.4%
67.2%
60.4%
67.0%
M-BERT-base
Eng & Slo
70.7%
75.0%
60.5%
67.3%
CroSloEngual BERT
Eng
72.8%
76.3%
56.3%
63.6%
CroSloEngual BERT
Slo
63.6%
68.2%
58.4%
65.4%
CroSloEngual BERT
Eng & Slo
68.8%
73.0%
58.1%
65.7%
RemBERT
Eng
84.5%
87.5%
67.1%
73.8%
SloBERTa 2.0
Slo
60.6%
64.7%
66.7%
73.9%
Table 5: Comparison of the results of various models and their fine-tuning configurations on the English SQuAD 2.0
evaluation dataset and Slovene machine translated SQuAD 2.0 evaluation dataset. The English dataset only contains the questions preset in its Slovene counterpart. Specific parameters used in fine-tuning are presented in Table 4.
Fine-Tuning
Original
Machine Translation
Human Translation
Model name
Language
Exact
F1
Exact
F1
Exact
F1
xlmR-large
Eng
80.0%
82.9%
61.1%
68.5%
71.6%
75.9%
xlmR-large
Slo
69.1%
72.9%
61.4%
69.1%
69.8%
74.8%
xlmR-large
Eng & Slo
68.8%
73.4%
64.6%
72.4%
70.5%
75.7%
M-BERT-base
Eng
71.9%
74.9%
52.6%
57.7%
57.5%
60.3%
M-BERT-base
Slo
56.1%
60.4%
58.6%
64.5%
60.4%
66.2%
M-BERT-base
Eng & Slo
64.9%
68.8%
55.8%
61.2%
63.5%
68.6%
CroSloEngual BERT
Eng
73.3%
75.5%
53.0%
60.8%
62.1%
65.7%
CroSloEngual BERT
Slo
59.6%
63.1%
51.6%
58.8%
60.7%
66.0%
CroSloEngual BERT
Eng & Slo
68.1%
70.6%
58.9%
66.3%
64.6%
71.0%
RemBERT
Eng
84.9%
87.2%
64.2%
71.4%
71.9%
76.9%
SloBERTa 2.0
Slo
59.3%
65.0%
64.9%
72.2%
72.6%
78.0%
Table 6: Comparison of the results of various models and their fine-tuning configurations on the Human Translated subset of SQuAD 2.0, and the subsets containing same question from original English dataset and the machine translated dataset.
Specific parameters used in fine-tuning are presented in Table 4.
#
Dataset
Question
Answer
Prediction
ENG
How many of Warsaw’s inhabitants spoke Polish in 1933?
833,500
833,500
1
MT
Koliko prebivalcev Varšave je leta 1933 govorilo poljsko?
prebivalcev
833.500
HT
Koliko prebivalcev Varšave je leta 1933 govorilo poljski jezik?
833.500
833.500
ENG
Who recorded ”Walking in Fresno?”
Bob Gallion
Bob Gallion
2
MT
Kdo je posnel Walking in Fresno?“
je Bob
Bob Gallion
”
HT
Kdo je posnel ≫Walking in Fresno≪?
Bob Gallion
Bob Gallion
Table 7: Examples of correct predictions with wrong answers. ENG denotes the English dataset, MT one translated by a computer and HT one translated by a human.
#
Dataset
Question
Answer
Prediction
ENG
Where did Korea border Kublai’s territory?
northeast
northeast
1
MT
Kje je Koreja mejila na Kublajevo ozemlje?
severovzhodno
zahodno
HT
Kje je Koreja mejila na Kublajkanovo ozemlje?
severovzhodno
severovzhodno
ENG
How many miles, once completed, will the the Lewis S. Eaton trail cover?
22
22
2
MT
Koliko kilometrov, ko bo končano, bo pokrivalo Lewis S. Eaton?
22
35
HT
Koliko kilometrov bo dolga pot Lewisa S. Eatona, ko bo končana?
22
35
Table 8: Examples of correct answers with wrong predictions. ENG denotes the English dataset, MT one translated by a computer and HT one translated by a human.
ŠTUDENTSKI PRISPEVKI
357
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
6.
Discussion
most cases reversed when given a task of answering hu-
6.1.
Quantitative Analysis
man translated questions. This leads us to conclude that
machine translation, at least one available on via eTrans-
From the results in Table 5, we can see that RemBERT
lation (Commission, 2020) service, is not particularly suit-
and SloBERTa 2.0 gave the best results on the dataset trans-
able for training multilingual models. Of all the models,
lated by a computer. While the result for SloBERTa was ex-
SloBERTa 2.0 produced the best results on both machine
pected, as monolingual models tend to perform better than
and human translated data, while the RemBERT gave com-
multilingual ones, RemBERT managed to outperform its
parable results even when only fine-tuned on the English
multilingual competitors while only being fine-tuned on the
dataset.
English dataset. We would attribute this simply to the bet-
The testing procedure could be easily improved by em-
ter design of the model. Although both models had a very
ploying stronger hardware. RemBERT could for example
similar performance, we would like to point out that Rem-
be fine-tuned on the Slovene dataset, which would allow for
BERT model is a much larger model and was pre-trained
its better evaluation. Additionally, we were unable to ascer-
on a significantly larger dataset. Similar results were also
tain the optimal parameters for fine-tuning as performing
observed when comparing the results on the smaller sub-
multiple fine-tunings for each language would be unfeasi-
set of questions that were translated by a human, as seen in
ble. Some restrictions of the project are limited time for
Table 6.
post-editing and only one translator who is not an expert
In Table 6 we can see models consistently perform-
in the topics of various technical texts, and the method of
ing better on the human translated data, suggesting that the
minimal editing that can result in mediocre translation. The
machine translation provided by eTranslation webservice
experiment could be expanded by including a larger subset
comes short of providing adequate set for proper evalua-
of human translated or revised data, more datasets, such as
tion in the Slovene language. We can also see that while the
Natural Questions (Kwiatkowski et al., 2019), and different
models fine-tuned using machine translated dataset do per-
machine translation services, such as DeepL.
form better when evaluated on the machine translated data,
this does not hold true for evaluations on human translated
data.
8.
Acknowledgments
We have also observed that fine-tuning the model on the
We would like to to thank our mentors, Slavko Žitnik
English dataset first, and then on the Slovene, yields better
and Špela Vintar, for providing us with directions, feedback
results on the smaller models, M-BERT-base and CroSlo-
and advice.
Engular BERT, as compared to fine-tuning on either lan-
guage.
9.
References
6.2.
Qualitative Analysis
Ines Čeh and Milan Ojsteršek. 2009. Developing a ques-
While there are many correct predictions of the answers
tion answering system for the Slovene language. WSEAS
in the machine translated dataset, it is clear that a great
Transaction on Information science and applications,
number of predictions still does not answer the question
(9).
correctly. This is because the machine translation of the
Hyung Won Chung, Thibault Févry, Henry Tsai, Melvin
sentences in the context is not grammatically and stylisti-
Johnson, and Sebastian Ruder.
2020.
Rethinking
cally correct, does not convey the right meaning and thus
embedding coupling in pre-trained language models.
the model has more problems finding the answer. The cor-
CoRR, abs/2010.12821.
rect predictions are mostly the ones where the answer to
European Commission. 2020. CEF Digital eTranslation.
the question is short and the words are not conjugated, i.e.
https://ec.europa.eu/cefdigital/
numbers and names, even though there are some excep-
wiki/display/CEFDIGITAL/eTranslation.
tions. The same is true for human post-edited translation,
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
but improvement of some answers is already visible from
Vishrav Chaudhary,
Guillaume Wenzek,
Francisco
only a few representative examples in Table 7 and Table 8.
Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer,
and Veselin Stoyanov. 2019. Unsupervised cross-lingual
7.
Conclusion
representation learning at scale. arXiv:1911.02116.
In this work we present a machine translated SQuAD
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
2.0 dataset and evaluate it on the following question an-
Toutanova. 2018. BERT: Pre-training of deep bidirec-
swering (QA) models: XLM-R-large, M-BERT-base, Rem-
tional
transformers
for
language
understanding.
BERT, CroSloEngual BERT and SloBERTa 2.0. Addition-
arXiv:1810.04805.
ally, we also perform human post-editing on a subset of
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield,
SQuAD 2.0 translations in order to better ascertain the qual-
Michael Collins, Ankur Parikh, Chris Alberti, Danielle
ity of machine translations. The results show that using ma-
Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin,
chine translated data for evaluation led to notably worse re-
Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-
sults as compared to the one translated by a human. More-
Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and
over, we noticed that while multilingual models fine-tuned
Slav Petrov. 2019. Natural questions: a benchmark for
using machine translated data performed better than ones
question answering research. Transactions of the Asso-
fine-tuned on English data when given a task of answer-
ciation of Computational Linguistics.
ing the machine translated question, the situation was in
ŠTUDENTSKI PRISPEVKI
358
STUDENT PAPERS
Konferenca
Conference on
Jezikovne tehnologije in digitalna humanistika
Language Technologies & Digital Humanities
Ljubljana, 2022
Ljubljana, 2022
Guillaume Lample and Alexis Conneau. 2019. Cross-
Generalized autoregressive pretraining for language un-
lingual language model pretraining. Advances in Neural
derstanding. Advances in neural information processing
Information Processing Systems (NeurIPS).
systems, 32.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin
Gimpel, Piyush Sharma, and Radu Soricut. 2020. AL-
BERT: A lite BERT for self-supervised learning of lan-
guage representations. In: International Conference on
Learning Representations.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa:
A robustly optimized BERT pretraining approach.
arXiv:1907.11692.
Nikola Ljubešić and Kaja Dobrovoljc.
2019.
What
does Neural Bring? Analysing Improvements in Mor-
phosyntactic Annotation and Lemmatisation of Slove-
nian, Croatian and Serbian. In: Proceedings of the 7th
Workshop on Balto-Slavic Natural Language Process-
ing, pages 29–34, Florence, Italy. Association for
Computational Linguistics.
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez,
Yoann Dupont, Laurent Romary, Éric de la Clergerie,
Djamé Seddah, and Benoˆıt Sagot. 2020. CamemBERT: a
tasty French language model. In: Proceedings of the 58th
Annual Meeting of the Association for Computa- tional
Linguistics, pages 7203–7219, Online. Association for
Computational Linguistics.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J. Liu. 2020. Exploring the limits of trans-
fer learning with a unified text-to-text transformer. Jour-
nal of Machine Learning Research, 21(140):1–67.
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
Know what you don’t know: Unanswerable questions for
squad.
Matej Ulčar and Marko Robnik- Šikonja. 2020. Finest
BERT and CroSloEngual BERT. In: International Con-
ference on Text, Speech, and Dialogue, pages 104–111.
Springer.
Matej Ulčar and Marko Robnik- Šikonja. 2021. SloBERTa:
Slovene monolingual large pretrained masked language
model.
Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma,
Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and
Sampo Pyysalo. 2019. Multilingual is not enough: BERT
for Fnnish. arXiv:1912.07076.
William A Woods and WOODS WA. 1977. Lunar rocks in
natural english: Explorations in natural language ques-
tion answering.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,
Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Colin Raffel. 2021. mT5: A massively multilingual pre-
trained text-to-text transformer. In: Proceedings of the
2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Lan-
guage
Technologies,
pages
483–498,
Online.
As-
sociation for Computational Linguistics.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,
Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet:
ŠTUDENTSKI PRISPEVKI
359
STUDENT PAPERS
Document Outline
Home Invited talks
Papers 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Abstracts 1
2
3
4
5
6
7
8
9
Student Papers 1
2
3
4
5
6
7
8
9
10
11
12