Zbornik konference Jezikovne tehnologije in digitalna humanistika Proceedings of the Conference on Language Technologies and Digital Humanities 15.– 16. september 2022 Ljubljana, Slovenija September 15h – 16th 2022 Ljubljana, Slovenia Uredila / Edited by: Darja Fišer, Tomaž Erjavec ZBORNIK KONFERENCE JEZIKOVNE TEHNOLOGIJE IN DIGITALNA HUMANISTIKA PROCEEDINGS OF THE CONFERENCE ON LANGUAGE TECHNOLOGIES & DIGITAL HUMANITIES Uredila / Edited by: Darja Fišer, Tomaž Erjavec Tehnični uredniki / Technical editors: Jakob Lenardič, Katja Meden, Mihael Ojsteršek Založil / Published by: Inštitut za novejšo zgodovino / Institute of Contemporary History Izdal / Issued by: Inštitut za novejšo zgodovino / Institute of Contemporary History Za založbo / For the publisher: Andrej Pančur Direktor / Director Ljubljana, 2022 First edition Spletno mesto konference / Conference website: https://www.sdjt.si/jtdh-2022 / https://www.sdjt.si/jtdh-2022/en Publikacija je brezplačno dostopna na: / Publication is available free of charge at: https://nl.ijs.si/jtdh22/proceedings-sl.html / https://nl.ijs.si/jtdh22/proceedings-en.html To delo je objavljeno pod licenco Creative Commons Priznanje avtorstva 4.0 Mednarodna. This work is licensed under a Creative Commons Attribution 4.0 International License. CIP - Kataložni zapis o publikaciji Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID 121176323 ISBN 978-961-7104-20-2 (PDF) Predgovor k zborniku konference “Jezikovne tehnologije in digitalna humanistika” Slovensko društvo za jezikovne tehnologije, skupaj z Inštitutom za novejšo zgodovino in Centrom za jezikovne vire in tehnologije Univerze v Ljubljani ter raziskovalnima infrastrukturama CLARIN.SI in DARIAH-SI, že četrtič po vrsti prirejajo konferenco “Jezikovne tehnologije in digitalna humanistika”, po uspešni programski širitvi konference Jezikovne tehnologije, ki se je odvijala od 1998, na digitalno humanistiko leta 2016 ohranja povezovalni fokus med disciplinama, hkrati pa si prizadeva postati pomembno srečevališče raziskovalcev v regiji. Letošnja konferenca je potekala na Fakulteti za družbene vede Univerze v Ljubljani. Ker smo želeli zagotoviti, da bi bila konferenca v čim večji meri dostopna vsem zainteresiranim, smo vabljeni predavanji in vse predstavitve posneli in po zaključku konference objavili na konferenčni spletni strani. Na spletni strani konference pa je bil že vnaprej objavljen tudi zbornik konference. Konferenčne vsebine smo razvrstili v tri dni. Prvi dan je bil posvečen predkonferenčnima seminarjema na temo tematskega modeliranja parlamentarnih razprav in raziskovalne infrastrukture CLARIN.SI. Drugi in tretji dan pa so se zvrstile predstavitve vabljenih predavateljev in avtorjev sprejetih prispevkov. Ker je bila zasedba na konferenci mednarodna, smo program izvedli v ločenih slovenskih in angleških sekcijah. Zvrstili sta se tako slovenska kot angleška študentska sekcija, dve slovenski in tri angleške redne sekcije ter angleška in slovenska poster sekcija, tako za redne, kot za študentske prispevke. Ob zaključku konference smo nagradili najboljši študentski prispevek. V posebni sekciji so bili predstavljeni še dosedanji rezultati projekta Razvoj slovenščine v digitalnem okolju, po konferenci pa je sledil še redni letni občni zbor Slovenskega društva za jezikovne tehnologije. Na letošnji konferenci sta se predstavila dva vabljena predavatelja ter avtorji 30 rednih prispevkov, 9 razširjenih povzetkov in 12 študentskih prispevkov. Vse prispevke so pregledali trije recenzenti. 20 prispevkov je napisanih v slovenskem, 31 pa v angleškem jeziku. Skupno število vseh avtorjev prispevkov je 120, od katerih je skoraj tretjina tujih (iz Avstralije, Bosne in Hercegovine, Brazilije, Bolgarije, Hrvaške, Finske, Francije, Italije, Luksemburga, Severne Makedonije in Srbije). Urednika se najlepše zahvaljujeva vsem, ki so prispevali k uspehu konference: vabljenima predavateljema in avtorjem prispevkov za skrbno pripravljene prispevke, predstavitve in plakate, programskemu odboru za natančno recenzentsko delo, organizacijskemu odboru za izvedbo konference, moderatorjem diskusij, tehničnim urednikom za pripravo spletnega zbornika in raziskovalnima infrastrukturama DARIAH-SI in CLARIN.SI ter društvu SDJT za finančno podporo konference. Ljubljana, september 2022 Darja Fišer in Tomaž Erjavec i Preface to the Proceedings of the Conference “Language Technologies and Digital Humanities” The Slovenian Language Technologies Society, together with the Institute of Contemporary History, the Centre of Language Resources and Technologies at the University of Ljubljana, and the research infrastructures CLARIN.SI and DARIAH-SI has organised the 13th Conference on Language Technologies and Digital Humanities. After its successful expansion to Digital Humanities in 2016, the conference retains its focus on the integration of the two disciplines and at the same time aims to position itself as an important meeting hub for fellow researchers in the region. This year’s conference took place at the Institute of Contemporary History in Ljubljana. In order to make the conference as accessible as possible to all participants, we made recordings of the invited talks and the presentations. After the conference, we published the recordings on the conference webpage, while the proceedings were made available on the webpage in advance. The conference took place over the course of three days. On the first day, two pre-conference seminars were organised, one on topic modelling of parliamentary debates and another the CLARIN.SI research infrastructure. Days two and three were dedicated to two invited talks and presentations of accepted papers. Since the conference was also attended by international scholars, the programme was divided into separate Slovenian and English sessions. There was a Slovenian and an English student session, two Slovenian and three English regular sessions, as well as an English and a Slovenian poster section both for regular and student contributions. In a special session, the results of the project Development of Slovene in a Digital Environment – Language Resources and Technologies were presented. This year’s conference saw presentations from two invited speakers and from the authors of 30 regular papers, 9 extended abstracts, and 12 student papers. All the papers were reviewed by three reviewers. 20 papers were written in Slovene and 31 in English. The total number of all authors of the accepted papers is 120, a third of which were from abroad (from Australia, Bosnia and Herzegovina, Brazil, Bulgaria, Croatia, Finland, France, Italy, Luxemburg, North Macedonia, Serbia). The editors would like to thank everyone who has contributed to the success of this conference: the invited lecturers and the authors of the papers for inspiring contributions and recordings of their lectures, the programme committee for their detailed reviews, the organising committee for enabling the conference to be held virtually, the discussion moderators, the technical editors for preparing the online proceedings and the research infrastructures DARIAH-SI and CLARIN.SI as well as the SDJT society for financially supporting the conference. Ljubljana, September 2022 Darja Fišer and Tomaž Erjavec ii Programski odbor / Programme committee Predsedstvo programskega odbora / Steering committee Darja Fišer, predsednica / Chair Filozofska fakulteta, Univerza v Ljubljani in Inštitut za novejšo zgodovino / Faculty of Arts, University of Ljubljana and Institute of Contemporary History Simon Dobrišek Fakulteta za elektrotehniko, Univerza v Ljubljani / Faculty of Electrical Engineering, University of Ljubljana Tomaž Erjavec Institut “Jožef Stefan” / Jožef Stefan Institute Andrej Pančur Inštitut za novejšo zgodovino / Institute of Contemporary History Matej Klemen, študentska sekcija / student section Fakulteta za računalništvo in informatiko / Faculty for computer science and informatics, University of Ljubljana Aleš Žagar, študentska sekcija / student section Fakulteta za računalništvo in informatiko / Faculty for computer science and informatics, University of Ljubljana Člani programskega odbora in recenzenti / Programme committee members and reviewers Špela Arhar Holdt Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Petra Bago Filozofska fakulteta, Univerza v Zagrebu / Faculty of Arts, University of Zagreb Vuk Batanović Fakulteta za elektrotehniko, Univerza v Beogradu / Faculty of Electrical Engineering, University of Belgrade Zoran Bosnić Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Information Science, University of Ljubljana Narvika Bovcon Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Information Science, University of Ljubljana Václav Cvrček Inštitut češkega narodnega korpusa, Karlova univerza v Pragi / Institute of the Czech National Corpus, Charles University in Prague Jaka Čibej Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Helena Dobrovoljc Inštitut za slovenski jezik Frana Ramovša, ZRC SAZU / Fran Ramovš Institute of the Slovenian Language, ZRC SAZU iii Kaja Dobrovoljc Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Jerneja Fridl Znanstvenoraziskovalni center Slovenske akademije znanosti in umetnosti / Research Centre of the Slovenian Academy of Sciences and Arts Polona Gantar Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Vojko Gorjanc Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Jurij Hadalin Inštitut za novejšo zgodovino / Institute of Contemporary History Miran Hladnik Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Ivo Ipšić Univerza na Reki / University of Rijeka Mateja Jemec Tomazin Inštitut za slovenski jezik Frana Ramovša, ZRC SAZU / Fran Ramovš Institute of the Slovenian Language, ZRC SAZU Alenka Kavčič Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Science, University of Ljubljana Iztok Kosem Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Simon Krek Laboratorij za umetno inteligenco, Institut “Jožef Stefan” / Artificial Intelligence Laboratory, Jožef Stefan Institute Jakob Lenardič Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Nikola Ljubešić Odsek za tehnologije znanja, Institut “Jožef Stefan” / Department of Knowledge Technologies, Jožef Stefan Institute Nataša Logar Fakulteta za družbene vede, Univerza v Ljubljani / Faculty of Social Sciences, University of Ljubljana Matija Marolt Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Information Science, University of Ljubljana Sanda Martinčić Ipšić Univerza na Reki / University of Rijeka Maja Miličević Petrović Univerza v Bolonji / University of Bologna Dunja Mladenić Laboratorij za umetno inteligenco, Institut “Jožef Stefan” / Artificial Intelligence Laboratory, Jožef Stefan Institute iv Matija Ogrin Inštitut za slovensko literaturo in literarne vede ZRC SAZU / Institute of Slovenian Literature and Literary Sciences, ZRC SAZU Matevž Pesek Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Science, University of Ljubljana Dan Podjed Inštitut za slovensko narodopisje ZRC SAZU / Institute of Slovenian Ethnology, ZRC SAZU Senja Pollak Odsek za tehnologije znanja, Institut “Jožef Stefan” / Department of Knowledge Technologies, Jožef Stefan Institute Ajda Pretnar Žagar Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Science, University of Ljubljana Marko Robnik-Šikonja Fakulteta za računalništvo in informatiko, Univerza v Ljubljani / Faculty of Computer Information Science, University of Ljubljana Tanja Samardžić Univerza v Zürichu / University of Zurich Miha Seručnik Zgodovinski inštitut Milka Kosa ZRC SAZU / Milko Kos Historical Institute, ZRC SAZU Mirjam Sepesy Maučec Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru / Faculty of Electrical Engineering and Computer Science, University of Maribor Marko Stabej Filozofska fakulteta, Univerza v Ljubljani / Faculty of Arts, University of Ljubljana Branislava Šandrih Todorović Filološka fakulteta, Univerza v Beogradu / Faculty of Philology, University of Belgrade Mojca Šorn Inštitut za novejšo zgodovino / Institute of Contemporary History Janez Štebe Fakulteta za družbene vede / Faculty of Social Sciences, University of Ljubljana Simon Šuster Univerza v Melbournu / University of Melbourne Daniel Vasić Univerza v Mostarju / University of Mostar Darinka Verdonik Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru / Faculty of Electrical Engineering and Computer Science, University of Maribor Andrej Žgank Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru / Faculty of Electrical Engineering and Computer Science, University of Maribor Jerneja Žganec Gros Alpineon d.o.o. / Alpineon d.o.o., Slovenia Branko Ž itko Fakulteta za znanost, Univeza v Splitu / Faculty of Science, University of Split v Organizacijski odbor / Organising committee Mojca Šorn, predsednica / Chair Inštitut za novejšo zgodovino / Institute of Contemporary History Ana Cvek Inštitut za novejšo zgodovino / Institute of Contemporary History Kaja Dobrovoljc Filozofska fakulteta, Univerza v Ljubljani, Institut “Jožef Stefan” / Faculty of Arts, University of Ljubljana, Jožef Stefan Institute Jerneja Fridl Znanstvenoraziskovalni center Slovenske akademije znanosti in umetnosti / Research Centre of the Slovenian Academy of Sciences and Arts Katja Meden Institut “Jožef Stefan” / Jožef Stefan Institute Mihael Ojsteršek Inštitut za novejšo zgodovino / Institute of Contemporary History Nataša Rozman Inštitut za novejšo zgodovino / Institute of Contemporary History Organizatorji / Organizers vi URNIK / TIMETABLE Sreda / Wednesday, 14. 9. 2022 Inštitut za novejšo zgodovino / Institute of Contemporary History 09.00-09.30 Registracija / Registration 09.30-11.00 Orange delavnica 1. del / Orange Tutorial Part 1 - 1. nadstropje, Stavba A / 1st floor, Building A 11.00-11.30 Odmor za kavo / Coffee break 11.30-13.00 Orange delavnica 2. del / Orange Tutorial Part 2 - 1. nadstropje, Stavba A / 1st floor, Building A 13.00-14.30 Kosilo / Lunch 14.30-15.30 CLARIN delavnica 1. del / CLARIN Tutorial Part 1 - 1. nadstropje, Stavba A / 1st floor, Building A 15.30-16.00 Odmor za kavo / Coffee break 16.00-17.30 CLARIN delavnica 2. del / CLARIN Tutorial Part 2 - 1. nadstropje, Stavba A / 1st floor, Building A 17.30 Neformalno večerno druženje/ Informal dinner vii Četrtek / Thursday, 15. 9. 2022 Fakulteta za družbene vede / Faculty of Social Sciences 08.30- Registracija / Registration - 1. nadstropje / 1st floor 09.15 09.15- Otvoritev / Opening - Room 20 / Soba 20 09.30 09.30- Študentska sekcija SLO / Student Session SLO - Room 20 / Soba 20 10.00 David Bordon: Govoriš nevronsko? Kako ljudje razumemo jezik sodobnih strojnih prevajalnikov Špela Antloga: Korpusni pristopi za identifikacijo metafore in metonimije: primer metonimije v korpusu g-KOMET 10.00- Vabljeno predavanje 1 / Keynote 1 - Room 20 / Soba 20 11.00 Eetu Mäkelä (University of Helsinki): Designing computational systems to support humanities and social sciences research [Abstract] 11.00- Odmor za kavo / Coffee break 11.30 viii 11.30- Sekcija 1 SLO / Oral Session 1 SLO- Room 20 / Soba 20 Sekcija 1 ANG / Oral Session 1 ENG- Room 21 / Soba 21 13.00 Iztok Kosem, Jaka Čibej, Kaja Dobrovoljc and Nikola Ljubešić: Spremljevalni korpus Trendi: metode, vsebina in kategorizacija besedil Jakob Lenardič and Kristina Pahor de Maiti: Slovenian Epistemic and Deontic Modals in Socially Eva Pori, Jaka Čibej, Tina Munda, Luka Terčon and Špela Arhar Holdt: Unacceptable Discourse Online Lematizacija in oblikoskladenjsko označevanje korpusa SentiCoref Jure Skubic and Darja Fišer: Kaja Dobrovoljc, Luka Terčon and Nikola Ljubešić: Parliamentary Discourse Research in History: Literature Universal Dependencies za slovenščino: nadgradnja smernic, učnih podatkov in Review razčlenjevalnega modela Maja Miličević Petrović, Vuk Batanović, Radoslava Darinka Verdonik, Andreja Bizjak, Andrej Žgank and Simon Dobrišek: Trnavac and Borko Kovačević: Metapodatki o posnetkih in govorcih v govornih virih: primer baze Artur Cross-Level Semantic Similarity in newswire texts and software code comments: Gregor Donaj and Mirjam Sepesy Maučec: Insights from Serbian data in the AVANTES project Primerjava načinov razcepljanja besed v strojnem prevajanju slovenščina-angleščina Ajda Pretnar Žagar, Nikola Đukić and Rajko Muršič: Tomaž Erjavec, Kaja Dobrovoljc, Darja Fišer, Jan Jona Javoršek, Document enrichment as a tool for automated interview Simon Krek, Taja Kuzman, Cyprian Laskowski, Nikola Ljubešić and Katja Meden: coding Raziskovalna infrastruktura CLARIN.SI Nikola Ljubešić and Peter Rupnik: The ParlaSpeech-HR benchmark for speaker profiling in Croatian Marta Petrak, Mia Uremović and Bogdanka Pavelin Lešić: Fine-grained human evaluation of NMT applied to literary text: case study of a French-to-Croatian translation 13.00- Kosilo / Lunch 13.45 ix 13.45- Predstavitev plakatov ANG / Poster Session with coffee ENG - Predprostor 14.30 predavalnic, prvo nadstropje / Anteroom of the lecture halls, 1st floor Jasna Cindrič, Lara Kuhelj, Sara Sever, Živa Simonišek and Miha Šemen: Data Collection and Definition Annotation for Semantic Relation Extraction Katja Meden: Speech-level Sentiment Analysis of Parliamentary Debates using Lexicon-based Approaches Vladimir Polomac: Serbian Early Printed Books: Towards Generic Model for Automatic Text Recognition using Transkribus Branko Žitko, Lucija Bročić, Angelina Gašpar, Ani Grubišić, Daniel Vasić and Ines Šarić-Grgić: Automatic Predicate Sense Disambiguation Using Syntactic and Semantic Features Henna Paakki, Faeze Ghorbanpour and Nitin Sawhney: An approach to computational crisis narrative analysis: a case-study of social media discourse interaction with news narratives about Covid-19 vaccinations in India Petra Matović and Katarina Radić: A Parallel Corpus of the New Testament: Digital Philology and Teaching the Classical Languages in Croatia 14.30- Sekcija 2 SLO / Oral Session 2 SLO - Room 20 / Soba 20 Sekcija 2 ANG / Oral Session 2 ENG - Room 21 / Soba 21 16.00 Špela Arhar Holdt, Polona Gantar, Iztok Kosem, Eva Pori, Thi Hong Hanh Tran, Matej Martinc, Andraz Repar, Nataša Logar Berginc, Vojko Gorjanc and Simon Krek: Antoine Doucet and Senja Pollak: Sovražno in grobo besedišče v odzivnem Slovarju sopomenk sodobne slovenščine A Transformer-based Sequence-labeling Approach to the Slovenian Cross-domain Automatic Term Extraction Martin Anton Grad and Nataša Hirci: Raba kolokacijskega slovarja sodobne slovenščine pri prevajanju kolokacij Michal Mochtak, Peter Rupnik and Nikola Ljubešić: The ParlaSent-BCS dataset of sentiment-annotated Tadeja Rozman and Špela Arhar Holdt: x Gradnja Korpusa študentskih besedil KOŠ parliamentary debates from Bosnia-Herzegovina, Croatia, and Serbia Maja Veselič and Dunja Zorman: Uporaba Europeaninega podatkovnega modela (EDM) pri digitalizaciji kulturne Petra Bago and Virna Karlić: dediščine: primer Skuškove zbirke iz Slovenskega etnografskega muzeja v projektu DirKorp: A Croatian corpus of directive speech acts PAGODE-Europeana China Sara Košutar, Dario Karl, Matea Kramarić and Gordana Matija Marolt, Mark Žakelj, Alenka Kavčič and Matevž Pesek: Hržica: Poravnava zvočnih posnetkov s transkripcijami narečnega govora in petja Automatic text analysis in language assessment: developing a MultiDis web application Janez Križaj, Simon Dobrišek, Aleš Mihelič, Jerneja Žganec Gros: Zadnji napredki pri samodejni slovenski grafemsko-fonemski pretvorbi Boshko Koloski, Senja Pollak and Matej Martinc: What works for Slovenian? A comparative study of different keyword extraction systems Andrejka Žejn, Mojca Šorli: Annotation of Named Entities in the May68 Corpus: NEs in modernist literary texts 19.00- Konferenčna večerja / Conference dinner 21.00 xi Petek / Friday, 16. 9. 2022 Fakulteta za družbene vede / Faculty of Social Sciences 08.30-09.00 Registracija / Registration - 1. nadstropje / 1st floor 09.00-10.00 Študentska sekcija ANG / Student Session ENG - Soba 20 / Room 20 Ruzica Farmakovski and Natalija Tomic: Serbo-Croatian Wikipedia between Serbian and Croatian Wikipedia Meta Jazbinšek, Teja Hadalin, Sara Sever, Erika Stanković and Eva Boneš: Neural translation model specialized in translating English TED Talks into Slovene Uroš Šmajdek, Maj Zirkelbach, Matjaž Zupanič and Meta Jazbinšek: Preparing a corpus and a question answering system for Slovene Tvrtko Balić: The CCRU as an Attempt of Doing Philosophy in a Digital World 10.00-11.00 Vabljeno predavanje 2 / Keynote 2 - Soba 20 / Room 20 Benoît Sagot (INRIA): Large-scale language models: challenges and perspective [Abstract] 11.00-11.30 Odmor za kavo / Coffee break xii 11.30-12:45 Sekcija 3 ANG / Oral Session 3 ENG - Soba 20 / Room 20 Taja Kuzman, Nikola Ljubešić and Senja Pollak: Assessing Comparability of Genre Datasets via Cross-Lingual and Cross-Dataset Experiments Špela Vintar and Andraz Repar: Human evaluation of machine translations by semi-professionals: Lessons learnt Aleksandar Petrovski: A Bilingual English-Ukrainian Lexicon of Named Entities Extracted from Wikipedia Darja Fišer, Tjaša Konovšek and Andrej Pančur: Populist and Non-Populist Discourse in Slovene Parliament (1992 – 2018) Petra Bago: Progress of the RETROGRAM Project: Developing a TEI-like Model for Pre-standard Croatian Grammars 12:45-13.30 Kosilo / Lunch 13.30-14.15 Predstavitev plakatov z odmorom za kavo SLO / Poster Session with coffee SLO - Predprostor predavalnic / Anteroom of the lecture halls Tina Mozetič, Miha Sever, Martin Justin and Jasmina Pegan: Evalvacijska kategorizacija strojno izluščenih protipomenskih parov Nina Sangawa Hmeljak, Anna Sangawa Hmeljak and Jan Hrastnik: Ilukana - aplikacija za učenje japonskih zlogovnih pisav hiragana in katakana s pomočjo asociacij Vili Grdič, Kaja Perme, Lea Turšič and Alja Križanec: Šahovska terminološka baza Lucija Gril, Simon Dobrišek and Andre Žgank: Akustično modeliranje z različnimi osnovnimi enotami za avtomatsko razpoznavanje slovenskega govora Saša Babič and Tomaž Erjavec: Izdelava in analiza digitalizirane zbirke paremioloških enot xiii Magdalena Gapsa: Ocenjevanje uporabniško dodanih sopomenk v Slovarju sopomenk sodobne slovenščine – pilotna študija 14.15-14.30 Podelitev nagrad in zaključek / Award&Closing - Soba 20 / Room 20 14.30-16.00 Občni zbor SDJT / SDJT Annual Meeting Razvoj slovenščine v digitalnem okolju – jezikovni viri in tehnologije: Predstavitev vmesnih rezultatov / Development of Slovene in a Digital Environment – Language Resources and Technologies: presentation of intermediate results - Soba 20 / Room 20 xiv Kazalo / Table of Contents Predgovor ………………….......………………………………………………………..….…… i Preface ………………….......………………………………………………………..…......…… ii Programski odbor / Programme committee ………………………………….………… iii Člani programskega odbora / Programme committee members …………………………… iii Organizacijski odbor / Organising committee …………………………………………… vi Organizatorji / Organizers …………………………………………………………………. vi Urnik / Timetable …………………………………………………………………. vii Kazalo / Table of Contents ….……………………………………………………………… xv VABLJENI PRISPEVKI / INVITED TALKS 1 Designing computational systems to support humanities and social sciences research Eetu Mä kelä 1 Large-scale language modelsl challenges and perspective Benoî t Sagot 2 PRISPEVKI – PAPERS 3 The impact of a one-session-phonetic training on the improvement of non-native speakers’ pronunciation of English Amaury Flávio Silva 3 Sovražno in grobo besedišče v odzivnem Slovarju sopomenk sodobne slovenščine Špela Arhar Holdt, Polona Gantar, Iztok Kosem, Eva Pori, Nataša Logar, 10 Vojko Gorjanc, Simon Krek Izdelava in analiza digitalizirane zbirke paremioloških enot Saša Babič, Tomaž Erjavec 17 xv DirKorp: A Croatian Corpus of Directive Speech Acts Petra Bago, Virna Karlić 23 Universal Dependencies za slovenščino: nadgradnja smernic, učnih podatkov in razčlenjevalnega modela Kaja Dobrovoljc, Luka Terčon, Nikola Ljubešić 30 Primerjava načinov razcepljanja besed v strojnem prevajanju slovenščina–angleščina Gregor Donaj, Mirjam Sepesy Maučec 40 Raziskovalna infrastruktura CLARIN.SI Tomaž Erjavec, Kaja Dobrovoljc, Darja Fišer, Jan Jona Javoršek, Simon Krek, 47 Taja Kuzman, Cyprian Laskowski, Nikola Ljubešić, Katja Meden ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts Simon Gonzalez 55 Raba Kolokacijskega slovarja sodobne slovenščine pri prevajanju kolokacij Martin Anton Grad, Nataša Hirci 63 Akustično modeliranje z različnimi osnovnimi enotami za avtomatsko razpoznavanje slovenskega govora Lucija Gril, Simon Dobrišek, Andrej Žgank 71 What works for Slovenian? A comparative study of different keyword extraction systems Boshko Koloski, Senja Pollak, Matej Martinc 78 Spremljevalni korpus Trendi: metode, vsebina in kategorizacija besedil Iztok Kosem, Jaka Čibej, Kaja Dobrovoljc, Nikola Ljubešić 86 Automatic Text Analysis in Language Assessment: Developing a MultiDis Web Application Sara Košutar, Dario Karl, Matea Kramarić, Gordana Hržica 93 xvi Assessing Comparability of Genre Datasets via Cross-Lingual and Cross-Dataset Experiments Taja Kuzman, Nikola Ljubešić, Senja Pollak 100 Slovenian Epistemic and Deontic Modals in Socially Unacceptable Discourse Online Jakob Lenardič, Kristina Pahor de Maiti 108 The ParlaSpeech-HR benchmark for speaker profiling in Croatian Nikola Ljubešić, Peter Rupnik 117 Cross-Level Semantic Similarity in Newswire Texts and Software Code Comments: Insights from Serbian Data in the AVANTES Project Maja Miličević Petrović, Vuk Batanović, Radoslava Trnavac, Borko Kovačević 124 The ParlaSent-BCS Dataset of Sentiment-annotated Parliamentary Debates from Bosnia and Herzegovina, Croatia, and Serbia Michal Mochtak, Peter Rupnik, Nikola Ljubešić 132 Fine-grained human evaluation of NMT applied to literary text: case study of a French-to- Croatian translation Marta Petrak, Mia Uremović, Bogdanka Pavelin Lešić 141 A Bilingual English-Ukrainian Lexicon of Named Entities Extracted from Wikipedia Aleksandar Petrovski 147 Serbian Early Printed Books: Towards Generic Model for Automatic Text Recognition using Transkribus Vladimir Polomac 154 Lematizacija in oblikoskladenjsko označevanje korpusa SentiCoref Eva Pori, Jaka Čibej, Tina Munda, Luka Terčon, Špela Arhar Holdt 162 Document Enrichment as a Tool for Automated Interview Coding Ajda Pretnar Žagar, Nikola Ðukić, Rajko Muršić 169 Parliamentary Discourse Research in History: Literature Review Jure Skubic, Darja Fišer 177 xvii Annotation of Named Entities in the May68 Corpus: NEs in modernist literary texts Mojca Šorli, Andrejka Žejn 187 A Transformer-based Sequence-labeling Approach to the Slovenian Cross-domain Automatic Term Extraction Thi Hong Hanh Tran, Matej Martinc, Andraž Repar, Antoine Doucet, Senja Pollak 196 Metapodatki o posnetkih in govorcih v govornih virih: primer baze Artur Darinka Verdonik, Andreja Bizjak, Andrej Žgank, Simon Dobrišek 205 Uporaba Europeaninega podatkovnega modela (EDM) pri digitalizaciji kulturne dediščine: primer Skuškove zbirke iz Slovenskega etnografskega muzeja v projektu PAGODE-Europeana China Maja Veselič, Dunja Zorman 213 Human Evaluation of Machine Translations by Semi-Professionals: Lessons Learnt Špela Vintar, Andraž Repar 220 Automatic Predicate Sense Disambiguation Using Syntactic and Semantic Features Branko Žitko, Lucija Bročić, Angelina Gašpar, Ani Grubišić, Daniel Vasić, 227 Ines Šarić-Grgić POVZETKI –ABSTRACTS 235 Progress of the RETROGRAM Project: Developing a TEI-like Model for Croatian Grammars Books before Illyrism Petra Bago 235 The CCRU as an Attempt of Doing Philosophy in a Digital World Tvrtko Balić 239 Referencing the Public by Populist and Non-Populist Parties in the Slovene Parliament Darja Fišer, Tjaša Konovšek, Andrej Pančur 243 Uporaba postopkov strojnega učenja pri samodejni slovenski grafemsko-fonemski pretvorbi Janez Križaj, Simon Dobrišek, Aleš Mihelič, Jerneja Žganec Gros 248 xviii Poravnava zvočnih posnetkov s transkripcijami narečnega govora in petja Matija Marolt, Mark Žakelj, Alenka Kavčič, Matevž Pesek 252 A Parallel Corpus of the New Testament: Digital Philology and Teaching the Classical Languages in Croatia Petra Matović, Katarina Radić 256 Pre-Processing Terms in Bulgarian from Various Social Sciences and Humanities (SSH) Domains: Status and Challenges Petya Osenova, Kiril Simov, Yura Konstantinova 258 An Approach to Computational Crisis Narrative Analysis: A Case-study of Social Media Narratives Around the COVID-19 Crisis in India Henna Paakki, Faeze Ghorbanpour, Nitin Sawhney 263 Gradnja Korpusa študentskih besedil KOŠ Tadeja Rozman, Špela Arhar Holdt 267 ŠTUDENTSKI PRISPEVKI – STUDENT PAPERS 271 Korpusni pristopi za identifikacijo metafore in metonimije: primer metonimije v korpusu g- KOMET Špela Antloga 271 Neural Translation Model Specialized in Translating English TED Talks into Slovene Eva Boneš, Teja Hadalin, Meta Jazbinšek, Sara Sever, Erika Stanković 278 Govoriš nevronsko? Kako ljudje razumemo jezik sodobnih strojnih prevajalnikov David Bordon 286 Data Collection and Definition Annotation for Semantic Relation Extraction Jasna Cindrič, Lara Kuhelj, Sara Sever, Živa Simonišek, Miha Šemen 292 Serbo-Croatian Wikipedia Between Serbian and Croatian Wikipedia Ružica Farmakovski, Natalija Tomić 300 xix Ocenjevanje uporabniško dodanih sopomenk v Slovarju sopomenk sodobne slovenščine – pilotna študija Magdalena Gapsa 308 Angleško-slovenska šahovska terminološka baza Vili Grdič, Alja Križanec, Kaja Perme, Lea Turšič 317 Speech-level Sentiment Analysis of Parliamentary Debates using Lexicon-based Approaches Katja Meden 323 Evalvacijska kategorizacija strojno izluščenih protipomenskih parov Tina Mozetič, Miha Sever, Martin Justin, Jasmina Pegan 331 Ilukana – aplikacija za učenje japonskih zlogovnih pisav hiragana in katakana s pomočjo asociacij Nina Sangawa Hmeljak, Anna Sangawa Hmeljak, Jan Hrastnik 339 Filter nezaželene elektronske pošte za akademski svet Anja Vrečer 345 Preparing a Corpus and a Question Answering System for Slovene Matjaž Zupanič, Maj Zirkelbach, Uroš Šmajdek, Meta Jazbinšek 353 xx Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Designing computational systems to support humanities and social sciences research Eetu Mäkelä University of Helsinki, Finland P.O. Box 24, 00014 eetu.makela@helsinki.fi Abstract From the viewpoint of the humanities and social sciences, collaborations with computer scientists often fail to deliver. In my research group, we have tried to understand why this is, and what to do about it. In this talk, I will discuss three key elements that we have discovered: Often, datasets in the humanities and social sciences are not neatly representative of the object of interest. Systems need to provide ways in which to evaluate and counter the biases, confounders and noise in the data. Often, there is also a large gap between what is in the data, and what would be of interest. This gap needs to be bridged using algorithms, but care must be given that a) what the algorithm produces actually matches the interest and b) that its application does not introduce bias of its own (also interestingly, algorithm performance metrics of interest here often differ from those generally used in NLP/computer science). On a process level, collaboration between researchers from different disciplines is hard due to discrepancies in expectations relating to all facets of research, from research questions through methodology to the publication of results. Projects and systems need to acknowledge this, and be designed to facilitate iterative movement in the right direction. Bio Eetu Mäkelä is an associate professor in Human Sciences–Computing Interaction at the University of Helsinki, and a docent (adjunct professor) in computer science at Aalto University. At the Helsinki Centre for Digital Humanities, he leads a research group that seeks to figure out the technological, processual and theoretical underpinnings of successful computational research in the humanities and social sciences. Additionally, he serves as a technological director at the DARIAH-FI infrastructure for computational humanities and is one of three research programme directors in the datafication research initiative of the Helsinki Institute for Social Sciences and Humanities. For his work, he has obtained a total of 19 awards, including multiple best paper awards in conferences and journals, as well as multiple open data and open science awards. He also has a proven track record in creating systems fit for continued use by their audience. VABLJENI PRISPEVKI 1 INVITED TALKS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Large-scale language models: challenges and perspective Benoît Sagot Inria Paris (équipe ALMAnaCH) 2 rue Simone Iff CS 42112 75589 Paris Cedex 12, France benoit.sagot@inria.fr Abstract The emergence of large-scale neural language models in Natural Language Processing (NLP) research and applications has improved the state of the art in most NLP tasks. However, training such models requires enormous computational resources and training data. The characteristics of the training data has an impact on the behaviour of the models trained on it, depending for instance on the data’s homogeneity and size. In this talk, I will speak about how we developed the large-scale multilingual OSCAR corpus. I will describe the lessons we learned while training the French language model CamemBERT, the first large-scale monolingual model for a language other than English, especially in terms of the influence of size and heterogeneity of the training corpus. I will also sketch out a few research questions related to biases in large-scale language models, with a focus on the impact of tokenisation and language imbalance, in the context of the BigScience initiative. I will conclude with my thoughts on the future of language models and their impact on NLP and other data processing fields (speech, vision). Bio Benoît Sagot, Directeur de Recherches (Senior Researcher) at Inria, is the head of the Inria project-team ALMAnaCH in Paris, France. A specialist in natural language processing (NLP) and computational linguistics, his research focuses on language modelling, language resource development, machine translation, text simplification, part-of-speech tagging and parsing, computational morphology and, more recently, digital humanities (computational historical linguistics and historical language processing). He has been the PI or co-PI of a number of national and international projects, and is the holder of a chair in the PRAIRIE institute dedicated to research in artificial intelligence. He is also the co-founder of two start-ups where he uses his expertise in NLP and data mining for the automatic analysis of employee survey results. VABLJENI PRISPEVKI 2 INVITED TALKS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 The impact of a one-session-phonetic training on the improvement of non-native speakers’ pronunciation of English Amaury Flávio Silva Technology College of Jacareí (FATEC Jacareí) - São Paulo, Brazil Rua Faria Lima, 155 – Jardim Santa Maria, Jacareí – SP, Brazil, Zip Code 12328-070 amaury.silva@fatec.sp.gov.br Abstract Due to the difficulties L21 learners face regarding pronunciation, we conducted an experiment to find out if the participants of a onesession phonetic training would present any sign of improvement in their speech a week after the session. In order to evaluate their improvement, it was checked if the interword phonetic phenomena resyllabification, blending and hiding could be found in the subjects’ speech. Furthermore, intraword-level pronunciation was also investigated. The findings have shown that betterment related to the presence of resyllabification occurred to all the subjects, but improvement to the other phenomena studied happened heterogeneously. The dataset used during the training session was based 1. Introduction on a study developed by Silva (2021) in which he studied Until the end of the 21st century, there was a limited examples of coarticulatory effects that we also incorporated number of studies regarding pronunciation (Derwing and in our pronunciation instruction session. Munro, 2005). This negligence is attributed to the fact that pronunciation was considered an aspect of language 2. Goal of the paper learning that could be naturally acquired through the This paper, whose goal is to investigate the efficacy of learning process. However, since 2005 this viewpoint has a one-session phonetic training to enhance the participants’ been changing inasmuch as several studies, conferences, performance in pronunciation tasks also aims to provide a and articles about L2 pronunciation started to arise guideline that L2 teachers could use to assist their students (Thomas and Derwing, 2014). improve. Furthermore, we hope that researchers could use Despite the fact the importance of L2 pronunciation has the methods here applied to carry out new experiments in become more evident, there are still L2 students, teachers this area. and researchers who consider pronunciation teaching as being unnecessary as they reckon it can be learnt through 3. Theoretical Background exposure. The increasing number of pronunciation-related studies We regard pronunciation instruction as an essential part since 2005 reveals the importance that pronunciation of the L2 teaching process. Its essential character becomes instruction has in the L2 learning process. Not only does it more evident when L2 learners, in spite of being studying allow learners to become more confident when they speak, the L2 for many years, still struggle to correctly pronounce it also improves speech intelligibility as it helps to avoid the L2-language sounds, especially the ones that are not misunderstandings. part of their L1 inventory systems. Nonetheless, we do not Due to the importance of pronunciation, Thomas and believe that achieving native-like pronunciation is Derwing (2014) wrote an article in which they evaluated 75 necessary in that one’s pronunciation being intelligible L2 pronunciation studies, most of which affirm that there enough not to cause misunderstandings or hamper the flow was some kind of improvement in the speakers’ of communication is what should be expected. pronunciation due to the training they took. The authors Owing to our belief that pronunciation instruction point out that diverging results take place owing to a few should be part and parcel of L2 language learning, we factors such as ‘learner individual differences, goals and decided to carry out a study that aims to check the benefits foci of instruction, type and duration of instructional input of a one-session-pronunciation training in the improvement and assessment procedures’ (p.1). of the pronunciation of a group of subjects, Brazilian Most of the 75 studies focused on the achievement of learners of English as a foreign language. native-like pronunciation by the learner and consisted in the With regard to this one-session training, we hypothesize use of computer-assisted tools. Moreover, the studies aimed that there may be some kind of improvement in the at teaching the pronunciation of individual segments subjects’ pronunciation, but more sessions will be instead of teaching suprasegmental features, which would necessary to address all the pronunciation problems they involve, for instance, resyllabification, prosodic may have. Moreover, the less proficient the students are, boundaries, word stress, intonation, and speech rate. the higher will be the number of sessions necessary to help In order to teach the pronunciation of segments, most of them deal with their pronunciation problems. the time, the learners were engaged in activities that 1 We use the term ‘L2’ to refer to the teaching of English as a foreign and as a second language. PRISPEVKI 3 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 required them to read texts aloud, instead of producing should focus on ‘more intelligible, as opposed to less- spontaneous speech. accented speech … (and) include a variety of assessment When it comes to the quality of a pronunciation study, tasks’ (Thomson and Derwing, 2014, p. 13-14). Thomas and Derwing (2014) mention a few features they Furthermore, the authors state that evaluating the efficacy should have. Firstly, they express their belief that of the studies in a naturalistic fashion would take years, pronunciation instruction should focus on ‘helping students instead of weeks or months. become more understandable’ (p. 2). From this principle, We believe that any research should depart from a well- they point out that an ideal pronunciation study should be established theoretical standpoint. Hence, since in our able to give plenty of information on the subjects, have analyses we focused on the influence adjacent intra or enough data that could be used to carry out statistical interword segments have on one another, we turned to the analyses, have a control group, and should not be limited to studies developed by Browman and Goldstein (1986, 1989) reading aloud tasks, i.e., it should also include spontaneous on coarticulation. speech samples. Finally, it should include delayed According to Browman and Goldstein (1986, 1989), assessment to verify the lasting effect of the pronunciation adjacent segments may be subjected to the phenomena instruction. called blending and hiding. Blending occurs when adjacent With regard to qualitative analyses, they should segments share the same articulator so that they cannot be encompass aspects such as motivation, type of interactions produced without disturbance in their constriction location. in the L2 and even social influences (Thomas and Derwing, An example of this phenomenon takes place when the 2014). segments [t] and [ð] from the context ‘I want that’ have to The training input of the studies surveyed, which was be produced one after the other. In this context, the either classroom instruction or computer assisted constriction location of either segment may be disturbed as pronunciation training, ranged from the manipulation of they are both characterized by a tongue tip gesture. Thus, segments (Wang, 2002; Lee, 2009) to providing students the canonical production of the alveolar plosive and the with speech samples produced by native speakers so interdental fricative may be realized as an approximant and students could listen to them and compare them with their as a dental fricative respectively. own productions (Gonzales-Bueno, 1997; Guilloteau, Hiding occurs when adjacent segments do not share the 1997, Weinberg and Knoerr, 2003; Lord, 2005; Pearson et same articulator so that the production of the first segment al., 2011). is overlapped by the production of the second one. Such The learners’ performances were evaluated by human phenomenon may occur when the segments [t] and [b] from listeners in 79 per cent of the studies and the other 21 per the context ‘I can’t buy it’ have to be produced one after the cent were evaluated using acoustic analyses. other. When this happens, the gesture of mouth closure to The majority of the pronunciation training studies produce the bilabial consonant ‘hides’ the burst that would reviewed by Thomson and Derwing (2014) lacked explicit be caused by the release of the alveolar plosive. theoretical background so that the pronunciation training Being aware of how these phenomena work allows was solely based on the researchers’ own experience. In our speakers to reduce articulatory effort when they speak as training, we considered the research about reduction the excursion of the articulators is decreased. The reduction phenomena led by Silva (2016, 2021), the findings on in articulatory effort was studied by Silva (2016, 2021). In coarticulation conducted by Browman and Goldstein his investigations, he noticed that reduction is a strategy (1986, 1989), and the work developed by Vroomen and commonly used by native speakers and which can be Gelder (1999) about resyllabification. We will be characterized by the replacement of a segment that calls for discussing this theoretical background later on in this high excursion of the articulators by one that does not (low- section. hierarchy reduction). Reduction can be also characterized One important aspect that was not clear in the studies by a segment deletion (high-hierarchy reduction). was the procedure taken during the training sessions Another phenomenon that causes reduction in (training input). The lack of clarity in the methodological articulatory effort is the one called resyllabification. It procedures prevent other teachers and researchers from happens when ‘consonants are attached to syllables other replicating the steps used in the studies in their own classes than those from which they originally came’ (Vroomen and or research. Therefore, detailed methodological procedure Gelder, 1999, p.413). An example of this phenomenon is is necessary ‘for the benefit of other researchers and the sentence ‘you can evaluate this’ in which the consonant teachers’ (Thomson and Derwing , 2014, p. 11). /n/ of the word ‘can’ is coarticulated with the vowel /ɪ/ of The research on pronunciation training by Thomson and the word ‘evaluate.’ This process contributes to maintain Derwing (2014) revealed that most of the participants the speech flow as the speaker does not need to add a pause showed some kind of improvement after the training. between adjacent words. Nonetheless, the majority of studies only focused on the The analyses carried out in this study as well as the instruction of single sounds such as the contrast of /i:/ and concepts explained during the training session were based /ɪ/. Should the studies be on several segmental and on the phenomena blending and hiding (Browman and suprasegmental features, more time would be necessary so Goldstein, 1986-1989), reduction (Silva 2016, 2021), and the learners could present significant improvement. resyllabification (Vroomen and Gelder, 1999). Another issue that questions the efficacy of the studies is whether or not the assessment used in them would reflect in the improvement of intelligibility when language is used in real-life contexts. For such issue to be solved, the studies PRISPEVKI 4 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4. Methods Contexts Phonemes involved In this section, we will describe details related to the ‘comes into’ /z/and /ɪ/ subjects that participated in the study, the research dataset , ‘at a certain’ /t̬/and /ə/ the acoustic inspection and the training session. ‘one of’ /n/and /ə/ ‘on Earth’ /n/and /ɜr/ ‘and I’ /d/and /aɪ/ 4.1 Subjects ‘great is’ /t̬/and /ɪ/ In order to conduct the analysis, we had the ‘kind of’ /d/and /ə/ participation of four subjects, native speakers of Brazilian ‘an angel’ /n/and /eɪ/ Portuguese (three males and one female), who study ‘as much as’ /tʃ/and /ə/ English as a foreign language. The subject ‘English’ is part ‘seen a’ /n/and /ə/ of the Technological course the subjects were taking and all ‘lot of life’ /t̬/and /ə/ of them were enrolled in the same class, taking the third semester. It is important to point out that English is offered Table 2: Resyllabification phenomenon throughout the duration of the course, six semesters, and, despite the fact all the students were in the same class, their With regard to the phenomena blending and hiding, we proficiency level was not the same. analyzed eight contexts, presented in the next table. The four participants will be referred to as subjects, ‘S,’ in this investigation. Contexts Phonemes involved ‘certain time’ /n/and /t/ 4.2 Research dataset ‘great things’ The research dataset, table 1, is an extract from the /t/and /θ/ program Actors’ Studio (season 12, episode 13, released on ‘guided towards’ /d/and /t/ July 2006) that was sent to the subjects so they would have ‘get to dance’ /t/and /t/ to record and send it to the trainer before the training ‘prepared to’ /d/and /t/ session. After the session, they would record it once more ‘typical lawyer’ /l/and /l/ and send it to the trainer again so their improvement could ‘these people’ /z/and /p/ be analyzed. We would like to point out that in our ‘I don’t want’ /t/and /w/ experiment, we asked the subjects to use their own smartphones or computers to record the dataset. This was Table 3: Blending and hiding phenomena done as they could not come to college to record it in its sound laboratory due to the restrictions related to the Lastly, when it comes to word-level pronunciation, the COVID-19 pandemic. words presented in the table below were investigated. The same text was used in the pre and post-training phase as we aimed to analyze whether or not improvement could be observed in the second recording in terms of the Words Pronunciation errors group of words we selected that encompass the phenomena found described in tables 2-4. ‘someone’ Phoneme substitution and This dataset was also selected by Silva (2021) on his insertion of a phoneme investigation about coarticulatory phenomena analysis. ‘certain’ Phoneme substitution and word stress ‘mysteriously’ Phoneme substitution and It's funny, you know, someone comes into your life at a word stress certain time and that’s one of the great things that ‘towards’ Phoneme substitution happens on Earth is you're mysteriously guided ‘thought’ Phoneme substitution towards these people that you get to dance with, you ‘offer’ Phoneme substitution and know. And I thought "How great is that", he's kind of, word stress like, I don't want to say an angel to her, but he's ‘lawyer’ Phoneme substitution and someone who needs as much as he’s prepared to offer, word stress and he has seen a lot of life, and he's not a typical lawyer-type. Table 4: Word-level pronunciation Table 1: Research Dataset 4.3 Phonetic inspections The phonetic inspection was carried out with the use of Using the dataset above, we selected fragments in the free software PRAAT, version 6.0.39, developed by which the phenomena resyllabification, blending and Paul Boersma and David Weenik (2018), from the Institute hiding could take place. Furthermore, we also analyzed the of Phonetic Science of the University of Amsterdam. pronunciation of a group of words that the students The inspections were based on the observation of the mispronounced in the pre-training recording. waveform, the broadband spectrogram, the fundamental The phenomenon resyllabification was investigated in frequency and the intensity of the phonemic segments. 11 contexts, which we will present in the next table. PRISPEVKI 5 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4.4 Training Session phase is provided and one at the end of section 5.2 with all The training took place in a single 50-minute session of the contexts analyzed in the post-training phase is available. an online class. It was recorded so that the subjects could We reckon it is important to point out that the subjects reported that they recorded the dataset several times and revisit it as many times as they wanted in order to review that they sent us the version they judged to be the best. the concepts explained. The training session the subjects participated was 5.1. Pre-training analyses provided by the researcher of this work. At the beginning of the training, which took place after In this section, we will present the analyses that refer to the first recording of the dataset was sent by the subjects, the pre-training recordings. The first one refers to the the original recording of the dataset was played, and the context ‘and I,’ resyllabification. corresponding script was projected on the screen for the subjects to follow it. The recording was played three times. After that, the concept of resyllabification was explained and the first context where such phenomenon occurred according to table 2, ‘comes into’, was presented to the subjects (the orthography along with the recording). The context was played three times. The subjects were asked to pay close attention to the recording of the context as they would have to repeat it afterwards. If they could not repeat it, the trainer would repeat the context himself at least three more times in order Figure 1: Production of ‘and I’ by L1, pre-training to assist the subjects grasp what and how they should say it. Through the analysis of the broadband Before moving on to the next context, the original spectrogram and its corresponding waveform above, we recording was played one more time and the subjects were can infer that there was no pause between the production of asked to repeat it. Not until all the subjects were able to the adjacent segments [d] and [ay] so the phenomenon repeat the context intelligibly, would the trainer teach the resyllabification was observed. next context. The procedure described above was followed to teach the other contexts including resyllabification, blending and hiding phenomena. Word-level pronunciation instruction followed the steps related to playing the recording three times before repetition. However, after analyzing the first recording, we reckoned the need to teach word stress and phoneme pronunciation. It is important to point out that we did not use technical terms during the training as our focus was simply on improving their pronunciation. When it comes to the difficulty the subjects presented Figure 2: On Earth – S1, pre-training – Figure shows to pronounce a word or group of words, the trainer noticed pause between the words ‘on’ and “Earth’ that it was necessary to teach the articulation of some phonemes, especially the ones not present in the subjects’ In the production of ‘on Earth,’ figure above, L1inventory system. After the instruction of the articulation there was a pause between the segments [n] and [ɜr] so of such phonemes, improvements could be observed in that the phenomenon resyllabification did not take place. their pronunciation. The subjects, after the training session, had access to the original recording of the dataset and to a version recorded by the trainer, which was produced with a slower speech rate so that it could be helpful to less proficient subjects. These recordings were tools the subjects could use to improve their pronunciation before making the second recording that had to be sent within a week. Once all the subjects had sent their recordings, we started the data analysis, whose results are presented in the next section. 5. Data analyses Figure 3: Production of ‘lawyer’ by S2 The analyses in this chapter will feature figures that The figure above, which presents acoustic information, contain the waveform, spectrogram, segmentation, and shows that the subject mispronounced the word lawyer in spelling of a selection of the contexts investigated. that [lɔwər/] was produced instead of [lɔɪər/]. However, at the end of section 5.1, a table with a summary of all the contexts investigated during the pre-training PRISPEVKI 6 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 A summary of the data analyses that refer to the pre-training recordings is presented in the table below. The word or group of words written indicates that the phenomenon in the corresponding column was observed in their production. Pre-training Subje Resyllabifica Blending/Hi Word-level cts tion ding pronunciati (group of (group of on Figure 4: Concatenated productions of ‘on Earth’ by S1. words in words in (mispronoun Post-training left and pre-training right. which the which the ced words) phenomenon phenomena The concatenated productions of ‘on Earth’ by S1, was were presented in the figure above, demonstrate that the observed) observed) phenomenon resyllabification was observed in the post- S1 ‘and I’ ‘guided All the training recording, but not in the pre-training recording. ‘great is’ towards’ words were This fact is confirmed by the absence of pause between the ‘kind of’ ‘typical mispronoun segments [n] and [ɜr] in the post-training phase that did not ‘lot of life’ lawyer’ ced except occur in the pre-training phase as a pause is present in the ‘these ‘someone’ spectrogram. people’ and ‘offer’ ‘I don’t want to’ S2 ‘at a certain’ ‘great things’ ‘mysteriousl ‘one of’ ‘guided y’ ‘on Earth’ towards’ ‘thought’ ‘and I’ ‘get to dance’ ‘lawyer’ ‘kind of’ ‘prepared to’ ‘lot of’ ‘typical lawyer’ ‘these people’ S3 ‘comes into’ ‘certain time’ ‘mysteriousl Figure 5: Concatenated productions of the post and pre- ‘at a certain’ ‘get to dance’ y’ training versions of ‘great is’ by S2 ‘one of’ ‘prepared to’ ‘thought’ ‘on Earth’ ‘these ‘lawyer’ As shown in the analysis of the context ‘on Earth,’ ‘and I’ people’ figure 4, in the context ‘great is’ by S2, figure above, the ‘great is’ phenomenon resyllabification was observed in the post- training recording, but not in the pre-training one. S4 All the All the ‘thought’ contexts contexts ‘offer’ except ‘and I’ except ‘lawyer’ ‘certain time’ Table 5: Data analyses concerning the pre-training recordings 5.2. Post-training analysis In this section, we will present the analyses that refer to the post-training recordings. The first one refers to the Figure 6: Production of the word ‘offer’ by S4 context ‘on Earth,’ resyllabification. The analysis of the production of the context ‘offer,’ produced by S4, shows that the word stress was placed on the syllable ‘-fer’ instead of the syllable ‘of-’, which is where the correct stress for the word ‘offer’ should occur. The word stress on the syllable ‘-fer’ can be confirmed not only by the higher duration of the segment [ɛr], but also the higher intensity of this segment in comparison to the segment [ɔ]. What’s more, S4 used the segment [ɛr] instead of /ər/ in the second syllable. PRISPEVKI 7 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 was observed in terms of resyllabification for S4, but they A summary of the data analyses that refer to the post- had already presented excellent performance of this training recordings are presented in the table below. The strategy as there was only one context where it was not word or group of words written indicates that the applied. phenomenon in the corresponding column was observed in The presence of the phenomena blending, and hiding their production. was found in the production of S1 in most of the contexts and in all the contexts produced by S4 in the post-training Post-training recording. Such phenomena were noticed in fewer contexts Subje Resyllabifica Blending/Hi Word-level in the production of S2 and in the same number in the cts tion ding pronunciati production of S3 in the post-training recording. (group of (group of on With regard to the last feature analyzed after the post- words in words in (mispronoun training session, word-level pronunciation, no which the which the ced words) improvements were observed in the production of S1, S2 phenomenon phenomena made one more mistake and S3 improved the production of was were the word ‘lawyer’ but mispronounced a word he had observed) observed) produced correctly in the pre-training session, ‘someone’. S1 All ‘certain time’ All the S4 improved the production of the word ‘lawyer’ but the contexts ‘great things’ words were continued mispronouncing the words ‘thought’ and ‘offer’. ‘get to dance’ mispronoun Our findings have revealed different levels of ‘typical ced except improvement in the subjects’ performance so that S1 is the lawyer’ ‘someone’ one who presented the most improvement. S2 and S3’s ‘these and ‘offer’ performances betterment was limited to the presence of the people’ resyllabification phenomenon. S4 is the most proficient ‘I don’t want subject who presented only a few mistakes in the pre- to’ training recording and was able to use the phenomena S2 All the ‘certain time’ ‘someone’ blending and hiding in all the contexts and to improve the contexts ‘great things’ ‘mysteriousl pronunciation of a word after training. ‘prepared to’ y’ The hypothesis we presented at the beginning of our ‘these ‘thought’ work was confirmed as the subjects’ pronunciation was people’ ‘lawyer’ somehow improved, but more sessions are necessary to address certain pronunciation problems such as word-level S3 ‘comes into’ ‘certain time’ ‘someone’ pronunciation and the phenomena hiding and blending. ‘at a certain’ ‘great things’ ‘mysteriousl In future studies, we could ask the subjects to report on ‘one of’ ‘get to dance’ y’ the time they have dedicated to study and practice the ‘on Earth’ ‘prepared to’ ‘thought’ pronunciation concepts studied during the training session. ‘and I’ Furthermore, we could ask judges to evaluate the students’ ‘great is’ performance before and after the training session to find out ‘kind of’ if a perceptual betterment in their pronunciation was clear, ‘a lot of life’ i.e., if the level of intelligibility was enhanced. S4 All the All the ‘thought’ We believe vehemently that, although the number of contexts contexts ‘offer’ participants was not adequate through a quantitative except ‘and I’ standpoint as our aim was to conduct a qualitative investigation, the study has shown that improvement did Table 6: Data analyses concerning the post-training occur, bringing to light the importance of phonetic recordings instruction. Moreover, we expect that the procedure we used during the training session was clear enough so the 6. Discussion study could be replicated by other researchers. Lastly, we hope to continue our investigation by providing the subjects with more training sessions, evaluate The analyses have shown that the one-session them at least five months after the first training session and phonetic training was useful to help the subjects improve have more participants so we could carry out statistical their pronunciation with regard to the resyllabification analysis. phenomenon. Nevertheless, no homogeneous improvement was observed in terms of the remaining phenomena analyzed. 7. Reference The observed improvement in the resyllabification feature in the production of S1 and S2 was characterized by the use of this strategy in the production of all the contexts Paul Boersma, David Weenink. 2018. Praat: doing analyzed in the post-training recording, fact not observed phonetics by computer, version 6.0.39. Available at: in the pre-training one. S3 also demonstrated improvement . Access on: 2 Dec. 2018 in the use of this strategy in that it was used in two more Catherine Browman, Louis Goldstein. 1986. Towards an contexts in the post-training recording. No improvement articulatory phonology. Phonology, v. 3, pages. 219- 252. PRISPEVKI 8 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 ___________. Articulatory gestures as phonological units. English syllable margins, In: Levis J. and LeVelle K. Phonology, v. 6, 1989, pages 201-251. (eds). Proceedings of the 2nd Pronunciation in Second Tracey M. Derwing, M, Murray J. Munro.2005. Second Language Learning and Teaching Conference, Iowa language accent and pronunciation teaching: A research- State University, pages 169-180. based approach. TESOL Quarterly 16/1: pages 71-77. Amaury F. Silva. 2021. Coarticulatory phenomena Manuela Gonzales-Bueno. 1997. The effects of formal analysis in English based on the articulatory phonology. instruction on the acquisition of Spanish stop São Paulo. CBTecLe v.1, n.1. consonants. Contemporary Perspectives on the ____________.2016. Percepção de reduções em inglês Acquisition of Spanish 2: pages 57-75. como L2. Unpublished Ph.D. thesis, PUC-SP. Nancy Clarke Guilloteau. 1997. Modification of phonetic Ron I. Thomson, Tracey M. Derwing. 2014. The categories in French as a second language: effectiveness of L2 pronunciation Instruction: a Experimental studies with conventional and computer- narrative review. Oxford, Oxford University Press. based intervention methods. Unpublished Ph. D. thesis. Jean Vroomen, Beatrice De Gelder. 1999. Lexical access of University of Texas at Austin. resyllabified words: evidence from phoneme Lee Ji-Yeon. 2009. The effects of pronunciation instruction monitoring. Memory & cognition, 27(3), pages 413– using duration manipulation in the acquisition of English 421. vowel sounds by pre-service Korean EFL teachers. Xinchun Wang. 2002. Training Mandarin and Cantonese Unpublished Ph.D. thesis, University of Kansas. speakers to identify English vowel contrasts: long term Gillian Lord. 2005. (How) can we teach foreign language retention and elects on production. Unpublished Ph.D. pronunciation? On the Effects of a Spanish Phonetics thesis, Simon Fraser University. Course. Hispania, 88/3: pages 557-567. Alysse Weinberg, Hélène Knoerr. 2003. Learning French Pamela Pearson, Lucy Pickering, Rachel DaSilva. 2011. pronunciation: Audiocassettes or multimedia? CALICO The impact of computer assisted pronunciation training Journal, 20/2: pages 315-336. on the improvement of Vietnamese learner production of PRISPEVKI 9 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Sovražno in grobo besedišče v odzivnem Slovarju sopomenk sodobne slovenščine Špela Arhar Holdt,*‡ Polona Gantar,* Iztok Kosem,* Eva Pori,* Nataša Logar,** Vojko Gorjanc,* Simon Krek* * Filozofska fakulteta, Univerza v Ljubljani Aškerčeva 2, 1000 Ljubljana apolonija.gantar@ff.uni-lj.si, iztok.kosem@ff.uni-lj.si, eva.pori@ff.uni-lj.si, vojko.gorjanc@ff.uni-lj.si, simon.krek@ff.uni-lj.si ** Fakulteta za družbene vede, Univerza v Ljubljani Kardeljeva ploščad 5, 1000 Ljubljana natasa.logar@fdv.uni-lj.si ‡ Fakulteta za računalništvo in informatiko, Univerza v Ljubljani Večna pot 113, 1000 Ljubljana spela.arharholdt@fri.uni-lj.si Povzetek V prispevku predstavljamo rešitve za identifikacijo in označevanje sovražnega ter grobega besedišča v okviru koncepta odzivnega Slovarja sopomenk sodobne slovenščine. Ker gre za prvi tovrstni projekt, so pripravljene rešitve v veliki meri inovativne, umeščene pa v okvir problematike avtomatske strojne izdelave slovarja, njegove odprtosti in vključenosti uporabniške skupnosti. Prispevek prikazuje identifikacijo sovražnega in grobega besedišča ter pripis oznak oziroma opozorilnih ikon z daljšimi pojasnili. Oznake temeljijo na sporočanjskem namenu oziroma učinku, pri čemer je njihovo bistvo informacija o možnih posledicah rabe. Pri označevanju tako kot pri izdelavi celotnega slovarja posvečamo veliko pozornost digitalnemu mediju in vizualizaciji rešitev v njem. Ker je odzivnost eden ključnih konceptov slovarja, se tudi pri rešitvah glede označevanja zavedamo pomembnosti sodelovanja z uporabniško skupnostjo, zato predlagamo še rešitve za sodelovanje s skupnostjo pri dodajanju oznak. Extremely Offensive and Vulgar Vocabulary in the Responsive Thesaurus of Modern Slovene In the paper we present the solutions for identification and annotation of extremely offensive and vulgar vocabulary, which can be found in the responsive Thesaurus of Modern Slovene. As this is the first project of its kind, the prepared solutions are to a great extent innovative, and have been devised considering the use of automatic methods in dictionary compilation, open access nature of dictionary data, and the inclusion of users into the compilation process. The paper describes the process of identification of extremely offensive and vulgar vocabulary, as well as the attribution of labels and warning icons containing longer explanations. The labels are based on their communicative purpose or effect, and are focussed on providing the information about potential consequences of word use. During the processes of labelling and dictionary compilation, considerable attention is paid to the digital medium and related visualisation solutions. As responsiveness is one of the key concepts of the dictionary, a part of preparing the labelling solutions was to design ways of including user community in labelling. pripadnike. Tako besedišče se trenutno v slovarju (lahko) 1. Uvod pojavlja na različnih mestih in na različne načine. Namen prispevka je predstaviti obseg problematike, ki Slovar sopomenk sodobne slovenščine (SSSS) je se pri odzivnem slovarju pomembno razlikuje od oblikovan po modelu odzivnega slovarja: v prvem koraku tradicionalnih slovaropisnih projektov, in opisati rešitve, je bil pripravljen strojno, nadaljnje urejanje podatkov pa ki bodo vključene v prihajajočo nadgradnjo SSSS. Med poteka po korakih in v sodelovanju jezikoslovcev ter širše temi želimo posebej izpostaviti nove načine identifikacije zainteresirane skupnosti (Arhar Holdt et al., 2018: 404). V in označevanja sovražnega, grobega ter drugače negativno SSSS lahko slovarski uporabniki ob strojno pripravljeno vrednotenega besedišča, ki SSSS presegajo in so uporabne sopomensko gradivo dodajo lastne predloge sopomenk, za za različne sodobne jezikovne vire. vse sopomenke v slovarju pa je mogoče tudi glasovati in gradivo na tak način (pomagati) potrditi ali zavrniti.1 Vključevanje strojnih postopkov in predlogov 2. Sovražno, grobo, tabuizirano, zaničljivo uporabniške skupnosti v slovaropisne delotoke odgovarja … v družbi, jeziku in slovarju na potrebe sodobnega časa, kot sta potreba skupnosti po Na kratko je mogoče sovražni govor opredeliti kot odprto dostopnih jezikovnih podatkih in želja slovarskih “aktivno javno spodbujanje antipatije do določene, uporabnikov po demokratičnem sodelovanju pri razvoju ponavadi šibke, družbene skupine” (Rebolj, 2008: 13), v temeljne jezikovne infrastrukture. Na drugi strani pa ima daljši in bolj povedni obliki pa kot (Petković in Kogovšek neposredno objavljanje strojnega in uporabniško dodanega Šalamon, 2007: 23): (nepregledanega) gradiva lahko tudi neželene posledice, ki ustno ali pisno izražanje diskriminatornih stališč. Z njim jih je treba pri razvoju odzivnega modela predvideti in širimo, spodbujamo, promoviramo ali opravičujemo rasno ustrezno obravnavati. Med prioritetami za razvoj SSSS je sovraštvo, ksenofobijo, homofobijo, antisemitizem, seksizem tako brez dvoma obravnava besedišča, ki vrednostno in druge oblike sovraštva, ki temeljijo na nestrpnosti. Mednje poimenuje posamezne družbene skupine in njihove sodi tudi nestrpnost, ki se izraža z agresivnim nacionalizmom in etnocentrizmom, z diskriminacijo in sovražnostjo zoper manjšine, migrante in migrantke. Žrtve sovražnega govora praviloma niso posamezniki, pač pa 1 Slovar v vmesniku je na https://viri.cjvt.si/sopomenke/slv/, kot ranljive družbene skupine. V osrčju sovražnega govora je slovarska baza pa na repozitoriju CLARIN.SI (Krek et al., 2018). prepričanje, da so nekateri ljudje manj vredni, zato je cilj Strojno pripravo slovarja opisujejo Krek et al. (2017), koncept sovražnega govora v razčlovečenju, ponižanju, ustrahovanju odzivnega slovarja pa Arhar Holdt et al. (2018). PRISPEVKI 10 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 in poslabšanju družbenega položaja tistih, proti katerim je strojno in so v slovarju brez oznak. Posledica je lahko naperjen. sopostavitev pomensko neustreznih podatkov, npr. pri Motl in Bajt (2016: 7) ugotavljata, da je sovražni primerjavi besed ženska – kura najdemo prekrivne govor deležen precejšnje pozornosti v različnih vedah, od kolokacije [ stara, prava, gola] ženska in [ stara, prava, prava, sociologije in komunikologije do psihiatrije in gola] kura ali ženska [ brez glave, v postelji, na odru] in informatike, pridružimo pa jim lahko tudi jezikoslovje – kura [ brez glave, v postelji, na odru]. Korpusni zgledi predvsem jezikoslovje, povezano s slovarji. Ameriško načeloma pomagajo razdvoumiti problematične primere, slovaropisje (Hughes, 2011: 3. pogl.) je že pred desetletji vendar niso na voljo za vse primerjane besede, zgledi, ki v svoje vire načrtno vgradilo tudi občutljivost do ranljivih so na voljo, pa niso izbrani po vsebinskih kriterijih. To je družbenih skupin, pri čemer ni zanemarilo nobenega od zlasti problematično pri sovražnem besedišču, npr. delov geselskega članka: razlag, oznak in zgledov rabe kolokacije [ sovražiti, tepsti, ubiti] pedra ali zgledi tipa In (Logar et al., 2020: 104). V manjši meri in pozneje, a reskiral sem celo, da bi me imel za pedra. vendarle so se opozorila o nujni tovrstni družbeni Pri uporabniško predlaganih sopomenkah ločujemo na občutljivosti ter odgovornosti pojavila tudi v slovenskem eni strani zlonamerne vnose, kot je npr. uporabniški vpis prostoru (npr. Gorjanc, 2005; Kern, 2015; Logar et al., aljaz pri iztočnici gej. Za takšne primere bi bilo treba 2020: 91, 104), a jih kljub temu do sedaj ni polno določiti natančno uredniško politiko za sprotno obravnavo upošteval še noben slovarski projekt. na ravni vmesnika. Na drugi strani uporabniki Ni pa zgolj sovražni govor tisti, ki ga je treba v zaznamovano besedišče dodajajo kot dejanski sopomenski slovarjih obravnavati posebej pozorno. Kritično predlog, npr. pri iztočnici južnjak, kjer so uporabniki slovaropisje opozarja, da je treba pri slovarskih opisih dodali dolg niz predlogov, mdr. jugovič, južni brat, jugič, izrecne (in nove) rešitve iskati pri vseh elementih, ki trenirkar, bosanec, z juga. Uredniška naloga je presoditi, prinašajo vljudne in nevljudne vidike jezika, tabuiziranost, kateri predlogi so relevantni za vključitev v slovarsko so usmerjeni v vrednotenje, konotacijo, kulturne aluzije bazo (in s katerimi oznakami), že uporabnikom pa ipd., še posebej pa je treba biti pozoren na nestabilna in omogočiti, da problematično besedišče označijo kot tako, spreminjajoča se poimenovanja vseh oblik drugosti da se torej oznaka v vmesniku prikaže istočasno kot (Moon, 2014: 85). Pri tem se sodobno slovaropisje ne dodana sopomenka. more sklicevati na tradicionalne modele jezikovnega Besedišče, ki je problematično na ravni same leme, je opisovanja in delovanja. Nikakor pri tem ni sprejemljivo mogoče označiti že v obstoječi različici slovarja. Primeri, tradicionalno razmišljanje, da “je slovar metajezikovni pri katerih je oznaka vezana na posamezen pomen besede odsev dejanske hierarhizirane konceptualizacije sveta” ali specifičen kontekst rabe, pa zahtevajo predhodno (Vidovič Muha, 2013: 7), kar vodi v razpravljanje o pomensko členitev ter z njo povezan slovaropisni pregled resnicah v okviru slovaropisnega dela – prav nasprotno: kolokacij in zgledov rabe. slovaropisje mora jasno naslavljati vprašanja, ki so v V projektu Nadgradnja temeljnih slovarskih virov in svojem bistvu ideološka, saj gre za “uravnoteževanje opisa podatkovnih baz CJVT UL bomo uresničili dva cilja: (a) tega, kar prinašajo podatki glede pomena, s tem, na kakšen identificirali besedišče, ki je problematično na ravni leme način ‘naj bi bil’ v postmoderni vključujoči družbi in ga označiti po celotnem slovarju SSSS ter (b) dodali v določen koncept obravnavan in predstavljen” (Moon, slovarski vmesnik možnost, da uporabniki sami označijo 2014: 89). Gre torej za to, da pri slovaropisnem delu svoje predloge. V nadaljevanju natančneje pojasnjujemo, končne rešitve preprosto ne morejo biti “samo kako. jezikoslovne; neizogibno morajo biti tudi ideološke” (Moon, 2014: 94). Pomembno je, da se ideološkosti pri 4. Identifikacija problematičnega besedišča slovarskih opisih zavedamo, da odkrito in jasno povemo, da je slovaropisno delo težavno prav zato, ker je tudi ideološko (Gantar, 2015: 399), še posebej pri družbeno 4.1. Slovaropisna izhodišča in sistem oznak občutljivih elementih slovarja. Prepoznavanje potencialnega, z vidika družbene občutljivosti problematičnega besedišča temelji na 3. Problemi trenutnega SSSS slovaropisnih izhodiščih, ki smo jih pred nekaj leti pripravili za slovarske vire na CJVT UL, prvič pa začeli SSSS je pripravljen strojno in je trenutno na voljo v uporabljati pri izdelavi Velikega slovensko-madžarskega prvi, nepregledani različici, v kateri so kot iztočnice in slovarja (Kosem et al., 2018a). V izhodišča je vključeno sopomenke navedene leme (brez besednih vrst), prepoznavanje elementov sovražnega govora (oznaka pomensko členitev in opis začasno nadomeščajo strojno sovražno), elementov nevljudnosti, žaljivosti ( grobo) ter pripravljene pomenske gruče, slovar pa tudi ne vsebuje elementov negativnega vrednotenja ali konotacije ( izraža slovarskih oznak, razen področnih. negativen odnos). Omenjene oznake sodijo v širši okvir t. Navedene značilnosti imajo več posledic. Na eni strani i. sporočanjskih oznak,2 ki opredeljujejo izraze ali pomene se strojno pripravljene iztočnice in sopomenski kandidati z vidika njihove rabe v sporočanjskem procesu in v pojavljajo brez oznak ali opozoril tudi pri izrazito situacijah, v katerih sporočanje poteka. V predlaganem problematičnih primerih, kot je npr. iztočnica buzi s slovaropisnem opisu so sporočanjske oznake namenjene sopomenkami peder, buzerant, toplovodar, homič, označevanju izrazov, z izbiro katerih govorci dosegamo ali poženščen moški. Na drugi strani je problem potencialno želimo doseči določen učinek pri naslovniku. Ta učinek je zavajajoča (ne)zastopanost sopomenskega gradiva, npr. lahko povzročen s pozitivnim ali negativnim vse sopomenke, ki jih najdemo pri iztočnici zmaj – ksantipa, vešča, strupenjača, babura, coprnica, pošast, kričava ženska – so vezane na ženski spol in imajo izrazito 2 Celotni sistem označevanja, ki ga razvijamo v okviru virov negativno konotacijo, čeprav se beseda rabi tudi za moške CJVT UL, poleg sporočanjskih oznak, ki jih notranje členimo na in (v drugem pomenu) tudi s pozitivno konotacijo. vrednotenjske, registrske in stilne, zajema še nabor pragmatičnih, Tudi kolokacije in zgledi, ki so namenjeni primerjavi kontekstualnih, področnih, slovničnih, časovnih in trendovskih rabe dveh sopomenk, so iz referenčnega korpusa izvoženi oznak ter nabor oznak, vezanih na tuja poimenovanja in prevodne ustreznice. PRISPEVKI 11 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 vrednotenjem, z uporabo v določenem govornem položaju pregledovanje, kot bo razvidno v nadaljevanju; (b) (npr. javnem, nejavnem) ali z namenom izraziti odnos do presojanje je lahko bolj natančno, saj problematičnost predmetnosti ali vsebine, ki temelji na določenih posamezne leme nakazujejo ostale besede v nizu, prim. družbenih normah, pričakovanjih in odstopanjih od njih. npr. nategniti v nizu raztegniti; dilatirati; iztegniti; Ta sistem se od tradicionalnega označevanja besed na nategniti; pogrniti; razgrniti; razmakniti; razpreti; podlagi odnosa do knjižne norme, kot ga pozna SSKJ (t. i. razprostreti; razviti; napeti; zavlačevati z; razpeti; stilno-zvrstni in ekspresivni kvalifikatorji), ločuje v prolongirati in v nizu pokavsati; nategniti; povaljati; kvalificiranju besedišča na podlagi sporočanjskega porivati; pofukati; pojahati. namena oz. učinka, pri čemer izhodišče kvalificiranja ni v Iz množice 65.615 nizov smo najprej umaknili 24.945 opozarjanju na odstop od knjižne norme, pač pa v nizov (38,0 %), pri katerih sopomenke vsebujejo področne informiranju glede možnih posledic rabe. S takim oznake, npr. odbojnik, deflektor, ločilnik, membrana, sistemom se želimo izogniti morebitnemu kvalificiranju opna, odbojna pregrada, zvočna stena z oznako elektrika govorca samega, hkrati pa opozoriti na kontekst (ker so ti podatki terminološke narave, smo predvidevali potencialno problematične rabe v informativnem smislu. zanemarljivo nizko vsebnost problematičnega besedišča in To pomeni, da ne želimo uporabnikov slovarja obveščati smo jih pustili za hiter pregled ob koncu naloge); ostalo je samo o možnih učinkih rabe grobega in sovražnega 496 nizov (0,8 %), ki vsebujejo lastnoimenske besedišča, pač pa pokazati tudi na okoliščine, v katerih je samostalnike, npr. Antarktika, antarktično območje, južno tako rabo mogoče prepoznati. polarno območje, in 40.176 (61,2 %) občnoimenskih V slovarskem sistemu oznak označujemo z oznako nizov, vsi relevantni za ročni pregled. sovražno izraze in pomene, ki so diskriminatorni, Podatke so pregledovali študentke in študenti ksenofobični, rasistični in homofobični, ki so uperjeni jezikoslovnih smeri, in sicer po trije vzporedno. proti predstavnikom skupin ali manjšin na podlagi njihove Pregledovanje je potekalo v okolju Google Sheets. narodnosti, rase ali etničnega porekla, verskega Sopomenske nize smo organizirali v vrstice tabele, kjer prepričanja, spola, zdravstvenega stanja, spolne jim je bilo mogoče pripisati eno od naslednjih odločitev: usmerjenosti, invalidnosti, gmotnega stanja, izobrazbe, (1) niz vsebuje sovražno ali grobo besedišče; (2) niz družbenega položaja ter drugih lastnosti in prepričanj. Z vsebuje besedišče, ki je drugače negativno ali (v oznako sovražno se torej opredeljujemo do vseh izrazov, določenem pomenu, kontekstu) izraža negativen odnos; ki spodbujajo sovraštvo, predsodke ali nestrpnost in s tem (3) z vidika sovražnosti, grobosti, negativnosti je niz lahko predstavljajo – kot je bilo opredeljeno že v razdelku neproblematičen. Če so pregledovalci želeli, so lahko 2 – elemente sovražnega govora. opredeliti tudi, da je (4) v nizu kako drugače zaznamovano Na drugi strani z oznako grobo označujemo izraze ali besedišče, da (5) ne razumejo vseh besed v nizu, lahko pa pomene, ki so za naslovnika lahko žaljivi, z vidika so vpisali tudi dodaten komentar na svoje odločitve ali družbenih in moralnih norm pa neprimerni. Tipično se podatke. nanašajo na človeško ali živalsko telo, spolnost, Kljub ogromni količini podatkov je bila tako prehranjevanje in izločanje – zlasti torej na tabuizirano oblikovana naloga izvedljiva v relativno kratkem času, saj predmetnost. so študentje lahko odločitev podali takoj, ko so v nizu Tretji sklop predstavlja besedišče, ki izraža našli eno samo problematično besedo, natančnejše neodobravanje, nenaklonjenost, posmehljivost ali kritiko razmisleke o vrsti zaznamovanosti oz. označevanja do lastnosti posameznikov, predmetov ali dejanj. Z oznako posameznih besed pa so prepustili za drugi korak dela s izraža negativen odnos želimo tako opozoriti na izraze z podatki. izrazito negativno konotacijo ali vrednotenjem, ki so lahko za naslovnika žaljivi ali neprijetni. 4.3. Rezultati ročnega pregleda Študentske odločitve smo pretvorili v končne odločitve 4.2. Ročni pregled gradiva po naslednjem ključu: (1) sovražno/grobo: če je vsaj eden Potencialno problematično besedišče v SSSS smo od študentov presodil, da se v nizu pojavlja sovražno ali identificirali z ročnim pregledom iztočnic in sopomenk v grobo besedišče; (2) drugače negativno: kombinacije slovarju. Na projektu smo se omejili na slovarske (jedrne odločitev “druga negativnost” in “neproblematično” ali (3) in bližnje) sopomenke, saj pregled uporabniških predlogov neproblematično: če so vsi študenti presodili, da je z zahteva dodatne uredniške premisleke in bo zato opravljen vidika sovražnosti, grobosti, negativnosti niz kasneje s prilagojeno metodologijo. Zaradi obilja gradiva neproblematičen. Rezultate prikazuje Tabela 1. smo delo organizirali v dva koraka: širši pregled, v katerem smo v grobem ločili potencialno problematično in Kategorija končne Število nizov v Delež glede na neproblematično gradivo, nato pa natančnejši pregled odločitve kategoriji vse pregledano problematičnih primerov. Najprej smo iz slovarske baze izvozili nize sopomenk, Sovražno/grobo 1.810 4,5 % urejenih na podlagi pomenskih gruč (Krek et al., 2017), Drugače negativno 12.730 31,3 % npr. speljati se; izginiti; pobrati se; skidati se; spokati se; Neproblematično 26.132 64,3 % spizditi, pri čemer smo odstranili nize, ki so se glede nabora sopomenk podvajali, in tiste, ki so bili podmnožica Skupaj 40.672 100,0 % kakega drugega niza. Na tak način smo pripravili 65.615 nizov različne dolžine: od posameznih sopomenskih parov Tabela 1: Številčna zastopanost in delež nizov glede na do zelo dolgih nizov, ki pa so redki: več kot 30 sopomenk končno odločitev glede potencialne problematičnosti. vsebuje le 156 nizov, povprečje je 5 sopomenk na niz. Čeprav strojno pomensko gručenje ni povsem V Tabeli 2 navajamo nekaj nizov s po tremi natančno in se razlikuje od slovaropisne pomenske sopomenkami, ki so jim študentke in študenti pripisali členitve, tovrstna organizacija podatkov dobro naslovi dva skladne ali različne odločitve. Kot je razvidno, lahko pomembna problema: (a) tak pristop bistveno pohitri posamezen niz vsebuje raznoliko zaznamovano besedišče, PRISPEVKI 12 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 kot tudi nezaznamovano besedišče. Tabela obenem V “drugače negativno” so raznorodni primeri, saj so ponazarja gradivo, ki bo deležno celovite in natančnejše poleg zaznamovanih izrazov in pomenov študentje obravnave v zaključnem delu projekta Nadgradnja označevali tudi besedišče, ki poimenuje negativne temeljnih slovarskih virov in podatkovnih baz CJVT UL vsebine in predmetnost. Gre zlasti za poimenovanja (odločitev 1), podatke, ki so na tak ali drugačen način agresivnega obnašanja: uničiti, dotolči, nekaterih osebnih relevantni za nadaljnje delo (odločitev 2), in gradivo, ki ga lastnosti: pokvarjen, hudoben, ničvreden, grozljiv, grd, z vidika negativne zaznamovanosti ne bomo nadalje apatičnost, pokvarjenost; videza, stanja: neurejenost, obravnavali (odločitev 3). razdejanje, zanikrnost itd. V slovarju večina teh besed ne potrebuje oznake. Čeprav besed ne bomo označevali, so Študentske seznami tovrstnega besedišča pomemben rezultat ročnega Niz sopomenk in končna pregleda, saj so koristni za različne druge namene na področju slovaropisja in strojne obdelave jezika, npr. za odločitev filtriranje gradiva z negativnim pomenom iz jezikovnih fukati; porivati; natepavati 111 -> 1 iger ali učnih gradiv, strojno pripisovanje sentimenta ipd. skozlati; izbruhati; zbruhati 111 -> 1 pedrski; buzerantski; toplovodarski 111 -> 1 5. Vizualizacija v vmesniku SSSS 2.0 črnuhinja; zamorka; zamorklja 111 -> 1 V slovarskem vmesniku SSSS 2.0 bomo na besedišče, pofukanka; prasica; zajebanka 111 -> 1 o katerem razpravljamo tu, opozorili s kombinacijo opozorilne ikone in daljšega pojasnila, ki se bo izpisalo ob debilen; bebast; duševno zaostal 121 -> 1 kliku nanjo. Trenutno rešitev, ki jo bomo po potrebi še kripelj; pohabljenec; pohabljenka 211 -> 1 nadgradili, kaže Tabela 3. Namenoma smo se odrekli kurnik; pajzelj; temačna luknja 222 -> 2 pripisovanju (eno-)besednih oznak, saj bi te pri bedastoča; glupost; nesmisel 222 -> 2 označevanju (mestoma tudi homonimnih) lem lahko vodile v napačno interpretacijo podatkov. Pri pomensko eliminirati; likvidirati; usmrtiti 222 -> 2 členjenih geslih bodo oznake seveda pripisane izmozgano; izčrpano; mršavo 223 -> 2 posameznim pomenom, pri pomensko nečlenjenih geslih imenski; nazivni; nominalni 333 -> 3 pa bo kombinacija ikone in pojasnila omogočila, da je problematično besedišče na prvi pregled zelo opazno, kopirni papir; indigo; karbon 333 -> 3 pojasnilo pa je lahko daljše in vsebuje informacije o zaustaviti se; izklopiti se; izključiti se 333 -> 3 možnem učinku na naslovnika oz. možnih posledicah rabe označene besede. Tabela 2: Primeri nizov s študentskimi odločitvami in končno odločitvijo o nadaljnji obravnavi. Oznaka Ikona Pojasnilo Z uporabo besede lahko V sodelovanju s študenti bomo v 1.810 nizih z izražamo sovražni, nestrpni odločitvijo (1) določili besede in zveze, ki so relevantne za Sovražno slovarsko označevanje. Slednje bo potekalo ob odnos do posameznika ali upoštevanju pojavljanja oz. rabe identificiranega besedišča družbene skupine. v raznovrstnih kontekstih, s čimer bomo željo po pohitritvi Zaradi družbenih in moralnih postopka prve selekcije ustrezno nadzorovali in obranili norm se marsikateremu pred črno-belim presojanjem primernosti. Za ponazoritev uporabniku jezika beseda lahko navajamo nekaj primerov, ki so na seznamu za presojo: Grobo ● sovražno: črnuh, črnuhinja, zamorklja, zdi groba ali neprimerna. hlapčevski črnec, rdečuh, rdečuhinja, beli prasec, Uporaba lahko povzroči bela prasica, lezba, lezbača, peder, buzerant, nelagodje, razburi ali užali. pička, prasica, kripelj; Beseda lahko ni nevtralna. Z ● grobo: podjebavati, v kurcu, zdrkati, nabrisati, pokavsati, nategniti v rit, pofafati ga, sranje, uporabo besede se lahko poscan, fentati, crkniti, razpizden, sfukan, Izraža posmehujemo, izražamo kurbarija, joški. negativen neodobravanje ali kritiko do V SSSS želimo poleg sovražnega in grobega označiti odnos nekaterih lastnosti tudi besedišče, ki izraža negativen odnos. To se najde posameznikov, predmetov ali predvsem v nizih z odločitvijo (2), mestoma pa tudi v (1). Kot problematične so študentje prepoznavali tako izraze dejanj. (npr. budala, avša, bedast) kot potencialne problematične pomene besed (npr. nataknjen, zabit, nasekati). Prve je Tabela 3: Predvidene ikone in izhodiščna različica pojasnil mogoče označiti že v trenutni različici slovarja, saj je za označevanje besedišča v SSSS. njihova problematičnost vezana na lemo ne glede na morebitno večpomenskost. V drugem primeru bi oznaka Slika 1 kaže oblikovalski predlog vmesnika SSSS 2.0, morala biti pripisana pomenu, zato bo označevanje možno kakršen je na voljo v času priprave prispevka. Slovarske šele, ko bo slovar vseboval pomenske členitve. Primeri informacije na sliki so provizorične. Slika ponazarja, besedišča, ki ga je mogoče označiti na ravni izraza: kakšna bo vizualizacija pri bližnjih in jedrnih (pri depra in ● izraža negativen odnos: trapa, bebav, počasne deprimiranost) ter pri uporabniških sopomenkah (pri pameti, lolek, kozlarija, zarukan, špeglarca, sopomenka). Razvidne so tudi nekatere druge novosti, npr. luftar, blefer, snobovski, drhal, težakinja, delitev uporabniških sopomenk glede na slovarsko verzijo, mlatenje prazne slame, avša, otročaj. v kateri so bile predlagane, ter možnost dodajanja slovarskih oznak ob predlagane sopomenke. PRISPEVKI 13 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Slika1. Oblikovalski predlog vmesnika SSSS 2.0 (vsebina čimer bo lahko dosežena določena stopnja enotnosti je provizorična). uporabniškega označevanja (informacije bodo na voljo na klik, gl. ikono (i) na Sliki 1). Predvideno pa je, da bodo 6. Uporabniško dodajanje oznak uporabniki oznake mestoma interpretirali in uporabljali drugače, kot bi jih slovaropisci. Vse dodane oznake bodo Nedavno izvedena raziskava o odnosu uporabniške (skupaj z dodanimi sopomenkami) preverjene in označene skupnosti do SSSS, v kateri je sodelovalo 671 sopomenke bodo dragoceno gradivo ne le za dopolnitev anketirancev, je pokazala naklonjenost do večine novosti, odprto dostopne slovarske baze sopomenk, ampak tudi za ki jih prinaša slovar, npr. stalno posodabljanje, strojni analize širšega dojemanja označevalnega sistema ter postopki, digitalni format, kolokacijski podatki, povezave dometa in meja oznak. Prav tako pomemben uvid bodo na korpus, uporabniško vključevanje (Arhar Holdt, 2020: ponudile ročno vpisane oznake, ki jih bomo analizirali z 470). Med problematičnimi značilnostmi sta bili vidika vsebine in pogostosti ter uporabili izsledke za izpostavljeni nezanesljivost (strojno pridobljenih) nadaljnji razvoj slovarja. podatkov in primanjkljaj slovarskih oznak tako pri jedrnih in bližnjih sopomenkah kot pri uporabniško dodanih. To, da ni oznak, je motilo 37 % sodelujočih (ibid.: 472). 7. Sklep in nadaljnje delo V trenutnem slovarskem vmesniku nekateri uporabniki Sodobno slovaropisno delo ima ob zavedanju in uporabnice težavo rešujejo tako, da oznako ali kako ideološkosti, vključevanju novih pristopov, uporabi drugo pojasnilo v oklepaju pripišejo ob svoj sopomenski tehnologije, moči množic itd. danes veliko možnosti, da predlog, npr. babica – nona (lokalno), bojazljivec – pezde tudi vprašanja označevanja konotacije naslavlja na novo in (vulg.), Italijanka – makaronarka (slabš.). Kot omenjeno v zanj pripravlja inovativne rešitve (Gorjanc, 2017: 154). poglavju 3, pa večina predlaganih sopomenk oznake nima. V prispevku smo opisali, kako poteka obravnava Skladno z uporabniškimi željami in potrebami želimo sovražnega in grobega besedišča v SSSS in katere nadgraditi protokol dodajanja sopomenk, da bodo spremembe so v načrtu za različico 2.0, ki bo objavljena predlagani besedi ali zvezi uporabnice in uporabniki lahko jeseni 2022. Rešitve naslavljajo dve pomembni značilnosti dodali tudi slovarsko oznako oz. oznake. Privzeta izbira SSSS: njegovo strojno izdelanost in odprtost, da pri bo, da je predlog “brez oznake” , ostale možnosti bodo na razvoju slovarja sodeluje tudi uporabniška skupnost. V voljo v spustnem meniju (Slika 1). V različici SSSS 2.0 novi različici slovarja bodo sovražnemu in grobemu bodo na klik na voljo oznake sovražno, grobo in izraža besedišču pripisane slovarske oznake oz. opozorilne ikone negativen odnos, poleg tega pa bomo ponudili okence, v s pojasnili o možnih posledicah rabe in dodana bo katerega bo mogoče vtipkati morebitno drugo oznako. možnost, da uporabniki pripišejo oznako svojim Pomen in raba oznak sovražno, grobo in izraža predlogom sopomenk. negativen odnos bo razložena in ponazorjena s primeri, s PRISPEVKI 14 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Ker vse težave trenutnega SSSS niso enostavno in 9. Literatura hitro rešljive, želimo slovarske uporabnike bolje opozoriti na trenutne omejitve. Čeprav je metodologija priprave Špela Arhar Holdt. 2020. How Users Responded to a SSSS pojasnjena v razdelku O viru, pri samih iztočnicah Responsive Dictionary: The Case of the Thesaurus of ni izrecnih opozoril, da je slovar pripravljen strojno, in to Modern Slovene. Rasprave: Časopis Instituta za na vseh ravneh: sopomenke, kolokacije, korpusni zgledi, hrvatski jezik i jezikoslovlje, 46(2): 465–482. kar lahko vodi v napačne interpretacije slovarske vsebine. https://doi.org/10.31724/rihjj.46.2.1 V naslednji različici SSSS želimo zato uvesti indikator Špela Arhar Holdt, Jaka Čibej, Kaja Dobrovoljc, stopnje gesla3 in dodati v predstavitev slovarja opozorila o Apolonija Gantar, Vojko Gorjanc, Bojan Klemenc, Iztok dometu in posledicah metodologije ter razlago korakov, po Kosem, Simon Krek, Cyprian Laskowski in Marko katerih se slovar razvija. Robnik Šikonja. 2018. Thesaurus of Modern Slovene: Prepoznano sovražno in grobo besedišče bo koristno By the Community for the Community. V: J. Čibej, V. tudi pri izdelavi drugih virov, kjer se za pomene izbirajo Gorjanc, I. Kosem in S. Krek, ur., Proceedings of the reprezentativne kolokacije in zgledi. Pri izdelavi novih 18th Euralex International Congress: Lexicography in gesel za Kolokacijski slovar sodobne slovenščine Global Contexts, str. 401–410. Znanstvena založba (Kosem Filozofske fakultete, Ljubljana. et al., 2018b) npr. že zdaj pri pripravi podatkov (pred slovaropisno analizo) označujemo kolokacije, ki vsebujejo https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/v sovražno in grobo besedišče, pa tudi besedišče, ki izraža iew/118/211/3000-1 negativen odnos. Tako slovaropiske in slovaropisce Polona Gantar. 2015. Leksikografski opis slovenščine v opozorimo na potencialno problematične kolokacije in digitalnem okolju. Znanstvena založba Filozofske posledično pohitrimo delo oz. se izognemo vključevanju fakultete, Ljubljana. problematičnih vsebin. Seznami problematičnega https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/d besedišča, ki jih uporabljamo trenutno, so pripravljeni ad ownload/62/138/2602-1?inline=1 hoc iz odprto dostopnih jezikovnih virov in precej krajši Vojko Gorjanc. 2005. Neposredno in posredno žaljiv od seznamov, ki bodo (lahko) nastali na osnovi govor v jezikovnih priročnikih: diskurz slovarjev predstavljenega dela. slovenskega jezika. Družboslovne razprave, 21(48): 197–209. Kot smo poudarili v prispevku, je izražanje Vojko Gorjanc. 2017. Nije rečnik za seljaka. Biblioteka negativnega odnosa večkrat vezano na posamezen pomen XX vek, Beograd. besede, zato bo velik del naloge izvedljiv šele ob pripravi Geoffrey Hughes. 2011. Political Correctness: A History pomensko členjenih gesel. Pri pomenski členitvi in of Semantics and Culture. Wiley-Blackwell, MA. nadaljnjem označevanju gradiva SSSS bomo uporabili Boris Kern. 2015. Politična korektnost v slovaropisju. V: metodologijo, ki jo razvijamo pri izdelavi Velikega D. Zuljan Kumar in H. Dobrovoljc, ur., Zbornik slovensko-madžarskega slovarja (Kosem et al., 2018a) in prispevkov s simpozija 2013, str. 144–154, Založba podatke oz. informacije, ki so na voljo v obstoječih odprto Univerze, Nova Gorica. dostopnih virih za slovenščino. Preizkus prenosa Iztok Kosem, Júlia Čeh Bálint, Vojko Gorjanc, Anna metodologije bomo izvedli že pod okriljem projekta Kolláth, Attila Kovács, Simon Krek, Sonja Nadgradnja temeljnih slovarskih virov in podatkovnih baz Novak-Lukanovič in Jutka Rudaš. 2018a. Osnutek CJVT UL, kjer je med cilji tudi nadgradnja SSSS z 2.000 koncepta novega velikega slovensko-madžarskega pomensko členjenimi gesli, ki bodo imela slovaropisno slovarja. Univerza v Ljubljani, Filozofska fakulteta, pregledane in razvrščene sopomenke, kolokacije ter Ljubljana. korpusne zglede. https://www.cjvt.si/komass/wp-content/uploads/sites/17 V nadaljnje premisleke glede sovražnega in grobega /2020/08/Osnutek-koncepta-VSMS-v1-1.pdf besedišča znotraj koncepta odzivnega slovarja bi bilo Iztok Kosem, Simon Krek, Polona Gantar, Špela Arhar smiselno celoviteje vključiti vidike okoliščin rabe. Holdt, Jaka Čibej in Cyprian Laskowski. 2018b. Zanimivo bi bilo denimo obravnavati zaznavanje in Kolokacijski slovar sodobne slovenščine. V: D. Fišer in presojanje sovražnosti, grobosti v različnih tipih besedil, A. Pančur, ur., Jezikovne tehnologije in digitalna npr. medijskih. Ob tem se odpira tudi vprašanje humanistika. Znanstvena založba Filozofske fakultete, formalnosti in neformalnosti položajev, na katere se ta Ljubljana. presoja nanaša: ali posega na vse ravni izražanja ali gre http://www.sdjt.si/wp/wp-content/uploads/2018/09/JTD zgolj za formalne, javne položaje in ali je neodvisna od H-2018_Kosem-et-al_Kolokacijski-slovar-sodobne-slov generacijske ali kake druge pripadnosti presojevalca. enscine.pdf Simon Krek, Cyprian Laskowski in Marko Robnik 8. Zahvala Šikonja. 2017. From translation equivalents to Projekt Nadgradnja temeljnih slovarskih virov in synonyms: creation of a Slovene thesaurus using word podatkovnih baz CJVT UL v letih 2021–2022 financira co-occurrence network analysis. V: I. Kosem et al., ur., Ministrstvo za kulturo Republike Slovenije. Raziskovalna Proceedings of eLex 2017: Lexicography from Scratch, programa št. P6-0411 (Jezikovni viri in tehnologije za str. 93–109, Leiden, Netherlands. slovenski jezik) in št. P6-0215 (Slovenski jezik – bazične, https://elex.link/elex2017/wp-content/uploads/2017/09/ kontrastivne in aplikativne raziskave), sofinancira Javna paper05.pdf agencija za raziskovalno dejavnost Republike Slovenije iz Simon Krek, Cyprian Laskowski, Marko Robnik Šikonja, državnega proračuna. Iztok Kosem, Špela Arhar Holdt, Polona Gantar, Jaka Čibej, Vojko Gorjanc, Bojan Klemenc in Kaja Dobrovoljc. 2018. Thesaurus of Modern Slovene 1.0, 3 Po zgledu Kolokacijskega slovarja sodobne slovenščine Slovenian language resource repository CLARIN.SI, (KSSS), ki z ikono petstopenjske piramide uporabniku na jasen http://hdl.handle.net/11356/1166 in ekspliciten način posreduje informacijo o razvoju ter različnih Nataša Logar, Nina Perger, Vojko Gorjanc, Monika Kalin stopnjah izdelanosti slovarskih gesel (Kosem et al., 2018b). Golob, Neža Kogovšek Šalamon in Iztok Kosem. 2020. PRISPEVKI 15 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Raba slovarjev v slovenski sodni praksi. Teorija in praksa, 57:89–108. https://www.fdv.uni-lj.si/docs/default-source/tip/tip_pos _2020_logar_idr.pdf?sfvrsn=0 Rosamund Moon. 2014. Meanings, Ideologies, and Learners’ Dictionaries. V: A. Abel et al., ur., Proceedings of the XVI EURALEX International Congress: The User in Focus, str. 85–105, Bolzano/Bozen. Institute for Specialised Communication and Multilingualism. https://euralex.org/elx_proceedings/Euralex2014/eurale x_2014_004_p_85.pdf Andrej Motl in Veronika Bajt. 2016. Sovražni govor v Republiki Sloveniji: Pregled stanja. Mirovni inštitut, Ljubljana. https://dlib.si/stream/URN:NBN:SI:DOC-F2YZP2RB/c 117f4c6-8fe9-437d-8c64-5b7987a856b6/PDF Brankica Petković in Neža Kogovšek Šalamon. 2007. O diskriminaciji: Priročnik za novinarje in novinarke. Mirovni inštitut, Ljubljana. https://www.mirovni-institut.si/wp-content/uploads/201 4/08/Prirocnik-o-diskriminaciji-final-all.pdf Dušan Rebolj. 2008. Uporabnejša opredelitev politične korektnosti. V: S. Autor in R. Kuhar, ur., Politična (ne)korektnost. Mirovni inštitut, Ljubljana, str. 4–15. https://www.mirovni-institut.si/wp-content/uploads/201 4/08/nestrpnost-6.pdf SSKJ. 2014. Slovar slovenskega knjižnega jezika: Uvod. Druga, dopolnjena in deloma prenovljena izdaja. Inštitut za slovenski jezik Frana Ramovša ZRC SAZU, Ljubljana. https://fran.si/130/sskj-slovar-slovenskega-knjiznega-je zika Ada Vidovič Muha. 2013. Moč in nemoč knjižnega jezika. Znanstvena založba Filozofske fakultete, Ljubljana. PRISPEVKI 16 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Izdelava in analiza digitalizirane zbirke paremioloških enot Saša Babič*, Tomaž Erjavec† * Inštitut za slovensko narodopisje ZRC SAZU Novi trg 2, 1000 Ljubljana sasa.babic@zrc-sazu.si † Odsek za tehnologije znanja, Institut »Jožef Stefan« Jamova cesta 39, 1000 Ljubljana tomaz.erjavec@ijs.si Povzetek Članek obravnava digitaliziranje zbirke slovenskih pregovorov Inštituta za slovensko narodopisje ZRC SAZU. Zbirka je nastajala od leta 1947 dalje, digitalizacija pa se je začela v samem začetku 21. stoletja z iniciativo Marije Stanonik. V predstavljenem delu smo izhajali iz Excel razpredelnic paremioloških enot in virov, iz katerih smo najprej izločili neustrezne enote in neuporabljene vire. Nato smo tabeli pretvorili v zapis TEI in pregovore avtomatsko jezikoslovno označili. Tu so bile besede posodobljene, lematizirane, oblikoskladenjsko označene, povedi pa skladenjsko razčlenjene po formalizmu Universal Dependencies. Kanonični zapis TEI smo pretvorili v več izvedenih formatov in zbirko objavili pod odprto licenco na repozitoriju CLARIN.SI, kjer jo je mogoče prevzeti, in na konkordančnikih CLARIN.SI, ki so primerni za jezikoslovne analize zbirke. V članku orišemo tudi način iskanja po zbirki v konkordančnikih, ki omogočajo temeljitejšo etnolingvistično in semiotično raziskavo. Creation and analysis of a digitised collection of Slovenian paremiological units The article discusses the digitization of the collection of Slovenian proverbs from the Institute of Slovenian Ethnography ZRC SAZU. The collection was created from 1947, and its digitization began at the start of the 21st century on the initiative of Marija Stanonik. The departure point of the presented were two Excel spreadsheets with paremiological units and their bibliographical sources, from which we removed inappropriate units, and unused sources. The two spreadsheets were then converted to a TEI encoding, and the paremiological units automatically linguistically annotated: words were modernised, lemmatised, morphosyntactically annotated and the sentences syntactically parsed according to the Universal Dependencies formalism. We converted the canonical TEI encoding into several derived formats and published the collection under an open licence on the CLARIN.SI repository, where it can be downloaded, and on the CLARIN.SI concordancers, which allow for linguistic analyses of the collection. The paper also outlines searching the collection in the concordancers, which enables detailed ethnolinguistic and semiotic research. raziskovanje pregovorov kot kulturni znak, ki ohranja 1. Uvod zgodovino kulture oz. družbe, hkrati pa sprejema nove Jezik je ohranjevalec in nosilec kulture, s katerim človeštvo funkcije, ki širijo in porajajo nove kontekste. Prav zato so ustvarja in vključuje refleksije o samem sebi (Pitkin paremiološke enote oz. pregovori , 1972; označeni za narodni Bartmiński, 2005; Tolstaja, 2015). Ena od najpogosteje zaklad, neprecenljivo modrost in dediščino prednikov, in ne rabljenih jezikovnih oblik so pregovori oz. paremiološke preseneča, da so (bili) predmet sprotnega terenskega enote. zapisovanja ali celo namenskega zbiranja (Arewa in Paremiološke enote ali pregovori v širšem pomenu so Dundes, 1966; Stanonik, 2015) ter analiz rabe (Meterc, eden od najkrajših žanrov slovstvene folklore; pregovore 2021). lahko opišemo kot relativno stalne povedi, ki jih uvrščamo Inštitut za slovensko narodopisje ZRC SAZU je med kratke folklorne obrazce. Pogosto so označeni sistematično gradil arhiv različnih žanrov slovstvene z besednimi zvezami, kot »modrost ljudstva« (Mieder, folklore, v sklopu katerega je nastajala tudi zbirka pregovorov. Ti so bili zabeleženi na kartotečnih listkih ali 1993), »stara modrost« in »poezija vsakdanjega jezika« (Matičetov v tematskih arhivskih mapah. V začetku 21. stoletja se , 1956). V vsakem primeru lahko trdimo, da so je pregovori »skrčeni moralno-etični obrazci določene pojavila potreba po digitalizaciji gradiva, ki bi omogočala lažje delo z gradivo skupnosti; so neke vrste tradicionalni stereotipi njenega m. samozavedanja in samoidentifikacije, bili so iz generacije Pri projektu Tradicionalne paremiološke enote v v generacijo prenašani jezik vsakdanje kulture« (Kržišnik, dialogu s sodobno rabo (2020–2023) smo predvideli združitev etnolingvističnih pristopov in semiotike z 2008: 38). Prav zato velja, da so pregovori kratki stereotipi na sentenčni ravni s prenesenim ali generalizirajočim namenom diahronega vpogleda v družbo s pomočjo pregovorov. Da bi bila analiza temeljitejša, je pomemben pomenom ter so načeloma splošno znani (Grzybek, 2012). Pregovori so kulturna besedila z velikim semantičnim del projekta pretvorba gradiva v sprejemljivo obliko za računalniško potencialom (Grzybek, 2015), saj gre za »zaključene misli« besedilno analizo. V članku opišemo (Mlacek, 1983: 131), vendar pa se ne razlikujejo le po pripravo in jezikoslovno besedilu, temveč tudi glede na teksturo in kontekst označevanje digitalizirane zbirke pregovorov, ki je sedaj (Dundes, 1965). Zaradi prozodičnih značilnosti si jih je dostopna na repozitoriju in konkordančnikih CLARIN.SI, lažje zapomniti, dandanes pa zato ponujajo možnosti za ter uporabo digitalizirane zbirke v namene etnolingvistične obravnave paremioloških enot. nadaljnjo uporabo, na primer pri oglaševanju, sodobnem Na koncu podamo prenosu mnenj, grafitih ali modifikacijah v različnih zaključke in načrte za nadaljnje delo. medijih. Semiotična kompleksnost pregovorov in prepletenost med sintaktično (kratkost), pragmatično (prenašanje skozi različne generacije) in semantično (stereotipno, splošno znanje) razsežnostjo ponujajo 2. Priprava gradiva PRISPEVKI 17 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Inštitut za slovensko narodopisje (ISN) ZRC SAZU v Gradivo je tako po ročnem in strojnem urejanju arhivu hrani folklorno gradivo v analogni obliki, tj. ročno vsebovalo 36.349 relativno enotno urejenih paremioloških napisani, natipkani ali natisnjeni na kartotečnih listkih, v enot ter 2.515 virov. arhivskih predalih in omarah. Težnja po digitalizaciji Razpredelnica z viri vsebuje za vsak bibliografskimi vir folklornega gradiva se je najprej začela pri pregovorih, za njegov identifikator, identifikator z izvornega seznama katere je Marija Stanonik že leta 1997–1999 pridobila virov, zaporedno številko vira (ki tudi združuje vire, ki projekt Slovenski pregovori in rekla (Stanonik, 1996), v spadajo v nadrejeno enoto), letnico izida (in letnico prvega katerem je začela širiti arhivsko zbirko pregovorov na ISN. izida, kjer se ta razlikuje), ime vira (avtor, naslov) ter Z mislijo na digitalizacijo je nadaljevala v projektih kategorizacijo vira v 18 kategorij, npr. Leposlovje in Informatizacija neoprijemljive dediščine za etnologijo in literarjenje, Muzejske zbirke, Periodika – pratike in folkloristiko (2005–2008) (Stanonik, 2004) in Slovenski koledarji, Ustni viri itd. pregovori kot kulturna dediščina: klasifikacija in redakcija Razpredelnica s paremiološkimi enotami vsebuje korpusa (2010–2013) (Stanonik, 2009; Stanonik, 2015). identifikator enote, zaporedno številko iz izvornega Gradivo je bilo dodano k obstoječi zbirki v računalniškem seznama enot, seznam identifikatorjev virov skupaj s prepisu, sprva v programu Word, pozneje v programu številko strani, na kateri je enota v viru omenjena, Excel, kar je predstavljalo temelj, na katerem smo lahko diplomatično transkripcijo enote (torej zapis enote, kot se izvedli pretvorbo v druge digitalne formate. pojavi v viru) in kritično transkripcijo enote, ki enote, zapisane v bohoričici, transkribira v gajico. Tako ima npr. 2.1. Priprava gradiva v razpredelnicah enota PREG-00-00001 zaporedno številko 1, seznam virov bib14.1: 202; bib23.1: 51; bib7.1: 524, diplomatično V urejanje smo dobili dve excelovi tabeli: prva je vsebovala transkripcijo »Bres muje 59.543 večinoma paremioloških enot, druga pa 2.742 virov ſe zhreul ne obuje.« in kritično transkripcijo »Brez muje se čreul ne obuje.« teh enot. Tabeli sta bili povezani s kodo, ki je bila določena viru. Ob pregledu gradiva smo ugotovili, da precej enot ne spada v paremiološki nabor; te smo ročno izločili (uganke, 2.2. Zapis TEI dele folklornih pesmi, pozdrave, frazeme ipd.), pri pregledu V naslednjem koraku smo podatke iz dokumentov TSV virov smo ročno izločili vse tiste, ki niso bili navedeni ob pretvorili v zapis, ki je bolj primeren za hrambo kot tudi za paremioloških enotah. Poleg tega so nekatere paremiološke nadaljnje obdelave, in sicer XML s shemo po priporočilih enote vsebovale širši kontekst, ki smo ga ročno izbrisali; iniciative za kodiranje besedil TEI (TEI Consortium, 2020). tako smo dobili poenoteno obliko samostojnih Celotna zbirka je bila formirana kot en TEI dokument paremioloških enot. Pri vremenskih paremioloških enotah (element ) s kolofonom (element ) in se je pojavil problem pojasnjevanja svetniškega besedilnim delom (). poimenovanja dnevov in praznikov: v originalnem zapisu Kolofon vsebuje bibliografske in druge metapodatke o (časopisi, koledarji, zvezki ipd.) so bili navedeni kot zbirki, kot je npr. taksonomija kategorizacije virov. V opisu pojasnilo, npr. Če je na Velike maše dan [15. avgust] lepo vira () vsebuje tudi celoten seznam virov vreme, potem bo ozimna pšenica lepa; Če na ta dan paremioloških enot; zapis je ilustriran v sliki 1. [Florijanovo, 4. maj] dež gre, potlej ga celo leto manjka. V Besedilni del vsebuje paremiološke enote, vsako s excelovi tabeli, ki predstavlja del Inštitutskega arhiva, smo svojim identifikatorjem, diplomatični in kritični prepis ter te pustili zabeležene v oglatem oklepaju. seznam njihovih virov; zapis ilustriramo v sliki 2. Po ročnem urejanju smo Excel dokumente združili z OpenRefine1 in tako poenotili korektorske opombe in 2.3. Posodabljanje besed in drugo jezikoslovno kategorije označevanja pregovorov. Osnovne popravke označevanje smo vnesli tudi pri preverjanju shematiziranih vnosov (npr. Precejšnjo težavo za uporabo izdelane zbirke je predstavljal navajanje virov, odstranjevanje presledkov na koncu zapis v arhaični slovenščini, ki oteži iskanje po pregovorih, besedil v posameznih celicah ipd.). Sledil je prenos kot tudi njihovo nadaljnjo analizo. Oteženo je tudi podatkov v delovno bazo SQLite2, kjer so potekali popravki avtomatsko jezikoslovno označevanje zbirke, saj orodja za preprostih slovničnih napak in zatipkov (velike začetnice, jezikoslovno označevanje delujejo dobro le na sodobni dvojni presledki, nepravilna raba ločil ipd.) ter zaznava standardni slovenščini. uporabljenih črkovnih naborov, kjer gre izpostaviti Za posodabljanje zbirke smo uporabili odprtokodno3 nestandardizirane zapise dajnčice, metelčice, bohoričice in orodje za normalizacijo cSMTiser (Scherrer in Ljubešić, gajice. Pregovore so namreč začeli prepisovati v 2016), ki temelji na principu statističnega strojnega računalniško obliko že v začetku 21. stoletja, ko nabor prevajanja in orodju Moses (Koehn, 2010). cSMTiser smo črkovnih znakov še ni bil tako pester in so prepisovalci naučili posodabljanja na ročno posodobljene korpusu reševali zagate z različnimi zapisi z improviziranim slovenščine goo300k (Erjavec, 2016), podobno, kot smo že izborom znakov. Po osnovnih popravkih paremioloških pred tem naredili za posodabljanje zbirke slovenskih enot smo nadaljevali z iskanjem enakih oz. podvojenih enot romanov v okviru korpusa ELTeC (Schöch et al., 2021). Z in odstranjevali dvojnike, pri čemer smo vse vire dodali k orodjem smo nato normalizirali kritični prepis, pri čemer eni paremiološki enoti. Ob koncu urejanja smo podatke orodje sicer približa zapis besed sodobni slovenščini, dela izvozili v format TSV (tab-separated values), ki je bil pa tudi napake (npr. besedo »čreul« prevede v »čevlj« izhodišče za izdelavo korpusa. namesto »čevelj«). 1 https://openrefine.org/ 3 https://github.com/clarinsi/csmtiser 2 https://www.sqlite.org/ PRISPEVKI 18 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Slika 1: Primer virov paremioloških enot v zapisu TEI. Slika 2: Primer kodiranja paremiološke enote v zapisu TEI. Slika 3: Primer kodiranja jezikoslovno označene paremiološke enote v zapisu TEI. Na osnovi avtomatsko posodobljenih besed smo nato 2017), npr. »NOUN Case=Gen Gender=Masc korpus jezikoslovno označili. Tu smo uporabili Number=Sing«. Te oznake so sicer podobne odprtokodno orodje CLASSLA4 (Ljubešić in Dobrovoljc, oznakam MULTEXT-East, vendar z drugače 2019), s katerim smo dodali naslednje jezikoslovne oznake izpisanim naborom lastnosti in vrednosti, občasno v besedilo, npr. za »čevlja«: se pa od njih tudi sistemsko razlikujejo;  oblikoskladenjsko oznako po priporočilih  odvisnostno skladenjsko razčlenitvijo povedi po MULTEXT-East (Erjavec, 2012), npr. »Ncmsg« za sistemu Universal Dependencies. »Noun Type=common Gender=masculine Jezikoslovno označena različica posamezne paremiološke Number=singular Case=genitive« (pri čemer enote je bila dodana v zapis TEI po njenih besedilnih obstaja tudi ekvivalentna oznaka v slovenščini, tu zapisih; format je ilustriran v sliki 3. V različici korpusa, ki »Somer« in njena razširitev v pare vsebuje posodobljene in jezikovno označene enote, je lastnost=vrednost); dopolnjen tudi kolofon s taksonomijo skladenjskih oznak  lemo oz. osnovno obliko besede, tu »čevelj«; Universal Dependencies in z opisom uporabljenih orodij.  oblikoskladenjske oznake po sistemu Universal Dependencies za slovenski jezik (Dobrovoljc et al., 4 https://github.com/clarinsi/classla PRISPEVKI 19 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 2.4. Objava zbirke Etnolingvistika kot samostojno področje daje jeziku Zbirko smo objavili na dva načina. Za prevzem je dostopna posebno mesto v družbi: v jeziku se oblikujejo kulturni na repozitoriju CLARIN.SI (Babič et al., 2022) pod odprto pomeni; jezik v besedah, frazeologiji, celo v slovnici licenco CC BY. Poleg obeh različic zbirke (brez in z posreduje podobe sveta. Jezik je s tega vidika »gradivo jezikoslovno označenimi enotami) v formatu TEI je tam na kulture«, medtem ko je hkrati tudi kulturni meta-jezik: voljo tudi v izvedenem formatu TSV, torej kot skupaj s folkloro velja za enega ključnih kulturnih kodov in razpredelnici z viri in enotami, in v t. i. vertikalnem kulturno ekspresivnih oblik. Jezik je zato eden formatu, ki služi kot vhodni format za konkordančnike najpomembnejših virov za raziskovanje folklore in CLARIN.SI. rekonstrukcij njenih zgodnjih stanj; povezava med jezikom Zbirka je dostopna tudi na konkordančnikih noSketch in kulturo je vzajemna (Tolstaja, 2006) in skupaj tvorita Engine in KonText CLARIN.SI. Prek teh dveh storitev je znakovni sistem. Vsi kulturni pomeni se zberejo v omogočen analitični vpogled v digitalizirano zbirko. semantiko poimenovanja z besedami; te je lublinska etnolingvistična šola poimenovala jezikovni stereotipi 3. Analiza gradiva (Bartmiński, 2005), ki kažejo naš poskus nadzora sveta. Analize relativno stalnih besednih zvez in besed v Zbirka ima v diplomatičnem zapisu zavedenih 36.066 določenih kontekstih nam prikazujejo jezikovni zemljevid paremioloških enot (283 jih je v kritičnem prepisu). Največ sveta z najpomembnejšimi družbenimi podobami in paremioloških enot je izpisanih iz že obstoječih zbirk predstavami. pregovorov (10.187 enot) ter iz leta 1974, tj. zbirke Hitro razvijajoče se področje digitalne humanistike pregovorov Pregovori in reki na Slovenskem, ki jo je uredil omogoča raziskovalcem sprejemanje novih, korenito Etbin Bojc (4.884 enot). Treba je upoštevati dejstvo, da je drugačnih metod raziskovanja in, kar je prav tako Bojc precej paremioloških enot zbral tudi iz že prej pomembno, daje na voljo elektronske zbirke z naprednimi obstoječih zbirk (npr. Kocbek (1887), Kocbek-Šašelj možnostmi iskanja podatkov (Rassmusen Neal, 2015). (1934) ter starejše slovnice in slovarji), velja pa njegova Korpusno jezikoslovje in trenutno priljubljena zbirka za prvo sodobnejšo. Najstarejši pregovori v zbirki so »metodologija branja na daljavo« (tj. uporaba e-virov) iz leta 1587, in sicer iz Predgovora k Postili Jurija Juričiča. poskuša izkoristiti velike jezikovne vzorce, da bi pridobili Zbirka paremioloških enot ISN vsebuje precej enot iz (kvantitativni) vpogled v besedišče, uporabo, trende in slovnic in slovarjev, kar pomeni, da so bile te enote vizualizacije na področjih jezikovnega interesa. Hkrati pa zapisane kot izolirane entitete, brez konteksta. Poleg tega, takšne računalniške tekstovne oblike zbirk omogočajo navedeno ne izpriča dejanskega poznavanja in rabe natančnejše in hitrejše kvalitativne analize večjih zbirk: paremioloških enot, kot ga lahko predvidevamo pri posameznih konkordančnih kombinacij in besednih okolij. zbiranju paremiološkega gradiva na terenu ali iz tiskanih Semiotična analiza v namene etnolingvistične raziskave besedil, v katerih avtor predvideva poznavanje posameznih (Bartmiński, 2005) paremiološkega gradiva poteka paremioloških enot in s tem bralčevo razumevanje predvsem na ravni semantike: pri besedah želimo zaznati napisanega. Navedeno je v folkloristiki pomemben del tako metaforične pomene kot stereotipne oznake, ki jih raziskav in analiz, saj razkriva tudi konceptualni in (posamezna) beseda vsebuje in hkrati posreduje prek etnolingvistični vidik folklornega gradiva. Če predvidimo metafore v širši kontekst, torej s semiotičnega vidika, dobro poznavanje posameznega pregovora (npr. Brez muje kakšni znaki se tvorijo znotraj paremiološke enote. se še čevelj ne obuje), lahko predvidimo tudi konceptualno Statistični vpogled v celotno zbirko pokaže med drugim ozadje in etnolingvistično sliko, ki nam jo tovrstno gradivo tudi najbolj pogosto rabljene besede, ki lahko podajo tudi lahko ponudi. Za takšen vpogled se poslužujemo ne le splošnejša predvidevanja o družbeni naravnanosti. etnolingvističnega pristopa (povezovanja jezikoslovja in Najpogostejša polnopomenska beseda v zbirki etnologije s poudarkom na stereotipni predstavi pojava), paremioloških enot je: temveč tudi semiotične analize (pomen znaka). - samostalnik dan se pojavi 1.657-krat; ta meta- forično ali metonimično označuje tako časovno 3.1. Etnolingvistična in semiotična analiza s omejeno obdobje, ki pomeni dolgo ( Premislek je pomočjo konkordančnikov boljši kot dan hoda. ) ali kratko ( Bitke ne dobiš v Čeprav pregovori tradicionalno spadajo na področje enem dnevu. ), konec obdobja ( Po večeru se dan paremiologije, so pogosto tudi raziskovalni predmet pozna.), poimenovanje konkretnega dneva ( Ni vsak folkloristike, sociologije, pedagogike, jezikoslovja itd. dan praznik./Pavla dne lepo, leto dobro bo. ), Semiotika, kot veda o znakih, ponuja metodologijo za sledenje dobrega oz. označevanje konceptualne raziskovanje globljih dimenzij prepletenih kulturnih ozadij cikličnosti ( Za vsako nočjo pride dan). pregovorov (Grzybek, 2014). Semiotika s poudarkom na Najpogostejša pojavnost ne preseneča, saj je ta pragmatični (razmerje med označenim in označencem), samostalnik zelo pogost sestavni element sintaktični (formalni odnosi med znaki) in semantični vremenskih in kmetijskih paremioloških enot, dimenziji (odnosi znakov s predmeti, za katere je mogoče poleg tega pa je je tudi v splošnem sodobnem uporabiti znake) (Morris, 1938) omogoča opazovanje jeziku izredno pogost: v Gigafidi v2.0 je tretji pregovorov z globljim vpogledom v kulturne pomene, najpogosteje rabljeni samostalnik5. Po drugi strani pojme in svetovne nazore. Do svetovnega nazora v je smiselno izpostaviti, da se nasprotje, tj. noč pregovorih pa je moč dostopati z etnolingvističnimi pojavi le 318-krat (pojavlja se kot nasprotje dnevu raziskovalnimi metodami, vključno z diahronim in ( Ljubezen vidi noč, kjer sije beli dan. ), temen čas, sinhronim pristopom. ko se ne vidi ( Ponoči so vse krave črne. ), vpliven čas ( Noč ima svojo moč. ), mejni čas ( Ne hvali 5 http://hdl.handle.net/11346/QHKH PRISPEVKI 20 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 dneva pred nočjo. ), slab čas ( Dan se zjutraj išče, samostalnik se pojavlja koža, kar tvori pregovor, ki noč pa sama pride. ), oznaka prazničnih časov metaforično svari pred preuranjeno hvalo. Pregovor velika noč, božična noč) itd.). nakazuje semantično polje, ki se v etnolingvistični - Glagol biti se pojavi 19.301-krat (zanikan pa 3.501- interpretaciji veže na ekonomski odsev družbe, tj. prodaje krat), kar ne preseneča, glede na to, da gre za enega medvedove kože, ki v zgodovinskem kontekstu pokaže najosnovnejših glagolov; glagol je najpogostejši svojo veliko ekonomsko vrednost. tudi v splošnem sodobnem jeziku.6 Kljub vsemu iskalnik zaradi starejših in narečnih - Pridevnik dobro se pojavi 1.367-krat, največkrat v izrazov ne poišče vedno vseh kombinacij, npr. pri Lep osnovni obliki, najmanjkrat pa kot presežnik (prim. čevelj vidiš, a ne veš, kje me gloje ali Kdor stare čevlje flika, slabo se pojavi 301-krat, osnovnik najpogosteje, pride do zlatnika, kjer konkordančnik ni zaznal presežnik najmanjkrat). Na podlagi izoliranih enot kombinacije samostalnika in glagola. bi lahko sklepali, da pregovori na semantični ravni Variante posameznega pregovora najlažje najdemo z pogosto izražajo vrednotenje stanja ali delovanja, iskanjem po besednih zvezah, npr. iskanje besedne zveze kar poleg izražanja družbenega nazora potrjuje tudi lastovka ne poda štiri rezultate: Ena lastovka ne naredi njihov pedagoški potencial. poletja, Ena lastovka ne naredi pomladi, Ena lastovka ne - Predlog v je najpogostejši predlog v paremioloških prinese pomladi, Ena lastovka ne prinese nikoli spomladi. enotah, tj. pojavi se v 4.538-krat. Iz tega podatka Pri glagolski besedni zvezi gre samo enkrat na led pa lahko sklepamo, da izvorno konceptualno rezultat poda tako osla kot lisico ( Osel/lisica gre samo najpogosteje uvrščamo pojave znotraj časovno-enkrat na led) , prav tako svoj rep hvali lahko tako lisica kot prostorskega koncepta pojava, pa čeprav se mačka ( Vsaka lisica/mačka svoj rep hvali). pomensko raba predloga razširi tudi na izražanje Etnolingvistični vpogled v korpus pregovorov je z namena, sredstva, odnosa do celote, dejanja/stanja digitaliziranim gradivom in možnostjo zahtevnejšega ipd. Enako je opaziti tudi v sodobnem splošnem iskanja temeljitejši. Že pogostost posameznih besed v jeziku.7 pregovorih ali pa podatek o variantnosti posameznega pregovora je odlično izhodišče, ki ga z analognim arhivom Ob najpogostejši prisotnosti besed v paremioloških enotah le težko dosežemo. dan, biti, dobro in v se izkaže, da te povsem ustrezajo tudi pogostnosti rabe v splošnem sodobnem jeziku, ne glede na 4. Sklep to, da gre za večinoma arhivsko gradivo. Digitalizacija folklornega gradiva olajša analizo le tega, Za natančnejši etnolingvistični in konceptualni vpogled je hkrati pa postane bistveno bolj natančna – iskalniki primernejša analiza s posamezno sestavino (npr. omogočajo izpis vseh želenih enot, hkrati pa je primerjava samostalnikom čevelj, medved) in njenimi vezavami, na gradiva bolj dosledna. podlagi katerih lahko s semiotično metodo podamo Vzpostavitev digitalne zbirke paremioloških enot ISN interpretacije družbenih konceptualnih vidikov. Za tako pomeni premik v slovenski folkloristiki. Gradivo je analizo je najširše uporabno enostavno iskanje, ki v primeru dostopnejše in analitično lažje obvladljivo. Hkrati takšna te zbirke naniza vse sklone iskanega samostalnika, oblika ne terja (semantične, tematske, funkcijske, abecedne vključno s starejšimi zapisi, npr. pri iskanju besede čevelj ipd.) kategorizacije pregovorov, temveč so razvrščeni kot (68 enot) iskalnik izloči vse sklone, prav tako pa zapis najmanjša zaključena besedila, na katerih izvedemo črevelj, čevl ipd. Ob zahtevnejših iskanjih je možno slediti analizo. Nedvomno je glede problema kategorizacije tudi številu posamezni obliki zapisa: črevelj (2), črevlju (1), takšna rešitev najugodnejša, saj sama kategorizacija čevle (3), seznam besed pa omogoča tudi sledenju starejšim pogosto pokaže več pomanjkljivosti kot prednosti. zapisom, virom in njihovi pogostnosti v časovnem razponu, Pri zbirki pregovorov vsekakor najdemo mesto za variantnim rabam in morebitnim prenovitvam. izboljšave: poleg odprave nekaterih pravopisnih napak, se Enako pri iskanju vseh zapisov in sklonov besede lisica poraja vprašanje variantnosti ter povezave med variantami; (starejša oblika lesica, 7 enot) iskalnik najde 93 na ta način bi bili odstranjeni tudi še nekateri podvojeni paremioloških enot. Ob zahtevnejših iskanjih je možno pregovori (predvsem tisti, ki so vpisani z različnimi ločili, slediti tudi številu posamezni obliki zapisa: lisica (31), npr. eden z vejico, drug s podpičjem). Ker je ponekod lisice (5), lisici (7), lisico (6), lesica (7), itd. Kontekstualna diplomatični prepis problematičen (gajica, bohoričica, raziskava objav po različnih virih poda poveden podatek: metelčica, fonetični zapis), se poraja vprašanje smiselnosti slovnice in slovarji navajajo paremiološke enote z besedo knjižnega zapisa pregovora, ki bi moral biti ročno lisica, ki so v celoti metaforične in se nanašajo na ljudi, preverjen. Zbirka bo vsekakor tudi dopolnjena z novimi medtem ko koledarji navedejo tudi paremiološke enote, ki paremiološkimi enotami (iz starejših virov kot sodobne veljajo za vremenske napovedi. rabe). Poleg teh pa bi bilo smiselno uvesti tudi razdelitev Iskalnik omogoča tudi iskanje želene besede v navezavi virov po kategorijah, ki bi natančneje prikazal prisotnost z drugo besedno vrsto, npr. lema medved, ki mu sledi paremioloških enot v posamezni kategoriji virov, kar bi glagol. Sicer je tako moč ugotoviti marsikatere povezave, omogočalo tudi primerjalno analizo (npr. enote v koledarjih vendar sam statistični del v nasprotju s pričakovanji prikaže in enote v slovnicah). tudi rezultate iz drugih (predhodnih ali sledečih) Za izdelavo digitalne paremiološke zbirke smo posegli pregovorov, ne le rezultate, vezane na posamezni pregovor. po sistemih, ki so ustaljeni v jezikoslovju. V premisleku pa Na primeru besede medved statistični del prikaže 79 ostaja, kako digitalizirati slovstveno folkloro, ki je daljša ustreznic, vendar je teh znotraj enega pregovora 66. Ob (npr. zgodbe, molitve) in ima specifične funkcije (npr. ročnem pregledu kaj hitro ugotovimo, da se ta beseda uganke, zagovori). najpogosteje veže z glagolom prodajati. Ob navezavah na 6 http://hdl.handle.net/11346/XNRI 7 http://hdl.handle.net/11346/ZYVZ PRISPEVKI 21 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Zahvala Philipp Koehn. 2010. Statistical Machine Translation. Digitalizirana zbirka paremiolških enot ne bi nastala brez Cambridge University Press. projektnih sodelavcev, še posebej Miha Pečeta: njegov Erika Kržišnik. 2008. Kulturološka interpretacija frazema. občutek za folklorno gradivo in poznavanje računalniškega V: M. Kalin Golob, ur., N. Logar Berginc, ur., in A. sveta sta omogočila hiter potek dela in sprotno reševanje Grizold, ur., Jezikovna prepletanja, str. 149–165, zagat. Fakulteta za družbene vede, Ljubljana. Delo opisano v prispevku je podprl temeljni raziskovalni Nikola Ljubešić in Kaja Dobrovoljc. 2019. What Does projekt »Tradicionalne paremiološke enote v dialogu s Neural Bring? Analysing Improvements in sodobno rabo« (ARRS J6-2579). Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. V: Proceedings of the 5. Literatura 7th Workshop on Balto-Slavic Natural Language Processing, str. 29–34, Association for Computational Ojo Arewa in Alan Dundes. 1966. Proverbs and the Linguistics, doi:10.18653/v1/W19-3704. Ethnography of Speaking Folklore. American Milko Matičetov. 1956. Pregovori in uganke; ljudska Anthropologist, 64: 70–85. proza. Slovenska matica, Ljubljana. Saša Babič, Miha Peče, Tomaž Erjavec, Barbara Ivančič Matej Meterc. 2021. Aktualna raba in pomenska Kutin, Katarina Šrimpf Vendramin, Monika Kropej določljivost 200 pregovorov in sorodnih paremioloških Telban, Nataša Jakop, in Marija Stanonik. 2022. izrazov. Jezikoslovni zapiski 27(1): 45–61. Collection of Slovenian paremiological units Pregovori Jozef Mlacek. 1983. Problémy komplexného rozboru 1.0, Slovenian language resource repository prísloví a porekadiel. Slovenská reč 48(2): 129–140. CLARIN.SI, ISSN 2820-4042, Wolfgang Mieder. 1993. Proverbs are never out of season: http://hdl.handle.net/11356/1455. Popular wisdom in modern age. Oxford University Jiři Bartmiński. 2005. Jazykovoj obraz mira: očerki po Press. etnolingvistike. Indarik, Moskva. Hanna F. Pitkin. 1972. The concept of representation. Kaja Dobrovoljc et al. 2017. The Universal Dependencies University of California Press. Treebank for Slovenian. V: Proceedings of the 6th Diana Rassmusen Neal. 2015. Indexing and retrieval of Workshop on Balto-Slavic Natural Language non-text information. De Gruyter Saur, Chicago, Processing, str. 33–38, Association for Computational Vancouver. Linguistics, doi:10.18653/v1/W17-1406. Yves Scherrer in Nikola Ljubešić. 2016. Automatic Alan Dundes. 1965. The study of folklore. Prentice-Hall, Normalisation of the Swiss German ArchiMob Corpus Englewood Cliffs. Using Character-Level Machine Translation. V: Tomaž Erjavec. 2021. MULTEXT-East: Morphosyntactic Proceedings of the 13th Conference on Natural Resources for Central and Eastern European Languages. Language Processing (KONVENS 2016), str. 248–55. Language Resources and Evaluation, 46(1): 35–57. Christoph Schöch, Roxana Patraş, Tomaž Erjavec, Diana Tomaž Erjavec. 2015. Reference corpus of historical Santos. 2021. Creating the European Literary Text Slovene goo300k 1.2. Slovenian language resource Collection (ELTeC). Modern languages open, doi: repository CLARIN.SI, 10.3828/mlo.v0i0.364. http://hdl.handle.net/11356/1025. Marija Stanonik. 1996. Slovenski pregovori in rekla. Diana Faridovna Khakimzyanova in Enzhe Kharisovna Projektna prijava. Shamsutdinova. 2016. Corpus Linguistics in Proverbs Marija Stanonik. 2004. Informatizacija neoprijemljive and Sayings Study: Evidence from Different Languages. dediščine za etnologijo in folkloristiko. Projektna The Social Sciences, 11(15): 3770–3773. prijava. Peter Grzybek. 2012. Proverb Variants and Variations: A Marija Stanonik. 2009. Slovenski pregovori kot kulturna New Old Problem? V: O. Lauhakangas, ur., in R. J. B. dediščina: klasifikacija in redakcija korpusa. Projektna Soares, ur., Proceedings of the Fifth Interdisciplinary prijava. Colloquium on Proverbs, str. 136–152, AIP-IAP, Tavira. Marija Stanonik. 2015. Slovenski pregovori kot kulturna Peter Grzybek. 2014. Semiotic and Semantic Aspects of the dediščina. Klasifikacija in redakcija korpusa. Proverb. V: H. Hrisztova-Gotthardt, (ur.) in M. A. Varga, Traditiones, 44(3): 171–214. ur., Introduction to Paremiology: A Comprehensive Kathrin Steyer. 2017. Corpus Linguistic Exploration of Guide to Proverb Studies, str. 68–111, De Gruyter, Modern Proverb Use and Proverb Patterns. V: R. Warsaw/Berlin. Mitkov, ur., Europhras 2017. Computational and Dell Hymes, D. 1962. The ethnography of speaking. V: T. corpus-based phraseology: Recent advances and Gladwin, ur., in W. C. Sturtevant, ur., Anthropology and interdisciplinary approaches. Proceedings of the Human Behavior, str. 13–53, Anthropological Society of Conference Volume II, str. 45–52, London, Geneva. Washington, Washington. TEI Consortium. 2022. TEI P5: Guidelines for Electronic Fran Kocbek. 1887. Pregovori, prilike in reki. Založil Text Encoding and Interchange. https://tei- Anton Trstenjak, Ljubljana. c.org/guidelines/P5/ Fran Kocbek in Ivan Šašelj. 1934. Slovenski pregovori, reki Svetlana M. Tolstaja. 2015. Obraz mira v tekste i rituale. in prilike. Družba Sv. Mohorja, Ljubljana. Univerza Dimitrija Požarskega, Moskva. PRISPEVKI 22 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 DirKorp: A Croatian Corpus of Directive Speech Acts Petra Bago*, Virna Karliㆠ* Department of Information and Communication Sciences † Department of South Slavic Languages and Literatures Faculty of Humanities and Social Sciences, University of Zagreb Ivana Lučića 3, HR-10000 Zagreb {pbago, vkarlic}@ffzg.hr Abstract In this paper we present recent developments on a new version (v2.0) of DirKorp ( Korpus direktivnih govornih činova hrvatskoga jezika), a Croatian corpus of directive speech acts developed for the purposes of pragmatic research. The corpus contains 800 elicited speech acts collected via an online questionnaire with role-playing tasks. Respondents were 100 Croatian speakers, all undergraduate or graduate students of the Faculty of Humanities and Social Sciences University of Zagreb. The corpus has been manually annotated on the speech act level, each speech act containing up to 12 features. It contains 12,676 tokens and 1,692 types. The corpus is encoded according to the TEI P5: Guidelines for Electronic Text Encoding and Interchange, developed and maintained by the Text Encoding Initiative Consortium (TEI). We describe applied pragmatic annotation as well as the structure of the corpus. 1. Introduction to the pragmatic function criterion considerably difficult. Corpus pragmatics is an interdisciplinary field of study It is for this reason that corpus pragmatics researchers that incorporates linguistic pragmatics and computer most often investigate conventional speech acts or science, focusing on the development of natural language functions performed by a limited number of language corpora in machine-readable form and their application for forms (Jucker, Scheier, and Hundt, 2009: 4). The aim of the purposes of studying pragmatics phenomena in written this paper is to present the first Croatian corpus of and spoken language. For a long time have linguists directive speech acts DirKorp, manually annotated for regarded a corpus approach to language incompatible with corpus pragmatic research. pragmatics (Romero-Trillo, 2008: 2). While the corpus The paper is structured as follows: Section 2 describes approach to studying language implies processing selected work related to pragmatic corpora, while the authentic language material implementing quantitative subsequent three section present the DirKorp corpus. research methods, pragmatic research is still Section 3 gives a description of the developed corpus, predominantly of qualitative nature – based on the Section 4 describes 12 annotation features, and Section 5 researcher’s introspection, data obtained by elicitation presents the structure of the corpus encoded according to methods or an analysis of authentic linguistic material of the TEI P5: Guidelines for Electronic Text Encoding and small size. The application of corpus analysis in the Interchange (TEI Consortium, 2021). Finally, Section 6 research of pragmatics phenomena represents a major contains conclusion and future work. turnaround in the development of pragmatics, primarily because it allows a systematic analysis of authentic 2. Related Work language material of large size, and thus the detection of The number of large corpora with systematically patterns of language use that “go below radar” through implemented pragmatic annotation is small so far. Due to qualitative analyses (ibid.). In addition, it should be a disproportionate relationship between pragmatic pointed out that the application of new technologies in functions and language forms by which these functions linguistics, including pragmatics, did not only ensure, are expressed, automatic corpus annotation does not facilitate or accelerate numerous research processes, but produce satisfactory results. For this reason, only a small opened the door to a new, different way of thinking about number of researchers have engaged in the creation of language (Leech, 1992). larger corpora of this sort. Generally, for the purposes of The application of corpus methods on large pragmatic corpus pragmatic research, specialized corpora of smaller corpora allows one to systematically carry out empirically size are produced for individual research purposes. In based pragmatic research (Bunt, 2017: 327). While the addition, pragmatic research is sometimes carried out on implementation of corpus research can result in minor corpora without pragmatic annotation. adjustments to existing theories on the one hand, it can An example of a corpus that does not contain lead to a rethinking of pragmatics concepts and theoretical pragmatic annotation, but which was used for pragmatic frameworks on the other hand, for example the research is the Birmingham Blog Corpus1 (Kehoe and development of the theory of dialogue acts (ibid.). Gee, 2007; Kehoe and Gee, 2012). In fact, this is a According to Rühlemann and Aijmer (2015), one of subcorpus of a larger set of corpora being developed at the the major methodological problems that corpus department Research and Development Unit for English pragmatics researchers encounter is the disproportionate Studies at the Birmingham City University. It consists of relationship between pragmatic functions and language blog posts and reader comments, sizing 500M words in forms by which these functions are expressed. One form English that were collected between 2000 and 2010. can perform multiple pragmatic functions in discourse, Automatic POS annotation was performed using the while one function can be expressed by different forms, which makes the process of querying a corpus according 1 https://www.webcorp.org.uk/wcx/lse/corpora PRISPEVKI 23 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Stanford Core NLP tools2 and include lemma annotations Another example of a corpus with a pragmatic and part-of-speech categories3 based on the Universal annotation is the Engineering Lecture Corpus8 (Alsop and Dependencies framework4, while documents contain Nesi, 2013; Alsop and Nesi, 2014) that contains 76 metadata of the publication date. Pragmatic research on transcripts based on an hour-long video recordings of speech acts was conducted on this corpus: For example, engineering lectures held in English on three universities. Lutzky and Kehoe (2017a; 2017b) used it to analyze It is manually annotated for three pragmatic features: apologies as speech acts that contain formulaic humor, storytelling and summary9. Each feature can be expressions, which facilitate its querying in a corpus when augmented with one of the attributes containing additional using available tools. information that describes the feature in more detail. Similarly, we (Karlić and Bago, 2021) conducted Further, the corpus contains labels regarding significant research on the pragmatic functions and properties of breaks, laughter, writing or drawing in the board, etc. imperatives using corpora without pragmatic annotation. Finally, we present SPICE-Ireland corpus ( Systems of We used hrWaC and srWaC (Ljubešić and Klubička, Pragmatic Annotation in the Spoken Component of ICE- 2014), two large web corpora of Croatian and Serbian Ireland) (Kallena and Kirka, 2012), a part of a larger set language with morphosyntactic annotation. For the of corpora ICE-Ireland ( International Corpus of English: purposes of the analysis, an additional pragmatic Ireland Component) containing pragmatic, discourse and annotation of a representative sample of verbs in an prosodic features. The corpus contains various types of imperative form was carried out manually. Other corpora private and public, formal and informal dialogues and of the Croatian spoken and written language with no monologues of a length of about 2,000 words, sizing 625K pragmatic annotation have also been used as a resource for words. It consists of spoken English. The pragmatic a corpus pragmatic research. For example, Hržica, annotation of speech acts is based on Searle’s Košutar, and Posavec (2021) used the Croatian Corpus of classification (Searle, 1969; Searle, 1976): representatives, the Spoken Language of Adults (HrAL) (Kuvač Kraljević directives, commissives, expressives and declaratives. and Hržica, 2016) and the Croatian National Corpus of the To the best of our knowledge, there exist no publicly written language (HNK) (Tadić, 1996) for the search and available corpora of spoken or written Croatian language analysis of connectors and discourse markers. with pragmatic annotation. So far, Croatian linguists According to Bunt (2017) the majority of corpora with mostly dealt with speech acts from a theoretical pragmatic annotation contain labels on discourse perspective, referring primarily to the Austin’s and relationships in written texts and on spoken dialogue acts. Searle’s theory (cf. Pupovac, 1991; Ivanetić, 1995; An example of such a larger corpus is Penn Discourse Miščević, 2018; Palašić, 2020). However, in recent times, Treebank or PDTB5 (Prasad, Webber, and Lee, 2018) that the number of research based on qualitative and contains labels on discourse relations, i.e. discourse quantitative analysis of small-sized authentic linguistic structure and its semantics. Discourse annotations were materials (from literary texts and advertisements to email added to a subcorpus consisting of texts published in the messages and political discourse in Croatian and other newspaper Wall Street Journal sizing 1M tokens, included languages) has been increasing (cf. e.g., Pišković, 2007; in a bigger corpus Penn Treebank (PTB). Bunt (2017) Matić, 2011; Franović and Šnajder, 2012; Šegić, 2019). states that there are corpora of other languages developed In the following sections we present a new version for the purposes of studying the co-occurrence of (v2.0) of DirKorp, the first Croatian corpus of directive discourse labels, such as Chinese, Czech, Dutch, German, speech acts. Hindi and Turkish – emphasizing that these corpora are manually annotated and of modest sizes. Additionally, for 3. Corpus Description each corpora a new schema was developed based on DirKorp ( Korpus direktivnih govornih činova various theoretical starting points. hrvatskoga jezika) (Karlić and Bago, 2021) is a Croatian DialogBank6 (Bunt et al., 2019) is one of a rare corpus of directive speech acts developed for the purposes dialogue corpus annotated with an ISO 24617-2 standard. of pragmatic research. The corpus contains 800 elicited It contains already existing dialogue corpora annotated speech acts collected via an online questionnaire with with various schemas. Four corpora are of English: HCRC role-playing tasks applying the method of simulated Map Task (Anderson et al., 1991), Switchboard (Godfrey, communication that is implemented under pre-set Holliman, and McDaniel, 1992), TRAINS (Allen et al., conditions. This method is suitable for researching speech 1995) and DBOX (Petukhova et al., 2014); and four of acts due to the ability to collect a great number of Dutch language: DIAMOND (Geertzen et al., 2004), examples of speech acts of the equal propositional content OVIS7, Dutch Map Task (Caspers, 2000) and Schiphol and illocutionary purpose used in the same controlled (Prüst, Minnen, and Beun, 1984). Dialogue act annotation situations. The questionnaire included eight closed-type involves segmenting a dialogue into defined grammatical role-playing tasks. These types of tasks imply recording units and augmenting each unit with one or more the speaker’s reactions (in this case in writing) to the communicative function labels. stimulus without feedback. In each task, the participants 2 https://stanfordnlp.github.io/CoreNLP/ are presented with one textually described hypothetical 3 See more about the POS tagset used for the Birmingham Blog situation asking them to refer a directive speech act to Corpus: https://www.webcorp.org.uk/wcx/lse/guide. 8 www.coventry.ac.uk/elc 4 https://universaldependencies.org/u/pos/index.html 9 5 https://doi.org/10.35111/qebf-gk47 https://www.coventry.ac.uk/research/research-directories/current 6 https://dialogbank.uvt.nl/ -projects/2015/engineering-lecture-corpus-elc/annotations-and- 7 http://www.let.rug.nl/vannoord/Ovis/ mark-ups/ PRISPEVKI 24 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 their interlocutor. Their assignment was to imagine they Zagreb, ages between 18 to 33. Croatian is the mother were in the presented situation and to give a written tongue for the majority of the respondents (96 %). The statement they would use in the described situations. The questionnaire was carried out during December 2020 and presented situations are classified into two categories with January 2021. All respondents voluntarily participated in regard to the relationship between the participants of the the study. The questionnaire was conducted anonymously, communication act: (1) situations involving interlocutors and the collected language material was used exclusively who are not in a familiar relationship; (2) situations for scientific purposes. involving interlocutors in a familiar relationship. The elicitation of language production by the role- Assignments of the two categories are organized into four playing method has its advantages and disadvantages. On pairs, asking respondents to share a speech act of similar the one hand, it enables the collection of a large number of propositional content: “I want you to return something speech acts with the same propositional content and that belongs to me” (for text of role-playing tasks see illocutionary purpose. On the other hand, users of the Example 1 when interlocutors have (a) an unfamiliar corpus should keep in mind that the language material relationship and (b) a familiar relationship); “I want you to collected by this method does not reflect the features of answer my inquiry”; “I want you to change something that actual language use. It rather shows what speakers think bothers me”; “I want you to stop behaving they would say and/or do in hypothetical situations. inappropriately”10. DirKorp contains 12,676 tokens and 1,692 types11. Since it consists of 800 speech acts, it is a relatively small Example 1 corpus. However, as the first Croatian corpus with (a) Upravo si pojeo/la ručak u restoranu. detailed pragmatic annotation, DirKorp can serve as a Posluživao te stariji konobar koji se odnosio prema useful resource for researching speech acts, politeness tebi ljubazno i profesionalno. Prilikom plaćanja strategies and other related pragmatic phenomena in the računa konobar ti vraća 100 kuna manje nego što je Croatian language. In addition, we hope that it will trebao. Želiš da ti konobar vrati novac. Zamisli da contribute to the development of larger corpora of the se konobar nalazi pred tobom i napiši što bi mu Croatian language with pragmatic annotation, and that it točno rekao/la u danoj situaciji (nemoj prepričavati, will encourage a wider application of the corpus- već iskaz formuliraj kao da se izravno obraćaš pragmatic research method. sugovorniku). We have conducted corpus pragmatic analyses of the (Eng. You just ate lunch at a restaurant. You were collected speech acts in order to investigate ways and served by an elderly waiter who treated you kindly means of expressing directives, and their pragmatic and professionally. When paying the bill, the characteristics and functions. For example, we confirmed waiter refunds you 100 kunas less than he should that indirect directives are more frequent than direct, have. You want the waiter to give you your money especially among interlocutors who are not in a familiar back. Imagine the waiter was in front of you and relationship. Regarding (un)familiar relationship between write what exactly you would say to him in the interlocutors, we detected that explicit illocutionary force given situation (do not recount but formulate the is more frequent in communication between interlocutors statement as if you were addressing the with a familiar relationship, while implicit illocutionary interlocutor directly). ) force is more frequent in communication between (b) Posudio/la si knjigu najboljem prijatelju (ili interlocutors with an unfamiliar relationship. Additionally, prijateljici). Rekao ti je da će ti je uskoro vratiti, no we have identified that imperative utterances are a more nije održao riječ. Sjedite zajedno u kafiću, situacija frequent type of direct directives than utterances with a je opuštena, razgovarate o svakodnevnim stvarima. directive performative verb in 1st person. For more such Želiš mu dati do znanja da ti treba čim prije vratiti corpus pragmatic analyses see Karlić and Bago (2021). knjigu. Zamisli da se tvoj prijatelj nalazi pred tobom i napiši što bi mu točno rekao/la u danoj situaciji (nemoj prepričavati, već iskaz formuliraj 4. Corpus Annotation kao da se izravno obraćaš sugovorniku). Collected language material has been manually (Eng. You lent a book to your best friend. (S)he annotated on the speech act level by two independent told you (s)he'd give it back to you soon, but (s)he annotators with university graduate degrees in the field of didn't keep her/his word. You are sitting together in philology. Annotators received oral and written a café, the situation is relaxed, you talk about instructions, including illustrative examples for all the everyday things. You want to let her/him know you features they had to annotate. need to get your book back as soon as possible. The categorization of speech acts and their formal and Imagine if your friend was in front of you and pragmatic properties was carried out according to the wrote what exactly you would say to her/him in the theory of speech acts by Austin (1962), Searle (1969; given situation (do not recount but formulate the 1976) and their successors; the politeness theory of Brown statement as if you were addressing the and Levinson (1978), and the grammars of contemporary interlocutor directly). ) Croatian and Serbian languages (Silić and Pranjković, 2007; Piper et al., 2005). For more on individual Respondents were 100 Croatian speakers, all undergraduate (63 %) or graduate students (37 %) of the 11 Respondents’ answers contain utterances, but also text about Faculty of Humanities and Social Sciences University of what they would do in the given situation. At this moment, we have not analyzed average length of a response. Generally, we 10 Full texts of role-playing tasks are available in the corpus can only state that some speech acts contain only one utterance, header. while some contain more than one. PRISPEVKI 25 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 categories, see Karlić and Bago (2021). In the new version Example 4 of DirKorp (v2.0), each speech act can contain up to 12 (a) Daj mi donesi više onu knjigu, treba mi! features. The first 8 features were part of the corpus (Eng. Bring me that book already, I need it! ) version v1.0, while features 9-12 are newly added. For (b) Kaj je s onom knjigom koju sam ti posudio? frequency distribution of all features see Karlić and Bago (Eng. What happened to that book I lent you? ) (2021). (1) Respondent ID – This mandatory feature contains (6) Propositional content – This optional feature information on identification of the respondent uttering contains information on explicitness or implicitness of the the speech act. propositional content of a speech act. It is only applied to (2) Familiarity / unfamiliarity – This mandatory utterances that contain verbal means (an imperative feature contains information on the category of the utterance, an assertive utterance, an utterance in the form proposed situation in which the speech act was uttered. of a question and in the form of an ellipsis). It contains Four situations are labelled ‘unfamiliar’ (involving two labels: (a) explicit and (b) implicit (see Example 5). interlocutors who are not in a familiar relationship), while Example 5 the other four situations are labelled ‘familiar’ (involving (a) Gledaj na cestu, pusti mobitel. interlocutors who are in a familiar relationship). (Eng. Look at the road, leave the cell phone. ) (3) Utterance type – This mandatory feature contains (b) Ti hoćeš da poginemo? information on the utterance type regarding its structural (Eng. You want us to die? ) organization. It contains five labels: (a) an imperative utterance, (b) an assertive utterance (a statement), (c) an (7) T/V form – This optional feature contains utterance in the form of a question, (d) an utterance in the information on how the respondent addressed the form of an ellipsis, (e) a nonverbal signal, (f) a case of interlocutor, using an informal (T-form) or a formal you avoidance of executing a speech act (see Example 2). (V-form). It is only applied to utterances that contain verbal means (an imperative utterance, an assertive Example 2 utterance, an utterance in the form of a question and in the (a) E vrati mi onu knjigu koju sam ti posudio. form of an ellipsis). It contains three labels: (a) T-form, (Eng. Hey, give me back that book I lent you. ) (b) V-form and (c) impossible to determine (see Example (b) Oprostite, ali mislim da ste mi krivo vratili 6). novce. (Eng. Excuse me, but I think you gave me my Example 6 money back wrong. ) (a) Oprosti, dao si mi manje novca (c) Možete li molim vas zatvoriti prozore? (Eng. Sorry, youT-form gave me less change. ) (Eng. Could you please close the windows? ) (b) Oprostite, mislim da ste mi ipak još dužni100 (d) E, moja knjiga?? kuna. (Eng. Hey, my book?? ) (Eng. Excuse me, I think youV-form still owe me 100 (e) [Samo bih zavrtjela očima da vide moje kunas. ) neodobravanje, ali ne bih ništa rekla.] (c) Hmm... još 100 kuna, zar ne? (Eng. [I’d just roll my eyes so that they see my (Eng. Hmm… another 100 kunas, right? ) disapproval, but I wouldn’t say anything.] (8) Exhortative – This optional feature contains (f) [Ne bih ništa rekao.] information on the representation of an exhortative as part (Eng. [I wouldn’t say anything.]) of the speech act. It contains two labels: (a) yes and (b) no (4) Directive performative verb in 1st person – This (see Example 7). optional feature contains information on the representation Example 7 of a directive performative verb in 1st person as part of the (a) Daj mi više vrati knjigu, treba mi za knjižnicu. speech act, only for assertive utterances and utterances in (Eng. Bring me back my book already, I need it for the form of a question. It contains two labels: (a) yes and the library. ) (b) no (see Example 3). (b) Jel se sjećaš one knjige koju sam ti posudila? Example 3 Potrebna mi je. Možeš li mi ju donijeti sutra na (a) Oprostite, molim da odete na kraj reda. faks? (Eng. Excuse me , I am imploring you to go to the (Eng. Do you remember that book I lent you? I end of the line. ) need it. Could you bring it tomorrow to uni? ) (b) Gospođo, morate na kraj reda stati. (9) Request – This optional feature contains (Eng. Madam, you must move to the end of the information on whether the speech act includes a lexical line. ) marker of request. It contains two labels: (a) yes and (b) (5) Illocutionary force – The optional feature contains no (see Example 8). information on explicitness or implicitness of the Example 8 illocutionary force of a speech act. It is only applied to (a) E da, jel bi mi mogao/la vratiti knjigu, molim utterances that contain verbal means (an imperative te? utterance, an assertive utterance, an utterance in the form (Eng. Oh yeah, could you bring the book back, of a question and in the form of an ellipsis). It contains please? ) two labels: (a) explicit and (b) implicit (see Example 4). (b) Zaboravio si mi vratiti knjigu, jel se možeš idući put sjetiti? PRISPEVKI 26 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 (Eng. You forgot to bring me back the book, can DirKorp is available for download under the CC BY- you remember next time? ) SA 4.0 license from GitHub in TEI format (https://github.com/pbago/DirKorp). (10) Apology – This optional feature contains information on whether the speech act includes a lexical marker of apology. It contains two labels: (a) yes and (b) no (see Example 9). Govorni čin sadržava Example 9 obraćanje na ti (atribut se odnosi (a) Oprostite, ovdje fali još 100 kuna na tipove iskaza koji uključuju (Eng. Excuse me, 100 kunas is missing here. ) verbalna sredstva [imperativni, (b) Možete li molim vas pritvoriti prozore, hladno tvrdnja, upitni, mi je? eliptični]). (Eng. Could you please close the windows, I’m cold? ) (11) Gratitude – This optional feature contains Govorni čin sadržava information on whether the speech act includes a lexical obraćanje na Vi (atribut se odnosi marker of gratitude. It contains two labels: (a) yes and (b) na tipove iskaza koji uključuju no (see Example 10). verbalna sredstva [imperativni, tvrdnja, upitni, Example 10 (a) Molim te mi samo javi da znam zbog eliptični]). organizacije hoćeš li doći. Hvala ti! (Eng. Please just let me know whether you’re coming so that I know because of the organization. Nije moguće odrediti Thank you! ) sadržava li govorni čin obraćanje na (b) Heej, jel dolaziš večeras na druženje? Moram ti ili Vi (atribut se odnosi na znati zbog organizacije. xoxo tipove iskaza koji uključuju (Eng. Heeey, are you coming tonight to hang out? I verbalna sredstva [imperativni, need to know because of the organization. xoxo) tvrdnja, upitni, eliptični]). (12) Honorific title – This optional feature contains information on whether the speech act includes an honorific title. It contains two labels: (a) yes and (b) no (see Example 11). Figure 1: An example of a pragmatic feature description – how the respondent addressed the Example 11 interlocutor (V-form, T-form or impossible to determine). (a) Gospođo, kraj reda je dolje (Eng. Madam, the end of the line is back there. ) (b) Oprostite, tamo je kraj reda! (Eng. Excuse me, the end of the line is there! )

ispitanik/ispitanica, 20 godina, spol Ž, preddiplomski studij 5. Corpus Format Filozofskog fakulteta, nefilološko DirKorp is encoded according to the TEI P5: usmjerenje, materinji jezik Guidelines for Electronic Text Encoding and Interchange, hrvatski

developed and maintained by the Text Encoding Initiative
Consortium (TEI) (TEI Consortium, 2021). The TEI document is comprised of a header and the body of the Figure 2: An example of participant information. corpus. The content of the elements and attributes are in Croatian. Metadata of the corpus is given in the header Ispričavam se, pardon, process (see Figure 1 for an example), including full text fali još sto kuna. Oprostite. of the eight situations on the questionnaire; a list of questionnaire participants with information on their age, Figure 3: An example of an utterance containing all 12 gender, undergraduate or graduate level of study, pragmatic features. enrollment in a philological/non-philological/combined study program and mother tongue (see Figure 2 for an 6. Conclusion and Future Work example); and a list of revisions of the DirKorp versions. We have presented DirKorp, the first Croatian corpus The body of the corpus is composed of one division of directive speech acts, containing 800 elicited speech containing utterances with pragmatic features (see Figure acts collected via an online questionnaire with role- 3 for an example). playing tasks, specifically developed for pragmatic research studies. Respondents were 100 Croatian speakers, all students of the Faculty of Humanities and PRISPEVKI 27 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Social Sciences University of Zagreb. The corpus has Military Friendship Publish , Beijing, https://www.isca- been manually annotated on the level of a speech act, each speech.org/archive/icslp_2000/. speech act containing up to 12 features. It contains 12,676 Tin Franović and Jan Šnajder. 2012. Speech Act Based tokens and 1,692 types. The corpus is available for Classification of Email Messages in Croatian Language. download under the CC BY-SA 4.0 license from GitHub In: Proceedings of the Eighth Language Technologies in TEI format. Conference, pages 69-72. Information Society, Further work is planned on the corpus, which includes Ljubljana. an evaluation of the developed scheme for annotating Jeroen Geertzen, Yann Girard, Roser Morante, Ielka Van directive speech acts, annotation at the levels smaller than der Sluis, Hans Van Dam, Barbara Suijkerbuijk, Rintse a speech act, as well as augmentation with additional Van der Werf, Harry Bunt. 2004. The DIAMOND features such as information on grammatical mood used in Project. In: Proceedings of the 8th Workshop on the a speech act, information on representation of modal verb Semantics and Pragmatics of Dialogue (CATALOG in 2nd person as part of a speech act, and information on 2004), Barcelona. various politeness strategies applied in a speech act. John Godfrey, Edward Holliman, and Jande McDaniel. 1992. SWITCHBOARD: Telephone Speech Corpus for 7. Acknowledgements Research and Development. In: IEEE International This paper is generously co-financed by the Conference on Acoustics, Speech, and Signal institutional project of the Faculty of Humanities and Processing, Vol. 1, pages 517–520 . IEEE Computer Social Sciences “South Slavic languages in use: pragmatic Society, San Francisco. analyses” (principle researcher Virna Karlić). We wish to Gordana Hržica, Sara Košutar, and Kristina Posavec. thank all our annotators. 2021. Konektori i druge diskursne oznake u pisanome i spontanome govorenom jeziku. Fluminensia: časopis 8. References za filološka istraživanja, 33(1):25–52. Nada Ivanetić. 1995. Govorni činovi. Zagreb: FF-press, James F. Allen, Lenhart K. Schubert, Geoge Ferguson, Zavod za lingvistiku Filozofskoga fakulteta Sveučilišta Peter Heeman, Chung Hee Hwang, Tsuneaki Kato, u Zagrebu. Marc Light, Nathaniel G. Martin, Bradford W. Miller, Andreas H. Jucker, Daniel Schreier, and Marianne Hundt. Massimo Poesio, and David R. Traum. 1995. The (eds.). 2009. Corpora: Pragmatics and Discourse. TRAINS Project: A Case Study in Building a Rodopi, Amsterdam. Conversational Planning Agent. Journal of Jeffrey L. Kallen and John M. Kirk. 2012. SPICE-Ireland: Experimental & Theoretical Artificial Intelligence, A User’s Guide. 7(1):7–48. https://pure.qub.ac.uk/en/publications/spice-ireland-a- Sian Alsop and Hilary Nesi. 2013. Annotating a Corpus of users-guide. Spoken English: The Engineering Lecture Corpus Virna Karlić and Petra Bago. (Računalna) pragmatika: (ELC). In: Proceedings of GSCP 2012: Speech and temeljni pojmovi i korpusnopragmatičke analize. FF Corpora, pages 58–62. Firenze University Press, Press , Zagreb, 2021. Florence. https://openbooks.ffzg.unizg.hr/index.php/Ffpress/ Sian Alsop and Hilary Nesi. 2014. The Pragmatic catalog/book/125. Annotation of a Corpus of Academic Lectures. In: The Andrew Kehoe and Matt Gee. 2007. New Corpora from International Conference on Language Resources the Web: Making Web Text More ‘Text-Like’. In: andEvaluation 2014 Proceedings, pages 1560–1563. Studies in Variation, Contacts and Change in English 2. European Language Resources Association, Reykjavik. https://varieng.helsinki.fi/series/volumes/02/kehoe_gee/ Anne H. Anderson, Miles Bader, Ellen Gurman Bard, . Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, Andrew Kehoe and Matt Gee. 2012. Reader Comments as Stephen Isard, Jacqueline Kowtko, Jan McAllister, Jim an Aboutness Indicator in Online Texts: Introducing the Miller, Catherine Sotillo, Henry S. Thompson, and Birmingham Blog Corpus. In: Studies in Variation, Regina Weinert. 1991. The HCRC Map Task Corpus, Contacts and Change in English 12. Language and Speech, 34(4):351–366. https://varieng.helsinki.fi/series/volumes/12/kehoe_gee/ John L. Austin. 1962. How to Do Things with Words. . Clarendon Press, Oxford. Jelena Kuvač Kraljević and Gordana Hržica. 2016. Penelope Brown and Stephen C. Levinson. 1987. Croatian Adult Spoken Language Corpus (HrAL). Politeness: Some Universals in Language Usage. Fluminensia: časopis za filološka istraživanja, Cambridge University Press. 28(2):87–102. Harry Bunt. 2017. Computational Pragmatics. In: Oxford Geoffrey N. Leech. 1992. Corpora and Theories of Handbook of Pragmatics, pages 326-345. Oxford Linguistic Performance. In: Directions in Corpus University Press, New York. Linguistics, pages 105–122. De Gruyter, Berlin. Harry Bunt, Volha Petukhova, Andrei Malchanau, Alex Ursula Lutzky and Andrew Kehoe. 2016. Your Blog is Fang, and Kars Wijnhoven. 2019. The DialogBank: (the) Shit: A Corpus Linguistic Approach to the Dialogues with Interoperable Annotations. In: Identification of Swearing in Computer Mediated Language Resources and Evaluation, 53(2):213–249. Communication. International Journal of Corpus Johanneke Caspers. 2000. Melodic Characteristics of Linguistics, 21(2): 165–191. Backchannelsin Dutch Map Task Dialogues. In: Ursula Lutzky and Andrew Kehoe. 2017a. ‘I Apologize Proceedings, 6th International Conference on for My Poor Blogging’: Searching for Apologies in the SpokenLanguage Processing, pages 611–614. China PRISPEVKI 28 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Birmingham Blog Corpus. Corpus Pragmatics, Resources for Adjudicating Meaning in Trademark 1(1):37–56. Disputes. In: The 9th International Corpus Linguistics Ursula Lutzky and Andrew Kehoe. 2017b. ‘Oops, I Didn’t Conference. Birmingham: Birmingham Mean to Be so Flippant’. A Corpus Pragmatic Analysis University.https://www.birmingham.ac.uk/documents/c of Apologies in Blog Data. Journal of Pragmatics, ollege-artslaw/corpus/conference-archives/2017/ 116:27–36. general/paper134.pdf. Nikola Ljubešić and Filip Klubička. 2014. {bs, hr, Rashmi Prasad, Bonnie Webber, and Alan Lee. 2018. sr}WaC-Web Corpora of Bosnian, Croatian and Discourse Annotation in the PDTB: The Serbian. In: Proceedings of the 9th Web as Corpus NextGeneration. In: Proceedings of the 14th Joint ACL- Workshop (WaC-9), pages 29–35, Association for ISO Workshop on Interoperable Semantic Annotation, Computational Linguistics , Gothenburg, pages 87–97. Santa Fe: Association for Computational https://aclanthology.org/W14-0405.pdf. Linguistics. https://aclanthology.org/W18-4710.pdf. Daniela Matić. 2011. Govorni činovi u političkome Hub Prüst, Guido Minnen, and Robbert-Jan Beun. 1984. diskursu. PhD thesis. Faculty of Humanities and Social Transcriptie dialooogesperiment juni/juli 1984, Sciences, Zagreb. IPORapport 481. Institute for Perception Research, Nenad Miščević. 2018. Rođenje pragmatike. Orion Art, Eindhoven University of Technology, Eindhoven. Beograd. Milorad Pupovac. 1990. Jezik i djelovanje. Biblioteka Nikolina Palašić. 2020. Pragmalingvistika – lingvistički časopisa Pitanja, Zagreb. pravac ili petlja? Hrvatska sveučilišna naklada, Zagreb. Jesús Romero-Trillo (ed.). 2008. Pragmatics and Corpus Volha Petukhova, Martin Gropp, Dietrich Klakow, Gregor Linguistics: A Mutualistic Entente. De Gruyter, Berlin. Eigner, Mario Topf, Stefan Srb, Petr Motlicek, Blaise Christoph Rühlemann and Karin Aijmer. 2015. Potard, John Dines, Olivier Deroo, Ronny Egeler, Uwe Introduction. Corpus pragmatics: laying the Meinz, Steffen Liersch, and Anna Schmidt. 2014. The foundations. In: Corpus pragmatics, pages 1-28. DBOX Corpus Collection of Spoken Human-Human John R. Searle. 1969. Speech Acts. Cambridge University and Human-Machine Dialogues. In: Proceedings of the Press, Cambridge. Ninth International Conference on Language Resources John R. Searle. 1976. A classification of illocutionary and Evaluation (LREC'14), pages 252–258. European acts. Language in Society, 5:1–23. Language Resources Association, Reykjavik. Josip Silić and Ivo Pranjković. 2007. Gramatika Predrag Piper et al. 2005 = Предраг Пипер, Ивана hrvatskoga jezika za gimnazije i visoka učilista. Školska Антонић, Бранислава Ружић, Срето Танасић, knjiga , Zagreb. Људмила Поповић, Бранко Тошовић. 2005. Tea Šegić. 2019. Tata kupi mi auto und Nivea Milk weil Синтакса савременог српског језика. Проста es nichts Besseres für die Hautpflege gibt. Filologija, реченица, Београд: Институт за српски језик САНУ, 73:103–116. Београдска књига, Матица српска. Marko Tadić. 1996. Računalna obradba hrvatskoga i Tatjana Pišković. 2007. Dramski diskurs između nacionalni korpus. Suvremena lingvistika, 41-42:603– pragmalingvistike i feminističke lingvistike. Rasprave: 611. Časopis Instituta za hrvatski jezik i jezikoslovlje, TEI Consortium (ed.). 2021. TEI P5: Guidelines for 33(1):325–341. Electronic Text Encoding and Interchange. TEI Olumide Popoola. 2017. A Dictionary, a Survey and a Consortium. Corpus Walked into a Courtroom...: An Evaluation of PRISPEVKI 29 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Universal Dependencies za slovenščino: nadgradnja smernic, učnih podatkov in razčlenjevalnega modela Kaja Dobrovoljc∗†‡, Luka Terčon†, Nikola Ljubeši懆 ∗Filozofska fakulteta, Univerza v Ljubljani Aškerčeva 2, 1000 Ljubljana kaja.dobrovoljc@ff.uni-lj.si †Fakulteta za računalništvo in informatiko Univerza v Ljubljani Večna pot 113, 1000 Ljubljana luka.tercon@fri.uni-lj.si ‡Institut “Jožef Stefan” Jamova cesta 39, 1000 Ljubljana nikola.ljubesic@ijs.si Povzetek Universal Dependencies (UD) je mednarodno usklajena označevalna shema za medjezikovno primerljivo oblikoslovno in skladenjsko označevanje besedil po načelih odvisnostne slovnice, ki je bila ob več kot 130 drugih svetovnih jezikih uspešno uporabljena tudi za označevanje besedil v slovenščini. V prispevku predstavimo rezultate nedavnih aktivnosti v povezavi s shemo UD znotraj projekta Razvoj slovenščine v digitalnem okolju, v okviru katerega smo obstoječo infrastrukturo nadgradili s prenovo in podrobno dokumentacijo označevalnih smernic UD za slovenščino, razširitvijo drevesnice SSJ-UD za pisno slovenščino z novimi povedmi iz korpusov ssj500k in ELEXIS-WSD ter izdelavo novega strojnega modela skladenjskega razčlenjevanja v označevalnem orodju CLASSLA-Stanza. V podporo nadaljnjim aplikacijam na različnih področjih strojnega procesiranja slovenščine novi model podrobneje ovrednotimo, in sicer poleg splošne evalvacije natančnosti razčlenjevanja poročamo tudi o natančnosti na ravni posamičnih skladenjskih relacij in o najpogostejših tipih napak. 1. Uvod vrst, 24 oblikoskladenjskih lastnosti, 37 odvisnostnih skla- Jezikoslovno označeni korpusi, tj. digitalizirane zbirke denjskih relacij), ki odslej omogoča enotno označevanje besedil, ki poleg besed na površini vsebujejo tudi ročno pri- podobnih slovničnih pojavov v različnih svetovnih jezikih, pisane podatke o njihovih slovničnih lastnostih na različnih obenem pa dovoljuje tudi jezikovnospecifične izpeljave, če ravneh jezikoslovnega opisa (Ide in Pustejovsky, 2017), je to potrebno. Shema temelji na načelih odvisnostne slov- predstavljajo enega izmed temeljnih jezikovnih virov za ra- nice, ki je v primerjavi s frazno pragmatiko bolj primerna zvoj jezikovnotehnoloških orodij na eni strani in korpusno- za jezike s prostim besednim redom in za neposredno upo- jezikoslovne raziskave na drugi. Slovnične lastnosti so be- rabo v različnih jezikovnotehnoloških aplikacijah (Jurafsky sedilom tipično pripisane na podlagi vnaprej opredeljenih in Martin, 2021), njena teoretična izhodišča pa so podrob- označevalnih shem oz. označevalnih sistemov, ki poleg na- neje predstavljena v prispevku De Marneffe et al. (2021). bora možnih oznak običajno vsebujejo tudi smernice za nji- Doslej je bilo z označevalno shemo UD ročno hovo pripisovanje konkretnim slovničnim pojavom. Ker so označenih že več kot 200 korpusov (t.i. odvisnostnih dreve- v preteklosti označevalne sheme nastajale ločeno za posa- snic, angl. dependency treebanks) v 130 svetovnih jezikih. mezne jezike, slovnične teorije ali celo korpuse, je njihova Med njimi sta tudi univerzalni odvisnostni drevesnici pisne posledična raznolikost onemogočala kakršnokoli neposre- slovenščine SSJ (Dobrovoljc et al., 2017) in govorjene slo- dno primerjavo označenih podatkov ali na njih temelječih venščine SST (Dobrovoljc in Nivre, 2016), ki sta bili s tem računalniških orodij. neposredno vključeni v razvoj številnih najsodobnejših oro- Kot protiutež tovrstni razdrobljenosti je bila leta 2013 dij za večjezično obdelavo naravnih jezikov (Zeman et al., vzpostavljena označevalna shema Universal Dependen- 2018), kakor tudi raznolike primerjalnojezikoslovne razi- cies,1 ki si prizadeva za mednarodno oz. medjezično uskla- skave (Futrell et al., 2015; Naranjo in Becker, 2018; Chen jeno slovnično označevanje besedil na oblikoslovni in skla- in Gerdes, 2018). denjski ravni, da bi pospešila razvoj večjezičnih jezikovnih Glede na pomen razvoja slovenskih virov v tovrstnih tehnologij, medjezičnega strojnega učenja in kontrastivnih mednarodnih standardizacijskih pobudah smo v okviru na- jezikoslovnih analiz. Znotraj sheme UD je bil tako vzposta- cionalnega projekta Razvoj slovenščine v digitalnem oko- vljen univerzalni nabor kategorij in smernic (17 besednih lju (RSDO),2 ki si prizadeva za zadovoljitev potreb po 1https://universaldependencies.org/ 2https://slovenscina.eu/ PRISPEVKI 30 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 računalniških izdelkih in storitvah s področja jezikovnih natančnostjo. Med nepretvorjenimi so tako ostale zlasti tehnologij za slovenski jezik, obstoječe vire in povezano in- povedi s strukturami, ki so bile v sistemu JOS označene frastrukturo za označevanje slovenskih besedil po sistemu kot t. i. povezave tretjega nivoja (oznaka modra), kot so Universal Dependencies bistveno nadgradili. stavčna priredja in soredja, pristavki in pojasnjevalne struk- Potek in rezultate te aktivnosti predstavimo v nadaljeva- ture, členki oz. nepropozicijskimi prislovi, vrivki in po- nju prispevka, v katerem po kratki predstavitvi izhodiščne dobno. različice korpusa SSJ-UD pred začetkom projekta RSDO Prvotna različica korpusa SSJ-UD, prvič objavljena kot (2. razdelek) opišemo dokumentacijo rahlo prenovljenih del zbirke drevesnic UD v1.2 leta 2015, je tako obsegala označevalnih smernic UD za slovenščino (3. razdelek). Na- 8.000 povedi oz. 140.670 pojavnic. Kljub kontinuira- daljujemo s predstavitvijo označevalne kampanje (4. raz- nemu izboljševanju korpusa s prilagajanjem spremembam delek), v okviru katere je bilo ročno razčlenjenih več kot v splošnih označevalnih smernicah in odpravljanjem po- 5.000 novih povedi, ki skupaj z nekoliko izboljšanim pr- samičnih napak je njegova velikost do nedavne razširitve, votnih korpusom tvorijo najnovejšo različico korpusa SSJ- ki jo predstavimo v 4. razdelku tega prispevka, ostajala ves UD (5. razdelek). V drugem delu prispevka opišemo izde- čas nespremenjena. lavo na novem korpusu temelječega napovednega modela za strojno skladenjsko razčlenjevanje (6. razdelek), ki ga 3. Popis smernic UD za slovenščino v sklepnem delu tudi ovrednotimo z analizo splošne na- Splošne smernice UD, kakršne so dokumentirane na tančnosti (7. razdelek) in analizo najpogostejših napak (8. krovni spletni strani projekta,6 so kot nadaljevanje predho- razdelek). dnih standardizacijskih iniciativ in večletnega kolaborativ- nega razvoja zasnovane tako, da skušajo na čim krajši način 2. Nastanek korpusa SSJ-UD nasloviti skladenjske specifike čim širšega nabora jezikov. Prva različica univerzalne odvistnostne drevesnice za Tako v splošnih smernicah najdemo predvsem prototipične pisno slovenščino SSJ-UD3 je nastala na podlagi na opredelitve posameznih oznak, opis najbolj tipičnih mej- polavtomatske pretvorbe korpusa ssj500k (Krek et al., nih primerov in ponazoritve na primerih izbranih jezikov, 2020), bogato označenega referenčnega učnega korpusa naloga avtorjev drevesnic za posamezne jezike pa je, da te za slovenščino, ki je bil predhodno že ročno lematiziran, splošne smernice nato prenesejo na svoje konkretne jezi- oblikoskladenjsko označen in skladenjsko razčlenjen po kovne podatke. Pri tem infrastuktura UD omogoča, da se za označevalnem sistemu JOS (Erjavec et al., 2010). Med- vsak jezik ta načela popišejo kot jezikovnospecifične smer- tem ko so leme in oblikoskladenjske oznake JOS pripisane nice na uradni spletni strani, vendar to ni obvezno, zato je vsem pojavnicam korpusa ssj500k (586.248 pojavnic oz. dokumentacija označevalnih smernic UD za posamične je- 27.829 povedi), je skladenjsko razčlenjena zgolj slaba po- zike prepuščena predvsem samoiniciativnosti avtorjev po- lovica korpusa (235.864 pojavnic oz. 11.411 povedi). datkov. Pretvorba korpusa ssj500k iz označevalne sheme JOS Za slovenščino so bile ob prvi objavi korpusa SSJ-UD v shemo UD (Dobrovoljc et al., 2016; Dobrovoljc et al., tako dokumentirane zgolj smernice za pripisovanje bese- 2017) je temeljila na širokem naboru pravil za preslikavo za dnih vrst in oblikoskladenjskih oznak, ki so odtlej ob pre- vse tri ravni sheme UD: besedne vrste, oblikoslovne lastno- hodu z UD v1 na UD v2 (Nivre et al., 2020) že nekoliko sti in odvisnostne skladenjske relacije.4 Ker so si (z nekaj zastarele, smernice za pripisovanje skladenjskih relacij UD izjemami) načela označevanja obeh sistemov na ravni obli- besedilom v slovenščini pa zaradi obsežnosti niso bile po- koslovja precej podobna, je bilo mogoče s pravili za presli- drobneje dokumentirane oz. so bile razvidne zgolj implici- kavo v besedne vrste in oblikoskladenjske lastnosti UD pre- tno iz pretvorbenih pravil na eni strani in objavljenega kor- tvoriti celoten korpus ssj500k oz. na istem sistemu teme- pusa na drugi. lječi leksikon Sloleks (Dobrovoljc et al., 2019), pri čemer Prvi korak znotraj projekta RSDO je bil tako namenjen je bilo ročno razdvoumljanje potrebno zgolj pri besednovr- izčrpnemu popisu smernic UD za slovenščino na vseh treh stni kategorizaciji glagola biti.5 ravneh označevanja (besedne vrste, oblikoskladenjske la- Po drugi strani pa je bil skladenjsko razčlenjeni del stnosti in skladenjske relacije) v obliki priročnika, ki na slo- korpusa ssj500k v shemo UD pretvorjen le delno, saj za- venskih primerih razlaga in ponazarja uporabo posameznih radi robustnosti sistema JOS v primerjavi z UD kljub po- oznak UD za označevanje besedil v slovenščini. Pri tem drobnemu sistemu pravil za preslikavo vseh povedi ni bilo smo poleg opisa prvotnih smernic uvedli tudi nekaj manjših mogoče v celoti samodejno pretvoriti z dovolj zanesljivo sprememb na mestih, kjer je bila prvotna označenost kor- pusa SSJ-UD nedosledna ali neustrezna glede na univer- 3V tem prispevku namesto uradnega imena drevesnice (SSJ) zalne smernice. Med njimi lahko izpostavimo predvsem zaradi podobnosti s poimenovanji sorodnih korpusov in projektov spremembe v obravnavi primerjalnih struktur (lastnost kot v slovenskem prostoru uporabljamo daljši akronim SSJ-UD. nadredni element primerjave), poudarjalnih členkov (razli- 4Pravila in skripte za pretvorbo iz sistema JOS v sistem UD so kovanje med modifikatorji samostalnikov na eni in poved- na voljo na https://github.com/clarinsi/jos2ud. 5 kov na drugi strani), besedilnih povezovalcev (razlikovanje V nasprotju s sistemom JOS, znotraj katerega so pojavi- glede na stavčno pozicijo) in prostega morfema se/si (raz- tve glagola biti ne glede na skladenjsko vlogo ali pomen ve- dno označene kot glagol s podvrsto pomožni, sistem UD že likovanje med zaimki v predmetni in ekspletivni vlogi), ki na ravni besednih vrst ločuje med glavnimi (oznaka VERB) in pomožnimi glagoli (oznaka AUX), kamor se umeščajo glagoli v 6https://universaldependencies.org/ vlogi pomožnikov in veznih glagolov. guidelines.html PRISPEVKI 31 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 so bili zaradi omejitev strojne pretvorbe iz sistema JOS pr- je bilo glede na pretvorbena pravila pričakovano, saj so votno označeni drugače kot predvidevajo splošne smernice bila ločila večinoma na relevantno jedro povezana šele po UD. določitvi vseh drugih pojavnic v povedi, zlasti korena po- Priročnik s smernicami UD za slovenščino7 poleg opi- vedi (root, običajno jedro povedka glavnega stavka ali drug sov posamičnih slovničnih kategorij in načel njihovega pri- hierarhično najpomembnejši element v povedi), ki predsta- pisovanja besedilom v slovenščini vsebuje še razdelek s vlja tudi drugo najpogostejšo vrsto nepretvorjenih pojavnic podrobnejšo obravnavo težavnejših primerov, ki se je do- (12 %). Tej sledita še relaciji parataxis (9 %) in conj (6 %), polnjeval tudi skozi označevalno kampanjo, opisano v 4. ki se uporabljata za povezovanje stavčnih soredij oz. prire- razdelku. V pripravi je tudi objava slovenskih smernic na dij, torej struktur, kakršnih zgolj s pravili ni bilo mogoče uradni spletni strani UD (v angleščini) in popis odprtih pretvoriti z dovolj zanesljivo natančnostjo. vprašanj z izhodiščnimi priporočili za nadaljnje izboljšave (v sodelovanju z Univerzo v Novi Gorici) . 4.2. Razširitev s povedmi iz korpusa ELEXIS-WSD 4. Nadgradnja korpusa SSJ-UD V drugi fazi širitve je bil skladenjsko razčlenjen še kor- V drugem koraku projekta je sledila označevalna kam- pus ELEXIS-WSD-SL, tj. slovenski del paralelnega kor- panja, v okviru katere smo ročno označili več kot 5.000 no- pusa ELEXIS-WSD (Martelli et al., 2021; Martelli et al., vih povedi iz korpusov ssj500k oz. ELEXIS-WSD, neko- 2022), razvitega za potrebe strojnega pomenskega razdvo- liko izboljšana pa je bila tudi označenost prvotne različice umljanja, ki vsebuje v več evropskih jezikov prevedena be- korpusa SSJ-UD. V vseh treh fazah je označevanje po- sedila iz Wikipedie (Schwenk et al., 2021). Slovenski kor- tekalo v označevalnem orodju Q-CAT (Brank, 2022), ki pus ELEXIS-WSD vsebuje 2.024 povedi (31.237 pojav- odslej podpira tudi standardni format CONLL-U, za pri- nic), ki so bile predhodno že ročno tokenizirane, lemati- merjavo označenih datotek (kuriranje) pa smo uporabili lo- zirane in oblikoskladenjsko označene po sistemu JOS, na kalno inštalacijo orodja WebAnno (Eckart de Castilho et podlagi česar smo korpus s pretvorbeno skripto samodejno al., 2016), ki jo vzdržuje CLARIN.SI.8 Podrobnejša analiza pretvorili še v besedne vrste in oblikoskladenjske oznake označevalnega procesa je opisana v prispevku Dobrovoljc UD, pojavitve glagola biti pa razdvoumili ročno. in Ljubešić (2022), v nadaljevanju pa predstavimo zgolj Tako označen korpus je bil izhodiščno skladenjsko najpomembnejše rezultate. razčlenjen z orodjem CLASSLA-Stanza (Ljubešić in Do- brovoljc, 2019), pravilnost strojno pripisanih razčlemb pa 4.1. Razširitev s polpretvorjenimi povedmi iz ssj500k so nato pregledali trije označevalci in končni kurator. Na ta Kot smo omenili že v 2. razdelku, nekaterih skladenjsko način je bilo ročno popravljenih 1.534 (4.91 %) skladenj- razčlenjenih povedi v korpusu ssj500k zaradi omejitev pre- skih relacij, med katerimi so prevladovale strukture z ozna- tvorbenih pravil ni bilo mogoče v celoti pretvoriti v oznake kami nmod, advmod, obl, conj in punct, kar se, kot bomo UD, zato niso bile vključene v prvotno različico drevesnice videli v nadaljevanju, sklada z najpogostejšimi tipi napak SSJ-UD, predstavljale pa so logično izhodišče za nadaljnjo razčlenjevalnika nasploh (8. razdelek). širitev podatkov UD za slovenščino. V prvi fazi razširitve so tako označevalci ročno pregledali teh 3.411 polpretvor- 4.3. Izboljšanje označenosti v prvotnem korpusu jenih povedi oz. 96.194 pojavnic, med katerimi jih 22.377 (23,5 %) ni imelo pripisane skladenjske relacije UD. Te so Poleg dodajanja novih razčlenjenih povedi smo glede bile za potrebe lažje vizualizacije označene z relacijo un- na rahlo spremembo smernic (3. razdelek), analizo ročnih known (slika 1), označevalci (po dva na poved) pa so poleg popravkov pretvorjenih relacij (razdelek 4.1.) in drugih ustvarjanja novih povezav preverjali tudi ustreznost že ob- identificiranih nedoslednosti izboljšali tudi označenost iz- stoječih (pretvorjenih) povezav. hodiščne različice korpusa SSJ-UD. Med približno 30 identificiranimi tipi napak oz. ne- doslednosti so bile denimo pristavčne strukture, visok delež (neupravičenih) neprojektivnih povezav,9 nedosledno ločevanje med sorednimi in priredno vezanimi stavki, med premimi in nepremimi predmeti, itd. Za vsako izmed ka- tegorij smo s hevrističnimi poizvedbami ustvarili podkor- puse povedi s potencialno problematičnimi oznakami, ki so jih nato označevalci ročno pregledali in popravili v skladu s smernicami. Na ta način je bilo v izhodiščnem korpusu po- Slika 1: Primer prikaza polpretvorjene povedi iz ssj500k pravljenih 1.670 skladenjskih oznak, kar sicer predstavlja z manjkajočimi relacijami UD (unknown) v označevalnem razmeroma majhen del celotnega korpusa (1,2 %). orodju Q-CAT. Med pojavnicami, ki v izhodišču niso imele pripi- 9Povezava med besedo A in besedo B je projektivna, če je be- sane relacije UD, je bila skoraj polovica ločil (punct), kar seda A posredno nadrejena tudi vsem drugim besedam med A in B – obstaja torej pot od A do vseh besed med A in B. Če si to pred- 7Priročnik je trenutno na voljo v delovni različici, uradno pa stavljamo grafično, se povezave v neprojektivnem drevesu med bo objavljen ob zaključku projekta RSDO. seboj križajo. To je v jezikih s prostim besednim redom, kot je 8https://www.clarin.si/webanno/. slovenščina, sicer možen pojav, a vendarle redek. PRISPEVKI 32 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 5. Nova različica korpusa SSJ-UD dardno uporabljajo pri razvoju in evalvaciji na teh podat- V zadnjem koraku smo izhodiščni korpus SSJ-UD z kih temelječih napovednih modelov. Pri tem smo sledili nekoliko izboljšano označenostjo (razdelek 4.3.) združili načelom delitve podatkov v prvotni različici, v kateri so z novimi povedmi iz korpusov ssj500k (razdelek 4.1.) in bile podmnožice razdeljene glede na zaporedje pojavlja- ELEXIS-WSD (razdelek 4.2.) ter tako dobili novo različico nja v korpusu. Glede na to, da so nove povedi iz ssj500k referenčne univerzalne odvistnostne drevesnice za pisno enakomerno razpršene po celotnem korpusu, smo te zgolj slovenščino SSJ-UD,10 ki je bila prvič objavljena kot del priključili k že obstoječi delitvi povedi v prvotni različici uradnega izida UD v2.10 (Zeman et al., 2022). Ob za- in ohranili enako razmerje (80 % učna, 10 % validacij- ključku projekta RSDO bo drevesnica SSJ-UD integrirana ska, 10 % testna), nato pa smo vsaki izmed treh množic tudi v novi referenčni korpus učne slovenščine SUK. v enakem razmerju dodali še povedi iz korpusa ELEXIS- WSD. Sestava podmnožic tako odslikava raznolikost nove 5.1. Sestava korpusa različice korpusa SSJ-UD, kakršno opisujemo v razdelku Kot prikazuje tabela 5.1., nova različica v primerjavi s 5.1., in z reprezentativnostjo testnih podatkov glede na učne prvotno vsebuje 5.435 novih razčlenjenih povedi (+67,9 %) zagotavlja ustreznejšo, besedilnozvrstno manj pristransko oz. skoraj enkrat večje število pojavnic (126.427, +89,9 %), evalvacijo. s čimer se korpus SSJ-UD po številu pojavnic danes umešča na 30. mesto med skupno 228 drevesnicami UD. Z 6. Razčlenjevalni model razširitvijo je korpus SSJ-UD postal tudi bolj raznolik, saj se vsi trije podkorpusi (izvorne povedi iz ssj500k, nove po- V drugi fazi projekta smo na novi, bistveno večji vedi iz ssj500k, povedi iz ELEXIS-WSD) med seboj raz- različici ročno označenega korpusa SSJ-UD naučili tudi likujejo tako z vidika vrste vsebovanih besedil kot njihove nov napovedni model skladenjskega razčlenjevanja po skladenjske kompleksnosti. sistemu UD v označevalnem orodju CLASSLA-Stanza Medtem ko besedila ssj500k kot vzorec korpusa Fida- (Ljubešić in Dobrovoljc, 2019),11 ki se kot temeljno pro- PLUS (Arhar Holdt, 2007) vsebujejo predvsem izvorno gramsko orodje za označevanje besedil v slovenščini prav slovenska leposlovna, neleposlovna in publicistična be- tako razvija v okviru projekta RSDO. Gre za izpeljavo od- sedila, korpus ELEXIS-WSD vsebuje prevedena enciklo- prtokodnega orodja Stanza (Qi et al., 2020), ki v primer- pedična besedila iz Wikipedie. Po drugi strani sta si iz- javi z izvornim orodjem uvaja nekatere izboljšave na ravni vorni SSJ-UD in korpus ELEXIS-WSD podobna z vidika tokenizacije, oblikoskladenjskega označevanja in lematiza- kompleksnosti (krajše in skladenjsko enostavnejše povedi), cije, skladenjski razčlenjevalnik pa se od izvornega (Dozat medtem ko so nove povedi iz ssj500k bistveno daljše. in Manning, 2016), ki temelji na nadgrajeni metodi dvo- Nenazadnje pa je z metodološkega vidika pomembno smernega dolgega kratkoročnega spomina (BiLSTM), raz- izpostaviti še, da se vsi trije podkorpusi razlikujejo tudi likuje predvsem po uporabi besednih vložitev CLARIN.SI- z vidika izvora pripisanih oznak UD, saj so oznake prvo- embed.sl (Ljubešić in Erjavec, 2018), ki so bile naučene na tnega SSJ-UD v veliki večini rezultat avtomatske pretvorbe slovenskih besedilih v obsegu 3,5 milijard besed. iz sistema JOS, oznake novih povedi iz ssj500k kombina- Tako pri učenju kot evalvaciji razčlenjevalnega modela cija pretvorbe in ročnega pregleda, oznake povedi iz kor- smo uporabili ročno označene podatke na nižjih ravneh pusa ELEXIS-WSD pa so bile v celoti pregledane ročno. označevanja (tokenizacija, stavčna segmentacija, obliko- skladenjsko označevanje, lematizacija), saj nas je v tej fazi razvoja razčlenjevalnika zanimala predvsem natančnost na- Podkorpus Povedi Pojavnice Povp. povednega modela v izolaciji, brez vpliva napovednih ka- Prvotni SSJ-UD 8.000 140.670 17,58 rakteristik orodja na nižjih ravneh. Novo iz ssj500k 3.411 95.194 27,91 Izgradnjo napovednega modela, njegovo primerjavo z Novo iz ELEXIS-WSD 2.024 31.233 15,43 modelom, naučenim na prvotni različici SSJ-UD, in eval- Skupaj novi SSJ-UD 13.435 267.097 19,88 vacijo glede na posamične podkorpuse podrobneje opisu- Tabela 1: Zgradba nove različice korpusa SSJ-UD (od UD jeta Dobrovoljc in Ljubešić (2022), ki ugotavljata, da je v2.10 naprej). model, naučen na novi različici korpusa SSJ-UD, zaradi povečanega obsega učnih podatkov in njihove diverzifika- cije bistveno izboljšan v primerjavi z modelom, naučenim na prvotni različici. 5.2. Delitev podatkovnih množic Da bi osvetlili prednosti in pomanjkljivosti uporabe Del objave drevesnice v uradni zbirki UD je tudi njena novega razčlenjevalnega modela v različnih jezikovnoteh- delitev na učno, validacijsko in testno množico, ki se stan- noloških in jezikoslovnih aplikacijah ter obenem identifi- cirali prioritete za njegove nadaljnje izboljšave, v nadalje- 10 Čeprav infrastruktura UD dopušča objavo poljubnega števila vanju prispevka te ugotovitve nadgradimo s podrobnejšo drevesnic, smo se namesto objave novih drevesnic UD za slo- evalvacijo splošne natančnosti modela (7. razdelek) na eni venščino namenoma odločili za priključitev novih povedi k že ob- strani in analizo najpogostejših tipov napak (8. razdelek) na stoječi drevesnici SSJ-UD, da bi zagotovili kar najbolj učinkovito drugi. izrabo teh podatkov v širši jezikovnotehnološki skupnosti, kjer se zaradi poenostavitve dela modeli pogosto razvijajo zgolj na iz- brani, običajno največji, drevesnici nekega jezika. 11https://pypi.org/project/classla/ PRISPEVKI 33 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 7. Splošna natančnost katerokoli drugo povezavo (npr. ostanki oštevilčenih strani Za kvantitativno evalvacijo splošne natančnosti modela pri digitalizaciji besedil). šmo uporabili standardni protokol, po katerem smo mo- Ceprav se je natančnost označevanja samostalniških pri- del, naučen na učni oz. validacijski množici uporabili za stavčnih določil (appos, 63,40), ’osirotelih’ stavčnih členov razčlenjevanje testne množice, napovedane oznake pa nato v povedih z glagolsko elipso (orphan; 68,24), diskur- primerjali z ročno pripisanimi. Za poročanje o natančnosti znih členkov (discourse; 69,23), stavčnih soredij (para- uporabljamo uveljavljeno metriko LAS (angl. labeled atta- taxis; 70,35) in naštevalnih seznamov (list; 75,86) z novo chment score), ki prikazuje delež pojavnic s pravilno napo- različico korpusa SSJ-UD bistveno izboljšala glede na pr- vedano nadrejeno pojavnico in vrsto njunega skladenjskega votni model (Dobrovoljc in Ljubešić, 2022), te relacije razmerja, pri čemer ta delež povzemamo z oceno F1, ki ostajajo med tistimi z najnižjo natančnostjo, kar je glede na prikazuje harmonično sredino med preciznostjo in prikli- njihovo ohlapnejšo slovnično povezanost s povedkom oz. cem.12 nadrejenimi stavčnimi členi tudi pričakovano. Rezultati, predstavljeni v tabeli 7., prikazujejo, da Med drugimi relacijami s podpovprečno natančnostjo razčlenjevalni model dosega splošno natančnost 93,21 LAS označevanja lahko izpostavimo še podredne stavke F1, kar nekoliko poenostavljeno pomeni, da se model v različnih tipov, kot so prislovni (advcl; 75,86), prilast- povprečju na vsakih sto označenih pojavnic zmoti pri manj kovi (acl; 81,73), osebkovi (csubj; 85,53) in predmetni kot sedmih, tj. jim pripiše napačno nadrejeno pojavnico odvisniki (ccomp; 90,67). Poleg nepremih predmetov in/ali vrsto povezave med njima.13 (iobj; 81,66), ki jih je težavno identificirati predvsem Kot prikazujejo rezultati za posamične tipe relacij,14 pa zaradi pomanjkljivosti trenutnih označevalnih smernic,15 ta splošna ocena natančnosti ni reprezentativna za vse vrste modelu precejšen izziv predstavljajo tudi priredja, zlasti skladenjskih struktur, saj je pri napovedovanju nekaterih re- medstavčna (conj; 85,91), samostalniški prilastki (nmod; lacij model bistveno natančnejši kot pri drugih. 87,44) in prislovna določila povedkov, samostalnikov in Med relacijami z najvišjo natančnostjo napovedova- pridevnikov (advmod; 89,95). nja so po pričakovanju funkcijske besede, kot so predlogi (case; 99,17), pomožni glagol biti (aux; 98,93), določilniški 8. Najpogostejše napake zaimki in prislovi (det; 98,79), podredni vezniki (mark; V drugem koraku evalvacije smo analizo zanesljivo- 98,69), ekspletivni zaimki (expl; 96,71) in priredni vezniki sti modela pri razčlenjevanju posameznih tipov relacij do- (cc; 96,27), skratka, pojavnice, ki se pojavljajo v zelo pred- polnili še s podrobnejšo analizo najpogostejših tipov na- vidljivih oblikah in skladenjskih položajih. pak. Tabela 8. tako povzema distribucijo napak glede na Poleg navedenih relacij model razmeroma dobro na- to, pri katerem izmed obeh napovedanih podatkov (identifi- tančnost dosega tudi pri napovedovanju nekaterih jedrnih kator nadrejene pojavnice in vrsta skladenjske relacije med skladenjskih struktur, kot so samostalniški predmeti (obj; njima) se je model dejansko zmotil. Za vsak tip napake 95.53) in osebki (nsubj; 95.28), nadpovprečno uspešen pa navajamo tudi pet najpogostejših podtipov glede na rela- je tudi pri identifikaciji korena povedi (root; 96,26), ki je cije, pri katerih se pojavlja, pri čemer štetje prikazujemo običajno jedro povedka glavnega stavka, in veznega gla- združeno za napake v obe smeri (npr. obl-nmod vključuje gola biti (cop; 95,43), ki nastopa v strukturah s povedko- tako napovedovanje obl namesto nmod kot napovedovanje vimi določili. nmod namesto obl). Med relacijami, pri napovedovanju katerih model do- Identificirane pogoste tipe napak znotraj vsake katego- sega najslabše rezulate, pričakovano najdemo ogovore (vo- rije na podlagi ročne analize napačno označenih primerov cative; 0,0), saj se v testni množici pojavi zgolj en primer, opišemo v nadaljevanju, pri čemer podrobneje predstavimo in nedoločene strukture (dep; 54,55), saj se ta oznaka kot predvsem najpogostejše. skrajna možnost uporablja predvsem za povezovanje ob- robnih, iregularnih pojavov, ki jim je nemogoče pripisati 8.1. Napačna napoved nadrejenega elementa 12 Kot prikazuje tabela 8., dobro polovico (52,8 %) pred- Izračuni temeljijo na uradni evalvacijski skripti tekmovanja CoNLL Shared Task 2018 (Zeman et al., 2018), ki smo jo doda- stavljajo napake, pri katerih je model pravilno napovedal tno prilagodili tako, da poleg splošnega izračuna natančnosti vrača skladenjsko vlogo pojavnice (pravilno relacijo oz. oznako), tudi rezultate za posamične skladenjske relacije in druge relevan- zmotil pa se je pri napovedi njenega nadrejenega elementa tne oznake. (jedra oz. izvora relacije). 13Ta natančnost je v skladu z natančnostjo orodja Stanza Najpogostejša napaka pri določanju nadrejenega ele- za druge jezike oz. drevesnice (https://stanfordnlp. menta je povezana z relacijo punct, ki označuje ločila. github.io/stanza/performance.html) oz. na- Večinoma gre za primere, kjer so napačno določena tudi tančnostjo drugih sodobnih razčlenjevalnikov nasploh (https://universaldependencies.org/conll18/ 15Zaradi kompleksnega prepletanja oblikoslovnih, skladenj- results.html), vendar neposredna primerjava zaradi specifik evalvacijske metodologije ni smiselna. skih in pomenskih razločevalnih lastnosti med premimi in nepre- 14 mimi predmeti trenutne smernice UD priporočajo, da je v pove- V Tabeli 7. ni relacije compound, ki je glede na smernice v dih z zgolj enim izraženim predmetom ta ne glede na sklon ali slovenščini ne uporabljamo. Pri relacijah dislocated, goeswith in udeležensko vlogo označen kot premi predmet (obj). To pomeni, reparandum podatkov o natančnosti ni (oznaka n/a), saj se v te- da se lahko tudi predmeti v dajalniku, kakršni tipično nastopajo stni množici ne pojavijo. O natančnosti izpeljanih relacij oz. po- kot nepremi predmeti, ob odsotnosti drugih predmetov označujejo doznak (npr. flat:name, flat:foreign) poročamo združeno z jedrno kot premi. oznako (npr. flat). PRISPEVKI 34 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Relacija Izvorni opis Slovenski prevod LAS F1 acl clausal modifier of noun stavčni prilastki 81,73 advcl adverbial clause modifier prislovni odvisniki 75,86 advmod adverbial modifier prislovna določila (gl. opombo 16) 89,95 amod adjectival modifier pridevniški prilastki 98,9 appos appositional modifier pristavčna določila 63,4 aux auxiliary verb pomožni glagoli 98,93 case case marking preposition predlogi 99,17 cc coordinating conjunction priredni vezniki 96,27 ccomp clausal complement stavčna dopolnila (predmetni odvisniki) 90,67 conj conjunct priredno zloženi elementi 85,91 cop copula verb vezni glagoli 95,43 csubj clausal subject osebkovi odvisniki 85,53 dep unspecified dependency nedoločena povezava 54,55 det determiner določilniki 98,79 discourse discourse element diskurzni členki 69,23 dislocated dislocated element dislocirani elementi n/a expl expletive ekspletivne besede 96,71 fixed fixed multi-word expression funkcijske zveze 93,33 flat flat multi word-expression eksocentrične zveze 92,12 goeswith disjointed token razdruženi deli besed n/a iobj indirect object nepremi predmeti 81,66 list list seznami 75,86 mark marker (subordinating conjunction) podredni vezniki 98,69 nmod nominal modifier samostalniški prilastki 87,44 nsubj nominal subject samostalniški osebki 95,28 nummod numeric modifier številčna določila 94,23 obj (direct) object premi predmeti 95,53 obl oblique nominal (adjunct) odvisne samostalniške zveze 91,14 orphan dependent of missing parent elementi v eliptičnih strukturah 68,24 parataxis parataxis stavčna soredja 70,35 punct punctuation symbol ločila 93,08 reparandum overriden disfluency samopopravljanja n/a root root element koren povedi 96,26 vocative vocative ogovori 0 xcomp open clausal complement odprta stavčna dopolnila 92,87 Vse relacije 93,21 Tabela 2: Natančnost novega modela orodja CLASSLA-Stanza za skladenjsko razčlenjevanje po sistemu UD glede na metriko LAS. jedra drugih struktur v povedi, na katera se ločila pra- vedek kot posamezne stavčne člene, kar je pogosto mogoče viloma povezujejo. Napačno povezana ločila so torej razbrati šele iz konteksta ali prozodičnih poudarkov pri bra- predvsem posledica napak razčlenjevanja njihovih nadre- nju. Kot prikazuje primer na sliki 3, razčlenjevalnik te be- jenih struktur, kot prikazuje primer na sliki 2, pri kate- sede namesto na poudarjeni samostalnik pogosto veže na rem razčlenjevalnik zadnji stavek zmotno interpretira kot povedek stavka. To ni presenetljivo, glede na to, da gre za priredje pred njim stoječega odvisnika, čemur ustreza tudi eno izmed kategorij, pri kateri so se označevalci najpogo- (napačno) označena vejica. steje razhajali, prav tako pa je bila nedosledno označena v Druga pogosta skupina je povezana s t.i. poudarjalnimi prvotnem korpusu, v katerem so bile ob pretvorbi te pojav- členki oz. prislovi, kot so besedice tudi, še, le, že idr., ki nice ne glede na vlogo vedno povezane na povedek. jim pripisujemo relacijo advmod,16 njihova stava pa je v Pri preostalih treh analiziranih relacijah s pogosto slovenščini razmeroma prosta – modificirajo lahko tako po- napačno pripisanim izvorom povezave, tj. nmod, conj in acl, prihaja do podobne napake: razčlenjevalnik za- 16Relacija advmod se uporablja za označevanje prislovov nesljivo prepozna vrsto nadrejene strukture (npr. samo- v vlogi modifikatorjev, kar vključuje tako prislove v vlogi okoliščinskih dopolnil povedkov (kakršna Slovenska slovnica stalniške zveze, pridevniške zveze ali povedki), vendar na- imenuje prislovna določila, npr. pridem takoj) kot prislove v vlogi mesto prave strukture kot jedro izbere najbližjo ustrezno modifikatorjev pridevniških, prislovnih ali samostalniških bese- zvezo na levi, kar ni vedno prav, saj se včasih pravi izvor dnih zvez (prislovni prilastki, npr. izjemno prilagodljiv). relacije v povedi pojavi že prej (slika 4). PRISPEVKI 35 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 punct conj acl nsubj punct cc cop punct case det advmod obj case obl nsubj det Ta svet je zelo podoben Zemlji , na kateri živijo ljudje , vendar z nekaj izjemami . punct conj Slika 2: Primer razhajanja med ročno (zgoraj) in strojno (spodaj) pripisanim jedrom relacije punct. nmod Tip napake Število napak nmod Napačno jedro 914 case amod case nmod punct-punct 248 advmod-advmod 166 na turnirju mladih judoistov za pokal Ptuja nmod-nmod 111 nmod conj-conj 99 acl-acl 53 Slika 4: Primer razhajanja med ročno (zgoraj) in strojno Napačno jedro in oznaka 517 (spodaj) identificirano odnosnico predložne zveze v vlogi obl-nmod 141 desnega prilastka (nmod). parataxis-root 37 acl-advcl 22 root-nsubj 22 zamenjevanje struktur z oznakama obl17 in nmod, ki pred- nsubj-nmod 19 stavlja tretji najpogostejši (pod)tip napak nasploh. Analiza Napačna oznaka 299 primerov kaže, da gre večinoma za primere, v katerih pre- conj-parataxis 23 dložna zveza v vlogi prislovnega določila povedka (obl) obl-nsubj 19 stoji tik za neko samostalniško zvezo, model pa prislovno appos-conj 17 določilo napačno tolmači kot njen desni prilastek, za katere obl-obj 13 se uporablja relacija nmod, kot prikazuje primer na sliki 5. iobj-obj 13 Manj pogoste v tej kategoriji so še napake pri določanju Vse napake 1730 glavnega stavka v nizu dveh ali več soredno zloženih stav- kov, zlasti kadar gre za vrinjene stavke ali premi go- vor (parataxis-root), napake ločevanja med prislovno- Tabela 3: Distribucija napak razčlenjevalnega modela glede določilnimi odvisniki in stavčnimi prilastki, pogosto v na tip napake. kombinaciji z veznikom kot (acl-advcl), zamenjava osebka in povedkovega določila v strukturah z veznim glagolom biti (root-nsubj) in napake določanja osebka v povedih, obj kjer osebek ni eksplicitno izražen (nsubj-nmod). obl advmod advmod acl case nmod 8.3. Napačna napoved relacije Med vsemi tremi kategorijami napak pa je najmanj ta- ima pa tudi ambicije sodelovati za kreacijo oblek kih, pri katerih je razčlenjevalnik pojavnico povezal s pra- advmod vim nadrejenim elementom, a tej relaciji pripisal napačno oznako (17,3 %). V primerjavi s prvima dvema kategori- Slika 3: Primer napačne razčlembe poudarjalnih členkov jama so tukaj tipi glede na relacije razpršeni bolj enako- (advmod zgoraj) kot prislovnih določil povedka (advmod merno. spodaj). Do zamenjav oznak conj in parataxis18 prihaja pred- 17Relacija obl se uporablja za odvisne samostalniške in pre- dložne zveze, ki nastopajo v vlogi nejedrnih argumentov povedka. 8.2. Napačna napoved nadrejenega elementa in Poleg teh se s to relacijo označujejo tudi neglagolske strukture s relacije primerjalnimi vezniki. 18Relacija parataxis se uporablja za označevanje stavčnih sore- Po pogostosti sledijo napake, pri katerih se je model dij različnih vrst. To so razmerja med besedo (običajno jedrom zmotil tako pri napovedi nadrejene pojavnice kot njune glavnega stavka) in drugimi elementi, ki z njo niso v priredju, po- skladenjske relacije (29,9 %). Med njimi najbolj izstopa dredju ali kateremkoli drugem jedrnem slovničnem razmerju. PRISPEVKI 36 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 nsubj obl obl case case nmod det obl obj case Ta v primeru potrebe po svoji presoji napoti bolnika k specialistu nmod nmod Slika 5: Primer napačne razčlembe predložnih prislovnih določil (obl zgoraj) kot desnih prilastkov (nmod spodaj). vsem pri daljših povedih, pri katerih se med dva priredno Čeprav je bila shema UD prvotno vzpostavljena pred- zložena stavka oz. med priredni veznik in drugi stavek v vsem za potrebe jezikovnotehnoloških raziskav, pa številne priredju vrivajo druge strukture (npr. odvisniki). Samo- odmevne primerjalnojezikoslovne študije dokazujejo tudi stalniška prislovna določila (ki prejmejo relacijo obl) so njeno relevantnost na področju jezikoslovja, vključno napačno označena kot osebki (nsubj) predvsem v zvezah s slovenistiko, kjer metodološki potencial skladenjsko z glagoli, kot so imenovati, praviti, idr., v katerih se poja- razčlenjenih korpusov doslej še ni bil polno izkoriščen (Le- vljajo v imenovalniku (npr. pravimo jim mikroznaki). dinek, 2018). Verjamemo, da izčrpno dokumentirane smer- Med drugimi tipi napačno pripisanih relacij je pogosta nice, obsežen ročno označen korpus in sistematična evalva- še dvoumnost med samostalniškimi zvezami v vlogi pri- cija natančnosti na njem naučenega modela predstavljajo stavčnih določil (appos) na eni in priredno povezanih ele- pomemben doprinos k nadaljnjim jezikoslovnim raziska- mentov (conj) na drugi strani, zlasti kadar zadnji element v vam ročno in strojno razčlenjenih slovenskih korpusov, pri brezvezniškem priredju stoji na koncu povedi. Pojavljajo se čemer je glede na kompleksno strukturo tovrstnih korpu- tudi napake ločevanja med prislovnimi določili in predmeti, sov nujno vzpostaviti tudi ustrezno infrastrukturo za nji- predvsem pri samostalniških zvezah, ki izražajo časovni oz. hovo analizo. prostorski okvir dogodka (obl-obj) in pa napačno določanje Seveda pa je tako z vidika jezikovnotehnološke kot je- premega (obj) in nepremega predmeta (iobj). zikoslovne uporabe predstavljene rezultate smiselno konti- nuirano nadgrajevati tudi v prihodnje, kar vključuje tako 9. Zaključek izboljšavo izhodiščnih smernic na eni strani kot njihovo dosledno implementacijo na drugi. Glede na v prispevku V prispevku smo predstavili nadgradnjo drevesnice predstavljene metodološke razlike v nastanku posamičnih SSJ-UD, referenčnega ročno skladenjsko razčlenjenega delov korpusa SSJ-UD in zaznane nedoslednosti med kva- korpusa po medjezično usklajeni shemi Universal Depen- litativno analizo napak je poleg nadaljnjega povečevanja dencies, v okviru katere smo po rahli prenovi in izčrpni korpusa vsekakor enako smiselna tudi konsolidacija ob- dokumentaciji označevalnih smernic za slovenščino kor- stoječega. pus razširili z novimi povedmi ter nato na novi učni množici naučili tudi nov napovedni model za skladenj- 10. Zahvala sko razčlenjevanje slovenskih besedil. Podrobna kvantita- tivna in kvalitativna analiza njegove natančnosti je poka- Predstavljeno delo sta podprla projekt Razvoj slo- zala, da model v splošnem dosega razmeroma dobre rezul- venščine v digitalnem okolju, ki ga financirata Ministr- tate, pri čemer je pri členjenju nekaterih struktur mogoče stvo za kulturo Republike Slovenije in Evropski sklad za pričakovati bistveno večjo zanesljivost rezultatov kot pri regionalni razvoj, ter raziskovalni program Jezikovni viri drugih. in tehnologije za slovenski jezik (št. P6-0411), ki ga fi- Glede na mednarodno relevantnost sheme UD rezul- nancira Javna agencija za raziskovalno dejavnost Repu- tati predstavljajo pomemben doprinos k nadaljnjemu ra- blike Slovenije iz državnega proračuna. Zahvala gre tudi zvoju jezikovnih tehnologij za slovenščino tako v sloven- označevalcem novih podatkov (Tina Munda, Ina Poteko, skem kot mednarodnem prostoru, saj je glede na odprti do- Rebeka Roblek, Luka Terčon, Karolina Zgaga) ter Tomažu stop in standardizirano distribucijo drevesnic UD mogoče Erjavcu, Luku Krsniku, Cyprianu Laskowskemu in Miha- pričakovati, da bodo novi podatki za slovenščino kmalu elu Šinkcu za tehnično podporo. integrirani tudi v številna druga razčlenjevalna orodja oz. na njih temelječe aplikacije (npr. Nguyen et al. (2021)). 11. Literatura Poleg modelov za skladenjsko razčlenjevanje, kakršnega smo predstavili v tem prispevku, je skoraj enkrat večja Špela Arhar Holdt. 2007. Korpus FidaPLUS: nova genera- količina učnih podatkov za slovenščino neprecenljiva tudi cija slovenskega referenčnega korpusa. Jezik in slovstvo, za nadaljnji razvoj modelov za lematizacijo in oblikoslovno 52(2). označevanje po sistemu UD, ki v mednarodnem prostoru Janez Brank. 2022. Q-CAT corpus annotation tool 1.3. večinoma temeljijo zgolj na uradno izdanih drevesnicah Slovenian language resource repository CLARIN.SI. UD, kot je SSJ-UD, ne pa virih, ki so bili razviti oz. distri- Xinying Chen in Kim Gerdes. 2018. How do Universal buirani v lokalnem kontekstu, kot je denimo celotni korpus Dependencies distinguish language groups. Quantitative ssj500k oz. nastajajoči učni korpus SUK. Analysis of Dependency Structures, 72:277–294. PRISPEVKI 37 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Marie-Catherine De Marneffe, Christopher D Manning, Jo- 2020. The ssj500k training corpus for Slovene language akim Nivre in Daniel Zeman. 2021. Universal Depen- processing. V: Proceedings of the Conference on Lan- dencies. Computational linguistics, 47(2):255–308. guage Technologies and Digital Humanities, str. 24–33, Kaja Dobrovoljc, Tomaž Erjavec in Simon Krek. 2016. Ljubljana, Slovenia, September. Institute of Contempo- Pretvorba korpusa ssj500k v univerzalno odvisnostno rary History. drevesnico za slovenščino. V: Proceedings of the Con- Nina Ledinek. 2018. Skladenjska analiza slovenščine in ference on Language Technologies and Digital Humani- slovenski jezikoslovno označeni korpusi. Jezik in slo- ties. vstvo, 63(2/3). Kaja Dobrovoljc, Tomaž Erjavec in Simon Krek. 2017. Nikola Ljubešić in Kaja Dobrovoljc. 2019. What does ne- The Universal Dependencies Treebank for Slovenian. V: ural bring? Analysing improvements in morphosyntac- Proceedings of the 6th Workshop on Balto-Slavic Natural tic annotation and lemmatisation of Slovenian, Croatian Language Processing, BSNLP@EACL 2017, str. 33–38. and Serbian. V: Proceedings of the 7th Workshop on Kaja Dobrovoljc, Tomaž Erjavec in Nikola Ljubešić. 2019. Balto-Slavic Natural Language Processing, str. 29–34, Improving UD processing via satellite resources for mor- Florence, Italy, August. Association for Computational phology. V: Proceedings of the Third Workshop on Uni- Linguistics. versal Dependencies (UDW, SyntaxFest 2019), str. 24– Nikola Ljubešić in Tomaž Erjavec. 2018. Word embed- 34, Paris, France, August. Association for Computatio- dings CLARIN.SI-embed.sl 1.0. Slovenian language re- nal Linguistics. source repository CLARIN.SI. Kaja Dobrovoljc in Nikola Ljubešić. 2022. Extending Federico Martelli, Roberto Navigli, Simon Krek, Carole the SSJ Universal Dependencies treebank for Slovenian: Tiberius, Jelena Kallas, Polona Gantar, Svetla Koeva, Was it worth it? V: Proceedings of the 16th Linguistic Sanni Nimb, Bolette Sandford Pedersen, Sussi Olsen, Annotation Workshop (LAW 2022), June. Margit Langements, Kristina Koppel, Tiiu Üksik, Kaja Kaja Dobrovoljc in Joakim Nivre. 2016. The Universal Dobrovolijc, Rafael-J. Ure˜na-Ruiz, José-Luis Sancho- Dependencies treebank of spoken Slovenian. V: Procee- Sánchez, Veronika Lipp, Tamas Varadi, András Györffy, dings of the Tenth International Conference on Language Simon László, Valeria Quochi, Monica Monachini, Fran- Resources and Evaluation (LREC’16), str. 1566–1573, cesca Frontini, Rob Tempelaars, Rute Costa, Ana Sal- Portorož, Slovenia, May. European Language Resources gado, Jaka Čibej in Tina Munda. 2021. Designing the Association (ELRA). ELEXIS parallel sense-annotated dataset in 10 European Timothy Dozat in Christopher D Manning. 2016. Deep languages. V: eLex 2021 Proceedings, eLex Conference. biaffine attention for neural dependency parsing. arXiv Proceedings. Lexical Computing CZ. preprint arXiv:1611.01734. Federico Martelli, Roberto Navigli, Simon Krek, Jelena Richard Eckart de Castilho, Éva Mújdricza-Maydt, Kallas, Polona Gantar, Svetla Koeva, Sanni Nimb, Bo- Seid Muhie Yimam, Silvana Hartmann, Iryna Gurevych, lette Sandford Pedersen, Sussi Olsen, Margit Langemets, Anette Frank in Chris Biemann. 2016. A web-based tool Kristina Koppel, Tiiu Üksik, Kaja Dobrovoljc, Rafael for the integrated annotation of semantic and syntactic Ure˜na-Ruiz, José-Luis Sancho-Sánchez, Veronika Lipp, structures. V: Proceedings of the Workshop on Language Tamás Váradi, András Gy˝orffy, Simon László, Vale- Technology Resources and Tools for Digital Humanities ria Quochi, Monica Monachini, Francesca Frontini, Ca- (LT4DH), str. 76–84, Osaka, Japan, December. The CO- role Tiberius, Rob Tempelaars, Rute Costa, Ana Sal- LING 2016 Organizing Committee. gado, Jaka Čibej in Tina Munda. 2022. Parallel sense- Tomaž Erjavec, Darja Fišer, Simon Krek in Nina Le- annotated corpus ELEXIS-WSD 1.0. Slovenian langu- dinek. 2010. The JOS Linguistically Tagged Corpus age resource repository CLARIN.SI. of Slovene. V: Proceedings of the Seventh conference Mat´ıas Guzmán Naranjo in Laura Becker. 2018. Quantita- on International Language Resources and Evaluation tive word order typology with UD. V: Proceedings of the (LREC’10), Valletta, Malta, May. European Language 17th International Workshop on Treebanks and Lingui- Resources Association (ELRA). stic Theories (TLT 2018), December 13–14, 2018, Oslo Richard Futrell, Kyle Mahowald in Edward Gibson. 2015. University, Norway, št. 155, str. 91–104. Linköping Uni- Large-scale evidence of dependency length minimization versity Electronic Press. in 37 languages. Proceedings of the National Academy Minh Van Nguyen, Viet Lai, Amir Pouran Ben Veyseh of Sciences, 112(33):10336–10341. in Thien Huu Nguyen. 2021. Trankit: A light-weight Nancy Ide in James Pustejovsky. 2017. Handbook of lin- transformer-based toolkit for multilingual natural langu- guistic annotation, zvezek 1. Springer. age processing. V: Proceedings of the 16th Conference Dan Jurafsky in James H. Martin. 2021. Speech and lan- of the European Chapter of the Association for Compu- guage processing: an introduction to natural language tational Linguistics: System Demonstrations. processing, computational linguistics, and speech reco- Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, gnition, 3nd Edition Draft. Prentice Hall series in artifi- Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Se- cial intelligence. Prentice Hall, Pearson Education Inter- bastian Schuster, Francis Tyers in Daniel Zeman. 2020. national. Universal Dependencies v2: An evergrowing multilin- Simon Krek, Tomaž Erjavec, Kaja Dobrovoljc, Polona gual treebank collection. V: Proceedings of the 12th Gantar, Špela Arhar Holdt, Jaka Čibej in Janez Brank. Language Resources and Evaluation Conference, str. PRISPEVKI 38 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4034–4043, Marseille, France, May. European Language Resources Association. Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton in Chri- stopher D Manning. 2020. Stanza: A python natural language processing toolkit for many human languages. arXiv preprint arXiv:2003.07082. Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong in Francisco Guzmán. 2021. WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wi- kipedia. V: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, str. 1351–1361, Online, April. Association for Computational Linguistics. Daniel Zeman, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre in Slav Petrov. 2018. CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies. V: Procee- dings of the CoNLL 2018 Shared Task: Multilingual Par- sing from Raw Text to Universal Dependencies, str. 1–21, Brussels, Belgium, October. Association for Computati- onal Linguistics. Daniel Zeman, Joakim Nivre, Mitchell Abrams, Elia Ackermann, Noëmi Aepli, Hamid Aghaei, Željko Agić, Amir Ahmadi, Lars Ahrenberg et al. 2022. Universal dependencies 2.10. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (U´ FAL), Faculty of Mathematics and Physics, Charles University. PRISPEVKI 39 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Primerjava načinov razcepljanja besed v strojnem prevajanju slovenščina–angleščina Gregor Donaj, Mirjam Sepesy Maučec Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru Koroška cesta 46, 2000 Maribor gregor.donaj@um.si; mirjam.sepesy@um.si Povzetek V nevronskih strojnih prevajalnikih smo pri današnji tehnologiji grafičnih procesnih enot omejeni z velikostjo slovarja, kar zmanjšuje kakovost prevodov. Uporaba podbesednih enot rešuje problem velikosti slovarja in pokritosti jezika. Z nadaljnjim razvojem tehnologije pa omejenost slovarja in uporaba podbesednih enot izgubljata pomen. V članku predstavljamo različne metode razcepljanja besed na podbesedne enote z različno velikimi slovarji in primerjamo njihovo uporabo v strojnem prevajalniku za jezikovni par slovenščina– angleščina. V primerjavo vključujemo tudi prevajalnik brez razcepljanja besed. Predstavljamo rezultate uspešnosti prevajanja, hitrosti učenja in prevajanja ter velikosti modelov. A Comparison of Word Splitting Methods for Slovene-English Machine Translation Given today’s technology for graphical processing units, neural machine translation systems can use only a limited vocabulary, negatively affecting translation quality. The use of subword units can alleviate the problems of vocabulary size and language coverage. However, with further technological development, the limited vocabulary and the use of subword units are losing significance. This paper presents different word splitting methods with different final vocabulary sizes. We apply these methods to the machine translation task for the Slovene-English language pair and compare them in terms of translation quality, training and translation speed, and model size. We also include a comparison with word-based translation models. 1. Uvod in njihove porabe pomnilnika GPU. Vse metode bomo tudi primerjali z besednim modelom brez razcepljanja. Strojno prevajanje je v zadnjem desetletju doseglo pravi razcvet, zahvaljujoč predvsem vedno večjim zbirkam dvo- 2. Slovarske enote v strojnih prevajalnikih jezičnih korpusov in razpoložljivosti vedno večje računske Najbolj intuitivna izbira slovarske enote prevajalnika je moči, ki omogoča učenje kompleksnih nevronskih mrež. beseda, ki je tudi najpogosteje izbrana osnovna enota v dru- Danes najbolj intenzivno raziskovani pristopi stroj- gih postopkih jezikovnih tehnologij. Prinaša pa številne nega prevajanja temeljijo na nevronskih mrežah (Stahlberg, izzive. Za dovolj dobro pokritost besedišča jezika so po- 2020). Uveljavile so se tri osnovne arhitekture: nevron- trebni veliki slovarji, kar je še posebej izrazit problem pri ske mreže s povratno zanko (RNN – Recurrent Neural Ne- visoko pregibnih jezikih. Posledica premajhnih slovarjev twork), konvolucijske nevronske mreže (CNN – Convolu- pa je visok delež neznanih besed oz. besed izven slovarja, tional Neural Network) in arhitekture s samo-pozornostjo ki močno zmanjša kakovost prevodov. (self-attention). Za obvladovanje omenjenih težav so bile predlagane Uporaba nevronskih mrež pa prinaša tudi tehnične iz- različne alternativne slovarske enote. Kot najmanjša slovar- zive. Zaradi računske zahtevnosti je v praksi nujna upo- ska enota je bila uporabljena črka oz. znak, ki se je izkazal raba grafičnih procesnih enot (GPU – Graphical Processing kot zelo robustna enota, manj občutljiva na šum in razlike Unit). Le-te pa imajo omejeno velikost delovnega spomina, v domeni učnega in testnega korpusa (Heigold et al., 2018; zaradi česar ne moremo uporabljati poljubno velikih ne- Gupta et al., 2019). Potrebne pa so določene prilagoditve vronskih mrež. Velikost nevronske mreže v strojnem pre- v arhitekturi nevronske mreže, saj je dolžina segmenta ne- vajanju je odvisna od izbrane arhitekture, nastavitev hiper- kajkrat daljša od segmenta, ki kot slovarske enote uporablja parametrov nevronske mreže in velikosti slovarja. Ome- besede. Posledica je slabše modeliranje odvisnosti na veli- jena velikost slovarja pa pomeni slabo pokritost besedišča kih razdaljah v besedilih. jezika in posledično dodatne napake pri prevajanju. Tudi je- Preizkušene so bile tudi slovarske enote, ki so po ve- ziki, med katerimi prevajamo, predstavljajo različne izzive likosti med črko in besedo. Pod-besedne enote, dobljene in imajo določene specifične lastnosti. s podatkovno vodenim razcepljanjem, ki kot enoto ohranja V tem članku bomo preizkusili različne podatkovno vo- pogosta zaporedja črk, so se v splošnem izkazale kot naj- dene metode za razcepljanje besed, s katerimi zmanjšamo bolj učinkovite, saj v večji meri ohranjajo sintaktične in se- velikost slovarja. Izbrali smo metode, ki so dobro znane mantične lastnosti (Sennrich et al., 2016; Banerjee in Bhat- in uveljavljene, vendar pa temeljijo na precej različnih op- tacharyya, 2018). Ker je beseda lahko razcepljena na več timizacijskih kriterijih. Te metode bomo uporabili na pri- različnih načinov, je bila predlagana tudi metoda regulacije meru strojnega prevajanja med angleščino in slovenščino. razcepljanja (Kudo, 2018). Kot slovarsko enoto bi lahko Predstavili bomo rezultate v smislu uspešnosti prevajanja, uporabili tudi jezikoslovno enoto morfem, vendar bi za to hitrosti učenja in prevajanja ter velikosti izdelanih modelov potrebovali slovnično znanje. PRISPEVKI 40 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 2.1. Postopek Byte-Pair Encoding kjer so m pod-besedne enote, l(mj) dolžina pod-besedne Postopek BPE (Byte-Pair Encoding) je v izvoru posto- enote mj (število črk) in k število bitov, ki so potrebni za pek za stiskanje podatkov, ki deluje z iterativno zamenjavo predstavitev ene črke in ki ga v praksi lahko postavi na 5. najpogostejših parov simbolov z novim simbolom. Sen- Verjetnost posamezne pod-besedne enote v besedilu p(mi) nrich in drugi (Sennrich et al., 2016) so priredili ta algori- se izračuna z oceno največje verjetnosti kot razmerje med tem za razcepljanje besed. absolutno frekvenco te enote in številom vseh enot v bese- V postopku se najprej inicializira slovar, ki vsebuje vse dilu. črke in druge znake (števke, ločila), ki se pojavijo v kor- V svojem delu smo uporabljali novejšo implematacijo pusu, ter simbol za konec besede. Vsebina korpusa se programa – Morfessor 2.0 (Virpioja et al., 2013). Iskalni obravnava kot zaporedje simbolov, ki so v prvem koraku algoritem v tej implementaciji poišče nabor pod-besednih le črke in drugi znaki. Nato sledi iterativni postopek, v enot, ki optimizirajo funkcijo cene, pri tem pa lahko ali katerem se najde najpogostejši par zaporednih simbolov ročno izbiramo uteži za obe komponenti funkcije cene ali in se le-ta nadomesti z novim simbolom. Te iterativne pa izberemo želeno velikost slovarja. korake imenujemo združevanja. Pri postopku pa nimamo 2.3. Unigramski model združevanj, ki bi vključevala simbol za konec besede, kar Zadnja metoda, ki smo jo pogledali, je razcepljanje v končnem korpusu prepreči združevanje besed, namesto besed na podlagi uporabe unigramskega modela (Kudo, njihovega razcepljanja. 2018). V unigramskemu modelu je verjetnost zaporedja Parameter postopka je število združevanj, ki neposre- pod-besednih enot x modelirana kot produkt verjetnosti po- dno vpliva na velikost končnega slovarja. Natančna velikost sameznih enot tega zaporedja: končnega slovarja je nato enaka vsoti števila združevanj in števila znakov v začetnem slovarju. M Y Vsaka beseda v korpusu se pri uporabi modela razcepi P (x) = p(xi), (2) na enote iz slovarja BPE. Ker pa so v tem slovarju tudi po- i=1 samezne črke, je skoraj zagotovljeno, da bo delež (pod-) be- kjer je M dolžina besedila, p(xi) pa verjetnost i-te enote v sednih enot izven slovarja po razcepljanju enak 0. Izjeme besedilu. Pri tem spadajo vse pod-besedne enote v določen so zelo redke in se lahko pojavijo, kadar v testnem besedilu slovar in vsota verjetnost vseh enot mora biti enaka 1. vidimo črko ali znak, ki ga ni v učnem korpusu. Najverjetnejše razcepljanje x* besed vhodnega besedila Avtorji v (Sennrich et al., 2016) so predstavili imple- X je tisto, za katero velja mentacijo tega algoritma in predlagali možnost skupnega x* = arg maxP (x), (3) učenja razcepljanja (Joint BPE), kjer uporabimo besedilo x∈S(X) obeh strani vzporednega korpusa kot učno gradivo za mo- del razcepljanja. Tako tvorimo en model in dva slovarja, kjer je S(X) množica vseh možnih razcepljanj besed iz be- ločena za vsak jezik v paru. Nastavitev števila združevanj sedila X. pa nato ustreza skupnemu številu združenih simbolov za Verjetnosti posameznih unigramov pod-besednih enot oba jezika. Ni pa nujno, da se vsi združeni simboli poja- lahko določimo z algoritmom EM (Expectation Maximi- vijo v obeh jezikih. Posledično sta slovarja v tem primeru zation), optimalno razcepljanje besed pa najdemo z Viter- tipično manjša od števila združevanj. bijevim algoritmom (Kudo, 2018). Primer implementacije opisanega postopka je v orodju 2.2. Morfessor SentencePiece (Kudo in Richardson, 2018), v katerem so Program Morfessor (Creutz in Lagus, 2002) je bil raz- sicer implementirani še drugi postopki, vključno z BPE. vit v želji po razcepljanju besed v kompleksnih jezikih na V tem orodju lahko prav tako izhajamo iz želene velikosti pod-besedne enote, ki približno ustrezajo morfemom – naj- končnega slovarja. manjšim enotam besede, ki nosijo pomen. Želja je bila 2.4. Izbrane metode in orodja imeti podatkovno voden postopek, ki deluje za več jezikov Za naše eksperimente smo se odločili, da izberemo 4 brez dodatnega slovničnega znanja. Namen je bil zgraditi metode razcepljanja besed: slovar jezikovnih enot, ki je manjši in bolj splošen kot pa slovar besed. • Joint BPE – postopek BPE s skupnim učenjem na Predpostavka delovanja algoritma je, da so besede se- vzporednem korpusu in implementacijo Rica Sennri- stavljene iz zaporedja več segmentov, kot je to tipično v cha, imenovano Subword NMT. aglutinativnih jezikih. Razvita sta bila dva algoritma. Prvi • Morfessor – postopek na principu najkrajše dolžine temelji na principu najkrajše dolžine opisa, drugi pa na opisa, kjer se uteži v funkciji cene prilagodijo glede principu največje verjetnosti. Uporabili smo prvega. na želeno velikost slovarja in implementacijo Morfes- Cilj algoritma je najti slovar pod-besednih enot, ki daje sor 2.0. optimalno vrednost funkcije cene (cost function), ki vse- buje dva dela: ceno izvornega besedila T in ceno slovarja • SentencePiece – BPE – implementacija postopka BPE V . Ceno opišemo z z ločenim učenjem v orodju SentencePiece. C = Cost(T ) + Cost(V ) = • SentencePiece – Unigram – postopek na podlagi uni- gramskih jezikovnih modelov in implementacija v X X = − log p(mi) + k · l(mj), (1) orodju SentencePiece. mi∈besedilo mj ∈slovar PRISPEVKI 41 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Nastavitev Joint BPE Morfessor SP-BPE SP-Unigram Joint BPE Morfessor SP-BPE SP-Unigram (sl) (sl) (sl) (sl) (en) (en) (en) (en) 10k 11.384 18.670 17.064 17.814 11.556 18.405 16.909 17.358 15k 16.273 28.251 25.525 26.716 15.739 27.822 25.455 26.177 20k 21.101 37.934 33.561 35.534 19.631 37.664 33.595 34.779 25k 25.883 46.879 41.297 44.175 23.299 46.298 41.395 43.051 30k 30.625 55.438 48.822 52.717 26.760 55.994 48.946 51.204 40k 39.890 73.766 63.478 69.530 33.593 73.960 63.132 66.839 50k 49.063 93.726 77.520 86.111 40.115 90.248 76.515 82.082 60k 58.155 109.989 91.015 102.242 46.404 105.558 89.312 96.924 80k 76.152 143.018 117.134 133.892 58.788 133.679 113.496 125.572 100k 93.938 174.026 142.294 164.877 71.043 159.419 136.198 153.190 120k 111.646 205.658 166.442 195.155 82.972 182.895 157.987 180.256 150k 138.006 238.620 201.334 239.515 101.013 210.425 188.859 218.140 Tabela 1: Velikost izdelanih slovenskih (sl) in angleških (en) slovarjev. Joint BPE: države članice bodo pregle-dale sezna-me in od izdaja-te-ljice ... Morfessor: držav-e članic-e bodo pregled-a-le seznam-e in od izdajatelj-ice ... SP - BPE: države članice bodo pregleda-le sezna-me in od izdaja-telji-ce ... SP - Unigram: države članice bodo pregleda-le seznam-e in od izdajatelj-ice ... Slika 1: Primer segmenta besedila iz testne množice z razcepljenimi besedami. 3. Eksperimentalni sistem Korpus Število segmentov 3.1. Korpusi Učni 3.714.473 Eksperimenti so bili izvedeni na prosto dostopnem Razvojni 1.987 vzporednem korpusu ParaCrawl (Ba˜nón et al., 2020). Kor- Testni 1.990 pus je bil zgrajen s spletnim pajkanjem (Web Crawling) Skupaj 3.718.450 in samodejno poravnavo. Za jezikovni par angleščina- slovenščina vsebuje približno 3,7 milijona poravnanih se- Tabela 2: Število segmentov besedila v učnem, razvojnem gmentov, kar predstavlja 65,5 milijona besed na angleški in in testnem korpusu. 60,9 milijona besed na slovenski strani. Korpus smo razdelili na 3 dele: učni korpus, razvojni korpus in testni korpus. Razvojni korpus je namenjen spro- tni validaciji tekom učenja strojnega prevajalnika, testni Ker smo izhajali iz želje po različnih velikostih končnih korpus pa končnemu testiranju in vrednotenju rezultatov. slovarjev, smo spreminjali ustrezne parametre pri uporabi Za vsakega izmed teh dveh korpusov smo izbrali 2.000 na- orodij za učenje razcepljanja. Pri tem pa orodja uporabljajo ključnih segmentov besedila iz izvornega korpusa. Preosta- te parametre na različne načine, kar pomeni, da velikosti nek je bil uporabljen kot učni korpus. končnih slovarjev ne ustrezajo natančno nastavljenim vre- Nad vsemi deli korpusov smo izvedli standardno pred- dnostim parametrov. Želene vrednosti, ki smo jih nasta- procesiranje za strojno prevajanje: čiščenje, normalizacijo vili, so: 10.000, 15.000, 20.000, 25.000, 30.000, 40.000, ločil, tokenizacijo in truecasing1. Pri tem je bil učni korpus 50.000, 60.000, 80.000, 100.000, 120.000 in 150.000. V uporabljen tudi za učenje modela za truecasing. Končne tabeli 2.4. so prikazane natančne velikosti slovarjev, ki jih velikosti vseh predprocesiranih korpusov so predstavljene dobimo na slovenski in na angleški učni množici pri teh v tabeli 3.1.. nastavitvah. 3.2. Razcepljanje besed Na sliki 1 je prikazan primer segmenta, kjer smo besede razcepili z uporabo vseh 4 postopkov. Uporabili smo ciljno Pri razcepljanju besed smo uporabili orodja, ki so opi- velikost slovarja 20.000, saj so pri tej velikosti razcepljanja sana v prejšnjem poglavju. Učni del korpusa smo uporabili besed bolj pogosta in lažje prikažemo več razlik v enem se- za učenje modela razcepljanja, nato smo naučene modele gmentu. Na sliki so mesta delitve besed nakazana z vezaji. uporabili za razcepljanje vseh delov korpusa. Tako smo do- V modelih brez razcepljanja besed smo uporabili ve- bili različice razcepljenih korpusov. likosti slovarjev: 60.000, 80.000, 100.000, 125.000, 1 150.000, 200.000, 250.000 in 300.000. Določanje pravilnega zapisa velikih in malih črk: zapis začetnih besed v vsakem stavku pretvorimo v najverjetnejši zapis V naslednjem koraku smo zgradili slovarje za vse z malo ali veliko črko in s tem zmanjšamo pomanjkanje podatkov različice razcepljenih učnih korpusov kot tudi za nerazce- PRISPEVKI 42 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Angleško-slovensko Slovensko-angleško 46 49 Besede Besede 45 Joint BPE 48 Joint BPE Morfessor Morfessor SentencePiece - BPE SentencePiece - BPE 44 SentencePiece - Unigram 47 SentencePiece - Unigram 43 46 U U E 42 E L 45 L B B 41 44 40 43 39 42 38 41 104 105 104 105 Velikost slovarja Velikost slovarja Slika 2: Rezultati uspešnosti prevajanja za vse modele. pljen besedni učni korpus. Medtem ko v razcepljenih kor- Učenje smo izvajali 10 epoh s preverjanjem rezultata pusih slovarji pokrijejo celotni korpus, se pri besednem kor- na razvojni množici na vsakih 100 posodobitev parametrov pusu pojavijo besede izven slovarja. V tabeli 3.2. smo pri- modela. Najboljši model glede na razvojno množico smo kazali deleže besed izven slovarja (OOV) na testnem delu nato uporabili pri vrednotenju rezultatov na testni množici. korpusa za oba jezika. Po pričakovanjih vidimo, da so Pri prevajanju smo uporabljali mini serije (mini-batch) deleži večji na slovenski strani in da padajo z večanjem slo- velikosti 64, medtem ko je pri učenju uporabljena fleksi- varja. bilna velikost, ki je prilagojena velikosti delovnega pomnil- nika enote GPU, na kateri izvajamo učenje. Slovar OOV (en) [%] OOV (sl) [%] 3.4. Uporabljena orodja 60k 2,57 6,66 Za predprocesiranje (čiščenje, normalizacijo, tokeniza- 80k 2,07 5,38 cijo in truecaseing) ter postprocesiranje (detruecaseing in 100k 1,77 4,44 detokenizacijo) smo uporabljali skripte iz programskega 125k 1,50 3,74 paketa MOSES (Koehn et al., 2007). Za učenje prevajal- 150k 1,30 3,22 nikov in prevajanje smo uporabljali orodje Marian NMT 200k 1,08 2,53 (Junczys-Dowmunt et al., 2018), ki smo ga poganjali na 250k 0,95 2,11 grafičnih procesnih enotah Nvidia Tesla V100. Za vredno- 300k 0,85 1,82 tenje rezultatov z metriko BLEU smo uporabljali orodje Tabela 3: Delež besed izven slovarja pri besednih slovarjih SacreBLEU (Post, 2018), ki kot del vrednotenja izvaja na angleškem (en) in slovenskem (sl) testnem korpusu. tudi ponovno tokenizacijo in vrednoti tokenizirana bese- dila. Orodja za razcepljanje besed na pod-besedne enote so opisana v poglavju 2. 3.3. Prevajalnik 4. Rezultati in diskusija Model prevajalnika je v vseh primerih nevronski strojni Ker je bil osnovni namen uporabe pod-besednih enot prevajalnik na podlagi arhitekture RNN z dimenzijo skri- zmanjšanje velikosti slovarja in s tem izvedljivost uporabe tega stanja 1024 in dimenzijo vgrajenih vektorjev 512 (pri- nevronskih strojnih prevajalnikov, najprej prikazujemo pri- vzete nastavitve orodja Marian NMT). Naše dosedanje mer rezultatov na tipičnih velikostih slovarjev. Za besedni izkušnje na tej učni množici pa kažejo, da z uporabo ar- slovar smo izbrali velikost 60.000 besed, kar je pogosto hitekture transformer in samo-pozornosti ne dosežemo bi- uporabljena velikost slovarjev v procesiranju naravnega je- stvenih izboljšav. Dolžine segmentov besedila smo pri zika. V tabeli 4. primerjamo rezultate prevajanja med be- učenju omejili na 80 pojavnic (besed in ločil oz. pod- sednim modelom in modelom Joint BPE z enako velikostjo besednih enot in ločil), kar pomeni, da upoštevamo 99,7 % slovarja. V tej točki lahko vidimo izboljšanje uspešnosti vseh segmentov v učni množici brez razcepljanja. Pri mo- prevajanja z uporabo pod-besednih enot, kot jo tudi tipično delih, kjer uporabljamo razcepljanje pa tako upoštevamo zasledimo v obstoječi literaturi, npr. v (Sennrich et al., med 96,3 % in 99,5 % vseh segmentov. Omejitve dolžine 2016). Na tej točki smo še dodali rezultate vrednotenja, segmentov nismo več povečevali, saj glede na omenjeno ki jih dobimo z metriko ChrF (β = 3) (Popović, 2015). pokritost predvidevamo, da ne bo več prišlo do znatnih Čeprav se ta metrika uveljavlja za vrednotenje prevajanja sprememb rezultatov. pri morfološko kompleksnih jezik, smo preostale rezultate PRISPEVKI 43 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Angleško-slovensko Slovensko-angleško 1800 2000 1600 1800 1400 1600 ] ] c c e e s s / 1200 / i 1400 t it n n e e m 1000 m g 1200 g e e s[ s [ t t s 800 s 1000 or o t r i ti H H 600 Besede 800 Besede Joint BPE Joint BPE Morfessor Morfessor 400 SentencePiece - BPE 600 SentencePiece - BPE SentencePiece - Unigram SentencePiece - Unigram 200 400 104 105 104 105 Velikost slovarja Velikost slovarja Slika 3: Hitrost učenja prevajalnika za vse modele. Angleško-slovensko Slovensko-angleško 450 500 400 450 ] ] c c e e s 400 350 s / / i t it n n e 350 e m 300 m g g e e s 300 [ s [ a 250 j aj n n aj 250 aj a a v 200 v er er p 200 p t t s 150 s o Besede Besede r o t r i Joint BPE 150 ti Joint BPE H Morfessor H Morfessor 100 SentencePiece - BPE 100 SentencePiece - BPE SentencePiece - Unigram SentencePiece - Unigram 50 50 104 105 104 105 Velikost slovarja Velikost slovarja Slika 4: Hitrost uporabe prevajalnika za vse modele. predstavili le z metriko BLEU, ki je še vedno uveljavljena prevajanja pri uporabi besednih modelov naraščaja hitreje in zadostuje za medsebojno primerjavo naših modelov. in se pri največjih slovarjih precej približa uspešnosti pre- vajalnikov z razcepljanjem. Metrika Besedni Joint BPE Ko med sabo primerjamo sisteme, ki uporabljajo razce- pljanje, vidimo manjše razlike. Kljub temu lahko opazimo, BLEU (en-sl) 38,50 42,87 da pri prevajanju iz angleščine v slovenščino večinoma daje BLEU (sl-en) 41,62 45,87 najboljše rezultate orodje Sentence Piece z razcepljanjem ChrF (en-sl) 58,43 63,13 na podlagi unigramov (Sentence Piece – Unigram), v na- ChrF (sl-en) 60,68 65,76 sprotni smeri pa orodje Subword NMT s skupnim učenjem Tabela 4: Primer rezultatov uporabe besednega modela in BPE (Joint BPE). modela z uporabo Joint BPE pri slovarju velikost 60.000. Slika 3 prikazuje hitrost učenja modela prevajalnika, slika 4 pa hitrost prevajanja pri njegovi uporabi. V vseh primerih uporabljamo kot merilo za hitrost število obdela- Na sliki 2 so prikazani rezultati uspešnosti prevajanja v nih segmentov besedila na sekundo, saj se zaradi različnih odvisnosti od velikosti slovarja za vse sisteme. Na slikah razcepljanj število pojavnic razlikuje med sistemi. Število so velikosti slovarjev ponazorjene v logaritemskem merilu. besed na sekundo pri učenju dobimo, če upoštevamo, da je Na splošno lahko opazimo, da uspešnost narašča z povprečno število besednih pojavnic (besed in ločil) na se- večanjem slovarjev, čeprav posamezni sistemi odstopajo od gment v angleškem besedilu je 18,7, v slovenskem besedilu tega trenda, npr. prevajalnik iz slovenščine v angleščino s pa 20,2. slovarjem 120.000 besed. Opazimo pa tudi, da uspešnost Opazimo lahko, da se obe hitrosti zmanjšujeta z PRISPEVKI 44 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Velikost modela (en-sl) Poraba GPU pomnilnika 1800 5000 Besede 1600 Joint BPE 4500 1400 Morfessor SentencePiece - BPE 4000 1200 SentencePiece - Unigram 1000 3500 ] ] B B i i 800 M M [ 3000 [ a al ki e n d li o n 600 m 2500 m t o s p o k a il b e ar V oP 2000 400 1500 200 104 105 105 Velikost slovarja Velikost slovarja Slika 5: Velikost izdelanega modela za vse modele ter poraba pomnilnika na grafični procesorski enoti. večanjem slovarja. Hitrost pri besednih modelih je večja niki brez razcepljanja, tudi v primerih, ko jim večamo slo- kot pa hitrost pri ostalih modelih, vendar se tudi tukaj raz- varje. Trend pa kaže, da lahko z besednimi prevajalniki lika pri večjih slovarjih zmanjšuje. Vidimo pa sicer, da so z nadaljnjim večanjem slovarja dohitimo modele z razce- najhitrejši modeli tisti, ki za razcepljanje korpusa upora- pljanjem besed. Ob trenutnem trendu razvoja in večanja bljajo orodje Morfessor. Lahko pa v teh modelih opazimo pomnilniških zmogljivosti enot GPU, bo takšne modele v več točk, ki močno odstopajo od trendov. Predvidevamo, da prihodnje možno naučiti in uporabljati. so odstopanja nastala zaradi naključnih začetnih nastavitev Prikazani rezultati lahko služijo raziskovalcem in upo- nekaterih parametrov pri učenju, morebitnih odstopanj na rabnikom kot orientacija pri izbiri velikost slovarja za uporabljeni strojni opremi, specifičnih lastnosti program- strojne prevajalnike, če želijo upoštevati uspešnost prevaja- ske opreme za učenje modelov ali pa med prilagajanjem nja, hitrost prevajanja in velikost modela. Slednja je lahko velikost mini serije pri različnih velikostih slovarja. pomembna zaradi omejitev strojne opreme. Prikazane hitrosti prevajanja ne upoštevajo predproce- Za boljše razumevanje uporabnosti razcepljanja besed siranja in postprocesiranja besedila. v strojnem prevajanju bi bilo potrebno izvesti še nadalj- Na sliki 5 so prikazane velikosti datotek za vse modele nje raziskave. V tem prispevku smo se omejili na stalne prevajanja in njihova poraba pomnilnika enote GPU za be- vrednosti hiperparametrov modelov. Izvedli smo le po- sedne modele. Velikosti datotek naraščajo skoraj linearno z stopke podatkovno vodenega razcepljanja. V nadaljeva- velikostjo slovarja (na grafu sta obe osi v logaritemskem nju lahko preučujemo tudi metode razcepljanja na podlagi merilu). Vsakemu modelu sta pridruženi dve datoteki z slovničnega znanja ali pa kombiniranje komplementarnih obema slovarjema, ki pa sta bistveno manjši. Prikazane ve- metod. Pomemben prispevek h kakovosti prevajanja pri be- likosti so za modele prevajanja iz angleščine v slovenščino. sednih modelih imata lahko tudi večanje učne množice in Modeli v nasprotni smeri imajo primerljive velikosti. večanje hiperparametrov modela. Slednje pa sicer pomeni Desno je prikazana še poraba pomnilnika pri uporabi tudi povečanje velikosti modela in njegovo počasnejše de- modelov pri prevajanju. Vidimo, da ima orodje osnovno lovanje. Nadaljnje raziskave lahko tudi vključujejo podrob- porabo pomnilnika, kar se kaže v manjših spremembah po- nejšo analizo napak, ki se pojavljajo pri različnih metodah rabe pri malih slovarjih. Pri večjih slovarjih pa poraba po- razcepljanja. mnilnika prav tako linearno narašča. Prikazana je poraba pomnilnika za besedne modele, ki je enaka v obeh smereh prevajanja. Porabe pomnilnika pri drugih modelih so pri- merljive glede na velikost slovarja. Pri preverjanju porabe 6. Zahvala pomnilnika je bila uporabljena mini serija velikosti 64. Pri učenju pa je poraba pomnilnika lahko tudi večja. Raziskovalni program št. P2-0069, v okvirju katerega Pripomniti je potrebno, da bi bile velikosti drugačne v je nastala ta raziskava, je sofinancirala Javna agencija za primeru drugih nastavitev hiperparametrov modelov. raziskovalno dejavnost Republike Slovenije iz državnega proračuna. 5. Zaključek Avtorji se zahvaljujejo konzorciju HPC RIVR V članku smo prikazali in primerjali nekatere najpo- (www.hpc-rivr.si) za sofinanciranje raziskave z upo- gostejše podatkovno vodene metode za razcepljanje besed rabo zmogljivosti sistema HPC MAISTER na Univerzi v in njihovo uporabo na primeru nevronskega strojnega pre- Mariboru (www.um.si). vajalnika. Naši rezultati kažejo, da z razcepljanjem be- Zahvaljujejo se tudi avtorjem vzporednega korpusa sed še vedno dosegamo boljše rezultate kot pa s prevajal- ParaCrawl za njegovo prosto dostopnost. PRISPEVKI 45 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 7. Literatura of the Association for Computational Linguistics (Vo- lume 1: Long Papers), str. 66–75, Melbourne, Australia. Tamali Banerjee in Pushpak Bhattacharyya. 2018. Me- Association for Computational Linguistics. aningless yet meaningful: Morphology grounded Maja Popović. 2015. chrF: character n-gram F-score for subword-level NMT. V: Proceedings of the second wor- automatic MT evaluation. V: Proceedings of the Tenth kshop on subword/character level models, str. 55–60. Workshop on Statistical Machine Translation, str. 392– Marta Ba˜nón, Pinzhen Chen, Barry Haddow, Kenneth He- 395, Lisbon, Portugal. Association for Com- putational afield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. For- Linguistics. cada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Ser- Matt Post. 2018. A call for clarity in reporting BLEU gio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ram´ırez- scores. V: Proceedings of the Third Conference on Ma- Sánchez, Elsa Sarr´ıas, Marek Strelec, Brian Thomp- chine Translation: Research Papers, str. 186–191, Brus- son, William Waites, Dion Wiggins in Jaume Zaragoza. sels, Belgium. Association for Computational Lin- 2020. ParaCrawl: Web-scale acquisition of parallel cor- guistics. pora. V: Proceedings of the 58th Annual Meeting of Rico Sennrich, Barry Haddow in Alexandra Birch. 2016. the Association for Computational Linguistics, str. 4555– Neural machine translation of rare words with subword 4567. Association for Computational Lin- guistics. units. V: Proceedings of the 54th Annual Meeting of the Mathias Creutz in Krista Lagus. 2002. Unsupervised dis- Association for Computational Linguistics (Volume 1: covery of morphemes. V: Proceedings of the Workshop Long Papers), str. 1715–1725, Berlin, Germany. on Morphological and Phonological Learning of ACL- Association for Computational Linguistics. 02, str. 21–30, Philadelphia, Pennsylvania. Felix Stahlberg. 2020. Neural Machine translation: A Rohit Gupta, Laurent Besacier, Marc Dymetman in Mat- review. Journal of Artificial Intelligence Research, thias Gallé. 2019. Character-based NMT with transfor- 69:343–418. mer. arXiv:1911.04997. Sami Virpioja, Peter Smit, Stig-Arne Grönroos in Mikko Georg Heigold, Stalin Varanasi, Günter Neumann in Jo- Kurimo. 2013. Morfessor 2.0: Python implementa- sef van Genabith. 2018. How robust are character-based tion and extensions for morfessor baseline. Tehnično word embeddings in tagging and MT against wrod scra- poročilo, Aalto University. mlbing or randdm nouse? V: Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), str. 68–80, Boston, MA. Association for Machine Transla- tion in the Americas. Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Necker- mann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins in Alexandra Birch. 2018. Marian: Fast neural machine translation in C++. V: Proceedings of ACL 2018, System Demonstra- tions, str. 116–121, Melbourne, Australia. Associa- tion for Computational Linguistics. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin in Evan Herbst. 2007. Moses: Open source toolkit for sta- tistical machine translation. V: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, str. 177–180, Prague, Czech Repu- blic. Association for Computational Linguistics. Taku Kudo in John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. V: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, str. 66– 71, Brussels, Belgium. Association for Com- putational Linguistics. Taku Kudo. 2018. Subword regularization: Improving ne- ural network translation models with multiple subword candidates. V: Proceedings of the 56th Annual Meeting PRISPEVKI 46 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Raziskovalna infrastruktura CLARIN.SI Tomaˇ z Erjavec1, Kaja Dobrovoljc3 , 1, Darja Fiˇ ser4 , 3 , 1, Jan Jona Javorˇ sek1, Simon Krek2 , 1, Taja Kuzman1, Cyprian Laskowski2, Nikola Ljubeˇ si´ c1 , 2, Katja Meden1 1 Institut ≫Jožef Stefan≪ tomaz.erjavec@ijs.si, kaja.dobrovoljc@ijs.si, jan.javorsek@ijs.si, simon.krek@ijs.si, taja.kuzman@ijs.si, nikola.ljubesic@ijs.si, katja.meden@ijs.si 2 Center za jezikovne vire in tehnologije Univerze v Ljubljani cyp@cjvt.si 3 Filozofska fakulteta Univerze v Ljubljani darja.fiser@ff.uni-lj.si 4 Inštitut za novejšo zgodovino Povzetek Prispevek povzame storitve slovenske raziskovalne infrastrukture za jezikovne vire in tehnologije CLARIN.SI, ki je članica evropskega konzorcija raziskovalnih infrastruktur CLARIN ERIC. Najprej obravnavamo vodenje, organizacijo in tehnično infrastrukturo CLARIN.SI, nato pa njene spletne storitve, predvsem repozitorij digitalnih jezikovnih virov in orodij ter konkordančnike. Sledi pregled promocije področja jezikovnih tehnologij in digitalne humanistike v Sloveniji, kar vključuje storitve centra znanja za računalniško obdelavo južnoslovanskih jezikov CLASSLA, financiranje projektov in organizacijo, podporo ali sodelovanje na konferencah in delavnicah. Predstavimo tudi sodelovanje CLARIN.SI s CLARIN ERIC in s sorodnima slovenskima infrastrukturama DARIAH-SI in CESSDA/ADP ter vključenost v slovenske in evropske projekte. The CLARIN.SI Research Infrastructure The paper summarises the services offered by the Slovenian research infrastructure for language resources and technologies CLARIN.SI, which is a member of the European research infrastructure consortium CLARIN ERIC. We first present the governance, organisation and technical infrastructure of CLARIN.SI, followed by a description of its web applications with a focus on its repository and concordancers. Next comes an overview of support activities that CLARIN.SI offers to the fields of language technologies and digital humanities in Slovenia, which includes services of the knowledge centre for computational processing of South-Slavic languages CLASSLA, financial support of projects, and organisation or support of conferences and workshops. We also introduce the work of CLARIN.SI within CLARIN ERIC, its cooperation with its sister national infrastructures DARIAH-SI and CESSDA/ADP, and involvement in national and European projects. 1. Uvod 2022). Korist od RI imajo raziskovalci, učitelji in Raziskovalna infrastruktura (RI) CLARIN1 študenti slovenskega jezika ter drugih jezikoslovnih ( smeri, računalniškega jezikoslovja in umetne inteli- ≫Common Language Resources and Technology In- frastructure gence, pa tudi drugi raziskovalci s področja humani- ≪ oz. ≫Infrastruktura za skupne jezikovne vire in tehnologije stike in družboslovja, ki pri svojem delu uporabljajo ≪) zagotavlja digitalne jezikovne vire, orodja in storitve za podporo raziskovalcem jezikovna gradiva. RI nudi podporo tudi slovaropi- na področju humanistike in družboslovja in drugih scem, prevajalcem in podjetjem, ki v svoje produkte področij, ki se ukvarjajo z jezikom (Jong et al., 2018). vključujejo obdelavo slovenskega jezika, nenazadnje CLARIN je bila ena od infrastruktur, ki so bile pa tudi laičnim uporabnikom za namene reševanja predvidene že v prvem načrtu Evropskega strateškega praktičnih vprašanj. foruma za raziskovalne infrastrukture ESFRI (Váradi et al., 2008). Ustanovljena je bila leta 2012 in je Slovenska RI CLARIN.SI je bila ustanovljena leta bila ena prvih RI, ki je pridobila status evropske 2014, članica CLARIN ERIC pa je postala leta 2015, za pravne osebe konzorcija raziskovalnih infrastruktur kar je bilo potrebno, da je bil ustanovljen nacionalni ERIC (European Research Infrastructure Consor- konzorcij in da je Vlada Republike Slovenije podpi- tium). CLARIN ERIC ima sedež na Nizozemskem in sala memorandum, s katerim se je zavezala plačevati trenutno združuje RI 22 držav članic in 3 opazovalke. članarino za članstvo Slovenije v CLARIN ERIC. Do Zaposluje vodjo in podporno osebje za koordinacijo in sedaj je bila edina publikacija, ki celostno predstavi centralne tehnične storitve, medtem ko imajo glavno CLARIN.SI, objavljena kmalu po njeni ustanovitvi vlogo pri zagotavljanju storitev nacionalni centri RI. (Erjavec et al., 2014), kjer smo predstavili prve korake Glede na pomen slovenskega jezika za Slovenijo je RI in načrte za nadaljnje delo. Pričujoči prispevek sodelovanje v CLARIN ključnega pomena, saj spod- povzema, kaj je bilo narejenega v minulih osmih letih: buja empirično podprto raziskovanje jezika ter ra- v razdelku 2. predstavimo organizacijsko strukturo in zvoj jezikovnih virov in tehnologij, s čimer lahko slo- upravljanje infrastrukture, v 3. repozitorij jezikovnih venščina v informacijski družbi nastopa enakopravno virov in orodij, v 4. spletne storitve, v 5. podporne de-z drugimi jeziki, tudi mnogo večjih skupnosti (Krek, javnosti, v 6. vpetost CLARIN.SI v domače in evrop- ske projekte in v aktivnosti CLARIN ERIC, v 7. pa 1 podamo zaključke in načrte za nadaljnje delo. https://www.clarin.eu/ PRISPEVKI 47 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 2. Organiziranost CLARIN.SI vseh večjih akterjev na področju digitalnega jeziko- Infrastruktura ima sedež na Institutu slovja in jezikovnih tehnologij, kot tudi digitalne huma- ≫Jožef Ste- fan nistike in družboslovja, saj CLARIN.SI tesno sodeluje ≪ (IJS), kjer tudi domuje večina njene računalniške opreme in kjer se zagotavlja varnost, vzdrževanje in z dvema sestrskima RI v slovenskem prostoru. To sta neprestano obratovanje spletnih storitev RI. Pri vode- DARIAH-SI s sedežem na Inštitutu za novejšo zgodo- nju in tehničnem vzdrževanju sodelujejo tri organiza- vino (INZ), ki predstavlja nacionalno vozlišče evropske cijske enote IJS, in sicer Odsek za tehnologije znanja RI za digitalno humanistiko, in CESSDA/ADP v Ar- E8, Laboratorij za umetno inteligenco E3 ter Center hivu družboslovnih podatkov na Fakulteti za družbene za mrežno infrastrukturo CMI. vede Univerze v Ljubljani (ADP), ki je nacionalno vozlišče evropske RI za digitalno družboslovje CES- CLARIN.SI je organiziran kot konzorcij, ki nima SDA ( narave pravne osebe, v njem pa ima članstvo 12 par- ≫Consortium of European Social Science Data Archives tnerjev.V konzorciju so združene vse glavne institucije, ≪). CLARIN.SI je tudi ena od ustanovnih članic Slovenskega nacionalnega superračunalniškega ki se v Sloveniji ukvarjajo z razvojem ali uporabo jezi- omrežja SLING2 in preko njega članica federacije kovnih virov in tehnologij, in sicer: računskih in podatkovnih virov EGI3 ter Partnerstva • Univerze: Univerza v Ljubljani, Univerza v Mari- za napredno računalništvo v Evropi PRACE4. boru, Univerza v Novi Gorici in Univerza na Pri- CLARIN.SI vzdržuje dvojezično (slovenščina, an- morskem. Univerza v Ljubljani je sedež Centra gleščina) spletno stran,5 na kateri je predstavljena RI za jezikovne vire in tehnologije (CJVT), ki koor- kot tudi vse njene storitve. Spletno mesto nudi tudi dinira delo na področju korpusnega jezikoslovja kontaktne podatke, npr. e-poštni naslov, na katerega in jezikovnih tehnologij ter razvija in vzdržuje te- se lahko obrnejo uporabniki, ki si želijo pomoči ali na- meljne digitalne jezikovne vire in jezikovnoteh- svetov. Poleg tega spletno mesto vključuje z geslom nološka orodja za sodobni slovenski jezik. zaščitene interne strani, do katerih imajo dostop člani oz. namestniki UO in ki vsebujejo ustanovne doku- • Raziskovalni inštituti: ZRC SAZU, Institut mente, zapisnike sestankov, relevantne zapisnike CLA- ≫Jožef Stefan RIN ERIC itd. ≪ (IJS), Inštitut za novejšo zgodovino (INZ) in Znanstveno-raziskovalno središče Koper. Zno- Za dokumentiranje tehničnega vzdrževanja CLA- traj ZRC SAZU Inštitut za slovenski jezik Frana RIN.SI uporablja interno instalacijo platforme Word- Ramovša zbira jezikovno gradivo in ga uporablja Press, na kateri dokumentiramo postopke vzdrževanja za izdelavo temeljnih del slovenskega jezikoslovja, za vse spletne storitve CLARIN.SI, medtem ko se za predvsem slovarjev. IJS kot gostitelj raziskovalne zahtevke za reševanje odkritih problemov uporablja in- infrastrukture CLARIN.SI koordinira delo infra- stalacijo platform Redmine. strukture, vzdržuje in nadgrajuje njen repozitorij Kritične spletne storitve CLARIN.SI so vedno in- in storitve ter razvija jezikovne vire in orodja. stalirane tudi na razvojnem strežniku, kjer se naj- prej preveri delovanje vsake spremembe na programski • Društva oz. zavodi: Slovensko društvo za je- opremi, na ponujenih jezikovnih virih ali v dokumen- zikovne tehnologije (SDJT), ki s konferenco taciji. Delovanje spletnih storitev se preverja prek sis- tema NAGIOS, repozitorij pa tudi neodvisno s strani ≫Jezikovne tehnologije in digitalna humani- stika CLARIN ERIC. V primeru napak so tako skrbniki sto- ≪ (JTDH) promovira razvoj jezikovnih teh- nologij za slovenski jezik, in Zavod za uporabno ritve nemudoma obveščeni in lahko takoj pristopijo k slovenistiko Trojina s svetovalno in podporno de- odpravljanju težave. javnostjo ter izdelavo jezikovnih virov in orodij. 3. Repozitorij jezikovnih virov Osnovna storitev CLARIN.SI je vzdrževanje repo- • Podjetji Alpineon in Amebis, med katerima prvo zitorija jezikovnih raziskovalnih podatkov oz. jezikov- podjetje v infrastrukturo CLARIN.SI prispeva nih virov, kot so velike in bogato označene zbirke be- predvsem govorne tehnologije, drugo pa se ukvarja sedil (korpusi), računalniški leksikoni in modeli, pa z izdelavo programske opreme s področja jezikov- tudi strojno berljivi slovarji in računalniška orodja. nih tehnologij in elektronskega založništva. Računalniška platforma repozitorija je odprtodosto- Odločitve o vodenju RI sprejema oz. potrjuje pna CLARIN-DSpace,6 ki so jo razvili posebej za na- Upravni odbor (UO) CLARIN.SI, v katerem ima vsak mene CLARIN repozitorijev v okviru češke razisko- partner po enega predstavnika in poljubno število na- valne infrastrukture CLARIN (sedaj CLARIAH, ki je mestnikov oz. namestnic. Komunikacija se odvija nastala po združitvi češke CLARIN in DARIAH) na prek dopisnega seznama upravnega odbora, ki trenu- Inštitutu za formalno in uporabno jezikoslovje na Kar- tno šteje 34 članov, enkrat letno pa organiziramo se- lovi Univerzi v Pragi. Platformo poleg Slovenije, in stanek CLARIN.SI UO, na katerem se pogovorimo o 2 delovanju RI v preteklem letu in naredimo načrte za https://www.sling.si/ 3https://www.egi.eu/ naslednje. 4https://prace-ri.eu/ Delovanje raziskovalne infrastrukture CLARIN v 5https://www.clarin.si/ Sloveniji se tako kroji na podlagi potreb in konsenza 6https://github.com/ufal/clarin-dspace PRISPEVKI 48 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 seveda Češke, uporablja še sedem drugih nacionalnih CLARIN.SI trudimo vzdrževati čim bolj popolne repozitorijev CLARIN, kar skupaj predstavlja 40 % in enotne metapodatkovne zapise. vseh rednih članic CLARIN ERIC. Repozitorij CLARIN.SI je poleg ADP edini v Slove- • Navodila za kodiranje deponiranih podatkov, ki niji akreditiran s certifikatom navajajo sprejemljive formate zapisa in načine ≫Core Trust Seal≪,7 torej kot zaupanja vreden podatkovni repozitorij. Repozito- označevanja podatkov, poleg tega pa zajemajo rij v skladu s strategijo CLARIN ERIC implementira tudi splošna navodila za pripravo kvalitetnih in načela FAIR8,9(najdljivost, dostopnost, interoperabil- usklajenih podatkov. Po tem se repozitorij CLA- nost in ponovna uporaba). Evropski agendi za odprto RIN.SI razlikuje od večine drugih repozitorijev znanost in načelom FAIR CLARIN sledi avant la lettre CLARIN (Lenardič in Fišer, 2022), saj ti tipično (Jong et al., 2018), in sicer z naslednjimi instrumenti: ponujajo samo seznam sprejemljivih formatov, ne pa tudi bolj splošnih navodil za pripravo kako- • Akademska avtentikacija AAI, ki deluje po sis- vostnih podatkov, kakršna so lahko zelo koristna temu SSO (≫Single sign-on≪), kjer ločimo ponu- za avtorje s področja humanističnih znanosti brez dnike identitete (Arnes, univerze, druge akadem- poglobljenega znanja računalniških veščin za pra- ske institucije) in ponudnike storitev (v našem pri- vilno pripravo podatkov. meru repozitorij), da uporabnikom ni potrebno ustvariti svojega računa na CLARIN.SI, pač pa se • Seznam pogosto postavljenih vprašanj z odgovori v repozitorij prijavijo prek svojega EduGain upo- in podobne vsebine. rabniškega imena in gesla pri izbranem ponudniku Poleg prilagojenosti za opis jezikovnih virov je za identitete. razliko od splošnih repozitorijev za samoarhiviranje, • Trajni identifikatorji vnosov po sistemu kot je npr. Zenodo, pomembna odlika repozitorija ≫handle≪, kar omogoča pripis trajnega naslova URL vsa- CLARIN.SI zagotavljanje visoke kvalitete deponiranih kemu vnosu v repozitorij, ki je, enako kot DOI, ne- jezikovnih virov in njihovih metapodatkov, saj vsak odvisen od specifičnega URL-ja tega vira v okviru vnos pred objavo skrbno pregleda eden od urednikov repozitorija, in s tem tudi odporen na spremembe repozitorija, ki preveri, ali vnos ustreza merilom CLA- v platformi oz. lokaciji repozitorija. RIN.SI. Če jim ne, urednik vnos zavrne z obrazložitvijo napak, v vnaprej dogovorjenih primerih pa tudi po- • Vpetost v mednarodne spletne agregatorje me- maga pri popravljanju vira. tapodatkov, kot so OpenAIRE10, Re3data11, od V osmih letih, kolikor jih je minilo od prvega vnosa, 2022 pa tudi European Language Grid. Preko je število deponiranih jezikovnih virov in orodij naraslo CLARIN ERIC je bil CLARIN.SI med prvimi RI na več kot 300, ki so rezultat dela prek 700 avtorjev, vključen tudi v sistem ponudbe virov in storitev v pri čemer je v mnoge bilo vloženih več let dela. V letu okviru Evropskega odprtega znanstvenega oblaka 2021 je repozitorij beležil okoli 40.000 ogledov in 4.000 EOSC12 že vse od vzpostavitve portala EOSC prenosov. V tem letu so bili najpogosteje preneseni viri leta 2018. V okviru RI CLARIN se za meta- zbirka 751 emodžijev z avtomatsko pripisanim senti- podatkovne zapise uporablja priporočila CMDI13 mentom, ki je bil izračunan na podlagi 70.000 tvitov v (≫Component MetaData Infrastructure≪), izvoz 13 evropskih jezikih, označenih za sentiment in s strani oz. žetev metapodatkov pa je omogočena tudi v 83 anotatorjev (Kralj Novak et al., 2015) ter jezikovni standardu Dublin Core. modeli (besedne vložitve) tipa BERT (Devlin et al., • Bogata izbira licenc, od odprtih, kot so licence 2018) za slovenske besede (Ulčar in Robnik-Šikonja, Creative Commons, do bolj omejenih, ki zahtevajo 2021), ki so koristni za marsikatero nalogo obdelave predhodno prijavo v repozitorij in digitalni podpis slovenskega jezika. sporazuma o uporabi vira. S spodbujanjem deponiranja jezikovnih virov in pomočjo pri njihovem oblikovanju in opisu je CLA- • Eksplicitni pogoji uporabe, ki določajo pravice in RIN.SI bistveno pripomogel k uveljavljanju koncepta dolžnosti tako upravljalcev repozitorija kot upo- odprte, preverljive, ponovljive in odgovorne znano- rabnikov. sti na področju jezikoslovnih raziskav v Sloveniji ter številne jezikovne vire, nastale kot rezultat slovenskih • Navodila za deponiranje vnosov, ki opišejo posto- raziskovalnih projektov, obvaroval pred izginotjem in pek oddaje virov s posebnim poudarkom na zah- jim omogočil mednarodno vidnost in vpliv. tevanih metapodatkih in njihovi obliki, saj se pri 4. Spletne storitve 7https://www.coretrustseal.org/ Poleg repozitorija CLARIN.SI trajno vzdržuje več 8https://www.go-fair.org/fair-principles/ spletnih storitev, od katerih so najpomembnejši kon- 9https://www.clarin.eu/fair 10 kordančniki, tj. orodja za analizo korpusov, in si- https://www.openaire.eu/ 11 cer ponuja CLARIN.SI uporabo konkordančnika Kon- https://www.re3data.org/ 12https://eosc-portal.eu/ Text in dveh različic konkordančnika noSketch Engine 13https://www.clarin.eu/content/ (Crystal in Bonito). Vsi trije uporabljajo isti zaledni component-metadata program, in sicer Manatee (Rychl´y, 2007), ki omogoča PRISPEVKI 49 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 hitre poizvedbe po bogato označenih korpusih, ven- Hubu ima CLARIN.SI svojo virtualno organizacijo,18 dar se razlikujejo v čelnem delu. NoSketch Engine je ki združuje sedaj že okoli 60 odprtokodnih projektov. odprtokodna različica komercialnega konkordančnika Za razliko od GitHuba, ki obstaja samo kot spletna sto- Sketch Engine (Kilgarriff et al., 2014),14 medtem ko ritev v lasti podjetja Microsoft, je mogoče platformo je bil KonText razvit na oddelku Češkega nacional- GitLab tudi instalirati, kar ima to prednost, da so nega korpusa Karlove univerze v Pragi (Machálek, projekti locirani na lokalni računalniški opremi, do- 2020). Poleg izgleda konkordančnikov so glavne razlike stopnost projektov pa je mogoče tudi omejiti, kar je v med njimi v tem, da noSketch Engine ponuja nekaj posameznih primerih potrebno, npr. zaradi avtorskih več funkcionalnosti kot KonText (predvsem možnost pravic nad besedili nekega jezikovnega vira, ki se ga izračuna ključnih besed korpusa oz. podkorpusa), med- razvija. Instalacija GitLab na CLARIN.SI19 vsebuje tem ko KonText podpira prijavo prek sistema AAI okoli 20 projektov, tako javnih (kot npr. že omenjena (enako kot repozitorij), kar nato omogoča personalizi- pretvorba TEI za WebAnno) kot tudi zasebnih. rane nastavitve zaslona, hranjenje zgodovine poizvedb, CLARIN.SI v okviru centra znanja CLASSLA, itd. ki ga obravnavamo v naslednjem razdelku, ponuja Vsi konkordančniki na CLARIN.SI ponujajo isti na- tudi spletno storitev ReLDIanno za jezikoslovno bor korpusov, ki jih je sedaj že preko 40, od referenčnih označevanje besedil v slovenskem, hrvaškem in srb- do specializiranih, pa tudi govorjenih in večjezičnih. skem jeziku.20 Storitev podpira oblikoskladenjsko Tu izpostavimo novi korpus metaFida, ki združuje 34 označevanje, lematizacijo, označevanje imenskih en- obstoječih korpusov in vsebuje skupaj 4,5 milijarde po- titet in skladenjsko razčlenjevanje, dostopna pa je javnic, s čimer je največji in najbolj raznovrsten korpus tako prek spletnega vmesnika kot prek vtičnika API, za slovenščino, po katerem je mogoče iskati s pomočjo pri čemer lahko rezultate prikaže na zaslonu ali pa konkordančnikov. označeno besedilo prenesemo na lastni računalnik. Konkordančniki CLARIN.SI se uporabljajo pri iz- vajanju študijskih programov na več univerzah, v 5. Strokovna podpora in diseminacija sklopu jezikoslovnih raziskav ali pri različnih razisko- 5.1. Srediˇ sˇ ca znanja valnih projektih, kot tudi v prevajalskih podjetjih. CLARIN.SI je aktiven pri promociji in spodbu- Naslednja spletna storitev, ki jo ponuja CLA- janju razvoja računalniškega jezikoslovja, ne le za RIN.SI, je platforma za ročno označevanje korpusov slovenščino, ampak tudi za druge južnoslovanske je- WebAnno (Yimam et al., 2013), ki so jo razvili v zike, kot so hrvaščina, srbščina, makedonščina in bol- okviru CLARIN-DE. V okviru CLARIN.SI smo raz- garščina, s čimer si je RI bistveno povečala mednaro- vili pretvorbo iz zapisa korpusov TEI v format TSV3, dno odmevnost. CLARIN.SI namreč skupaj z bolgar- ki ga uporablja WebAnno, in združevanje izvornega sko raziskovalno infrastrukturo CLARIN CLADA-BG korpusa TEI z ročnimi oznakami iz datoteke TSV, s in hrvaškim Institutom za hrvaški jezik in jezikoslovje čimer smo omogočili dodajanje oz. spreminjanje ob- upravlja središče znanja CLARIN za južnoslovanske je- stoječih oznak v TEI kodiranih korpusih z oznakami, zike CLASSLA, v okviru katerega ponuja strokovno ki so bile ročno vstavljene oz. popravljene na platformi podporo pri uporabi jezikovnih virov in tehnologij za WebAnno (Erjavec et al., 2016)15. Naša instalacija in južnoslovanske jezike. Središče znanja podpira razisko-pretvorba je bila do sedaj uporabljena pri prek 10 pro- valce z dokumentacijo o prosto dostopnih jezikovnih vi- jektih, npr. za ročno označevanje normaliziranih be- rih, orodjih za ustvarjanje in obdelavo besedilnih kor- sednih oblik, lem in oblikoslovnih oznak uporabniško pusov ter drugih jezikovnih tehnologijah. Poleg tega generiranih vsebin v okviru projekta Janes ≫Jeziko- CLASSLA razvija lastne jezikovne tehnologije in kor- slovna analiza nestandardne slovenščine≪ (Fišer et al., puse, s katerimi pokriva velike potrebe južnoslovanskih 2020),16 za označevanje dvojezičnih terminov v okviru jezikov, ki so tehnološko manj podprti. Tako je na projekta KAS ≫Slovenska znanstvena besedila: viri in primer v letu 2020 v sklopu projekta zbiranja korpu- opis≪ (Erjavec et al., 2021)17 ali za označevanje defini-sov besedil iz Wikipedije središče ustvarilo prvi jeziko- cij terminov v besedilih v okviru projekta TermFrame slovno označeni makedonski korpus, CLASSLAWiki- ≫Terminologija in sheme znanja v medjezikovnem pro- mk (Ljubešić et al., 2021). storu≪ (Vintar in Martinc, 2022). V 2021 je CLARIN.SI postal tudi član CLARIN Za kontrolirano in kolaborativno vzdrževanje je po- centra znanja za obdelavo uporabniško generiranih stala zelo popularna platforma Git, ki jo v okviru CLA- vsebin CKCMC,21 ki ga vodi Eurac Research, Bolzano. RIN.SI prav tako uporabljamo, ne samo za program- sko opremo, temveč tudi za jezikovne vire. Za spletno 5.2. Financiranje projektov dostopne repozitorije Git, ki vključujejo tudi množico CLARIN.SI finančno podpira projekte, letno iz- drugih funkcij, kot so zahtevki in izvajanje programov, brane na odprtem razpisu za člane konzorcija, ki pri- sta najbolj uporabljana GitHub in GitLab. Na Git- pomorejo k uresničitvi strategije CLARIN.SI. Ta de- 14https://www.sketchengine.eu/ 18https://github.com/clarinsi 15https://gitlab.clarin.si/clarinsi/webanno_tei 19https://gitlab.clarin.si/ 16https://nl.ijs.si/janes/ 20http://clarin.si/services/web/ 17https://nl.ijs.si/kas/ 21https://cmc-corpora.org/ckcmc/ PRISPEVKI 50 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 javnost je bila zelo odmevna in je tudi pomembno do- 6. Vpetost v projekte in infrastrukture prinesla k zanimanju za raziskave in razvoj jezikovnih CLARIN.SI je vpet v domače in evropske projekte, virov med mladimi. Od leta 2018, ko smo z inicia- s čimer zagotavlja večjo izkoriščenost in vidljivost ter tivo začeli, je bilo uspešno izvedenih 19 projektov, v seveda tudi dodaten dotok sredstev za svoje delovanje. sklopu katerih so med drugim nastali korpus parla- mentarnih razprav Državnega zbora Republike Slove- 6.1. Sredstva Evropske kohezijske politike nije siParl (Pančur et al., 2020), nadgradnja korpusa V okviru projekta kohezijskih sredstev MIZŠ 2018– akademske slovenščine KAS 2.0 (Žagar et al., 2022) in 2021 so partnerji konzorcija IJS, UM in UL nadgradili govornega korpusa Gos Videolectures (Verdonik et al., strojno opremo, s čimer je omogočeno hitrejše in proti 2019), orodje za učinkovito analizo slovenskih korpu- okvaram odpornejše delovanje spletnih storitev CLA- sov LIST (Krsnik et al., 2019) in drugi jezikovni viri in RIN.SI, pridobljena gruča GPU strežnikov na Univerzi programska oprema. Med drugim je CLARIN.SI finan- v Mariboru pa služi za raziskave globokega strojnega ciral tudi projekt ≫Razvoj učnega gradiva na korpusu učenja obdelave jezikovnih podatkov, npr. na področju siParl 2.0: Korpusni pristop k raziskovanju parlamen- govora. S temi nadgradnjami lahko CLARIN.SI slo- tarnega diskurza≪ (Fišer in de Maiti, 2021). venski raziskovalni skupnosti zagotavlja odlično raziskovalno infrastrukturo, ki mdr. pripomore k pri- 5.3. Organizacija dogodkov vlačnosti slovenskih partnerjev v mednarodnih razisko- CLARIN.SI sodeluje pri organizaciji in izvedbi do- valnih in inovacijskih projektih ter podpira doseganje godkov s področja računalniškega jezikoslovja in soro- znanstvene odličnosti in mednarodno vrhunskih rezul- dnih tematik v Sloveniji, npr. tatov. Tako npr. projekt EU MaCoCuuporablja gručo ≫XVIII EURALEX Intl. Congress računalnikov CLARIN.SI za zajem in obdelavo sple- ≪ (Ljubljana, 2018) ali ≫22nd Intl. Conf. on Text, Speech and Dialogue tnih velepodatkov, v okviru projekta EU InTaviapa se ≪ (Ljubljana, 2019), pred- vsem pa glavne konference za to področje v Sloveniji, jezikoslovno označuje Slovenski biografski leksikon z modeli, razvitimi na gruči GPU. Več velikih projektov ≫Jezikovne tehnologije in digitalna humanistika≪, ki ima prek 20-letno tradicijo in z organizacijo katere je EU, kot sta ELEXIS in EMBEDDIA, je deponiralo začelo društvo SDJT. SDJT od leta 2005 organizira razvite jezikovne vire v repozitorij CLARIN.SI. občasna predavanja JOTA (Jezikovnotehnološki abo- 6.2. Vpetost v evropske projekte nma), kjer je CLARIN.SI podprl snemanje in arhivi- ranje 12 predavanj na VideoLectures.NET22, do sedaj Med evropskimi projekti posebej izpostavimo ELE- z 10.000 ogledi. XIS,25 saj je bila za potrebe tega projekta v repozi- toriju CLARIN.SI narejena nova zbirka CLARIN.SI ELEXIS, v kateri so zbrani metapodatki in povezave 5.4. Obveˇ sˇ canje in promocija do spletnih vmesnikov 143 digitalnih slovarjev. Ob Nenazadnje, delovanje CLARIN.SI in njegovih koncu projekta ELEXIS v okviru CLARIN.SI oz. IJS središč znanja redno predstavljamo na domačih in tujih načrtujemo tudi vzpostavitev novega centra znanja delavnicah in konferencah, kot so konferenca Evrop- CLARIN za digitalno leksikografijo. skega strateškega foruma za raziskovalne infrastruk- ture (ESFRI), konference CLARIN idr., ter na preda- 6.3. Vpetost v domaˇ ce projekte vanjih v okviru študijskih programov slovenskih uni- Sodelujemo tudi v več domačih projektih. Največji verz. je 26 ≫Razvoj slovenščine v digitalnem okolju≪ , ki mu CLARIN.SI organizira tudi delavnice o uporabi CLARIN.SI zagotavlja svoje storitve za pregled in de- korpusov in jezikovnih tehnologij za namene znanstve- poniranje v projektu izdelanih jezikovnih virov ter de- nih raziskav. Tako smo npr. izvedli delavnice23 za finicijo shem za usklajeno označevanje jezikovnih virov uporabo konkordančnika noSketch Engine, platform slovenskega jezika. V načrtu je tudi izdelava sezna- WebAnno in Git, središče znanja CLASSLA pa je so- mov kontroliranih besedišč za jezikoslovno označevanje delovalo pri delavnici o uporabi korpusov za analizo slovenskih besedil na ravni oblikoskladnje, skladnje, regionalne variacije spolne zaznamovanosti jezika24. imenskih entitet, udeleženskih vlog itd. O dejavnostih partnerjev konzorcija CLARIN.SI in 6.4. Sodelovanje z drugimi RI njegovih središč znanja javnost obveščamo tudi prek ažurnih novic, objavljenih na spletni strani infrastruk- CLARIN.SI sodeluje s slovenskimi centri sestrskih ture, poštnega seznama ter objav s profila CLARIN.SI infrastruktur CESSDA/ADP in DARIAH-SI. V pro- na Twitterju. Delo CLARIN.SI in njegovega središča jektu ≫RDA Node Slovenia≪ (2019–2020), ki ga je ko- znanja CLASSLA je bilo izpostavljeno tudi v več pu- ordiniral ADP (FDV UL), smo pregledali in analizirali blikacijah slovenske repozitorije raziskovalnih podatkov (Meden ≫CLARIN ERIC Tour de CLARIN≪ (Fiˇ ser et al., 2019). in Erjavec, 2021). Z INZ oz. DARIAH-SI pa smo so- delovali na področju standardizacije zapisa in izdelave korpusov parlamentarnih podatkov. 22https://videolectures.net/jota/ 23https://www.clarin.si/info/dogodki/ 25https://elex.is/ 24https://www.clarin.si/info/k-center/delavnice/ 26https://www.cjvt.si/rsdo/ PRISPEVKI 51 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 6.5. Sodelovanje v delu CLARIN ERIC razvoj aplikacij, informacijskih sistemov in orodij na CLARIN.SI je ena od aktivnejših nacionalnih RI v vseh ravneh tehnološke pripravljenosti. CLARIN ERIC. Pridobili smo sredstva za dva manjša Z izjemnim nacionalnim in regionalnim pome- projekta, ki sta vključevala mednarodni delavnici, in nom, spodbujanjem področja, pritegovanjem mladih, sicer 2016 v Ljubljani in 2019 v Amersfoortu. Slednja povezovanjem z industrijo ter široko vključenostjo je bila v sodelovanju z DARIAH-SI posvečena izde- deležnikov, močno vlogo pri uvajanju načel odprte zna- lavi priporočil za standardizirano kodiranje korpusov nosti ter izjemno odmevno in uspešno ter tudi nagra- parlamentarnih razprav z imenom Parla-CLARIN27 jeno vlogo na ravni evropskega in mednarodnega sode- (Erjavec in Pančur, V tisku), ki je postala prilju- lovanja CLARIN.SI s povezanimi projekti predstavlja bljena izbira za kodiranje parlamentarnih korpusov. zgled za vzpostavitev uspešne vrhunske sodobne inter- Na tej osnovi je CLARIN.SI pridobil ključno vlogo v disciplinarne znanstveno-raziskovalno tehnološke infra- dveh večjih strukture. ≫CLARIN Flagship≪ projektih, ParlaMint I (2020–2021) in ParlaMint II (2022–2023). V naslednjem obdobju bo CLARIN.SI poleg Namen projektov ParlaMint je ustvariti primerljive, vzdrževanja obstoječih storitev še bolj intenzivno interpretativne in enotno kodirane korpuse parlamen- spodbujal ponovno uporabo raziskovalnih podatkov, s tarnih razprav. V že zaključenem projektu ParlaMint čimer bo omogočal raziskovalcem na področju humani- I je CLARIN.SI vodil zbiranje in kodiranje 17 korpu- stike in družboslovja povečanje produktivnosti, in kar sov nacionalnih parlamentov (Erjavec et al., 2022), ki je še pomembneje, vzpostavljanje novih raziskovalnih so odprto dostopni na repozitoriju CLARIN.SI, kot smeri, ki obravnavajo eno ali več družbenih vlog jezika. tudi na konkordančnikih RI. V okviru projekta Par- Drug pomemben cilj je izvajanje smernic za zagotavlja- laMint II, katerega namen je razširitev in obogatitev nje interoperabilnosti CLARIN ERIC31, ki je ključni obstoječih korpusov ter dodajanje korpusov novih par- predpogoj za učinkovito podporo raziskovalnemu delu tnerjev, prav tako pa tudi razvoj izobraževalnih gra- skozi interoperabilnost orodij, virov, metapodatkov, div in primerov dobrih praks uporabe parlamentarnih standardov za zapis, kot tudi na organizacijski ravni korpusov za raziskave v humanistiki in družboslovju, (Jong et al., 2020). Hkrati bo treba okrepiti podporo člani CLARIN.SI vodijo štiri izmed petih delovnih uporabnikom, saj univerze in agencije od raziskoval- sklopov28. cev v doktorskih in raziskovalnih programih vse inten- Člani UO CLARIN.SI sodelujejo v delu CLARIN zivneje zahtevajo načrte za ravnanje z raziskovalnimi delovnih teles za pravna vprašanja (Mateja Jemec To- podatki in njihovo trajno hrambo. mazin, ZRC SAZU), za standardizacijo (Tomaž Er- ESFRI kažipot 202132 za RI poudarja pomen po- javec, IJS) in za uporabniška vprašanja (Jakob Le- datkov FAIR, pri čemer smo na tem področju v okviru narčič, FF UL) ter na letnih konferencah CLARIN repozitorija CLARIN.SI storili že več pomembnih ko- (T. Erjavec je predsednik programskega odbora kon- rakov, se pa bomo vidikom FAIR posvečali tudi naprej. ference v 2022 v Pragi). J. Lenarčič je prejel CLA- Tako je v povezavi z RDA Node Slovenia v načrtu pri- RIN ≫Stewen Krawer award≪ za mladega raziskovalca prava delavnice o certifikaciji CTS in načelih FAIR za leta 2019, mdr. za svoje delo (skupaj z Darjo Fišer) slovenske repozitorije raziskovalnih podatkov. ESFRI pri vzpostavitvi iniciative ≫CLARIN Resource Fami- kažipot 2021 poudarja tudi vedno večjo prisotnost ve- lies 29 ≪ , T. Erjavec pa je prejel ≫Steven Krauwer Award lepodatkov in pomembnost infrastruktur, da jih ustre- for CLARIN Achievements 2021≪ za svoje delo na pro- zno hranijo in obdelujejo. Zaradi vedno večje količine jektu ParlaMint. Darja Fišer in Kristina Pahor de Ma- dostopnih besedil, premika s pisnih na govorne in vizu- iti (FF UL) sta leta 2021 prejeli nagrado ≫Teaching alne jezikovne vire ter vse bogatejšega označevanja be- with CLARIN Award≪ za najboljši učni material, po- sedil tudi na področju jezikovnih virov prehajamo v ob- vezan z uporabo virov CLARIN. Kaja Dobrovoljc (FF dobje velepodatkov, kot se že sedaj izkazuje v projektih UL) je predstavila RI CLARIN na konferenci ob 20. ParlaMint II, RSDO in MaCoCu. Zato bo CLARIN.SI obletnici ESFRI v Parizu leta 202230. Darja Fišer je v naslednjem obdobju podprl uporabo strojne in pro- bila med leti 2016 in 2020 direktorica področja za upo- gramske kapacitete za hrambo in predvsem obdelavo rabniška vprašanja, z letom 2023 pa naj bi postala ge- velepodatkov. Kažipot tudi poudarja pomembnost neralna direktorica CLARIN ERIC. raziskovalnih infrastruktur za zajem, hrambo in ob- delavo podatkov z družbenih omrežij in spleta. CLA- 7. Zakljuˇ cki RIN.SI je že sedaj posvečal posebno pozornost takšnim CLARIN.SI je izjemno uspešno vzpostavljena in- jezikovnim virom, v prihodnosti pa bo te aktivnosti še frastruktura, ki pokriva široko interdisciplinarno po- okrepil, ne samo za slovenski, temveč (v okviru centra dročje od humanističnih in družboslovnih raziskav do znanja CLASSLA in projekta MaCoCu) tudi za druge razvoja sistemov in tehnologij znanja in umetne inte- južnoslovanske jezike. ligence. Podpira temeljne in aplikativne raziskave ter Kažipot poudarja tudi instrumentalizacijo in dosto- pnost podatkov ter storitev, pomembnih za posamezne 27https://clarin-eric.github.io/parla-clarin/ skupnosti. Konzorcij CLARIN.SI trenutno vključuje 28https://www.clarin.eu/parlamint 29https://www.clarin.eu/resource-families 30 31 https://www.esfri.eu/esfri-events/ https://www.clarin.eu/content/interoperability 32 esfri-20years-conference https://www.esfri.eu/esfri-roadmap-2021 PRISPEVKI 52 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 12 članic, s čimer sicer pokriva veliko večino sloven- 8. Literatura skih deležnikov, ki bodisi proizvajajo ali uporabljajo jezikovne vire in tehnologije, ne pa vseh. V nasle- Jacob Devlin, Ming-Wei Chang, Kenton Lee in Kri- dnjem obdobju se bo CLARIN.SI trudil razširiti svoj stina Toutanova. 2018. BERT: Pre-training of Deep konzorcij, s čimer bo pokril tudi skupnosti potenci- Bidirectional Transformers for Language Understan- alnih uporabnikov infrastrukture, ki do sedaj še niso ding. CoRR, abs/1810.04805. https://doi.org/ bili zajeti v njeno delovanje. CLARIN.SI prav tako 10.48550/arXiv.1810.04805. načrtuje študijo potreb posameznih skupnosti (razi- Tomaž Erjavec, Jan Jona Javoršek in Simon Krek. skovalci in predavatelji s področja humanistike in s po- 2014. Raziskovalna infrastruktura CLARIN.SI. V: dročja računalniške lingvistike, slovaropisci, prevajalci, Zbornik Devete konference JEZIKOVNE TEHNO- osebe s posebnimi potrebami) in izboljšanje svoje po- LOGIJE, Ljubljana, 9. - 10. oktober 2014. Slovensko nudbe v skladu z ugotovitvami. društvo za jezikovne tehnologije. https://nl.ijs. Kažipot med drugim poudarja pomen izo- si/isjt14/proceedings/isjt2014_03.pdf. braževanja, šolanja in podpore pri uporabi infrastruk- Tomaž Erjavec, Špela Arhar Holdt, Jaka Čibej, tur za obstoječe in bodoče uporabnike. V prvem Kaja Dobrovoljc, Darja Fišer, Cyprian La- obdobju obstoja je bil CLARIN.SI izrazito kadrovsko skowski in Katja Zupan. 2016. Annotating podhranjen, a smo kljub temu izvedli vrsto dogodkov CLARIN.SI TEI corpora with WebAnno. V: na delavnicah po Sloveniji in v tujini, predvsem Proceedings of the CLARIN annual conference. na različnih fakultetah, kjer smo infrastrukturo https://www.clarin.eu/sites/default/files/ predstavili študentom. V naslednjem obdobju se erjavec-etal-CLARIN2016_paper_17.pdf. bomo načrtno lotili teh aktivnosti z bolj proaktivnim Tomaž Erjavec, Darja Fišer in Nikola Ljubešić. 2021. pristopom k izvedbi predavanj in delavnic tako za The KAS corpus of Slovenian academic writing. študente kot tudi za raziskovalce in predavatelje ter Language Resources and Evaluation, 55(2):551–583. razvoju in promociji izobraževalnih materialov. https://rdcu.be/b7GrB. Nenazadnje je za prihodnost CLARIN.SI pomem- Tomaž Erjavec, Maciej Ogrodniczuk, Petya Ose- ben tudi pred kratkim sprejet Načrt razvoja razisko- nova, Nikola Ljubešić, Kiril Simov, Andrej Pančur, valne infrastrukture 2030 (NRRI 2030)33 v Sloveniji, Micha l Rudolf, Matyáš Kopp, StarkaDur Barkar- ki ima ≫v načrtu nadaljevanje in krepitev dejavnosti še son, Steinþór Steingr´ımsson, C¸a˘grı C¸öltekin, Jesse v okviru mednarodnih projektov CLARIN≪ (str. 60), de Does, Katrien Depuydt, Tommaso Agnoloni, Giu- priznava dosedanje sodelovanje z RI DARIAH-SI in lia Venturi, Mar´ıa Calzada Pérez, Luciana D. de Ma- CESSDA/ADP, ob tem pa predvideva tudi povezova- cedo, Costanza Navarretta, Giancarlo Luxardo, nje z novima RI, in sicer OPERAS (Odprta znanstvena Matthew Coole, Paul Rayson, Vaidas Morkevičius, komunikacija v evropskem raziskovalnem prostoru za Tomas Krilavičius, Roberts Dar´gis, Orsolya Ring, družboslovne in humanistične vede)34, ki jo v Sloveniji Ruben van Heusden, Maarten Marx in Darja Fišer. vodi ZRC SAZU, in PRACE (Partnerstvo za napredno 2022. The ParlaMint corpora of parliamentary računalništvo v Evropi)35, ki jo vodi ARNES. proceedings. Language Resources and Evaluation. https://doi.org/10.1007/s10579-021-09574-0. Zahvala Tomaž Erjavec in Andrej Pančur. V tisku. The Parla- Delo predstavljeno v prispevku so podprli ARRS CLARIN Recommendations for Encoding Cor- v okviru financiranja raziskovalnih infrastruktur ES- pora of Parliamentary Proceedings. Journal of FRI, Republika Slovenija in Evropska unija iz Evrop- the Text Encoding Initiative. https://journals. skega sklada za regionalni razvo v okviru projektov openedition.org/jtei/index.html. C3330-19-952059 ≫Razvoj raziskovalne infrastrukture Darja Fišer in Kristina Pahor de Maiti. 2021. “Prvič, za mednarodno konkurenčnost slovenskega RRI pro- sem političarka in ne politik, drugič pa. . . ”. Con- stora / RI-SI-CLARIN≪ in OP20.06780 ≫Razvoj slo- tributions to Contemporary History, 61(1). https: venščine v digitalnem okolju≪ ter projekti CLARIN //doi.org/10.51663/pnz.61.1.07. ERIC. Darja Fišer, Jakob Lenardič, Ilze Auzi¸na, Nan Bern- Zahvaljujemo se tudi sodelavcem CLARIAH-CZ za stein Ratner, Koenraad De Smedt, Kaja Dobro- pomoč pri nadgradnjah in vzdrževanju platforme re- voljc, Réka Dodé, Rickard Domeij, Helge Dyvik, pozitorija, sodelavcem Češkega nacionalnega korpusa, Tomaž Erjavec, Olga Gerassimenko, Jan Hajič, Mi- predvsem Tomášu Macháleku, za pomoč pri instalaciji chal Křen, Nikola Ljubešić, Brian MacWhinney, konkordančnika KonText in sodelavcem podjetja Lexi- Monica Monachini, Beatrice Nava, Costanza Na- cal Computing, predvsem Janu Bušti in Tomášu Svo- varreta, Aneta Nedyalkova, Klaus Nielsen, Marin bodi, za pomoč pri instalaciji konkordančnika Sketch Noémi VadászLaak, Susanne Nylund Skog, Lene Engine Crystal. Offersgaard, Petya Osenova, Valeria Quochi, Sa- nita Reinsone, Inguna Skadi¸na, Kiril Simov, Ondřej 33https://www.gov.si/assets/ministrstva/MIZS/ Tich´y, Noémi Vadász, Tamás Váradi in Kadri Vi- Dokumenti/ZNANOST/Novice/NRRI-2030/NRRI-2030_SLO. der. 2019. Tour de CLARIN Volume Two. Zenodo. pdf 34https://www.operas-eu.org https://doi.org/10.5281/zenodo.3754164. 35https://www.prace-ri.eu Darja Fišer, Nikola Ljubešić in Tomaž Erjavec. 2020. PRISPEVKI 53 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 The Janes project: Language resources and tools datkov. Tehnično poročilo, Jožef Stefan In- for Slovene user generated content. Language Reso- stitute. https://www.clarin.si/info/services/ urces and Evaluation, 54:223–246. https://rdcu. projects/#RDA_Node_Slovenia. be/7RX4. Andrej Pančur, Tomaž Erjavec, Mihael Ojsteršek, Franciska De Jong, Bente Maegaard, Koenraad De Mojca Šorn in Neja Blaj Hribar. 2020. Slovenian Smedt, Darja Fišer in Dieter Van Uytvanck. 2018. parliamentary corpus (1990-2018) siParl 2.0. http: CLARIN: Towards FAIR and Responsible Data Sci- //hdl.handle.net/11356/1300. ence Using Language Resources. V: Proceedings of Pavel Rychl´y. 2007. Manatee/Bonito - A Modular the Eleventh International Conference on Language Corpus Manager. V: 1st Workshop on Recent Ad- Resources and Evaluation (LREC 2018), Miyazaki, vances in Slavonic Natural Language Processing, str. Japan, May 7–12, 2018. European Language Resour- 65–70, Brno. Masarykova univerzita. ces Association (ELRA). https://aclanthology. Matej Ulčar in Marko Robnik-Šikonja. 2021. Slove- org/L18-1515. nian RoBERTa contextual embeddings model: Slo- Franciska De Jong, Bente Maegaard, Darja Fišer, Die- BERTa 2.0. Slovenian language resource reposi- ter Van Uytvanck in Andreas Witt. 2020. Interope- tory CLARIN.SI. http://hdl.handle.net/11356/ rability in an infrastructure enabling multidiscipli- 1397. nary research: The case of CLARIN. V: Proceedings Tamás Váradi, Steven Krauwer, Peter Wittenburg, of the 12th Language Resources and Evaluation Con- Martin Wynne in Kimmo Koskenniemi. 2008. ference, str. 3406–3413. European Language Resour- CLARIN: Common language resources and tech- ces Association (ELRA). https://aclanthology. nology infrastructure. V: Proceedings of the Si- org/2020.lrec-1.417/. xth International Conference on Language Resour- Adam Kilgarriff, V´ıt Baisa, Jan Bušta, Miloš Ja- ces and Evaluation (LREC’08), Marrakech, Mo- kub´ıček, Vojtěch Kovář, Jan Michelfeit, Pavel rocco, May. European Language Resources As- Rychl´y in V´ıt Suchomel. 2014. The Sketch Engine: sociation (ELRA). http://www.lrec-conf.org/ Ten years on. Lexicography, 1:7–36. proceedings/lrec2008/pdf/317_paper.pdf. Petra Kralj Novak, Jasmina Smailović, Borut Sluban Darinka Verdonik, Tomaž Potočnik, Mirjam Se- in Igor Mozetič. 2015. pesy Maučec, Tomaž Erjavec, Simona Majhenič in Emoji Sentiment Ranking Andrej Žgank. 2019. Spoken corpus Gos VideoLec- 1.0. Slovenian language resource repository CLA- RIN.SI. tures 4.0 (transcription). http://hdl.handle.net/ http://hdl.handle.net/11356/1048. Simon Krek. 2022. Delivrable D1.31: Report on 11356/1223. ťhe Slovenian Language. Tehnično poročilo, Eu- Spela Vintar in Matej Martinc. 2022. Framing karsto- ropean Language Equality Project. logy: From definitions to knowledge structures and https:// automatic frame population. Terminology. Interna- european-language-equality.eu/wp-content/ tional Journal of Theoretical and Applied Issues in uploads/2022/03/ELE___Deliverable_D1_31_ Specialized Communication, 28(1):129–156. _Language_Report_Slovenian_.pdf. Seid Muhie Yimam, Iryna Gurevych, Richard Ec- Luka Krsnik, Špela Arhar Holdt, Jaka Čibej, Kaja kart de Castilho in Chris Biemann. 2013. We- Dobrovoljc, Aleksander Ključevšek, Simon Krek in bAnno: A Flexible, Web-based and Visually Su- Marko Robnik-Šikonja. 2019. Corpus extraction pported System for Distributed Annotations. V: tool LIST 1.2. http://hdl.handle.net/11356/ Proceedings of the 51st Annual Meeting of the Asso- 1276. ciation for Computational Linguistics: System De- Jakob Lenardič in Darja Fišer. 2022. CLARIN De- monstrations, str. 1–6, Sofia, Bulgaria, August. As- positing Guidelines: State of Affairs and Proposals sociation for Computational Linguistics. https:// for Improvement. V: Proceedings of the CLARIN aclanthology.org/P13-4001. Annual Conference, Prague, Czech Republic, Octo- Aleš Žagar, Matic Kavaš, Marko Robnik-Šikonja, ber 10–12, 2022. https://www.clarin.eu/event/ Tomaž Erjavec, Darja Fišer, Nikola Ljubešić, Marko 2022/clarin-annual-conference-2022. Ferme, Mladen Borovič, Borko Boškovič, Milan Oj- Nikola Ljubešić, Filip Markoski, Elena Markoska in steršek in Goran Hrovat. 2022. Corpus of academic Tomaž Erjavec. 2021. Comparable corpora of South- Slovene KAS 2.0. http://hdl.handle.net/11356/ Slavic Wikipedias CLASSLA-Wikipedia 1.0. Slo- 1448. venian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1427. Tomáš Machálek. 2020. KonText: Advanced and Fle- xible Corpus Query Interface. V: Proceedings of the 12th Language Resources and Evaluation Con- ference, str. 7003–7008, Marseille, France, May. Eu- ropean Language Resources Association. https: //www.aclweb.org/anthology/2020.lrec-1.865. Katja Meden in Tomaž Erjavec. 2021. Pre- gled slovenskih repozitorijev raziskovalnih po- PRISPEVKI 54 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts Simon Gonzalez The Australian National University, Canberra, Australian Capital Territory, Australia u1037706@anu.edu.au Abstract Social Media platforms have offered invaluable opportunities for linguistic research. The availability of up-to-date data, coming from any part in the world, and coming from natural contexts, has allowed researchers to study language in real time. One of the fields that has made great use of social media platforms is Corpus Linguistics. There is currently a wide range of projects which have been able to successfully create corpora from social media. In this paper, we present the development and deployment of a linguistic corpus from Twitter posts in English, coming from 26 news agencies and 27 individuals. The main goal was to create a fully annotated English corpus for linguistic analysis. We include information on morphology and syntax, as well as NLP features such as tokenization, lemmas, and n-grams. The information is presented through a range of powerful visualisations for users to explore linguistic patterns in the corpus. With this tool, we aim to contribute to the area of language technologies applied to linguistic research. approach as language researchers, from engineers and app 1. Introduction developers, for example, is that we are interested to study In this current age, the use of social media platforms has how people use technology to communicate and describe permeated across all circles of society, from personal what makes it a distinctive type of language (Page et al., communication to government communications. Its impact 2014). In this sense, we are interested in identifying the is hard to be overstated. It is considered as a form of mass language patterns as used in social media platforms, media, but distinctive from other forms such as television knowing that patterns found in social media are not and radio, where the information is presented by a specific necessarily representative of language patterns in other broadcasting mechanism (Page et al., 2014). In the case of contexts. This has been demonstrated empirically by social media, the content can be delivered by anyone, Grieve et al. (2019), where they compared Twitter data making it more personal and individual than other mass versus traditional survey data. They found that some forms. The adopting of this technology in language patterns were observed more strongly in Twitter data than research has been an organic and necessary process. This is in the survey data. Results like these are evidence then that because language research investigates the use of language when we deal with social media language, we are in society, and since social media is a medium of language, examining a way of expression, which has features like we need to understand how we use language in this digital other language forms, but at the same time it has its own world. distinctive characteristics. This is paramount to be One framework that has efficiently paved the way for considered when new language analysis technologies are linguists to examine social media language is Computer- developed. Mediated Communication (CMC). This has been defined as a relatively new ‘genre’ of communication (Herring, 1.1. Twitter and Corpus Linguistics 2001), which can involve a diversity of forms connecting The combination of language research and social media the physical and the digital (Boyd and Heer, 2006). One of is a complex endeavor, making people working with both the focus of study in CMC research is on the intrinsic apply skills that are necessary in this interdisciplinary characteristics of digital language, e.g. stylistics, use of undertaking. One area that reflects this complexity and that words, semantics, and other relevant linguistic features. has efficiently adapted social media is Corpus Linguistics This has been done for various CMC types, including social (CL). A strong characteristic of CL is that it is used to media. collect, store, and facilitate language analysis for large But describing social media features is not a straight- datasets (Szmrecsanyi, 2011; Grieve, 2015). And with the forward task because it is not a homogenous genre. It has a advantage of having more sophisticated tools available, diversity of types depending on the main shareable content such as in social media research, corpora are becoming (e.g., YouTube for videos, Twitter for texts1) and main larger and larger, with the only constraints being format (e.g., Reddit as a discussion forum, Pinterest computational power and storage capacity. Product pins for products purchase), for example. But one Many social media platforms have been widely used for common feature is that all platforms have an interactive language and linguistic research (c.f. Liew and Hassan, component in which users can express ideas, comment, and 2021; Nagase et al., 2021; Trius and Papka, 2022; Wincana reply to other people’s perspectives. The inherent et al, 2022). Out of these platforms, Twitter stands out due communicative aspect in this social interaction is one that to its world spread, and the option it gives to researchers has strong implications in linguistic research, which is that when stratifying the demographics of user accounts, when we analyse language from social media, we look at including the use of the geo-code and time-stamp how language is used in natural contexts, with concrete communicational purposes. What distinguishes then our 1 The type of content of social media platforms is not restricted specific cases. For instance, YouTube allows users to write to only one. This is just an example on the main purpose for comments on videos and Twitter can embed videos on posts. PRISPEVKI 55 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 information of the posts2 (Grieve et al., 2018). It is 3.1. Data Collection classified as a microblogging site (Chandra et al. 2021) We applied four criteria to identify the Twitter accounts where the content can be on opinions, news, arguments, and to be included in the corpus. The first criterion was that other types of sentences (Chaturvedi et al., 2018). Because account users (news agencies and individuals) had to have of their wide-spread use, it has been used in the creation of English as the main language of communication. The numerous corpora created from Twitter posts (c.f. Dijkstra second one was that accounts had to be active at the et al., 2021; Grieve et al., 2019, Tellez et al., 2021). moment of extraction. The reason was to capture tweets that were synchronous and where topics and trends could 1.2. Current Project be shared across accounts. The third criterion was that In this paper, we present the development of a web- accounts had to have a large number of tweets, enough to based corpus from Twitter posts, named ILiAD: An reach over 3,000. This was done to make sure that enough Interactive Corpus for Linguistic Annotated Data. In posts were left after the filters were applied, which is relation to our methodological approach, we propose that explained below. The final criterion was to only include corpora built from social media helps study the patterns of those users whose posts were not mainly retweets. This language used in this context and capture their linguistic filter aimed to exclude those accounts that do not produce complexity. By doing this, we can have a better view of the content but only retweet posts from other accounts. From multilayered nature of the corpus. this, we identified 29 news agencies, and 27 individual accounts. The percentages are shown in Table 1. 2. Goal of the Paper The aim of the corpus is to capture the linguistic User Type Total Tweets Percentage complexities used in Twitter language, and we chose two News Agency 84,354 54% types of account users: news agencies and individuals. We Individual 71,477 46% explore the differences between their structures and Total 155,831 100% patterns. The language of journalism is characterised based on its main purpose: exert influence on readers and Table 1: Total number of tweets in the corpus and their convince them on a specific interpretation (Fer, 2018; proportions per account type. Moschonas, 2014). This is achieved by three main stylistic features. The first one is language clarity, a feature that is The data extraction was done through an R script strongly appropriate for journalism more than many other developed by the main author. We used the rTweet language styles. The second one is accuracy. This refers to (Kearney, 2019) package, which allows users to gather the ability to convey ideas accurately and avoiding Twitter posts by the free Twitter API, giving a total of over ambiguities in interpretation. The final one is the 156,000 tweets. simplicity. This aims to convey messages without the use of complex words that may blur the intention of the Year Total Tweets Percentage message (Fer, 2018). The aim therefore is to prepare the 2009 139 0.1% corpus for further exploration, querying and analysis to 2010 178 0.1% understand the language used in Twitter. 2011 497 0.3% The analysis can focus on many linguistic parameters 2012 2230 1.4% and here we approach it in an integrated way. This can give 2013 5097 3.3% users the opportunity to explore the corpus from different 2014 3625 2.3% angles and linguistic perspectives. 2015 5159 3.3% 2016 6745 4.3% 3. Methodology 2017 5508 3.5% 2018 6301 4.0% The stages of data collection, data processing, and app 2019 7847 5.0% deployment were carried out in R (R Core Team, 2021), 2020 18742 12.0% using shiny R (Chang et al., 2021) for the app development. 2021 20697 13.3% Apps developed in shiny have three main advantages. The 2022 73066 46.9% first one is its interactivity capability. With this, users can Total 155,831 100% interact with the whole corpus, across a range of visualisation outputs and tables. The second one is its reactive power. With this, users modify parameters in the Table 2: Total number of tweets in the corpus and their tables and visualisations, and the app changes outputs based proportions per year. on user inputs. The positive impact on corpus linguistics is invaluable. With these features, a corpus can be used to 3.2. Data Processing have a full understanding on the shape of the data as well From the collected data, we applied six filters to make as an exploration of patterns. sure that the corpus reflects comparable linguistic data for all account users. The first filter was to exclude tweets that were not in English (n=10,067; 6%). This was done by filtering out those tweets which did not have the English ( en) assigned by Twitter’s machine language detection, 2 The geo-code information is optional in Twitter, and the user as time zone and language features, which are used to infer decides whether to show this or not. Other approaches include locations. running algorithms that identify locations based on factors such PRISPEVKI 56 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 which is annotated in the tweet’s metadata. The second 3.3.2. Morphological Analysis filter was to exclude re-tweets (n=23260; 15%). This There are three main fields tagged in the data process: restricts the data only to those posts that come from the given user and not from other accounts. The third filter was 1. Part-of-speech tagging to exclude quote tweets (n=7,142; 5%). These are tweets 2. Morphological features that are re-tweeted with an added comment from the user. 3. Lemma or stem Keeping quote tweets in the data would add repeated tweets to the corpus and also would add patterns and word counts The parts-of-speech tagging uses MorphoDiTa (Straková et that do not correspond to a specified account. The fourth al., 2014). The tagging process exploits the rich linguistic filter deleted repeated tweets (n=778; 0.5%). This targeted features of inflective languages with large number of those cases in which account users write the same content suffixes, where multiple forms can be related to a single and post it as a separate tweet, but not as a re-tweet. Similar lemma. From this, the tagger estimates common patterns on to quote tweets, keeping repeated tweets would inflate the endings and creates morphological templates from the content of the corpus and it would not be representative. observed clusters. On Table 3, we show the top ten counts For the fifth filter, we excluded strings that were URL links, and proportions of Parts-Of-Speech tags in the current which do not have linguistic features3 of interest in this corpus, as output from UDPipe. paper (n=1,208; 0.8%). For the sixth and last filter, we first calculated the number of words for each tweet, which were POS Corpus Count Percentage split by white spaces to get the number of individual words. NOUN 76,795 20.8% We then excluded those tweets that had a length of less than VERB 62,537 16.9% eight words (n=14,125; 9%). This filter targets those tweets ADP 39,237 10.6% which do not have linguistic content but only social media PROPN 37,862 10.3% features such as hashtags or links. PRON 37,399 10.1% With these filters, the final data contained 112,690 DET 31,284 8.5% tweets. This is a loss of 28% (n = 43,919) of the original PUNCT 30,001 8.1% data exported from the Twitter API. ADJ 24,452 6.6% ADV 16,425 4.4% 3.3. Text Processing AUX 13,171 3.6% After data filtering, we implemented a wide range of Natural Language Processing (NLP) techniques for the Table 3: Total number of top ten Part-Of-Speech tags in data wrangling and analysis. We carried out the text the corpus and their proportions. processing using the UDPipe (Straka and Straková, 2017) package as the main tool for the NLP tasks. UDPipe is 3.3.3. Classification Features defined as single tool which contains a tokenizer, UDPipe uses two models that facilitate the tagging morphological analyzer, Parts-Of-Speech tagger, process and improve the overall accuracy by employing lemmatizer, and a dependency parser. It currently offers 77 different classification feature sets. The first one the POS language models, with some languages having more than tagger, which disambiguates all available morphological one model available. We used the EWT English model fields in the data. The second model, a lemmatizer, available in the package. We selected the text column from disambiguates the lemmas tagged. the API output and made it the input for the main UDPipe function. The core purpose of the UDPipe package is to 3.3.4. Dependency Parsing create a single-model tool for a given language which can Dependency parsers are part of the family of grammar be used to process raw text and convert it to a CoNLL-U- formalisms called dependency grammars (Jurafsky and formatted text. This format stores tagged information for all Martin, 2021). In these, the syntactic structure sentences are words in dependency trees, including morphological and described on the grammatical relations between the words, syntactic features (Straka and Straková, 2017). From this shown as directed binary dependencies. All structures start format, the UDPipe algorithm creates morphological at the root node of the tree, and then components and the taggers and dependency parsers. The main taggers are dependencies are shown throughout the entire structure. described below. Dependency parser trees can deal very efficiently with languages that are rich morphologically and also have a 3.3.1. Tokenization relatively free word order, for example Spanish, Czech, and The tokenization tools are wrapped within a trainable English, with varying flexibility. Another important tokenizer based on artificial neural networks, specifically, advantage of using dependency parsers is that they allow the bidirectional LSTM artificial neural network (Graves closer examination of semantic relationships between and Schmidhuber, 2005) and a gated linear unit – GRU arguments in the sentence. (Cho et al., 2014). It works by comparing the words in the Summing up, the features, descriptions, and tagging input text to the trained tokenizer and does not add any done by the UDPipe framework, offer invaluable additional knowledge about the language. If a given word, information relevant for linguistic analysis used in Corpus or group of words, is not recognized, the tokenizer tries to Linguistics. With these features extracted for all tweets, we reconstruct it by utilizing an additional raw text corpus. have information available at different layers for linguistic 3 URL Links are an important aspect of social media language. However, its analysis is beyond the scope of this paper. PRISPEVKI 57 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 analysis: morphological, syntactic, and even semantic, Stop word Total Count Percentage through the dependency parsers. the 17,151 4.18% to 11,543 2.81% 3.4. Data Filtering a 9,076 2.21% After obtaining the output from the UDPipe package, be 8,480 2.07% we proceeded to filter the data. The motivation was to and 7,844 1.91% prepare it for the linguistic analysis within the corpus. This of 7,001 1.71% filtering process affects two dataset outputs which used for I 6,734 1.64% different purposes in the corpus. The first one is used for in 6,429 1.57% calculating n-grams and word frequencies. The second one you 5,315 1.3% is for showing Syntactic Dependencies. have 4,083 1% that 3,933 0.96% 3.4.1. Token Filtering it 3,803 0.93% Identifying the right tokens in social media language is for 3,793 0.92% a difficult process. The correct practice in this step is crucial on 3,552 0.87% to achieve efficient outcomes. This filtering differs from the he 3,442 0.84% practice done on other language media such as the language in newspapers, television, and academic papers. Following Table 5: Top 15 stop words excluded in the data subset O'Connor et al., (2010), we excluded tokens containing and their proportions in the corpus. hashtags, URL links, @-replies, strings of punctuation, and emoticons4. Their proportions are shown in Table 4. 3.4.3. Sentence Structure Filtering In this filter, we aimed to identify those posts which Content Total Count Percentage were not linguistic phrases or sentences, thus including Excluded only those structures that were classified into a sentence Emoticons 1,556 0.4% category. For each of the tweet breakdown done by UDPipe Hashtags 1,986 0.5% (as shown in Table 6), we looked at the PUNCT URL Links 2,857 0.7% classification, where we identified three types of sentences: @-replies 3,851 0.9% statements (ending with “.”), questions (ending with “?”) Punctuation 30,001 7.3% and exclamations (ending with “!”). Any unclassified sentence was deleted from the dataset. Deciding to keep Table 4: Total number of social media content excluded sentences that follow the standard punctuation symbols has and their proportions in the whole corpus. a strong impact in a corpus based on Twitter language, since sentences here can follow other rules, e.g. ending a 3.4.2. Removing Stop Words sentence with emoticons or other use of punctuation Following standard procedures, we removed stop words symbols, such as !!! or :). However, an important number for calculating n-grams and word frequencies. An of sentences follow the most standard use of punctuation important observation is that removing stop words is a symbols, which is a reliable representation of the data compromise for the corpus, since certain word collected. Finally, for each sentence, we checked whether combinations are affected, especially those which appear there was a conjugated verb. For those sentences which had together with the words in the list. Future versions of this no conjugated verbs, the identified sentence was deleted work aim to efficiently implement analysis considering the from the dataset used for the Syntactic Dependency section. role of stop words in the corpus. For this, we created a data subset that only contained sentences and their corresponding classification done in the Here we removed stop words by following the steps previous step. This was the input for the Section explained below: in 4.1.2. 1. First, we selected a list of stops words from the token upos feats dep_rel stopwords (Benoit et al., 2021) package in R. We Senate PROPN Number=Sing nsubj selected the ones used for English and it included needs VERB Mood=Ind root 175 words (see Table 5 for the top 15). to PART mark 2. Next, we filtered out the stop words in this data think VERB VerbForm=Inf xcomp subset. and CCONJ cc 3. Finally, we filtered out stop words that are specific vote VERB VerbForm=Inf conj for Twitter, and that includes words such as RT, follow, follows, and following. In future versions, Table 6: Sample output from UDPipe. we aim to implement a disambiguation algorithm where a key word, such as follow, can be identified 3.5. Calculating N-Grams as a word used in social media context (e.g. follow By implementing NLP techniques, this brings more us on Twitter), or in a more traditional one (e.g. depth to the corpora analysis since it allows users to explore follow the road). more areas in the data. In the current version of the app, we 4 Emoticons entail rich linguistic information. However, their analysis is not included in this version of the tool. PRISPEVKI 58 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 use unigram and bigram explorations. The n-grams are Twitter Terms of Service, the app cannot display the raw calculated using the tidytext (Silge and Robinson, 2016) tweets as a database format nor give the option to download package. We followed the established approach of deleting data. The interactive tool therefore focuses on the stops words in English, using the stopwords package. After presentation of the linguistic features derived from the data. the filtering, the n-grams were calculated across all the data. 4.1. Exploring Calculated Features 3.6. Entity Identification The linguistic features are the main backbone of the A second group of NLP techniques implemented is the corpus. In this section, there are visualisation options that identification of entities in the corpus, and that includes can be used to have both a broad understanding of patterns, mentions of people, physical locations, and established as well as a deep exploration of linguistic features. organisations. We used the entity (Rinker, 2017) package for this purpose. This package is a wrapper to simplify and 4.1.1. Parts of Speech extend the NLP (Hornik, 2020) package and the openNLP This section gives the overall statistics of the words (Hornik, 2019) package named entity recognition. The classified into their POS, including distributions and advantage of this approach is that we can use it to detect proportions per year and sentence type. The exploration can important information, which is crucial especially in large be done in different levels: all corpus or by user type (news datasets, that can be captured when identifying entities. agencies or individuals). The input data in this section This also has a strong impact on our understanding of comes from the Sentence Structure Filtering Section linguistic features, since they are related to important (3.4.3). elements in sentences, such as nouns and adjectives. By implementing this, the app brings more depth to the corpora analysis since it allows users to explore the main entities in the corpus. 3.7. Twitter Metrics The final metrics measured and obtained aims to show information that is relevant when dealing with Twitter data. The motivation is to be able to contextualise the information in the corpus within the overall world of social media. The information presented here is extracted from the Twitter API output, which means that we display two Figure 1: POS Distributions Tab. features publicly available. The first one is the number of tweets across time. We also include a general summary of 4.1.2. Syntactic Dependencies the main sour locations by country of the tweets This section allows users to explore the syntactic contributing to the data. Previous studies (c.f. Grieve et al., dependencies of all the available sentences. Here we use a 2019) have demonstrated that the use of geo-coding combination of the UDPipe output and the textplot information is relevant for linguistic studies, but here, we (Wijffels et al., 2021) package, which creates the only show the country of origin of all tweets without dependencies as in the figure below. Since users can select identifying individuals or linking linguistic features to any all available sentences, this is a powerful function than can demographics. be used to explore syntactic patterns across the corpus and facilitates the understanding of syntactic structures. The 4. App Infrastructure input data in this section comes from the Sentence The app was developed in RStudio, which has been Structure Filtering Section (3.4.3). widely used for corpus linguistics development and related tasks (Abeille and Godard, 2000; Gries, 2009), and the main framework was within shiny R. Shiny apps allow great interactivity and responsiveness. Interactivity allows users to explore visualizations in effective ways, and responsiveness allows users to navigate contents in real time, with the use of clicks and dropdown menus. Other libraries that we used for the creation of visuals were ggplot2 (Wickham, 2016) and echarts4r (Coene, 2022). ggplot2 allows a great degree of flexibility when creating figures. This is relevant since there is a lot of complexity of the linguistic data that we present. But this allows complex ideas to be presented in a digestible way. Another advantage of this is that it allows users to see data points within the general context as well as being able to narrow Figure 2: Syntactic Dependencies Tab. down into more specific analysis. This creates a seamless navigation of linguistic data in an efficient way. The 4.1.3. Exploring N-Grams presentation of the app components was divided into two N-grams are explored through visualisations, including main sections. The first component gives users tools to connection networks. These networks are developed within explore linguistic features and the second one gives the Network Analysis (NA) approach. The power of this information on Twitter metrics. Due to the limitations on analysis comes from its capability of observing PRISPEVKI 59 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 relationships between components. This technique has selected to observe by account type, giving more been implemented in other fields, such as psychology granularity of exploration. Another timeline visualization is (Jones et al., 2021; Mullarkey at al., 2019), and social applied to N-grams. This has been used to observe lexical network research (Clifton and Webster, 2017; innovations (c.f. Grieve et al., 2019), by looking at N-grams Würschinger, 2021). NA visualizations follow the that increase in terms of frequency across time. This tool assumption that if a relationship is meaningful within the can facilitate this type of analysis. whole network, it will stand out from other relationships by stronger connections than random or weaker relationships. In this analysis, the connections are based on the frequencies which connect n-grams. Here we use the functionality from the visNetwork (Almende, 2021) package. Figure 5: Twitter Timeseries Count Tab. The second visualization implemented is a world map showing the region source information of tweets. The purpose is to visualize the main geographical areas from where the tweets come. We use the functionality from Figure 3: N-Grams Visualisation Tab showing a network echarts4r package, which is very efficient at displaying this relationships. type of information, as well as being interactive in an online context. 4.1.4. Exploring Entities We use a different visualization approach for the entities captured in the corpus. We use bar plots and word clouds. The advantage of bar plots is that they show the frequencies in a way that we can see from the most frequent to the least frequent, organized from left (most frequent) to right (least frequent). Word clouds are an easy and user- friendly way to represent frequencies. Here, more frequent words are represented with larger fonts than less frequent words. An example for the organizations mentioned in the corpus is shown in the figure below. At the top, we see the bar plot and at the bottom the word cloud. Figure 6: Twitter Map Tab. 5. Discussion The app presents a wide range of visualizations and analyses from the Twitter corpus. The features capture different linguistic layers, including morphology, syntax, and n-grams. With the inclusion of Twitter metrics, this tool gives all exploration opportunities to understand the whole corpus. R and shiny R have proven to be an efficient combination to develop and deploy the corpus. For the text Figure 4: Named Entities Visualisation Tab. processing tasks, the use of the UDPipe and tidytext packages have been highly effective. The in-built functions 4.2. Twitter Data Metrics have been used and we have created our custom-made functions to complete the tasks done throughout the whole The final section shows relevant Twitter data metrics, process. For visualization tasks, the combination of for which we dedicate two sections. The first one is a ggplot2, plotly, visNetwork, and echarts4r has timeline visualization using a combination of the ggplot2 demonstrated efficient to represent complex linguistic package and the plotly (Sievert, 2020) package. This features and relationship analysis. The app can be accessed combination gives ggplot2 plots interactive power. The through the following GitHub repository: timeline displays the number of posts across time, for all https://github.com/simongonzalez/ILiAD. the data available in the corpus. This timeline can also be PRISPEVKI 60 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 6. Conclusion Winston Chang, Joe Cheng, JJ Allaire, Carson Sievert, In this paper, we have presented the development of a Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan linguistic corpus based on the Twitter posts. It has been McPherson, Alan Dipert and Barbara Borges. 2019. designed to be used by a diversity of audiences who are shiny: Web Application Framework for R (Version 1.3.2) interested in exploring linguistic patterns from corpora [R package]. Retrieved from https://CRAN.R- based on social media language. Similar tools have been project.org/package=shiny developed with invaluable contributions to the field of Snigdha Chaturvedi, Shashank Srivastava and Dan Roth. Corpus Linguistics. Our proposal, however, makes stronger 2018. Where have I heard this story before? Identifying integrations with a variety of visualization types that narrative similarity in movie remakes. In: Proceedings of enhance the analysis in a holistic way. The tool also gives the 2018 Conference of the North American Chapter of users interactive and reactive power throughout all the data, the Association for Computational Linguistics: Human which not only offers a corpus to analyse, but a corpus to Language Technologies, Vol. 2, pages 673–678. New interact with and query in a more organic way, compared Orleans, Louisiana. Association for Computational to more traditional approaches of presenting corpora. Linguistics. Finally, it has been developed within an open-source Allan Clifton and Gregory D. Webster. 2017. An framework, making it freely available to any user interested introduction to social network analysis for personality in using and even expanding this tool. and social psychologists. Social Psychological and Personality Science, 8(4):442–453. 7. Future Work Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau and Yoshua Bengio. 2014. On the properties of neural In the current version, we have selected a relatively machine translation: Encoder-decoder approaches. small number of users in the corpus, as compared to other CoRR, abs/1409.1259. larger projects with similar goals. This is to allow the John Coene. 2022. echarts4r: Create Interactive Graphs implementation of the interactive capability in the with 'Echarts JavaScript', Version 5. visualization methods, which requires a high level of https://echarts4r.john-coene.com/. computational power. We aim to add more data in future Jelske Dijkstra, Wilbert Heeringa, Lysbeth Jongbloed- versions using more efficient processing algorithms. Faber and Hans Van de Velde. 2021. Using Twitter Data Finally, we see the value of adding linguistic analysis to for the Study of Language Change in Low-Resource emoticons. In a future version, we aim to include analysis Languages. A Panel Study of Relative Pronouns in on emoticons, as a distinctive component of social media Frisian. Frontiers in Artificial Intelligence, 4:644554. language. Simona Fer. 2018. The Language of Journalism: Particularities and Interpretation of Its Coexistence with 8. Acknowledgements Other Languages (February 22, 2018). Available at I want to thank the anonymous reviewers of this paper SSRN: https://ssrn.com/abstract=3128134 or for their invaluable comments and insights in the shape and http://dx.doi.org/10.2139/ssrn.3128134 content of the final version. Their generosity and expertise Alex Graves and Jürgen Schmidhuber. 2005. Framewise have improved this paper in innumerable ways and saved phoneme classification with bidirectional LSTM and me from many errors. Those that inevitably remain are other neural network architectures. Neural Networks, entirely my own responsibility. pages 5–6. Stefan Th. Gries. 2009. Quantitative Corpus Linguistics 9. References with R. London and New York: Routledge. Anne Abeillé and Danièle Godard. 2000. French word Jack Grieve1, Chris Montgomery, Andrea Nini, Akira order and lexical weight. In: R. Borsley, ed., The Nature Murakami and Diansheng Guo. 2019. Mapping Lexical and Function of Syntactic Categories, Syntax and Dialect Variation in British English Using Twitter. Semantics, Academic Press, pages 325–358. Frontiers in Artificial Intelligence. 2(11). doi: Benoit Thieurmel. 2021. visNetwork: Network 10.3389/frai.2019.00011. Visualization using 'vis.js' Library. R Package Version Jack Grieve, Andrea Nini and Diansheng Guo. 2018. 2.1.0. Mapping Lexical Innovation on American Social Media. Kenneth Benoit, David Muhr and Kohei Watanabe. 2021. Journal of English Linguistics, Vol. 46, pages 293–319. stopwords: Multilingual Stopword Lists, (Version 2.2) Jack Grieve. 2015. Dialect Variation. In: Douglas Biber and [R package]. Retrieved from Randi Reppen, eds., The Cambridge Handbook of https://github.com/quanteda/stopwords English Corpus Linguistics. Cambridge University Danah Boyd and Jeffrey Heer. 2006. Profiles as Press. Conversation: Networked Identity Performance on Susan C. Herring. 2001. Computer-mediated discourse. In: Friendster. In: Proceedings of the 39th Annual Hawaii D. Schiffrin, D. Tannen and H. Hamilton , eds., The International Conference on System Sciences Handbook of Discourse Analysis, (Oxford: Blackwell (HICSS'06), 2006, pages 59c–59c, doi: Publishers), pages 612–634. 10.1109/HICSS.2006.394. Kurt Hornik. 2019. openNLP: Apache OpenNLP Tools Subhadip Chandra, Randrita Sarkar, Sayon Islam, Soham Interface, (Version 0.2-7) [R package]. Nandi, Avishto Banerjee and Krishnendu Chatterjee. Kurt Hornik. 2020. NLP: Natural Language Processing 2021. Sentiment Analysis on Twitter Data: A Infrastructure, (Version 0.2-1) [R package]. Comparative Approach. International Journal of Payton J. Jones, Ruofan Ma and Richard J. McNally. 2021. Computer Science and Mobile Applications, 9(10):01– Bridge Centrality: A Network Approach to 12. Understanding Comorbidity. Multivariate Behavioral PRISPEVKI 61 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Research, 56(2):353–367, DOI: Balti- more, Maryland, June. Association for 10.1080/00273171.2019.1614898 (2021). Computational Linguistics. Daniel Jurafsky and James H. Martin. 2021. Speech and Benedikt Szmrecsanyi. 2011. Corpus-based dialectometry: Language Processing: An introduction to natural a methodological sketch. Corpora, 6(1):45–76. DOI: language processing, computational linguistics, and 10.3366/cor.2011.0004 | preprint pdf speech recognition. All rights reserved. Draft of June Eric S. Tellez, Daniela Moctezuma, Sabino Miranda and December 29, 2021. Mario Graff. 2021. A large scale lexical and semantic Michael W. Kearney. 2019. rtweet: Collecting and analysis of Spanish language variations in Twitter. analyzing Twitter data. Journal of Open Source ArXiv, abs/2110.06128. Software, 4(42), 1829. doi: 10.21105/joss.01829, R Lilia I. Trius and Nataliya V. Papka. 2022. Some Aspects package version 0.7.0, of Online Discourse Manipulation on Social Media (the https://joss.theoj.org/papers/10.21105/joss.01829. case of Instagram English Presentational Discourse of Tze Siew Liew and Hanita Hassan. 2021. The search for Pfizer Inc.). Current Issues in Philology and national identity in the discourse analysis of YouTube Pedagogical Linguistics. comments. Journal of Language and Linguistic Studies. Hadley Wickham. 2016. ggplot2: Elegant Graphics for Spiros Moschonas. 2014. The Media On Media-Induced Data Analysis. Springer-Verlag New York. ISBN 978-3- Language Change. In: J. Androutsopoulos, ed., 319-24277-4, https://ggplot2.tidyverse.org. Mediatization and Sociolinguistic Change, pages 395– Jan Wijffels, Sacha Epskamp, Ingo Feinerer and Kurt 426. Berlin, Boston: De Gruyter. Hornik. 2021. textplot: Visualise complex relations in https://doi.org/10.1515/9783110346831.395. texts, (Version 0.2.0) [R package]. Retrieved from Michael C. Mullarkey, Igor Marchetti and Christopher G. https://github.com/bnosac/textplot Beevers. 2019. Using Network Analysis to Identify Gita Wincana, Wahyudi Rahmat and Ricci Gemarni Central Symptoms of Adolescent Depression. Journal of Tatalia. 2022. Linguistic Tendencies of Anorexia Clinical Child and Adolescent Psychology, 48(4):656– Nervosa on Social Media Users Facebook (Pragmatic 668, DOI: 10.1080/15374416.2018.1437735 (2019). Study). Journal of Pragmatics and Discourse Research. Ryotaro Nagase, Takahiro Fukumori and Yoichi Quirin Würschinger. 2021. Social Networks of Lexical Yamashita. 2021. Speech Emotion Recognition with Innovation. Investigating the Social Dynamics of Fusion of Acoustic- and Linguistic-Feature-Based Diffusion of Neologisms on Twitter. Frontiers in Decisions. In: 2021 Asia-Pacific Signal and Information Artificial Intelligence. 4:648583. doi: Processing Association Annual Summit and Conference 10.3389/frai.2021.648583. PMID: 34790894; PMCID: (APSIPA ASC), pages 725–730. PMC8591557. Brendan O'Connor, Michel Krieger and David Ahn. 2010. TweetMotif: Exploratory Search and Topic Summarization for Twitter. ICWSM. Ruth Page, David Barton, Carmen Lee, Johann Wolfgang Unger and Michele Zappavigna. 2014. Researching Language and Social Media (1st ed.). Taylor and Francis. Retrieved from https://www.perlego.com/book/1559453/researching- language-and-social-media-pdf (Original work published 2014) R Core Team 2021. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R- project.org/. Tyler Rinker. 2017. entity: Named Entity Recognition, (Version 0.1.0) [R package]. Carson Sievert. 2020. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC. ISBN 9781138331457, https://plotly-r.com. Julia Silge and David Robinson. 2016. tidytext: Text Mining and Analysis Using Tidy Data Principles. In: JOSS, 1(3). doi:10.21105/joss.00037, http://dx.doi.org/10.21105/joss.00037. Milan Straka and Jana Straková. 2017. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88–99, Vancouver, Canada. Association for Computational Linguistics. Jana Straková, Milan Straka and Jan Hajič. 2014. Open- source tools for morphology, lemmatization, pos tagging and named entity recognition. In: Proceedings of 52nd An- nual Meeting of the Association for Computational Lin- guistics: System Demonstrations, pages 13–18, PRISPEVKI 62 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Raba Kolokacijskega slovarja sodobne slovenščine pri prevajanju kolokacij Martin Anton Grad,* Nataša Hirci† *,† Oddelek za prevajalstvo, Filozofska fakulteta Univerze v Ljubljani Aškerčeva 2, 1000 Ljubljana martin.grad@ff.uni-lj.si natasa.hirci@ff.uni-lj.si Povzetek Prispevek povzema izsledke raziskave, ki se je ukvarjala z rabo Kolokacijskega slovarja sodobne slovenščine (KSSS) pri prevajanju kolokacij iz angleščine v slovenščino. Dodiplomski študenti Oddelka za prevajalstvo FF UL so prevajali besedilo, v katerem je bilo označenih deset kolokacij. Med procesom prevajanja se je dogajanje na zaslonu snemalo, tako da je bilo po koncu naloge mogoče analizirati tako prevodne rešitve posameznih kolokacij kot tudi prevajalski proces, pri čemer smo se osredotočali zlasti na rabo jezikovnih virov, med katerimi nas je najbolj zanimala raba KSSS. Rezultati so pokazali, da je vključevanje KSSS v pedagoški proces uspešno, saj so vsi študenti, ki so sodelovali v raziskavi, z njim seznanjeni in ga pri svojem prevajalskem delu tudi aktivno uporabljajo. Pri tem so se med študenti pokazale občutne razlike glede tega, kako dobro poznajo napredne funkcije, ki jih KSSS nudi, in posledično kako uspešni so pri iskanju ustreznih kolokacij. Raziskava je pokazala tudi, da raba jezikovnih virov ne vodi nujno do optimalne prevodne rešitve. Using the Collocations Dictionary of Modern Slovene in the Process of Translating Collocations The paper outlines the findings of the study on how the Collocations Dictionary of Modern Slovene (KSSS) is utilized when translating collocations from English into Slovene. Undergraduate students of the Department of Translation Studies at the Faculty of Arts in Ljubljana were requested to translate a text with ten selected collocations. During the translation process, their on-screen activities were recorded to allow for the analysis of both translation solutions, as well as the translation process, focusing on the use of language resources, in particular KSSS. The results have shown that the integration of KSSS into the training process is successful, as all the students participating in the study were familiar with this resource and actively use it in their translation work. However, the study has also exposed significant differences between students in terms of their familiarity with the advanced features of KSSS and, consequently, their success and efficiency in finding appropriate collocations. The results of the study have also highlighted that the use of language resources does not necessarily lead to an optimal translation solution. Poleg statističnega pa kolokacije definira tudi 1. Uvod skladenjski vidik, saj med kolokacijskima elementoma obstaja hierarhično razmerje, v katerem baza določa Kolokacije predstavljajo izjemno zanimiv kolokator (Hausmann, 1984: 401). jezikovni pojav, ki pa je hkrati tudi zelo izmuzljiv. Gantar et al. (2021) predstavijo še tretji, Uvodoma povzemamo definicije Gantar et al. (2021), najpomembnejši vidik, ki je hkrati tudi najbolj ki so osnova za vključitev kolokacij v Kolokacijski problematičen, in sicer pomenski vidik kolokacij. Ta slovar sodobne slovenščine (KSSS), rabo katerega je namreč tesno povezan s statističnim vidikom, ki opisuje pričujoči prispevek (prim. Kosem et al., 2018). kolokacije uvršča med pola, ki ju predstavljajo proste Pri definiranju kolokacij avtorji izpostavijo statistični, besedne zveze na eni in popolnoma ustaljene skladenjski in semantični vidik. večbesedne enote na drugi strani, kar posledično Atkins in Rundell kolokacije definirata kot vpliva na semantične spremembe in omejitve pri izbiri “ponavljajoče se kombinacije besed, v katerih kaže kolokacijskih elementov. določen leksikalni element (jedro) očitno tendenco Pri definiranju kolokacij avtorji navajajo dva sopojavljanja z drugim leksikalnim elementom ključna pristopa, in sicer ožji, ki kolokacije (kolokatorjem), s frekvenco, ki je večja od naključne prepoznava kot samostojen tip frazeoloških enot, ki so sopojavitve” (2008: 302). Gantar et al. pri tem delno ali popolnoma (pomensko in skladenjsko) izpostavijo problematiko kolokacij, katerih sestavni zamrznjene, in širši, ki med kolokacije šteje tudi deli v besedilu navadno ne nastopajo skupaj oz. se frekventne besedne zveze, katerih notranja mednje vrivajo drugi elementi, ki jih imenujejo povezovalnost ni ozko zamejena ali celo izključujoča, “razširjene kolokacije” (2021: 19). PRISPEVKI 63 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 pač pa je lahko niz sopojavnic tudi razmeroma odprt prevajanja kolokacij in rabo jezikovnih virov pri (Gantar et al. 2021: 20-21). iskanju prevodnih rešitev. Prispevek predstavlja Kljub temu, da kolokacije predstavljajo težavo, izsledke raziskave, ki je preučevala proces prevajanja zlasti ko se z njimi srečamo v tujem jeziku, ker se pri kolokacij iz angleščine v slovenščino med študenti razumevanju ne moremo zanašati na skladenjska ali prevajalstva, s poudarkom na rabi KSSS v procesu pomenska razmerja maternega jezika, saj ni nujno, da prevajanja, s čimer se doslej raziskovalno ni ukvarjal uporabljana jezika določeno kolokacijo tvorita na enak še nihče. način, pa so kolokacijski slovarji koristen pripomoček Raziskovanje prevajalskega procesa ponuja tudi za rojene govorce, zlasti ko gre za področje, ki jim dragocene informacije o strategijah, ki so potrebne, da ni najbolj domače. Dodana vrednost tovrstnih se določeno besedilo prevede iz izvirnega v ciljni jezikovnih virov za prevajalce je v tem, da na enem jezik. Poglobljen pregled ključnih raziskovalnih mestu na pregleden način prikažejo širok nabor vprašanj, najpogosteje uporabljenih tehnologij in kolokacij, med katerimi je nato mogoče izbrati tisto, ki trendov razvoja tega področja ponuja Jakobsen (2017) je tako s pomenskega kot tudi sobesedilnega vidika (prim. Hansen, 2009; Hvelplund, 2019; idr.). V najustreznejša. Sloveniji je področje preučevanja prevajalskega procesa manj raziskano (prim. Hirci, 2012). 2. Namen raziskave in pregled področja Študenti na Oddelku za prevajalstvo Filozofske fakultete Univerze v Ljubljani se s KSSS seznanijo že Zaradi svoje jezikovne in kulturne specifičnosti, so v prvem letniku dodiplomskega študija. V raziskavi je kolokacije izjemno zanimive z vidika prevajanja, s sodelovalo 15 študentov in študentk, in sicer osem iz čimer so se raziskovalno ukvarjali tako domači (npr. drugega (v nadaljevanju označeni z oznakami od II-1 Gabrovšek, 2014; Jurko 2014; Sicherl, 2004; Vrbinc, do II-8) in sedem iz tretjega letnika (III-1 do III-7) 2006) kot tuji avtorji (Kwong, 2020; McKeown in dodiplomskega študija. Radev, 2000 itd.). S prevajalskega vidika je Prevajalska naloga je vsebovala članek s pomembno, na kakšen način prevajalci rešujejo poljudnoznanstveno vsebino1 in navodila za prevod. V težave, ki se pojavljajo pri prevajanju kolokacij, saj le 437 besed dolgem članku z astronomsko tematiko je ustrezne prevajalske strategije lahko privedejo do bilo označenih 10 kolokacij. Kljub temu, da je bilo na ustreznih prevodnih rešitev. Pri iskanju možnih voljo celotno besedilo, saj je pri prevajanju kolokacij prevodnih rešitev in potrditvi prevodnih ustreznic si sobesedilo ključnega pomena, so študenti morali prevajalci lahko pomagajo z različnimi jezikovnimi prevesti zgolj tiste povedi, v katerih so bile označene viri. Eden od možnih raziskovalnih pristopov, ki nam kolokacije. Četudi označevanje kolokacij, ki so bile lahko pomagajo osvetliti, kako prevajalci pridejo do predmet analize, predstavlja odmik od avtentične ustreznih prevodnih rešitev in kakšne vire pri tem prevajalske situacije, smo se za ta korak odločili, da bi uporabljajo, so uporabniške študije, ki pa se seveda ohranili konsistentnost analize, kar je bilo v luči razlikujejo glede na potrebe ciljne skupine, (prim. majhnega števila sodelujočih pravzaprav nujno. Arhar Holdt et al., 2015; Arhar Holdt et al., 2016; Pori Da bi pridobili čim več podatkov o prevajalskem et al., 2020; Pori et al., 2021). Nekateri avtorji procesu, je delo potekalo na platformi Zoom, kjer so (Rozman, 2004; Stabej, 2009; Logar Berginc, 2009; sodelujoči uporabili možnost deljenja zaslona, kar je Arhar Holdt, 2015) so pred časom sicer opozarjali na omogočalo spremljanje procesa prevajalskega dela, to, da v Sloveniji primanjkuje uporabniških študij, celotno dogajanje na zaslonu pa se je za potrebe vendar pa se v zadnjem času na tem področju stanje kasnejše analize tudi snemalo. spreminja. Izvedenih je bilo namreč kar nekaj Prevajalska naloga je omogočila analizo z dveh uporabniških raziskav, in sicer tako med prevajalci, različnih vidikov. Prvega predstavlja sam prevod oz. tolmači, študenti, učitelji slovenščine, lektorji in posamezne prevodne rešitve kolokacij in povedi, v jezikoslovci, kot tudi drugimi, ki se poklicno ukvarjajo katerih se nahajajo, drugega pa posnetek prevajalskega z jeziki (prim. Čibej et al., 2015; Gorjanc, 2014; Hirci, procesa, zlasti rabe jezikovnih virov, še posebej KSSS. 2013; Pori et al., 2021; Mikolič, 2015; Šorli in V prvem delu smo se pri analizi osredotočali na Ledinek, 2017; Arhar Holdt et al., 2017). to, ali je prevod ustrezen (tj. ali prevodna rešitev V pričujočem prispevku je predstavljena predstavlja ustrezno, v ciljnem jeziku sprejemljivo uporabniška študija, ki se ukvarja z vprašanjem kolokacijo in ali ta kolokacija tudi pomensko ustreza 1 Povezava do članka: http://news.bbc.co.uk/2/hi/science/nature/1006305.stm PRISPEVKI 64 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 izvirniku). V primeru neprimernih oz. pogojno Posnetek prevajalskega dela je omogočil vpogled sprejemljivih rešitev smo pri vsaki kolokaciji dodali v proces rabe jezikovnih virov. Pregledu prevodne komentar vidika, ki se je zdel problematičen. rešitve posamezne kolokacije je sledil pregled Kljub temu, da je prevajalska naloga vsebovala 10 posnetka zaslona, pri čemer smo analizirali, na kakšen kolokacij (Tabela 1), se v pričujočem prispevku zaradi način so študenti uporabljali jezikovne vire, da bi prišli prostorske omejitve osredotočamo zgolj na tri, ki pa do prevodne ustreznice – katere, kako učinkovito (če ponujajo zanimiv vpogled v celotni razpon je to iz posnetka razvidno) in kako so prišli do zahtevnosti, težav, iskanja prevodnih rešitev in ugotovitve, da je določena rešitev najustreznejša. načinov rabe KSSS ter drugih jezikovnih virov. 3. Rezultati Povedi z označeno kolokacijo Astronomers say reports that the Earth could be struck by a small asteroid in 2030 are wildly 1 exaggerated. Less than a day after (2) sounding the alert about asteroid 2000SG344, a (3) revised analysis of the space rock's 2, 3 orbit shows it will in fact miss the Earth by about five million kilometres. Some scientists have criticised the way the information was released to the media before it had been thoroughly 4 confirmed. 5 Threat rating* On Friday, the International Astronomical Union issued an alert saying that the object had about a 1-in-500 chance 6 of striking the Earth on 21 September 2030. Were it to strike our planet, the results would be devastating, with an explosion greater than the most powerful 7 nuclear weapon. The new orbit reveals a slight risk of a collision with the Earth about 2071, but it is thought that when the orbit is 8 better known this risk will disappear as well. 9 If it is manmade and did strike Earth, the effects would be very local and limited. Some scientists have criticised the IAU and Nasa for releasing warnings about the asteroid only for those warnings 10 to be rescinded less than a day later. *podnaslov Tabela 1: Pregled povedi z označenimi kolokacijami. Kolokacija št. 4 se je izkazala za najbolj Kot zgolj delno ustrezne smo opredelili kolokacije, problematično, saj so bile tri prevodne rešitve v celoti ki so se v prevodu pomensko preveč oddaljile od neustrezne, tri pa zgolj delno ustrezne. Kot v celoti izvirnika ali pa je njihova skladenjska oblika netipična. neustrezne smo označili tiste, ki izkazujejo bodisi V nadaljevanju so prikazane vse prevodne rešitve, ki skladenjsko neustreznost v ciljnem jeziku bodisi gre za so bile označene kot neustrezne oz. delno ustrezne, in pomensko neustrezno prevodno rešitev, četudi je bila jezikovni viri, ki so jih študenti pri iskanju prevodnih uporabljena slovenska kolokacija z visoko rešitev uporabili. pogostostjo. Kolokacija št. 4 Viri […] the way the information was released […] II-1 skritizirali način, kako je bila informacija […] deljena z mediji brez virov II-5 skritizirali način izdaje podatkov medijem angleški kolokacijski slovar ozdic.com II-7 kritizirali način, da so novico objavili spletni an-sl slovar Pons, KSSS veliki an-sl slovar (Amebisov pregledovalnik III-2 način, ki je bil uporabljen za posredovanje informacij javnosti podatkovnih zbirk ASP32), KSSS III-3 način, na katerega so bile informacije sporočene medijem Evrokorpus, KSSS III-5 dejstvo, da so mediji objavili informacijo korpus Gigafida Tabela 2: Prikaz iskanja prevodnih rešitev za kolokacijo št. 4. PRISPEVKI 65 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Kolokaciji št. 2 in 6 predstavljamo skupaj, saj sta V nadaljevanju so prikazane vse prevodne rešitve, si tako pomensko kot skladenjsko zelo podobni, a je ki so bile označene kot neustrezne oz. delno ustrezne, prva študentom povzročala precej težav, prevodne in jezikovni viri, ki so jih študenti uporabili. rešitve za drugo pa so bile z izjemo dveh v celoti ustrezne. Kolokacija št. 2 Viri Kolokacija št. 6 Viri […] after sounding the […] issued an alert alert about […] saying […] dan po tem [sic] ko so II-1 sprožili alarm KSSS izdala opozorilo brez virov brskalnik Google, enojezična II-3 dan po sproženem alarmu an-sl spletni slovar Pons, KSSS izdala opozorilo spletna slovarja Collins in Cambridge, KSSS an-sl spletni slovar Pons, II-6 dan po sprožitvi alarma enojezični spletni slovar izdala opozorilo KSSS Merriam-Webster, KSSS an-sl spletni slovar Pons, portal an-sl spletni slovar Pons, KSSS, II-8 dan po sprožitvi alarma izdala opozorilo Fran, KSSS, korpus Gigafida korpus Gigafida dan potem [sic] ko so brskalnik Google, portal Fran, III-2 sprožili alarm brez virov sprožila alarm KSSS, korpus Gigafida brskalnik Google, enojezični III-3 dan po sproženem alarmu spletni slovar Merriam- sprožil alarm Evrokorpus, EUR+Lex, KSSS Webster, KSSS Tabela 3: Prikaz iskanja prevodnih rešitev za kolokaciji št. 2 in 6. 4. Diskusija glagolnik “izdaja” s samostalnikom v rodilniku 4.1. Kolokacija št. 4 večinoma pojavlja v pomenu izida tiskanega dela (npr. Z vidika prevajanja se je za najbolj zahtevno knjige, revije itd.) ali dajanja (delnic, denarja itd.) v izkazala kolokacija št. 4 (“release information”). Pri obtok. Študentka je pri iskanju prevodne rešitve tem je treba izpostaviti, da je že sam uporabila angleški kolokacijski slovar ozdic.com z izvirnik nekoliko problematičen, saj je samostalniška besedna zveza iskalnima nizoma “information” in “released information”, vendar pa temu ni sledila korekcija ”the way” v funkciji premega predmeta, ki sledi izbrane prevodne rešitve. glagolu ”criticise”, najpogosteje uporabljena v smislu II-7: Pri prevodu “kritizirali način, da so novico “kritizirati način, na katerega /…/”. Vendar pa je v objavili” gre za pomensko napako, ki v veliki meri analiziranem primeru mišljeno drugače: znanstveniki temelji na dobesednem razumevanju izvirnika. Čeprav so kritizirali dejstvo, da je bila ta informacija sploh prevodna rešitev “kritizirati način” s kolokacijskega posredovana medijem, ne pa načina, kako je bilo to vidika ni sporna, pa je bolj problematičen odvisnik, ki storjeno. Zdi se, da so študenti, katerih prevodne sledi. Samostalniku “način“, kadar ta sledi glagolu rešitve so bile označene kot neustrezne oz. zgolj delno “kritizirati”, navadno sledi prilastkov odvisnik in ne ustrezne, izvirnik obravnavali preveč dobesedno, pri predmetni. Študent je uporabil dva spletna jezikovna čemer niso upoštevali širšega konteksta, ki bi jim vira, in sicer angleško-slovenski slovar Pons za iskanje omogočil pravilno interpretacijo, čeprav so imeli na ustreznic besed “release”, “some” in “thoroughly”, v KSSS pa je iskal kolokacije s samostalnikom voljo celotno besedilo. “novica”, pri čemer ni uporabil nobenih filtrov. II-1: Četudi je kolokacija glagola “deliti” in samostalnika “informacija” ustaljena in v danem III-2: Prevodna rešitev “način, ki je bil uporabljen za posredovanje informacij javnosti” je kolokacijsko kontekstu tudi ustrezna, se redkeje pojavlja v trpni ustrezna, vendar se pomensko oddalji od izvirnika. obliki. Glede na dejstvo, da študent ni uporabil Študentka je uporabila dva jezikovna vira – veliki nobenih jezikovnih virov, lahko zgolj domnevamo, da angleško-slovenski slovar (Amebisov pregledovalnik bi se sicer na podlagi zgledov v tvornem načinu morda podatkovnih zbirk ASP32) za glagol “criticise” in odločil za drugačno rešitev. KSSS za samostalnik “informacija”, pri čemer ni II-5: Prevodna rešitev “način izdaje podatkov medijem” je problematična z vidika dobesednosti uporabila nobenih filtrov. , obenem pa je tudi kolokacijsko vprašljiva, saj se PRISPEVKI 66 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 III-3: Prevodna rešitev “način, na katerega so bile niza, je isto kolokacijo vpisala še v spletni slovar informacije sporočene medijem” je v smislu Cambridge. Ker tudi tam ni našla zadetka, se je vrnila pomenskega odklona zelo podobna primeru III-2, in je na brskalnik Google. Iz dogajanja na zaslonu ni kolokacijsko ustrezna, slogovno je vprašljiva zgolj razvidno zakaj, vendar je, ne da bi kliknila na katerega trpna oblika. Študentka je kot vir uporabila izmed zadetkov, uporabila KSSS z iskalnim nizom Evrokorpus z iskalnima nizoma “released “opozorilo”, kjer je nato kljub temu, da ni uporabila information” in “information released” ter KSSS z nobenega filtra, na prvi strani najpogostejših kolokacij iskalnim nizom “izdaja podatkov”, za katerega pa ni našla “izdati opozorilo“. Tej ugotovitvi pa ni sledila bilo zadetkov. Če bi namesto tega izbrala zgolj korekcija kolokacije št. 2. glagolnik “izdaja”, bi z izbiro filtra “s II-6: Tudi študentka II-6 se je odločila za podobno samostalniki/2-rodilnik“ lahko hitro prišla do ustrezne prevodno rešitev, in sicer “dan po sprožitvi alarma”. prevodne rešitve. Prvi jezikovni vir je bil angleško-slovenski slovar III-5: Prevodna rešitev “dejstvo, da so mediji Pons za iskalni niz “alert”, kjer je našla prevodno objavili informacijo” je slogovno in kolokacijsko ustreznico “alarm”. Temu je sledilo iskanje po ustrezna, pomensko pa se tudi ta oddalji od izvirnika, enojezičnem slovarju Merriam-Webster z geslom saj so po tej interpretaciji predmet kritike mediji. “sound”, temu pa iskanje kolokacij za samostalnik “alarm” v KSSS. Tu je pregledala štiri Izvirnik v resnici govori o tem, da so znanstveniki kolokacije v sobesedilu, pri čemer so bili trije zadetki za ”sprožiti kritizirali dejstvo, da so mediji te podatke sploh lažni alarm” in en zadetek za “sprožiti požarni alarm”, dobili oz. so posredno kritizirali svoje cehovske kar pa je ni omajalo v prepričanju, da je to prava kolege, ne pa samih medijev. Študentka je pri iskanju rešitev. Ne glede na to odločitev se je pri kolokaciji št. prevodne rešitve uporabila zgolj korpus Gigafida z 6 tudi ona odločila za drugačno prevodno rešitev, in iskalnim nizom “objaviti informacijo”. sicer “izdati opozorilo”, do katere je prišla neposredno s pomočjo KSSS z iskalnim nizom “opozorilo”. Tudi 4.2. Kolokaciji št. 2 in št. 6 v tem primeru tej ugotovitvi ni sledila korekcija Pri kolokaciji “sound the alert” je treba izpostaviti, kolokacije št. 2. da je v angleščini precej bolj pogosta pomensko II-8: Prevodna rešitev “dan po sprožitvi alarma” je podobna kolokacija “sound the alarm”. To je identična kot pri študentki II-6, podobni so tudi najverjetneje tudi razlog, da so se študenti v kar šestih jezikovni viri, ki jih je uporabljala pri iskanju primerih v prevodu namesto za “opozorilo” odločili za prevodnih rešitev. Prvi je bil angleško-slovenski samostalnik “alarm”, saj so v jezikovnih virih bodisi slovar Pons za iskalni niz “alert”, čemur je sledilo našli prevodno ustreznico (“alert”→“alarm”) bodisi iskanje na jezikovnem portalu Fran z geslom “alarm”. dobili potrditev, da je angleška različica “sound the Nato je v KSSS iskala kolokacije za samostalnik alarm” bolj pogosta. Od tu naprej so v jezikovnih virih “alarm”, pri čemer ni uporabila nobenega filtra. Temu iskali zgolj kolokacije za samostalnik “alarm”. Vsi so je sledilo preverjanje kolokacije “sprožiti alarm” v izbrali sicer pravilno, a zelo dobesedno kolokacijo korpusu Gigafida z iskalnim nizom “alarm”, kjer se je “sprožiti alarm“. med konkordancami zadržala pri omenjeni kolokaciji, II-1: Študent se je odločil za prevodno rešitev “dan ki pa jo je nato v prevodni rešitvi spremenila v po tem ko so sprožili alarm”, pri čemer je kot vir “sprožitev alarma“. Pri kolokaciji št. 6 se je odločila za uporabil KSSS, kjer je najprej iskal kolokacije za drugačno prevodno rešitev, in sicer za kolokacijo glagol “sprožiti“, nato pa še za samostalnik “alarm”. “izdati opozorilo“. Najprej je uporabila angleško- Tu je tudi našel kolokacijo “sprožiti alarm”. Pri slovenski slovar Pons (“issue“ in “alert“), čemur je kolokaciji št. 6 se je odločil za drugačno prevodno sledilo iskanje kolokacij za samostalnik “opozorilo” v rešitev, in sicer “izdati opozorilo”, pri čemer pa ni KSSS (najprej brez filtrov, nato s filtrom “z glagoli”, uporabil nobenih jezikovnih virov. vendar pri pregledovanju zadetkov ni vztrajala). II-3: Študentka je do prevodne rešitve “dan po Sledilo je iskanje po korpusu Gigafida (“opozorilo”), sproženem alarmu“ prišla s pomočjo angleško- kjer pa med konkordancami ni našla nobenega slovenskega spletnega slovarja Pons z iskalnim nizom zadetka, ki bi se ji zdel primeren. Po tem je ponovno “sounding”. Tu je našla kolokacijo “to sound the uporabila KSSS, kjer je med rezultati (spet brez filtra) alarm”, čemur je sledilo iskanje v KSSS, in sicer našla končno prevodno rešitev “izdati opozorilo”. “alarm”, kjer je našla kolokacijo “sprožiti alarm”. Pri III-2: Do tako slovnično kot vsebinsko kolokaciji št. 6 se je tudi ona odločila za drugačno problematične prevodne rešitve “dan potem [sic] ko so prevodno rešitev, in sicer “izdati opozorilo”, do katere sprožili alarm” je študentka prišla brez rabe jezikovnih je prišla s pomočjo brskalnika Google (iskalni niz virov. Morda je to tudi razlog, da se je za enako “issue an alert”); sledila je povezavi do spletnega prevodno rešitev odločila pri kolokaciji št. 6, kjer pa je enojezičnega slovarja Collins, ker pa tam ni našla tega uporabila kar nekaj jezikovnih virov. Svoje iskanje je PRISPEVKI 67 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 začela v brskalniku Google z iskalnim nizom “izdati Zlasti pri prevodnih strategijah kolokacije št. 4 se rdeč [sic] alarm”, pregledala nekaj zadetkov je pokazalo, da jezikovni viri, kljub temu, da (večinoma časopisnih naslovov) in nadaljevala iskanje prevajalcu lahko pomagajo priti do prevodnih rešitev, na portalu Fran z iskalnim nizom “alarm”. Isto geslo ki so parcialno pravilne, ne morejo vedno preprečiti je nato uporabila še pri KSSS. Sledilo je iskanje po napačne interpretacije izvirnika in posledično korpusu Gigafida (na portalu CJVT) z naprednimi neposrečenih prevodnih rešitev, ki so rezultat funkcijami “okolica” in “levo 1”, “desno 0”. Med združevanja besed, besednih zvez in kolokacij, ki so besednimi vrstami na levi je nato uporabila filter same zase ustrezne, kot celota pa preprosto ne “glagol”, kjer je po pogostosti izstopala kolokacija, ki funkcionirajo. Pri prepoznavanju in reševanju jo je študentka nato izbrala kot prevodno rešitev. tovrstnih situacij zagotovo ključno vlogo odigra tudi Omenjeni primer uporabe naprednih funkcij se zdi dobro znanje materinščine. zelo poveden, saj dokazuje, da zgolj tehnična Izpostaviti je treba tudi dejstvo, da se je od podkovanost pri rabi jezikovnih virov ne zagotavlja študentov pričakovalo, da bodo prevedli povedi, v nujno tudi ustrezne prevodne rešitve. Čeprav omogoča katerih so bile kolokacije vnaprej označene. Vendar pa časovno učinkovitost pri iskanju možnih kolokacij, je povsem mogoče predvideti tudi scenarij, kjer bi se mora prevajalec še vedno sprejeti končno odločitev o prevajalec odločil za prevodno rešitev, v kateri izvirna tem, katera izmed ponujenih možnosti predstavlja kolokacija ne bi bila prevedena v obliki kolokacije, a najustreznejšo prevodno rešitev, pri čemer vedno igra bi bila s pomenskega in sobesedilnega vidika rešitev pomembno vlogo tudi prefinjen občutek za jezik. prav tako ustrezna. III-3: Tudi ta študentka se je pri kolokaciji št. 2 Kljub temu, da je KSSS enojezični jezikovni vir, odločila za prevodno rešitev “dan po sproženem se je v procesu prevajanja izkazal za zelo uporabnega. alarmu”. Začela je z brskalnikom Google (iskalni niz Pri tem se zdi, da ne gre toliko za možnost preverjanja, “souding [sic] the alert”), kjer je izbrala povezavo do ali določena besedna kombinacija sploh predstavlja spletnega slovarja Merriam-Webster za “raise/sound kolokacijo, temveč bolj za vpogled v širši nabor the alarm“. Nato je v KSSS iskala kolokacijo za možnih kolokacij, ki jih KSSS ponuja, izmed katerih samostalnik “alarm”, med rezultati opazila kolokacijo nato prevajalec lahko izbere tisto, ki je pomensko in “sprožen alarm” in jo tudi uporabila. Pri iskanju sobesedilno najprimernejša. prevodne rešitve kolokacije št. 6 je najprej uporabila Četudi se študenti prevajalstva že v prvem letniku Evrokorpus (“issue an alert”), od koder je sledila dodiplomskega študija seznanijo s KSSS in ga pri predlagani povezavi na EUR+Lex za isti iskalni niz, prevajalskih nalogah tudi uporabljajo, se je izkazalo, nazadnje pa je v KSSS ponovno iskala kolokacije za da njegovega ustroja ne poznajo vsi enako dobro. samostalnik “alarm”, opazila že znano kolokacijo Posledično se niti ne zavedajo možnosti, ki jih nudi, “sprožiti alarm“ in jo nato tudi uporabila. zato je v nekaterih primerih njihovo iskanje ustreznih Na koncu je treba poudariti, da je v predstavljeni kolokacij manj učinkovito, kar v skrajnem primeru uporabniški raziskavi šlo za vodeno nalogo, ki realno lahko privede tudi do tega, da kljub pravilno situacijo prevajanja tovrstnega besedila nekoliko vnesenemu iskalnemu nizu ne najdejo ustrezne popači, zato je pri posploševanju opažanj potrebna kolokacije. previdnost. Ena izmed možnosti, ki jih slovar ponuja, je Razlogi za to so poleg omejenega vzorca, tako z napredno iskanje s pomočjo filtrov, kjer je mogoče vidika števila sodelujočih kot tudi dejstva, da so v izbrati ustrezno kategorijo kolokatorja (npr. raziskavi sodelovali zgolj študenti dveh letnikov samostalnik, pridevnik, glagol itd.), v nekaterih dodiplomskega študija, vsaj trije. Prvi je ta, da so bili primerih pa tudi njegovo podkategorijo (npr. sklon študenti seznanjeni s tem, da se raziskava ukvarja z samostalnika, sklon pridevnika, predlog itd.). Pred rabo jezikovnih virov, kar je morda vplivalo na to, študenti, ki te funkcije niso poznali oz. je niso katere vire so uporabljali in kako. Drugi razlog je, da uporabili, se je v primerih iskanja na podlagi baze, ki so vedeli, da je poudarek naloge na prevajanju tvori kolokacije s številnimi kolokatorji, tako odprl kolokacij, zaradi česar je mogoče, da so se posledično nepregleden seznam kolokacij, deljen glede na na to prvino bolj osredotočali. Tretji razlog pa je ta, da različne kolokacijske vzorce, večina katerih je bila v se je za potrebe kasnejše analize proces prevajanja danem primeru neuporabnih. Študenti so s snemal, zaradi česar se študenti bolj zavedajo vsake pregledovanjem seznama po nepotrebnem izgubljali svoje poteze. Ali je – in če, do kakšne mere – to v dani čas, ki bi ga sicer lahko bolje izkoristili. situaciji vplivalo na sam proces prevajanja in končni izdelek je težko soditi, vsekakor pa je pri interpretaciji 5. Zaključek rezultatov in izpeljevanju zaključkov treba imeti v mislih tudi ta vidik. Uporabniška raziskava med študenti prevajalstva o rabi jezikovnih virov pri iskanju prevodnih rešitev PRISPEVKI 68 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 kolokacij ali kolokacijskih parov je postregla s Prototypical a – b. ELOPE: 11(2), str. 7–20. Ljubljana: številnimi zanimivimi izsledki in omogočila vpogled v Slovensko društvo za angleške študije. praktično rabo jezikovnih virov v prevajalskem Polona Gantar, Simon Krek in Iztok Kosem. 2021. procesu. Opredelitev kolokacij v digitalnih slovarskih virih za Obenem se je izkazalo, da so nekateri študenti slovenščino. V: I. Kosem, ur., Kolokacije v tehnično dobro podkovani in vešči uporabe naprednih slovenščini, str. 15–41. Ljubljana: Znanstvena založba funkcij, ki jih posamezni jezikovni viri nudijo, a se to Filozofske fakultete. ne odraža vedno tudi v kakovosti njihovih prevodnih Vojko Gorjanc. 2014. Slovar slovenskega jezika v rešitev. Pri tem ne gre nujno za jezikovno šibkost v digitalni dobi. Irena Grahek in Simona Bergoč, ur., E- maternem jeziku, kot morda bolj za prekomerno zbornik Posveta o novem slovarju slovenskega jezika zaupanje v jezikovni vir, pri čemer ne vzamejo v obzir na Ministrstvu za kulturo. Ljubljana: Ministrstvo za možnih razlik med jezikovno materijo, ki jo prevajajo, kulturo RS. in primeri, ki jih ponujajo jezikovni viri. Gyde Hansen. 2009. Some thoughts about the evaluation Raziskava je tako izpostavila tudi zelo konkretno of translation products in translation process pomanjkljivost, ki bi jo v pedagoškem procesu v research. Copenhagen Studies in Language 38, str. prihodnje veljalo bolje nasloviti. Med raziskavo so se 389–402. Copenhagen: Samfundslitteratur. prav tako odprla številna vprašanja, povezana z rabo Franz Josef Hausmann. 1984. Wortschatzlernen ist jezikovnih virov v procesu prevajanja, ki bi jih bilo Kollokationslernen. Zum Lehren und Lernen smiselno nasloviti v prihodnjih raziskavah. französischer Wortverbindungen. V: Praxis des neusprachlichen Unterrichts, 31, str. 395–406. Raziskovalni program št. P6-0215 (Slovenski jezik – bazične, Dortmund: Lensing. kontrastivne in aplikativne raziskave) je sofinancirala Javna Nataša Hirci. 2012. Electronic Reference Resources for agencija za raziskovalno dejavnost Republike Slovenije iz Translators. The Interpreter and Translator Trainer državnega proračuna. (6) 2, str. 219–36. London: Taylor & Francis. Nataša Hirci. 2013. Changing trends in the use of 6. Literatura translation resources: the case of trainee translators in Špela Arhar Holdt. Slovenia. ELOPE 10, str. 149–165. Ljubljana: 2015. Uporabniške raziskave za Slovensko društvo za angleške študije. potrebe slovenskega slovaropisja: prvi koraki. V: V. Kristian Tangsgaard Hvelplund. 2019. Digital resources Gorjanc et al., ur., Slovar sodobne slovenščine: problemi in rešitve in the translation process – attention, cognitive effort , str. 136-148. Ljubljana: Znanstvena založba Filozofske fakultete and processing flow. Perspectives 27 (4), str. 510–24. . Špela Arhar Holdt, Jaka Čibej in Ana Zwitter Vitez. London: Taylor & Francis. Arnt Lykke Jakobsen. 2017. Translation process 2017. Value of language-related questions and research. V John W. Schwieter in Aline Ferreira, ur., comments in digital media for lexicographical user The handbook of translation and cognition, str. 19–49. research. International journal of lexicography, 30(3), Hoboken: Wiley. str. 285–308. Oxford: OUP. Primož Špela Arhar Holdt, Iztok Kosem, in Polona Gantar. 2016. Jurko. 2014. Target language corpus as an encoding tool: collocations in Slovene-English Dictionary user typology: the Slovenian case. V: T. translator training. ELOPE 11 (1), str. 153–64. Margalitadze in G. Meladze, ur., Lexicography and Ljubljana: Slovensko društvo za angleške študije linguistic diversity. Proceedings of the XVII Iztok Kosem, Simon Krek, Polona Gantar, Špela Arhar EURALEX International Congress, 6–10 September, Holdt, Jaka Čibej in Cyprian Laskowski. 2018. 2016, str. 179–187. Tbilisi: Ivane Javakhishvili Tbilisi Kolokacijski slovar sodobne slovenščine. V: D. Fišer State University. in Andrej Pančur, ur., Zbornik konference Jezikovne Beryl T. Sue Atkins in Michael Rundell. 2008. The tehnologije in digitalna humanistika / Proceedings of Oxford Guide to Practical Lexicography. New York: the conference on Language Technologies & Digital Oxford University Press. Jaka Čibej, Vojko Gorjanc in Damjan Popič. 2015. Vloga Humanities, 20-21, str. 133-139. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani. jezikovnih vprašanj prevajalcev pri načrtovanju Nataša Logar Berginc. 2009. Slovenski splošni in novega enojezičnega slovarja. V: V. Gorjanc et al., ur., terminološki slovarji: za koga? V Slovar sodobne slovenščine: problemi in rešitve : M. Stabej, ur., , str. Infrastruktura slovenščine in slovenistike. Obdobja 28, 168-181. Ljubljana: Znanstvena založba Filozofske str. 225–231. Ljubljana: Znanstvena založba fakultete. Dušan Gabrovšek. 2014. Extending Binary Collocations: Filozofske fakultete Univerze v Ljubljani. Oi Yee Kwong. 2020. Translating Collocations: The (Lexicographical) Implications of Going beyond the Need for Task-driven Word Associations. V: Proceedings of the Workshop on the Cognitive Aspects PRISPEVKI 69 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 of the Lexicon, str. 112–16. Association for Computational Linguistics. Kathleen R. McKeown in Dragomir R. Radev. 2000. Collocations. V Robert Dale et al., ur., Handbook of Natural Language Processing, str. 1–3. New York: Marcel Dekker. Vesna Mikolič. 2015. Slovarski uporabniki – ustvarjalci: ustvarjati v jeziku in z jezikom. V: V. Gorjanc et al., ur., Slovar sodobne slovenščine: problemi in rešitve, str. 182-195. Ljubljana: Znanstvena založba Filozofske fakultete. Eva Pori, Jaka Čibej, Iztok Kosem in Špela Arhar Holdt. 2020. The attitude of dictionary users towards automatically extracted collocation data: A user study. Slovenščina 2.0: Empirical, Applied and Interdisciplinary Research 8 (2), str. 168-201. Eva Pori, Iztok Kosem, Jaka Čibej in Špela Arhar Holdt. 2021. Evalvacija uporabniškega vmesnika Kolokacijskega slovarja sodobne slovenščine. V I. Kosem, ur., Kolokacije v slovenščini, str. 235–268. Ljubljana: Znanstvena založba Filozofske fakultete. Tadeja Rozman. 2004. Upoštevanje ciljnih uporabnikov pri izdelavi enojezičnega slovarja za tujce. Jezik in slovstvo 49 (3/4), str. 63-75. Ljubljana: Slavistično društvo Slovenije. Eva Sicherl. 2004. On the Content of Prepositions in Prepositional Collocations. ELOPE 1(1-2), str. 37–46. Ljubljana: Slovensko društvo za angleške študije Marko Stabej. 2009. Slovarji in govorci: kot pes in mačka? Jezik in slovstvo 54 (3−4), str. 115–138. Ljubljana: Slavistično društvo Slovenije. Mojca Šorli in Nina Ledinek. 2017. Language policy in Slovenia: language users’ needs with a special focus on lexicography and translation tools. V: I. Kosem et al., ur., Electronic lexicography in the 21st century: proceedings of eLex 2017 Conference, 19–21 September 2017, Leiden, The Netherlands, str. 377– 394. Brno: Lexical Computing. Marjeta Vrbinc. 2005. Native speakers of Slovene and their translation of collocations from Slovene into English: a Slovene-English empirical study. Erfurt Electronic Studies in English. Erfurt: Institut für Anglistik/Amerikanistik Erfurt. PRISPEVKI 70 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Akustično modeliranje z različnimi osnovnimi enotami za avtomatsko razpoznavanje slovenskega govora Lucija Gril,* Simon Dobrišek, ‡ Andrej Žgank* * Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru Koroška cesta 46, 2000 Maribor lucija.gril@um.si, andrej.zgank@um.si ‡ Fakulteta za elektrotehniko, Univerza v Ljubljani Tržaška 25, 1000 Ljubljana simon.dobrisek@fe.uni-lj.si Povzetek V članku je predstavljen sistem avtomatskega razpoznavanja govora za slovenski jezik. Za graditev akustičnih modelov smo uporabili dva različna jezikovna vira in dve različni osnovni akustični enoti pri zapisu slovarjev. Testiranje je potekalo na testni množici, ki je nastala znotraj projekta Razvoj slovenščine v digitalnem okolju in vsebuje malo manj kot 5 ur zvočnih posnetkov. Za graditev jezikovnih modelov smo uporabili hibridni pristop HMM-DNN. Za nevronske mreže smo uporabili dva tipa mrež, in sicer TDNN in LSTM. Najboljši rezultat WER je znašal 24,95 % in smo ga dosegli z arhitekturo TDNN in grafemskim slovarjem. Acoustic modeling with various basic units for Slovenian automatic speech recognition The article presents the automatic speech recognition system for the Slovenian language. We used two different language sources and lexicons based on two basic acoustic units. The system was tested by the test set containing a little less than 5 hours of sound recordings that developed by the RSDO project. We used the hybrid HMM-DNN approach to build language models. Two types of networks were used for neural networks, namely TDNN and LSTM. The best WER score was 24.95% and we achieved it with TDNN architecture and grapheme lexicon. razpoznavalnik govora potrebujemo govorne posnetke, ki 1. Uvod jih spremljajo datoteke s transkripcijo, v katerih je zapis Dandanes nas pametna okolja spremljajo že na vsakem izgovorjenih besed. Hkrati potrebujemo besedilni korpus in koraku. Pametni telefoni, tablice, televizijski sprejemniki, slovar, s katerima se lahko razpoznavalnik govora nauči ročne ure, naprave v gospodinjstvu itd. Vse te naprave so značilnosti besed in tudi njihovega kontekstnega uvrščanja. nagnjene k temu, da nam nudijo boljšo in preprostejšo Izgovorjene besede lahko v avtomatskem uporabniško izkušnjo. razpoznavalniku govora predstavimo z dvema različnima Storitev, ki jih nudijo, je veliko in za akustičnima enotama vse je potrebno skrbno načrtovanje tako strojne kot tudi – s fonemi ali z grafemi. Fonemi so programske opreme. Ena izmed takšnih storitev je tudi glasovne enote, ki predstavljajo izgovorjavo glasov v avtomatsko razpoznavanje govora (angl. Automatic speech besedi. Fonemski zapis slovenske izgovorjave se v večini recognition – ASR). Če želimo razpoznavati govor, se je primerov razlikuje od grafemskega. Grafemi in fonemi se treba zavedati, da lahko programska oprema deluje med seboj razlikujejo tudi v številu osnovnih enot. Grafem zapišemo z eno osnovno enoto, ki ustreza črki v besedi. Po brezhibno, vendar na uspešnost njenega delovanja vpliva še drugi strani se lahko ista črka veliko drugih dejavnikov. Eden izmed njih je lahko na slika v več različnih fonemov, primer slab mikrofon, ki zajame veliko šuma in popači odvisno od konteksta, naglasa in mesta v besedi. Prav tako se lahko črka preslika zvok ter tako degradira razpoznavanje govora. To v zaporedje dveh fonemov. Z mislijo posledično vodi tudi do slabše uporabniške izkušnje. Prav na to lahko pri razpoznavalniku tvorimo slovarje, ki vsebujejo izgovorjene besede na dva načina tako lahko do poslabšanja rezultatov pride, če je , in sicer s razpoznavalnik tekočega govora slabše zasnovan in nima fonemi ali grafemi. Izbira vrste slovarja razpoznavalnika govora neposredno vpliva na to, kakšna bo osnovna optimalnih karakteristik. Zato je pomembno, da z akustična enota. Izbira akustične enote eksperimenti preverjamo različne arhitekture in zasnove vpliva tudi na zahtevnost in način priprave slovarjev, modelov avtomatskega razpoznavalnika govora. kompleksnost akustičnih modelov Za razvoj razpoznavalnika govora potrebujemo veliko in prek tega na potreben pomnilnik in količino učnega gradiva procesorske zmogljivosti za učenje in delovanje za posamezen jezik. Za jezike z veliko govorci je takšnega dosegljivega gradiva praviloma avtomatskega razpoznavalnika govora. Tvorjenje slovarja s veliko. Za jezike z manjšim številom govorcev, kamor fonemi je odvisno od jezika, ki ga želimo uporabiti. Za lahko uvrščamo tudi slovenščino, pa takšnih virov ni dovolj slovenski jezik je ta naloga razmeroma zapletena in za uporabo naprednih metod umetne inteligence, kot je na kompleksa. Slovar se lahko tvori ročno, kar praviloma primer enovito učenje počnejo fonetiki ali slovenisti (ang. end to end) s konvolucijskimi , ali avtomatsko. Pri mrežami. V zadnjem obdobju se pogosto uporablja tudi avtomatskih postopkih pa se lahko zgodi, da je zapis besede učenje s prenosom fonetično napač znanja, vendar za obe našteti metodi en, kar se v kasnejših korakih odraža na neoptimalnem učenju in razpoznavanju govora velja, da omogočata slabši nadzor nad modeliranjem v . Priprava slovarja z grafemi je lažja, saj je pretvorba trivialna. K primerjavi s hibridnim pristopom, ki smo ga uporabili v tem ateri članku. Praviloma hibridni pristop tudi dosega nekoliko pristop je primernejši, je odvisno tudi od količine učnih boljše rezultate, kot pa druga dva pristopa. Za avtomatski podatkov, ki jih uporabljamo, saj je pri slovarjih s fonemi, ki so sestavljeni iz več osnovnih enot, večji poudarek na PRISPEVKI 71 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 številski porazdelitvi glede na kategorijo. V okviru projekta ta razlika manjšala, ko bo za slovenski jezik na voljo več ur Razvoj slovenščine v digitalnem okolju (RSDO, b. d.) transkribiranega govora. Z večanjem količine posnetkov trenutno vzporedno poteka graditev govorne baze in pa namreč pridobimo na posamezno osnovno enoto tudi več razvoj prvih verzij avtomatskega razpoznavalnika govora. vzorcev, kar izboljša možnost modeliranja akustičnih Zato imamo trenutno še vedno na voljo dokaj omejeno značilnosti in izboljša robustnost na potencialne napake, ki količino transkribiranega slovenskega govora, kar je bil se lahko zgodijo zaradi avtomatske grafemsko-fonemske povod za uporabo grafemske akustične enote. Že v pretvorbe. preteklosti se je tako za slovenski jezik (Žgank in Kačič, 2006) kot tudi za druge jezike (Killer et at., 2003) pokazalo, 3. Govorni in jezikovni viri da je lahko v takšnih primerih uporaba grafemskih Govorni in jezikovni viri so pri razpoznavalnikih akustičnih enot dobra rešitev. Tako smo za cilj članka govora ključna komponenta. Za govorne posnetke smo postavili primerjavo med fonemskimi in grafemskimi uporabili korpuse Gos 1.0 (Zwitter Vitez et al., 2013), Sofes akustičnimi osnovnimi enotami v povezavi s trenutno (Dobrišek et al., 2017) in delovno različico testnega seta razpoložljivimi govornimi viri. nastajajoče govorne baze RSDO (trenutna delovna različica V nadaljevanju članka najprej pregledamo, kaj je na je 2.0, ki ne vsebuje več črkovanja). Korpusa Gos in Sofes področju modeliranja akustičnih osnovnih enot smo uporabili za učno in razvojno množico, medtem ko avtomatskega razpoznavanja govora že bilo izvedenega za smo testni korpus 2.0 projekta RSDO uporabili za slovenski jezik. V tretjem poglavju predstavimo, katere vrednotenje rezultatov. Za slovarje smo uporabili govorne in jezikovne vire smo uporabili pri zasnovi prostodostopni vir Sloleks 2.0 (Dobrovoljc et al., 2019) in eksperimentov. Tvorjenje slovarjev in samodejno trenutno verzijo slovarja, ki je nastala v projektu RSDO. Za grafemsko-fonemsko pretvorbo na osnovi pravil besedilni korpus smo uporabili prostodostopni besedilni vir predstavimo v četrtem poglavju. V petem poglavju ccGigafida 1.0 (Logar et al., 2013). predstavimo modeliranje akustičnih in jezikovnih modelov Korpus Gos vsebuje 120 ur posnetkov. Posnetki avtomatskega razpoznavalnika govora. Rezultati so zajemajo različne zvrsti, npr. televizijske oddaje, predstavljeni in komentirani v šestem poglavju, ki mu sledi predavanja, pouk, zasebne pogovore … Ves govor je še zaključek. transkribiran v dveh različicah, in sicer v pogovorni in standardizirani različici. Posnetki zajemajo 1526 različnih 2. Pregled sorodnih člankov govorcev. Govorni korpus Sofes vsebuje 9 ur in 52 minut Avtomatski razpoznavalniki govora so kot svojo posnetkov, ki zajemajo govor 134 različnih govorcev. privzeto akustično osnovno enoto uporabljali foneme in Posnetki vsebujejo poizvedovanja po letalskih njihove izpeljanke v obliki kontekstnega podaljševanja. informacijah v slovenskem jeziku. Pri korpusu Sofes Izhodišče je bilo, da gre pri avtomatskem razpoznavanju najdemo transkripcije v fonetičnem zapisu in govora za pretvorbo iz govorjene v besedilno obliko, kar se standardiziranem zapisu govora. V testnem setu 2.0 RSDO tako sklada z izbiro osnovne akustične enote. Leta 2000 so je za 4 ure in 47 minut posnetkov. Korpus se od različice Schillo in sodelavci predstavili prvi grafemski avtomatski 1.0 razlikuje po tem, da ne vsebuje posnetkov črkovanja, razpoznavalnik govora, ki je z izbiro drugačne osnovne kar znaša okoli 19 minut govora. Črkovanje smo iz akustične enote kršil navedeno predpostavko. Sistem je za splošnega testnega nabora izločili, saj je za njegovo nemški jezik sicer dosegel slabše rezultate razpoznavanja učinkovito razpoznavanje treba uporabiti drugačne govora kot fonemski sistem, vendar so bili naučeni pristope. Testna množica RSDO zajema bran, javni, grafemski modeli manjši. nejavni govor in posnetke državnega zbora. V posnetkih se Grafemi kot osnovne akustične enote postanejo hitro pojavi 63 različnih govorcev. Tudi pri korpusu RSDO zanimivi tudi za večjezično in križnojezično razpoznavanje imamo dva različna zapisa govora, ki sta nastala na osnovi govora (Killer et at., 2003). V takšnih primerih je namreč enakih priporočil kot v korpusu Gos. možno združevati jezike brez podrobnega poznavanja Vir Sloleks 2.0 je leksikon, ki vsebuje okoli 2.792.000 fonetike vključenih jezikov. Osnovo pač predpostavlja posameznih besednih oblik. Vsak vnos vsebuje podatke o zapisana črka. Uporabnost takšnega pristopa pride še besedi (v katero besedno vrsto sodi in kakšne so njene dodatno do izraza pri križnojezičnem razpoznavanju slovnične lastnosti). Zapisane so tudi vse pregibne oblike govora, kjer so v ciljnem jeziku na voljo omejeni govorni za posamezno besedo. Slovenščina je pregiben jezik in zato viri. Uspešnost metode je v določeni meri odvisna tudi je takšnih oblik zelo veliko. V različici 2.0 je označeno tudi akustično-fonetične podobnosti med vključenimi jeziki. mesto naglasa in zapis v mednarodni fonetični pisavi (IPA). Prve raziskave o uporabi grafemov kot osnovne V našem primeru smo Sloleks 2.0 uporabili za tvorjenje akustične enote za križnojezično razpoznavanje fonetičnega slovarja avtomatskega razpoznavalnika slovenskega govora so predstavili Žgank in sodelavci govora. V takšnem slovarju potrebujemo besede in njihovo (2005). Sledila je še uporaba grafemov za običajno izgovorjavo s fonemi. Sloleks 2.0 smo s pomočjo postopka, enojezično avtomatsko razpoznavanje govora (Žgank in ki so ga uporabili Ulčar in drugi (2019), pretvorili v obliko, Kačič, 2006). Grafemi kot osnovne akustične enote so tako ki je ustrezna za avtomatski razpoznavalnik govora. postali del standardne izbire za razpoznavanje slovenskega Besedilni korpus CcGigafida vsebuje nekaj čez govora, še posebej v domeni dnevnoinformativnih oddaj 103.000.000 besed in je javno dostopni del korpusa (Gril et al., 2021). V kombinaciji s slovenskimi Gigafida, ki ga je možno uporabljati pod licenco Creative razpoznavalniki govora, ki so zasnovani na HMM Commons. Besedilo vsebuje informacije o virih časopisov, akustičnih modelih ali na hibridni zasnovi HMM/DNN in revij, leta izdaj, vrsti besedil, naslovih, o avtorjih besedil. imajo za učenje na voljo nekaj 10 ur transkribiranih Korpus je označen z morfoskladenjskimi opisi in lemami. govornih posnetkov, praviloma dosežejo boljše rezultate Besedilni korpus ccGigafida smo uporabili za razpoznavanja govora. Predpostavimo sicer lahko, da se bo jezikovno modeliranje avtomatskega razpoznavalnika PRISPEVKI 72 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 govora. Zaradi pravilne obdelave smo iz korpusa izbrisali določeno za vsako besedo posebej in se ga je med prazne vrstice in večkratne presledke. Odstranili smo tudi različnimi generacijami govorcev slovenskega govorjenega ločila, da je bilo besedilo v skladu z običajno obliko v jezika zgodovinsko prenašalo z učenjem jezika in govornim sistemu za razpoznavanje govora. sporazumevanjem. Kljub različnim mestom naglasa, ki so se z razvojem jezika in v različnih narečnih jezikovnih 4. Tvorjenje slovarjev za razpoznavalnik skupinah tudi spreminjala, pa je vendarle možno opredeliti govora določena pravila, ki v pretežni meri določajo mesto naglasa Tvorjenje fonetičnih slovarjev, ki so potrebni za v besedah (Toporišič, 1991). Ta pravila so bila v glavnem upoštevana in graditev hibridnih arhitektur avtomatskih uporabljena za samodejno določanje mesta razpoznavalnikov govora, temelji tako na uporabi naglasa v danih besedah. Pravila temeljijo na upoštevanju obstoječih razpoložljivih leksikonov, ki so navadno ročno seznamov predpon, pripon, začetnic in končnic, ki se preverjeni in že vsebujejo fonetične prepise besed, kot tudi pojavljajo v slovenskih besedah in značilno določajo mesto na uporabi samodejnih grafemsko-fonemskih naglasa v dani besedi. Pravila so bil določena na podoben način pretvornikov, ki se uporabljajo za t. i. izvenslovarske , kot je bilo to izvedeno pri razvoju sistema za besede, ki jih predvideva jezikovni model razpoznavalnika samodejno tvorjenje umetnega slovenskega govora (Gros, govora, niso pa še vključene v obstoječe leksikone. 1997). Tvorjenje slovarja za prvo različico avtomatskega Uporabljena pravila sicer ne pokrijejo vseh trenutno razpoznavalnika govora (»Rezultat R2.2.7: Orodje za uporabljanih slovenskih besed. Zato se je na osnovi dodatne statistične analize mest naglas grafemsko fonemsko pretvorbo – verzija 2«, 2022) , ki je ov pri najbolj pogostih bil razvit v okviru projekta RSDO in bo predstavljen v slovenskih besedah določilo še dodatna pravila za določitev naslednjih poglavjih, je primarno temeljilo na uporabi že najbolj verjetnega mesta naglasa v danih besedah. Ta omenjenega leksikona Sloleks 2.0 ter ročno urejenega in pristop je do določene mere možno tolmačiti tudi kot izvajanje strojnega učenja iz podatkov. preverjenega slovarja izgovorjav, ki je vključen v govorni korpus Sofes. Za vse besede, ki se pojavljajo v normiranih Grafemski zapisi vhodnih besed se v razvitem besednih prepisih vseh zvočnih govornih posnetkov, ki so pretvorniku z uporabo pravil pretvarjajo po vrsti, od leve se uporabili za tvorjenje akustičnega modela proti desni. Pravila se v pretvorniku preverjajo in upoštevajo po razpoznavalnika govora, ter za vse besede, ki se pojavljajo danem vrstnem redu, zato si morajo slediti tako, da so na začetku seznama pravil za posamezen grafem v normiranem besedilnem korpusu, ki se je uporabil za tvorjenje njegovega jezikovnega modela, smo najprej najprej tista, ki opisujejo posebne primere pretvorb, sledijo pa jim bolj splošna pravila. pogledali v leksikon Sloleks 2.0 in ročno urejen slovar govornega korpusa Sofes, če ta morda vsebujeta Razviti grafemsko-fonemski pretvornik na svojem obravnavano besedo. Če je bila beseda v tem leksikonu vhodu predvideva besede, ki so že podane v normirani oziroma slovarju vsebovana, se je njen fonetični prepis polni besedni obliki. Števila, števniki, denarne enote, samo prenesel v slovar razpoznavalnika govora. Če okrajšave in drugi posebni zapisi morajo tako biti podani v obravnavana beseda v leksikonu Sloleks 2.0 oziroma polni besedni obliki. Za to je bilo poskrbljeno z slovarju Sofes ni bila vsebovana, pa se je njen fonetični normalizacijo besednih prepisov govornih posnetkov, ki so prepis pridobilo z uporabo prve različice samodejnega se uporabljali za tvorjenje akustičnega modela grafemsko-fonemskega pretvornika, ki je bil razvit v okviru razpoznavalnika govora, in tudi besedil iz besedilnega projekta RSDO in je v grobem opisan v nadaljevanju. Pri korpusa, ki so se uporabljala za tvorjenje jezikovnega tvorjenju slovarja za predstavljeni razpoznavalnik govora modela razpoznavalnika govora. se je samodejno moralo pretvoriti kar več kot 58 odstotkov Izhodni nabor fonemskih različic je glede na samodejno določanje in upoštevanje mesta naglasa omogočal tudi vseh besed, ki so bile predvidene za razpoznavalnik govora. ločevanje Pravilnost samodejne pretvorbe pa pri prvi različici med dolgimi in kratkimi samoglasniki. Pri razpoznavalnika govora še ni bila natančno preverjena in tvorjenju slovarja za razpoznavalnik govora pa se to ločevanje ni upoštevalo, ker se ovrednotena. pri tvorjenju akustičnih modelov razpoznavalnikov govora samoglasnikov navadno ne ločuje na kratke in dolge, ker dolžina samoglasnikov 4.1. Samodejna grafemsko-fonemska pretvorba na osnovi pravil nima osnovne pomensko razločevalne vloge pri razpoznavanju besed (Ulčar, 2019). Prva različica samodejnega grafemsko-fonemskega pretvornika, ki je bil razvit v okviru projekta RSDO in se je 5. Arhitektura avtomatskega uporabil za tvorjenje slovarja razpoznavalnika govora, je razpoznavalnika govora temeljila na uporabi množice kontekstno odvisnih fonetičnih pravil, ki so bila določena na osnovi statističnih Glede na razpoložljivo količino akustičnega učnega analiz in obstoječega jezikoslovnega in glasoslovnega materiala, je bilo smiselno uporabiti hibridno arhitekturo poznavanja fonetičnih značilnosti slovenskega govorjenega avtomatskega razpoznavalnika govora, ki je v takšnih jezika. Upoštevana kontekstno odvisna pravila so temeljila primerih praviloma učinkovitejša, kot so pa pristopi E2E. predvsem na upoštevanju mesta naglasa v danih besedah. Pri hibridnih sistemih avtomatskega razpoznavalnika Mesto naglasa v besedi na splošno določa zlog, na govora lahko arhitekturno sestavo grobo razdelimo na dva katerem ima beseda jakostno ali tonsko izraženo slušno dela, in sicer na akustični model in jezikovni model. Akustični model naučimo na osnovi zaznavno izrazitost (Toporišič, 1992). Pomembna vzorcev iz zvočnih značilnost slovenskega jezika je, da se mesto naglasa posnetkov in njihovih ustreznih prepisov, jezikovni model pojavlja na prvem, zadnjem, predzadnjem ali tudi pa glede na besedilni korpus. V nadaljevanju članka bomo predpredzadnjem zlogu. Poleg tega pa lahko imajo podrobneje predstavili oba modela, za graditev katerih smo posamezne besede tudi več mest naglasa. Mesto naglasa je uporabili prostodostopno orodje Kaldi (Povey et al., 2011). PRISPEVKI 73 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Za pripravo spremljajočih datotek, ki jih potrebujemo za graditev modela v orodju Kaldi, smo uporabili transkripcije govornih korpusov, ki so zapisane v obliki standardiziranega zapisa govora. 5.1. Akustično modeliranje Za akustično modeliranje smo uporabili govorne baze Gos, Sofes in testno množico projekta RSDO. Zvočni posnetki govornih baz Gos in Sofes so bili v mono formatu in so bili zapisani v 16-bitnem zapisu. Frekvenca vzorčenja je bila 16 kHz. Posnetki testne množice projekta RSDO so imeli frekvenco vzorčenja 44,1 kHz, bitna hitrost in format pa sta bila enaka posnetkom v bazah Gos in Sofes. Orodje Kaldi za svoje delo potrebuje mono zvočne posnetke s frekvenco vzorčenja 16 kHz in 16-bitnim zapisom. Da lahko posnetke v orodju Kaldi procesiramo, moramo posnetke pretvoriti v ustrezni format. S prostodostopnim orodjem SoX smo posnetke pretvorili v mono zvočne posnetke, s frekvenco vzorčenja 16 kHz in 16-bitnim zapisom. Pretvarjanje posnetkov smo vključili v proces priprave potrebnih datotek za procesiranje v orodju Kaldi. Slika 1: Postopek učenja akustičnega modela S tem korakom smo se ognili ročnemu pretvarjanju vseh avtomatskega razpoznavalnika govora. posnetkov. Zvočne posnetke, ki so del govorne baze, smo spremenili v vektorje značilk. Na začetku posnetke V naslednji fazi smo uporabili linearno diskriminančno razdelimo na okna dolžine 25 ms in jih nato analizo (angl. Linear discriminant analysis – LDA), s transformiramo, da dobimo značilke MFCC (mel- katero poiščemo linearno kombinacijo stanj. LDA vzame frekvenčne kepstralne koeficiente). Za nadaljnje delo smo vektorje značilk in zgradi HMM stanja, vendar z manjšim uporabili 12 MFCC koeficientov in energijo, nad katerimi prostorom značilke za vse podatke. LDA smo uporabili v smo izračunali še prvi in drugi časovni odvod. S prvim kombinaciji z linearno transformacijo z največjo odvodom dobimo delta in z drugim delta-delta značilke. Nadaljevali smo z akustičnim modeliranjem, kjer smo v več verjetnostjo (angl. Maximum Likelihood Linear Transform – fazah izvajali učenje novih akustičnih modelov in njihove MLLT), ki poenostavi računanje Gaussovih porazdelitev (Gales, 1999). MLLT vzame značilke iz LDA in izpelje poravnave. Osnova akustičnega modeliranja edinstveno transformacijo za vsakega govorca. MLLT je avtomatskega prvi korak k normalizaciji govorcev, saj minimalizira razpoznavalnika govora so prikriti modeli Markova (angl. razlike med govorci. Pri LDA in MLLT se uporabi prvih 13 Hidden Markov Model – HMM). Z modeli HMM na osnovi značilk MFCC in vsako razdeli na 4 predhodna okna na levi vhodnih vektorjev značilk ocenjujemo verjetnosti hipotez in 4 naslednja okna na desni. To pomeni, da imamo končno izgovorjenega govora. Za to moramo poznati zapis dimenzijo značilk 117. fonemov v vsaki besedi. Takšne zapise imamo Nato z LDA dimenzijo značilke vnesene v fonetičnem slovarju, kjer je vsaka beseda predstavljena omejimo na 40. z Za večjo natančnost avtomatskega razpoznavanja nizom fonemov izgovorjene besede. Pri tem imamo lahko za posamezno besedo na voljo več govora smo uporabili učenje s prilagajanjem govorcu (angl. izgovorjav, kar je Speaker Adaptive Training – SAT), ki za vsakega odvisno od vključenega slovarja. Pri HMM modelih posameznega govorca izračuna parametre adaptacij glede foneme predstavimo s skritimi stanji, model pa nato na učne podatke izračuna opazovana stanja s pomočjo tega govorca (Anastasakos et at., 1996). Gaussovih Učenje akustičnega modela smo začeli z monofonskim porazdelitev, ki tvorijo hipoteze izgovorjene besede. akustičnim modelom in nadaljevali s trifonskim akustičnim modelom z delta in delta-delta (tri1) značilkami, trifonskim akustičnim modelom z LDA in MLLT (tri2b) ter na koncu še s trifonskimi akustičnimi modeli s SAT (tri3b). Postopek učenja je prikazan tudi s pomočjo diagrama, ki ga lahko vidimo na sliki 1. V drugem delu graditve akustičnih modelov sledi prehod na globoke nevronske mreže. Nevronske mreže so sistemi, kjer algoritmi posnemajo delovanje nevronov v možganih. Sistem je sestavljen iz vhodnih, skritih in izhodnih plasti, ki so sestavljene iz enega ali več nevronov. Nevroni so med seboj povezani z relacijami, ki lahko potekajo naprej, nazaj ali naprej in nazaj. Na relacijah se uporabljajo uteži, s katerimi se izračunajo nova stanja. Uporabili smo dva različna tipa nevronskih mrež, in sicer časovno zakasnjene nevronske mreže (angl. Time Delayed Neural Networks – TDNN) in nevronske mreže z PRISPEVKI 74 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 dolgim kratkoročnim spominom (angl. Long Short Term Akustične modele LSTM za razpoznavanje govora smo Memory – LSTM). učili s 4 epohami. Ostale vrednosti smo ohranili na TDNN so nevronske mreže (Waibel, 1989), ki imajo privzetih, vključno z začetno in končno efektivno stopnjo več plasti. Začetne plasti se transformacije učijo bolj ozko, učenja, ki sta bili nastavljeni na 0,001 in 0,0001. kasnejše pa imajo širši časovni kontekst. Za kontekstno V naslednjem poglavju bomo predstavili rezultate modeliranje je treba zagotoviti, da vsaka nevronska celica sistemov LSTM in TDNN za razpoznavanje govora. Ker je poleg vhodne vrednosti, ki jo pridobi od aktivacijske sistem TDNN dosegel boljše rezultate, smo del funkcije oziroma iz nižje plasti, pridobi tudi informacijo o eksperimentov opazovali samo na sistemu TDNN. vzorcu izhodnih vrednosti in njihovega konteksta. Kar v primeru s časovnim signalom pomeni, da dobi vsaka 5.2. Jezikovno modeliranje nevronska celica na vhod informacijo o aktivacijskem Kot povezovalni člen med akustičnim in jezikovnim vzorcu skozi čas od nižje ležečih plasti. prostorom smo uporabili dva različna tipa slovarjev Nevronske mreže LSTM (Povey, 2018) vključujejo avtomatskega razpoznavalnika govora. Prvi tip spominsko celico, ki ohrani informacijo dalj časa. Celica uporabljenih slovarjev je bil fonemski slovar, kjer so ima troje različnih vrat, in sicer vhodna, izhodna ter besede zapisane s fonemi, in drugi tip, kjer smo namesto pozabljiva. Vhodna vrata uravnavajo količino podatkov zapisa izgovorjene besede s fonemi uporabili zapis z prejšnjega vzorca, ki se bo shranila. Izhodna vrata določajo grafemi. V tabeli 1 smo predstavili lastnosti slovarjev. Ena količino podatkov, ki se bo prenesla na naslednjo plast. izmed lastnosti je tudi delež besede izven slovarja (angl. out Pozabljiva vrata pa regulirajo hitrost izgubljanja informacij of vocabulary – OOV), ki ga izračunamo kot: v celici. Zaradi shranjevanja informacij so sistemi LSTM primerni za delo s časovnimi signali, saj se lahko š𝑡. 𝑏𝑒𝑠𝑒𝑑 𝑖𝑧𝑣𝑒𝑛 𝑠𝑙𝑜𝑣𝑎𝑟𝑗𝑎 𝑣 𝑡𝑒𝑠𝑡𝑛𝑖 𝑚𝑛𝑜ž𝑖𝑐𝑖 𝑂𝑂𝑉 = ∙ 100 (2) pomembni dogodki zamaknejo. Modelu LSTM lahko š𝑡. 𝑣𝑠𝑒ℎ 𝑏𝑒𝑠𝑒𝑑 𝑣 𝑠𝑙𝑜𝑣𝑎𝑟𝑗𝑢 rečemo tudi izboljšana ponavljajoča se nevronska mreža Slovarji, ki smo jih uporabili, so večji, kakor tisti, ki so (angl. Recurrent Neural Network – RNN), saj je bila tako se uporabljali v prejšnjih razpoznavalnikih informativnih odpravljena težava izginjajočega gradienta (Hochreiter, oddaj (Gril et at., 2021). Vrednosti OOV so zelo nizke in 1991). jih lahko enostavno zanemarimo. Arhitektura TDNN je sestavljena iz vhodnega nivoja, skritih nivojev in izhodnega nivoja. Vhodni nivo je dimenzije 40. Prvi skriti nivo mreže TDNN je bila mreža Slovar Tip slovarja Št. besed OOV [%] LDA z dimenzijo 40 in je bila polno povezana. Mreži LDA Sloleks 2.0 fonemski 1.129.144 0,054 je sledilo še 8 polno povezanih mrež TDNN dimenzij 512. Sloleks 2.0 grafemski 931.848 0,065 Na 8 nivojih mrež TDNN je bilo uporabljeno izpuščanje RSDO fonemski 1.440.070 0,008 nevronov (angl. dropout). Mrežam TDNN sledita še dve RSDO grafemski 1.440.070 0,008 vzporedni veji nivojev, in sicer verižna veja in veja xent. Verižna in xent veji sta sestavljeni iz dveh nivojev. Prva Tabela 1: Lastnosti uporabljenih slovarjev. vzporedna nivoja tvorita mreži ReLU dimenzije 512. Mreži sta polno povezani in enako kakor mreže TDNN Jezikovni model avtomatskega razpoznavalnika govora uporabljata izpuščanje nevronov. Mrežama ReLU sledita naučimo z besedilnim korpusom. Takšen model je izhodna nivoja. Veji se razlikujeta po funkciji izgube. sposoben predvidevati besedo, ki sledi, glede na predhodne Verižna veja uporablja funkcijo logaritem verjetnosti besede v nizu. Jezikovni model ima tudi zmožnost pravilne sekvence fonemov oziroma grafemov, medtem ko kontekstnega uvrščanja, saj bo med besedami, ki imajo veja xent za funkcijo izgube uporablja križno entropijo. podobno izgovorjavo, izbral tisto, ki bo bolj smiselna glede Mreža TDNN je tako sestavljena iz 10 nivojev, pri katerih na kontekst predhodno opazovanega zaporedja besed. pa smo uporabili tudi časovno združevanje, kjer se na teh Jezikovni model smo naučili z uporabo orodja n-gram nivojih združijo informacije iz želenih časovnih oken glede count, ki je del paketa SRILM (Stolcke, 2002). N-grami so na vhod. Časovno združevanje smo uporabili na nivoju v našem primeru nizi n besed v stavku. N-gram count glede LDA in 2., 4., 6., 7. ter 8. nivoju TDNN. na besedilni korpus generira n-grame in z njimi ocenjuje Učenje modelov TDNN je potekalo 7 epoh. Začetno napovedne verjetnosti jezikovnega modela. Pri n-gram efektivno stopnjo učenja (angl. initial effective lrate) smo countu je treba določiti, do kakšne velikosti n-gramov nastavili na 0,0001 in končno (angl. final effective lrate) na želimo zgraditi model. Tako smo zgradili jezikovni model, 0,00001. Ostale vrednosti parametrov smo ohranili na ki je vseboval 1-grame, 2-grame in 3-grame. privzetih vrednostih. Tako kot arhitektura TDNN tudi LSTM vsebuje tri vrste 6. Rezultati avtomatskega razpoznavanja nivojev. Prvi je vhodni in je enak vhodnemu nivoju govora arhitekture TDNN. Prav tako je tudi prvi skriti nivo Uspešnost različnih verzij avtomatskega arhitekture LSTM enak nivoju LSA, ki sestavlja arhitekturo TDNN. Naslednji štirje skriti nivoji so mreže LSTM razpoznavalnika govora smo ovrednotili na testni množici P 2.0 projekta RSDO. Za vrednotenje smo uporabili delež (angl. Long Short-Term Memory Projection) velikosti napačno razpoznanih besed (angl. Word Error Rate – 1024. LSTMP je mreža LSTM, ki dodatno vsebuje še projekcijski nivo. V naš WER). WER smo izračunali kot razmerje med številom i konfiguraciji arhitekture smo vrinjenih, izbrisanih ter zamenjanih besed in med številom dimenzijo projekcijskega nivoja nastavili na 256. Skritim besed, ki so v referenčnem besedilu. To lahko zapišemo nivojem sledita dve veji izhodnih nivojev. Tudi tukaj se veji kot: razlikujeta glede na funkcijo izgube tako kot pri arhitekturi TDNN. (𝐼 + 𝐷 + 𝑆) 𝑊𝐸𝑅 = ∙ 100 (1) 𝑁 PRISPEVKI 75 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Kjer je I število vrinjenih besed (angl. insertions), D število (angl. Large-Vocabulary Continuous Speech Recognition – izbrisanih besed (angl. deletions) in S število zamenjanih LVCSR), kjer pri velikih datotekah hitro nastane ozko grlo. besed (angl. substitutions). Z N označimo število vseh Dodatna prednost grafemskih akustičnih enot je tudi v tem, besed v referenčnem besedilu testne množice. Razmerje da lahko v praktični uporabi slovar avtomatskega nato pomnožimo s 100, saj WER praviloma podajamo v razpoznavalnika govora nadgrajuje tudi laik. odstotkih. 7. Zaključek Arhitektura Slovar Tip slovarja WER [%] V članku smo predstavili sistem za razpoznavo LSTM Sloleks 2.0 fonemski 38,70 slovenskega govora. Za akustični model smo uporabili TDNN Sloleks 2.0 fonemski 27,19 hibridni pristop HMM-DNN. Za napovedovanje skritih TDNN RSDO fonemski 25,31 stanj v HMM smo uporabili dva tipa nevronskih mrež. TDNN Sloleks 2.0 grafemski 26,97 Časovno zakasnjene nevronske mreže so se izkazale za TDNN RSDO grafemski 24,95 boljši pristop kakor nevronske mreže z dolgim kratkoročnim spominom. Za tvorjenje slovarja smo Tabela 2: Rezultati razpoznavanja govora z različnimi uporabili dve osnovni akustični enoti. Grafemski modeli so vrstami vključenih metod in modelov. v našem primeru dali boljše rezultate kakor fonemski. Uporabili smo novo testno množico, ki je nastala pri Najprej poglejmo rezultate, ki smo jih dobili, ko smo projektu RSDO. Najboljši delež napačno razpoznanih vrednotili različna tipa arhitektur akustičnih modelov. besed je bil 24,95 %. Rezultat je primerljiv tudi z rezultati Predstavljeni so v tabeli 2. Sistem LSTM se je izkazal za drugih sistemov razpoznavanja govora. K dobremu slabšega, saj je bil rezultat WER kar za 11,51 % slabši kot rezultatu razpoznave prispeva velik slovar, ki je večji kakor v primeru, ko smo uporabili sistem TDNN. Na osnovi tega pri primerljivih sistemih, in uporaba grafemov kot osnovne rezultata smo kot nadaljnjo arhitekturo akustičnih modelov akustične enote. Sistemi z grafemi omogočajo enostavnejše izbrali TDNN. Izhodiščni WER je bil 27,19 %. Ulčar in tvorjenje slovarjev, enostavnejše je tudi nadgrajevanje drugi (2019) so na podobnem sistemu dosegli malo slabši takšnih slovarjev. Uporaba grafemov ima pozitivni učinek rezultat, vendar rezultati niso neposredno primerljivi, saj se tudi pri uporabi sistemov, saj takšni modeli zavzemajo je vrednotenje preverjalo na drugi testni množici. nekoliko manj pomnilniškega prostora. Primerjava s predhodnim podobnim ASR (Gril et at., 2021) kaže razliko v rezultatih. Avtorji so takrat dosegli 15,17 % Zahvala WER, vendar z uporabo drugačnih govornih virov. Zahvaljujemo se avtorjem korpusa Gos 1.0, ki so nam Domena virov je bila v prejšnjem primeru omejena omogočili njegovo uporabo za razvoj avtomatskega izključno na televizijske oddaje, res pa je, da so te lahko v razpoznavalnika govora. nekaterih primerih, kot je na primer glasbeno ozadje Raziskovalno delo je bilo delno opravljeno v okviru govora, tudi dokaj kompleksne za avtomatsko projekta RSDO – Razvoj slovenščine v digitalnem okolju. razpoznavanje govora. Operacijo Razvoj slovenščine v digitalnem okolju Za nadaljevanje razvoja sistema za razpoznavanje sofinancirata Republika Slovenija in Evropska unija iz govora smo uporabili dva različna slovarja, in sicer slovar, Evropskega sklada za regionalni razvoj. Operacija se izvaja ki je bil narejen na osnovi Sloleksa, in slovar, ki je bil v okviru Operativnega programa za izvajanje evropske pripravljen v sklopu projekta RSDO. V tabeli 2 lahko kohezijske politike v obdobju 2014–2020. vidimo, da se rezultat vrednotenja z uporabo slovarja, ki je bil pripravljen pri projektu RSDO, izboljša za 1,88 %. 8. Literatura V zadnjem koraku smo primerjali med seboj še Tasos Anastasakos, John McDonough, Richard Schwartz avtomatske razpoznavalnike govora, pri katerih smo z uporabo različnih tipov slovarja fonemsko osnovno in John Makhoul. 1996. A compact model for speaker- akustično enoto zamenjali z grafemsko. Za avtomatski adaptive training. V: Proceedings ICSLP, str. 113–1140. razpoznavalnik govora, pri katerem smo uporabili za Simon Dobrišek, Jerneja Žganec Gros, Janez Žibert, France Mihelič in Nikola Pavešić. 2017. osnovo Sloleks, je zamenjava fonemov z grafemi izboljšala Speech Database of rezultat za 0,22 %. Pri uporabi slovarja, ki je bil izdelan v Spoken Flight Information Enquiries SOFES 1.0. okviru projektu RSDO, pa je zamenjava fonemov z grafemi Slovenian language resource repository CLARIN.SI, WER izboljšala za 0,36 %. Rezultat s tem modelom in ISSN 2820-4042, http://hdl.handle.net/11356/1125 enotami je hkrati najboljši rezultat razpoznavanja govora, Kaja Dobrovoljc, Simon Krek, Peter Holozan, Tomaž Erjavec, Miro Romih, Špela ki smo ga dosegli s predstavljenimi eksperimenti. Rezultat Arhar Holdt, Jaka Čibej, z grafemi je verjetno boljši zaradi omejene količine učnih Luka Krsnik in Marko Robnik-Šikonja. 2019. podatkov in s tem tudi števila vzorcev na posamezno Morphological lexicon Sloleks 2.0. Slovenian language akustično enoto. Sklepamo lahko, da je teh bilo premalo za resource repository CLARIN.SI, ISSN 2820–4042, razpoznavo specifičnih akustičnih enot, ki so redkejše. http://hdl.handle.net/11356/1230 Tako je razpoznavanje z grafemi, ki imajo manj akustičnih Mark J. Gales. 1999. Semi-tied covariance matrices for osnovnih enot, ker ne razlikujejo podvariant, delovalo hidden Markov models. IEEE transactions on speech bolje. Čeprav izboljšanje z grafemskim slovarjem ni and audio processing, 7(3): 272–281. posebej veliko, lahko pri tem tipu slovarja opozorimo na to, Jerneja Gros. 1997. Samodejno tvorjenje govora iz besedil. da je postopek priprave veliko preprostejši. Prednosti ima Doktorska disertacija. Fakulteta za elektrotehniko, tudi pri uporabi, saj po velikosti zasede nekoliko manj Univerza v Ljubljani. pomnilniškega prostora, kar je posebej pomembno pri Sepp Hochreiter. 1991. Untersuchungen zu dynamischen avtomatskih razpoznavalnikih govora z velikimi slovarji neuronalen Netzen. Dostopno na: PRISPEVKI 76 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 https://people.idsia.ch/~juergen/SeppHochreiter1991Th esisAdvisorSchmidhuber.pdf (16. 5. 2022) Mirjam Killer, Sebastian Stüker and Tanja Schultz. 2003. Grapheme based speech recognition. Interspeech. Nataša Logar, Tomaž Erjavec, Simon Krek, Miha Grčar in Peter Holozan. 2013. Written corpus ccGigafida 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1035 Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motliček, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer in Karel Vesely. 2011. The Kaldi speech recognition toolkit. V: IEEE ASRU 2011 Workshop on automatic speech recognition and understanding. IEEE Signal Processing Society. Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li in Sanjeev Khudanpur, 2018. A Time-Restricted Self- Attention Layer for ASR. V: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), str. 5 874–5878. RSDO. (b. d.). Dostopno na: https://www.cjvt.si/rsdo/. Razvoj slovenščine v digitalnem okolju – RSDO: Rezultat R2.2.7: Orodje za grafemsko fonemsko pretvorbo – verzija 2, Poročilo projekta, 2022. Christoph Schillo, Gernot A. Fink in Franz Kummert. 2000. Grapheme based speech recognition for large vocabularies. Sixth International Conference on Spoken Language Processing. Andreas Stolcke. 2002. SRILM – an extensible language modeling toolkit. V: Seventh international conference on spoken language processing. Jože Toporišič. 1992. Enciklopedija slovenskega jezika. Cankarjeva založba. Ljubljana. Jože Toporišič. 1991. Slovenska slovnica. Založba Obzorja. Maribor. Matej Ulčar, Simon Dobrišek, Marko Robnik Šikonja. 2019. Razpoznavanje slovenskega govora z metodami globokih nevronskih mrež. Uporabna informatika, 27 (3). Dostopno na: https://uporabna- informatika.si/index.php/ui/article/view/53 (8. 11. 2021) Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano and Kevin J. Lang. 1989. Phoneme recognition using time-delay neural networks . IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3): 328–339. Ana Zwitter Vitez, Jana Zemljarič Miklavčič, Simon Krek, Marko Stabej in Tomaž Erjavec. 2013. Spoken corpus Gos 1.0. Slovenian language resource repository CLARIN.SI, ISSN 2820–4042, http://hdl.handle.net/11356/1040 Andrej Žgank, Zdravko Kačič, Frank Diehl, Jožef Juhar, Slavomir Lihan, Klara Vicsi in Gyorgy Szaszak. 2005. Graphemes as Basic Units for CrosslingualSpeech Recognition. V: COST278 Final Workshop and ITRW on Applied Spoken Language Interaction in Distributed Environments. Andrej Žgank in Zdravko Kačič. 2006. Conversion from phoneme based to grapheme based acoustic models for speech recognition. Interspeech. PRISPEVKI 77 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 What works for Slovenian? A comparative study of different keyword extraction systems Boshko Koloski, Senja Pollak, Matej Martinc Jožef Stefan Institute, Jožef Stefan International Postgraduate School Jamova cesta 39, Ljubljana, Slovenia {boshko.koloski,senja.pollak,matej.martinc}@ijs.si Abstract Identifying and retrieving keywords from a given document is one of the fundamental problems of natural language processing. In this paper, we conduct a thorough comparative analysis of several distinct approaches for keyword identification on a new benchmark Slovenian keyword extraction corpus, SentiNews. The first group of methods is based on a supervised methodology, where previously annotated data is required for the models to learn. We evaluate two such approaches, TNT-KID and BERT . The other paradigm relies on unsupervised approaches, where no previously annotated data for training is needed. We evaluate five different unsupervised approaches, covering three main types of unsupervised systems: statistical, graph-based and embedding-based. The results show that supervised models perform significantly better than unsupervised approaches. By applying the TNT-KID method on the Slovenian corpus for the first time, we also advance the state-of-the-art on the SentiNews corpus. 1. Introduction eral distinct strategies for keyword extraction on Slovenian, among them also some, which have not been tested be- Identifying and retrieving keywords from a given docu- fore on Slovenian. We show that the employment of the ment represents one of the crucial tasks for organization of TNT-KID model (Martinc et al., 2020), a model specifically textual resources. It is employed extensively in media or- adapted for the monolingual low-resource scenario, leads ganizations with large daily article production that needs to to advance in state-of-the-art on the Slovenian SentiNews be categorized in a fast and efficient manner. While some keyword extraction benchmark dataset (Bučar, 2017). To media houses use keywords to link articles and produce summarize, the main contributions of this work include: networks based on keywords, journalists use keywords to search for news stories related to newly produced articles • A systematical analysis of a keyword extraction and also to summarize new articles with a handful of words. dataset of Slovenian news. Manual categorization and tagging of these articles is a bur- densome and time demanding task, therefore development • Thorough comparison of several supervised and unsu- of algorithms capable of tackling keyword extraction auto- pervised keyword extraction strategies on the Slove- matically, and therefore allowing the journalists to spend nian data-set. Supervised methods include the mono- more time on more important investigative assignments, lingual TNT-KID method, which has not been em- has become a necessity. ployed for Slovenian before, and an application of the The approaches for automatic detection of keywords multilingual BERT model (Devlin et al., 2019), same can be divided based on their need for annotated data prior as in Koloski et al. (2022b). We also cover several un- to learning. One paradigm of keyword extraction focuses supervised methods in this study, including statistical, on extracting keywords without prior training (i.e. unsu- graph-based and embedding based models. pervised approaches), while the other focuses on learning • The advancement in state-of-the-art on the Slovenian to identify keyphrases from an annotated data-set (i.e. su- keyword extraction dataset from SentiNews pervised approaches). While unsupervised approaches can be easily applied for domains and languages that have low • Release of a dockerized pretrained model of the best to no amount of labeled data, they nevertheless tend to of- performing system TNT-KID-Slovene in terms of F1- fer non-competitive performance when compared to super- score. vised approaches (Martinc et al., 2020), since they can not be adapted to the specific language and domain through The paper is organized in the following manner: Sec- training. On the other hand, supervised state-of-the-art ap- tion 2. describes the related work in the field, followed by proaches based on the transformer architecture (Vaswani et the description of data and the exploratory data analysis al., 2017) have become very effective in solving the task, in Section 3. Section 4. describes the experimental setting but they do usually require substantial amounts of labeled considered in this study and in Section 5., we discuss the data which is hard to obtain for some low-resource domains results. Finally, Section 6. presents the conclusions of the and languages. study and proposes further work. In this research, we focus on one of the low-resource languages, Slovenian, for which not a lot of manually la- 2. Related work beled data that could be leveraged for training of keyword Keyword extraction approaches are either supervised or extractors, is available. We systematically evaluate sev- unsupervised. PRISPEVKI 78 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 2.1. Unsupervised methods of the meta-graph, it applies the load centrality metric Modern supervised learning approaches are very suc- for the term ranking, and also relies on multiple graph cessful in keyword extraction, but they are data intensive redundancy measures. and time consuming. Unsupervised keyword detectors can • embedding-based methods are gaining traction with address both problems and typically require much less com- the recent introduction of various off-the shelf pre- putational resources and no training data, but this comes trained embeddings such as FastText (Bojanowski et with the price of lower overall performance. Unsupervised al., 2016) or transformer - BERT (Devlin et al., 2019) methods can be divided into four main categories: based embeddings. Key2Vec (Mahata et al., 2018) • statistical - methods that belong to this family are represents the pioneer of this type of methods, fol- based on calculating various text statistics to capture lowed by the EmbedRank (Bennani-Smires et al., keywords, such as frequency of appearance, position 2018) method. The aforementioned methods consider in the text, etc. KPMiner (El-Beltagy and Rafea, the semantic information captured by the distributed 2009) is one of the oldest methods and focuses on word and sentence embedding representations. Key- the frequency and position of a given keyphrase. Af- BERT (Grootendorst, 2020) is currently the state-of- ter calculating several frequency based statistics, the the-art method of the type. The foundation of this method uses post-processing filtering to remove some method are pre-trained sentence-BERT (Reimers and keyphrases that are too rare or that are not positioned Gurevych, 2019) based representations. The method within the first k characters of the document. YAKE considers embedding n-grams of a given size and com- (Campos et al., 2018) represents one of the latest up- pares them to the embedding of the entire document. grades of the statistical approaches, and includes the The n-grams closely matching the representation of simpler features proposed by the KPMiner. The main an entire document (i.e. keywords most representa- novelty is that it also considers the relatedness of term tive of an entire document) are retrieved as keywords candidates to general document context, dispersion, that best describe the overall document content. In or- and casing of a specific term candidate. der to diversify the results, the method also introduces the Max Sum Similarity metric with which the model • graph-based - methods focus on creating graphs from selects the candidate phrases with the highest rank that a given document and then exploit graph properties in are least similar to each other. order to rank words and phrases. In the first, graph creation step, authors usually consider two adjacent • language model-based - methods use language model words as two adjacent nodes in a graph G. Usually derived statistics to extract keywords from text. before the graph-creation step some form of word nor- Tomokiyo and Hurst (2003) considered multiple lan- malization is performed - either stemming or lemma- guage models and measured the Kullback-Leibler Di- tisation. Since keyword phrases can consist of multi- vergence (Joyce, 2011) for ranking both phrasesness ple words, the methods consider the use of a sliding and the informativeness of candidate terms. windows to obtain n-grams up to specific value of n, and using obtained n-grams as nodes. Text Rank (Mi- 2.2. Supervised methods halcea and Tarau, 2004) is one of the first such meth- Supervised methods require manually annotated data ods. In the second, keyword ranking step, it leverages for training. The methods can be divided into neural and Google’s PageRank (Page et al., 1999) algorithm to non-neural. rank the nodes according to their importance within the graph G. While TextRank is a robust method, it 2.2.1. Non-neural does not account for the position of a given term in The first methods that proposed a solution in a super- the document. This was improved in the PositionRank vised manner, considered keyword extraction as a classifi- (Florescu and Caragea, 2017) method that leverages cation task. The KEA method (Witten et al., 1999) treats PageRank on one side, and the position of a given term each word or phrase as a potential keyword, and uses TF- on the other side. An upgrade to the graph-creation IDF (Sammut and Webb, 2010) metric and word position step was introduced in Boudin (2018), where they con- for representation, and Naive Bayes for classification of a sider encoding the potential keywords into a multi- given term as a keyword or not. partite1 graph structure. The method in addition also considers topic information. Similarly to TextRank 2.2.2. Neural it leverages PageRank (Page et al., 1999) to rank the With the recent-gain in computing power and introduc- nodes. RaKUn ( Škrlj et al., 2019) is one of the most tion of more modern deep architectures, the field of key- recent additions to the family of graph based keyword word extraction was taken by storm of neural architectures. extractors. The main contribution of this method is The neural approaches can be divided are two groups: one that it introduces an intermediate step, that constructs that treat the task as a sequence-to-sequence generation and meta-nodes from the initial nodes of the graph via ag- the one that model the task as sequence-labelling. gregation of the existing nodes. After the construction Meng et al. (2017) first proposed the idea of keyword extraction as a sequence-to-sequence generation task. In 1Family of graphs where the nodes can be split into multiple their work they proposed a recurrent generative model with disjoint sets. an attention and a copying mechanism (Gu et al., 2016) PRISPEVKI 79 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 based on the positional information. An additional strong- al., 2019), containing more than 200,000 documents. We point of this model is that is able to find keywords that do benchmark all of our models on the same test split that was not appear in the text due to it’s generative nature. already used in the study by (Koloski et al., 2022b), in or- The first representative of the sequence-labelling der to make our results directly comparable to the ones in method is the approach by Luan et al. (2017), where the the related work. authors consider bidirectional Long Short-Term Memory The documents have a similar structure in all of the (BiLSTM) layer and a conditional random field (CRF) layer three splits, having on average 370 words (370.10 words for classification. The more recent approaches of this type in the train split, 366.89 words in the validation split and utilize the transformer architecture (Vaswani et al., 2017) 377.46 words in the test split) and on average around 15 in their models. An upgrade of the approach by Luan et sentences (15.419 sentences in the train split, 15.203 sen- al. (2017) was proposed by Sahrawat et al. (2020), where tences in the validation split and 15.662 sentences in the contextual embeddings generated by BERT (Devlin et al., test split). 2019), RoBERTa (Liu et al., 2019) and GPT-2 (Radford et al., 2019) were fed into the BiLSTM network. Cur- Split rently, the state-of-the-art model based on the transformer Property Train Valid Test architecture is the one proposed by Martinc et al. (2020). Document statistics # of documents 4796 1199 1519 They employ the tactic of not relying on the massive lan- avg. # of sentences 15.419 15.2026 15.6622 guage model pretraining but rather on the language model avg. # of words 370.10 366.89 377.46 pretraining on the much smaller domain specific corpora. Keywords statistics This makes the approach more easily transferable to less # of keywords 19429 4773 5903 resourced domains and languages. # of unique keywords 4414 1854 2049 # of unique keywords per document 0.9203 1.5462 1.3489 Most keyword recognition studies still focus on En- # of keywords per document 4.0052 4.1643 3.8861 glish. Nevertheless, several multilingual and cross-lingual keywords present in the document 59.91 % 60.54 % 59.95 % studies have been conducted recently, also including low- Keyword composition statistics resource languages. One of them is the study by Koloski Proportion of 1-word terms 92.77% 93.17 % 92.68 % et al. (2021), which compared the performance of two su- Proportion of 2-word terms 5.88 % 5.61 % 5.98 % Proportion of 3-word terms 0.62 % 0.57 % 0.58 % pervised transformer-based models, a multilingual BERT Proportion of more than 3-word terms 0.74 % 0.65 % 0.76 % with a BiLSTM-CRF classification head (Sahrawat et al., 2020) and TNT-KID, in a multilingual setting with Esto- Table 1: Dataset statistics. We conducted three different nian, Latvian, Croatian and Russian news corpora. The statistical analyses. The first one was on the document level authors also investigated whether combining the results of and it considered counting the word and sentence tokens. the supervised models with the results of the unsupervised The second focused on the keyword level statistics, such as models can improve the recall of the system. In Koloski total number of keywords, number of unique keywords, and et al. (2022b), an extensive study was conducted to com- the proportion of all versus unique keywords per document. pare the performance of supervised zero-shot cross-lingual Finally, we explored the composition of keywords, i.e. how approaches with unsupervised approaches. The study was many of them were composed of single words, two words, conducted for six languages - Slovenian, English, Estonian, three words or more words. Latvian, Croatian, and Russian. The authors show that models fine-tuned to extract keywords on a combination of There are in total 30,105 keywords in the dataset, with languages outperform the unsupervised models, when eval- 8,317 of them being unique. On average there are 4 key- uated on a new previously unseen language not included in words per document in the training split, 4.16 keywords the training dataset. per document in the validation split and 3.8861 keywords per document in the test split. In regards to the unique key- 3. Data words per split, there are 0.92 unique keywords per doc- We conduct our experiments on the Slovenian Sen- ument in the training split, 1.55 in the validation split and tiNews dataset (Bučar, 2017), which was originally used 1.35 keywords per document in the test split. Since the key- for news sentiment analysis, but nevertheless does contain word extractors used in this study are only able to extract manually labeled keywords and was therefore identified as keywords that are present in the data, we also calculated the suitable for keyword extraction (Koloski et al., 2022a). Be- share of keywords that are present in the document. In the fore feeding the datasets to the models, they are lowercased. training set, there were 59.91% of the keywords present, in We split the dataset into three different splits: train, valida- the validation set 60.54% and in the testing set 59.95%. tion and test. Finally, we conducted a study on the composition of keywords in which we explored how many words consti- 3.1. Exploratory data analysis tute a specific keyphrase. In all of the splits, more than Next, we preform exploratory data analysis (EDA) on 92% of the keywords contained only a single words, 2-word the given dataset. There are total of 7514 documents, terms represented about 5% of the keywords, while 3 or 4796 (64%) for training, 1199 (16%) for validation and more word terms represented around 3% of all keywords. 1519 (20%) for testing, which makes the dataset rela- The most common keyword was gospodarstvo with 2,350 tively small in comparison to some English keyword ex- occurrences (representing roughly 12% of all keyword oc- traction datasets, such as for example KPTimes (Gallina et curences), followed by ekonomija with 1315 (6.76%) oc- PRISPEVKI 80 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 currences, followed by banka with 147 (0.08%) occur- unigram model outscored the model that considered rences. n-grams of sizes 1 to 3 as keyword candidates for all These keywords suggest that most of the articles come languages, therefore in the final report we show only from the economic and financial domain. In order to ex- the results for the unigram model. plore the structure and content of the dataset in more detail, we do additional network science analysis on the graph of 4.1.3. Graph-based methods 100 most-frequent terms. We construct a graph G • MultipartiteRank (Boudin, 2018): We set the mini- 100 in the following manner: we create links among every pair of key- mum similarity threshold for clustering at 74%. words that accompany a given article in the training split. • RaKUn ( Škrlj et al., 2019): We use edit distance for We repeat the step for every article in the training split. calculating distance between nodes, and remove stop- We next focus on community detection in the con- words (using the stopwords-iso library3), a bigram- structed graph. For that purpose, we use the Louvain al- count threshold of 2 and a distance threshold of 2. An gorithm (Blondel et al., 2008). The algorithm detects four example graph of the RaKUn document representation distinct communities. The first one colored green is the and its predicted keywords are presented in Figure 2. most central community - the community with the highest amount of shared links with the three other detected com- munities. It contains general terms like family, declara- tion, NKB(a bank), sod. Next one is purple and it talks about the trend of rising taxes, new laws and the petro- chemical industry. The community colored in blue repre- sents the economic news about infrastructure and construc- tion industries. The last is the yellow community that talks about financial help from the government and the European union, accompanied by the unemployment and the slow rise of GDP. The graph and its detected communities are pre- sented in Figure 1. 4. Methods In our experiments, we follow the experimental set- ting proposed in Koloski et al. (2021) and Koloski et al. (2022b). The methods and the hyperparameters used are described below. 4.1. Unsupervised approaches Figure 2: Visualization of one training example as it was We evaluate three types of unsupervised keyword ex- seen by the RaKUn method. The visualization is generated traction methods, statistical, graph-based, and embedding- via the Py3Plex (?) library. Top three extracted tokens here based, described in Section 2. Note that these models were are Ljubljana, Prihodki and Zdravil - depicting that the ar- already evaluated on the same corpus in Koloski et al. ticle is about purchase of medicine. (2022b). 4.1.1. Statistical methods We use the PKE (Boudin, 2016) implementations of • YAKE (Campos et al., 2018): We consider n-grams YAKE, KPMiner and MultiPartiteRank. We use the offi- with n ∈ {1, 2, 3} as potential keywords. cial implementation for the RaKUn ( Škrlj et al., 2019) and for the KeyBERT model (Grootendorst, 2020). For unsu- • KPMiner (El-Beltagy and Rafea, 2009): We apply pervised models, the number of returned keywords need to least allowable seen frequency of 3, while we set the be set in advance. Since we employ F1@10 as the main cutoff to 400. evaluation measure (see Section 4.3.), we set the number of returned keywords to 10 for all models. 4.1.2. Embedding-based methods • KeyBERT (Grootendorst, 2020): For document em- 4.2. Supervised approaches bedding generation we employ sentence-transformers We test two distinct state-of-the-art transformer-base (Reimers and Gurevych, 2019), more specifically the models, BERT (Devlin et al., 2019) and TNT-KID (Mar- distiluse-base-multilingual-cased-v2 model available tinc et al., 2020). in the Huggingface library2. Initially, we tested two different KeyBERT configurations: one with n-grams 4.2.1. BERT sequence labelling of size 1 and another with n-grams ranging from 1 to As a strong baseline, we utilize the transformer- 3, with MMR=f alse and with MaxSum=f alse. The based BERT model (Devlin et al., 2019) with a token- classification head consisting of a simple linear layer for 2https:/huggingface.co/ sentence-transformers/ 3https://github.com/stopwords-iso/ distiluse-base-multilingual-cased-v2 stopwords-iso PRISPEVKI 81 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Figure 1: Visualization of the derived communities of the co-occurrence graph. all our supervised approaches. We treat the keyword ex- 4.2.2. TNT-KID sequence labelling traction task as a sequence classification task. We follow the approach proposed in Martinc et al. (2020) and predict Same as for BERT, we follow the approach proposed binary labels (1 for ‘keywords’ and 0 for ‘not keywords’) in Martinc et al. (2020) and predict binary labels (1 for for all words in the sequence. The sequence of two or more ‘keywords’ and 0 for ‘not keywords’) for all words in the sequential keyword labels predicted by the model is always sequence. Again, the sequence of two or more sequential interpreted as a multi-word keyword. More specifically, we keyword labels predicted by the model is always interpreted employ the bert-uncased-multilingual model from the Hug- as a multi-word keyword. We first pretrain TNT-KID as gingFace library (Wolf et al., 2019) and fine-tune it on the an autoregressive language model on the domain specific SentiNews train split using an adaptive learning rate (start- news corpus containing 884,407 news articles crawled from ing with the learning rate of 3 · 10−5), for up to 10 epochs websites of several Slovenian news outlets. The model was with a batch-size of 8. Note that we chose this model since trained for 10 epochs. After that, the model was fine-tuned it is the best performing model on the Slovenian SentiNews on the SentiNews train set for the keyword extraction task, dataset according to the study by Koloski et al. (2022b). again for up to 10 epochs. Sequence length was set to 256, embedding size to 512 and batch size to 8, and we employ the same preprocessing as in the original study (Martinc et PRISPEVKI 82 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 al., 2020). Model precision@10 recall@10 f1-score@10 Statistical 4.3. Evaluation setting KPMiner 12.80 7.44 9.41 YAKE 5.91 12.13 7.94 To evaluate the models, we compute F1, Recall, and Embedding-based Precision on 10 retrieved words. We next formally repre- KeyBert 12.13 12.00 11.53 sent the Recall@10 metric: Graph-based RaKUn 6.72 12.52 8.75 (# of recommended relevant items @ 10) MPRU 3.39 6.96 4.55 Recall@10 = (total # of relevant items). Sequence-labelling BERT 29.54 47.81 32.59 and Precision@10 metric: TNT-KID 38.58 42.81 40.59 Table 2: Comparison of the evaluation of the proposed ap- (# of recommended relevant items @ 10) proaches. We report on the precision@10, recall@10 and P recision@10 = f1-score@10. The scores of the best performing system (# of recommended items @10) of a specific type (i.e. statistical, embedding-based, graph- based or sequence-labelling based) are written in italic. The We omit the documents in which there are no keywords scores for the overall best-preforming model according to or which do not contain keywords. We do this because each metric are written in bold and presented in percents. we only use approaches that extract words (or multi-word expressions) from the given document and cannot process keywords that do not appear in the text. All approaches are The final comparison of both the unsupervised and su- evaluated on the same monolingual test splits, which are not pervised models is presented in Table 2. The TNT-KID used for training the supervised models. Lower case and model performed the best in terms of precision and F1- lemmatization are performed during the evaluation for both score while BERT model performed the best out of all mod- the gold standard and the extracted keywords (keyphrases). els in terms of recall. The supervised models outscored the unsupervised models by a large margin on the given task. The ranking of the models in terms of various metrics is 5. Results given in Figure 3. In this section we examine the results of the evaluation of the proposed models. We first study the results of the 6. Conclusion and further work unsupervised methods and later the results of the supervised In this study, we compared the performance of super- models. vised and unsupervised keyword extraction methods on the new public benchmark for keyword extraction, derived 5.1. Unsupervised methods from Slovenian SentiNews corpus. We have compared 8 different models, among them also TNT-KID, which has In this study we evaluate 5 different unsupervised meth- not been tested on Slovenian dataset yet. Five unsupervised ods: 2 statistical, 1 embedding-based and 2 graph-based approaches can be further divided into two graph-based, methods. Comparing the two statistical methods, KPMiner two statistical and one embedding-based approach. The outscored the YAKE method in terms of f1-score and preci- embedding-based method KeyBERT showcased superior sion. The embedding based KeyBERT method achieved the performance to the other unsupervised methods in terms of best results when compared to other unsupervised methods. F1-score at 10 retrieved keywords. From the graph-based methods, RaKUn performed the best When it comes to supervised approaches, we experi- in comparison with the MPRU method, achieving nearly mented with two transformer based models - one leverag- 100% relative improvement. Table 2 presents the results ing multilingual BERT and the other the TNT-KID method for all systems and evaluation metrics in detail. - that model keyword extraction as a sequence labelling task. The TNT-KID approach outperformed BERT-based 5.2. Supervised methods approach (and all unsupervised models) in terms of pre- We use two different supervised methods based on the cision and F1-score. These results therefore support the sequence labeling paradigm. BERT based model outper- claims of the original study by (Martinc et al., 2020) that forms TNT-KID in terms of recall by about 5 percentage TNT-KID can be easily adapted for employment on less- points, achieving the best recall out of all models. In terms resource languages, such as Slovenian, by domain specific of precision, TNT-KID outscores the BERT model by 9.04 unsupervised language model pretraining. By employing percentage points and achieves the best precision@10 score TNT-KID on the SentiNews dataset, we have advanced the - 38.58%. We believe this is due to the extensive language- state-of-the-art on the benchmark Slovenian keyword ex- model pretraining on a large domain specific Slovenian traction dataset. news corpus and the frequency of common co-occurrence For further work, we plan to explore how potentially patterns in the data, that TNT-KID has learned to exploit we can improve the results by constructing ensembles of successfully. keyword extractors. We will also propose testing several PRISPEVKI 83 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Figure 3: Comparison of the models ranking with respect to Precision@10, Recall@10 and F1-score@10. different data splitting strategies, in order to study the pos- communities in large networks. Journal of Statistical sible effect of different splitting strategies on performance Mechanics: Theory and Experiment, 2008(10):P10008, of different models and to establish the best possible split oct. strategy. We also hypothesize that a possible improvement Piotr Bojanowski, Edouard Grave, Armand Joulin, and can be introduced by taking into account the co-occurrence Tomas Mikolov. 2016. Enriching word vectors with sub- of various pairs of keywords. Finally, in the future we plan word information. arXiv preprint arXiv:1607.04606. to expand our experiments to also include the recently intro- Florian Boudin. 2016. PKE: an open source Python-based duced monolingual massively pretrained model for Slove- keyphrase extraction toolkit. In Proceedings of COLING nian, SloBERTa (Ulčar and Robnik- Šikonja, 2020). We 2016, the 26th International Conference on Computa- plan to fine-tune this model for the keyword extraction task tional Linguistics: System Demonstrations, pages 69–73, and compare it to the TNT-KID, to check whether state-of- Osaka, Japan, December. the-art can be advanced even further. Florian Boudin. 2018. Unsupervised keyphrase extraction with multipartite graphs. CoRR, abs/1803.08721. 7. Availability Jože Bučar. 2017. Manually sentiment annotated slove- The best-performing TNT-KID based model is nian news corpus SentiNews 1.0. Slovenian language re- available as a docker model on the following link source repository CLARIN.SI. https://gitlab.com/boshko.koloski/tnt_ Ricardo Campos, V´ıtor Mangaravite, Arian Pasquali, kid_app_slo. Al´ıpio Mário Jorge, Célia Nunes, and Adam Jatowt. 2018. Yake! collection-independent automatic keyword 8. Acknowledgements extractor. In European Conference on Information Re- The authors acknowledge the financial support from the trieval, pages 806–810. Springer. Slovenian Research Agency for research core funding for Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina the programme Knowledge Technologies (No. P2-0103) Toutanova. 2019. BERT: Pre-training of deep bidirec- and the project Computer-assisted multilingual news dis- tional transformers for language understanding. course analysis with contextual embeddings (CANDAS, Samhaa R El-Beltagy and Ahmed Rafea. 2009. KP-Miner: J6-2581). A keyphrase extraction system for English and Arabic documents. Information systems, 34(1):132–144. 9. References Corina Florescu and Cornelia Caragea. 2017. Position- Kamil Bennani-Smires, Claudiu Musat, Andreea Hoss- Rank: An unsupervised approach to keyphrase extraction mann, Michael Baeriswyl, and Martin Jaggi. 2018. Sim- from scholarly documents. In Proceedings of the 55th ple unsupervised keyphrase extraction using sentence Annual Meeting of the Association for Computational embeddings. In Proceedings of the 22nd Conference on Linguistics (Volume 1: Long Papers), pages 1105–1115, Computational Natural Language Learning, pages 221– Vancouver, Canada, July. Association for Computational 229, Brussels, Belgium, October. Association for Com- Linguistics. putational Linguistics. Ygor Gallina, Florian Boudin, and Béatrice Daille. 2019. Vincent D Blondel, Jean-Loup Guillaume, Renaud Lam- Kptimes: A large-scale dataset for keyphrase generation biotte, and Etienne Lefebvre. 2008. Fast unfolding of on news documents. In Proceedings of the 12th Inter- PRISPEVKI 84 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 national Conference on Natural Language Generation, Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing pages 130–135. order into text. In Proceedings of the 2004 conference on Maarten Grootendorst. 2020. KeyBERT: Minimal key- empirical methods in natural language processing, pages word extraction with bert. 404–411. Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry 2016. Incorporating copying mechanism in sequence-to- Winograd. 1999. The pagerank citation ranking: Bring- sequence learning. In Proceedings of the 54th Annual ing order to the web. Technical Report 1999-66, Stan- Meeting of the Association for Computational Linguis- ford InfoLab, November. Previous number = SIDL-WP- tics (Volume 1: Long Papers), pages 1631–1640, Berlin, 1999-0120. Germany, August. Association for Computational Lin- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, guistics. Dario Amodei, and Ilya Sutskever. 2019. Language James M. Joyce, 2011. Kullback-Leibler Divergence, pages models are unsupervised multitask learners. Technical 720–722. Springer Berlin Heidelberg, Berlin, Heidel- report, OpenAi. berg. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Boshko Koloski, Senja Pollak, Blaž Škrlj, and Matej Mar- Proceedings of the 2019 Conference on Empirical Meth- tinc. 2021. Extending neural keyword extraction with ods in Natural Language Processing. Association for TF-IDF tagset matching. In Proceedings of the EACL Computational Linguistics, 11. Hackashop on News Media Content Analysis and Auto- mated Report Generation, pages 22–29, Online, April. Dhruva Sahrawat, Debanjan Mahata, Mayank Kulkarni, Association for Computational Linguistics. Haimin Zhang, Rakesh Gosangi, Amanda Stent, Agniv Sharma, Yaman Kumar, Rajiv Ratn Shah, and Roger Boshko Koloski, Matej Martinc, Ilija Tavchioski, Blaž Žimmermann. 2020. Keyphrase extraction from schol- Skrlj, and Senja Pollak. 2022a. Slovenian keyword ex- arly articles as sequence labeling using contextualized traction dataset from SentiNews 1.0. Slovenian language embeddings. In Proceedings of European Conference on resource repository CLARIN.SI. Information Retrieval (ECIR 2020), pages 328–335, Lis- Boshko Koloski, Senja Pollak, Bla ˚ A¾ ˚ A krlj, and Matej bon, Portugal. Springer. Martinc. 2022b. Out of thin air: Is zero-shot cross- Claude Sammut and Geoffrey I. Webb, editors, 2010. TF– lingual keyword detection better than unsupervised? In IDF, pages 986–987. Springer US, Boston, MA. Proceedings of the Language Resources and Evaluation Takashi Tomokiyo and Matthew Hurst. 2003. A language Conference, pages 400–409, Marseille, France, June. Eu- model approach to keyphrase extraction. In Proceedings ropean Language Resources Association. of the ACL 2003 Workshop on Multiword Expressions: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Analysis, Acquisition and Treatment - Volume 18, page dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke 33–40, Sapporo, Japan. Association for Computational Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: Linguistics. A robustly optimized BERT pretraining approach. arXiv Matej Ulčar and Marko Robnik- Šikonja. 2020. Slovenian preprint arXiv:1907.11692. roBERTa contextual embeddings model: SloBERTa 1.0. Yi Luan, Mari Ostendorf, and Hannaneh Hajishirzi. 2017. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Scientific information extraction with semi-supervised Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, neural tagging. In Proceedings of the 2017 Conference and Illia Polosukhin. 2017. Attention is all you need. on Empirical Methods in Natural Language Processing, In Advances in neural information processing systems, pages 2641–2651, Copenhagen, Denmark, September. pages 5998–6008, Vancouver, Canada. Curran Asso- Association for Computational Linguistics. ciates, Inc. Debanjan Mahata, John Kuriakose, Rajiv Ratn Shah, Blaž Škrlj, Andraz Repar, and Senja Pollak. 2019. and Roger Zimmermann. 2018. Key2Vec: Automatic Rakun: Rank-based keyword extraction via unsuper- ranked keyphrase extraction from scientific articles using vised learning and meta vertex aggregation. CoRR, phrase embeddings. In Proceedings of the 2018 Confer- abs/1907.06458. ence of the North American Chapter of the Association Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl for Computational Linguistics: Human Language Tech- Gutwin, and Craig G. Nevill-Manning. 1999. Kea: Prac- nologies, Volume 2 (Short Papers), pages 634–639, New tical automatic keyphrase extraction. In Proceedings of Orleans, Louisiana, USA, June. Association for Compu- the Fourth ACM Conference on Digital Libraries, DL tational Linguistics. ’99, page 254–255, Berkeley, California, USA. Associa- Matej Martinc, Blaž Škrlj, and Senja Pollak. 2020. Tnt- tion for Computing Machinery. kid: Transformer-based neural tagger for keyword iden- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- tification. Natural Language Engineering, pages 1–40. mond, Clement Delangue, Anthony Moi, Pierric Cistac, Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Pe- Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie ter Brusilovsky, and Yu Chi. 2017. Deep keyphrase gen- Brew. 2019. Huggingface’s transformers: State-of-the- eration. In Proceedings of the 55th Annual Meeting of art natural language processing. CoRR, abs/1910.03771. the Association for Computational Linguistics (Volume 1: Long Papers), pages 582–592, Vancouver, Canada, July. Association for Computational Linguistics. PRISPEVKI 85 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Spremljevalni korpus Trendi: metode, vsebina in kategorizacija besedil Iztok Kosem,‡* Jaka Čibej,‡ Kaja Dobrovoljc,‡* Nikola Ljubeši㇠‡ Institut "Jožef Stefan" Jamova cesta 39, 1000 Ljubljana iztok.kosem@ijs.si, jaka.cibej@ijs.si, kaja.dobrovoljc@ijs.si, nikola.ljubesic@ijs.si * Filozofska fakulteta, Univerza v Ljubljani Aškerčeva 2, 1000 Ljubljana Povzetek V prispevku opisujemo postopek gradnje korpusa Trendi – prvega spremljevalnega korpusa za slovenščino. Prva različica korpusa, imenovana Trendi 2022-05, vsebuje več kot 565 milijonov pojavnic iz več kot 1,4 milijona besedil. Namen korpusa je, da tako strokovni kot nestrokovni javnosti ponudi podatke o aktualni jezikovni rabi in omogoči spremljanje pojavljanja novih besed ter upadanja ali naraščanja rabe že obstoječih. Predstavimo metodologijo izdelave in vsebino korpusa ter prve korake pri načrtovani strojni klasifikaciji korpusnih besedil v kategorije (npr. gospodarstvo, okolje), s katerimi bo mogoče v korpusu spremljati jezikovno rabo tudi po tematskih področjih. Predstavimo tudi rezultate ankete, s katero smo preverili uporabniška pričakovanja o jezikovnem viru za spremljanje jezikovne rabe. The Trendi Monitor Corpus of Slovene: Methods, Content, and Text Categorization In the paper, we present the compilation of the Trendi corpus – the first monitor corpus of Slovene. The first version of the corpus, named Trendi 2022-05, contains over 565 million tokens coming from more than 1.4 million different texts. The purpose of the corpus is to provide both experts and non-experts with data on contemporary language use and enable the monitoring of the appearance of new words or the increase/decrease in the use of existing words. We present the methodology of corpus compilation, its content, and the first steps for the automatic classification of corpus texts into categories (such as economics and environment), which will enable the monitoring of language use by thematic areas. We also describe the results of a survey, the goal of which was to collect feedback on user expectations from a language monitoring resource korpusa Trendi. Sledi predstavitev klasifikacije tematskih 1. Uvod kategorij, ki smo jo izdelali za pripravo modela za avtomatsko kategorizacijo besedil. V zadnjem delu Jezik se nenehno spreminja, pojavljajo se nove besede, predstavimo anketo med uporabniki o želenih statističnih obstoječe besede in besedne zveze dobivajo nove pomene, izračunih iz korpusa. V zaključku predstavimo načrte za določene besede ali njihovi pomeni se prenehajo prihodnje delo. uporabljati ipd. V zadnjem času, tudi zaradi epidemije covida-19, ki je prinesla veliko novega izrazoslovja, je še posebej veliko pozornosti deležno področje neologije, tako 2. Spremljevalni korpusi leksikalne (nove besede) kot semantične (novi pomeni). V mednarodnem prostoru so spremljevalni korpusi Za spremljanje sprememb v jeziku se tipično prisotni že od 20. stoletja. Eden prvih je bil the Bank of uporabljajo spremljevalni korpusi, ki vsebujejo najnovejša English, ki je bil prvič objavljen leta 1991. Vsebuje več kot besedila v jeziku. Spremljevalni korpusi zapolnjujejo 650 milijonov besed2 in je danes vključen v 4,5-milijardni manko referenčnih korpusov, katerih izdelava zaradi korpus COBUILD založbe Collins. Korpus ni prosto raznovrstnosti besedil in njihovih formatov ter obsega traja dostopen, poleg zaposlenih na založbi Collins ga lahko dlje časa. V času tehnološkega napredka in ob dejstvu, da uporabljajo tudi zaposleni in študentje na Univerzi v je zdaj zelo veliko besedil dostopnih na spletu, je izdelava Birminghamu. spremljevalnih korpusov postala enostavnejša; kar je Za angleščino je danes pomemben predvsem korpus objavljeno danes, je lahko že jutri vključeno v korpus. NOW (News on the Web; Davies, 2016-), ki vsebuje več Za slovenščino kljub bogati opremljenosti na področju kot 15 milijard besed iz spletnih časopisov in revij. Korpus korpusov do zdaj nismo imeli spremljevalnega korpusa, zajema besedila od 2010 naprej. Kot je omenjeno na spletni čeprav se je med različnimi deležniki kazala jasna potreba strani,3 korpus vsak mesec naraste za 180-200 milijonov po njem. Naslavljanja tega manka smo se lotili v okviru besed. projekta Spremljevalni korpus in spremljajoči podatkovni Obsežna zbirka korpusov za spremljanje sprememb v viri (SLED),1 ki poteka od oktobra 2021 do novembra 2022 jeziku, ki poleg angleščine pokriva še več kot 35 drugih in ga sofinancira Ministrstvo za kulturo Republike jezikov, so korpusi Timestamped JSI. Korpusi vsebujejo Slovenije. Cilj projekta ni samo izdelati spremljevalni novice, ki jih zbira JSI Newsfeed na Institutu "Jožef Stefan" korpus, temveč tudi pripraviti infrastrukturo za njegovo (Trampuš in Novak, 2012). Korpusi za 18 jezikov so na redno posodabljanje. voljo v orodju Sketch Engine (Kilgarriff et al., 2004),4 v V prispevku najprej ponujamo pregled nekaterih katerem imajo poleg ostalih funkcij orodja uporabniki na pomembnejših tujih spremljevalnih korpusov, nato pa voljo tudi t. i. Trende (Herman, 2013), funkcijo, ki pomaga predstavimo metodologijo in vsebino spremljevalnega prepoznavati trende v rabi besed. Korpusi v Sketch Enginu 1 https://sled.ijs.si/ 3 https://www.english-corpora.org/now/ 2 Žal nismo našli podatka, kdaj je bil korpus nazadnje 4 https://www.sketchengine.eu/ posodobljen. PRISPEVKI 86 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 vsebujejo besedila od 2014 do aprila 2021 (čas zadnje besedila avtomatsko opremilo s podatkom o tematski posodobitve) in so različnih velikosti; korpus angleščine na kategoriji. primer vsebuje približno 60 milijard besed. Obstaja še precej drugih spremljevalnih korpusov, ki pa 3.1. Metodologija in vsebina korpusa so pogosto na voljo zgolj za interno rabo. Primer takšnega Z metodološkega vidika smo pri snovanju korpusa korpusa je ONLINE, dinamični spremljevalni korpus Trendi morali sprejeti dve odločitvi: obdobje, ki ga bo češkega jezika, ki ga izdeluje Inštitut za češki nacionalni korpus pokrival, in kako pogosto bo korpus posodobljen. korpus.5 Velik je približno 6,3 milijarde besed in vsebuje Pri odločitvi o obdobju smo izhajali iz želje, da bi korpus spletne novice, komentarje (pod spletnimi novicami), Trendi vedno pokrival manko najnovejše različice besedila s forumov in družabnih omrežij (Facebook, referenčnega (pisnega) korpusa Gigafida, trenutno je to 2.0. Twitter, Instagram). Korpus ONLINE je razdeljen na dva V tem trenutku to pomeni, da bo Trendi vseboval besedila komplementarna korpusa: ONLINE_NOW in od 2019 naprej. To pomeni, da se ob objavi nove različice ONLINE_ARCHIVE. Prvi je posodobljen vsak dan in korpusa Gigafida (npr. korpus Gigafida 3.0 bo objavljen v pokriva obdobje zadnjega meseca in preteklih šestih sklopu projekta Razvoj slovenščine v digitalnem okolju - mesecev. ONLINE_ARCHIVE pokriva obdobje od RSDO),8 obdobje korpusa Trendi ustrezno prilagodi. februarja 2017 do prvega meseca, ki ga vsebuje Tesna povezanost s korpusom Gigafida tudi pomeni, da ONLINE_NOW. Tako se vsebina zadnjega meseca po bo korpus Trendi predstavljal standardno pisno starosti v korpusu ONLINE_NOW na začetku vsakega slovenščino. Odločitev se nam zdi smiselna tudi zato, ker meseca preseli v ONLINE_ARCHIVE. sta nestandardna oz. govorjena slovenščina pokrita s Obstajajo tudi manjši in bolj specializirani korpusi, kot sta JANES9 in Gos,10 in je torej njun razvoj spremljevalni korpusi, kakršen je npr. korpus Coronavirus predmet ločenih projektov. Navsezadnje pa ne gre pozabiti (Davies, 2019-), ki zajema obdobje od januarja 2020 do na nastajajoči korpus metaFida,11 ki bo združil vse danes in vsebuje več kot 1,4 milijarde besed. V njem so slovenske korpuse. spletne novice v angleščini, vsak dan pa naraste za 3 do 4 Pri pripravi seznama virov za vključitev v korpus milijone besed. Trendi smo izhajali iz seznama slovenskih spletnih virov, Do določene mere vlogo spremljevalnega korpusa ki jih najdemo v servisu JSI Newsfeed. Izdelali smo seznam opravljajo tudi diahroni korpusi, seveda pod pogojem, da vseh virov od leta 2019 do konca 2021, pridobili smo tudi vsebujejo čim novejša besedila. Kot primer lahko podatek o skupnem številu besedil na vir. Nato smo pri navedemo korpus sodobne ameriške angleščine (Corpus of pripravi seznama za korpus Trendi podrobno analizirali Contemporary American English; Davies, 2008-), ki vsakega od 243 virov. 90 virov smo izključili, ker je šlo za vsebuje besedila od leta 1990 do marca 2020 (zadnja tuje ali slovenske spletne strani z vsebino v tujem jeziku. posodobitev) in obsega več kot milijardo besed. Prednost Nato smo s seznama odstranili še 34 virov, nekatere zato, korpusa je, da je žanrsko uravnotežen, saj vsebuje besedila ker niso vsebovali medijskih novic (blogi, spletne strani iz osmih različnih žanrov (govorjeni jezik, leposlovje, vladnih uradov in podjetij), druge zato, ker je njihova revije, časopise, znanstvena besedila, televizijske in vsebina preveč specializirana (npr. repozitoriji akademskih filmske podnapise, bloge in ostale spletne strani). publikacij so primernejši za korpuse, kot je Korpus Slovenski ekvivalent bi bil korpus Gigafida 2.0 (Krek et al., akademske slovenščine). Ena od strani (preberi.si) je bila s 2019),6 ki obsega 1,13 milijarde besed, vendar pa je v seznama odstranjena zato, ker je agregator novic iz drugih primerjavi s korpusom sodobne ameriške angleščine manj virov. Končni seznam korpusa Trendi tako vsebuje 110 ažuren (vsebuje samo besedila do leta 2018). virov, med tistimi, ki so v obdobju 2019-2021 prispevali Za slovenščino do danes še ni obstajal pravi največ novic, so sta.si (260.080 besedil), rtvslo.si (97.924), spremljevalni korpus. Obstajajo sicer viri, kot je Jezikovni siol.net (69.471), delo.si (65.415), 24ur.com (61.623), sledilnik (Kosem et al., 2021),7 ki že izkorišča dnevnik.si (47.749) in vecer.com (45.548). najsodobnejše podatke o jezikovni rabi, v konkretnem Seznam virov se bo redno posodabljal, saj lahko primeru od JSI Newsfeeda, za izdelavo neke vrste začasnih pričakujemo pojav novih spletnih strani, pa tudi ukinitev korpusov, na katerih se potem izvajajo statistični izračuni. obstoječih. Kot primer lahko navedemo spletno stran Taka ciljna raba je seveda tudi potrebna, vendar pa je necenzurirano.si, ki se je pojavila šele leta 2020 in je že 28. namenjena nestrokovni javnosti; po drugi strani strokovna po številu novic (8.494). Dodajanje novih virov v korpus javnost, kot so leksikografi_ke, jezikoslovci_ke, drugi pomeni tudi večje število besed na mesečni ravni in raziskovalci_ke potrebujejo dostop do izvirnih besedil, če posledično večji korpus Trendi. Trenutni okvirni izračuni želijo opravljati še druge analize. kažejo, da se bo Trendi vsak mesec povečal za 10-15 milijonov pojavnic, pri čemer je bil povprečen mesečni 3. Korpus Trendi obseg leta 2019 12,5 milijona pojavnic, leta 2021 pa že 21 Izdelave prvega spremljevalnega korpusa za milijonov pojavnic. slovenščino, ki smo ga poimenovali Trendi, smo se lotili v Zaradi narave korpusa Trendi bodo potrebne redne okviru projekta SLED. Poleg izdelave in rednega posodobitve, ki so zaenkrat predvidene na mesečni ravni, posodabljanja korpusa Trendi ima projekt še dva cilja: kot je praksa pri podobnih tujih korpusih. To se zdi trenutno pripravo na korpusnih podatkih temelječe statistike o realno, upoštevajoč časovno zahtevnost pridobivanja in različnih vidikih rabe besed in izdelavo orodja, ki bo 5 https://korpus.cz/ 9 https://www.clarin.si/kontext/query?corpname=janes 6 https://viri.cjvt.si/gigafida/ 10 http://www.korpus-gos.net/ 7 https://viri.cjvt.si/sledilnik/slv/ 11 https://www.clarin.si/kontext/query?corpname=mfida01 8 https://slovenscina.eu/ PRISPEVKI 87 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 označevanja besedil, pretvorb v potreben format in Po končanem označevanju se v cevovodu opravi še vključevanje korpusa v konkordančnike. pretvorba besedil iz privzetega formata označevalnega orodja (CONNL-U) v TEI XML, ki ga med drugim 3.2. Priprava besedil potrebujemo za statistične izračune s programom LIST (Krsnik et al., 2019). V ta proces sta vključena še dva Za pripravo besedil smo pripravili cevovod, ki povezana postopka združevanja besedil: združevanje vključuje pridobivanje besedil, označevanje na različnih besedil po viru na dan (vsakodneven postopek) in ravneh, združevanje po virih in obdobjih ter pretvorbo v združevanje besedil istega vira za cel mesec (enkrat na različne formate. Pridobivanje besedil je zaenkrat vezano mesec, na začetku novega meseca za nazaj). V zadnjem na servis JSI Newsfeed, ki uporablja protokol RSS novic, koraku, ki ga izvajamo enkrat mesečno in ga moramo vendar pa smo sredi priprave lastnega postopka luščenja. pognati ločeno zaradi kombinacije XSLT in skripte Perl, je Za to smo se odločili predvsem zato, ker smo odkrili, da so opravljena še pretvorba mesečnih datotek (razdeljenih po pri mnogih virih potrebne izboljšave pri pridobivanju viru) v format VERT, ki ga uporabljata konkordančnika besedil, npr. poleg besedila so izluščeni še drugi deli strani, KonText (Machálek, 2020) in NoSketch Engine (Rychlý, besedilo ni pridobljeno v celoti ipd. Poleg tega strani včasih 2007). vsebujejo pomembne metapodatke o besedilu, ki trenutno niso del zajema. V novem postopku bomo ročno preverili rezultate pridobivanja besedil z vsakega vira in prilagodili 3.3. Prva različica korpusa Trendi algoritem za vsak vir, kjer se bo izkazala potreba po Prva različica korpusa Trendi, imenovana Trendi 2022- prilagoditvi. 05, je bila objavljena junija 2022 in vsebuje 565.308.991 Nekateri viri, kot so sta.si, delo.si itd. imajo vsebine pojavnic oz. malo več kot 473 milijonov besed. V korpusu zaklenjene oziroma so dostopne samo naročnikom. Pri je 1.436.548 besedil od 48 izdajateljev, pri čemer imajo pridobivanju prek protokola RSS so tako prosto dostopni največje deleže Slovenska tiskovna agencija (337.484; samo povzetki ali prvih nekaj odstavkov, včasih celo samo 23,5 %), Delo d.o.o. (128.164; 8,9 %), Radiotelevizija naslov in podnaslov. Pri reševanju problema smo združili Slovenija (124.861; 8,7 %), Media24 d.o.o. (100.587; 7 %), moči z ekipo, ki v okviru projekta RSDO oz. priprave PRO PLUS d.o.o. (86.578; 6 %) in TSMedia d.o.o. (83.342; korpusa Gigafida 3.0 sklepa pogodbe z besedilodajalci. 5,8 %). Dogovor z besedilodajalci vključuje redno dostavljanje celotnih besedil. Posledično bo končna oblika cevovoda za 3.4. Dostopnost korpusa Trendi korpus Trendi kombinacija priprave besedil, pridobljenih s Korpus Trendi je za brskanje prosto dostopen v treh spleta, in besedil, ki jih bodo v digitalni obliki poslali konkordančnikih CLARIN.SI – konkordančniku besedilodajalci. KonText13 in dveh različicah konkordančnika Del postopka pridobivanja besedil je tudi deduplikacija, NoSketchEngine,14 tako KonText kot NoSketch Engine ki je trenutno omejena zgolj na raven vira besedila; del imata več enakih funkcionalnosti (enostavno in napredno cevovoda je namreč preverjanje, da se besedilo z istim iskanje ipd.), vendar pa KonText ponuja možnost URL-jem ne ponovi. Zavedamo se, da zaradi pokrivanja registracije in shranjevanje iskanj in priljubljenih korpusov, istih dogodkov obstaja velika prekrivnost med viri. Še več, NoSketchEngine pa dodatne funkcionalnosti, kot je mnogi viri osnujejo številne novice na podlagi vsebin sta.si, luščenje ključnih besed (angl. keywords) iz korpusov, za kar pripelje do podvajanja besedila na ravni stavkov, uporabo katerih ni potrebna registracija. Konkordančnik odstavkov ali tudi celotne vsebine. Kljub temu za namene NoSketch Engine je na CLARIN.SI poleg starejše različice korpusa Trendi deduplikacija na ravni vsebine ni (Bonito) po novem na voljo tudi v novejši različici predvidena, saj želimo uporabnikom omogočiti analizo uporabniškega vmesnika (Crystal),15 ki zagotavlja vsebin posameznih virov ter primerjalne analize med viri. izboljšano uporabniško izkušnjo in dolgoročnejše Deduplikacija pa bo najbrž opravljena pri pripravi besedil vzdrževanje. za novo različico korpusa Gigafida, kot je bila praksa v Odprto dostopna različica korpusa Trendi bo zaradi preteklih različicah (Krek et al., 2019). omejitev avtorskih pravic izdelana po isti metodi kot Sledi postopek strojnega označevanja besedil, za kar ccGigafida 1.0 (Logar et al., 2013), tj. vzorčeni bodo uporabljamo označevalni cevovod CLASSLA-Stanza naključni odstavki posameznih besedil, in bo na voljo v (Ljubešić in Dobrovoljc, 2019),12 ki se kot referenčno repozitoriju CLARIN.SI. orodje za slovnično označevanje besedil v slovenščini Korpus bo v repozitoriju CLARIN.SI na voljo tako v aktivno razvija v okviru projekta RSDO. Orodje je formatu TEI kot v formatu CONNL-U, saj je slednji nadgradnja odprtokodnega orodja Stanza (Qi et al., 2020), preferenčni format pri nalogah, ki vključujejo nadaljnje ki v primerjavi z izvorno programsko opremo podrobneje procesiranje podatkov, npr. strojno učenje, luščenje naslavlja specifike slovenščine, zlasti na ravni stavčne podatkov ipd. segmentacije, tokenizacije, oblikoskladenjskega označevanja in lematizacije po sistemu JOS (Erjavec et al., 3.5. 2010). Poleg navedenih ravni orodje besedila tudi Tematska kategorizacija besedil skladenjsko razčleni po sistemu Universal Dependencies Ena od aktivnosti projekta SLED je tudi izdelava orodja (Dobrovoljc et al., 2017) in v njih označi imenske entitete za avtomatsko kategorizacijo besedil glede na tematiko. Za (Zupan et al., 2017), kot so imena oseb, krajev, organizacij izdelavo takšnega orodja oz. modela zanj potrebujemo ipd. dvoje: klasifikacijo kategorij in učno množico. 12 https://pypi.org/project/classla/ 14 https://www.clarin.si/noske/ 13 https://www.clarin.si/kontext/ 15 https://www.clarin.si/ske/ PRISPEVKI 88 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Pri izdelavi nabora kategorij smo se opirali na podatke izobraževanja, od vrtca do univerzitetnega iz treh skupin virov: izobraževanja, pa tudi vseživljenjsko učenje. ● slovenskih novičarskih portalov, izbrali smo jih šest, tj. rtvslo.si, delo.si, sta.si, dnevnik.si, Kot prikazuje primerjalna Tabela 1, obstaja precejšnja 24ur.com in vecer.com. prekrivnost tako s kategorijami novičarskih portalov kot s ● nabora tematskih kod oz. kategorij Mednarodnega kategorijami IPTC in tujih korpusov. V nekaterih primerih, tiskovnega telekomunikacijskega sveta (IPTC).16 npr. gospodarstvo, prosti čas, politika in družba, naša S tem smo tudi želeli zagotoviti čim boljšo kategorija zajema več kategorij ostalih virov. Tako ima za usklajenost naših kategorij z mednarodnim prosti čas estonski korpus kar sedem ločenih kategorij. standardom. Edini primer, ko se eno od kategorij tujih virov lahko uvrsti ● kategorij v sodobnih sinhronih in spremljevalnih v dve naši, sta umetnost in kultura ter zabava. Kategoriji korpusih, pri čemer sta bila relevantna predvsem smo namreč ločili po eni strani zato, ker ima veliko češki korpus SYN_2015 (Křen et al., 2016) in slovenskih novičarskih portalov ločene podstrani zanju, po estonski nacionalni korpus (Koppel in Kallas, v drugi strani pa zaradi samega jezika - kulturno-umetniške tisku). vsebine so za razliko od zabavnih pogosto precej bolj strokovne. Glavno vodilo pri pripravi klasifikacije je bilo pripraviti Medtem ko v naše kategorije lahko umestimo vseh 17 relativno majhen nabor kategorij, v katere lahko uvrstimo kategorij IPTC, pa češki oz. estonski korpus določenih vse novice na različnih portalih. S tem bi zagotovili tudi kategorij nimata, npr. estonski nima črne kronike, češki pa boljše delovanje modela. Posledično smo pri analizi ne okolja, zdravja, znanosti in tehnologije ter zabave. Oba uporabljenih virov več pozornosti posvečali krovnim tudi nimata ločene kategorije za vreme, ki pa jo ima IPTC kategorijam, kar je bilo sploh potrebno pri naboru IPTC, ki in smo jo dodali zato, ker jo ima večina slovenskih ima približno 1.400 kategorij, razdeljenih v tri nivoje (s tem novičarskih portalov. da krovni nivo sestavlja le 17 kategorij). Za ponazoritev Če pogledamo še prekrivnost kategorij s stranmi oz. smiselnosti uporabe zgolj krovnih kategorij lahko podstranmi šestih slovenskih novičarskih portalov, vidimo, vzamemo kategorijo šport, ki ima na večini novičarskih da so problematične kategorije predvsem politika, družba portalov nadaljnje kategorije, od katerih se vedno pojavita in izobraževanje. Gre za sicer legitmne kategorije, ki pa na samo nogomet in košarka, ostale pa le na nekaterih portalih, novičarskih portalih nimajo svojih podstrani, temveč so npr. dnevnik.si nima zimskih športov, ima pa ločeno novice razpršene po drugih podstraneh, ki so večinoma podstran za novice o Luki Dončiću; rtvslo.si je edini, ki ima opredeljene glede na geografski izvor novice, npr. podstran za novice o Formuli 1, 24ur.si ima ločene Slovenija, Svet, Lokalno. Medtem ko so se avtorji češkega podstrani za Ligo prvakov in Ligo Evropa (nogomet) ter korpusa odločili slediti takšni delitvi tudi pri kategorijah borilne športe. ( current events, foreign news, domestic news, regional Naša končna klasifikacija vsebuje 12 kategorij: news), smo se mi raje držali tematike. To za izdelavo učnih ● umetnost in kultura. Vključuje besedila o množic pomeni nekoliko več ročnega dela oz. iskanje kulturi, umetnosti, filmih, knjigah, gledališču, pa drugih kazalcev, s katerimi lahko odkrijemo tematiko tudi recenzije ipd. prispevka na posameznem portalu. Izjema je portal sta.si, ● črna kronika. Naravne in ostale nesreče, človeški ki že ima ustrezne kategorije, in sicer Šolstvo in Družba, za delikti, kriminal. politiko pa Državni zbor, Evropska unija, Mednarodna ● gospodarstvo. Vključuje besedila s področja politika, Slovenska notranja politika in Slovenska zunanja ekonomije, trgov, financ, zaposlitev ipd. politika. ● okolje. Zajema okoljevarstvo, planet, energente, Učne množice smo izdelali z mapiranjem kategorij tudi kmetijske teme. različnih virov novic na našo interno kategorizacijo. Tako ● zdravje. Fizično in mentalno zdravje ljudi, lahko besedila iz določenih kategorij konkretnih virov medicina, farmacija, zdravstvena infrastruktura. uporabimo za učenje modela. Pri pripravi učnih množic ● prosti čas. Hobiji, rekreacija, potovanja, turizem, bomo vzorčili tako količino podatkov iz posameznega vira ljubljenčki, dom in družina, bivanje. kot količino podatkov v kategoriji in s tem zagotovili ● politika in pravo. Mednarodne in nacionalne raznolikosti učnih množic, pa tudi robustnost končnega novice s področna državne uprave, pravnih modela. postopkov in družbenih razmerij, konfliktov, vojn. Za modeliranje bomo uporabili orodje fasttext (Joulin ● znanost in tehnologija. Znanstvena odkritja, et al, 2016) z vložitvami CLARIN.SI (Ljubešić in Erjavec, zanimivosti, tehnološke inovacije, informacijska 2018) in model SloBERTa (Ulčar in Robnik-Šikonja, tehnologija, računalništvo. 2021). Glede na razliko v rezultatih (pričakujemo, da se bo ● družba. Družbena vprašanja in razmerja, enakost, model SloBERTa odrezal boljše, a morda razlika v diskriminacija, religija, etika ipd. rezultatih ne bo tako opazna) in kompleksnosti ● šport. Športni rezultati in zanimivosti z različnih klasifikatorja (fasttext je precej hitrejši in zahteva bistveno športnih področij. manj spominskih kapacitet), bomo izbrali klasifikator, ki ga ● vreme. Meteorološke napovedi, opisi vremenskih bomo uporabili na novih besedilih. posebnosti, stanj, procesov. ● zabava. Estrada, moda, slog. ● izobraževanje. Procesi posredovanja in pridobivanja znanja ter veščin. Vse stopnje 16 https://cv.iptc.org/newscodes/subjectcode PRISPEVKI 89 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 slovenski kategorija češki korpus estonski korpus IPTC portali (6) umetnost in culture & entertainment arts, culture and 5 culture kultura entertainment črna kronika 6 crime / disaster and accident gospodarstvo economy, finance & business; economy, business and 6 economy agriculture; construction & real finance; labour estate okolje 2 / nature & environment environmental issue zdravje 3 / health health prosti čas leisure beauty; cars; food & drinks; lifestyle and leisure gambling & casinos; home, family 4 & children; pets and animals; travel & tourism; video games politika in pravo politics politics & government politics; crime, law and 1 justice; unrest, conflicts and war znanost in / science, technology & IT science and technology 5 tehnologija družba social life society; religion; sex; women social issue; 1 religion and belief; human interest šport 6 sports sports sport vreme 4 / / weather zabava / culture & entertainment* arts, culture and 4 entertainment* izobraževanje 1 / education education Tabela 1: Primerjava tematskih kategorij projekta SLED z domačimi novičarskimi portali in tujimi viri. so zbirala demografske podatke (spol, starost, področje 3.6. Rezultati uporabniške ankete delovanja). Diseminirana je bila po e-poštnih seznamih slovenskih jezikoslovnih raziskovalnih skupnosti (npr. Ker je Trendi prvi korpus svoje vrste v slovenskem SlovLit ter e-poštni seznam Slovenskega društva za okolju, smo ga želeli zasnovati karseda skladno z jezikovne tehnologije) ter po družbenem omrežju Facebook uporabniškimi pričakovanji. Ta smo v decembru 2021 (na uradni strani Centra za jezikovne vire in tehnologije preverili s pomočjo uporabniške ankete, s katero smo Univerze v Ljubljani ter v neformalnih jezikoslovnih ugotovili, katerih podatkov o aktualni rabi jezika si uporabniških skupinah, kot je Prevajalci, na pomoč! ). raziskovalna skupnost želi in v kakšni obliki (npr. različni V celoti izpolnjenih vprašalnikov je bilo 100. Vzorec, seznami, kot so kandidati za neologizme, besede in besedne ki ga je zajela anketa, zajema predvsem osebe ženskega zveze z najbolj izstopajočo rabo v določenem obdobju spola (82 %), manjši delež pa je moških (18 %). Po starosti (dnevu, tednu, mesecu), izstopajoče besede in besedne vzorec zajema predvsem generacije med 26. in 55. letom zveze glede na vir ipd.). starosti (80 % vseh udeleženk_cev), največ med 26. in 35. Anketa17 je bila izdelana na platformi 1KA, sestavljena letom (33 %) in med 46. in 55. letom (32 %). Večina pa je bila iz 9 vprašanj: med temi je bilo 5 vsebinskih, 4 pa 17 Podrobnejše poročilo o izvedeni anketi je na voljo na spletni content/uploads/2022/02/SLED_anketa_porocilo_2022-2- strani projekta: https://sled.ijs.si/wp- 03_final.pdf PRISPEVKI 90 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 udeleženk_cev je zaposlenih bodisi v javnem sektorju V odprtem vprašanju so imeli anketiranci_ke možnost (61 %) bodisi je samozaposlena (20 %), le manjši delež ima izraziti predloge oz. dodatne scenarije, ki bi jih zanimali o še študentski status (3 %) ali pa so zaposleni v podjetjih aktualni jezikovni rabi. Dodatnih predlogov je bilo 15. (6 %), upokojeni (4 %) ali v iskanju zaposlitve (5 %). Po Nanašajo se npr. na povezljivost orodja z drugimi področju delovanja, pri katerem so udeleženci_ke lahko jezikovnimi viri (npr. integracija v Slovenski oblikoslovni izbrali_e več možnosti, prednjačita lektoriranje (60 %) in leksikon Sloleks in v korpus pisne standardne slovenščine prevajanje (46 %), visok delež pa imajo tudi ljubiteljsko Gigafida) in dostop do podatkov (npr. možnost dostopa do raziskovanje jezika (38 %), strokovno in znanstveno podatkov preko javnega API-ja), primerjavo sopomenskih pisanje (34 %), jezikoslovne raziskave (32 %) ter kreativno različic besed oz. besednih zvez (npr. oče vs. ata), pisanje in blogerstvo (22 %). Skupno 40 % zajemajo tudi vključitev zgledov rabe in spremljanje jezikovne rabe različne kategorije poučevanja jezika (slovenščina kot 1. daljših enot (npr. frazemov). Večina dodatnih predlogov jezik na osnovni ali srednji šoli, slovenščina kot 2. ali tuji sicer presega obseg projekta SLED, a predstavljajo jezik, jezikoslovni predmeti na višji/univerzitetni ravni). pomembno povratno informacijo za razmislek o Vzorec nakazuje, da je anketa zajela različna področja prihodnjem razvoju in integraciji spremljevalnega korpusa jezikoslovno-raziskovalnega udejstvovanja. in iz njega izluščenih podatkov v ostale jezikovne vire. V nadaljevanju predstavljamo podrobnejšo analizo odgovorov na vsebinska vprašanja. 4. Sklep in nadaljnje delo V prispevku smo predstavili različne aktivnosti projekta 3.6.1. Scenariji uporabe in uporabniško zanimanje SLED, s poudarkom na korpusu Trendi, nastajajočem Anketiranci_ke so navedli_e, kateri podatki v orodju, ki spremljevalnem korpusu slovenskega jezika. Opisali smo bi spremljalo aktualno jezikovno rabo, bi jih najbolj metodologijo njegove izdelave, vsebino in oblike, v katerih zanimali, in pri vsakem od 6 predlaganih scenarijev je na voljo uporabnikom_cam. Predstavili smo tudi uporabe (s konkretnimi primeri za lažjo predstavo) klasifikacijo tematskih kategorij, ki je bila oblikovana za ocenili_e svojo stopnjo zanimanja (1 - sploh me ne zanima, namene izdelave modela za avtomatsko tematsko 5 - zelo me zanima). Med scenariji so npr. katere kategorizacijo besedil. Zadnji del je bil namenjen besede/besedne zveze so najznačilnejše za določeno predstavitvi rezultatov ankete o uporabniških pričakovanjih obdobje v primerjavi z drugim obdobjem? (npr. katere o podatkih o aktualni rabi jezika, ki jih želi imeti besede so se mnogo pogosteje uporabljale v februarju 2020 zainteresirana skupnost. kot pa v februarju 2021); v katerem obdobju je določena V prihajajočih mesecih bomo nadaljevali z objavami beseda/besedna zveza najpogostejša? (npr. ali je bila mesečnih različic korpusa, pripravili prve statistične beseda "tajkun" res najpogostejša v obdobju 2008-2009?); izračune in dokončali ter evalvirali algoritem za ali raba besede/besedne zveze v zadnjem obdobju glede na avtomatsko kategorizacijo besedil. Pomembno je, da smo trende narašča ali pada? (npr. ali se "epidemija" uporablja veliko časa posvetili vzpostavitvi avtomatskih postopkov vse pogosteje ali vse redkeje?). priprave besedil in izračunov, saj bo to pospešilo Rezultati kažejo, da se anketirancem_kam vsi posodabljanje podatkov v konkordančnikih in na predlagani scenariji zdijo zanimivi: kategoriji "Zanima me" repozitorju CLARIN.SI. (4) in "Zelo me zanima." (5) namreč pri vsakem scenariju Prav tako je ključna aktivnost izboljšava postopka skupaj zajemata med 74 in 88 %. Po stopnji zanimanja pridobivanja besedil, ki bo poskrbela, da bodo odpravljene najbolj izstopa scenarij, v katerem je mogoče primerjati določene pomanjkljivosti trenutne metode. Ker bo trend rabe dveh ali več besed/besednih zvez (npr. vzpostavljena tesna povezanost med korpusom Trendi in anticepilec vs. proticepilec), enako pa anketiranke_ce referenčnim korpusom Gigafida, bo vsaka izboljšava zanima tudi, ali raba določene besede/besedne zveze v postopkov koristila obema korpusoma. zadnjem obdobju glede na trende narašča ali pada. S korpusom Trendi je slovenska jezikovna Dobre tri četrtine vprašanih (76 %) je odgovorilo, da bi infrastruktura bogatejša za pomemben vir, ki bo relevanten jim podatki o aktualni jezikovni rabi koristili pri delu, le tako za raziskovalno skupnost kot širšo javnost. 9 % tovrstni podatki ne bi koristili (15 % je neodločenih). Rezultati ankete torej potrjujejo, da jezikoslovno skupnost 5. podatki o trendih jezikovne rabe zanimajo in da obstaja Zahvala realna potreba po jezikovnem viru, ki tovrstne podatke Projekt SLED ( Spremljevalni korpus in spremljajoči prinaša sprotno in ažurno. podatkovni viri) financira Ministrstvo za kulturo Republike Slovenije kot del Javnega razpisa za (so)financiranje 3.6.2. Načini prikaza podatkov projektov, namenjenih gradnji in posodabljanju Na lestvici od 1 (sploh ni pomembno) do 5 (zelo infrastrukture za slovenski jezik v digitalnem okolju 2021– pomembno) so anketiranci_ke ocenili_e tudi, kateri od 2022. Raziskovalna programa št. P6-0411 ( Jezikovni viri in predlaganih načinov prikaza podatkov (grafi s trendi tehnologije za slovenski jezik) in št. P6-0215 ( Slovenski jezikovne rabe. tabele s številskimi podatki, seznami besed jezik - bazične, kontrastivne in aplikativne raziskave) je oz. besednih zvez z naraščajočo/padajočo rabo, drugo) se sofinancirala Javna agencija za raziskovalno dejavnost jim zdijo pomembni. Če združimo deleže kategorij Republike Slovenije iz državnega proračuna. "pomembno" (4) in "zelo pomembno" (5), dobimo deleže 79 % za grafe, 64 % za tabele s številskimi podatki in 87 % za sezname besed s padajočo/naraščajočo rabo. Anketiranke_ce torej najbolj zanimajo preprosti seznami, najmanj pa napredne tabele s številskimi podatki. 3.6.3. Uporabniški predlogi PRISPEVKI 91 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 6. Literatura Slovenian, Croatian and Serbian. V: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Mark Davies. 2008–. The Corpus of Contemporary Processing, str. 29–34. American English (COCA). https://www.english- Nikola Ljubešić in Tomaž Erjavec. 2018. Word corpora.org/coca/. embeddings CLARIN.SI-embed.sl 1.0, Slovenian Mark Davies. 2016–. Corpus of News on the Web (NOW). language resource repository CLARIN.SI, ISSN 2820- Available online at https://www.english- 4042, http://hdl.handle.net/11356/1204. corpora.org/now/. Nataša Logar, Tomaž Erjavec, Simon Krek, Miha Grčar, in Mark Davies. 2019–. The Coronavirus Corpus. Available Peter Holozan. 2013. Written corpus ccGigafida 1.0, online at https://www.english-corpora.org/corona/. Slovenian language resource repository CLARIN.SI, Kaja Dobrovoljc, Tomaž Erjavec in Simon Krek. 2017. The ISSN 2820-4042, http://hdl.handle.net/11356/1035. Universal Dependencies Treebank for Slovenian. V: Tomáš Machálek. 2020. KonText: Advanced and Flexible Proceedings of the 6thWorkshop on Balto-Slavic Natural Corpus Query Interface. V: Proceedings of the 12th Language Processing, BSNLP@EACL 2017, str. 33–38. Language Resources and Evaluation Conference, Tomaž Erjavec, Darja Fišer, Simon Krek in Nina Ledinek. Marseille, France, str. 7003–7008. 2010. The JOS Linguistically Tagged Corpus of Slovene. https://www.aclweb.org/anthology/2020.lrec-1.865 V: P roceedings of the Seventh conference on Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton in International Language Resources and Evaluation Christopher D. Manning. 2020. Stanza: A python natural (LREC’10). language processing toolkit for many human languages. Ondrej Herman. 2013. Automatic methods for detection of arXiv preprint arXiv:2003.07082. word usage in time. Diplomska naloga. Masaryk Pavel Rychlý. 2007. Manatee/Bonito-A Modular Corpus University, Faculty of Informatics. Manager. V: RASLAN, str. 65–70. Armand Joulin, Edouard Grave, Piotr Bojanowski in Mitja Trampuš in Blaž Novak. 2012. The Internals Of An Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Aggregated Web News Feed. V: Proceedings of 15th Classification. arXiv. https://arxiv.org/abs/1607.01759. Multiconference on Information Society 2012 (IS-2012). Adam Kilgarriff, Pavel Rychlý, Pavel Smrz in David http://ailab.ijs.si/dunja/SiKDD2012/Papers/Trampus_N Tugwell. 2004. The Sketch Engine. V: G. Williams in S. ewsfeed.pdf. Vessier, ur., Proceedings of the Eleventh EURALEX Matej Ulčar in Marko Robnik-Šikonja. 2021. SloBERTa: International Congress, Lorient, France, str. 105-116. Slovene monolingual large pretrained masked language Lorient: Université de Bretagne Sud. model. V: Proceedings of the 24th International Kristina Koppel in Jelena Kallas. (v tisku). Eesti keele Multiconference – IS2021 (SiKDD). ühendkorpuste sari 2013– 2021: mahukaim eestikeelsete https://ailab.ijs.si/dunja/SiKDD2021/Papers/Ulcar+Rob digitekstide kogu. Eesti Rakenduslingvistika Ühingu nik.pdf. aastaraamat. Katja Zupan, Nikola Ljubešić in Tomaž Erjavec. Smernice Iztok Kosem, Simon Krek, Polona Gantar, Špela Arhar za označevanje imenskih entitet v slovenskem jeziku. Holdt in Jaka Čibej. 2021. Language monitor: tracking https://www.clarin.si/repository/xmlui/bitstream/handle the use of words in contemporary Slovene. V: I. Kosem, /11356/1238/SlovenianNER-slv- M. Cukr, M. Jakubíček, J. Kallas, S. Krek in C. Tiberius , v1.0.pdf?sequence=7&isAllowed=y. ur., Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference. 5–7 July 2021, virtual, str. 514–527. Brno: Lexical Computing CZ, s.r.o., https://elex.link/elex2021/wp- content/uploads/2021/08/eLex_2021_33_pp514- 528.pdf. Simon Krek et al. 2019. Corpus of Written Standard Slovene Gigafida 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1320. Luka Krsnik et al. 2019. Corpus extraction tool LIST 1.2, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1276. Michal Křen, Václav Cvrček, Tomáš Čapka, Anna Čermáková, Milena Hnátková, Lucie Chlumská, Tomáš Jelínek, Dominika Kováříková, Vladimír Petkevič, Pavel Procházka, Hana Skoumalová, Michal Škrabal, Petr Truneček, Pavel Vondřička in Adrian Jan Zasina. 2016. SYN2015: Representative Corpus of Contemporary Written Czech. V: P roceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), str. 2522–2528, Portorož, Slovenia. European Language Resources Association (ELRA). Nikola Ljubešić in Kaja Dobrovoljc. 2019. What does Neural bring? Analysing improvements in morphosyntactic annotation and lemmatisation of PRISPEVKI 92 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Automatic Text Analysis in Language Assessment: Developing a MultiDis Web Application Sara Košutar*, Dario Karl‡, Matea Kramarić*, Gordana Hržica* * Faculty of Education and Rehabilitation Sciences, University of Zagreb University Campus Borongaj, Borongajska cesta 83 f, 10 000 Zagreb sara.kosutar@erf.unizg.hr matea.kramaric@erf.unizg.hr gordana.hrzica@erf.unizg.hr ‡Department of Data Science, InSky Solutions Medačka ulica 18, 10 000 Zagreb dario.karl.sl@gmail.com Abstract Language sample analysis provides rich information about the language abilities in the written or spoken text produced by a speaker in response to a language task. Language sample analysis is generally used to assess the abilities of children during language acquisition, but also the abilities of adult speakers across the lifespan. Its wide range of uses also allows for the assessment of language abilities in educational contexts such as second language acquisition or fluency, the abilities of bilingual speakers in general, and it is also used for diagnosis in speech and language pathology. Various computer programs have been developed to assist in the language sample analysis. However, these programs have been developed mainly for English and are often not fully open-access or do not provide data on population metrics, history of data uploaded by a user, and/or improvements in basic language measures. The time needed for transcription and the linguistic knowledge required for manual analysis are considered to be the main obstacles to its implementation The goal of this paper is to present a web-based application MultiDis intended for the analysis of language samples at the microstructural text level in Croatian. The application is still under development, but the current version fulfils its main purpose – it enables the (semi-) automatic calculation of measures reflecting language productivity, lexical diversity, syntactic complexity, and discourse cohesion in spoken language, and provides users with socio-demographic and linguistic metadata as well as the history of uploaded transcripts. We will present the challenges we have faced in developing the application (e.g., annotation system, text standardisation), future improvements we plan to make to the application (e.g., syntactic parsing, speech-to-text, multilingual analysis), and the possibilities of its use in the wider scientific and professional community. 1. Language sample analysis Language sample analysis provides rich information example telling a story based on a picture, and is recorded about the language abilities in the written or spoken text while performing this task. The recordings are then produced by a speaker in response to a language task, e.g. transcribed using special codes and are divided into smaller storytelling, written essay, description of a picture, units of analysis, e.g., communication units (C-units; see answering questions, etc. It is an ecologically valid means Labov and Waletzky, 1967). Special codes mark different of language assessment that can be used along with features of the spoken language or deviations (e.g., standardised language tests because it provides data that repetitions, omissions of vowels, use of regionally marked tests cannot. Compared to standardised tests, language words, morphosyntactic errors, etc.). When written sample analysis has greater ecological validity because it language samples are collected, the speaker responds to the reflects the natural everyday situation of language task in writing, but all further steps are the same. Once the production. Consequently, it allows for a more in-depth transcripts are produced, they can be analysed in various analysis of specific morphosyntactic, semantic, and computer programs that enable (semi-)automatic pragmatic features. Due to its lower bias, it proved to be calculation of different language measures. more suitable for studying regional variations and dialects Language sample analysis provides information about compared to standard questionnaires (e.g., Samardžić and language abilities at two levels of text structure (Gagarina Ljubešić, 2021). Language sample analysis is generally et al., 2012). First is the microstructural level, which refers used to assess children’s abilities during language to the internal linguistic organisation and includes text acquisition, but also adult speakers’ abilities across the length, vocabulary use, morphosyntax, cohesive devices, lifespan (e.g., Westerveld et al., 2004). Its wide range of etc. At the microstructural level, one can observe, for uses allows for the assessment of language abilities in example, which language structures have emerged during educational contexts such as second language acquisition language acquisition or how complex they are in terms of or fluency (e.g., Clercq and Housen, 2017), the abilities of their internal features. The macrostructural analysis allows bilingual speakers in general (e.g., Gagarina et al., 2016), for assessing the ability of the hierarchical organisation of and it is also used for diagnosis in speech and language the text (e.g., in storytelling, whether the speaker has pathology (e.g., Justice et al., 2006). This type of analysis expressed a goal, an attempt, an outcome, etc.). At the is widely used in some countries, but in many countries, macrostructural level, one can examine how successfully a scientists and professionals are unaware of its benefits or speaker connects sentences according to a language task. find it too complex and time-consuming (see Heilmann, By examining these elements, one gains insight into the quality of an individual’s language when performing a 2010; Klatte et al., 2022). The process of collecting language samples involves particular language task, but also indirectly information on several steps. First, a speaker is given a language task, for her or his language skills in general. PRISPEVKI 93 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 1.1. Language measures Measures of lexical diversity have also been found to Different aspects of microstructure correspond to correlate with standardised vocabulary tests in bilingual several dimensions, such as productivity, lexical diversity, children (e.g., Hržica and Roch, 2021). and syntactic complexity. A set of (semi-)automatic Syntactic complexity refers to the range of syntactic measures has been proposed to assess language abilities at structures and the degree of sophistication of these the microstructural level. Productivity refers to the amount structures in language production (Ortega, 2003). It is of language (words or utterances) produced (Leadholm and usually measured by calculating the average length of the Miller, 1992). Measures of productivity include the total C-unit. The length of the C-unit increases when there is a number of C-units or the total number of words (TNW). C- dependent clause or when the syntax within the clause is units are often used instead of utterances in spoken more complex, for example when the clause is extended by language analysis (see MacWhinney, 2000). The basic adding attributes, appositions, or adjectives. Measures of criteria for dividing a sequence of spoken words into syntactic complexity have been shown to distinguish utterances are intonation and pauses. However, transcribers between different groups of speakers, including children may rate the utterances differently against these criteria, with DLD and adults of different ages (e.g., Rice et al., which results in lower inter-rater reliability (Stockman, 2010; Nippold et al., 2017). In addition to the average 2010). C-units consist of one or more clauses. A clause is length of syntactic units, other measures of syntactic any syntactic unit consisting of at least one predicate. A complexity include clausal density (i.e., the total number of complex sentence with one or more dependent clauses main and subordinate clauses divided by the total number constitutes one C-unit, while a compound sentence is of C-units) and mean length of clause (main or divided into two or more C-units, depending on the number subordinate), and they are also commonly used (e.g., Scott of independent clauses. Studies have shown that measures and Stokes, 1995; Norris and Ortega, 2009). Because of the of productivity can distinguish children with typical variety of measures and the different methods of language status from children with developmental language calculation, little is known about which measures are disorders (DLD; Wetherell et al., 2007), bilingual from appropriate concerning typological differences between monolingual children (Hržica and Roch, 2021), and adult languages, and some of these measures are not always speakers according to their language skills (Nippold et al., automatic. 2017). In the last decades of the 20th century, various computer Measures of lexical diversity are used to assess programs have been developed to support language sample vocabulary abilities. The more diverse the vocabulary analysis (overview: Pezold et al., 2020), but they are often produced, the greater the lexical diversity. Measuring not user-friendly. More recently, web-based programs have lexical diversity is more complex and therefore been introduced that allow for the analysis of language use methodologically challenging. Traditional measures at different linguistic levels (e.g., Coh-Metrix; McNamara include the number of different words (NDW; Miller, 1981) et al., 2014). The measures are based on basic calculations and the type-token ratio (TTR; Templin, 1957). Types and (e.g., TTR, MLU), but there are also advanced measures tokens can be easily calculated automatically, whereas based on language technologies such as the annotation of lemmas are more difficult to calculate automatically, and morphological, syntactic, and semantic features. Such require specialized natural language processing tasks. In applications are mainly developed for English or other particular, this requires morphological analyses such as widely spoken languages and are often not fully open- lemmatisation, part-of-speech (POS) tagging, or access. There is an increasing awareness of the importance morphological segmentation. In languages with rich of language sample analysis as a complementary method in morphology, the lemma-token ratio would be more language assessment. The time needed for transcription and appropriate, but due to the time-consuming nature of the the linguistic knowledge required for manual analysis are task, this has rarely been done (see Balčiūnienė and considered to be the main obstacles to its implementation Kornev, 2019). Another problem with measures of lexical (Pezold et al., 2020). Therefore, the development of a tool diversity and measures of productivity is that they are for the automatic calculation of language measures could affected by the length of a language sample (Malvern et al., make naturalistic language assessment more feasible. 2004; McCarthy, 2005). To overcome these limitations, alternative measures 2. Goal of the paper have been developed, such as D (Malvern and Richards, The goal of this paper is to present a web-based 1997) and moving average type-token ratio (MATTR; application MultiDis, intended for the analysis of language Covington and McFall, 2010). The measure D is based on samples at the microstructural level in Croatian, which modelling the decrease in TTR with the increasing size of enables the (semi-)automatic calculation of measures the language sample using mathematical algorithms. reflecting language productivity, lexical diversity, syntactic MATTR calculates TTR for text windows of a fixed size, complexity, and discourse cohesion in spoken and written e.g., 500 words. The window moves through the text and language. We will present the challenges we have faced in calculates TTR for words 1-501, 2-502, etc. At the end of developing the application, future improvements we plan to the text, all TTRs are averaged to determine the final score. make to the application, and the possibilities of its use in However, it is not yet clear which of these measures the wider scientific and professional community. provides more reliable results, as the results of validation studies vary (see deBoer, 2014; Fergadiotis et al., 2015). 3. Development of the MultiDis web Regardless of methodological limitations, these measures application can distinguish the abilities of children and adults with Existing computer-based resources used to analyse typical language status from children or adults with DLD children’s or adults’ language abilities are either developed (e.g., Hržica et al., 2019; Kapantzoglou et al., 2019). for English only or do not provide data on population PRISPEVKI 94 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 metrics, history of data uploaded by a user, and/or (1) Dečko i ćuko su ulovili žabicu. ‘The boy and the improvements in basic language measures such as NDW or dog caught the frog’ TTR. The Computerized Language Analysis (CLAN; (2) Dečko i <ćuko> (@d pas) su ulovili žabicu. ‘The MacWhinney, 2000), for example, is a freely available boy and the dog caught the frog’ desktop application whose users are expected to have a high (3) Dečko i pas su ulovili žabicu. ‘The boy and the level of language and transcription expertise. Text dog caught the frog’ Inspector (2018), on the other hand, is a web-based The annotation system and parsing rules for the application, but it is only designed for the text analysis of transcripts were implemented using common Regular the English language and the target users are mainly first or expressions (regex) in Python (Van Rossum, 2020). second language acquisition teachers. We aim to develop a Regular expressions allow the system to recognise specific web-based application that fosters the analysis of language codes, save the data and convert the language into a samples in Croatian. Our target users work at least partly standard form, so that existing language resources, such as with spoken language (e.g., language diagnostics tokenizers and lemmatizers, achieve a higher hit rate and performed by speech and language pathologists), so the precision. After annotation and parsing, the application will application should support both written and spoken provide a standardised language text on which further language analysis. The application is currently being language sample analysis is performed. developed, and we will present the coding system, language resources, data collection and language measures that have 3.2. Language resources been implemented so far. The next step in the development of the application was the integration of an open-source Python library. We started 3.1. Annotation codes with Stanza (Qi et al., 2020) to solve the following tasks Considering that our target users mostly work with common in natural language processing: spoken language, there are several codes which can be used lemmatisation to annotate the data. Computer programs for language POS tagging analysis such as CLAN (MacWhinney, 2000) have an syntactic parsing (sentence and clause entire system of very specific annotation codes. In the segmentation). MultiDis web application, a new and simpler system of In the early stages of developing the MultiDis web annotation codes was developed to provide a faster and application, one of the main linguistic resources used was more organised annotation process. The system of the Stanza, a Python natural language processing toolkit for codes was designed to include several categories with human language developed at Stanford University (Qi et individual codes and subsets of codes. The main idea is to al., 2020). Stanza enables quick out-of-the-box processing have a system of annotation codes that can be changed over of multilingual texts. Since we plan to test our use case – time according to the following criteria: based on the analysis of children’s spoken language – on hierarchical (with categories and subcategories of multiple languages, Stanza has an advantage over several codes) other natural language processing models, frameworks and extensible (adding new categories and codes) neural pipelines, such as Podium (Tutek et al., 2021), easily customizable system (each category has a CLASSLA (Ljubešić and Dobrovoljc, 2019) or BERTić recognizable first character). (Ljubešić and Lauc, 2021). Lemmatisation and POS To date, the following categories have been established: tagging are fairly accurate (> 85 % of the cases), as they do phonotactic codes include conversation markers and not interfere with the computation of currently elements of communication; citation codes indicate implemented language measures, though the process of references to another utterance within the language sample; delimiting the boundaries of C-units has been an obstacle phonetic codes indicate pronunciation and other elements that is currently being resolved. We are also exploring other specific to spoken language; sociolinguistic codes indicate options and planning further analysis and accuracy testing dialectisms, neologisms, foreign words, etc.; correction for this task. Since the language samples that the codes indicate errors made at a particular level of linguistic application will analyse are non-literary texts, we also plan structure – phonological, morphosyntactic and/or lexical. to explicitly compare the aforementioned tools in the tasks There is also an additional code for corrections – a marker of lemmatisation, POS tagging and morphosyntactic that can be used to exclude a particular segment from the description (MSD) using our datasets to improve the transcript and provide a correct or standardised form that application’s baseline accuracy in these tasks. The standard the application will use to standardise any text before for POS tagging is MulTextEast language resources moving on to a later stage of language analysis. A full (Erjavec, 2010), version 4 for the Croatian language. In this description of codes is available on the web page of the way, a token ćuko ‘dog’ is annotated as a dialectism using application: http://www.multidis.com.hr/statistics/. the annotation codes for the transcript parsing, and the An example of multiple annotation codes would be a standardised form pas ‘dog’ receives a morphosyntactic tag sentence in (1), that would look like (2) in the following Ncmsn (nominative case, common noun, masculine, uploaded transcript. Angle brackets point to a segment that singular). needs to be excluded and round brackets point to a ‘standardised’ form of that segment. In addition, the @d 3.3. Data collection – manual annotation of code preceding the token ćuko ‘dog’ refers to a dialectism. transcripts with the new coding system The application will convert the sentence in (2) into the standardised form or the sentence as in (3), mapping the In the next step of developing the MultiDis web dialectism and providing this information in the final application, it was important to test the annotation system analysis report. and the parsing of the language samples, as the aim was to obtain a standardised text with the data on the participants’ PRISPEVKI 95 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 socio-demographic and language characteristics, parsed 3.4. Automation of language measures with the appropriate annotation codes and available to the Using the standardised text and the provided language user along with the morphosyntactic data. Before running data from the previous step in the analysis, the next task of the analysis, the texts were manually transcribed by the MultiDis web application is to provide users with a students and volunteers within the courses Computer detailed analysis of language measures. It is important to Analysis of Child Language and Volunteering at the note that the measures are currently calculated Department of Speech and Language Pathology at the intertextually, but we plan to compare the individual results Faculty of Education and Rehabilitation Sciences, with the population results, as well as with the baseline University of Zagreb. The test transcripts are the result of a data. The application incorporates diverse measures that storytelling task, mostly Frog where are you? (Mayer, can be used in the language assessment such as 1969) and Multilingual Assessment Instrument for productivity, lexical diversity, syntactic complexity and Narratives (MAIN; Gagarina et al., 2012; Gagarina et al., discourse cohesion. The list of language measures included 2019; Hržica and Kuvač Kraljević, 2012, 2020). After the in the MultiDis web application is available in Table 1. implementation of annotation codes, these transcripts have been successfully standardised and prepared for the final analysis. Any other transcript can be uploaded to the application and the user can only receive data about their uploaded transcripts and not about the transcripts of other users. Category Measure Description Language Number of communication The total number of productivity units (NCU) communication units Total number of words (TNW) The total number of tokens (repeated tokens are excluded) Number of different words The total number of word forms (NDW) – types Type-token ratio (TTR) The total number of tokens Lexical diversity divided by the total number of types Index of lexical diversity D* Based on the VOCD algorithm calculates the probability of the next token in a sequence based on an arbitrarily chosen n-token sample from the text Moving average type-token Based on a window length pre- ratio (MATTR) defined by a user, the text is divided into segments and for each window length, the TTR is calculated – the average TTR ratio of each segment is the measure of MATTR Mean length of the The total number of words is Syntactic communication unit divided by the total number of complexity communication units Clausal density The total number of main and subordinate clauses is divided by the total number of communication units Mean length of clause The total number of tokens is divided by the total number of clauses Ratio of connectives The total number of Discourse cohesion connectives is divided by the total number of C-units. PRISPEVKI 96 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Ratio of different connectives** The total number of one type of connective is divided by the total number of all other types of connectives in the text Table 1: List of language measures implemented in the MultiDis web application (*being tested; **in the process of implementation). The process of automatic analysis of language measures The application is designed so that each segment can be is based on precise segmentation of C-units and clauses, as improved, without compromising our main goals or the well as on the results of tokenisation and lemmatisation. user’s experience. In this sense, we can also include written Each simple sentence (e.g., The dog is playing with the language samples and provide new annotation codes and frogs), each complex sentence containing a subordinate categories for written language or implement measures that clause or a parenthetical phrase (e.g., When the dog chased are only used in the analysis of adult language. the cat away, the birds were happy), and each clause of a Lemmatisation and POS tagging can be improved by replacing the existing model with a new, customized and compound sentence was considered as one C-unit (e.g., open-source model that can be extended to languages other One goat is in the water and the other is grazing grass). than Croatian. Given the fact that we need 100% accuracy on this task, at this stage, we are still in the process of developing an 5. Future extensions automatic way of detecting connectives in the text as well as clause delimiters. Thus, a user still has to manually The MultiDis web application is still under divide the text into C-units following the above-mentioned development, but the current version fulfils its main criteria before uploading a language sample to the purpose – it allows for (semi-)automatic analysis of spoken language, and provides users with socio-demographic and application. This also means that the user can change any linguistic metadata as well as the history of uploaded automatically parsed C-unit. Collecting a larger amount of transcripts. In addition to the implementation of a service data will make it possible to train and apply an appropriate for the automatic determination of C-units and clause machine learning model to enable automatic segmentation boundaries, additional data will be made available to users, of C-units and clauses. such as the analysis of Croatian dialects and reference data At the current stage of developing the application, a user for language measures, at least for some populations and can obtain the results of all available language measures some text types. Several other options are also being based on C-unit segmentation, as well as the considered, such as fully automatic parsing of the original morphosyntactic data and the data provided by the language sample without the manual annotation codes and annotation codes. It is important to note that the MATTR an experimental speech-to-text service. As the tools and measure does not have a fixed window length; instead, resources to develop this application are also available for there is a default window size that contains 10% of the total other languages, the application could be scaled for number of tokens, and the user can manually adjust the multilingual analysis, preferably in collaboration with other window size. In this way, we have avoided the possibility researchers. for the results on MATTR to be the same as the results on TTR for language samples with less than 500 tokens, and 6. Conclusion we have allowed the user to define the best window size for this measure. Measure D and the number of different The MultiDis web application is freely available at connectives are currently being implemented and tested http://www.multidis.com.hr/ and can be used by linguists, before these results are made available to users. The speech and language pathologists, teachers etc., to assess remaining measures listed in Table 1 have been the language abilities of both children and adult speakers successfully implemented. of Croatian. It can help clinicians and educators in language sample analysis by resolving some of the main obstacles to 4. Technical specifications of the MultiDis web its use. A simpler coding system fosters transcription and application future development of speech-to-text could ease this process even further. Automatic lemmatisation and The MultiDis web application is deployed on the morphological tagging save time and enable more precise Croatian Academic and Research Network (CARNET) calculation of language measures. The language measures server as a monolithic Docker service. All requests are first included in the application were selected based on previous forwarded to a Nginx service for the static files and only research and adequately reflect the different aspects of the then to the application itself via a Gunicorn service (Python participants’ language abilities. Therefore, the MultiDis Web Server Interface Gateway HTTP Server). The web application supports its users by reducing both the application and the entire backend logic are written in the transcription time and the linguistic knowledge required to Python programming language (Van Rossum, 2020) within technically perform the analysis. the Django web framework. All data is stored in a MySQL database instance on the server. As mentioned earlier, a 7. Acknowledgements Stanza PyTorch model (Qi et al., 2020) is run with the application to infer the language data and provide This work was supported by the Croatian Science morphosyntactic information. Other open-source libraries Foundation under the project entitled Multilevel and packages used are python-docx, NumPy and Pandas. approach to spoken discourse in language development PRISPEVKI 97 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 (UIP-2017-05-6603), by the Arts and Humanities disorder]. Hrvatska Revija za Rehabilitacijska Research Council under the project entitled Feast and Istraživanja, 55(2):14–30. Famine Project: Confronting Overabundance and Gordana Hržica and Jelena Kuvač Kraljević. 2012. MAIN Defectivity in Language (AH/T002859/1) and by the – hrvatska inačica: Višejezični instrument za ispitivanje COST Action under the project NexusLinguarum – pripovijedanja [MAIN – Croatian version: Multilingual European network for Web-centred linguistic data Assessment Instrument for Narratives]. ZAS papers in science (CA18209). Sara Košutar was supported by the linguistics, 56:201–218. project Young Researchers’ Career Development project Gordana Hržica and Jelena Kuvač Kraljević. 2020. The – Training of New Doctoral Students. Any opinions, Croatian adaptation of the Multilingual Assessment findings, conclusions, or recommendations presented in Instrument for Narratives. ZAS Papers in Linguistics, this manuscript are those of the author(s) and do not 64:37–44. necessarily reflect the views of the Croatian Science Gordana Hržica and Maja Roch. 2021. Lexical diversity in Foundation. bilingual speakers of Croatian and Italian. In: S. Armon- Lotem and K. K. Grohmann, eds., LITMUS in Action: 8. References Cross comparison studies across Europe, pages 100– Ingrida Balčiūnienė and Aleksandr N. Kornev. 2019. 129. John Benjamins Publishing Company Trends in Evaluation of narrative skills in language-impaired Language Acquisition Research (TILAR), Amsterdam. children. Advantages of a dynamic approach. In: E. Laura M. Justice, Ryan P. Bowles, Joan N. Kaderavek, Aguilar-Mediavilla, L. Buil-Legaz, R. López-Penadés, Teresa A. Ukrainetz, Sarita L. Eisenberg, and Ronald B. V. A. Sanchez-Azanza and D. Adrover-Roig, eds., Gillam. 2006. The Index of Narrative Microstructure: A Atypical Language Development in Romance Clinical Tool for Analyzing School-Age Children’s Languages, pages 127–414. John Benjamins Publishing Narrative Performances. American Journal of Speech- Company, Amsterdam and Philadelphia. Language Pathology, 15(2):177–191. Michael A. Covington and Joe D. McFall. 2010. Cutting Maria Kapantzoglou, Gerasimos Fergadiotis, and Alejandra the Gordian Knot: The Moving-Average Type–Token Auza Buenavides. 2019. Psychometric evaluation of Ratio (MATTR). Journal of Quantitative Linguistics, lexical diversity indices in Spanish narrative samples 17(2):94–100. from children with and without developmental language Bastien de Clercq and Alex Housen. 2017. A Cross- disorder. Journal of Speech, Language, and Hearing Linguistic Perspective on Syntactic Complexity in L2 Research, 62(1):70–83. Development: Syntactic Elaboration and Diversity. The Inge S. Klatte, Vera van Heugten, Rob Zwitserlood, and Modern Language Journal, 101(2):315–334. Ellen Gerrits. 2022. Language Sample Analysis in Fredrik deBoer. 2014. Evaluating the comparability of two Clinical Practice: Speech-Language Pathologists' measures of lexical diversity. System, 47:139–145. Barriers, Facilitators, and Needs. Language, speech, and Tomaž Erjavec. 2010. MULTEXT-East Version 4: hearing services in schools, 53(1):1–16. Multilingual Morphosyntactic Specifications, Lexicons William Labov and Joshua Waletzky. 1967. Narrative and Corpora. In: Proceedings of the Seventh analysis: Oral versions of personal experience. In: J. International Conference on Language Resources and Helm, ed., Essays on the verbal and visual arts, pages 3– Evaluation (LREC'10), pages 2544–2547, Valletta, 38. University of Washington Press, Seattle and London. Malta. Barbara J. Leadholm and Jon F. Miller. 1992. Language Gerasimos Fergadiotis, Heather Harris Wright and Samuel sample analysis: The Wisconsin guide. Wisconsin State B. Greenc. 2015. Psychometric Evaluation of Lexical Department of Public Instruction, Madison. Diversity Indices: Assessing Length Effects. Journal of Nikola Ljubešić and Kaja Dobrovoljc. 2019. What does Speech, Language, and Hearing Research, 58(3):840– Neural Bring? Analysing Improvements in 852. Morphosyntactic Annotation and Lemmatisation of Natalia Gagarina, Daleen Klop, Sari Kunnari, Koula Slovenian, Croatian and Serbian. In: Proceedings of the Tantele, Taina Välimaa, Ingrida Balčiūnienė, Ute 7th Workshop on Balto-Slavic Natural Language Bohnacker, and Joe Walters. 2012. MAIN: Multilingual Processing, pages 29–34, Florence, Italy. Association assessment instrument for narratives. ZAS Papers in for Computational Linguistics. Linguistics, 56:1–155. Nikola Ljubešić and Davor Lauc. 2021. BERTić - The Natalia Gagarina, Daleen Klop, Sari Kunnari, Koula Transformer Language Model for Bosnian, Croatian, Tantele, TainaVälimaa, Ute Bohnacker, and Joel Walters. Montenegrin and Serbian. In: Proceedings of the 8th 2019. MAIN: Multilingual Assessment Instrument for Workshop on Balto-Slavic Natural Language Narratives – Revised. ZAS Papers in Linguistics, 63:1– Processing, pages 37–42, Kiyv, Ukraine. Association for 21. Computational Linguistics. Natalia Gagarina, Daleen Klop, Ianthi M. Tsimpli, and Joel Brian MacWhinney. 2000. The CHILDES project: Tools Walters. 2016. Narrative abilities in bilingual children. for analyzing talk: Transcription format and programs Applied Psycholinguistics, 37(1):11–17. (3rd ed.). Lawrence Erlbaum Associates Publishers, John J. Heilmann. 2010. Myths and Realities of Language Mahwah, NJ. Sample Analysis. Perspectives on Language Learning David Malvern and Brian Richards. 1997. A new measure and Education, 17(1): 4–8. of lexical diversity. In: A. Ryan and A. Wray, eds., Gordana Hržica, Sara Košutar, and Matea Kramarić. 2019. Evolving models of language, pages 58–71. Multilingual Rječnička raznolikost pisanih tekstova osoba s Matters, Clevedon. razvojnim jezičnim poremećajem [Lexical diversity in David Malvern, Brian Richards, Ngoni Chipere, and Pilar Durán. 2004. written texts of persons with developmental language Lexical Diversity and Language PRISPEVKI 98 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Development. Quantification and Assessment. Palgrave children 3 to 9 years with and without language Macmillan, London. impairments. J o urnal of Speech, Language, and Hearing Mercer Mayer (1969). Frog, where are you? Dial Press, Research, 53(2):333–349. New York. Tanja Samardžić and Nikola Ljubešić. 2021. Data Phillip M. McCarthy. 2005. An assessment of the range and Collection and Representation for Similar Languages, usefulness of lexical diversity measures and the potential Varieties and Dialects. In: M. Zampieri and P. Nakov, of the measure of textual, lexical diversity (MTLD). PhD eds., Similar Languages, Varieties, and Dialects: A thesis, University of Memphis. Computational Perspective, Studies in Natural Danielle S. McNamara, Arthur C. Graesser, Phillip M. Language Processing, pages 121–137, Cambridge McCarthy, and Zhiqiang Cai. 2014. Automated University Press, Cambridge. Evaluation of Text and Discourse with Coh-Metrix. Cheryl M. Scott and Sharon L. Stokes. 1995. Measures of Cambridge University Press, New York. syntax in school-age children and adolescents. Language, Jon M. Miller. 1981. Assessing language production in Speech, and Hearing Services in Schools, 26(4):309–319. children: experimental procedures. University Park Ida J. Stockman. 2010. Listener reliability in assigning Press, Baltimore. utterance boundaries in children's spontaneous speech. Marilyn A. Nippold, Laura M. Vigeland, Megan W. Frantz- Applied Psycholinguistics, 31(3):363–395. Kaspar, and Jeannene M. Ward-Lonergan. 2017. Mildred C. Templin. 1957. Certain language skills in Language Sampling With Adolescents: Building a children; their development and interrelationships. Normative Database With Fables. American Journal of University of Minnesota Press, Minneapolis. Speech-Language Pathology, 26(3):908–920. Text Inspector. 2018. Online lexis analysis tool at John M. Norris and Lourdes Ortega. 2009. Towards an textinspector.com organic approach to investigating CAF in instructed Martin Tutek, Filip Boltužić, Ivan Smoković, Mario Šaško, SLA: The case of complexity. Applied Linguistics, Silvije Škudar, Domagoj Pluščec, Marin Kačan, Dunja 30(4):555–578. Vesinger, Mate Mijolović, and Jan Šnajder. 2021. Lourdes Ortega. 2003. Syntactic complexity measures and Podium: a framework-agnostic NLP preprocessing their relationship to l2 proficiency: A research synthesis toolkit. GitHub repository. of college-level l2 writing. Applied Linguistics, 24(4): https://github.com/TakeLab/podium 492–518. Guido Van Rossum. 2020. The Python Library Reference, Mollee J. Pezold, Caitlin M. Imgrund, and Holly L. Storkel. release 3.8.2. Python Software Foundation. 2020. Using Computer Programs for Language Sample https://py.mit.edu/_static/spring21/library.pdf Analysis. Language, Speech, and Hearing Services in Marleen F. Westerveld, Gail Gillon, and Jon F. Miller. Schools, 51(1):103–114. 2004. Spoken language samples of New Zealand children Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and in conversation and narration. Advances in Speech Christopher D. Manning. 2020. Stanza: A Python Language Pathology, 6(4):195–208. Natural Language Processing Toolkit for Many Human Danielle Wetherell, Nicola Botting, and Gina Conti- Languages. In: Proceedings of the 58th Annual Meeting Ramsden. 2007. Narrative in adolescent specific of the Association for Computational Linguistics: System language impairment (SLI): a comparison with peers Demonstrations, pages 101–108, Stroudsburg, PA. across two different narrative genres. International Association for Computational Linguistics. journal of language & communication disorders, Mabel L. Rice, Filip Smolik, Denise Perpich, Travis 42(5):583–605. Thompson, Nathan Rytting, and Megan Blossom. 2010. Mean length of utterance levels in 6-month intervals for PRISPEVKI 99 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Assessing Comparability of Genre Datasets via Cross-Lingual and Cross-Dataset Experiments Taja Kuzman† ∗, Nikola Ljubešić†, Senja Pollak† †Department of Knowledge Technologies, Jožef Stefan Institute taja.kuzman@ijs.si, nikola.ljubesic@ijs.si, senja.pollak@ijs.si ∗Jožef Stefan International Postgraduate School Abstract This article explores comparability of an English and a Slovene genre-annotated dataset via monolingual and cross-lingual experiments, performed with two Transformer models. In addition, we analyze whether translating the Slovene dataset into English with a machine translation system improves monolingual and cross-lingual performance. Results show that cross-lingual transfer is possible despite the differences between the datasets in terms of genre schemata and corpora construction methods. Furthermore, the XLM-RoBERTa model was shown to provide good results in both settings already when learning on less than 1,000 instances. In contrast, the trilingual CroSloEngual BERT model was revealed to be less suitable for this text classification task. Moreover, the results reveal that although the English dataset is 40 times larger than the Slovene dataset, it provides similar or worse classification results. 1. Introduction smaller Finnish, Swedish, and French datasets. Rönnqvist Texts in datasets can be grouped by genres based on et al. (2021) extended this research, training the models on their common function, form and the author’s purpose (Or- a multilingual dataset, created from the four corpora, which likowski and Yates, 1994). Labeling texts with genres al- further improved the results. lows for a deeper insight into the composition and qual- These promising results stimulated creation of genre- ity of a web corpus that was collected with automatic annotated datasets for other languages, and for Slovene, means, more efficient queries in information retrieval tools a web genre identification corpus GINCO 1.0 (Kuzman et (Vidulin et al., 2007), as well as improvements of various al., 2021) was created. Its genre schema was based on the language technologies tasks, such as part-of-speech tag- CORE schema with the possibility of cross-lingual experi- ging (Giesbrecht and Evert, 2009) and machine translation ments in mind (see Kuzman et al. (2022)). However, a lin- (Van der Wees et al., 2018). That is why automatic genre guistic analysis of the categories (Biber and Egbert, 2018) identification (AGI) has been a subject of numerous studies and a low inter-annotator agreement, reported by Egbert et in the computational linguistics and information retrieval al. (2015) and Sharoff (2018), revealed some shortcom-fields (e.g., see Egbert et al. (2015), Sharoff (2018)). ings of the CORE schema that could impact the reliability As in other text classification tasks, a large manually an- of the dataset. Thus, Kuzman et al. (2022) diverged from notated dataset is required in AGI in order to train and test the original schema when annotating GINCO, striving to- a classifier. While there exist some large English genre- wards a more reliably annotated dataset. In addition to annotated datasets, such as the Corpus of Online Regis- this, the CORE and GINCO datasets were created follow- ters of English (CORE) (Egbert et al., 2015) with 53,000 ing different corpora collection and annotation approaches texts and the Leeds Web Genre Corpus (Asheghi et al., (see Section 3.1.). Due to these differences, it remained 2016) with 5,000 texts, for other languages there is either no unclear whether the datasets are comparable enough to al-dataset or mostly a small one, consisting of 1,000 to 2,000 low cross-lingual transfer which would eliminate the need texts, such as genre-annotated corpora for Russian (Sharoff, for extensive annotation campaigns of Slovene and other 2018), Finnish (Laippala et al., 2019), Swedish and French under-resourced languages of interest. This article provides (Repo et al., 2021). This means that for obtaining a large first insight into this, exploring the comparability of the dataset needed for genre identification of other languages, two datasets through cross-dataset and cross-lingual exper- costly and time-consuming annotation campaigns are still iments. needed, leaving most languages under-resourced in regard to the technologies based on the AGI. 2. Goal of the Paper However, it might be possible to overcome this obsta- This paper analyzes comparability of two genre- cle by leveraging the cross-lingual transfer, applying mod- annotated datasets, the Corpus of Online Registers of En- els trained on high-resource languages to the low-resource glish (CORE) (Egbert et al., 2015) and the Slovene Web languages. Recently, Repo et al. (2021) showed that it is genre identification corpus GINCO 1.0 (Kuzman et al., possible to achieve good levels of cross-lingual transfer in 2021). We perform cross-dataset and cross-lingual auto- AGI experiments. They performed experiments in zero- matic genre identification experiments to address the main shot cross-lingual automatic genre identification by train- research question (Q1): Is the CORE dataset comparable to ing multilingual Transformer-based models on the English the GINCO dataset enough to provide good cross-lingual CORE corpus (Egbert et al., 2015) and testing them on transfer, as it was achieved by Repo et al. (2021) who PRISPEVKI 100 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 used comparably encoded Finnish, Swedish and French that it can have up to four labels. The corpus that we ob- datasets? tained from the authors and used in this research consists To compare the corpora and to analyze their useful- of 48,415 texts, labeled with 8 main categories and 47 sub- ness for monolingual as well as for cross-lingual auto- categories. The corpus was further cleaned by removing matic genre identification, first, labels from both corpora duplicated texts and texts with more than one assigned la- were mapped to a joint schema, the GINCORE schema. bel, resulting in 41,502 texts. Then, multilingual pre-trained Transformer-based models The GINCO corpus (Kuzman et al., 2022) consists of a were trained on the English CORE dataset with GINCORE random sample of web texts from two Slovene web corpora, labels (EN-GINCORE), the Slovene GINCO dataset with slWaC 2.0 corpus (Erjavec and Ljubešić, 2014) from 2014 GINCORE labels (SL-GINCORE) and the SL-GINCORE and MaCoCu-sl 1.0 corpus (Ba˜nón et al., 2022) from 2021. dataset that was machine translated into English (MT- Both web corpora were created by crawling the Slovene GINCORE). We conduct 1) monolingual in-dataset AGI top-level domain and some generic domains that are inter- experiments, training and testing on the same dataset, 2) linked with the national domain. As in GloWbE, the boiler- cross-lingual and cross-dataset AGI experiments, training plate was removed with the Justext tool (Pomikálek, 2011). on one dataset and testing on the other. The machine- The GINCO corpus consists of two parts, the “suitable” translated dataset is added to the comparison to explore part, annotated with genres, and “not suitable” part, consist- two additional research questions: Q2) In monolingual in- ing of texts not suitable for genre annotation, such as texts dataset experiments, do multilingual models, which were in other languages, machine-translated texts etc. In this re- pre-trained on more English than Slovene data, perform search, only the suitable part, consisting of 1,002 texts, was differently on Slovene dataset (SL-GINCORE) than on used. a Slovene dataset, machine-translated to English (MT- For the annotation, a GINCO schema was used, GINCORE)? and Q3) In cross-lingual cross-dataset exper- consisting of 24 labels, e.g., News/Reporting, Opin- iments, does translating the training data (MT-GINCORE) ion/Argumentation, Promotion of a Product. The schema into the language of test data (EN-GINCORE) provide bet- is based on the subcategory level of the CORE schema and ter results than using training and testing data in different on other schemata from previous genre studies. The texts languages (SL-GINCORE and EN-GINCORE)? were annotated by two annotators with the background in The experiments were performed with two multilingual linguistics. In case of disagreement, final labels were de- Transformer-based pre-trained language models, massively termined at frequent meetings. Multi-label annotation was multilingual XLM-RoBERTa model (Conneau et al., 2020), allowed, i.e., each text could be annotated with up to three and the trilingual Croatian-Slovene-English CroSloEngual classes which were ordered according to their prevalence BERT model (Ulčar and Robnik- Šikonja, 2020). This in the text as a primary, secondary and tertiary label. How- provides an answer to the fourth research question (Q4): ever, in these experiments, only the primary labels are used. Does CroSloEngual BERT, pre-trained on a smaller num- Each paragraph in the texts is accompanied with metadata ber of languages, perform better in the cross-lingual AGI (attribute keep) with information on whether it was man- experiments than a massively multilingual XLM-RoBERTa ually identified to be a part of the main text and thus useful model? for the annotation. In this research, paragraphs not deemed to be useful were discarded. 3. Data Preparation The machine-translated GINCO corpus (MT-GINCO) 3.1. Original Datasets was created by translating the Slovene GINCO 1.0 to En- glish with the DeepL1 machine translation system. The sys- In this research, three datasets were used: the Corpus of tem is stated by its developers to be “3x more accurate” Online Registers of English (CORE) (Egbert et al., 2015), than its closest competitors, i.e., Google Translate, Ama- the Slovene Web genre identification corpus GINCO 1.0 zon Translate and Microsoft Translator, based on internal (Kuzman et al., 2021) and the GINCO 1.0 corpus, machine blind tests (DeepL, nd). DeepL was confirmed to outper- translated to English. form Google Translate also in an independent study of Yu- The CORE corpus consists of web texts that were ex- lianto and Supriatnaningsih (2021). The GINCO corpus tracted from the “General” part of the Corpus of Global was translated into British English, as this variety seems Web-based English (GloWbE) (Davies and Fuchs, 2015). to be more frequent than American English in the general The GloWbE corpus was collected via Google searches part of the GloWbE corpus on which the CORE corpus is with high frequency English 3-grams as the queries (Davies based (GloWbe, nd). The prevalence of the British variety and Fuchs, 2015). After obtaining the texts, further clean- in the CORE corpus was also confirmed with a lexicon- ing was performed, more specifically, the boilerplate was based American-British-variety Classifier (Rupnik et al., removed with the Justext tool (Pomikálek, 2011). 2022) which identified 40% of texts to be British, 25% to be The CORE corpus was annotated based on a hierarchi-American, while the rest contain a mixture of both varieties cal schema which consists of 8 main genre categories, such or no signal for any of them. as Narrative, Opinion, Spoken, and 54 subcategories, e.g., News Report/Blog, Instruction, Travel Blog, Magazine Ar- 3.2. GINCORE Schema ticle. The annotation was single-label, i.e., each annotator, To be able to perform cross-dataset experiments, the recruited through a crowd-sourcing platform, could assign CORE and GINCO schemata were mapped to a joint one main category and one subcategory to a text. However, as each text was annotated by four annotators, that means 1https://www.deepl.com/translator PRISPEVKI 101 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 (a) SL-GINCORE and MT-GINCORE. (b) EN-GINCORE. Figure 1: The differences between the distributions of GINCORE labels in the GINCO corpora MT-GINCORE and SL-GINCORE, and in the EN-GINCORE (CORE corpus). schema – the GINCORE schema. The schemata were bels: 15 labels that are present in both corpora, and 5 labels, mapped based on descriptions of categories in previous re- newly introduced by the GINCO schema and thus present search, in the annotation guidelines for GINCO2 and the only in the GINCO dataset. guidelines for CORE, created for the needs of annotation of Finnish, French and Swedish corpora using the CORE 3.3. GINCORE Datasets schema3 in further research (Laippala et al., 2019; Laippala For the purpose of performing cross-dataset experi- et al., 2020). Furthermore, manual inspection of instances ments, only the GINCORE classes that have more than from the GINCO and CORE corpora was performed to an- 5 instances in each of the datasets were used, resulting alyze to which extent the annotations in the corpora match in a smaller set of 12 GINCORE labels: News, Forum, the guidelines. The basis of the GINCORE schema was the Opinion/Argumentation, Review, Research Article, Infor- GINCO schema as it was shown to provide a more reliable mation/Explanation, Promotion, Instruction, Prose, Inter- annotation than CORE (see Kuzman et al. (2022)). More- view, Legal/Regulation, and Recipe. The texts annotated over, it is easier to map 54 CORE subcategories with a very with other GINCORE labels were not included in the ex- high granularity to 24 broader GINCO categories than vice periments. Thus, the final datasets are slightly smaller: versa. The CORE schema consists of broad main categories and more specific subcategories. As the GINCO schema • the English CORE dataset with 12 GINCORE la- was based on the subcategories of the CORE schema, the bels, henceforth referred to as the English GINCORE subcategories level was used for the mapping from CORE dataset (EN-GINCORE), consists of 33,918 texts; to GINCORE. • the Slovene GINCO dataset with 12 GINCORE la- Some of genre categories in both schemata are identical bels, henceforth referred to as the Slovene GINCORE and can be directly mapped, namely Recipe, Review, In- dataset (SL-GINCORE), consists of 810 texts; terview and Legal/Regulation. As the GINCO and CORE schemata differ in granularity, broader GINCORE labels • the machine-translated English GINCO dataset with were created which efficiently cover categories from both 12 GINCORE labels, henceforth referred to as schemata. Some CORE categories were not included in the the Machine-Translated GINCORE dataset (MT- mapping, because a) these labels revealed to be very in- GINCORE), consists of 810 texts. frequent and there is no sufficient information about them available, b) the labels were too broad or problematic for The text instances were not pre-processed, i.e. each in- annotators and as a result include instances that are too het- stance is a running text as it was extracted from the origi- erogeneous and cannot be mapped to just one GINCORE nal web page from which the boilerplate and HTML tags label. The resulting GINCORE schema4 covers 43 CORE were removed. In GINCO datasets (SL-GINCORE and subcategories and all 24 GINCO categories by using 20 la- MT-GINCORE), the texts consist of paragraphs, which is indicated by the

tag, while in the CORE dataset (EN- 2The guidelines for GINCO are available here: GINCORE), the partitioning into paragraphs is not pre- https://tajakuzman.github.io/GINCO-Genre- served. In addition to this, the datasets differ significantly Annotation-Guidelines/. in terms of length of the texts. In the CORE dataset, the 3The guidelines for the annotation campaigns using the median length is 649 words, while the minimum and max- CORE schema are available here: https://turkunlp.org/ imum text length is 52 words and 118,278 words respec- register-annotation-docs/. tively. In the GINCO datasets, most texts are significantly 4The final table with all the GINCORE mappings is avail- shorter, with the median length of 198 words, minimum able here: https://tajakuzman.github.io/GINCO- length of 12 words and maximum length of 4,134 words. Genre-Annotation-Guidelines/genre_pages/ GINCORE_mapping.html. As the Transformer models, used in the experiments, can PRISPEVKI 102 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 process maximum instance length of 512 tokens, this means Optimum learning rate was revealed to be 10−5, while that while the models will in most cases be trained on com- the optimum number of epochs varies based on the train- plete texts from the GINCO datasets, more than half of the ing dataset and the model, i.e., the optimum number of texts from the CORE dataset will not be used in their en- epochs when training on the EN-GINCORE with a) XLM- tirety and the models will be trained only on the first part of RoBERTa is 9, and b) CroSloEngual BERT is 6; while these instances. the optimum number of epochs when training on the SL- Here, it should be also noted that the CORE dataset and GINCORE and MT-GINCORE with a) XLM-RoBERTa is the GINCO datasets are characterized by a different distri- 60, and b) CroSloEngual BERT is 90. bution of GINCORE classes. Frequency of some classes, We performed monolingual in-dataset experiments and such as Promotion, is significantly different, as can be seen cross-lingual cross-dataset experiments5. The monolingual in Figure 1. experiments, described in Section 4.3.1., are in-dataset experiments, which means that the models were trained and 4. Machine Learning Experiments tested on splits from the same dataset. In contrast to this, in 4.1. Models cross-dataset experiments, presented in Section 4.3.2., the models are trained on one dataset and tested on the other. Experiments were performed with the Transformer- At the same time, these experiments are cross-lingual, as based pre-trained language models which were shown to the original datasets are in different languages. perform well in the automatic genre identification task in Three runs of each experiment were performed and av- a monolingual as well as a cross-lingual setting (Repo et erage results are reported. The models used in monolin- al., 2021). More specifically, two models were used, the gual and cross-lingual setups were evaluated via micro F1 base-sized massively multilingual XLM-RoBERTa model and macro F1 scores to measure the instance-level and the (Conneau et al., 2020), and the trilingual Croatian-Slovene-label-level performance. English CroSloEngual BERT model (Ulčar and Robnik- Šikonja, 2020). The XLM-RoBERTa model was chosen 4.3. Results because it was revealed to be the best performing model 4.3.1. Monolingual In-dataset Experiments in cross-lingual automatic genre identification based on the First, the datasets are compared via monolingual in- CORE dataset (Repo et al., 2021), and to be compara- dataset experiments where the models were trained and ble to the Slovene monolingual model SloBERTa (Ulčar tested on the splits of the same dataset. In addition to this, and Robnik- Šikonja, 2021) in experiments, performed on a dummy classifier which predicts the majority class was GINCO (Kuzman et al., 2022). The CroSloEngual BERT implemented as an illustration of the lower bound. The model was revealed to achieve results comparable to the results, presented in Table 1, show that the mapping of XLM-RoBERTa model or to even outperform the latter the original labels into a joint schema was successful and model in common monolingual and cross-lingual NLP that it is possible to achieve good results when learning tasks (Ulčar et al., 2021). Thus, it was included in these Transformer models on GINCORE datasets. Transformer experiments to explore whether it achieves similar results models are shown to be very effective at this task, achiev- on the AGI task as well. ing micro and macro F1 scores that are higher than the 4.2. Experimental Setup scores of the dummy model for at least 30 points. XLM- RoBERTa, which was revealed to be the best performing The datasets were split into 60:20:20 train, dev and model, achieved relatively high results, with micro and test splits, stratified according to the label distribution. macro F1 scores ranging between 0.72 and 0.84, even when The models were trained on the train split, consisting of trained on the two smaller datasets, which consist of less 20,350 texts in the case of EN-GINCORE, and of 486 than 1,000 instances. texts in the case of SL-GINCORE and MT-GINCORE, and The results show that in a monolingual setting, the mas- tested on the test split, i.e., 6,784 texts in the case of EN- sively multilingual XLM-RoBERTa model outperforms the GINCORE and 162 texts in the case of SL-GINCORE and trilingual CroSloEngual BERT model. While Ulčar et al. MT-GINCORE. The dev split, which is of the same size as (2021) showed that the trilingual model is comparable to the test split, was used for testing the hyperparameter op- the XLM-RoBERTa model at NLP tasks which are fo- timization. When splitting the datasets, it was assured that cused on classification of words or multiword units, such as the splits of SL-GINCORE and MT-GINCORE contain the named-entity recognition and part-of-speech tagging, these same instances, so that they differ only in the language of results reveal that CroSloEngual BERT is not as suitable as the content. XLM-RoBERTa for automatic genre identification. The Transformer models are available at the Hugging Among all monolingual experiments, the best micro and Face repository and were trained using the Simple Trans- macro F1 results were achieved when the XLM-RoBERTa formers library. To find the optimal number of epochs and was trained and tested on the machine-translated MT- the learning rate, the hyperparameter search was performed GINCO dataset, reaching average micro and macro F1 separately for CroSloEngual BERT and XLM-RoBERTa. scores of 0.81 and 0.84 respectively. At the same time, the The maximum sequence length was set to 512 tokens and other hyperparameters were set to default values. As the 5The code for data preparation and machine learning EN-GINCORE dataset is more than 40 times larger than experiments is available here: https://github.com/ the SL-GINCORE and MT-GINCORE datasets, separate TajaKuzman/Cross-Lingual-and-Cross-Dataset- hyperparameter searches for each dataset were performed. Experiments-with-Genre-Datasets. PRISPEVKI 103 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Datasets Majority classifier XLM-RoBERTa CroSloEngual BERT Trained on Tested on Micro F1 Macro F1 Micro F1 Macro F1 Micro F1 Macro F1 SL-GINCORE SL-GINCORE 0.259 0.027 0.782±0.02 0.725±0.01 0.738±0.01 0.599±0.06 MT-GINCORE MT-GINCORE 0.259 0.027 0.807±0.01 0.841±0.03 0.714±0.00 0.501±0.05 EN-GINCORE EN-GINCORE 0.363 0.036 0.768±0.00 0.715±0.00 0.761±0.00 0.706±0.00 SL-GINCORE EN-GINCORE 0.029 0.004 0.639±0.01 0.539±0.01 0.547±0.02 0.391±0.02 MT-GINCORE EN-GINCORE 0.029 0.004 0.625±0.01 0.521±0.01 0.585±0.01 0.409±0.01 EN-GINCORE SL-GINCORE 0.253 0.027 0.603±0.02 0.575±0.03 0.566±0.02 0.510±0.03 EN-GINCORE MT-GINCORE 0.253 0.027 0.630±0.02 0.663±0.03 0.630±0.01 0.543±0.01 Table 1: Results of monolingual and cross-lingual experiments performed with XLM-RoBERTa and CroSloEngual BERT models, reported via micro and macro F1 scores (averaged over three runs). As a baseline, the scores of a majority classifier are added. The best scores for each of the two Transformer models for each of the two setups (in-dataset experiments and cross-dataset experiments) are shown in bold. lowest scores, i.e., micro F1 of 0.71 and macro F1 of 0.50, and MT-GINCORE. Half of the labels, i.e., Review, Le- were obtained on the same dataset in combination with gal/Regulation, Research Article, Interview, Recipe and the CroSloEngual BERT. Similarly, while XLM-RoBERTa Prose, are represented by solely 4 instances or less in SL- achieved the worst results when trained and tested on the GINCORE and MT-GINCORE test splits. One should be EN-GINCORE, CroSloEngual BERT achieved the best re- aware that this means that a correct or incorrect prediction sults on this dataset. The difference between the results on of such a small number of instances per labels has a large the same datasets shows the importance of analyzing the impact on the macro F1 score. Furthermore, a correct pre- output of multiple models before reaching any conclusion diction of labels with only one or two instances in the test regarding the datasets – if only XLM-RoBERTa would be split might happen due to chance or a similarity of texts in used, one could assume that the EN-GINCORE dataset is the train and test split. Thus, the F1 scores of these labels less suitable for automatic genre identification experiments. are not reliable. As shown in Figure 2, in the three runs, the However, after performing experiments with both models, F1 scores of Interview and Recipe, which are represented by we can see that no dataset consistently provides the best only 1 instance in the SL-GINCORE and MT-GINCORE results. test sets, were either 0 or 1, which has a large impact on a macro F1. These results also show how important it is to repeat each experiment multiple times, to ascertain stability and reliability of results. If we compare the three datasets based on micro F1 scores, there are small differences between them, i.e., a difference of 4 points between the lowest and highest scores when XLM-RoBERTa was used and a difference of 5 points when CroSloEngual BERT was used. Inter- estingly, although the EN-GINCORE is 40 times larger than the SL-GINCORE and MT-GINCORE, it does not provide higher results than the other two datasets when the XLM-RoBERTa model is used for training. Simi- lar results were revealed in previous work (see Repo et al. (2021)) where they performed monolingual exper- iments with XLM-RoBERTa on the CORE dataset and three smaller genre-annotated datasets, Finnish FinCORE, Figure 2: F1 scores per labels (averaged over three runs) French FreCORE and Swedish SweCORE datasets. Al- in in-dataset experiments with MT-GINCORE and EN- though the non-English datasets were annotated with the GINCORE, performed with CroSloEngual BERT. Labels CORE schema, the annotation procedure and dataset col- are ordered according to their frequency in the smallest of lection methods are more similar to the GINCO approach the two datasets, MT-GINCORE. than CORE. Their experiments showed that the XLM- RoBERTa and other Transformer models perform similarly If we compare experiments, performed with the same or better when trained on datasets which consisted of 1,800 model, we can observe that the largest differences between to 2,200 instances than when trained on the CORE dataset. the datasets are in terms of macro F1 scores which are cal- We have two hypotheses why this is the case: 1) It might culated on the level of labels. As shown in Figure 2, the be that due to high capacities of Transformer models, their biggest differences between the F1 scores per labels oc- performance on this task plateaus already at a few thousand cur in cases of labels that are represented by a very small instances and contributing bigger datasets does not signif- number of instances in the smaller datasets, SL-GINCORE icantly improve the results. 2) Or this could indicate that PRISPEVKI 104 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 the CORE dataset is less suitable for AGI machine learning schema, and testing on SL-GINCORE, which reached 0.60 experiments. The reason for that could be that as crowd- micro F1 with the base-sized XLM-RoBERTa model, are sourcing was used for the annotation of the dataset, the promising, showing that mapping to the GINCORE schema assigned labels are less reliable and the classes are conse- gives comparable results to using the CORE schema. quently fuzzier. Poor reliability of the dataset was also con- To obtain a deeper insight into the comparability of the firmed by low inter-annotator agreement. The authors of GINCO and CORE corpora, we can compare how the F1 the dataset reported that there was no agreement between at scores per labels change when we test the model on another least three of four annotators on the subcategory of 48.98% corpus versus when we test it on the same dataset. Figure 3 of texts (Egbert et al., 2015). When the schema and ap- shows a comparison between the F1 scores per labels for in- proach was used by Sharoff (2018) on another corpus, he dataset experiments with SL-GINCORE and cross-dataset reported nominal Krippendorff’s alpha of 0.53 on the level experiments from SL-GINCORE to EN-GINCORE, per- of subcategories, which is below the acceptable threshold formed with XLM-RoBERTa. An analysis of these ex- of 0.67, as defined by Krippendorff (2018). In contrast to periments, performed with CroSloEngual BERT, confirmed this, the GINCO dataset was reported to achieve Krippen- that differences between label scores occur when learning dorff’s alpha of 0.71, confirming much higher reliability of with any of the two models, and do not depend on the annotations. model. The same differences in label scores were also ob- served in experiments where MT-GINCORE is used instead 4.3.2. Cross-lingual Cross-dataset Experiments of SL-GINCORE, which indicates that the language of the To assess comparability of the English CORE dataset dataset does not seem to have a large impact on the results and the Slovene GINCO dataset, we performed cross- per labels. lingual cross-dataset experiments by training the Trans- As shown in Figure 3, the F1 scores for News and former models on one dataset and testing them on another. Opinion/Argumentation are almost the same in both setups, In addition to experimenting with cross-lingual transfer which shows that in regard to these genres, the datasets are from Slovene to English dataset and vice versa, we also ex- comparable enough for the model to generalize from one plored whether translating the Slovene dataset into English dataset to the other. The F1 scores are significantly lower with a machine translation system improves the results of in cross-lingual experiments in case of Promotion, Infor- cross-dataset experiments. mation/Explanation, Forum and Instruction. For the labels The results, shown in Table 1, reveal that the trilin- that are under-represented in the SL-GINCORE, i.e., labels gual CroSloEngual BERT model performs worse than the that are on the right side of Review in the Figure, it is not massively multilingual XLM-RoBERTa model in the cross- possible to ascertain whether the differences between the lingual experiments with a difference of 12 points between scores are an indicator that the datasets are not comparable the highest macro F1 scores obtained by the models and in regard to these labels or that the differences occurred due a much slighter difference between the highest micro F1 to chance. scores (0.009). In general, results obtained in the cross-lingual exper- iments are significantly lower than the results from the monolingual experiments. If we compare experiments per- formed with XLM-RoBERTa, there are differences in 13– 18 points in micro F1 and 5–32 points in macro F1 between testing the model on the same dataset as it was trained on (monolingual experiments) and on another dataset (cross- lingual experiments). In case of CroSloEngual BERT, the differences between testing on the same dataset versus test- ing on the other dataset were in 13–20 points in micro F1 and 9–20 points in macro F1. Nevertheless, the XLM-RoBERTa scores, which range between 0.6–0.64 and 0.52–0.66 for micro and macro F1 respectively, are a promising indicator that cross-lingual transfer could be possible in this task for Slovene as well. Figure 3: Comparison of average F1 scores per labels Furthermore, the results are comparable to the results of between in-dataset experiments and cross-dataset experi- cross-lingual experiments with the CORE corpora, reported ments with XLM-RoBERTa. The models were trained on by Repo et al. (2021). When they trained the XLM- SL-GINCORE, and tested on a) SL-GINCORE (in-dataset RoBERTa model on the CORE corpus and tested it on experiments) and b) EN-GINCORE (cross-dataset experi- Finnish, Swedish and French datasets, annotated with the ments). Labels are ordered according to their frequency in CORE schema, the micro F1 scores ranged from 0.61 to the smallest of the datasets, SL-GINCORE. 0.69. Here it needs to be noted that they used a large- sized model which was shown to significantly outperform the base-sized model used by us (Conneau et al., 2020), and As in the in-dataset experiments, experiments with the that they used 8 labels, while we used 12. Considering this, two Transformer models show that while one dataset com- the results of learning on CORE, mapped to the GINCORE bination seems to achieve the best results with one model, PRISPEVKI 105 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 it performs differently with the other model. These results analyze whether this further improves the results. once again show the importance of using multiple models As recently developed trilingual Croatian-Slovene- on multiple datasets in the experiments to see whether con- English CroSloEngual model was shown to be compara- clusions obtained from experiments with one model are still ble to massively multilingual XLM-RoBERTa model in nu- supported when using another, yet similar model, and how merous NLP tasks (see Ulčar et al. (2021)), both mod- the performance of the models depends on the datasets. els were used in the experiments to analyze their perfor- While results in terms of micro F1, achieved with XLM- mance in the AGI tasks. The results of both monolingual RoBERTa, point to a conclusion that transfer from SL- and cross-lingual experiments showed that despite achiev- GINCORE to EN-GINCORE achieves better results than ing high results in other common NLP tasks, CroSloEngual the other direction, macro F1 scores, achieved with XLM- BERT seems to be less suitable than XLM-RoBERTa for RoBERTa, and both F1 scores, achieved with CroSloEn- automatic genre identification. gualBERT, show transfer direction from English to Slovene To improve monolingual and cross-lingual results, we to be better. However, although the EN-GINCORE dataset also experimented with translating the Slovene GINCO is 40 times larger than SL-GINCORE, the transfer from dataset into English, which is the main language on which EN-GINCORE to SL-GINCORE does not achieve signif- the Transformer models were pre-trained. In regard to icantly higher results than the transfer in the other direction monolingual experiments, there were no consistent results when the Slovene dataset is used. which would confirm that using an English dataset im- In addition to this, the results show that machine- proves classification. However, when the models were translating the dataset into English can in some cases im- trained on the English EN-GINCORE and tested on MT- prove the results of cross-lingual experiments. In cases GINCORE, i.e., a Slovene dataset, machine-translated into where the model was trained on the GINCO datasets, i.e., English, this led to improvement of macro F1 scores, SL-GINCORE or MT-GINCORE, and tested on the EN- achieved with XLM-RoBERTa, and both micro and macro GINCORE dataset, the setup with the machine-translated F1 scores for CroSloEngual BERT. This means that ma- text achieved slightly lower results than the setup with the chine translating the dataset into the language of another original Slovene dataset, SL-GINCORE, in case of XML- dataset might be beneficial in cross-lingual cross-dataset RoBERTa, and slightly better results in case of CroSlo- experiments. Engual BERT. However, when the transfer was applied in Although monolingual and cross-lingual experiments the other direction, that is, from EN-GINCORE to SL- or showed good results also when the models were trained on MT-GINCORE, machine translating the test instances from SL-GINCORE and MT-GINCORE, consisting of less than Slovene into English resulted in improvements of macro F1 1,000 instances, comparisons of F1 scores, reported for scores, achieved with XLM-RoBERTa, and both micro and each label in different runs and setups, showed that some macro F1 scores, obtained with CroSloEngual BERT. labels are represented by too few instances to provide reli- able results. In the future, we plan to extend the GINCO 5. Conclusions dataset to assure more reliable results and to further im- Following Repo et al. (2021) who showed that good prove the classifiers’ performance. levels of cross-lingual transfer can be achieved by training In addition to this, recent work by Rönnqvist et al. Transformer models on a large English genre dataset and (2021) showed that multilingual modeling, where the applying them to datasets in other languages, the goal of model was trained on CORE datasets in various languages, this study was to explore whether it is possible to achieve resulted in significant gains over cross-lingual modeling, similar results on the Slovene genre dataset. The results where the model was trained solely on the English CORE revealed to be promising, as despite using a smaller Trans- dataset. As our research revealed that the CORE and former model and a different schema with more labels than GINCO labels can be successfully mapped to a joint previous work, the results are rather comparable, show- schema, in the future, we plan to extend the experiments to ing that the English CORE and Slovene GINCO datasets multilingual modeling by training the model on a combina- are comparable enough to allow cross-dataset experiments. tion of all CORE datasets and the Slovene GINCO dataset. The XLM-RoBERTa scores, which range between 0.6– 0.64 and 0.52–0.66 in terms of micro and macro F1 scores Acknowledgments respectively, are a promising indicator that cross-lingual transfer could be possible in the automatic genre identifica- This work has received funding from the Eu- tion task for Slovene as well. Furthermore, high F1 scores ropean Union’s Connecting Europe Facility 2014- achieved with XLM-RoBERTa in monolingual experiments 2020 - CEF Telecom, under Grant Agreement No. show that automatic genre identification is feasible already INEA/CEF/ICT/A2020/2278341. This communication with a very small dataset, and that using the GINCORE reflects only the author’s view. The Agency is not respon- schema on all datasets gives good results. Moreover, de- sible for any use that may be made of the information it spite the fact that the CORE dataset is 40 times larger than contains. This work was also funded by the Slovenian the GINCO dataset, it did not provide consistently signifi- Research Agency within the Slovenian-Flemish bilateral cantly better results than the GINCO dataset in either of the basic research project ”Linguistic landscape of hate setups. We plan to analyze this further by exploring what speech on social media” (N06-0099 and FWO-G070619N, results can be achieved when smaller portions of CORE are 2019–2023) and the research programme “Language used for training, and by extending the GINCO dataset to resources and technologies for Slovene” (P6-0411). PRISPEVKI 106 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 6. References Juhani Luotolahti, Liina Repo, Anna Salmela, Valtteri Skantsi, and Sampo Pyysalo. 2020. From web crawl to Noushin Rezapour Asheghi, Serge Sharoff, and Katja clean register-annotated corpora. In: Proceedings of the Markert. 2016. Crowdsourcing for web genre annota- 12th Web as Corpus Workshop, pages 14–22. tion. Language Resources and Evaluation, 50(3):603– 641. Wanda J Orlikowski and JoAnne Yates. 1994. Genre reper- toire: The structuring of communicative practices in Marta Ba˜nón, Miquel Esplà-Gomis, Mikel L. Forcada, organizations. Administrative science quarterly, pages Cristian Garc´ıa-Romero, Taja Kuzman, Nikola Ljubešić, 541–574. Rik van Noord, Leopoldo Pla Sempere, Gema Ram´ırez- Jan Pomikálek. 2011. Removing boilerplate and duplicate Sánchez, Peter Rupnik, V´ıt Suchomel, Antonio Toral, content from web corpora. Ph.D. thesis, Masaryk uni- Tobias van der Werff, and Jaume Zaragoza. 2022. versity Faculty of informatics, Brno, Czech Republic. Slovene web corpus MaCoCu-sl 1.0. Slovenian lan- guage resource repository CLARIN.SI. Liina Repo, Valtteri Skantsi, Samuel Rönnqvist, Saara Hellström, Miika Oinonen, Anna Salmela, Douglas Douglas Biber and Jesse Egbert. 2018. Register variation Biber, Jesse Egbert, Sampo Pyysalo, and Veronika Laip- online. Cambridge University Press. pala. 2021. Beyond the English web: Zero-shot cross- Alexis Conneau, Kartikay Khandelwal, Naman Goyal, lingual and lightweight monolingual classification of Vishrav Chaudhary, Guillaume Wenzek, Francisco registers. In: 16th Conference of the European Chapter Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, of the Association for Computational Linguistics: Stu- and Veselin Stoyanov. 2020. Unsupervised cross-lingual dent Research Workshop, EACL 2021, pages 183–191. representation learning at scale. In: Proceedings of the Association for Computational Linguistics (ACL). 58th Annual Meeting of the Association for Computa- Samuel Rönnqvist, Valtteri Skantsi, Miika Oinonen, and tional Linguistics, pages 8440–8451. Veronika Laippala. 2021. Multilingual and zero-shot is Mark Davies and Robert Fuchs. 2015. Expanding horizons closing in on monolingual web register classification. In: in the study of World Englishes with the 1.9 billion word Proceedings of the 23rd Nordic Conference on Compu- Global Web-based English Corpus (GloWbE). English tational Linguistics (NoDaLiDa), pages 157–165. World-Wide, 36(1):1–28. Peter Rupnik, Taja Kuzman, and Nikola Ljubešić. DeepL. n.d. Why DeepL? https://www.deepl. 2022. American-British-variety Classifier. com/en/whydeepl. https://github.com/macocu/American- Jesse Egbert, Douglas Biber, and Mark Davies. 2015. De- British-variety-classifier. veloping a bottom-up, user-based method of web register Serge Sharoff. 2018. Functional text dimensions for the classification. Journal of the Association for Information annotation of web corpora. Corpora, 13(1):65–95. Science and Technology, 66(9):1817–1831. Matej Ulčar and Marko Robnik- Šikonja. 2020. CroSloEn- Tomaž Erjavec and Nikola Ljubešić. 2014. The slWaC 2.0 gual BERT 1.1. Slovenian language resource repository corpus of the Slovene web. T. Erjavec, J. Žganec Gros CLARIN.SI. (ur.) Jezikovne tehnologije: zbornik, 17:50–55. Matej Ulčar and Marko Robnik- Šikonja. 2021. Slovenian Eugenie Giesbrecht and Stefan Evert. 2009. Is part-of- RoBERTa contextual embeddings model: SloBERTa 2.0. speech tagging a solved task? An evaluation of POS tag- Slovenian language resource repository CLARIN.SI. gers for the German web as corpus. In: Proceedings of Matej Ulčar, Aleš Žagar, Carlos S Armendariz, An- the fifth Web as Corpus workshop, pages 27–35. draž Repar, Senja Pollak, Matthew Purver, and Marko GloWbe. n.d. Corpus of Global Web-Based En- Robnik- Šikonja. 2021. Evaluation of contextual embed- glish (GloWbE): Texts. https://www.english- dings on less-resourced languages. arXiv:2107.10614. corpora.org/glowbe/. Marlies Van der Wees, Arianna Bisazza, and Christof Klaus Krippendorff. 2018. Content analysis: An introduc- Monz. 2018. Evaluation of machine translation perfor- tion to its methodology. Sage publications. mance across multiple genres and languages. In: Pro- Taja Kuzman, Mojca Brglez, Peter Rupnik, and Nikola ceedings of the Eleventh International Conference on Ljubešić. 2021. Slovene web genre identification cor- Language Resources and Evaluation (LREC 2018). pus GINCO 1.0. Slovenian language resource repository Vedrana Vidulin, Mitja Luštrek, and Matjaž Gams. 2007. CLARIN.SI. Using genres to improve search engines. In: 1st In- Taja Kuzman, Peter Rupnik, and Nikola Ljubešić. 2022. ternational Workshop: Towards Genre-Enabled Search The GINCO Training Dataset for Web Genre Identifi- Engines: The Impact of Natural Language Processing, cation of Documents Out in the Wild. In: Proceed- pages 45–51. ings of the Language Resources and Evaluation Con- Ahmad Yulianto and Rina Supriatnaningsih. 2021. Google ference, pages 1584–1594, Marseille, France. European Translate vs. DeepL: A quantitative evaluation of close- Language Resources Association. language pair translation (French to English). AJELP: Veronika Laippala, Roosa Kyllönen, Jesse Egbert, Douglas Asian Journal of English Language and Pedagogy, Biber, and Sampo Pyysalo. 2019. Toward multilingual 9(2):109–127. identification of online registers. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages 292–297. Veronika Laippala, Samuel Rönnqvist, Saara Hellström, PRISPEVKI 107 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Slovenian Epistemic and Deontic Modals in Socially Unacceptable Discourse Online Jakob Lenardič,∗ Kristina Pahor de Maiti† ∗Faculty of Arts, University of Ljubljana jakob.lenardic@ff.uni-lj.si †CY Cergy Paris University kristina.pahor-de-maiti@u-cergy.fr Abstract In this paper, we investigate the use of epistemic and deontic modal expressions in Slovenian Facebook comments. Modals are linguistic expressions that can be strategically used to fulfill the face-saving dimension of communication and to linguistically mask discriminatory discourse. We compile a list of modal expressions that have a tendency towards a single modal reading in order to enable robust corpus searches. Using this set of modals, we first show that deontic, but not epistemic, modals are significantly more frequent in socially unacceptable comments. In the qualitatve part of the paper, we discuss the use of modals expressing deontic and epistemic necessity from the perspective of discourse pragmatics. We explore how the communicative strategy of face-saving interacts with personal and impersonal syntax in the case deontic modals, and how hedging and boosting interacts with irony in the case of epistemic modals. 1. Introduction lated corpus-linguistic work on modality in socially unac- Hate speech and other forms of socially unacceptable ceptable discourse. Section 4. describes the make-up of discourse have a negative effect on society (Delgado, 2019; the FRENK corpus in terms of the subtypes of socially un- Gelber and McNamara, 2016). For instance, calls to ac- acceptable discourse and the criteria for the selection of tion targeting specific demographics on social media have the analysed modals. Section 5. presents the quantitative been shown to lead to offline consequences such as real- analysis, wherein epistemic and deontic modals are com- world violence (Siegel, 2020). Linguistically, socially un- pared between the acceptable and unacceptable supersets acceptable attitudes are often disseminated in a dissimu- in FRENK. Section 6. presents the qualitative analysis, lated form, using pragmatic markers which superficially where certain deontic and epistemic necessity modals are lessen the strength of intolerant claims or violent calls to discussed in terms of their pragmatic functions. Section 7. action; nevertheless, the discursive markers of such dissim- concludes the paper. ulated discourse are still not well known (Lorenzi-Bailly 2. Theoretical background and Guellouz, 2019), especially outside of English social 2.1. The semantics of epistemic and deontic modals media. In this paper, we look at the use of Slovenian modal Modal expressions are semantic operators that interpret expressions as key pragmatic contributors to the dissimu- a prejacent proposition within the irrealis realm of possi- lation of unacceptable discourse on social media. We first bility (Kratzer, 2012). There are two key semantic com- look at how the use of epistemic modals, which convey the ponents to modals – one is the modal force, which corre- speaker’s truth commitment, and the use of deontic modals, sponds to the logical strength of the modal expression and which convey how the world should or must be according roughly ranges from possibility via likelihood to necessity, to a set of contextually determined circumstances, differ be- and the other is the type of modality,1 according to which tween unacceptable and acceptable discourse in the case of the evaluation of the possibility is tied to the actual world.2 Slovenian Facebook comments obtained from the FRENK There are two main types of modality – epistemic on corpus (Ljubešić et al., 2021). the one hand and root on the other (Coates, 1983; Kratzer, We then turn to a qualitative analysis of modals convey- 2012; von Fintel, 2006). Epistemic modals tie the evalua- ing logical necessity. We discuss how the meaning of de- tion of the possibility or necessity to the speaker’s knowl- ontic necessity, which corresponds to some kind of obliga- edge about the actual world. For instance, the possibility tion that needs to be fulfilled by the agent of the modalised adverb morda in (1), taken from the FRENK corpus, has proposition, can have a secondary pragmatic meaning that the reading which says that there is a possibility that the ref- is akin to face-saving observed with epistemic modals and erents of the indefinite subject nekaj jih (“some of them”) that arises with syntactically impersonal modals. We then 1For formal semanticists viewing modals as quantifiers over discuss how epistemic modals are used to achieve a face- possible worlds (von Fintel, 2006; Kratzer, 2012), there are actu-saving effect, either as hedging or boosting devices or as ally three semantic components – modal force, modal base, and the intensifiers of irony. the ordering source; for ease of exposition, we conflate the modal The paper is structured as follows. Section 2. presents base and ordering source under the simplified modality type com- the semantic and pragmatic properties of epistemic and de- ponent of meaning. ontic modals, while Section 3. presents some of the re- 2The italics in the examples are always our own and used to highlight the modal under scrutiny. PRISPEVKI 108 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 will stay in the country. This possibility reading is epis- which involves the preservation of the positive image of the temic as it conveys that the speaker is not sure whether the addressee and prevents them from feeling inferior to the possibility of their staying will actually turn out to be the speaker. Finally, they are used as part of a speaker-oriented case. positive politeness strategy, which involves the preserva- tion of the positive image of the speaker by enabling the (1) [N]ekaj jih bo morda ostalo v naših krajih. EPISTEMIC smooth withdrawal from a statement that can be perceived “Some of them will possibly stay in our country.” as a boast, threat, or similar. Related to such politeness strategies, modals fulfil the Root modality, on the other hand, is not tied to the conversational role of so-called hedging or boosting de- speaker’s (un)certainty about the truth of the proposition. vices (Hyland, 2005). Epistemic modals function as hedges Rather, it ascribes the possibility to certain, usually unspec- when the speaker uses them to reduce their commitment ified, facts about the actual world. There are several sub- to the truth of the propositional content – i.e., to signal types of root modality, but the one we are interested in this their hesitation or uncertainty in what is being expressed, paper is the deontic subtype, in which the evaluation of pos- which is a type of face-saving strategy in and of itself. sibility or necessity is tied to some contextually determined (Gonzálvez Garc´ıa, 2000; Hyland, 1998). In terms of authority, such as a set of rules, the law, or even the speaker modal force, it is weak epistemic modals denoting possi- (Palmer, 2001, 10). An example of a deontic modal is the bility that typically correspond to hedges, though certain verb dovoliti in example (2), again taken from FRENK. This necessity modals can also acquire such a function in cer- verb also denotes possibility in terms of modal force, so the tain contexts, as we will show in the qualitative analysis. deontic possibility reading roughly translates to they should Strong epistemic modals, which express certainty or not be given the possibility (i.e., be allowed) to change our high commitment of the speaker to the truth of the utter- culture. ance, typically function as boosters and are used by the (2) [S]eveda se jim ne sme dovoliti [,] da bi spre- speaker to convince his or her audience, make his or her DEONTIC menil naso (sic) kulturo. utterance argumentatively stronger, close the dialogue for “They should not be allowed to change our cul- further deliberation (Vukovic, 2014), stress the common ture.” knowledge and group membership (Hyland, 2005), and so forth. Such boosters can also be used manipulatively to Note that a single modal can have different readings in boost a claim that is otherwise controversial or highly par- terms of modality type. This is, for instance, the case with ticular (Vukovic, 2014). the necessity modal morati, where the epistemic reading Deontic modality also fulfils interpersonal roles in com- in (3a) conveys that the speaker is certain (i.e., epistemic munication. Because deontic modals express notions such necessity) that whomever they are referring to is a bonafide as obligation and permission, they have to do with negoti- Slovenian. By contrast, the deontic reading in (3b) says that ating social power between an authority and the discourse what needs to be necessarily done is preparing for the com- participant to whom the permission is granted or obliga- petition. Such readings are disambiguated contextually. tion imposed upon (Winter and Gärdenfors, 1995). Deon- tic statements often involve a power imbalance between in- (3) a. Ta mora biti pravi Slovenec, ni dvoma. EPISTEMIC terlocutors (which is especially evident in case it is not in “He must be a bonafide Slovenian, no doubt the interest of the agent to fulfil the obligation), so the use about it.” of deontic modals is often paired up with other pragmatic b. Pripraviti se bodo morali tudi na DEONTIC devices denoting politeness or face-saving. Politeness is konkurenco, ki je zdaj še nimajo. thus “an overarching pragmalinguistic function that can be “They must also prepare for the competitors overtly or covertly marked in deontic and epistemic modal which they do not have.” utterances” (Gonzálvez Garc´ıa, 2000, 127). (Roeder and Hansen, 2006, 163) 3. Related work on modality in hate speech 2.2. The pragmatics of epistemic and deontic modals The linguistic and pragmatic characteristics of modal- Modality expresses the speaker’s subjective attitudes ity have not yet been extensively explored in the literature and opinions (Palmer, 2001), which is why the pragmat- on online socially unacceptable discourse. One exception ical aspects of the modalised utterance play an important is the work done by Ayuningtias et al. (2021), who analy- role in discourse. ses YouTube comments related to the 2019 Christchurch Epistemic modals fulfill what Halliday (1970) calls the mosque shootings. They find that clauses with deontic interpersonal dimension of the utterance. In this sense, modals outnumber those with epistemic modals, and that epistemic modals show the following three pragmatic uses the main discursive strategy of commenters in socially un- (Coates, 1987) related to Brown et al. (1987)’s Politeness acceptable comments is to use deontic modals to incite vi-Theory. First, they are used as part of the negative polite- olent action against members of the New Zealand Muslim ness strategy to save the addressee’s negative face, when community. for instance the speaker tries to facilitate open discussion Other corpus linguistic studies investigate modal mark- by not assuming the addressee’s stance on the conversa- ers from the perspective of stance. Chiluwa (2015), for tional issue in advance. Second, epistemic modals can be example, analyses the stance expressed in the Tweets of used as an addressee-oriented positive politeness strategy, two radical militant groups, Boko Haram and Al Shabaab. PRISPEVKI 109 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Among other stance-related elements, she investigates the Modal Syntax Modality Force AF use of hedges (including weak epistemic modals) and najIND Adverb Deontic Likelihood 886 boosters (including strong epistemic modals). The results morati Verb Deontic Necessity 489 treba Adjective Deontic Necessity 306 show that boosters are more frequent than hedges although smeti Verb Deontic Possibility 150 their overall frequency in the data was low. According to verjetno Adverb Epistemic Likelihood 123 the author, the low frequency of hedges shows that radical- mogoče Adverb Epistemic Possibility 92 ist discourse does not exhibit the tendency to mitigate com- dovoliti Verb Deontic Possibility 55 mitment, which goes hand in hand with the slightly higher morda Adverb Epistemic Possibility 46 presence of boosters that are used as a rhetorical strategy najbrž Adverb Epistemic Likelihood 29 to support (possibly unfounded) statements and to influ- ziher Adverb Epistemic Necessity 25 ence, radicalise and win over their readers by projecting zagotovo Adverb Epistemic Necessity 16 assertiveness. potrebno Adjective Deontic Necessity 4 Another study on stance in the context is by Sindoni Σ 2,221 (2018), who looks at the verbal and multimodal construc- Table 2: The analysed modals; AF stands for absolute fre- tion of hate speech in British mainstream media. She anal- quency. yses epistemic modal operators (among other related de- vices) in order to uncover the writer’s stance and attitude towards the content conveyed in the news item. She finds – Offensive discourse, which corresponds to abu- that modality is strategically used to present the author’s sive, threatening or defamatory speech that is tar- opinions as facts, while the opinions of others are reported geted towards someone on the basis of their back- as hypotheses and assumptions. ground or group participation. – Violent discourse, which contains threats or calls 4. The FRENK corpus to physical violence and is often punishable by 4.1. Corpus make-up law (Fišer et al., 2017, 49). – Inappropriate speech, which contains offensive Subcorpus Tokens language but is not directed at anyone in particu- Acceptable 92,922 34% lar. Offensive 143,948 53% Inappropriate 1,471 1% For our study, we have created two subsets of com- Violent 8,789 3% ments: the acceptable subset containing comments tagged Not relevant 24,572 9% as acceptable, and the unacceptable subset containing com- ments tagged as offensive, violent or inappropriate. This Σ 271,702 100% decision is based on the frequency distributions as shown Table 1: The make-up of the FRENK corpus in terms of in Table 1. We can observe that the FRENK subcorpora socially (un)acceptable discourse. are uneven in terms of size, with the violent and inappro- priate sets contain significantly fewer comments than the For this study, we have used FRENK, a 270,000-token acceptable and offensive sets. Because violent discourse is corpus of Slovenian Facebook comments of mostly socially generally less frequent than offensive discourse in linguistic unacceptable discourse (Ljubešić et al., 2019). The Face- corpora,4 it is difficult to annotate automatically (Evkoski book comments in the FRENK corpus concern two major et al., 2022), so one of the crucial features of FRENK is the topics – migrants, generally in the context of the 2015 Eu-fact that the annotations into discourse type were done man- ropean migrant crisis, and the LGBTQ community, mostly ually, employing 8 trained annotators per Facebook com- in the context of their civil rights – and are manually anno- ment (Ljubešić et al., 2019, 9). Note that about 9% of tated for several different kinds of discourse.3 The anno- the Facebook comments are marked as Not relevant, which tations distinguish whether the discourse is aimed towards refers to comments with incorrect topic classification (ibid., a target’s personal background, such as sexual orientation, 5). race, religion, and ethnicity, or their belonging to a particu- The latest, that is, version 1.1, of the FRENK cor- lar group, such as political party. They also distinguish the pus, which also includes texts in Croatian and English, type of the discourse itself, which falls into 4 broad cate- is available for download from the CLARIN.SI repository gories, one being acceptable discourse and the others dif- (Ljubešić et al., 2021). However, the online version, which ferent kinds of socially unacceptable discourse (de Maiti et is accessible through CLARIN.SI’s noSketch Engine con- al., 2019, 38): cordancer and which we have used for the purposes of this paper,5 is not yet available to the public. • Acceptable discourse • Socially unacceptable discourse 3The annotations are performed on the comment level while also taking into account the features of the entire discussion thread. PRISPEVKI 110 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4.2. The modals analysed in the study Acceptable Unacceptable Modal AF RF AF RF A/U U/A Table 2 shows that there are 12 modal expressions used verjetno 52 559.6 66 428.0 1.3 0.8 in the study. We have selected the modals using the follow- morda 24 258.3 19 123.2 2.1 0.5 ing two criteria. mogoče 29 312.1 55 356.7 0.9 1.1 The first criterion is the modal’s tendency towards a sin- najbrž 12 129.1 13 84.3 1.5 0.7 gle modal reading. As discussed in Section 2.1., modals zagotovo 3 32.3 13 84.3 0.4 2.6 are in principle ambiguous in terms of their modality type. ziher 8 86.0 15 97.3 0.9 1.1 However, corpus data show that certain modals have an Σ 128 1,377.4 181 1,173.7 1.2 0.9 overwhelming preference for a single reading; for instance, while the modal auxiliary morati can theoretically have Table 3: The distribution of epistemic modals in the both the epistemic and the deontic interpretations (Roeder FRENK corpus; AF stands for absolute frequency and RF and Hansen, 2006, 162–163), as was shown in (3), the epis-for relative frequency, normalised to a million tokens. temic reading (3a) is actually extremely rare in attested usage, and in the case of the FRENK corpus completely non- While most modals are syntactically adverbs (e.g., morda, existent.6 Similarly, whenever the adverb naj is used in ziher), some are verbs selecting for finite clausal comple- the indicative rather than conditional mood (glossed with ments, such as dovoliti in (2), verbs selecting for non-finite the subscript IND in Tables 2 and 4), its meaning is always complements, such as morati in (3), and predicative adjec-some shade of the deontic reading (command, wish, etc.). tives (of the syntactic frame It is necessary to) selecting Thus, all the modals in Table 2 are either unambiguously for non-finite complements, such as treba (see the exam- deontic or unambiguously epistemic, so they function as a ples in Section 6.1.). However, such syntactic differences robust set for testing how deontic and epistemic modality have no bearing on the modal interpretation – in all cases, manifests itself in different types of discourse without con- the modals remain sentential operators that take semantic founding examples with unintended interpretations. scope over the proposition denoted by the clause. Second, some lexemes known to convey modal inter- pretations also frequently occur with a superficially similar 5. Quantitative Analysis propositional meaning that, however, is not modal. Such is the adverb itak, as in example (4), also taken from FRENK. Tables 3 and 4 show how the Slovenian modals are distributed between the acceptable and unacceptable subsets (4) Krscanstvo pa itak izvira iz istih krajev kot islam in for the unambiguously epistemic and deontic modals, re- juduizem (sic). spectively. The unacceptable subset brings together the “Of course, Christianity comes from the same place three subtypes – offensive, inappropriate, and violent – in- as Islam and Judaism.” troduced in Section 4.1.. The acceptable and unacceptable This adverb differs from e.g. the certainty adverb zago- sets contain 92, 922 and 154, 208 tokens, respectively. tovo in that it does not convey the speaker’s degree of cer- In the epistemic set (Table 3), half of the modals – that tainty,7 but rather simply intensifies whatever he or she is, the possibility modal mogoče and the necessity modals knows to be actually the case (the historical-geographic ziher and zagotovo – are more frequent in the corpus of un- source of Christianity). Because such non-modal readings acceptable discourse, while the remaining 3 modals – that are usually as frequent as the modal meaning in attested is, the possibility modal morda and the logically synony- usage, we have omitted them from our study. mous likelihood modals najbrž and verjetno – are more fre- Lastly, note that in terms of part of speech, the modals quent in the subset of socially acceptable discourse. Over- in Table 2 do not constitute a syntactically homogenous set. all, the six epistemic modals are 1.2 times more frequently used in acceptable discourse than they are in unacceptable 4This is also a result of the EU Code of conduct and terms discourse. of service of social media platforms, according to which content The distribution is reversed in the set of unambiguously deemed illegal due to its hateful character needs to be taken down. deontic modals (Table 4). Here, all modals, save for the 5https://www.clarin.si/noske possibility verb smeti (“to allow”), are more characteris- 6The frequency counts were preformed on lemmas, as this is tic of unacceptable rather than acceptable discourse, with sufficient for distinguishing the part of speech as well; for in- the deontic necessity adjective treba and deontic likelihood stance, the lemma mogoče corresponds to the adverbial forms, adverb naj showing the largest preference for the unac- IND whereas the lemma mogoč corresponds to the adjectival ones; ceptable set. Overall, the 6 deontic modals are 1.3 times however, the adjectival form when used predicatively is consis- more frequently used in socially unacceptable discourse tently ambiguous between the non-epistemic and epistemic inter- than they are in acceptable discourse. pretations, see Lenardič and Fišer (2021) for discussion and ex-Statistically, we have tested the overall differences in amples. 7 frequency between the unacceptable and acceptable sets for Zagotovo has the synonym gotovo; we have excluded it from our overview because it is too frequently used in the non-modal both the epistemic (Table 3) and deontic (4) modals using sense, as in (1), which is mostly typical of non-standard Slove-the log-likelihood statistic. This statistic is used to “estab- nian. lish whether the differences [between pairwise frequencies in two corpora with different sizes] are likely to be due to (1) Postrelit in gotovo. chance or are statistically significant” (Brezina, 2018, 83– “Shoot them all – that’s the end of it.” 84). The formula for calculating the log likelihood statistic PRISPEVKI 111 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Acceptable Unacceptable 6. Qualitative analysis Modal AF RF AF RF A/U U/A 6.1. Deontic modals in violent discourse najIND 227 2,442.9 583 3,780.6 0.6 1.5 morati 151 1,625.0 292 1,893.6 0.9 1.2 In Section 5., it was shown that deontic modals are more treba 87 936.3 197 1,277.5 0.7 1.4 typical of unacceptable rather than acceptable discourse, a smeti 41 441.2 60 389.1 1.1 0.9 finding that was shown to be statistically significant. dovoliti 17 183.0 34 220.5 0.8 1.2 To look at the pragmatics of deontic modals and their potrebno 1 10.8 3 19.5 0.6 1.8 discursive role in relation to socially unacceptable dis- Σ 524 5,639.1 1,169 7,580.7 0.74 1.3 course, let’s first recall from Section 4.1. that the socially unacceptable discourse in the FRENK corpus is further subTable 4: The distribution of deontic modals in the FRENK divided into several subtypes. Here we focus on two – of- corpus. fensive discourse on the one hand and violent on the other. It turns out that all of the surveyed deontic modals, with the is given in (5), where the observed values O exception of the auxiliary morati, are actually more promi- 1,2 correspond to the absolute frequencies of a modal in the unacceptable nent in violent discourse than in offensive discourse; this and acceptable sets. is shown in Table 5, where for instance treba is almost four times as frequent in the violent-speech subset (RF = 4437.3 (5) 2 × O tokens per million) than it is in the offensive subset (RF 1 × ln O1 + O E 2 × ln O2 1 E2 = 1083.7 tokens per million). It turns out that the overall greater occurrence of epis- What is interesting is that treba and morati are synony- temic modals in the acceptable set (AF = 128 tokens, mous, possibly completely so, in terms of modal logic, as RF = 1, 377.4 tokens/million) than in the unacceptable both entail necessities in terms of modal force and in most set (AF = 181 tokens, RF = 1, 173.7 tokens/million) cases have a deontic reading that has to do with a contex- is statistically insignificant at p < 0.05; log likelihood tually determined obligation.8 However, despite the syn- = 1.902, p = 0.165. By contrast, the greater occurrence onymy, treba is by far more frequent in violent speech than of deontic modals in the unacceptable set (AF = 1, 169 to- it is in offensive, while morati is the only deontic modal kens; RF = 7, 580.7 tokens/million) than in the acceptable that is more prominent in offensive than in violent speech. one (AF = 524 tokens; RF = 5, 639.1 tokens/million) is The difference in the distribution of the two synony- statistically significant at the same cut-off point; log likeli- mous modals can be tied to the fact that they vastly differ hood = 32.8, p = 9 × 10−9. in their communicative function, which crucially is observ- Using the online tool Calc (Cvrček, 2021), we have able within the same subset. Put plainly, the chief differ- also calculated the Difference Index (DIN) – an effect-size ence is that treba occurs in considerably more hateful state- metric – for the overall difference between the acceptable ments than morati, even though the statements all qualify and unacceptable deontic sets. The DIN value is −14.687, as violent hate speech rather than offensive speech in that which indicates that the deontic modals’ preference for the some kind of incitement towards violence is expressed in unacceptable set, although statistically significant, is rela- the modalised statement. tively small (Fidler and Cvrček, 2015, 230). In addition, For instance, let’s first consider some typical examples Calc automatically computes the confidence intervals for with treba from the violent subset: the relativised frequencies, which is 5, 639.1 ± 471.4 for the overall acceptable RF and 7, 580.7 ± 426.9 for the un- (6) a. To golazen treba zaplinit, momentalno!!!! acceptable RF at the 0.05 significance level. The fact that “These vermin must be gassed at once!” the intervals do not overlap further confirms that the differ- b. Pederčine je treba peljat nekam in postrelit. ence is not accidental. “Faggots must be taken somewhere and shot.” These findings are related to those in the literature c. Ni treba par tisoč Voltov, dovolj je 220, da ga (see Section 3.) as follows. Just like in Ayuningtias et strese in opozori, da bo čez par metrov stražar al. (2021)’s work on socially unacceptable discourse in s puško. YouTube comments, our deontic modals significantly out- “We don’t need a couple of thousand Volts; 220 number epistemic modals in both the acceptable and un- is enough to electrocute them and warn them acceptable sets (e.g., 1, 169 deontic modals vs. 181 epis- that, a couple of metres further on, an armed temic modals under unacceptable). Second, both modals of guard is waiting.” epistemic necessity in Table 3 – that is, zagotovo and ziher (“certainly”) – differ from most of the weaker modals, like morda (“possibly”) and najbrž (“likely”), in that they are 8Note that in negated sentences with treba, negation takes more frequent in unacceptable discourse; this is similar to scope over necessity, which means the interpretation is “it is not the finding by Chiluwa (2015), who shows that strong epis- necessary” rather than “it is necessary not”; a more principled in- temic modals are more frequent than weak ones in the case vestigation into how this interaction affects the pragmatics of the of Tweets by radical militant groups. However and in con- modalised propositions is left for future work, though we note that negation in examples such as (6c) behaves in a similar manner to trast to Chiluwa (2015), our statistically significant finding the so-called metalinguistic negation (Martins, 2020), as the comis not the difference in modal force, but rather the difference menter merely objects to the specific number of Volts, but still in modality type, as discussed above. condones the violent action i.e. the electrocution of migrants. PRISPEVKI 112 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Modal Acceptable Violent Offensive Even when the morati examples convey that it is nec- treba 936.3 4,437.4 1,083.7 essary that some kind of action be taken against e.g. mi- potrebno 10.8 568.9 243.1 grants, as in example (8a), the verbs used are such that they dovoliti 183.0 341.3 213.2 no longer convey explicit violent acts, such as postreliti (“to smeti 441.2 682.7 405.7 shoot”), zapliniti (“to gas”), and stresti (“to electrocute”) in morati 1,625.0 1,479.1 1,910.4 the treba examples (6), but express non-violent acts, as in naj 2,442.9 6,371.6 3,647.2 IND the case of the verbal phrase zapreti meje “close the bor- Σ 5,639.2 13,881.0 7,503.3 ders” in (8c). Indeed, the calls to violent action with morati are significantly more tentative, as many of the cases of Table 5: The distribution of deontic modals between the deontic morati are embedded under the conditional mood Offensive and Violent subsets of FRENK; the frequencies clitic bi, which leads to a composite meaning where the de- are relative and normalized to a million tokens. ontic necessity is interpreted as a suggestion rather a direct command, as in examples (8a) and (8c), which also is not The chief linguistic characteristic of the treba examples the case with treba. boils down to lexical choice. The most prominent nomi- To sum up the discussion so far, we have observed nal collocate in the violent subset for the treba examples, that while treba and morati both convey deontic necessity calculated on the basis of the Mutual Information statistic, (roughly an obligation that needs to be met), they are paired is golazen “vermin”, which can be seen in example (6a), up with quite substantially different statements in terms of where migrants are referred to as such. According to Assi- hateful rhetoric in the case of the same type of unacceptable makopoulos et al. (2017, 41) such metaphoric expressions discourse, i.e., violent speech. Further, morati is also the “are an intrinsic part of the Othering process, and central only deontic modal which is less typical of violent speech to identity construction”. In the case of animal metaphors than it is of offensive speech. such as MIGRANTS ARE VERMIN, migrants are concep- We suggest that the difference is tied to the way the tually construed and stereotyped as an invasive out-group pragmatics of deontic modals interact with their core syn- that is maximally different from the in-group to which the tactic and semantic properties. As discussed in Section 2.2., speaker considers themselves to belong (ibid.). The other pragmatically deontic modals fulfil the interpersonal func- most prominent nominal collocate is elektrika (“electric- tion in communication. The interpersonal dimension has to ity”); metaphors containing this lexeme or lexemes related do with the fact that the deontic necessity, i.e., obligation, to electricity (volts, to schock, etc.) often have implied is ascribed by the speaker to whoever corresponds to the reference, where the undergoers of the verbal event, i.e., agent of the verbal event in the modalised proposition; con- migrants, are not directly mentioned, as shown in example cretely, in the case of example (8a), the speaker says that it (6c). Curiously, when the targets of violent speech are not is European countries that have the obligation to strike back migrants but members of the LGBT community, instead of against migrants. metaphors like golazen, slurs such as pedri (“faggots”) are The chief difference between the treba (6) and the used, as in example (6b). morati (8) examples, manifested in the discussed lexi- Note that it is not only treba which patterns with such cal differences, lies in this interpersonal pragmatic dimen- charged lexical items; for instance, the adverb naj, which sion, which is crucially influenced by the syntax of the denotes the speaker’s desire in terms of deontic modality, expressions. Treba is an impersonal predicative adjective also frequently occurs with the electricity metaphor, as in which, in contrast to morati, syntactically precludes the use (7). of a nominative grammatical subject that would be inter- preted as the agent in the modalised proposition (Rossi and (7) Elektriko v žice spustit. Naj kurbe skuri! Zinken, 2016). Consequently, all the statements in the treba “Electrify the fence wires! May it burn the set of examples are such that the agent has an undefined, whores!” arbitrary reference – for instance, it is unclear who is ex- pected to “gas the vermin” in example (6a). What happens The examples with morati, on the other hand, are sig- pragmatically is that the subject-less syntax of the adjec- nificantly less lexically charged, as shown in (8), and the tive treba allows the speaker to sidestep the ascription of statements framed in a more indirect way. obligation to a specific agent, thus largely obviating what (8) a. Vse Evropske države bi morale bolj grobo is perhaps the core interpersonal aspect of deontic modal- udarit po migrantih. ity. This cannot be really avoided with morati, which is “All European countries should have to more a personal verb that obligatorily selects for a grammatical strictly strike back against migrants.” subject in active clauses – in other words, because of its per- b. Kdo nas zaščitil[,] a moramo mi tud nabavit sonal syntax, morati presents a bigger interpersonal burden pištolo on the speaker, as he or she needs to specifically name the “Who will protect us? Do we also have to buy person or institution that is required to fulfill the obligation. a gun?” Note that, in the violent subset, there is only one exam- ple where morati is used with the verb dobiti (“get”), which c. Evropa bi morala stopiti skupaj hermeticno za- induces a passive-like interpretation (9). Here, the gram- preti meje. matical subject headed by Vsak (“everyone”) is interpreted “Europe should have to come together and her- as the target of the violent action rather that the agent. It is metically close the borders.” PRISPEVKI 113 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Modal Acceptable Violent Offensive In offensive comments, ziher is used either as a booster morda 258.3 0.0 169.3 (10) or a hedge (11), a discursive function which the com-mogoče 312.1 113.8 555.8 menter uses as part of the face-saving strategy. Boosting is verjetno 559.6 341.3 451.6 shown in example (10). najbrž 129.1 0.0 90.3 ziher 86.0 113.8 97.3 (10) Begunca? Ekonomske migrante pa picke, ki se ne zagotovo 32.3 113.8 83.4 znajo borit za svoj kos zemlje ZIHER ne!!!!!!! Σ 1,377.4 682.7 1,447.5 “Accepting a refugee? CERTAINLY not accepting economic migrants and cunts who don’t know how Table 6: The distribution of epistemic modals between the to fight for their piece of land!!!!!!!” Acceptable, Violent, and Offensive subsets of FRENK; the frequencies are relative and normalized to a million tokens. In this example, the use of the modal conveys the lex- ical meaning of certainty and thus the full speaker’s truth commitment to the propositional content. By being accom- telling that this is also the only example with morati which panied by excessive exclamatory punctuation, upper case is closer in the use of lexically charged items (i.e., being letters and contemptuous argumentation, the modal prag- “shot in the head” rather than “the closing of borders” in the matically acts as a booster emphasizing the speaker’s com- previous examples) to the treba examples, as this passive- mitment. The face-saving dimension comes about because like construction also precludes the use of an agentive noun the assertiveness conveyed by the modal helps legitimize phrase (unless it is introduced by the Slovenian equivalent the speaker as a member of the in-group that is exclusion- of the by-phrase, but there are no such examples in the cor- ary of migrant out-group. pus). (11) [K]r k cerarju nej gredo zihr ma veliko stanovanje (9) [V]sak, ki se približa našim ženskam in otrokom, ... bedaki. mora dobiti metek v čelo. “They better go to the prime minister Cerar, he “Everyone who gets close to our women and chil- surely has a big flat ... assholes.” dren must be shot in the head.” Contrary to the previous example, the modal in (11) In short, the interpersonal structure influences the de- pragmatically hedges the propositional content by invok- gree of hateful rhetoric, in the sense that speakers are more ing the presumed shared knowledge of the in-group, which ready to use degrading metaphors, slurs and violent ver- concerns the size of the prime minister’s home. Here, hedg- bal expressions when they can avoid ascribing the obliga- ing is related to the fact that the modal activates the face- tion to someone specific. We follow Luukka and Markka- saving strategy which protects the speaker from the accusa- nen (1997) by suggesting that impersonality has a similar tion of making an unfounded claim, as the modalised state- hedging effect to epistemic modals, in the sense that the ment, despite entailing certainty, is still weaker than the un- unexpressed agent in impersonals introduces a degree of modalised variant which would otherwise report that the semantic vagueness to the proposition, as does uncertainty speaker holds factual knowledge about the prime minister’s brought about by the epistemic reading. Thus, with treba, apartment. deontic imposition and epistemic face-saving meet in one While the offensive comments predominantly feature and the same lexeme. ziher in such a hedging or boosting role, in the large ma- 6.2. Epistemic modals in offensive and acceptable jority of the acceptable comments, the modal conveys an discourse additional figurative meaning – i.e., that of irony, which we also claim is related to face-saving and contributes an ad- Epistemic modals are slightly more frequent in accept- ditional persuasive effect in terms of discourse pragmatics able comments, although the difference is not statistically (Gibbs and Izett, 2005; Attardo, 2000). significant, as was shown in Section 5. In order to explore Example (12) conveys a proposition whose ironic mean-further the possible differences and similarities in the use ing is emphasized by the modal ziher. of epistemic modals between different types of comments, we look at their distribution in three subcorpora, namely in (12) Itak, dejmo vsi lagat, to je ziher prav :) acceptable, offensive and violent comments. The distribu- “Of course, let’s all lie, that’s certainly the right tion is shown in Table 6. We find that epistemic modals thing to do :)” are very infrequent in the violent comments (even unat- tested for morda “possibly” and najbrž “likely” ) in contrast The ironic reading of this example is suggested by the to deontic modals, which are more frequent almost across use of the intensifying adverb itak (“of course”), exagger- the board in the violent set (Table 5). On the other hand, ation by means of the collective reading of the plural pro- the epistemic modals show a similar distribution between noun vsi (“everyone”), the use of the verb in the first-person acceptable and offensive comments in contrast to violent dejmo (“let’s”), and the use of the emoticon. Finally, the comments. face-saving strategy enacted in this example has two di- We now look at the pragmatics of the epistemic ne- mensions. The first is the protection of the speaker’s face cessity modal ziher (“certainly”), as it exhibits the most since the irony not only enables the speaker to capitalise comparable frequency between the acceptable and offen- on the use of a sophisticated rhetorical device, but also to sive subcorpora. claim group affiliation by clearly stating the values that the PRISPEVKI 114 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 group has in common. The second aspect is the protection 8. References of the addressee’s face since the irony helps tone down the Stavros Assimakopoulos, Fabienne H. Baider, and Sharon speaker’s criticism – according to Gibbs and Izett (2005), Millar. 2017. Online hate speech in the European Union: ironic criticism is accepted better or in a friendlier way than A discourse-analytic perspective. Springer Na- ture. direct critiques. Salvatore Attardo. 2000. Irony markers and functions: To- wards a goal-oriented theory of irony and its processing. 7. Conclusion Rask, 12:3–20. This paper has presented a corpus investigation of epis- Diah Ikawati Ayuningtias, Oikurema Purwati, and Pratiwi temic and deontic modal expressions in Slovenian Face- Retnaningdyah. 2021. The lexicogrammar of hate speech. book comments in the FRENK corpus. In: Thirteenth Conference on Applied Linguistics We have first proposed a set of Slovenian modals that (CONAPLIN 2020), pages 114–120. Atlantis Press. show an overwhelming tendency towards a single modal Vaclav Brezina. 2018. Statistics in corpus linguistics: A reading. Because of such unambiguity, they constitute a practical guide. Cambridge University Press. robust set that allows for precise quantitative comparisons Penelope Brown, Stephen C Levinson, and Stephen C between different types of discourse without irrelevant con- Levinson. 1987. Politeness: Some universals in lan- founding examples and for careful manual analysis of the guage usage, volume 4. Cambridge University Press. corpus examples. Quantitatively, we have shown that de- Innocent Chiluwa. 2015. Radicalist Discourse: A study of the ontic modals are a prominent feature of unacceptable dis- stances of Nigeria’s Boko Haram and Somalia’s al course, and that they are especially prominent in discourse Shabaab on Twitter. Journal of Multicultural Discourses, that concerns incitement to violent action, which is legally 10(2):214–235. prosecutable. Jennifer Coates. 1983. The Semantics of the Modal Auxil- In terms of discourse pragmatics, we have first shown iaries. Croom Helm, London and Canberra. that modals which are completely synonymous both in Jennifer Coates. 1987. Epistemic modality and spo- ken terms of force and modality type can nevertheless pro- discourse. Transactions of the Philological S ociety, foundly differ in the degree of hateful rhetoric in the same 85(1):110–131. type of socially unacceptable discourse. We have shown Václav Cvrček. 2021. Calc 1.03: Corpus Calculator. that what makes a difference in such examples is the pres- Czech National Corpus. https://www.korpus. ence of impersonal syntax, which offers speakers the ability cz/calc/. to linguistically obviate the ascription of the denoted obli- Kristina Pahor de Maiti, Darja Fišer, and Nikola Ljubešić. gation to a particular agent. We have suggested that this 2019. How haters write: Analysis of nonstandard lan- sort of face-saving strategy of ambiguity by way of imper- guage in online hate speech. Social Media Corpora for sonality correlates with the speaker’s tendency to use dehu- the Humanities (CMC-Corpora2019), page 37. manising language, such as slurs or degrading metaphors. Richard Delgado. 2019. Understanding words that wound. In the case of epistemic modals, we have shown that ac- Routledge. ceptable and offensive comments, which are highly similar Bojan Evkoski, Andraž Pelicon, Igor Mozetič, Nikola at their surface linguistic level, differ pragmatically in re- Ljubešić, and Petra Kralj Novak. 2022. Retweet com- lation to face-saving; while offensive comments use epis- munities reveal the main sources of hate speech. PloS temic modals as simple hedging or boosting devices, ac- one, 17(3):e0265602. ceptable comments use the modals to convey ironic state- Masako Fidler and Václav Cvrček. 2015. A data-driven ments in which the irony is emphasised by the modal. We analysis of reader viewpoints: Reconstructing the his- have claimed that the irony also contributes to the face- torical reader using keyword analysis. Journal of Slavic saving pragmatics. linguistics, pages 197–239. In future work, we intend to explore how deontic and Darja Fišer, Tomaž Erjavec, and Nikola Ljubešić. 2017. epistemic modals also differ based on topic (migrants on Legal framework, dataset and annotation schema for so- the one hand and the LGBTQ community on the other). cially unacceptable online discourse practices in slovene. We also want to explore if and how the discourse differs In: Proceedings of the first workshop on abusive language if the unacceptable comments are either directed towards online, pages 46–51. a person’s individual background (e.g., race, ethnicity) or Katharine Gelber and Luke McNamara. 2016. Evidencing group affiliation (e.g., political party). the harms of hate speech. Social Identities, 22(3):324– Acknowledgments 341. Raymond W Gibbs and Christin Izett. 2005. Irony as per- The work described in this paper was funded by the suasive communication. Figurative language compre- Slovenian Research Agency research programme P6-0436: hension: Social and cultural influences, pages 131–151. Digital Humanities: resources, tools and methods (2022– Francisco Gonzálvez Garc´ıa. 2000. Modulating grammar 2027), the DARIAH-SI research infrastructure, and the na- trough modality: A discourse approach. ELIA, 1, 119- tional research project N6-0099: LiLaH: Linguistic Land- 136. scape of Hate Speech. Michael A.K. Halliday. 1970. Functional diversity in lan- guage as seen from a consideration of modality and PRISPEVKI 115 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 mood in english. Foundations of language, pages 322– 361. Ken Hyland. 1998. Hedging in scientific research articles, volume 54. John Benjamins Publishing Company Ams- terdam. Ken Hyland. 2005. Stance and engagement: A model of interaction in academic discourse. Discourse studies, 7(2):173–192. Angelika Kratzer. 2012. Modals and conditionals: New and revised perspectives, volume 36. Oxford University Press. Jakob Lenardič and Darja Fišer. 2021. Hedging modal adverbs in slovenian academic discourse. Slovenščina 2.0: empirical, applied and interdisciplinary research, 9(1):145–180. Nikola Ljubešić, Darja Fišer, and Tomaž Erjavec. 2019. The frenk datasets of socially unacceptable discourse in slovene and english. In: International conference on text, speech, and dialogue, pages 103–114. Springer. Nikola Ljubešić, Darja Fišer, Tomaž Erjavec, and Ajda Šulc. 2021. Offensive language dataset of Croatian, English and Slovenian comments FRENK 1.1. Slove- nian language resource repository CLARIN.SI. http: //hdl.handle.net/11356/1462. Nolwenn Lorenzi-Bailly and Mariem Guellouz. 2019. Ho- mophobie et discours de haine dissimulée sur twitter: celui qui voulait une poupée pour noël. Semen. Revue de sémio-linguistique des textes et discours, 47. Minna-Riitta Luukka and Raija Markkanen. 1997. Imper- sonalization as a form of hedging. Research in Text The- ory, pages 168–187. Ana Maria Martins. 2020. Metalinguistic negation. In The Oxford Handbook of Negation. Oxford University Press. Frank Robert Palmer. 2001. Mood and modality. Cam- bridge University Press. Carolin F. Roeder and Björn Hansen. 2006. Modals in contemporary slovene. Wiener Slavistisches Jahrbuch, 52:153–170. Giovanni Rossi and Jörg Zinken. 2016. Grammar and so- cial agency: The pragmatics of impersonal deontic state- ments. Language, 92(4):e296–e325. Alexandra A Siegel. 2020. Online hate speech. Social me- dia and democracy: The state of the field, prospects for reform, pages 56–88. Maria Grazia Sindoni. 2018. Direct hate speech vs. indi- rect fear speech. A multimodal critical discourse analysis of the sun’s editorial ‘1 in 5 brit muslims’ sympathy for jihadis”. Lingue e Linguaggi, 28:267–292. Kai von Fintel. 2006. Modality and language. In Don- ald M. Borchert, editor, Encyclopedia of Philosophy – Second Edition, pages 20–27. MacMillan Reference USA, Detroit. Milica Vukovic. 2014. Strong epistemic modality in par- liamentary discourse. Open Linguistics, 1(1). Simon Winter and Peter Gärdenfors. 1995. Linguistic modality as expressions of social power. Nordic Journal of Linguistics, 18(2):137–165. PRISPEVKI 116 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 The ParlaSpeech-HR benchmark for speaker profiling in Croatian Nikola Ljubešić,∗† Peter Rupnik∗ ∗Department of Knowledge Technologies Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana nikola.ljubesic@ijs.si peter.rupnik@ijs.si †Faculty of Computer and Information Science University of Ljubljana Večna pot 113, SI-1000 Ljubljana Abstract Recent advances in speech processing have made speech technologies significantly more accessible to the research community. Beyond the most-popular task of automatic speech recognition, classifying speech acts by various criteria has also recently caught interest. In this paper we propose a benchmark constructed from a dataset of speeches given in the Croatian parliament, aimed at predicting the following speaker profile features: speaker identity, gender, age, and power position (whether the speaker is in the ruling coalition or opposition). We evaluate various pre-trained transformer models on our variables of interest, showing that speaker identification and power position prediction seem to rely mostly on language-specific features, while gender and age prediction rely more on generic speech features, available also in models not pre-trained on the target language. We release the benchmark to serve in measuring the strength of upcoming speech models on a lower-resourced language such as Croatian. 1. Introduction many approaches to speaker profiling developed before the era of transformers, in this work, we limit ourselves on eval- Speech technologies have recently experienced a quan- uating transformer models only, primarily due to their re- tum leap in their development due to the successful applica- ported superior performance (Yang et al., 2021). tion of the self-supervised pre-training of transformer mod- els on speech data (Schneider et al., 2019). Due to this sig-2. Related work nificant simplification of the development of speech tech- nologies, their uptake has increased significantly (Fan et al., Various speech benchmarks, including speaker profil- 2020; Pepino et al., 2021; Bartelds et al., 2022), which reing tasks, exist, but mostly for the English language. The sulted also in the development of the first open dataset for SUPERB benchmark (Yang et al., 2021) consists of tasks training automatic speech recognition in Croatian (Ljubešić of speaker identification – identifying the speaker from a et al., 2022), based on data from the Croatian parliament. closed set of speakers, speaker verification – binary tasks The parliamentary data are especially suited for speech ex- of whether two utterances are spoken by the same speaker, periments, not only because they are in the public domain, and speaker diarization – predicting who is speaking when but also because they are rich in speaker metadata (Ljubešić for each timestamp, where multiple speakers can also speak et al., 2022). simultaneously. In this work we are presenting a rather opportunistic Another recent benchmark, XTREME-S (Conneau et benchmark for speaker profiling in Croatian, based on the al., 2022), is focused on evaluating universal cross-lingual ParlaSpeech-HR dataset and the available information on speech representations in many languages, the tasks based the speakers in that dataset. We define four tasks. In the on speech classification covering spoken language identifi- first task, speaker identification, the task is to predict who cation among 104 languages, and intent classification from of the possible 50 speakers is the speaker of a speech act. the e-banking domain. For the second task, male and female speakers are to be A dataset used for speaker identification benchmark- discriminated between. The third task is focused on dis- ing is VoxCeleb (Nagrani et al., 2017), consisting of over criminating between younger and older speakers, 49 years 1,000 celebrities voice samples, obtained by applying fa- of age being the division point between the two age groups. cial recognition over YouTube videos. In the fourth task we aim at discriminating the speech acts A well known dataset used for benchmarking automatic depending on whether they were given by MPs from the speech recognition systems, but also used for speaker pro- ruling coalition, or from the opposition. filing is the TIMIT dataset (Garofolo et al., 1993), con- We compare models pre-trained on the target language sisting of 630 speakers of 8 dialects of American En- (Croatian) and models that were not pre-trained on this glish. It consists of speaker information on gender, age and language, obtaining insights not only how well trans- height (Kalluri et al., 2020). former models perform on these tasks, but also how much In this work we are not trying to build on top of the language-dependent these tasks are. While there have been existing benchmarks due to two reasons. The main reason PRISPEVKI 117 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 is our interest in less-resourced languages, primarily South represented in a sample, the most prolific speakers were left Slavic languages, for which there is little to no data avail- out of the sampling procedures due to their very specific able. The very recently released ParlaSpeech-HR dataset, roles in the parliament, which quite likely carries different on which this benchmark is based on, is the first openly unwanted biases in their speech production. available speech dataset for Croatian (Ljubešić et al., 2022). In the four following subsections we describe the spe- The second reason is the disruptive effect the speech trans- cific sampling criteria applied for each of our four tasks. formers had on the field, drastically lowering the previous level of error (Yang et al., 2021), with significant improve-3.2.1. Speaker identification ments expected in the near future as well. This is why we For the task of speaker identification, 25 speakers per opt for a new, very opportunistic benchmark on speakers binary gender were sampled. Per speaker, 100 instances from the Croatian parliament. Besides documenting the were included in the training subset, 10 in the development highly important data selection decisions, we are report- subset, and 10 in the test subset. Checks were performed ing first results on the current state-of-the-art technology. to assure that for no speaker instances from the same video Given the current high pace of innovation in speech tech- appear in more than one subset. With this sampling pro- nologies, that is surely not to slow down soon, this bench- cedure, each of the three subsets consist of the same 50 mark will be highly useful in assessing what new technolo- speakers, the training subset having 5,000 instances, while gies are and will be able to offer to a less-resourced lan- the development and testing subsets of 500 instances each. guage such as Croatian. 3.2.2. Gender prediction 3. Benchmark construction For each of the two binary genders, male and female, In this section we present the dataset our benchmark is 25 speakers were selected for the training subset, every constructed on, and the data selection protocols given the speaker being represented with 20 instances. For each of four variables of interest. the two genders, 5 speakers (that were not already in the training subset) were taken for the development split, and 3.1. The dataset 5 speakers for the test split. Every speaker in the develop- The dataset this benchmark is based on is the ment and testing subset was represented with 200 instances. ParlaSpeech-HR dataset (Ljubešić et al., 2022), aimed pri- With this we assured three subsets of distinct speakers, the marily at developing automatic speech recognition systems training subset consisting of 1,000 instances, and the devel- for Croatian. It consists of 1816 hours of speech obtained opment and testing subset of 2,000 instances. from 309 speakers. For each speaker metadata on age, gen- der, party affiliation, role in the parliament, and power sta- 3.2.3. Age prediction tus (opposition vs. coalition) is available. More details on Given that there are very few distinct female speakers in the content and construction procedure of the ParlaSpeech- the ParlaSpeech-HR dataset, and that controlling for gen- HR dataset can be found in the description paper (Ljubešić der while performing any data split is necessary due to the et al., 2022). likely strong signal coming from the gender of the speaker as a potential confounder, after some metadata analyses, we 3.2. Data selection decided to setup the age prediction task on male speakers For each of the four tasks a separate data selection pro- only. cedure was set-up, given the limited data available, but also The age distribution of male speakers showed a rather the different nature of the tasks. While most tasks are bi- narrow and normal distribution around the median of 49 nary (gender, age, power status), the task of speaker identi- years of age. The age distribution is far from a uniform and fication is a 50-class task. Furthermore, while for the three wide distribution, that would allow for a diverse age predic- binary tasks the training, development and testing subsets tion task, being set-up as a regression task, or a classifica- have to consist of different speakers, on the speaker iden- tion task with many categories. This is why we decided to tification task, in all three subsets the same speakers have define this as a binary task, predicting whether a speaker is to be present. Finally, in the tasks of age and power status below or above the median age. For the training portion of prediction we decided to sample only from male speakers the task, 60 speakers were selected, with 20 instances per as there are too few female speakers in the dataset for a speaker. For the development and test set, 20 speakers were reasonable sampling that would not include any unwanted selected for each subset, each speaker being represented by bias. 50 instances. While performing the split, additional checks Additionally, in each of the four tasks we only selected were put in place to ensure that the age distribution in each instances that were at least 8 seconds in duration. While of the subsets is as-close-as-possible to the distribution in most of the ParlaSpeech-HR dataset consists of such in- the full dataset. Additional checkups were also performed stances (voice activity detection was set-up in such a fash- to ensure that no speaker leakage existed between the three ion), there is a small number of instances, mostly coming subsets. With this data selection, the training subset con- from endings of audio files, that are shorter than 8 seconds. sists of 1,200 instances, while the development and test- We also discarded speakers producing more than 3,000 ing subsets consist of 1,000 instances each. Given that the instances or less than 200 instances. While the speakers median was chosen as the classification boundary, the fi- with a small production might complicate the data selection nal dataset is balanced regarding the two levels of the age procedure as we want each selected speaker to be equally variable. PRISPEVKI 118 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 3.2.4. Power status prediction lated language, in our case English (English-asr3). We We decided to wrap up the benchmark with a quite have decided to use the English model fine-tuned for ASR likely less acoustic task, and a more semantic task. Given as the non-finetuned model4 was giving random results af- that we are currently proposing a shared task on predicting ter fine-tuning to any of our four tasks. This suspiciously whether a transcript of a speech was given by the ruling bad result is probably to be followed back to a technical coalition or opposition, we decided to add that task in this issue in the model, rather than the fact that the model was benchmark as well, but performed on speech and not on text not fine-tuned on ASR before, as will be seen in the com- transcripts. The ParlaSpeech-HR data come from a single parison between the performance of the Slavic and the term of the Croatian parliament, which means that the rul- Slavic-asr model. ing coalition members are mostly from the right political The overview of models used in our experiments, to- spectrum, while the opposition members are mostly from gether with a short description on the type and amount the left side of the political spectrum. Disentangling the of data the models were pre-trained and fine-tuned on, is party affiliation or political orientation, and the power sta- given in Table 1. The non-finetuned Croatian model was tus was rather impossible here, which has to be taken into pre-trained on around 99 thousand hours of raw recordings account while analysing the results. of speeches in various Slavic languages that were given in Similar to the task of age prediction, we, again, sam- the European parliament. The fine-tuned Croatian model pled only among male speakers as the number of female was additionally fine-tuned on the ASR task on around 300 speakers was too low for well-stratified samples. Similar hours of the ParlaSpeech-HR dataset. The English model as with age, given the high predictability of gender, we did was pre-trained on 53 thousand hours of raw speech mate- not want to allow for gender to become a confounder of our rial obtained from audio books and was fine-tuned for the primary prediction task, which is power status in this case. ASR task on 960 hours of similar material. We sampled 25 speakers per each power status for train, Regarding hyperparameter optimization, we investigate each speaker being represented by 50 instances. For the only the number of epochs required for performance im- development and test sets we selected 9 speakers for each provements to stall, which is performed by training on the subset, again, representing each speaker with 50 instances. training portion and evaluating on the development portion. Additional checks were performed that there is no speaker For the first two tasks of speaker identification and gender leakage between the three subsets. With this, the size of prediction, two epochs were shown to be enough, while for the training subset is 2,500 instances, while the develop- the tasks of age prediction and power status prediction, 15 ment and test subsets consist of 900 instances each. For epochs over the training subset were chosen as optimal. simplicity of evaluation, the division of instances regarding We evaluate each model on our test subset by reporting the power status variable is balanced, with 50% instances both the accuracy and macro F1 metric. Given that all our coming from each side of the political power spectrum. tasks consist of datasets with a balanced distribution of the The benchmark is made available for reproducibil- response variable, our random baseline lies at 0.5 in case ity and further benchmarking to the public via the of the binary classification schema, and 0.02 in case of the GitHub repo https://github.com/clarinsi/ 50-class speaker identification schema. parlaspeech-hr-benchmark/. For the less challenging tasks of speaker identification and gender prediction, we perform two types of evaluation, 4. Experimental setup on full instances, and on the first 2 seconds of each instance In this section we give a short description of the setup of only. the experiments performed on the newly constructed bench- mark. 5. Results We perform all our experiments with transformer mod- 5.1. Speaker identification els (Vaswani et al., 2017) that were pre-trained on spo- The results on the speaker identification task are pre- ken data. We use the Transformers library (Wolf et al., sented in Table 2. The results show for task to be quite 2019) and retrieve pre-trained models from the Hugging- easy for the Slavic and Slavic-asr models applied face model repository. on full instances. The model fine-tuned on ASR seems to We use the model pre-trained on Croatian that has perform slightly better in the full-data scenario, keeping an proven to perform best on the task of automatic speech even score on instances clipped to two seconds, while in recognition (ASR) (Ljubešić et al., 2022), namely the that case the non-finetuned model experiences a significant Slavic model.1 We compare the performance of the drop of 20 points. This result seems to show how important pre-trained-only model to the model that was additionaly it is for the model to experience the exact speakers it is sup- fine-tuned on the ASR task (Slavic-asr2) to investigate posed to differentiate between, even on another task such as whether fine-tuning the model on the same data, but another ASR. We do not believe that transfer has occurred between task, improves performance. the ASR task and the speaker identification task directly We compare the performance of the model pre-trained (the model exploiting what people are saying while decid- on Croatian to the model that was pre-trained on an unre- ing on the speaker identity) but rather that its parameters 1https://huggingface.co/facebook/ 3https://huggingface.co/facebook/ wav2vec2-large-slavic-voxpopuli-v2 wav2vec2-large-960h-lv60-self 2https://huggingface.co/classla/ 4https://huggingface.co/facebook/ wav2vec2-large-slavic-parlaspeech-hr wav2vec2-large PRISPEVKI 119 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 model name short name pre-training ASR fine-tuning facebook/wav2vec2-large-slavic-voxpopuli-v2 Slavic Slavic (99k hours) - classla/wav2vec2-large-slavic-parlaspeech-hr Slavic-asr Slavic (99k hours) Croatian (300 hours) facebook/wav2vec2-large-960h-lv60-self English-asr English (53k hours) English (960 hours) Table 1: List of models used in our experiments, with amount and type of pre-training and fine-tuning data. accuracy macro F1 10 model clipped Alfirev, Ma rija 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 Slavic no 0.998 0.998 Ba bi , Ante 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Ba uk, Arsen 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Bedekovi , Vesna 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 sec 0.806 0.784 Berna rdi , Da vor 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Beus Richem bergh, Gora n 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Buri , Ma jda 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 Culej, Stevo 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Slavic-asr no 1.000 1.000 Divja k, Bla enka 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Dobrovi , Sla ven 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Gla va k, Sun a na 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Ha jdukovi , Dom a goj 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 2 sec 1.000 1.000 Horva t, Mile 0 0 0 0 0 0 0 0 0 1 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Ja ndrokovi , Gorda n 0 0 0 0 0 0 0 0 0 0 0 3 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Jeckov, Dra ga na 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Jelkova c, Ma rija 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 English-asr no 0.334 0.275 Jerkovi , Rom a na 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Josi , eljka 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Jova novi , eljko 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 6 2 sec 0.106 0.048 Juri ev-Ma rtin ev, Bra nka 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Ka ti , a rko 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Klisovi , Jo ko 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Kri a ni , Josip 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 La lova c, Boris 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Lena rt, eljko 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 5 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 Luka i , Ljubica 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Table 2: Speaker identification results. Ma ka r, Bo ica True label 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Ma ksim uk, Ljubica 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Metelko-Zgom bi , Andreja 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Miku igm a n, N a ta a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Milo evi , Boris 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Mi i , Ivica 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 4 Murga ni , N a da 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N in evi -Lesa ndri , Iva na 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Obuljen Kor inek, N ina 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Petin, Ana -Ma rija 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 were previously adapted to focus better at the peculiarities Petrijev a nin Vuksa novi , Irena 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 Petrov, Bo o 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 Prelec, Alen 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 Puh, Ma rija 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 of the 50 speakers in question. Pupova c, Milora d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 3 0 Reiner, eljko 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 5 1 0 0 2 0 0 0 0 Sta zi , N ena d 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 Topolko, Berna rda 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 2 For the English model, it shows interestingly to perform Turina - uri , N a da 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 Tu m a n, Mirosla v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 Vra nje , Dra gica 0 0 0 1 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 rather badly, with predictions over the full length of each ikoti , Sonja 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 a ki , Josip 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 uji , Sa a 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 ile ko a o a instance (between 8 and 20 seconds) being correct only in ica arija ajda ana agoj ana ada ina ada , Ante avor oran enka arija eljka eljko arko eljko ata arija arija eljko enad , Vesna , D ordan ilorad , Sa , M , Josip , Ivica ragica , Josip , Slaven om ragana , , Jo , Boris , N , Ivana , Irena , N iroslav , Sonja , G , , i , N , Rom , Ljubica , Andreja i , D uji Babi ulej, Stevo , D orvat, M ev, Branka ani i uk, Ljubica an, N evi M inek, N uri aki 33% of cases. This is still quite far apart from the random Bauk, Arsen C H Josi Kati akar, Bo bi Petrov, Bo Prelec, Alen Puh, M an, M ikoti Alfirev, M bergh, G Buri 0 ivjak, Bla Reiner, obrovi lavak, Sun Klisovi Kri Lalovac, Boris Lenart, M igm ilo urgani Stazi m Bedekovi Bernardi D D G Jeckov, D Jelkovac, M M Jerkovi Jovanovi artin Luka M aksim -Lesandri Petin, Ana-M Pupovac, M Tu Vranje ajdukovi Topolko, Bernarda Jandrokovi M iku Turina- baseline of 2%, but also very far from the stellar perfor- H ev-M M evi buljen Kor anin Vuksanovi etelko-Zgom in O Beus Richem Juri M N mance of the models pre-trained on Croatian. Predicting Petrijev Predicted la bel only on 2 seconds of speech further deteriorates the results to an accuracy of 10%. For the speaker identification task Figure 1: Confusion matrix for speaker identification with the pre-training language seems to be very important, as the Slavic model on instances clipped to two seconds. the model quite likely models phonetic peculiarities of each speaker, rather than only acoustic features for which any accuracy macro F1 speech transformer should be useful. eval split clipped To investigate which speakers get confused between by Slavic no 0.997 0.997 the Slavic model, when only two seconds are available 2 sec 0.989 0.989 for prediction, we present the confusion matrix in Figure1. Slavic-asr no 0.985 0.985 The matrix shows that speakers of the same gender are be- 2 sec 0.985 0.985 ing confused between each other, e.g. Arsen Bauk, Davor English-asr no 0.999 0.999 Bernardić and Božo Petrov being confused for Žarko Katić, 2 sec 0.994 0.994 or Sunčana Glavak and Ljubica Lukačić being misclassified as Ivana Ninčević-Lesandrić. Table 3: Gender prediction results. 5.2. Gender prediction 1000 The results on task of gender prediction are presented in Table 3. On this task all three models, regardless of the language they are pre-trained on, achieve very good per-800 formance, the lowest result being accuracy of 98.5%, and F 1000 0 the difference in the length of test instances not having a 600 strong impact. Interestingly, the Slavic-asr model that performed perfectly on the speaker identification task is the one that performs the worse on the gender prediction task. True label 400 Investigating what type of confusion occurs on this task M 21 979 we analyse the output of the Slavic model on 2-second instances. We represent the results via a confusion matrix in 200 Figure 2, showing that male instances are sometimes con- F M fused for female instances, but not vice versa. Investigating Predicted la bel 0 further what speakers are being confused most of the time, it shows that it is a limited number of speakers whose voice Figure 2: Confusion matrix for speaker gender prediction has, at least in some occasions, a higher pitch. of the Slavic model on 2-second test instances. The results on gender prediction show that transformer PRISPEVKI 120 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 model clipped accuracy macro F1 100 Test split a ge distribution Slavic no 0.694 0.690 Misscla ssifica tions Media n a ge Slavic-asr no 0.722 0.722 80 English-asr no 0.678 0.672 Table 4: Age prediction results. 60 400 Frequency 40 350 20 old 290 210 300 0 30 40 50 60 70 250 Age True label 200 young 96 404 Figure 4: Distribution of age in our test subset, along with misclassifications by the Slavic model. 150 old young model clipped accuracy macro F1 Predicted la bel 100 Slavic no 0.590 0.587 Slavic-asr no 0.627 0.626 Figure 3: Confusion matrix for speaker age classification English-asr no 0.549 0.531 by the Slavic model. Table 5: Power status identification. models do not rely on language-specific features, but quite ers (68 and 69 years) show to be perfectly performed by the likely on the pitch of a speaker’s voice, with best results model. being reported by the English model, with almost perfect This insight might motivate us to organise the age pre- results even on 2-second test instances. diction task in the future as a classification task into three categories, the middle category, around the median age, be- 5.3. Age prediction ing considered hard, and discarded in the easier setup of the The results on age prediction, guessing whether a classification task. speaker is younger or older than 49 years, which is the me- dian speaker age in the dataset, are given in Table 4. Here 5.4. Power status prediction we do not perform experiments on speech samples clipped The results of our final task, power status prediction, to two seconds as the task is already demanding enough on are given in Table 5. The results show to be, as expected, full-length instances. The Slavic-asr model seems to the lowest of all four tasks defined in this benchmark. perform best, with accuracy of 72%, 50% being a random The Slavic-asr model performs best, with the differ- result. The Slavic and English-asr model seem to ence to the non-finetuned model being 2.7 accuracy points. be suspiciously close in performance, only with a point and The model that was not pre-trained on Croatian achieves a half difference, which shows that the age prediction task a significantly lower result, 5 points lower than any model does not rely on language-specific features, but rather gen- pre-trained on Croatian, showing that for solving this task eral acoustic features. mostly language-specific features are used. To investigate the confusion patterns between the two Which features exactly are actually used is hard to iden- age groups, we plot a confusion matrix of the Slavic tify. The only attempt we perform in this direction is a per- model in Figure3. The confusion matrix shows clearly that speaker analysis of correct and incorrect classifications by more frequently older speakers tend to be misclassified as the Slavic-asr model, which we present in Figure 5. younger speakers than vice versa. The results show that people in power seem to be easier Given that we have divided the speakers by age on the to identify than those who are in opposition, as the speak- median point, and that the speaker age is rather normally ers having the lowest percentage of correctly classified in- distributed, we wanted to additionally check whether most stances are mostly from the opposition. The error also of the prediction errors occur on users who are close to the seems to be rather speaker-dependent, with eight of the class boundary. To investigate this, we plot an instance- speakers having accuracy above 80%, and the five worst- level age histogram in Figure 4, encoding the correctly and performing speakers having accuracy below 40%. incorrectly classified instances by the Slavic model with Analysing the five worst-performing speakers, a trend different colour. The histogram shows that most misclas- can be observed, with the two speakers in power being two sifications happen, as expected, close to the median class of the most fine-mannered speakers, while two out of three boundary, with almost all instances of speakers of 50 and speakers from the opposition are rather known for their 51 years of age being misclassified as younger speakers. harsh speech. This analysis has also shown that the signal Classifications on the youngest (35 years) and oldest speak- the classifier has caught on is quite likely based on the polit- PRISPEVKI 121 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Acknowledgements This work has received funding from the Eu- ropean Union’s Connecting Europe Facility 2014- 2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341 (MaCoCu project). This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains. This work was also funded by the Slovenian Research Agency within the Slovenian-Flemish bilateral basic re- search project “Linguistic landscape of hate speech on so- Figure 5: Per-speaker accuracy level with the cial media” (N06-0099 and FWO-G070619N, 2019–2023) Slavic-asr model on the power status prediction and the research programme “Language resources and tech- task. nologies for Slovene” (P6-0411). ical orientation rather than power status itself. For perform- 7. References ing modelling of the power status in speech, the training and evaluation data should consist of multiple-terms data, Martijn Bartelds, Wietse de Vries, Faraz Sanal, Caitlin with the same political orientations having speeches given Richter, Mark Liberman, and Martijn Wieling. 2022. while in power and while in opposition. Neural representations for modeling variation in speech. Journal of Phonetics, 92:101137. 6. Conclusion Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara In this paper we have presented a benchmark for Rivera, Mihir Kale, et al. 2022. Xtreme-s: Evaluat- speaker profiling in Croatian, based on the recordings of ing cross-lingual speech representations. arXiv preprint the Croatian parliament. We have carefully selected the arXiv:2203.10752. speakers and instances to be used in the benchmark, pay- Zhiyun Fan, Meng Li, Shiyu Zhou, and Bo Xu. 2020. ing special attention to any type of bias or confounders that Exploring wav2vec 2.0 on speaker verification and lan- might be included in the tasks. guage identification. arXiv preprint arXiv:2012.06185. We have performed initial experiments with transformer John S. Garofolo, Lori F. Lamel, William M. Fisher, models pre-trained on speech, obtaining interesting in- Jonathan G. Fiscus, and David S. Pallett. 1993. Darpa sights. The task of speaker identification seems to be rather timit acoustic-phonetic continous speech corpus cd-rom. language-dependent, and can be further improved if the nist speech disc 1-1.1. NASA STI/Recon technical report model has seen speakers to be identified before the final n, 93:27403. fine-tuning process. Gender prediction seems to be the least language specific, obtaining very good results regard- Shareef Babu Kalluri, Deepu Vijayasenan, and Sriram less of the model, and quite likely relying simply on the Ganapathy. 2020. Automatic speaker profiling from pitch of the speaker. Age prediction, in our case set up short duration speech data. Speech Communication, as a binary task, with the boundary being the age median, 121:16–28. shows to be hard, but very feasible on instances that are Nikola Ljubešić, Danijel Koržinek, Peter Rup- further away from the classification boundary. The task nik, Ivo-Pavao Jazbec, Vuk Batanović, Lenka shows to use language-specific features to a small amount, Bajčetić, and Bojan Evkoski. 2022. ASR training but the model that has experienced the same speakers be- dataset for Croatian ParlaSpeech-HR v1.0. Slove- fore the final fine-tuning still performing visibly better than nian language resource repository CLARIN.SI, the model that has not. Power status prediction is the hard- http://hdl.handle.net/11356/1494. est of all four tasks, and shows to rely on language-specific Nikola Ljubešić, Danijel Korzinek, Peter Rupnik, and Ivo- features, again profiting additionally from experiencing the Pavao Jazbec. 2022. ParlaSpeech-HR – a freely avail- speakers prior to the final fine-tuning. Analysing the accu- able ASR dataset for Croatian bootstrapped from the racy by speaker shows that the power status model seems ParlaMint corpus. In: Proceedings of the Third Par- to have caught on the political orientation rather than the laCLARIN Workshop, Marseille, France. language of power itself. For working on modelling that Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. phenomenon, a dataset controlling for political orientation 2017. Voxceleb: A large-scale speaker identification should be constructed, which requires a much wider data dataset. arXiv preprint arXiv:1706.08612. range than is currently available. Leonardo Pepino, Pablo Riera, and Luciana Ferrer. 2021. We are releasing the benchmark definitions, to be cou- Emotion recognition from speech using wav2vec 2.0 em- pled with the full ParlaSpeech-HR dataset (Ljubešić et al., beddings. arXiv preprint arXiv:2104.03502. 2022) in a GitHub repository.5 Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised 5https://github.com/clarinsi/ pre-training for speech recognition. arXiv preprint parlaspeech-hr-benchmark/ arXiv:1904.05862. PRISPEVKI 122 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in N eural I nformation P rocessing S ystems, 30. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural lan- guage processing. arXiv preprint arXiv:1910.03771. Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng- I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al. 2021. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051. PRISPEVKI 123 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Cross-Level Semantic Similarity in Newswire Texts and Software Code Comments: Insights from Serbian Data in the AVANTES Project Maja Miličević Petrović,* Vuk Batanović,† Radoslava Trnavac,‡ Borko Kovačevi㇠* Department of Interpreting and Translation, University of Bologna Corso della Repubblica 136, 47121 Forlì maja.milicevic2@unibo.it † Innovation Center of the School of Electrical Engineering, University of Belgrade Bulevar kralja Aleksandra 73, 11120 Belgrade vuk.batanovic@ic.etf.bg.ac.rs ‡ Faculty of Philology, University of Belgrade Studentski trg 3, 11000 Belgrade radoslava.trnavac@fil.bg.ac.rs, borko.kovacevic@fil.bg.ac.rs Abstract This paper presents the Serbian datasets developed within the project Advancing Novel Textual Similarity-based Solutions in Software Development – AVANTES, intended for the study of Cross-Level Semantic Similarity (CLSS). CLSS measures the level of semantic overlap between texts of different lengths, and it also refers to the problem of establishing such a measure automatically. The problem was first formulated about a decade ago, but research on it has been sparse and limited to English. The AVANTES project aims to change this through the study of CLSS in Serbian, focusing on two different text domains – newswire and software code comments – and on two text length combinations – phrase-sentence and sentence-paragraph. We present and compare two newly created datasets, describing the process of their annotation with fine-grained semantic similarity scores, and outlining a preliminary linguistic analysis. We also give an overview of the ongoing detailed linguistic annotation targeted at detecting the core linguistic indicators of CLSS. linguistic properties identified as relevant for recognising 1. Introduction different similarity levels are being annotated further, with One of the central meaning-related tasks in Natural a view to improving linguistic descriptions of semantic Language Processing (NLP) is Semantic Textual Similarity similarity and testing linguistically informed NLP models. (STS; Agirre et al., 2012). The goal of STS is to establish the extent to which the meanings of two short texts are 2. Related work similar to each other, which is typically encoded as a Previous studies of CLSS are few. The NLP task was numerical score on a Likert scale. The similarity scores can introduced by Jurgens et al. (2014, 2016), who provided the subsequently be used in more complex tasks, such as first annotated datasets for English, composed of text pairs Question Answering (Risch et al., 2021) or Text of different lengths (paragraph to sentence, sentence to Summarisation (Mnasri et al., 2017). phrase, phrase to word, and word to sense), in genres In the related task of Cross-Level Semantic Similarity including newswire, travel, scientific, review, and others. (CLSS) the goal is to contrast texts of non-matching size, The initial datasets were re-used in subsequent work on such as a phrase and a sentence, or a sentence and a developing and evaluating CLSS methods at different paragraph. CLSS was first formulated as a SemEval shared specific levels (e.g., Rekabsaz et al., 2017 for sentence to task by Jurgens et al. (2014), who saw it as a generalisation paragraph), or regardless of text length (e.g., Pilehvar and of STS to items of different lengths. Clearly, the length Navigli, 2015). Among related tasks, Conforti et al. (2018) discrepancy brings an additional level of complexity, as dealt with the problem of cross-level stance detection, longer texts tend to carry a greater amount of salient where the stance target is a sentence, and the text to be information than shorter texts, so CLSS can be understood evaluated is a long document. as aiming to measure how well the meaning of the longer In Serbian, previous work on semantic similarity has text is summarised in the shorter one. been relatively limited. Batanović et al. (2011) and Furlan Previous work on CLSS has generally been sparse and, et al. (2013) introduced paraphrase.sr, a corpus of Serbian to the best of our knowledge, focused entirely on English. newswire texts manually annotated with binary similarity In addition, there is a large discrepancy between the NLP judgments; they also used it to train and evaluate several models, which are based on linguistically opaque text paraphrase identification approaches. Batanović et al. properties, and linguistic analyses of semantic similarity. (2018) extended this dataset with fine-grained similarity The main aim of this paper is to describe the first non- scores, using the resulting STS.news.sr corpus to compare English annotated CLSS datasets, CLSS.news.sr and several automatic models. Finally, Batanović (2020) CLSS.codecomments.sr, developed within the project showed that multilingual pre-trained models such as Advancing Novel Textual Similarity-based Solutions in multilingual BERT (Devlin et al., 2019) outperform all Software Development – AVANTES. Both datasets traditional methods, while Batanović (2021) obtained even comprise phrase-sentence and sentence-paragraph text better results using BERT’s counterpart for Serbian and other pairs in Serbian and both are (being) manually annotated closely related languages, BERTić (Ljubešić and Lauc, 2021). for CLSS. After providing some background, we describe In terms of linguistic analysis, semantic similarity is not the dataset creation and CLSS annotation, outline a systematically defined and described, and the contributing preliminary linguistic analysis, and explain how the phenomena tend to be explored in isolation from each other PRISPEVKI 124 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 (e.g., synonymy in lexical semantics, diathesis alternations 3.1. CLSS.news.sr in morphosyntax). A somewhat more integrated approach The initial texts for the CLSS.news.sr dataset were is found with regard to the neighbouring notion of obtained from the Serbian news aggregator website paraphrase, intended as a relation of (near-)equivalence of naslovi.net. This website provides a headline and an meaning between phrases and/or sentences (Mel’čuk, introductory paragraph for each news report; a subhead is 2012: 46), i.e. as an instance of high semantic similarity frequently included too. We treated the headlines as source (albeit a non-symmetrical one). According to Milićević material for phrases, subheads as source material for (2007), paraphrases can be of different types based on the sentences, and introductory paragraphs as source material nature of information that underlies equivalence (linguistic for paragraphs for our corpus, exploiting the journalistic vs. extra-linguistic), the level of linguistic representation convention that the beginning sections in an article involved (morphology, lexicon, semantics, syntax), and the commonly provide a summary of its content; our approach depth of relation. A detailed typology of changes involved was the same one used in the construction of multiple other in paraphrase has been proposed by Vila Rigat (2013) and newswire STS and paraphrasing corpora (Dolan et al., Vila et al. (2014) in view of the NLP task of automatic 2004). Since news item are commonly reported differently paraphrase detection. This typology combines several by different media outlets, cross-linking the texts of criteria and multiple levels of granularity into a taxonomy different reports allowed for the creation of text pairs with that will be presented in more detail in Section 4.2, as the varying degrees of semantic similarity. Close to 18,000 basis for our linguistic analysis of CLSS. news reports, published between June and August 2021, were scraped using the scrapy Python library,1 to ensure the 3. Datasets and CLSS annotation annotators had a sufficient quantity of raw text available for The corpora of phrase-sentence and sentence-paragraph creating adequate pairs. To ensure comparability with the text pairs presented in this paper are developed within the SemEval dataset, our target dataset size was 1,000 phrase- AVANTES project. The aim of this project is to support the sentence and 1,000 sentence-paragraph pairs. analysis of correspondences between blocks of source code, The construction of the 2,000 text pairs was divided written in a programming language, with an analysis of the between five annotators, who were either trained linguists level of semantic similarity between their respective or had previous experience with text annotation for the documentation comments, written in a natural language closely related STS task. Even though they received text (English or Serbian), with the goal of detecting code samples pre-classified based on length, they were similarity and clones. A CLSS setup is highly appropriate for instructed to evaluate whether an item in a certain category the textual similarity task due to arbitrary comment length, really was a phrase, a sentence, or a paragraph, and were which can range from single words to phrases, sentences and allowed to change the categorisation. Paragraphs were entire paragraphs. Since the language used in comments is defined as text containing a minimum of two sentences known to diverge from the standard language, for instance in (where only complete sentences were to be taken into being syntactically incomplete (Zemankova and Eastman, account). A sentence had to contain at least one finite verb 1980), we add to our study setup CLSS in standard language, form, whereas a phrase was not allowed to contain finite choosing newswire texts as its representative. verbs (non-finite forms such as infinitives and participles In the context of the project, comparative analyses are were allowed, as were deverbal nouns). planned both between text domains and between languages. The annotators were provided with the similarity score For this reason, it was important to establish a common definitions and SemEval examples to help them interpret methodology for the creation and annotation of datasets. each score. Since these examples proved insufficient to Since the only pre-existing CLSS dataset was the SemEval ensure high annotation consistency, the outputs were one for English, we adopted the approach of Jurgens et al. calibrated by having all annotators create a smaller set of (2014) as a (partial) model for our work. We retained their five to six representative pairs for each similarity score and five-point similarity scale, with scores ranging from 0 to 4, each length pairing. These pairs were reviewed by project as well as their definitions for each score: 0 – unrelated, 1 researchers and feedback was provided regarding any – slightly related, 2 – somewhat related but not similar, 3 – issues encountered. The following step was the compilation somewhat similar, 4 – very similar. However, we altered of a detailed set of examples, three per similarity score and the method of text pair construction. Namely, while Jurgens length pairing, using the agreed upon representative pairs et al. (2014) provided annotators with a longer text and from all annotators. This set, the score definitions and asked them to generate a shorter one with a designated general instructions became an integral part of the final similarity score in mind, we pre-prepared numerous text annotation guidelines for our task, available in the dataset samples of different lengths (phrases, sentences, and repository in Serbian (original) and English (translation).2 paragraphs), and asked the annotators to combine these A subset of examples is shown in Table 1. texts into phrase-sentence and sentence-paragraph pairs, The annotators were subsequently asked to construct a aiming for a balanced score distribution for the pairs they total of 200 pairs for each text length combination, trying construct. The main motivation for this choice was that the to include both pairs clearly corresponding to a specific generation of texts by annotators would have been very score, and less clear-cut ones. The resulting 2,000 cross- difficult to implement in the domain of source code level text pairs were labelled with semantic similarity comments, given the highly technical and often project- scores by all five annotators, using the STSAnno tool specific terminology encountered in them. At the same (Batanović et al., 2018). The final score for each pair was time, our approach prevented a potential paraphrasing bias calculated by averaging the scores of all individual that the annotators could inadvertently introduce. annotators. Obtaining multiple parallel annotations and 1 http://scrapy.org/ 2 http://vukbatanovic.github.io/CLSS.news.sr/ PRISPEVKI 125 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 averaging them out was chosen instead of relying on an between each annotator and the mean of other annotators’ adjudicated double annotation (used for the SemEval scores yields a Krippendorff’s alpha coefficient of dataset) in order to minimise individual annotator’s biases. α = 0.929, while the Pearson and the Spearman correlation In addition, while Jurgens et al. (2014) allowed finer- coefficients are equal, r = ρ = 0.938. In the case of grained score distinctions using multiples of 0.25, in our sentence-paragraph pairs these values are α = 0.922, setup with five annotators this was not necessary. r = 0.937 and ρ = 0.934. More details and a comparison with the English SemEval dataset are reported in Batanović Score Examples and Miličević Petrović (2022). Veliki požar na železničkoj stanici u Londonu 3.2. CLSS.codecomments.sr A large fire at a London railway station A particularly innovative part of the work conducted in 4 Veliki požar izbio je danas na metro stanici u the AVANTES project is the creation of a corpus of centralnom delu Londona. software code comments, to be made publicly available for A large fire broke out today at an underground download and use in testing NLP models once the station in central London. annotation of semantic similarity is completed. The sources Novi nacionalni praznik: Džuntint that the code comment dataset was drawn from include A new national holiday: Juneteenth public repositories such as GitHub, student projects, Američki Kongres usvojio je predlog zakona coursework and teaching materials from various computing prema kojem je 19. jun proglašen praznikom u courses at the School of Electrical Engineering of the University of Belgrade and other academic institutions in 3 znak sećanja na kraj ropstva i odlazak poslednjih robova 1865. godine u državi Teksas. Serbia, as well as software projects developed at the The American Congress passed a Draft law Computing Center of the School of Electrical Engineering. declaring 19 June a holiday to commemorate In order to prevent our work from being focused on the the end of slavery and the liberation of the last specificities of a single programming language or slaves in 1865 in the state of Texas. programming paradigm, we opted to collect comments Veliki problem za Portugal from eight programming languages: C, C++, C#, Java, JavaScript/TypeScript, MATLAB, Python, and SQL. A major problem for Portugal We focused on manually pre-selecting only those code 2 Loše vesti stižu za Portugal pred start comments that describe the functionality of particular Evropskog prvenstva. sections of code, ranging from individual code lines, to Bad news arrives for Portugal just before the methods and functions, to classes and entire modules. To do start of the European Championship. so, we relied on a newly designed taxonomy for Svađa pred svadbu differentiating between types of code comments (Kostić et A pre-wedding argument al., 2022), which includes the following code comment Mirko Šijan i Bojana Rodić uskoro očekuju categories: Code, Functional-Inline, Functional-Method, Functional-Module, General, IDE, Notice and ToDo. The 1 svoje prvo dete, a uveliko se sprema i njihova svadba. initial data collection and pre-selection were performed by Mirko Šijan and Bojana Rodić are expecting master’s degree students at the School of Electrical their first child soon, and their wedding is Engineering of the University of Belgrade, as part of their being prepared. course project for the Natural Language Processing course. Otvaranje silosa u Zrenjaninu In total, after all duplicate entries were removed, 9,395 code A silo opening in Zrenjanin comments belonging to the Functional categories were identified. These include 6,455 Functional-Inline comments, 0 Maja Žeželj, voditeljka, ispričala je kako je svojevremeno jedva izvukla živu glavu. which describe the functionality of individual code lines or code passages, 1,829 Functional-Method comments, which Maja Žeželj, TV presenter, told the story of how address the functionality of functions and class methods, and some time ago she nearly died. 1,111 Functional-Module comments, which are related to the functionality of entire code modules and classes. Table 1: Guideline examples of phrase-sentence pairs in In order to construct text pairs, the comments were first the newswire dataset for each similarity score. roughly divided into candidates for phrases, sentences, and paragraphs on the basis of a set of heuristics. Using The final CLSS.news.sr dataset comprises 30 thousand whitespace tokenisation, we treated all texts with up to six tokens in the phrase-sentence subset, and 86 thousand tokens as candidates for phrases. All texts containing more tokens in the sentence-paragraph subset. The average than six tokens, but limited to a single sentence, were sentence length is ~22 tokens in the sentence-paragraph treated as candidates for sentences, while those with more pairs and ~23 tokens in the phrase-sentence ones. The than one sentence were considered paragraph candidates. average phrase length is ~6 tokens, while the average The number of sentences was determined using a regular paragraph length is ~64 tokens. The average similarity expression that treated question marks, exclamation marks, scores are close to the scale’s mean value of 2: 1.91 in the and periods outside of URLs and decimal numbers as sentence-paragraph subset, and 1.96 in the phrase-sentence sentence boundaries. Using this procedure, the text set was subset. The distribution of different scores is fairly uniform, divided into 4,880 phrase candidates, 3,592 sentence especially for the phrase-sentence pairs; the peaks include candidates, and 923 paragraph candidates. a marked one around 0, and a less evident one around 3. Due to the high domain specificity of code comments, The annotation (self-)agreement levels are very high. For we entrusted the creation of CLSS pairs to two experienced the phrase-sentence subset, the average binary agreement PRISPEVKI 126 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 programmers. They used the provided candidate texts to comprising 14 thousand tokens, and 1,000 sentence- form the pairs, but were instructed to carefully evaluate paragraph pairs, comprising 39 thousand tokens. The whether each sample truly belonged to its automatically average sentence length is ~10 tokens in both the sentence- assigned length grouping. Such an evaluation was paragraph and the phrase-sentence pairs. The average necessary because complete standard sentences and phrase length is ~3 tokens, while the average paragraph paragraphs were rarely encountered in the data. Instead, we length is ~29 tokens. Overall, the code comments are found that despite having a sentence-like function in the approximately half the length of the newswire text items. comment, many texts are not true sentences in the linguistic Although our initial aim was again to construct a dataset sense – they do not follow any punctuation rules and they balanced across the range of similarity scores, this proved lack a predicate, or possess it only implicitly (e.g., @author to be impossible with our selection of source texts, since Tim 2 or Naziv komponente ‘Component name’ within a they pertained to a wide range of programming projects paragraph item). Similarly, paragraphs in the code with different purposes and implemented using diverse comment domain are often separated into units not via programming paradigms and languages. This made the standard punctuation, but rather by using visual boundaries, construction of pairs with high similarity scores very such as moving to a new line in the source file, or problematic. We therefore abandoned the goal of obtaining (repeatedly) using special characters (e.g., * or ###). a balanced score distribution, but still instructed the Limiting our text selection to a rigid definition of sentences programmers to compile as many highly similar pairs as and paragraphs would thus not only have reduced the size possible with the given source content. Each programmer of the dataset, but it would also have led to the exclusion of was tasked with the construction and scoring of 500 pairs numerous domain-specific phenomena, significantly of each length. impacting our linguistic analyses of code comments. We The similarity scoring of the text pairs was performed therefore decided to count as paragraphs texts consisting of on the basis of guidelines similar to the ones used in the at least two clearly identifiable units, even if those units newswire domain, but with a new set of three examples per were not true sentences. Similarly, we expanded the score and length pairing, drawn from the code comment sentence set with texts containing an implicit predicate, as domain; a subset of phrase-sentence pair examples is well as with those containing subordinate clauses without a shown in Table 2. After the code comment text pairs were main clause (e.g., relative clauses such as: Metode koje se constructed, they were forwarded to the same annotators odnose na simulaciju procesa ‘Methods that refer to who worked on the CLSS.news.sr dataset, in order to obtain process simulation’). multiple parallel annotations. Since this work is still in progress, our linguistic analyses of CLSS.codecomments.sr Score Examples in this paper will be based on the individual similarity scores assigned by the two programmers who constructed Računanje površine pravougaonika the text pairs. Calculating the area of a rectangle 4 Površina pravougaonika po formuli je a * b 4. Linguistic analysis The area of a rectangle according to the formula The NLP algorithms used in automatic treatment of is a * b semantic similarity rely on different types of information, POMOCNA FUNKCIJA including linguistic features. While state-of-the-art models AUXILIARY FUNCTION 3 such as multilingual BERT and BERTić reach performances Fajl koji pruza pomocne funkcije that correlate highly with human scores, with coefficients A file that provides auxiliary functions r,ρ > 0.9 for CLSS on Serbian newswire texts (Batanović ubrzano kretanje and Miličević Petrović, 2022), they lack linguistic accelerated movement transparency and are of limited help in understanding the relative contributions of different levels of language 2 Zelimo da se ogranicimo od mogucnosti da se ubrzano krece. structure and different specific features. Since one of the We want to limit the possibility of accelerated aims of the AVANTES project is to combine NLP with movement. linguistic knowledge, we conduct two types of linguistic Update dokumenta analyses on the datasets. A preliminary qualitative analysis Document update is performed to gain initial insight into the data and help 1 Ovaj program formira html dokument decide on the specifics of detailed annotation of semantic This program forms an html document similarity indicators (to be followed by a quantitative analysis of the annotated datasets). izracunavanje faktorijela calculating the factorial 0 Azurira rotaciju kamere preko pomeraja misa 4.1. A qualitative overview Updates the camera rotation via mouse A qualitative linguistic analysis was performed on a movement random sample of ten text pairs per score, for both CLSS.news.sr and CLSS.codecomments.sr, and for both Table 2: Guideline examples of phrase-sentence pairs in phrase-sentence and sentence-paragraph pairs. In the case the code comment dataset for each similarity score. of newswire texts, items that received the same score by all annotators were selected; an approach focused on clear-cut This allowed us to construct a code comment dataset of cases was deemed useful as a first step in the analysis given the same size as CLSS.news.sr. The CLSS.codecomments.sr its goals of verifying both the linguistic relevance of the dataset therefore includes 1,000 phrase-sentence pairs, similarity scores and the taxonomy for more detailed linguistic annotation. For comments, the initial scores PRISPEVKI 127 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 assigned by programmers were used for selection. The rekurzivni poziv ‘if it has children then we do a recursive analysis consisted in a comparison of information content call’). The predicate of the sentence item is typically not between the pairs’ components, as well as a study of related to the head noun of the phrase item. The pairs vocabulary overlaps (or lack thereof). Its goal was to get an marked 1 and 0 contain barely any overlapping personal initial grasp of the data and help define a taxonomy to base names or specialised terms. Score 1 items do share some a more elaborate analysis on. common lexical words, but synonyms, near-synonyms, and For both corpora and both types of comparisons, the terms from the same wider semantic field are more present pairs marked 4 are characterised by the occurrence of the than words that are identical or morphologically closely same distinctive vocabulary items: personal names and/or related (e.g., tragedija ‘tragedy’ – nesreća ‘accident’, numbers (newswire), or specialised terms (comments). The pljuskovi ‘showers’ – kiša ‘rain’). Items marked 0 typically form is often not identical, but the items involved are do not share any lexical words. clearly relatable on morphological grounds (e.g., they are When it comes to differences between the two corpora, inflectional forms of the same noun, as in Kragujevcu.LOC in CLSS.news.sr it is often the case that the relatedness of – Kragujevca.GEN ‘Kragujevac’, parametre.ACC – lexical items in the pair is based on real world knowledge parametrima.INS ‘parameters’, or a noun and a denominal (largely about something happening at the time of writing) adjective, as in Vlasotincu.N – vlasotinačkom.ADJ ‘(of) rather than on linguistic information (e.g., vakcinacija Vlasotince’)3. The shared numbers are mostly large and ‘vaccination’ – virus korona ‘corona virus’, Tokio ‘Tokio’ either quite specific or used in a collocation (e.g., 100.620, – Olimpijske igre ‘Olympic games’), especially in items or 3.000 dinara ‘3000 dinars’). Overlaps in common lexical assigned a score below 3. CLSS.codecomments.sr, on the words are also frequently based on morphologically related other hand, is characterised by various non-standard rather than identical forms (e.g., stiglo.PAST.PART – features, such as inconsistent spelling ( popup vs. pop-up), stići.INF ‘arrive’, novozaraženih ‘newly infected’ – novih missing diacritics ( cita for čita ‘reads’), inflectional slučajeva zaraze ‘new cases of infection’, filtriranje endings on English words inconsistently spelt with/without ‘filtering’ – filtar ‘filter’). A number of synonyms are found a dash ( zoom-a, workspace-u vs. levela), non-standard ( potvrda – sertifikat ‘certificate’, promenljiva – varijabla abbreviations ( f-ja for funkcija ‘function)’, or phonetic ‘variable’), sometimes involving a Serbian and an English transcription of English terms ( eksepšn ‘exception’).4 word ( mreža – grid ‘grid’), and sometimes within different collocations based on the same term (e.g., toplotni talas – talas vrućina ‘heat wave’, zoom levela – stepena zoom-a 4.2. Linguistic annotation ‘zoom level’). Overall, most lexical words from the smaller Using the preliminary analysis outlined above and the unit are present in the larger one, which also contains other existing paraphrase typologies (primarily Vila Rigat, 2013; elements that describe the situation in more detail, but Vila et al., 2014; also Milićević, 2007; Mel’čuk, 2012), we without adding entirely new topics ( u Londonu ‘in London’ propose a taxonomy of semantic similarity types and – u centralnom delu Londona ‘in central London’; funkcija indicators, shown and illustrated in Table 3; most examples sa parametrima ‘a function with parameters’ – funkcija are taken directly or adapted from our corpora (examples koja nije f(void), vec prima parametre ‘a function that is for two clear indicators are omitted to save space). The initial not f(void), but accepts parameters’). focus is on the nature of information that similarity is based Score 3 items are distinguished by similar properties in on, and a core distinction is made between linguistic, quasi- terms of shared lexis and especially personal names and linguistic and extralinguistic similarity types. This is at the specialised terms, but with entirely new information in the same time one of the main points of divergence between longer item, and/or partly different information in the our approach and the one by Vila Rigat (2013) and Vila et components of the pair, leading to a less marked overall al. (2014), who acknowledge the existence of non-linguistic vocabulary overlap (e.g., Neuralna mreza ‘neural network’ paraphrase, but do not include it in their core typology; we – vanila neuralna mreza koja se obucava pomocu genetskog rely on Milićević (2007) and Mel’čuk (2012) for these types. algoritma ‘vanilla neural network which is trained via a Another difference with respect to previous work is that our genetic algorithm’). Near-synonyms appear to be more taxonomy makes reference to similarity indicators, while common in score 3 pairs ( reč ‘word’ – termin ‘term’, nov changes are invoked in previous work, due to paraphrase ugovor ‘new contract’ – produžetak saradnje ‘extension of being perceived as involving a source and a target item. collaboration’). In both score 4 and score 3 items, the head Linguistic similarity is based on language-internal noun of the phrase tends to appear as the subject or the information at the word/lexical unit level (i.e., the morpho- object of the sentence predicate, or it is a deverbal noun that lexicon), the level of structural organisation, and the level corresponds to the predicate ( unos.N – unosi.V ‘input’). of meaning (i.e., semantics). The first two types have two The predicate is typically the same in sentence-paragraph subtypes each: morphology- and lexicon-based and syntax- pairs, with additional predicates in the paragraph item. and discourse-based indicators respectively; the indicator Among less similar pairs, those marked 2 are somewhat types and subtypes thus follow the classical organisation in mixed, as they either contain different personal names/ formal levels of linguistic analysis. Finally, the indicator specialised terms and similar common vocabulary, or vice names in the last column of Table 3 denote specific versa ( Tropski pakao u Beogradu ‘tropical hell in Belgrade’ mechanisms through which semantic similarity is – I sutra će u Novom Sadu biti veoma toplo ‘It will again established. Following Vila et al. (2014), our assumption is be very warm in Novi Sad tomorrow’; prekid rekurzije that the indicators reveal what triggers semantic similarity ‘interruption of recursion’ – ako ima decu onda idemo at the micro level. In other words, unlike the similarity 3 Abbreviations used: LOC – locative; GEN – genitive; ACC – 4 accusative; INS – instrumental; ADJ – adjective; N – noun, Many of the features found in code comments are shared with PAST.PART – past participle; INF – infinitive; V – verb. computer-mediated communication in Serbian (see Miličević Petrović et al., 2017). PRISPEVKI 128 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 scores assigned to pairs of items as wholes (i.e., to entire Looking more closely at the indicator subtypes, phrases, sentences, or paragraphs), the linguistic taxonomy morphology-based indicators concern the morphological targets individual phenomena that cumulatively contribute form of words, capturing complete equivalence, as well as to the overall score, where such individual elements are not inflectional and derivational relations, i.e. different forms mutually exclusive and several can be co-present. of the same word or changes of category via derivational Similarity type Indicator type Indicator subtype Indicator ( example) - Identical ( požar – požar ‘fire’) - Inflectional ( parametre.ACC – parametrima.INS Morphology-based ‘parameters’) - Derivational ( Vlasotincu.N – vlasotinačkom.ADJ ‘(of) Vlasotince’) - Spelling and format ( pop-up – popup) - Synthetic/analytic ( novozaraženih ‘newly infected’ – novih slučajeva zaraze ‘new cases of infection’) Morpholexicon-based - Same polarity -- Synonymy ( potvrda – sertifikat ‘certificate) -- Near-synonymy ( reč ‘word’ – termin ‘term’) Lexicon-based -- Hyponymy ( škoda ‘Škoda’ – automobil ‘car’) -- Meronymy ( Vašington ‘Washington’ – SAD ‘USA’) - Opposite polarity ( izgubio ‘lost’ – nije uspeo da pobedi ‘failed to win’) - Converse ( pogibija dva pešaka ‘death of two pedestrians’ – usmrtio pešake ‘killed the pedestrians’) - Diathesis alternations ( opljačkali su stan ‘robbed the Syntax-based flat’ – stan je opljačkan ‘the flat was robbed’) - Coordination changes Linguistic - Subordination and nesting changes - Punctuation ( Potpis dana - Aleksandar Kolarov! ‘Signature of the day - Aleksandar Kolarov!’ – Aleksandar Kolarov potpisao novi ugovor ‘Aleksandar Kolarov signed a new contract’) Structure-based - Direct/indirect style ( Bilčik ocenjuje da vežbe ne pomažu ‘Bilčík states that the military exercises do not Discourse-based help’ – Bilčik ukazuje da vesti o vežbi “nisu od pomoći” ‘Bilčík points out that the news of a military exercise “is not helpful”’) - Sentence modality ( maske više nisu obavezne? ‘masks no longer compulsory?’ – neće biti obavezne zaštitne maske ‘protective masks will not be compulsory’) ( Tropski pakao ‘tropical hell’ – biti veoma toplo ‘be Semantics-based very warm’) - Change of order ( klasa singleton – Singleton patern ‘singleton class/pattern’) Miscellaneous - Addition/deletion ( funkcija za sortiranje ‘sorting function’ – metoda koja sortira uzetu matricu ‘the method that sorts the given matrix’) ( Scattered showers are very likely – Bring your Quasi-linguistic Pragmatic umbrella; Mel’čuk, 2012: 60) ( Besplatno kroz Severnu Makedoniju od danas ‘Free Situational travel through North Macedonia from today’ – Novina od 15. juna ‘New rules from 15 June’) ( Italija ‘Italy (the team)’ – ekipa sa Apenina ‘the team Extralinguistic Encyclopaedic from the Apennine Mountains’) ( Još pola dinara za veknu hleba ‘Half a dinar more for Logical a loaf of bread’ – Cena hleba visa za 20% ‘The price of bread higher by 20%’; Milićević, 2007: 145) Table 3: Overview of the taxonomy of semantic similarity (the examples are drawn from CLSS.news.sr/CLSS.codecomments.sr, or from the literature). PRISPEVKI 129 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 affixes. The identical indicator is not present under the Beyond the linguistic structure, the quasi-linguistic morphology heading in Vila Rigat (2013) and Vila et al. domain captures inference-based similarity that relies on (2014), who categorise it as a “paraphrase extreme”, which pragmatic information. The core linguistic meanings and is a special type in their taxonomy, capturing longer chunks the extralinguistic referents are different in this case, but the of text; we add it based on the preliminary analysis meaning of one element in the pair can still be inferred from presented in Section 4.1, which revealed that identical the meaning of the other. Given the nature of our texts, this individual words are common in highly similar items in type of similarity is expected to be infrequent, and we have CLSS. Additional information that could prove useful so far not identified any examples; however, we leave this concerns parts of speech, the distinction between personal category in our taxonomy to possibly be applied in the and common nouns, as well as information on general vs. annotation phase. The extralinguistic domain also entails specialised vocabulary. Given that the identification of inequality of linguistic meaning, but it involves information specialised terminology would require work that goes equivalence between two texts, i.e. reference to the same beyond the scope of the current project, we are still real-world situation. It requires knowledge external to evaluating the possibility of including it in the analysis. language for similarity to be recognised; this knowledge Lexicon-based indicators are somewhat more varied, can be situational (containing elements such as today or ranging from different spellings of the same words, to here), encyclopaedic (involving general knowledge), or syntactic and analytic expressions of the same meaning, logical (requiring calculations or other similar operations). and to lexical semantic relations in the narrow sense. Same Based on the initial analyses of our datasets, this is a polarity items constitute the most complex group of lexical common type of similarity, especially in newswire texts. relations, comprising synonymy as a similarity relation par Keeping the above definitions in mind, the outlined excellence, near-synonymy, hyponymy (the relationship taxonomy will be applied to the CLSS.news.sr and between superordinate/more general and subordinate/more CLSS.codecomments.sr corpora. Detailed guidelines are specific lexical items), and meronymy (a part-whole currently being developed, and the texts (initially from relation). Opposite polarity relations are based on antonym CLSS.news.sr) are being prepared for word/segment-level pairs with opposite comparative words, or with one of the annotation with semantic similarity indicators, within the components negated. Finally, a converse relation captures identified pairs. The annotation will be performed by the complementary actions whose arguments are inversed. project researchers, first as a double procedure on a smaller Syntax-based indicators capture those relations that sample, and then individually once a satisfactory level of imply a syntactic reorganisation in the sentence; they can agreement is reached. The initial phase will at the same be found within single sentences, or in the way multiple time enable us to verify the appropriateness of the sentences are connected. Specific cases include instances taxonomy, and adapt it should the need arise. The annotated of diathesis alternations (such as the active/passive datasets will be used for empirically validating the alternation), coordination (where coordinated units are taxonomy, for gaining a better understanding of the present in one member of the pair, but not in the other), and linguistic factors that carry the most weight in cross-level subordination or nesting (where subordinate/nested semantic similarity in different text genres, and for learning elements are present in only one item). The second subtype how this kind of information can be taken into account in of structural changes, discourse-based indicators, do not NLP models. Based on previous work on paraphrase and a affect the sentential arguments, but are instead related to preliminary exploration of our data at text level (with entire elements such as punctuation and formatting (beyond pairs marked for indicator presence/absence), single lexical units), affirmative vs. interrogative sentence morphological indicators, addition/deletion and same modality, and direct vs. indirect speech. polarity items are expected to be particularly prominent. The semantics-based subtype is also distinguished by going beyond the level of individual lexical items, as it 5. Concluding remarks concerns phrase/sentence-level meaning. No subtypes of In this paper, we have described the first non-English specific indicators are singled out, as this level of analysis CLSS corpora, CLSS.news.sr and CLSS.codecomments.sr. refers generally to the distribution of semantic content The focus was on the methodology used to construct and across lexical units, and it can involve multiple and varied annotate the data, as well as on their initial linguistic formal changes that lead to different lexicalisations of the analysis. We believe these two datasets to be an important same meaning units. The boundaries between semantics- resource for Cross-Level Semantic Similarity research, not based similarity and lexicon-based similarity indicators are only in virtue of representing a new language, but also due not always clear-cut, but it is generally the case that to introducing an underexplored text genre (source code lexicon-based indicators concern individual words or comments), and due to dedicating substantial attention to multiword units, while semantics-based similarity relies on the linguistic properties of the datasets. multiple lexical items. Our planned next steps are to complete the CLSS The last type of linguistic indicators is classified as annotation of code comments, implement the proposed miscellaneous, given that it captures phenomena that do linguistic taxonomy of semantic similarity in the annotation concern the linguistic structure of items, but do not clearly of both datasets, conduct a more extensive linguistic belong to a single level of linguistic analysis. Change of analysis based on the annotated data, and examine the order and addition/deletion are found here as specific impact of linguistic traits on the performances of automatic indicator types, the former involving units with the same CLSS models. Another goal is to compare the results to content expressed using different word orders, and the latter those obtained on similar datasets for English, using the based on added or omitted information. Both indicators SemEval dataset for newswire, and our own dataset (which concern at least syntax and discourse; given the cross-level is currently being created) for source code comments. setup, the latter is particularly important for our datasets. PRISPEVKI 130 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 6. Acknowledgements deficient natural language processing support. Decision The AVANTES project ( Advancing Novel Textual Support Systems, 55(3):710–19. Similarity-based Solutions in Software Development) is David Jurgens, Mohammad Taher Pilehvar, and Roberto supported by the Science Fund of the Republic of Serbia, Navigli. 2014. SemEval-2014 Task 3: Cross-Level grant no. 6526093, within the “Program for Development Semantic Similarity. In: Proceedings of the Eighth of Projects in the Field of Artificial Intelligence”. The International Workshop on Semantic Evaluation authors would like to thank Jelica Cincović and Dušan (SemEval 2014), pages 17–26, Dublin, Ireland. Stojković for constructing the code comment text pairs, as Association for Computational Linguistics. well as Bojan Jakovljević, Lazar Milić, Marija Lazarević, David Jurgens, Mohammad Taher Pilehvar, and Roberto Ognjen Krešić, and Vanja Miljković for annotating the Navigli. 2016. Cross Level Semantic Similarity: An corpora with semantic similarity scores. Evaluation Framework for Universal Measures of Similarity. Language Resources and Evaluation, 7. References 50(1):5–33. Marija Kostić, Aleksa Srbljanović, Vuk Batanović, and Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Boško Nikolić. 2022. Code Comment Classification Gonzalez- Agirre. 2012. SemEval-2012 Task 6: A Pilot Taxonomies. In: Proceedings of the Ninth IcETRAN on Semantic Textual Similarity. In Proceedings of the Conference, Novi Pazar, Serbia. First Joint Conference on Lexical and Computational Nikola Ljubešić and Davor Lauc. 2021. BERTić – The Semantics (*SEM), pages 385–393, Montreal, Canada. Transformer Language Model for Bosnian, Croatian, Association for Computational Linguistics. Montenegrin and Serbian. In: Proceedings of the 8th Vuk Batanović. 2020. A Methodology for Solving Semantic Workshop on Balto-Slavic Natural Language Processing Tasks in the Processing of Short Texts Written in Natural (BSNLP 2021), pages 37–42, Kiev, Ukraine, Association Languages with Limited Resources. Ph.D. thesis, for Computational Linguistics. University of Belgrade. Igor A. Mel’čuk. 2012. Semantics. From Meaning to Text. Vuk Batanović. 2021. Semantic similarity and sentiment John Benjamins, Amsterdam. analysis of short texts in Serbian. In: Proceedings of the Maja Miličević Petrović, Nikola Ljubešić, and Darja Fišer. 29th Telecommunications forum (TELFOR 2021), 2017. Nestandardno zapisivanje srpskog jezika na Belgrade, Serbia, IEEE. Tviteru: mnogo buke oko malo odstupanja? Anali Vuk Batanović and Maja Miličević Petrović. 2022. Cross- Filološkog fakulteta 29(2):111–36. Level Semantic Similarity for Serbian Newswire Texts. Jasmina Milićević. 2007. La paraphrase. Peter Lang, Bern. In: Proceedings of the 13th International Conference on Maâli Mnasri, Gaël de Chalendar, and Olivier Ferret. 2017. Language Resources and Evaluation (LREC 2022), Taking into account Inter-sentence Similarity for Update Marseille, France. European Language Resources Summarization. In: Proceedings of the Eighth Association. International Joint Conference on Natural Language Vuk Batanović, Miloš Cvetanović, and Boško Nikolić. Processing, pages 204–209, Taipei, Taiwan. Association 2018. Fine-grained Semantic Textual Similarity for for Computational Linguistics. Serbian. In: Proceedings of the 11th International Mohammad Taher Pilehvar and Roberto Navigli. 2015. Conference on Language Resources and Evaluation From senses to texts: An all-in-one graph-based (LREC 2018), pages 1370–78, Miyazaki, Japan, approach for measuring semantic similarity. Artificial European Language Resources Association. Intelligence, 228:95–128. Vuk Batanović, Bojan Furlan, and Boško Nikolić. 2011. A Navid Rekabsaz, Ralf Bierig, Mihai Lupu, and Allan software system for determining the semantic similarity Hanbury. 2017. Toward optimized multimodal concept of short texts in Serbian. In: Proceedings of the 19th indexing. In: N. Nguyen, R. Kowalczyk, A. Pinto, and J. Telecommunications forum (TELFOR 2011), pages Cardoso, eds., Transactions on Computational 1249–52, Belgrade, Serbia, IEEE. Collective Intelligence XXVI, pages 144–61, Cham, Costanza Conforti, Mohammad Taher Pilehvar, and Nigel Springer International Publishing. Collier. 2018. Towards automatic fake news detection: Julian Risch, Timo Möller, Julian Gutsch, and Malte Cross-level stance detection in news articles. In: Pietsch. 2021. Semantic answer similarity for evaluating Proceedings of the First Workshop on Fact Extraction question answering models. In: Proceedings of the Third and VERification, pages 40–49, Brussels, Belgium, Workshop on Machine Reading for Question Answering, Association for Computational Linguistics. pages 149–57, Punta Cana, Dominican Republic, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Association for Computational Linguistics. Toutanova. 2019. BERT: Pre-training of Deep Marta Vila Rigat. 2013. Paraphrase Scope and Typology. Bidirectional Transformers for Language A Data-Driven Approach from Computational Understanding. In: Proceedings of NAACL-HLT 2019, Linguistics. Ph.D. thesis, University of Barcelona. pages 4171–86, Minneapolis, Minnesota, USA, Marta Vila, M. Antonia Martí, and Horacio Rodríguez Association for Computational Linguistics. 2014. Is this a paraphrase? What kind? Paraphrase Bill Dolan, Chris Quirk, and Chris Brockett. 2004. boundaries and typology. Open Journal of Modern Unsupervised construction of large paraphrase corpora. Linguistics, 4:205–18. In: Proceedings of the 20th International Conference on Marie Zemankova and Caroline M. Eastman. 1980. Computational Linguistics, pages 350–56, Geneva, Comparative lexical analysis of FORTRAN code, code Switzerland, Association for Computational Linguistics. comments and English text. In: Proceedings of the 18th Bojan Furlan, Vuk Batanović, and Boško Nikolić. 2013. annual Southeast regional conference, pages 193–97, Semantic similarity of short texts in languages with a Tallahassee, Florida, USA, Association for Computing Machinery. PRISPEVKI 131 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 The ParlaSent-BCS Dataset of Sentiment-annotated Parliamentary Debates from Bosnia and Herzegovina, Croatia, and Serbia Michal Mochtak,∗ Peter Rupnik,† Nikola Ljubeši憇 ∗Institute of Political Science University of Luxembourg 2 avenue de l’Université, L-4366 Esch-sur-Alzette michal.mochtak@uni.lu † Department of Knowledge Technologies Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana peter.rupnik@ijs.si nikola.ljubesic@ijs.si ‡Faculty of Computer and Information Science University of Ljubljana Večna pot 113, SI-1000 Ljubljana Abstract Expression of sentiment in parliamentary debates is deemed to be significantly different from that on social media or in product reviews. This paper adds to an emerging body of research on parliamentary debates with a dataset of sentences annotated for detection of sentiment polarity in political discourse using sentence-level data. We sample the sentences for annotation from the proceedings of three Southeast European parliaments: Croatia, Bosnia and Herzegovina, and Serbia. A six-level annotation schema is applied to the data with the aim of training a classification model for the detection of sentiment in parliamentary proceedings. Krippendorff’s alpha measuring the inter-annotator agreement ranges from 0.6 for the six-level annotation schema to 0.75 for the three-level schema and 0.83 for the two-level schema. Our initial experiments on the dataset show that transformer models perform significantly better than those using a simpler architecture. Furthermore, regardless of the similarity of the three languages, we observe differences in performance across different languages. Performing parliament-specific training and evaluation shows that the main reason for the differing performance between parliaments seems to be the different complexity of the automatic classification task, which is not observable in annotator performance. Language distance does not seem to play any role neither in annotator nor in automatic classification performance. We release the dataset and the best-performing models under permissive licences. 1. Introduction Although there is a general agreement among political scientists that sentiment analysis represents a critical com- Emotions and sentiment in political discourse are ponent for understanding political communication in gen- deemed as crucial and influential as substantive poli- eral (Young and Soroka, 2012; Flores, 2017; Tumasjan et cies promoted by the elected representatives (Young and al., 2010), the empirical applications outside the English- Soroka, 2012). Since the golden era of research on propa- speaking world are still rare (Rauh, 2018; Mohammad, ganda (Lasswell, 1927; Shils and Janowitz, 1948), a num- 2021). This is especially the case for studies analyzing ber of scholars have demonstrated the growing role of emo- political discourse in low-resourced languages, where the tions on affective polarization in politics with negative con- lack of out-of-the-box tools creates a huge barrier for so- sequences for the stability of democratic institutions and the cial scientists to do such research in the first place (Proksch social cohesion (Garrett et al., 2014; Iyengar et al., 2019; et al., 2019; Mochtak et al., 2020; Rauh, 2018). The paper, Mason, 2015). With the booming popularity of online me- therefore, aims to contribute to the stream of applied re- dia, sentiment analysis has become an indispensable tool search on sentiment analysis in political discourse in low- for understating the positions of viewers, customers, but resourced languages. The goal is to present a new anno- also voters (Soler et al., 2012). It has allowed all sorts of tated dataset compiled for machine-learning applications entrepreneurs to know their target audience like never be- focused on the detection of sentiment polarity in the politi- fore (Ceron et al., 2019). Experts on political communica- cal discourse of three Southeast European (SEE) countries: tion argue that the way we receive information and how we Bosnia and Herzegovina, Croatia, and Serbia. We further process them play an important role in political decision- use the dataset to train different classification models for the making, shaping our judgment with strategic consequences sentiment analysis applying different schemas and settings both on the level of legislators and the masses (Liu and to demonstrate the benefits and limitations of the dataset Lei, 2018). Emotions and sentiment simply do play an and the trained models. We release the dataset and the best- important role in political arenas and politicians have been performing models under permissive licenses to facilitate (ab)using them for decades. PRISPEVKI 132 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 further research and more empirically oriented projects. In Hirst, 2016), or detection of sentiment carrying sentences general, the paper, the dataset, and the models contribute (Onyimadu et al., 2013). to an emerging community of research outputs on parlia- mentary debates with a focus on sentence-level sentiment 2.2. Background data annotation with future downstream applications in mind. In order to compile a dataset of political sentiment for manual annotation and then use it for training the classi- 2. Dataset construction fication models for real world applications, we sampled sentences from three corpora of parliamentary proceedings 2.1. Focus on sentences in the region of former Yugoslavia – Bosnia and Herze- The dataset we compile and then use for training differ- govina (Mochtak et al., 2022c), 1 Croatia (Mochtak et al., ent classification models focuses on a sentence-level data 2022a),2 and Serbia (Mochtak et al., 2022b).3 The Bosnian and utilizes sentence-centric approach for capturing senti-corpus contains speeches collected on the federal level ment polarity. The strategy goes against the tradition in from the official website of the Parliamentary Assembly mainstream research applications in social sciences which of Bosnia and Herzegovina (Parlamentarna skupština BiH, focus either on longer pieces of text (e.g. utterance of 2020). Both chambers are included – House of Representa- “speech segment” or whole documents (Bansal et al., 2008; tives (Predstavnički dom / Zastupnički dom) and House of Thomas et al., 2006)) or coherent messages of shorter na- Peoples (Dom naroda). The corpus covers the period from ture (e.g. tweets (Tumasjan et al., 2010; Flores, 2017)). 1998 to 2018 (2nd – 7th term) and counts 127,713 speeches. The approach, however, creates certain limitations when The Croatian corpus of parliamentary debates covers de- it comes to political debates in national parliaments where bates in the Croatian parliament (Sabor) from 2003 to 2020 speeches range from very short comments counting only a (5th – 9th term) and counts 481,508 speeches (Hrvatski sa- handful of sentences to long monologues having thousands bor, 2020). Finally, the Serbian corpus contains 321,103 of words. Moreover, as longer text may contain a multi- speeches from the National Assembly of Serbia (Skupština) tude of sentiments, any annotation attempt must generalize over the period of 1997 to 2020 (4th – 11th term) (Otvoreni them, introducing a complex coder bias which is embedded Parlament, 2020). in any subsequent analysis. The sentence-centric approach attempts to refocus the attention on individual sentences 2.3. Data sampling capturing attitudes, emotions, and sentiment positions and Each speech was processed using the CLASSLA- using them as lower-level indices of sentiment polarity in a Stanza tool (Ljubešić and Dobrovoljc, 2019) with tokeniz- more complex political narrative. Although sentences can- ers available for Croatian and Serbian in order to extract not capture complex meanings as paragraphs or whole doc- individual sentences as the basic unit of our analysis. In uments do, they usually carry coherent ideas with relevant the next step, we filtered out only sentences presented by sentiment affinity. This approach stems from a tradition of actual speakers, excluding moderators of the parliamen- content analysis in political science which focuses both on tary sessions. All sentences were then merged into one the political messages and their role in political discourse in meta dataset. As we want to sample what can be under- general (Burst et al., 2022; Hutter et al., 2016; Koopmans stood as “average sentences”, we further subset the sen- and Statham, 2006). tence meta corpus to only sentences having the number of Unlike most of the literature which approaches senti- tokens within the first and third frequency quartile (i.e. be- ment analysis in political discourse as a proxy for position- ing within the interquartile range) of the original corpus taking stances or as a scaling indicator (Abercrombie and (∼3.8M sentences). Having the set of “average sentences”, Batista-Navarro, 2020b; Glavaš et al., 2017; Proksch et al., we used the Croatian gold standard sentiment lexicon cre- 2019), a general sentence-level classifier we aim for in this ated by (Glavaš et al., 2012), translated it to Serbian with paper has a more holistic (and narrower) aim. Rather than a rule-based Croatian-Serbian translator (Klubička et al., focusing on a specific policy or issue area, the task is to as- 2016), combined both lexicons, and extracted unique en- sign a correct sentiment category to sentence-level data in tries with a single sentiment affinity, and used them as seed political discourse with the highest possible accuracy. Only words for sampling sentences for manual annotation. The when a good performing model exists, a downstream task final pool of seed words contains 381 positive and 239 neg- can be discussed. We believe it is a much more versatile ative words (neutral words are excluded). These seed words approach which opens a wide range of possibilities for un- are used for stratified random sampling which gives us 867 derstanding the context of political concepts as well as their sentences with negative seed word(s), 867 sentences with role in political discourse. Furthermore, sentences as lower positive seed word(s), and 866 sentences with neither pos- semantic units can be aggregated to the level of paragraphs itive nor negative seed words (supposedly having neutral or whole documents which is often impossible the other sentiment). We sample 2600 sentences in total for manual way around (document → sentences). Although sentences annotation. The only strata we use is the size of the original as the basic level of analysis are less common in social sci- corpora (i.e. number of sentences per corpus). With this we ences research when it comes to computational methods sample 1,388 sentences from the Croatian parliament, 1059 (Abercrombie and Batista-Navarro, 2020b), practical appli- cations in other areas exist covering topics such as valida- 1https://doi.org/10.5281/zenodo.6517697 tion of sentiment dictionaries (Rauh, 2018), ethos mining 2https://doi.org/10.5281/zenodo.6521372 (Duthie and Budzynska, 2018), opinion mining (Naderi and 3https://doi.org/10.5281/zenodo.6521648 PRISPEVKI 133 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 sentences from the Serbian parliament, and 153 sentences parliament positive neutral negative from the Bosnian parliament. all 470 772 1358 HR 261 433 694 2.4. Annotation schema BS 27 42 84 The annotation schema for labelling sentence-level data SR 182 297 580 was adopted from Batanović et al. (Batanović et al., 2020) who propose a six-item scale for annotation of sentiment Table 1: Distribution of the three-class labels in the whole polarity in a short text. The schema was originally devel- dataset, as well as across each of the three parliaments. oped and applied to SentiComments.SR, a corpus of movie comments in Serbian and is particularly suitable for low- resourced languages. The annotation schema contains six The inter-annotator agreement (IAA) measured using Krip- sentiment labels (Batanović et al., 2020: 6): pendorff’s alpha in the first round was 0.599 for full six- item annotation scheme, 0.745 for the three-item annota- • +1 (Positive in our dataset) for sentences that are tion schema (positive/negative/neutral), and 0.829 for the entirely or predominantly positive two-item annotation schema focused on the detection of only negative sentiment (negative/other). The particular fo- • –1 (Negative in our dataset) for sentences that are cus on negative sentiment in the test setting is inspired by a entirely or predominantly negative stream of research in political communication which argues that negative emotions appear to be particularly prominent • +M (M Positive in our dataset) for sentences that in the context of forming the human psyche and its role in convey an ambiguous sentiment or a mixture of senti- politics (Young and Soroka, 2012). More specifically, ments, but lean more towards the positive sentiment in political psychologists have found that negative political in- a strict binary classification formation has a more profound effect on attitudes than posi- tive information as it is easier to recall and is more useful in • –M (M Negative in our dataset) for sentences that heuristic cognitive processing for simpler tasks (Baumeis- convey an ambiguous sentiment or a mixture of sen- ter et al., 2001; Utych, 2018). timents, but lean more towards the negative sentiment Before the second annotator moved to annotate the sec- in a strict binary classification ond batch of instances, hard disagreements, i.e. disagree- ments pointing at a different three-class sentiment, where • +NS (P Neutral in our dataset) for sentences that +NS and -NS are considered neutral, were resolved to- only contain non-sentiment-related statements, but gether by both annotators through a reconciliation proce- still lean more towards the positive sentiment in a strict dure. binary classification The final distribution of the three-class labels in the whole dataset, as well as along specific parliaments, is • –NS (N Neutral in our dataset) for sentences that given in Table 1. The presented distributions show that, only contain non-sentiment-related statements, but regardless of a lexicon-based sampling, the negative class still lean more towards the negative sentiment in a is still by far the most pervasive category, which might be strict binary classification even more the case in a randomly sampled dataset, some- thing we leave for future work. The different naming convention we have applied in our dataset serves primarily practical purposes: obtaining the 2.6. Dataset encoding 3-way classification by taking under consideration only the The final dataset, available through the CLARIN.SI second part of the string (if an underscore is present). repository, contains the following metadata: Additionally, we also follow the original schema which allowed marking text deemed as sarcastic with a code “sar- • sentence that is annotated casm”. The benefit of the whole annotation logic is that it was designed with versatility in mind allowing reducing • country of origin of the sentence the sentiment label set in subsequent processing if needed. • annotation round (first, second) That includes various reductions considering polarity cat- egorization, subjective/objective categorization, change of • annotation of annotator1 with one of the labels the number of categories, or sarcasm detection. This is im- from the annotation schema presented in Section 2.4. portant for various empirical tests we perform in the fol- lowing sections. • annotation of annotator2 following the same an- notation schema 2.5. Data annotation • annotation given during reconciliation of hard Data were annotated in two waves, with 1300 instances disagreements being annotated in each. Annotation was done via a custom online app. The first batch of 1300 sentences was annotated • the three-way label (positive, negative, neutral) by two annotators, both being native speakers of Croatian, where +NS and -NS labels are mapped to the neutral while the second batch was annotated only by one of them. class PRISPEVKI 134 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 • the document id the sentence comes from model macro F1 classla/bcms-bertic 0.7941 ± 0.0101∗∗ • the sentence id of the sentence EMBEDDIA/crosloengual-bert 0.7709 ± 0.0113 xlm-roberta-base 0.7184 ± 0.0139 • the date the speech was given fasttext + CLARIN.SI embeddings 0.6312 ± 0.0043 • the name, party, gender, birth year of the speaker Table 2: Results of the comparison of various text classifi- • the split (train, dev, or test) the instance has been cation technologies. We report macro-F1 mean and stan- assigned to (described in more detail in Section 3.1. dard deviation over 6 runs with the model-specific opti- mal number of training epochs. The distributions of results The final dataset is organized in a JSONL format (each of the two best performing models are compared with the line in the file being a JSON entry) and is available under Mann-Whitney U test (** p < 0.01). the CC-BY-SA 4.0 license.4 some tasks even models pre-trained on many languages ob- 3. Experiments tain performance that is comparable to otherwise superior 3.1. Data splits models pre-trained on one or few languages (Kuzman et al., For performing current and future experiments, the 2022). dataset was split into the train, development and test sub- While comparing the different classification techniques, sets. The development subset consists of 150 instances, each model was optimized for the epoch number hyperpa- while the test subset consists of 300 instances, both using rameter on the development data, while all other hyperpa- instances from the first annotation round, where two anno- rameters were kept default. For training transformers, the tations per instance and hard disagreement reconciliations simpletransformers library8 was used. are available. The training data consists of the remainder The second question on parliament specificity we an- of the data from the first annotation round and all instances swer by training separate models on Croatian sentences from the second annotation round, summing to 2150 in- only and Serbian sentences only, evaluating each model stances. both on Croatian and on Serbian test sentences. We further While splitting the data, stratification was performed evaluate the model trained on all training instances sepa- on the variables of three-way sentiment, country, and rately on instances coming from each of the three parlia- party. With this we can be reasonably sure that no specific ments. strong bias regarding sentiment, country or political party For our third question on the usefulness of the model for is present in any of the three subsets. data analysis, we report confusion matrices, to inform po- tential downstream users of the model’s per-category per- 3.2. Experimental setup formance. In our experiments we investigate the following ques- 4. Results tions: (1) how well can different technologies learn our three-way classification task, (2) what is the difference in 4.1. Classifier comparison performance depending on which parliament the model is We report the results of our text classification technol- trained or tested on, and (3) is the annotation quality of the ogy comparison in Table 2. The results show that trans- best performing technology high enough to be useful for former models are by far more capable than the fasttext data enrichment and analysis. technology relying on static embeddings only. Of the We investigate our first question by comparing the three transformer models, the multilingual XLM-RoBERTa results on the following classifiers: fastText (Joulin et model shows to have a large gap in performance to the two al., 2016) with pre-trained CLARIN.SI word embed- best-performing models. Comparing the cseBERT and the dings (Ljubešić, 2018), the multilingual transformer model BERTić model, the latter manages to come on top with XLM-Roberta (Conneau et al., 2019),5 the transformer a moderate improvement of 1.5 points in macro-F1. The model pre-trained on Croatian, Slovenian and English cse- difference in the results of the two models is statistically BERT (Ulčar and Robnik- Šikonja, 2020),6, and the trans-significant regarding the Mann-Whitney U test (Mann and former model pre-trained on Croatian, Bosnian, Montene- Whitney, 1947), with a p-value of 0.0053. grin and Serbian BERTić (Ljubešić and Lauc, 2021).7 Our 4.2. Parliament dependence expectation is for the last model to perform best given that it was pre-trained on most data from the three languages. We next investigate the dependence of the results on However, this assumption has to be checked given that for from which parliament the training and the testing data came. Our initial assumption was that the results are depen- 4 dent on whether the training and the testing data come from http://hdl.handle.net/11356/1585 5 the same or a different parliament, with same-parliament https://huggingface.co/xlm-roberta-base 6https://huggingface.co/EMBEDDIA/ results being higher. We also investigate how the model crosloengual-bert trained on all data performs on parliament-specific test data. 7https://huggingface.co/classla/ bcms-bertic 8https://simpletransformers.ai PRISPEVKI 135 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4.2.1. Impact of training data XLM-RoBERTa We perform this analysis on all three transformer mod- train \test HR SR els from Section 4.1., hoping to obtain a deeper understand-HR 0.7296 ± 0.0251 0.6128 ± 0.0341 ing of parliament dependence on our task. We train and test SR 0.7323 ± 0.0282 0.6487 ± 0.0203 on data from the Croatian and the Serbian parliament only cseBERT as the Bosnian parliament’s data are not large enough to train \test HR SR enable model training. HR 0.7748 ± 0.0174 0.7146 ± 0.0175 In Table 3 we report the results grouped by model SR 0.7762 ± 0.0114 0.6989 ± 0.0275 and training and testing parliament. To our surprise, the BERTić strongest factor shows not to be whether the training and train \test HR SR testing data come from the same parliament, but what test- HR 0.8147 ± 0.0083 0.7249 ± 0.0105 ing data are used, regardless of the training data. This trend SR 0.7953 ± 0.0207 0.7130 ± 0.0278 is to be observed regardless of the model used. The results show that Serbian test data seem to be harder Table 3: Comparison of the three transformer models when to classify, regardless of what training data are used, with a trained and tested on data from the Croatian or Serbian par- difference of ∼9 points in macro-F1 for the BERTić and the liament. Average macro-F1 and standard deviation over 6 XLM-RoBERTa models. The difference is smaller for the runs is reported. cseBERT model, ∼7 points, but still shows the same trend as the two other models. test ternary binary We have additionally explored the possibility of a com- all 0.7941 ± 0.0101 0.8999 ± 0.0120 plexity bias of Serbian test data in comparison to Serbian HR 0.8260 ± 0.0186 0.9221 ± 0.0153 training data by performing different data splits, but the re- BS 0.7578 ± 0.0679 0.9071 ± 0.0525 sults obtained were very similar to those presented here. SR 0.7385 ± 0.0170 0.8660 ± 0.0150 Serbian data seem to be harder to classify in general, which is observed when performing inference over Serbian data. Table 4: Average macro-F1 and standard deviation of 6 Training over Serbian data still results in a model compara- runs of the BERTić model, trained on all training data, and bly strong to that based on Croatian training data. Important evaluated on varying testing data. to note is that the Croatian data subset is 30% larger than the Serbian one. to identify at this point, but that will have to be taken under To test whether the Serbian data complexity goes back consideration in future work on this dataset. to challenges during data annotation, or whether it is rather the models that struggle with inference over Serbian data, we calculated the Krippendorff IAA on data from each 4.2.2. Impact of testing data parliament separately. The agreement calculation over In the next set of experiments, we compare the perfor- the ternary classification schema resulted in an IAA for mance of BERTić classifiers trained over all training data, Bosnian data of 0.69, Croatian data of 0.733, and Serbian but evaluated on all and per-parliament testing data. Be- data of 0.77. This insight proved that annotators themselves yond this, we train models over the ternary schema that we did not struggle with Serbian data as these had the highest have used until now (positive vs. neutral vs. negative), but IAA. We also tested whether there is excessive sarcasm in also the binary schema (negative vs. rest), given our spe- Serbian data, which might affect the model’s performance. cial interest in identifying negative sentences, as already The dataset contains two sarcastic instances from the par- discussed in Section 2.5. liament of Bosnia and Herzegovina and 16 for both Croatia We report results on test data from each of the three and Serbia, which means sarcasm can hardly explain the parliaments, including the Bosnian one, which, however, overall lower performance on Serbian test data. Lastly, we contains only 18 testing instances, so these results have to checked the type-token ratio (TTR) on samples of Croat- be taken with caution. ian and Serbian sentences to estimate the lexical richness The results presented in Table 4 show again that the Ser- of each subset, a higher lexical richness of Serbian (via a bian data seem to be the hardest to classify even when all higher type-token ratio) possibly explaining the lower re- training data are used. Bosnian results are somewhat close sults obtained on Serbian test data. By calculating the type- to the Serbian ones, but caution is required here due to the token ratio on 100 tokens selected from random sentences, very small test set. This level of necessary caution regard- and repeating the process 100 times in a bootstrapping man- ing Bosnian test data is also visible from the five times ner, we obtained a result of 0.833 for Serbian and 0.839 higher standard deviation in comparison to the results of for Croatian. This result shows for the Croatian part of the the two other parliaments. Croatian data seem to be easiest dataset to be just slightly more lexically rich (83.9 different to classify, with an absolute difference of 9 points between tokens among 100 tokens on average) than Serbian (83.3 the performance on Serbian and Croatian test data. Regard- different tokens among 100 tokens), which does not explain ing the binary classification results, these are, as expected, the difference in performance of various classifiers on Ser- higher than those of the ternary classification schema with bian data. an macro-F1 of 0.9 when all data are used. The relation- The complexity of Serbian data that can be observed in ship between specific parliaments is very similar to that ob- the evaluation is due to some effect that we did not manage served using the ternary schema. PRISPEVKI 136 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Confusion m a trix Confusion m a trix cla ssla /bcm s-bertic cla ssla /bcm s-bertic 0.8 N ega tive 0.88 0.075 0.05 0.8 N ega tive 0.88 0.12 0.6 0.6 N eutra l 0.099 0.74 0.16 0.4 True label True label 0.4 Other 0.083 0.92 Positive 0.094 0.1 0.81 0.2 0.2 N ega tive N eutral Positive N ega tive Other Predicted la bel Predicted la bel Confusion m a trix Confusion m a trix cla ssla /bcm s-bertic cla ssla /bcm s-bertic N ega tive 962 82 55 800 800 N ega tive 972 127 600 600 N eutra l 61 454 101 True label 400 True label 400 Other 83 918 Positive 36 39 310 200 200 N ega tive N eutral Positive N ega tive Other Predicted la bel Predicted la bel Figure 1: Row-normalised and raw-count confusion matrix Figure 2: Row-normalised and raw-count confusion matrix of the BERTić results on the ternary schema. of the BERTić results on the binary schema. 4.3. Per-category analysis task (positive/negative/neutral), they agreed only in one- quarter of the 1,500 sentences. Using heuristic classifiers Our final set of experiments investigates the per- based on the use of statistical and syntactic clues, Onyi- category performance both on the ternary and the binary madu et al. (2013) found that on average, only 43% of classification schema. We present the confusion matrices the sentences were correctly annotated for their sentiment on the ternary schema, one row-normalized, another with affinity. The results of our experiments are therefore cer- raw counts, in Figure 1. As anticipated, the classifier works tainly promising. Especially when it comes to the classifi-best on the negative class, with 88% of negative instances cation of negative sentences, the model has 1 in 10 sentence properly classified as negative. Second by performance is error rate which is almost on par with the quality of anno- the positive class with 81% of positive instances being la- tation performed by human coders. belled like that, while among the neutral instances 3 out of 4 instances are correctly classified. Most of the confusion 5. Conclusion between classes occurs, as expected, between the neutral The paper introduces a sentence-level dataset of parlia- and either of the two remaining classes. mentary proceedings, manually annotated for sentiment via The binary confusion matrices, presented in Figure 2 a six-level schema. The good inter-annotator agreement is show for a rather balanced performance on both categories. reported, and the first results on the automation of the task On each of the categories recall is around 0.9, with a similar are very promising, with a macro-F1 of ∼0.8 on the ternary precision given the symmetry of the confusions. schema and ∼0.9 on the binary schema. The difference in When comparing the output of the ternary and the bi- performance across the three parliaments is observed, but nary model, the ternary model output mapped to a binary visible only during inference, Serbian data being harder to schema performs slightly worse than the binary model, make predictions on, while for modelling, all parliaments meaning that practitioners should apply the binary model seem to be similarly useful. One limitation of our work is if they are interested just in distinguishing between nega- the following: our testing data have been sampled as the tive and other sentences. whole dataset, with a bias towards mid-length sentences, Although any direct comparisons are hard to make, the and sentences containing sentiment words. Future work few existing studies which performed text classification should consider preparing a sample of random sentences, on sentence-level data, report much worse results. Rauh or, even better, consecutive sentences, so that the potential (2018) found that when three annotators and three senti- issue of lack of a wider context during manual data annota- ment dictionaries were compared on a ternary classification tion is successfully mitigated as well. PRISPEVKI 137 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 In general, the reported results have several promising bates. In: Proceedings of the 12th Language Resources implications for applied research in political science. First and Evaluation Conference, pages 5073–5078, Mar- of all, it allows a more fine-grained analysis of political seille, France. European Language Resources Associa- concepts and their context. A good example is a com- tion. bination of the KWIC approach with sentiment analysis, Gavin Abercrombie and Riza Batista-Navarro. 2020b. with a focus on examining the tone of a message in po- Sentiment and position-taking analysis of parliamentary litical discourse. This is interesting for both qualitatively debates: A systematic literature review. Journal of Com- and quantitatively oriented scholars. Especially the possi- putational Social Science, 3(1):245–270. bility of extracting numeric assessment of the classification Mohit Bansal, Claire Cardie, and Lillian Lee. 2008. The model (e.g. class probability) is particularly promising for power of negative thinking: Exploiting label disagree- all sorts of hypothesis-testing statistical models. Moreover, ment in the Min-cut classification framework. In: Col- sentence-level analysis can be combined with the findings ing 2008: Companion volume: Posters, pages 15–18, of various information and discourse theories for studying Manchester, UK. Coling 2008 Organizing Committee. political discourse focused on rhetoric and narratives (e.g. Vuk Batanović, Miloš Cvetanović, and Boško Nikolić. beginning and end of a speech are more relevant than what 2020. A versatile framework for resource-limited senti- comes in the middle). Apart from the concept-driven anal- ment articulation, annotation, and analysis of short texts. ysis, the classification model can be used for various re- PLOS ONE, 15(11):e0242050. search problems ranging from policy position-taking to ide- Roy F. Baumeister, Ellen Bratslavsky, Catrin Finkenauer, ology detection or general scaling tasks (Abercrombie and and Kathleen D. Vohs. 2001. Bad is Stronger than Good. Batista-Navarro, 2020a; Glavaš et al., 2017; Proksch et al., Review of General Psychology, 5(4):323–370. 2019). Although each of these tasks requires proper testing, Tobias Burst, Werner Krause, Pola Lehmann, Jirka the performance of the trained models for such applications Lewandowski, Theres Matthieß, Nicolas Merz, Sven is undoubtedly promising. Regel, and Lisa Zehnter. 2022. Manifesto corpus. As a part of our future work, we plan to test the use- Andrea Ceron, Luigi Curini, and Stefano M Iacus. 2019. fulness of the predictions on a set of downstream tasks. Politics and B ig D ata: N owcasting and F orecasting The goal is to analyze the data from all three parliaments E lections with S ocial M edia. Routledge, Abingdon, New (Bosnia and Herzegovina, Croatia, and Serbia) in a series York. of tests focused on replication of the results from the exist- ing research using mostly English data. Given the results Alexis Conneau, Kartikay Khandelwal, Naman Goyal, we obtained, we aim to continue our research using the Vishrav Chaudhary, Guillaume Wenzek, Francisco setup with the model trained on cross-country data. Fur- Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, thermore, the three corpora we have used in this paper will and Veselin Stoyanov. 2019. Unsupervised cross-lingual be extended as a part of ParlaMint II project. representation learning at scale. We make the ternary and binary BERTić models trained Rory Duthie and Katarzyna Budzynska. 2018. A deep on all available training available via the HuggingFace modular rnn approach for ethos mining. In: Proceedings repository910 and make the dataset available through the of the 27th International Joint Conference on Artificial CLARIN.SI repository (Mochtak et al., 2022d). Intelligence, IJCAI’18, page 4041–4047. AAAI Press. René D. Flores. 2017. Do Anti-Immigrant Laws Shape Acknowledgements Public Sentiment? A Study of Arizona’s SB 1070 This work has received funding from the Eu- Using Twitter Data. American Journal of Sociology, ropean Union’s Connecting Europe Facility 2014- 123(2):333–384. 2020 - CEF Telecom, under Grant Agreement No. R. Kelly Garrett, Shira Dvir Gvirsman, Benjamin K. John- INEA/CEF/ICT/A2020/2278341 (MaCoCu project). This son, Yariv Tsfati, Rachel Neo, and Aysenur Dal. 2014. communication reflects only the author’s view. The Implications of pro- and counterattitudinal information Agency is not responsible for any use that may be made of exposure for affective polarization. Human Communica- the information it contains. tion Research, 40(3):309–332. This work was also funded by the Slovenian Research Goran Glavaš, Jan Šnajder, and Bojana Dalbelo Bašić. Agency within the Slovenian-Flemish bilateral basic re- 2012. Semi-supervised acquisition of Croatian sentiment search project “Linguistic landscape of hate speech on so- lexicon. In: International Conference on Text, Speech and cial media” (N06-0099 and FWO-G070619N, 2019–2023) Dialogue, pages 166–173. Springer. and the research programme “Language resources and tech- Goran Glavaš, Federico Nanni, and Simone Paolo Ponzetto. nologies for Slovene” (P6-0411). 2017. Unsupervised cross-lingual scaling of political 6. References texts. In: Proceedings of the 15th Conference of the Euro- pean Chapter of the Association for Computational Lin- Gavin Abercrombie and Riza Batista-Navarro. 2020a. Par- guistics: Volume 2, Short Papers, pages 688–693, Valen- lVote: A corpus for sentiment analysis of political de- cia, Spain. Association for Computational Linguistics. 9 Hrvatski sabor. 2020. eDoc. http://edoc.sabor.hr/. https://huggingface.co/classla/ bcms-bertic-parlasent-bcs-ter Swen Hutter, Edgar Grande, and Hanspeter Kriesi. 2016. 10https://huggingface.co/classla/ Politicising Europe: I ntegration and mass politics. Cam- bcms-bertic-parlasent-bcs-bi bridge University Press, Cambirdge. PRISPEVKI 138 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Shanto Iyengar, Yphtach Lelkes, Matthew Levendusky, Parliamentary Debates in Croatia (v1.1.1). Neil Malhotra, and Sean J. Westwood. 2019. The https://doi.org/10.5281/zenodo.6521372. Origins and Consequences of Affective Polarization in Michal Mochtak, Josip Glaurdić, and Christophe the United States. Annual Review of Political Science, Lesschaeve. 2022b. SRBCorp: Corpus of 22(1):129–146. Parliamentary Debates in Serbia (v1.1.1). Armand Joulin, Edouard Grave, Piotr Bojanowski, and https://doi.org/10.5281/zenodo.6521648. Tomas Mikolov. 2016. Bag of tricks for efficient text Michal Mochtak, Josip Glaurdić, Christophe Lesschaeve, classification. arXiv preprint arXiv:1607.01759. and Ensar Muharemović. 2022c. BiHCorp: Corpus Filip Klubička, Gema Ram´ırez-Sánchez, and Nikola of Parliamentary Debates in Bosnia and Herzegovina Ljubešić. 2016. Collaborative development of a rule- (v1.1.1). https://doi.org/10.5281/zenodo.6517697. based machine translator between croatian and serbian. Michal Mochtak, Peter Rupnik, and Nikola Ljubešić. In: Proceedings of the 19th Annual Conference of the 2022d. The sentiment corpus of parliamentary de- Eu- ropean Association for Machine Translation, pages bates ParlaSent-BCS v1.0. Slovenian language resource 361– 367. repository CLARIN.SI. Ruud Koopmans and Paul Statham. 2006. Political Claims Saif M. Mohammad. 2021. Sentiment analysis: Automat- Analysis: Integrating Protest Event and Political Dis- ically detecting valence, emotions, and other affectual course Approaches. Mobilization: An International states from text. https://arxiv.org/abs/2005.11882. Quarterly, 4(2):203–221. Nona Naderi and Graeme Hirst. 2016. Argumentation Taja Kuzman, Peter Rupnik, and Nikola Ljubesic. 2022. mining in parliamentary discourse. In: Matteo Baldoni, The ginco training dataset for web genre identification Cristina Baroglio, Floris Bex, Floriana Grasso, Nancy of documents out in the wild. ArXiv, abs/2201.03857. Green, Mohammad-Reza Namazi-Rad, Masayuki Nu- mao, and Merlin Teodosia Suarez, editors, Principles Harold Dwight Lasswell. 1927. Propaganda T echnique in and Practice of Multi-Agent Systems, pages 16–25, the W orld W ar. Peter Smith, New York. Cham. Springer. Dilin Liu and Lei Lei. 2018. The appeal to political senti- Obinna Onyimadu, Keiichi Nakata, Tony Wilson, David ment: An analysis of Donald Trump’s and Hillary Clin- Macken, and Kecheng Liu. 2013. Towards sentiment ton’s speech themes and discourse strategies in the 2016 analysis on parliamentary debates in Hansard. In: US presidential election. Discourse, Context & Media, Revised Selected Papers of the Third Joint International 25:143–152. Confer- ence on Semantic Technology – Volume 8388, Nikola Ljubešić and Kaja Dobrovoljc. 2019. What does JIST 2013, page 48–50, Berlin, Heidelberg. neural bring? Analysing improvements in morphosyn- Springer-Verlag. tactic annotation and lemmatisation of Slovenian, Croat- Otvoreni Parlament. 2020. Početna. ian and Serbian. In: Proceedings of the 7th Workshop on https://otvoreniparlament.rs/. Balto-Slavic Natural Language Processing, pages 29– Parlamentarna skupština BiH. 2020. Sjednice. 34, Florence, Italy, August. Association for Computa- https://www.parlament.ba/?lang=bs. tional Linguistics. Sven-Oliver Proksch, Will Lowe, Jens Wäckerle, and Stuart Nikola Ljubešić and Davor Lauc. 2021. BERTić – the Soroka. 2019. Multilingual Sentiment Analysis: A New transformer language model for Bosnian, Croatian, Mon- Approach to Measuring Conflict in Legislative Speeches. tenegrin and Serbian. In: Proceedings of the 8th Work- Legislative Studies Quarterly, 44(1):97–131. shop on Balto-Slavic Natural Language Processing, Christian Rauh. 2018. Validating a sentiment dictionary pages 37–42, Kiyv, Ukraine, April. Association for Com- for German political language—a workbench note. Jour- putational Linguistics. nal of Information Technology & Politics, 15(4):319– Nikola Ljubešić. 2018. Word embeddings CLARIN.SI- 343. embed.hr 1.0. Slovenian language resource repository Edward A. Shils and Morris Janowitz. 1948. Cohesion and CLARIN.SI. Disintegration in the Wehrmacht in World War II. Public Henry B. Mann and Donald R. Whitney. 1947. On a test of Opinion Quarterly, 12(2):315. whether one of two random variables is stochastically Juan M. Soler, Fernando Cuartero, and Manuel Roblizo. larger than the other. The annals of mathematical statis- 2012. Twitter as a tool for predicting elections results. In: tics, pages 50–60. 2012 IEEE/ACM International Conference on Advances Lilliana Mason. 2015. “I Disrespectfully Agree”: The Dif- in Social Networks Analysis and Mining, pages 1194– ferential Effects of Partisan Sorting on Social and Is- 1200. sue Polarization. American Journal of Political Science, Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out 59(1):128–145. the vote: Determining support or opposition from con- Michal Mochtak, Josip Glaurdić, and Christophe Less- gressional floor-debate transcripts. In: Proceedings of the chaeve. 2020. Talking War: Representation, Veterans 2006 Conference on Empirical Methods in Natural Lan- and Ideology in Post-War Parliamentary Debates. Gov- guage Processing, pages 327–335, Sydney. Association ernment and Opposition, 57(1):148–170. for Computational Linguistics. Michal Mochtak, Josip Glaurdić, and Christophe Andranik Tumasjan, Timm Sprenger, Philipp Sandner, and Lesschaeve. 2022a. CROCorp: Corpus of Isabell Welpe. 2010. Predicting elections with Twitter: What 140 characters reveal about political sentiment. PRISPEVKI 139 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Proceedings of the International AAAI Conference on Web and Social Media, 4(1). Matej Ulčar and Marko Robnik- Šikonja. 2020. FinEst BERT and CroSloEngual BERT: less is more in multilingual models. In: P. Sojka, I. Kopeček, K. Pala, and A. Horák, eds., Text, Speech, and Dialogue TSD 2020, volume 12284 of Lecture Notes in Computer Science. Springer. Stephen M. Utych. 2018. Negative Affective Language in Politics. American Politics Research, 46(1):77–102. Lori Young and Stuart Soroka. 2012. Affective news: The automated coding of sentiment in political texts. Politi- cal Communication, 29(2):205–231. PRISPEVKI 140 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Fine-grained human evaluation of NMT applied to literary text: case study of a French-to-Croatian translation Marta Petrak,* Mia Uremović, * Bogdanka Pavelin Lešić* * Faculty of Humanities and Social Sciences, University of Zagreb Ivana Lučića 3, 10000 Zagreb mpetrak@ffzg.hr uremovic.mia@gmail.com bpavelin@ffzg.hr Abstract Even though neural machine translation (NMT) has demonstrated phenomenal results and has shown to be more successful than previous MT systems, there is not a large number of works dealing with its application to literary text. This results from the fact that literary texts are deemed to be more complex than others because they involve more specific elements such as idiomatic expressions, metaphor, a specific author’s style, etc. Regardless of this fact, there is a growing body of research dealing with NMT applied to literary texts, and this case study is one of them. The goal of the present paper is to conduct an in-depth, fine-grained evaluation of a novel translated by Google Translate (GT) in order to reach detailed insights into NMT performance on literary text. In addition, the paper aims to include for the first time, to the best of our knowledge, the French-Croatian language combination. 1. Introduction MT to other types of text, they are not inexistant. Hansen’s (2021) paper brings a detailed and up-to-date Numerous studies have demonstrated that neural overview of the works dealing with MT of literary texts. machine translation (NMT) outperforms previous MT The first literary text translated by MT was done by systems (e.g. Bentivogli et al., 2016; Burchardt et al., Besacier (2014), and it comprised an essay translated 2017; Klubička et al., 2018; Hansen, 2021). This has from English to French. A number of languages have been demonstrated for a number of various text types, already been covered by various studies of MT to literary among which literary texts are the least represented due text, among which Slavic (e.g. Slovene, Kuzman et al., to their specificities such as lexical richness, 2019), Romance (e.g. Catalan, Toral and Way, 2018, metaphorical and idiomatic elements (e.g. Toral and French, Besacier, 2014), Hansen, 2021), Germanic Way, 2018). Literary translation is also usually (English, in a number of papers; German, Matusov, considered to be more complex than technical translation 2019); Scottish Gaelic and Irish, Ó Murchú, 2019), etc. because it includes elements such as writer’s individual style (Hadley, 2020). 2. Goal of the paper Due to these facts, literary texts are still perceived to be “the greatest challenge for MT” (Toral and Way, The goal of this case study is to go beyond the overall 2018). Some more pessimistic authors even claim that performance of NMT on literary text and to provide an “there is no prospect of machines being useful at extensive, in-depth human analysis of its results. In order (assisting with) the translation of [literary texts]” (Toral to do so, we will, firstly, produce a MT of a French novel and Way, 2018). While the use of machine translation and, secondly, compare that translation with a human followed by the post-editing phase is a widespread translation of the same text. The human translation will practice generally speaking, it has not yet become a be done by a student in translation from French into permanent fixture in literary translation (Besacier, 2014). Croatian as part of her Master’s thesis, and the analysis In spite of this fact, there has been a growing interest will be carried out by two human evaluators, the student in applying MT to literature, which can be seen, for and an experienced professional translator. example, in the fact that there is a workshop on In addition to providing an in-depth analysis of the computational linguistics for literature organised by translation of a literary text done by MT, our case study ACL since 20121. Moreover, the French-speaking world is the first one to pair, to the best of our knowledge, a has seen the creation of an observatory for MT large Romance language, French, with Croatian 4 , a ( Observatoire de la traduction automatique) by the smaller scale language rich in morphology. ATLAS 2 association in December 2018 to follow the The rest of the paper is structured as follows: in development of MT application to literary text3. Section 3 we describe the methodology used. Section 4 Even though studies that analyse the application of is the central part of the paper, as it sums up the results MT to literary text are less numerous than those applying of our analysis combined with a number of specific 1 Cf. e.g. https://aclanthology.org/events/clfl-2020/. 4 Croatian is the official language of the Republic of Croatia 2 ATLAS stands for Association pour la promotion de la and of the EU., but is also spoken in Bosnia and Herzegovina, traduction littéraire (Association for the promotion of literary Montenegro, etc. It has approximately 5.6 million native translation), https://www.atlas-citl.org/. speakers worldwide. Cf. https://www.european-language- 3 https://www.atlas-citl.org/lobservatoire-de-la-traduction- grid.eu/ncc/ncc-croatia/. automatique/ PRISPEVKI 141 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 examples from the corpus. In Section 5 we bring some upon extant ones by a number of previous authors and concluding remarks and recommend some further steps. some specificities of the corpus. Her study (2016) included only non-literary texts, newspaper reports, public opinion reports and EU legal documents (opinions 3. Methodology and decisions), a total of 3,406 words. Still, Pavlović’s (2016) methodology was developed with the goal of In order to conduct our analysis, we have chosen a comparing MT done by GT and human translation, and novel, which is “arguably the most popular type of it takes into account some specificities of the Croatian literary text” (Toral and Way, 2018). Our corpus language such as a rather free word order, abundance of comprises the first eight chapters of the novel La inflection and morphological complexity. It should be traduction est une histoire d’amour ( Translation is a emphasized that Pavlović’s (2016) study was conducted Love Affair) written by Jacques Poulin, a contemporary before GT used NMT for Croatian, which is available Canadian author. It comprises a total of 8,347 words. The today 5 and is the technology used for the analysis original text, written in French, is first translated by GT, presented in this case study. and subsequently by a human translator. The MT is The analysis of errors conducted for this paper analysed in detail by two evaluators, after which the two follows that given by Pavlović (2016), with only minor translations are compared. alterations. For example, the sub-category (D.c), Hansen (2021) argues that evaluation of texts ‘numbers’, is not present in the machine translation of produced by MT still remains a major obstacle. More the chosen text and is hence not part of this analysis. precisely, if BLEU (Papineni et al., 2002) is the most widely used automatic metric, it has to be taken with 4. Results and analysis caution in case of literary texts ( ibid.). Papineni et al. (2002) argue that human evaluations of MT are 4.1. Fine-grained human evaluation “extensive” and therefore usually more fine-grained than automatic ones, but the authors also point to their Our analysis has demonstrated that GT has provided expensiveness. a very satisfactory translation generally speaking, and In our case study, we present a quantitative and some of its solutions were even better than the ones qualitative analysis of errors. We base our methodology provided by the human translation in the cases where on the one developed by Pavlović (2016). Pavlović ( ibid. ) there was a possible choice between a general word and also argues that in the literature there is not a single its more suitable or literary synonym. classification of translation errors that all authors would Below we first bring a table with a general agree upon, so she makes her own classification based presentation of errors found in the MT. Error category % Morphosyntax 55.3 Lexicon 32.1 Spelling 7 Other 5.6 Table 1: Classification of general error types produced by MT. Table 1 demonstrates that morphosyntactic errors are followed by errors in lexical choice. In Table 2 visibly make the most frequent error type in our corpus, (below) we bring a detailed list of error types found in i.e. more than half of the total number of errors. These our corpus. Error type % C.a. congruence 39.3 B.a. lexical choice 18.8 C.c. word order / order of 10.9 phrase constituents B.c. idiomatic expressions 7.5 B.b. term or title 5.8 C.b. verbal forms / tenses 5.2 A.a. punctuation 4.5 A.b. capital letters 2 D.a. not translated 2 D.b. omissions 1.9 D.d. format, etc. 1.6 5 Cf. https://translate.google.com/intl/hr/about/languages/. PRISPEVKI 142 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 A.c. other spelling errors 0,5 D.c. numbers 0 Table 2: Detailed breakdown of error types found in the corpus. 4.1.1. Morphosyntactic errors specifically congruence errors, representing 39.3%. This type of errors most frequently have to do with According to our analysis, the most common errors grammatical gender. Here is an example: done by GT are morphosyntactic errors, more original GT human translation La meilleure traductrice Najbolji prevodilac u Najbolja prevoditeljica u du Québec Quebecu Québecu. Table 3: Example of congruence error. In the above example, traductrice ‘female translator’ systems have been found to make fewer morphological, is translated by GT as prevoditelj ‘male translator’ even lexical and word-order errors (e.g. Burchardt, 2017). though both French and Croatian are marked for gender, What was a problem, however, in the category of and even though there is a ready-made solution in morphosyntactic errors is recognizing the narrator as a Croatian, prevoditeljica ‘female translator’. The problem female, and consequently translating all her attributes here is probably the fact that GT uses English as a sort of and making all the agreements in the feminine gender. pivot or intermediate language (e.g. Ljubas, 2018) when This is a feature of the text that extends beyond sentence translating between French and Croatian6, that do not level and permeates the entire discourse of the novel. In share as large a corpus of texts as they do with English some French sentences, this difference between individually. masculine and feminine gender cannot be seen, for This is a frequent error produced by GT in the corpus, example in the present tense or in the past tense ( passé i.e. not marking whatever has to do with the narrator, compose) formed with the auxiliary verb to have ( avoir). who is a woman, as female, but leaving male nouns, In Croatian, the same goes for the present tense, but the adjectives etc., which we also attribute to translating via past tense always shows agreement with the subject in English: e.g. Je raccrochai is translated as Spustio (masc.) gender. The large number of errors in this category sam slušalicu instead of Spustila (fem.) sam slušalicu / undoubtedly stems from the use of English as a pivot Poklopila (fem.) sam. language. In other words, it can be said generally that our analysis has demonstrated that GT had no problems, for 4.1.2. Lexical errors example, with the Croatian rich nominal case system and general subject-verb or noun-adjective agreement. This The next most represented category are lexical errors is in line with findings from the literature that neural (32.1%), listed in the table below. original GT human translation Eh bien, c'était le portrait tout Pa, to je bila pljuvačka slika E pa to je pljunuti portret craché de ma mère. moje majke. moje majke. Les ouaouarons, affolés, … Uplašeni bikovi žabe … Žabe su se preneražene … Je suis sur la route parce que ma Na putu sam jer se moja Na ulici sam jer se moja maîtresse ne peut plus s’occuper ljubavnica više ne može vlasnica više ne može brinuti de moi, (…) brinuti o meni o meni, (…) Ma mère et ma grand-mère Moja majka i baka odmarale Moja majka i baka bile su reposaient derrière l’église … su se iza crkve … pokopane iza crkve … … dans l'herbe jonchée de … u travi posutoj mrtvim … travi prekrivenoj suhim feuilles mortes. lišćem. lišćem. J’étais très heureuse, presque sur Bio sam vrlo sretan, skoro na Bila sam sretna, gotovo u un nuage, (…) devetom oblaku, (…) sedmom nebu, (…) Les maudites algues… Proklete morske alge… Proklete alge… Table 3: Examples of lexical choice errors. 6 This has been claimed generally as a feature of GT that it uses https://algorithmwatch.org/en/google-translate-gender-bias/; when translating between any pair of languages. A Google cf. https://www.circuitmagazine.org/chroniques-126/sur-le- spokesperson has admitted that Google Translate uses English vif-126/google-uses-english-as-a-pivot-language. for „bridging“ between languages with fewer resources. See PRISPEVKI 143 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Errors is this category concern the following: 1) * pljuvačka slika instead of pljunuti portret. It clearly single-word polysemy, 2) idiomatic expressions, 3) calqued the expression être sur un nuage ‘be on cloud calques from English. nine’ on English and translate dit as * biti na devetom With respect to single-word polysemy, GT has, for oblaku, which does not exist in Croatian, and should be instance, erroneously translated maîtresse ‘owner’ (of a translated as na sedmom nebu ‘lit. on seventh sky’. The cat) as ‘lover’. It also translated reposaient ‘rested’ as noun phrase feuilles mortes is litterally translated as odmarale su se ‘were having a rest’ instead of bile su * mrtvo lišće instead of suho lišće ’lit. dry leaves’, etc. pokopane, which is used in the context of the dead buried There are several instances of calquing from English, in a graveyard. Furthermore, it translated algues as such as in the example of ouaouarons, animals known in morske alge ‘sea algae’, which is an incorrect English as American bullfrogs, which are litterally specification stemming from the fact that algae are translated as bikovi žabe ‘bulls-frogs’, and for which we usually related to the sea, but algae in the story, however, would suggest the translation žabe due to the fact that the come from a pond. particular species is irrelevant to the plot. As for idiomatic expressions (7% of total errors), GT rendered le portrait craché ‘spitting image’ as 4.1.3. Other errors In the category of capital letters, GT had difficulties Bentivogli et al. (2016) and Toral and Sánchez rendering street names, which appeared in the text Cartagena (2017) found that NMT improves notably on several times. Examples such as 609, rue Richelieu were reordering and inflection than PBMT. In the case of rendered by GT as 609, ulica Richelieu, where all the Poulin’s novel translated and analysed in this paper, individual elements are correctly translated, but the street there were generally very few problems with inflection, name as a whole should be written as Ulica Richelieu 609, and word / constituent order represented only 10% of all which is a conventional way of writing street names in the errors. What our analysis seems to point to is the fact Croatian. that using English as a pivot language is the source of a Another interesting error concerns proper names. Let large number of errors, and that using language-pair us cite two examples: Marine and Chaloupe. Marine, the specific language corpora could arguably give better name of the main character and narrator, is sometimes results in translating between two languages of which translated by GT as marinac ‘Marine, i.e. member of an neither is English. This would also probably have a elite US fighting corps’. In addition to the same form, the positive effect on the translation of culturally specific English word is always capitalised, so that could be elements such as spelling and writing of toponyms (e.g. another reason for such a translation. Chaloupe, on the street names). Furthermore, our analysis also other hand, is the name of the cat that appears several demonstrates that more improvement should be done in times in the text. It is derived from the common noun the detection and translation of polysemy and idiomatic chaloupe denoting a type of boat. GT translated the noun expressions. as čamac ‘boat’, making it a common noun and even leaving out the capital letter. 4.2. BLEU evaluation Overall cumulative BLEU score for the literary text analysed in our case study was 5.49, which would In addition to a fine-grained human translation, suggest very poor MT quality. As a reference, BLEU BLEU score was also calculated using the interactive scores of 30 to 40 are considered to be “understandable BLEU score evaluator7 available via the Tilde platform. to good translations”, while those of 40 to 50 are “high BLEU score is based on the correspondence of the MT quality translations” 8 . Here is the breakdown of the output and the reference human translation. BLEU score: Type 1-gram 2-gram 3-gram 4-gram Individual 21.92 5.86 2.79 2.54 Cumulative 21.92 11.33 7.10 5.49 Table 4: Results of automatic BLEU evaluation. In other available case studies dealing with MT of a English literary texts translated into Slovene, BLEU literary text, BLEU scores show significant variation. In scores varied from 1.73 to 30 depending on the texts on the case of a translation of a literary essay from English which the MT model was trained (Kuzman et al., 2019). into French (Besacier and Schwartz, 2015), BLEU score Toral and Way (2018) obtained BLEU scores of around was around 30. In another case study dealing with 30 for English-to-Catalan translations of 12 English 7 https://www.letsmt.eu/Bleu.aspx. 8 https://cloud.google.com/translate/automl/docs/evaluate PRISPEVKI 144 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 novels by PBSMT and NMT systems, where NMT French-to-Croatian literary translation morphosyntactic outperformed PBSMT. errors were by 20% more present than lexical errors, Unlike the results obtained by Kuzman et al. (2019) which is different than what he found in the English- in their study of a literary translation from English into French language pair. Furthermore, Hansen ( ibid.) was Slovene, a language genetically very close to Croatian, surprised to note that the specific vocabulary related to where “there were no sentences that would not need the fantasy series in question was respected almost postediting”, in our case study there were a number of entirely, which is probably due to the training of the MT sentences entirely correctly rendered by GT, i.e. that model on texts written by the same author. This is one of would be publication ready. the reasons why Hansen (2022) suggests that In any case, it should be borne in mind that BLEU personalized MT systems should be introduced in automatic evaluation metric was calculated with respect literary translation for translating specific authors’ styles. to a single human translation, and that it cannot represent In another paper, involving Slovene, a language the “real quality” of MT output. In that sense, Hansen closely related to Croatian, and analysing translation of (2022) notes, for instance, that two MT models used in literary texts from English, Kuzman et al. (2019) observe his case study had a similar BLEU score in spite of the that “error analysis (…) revealed various punctuation fact that the first one produced correctly translated words errors, wrong translations of prepositions and in incomprehensible sentences, while the second one conjunctions, inappropriate shifts in verb mood, wrong generated correct sentences with words that semantically noun forms and co-reference changes”. The authors did not correspond the lexical field of the translated emphasize the presence of numerous semantic errors, literary text. This is one of the reasons why we would not “especially in connection with idioms and ambiguous entirely agree that the translation provided by GT words”. In this case, more detailed data is also lacking, analysed in this paper is irrelevant or “useless”, as it but we can generally conclude that this study also differs would be classified due to its BLEU score inferior to 10 from ours in that semantic errors are definitely not the (cf. footnote n° 8). leading error type in our French-to-Croatian translation. In addition, it should be noted that some authors Interestingly, Kuzman et al. (2019) also found that claim that morphological richness of the Croatian GNMT assigned the wrong gender to the main character, language could raise problems for BLEU evaluation due just as happened in our case, as mentioned in 4.1.1. to the fact that each Croatian noun has approximately 10 We can conclude that in the French-to-Croatian GT different word forms, which are considered by BLEU to of the novel analysed in this text, morphosyntactic errors be 10 different words, and not 10 different word forms (55.3%) are the most represented ones, followed by of a single lemma (cf. Seljan et al., 2012). This could various lexical errors (32.1%). These results are result in lower BLEU scores. somewhat different from what was observed in earlier extant studies dealing with MT of literary texts from 5. Conclusion English to French and English to Slovene. Even though BLEU score was only 5.49, indicating This case study is a contribution to a growing number very poor translation quality which should be deemed as of papers dealing with applying (N)MT to literary text, useless, we believe that the GT output would be useful to which has been thought of until only recently as a some extent to translators translating Poulin’s novel from domain that could not be translated by MT. Various scratch. Further analyses should be made however in authors have, however, demonstrated the usefulness of order to analyse whether GT trained on French and using MT in literary translation. Some (e.g. Besacier and Croatian corpora would amount to better results than GT Schwartz, 2015) even argue that MT of literary text that uses English as pivot. Furthermore, it should also be may even be of interest for all participants of the studied how much post-processing effort is needed to translation chain from editors, through readers to correct errors of GT in comparison to translation from authors and translators. scratch in the French-to-Croatian language combination. Our analysis has demonstrated that there was a total of 738 errors in the text produced by GT, largely falling 6. References into two groups: morphosyntactic (around 55%) and lexical choice (around 32%) errors. While the Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, morphosyntactic errors largely concerned errors in and Marcello Federico. 2016. Neural versus Phrase- congruence stemming probably from the usage of Based Machine Translation Quality: a Case Study. In: J. English as a pivot language between French and Croatian, Siu, K. Duh and X. Carreras, eds., Proceedings of the the lexical choice errors had mostly to do with polysemy, 2016 Conference on Empirical Methods in idiomatic expressions and calques. Natural Language Processing, pages 257–267, Let us now compare our results with those from other Association for Computational Linguistics, Austin, existing works on MT of literary texts involving either of Texas. the two languages from this case study, Croatian or Laurent Besacier and Lane Schwarz. 2015. French. Hansen (2022), who analysed English-to-French Automated Translation of a Literary Work: A Pilot Study. translations of fantasy books, observed that, generally In: A. Feldman, A. Kazantseva, S. Szpakowicz and C. speaking, the MT output was rather literal and it Koolen, eds., Proceedings of the Fourth Workshop on produced mostly lexical errors, as well as errors related Computational Linguistics for Literature, pages 114– to determiners and syntax. While Hansen ( ibid. ) does not 122. Association for Computational Linguistics, Denver, provide further details, we can generally say that in our Colorado. PRISPEVKI 145 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Laurent Besacier. 2014. Traduction automatisée Sandra Ljubas. 2018. Prijelaz sa statističkog na d’une œuvre littéraire : une étude pilote. In: P. Blanche, neuronski model: usporedba strojnih prijevoda sa F. Béchet and B. Bigi, eds., Actes du 21ème Traitement švedskoga na hrvatski jezik. Hieronymus, 5:72–79. Automatique des Langues Naturelles, pages 389-94. https://www.bib.irb.hr/978980 Association pour le Traitement Automatique des Evgeny Matusov. 2019. The Challenges of Using Langues, Marseille. https://hal.inria.fr/hal-01003944 Neural Machine Translation for Literature. In: J. Hadley, Marija Brkić, Sanja Seljan and Maja Matetić. 2011. M. Popović, H. Afli, and A. Way, eds., Proceedings of Machine Translation Evaluation for Croatian-English the Qualities of Literary Machine Translation, Machine and English-Croatian Language Pairs. In: B. Sharp, M. Translation, pages 10-19, European Association for Zock, M. Carl, A. L. Jakobsen, eds., Proceedings of the Machine Translation, Dublin. 8th International NLPCS Workshop: Human-Machine https://aclanthology.org/W19-7302.pdf Interaction in Translation, pages 93-104. Copenhagen: Eoin P. Ó Murchú. 2019. Using Intergaelic to pre- Copenhagen Business School. translate and subsequently post-edit a sci-fi novel from Aljoscha Burchardt, Vivien Macketanz, Jon Dehdari, Scottish Gaelic to Irish. In: J. Hadley, M. Popović, H. Georg Heigold, Jan-Thorsten Peter, and Philip Williams. Afli & A. Way, eds., Proceedings of the Qualities of 2017. A Linguistic Evaluation of Rule- Based, Phrase- Literary Machine Translation, pages 20–25, European Based, and Neural MT Engines. The Prague Bulletin Association for Machine Translation, Dublin. of Mathematical Linguistics, 108:159–170. https://aclanthology.org/W19-7303 Margot Fonteyne, Arda Tezcan, and Lieve Macken. Kishore Papineni, Salim Roukos, Todd Ward and 2020. Wei- Jing Zhu. 2016. Bleu: a Method for Automatic Literary Machine Translation under the Magnifying Evaluation of Machine Translation. In: P. Isabelle, E. Glass: Assessing the Quality of an NMT-Translated Charniak and D. Lin, eds., Proceedings of the 40th Detective Novel on Document Level. In: N. Calzolari, F. Annual Meeting of the Association for Computational Béchet, P. Blache, K. Choukri, C. Cieri, T. Delreck, S. Linguistics, pages 311–318. Association for Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, computational linguistics, Philadelphia, Pennsylvania, A. Moreno, J. Odijk and S. Piperidis, eds., USA. doi.org/10.3115/1073083.1073135 Proceedings of the 12th Conference on Language Nataša Pavlović. 2017. Strojno i konvencionalno Resources and Evaluation, pages 3790–3798, European prevođenje s engleskoga na hrvatski: usporedba Language Resources Association, Marseille. pogrešaka. In: D. Stolac and A. Vlastelić , eds., Jezik kao James Hadley. 2020. Traduction automatique en predmet proučavanja i jezik kao predmet poučavanja, littérature : l’ordinateur va-t-il nous voler notre travail. pages 279–295, Srednja Europa, Zagreb. Contrepoint, 4:14–18. https://www.ceatl.eu/wp- Jacques Poulin. 2006. La traduction est une content/uploads/2020/12/Contrepoint_2020_04_articl histoire d’amour. Leméac/Actes Sud, Montreal. e_04.pdf Sanja Seljan, Marija Brkić and Tomislav Vičić. 2012. Damien Hansen. 2022. La traduction littéraire BLEU Evaluation of Machine-Translated English- automatique : Adapter la machine à la traduction Croatian Legislation. In: N. Calzolari, K. Choukri, T. humaine individualisée. https://hal.archives- Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. ouvertes.fr/hal-03583562/document Moreno, J. Odijk, S. Piperidis, eds., Proceedings of the Damien Hansen. 2021. Les lettres et la machine : un Eighth International Conference on Language état de l’art en traduction littéraire automatique. In: P. Resources and Evaluation (LREC'12). Istanbul, Turkey. Denis, N. Grabar, A. Fraisse, R. Cardon, B. Jacquemin, E. Antonio Toral and Andy Way. 2018. What Level Kergosien and A. Balvet, eds., Actes de la 28e of Quality Can Neural Machine Translation Attain on Conférence sur le Traitement Automatique Literary Text? In: J. Moorkens, S. Castilho, F. Gaspari, S. des Langues Naturelles, Vol. 1, Doherty, eds., Translation Quality Assessment. Machine pages 61–78. ATALA, Lille. Translation: Technologies and Applications, Vol 1, Filip Klubička, Antonio Toral and Víctor M. pages 263-287. Springer, Cham. Sánchez- Cartagena. 2017. Fine-Grained Human https://doi.org/10.1007/978-3-319-91241-7 Evaluation of Neural Versus Phrase-Based Machine Translation. The Prague Bulletin of Mathematical Linguistics, 108:121-132. https://arxiv.org/abs/1706.04389 Taja Kuzman, Špela Vintar and Mihael Arčan. 2019. Neural Machine Translation of Literary Texts from English to Slovene. In: J. Hadley, M. Popović, H. Afli, and A. Way, eds., Proceedings of the Qualities of Literary Machine Translation, pages 1–9, European Association for Machine Translation, Dublin. https://aclanthology.org/W19-7301 Rudy Loock. 2018. Traduction automatique et usage linguistique : une analyse de traductions anglais- français réunies en corpus. Meta 63(3):786–806. https://doi.org/10.7202/1060173ar PRISPEVKI 146 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 A Bilingual English-Ukrainian Lexicon of Named Entities Extracted from Wikipedia Aleksandar Petrovski Faculty of Informatics International Slavic University Marshal Tito 77 Sv. Nikole, North Macedonia aleksandar.petrovski@msu.edu.mk Abstract This paper describes the creation of a bilingual English - Ukrainian lexicon of named entities, with Wikipedia as a source. The proposed methodology provides a cheap opportunity to build multilingual lexicons, without having expertise in target languages. The extracted named entity pairs have been classified into five classes: PERSON, ORGANIZATION, LOCATION, PRODUCT, and MISC (miscellaneous). It has been achieved using Wikipedia metadata. Using the presented methodology, a huge lexicon has been created, consisting of 624,168 pairs. The classification quality has been checked manually on 1,000 randomly selected named entities. The results obtained are 97% for precision and 90% for recall. 1. Introduction age for query term disambiguation. Tyers and Pienaar (Tyers The term named entity (NE) refers to expressions de- and Pienaar, 2008) described a simple, fast, and computa- scribing real world objects, like persons, locations, and or- tionally inexpensive method for extracting bilingual dictio- ganizations. It was first introduced to the Natural Language nary entries from Wikipedia (using the interwiki link system) Processing (NLP) community at the end of the 20th century. and assessed the performance of this method with respect to Named entities are often denoted by proper names. They four language pairs. Yu and Tsujii (Yu and Tsujii, 2009) pro- can be abstract or have a physical existence. Some other posed a method using the interlanguage link in Wikipedia expressions, describing money, percentage, time, and date to build an English-Chinese lexicon. Knopp (Knopp, 2010) might also be considered as named entities. Examples of showed how to use the Wikipedia category system to clas- named entities include: United States of America, Paris, sify named entities. Bøhn and Nørvåg (Bøhn and Nørvag, Google, Mercedes Benz, Microsoft Windows, or anything 2010) described how to use Wikipedia contents to auto- else that can be named. matically generate a lexicon of named entities and syn- The role of named entities has become more and more onyms that are all referring to the same entity. Halek et important in NLP. Their information is crucial in informa- al. (Hálek et al., 2011) attempted to improve machine trans- tion extraction. As recent systems mostly rely on machine lation from English of named entities by using Wikipedia. learning techniques, their performance is based on the size In (Ivanova, 2012), the author evaluated a bilingual bidirec- and quality of given training data. This data is expensive tional English-Russian dictionary created from Wikipedia and cumbersome to create because experts usually annotate article titles. Higashinaka et al. (Higashinaka et al., 2012) corpora manually to achieve high quality data. As a result, aimed to create a lexicon of 200 extended named entity these data sets often lack coverage, are not up to date, and (ENE) types, which could enable fine-grained information are not available in many languages. To overcome this prob- extraction. Oussalah and Mohamed (Oussalah and Mo- lem, semi-automatic methods for resource construction from hamed, 2014) demonstrated how to use info-boxes in order other available sources were deployed. One of these sources to identify and extract named entities from Wikipedia. is Wikipedia. The method presented here has been used to build a 3. Wikipedia Python application which extracts the English - Ukrainian Wikipedia is a free online encyclopedia, made and main- pairs from Wikipedia and classifies them using the En- tained as an open coordinated effort venture by a network glish Wikipedia category system. Since both English of volunteer editors, utilizing a wiki – based editing sys- and Ukrainian are among languages with most articles on tem. Hosted and supported by the Wikimedia Foundation, Wikipedia, the result is a huge lexicon. since its start in 2001, the site has grown in both popularity The goal of this paper is to present a method of extract- and size. At the time of writing this paper (March 2022), ing multilingual lexicons of classified named entities from Wikipedia contained over 58 million articles in 323 lan- Wikipedia. The method has been implemented to build a guages; its English version has over 6 million articles. The huge English - Ukrainian lexicon of named entities. richness of information and texts continuously makes it an object of special research interest among the NLP (Natural 2. Related work Language Processing) community. By attracting approxi- Building multilingual lexicons from Wikipedia has been mately 6 billion visitors per month (Statista, 2021), it is a subject of research for more than 10 years. Schönhofen et the largest and most popular general reference work on the al. (Schönhofen et al., 2007) exploited Wikipedia hyperlink- World Wide Web. PRISPEVKI 147 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 3.1. Wikipedia as a source 4. Method Even though Wikipedia isn’t made and maintained by The flowchart presented in Figure 1 shows the process linguists, metadata about articles, for instance, translations, used for building the lexicon. disambiguations, or categorizations are accessible. Its struc- tural features, size, and multilingual availability give a rea- sonable base to derive specialized resources, like multi- lingual lexicons (Bøhn and Nørvag, 2010). Researchers have found that around 74% of Wikipedia pages describe named entities (Nothman et al., 2008), a clear indication of Wikipedia’s high coverage for named entities. Each Wikipedia article associated with a named entity is iden- tified with its title, which is itself a named entity. That is a perfect opportunity to build parallel lexicons of named entities between them. Wikipedia is a very cheap resource of multilingual lex- icons of named entities. Its database dump can be freely downloaded in sql and XML formats. But, taking into ac- count the fact that Wikipedia articles have been written by millions of contributors, a question arises: What is the qual- ity of these lexicons, and how reliable are they for using, e.g., in machine translation? 3.2. English and Ukrainian Wikipedias The English Wikipedia is the English language edition of the Wikipedia online encyclopedia. English is the first language in which Wikipedia was written. It was started Figure 1: The process flowchart. on 15 January 2001 (Wikimedia Foundation, 2022b), but versions of Wikipedia in other languages were quickly de- 1. Extract title pairs with English as a first language veloped. Among these versions, there is one in Ukrainian language. The Ukrainian Wikipedia (Wikimedia Founda- For building multilingual lexicons, two tables from the tion, 2022c), written in the Cyrillic alphabet, was initiated database are necessary: table of pages and table of inter- in the year 2004. language links. The page table is the "core of the wiki". It contains titles and other essential metadata for different A list of all Wikipedias is published regularly on Wikipedia namespaces. The interlanguage links table con- the Internet, along with several parameters for each lan- tains links between pages in different languages. Using guage (Wikimedia Foundation, 2022a). Four parameters are these two tables, it is an easy programming task to create important: number of articles, the total number of pages huge bilingual dictionaries without having any language (articles, user pages, images, talk pages, project pages, expertise. categories, and templates), number of active users (regis- tered users who performed at least one change in the last 2. Filter out irrelevant title pairs thirty days), and depth (a rough indicator of the quality of The extracted title pairs from the previous step contain Wikipedia, which shows how often articles are updated). a lot of noise. This step deals with it. First, the algorithm As shown in Table 1, as of 26 March 2022, the English removes all the titles that don’t belong to the main, template, Wikipedia contains 6,473,638 articles and 55,472,454 pages. or category namespaces. Second, there are titles containing There are 127,722 active users. The depth value is 1,110. some words or word stems that increase the noise and should It is by far the largest edition of Wikipedia. The Ukrainian be filtered out. The page table contains many entries that Wikipedia contains 1,144,596 articles and 3,992,549 pages. could not be a part of any lexicon, like user names, nick- There are 2,702 active users. The depth value is 54. It is the names, template names, etc. There are also titles, containing 17th largest edition of Wikipedia, according to number of exclusively digits or blanks, which should be removed too. articles. 3. Classify the remaining title pairs using the English Wikipedia category system Parameter en uk In order to classify the extracted named entities, one Number of articles 6,473,638 1,144,596 additional table from the database is required: a table of Total number of pages 55,472,454 3,992,549 category links. The task of classifying named entities by Number of active users 127,722 2,702 means of category links is more complex. Wikipedia articles Depth 1,110 54 are generally members of categories. A category may have subcategories, each subcategory its own subcategories, etc. The problem is that the graph could be cyclic, which may Table 1: Parameters of the English and Ukrainian cause the algorithm to go into an endless loop. Wikipedias. Various authors propose different classes for named en- tities. Here, there are five: PERSON, ORGANIZATION, PRISPEVKI 148 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Figure 2: A lexicon entry in CSV format. LOCATION, PRODUCT, and MISC. Each named entity be- longs to at least one of these classes. The classes comprise: Figure 3: A lexicon entry in XML format. • ORGANIZATION- political organizations, companies, schools, rock bands, sport teams • PERSON- humans, gods, saints, fictional characters 5. Results • LOCATION- geographical terms, fictional places, cos- The method presented in previous chapter has been used mic terms to build a Python application which extracts title pairs inde- pendently on the languages. This application was applied • PRODUCT- industrial products, software products, to the Wikipedia database to extract the English - Ukrainian weapons, artworks, documents, concepts, standards, pairs of named entities. The result of the extraction after laws, formats, anthems, algorithms, journals, coats of the first two steps from Figure 1 was 687,799 pairs. After arms, platforms, websites filtering out non named entities, 624,168 pairs remained. • MISC- events, languages, peoples, tribes, alliances, One part of the lexicon is presented in Figure 4. orders, scientific discoveries, theories, titles, curren- cies, holidays, dynasties, positions, projects, histori- cal periods, battles, competitions, alliances, deceases, programs, set of locations, awards, musical genres, missions, artistic directions, sets of organizations, net- works 4. Filter out title pairs classified as non named entities Most Wikipedia titles are named entities, but not all of them. For example, certain natural terms-like biolog- ical species and substances-which are very common on Wikipedia, are not included in the lexicon. 5. Convert the resulting data into CSV and XML formats The lexicon comes in two formats: CSV and XML. The first row in the CSV file is a title row and tab is used as a field separator. The columns’ titles are: en, uk, PERSON, ORGANIZATION, LOCATION, PRODUCT, and MISC. All other rows contain the data: English name, Ukrainian name, and five binary digits. These digits denote the class the named entity belongs to. For example, accord- ing to Figure 2, the named entity Odessa belongs to the class LOCATION, since the column LOC contains 1. All other classes contain 0’s. The structure of the XML file is similar. An equiva- lent of the entry from Figure 2 is shown in Figure 3. The columns’ names en and uk from the CSV file are now names of elements and class denotes the classification. In realizing the steps 2-3 of Figure 1, which refer to noise reduction and classification of named entities, the experience of creating a parallel lexicon of named entities from English to South Slavic languages (Slovenian, Croatian, Croatian, Bosnian, Ukrainian, Macedonian, and Bulgarian) (Petrovski, 2019) was of great benefit. That lexicon contains 26,155 entries, and the steps 2-3 were done manually. This methodology has been used to create a multilingual English – Hebrew – Yiddish – Ladino lexicon of named entities. A tool that can be used to search it, can be found on the Internet (Petrovski, 2021). PRISPEVKI 149 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Figure 4: A part of the lexicon. PRISPEVKI 150 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 The distribution of classes is presented in Table 2. Class en uk PERSON 93% 92% Class Number ORGANIZATION 87% 78% PERSON 142,850 LOCATION 49% 42% ORGANIZATION 39,348 PRODUCT 80% 76% LOCATION 237,229 MISC 92% 89% PRODUCT 56,952 All 75% 70% MISC 159,952 Total 636,331 Table 4: Percentage of multiword NEs per class. Table 2: Distribution of classes. en uk Malkiya Club Малкія The total number of classes, 636,331, is slightly higher Dnipro Kherson Дніпро than the number of entries, since some named entities may Sharjah FC Шарджа belong to more classes. The lexical entry presented in Fig- Shin Bet Шабак ure 5 is such an example. Kherson State University is clas- Newtown A.F.C. Ньютаун sified as both ORGANIZATION (the university as an edu- The Day After Tomorrow Післязавтра cational organization) and LOCATION (the building where the organization is located). Table 5: Examples of multiwords in English and single words in Ukrainian, class ORGANIZATION. nia - Сакраменто, Malkiya Club - Малкія, The Acropolis - Акрополіс. 6. Evaluation of classification To evaluate classification, two common metrics in in- formation retrieval have been used: precision and recall. Figure 5: A lexicon entry belonging to two classes. Precision refers to the percentage of classes that are correct. On the other hand, recall refers to the percentage of total relevant classes correctly classified by the algorithm. It is expected that the most of Wikipedia titles are multi- An alternative to having two measures is the F-measure, words, i.e. they contain either a space or a hyphen. Table 3 which combines precision and recall into a single perfor- shows the number of multiword NEs per class in the lexicon mance measure. This metric is known as F1-score, which is for both English and Ukrainian. simply the harmonic mean of precision and recall. In order to evaluate the classification, a random sample Class en uk containing 1,000 entries has been extracted from the lexicon. PERSON 132,219 131,354 The entries from the sample have been classified manually ORGANIZATION 34,114 30,509 and then compared to the classification performed by the LOCATION 116,974 99,399 algorithm. The results are presented in Table 7. PRODUCT 45,781 43,378 The precision of classification is between 94% for OR- MISC 146,498 141,665 GANIZATION and 99% for PERSON. The recall is slightly Total 475,586 446,305 lower, from 83% for PRODUCT and MISC to 97% for PER- SON. The overall results are 97% for precision and 90% for recall. Table 3: Number of multiword NEs per class. The higher values of precision show that the classifica- tion algorithm was adjusted to classify the named entities Table 4 shows the percentage of multiword NEs per correctly, rather than to extract more named entities for the class. lexicon. It can be seen that the percentage of multiwords is higher in the English than in the Ukrainian Wikipedia. This is 7. Conclusion most noticeable in the classes ORGANIZATION and LO- CATION. Some examples from the lexicon where there is Using the methodology presented in this paper, an En- a multiword in English and a single word in Ukrainian are glish - Ukrainian lexicon of named entities has been created. given in Table 5 for the class ORGANIZATION and Table 6 Its size is 624,168 pairs. The named entities have been for the class LOCATION. classified into five classes: PERSON, ORGANIZATION, Contributors to the English Wikipedia add words to the LOCATION, PRODUCT, and MISC (miscellaneous). The base title, which define it in more detail, or it is simply a quality of classification has been assessed: 97% for preci- matter of adding a definite article, e.g. Sacramento, Califor- sion and 90% for recall. PRISPEVKI 151 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 en uk 8. References Malmö Airport Мальме Christian Bøhn and Kjetil Nørvag. 2010. Extracting Named Shintoku, Hokkaido Сінтоку Entities and Synonyms from Wikipedia. In Proceedings Amarillo, Texas Амарилло of International Conference on Advanced Information Sacramento, California Сакраменто Networking and Applications, pages 1300–1307. The Dakota Дакота European Commission. 2022. Digital Europe Programme The Acropolis Акрополіс Language Technologies. https://language-tools. ec.europa.eu/. Table 6: Examples of multiwords in English and single European Union’s Horizon 2020 Research and Innovation words in Ukrainian, class LOCATION. Programme. 2020. Bergamot Translations. https:// translatelocally.com/web/. Ryuichiro Higashinaka, Kugatsu Sadamitsu, Kuniko Saito, Class Precision Recall F1-score Toshiro Makino, and Yoshihiro Matsuo. 2012. Creating ORGANIZATION 94% 87% 90% an Extended Named Entity Dictionary from Wikipedia. In LOCATION 98% 92% 95% 24th International Conference on Computational Linguis- PRODUCT 96% 83% 89% tics - Proceedings of COLING 2012: Technical Papers, MISC 96% 83% 89% pages 1163–1178. All 97% 90% 93% Ondrej Hálek, Rudolf Rosa, Aleš Tamchyna, and Ondrej Bojar. 2011. Named entities from wikipedia for machine translation. In Proceedings of the Conference on Theory Table 7: The results of the classification check. and Practice of Information Technologies, pages 23–30. Angelina Ivanova. 2012. Evaluation of a Bilingual Dictio- nary Extracted from Wikipedia. In Computer Science. The lexicon is available on (Petrovski, 2022) under Johannes Knopp. 2010. Classification of Named Entities in CC-BY-NC-4.0 license (free for non commercial use). a Large Multilingual Resource Using the Wikipedia Cate- gory System. University of Heidelberg, Master’s thesis, Lexicons, like the one presented in this paper, can be Heidelberg, Germany. used in machine translation (MT). Most statistical MT sys- tems do not deal explicitly with named entities, simply re- Joel Nothman, James Curran, and Tara Murphy. 2008. lying on the model of selecting the correct translation, i.e., Transforming Wikipedia into Named Entity Training Data. mistranslating them as generic nouns. It is also possible In Proceedings of the Australian Language Technology that, when not identified, named entities may be left out of Workshop. the output translation, which also has implications for the Mourad Oussalah and Muhidin Mohamed. 2014. Identi- readability of the text. Because most NEs are rare in texts, fying and Extracting Named Entities from Wikipedia statistical MT systems are not capable of producing quality Database Using Entity Infoboxes. In International Jour- translations for them. Another problem with MT systems nal of Advanced Computer Science and Applications, vol- is that failure to recognize NEs often harms the morpho – ume 5, pages 164–169. syntactic and lexical context outside of NEs itself. If named Aleksandar Petrovski. 2019. EnToSSLNE - a Lexi- entities are not immediately identified, certain morphologi- con of Parallel Named Entities from English to South cal features of adjacent and syntactically related words, as Slavic Languages. http://catalogue.elra.info/ well as word order, may be incorrect. It can be concluded en-us/repository/browse/ELRA-M0051/. that the identification of named entities in the source text Aleksandar Petrovski. 2021. Jewish Lexicons of Named is the first task of machine translators (Hálek et al., 2011). Entities. https://www.jewishlex.org/. However, developers of commercial MT systems often do Aleksandar Petrovski. 2022. A Bilingual English- not pay enough attention to the correct automatic identifi- Ukrainian Lexicon of Named Entities Extracted cation of certain types of NE, e.g. names of organizations. from Wikipedia. https://catalogue.elra.info/ This is partly due to the greater complexity of this problem en-us/repository/browse/ELRA-M0104/. (the set of proper nouns is open and very dynamic), and Péter Schönhofen, András Benczúr, Istvan Biro, and partly due to lack of time and other development resources. Károly Csalogány. 2007. Cross-Language Retrieval One solution to this problem is using a parallel lexicon of with Wikipedia. In Advances in Multilingual and Multi- named entities. If the lexicon contains a translation of the modal Information Retrieval: 8th Workshop of the Cross- named entity, the translation quality will probably be good. Language Evaluation Forum, CLEF 2007ed Papers, vol- The European Commission called for language data in ume 5152, pages 72–79. Ukrainian to/from all EU languages to train automatic trans- Statista. 2021. Worldwide visits to Wikipedia.org lation systems (European Commission, 2022), (European from January to June 2021. https:// Union’s Horizon 2020 Research and Innovation Programme, www.statista.com/statistics/1259907/ 2020) supporting refugees and helpers in the Ukraine cri- wikipedia-website-traffic/. sis. This lexicon was sent to ELRC (European Language Francis M. Tyers and Jacques A. Pienaar. 2008. Extracting Resource Coordination) Secretariat as a response. Bilingual Word Pairs from Wikipedia. In Proceedings of PRISPEVKI 152 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 the SALTMIL Workshop at the Language Resources and Evaluation Conference, LREC2008. Wikimedia Foundation. 2022a. List of Wikipedias – Meta. https://meta.wikimedia.org/wiki/List_of_ Wikipedias. Wikimedia Foundation. 2022b. Wikipedia, the Free Encyclopedia. https://en.wikipedia.org/wiki/ Main_Page. Wikimedia Foundation. 2022c. Wikipedia, the Free Encyclopedia. https://uk.wikipedia.org/wiki/ Main_Page. Kun Yu and Jun’ichi Tsujii. 2009. Bilingual Dictionary Ex- traction from Wikipedia. In Machine Translation Summit, volume 12. PRISPEVKI 153 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Serbian Early Printed Books: Towards Generic Model for Automatic Text Recognition using Transkribus Vladimir Polomac* * Serbian Language Department, Faculty of Philology and Arts, University of Kragujevac Jovana Cvijića bb, 34 000 Kragujevac, Serbia v.polomac@filum.kg.ac.rs Abstract The paper describes the process of creating and evaluating a new version of the generic model for automatic text recognition of Serbian Church Slavonic printed books within the Transkribus software platform, based on the principles of artificial intelligence and machine learning. The generic model Dionisio 2.0. was created on the materials of Serbian Church Slavonic books from various printing houses of the 15th and 16th centuries (Cetinje, Venice, Goražde, Mileševa, Gračanica, Belgrade and Mrkša’s Church), and, during the evaluation of its performance, it was noticed that CER was about 2–3%. The Dionisio 2.0. model will be publicly available to all users of the Transkribus software platform in the near future. The Dionisio 1.0. model structure is shown in Table 1, 1. Introduction and its performance is displayed in Table 2. The research on creating a model for automatic text recognition of the Serbian Church Slavonic printed books Book Word count from Venice using a software platform Transkribus,1 Prayer Book (1538–1540) 39,889 presented in Polomac (2022), represents the starting point Psalter (1519–1520) 10,132 for this paper. This paper describes the process of Miscellany for Travellers (1536) 10,618 transcription and creation of a specific model2 for Festal Menaion (1538) 10,732 automatic text recognition of Prayer Book (Euchologion) printed between 1538 and 1540 in the printing house of Miscellany for Travellers (1547) 10,006 Božidar Vuković,3 as well as the process of creating a Hieratikon (Liturgikon) (1554) 10,196 generic model4 for automatic text recognition of other Total 91,573 books printed in Venice in the printing house of Božidar Vuković and his son Vićenco.5 The most important result Table 1: Dionisio 1.0. Structure and the Amount of of this paper is the creation of the first version of the model Training Data. Dionisio 1.0. (named after an Italian pseudonym for Božidar Vuković – Dionisio della Vechia) representing the Word Number CER7 on CER on first publicly available resource for automatic reading of count of epochs6 Train set Validation set Serbian Church Slavonic manuscripts and printed books within the Transkribus software platform (cf. 86,347 100 1.66% 2.09% https://readcoop.eu/model/dionisio-1-0/). Table 2: Dionisio 1.0 Performance. 1 Transkribus (https://readcoop.eu/transkribus) represents an aimed at the Serbian Orthodox Church and its flock under open-access software platform for automatic text recognition and Ottoman rule, yet the motives of his printing business were not retrieval developed as part of the READ project at the University only patriotic and religious, but also mercantile and financial of Innsbruck. More details about the technological background (Lazić, 2020b). and operating system cf. Mühlberger et al. (2019). 4 Unlike a specific model that is trained to recognize a single 2 The functionality of the Transkribus platform is particularly manuscript or printed book, a generic model contains material manifested in the potential to train one’s own automatic text from different manuscripts or printed books. More details on the recognition model, irrespective of the language or script used in possibilities and pitfalls of training generic models can be found the manuscript. The training of the automatic recognition model in Rabus (2019b). represents an instance of machine learning based on neural 5 After the death of Božidar Vuković, Vićenco Vuković had networks in which during the learning process the model reprinted several of his father's editions until 1561, and later compares the manuscript photographs and corresponding letters, rented his equipment to other Venetian printers. For more details words and lines of the text in the diplomatic edition. For more about his life and work see also Pešikan (1994). details see Mühlberger et al. (2019) and Rabus (2019a). 6 The term epoch in machine learning stands for ”one complete 3 Božidar Vuković was a Serbian merchant from Zeta (Podgorica presentation of the data set to be learned to a learning machine“ and the area surrounding Lake Skadar). After his arrival at Venice (Burlacu and Rabus, 2021). (in 1516 at the latest) he acculturated his Serbian name to the new 7 The Character Error Rate (CER) is calculated by comparing the environment by creating a Latin ( Dionisius a Vetula) and an automatically generated text and the manually corrected version. Italian pseudonym ( Dionisio della Vecchia) from his Serbian See for more details in Transkribus Glossary name and the toponym of Starčeva Gorica (at Lake Skadar), https://readcoop.eu/glossary/character-error-rate-cer/. indicating his origin (Lazić, 2018). Books from his printery were PRISPEVKI 154 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 In the continuation of the research, we aimed at examining the performance of the Dionisio 1.0. model on Serbian Church Slavonic books created in other printing houses, firstly in Venetian printing houses created after closing Božidar and Vićenco Vuković’s printing house, and then in other old Serbian printing houses of the 15th and 16th centuries (Cetinje, Goražde, Mrkša's Church, Belgrade, Mileševa and Gračanica), thus ultimately offering a generic model for the automatic text recognition of Serbian Church Slavonic printed books as a whole. 2. Applying the Dionisio 1.0. Model on Books from Other Venetian Printing Houses In the first experiment, we tested the performance of the Dionisio 1.0. model on several Serbian Church Slavonic books printed in Venice after closing Božidar and Vićenco Vuković’s printing house: Lenten Triodion was printed in Figure 1: The Automatically Read Text of a Segment of 1561 by Stefan of Scutari in the Camillo Zanetti’s printing Sheet 2b Prayer Book (Euchologion) from 1570. house, Prayer Book ( Miscellany for Travellers) was printed in 1566 by Jakov of Kamena Reka, Prayer Book The greatest number of errors in text recognition refers (Euchologion) was created in 1570 in the printing house of to cases in which the model outputs accent marks in Jerolim Zagurović and Psalter with Appendices was printed accordance with the material on which it was trained, in 1638 in the printing house of Bartol Ginammi (Pešikan, although in the text of Prayer Book (Euchologion) these 1994). The starting hypothesis of the paper in the current marks were not used: so instead of щедротами 1/2, твоѥго experiment was that the model trained on the materials of 2, ними 2, бл҃свень ѥси 3, животворещимь 3/4, дх҃омь 4, Serbian Church Slavonic books from the printing house of Božidar and Vićenco присноива 4, мои 5, твоимь 5, наѳаномь 6, своихь 7, Vuković would be useful for automatic text recognition of other Venetian editions прѣгрѣшенихь 7, ѥмоу 7, подасть 8, манасѵно 8, покаꙗнїе 8 printed using their printing equipment. the model outputs щедро́тами 1/2, твоѥ҆го 2, ни́ми 2, бл҃све́нь The statistical results of the experiment are shown in the ѥ҆си 3, жи́вотво́рещи̏мь 3/4, дх҃о́мь 4, при́снои́ва 4, моѝ 5, following table. твои҆мь 5, наѳа́номь 6, свои҆хь 7, прѣгрѣше́нихь 7, ѥ҆моу 7, пода́сть 8, ма́насѵно 8, покаꙗ҆нїе 8. Along with the accent Book CER marks, the model inccorectly reads a pajerak mark in two Lenten Triodion (1561) 9.41% examples only: instead of ѥдинороднаго 2, покаꙗвшоу 6 Miscellany for Travellers (1566) 11.63% there is the incorrect ѥ҆ди́норо́д на̏го 2, покаꙗ҆в шоу 6. In one Prayer Book (Euchologion) (1570) 13.67% example, instead of oxia there is an incorrect double Psalter with Appendices (1638) 16.04% circumflex: instead of бл҃гы́мь 3 there is the incorrect бл҃гы̏мь 3. Table 3: Application of the Dionisio 1.0. model on The same problem is exhibited by the comparative publications from other Venetian printing houses. presentation of the photograph of a part of sheet 5b Psalter with Appendices (1638) and the automatically read text. The unexpectedly high CER does not necessarily indicate poor performance of the Dionisio 1.0. model. The largest number of errors in text recognition is the result of the fact that in these books accent marks are used differently than in the books from the printing house of Božidar and Vićenco Vuković, which were used to train the Dionisio 1.0. model. This fact is especially evident in Prayer Book (Euchologion) from the printing house of Jerolim Zagurović (1570) and Psalter with Appendices from the printing house of Bartol Ginammi (1638) in which only spiritus lenis with an oxia over the initial vowel grapheme was used. To illustrate this claim, we shall use a comparative presentation of a photograph of a part of sheet 2b Prayer Book (Euchologion) (1570) and an automatically read text using the Dionisio 1.0. model. Figure 2: The Automatically Read Text of a Part of Sheet 5b Psalter with Appendices from 1638. PRISPEVKI 155 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Here, too, the largest number of errors refers to cases in incorrectly outputs ва́леща̏ 1, соуѥ҆тною̀ 1, съкроу́ше́н нѣи҅ 2, which the Dionisio 1.0. model outputs accent marks срцоу́ 2/3, те́бѣ̀ 3, ѡ῎цѣ́- 3, ѻ῎ста́в лѥнїе 5, пе́щь́сь тво́ри̏ 8, according to the patterns of their use in the Venetian books ха́л - 9/10. Errors in recognizing spaces between words are that served for its training, although in the text of also of high frequency: instead of да́- 1, цѹ зовоу́щоу 3, пѣⷭ Ginammi’s Psalter with Appendices these marks were not з҃ 7, ꙋбо пѐщь сьтвори̏ 8, а῎г- 8, ст҃ыимъ дѣ́темь 9 the model used. Thus instead of вьзвахь 4, оуслиша 4, правди 4/5, моѥ incorrectly outputs да́ 1, цѹ́зовоу́щоу 3, пѣз҃ 7, ꙋбопе́щь́сь 5, скрьбїи 5, распространиль 5, ме 5, ѥсїи 6, оущедри 6, тво́ри̏ 8, а῎г 8, ст҃ыимъдѣ́темь 9. In a fewer number of ѹслиши 6, мою 7, сн҃ове 7, до колѣ 7, тешкосрьдїи 7, вьскоую examples, errors in recognizing pajerak mark, superscript 8, любыте 8, соуѥтннаа 8, льжоу 9, оувѣдите 9, ꙗко 9, letters and titlo mark can be found: instead of съкроуше́ннѣи҅ оудивїи 9 the model incorrectly outputs вьзва́хь 4, оу῎сли́ша 2, ѻ῎ставлѥнїе 5, хал- 9/10, срⷣ- 2, 7 the model incorrectly 4, пра́в дѝ 4/5, моѥ҅ 5, скрь́бїѝ 5, распростра́ниль 5, ме́- 5, ѥ҆сїи outputs съкроу́ше́н нѣи҅ 2, ѻ῎ста́в лѥнїе 5, ха́л - 9/10, ср- 2, пѣ 6, оу῎ще́дри 6, ѹ῎сли́ши 6, мою̀ 7, сн҃о́ве 7, до ко́лѣ 7, 7. те́шкосрь́дїи 7, вьскоую̀ 8, лю́быте 8, соуѥ҆тннаа 8, ль́жоу 9, A comparative presentation of a part of sheet 7a Prayer оу῎вѣ́дите 9, ꙗ῎ко 9, оу῎ди́вїи 9. Here, as well, the other types Book (Miscellany for Travellers) from 1566 and the of errors are confirmed by isolated examples: pajerak automatically read text using the Dionisio 1.0. model mark: instead of правди 1/2 there is the incorrect пра́в дѝ displays similar errors. 1/2; space between words: instead of ме 5 the incorrect ме́- 5; initials: instead of Вьнѥгд҃а 4 the incorrect ьнѥгд҃а 4; incorrect accent recognition: instead of и῎ 6 there is the incorrect и҆ 6. The given examples of the most common errors show that, despite the high percentage of incorrectly recognized characters, after the automatic post-correction of the transcripts which would include accent marks removal using the Search/Replace chosen chars in transcript option, the Dionisio 1.0. model can also be very efficient in recognizing Serbian Church Slavonic books created in the printing houses of Jerolim Zagurović and Bartol Ginammi during the 16th and 17th centuries. The greatest number of errors in the automatic recognition of the text Lenten Triodon (1561) by Stefan of Scutari and Prayer Book (Miscellany for Travellers) (1566) by Jakov of Kamena Reka also refers to the recognition of accent marks. However, what distinguishes these books from the books from the printing houses of Jerolim Zagurović and Bartol Ginammi is that accent marks are Figure 4: The Automatically Read Text of a Part of Sheet actually used, yet in different positions compared to the 7a Prayer Book (Miscellany for Travellers) from 1566. books from the printing house of Božidar and Vićenco Vuković on which the Dionisio 1.0. model was trained. To Errors in recognizing accent: instead of небо 1, землꙗ 1, illustrate this claim, we will first use a comparative похва́лите ю̏ 1, ѿчьствїа 1/2, і҆езыкь̏ 2, весе́лит 2, presentation of a part of sheet 3a Lenten Triodon (1561) by трьжьствꙋѥ῎ть 3, неплѡди 3, раждаѥ῎- 3, питател ницꙋ 4, Stefan of Scutari and the automatically read text using the жизны 4, на́шеѥ῎ 4, и 5, мꙋченикь 5, кондакь 6 the model Dionisio 1.0. model. incorrectly outputs не бо̀ 1, землꙗ̀ 1, похва́литею̀ 1, ѿчь́ствїа 1/2, і҆е҆зы́кь̏ 2, весе́лит 2, трь́жьствꙋѥ҆ть 3, неплѡ́ди 3, раждаѥ҆ 3, пи́тател ницꙋ 4, жи́зны 4, на́шеѥ 4, и῎ 5, мꙋче́никь 5, воѥ҆дакь 6. A certain number of errors is connected to recognizing spaces between words: instead of небо 1, похва́лите ю̏ 1, раждаѥ῎- 3, свѣт мѹ 5 the model incorrectly outputs не бо̀ 1, похва́литею̀ 1, раждаѥ҆ 3, свѣтмѹ 5. Several errors in recognizing letters may perhaps be related to poor quality of the photograph: instead of сі 5, кондакь 6 the model incorrectly outputs сь 5, воѥ҆дакь 6. The illustrated examples of the most frequent errors in Figure 3: The Automatically Read Text of a Part of Sheet Lenten Triodon (1561) and Prayer Book (Miscellany for 3a Lenten Triodon from 1561. Travellers) (1566) show that the Dionisio 1.0. model can be used for obtaining transcripts that can, after appropriate Errors in accent mark recognition: instead of валеща 1, manual correction, be used for creating specific models for соуѥ́тною automatic text recognition of the aforementioned two 2, съкроуше́ннѣи҅ 2, срⷣцоу 2/3, тебѣ̀ 3, ѡ҆цѣ́- 3, books. ѻ῎ставлѥнїе 5, пѐщь сьтвори̏ 8, хал- 9/10 the model PRISPEVKI 156 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 mark: instead of Ѡ и҆ме́ни 4, подь 9 и дрѣ́внꙋю̀ 10 the model 3. Applying the Dionisio 1.0. Model on Books incorrectly reads и҆ме́нѝ 4, по дь 9 и дрѣ́в нꙋю̀ 10. from Other Serbian Printing Houses of the 15th and 16th Centuries In the second experiment, the performance of the Dionisio 1.0. model was tested on selected books from other printing houses of the 15th and 16th centuries (Cetinje, Goražde, Gračanica, Mileševa, Belgrade and Mrkša’a Church). During the research, we started from the hypothesis that the model trained on the material of books from the Venetian printing house Vuković will be useful for books from other printing houses, since there are not many orthographic variations in Serbian Early Printed Books as there are in medieval manuscripts. The results of the experiment are shown in the following table. Book (Printed House, Year) CER Octoechos, mode 1–4 (Cetinje, 1495) 8.24% Psalter with Appendices (Goražde, 1519) 6.44% Octoechos, mode 5–8 (Gračanica, 1539) 11.11% Prayer Book (Euchologion) (Mileševa, 1546) 5.43% Tetraevangelion (Belgrade, 1552) 11,28% Figure 5: The Automatically Read Text of a Part of Sheet Tetraevangelion (Mrkša’s Church, 1562) 12.06% 5b Prayer Book (Euchologion) from 1546. Similar errors are indicated by the comparative Table 4: Application of the Dionisio 1.0. model on illustration of the photograph of a part of sheet 35a Psalter publications from other printing houses in with Appendices (1519) from the Goražde printing house the 15th and 16th centuries. and the automatically read text using the Dionisio 1.0. model. Based on the previous table, it can be concluded that the Dionisio 1.0. model achieved the best results in the automatic recognition of the text of Prayer Book (Euchologion) (1546) from the printing house of the Mileševa monastery and Psalter with Appendices (1521) from the Goražde printing house. These results can be explained by the fact that Prayer Book (Euchologion) (1546) had been printed in Mileševa with the same typographic characters as Psalter with Appendices (1521) from Božidar Vuković’s printing house, as well as by the fact that Psalter with Appendices (1519) was printed in Goražde using the typographic equipment imported from Venice (Lazić, 2020a).8 To illustrate the efficiency of the Dionisio 1.0. model we may firstly use the comparative presentation of the photograph of a part of sheet 5b Prayer Book (Euchologion) (1546) from the printing house of the Mileševa monastery and the automatically read text in Figure 5. In this book, as well, the greatesт number of errors refers to accent marks recognition: instead of и҆ме́ни 4, и҆сти́ныи 4, ѥ҆ди́норо́днааго 4/5, ст҃аго 5, и ме 7, сподобльшаго 7, ѡ῎ноу̀ 10 the Dionisio 1.0. incorrectly outputs и҆ме́нѝ 4, и῎сти́ныѝ Figure 6: The Automatically Read Text of a Part of Sheet 4, ѥ῎ди́норо́днаа̀го 4/5, ст҃а́го 5, и῎ ме́ 7, сподо́бльшаго 35a Psalter with Appendices from 1519. 7, ѡ῎ноу 10. Other errors are fewer in number and relate to recogizing initials, spaces between words and pajerak 8 Scholars likewise claim that Psalter with Appendices (1519) corresponds to the widespread practice of the time to place a and Prayer Book (Euchologion) (1544) from Goražde printing counterfeit place of printing on the colophonies of editions (Lazić, house could have been printed in Venice, as well, which 2020a). PRISPEVKI 157 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 The previous illustration demonstrates how the неи҆зрече́нною 14, ни́щетою 14/15, зе́млѥѝ 15, ꙗ҆ко 15, вьзми Dionisio 1.0. model makes the most frequent errors while 16, и҆ 16 the Dionisio 1.0. model incorrectly reads: е҆сь́мь 8, recognizing accent marks: instead of пра́веднїи 14, и҆спль̑нь 9, на 9, мою̀ 10, твоѝ 11, бѣсѡ́ в ска̀го 11/12, и῎зба́ви подо́баѥть 15, по́хвала 15, исповѣ́даите се 15/16, ѱа́лтѝри 12, ꙗ῎ко 13, сьзда́нїе 13, неи῎зрече́нною 14, ни́ще̏тою̀ 14/15, 16/17, ѥ҆моу́ 17, добрѣ̀ 18, ѥ҆го 19, млⷭ 19, сꙋдь 19 the зе́млѥѝ 15, ꙗ῎ко 15, вь́зми 16, и῎ 16. The issues with model incorrectly outputs пра́веднїѝ 14, подо́баѥ҆ть 15, recognizing spaces between words and pajerak mark can похва́ла 15, и῎сповѣ́д аи҆те се 15/16, ѱа́л тѝрѝ 16/17, ѥ҆моу 17, be illustrated by the following examples: instead of до́брѣ̀ 18, ѥ҆го̀ 19, млⷭты́ню̀ 19, сꙋдь 19. The other errors ѡ῎боу́рева- 8, наде́ждоу 10, бѣсѡ́- 11 there is the incorrect pertain to recognizing spaces between words, pajerak mark ѡ῎боу́рева 8, на де́ждоу 10, бѣсѡ́ 11; instead of бѣсѡ́вска̏го and initials: instead of пра́выи- 14, ѱа́лтѝ- 16, 11/12 there is the incorrect бѣсѡ́ в ска̀го 11/12. In this book, десе́тостроу́ннѣ 17 the model incorrectly reads: пра́выи 14, as we have already mentioned, the Dionisio 1.0. model ѱа́л тѝ 16, десе́то строу́ннѣ 17; instead of исповѣ́даите се likewise incorrectly recognizes certain letters and 15/16, ѱа́лтѝ- 16 there is the incorrect и῎сповѣ́д аи҆те се punctuation marks: instead of ѿ 8, ѕы́ждитель 13, мл҃срдь 15/16, ѱа́л тѝ 16; instead of Рауⷣи́те се 14 there is the 16 there is the incorrect ѡ῎ 8, бы́ждитель 13, мл҃содь 16; incorrect ПРауⷣи́те се 14. There is merely one example of an instead of невиди́мыихь, 11, и῎зба́ви :·12 there is the incorrectly recognized letter: instead of вьсклица́ни 19 the incorrect невиди́мыихь · ⁘ 11 и ῎зба́ви :·12. model incorrectly reads вь свлица́ни 19. In the rest of the books listed in Table 4, ( Octoechos, The Dionisio 1.0. model also shows a similar mode 5–8 (1539) from Gračanica, Tetraevangelion (1552) performance during the automatic recognition of the text of from Belgrade and Tetraevangelion (1562) from Mrkša’s the oldest printed Serbian Church Slavonic book – Church), the CER is slightly higher, around 11–12%. The Octoechos, mode 1–4 (1495) from the Cetinje printing categories in which the Dionisio 1.0. model outputs errors house. The percentage of unrecognized characters is are mostly the same in all three books, so we will only take somewhat higher than in the previous two books due to a comparative presentation of a part of sheet 27b poor photo quality and issues with recognizing certain Octoechos, mode 5–8 (1539) from Gračanica and the letters and punctuation marks. automatically read text as an illustration. To illustrate the efficiency of the model, we will use a comparative presentation of a part of sheet 33b and the automatically read text in the following figure. Figure 8: The Automatically Read Text of a Part of Sheet 27b Octoechos, mode 5–8 from 1539. The greatest number of errors is related to the Figure 7: The Automatically Read Text of a Part of Sheet recognition of accent marks: instead of бо́лѣзны 1, и҆ 1, 2, 5, 33b Octoechos, mode 1–4 from 1495. 6, 8, мои 2, трьпишѝ 2, поно́сноѐ 2, чь вькоу́шаѐши 3, ѿѐмлѥ 3, прободе́нїемь 4, ꙗ῎звы 4/5, ꙗ҆ко 5, и҆сцѣлꙗе 5, вьспѣ́ваѐмь In this book, too, the largest number of errors in the 5, твое 6, сла́вное хотѣ́нїе 6, покла́нꙗю́ще се 6, и҆миже 7, automatic text recognition occurs with accent marks: своѐмꙋ ми́- 7, ве́лїю 8 the Dionisio 1.0. model incorrectly instead of е҆сьмь 8, и῎спль́нь 9, на́ 9, мою 10, твои 11, outputs бо́лѣзны̏ 1, и῎ 1, 2, 5, 6, 8, моѝ 2, трь́пишѝ 2, поно́сное бѣсѡ́вска̏го 11/12, и҆зба́ви 12, ꙗ҆ко 13, сьзданїе 13, 2, чьвькоу́шае҆ши 3, ѡ῎е҆млѥ̏ 3, прободе́нїе҆мь 4, ꙗ҆звы 4/5, ꙗ῎ко PRISPEVKI 158 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 5, и҆сцѣ́лꙗѐ 5, вьспѣ́вае҆мь 5, твоѐ 6, сла́вное҆хотѣ́нїе 6, Word покла́нꙗю҆ще се Book (Printed House, Year) 6, и῎ ми́же 7, своѐ м ми́- 7, ве́лїю 8. count Recognizing spaces between words represents the Octoechos, mode 1–4 (Cetinje, 1495) 15,667 problematic issue in a multitude of cases: instead of жль- Psalter with Appendices (Goražde, 1519) 16,445 2, чь вькоу́шаѐши 3, сла́вное хотѣ́нїе 6, чьте́мь 6, и҆миже 7, Octoechos, mode 5–8 (Gračanica, 1539) 15,179 своѐмꙋ ми́- 7, роу́коположе́нїе 8 the model incorrectly outputs Prayer Book (Euchologion) (Mileševa, 1546) 15,003 жлⷭв 2, чьвькоу́шае҆ши 3, сла́вное҆хотѣ́нїе 6, чь те́мь 6, и῎ ми́же Tetraevangelion (Belgrade, 1552) 15,333 7, своѐ мꙋми́- 7, роу́ко положе́нїе 8. The other errors pertain Tetraevangelion (Mrkša’s Church, 1562) 15,733 to the recognition of superscript letters and pajerak mark, as well as regular letters in a few examples: instead of жль- Table 5: The Dionisio 2.0. model – Ground Truth data 2, 8 the model outputs жлⷭв 2, млтⷣь 8; instead of тѣ́мже from other printing houses of the 15th and 16th centuries. 5 there is the incorrect тѣ́м же 5; instead of поноше́нїа 1, жль- 2, ѿѐмлѥ 3 the model reads попоше́нїа 1, жлⷭ 2, The performance of the generic model Dionisio 2.0. is ѡ῎е҆млѥ̏ 3. shown in the following table. The quantitative and qualitative analysis conducted in this chapter demonstrates that the Dionisio 1.0. recognizes Word Number CER on CER on the text of the Serbian Church Slavonic books created in count of epochs Train set Validation set other printing houses of the 15th and 16th centuries with varying degrees of success. The quantitative analysis shows 176,481 200 2.03% 2.44% that the lowest CER was recorded in books from Mileševа and Goražde printing houses, which is expected Table 6: Performance of the generic model Dionisio 2.0. considering the fact that these books were printed using the typographic printing equipment from Venice. An In order to compare the performance of the two models, acceptable CER was noted during the recognition of we tested them on ten sheets from Psalter with Appendices Octoechos, mode 1–4 (1494) from the Cetinje printing (1495) from the Cetinje printing house and Hieraticon house, while this percentage exhibited in books from other (1521) from the Goražde printing house, the latter two printing houses (Belgrade, Gračanica, Mrkša’s Church) representing Serbian Church Slavonic books that did not underscores the need for training a new version of the form the material for training the model. The results of the generic model with improved performance. The qualitative experiments are shown in the following table. analysis showed that the Dionisio 1.0. model usually makes errors when recognizing accent marks, but also when Book Dionisio 1.0. Dionisio 2.0. recognizing spaces between words. The errors in (Printed House, Year) CER CER recognizing superscript letters, pajerak mark, initials and regular letters are far less common. Psalter with Appendices 5.71% 1.50% (Cetinje, 1495) 4. Creation and evaluation of the generic Hieraticon 9.38% 4.61% model Dionisio 2.0. (Goražde, 1521) Table 7: Comparing the Performance of the Two Models When creating a new version of the model, we started on Books from Cetinje and Goražde Printing Houses. from the transcripts of Serbian Church Slavonic books listed in Table 4 obtained using the Dionisio 1.0. model. By As can clearly be seen from the previous table, the means of the manual correction of the transcripts, the Dionisio 2.0. model displays significantly better results Ground Truth9 data was obtained for training the generic compared to the Dionisio 1.0. model. To illustrate the model Dionisio 2.0. In accordance with our findings on the exceptional efficiency of the Dionisio 2.0. model we interdependence of model success and the amount of provide a comparative presentation of a part of sheet 3b training data (Polomac, 2022), as well as similar findings Psalter with Appendices (1495) from Cetinje printing for Church Slavonic books from the Berlin State Library house and the automatically read text in the figure 9. (Neumann, 2021), the goal was set to provide a critical As we can see in the figure, the Dionisio 2.0. model mass of at least 10000 words for each printed book in order erros only in a few examples in which the spiritus lenis and to train the generic model Dionisio 2.0. While training the perispomena are insufficiently clearly differentiated: generic model Dionisio 2.0. we used the Ground Truth data prepared for the Dionisio 1.0. model (see Table 1 here), as instead of ю̑нїи 8, пою̑ть 9, и҆сти́ною̑ 10, нака́зоую̑ть 11 the well as the new Ground Truth data from Serbian Church model incorrectly outputs ю҆нїи 8, пою҆ть 9, и҆сти́ною҆ 10, Slavonic books printed in other printing houses of the 15th нака́зоую҆ть 11. There is a single example of the model and 16th centuries listed in the following table. mixing spiritus lenis and oxia: instead of ѿи҆де 13 there is the incorrect ѿи́де 13. The space between words was also 9 The term Ground Truth Data in machine learning refers to manuscript. For more details on this term, see Transkribus completely accurate data used to train the model. In our case, Glossary at https://readcoop.eu/glossary/ground-truth/. these would be exact transcripts of digital photographs of the PRISPEVKI 159 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 incorrect in one example solely: instead of мнѣ́ти 9 there is The previous illustration points to the fact that the the incorrect мнѣ́ ти 9. In the other examples on the shown Dionisio 2.0. model makes errors almost exclusively during part of sheet 3b the Dionisio 2.0. model regularly accent marks recognition. Thus, instead of рабѣ̀ 1, бж҃їе҅мь recognizes letters, spaces between words, titlo and accent 1, мⷬе 1, и҆гоу́меноу 2, и́ 3, на́шеи 4, и҆х 4, призовы̀ 4/5, твое҅ marks. The exceptional efficiency of the Dionisio 2.0. 5x2, приѡ҆бще́нїе 5, бл҃госрь́дїе҅ 5/6, вьсѐбл҃гы̏ 6 the model model in recognizing Psalter with Appendices (1495) from incorrectly reads ра́бѣ̀ 1, бж҃їе҆мь 1, мⷬе́ 1, и῎гоу́меноу 2, и҆ 3, the Cetinje printing house, especially compared to на́шеѝ 4, и῎х 4, призовы 4/5, твоѐ 5x2, приѡ῎бще́нїе 5, Hieraticon (1521) from the Goražde printing house, has бл҃госрь́дїѐ 5/6, вьсе бл҃гы̏ 6. Along with the aforementioned resulted from the fact that there are no superscipt letters in errors, there are a few examples of incorrect recognition of Psalter with Appendices (1495), while accent marks are spaces between words: instead of ѻ҆ бра́тїи 2, сь слꙋ-2, given in expected positions. дїа́конѣ- 3 вьсѐбл҃гы̏ 6 the model reads ѻ҆бра́тїи 2, сьслꙋ- 2, дїа́конѣ 3 вьсе бл҃гы̏ 6. 5. Concluding Remarks The research showed how the Transkribus software platform, based on the principles of machine learning and artificial intelligence, could be used to create efficient models for automatic text recognition of Serbian Church Slavonic printed books from the end of the 15th to the middle of the 17th century. Having in mind the limitations of the Dionisio 1.0. model in the automatic recognition of the text of the Serbian Church Slavonic books printed outside Venice, the paper describes the process of creating a generic model Dionisio 2.0. , capable of recognizing Serbian Church Slavonic printed books as a whole. The generic model Dionisio 2.0. was trained on the material of the Serbian Church Slavonic books printed in various Serbian printing houses of the 15th and 16th centuries: Cetinje, Venice, Goražde, Gračanica, Mileševa, Belgrade Figure 9: The Automatically Read Text of a Part of Sheet and Mrkša’s Church. The quantitative analysis of the 3b Psalter with Appendices (1495). performance of this model showed that it could be used to automatically obtain transcripts with a minimum On the other hand, superscript letters, as well as accent percentage of incorrectly recognized characters (about 2- marks, found frequently in unexpected positions, are 3%). Most frequently, CER depends on the quality of the present in Hieraticon (1521) from the Goražde printing photo of the book, the frequency of use of accent marks and house, which definitely affects a somewhat less efficient superscripts, as well as the correct use of accent marks in CER in this book. To illustrate the aforementioned, we shall the appropriate positions. Using the Dionisio 2.0. model use the comparative presentation of a part of sheet 9b and transcripts of Serbian Church Slavonic printed books can the automatically read text in the following figure. be obtained automatically, which, after being edited by a competent philologist, can be used for further philological and linguistic research, primarily for creating searchable digital editions of books, as well as electronic corpora, thus creating opportunities for diachronic research of Serbian early modern literacy on a large quantity of data. In the near future, the generic model Dionisio 2.0. will become publicly available to all users of the Transkribus software platform, which will enable further improvement of its performance, which could ultimately lead to the creation of a generic model for automatic text recognition of Church Slavonic printed books as a whole. 6. Acknowledgment The research conducted in the paper was financed by the Ministry of Education, Science and Technological Development of the Republic of Serbia, contract no. 451- Figure 10: The Automatically Read Text of a Part of Sheet 03-68/2022-14/ 200198, as well as by the German 9b Hieraticon (1521). Academic Exchange Service (DAAD) within the project Automatic Text Recognition of Serbian Medieval PRISPEVKI 160 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Manuscripts and Early Printed Books: Problems and Perspectives. 7. References Constaţa Burlacu and Achim Rabus. 2021. Digitising (Romanian) Cyrillic using Transkribus: new perspectives. Diacronia, 14:1–9. Miroslav Lazić. 2018. Od Božidara Vukovića do Dionizija dela Vekije: identitet i pseudonim u kulturi ranog modernog doba. In: Anatolij A. Turilov et al., eds., Scala Paradisi, pages 165–185, SANU, Beograd. Miroslav Lazić. 2020a. Inkunabule i paleotipi: srpskoslovenske štampane knjige od kraja 15. do sredine 17. veka. In: Vladislav Puzović and Vladan Tatalović, eds., Osam vekova autokefalije Srpske pravoslavne crkve, Vol. 2, pages 325–344. Sveti arhijerejski sinod Srpske pravoslavne crkve– Pravoslavni bogoslovski fakultet, Beograd. Miroslav Lazić. 2020b. Between an Imaginary and Historical Figure: Božidar Vukovićˊs Professional Identity. Ricerche Slavistiche, 43:141–156. Vladimir Neumann, 2021. Deep Mining of the Collection of Old Prints Kirchenslavica Digital. Scripta & e- Scripta 21: 207–216. Vladimir Polomac. 2022. Serbian Early Printed Books from Venice. Creating Models for Automatic Text Recognition using Transkribus. Scripta&e-Scripta, 22 [in print]. Günther Mühlberger, L. Seaward, M. Terras, S. Oliveira Ares, V. Bosch, M. Bryan, S. Colluto, H. Déjean, M. Diem, S. Fiel, B. Gatos, A. Greinoecker, T. Grüning, G. Hackl, V. Haukkovaara, G. Heyer, L. Hirvonen, T. Hodel, M. Jokinen, P. Kahle, M. Kallio, F. Kaplan, F. Kleber, R. Labahn, M. Lang, S. Laube, G. Leifert, G. Louloudis, R. McNicholl, J. Meunier, J. Michael, E. Mühlbauer, N. Philipp, I. Pratikakis, J. Puigcerver Pérez, H. Putz, G. Retsinas, V. Romero, R. Sablatnig, J. Sánchez, P. Schofield, G. Sfikas, C. Sieber, N. Stamatopoulos, T. Strauss, T. Terbul, A. Toselli, B. Ulreich, M. Villegas, E. Vidal, J. Walcher, M. Wiedermann, H. Wurster, and K. Zagoris. 2019. Transforming scholarship in the archives through handwrittwn text recognition. Journal of Documentation, 5 (75):954–976. Mitar Pešikan. 1994. Leksikon srpskoslovenskog štamparstva. In: Mitar Pešikan et al., eds., Pet vekova srpskog štamparstva 1494–1994: razdoblje srpskoslovenske štampe XV–XVII, pages 71–218, Narodna biblioteka Srbije–Matica srpska, Beograd. Achim Rabus. 2019a. Recognizing Handwritten Text in Slavic Manuscripts: a Neural-Network Approach using Transkribus. Scripta & e-Scripta, 19:9–32. Аchim Rabus. 2019b. Training Generic Models for Handwritten Text Recognition using Transkribus: Opportunities and Pitfalls. In: Proceeding of the Dark Archives Conference, Oxford, 2019b, in print. PRISPEVKI 161 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Lematizacija in oblikoskladenjsko označevanje korpusa SentiCoref Eva Pori,* Jaka Čibej,* Tina Munda,† Luka Terčon,† Špela Arhar Holdt*† * Filozofska fakulteta, Univerza v Ljubljani, Aškerčeva 2, 1000 Ljubljana eva.pori@ff.uni-lj.si; jaka.cibej@ff.uni-lj.si; spela.arharholdt@ff.uni-lj.si † Fakulteta za računalništvo in informatiko, Univerza v Ljubljani, Večna pot 113, 1000 Ljubljana tina.munda@fri.uni-lj.si; luka.tercon@fri.uni-lj.si Povzetek V prispevku predstavimo proces in rezultate ročnega pregledovanja lem in oblikoskladenjskih oznak MULTEXT-East v6 korpusa SentiCoref, ki bo pod okriljem projekta Razvoj slovenščine v digitalnem okolju (RSDO) vključen v novi učni korpus za slovenščino (trenutni ssj500k). Opišemo delotoke označevalne kampanje, ki je ena najobsežnejših tega tipa v našem prostoru, označevalne dileme, ki so razkrile določene vrzeli v referenčnih označevalnih smernicah, kot tudi rešitve in rezultate, ki smo jih oblikovali med delom in jih bo mogoče uporabiti v prihodnje. Lemmatization and Morphosyntactic Annotation in the SentiCoref Corpus The paper presents the process and the results of manual lemma and MULTEXT-East v6 morphosyntactic tag annotation in the SentiCoref corpus, which is planned to be included in the new Slovene training corpus (currently known as ssj500k) as part of the "Development of Slovene in a Digital Environment" project. The paper describes the workflows of the annotation campaign – which was among the most extensive campaigns of this type in Slovenia –, the annotation dilemmas that revealed gaps in previous versions of annotation guidelines, as well as the resulting solutions that will be useful in future annotation campaigns. povečave učnega korpusa.2 SentiCoref vsebuje besedila z 1. Uvod novičarskih portalov, v katera so ročno vpisane oznake koreferenc in imenskih entitet, in odgovarja na potrebo, da Med leti 2020 in 2023 s podporo Ministrstva za se v učni korpus vključi gradivo, ki omogoča označevanje kulturo Republike Slovenije in Evropskega sklada za jezikovnih značilnosti prek meja povedi (Arhar Holdt in regionalni razvoj poteka aplikativni projekt Razvoj Čibej, 2021). slovenščine v digitalnem okolju (RSDO).1 Med cilji Namen prispevka je opisati delo, rezultate, zlasti pa projekta je infrastruktura za kontinuirano grajenje označevalne dileme na ravni lem in oblikoskladnje, ki so slovenskih korpusov: delotoki sprotnega zbiranja besedil, razkrile določene vrzeli v referenčnih označevalnih označevalni cevovod in dokumentacija za označevanje na smernicah (Holozan et al., 2008), kot tudi rešitve, ki smo različnih jezikovnih ravneh ter nekatera nova orodja za jih oblikovali med delom in jih je mogoče uporabiti za ročno označevanje ter pregledovanje korpusnih podatkov. prihodnje primerljive naloge. Novi učni korpus bo skupaj Kot temeljna jezikovna vira za razvoj cevovoda za strojno z nadgrajenimi označevalnimi smernicami ob zaključku označevanje sodobne slovenščine sta v nadgradnjo projekta RSDO odprto na voljo na repozitoriju vključena tudi leksikon besednih oblik Sloleks CLARIN.SI. (Dobrovoljc et al., 2019) in učni korpus ssj500k (Krek et al., 2020), s katerim se povezuje tudi pričujoči prispevek. Učni korpus ssj500k v različici 2.3 (Krek et al., 2021) 2. Preteklo in sorodno delo vsebuje 27.829 povedi oz. 500.295 besednih pojavnic, Učni korpus ssj500k se kot referenčni vir za označenih na ravneh od stavčne segmentacije, nadzorovano učenje strojnega jezikoslovnega označevanja tokenizacije, lematizacije, oblikoslovja in oblikoskladnje sodobnih slovenskih pisnih besedil razvija že več kot prek odvisnostne skladnje, imenskih entitet in večbesednih desetletje (Krek et al., 2020). Do sedaj so bili na tem leksemov do udeleženskih vlog. Kot je značilno za učne korpusu naučeni različni označevalniki, npr. Obeliks korpuse, so jezikoslovne oznake ročno pregledane, s čimer (Grčar et al., 2012), ReLDI (Ljubešić in Erjavec, 2016), je dosežena zanesljivost, ki jo potrebujemo za nevronski označevalnik, ki ga je razvil Belej (2018), in nadzorovano učenje strojnih postopkov. Na rezultate CLASSLA StanfordNLP (Ljubešić in Dobrovoljc, 2019), vplivata tudi obseg in zastopanost gradiva, zato je glavni ki se nadalje razvija tudi na projektu RSDO. cilj nadgradnje povečanje učnega korpusa na 1.000.000 Začetki učnega korpusa segajo v čas projekta besednih pojavnic. Na projektu bo za višje, kompleksnejše MULTEXT-East, ki je spodbudil razvoj sistema za nivoje označevanja pripravljeno omejeno število oblikoskladenjsko označevanje (tudi) slovenščine novooznačenih povedi, osnovni nivoji pa bodo ročno (Dimitrova et al., 1998). Sistem oznak je bil revidiran in pregledani za vse novo gradivo. nadgrajen pod okriljem projekta Jezikoslovno označevanje V prispevku predstavljamo označevalno kampanjo, v slovenščine (JOS), v katerem je nastal korpus jos100k kateri smo ročno pregledali in popravili tokenizacijo, (Erjavec in Krek, 2008). Nato je bilo v projektu segmentacijo, leme in oblikoskladenjske oznake sistema MULTEXT-East (Erjavec, 2012) v korpusu SentiCoref 1.0 2 Za preostalih 24 % so v načrtu raznolike besedilne množice, ki (Žitnik, 2019), ki predstavlja približno 76 % predvidene bodo zagotovile (a) temelje za semantično označevanje, kot npr. slovenska različica vzporednega korpusa Elexis-WSD (Martelli et al., 2022), (b) izbrane nezastopane besedilne vrste, npr. tvite, ki predstavljajo uporabniško generirane spletne vsebine, (c) v 1 Spletna stran, ki predstavlja projektne cilje in sodelujoče rabi redkejše dvoumne besedne oblike: enakopisne zaimke, partnerje: https://slovenscina.eu/. dvojinske oblike ipd. (Arhar Holdt in Čibej, 2021: 49–50). PRISPEVKI 162 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Sporazumevanje v slovenskem jeziku pregledanih konec stavka označili piko, ki je v resnici del okrajšave dodatnih 400.000 besed, pripravljene pa so bile tudi ("d.o.o", "." > "d.o.o."). referenčne smernice za označevanje lem in oblikoskladnje Na popravljenem in ustrezno segmentiranem korpusu po sistemu JOS oz. MULTEXT-East v4 (Holozan et al., smo leme in oblikoskladenjske oznake označili z 2008). Trenutna različica korpusa vsebuje oznake sistema označevalnikom CLASSLA StanfordNLP v0.0.11.6 MULTEXT-East v6, ki na sistemski ravni vsebuje 1.900 možnih oznak z informacijo besedne vrste in različnih 3.2. Priprava smernic za označevanje slovarsko-slovničnih značilnosti, kot so npr. spol, sklon, število in lastnoimenskost pri samostalnikih.3 Pri pregledovanju oznak smo sledili smernicam za SentiCoref 1.0 (Žitnik, 2019) je korpus z 837 besedili oblikoskladenjsko označevanje JOS (Holozan et al., oz. približno 433.000 pojavnicami, ki je bil vzorčen iz 2008), ki vključujejo nabor oblikoskladenjskih oznak korpusa SentiNews 1.0 (Bučar, 2017). Čeprav SentiCoref (MSD), splošna načela lematizacije ter natančnejše 1.0 neposredno ne vsebuje enakih oznak sentimenta kot opredelitve posameznih označevalnih kategorij in SentiNews 1.0, sta korpusa medsebojno povezljiva. podkategorij, ponazorjene z označenimi korpusnimi SentiCoref 1.0 vsebuje tudi oznake imenskih entitet (oseb, primeri. Smernice smo pripravili v okolju Google organizacij in lokacij) ter koreferenc na imenske entitete Dokumenti (ang. Google Docs), da smo jih lahko skupaj s koreferenčnimi verigami, ki označujejo sentiment dopolnjevali na podlagi sprotne obravnave ključnih za vsako entiteto. SentiCoref 1.0 je odprto dostopen pod označevalskih dilem ter ponovnega pregleda in evalviranja licenco CC BY 4.0 na repozitoriju CLARIN.SI, in sicer v problematičnih mest. Predvsem na te vidike smernic se tabelaričnem formatu TSV3, ki ga podpira označevalno bomo osredotočili tudi v nadaljevanju. orodje INCEpTION (Klie et al., 2018), naslednik orodja WebAnno (Eckart de Castilho et al., 2014). 4. Pregledovanje oznak 3. Priprava na označevanje 4.1. Obseg in delotoki označevalne kampanje Ročni pregled strojno označenega gradiva je potekal v 3.1. Priprava podatkov okolju Google Preglednice. Podatki iz 837 besedil so bili SentiCoref 1.0 je sicer tokeniziran, ne vsebuje pa lem pripravljeni v prav toliko datotekah. Vsaka datoteka je in oblikoskladenjskih oznak. Kar zadeva delitev na vsebovala metapodatke in za pregledovanje relevantne pojavnice, SentiCoref 1.0 ni bil zasnovan z mislijo na informacije: obliko pojavnice, lemo, strojno pripisano potencialne dodatne jezikoslovne nivoje označevanja, zato oblikoskladenjsko oznako (z možnostjo izbire popravka s v nekaterih primerih odstopa od tokenizacijskih pravil, ki spustnega seznama vseh obstoječih oznak, kar je olajšalo jih pri označevanju korpusov trenutno uporabljamo v popravljanje in zmanjšalo možnost zatipka) in celico za slovenskem prostoru (označevalnik classla 4 oz. vanj morebiten komentar pregledovalca. vključeni tokenizator Obeliks 5), npr. pri deljenju kratic Podatke je pregledovalo 24 študentov jezikoslovnih ("STA-jev" > "STA", "-", "jev") in števnikov ("2,356" > smeri, razdeljenih v 3 skupine. Vsaka izmed teh skupin "2", ",", "356"). Prav tako delitev v SentiCorefu 1.0 ne študentov je pregledovala iste datoteke; namen tega, da vsebuje podatkov o presledkih. Pred strojnim vsako pojavnico pregledajo 3 študenti, je bil doseči večjo oblikoskladenjskim označevanjem in ročnim zanesljivost odločitev. Vsakemu izmed 8 pregledovalcev v popravljanjem oblikoskladenjskih oznak je bilo treba skupini je bila dodeljena besedna vrsta oz. več besednih najprej popraviti tokenizacijo (vzporedno z njo tudi vrst, pri čemer je dodelitev potekala na osnovi preferenc strojno lematizacijo) ter razdeliti besedilo na povedi študentov, predhodno ugotovljenih v anketi. Glede na (stavčna segmentacija). Za pregledovanje smo korpus težavnost označevanja ter pogostost vsake besedne vrste v pripravili v tabelaričnem formatu v okolju Google korpusu sta samostalnik pregledovala dva študenta; Preglednice (ang. Google Sheets), saj INCEpTION ne glagol, pridevnik in zaimek po en študent; za izbiro podpira spreminjanja tokenizacije. Tokenizacija je bila v oznake preprostejše besedne vrste pa smo združili v celoti popravljena ročno, stavčna segmentacija pa je bila skupine, pri čemer je en študent pregledoval po eno najprej strojno pripisana (na podlagi ločil), nato pa ročno skupino: prislov in členek; predlog in veznik; števnik, pregledana in potrjena. okrajšava, medmet in “neuvščeno”. Pred pričetkom Pri pregledovanju segmentacije je bilo 17.095 strojno pregledovanja so bile pregledovalcem predstavljene pripisanih koncev povedi ročno potrjenih kot ustreznih (z smernice (gl. 3.2) in demonstracija postopka v obliki ujemanjem treh pregledovalcev in potrditvijo končnega videa. Pregledovanje je potekalo v dveh fazah. razsojevalca oz. kuratorja). 2.528 koncev povedi so pregledovalci pripisali ročno: pri 2.151 koncih so se 4.1.1. Pregledovanje strinjali vsi pregledovalci (in kurator), pri 275 po dva, pri Uvodni teden pregledovanja je bil namenjen 156 pa je konec povedi označil le en pregledovalec. 2.992 poglobljeni seznanitvi s smernicami in razreševanju koncev povedi je bilo potrjenih kot neustreznih; od tega potencialnih nejasnosti, zato je bilo vsakemu jih je bilo 1.409 označenih avtomatsko, 940 ročno s pregledovalcu dodeljenih le 5 datotek. Število datotek se popolnim ujemanjem med tremi pregledovalci, 167 ročno je postopoma zviševalo do 20 tedensko, hkrati pa smo z ujemanjem dveh pregledovalcev, 476 ročno z oznako le okretnejšim ali bolj časovno razpoložljivim enega pregledovalca. Pri večini primerov, v katerih je pregledovalcem omogočili večji obseg dela razsojevalec zavrnil odločitve pregledovalcev, gre za (individualizirani pristop). Analiza (ne)ujemanja med popravke tokenizacije in lem, ko so pregledovalci npr. kot tremi vzporednimi pregledovalci je predstavljala izhodišče za 2. fazo – kuracijo. 3 Označevalni sistem je opisan na spletni strani: http://nl.ijs.si/ME/V6/msd/. 4 https://github.com/clarinsi/classla 5 https://github.com/clarinsi/obeliks 6 https://pypi.org/project/classla/0.0.11/ PRISPEVKI 163 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4.1.2. Kuracija izlastnoimenskih svojilnih pridevnikov, ki se v rabi pišejo Posamezne odločitve pregledovalcev smo uredili v z malo ali pa prehajajo v zapis z malo, ker niso v pomenu enotno tabelo, da so bile ob pojavnicah prikazane vse 3 prave svojine ( Parkinsonova bolezen > lema: odločitve, pri čemer so bile posebej označene tiste oznake, parkinsonov). pri katerih je med pregledovalci prišlo do razhajanja. Pri lematizaciji izlastnoimenskih pridevnikov v Naloga kuratorjev je bila pregledati prav te pojavnice in stvarnih lastnih imenih je pregledovalce zmedla različna jim pripisati končno oznako. 7 kuratorjev je bilo izbranih obravnava primerov v korpusu ssj500k ( Delova dopisnica iz vrst pregledovalcev, po eden za vsako besedno vrsto oz. > lema: Delov vs. Magov novinar > lema: magov), zato je skupino besednih vrst. Označevalna kampanja se je bilo treba ta del pravila, ki v izhodiščnih smernicah ni bil zaključila v 12 tednih, od katerih so bili štirje namenjeni pojasnjen, posebej razložiti. samo kuraciji. 4.2.3. Tuja stvarna lastna imena 4.2. Označevalne dileme Ker načelo prekrivnosti z občnoimenskimi Ob kuraciji smo identificirali dve vrsti označevalnih samostalniki (gl. 4.2.1) velja primarno za slovenske težav: (a) primeri, pri katerih so bile označevalne smernice samostalnike, se je pogosto pojavljalo vprašanje, katere jasne, a pri delu niso bile dosledno upoštevane in (b) besede obravnavati kot slovenske (prevzete besede, ki se primeri, ki so se pokazali kot zahtevnejši: slabše pregibajo s slovensko morfologijo, vedno umeščamo med predstavljeni v smernicah in mestoma tudi nedosledno slovenske, če potrditve za pregibanje v rabi ni, pa se je obravnavani v obstoječem ssj500k 2.3.7 treba odločiti na podlagi drugih kriterijev). Dileme so se Težave prvega tipa smo analizirali, odpravili nanašale predvsem na: (a) prevzete besede, ki pogosto nekonsistentnosti in jih označili v skladu z označevalnimi nastopajo kot deli tujejezičnih imen sicer slovenskih smernicami. Nekaj več informacij o tipičnih tovrstnih podjetij (tip leasing, holding) ter (b) ostale občnoimenske težavah povzemamo v poglavju 4.3. Posebno pozornost pa besede v tujejezičnih zvezah, ki so prekrivne s smo posvetili drugi skupini težav, ki smo jih identificirali slovenskimi občnoimenskimi samostalniki, pri čemer pa kot bolj kompleksne in zahtevne, saj so njihove rešitve pogosto ne izpolnjujejo kriterija pomenske prekrivnosti zahtevale premislek o odprtih vprašanjih na ravni (tip trans, global). lematizacije in oblikoskladnje (tudi) v korpusu ssj500k in Podrobneje smo obravnavali tudi skupino stvarnih posledično nadgradnjo označevalnih smernic. V lastnih imen tipa Zagrebačka banka, Večernji list. Ker gre nadaljevanju predstavimo te težave, v poglavju 5 pa za imena v hrvaščini, ki zaradi sorodnosti s slovenščino predlagane spremembe smernic. mestoma prinašajo besedje, enako slovenskemu, so bili pregledovalci v dilemi, ali tako pridevnik kot samostalnik 4.2.1. Občnoimenska prekrivnost v stvarnih lastnih označiti kot slovensko besedo in pri tem pridevniku imenih pripisati v slovenščini neobstoječo lemo, ali (vsaj) Pregledovalcem je težave povzročalo pravilo, da je v pridevnik umestiti med tujejezično besedišče. stvarnih lastnih imenih, kjer je lastnoimenski samostalnik prekriven z občnoimenskim samostalnikom, tako lema kot 4.2.4. Ločevanje pridevnikov od prislovov oblikoskladenjska oznaka občnoimenska. Iz tega sledi, da Odločitve pregledovalcev so se pogosto razhajale pri je lematizacija slovenskih imen podjetij, časopisov, revij, primerih, ki so izkazovali tipično povedkovnodoločilno knjig, tudi televizijskih oddaj, serij ali filmov ipd. z malo rabo pridevnikov oz. obravnavo pridevniških oblik, ki so začetnico: npr. podjetje Iskra [iskra, Sozei]; časnik Delo se prekrivale z osnovno prislovno obliko. Smernice so že [delo, Sosei]. Na iskanje prekrivnosti, ki zaradi pomenske vsebovale splošno navodilo o označevanju pridevnikov, ki oddaljenosti občnoimenske "ustreznice" pogosto ni lahko nastopajo v prilastkovi ali povedkovi rabi ( Sledil je enoznačno (gl. tudi 4.2.3), je bilo treba večkrat opozoriti, prelomni korak > pridevnik kot levi prilastek; uradno še saj je bilo pregledovalcem bolj intuitivno ohraniti zapis ni rehabilitiran > pridevnik kot povedkovo določilo), pa leme z veliko začetnico. Opozarjati jih je bilo treba tudi, tudi pravilo za ločevanje pridevnikov od prislovov v da načelo prekrivnosti dogovorno velja samo pri primeru pridevniškega niza ( uradno prečiščeno besedilo > samostalnikih (stranka Zares [Zares, Slmei]). Manj težav prislov). Niso pa naslovile razlike med pridevniško in smo zaznali pri pregledovanju tistih primerov stvarnih prislovno lemo pri posameznih zahtevnejših primerih (npr. lastnih imen, ki niso imela prekrivne leme z občnim smotrno, potrebno, mogoče, možno v primerih kot npr. bi samostalnikom in smo jih lematizirali z veliko začetnico bilo smotrno, da bi [...]), ki so se tudi v korpusu ssj500k (podjetje Mercator [Mercator, Slmei]). pokazali kot nekonsistentno označeni: pogosto smo zasledili prislovno lemo namesto dogovorno ustrezne 4.2.2. Izlastnoimenski svojilni pridevniki pridevniške leme. Neskladja so predstavljala izhodišče za Del pravila, da pri svojilnih pridevnikih, ki izvirajo iz nadaljnje analize, ki so vključevale ponovni pregled vseh osebnih ali zemljepisnih lastnih imen, ohranjamo lemo z primerov oz. zgledov (v korpusu SentiCoref) s veliko začetnico ( Aškerčeva ulica > lema: Aškerčev), je prekrivnimi pridevniškimi in prislovnimi oblikami ter bil jasen, več dilem je bilo pri pregledovanju tistih oblikovanje dopolnjenega pravila za pripisovanje pridevniških in prislovnih lem. 7 Smernice Holozan et al. (2008) predstavljajo v slovenskem 4.2.5. Nesklonljivi prilastki (tip bruto, solo) prostoru sprejet in široko apliciran označevalni standard, zato Pregledovalci so imeli težave z razumevanjem smo jim sledili v največji možni meri. Tudi dopolnitev smernic, navodila v izhodiščnih smernicah, da tiste primere tipa ki smo jo pripravili na projektu RSDO, ostaja v zastavljenih bruto, solo (npr. solo uspeh, rast bruto zadolževanja, info konceptualnih okvirih. Morebitne korenitejše spremembe točka), ki so sklonljivi, obravnavamo kot samostalnike, označevalnega sistema, kjer izstopa predvsem vprašanje tiste, ki niso, pa kot pridevnike. Predvsem v navodilu ni lematiziranja (pravopisno, ne pa tudi oblikoslovno) različnih jasno, kako preverjati (ne)sklonljivost in kaj je vodilo za samostalnikov in tudi drugih besednih vrst z veliko ali malo odločitev (sistemska možnost, gradivo). začetnico, zahtevajo širši premislek, ki ga nakažemo v pogl. 6. PRISPEVKI 164 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4.2.6. Prislovne zveze (tip na novo) kategorije, ki so povzročale največ težav (podjetja in Težave so bile tudi z obravnavo t. i. prislovnih zvez oz. časnike), npr. O tem, da so bile v Iskri [iskra, Somem] označevanjem nepredložnega dela teh zvez. Smernice potrebne spremembe, so čivkali že vrabci na veji. ; Večino posredno nakazujejo, naj označevanje teži k pridevniškim hrane kupimo v Mercatorju [Mercator, Slmem] ali lemam ( na drobno > lema: droben), se je pa recimo pri Intersparu [Interspar, Slmem] . ; Kot smo poročali v primeru v živo pokazalo, da so bili v korpusu ssj500k vsi prejšnji številki Mladine [mladina, Sozer] . takšni primeri označeni kot prislovni ( v živo > lema: živo). II. Izlastnoimenski svojilni pridevniki: v smernice Na osnovi tega neskladja smo naredili podrobnejšo smo dodali pravila za rabo velike in male začetnice s analizo in odkrili več primerov neenotnega označevanja primeri: enakovrstnih primerov. (a) Pridevniki iz osebnih in zemljepisnih lastnih imen: načeloma ohranjamo lemo z veliko začetnico, tiste 4.3. Pregledani podatki primere, ki se v rabi pišejo z malo ali so na Analiza popravkov po koncu pregledovanja in prehodu v zapis z malo, ker niso v pomenu prave kuriranja kaže, da se delež vnesenih popravkov sklada s svojine, pa lematiziramo z malo, npr. Celjska pričakovanim deležem napak pri avtomatskem občina je prejšnji teden objavila razpis za najem označevanju slovenskih besedil z označevalnikom vile v Aškerčevi [Aškerčev, Psnzem] 7 v Celju. ; CLASSLA StanfordNLP (Ljubešič in Dobrovoljc 2019: Gre za zdravilo za zdravljenje parkinsonove 31–32). Na ravni lematizacije je bilo skupaj popravljenih [parkinsonov, Psnzer] bolezni. 5.588 lem, kar je približno 1,3 % vseh pojavnic v korpusu, (b) Pridevniki iz stvarnih lastnih imen: dodatno smo kar se sklada s približno 98-odstotno natančnostjo opredelili načelo lematizacije primerov tipa Delova lematizacije. Na ravni oblikoskladenjskih oznak je bilo dopisnica > lema: Delov in Magov novinar > lema: skupaj 12.586 popravkov, kar pomeni 2,9 % vseh oznak v magov. Pri primerih, kjer je bila prekrivnost korpusu (ob skoraj 97-odstotni natančnosti sistemsko sicer možna, vendar v dejanski rabi oblikoskladenjskega označevanja). neizkazana, smo ohranili veliko začetnico, npr. S Pri popravkih lem so bili med najpogostejšimi tega stališča je polemika z Mladininim [Mladinin, lastnoimenskimi samostalniki, ki so prekrivni z Psnmeo] doktorjem sociologije že skorajda na robu občnoimenskimi (npr. Luka Koper > lema: luka), smiselnega (občni samostalnik mladina sicer okrajšave, sestavljene iz ene ali dveh črk (npr. dr. > lema: obstaja, vendar je svojilni pridevnik v rabi izredno dr.), pa tudi besede s prekrivnimi oblikami v redek, tj. ima eno samo pojavitev v referenčnem oblikoskladenjski paradigmi (npr. delo in del). Pri korpusu Gigafida 2.0). Nasprotno je v primerih, ki popravkih oblikoskladenjskih oznak je šlo povečini za izkazujejo pogostejšo rabo svojilnega pridevnika, ločevanje med občnimi in lastnoimenskimi samostalniki npr. vsi pa občudujejo njegovi operi Jevgenij (tip Leasing – leasing; 1538 popravkov oz. 12 %; v Onjegin in Pikova [pikov, Psnzei] dama. obratni smeri iz občnoimenskega v lastno je bilo (c) Pridevniki na -ski, -ški kot del zemljepisnih lastnih popravkov manj: 235 oz. 1,8 %), med moškim in ženskim imen: lematiziramo jih z malo začetnico, pri čemer spolom (825 popravkov oz. 6,6 %; pri tem gre npr. za je treba posebej izpostaviti razliko v odnosu do imena določenih strank, kot je Desus) ter med prekrivnimi primerov tipa Kranjska, Štajerska ipd. Pri imenih oblikami v imenovalniku, tožilniku in rodilniku (skupaj regij gre za samostalnike in jih lematiziramo z 1.617 popravkov oz. 12,8 % pri samostalnikih; npr. neživi veliko: V Vinski kleti Goriška [goriški, Ppnsmi] samostalniki moškega spola: odbor, posel v imenovalniku Brda zadovoljni s poslovanjem v minulem letu; in tožilniku). Na ravni besednih vrst je šlo največkrat za Črnivec je poleg prelaza Volovjek najsevernejši težje ločevanje med prekrivnimi prislovi in prirednimi cestni prehod, ki povezuje Kranjsko [Kranjska, vezniki (npr. tako; 130 popravkov oz. 1,1 %), med Slzet] in Štajersko [Štajerska, Slzet] . lastnoimenskimi samostalniki in neuvrščenimi (č) Splošni pridevniki kot del zemljepisnih lastnih tujejezičnimi izrazi (npr. Amnesty International; 118 imen: lematiziramo jih z malo (tip nov, spodnji), če popravkov oz. 1,0 %) ter med členki in prirednimi vezniki v splošni rabi ne obstajajo, pa ohranimo veliko (npr. sicer, niti, ne; 97 popravkov oz. 0,7 %). Ker je bila začetnico, npr. Britanija, Avstralija in Nova [nov, količina popravkov relativno majhna, bi se bilo v Ppnzei] Zelandija; Mlekarna Celeia iz Arje [Arji, prihodnjih označevalnih kampanjah morda smiselno Ppnzer] vasi je namreč edina domača mlekarna v osredotočiti le na najpogostejše pričakovane napake. Kot večinski lasti zadrug. vodilo lahko pri tem služijo v tem poglavju naštete III. Tuja stvarna lastna imena: po posvetu s širšo najpogostejše dileme in težave. projektno ekipo smo se odločili, da bomo oblikovno prekrivne občnoimenske samostalnike "iskali" tudi v tujejezičnih večbesednih stvarnih lastnih imenih. Pri tem 5. Nadgradnja označevalnih smernic je treba upoštevali predvsem dve merili: pregibanje v rabi Na podlagi analize najpogostejših označevalskih dilem in prevzetost (prisotnost v referenčnih priročnikih, npr. in pregleda označevalnih odločitev v korpusu ssj500k smo Hypo Leasing [leasing, Somei], Infond Holding [holding, pripravili rešitve glede (nadaljnjega) pregledovanja in Somei]), ne pa nujno tudi merilo pomenske prekrivnosti – dopolnitve smernic za problematične kategorije, naštete v v nekaterih primerih lahko ima tuja beseda v stvarnem poglavju 4.2. Nadgrajene smernice bodo objavljene ob lastnem imenu podoben pomen, kot ga ima (prekrivna) koncu projekta RSDO. slovenska beseda, v nekaterih pa ne. Primere, ki so bili I. Občnoimenska prekrivnost v stvarnih lastnih oblikovno prekrivni, pomensko pa ne, smo zbrali na imenih: splošno načelo, da stvarna imena, prekrivna z posebnem seznamu in po analizi sprejeli odločitev, da jih občnim samostalnikom, označujemo kot občni vse obravnavamo kot občne samostalnike, npr. Trade samostalnik in lematiziramo z malo začetnico, ostala, ki Trans [trans, Somei] Invest, Prevent Global [global, prekrivnosti ne izkazujejo, pa z veliko začetnico, smo Somei]. dopolnili s konkretnimi zgledi rabe. Izbrali smo V smernice smo dodali odločitev glede označevanja slovenščini sorodnih tujih primerov (tip Večernji list): pri PRISPEVKI 165 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 samostalnikih upoštevamo načelo prekrivnosti s Pri razlikovanju obravnave zemljepisnih/osebnih imen ter slovenskim občnoimenskim besediščem, pridevnike stvarnih imen se v sistem še dodatno vpletajo načela, ki so obravnavamo kot tuje besedišče, pri katerem ostane lema bolj kot na oblikoskladnjo vezane na (s semantiko in enaka besedni obliki, npr. Jutarnji [Jutarnji, Nj] list [list, trenutnim pravopisom povezano) metajezikovno Somei], Zagrebačka [Zagrebačka, Nj] banka [banka, klasifikacijo referentov. Zdi se, da na ravni označevanja Sozei]. lem in oblikoskladnje sprejemamo odločitve, ki bi sodile IV. Ločevanje pridevnikov od prislovov: pri na raven jezikovnega opisa in predpisa, ob čemer se opredelitvi razlike med pridevnikom in prislovom v vlogi opiramo na jezikovne vire, kjer prav te odločitve pogosto povedkovega določila smo v smernicah izpostavili še niso sprejete. skladenjski vidik. Opredelitev razlike, da je beseda v vlogi Ker težave pripisovanja občno- oz. lastnoimenskosti povedkovega določila prislov, če je iz stavka izpustljiva, samostalnikov prednjačijo v veliki sliki vseh opravljenih pridevnik pa, če je nepogrešljiva (obvezna), smo popravkov, obenem pa identifikacijo lastnoimenskih zvez podkrepili s primeri, npr. O tem ni *(mogoče) [mogoč, v zadnjih letih uspešno opravljamo pri označevanju Ppnsei] sklepati. ; (Mogoče) [mogoče, Rsn] ste ga imenskih entitet, bi kazalo ponovno razmisliti o dodani vznemirili. vrednosti te kategorije na ravni oblikoskladnje. Če se V. Nesklonljivi prilastki (tip bruto, solo): pri izkaže, da je kljub vsemu koristna, bi se določene težave obravnavi nesklonljivih prilastkov se je smiselno opreti na dalo odpraviti z radikalnejšim posegom v smernice, npr. z preverjanje njihove sklonljivosti v dejanski rabi. odpovedjo iskanja prekrivnih občnih in lastnih Oblikovali smo pravilo, da če najdemo potrditev v samostalnikov in sledenju rabi, kakršna se v besedilih referenčnem korpusu, da se določen primer lahko pregiba pojavlja. Enako velja za obravnavo tujega besedišča, ki ga kot samostalnik, potem to opcijo upoštevamo, če pa po trenutnem sistemu med slovenske samostalnike potrditve ne najdemo, primer dosledno obravnavamo kot umeščamo precej popustljivo in obenem nedosledno. S pridevnik: so se do konca leta povprečne neto [neto, širjenjem označevanja na besedilne vrste, kjer je Ppnmein] plače realno povečale za okoli 33 odstotkov. tujejezičnih elementov več in v slovenščino prehajajo po VI. Prislovne zveze (tip na novo): v smernice smo manj predvidljivih vzorcih, bi bilo smiselno opredeliti dodali eksplicitno pravilo, da primere tega tipa jasen namen ločevanja po jezikih in oblikovati dosledne in obravnavamo kot zveze predloga in pridevnika. Na pripisljive kriterije zanj. Problem bi bilo dobro nasloviti primerih, ki so pregledovalcem predstavljali največ težav, celovito in podati rešitve za vse relevantne označevalne smo ponazorili, da obravnavamo nepredložni del zveze ravni, ne le lematizacijo in oblikoskladnjo. torej kot pridevnik in ne kot prislov, npr. Če bi se na [na, Druga večja skupina označevalnih težav je bila vezana Dt] hitro [hiter, Ppnset] ozrl, bi videl, da ga zasledujejo. na enakopisne oblike, pogosto pridevnike in prislove, pa tudi nekatere slovnične besedne vrste. Tudi tu je opaziti, 6. Zaključek in nadaljnje delo da se v smernicah pojavljajo semantični (ne le oblikoslovni in skladenjski) kriteriji za presojanje, kar pa Pregledovanje osnovnih označevalnih nivojev korpusa se je izkazalo za manj pereče od (po novem vsaj deloma SentiCoref predstavlja eno najobsežnejših kampanj te naslovljenih, ne pa povsem odpravljenih) dilem glede vrste v našem prostoru in – ob kampanji, ki se je uporabe referenčnih jezikovnih virov, npr. za osredotočala na gradivo računalniško posredovane opredeljevanje sklonljivosti. Pri tej skupini težav je komunikacije Janes (Čibej et al., 2018) – tudi eno prvih ključna ugotovitev, da označevanje tudi v ssj500k ni priložnosti za ponovitev dela z uporabo metodologije, ki potekalo povsem usklajeno, zato smo ob delu pripravili se je vzpostavila pri pripravi izhodiščne različice učnega seznam težav, ki bi jih bilo v prihodnosti smiselno korpusa. preveriti in ustrezno urediti za nazaj. Po opravljeni kuraciji, končni kontroli kvalitete Pri vsem pa je treba upoštevati, da je strojno označenega in statističnem pregledu dilem in popravkov je pripisovanje lem in oblikoskladenjskih oznak za mogoče potegniti nekaj zaključkov. Pomembno je, da so slovenščino že doseglo raven, ko bi bilo celovite ročne se pomanjkljivosti označevalnih smernic kazale zlasti pri preglede smotrno nadomestiti z delnimi, za katere pa bi temah, povezanih z označevanjem lastnih imen bilo treba razviti (referenčne in dokumentirane) postopke (samostalnikov, izlastnoimenskih pridevnikov), še zlasti za avtomatsko ali polavtomatsko identifikacijo pri odločitvah, ki so povezane s presojanjem, ali je problematičnih mest. Spoznanja, ki jih navajamo v določena beseda slovenska ali tujejezična. Ker korpus prispevku, so lahko izhodišče za takšno nadaljevanje. SentiCoref vsebuje atipično visoko število raznovrstnih Pregledani in popravljeni SentiCoref bo v nadaljevanju lastnih imen (tako je bil namreč zgrajen), smo pogosto projekta RSDO umeščen ob ostale besedilne množice, ki srečevali težave, ki so bile pri pripravi ssj500k redkejše in bodo sestavljale povečani učni korpus za slovenščino. V za smernice manj relevantne. prihodnje bomo v celotnem učnem korpusu izvedli še Obstoj kategorije lastnoimenskosti na ravni serijo polavtomatskih popravkov (npr. ali so enobesedni oblikoskladnje in posledično lematiziranje ob iskanju vezniki, kot je "zato", vedno ustrezno označeni kot prekrivnosti med občno- in lastnoimenskimi entitetami vezniki), s čimer bomo poskrbeli, da bodo enake dileme v odpira konceptualne težave, ki bi jih kazalo v ponovno celotnem učnem korpusu razrešene konsistentno. Na premisliti. Prva je, da je označevalno kategorijo najti samo podoben način bomo učni korpus primerjali tudi s pri samostalnikih, prekrivnost (po nekako drugačni logiki) Slovenskim oblikoslovnim leksikonom Sloleks iščemo tudi pri pridevnikih, ne pa pri ostalih besednih (Dobrovoljc et al., 2019), da npr. preverimo, ali se vrstah. Težava je tudi, da pri odločitvah glede zapisa leme glagolski vid glagolov v učnem korpusu ujema s z veliko ali malo začetnico na raven oblikoskladenjskega Sloleksom. V okviru projekta RSDO je istočasno z označevanja prenašamo vprašanja, ki se dotikajo nadgradnjo učnega korpusa potekala tudi nadgradnja pravopisa (oz. pravopisov, če upoštevamo, da se vse Sloleksa, zato smo nalogo prestavili na poznejši termin. dileme preslikavajo in potencirajo pri srečevanju s Učni korpus bo skupaj z nadgrajenimi označevalnimi tujejezičnimi elementi), pri čemer sistem sledi smernicami in ostalo dokumentacijo ob zaključku projekta predpostavki, da avtorji besedil pravopisu vedno sledijo. javnosti odprto na voljo na repozitoriju CLARIN.SI. PRISPEVKI 166 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 7. Zahvala flexible, web-based annotation tool for CLARIN. V: Projekt Razvoj slovenščine v digitalnem okolju Proceedings of the CLARIN Annual Conference (CAC) (RSDO) sofinancirata Republika Slovenija in Evropska 2014, Soesterberg, Nizozemska. unija iz Evropskega sklada za regionalni razvoj. Operacija https://www.clarin.eu/sites/default/files/cac2014_submi se izvaja v okviru Operativnega programa za izvajanje ssion_6_0.pdf. evropske kohezijske politike v obdobju 2014–2020. Tomaž Erjavec in Simon Krek. 2008. The JOS Raziskovalna programa št. P6-0411 (Jezikovni viri in morphosyntactically tagged corpus of Slovene. V: tehnologije za slovenski jezik) in št. P6-0215 ( Slovenski Proceedings. 6th International Conference on jezik - bazične, kontrastivne in aplikativne raziskave) je Language Resources and Evaluation (LREC 2008), str. sofinancirala Javna agencija za raziskovalno dejavnost 322–327, Marakeš, Maroko. European Language Republike Slovenije iz državnega proračuna. Avtorice in Resources Association (ELRA). avtorji se sodelujočim v označevalni kampanji iskreno http://www.lrec-conf.org/proceedings/lrec2008/pdf/89_ zahvaljujemo za vse delo, prav tako pa tudi recenzentoma paper.pdf. za relevantne in konstruktivne komentarje. Tomaž Erjavec. 2012. MULTEXT-East: Morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46(1):131–142. 8. Literatura Miha Grčar, Simon Krek in Kaja Dobrovoljc. 2012. Špela Arhar Holdt in Jaka Čibej. 2021. Analize za Obeliks: statistični oblikoskladenjski označevalnik in nadgradnjo učnega korpusa ssj500k. V: Š. A. Holdt, ur., lematizator za slovenski jezik (Obeliks: statistical Nova slovnica sodobne standardne slovenščine: viri in morphosyntactic tagger and lemmatizer for Slovene). V: metode, str. 15–53. Znanstvena založba Filozofske Proceedings of the 8th Language Technologies fakultete, Ljubljana. Zbirka Sporazumevanje. Conference, zvezek C, str. 89–94, Ljubljana, Slovenija. https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/v IJS. http://nl.ijs.si/isjt12/proceedings/isjt2012_17.pdf. iew/325/477/7313-1. Peter Holozan, Simon Krek, Matej Pivec, Simon Rigač, Primož Belej. 2018. Oblikoskladenjsko označevanje Simon Rozman in Aleš Velušček. 2008. Specifikacije za slovenskega jezika z globokimi nevronskimi mrežami. učni korpus. Projekt "Sporazumevanje v slovenskem Magistrsko delo, Fakulteta za računalništvo in jeziku". informatiko, Univerza v Ljubljani. http://projekt.slovenscina.eu/Vsebine/Sl/Kazalniki/K2.a Jože Bučar. 2017. Manually sentiment annotated spx. Slovenian news corpus SentiNews 1.0. Slovenian Jan-Christoph Klie, Michael Bugert, Beto Bullosa, language resource repository CLARIN.SI. Richard Eckart de Castilho in Irina Gurevych. 2018. http://hdl.handle.net/11356/1110. The INCEpTION Platform: Machine-Assisted and Jaka Čibej, Špela Arhar Holdt, Tomaž Erjavec in Darja Knowledge-Oriented Interactive Annotation. V: Fišer. 2018. Ročno označeni korpusi Janes za učenje Proceedings of System Demonstrations of the 27th jezikovnotehnoloških orodij in jezikoslovne raziskave. International Conference on Computational Linguistics, V: D. Fišer, ur., Viri, orodja in metode za analizo spletne Santa Fe, New Mexico, ZDA. slovenščine, str. 44–73. Znanstvena založba Filozofske https://aclanthology.org/C18-2002.pdf. fakultete, Ljubljana. Zbirka Prevodoslovje in uporabno Simon Krek, Kaja Dobrovoljc, Tomaž Erjavec, Sara jezikoslovje. Može, Nina Ledinek, Nanika Holz, Katja Zupan, Polona https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/v Gantar, Taja Kuzman, Jaka Čibej, Špela Arhar Holdt, iew/111/203/2416-1. Teja Kavčič, Iza Škrjanec, Dafne Marko, Lucija Ludmila Dimitrova, Tomaž Erjavec, Nancy Ide, Heiki Jezeršek in Anja Zajc. 2021. Training corpus ssj500k Jaan Kaalep, Vladimir Petkevič in Dan Tufis. 1998. 2.3. Slovenian language resource repository Multext-east: Parallel and comparable corpora and CLARIN.SI. http://hdl.handle.net/11356/1434. lexicons for six central and eastern European languages. Simon Krek, Tomaž Erjavec, Kaja Dobrovoljc, Polona V: 36th Annual Meeting of the Association for Gantar, Špela Arhar Holdt, Jaka Čibej in Janez Brank. Computational Linguistics and 17th International 2020. The ssj500k Training Corpus for Slovene Conference on Computational Linguistics, zvezek 1, str. Language Processing. V: D. Fišer in T. Erjavec, ur., 315–319, Montreal, Quebec, Kanada. Association for Jezikovne tehnologije in digitalna humanistika: zbornik Computational Linguistics. konference, str. 24–33, Ljubljana, Slovenija. Inštitut za https://aclanthology.org/P98-1050.pdf. novejšo zgodovino. Kaja Dobrovoljc, Simon Krek in Tomaž Erjavec. 2015. http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_Krek-et-al_The-s Leksikon besednih oblik Sloleks in smernice njegovega sj500k-Training-Corpus-for-Slovene--Language-Proces razvoja. V: V. Gorjanc, P. Gantar, I. Kosem in S. Krek, sing.pdf. ur., Slovar sodobne slovenščine: problemi in rešitve, str. Nikola Ljubešić in Kaja Dobrovoljc. 2019. What does 80−105. Znanstvena založba Filozofske fakultete, Neural Bring? Analysing Improvements in Ljubljana. Morphosyntactic Annotation and Lemmatisation of https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/v Slovenian, Croatian and Serbian. V: Proceedings of the iew/15/47/489-1. 7th Workshop on Balto-Slavic Natural Language Kaja Dobrovoljc, Simon Krek, Peter Holozan, Tomaž Processing, str. 29–34. Firence, Italija. The Association Erjavec, Miro Romih, Špela Arhar Holdt, Jaka Čibej, for Computational Linguistics, Stroudsburg. Luka Krsnik in Marko Robnik-Šikonja. 2019. https://www.aclweb.org/anthology/W19-3704. Morphological lexicon Sloleks 2.0. Slovenian language Nikola Ljubešić in Tomaž Erjavec. 2016. Corpus vs. resource repository CLARIN.SI. Lexicon Supervision in Morphosyntactic Tagging: the http://hdl.handle.net/11356/1230. Case of Slovene. V: Proceedings of the Tenth Richard Eckart de Castilho, Chris Biemann, Irina International Conference on Language Resources and Gurevych in Seid Muhie Yimam. 2014. WebAnno: a Evaluation (LREC 2016), str. 1527–1532, Pariz, PRISPEVKI 167 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Francija. European Language Resources Association (ELRA). https://aclanthology.org/L16-1242.pdf. Federico Martelli, Roberto Navigli, Simon Krek, Jelena Kallas, Polona Gantar, Svetla Koeva, Sanni Nimb, Bolette Sandford Pedersen, Sussi Olsen, Margit Langemets, Kristina Koppel, Tiiu Üksik, Kaja Dobrovoljc, Rafael Ureña-Ruiz, José-Luis Sancho-Sánchez, Veronika Lipp, Tamás Váradi, András Győrffy, Simon László, Valeria Quochi, Monica Monachini, Francesca Frontini, Carole Tiberius, Rob Tempelaars, Rute Costa, Ana Salgado, Jaka Čibej, in Tina Munda. 2022. Parallel sense-annotated corpus ELEXIS-WSD 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1674. Slavko Žitnik. 2019. Slovene corpus for aspect-based sentiment analysis - SentiCoref 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1285. PRISPEVKI 168 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Document Enrichment as a Tool for Automated Interview Coding Ajda Pretnar Žagar,∗ Nikola Ðukić,∗ Rajko Muršič‡ ∗Laboratory for Bioinformatics Faculty of Computer and Information Science University of Ljubljana Večna pot 113, SI-1000 Ljubljana ajda.pretnar@fri.uni-lj.si, nd1776@student.uni-lj.si †Department of Ethnology and Cultural Anthropology Faculty of Arts University of Ljubljana Zavetiška ulica 5, SI-1000 Ljubljana rajko.mursic@ff.uni-lj.si Abstract While widely used in social sciences and the humanities, qualitative data coding remains a predominantly manual task. With the proliferation of semantic analysis techniques, such as keyword extraction and ontology enrichment, researchers could use existing taxonomies and systematics to automatically label text passages with semantic labels. We propose and test an analytical pipeline for automated interview coding in anthropology, using two existing taxonomies, Outline of Cultural Materials and ETSEO systematics. We show it is possible to quickly, efficiently and automatically annotate text passages with meaningful labels using current state-of-the-art semantic analysis techniques. 1. Introduction models (i.e., YAKE) and word embeddings (Godec et al., Qualitative data coding is a well-established procedure 2021) for determining concept similarity. in social sciences, particularly in sociology, cultural stud- Qualitative data coding is often based on grounded the- ies, oral history, and biographic studies. The technique is ory (Strauss and Corbin, 1997). The theory, which is more gaining ground in anthropology, where interview transcrip- of an analytical approach, focuses on codes to emerge from tions abound. Ethnographic text coding can become a se- the data (Holmes and Castañeda, 2014) rather than impos- rious research technique, using existing ethnographic sys- ing them. Coding can also stem from a linguistic paradigm, tematics, categories, vocabularies, and codes. Data coding especially semantic approaches, where text would be la- facilitates the analysis of themes and close reading of the in- belled based on the occurrence of words in it. The first ap- terview segments on each theme, which is one of the main proach still requires human input, while the second is based analytical techniques of ethnographies in anthropology, be on unsupervised machine learning. Thus, having a general they computer-assisted or manual. ethnographic taxonomy or classification scheme enables re- Computer-assisted qualitative data analysis (CAQDAS) searchers to inductively elicit prevalent topics from the data is used to determine topics of interview segments, where rather than devising elaborate codebooks in advance. Our the topics are not discrete but can overlap. The coder would contribution is applying semantic annotation and ontology normally define a codebook with the topics, then go over mapping to interview transcripts. the text and label passages with corresponding tags. In the Semantic enrichment of documents means assigning end, the coder can review selected topical passages, define conceptually relevant terms to documents or document topic co-occurrence, and extract a subset of documents on segments. The procedure can include automatic key- a specific topic. word extraction, which identifies relevant keywords in the Manual labelling can take a long time and requires a text (Bougouin et al., 2013; Campos et al., 2020) or relat- somewhat experienced coder to handle the tagging. How- ing existing lists of terms to texts (Massri et al., 2019). The ever, we can construct an automatic pipeline for segment latter can be either unsupervised or supervised. Unsuper- tagging due to the rapid development of natural language vised refers to the terms being scored by their similarity to processing tools and language resources. The pipeline is the text and (multiple) terms assigned to each document if built on the recent developments in ontology enrichment, their similarity to the document is above a certain threshold. which uses pre-defined ontologies (or taxonomies). Docu- Supervised means the terms are used for document classi- ments are preprocessed, and then the resulting tokens, typ- fication, where a document is assigned the most probable ically words, are compared by similarity to tokens from the term. ontology. A simple approach is based on TF-IDF1 trans- In continuation, we propose a technique using unsuper- form. In contrast, the current state of the art uses graph word counts to describe documents, weighing them based on over- 1 all word frequency. TF-IDF is a document vectorisation technique which uses PRISPEVKI 169 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 vised ontology enrichment to automatically label text seg- Modern computer-assisted qualitative data analysis ments with the corresponding topic labels. Automatic seg- (CAQDAS) approaches don’t require using punched cards ment labelling uses existing (anthropological) taxonomies with per-page summaries to navigate the text, as was the to label interview segments and thus assist researchers in case in earlier times. They can quickly retrieve segments navigating interview transcripts. The proposed technique tagged with the specified tag. MacMillan and Phillip (2010) doesn’t apply only to anthropology – it could be used in use a semi-anthropological approach to better gauge the any text analysis research. We use anthropology as a use connection between venison price and cull effort. They case since the use of computer-assisted techniques is still conduct in-depth interviews with stalkers, people employed somewhat rare in this discipline. by the British estates that hunt wild game, and analyse Finally, a short note on terminology. The term “ontol- the interviews with NVivo. They use the qualitative data ogy” is used in computer science to describe a structured from the interviews to corroborate the quantitative findings hierarchical list of terms (Gruber, 1995), while in social – deer hunting is deeply rooted in tradition and seen as a sciences and the humanities, it means a branch of philos- sport rather than economic activity. ophy studying concepts of existence. In this paper, we use Researchers studying sensorially-charged biographic the term ontology in the former sense, sometimes referring experiences in Turku, Brighton, and Ljubljana defined the to it as a taxonomy for clarity. main categories with a larger list of subcategories. Cod- ing only the translated transcripts and using Atlas.ti, they 2. Interview transcripts extracted similarly charged testimonies related to different Interview transcripts are specific since they contain sensations, for example, sounds (Venäläinen et al., 2020) or questions from the interviewer and answers from the inter- smells (Bajič, 2020). viewee. The transcripts are usually structured, with names Most commonly, CAQDAS is used in discourse analy- or abbreviations denoting the speaker. If the interview is sis. Hitchner et al. (2016) analyse discourses on bio-energy (semi-)structured, questions between different interviews to elicit key metaphors used to create common imaginaries. will be very similar, if not identical. Moreover, interview- Using this approach, they were able to identify three dis- ing a person often requires the interviewer to ask for clar- cursive units that guide the bio-energy narrative. Cuerrier ification, affirm the interpretation of the answer or simply et al. (2015) identified 134 categories referring to climate confirm (s)he understood what the interviewee said. Hence change in 46 interviews conducted with the Inuit population including questions in the analysis is often not a good ap- in Nunavik. Next, they created ordinal and binary matrices proach. describing the change in quantity and the presence or ab- Delineating between questions and answers depends sence of topics. They used various statistical approaches to on the structure of the digital document. A dedicated determine whether different communities of Nunavik dif- parser would consider new lines as segment delineations fer in terms of knowledge of climate change. Both papers and names, pseudonyms, or initials as speaker identifiers. retrieve popular taxonomies created by people under study. Ideally, the parser would consider the continuation of a re- Discourse analysis is also prominent in Schafer (2011), ply, even when it was interrupted by the interviewer. But who uses Atlas.ti to analyse over 30 in-depth interviews without a proper co-reference resolution for the given lan- with secular funeral officiants called “funeral celebrants” guage (Žitnik and Bajec, 2018), it is difficult to determine in New Zealand. The author identified key conceptual cat- such conceptual segments. egories in funeral celebrant ethnographies, specifically the narratives on connection, identity, and personalisation of 3. Related work funeral practices. Back in 1983, Podolefsky and McCarty (1983) had an CAQDAS can also be used to retrieve relevant text pas- interesting idea - how about using computers to help us nav- sages. Yilmaz et al. (2019) conducted 30 interviews with igate numerous ethnographic notes and transcripts? Those highly educated Turkish-Belgian women to determine the were the days when most anthropologists stored their data factors affecting their marriage choices. They stem from on physical paper. Navigating such texts apparently re- grounded theory and use predetermined codes for the first quired duplicating pages to store them under various cat- round of coding, then refine and enhance their codebook egories. Nowadays, this is no longer necessary. Ethno- later. With iterative codebook improvements, they deter- graphic data is often multimodal and predominantly stored mined women’s decisions and the driving factors behind digitally. It includes images, videos, and audio record- them, for example, the structural and general constraints in ings along with the text. When navigating digital text data, marriage choices. one can easily use the “find” function to look for different Conversely, Wehi et al. (2018) do not use CAQDAS text segments, while similar techniques exist for navigating software but instead observe raw word frequencies in M¯aori other data types. oral tradition. They collected ancestral sayings called Nevertheless, organising interview data is not an easy whakatauk¯ı and identified references to animal extinctions task, and there are ways computers can help. Podolef- in the data. sky and McCarty (1983) proposed developing coding cat- It is interesting to note that many contributions using egories for marking text passages. This is the precursor to quali-quantitative text analysis were published in the Hu- modern qualitative data analysis software, such as NVivo, man Ecology journal, which testifies to the (still) marginal Atlas.ti, or MaxQDA. These, too, require a predetermined use of these methods in anthropology. Ideally, we will see set of categories used for labelling the data. many more journals willing to publish such research and PRISPEVKI 170 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 more researchers ready to use these tools in practice. isation (86), Education (87) and Adolescence, Adulthood, Longan (2015) expresses the sentiment to perfection: and Old Age (88). The categories are still very general, so “There is room for innovation in the creation of technologi- more specific categories must be coded additionally. cal aids to facilitate mesoscale qualitative online research Ethnographic systematics (ETSEO) is derived from that lies between massive data sets and small qualitative continental ethnographic practices, mostly interested in tra- studies. Though the major qualitative software suites have ditional culture of the European peasantry. Its taxonomy improved over time, much of the process is still tedious is hierarchically extensive, starting with the essentially de- and requires hours of snorkelling and coding by hand.” fined material, spiritual and social culture categories. Since First, he explicitly points to the nor-big-nor-small issue of the taxonomy was designed for museum archives, the most many contemporary anthropological studies. Even organ- detailed is the field of material culture, subdivided on as ising just thirty interview transcripts can be complicated, many levels as necessary, and taxonomy in general fits folk let alone a hundred records. Yet one hundred records can taxonomy and practices. Spiritual culture is further divided hardly be described as “big data” requiring “big tools”. into general categories comprising folklore, ritual practices, There’s a need for a mid-level tool to help organise the data and art-related activities. Less detailed is the so-called “so- in a time-efficient way. Second, he points to the issue of cial culture” field containing festivities in a calendar year, coding by hand, which takes time and effort from the re- celebrations of live events, and communal activities, prac- searcher. Third, he identifies an opportunity for technolog- tices, and rules. This system is much more detailed but, ical innovation for qualitative data analysis that surpasses at the same time, only partly decimally classified and only modern qualitative analysis software. somewhat comparable to the OCM taxonomy. It was de- Previously, ontology enrichment for labelling text pas- signed for classical archive work and is now only partially sages was used predominantly in biology and medicine (Bi- accepted as a digitised taxonomy. fis et al., 2021). In social sciences and the humanities, au- OCM’s main aim was to facilitate searching the large tomated segment labelling was expressed as more of a wish database of ethnographic entries and organise basic infor- rather than a reality (Hoxtell, 2019). In contrast to CAQ- mation on ethnic and social groups. Hence it is easy to DAS, ontology enrichment provides a way to automatically extend the idea of an ethnographic classification system to label large amounts of text in a short period of time. At the a codebook – each entry represents a concept relevant to same time it enables relating interview transcripts to ex- describing a culture. One could use the well-defined sys- isting domain-specific ontologies. Our contribution show- tem with descriptions of categories to automatically tag text cases automated interview segment labelling with existing passages with relevant ethnographic concepts. For exam- ontologies, thus providing a practical example of how ma- ple, if the passage describes using outdoor toilets, the cor- chine learning can support ethnographic analysis. responding codes should be “744 Public Health and Sanita- We propose an approach using ontology enrichment tion”, “515 Personal Hygiene”, “336 Plumbing”, and “312 from computer science to help organise and structure in- Water Supply”. Besides already existing taxonomies for terview transcripts, fieldwork notes, and archive data. The ethnographic materials (OCM and ETSEO), it is useful to three-fold example described below is a prototype for produce native or folk taxonomies as “a description of how machine-assisted data coding, which uses standard anthro- people divide up domains of culture, and how pieces of a pological taxonomies, such as the Outline of Cultural Mate- domain are connected” (Bernard, 1994, p. 386). Automated rials (Bernard, 1994, p. 519-528), or more local and specific accurate tagging would enable quickly retrieving relevant ethnographic taxonomies, related to the European ethnol- parts of the text on the one hand and observing dominant ogy studies of the so-called folk or traditional culture (Kre- topics and their inter-relatedness on the other. menšek et al., 1976), to label text passages. 5. Document enrichment 4. Ontologies as codebooks Analysis of interview transcripts would normally in- Instead of pre-defining codebooks for manual coding, clude labelling documents or interview segments with cor- we propose to use existing anthropological taxonomies to responding codes, identifying topics/codes, observing their automatically label the data. One such well-established frequencies in the corpus, and retrieving interview seg- taxonomy, which we call “ontology” in text mining, is the ments for a given topic/code. We show how to perform Outline of Cultural Materials. Human Relations Area Files these tasks in a visual-programming data mining tool Or- is a non-profit research organisation whose aim is to fos- ange (Demšar et al., 2013). Workflow (as seen in Figure 1) ter cross-cultural research (Melvin, 1997). One of its key for replicating the analysis is available online (Pretnar Ža- achievements is the establishment of several databases that gar, 2022b) along with a Slovenian translation of OCM on- contain previous cross-cultural research. The database en- tology (Pretnar Žagar, 2022a). The corresponding data are tries, such as ethnographic reports, are indexed using the not publicly available due to privacy issues. Outline of Cultural Materials (OCM), an ethnographic sub- ject classification system developed by Murdock and col- 5.1. Data and preprocessing leagues (Murdock et al., 1969; Ford, 1971). To demonstrate how contemporary ontology enrich- The taxonomy is designed in a decimal classification ment and semantic analysis approaches can be used in an- system, similar to the librarian Universal Decimal Classifi- thropology, we are using interview transcripts from twenty cation. Its main categories start with Orientation (10), Bib- interviews on smart buildings (Pretnar and Podjed, 2019). liography (11) and Methodology (12), and end with Social- The interviews are in colloquial Slovenian and describe PRISPEVKI 171 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Figure 1: Workflow for ontology enrichment and extracting interview topics from annotated visualization. the experiences and struggles of faculty staff with a smart building. The interview is segmented into questions and an- swers. Each answer represents the utterance and constitutes a single document in the final corpus resulting in 1126 data instances. The metadata includes the question, the intervie- wee, and the interview date. Tokens are constructed by passing the text through the CLASSLA pipeline for non-standard Slovenian. Then, lemmas and POS tags are retrieved, and only nouns and verbs are kept for the analysis. Tokens are used to com- pute document embeddings, a mean-aggregation of word embeddings based on fastText models (Bojanowski et al., 2017). We tried simple lowercasing, Lemmagen lemmati- zation (Juršič et al., 2010) and stopword removal for pre- processing, but the results were not as informative (they mostly contained generic verbs, such as to have and to go, discourse particles and fillers). Moreover, while SBERT embeddings generally perform better due to their context- parsing abilities, they produced worse results in the t-SNE visualisation. Specifically, fastText identified a group of segments with short, unspecific replies (i.e., “Yes.”, “Uh- huh.”), while SBERT did not. Figure 2: t-SNE document map with annotated semantic groups. 5.2. Identifying topics Generally, the researchers will know which topics the lates “king” to “prince” and “queen” to “princess”. Once corpus covers because often, they will be its creators. In the embedding of each word is retrieved, words from the the case of interviews, the researcher is likely also the in- document are aggregated into the mean document vector. terviewer who guided the interview based on research ques- This numeric representation will be used to plot a t-SNE tions. However, ethnographic narratives often take unex- document map, where segments with similar content will pected turns or focus on unforeseen details, which the re- lie close to each other2. But a bare map is not very informa- searcher can uncover by coding the data and iteratively re- tive on its own. Hence, we added Gaussian mixture models fining the codebook. Alternatively, one can use document to identify groups of segments and retrieve their character- maps, where segments with semantically similar content istic words (Figure 2). The procedure identified segments will lie close together. referring to air quality (green cluster), lighting (magenta To semantically represent the content of interview seg- cluster), room descriptions (yellow cluster), and so on. ments, we will pass them to document embedding. The procedure will take the words (tokens) identified in prepro- 2In t-SNE, we selected a larger group of segments for anno- cessing and find their vector representation. The represen- tation. There was a smaller group of 121 segments representing tation models the meaning of the words in a way that re- short replies, such as “yes”, “no”, and “I don’t know”. PRISPEVKI 172 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 5.3. Exploring topical segments Ontologies can be used to enrich interview segments by measuring how similar given ontology terms are to each segment. Automatic identification of segments helps re- searchers quickly identify relevant parts of the interview. Figure 4: Annotating text segments with a part of ontology referring to work (“delo”). to be assigned, which resulted in segments that did not have a corresponding code. After loading the corpus, we re- move all the interview segments without any codes. We retain 252 segments with codes and observe their frequen- cies. The results are somewhat promising but with some obvious errors (Figure 5). Figure 3: Selecting a part of the OCM ontology referring to work (“delo”) and work-related terms. For example, we can look for “delo” (orig. 350 equip- ment and maintenance of buildings) and its child terms from the OCM ontology in the corpus (Figure 3). Selected terms from the ontology are used for semantic annotation of interview segments. Semantic annotation scores each segment by how sim- ilar its sentences are to the input terms, using SBERT em- beddings (Reimers and Gurevych, 2019). SBERT was used because it specialises in sentence embeddings and consid- ers word context. Ideally, this procedure identifies passages talking about work-related topics, including breaks, em- ployment, paychecks, and work relations. One can sort the results by either the overall segment score, an aggregate of Figure 5: Top 10 codes identified in the corpus. While some all sentence scores, or by matches, which counts how many are plain wrong, most are quite accurate and useful. input words appear in the segment. Here, we show the latter option, namely displaying the The most frequent code is “luč” (light), which is in- segments with the most matches. We have selected all the deed a very prominent topic in the corpus. Then the re- segments matching any of the input terms and highlighted sults get a little strange. The two next topics are “svaki them (Figure 4). Ontology enrichment successfully iden- in svakinje” (brothers and sisters in law) and “tipi porok” tified segments discussing the office environment, research (marriage types), which are not among the interview top- work, work routine, schedules, weekend work, etc. ics. The errors are likely caused by the multilingual SBERT model used for word embedding, which sometimes cannot 5.4. Assigning terms to segments distinguish between South Slavic languages. For example, The final goal of any automated coding system would it considers the Slovenian slang term “ratal” (succeeded) as be to return a corpus with assigned codes. We prototyped a “war” based on its similarity to the Serbian “rat” (war). procedure that uses the above technique of semantic scoring However, there are some quite relevant topics among to identify the code with the highest score for each segment. the top ten codes, for example, “toplota” (warmth), “pod- We decided on a 0.6 cosine similarity threshold for the code nebje” (climate), “dnevi počitka in dela prosti dnevi” (rest PRISPEVKI 173 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 days and holidays), “stranska poslopja” (outbuildings), and returns relevant interview passages (Figure 6). “bivališča” (dwellings). Clicking on a label, for example, The ETSEO taxonomy is less useful than the OCM tax- “toplota” (warmth), outputs text segments discussing the onomy. This is due to the somewhat outdated nature of the interviewees’ attitude to temperature regulation. With a few questions, which were based on the main foci of Slovenian steps, the researcher can identify and extract interview seg- ethnology and were less relevant for anthropology. They ments discussing a specific topic and read them to better un- are missing some key contemporary areas of anthropology, derstand the context of these segments and which subtopics namely media, urban areas, internet communities, and mi- the respondents deem relevant. For example, the texts on gration. Nevertheless, the taxonomy could be extremely temperature regulation mostly refer to difficulties with ad- useful for older ethnographic texts and, with some updates, justing office temperature. even for contemporary materials. The system could be improved with specifically devel- oped language resources for non-standard Slovenian. Nev- 6. Conclusion ertheless, even in its current imperfect form, it can be a use- Anthropology can greatly benefit from the recent devel- ful tool for semi-automated coding, where the researcher opments in text analysis. Ontology enrichment, along with can manually adjust the suspicious/incorrect codes. other data exploration and visualisation methods, is a useful tool providing an overview of the collected data. 5.5. Comparison to ETSEO taxonomy In the time when anthropologists are using larger cor- While the OCM taxonomy is widely recognised in pora (Culhane and Elliott, 2016), when data is created on- the anthropological community, the ETSEO taxonomy is line for many different purposes (Wang, 2012), and when strictly regional. The project Ethnological Topography of anthropologists use online platforms to store raw ethno- Slovenian Ethnic Territory (ETSEO) began in 1971 by a graphic multimedia data (Przybylski, 2021), it is of utmost large group of Slovenian ethnologists led by Slavko Kre- importance to store and later archive data meaningfully, us- menšek. The project entailed the development of the ques- ing relevant classification and coding systems. It is even tions based on ethnological systematics, ethnographies of more important in archival work, which is no longer just an Slovenian towns and cities (18 in total), and detailed ethno- additional part of anthropological research, supplementing graphies on a specific topic. The taxonomy is a result of the ethnographic fieldwork, but is becoming highly relevant for first part of the project, namely the questions and detailed digital aspects of our lives. ethnological systematics. The ETSEO questions were pub- Updating taxonomic systems is an urgent task for an- lished between 1976 and 1977 in twelve books, including thropologists. However, using existing taxonomies to ex- the introductory volume with reports of ethnological insti- plore and visualise data already benefits the analytic pro- tutions (Kremenšek et al., 1976) and eleven volumes of top- cess, especially in re-studies and comparative research. ical presentations and suggested questionnaires. The series Classical anthropological coding of ethnographic material served as a theoretical and practical guide for ethnographic is no longer possible, so automated coding is the first step to fieldwork (Ravnik, 1996). expanding the range of anthropological data analysis. How- ever, in the absence of specialised word embedding models for Slovenian (SBERT is currently multilingual and con- flates South Slavic languages), the approach does not yet achieve the accuracy of a human annotator. While automated coding, particularly for languages with fewer language resources, still has a long way to come to be comparable to human input, it facilitates data explo- ration and extracting general topics from the text. Ontology enrichment tools support the iterative analytical process of ethnography. They provide a starting point for forming new research questions, enhancing existing ones and can be eas- ily repeated on new data. Many improvements could be made to improve auto- Figure 6: Matches for ETSEO entry “technical knowl- mated coding for the Slovenian language: edge”. • Developing a Slovenian-only sentence transformer used in semantic search. ETSEO taxonomy contains 53 areas of ethnographic in- terest. Still, it lacks explicit hierarchy, although it follows • Re-writing transcripts in standard Slovenian or further the classical division of ethnographic material for the so- improving CLASSLA to handle slang terms and non- called folk culture: material (volumes I to V), social (vol- standard Slovenian. umes VI to VIII) and spiritual (volumes IX to XI). A rough • Implementing co-reference resolution for Slovenian to hierarchy could be formed from the eleven books in which resolve issues with indirect references in text, further these questions were published, but the books lack hyper- clarifying the exact content of the document. nyms. Hence, we will use this as a flat taxonomy. There are fewer relevant areas to choose from than in the OCM. How- While these improvements would greatly enhance cod- ever, looking for “tehnično znanje” (technical knowledge) ing capabilities for Slovenian, they are, for the most part, PRISPEVKI 174 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 available for larger languages, thus already enabling simi- Sarah Hitchner, John Schelhas, and J Peter Brosius. 2016. lar research. Snake oil, silver buckshot, and people who hate us: metaphors and conventional discourses of wood-based bioenergy in the rural southeastern united states. Human 7. Acknowledgements Organization, 75(3):204–217. The work described in this paper was funded by the Seth M Holmes and Heide Castañeda. 2014. Ethnographic Slovenian Research Agency research programme P6-0436: research in migration and health. Migration and Health: Digital Humanities: resources, tools and methods A Research Methods Handbook, pages 265–277. (2022–2027) and the DARIAH-SI research infrastructure. Annette Hoxtell. 2019. Automation of qualitative con- tent analysis: A proposal. In Forum Qualitative Sozial- 8. References forschung/Forum: Qualitative Social Research, vol- Blaž Bajič. 2020. Nose-talgia, or, olfactory remembering ume 20. of the past and the present in a city in change. Ethnologia Matjaž Juršič, Igor Mozetič, Tomaž Erjavec, and Nada Balkanica, 22:61–75. Lavrač. 2010. Lemmagen: Multilingual lemmatisation H Russell Bernard. 1994. Research Methods in Anthro- with induced ripple-down rules. Journal of Universal pology: Qualitative and Quantitative Approaches. Sage Computer Science, 16(9):1190–1214. Publications, Thousand Oaks, London, New Delhi. Slavko Kremenšek, Vilko Novak, and Valens Vodušek. Aristeidis Bifis, Maria Trigka, Sofia Dedegkika, Panagiota 1976. Etnološka topografija slovenskega etničnega Goula, Constantinos Constantinopoulos, and Dimitrios ozemlja. Uvod. Poročila. Raziskovalna skupnost sloven- Kosmopoulos. 2021. A hierarchical ontology for dia- skih etnologov, Ljubljana. logue acts in psychiatric interviews. In The 14th PEr- Michael W Longan. 2015. Cybergeography irl. Cultural vasive Technologies Related to Assistive Environments Geographies Special Issue - New Methods in Cultural Conference, PETRA 2021, page 330–337, New York, Geography, 22(2):217–229. NY, USA. Association for Computing Machinery. Douglas Craig MacMillan and Sharon Phillip. 2010. Can Piotr Bojanowski, Edouard Grave, Armand Joulin, and economic incentives resolve conservation conflict: the Tomas Mikolov. 2017. Enriching word vectors with sub- case of wild deer management and habitat conservation word information. Transactions of the Association for in the scottish highlands. Human Ecology, 38(4):485– Computational Linguistics, 5:135–146. 493. Adrien Bougouin, Florian Boudin, and Béatrice Daille. M Besher Massri, Sara Brezec, Erik Novak, and Klemen 2013. Topicrank: Graph-based topic ranking for Kenda. 2019. Semantic enrichment and analysis of legal keyphrase extraction. In International Joint Conference domain documents. Artificial Intelligence, page 2. on Natural Language Processing (IJCNLP), pages 543– George Peter Murdock, Clellan S. Ford, Alfred E. Hudson, 551. Raymond Kennedy, Leo W. Simmons, and John W. M. Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alí- Whiting. 1969. Outline of Cultural Materials. Human pio Jorge, Célia Nunes, and Adam Jatowt. 2020. Yake! Relations Area Files, New Haven. keyword extraction from single documents using multi- Aaron Podolefsky and Christopher McCarty. 1983. Topi- ple local features. Information Sciences, 509:257–289. cal sorting: A technique for computer assisted qualitative Alain Cuerrier, Nicolas D Brunet, José Gérin-Lajoie, Ash- data analysis. American Anthropologist, 85(4):886–890. leigh Downing, and Esther Lévesque. 2015. The study Ajda Pretnar and Dan Podjed. 2019. Data mining of inuit knowledge of climate change in nunavik, quebec: workspace sensors: A new approach to anthropology. a mixed methods approach. Human Ecology, 43(3):379– Prispevki za novejšo zgodovino, 59(1):179–196. 394. Ajda Pretnar Žagar. 2022a. OCM Dara Culhane and Denielle Elliott. 2016. A Different Kind ontology - Slovenian. Figshare. of Ethnography: Imaginative Practices and Creative https://doi.org/10.6084/m9.figshare.19844107.v1. Methodologies. University of Toronto Press, North York, Ajda Pretnar Žagar. 2022b. OCM Ontario, Canada. ontology enrichment. Figshare. Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, https://doi.org/10.6084/m9.figshare.19787065.v1. Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Liz Przybylski. 2021. Hybrid Ethnography: Online, Of- Matija Polajnar, Marko Toplak, Anže Starič, et al. 2013. fline, and in Between. Sage Publications, Los Angeles; Orange: data mining toolbox in python. The Journal of London; New Delhi; Singapore; Washington DC; Mel- Machine Learning Research, 14(1):2349–2353. bourne. Clellan S Ford. 1971. The development of the outline of Mojca Ravnik. 1996. Način življenja slovencev v 20. sto- cultural materials. Behavior Science Notes, 6(3):173– letju. Traditiones, 25:403–406. 185. Nils Reimers and Iryna Gurevych. 2019. Sentence- Primož Godec, Nikola Ðukić, Ajda Pretnar, Vesna Tanko, bert: Sentence embeddings using siamese bert-networks. Lan Žagar, and Blaž Zupan. 2021. Explainable arXiv preprint arXiv:1908.10084. point-based document visualizations. arXiv preprint Cyril Schafer. 2011. Celebrant ceremonies: life-centered arXiv:2110.00462. funerals in aotearoa/new zealand. Journal of ritual stud- Thomas R Gruber. 1995. Toward principles for the design ies, 25(1):1–13. of ontologies used for knowledge sharing? International Anselm Strauss and Juliet M Corbin. 1997. Grounded The- Journal of Human-Computer Studies, 43(5-6):907–928. ory in Practice. Sage. PRISPEVKI 175 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Juhana Venäläinen, Sonja Pöllänen, and Rajko Mursic. 2020. The street. The Bloomsbury Handbook of the An- thropology of Sound. Tricia Wang. 2012. The tools we use: Gah- hhh, where is the killer qualitative analysis app? http://ethnographymatters.net/blog/2012/09/04/the- tools-we-use-gahhhh-where-is-the-killer-qualitative- analysis-app/. Priscilla M Wehi, Murray P Cox, Tom Roa, and H¯emi Whaanga. 2018. Human perceptions of megafaunal ex- tinction events revealed by linguistic analysis of indige- nous oral traditions. Human Ecology, 46(4):461–470. Sinem Yilmaz, Bart Van de Putte, and Peter AJ Stevens. 2019. The paradox of choice: Partner choices among highly educated turkish belgian women. DiGeSt. Jour- nal of Diversity and Gender Studies, 6(1):5–24. Slavko Žitnik and Marko Bajec. 2018. Odkrivanje korefer- enčnosti v slovenskem jeziku na označenih besedilih iz coref149. Slovenščina 2.0: Empirical, Applied and In- terdisciplinary Research, 6(1):37–67. PRISPEVKI 176 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Parliamentary Discourse Research in History: Literature Review Jure Skubic¨, Darja Fišer*¨ Ïnstitute of Contemporary History, Ljubljana, Slovenia *Faculty of Arts, University of Ljubljana, Slovenia ¨Privoz 11, 1000 Ljubljana, Slovenia *Aškerčeva cesta 2, 1000 Ljubljana, Slovenia jure.skubic@inz.si, darja.fiser@ff.uni-lj.si Abstract Historical research of parliamentary discourse focuses not only on the origins but especially on the development of parliamentary discourse. It is predominantly based on textual data analysis, employing various methodological frameworks. In this literature review we provide an overview of these methods and present commonalities and differences of approaches established in history with corpus-driven approaches. This allows for a better understanding of historical analysis of parliamentary discourse and highlights the importance of ParlaMint project and the integration of parliamentary corpora into historical research. employed in historical analysis of parliamentary discourse. In the second part, we summarize the articles we identified 1. Introduction in terms of 1) the main aim and topic of the research, 2) Parliamentary discourse is a salient research topic in methods used, 3) data collection methods and 4) a short both humanities and social science disciplines, such as discussion about the possible improvements and/or sociology, political science, sociolinguistics, and history. problems of the research. We conclude the review with a Especially historical research is highly interested in discussion of how historical research could benefit from studying not only the origins but also the development of corpus data and corpus research methods. parliamentary discourse. History is often focused on researching parliamentary debates and as Ihalainen (2021) 2. Literature Selection and Methods observes, in historical research, parliamentary debates can As Torou et al. (2009) show, the main objective of be approached analytically as nexuses of past political history is to recreate the past by researching and analyzing discourses which means that they can be viewed as existing records and their interconnectedness. It is through “meeting places” where in a certain time and space various this process that historians employ their academic political discourses have intersected. knowledge, rely on experience, and decide on the relevant This literature review is one in the series of literature information and appropriate sources which this information reviews conducted in the context of the ParlaMint project is extracted from. Especially in political history, it is (Erjavec et al., 2022). A similar literature review has been uncommon for historians to rely on only one type of source, compiled for sociological research (Skubic and Fišer, but rather focus on various so called primary and secondary 2022). The ParlaMint project develops comparable corpora sources. The former are most commonly gathered from of parliamentary proceedings from more than 20 European historical archives since they include document or artefacts countries, accompanied by literature overviews, showcases created by the participants in an event or the witnesses, and tutorials which will hopefully help maximize the use of whereas latter include oral sources, newspapers, memoirs, these corpora in different disciplinary communities visual representations, practices, etc. This means that an interested in analyzing parliamentary debates. This important factor in historical research is to understand the literature review summarizes historical research of nature of information as well as the research methodologies parliamentary debates and the most popular research and models historians use while conducting research methods employed. It needs to be explicitly noted, however (ibid.). that despite the obvious usefulness of ParlaMint corpora, Although the variety of issues and approaches in the researchers ought to consider also other qualitative and political history is large, the emerging and quite narrow quantitative data and information in order to come to focus of political history is on analyzing the history of objective and unbiased conclusions. Also, in this review we parliamentary discourse and political debates. Ihalainen focus mostly on written parliamentary records since the and Saarinen (2019) show that political history frequently main interest of ParlaMint project is on written builds its research on textual data (documents, diaries, parliamentary sources. However, the importance of other texts) although sometimes the exact textual methods used sources such as surveys, records of election results, are not explicated. Ihalainen and Saarinen (2019) note that territorial records, etc. must be recognized as well since when conducting textual analysis, historians often draw on they present an important part of historical research. selected methodological tools from methods which are The review is structured as follows. In the first part, we otherwise common in humanities and social sciences and describe the selection procedure of the relevant articles and especially qualitative sociological research, such as briefly enumerate the methods they employ. This allows for (critical) discourse analysis as well as content analysis. In a better understanding of the methods most frequently addition to those and to other fields which include the study PRISPEVKI 177 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 of history (memory studies, conceptual history, etc.) spreadsheet.1 We then retained only those that clearly researchers sometimes opt for mixed methods approach, described the method and the data used, taking into account corpus assisted discourse studies or text mining. only the papers which primarily used parliamentary records 2.1 Selection of Articles as a source. This resulted in 11 articles which were then submitted to a more detailed analysis. Since the research The reviewed articles were carefully selected among questions were so heterogeneous, we did not group the hundreds of sources which focus on parliamentary debates articles thematically. by considering some important research criteria. We We reviewed predominantly articles which focused on identified the following scholarly search engines to look for historical research of parliamentary discourse and political the articles: communication. Out of 11 reviewed articles, 3 employed • Taylor and Francis Online the methodological framework of Discourse Analysis, 2 (https://www.tandfonline.com), • articles employed Content Analysis, 1 opted for the method SAGE Journals (https://journals.sagepub.com), • of Memory Studies, 2 articles used Mixed methods Wiley Online Library (https://onlinelibrary.wiley.com), approach, 2 articles employed the framework of • Semantic Scholar Conceptual History (Begriffsgeschichte), and 1 article (https://www.semanticscholar.org), employed the method of Topic Modelling. • MUSE Project (https://muse.jhu.edu), • JSTOR (https://www.jstor.org), 3. Reviewed Research and Employed • Elsevier (https://www.elsevier.com), and Methods • Google Scholar (https://scholar.google.com). In this part of the literature review, we give a detailed account of the historical research that analyzes We applied the following filters in order to identify the parliamentary discourse and political communication as relevant articles: • Publication period: 2012 – 2022, well as the methods they employ. We provide a short • Discipline: History description of the methodological framework and show • Article ranking: ‘most relevant’ and ‘most cited’ why it is important for historical research. Then, we give • Relevant journals: sometimes we needed to apply an overview of the studies which employed this method. additional filter where we selected relevant 3.1 Conceptual History historical journals. Conceptual History (Begriffsgeschichte) is a strand of By using those filters, most prominent historical historical studies which deals with historical semantics and journals were identified, such as Parliamentary History, the evolution of paradigmatic ideas and value systems over Historical Research, Memory Studies, Contributions to the time. It was first defined by Koselleck in 1997 who shows History of Concepts and Historical Social Research, (as cited by Litte, 2016) that the major aim of conceptual although articles included in this review were also history is to uncover the logic and semantics of the concepts published elsewhere. All articles, the title of which was that have been used to describe historical events and considered potentially relevant were skimmed; we processes in addition to being interested in historical specifically analyzed the abstract, methodology and evolution of some concepts over time. Ihalainen and analysis sections to confirm the relevance of the articles. A Saarinen (2019) note that Conceptual History, when high number of articles was discarded either because of the combined with Political History, mostly focuses on past lack of methodological explanation or because the analysis human interaction and communication, and understands did not focus on parliamentary data. In this review we discourses as central interlinked elements of political wanted to include only those articles which dealt processes, events, and action. specifically with parliamentary records and/or legislative Interest in the field of Conceptual History was quite documents and the majority of the selected research high in the 20th Century Germany, especially when conformed to this criteria. Some of the articles, however, conducting historical research of the World War II. Later, also included other sources which emphasizes the fact that the field became prominent in political history for the historians use a variety of sources when researching analysis of political communication and events. As shown parliamentary discourse. This is also to show that although by Litte (2016), conceptual history has three main tasks: parliamentary records could present one of the primary firstly, to identify the concepts that are possible in sources for historical research (and projects such as characterization of history, then to locate those concepts in ParlaMint would be helpful in providing relevant data), the context of political or social discourses and finally to historians still often opt for a broader research perspective critically evaluate those concepts for their usefulness for and combine parliamentary records with other, historical analysis complementary sources of data in their research. 3.1.1 Debates on Democracy in Sweden 2.2 Overview of Methods Research problem: Friberg’s (2012) article aims to A total of 27 articles were initially determined as explore the concepts of democracy that were used in relevant for our literature review and are listed in a Google Sweden and especially focuses on how the concepts were 1https://docs.google.com/spreadsheets/d/13mF_X3OB9C KtdfsUFDLPJZJ44VcxZ1uv9OzAE2E_E- I/edit#gid=1690588464 PRISPEVKI 178 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 used by the Social Democrats (SDP) during the interwar perspective when studying parliamentary discourse and years when the party was establishing itself and its political investigates the concept of the word “immunity” as used in agenda. It examines the Swedish parliamentary rhetoric parliamentary discourse. about democracy after the full suffrage reform. Research method: The author employs Research method: The author employed German methodological framework of Conceptual History and Begriffsgeschichte (conceptual history approach) as makes comparative analysis of the two aforementioned introduced by Koselleck, and the theory of ideologies countries. We could therefore understand this method as (Freeden, 1998 as cited by Friberg, 2012). According to comparative conceptual analysis. Friberg, these two methods complement each other since Data collection: The data was collected from a variety conceptual history emphasizes how socio-political context of sources which were mostly not parliamentary ones. For influences the changing meaning of the concept whereas French, dictionaries (Le Grand Robert de la langue theory of ideologies finds the meaning of this concept francaise, Dictionnaire de l’Academie francaise, etc.), dependent on morphological structure. scientific works which focused on the history of French Data collection: The main source of the data for this parliamentarism (Histoire de France or Les caracteres ou article were the debates in Swedish Parliament during the les moueurs de ce siècle), as well as various political interwar years. In addition, the author used other documents and French Constitution were used. Romanian governmental materials, such as reports from different data was also gathered from dictionaries (Dictionar al committees. Both sources were only available as institutiilor feudale din Tarile Romane, for example), as hardcopies, but they provided coherent source materials. well as various historical documents and different versions The debates which were analyzed were chosen according of democratic Constitutions. What all documents had in to two important criteria. First, the debates needed to be common was that although they were not strictly records of explicit discussions in Parliament and needed to focus on parliamentary debates, they did focus on the parliamentary the concept of democracy in the interwar years. Second, the and political language and discourse. debates had to be related to a topic that a political party (in Discussion: This research is slightly different from the this case Social Democrats) claimed was connected to others in this review since it does not draw directly from democracy in a certain way. The debates which conformed the parliamentary records. This analysis successfully shows to the first criterion were identified through the subject how historical analyses frequently draw on sources other index of the governmental records, whereas the debates than explicitly parliamentary data. which needed to observe the second criterion were 3.2 (Collective) Memory Analysis recovered through an extended analysis of materials such as party manifests, newspaper articles and records from the Memory analysis combines intellectual strands from congress. This was necessary to get the feeling of what the various domains such as history, sociology, anthropology, SDP claimed to be connected to the democracy and then education, etc. Since this is an emerging field of research, compare those records with parliamentary records. In its qualitative and quantitative methodological tools are not addition, the author analyzed the articles from the Social yet fully developed. Instead, researchers who conduct Democratic journal titled Tiden, which throughout the 20th memory analysis usually borrow methodological tools century was one the most important Social Democratic from other social sciences and adapt them for their own newspapers for conducting internal debates. The analysis purpose. These methods frequently contain content and of these articles added to the reliability of the conceptual (critical) discourse analysis. analysis. The main aim of memory analysis is the study of forms Discussion: One of the problems with data collection and functions of representing the past. Data collection was that all the records were accessible only as hardcopies includes a careful examination of primary historical and not electronically. Although the author gives no sources and archival studies, as well as secondary sources information about that, we could assume that the such as case studies, interviews, surveys, and eyewitness documents needed to be thoroughly read and notes taken. reports. Once the data is collected, the aforementioned Also, parliamentary records do not exactly depict the actual methodological tools are employed to thoroughly analyze debates since the process from an actual debate to a printed the data. Memory analysis frequently also includes the one used to be rather complicated and long. This results in research of collective memories and narratives. Collective sometimes significant differences between the actual memory as defined by Hogwood (2013) is a concept, which speech and the written text. This long process of editing, is used across disciplines to refer to the ways the past is changing, and adapting the actual text to be suitable for a “perceived, shaped, and constructed” and its main aim is to printed version results in the data not objectively depicting extract useful data from collective conversations, sharing what was said during the debate. ideas and media. This then leads to a synthesis of voices and formation of a common information thread among 3.1.2 Debates on Immunity in France and Romania peers. Research problem: In this article Negoita (2015) One of the major methodological problems that occurs analyzes the concept of parliamentary immunity. His main inside memory analysis, is that when researchers conduct goal is to identify not only historical premises but also research, they usually use whatever evidence is readily linguistic, political, and legal instruments that played part available, without digging deeper into the event and in conceptualization of parliamentary immunity in two research it more thoroughly. This points to the fact that countries – France and Romania. This article, therefore, even though memory analysis is a useful field of historical although historical in nature, employs interdisciplinary analysis, researchers must be attentive to employ other PRISPEVKI 179 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 approaches with which they confirm and legitimate the news reports or parliamentary debates. This field emerged findings of memory analysis. in the 1960s and is very prominent inside humanities and 3.2.1 The Nation in Parliamentary Discourse on especially social sciences. The field of Discourse Studies Immigration includes various methods, such as Discourse Analysis (DA), Critical Discourse Analysis (CDA) and Political Research problem: De Saint-Laurent (2014) focuses Discourse Analysis (PDA). All three were detected as on exploring the meaning that is attributed to the national salient in this literature review. group. The aim of her article is to analyze collective Discourse Analysis (DA) is one of the most frequently memories (she names them narratives) and show what used methods in those social science disciplines, where the meaning they give to the nation, how this meaning is focus is frequently on the study of language and text. In produced and how the stories told by different groups relate historical research, Discourse Analysis is sometimes to one another. referred to as Discourse Historical Approach (DHA) and its Research method: She employs a qualitative analysis main defining feature is in acknowledging the historical of collective narratives of the past. In connection with context and attempting to integrate this knowledge together memory analysis, she employs dialogism as a with background of social and political fields into research. methodological tool since the analysis of dialogic DHA focuses on studying the display of power through overtones helps reconstruct the social processes through language and conceptualizes history through a theorized which the discourse is done. lens of critique. This method shares various common Data collection: It needs to be noted that this article is features with the Critical Discourse Analysis (CDA) and an analysis of the meaning which is given to the concept of provides a clear description of how to integrate historical nation in French parliamentary debates over a bill on context to critical discourse analysis, highlighting the Immigration and Integration. The data used consisted of importance of historicity to understand the continuities of official transcripts of fifteen parliamentary sessions which discourses (Achugar, 2017). Sometimes DA, when used to happened between May 2 and May 17, 2006. In addition to analyze political discourse, is referred to as Political that, the author also included the vote session which Discourse Analysis (PDA) (Dunmire, 2012). happened on June 30, 2006. All documents are made Critical Discourse Analysis (CDA) examines the available to the public through the official parliamentary means by which political power is manifested or abused website. In addition to using general reactions of the through discourse structures and practices (Dunmire 2012). Assembly, the author used transcripts of the participants Achugar (2017) shows that since the past has become an interventions and interruptions from the entire sessions. area of focus for CDA, this method has become a salient Once the author determined the datasets, she began with one in historical research. One of its major aims is to relevant data selection, which happened in three stages. In provide an explanation of the power differences in the first stage, the author identified those excerpts which contemporary society by researching the past events and were relevant for the study of the role of collective their context. memory. She did that with the help of Nvivo2 software (QSR International Pty Ltd., 2020). In this stage the author 3.3.1 British Parliament and Foreign Policy in the also extracted relevant references by carefully reading 20th Century through all the debates and employing a keyword search, Research problem: Ihalainen and Matikainen (2016) which contributed to pinpointing the indirect references. investigated the parliamentarization of the foreign policy in The second stage was the coding stage, where firstly British Parliament throughout the 20th century. They argue thematic coding happened to map out relevant historical that throughout the 20th century, parliaments in general periods and secondly the groups which the speakers gained more power in discussing foreign policy and belonged to were coded into two categories – political party especially in British Parliament this parliamentarization of and outside the political spectrum. In the third stage the foreign policy debates was highly noticeable. fragmented excerpts and data were used to reconstruct past Research method: They combine analysis of the policy narratives, which were then thoroughly analyzed. documents with the more discourse-oriented analysis of Discussion: This paper is not only historical since the parliamentary debates. Their research method is more author herself notes that it also adopts “socio-cultural discourse-oriented than the traditional diplomatic history psychological perspective on memory” (ibid.). She also since they do not focus only on policy documents but also notes that because of the reconstructive aspect of her consider the discourse of parliamentary debates in that analysis, she checked the narratives against certain time. complementary sources (research in French newspapers, Data collection: The authors utilized a wide variety of blogs, websites, etc.). This made the research much more primary sources with the Hansard constituting the starting reliable. point of their analysis. They also used parliamentary papers 3.3 Discourse Studies such as committee reports as well as the relevant sources created by other political actors – the foreign office, other Van Dijk (2018) uses the term discourse studies to refer relevant government departments, voluntary associations, to a field of research, which includes various qualitative the media, etc. Their data therefore consists of and quantitative methods and different genres, such as parliamentary debates on the one hand and archival 2 https://www.qsrinternational.com/nvivo-qualitative-data- analysis-software/home PRISPEVKI 180 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 documents, public debates, and interviews on the other. material but did not need to transcribe it as it has already They argue that the use of such a wide range of data was been made available in textual form. Then the data was necessary to grasp the multi-sided nature of the policy unitized, then categorized on the basis of the actual data and discourse and to ensure that data was vast enough to relevant theory and finally each unit was separately coded. provide the complete picture of how the Discussion: The author never explicitly mentioned parliamentarization of foreign policy debates occurred. discourse analysis as his research method. But since he Parliamentary records database was electronically talks about conducting a qualitative analysis of discourse available which resulted in authors utilizing full-text from parliamentary records and newspaper articles, we searches to locate sources for contextual analysis of could assume that he employed discourse analysis parliamentary debates. They wanted to locate potentially approach. interesting debates and analyze them by using 3.3.3 Nationalism and Political Discourse in aforementioned historical methods. Scotland Discussion: Authors do not provide a detailed account of how the data was collected and give no information Research problem: The research conducted by about how the documents, other than the electronically Whigham (2019) critically examines the narratives which accessible Hansard, were obtained. They do, however, emerged from party political discourse after Scottish clearly show that in order to conduct thorough historical independence referendum in 2014. The aim of the research research, a variety of sources needs to be studied and that is to analyze the past discourse on nationalism in Scotland focusing only on parliamentary debates is not enough. and to critically reflect on narratives about Scottish nation’s past. 3.3.2 British Lobbying and Parliamentary Research method: The author employs the Discourse methodological approach called political discourse Research problem: McGrath (2018) focuses his analysis (PDA), which was introduced and thoroughly research on lobbying which he sees as a significant explained by Fairclough and Fairclough (2012). According component of the modern politics in Britain. In his article, to Whigham, this method was used since it provides an he provides a detailed explanation of the scale and “original methodological contribution to the study of significance of lobbying and studies how lobbying in Scottish nationalism”. Britain was discussed not only by parliamentarians but also Data collection: The author focused on parliamentary by journalists. discourse of the largest political parties in Scotland, namely Research method: The author utilized keyword search the pro-independence Scottish National Party (SNP) on the on several digitized archives which helped him gather one hand and Scottish Labor Party as well as Scottish extracts from parliamentary debates and newspaper Conservative and Unionist Party on the other. The database articles. He blended both qualitative and quantitative consisted of election manifestos and policy documents readings of the texts which leads us to assume that some which were related specifically to the independence kind of discourse analysis method was employed. referendum. Because of a wide range of potentially useful Data collection: The author draws on parliamentary data, Whigham focused primarily on political manifestos debates as well as three other databases which together and constitutional policy documents. This also allowed for consist of 51 newspaper titles between 1800 and 1950. The a more detailed analysis of only crucial information about data was available in electronic archives and already in each party’s position on the Scottish constitutional debates. written form, so no transcription was needed. The unit of The author used the Nvivo qualitative data analysis the analysis is an individual newspaper article or software package (QSR International Pty Ltd., 2020) which parliamentary speech. The database consisted of four helped him code content of each of the data sources online archives: 1) Hansard (1803-1950), 2) British Library according to the themes that emerged. This was then (1800-1900), 3) The Times (1800-1950), and 4) The followed by a coding process which categorized low-level Guardian (1800-1950). To gather the source material, the codes into higher-level discursive forms. This sample author employed a three-step process; firstly, each archive allowed for a reflection and thorough analysis of political was searched using a range of keywords which are discourse. associated with lobbying which produced roughly 1.691 Discussion: It needs to be noted that the application of items. Secondly, each item was printed, carefully read, and the Nvivo software is an exemplary one and is not sorted according to the descriptor he was searching for. frequently observed in historical research. Also, at times Some of the data has already been discarded here since it the article reads as a sociological one and we believe that it did not correspond to the search parameters (e.g., did not could just as well be classified as such since the author is relate to governmental bodies, material covered lobbying also a sociologist. However, a more thorough description in countries other than Britain, etc.). In the third stage, the of the methodological framework would be appreciated. items which were not removed were put into chronological 3.4 Content Analysis order and the author removed all the duplicates. This resulted in 689 items being determined as suitable for Content Analysis (CA) primarily focuses on studying analysis. Once all the unique items were collected the and analyzing society and social life by examining the individual items were examined and coded. To acquire the content of the visual and textual media – texts, images, and appropriate data, McGrath employed a five-stage process other media products. Mihailescu (2019) understands it as to transform qualitative material into quantitative data, a research technique for making replicable and valid although not all stages needed to be applied. He sourced the inferences from data to their contexts which is particularly PRISPEVKI 181 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 useful in humanities and social sciences. It is a that although this is a historical research, it does have methodological approach which can help in development certain methodological features of the sociological of the deductive and inductive capacities, which are research, since sociology also frequently employs content extremely important in historical research. In addition, it is analysis as the main methodological framework. highly useful in historical research where researchers are Data collection: The author draws on parliamentary analyzing data with large amounts of text and where speeches as well as newspaper articles in order to research meaningful information need to be extracted from the the emergence and development of the Irish political historical documents. clientelism and its critique. This empirical material was Since CA frequently intertwines both qualitative and deliberately chosen since it is made continuously available quantitative approaches, it sometimes comes close to a throughout the decades and is easy to access. She gathered mixed methods. CA can sometimes be mistaken for the parliamentary data from the official website of the Irish Discourse Analysis since the two methods are very similar. parliament and media data form online archives of the Although both are interested in providing the context of an respective newspapers. She opted for data from two of the event, the main difference between the two is that CA most frequently read Irish quality papers, namely the Irish focuses on the content of the text, whereas DA focuses on Independent and the Irish Times. The first step of her data the language that is used in text and context. collection consisted of keyword search of the words 3.4.1 Constructing the Child in Need of State “clientelism” and “brokerage” in both parliamentary and Protection newspaper records. After realizing that the two words had not been used until the 1980s, she identified other Research problem: In this article, Smith (2016) potentially relevant terms based on their emergence in explores the development of the discourse surrounding items referent to the two keywords. This produced several children in need of a state protection in Ireland. She mostly other keywords such as “stroke politics”, “gombeen focuses on the discourse produced by legislators and politics”, etc., which the author used to find relevant data. government ministers who are ultimately responsible for The period of her analysis runs up to 2012 and starts in the child services. early 1940s. She notes that in the case of parliamentary Research method: The author employs content records, the unit of her analysis is the contribution of the analysis of various bills as well as parliamentary debates. member of the House of Deputies or the Senate; this can She defines it as a textual analysis, but we regard it as a either be a speech or a short intervention. In the case of content analysis since she focuses mainly on the content on newspaper articles, the unit of her analysis is an article bills and debates. itself. The respective units were then coded according to Data collection: The author focuses on a specific their focus and since some units included several views of timeframe in Irish history, namely between 1922 (the the matter, they were coded in more than one category. formation of the Irish Free State) and 1991 (the adoption of Then those articles and debates which specifically focused current legislative framework for children welfare). The on the link between politicians and voters were selected for data consists of debates from both houses of Irish a more detailed interpretation. parliament – the House of Deputies and the Senate. In Discussion: This article falls under the category of addition, Smith also focused on the official reports which historical social research and employs methodological informed these debates. In one part of her research, she approaches which are frequent in both historical and focused on parliamentary debates on the Children Bills of sociological research. She gives a detailed account of the 1928 and 1940 and Cussen Report from 1936. In the second method and data she used and where this data was taken part, she conducts the analysis of the Kennedy Report from, which is not always the case in historical research. (1970), the Final Report of the Task Force on Child Care As seen in some of the previously reviewed articles, the Services (1981) as well as the parliamentary debates on the author combined parliamentary and newspaper data so as Child Care Bill of 1988. to address the concept of clientelism in as much detail as Discussion: The author dedicates only one paragraph to possible. explicating where the data was taken from in addition to only briefly mentioning the method she used. We consider 3.5 Mixed Methods this to be one of the major shortcomings in this article since Shorten and Smith (2017) understand the mixed it would be useful to know how the textual analysis was methods approach as drawing on the strengths of both performed, what the author focused on and why, as well as qualitative and quantitative methods, which results in what was her motivation for focusing on those specific bills showing a more complete picture of a research problem. It and debates. is a highly complementary approach, which means that the 3.4.2 Clientelism in Irish Politics results produced by one of the methods, can be elaborated and clarified with the findings from the other method. This Research problem: The main aim of this article is to means that triangulation of one set of results influences and research the emergence and development of discourse enhances the validity of inferences. In addition, the which revolves around the concept of clientelism in Irish combination of different methodologies, approaches, and politics. Kusche (2017) focuses on the analysis of the various fields of research adds to the validity of the research relationship between Irish deputies and voters, which has and eliminates the possibility of research bias. Thies (2002) been perceived as particularly clientelist. shows that as in many other disciplines (sociology for Research method: Kusche identifies the main method example), investigator bias as well as unwarranted of her research as qualitative content analysis. She shows selectivity of the use of historical source materials are the PRISPEVKI 182 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 main problems of qualitative historical research which 3.5.2 Political Discourse of Israeli PMs between emphasizes the importance of the selection of the 2001 and 2009 appropriate methodological approach. Research problem: The aim of this article by Gavriely- Corpus-Assisted Discourse Studies (CADS) combine Nuri (2013) is to look critically at the uses of collective qualitative Discourse Analysis with the predominantly memories in Israeli politics. Collective memories are of quantitative corpus-based approach. The main aim of the great significance to the case of Israel due to their historical CADS is to facilitate understanding from the linguistic background and this article analyzes how collective perspective as well as from that of humanities and social memories are used within the corpus of speeches of Israeli science. As Partington (2012) shows, this approach uses Prime Ministers. corpus techniques to investigate a particular political or Research method: The author employed a institutional discourse type and to uncover and analyze methodological approach which incorporated both, Critical obvious patterns of language or aspects of linguistic Discourse Analysis (CDA) and corpus linguistics. Because interaction. of the combination of corpus linguistics and discourse 3.5.1 Scottish Political Rhetoric in Invasion of Iraq analysis, we regarded this article as using the approach called Corpus-Assisted Discourse Studies (CADS). Research problem: Elcheroth and Reicher (2014) Data collection: The data used for this study consisted conduct a systematic analysis of the Scottish debate over of speeches of Israeli Prime Ministers, over a period of 9 the invasion of Iraq in 2003. The aim of their article is to years (between 2001 and 2009), which were delivered in show, on the one hand the development of the debates in the Israeli Parliament (Knesset). The author conducted a Scottish Parliament and conduct the analysis of computerized search in the speech archive that includes parliamentary discourse of anti-war Scottish separatist addresses of the PMs and constructed a corpus, which was parties, and on the other to examine how the conflict was then used as a database. The corpus included speeches by construed as either for or against national interest. the two selected Prime Ministers, namely Ariel Sharon Research method: The authors employ a mixed- (2001 – 2005) and Ehud Olmert (2006 – 2009). Her methods approach and used thematic coding. This on the computerized search revealed 274 instances of the word one hand produced structured inventories of arguments “memory”, which was determined as the keyword to which served as the grid for qualitative analysis, and on the identify relevant speeches. All those references were then other, it produced a database which was then used for carefully studied and read in order to determine the context. content analysis. This resulted in identifying 103 references of the phrase Data collection: The data for the analysis consist of all “collective memory” which were distributed among 64 the contributions to four Scottish parliamentary debates speeches. In this count, the author also included synonyms referring to the Gulf War. A total of 106 interventions such as “national memory”, “public’s memory”, “people’s which occurred between January 2003 and June 2004 was memories”, etc. Once the data was broadly selected, the used as a dataset. It needs to be noted that during the time author performed a two-stage analysis to determine the of 2003 Gulf War, there was also the campaign for election actual topics of the speeches. In the first stage, the context to the Scottish Parliament which meant that the election in which national events evoked the mention of collective debate was definitely influenced by the war debate. Each memory was analyzed. In the second stage, specific content individual intervention was separately coded to extract the included in the mentions was studied. information such as which debate the speech was taken Discussion: Although the author mentions the Cultural from, what was the party membership of the speaker, what approach to the CDA, she gives no detailed account on how was the overall moral argument and so on. Special this approach differs from traditional CDA or what its emphasis was put on the two parliamentary debates which benefits are. One of the possible justifications for occurred right before the invasion of Iraq (January and employing cultural approach is the study of cultural context March 2003) as well as on first two substantial of the PMs’ speeches and their cultural significance. We parliamentary debates that took place after the invasion also found that in the article there is no explicit elaboration (November 2003 and June 2004). The transcripts of these as to why this particular methodological framework was debates were all published in the official records of the selected and how it contributes to the overall analysis. parliament, and they constituted the “corpus” data for their further analysis. When determining relevant data, all the 3.6 Digital History and Topic Modelling transcriptions were read several times and coded for those We have shown that historians gather their data mostly interventions that included arguments that were from historical archives and feel “much more confident thematically fitting for the analysis. The two pre-invasion when using traditional sources in printed format, since they debates produced 68 relevant interventions whereas the two believe to have better access to the historical data required post-invasion debates produced the remaining 38. for their research” (Torou, 2009). Guldi (2019) believes Discussion: This article consists of two separate that digital methods can help researchers land material for studies. The first study is the analysis of parliamentary historical synthesis that “builds upon the insights of speeches, whereas in the second part, the authors turn from foregoing historians while potentially illuminating new elite discourse to popular understanding of the war. The directions for further research”. Some authors (Piersma et second part draws on the data from Scottish Social Attitude al., 2014) regard these methods as the Digital Approach or (SSA) survey and since it does not focus on the Digital History, the main function of which is to enable parliamentary discourse, only the first study was of interest historians use advanced search engines in order to explore for us. large quantities of data. PRISPEVKI 183 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Topic modelling is capable of scanning a large set of digitized sources of primary and secondary documents are documents within which it detects word and phrase patterns used in historical research, with digitized sources and automatically clusters them into groups according to (transcriptions, text documents, and corpora) becoming their meaning. As Guldi (2019) shows, topic modelling has increasingly more popular. This makes historical research been effectively used in history to identify patterns of community a potentially important user group of the historical interest in academic sources and has in ParlaMint corpora. combination with discourse studies proven to be useful for This literature review shows how the majority of historical analysis. researchers of political history collect data on their own, Today, several software packages exist which can be using techniques and methods which are often time- used in a pre-existing database of digitalized texts. This, consuming and demand a lot of manual work. Such work and the fact that digital methods, such as text mining and could be made much more research-friendly and efficient topic modelling are becoming increasingly used in if historical parliamentary corpora were developed, historical research of parliamentary discourse, underlines annotated, and documented. They would present a database the importance of digitizing historical parliamentary of collected parliamentary records of the past and would be records, and not only enable but also encourage historians a useful source of historical parliamentary records which to start using them as one of the primary sources of data for would be an invaluable extension of the ParlaMint project. their research. Our first aim should therefore be to provide historians 3.6.1 Topic Modelling and Historical Change with tutorials, workshops and showcases on how to use corpora, corpora data, and the main corpus-analytical Research problem: The aim of Guldi’s (2019) article techniques. Rich and user-friendly documentation on how is to research the parliamentary discourse on 19th century the ParlaMint data is gathered, processed, and annotated is British empire infrastructure projects, such as the drainage to be made available to the historians in addition to offering of the River Shannon in 1860, as well as parliamentary quick user manuals which would show the basic use of argument of the telegraph connection between England and concordancers for historians to learn how to effectively use India. corpora. Research method: The author uses dynamic topic Then, we should encourage them to develop and use modelling which allowed her to generalize about the their own corpora and datasets for the historical periods discourse on a diachronic dataset, observing trends in they are interested in using the same encoding standards. In different time periods. this endeavor, we agree with Kytö (2010) that compilers of Data collection: The data for her research consisted of the data should document their compilation decisions in parliamentary debates in the British parliament in 19th clear terms in user guides, corpus manuals, and training century, gathered from Hansard, the official database of all materials which need to accompany the release versions of UK parliamentary debates. The author focused on several the corpora, since it would be impossible for end-users to topics connected to the infrastructure and employed find information about the background of the texts which approximately the same data collection and analysis in all are included in the historical corpora without them. of them. The entire Hansard database was subjected to topic The ParlaMint community should also focus on the modeling, resulting in a set of words used by MPs most implementation of the tagging of the digital repository indicative of their discussions a certain topic. The author contents with complete and structured metadata. Some experimented with using on the one hand debate as a historians (Torou et al. 2009) note that the information document and then also a speech as a document. In which is typically used in research queries by historians addition, she also experimented with degrees of granularity (such as the author, topic of the item, date of creation, the for analysis, asking the computer to return either 4, 10, 100, period to which the content refers, etc.) should be available 500 or 1000 topics. She obtained most informative results as metadata. The availability and reliability of metadata is with 500 topics as the search returned fairly specific words extremely important since historians often rely on the which were interesting for further analysis. additional data and information about a certain historical Discussion: Guldi shows how topic modelling can be source. implemented into research and analysis of historical data. Marjanen (personal communication, 2022) points out Topic modelling is becoming increasingly popular in that historians researching parliamentary discourse are historical research and is frequently used not only on highly interested in the use of rhetoric, the uses of voice national but also international level (e.g., when researching and practices of negotiation and debate. One of the key debates in the European parliament). It is important to note interests for them is identifying who talked, which makes that topic modelling must always be complemented by an the availability of any metadata about the MPs of vital objective analysis and critical skills of the researcher when importance. According to Marjanen, there are also some interpreting the results of topic modelling. historians who together with traditional sources, use audio and video recordings from parliament to study non-verbal 4. Discussion and Conclusion elements in parliamentary discourse. He points out that This literature review shows the most common methods with digitized sources, keyword search has made material and approaches that (political) historians use in their much more accessible though many historians are often research of parliamentary discourse as well as tries to interested in something broader than keyword search. They understand what kind of data and information historians are focus on the entirety of speeches or discourse related to a looking for and which sources they use. We can confirm certain topic since keyword search often does not produce observations from Torou et al. (2009) that either printed or enough relevant results. Historians are used to finding these PRISPEVKI 184 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 “discourses” on their own but if the process of searching Immigration. Critical Practice Studies, 15(3): 22-53. for relevant sources was made easier for them, it would http://dx.doi.org/10.7146/ocps.v15i3.19860 definitely be welcomed. Patricia L. Dunmire. 2012. Political discourse analysis: An increasing availability of the digitized sources Exploring the language of Politics and the Politics of appears to be setting an interesting trend. In addition to language. Language and Linguistics Compass, 6: 735- more and more sources and documents becoming digitized 751. and made available through electronic libraries, various Guy Elcheroth and Steve Reicher. 2014. ‘Not our war, not digital research tools and approaches are becoming our country’: Contents and contexts of Scottish political available, making historical research often very digital. rhetoric and popular understandings during the invasion Therefore, political historians nowadays already employ of Iraq. British Journal of Social Psychology, 53: 112- digital approaches and tools to analyze parliamentary data 133. https://doi.org/10.1111/bjso.12020 and these approaches allow them to gather and analyze data Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, et al. in a faster, more efficient, and less time-consuming 2022. The ParlaMint corpora of parliamentary manner. However, the development of parliamentary proceedings. Lang Resources & Evaluation. historical corpora could potentially reshape the entire https://doi.org/10.1007/s10579-021-09574-0 process of historical research and offer new understanding Anna Friberg. 2012. Democracy in the Plural? The of the parliamentary data. As Blaxill (2013) shows, the Concepts of Democracy in Swedish Parliamentary combined approaches of close manual analysis and Debates during the Interwar Years. Contributions to the selective quantification simplify the research as well as History of Concept, 7(1): 12-35. facilitate numerical comparison and contextualization. http://dx.doi.org/10.3167/choc.2012.070102 The argument we want to put forward with this Dalia Gavriely-Nuri. 2013. Collective memory as a literature review is not that current qualitative historical metaphor: The case of speeches by Israeli prime research of political debates and parliamentary discourse ministers 2001–2009. Memory Studies, 7(1): 46-60. should be completely replaced by more quantitative https://doi.org/10.1177%2F1750698013497953 corpus-assisted approaches, but rather that corpora could Jo Guldi. 2019. Parliament's Debates about Infrastructure: be effectively used alongside the traditional qualitative An Exercise in Using Dynamic Topic Models to historical analysis. We treat corpora as potentially powerful Synthesize Historical Change. Technology and Culture, tools which would not only simplify data collection and 60(1): 1-33. https://doi.org/10.1353/tech.2019.0000 generate relevant results much more effortlessly, but also Patricia Hogwood. 2013. Selective memory: challenging effectively reduce and minimize potential research bias that the past in post-GDR society. In: Saunders, A. and might be present in the analysis of historical data. Pinfold, D. (Ed.) Remembering and Rethinking the GDR, This review also shows the need for more systematic, pages 34-48. Palgrave Macmiallan, London. transparent, and replicable quantitative and qualitative Pasi Ihalainen and Satu Matikainen. 2016. The British analysis, which makes corpus-assisted approaches ideally Parliament and Foreign Policy in the 20th Century: suited for historical research of parliamentary discourse. Towards Increasing Parliamentarisation?. The immediate usefulness of the ParlaMint corpora is also Parliamentary History, 35(1): 1-14. clearly confirmed by this review and it emphasizes the need https://doi.org/10.1111/1750-0206.12180 for further enrichment and the addition of the historical data Pasi Ihalainen, and Taina Saarinen. 2019. Integrating a to the current ParlaMint database. Nexus: The History of Political Discourse and Language Policy Research. Rethinking History: 1-20. 5. Acknowledgements https://doi.org/10.1080/13642529.2019.1638587 The work described in this paper was funded by the Pasi Ihalainen. 2021. Parliaments as Meeting Places for Slovenian Research Agency research programme P6-0436: Political Concepts. Centre for Intellectual History, Digital Humanities: resources, tools, and methods (2022- University of Oxford. 2027), the Social Sciences & Humanities Open Cloud https://intellectualhistory.web.ox.ac.uk/article/parliame (SSHOC) project (https://www.sshopencloud.eu/), the nts-as-meeting-places-for-political-concepts CLARIN ERIC ParlaMint project Isabel Kusche. 2017. The Accusation of Clientelism: On (https://www.clarin.eu/parlamint) and the DARIAH-SI the Interplay between Social Science, Mass Media and research infrastructure. Politics in the Critique of Irish Democracy. Historical Social Research, 42(3): 172-195. 6. References https://www.jstor.org/stable/44425367 Marja Kytö. 2010. Corpora and historical linguistics. Mariana Achugar. 2017. Critical discourse analysis and Revista Brasileira de Linguística Aplicada, 11(2): 417- history. In J. Flowerdew and J.E. Richardson (Ed.) The 457. http://dx.doi.org/10.1590/S1984- Routledge Handbook of Critical Discourse Studies, Vol. 63982011000200007 1, pages 298-311. Routledge, London. Daniel Litte. 2016. What is "conceptual history"?. Luke Blaxill. 2013. Quantifying the language of British Understanding society. politics, 1880–1910. Historical Research, 86(232): 313- https://understandingsociety.blogspot.com/2016/10/wha 341. https://doi.org/10.1111/1468-2281.12011 t-is-conceptual-history.html Constance De Saint Laurent. 2014. “I would rather be Jani Marjanen. Personal communication. By Jure Skubic, hanged than agree with you!”: Collective Memory and 27 May 2022. the Definition of the Nation in Parliamentary Debates on PRISPEVKI 185 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Conor McGrath. 2018. British Lobbying in Newspaper and Parliamentary Discourse, 1800–1950. Parliamentary History, 37(2): 226-249. https://doi.org/10.1111/1750- 0206.12363 Mimi Mihailescu. 2019. Content analysis: a digital method. http://dx.doi.org/10.13140/RG.2.2.21296.61441 Ciprian Negoita. 2015. Immunity: A Conceptual Analysis for France and Romania. Contributions to the History of Concepts, 10(1): 89-109. http://dx.doi.org/10.3167/choc.2015.100105 Alan Partington. 2012. Corpus Analysis of Political Language. In: C.A. Chapelle (Ed.) The Encyclopedia of Applied Linguistics. Blackwell Publishing Ltd. Hinke Piersma, Ismee Tames, Lars Buitinck, Johan van Doornik and Marteen Marx. 2014. War in Parliament: What a Digital Approach Can Add to the Study of Parliamentary History. Digital Humanities Quarterly, 8(1): 1-18. http://www.digitalhumanities.org/dhq/vol/8/1/000176/0 00176.html Allison Shorten and Joanna Smith. 2017. Mixed methods research: expanding the evidence base. Evidence-based nursing 20: 74-75. https://ebn.bmj.com/content/20/3/74.info Karen Smith. 2016. Constructing the Child in Need of State Protection: Continuity and Change in Irish Political Discourse, 1922–1991. The Journal of the History of Childhood and Youth, 9(2): 309-323. https://doi.org/10.1353/hcy.2016.0042 Jure Skubic and Darja Fišer. 2022. Parliamentary Discourse Research in Sociology: Literature Review. Accepted for publication in Proceedings of the ParlaCLARIN III workshop at LREC2022, pages 91- 100, Marseille, France. Cameron G. Thies. 2002. A Pragmatic Guide to Qualitative Historical Analysis in the Study of International Relations. International Studies Perspectives, 3(4): 351- 372. https://www.jstor.org/stable/44218229 Elena Torou, Akrivi Katifori, Costas Vassilakis, Georgios Lepouras and Constantin Halatsis. 2009. Capturing the historical research methodology: an experimental approach. In Proceedings of International Conference of Education, Research and Innovation, Madrid, 2009. Madrid, Spain. Teun Van Dijk. 2018. Discourse and Migration. In Qualitative Research in European Migration Studies, edited by Ricard Zapata-Barrero and Evren Yalaz, 227- 247. Springer Open. https://link.springer.com/book/10.1007/978-3-319- 76861-8 Stuart Whigham. 2019. Nationalism, party political discourse and Scottish independence: comparing discursive visions of Scotland's constitutional status. Nations and Nationalism, 25(4): 1212-1237. https://doi.org/10.1111/nana.12535 PRISPEVKI 186 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Annotation of Named Entities in the May68 Corpus: NEs in modernist literary texts Mojca Šorli,* Andrejka Žejn† * ZRC SAZU, Institute of Slovenian Literature and Literary Studies Novi trg 2, SI-1000 Ljubljana mojca.sorli@zrc-sazu.si † ZRC SAZU, Institute of Slovenian Literature and Literary Studies Novi trg 2, SI-1000 Ljubljana andrejka.zejn@zrc-sazu.si Abstract In this paper we present the process of manual semantic annotation of a corpus of modernist literary texts. An extended set of annotations is proposed with respect to the established NER-systems and practices of related projects, i.e. several categories of proper names, foreign language elements and bibliographic citations. We focus on the annotation challenges concerning the names of literary characters seen in transition from common nouns to proper names, as well as giving examples of the results of preliminary analyses of the corpus. the late 1960s to the early 1970s,1 discussing these groups 1. Introduction from the point of view of three different sources of The starting point of the digital humanist literary project representation problems that are independent but presented here is a corpus of literary texts that was created interrelated: ambiguity, variation, uncertainty. As pointed according to special criteria defined for the purposes of this out in Beck et al. (2020), representational problems in research. In view of the significance for DH of controlling linguistic annotation arise from five different sources (ibid., a large number of texts and their vertical reading, where 61): (i) Ambiguity is an inherent property of the data. (ii) patterns become visible that cannot be detected with the Variation is also part of the data and can occur, for example, naked eye or traditional close reading, the corpus size is in different documents. (iii) Uncertainty is caused by lack often seen as a key factor. At the same time, large text of knowledge or information by the annotator. (iv) Errors volumes require automation of corpus processing for may be found in the annotations. (v) Bias is a property of quantitative analysis, involving different levels of the entire annotation system. We list a number of relevant (linguistic) annotation in the first phase, and allowing annotated categories, their specific character, and additional levels of semantic annotation in later phases that representational problems associated with them. Our enrich the text with metadata. In the presented approach, choices are discussed when any of the first three listed however, the annotation task is performed on a small, sources of representation problems apply. specialized corpus that is easier to control and allows for Together with the theoretical concept, the selection of manual annotation. The identified and manually annotated annotation material, and the definition of guidelines for the Named Entities are distinguished based on semantic annotation process (Pagel et al., 2020), the annotation criteria, so we consider this an example of semantic scheme presented here is a model of extended annotation of annotation. NEs in modernist periodicals that can be applied in certain Linguistically annotated corpora have long been a segments to other corpora of literary texts. We focus both standard tool for linguistic research. Named Entity on the identified inaccuracies and on the benefits of manual Recognition (hereafter NER) and analysis has also long annotation of selected groups of NEs in our specialized been relevant in the social sciences and sociology corpus of literary texts. In the concluding part, we present (Ketschik, 2020), from where the method, like several the preliminary results of an analysis performed on the others, has been transferred via linguistics to literary annotated corpus. studies, where named entities are most closely associated Following the automatic preprocessing (i.e., POS with literary character research. A more comprehensive tagging and lemmatization) of the May68 Corpus, further picture of the way characters are named in literature, manual annotation was performed to capture more complex beyond the automatic recognition of Named Entities linguistic (semantic) phenomena and to provide a more (hereafter NEs), can be obtained by manually annotating sophisticated annotation model for proper names given the these entities in literary texts, by analyzing the annotation recurring representational problems: At this first stage, a process, and finally by analyzing the data obtained from the model for identifying and annotating the selected NEs was annotated corpus itself. put in place, with a second stage of the project envisaged, in which the texts will be annotated for the use of metaphor. 2. The Goal of the paper Here we will focus on some open challenges in the annotation of NEs, in particular problems related to the In this paper we report on an attempt to identify and functional aspects of the annotated elements. We discuss annotate three groups of NEs in the “Corpus of 1968 the practical treatment of proper names for the purposes of Slovenian literature Maj68 2.0” (short name May68 corpus linguistic and stylistic research, in the hope of Corpus) – corpus of Slovenian modernist literary texts from 1 http://hdl.handle.net/11356/1491 PRISPEVKI 187 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 improving the reliability of research results and also of NLP with by multidimensional approaches that shed additional models. light on proper names in light of the special features of literary text. Empirical analyses of protagonists in the 3. Automated and manual annotation of literature can, at the most basic level, for example, study corpora the characteristics of names, their typicality, archaic character, or “unusualness” for a particular society (cf. In the context of language technologies, universal Calvo Tello, 2021), compare usage and functions of proper concepts and tools for automatic corpus annotation have names, exploring to what extent they are genre-related (e.g. been developed to some extent, especially for individual children’s literature, cf. van Dalen-Oskam, 2022). language groups, while language-specific concepts and Empirical analysis of the ratio between female and male tools are also needed. Established levels of automatic characters in a corpus of English literature up to the mid- tagging for Slovenian, initially based on lexicographic and 20th century (cf. Nagaraj and Kejriwal, 2022), for example, linguistic projects, include tokenization and related showed the quantitative predominance of male characters segmentation into sentences, normalization, over female characters. More complex research also deals morphosyntactic tagging, lemmatization, and syntactic with characterization analysis, identifying relationships parsing (Erjavec et al., 2015). NEs pose a challenge for between main and secondary characters, examining the automatic extraction of information due to their semantic relationship between active and passive character presence, an functional complexities. For Slovenian, the main tool and distinguishing between “actively present” characters used is StanfordNER, which assigns lexical units to and characters from other fictional worlds (Krautter et al., predefined categories (Ljubešić et al., 2012): personal 2018; Brooke et al., 2016; Ketschik, 2020). One of the more names, geographical names and common proper nouns. established approaches is the application of social network The state-of-the-art of the existing NER tools for analysis, a method from empirical sociology that builds on Slovenian has not been the focus of this research, but a the relationship between NEs. The analysis of social preliminary review of the tools, as well as of the function networks in the literature (cf. de Does, 2017) is closely of NEs in the texts, has shown their limited applicability related to quantitative approaches to the study of direct and to a specialized literary corpus that we set out to reported speech or narrator speech and character speech in investigate. storytelling and drama, where NEs are an essential component of a broader context (cf. Burrows, 2004; Moretti, 2011; Elson et al., 2010; Papay and Padó, 2020). 3.1. NER-systems for corpora of literary texts Digitally supported analysis of the broader picture of characters also draws on concepts derived from Bakhtin’s For literary texts, narratology in particular has concept of chronotope, such as The Text World Theory – a developed various typologies of protagonists, heroes, or cognitive-linguistic concept of a unity of characters, time major and minor characters in texts, ways of characterizing and space, or the concept of situation (Krautter et al., 2018; them, and strategies for recognizing them. Since the advent Mikhalkova et al., 2019). of digital tools researchers have had to find a way to translate the definitions formed by literary scholars into 4. Model annotation schemes computer-readable data (Krautter et al., 2018). While there are no specific NER-systems for annotating In designing the model for manual annotation of the literary texts, even though literary texts have a high May68 Corpus, we relied on familiarity with the texts variation of NEs compared to normal non-fiction texts contained in the corpus and on several other well-known (Stanković et al., 2019), “universal” systems are often used. models of manual annotation for similar projects, three of However, automatic annotation tends to overlook certain which are presented below. segments of NEs in literary texts (Vala et al., 2015). Attempts are made to overcome these limitations by 4.1. COST Action (“Distant reading” project) additional automatic tagging, or to expand the set of Distant Reading project for the annotation of the annotated entities by manual tagging, often of referential multilingual ELTeC corpus (https://www.distant- expressions, i.e., linguistic expressions that refer to a reading.net/eltec/)2 based on European novel provides the specific entity in the text world, where the entities and their following distinct categories: “demonyms (DEMO), references must be interconnected (entity grounding). professions and titles (ROLE), works of art (WORK) References and connections themselves can only be person names (PERS), places (LOC), events (EVENT), inferred from the knowledge of the context (Ketschik, organizations (ORG)” (for a brief description of the 2020; Papay and Padó, 2020), so in the early stages of categories cf. Frontini, 2020). The selection of these research, manual annotation of the corpus is usually categories was partly motivated by the existing possibilities required to improve the automatic process. of automated NER, which brings with it certain limitations (Stanković et al., 2019). The project also points out the 3.2. Background and related work importance of “cultural references, role models and Compiling lists of NEs, especially for categories of cosmopolitanism”, and these can only be answered “if proper names, represents only the basis for the references to works of art, authors, folklore and periodical identification of character names and is as yet insufficient publications are detected”, which is why in our corpus of for relevant literary analyses, so these lists must be dealt 2 The Distant Reading for European Literary History (COST literary texts. It is based on the compilation and analysis of a Action CA16204) started in 2017 with the goal of using multilingual open source collection, named European Literary computational methods of analysis for large collections of Text Collection (ELTeC). PRISPEVKI 188 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 modernist literary texts we introduced a BIBLIO group to birth, are included with the texts. The presence of visual incorporate references to authors, but covered other listed elements is also marked in the corpus; 48 texts consist only types of references with the “other” group (NAME / XXX). of visual elements, i.e. they do not contain standard text. In May68 Corpus, however, we focus for now on proper Automatic linguistic annotation includes lemmas, names. morpho-syntactic descriptions from MULTEXT-East, and morphological features and syntactic annotations from 4.2. CLARIN.SI Universal Dependencies. As shown here, manually tagged NEs for persons, geographical locations, organizations, and The annotation scheme adopted largely follows the various names, (foreign) linguistic variations and registers, guidelines provided for Slovenian in the past (e.g. Štajner and cited authors (sources) are additionally marked. et al., 2013), perhaps closest in its granularity to the Janes- The following sections and subsections introduce the NER guidelines (CLARIN.SI) as described by Zupan et al. types and categories of NEs, including the dilemmas (2017), except for the derived adjectives (DERIV-PER) encountered in the process of annotation and the practical type, which is given here an independent status unlike in reasons for annotation. From here on, and with a somewhat May68 Corpus, where this is subsumed under the PER-LIT narrower notion of NER, we speak of categories of “proper and PER-REAL subtypes.3 names (personal and place names)” rather than “named In addition, we decided in the case of May68 Corpus to entities” for the purposes of this paper. conceptualize combinations of nouns denoting professions, functions or titles, and personal names as units, therefore 5.1. Annotation procedure and categories labelling the entire strings as literary personal name (PER- The annotation was implemented using the WebAnno LIT) or real personal name (PER-REAL). tool (Eckart de Castilho et al., 2016). To simplify the technical aspect, the whole corpus was divided into 1529 4.3. Annotation schemes for Czech language sections of five sentences each, on average 380 chunks per Annotation of NEs in Czech corpora is implemented section. WebAnno allows annotation of one sentence at a according to more complex models as described in time, which was a disadvantage for longer instances of text Sevščíková et al. (2007). Our three-level NE taxonomy is, marked by the use of foreign language(s). Each annotation nonetheless, somewhat less fine-grained. Furthermore, round was curated by two curators.4 However, reiterative unlike the Czech model, ours does not include numbers, annotation was not foreseen, since the primary goal at this such as in addresses, zip codes, or phone numbers, specific stage was not to improve automatic annotation, but to number usages and quantitative expressions – entities manually annotate the specialized corpus for optimal typically included in NER. corpus analysis and stylistic studies. There is no universally accepted taxonomy for NEs, 5. May68 Corpus of Slovenian modernist except for some coarse-grained categories (people, places, literary texts – corpus description organizations). Since we are interested in a semantically The Maj68 Corpus is a result of a project on the oriented annotation and prefer more informative (fine- literature of the avant-garde and modernism in the period grained) categories, we opted for a three-level NE of the worldwide student movement, whose activities are classification as shown in Table 1 (cf. Sevščíková et al., also reflected in the transformation of literature. The 2007). The first level in our annotation model corresponds student journals Tribuna and Problemi, from which the to the three basic groups: 1. Proper names, 2. Foreign texts for the corpus were selected, played an important role language and register variations, and 3. Cited authors. in the theoretical and literary-artistic innovations of the These groups are labelled as 1. NAME, 2. FOREIGN, 3. Slovenian student movement. The Maj68 Corpus 1.0 BIBLIO respectively, with the first two further subdivided. contains 1,521 texts by 198 known authors published The second and third levels provide a more detailed between 1964 and 1972 in the Slovenian periodicals semantic classification. Tribuna, Problemi and Problemi.Literatura. The Maj68 The NAME group includes the following types and Corpus 2.0 version, which has been further edited and subtypes: corrected (metadata), contains 647 additional texts from - Person (PER), including the person-derived adjective, Tribuna and Problemi. is subdivided into fictional literary characters (PER- The compilation of the corpus began with an extensive LIT), characters referring to real, i.e., existing and bibliographic inventory of texts in selected publications historical or mythological, persons or beings (PER- that have been digitized and are publicly available on dLib. REAL), literary characters bearing a descriptive name On the basis of these lists, the original texts of Slovenian (PER-DES), and members of national and social authors were converted from .pdf format to .docx format groups (PER-GROUP). and, in a second phase, linked to metadata in Excel - Geographical location (GEO) is divided into locations spreadsheets. Finally, the corpus was automatically tagged in Slovenia (GEO-SI), in former Yugoslavia (GEO- (see Juvan et. al 2021 for more details on the YU), in Europe (GEO-EU), and in others (GEO-ZZ). procedure).The texts contain complete bibliographic data, - Organizations and institutions (ORG). are classified by text and language type, degree of presence - Miscellaneous (XXX). of non-standard Slovenian, foreign languages, modernism, A group labelled FOREIGN is used to annotate the and visual elements. Author details, i.e., gender and year of foreign language: Serbo-Croatian (SBH), English (EN), 3 Overall and in the same fashion, in May68 Corpus we also 4 The texts were annotated by A. Jarc, L. Mandić, and K. Žvanut favour larger lexical units. in accordance with the annotation scheme designed by the authors of this paper, who also curated all of the annotations. PRISPEVKI 189 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 French (FR), Italian (IT), Latin (LA), and German (GE), or 5.1.1. Person register variation (DIALECT, INFORMAL, SLANG) in PERSON (PER) type is divided into PER-LIT, PER- the corpus. REAL, PER-DES and PER-GRP. While the first three are Once the annotation process was completed, the labels categorized as subtypes of the same type, PER-GRP is in WebAnno were converted to TEI encoding.5 Following defined as an independent type. The most important the conversion thus all proper names (personal names, subdivision of the type (within the NAME group) is that place names, names of organizations, and real names) are between real, e.g., historical or real-life, persons appearing labelled with , then divided into types with in the text, and fictional characters, each of which, @person, @geo, @misc, @personGrp, and @org however, is further specified according to semantic criteria. attributes, three subtypes for literary characters (@literary, Subcategories include names of people and pets, @descriptive, @real), and for geographical names (@SI, nicknames, pseudonyms, members of national and social @EU, @ZZ and @YU). Units of text with foreign groups. languages and non-standard Slovenian were labelled as and corresponding attributes according to TEI coding. Group Type Subtype Description Real: Characters referring to real, i.e. existing and historical or mythological persons PER-REAL or beings (web sources, Wikipedia, etc.), e.g. Greta Garbo. PER-LIT Literary: Fictional literary characters, e.g. Ančika, Zobec. PERSON (PER) Descriptive: Literary characters that carry a descriptive name (e.g., dolgolasec, Eng. PER-DES the long-haired guy) N PER-GRP Group: Members of national and social groups, e.g. Kranjci, Slovenec, Američan. AM GEO-SI Slovenia, e.g. Ljubljana E GEO-YU Former Yugoslavia (except for Slovenia), e.g. Zagreb GEO GEO-EU Europe, e.g. Frankfurt GEO-ZZ Other, e.g. Peking ORG – Names of organizations, institutions ( Klub nepismenih, Slovenska matica, Državna varnost) XXX – Common proper nouns, including titles of books and other art works, artefacts, etc., e.g. Rdeča kapica, Empire State Building. HBS – Serbo-Croatian EN – English DE – German N FR – French IG IT – Italian ER LA – Latin OF XX – Other DIALECT – Dialect VERNACULAR – Vernacular SLANG – Slang BIBLIO – – Quoted authors (Sources) Table 1: The main categories of the May68 annotation scheme (WebAnno). PER-REAL denotes both real, i.e. existing, persons and Of the categories introduced specifically for the historical or mythological figures that are basically purposes of the May68 Corpus, NAME / PER-DES proved, identifiable in encyclopaedic sources such as online as expected, to be the most challenging subcategory (see lexicons of proper names, Wikipedia and the like. URL is 6.1.1.). an additional attribute of the NAME group and is given as Given their statistical importance in the context of NER, a relevant source of information, e.g., a website, for a group the same annotation rules apply here as for characters in of people appearing in the literary text. The assignment of plays when they do not require special treatment with a URL depends on context or extra-linguistic knowledge; if respect to their function. The labelling of proper names in a person can be assumed to be part of common (cultural) plays depends on the status and/or function of the proper knowledge (Descartes, Nietzsche), we do not enrich the name. Names of individual characters that merely corpus with encyclopaedic data. announce an individual character’s speech, his/her lines of All standard personal proper names are labelled as dialogue, have not been annotated, while names in NAME and assigned to one of the closed subtypes. descriptions of their physical actions or behaviour are The label PER-GRP with no subtype is assigned to treated as ordinary proper names on the model of “sb does members of a particular social group, most often nationality sth” etc. ( Pandolfo se ogleduje v zrcalu / Pandolfo looks at (Slovenec), regional identity (Kranjci, Štajerci; Novakovi), himself in the mirror). Below is an example of a dialogue but also smaller social groups defined on the basis of showing the distinction between the two and a third subtype occupational or other criteria. (the names in bold are labelled as PER-LIT, PER-DES and PER-REAL respectively): 5 The annotation task was carried out in collaboration with T. Erjavec (technical aspects and data conversion). PRISPEVKI 190 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 BARRÈRE: Potemtakem moramo danes z njim obračunati. For many common nouns, one can observe a transition ( Tallien odide) to the category of proper names, which seems to exist as a (Davidu): Si pripravljen s Krepostnim umreti? continuum. For example, the word krčma (Eng. inn, pub) DAVID: V smrt? assumes the function of a proper noun referring exclusively BARRÈRE: Se nisi maločas naglas pridušil? to a particular unit/object, in this case “inn”. The word is DAVID: Čudovit črtež sem zamislil. Kako dviga Sokrates therefore referred to as NAME / XXX. čašo strupa k ustom. Naš dobri prijatelj je tako presunljivo govoril. 5.1.4. BIBLIO BIBLIO is typically used for literary works cited or Adjectives derived from personal proper nouns are mentioned in the literary texts. It contains text passages that annotated as the corresponding proper nouns. Their derived refer to literary works or other bibliographic units, and is character is revealed by morpho-syntactic tagging. annotated for authors, not titles or citations, e.g. The patamus can never reach The mango on the mango tree 5.1.2. Geographical location (GEO) (T. S. Eliot: The Hippopotamus) Place names are labelled as NAME and the following closed-set subtypes: SI, YU, EU, ZZ, depending on 5.1.5. Language and register whether the location is in Slovenia, in the former Yugoslav In the case of language and register variation, we use republics, in the rest of Europe, or outside all of these areas. the FOREIGN group that subsumes (foreign) language and As with personal names, a distinction is made between register variation (see Table 1). This group is not directly real and fictitious geographical names ( Indija vs. Eldorado). Commentators decide whether a place is real or relevant to this paper. fictitious (such as street names in a fictitious city) based on context and common knowledge. Places typically include 6. Dilemmas of annotation in the framework continents, countries, regions, cities, towns, and natural of representational problems geographical objects, as well as streets, squares, and A number of dilemmas are discussed here in terms of neighbourhoods, and functional infrastructure such as the three categories – ambiguity, variation, and uncertainty churches, airports, and local cultural and natural sites. Place – as detailed, for example, in Beck et al. (2020), who names used metaphorically, e.g. Eden, are categorized as outline the main representational problems in linguistic “other” and assigned the label NAME / GEO-ZZ – the same annotation (we disregard the two additional categories label is used for place names outside the European territory. addressed in the model: error and bias). At this stage, we have not paid special attention to the The interpretation of the listed categories is tailored to treatment of proper names (personification) used the nature of our data, and the problems are assigned to the metaphorically, such as listed categories accordingly. The annotation process is Jadra so pogorela, Delfi molčijo … [The sails have burnt consistently guided by the identified function of the down, and Delfi stays silent …] annotated elements. The three dilemmas are described This type of analysis is planned for the later stages of below. annotation (which will include the annotation of metaphors). 6.1. Ambiguity Adjectives derived from place names, e.g. African, In principle, ambiguity occurs whenever a unit admits European, were included in the annotation by analogy with several interpretations. Ambiguities between form and geographical names and divided into the same subtypes meaning occur in natural language at the phonological, (SLO, YU, EU, ZZ). morpho-syntactic, lexical, or pragmatic levels and are a major source of representational problems (Beck et al., 5.1.3. Organizations and common proper nouns 2020). As with geographical names, there are no subgroups for the two groups of so-called common proper names and 6.1.1. Transition from personal proper names to names for organizations. Capitalization is an obvious but “common proper nouns” not a necessary condition for this classification. Thus, no The most striking example of ambiguity is the transition distinction is made here between real and fictitious; what from common nouns to those that function as personal matters is that the name be recognized as “common proper” names. This is a pervasive and rather complex in the literary context of the text. representational problem. The dilemma concerns the Organizations and institutions subsume names of category NAME / PER-DES, i.e., descriptive names of museums and other cultural institutions, as well as political literary characters, especially in relation to the category and civic organizations. Organizations are labelled as ORG NAME / PER-LIT, which refers to standard proper names and usually include businesses, institutes, media, cultural, that are recognizable as such because of their form and and educational institutions. However, we have treated conventional properties (e.g., capitalization). This group restaurants, music groups, and other “entertainment” includes examples where common nouns optionally establishments as “miscellaneous” rather than combine with proper names to refer to individual characters organizations. like “inšpektor (Kos)” [inspector (Kos)], or “veteran” [the Miscellaneous is a category reserved mainly for veteran], including capitalized adjectival derivatives, such common proper nouns, as explained above, such as titles of as “Brezposelni” [The jobless one], functioning as personal books and other works of art, artefacts, films, documents, names, etc. brand names, commercial products, events, including place However, capitalization is not a necessary condition for names, such as mythological places, place names used the NAME / PER-DES designation, especially in a corpus metaphorically, etc. These NEs are labelled as XXX. of modernist texts that frequently employ modernist and/or PRISPEVKI 191 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 idiosyncratic conventions, with orthographic rules applied 6.2. Variation to proper names or descriptive linguistic units that typically In variation, the same content or value is expressed by eschew capitalization (e.g.,“fant” [the boy], “starka” [the multiple, interchangeable variants (Lüdeling, 2017). old woman]). A key feature of proper names, as it turns out, Variation can be due to extra-linguistic factors, such as the is “descriptive continuity,” which shows that there is no time period, genre, author/speaker of the text, or linguistic clear boundary between what can be considered a standard conventions. proper name (which is traditionally subsumed under Like ambiguity, variation is an inherent part of natural onomastics) and what can be understood as an instance of language and thus of corpus data. Indirectly related to a text that performs the function of a proper name, but does variation is the case of ambiguity described above in 7.1.1. not, strictly speaking, qualify as such. The descriptive name is not necessarily used exclusively The assignment of a noun to NAME / PER-DES is for one and the same literary character; on the contrary, it decided primarily on the basis of context. Often, a lexical usually alternates with the character’s actual proper name. unit (word or phrase) is used to describe a particular Alternation in the mention of literary characters is very property of the character to which the proper noun initially common; in fact, it is the rule. Some personal proper names refers, and which is then gradually but clearly transformed (including their descriptive variants) occur as variants into a (descriptive) unit that functions as a proper name preceded by an attributive noun (always the same), usually (whether capitalized or not), such as “Rdečelasi” [The red- referring to their professional or social status (e.g., haired one]. The descriptive name is used only when the Inspector Kos). When this type of designation is used transition is complete, which must be evident from the consistently, we refer to the entire lexical unit as NAME / broader context. The quantitative criterion (in longer texts) PER-LIT, but when the attributive noun (Inspector) is a minimum of three occurrences of the same designation, becomes an independent descriptive variant, we refer to it such as below: as NAME / PER-DES. Videl je same znane obraze — inšpektorja Kosa, vratarja Descriptive terms NAME / PER-DES may consist of Žorža, kurirja Enorokega, Žana, nekoliko v ozadju pa je stal one or more words, they may be a combination of “object bledi Novinec [the (pale) new guy], … nouns” and standard proper names ( inšpektor Kos) or of two or more “common nouns” ( kurir Enoroki), regardless Other examples include dolgolasec [the long-haired of their capitalization, as long as they function as personal guy], mladenič [young man], mojster [the master], proper names when referring to or naming characters. The debelušček [the fatty] and typically correspond to phrases same character may be referred to by three, four, or more introduced with a definite article in English. In principle, variants. In our case: inspector Kos, inspector or Kos. PER-DES is not limited to a maximum number of Also treated as single variants are lexical units denoting components, but the likelihood that a lengthy description, proper names whose capitalization varies, e.g., Ministrstvo such as Zagledal je na tleh sedečega fanta upadlih lic in za kulturo Republike Slovenije vs. ministrstvo za kulturo kuštravih las [He saw a boy with skinny cheeks and messy (Ministry of Culture) and Zveza borcev vs. zveza borcev hair sitting on the floor], should appear three times at least (Association of Freedom Fighters). in the text(s) is minimal. Even if descriptive units tend to We are aware that when variants are expressed as a recur they normally vary in at least one of their elements . single interpretation, the property of variation as a whole is Capitalization itself does not preclude a lexical unit lost. However, a semantic annotation based on the function from being labelled PER-DES, as with Mož brez imena [the of linguistic elements is less prone to structural diversity Nameless Man]. than, for example, spelling variations in historical texts that Appellatives, nicknames, and pseudonyms are labelled reflect, for example, dialectal and/or temporal differences as ordinary personal proper names (NAME / PER-LIT), (cf. Beck et al., 2020), which is why, apart from our own except for those expressing description, such as Dolgi Džon specific research goals, we did not choose to preserve [John the Longish]. (proper name) variations. 6.1.2. Nesting 6.3. Uncertainty Another example of ambiguity concerns nesting, which Uncertainty arises whenever there are multiple possible often creates additional annotation problems. Instead of a interpretations of data, but the relevant or reliable potential two- (or three-level) nesting model, a single-level knowledge to make an informed decision about nesting is used throughout, taking as the basic annotated interpretation is not available (see Bonnea et al., 2014, in unit the largest possible lexical unit, typically a Beck et al. 2020). Most examples involve the inability to geographical name or the name of an organization distinguish between the subtypes PER-REAL and PER-LIT composed of one or more proper names: in the case of Državna založba Slovenije in texts that do not provide sufficient clues to the “origin” [National Publishing House of of the character, although this seems to be rather rare. Slovenia], the entire unit is labelled as an organization In such cases, manual annotation provides the (ORG) and the proper name Slovenije is not nested and opportunity for discussion and collective decision, which labelled on its own as a place name ( Slovenija); the same we see as an advantage, since cases where the uncertainty goes for for Društvo novinarjev Slovenije [Journalists’ (or ambiguity) cannot be resolved are reduced to the Association of Slovenia], Prešernova družba [Prešeren’s absolute minimum, for example: Society Publishing], Direkcija za prehrano Beograd [Belgrade Food Agency], or, e.g. Fani is NOT nested in gospodična Fani Maruška – [PER-REAL, author’s wife] peče domači kruh … , but treated as a single-level personal Milenko, Andraž, Marko, David – [PER-REAL, members of proper name. A general dilemma often arises here as to OHO Slovenian art group: Milenko Matanović, Andraž whether the term should be referred to as a proper name or as a common noun. PRISPEVKI 192 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Šalamun, Marko Pogačnik, David Nez – established on the Shakespeare, Mozart); 2. Mythological figures (Cain, basis of extra-textual knowledge]. Poseidon, Ishtar); 3. Characters from other works of Slovenian and world literature (Pegam, Lambergar, Servant 7. Preliminary results Jernej, Charlie Brown, Odysseus, Pinocchio); the last two Apart from the problems encountered in the annotation groups are represented, on the one hand, by characters from the contemporary world of the authors, such as real-life itself, the preliminary research results of the annotated celebrities (Tomaž Terček, Andraž Šalamun, Milenko corpus can also contribute to the study of characters in a Matanovič, Brigitte Bardot, Gérard Philipe, Giorgio selected corpus of literary texts. Based on the query and the Albertazzi, Sylvie Vartan) and, on the other hand, by results in NoSketchEngine, Figure 1 shows the quantitative characters from the authors’ immediate (family) relationship between three subtypes of the type PERSON environment (Ana, Maruška). (literary names, descriptive names, and names of characters The results show the least consistency for the from the non-literary world). It can be seen that the majority descriptive name subtype with the lowest degree of are literary names (PER-LIT, 68 per cent) whose intersubjectivity, especially with respect to the relationship predominance was to be expected - followed quantitatively between the transition from common noun to proper name by descriptive names (PER-DES, 18 per cent, and then by and the aptronyms or nominative determinism, which names of characters from the non-literary world (PER- Barthes considers a kind of “economic” characterization REAL, 14 per cent). (Lahn and Maister, 2016). The relatively high presence of this subtype suggests a modernist blurring of the boundary between fiction and reality, which is reinforced by postmodernism. PER-DES 18 % 7.2. Relationship between male and female PER-REAL 14 % characters PER-LIT The second graph (cf. Figure 2) shows the quantitative 68 % ratio between male and female characters as they occur in the May68 Corpus (based on the number of tokens). 90% 80% 82 % 82 % 70% 76 % 72 % 60% 50% 56 % Figure 1: The ratio between the subtypes literary, descriptive 40% 50 %50 % 44 % and real of the PERSON type. 30% 20% 28 % 24 % 10% 18 % 18 % 7.1. Categories of descriptive names and real 0% names AUTHORS AUTHORS AUTHORS AUTHORS AUTHORS AUTHORS (MEN) (WOMEN) (MEN) (WOMEN) (MEN) (WOMEN) Using the lists of the three types of personal names, we PER-LIT PER-DES PER-REAL can create an approximate typology of character names MALE CHARACTERS FEMALE CHARACTERS according to the given typologies and evaluate the consistency of labelling. Because of their special characteristics, we limit ourselves to the subtypes Figure 2: The quantitative relationship between male and female descriptive and real, leaving aside the subtype literary, characters in the May68 Corpus. which includes mostly “ordinary” personal names. Descriptive names are most often occupational (e.g., The results confirm findings from other research (cf. chief, inspector, captain, mayor; foreman, waitress, Nagaraj and Kejriwal, 2022) that the proportion of male secretary, lab assistant); second are names expressing characters is significantly higher than that of women. physical characteristics (e.g., one-armed, long-haired, “the We supplement this account by comparing male and one with the moustache” the handicapped), followed by female characters by author gender, which gives a very names describing character (e.g., bully, beast, monster), disproportionate picture: Metadata analysis has shown the beast, bloodthirsty),family relations (e.g., aunt, uncle, predominance of male authorship in the corpus (81 per godmother), generational affiliation (e.g., old man, young cent) - only 7 per cent of authors are women, and there are man), while longer descriptive lexical strings are rarer (man no data for the remaining 12 per cent (Juvan, et al., 2021). with no name, brother in Christ, the long-haired one). If we start from the gender of the authors when Among the names for women, forms that formally express analyzing the occurrence of male and female characters, we possession but function as gendered common proper names find (see Figure 3) that in the works by men, male are frequent in Slovenian (e.g. Tomaž’s (one), the manager’s wife). This is statistically almost as characters outnumber female characters by 44 per cent in significant the subcategory literary names, while this difference is as feminine names for occupations. much smaller in the works by women (12 per cent). In the As can be seen from the annotated corpus, we identify category descriptive names, this ratio is difficult to assess five subcategories and include them in the subtype for real due to the low occurrence among women authors, but a persons: 1. Real persons from social (Brutus, Lenin, Kidrič) and cultural history (Prešeren, Heidegger, Descartes, large difference between female and male characters in men authors goes in favour of the latter. PRISPEVKI 193 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Annotation Workshop, pages 60–73, Barcelona, Spain, December 12, 2020. Julian Brooke, Timothy Baldwin, and Adam Hammond. 82 % 18 % 2016. Bootstrapped Text-level Named Entity PER-REAL Recognition for Literature. In: Proceedings of the 54th Annual Meeting of the Association for Computational PER-DES 76 % 24 % Linguistics, pages 344–350, Berlin, Germany, August 7– 12. PER-LIT 82 % 18 % John Burrows. 2004. Textual analysis. In: S. Schreibman, Ray Siemens, and John Unsworth, eds., A Companion to 0 2000 4000 6000 8000 Digital Humanities. Blackwell, Oxford. José Calvo Tello. 2021. The Novel in the Spanish Silver MALE CHARACTERS FEMALE CHARACTERS Age: A Digital Analysis of Genre Using Machine Learning. Bielefeld University Press, Bielefeld. Karine van Dalen-Oskam. 2022. Distant Dreaming About European Literary History. Evening keynote at the Figure 3: Male and female characters according to the gender of Distant Reading Closing Conference. authors. https://www.distant-reading.net/events/conference- programme/ In the subcategory real, there is no significant difference Jesse de Does, Katrien Depuydt, Karina van Dalen-Oskam, in terms of author gender, which is probably due to the and Maarten Marx. 2017. Namescape: Named Entity actual and undisputed presence of men and women in social Recognition from a Literary Perspective. In: J. Odijk, and cultural history. and A. van Hessen, eds., CLARIN in the Low Countries, pages 361–70. Ubiquity Press. 8. Conclusions and open challenges https://www.ubiquitypress.com/site/chapters/10.5334/b The main goal of our annotation task was to provide an bi.30/download/1046/. adequate representation of a specific set of semantic data Richard Eckart de Castilho, Éva Mújdricza-Maydt, Seid (=Named Entities) and to fully exploit the potential of this Muhie Yimam, Silvana Hartmann, Iryna Gurevych, type of corpus linguistic data in the context of future Anette Frank, and Chris Biemann. 2016. A web-based literary and linguistic analyses. To this end, we tool for the integrated annotation of semantic and implemented a three-level annotation process. We syntactic structures. In: Proceedings of the Workshop on conclude on the basis of high variation in referential Language Technology Resources and Tools for Digital expressions that in potential future projects an additional Humanities (LT4DH), pages 76–84. Osaka, Japan. The step should be linking the different names of the same COLING 2016 Organizing Committee. character. David Elson, Nicholas Dames, and Kathleen McKeown. In the present work, we sought to identify and interpret 2010. Extracting Social Networks from Literary Fiction. different types of representational problems based on the In: Proceedings of the 48th Annual Meeting of the model proposed by Beck et al. (2020) in order to improve Association for Computational Linguistics, pages 138– our understanding of the linguistic and extra-linguistic 147, Uppsala, Sweden. Association for Computational properties of the texts in a (literary) corpus. It is hoped that Linguistics. this will lead to a more nuanced understanding of the Tomaž Erjavec, Peter Holozan, and Nikola Ljubešić. 2015. challenges of NER, and that this in turn may inform future Jezikovne tehnologije in zapis korpusa. In: V. Gorjanc, resources in ways that are more appropriate to the data they P. Gantar, I. Kosem and S. Krek, eds., Slovar sodobne represent. slovenščine: problemi in rešitve, pages 262–76. In the next phases of annotation, we plan to improve the Znanstvena založba Filozofske fakultete, Ljubljana. segments that have the lowest level of consistency and Francesca Frontini, Carmen Brando, Joanna Byszuk, Ioana agreement among annotators, such as common nouns that Galleron, Diana Santos, and Ranka Stanković. 2020. perform the referential function of proper names, Named Entity Recognition for Distant Reading in seemingly operating as a representational continuum. ELTeC. In: CLARIN Annual Conference 2020, Oct 2020, We have yet to work out the best approach to fully str. 37–41, Virtual Event, France. incorporate the various instances of PER-DES in the Marko Juvan, Andrejka Žejn, Mojca Šorli, Lucija Mandić, annotation scheme, but these are certainly worth Andrej Tomažin, Andraž Jež, Varja Balžalorsky Antić, considering as a special (sub)category of the NAME group. and Tomaž Erjavec. 2022 . Corpus of 1968 Slovenian literature Maj68 2.0, ZRC SAZU. 9. Acknowledgements http://hdl.handle.net/11356/1430 Marko Juvan, Mojca Šorli, and Andrejka Žejn. 2021. ARRS (Slovenian Research Agency) J6-9384 “Maj 68 v literaturi in teoriji (May '68 in Literature and Theory)” Interpretiranje literature v zmanjšanem merilu: »Oddaljeno branje«  korpusa  »dolgega leta 1968«. Jezik in slovstvo, 66(4):55–76. Nora Ketschik, André Blessing, Sandra Murr, Maximilian 10. References Overbeck, and Axel Pichler. 2020. Interdisziplinäre Christin Beck, Hannah Booth, Mennatallah El-Assady, and Annotation von Entitätenreferenzen. Von Miriam Butt. 2020. Representation Problems in fachspezifischen Fragestellungen zur einheitlichen Linguistic Annotations: Ambiguity, Variation, methodischen Umsetzung. In: N. Reiter, A. Pichler, and Uncertainty, Error and Bias. In: The 14th Linguistic J. Kuhn, eds., Reflektierte Algorithmische Textanalyse. PRISPEVKI 194 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Interdisziplinäre(s) Arbeiten in der CRETA-Werkstatt, In: Proceedings of the Eighth Language Technologies pages 203–36, Berlin. Conference, October 8th-12th, 2012, Ljubljana, Benjamin Krautter, Janis Pagel, Nils Reiter, and Marcus Slovenia: proceedings of the 15th International Willand. 2018. In: T. Weitin, ed., Eponymous Heroes Multiconference Information Society - IS 2012, volume and Protagonists – Character Classification in C, pages 191–96, Ljubljana, Institut Jožef Stefan. German-Language Dramas. LitLab. Pamphlet # 7. Katja Zupan, Nikola Ljubešić, and Tomaž Erjavec. 2017. Silke Lahn, and Jan Christoph Meister. 2016. Einführung Annotation guidelines for Slovenian named entities: in die Erzähltextanalyse. Stuttgart, Metzler. Janes-NER. Technical report, Jožef Stefan Institute, Nikola Ljubešić, Marija Stupar, and Tereza Jurič. 2012. September. Building Named Entity Recognition Models For https://www.clarin.si/repository/xmlui/bitstream/handle Croatian And Slovene. In: T. Erjavec, and J. Žganec /11356/1123/SlovenianNER-eng-v1.1.pdf. Gros, eds., Proceedings of the Eighth Language Technologies Conference, October 8th-12th, 2012, Ljubljana, Slovenia: proceedings of the 15th International Multiconference Information Society - IS 2012, volume C, pages 129–34. Ljubljana, Institut Jožef Stefan. Anke Lüdeling. 2017. Variationistische Korpusstudien. In: M. Konopka, and A. Wöllstein, eds., Grammatische Variation. Empirische Zugänge und theoretische Modellierung. IDS Jahrbuch 2016, pages 129– 144. de Gruyter, Berlin. Elena V. Mikhalkova, Timofei Protasov, Anastasiia Drozdova, Anastasiia Bashmakova, and Polina Gavin. 2019. Towards annotation of text worlds in a literary work. In: Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue” , pages 101–10. Issue 18, Supplementary Volume 18. Franco Moretti . 2011. Network Theory , Plot Analysis . New Left Review, 68:80–102. Akarsh Nagaraj, and Mayank Kejriwal. 2022. Robust Quantification of Gender Disparity in Pre-Modern English Literature using Natural Language Processing. arXiv:2204.05872v1 [cs.CY] 12 Apr 2022. Sean Papay, and Sebastian Padó. 2020. RiQuA: A Corpus of Rich Quotation Annotation for English Literary Text. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 835–841, Marseille, France. European Language Resources Association. Janis Pagel, Nils Reiter, Ina Rösiger, and Sarah Schulz. 2020. Annotation als flexibel einsetzbare Methode. In: N. Reiter, A. Pichler, and J. Kuhn, eds., Reflektierte Algorithmische Textanalyse. Interdisziplinäre(s) Arbeiten in der CRETA-Werkstatt, pages 125 – 142. Berlin. Ranka Stanković, Diana Santos, Francesca Frontini, Tomaž Erjavec, and Carmen Brando. 2019. Named Entity Recognition for Distant Reading in Several Languages. In: G. Pálko, ed., DH_Budapest_2019. Budapest, ELTE. http://elte-dh.hu/dh_budapest_2019-abstract-booklet/ Magda Ševčíková, Zdeněk Žabokrtský, and Oldřich Krůza. 2007. Named Entities in Czech: Annotating Data and Developing NE Tagger. In: V. Matoušek, P. Mautner eds., Text, Speech and Dialogue: 10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3–7, 2007. Proceedings. Berlin – Heidelberg, Springer-Verlag. https://ufal.mff.cuni.cz/~zabokrtsky/publications/papers /tsd07-namedent.pdf Tadej Štajner, Tomaž Erjavec, and Simon Krek. 2013. Razpoznavanje imenskih entitet v slovenskem besedilu. PRISPEVKI 195 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 A Transformer-based Sequence-labeling Approach to the Slovenian Cross-domain Automatic Term Extraction Thi Hong Hanh Tran∗†, Matej Martinc†, Andraž Repar†, Antoine Doucet‡, Senja Pollak† ∗Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia † Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia ‡ University of La Rochelle, 23 Av. Albert Einstein, La Rochelle, France Abstract Automatic term extraction (ATE) is a popular research task that eases the time and effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. In this paper, we treat terminology extraction as a sequence-labeling task and experiment with a Transformer-based model XLM-RoBERTa to evaluate the performance of multilingual pretrained language models in the cross-domain sequence-labeling setting. The experiments are conducted on the RSDO5 corpus, a Slovenian dataset containing texts from four domains, including Biomechanics, Chemistry, Veterinary, and Linguistics. We show that our approach outperforms the Slovene state-of-the-art approach, achieving significant improvements in F1-score up to 40 percentage points. This indicates that applying multilingual pretrained language models for ATE in less-resourced European languages is a promising direction for further development. Our code is publicly available at https://github.com/honghanhh/sdjt-ate. 1. Introduction NLP task. However, despite the importance of term ex- Terms are single- or multi-word expressions denoting traction and the research attention paid to the task, identi- concepts from specific subject fields whose meaning may fying the correct terms remains a notoriously challenging differ from the same set of words in other contexts or ev- problem with the following not yet solved hurdles. First, eryday language. They represent units of knowledge in despite several different definitions to describe the mean- a specific field of expertise and term extraction is useful ing of a term, the explicit distinction between terms and for several terminographical tasks performed by linguists common words is in many cases still unclear. In addition, (e.g., construction of specialized term dictionaries). Most the characteristics of specific terms can vary significantly of these tasks are time- and labor-demanding, therefore re- across domains and languages. Furthermore, the gold stan- cently several automatic term extraction approaches have dard term lists and manually labeled domain-specific cor- been proposed to speed up the process. pora for training and evaluation of ATE approaches are gen- Term extraction can also support and improve several erally scarce for less-resourced languages including Slove- complex downstream natural language processing (NLP) nian, due to the large amount of work required for the con- tasks. The broad range of downstream NLP tasks to which struction of these resources. term extraction could benefit include, for example, glos- Deep neural approaches towards ATE have been only sary construction (Maldonado and Lewis, 2016), topic de- recently proposed, but their evaluation in less-resourced tection (El-Kishky et al., 2014), machine translation (Wolf languages has not yet been sufficiently explored and re- et al., 2011), text summarization (Litvak and Last, 2008), mains a research gap worth investigating. Inspired by the information retrieval (Lingpeng et al., 2005), ontology en- success of Transformer-based models in ATE from the re- gineering and learning (Biemann and Mehler, 2014), busi- cent TermEval 2020 competition’s ACTER dataset (Hazem ness intelligence retrieval (Saggion et al., 2007; Palomino et al., 2020; Lang et al., 2021), we propose to exploit and et al., 2013), knowledge visualization (Blei and Lafferty, explore the performance of XLM-RoBERTa pretrained lan- 2009), specialized dictionary creation (Le Serrec et al., guage model (Conneau et al., 2019), which addresses the 2010), sentiment analysis (Pavlopoulos and Androutsopou- ATE as a sequence-labeling task. Sequence-labeling ap- los, 2014), and cold-start knowledge base population (Ellis proaches have been successfully applied to a range of NLP et al., 2015), to cite a few. tasks, including Named Entity Recognition (Lample et al., In the attempt to ease the time and effort needed to man- 2016; Tran et al., 2021) and Keyword Extraction (Martinc ually identify terms from domain-specific corpora, auto- et al., 2021; Koloski et al., 2022). The experiments are con-matic term extraction (ATE), also known as automatic term ducted in the cross-domain setting on the RSDO5 corpus1 recognition (Kageura and Umino, 1996) or automatic term (Jemec Tomazin et al., 2021a) containing Slovenian texts detection (Castellv´ı et al., 2001), thus became an essential 1http://hdl.handle.net/11356/1470 PRISPEVKI 196 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 from four domains (Biomechanics, Chemistry, Veterinary, word embeddings (Kucza et al., 2018), and the combination and Linguistics). of both representations (Gao and Yuan, 2019). The main contributions of this paper can be summarized In the recent ATE challenge, namely TermEval 2020 in the following points: (Rigouts Terryn et al., 2020), the use of language mod- els became very important. The winning approach on • We systematically evaluate the performance of the the Dutch corpus used pretrained GloVe word embeddings Transformer-based pretrained model, namely XLM- fed into a bi-directional LSTM based neural architecture. RoBERTa, on the term extraction task, formulated as Meanwhile, the winning approach on the English corpus a supervised cross-domain sequence-labeling on the (Hazem et al., 2020) relied on the extraction of all possible RSDO5 dataset containing texts from four different n-gram combinations, which are fed into a BERT binary domains. classifier that determines for each n-gram inside a sentence, • We demonstrate that the proposed cross-domain ap- whether it is a term or not. Besides BERT, several other proach surpasses the performance of the current state variations of Transformer-based models have also been in- of the art (Ljubešić et al., 2019) for all the combi- vestigated. For example, RoBERTa and CamemBERT have nations of training and testing domains we experi- been used in the TermEval 2020 challenge (Hazem et al., mented with, therefore establishing a new state-of-the- 2020). Another recent method is the HAMLET system art (SOTA) method for the ATE on Slovenian corpus. (Rigouts Terryn et al., 2021), which proposes a hybrid adaptable machine learning approach that combines the lin- This paper is organized as follows: Section 2. presents guistic and statistical clues to detect terms and is also eval- the related work in the field of term extraction. Next, we uated on the TermEval data. introduce our methodology in Section 3., and the experi- Meanwhile, Conneau et al. (2019) and Lang et al. mental details in Section 4.. The results with further error (2021) take advantage of XLM-RoBERTa (XLM-R) to analysis are discussed in Section 5. and 6., before we con-compare three different approaches, including a binary se- clude and present future works in Section 7.. quence classifier, a sequence classifier, and a token classi- fier employing the sequence-labeling approach (also under 2. Related Work research by Kucza et al. (2018)), as we do in our research. The history of ATE has its beginnings during the 1990s Finally, Lang et al. (2021) proposes to use a multilingual with research done by Damerau (1990), Ananiadou (1994), encoder-decoder model called mBART (Liu et al., 2020), Justeson and Katz (1995), Kageura and Umino (1996), and which is based on denoising pre-training, that generates Frantzi et al. (1998). ATE systems usually employ the fol- sequences of comma-separated terms from the input sen- lowing two-step procedure: (1) extracting a list of candidate tences. terms; and (2) determining which of these candidate terms Annotated Corpora for Term Extraction Research (AC- are correct using supervised or unsupervised approaches. TER) dataset was released for the TermEval competition as Recently, neural approaches have been proposed. a collection of four domain-specific corpora (Corruption, Traditionally, the approaches were strongly based on Wind energy, Equitation, and Heart failure) in three lan- linguistic knowledge and distinctive linguistic aspects of guages (English, French, and Dutch). However, when it terms in order to extract possible candidates. Several comes to ATE for less-resourced languages, there is still NLP tools, such as tokenization, lemmatization, stemming, a lack of gold standard corpora and limited use of neu- chunking, PoS tagging, full syntactic parsing, etc., are em- ral methods. In recent years, the Slovene KAS corpus ployed in this approach to obtain linguistic profiles of term was compiled (Erjavec et al., 2021), and most recently the candidates. As a heavily language-dependent approach, the RSDO corpus that we use in our study (Jemec Tomazin et better the quality of the pre-processing tools (e.g., FLAIR al., 2021b). Regarding the Slovenian language on which (Akbik et al., 2019), Stanza (Qi et al., 2020)), the better the we focus in our study, the current SOTA was proposed quality of linguistic ATE methods. by Ljubešić et al. (2019) that extracts the initial candi- Meanwhile, several studies preferred the statistical ap- date terms using the CollTerm tool (Pinnis et al., 2019), a proach or combined linguistic and statistical approaches. rule-based system employing a complex language-specific Some of the measures include the termhood (Vintar, 2010), set of term patterns (e.g., POS tag,...) from the Slovenian unithood (Daille et al., 1994) or C-value (Frantzi et al., SketchEngine module (Fišer et al., 2016), followed by a 1998). Many current systems still apply some variation of machine learning classification approach with features rep- this approach, most commonly in hybrid systems combin- resenting statistical term extraction measures. Another re- ing linguistic and statistical information (Repar et al., 2019; cent approach by (Repar et al., 2019) focuses on term ex- Meyers et al., 2018; Drouin, 2003; Macken et al., 2013; traction and alignment, where the main novelty is in using Šajatović et al., 2019; Kessler et al., 2019, to cite a few.). an evolutionary algorithm for the alignment of terms. On Recently, advances in embeddings and deep neural net- the other hand, the deep neural approaches have not been works have also influenced the term extraction field. Sev- explored for Slovenian yet. Another problem very specific eral embeddings have been investigated for term extraction, for less-resourced languages is that the open-sourced code for example, uni-gram term representations constructed is often not available for most current benchmark systems, from a combination of local and global vectors (Amjadian hindering their reproducibility (for Slovenian, only the code et al., 2016), non-contextual word embeddings (Wang et by Ljubešić et al. (2019) is available). al., 2016; Khan et al., 2016; Zhang et al., 2017), contextual PRISPEVKI 197 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Figure 1: An example of the (B-I-O) mechanism on a text sequence from Slovenian corpus. 3. Methodology We consider ATE as a sequence-labeling task where the model returns a label for each token in a text sequence. We use the (B-I-O) labeling mechanism (Rigouts Terryn et al., 2021; Lang et al., 2021) where B stands for the beginning word in the term, I stands for the word inside the term, and O stands for the word not part of the term. The terms from a gold standard list are first mapped to the tokens in the raw text and each word inside the text sequence is anno- tated with one of the three labels (see examples in Figure 1). The model is first trained to predict a label for each token in the input text sequence (e.g., we model the task as token classification) and then applied to the unseen text (test data). Finally, from the tokens or token sequences la- beled as terms, the final candidate term list for the test data is composed. We experiment with XLM-RoBERTa 2 (Conneau et al., 2019), a Transformer-based model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 lan- guages. With the proliferation of non-English models Figure 2: The overall architecture. (e.g., CamemBERT for French, Finnish BERT, German BERT, etc), XLM-RoBERTa, the multilingual version of RoBERTa (Liu et al., 2019), is a generic cross-lingual sen- The model is fine-tuned on the training set to predict the tence encoder that achieves benchmark performance on probability for each word in a word sequence whether it is multiple downstream NLP tasks, including ATE for rich- a part of the term (B, I) or not (O). To do so, an additional resourced languages (e.g. English) (Rigouts Terryn et al., token classification head containing a feed-forward layer 2020). Due to this well-documented SOTA performance on with a softmax activation is added on top of the model. several related tasks, we opted to employ XLM-RoBERTa 4. Experimental Setup in a monolingual setting on our low-resourced Slovenian corpus. The overall architecture of our approach is pre- Here, we describe the dataset, the experimental details, sented in Figure 2. and the metrics that we apply for the evaluation. In our experiments, we use a multilingual pre-trained 4.1. Dataset language model in order to leverage the general knowl- edge the model obtained during pretraining on the huge The experiments are conducted on the Slovenian multilingual corpus. First, we divide the dataset into train- RSDO5 corpus version 1.1 (Jemec Tomazin et al., 2021a), validation-test splits. We also investigate the effectiveness which is a less-resourced Slavic language with rich mor- of cross-domain learning, where the main idea is to test phology. As a part of the RSDO national project, the the transfer of knowledge from one domain to another and RSDO5 corpus was manually compiled and annotated therefore evaluate the capability of the model to extract and contains 12 documents with altogether about 250,000 terms in new unseen domains as well as the ability to learn words from the fields of Biomechanics (bim), Chemistry the relations between terms across domains given the as- (kem), Veterinary (vet), and Linguistics (ling). The data sumption that they have terminologically-marked contexts. were collected from diverse sources, including Ph.D. the- Therefore, we fine-tune the model on two domains (e.g., ses (3), a Ph.D. thesis-based scientific book (1), graduate- Biomechanics, Chemistry) as the train split, validate on a level textbooks (4), and journal articles (4) published be- third domain (e.g., Veterinary) as the validation split, and tween 2000 and 2019. Apart from the manually annotated test on the fourth domain that does not appear in the train terms, RSDO5 is also annotated with Universal Depen- set (e.g., Linguistics). The train split is used for fine-tuning dency tags (e.g. tags annotating tokens, sentences, lemmas, the pre-trained language model. The validation split is ap- morphological features, etc.). However, in our research, we plied to prevent over-fitting during the fine-tuning phase. only leverage the original text with the term labels, where Finally, the test split, which is not adopted during training, we consider all terms and do not distinguish between in- is used for the evaluation of the method. domain and out-of-domain terms. In Table 1, we report on the number of documents, to- 2https://huggingface.co/xlm-roberta-base kens, and unique terms across domains. Given the same PRISPEVKI 198 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Biomechanics (bim) Chemistry (kem) Veterinary (vet) Linguistics (ling) Languages # Docs # Tokens # Terms # Docs # Tokens # Terms # Docs # Tokens # Terms # Docs # Tokens # Terms Slovenian 3 61,344 2,319 3 65,012 2,409 3 75182 4,748 3 109,050 4,601 Table 1: Number of documents, tokens, and unique terms per domain in Slovenian RSDO5 dataset. Biomechanics (bim) Chemistry (kem) Veterinary (vet) Linguistics (ling) Languages B I O % Term B I O % Term B I O % Term B I O % Term Slovenian 7,070 6,835 47,439 22.67 7,614 4,486 52,912 18.61 10,953 6,261 57,968 22.90 12,348 6,079 90,623 16.89 Table 2: Label distribution and the proportion of terms appearing per domain in the Slovenian RSDO5 dataset. number of collected documents for each domain, the doc- tion performed the best on the validation set. The docu- uments from the Linguistics and Veterinary domains are ments are split into sentences and the sentences contain- longer (i.e., have more tokens) and also contain more terms ing more than 512 tokens are truncated, while the sen- than the domains of Biomechanics and Chemistry. In ad- tences with less than 512 tokens are padded with a special dition, Figure 3 presents the frequency of terms of differ- < P AD > token at the end. During fine-tuning, the model ent lengths per domain. Veterinary, Chemistry, and Lin- is evaluated on the validation set after each training epoch, guistics share a similar term length distribution with most and the best-performing model is applied to the test set. terms made of one to three words and only a few (less than The model predicts each word in a word sequence three) terms longer than seven words (an example of a long whether it is a part of a term (B, I) or not (O). The sequences term found in the corpus would be “kaznivo dejanje zoper identified as terms are extracted from the text and put into a življenje , telo in premoženje”, which means a crime against set of all predicted candidate terms. A post-processing step life, body, and property). Meanwhile, the Biomechanics to lowercase all the candidate terms is applied before we domain distribution has a longer right tail, containing sev- compare our derived candidate list with the gold standard eral terms with more than three words. using the evaluation metrics discussed in Section 4.3.. Furthermore, the corpus contains several nested terms, i.e., they also appear within larger terms and vice versa, a 4.3. Evaluation Metrics multiword term may contain shorter terms. For example, in We perform the global evaluation on our term extraction the Biomechanics domain, term “navor” (torque) appears system by comparing the list of candidate terms extracted in terms such as “sunek navora” (torque shock), “zunanji on the level of the whole test set with the manually anno- sunek navora” (external torque shock), and “izokinetični tated gold standard in the test set using Precision, Recall, navor” (isokinetic torque), to mention a few. This makes and F1-score. Precision refers to the percentage of the ex- the labeling harder and the classifier needs to infer from the tracted terms that are correct. Meanwhile, Recall indicates context whether a specific term is part of a longer term. the percentage of the total correct terms that are extracted. Low Precision means a lot of noise in extraction whereas 4.2. Implementation Details low Recall indicates the presence of lots of misses in ex- We experiment with several combinations of training, traction. Besides, F1-score is a measure that computes an validation, and testing data where two domains are used overall performance by calculating the harmonic mean be- for training, the third one for validation, and the fourth one tween Precision and Recall). These evaluation metrics have for testing (i.e., we train 12 models covering all possible been used also in the related work, including the TermEval domain combinations). We consider term extraction as a 2020 shared task (Hazem et al., 2020; Rigouts Terryn et al., sequence-labeling or token classification task with a (B- 2020; Lang et al., 2021). I-O) annotation scheme. Table 2 presents the distribution across label types and the proportion of (B) and (I) labels 5. Results in the total number of tokens per domain in the dataset. On Table 3 presents the results achieved by the multilingual average, the number of tokens annotated as terms (or parts XLM-RoBERTa pre-trained language model on the Slove- of the term) only represents about one-fifth of the total to- nian RSDO5 dataset. Note that the results in the table are kens in the corpus, which means that there is a significant grouped according to the model’s test domain for better imbalance between (B) and (I) tokens, and tokens labeled comparison between different settings. Our cross-domain as not terms (O). approach proves to have relatively consistent performance We employ the XLM-RoBERTa token classification across all the combinations, achieving Precision of more model and its “fast” XLM-RoBERTa tokenizer from the than 62%, Recall of no less than 55%, and F1-score above Huggingface library 3. We fine-tune the model for up to 61%. The model performs slightly better for the Linguistics 20 epochs regarding model convergence (i.e., we also em- and Veterinary domains than for Biomechanics and Chem- ploy the early stopping regime) with the learning rate of 2e- istry. The difference in the number of terms and length of 05, training and evaluation batch size of 32, and sequence terms per domain pointed out in Section 4.1. might be one length of 512 tokens, since this hyperparameter configura- of the factors that contribute to this behavior. In addition, a significant performance boost can be observed for the Lin- 3https://huggingface.co/models guistics domain when the model is trained in the Chemistry PRISPEVKI 199 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Figure 3: The frequencies of terms of specific length per each domain in a Slovenian dataset. and Veterinary domains, and for the Veterinary domain, covering low-frequency terms of which there are a lot in when the model is trained in Biomechanics and Linguis- the RSDO5 corpus. In their own experiments, Ljubešić et tics. In these two settings, the model achieves an F1-score al. (2019) discard all term candidates with a frequency be- of more than 68%. low 3, hence why their results on their corpus are higher than on RSDO5. Training Validation Testing Precision Recall F1-score Overall, we achieve results roughly twice as high as the bim + kem vet ling 69.55 64.05 66.69 approach proposed by Ljubešić et al. (2019) in terms of F1- bim + vet kem ling 69.48 73.66 71.51 score for all test domains. The results demonstrate the pre- kem + vet bim ling 66.20 72.38 69.15 dictive power of contextual information in language mod- Ljubešić et al. (2019) ling 52.20 25.40 34.10 els such as XLM-RoBERTa over the machine learning ap- bim + kem ling vet 71.06 66.72 68.82 proach with features representing statistical term extraction bim + ling kem vet 72.66 65.59 68.94 measures as in Ljubešić et al. (2019). ling + kem bim vet 69.3 68.07 68.68 Ljubešić et al. (2019) vet 66.90 19.30 29.90 6. Error Analysis bim + vet ling kem 68.67 55.13 61.16 In this section, we analyze the predictions of XLM- bim + ling vet kem 70.14 60.27 64.83 ling + vet bim kem 70.23 59.24 64.27 RoBERTa in the RSDO5 corpus to get a better understand- ing of the model’s performance and discover possible av- Ljubešić et al. (2019) kem 47.80 31.40 37.80 enues for future work. First, we analyze the predictive vet + kem ling bim 63.51 66.80 65.11 power of our approach for terms of different lengths by cal- vet + ling kem bim 62.25 65.20 63.69 ling + kem vet bim 62.35 63.99 63.16 culating the Precision and Recall separately for terms of length k = {1,2,3,4, equal or more than 5}. The number of Ljubešić et al. (2019) bim 53.80 24.80 33.90 predicted candidate terms, number of ground truth terms, Table 3: Term extraction evaluation in a cross-domain number of correct predictions (TPs), Precision, and Recall setting on a Slovenian RSDO5 dataset. regarding different terms of length k and different test do- mains are presented in Tables 4, 5, 6, and 7. Note that these statistics are collected for the train-validation-test combina-We also present results for the current SOTA approach tions that perform the best on each domain according to the from Ljubešić et al. (2019) by reproducing their method- F1-score. ology in the same RSDO5 dataset. In general, our ap- Results across Tables 4 to 7 show that our models are proach outperforms the approach proposed by Ljubešić et good at predicting short terms containing up to three words al. (2019) by a large margin on all domains and accord- in all four domains. The best model applied to the Linguis- ing to all evaluation metrics. The margin is especially large tics test domain also shows competitive performance for the when it comes to Recall. Given the training process applied prediction of longer terms, achieving 75.00% Precision and on RSDO5 corpus, Ljubešić et al. (2019) approach has low a decent 31.03% Recall for terms with at least 5 words. De- performance in F1-score due to the high imbalance between spite the relatively high Precision achieved by the models the Precision and Recall. This is most likely due to the fact on long terms in the Veterinary and Biomechanics test do- that the methods employed by Ljubešić et al. (2019) rely mains, the Recall is pretty low, most likely due to the small heavily on the frequency and are thus not suitable for dis- PRISPEVKI 200 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 k #Predictions #Ground truth #TPs Precision Recall the final list of predicted terms for the Linguistics test do- 1 2,078 1,728 1,300 62.56 75.23 main. 2 2,631 2,404 1,858 70.62 77.29 7. Conclusion 3 322 360 7,191 59.32 53.06 4 57 80 31 54.39 38.75 In summary, we investigated the performance of the multilingual Transformer-based language model, XLM- ≥5 12 29 79 75.00 31.03 RoBERTa, in the monolingual cross-domain sequence- Table 4: Performance in Precision and Recall per term labeling term extraction task. The experiments were length in Linguistics domain. conducted on the representative Slovenian RSDO5 cor- pus, which contains texts from four specific domains, namely Biomechanics, Chemistry, Veterinary, and Linguis- k #Predictions #Ground truth #TPs Precision Recall tics. Our cross-domain sequence-labeling approach with 1 2,159 2,067 1,472 68.18 71.21 XLM-RoBERTa had consistent performance across all the 2 2,062 2,103 1,448 70.22 68.85 combinations of training, validation, and test set, achiev- 3 314 446 182 57.96 40.81 ing the performance of up to 72.66% in terms of Preci- 4 28 77 10 35.71 12.99 sion, up to 73.66% in terms of Recall, and up to 71.51% ≥5 3 55 2 66.67 3.64 in terms of F1-score. The model performed slightly better in extracting terms from the Linguistics and Veterinary do- Table 5: Performance in Precision and Recall mains than from Biomechanics and Chemistry. Moreover, per term length in Veterinary domain. our approach outperformed the current state of the art on the Slovenian language (Ljubešić et al., 2019) by a large k #Predictions #Ground truth #TPs Precision Recall margin according to all three evaluation metrics, in some 1 943 890 580 61.51 65.17 cases achieving three times higher Recall and roughly two times higher F1-score. As a consequence, our approach is 2 1,073 1,202 768 71.58 63.89 the new SOTA approach on the RSDO5 dataset. 3 164 260 93 56.71 35.77 However, we believe that there remains room for im- 4 26 46 11 42.31 23.91 provement in the field of supervised term extraction. In ≥5 3 11 0 0.00 0.00 the future, we would like to pre-train the model on the in- termediate task (e.g., machine translation) resembling term Table 6: Performance in Precision and Recall extraction before fine-tuning it on the target downstream per term length in Chemistry domain. task, in order to boost the extraction performance. In addi- tion, we will also investigate the performance of the mod- k #Predictions #Ground truth #TPs Precision Recall els in the zero-shot cross-lingual setting, multi-lingual set- 1 1,079 718 22 48.38 72.70 ting, and the combination of both settings in comparison 2 1,153 1,172 822 71.29 70.14 with our current monolingual setting. Lastly, we suggest 3 223 286 124 55.61 43.36 the integration of active learning into our current approach to improve the output of the automated method by dynami- 4 26 59 11 42.31 18.64 cal adaptation after human feedback. By learning with hu- ≥5 11 84 5 45.45 5.95 mans in the loop, we aim at getting the most information with the least amount of term labels. We will also evaluate Table 7: Performance in Precision and Recall the contribution of active learning in reducing the annota- per term length in Biomechanics domain. tion effort and determine the robustness of the incremental active learning framework across different languages and amount of longer terms in the dataset on which the models domains. are trained. When it comes to predictions in the Chemistry domain, there are no correct term predictions that consist of 8. Acknowledgements more than five words. The work was partially supported by the Slovenian Re- In addition, as the corpus contains many nested terms, search Agency (ARRS) core research program Knowledge the very common mistake the model makes is to predict a Technologies (P2-0103) and project TermFrame (J6-9372), shorter term nested in the correct term of the gold standard as well as the Ministry of Culture of the Republic of Slove- (Pattern 1). Vice versa, the model sometimes generates in- nia through project Development of Slovene in Digital En- correct predictions containing the correct nested terms (Pat- vironment (RSDO). The first author was partly funded by tern 2). Furthermore, in some cases, the model predicts a Region Nouvelle Aquitaine. This work has also been sup- single prediction made out of two consecutive terms (Pat- ported by the TERMITRAD (2020-2019-8510010) project tern 3). We report some examples of these incorrect pat- funded by the Nouvelle-Aquitaine Region, France. terns in Table 8, where the first column refers to the pattern type, the second one refers to our predicted candidate term, 9. References and the last column presents the true term from the gold Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Ra- standard. The presented candidate terms are extracted from sul, Stefan Schweter, and Roland Vollgraf. 2019. Flair: PRISPEVKI 201 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Patterns Our predictions The gold standards “klasična analogna telefonska zveza” “klasična analogna telefonska zveza pot” (classic analog telephone connection) (classic analog telephone connection path) 1 “končnica neprve slovarske oblike” “končnica” (suffix of non-first dictionary form) (suffix) ... ... “brezžično slušalk v ušesu” “brezžično slušalk” (wireless in-ear headphones) (wireless headphones) 2 “elektromehanska uporaba električne energije” “električne energije” (electromechanical use of electrical energy) (electrical energy) ... ... “batne parne stroje za pogon” “batne parne stroje” , “pogon” (reciprocating steam engines) (piston steam engines), (propulsion) “elektrarna na atomski pogon” “elektrarna”, “atomski pogon” (nuclear power plant) (power plant), (nuclear power plant) “besedilnim tipom strokovnega jezika” “besedilnim tipom”, “strokovnega jezika” (text type professional language) (text type), (professional language) 3 “eksperimentalno modeliranje dinamičnih sistemov” “eksperimentalno modeliranje”, “dinamičnih sistemov” (experimental modeling of dynamic systems) (experimental modeling), (dynamic systems) ... ... Table 8: Examples of unlemmatised predictions in the Linguistics test domain. An easy-to-use framework for state-of-the-art nlp. In & management, 26(6):791–801. Proceedings of the 2019 Conference of the North Amer- Patrick Drouin. 2003. Term extraction using non-technical ican Chapter of the Association for Computational Lin- corpora as a point of leverage. Terminology, 9(1):99– guistics (Demonstrations), pages 54–59. 115. Ehsan Amjadian, Diana Inkpen, Tahereh Paribakht, and Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare Voss, Farahnaz Faez. 2016. Local-Global Vectors to Improve and Jiawei Han. 2014. Scalable topical phrase mining Unigram Terminology Extraction. In Proceedings of the from text corpora. arXiv preprint arXiv:1406.6312. 5th International Workshop on Computational Terminol- Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi ogy (Computerm2016), pages 2–11. Song, Ann Bies, and Stephanie M Strassel. 2015. Sophia Ananiadou. 1994. A methodology for automatic Overview of linguistic resources for the tac kbp 2015 term recognition. In COLING 1994 Volume 2: The 15th evaluations: Methodologies and results. In TAC. International Conference on Computational Linguistics. Tomaž Erjavec, Darja Fišer, and Nikola Ljubešić. 2021. Chris Biemann and Alexander Mehler. 2014. Text mining: The kas corpus of slovenian academic writing. Lan- From ontology learning to automated text processing ap- guage Resources and Evaluation, 55(2):551–583. plications. Springer. Darja Fišer, Vit Suchomel, and Miloš Jakub´ıcek. 2016. David M Blei and John D Lafferty. 2009. Visualiz- Terminology extraction for academic slovene using ing topics with multi-word expressions. arXiv preprint sketch engine. In Tenth Workshop on Recent Advances in arXiv:0907.1013. Slavonic Natural Language Processing, RASLAN 2016, M Teresa Cabré Castellv´ı, Rosa Estopa Bagot, and Jordi Vi- pages 135–141. valdi Palatresi. 2001. Automatic term detection: A re- Katerina T Frantzi, Sophia Ananiadou, and Junichi Tsujii. view of current systems. Recent advances in computa- 1998. The c-value/nc-value method of automatic recog- tional terminology, 2:53–88. nition for multi-word terms. In International conference Alexis Conneau, Kartikay Khandelwal, Naman Goyal, on theory and practice of digital libraries, pages 585– Vishrav Chaudhary, Guillaume Wenzek, Francisco 604. Springer. Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Yuze Gao and Yu Yuan. 2019. Feature-less End-to-end and Veselin Stoyanov. 2019. Unsupervised cross- Nested Term extraction. In CCF International Con- lingual representation learning at scale. arXiv preprint ference on Natural Language Processing and Chinese arXiv:1911.02116. Computing, pages 607–616. Springer. Béatrice Daille, Éric Gaussier, and Jean-Marc Langé. Amir Hazem, Mérieme Bouhandi, Florian Boudin, and 1994. Towards Automatic Extraction of Monolingual Béatrice Daille. 2020. TermEval 2020: TALN-LS2N and Bilingual Terminology. In COLING 1994 Volume System for Automatic Term Extraction. In Proceedings 1: The 15th International Conference on Computational of the 6th International Workshop on Computational Ter- Linguistics. minology, pages 95–100. Fred J Damerau. 1990. Evaluating computer-generated Mateja Jemec Tomazin, Mitja Trojar, Simon Atelšek, Tanja domain-oriented vocabularies. Information processing Fajfar, Tomaž Erjavec, and Mojca Žagar Karer. 2021a. PRISPEVKI 202 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Corpus of term-annotated texts RSDO5 1.1. Slovenian Robustly Optimized BERT Pretraining Approach. arXiv language resource repository CLARIN.SI. preprint arXiv:1907.11692. Mateja Jemec Tomazin, Mitja Trojar, Mojca Žagar, Simon Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Atelšek, Tanja Fajfar, and Tomaž Erjavec. 2021b. Cor- Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke pus of term-annotated texts rsdo5 1.0. Zettlemoyer. 2020. Multilingual denoising pre-training John S Justeson and Slava M Katz. 1995. Technical Ter- for neural machine translation. Transactions of the As- minology: Some Linguistic Properties and an Algorithm sociation for Computational Linguistics, 8:726–742. for Identification in Text. Natural language engineering, Nikola Ljubešić, Darja Fišer, and Tomaž Erjavec. 2019. 1(1):9–27. Kas-term: Extracting Slovene Terms from Doctoral The- Kyo Kageura and Bin Umino. 1996. Methods of Auto- ses via Supervised Machine Learning. In International matic Term Recognition.A Review. Terminology. Inter- Conference on Text, Speech, and Dialogue, pages 115– national Journal of Theoretical and Applied Issues in 126. Springer. Specialized Communication, 3(2):259–289. Lieve Macken, Els Lefever, and Veronique Hoste. 2013. Rémy Kessler, Nicolas Béchet, and Giuseppe Berio. 2019. Texsis: Bilingual terminology extraction from parallel Extraction of terminology in the field of construction. corpora using chunk-based alignment. Terminology. In- In 2019 First International Conference on Digital Data ternational Journal of Theoretical and Applied Issues in Processing (DDP), pages 22–26. IEEE. Specialized Communication, 19(1):1–30. Muhammad Tahir Khan, Yukun Ma, and Jung-jae Kim. Alfredo Maldonado and David Lewis. 2016. Self-tuning 2016. Term Ranker: A Graph-Based Re-Ranking Ap- ongoing terminology extraction retrained on terminology proach. In FLAIRS Conference, pages 310–315. validation decisions. In Proceedings of The 12th Interna- Boshko Koloski, Senja Pollak, Blaž Škrlj, and Matej Mart- tional Conference on Terminology and Knowledge Engi- inc. 2022. Out of thin air: Is zero-shot cross-lingual key- neering, pages 91–100. word detection better than unsupervised? arXiv preprint Matej Martinc, Blaž Škrlj, and Senja Pollak. 2021. Tnt- arXiv:2202.06650. kid: Transformer-based neural tagger for keyword iden- Maren Kucza, Jan Niehues, Thomas Zenkel, Alex Waibel, tification. Natural Language Engineering, page 1–40. and Sebastian Stüker. 2018. Term Extraction via Neu- Adam L Meyers, Yifan He, Zachary Glass, John Ortega, ral Sequence Labeling a Comparative Evaluation of Shasha Liao, Angus Grieve-Smith, Ralph Grishman, and Strategies Using Recurrent Neural Networks. In INTER- Olga Babko-Malaya. 2018. The Termolator: Termi- SPEECH, pages 2072–2076. nology Recognition Based on Chunking, Statistical and Guillaume Lample, Miguel Ballesteros, Sandeep Subrama- Search-Based Scores. Frontiers in Research Metrics and nian, Kazuya Kawakami, and Chris Dyer. 2016. Neu- Analytics, 3:19. ral Architectures for Named Entity Recognition. In Pro- Marco A Palomino, Tim Taylor, and Richard Owen. 2013. ceedings of the 2016 Conference of the North American Evaluating business intelligence gathering techniques for Chapter of the Association for Computational Linguis- horizon scanning applications. In Mexican International tics: Human Language Technologies, pages 260–270. Conference on Artificial Intelligence, pages 350–361. Christian Lang, Lennart Wachowiak, Barbara Heinisch, Springer. and Dagmar Gromann. 2021. Transforming term extrac- John Pavlopoulos and Ion Androutsopoulos. 2014. Aspect tion: Transformer-based approaches to multilingual term term extraction for sentiment analysis: New datasets, extraction across domains. In Findings of the Associa- new evaluation measures and an improved unsupervised tion for Computational Linguistics: ACL-IJCNLP 2021, method. In Proceedings of the 5th Workshop on Lan- pages 3607–3620. guage Analysis for Social Media (LASM), pages 44–52. Anna¨ıch Le Serrec, Marie-Claude L’Homme, Patrick M¯arcis Pinnis, Nikola Ljubešić, Dan S¸tef˘anescu, Inguna Drouin, and Olivier Kraif. 2010. Automating the com- Skadin¸a, Marko Tadić, Tatjana Gornostaja, Špela Vintar, pilation of specialized dictionaries: Use and analysis of and Darja Fišer. 2019. Extracting data from compara- term extraction and lexical alignment. Terminology. In- ble corpora. In Using Comparable Corpora for Under- ternational Journal of Theoretical and Applied Issues in Resourced Areas of Machine Translation, pages 89–139. Specialized Communication, 16(1):77–106. Springer. Yang Lingpeng, Ji Donghong, Zhou Guodong, and Nie Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Yu. 2005. Improving retrieval effectiveness by using key Christopher D Manning. 2020. Stanza: A python natural terms in top retrieved documents. In European Confer- language processing toolkit for many human languages. ence on Information Retrieval, pages 169–184. Springer. arXiv preprint arXiv:2003.07082. Marina Litvak and Mark Last. 2008. Graph-based key- Andraž Repar, Vid Podpečan, Anže Vavpetič, Nada Lavrač, word extraction for single-document summarization. In and Senja Pollak. 2019. TermEnsembler: An Ensem- Coling 2008: Proceedings of the workshop multi-source ble Learning Approach to Bilingual Term Extraction and multilingual information extraction and summarization, Alignment. Terminology. International Journal of The- pages 17–24. oretical and Applied Issues in Specialized Communica- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- tion, 25(1):93–120. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Ayla Rigouts Terryn, Veronique Hoste, Patrick Drouin, and Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Els Lefever. 2020. TermEval 2020: Shared Task on PRISPEVKI 203 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Automatic Term Extraction Using the Annotated Cor- pora for Term Extraction Research (ACTER) Dataset. In 6th International Workshop on Computational Terminol- ogy (COMPUTERM 2020), pages 85–94. European Lan- guage Resources Association (ELRA). Ayla Rigouts Terryn, Véronique Hoste, and Els Lefever. 2021. HAMLET: Hybrid Adaptable Machine Learning approach to Extract Terminology. Terminology. Horacio Saggion, Adam Funk, Diana Maynard, and Kalina Bontcheva. 2007. Ontology-based information extrac- tion for business intelligence. In The Semantic Web, pages 843–856. Springer. Antonio Šajatović, Maja Buljan, Jan Šnajder, and Bo- jana Dalbelo Bašić. 2019. Evaluating automatic term extraction methods on individual documents. In Pro- ceedings of the Joint Workshop on Multiword Expres- sions and WordNet (MWE-WN 2019), pages 149–154. Thi Hong Hanh Tran, Antoine Doucet, Nicolas Sidere, Jose G Moreno, and Senja Pollak. 2021. Named en- tity recognition architecture combining contextual and global features. In Towards Open and Trustworthy Dig- ital Societies: 23rd International Conference on Asia- Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings, page 264. Springer Nature. Spela Vintar. 2010. Bilingual Term Recognition Revis- ited: The Bag-of-equivalents Term Alignment Approach and its Evaluation. Terminology. International Journal of Theoretical and Applied Issues in Specialized Com- munication, 16(2):141–158. Rui Wang, Wei Liu, and Chris McDonald. 2016. Feature- less Domain-Specific Term Extraction with Minimal La- belled Data. In Proceedings of the Australasian Lan- guage Technology Association Workshop 2016, pages 103–112. Petra Wolf, Ulrike Bernardi, Christian Federmann, and Sabine Hunsicker. 2011. From statistical term extrac- tion to hybrid machine translation. In Proceedings of the 15th Annual conference of the European Association for Machine Translation. Ziqi Zhang, Jie Gao, and Fabio Ciravegna. 2017. SemRe- Rank: Incorporating Semantic Relatedness to Improve Automatic Term Extraction Using Personalized PageR- ank. arXiv preprint arXiv:1711.03373. PRISPEVKI 204 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Metapodatki o posnetkih in govorcih v govornih virih: primer baze Artur Darinka Verdonik,* Andreja Bizjak,* Andrej Žgank,* Simon Dobrišek† * Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru Koroška 46, 2000 Maribor darinka.verdonik@um.si, andreja.bizjak1@um.si, andrej.zgank@um.si † Fakulteta za elektrotehniko, Univerza v Ljubljani Tržaška 25, 1000 Ljubljana simon.dobrisek@fe.uni-lj.si Povzetek Ob združevanju različnih govornih jezikovnih virov se pojavljajo težave, ki izhajajo iz vsebinske nezdružljivosti zabeleženih metapodatkov o posnetkih, govorcih oz. govoru nasploh (npr. tip govora, vrsta govornega dogodka, lokacija in čas snemanja, spol, izobrazba, regija govorca). Ti metapodatki se zajemajo po eni strani zato, da omogočajo preverjanje uravnoteženosti govornega vira glede na različne govorce in govorne situacije, po drugi strani pa zato, da omogočajo razvrščanje govornih podatkov v kategorije, potrebne bodisi za jezikoslovne analize bodisi za učenje algoritmov razpoznavanja govora ipd. Najpogostejše razlike med zabeleženimi metapodatki o posnetkih in govorcih v obstoječih prosto dostopnih govornih virih za slovenščino so v kategorizacijah vrste govora in lokacije snemanja oziroma v kategorizacijah in oznakah regije govorca. Različne kategorije se pojavljajo tudi v zvezi s starostnimi in izobrazbenimi skupinami govorcev. Veliko vrst metapodatkov se pojavlja samo v posameznih virih, v drugih pa ne. Prispevek poleg pregleda razlik podaja tudi predloge za njihovo premostitev. Metadata on recordings and speakers in spoken language resources: The case of the Artur database When merging data from different spoken language resources, problems arise due to incompatibility of metadata on recordings, on speakers or on speech in general (e.g., information about the speech type or speech event, time and place of the recording, the gender, education, region of speaker). These metadata are captured on the one hand to ensure the balance of speech samples according to different speakers and speech situations, and on the other hand to enable the classification of speech data into categories needed either for linguistic analysis or for learning speech recognition algorithms. The most common differences in metadata on recordings and on speakers in the existing freely available speech resources for Slovene relate to categorizations of the type of speech and the location of the recording as well as to categorizations and designations of the speaker's region. Different categories also emerge in relation to age and educational groups of speakers. Many types of metadata are recorded only in particular resources. In addition to reviewing the differences we also give some suggestions how to overcome them. 1. Uvod 2. Metapodatki o posnetkih in govorcih v Govorni jezikovni viri so pomembni tako za razvoj govornih korpusih jezikoslovja in celostno poznavanje jezika kot tudi za Korpus GOS je predstavljal enega prvih večjih razvoj govornih tehnologij, kot je razpoznavanje ali sinteza projektov, namenjenih zagotovitvi obsežnejšega govornega govora. Poleg posnetkov in zapisa govora vsebujejo vira za raziskave slovenskega jezika. Izdan je bil leta 2011 običajno tudi manjše ali večje število podatkov o tem, kje, v obsegu ca. 112 ur posnetkov in je sledil za tisti čas kdaj, kako so posnetki nastali in kakšne so lastnosti aktualnim korpusnojezikoslovnim prizadevanjem po govorcev glede na spol, starost, izobrazbo ipd. Čeprav Text dopolnjevanju referenčnih pisnih korpusov z referenčnimi Encoding Iniciative – TEI vključuje tudi standardizacijska govornimi korpusi (npr. Burnard, 2007; Allwood et al., priporočila s področja govora, pa so vsebinske odločitve, 2000; Oostdijk et al., 2002; Pořízka, 2009). Njegov katere kategorije tovrstnih podatkov zajeti in kako namen je bil torej predvsem zagotoviti podatke o govorjeni podrobno jih opisati, zelo odvisne od vrste gradiva in slovenščini za leksikografske, slovnične in druge namena govornega vira. Tako se ob združevanju virov, jezikoslovne raziskave, za poučevanje slovenščine, za nastalih v različnih časovnih obdobjih z delno različnimi poklicne govorce ali pisce oz. tudi za širšo zainteresirano cilji in vključujoč različne tipe govora, pojavljajo težave, ki javnost. Vseboval je kolikor mogoče reprezentativen nabor izhajajo iz vsebinske nezdružljivosti popisanih podatkov o različnih govornih situacij, s ciljem, da bi zajeli vzorčne posnetkih, govorcih oz. govoru nasploh. primere različnih govornih situacij in različnih govorjenih S ciljem, da se tovrstne težave v prihodnje zmanjšajo, diskurzov, demografsko reprezentativen vzorec govorcev bomo v tem prispevku pregledali, potrebe po katerih slovenskega jezika in tiste govorne situacije, v katerih so podatkih so se pojavljale v različnih vedah, s poudarkom na uporabnika jezika najbolj pogosto produktivno ali pa samo dosedanjih slovenskih govornih virih (poglavji 2 in 3), receptivno udeleženi (Verdonik in Zwitter Vitez, 2011: 17). podrobneje predstavili strukturo teh podatkov na primeru GOS je bil poleg transkripcij dopolnjen tudi s posnetki govorne baze Artur, ki predstavlja najnovejši in hkrati ter s številnimi podatki o posnetkih in govorcih (po katerih najobsežnejši in najbolj heterogen govorni vir za lahko uporabniki korpusa tudi filtrirajo zadetke). Podobna slovenščino ta trenutek (poglavje 4), ter izpostavili tiste je praksa v drugih, tujih govornih korpusih. Običajni vrste podatkov, kjer so vsebinska razhajanja največja, in podatki o situaciji, ki je posneta, so datum, lokacija, vrsta podali predlog za njihovo uskladitev (poglavje 5). interakcije, kontekst, tematika, udeleženci, trajanje, uporabljena oprema za snemanje, vir ipd. Podatki o PRISPEVKI 205 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 udeležencih so običajno identitifikacijska koda, starost, značilnosti posnetega govora. Z računskimi metodami spol, narodnost oz. prvi jezik, regija oz. narečje, poklic, obdelave signalov se namreč iz govornih signalov lahko lahko pa tudi še mesto rojstva, trenutna lokacija, drugi izlušči različne govorne značilke, pri katerih se jeziki ipd. (Zemljarič Miklavčič, 2008; Cresti in Moneglia, predpostavlja hierarhična razvrščenost pri njihovem 2005; Ehmer in Martinez, 2014; Love et al., 2017). odražanju tako nizkonivojskih anatomskih značilk V korpusu GOS so metapodatki o posnetkih vključevali človekovih govoril kot tudi višjenivojskih dialoških in (Verdonik in Zwitter Vitez, 2011): semantičnih značilk. - podatke o gradivodajalcu oz. viru posnetka, Za razvoj samodejnih razpoznavalnikov govora je torej - podatke o vrsti govora, institucionalnem okviru, iz celotnega nabora metapodatkov smiselno ohraniti govornem dogodku, prosti opis govorne situacije in predvsem tiste, ki lahko prispevajo k boljšemu akustičnemu število aktivnih udeležencev govornega dogodka, in jezikovnemu modeliranju govora. Pri razvoju govornih - podatke o času in kraju snemanja, pri čemer je bil kraj baz za razpoznavanje govora so bili tako v preteklosti snemanja opredeljen tako z imenom kraja kot umeščen metapodatki ključna informacija, na osnovi katere se je v širše (registrsko) območje. poskušala doseči ustrezna zastopanost vseh kategorij Podatki o govorcih so zajemali: govorcev in govora, kot je bilo predvideno v specifikacijah. - spol, Glavni namen zbiranja teh metapodatkov je bil predvsem - starost, razdeljeno v 7 kategorij, ta, da se v govorni bazi čim bolj realno odražajo okoliščine - izobrazbo, razdeljeno v 4 kategorije, in scenariji možnih uporab samodejnih razpoznavalnikov - regijo govorca, opredeljeno glede na registrsko govora (Kolář in Švec, 2008). Takšen pristop je zelo območje, pri čemer je bila možnost opredelitve več regij pomemben predvsem pri govornih bazah, ki obsegajo od v primeru, da je govorec več kot eno leto bival v vsaj nekaj 10 do več 100 ur govora oziroma govorcev. različnih regijah (npr. zaradi študija, službe ipd.), Hiter tehnološki razvoj informacijsko-komunikacijskih - prvi jezik govorca. sistemov je omogočil zbiranje in obdelavo vse večjih Korpusu GOS je v letih 2016–2019 v več izdajah sledila količin podatkov. Hkrati je prišlo tudi do izrazitega manjša govorna baza Gos Videolectures (Verdonik, 2018), povečanja razpoložljivih računskih zmogljivosti sodobnih ki je v nasprotju s korpusom GOS zajema področno računalnikov, predvsem z razvojem zelo zmogljivih omejeno gradivo javnih predavanj, dostopnih prek portala grafičnih procesnih enot (GPU), s katerimi se učinkovito Videolectures.net. V svoji zadnji, četrti različici obsega izvajajo numerično zahtevni algoritmi t. i. globokega skupno 22 ur posnetkov javnih predavanj, uravnoteženih učenja (Gondi in Pratap, 2021). Posledica tega napredka je glede na tematska področja družboslovja, humanistike, tudi ta, da so se za jezike z velikim številom govorcev medicine, tehnike ter naravoslovja/matematike. Prav tako začele pridobivati obsežne govorne baze, ki obsegajo tudi nastopajoči govorci enakomerno predstavljajo oba spola, več kot 10.000 ur posnetkov govora. Tukaj gre praviloma starejše in mlajše govorce ter grobo opredeljene različne za govorne baze, ki se pridobijo iz zelo različnih virov, kot regije Slovenije. so npr. razni mediji, spletne platforme, zvočne knjige idr. Metapodatki o posnetkih in govorcih so sledili shemi, Zaradi velikega obsega takšnih baz se pridobljeni govorni zastavljeni v korpusu GOS, vendar zaradi omejenega posnetki pogosto ne označujejo in ne transkribirajo ročno. dostopa do informacij niso bili beleženi z isto natančnostjo. Za učenje razpoznavalnikov govora se potem uporabljajo Če je bila starost govorcev v korpusu GOS deljena v 7 nenadzorovani ali delno nadzorovani pristopi, ki ne kategorij, je v Gos Videolectures samo v 2, pa še to zahtevajo ročno narejenih oznak in transkripcij govornih predvsem na podlagi vizualnega vtisa, ne na podlagi posnetkov (Hershey et al., 2017). Tako postane v večini neposredne, točne informacije. Prav tako ni bilo primerov zelo obsežnih govornih baz dosledno neposrednih podatkov o regiji govorca, ampak so bili pod uravnoteževanje govornih posnetkov na osnovi to postavko zabeleženi slušni vtisi o značilnostih govora. metapodatkov drugotnega pomena. Glede na zelo različne Nekateri podatki pa niti niso bili opredeljeni, saj bodisi niso možne vire in načine zbiranja govornih posnetkov namreč bili dostopni (izobrazba) bodisi niso bili relevantni (prvi pogosto tudi ni možno pridobivati relevantnih jezik, ki je za vse govorce slovenščina). Ker je bila govorna metapodatkov. V primerih, ko so metapodatki sicer na baza Gos Videolectures namenjena tudi razvoju tehnologije voljo, vendar jih je v govorni bazi težko uravnotežiti, pa razpoznavanja govora, so se pokazale potrebe še po pride v ospredje znamenit izrek Roberta Mercerja iz leta beleženju kvalitete posnetka, ki je bila dodana zgolj kot 1985, da ni boljših podatkov, kot je več podatkov. subjektivna ocena transkriptorja na podlagi slušnega vtisa. Novi metapodatkovno neuravnoteženi pristopi k izdelavi govornih baz so dobili dodatno podporo pri 3. Metapodatki o posnetkih in govorcih v postopkih globokega učenja, kjer se vse bolj pogosto govornih bazah za razpoznavanje govora uporabljajo metode samodejnega povečevanja obsega in plemenitenja učnih podatkov. I Z vidika razvoja govornih tehnologij oziroma zvorni govorni posnetki se razpoznavalnikov govora je glavni razlog za zbiranje lahko tako s pomočjo sodobnih metod digitalne obdelave podatkov o govorcih in posnetkih predvsem ta, da se v signalov modificirajo v različne simulirane oblike. Takšni osnovni pristopi so, denimo, pohitritve ali upočasnitve govorni bazi zagotovi čim bolj ustrezna reprezentativna zastopanost vseh izrazitih govornih značilnosti, ki se govora v izvornih govornih posnetkih. Z vidika spreminjajo med različnimi govorci in različnimi metapodatkov, ki se navadno upoštevajo pri razvoju govornimi okoliščinami. Relevanten je torej razpoznavalnikov govora, pa so se razvili tudi zahtevnejši katerikoli podatek o govorcu ali govornem posnetku, ki lahko nosi pristopi, pri katerih se simulirajo različne snemalne okoliščine (npr. značilnosti kanala, nivo šuma, kodirnik informacijo o govornih značilnostih samega govorca i, zvočna ozadja, prostor idr.). S takšnimi pris oziroma njegovih govornih okoliščinah, za katere se topi lahko učinkovito dopolnimo obseg izvornih govornih posnetkov predpostavi, da imajo vpliv na akustične in jezikovne PRISPEVKI 206 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 in uravnotežimo primanjkljaj določenih vrst govornih kohezijske politike v obdobju 2014–2020. Projekt je izvajal posnetkov (Karafiát et al., 2017). konzorcij 12 partnerjev, od tega 6 javnih raziskovalnih Pri zasledovanju osnovnega cilja, da govorna baza čim zavodov in 6 podjetij. Naslavljal je več sklopov jezikovnih bolje odraža možne okoliščine in scenarije uporabe tehnologij, med njimi tudi govorne tehnologije, kjer je bilo razpoznavalnikov govora, je smiselno postaviti določene veliko pozornosti namenjene izdelavi govorne baze za prioritete pri upoštevanju metapodatkov in njihovi razvoj razpoznavanja govora v obsegu 1000 ur. uravnoteženosti. Za razvoj splošnega samodejnega Pomanjkanje ustrezno velike, zahtevam razpoznavanja razpoznavalnika govora je tako priporočljivo upoštevati govora prilagojene in prosto dostopne govorne baze se je predvsem naslednje metapodatke: namreč pokazalo kot osrednja ovira pri razvoju - Oznaka govorca: enoznačno identificira vse posnetke razpoznavanja govora za slovenski jezik. Pri izdelavi istega govorca v bazi. To omogoča učinkovito izvajanje govorne baze so sodelovali Univerza V Mariboru (FERI), metod prilagajanja modela razpoznavalnika govora na Univerza v Ljubljani (FE in FRI), ZRC SAZU, Alpineon in posamezne govorce (npr. metode MLLR, SAT, iVector STA. Vključuje 4 večje sklope različnih vrst govora: brani idr.) (Povey et al., 2008; Cardinal et al., 2015), kar lahko govor po pisnih predlogah (500 ur), javni govor (javni prispeva k znatnemu izboljšanju pravilnosti dogodki, mediji ipd. – 200 ur), parlamentarni govor samodejnega razpoznavanja govora. (Državni zbor RS – 200 ur) in nejavni govor (terenski - Prvi jezik: samodejno razpoznavanje govora za določen posnetki prosto govorjenih monologov in dialogov). jezik je navadno bistveno manj uspešno pri govorcih, ki Podatki o posnetkih in govorcih so v bazi Artur jim ta jezik ni prvi. Zato se pri razvoju splošnega organizirani kot tsv-datoteka in v obliki xml-zapisa po razpoznavalnika govora njihov govor navadno izloči iz standardu TEI. V primerjavi s predhodnimi govornimi viri učnega postopka in se potem izvajajo posebne za slovenščino vključujejo predvsem zelo podroben popis prilagoditve splošnega razpoznavalnika takšnim tehničnih lastnosti posnetkov (npr. podatke o lastnostih govorcem. izvornih posnetkov in tehnični opremi, uporabljeni za - Narečna skupina (Draxler in Kleiner, 2017): snemanje) ter vseh okoliščin, ki bi lahko na te lastnosti metapodatek je še posebej pomemben v primerih vplivale (od velikosti prostora snemanja, prisotnosti spontanega nejavnega govora. V primeru izrazitega hkratnega govora vse do uporabe maske pri govorcih, ki je narečnega govora je namreč možno uporabiti različne bila pogosta v času epidemije COVIDA-19). pristope adaptacije razpoznavalnika govora na narečja Končni seznam metapodatkov o posnetkih v govornih govorcev, s čimer se lahko do neke mere odpravi bazi Artur je naslednji: poslabšanje rezultatov. I. Identifikacijski podatki in kategorizacija posnetkov: - Snemalne zvočne okoliščine (Zhang et al., 2018): imajo - ID-posnetka: je sestavljen iz imena baze (Artur), lahko bistven vpliv na zanesljivost samodejnega podatka o tipu govora (brani – B, javni – J, nejavni – N razpoznavanja govora. Njihov vpliv je delno možno in parlamentarni govor – P), štirimestne identifikacijske tudi simulirati ali ga odstranjevati s postopki robustne številk obdelave in izboljševanja kakovosti govornih signal e govorca (Gxxxx), šestmestne identifikacijske ov. številk - Spol in starost govorca: v primeru splošnega e posnetka (Pxxxxxx) ter podatka o vrsti datoteke razpoznavalnika govora je pri tvorjenju akustičnega (-avd). Pri posnetkih javnega govora, na katerih se modela govora pomembna uravnoteženost govorcev po običajno pojavlja večje število govorcev, je namesto teh dveh kategorijah. Adaptacija razpoznavalnika štirimestne identifikacijske številke govorca navedba govora na spol in starost govorca se sicer redko izvaja, Gvecg (s pomenom več govorcev). Primer ID-posnetka: saj se uporablja predvsem sprotno prilagajanje modela Artur-N-G5134-P600134-avd. razpoznavalnika govora na posameznega govorca. Se - Vrsta govornega dogodka: označuje, ali gre za javni, pa ta informacija lahko uporabi pri razvoju in preizkušanju tovrstnih metod za ugotavljanje njihove nejavni, parlamentarni ali brani govor (Žganec Gros in odvisnosti od teh dveh metapodatkov. Vesnicer, 2020). Če predstavljeni metapodatki v neki govorni bazi niso - Opisi govornih dogodkov oz. topiki: Pri na voljo, jih je z določeno zanesljivostjo možno tudi parlamentarnem govoru je govorni dogodek vedno naknadno samodejno določiti z različnimi postopki označen kot seja državnega zbora. Pri javnem govoru so samodejnega razpoznavanja govornih vzorcev, kot so govorni dogodki opredeljeni kot okrogle mize, postopki biometričnega razpoznavanja in grozdenje intervjuji, nagovori na dogodkih, novinarske govorcev ali razpoznavanje prvega jezika govorca. Takšni naknadno samodejno določeni metapodatki seveda lahko konference ipd. oziroma kot spletni dogodek, kadar gre vsebujejo tudi napake, kar je potrebno upoštevati pri za posnetke, posnete na daljavo. Pri branem govoru so njihovi uporabi. govorni dogodki opisani kot branje vnaprej pripravljenih pisnih predlog ali kot dva različna tipa 4. Metapodatki o posnetkih in govorcih v črkovanja. Izbrani nabor kratic so govorci črkovali z govorni bazi Artur dodajanjem samoglasnikov (npr. ef a ku), vnaprej Leta 2020 se je začel nacionalni projekt Razvoj določene pare imen in priimkov pa z dodajanjem slovenščine v digitalnem okolju,1 ki sta ga sofinancirala polglasnikov (npr. jə o nə a sə). Če je govorec črkoval Republika Slovenija in Evropska unija iz Evropskega na nepredviden način, je topik poimenovan kot sklada za regionalni razvoj. Operacija se je izvajala v črkovanje s samoglasniki (oz. soglasniki) z okviru Operativnega programa za izvajanje evropske odstopanjem ( npr . ef fa ku), če je med branjem tudi kaj 1 https://www.slovenscina.eu/ PRISPEVKI 207 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 dodal ali komentiral, pa kot črkovanje s samoglasniki računalnik, prenosni snemalnik, pametni telefon, (oz. polglasniki) s komentarjem. Pri nejavnem govoru kamera in diktafon. sta za govorne dogodke uporabljeni oznaki prosti dialog - Podatki o tehničnih lastnostih snemalne opreme med dvema sogovornikoma in prosti monološki govor zajemajo: opis naprave (npr. MacBook PRO, Asus – pri slednjem govorec prosto opisuje različne stvari, Vivobook, Zoom H4n, Zoom H1n), naziv operacijskega recimo svoj najljubši film. Za potrebe razvoja sistema (npr. iOS 14.2.1, Windows 10), podatek o specializiranih razpoznavalnikov v projektu Razvoj morebitnem mešalniku zvoka (npr. Focusrite Scarlett slovenščine v digitalnem okolju so v bazi Artur 2i2 3rd Gen), adapterju in opisu njegovega modela (npr. opredeljeni še govorni dogodki, kjer je snemanje Yamaha Audiogram 6), vrsti mikrofona (npr. namizni, potekalo po vnaprej pripravljenih scenarijih z dveh vgrajeni ali studijski mikrofon), modelu mikrofona (npr. področij: opisovanje obrazov in upravljanje pametnega Samson Q2U) in snemalnem programu (npr. Adobe doma. Audition 12, Audacity 2.3.2, Premiere Pro 14.0, Zoom, II. Podatki o okoliščinah snemanja: MS Teams). - Datum snemanja je zapisan v obliki »mesec leto« (npr. V. Podatki o viru posnetkov: april 2021). - Vir posnetka je lahko lastni posnetek, ki ga je naredila - Podatek o občini snemanja temelji na seznamu občin v ekipa govorne baze Artur namensko za to bazo – to so Republiki Sloveniji v času snemanja (2020–2022). vsi posnetki branega in nejavnega govora. V primeru - Prostor snemanja natančneje opredeljuje, kje je govorni parlamentarnega in javnega govora pa gre za arhivsko dogodek posnet, na primer v stanovanju ali pisarni, ali drugo gradivo, pridobljeno od različnih studiu ali premičnem snemalnem studiu, v dvorani, gradivodajalcev: Državni zbor RS, STA, Arnes, ZRC parlamentu ali pa je snemanje potekalo v odprtem SAZU, Univerza v Mariboru, SDJT, Radio Štajerski prostoru. Val idr. - Velikost prostora je razdeljena v tri kategorije: do 20 - Pri javnem govoru je za posnetke večkrat na voljo tudi m2, od 20 do 80 m2 in nad 80 m2. spletna povezava do videa. - Prisotnost šuma označuje, ali se na posnetku občasno Mnogi metapodatki o posnetkih večkrat niso bili pojavlja šum v ozadju, kot je šelestenje, šumenje, dostopni. To velja zlasti za posnetke, ki niso bili lastni, prometni hrup, zvok ventilatorja ipd. Če se šum po ampak pridobljeni iz drugih virov, torej pri javnem in osebni presoji validatorja posnetkov pojavlja v parlamentarnem govoru. Posnetki so bili uvrščeni v bazo, preveliki meri, je tak posnetek uvrščen v skupino tudi če so kakšni metapodatki o njih manjkali, saj zlasti za izločenih posnetkov. javni govor ne moremo pričakovati, da bodo že obstoječi - Presluh se občasno pojavi pri 2-kanalnem snemanju posnetki dokumentirani z metapodatki tako podrobno, kot nejavnega govora, ko je spontani pogovor dveh je to mogoče, kadar snemamo namenoma za uvrstitev sogovornikov posnet z dvema ločenima mikrofonoma. posnetka v govorno bazo. Prisotnost presluha je označena, če se pri takem Končni seznam metapodatkov o govorcih v govorni snemanju pogosto in jasno sliši govor govorca z bazi Artur je naslednji: drugega kanala. I. Identifikacijski in sociodemografski metapodatki: - Pogost hkratni govor je zabeležen pri nejavnem govoru, - ID-govorca zajema ime baze (Artur), oznako vrste ko je sneman zasebni pogovor med dvema govornega dogodka (B, J, N in P) ter vnaprej določeno sogovornikoma, ki pogosto hkrati govorita. štirimestno identifikacijsko številko govorca (Gxxxx). - Podatek o tem, ali govorec nosi masko, je bil aktualen Primer ID-govorca: Artur-N-G5097. v času epidemije COVIDA-19, ko je veliko javnih - Spol (moški, ženski, drugo) je minimalno določljiv dogodkov potekalo ob uporabi obrazne maske. To metapodatek o govorcih, tudi ko govor ni bil posnet kot pomembno vpliva na akustične značilnosti posnetka. lastni vir in govorci svojih sociodemografskih podatkov Posamezni redkejši posnetki te vrste, ki so bili uvrščeni niso sami posredovali. v bazo Artur, so zato ustrezno označeni. - Izobrazba je ločena v 9 kategorij: osnovna šola – III. Podatki o formatu izvornih posnetkov: nedokončana; osnovna šola – dokončana; nižje - Najpogostejši formati izvornih posnetkov so WAV, poklicno izobraževanje; srednje poklicno MP3 in M4A. izobraževanje; gimnazije, SSI in PTI; višješolski - Čeprav so vsi posnetki v bazi Artur pretvorjeni v enotni programi, VS in UNI programi (1. bolonjska stopnja); format WAV, 44,1 kHz, pcm, 16-bit, mono, so bili magisterij stroke (2. bolonjska stopnja); magisterij posamezni posnetki, pridobljeni iz nelastnih virov, znanosti (pred bolonjsko reformo); doktorat znanosti. posneti v drugačnih formatih. Kadar so bile informacije - Metapodatek o starosti je razvrščen v skupine: 12–17 dostopne, je bil izvorni format posnetkov popisan glede let, 18–29 let, 30–49 let, 50–59 let, 60+ let. na frekvenco vzorčenja, bitno hitrost in bitno ločljivost. II. Metapodatki o regiji govorca: IV. Podatki o opremi, uporabljeni za snemanje: - Občina stalnega bivališča vključuje tako občine v - Najpogosteje uporabljene snemalne naprave za Republiki Sloveniji kot stalno bivališče v tujini. posnetke v bazi Artur so prenosni ali namizni PRISPEVKI 208 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 - Čim celovitejša demografska uravnoteženost govorcev 5. Razhajanja v metapodatkih o posnetkih branega in nejavnega govora je upoštevana tudi pri in govorcih statistični regiji njihovega stalnega bivališča. Govorni korpusi, ki nastajajo za potrebe jezikoslovnih - Metapodatek o občini bivanja v otroštvu pokriva raziskav, in govorne baze, pripravljene za namene diahroni vidik morebitnih narečnih vplivov na govor razpoznavanja govora, so praviloma zelo podobni govorni viri. Zato je smiselno, da se iščejo sinergijski učinki in se govorca. vsaj del gradiva uporabi v oba namena (Žgank et al., 2014). - Prvi jezik. Poleg govorcev, katerih prvi jezik je Tako se je že baza Gos Videolectures delala z mislijo na slovenščina, so v bazo Artur v manjši meri vključeni uporabo tudi za razpoznavanje govora (Verdonik, 2018), tudi govorci, katerih prvi jezik je hrvaščina, srbščina, vendar je v metapodatkih še dokaj dosledno sledila makedonščina, bosanščina, ruščina, madžarščina idr. zastavljeni shemi v korpusu GOS. Tudi v projektu Razvoj Podatek je izpolnjen samo pri govorcih, od katerih je slovenščine v digitalnem okolju je bil iz velikega obsega pridobljen neposredno, pri javnih govorcih pa samo, če posnetkov za govorno bazo Artur izbran primeren del za se lahko z veliko verjetnostjo sklepa, da je prvi jezik nadgradnjo govornega korpusa GOS. Ob tem pa se je v veliki meri ravno v zvezi z metapodatki o govorcih in slovenski. posnetkih zgodilo precej razhajanj, ki so večinoma - Značilnosti govora se nanašajo na socialno zvrstnost posledica bolj natančnega popisovanja podatkov, specifik jezika in so bile opredeljene s strani transkriptorja ali pa namena baze, povzročajo pa težave ob združevanju standardiziranega zapisa ali validatorja posnetkov. gradiv. Katere vrste metapodatkov so take, pri katerih se Namenjene so v pomoč pri morebitnem prilagajanju najpogosteje pojavljajo različne odločitve? modelov razpoznavanja govora regionalnim značilnostim, prav tako so lahko v pomoč pri analizah 5.1. Metapodatki o posnetkih zvrstnosti slovenskega govora. Niso pa mišljene kot Obstajajo različne kategorizacije posnetega govora, saj točna strokovna opredelitev zvrsti govora govorca na se te praviloma izvedejo na podlagi tega, kaj vse neki posnetku. Ker je podrobna teorija socialne zvrstnosti za govorni vir vsebuje. GOS je tako ločeval štiri tipe diskurza: slovenščino (Toporišič, 2000) na empiričnem gradivu javni informativno-izobraževalni, javni razvedrilni, nejavni nezasebni in nejavni zasebni. Če pr težko enoumno in robustno imerjamo to s uporabljiva, je bila kategorizacijo v bazi Artur, vidimo, da se tam pojavi še poenostavljena v tri osnovne kategorije: standardni kategorija parlamentarni govor, manjka pa javni jezik, pogovorni jezik in narečje. Glede na okoliščine razvedrilni, ki se v Arturju tako rekoč ne pojavlja, pač pa se govora je bilo predvideno, da se v javnem in lahko celoten javni govor uvrsti kot javni informativno- parlamentarnem govoru pojavljata bodisi standardni izobraževalni. Prav tako ni nejavnega nezasebnega, ki se jezik bodisi pogovorni jezik, pri čemer smo za nanaša na različne uradovalne, storitvene, trgovalne in pogovorni jezik šteli situacijo, ko so bili v govoru druge podobne nezasebne govorne situacije v vsakdanjem življenju. Je pa prisoten brani govor, ki se nanaša na zelo govorca pogosto prisotni sistematični glasoslovni specifično, za namene snemanja posnetkov za ba pojavi, značilni za nestandardne zvrsti zo Artur . Za standardni ustvarjeno govorno situacijo, v kateri govorci berejo jezik pa je bil na primer označen tudi govor, ki je imel vnaprej pripravljene povedi eno po eno. sicer prepoznavno regionalno obarvano melodiko, Poleg krovne kategorizacije posnetkov v manjše število vendar je bil hkrati razviden zavesten večji odmik od krovnih kategorij se tako v korpusu GOS kot v bazi Artur vsakdanjega pogovornega jezika govorca proti uporabljajo še bolj podrobne opredelitve posnetega govora standardnemu – to velja zlasti za govorce iz obrobja glede na govorni dogodek. V korpusu GOS je zabeleženih več kot 20 vrst govornih dogodkov, prav tako v bazi Artur, Slovenije ali drugih neosrednjih delov Slovenije. pri čemer pa jih je približno polovica namenjenih Razlike v izgovorjavi so bile zaznane tudi pri branem opredelitvi gradiva, ki je zelo specifično za potrebe govoru, ki ga pa zaradi okoliščin (branje vnaprej razpoznavalnikov govora (črkovanje, področno specifični napisanih povedi) težko ločimo na standardni in razpoznavalniki za pametni dom in opisovanje obrazov). pogovorni jezik, zato sta bili pri branem govoru Opredelitev vrste govornega dogodka je nadvse uporabljeni oznaki standardna izgovorjava in pomembna, saj omogoča po potrebi tudi naknadno nestandardna izgovorjava. Predvsem v nejavnem prekategorizacijo zbranega gradiva ob združevanju govoru pa je lahko prisotna tudi oznaka narečje. V različnih virov, zato je verjetno eden najbolj bistvenih kolikor je bila izbrana, je dodana tudi oznaka o vrsti metapodatkov o tipih posnetkov za vsak govorni vir, bolj narečja, ki je določena na podlagi pomemben kot širša, krovna kategorizacija, ki se lahko metapodatka o občini naknadno tudi spreminja na podlagi razvrščanja informacij bivanja govorca v otroštvu. o vrstah govornih dogodkov ali deloma tudi na podlagi - Zadnja oznaka se nanaša na opazne izgovorne težave. informacij o viru. Pri posameznih govorcih se namreč pojavijo kakšne Obvezna metapodatka o posnetkih v govornih virih sta posebnosti, ki so povezane na primer z izgovorom čas in lokacija snemanja. Medtem ko so pri času lahko glasov r, l ali podobno. razhajanja samo v večji ali manjši natančnosti zabeleženega Navedeni metapodatki bodo v bazi Artur predstavljeni časa, pa se pri opredelitvi lokacije pojavljajo razlike, na s slovenskimi poimenovanji kot tudi s prevodi v angleški katere enote se pri tem naslonimo. V korpusu GOS je bil ta jezik. metapodatek opredeljen dvojno: kot kraj, torej z imenom mesta ali vasi, ki pa skozi spletni konkordančnik ni dostopen zaradi varovanja identitete govorcev, in kot regija PRISPEVKI 209 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 snemanja, ki pa jo lahko opredelimo zelo različno. V lahko bodisi izpustijo bodisi ostanejo nedefinirani, če niso korpusu GOS se je označila na podlagi registrskih območij. bili zabeleženi in niso na voljo. V bazi Artur je metapodatek o lokaciji zabeležen kot občina snemanja. V slovenskem kontekstu se zdi (glede na veliko 5.2. Metapodatki o govorcih število in razdrobljenost občin) informacija lokacije Čeprav so metapodaki o govorcih manj raznovrstni kot snemanja skozi občino ustrezen kompromis. V slovenskem metapodatki o posnetkih, pa se razlike, kako jih podeželskem okolju lahko namreč navajanje točnega kraja opredelimo, pojavljajo tako rekoč pri vseh kategorijah z imenom vasi razkriva identiteto govorcev, enote, večje od razen pri spolu. občine (npr. upravna enota, registrsko območje ali Najzahtevnejše vprašanje je povezano s potrebo, da se statistična regija) pa niso več zadosti natančne in skladne z zabeležijo različni regionalni vplivi na govor posameznika. narečno razpršenostjo, ki je v Sloveniji pregovorno velika. V zvezi s tem sta problematični naslednji točki: Metapodatek o viru prinaša informacijo o izvornem 1. Opredelitev regionalnih vplivov na govor govorca ni nosilcu avtorskih pravic. Podobno kot za pisna besedila nujno enoznačna. Tako so se na primer v dodatku h namreč tudi za govorjena besedila velja, da so njihovi govornemu delu korpusa BNC (British National tvorci hkrati tudi avtorji z avtorskimi pravicami2 nad Corpus) iz leta 2014, v katerem so zajemali samo besedili in pogosto obstajajo pogodbene zaveze, da bo ta vsakdanje pogovore, prepustili govorcem, da so sami s podatek v jezikovnem viru ustrezno naveden. Pri posnetkih svojimi besedami opisali svoj dialekt, in nato te opise govora se v zvezi z avtorskimi pravicami in navajanjem preslikali v shemo statističnih teritorijskih enot Velike vira srečujemo s štirimi vrstami situacij: (1) Če gre za Britanije (Love et al., 2017). Tudi v slovenskih posnetek na terenu, ki je bil narejen za namene govornega govornih virih se je uveljavila praksa, da se regija vira in zajema avtentični govor v vsakdanjih situacijah, govorcev beleži skozi geopolitične, in ne govorci prenesejo avtorske pravice praviloma na nosilca geolingvistične kategorije. Razlog je bržkone ta, da projekta, v katerem nastaja govorni vir. Praksa je, da je v lahko zanesljive geolingvistične kategorizacije naredi takih primerih kot vir označeno terenski/lastni posnetek. (2) samo stroka, in to naknadno, na podlagi zbranih Če gre za posnetek, ki je bil predvajan prek radia ali podatkov. V korpusu GOS so bile tako kategorije za televizije, so pogosto nosilci avtorskih pravic medijske hiše regijo govorcev definirane na podlagi registrskih in so posledično te navedene kot vir. Tudi pri spletnih virih območij, ki jih je za Slovenijo skupno 11, k temu pa so (npr. posnetki na Youtubu3) je treba pogosto urediti bile dodane še kategorije za zamejske Slovence avtorske pravice z njihovim/-i nosilcem/-i in v (Avstrija, Italija, Madžarska) in govorce, ki jim metapodatkih ustrezno navesti vir. Če gre za spletne slovenščina ni prvi jezik (tujina). Taka razdelitev je dogodke, ki jih sicer organizira in objavi neka institucija izredno ohlapna in nenatančna v primerjavi s slovensko (npr. spletne konference, delavnice, seminarji), je pogosto dialektalno razpršenostjo. Tudi sam koncept treba urejati avtorske pravice z neposrednimi tvorci teh »regionalna pripadnost«, zveden na registrsko označbo besedil. Pri tem se pojavi vprašanje, kako je najbolj na avtomobilu, se zdi neustrezen, čeprav ima za teren smiselno definirati metapodatek o viru: kot posameznika/e, zelo koristno lastnost robustnosti. V bazi Artur se je ki je/so pravice odstopil/-i in nastopa/-jo na posnetku, ali zato iskala bolj natančna, enoznačna, enostavna in manj kot institucijo, ki je organizirala in objavila spletni sporna opredelitev metapodatka, ki bi nosil informacije dogodek. V bazi Artur je bila pri tovrstnih posnetkih o regiji govorcev. Ker smo ime kraja, zlasti ko gre za izbrana druga možnost. (3) Določeni internetni viri že podeželsko okolje, že izpostavili kot problematično imajo urejene avtorske pravice na način, ki omogoča zaradi potencialnega razkrivanja identitete govorca, je nadaljnjo uporabo, in sicer pod pogoji katere od licenc bila kot osnovna enota izbrana občina. Slovenija je v Creative Commons. Taka večja vira posnetkov v času zbiranja posnetkov za bazo Artur razdeljena na 212 slovenščini sta portala Videolectures.net in Arnes Video. V občin. Prednost te kategorije je tudi ta, da je mogoče takih primerih se v obstoječih bazah za slovenščino kot vir občine enostavno enoznačno preslikati na širše navaja kar ime portala. (4) Določena govorjena besedila geopolitične enote – 12 statističnih regij Slovenije, kot niso avtorsko varovana. V skladu z 9. členom ZASP so taka jih v času nastajanja baze definira Statistični urad »uradna besedila z zakonodajnega, upravnega in sodnega Republike Slovenije. področja«. Čeprav še ni tovrstne sodne prakse ali doktrine, 2. Marsikdo danes ne živi vse življenje v nekem se lahko kot tovrstna med drugim štejejo govorjena omejenem geografskem prostoru, ki je govorno besedila, ki nastajajo v Državnem zboru RS v okviru homogen, pač pa je veliko ljudi mobilnih, bodisi z zakonodajnih postopkov. V tem primeru se kot vir v bazi dnevnimi/tedenskimi migracijami zaradi šolanja ali Artur, kjer se pojavljajo tovrstni posnetki, navaja kar zaposlitve bodisi zaradi selitev. Slika regionalnih Državni zbor Republike Slovenije. vplivov na govor govorca je zato lahko pri določenih Druge vrste metapodatkov o posnetkih, kot smo jih posameznikih izredno kompleksna in hkrati včasih tudi predstavljali v poglavjih 2, 3 in 4, se v določenih govornih zelo specifična. Korpus GOS je tako omogočal, da so virih pojavljajo, v drugih ne, odvisno od specifičnega govorci zase izbrali skupno tudi do pet »regionalnih namena govornega vira. Pri združevanju govornih virov se pripadnosti«. Tako nastane precej kompleksna slika, saj 2 Termin avtorske pravice tukaj uporabljamo za vse materialne članka. Bralca samo opozarjamo, da uporabo posnetkov za avtorske pravice, druge pravice avtorja v skladu z ZASP in govorne vire ovira tudi ta pravni vidik. avtorski sorodne pravice, ki utegnejo nastati pri snemanju. O 3 Sama licenca Youtube ne omogoča uporabe posnetkov za vprašanjih osebnostnih pravic in varstva osebnih podatkov, ki so govorni vir. prav tako pomembna za vsako uporabo posnetkov v govornih virih, tukaj ne razpravljamo, saj ni relevantno v kontekstu tega PRISPEVKI 210 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 dobimo poleg govorcev s samo eno regijo še precej slovenščine v digitalnem okolju, Artur. Govorni podatki iz govorcev z zelo različnimi kombinacijami regij, med teh treh baz namreč predstavljajo vir podatkov za razširitev katerimi pa posamezna kombinacija ne zajema veliko referenčnega govornega korpusa GOS, ob tem pa se kažejo govorcev. Na koncu je za slednje najbrž smiselno težave z združevanjem, ki med drugim4 izhajajo tudi iz zabeležiti samo eno skupno kategorijo »različni razlik v popisu in kategorizacijah metapodatkov o regionalni vplivi«, kot naredijo v korpusu C-ORAL- posnetkih in govorcih. ROM (Cresti in Moneglia, 2005). V bazi Artur je bila V prihodnje bi si želeli večjo homogenizacijo opredelitev geografske mobilnosti skozi čas metapodatkov o posnetkih in govorcih zlasti tam, kjer gre poenostavljena na dve vrsti metapodatkov, prva se za ključne metapodatke, ki so bistveni tako za spremljanje nanaša na občino bivanja v otroštvu, druga na občino uravnoteženosti gradiv kot za kategoriziranje govornih trenutnega stalnega bivališča. S tem se izgubi precej podatkov. Pri posnetkih so taki ključni metapodatki: (1) informacij o morebitni dodatni mobilnosti posamezne opis govornega dogodka, ki mora biti zadosti podroben in osebe, ki bi sicer bile pomembne za podrobno analizo se lahko razume v smislu govornih situacij, ki imajo večje govora posameznega govorca, vprašljivo pa je, koliko število skupnih kontekstnih lastnosti, vključno z vrsto so relevantne za (kvantitativno) korpusno analizo ali za lokacije, vrsto razmerja med tvorci in naslovniki, namenom morebitno prilagajanje razpoznavalnika govorcem po in kanalom komunikacije; (2) čas in lokacija snemanja, pri regijah. čemer je zlasti pri lokaciji pomembno, da je zadosti Določenemu delu govorcev slovenščina ni prvi jezik. podrobna, npr. ime kraja ali občine, kjer poteka snemanje; Tudi to je podatek, ki je za govorni vir, če se v njem tovrstni (3) vir posnetka, pomemben zaradi korektne obravnave govorci pojavljajo, zelo pomemben. Niti iz korpusa GOS avtorskih pravic, v pomoč je lahko tudi pri sortiranju niti iz baze Artur se nematerni govorci slovenščine niso govornih podatkov po tipih, zaradi naknadnega dostopa do izključevali, pač pa nasprotno – namenoma vključevali. S video vsebine pa je skoraj nujna tudi povezava do tem je v obeh virih bistven tudi metapodatek o prvem jeziku videoposnetka, če obstaja; (4) vedno koristni in zaželeni, a govorca. morda manj nujni pa so tudi vsi razpoložljivi podatki o Niti metapodatek o geografski pripadnosti govorca niti snemalni opremi in tehničnih lastnostih posnetka. Pri metapodatek o prvem jeziku pa še ne povesta, kakšen je govorcih so ključni metapodatki o: (1) identifikaciji, (2) dejansko govor nekega govorca v govornem viru z vidika spolu, (3) starosti, (4) prvem jeziku in (5) regiji/-ah, pri socialnozvrstne delitve. Slednjo lahko ugotavljamo šele na čemer mora biti slednja zadosti podrobno opredeljena (npr. podlagi (zlasti) slušne analize govora. Ne gre torej za na ravni kraja ali občine) in vsaj v grobem upoštevati tudi metapodatek, ki ga zabeležimo na terenu, ampak za diahroni vidik. Pogosto prisoten je tudi metapodatek o (6) naknadno interpretacijo govornih podatkov. V korpusu stopnji izobrazbe, medtem ko beleženje metapodatkov o GOS se ni delala, v bazi Artur pa je bila izražena tovrstna poklicih, socialnem sloju ali pripadnostih različnim želja za potrebe razpoznavalnikov govora. družbeno-kulturnim skupinam v slovenskih govornih virih Izobrazba in starost govorcev sta metapodatka, preko do zdaj ni bilo prakticirano. katerih predvsem zagotavljamo ustrezno demografsko V članku med drugim predstavljamo tudi podroben opis razpršenost govorcev, zajetih v govorni vir. Za posnetke metapodatkov o posnetkih in govorcih v govorni bazi javnega govora večinoma niti nista dostopna in posledično Artur. Pri določanju metaoznak se je pokazalo, da pri čisto za velik del posnetkov v korpusu GOS in bazi Artur teh vseh kategorijah vnosi niso enoznačni in enostavno metapodatkov ni. Kjer pa sta na voljo, so skupine glede na določljivi. Pri metaoznakah, nanašajočih se na govorce, je starost in izobrazbo delno različno opredeljene in različno bila največji izziv kategorija značilnosti govora, saj je bil podrobne, kar otežuje združevanje virov. Minimalne odločevalec pogosto soočen z dilemo, ali je jezik še kategorije starostnih skupin so po našem mnenju skupina standardni ali pogovorni oz. ali je pogovorni ali narečje. najstnikov (okvirno do 19 let), skupina upokojencev Kot pišemo v poglavju 4, so bili sistematični glasoslovni (okvirno nad 60 let) in vse ostalo vmes. V kategoriji pojavi, značilni za nestandardne zvrsti, odločilni kriterij, da izobrazbe imamo 4-stopenjsko delitev v GOS-u in 9- gre za pogovorni jezik; in nasprotno, opazno prizadevanje stopenjsko delitev v Arturju. Po našem mnenju minimalna govorca, da bi uporabljal standardni jezik, čeprav je v delitev je vsaj v dve skupini glede na to, ali je oseba njegovem govoru še vedno mogoče zaznati regionalno zaključila izobraževanje po srednji šoli ali pa nadaljevala obarvano melodiko, je bilo odločilno za oznako standardni šolanje. Večja podrobnost metapodatkov o govorcih bi bila jezik. Če je bil govor označen kot narečni, smo se za točno zanimiva verjetno predvsem za sociolingvistične raziskave, določitev vrste narečja oprli na podatek o občini bivanja v zato je potrebna ustrezna previdnost pred prehitrim otroštvu. Preostali metapodatki o govorcih so bili bodisi posploševanjem v zelo grobe kategorije. pridobljeni neposredno od govorcev bodisi jih nismo določali. Izjema je spol govorca, ki smo ga določili na 6. Zaključek podlagi posnetka, tudi ko ni bilo neposredne informacije. V prispevku smo obravnavali metapodatke o posnetkih Pri javnih govorcih, za katere nismo imeli neposrednih in govorcih, ki se tipično uporabljajo v govornih jezikovnih informacij, a smo lahko z veliko verjetnostjo sklepali, da je virih. Osredotočili smo se na obstoječe prosto dostopne njihov prvi jezik slovenski, je lahko bil dodan tudi ta govorne vire korpusnega tipa za slovenski jezik, tj. podatek. Veliko izzivov je bilo tudi pri pridobivanju referenčni govorni korpus GOS, bazo Gos Videolectures in metapodatkov o posnetkih, saj je v primeru, ko ni podatkov govorno bazo v nastajanju znotraj projekta Razvoj s terena, izjemno težko sklepati o vrsti in velikosti prostora snemanja ali identificirati podatke o datumu in občini 4 Določene razlike so sicer tudi v pravilih zapisovanja govora. V tem članku se osredotočamo samo na metapodatke o posnetkih in govorcih. PRISPEVKI 211 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 snemanja dogodka ter nemogoče zagotoviti natančen Nelleke Oostdijk, Wim Goedertier, Frank Van Eynde, Lou tehnični popis snemalne opreme. V bazi Artur so bili ti Boves, Jean-Pierre Martens, Michael Moortgat in Harald metapodatki vpisani samo, ko so bili znani. Baayen. 2002. Experiences from the Spoken Dutch Baza Artur je prioritetno namenjena razvoju modelov corpus project. V: M. González Rodriguez, C. Paz Suárez razpoznavanja govora, vendar lahko s svojim izredno Araujo, ur., Proceedings of the third international podrobnim popisom metapodatkov predstavlja izhodišče conference on language resources and pri morebitni nadgradnji ali razvoju podobnih govornih evaluation (LREC’02), str. 340–347. Las Palmas, virov v prihodnosti. Po zaključku, od novembra 2022 Kanarski otoki. ELRA. naprej, bo prosto dostopna prek repozitorija CLARIN.SI Petr Pořízka. 2009. Olomouc corpus of Spoken Czech: pod licenco Creative Commons. Characterization and main features of the project. Linguistik online, 38(2). http://www.linguistik- 7. Literatura online.de/38_09/porizka.html. Jens Allwood, Maria Björnberg, Leif Grönqvist, Elisabeth Daniel Povey, Hong-Kwang J. Kuo in Hagen Soltau. 2008. Ahlsen in Cajsa Ottesjö. 2000. The spoken language Fast speaker adaptive training for speech recognition. V: Proceedings of Interspeech 2008, str. 1245–1248. corpus at the Linguistics Department, Göteborg Jože Toporišič. 2000. Slovenska slovnica. Založba University. Forum Qualitative Social Research, 1(3). Lou Burnard, ur. 2007. Reference guide for the British Obzorja, Maribor. Darinka Verdonik. 2018. Korpus in baza Gos National Corpus (XML Edition). URL: http://www.natcorp.ox.ac.uk/XMLedition/URG/. Videolectures. V: Darja Fišer, Andrej Pančur, ur., Zbornik konference Jezikovne tehnologije in digitalna Patrick Cardinal, Najim Dehak, Yu Zhang in James Glass. humanistika, str. 265–268. Znanstvena založba 2015. Speaker adaptation using the i-vector technique for bottleneck features. V: Proceedings of Interspeech 2015, Filozofske fakultete, Ljubljana. str. 2867–2871. Darinka Verdonik in Ana Zwitter Vitez. 2011. Slovenski govorni korpus Gos. Trojina, zavod za uporabno Emanuela Cresti in Massimo Moneglia, ur. 2005. C-ORAL- slovenistiko, Ljubljana. ROM: Integrated reference corpora for spoken romance Jana Zemljarič Miklavčič. 2008. Govorni korpusi. languages. John Benjamins Publishing Company, Znanstvena založba Filozofske fakultete, Ljubljana. Amsterdam, Philadelphia. Zixing Zhang, Jürgen Geiger, Jouni Pohjalainen, Amr El- Christoph Draxler, Stefan Kleiner. 2017. A cross-database Desoky Mousa in Wenyu Jin, Björn Schuller. 2018. comparison of two large German speech databases. V: Deep learning for environmentally robust speech Proceedings of the 18th International Congress of recognition: An overview of recent developments. V: Phonetic Sciences, Glasgow, UK, 10.–15. avgust 2015. ACM Transactions on Intelligent Systems and International Phonetic Association. Technology (TIST) 9.5, str. 1–28. Oliver Ehmer in Camille Martinez. 2014. Creating a Jerneja Žganec Gros in Boštjan Vesnicer. 2020. Izbor multimodal corpus of spoken world French. V: Sükriye Ruhi, Michael Haugh, Thomas Schmidt, Kai Wörner fonetično uravnoteženih besedilnih predlog za bazo , branega govora. V: Tanja Mirtič, Marko Snoj, ur., ur., Best Practices for Spoken Corpora in Linguistic Razprave II. razreda SAZU: 1. slovenski pravorečni Research, str. 142–161. Cambridge Scholars Publishing, posvet, str. 111–119. Slovenska akademija znanosti in Newscastle upon Tyne. umetnosti, Ljubljana. Santosh Gondi in Vineel Pratap. 2021. Performance Andrej Žgank, Ana Zwitter Vitez in Darinka Verdonik. Evaluation of Offline Speech Recognition on Edge 2014. The Slovene BNSI broadcast news database and Devices. Electronics 2021, 10, 2697. MDPI, Basel, reference speech corpus GOS: Towards the uniform Switzerland. guidelines for future work. V: Nicoletta Calzolari, John R. Hershey, Jonathan Le Roux, Shinji Watanabe, ur., LREC 2014: proceedings of the Ninth International Scott Wisdom, Zhuo Chen in Yusuf Isik. 2017. Novel Conference on Language Resources and Evaluation, str. deep architectures in speech processing. V: New Era for 2644–2647, Reykjavik, Islandija. ELRA. Robust Speech Recognition, str. 135–164. Springer. Martin Karafiát, Karel Veselý, Kateřina Žmolíková, Marc Delcroix, Shinji Watanabe, Lukáš Burget, Jan “Honza” Černocký in Igor Szőke. 2017. Training data augmentation and data selection. V: New Era for Robust Speech Recognition, str. 245–260. Springer. Jáchym Kolář in Jan Švec. 2008. Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations. V: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA). Robbie Love, Claire Dembry, Andrew Hardie, Vaclav Brezina in Tony McEnry. 2017. The spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3):319–344. PRISPEVKI 212 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Uporaba Europeaninega podatkovnega modela (EDM) pri digitalizaciji kulturne dediščine: primer Skuškove zbirke iz Slovenskega etnografskega muzeja v projektu PAGODE-Europeana China Maja Veselič,* Dunja Zorman, † Oddelek za azijske študije, Filozofska fakulteta, Univerza v Ljubljani Aškerčeva 2, 1000 Ljubljana *maja.veselic@ff.uni-lj.si † dunja.zorman @ff.uni-lj.si Povzetek V prispevku predstaviva podatkovni model baze Vzhodnoazijske zbirke v Slovenji in njegovo prilagoditev Europeaninem podatkovnem modelu za potrebe objave podatkov več kot 900 predmetov kitajske kulturne dediščine, pretežno fotografij, v Europeani. Podrobno opiševa proces oblikovanja prilagojenega modela ter postopek priprave podatkov na uvoz, ki je potekal s pomočjo orodja MINT. Na koncu podava nekaj refleksij o pozitivnih učinkih te izkušnje na delo z izvorno bazo. Application of the Europeana Data Model (EDM) in digitalization of cultural heritage: The example of the Slovene Ethnographic Museum’s Skušek Collection in the PAGODE-Europeana China project This paper first introduces the data model of the East Asian Collections in Slovenia database. It then details how it was adjusted to the Europeana Data Model for the purpose of publishing in Europeana, more than 900 objects of Chinese cultural heritage, mostly photographs. It recounts the steps taken in creating the adjusted model and the procedure of data preparation for the import by using the MINT tool. It concludes with a reflection on the positive impacts of this experience on the work with the original database. Ljubljano leta 1920 prinesel mornariški častnik Ivan 1. Uvod Skušek ml. in jih danes hrani Slovenski etnografski muzej. Zadnjih nekaj let države in nadnacionalne organizacije Ti predmeti so bili digitalizirani in v Europeani objavljeni spodbujajo institucije, ki hranijo in varujejo kulturno v okviru projekta PAGODE-Europeana China (2020–2021, dediščino – galerije, knjižnice, arhive in muzeje, k v nadaljevanju PAGODE),2 medtem ko je analiza pospešeni digitalizaciji kulturne dediščine. Ta naj ne bi predmetov in ustvarjanje opisnih (vsebinskih) podatkov zgolj zaščitila in ohranjala kulturne dediščine ali olajšala potekalo v okviru projektne skupine Vzhodnoazijske zbirke dostop do materialne in nematerialne dediščine za v Sloveniji (2018–2021, VAZ),3 ki z istoimensko raziskovalne, izobraževalne ali ljubiteljske namene, temveč podatkovno bazo in portalom predstavlja tudi izvorno naj bi stimulirala gospodarsko rast skozi promocijo lokacijo digitalnih fotografij predmetov. Za nekoga, ki se prvič srečuje kreativnosti in inovacij, npr. v turizmu z novimi digitalnimi metapodatki in turističnimi produkti ali kot vir idej in navdiha v t. i. podatkovnimi bazami in ob tem nima tehnično strokovnega znanja, je soočenje z EDM kreativnih industrijah.1 -om in obdelavo podatkov za Toda da bi bila digitalizirana dediščina resnično uvoz in objavo v Europeani zastrašujoče. Skozi refleksijo lastnih napak in končnega uspeha, želiva tiste, ki dostopna in uporabna, da bi torej uporabnik pri iskanju oklevajo lahko dobil čim več zadetkov, ki čimbolj natančno glede objave svojega gradiva v Europeani, k temu zadostijo iskalnim parametrom, da bi prišel do relevantnih spodbuditi. zadetkov, tudi če so podatki zapisani v drugem jeziku, kot je jezik iskanja in da bi zadetke lahko po različnih 2. Europeanin podatkovni model (EDM) parametrih dodatno filtriral, je digitalizirane predmete Evropska digitalna knjižnica Europeana, ki jo financira nujno opremiti s čimbolj kakovostnimi metapodatki. Ti Evropska unija, sodi med najpomembnejše zbirke digitalne močno olajšajo selekcijo gradiva glede na specifične kulturne dediščine v Evropi. Danes v Europeani najdemo potrebe konkretnih uporabnikov, kar med drugim gradivo več kot 4000 posameznih institucij, ki obsega nekaj pripomore h kvalitetnejšemu kuriranju (npr. v obliki deset milijonov slikovnih in tekstovnih datotek, skoraj digitalnih galerij, razstav) in lažjemu vizualiziranju vsebin. milijon avdio- in videoposnetkov, pa tudi več kot 8000 3D V prispevku predstaviva izkušnje s prilagoditvijo modelov.4 Poudariti je treba, da knjižnica na svojih podatkovnega modela baze Vzhodnoazijske zbirke v strežnikih ne hrani digitaliziranih predmetov kulturne Slovenji (VAZ) Europeaninem podatkovnem modelu dediščine,5 temveč so ti dostopni preko povezav na različne (Europeana Data Model, v nadaljevanju EDM) pri uvozu institucionalne in nacionalne repozitorije in baze. izbranih digitaliziranih predmetov iz Skuškove zbirke na Europeana digitalizirane predmete le prikazuje in objavlja evropsko digitalno knjižnico Europeana. Gre za del zbirke njihove (meta)podatke. Europeana tudi ni v neposrednem pretežno kitajskih predmetov, ki jih je iz Pekinga v stiku s posameznimi institucijami, temveč podatke pridobi 1 https://digital-strategy.ec.europa.eu/en/news/commission- 4 https://www.europeana.eu/en/about-us. proposes-common-european-data-space-cultural-heritage. 5 Izraz predmet ne označuje zgolj materialnega predmeta, saj so 2 https://photoconsortium.net/pagode/. v Europeani predstavljeni tako predmeti snovne kot nesnovne 3 https://vazcollections.si/. dediščine ter predmeti, ki so bili že ustvarjeni digitalno. PRISPEVKI 213 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 od agregatorjev, ki zbirajo, pregledujejo in pred uvozom v ključnih besed, preko katerih lahko uporabnik najde Europeano obogatijo podatke, ki jih posredujejo različne določen predmet v Europeaninem brskalniku. kulturne in dediščinske institucije ali organizacije (t. i. Bogatost in strukturiranost podatkov torej vplivata na ponudniki vsebin). Številni, a ne vsi agregatorji so to, kako najdljivi so predmeti. V Europeani različne poti do organizirani kot posredniki na nacionalni ravni. Za predmetov imenujejo scenariji za odkrivanje7 (angl. Slovenijo to vlogo opravlja Nacionalna in univerzitetna discovery scenarios) in ločijo štiri osnovne načine knjižnica v Ljubljani.6 najdljivosti: glede na čas oziroma časovni razpon, glede na V primeru Europeane torej poseben izziv predstavlja teme in tipe, glede na agente ter glede na lokacije. številčnost institucij, ki tam objavljajo svoje vsebine, in Da bi spodbudili ponudnike k objavljanju čim bolj raznolikost načinov organizacije (meta)podatkov, ki so jih bogatih in čim bolje strukturiranih metapodatkov, so pri oblikovale skozi svoje institucionalne zgodovine in prakse. Europeani v zadnjih letih razvili tristopenjski standard Nekateri metapodatkovni standardi so sicer močno kakovosti metapodatkov, pri čemer vsaka od ravni razširjeni, na primer LIDO, ki ga je razvil Mednarodni omogoča določeno uporabniško izkušnjo. Raven A muzejski svet (ICOM) in ga uporabljajo številni muzeji. omogoča le osnovno iskanje konkretnih predmetov, raven Toda v Europeano prihaja gradivo različnih vrst kulturnih B omogoča raziskovanje vsebin, raven C pa predstavlja institucij, gradivo različnih tipov, poleg tega predstavlja platformo znanja, saj omogoča številna presečna iskanja, knjižnica tudi večjezikovno okolje. Pri Europeani so zato med drugim tudi po specifičnih temah, motivih, barvah in razvili svoj lastni model metapodatkov ter vanj integrirali drugih lastnosti predmeta kulturne dediščine (Europeana elemente pred tem uveljavljenih standardov, zlasti ORE, 2019b). Čeprav se v projektu PAGODE nismo zavezali k DCMI, SKOS in CRM. določeni ravni metapodatkov, si je večina partnerjev EDM metapodatke deli na tri jedrne razrede (angl. core prizadevala doseči stopnjo C.8 classes): (1) metapodatki, vezani na predmet kulturne dediščine, ki je digitaliziran (edm:ProvidedCHO), npr. kdaj in kje je predmet nastal, kdo ga je ustvaril, kakšne 3. Projekt PAGODE – Europeana China dimenzije ima, (2) metapodatki, vezani na spletni vir ali več Mednarodni projekt PAGODE – Europeana China virov, ki so vezani na predmet (edm:WebResource), npr. (PAGODE),9 ki je trajal 18 mesecev (2020–2021) in ga je kakšen je format spletnega vira, kdo ga je prispeval, kakšne vodilo italijansko ministrstvo za gospodarski razvoj, je so avtorske pravice; ter (3) metapodatki, povezani z združil javne in zasebne institucije s področja znanosti, agregacijo, torej z mehanizmom, ki združuje zgornja dva kulturne dediščine in kulturnega menedžmenta z namenom, razreda in predstavlja uvoz podatkov v Europeano, npr. da bi obogatili, izpostavili in dodatno osvetlili kitajsko kateri agregator prispeva podatke, kje so ti prikazani kulturno dediščino v Europeani. V projektu je sodelovalo 6 (ore:Aggregation) (Europeana, 2017). partnerjev ter 15 pridruženih partnerjev. Glavni cilj je bil v Poleg tega EDM omogoča tudi kontekstualne razrede Europeano dodati 10.000 novih digitaliziranih predmetov (angl. contextual classes). Sem sodijo metapodatki o agentu kitajske kulturne dediščine, avtomatsko obogatiti (edm:Agent), o prostoru (edm:Place), o časovnem obdobju metapodatke že obstoječim 20.000 predmetom, še 2000 (edm:TimeSpan), o konceptu (skos:Concept) in o licenci predmetom pa metapodatke dodati z ročno anotacijo v (cc:Licence). Med podatke o agentu na primer beležimo obliki množične skupnostne kampanje. Drugi osrednji cilj ljudi, ki jih je predmet v svojem življenju srečal oz. so z je bil kitajsko dediščino uporabnikom Europeane njim kakorkoli povezani, med tiste o prostoru pa mesta, kjer predstaviti skozi kurirane vsebine – galerije, bloge, razstave se je kdaj nahajal (Europeana, 2017). ter posebno vozlišče za kitajsko dediščino.10 Europeana poleg tega pri kvaliteti metapodatkov izpostavlja še dvoje: večjezičnost ter rabo nadzorovanih besednjakov. Europeana namreč prikazuje zbirke in predmete v štiriindvajsetih evropskih jezikih. V ta namen morajo biti v model vključeni podatki o jeziku, v katerem so vrednosti, tj. podatki v posameznem polju, zapisani. Poleg tega je za čim širšo jezikovno pokritost zaželena čim večja uporaba identifikatorjev iz povezanih odprtih podatkov in nadzorovanih besednjakov, kot so Wikidata, Gettyjev Arts & Architecture Thesaurus (AAT) ali geografska podatkovna baza GeoNames. Metapodatki vezani na identifikatorje se tako ne prikazujejo le v jeziku iskanja, temveč tudi v drugih evropskih jezikih, ki so vključeni v nadzorovane besednjake oz. povezane odprte Slika 1: Kurirane vsebine kitajske kulturne dediščine podatkovne baze. Poleg tega identifikatorji služijo na Europeani pod skupno tematsko vstopno točko. nadaljnjemu semantičnemu bogatenju metapodatkov. To je odlično za končnega uporabnika, saj povečuje število 6 http://www.agregator.si/. zvočnega posnetka itd.) ter možnost njegove ponovne uporabe 7 Nujne metapodatkovne kategorije za posamezen scenarij so glede na avtorske pravice (Europeana 2019a). predstavljene v Charles, Isaac in Hill (2015). 9 Projekt je financirala Evropska komisija v okviru mehanizma 8 Europeanin okvir za objavljanje (Europeana Publishing Connecting Europe Facilities. Framework) loči tudi različne nivoje kakovosti vsebine (od 1 do 10 https://www.europeana.eu/en/chinese-heritage. Vozlišče 4), pri čemer merijo kakovost digitalnega zapisa (fotografije, predstavlja osrednjo zbirno točko za kurirane vsebine o kitajski dediščini na Europeani tudi po koncu projekta PAGODE. PRISPEVKI 214 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 V projektu PAGODE je za večino ponudnikov vsebin 4. Podatkovna shema baze VAZ agregacijo opravil partner in akreditirani agregator Projekt VAZ je nacionalni raziskovalni projekt, ki je Photoconsortium,11 ki je sicer specializiran za fotografske formaliziral večletna prizadevanja za sistematičen popis in vsebine v Europeani. Poleg tega, da je veliko sodelujočih znanstveno-strokovno obravnavo vzhodnoazijskih zbirk in ponudnikov vsebin v Europeano dodalo prav fotografsko predmetov v različnih slovenskih muzejih (Vampelj gradivo, je tovrstna agregacija omogočala boljši nadzor na Suhadolnik, 2019). Skupna podatkovna baza in portal, ki izpolnjevanjem ambicioznih ciljev glede novih vnosov in sta osrednja rezultata projektnega dela, predstavljata neke avtomatskega bogatenja. vrste digitalno različico muzejev vzhodnoazijskih Kot partner je v projektu sodeloval tudi Oddelek za umetnosti in kultur, kakršne najdemo v številnih azijske študije Filozofske fakultete UL, bolj natančno tri prestolnicah in velemestih po Evropi, v Severni Ameriki in članice nacionalnega raziskovalnega projekta drugod. Kot pobudnik in vodilni partner projekta si Vzhodnoazijske zbirke v Sloveniji (2018–2021).12 Naša Oddelek za azijske študije Filozofske fakultete Univerze v naloga je bila vzpostavitev semantične sheme projekta, ki Ljubljani prizadeva za trajno hrambo in dopolnjevanje ter naj bi vodila tako izbor novih predmetov (opredelitev, kaj posodabljanje baze in portala tudi po zaključku projekta, sploh je kitajska dediščina v Evropi), kot obogatitev že seveda v meri, ki jo bodo v bodoče dopuščale finančne obstoječih predmetov (ključne besede, ki opredeljujejo zmožnosti in delovne obveznosti.14 kitajsko dediščino). Čeprav z Europeano nismo imele Eden od naših osrednjih ciljev je bil vseskozi, da je baza izkušenj, pa tudi projekt VAZ se je šele dobro začel, nas je s fotografijami in podatki javno dostopna in enostavna za povabilo k sodelovanju pritegnilo predvsem, ker je uporabo, saj velika večina predmetov že več desetletij ni obljubljalo dostop do dodatnih sredstev za digitalizacijo bila razstavljenih, prav tako pa so bili le redki med njimi predmetov, ki smo jih nameravali vključiti v digitalno bazo deležni temeljitejše analize, saj slovenske muzejske VAZ. institucije nimajo oseb z ustreznim specializiranim Največja zbirka kitajskih predmetov pri nas je znanjem.15 V okviru projekta smo obravnavali že omenjeno Skuškova zbirka v Slovenskem etnografskem muzeju Skuškovo zbirko iz SEM-a, Zbirko Alme Karlin ter Zbirko (SEM), ki obsega več kot 500 predmetov manjših in večjih predmetov iz Azije in južne Amerike iz Pokrajinskega dimenzij, med njimi pohištvo, okrasne stene, porcelan, muzeja Celje, vzhodnoazijske kose v zbirki keramike iz tekstil, slike, kipce, kadilne pripomočke, kovance, knjige, Narodnega muzeja ter album vzhodnoazijskih razglednic fotografije. Predmete je Ivan Skušek ml. (1877–1947), mornariškega superiorja Ivana Koršiča, ki ga hrani mornariški častnik, ki se je med prvo svetovno vojno znašel Pomorski muzej Piran. v internaciji v Pekingu, skupaj s svojo bodočo ženo, na Pri oblikovanju podatkovne sheme smo se najprej Kitajskem živečo Japonko Kondō Kawase Tsuneko (1893– posvetovali s kustosi obravnavnih zbirk in nekaterimi 1963), kasneje krščeno Marija Skušek, leta 1920 pripeljal v njihovimi muzejskimi sodelavci. Vse sodelujoče institucije Ljubljano. Skušek je doma nameraval postaviti muzej uporabljajo program Galis, ki so ga snovalci razvili v kitajske kulture, a so mu finančne težave to preprečile. Po sodelovanju z domačimi in tujimi strokovnimi moževi smrti je Marija Skušek zbirko zapustila državi in institucijami, tudi Europeano. Shema podatkov, ki jih je večina predmetov je pristala v Slovenskem etnografskem moč beležiti za posamezen predmet je izjemno obširna, muzeju. Le nekaj jih je razstavljenih na stalni razstavi, vendar pa v praksi kustosi izpolnijo le nekaj osnovnih mnogi med njimi pa donedavna niso bili niti spodobno kategorij, pri čemer na izbor vplivajo tako tipi predmetov, inventarizirani.13 za katere skrbijo, kot tudi njihove povsem individualne Na naš predlog se je projektu PAGODE kot pridruženi navade in ambicije. Tudi tuji strokovnjaki, s katerimi smo partner priključil SEM, ki je v ta namen pripravil digitalne sodelovali – tako muzejski kustosi kot akademski fotografije približno 200 kovancev ter skenograme dveh na raziskovalci specializirani za različne vidike Japonskem izdanih tiskanih albumov arhitekturnih vzhodnoazijske umetnosti, so nam svetovali, naj fotografij in skic cesarskega Pekinga, dveh naslikanih podatkovno shemo razvijemo glede na predmete, ki jih albumov s podobami kitajskega kaznovanja in vsakdana najdemo v slovenskih muzejskih zbirkah. Ti so v resnici bogatih žensk in otrok in album s 450 prilepljenimi izjemno raznovrstni in obsegajo keramiko in porcelan, fotografijami Pekinga in okolice. Opisne podatke kipe, pohištvo, tekstil, pahljače, numizmatiko, fotografije predmetov smo pripravili v projektu VAZ, prilagoditev in razglednice, slike in lesoreze, orožje, arhitekturne podatkovne sheme, ki jo uporabljamo v bazi VAZ za modele ter različne predmete vsakdanje rabe. Po osnovnem potrebe uvoza v Europeano pa sva pripravili avtorici. V pregledu izbranih zbirk smo si v raziskovalni skupini nadaljevanju prispevka tako najprej predstaviva podatkovno shemo, ki smo jo razvili v projektu VAZ, nato pa prilagoditev te sheme za uvoz v Europeano. 11 https://www.photoconsortium.net/. 14 Pridobitev novega nacionalnega raziskovalnega projekta 12 Projekt s polnim imenom Vzhodnoazijske zbirke v Sloveniji: Osiroteli predmeti: obravnava vzhodnoazijskih predmetov izven vpetost slovenskega prostora v globalno izmenjavo predmetov in organiziranih zbirateljskih praks v slovenskem prostoru (2021– idej z Vzhodno Azijo (2018–2021) (št. J7-9429), je financirala 2024) (ARRS, št. J6-3133) zagotavlja sredstva za nadaljnje delo Javna agencija za raziskovalno dejavnost Republike Slovenije in tehnične izboljšave. (ARRS). Poleg avtoric prispevka je v projektu PAGODE 15 V okviru projekta VAZ je analiza potekala pretežno s strani sodelovala še Nataša Vampelj Suhadolnik. sinologinj, japonologinj in koreanista ob podpori pristojnih 13 O poti zbirke od Kitajske do SEM-a pišeta Berdajs (2021) in kustosinj in kustosa. Poleg tega smo organizirali več delavnic, na Motoh (2021), o razlogih za pomanjkljivo obravnavo v muzeju katerih so izbrane predmete ali skupine predmetov preučili tudi pa Vampelj Suhadolnik (2021). tuji strokovnjaki in strokovnjakinje. PRISPEVKI 215 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 razdelili tipe predmetov glede na rabo,16 nato pa je vsak za portalu ne prikazujemo, so zapisani ležeče. Z asteriskom so dodeljeni tip predmetov pregledal na spletu dostopne označeni podatki, ki jih na portalu uporabljamo kot filtre za podatkovne sheme različnih priznanih muzejev in arhivov. prikazovanje. Pri tem smo bili seveda omejeni na institucije, ki so že Trenutno ima naša podatkovna baza obliko Excelove digitalizirale dele svojih zbirk in jih ponudile na ogled tabele z ločenimi listi za tipe predmetov, vendar je zaradi javnosti, in na tiste vrste podatkov, ki so jih smatrale kot razvejanosti neprijazna za vnos in slabo pregledna. Smo v relevantne za obiskovalce in jih zato prikazovale na svojih postopku tehnične prenove baze, tako da bo v bodoče straneh. vstopna točka zanjo spletna stran, vmesnik pa bo v obliki Po več krogih posvetovanj ter širjenj in oženj nabora podatkovne kartice. Ob tem bomo dodali nekaj novih podatkovnih elementov, smo izoblikovali spodnjo shemo, kategorij administrativnih podatkov, npr. avtorje zapisa o pri čemer smo metapodatke razdelili na tiste, ki bodo vidni predmetu. obiskovalcem portala, in one, ki jih zbiramo za naše raziskovalne analize in administracijo. Podatki, jih na Administrativni podatki Opis predmeta • Zaporedna številka (inventarna številka • Ime predmeta predmeta v naši bazi, označena s črkami za • Raba* tip in zaporedno številko vnosa) • Sekundarna raba* • Fotografija (imena fotodatotek, ki • Material* prikazujejo predmet) • Sekundarni material* • Copyright • Tekstualni opis • Podatki o procesu vnosa (beležimo ali je • Opis materiala določen vnos zaključen, lektoriran in • Tehnika izdelave prenesen na portal) • Dimenzije • Napis – vsebina (izvirnik, transkripcija, Lokacija predmeta prevod) • Zbirka/album* • Podpis(i) (izvirnik, transkripcija) • Muzej* • Cenzor (izvirnik, transkripcija, prevod) • Žig (izvirnik, transkripcija, prevod) Provenienca in obravnava predmeta • Datum in kraj korespondence • Trenutni lastnik • Naslovnik • Čas pridobitve • Pošiljatelj • Način pridobitve • Število delov/kosov • Pretekli lastniki in obdobja lastništva • Ime v izvornem jeziku (izvirnik, transkripcija) • Stanje predmeta, obravnava, poškodbe • Motiv • Zgodovina razstavljanja • Format • Objave medijih • Tehnika vezave • Originalne inventarne številke • Stil kaligrafije • Lokacija napisa na predmetu Izvor predmeta • Lokacija podpisa na predmetu • Stoletje* • Lokacija cenzorjevega podpisa na predmetu • Obdobje* (dinastična obdobja) • Lokacija žiga na predmetu • Regija* • Kraj izdelave • Avtor (izvirnik in transkripcija) • Delavnica/tovarna/izdajatelj (izvirnik in transkripcija) • Datacija (cesarji) Tabela 1: Podatkovna shema VAZ. Pri oblikovanju podatkovne sheme VAZ sta nas torej Prav tako tudi nismo razmišljali o tem, kako naj bodo vodili dve vprašanji: katere vrste podatki so ali utegnejo podatki strukturirani, da jih bomo čim lažje in čim postati koristni za naše raziskave in katere vrste podatkov učinkoviteje raziskovalno obdelovali. Drugače povedano, so lahko zanimive za druge uporabnike, naj bodo čeprav je bila digitalizacija eden od osrednjih ciljev strokovnjakinje ali ljubitelji. Čeprav smo imeli javnost projekta VAZ, nismo poznali praks digitalne humanistike. nenehno v mislih, pa vse do konca projekta PAGODE Preveč osredotočeni na predmete kot muzejske predmete nismo veliko razmišljali o načinih priprave kuriranih po eni ter njihov vzhodnoazijski izvor na drugi strani, vsebin, še manj pa o pomenu metapodatkov v tem procesu. nismo našli poti do tistih institucij in strokovnjakov in 16 Glede na pogostnost v zbirkah smo naredili naslednjo dodatki, orožje in vojaška oprema, pahljače, pohištvo in notranja tipologijo po rabi (po abecednem vrstem redu): arhitektura in oprema, posodje in pribor, predmeti za osebno nego, pripomočki modeli, glasbila in gledališki predmeti, igre in igrače, kipi, za kajenje in uživanje substanc, razglednice in fotografije, knjige in tiskani materiali, numizmatika, oblačila, obutev in religijski predmeti, slike in grafike ter drugo. PRISPEVKI 216 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 strokovnjakinj v našem prostoru, ki se ukvarjajo z je vrednost zapisana. Poleg tega smo morale uporabiti tudi digitalizacijo kulturne dediščine, digitalnimi arhivi in vsaj štiri različne elemente iz dveh različnih scenarijev za digitalno humanistiko. Poleg tega podjetje, ki skrbi za odkrivanje ter vsaj dva kontekstualna razreda z ustreznimi tehnično podporo našega portala, nima izkušenj z povezavami na odprte podatke oziroma nadzorovane razvijanjem podatkovnih baz, smo pa z njimi v preteklosti besednjake. dobro sodelovali. Za začetek snovanja podatkovnega modela, smo definirale naslednja izhodišča: 5. Prilagoditev podatkovnega modela VAZ ➢ za osnovo vzamemo izvorno bazo projekta VAZ; za uvoz v Europeano ➢ identificiravao čim večje število podatkov v Vsi digitalni predmeti kulturne dediščine v projektu izvorni bazi, ki jih lahko prevedemo v EDM; ➢ VAZ, ki smo jih nameravali objaviti v Europeani, so dodamo administrativne metapodatke, ki jih spadali v tip slik, saj je šlo za digitalne slikovne posnetke zahteva EDM; izbranih predmetov Skuškove zbirke. V projektu PAGODE ➢ v čim večjem obsegu metapodatkom dodamo smo se zavezali doseči višje nivoje na Europeanini lestvici identifikatorje iz nadzorovanih besednjakov; vsebinske kakovosti, s katero označujejo predmete z ➢ z metapodatki zajamemo tudi raznovrstnost visokim potencialom za rabo v izobraževanju, na odprtih končnih uporabnikov v Europeani. Naša i platformah in v kreativnih industrijah (prim. Europeana zvorna podatkovna baza VAZ vsebuje 46 kategorij. Od tega smo jih 23 kot elemente vključile 2019a). Fotografije oziroma skenogrami so zato morali v različne razrede EDM. Med izpuščenimi podatki so bili izpolnjevati dve zahtevi: (1) njihova velikost ni smela biti manjša od 1200 x 1200 slikovnih točk, in (2) omogočen je predvsem tisti, namenjeni tipom predmetom, ki niso bili vključeni PAGODE. V moral biti prosti dostop ali uporaba pod licenco Creative VAZ-u na primer zbiramo podatke, Commons Priznanje avtorstva-Deljenje pod enakimi pogoji ki so namenjeni raziskovanju razglednic, kot so naslovnik in pošiljatelj. Ker razglednic (CC BY SA). nismo uvažali v Europeano, smo v pripravi modela za Europeano izločile te kategorije podatkov. Za naše potrebe smo iz EDM-a uporabile vse tri jedrne razrede, Najprej smo v vsakem od njih identificirale minimalne zahtevane elemente za metapodatke (in si zabeležile njihove standardizirane lastnosti). Ti so (1) oblika digitalnega nadomestka (edm:type), (2) skrbnik podatkov (edm:dataProvider), (3) ime nacionalnega agregatorja ali druga institucija, ki je omogočila pretok podatkov v Europeano (edm:Provider) – v našem primeru je bil to Photoconsortium; in (4) pravice uporabe (edm:rights). Takoj zatem smo v model dodale še kontekstualne razrede za agenta (edm:Agent), časovni razpon (edm:TimeSpan) in koncept (skos:Concept). Nato smo v model vključile vse priporočene elemente za metapodatke, ki so sovpadali s posameznimi kategorijami iz baze VAZ, kot so opis predmeta (dc:description) in dimenzije (dcterms:extent). Sledilo je vključevanje priporočenih elementov za metapodatke, ki jih ni v izvorni bazi, a bi omogočali širok spekter Slika 2: Prikazovanje podatkov o predmetu na portalu uporabnosti. Tu se je zataknilo. Izkazalo se je, da v danem VAZ. času ne bomo uspele napolniti modela z manjkajočimi podatki, da bi zadovoljile vse identificirane uporabnike Tudi pri premisleku, katere EDM-ove podatkovne Europeane. Čeprav smo prvotno želele vključiti tudi kategorije zapolniti, so nas vodile ambicije po visoki ravni podatke za kreativni sektor, nam je za njemu namenjene kakovosti metapodatkov, ki jo Europeana ocenjuje glede na elemente (motivi, vzorci, barve) manjkalo največ večjezičnost, uporabo scenarijev za odkrivanje ter podatkov, zato smo ta del sheme opustile. So bili pa vnosi kontekstualne razrede. Ker smo želele nagovoriti širok s podatki o barvah avtomatsko obogateni v procesu uvoza spekter končnih uporabnikov – od strokovnjakov, do na Europeano, tako da je danes predmet iz Skuškove zbirke ljubiteljev kulturne dediščine in predstavnikov kreativnega moč iskati in filtrirati tudi po tem kriteriju. sektorja, smo si za merilo postavile pogoje ravni C. To bi v Na koncu smo pri 19 elementih dodale še identifikatorje praksi pomenilo, da bi uporabnik kovance iz Skuškove iz nadzorovanih besednjakov. Dva elementa smo zaradi zbirke lahko našel s splošno poizvedbo »kitajski kovanci« jezikovne dostopnosti prevedle še v angleščino in sicer ime ali zelo detajlno, poznavalsko poizvedbo o konkretnem tipu predmeta (dc:title) in vmesnega ponudnika kovanca »Daoguang tongbao«, v pismenkah »道光通寶«. (edm:intermediateProvider). Podatkovni model za uvoz je Za predstavnike kreativnega sektorja so po drugi strani še na koncu vseboval 67 elementov. posebej koristni metapodatki, ki omogočajo iskanje po Ko smo imele model zaključen, smo se lotile še motivih, vzorcih in barvah. pridobivanja manjkajočih metapodatkov. Med njimi so Ciljna kakovostna raven metapodatkov je terjala, da je prevladovali identifikatorji. Ta del procesa smo opravile vsaj 75 odstotkov podatkov v podatkovnih elementih, ki jih hitro. Zaradi ene naših osrednjih nalog v projektu Europeana uvršča med najbolj relevantna za iskanje, PAGODE – priprave semantične sheme za avtomatsko in moralo imeti tudi metapodatek o jeziku ali jezikih, v katerih množično ročno anotacijo predmetov kitajske kulturne PRISPEVKI 217 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 dediščine, ki so že bili v Europeani, smo že nekaj mesecev S celotnim delom smo na koncu dosegle želeno raven C pred tem oblikovale seznam nekaj manj kot 1000 ključnih kvalitete metapodatkov, ki zagotavlja brskanje na precizen besed s pripadajočimi identifikatorji iz Gettyjevega AAT in način, in omogoča, da Europeana deluje kot platforma Wikidate.17 Da smo to dosegle, smo predhodno pregledale znanja. vsaj trikrat večje število ključnih besed v omenjenih nadzorovanih besednjakih. Tako smo imele zelo dober 6. Refleksija pregled, kaj posamezen besednjak ponuja in kaj lahko Delo na projektu PAGODE – tako priprava semantične uporabimo v našem podatkovnem modelu. sheme kot prilagoditev VAZ-ovega metapodatkovnega Poleg identifikatorjev smo morale v prilagojeno shemo modela EDM-u, je bilo za sodelujoče raziskovalke izjemno vnesti tudi unikatne identifikatorje fotografij, objavljenih dragocena izkušnja, skozi katero smo lahko ovrednotile in na spletni strani VAZ, saj Europeana digitalizirane nato izboljšale tudi delo na projektu VAZ. Kot predmete prikazuje neposredno s strežnikov institucij ali strokovnjakinje s področja vzhodnoazijskih študij za organizacij. Za konec smo dodale še metapodatke, ki so predmete, ki ji jih digitaliziramo v projektu VAZ, skušamo posamezne dele povezovali v celoten komplet (fotografije pridobiti čimbolj izčrpne podatke, ki jih organiziramo v v album fotografij, vstavne liste v tiskane oz. slikane razmeroma razvejano podatkovno shemo. Pri prilagoditvi albume). Pri tem smo kot vrednosti vnesle unikatne naše sheme Europeaninemu podatkovnemu modelu, identifikatorje fotografije predmeta objavljenega na spletni predvsem pa pri polnjenju te sheme s konkretnimi podatki strani, ki je bil naslednji po zaporedju smo zato imele veliko lažjo nalogo kot drugi ponudniki (edm:isNextInSequence). vsebin, ki so sodelovali v projektu PAGODE in niso imeli specializiranih znanj. Ko smo enkrat razumele opredelitve posameznih elementov v EDM-u, smo VAZ-ovim podatkovnim kategorijam hitro našle ustrezne vzporednice, so pa v VAZ-ovi shemi seveda manjkali podatki vezani na spletni vir oziroma agregacijo. Za potrebe projekta PAGODE smo v VAZ-ovo shemo dodale kategorijo avtorskih pravic fotografij, saj se mora ta informacija prikazovati tako na Europeanini kot na izvorni strani.18 Podatke iz baze VAZ smo v EDM-u obogatile predvsem s povezavami na odprte podatke in nadzorovane besednjake, vendar teh zaenkrat ne nameravamo vključiti v bazo VAZ, saj je naša prioriteta dopolnjevanje baze z novimi vnosi. Orodje MINT, ki smo ga uporabile za tehnično obdelavo podatkov za uvoz v Europeano, po drugi strani ni zahtevalo programerskih znanj, tako da smo tudi ta del lahko opravile same. Slabost takega načina objavljanja podatkov je, da gre za enkratni uvoz, zato se podatki ne bodo posodabljali hkrati z bazo VAZ. Uporabnik bo z Europeanine strani posameznega predmeta v Skuškovi zbirki sicer preko povezave lahko prispel na VAZ-ovo stran in tam videl najnovejšo verzijo, a podatki v Europeani ne bodo ažurirani Slika 3: Prikazovanje metapodatkov v Europeani. , dokler ne bomo izvedli ponovnega uvoza. Če bi to vedele že na začetku, bi gotovo premislile o uvozu Na koncu smo za mapiranje uporabile platformo MINT, preko nacionalnega agregatorja, čeprav bi se glede na ki jo redno uporabljajo projektni partnerji. Platforma časovni pritisk in nizka finančna sredstva na koncu morda omogoča mapiranje metapodatkov in polnjenje Europeane vseeno odločile za enostavnejšo agregacijo s pomočjo z novo vsebino brez programerskega predznanja o XML MINT-a. Poudariti morava, da podatki v Europeani ne bodo podatkovni strukturi, ki tehnično podpira agregacijo. MINT napačni ali nekakovostni, bodo le malenkost manj bogati omogoča, da elemente v svojem podatkovnem setu povežeš kot v bazi VAZ, ki jo bomo dopolnjevali z novimi z elementi EDM. Podatkovni model se uvozi na različne raziskovalnimi izsledki. načine, med drugim z datoteko csv, kot smo storile me. Skozi sodelovanje v projektu PAGODE smo postale Preko uporabniku prijaznega vmesnika se uredi mapiranje, tudi ambicioznejše glede kuriranja in vizualiziranja vsebin ki ga program nato pretvori v XML obliko, ki omogoči na portalu VAZ. Ob premlevanju idej, kako bi naše izsledke dokončno polnjenje vsebin v Europeano. Poleg mapiranja predstavili na dostopnejše, privlačnejše načine, sva avtorici metapodatkov smo vsakemu elementu metapodatkov v ugotovili, da bi bilo boljše, če bi bila naša metapodatkovna MINT-u ročno določile jezik, v katerem je zapisan, in v shema še bolj razvejana in če bi elemente, ki jih sedaj kontekstualni razred o agentu (edm:Agent) ročno vnesle pišemo skupaj, dodatno razdelili. Pri razglednicah na imena agentov v različnih jezikih (npr. Marija primer kraj in datum poštnega žiga vnesemo v isto polje, Skušek/Kondō -Kawase Tsuneko/ 近藤常子). čeprav bi bilo za nadaljnjo obdelavo bolje, če bi jih ločili. Enako je pri provenienci, kjer so sedanji in pretekli lastniki 17 V Wikidati smo okoli 80 gesel za potrebe projekta PAGODE sodelujoči muzeji skupaj odločili, da želijo ohraniti avtorske tudi ustvarile. pravice. Tudi SEM je pravice spremenil le fotografijam pri 18 Naš prvotni načrt je bil, da bi bile vse fotografije v bazi VAZ predmetih, ki so bili dodani v Europeano. prosto dostopne za rabo v nekomercialne namene, vendar so se PRISPEVKI 218 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 našteti skupaj. Poleg tega so naju pri EDM-u navdušili kontekstualni razredi, ki bi nam zlasti pri gradivu, kjer imamo veliko podatkov o krajih, osebah in času, olajšali analizo in prikazovanje poti, mrež ter življenjskih zgodb predmetov. 7. Literatura Tina Berdajs. 2021. Retracing the Footsteps: Analysis of the Skušek Collection. Asian Studies, 9(3): 141–166. https://doi.org/10.4312/as.2021.9.3.141-166. Valentine Charles, Antoine Isaac in Timothy Hill, ur. 2015. Discovery - User scenarios and their metadata requirements. https://pro.europeana.eu/files/Europeana_Professional/ EuropeanaTech/EuropeanaTech_WG/DataQualityCom mittee/DQC_DiscoveryUserScenarios_v3.pdf Europeana. 2017. Europeana Data Model – Mapping Guidelines v2.4. https://pro.europeana.eu/files/Europeana_Professional/S hare_your_data/Technical_requirements/EDM_Docum entation/EDM_Mapping_Guidelines_v2.4_102017.pdf. Europeana. 2019a. Europeana Publishing Framework: Content. https://pro.europeana.eu/files/Europeana_Professional/P ublications/Publishing_Framework/Europeana_publishi ng_framework_content.pdf. Europeana. 2019b. Europeana Publishing Framework: Metadata. https://pro.europeana.eu/files/Europeana_Professional/P ublications/Publishing_Framework/Europeana_publishi ng_framework_metadata_v-0-8.pdf Helena Motoh. 2021. Lived-in museum. Asian Studies, 9(3): 119–140. Nataša Vampelj Suhadolnik. 2021. Between Ethnology and Cultural History: Where to Place East Asian Objects in Slovenian Museums? Asian Studies, 9(3); 85–116. https://doi.org/10.4312/as.2021.9.3.85-116 Nataša Vampelj Suhadolnik. 2019. Zbirateljska kultura in vzhodnoazijske zbirke v Sloveniji. V: , uredili Andrej Bekeš, Jana S. Rošker in Zlatko Šabič, ur., Procesi in odnosi v Vzhodni Aziji: zbornik EARL, 93–137. Ljubljana: Znanstvena založba Filozofske fakultete Univerze v Ljubljani. https://doi.org/10.4312/9789610602699. PRISPEVKI 219 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Human Evaluation of Machine Translations by Semi-Professionals: Lessons Learnt Špela Vintar*, Andraž Repar† * Department of Translation, Faculty of Arts, University of Ljubljana Aškerčeva 2, SI-1000 Ljubljana spela.vintar@ff.uni-lj.si †Department of Knowledge Technologies, Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana andraz.repar@ijs.si Abstract We report on two experiments in human evaluation of machine translations, one using the Fluency/Adequacy scoring and the other using error annotation combined with post-editing. In both cases the evaluators were students of translation at the Master's level, who received instructions on how to perform the evaluation but had previously had little or no experience with the evaluation of translation quality. The human evaluation was performed in the context of development and testing different MT models within the Development of Slovene in a Digital Environment (DSDE) project. Our results show that Fluency/Adequacy scoring is more efficient and reliable than error annotation, and a comparison of both methods shows low correlation. summaries of the results. In addition to quantitative results, 1. Introduction for the error annotation and post-editing task we also give The design and evolution of a new machine translation a brief summary of the most frequent observations. We system is invariably linked with regular quality conclude by discussing the findings from the perspective of assessments, using both automatic methods commonly translation quality assessment in MT development. known as metrics and human evaluations of the MT system's outputs. The context of this experiment is the 2. Related work development of a neural MT system for the English- Evaluation of MT is a crucial part of development and Slovene language pair within the DSDE project, which improvement of MT systems, and it is traditionally divided involved work packages dedicated to data collection, into automatic evaluation using metrics such as BLEU implementation and testing of different NMT architectures (Papineni et al., 2002), METEOR (Banerjee and Lavie, and MT evaluation. 2005), TER (Snover et al., 2006), and human or manual Throughout the project, different versions of the DSDE evaluation. Automatic evaluation is usually performed by NMT system were regularly automatically evaluated using comparing the candidate machine translated text to a the BLEU metric, while later versions were also evaluated reference translation produced by a human professional, with a comprehensive set of scores available on the whereby the comparison can be rather superficial and SloBench 1.0 evaluation platform. In parallel to the word-based such as with BLEU, or more linguistically automatic ones we performed a set of human evaluations informed such as with METEOR. The obvious advantage with several aims in mind: To validate the automatic scores of automatic metrics is that they can be performed on the with manual assessments, to gain insight into the fly requiring no human effort, but the rate of correlation performance of the system under development, but also to with human judgements remains a constant concern. compare two human evaluation scenarios in terms of Particularly since the emergence of NMT, some authors efficiency and reliability. show that the reliability of metrics as indicators of The manual evaluations of the DSDE MT engine were translation quality may be faltering (Shterionov et al., performed by students of MA Translation at the 2018), or that metrics alone cannot adequately reflect the Department of Translation Studies, Faculty of Arts, variety of linguistic issues which may affect quality. University of Ljubljana. We refer to advanced students of Manual evaluation therefore remains an integral part of MT translation as semi-professionals because of their high quality evaluation and is annually included into the WMT proficiency in both languages and their understanding of shared task (Bojar et al., 2016). translation as a complex cognitive activity with many Over time, many methods of human MT evaluation alternative solutions for each source text. On the other have evolved. The Adequacy/Fluency scoring was first hand, their experience with translating is for the most part adopted by the ARPA MT research program (White et al., limited to the study environment, and they have received 1994) as a standard methodology of scoring translation little or no formal training in post-editing or translation segments on a discreet 5- or 7-level scale. The adequacy assessment. evaluation is performed either by professional translators Manual evaluation was performed using two common who are presented with the original and the machine evaluation frameworks: the Adequacy/Fluency score and translated segment and make judgments about the degree to the MQM-DQF error annotation combined with post- which the information from the original can be found in MT editing. output, or by monolingual speakers who are presented with The paper first presents the rationale for selecting the the MT and a reference translation. For the fluency methodologies by referring to related work, then describes evaluation, no reference translation nor original is provided the MT system and its development within the DSDE and the evaluators determine whether the translation project. We then present the evaluation setups and provide PRISPEVKI 220 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 "reads" like good language, sounds natural and adheres to Translation, University of Ljubljana. The translation grammatical conventions of the language. environment of choice was memoQ, a tool which allows the Other manual evaluation methods include task-based project manager to select or define an LQA scheme with evaluation (Doyon et al., 1999), post-editing with targeted the fluency/adequacy scoring or the error categories human annotation, also known as HTER (Snover et al., respectively. The annotator performs the evaluation, error 2006), and error analysis using various error typologies. annotation and post-editing in a typical two-column setting The most comprehensive translation error typology to date with the segmented original on the left hand side and the is the Multidimensional Quality Metrics (MQM) developed machine translated segments already inserted into the target in the QT-Launchpad1 project. The MQM guidelines text on the right hand side via pre-translation. Annotators provide a fine-grained hierarchy of quality issues and a receive an outbound memoQ package which ensures that mechanism for applying them to different evaluation the source text, the raw MT and the evaluation/error scenarios, however the entire tagset is too complex to be annotation scheme are available and activated with no used in a concrete evaluation task. Originating from the further setup, and the evaluated, post-edited and annotated needs of the language industry, the TAUS Dynamic Quality texts may be returned to the project manager (in our case Framework (DQF) proposed an error typology which has the experiment designer) as inbound return packages. been harmonized with the MQM model in 2015 and is Five different source texts were used from the domains today integrated into most commercial translation tools of chemistry, karst, economy, law and general news. The (Marheinecke, 2016). texts were of comparable length (~500 words) and The annotation of translation errors can be a part of consisted either of the entire text or a meaningful portion Linguistic Quality Assurance (LQA) in professional thereof. With the exception of the general news text dealing translation environments, in order to monitor quality on the with US elections, all domain-specific texts were highly corporate, project or individual levels. However, for the specialized with complex syntax and many terminological task of manual MT evaluation MQM and related methods expressions. are notoriously poor in inter-annotator agreement scores For the fluency/adequacy scoring, both language pairs (Lommel et al., 2014). Some authors believe that pre- were evaluated by a group of five students over a period of annotation training can significantly reduce disagreements, several months. Each document was evaluated by two but the task apparently remains highly subjective. students. Once a new model was available, MemoQ Despite the labour intensity and low inter-annotator packages were sent to the students who performed the agreement, error annotation is still frequently employed in evaluation in their home environment. Note that for the human MT evaluation because of the significance and adequacy/fluency evaluation, no postediting took place – depth of insight into translation issues it may provide. As the students only had to score each translated segment on a Klubička et al. (2017) point out, Slavic languages are rich scale of 0 to 3 (see Table 1). in inflection, case and gender agreement, and they have rather free word order compared to English. The motivation Adequacy Fluency for using error analysis in MT evaluation is to see – in the 0 None Incomprehensible process of developing and improving a new MT system – 1 Little Disfluent whether the particular grammatical issues occurring with 2 Much Good Slavic languages are adequately addressed, resulting in 3 All Flawless overall quality improvement. In line with related works we opted for two of the most Table 1: Adequacy/Fluency scoring. commonly used manual evaluation methods, the Fluency/Adequacy score and the TAUS DQF-MQM For the error annotation, only the English-Slovene metrics which has been further adapted for the DSDE language pair was evaluated, with English as original and project. Slovene as target. Fifteen students participated, so that post-editing and error analysis were in the end performed 3. The DSDE MT system by three students for each text. The experiment took place The main goal of the machine translation work package during a regular face-to-face seminar session in the in RSDO is to improve on the state-of-the-art model for the presence of the lecturer. Students were using standard PCs Slovene/English and English/Slovene language pairs and with memoQ 9.5 running Translator Pro licenses. developed within the TraMOOC project (Sennrich et al., Students were requested to perform full post-editing of 2017). To this end, various neural machine translation the machine translated text, and at the same time annotate frameworks were evaluated, such as MarianNMT (Junczys- each error using the preloaded TAUS/DSDE error Dowmunt et al., 2018), fairseq (Ott et al., 2019) and NeMo typology. The latter proved somewhat wearisome, since the (Kuchaiev et al., 2019). The same dataset consisting of annotation of each single error involves opening a separate publicly available parallel data as well as data collected dialog box, selecting the category and resuming work, within the DSDE project2 was used to train the models on whereby the typical commands used during "normal" the selected frameworks. translation must be avoided (e.g. Control + Enter to confirm the segment). This invariably slows down the post-editing 4. Evaluation setup process and presumably affects the natural cognitive flow Both types of manual evaulation were performed by during post-editing. students of MA Translation at the Department of 1 https://www.qt21.eu/launchpad/index.html 2 Data collected within the DSDE project will be made available under a CC-BY-SA 4.0 license at the end of the project. PRISPEVKI 221 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 5. Results using the fairseq framework and one using the NeMo framework. We also performed one round of evaluation of 5.1. Fluency/Adequacy scoring the eTranslation system developed by the European Figure 1. Adequacy and Fluency scores across five domains and two language pairs. In addition to the baseline model, five models were Commission. evaluated using the Adequacy/Fluency methodology (three The initial models ( marian and fairseq) performed versions trained using the marianNMT framework, one badly and did not exceed the scores of the baseline model in the DSDE project, but additional iterations performed PRISPEVKI 222 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 better. The overall best performance was exhibited by the annotators frequently choose the Accuracy->Mistranslation NeMo model with best or close to best scores in all five category for errors related to specialized lexis. Minor errors domains. The latest version of the Marian model ( marian- are the most frequently selected severity level, with a v5) also performed well in some domains (e.g. Legal) less majority of stylistic errors. Accuracy is also the source of well in others. When comparing the DSDE models with the most critical errors which, in the opinion of annotators, eTranslation, we can observe that the NeMo model offers completely change the meaning of the text. competitive performance across all five domains (with the possible exception of the News domain for the 5.2.1. Errors by text Slovene/English language pair). On average, students would annotate ~30 errors per text, or 1.2 errors per segment. The differences in the 5.2. Error annotation with post-editing number of errors between texts are small, with a maximum The error annotation with post-editing was performed of 102 errors for the legal text (the sum for all three in order to gain insight into the translation issues most annotators) and a minimum of 90 for the text on karst. affecting MT quality, but also to assess the efficiency and reliability of this methodology when used with semi- Category Subcatego Chemis Econo Kar Leg Ne ry try my st al ws professional translators. The evaluation took place in November 2021 using the output of the marian-v5 model. Accuracy Category 40 39 58 19 49 total . Addition 10 1 0 0 1 Severity 1 - Severity 2 - Severity 3 - Category Subcategory Critical Major Minor Mistransla 30 36 56 13 44 Category total 56 68 37 tion Omission 0 2 2 6 4 Addition 1 2 3 30 14 16 26 18 Accuracy Language Category Mistranslatio 50 63 30 total n Grammar 19 13 15 11 14 Omission 5 3 4 Spelling 11 1 1 15 4 Category total 3 26 57 Style Category 19 36 13 45 25 Language Grammar 3 18 37 total Awkward 19 27 13 22 22 Spelling 0 8 20 Inconsiste 0 9 0 23 3 Category total 13 18 80 nt Terminolo Category 6 5 3 12 9 Style Awkward 6 15 55 gy total Inconsistent 7 3 25 Total 95 94 90 102 101 Terminolog Category total 4 16 14 y Table 3: Errors by text. Total 76 128 188 Chemistry: There is considerable variation in the Table 2: Total errors by category. number of errors marked by each annotator: 40 / 26 / 29. In all 3 annotations, the most frequent error types are Accuracy and Language, followed by Style and Terminology. Only one annotator found 2 critical errors, Errors by severity the majority of errors were markes as minor. Economy: The number of errors marked by each annotator varies: 29 / 30 / 35. Similar to other texts, the highest number of errors were attributed to Accuracy- >Mistranslation, followed by Style and Language, and only 5 terminology errors. Karst: The three annotators diverged in the numbers of errors marked: 21 / 31 / 38. Contrary to other texts, here the majority of errors were found to be major or even critical, with only 22 errors categorized as minor. Given that the text was highly specialized, it is again surprising that the Critical Major Minor Terminology category was not selected more often. Legal: For the legal text, variation and non-agreement between annotators is at its highest: they marked 21 / 54 / Figure 2: Errors by severity. 27 errors each, and even more interesting is the distribution As shown in Table 2, the highest number of errors were of errors amongst severity levels. For the most prolific marked in the Accuracy category, followed by Style, annotator, only 4 errors were found to be critical, but for Language and Terminology. Given that four out of five the annotator who spotted 21 errors, 11 were categorized as texts were specialized, the low count of terminology errors critical. The third annotator on the other hand found no is perhaps surprising but can be attributed to the fact that critical errors. PRISPEVKI 223 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 News: The numbers of errors marked by each annotator In many cases the annotators agree on the error itself or were 28 / 33 / 40 respectively, with 12 / 10 / 6 critical errors. the portion of text which should be corrected, but Despite the fact that this text was the least specialized of categorize the error differently. A major error was the five, annotators marked 9 errors as terminological, and unanimously marked by all three annotators in the the overall majority of errors were those pertaining to Economy text, where the original "To repress these accuracy (49). troubles" was machine translated to "Za ponoven tisk te težave". Corrections ranged from "spoprijemanje s težavo", "zmanjšanje teh težav" to "blaženje teh težav", but the error 5.2.2. Analysing students' edits was categorized as Accuracy->Addition, Accuracy Some texts were highly specialized and rich in ->Mistranslation and Terminology respectively. terminology, the students however often perceive errors as Disagreement in categories was frequent also in the minor and categorize terminology errors under Accuracy. non-specialized text, a news article reporting Trump's In the Karst text for example, the original contains the term attempts to postpone elections. The MT version contains a "precipitation" which is translated as "padavine". None of fluent but inaccurate rendering of "November's presidential the annotators identified this as a critical error: in geology, elections to be postponed", where the MT engine proposed precipitation is not a weather phenomenon but a type of "je predlagal predsedniške volitve v novembru". This is sedimentation process, and the correct translation would certainly a critical accuracy error, which should be read "precipitacija" or "usedanje". The word "test" in the categorized as omission since the postponement was original is most likely a typo and remains untranslated, missing in the target. Indeed all three annotators identify while the translation of "algal crusts" into "drogovi" is the error as critical, but one categorized as mistranslation another critical error. and the other two as omission. Another severe mistranslation occurs in segment 4, where the MT reverts In nature, many types of CaCO3 precipitation are the meaning of "There is little evidence..." to "Ni malo linked to living organisms: test, shells, skeletons, dokazov..."; again all three annotators agree in the severity stromatolites, algal crusts, etc. level but not in the category. V naravi so številne padavine CaCO3 povezane z 5.3. Comparing both evaluation methods živimi organizmi: test, oklepi, skeleti, stromatoliti, While the Fluency/Adequacy evaluation method gives drogovi itd. little insight into the specific issues that may have been improved or aggravated from one MT model to another, it The students' edits are sometimes unnecessary or even seems relatively consistent in the scoring of different wrong, as in the case of the correctly translated word models across domains. If we compare the "adduction" -> "addukcija" corrected into "adukcija" in one Fluency/Adequacy scores obtained for each text translated case, and in another into "uporaba". by the marian-v5 model with the results of the error Inconsistent translations are another common issue in annotation, correlation is low. According to the former, the machine translation. Thus, in the Economy text, most adequate and fluent translation was that of the legal "expenditure" is translated as "stroški", "izdatki", "poraba"; text, and the least of the karst text. According to the number "plant" as "rastlina" and "naprava". A trained and alert post-of annotated errors and edits, karst was the best and legal editor would spot such inconsistencies and make sure they the worst. (The number of errors in Figure 3 is normalized are consolidated in the final version, the students however to allow for better visual comparison.) focus on single segments and overlook such unwanted variation. Easier to spot are untranslated words, such as "speleothem" in both the Karst original and the Slovene MT. All three annotators spotted the error and opted for "kapnik" in their edits, but the correction is inadequate because "kapnik" is a hyponym of "speleothem" and a better translation would be "speleotem" or "siga". Two annotators marked the error as Critical and one as Major. It seems that students of translation are much more sensitive to grammatical errors than terminological ones, as the example below containing the correct phrase but in the wrong case was marked as a Major error by all three annotators. Zaradi velike moči odpornosti proti svetlobi in trajnosti derivatov benzimidazolov se pogosto uporabljajo za proizvodnjo akvarele in Figure 3: Comparing fluency, adequacy and number of elektrofotografskih razvijalnih toner. errors per text. Zaradi velike moči odpornosti proti svetlobi in 6. Conclusion trajnosti derivatov benzimidazolov se pogosto We presented the results of human evaluation of MT uporabljajo za proizvodnjo akvarelnih in using two well-known methodologies. The elektrofotografskih razvijalnih tonerjev. Fluency/Adequacy evaluation is relatively efficient and PRISPEVKI 224 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 fast, and the results are a useful indicator of the quality of quality assessment and postediting, both of which are tasks different MT models. In general, the scores show high frequently encountered in professional translation. correlation with automatic metrics3, with Nemo models achieving the highest automatic evaluation scores, 7. Acknowledgments followed by the Marian models and the baseline model, The project Development of Slovene in a Digital which is similar to what can be observed from the Environment (Slovene: Razvoj slovenščine v digitalnem Adequacy/Fluency data. To measure the reliability of the okolju, RSDO) is co-financed by the Republic of Slovenia Adequacy/Fluency ratings, we calculated the Cohen's and the European Union under the European Regional kappa coefficient4 for each document evaluated by a pair of Development Fund. The operation is carried out under the evaluators. As somewhat expected, the agreement is fairly Operational Programme for the Implementation of the EU low with most of the values falling between 0.20 and 0.50. Cohesion Policy 2014–2020. The fact that the evaluation was performed by students does The authors thank the students of MA Translation at the not seem to significantly affect the results. Faculty of Arts, University of Ljubljana, for their On the other hand, the evaluation through error participation in the task. annotation and post-editing requires a much higher level of effort, linguistic and extra-linguistic competence. Since 8. References each text was annotated by three students, a comparison of their decisions provides a valuable insight into the Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An difficulty and subjectivity of the task. Agreement is low for automatic metric for MT evaluation with improved all the parameters under observation: the number of errors correlation with human judgments. In: Proceedings of marked, their categorization and their severity levels. the ACL workshop on intrinsic and extrinsic evaluation Moreover, there is little correlation between the number of measures for machine translation and/or summarization, marked errors, their severity and the true quality of the pages 65–72. Association for Computational Linguistics. machine translation. For the text which was the most Ondrej Bojar, Christian Federmann, Barry Haddow, specialized (Karst), contained a high number of un- or Philipp Koehn, Matt Post, and Lucia Specia. 2016. Ten mistranslated terms and received the lowest years of WMT evaluation campaigns: Lessons learnt. Fluency/Adequacy score, the number of marked errors was In: Proceedings of the LREC 2016 Workshop the lowest of all. Student annotators with little or no expert Translation Evaluation–From Fragmented Tools and knowledge of the domain will therefore find it difficult to Data Sets to an Integrated Ecosystem, pages 27–34. correctly identify terminology errors, assess their severity Jennifer Doyon, Kathryn B. Taylor, and John S. White. or post-edit the text to a more accurate version. 1999. Task-based evaluation for machine translation. Conversely, possibly owing to the fact that students of In: Proceedings of Machine Translation Summit VII, translation are still in the process of acquiring their pages 574–578. language competence and are constantly reminded of the Filip Klubička, Antonio Toral, and Víctor M. Sánchez- grammatical aspect of the texts they produce, their Cartagena. 2017. Fine-grained human evaluation of sensitivity to fluency-related issues is high, hence linguistic neural versus phrase-based machine translation. The and stylistic errors are still often perceived as major. This Prague Bulletin of Mathematical Linguistics 108, no. 1 might explain why the two texts which were most (2017), pages 121–132. accessible and easy to understand received the highest Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii number of marked errors. Hrinchuk, Ryan Leary, Boris Ginsburg, and Jonathan M. In retrospect, the postediting and error annotation task Cohen. 2019. Nemo: a toolkit for building AI was too difficult for advanced students of translation and applications using neural modules. arXiv:1909.09577. failed to provide meaningful insights into MT quality, for Arle Lommel, Hans Uszkoreit, and Aljoscha Burchardt. several reasons: Firstly, the texts were too specialized and 2014. Multidimensional quality metrics (MQM): A difficult to understand for non-experts. While students were framework for declaring and describing translation free to use all available resources, some of the quality metrics. Revista Tradumàtica: tecnologies de la terminological expressions would require extensive traducció 12, pages 455–463. research to resolve and the students lacked the time, Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz motivation or skill to perform such research. Secondly, to Dwojak, Hieu Hoang, Kenneth Heafield, Tom ensure higher agreement in the severity and category of Neckermann, Frank Seide, Ulrich Germann, Alham Fikri errors, students should have received training, a test run and Aji, Nikolay Bogoychev, André F. T. Martins, and much more comprehensive annotation guidelines with Alexandra Birch. 2018. Marian: Fast Neural Machine English-Slovene examples. Finally, the annotation Translation in C++. In: Proceedings of ACL 2018, environment in MemoQ with the rather fine-grained System Demonstrations, pages 116–121, Melbourne, MQM/DSDE error typology is cumbersome and Australia. Association for Computational Linguistics. unintuitive, which probably affected the results. Katrin Marheinecke. 2016. Can Quality Metrics Become We nevertheless believe that the experiments were the Drivers of Machine Translation Uptake? An Industry valuable both for researchers and annotators. As Perspective. Translation Evaluation: From Fragmented researchers in MT development and evaluation we have Tools and Data Sets to an Integrated Ecosystem, pages gained experience which will allow us to better design 71–76. evaluation runs, select texts and train annotators, and the Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, student annotators have been subjected to translation Sam Gross, Nathan Ng, David Grangier, and Michael 3 Automatic metric scores can be found at https://slobench.cjvt.si/ 4 Using the cohen_kappa_score function from the sklearn Python library. PRISPEVKI 225 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv:1904.01038. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318. Rico Sennrich, Antonio Valerio Miceli Barone, Joss Moorkens, Sheila Castilho, Andy Way, Federico Gaspari, Valia Kordoni, Markus Egg, Maja Popovic, Yota Georgakopoulou, Maria Gialama, Menno van Zaanen. 2017. TraMOOC—translation for massive open online courses: recent developments in machine translation. In: 20th Annual Conference of the European Association for Machine Translation, EAMT. Dimitar Shterionov, Riccardo Superbo, Pat Nagle, Laura Casanellas, Tony O’dowd, and Andy Way. 2018. Human versus automatic quality evaluation of NMT and PBSMT. Machine Translation 32, no. 3, pages 217–235. Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231. John S. White, Theresa A. O’Connell, and Francis E. O’Mara. 1994. The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In: Proceedings of the First Conference of the Association for Machine Translation in the Americas. PRISPEVKI 226 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Automatic Predicate Sense Disambiguation Using Syntactic and Semantic Features Branko Žitko,∗ Lucija Bročić,∗ Angelina Gašpar,† Ani Grubišić,∗ Daniel Vasić,‡ Ines Šarić-Grgić∗ ∗Faculty of Science University of Split Ru ¯ dera Boškovića 33, 21000 Split, Croatia branko.zitko@pmfst.hr, lucija.brocic@pmfst.hr, ani.grubisic@pmfst.hr, ines.saric@pmfst.hr †Catholic Faculty of Theology University of Split Ulica Zrinsko Frankopanska 19, 21000 Split, Croatia angelina.gaspar@kbf-st.hr ‡Faculty of Science and Education University of Mostar Poljička cesta 35, Mostar, Bosnia and Herzegovina daniel.vasic@fpmoz.sum.ba Abstract This paper focuses on Predicate Sense Disambiguation (PSD) based on PropBank guidelines. Different approaches to this task have been proposed, from purely supervised or knowledge-based, to recently hybrid approaches that have shown promising results. We introduce one of the hybrid approaches - a PSD pipeline based on both supervised models and handcrafted rules. To train three supervised POS, DEP and POS DEP models we used syntactic features (lemma, part-of-speech tag, dependency parse) and semantic features (semantic role labels). These features enable per-token classification, which to be applied to unseen words, requires handcrafted rules to make predictions specifically for nouns in light verb constructions, unseen verbs and unseen phrasal verbs. Experiments were done on newly-developed dataset and the results show a token-level accuracy of 96% for the proposed PSD pipeline. 1. Introduction Depending on the sense of a word ’walk’, the sense of the whole predicate changes. One of the main tasks of Natural Language Processing Another important role of PSD is the one it plays in (NLP) is precisely understanding the meaning of the word Semantic Role Labelling (SRL). The process of semantic and its specific usage in a sentence, task known as Word role labelling typically consists of predicate identification Sense Disambiguation (WSD). In this paper, we focus on and its sense disambiguation, followed by identification of predicate sense disambiguation, i.e. the correct meaning semantic roles and finally their labelling. The state-of-the- of a predicate in a given sentence. A predicate combines art BERT models like AllenNLP’s models (Gardner et al., with a subject to form a sentence, expressing some situ- 2018) or InVeRo (Conia et al., 2020) perform all mentioned ation, event or state. Predicates are often single or com- subtasks except for predicate sense disambiguation which pound verbs, consisting of various part-of-speech (prepo- is missing. Ideally, the tool would use predicate senses to sitions, adverbs, nouns, auxiliaries, etc.). Hence, the pre- label semantic roles. However, we lack the tool for PSD, so cise understanding of the meaning of a sentence lies in the we use the opposite technique – attempting to predict role- correct disambiguation of different types of words, not just set IDs from already annotated semantic role labels. An- verbs. For example, the term light verb (LV) refers to a other shortcoming of mentioned state-of-the-art models is verb that gets its main semantic content from the noun that that they only label verbs as predicates, and as we have follows rather than the verb itself. Thus, the construction seen, it is necessary to label words of different part-of- consisting of such a verb and noun is called Light Verb speech in addition to verbs. Regarding the sentence "I take Construction (LVC). In the sentence “I take a walk in the a walk in the park.", state-of-the-art models identify word park.”, ‘take a walk’ is the LVC in which the noun ’walk’ ’take’ as a predicate, whereas they ignore the word ’walk’. describes an action. It is non-compositional and its lexical- The need for such a PSD tool arises during the question syntactic structure is not flexible. This example illustrates generation task in intelligent tutoring system (Grubišić et that word sense disambiguation can make Predicate Sense al., 2020) our research team is working on. Disambiguation (PSD) more accurate, since splitting up the LVC and disambiguating the senses of its components in- In this work, we describe our PSD pipeline, depicted dividually neglects the semantic unity of the construction in Figure 1, as well as the process it takes to create it. and fails to represent its single meaning. Namely, ’walk’ The approach we take is the combination of the super- can have a meaning of moving forward, one foot in front vised PSD trained with the Stochastic Gradient Descent of the other, but it can also be a term specific for baseball. method (Kiefer and Wolfowitz, 1952) and the knowledge PRISPEVKI 227 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 verb approach (Chen and Eugenio, 2010). We implement the latter technique, where we train each classifier to dis- ambiguate senses of only one word. Purely data-driven WSD is a straightforward approach when dealing with the comprehensive data. However, we find Supervised WSD approach that exploits relations between tokens more ap- pealing. In that approach, some examples of improving the sense prediction might be by using contextual embeddings learned from Neural Language Model (Loureiro and Jorge, 2019), or by utilizing WordNet relations to create try-again mechanism to predict sense for ambiguous words (Wang and Wang, 2020). Figure 1: Our PSD Pipeline. On the other hand, knowledge-based WSD often im- plements various graph algorithms to extract from to- used to handcraft rules to compensate for the shortcomings kens and sentences their syntactic, semantic or any other of the data. We train supervised classifiers for each word features. These features are essential for modelling the to disambiguate senses based on extracted syntactic and Lexical Knowledge Base that algorithms use to predict semantic features, which play a significant role in many senses. Although there are some high-scoring methods NLP tasks (e.g. text summarization, question generation, (Wang and Wang, 2020; Scozzafava et al., 2020) based etc.). As for the syntactic features we use spaCy (Honnibal on this approach, knowledge-based WSD systems still per- et al., 2020) annotated fine-grained POS (part-of-speech) form worse than supervised ones. However, lately there tags and dependency tags. We employ the AllenNLP’s have been a few promising hybrid approaches that com- BERT-based model (Gardner et al., 2018) to retrieve shal- bine supervised and knowledge-based ones, as mentioned low semantics, represented by SRL labels. Thus, the pro- in the survey (Bevilacqua et al., 2021). Moreover, their posed PSD pipeline consists of Machine Learned Classifi- high scores indicate that the hybrid approaches are cur- cation (MLC) pipeline, based on Machine Learned Model rently the best solution to WSD (Barba et al., 2021). Be- (MLM), and Rule-Based Classification (RBC) pipeline, sides the research done on WSD, there has also been some based on Rule-Based Model (RBM) including handcrafted work concentrated specifically on Verb Sense Disambigua- rules for LVC, unseen verbs (verbs that don’t occur in the tion (VSD). As verbal multiword expressions are semanti- OntoNotes dataset used for training the MLMs) and un- cally complex lexical items, there have been experiments seen phrasal verbs (phrasal verbs that don’t occur in the to inspect the effect of the selection of semantic features in OntoNotes dataset used for training the MLMs). We pro- VSD. Research works like ours (Dang and Palmer, 2005; vide source code1 with the spaCy integration of the pro- Dligach and Palmer, 2008; Ye and Baldwin, 2006) used posed PSD pipeline. SRL annotation, which is a distinctive characteristic of a Section 2 provides related works, which suggest that predicate, to get better sense prediction. the WSD, which entails PSD, is a current problem encoun- tered in various popular NLP tasks. Section 3 describes the 3. Data Manipulation and Analysis dataset used for training PSD models and the modifications To build a good PSD model combining a supervised done to it. Section 4 describes the proposed PSD pipeline, PSD approach and handcrafted rules, we need good data providing detailed information on the training and evalua- for the former and clear guidelines for sense disambigua- tion of the models. Section 5 provides the conclusion of tion for the latter. this paper and discussion about the given work. 3.1. OntoNotes Data 2. Related Work We use an English corpus from the OntoNotes project Word Sense Disambiguation and Predicate Sense Dis- as the train and test data for the supervised component of ambiguation are appealing NLP tasks for researchers in the the model. The English dataset of the OntoNotes Release field. Thus, they are the subject of many research activi- 5.0 (Weischedel et al., 2013) consists of 13109 annotated ties, summarized in the up-to-date survey of recent trends documents organized as .onf files, arranged into seven di- in WSD (Bevilacqua et al., 2021). Among the various ap- rectories that correspond to files’ sources. It is impor- proaches to WSD, most popular are knowledge-based ap- tant to train the model on the content of assorted genres proaches, which often implement graph algorithms, and su- and types, therefore, OntoNotes was picked as it has the pervised approaches, which lately utilize neural networks. following seven categories: Broadcast Conversation (tran- Supervised WSD formulates the given task as classifi- scripts of talk shows from channels such as BBC, CNN and cation task. Hence, it requires precisely labelled training MSBNC), Broadcast News (news data collected from var- data to learn the relationship between word annotations and ious news sources, such as ABC, NBC, CNN and Voice senses. In contrast to a single classifier approach (Kawa- of America), Magazine (Sinorama Magazine), Newswire hara and Palmer, 2014), where one classifier is trained to (data from sources such as Wall Street Journal newswire), make predictions for every word sense, there is also a per- Pivotal Corpus (biblical texts from the Old Testament and the New Testament), Telephone Conversation (conversa- 1https://github.com/lucijabrocic/PSD-pipeline tional speech texts) and Web data (English web texts and PRISPEVKI 228 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 web text translated from Arabic and Chinese to English). The syntactic annotation of the sentences in the corpus followed the Penn TreeBank scheme and the predicate- argument structure followed the Proposition Bank (Prop- Bank) annotation (Palmer et al., 2005). The OntoNotes English corpus consists of 143709 annotated sentences, most of which but not all have comprehensive annotation. Namely, some web texts selected to improve sense cover- age were just tokenized and not even treebanked. There- fore, the corpus needed some refinement before further usage. The scripts (Bonial et al., 2014) provided by the Proposition Bank project enabled the conversion of origi- nal PropBank annotations (found in the OntoNotes project) to the new unified PropBank annotations. The files thus ob- tained were further modified by custom user-defined meth- ods written for this work. Those methods mostly changed the aesthetics of the files, such as converting SRL anno- tation to utilize BIO notation and converting tree parses into dependency parse annotation. Finally, after the refine- ment and modifications, our corpus contains 7212 text files (137811 sentences), which follow the original OntoNotes directory structure based on files’ sources. 3.2. The English PropBank As already mentioned, the used data follows the En- glish PropBank (Palmer et al., 2005) sense disambiguation guidelines. This research aims to predict the sense ID, also known as a frameset or roleset ID, for each word of any complex predicate structure in a sentence. The English PropBank consists of 7311 .xml files called frame files, specifying the predicate-argument structure. Figure 2: The syntactically and semantically annotated sen- One frame file, or frameset, consists of one predicate tence "I take a walk in the park." enters MLC pipeline, lemma or multiple different ones, and contains the infor- which annotates the predicate sense for verb "take" as mation about roleset IDs that disambiguate various mean- take.01. The annotated sentence then proceeds to the RBC ings of a predicate. Since diverse forms of a predicate can pipeline, which annotates the predicate sense for noun be under the same roleset ID, PropBank aliases can help to "walk" as walk.01. distinguish the correct sense from the wrong one. As our work required the English PropBank annotation informa- Figure 2 illustrates the PSD pipeline with an example tion, we organized all the information for 10687 rolesets input sentence, annotated with syntactic and semantic fea- (and 7311 framesets) into easily loadable .json file. tures. First the MLC pipeline extracts these features from No matter how large, representative, and carefully de- the sentence and feeds them to the trained classifiers used signed, no corpus can exhibit the same characteristics as a to obtain predicate senses. Then, RBC pipeline takes the natural language. Having this in mind, we check the cover- syntactically and semantically annotated sentence with pre- age of rolesets and framesets in the OntoNotes corpus. The dicted predicate senses. RBC pipeline applies handcrafted analysis shows that the modified files miss 4922 rolesets rules to the sentence to improve the prediction of predicates and 3104 framesets, i.e. they cover 53.94% of rolesets and in light verb constructions, unseen verbs and unseen phrasal 57.54% of framesets that occur in the English PropBank. verbs. As a result of the proposed pipeline processing, each Even though the frequency of using missing framesets and token in the sentence has a roleset attribute that stores the rolesets might be low, the objective is to include as many result. framesets and rolesets as possible to increase the overall coverage. To achieve this objective, we add the handcrafted 4.1. Training the Models rules, explained more thoroughly in subsection 4.3. We have 7212 OntoNotes files available to make the best use of while training our models. We first apply a typi- 4. The Proposed PSD Pipeline cal supervised learning approach - splitting the dataset into This section describes the training process of three PSD the train and test sets and then performing the training and models (POS, DEP and POS DEP) and their evaluation. We evaluation. The train-test split given in the PropBank (Bo- train each model by employing two approaches. In the first nial et al., 2014) resulted in 80% of the files (and sentences) approach, we split the dataset into train and test sets, while in train set and 20% in the test set. in the second one, we use entire dataset for training. Table 1 shows that many framesets and roleset IDs oc- PRISPEVKI 229 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 No. of No. of No. of No. of nally tokens "in", "the" and "park" that are ARGM-LOC. files sentences framesets roleset IDs Therefore, the featureset for token "take" is text, take, Train set 5832 111104 3996 5455 lemma, take, ARGM-PRR, 〈NN, dobj〉, and for to- Test set 1380 26707 2692 3609 ken "walk" text, walk, lemma, walk, ARG0, 〈PRP, Corpus 7212 137811 4208 5766 nsubj〉, ARGM-LVB, 〈VBP, ROOT〉, ARGM-LOC, 〈IN, prep〉, 〈DT, det〉, 〈NN, pobj〉. Table 1: Corpus composition. Then we vectorize extracted features and feed them into the classifiers. Dealing with PSD, we face a multiclass clas- sification problem with more than 10000 classes. Instead of cured in both train and test set. Out of 2692 framesets iden- a single classifier, a common solution to a problem like this tified in the test set, 212 framesets did not appear in the is training multiple binary classifiers, one for each class of train set. Likewise, out of 3609 roleset IDs detected in the the original problem. In the NLP-like domains, however, test set, 311 of them failed to appear in the train set. it is more suitable to use multiple classifiers which pre- dict a constricted number of classes (Even-Zohar and Roth, 2001). Therefore, in this research, multiple multiclass clas- sifiers perform the classification task, with one classifier for each frame file. Hence, the number of classifiers auguments to 7311, and each has to learn the nuances between roleset IDs within the same frame file. The model itself is essen- tially a collection of such classifiers. Regarding the choice of classifier, we want to build a simple and fast model for this PSD task. Since the con- Figure 3: The models’ training pipeline. text we need is already assigned to a token through context- aware models (spaCy, AllenNLP), with some feature engi- Figure 3 illustrates the training process. First, the syn- neering we can utilize generated annotations (lemma, POS, tactically and semantically annotated sentence is loaded dependency, SRL) as features for our model. Hence, we and forwarded to feature extraction. did not take a neural approach, but we decided on a linear During the feature engineering and extraction phase, the classifier where learning is based on multinominal logistic most relevant token-level annotations for developing the regression with SGD optimization. models are selected. Those annotations are token text, its modified lemma that matched the English PropBank frame- 4.2. The evaluation of models’ accuracy and set, part-of speech (POS) tag, dependency parse and seman- performance tic role labels (SRL). The research (Dang and Palmer, 2005; We evaluate our models on the OntoNotes test set con- Dligach and Palmer, 2008) shows that the predicate sense taining 26707 sentences. Those sentences contain in total disambiguation could improve semantic role labelling. Ide- 504891 tokens, of which 75621 (or 14.98%) are predicate ally, word sense disambiguation would solve the problem tokens, and 429270 (or 85.02%) are non-predicate tokens. of identifying the correct sense of a polysemic word based When looking at the average sentence, it contains 18.90 on context. However, the lack of comprehensive reposi- tokens, of which 2.83 are predicate tokens and 16.07 are tory of senses and a tool for PSD prompted us to use the non-predicate tokens. We measure the accuracy of the three opposite technique - attempting to predict roleset IDs from PSD models on this OntoNotes test set with three different already annotated semantic role labels. As for the POS and metrics: dependency annotation, previous studies show the perfor- • the token-level accuracy (TLA) metric measures the mance of the SRL task heavily depends on the performance number of (predicate and non-predicate) tokens the of dependency parsing (Mohammadshahi and Henderson, model predicted correctly (correct roleset ID or no pre- 2021) and POS tagging (Wilks and Stevenson, 1997) sub- diction, depending on whether the token is a part of a tasks. We train three models and name them according to predicate or not) the features they used - POS, DEP and POS DEP. All three models utilize token text and lemma, but differ in the other • the sentence-level accuracy (SLA) metric measures used annotation(s): (i) the POS model utilizes the relation the number of sentences the model predicted com- between SRL and fine-grained POS tag, (ii) the DEP model pletely correctly (all the tokens) utilizes the relation between SRL and dependency tag, (iii) • the predicate-level accuracy (PLA) metric measures the POS DEP model utilizes the relation between SRL, fine- the number of predicate tokens the model predicted grained POS tag and dependency tag. In this research, we correctly train and evaluate the three models in parallel. To be more specific, we present featuresets of tokens Besides accuracy, we also use predicate prediction cov- "take" and "walk" in the Figure 2 used when employing the erage (PPC) metric, which represents the ratio of pre- POS DEP model. Token "take" has only one SRL argu- dicted predicate tokens and total predicate tokens (whether ment - token "walk" which is ARGM-PRR. On the other they are correctly predicted or not). When evaluating Al- hand, token "walk" has three SRL arguments - token "I" lenNLP’s BERT model on OntoNotes test set, we can ob- that is ARG0, token "take" that is ARGM-LVB, and fi- tain a measure similar to PPC. Looking at the ratio between PRISPEVKI 230 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 predicate tokens in OntoNotes test set for which AllenNLP added to spaCy objects (Token, Span, Doc) via the cus- annotates the SRL arguments and all predicate tokens in tom SRL pipe. One thing to note is that we slightly mod- OntoNotes test set, we get a result of 88.02%. It is im- ify both the spaCy pipeline and AllennNLP’s BERT model. portant to note that the remaining 11.98% are nouns for We improve spaCy’s lemmatizer to better lemmatization of which AllenNLP’s BERT model cannot annotate SRL la- gerunds and contracted verbs. The modifications made to bels. This coverage metric for AllenNLP puts into perspec- the AllenNLP’s BERT model allow the presence of nouns tive the PPC measure of our models, given in Table 2. in a predicate and adjustment of SRL labels for LVCs to the English PropBank guidelines. TLA (%) SLA (%) PLA (%) PPC (%) Next, syntactic and semantic features are extracted in POS 98.50 76.91 90.01 97.49 the same way as it has been described in the training phase DEP 98.71 79.74 91.37 97.82 (Subsection 4.1.). The prediction can be done using one POS DEP 98.73 80.04 91.54 97.97 of the three previously mentioned OntoNotes-whole mod- els (POS, DEP, POS DEP), and each model is essentially Table 2: Evaluation of the Models. a collection of classifiers that each corresponds to a Penn PropBank frameset. The output of MLC component is a sentence where predicate tokens are annotated with sense Table 2 shows that the results of evaluation metrics on predicted via classifiers. accuracy are similar for the three models, even though POS Figure 5 illustrates further processing of annotated sen- DEP model is the most accurate and obtained the high- tences in the Rule-Based Classification (RBC) component est PPC score. As explained in subsection 4.1., models based on the Rule-Based Model, including handcrafted encounter some framesets and roleset IDs in the test set rules for LVC, unseen verbs and unseen phrasal verbs to alone. After the initial training and evaluation phase, we improve prediction. further train models on all 7212 modified OntoNotes files, assuming their performance would improve. To distin- guish which results correspond to which model, we will use two terms: OntoNotes-split model and OntoNotes-whole model. The term OntoNotes-split model will denote model that is trained on OntoNotes train set and evaluated on OntoNotes test set, while OntoNotes-whole model will de- note model that is trained on all of the 7212 OntoNotes files. Figure 5: The RBC component of the PSD Pipeline The results given so far are for OntoNotes-split models. . 4.3. PSD Pipeline Essentially, a sentence with classifier-predicted PSD an- notation is forwarded to the RBC pipeline component to Even when trained on all available data, our PSD mod- first handle the sense disambiguation of nouns in LVCs. els cover only 53.94% of rolesets and 57.54% of framesets Then the RBC component uses modified SRL labels to find in the the English PropBank. Therefore, we handcraft rules both parts of an LVC and search for PropBank aliases to to improve the predictive abilities of models. find the corresponding one. The pipeline component ex- plores aliases labeled as nouns only if there are no aliases tagged as the light verbs. This way PropBank aliases help in finding the correct sense IDs. The next step includes the sense disambiguation of un- seen verbs. The RBC component searches for PropBank aliases tagged as verbs, attempting to find the potential sense (roleset ID) of verbs that do not occur in the train- ing set. In the last step, the pipeline component performs the sense disambiguation of two-word phrasal verbs. Phrasal verbs are easy to predict correctly using the rules. The RBC pipeline first checks if a verb has a dependant particle (eg. a Figure 4: The MLC component of the PSD Pipeline. preposition or an adverb) and searches the PropBank aliases tagged as verbs, to find a corresponding sense (roleset ID). Figure 4 presents the Machine Learned Classification The RBC pipeline makes prediction in each step only (MLC) component of the PSD pipeline, which uses the if the observed token (i) has SRL labels (AllenNLP model ML model to make a predicate sense prediction. In model identified the token as predicate), (ii) is not a modal verb training phase, we use the OntoNotes annotation of sen- (no sense disambiguation of modals) and (iii) has no predic- tences for feature extraction. However, when using the tion (the goal is to supplement the classifiers, not to over- PSD pipeline “in the wild” on arbitrary sentences, spaCy’s write their predictions). English RoBERTa-based transformer processing pipeline Moreover, we introduce new annotations in the three uses the raw input to retrieve syntactic features. The Al- steps of the RBC pipeline. For a better understanding, ex- lenNLP’s BERT model is used to obtain semantic features, amples in Table 5 illustrate predictions of the PSD pipeline PRISPEVKI 231 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 components using the POS DEP model and their possible TLA (%) SLA (%) PLA (%) PPC (%) outcomes. The search for PropBank aliases can result in X no X X no X X no X X & no X a lack of roleset ID matches, only one roleset ID match, pipeline 96.19 92.20 69.28 69.43 87.11 87.17 98.05 or multiple roleset ID matches. Table 5 shows how each gold 97.63 97.67 78.01 78.46 89.75 89.94 100.00 pipeline component resolves the roleset ID issue depend- standard ing on the number of found rolesets matches. When there is no corresponding roleset ID for the token, Table 4: Evaluation of the POS DEP model on the gold the actions of the RBC pipeline differ based on the predi- standard dataset. cate construction. If the token is a part of an LVC (e.g. picnic - None), the RBC pipeline predicts the sense disam- biguation as the lemma of the token followed by ".00" (pic- The evaluation results in Table 4 show that the nic - picnic.00). If the token is an unseen verb (e.g. over- OntoNotes-whole POS DEP model predicts better if fed write - None) or a part of an unseen phrasal verb (e.g. clue with human-made annotations rather than with system- - None), however, the sense remains unchanged (None). generated annotations. The most significant difference is in If there is only one roleset ID match, components of the sentence-level accuracy, resulting from higher token-level RBC pipeline choose that roleset ID. and predicate-level accuracies. If there are multiple roleset ID matches, components To put the PPC measure given in Table 4 in perspective, of the RBC pipeline choose the roleset ID with the lowest we evaluate AllenNLP’s BERT model on the gold standard number, followed by the flag "X". However, this annotation dataset and obtain a measure similar to PPC. Looking at indicates that the unique prediction is still not achievable. the ratio between predicate tokens in the dataset for which Finally, our PSD pipeline incorporates final sense pre- AllenNLP annotates the SRL arguments and all predicate diction into the spaCy’s processing pipeline, into custom tokens in the dataset, we get a result of 97.61%. When roleset attribute. using system-generated annotations, our OntoNotes-whole POS DEP model relies on AllenNLP for discovering the 5. Experimental Results and Discussion predicates it needs to predict senses for. By deeper anal- This section provides the results obtained on the gold ysis, it is visible that there are certain errors in spaCy’s standard dataset and discussion and suggestions for further system-generated annotations (namely lemma) that lower work. the original AllennNLP coverage of 97.61%. However, the modifications made to the AllenNLP’s BERT model that 5.1. The Evaluation of the Model Performance on the allow presence of nouns in a predicate have increased our Gold Standard Dataset predicate coverage of 98.05%, and in the end improved the As all three OntonoNotes-split models perform similar- original AllenNLP’s coverage of 97.61%. ily well, we further assess the accuracy of the OntoNotes- The POS DEP model returns rolesets with “X” flag whole POS DEP model on a fresh set of sentences that when it cannot decide between multiple different senses. represent our gold standard. The new dataset consists of To fully evaluate the model’s performance, we calculated manually annotated 664 sentences with syntactic (lemma- the four metrics on the predictions with removed "X" flag tization, part-of-speech and dependency tags) and semantic (no X). The slight increase in scores indicates that the role- (SRL) labels, and the predicate sense IDs which our model set with the lowest ID number was often the right one. predicts. In Table 3 are given statistics for the dataset con- sidering tokens, words and predicates. Tokens include both 5.2. Discussion and Further Work words and non-word parts of a sentence, e.g. punctuation. We have shown our approach to predicate sense disam- When expressed as a percentage, 18.46% tokens in the gold biguation utilizing POS, dependency and SRL annotations, dataset are predicates. and on the way presented the analysis of the coverage of the predicate senses in the OntoNotes corpus and the English per sentence total PropBank contrastively. The integration of PSD pipeline mean std min max into spaCy makes its usage straightforward - by adding a token 6853 10.320 4.770 2 65 custom SRL and roleset components to the spaCy process- word 5971 8.992 3.430 1 48 ing pipeline. predicate 1265 1.905 0.890 0 12 Another feature of the proposed PSD pipeline is its Ma- chine Learned Models (MLMs). Each model consists of Table 3: Gold dataset statistics. per-token classifiers, which implies some effort required to combine their outputs. However, the predicate sense pre- The first step of evaluation process includes the predi- diction is fast since the pipeline only employs the classifiers cate sense prediction using input sentence and the needed corresponding to framesets found in the sentence. More- annotations obtained through system (spaCy transformer over, changing the single classifier is simplified – if there model and AllenNLP’s BERT model) pipeline. In the is a change in annotation guidelines within one frame file, second step, as some system annotations are erroneous, only one smaller classifier requires retraining. We have also namely, wrong lemmatization and SRL labels, we use gold presented different accuracy and prediction metrics used in standard annotations to check if there is any difference in evaluation of models’ performance. prediction. The scores in Table 4 suggest our PSD pipeline ob- PRISPEVKI 232 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 LVC Unseen verbs Unseen phrasal verbs Sentence Let’s have a picnic in It will overwrite the She’ll clue you in on the the park. files on your hard drive. latest news. MLC have – have.01 overwrite - None clue - None Roleset ID prediction picnic - None doesn’t exist MLC + RBC have – have.01 overwrite - None clue - None prediction picnic - picnic.00 Sentence He is having an affair. Some people annotate The cat scrunched up to as they read. sleep. Unique MLC is – be.03 annotate – None scrunched – None roleset IDs prediction having - have.01 read – read.01 exist affair - None MLC + RBC is – be.03 annotate – annotate.01 scrunched – scrunch_up.01 prediction having - have.01 read – read.01 affair – affair.01 Sentence We are making a plea John frowned when he They sluice the streets to all companies. heard the news. down every morning. Multiple MLC are – be.03 frowned – None sluice - None roleset IDs prediction making – make.01 heard – hear.01 exist plea - None MLC + RBC are – be.03 frowned – frown.01X sluice – sluice_down.01X prediction making – make.01 heard – hear.01 plea - plead.01X Table 5: Examples for PSD pipeline. tains satisfactory results, however, there is still room for turing information about its arguments and characteristics improvement. More specifically, in our further work, we will be useful when deciding on appropriate wh-word in a plan to enhance the Rule-Based Classification (RBC) com- question. ponent, particularly sense disambiguation of unseen words with multiple rolesets based on their part-of-speech tags. Acknowledgements The PSD pipeline only chooses the roleset with the low- The presented results are the outcome of the research est roleset ID and adds the flag “X”. We assume we can project “Enhancing Adaptive Courseware based on Natu- achieve better results if we create a more complex rule, ral Language Processing (AC&NL Tutor)” undertaken with as the one that utilizes PropBank guidelines on roleset the support of the United States Office of Naval Research sense IDs and their corresponding arguments in predicate- Grant (N00014-20-1-2066). argument structure. Since there is a large number of miss- ing rolesets and framesets (46.06% and 42.46% respec- 6. References tively), that will be no easy task and more in-depth analy- Edoardo Barba, Tommaso Pasini, and Roberto Navigli. sis is necessary to figure out what mistakes does the model 2021. ESC: Redesigning WSD with extractive sense make and how to fix them. comprehension. In: Proceedings of the 2021 Conference We build our Rule-Based Models (RBMs) on three cate- of the North American Chapter of the Association for gories of words – nouns in Light Verbs Construction (LVC), Computational Linguistics: Human Language Technolo- unseen verbs and unseen phrasal verbs. Perhaps categories gies, pages 4661–4672, Online. Association for Compu- could be further disambiguated and thus, enable a better tational Linguistics. RBM. Another change that might be beneficial for improv- Michele Bevilacqua, Tommaso Pasini, Alessandro Ra- ing the results is a selection of more features during the fea- ganato, and Roberto Navigli. 2021. Recent trends in ture extraction phase. For a certain predicate, we use only word sense disambiguation: A survey. In: Zhi-Hua Zhou, POS and dependency tags of its arguments, but the accu- editor, Proceedings of the Thirtieth International Joint racy might improve if we consider the text of the argument Conference on Artificial Intelligence, IJCAI-21, pages token as well. 4330–4338. International Joint Conferences on Artificial Finally, the downstream task this PSD pipeline is cre- Intelligence Organization. ated for is the question generation task in our intelligent Claire Bonial, Julia Bonn, Kathryn Conger, Jena D. Hwang, tutoring system. Disambiguating predicate senses and cap- and Martha Palmer. 2014. PropBank: Semantics of new PRISPEVKI 233 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 predicate types. In: Proceedings of the Ninth Interna- elling makes sense: Propagating representations through tional Conference on Language Resources and Evalua- WordNet for full-coverage word sense disambiguation. tion (LREC’14), pages 3013–3019, Reykjavik, Iceland. In: Proceedings of the 57th Annual Meeting of the Asso- European Language Resources Association (ELRA). ciation for Computational Linguistics, pages 5682–5691, Lin Chen and Barbara Di Eugenio. 2010. A Maximum En- Florence, Italy. Association for Computational Linguis- tropy Approach To Disambiguating VerbNet Classes. In: tics. Proceedings of Verb 2010, 2nd Interdisciplinary Work- Alireza Mohammadshahi and James Henderson. 2021. shop on Verbs, The Identification and Representation of Syntax-aware graph-to-graph transformer for semantic Verb Features. role labelling. Simone Conia, Fabrizio Brignone, Davide Zanfardino, and Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. Roberto Navigli. 2020. InVeRo: Making semantic role The Proposition Bank: An annotated corpus of semantic labeling accessible with intelligible verbs and roles. In: roles. Computational Linguistics, 31(1):71–106. Proceedings of the 2020 Conference on Empirical Meth- Federico Scozzafava, Marco Maru, Fabrizio Brignone, Gio- ods in Natural Language Processing: System Demon- vanni Torrisi, and Roberto Navigli. 2020. Personalized strations, pages 77–84, Online. Association for Compu- PageRank with syntagmatic information for multilingual tational Linguistics. word sense disambiguation. In: Proceedings of the 58th Hoa Trang Dang and Martha Palmer. 2005. The role of Annual Meeting of the Association for Computational semantic roles in disambiguating verb senses. In: Pro- Linguistics: System Demonstrations, pages 37–46, On- ceedings of the 43rd Annual Meeting of the Association line. Association for Computational Linguistics. for Computational Linguistics (ACL’05), pages 42–49, Ming Wang and Yinglin Wang. 2020. A synset relation- Ann Arbor, Michigan. Association for Computational enhanced framework with a try-again mechanism for Linguistics. word sense disambiguation. In: Proceedings of the 2020 Dmitriy Dligach and Martha Palmer. 2008. Novel seman- Conference on Empirical Methods in Natural Language tic features for verb sense disambiguation. In: Proceed- Processing (EMNLP), pages 6229–6240, Online. Asso- ings of the 46th Annual Meeting of the Association for ciation for Computational Linguistics. Computational Linguistics on Human Language Tech- Ralph Weischedel, Martha Palmer, Mitchell Marcus, Ed- nologies: Short Papers, HLT-Short ’08, page 29–32, uard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen USA. Association for Computational Linguistics. Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, and Ann Hous- Yair Even-Zohar and Dan Roth. 2001. A sequential model ton. 2013. OntoNotes Release 5.0. for multi-class classification. In: Proceedings of the 2001 Yorick Wilks and Mark Stevenson. 1997. The grammar of Conference on Empirical Methods in Natural Language sense: Using part-of-speech tags as a first step in seman- Processing. tic disambiguation. Natural Language Engineering, 4. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Patrick Ye and Timothy Baldwin. 2006. Verb sense disam- Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael biguation using selectional preferences extracted with a Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A state-of-the-art semantic role labeler. In: Proceedings of deep semantic natural language processing platform. In: the Australasian Language Technology Workshop 2006, Proceedings of Workshop for NLP Open Source Software pages 139–148, Sydney, Australia. (NLP-OSS), pages 1–6, Melbourne, Australia. Associa- tion for Computational Linguistics. Ani Grubišić, Slavomir Stankov, Branko Žitko, Ines Šarić- Grgić, Angelina Gašpar, Suzana Tomaš, Emil Brajković, and Daniel Vasić. 2020. Declarative Knowledge Extrac- tion in the AC&NL Tutor. In: Robert A. Sottilare and Jessica Schwarz, editors, Adaptive Instructional Systems, pages 293–310, Cham. Springer International Publish- ing. Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Boyd Adriane. 2020. spaCy: Industrial-strength Natural Language Processing in Python. Daisuke Kawahara and Martha Palmer. 2014. Single clas- sifier approach for verb sense disambiguation based on generalized features. In: Proceedings of the Ninth Inter- national Conference on Language Resources and Evalu- ation (LREC’14), pages 4210–4213, Reykjavik, Iceland. European Language Resources Association (ELRA). Jack Kiefer and Jacob Wolfowitz. 1952. Stochastic esti- mation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466. Daniel Loureiro and Alípio Jorge. 2019. Language mod- PRISPEVKI 234 PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Progress of the RETROGRAM Project: Developing a TEI-like Model for Croatian Grammars Books before Illyrism Petra Bago Department of Information and Communication Sciences, Faculty of Humanities and Social Sciences, University of Zagreb Ivana Lučića 3, HR-10000 Zagreb pbago@ffzg.hr 1. Background RETROGRAM1 ( Retro-digitization and Interpretation of Croatian Grammar Books before Illyrism) is a 4-year research project that started in November 2019, co-funded by the Croatian Science Foundation (IP-2018-01-3585) and the Institute of Croatian Language and Linguistics. It is a linguistic heritage project that focuses on the digitization and interpretation of pre-Illyrian Croatian grammar books with the aim to serve as a repository of such works in the future as well as to offer a model and develop processes for future similar research on digitization of Croatian grammars. So far, no digitization projects have included Croatian grammar books from the pre-Illyrian period of the Croatian language i.e. before the establishment of the common standard language2 and orthography (Horvat and Kramarić, 2021). Croatian language comprises of a common standard language as well as its three dialects: Čakavian, Kajkavian and Štokavian. The standardization of the Croatian literary language and the orthography based on the Štokavian dialect variant began in the 17th century. The process was finalized in the 19th century during the time of Croatian National Revival or the Illyrian movement (i.e. Illyrism). The main goals of the movement regarding language was to introduce a common literary language and a spelling reform, as well as introducing the Štokavian dialect as a linguistic common standard in order to strengthen the national cultural identity. The grammars described in this article thus belong to the pre-Illyrian period of the Croatian language, containing Croatian literary languages that precede the modern Croatian common standard language. The first grammar books were written within the religious orders, of the Jesuits and Franciscans, and were used to teach Croatian or Latin language to the Franciscan and Jesuit youth. (Horvat and Kramarić, 2021) The main goal of the project is to create a web portal of pre-Illyrian Croatian grammar books, which would include facsimiles of selected grammar books with basic bibliographic and processing information, transcription or translation, and an index of historical grammar and linguistic terminology. The portal will be equipped with thematic searching possibilities on the morphology level. The user will be able to browse grammar books facsimiles, read transcribed or translated text, and search it by predetermined parameters (which will allow conjugation and declension paradigms search). Links to facsimiles will enable comprehensive research on orthography and traductological aspects of the selected texts. An open-access portal will be developed and available to scholars and the general public. The main objective of the project is to intensify research activities and the interpretation of the Croatian pre-Illyrian grammars within the scope of modern linguistic disciplines (e.g. cognitive approach), to complete existing knowledge about the morphological development of the Croatian language, its normative descriptions, and development of linguistic terminology in the pre-Illyrian period. Conclusions on the formation of the Croatian language grammar model will also be based on the analysis of the Latin language grammar structure. Contrastive analysis of Latin and Croatian grammar meta-text and terminology will lead to conclusions about the influence of Latin language description on Croatian linguistic concepts in the pre-Illyrian period. More on the project can be found in Horvat (2020) and Horvat and Kramarić (2021). 2. Dataset RETROGRAM has selected eight Croatian grammar books for the digitization and enrichment process that span from the early 17th until the early 19th century. The grammar books cover two dialects (Štokavian and Kajkavian) of pre-Illyrian Croatian before there was an agreement on the common standard language and orthography. Even though not all are grammars of Croatian language, all contain Croatian as metalanguage and/or Croatian examples of morphological paradigms. The texts are transcriptions or translations of the originals in MS Word format, as all have been published as reference books by philologists from the project’s research group. 1 https://retrogram.jezik.hr/ 2 By “common standard language” we mean a standard language covering the entire Croatian speaking area. POVZETKI 235 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 The selected transcriptions or translations of grammar books used for the development of the annotation model are based on the following works:  Bartol Kašić, Institutionum linguae Illyricae libri duo, Rome, 1604 (Kašić, 2002),  Jakov Mikalja, Gramatika talijanska ukratko ili kratak nauk za naučiti latinski jezik, Loreto, 1649 (Mikalja, Horvat, and Gabrić-Bagarić, 2008),  Ardelio Della Bella, Istruzioni grammaticali della lingua illirica, Venice, 1728 (Della Bella, Sironić- Bonefačić, and Gabrić-Bagarić, 2006),  Blaž Tadijanović, Svašta po malo iliti kratko složenje imena, riči u ilirski i njemački jezik, Magdeburg, 1761 (Horvat and Ramadanović, 2012),  Marijan Lanosović, Uvod u latinsko riči slaganje s nikima nimačkog jezika biližkama za korist slovinskih mladića složen, Osijek, 1776 (Perić Gavrančić, 2020),  Ignacije Szentmártony, Einleintung zur kroatischen Sprachlehre für Deutsche, Varaždin, 1783 (Szentmártony, 2014),  Josip Voltić, Grammatica illirica, Vienna, 1803 (Voltić, 2016),  Francesco M. Appendini, Grammatica della lingua Illirica, Dubrovnik, 1808 (Appendini and Lovrić Jović, 2022). 3. Data Annotation Model The eight selected Croatian grammar books are the basis for the development of the annotation model based on the TEI Guidelines (TEI Consortium, 2021b). The model addresses two annotation tasks: 1) annotation of historical grammar and linguistic terminology, and 2) the annotation of morphological paradigms. The annotation tasks will be performed manually by experts working on the project. The decision was made to keep the original text intact, and any enrichment to be done through elements and attributes. Each grammar book is a TEI document comprised of a header and the body of the grammar text. The header contains metadata relevant to the project and to the particular grammar book, such as a list of all annotated grammatical terms. The body of the TEI document contains all grammar text with grammatical terminology and morphological paradigms annotations. 3.1. Grammatical Terminology Model One of the aims of the RETROGRAM project is to facilitate research into historical grammar and linguistic terminology via the web portal. We composed an index of contemporary Croatian terms to be used for normalization of the terminology. These terms are also used in the morphological paradigms annotation task. We have identified 87 terms related to the inflected parts-of-speech. The list of terms is encoded in the TEI header. In the Example 1. we present the encoding of the term “noun”( imenica in Croatian) in the index to be used in the annotation model. The example is extracted from Mikalja’s grammar book. imenica ... Example 1: Encoding of the term “noun” ( imenica in Croatian) in the index of Mikalja’s grammar. To annotate the term in the grammar text, we use the element 3 that is, according to the TEI Guidelines, used to encode a technical term. In the Example 2 you can find encoding of the historical grammar term IMENA that Mikalja used to describe nouns and adjectives, hence two attribute values. The model developed for annotating grammar terminology adheres to the TEI Guidelines. 3 https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-term.html POVZETKI 236 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022

OD IMENA

Example 2: Encoding of the term “noun” (imenica in Croatian) in the grammar text of Mikalja’s grammar. 3.2. Morphological Paradigms Model For the development of the morphological paradigms model, we analyzed the following inflected parts-of-speech: nouns, pronouns, adjectives, numbers and verbs. In the TEI Guidelines, there is no specific module for encoding grammar texts. However, we have decided to customize the dictionary module (TEI Consortium, 2021a) since it already contains elements that group morphosyntactic information of a lexical item. Interestingly, we were not the only ones with the same idea, as Toma Tasovac and Laurent Romary addressed the issue as part of the TEI Lex-0 initiative4. Often the morphological paradigms are presented in a table format. For the purposes of the RETROGRAM project, we decided to disregard the presentation mode of the paradigm, and encode only the implicit information contained in the tables. To encode one lexical item in a paradigm, we use the element
5, which usually “groups all the information on the written and spoken form of one headword” in a dictionary. According to the TEI Guidelines, the element is allowed to be contained by elements grouping information on one or more entries. We violate the guidelines by allowing this element to occur in a paragraph. Except for the violation of the guidelines regarding where the element can occur, all other child elements adhere to the TEI documentation albeit are not encoding information on a headword, but on a lexical unit of a morphological paradigm. We have defined mandatory and optional information for each inflectional parts-of-speech to be annotated as part of the RETROGRAM project, and developed a customized TEI schema. In Example 3. an encoding of two cases of the noun vojnik (soldier in English) as part of the paradigm is presented.

Kad ga imenujemo, rečemo vojnik

il soldato

Kad se pita čigovo je, rečemo

vojníka
del soldato

... Example 3: Encoding of two cases of the noun vojnik as segment of a morphological paradigm in Mikalja’s grammar. 4 https://github.com/DARIAH-ERIC/lexicalresources/tree/master/Resources/grammars-in-TEI 5 https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-form.html POVZETKI 237 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4. Future Plans and Conclusion We are currently conducting the manual annotation tasks based on the two models. Once the annotation tasks are complete, the next step is to create a web portal where all enriched grammar texts will be open and freely available with various search options. In this extended abstract we present progress of RETROGRAM, a linguistic heritage project that focuses on the digitization and interpretation of pre-Illyrian Croatian grammar books with the aim to serve as a repository of digital Croatian grammars as well as to offer a model and develop processes on digitization of such works. Analyzing eight grammar texts published from the 17th until the 19th century, we developed two models: 1) a model for annotation of historical grammar and linguistic terminology, 2) a model for annotation of morphological paradigms. We composed a taxonomy consisting of 87 terms to be used in both models. To implement the models, we consulted the TEI Guidelines, the de facto standard in the digital humanities. Our first model adheres to the guidelines. However, our second model is a TEI-like model that we developed based on the dictionary module of the same guidelines. We hope that the morphological paradigm model will serve as a basis for the development of a TEI module for grammars, a model that is presently missing, but could be incorporated in the TEI infrastructure by expanding the dictionary module. 5. Acknowledgements RETROGRAM is generously co-financed by the Croatian Science Foundation under the program “Research Projects” with grant agreement IP-2018-01-3585 and by the Institute of Croatian Language and Linguistics. We wish to thank all our research associates as well as Toma Tasovac for their feedback and help. 6. References Francesco Maria Appendini and Ivana Lovrić Jović. 2022. Appendinijeva Gramatika ilirskoga jezika: Jezična studija s prijevodom i transkripcijom uz faksimil. Institut za hrvatski jezik i jezikoslovlje, Nacionalna i sveučilišna knjižnica u Zagrebu, Zagreb. Ardelio Della Bella, Nives Sironić-Bonefačić, and Darija Gabrić-Bagarić. 2006. Istruzioni grammaticali della lingua illirica, 1728: Gramatičke pouke o ilirskome jeziku. Institut za hrvatski jezik i jezikoslovlje, Zagreb. TEI Consortium (ed.). 2021a. 9 Dictionaries. In: TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 4.3.0. TEI Consortium. https://tei-c.org/release/doc/tei-p5-doc/en/html/DI.html. TEI Consortium (eds.). 2021b. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 4.3.0. TEI Consortium. http://www.tei-c.org/Guidelines/P5/. Marijana Horvat. 2020. Istraživanje povijesti hrvatskoga jezika u digitalno doba. Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje, 46(2):635–643. Marijana Horvat and Martina Kramarić. 2021. Retro-Digitization of Croatian Pre-Standard Grammars. Athens Journal of Philology, 8(4):297–310. Marijana Horvat and Ermina Ramadanović. 2012. Jezikoslovni priručnik Blaža Tadijanovića Svašta po malo iliti kratko složenje imena, riči u ilirski i njemački jezik (1761.). Institut za hrvatski jezik i jezikoslovlje, Zagreb. Bartol Kašić. 2002. Institutiones linguae Illyricae/Osnove ilirskoga jezika. Institut za hrvatski jezik i jezikoslovlje, Zagreb. Jakov Mikalja, Marijana Horvat,and Darija Gabrić-Bagarić. 2008. Gramatika talijanska ukratko ili kratak nauk za naučiti latinski jezik. Institut za hrvatski jezik i jezikoslovlje, Zagreb. Sanja Perić Gavrančić. 2020. Latinska gramatika i hrvatski jezik Marijana Lanosovića: Povijesnojezična studija i transkripcija izvornika. Institut za hrvatski jezik i jezikoslovlje, Zagreb. Ignacije Szentmártony. 2014. Uvod u nauk o horvatskome jeziku. Institut za hrvatski jezik i jezikoslovlje, Zagreb. Josip Voltić. 2016. Grammatica Illirica/Ilirska gramatika. Reprint of the first edition (1803). Institut za hrvatski jezik i jezikoslovlje, Zagreb. POVZETKI 238 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 The CCRU as an Attempt of Doing Philosophy in a Digital World Tvrtko Balić Faculty of Humanities and Social Sciences, University of Zagreb Ivana Lučića 3, 10000, Zagreb tvrtko.balic@gmail.com 1. Introduction The consequences brought about by the Internet have been immense. The resulting chaos effected society at large and while natural sciences could enjoy the greater availability of information, social sciences and humanities found themselves in a new world with new problems. A new environment was created for communities to function in, and this environment was ready to be studied. But it was also an area effected by those sciences, a breeding ground for theories. The fact that theories effect realities they study has been accelerated with the emergence of the Internet. It is not clear how to act in such an environment. In 1995 at Warwick University, England, an experimental cultural theorist collective was formed called the Cybernetic Culture Research Unit (CCRU). 2. Goal of the paper The goal of the paper is to examine the problems presented by the Internet and to look at the Cybernetic Culture Research Unit as an example of theorists (specifically those in the field of philosophy) adapting to the new medium. 3. Influences Main influences on the CCRU were French postmodernists and it is itself a postmodern project. Sadie Plant, the feminist lecturer writing a book on “The Situationist International in a Postmodern Age” and Nick Land, the eccentric professor teaching a course on “Current French Philosophy” took their influences and led them to new levels of eccentric. 3.1. Lyotard Jean-François Lyotard was first to introduce the term “postmodern” in philosophical context. According to him the availability of knowledge is what causes the transition from the modern to a postmodern condition. Organization of knowledge is the thing that serves to justify power in the modern world. As knowledge becomes more available, the power of old actors such as nation-states withers, new actors emerge and the nature of society changes profoundly. Scientific knowledge is not easily accessible and to bring it closer to people for the purpose of legitimation whether of itself or some political, economic, cultural or any other kind of system, it takes the form of a narrative. That is when a problem of a conflict of narratives emerges, but generally one dominates over others and becomes a metanarrative, a story which offers an explanation for the world and justifies a certain social order. The fundamental feature of postmodernism according to Lyotard is the decay and disappearance of metanarratives. He was enthusiastic about postmodernism and wanted to fragment and break down society in order for experimentation in the social field to yield improvements. The CCRU presents the Internet as a fertile ground for Lyotard’s theories. It should be clear why. It makes knowledge even more accessible as well as the power of expression. 3.2. Derrida The way in which Jacques Derrida is most reflected in the works of the CCRU’s style. What is reflected are his hopes for philosophical writing. He was critical of the seriousness of the philosophical canon which in his day he saw dominated by Hegelian thought. Derrida celebrates poetry, laughter and ecstasy which he sees as neglected. He sees developed two forms of writing, serious philosophy on the one hand and playful literature on another. He opposes what he calls logocentrism, perceived domination of the ideal of the spoken word and criticism of writing as in literature, something stemming all the way from antiquity with Socrates and Plato sticking out as important critics of writing. “This-major-writing will be called writing because it exceeds the logos (of meaning, lordship, presence etc.). Within this writing – the one sought by Bataille – the same concepts, apparently unchanged in themselves, will be subject to a mutation of meaning, or rather will be struck by (even though they are apparently indifferent), the loss of sense toward which they slide, thereby ruining themselves immeasurably.” (Derrida, 1990) POVZETKI 239 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Derrida aims to destroy boundaries between philosophy and literature. With the CCRU the boundaries get lost in the creation of a brand-new writing style, theory-fiction. Theory-fiction could be considered a genre of its own, a surreal combination of cyberpunk and Gothic horror. Writings in this style are ambiguous and even their literary meaning is hard to distinguish yet they are filled with philosophical ideas waiting to be deciphered. One could imagine this making Derrida proud or jealous. 3.3. Deleuze and Guattari As far as theoretical influences are concerned, the French pair that coauthored many works, the philosopher Gilles Deleuze and psychoanalyst and political activist Félix Guattari are probably the most influential. The CCRU writings aim for what Deleuze and Guattari called schizoanalysis. It is an alternative to what is typically understood as rational thinking and more in line with the spirit of the time. It embraces the kind of thinking associated with schizophrenics and people with cluster A personality disorders which are often associated with it. The similarity between a philosopher and a schizophrenic is that they both rely on abstractions and finding connections between wildly different phenomena. Schizoanalysis takes this connection and runs with it before letting it run loose. Thinking becomes chaotic yet orderly in its own way, within its own logic. Everything becomes rhizomatic. “Let us summarize the principal characteristics of a rhizome: unlike trees or their roots, the rhizome connects any point to any other point, and its traits are not necessarily linked to traits of the same nature; it brings into play very different regimes of signs, and even nonsign states. The rhizome is reducible neither to the One nor the multiple.” (Deleuze and Guattari, 2005) A pair of terms of special note are deterritorialization and reterritorialization, the first one referring to the process by which social relations are altered, mutated or destroyed and the second one referring to the process by which new relations emerge. The CCRU was revolutionary in its accelerationist embrace of social change which meant celebrating deterritorialization, whether for its own sake, motivated by a libertarian desire for freedom, or for the sake of better alternatives emerging, maybe even new trees and new metanarratives. 3.4. Baudrillard The last very much important figure influencing the CCRU was the French sociologist, philosopher and cultural theorist Jean Baudrillard. The key concepts for him are simulation, simulacra and hyperreality. Simulation is a process by which reality is replaced with its representation and what is left are called simulacra. Baudrillard describes three orders of simulacra, all stemming from the original traditional symbolic order. “In the first case, the image is a good appearance - representation is of the sacramental order. In the second, it is an evil appearance - it is of the order of maleficence. In the third, it plays at being an appearance - it is of the order of sorcery. In the fourth, it is no longer of the order of appearances, but of simulation.” (Baudrillard 1994) This fourth case, the third order of simulacra is the pure simulacra, something only ever referencing itself without any authentic reality behind it. This is how Baudrillard conceived of the postmodern world. For him the history of modernity is the history of the disappearance of the real. However, what is left isn’t the unreal or the false, it is the hyperreal. Baudrillard’s writing is full of references to magic when speaking of traditional societies and to new technologies, virtual reality, explosions of information, machines conquering humanity etc. when talking about contemporary societies. This is very much the thematically relevant to the CCRU. They weren’t the only ones fascinated with Baudrillard, he was so influencial that The Matrix is full of references to his work. However, as opposed what is depicted in The Matrix, in the hyperreal world there is no real to refer to, there is no exiting the simulation, no escaping the code. But for the CCRU there is hope in the Internet that from the “Desert of the Real” will emerge something new. Baudrillard is pessimistic about changes he observes and only brings up possible sollutions to problems in order to refute them, but in the CCRU there is an amor fati present even if not optimism. 4. Playful and dangerous writing group From the name, Cybernetic Culture Research Unit and the basic knowledge of what it is about, one might suspect two things, cyberpunk and philosophy. Instead, what one finds is a surreal drug fueled collection of writing about Lovecraftian demons, numerology, ghost lemurs of Madagascar preserving the memories of psychic amphibians… And the things one might expect are so enigmatic as to be distorted beyond recognition. Two things become clear, one is the role of drugs in the CCRU and the other is that it wasn’t really a philosophical or information science research group at all, but was primarily a literary club. People involved mostly had a background of philosophy and they had their independent careers, some writing in a more psychedelic style revealing their history in the CCRU and some being more “normal” and understandable. Which isn’t to say that there is no philosophy to be found in the CCRU, but a lot of it is motifs and sources of inspiration arising from the chaos of collective storytelling and the authors’ common interests and influences. POVZETKI 240 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 One important concept related to the CCRU is hyperstition. Hyperstitions are fictions that make themselves real, like how the concept of space travel caused space travel to come into reality. This explains the importance of the artistic style for some members. All ideas can be understood as hyperstition using humans as hosts that bring them into existence. The CCRU often wrong about a fictionalized version of itself. This can be understood as a sort of magic. 5. Prominent figures and their insights From this literary group emerged strains of thought ranging from far right Nick Land to far left Mark Fisher and cyberfeminist Sadie Plant. 5.1. Sadie Plant and cyberfeminism Plant offers a unique blend of postmodern feminism and hopes typical for the 90s and visible in films like Hackers. According to her, the transformative power of the Internet lies in the fact that it offers a space without physical bodies. Furthermore, computer technology and programming are inherently feminine and therefore benefit women. Finally, women are treated as machines and because of this share a connection with them emancipation of machines will bring about an emancipation of women. In some respects, Plant proved prophetic. The Internet greatly improved the visibility of marginalized groups and made the general public more compassionate for them. In other respects, not so much, the Internet allows all kinds of opinions to prosper and that certainly includes sexist opinions. But in any case, she certainly offers food for thought about how gender identities are formed and expressed. 5.2. Mark Fisher and blogging Fisher is most famous for writing about how hard it is to people to imagine an alternative and how capitalism is capable of coopting resistance and creating fake opposition. However, one subject where he was surprisingly optimistic was blogging. Fisher reflected on how doing serious philosophical work (for instance writing a PhD) can be difficult and depressive, but writing a blog is more relaxing, by being less serious it can trick people into doing serious philosophy and it also offers an interactivity that hasn’t been seen since the days of the Greek agora. The new digital agoras have since also been assymilated into the existing system. In a way there is a contradiction in Fisher’s writing, but the glimmer of hope he saw is important. If it is forgotten, we are not due for any better of a fate than Fisher who killed himself due to depression. “I started blogging as a way of getting back into writing after the traumatic experience of doing a PhD. PhD work bullies one into the idea that you can’t say anything about any subject until you’ve read every possible authority on it. But blogging seemed a more informal space, without that kind of pressure. Blogging was a way of tricking myself back into doing serious writing. I was able to con myself, thinking, ‘it doesn’t matter, it’s only a blog post, it’s not an academic paper’. But now I take the blog rather more seriously than writing academic papers.” (Fisher, 2018) 5.3. Nick Land and neo-reaction For better or worse the member of the CCRU who is most prominent today is Nick Land. One of the ideas which he developed was conceiving of capitalism as an artificial intelligence, but while other authors may hope for this AI to update its software and produce something new, Land seems to be content in accepting that there is no alternative. Land continues to either inspire interpretations of new phenomena on the Internet or offer new interpretations himself. A significant example of the former would be the influence by a combination of younger Land’s ideas of hyperstition and older Land’s right wing political attitudes in creation of the online theory of meme magick, the idea that Internet memes can influence reality and that this is why Donald Trump won the 2016 US presidential elections in a supernatural way. A significant example of the latter would be Land’s philosophy of Bitcoin which isn’t only economic, but metaphysical as well, using Bitcoin to explain the logical law of identity and to reaffirm the Kantian understanding of space and time. 6. Concluding remarks The CCRU is relevant because today Internet is so ingrained in our lives that we don’t even notice it any more just as fish don’t notice the water they are in. It can prove useful to look at the time when this technology was new and if the future did turn out disappointing maybe we should examine yesterday’s speculation of today to remind ourselves what could have been. Sometimes it happens that parts of writing prove to be oddly prophetic and in that case it is good to appreciate what we have or maybe just look at it with new eyes. And even when they seem wrong they represent a valiant attempt at doing something new. POVZETKI 241 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 7. References Brent Adkins. 2015. Deleuze and Guattari's A Thousand Plateaus: A Critical Introduction and Guide. Edinburgh University Press, Edinburgh. Jean Baudrillard. 1994. Simulacra and simulation. University of Michigan press, Michigan.. Ccru. 2015. Ccru: Writings 1997-2003. Time Spiral Press. Mark Fisher and Matt Colquhoun. 2020. Acid Communism. Pattern Books. Mark Fisher. 2009. Capitalist realism: Is there no alternative? . John Hunt Publishing. Mark Fisher. 2018. K-Punk: The Collected and Unpublished Writings of Mark Fisher (2004-2016). Repeater. Gilles Deleuze and Felix Guattari. 2005. A Thousand Plateaus. University of Minnesota Press, Minneapolis Jacques Derrida. Writing and Difference. Routledge, London. Nick Land. 2011. Fanged noumena: Collected writings 1987-2007. MIT Press. Jean-François Lyotard. 2015. Libidinal economy. Bloomsbury Publishing, London. Jean-François Lyotard. 2005. Postmoderno stanje: Izvještaj o znanju. Ibis-grafika, Zagreb. Jean-François Lyotard. 1991. The inhuman: Reflections on time. Stanford University Press. Sadie Plant. 1997. Zeros and ones: Digital women and the new technoculture. Fourth Estate, London. POVZETKI 242 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Referencing the Public by Populist and Non-Populist Parties in the Slovene Parliament Darja Fišer*+, Tjaša Konovšek*, Andrej Pančur* *Institute of Contemporary History Privoz 11, SI-1000 Ljubljana darja.fiser@inz.si tjasa.konovsek@inz.si andrej.pancur@inz.si +Faculty of Arts, University of Ljubljana Aškerčeva 2, SI-1000 Ljubljana 1. Introduction In the last two decades, political reality in many democratic countries in Europe as well as around the globe has witnessed an increase in active populist political parties and a rise in their popularity among citizens. Parallel to the spread of populism, political science and sociological analyses note a clear difference between the discourses of members of populist and non-populist parties, especially when using social and other media. However, less is known about the relationship between populist and non-populist discourses in the speeches of members of parliament (MPs) in political systems of parliamentary democracy, in which parliaments are the central representative, legislative, and controlling state institutions. This contribution aims at suggesting a model for such analysis. The proposed analysis is embedded around two key concepts. First, we use the concepts of life-world to acknowledge the existence of a specific reality of MPs in which their speech is made. Second, we draw on the existing typology of populist and non-populist parties created by political scientists and sociologists to see how MPs from two different groups of political parties, i.e. populist and non-populist, construct their view of the public. The goal of the analysis is to detect any differences between populist and non-populist discourse observed through the lens of their references to the general public. 2. Approach and methodology To further investigate the connection between the speech of MPs, their image of the public, and their populist or non-populist origin, we combine cultural history of parliamentarianism with corpus linguistics. From a historical perspective, we draw on recent developments in political history, focusing on the cultural side of the history of parliamentarism (Aerts, 2019; Gjuričová and Zahradníček, 2018; Gašparič, 2012; Schulz and Wirsching, 2012; Ihalainen et al., 2016). For this purpose, we use the concept of life-world (or Lebenswelt). The concept of life-world originated in philosophy (Husserl, 1962, Habermas, 2007). The concept of life-world has been used in historiography to emphasize the circumstances in which parliamentarianism is experienced, focusing on MPs as historical actors (Gjuričová et al., 2014). The approach brings to the fore research questions about MPs' perceptions, education, and expectations; their political socialization, prior experiences, and everyday life; and the influence of collective opinions, public images, and the media on their work. In this paper, we focus on one of the aspects of MPs' life-world, namely their relationship to their counterpart, the public, through the words they choose to use, which, in turn, reveals a part of their self-understanding. In the framework of life-world, we further distinguish between populist and non-populist parties on two axes. First, based on the contents of political parties, we draw on existing research to determine which Slovenian political parties qualify as populist. Second, on the temporal axis, we acknowledge the break of 2004 as a year that witnessed the active beginnings of modern populism in Slovene political space (Fink Hafner, 2019; Frank in Šori, 2015; Fabijan in Ribać, 2021; Campani and Pajnik, 2017; Šori, 2015; Hadalin, 2020; Hadalin, 2021; Lovec, 2019; Pajnik, 2019). We take into account the difference between modern populist parties, as they emerged in the last decade and a half, and their immediate precursors, which have existed since the early 1990s. Therefore, the analysis counts the Slovenian Democratic Party (SDS) and its predecessor, the Social Democratic Party of Slovenia (SDSS), New Slovenia (NSi) and the Slovenian National Party (Slovenska nacionalna stranka, SNS) as populist parties, while all others were classified as non-populist. POVZETKI 243 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 3. Analysis The analysis is based on the Slovenian parliamentary corpus (1990–2018) siParl 2.0 (Pančur et al., 2020). We take into account the time span from 1992 when the first term of the Slovenian parliament started until 2018 when the seventh term ended. The time frame thus includes some important events that affected the development of Slovenian political parties and their governing style, such as Slovenia's accession to the European Union in 2004 (Gašparič, 2012), the global financial crisis in 2007 and 2008, and the migrant crisis in 2015 (Moffitt, 2014). Using the typology advocated by sociologists and political scientists (see Section 2), we created subcorpora of populist and non-populist political parties for each parliamentary term, resulting in a total of 14 subcorpora. The subcorpora ranged between just under a million tokens in Term1 and to 12 million tokens in Term7 for populist parties, and between 7 million tokens in Term1 and to just under 15 million tokens in Term7 for non-populist parties. The next step presented a challenge, as there are no pre-existing wordlists of references to the general public that we could rely on. We therefore generated frequency lists of nouns for each subcorpus and manually selected those that refer to the public in the broadest sense (e.g. person, citizen, inhabitant) from the 1,000 most frequent nouns in each subcorpus. We only took into account the nouns that can only refer to people (groups or individuals), disregarding those that can be used for institutions (e.g. association) or objects (e.g. school). We also checked their usage via concordance search and discarded the expressions that could potentially be used for the general public but in this specific corpus predominantly refer to the MPs, the government or their staff (e.g. proposer). As can be seen in Table 1, this yielded a total of 86 unique nouns with the total absolute frequency of 359,320 and relative frequency of 7,322.53 for the populist parties and the total absolute frequency of 524,195 and relative frequency of 6,788.74 for their non-populist counterparts. Most (69) of the nouns are shared between both party groups (e.g. human), in addition to 10 that are unique for the populist MPs (e.g. Croat) and 7 that are specific to non-populist MPs (e.g. stakeholder). POPULIST1-7 NON-POPULIST1-7 Rom 627 12.78 808 10.46 1.22 #tokens 49,070,504 77,215,381 bolnik 1,279 26.06 1,717 22.24 1.17 #lemmas 76 74 prosilec 343 6.99 468 6.06 1.15 LEMMA AF RF AF RF P:N ratio javnost 16,248 331.12 22,367 289.67 1.14 Hrvat 1,341 27.33 0 0.00 / starš 5,732 116.81 7,893 102.22 1.14 žena 397 8.09 0 0.00 / oseba 16,836 343.10 23,762 307.74 1.11 Avstrijec 318 6.48 0 0.00 / subjekt 3,406 69.41 4,866 63.02 1.10 Y diplomant 300 6.11 0 0.00 / družina 11,120 226.61 16,298 211.07 1.07 L otrok 18,205 371.00 26,762 346.59 1.07 N storilec 232 4.73 0 0.00 / -O volilec 161 3.28 0 0.00 / gost 966 19.69 1,438 18.62 1.06 P delojemalec 36 0.73 0 0.00 / begunec 1,247 25.41 1,879 24.33 1.04 neslovenec 31 0.63 0 0.00 / mladina 1,384 28.20 2,101 27.21 1.04 svojec 27 0.55 0 0.00 / delničar 444 9.05 684 8.86 1.02 delavka 0 0.00 0 0.00 / tujec 3,169 64.58 4,908 63.56 1.02 deležnik 0 0.00 1,784 23.10 / zavarovanec 896 18.26 1,394 18.05 1.01 prejemnik 0 0.00 1,191 15.42 / volivec 3,478 70.88 5,544 71.80 0.99 YL najemnik 0 0.00 983 12.73 / lastnik 8,031 163.66 12,814 165.95 0.99 N dolžnik 0 0.00 752 9.74 / mati 320 6.52 512 6.63 0.98 -ON vajenec 0 0.00 444 5.75 / družba 23,431 477.50 38,532 499.02 0.96 kadilec 0 0.00 290 3.76 / študent 4,973 101.34 8,202 106.22 0.95 krajan 0 0.00 172 2.23 / posameznik 7,367 150.13 12,307 159.39 0.94 oče 929 18.93 329 4.26 4.44 zavezanec 2,437 49.66 4,096 53.05 0.94 obrtnik 1,187 24.19 540 6.99 3.46 uporabnik 3,441 70.12 5,866 75.97 0.92 davkoplačevalec 4,762 97.04 2,178 28.21 3.44 nosilec 2,211 45.06 3,812 49.37 0.91 migrant 2,627 53.54 1,255 16.25 3.29 občan 1,558 31.75 2,688 34.81 0.91 vlagatelj 426 8.68 260 3.37 2.58 prebivalec 5,318 108.37 9,404 121.79 0.89 podjetnik 3,880 79.07 2,671 34.59 2.29 partner 4,580 93.34 8,312 107.65 0.87 moški 827 16.85 619 8.02 2.10 potrošnik 1,657 33.77 3,060 39.63 0.85 ljudstvo 3,089 62.95 2,376 30.77 2.05 generacija 2,279 46.44 4,215 54.59 0.85 Italijan 272 5.54 216 2.80 1.98 delavec 10,768 219.44 20,055 259.73 0.84 Slovenka 1,432 29.18 1,143 14.80 1.97 invalid 3,032 61.79 5,760 74.60 0.83 pacient 1,619 32.99 1,452 18.80 1.75 prebivalstvo 2,727 55.57 5,452 70.61 0.79 T zamejstvo 1,067 21.74 966 12.51 1.74 manjšina 2,742 55.88 5,518 71.46 0.78 IN kmet 6,839 139.37 6,739 87.28 1.60 učenec 1,437 29.28 3,071 39.77 0.74 JO prijatelj 1,024 20.87 1,012 13.11 1.59 ženska 2,941 59.93 6,517 84.40 0.71 naročnik 517 10.54 516 6.68 1.58 upokojenec 3,547 72.28 8,097 104.86 0.69 Slovenec 10,103 205.89 11,090 143.62 1.43 skupnost 16,208 330.30 38,163 494.24 0.67 dijak 2,403 48.97 2,670 34.58 1.42 pripadnik 1,375 28.02 3,238 41.93 0.67 kupec 1,216 24.78 1,357 17.57 1.41 upravičenec 1,673 34.09 4,523 58.58 0.58 državljan 21,570 439.57 24,828 321.54 1.37 upnik 566 11.53 1,725 22.34 0.52 priča 4,061 82.76 4,701 60.88 1.36 podpisnik 465 9.48 1,460 18.91 0.50 državljanka 6,902 140.65 8,372 108.42 1.30 udeleženec 500 10.19 1,685 21.82 0.47 narod 4,952 100.92 6,035 78.16 1.29 porabnik 129 2.63 540 6.99 0.38 žrtev 3,945 80.39 4,810 62.29 1.29 populacija 480 9.78 2,179 28.22 0.35 sosed 738 15.04 928 12.02 1.25 Total 359,320 7,322.53 524,195 6,788.74 1.08 človek 68,517 1,396.30 86,824 1,124.44 1.24 Table 1: List of specific and joint public-related words identified in the subcorpora of populist and non-populist speeches with their absolute and relative frequencies as well as the usage ratio. POVZETKI 244 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 The list of populist-specific nouns contains words describing people according to their background (e.g. Austrian, non-Slovenian), family role (e.g., relative, wife) and employment status (e.g. female worker, employee). Non-populist-specific nouns contain expressions which describe the role or status of a person in an administrative or legal procedure (e.g. stakeholder, recepient), business transaction (e.g. tenant, debtor), origin (e.g. local), education (e.g. apprentice) or health status (e.g. smoker). Among the joint nouns, father, craftsman, taxpayer and migrant are used three times more frequently by populist MPs, whereas beneficiary, participant, consumer and population are use more than twice as frequently by non-populist MPs. Insurance holde r, voter and owner are used nearly identically by both groups of MPs. This might reflect a difference between the populist and non-populist parties and their focus in their political base: while the first usually rally voters from rural areas, the latter are traditionally more successful in urban areas. T1 T2 T3 T4 T5 T6 T7 Total Populist #tokens 950,851 4,917,224 7,291,606 8,607,268 8,598,006 6,622,380 12,083,169 49,070,504 Populist "public" AF 6,204 27,738 49,606 68,971 57,041 48,881 100,879 359,320 Populist "public" RF 6,525 5,641 6,803 8,013 6,634 7,381 8,349 7,323 Non-populist #tokens 7,323,569 11,387,486 8,838,299 14,394,700 11,452,223 8,869,712 14,949,392 77,215,381 Non-populist "public" AF 48,446 58,100 52,118 91,254 84,878 67,310 122,089 524,195 Non-populist "public" RF 6,615 5,102 5,897 6,339 7,411 7,589 8,167 6,789 P-value 0.3059 2.54E-43 6.61E-116 0 8.25E-94 2.81E-03 2.01E-07 1.41E-269 Chi2 test 1.0482 190.4453 523.7064 2181.3538 422.1633 21.9444 27.0286 1230.5394 Statistical significance NO YES YES YES YES YES YES YES Table 2: Absolute and relative frequency of public-related words as used by populist and non-populist MPs per parliamentary term and statistical significance tests. 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 T1 T2 T3 T4 T5 T6 T7 Populist Non-populist Combined Figure 1: Relative frequency of nouns referring to the public in speeches of MPs from populist and non-populist political parties in the Slovene parliament 1992 – 2018, by parliamentary term. As can be seen from Table 2 and Figure 1, we observe a steady general upwards trend in the use of nouns, describing the public in both populist and non-populist parties over time. For all terms combined, populist MPs refer to the public statistically significantly more frequently than their non-populist counterparts (P-value 1,41E-269, Chi2 test 1230,53941), which confirms our main hypothesis. For all the MPs combined, the only, and quite substantial, drop in the frequency of references to the public can be observed from Term1 and Term2, which could be contributed to the early stages of the formation of the Slovenian political space. Especially in Term1, the MPs had to face many questions of establishing the working of the new parliament itself. It took time before a new normality of the parliamentary work was established, before the MPs began to address the public more. While early Slovene political transition exhibited a general consensus about the need to strengthen parliamentary democracy, the time after that has been much less clear, which could account to the increase of references of the public by the MPs, since they had to search for new contents of policy-making. 1 https://www.korpus.cz/calc/ POVZETKI 245 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 As for individual terms, populist MPs refer to the public statistically significantly more often in Terms2–4 and 7 with Term4 as the biggest outlier, while the opposite is true of Terms5–6 with Term5 as the biggest outlier. In Term1, non-populist MPs use more public-denominating expressions but the difference is not statistically significant. Terms2–3 can be interpreted as the period of formation of populist parties (1992–2004), with Term4 being the first parliamentary term working with a populist (SDS-led) government. In turn, Term7 (2014–2018) could suggest the emergence of the second-wave growing power of populist parties in the face of the crisis of the non-populist parties. In Terms5–6, when references to the general public prevailed in what sociologists and political scientists refer to as the non-populist discourse, the Slovenian political space witnessed an emergence of numerous new political parties, many of which entered the parliament, which influenced the relation between populist and non-populist discourse. Due to the safe-guards in parliamentary procedures which ensure equal opportunity of participation for opposition MPs regardless of their number, the speeches of MPs might also be influenced by the existence of populist and non-populist led governments and the strength of the populist and non-populist parties in the parliament at the time. While party strength is usually counted by the number of seats taken in the parliament, there are many more factors that influence it and make the correlation between the number of seats, coalition and opposition roles, and party strength challenging (Sartori, 2005; Krašovec, 2000). 4. Discussion While the results do confirm our initial hypothesis that populist parties refer to the public more, the difference between the two blocs appears to be smaller than the current findings of studies in sociology and political science suggest. Where research from these two fields mainly focuses on the speech of members of populist parties in (selected) television interviews, on social media, and other, less rigid environments, this contribution focused on taking into account all the speeches of MPs throughout the Slovenian parliament which is a highly institutionalized and regulated environment that probably allows for less differentiation between MPs of different political orientation. Our results show that the same life-world of MPs, marked by their shared experience, social forms, norms, and a shared dialogue in plenary sessions provides an environment with a strong unifying factor. Although there is little doubt that political parties themselves decisively differ from one another, the power of the institution, its rigidity and specificity as well as MPs awareness of the target audience and reach of their speeches, proved to be decisive factors in MPs speech when speaking about the public. According to political scientists and historians, the political space in Slovenia has been increasingly polarized since 1992. Again, our results show a somewhat more nuanced picture: while a growing difference between populist and non-populist discourse can be observed in Terms2–4, the gap narrows in Terms5–7. This challenges the dominant narrative of Slovenian political space. The record high frequency of references to the public by populist MPs in Term4 coincides with SDS winning the 2004 election for the first time after 1992, which happened immediately after the party went through its populist transformation in 2003. Term5, SDS witnessed a backlash with the non-populist coalition prevailing, while one of the populist parties, the NSi, did not even reach the parliamentary threshold. The general public as well as the media frequently refer to several of the more recent parties, such as Levica, as populist as well. While these parties do exhibit a certain populist appeal, their content, attitudes towards experts and state institutions, as well as their actions in the parliament place them in the non-populist spectrum, with Levica gravitating more towards the spectre of democratic socialism (Toplišek, 2019) than to the same category of populism as defined by Mudde (2005, 2007) which was the theoretical framework of this study. Another methodological issue is temporality: the modern populist shift is a phenomenon belonging to the 21st century; thus, the decade after 1992, included in our analysis, requires a separate interpretation and can only be understood as a preface to the later populist shift (Fuentes, 2020). 5. Acknowledgments The work described in this paper was funded by the Slovenian Research Agency research programme P6-0436: Digital Humanities: resources, tools, and methods (2022- 2027) and No. P6-0281: Political History, the CLARIN ERIC ParlaMint project (https://www.clarin.eu/parlamint) and the DARIAH-SI research infrastructure. POVZETKI 246 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 6. Bibliography Adéla Gjuričová and Tomáš Zahradníček. 2018. Návrat parlamentu. Česi a Slováci ve Federálním shromáždění. Argo. Adéla Gjuričová, Andreas Schulz, Luboš Velek, and Andreas Wirsching, eds. 2014. Lebenswelten von Abgeordneten in Europa 1860–1990. Droste Verlag. Alen Toplišek. 2019. Between populism and socialism: Slovenia’s Left party. In: Giorgos Katsambekis and Alexandros Kioupkiolis, eds. The Populist Radical Left in Europe. Routledge, Taylor & Francis Group. Alenka Krašovec. 2000. Moč v političnih strankah: odnosi med parlamentarnimi in centralnimi deli političnih strank. Fakulteta za družbene vede. Ana Frank and Iztok Šori. 2015. Normalizacija rasizma z jezikom demokracije: primer Slovenske demokratske stranke. Časopis za kritiko znanosti, 43(260):89–103. Andreas Schulz and Andreas Wirsching, eds. 2012. Parlamentarische Kulturen in Europa. Das Parlament als Kommunikationsraum. Droste Verlag. Benjamin Moffitt. 2015. How to Perform Crisis: A Model for Understanding the Key Role of Crisis in Contemporary Populism. Government and Opposition, 50(2):189–217. Cas Mudde, ed. 2005. Racist Extremism in Central and Eastern Europe. Routledge. Cas Mudde. 2007. Populist radical right parties in Europe. Cambridge University Press. Danica Fink Hafner. 2019. Populizem. Fakulteta za družbene vede, Založba FDV. Edmund Husserl. 1962. Die Krisis der europäischen Wissenschaften und die transzendentale Phänomenologie: eine Einleitung und die phänomenologische Philosophie. M. Nijhoff. Emanuela Fabijan and Marko Ribać. 2021. Politični in medijski populizem v televizijskem političnem intervjuju. Social Science Forum, 37(98):43-68. Giovanna Campani and Mojca Pajnik. 2017. Populism in historical perspectives. In: Gabriella Lazaridis and Giovanna Campani, eds. Understanding the populist shift: othering in a Europe in crisis, pages 13–30. Routledge, Taylor & Francis Group. Giovanni Sartori. 2005. Parties and party systems: a framework for analysis. ECPR. Iztok Šori. 2015. Za narodov blagor: skrajno desni populizem v diskurzu stranke Nova Slovenija. Časopis za kritiko znanosti, 43(260):104–117. Juan Francisco Fuentes. 2020. Populism. Contributions to the History of Concepts, 15(1):47–68. Jure Gašparič. 2012. Državni zbor 1992–2012: o slovenskem parlamentarizmu. Inštitut za novejšo zgodovino. Jürgen Habermas. 2007. The Theory of Communicative Action. Vol. 2, Lifeworld and system: a critique of functionalist reason. Polity Press. Jurij Hadalin. 2020. Straight Talk. The Slovenian National Party's Programme Orientations and Activities. Contributions to Contemporary History, 60(2). https://doi.org/10.51663/pnz.60.2.10. Jurij Hadalin. 2021. What Would Henrik Tuma Say? From The Social Democratic Party of Slovenia to the Slovenian Democratic Party. Contributions to Contemporary History, 61(3). https://doi.org/10.51663/pnz.61.3.10. Marko Lovec, ed. 2019. Populism and attitudes towards the EU in Central Europe. Ljubljana: Faculty of Social Sciences. Mojca Pajnik. 2019. Media Populism on the Example of Right-Wing Political Parties’ Communication in Slovenia. Problems of Post-Communism, 66(1):21–32. Andrej Pančur, Tomaž Erjavec, Mihael Ojsteršek, Mojca Šorn, and Neja Blaj Hribar. 2020. Slovenian parliamentary corpus (1990–2018) siParl 2.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1300. Pasi Ihalainen, Cornelia Ilie, and Kari Palonen, eds. 2016. Parliament and Parliamentarism. A Comparative History of a European Concept. Berghahn. Remieg Aerts, ed. 2019. The ideal of parliament in Europe since 1800. Palgrave Macmillan. POVZETKI 247 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Uporaba postopkov strojnega učenja pri samodejni slovenski grafemsko-fonemski pretvorbi Janez Križaj*, Simon Dobrišek*, Aleš Mihelič†, Jerneja Žganec Gros† *Laboratorij za strojno inteligenco, Fakulteta za elektrotehniko, Univerza v Ljubljani Tržaška cesta 25, 1000 Ljubljana, Slovenija janez.krizaj@fe.uni-lj.si, simon.dobrisek@fe.uni-lj.si †Alpineon razvoj in raziskave, d. o. o., Ulica Iga Grudna 15, 1000 Ljubljana Tržaška cesta 25, 1000 Ljubljana, Slovenija jerneja.gros@alpineon.si, ales.mihelic@alpineon.si 1 Uvod Grafemsko-fonemska pretvorba se nanaša na pretvarjanje izvirno črkovno zapisanih besed danega jezika v njihove fonemske zapise oziroma predstavitve. Nabor osnovnih grafemskih enot, ki se jih razume kot osnovne enote pisave in se jih upošteva pri črkovnih zapisih besed, navadno določa pravopis danega jezika, in enako velja tudi za slovenski jezik (SAZU, 1990). Osnovnim grafemskim enotam pravimo tudi grafemi, njihovim vidno zaznavnim različnim pisnim simbolnim predstavitvam, kot so velike in male črke, pa pravimo alografi. Nabor fonemov je na drugi strani določen predvsem na osnovi glasoslovnega pomensko razločevalnega sluš- nega kriterija. Grafemi in fonemi so kot osnovne enote do določene mere sicer povezani, a se pri pretvorbi grafemov v foneme lahko tudi več zaporednih črk v zapisani besedi preslika v posamezne foneme. Pretvarjanje grafemskih zapisov besed v njihove fonemske zapise tudi ne temelji samo na nekem manjšem številu osnovnih pravil in pri slovenskem govorjenem jeziku obstaja veliko izjem, ki se ne podrejajo osnovnim pravilom (Toporišič, 2000). Pri razvoju jezikovnih tehnologij se postopki samodejnega računalniškega pretvarjanja grafemskih zapisov besed v njihove fonemske zapise uporabljajo tako pri izgradnji samodejnih razpoznavalnikov govora kot tudi pri sistemih za tvorjenje umetnega govora (Žganec Gros et al., 2016). V okviru razvojnega in raziskovalnega projekta Razvoj slovenščine v digitalnem okolju (RSDO, 2020) smo izvedli in ovrednotili več različnih uveljavljenih postopkov samodejne grafemsko-fonemske pretvorbe, ki so bili uporabljeni za tovrstno pretvarjanje zapisov slovenskih besed. Preizkusili in ovrednotili smo tri izbrane postopke samodejne grafemsko-fonemske pretvorbe, ki so se uveljavili v zadnjih nekaj letih in so na kratko opisani v nadaljevanju. Za preizkus in ovrednotenje izbranih postopkov smo uporabili množico besed iz slovenskega leksikona Sloleks 2.0 (Dobrovoljc et al., 2019). Množico besed smo na različne načine razdelili na učno in testno množico, ki smo ju nato uporabili za strojno učenje in preizkus izbranih samodejnih grafemsko-fonemskih pretvornikov. 2 Obravnavani postopki V literaturi je predstavljenih mnogo različnih postopkov za samodejno grafemsko-fonemsko pretvorbo zapisov besed. Starejši postopki praviloma izvajajo pretvorbo na podlagi predhodno definiranih slovničnih pravil (Black et al., 1998). Pomanjkljivost teh postopkov je predvsem v dolgotrajnem ročnem oblikovanju pravil, ki zahtevajo znanje s področja jezikoslovja in glasoslovja in morajo vključevati tudi seznam izjem z različnimi posebnostmi pri izgovorjavah besed. Pri kasneje predlaganih postopkih se je uveljavila pretvorba z modeli skupnih zaporedij (Bisani in Ney, 2008), ki s poravnavo grafemskega zaporedja s fonemskim zapored-jem tvorijo posebne skupne enote, imenovane grafoni. Za modeliranje grafonskih zaporedij nato uporabljajo jezikovne modele n-gramov, udejanjene v obliki uteženega končnega pretvornika (angl. weighted final state transducer), ki omogočajo predvidevanja grafemsko-fonemske pretvorbe za besede, ki niso bile del učne mno- žice. Avtorji Novak et al. (2015) so razvoj grafemsko-fonemskega pretvornika osnovali na modelih uteženih končnih pretvornikov in predlagali postopek grafemsko-fonemske pretvorbe, ki temelji na prilagojeni metodi maksimizacije upanja za poravnavo niza grafemov z nizom fonemov in več dekodirnih postopkov, med njimi tudi jezikovni model, ki temelji na modelih rekurenčnih nevronskih omrežij (angl. recurrent neural networks). POVZETKI 248 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Yolchuyeva et al. (2019) so dosegli visoko uspešnost grafemsko-fonemske pretvorbe z uporabo globokega modela, ki je poznan pod imenom transformer. Ti modeli imajo zgradbo vrste kodirnik-dekodirnik z dodanim mehanizmom pozornosti, ki pomaga pri strojnem učenju soodvisnosti med učnimi pari nizov grafemov in fonemov, kar se odraža tako v hitrejšem strojnem učenju kot tudi pri bolj zanesljivi pretvorbi preizkusnih nizov grafemov v ustrezne nize fonemov. 3 Kvantitativno ovrednotenje Pri kvantitativnem ovrednotenju obravnavanih postopkov grafemsko-fonemskih pretvorb smo uporabili njihove izvedbe v prosto dostopnih računalniških programskih knjižnicah. Postopek, predlagan v (Bisani in Hermann, 2008), smo udejanjili s programskim orodjem Sequitur1, postopek avtorjev Novak et al. (2015) je implementiran z orodjem Phonetisaurus2, za evalvacijo metode avtorjev Yolchuyeva et al. (2019) pa smo uporabili programsko orodje Deep Phonemizer3. Pri tvorjenju in preizkušanju vseh obravnavanih modelov in izvajanje postopkov njihovega strojnega učen-ja smo uporabili ročno validirani del slovenskega leksikona Sloleks 2.0 (Dobrovoljc et al., 2019), ki poleg posameznih besed vsebuje tudi informacijo o njihovih osnovnih besednih oblikah oziroma lemah ter tudi njihove fonemske oziroma fonetične prepise. Validirani del leksikona Sloleks 2.0, ki smo ga uporabili za naše eksperimente, tako vsebuje 646.994 posameznih besed oziroma 62.729 besednih lem. Pri preizkušanju smo opazili, da so rezultati precej odvisni od tega, kako se množico razpoložljivih grafemsko-fonemsko pretvorjenih besed razdeli na učni in testni del. Pri preizkusih smo zato izvedli dve različni razdelitvi množice vseh besed v učno množico, ki je vsebovala 90 % besed iz slovarja, in testno množico, ki je vsebovala preostalih 10 % besed. Pri naključni razdelitvi, v nadaljevanju označeni z oznako “RandomSplit”, smo razdelitev izvedli povsem naključno z uporabo sistemskega naključnega generatorja. Pri razdelitvi, ki je temeljila na razvrščanju besed v učno oziroma testno množico glede na njihove leme, pa smo poskrbeli, da se v testni množici ne pojavljajo besede, ki se od besed v učni množici razlikujejo le po končnicah. To namreč pogosto velja za besede z istimi lemami. Polega tega smo poskrbeli, da se leme besed v testni množici razlikujejo za vsaj tri črke glede na njim najbolj podobne leme v učni množici besed. Ta razdelitev je v nadaljevanju označena z oznako “LemmaSplit”. Pri izvajanju poskusov smo ugotovili, da je rezultat po pričakovanjih tudi precej odvisen od upoštevanega nabora fonemskih enot pri grafemsko-fonemskih pretvorbah. Pri gradnji samodejnih razpoznavalnikov govora se tako navadno ne ločuje med dolgimi in kratkimi samoglasniki oziroma med naglašenimi in nenaglašenimi samoglasniki. To ločevanje pri razpoznavalnikih govora namreč ni pomembno po pomensko razločevalnem kriteriju določanja fonemskih enot. To ločevanje pa je pomembno pri gradnji sistemov za tvorjenje umetnega govora, kjer so prozodične značilnosti umetnega govora odvisne od informacije o naglašenih in nenaglašenih samoglasnikih v besedah. V skladu s temi predpostavkami smo učno in testno množico dodatno razdelili na različna načina, glede na to, katere osnovne fonemske enote se je upoštevalo. V nadaljevanju tako oznaka ASR označuje razdelitev, ki je bila primerna za samodejne razpoznavalnike govora in temelji na upoštevanju samo 34 osnovnih fonemskih enot oziroma fonemskih različic. Oznaka TTS pa označuje razdelitev, ki je primerna za sisteme za samodejno tvorjenje umetnega govora in temelji na upoštevanju 39 osnovnih fonemskih enot. Povečanje števila fonemskih enot je posledica upoštevanja ločevanja med dolgimi in kratkimi oziroma naglašenimi in nenaglašenimi samoglasniki. V nadaljevanju predstavljeni rezultati so potrdili predvidevanja, da je pri slovenskem jeziku najteže samodejno napovedovati naglasno mesto v besedah oziroma naglašene samoglasnike. Pri naglaševanju slovenskih besed je namreč zelo veliko izjem, ki se ne podrejajo nekemu bolj splošnemu manjšemu naboru osnovnih pravil naglaševanja besed. Rezultati uspešnosti samodejnih grafemsko-fonemskih pretvorb so v nadaljevanju podani v obliki odstot-nega deleža napačno pretvorjenih besed (angl. word error rate, WER) in deleža napačno pretvorjenih fonemskih enot (angl. phoneme error rate, PER). Kot je razvidno iz tabele so se glede na različne delitve množice besed in upoštevanja ločevanja med naglašenimi in nenaglašenimi samoglasniki pri rezultatih dejansko potrdi-la predvidevanja. Pri naključni razdelitvi so tako rezultati bistveno boljši kot pri razdelitvi po lemah, saj se pri naključni razdelitvi v testni množici lahko pojavljajo besede, ki se od najbolj podobnih besed v učni množici 1 https://github.com/sequitur-g2p/sequitur-g2p 2 https://github.com/AdolfVonKleist/Phonetisaurus 3 https://github.com/as-ideas/DeepPhonemizer POVZETKI 249 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 razlikujejo samo po končnici ali predponi. Rezultati pri večjem naboru osnovnih fonemskih enot, ki vključuje ločevanje med dolgimi in kratkimi samoglasniki (oznaka TTS), pa so prav tako po pričakovanjih precej slabši, kot pri manjšem naboru, ki tega ločevanja ne upošteva (oznaka ASR). To potrjuje druge že obstoječe ugotovitve, da je pri slovenskem jeziku dejansko težko samodejno napovedovati naglasno mesto v besedah (Žganec Gros et al., 2016). Orodje Slovar WER [%] PER [%] ASR_RandomSplit 16,5 1,9 Sequitur ASR_LemmaSplit 25,4 2,9 (Bisani in Hermann, 2008) TTS_RandomSplit 17,3 2,2 TTS_LemmaSplit 50,2 7,4 ASR_RandomSplit 1,0 0,1 Phonetisaurus ASR_LemmaSplit 14,1 1,6 (Novak et al., 2015) TTS_RandomSplit 2,0 0,3 TTS_LemmaSplit 29,1 4,1 ASR_RandomSplit 1,1 0,1 Deep Phonemizer ASR_LemmaSplit 8,6 0,9 (Yolchuyeva et al., 2019) TTS_RandomSplit 1,7 0,3 TTS_LemmaSplit 16,1 2,6 Tabela 1: Uspešnost grafemsko-fonemske pretvorbe obravnavanih postopkov. 4 Zaključek V prispevku so predstavljeni rezultati izvedb in preizkusov različnih samodejnih grafemsko-fonemskih pretvornikov za slovenski jezik. Glede na ugotovitve lahko uporabniki tovrstnih pretvornikov za izgradnjo samodejnih razpoznavalnikov govora pričakujejo približno 91% pravilno pretvorbo besed, ki niso vključene v obstoječe slovenske leksikone. Pri izgradnji sistemov za tvorjenje umetnega govora, pri katerih je pomembno pravilno določanje naglasnega mesta, pa lahko pričakujejo samo približno 84% pravilno pretvorbo. Zahvala Predstavljeno delo je bilo delno financirano s strani Ministrstva za kulturo in Evropskega sklada za regionalni razvoj v okviru projekta RSDO (Razvoj slovenščine v digitalnem okolju), s strani Javne agencije za raziskovalno dejavnost Republike Slovenije v okviru aplikativnega raziskovalnega projekta L7-9406 OptiLEX in s strani ARRS v okviru raziskovalnega programa Metrologija in biometrični sistemi (P2-0250). Literatura Maximilian Bisani in Hermann Ney. 2008. Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5):434–451. Alan W. Black, Kevin Lenzo in Vincent Pagel. 1998. Issues in Building General Letter to Sound Rules. V: Zbornik 3rd ESCA Workshop on Speech Synthesis, str. 77–80. Kaja Dobrovoljc, Simon Krek, Peter Holozan, Tomaž Erjavec, Miro Romih, Špela Arhar Holdt, Jaka Čibej, Luka Krsnik in Marko Robnik-Šikonja. 2019. Morphological lexicon Sloleks 2.0. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1230. Josef R. Novak, Nobuaki Minematsu in Keikichi Hirose. 2015. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Language Engineering, 22(6):907–938. RSDO - Razvoj slovenščine v digitalnem okolju. 2020. https://www.slovenscina.eu/. SAZU - Slovenska akademija znanosti in umetnosti. 1990. Slovenski pravopis 1: Pravila. Državna založba Slovenije, Ljubljana. Jože Toporišič. 2000. Slovenska slovnica. Založba Obzorja, Maribor. POVZETKI 250 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Sevinj Yolchuyeva, Géza Németh in Bálint Gyires-Tóth. 2019. Transformer Based Grapheme-to-Phoneme Conversion. V: Zbornik konf. Interspeech 2019, str. 2095–2099, Gradec, Avstrija. Jerneja Žganec Gros, Boštjan Vesnicer, Simon Rozman, Peter Holozanin Tomaž Šef. 2016. Sintetizator govora za slovenščino eBralec. V: Zbornik konf. Jezikovne tehnologije in digitalna humanistika, str. 180–185, Ljubljana, Slovenija. POVZETKI 251 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Poravnava zvočnih posnetkov s transkripcijami narečnega govora in petja Matija Marolt, Mark Žakelj, Alenka Kavčič, Matevž Pesek Fakulteta za računalništvo in informatiko, Univerza v Ljubljani Večna pot 113, 1000 Ljubljana matija.marolt@fri.uni-lj.si 1 Uvod V povzetku predstavljamo sistem za poravnavo zvočnih posnetkov slovenskega govora s pripadajočimi transkripcijami na nivoju besed. Pri razvoju sistema nas je še posebej zanimala njegova uporabnost pri poravnavi narečnega govora in petja, saj avtomatska razpoznava govora v tovrstnih posnetkih deluje nezanesljivo, z veliko napakami. Natančna avtomatska poravnava posnetkov in transkripcij nam tako lahko pomaga pri analizi narečnih korpusov in pripravi novih anotiranih podatkov za učenje razpoznavalnikov. V povzetku predstavimo sistem za poravnavo in primerjamo kvaliteto poravnave nenarečnih in narečnih govorcev. Analiziramo tudi kvaliteto poravnave narečnega petja z uporabo sistema, ki je učen zgolj na govoru. Ker se petje lahko zelo razlikuje od govora (dodatna spremljava, večglasno petje, dolgi toni, ...), se v nalogi osredotočimo zgolj na enoglasno petje brez spremljave, ki je še najbolj podobno govoru. 2 Sistem za poravnavo Sistem za poravnavo posnetkov in transkripcij je sestavljen iz treh glavnih komponent: • segmentacija posnetka, s čimer razdelimo celoten posnetek na več krajših delov, hkrati pa odstranimo šum in tišino; • razpoznava govora, s čimer iz avdio signala pridobimo približno tekstovno transkripcijo; • poravnava, s čimer vsaki besedi v originalnem besedilu določimo mesto v pridobljeni transkripciji in s tem tudi čas pojavitve. 2.1 Segmentacija posnetka Segmentacija je osnovana na Googlovem WebRTC–VAD algoritmu1, ki je hiter, robusten in v praksi pogosto uporabljen. S tem algoritmom lahko klasificiramo posamezen časovni okvir kot govor ali ozadje. Algoritem robustne segmentacije je povzet po izvorni kodi, uporabljeni v sistemu DeepSpeech (Hilleman et al., 2018). WebRTC–vad ima nastavljiv parameter aggresiveness, ki lahko zasede vrednosti med 0 in 3. Parameter smo nastavili na vrednost 2, tako smo dobili dovolj kratke segmente, da proces dekodiranja pri razpoznavi govora ni trajal predolgo. 2.2 Razpoznava govora Razpoznava govora je implementirana v dveh delih: 1) uporaba globokega akustičnega modela za pridobitev verjetnosti posameznih znakov za vsak časovni okvir in 2) dekodiranje izhoda modela za pridobitev končne transkripcije. Podatki za učenje akustičnega modela so bili pridobljeni iz različnih virov: Gos (Zwitter et al., 2013), Gos VideoLectures (Videolectures, 2019), CommonVoice2 , SiTEDx (Žgank et al., 2016), Sofes (Dobrišek et al., 2017) in narečni govor s portala narecja.si3. Akustični model je implementiran z uporabo ogrodja Nvidia NeMo, uporabili smo globoki model QuartzNet_15x5 (Kriman et al., 2019). Uporabili smo ga, ker lahko z njim kljub relativno majhnem številu parametrov (18,9 milijona) še vedno dobimo dokaj dobro natančnost razpoznave, primerljivo z večjimi modeli (več kot 100 milijonov parametrov). Primerjali smo dva modela: QuartzNet_15x5, učen zgolj na slovenskih podatkih, in QuartzNet_15x5, predučen na angleških podatkih, nato pa dodatno učen še na slovenskih podatkih. S slednjim modelom smo preverili kvaliteto prenosa znanja iz tujega jezika v slovenščino. Za pridobitev transkripcij smo primerjali tri različne metode dekodiranja CTC: 1) požrešna metoda največjih verjetnosti ( greedy), kjer za vsak časovni korak v CTC izberemo najbolj verjeten znak, nato združimo sosednje 1Webrtc google repository. https://chromium.googlesource.com/external/webrtc/+/branch-heads/43/webrtc/common_audio/vad 2Mozzilla Common Voice website. https://commonvoice.mozilla.org/sl/datasets. 3 https://narecja.si/ POVZETKI 252 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 ponovitve; 2) iskanje v snopu z besednim jezikovnim modelom ( word) in iskanje v snopu z znakovnim jezikovnim modelom ( char). Za jezikovni model smo uporabili N–gram jezikovni model KenLM (Heafield, 2011). Ker se model uporablja zgolj med dekodiranjem CTC za posamezen primer poravnave, smo za gradnjo modela uporabili kar originalno besedilo posameznega primera. Tako dobimo model, ki ni posplošen za slovenski jezik, temveč je prilagojen posamezni poravnavi. Testi so pokazali, da red jezikovnega modela ne vpliva bistveno na rezultat, na koncu smo uporabili model četrtega reda. 2.3 Poravnava in iterativno združevanje S pomočjo razpoznavalnika govora iz posnetka pridobimo približno transkripcijo govora. Le-to moramo v zadnjem koraku poravnati z originalnim besedilom posnetka. Za osnovno poravnavo uporabimo algoritem povzet po orodju DeepSpeech. Izkaže se, da z uporabo tega algoritma ne zagotovimo poravnave vseh besed originalnega besedila. Krajše besede pogosto nimajo nujno dovolj konteksta ali pa so slabo transkribirane. Da zagotovimo poravnavo vseh besed, smo razvili algoritem iterativnega združevanja besed. Glavna ideja algoritma je naslednja: besede, ki niso poravnane, združimo s sosednjo besedo v besedilu (odstranimo presledek in tvorimo enoten niz znakov). Osnovni algoritem poravnave ponovno poženemo, tokrat z modificiranim seznamom besed. Ta dva koraka ponavljamo, dokler niso vse besede (oziroma skupki besed) poravnani, nato lahko vsaki besedi originalnega besedila pripišemo začetni in končni čas glede na približno transkripcijo. 3 Evalvacija Natančnost sistema smo ovrednotili na testni množici s primerjavo z ročno izdelanimi poravnavami. Za oceno kvalitete poravnave uporabljamo tri mere: povprečje (MAE) in standardni odklon (STD) absolutnih napak začetnih časov besed ter delež absolutnih napak, manjših od 0,5 sekunde (< 0,5s). 3.1 Testna množica Testno množico sestavlja 26 primerov: 7 primerov nenarečnega govora, 13 primerov narečnega govora in 6 primerov narečnega enoglasnega petja brez spremljave. Najkrajši posnetek je dolg 21 sekund, najdaljši 219, povprečna dolžina posnetkov je 89 sekund. Primeri so pridobljeni iz naslednjih virov: Slovenske ljudske pesmi V (Kaučič et al., 2007), portal narecja.si, terenski posnetki GNI ZRC SAZU. Pravilne poravnave so bile narejene ročno z orodjem Praat. Tip posnetka Število besed Dolžina (min) narečni govor 2428 18,7 nenarečni govor 1394 11,0 narečno petje 508 8,7 skupaj 4330 38,4 Tabela 1: Testna množica. 3.2 Primerjava modelov in metod dekodiranja Primerjali smo osnovni akustični model ( base), ki je grajen zgolj na slovenskih podatkih, ter model, ki je učen na angleških podatkih, nato pa doučen na slovenskih ( transfer). Ob tem smo primerjali tri metode dekodiranja: požrešna metoda ( greedy), iskanje v snopu z jezikovnim modelom na nivoju znakov ( char), iskanje v snopu z jezikovnim modelom na nivoju besed ( word). Primerjavo smo opravili za vsak tip testnih podatkov posebej. Rezultati so podani v Tabeli 2. Iz tabele je razvidno, da pri nenarečnem govoru ne glede na metodo uporaba modela transfer prinese manjšo povprečno napako. Razlika je sicer majhna (0,06 do 0,07 sekunde), vendar je približno enaka za različne metode. Pri uporabi požrešne metode ima transfer sicer večji standardni odklon in manjši delež napak pod 0,5s, vendar je razlika minimalna. Različne metode dajejo zelo podobne rezultate. Kombinacija modela transfer in metode word da najboljši rezultat s povprečno napako 0,12s, standardnim odkonom 0,10s in 99,4% deležem napak pod 0,5s. Tudi v primeru narečnega govora uporaba modela transfer izboljša rezultate. Razlika v povprečnih napakah je majhna (0,04 do 0,09 sekunde), vendar je med akustičnima modeloma opazna razlika tudi v standardnem odklonu in deležu napak manjših od 0,5s. Z uporabo modela transfer so rezultati za različne metode poravnave zelo podobni, pri čemer se metoda word izkaže za najbolj robustno, saj ima najmanšo napako in standardni odklon pri obeh modelih. Pri modelu transfer ima metoda greedy sicer nekoliko večji delež napak pod 0,5s, POVZETKI 253 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 vendar je razlika majhna (0,4%). Kombinacija modela transfer in metode word da najboljši rezultat s povprečno napako 0,14s, standardnim odkonom 0,24s in 97,3% deležem napak pod 0,5s. V primerjavi z najboljšim rezultatom nenarečnega govora se povprečna napaka poveča za 0,02s, standardna deviacija za 0,13s, delež napak pod 0,5s se zmanjša za 2,1%. Razlika ni velika in je približno podobna za ostale kombinacije metod in modelov. tip testnih podatkov metoda model MAE STD < 0,5s Nenarečni govor greedy base 0,20 0,13 99,1% transfer 0,14 0,15 98,5% char base 0,21 0,09 99,0% transfer 0,14 0,10 98,9% word base 0,19 0,10 98,6% transfer 0,12 0,11 99,4% Narečni govor greedy base 0,22 0,39 94,9% transfer 0,15 0,27 97,7% char base 0,21 0,32 95,7% transfer 0,15 0,28 97,1% word base 0,18 0,28 97,2% transfer 0,14 0,24 97,3% Narečno petje greedy base 0,59 0,82 70,2% transfer 1,28 2,49 63,9% char base 0,82 1,66 66,7% transfer 0,44 0,41 73,4% word base 0,48 0,58 73,4% transfer 0,37 0,30 79,9% Tabela 2: Rezultati Pri narečnem petju je napaka poravnave opazno večja. Pri metodah word in char akustični model transfer deluje bolje. Z metodo char je povprečna napaka prepolovljena, standardni odklon je štirikrat manjši, delež napak pod 0,5s se izboljša za 6,7%. Z metodo transfer je povprečna napaka za 0,11s manjša, standardni odklon za 0,28s, delež napak pod 0,5s se izboljša za 6,5%. Pri metodi greedy je boljši model base, kar je edini tak primer v rezultatih. Rezultati različnih metod dekodiranja med seboj niso podobni. Pri obeh modelih metoda word bistveno izboljša rezultat. Kombinacija modela transfer in metode word da najboljši rezultat s povprečno napako 0,37s, standardnim odkonom 0,30s in 79,9% deležem napak pod 0,5s. V primerjavi z najboljšim rezultatom nenarečnega govora se povprečna absolutna napaka poveča za 0,25s, standardna deviacija za 0,19s in delež napak pod 0,5s se zmanjša za 19,5%. Razlika je velika in je vidna tudi pri ostalih kombinacijah metod in modelov. Povprečna absolutna napaka se poveča za faktor vsaj 2,5, standardni odklon za faktor vsaj 2,7 in delež napak pod 0,5s se zmanjša za vsaj 19,5%. 3.3 Ugotovitve Kvaliteta poravnav na nenarečnem govoru se izkaže za dobro in je primerljiva s podobno delujočimi sistemi, npr. (Malfrère et al, 2003). Tudi pri narečnem govoru je kvaliteta poravnav dobra. Napaka je nekoliko večja kot pri nenarečnem govoru, kar je pričakovano, saj je večina učnih podatkov za akustični model nenarečnih. V splošnem ocenjujemo, da sistem dobro deluje na slovenskem govoru in je zato uporaben za večino aplikacij. Vredno je omeniti, da v primeru kratkih posnetkov in popolnih transkripcij za učenje akustičnih modelov obstajajo potencialno boljše tehnike poravnave (Brognaux in Drugman, 2015). POVZETKI 254 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Kvaliteta poravnav enoglasnega petja brez spremljave je v primerjavi z govorom opazno slabša, kar smo tudi pričakovali, saj je v splošnem poravnava petja in besedila težji problem. V primerjavi z nenarečnim govorom je povprečna napaka približno trikrat večja in veliko več je napak večjih od pol sekunde. Povprečna napaka je sicer primerljiva s podobno delujočim sistemom za poravnavo petja (Stoller et al., 2019), vendar naši testni podatki ne vključujejo večglasnega petja ali petja s spremljavo, zato ta primerjava ne pove veliko. Domnevamo, da bi se kvaliteta poravnav bistveno izboljšala, če bi učna množica akustičnega modela vsebovala petje. V veliki večini primerov se akustični model transfer izkaže bolje od modela base. Edini obraten primer je v primeru petja in metode greedy, kjer model base doseže boljši rezultat, vendar ker ta kombinacija metode in modela ne da najboljšega rezultata pri petju, ni bistvena za oceno kakovosti. Na podlagi rezultatov potrjujemo domnevo, da prenos znanja z modelom transfer pozitivno vpliva na kvaliteto poravnave tako pri govoru kot pri petju. Čeprav je v primeru govora najboljša metoda za dekodiranje word, ostali dve metodi nimata bistveno večjih napak. V primeru nenarečnega govora z modelom transfer je povprečna napaka z metodo word manjša za 0,02s, v primeru narečnega govora pa za 0,01s. V aplikacijah, ko zelo natančna poravnava govora ni ključna, je pa pomemben čas računanja, je bolj smiselno uporabiti metodo greedy, saj le-ta ne zahteva iskanja v snopu ter uporabe jezikovnega modela in je zato bistveno hitrejša. Pri petju metoda greedy da bistveno slabše rezultate od metode word, zato je smiselno uporabiti slednjo. Zahvala Raziskave, opisane v prispevku, so bile opravljene v okviru temeljnega raziskovalnega projekta »Misliti folkloro: folkloristične, etnološke in računske perspektive in pristopi k narečju« (J7-9426, 2018-2022), programske skupine »Digitalna humanistika: viri, orodja in metode« (P6-0436, 2022-2027), oba financira ARRS, in raziskovalne infrastrukture DARIAH-SI. Literatura Sandrine Brognaux in Thomas Drugman. Hmm-based speech segmentation: Improvements of fully automatic approaches. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24:1–1, 01 2015. Simon Dobrišek, Jerneja Žganec Gros, Janez Žibert, France Mihelič, in Nikola Pavešić. Speech database of spoken flight information enquiries SOFES 1.0, 2017. Slovenian language resource repository CLARIN.SI. Kenneth Heafield. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland, July 2011. Association for Computational Linguistics. Ryan Hilleman, Tilman Kamp in Tobisas Bjornsson. Dsalign. https://github.com/mozilla/DSAlign, 2018. Marjetka Golež Kaučič, Marija Klobčar, Zmaga Kumer, Urša Šivic, and Marko Terseglav. Slovenske ljudske pesmi V. 2007. Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li in Yang Zhang. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions, 2019. F. Malfrère, O. Deroo, T. Dutoit, in C. Ris. Phonetic alignment: speech synthesis-based vs. viterbi-based. Speech Communication, 40(4):503–515, 2003. Daniel Stoller, Simon Durand, in Sebastian Ewert. End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model, 2019. VideoLectures.NET. Spoken corpus gos VideoLectures 4.0 (audio), 2019. Slovenian language resource repository CLARIN.SI. Ana Zwitter Vitez, Jana Zemljarič Miklavčič, Simon Krek, Marko Stabej in Tomaž Erjavec. Spoken corpus gos 1.0, 2013. Slovenian language resource repository CLARIN.SI. Andrej Žgank, Mirjam Sepesy Maučec in Darinka Verdonik. The SI TEDx-UM speech database: a new Slovenian spoken language resource. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4670–4673, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA). POVZETKI 255 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 A Parallel Corpus of the New Testament: Digital Philology and Teaching the Classical Languages in Croatia Petra Matović,* Katarina Radiㆠ* Faculty of Humanities and Social Sciences, University of Zagreb Ivana Lučića 3, 10000 Zagreb, Croatia pmatovic@ffzg.hr † Faculty of Humanities and Social Sciences, University of Zagreb Ivana Lučića 3, 10000 Zagreb, Croatia katarina.radic1@gmail.com 1. Introduction Corpus linguistics has been one of the liveliest disciplines in Croatian linguistics, and parallel corpora have been established by Croatian scholars since the 1960s (Tadić 1997, 2001; Simeon 2002). These corpora normally include Croatian and another living language, while corpora consisting of texts in Croatian and at least one of the so-called “dead” languages are still underrepresented, although there are corpora including languages like Ancient Greek, Latin, Sanskrit, Arabic, Persian and Akkadian, to be found on the World Wide Web (The Alpheios Project 2019; Palladino et al., 2021). The Department of Classical Philology at the University of Zagreb can already boast one of the earliest online (monolingual) corpora of Latin texts, the CroALa database, built and curated by Neven Jovanović (CroALa, 2014). In the last few years the said department has been steadily building small parallel corpora, and this paper aims to describe one of them, the Greek-Croatian parallel corpus of the New Testament, currently in the making, and furthermore discuss its educational uses in teaching Ancient Greek. 2. Goal of the paper Building parallel copora has been garnering more and more attention in the field of classical philology. The Department of Classical Philology at the University of Zagreb has been building smaller corpora, both as a part of several small-scale projects lead by Neven Jovanović and courses on Greek and Latin language (e.g. Soldo and Šoštarić 2019). Since 2021, several professors and students at the department have been working on project titled “A Linguistic Analysis of Selected Early Christian Writings”, lead by Petra Matović. Within the scope of the project we have started building a parallel corpus of the New Testament, so far comprising the Gospel of Mark and a part of the Apocalypse. The texts are aligned using the Alpheios tool for text alignment at the Perseids environment (The Alpheios Project, 2019; The Perseids Project, 2017). Alpheios enables the user to align words or word combinations in the source text with corresponding parts of its translation (The Alpheios Project, 2019). In this poster we firstly aim to explain the principles of alignment we followed while building the corpus, and, secondly, discuss some peculiarities in aligning Ancient Greek with Croatian. Finally we aim to look at the corpus from an educational point of view and discuss its possible uses in teaching Ancient Greek today. Text aligment was done by 4 students (Mateo Cader, Ružarijo Lukas, Katarina Radić, Luka Šop) and supervised by Petra Matović. The editions of the texts were Nestle-Aland 28 (Greek New Testament) and the so-called Zagreb Bible (https://biblija.ks.hr/). Initially, the main principle of alignment was to align units (words or word combinations) in the Greek text with their Croatian counterparts; these units had to be as small as possible. Full stops and commas were aligned, too. After the initial period it became clear that additional rules were necessary. While the students did not struggle with the meaning of the Greek text, they were sometimes unsure how to align the Greek with the Croatian. These uncertainties typically arose in the following situations due to specific linguistic features of the two languages: - the use of the article (exist in Greek, but not in Croatian: ὁ Ναζαρηνός = Nazarećanin, Mark 10,47) - commas can be aligned with conjunctions - participles (extensively used in Greek, not common in Croatian: ἀκούσας = kad je čuo, Mark 10,47) - particles (Greek is rich in particles, Croatian often lacks equivalents: the particle δέ is translated as “ali” in Mark 13,5, but left untranslated in Mark 13,13) - features of Hellenistic Greek (The New Testament was written in this later variety of the Greek language, which is often different from the Classical, 5th century BC Attic dialect of Greek which is mainly taught in schools and universities; one of theseis the preterite form ἤμην διδάσκων = „naučavah“ Mark 14,49). POVZETKI 256 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 There was also one unexpected problem: students often struggled with aligning prepositions, for example in Mark 1,6: ἐνδεδυμένος τρίχας καμήλου καὶ ζώνην δερματίνην περὶ τὴν ὀσφὺν, the preposition “s” was left unaligned in the Croatian translation (“odjeven u devinu dlaku, s kožnatim pojasom oko bokova”). Consequently, the following set of rules was formed: 1. The article is aligned together with the corresponding nouns, unless translated separately. 2. Conjunctions should be aligned either with conjunctions, particles or punctuation. 3. Punctuation should be aligned whenever possible. 4. Participles should be aligned with the corresponding word combination, even if it is an entire sentence. 5. If something is left out in the translation, the Greek original is left unaligned and vice versa, for example the verb "to be". 6. Prepositions should never be left unaligned. Whenever possible, they should be aligned with a corresponding Greek preposition. In the case where a preposition is added in Croatian, together with its noun it should be aligned with the corresponding noun in Greek. The work done on this corpus highlights several problems in teaching not only Ancient Greek, but also Croatian. Students are unsure of the uses of certain parts of speech, usually those parts of speech that do not have an equivalent in their mother tongue. They are also unaware of the nature of the comma, which can connect (or divide) two words just like a conjunction. Prepositions are often an obstacle because their meaning can be incorporated into a nominal form in Greek and does not have to be expressed separately. These problems probably arise because the school curriculum for Croatian is different from the curricula for Greek and Latin: the curricula for the classical languages pay more attention to grammar, while Croatian has to include both language and literature. Hopefully, projects like this one can highlight specific problems that can then be resolved either by adapting the school curricula or the teaching of classical languages on university level. 3. References The Alpheios Project. 2019. https://alpheios.net/. CroALa (Croatiae Auctores Latini). 2014. http://croala.ffzg.unizg.hr. Chiara Palladino, Maryam Foradi, and Tariq Yousef. 2021. Translation Alignment for Historical Language Learning: a Case Study. Digital Humanities Quarterly, 15(3). https://www.proquest.com/openview/e048d32e8e991c67282c3fbda5c1f0d4/1?pq- origsite=gscholar&cbl=5124193. The Perseids Project. 2017. https://www.perseids.org/. Ivana Simeon. 2002. Paralelni korpusi i višejezični rječnici. Filologija, 38-39: 209–15. Petar Soldo and Petra Šoštarić. 2018. Treebanking Lucian in Arethusa: Experiences, Problems and Considerations. Studia UBB Digitalia 63(2):7–18. Marko Tadić. 1998. Raspon, opseg i sastav korpusa suvremenoga hrvatskoga jezika. Filologija (30-31):337–47. Marko Tadić. 2001. Procedures in Building the Croatian-English Parallel Corpus. International Journal of Corpus Linguistics, Special issue, pages 1–17. Novum Testamentum Graece, Neste-Aland 28. https://www.academic-bible.com/en/online-bibles/novum-testamentum-graece-na-28/read-the-bible-text/bib el/text/lesen/stelle/51/10001/19999/ch/418f354347a79b322324823db62504dc/. The Zagreb Bible https://biblija.ks.hr/. POVZETKI 257 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Pre-Processing Terms in Bulgarian from Various Social Sciences and Humanities (SSH) Domains: Status and Challenges Petya Osenova*, Kiril Simov*, Yura Konstantinova† *Institute of Information and Communication Technologies, Bulgarian Academy of Sciences Acad. G. Bonchev bl. 2, 1113 Sofia {petya, kivs}@bultreebank.org †Institute of Balkan Studies and Centre of Tracology, Bulgarian Academy of Sciences Moskovska St 45, 1000 Sofia yura.konstantinova@balkanstudies.bg Abstract 1. Introduction There exists a great number of focused initiatives, projects and conferences that tackle deeply various topics related to terminology construction, understanding, processing and usage. We will mention only a small part of them here rather as initiatives than as distinct publications. These are, among others, ENeL COST Action on e-Lexicography,1 related activities in the NexusLinguarum COST Action,2 related activities in the ELEXIS project,3 globaLEX organization. There is also ongoing work on providing language technology help to colleagues in SSH within CLARIN-ERIC and DARIAH.4,5 Within the CLaDA-BG infrastructure,6 which combines the goals of CLARIN and DARIAH in Bulgaria, there are two types of partners – technological ones and colleagues also from SSH. The latter are historians, ethnographers, specialists in the deeds and lives of Cyril and Methodius, museum and library workers. This combination of complementary partners allows us to construct the necessary resources and immediately to verify their utility for SSH partners. In the task of creating the Bulgarian-centric Knowledge Graph (BGKG) within CLaDA-BG (Simov and Osenova, 2020) we requested data from our SSH partners in order to perform linguistic pre-processing and to enhance the creation of terminological dictionaries that cover the SSH subdomains based on these data. The size of the corpus is nearly half a million – 484,815 tokens. The selected words and phrases for pre-processing and creation of entries towards terminological dictionaries were about 5,000 within nearly 26,000 usages annotated within the corpus. From them the rejected candidates, or the false positives, were 542 candidate phrases. Out of them 328 candidates were completely rejected either because they were named entities or free compositional phrases. Thus, our colleagues from SSH would facilitate their own work with only checking and validating the previously pre-processed data. The data consists of selected texts from various sources such as: scientific texts authored by our SSH colleagues and related to Bulgarian history and society; Linked Open Data like Wikipedia; available textbooks, specialised dictionaries etc. Here we give a brief outline of our pre-processing strategy towards handling the data-driven terminology in these domains. 2. The Task Overview The work flow that is discussed here is related to the SSH data (publications, autobiographies, archive documents, newspaper articles from past periods, descriptions of artefacts, etc.) that were collected from partners, and annotated within the INCEpTION platform7 with named entities, events and roles. Thus, while annotating linguistically the texts, the annotators were additionally asked to mark candidate terms with the label term. This task was set in the view of the subsequent creation of specialised terminological dictionaries 1 https://www.cost.eu/actions/IS1305/ 2 https://nexuslinguarum.eu/ 3 https://elex.is/ 4 https://www.dariah.eu/ 5 https://www.clarin.eu/ 6 https://clada-bg.eu/en/ 7 https://inception-project.github.io/ POVZETKI 258 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 in each participating SSH domain – history, ethnography, biographical studies, etc. The annotators were instructed to view as candidate terms the keywords that are specific for the domain. Later on, these candidate terms were extracted and transferred to a huge excel table in Google Drive. The table consists of three main areas: a) the candidate term, b) the term in its context of occurrence, and c) the source that delimits the domain of usage. In Figure 1 an excerpt from the excel view is presented: Figure 1: An example from the excel table. In the first row the following information is given: the term as it occurred in the text (riding-the horse, ‘the riding horse’), the text excerpt with the term placed among the symbols @@@, and the name of the source text. In the second row the following information is given: the normalised term ( riding horse), the definition (a horse that is used for riding) and the domain – ethnography. All the one-word terms got initial definitions from the digitised version of the Explanatory dictionary of Bulgarian (Popov et al., 1994). This step was performed automatically through a rule-base matching method. First, the word forms in the texts were lemmatized with our in-house Inflectional Bulgarian dictionary. Then the coinciding lemmas in the dictionary and in the texts were matched. The terms with more than one meaning also received all the possible definitions automatically. Afterwards, these candidate terms were processed manually by the team that previously worked on the event and roles annotations. The core team engaged with the terminology pre-processing consisted of 4 members as a subpart of the whole annotating team that consisted of 8 people. The tasks related to the terminology processing were organized as follows: one person (outside the 4 working colleagues) performed the automatic construction of the table and the assignment of the existing definitions and sources. Initially the candidate terms were assigned in an alphabetical order to workers, i.e. each colleague was responsible for the candidate terms that began with certain letters. However, after having completed some letters, a decision was taken to go by domain source instead. This approach allowed us to observe the terms in their domain contexts and interrelations. Then, once more the terms were checked in their alphabetical appearance. The workflow was generally divided into two phases that respects the competences of the experts. In Phase 1 the corpus linguists (who were also annotators) pre-processed the candidate terms while in Phase 2 the specialists in SSH areas are supposed to check and validate these terms against their own area. 3. The Workflow The respective annotated data was uploaded in advance including the annotated candidate term. The workflow consisted of the following steps: 3.1 Deciding which candidate terms are true terms Here the main task of the corpus linguists was to try to reduce the list of the obvious non-terms or the common words and expressions from the specialised terms. Sometimes the boundaries were not very clear, especially with respect to the multiword expressions (MWE) and the nested terms. See more about this issue in point 3 below. The annotators had at their disposal three options to select from: a sure term, a maybe term and a non-term. 3.2 Checking the availability of the definition and its relevance In case there was a definition, the annotator had to: accept it as it is, reject it or modify it. If there was no definition, the annotator had to create one. When the term was one-word, the task was to check the definition that came from the Explanatory dictionary of Bulgarian. In case of lexical ambiguity the annotator had to select the correct definition among the available ones, or again - to provide their own, if no appropriate is present. Then the selected definition was marked as the right one. Please note that the other definitions were not deleted for the sake of completeness and future addition into BTB-Wordnet. 3.3 Handling multiword expressions POVZETKI 259 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Here the prevailing part of terms consisted of a head noun and a pre-positioned modifier. For example, демокрация (democracy) and пряка демокрация (direct democracy). The problems might go into two directions: to accept a MWE as a domain term or not, and to provide a definition of it since it is usually not available in the consulted sources. We decided to be inclusive in accepting what had to be a term. This means that all the expressions that were considered specific for the domain, were approved. The annotator could also add the definitions about the parts of compositional MWEs. For example, невалиден глас (invalid vote) can have a definition as a phrase, while its two elements невалиден (invalid) and глас (vote) might also be added below with their own definitions. 3.4 Re-checking the domain/genre. This step relies on the domain/genre classification that has already been used. Thus, an initial pre-defined schema was explored that in the process of work was further expanded and hierarchized. At the moment the list of the applied domains amounts to 76 categories (for example, architecture with a subdomain of construction; geography with a subdomain of geology; philosophy with subdomains of ethics, rhetorics and logic) and the registers to 15 (for example, dialectal, metaphorical, colloquial, etc.). The initial schema came from the classifications used in the Explanatory dictionary and had 36 domains and 4 registers. At the beginning, we tried to keep the terms in separate groups that do not overlap: history, ethnography, etc. These areas however are highly interdisciplinary and they inter-cross with each other. For that reason this approach was abandoned at a very early stage in our work. In this way one and the same term could be put in more than one domain with the same or different meaning. Other tasks that were part of the workflow – although with a lower priority were: 3.5 Adding other senses of the lemma of the term, and 3.6 Adding examples to these additional senses. The idea behind tasks 5 and 6 was that we aim at reaching better coverage also in other language resources like BTB-Wordnet (Osenova and Simov, 2018) and at compiling a sense corpus per lemma and usage. The results of this preparatory work was a classification of the initially selected about 5000 candidate terms and keywords with respect to the hierarchy of domains. This allows the further processing to be done by different experts in the corresponding domains. Their tasks are the following: Final Sorting of lexical items within true terms and keywords. As it was mentioned above, the examples annotated within the domain documents were classified within two main categories: general lexica and compositional phrases, on the one hand, and terms, on the other. The second group sometimes contains keywords that happen to be true terms in the respective domain. Thus, the first task for the experts was to sort out the true terms. Addition of missing terms. Despite the wide range of documents selected for annotation, they do not contain all the relevant terms in the domain. For example, from the set of all genres of Old Bulgarian literature, only three were identified within the annotated documents. The rest of terms for these identified genres were added by the experts of Old Bulgarian literature. Thus, by completing the missing slots, we expect that each domain will have a relatively complete list of terms. Extension of the definitions. In the original list of candidate terms we had to also add definitions from online sources or to construct our own definitions. Since the pre-processing group included linguists, but not experts in the domains, very often the definitions were not complete and/or precise enough. Thus, the domain experts extended the definitions with encyclopaedic information. In some cases also appropriate images were added. The resulting encyclopaedic entries were cross linked on the basis of the included terms. Here is an example of such an entry from the area of architecture: АЖУР техника при резбарското, златарското, плетаческото и други изкуства, при която между декоративните елементи има отвори POVZETKI 260 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Figure 2: One example from the terminology lexicon in the area of architecture. On the left, the term openwork and the definition (ornamental work such as embroidery or latticework having a pattern of openings) are given, and on the right there is an image illustrating it. The links to other terms are represented via italicising the corresponding words/phrases in the definition. This example is only illustrative. The actual entries might contain longer texts, references to relevant literature, more images and links to external resources. The resulting terminological lexicons are further processed by the team working on the Bulgarian Bultreebank Wordnet (BTB-WN). This work has been done in cooperation with the domain experts. Such alignment of the terminological lexicons and wordnet allows for a joint usage of both lexical resources for the main use cases – explanation of the specific knowledge in the domains and indexing of various types of domain documents. Figure 3 depicts a part of the hierarchy of Bulgarian folk units of measurement. They are linked with a hyponymy relation to the concept for Bulgarian folk units and the concept for linear units. Figure 3: In this figure a graphical view on Bulgarian folk units of measurement is presented. Each term is classified into two ways - as a unit of measurement for distance ( linear units) and that its domain is Bulgarian folk units. The hierarchy of terms could interleave with synsets that are not terms in the domain. The mapping to synsets in the English WordNet are given with identification (IDs) at the lower part of the graphical representation of each Bulgarian synset. Here measures are given such as педя (span), пръст (finger), лакът (elbow), etc. Our idea is BTB-WN to be the main resource within CLaDA-BG for representation of lexical data related to general language, terminology and to be aligned to the ontologies on which BGKG is constructed.8 In this way we hope to be able to provide access to these data by different types of users with different knowledge about the domains, with different goals in mind, etc. In addition to the standard wordnet relations (hypernymy, meronymy, etc.), we envisage other semantic relations that represent various aspects of knowledge within the corresponding domains. In this way, we will ensure the representation of encyclopaedic information and will facilitate the representation of Named Entities (NEs) classified with respect to the corresponding concepts. This approach relies on specially created templates based on the domain relations as well as their domain and range restrictions. We already defined about 20 such templates for main classes of NEs like geopolitical entities, historical events (wars, uprisings, etc.), artefacts (icons, stamps, ect), political parties and regimes, and so on. 4. Conclusions In this extended abstract we described the main steps that were followed in the creation of terminological lexicons in a bottom-up approach starting from real texts within SSH domains. After the domain texts were annotated with named entities, events, roles and candidate terms, a concordance of the latter from different documents was performed where they were grouped together and linguistically processed. As a result, they had the representation of the term in its basic form, listings of related words for MWEs, the existing potential 8 This approach is similar to the lexeme assignment in Wikidata. https://www.wikidata.org/wiki/Wikidata:Lexicographical_data/Documentation POVZETKI 261 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 senses from different sources (available locally to the annotators and on the web). The appropriate senses for the given context were selected or created. Then the result was further processed by the domain experts in order to make the definitions more precise and complete. Also, an addition of missing terms was performed. Then the terminological lexicons were aligned to the BTB-WN in order to be used for navigation, annotation of more documents (manually or automatically) and to establish links to the necessary ontologies. The main challenges can be divided as either technical or theoretical. In the first group we can mention the insufficient context, lack of enough sources for terms related to previous historical times; approaching the task in the best way - alphabetically or by source, etc. In the second group we can outline: the difficulty to differentiate between a term and a non-term; aiming at the most informative definition when there are too many found in the sources; finding and/or constructing a definition when it lacks or is wrong with the help of other available resources; handling close definitions for some lemma; construction of a definition for multiword terms; handling multi-domain inclusion of terms. 5. References Petya Osenova and Kiril Simov. 2018. The data-driven Bulgarian WordNet: BTBWN. In: Cognitive Studies | Études cognitives, vol. 18, https://doi.org/10.11649/ cs.1713. (freely available at: https://ispan.waw.pl/journals/index.php/cs-ec/article/view/cs.1713/4458) Kiril Simov and Petya Osenova. 2020. Integrated Language and Knowledge Resources for CLaDA-BG. In: Selected Papers from the CLARIN Annual Conference 2019, 172 (2020), LiU Electronic Press: Linköping Electronic Conference Proceedings 172 (2020), 2020. Dimitar Popov et al. 1994. D. Popov, L. Andreychin, L. Georgiev, St. Ilchev, N. Kostov, Iv. Lekov, St. Stoykov and Tsv. Todorov 1994 . Bulgarian Explanatory Dictionary. Nauka i izkustvo. Sofia. (in Bulgarian) POVZETKI 262 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 An Approach to Computational Crisis Narrative Analysis: A Case-study of Social Media Narratives Around the COVID-19 Crisis in India Henna Paakki*, Faeze Ghorbanpour*, Nitin Sawhney* * Department of Computer Science, Aalto University P.O.Box 15400, FI-00076 AALTO, Espoo, Finland henna.paakki@aalto.fi 1. Introduction Societal crises create an empty narrative space and a need for explanation about the crisis, related risks and required mitigation actions (Sellnow et al., 2019). Crises are socially constructed through discourses and have the potential to change social structures and perceptions (Walby, 2015). Crisis narratives also have an important role in attributing blame and structuring crisis responses and recovery plans (Walby, 2015, p. 14). The role of social media has increased significantly as a forum for seeking information about crises, as well as for discursive sense-making. People use discourses and narratives related to a crisis to construct the world socially and epistemologically and to explain the impending crisis (Joffe, 2003; Bednarek et al., 2022), which makes it important for authorities, experts and crisis regulators to understand various discourses around the crisis. This paper examines the possibilities for analyzing social media discourses using a novel computational approach, using a discourse act classifier based on zero-shot learning (Yin et al., 2019) to categorize discourse types into narrative function groups (Labov, 1972). Such tools can help support other means of inquiry and crisis preparedness. Our empirical case study examines discourses around the COVID-19 pandemic in the context of English-language social media in India. This abstract describes an ongoing research project. 2. Goal of the paper As crisis discourses on social media encompass a large set of data, there is a need for computational methods that can support close readings. Although some methods have been developed for computational discourse and narrative analysis (Piper et al., 2021), this line of research needs more tools. Lakoff and Narayanan have proposed that computational narrative analysis could be approached by focusing on the structural building blocks of narratives (Lakoff and Narayanan, 2010), which have been outlined in linguistics and social sciences (Labov 1972; Labov and Waletzky, 1967; van Dijk, 1976). Such rules can aid computational models. Narratives encase human motivations, goals, emotions, actions, events, and outcomes, elements that have been considered essential for computational models to understand (Lakoff and Narayanan, 2010). We posit that sense-making in crisis is action (Joffe, 2003), at the surface-level formulated as discursive actions (Edwards and Potter, 1993; Schegloff, 2007). Thus, for capturing social media narratives, we explore the validity of using a widely used and well-established narrative functions theory from linguistics (Labov, 1972; Labov and Waletzky, 1967) to categorize social media comments based on their functions. These functions have already been used to computationally analyze more traditional narratives like personal histories or short stories (see e.g., Li et al., 2017). We explore the possibilities for further extending their use to analyzing changes in social media discourses around crises. Many narrative theories agree that a sequence of events that forms a narrative whole includes first 1.) an orientation to the story or situation (identifying the time, place, persons, and situation of the narrative), some type of 2.) complication or disruption (the core event that creates tension in the narrative), 3.) an evaluation (clarification of why or how the events are important), and finally 4.) a resolution (how the story ends or how the core problematic event is resolved) (Lakoff and Narayanan 2010; Labov, 1972; Labov and Waletzky, 1967; Todorov 1971; Van Dijk, 1976). Conflict in communication is central in the narrative space surrounding a crisis and needs to be managed for successful crisis mitigation (Sellnow et al., 2019). Central to crisis discourses are critical events that have transformative power: they mobilize discourses and transform perspectives on the crisis through conflict (Jørgensen & Phillips, 2002). Thus, we might expect crisis narratives to involve a significant complication phase that needs to be followed by a resolution phase. We maintain that by analyzing the functional categories of orientation, complication, evaluation, and resolution, it is possible to understand shifts in perspectives to the ongoing crisis, ones that contribute to the narrativization of the crisis. Furthermore, we expect that it is possible to identify points of discursive struggle within crisis discourses, points where critical understandings of the crisis are negotiated to achieve a consensus or to legitimize a selected narrative (Jørgensen & Phillips, 2002; Sellnow et al., 2019). This is central in understanding how a consensus on crisis resolution is achieved. We seek to investigate the validity and utility of computationally categorizing social media crisis discourses based on their functions. We ask: POVZETKI 263 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 1. Can narrative functions be applied to analyzing online crisis discourses using a computational model? Are these functions operationalizable through discursive actions? 2. Do social media comments grouped by their actions correspond well enough to the functions of orientation, complication, evaluation, and resolution? 3. By using these function-based groupings, is it possible to find patterns of narrativization in online crisis discourses? Do comments have different functions at different points in time during the crisis? 3. Data sources and sampling Crisis news reporting has a significant impact on citizen perspectives on the crisis (Kasperson et al., 1988). We are thus interested in this relationship between the evolution of crisis news discourses and how citizen discourses develop during a long-lasting crisis. YouTube news channels’ crisis news videos and their comments offer an opportunity for investigating this interaction over time. We examine viewer comments to crisis news videos on English-language NDTV news’ YouTube channel during the Covid-19 crisis in India, in conjunction with news reports and contextual insights on the pandemic. The data were collected using a scraper and the YouTube API. They involve the beginning of the crisis (1/2020˗8/2020), acute vaccination phase of the crisis (02/2021˗08/2021), and a later prolonged phase of crisis (11/2021˗02/2022). Channel selection criteria included that the channel should be among the most followed English news providers in the country, one of the most trusted (Newman et al., 2021), that the channel allows viewer comments, has a wide viewership and is politically as close to the centre as possible. The Indian context is of interest as trust in news has been reported to be low (Newman et al., 2021), and as the Global South perspectives have not been sufficiently represented in research. 4. Methods Our approach is mixed, utilizing computational modeling to analyze a large set of data to achieve reproducibility and quantifiability, but also employing qualitative close reading. To operationalize the narrative functions theory, we posit that this can be approached through the pragmatic items of discursive actions, as these are often used to analyze accountability, agency, position, and intention in conversation (Edwards and Potter, 1993; Schegloff, 2007). We expect that in our social media data, the function of informing statements is to mostly orient to the crisis and to express beliefs; questions, accusations and challenges most often express a complication or problematize some aspect of the crisis; evaluations and appreciations mostly attempt to elaborate and evaluate the situation; and requests and proposals aim at a resolution of some aspect of the crisis (Couper-Kuhlen and Selting, 2017; Turowetz and Maynard, 2010). Thus, we argue that what a comment does can be used to conclude what function it has within the larger crisis narrative. The selection of actions is based on frameworks of core actions in social interaction (Clark and Schaefer, 1989), ones found relevant across different contexts (Stivers et al., 2010) and computer-mediated communication (Paakki et al., 2021). We manually annotated a set of 438 social media crisis news comments with actions. First, two annotators independently annotated the same set of comments, then compared and negotiated their annotations and resolved all conflicts, analyzing especially the difficult cases. Then annotators resumed annotation work, and finally an inter-annotator score was calculated using Krippendorf’s alpha. We achieved a score of 0.75, which indicates a good degree of agreement. Using the hand-labelled comments, we trained a classifier using few-shot learning (Yan et al., 2018), achieving an f1 score of 0.50. We also ran a zero-shot NLI classifier (Yin et al., 2019), which at the present time achieved better results (f1 0.61) and was thus used for labeling all comments. The labeling followed carefully prepared annotation guidelines based on the descriptions of actions in the literature (e.g. Couper-Kuhlen and Selting, 2017; Schegloff, 2007). The whole action annotation scheme involved 13 classes following research on which actions are relevant and common in computer-mediated communication (Paakki et al., 2021). It involved responsive actions (e.g. apology, acceptance) that were not included in the function groups. At this stage, we concentrate on the 8 actions described above. We further sorted comments into groups based on their action label by using a python script. We proceeded to validate our approach by 1.) qualitatively analyzing the functions (per Labov, 1972) of a set of hand-labeled comments from a time-period from 17th˗25th August 2021 (125 comments excluding doubles), based on their content, comparing this analysis to our action-based computational classification, and 2) using time-series analysis to investigate the emergence of function groups at different times during the crisis. We calculated a threshold to identify significant peaks in function group values (1.5 × SD 𝑜𝑣𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 𝑚𝑒𝑎𝑛). We suspected that if the narrative functions were applicable to analyzing social media crisis discourses, there should be significant changes in which function groups are most common in crisis comments at different times. POVZETKI 264 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 5. Results Our validation step 1 shows that the computational classification of comments gives us similar results as our manual analysis. The time-period chosen involved an especially high amount of complication actions in both the manually annotated set as well as the computationally annotated set. These are mostly related to criticism or mistrust in authorities and the COVID-19 vaccine, comments about negative symptoms from the vaccine, confusion about who to trust and what to do, but also some arguments that support the authorities. The qualitative analysis of the functions of comments corresponds sufficiently well to the action-based computational categorization: most statements and announcements had an orientation function also in our qualitative analysis; most questions, accusations and challenges served a complication function; evaluations and appreciations corresponded well with the evaluation function; and requests and proposals mostly aimed at some type of a resolution. However, 10% of comments did not fall into the assumed function group based on action type. In some cases actions had another function than expected: informing statements sometimes provided a complication in a few cases where negative effects from vaccines were described, evaluations sometimes had an orientation function, and some long comments involved more than one significant function. Secondly, our preliminary results from the time-series analysis show that there are significant changes in which functions crisis news comments have at different points of the crisis timeline. Within the NDTV crisis news comments during the early phase of the Corona crisis, there are more significant peaks in orientation or resolution oriented discourses. During the acute mid-phase of the crisis, the frequency of comments that have a complication function is significantly higher. At the last phase, functions become dispersed, i.e. none of the function groups come above the threshold. The time-series analysis is still a work in progress, but the results so far show that the crisis narrative achieves its most conflictive point at the acute mid-phase of crisis where COVID-19 vaccinations have become relevant. 6. Conclusion Our results so far show that the Labovian narrative theory is to some extent applicable to analyzing crisis discourses on social media. The applied model allows us to analyze how the functions of discourses shift along the crisis timeline, and to identify significant points of discursive struggle. The operationalization of functions through actions seems to work sufficiently well, as it allows a justifiable and pragmatic frame for annotation, rooted in a well-researched field. Based on our results, the action-based categorization has some limitations that need consideration, as the actions used do not always correspond to the expected function. However, the narrative function categories are highly abstract and thus difficult to classify as such, as we found in some earlier experiments, and thus for a computational model we consider an action-based labeling scheme to be a more pragmatic approach. Social media discourses did not exactly follow the Labovian narrative structure in our empirical case: although complication-oriented discourses seemed to occur during the second phase similarly to the narrative theory, the early phase already involved significant crisis resolution discourses. The dataset for our third phase of crisis should be extended in later research to gain further insights into whether discourses related to some function group might emerge as significant. Further research also needs to investigate if similar patterns of narrativization can be found in different cultural contexts and crises, and whether social media discourses follow their own pattern of narrative structure as compared to Labov’s theory (1972). Also, our few-shot classification also needs more work to achieve higher accuracy in action classification. Action classification for social media comments is not an easy task, for example because comments might often involve several actions, and as deciding what action a comment represents sometimes requires interpretation that is hard to define clearly for each case in annotation guidelines. Thus, action classification in this area requires more work. This research advances the development of the growing line of computational narrative analysis methods, elaborating on the possibilities for using narrative functions to understand the narrativization of crisis discourses. We argue that such tools are needed for supporting other means of research into crisis communication, for a multi-sided understanding of perspectives on crisis and social media engagement. Further, as social media is a site used to influence public opinion and to spread disinformation, the various discursive conflicts taking place in this arena are essential for crisis communicators to both understand and manage. 7. References Leiming Yan, Yuhui Zheng, and Jie Cao. 2018. Few-shot learning for short text classification. Multimedia Tools and Applications, 77(22):29799–29810. POVZETKI 265 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Monika Bednarek, Andrew Ross, Olga Boichak, Y.J. Doran, Georgia Carr, Eduardo Altmann, and Tristram Alexander. 2022. Winning the discursive struggle? The impact of a significant environmental crisis event on dominant climate discourses on Twitter. Discourse, Context & Media, 45:100564. Herbert Clark and Edward F. Schaefer. 1989. Contributing to Discourse. Cognitive Science, 13(2):259–294. Derek Edwards and Jonathan Potter. 1993. Language and causation: A discursive action model of description and attribution. Psychological review, 100(1):23–41. Elizabeth Couper-Kuhlen and Margret Selting. 2017. Interactional linguistics: Studying language in social interaction. Cambridge University Press. Kishaloy Halder, Alan Akbik, Josip Krapac, and Roland Vollgraf. 2020. Task-Aware Representation of Sentences for Generic Text Classification. In: Proceedings of the 28th International Conference on Computational Linguistics, pages 3202–3213, Barcelona, Spain. International Committee on Computational Linguistics. Hélène Joffe. 2003. Risk: From Perception to Social Representation. British Journal of Social Psychology, 42(1): 55–73. Marianne Jørgensen and Louise Phillips. 2002. Discourse analysis as theory and method. Sage, London. Robert Kasperson, Ortwin Renn, Paul Slovic, Halina Brown, Jacque Emel, Robert Goble, Jeanne Kasperson, and Samuel Ratick. 1988. The social amplification of risk: A conceptual framework. Risk analysis, 8(2):177– 187. William Labov. 1972. Language in the Inner City. Philadelphia: University of Pennsylvania Press. William Labov and Joshua Waletzky. 1967. Narrative analysis: oral versions of personal experience. In: J. Helms, ed., Essays in the Verbal and Visual Arts, pages 12–44. University of Washington Press, Seattle. George Lakoff and Srini Narayanan. 2010. Toward a computational model of narrative. In: 2010 AAAI Fall Symposium Series, pages 21–28, Menlo Park, California. https://www.aaai.org/ocs/index.php/FSS/FSS10/paper/view/2323 Nic Newman, Richard Fletcher, Anne Schulz, Simge Andi, Craig Robertson, and Rasmus Nielsen. 2021. Reuters institute digital news report 2021. Reuters Institute for the Study of Journalism, Oxford. Henna Paakki, Heidi Vepsäläinen, and Antti Salovaara. 2021. Disruptive online communication: How asymmetric trolling-like response strategies steer conversation off the track. Computer Supported Cooperative Work, 30(3):425–461. Andrew Piper, Richard So, and David Bamman. 2021. Narrative Theory for Computational Narrative Understanding. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 298–311, Online and Punta Cana, Dominican Republic. 10.18653/v1/2021.emnlp-main.26 Emanuel Schegloff. 2007. Sequence Organization in Interaction: A Primer in Conversation Analysis. Cambridge; New York: Cambridge University Press. Timothy Sellnow, Deanna Sellnow, Emily Helsel, Jason Martin and Jason Parker. 2019. Risk and crisis communication narratives in response to rapidly emerging diseases. Journal of Risk Research, 22(7):897– 908. Tanya Stivers, Nick Enfield, and Stephen Levinson 2010. Question–Response Sequences in Conversation Across Ten Languages: an Introduction. Journal of Pragmatics, 42(10):2615–2619. Tzvetan Todorov. 1971. The Two Principles of Narrative. Diacritics, 1(1):37–44. Teun A Van Dijk. 1976. Philosophy of action and theory of narrative. Poetics, 5(4):287–338. Jason Turowetz and Douglas Maynard. 2010. Morality in the social interactional and discursive world of everyday life. In: Hitlin S. and Vaisey S., eds., Handbook of the Sociology of Morality, pages 503–526, Springer, New York. Sylvia Walby. 2015. Crisis. Polity Press, Cambridge. POVZETKI 266 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Gradnja Korpusa študentskih besedil KOŠ Tadeja Rozman,* Špela Arhar Holdt‡† * Fakulteta za upravo, Univerza v Ljubljani Gosarjeva ulica 5, 1000 Ljubljana tadeja.rozman@fu.uni-lj.si ‡Filozofska fakulteta, Univerza v Ljubljani Aškerčeva ulica 2, 1000 Ljubljana Fakulteta za računalništvo in informatiko, Univerza v Ljubljani Večna pot 113, 1000 Ljubljana spela.arharholdt@ff.uni-lj.si 1 Uvod Korpusi avtentičnih besedil šolajoče se populacije so tako v svetu kot pri nas pomemben vir informacij o jezikovni zmožnosti oseb, ki v procesu izobraževanja svojo jezikovno zmožnost še razvijajo, hkrati pa so tudi pokazatelj jezikovnih in didaktičnih praks v izobraževalnih okoljih. Ti viri so zato pomembni za jezikovno didaktiko, pripravo k uporabnikom usmerjenih jezikovnih priročnikov in gradiv, kot tudi za razvoj različnih jezikovnotehnoloških orodij. Korpusno jezikoslovje v svetovnem merilu sicer večjo pozornost namenja razvoju in analizi korpusov usvajanja tujih jezikov,1 v slovenskem prostoru pa imamo po vzoru tovrstnih korpusov zgrajen tudi Korpus šolskih pisnih izdelkov Šolar (Rozman et al., 2012) oziroma razširjeno verzijo Šolar 2.0 (Kosem et al., 2016). Vsebuje besedila, napisana pri pouku v tretjem triletju osnovnih šol in v srednjih šolah, del korpusa pa tudi avtentične učiteljske popravke, ki so s hierarhično zasnovanim sistemom oznak (Arhar Holdt et al., 2018) kategorizirani glede na vrsto jezikovnega problema. Slovenščina je tako eden redkih jezikov, ki ima tovrstne podatke za prvi jezik, a le na omejeni šolski populaciji, zato v okviru projekta Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti (ARRS, J7-3159) 2 pripravljamo širitev korpusa s študentskimi besedili, na začetku v obliki pilotnega korpusa študentskih besedil. 2 Namen korpusa Gradnja Korpusa študentskih besedil KOŠ je v prvi vrsti namenjena pridobivanju empiričnih podatkov o pisni jezikovni zmožnosti študentske populacije, pa tudi analitičnemu vpogledu v procese razvoja strokovnega pisanja. Temeljna jezikovna (normativna, besedilna, pragmatična) znanja naj bi dijaki sicer usvojili že do konca srednje šole, na fakultetah pa naj bi se to znanje nadgrajevalo z usvajanjem terminologije in stilističnih značilnosti strokovnih besedil. Vsaj na nejezikoslovnih študijskih smereh, kjer jezikovna izobrazba ni cilj, ampak je dobro jezikovno znanje le temelj za uspešno profesionalno delovanje, nadaljnje razvijanje jezikovne zmožnosti načeloma poteka hkrati ob usvajanju strokovnega znanja, torej ob recepciji strokovnih del ter s pisanjem npr. seminarskih nalog, esejev, raziskovalnih poročil ter pripravo govornih nastopov, sodelovanjem v strokovnih debatah ipd. Ob tem naj bi študenti uzaveščali procese razumevanja in pisanja, se ukvarjali z razumljivostjo in sprejemljivostjo besedil ter rabo strokovnega besedišča, po potrebi pa tudi odpravljali pravopisne in slovnične pomanjkljivosti. Vendar pedagogi opažamo, da obstajajo velike razlike med jezikovnimi zmožnostmi študentov, profesorji stroke pa se z reševanjem jezikovnih težav lahko ukvarjajo le v omejenem obsegu. Zdi se, da so tudi pristopi pedagogov k ozaveščanju o jezikovnih izbirah različni, ne samo zaradi različnega jezikovnega znanja, ampak tudi pogledov na smiselnost tovrstne povratne informacije, pisnih akademskih praks ipd., pa tudi pomanjkanja didaktičnih usmeritev. Potreba po razvoju sporazumevalne zmožnosti v slovenskem strokovnem jeziku je bila prepoznana že pri pripravi Resolucije o Nacionalnem programu za jezikovno politiko 2014–2018,3 tedaj določeni jezikovnonačrtovalni cilji jezikovne ureditve visokega šolstva in znanosti pa se v aktualni Resoluciji o Nacionalnem programu za jezikovno politiko 2021–2025 4 niso bistveno spremenili. Dokument tako določa, da je na visokošolski strokovni ravni treba omogočati učenje strokovne slovenščine ter na podlagi raziskav in analiz 1 Več o korpusih usvajanja tujega jezika in gradnji korpusa usvajanja slovenščine kot tujega jezika gl. npr. Stritar Kučuk (2020). 2 https://www.cjvt.si/prop/ 3 https://www.uradni-list.si/glasilo-uradni-list-rs/vsebina/2013-01-2475?sop=2013-01-2475 4 https://www.uradni-list.si/glasilo-uradni-list-rs/vsebina/2021-01-1999?sop=2021-01-1999 POVZETKI 267 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 strokovno-znanstvenega pisanja na visokošolski ravni izdelati učni načrt za strokovno-znanstveno pisanje za uvodni predmet v prvem letniku prvostopenjskih programov. Na podlagi teh določil, zapisanih že v predhodni resoluciji, je bil z namenom pridobivanja empiričnih podatkov o strokovno-znanstvenem pisanju leta 2019 izdelan Korpus akademske slovenščine KAS, tj. korpus diplomskih, magistrskih in doktorskih del (Erjavec et al., 2021), objavljenih na Nacionalnem portalu odprte znanosti.5 V korpusu so torej zbrana strokovna študentska besedila po zaključenih stopnjah visokošolskega in univerzitetnega študija, mentorirana in v veliki meri tudi lektorirana, tako da je s stališča analiz pisne jezikovni zmožnosti študentske populacije in za analizo procesa razvoja strokovnega pisanja le deloma uporaben. Korpus KOŠ bi v perspektivi lahko odpravil vrzel korpusnih podatkov med Šolarjem in KAS-om ter ponudil osnovo za raziskave, katera temeljna znanja je potrebno (bolje) nasloviti na predhodnih stopnjah in katera na terciarni stopnji, kjer se razvoj pisnih jezikovnih zmožnosti nadaljuje na kompleksnejših besediloslovnih ravneh. Širša slika razvojnega loka bi omogočila, da opismenjevanje bolje usmerimo proti končnemu cilju, ki je polnomočno in samostojno (čeprav v skladu s sodobnimi praksami tehnološko in podatkovno podprto) pisanje različnih vrst besedil za različne sporazumevalne namene, kar je pomembno tudi za uspešno poklicno delovanje. 3 Zasnova korpusa V okviru projekta je predvidena priprava pilotnega korpusa, ki bo objavljen kot podatkovna baza na repozitoriju CLARIN.SI. Gradnja korpusa poteka v študijskem letu 2021/22 in se bo predvidoma končala jeseni 2022, besedila pa bodo zbrana po metodologiji priprave korpusa Šolar, ki vključuje: pravno ureditev odprtega dostopa do rezultatov (priprava in podpis pogodb o prenosu pravic in dovoljenja za uporabo pravic), beleženje vseh relevantnih metapodatkov (program, letnik, področje študija, tip besedila, morebitno večavtorstvo, ob oddanih več verzijah istega besedila tudi oznaka prvotne in spremenjenih verzij), vsaj delna vključitev profesorskih jezikovnih popravkov, zapis v združljivem formatu in strojno označevanje. Jezikovne popravke bomo v korpus beležili z orodjem Svala (Wirén, 2019), ki omogoča pregledno sopostavitev izvornega ter popravljenega besedila, psevdonimizacijo tistih delov besedila, ki bi lahko razkrili avtorstvo ali kake druge občutljive osebne informacije, ter označevanje in vsebinsko kategorizacijo jezikovnih popravkov. Orodje je bilo na projektu Razvoj slovenščine v digitalnem okolju 6 prilagojeno za delo s slovenskima korpusoma KOST (Stritar Kučuk, 2020) in Šolar in kot tako podpira označevanje s sistemom oznak korpusa Šolar (Arhar Holdt et al., 2018). Te oznake bomo uporabili tudi za korpus KOŠ (gl. sliko 1), predvideno pa je, da bo za študentska besedila sistem označevanja treba deloma prilagoditi. Pričakovati namreč je (in do sedaj zbrano gradivo to potrjuje), da so zaradi žanrske specifike študentskih besedil, ki jih pregledujejo profesorji nejezikoslovci, popravki redko tudi konkretni predlogi pravilnih jezikovnih izbir, ampak da gre bolj za usmeritve profesorjev, ki v svojih komentarjih študente le na splošno opozarjajo na jezikovne napake in se v večji meri posvečajo stilistiki strokovnih besedil, ustrezni rabi terminologije, citiranju, razumljivosti pisanja, argumentaciji ipd. Vsa korpusna besedila (z označenimi popravki in brez) bomo nato strojno označili na ravneh stavčne segmentacije, tokenizacije, lematizacije, oblikoskladnje, skladnje in imenskih entitet z označevalnikom CLASSLA StanfordNLP (Ljubešić in Dobrovoljc, 2019), ki se v času pisanja povzetka prav tako razvija na omenjenem projektu . Besedila zbiramo na prvostopenjskih študijskih programih na dveh fakultetah, za vključitev v korpus pa so potencialno relevantna vsa besedila, ki so jih študenti oddali pedagogom v študijskem procesu na fakulteti in niso napisana na roko. Besedila zato zbiramo prek učiteljev, saj bomo tako z večjo gotovostjo prejeli avtentična besedila, ki se realno pišejo v študijskem okolju, predvidoma pretežno seminarske naloge, eseje, poročila, povzetke strokovnih člankov, daljše (esejske) odgovore na vprašanja, morda pa tudi dispozicije in osnutke diplomskih del. Besedila, povezana s pripravo zaključnih del, so s stališča ugotavljanja zmožnosti oblikovanja daljšega strokovnega besedila po končanem izobraževanju zelo dragocena, tudi zaradi vpogleda v mentorske komentarje in popravke, a se trenutno zdi vključitev teh besedil v korpus problematična s stališča anonimizacije, saj so zaključna dela praviloma prosto objavljena na spletu in zlahka povezljiva z osnutki, avtorji in mentorji. 5 https://openscience.si/ 6 https://www.slovenscina.eu/ POVZETKI 268 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Slika1: Preizkus metodologije vpisovanja popravkov v testni različici lokaliziranega programa Svala. 4 Nadaljnji koraki Na projektu Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti želimo zagotoviti pilotni korpus v obsegu 200.000 pojavnic, ob gradnji pa pripraviti tudi oceno prenosljivosti metodologije Šolar na pisno produkcijo študentov in specifikacije nadaljnjega razvoja korpusa študentskega pisanja, tj. opredelitev želenega obsega, strukture glede na regionalno zastopanost, vrsto in področje izobraževanja ter tipologije popravkov. V tem okviru pripravljamo tudi krajši anketni vprašalnik za univerzitetne pedagoge, s katerim želimo pridobiti dodatne podatke o tem, kakšne so prakse podajanja povratnih informacij študentom, ter tako čim učinkoviteje zasnovati zbiranje in beleženje tega gradiva. Do sedaj zbrano gradivo po pričakovanjih nakazuje, da so prakse precej raznolike in da se v mnogočem razlikujejo od podajanja informacij profesorjev slovenščine, ki so zabeležene v korpusu Šolar. Pod okriljem projekta bomo sicer zbrano korpusno gradivo uporabili za pilotne kvantitativne in kvalitativne jezikoslovne analize študentskega pisanja. Analize se bodo osredotočile na tipične težave pisanja in trende opozarjanja na jezikovno neustrezne ali manj ustrezne ubeseditve, kar vključuje podajanje povratne informacije z vnosom rešitve, opisna priporočila ali grafično nakazovanje mesta težave, kot morebitne druge načine. Rezultate bomo primerjali s frekvenčno urejenimi seznami jezikovnih zadreg v korpusu Šolar. Izsledki bodo predvidoma že nakazali obrise razvoja pisne jezikovne zmožnosti na prehodu iz srednješolskega v univerzitetno izobraževanje, morebitne primanjkljaje temeljnega jezikovnega znanja ter kako je mogoče učni proces z empiričnimi podatki najbolje podpreti. Zahvala Projekt Empirična podlaga za digitalno podprt razvoj pisne jezikovne zmožnosti (J7-3159) in program Jezikovni viri in tehnologije za slovenski jezik (P6-0411) sofinancira Javna agencija za raziskovalno dejavnost Republike Slovenije iz državnega proračuna. Literatura Špela Arhar Holdt, Polona Lavrič, Rebeka Roblek in Teja Goli. 2018. Kategorizacija učiteljskih popravkov: Smernice za označevanje korpusa Šolar 2.0. V: 1.0. Kazalnik projekta Nadgradnja korpusa Šolar. https://solar.trojina.si/wp-content/uploads/2022/05/Smernice-za-oznacevanje-korpusa-Solar-2.0-v1.0.pdf Tomaž Erjavec, Darja Fišer in Nikola Ljubešić. 2021. The KAS corpus of Slovenian academic writing. V: Lang Resources & Evaluation 55, 551–583. https://doi.org/10.1007/s10579-020-09506-4 Iztok Kosem, Tadeja Rozman, Špela Arhar Holdt, Polonca Kocjančič in Cyprian Adam Laskowski. 2016. Šolar 2.0: nadgradnja korpusa šolskih pisnih izdelkov. V: Zbornik konference Jezikovne tehnologije in digitalna POVZETKI 269 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 humanistika, 95–100. Znanstvena založba Filozofske fakultete, Ljubljana. http://www.sdjt.si/wp/wp- content/uploads/2016/09/JTDH-2016_Kosem-et-al_Solar-2-0-nadgradnja-korpusa-solskih-pisnih- izdelkov.pdf Nikola Ljubešić in Kaja Dobrovoljc. 2019. What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. V: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, BSNLP@ACL 2019, pages 29–34. https://aclanthology.org/W19-3704.pdf Tadeja Rozman, Mojca Stritar Kučuk in Iztok Kosem. 2012. Šolar – korpus šolskih pisnih izdelkov. V: T. Rozman, ur., I. Krapš Vodopivec, M. Stritar, I. Kosem: Empirični pogled na pouk slovenskega jezika, 15–35. Ljubljana: Trojina, zavod za uporabno slovenistiko. Mojca Stritar Kučuk. 2020. Modul Leto plus – prvi korak do korpusa slovenščine kot tujega jezika. V: Zbornik konference Jezikovne tehnologije in digitalna humanistika 2020, pages 131–135. Inštitut za novejšo zgodovino, Ljubljana. http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_StritarKucuk_Modul-Leto-plus%e2%80%93prvi-korak- do-korpusa-slovenscine-kot-tujega-jezika.pdf Mats Wirén, Arild Matsson, Dan Rosén in Elena Volodina. 2019. SVALA: Annotation of Second-Language Learner Text Based on Mostly Automatic Alignment of Parallel Corpora. V: Selected papers from the CLARIN Annual Conference 2018, Pisa, 8–10 October 2018, pages 227–239. https://ep.liu.se/en/conference-article.aspx?series=ecp&issue=159&Article_No=23 POVZETKI 270 ABSTRACTS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Korpusni pristopi za identifikacijo metafore in metonimije: primer metonimije v korpusu g-KOMET Špela Antloga Fakulteta za elektrotehniko, računalništvo in informatiko, Univerza v Mariboru Koroška cesta 46, 2000 Maribor s.antloga@um.si Povzetek Prepoznavanje vrednosti in razširjenosti metaforičnih in metonimičnih izrazov v jeziku je v zadnjih dvajsetih letih vodilo k povečanemu zanimanju za sistematično identifikacijo in luščenje tovrstnih figurativnih izrazov v korpusih posameznih jezikov. Izraze, pri katerih potekajo konceptualne preslikave, ki sodelujejo pri metaforičnih in metonimičnih procesih, je namreč težko izluščiti iz korpusa, ki niso posebej označeni za namene raziskovanja figurativnega jezika. V članku predstavim najpogostejše metode luščenja metaforičnih in metonimičnih izrazov iz jezikovnih korpusov ter na primeru korpusa g-KOMET, ki je ročno označen za metaforične in metonimične izraze v slovenskem govorjenem jeziku, ponazarjam poskus sistematizacije metonimičnih prenosov. Corpus approaches to metaphor and metonymy identification: The case of metonymy in g-KOMET Recognizing te value of metaphorical anod metonymic expressions in language has in the last two decades led to increased interest in the systematic identification and extraction of figurative expressions in various language corpora. Expressions in which conceptual mappings that participate in metaphorical and metonymic processes take place are difficult to extract from a corpus that is not specifically annotated for the purposes of figurative language research. We describe prevailing methods of searching for metaphorical and metonymic expressions in language corpora. Using the manually annotated corpus for metaphorical and metonymic expressions in the Slovene spoken language g-KOMET, we try to systemize some of the prevailing annotated metonymical mappings. 1. Uvod Jezik in mišljenje sta tesno povezana. Naše mišljenje je razpisa CLARIN 2021. Na primeru korpusa g-KOMET bo tako zapleteno, da z jezikom nismo vedno zmožni vsega predstavljen poskus sistematizacije in klasifikacije »neposredno« izraziti najpogostejših označenih metonimičnih , zato za razlago sveta uporabljamo prenosov v različne jezikovno-kognitivne postopke, med drugim slovenskem govorjenem jeziku. metafore in metonimije. Korpusnih raziskav metafore in metonimije ter tudi drugih oblik figurativnega jezika v 2. Opredelitev metafore in metonimije v slovenščini je malo. Čeprav so v zadnjem desetletju kognitivnem jezikoslovju korpusne metode raziskovanja slovenščine postale Ena od ključnih ugotovitev sodobnega pogleda na uveljavljena empirična paradigma v jezikoslovju predvsem metaforo in metonimijo je, da metafor in metonimij ne na področjih, povezanih z leksikologijo in slovnico ter uporabljamo zgolj za jezikovno sporazumevanje, temveč jezikovno rabo, področje figurativnega jezika, ki je sicer na da v metaforah in metonimijah tudi mislimo. V tem duhu teoretski ravni dobilo zagon z razmahom teorije konceptualno teorijo metafore zanimajo zlasti načini konceptualne metafore in metonimije (Lakoff in Johnson, mentalne organizacije konceptov, s pomočjo katerih človek 1980; Lakoff in Turner, 1989; Lakoff, 1993), pri tem trendu osmišlja stvarnost, ki ga obdaja, in družbo, v kateri živi nekoliko zaostaja (Bedkowska-Kopczyk, 2016; Antloga, (Bratož, 2010). Za jezikoslovce so bila tovrstna vprašanja 2020c). Eden od možnih razlogov je pomanjkanje enotne sprva svojevrsten izziv, saj zahtevajo pogled preko meja in uspešne metode za sistematično identifikacijo področja jezikoslovja na druge discipline, kot so metaforičnih in metonimičnih izrazov v že obstoječih psihologija, nevroznanost, filozofija in druge vede, ter s korpusih, ki niso posebej označeni za konceptualne tem predpostavljajo interdisciplinaren način dela. Konec preslikave. Posledično so se za sistematično analizo sedemdesetih let prejšnjega stoletja se je tako zgodil t. i. konceptualnih struktur v jeziku jezikoslovci zatekli k kognitivni preobrat, ki je metaforo in metonimijo iz izgradnji korpusov z označenimi potencialnimi jezikovne ravni prenesel na konceptualno, miselno raven. metaforičnimi in metonimičnimi izrazi, ki pa so časovno Metaforo in metonimijo so začeli obravnavati kot zamudni in zahtevajo veliko prilagoditev označevalnih konceptualni mehanizem, s pomočjo katerega se védenje o shem ciljnemu jeziku raziskovanja. konkretnih pojavih in izkušnjah projicira na številne V prispevku bodo opisane različne bolj ali manj abstraktne domene. Na primer čas običajno uveljavljene metode identifikacije metaforičnih in konceptualiziramo kot prostor, čustva kot naravne sile, metonimičnih izrazov v obstoječih (splošnih) korpusih organizacije kot organizme ali stroje (Bratož, prav tam). besedil z vsemi prednostmi in slabostmi. Kot eden od virov za sistematično analizo metaforičnih in metonimičnih 2.1. Metafora izrazov v slovenskem govorjenem jeziku bo predstavljen korpus G-KOMET, ki je nastal v okviru ŠTUDENTSKI PRISPEVKI 271 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Po sodobni definiciji metafore torej niso samo jezikovni narava podobnost (logična) povezava izraz, ampak v njih tudi razmišljamo. Med različnimi konceptualnega teoretičnimi pristopi, ki so se posvečali preučevanju razmerja metafore, je danes pri raziskovanju metafore in metaforičnostih izrazov ena najvidnejših teorija Tabela 1: Razlikovanje med metaforo in metonimijo. konceptualne metafore, ki so jo razvili George Lakoff in Povzeto po Feyaerts (2012). njegovi sodelavci (Lakoff in Johnson, 1980; Lakoff in Turner, 1989). Po omenjenem teoretičnem modelu so metafore bistven element človekovega spoznavanja in 3. Metode luščenja metaforičnih (in sredstvo, ki nam omogoča, da razumemo in doživljamo eno metonimičnih) izrazov v korpusih izkušenjsko področje ali domeno ( domain) s pomočjo (v okviru) drugega. Prenos poteka s t. i. medpodročnimi V povezavi z metaforo in metonimijo sta zaradi preslikavami ( cross-domain mappings) med izhodiščnim pomanjkanja ustrezne metodologije problematična področjem ( predvsem (sistematična) identifikacija source domain), ki je običajno konkretnejše, in in luščenje ustreznih ciljnim področjem ( target domain), ki je bolj abstraktno. podatkov iz splošnega jezikovnega korpusa. Konceptualne preslikave, ki sodelujejo pri metaforičnih in metonimičnih procesih, namreč niso 2.2. neposredno povezane s posameznimi Metonimija jezikovnimi oblikami in jih je težko izluščiti iz korpusa, ki Tradicionalna retorika je metonimijo obravnavala niso posebej označeni za namene raziskovanja predvsem kot retorično figuro, torej je o njej razmišljala kot figurativnega jezika. S kombinacijo avtomatskega in o jezikovnem pojavu, kot o predmetu figurativnega jezika ročnega luščenja podatkov iz splošnih korpusov so se v (Radden in Kövecses, 1999). Tudi Aristotel ni povsem drugih jezikih izoblikovale naslednje metode identifikacije prepoznal značilnosti metonimije in jo je pojmoval kot metaforičnih (in metonimičnih) izrazov (Stefanowitsch, podtip metafore (Bernjak in Fabčič, 2018). Podobno 2006): definicijo metonimije zasledimo tudi v sodobnih slovarjih, Ročno luščenje metaforičnih besed iz korpusa se je npr. v Slovarju slovenskega knjižnega jezika.1 Jakobson uveljavilo zaradi potrebe po (bolj) sistematični analizi (1956) je poudaril inherentnost metonimije v jeziku in konceptualne metafore in metonimije, tako da je branju izpostavil pojem bližine kot temeljni princip metonimije. besedila v korpusu sledilo sistematično izpisovanje Kognitivni jezikoslovci se opirajo na ta in podobna stališča metaforičnih in metonimičnih izrazov (Semino in Masci, in razširijo fenomen metonimije na pojmovno-pomenski 1996). Seveda je bilo delo zamudno in obsegovno omejeno, mehanizem, ki omogoča strukturiranje jezika ter mišljenja, predvsem pa neizkoriščeno z vidika količine podatkov v torej deluje kot centralno sredstvo v procesu korpusu, a vsekakor bolj sistematično kot zanašanje na konceptualizacije. Lakoff in Johnson (1980: 46–52) sporadične primere ali primere, ki niso izhajali iz dejanske metonimijo definirata na ravni konceptualizacije kot jezikovne rabe. Kljub temu so kognitivistom očitali pojmovno operacijo ali kognitivni proces, v katerem eno, subjektivnost, neempiričnost in nekonsistentnost pri izhodiščno entiteto uporabimo zato, da nam omogoča prepoznavanju (iskanju) in razlagi konceptualnih metafor mentalni dostop do druge, ciljne entitete znotraj določene in metonimij (npr. Tummers et al., 2005; Wasow in Arnold, pojmovne domene. Torej metonimijo obravnavata kot 2005). pojmovno-pomenski mehanizem, ki strukturira ne samo Metaforični in metonimični izrazi so v izhodiščni jezik, ampak tudi naše mišljenje. Če pri metafori prihaja do domeni preslikave vedno povezani z neprenesenimi preslikave z enega konceptualnega področja na drugo, (nefigurativnimi) leksikalnimi enotami. Zato je bila kot metonimija vključuje samo eno domeno, saj do preslikave odziv na kritike naslednja stopnja korpusnega pristopa k med dvema elementoma prihaja v okviru ene same domene. figurativnemu jeziku iskanje izhodiščne domene po Lakoff in Johnson (1980) poudarjata, da je tako kot ključnih besedah oziroma identifikacija metafor na metafora tudi metonimija konceptualne narave in da gre za podlagi potencialnih izhodiščnih domen (pomensko polje, fenomen, ki igra osrednjo vlogo pri strukturiranju našega za katerega se predpostavlja oziroma je bilo že ugotovljeno, védenja o svetu. Kövecses (2002) pravi, da je metonimija da sodeluje pri metaforičnih preslikavah, kot so na primer kognitivni proces, v katerem do določene konceptualne srce, ogenj, boj, potovanje ipd.). Iskanje lahko poteka preko entitete (cilja) pridemo s pomočjo druge konceptualne posameznih besed v konceptualni strukturi ali preko entitete (sredstva). Z drugimi besedami, ena konceptualna skupine besed, ki so pomensko povezane (na primer ogenj, entiteta je referenčna točka, ki omogoča mentalni dostop do plamen, vročina, pogoreti, zgoreti, plamteti, vzplamteti druge konceptualne entitete. ipd.). Z ročnim pregledovanjem rezultatov je bila določena Bolj shematično primerjavo konceptualne metafore in potencialna metaforičnost izraza in nato ciljna domena metonimije predstavlja Tabela 1. metaforične preslikave (npr. LJUBEZEN, JEZA ipd.). Postopoma so se začeli izoblikovati seznami ključnih besed metafora metonimija izhodiščnih domen za identifikacijo metafor v posameznih funkcija sklepanje na referencialnost jezikih. Jezikoslovci so nato na podlagi seznamov konceptualnega podlagi raziskovali metafore v različnih jezikih, kontekstih in razmerja podobnosti diskurzih (Hanks, 2004; Koller, 2006). Postopna uveljavitev identifikacije metaforičnih in metonimičnih izrazov v korpusih z iskanjem po ključnih 1 »metonimija-e ž lit. besedna figura, za katero je značilno poimenovanje določenega pojma z izrazom za kak drug predmetno, količinsko povezan pojem«. ŠTUDENTSKI PRISPEVKI 272 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 besedah izhodiščne domene je vodila k zanimanju za obliki indirektne, direktne in implicitne metaforične raziskovanje figurativnega jezika v konkretnejših, bolj besede2 v štirih besedilnih tipih (časopisna besedila, specifičnih domenah, npr. v političnem diskurzu, v strokovna besedila, literarna besedila in konverzacijska ekonomiji, športu ipd. V teh primerih pristop, usmerjen v besedila) za angleški jezik3 je leta 2012 razvila skupina izhodiščno domeno, ni bil učinkovit, saj bi zahteval raziskovalcev, ki se je poimenovala Praglejazz. Ob tem je predhodno poznavanje vira preslikave (izhodiščne razvila postopek za ugotavljanje metaforičnih besed v domene), ki bi lahko bil potencialno najden v ciljni domeni. besedilu, poimenovan MIPVU (Steen et al., 2010), da bi Zato se je uveljavila metoda iskanja ciljne domene s omogočila objektivnejšo, natančnejšo in bolj sistematično seznamom ključnih besed izhodiščnih domen. Za (jezikoslovno) analizo metaforičnih izrazov v različnih učinkovito identifikacijo metaforičnih in metonimičnih besedilih. Temeljno izhodišče za označevanje metaforičnih izrazov s ključnimi besedami ciljne domene je potrebna besed pri tem postopku je ugotavljanje razmerja med velika količina reprezentativnih in enotematskih besedil, ki osnovnim in kontekstualnim pomenom besede. Pri tem je so povezana z iskano ciljno domeno. To je relativno treba za vsako leksikalno enoto ugotoviti, ali se njen enostavno pri »konkretnih« ciljnih domenah, kot so zgoraj konkretni kontekstualni pomen razlikuje od njenega naštete POLITIKA, EKONOMIJA, ŠPORT, težje pa bi osnovnega pomena. Postopek je s prilagoditvami bilo iskanje metaforičnih in metonimičnih izrazov s značilnostim posameznih jezikov sprožil zanimanje za ciljnimi domenami, kot so na primer ČUSTVOVANJE, identifikacijo metaforičnih izrazov in metafor v češčini UMSKA AKTIVNOST, ZAZNAVANJE ipd. (nekaj (Pavlas et al., 2018), litovščini (Urbonaitė, 2016), rešitev ponuja Tissari, 2003). Drugi problem, povezan s madžarščini (Babarzy in Bencze, 2010), poljščini (Risinski tovrstno identifikacijo metafor v korpusu, pa je, da bi in Mahula, 2015), srbščini (Bogetić, 2019) ter za izdelavo identificirali le tiste izhodiščne domene, ki so povezane z korpusov metafor v ruščini (Badryzlova in Lyashevskaya, izrazi, katerih pogostnost je v ciljni domeni tako visoka, da 2017), hrvaščini (Despot et al., 2019) in kitajščini (Lu in so se uvrstili na seznam ključnih besed ciljnih domen. Wang, 2017). Eden od poskusov oblikovanja korpusa Analiza metaforičnih prenosov torej ne bo celovita in metafor v slovenščini, ki bi omogočal jezikoslovno analizo sistematična. metaforičnih izrazov in metafor v različnih besedilih ter Z združitvijo obeh predhodno navedenih metod se je ponujal možnost za prepoznavanje kulturnospecifičnega uveljavila metoda iskanja stavkov, ki vsebujejo ključne pomena metafor, je korpus metafor KOMET 1.0 (Antloga, besede tako izhodiščne kot ciljne domene, predvsem v 2020a) in njegovo nadaljevanje z dodanimi transkripcijami obliki avtomatskega luščenja metaforičnih izrazov. Kljub govorjenega jezika korpus g-KOMET (Antloga in Donaj, temu metoda še vedno zahteva poglobljen ročni pregled 2022). izluščenih podatkov zaradi možnih enakopisnic ali neprenesenega pomena obeh izrazov v stavku. Problem je 4. Korpus g-KOMET tudi, da je za tako iskanje potreben zelo izčrpen seznam Korpus g-KOMET4 (korpus metaforičnih in besed z obeh domen, saj je sicer iskanje nepopolno. Poleg metonimičnih izrazov v govorjenem jeziku) je nadgradnja tega je ta metoda bolj uporabna za raziskovanje že poznanih pisnega korpusa metaforičnih izrazov in metafor KOMET konceptualnih struktur, metafor in metonimij, manj pa za 1.0 s transkripcijami (po)govora v obsegu 52.529 besed. sistematično identifikacijo (novih oziroma vseh) Nadgradnja vključuje tudi definiranje in ročno dodajanje konceptualnih struktur. novih oznak v primerjavi s korpusom KOMET 1.0, in sicer Nekaj poskusov identifikacije metaforičnih izrazov je oznak za idiome in metonimije. Besedilo za korpus je bilo potekalo tudi s t. i. kazalniki metaforičnosti, to so izluščeno iz korpusa GOS. Glede na želeno velikost našega metajezikovni izrazi, ki napovedujejo oziroma signalizirajo korpusa smo iz vsake datoteke korpusa GOS izbrali 5 % metaforično rabo. Goatly (1997) kot metaforične besedila. Pri tem smo naključno izbrali začetno izjavo5 signalizatorje navaja izraze, kot so govora in dodajali zaporedne izjave govora, dokler nismo metaphorically/figuratively speaking (metaforično/ dosegli želene velikosti. Če smo velikost dosegli sredi figurativno rečeno, v prenesenem pomenu), so to speak izjave, smo dodali tudi vse preostale besede v njej. S tem (tako rekoč/če tako rečem), intenzifikatorje literally smo dosegli končno velikost korpusa 52.529 besed z enako (dobesedno), actually (pravzaprav) ali celo ortografska uravnovešenostjo besedila, kot je prisotna v korpusu GOS. znamenja, kot so narekovaji, poševni tisk ipd. S to metodo Korpus torej vključuje uravnotežen nabor transkripcij lahko sicer izluščimo relativno malo metaforičnih izrazov, informativnega, izobraževalnega, razvedrilnega, zasebnega vendar lahko po drugi strani opazujemo jezikovne (telefonski pogovor, osebni stik) in nezasebnega (telefonski okoliščine, ko je metaforična raba v besedilu namerno (ali pogovor, osebni stik) diskurza. Če je bila beseda zapisana nenamerno) eksplicitno signalizirana (Skorczynska in tako v pogovorni kot normalizirani obliki, smo prevzeli Ahrens, 2015). normalizirano obliko. Pri tem se nekatere pogovorne Ena od zadnjih uveljavljenih metod je besede zapišejo kot dve besedi v normalizirani obliki, npr. iskanje po korpusu, označenem s konceptualn »nemo« v »ne bomo«. Pri izluščanju besedila smo imi preslikavami. Prvi korpus, označen s konceptualnimi preslikavami v odstranili časovne oznake in oznake za menjavo govornih 2 Ne gre za označevanje metafor, ampak besed, ki se 4 Projekt izdelave korpusa je bil financiran v okviru potencialno lahko realizirajo kot metafore. projekta CLARIN.si 2021. Korpus je dostopen na naslovu 3 Gl. http://hdl.handle.net/11356/1293. http://www.vismet.org/metcor/search/showPage.php?page 5,6 Izjavo in govorno vlogo razumemo, kot sta =start. opredeljeni v specifikacijah za transkribiranje GOS, gl. Zwitter Vitez et al., 2009. ŠTUDENTSKI PRISPEVKI 273 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 vlog,6 saj začetki in konci izluščenega dela besedila niso hkrati začetki in konci govornih vlog. Ohranili pa smo druge oznake, npr. smeh, hrup, in prekinjene besede oz. Tabela 2: Označeni figurativni elementi v korpusu g- napačne začetke. Za označevanje je bilo uporabljeno orodje KOMET. Q-CAT (Brank, 2019). 4.1. Označevanje metaforičnih besed 5. Analiza in klasifikacija označenih (oznake MRWi, MRWd, MFlag in metonimičnih izrazov v korpusu g- WIDLI) KOMET Označevanje metaforičnih besed je temeljilo na Čeprav sta bili od samih začetkov kognitivnega postopku za identifikacijo metafor MIPVU (Steen et al., jezikoslovja predmet zanimanja kognitivne semantike tako 2010),7 ki omogoča sistematično identifikacijo jezikovne metafora kot metonimija, je bila pozornost vseskozi metafore. Identificirani so bili jezikovni izrazi, ki imajo usmerjena zlasti na metaforo. Še danes je raziskovanje potencial, da jih ljudje realiziramo kot metafore. Za vsako metonimije v primerjavi z metaforo zelo marginalno, leksikalno enoto v besedilu je bil določen njen osnovni čeprav številni jezikoslovci prepoznavajo ključni pomen pomen (po SSKJ) in njen pomen v kontekstu. Če se je metonimije v vsakdanjem jeziku in poudarjajo raznovrstne kontekstualni pomen razlikoval od osnovnega pomena te metonimične relacije kot načine organizacije konceptualne besede, je bila beseda označena kot metaforična beseda strukture (Bratož, 2010). V korpusu g-KOMET je bilo (MRW). Označenim metaforičnim besedam je bila nato označenih 744 metonimičnih izrazov, ki jim je bila dodana pripisana informacija o tem, ali gre za (1) indirektno ena od 54 oznak za različne metonimične prenose. metaforo (MRWi), (2) direktno metaforo (MRWd) ali (3) mejni primer (WIDLI). Označeni so bili tudi (4) Tip metonimičnega Odstotek glede na vse metaforični signalizatorji (MFlag).8 Korpus je označevala prenosa označene metonimične ena oseba. izraze v korpusu g- KOMET 4.2. Označevanje stalnih besednih zvez splošno za specifično 16,8 % (oznaka idiom) institucija za osebo 9,7 % Označene so bile večbesedne enote, katerih pomen je (skupino) različen od pomena posameznih sestavin večbesedne enote. del za celoto 7,1 % Vsaj ena sestavina v označeni stalni besedni zvezi je bila rezultat dejanja za 6,4 % torej rabljena metaforično. dejanje ime za delo 6,3 % 4.3. Uvrščanje v pomensko polje lastnost za osebo 6 % metaforičnega prenosa (oznaka frame) smer za cilj 5,6 % Označeni metaforični izrazi in stalne besedne zveze so celota za del 3,6 % bili uvrščeni v pomenska polja, ki funkcionirajo kot sistem predmet za aktivnost 3,6 % kategorij, ki so strukturirane glede na določen kontekst, ki kraj za osebo 3,5 % jih motivira. Pomensko polje omogoča, da znotraj določene (skupino) pomenske kategorije (npr. naravni pojavi, čas, prostorska last za aktivnost 2,1 % orientacija, družina, premikanje itd.) poiščemo metaforične del telesa za osebo 1,6 % izraze, ki so lahko potencialno uresničitev neke (skupino) konceptualne strukture. V korpusu g-KOMET je bilo sredstvo dejanja za 1,3 % označenim metaforičnim besedam in stalnim besednim rezultat dejanja zvezam določenih 65 pomenskih polj. ideologija za osebo 1,3 % (skupino) 4.4. Označevanje metonimij dejanje za rezultat 1,2 % Če se pri metaforah dogaja preslikava z enega dejanja izkušenjskega področja na neko drugo izkušenjsko stavba za institucijo 1,2 % področje, se pri metonimijah preslikava dogaja znotraj podjetje za delavca 1,2 % enega področja, pri čemer ugotavljamo razmerje med (skupino) obema entitetama preslikave. Ugotovljenim metonimičnim kraj za dogodek 1,2 % izrazom je bilo določenih 45 tipov metonimične preslikave. Tabela 3: Najpogostejši označeni metonimični prenosi Označeni elementi Število označenih besed v odstotkih glede na vse označene metonimične izraze v (odstotek); ⅀ = 52.529 besed korpusu g-KOMET. metaforične besede 728 (1,38 %) idiomi 256 (0,49 %) Namesto tradicionalne opredelitve tipov metonimije metonimije 744 (1,42 %) glede na metonimični prenos (gl. zgoraj) navajam še pomenska polja 65 alternativno, vsebinsko delitev metonimije, kot izhaja iz 8 Za podrobnejšo razlago metodoloških izhodišč za 7 Metaphor Identification Procedure Vrije Universiteit definiranje označevalne sheme glej Antloga 2020a. (MIPVU). ŠTUDENTSKI PRISPEVKI 274 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 označenega korpusa g-KOMET. Delitev izhaja iz dejavnosti, prostora dejavnosti ipd. na osebo, ki opravlja to predpostavke, da lahko metonimije kategoriziramo glede dejavnost. Razdelimo jih lahko v naslednje podkategorije: na vrsto pojmovne vsebine, do katere se dostopa preko metonimije. Konceptualna metonimija je tako klasificirana OSEBA ZA AKTIVNOST glede na to, katero konceptualno vsebino aktivira v (…) zadnjič gledal nogometaše (…) metonimičnem prenosu. Navedeni so zgolj tipi metonimije, OSEBA, VKLJUČENA V AKTIVNOST, NAMESTO ki so najpogostejši v korpusu g-KOMET. Za smiselne AKTIVNOSTI zaključke o vlogi metonimije v govorjenem jeziku/konverzaciji bi bila nujna primerjava z zastopanostjo OSEBA ZA TEORIJO in vlogo metonimije tudi v negovorjenih besedilih. (…) vsi citirajo Žižka (…) PREDSTAVNIK TEORETIČNEGA PRISTOPA 5.1. Metonimija STVAR ZA X NAMESTO IZHODIŠČ TEGA PRISTOPA Metonimije STVAR ZA X so metonimije, katerih cilj (predvideni referent) je OSEBA ZA LOKACIJO STVAR, do katere se dostopa s pomočjo referenčne vsebine, ki je z njo povezana v istem (…) pa pri zdravniku sto let čakala (…) idealiziranem kognitivnem modelu. Metonimije STVAR ZA OSEBA, KI OPRAVLJA DEJAVNOST, NAMESTO X lahko razdelimo v podkategorije glede na konceptualno PROSTORA, KJER SE OPRAVLJA DEJAVNOST izhodišče metonimičnega prenosa: 5.4. Metonimija LOKACIJA ZA X STVAR ZA STVAR Pri metonimijah LOKACIJA ZA X je LOKACIJA Metonimični prenos omogoča neposredni mentalni uporabljena za priklic ene ali več entitet, ki so na tej dostop do stvari preko neke druge stvari ali njene vloge lokaciji. Ker sta lokacija in to, kar se nahaja na lokaciji, v ali funkcije v situaciji, ki jo ta stvar opravlja. nekakšni prostorski relaciji, bi lahko tovrstne metonimije opredelili tudi kot DEL ZA CELOTO. Metonimije (…), da kozica vre 20 do 25 minut (…) → LOKACIJA ZA X lahko razdelimo v podkategorije: POSODA (kozica) NAMESTO VSEBINE (vode v kozici) LOKACIJA ZA DOGODEK STVAR ZA ČLOVEKA (SKUPINO) (…) to mi je ostalo od Otočca (…) (…) so samo še bobni igrali (…) KRAJ, KJER JE POTEKAL DOGODEK, NAMESTO INŠTRUMENT NAMESTO GLASBENIKA, KI IGRA TA DOGODKA INŠTRUMENT LOKACIJA ZA INSTITUCIJO STVAR ZA LASTNOST (…) se zmenijo na Čufarjevi (…) (…) vidi mercedesa ko se pogleda v ogledalo (…) IME ULICE NAMESTO STAVBE NA TEJ ULICI AVTO NAMESTO VRLINE/POMANJKLJIVOSTI LOKACIJA ZA STVAR STVAR ZA DODODEK (…) da sem kar McDonald’s prinesla domov (…) (…) na rdeči preprogi (…) znova zablestela (…) RESTAVRACIJA NAMESTO JEDI V RESTACRACIJI SVAR NA DOGODKU NAMESTO CELOTNEGA DOGODKA LOKACIJA ZA OSEBO (SKUPINO) (…) gostilna pa vse čisto tiho (…) 5.2. Metonimija LASTNOST ZA X PROSTOR, KJER SE ZADRUŽUJE OSEBA (SKUPINA), Pri metonimijah LASTNOST ZA X je za cilj (predvideni NAMESTO OSEBE (SKUPINE) V TEM PROSOTRU referent je LASTNOST) prenosa pomembno, da je posameznik ali skupina znotraj kategorije »idealnih Metonimije lahko opazujemo tudi glede na vidik, ki članov« določa izhodišče/sredstvo (vehikel) metonimičnega te kategorije, kar je pogojeno z bližino posameznika ali skupine idealu, ki ga postavlja standardni prenosa. Pogled izhaja iz predpostavke kognitivnega referent, npr. stereotipna lastnost (ki nadomesti preostale jezikoslovja, da ima konceptualna metonimija izkustvene lastnosti), vidna lastnost (ki nadomesti čustvene lastnosti) in spoznavne temelje, njene jezikovne uresničitve pa so samo ena od možnih oblik, skozi katere se izraža. Zato ipd. Glede na konceptualno izhodišče metonimičnega prenosa v korpusu g-KOMET jih lahko razdelimo v dve kognitivizem uporabi pojem idealiziranih kognitivnih skupini: modelov (IKM), ki predstavljajo abstrakcijo človekovih izkustev. Delujejo kot abstrahirane sheme, ki delno zajemajo naše vedenje o svetu. Z LASTNOST ZA SKUPINO a kognitivne pristope je (…) taka mesta ki so (…) pa črni primarno vprašanje, zakaj izberemo prav določeno so tam (…) ČRNA BARVA NAMESTO TEMNOPOLTIH LJUDI konceptualno entiteto za metonimični izraz, in ne neke druge. Na tej podlagi (razširjeno po Radden in Kövecses, 1999) lahko označene metonimične izraze opazujemo tudi: LASTNOST ZA OSEBO (…) najlepša danes (..) - z vidika povezave med pogostnostjo IZGLED OSEBE NAMESTO OSEBE metonimičnega prenosa in človekovim izkustvom (npr. metonimični prenosi v korpusu 5.3. Metonimija OSEBA ZA X g-KOMET splošno za specifično (125) : Metonimije OSEBA ZA X so pogoste metonimije, pri specifično za splošno (3); konkretno za abstraktno katerih prihaja do prenosa človekove dejavnosti, rezultatov (7) : abstraktno za konkretno (3); definirano za ŠTUDENTSKI PRISPEVKI 275 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 nedefinirano (2) : nedefinirano za definirano (0)). Špela Antloga in Gregor Donaj. 2022. Korpus g-KOMET. Zaradi lažjega razumevanja je bolj verjetno, da Slovenian language resource repository CLARIN.SI. bodo metonimični prenosi potekali s splošnega na http://hdl.handle.net/11356/1490. specifično, s konkretnega na abstraktno ipd. Yulia Badryzlova in Olga Lyashevskaya. 2017. Metaphor Shifts in Constructions: the Russian Metaphor Corpus. Povezanost z visoko pogostnostjo označenih V: Computational construction grammar and natural tovrstnih metonimičnih prenosov v korpusu je ena language understanding: Papers from the 2017 AAAI od bistvenih (najpogostejših) funkcij metonimije, Spring Symposium. The AAAI Press. tj. referencialna funkcija, ki je nekakšna bližnjica Anna Babarczy in Idiko Bencze. 2010. The automatic za označevanje kompleksnega in abstraktnega identification of conceptual metaphors in Hungarian pojava z enostavnejšim, konkretnejšim in texts: A corpus-based analysis. V: LREC 2010 Workshop razumljivejšim pojavom (izrazom); on Methods for the Automatic Acquisition of Language Resources: Proceedings, str. 31–36. - z vidika povezave med pogostnostjo Elizabeta Bernjak in Melanija Fabčič. 2018. Metonimija metonimičnega prenosa in kulturno preferenco kot konceptualni in jezikovni pomen. Anali PAZU HD (v korpusu g-KOMET lahko opazujemo različne 4/1-2: 11–23. Združenje Pomurska akademsko kulturnospecifične metonimične prenose lastnost znanstvena unija. za osebo (45), lastnost za stvar (9), lastnost za Agnieszka Bedkowska-Kopczyk. 2016. Začutiti in občutiti: institucijo (1), posameznik za skupino (4), kognitivna analiza pomensko-skladenjskih lastnosti ideologija za človeka (skupino) (11), ustanova za dveh predponskih tvorjenk iz glagola čutiti. V: E. človeka (skupino) Kržišnik in M (72) ipd.). Te pojmovne sheme . Hladnik, ur., Toporišičeva obdobja, str. združujejo posamezne elemente, povezane z 41–48. Ljubljana: Znanstvena založba Filozofske fakultete. našim kulturnospecifičnim védenjem o svetu, Ksnenija Bogetić. 2019. Linguistic metaphor identification družbi, konvencijah in običajih. V konkretni in Serbian. V: S. Nacey in T. Krennmayr, ur., MIPVU in jezikovni situaciji pogosto kontekst in izkustvo Multiple Languages, str. 203–226. Amsterdam: John določata, kateri segment enciklopedičnega Benjamins. védenja se bo profiliral kot pomemben in se Janez Brank. 2019. Q-CAT Corpus Annotation Tool. jezikovno realiziral. Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1262. 6. Nadaljnje delo Silva Bratož. 2010. Metafore našega časa. Fakulteta za management, Koper. Janez Brank. 2019. Q-CAT Corpus Annotation Tool, Identifikacija in analiza metonimičnih in metaforičnih Slovenian language resource repository CLARIN.SI, izrazov s korpusnega vidika imata v slovenščini pred sabo ISSN 2820-4042, http://hdl.handle.net/11356/1262. še dolgo pot. Čeprav so nekatere metaforične preslikave in Kristina Despot, Mirjana Tonković, Mario Brdar, Benedikt metonimični prenosi univerzalni oziroma prisotni v več Perak, Ana Ostroški Anić, Bruno Nahod in Ivan Pandžić. jezikih, je unikatna njihova frekvenca pojavljanja v 2019. MetaNet.HR: Croatian Metaphor Repository. V: posameznih jezikih, njihova realizacija in vpetost v Metaphor and Metonymy in the Digital Age. Theory and kulturnospecifične elemente jezikovnega prostora. Za Methods for Building Repositories of Figurative nadaljnjo analizo metaforičnih izrazov v slovenskem jeziku Language, str. 123–146. Amsterdam: John Benjamins. bo zanimiva primerjava korpusa KOMET 1.0, v katerem so Kurt Feyaerts. 2012. Refining the Inheritance Hypothesis: označene metaforične besede v zapisanem jeziku, in Interaction between metaphoric and metonymie korpusa g-KOMET, ki vsebuje govorjena besedila v obliki hierarchies. V: A. Barcelona, ur., Metaphor and transkripcij. Ker so bile v korpus g-KOMET dodane tudi Metonymy at the Crossroads: A Cognitive Perspective, oznake za metonimične prenose, je eden od naslednjih str. 59–78. Berlin: De Gruyter Mouton. ciljev tudi sistematična analiza metonimije v govorjenem Raymond W. Gibbs. 1999. Researching Metaphor. V: jeziku. Researching and applying metaphor, str. 29–47. Cambridge: Cambridge University Press. Stefan Gries in Anatol Stefanowitsch. 2004. Extending 7. Literatura collostructional analysis: A corpus-based perspective on 'alternations'. International Journal of Corpus Špela Antloga. 2020a. Korpus metafor KOMET 1.0. Linguistics 9/1: 97–129. Slovenian language resource repository CLARIN.SI. Adrew Goatly. 1997. The Language of Metaphors. London http://hdl.handle.net/11356/1293. & New York: Routledge. Špela Antloga. 2020b. Korpus metafor KOMET 1.0. V: Patrick Hanks. 2004. The syntamatics of metaphor and Jezikovne tehnologije in digitalna humanistika idiom. International Journal of Lexicography 17/3: 245– [elektronski vir]: zbornik konference: 24.–25. september 274. 2020, str. 176–170. Ljubljana: Inštitut za novejšo Roman Jakobson. 1956. The Metaphoric and Metonymic zgodovino. Poles. V: Metaphor and Metonymy in Comparison and Špela Antloga. 2020c. Vloga metafor in metaforičnih Contrast, str. 41–47. Berlin/New York: Mouton de izrazov v medijskem diskurzu: analiza konceptualizacije Gruyter. boja. V: J. Vogel, ur., Slovenščina – diskurzi, zvrsti in Veronika Koller. 2006. Of critical importance: Using jeziki med identiteto in funkcijo, str. 27–34. Ljubljana: electronic text corpora to study metaphor in business Znanstvena založba Filozofske fakultete. media discourse. V: A. Stefanowitsch in S. Gries, ur., ŠTUDENTSKI PRISPEVKI 276 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Corpus-Based Approaches to Metaphor and Metonymy, Jose Tummers, Kris Heylen in Dirk Geeraerts. 2005. str. 237–266. Berlin: De Gruyter Mouton. Usage-based approaches in Cognitive Linguistics: A Zoltan Kövecses. 2002. Metaphor: A practical technical state of the art. Corpus Linguistics and Introduction. Oxford/New York: Oxford University Linguistic Theory 1(2): 225–261. Press. Justina Urbonaitė. 2016. Metaphor identification procedure George Lakoff. 1993. The contemporary theory of MIPVU: an attempt to apply it to Lithuanian. Taikomoji metaphor. V: Andrew Ortony, ur., Metaphor and kalbotyra [Applied Linguistics] 7: 1–25. thought, str. 202–251. Cambridge: Cambridge Thomas Wasow in Jennifer Arnold. 2005. Intuitions in University Press. linguistic argumentation. Lingua 115: 1481–1496. George Lakoff in Mark Johnson. 1980. Metaphors We Live Beatrice Warren. 2002. An alternative account of the By. University of Chicago Press. interpretation of referential metonymy and metaphor. V: George Lakoff in Mark Turner. 1989. More than Cool R. Dirven in R. Pörings, ur., Metaphor and Metonymy in Reason: A Field Guide to Poetic Metaphor. The Comparison and Contrast, str. 113–133. Berlin: De University of Chicago Press. Gruyter Mouton. Xiaofei Lu in Ben Pin Yun Wang. 2017. Towards a Ana Zwitter Vitez, Jana Zemljarič Miklavčič, Marko metaphor-annotated corpus of Mandarin Chinese. Stabej in Simon Krek. 2009. Načela transkribiranja in Language Resources and Evaluation 51/3: 663–694. označevanja posnetkov v referenčnem govornem Klaus-Uwe Panther in Günter Radden. 1999. The korpusu slovenščine. V: M. Stabej, ur., Infrastruktura potentiality for actuality metonymy in English and slovenščine in slovenistike, str. 437–442. Ljubljana: Hungarian V: K. U. Panther in G. Radden, ur., Metonymy Znanstvena založba Filozofske fakultete. in Language and Thought, str. 333–357. Amsterdam: John Benjamins. Dalibor Pavlas, Ondřej Vrabeľ in Jiří Kozmér. 2018. Applying MIPVU Metaphor Identification Procedure on Czech. V: Proceedings of the Workshop on Annotation in Digital Humanities co-located with ESSLLI 2018, str. 41–46. Sofia, Bulgaria. Pragglejaz Group. 2007. MIP: A method for identifying metaphorically used words in discourse. Metaphor and Symbol 22 (1): 1–39. Günter Radden in Zoltan Kövecses. 1999. Toward a theory of metonymy. V: K.-U. Panther in G. Radden, ur., Metonymy in language and thought, str. 17–60. Amsterdam: John Benjamins. Maciej Rosiński in Joanna Marhula. 2015. MIPVU in Polish: On Translating the Method. RaAM Seminar 2015. Elena Semino in Michela Masci. 1996. Politics is football: metaphor in the discourse of Silvio Berlusconi in Italy. Discourse and Society 7/2: 243–269. Elena Semino. 2017. Corpus linguistics and metaphor. V: The Cambridge Handbook of Cognitive Linguistics, str. 463–476. Cambridge: Cambridge University Press. Hanna Skorczynska in Kathleen Ahrens. 2015. A corpus- based study of metaphor signaling variation in three genres. Text & Talk. An Interdisciplinary Journal of Language Discourse Communication Studies 35(3): 359–381. Slovar slovenskega knjižnega jezika, druga, dopolnjena in deloma prenovljena izdaja. www.fran.si. David Stallar. 1993. Two Kinds Of Metonymy. V: 31st Annual Meeting of the Association for Computational Linguistics, str. 87–94. Association for Computational Linguistics: Columbus, Ohio. Gerard J. Steen, Aletta G. Dorst, Berenike J. Herrmann, Anna A. Kall, Tina Krennmayr in Tryntje Pasma. 2010. A method for linguistic metaphor identification. From MIP to MIPVU. Amsterdam: John Benjamins. Anatol Stefanowitsch. 2006. Corpus-based approaches to metaphor and metonymy. V: A. Stefanowitsch in S. Th. Gries, ur., Corpus-Based Approaches to Metaphor and Metonymy, str. 1–17. Berlin: De Gruyter Mouton. Elen Tissari. 2003. LOVEscapes: Changes in Prototypical Senses and Cognitive Metaphors Since 1500. Societe Neophilologique. ŠTUDENTSKI PRISPEVKI 277 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Neural Translation Model Specialized in Translating English TED Talks into Slovene Eva Boneš∗, Teja Hadalin†, Meta Jazbinšek†, Sara Sever†, Erika Stanković∗ ∗ Faculty of Computer and Information Science University of Ljubljana Večna pot 113, 1000 Ljubljana {eb1690,es6317}@student.uni-lj.si † Faculty of Arts University of Ljubljana Aškerčeva 2, 1000 Ljubljana {th3112,mj6953,ss6483}@student.uni-lj.si Abstract In this paper, we present our work on a neural translation model specialized in translating English TED Talks into Slovene. The aim is to provide transcriptions of the speeches in Slovene to make them available to a wider audience, possibly with the option of automatic subtitling. First, we trained a transformer model on general data, a collection of corpora from the Opus site, and then fine-tuned it on a specific domain which was a corpus of TED Talks. To see the functionality of the model, we carried out an evaluation of the pretrained, general, and domain versions of the model. We evaluated the translations with automatic metrics and manual methods – the adequacy/fluency and the end-user feedback criterion. The analysis of the results showed that our translation model did not produce the expected results and it can not be used to translate speeches in real life. However, in the TED talks addressing more everyday issues and using simple vocabulary, the translations successfully conveyed the main message of the speech. Any further research should consider improvements, such as including more specialized data covering only one specific topic. 1. Introduction was our attempt at machine translation of spoken language, which, if efficient, could also be used for automatic subti- In this paper, we trained a transformer model from tling in general. scratch on a large general corpus, which we then fine- tuned on a corpus consisting of TED Talks in order to 2. Related work make a model specialized for the translation of transcribed There are three main approaches to solving the MT speeches. We also found a pretrained model for the base- problem, all with their own advantages and shortcomings. line to which we were able to compare our translation mod- The rule-based machine translation (RBMT) is the oldest els. We then automatically and manually evaluated all three of the bunch and it requires expert knowledge of both the models on the validation datasets constructed from TED source and the target language in order to develop syntac- Talks. Finally, we evaluated the general translation model tic, semantic, and morphological rules. Another approach, on the validation dataset constructed from the large general which gained popularity in the 1990s, uses statistical mod- corpus. els based on the analysis of bilingual text corpora. The idea In Section 3, we first describe the data we used. In the behind statistical machine translation (SMT) as proposed in subsequent Section, we describe all the methods for both (Brown et al., 1990) is, if given a sentence in the target lan- training and evaluating the models. Later on, in Sections 5 guage, we seek the original sentence from which the trans- and 6, we present the results and discuss them. lator produced it. Today, as with many computer science fields, the current state-of-the-art approaches for machine 1.1. Goal of the paper translation are based on neural networks. The biggest chal- The main goal of this project is to provide a useful and lenge when building a successful English to Slovene (or effective tool for translating and subtitling speeches from vice-versa) automatic translator is obtaining a sufficiently English to Slovene, and this way granting access to a wide large bilingual corpus. Like all deep learning approaches, range of talks and other speeches to the Slovene-speaking having a large and quality dataset is crucial for the success audience. This paper focuses on translating TED Talks, a of the model. To deal with this exact problem, a lot of form of learning and entertainment that has gained popu- approaches to pre-training a network on monolingual data larity in recent years. Since TED Talks are currently sub- (that can be obtained easily) have been proposed. titled by volunteer translators, enabling automatic subtitles Bidirectional Encoder Representations from Trans- would facilitate this process. Machine translation (MT) has formers (BERT) (Devlin et al., 2018) uses two strategies been researched since the 1950s, but only recently, with the to deal with the problem, namely masked language model- rise of deep learning, did it prove to be solvable, although ing (MLM) and next sentence prediction (NSP). By using the possibility of achieving fully automatic machine trans- these two strategies in our models, we generally achieve lations of high quality is still being questioned. This project bigger datasets and a model with more context-awareness. ŠTUDENTSKI PRISPEVKI 278 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 In 2020, the mRASP (Lin et al., 2021) was introduced. Its Slovene translations. Both datasets add up to 1.8 million authors built a pretrained NMT model that can be fine- words (MOSES format) and 2.1 million tokens, which is tuned for any language pair. They used 197M sentence enough to form a well-rounded base for machine learning. pairs, which is considerably more than we could obtain for For more information about the domain-specific corpora only English-Slovene translations. see Table 2. Although these methods have proven to be success- We expanded the datasets by manually aligning 15 TED ful, one of the largest currently available databases of pre- Talks from 2018 and 2019 that are available on the TED trained translation models was trained using just a standard website (https://www.ted.com/talks). transformer model and it still achieved great results. The Tatoeba Translation Challenge (Tiedemann, 2020) aims to 4. Methods provide data and tools for creating state-of-the-art transla- tion models. The focus is on low-resource languages to 4.1. Pretrained model push their coverage and translation quality. It currently As a baseline for evaluating our models, we found an includes data for 2,963 language pairs covering 555 lan- already trained model, available in HuggingFace (Tiede- guages. Along with the data, pretrained translation mod- mann, 2020). It is a transformer-based multilingual model els for multiple languages were also released and are being that includes all the South Slavic languages. The frame- regularly updated. work provides both the South-Slavic to English model and the English to South-Slavic model. On the Tatoeba test 3. Dataset dataset for Slovene, the English to South-Slavic (en-zls) 3.1. General translation model model has achieved 18.0 BLEU score and 0.350 chr-F score. The datasets for the general translation model are the eight biggest corpora from the Opus site ( The model in question was trained using Marian- https:// NMT (Junczys-Dowmunt et al., 2018). The authors ap- opus.nlpl.eu (Tiedemann, 2012)) for the Slovene- English language pair. The corpora were chosen based on plied a common setup with 6 self-attentive layers in both, the quantity of the data, so the general translation model the encoder and decoder network using 8 attention heads would contain a large amount of diverse information. Af- in each layer. SentencePiece (Kudo and Richardson, 2018) ter a brief look at the contents of each one, we can see that was used for the segmentation into subword units. some datasets are of higher quality and more reliable be- The translation model can be loaded through the trans- cause of the source of the original texts and their trans- formers library in Python and for translation into Slovene, lations For example, the corpora from European institu- we must add the Slovene language label at the beginning of tions, such as Europarl, which is a parallel corpus ex- each sentence (>>slv<<). tracted from the proceedings of the European Parliament 4.2. Training from scratch from 1996–2011, and the DGT corpus, which is a collec- tion of translation memories from the European Commis- There exist several different frameworks to use with nat- sion’s Directorate-General for Translation. The other cor- ural language processing tasks, each with their own ad- pora are a collection of translations from different Inter- vantages and shortcomings. One of them is fairseq (Ott net sources, which makes them less reliable, however, they et al., 2019) – a sequence modeling toolkit written in Py- are still very valuable because they ensure a large quan- Torch for training models for translation, summarization, tity of the data. These include the CCAligned corpus con- and other tasks. It provides different neural network ar- sisting of parallel or comparable web-document pairs in chitectures, namely convolutional neural networks (CNN), 137 languages aligned with English, the MultiCCAligned Long-Short-Term Memory (LSTM) networks, and Trans- v1 multi-parallel corpus, the OpenSubtitles corpus com- former (self-attention) networks. The architectures can be piled from an extensive database of movie and TV sub- configured to specific needs and many implementations for titles, the Tilde MODEL corpus consisting of over 10M different tasks have been proposed since the fairseq’s intro- segments of multilingual open data for publication on the duction in 2019. In addition to different architectures, they META-SHARE repository, the WikiMatrix v1, a parallel also provide pretrained models and preprocessed test sets corpus from Wikimedia compiled by Facebook Research, for different tasks, but sadly none of them is in Slovene. the Wikimedia v20210402 corpus, and the XLEnt v1 cor- For training our model from scratch, we have decided pus created by mining CCAligned, CCMatrix, and Wiki- to use an extension of fairseq (stevezheng23, 2020) that has Matrix parallel sentences. The exact size of each one, com- additional data augmentation methods. We have trained our plete with the number of tokens, links, sentence pairs, and general model on a corpus described in Subsection 3.1. words, is noted in Table 1. 4.2.1. Preprocessing 3.2. Domain translation model Before training the model, we had to preprocess the Our domain translation model is specialized in translat- data. The datasets were already formatted as raw text with ing TED Talks. one sentence per line and with lines aligned in English and For the domain-specific machine training, we opted for Slovene datasets. We first normalized the punctuations, re- the two TED Talk corpora accessible on the Opus website – moved non-printing characters, and tokenized both corpora the TED2013 and TED2020 corpus. The included texts are with Moses tokenizer (Koehn et al., 2007). We removed mainly transcripts of speeches on various topics and their all the sentences that were too short (2 tokens or less) or ŠTUDENTSKI PRISPEVKI 279 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Sentence pairs Words CORPUS Tokens Links (MOSES format) (MOSES format) Europarl.en-sl 31.5 M 0.6 M 624,803 27.56 M CCAligned.en-sl 131.3M 4.4 M 4,366,555 110.08 M DGT.en-sl 215.8M 5.2 M 5,125,455 162.58 M MultiCCAligned.en-sl 5.6 G 4.4 M 4,366,542 110.01 M OpenSubtitles.en-sl 178.0 M 2.0 M 19,641,477 213.00 M TildeMODEL.en-sl 2305.4 M 21.1 M 2,048,216 79.90 M WikiMatrix.en-sl 1.1 G 0.9 M 318,028 11.99 M wikimedia.en-sl 350.6 M 31.8 K 31,756 1.50 M XLEnt.en-sl 200.7 M 0.9 M 861,509 4.53 M Table 1: Size of datasets for the general translation model. Sentence 4.3. Fine-tuning on TED talks Words pairs CORPUS Tokens Links (MOSES We preprocessed the TED data in the same way as the (MOSES format) general, only this time we used the same dictionary as be- format) fore and we did not build a new one. Less than 0.1% of TED2013 0.5 M 15.2 k 14,960 0.45 M tokens in training and validation sets were replaced with TED2020 1.6M 43.9 k 44,340 1.35 M unknown tokens, so our original dictionary was evidently Extras 23005 / 983 / large enough. We used the best performing epoch from our general translation model (according to the loss on our vali- Table 2: Size of datasets for the domain translation model. dation set) for fine-tuning it on our domain data. We trained three different models with three slightly different configu- rations – one with the same augmentation parameters as the too long (250 tokens or more), and the ones where the ra- general model, one with increased masking probability and tio of lengths was too big because there is a good chance decreased dropout and initial learning rate, and one without that these kinds of sentences are not translated properly. augmentation. We trained all of the models for 100 epochs We then applied Byte pair encoding (BPE) (Sennrich et al., and we are presenting the results of the best epoch for each 2016) to the dataset. The algorithm learns the most fre- of them. quent subwords to compress the data and thus induces some tokens that can help recognize less frequent and unknown 4.4. Evaluation words. In order to test the performance of the pretrained and With this preprocessed data, we then built the vocabu- general translation model, and the fine-tuned translation laries that we used for training and binarized the training model for TED Talks we had to evaluate the translations. data. Cleaned and preprocessed training data has ≈ 16M The automatic evaluation was carried out on two valida- sentences with ≈ 345M tokens in English and ≈ 341M tion sets. First, the general translation model was evaluated in Slovene. Both of the vocabularies have around 45,000 on a subset of the general data, which was split in the pre- types. In the end, we split the data into a training and vali- processing step (hereinafter referred to as the general val- dation set. idation set). All three models were evaluated on a subset of the domain data (hereinafter referred to as the domain validation set). The manual evaluation was only performed 4.2.2. Training on a subset of the domain validation set, as described in We trained a transformer (Vaswani et al., 2017) model Subsection 4.4.2. with 5 encoder and 5 decoder layers in the fairseq frame- 4.4.1. Automatic evaluation work. We used Adam optimizer, an inverse square root Since the manual evaluation of the translations is very learning rate scheduler with an initial learning rate of 7e−4 time-consuming, it is very difficult to evaluate a sufficient and dropout. We also used the proposed augmentation with amount of sentences this way. In cases like this, automatic a cut-off augmentation schema that randomly masks words evaluation metrics are often used. Natural language is quite and this way produces more training data and a more robust subjective. Hence, the perfect measure does not exist, but translator. by evaluating our results with different techniques, we were We trained our model for 8 epochs with the mentioned able to assess the performance of our translation model and initial learning rate, after which the minimum loss scale compare it with other models. We used automatic met- (0.0001) was reached, meaning that our loss was proba- rics most often used in NLP tasks – namely BLEU, chr-F, bly exploding. We tried training one more epoch with a GLEU, METEOR, NIST, and WER. lower initial learning rate and obtained an even worse per- formance with the minimum loss scale reached again. That 4.4.2. Manual evaluation is why we decided to stop the training at 8 epochs. Results The translations were also evaluated manually, namely of all the epochs are shown in Chapter 5. by the fluency-adequacy criterion first described by Church ŠTUDENTSKI PRISPEVKI 280 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 (Church, 1993). For this part of the evaluation, the Ex- 5. Results cel format was used. We extracted 6 paragraphs contain- For the training of our models, we used the Slovenian ing 10 consecutive segments from each speech to ensure national supercomputing network that provides access to that the context was clear. Three evaluators (the translators cluster-based computing capacities. We used the Arnes from our group) were assigned 20 segments each. To deter- cluster which is equipped with 48 NVIDIA Tesla V100S mine the adequacy of the translation, the evaluator marks PCIe 32GB graphic cards. When training on two of them, how much of the meaning expressed in the source text is one epoch took approximately 4 hours for the general trans- also expressed in the target translation. To determine the lation model and one minute for fine-tuning on the TED fluency of the translation, the evaluator marks whether the data. translation is grammatically well-formed, contains the cor- rect spelling, is intuitively acceptable, and can be sensibly 5.1. Automatic evaluation results interpreted by a native speaker. To test the adequacy, the In Table 5, we present the quantitative results of the au- evaluator compares both, the source text and the translation, tomatic evaluation for the pretrained, general, and domain whereas, in the process of the fluency evaluation, the focus models. is merely on the translation. The evaluators had to provide the scores on a scale from 1 to 4. We chose this evalu- 5.2. Manual evaluation results ation technique because it clearly and simply summarizes Along with the automatic evaluation metrics, we also and presents the quality of the translations. Since we evalu- performed a manual evaluation which provided a valuable ated three different translation models (pretrained, general, human insight into the final product and a better under- and domain), we had to evaluate the same segments of texts standing of the typology of the mistakes that occurred in the three times. Evaluating one text multiple times by the same translations. Each validation set was assessed by two eval- person is not recommended, therefore, the translations were uators at all three stages of the model development. The exchanged between the three evaluators at the beginning of results presented in Table 4 represent the average value of the evaluation of each translation model. the fluency and adequacy scores for the pretrained, general, and domain models, respectively. 4.4.3. End user comprehensibility questionnaire Finally, we evaluated the domain machine-translated MODEL Fluency Adequacy texts from the end-user’s point of view. Evaluators, who Pretrained 2.99 3.09 were not familiar with the content of this project, were General 2.83 2.9 given the translated texts from the domain model and a Domain 2.71 2.9 questionnaire formed by the translation team of this project. The objective of this questionnaire was to examine whether Table 3: Manual evaluation results on the TED validation the end-users understand the information given in the trans- set. lation, meaning it tested the functionality of the text. The questionnaire was given to nine persons, each evaluating 20 5.3. End-user comprehensibility questionnaire results segments from two different speeches - the segments were identical to segments used in the manual evaluation. In the We received feedback from the end-users based on the end, we obtained three evaluations for each text (6 speeches questionnaire for the texts from the domain translation altogether). The questionnaire included the following ques- model. The average score of the answers that could be in- tions: terpreted numerically is presented in Table 4. According to the answers to the question ‘What is the main message of 1. How comprehensible is the text? the text?’, the users have, for the most part, understood the text to the degree where they could sufficiently summarize 2. To what degree does the text seem like it was produced the content. The most frequent answer to the last question by a native speaker of Slovene? (What do you consider as the most problematic part of the 3. How would you grade the text as a whole? text?) was ’wrong syntax’, followed by ’lack of context’ 4. What is the main message of the text? and ’unknown words’. The participants also pointed out that the general structure of the text was rather confusing. 5. What do you consider as the most problematic part of the text? Text Question 1 Question 2 Question 3 1 1.33 1 1 For the first and second question, the end-users an- 2 2 1.33 1.33 swered on a scale from 1 to 4, with 1 meaning ‘not at all’ 3 3 2 2.33 and 4 meaning ‘very much’. The third answer had to be 4 1.66 1 1.33 a score from 1 to 4. The fourth question had to be an- 5 2 1.66 1.66 swered with one sentence, and for the fifth question, they 6 2.33 1.66 2 had to choose between the following answers: ‘unknown All 2.05 1.44 1.61 words’, ‘too little context’, ‘wrong syntax’, and ‘other’. We chose this evaluation technique because it shows whether Table 4: End-user feedback results from the questionnaire the translation is, in fact, functional and useful to the end- with average scores on a scale from 1 to 4. user. ŠTUDENTSKI PRISPEVKI 281 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 General (epochs) Domain Dataset Metric Pretrained 1 2 3 4 5 6 7 8 Configuration 1 Configuration 2 Configuration 3 BLEU - 0.387 0.398 0.405 0.409 0.411 0.417 0.417 0.420 - - - chr-F - 0.606 0.616 0.619 0.624 0.625 0.629 0.629 0.629 - - - GLEU - 0.391 0.401 0.407 0.411 0.413 0.417 0.417 0.420 - - - General METEOR - 0.545 0.556 0.560 0.565 0.566 0.569 0.569 0.571 - - - NIST - 8.752 8.922 8.987 9.063 9.096 9.144 9.114 9.177 - - - WER - 0.518 0.508 0.503 0.501 0.496 0.497 0.498 0.494 - - - BLEU 0.192 0.155 0.167 0.168 0.171 0.175 0.175 0.168 0.179 0.182 0.173 0.114 chr-F 0.514 0.487 0.496 0.495 0.497 0.500 0.498 0.500 0.505 0.503 0.497 0.440 GLEU 0.230 0.201 0.211 0.212 0.214 0.217 0.218 0.213 0.222 0.224 0.216 0.167 Domain METEOR 0.420 0.398 0.407 0.409 0.409 0.414 0.412 0.416 0.420 0.426 0.416 0.346 NIST 5.481 4.877 5.067 5.105 5.132 5.151 5.179 5.074 5.230 5.344 5.209 4.228 WER 0.659 0.711 0.696 0.694 0.690 0.689 0.689 0.698 0.685 0.667 0.680 0.756 Table 5: Evaluation scores for all models and all validation datasets. The best scores for each dataset and each metric are shown in bold. If the best score was the pretrained model, the second best score is shown in bold and italic to showcase our best score. ŠTUDENTSKI PRISPEVKI 282 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 6. Discussion analysis?”, and in segments with a higher register, for ex- Looking at the results in Table 5, we can first see that ample, the eloquent text on immigrants: ”Ta vprašanja so protipriseljenska in nativisti on the general validation set, the final epoch of our general čna v svojem jedru, zgrajena okoli neke vrste hierarhi model performs the best according to most metrics. This čne delitve notranjih in zunanjih oseb, nas in njih, v katerih smo pomembni le in ne.” vs the is expected, as the general validation set is comprised of original: ”These questions are anti-immigrant and nativist the texts from the corpora that we used for training, so our at their core, built around a kind of hierarchical division of model may be overfitted on this dataset. insiders and outsiders, us and them, in which only we mat- Connected to this, all of the results in the domain valida- ter, and they don’t.”. In both cases, the rate was never lower tion set are considerably worse than in the general dataset. than 2.8. The highest rated segments (with the score above We can account this to the fact that the domain validation 3) included short and simple sentences with everyday vo- set is truly different from the main training data. As to why cabulary, such as ”In rekla mi je: Samo dihajte.” or ”Na the pretrained model in most aspects performs better than srečo kriminalci podcenjujejo moč prstnih odtisov.”. Based our fine-tuned model, we assume that our domain data is on the evaluation results, it appears that our domain model not specific enough. Therefore, we could not really fine- would be more valuable in translating general texts with a tune our model to any specific styles or words, nor were we neutral style and vocabulary. able to do that in the validation set. The pretrained model performs better because it is trained on a larger dataset than The group members that evaluated these segments had our domain model is fine-tuned on – the TED corpus is rela- been participating in this project from the very beginning, tively small even though we included some additional texts. so it was crucial to obtain a more objective assessment of Similarly, the results of the manual evaluation showed our models. Looking at the results from Table 5, the gath- that the pretrained model produced the most fluent transla- ered feedback from the questionnaire revealed that overall, tions with an average score of 2.99 out of 4. This model the end-users thought that the texts are relatively compre- also achieved the highest score in the adequacy criterion. If hensible, but are not at all seen as being produced by a na- we take a closer look at the results of the other two mod- tive speaker of Slovene. For the first two questions, for els, it can be seen that both models faced similar difficul- which the answers were chosen on a scale from 1–4 (1=’not ties in translating phrasal verbs, terminology, word order, at all’/2=’little’/3=’good’/4=’very much’), only two texts and other lexical structures. The manual evaluation results received a score lower than 2 in terms of comprehensibil- are relatively low: the general and the domain model re- ity. When grading the texts, the highest average score for ceived an average of less than 3 points, in both fluency and a specific text was 2.33, while the lowest is 1. This varia- adequacy. The following examples show the discrepancies tion occurs because not all of the chosen texts were equally between the pretrained model and the other two models on complex. For the highest graded text, we received simi- the syntactic, semantic, and morphological levels: lar responses to the question asking what the main message of the text was: Opisovanje prstnih odtisov./Puščanje prst- Original: So then, what is our gut good for? nih odtisov./Prstni odtisi poleg vizualne sledi pustijo tudi Pretrained: Torej, za kaj je naš občutek dober? sled na molekularnem nivoju. There were only two out of General: Torej, kaj je naš črevo dobro za? eighteen answers stating that the message was not clear and Domain: Kaj je torej naš črevesje dobro? where the end-users could not summarize the main mes- sage, i.e. in texts 1 and 5. The fact that the end-users were Original: And I was not only heartbroken, but I was kind of in almost all cases able to summarize the main message embarrassed that I couldn’t rebound from what other peo- in one sentence shows that comprehension of the text was ple seemed to recover from so regularly. still possible despite a large number of significant mistakes Pretrained: Ne samo, da me je zlomilo srce, ampak me je (wrong syntax, unknown words, lack of context, changing bilo sram, da se nisem mogel odvrniti od tega, kar so si genders, etc.). drugi ljudje zdelo, da si je opomoglo tako redno. The following examples, segments from text 2, text 3, General: In nisem bil samo zlom srca, ampak sem bil and text 6, which have also been scored above average in neprijetno, da se nisem mogel odvrniti od tega, kar se je manual evaluation, support this claim: zdelo, da se drugi ljudje tako redno opomorejo. Domain: In nisem bil le srčni utrip, ampak sem bil nepri- Original: And you need something else as well: you have jetno, da nisem mogel vrniti od tega, kar se je zdelo, da se to be willing to let go, to accept that it’s over. drugi ljudje tako redno opomorejo. Domain: Potrebujete tudi nekaj drugega : biti morate pripravljeni pustiti, da sprejmete, da je konec. However, a quick analysis of the evaluation rates showed that the lowest ratings for the domain model ap- Original: I’m talking about an entire world of information peared in segments with specialized vocabulary, for exam- hiding in a small, often invisible thing. ple: ”Ampak ko gre za res velike stvari, kot bo naša kari- Domain: Govorim o celotnem svetu informacij, ki se skri- era ali kdo se bo poročil, zakaj bi morali domnevati, da so vajo v majhni, pogosto nevidni stvari. naše intuicije bolje kalibrirane za te kot počasne, pravilne analize?” vs the original: ”But when it comes to the re- Original: Five years ago, I stood on the TED stage, and I ally big stuff, like what’s our career path going to be or spoke about my work. who should we marry, why should we assume that our in- Domain: Pred petimi leti sem stal na odru TED in govoril tuitions are better calibrated for these than slow, proper o svojem delu. ŠTUDENTSKI PRISPEVKI 283 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Unfortunately, the final version of a machine transla- Acknowledgments tor did not meet our expectations regarding the quality of We would like to thank our mentors, Slavko Žitnik, the translations. Some of the major flaws that appeared in Špela Vintar, and Mojca Brglez, for helping us with the the translations were wrong syntax, untranslated words, in- project. We would also like to thank the nine evaluators comprehensible grammatical structures, wrong use of ter- who provided end-user feedback by filling out our ques- minology, and wrong translations of polysemes. While tionnaire. we expected the machine translator to be inappropriate for We would also like to thank SLING for giving us access translating complex sentences, we were surprised that it did to powerful graphic cards to successfully finish our train- not perform well when translating even basic grammatical ing, as we would still be training our general model with- structures. Here are two examples: out them. Special thanks to Barbara Krašovec from Arnes Original: So then, what is our gut good for? support who helped us with our numerous problems when Domain: Kaj je torej naš črevesje dobro? trying to connect to their cluster. Original: I later found out that when the gate was opened on a garden, a wild stag stampeded along the path and ran 8. References straight into me. Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vin- Domain: Kasneje sem ugotovil, da ko so vrata odprta na cent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, vrtu, je divji stag žigosanih po poti in tekel naravnost v Robert L. Mercer, and Paul S. Roossin. 1990. A statisti- mene. cal approach to machine translation. Comput. Linguist., Original: And for two years, we tried to sort ourselves out, 16(2):79–85. and then for five and on and off for 10. Kenneth Church. 1993. Good applications for crummy Domain: Dve leti smo se poskušali razvrstiti, nato pa pet machine translation. Machine Translation, 8:239–258, let in več. 12. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina The reasons for the poor functioning of the machine Toutanova. 2018. BERT: pre-training of deep bidirec- translations could be numerous. It is possible that we have tional transformers for language understanding. CoRR, not collected enough data or that the chosen data might abs/1810.04805. not have been the most suitable for this project. We esti- Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz mate that the main factor that impacted the final results the Dwojak, Hieu Hoang, Kenneth Heafield, Tom Necker- most is the wide range of different topics covered in TED mann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Talks. This means that our domain translation model did Nikolay Bogoychev, André F. T. Martins, and Alexandra not focus on just one domain and, essentially, there was not Birch. 2018. Marian: Fast neural machine translation in enough specific data from which it could train. What is C++. In more, the initial data consisted of transcriptions of English : Proceedings of ACL 2018, System Demonstra- tions, Melbourne, Australia. spoken discourse and their Slovene translations in the form of subtitles. It is important to keep in mind that neither spo- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris ken discourse nor subtitles have characteristics typical for Callison-Burch, Marcello Federico, Nicola Bertoldi, standard text types. Finally, not all of the chosen texts were Brooke Cowan, Wade Shen, Christine Moran, Richard equally complex and they had different syntactic, morpho- Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, logical, and lexical features. Therefore, some of the texts in and Evan Herbst. 2007. Moses: Open source toolkit for the data were essentially too difficult to translate. statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Compu- 7. Conclusion tational Linguistics Companion Volume Proceedings of The main purpose of this project was to develop a tool the Demo and Poster Sessions, pages 177–180, Prague, that would automatically provide Slovene transcriptions or Czech Republic, June. Association for Computational subtitles for English TED Talks. Our domain translation Linguistics. model provides translations that convey the main message Taku Kudo and John Richardson. 2018. Sentencepiece: of the texts, is based on the appropriate methodology, and A simple and language independent subword tokenizer built with all the necessary tools. Even more, the results and detokenizer for neural text processing. CoRR, of automatic metrics showed that it is comparable to other abs/1808.06226. neural machine translation models. On the other hand, the Zehui Lin, Xiao Pan, Mingxuan Wang, Xipeng Qiu, Jiang- lack of a uniform training dataset resulted in poor and in- tao Feng, Hao Zhou, and Lei Li. 2021. Pre-training mul- comprehensible translations. However, we believe that ac- tilingual neural machine translation by leveraging align- knowledging all of the discussed shortcomings in future ment information. research could significantly improve the development of Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, speech-to-text and translation technologies for Slovene lan- Sam Gross, Nathan Ng, David Grangier, and Michael guage users. Neural machine translation is still relatively Auli. 2019. fairseq: A fast, extensible toolkit for se- new and will develop in the following years because it is quence modeling. In: Proceedings of NAACL-HLT 2019: useful for translators and the general public. Our project Demonstrations. contributed to the advancement of the field and could pro- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. vide valuable information for similar work in the future. Neural machine translation of rare words with subword ŠTUDENTSKI PRISPEVKI 284 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, Au- gust. Association for Computational Linguistics. stevezheng23. 2020. fairseq extension. Jörg Tiedemann. 2012. Parallel data, tools and inter- faces in opus. In: Proceedings of the Eight Interna- tional Conference on Language Resources and Evalua- tion (LREC’12), Istanbul, Turkey, may. European Lan- guage Resources Association (ELRA). Jörg Tiedemann. 2020. The Tatoeba Translation Challenge – Realistic data sets for low resource and multilingual MT. In: Proceedings of the Fifth Conference on Machine Translation, pages 1174–1182, Online, November. As- sociation for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. ŠTUDENTSKI PRISPEVKI 285 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Govoriš nevronsko? Kako ljudje razumemo jezik sodobnih strojnih prevajalnikov David Bordon Oddelek za prevajalstvo, Filozofska fakulteta, Univerza v Ljubljani Aškerčeva 2, 1000 Ljubljana david.bordon@ff.uni-lj.si Povzetek Namen prispevka je predstaviti raziskavo preverjanja razumljivosti nerevidiranih strojno prevedenih spletnih besedil. Primarni udeleženci v raziskavi so bili splošni bralci in ne izurjeni prevajalci ali popravljalci strojnih prevodov. Gre za prvo tovrstno raziskavo, ki je bila izvedena za slovenski jezik. Cilj raziskave je bil preveriti, v kolikšni meri so nerevidirani strojni prevodi razumljivi splošnemu bralstvu, pri čemer sem se posvetil tudi vplivu besedilnega in slikovnega konteksta. Preverjal sem prevode prevajalnikov Google Translate in eTranslation. Raziskava je bila izvedena z anketo, v kateri so udeleženci odgovarjali na vprašanja, ki so preverjala razumevanje spremljajočega besedilnega segmenta, v katerem je bila napaka. Rezultati nudijo vpogled v trenutno stopnjo razvoja strojnih prevajalnikov, ne z vidika storilnosti pri njihovem popravljanju, ampak z vidika, koliko jih razume ciljno bralstvo. Do you Speak Neuralese? The aim of this paper is to present a study on the comprehensibility of unedited machine-translated web texts. The primary participants in the study were general readers, not trained translators or post-editors, and it is the first study of its kind to be conducted for the Slovene language. The aim of the study was to examine the extent to which unedited machine translations are comprehensible to general readers, while giving focus to the influence of textual and pictorial context. The translations were obtained from Google Translate and eTranslation. The survey was conducted by means of a questionnaire, in which participants answered questions that tested their understanding of a text segment that included an error. The results provide an insight into the current state of development of machine translation engines, not from the point of view of PEMT, but from the point of view of how well machine translations are understood by the target readership. 1. Uvod 2. Namen članka Članek obravnava raziskavo razumljivosti strojno Namen članka je predstaviti grobo oceno razumljivosti prevedenih spletnih besedil pri bralcih, ki ne vedo, da prevodov NMT-sistemov (ang. Neural machine prebirajo strojne prevode. Uporabil sem naključno izbrana translation) v času, ko so taka besedila na spletu vedno bolj angleška spletna besedila, slovenske prevode pa sem pogosta, pri čemer me zanima predvsem, kako slikovno pridobil z nevronskima strojnima prevajalnikoma Google gradivo v besedilnem kontekstu vpliva na rezultate. Translate in eTranslation. Prevodi niso bili revidirani, saj Tovrstna raziskava za slovenščino še ni bila izvedena. sem želel replicirati okoliščine, v katerih bi jih dejansko lahko našli – na spletu, kjer so zaradi (za nekatere) dovolj 2.1. Sorodne raziskave visoke kakovosti in cenovne nepremagljivosti (namreč so Raziskav na področju razumevanja nerevidiranih brezplačni) vedno bolj pogosta, kar velja tudi za strojnih prevodov pri naključnem splošnem bralstvu je prevajalske vtičnike, ki so vgrajeni v sodobne brskalnike in razmeroma malo, saj je z vidika omejenosti na stroko in aplikacije. gospodarstvu bolj zanimive analize storilnosti pri Vprašanje razumljivosti v taki obliki je postalo aktualno popravljanju prevodov veliko več raziskav osredotočenih samo v zadnjem času, saj so starejši, statistični modeli zgolj na prevodno prebivalstvo. prevajalnikov slovnično nekonsistentni in jezikovno Razširjenost prakse popravljanja strojnih prevodov okorni, sodobni nevronski prevajalniki pa proizvajajo lahko opazimo že v zapisih o najboljših praksah pri tekoča besedila, ki so težje ločljiva od človeških, hkrati pa popravljanju prevodov, ki so zapisani v blogih večjih je že profesionalnim pregledovalcem prevodov težje ponudnikov jezikovnih rešitev, kot so denimo MemoQ ugotoviti, kje so storili napako (Donaj in Sepesy Maučec, (Lelner, 2022), Crowdin (Voroniak, 2022) in Memsource 2018). (Zdarek, 2020). Te napake nastanejo predvsem zaradi težav pri Na Univerzi v Gentu je bila v sklopu projekta razdvoumljanju večpomenskih besed in prevajanju besed, ArisToCAT izvedena raziskava o razumevanju izmišljenih ki jih ni v podatkovni zbirki, s katero smo prevajalnik urili besed in samostalniških besednih zvez (Macken et al. (Thi-Vinh et al. 2019, 207; Koehn in Knowles 2017, 28, 2019). Primeri, ki so bili iz angleščine v nizozemščino 31–33; Sennrich et al. 2016, 3). Kljub morebitnim prevedeni s strojnima prevajalnikoma Google Translate in posamičnim napačno prevedenim besedam pa lahko ljudje DeepL, so bili predstavljeni samostojno ali v kontekstu pomen razberemo iz sobesedila. Pri preverjanju povedi, pri tem pa udeleženci niso imeli dostopa do razumljivosti sem v vseh primerih vključil še kontekst, saj izvirnega besedila. V povprečju je bilo 60 % odgovorov se v stvarnosti bralci nikoli ne srečujejo z izoliranimi napačnih; rezultati so bili boljši, če je bil primer besedami, ampak z zaključenimi besedili, ker pa se predstavljen v kontekstu povedi. osredotočam na spletno okolje, sem besedilnemu kontekstu V sklopu istega projekta je bila izvedena še analiza dodal še slikovnega, ki je inherentna lastnost sodobnega bralnega razumevanja človeškega prevoda na eni in spleta. nepopravljenega strojnega prevoda na drugi strani. ŠTUDENTSKI PRISPEVKI 286 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Človeški prevodi so bili ocenjeni bolje z vidika jasnosti najbolj iskanih pojmov v brskalniku2 sem izločil spletišča, podajanja informacij, z vidika končnega razumevanja pa je ki nimajo prevodnega potenciala (družbena omrežja, bila razlika manjša (Macken in Ghyselen, 2018). spletni portali v slovenščini, slovenski mediji). S tem sem Castilho in Guerberof Arenas (2018) sta izvedli prišel do končnega izbora besedilnih področij: spletno primerjalno analizo bralnega razumevanja za statistični in nakupovanje, turizem, elektronika, multimedija in nevronski model strojnega prevajalnika v primerjavi s videoigre, luksuzne storitve, moda, osebno zdravje (telesna človeškim izvirnikom. Glede na omejen vzorec (6 vadba in prehrana). udeležencev) in nedoslednost rezultatov je končna ugotovitev, da sistemi-NMT izkazujejo najboljše rezultate, 3.2. Prevodi besedil občasno še boljše kot angleški izvirnik, nedokončna. Pri preizkušanju strojnih prevajalnikov se je izkazalo, Martindale in Carpuat (2018) sta v raziskavi da Googlov prevajalnik nudi drugačne prevodne rešitve obravnavali odziv bralcev na tekočnost in natančnost glede na to, kako besedilo naložimo v obdelavo. Če nevronskih strojnih prevodov, ob tem pa sta preverjali besedilo prevajamo v pogovornem oknu vmesnika ali v stopnjo zaupanja informacijam v besedilu. Ugotovili sta, da brskalniku prevedemo spletno stran kot celoto, so rezultati bralce zelo zmotijo prevodi, ki niso tekoči, medtem ko se boljši kot tisti, ki jih dobimo s funkcijo prevajanja ob samo natančnost informacij obregne veliko manjši delež dokumenta. Od štirih različnih specializiranih domen, ki jih bralstva. nudi eTranslation, je najboljše rezultate nudil prevajalnik Izsledke potrjuje tudi Popović (2020). V njenem za splošna besedila (General Text). Uporabil sem najboljše eksperimentu so bralci v 30 % primerov zaradi zavajajoče možne prevode – omenjeno domeno v eTranslation, v tekočnosti sprejeli popolnoma napačno informacijo, še Googlu pa sem prevajal v pogovornem oknu. 25 % dodatnih primerov pa je bilo skoraj popolnoma (narobe) razumljivih. Na tem mestu velja omeniti, da so se nedavno začele Prevod iz Prevod, pridobljen vnosnega polja oz. pojavljati bolj eksperimentalne metode prevajanja, katerih s funkcijo »prevedi Izvirnik značilnost je upoštevanje multimedijskega konteksta, samodejni prevod dokument« denimo zvočnega ali slikovnega. Lala in Specia (2018) sta strani razvila model multimedijskega leksikalnega prevajanja, katerega namen je prevajanje dvoumnih večpomenskih Naj bo toplo besed s pomočjo slikovnega konteksta. Sulubacak et al. funkcijo - Naj bo topla - Keep Warm (2020) so predstavili sorodne raziskave, uporabne Mikrovalovna mikrovalovna Feature Maintains podatkovne zbirke in metode raziskovanja na področju ohranja živila, kot pečica ohranja Food Temperature multimedijskega strojnega prevajanja, ki so vezane na so zelenjava , prevajanje z zvokom, sliko in videom. Med novejšimi hrano, kot so Keeps foods like juhe , raziskavami Liu (2021) ponuja nevronski model vizualno- zelenjava, juhe, vegetables, soups, nerazporejenega tekstovnega enkodiranja in dekodiranja. jedi, graviža, hors d'oeuvres, d'oeuvres , Pričakujemo lahko, da se bo to področje v bodoče še omake in sladice, gravies, sauces and gravies , omake in hitreje razvijalo, predvsem zaradi tehnološkega napredka v topla in okusna v desserts warm and sladice toplo in drugih panogah (prepoznavanje slik, sinteza govora, pečici, dokler niso okusno v pečice , delicious in the avtomatsko podnaslavljanje ipd.). pripravljene za oven until they're postrežbo. dokler oni ready to serve. propravljeni , da 3. Metoda služijo. Raziskava je bila zasnovana okrog vprašalnika, ki je vseboval primere štirih vrst napak v slovenskih strojnih prevodih splošnih angleških spletnih besedil. Preverjal sem prevajalnika Google Translate in eTranslation, pri čemer je Tabela 1: Razlike v prevodih glede na način obdelave; bil vsak zastopan z 12 vprašanji. Poseben pomen sem Google Translate. posvetil slikovnemu gradivu v sobesedilu. 3.1. Izbor besedil Prevod modela »General Text« prevajalnika eTranslation Besedila sem zbiral glede na verjetnost, da bi se bralci z njimi lahko dejansko srečali na spletu. Analiza Ohraniti toplo funkcijo - Microwave ohranja hrano, kot so prevajalskega trga je pokazala, da večje prevajalske zelenjava, juhe, predjed d’oeuvres, omake, omake in agencije popolnoma obvladujejo sektorje, ki nudijo največ sladice tople in okusne v pečici, dokler niso pripravljeni dobička in hkrati zahtevajo človeško revizijo (tehnika, za postrežbo zdravstvo, pravo, finance ipd.) (Evropska komisija, 2020). V manj dobičkonosnih sektorjih, kjer človeška revizija ni tako bitna, obstaja večja verjetnost objave nerevidiranih Tabela 2: Prevod enakega segmenta; eTranslation. strojnih prevodov. Pregleda tržnega deleža spletnih iskalnikov, ki jih uporabljamo v Sloveniji je pokazal, da 96 % vseh uporabnikov spleta uporablja iskalnik Google.1 Na osnovi 1 https://gs.statcounter.com/search-engine-market- 2 https://ahrefs.com/keyword-generator share/all/slovenia ŠTUDENTSKI PRISPEVKI 287 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 3.3. Kategorizacija napak predlaganimi odgovori. S tem bi lahko preveril konsistenco Prevode sem analiziral in določil štiri kategorije pravilnosti oz. odstopanja glede na vrsto odgovora. najpogostejših napak, ki niso vezane na jezikovni sistem Vprašalnik sem delil na družbenih omrežjih Facebook oz.predpis. in Instagram in znance pozval, naj ga posredujejo naprej ▪ Neprevedena beseda; v prevodu se pojavlja svojcem in svojim znancem, če je le mogoče starejšim. beseda v enaki obliki kot v izvirniku. Dopustil sem Demografskih podatkov nisem zbiral, kar je mogoče ena možnost spremembe začetnih ali končnih izmed pomanjkljivosti raziskave. Glede na razmeroma morfemov, če je prevajalnik besedo samo majhen vzorec sodelujočih in morebiten efekt odmevne preoblikoval3. komore bi bilo vsekakor raziskavo potrebno nadgraditi in ▪ Napaka pri razdvoumljanju večpomenske ponoviti na bolj naključnem in predvsem večjem vzorcu, besede; denotativni pomen večpomenske besede toda glede na čas zbiranja odzivov, ki je sovpadal s prvo ali besedne zveze ne ustreza pomenu v izvirniku. omejitvijo gibanja vezano na epidemijo Covid-19, nisem ▪ Hujša pomenska napaka; napaka, ki otežuje imel druge izbire. razumevanje celotnega besedila. Na vprašalnik sem prejel 120 odgovorov. ▪ Izmišljena beseda; prevajalnik si izmisli novo besedo, ki je na prvi pogled videti slovenska, a ne spada v slovensko besedišče – t. i. »nevronščina«. 3.4. Kontekst Izbranim besedilom sem glede na inherentne lastnosti spletne pojavitve dodal kontekst. Kontekst je lahko bil več vrst: ▪ izključno besedilni, ▪ besedilni in slikovni; slika ne vpliva na razumevanje, ▪ besedilni in slikovni; slika vpliva na razumevanje, ▪ izbor ene izmed več predlaganih slik glede na to, kaj piše v besedilu. Slikovni kontekst sem vključil pri besedilih, ob katerih so se na spletu pojavljale fotografije, ki so pri nekaterih primerih bile zgolj vizualni dodatek, pri drugih pa je bilo pravilno razumevanje besedila vezano na prepoznavanje pravilnega vizualnega elementa. V svoji raziskavi besed nisem nikoli predstavil v izolaciji, kot so to denimo storili v raziskavi Macken in drugi (2019), saj to niso realne okoliščine – napake v objavljenih strojnih prevodih bodo vedno del nekega besedila. Besedil nisem popravljal, anketirancem so bila predstavljena vključujoč vse slovnične in pomenske Slika 1: Primer vprašanja. Izbor z razlago. napake, take, kot bi jih našli v divjini. 4. Rezultati 3.5. Oblikovanje vprašalnika, format odgovorov Rezultate predstavljam po naslednjih parametrih: na vprašanja in udeleženci ▪ splošno razumevanje, Anketo sem ustvaril na platformi Google Forms, ki nudi ▪ razumevanje glede na prevajalnik, podporo za prikaz slik in dober vmesnik za pregled in izvoz ▪ razumevanje glede na tip napake, rezultatov. Pomembno je poudariti, da anketirancem nisem ▪ razumevanje glede na tip konteksta, razkril, da bodo brali strojno prevedena besedila. Omenil ▪ razumevanje glede na tip odgovora. sem, da bodo »prebrali več kratkih besedil, ki so napisana v nekoliko okorni slovenščini«. 4.1. Splošno razumevanje Vrste odgovorov so bile omejene s funkcionalnostjo platforme Google Forms in niso sledile nobeni logični Vprašalnik je obsegal 24 vprašanj, s 120 odzivi je bilo vseh možnih odgovorov 2880. Vseh pravilnih odgovorov je metodi; določil sem jih subjektivno glede na vsebino bilo 1697 oz. 58,96 %. Daljša razčlemba je na voljo v primera in vrsto napake. Gre za najbolj nezanesljivo spremenljivko v metodi, saj bi s formulacijo vprašanja celotni raziskavi (Bordon, 2021). lahko sugeriral pravilen odgovor, zanimalo pa me je predvsem to, če prihaja do večjega odstopanja glede na tip 4.2. Razumevanje glede na prevajalnik odgovora, denimo, če bi bili odgovori odprtega tipa, kjer Odgovori na vprašanja, vezana na prevajalnik Google anketiranci vnesejo svoj odgovor v prazno vnosno polje, Translate so bili pravilni v 51,3 % primerov oz. 739 od bistveno slabši kot tisti, kjer izbirajo med štirimi 1440 odgovorov. Prevajalnik eTranslation je pokazal boljše rezultate, delež pravilnih odgovorov je znašal 66,6 %. 3 Denimo, prevod za rob zaslona (ang. bezel, je prevajalnik prevedel kot »bezela«). ŠTUDENTSKI PRISPEVKI 288 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4.3. Razumevanje glede na tip napake 4.5. Razumevanje glede na tip odgovora V vprašalniku so bili vključeni štirje tipi različnih V tem segmentu predstavljam rezultate glede na način napak. V alinejah nizam tip napake in odstotek pravilnih izbora odgovora. Primarna funkcija te analize je preveriti odgovorov: konsistenco oz. morebitna odstopanja npr.; če so odgovori ▪ izmišljena beseda: 48,5 %, odprtega tipa, kjer anketiranci v prazno vnosno polje ▪ neprevedena beseda: 64,8 %, vnesejo poljuben odgovor, bistveno slabši kot tisti, kjer ▪ napačno razdvoumljene večpomenske besede: imajo na voljo denimo štiri predlagane odgovore, izberejo 65,9 %, pa enega. ▪ hujša pomenska napaka: 56,3 %. ▪ Odgovor odprtega tipa (vnosno polje): 36,3 %, ▪ odgovor zaprtega tipa (A, B, C ali D): 60,8 %, Neprevedena ▪ izbor z razlago (A ali B, zakaj?): 68,3 %. Izmišljena Slabši rezultat pri odgovorih zaprtega tipa je potrebno beseda beseda jemati z rezervo, saj so bili primeri s tako vrsto odgovora zgolj štirje. Samo določanje pravilnosti odgovora je pri Pravilni odgovori Pravilni odgovori takih primerih težje, osebno pa sem bil strog ocenjevalec, saj sem vse odgovore, ki niso bili popolnoma pravilni, Napačni odgovori Napačni odgovori označil za napačne. 4.6. Skupina prevajalcev Edini demografski podatek, ki sem ga zbiral, je, ali se oseba, ki odgovarja na vprašalnik, ukvarja s prevajanjem. Pritrdilno je odgovorilo 24 udeležencev od 120. Pri teh 48,5 64,8 osebah sem analiziral odgovore glede na vrsto napake in jih primerjal z neprevajalci. Nasploh so bili njihovi rezultati za 6 % boljši (63,7 %), po kategorijah pa: ▪ izmišljena beseda 53,5 % (+ 6,3 %), ▪ neprevedena beseda 65,6 % (+ 1 %), ▪ razdvoumljanje večpomenske besede 70,8 % (+ 6,7 %), ▪ pomenska napaka 63,9 % (+ 9,6 %). Razdvoum. Pomenska Ostalih demografskih podatkov nisem zbiral, kar je ena od slabosti raziskave. V primeru da bi podatki sovpadali z večpomen. napaka mojo predpostavko, da niso relevantni, jih ne bi vključil, sedaj pa preprosto nimam podatkov, na katerih bi lahko Pravilni odgovori Pravilni odgovori utemeljil svojo odločitev. Napačni odgovori Napačni odgovori Neprevajalci Prevajalci izmišljena beseda 65,9 56,3 neprevedena beseda razdvoumljanje večpomenske besede Slika 2: Diagrami 1–4. Rezultati glede na tip napake v %. pomenska napaka 4.4. Razumevanje glede na kontekst 0 20 40 60 80 V naslednjem segmentu predstavljam delež pravilnih odgovorov vezanih na kontekst. ▪ Izključno besedilni: 60,4 %, Graf 1: Rezultati skupine prevajalcev proti ostalim. ▪ besedilni in slikovni; slika ne vpliva na razumevanje: 44 %, ▪ besedilni in slikovni; slika vpliva na razumevanje: 69,8 %, ▪ izbor ene izmed več predlaganih slik glede na to, kaj piše v besedilu: 64,2 %. ŠTUDENTSKI PRISPEVKI 289 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 5. Razprava pri strojnih prevodih v realnih okoliščinah, torej na spletu, Pri pregledu rezultatov sem ugotovil, da povprečna z vsem pomožnim gradivom, igra pomembno vlogo. stopnja razumevanja znaša 59 %. Od vseh 2880 odgovorov Udeleženci, ki se sicer ukvarjajo s prevajanjem, so na je bilo 1697 pravilnih. splošno odgovarjali boljše od povprečja. Njihov delež Na tej točki je potrebno izpostaviti primer št. 6, ki je bil uspešnosti je bil največji v kategoriji hujša pomenska nasploh najslabše razumljen in je znižal povprečje napaka (+ 9,6 %), kar kaže na to, da zaradi »poklicne rezultatov v vseh kategorijah, v katerih se je nahajal. Daljša deformacije« bolj učinkovito razumejo kontekst. razlaga z razčlembo je na voljo v celotni raziskavi (Bordon, 2021). 6. Zaključek V članku sem predstavil raziskavo o razumljivost nerevidiranih strojno prevedenih spletnih besedil pri Izvirnik Prevod končnih uporabnikih, ki niso bili posebej obveščeni, da One winner will receive prebirajo strojne prevode. Razumevanje besedilnih segmentov, ki so vključevali štiri različne tipe napak, ki En zmagovalec bo prejel the GeForce RTX 2080 grafično kartico GeForce nastanejo pri strojnem prevajanju NMT-sistemov, sem Ti Cyberpunk 2077 preverjal z anketo. Ta je vsebovala strojne prevode splošnih RTX 2080 Ti Cyberpunk Edition graphics card. besedil, ki sem jih prevedel s prevajalnikoma Google 2077 Edition. Translate in eTranslation. Besedila so bila nerevidirana, Entering the giveaway is vsebovala so napake, ki so bile predstavljene v več Vstop v predavanje je easy: različnih kontekstih, bodisi s slikovnim gradivom bodisi enostaven: Sign in to the forums or brez. 1. Prijavite se na forume ali create a forum account. Rezultati so pokazali, da je splošna stopnja ustvarite forumski račun . Comment on this thread razumevanja 59 %, pri čemer se je izkazalo, da so prevodi 2. Komentirajte to temo (WITHOUT QUOTING eTranslationa nasploh razumljivejši od prevodov (BREZ CITIRANJA TE THIS POST) and tell us Googlovega prevajalnika. Število pravilnih odgovorov je POSTAJE) in nam povejte, what you want to do most bilo najvišje v kategoriji razdvoumljanja večpomenskih kaj želite narediti najbolj v in Cyberpunk 2077. besed, kar nakazuje na to, da ljudje lažje razumemo pomen Cyberpunku 2077. Sign your username in strojnih prevodov, če nam je dan kontekst. Pri tem je bilo 3. Za potrditev vpisa vpišite our giveaway widget to najbolj učinkovito slikovno gradivo, s katerim so si lahko svoje uporabniško ime v naš confirm your entry. udeleženci v raziskavi pomagali razjasniti pomen pripomoček za oddajo. določenega besedilnega segmenta. Druga najuspešnejša HOW TO ENTER: To kategorija je bila razumevanje neprevedenih besed, kar KAKO VSTOPITI: Če želite enter, submit your entry pomeni, da je bilo znanje angleškega jezika med udeleženci vstopiti, vnesite mednopni during the Sweepstakes na visokem nivoju. vložek in sledite navodilom Period and follow the Po analizi se je izkazalo, da je bil nekoliko za vstop v nagradne igrače. problematičen način izbire directions to enter the odgovorov, saj sem anketirancem naključno vnaprej določil Sweepstakes. , na kakšen način bodo odgovarjali. Odgovori odprtega tipa so kazali slabše rezultate kot izbirni odgovori in odgovori zaprtega tipa, Tabela 3: Primer št. 6; »Mednopni vložek.« toda zaradi majhnega števila vprašanj je težko izpeljati kakšen razumen zaključek. Podobno velja za samo metodo eTranslation je bil v povprečju za 15 % boljši od odgovarjanja na anketo, ki je bila pogojena pandemičnemu prevajalnika Google Translate, v katerem je bil omenjen času. Za bolj relevantne rezultate bi bilo potrebno izvajati primer. Nasploh pa je eTranslation kazal boljše rezultate. test razumljivosti v živo, na razpravljalen način. Enako Najboljši rezultati glede na tip napake so bili vezani na velja tudi za vzorec sodelujočih – večji in bolj raznolik razdvoumljanje besednega pomena (65,9 %), kar kaže, da vzorec bi dal jasnejše rezultate. znamo ljudje nasploh dobro razbrati pomen iz sobesedila, V bodoče bi bilo zanimivo raziskati, če se razumevanje na drugem mestu pa so bile neprevedene besede (64,8 %), nerevidiranih strojno prevedenih besedil izboljšuje skupaj kar lahko pripišemo dobremu znanju angleščine med z nadgradnjami strojnih prevajalnikov, hkrati pa bi se lahko udeleženci v anketi. osredotočil še na avtomatsko generirana besedila in jezik Rezultati so bili slabši, ko je prevajalnik napravil hujšo spletnih robotov. pomensko napako, ki je oteževala razumevanje celotnega Menim, da bo v prihodnje nekoliko manj raziskav segmenta (56,3 %), daleč najslabše rezultate pa je bilo moč storilnosti pri popravljanju strojnih prevodov in veliko več opaziti v kategoriji izmišljena beseda (48, 5 %), v kateri je raziskav, ki bodo vezane na razumljivost strojno sicer bil prej omenjeni primer št. 6. prevedenih ali avtomatsko generiranih besedil v praktičnih Glede na tip konteksta so bili najboljši rezultati pri situacijah. Končni bralec se vedno bolj pogosto srečuje s primerih, kjer je slika vplivala na razumevanje (69,8 %) in takimi besedili, lahko pa pričakujemo, da bo zaradi še kjer so udeleženci morali izbrati sliko, na katero se je dodatnih izboljšav strojnih prevajalnikov, novih metod in nanašalo besedilo (64,2 %). Rezultati so bili nekoliko slabši razširjenosti prakse tovrstnih potencialnih stikov med stroji v izključno tekstovnem kontekstu (60,4 %), najslabši in bralci brez vmesnega posega človeškega popravljalca rezultati pa so bili v kategoriji, kjer je bila besedilu vedno več. priložena slika, ki ne vpliva na razumevanje oz. potencialno zmede udeleženca (44 %) – v tej kategoriji je bil tudi primer št. 6. Izkazalo se je, da slikovni kontekst, ki lahko potencialno vpliva na razumevanje besedilnega segmenta, ŠTUDENTSKI PRISPEVKI 290 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 7. Literatura 2022. David Bordon. 2021. Razumevati nevronščino: Kako si https://www.clinjournal.org/clinj/article/view/93. ljudje razlagamo jezik strojnih prevajalnikov. Magistrsko Marianna J. Martindale in Marine Carpuat. 2018. Fluency delo. Univerza v Ljubljani. Dostop 30. 5. 2022. Over Adequacy: A Pilot Study in Measuring User Trust https://repozitorij.uni- lj.si/IzpisGradiva.php?id=125328. in Imperfect MT. Dostop 30. 5. 2022. Sheila Castilho in Ana Guerberof Arenas. 2018. Reading https://arxiv.org/abs/1802.06041. Comprehension of Machine Translation Output: What Maja Popović. 2020. Relations between Makes for a Better Read?. V: Juan Antonio Perez-Ortiz, comprehensibility and adequacy errors in machine Felipe Sanchez-Martinez, Miquel Espla-Gomis, Maja translation output. V: Raquel Fernández in Tal Popovič, Celia Rico, Andre Martins, Joachim Van den Linzen. ur., Proceedings of the 24th Conference Bogaert in Mikel L. Forcada, ur., Proceedings of the 21st on Computational Natural Language Learning Annual Conference of the European Association for (CoNLL 2020), str. 256–264. Dostop 30. 5. Machine Translation, str. 79–88, Alacant, Španija. 2022. https://aclanthology.org/2020.conll-1.19.pdf. Dostop 30. 5. 2022. http://doras.dcu.ie/23071/. Rico Sennrich, Barry Haddow in Alexandra Birch. 2016. Gregor Donaj in Mirjam Sepesy Maučec. 2018. Prehod iz Neural Machine Translation of Rare Words with statističnega strojnega prevajanja na prevajanje z Subword Units. Dostop 30. 5. 2022. nevronskimi omrežji za jezikovni par slovenščina- https://arxiv.org/abs/1508.07909. angleščina. V: Zbornik konference Jezikovne tehnologije Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, in digitalna humanistika 2018, str. 62–68, Ljubljana. Aku Rouhe, Desmond Elliott, Lucia Specia in Dostop 30. 5. 2022. http://www.sdjt.si/wp/wp- Jörg Tiedemann. 2020. Multimodal machine content/uploads/2018/09/JTDH-2018_Donaj-et- translation through visuals and speech. Dostop 30. 5. al_Prehod-iz-statisticnega-strojnega-prevajanja-na- 2022. https://arxiv.org/abs/1911.12798. prevajanje-z-nevronskimi-omrezji-za-jezikovni-par- Ngo Thi-Vinh, Thanh-Le Ha, Phuong-Thai Nguyen in Le- slovenscina-anglescina.pdf. Minh Nguyen. 2019. Overcoming the Rare Word Evropska komisija, 2020 European Language Industry Problem for Low-Resource Language Pairs in Neural Survey 2020 Before & After Covid-19. Dostop 30. 5. Machine Translation. V: Proceedings of the 6th 2022. Workshop on Asian Translation, str. 207–214. https://ec.europa.eu/info/sites/default/files/2019_langua Association for Computational Linguistics. Hong Kong, ge_industry_survey_report.pdf. Kitajska. Dostop 30. 5. 2022. Philipp Koehn in Rebecca Knowles. 2017. Six challenges https://arxiv.org/abs/1910.03467. for neural machine translation. V: Proceedings of the Diana Voroniak. Post-Editing of Machine Translation: First Workshop on Neural Machine Translation, str. 28– Best Practices. Dostop 30. 5. 2022. 39. Association for Computational Linguistics, https://blog.crowdin.com/2022/03/30/mt-post-editing/. Vancouver, Canada. . Dostop 30. 5. 2022. Dan Zdarek. Machine Translation Post-editing Best https://arxiv.org/pdf/1706.03872.pdf. Practices. Dostop 30. 5. 2022. Chiraag Lala in Lucia Specia. 2018. Multimodal Lexical https://www.memsource.com/blog/post-editing- Translation. V: Proceedings of the 11th international machine-translation-best-practices/. conference on language resources and evaluation (LREC), str. 3810–3817. Miyazaki, Japonska: European Language Resources Association (ELRA). Dostop 30. 5. 2022. https://www.aclweb.org/anthology/L18-1602/. Zsófia Lelner. 2022. »Machine Translation vs. Machine Translation Post-editing: Which One to Use and When?«. Dostop 30. 5. 2022. https://blog.memoq.com/machine-translation-vs.- machine-translation-post-editing-which-one-to-use-and- when. Jiatong Liu. Multimodal Machine Translation. Dostop 30. 5. 2022. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnum ber=9547270. Lieve Macken in Iris Ghysele. 2018. Measuring Comprehension and User Perception of Neural Machine Translated Texts: A Pilot Study. V: Translating and the Computer 40 (TC40): Proceedings, str. 120–126. Geneva: Editions Tradulex. Dostop 30. 5. 2022. https://biblio.ugent.be/publication/8580951. Lieve Macken, Laura Van Brussel in Joke Daems. 2019. NMT’s wonderland where people turn into rabbits. A study on the comprehensibility of newly invented words in NMT output. V: Computational Linguistics in the Netherlands Journal 9 (2019), str. 67–80. Dostop 30. 5. ŠTUDENTSKI PRISPEVKI 291 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Data Collection and Definition Annotation for Semantic Relation Extraction Jasna Cindrič, Lara Kuhelj, Sara Sever, Živa Simonišek, Miha Šemen Department of Translation, Faculty of Arts, University of Ljubljana Aškerčeva cesta 2, SI-1000 Ljubljana jasna.cindric@gmail.com larakuhelj@gmail.com seversara@gmail.com ziva.sim@gmail.com miha.semen@gmail.com Abstract This paper presents the process of data collection, definition extraction and annotation for the purpose of semantic relation extraction based on English and Slovene texts related to geology, glaciology, and geomorphology. Automatic semantic relation extraction is an important task in NLP; its potential applications include information retrieval, information extraction, text summarization, machine translation, and question answering. This approach was based on the TermFrame project. The texts for the corpora were collected manually, while definitions were identified through targeted queries in SketchEngine and then semantically annotated using the WebAnno tool. Our research showed some significant differences between languages resulting in some difficulties during the annotation process. 1. Introduction creation of multimodal specialised knowledge bases, where “frames” are used as a “representation that integrates This paper describes the process of definition various ways of combining semantic generalisations about extraction, annotation and curation based on corpora one category or a group of categories” (Faber, 2015). created for a research project carried out by Master’s Additionally, “templates” are used as a representation of students as part of the module Corpora and Localisation at parts of one category, and “templates” cover the cultural the Department of Translation Studies, Faculty of Arts component (Faber, 2015). (University of Ljubljana). Translation students collaborated Following the process of the TermFrame project, the with their peers from the Faculty of Computer and Information Science (University of Ljubljana) on a project team began with compiling an English and a Slovene focusing on the automatic extraction of semantic relations, domain-specific corpus, then extracting definitions and which required the creation of an English and a Slovene annotating them using the WebAnno tool (Castilho et al., corpus and the provision of an additional data set annotated 2016). This paper describes these steps in detail, followed for semantic relations. We describe the process of corpus by an analysis of the annotated definitions. It also building, the identification and extraction of definitions, highlights the obstacles the team faced during the followed by the annotation and curation using the conversion of texts and the annotation process. WebAnno annotation tool. Finally, the paper illustrates the The main goal of the project was to create an English results and obstacles as well as discusses possible further and a Slovene corpus covering the fields of work and research. Corpus-based automatic semantic relation extraction geomorphology, glaciology and geology, which would has become one of the main topics in corpus linguistics. serve as a basis for definition extraction, annotation and Domain-specific annotated corpora are the basis for the curation. design of many NLP systems for relation extraction (Thanopoulos et al., 2000) and are considered knowledge 2. Building the corpora sources on natural language use. It is imperative to obtain corpora large enough to provide a sufficient number of instances of relation pairs for extraction (Huang et al., 2.1. Text collection 2015). This is especially true for Slovene, a language with For the purposes of our research, the linguist team complex morphology and free word order, which currently compiled two corpora, one Slovene and one English. The lacks readily available large domain-specific corpora entire project lasted for approximately one month. (Pollak et al., 2012). The first step was to search for texts in both languages The layout of the project relied heavily on a similar covering predefined topics, namely geology, glaciology, dataset, TermFrame1 – a trilingual knowledge base that and geomorphology. These areas were chosen because they contains Karst terminology in English, Slovene and were semantically related to the domain of karstology, but Croatian. The knowledge base was developed on the basis had not yet been used in the TermFrame database. More of the frame-based approach in terminology (Pollak et al., specifically, the texts from neighbouring domains to 2019; Vintar et al., 2021; Vintar and Stepišnik 2020; Vintar karstology were assumed to contain the same semantic et al., 2019; Vrtovec et al., 2019), a cognitive approach to relations, so that our to-be-created data set could be fully terminology that considers context, language and culture compatible with the existing ones. and focuses on specialised texts (Faber and Medina-Rull, The linguist team was particularly interested in 2017). Frame-based terminology is mainly used for the collecting scientific texts (scientific papers, articles, books, 1 https://termframe.ff.uni-lj.si/. ŠTUDENTSKI PRISPEVKI 292 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 doctoral and master’s theses). Many of these texts can be seemingly undemanding step required additional time and found through the Digital Library of Slovenia2 or through attention. the Co-operative Online Bibliographic System & Services – COBISS3, and through ResearchGate, a social 3. Definition extraction networking site for scientists and researchers4. Ultimately, our team proposed 32 Slovene texts and 26 English texts as In order to obtain the sentences containing definienda, candidates. The proposed titles were validated by a domain definitors and genera, we had one week to extract the expert and assessed as relevant. definitions from the corpora using targeted queries in The next step was to ensure that the texts were in a SketchEngine. Searching for typical definition-like format that could be read by Sketch Engine5, which proved sentences can be done by searching for specific words or difficult in some cases. Fortunately, most of the texts on phrases and by CQL queries. dLib.si are available in TXT and PDF format. As a result, To some extent, the structure of definitions can be the team was able to access the texts in the appropriate predicted. Typical definition structures in Slovene include format using Notepad. Texts that were suitable to the topic “X je Y”, “Y imenujemo (tudi) X”. “izraz X pomeni Y”, but could not be accessed in the correct format were “izraz X označuje Y”, “med Y štejemo (tudi) X” etc., while omitted. Document conversion and text cleaning proved typical definition structures in English include “X means cumbersome (see Section 2.2). The team had one week to …”, “X is a Y”, “X is a kind of …”, “The term X is …” or prepare the texts according to this process. “X is defined as”. (...) In this context, X is typically a hyponym and Y is a hypernym. Sketch Engine allows 2.2. Creating the corpora searching for such definitions in multiple ways. One After collecting a sufficient amount of documents and method is to use a simple Sketch Engine query and search successfully converting them into the appropriate formats, for words or phrases that are often included in the the team proceeded to create the corpora. As all team definitions, such as “imenujemo” or “izraz” in Slovene and members had full access to Sketch Engine, we decided this “is a” or “is a term used to describe” in English. We were would be the most efficient and straightforward tool for able to identify multiple definitions using this method, for corpus creation and subsequent querying. Table 1 provides example “Tip kraškega površja, kjer je prevladujoča oblika an overview and detailed information about both corpora. vrtače, imenujemo vrtačasti kras.” Another method is to use a CQL query in Sketch English Slovene Engine and check for definitions with advanced filtering Tokens 1,588,085 493,107 commands such as [tag="S.*"][word="je"][tag="S.*"] in Slovene or Words 1,284,564 358,731 [tag="NN"][word="is"][word="a"]?[tag="N.*"] in Sentences 52,147 18,373 English. This command combines a search for a specific Documents 26 32 part of speech (S.* – noun) and a specific word (je). An Table 1: Data on the English and Slovene corpus. example of a definition identified by using the CQL query in Slovene is “Uvala je večja kraška globel skledaste oblike As can be seen from Table 1, the Slovene corpus was z neravnim dnom in sklenjenim višjim obodom.” Another significantly smaller. This was due to the fact that longer example in English is “A coral reef is a ridge or mound Slovene texts were harder to find, which was to be built of the skeletal remains of generations of coral expected, considering there are not as many Slovene animals, upon which grow living coral polyps.” sources as there are English ones. Since not all definitions fit these typical structures, we As previously mentioned, arguably the most important used another strategy. We checked the keywords suggested challenge the team faced occurred after selecting the texts by Sketch Engine and search for them with a simple query. for the Slovene corpus. As most of them were in the form In this way, we were able to identify various definitions of PDF files, the team had to ensure they were searchable which could not be found otherwise. An example of such a before converting them into text (TXT) files. Due to some definition is Slovenska kraška terminologija navaja, da je language-specific characters, particularly diacritics, such as vrtača: depresijska oblika okroglaste oblike, navadno č, š, and ž, most of the widely available online converters globoka več metrov in je bolj široka kot globoka. failed to produce satisfactory results. In addition to these strategies, the English team also After a few unsuccessful attempts, we managed to utilised a glossary from the English corpus and extracted convert them with Notepad++, but we still had to review some of the definitions from there. the files and manually correct some errors before adding By combining all of these strategies, we were able to the documents to the corpus. Since the final text was identify definition candidates suitable for annotation. The corrected manually, man-made errors such as the inclusion selected definitions were then verified by a terminology of some elements, like the table of contents, English specialist. Some of the definitions were judged to be abstracts and reference lists that were unintentionally added unsuitable, either due to their wording or for semantic to the final version of the corpus caused some difficulties reasons. After discarding the inadequate definitions, we when searching for potential definitions. Ultimately, it was retained 100 definitions from the Slovene corpus and 104 impossible to rely entirely on conversion tools – this 2 https://www.dlib.si. 4 https://www.researchgate.net/. 3 https://www.cobiss.si. 5 https://www.sketchengine.eu/. ŠTUDENTSKI PRISPEVKI 293 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 definitions from the English corpus. All of them were then 2. Definition element: Here, the term in question uploaded to WebAnno6 to be manually annotated. was marked as DEFINIENDUM, its hypernym or superordinate term as GENUS, the defining phrase (the 4. Definition annotation phrase between the DEFINIENDUM and the GENUS – e.g. the phrase is a) as DEFINITOR and any of its The definitions were annotated using WebAnno – a hyponyms or subordinate terms as SPECIES. web-based annotation tool, which allowed for a faster 3. Semantic relation: A set of 15 relations was used collaborative annotation process as well as a comparative for annotating different features of the defined term: evaluation of the annotations (Castilho et al., 2016). The AFFECTS, HAS_ATTRIBUTE, HAS_CAUSE, annotation process took approximately ten days. CONTAINS, COMPOSITION_MEDIUM, Altogether, the team annotated 100 Slovene and 104 DEFINED_AS, HAS_FORM, HAS_FUNCTION, English definitions, whereby four layers of information HAS_LOCATION, MEASURES, HAS_POSITION, were considered. The layers were introduced to the linguist HAS_RESULT, HAS_SIZE, STUDIES and team by the course instructor and were, in term, selected OCCURS_IN_TIME. because they had already been used in the TermFrame 4. Relation definitor: This layer is associated with project (Vintar and Stepišnik, 2020). We believed that semantic relations and marks words or phrases that precede relying on the same categories that had already been particular semantic relations (e.g. in the ocean). adapted to karstology – a domain closely related to the ones WebAnno also offers an additional layer for the canonical chosen for this research – would ensure a straightforward form, which is used to ensure the full form of a term when annotation process with little to no ambiguities. it appears in an elliptic construction. The canonical form Furthermore, the resulting data set would be fully layer has been mostly used when annotating definitions in compatible to the existing one in the TermFrame project. the Slovene corpus. One of the reasons for this is that The layers of information include: ellipses are more common in Slovene. Another reason is 1. Semantic category: This layer covers the main semantic categories for A. Landform (A.1 Surface Landform, A.2 Underground Landform, A.3 Hydrological Landform or A.4 Other), B. Process (B.1 Movement, B.2 Loss, B.3 Addition or B.4 Transformation), C. Geome, D. Element/Entity/Property (D.1 Abiotic, D.2 Biotic, D.3 Property and D.3.1 Geolocation) and E. Instrument/Method (E.1 Instrument or E.2 Method). The semantic category was defined primarily for the definiendum and genus. Semantic categories are presented in Figure 1. Figure 1: Semantic categories (Vintar and Stepišnik, 2021). 6 https://www.clarin.si/webanno/login.html. ŠTUDENTSKI PRISPEVKI 294 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Figure 2: Use of the term canonical layer for pairing the words “uporablja” and “se” to show they form a single unit. that the predicate and the pronoun “se” are often separated students annotated the phrase “is a term covering” as the by other words. definitor and one student annotated only “is a term”. The As seen from Figure 2, which shows the example of the word “material” was determined to be a genus by two use of the term canonical layer in the Slovene corpus, the students, whereas one student extended the genus and predicate “se uporablja” consists of two words that act as a annotated “pyroclastic material” – “pyroclastic” was later definitor. Hence, the team used the term canonical layer to defined as COMPOSITION_MEDIUM. pair the two words together. For the purpose of this project, three students annotated 5. Analysis English definitions, while two students annotated the After annotating all of the extracted definitions, the Slovene ones. Afterward, in the process of curation, both linguist team wanted to take a closer look at the results. teams jointly annotated the definitions with the course Each English definition had one definiendum, giving a total instructor’s assistance. We observed that the annotation of of 104 definienda, while the Slovene definitions had one or definition elements (definiendum, genus and definitor) was more definienda, 113 in total. the most straightforward, although the annotators’ The most common definitor in English was “is a”, solutions still varied in some cases (See Figure 3). On the followed by “are”, and in Slovene “imenujemo” and “je”. Figure 3: Curation process in WebAnno. other hand, annotation of semantic categories, semantic One or more genera were found in all English definitions, relations and relation definitors proved to be more dubious 112 in total, while not all Slovene definitions had a genus. since the annotations often differed from one another. Figures 4 and 5 show the distribution of semantic When variations occurred, the team managed to resolve categories for the annotated terms in Slovene and English. such dilemmas through discussions. In total, 183 English and 334 Slovene terms were assigned As Figure 3 shows, all three students who annotated categories. The most frequent category in English was D.1 English definitions chose “tephra” as the definiendum. Two Abiotic, followed by A.1 Surface landform. Similarly, A.1 ŠTUDENTSKI PRISPEVKI 295 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Surface landform was the most frequent category in Slovene, followed by D.1 Abiotic. Figure 4: Semantic categories in the Slovene corpus. Figure 5: Semantic categories in the English corpus. Figures 6 and 7 show the distribution of semantic distribution). On the other hand, the two most common relations for Slovene and English. A total of 186 relations relations in Slovene were HAS_FORM (morphography) were marked in English and 156 in Slovene. The most and HAS_LOCATION (spatial distribution). common relations in English were HAS_CAUSE (morphogenesis) and HAS_LOCATION (spatial Figure 6: Number of semantic relations in the Slovene Figure 7: Number of semantic relations in the English corpus. corpus. 5.1. Annotation difficulties For example, the phrase “kraški izviri” in Figure 8 could During the annotation and curation process, the team semantically be understood as a hydrological form, a encountered some complex cases, in particular when surface form, an underground form or an abiotic. reviewing Slovene definitions, which required further As in the previous example, the word “obala” in Figure discussion and careful attention. While annotating the 9 can be understood as a hydrological form, a surface form, definition element proved fairly straightforward, semantic an abiotic or a geome. relations posed some challenges. Although the word “kras” is most likely understood as The analysis showed ambiguities in 37 out of 65 geome, depending on the context, it can also be understood sentences in the Slovene corpus. We have divided the as karstology, the study of karst. In line with the decision ambiguities into the following categories. to annotate “geomorphology” as a method, “kras” could therefore be annotated as a method as well as shown in 5.1.1. Phrases that could be placed in multiple Figure 10. categories The most recurring ambiguity concerned phrases that could be classified into a number of categories, while others were difficult to associate with any of the possible labels. In many cases, the team had to determine how the annotators would deal with these ambiguous words and es F tablish agreement on a consistent annotation strategy. i g u r e S E ŠTUDENTSKI PRISPEVKI 296 STUDENT PAPERS Q F i Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Figure 8: Example of an ambiguous annotation. Figure 9: Example of an ambiguous annotation. Figure 10: Example of an ambiguous annotation. Another example was “gravitacija” (see Figure 11). It and definiendum would share the same semantic category, was extremely difficult to annotate a word denoting such a since genus is a hypernym or superordinate term, but this complex concept. In discussions with the course instructor, was not the case for all definitions. For example, the the team decided to annotate it as a method, as the names definiendum “aquifer” was annotated as A.3 Hydrological of the studies had to be annotated in the same way. form, but the genus “body of rock” was annotated as D.1 However, it should be noted that the word could also be Abiotic in the same definition. This is because “body of annotated according to other criteria. rock” is not necessarily a hydrological form and can also be found on the surface. Another example is the definiendum “weathering”, which was annotated as B.4 Transformation Figure 11: Example of an ambiguous annotation. 5.1.2. HAS_FORM and the genus “process” was annotated as B. Process. The In a handful of cases annotating the Slovene definitions, reason for this is that “process” is a hypernym of it became clear that the semantic relation HAS_FORM “transformation”. manifests itself in different ways, as shown in Figures 12, 13 and 14. Since HAS_FORM relations are more abstract and harder to grasp, annotation proved to be more difficult and required double-checking. 5.1.3. Annotation of genus Sentences in the English corpus also posed some challenges, however their amount was significantly lower compared to their Slovene counterparts. Before the annotation process, it was decided not to choose long phrases for the genus, but preferably just one word, e.g. “unloading of mountains” could be considered for the genus as a whole, but the team annotated only the word “unloading” as the genus. It was expected that genus ŠTUDENTSKI PRISPEVKI 297 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Figure 12: HAS_FORM introduced by a preposition. Figure 13: HAS_FORM expressed with an adjective. Figure 14: HAS_FORM expressed with an adjective (1) and introduced by a preposition (2). languages that do not enjoy the same exposure and presence 6. Conclusion as widespread world languages such as English. Further research could examine how definitions in both This article describes the process of corpus creation and languages manifest themselves in different contexts and definition annotation for semantic relation extraction. domains. When building corpora, linguists had to pay close attention Large data collections serve as a basis for the to both the format and nature of the texts. The conversion development of tools for automatic semantic relation of Slovene data proved to be quite challenging and required extraction. Semantic relation extraction can be used to a great deal of attention to detail. It might be useful to create different computer applications that can make develop a conversion tool specifically for language-specific domain-specific knowledge more accessible, not only to characters, such as diacritics, to facilitate the study of data experts but to the general public as well. The corpora that originating from languages, namely Slovene. were built during this project can be used for future creation Definition extraction, on the other hand, did not pose of specialised knowledge bases on geology, any significant challenge. geomorphology and glaciology. In contrast, definition annotation followed by the curation entailed a great deal of debate and additional 7. References research. Since the team consisted only of linguists/translation students lacking domain-specific Richard Eckart de Castilho, Chries Biemann, Iryna terminological knowledge, it was sometimes difficult to Gurevych, and Seid Muhie Yimam. 2014. WebAnno: a comment on the nature of the extracted terms. For any flexible, web-based annotation tool for CLARIN. In: similar research endeavours, it could be useful to seek Proceedings of the CLARIN Annual Conference (CAC) expert’s input so as to facilitate the annotation process and 2014, pages 4505–4512, Soesterberg, Netherlands. prompt better results. Overall, definition elements were Pamela Faber. 2015. Frames as a framework for easier to identify and annotate than relation definitors and Terminology. In: H. Kockaert and F. Steurs, (eds.) semantic categories and relations. The result of this work is Handbook of Terminology, Vol. 1, pages 14–33. John a dataset with multi-layer semantic annotations in English Benjamins, Amsterdam/Philadelphia. and Slovene which can be used for future relation Pamela Faber and Laura Medina-Rull. 2017. Written in the extraction experiments. It complements the TermFrame Wind: Cultural Variation in Terminology. In: M. Gryviel dataset and will be added to the Clarin.si repository. (ed.) Cognitive Approaches to Specialist Languages, The paper also draws attention to the differences pages 419–442. Cambridge Scholars, Newcastle upon between the two languages. English seems to favour shorter Tyne. and more concise definitions, such as “is a” or “are”, while Chu-Ren Huang, Jia-Fei Hong, Wei-Yun Ma, and Petr Slovene tends to introduce longer structures, namely Šimon. 2015. From Corpus to Grammar: Automatic “imenujemo” and “se uporablja”, and sometimes shorter Extraction of Grammatical Relations from Annotated ones, such as “je”. Corpus. In T’sou & Kwong (eds.) Journal of Chinese This research provides insight into the various Linguistics Monograph Series, Vol. 25, pages 192–221. language-specific barriers that arise when studying smaller Chinese University of Hong Kong Press, Hong Kong. ŠTUDENTSKI PRISPEVKI 298 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Senja Pollak, Andraž Repar, Matej Martinc, and Vid Podpečan. 2019. Karst exploration: extracting terms and definitions from karst domain corpus. In: Proceedings of eLex 2019, pages 934–956. Lexical Computing CZ, s.r.o., Brno Senja Pollak, Anže Vavpetič, Janez Kranjc, Nada Lavrač, and Špela Vintar. 2012. NLP workflow for on- line definition extraction from English and Slovene text corpora. In: J. Jancsary (ed.) Proceedings of KONVENS 2012 (Main track: oral presentations), Vol. 5, pages 53–60. ÖGAI, Vienna. Aristomenis Thanopoulos, Nikos Fakotakis, and Georg Kokkinakis. 2000. Automatic Extraction of Semantic Relations from Specialized Corpora. In: Coling 2000, 18th International Conference on Computational Linguistics, Vol. 1, pages 836–842. Universität des Saarlandes, Saarbrücken. Špela Vintar, Vid Podpečan, and Vid Ribič. 2021. Frame-based terminography: a multi-modal knowledge base for karstology. In: Proceedings of eLex 2021, pages 164–176. Lexical Computing CZ, s.r.o., Brno. Špela Vintar, Amanda Saksida, Uroš Stepišnik, and Katarina Vrtovec. 2019 Modelling specialised knowledge with conceptual frames: the TermFrame approach to a structured visual domain representation. In: Proceedings of eLex 2019, pages 305–318. Lexical Computing CZ, s.r.o., Brno. Špela Vintar and Uroš Stepišnik. 2020. TermFrame: A Systematic Approach to Karst Terminology. In: Dela, Vol. 54, pages 149–167. Znanstvena založba Filozofske fakultete Univerze v Ljubljani, Ljubljana. https://doi.org/10.4312/dela.54.149-167. Katarina Vrtovec, Špela Vintar, Amanda Saksida, and Uroš Stepišnik. 2019. TermFrame: Knowledge frames in Karstology. In: Proceedings of ToTh 2019, pages 109–126. Presses Universitaires Savoie Mont Blanc, Chambéry ŠTUDENTSKI PRISPEVKI 299 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Serbo-Croatian Wikipedia Between Serbian and Croatian Wikipedia Ružica Farmakovski,* Natalija Tomić** *Faculty of Philology, University of Belgrade Studentski trg 3, 11 000 Belgrade ruzicamarinkovic12@gmail.com **Faculty of Philology, University of Belgrade Studentski trg 3, 11 000 Belgrade ntomic801@gmail.com Abstract In this paper, we try to establish the linguistic identity of the corpus of texts CLASSLAWIKI-sh (Serbo-Croatian Wikipedia), comparing it with the corpus of texts The CLASSLAWIKI-sr (Serbian Wikipedia) and the corpus of texts CLASSLAWIKI-hr (Croatian Wikipedia), that are available at CLARIN.SI, Slovene national consortium of the European research infrastructure CLARIN Wikipedia, i. e. we are trying to determine whether it is closer to the Serbian or Croatian language standard. For this comparison, we used as variables the distinguishing features between Serbian and Croatian described in grammars and manuals of Serbo-Croatian, Serbian and Croatian languages. We came to the conclusion that according to the basic characteristics (orthographic, most phonetic, and derivational morphology features), the CLASSLAWIKI-sh is closer to the CLASSLAWIKI-hr, and according to morphosyntactic, lexical, and semantic features it is closer to the CLASSLAWIKI-sr. 1. Introduction Wikipedia is a free online encyclopedia launched in right and reliable features as variable for this type of 2001 by a community of volunteers. It is available in 326 research (based on corpus). For example, we had to drop languages and it has more than 302,906 active editors and one of the most important and stable features, a feature more than 101,868,334 registered users.1 Its specificity is that is cited everywhere in the literature ( ko:tko), because its editing system. It is open to its audience for writing and it poses a problem for corpus lemmatization (Section 5.2). contributing different content. One of the languages with Our paper consists of 7 sections. In Section 2, we considerable content is Serbo-Croatian, a language that describe the goal and present the initial hypothesis. In does not officially exist since the split of former Section 3, we present the genetic and historical Yugoslavia. relationship between the Serbian and Croatian standards. In recent decades linguistic research has increasingly In Section 3, we describe two types of related works that been conducted on materials and data from the Internet. we used. On the one hand, there are works related to They are available to everyone, free and easy to use and linguistic identification or the discrimination between there are plenty of them. This makes it suitable for related languages, and on the other hand, there are works linguistic research as well. dealing with the differences between Serbian and Wikipedia, along with Twitter and other similar Croatian. Section 5 deals with the methodology, where we sources, offers plenty of materials and data, but to use list and describe the variables we used, and in Section 6, them at all, we need to know their true identity. That is we present the data we have obtained from the corpus and how the phenomenon of linguistic identification (and their analysis. In Section 7, we present the conclusion and automatic linguistic identification) is becoming some suggestions for further research. Finally, in Section increasingly important. 8, we list the literature that we used and cited in the paper. In this sense, discriminating between related 2. Goal of the paper languages, considered “as a sub-task in automatic In this paper, our goal is to determine the linguistic language identification” (Tiedemann and Ljubešić, 2012: identity of the corpus of texts CLASSLAWIKI-sh (Serbo- 2620), also gaining more and more attention from Croatian Wikipedia, hereinafter: SCW), that is available at researchers. CLARIN.SI, Slovene national consortium of the European But this is not an easy task, especially when it comes research infrastructure CLARIN.2 The CLASSLAWIKI-sr to related languages. Since they have a common origin, (Serbian Wikipedia, hereinafter: SW) and they share many grammatical features and lexemes, so it CLASSLAWIKI-hr (Croatian Wikipedia, hereinafter: is often very difficult to distinguish between them. CW) corpora can also be found here. When we compare Therefore, for many researchers, this task is a special the linguistic characteristics of our target corpus with the challenge, i. e. “both necessity and a challenge (Ljubešić other two corpora, we hope to determine its linguistic and Klubička, 2014: 32). identity, i. e. whether SCW is closer to SW or CW or if it We hope that our research, which is more is somewhere in the middle. In Figure 1, we show our linguistically oriented, will provide some useful linguistic hypothesis schematically. Our initial hypothesis is that data for automatic text recognition research. Also, we SCW is somewhere in the middle between SW and CW, hope that we will show how important it is to choose the perhaps with a tendency towards SW, due to the larger 1https://www.wikipedia.org/ 2 https://www.clarin.si/kontext/corpora/corplist ŠTUDENTSKI PRISPEVKI 300 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 number of its users, less resistance to the use of Serbo- (767), but they also add that linkage information and the Croatian resources, etc. text from hypertext anchors could improve overall results. Padró and Padró (2004) presented and compared three different statistical methods for language identification: Markov Models, Trigram Frequency Vectors, and Gram Based Text Categorisation (mentioned as n-gram above). Figure 1: Is SCW closer to SW or CW or it is somewhere They concluded that “for texts over 500 characters, all the in the middle? systems get a precision higher than 95%, and for texts of We also hope to get answers to some other related 5,000 characters the precision is higher than 99% with questions: Does SCW represent a language that existed in all systems” (161), but for the small texts Markov Model the former Yugoslavia under the name of Serbo-Croatian System has the highest precision. Also, all three systems language? Is SCW a mixture of characteristics of Serbian tend to fail when it comes to the problem of distinguishing and Croatian varieties? Or is SCW a mixture of Serbian similar languages (Catalan and Spanish). and Croatian texts? So we come to the paper of Ljubešić et al. (2007) 3. Serbo-Croatian vs. Serbian and Croatian dealing with the language identification problem of the Without the desire (and possibility) to determine Croatian language. To identify the Croatian language, precisely whether Serbian and Croatian are two languages, authors have to distinguish it from similar languages – one language with two names, two dialects, two varieties, Serbian, Slovenian, or Slovak. They applied the method of or two standards, we will present in basic terms their most frequent words and combined it with the character n- historical relationship. gram models. Finally, to improve the precision of These two entities lived under the common name identifying Croatian documents (where the biggest Serbo-Croatian language in the former Yugoslavia for problem was distinguishing them from Serbian almost a century and were considered one language. It is documents), the authors made a list of forbidden words for an open question of how much they mixed, how much Croatian and Serbian. Forbidden words (or “blacklisted they influenced each other and how many linguistic words”') are words that occur often in one language but features passed from one entity to another, and how much never in the other language. Forbidden words (or each of them preserved their identity. blacklisted words) are also used (along with a document They undoubtedly have the same origin. Before the classification method) in another article dealing with the Slavs immigrated to the Balkans, the Southern Slavs problem of discrimination between closely related separated from Eastern and Western Slavs. During languages, or more precisely between Bosnian, Croatian historical development, the western linguistic community and Serbian (Tiedemann and Ljubešić, 2012). of the Southern Slavs developed, from which the Slovene Zampieri and Gebrekidan (2012) also agree that and Serbo-Croatian languages developed. The Serbo- methods for discrimination similar languages or varieties Croatian language consisted of three dialects – Štokavian, are not “substantially explored”. In their article, they try to Kajkavian, and Chakavian, according to the interrogative define a model for the automatic classification of two pronoun: što/šta:kaj:ča (′what′). Until the 19th century, all varieties of Portuguese: European and Brazilian. They three dialects were in use. The foundations of the new state that these two varieties “are considered to be the standard language were established in the 19th century. same language [although] there are substantial differences After the Illyrian movement and the reform of the language and orthographic system by Vuk Karadžić, the between European and Brazilian Portuguese in terms of Štokavian dialect (ekavian and (i)jekavian variant) was phonetics, syntax, lexicon, and orthography” (235). Although they recognize the problem with similar entities, taken as the basis of the standard language. they use the character-based model using 4-grams. It is Even before the break-up of the former Yugoslavia, practically a standard character n-gram model, just with this language was polycentrically standardized, and the larger character n-grams. break-up of Yugoslavia practically created four new This group of works is more mathematically oriented languages: Serbian, Croatian, Bosnian, and Montenegrin. and does not deal with linguistic features like our work. 4. Related work 4.2. Literature on the differences between Our research is based on two types of sources. On the Serbian and Croatian one hand, there are works related to linguistic As we said at the beginning of this section, another identification or the discrimination between related group of papers is dealing with the differences between languages, and on the other hand, there are works dealing Serbian and Croatian. Among them, we paid special with the differences between Serbian and Croatian. attention to two papers, whose methodology was also used 4.1. Literature on linguistic identification and for our examination ‒ Ljubešić et al. (2018) and Ljubešić the discrimination between related et al. (2019).3 Namely, this group of authors states languages phonetic, morphological, syntactic, and lexical differences Martins and Silva (2005) start with a well-known n- between Serbian and Croatian, which represent variables gram-based algorithm “that measures similarity according to the prevalence of short letter sequences ( n-gram)” 3 Both papers have the same authors. ŠTUDENTSKI PRISPEVKI 301 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 through which a certain phenomenon is examined. In the to describe the phenomenon of linguistic accommodation. first paper, it is the spatial distribution of 16 linguistic Also, our choice of variables differs from the variables features and the question is, “do state borders correspond used in this paper (see explanation in Section 5.2). to linguistic boundaries”. In the second paper it is the 5. Methodology phenomenon of linguistic accommodation among the 5.1. Data and metadata speakers of BCMS4 languages, i. e. the question of In the Introduction, we defined Wikipedia as a free whether BCMS speakers adapt their language when they online encyclopedia. But it is not entirely, nor could it be, are in contact with speakers of other BCMS languages (do the subject of linguistic inquiry. The subject of our they change their accent, some grammar construction, do research are three special corpora composed of texts from they use specific lexemes, etc.). Wikipedia. These three corpora is, as we stated in Section This part also includes works that deal with differences 2: CLASSLAWIKI-sh, CLASSLAWIKI-sr, and in BCMS languages, but they are more descriptive, ie. CLASSLAWIKI-hr, available at CLARIN.SI, Slovene differences do not represent methodological instruments national consortium of the European research for research. From Piper (2009) we learn more about the infrastructure CLARIN. All free corpora are part of the historical, social, political, and cultural circumstances of project CLASSLA Wikipedia which involved generating these two languages, and then follow the description of corpora for seven south-Slavic languages: Macedonian, the language differences (537‒552). Branko Tošović and Bulgarian, Serbian, Croatian, Serbo-Croatian, Slovene, Arno Wonisch are the editors of a series of collections of and Bosnian. The corpora were generated using Wikipedia papers from 2009 to 2013 that also deal with the dumps that were downloaded on October 17th, 2020.6 relationship of the BCMS languages in general (historical, Some important metadata for our three corpora is social, political, and cultural perspectives), and then with given in Table 1. many individual language problems – adjectival aspect, Corpus Documents Tokens Words noun motion, nouns of nomina agentis type, distribution CLASSLAWIKI-sh 453,404 80,669,281 63,541,966 of future tenses, participial and reflexive passive, etc. (Serbo-Croatian (Tošović and Wonisch, 2009; 2010; 2012; 2013). In Wikipedia corpus Ćevriz-Nišić (2009) we could find various phonological, CLASSLAWIKI-sh derivational, lexical, and syntactic distinctive features 1.0) between Serbian, Croatian, and Bosnian standard CLASSLAWIKI-sr 639,277 122,530,226 97,258,485 languages from administrative style. Article Badurina (Serbian Wikipedia (2004) follows recent changes (late 20th century) in corpus orthography and vocabulary; in Karavdić (2011) 16 CLASSLAWIKI-sr syntactic differences are pointed out (apart from well- 1.0) known da+present or an infinitive): possessive genitive CLASSLAWIKI-hr 205,898 66,484,380 51,719,524 and the adjective with noun, future 2nd or present tense, (Croatian kod+accusative or k+dative, etc. In Bekavac et al. (2008) Wikipedia corpus differences are organized on five levels, from CLASSLAWIKI-hr phonological to semantic levels. The last one is especially 1.0) interesting because it is rarely mentioned in the literature. Table 1: Number of documents, tokens, and words in Authors state lexeme čas meaning ′one moment′ in SCW, SW, and CW. Croatian and ′one hour′ in Serbian, lexeme persons 5.2. Variables of interest To select the appropriate variables, we reviewed the translated in Serbian by ′lica′ and in Croatian by ′osobe′, linguistic differences between Serbian and Croatian that etc.5 are cited in the literature. As we have already said, we We also consulted the most relevant grammars and used Ljubešić et al. (2018) and Ljubešić et al. (2019) the manuals of the Serbian and Croatian languages, and for most because we followed the methodology applied in certain variables some special papers dealing with them. these works. Then we reviewed basic grammars and For more linguistic details of these, but also of the all manuals for Serbian, Croatian and Serbo-Croatian: listed literature units in this section, see Section 5. Pešikan et al. (2010), Stevanović (1989), Stanojčić and All papers in this second group, except for the second Popović (2008), Piper and Klajn (2013), Ivić et al. (2004), of the two papers that we highlighted at the beginning of Mrazović and Vukadinović (2009); Barić et al (1997). Section 4.2. (Ljubešić et al. (2019)), state the differences Then we reviewed papers whose main topic was these between Serbian and Croatian, without examining them in the corpus. Ljubešić et al. (2019) use a corpus differences. All these sources are described in Section 4.2. , but it is We also used papers that deal with a particular variable as about shorter texts (Twitter), and for a different purpose – a special problem. These sources are mentioned in the variable in question. 4 Bosnian, Croatian, Montenegrin, and Serbian languages. In the literature dealing with these languages, they are referred to as BCMS languages. 5 Lexeme persons can also be translated into Serbian by ′osobe′; 6 Links to Wikipedia Dumps can be found on the translation ′lica′ appears in an administrative language. https://github.com/clarinsi/classla-wikipedia ŠTUDENTSKI PRISPEVKI 302 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 First, we had to choose a smaller number of variables. side and Serbian on the other” (Bekavac et al., 2008:35) or So we tried to make the variables meet the following as one of “the biggest differences between Croatian and criteria: linguistic relevance, representing stable Serbian” (Ljubešić and Klubička, 2014:29) or “one of the differences, easy recognition by the speaker, and easy features central to defining the dialects” and as “the automatic retrieval. Therefore, we rejected unreliable variable whose geographical distribution is expected to be variables (such as script – Cyrilic or Latin; in addition, the most straightforward” (Ljubešić et al., 2018: 110). texts in all corpora are in Latin script), underdeveloped This variable was extracted through a list of words that variables, and variables that are impossible to process due was created manually (as we have already mentioned). to homonymy. Since the consonant j is a frequent cause of various For most variables, we selected words that illustrate a phonetic alternations, we chose words in which there are certain phenomenon so we could search the corpus. We no phonetic alternations. Otherwise, we would have to chose examples that are well known to us as native look for more results for the (i)jekavian forms and to sum speakers and for which we found confirmation in the them up: sneg:snijeg, snjeg (′snow′), devojka:djevojka, literature mentioned above.7 It would be better if we could đevojka (′girl′), etc. present all those examples in tables, along with their mean 3) rdrop values and proportions. But since that would require a lot The variable rdrop refers to the fact that in some words of space, we decided to just list those words and present in Croatian consonant r is kept at the end of the word, and the final analysis in Section 6. in Serbian it is lost: juče:jučer (′yesterday′). Two variables were extracted using regular This variable is also illustrated by a list of words that expressions – morphosyntactic variable trebati and lexical is created manually. variable da li:je li. The nouns veče:večer (′evening′) are regularly cited as In three cases (for the pair of words takođe:također an illustration of this difference, but since both nouns have (′also′) – in phonetic variables; for the semantic variable the same declension, we had to exclude it from the search čas (′hour′, ′moment′); and for the pronoun ko:tko) we because we can not deduce from the form what the lemma analyzed a smaller number of examples (80). We did this should be. We kept the words naveče:navečer, in cases where something seemed suspicious to us based predveče:predvečer and uveče:uvečer (′in the evening′), on the raw numbers ( takođe:također, ko:tko) or when we that are derived from the word veče:večer because they wanted to get a general impression of the use of the are adverbs, so they have no declension. lexeme, and a detailed analysis would require separate Since the grapheme đ also appears as dj, for words research ( čas).8 More examples and better-randomized takođe:također (′also′) we searched for both occurrences examples would improve this research. and summed them up ( takođe:također, takodje:takodjer). The selected variables belong to the following levels 4) h:k of linguistic structure: orthographic, phonetic, derivational The variable h:k occurs in words of Greek origin. As morphology, morphosyntactic, syntactic, and semantic early as the middle age, the rule was established in levels. Serbian that Greek χ was transferred as Slavic h, while in We chose this approach, to start from known and Croatian k appeared under the influence of Western described language features in the literature and then European languages. identify them in the corpora because we believe that this is We also used a manually created list for this variable the best way of language identification. In addition, we because there are not so many of those words. believe that automatic text recognition should be based on Derivational morphology variables theory. 5) ka:ica Orthographic variable The suffixes -ka and -ica are used for deriving 1) transliteration:original feminine nouns of nomina agentis type. But here the When it comes to the orthography of foreign proper situation is not so simple. First, both suffixes are very names, transliteration is more frequent in Serbian (and it is productive in both Serbian and Croatian, and we can not also a standard) and in Croatian foreign proper names are claim that one suffix is Serbian and the other is Croatian. written in original: Njujork:New York. Examples of this So we have in Serbian: glumica, igračica, pevačica etc., variable are found in Memić (2009). and in Croatian: maserka, programerka, novinarka, Phonetic variables analitičarka etc. This also applies to other suffixes. So we 2) e:ije/je find in Babić (1999) that suffixes - ica, -ka, -kinja, -inja It concerns the Proto-Slavic vowel jat and its different are as Croatian as Serbian, and differ only in the reflexes: je/ije in Croatian and e in Serbian, although the distribution. We find the similar claim in other authors (i)jekavian reflexes (and dialects) also belong to the (Dražić and Vojnović, 2010). Serbian standard language. Second, “the choice of the suffix also depends on the In the literature, this variable is considered “the most ending of the masculine noun from which the feminine obvious difference between Croatian and Bosnian on one form is derived” (Ljubešić et al., 2018: 113). Therefore, among many other suffixes, we chose the suffixes -ar and -or in the masculine gender, for which we found 7 The dictionary Ćirilov (2010) also helped us in this. confirmation in several sources that they regularly give - 8 See more details in those examples. ŠTUDENTSKI PRISPEVKI 303 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 ka in Serbian and -ica in Croatian (Dražić and Vojnović, ko, in addition to the forms tko, also received the lemma 2010; Ljubešić et al., 2018; Ćorić, 2010). We also tko in all three corpora ( da je bilo kome rekao – the form manually created a list of those pairs of words. kome got the lemma tko instead of ko). Another problem is 6) isa, ova:ira that the personal interrogative pronoun ko/tko has the This variable is related to the morphological same declension as the adjective pronoun koji/tkoji (its composition of the international verbs: organizovati in shorter form). In this way, many examples that were Serbian and organizirati in Croatian (′organize′). Petar supposed to get the lemma koji/tkoji got the lemma ko/tko Skok noticed that difference in the 1950s. According to ( kamen od koga se obično izrađuje nakit – the form koga Skok (1955‒1956) suffix -isati is related to Belgrade and got the lemma tko instead of koji). That is why we rejected it is of Greek origin and it entered Serbian with Turkisms. this feature as a variable, but we analyzed 80 examples The suffix -irati is related to Zagreb, it is of Latin origin, with the lemma ko and 80 examples with the lemma tko in and it was received through French and German. The each of the three corpora. Then we divided those suffix -ovati originates from the Proto-Slavic language. examples into lemmas that they should get: ko, tko, (t)koji. Recent research also confirms this distribution: “It is also The results we obtained are shown in Table 2. noticeable that the distribution of suffixes in certain verbs CLASSLA CLASSLA CLASSLA in Serbian and Croatian is differentiated […] examples of Wiki-sr Wiki-hr Wiki-sh verbs with -ira- are registered in Croatian texts, and with - Serbian Croatian Serbo- isa- and -ova- in texts by Serbian authors. ” (Ivanić and Wikipedia Wikipedia Croatian Perišić, 2018: 188). Wikipedia Lemma=k ko: 49 - - This variable is illustrated by a list of examples mostly listed in Tošović (2010), Skok (1955‒1956) o (80 tko: 0 - - , and Ivanić and Perišić (2018). examples) (t)koji: 29 - - error: 2 error: 10 error: 32 Morphosyntactic variable Lemma=tk ko: 4 ko: 9 ko: 1 7) trebati o (80 tko: 1 tko: 41 tko: 3 In standard Serbian, the modal verb trebati examples) (′need/should′) is used (t)koji: 71 (t)koji: 24 (t)koji: 71 as an impersonal verb and has a error: 4 error: 6 error: 5 complement da+present tense: ja treba da idem, ti treba da ideš Table 2: Lemmatization of the pronoun ko/tko. , etc.9 In Croatian, this verb is used as a personal verb and has an infinitive as a complement: ja trebam ići, 6. Analysis ti trebaš ići, etc. For this variable, we used the regular Insight into these three corpora gave us the following expression found in Ljubešić et al. (2018). data. For the variables we searched using the word lists we Lexical variable made, we got the number of lemmas. To obtain representative values and overcome the size inequality of 8) da li:je li these three corpora, we calculated mean values and As we read in Ljubešić et al. (2018) yes/no questions proportions. To calculate the proportion, we used the in Serbian are used with interrogative expressions da li following formula: the proportion of one value of one and je li. Form da li is more common and form je li is variable in one corpus is equal to the quotient of the mean usually shortened to je l’, jel’, or jel. In Croatian je li is the value of that variable value in that corpus and the sum of standard form. the mean values of both values of that variable in that We have analyzed only full forms using regular corpus. For example, the proportion for the value e of the expressions also found in Ljubešić et al. (2018): ‘\bda li\b’ and ‘ variable e:(i)je in SW = the mean for e in SW / (the mean \bje li\b’. for e in SW + the mean for (i)je in SW). Semantic variable To visually represent these relationships, for each 9) čas (′hour′: ′moment′) variable we made the same illustration. On the left (blue) Semantic differences are less common in the literature. is what we have defined as a Serbian feature, and on the We have already stated lexeme čas meaning ′one moment′ in Croatian and ′one hour′ in Serbian in Bekavac et al. right (red) what we have defined as a Croatian feature. Then we marked a value for each corpus. We presented (2008). Since it is a matter of meaning, we had to make the proportions as percentages because it seems easier to our own decisions on a case-by-case basis. So we took the read the data from the image in this way. This presentation first 80 occurrences of the lexeme čas and determined whether it means ′hour′ or ′moment′. allowed us to see data for all three corpora for each variable in the same image, making it easier to compare. After describing the variables used, we will only The figure also shows whether SCW is closer to SW or briefly mention at the end one of the very interesting CW. problems we encountered, and that is the use of the Our first variable is orthographic and it concerns the interrogative pronoun who, which in Serbian has the form writing of foreign proper names. As we said, ko, and in Croatian tko. The first problem is that the forms transliteration is more frequent in Serbian, and in Croatian foreign proper names are written in the original. To 9 In colloquial language this verb is very often used as a personal examine this we took 5 proper names: Njujork:New York, verb, but retains the complement da+present tense: ja trebam da Čikago:Chicago, Dablin:Dublin, Kembridž:Cambridge, idem, ti trebaš da ideš, etc. ŠTUDENTSKI PRISPEVKI 304 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Venecija:Venezia. As we can see from the mean values and proportions, transliteration is more prevalent in SW (0.74), original writing in CW (0.80), and SCW is closer to CW in this characteristic. The proportion is 0.68 in favour of the original writing. Figure 4: Variable rdrop. The last phonetic variable h:k is found in translations of words of Greek origin – h in Serbian and k in Croatian. We used the following 7 words: haos:kaos Figure 2: Variable transliteration:original. (′chaos′), harizma:karizma (′charizma′), hemija:kemija The next three variables are phonetic. For the (′chemistry′), hirurg:kirurg (′surgeon′), hronika:kronika first e:ije/je, we took 10 words, according to the criteria (′chronicle′), hlor:klor (′chlorine′), defined above for this variable: cvet:cvijet (′flower′), hrizantema:krizantema (′chrysanthemum′). For example, reč:riječ (′word′), sveća:svijeća (′candle′), we did not find the word harizma in CW at all, and the zameniti:zamijeniti (′replace′), uvek:uvijek (′always′), word hrizantema in CW nor SCW. This feature is very pesma:pjesma (′song′), vetar:vjetar (′wind′), mera:mjera stable – words with h consistently appear in SW (0.99), and words with k consistently occur in CW (0.99). In (′measure′), veštica:vještica (′witch′), sesti:sjesti (′sit′). SCW usage is balanced (0.50:0.50). Mean values and proportions show us the following. Although the (i)jekavian dialect also belongs to the Serbian standard, in SW ekavian reflex is completely dominant (0.99). In CW the (i)jekavian reflex of the Proto-Slavic vowel has the same value (0.99), which is not surprising, because there is only one standard in Croatian. Figure 5: Variable h:k. In SCW the ekavian reflex occupies approximately one- For our first derivational morphology variable third and the (i)jekavian 2 thirds (the proportion is ka:ica we used 9 words: slikarka: slikarica (′painter′, fem), 0.30:0.70). ministarka:ministrica (′minister′, fem), apotekarka:apotekarica (′pharmacist′, fem), autorka:autorica (′author′, fem), doktorka:doktorica (′doctor′, fem), profesorka:profesorica (′professor′, fem), direktorka:direktorica (′director′, fem), lektorka:lektorica Figure 3: Variable e:ije/je. (′language editor′, fem), inspektorka:inspektorica The next phonetic variable refers to words that (′inspektor′, fem). The data of the distribution of the have a consonant r at the end of the word in Croatian and suffixes -ka and -ica show the following. The suffix -ka in in Serbian it is lost. We used the following 6 words: SW has a very high value (0.97), which confirms its juče:jučer (′yesterday′), prekjuče:prekjučer (′the day consistent use in Serbian texts, just as the suffix -ica has a before yesterday′), naveče:navečer (′in the evening′), high value in CW (0.99). In SCW the suffix -ka reaches predveče:predvečer (′in the evening′), uveče:uvečer (′in almost one-third (0.28), and the rest is the suffix -ca (0.72), which makes SCW much closer to CW according the evening′), takođe:također (′also′). Analysing these to this feature. words, we came to the following results. Forms without the consonant r at the end of the word have the expected high value in SW (0.99), as do forms with the consonant r at the end of the word in CW (0.99). What we did not expect is an extremely high value of the form with the consonant r at the end of the word in SCW (0.99). Figure 6: Variable ka:ica. Looking at the raw numbers, we concluded that the The situation is similar with verb formation. The frequency of use of the form također in SCW contributed suffixes -isa and -ova, which are related to Serbian, have a to this. If we exclude this pair of words ( takođe:također) value of 0.99 in SW, the same as the suffix -ira in CW. In from the analysis, the characteristic forms almost retain SCW, the ratio is 0.39:0.61 in favour of the suffix -ira, their values in SW and CW (0.98 and 0.98), but SCW is which also shows that SCW is closer to CW according to much more balanced (0.48:0.52 in favour of forms with this feature. We used the 10 verbs: operisati:operirati the consonant r). We also wanted to make sure that these (′operate′), fotografisati:fotografirati (′take photos′), high values for the word također are not the result of a reformisati:reformirati (′reform′), regulisati:regulirat lemmatization error. We reviewed 80 examples in SCW (′regulate′), pakovati:pakirati (′pack′), and found 16 errors ( Brown je takođe hvalio film, On kritikovati:kritizirati (′criticise′), diskutovati:diskutirati takođe uzima učešća... ). In Figure 4 we show the values (′discuss′), identifikovati:identificirati (′identify′), that include the use of the pair of words takođe:također. promovisati:promovirati (′promote′). In SCW we did not find the form pakirati (′pack′), and in CW we did not find ŠTUDENTSKI PRISPEVKI 305 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 the forms fotografirati (′take photos′) and reformirati CLASSLAWIKI-hr. But we did not get a single or simple (′reform′). answer. It turned out that according to orthography, most phonetic and derivational morphology features SCW is closer to CW than to SW. On the other hand, the morphosyntactic, lexical, and semantic features show that Figure 7: Variable isa, ova:ira. SCW is closer to SW than to CW. This may indicate that Analysis of the morphosyntactic variable trebati SCW contains more Croatian texts because these, so to showed that the modal verb trebati (′need/should′) as an speak, basic characteristics are more Croatian. Also, the impersonal verb with a complement da+present tense in values in SCW for most variables are closer to the SW has a dominant use (0.96), as does its personal variant extremes than they are balanced, so our initial hypothesis with an infinitive as a complement in CW (0.88). In SCW is confirmed in only a few cases (for example, variable h:k this verb is used more in the impersonal form, which – 0.50:0.50). The other questions we asked at the means that according to this feature SCW is more Serbian beginning are not easy to answer in such a limited study. than Croatian (0.70:0.30) To improve this research and get more accurate and precise results, some variables should be included, some unclear issues should be resolved (some problems in lemmatization), and some more advanced corpus search techniques should be used (first of all, regular expressions, randomized examples, etc.). As for the variables, there are Figure 8: Variable trebati (imp:pers). a number of very interesting features: possessive adjective The lexical variable da li:je li represents the (in Serbian) / possessive genitive (in Croatian): tetka expressions da li and je li used for yes/no questions. In the Marin brat / brat tetke Mare (′Aunt Mary's brother′); the description of the variable, we said that both expressions conjunction pošto (′since′) ‒ in Croatian it is used only in are used in Serbian, but that the form da li is more a temporal sense, in Serbian and in a causative sense: common in, and that the form je li is the standard form in Pošto je knjiga bila skupa, nisam je kupila (′Since the Croatian. However, the results show the dominant use of da li in Serbian (0.98),10 while in Croatian the use of these book was expensive, I didn't buy it′); kod (in Serbian) / k expressions is much more balanced – both values are close (in Croatian): Doći ću kod tebe. / Doći ću k tebi. (′I will to the middle (0.46:0.54 – je li still has a bit more frequent come to you.′); gde (in Serbian) / kamo (in Croatian) for use). In SCW, da li appears much more often (0.83:0.17), the direction of movement: Gde ideš? / Kamo ideš? so it is closer to SW in this respect. (′Where are you going?′), etc. 8. References Božo Bekavac, Sanja Seljan, and Ivana Simeon. 2008. Corpus-based Comparison of Contemporary Croatian, Serbian and Bosnian. In: Proceedings of the Sixth Figure 9: Variable da li:je li. International Conference Formal Approaches to South The semantic variable čas is stable. The lexeme Slavic and Balkan Languages, pages 34‒39, čas is more often used in SW in the meaning of hour Dubrovnik, Croatia. (0.90), and in CW in the meaning of moment (0.97). In Božo Ćorić. 2010. Jezičke i/ili varijantske razlike na SCW these meanings stand in relation 0.63:0.37 in favour tvorbenom planu. In: Branko Tošović and Arno of the meaning of hour, and therefore SCW is closer to Wonisch, eds., Srpski pogledi na odnose između SW according to this feature. srpskog, hrvatskog i bošnjačkog jezika, Book I/2, pages 41‒50. Graz and Belgrade: Institut für Slawistik der Karl-Franzens-Universität Graz and Beogradska knjiga. Branko Tošović and Arno Wonisch, eds., 2009. Bošnjački Figure 10: Variable čas. pogledi na odnose između bosanskog, hrvatskog i 7. Conclusion srpskog jezika. Graz and Sarajevo: Institut für Slawistik der Karl-Franzens-Universität Graz and Int the beginning, we determined that our goal was to Institut za jezik. determine the linguistic identity of the corpus of texts Branko Tošović. 2010. Деривационные различия между CLASSLAWIKI-sh and we assumed that it is midway сербским, хорватским и бошняцким языкам between the corpus CLASSLAWIKI-sr and the corpus (прелиминариум). In: Branko Tošović and Arno Wonisch, eds., Srpski pogledi na odnose između srpskog, hrvatskog i bošnjačkog jezika, Book I/2, 10 The explanation for such a high value of da li in relation to je pages 65‒80. Graz and Belgrade: Institut für Slawistik li in SW is that in the Serbian spoken language the full form je li der Karl-Franzens-Universität Graz and Beogradska is rarely used. Its shortened variants je l’, jel’, or jel are much knjiga. more common. ŠTUDENTSKI PRISPEVKI 306 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Branko Tošović and Arno Wonisch, eds., 2010. Srpski Nenad Memić. 2009. O prenošenju austrijskih i njemačkih pogledi na odnose između srpskog, hrvatskog i toponima u bosanski, hrvatski i srpski jezik: o bošnjačkog jezika, I/2. Graz and Belgrade: Institut für problemu egzonima u savremenom jeziku. In: Branko Slawistik der Karl-Franzens-Universität Graz and Tošović and Arno Wonisch, eds., 2009. Bošnjački Beogradska knjiga. pogledi na odnose između bosanskog, hrvatskog i Branko Tošović and Arno Wonisch, eds.,. 2012. Srpski srpskog jezika. Graz and Sarajevo: Institut für pogledi na odnose između srpskog, hrvatskog i Slawistik der Karl-Franzens-Universität Graz and bošnjačkog jezika, I/4. Graz and Belgrade: Institut für Institut za jezik. University Computing Centre. Slawistik der Karl-Franzens-Universität Graz and Nikola Ljubešić, Maja Miličević Petrović, and Tanja Beogradska knjiga. Samardžić. 2018. Borders and boundaries in Bosnian, Branko Tošović and Arno Wonisch, eds., 2013. Srpski Croatian, Montenegrin and Serbian: Twitter data to the pogledi na odnose između srpskog, hrvatskog i rescue. Journal of Linguistic Geography, 6/2:100‒124, bošnjačkog jezika, I/5. Graz and Belgrade: Institut für Cambridge University Press. Slawistik der Karl-Franzens-Universität Graz and Nikola Ljubešić, Maja Miličević Petrović, and Tanja Beogradska knjiga. Samardžić. 2019. Jezična akomodacija na Twitteru: Bruno Martins and Mário J. Silva. 2005. Language Primjer Srbije. Slavistična revija, 67(1):87‒106. Identification in Web Pages. In: Proceedings of the Nikola Ljubešić, Nives Mikelić, and Damir Boras. 2007. 2005 ACM symposium on Applied computing, SAC Language identification: how to distinguish similar ’05, pages 764–768, New York, NY, USA. languages? In: Vesna Lužar-Stiffler, and Vesna Hljuz Eugenija Barić, Mijo Lončarić, Dragica Malić, Slavko Dobrić, eds., Proceedings of the 29th International Pavešić, Mirko Peti, Vesna Zečević, and Marija Conference on Information Technology Interfaces, Zninka. 1997. Hrvatska gramatika. Zagreb: Školska pages 541–546, Zagreb: SRCE. knjiga. Nikola Ljubešić and Filip Klubička. 2014. {bs, hr, Jasmina Dražić and Jelena Vojinović. 2010. Imenice tipa sr}WaC – Web corpora of Bosnian, Croatian and nomina agentis u srpskom i hrvatskom jeziku (tvorbeni Serbian. In: Proceeding of the 9th Web as Corpus i semantički aspekt). In: Branko Tošović and Arno Workshop (WaC-9) @ EACL 2014, pages 29–35, Wonisch, eds., Srpski pogledi na odnose između Gothenburg, Sweden. srpskog, hrvatskog i bošnjačkog jezika, Book I/2, Pavica Mrazović and Zorka Vukadinović. 2009. pages 41‒50. Graz and Belgrade: Institut für Slawistik Gramatika srpskog jezika za strance. Sremski der Karl-Franzens-Universität Graz and Beogradska Karlovci, Novi Sad: Izdavačka knjižarnica Zorana knjiga. Stojanovića. Jovan Ćirilov. 2010. Hrvatsko-srpski rječnik inačica и Pavle Ivić, Ivan Klajn, Mitar Pešikan, and Branislav Српско-хрватски речник варијаната. Novi Brborić. 2004. Srpski jezički priručnik. Beograd: Sad:Prometej. Beogradska knjiga. Jörg Tiedemann and Nikola Ljubešić. 2012. Efficient Petar Skok. 1955‒1956. O sufiksima -isati, -irati i -ovati . discrimination between closely related languages. In: Jezik, 4(2):36‒43. Proceedings of COLING 2012, pages 2619–2634, Predrag Piper. 2009. O prirodi gramatičkih razlika između Mumbai, India. srpskog i hrvatskog jezika. In: Predrag Piper, ed., Lada Badurina. 2004. Novije promjene u hrvatskome Južnoslovenski jezici: gramatičke strukture i funkcije, standardnom jeziku. Croatian Studies Review, 3‒4:83‒ pages 537‒552. Beograd: Beogradska knjiga. 93 Predrag Piper and Ivan Klajn. 2013. Normativna Marcos Zampieri and Binyam Gebrekidan. 2012. gramatika srpskog jezika. Novi Sad: Matica srpska. Automatic Identification of Language Varieties: The Stjepan Babić. 1999. Dva tvorbena normativna problema i Case of Portuguese. In: Jeremy Jancsary, ed., njihova rješenja. Jezik, 66(3):104–112. Proceedings of KONVENS 2012, pages 233–237, https://docplayer.rs/191032196-Dva-tvorbena- ÖGAI. Main track: poster presentations. normativna-problema-i-njihova-rješenja-stjepan- Mihailo Stevanović. 1989. Savremeni srpskohrvatski jezik. babić.html Beograd: Naučna knjiga. Vera Ćevriz-Nišić. 2009. Razlikovne crte između srpskog, Mirela Ivanić and Jelena Perišić. 2018. Derivacija glagola hrvatskog i bošnjačkog standardnojezičkog izraza. In: sa osnovama stranog porekla u srpskom jeziku u svetlu Savremena prоučavanja jezika i književnоsti, Zbоrnik (ne)jasne diferencijacije između srpskog i hrvatskog radоva sa I naučnоg skupa mladih filоlоga Srbije I standarda. In: Družbeni in politični procesi v sodobnih (1), pages 373‒383, Kragujevac: Impres. slovanskih kulturah, jezikih in literaturah, pages 177‒ Zenaida Karavdić. 2011. Komparativna sintaksa 190. bosanskog, crnogorskog, hrvatskog i srpskog jezika. Mitar Pešikan, Jovan Jerković, and Mato Pižurica. 2010. In: Njegoševi dani 3, Zbornik radova, 357‒365, Pravopis srpskoga jezika. Novi Sad: Matica srpska. Nikšić: Univerzitet Crne Gore, Filozofski fakultet. Muntsa Padró and Lluis Padró. 2004. Comparing methods Živojin Stanojčić and Ljubomir Popović. 2008. Gramatika for language identification. Procesamiento del srpskog jezika za gimnazije i srednje škole. Beograd: Lenguaje Natural, 33:155‒162. Zavod za udžbenike. ŠTUDENTSKI PRISPEVKI 307 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Ocenjevanje uporabniško dodanih sopomenk v Slovarju sopomenk sodobne slovenščine – pilotna študija Magdalena Gapsa* * Filozofska fakulteta, Univerza v Ljubljani Aškerčeva 2, 1000 Ljubljana magdalena.gapsa@ff.uni-lj.si Povzetek V prispevku opisujem prvi korak uporabniške raziskave, znotraj katere bodo različne strokovne skupine ocenjevalcev presojale o relevantnosti določenih uporabniško dodanih sopomenk v Slovarju sopomenk sodobne slovenščine. V okviru raziskave želim preveriti, ali se ocene strokovnjakov, kot so lektorji, prevajalci in učitelji, razlikujejo od ocen slovaropiscev ter kako enovite so ocene znotraj posamezne skupine. Osredotočam se na potek in rezultate prvega sklopa ocenjevanja, ki ga je kot testna množica izvedla skupina študentov. Ta korak je služil tudi kot preizkus navodil in izbranih orodij za lažje načrtovanje dela preostalih predvidenih skupin ocenjevalcev. Navajam ugotovitve glede relevantnosti sopomenskega gradiva po presoji študentske skupine, kjer so zlasti zanimive mejne kategorije »pogojno« sprejemljivega gradiva, sledijo identificirane šibke točke zasnovane raziskave ter rešitve, ki bodo vključene v nadaljnji potek ocenjevanja. Evaluation of User-Added Synonyms in the Thesaurus of Modern Slovene – a Pilot Study The paper describes the first step of a user research in which various expert groups of evaluators will assess the relevance of certain user-added synonyms in the Thesaurus of Modern Slovene. Part of the research is to check whether the evaluations of experts such as proofreaders, translators and teachers differ from those of lexicographers, and how consistent the assessments are within each group. The main focus is on the process and results of the first set of assessments carried out by a group of students as a test set. This step also served as a test of the instructions and tools chosen to facilitate the planning of the work of the remaining intended groups of evaluators. The results are then presented in terms of the relevance of the synonymous material assessed by the group of students, with the borderline categories of "conditionally" acceptable material being of particular interest, followed by the weaknesses identified in the research designed and the solutions and improvements that will be incorporated into the further assessment process. da se pogled strokovne oz. širše jezikovne skupnosti 1. Uvod razlikuje od pogleda slovaropiscev, vendar ta potencialni S pojavom digitalnega medija se na področju drugačni pogled jezikovne skupnosti lahko bistveno jezikoslovja in strojne obdelave naravnega jezika pripomore pri gradnji novih oz. nadgradnji obstoječih spreminjajo tako potrebe kot tudi priložnosti, ki se kažejo jezikovnih virov. To hipotezo bom preverila z analizo ocen sopomenskosti izbranega nabora uporabniško dodanega zlasti kot možnost avtomatiziranega (hitrejšega, enostavnejšega in cenejšega) posodabljanja jezikovnih gradiva, ki ga ocenjujejo različne strokovne skupine podatkov in opisov, večja povezljivost med podatki (naštete v nadaljevanju). Ocene bom najprej primerjala različnih vrst, neomejen prostor za njihov prikaz ter znotraj posameznih skupin, nato pa tudi med skupinami. vključevanje širše skupnosti v proces priprave slovarjev1 Namen prispevka je predstaviti izsledke prvega, testnega oz. pilotskega ocenjevanja uporabniško dodanih itd. V prispevku se osredotočam na slednje, torej možnost doprinosa širše jezikovne skupnosti, natančneje možnost, sopomenk, ki ga je izvedla skupina šestih študentov da slovarski uporabniki dodajajo sopomensko gradivo v jeziko(slo)vnih smeri. Evalvacijska naloga, posredovana Slovar sopomenk sodobne slovenščine2 (Arhar Holdt et al., tej skupini, je imela dva glavna namena: (I) priprava gradiva za ocenjevanje uporabniških sopomenk, preizkus 2018, v nadaljevanju tudi Sopomenke), s čimer je povezano vprašanje morebitne spremembe v pogledih na modela, orodij in navodil ter morebitne dopolnitve oz. sopomenskost. Na podlagi uporabniškega gradiva je možno prilagoditve le-teh ter (II) zbiranje povratnih informacij za opazovanje, kako sopomenskost dojemajo in občutijo načrtovanje nadaljnjega obsega in izvedbe ocenjevanja. uporabniki, zlasti v razmerju do slovaropiscev, ki podajajo Raziskava se deloma povezuje s projektom Sopomenke in končne odločitve o vključitvi sopomenskega gradiva v Kolokacije 2.0 – SoKol, Nadgradnja temeljnih slovarskih referenčne jezikovne vire. virov in podatkovnih baz CJVT UL,4 ki ga med leti 2021 in Prispevek temelji na raziskovalnem vprašanju iz 2022 financira Ministrstvo za kulturo Republike Slovenije. doktorske disertacije z naslovom Sopomenskost v Slovarju Glavni cilj projekta je prenovitev Slovarja sopomenk sopomenk sodobne slovenščine in izbranih različicah sodobne slovenščine ter Kolokacijskega slovarja sodobne slovenščine. Projekt je omogočil dostop do študentov wordneta,3 o prispevku širše jezikovne skupnosti k pogledom na sopomenskost. V disertaciji predpostavljam, jezikoslovja z dobrim poznavanjem Slovarja sopomenk 1 V slovenskem prostoru se tematike dotika monografija Slovar 3 Doktorska disertacija nastaja v okviru raziskovalnega programa sodobne slovenščine: problemi in rešitve (Gorjanc et al., 2017). Jezikovni viri in tehnologije za slovenski jezik (številka programa Podrobneje o vlogi uporabnikov v procesu priprave slovarjev in P6-0411) in jo v letih 2019–2023 financira Javna agencija za načinih sodelovanja z njimi sta pisala npr. A. Abel in C. Meyer raziskovalno dejavnost Republike Slovenije. Mentorira jo zn. sod. (2013). dr. Špela Arhar Holdt. 2 Slovar sopomenk sodobne slovenščine: 4 Spletna stran projekta SoKol: https://www.cjvt.si/sokol/ (dostop: https://viri.cjvt.si/sopomenke/slv/ (dostop: 6. 5. 2022). 3. 5. 2022). ŠTUDENTSKI PRISPEVKI 308 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 sodobne slovenščine ter izkušnjami z označevanjem "relevantnosti" predloga za njihovo delo (npr. predloge tipa pomensko povezanih podatkov. brat – sorojenec, avto – osebno vozilo itn., so ocenjevalci večinoma opredelili kot relevantne, čeprav gre za drugo 2. Opis vira in pregled področja relacijo).6 Potrebno je opozoriti, da presojanje o podobnosti Slovar sopomenk sodobne slovenščine, ki ga je leta dveh besed oz. sopomenskosti tudi za slovaropisce nikakor 2018 objavil Center za jezikovne vire in tehnologije ni lahka in nedvoumna naloga, saj univerzalne definicije sopomenskosti ni, sam koncept pa je zelo ši Univerze v Ljubljani, je prvi primer novega slovaropisnega rok in tesno povezan s kontekstom in okoliščinami rabe, hkrati pa ga koncepta, t. i. odzivnega slovarja (Arhar Holdt et al., 2018). Njegova glavna značilnost je, da se slovar stalno odziva na različni raziskovalci drugače interpretirajo in opisujejo (gl. spremembe v jeziku ter na potrebe uporabnikov. Za namene npr. Snoj, 2019, str. 13–41; Vidovič Muha, 2013, str. 172– tega prispevka je najbolj pomembna značilnost ta, da je 183; Zorman, 2000, str. 20–48). uporabnikom omogočeno sodelovanje v procesu nastajanja Večina definicij sopomenke opredeljuje kot besede, ki imajo identičen pomen, vendar različno formo (Zgusta, slovarja, saj se podatki spreminjajo glede na aktivnost ter pripombe skupnosti, hkrati pa ta lahko prispeva k čiščenju 1971, str. 89), poudarja se tudi razlikovanje besed z istim nerelevantnih ali napačnih podatkov.5 pomenom po njihovi stilni ali zvrstni vrednosti (Toporišič, Množičenje za namene slovaropisja je sicer znana 1992, str. 294). V literaturi prevladujeta dva glavna praksa. Množica, ki jo želimo vključiti v ocenjevanje, ne pogleda: sopomenke so le besede, ki imajo popolnoma isti potrebuje posebnih predznanj ali izobrazbe, saj so tudi pomen (popolna sopomenskost) ali pa besede, katerih uporabniki jezika, ki niso strokovnjaki s področja, dovolj pomen je zelo podoben (delna sopomenskost). Popolna nadarjena, ustvarjalna in učinkovita skupina sopomenskost je zelo redka, saj krši načelo jezikovne , ki zmore reševanje manj zahtevnih oz. bolj rutinskih nalog, ekonomičnosti oz. gospodarnosti, pogosta pa je delna strokovnjaki pa se lahko po vključitvi množičenja sopomenskost (prim. Hock, 1991, str. 283; Snoj et al., osredotočijo na kompleksnejše oz. bolj analitične naloge 2016, str. 5; Vidovič-Muha, 2013, str. 175; Zgusta, 1971, (Kosem et al., 2013, str. 46; Čibej et al., 2015, str. 70– str. 89), ki se najpogosteje kaže v primeru prenesenih 71). Množičenje je lahko izredno učinkovito in zanesljivo – pomenov, izposojenk iz tujih jezikov, arhaizmov in ekspresivnega besedišča, največ odgovori oz. ocene nestrokovne skupnosti se skorajda ne sopomenk pa imajo razlikujejo od zlatega standarda oz. odgovorov, ki so jih besede, ki so rabljene prav v prenesenem ali s kolokacijami podali slovaropisci, kar je bilo že leta 2008 dokazano z povezanem pomenu (Apresjan, 2000, str. 37). V uporabo orodja Amazon Mechanical Turk (AMT) (Snow et slovenskem prostoru je delna sopomenskost razumljena kot del stilistike in ne semantike, zato je bila dolgo, še zlasti v al., 2008, str. 257–258), še zlasti, če zagotovimo zadostno količino ocenjevalcev (prim. Nicolas et al., 2021). Na tej slovaropisju, obravnavana predvsem popolna predpostavki temelji vključevanje skupnosti v razvoj sopomenskost, sopomenke v SSKJ pa imajo tudi Slovarja sopomenk sodobne slovenščine. normativno vlogo (usmerjajo od zaznamovanega proti Presojanje o (ne)sopomenskosti besed je bilo v široko nezaznamovanemu), ob izidu Sinonimnega slovarja razumljenem digitalnem slovaropisju mnogokrat slovenskega jezika (2016) je bilo več pozornosti namenjene uporabljeno še zlasti ob (nad)gradnji in čiščenju različnih tudi delni sopomenskosti (Vidovič Muha, 2013, str. 180; wordnetov, npr. ruskega, kjer ocenjevalci presojajo o Snoj et al., 2016, str. 6). Slovar sopomenk sodobne slovenščine (ne)pravilnostih ter sami sestavljajo in popravljajo sinsete nudi nov okvir, saj osvetljuje vlogo in vrednost (Braslavski et al., 2014) ali češkega, za katerega je bil razvit konteksta s prikazom kolokacij ali povezavo na korpusne poenoten sistem prijavljanja napak, ki jih uporabniki zglede, obenem pa nudi tudi možnost, da uporabniki dodajo odkrijejo, ob tem pa lahko tudi predlagajo popravek (Horák gradivo, ki po njihovem mnenju v slovarju manjka. Na podlagi skoraj 1.000 uporabniško dodanih sopomenskih in Rambousek, 2018). V slovenskem prostoru pa so ocenjevalci s pomočjo orodja za množičenje sloWCrowd predlogov želim ponovno odpreti vprašanje razumevanja (Tavčar et al., 2012) presojali, ali so avtomatsko pridobljeni sopomenskosti in preveriti, ali se je le-to spremenilo z predlogi sopomenski in ali spadajo v predvideni sinset ter nastankom in razvojem digitalnih jezikovnih virov, zlasti tako pomagali odpraviti napake v slovenskem wordnetu odzivnega slovarja.. (Fišer et al., 2014). Ocene skupnosti, tudi presojanje o (ne)sopomenskosti besed, so uporabne tudi širše, npr. pri 3. Predviden potek in izsledki raziskave evalvaciji natančnosti vektorskih vložitev za povezane oz. Cilj znotraj doktorske raziskave je, da evalvacijo poleg sorodne besede (prim. Schnabel et al., 2015, str. 301–303) študentov opravijo tudi predstavniki drugih skupin Ob tem se odpira vprašanje, ali se s širjenjem skupine sodelujočih: slovaropisci (kot najbolj specializirani sodelujočih širi oz. spreminja tudi sam pogled na gradivo, strokovnjaki s področja), prevajalci, lektorji, učitelji ki (naj) ga slovar sopomenk prinaša. Slovaropisci za slovenščine in ljubiteljski raziskovalci jezika brez identifikacijo sopomenskosti sledijo vnaprej izbranim jezikoslovne izobrazbe (kot predstavniki širše jezikovne (včasih tudi dokaj strogim) jezikoslovnim izhodiščem, skupnosti). Interesne skupine so bile določene na podlagi slovarski uporabniki pa lahko o sopomenskosti presojajo tipologije ciljnih skupin uporabniških raziskav (gl. Arhar precej bolj subjektivno, in sicer z vidika "uporabnosti" ali Holdt, 2015, str. 142–146), kjer so slovarski uporabniki 5 Preostale glavne značilnosti slovarja so npr. (a) je dostopen le v (c) sopomenski podatki so povezani z besedilnim kontekstom s digitalni obliki ob upoštevanju potreb, pogojev in prednosti le-te, pomočjo kolokacij, korpusnih zgledov in povezav na korpus ter hkrati pa nikoli ni zaključen, saj se podatki stalno spreminjajo in (č) slovar in slovarska baza sta prosto in odprto dostopni pod prilagajajo trenutnemu jezikovnemu stanju, (b) slovarska baza ustrezno licenco (prim. Arhar Holdt et al., 2018, str. 404; Čibej in nastaja z uporabo naprednih računalniških metod, kar Arhar Holdt, 2019, str. 339–340). uporabnikom hitro ponuja veliko količino odprto in prosto 6 Predpostavljam, da se bodo v določenih kategorijah gradiva oz. dostopnih jezikovnih podatkov, ki so relevantni, a še neprečiščeni, odločitev pokazale skupne točke, v drugih pa razlike. ŠTUDENTSKI PRISPEVKI 309 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 načeloma pripadniki (vsaj) ene skupine: (I) uporabniki, ki Za to raziskavo so bili podatki za doktorsko disertacijo slovarje uporabljajo v procesu izobraževanja (npr. študenti dodatno opremljeni s podatki o uporabniško dodanih in učitelji slovenščine),7 (II) uporabniki, ki slovarje sopomenkah v Sopomenkah na podlagi internega izvoza uporabljajo v poklicne namene (npr. slovaropisci, podatkov iz 18. 11. 2021.11 Tako sem pridobila seznam prevajalci, lektorji in učitelji) ter (III) uporabniki, ki 30712 iztočnic, ki imajo vsaj eno uporabniško dodano slovarje uporabljajo za prostočasne aktivnosti (npr. sopomenko oz. 976 sopomenskih parov. Nekateri ljubiteljski raziskovalci jezika). uporabniški vnosi (68 parov) so vsebovali dodatna Tipologija je bila uporabljena tudi v raziskavi odnosa pojasnila in opombe v oklepajih, največkrat opombe o uporabnikov do novosti v Sopomenkah, kjer so bile v zaznamovanosti predloga, npr. arheolog – žličkar (šalj., rezultatih najbolj zastopane skupine lektorji, prevajalci, pog.), bonbon – cuker (neknj.), klient – kunt (nar.), učitelji slovenščine na različnih ravneh izobraževanja, pisci preteklost – prtljaga (ekspresivno), stopnica – štenga (nižje različnih vrst besedil (npr. beletristika, strokovno in pog.). Ker ocenjevalcem nisem želela sugerirati znanstveno pisanje, kreativno pisanje, novinarstvo, odgovorov, so bile tovrstne opombe odstranjene. blogerstvo ipd.) ter ljubiteljski raziskovalci jezika (prim. Zabeleženih je bilo tudi 5 primerov, kjer so uporabniki Arhar Holdt, 2020, str. 477). Iz tega lahko sklepamo, da so znotraj oklepajev dodajali bolj kontekstualne razlage in ne te skupine najbolj zainteresirane za Sopomenke (in kvalifikatorje. Ti primeri so bili brez sprememb vključeni v sopomenske podatke na splošno), hkrati so tudi relevantne končni nabor za ocenjevanje, saj sem s tem želela preveriti in reprezentativne. Ker je za preverbo hipoteze potrebno reakcijo ocenjevalcev na tovrstne oznake. Gre za predloge: tudi mnenje slovaropiscev, je skupina piscev8 nadomeščena interier – ambient (v zaprtem prostoru), kmet – kmet s slovaropisci.9 Njihovi odgovori bodo analizirani znotraj (šahovska figura), koncentracija – (velika/majhna) skupine, hkrati pa bodo služili kot referenčna evalvacija vsebnost, priloga – priponka (k e-pismu) in torbica – sopomenskih parov, s katero bodo primerjani odgovori (torbica) pismo. Z odstranitvijo opomb je prišlo do vseh ostalih skupin. podvojevanja sopomenskih parov,13 zaznala sem 4 take Na podlagi izsledkov ocenjevanja po skupinah in primere, ki so bili prav tako izločeni iz seznama za primerjav odgovorov med njimi želim pridobiti empirično ocenjevanje. Ta je v končni fazi obsegal 972 parov. podlago o željah in pričakovanjih uporabnikov, ki bo z aplikativnega vidika podlaga za pripravo smernic za 4.2. Navodila ocenjevalcem uredniške protokole, ki bodo uporabljeni pri nadgradnjah Ocenjevalci so prejeli nagovor s kratkim pojasnilom, da slovarja. Z znanstvenega vidika pa bodo odgovori podlaga se ocenjevanje izvaja v okviru doktorske raziskave in za definiranje sopomenskosti v luči odzivnih digitalnih kakšne podatke želim zbrati. Bili so naprošeni, naj med jezikovnih virov. Prvi sklop uporabniške raziskave, ki so ga ocenjevanjem ne uporabljajo drugih jezikovnih virov in opravili študenti, pa je poleg prej omenjenih ciljev služil še priročnikov. Navedeno je bilo, da je naloga sestavljena iz preizkusu zasnove raziskave in odkrivanju šibkih točk ter dveh obveznih delov: preglednice s sopomenskimi pari, zbiranju povratnih informacij za načrtovanje nadaljnjega kjer bodo podajali svoje odgovore in morebitne pripombe, obsega in izvedbe. ter vprašalnika, kjer bodo podajali demografske podatke o sebi ter povratne informacije o evalvacijski nalogi sami. V 4. Gradivo in metoda primeru dvomov so udeleženci lahko zastavljali dodatna vprašanja po e-pošti. 4.1. Gradivo Glavno navodilo ocenjevalcem je bilo odgovoriti na Uporabniška raziskava temelji na delu podatkov, ki so vprašanje: »Ali sta besedi v paru sopomenki?«. Vsak podatkovni vzorec za doktorat, in sicer seznamu 546 sopomenski par so lahko uvrstili v eno izmed štirih samostalnikov, ki se pojavijo tako v podatkovni bazi kategorij, oz. izbrali enega izmed štirih možnih odgovorov, Slovarja sopomenk sodobne slovenščine (Krek et al., 2018) in sicer DA, NE, POGOJNO DA ter NISEM in sloWNeta (Fišer, 2015) kot tudi v Leksikalni bazi za PREPRIČAN/NE VEM. Odgovor DA je bil predviden za slovenščino (Gantar et al., 2013) in v Velikem slovensko- primere, ko so bili prepričani, da sta besedi sopomenki, madžarskem slovarju, kjer so samostalniki opremljeni z odgovor NE za primere, ko so bili prepričani, da besedi oznakami semantičnih tipov (Kosem in Pori, 2021).10 nista sopomenki ter kadar je šlo za očitne napake oz. 7 Študenti, zlasti jezikovnih smeri, so na prehodu med izvozimo iz slovarskega vmesnika s pomočjo prilagojene skripte, izobraževanjem in poklicno rabo, podobno učitelji, ki slovarje kar omogoča, da so uporabniško dodani podatki ažurni. uporabljajo v poklicne namene, vendar je njihov poklic vezan na 12 Ne razlikujem med iztočnicami z veliko in malo začetnico. V izobraževalni proces. naboru gradiva za nalogo je to samo en primer, in sicer zemlja in 8 V primeru piscev bi bilo najtežje pridobiti koherentno skupino, Zemlja, ki je tukaj obravnavan kot ena iztočnica. ki bi pokrivala različne prej naštete žanre, po drugi strani preostale 13 Slovarski vmesnik ima sicer preprosto varovalko, ki skupine zadoščajo potrebi po predstavnikih skupine, ki slovarje uporabnikom preprečuje ponovni vnos že dodanega predloga, uporablja v poklicne namene. vendar temelji na prepoznavi znakov in dovoljuje vnos tako 9 Treba se je zavedati, da so slovaropisci zaradi svoje izobrazbe in alfanumeričnih kot nealfanumeričnih znakov, npr. oklepajev. Ko specializiranosti zelo atipična uporabniška skupina za slovarske uporabnik obstoječemu sopomenskemu predlogu doda opombo, raziskave (prim. Arhar Holdt, 2015, str. 140), vendar prav to na sistem to prepozna kot nov vnos. V mojem vzorcu se je to tem mestu služi namenu raziskave. zgodilo štirikrat, in sicer dvakrat znotraj gesla babica, kjer sta 10 Slednja dva vira sta upoštevana, saj želim v (preostalih) bila predloga nona in nona (lokalno) ter oma in oma ;), znotraj analizah v okviru doktorske disertacije upoštevati tudi korpusno gesla živina, kjer sta bili predlagani živad in živad (star.) ter osnovan pomenski opis (potencialno) sopomenskega gradiva. znotraj gesla nakup, kjer sta bili predlagani kupilo in kupilo 11 Slovarska baza Sopomenk, ki je dostopna v okviru repozitorija (star.). CLARIN.SI, ne vsebuje uporabniško dodanih sopomenk. Slednje ŠTUDENTSKI PRISPEVKI 310 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 zatipkane besede. Odgovor POGOJNO DA je bil predviden Skupina pilotskih ocenjevalcev je obsegala 6 študentov. za pare, ko so ocenjevalci sicer menili, da gre za Dostop do preglednic je bil študentom dodeljen sopomenki, vendar so hkrati videli tudi omejitve oz. imeli 15. 2. 2022, ocenjevanje so lahko začeli takoj in si ga pomisleke, dvome, npr. da sta besedi sopomenki samo v prilagodili glede na druge obveznosti. Prvi študent je določenem pomenu, kontekstu, ena ali obe besedi sta obvestilo o zaključenem ocenjevanju podal 16. 2. 2022, zaznamovani itn. Odgovor NISEM PREPRIČAN/NE VEM zadnji pa 8. 3. 2022. Vse želene odgovore sem pridobila v je bil predviden za pare, ko niso poznali ene ali obeh besed roku treh tednov. Pridobljene podatke je mogoče razdeliti v v paru, pomena ene ali obeh besed v paru ali niso bili dva glavna sklopa, in sicer ocene vzorca sopomenskih prepričani, ali so težko podajali svoje mnenje. Pri vsakem parov in odgovore, pridobljene z vprašalnikom. paru je bila možnost dodajanja opomb, ki so bile zahtevane pri odgovoru POGOJNO DA, zaželene pri NISEM 5.1. Izsledki ocenjevanja PREPRIČAN/NE VEM in možne pri drugih odgovorih. Vsi odgovori, ki so jih dali ocenjevalci, so bili združeni Ker je eden glavnih ciljev raziskave preverjanje, kaj v tabele z uporabo programa MS Excel. Prva tabela je ocenjevalci razumejo kot relevantno sopomensko gradivo, obsegala podatke o izbranem odgovoru (brez opomb), kar so bila navodila, v izogib sugeriranju odgovorov, zelo je omogočilo preverjanje ujemanja oz. enotnosti splošna. Zato »sopomenka« ni natančneje definirana, ocenjevalcev. V drugi tabeli so bile zabeležene podane možni odgovori so vsebovali le kratek opis, ne pa tudi opombe. Te so bile ročno pregledane in dodeljene v eno primerov. Prav tako ni bilo navodil, kam umestiti mejne izmed kategorij, ki so se oblikovale med pregledovanjem: primere. samo v določenem pomenu ali kontekstu, zaznamovano, neznana beseda ali pomen besede, nad- oz. podpomenka, 4.3. Ocenjevanje razlaga ter drugo (npr. nepravilno črkovanje, pomenske Sopomenski pari so bili ocenjevalcem posredovani v nianse, druga medbesedna razmerja, nenavadne oblike obliki tabele, ki je bila dostopna kot Googlova besed neujemanje besednih vrst, redkost rabe itn.). V preglednica.14 Datoteka je bila sestavljena iz dveh listov. primerih, kadar so ocenjevalci opredelili tudi vrsto Prvi list je obsegal skrajšano verzijo navodil, da so jih zaznamovanosti (npr. ljubkovalno, pogovorno, zastarelo ocenjevalci imeli vedno pri roki, drug list pa seznam 972 itn.), so bili tudi ti podatki ohranjeni.16 Številčni podatki o sopomenskih parov za ocenjevanje. V prvem stolpcu tabele opombah so predstavljeni v tabeli 1. 914 parov je imelo je zaporedna številka para, v drugem iztočnica, v tretjem pa pripisano vsaj eno izmed šestih kategorij opomb, 435 je predlagana sopomenka, npr. vonj – vzduh, stigma – brazda, imelo pripisani vsaj dve kategoriji, 75 parov vsaj tri reforma – sprememba, pošta – sporočila, dopust – vakance. kategorije in 3 pari so imeli pripisane štiri kategorije Celice v teh stolpcih so bile zaklenjene v izogib namernim opomb. Tretja tabela je obsegala združene podatke o in nenamernim spremembam podatkov. V četrtem stolpcu ujemanju oz. enotnosti ocenjevalcev ter o že so ocenjevalci iz spustnega seznama (v izogib zatipkom) kategoriziranih opombah. izbirali enega izmed štirih odgovorov. Zadnji, peti stolpec, je bil predviden za komentarje in opombe ocenjevalcev. To Kategorija Število je tudi edini stolpec, kjer so lahko prosto vnašali podatke. samo v določenem pomenu/kontekstu 406 Ocenjevalci so do podatkov dostopali po principu en ocenjevalec – ena preglednica, da odgovori drugih zaznamovano 375 ocenjevalcev ne bi vplivali na posameznikove odločitve. neznana beseda ali pomen besede 266 nad- oz. podpomenka 182 4.4. Vprašalnik razlaga 65 Ocenjevalci so dobili tudi povezavo do spletnega drugo 122 vprašalnika, ki je bil sestavni del ocenjevanja. Vprašalnik skupaj opomb 1.416 je bil sestavljen in dostopen v spletnem orodju za anketiranje 1ka.15 V prvem delu vprašalnika so sodelujoči odgovarjali na vprašanja o sebi: starost, zaposlitveni status, Tabela 1: Številčna razporeditev kategorij opomb. izobrazba (jezikoslovna ali ne), načini, na katere se z jezikom ukvarjajo in glavna področja, ki so zanje najbolj Popolnega ujemanja, kjer je vseh šest ocenjevalcev pomembna v zvezi z jezikom. V drugem delu so podalo isti odgovor, je bilo zelo malo: le 34 znotraj odgovarjali na vprašanja, povezana z evalvacijo samo: seznama 972 parov oz. približno 3,5 % celotnega nabora. 17 parov je vseh šest ocenjevalcev prepoznalo kot koliko časa so potrebovali, ali so imeli pri reševanju kakršne koli težave, ali so bila navodila jasna in ali so pri nedvomno sopomenske (6 odgovorov DA), 5 parov kot njih kaj pogrešali. Vprašalnik je bil dostopen brez omejitev, pogojno sopomenskih (6 odgovorov POGOJNO DA), 5 ocenjevalci so si lahko vprašanja vnaprej ogledali, njihovi parov kot nedvomno nesopomenskih (6 odgovorov NE) ter odgovori so se sproti shranjevali, da so lahko npr. najprej 7 parov kot neznane oz. neopredeljive (6 odgovorov NISEM PREPRIČAN/NE VEM). Bistveno več je bilo podali podatke o sebi, informacije o nalogi pa naknadno. večinskega ujemanja med ocenjevalci, kjer izstopa samo en odgovor. Takih parov je bilo skupno 132 oz. približno 5. Rezultati 14 Googlova preglednica od ocenjevalcev ne zahteva posebne 16 Podrobnejša analiza dejanskih opomb in komentarjev, ki so jih strojne opreme, hkrati pa se vneseni odgovori sproti shranjujejo, podali ocenjevalci, presega okvirje in namen tega prispevka, je pa zato naloge ni bilo potrebno reševati brez prekinitev. zagotovo relevantna in zanimiva, tudi z vidika razumevanja 15 Spletno orodje za anketiranje 1ka: https://www.1ka.si/ (dostop: sopomenskosti, zato bo naslovljena v prihodnosti. 5. 5. 2022). ŠTUDENTSKI PRISPEVKI 311 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 13,5 % nabora gradiva. V 50 primerih je 5 ocenjevalcev kategorijo NISEM PREPRIČAN/NE VEM. Razporeditev izbralo odgovor DA, v 46 POGOJNO DA, v 19 NE in v 17 ocen v štiri predvidene kategorije prikazuje Tabela 2. NISEM PREPRIČAN/NE VEM. Skupaj je bilo parov z visokim ujemanjem ocenjevalcev 166 oz. 17 % nabora gradiva. 67 parov (40 %) je bilo umeščenih v kategorijo DA, 51 parov (31 %) v kategorijo POGOJNO DA, 24 parov (14,5 %) v kategorijo NE, preostalih 24 parov (14,5 %) pa je bilo uvrščenih v Odgovor Popolno ujemanje Večinsko ujemanje Skupaj DA 17 parov 50 parov 67 parov POGOJNO DA 5 parov 46 parov 51 parov NE 5 parov 19 parov 24 parov NISEM PREPRIČAN/NE VEM 7 parov 17 parov 24 parov Skupaj 34 parov 132 parov 166 parov Tabela 2: Številčna razporeditev odgovorov. V primerih, kjer so se ocenjevalci strinjali (vsi so izbrali pokašljevanje), da ne poznajo besed ali pomenov besed isti odgovor), je bilo skupno podanih 22 opomb za 15 ( modrček – nedrc, oklevanje – obiranje, rit – zadnja plat), sopomenskih parov. V 132 primerih, kjer so se ocenjevalci da gre za razlago ( jok – pretakanje solz) ter drugo (npr. dež večinoma strinjali (izstopal je en odgovor), je bilo skupno – dežne kaplje: mero- oz. holonimija, elita – veljaki in elita za 109 parov podanih 158 opomb. V primeru opombe iz – pomembneži: neujemanje slovničnega števila, prerok – kategorije drugo so ocenjevalci največkrat navajali profet: nenavadna oblika, sestra – sorojenka: redka raba). pomenske nianse, črkovanje oz. zapis, redkost rabe, Zanimivo, da so par brat – sorojenec ocenjevalci občutili prevzete besed itn. Razporeditev opomb po kategorijah in kot razmerje nad- oz. podpomenskosti, pri paru sestra – številčni podatki so prikazani v Tabeli 3, zaradi večje sorojenka pa je en ocenjevalec opozoril na redkost rabe, preglednosti in lažje primerjave je ohranjeno zaporedje drugih komentarjev ni bilo. kategorij iz Tabele 1. Med pari, ki so jih sodelujoči označevali kot pogojno sprejemljive (POGOJNO DA), se najpogosteje pojavljajo primeri zaznamovanega besedišča, npr. avto – kripa Popolno Večinsko Kategorija (slabšalno), deček – mulec (slabšalno, negativni odnos), ujemanje ujemanje juha – župca (pogovorno, manjšalnica), krema – maža samo v določenem (pogovorno), zadrga – fršlus (pogovorno, narečno). 3 37 pomenu/kontekstu Pogosti so tudi primeri, kjer sta besedi sopomenski samo v določenem pomenu oz. kontekstu, npr. izkušnja – zaznamovano 5 55 dogodivščina, kaos – štala, jesen – starost, posluh – čut, neznana beseda ali 11 24 preteklost – prtljaga, ter kjer gre za nad- in podpomenke, pomen besede npr. alkohol – etanol, aorta – arterija, avto – prevozno nad- oz. podpomenka 1 20 sredstvo, fotoaparat – digič, priseljenec – tujec. Zaznani so razlaga 0 2 tudi pari, kjer so ocenjevalci navedli, da ne poznajo besed ali pomenov besed, npr. koder – krauželj, pivo – pirček, rit drugo 2 20 – prdulja, telovnik – lajbič, ter kjer so jim pripisali druge skupaj opomb 22 158 opombe, npr. pogum – jajca: manjka del zveze, policija – skupaj parov 15 109 murja in rit – guza: tujka. Med pari, ki so jih sodelujoči označevali kot Tabela 3: Številčna razporeditev kategorij opomb ob nesprejemljive (NE), so najpogosteje navajali, da gre za popolnem in večinskem ujemanju odgovorov. besedi, ki sta sopomenski samo v določenem pomenu oz. kontekstu, npr. ljubezen – življenjski tok, stopnica – terasa, Med pari, ki so jih sodelujoči označevali kot živina – blago. Pojavljale so se tudi opombe glede sprejemljive (DA), so najpogosteje opozarjali, da sta besedi zaznamovanosti besedišča, npr. čarovnica – ćudežnica sopomenski samo v enem pomenu ali kontekstu, npr. (pozitiven odnos), nedelja – teden (zastarelo), da dilema – težava, identiteta – osebnost, koncentracija – ocenjevalci besede ali pomena besed ne poznajo ( čik – osredotočenost, privilegij – ugodnost, stigma – žvečilni gumi in čik – žvečilka,17 laboratorij – zaznamovanost. Pogosto se pojavljajo tudi opombe glede pospeševalnik) je predlagana sopomenka bolj razlagalne zaznamovanosti besedišča, npr. beluš – asparagus narave ( rekreacija – raztezne vaje in vaje za moč), da gre (citatno), cedilo – cedilka (pogovorno), morilec – krvnik za nad- in podpomenki ( projekcija – podatek) ter druge (zastarelo), pes – kuža (pogovorno), strpnost – opombe ( davek – dan: nepravilno črkovanje, nedelja – potrpežljivost (pogovorno), da sta besedi nad- in teden: tujka). podpomenka (npr. avto – osebno vozilo, avtomobil – Pri vseh parih, kjer so ocenjevalci izbrali odgovor osebno vozilo, brat – sorojenec, poroka – ženitev, kašelj – NISEM PREPRIČAN/NE VEM, je bila zabeležena 17 Med opombami na ta dva para so ocenjevalci opozarjali tudi, da vendar so ju označili kot nesopomenska, kar so utemeljili, da za sicer besedi sta sopomenski v določenem pomenu oz. kontekstu, študentsko generacijo čik pomeni izključno cigareto. ŠTUDENTSKI PRISPEVKI 312 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 opomba, da gre za besede ali pomene, ki jih ocenjevalci ne pretežno ukvarjajo v procesu izobraževanja, da jezik poznajo, npr. čarovnica – bela žena, civilist – legist, obrok uporabljajo v poklicne namene ali da se z jezikom ukvarjajo – rata, stranka – kunt, zaliv – olmun. Dodatno so se zgolj ljubiteljsko. Vsi so označili, da se z jezikom ukvarjajo pojavljale tudi opombe, da gre za zaznamovane besede, v izobraževalnem procesu, polovica je dodatno označila, da npr. avto – gare (slabšalno), kašelj – brehanje (pogovorno), jezik uporabljajo tudi v poklicne namene. Nato so koder – loken, krona – dika (arhaično), zdravilo – arcnije opredeljevali največ tri področja oz. dejavnosti, kjer je jezik (zastarelo, arhaično), da sta besedi sopomenski samo v v ospredju njihovega zanimanja, ki so jih izbirali iz enem pomenu ali kontekstu (npr. postava – geštel), da gre ponujenega seznama devetih možnosti. Najpogostejši za nad- in podpomenke ( torbica – nabočnica) ter drugo, odgovor je bil raziskovanje oz. študij jezika (5 odgovorov), npr. srajca – košilja: nepravilno črkovanje, zdravilo – lektoriranje (4), prevajanje (3), poučevanje slovenščine (2) biofarmacevtik: dvomi o dejanski rabe besede, zdravilo – ter predavanje jezikoslovnih predmetov na višji oz. arcnije: neujemanje slovničnega števila. univerzitetni ravni in tvorjenje besedil (po 1 odgovor). Glede na popolno in večinsko ujemanje je bilo v Nihče ni izbral področja leksikografije in ljubiteljskega kategoriji DA ali POGOJNO DA umeščenih skupaj 118 raziskovanja jezika, drugih odgovorov prav tako ni bilo. V izmed 166 parov ali 71 %, v kategoriji NE in NISEM naslednjem vprašanju so morali ocenjevalci izbrati le eno PREPRIČAN/NE VEM je bilo umeščenih po 24 izmed 166 izmed prej navedenih področij oz. dejavnosti, ki je za njih parov oz. po 14,5 %. V kategoriji DA ali POGOJNO DA glavno oz. najbolj relevantno. Kot glavno so raziskovanje torej v sprejemljivo oz. relevantno gradivo, so ocenjevalci oz. študij jezika izbrali trije, po en je navedel lektoriranje, načeloma umeščali pare, kjer so tudi opozarjali, da gre za prevajanje in tvorjenje besedil. zaznamovane besede, npr. babica – starejša gospa Sledila so vprašanja o nalogi. Najprej so ocenjevalci (pogovorno, zastarelo, pozitiven odnos), debelost – opredeljevali, koliko ur jim je reševanje naloge vzelo. V zašpehanost (slabšalno, pogovorno, negativen odnos), kmet povprečju so študenti za izpolnitev preglednice potrebovali – seljak (slabšalno, negativen odnos), novinar – pisun približno 6 ur, najhitrejši jo je rešil v treh urah, (slabšalno, negativen odnos), steklenica – flaša najpočasnejši pa v enajstih. Vsi so zatrdili, da so bila (pogovorno), da sta besedi sopomenski samo v enem navodila jasno opredeljena. Le en študent je navedel, da je pomenu ali kontekstu (npr. blago – capa, izrazoslovje – imel med reševanjem naloge težave, in sicer da veliko izrazje, legenda – štorija, rit – zahrbtnež, žarnica – sijalka) besed ni poznal, zato se je težko opredeljeval do ter nad- oz. podpomenke (npr. kovanec – novčič, nakup – potencialne sopomenskosti. Ocenjevalci so imeli tudi fasunga). V ti kategoriji so bili umeščeni tudi pari, kjer so možnost podati svoje pomisleke, opazke in komentarje, ki ocenjevalci opozarjali na npr. pomenske nianse, redkost niso bili zajeti v navodilih. Te so podali trije ocenjevalci, ki rabe ali slovnična neujemanja (npr. predlagana sopomenka so navedli, da bi si želeli, da bi bila kategorija POGOJNO je v množini), npr. cedilo – sito, stereotip – predsodek, pes DA bolje opredeljena, da niso bili prepričani, v katero – štirinožni prijatelj. V kategorijo NE, torej nesprejemljivo skupino naj uvrstijo nad- in podpomenke, razlage besed in oz. nerelevantno gradivo, so najpogosteje spadale tujke, neuveljavljene tujke ter da so pogrešali možnost preverbe nepravilno zapisane besede ter besedi, ki bi lahko bili sopomenk v drugih virih, saj bi s tem lahko podali boljše sopomenki samo v enem pomenu ali kontekstu, ampak na odgovore, vendar hkrati razumejo, zakaj jih niso smeli ta pogoj v primeru večinskega odgovora NE je opozarjal le uporabljati. Dodatnih komentarjev niso imeli. en ocenjevalec, npr. davek – dan, nedelja – teden, živina – blago, stopnica – terasa, projekcija – podatek. V kategorijo 6. Diskusija NISEM PREPRIČAN/NE VEM, torej gradivo, ki zahteva Glede ustreznosti uporabniško dodanih sopomenk se je dodaten in podroben pregled, so ocenjevalci načeloma izkazalo, da je bilo nedvomno nesopomenskega gradiva, ki umeščali pare, kjer besed ali pomenov niso poznali ali kjer so ga prispevali uporabniki, zelo malo. Primeri, kjer so se so ocenjevalci menili, da se besede sploh oz. redko odgovori ocenjevalcev popolnoma ujemali, so majhen del uporablja, npr. avto – sinhronka, cigareta – španjoleta, vzorca (34 parov oz. približno 3,5 %), kar je bilo sicer fotografija – heliotipija, moka – mlevina, zdravilo – predvidljivo glede na obseg podatkov in število biofarmacevtik, torbica – nabočnica. Pri večinskem ocenjevalcev. Nekoliko več je bilo primerov, kjer je ujemanju sta le dva para dobila opombo, da je predlagana izstopal odgovor enega ocenjevalca (132 parov oz. sopomenka razlagalne narave, od tega je bil en par ( jok – približno 13,5 % nabora). Skupaj je torej parov z večinskim pretakanje solz) opredeljen kot sprejemljiv (večina ujemanjem 166 oz. 17 % nabora. Znotraj tega je parov, kjer odgovorov DA), drug ( rekreacija – raztezne vaje in vaje za je izstopajoč odgovor iz nasprotnega pola (npr. vsi moč) pa kot nesprejemljiv (večina odgovorov NE). odgovori NE in en POGOJNO DA, vsi odgovori DA in en NE), približno ena tretjina (42 parov). Veliko večino teh 5.2. Podatki o ocenjevalcih in nalogi parov so ocenjevalci ocenili kot sprejemljive. Parov, ki so V prvem delu vprašalnika sem zbirala podatke o bili večinsko umeščeni v kategorijo NE ali NISEM ocenjevalcih. Vsi ocenjevalci v pilotni skupini spadajo v PREPRIČAN/NE VEM, je bilo le 24 izmed 166 oz. 14,5 %. starostno skupino 20–30 let, najmlajši je rojen leta 2001, To kaže, da so uporabniško dodani predlogi sopomenk najstarejši pa leta 1995. Ker je šlo za študentsko populacijo, načeloma relevantni in konstruktivni, saj je bilo skupaj v so vsi ocenjevalci navedli, da študirajo, večina je navedla kategoriji DA in POGOJNO DA umeščenih 118 izmed 166 tudi, da je jezik osrednji predmet njihovega študija. Le en parov (71 %), kjer so se odgovori ocenjevalcev večinsko študent je opredelil, da jezik ni v ospredju njegovega ujemali, četudi gre za primere, ki bistveno presegajo študija, ker študira filozofijo. V naslednjem vprašanju sem tradicionalno jezikoslovno dojemanje sopomenskosti. Ta spraševala, na katerih področjih ima jezik zanje osrednjo ugotovitev je v skladu z raziskavo iz leta 2020, kjer se je po vlogo, možnih je bilo več odgovorov. Na voljo so imeli tri analizi uravnoteženega dela vzorca 1.662 sopomenk odgovore, in sicer da jih jezik zanima, ker se z njim (največ 10 predlogov na uporabnika) izkazalo, da je okoli ŠTUDENTSKI PRISPEVKI 313 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 70 % uporabniško dodanih predlogov konstruktivnih in 7. Zaključek in naslednji koraki hkrati nezaznamovanih, okoli 20 % konstruktivnih in V prispevku sem opisala načrt raziskave, s katero želim zaznamovanih ter le dobrih 6 % odstotkov nekonstruktivnih nasloviti raziskovalno vprašanje doktorske disertacije, in oz. zlonamernih (prim. Arhar Holdt in Čibej, 2020, str. 6). sicer da se pogled strokovne oz. širše jezikovne skupnosti Ocene le ene sodelujoče skupine razumljivo niso in ne razlikuje od pogleda slovaropiscev, vendar je ta potencialni smejo biti zadostna podlaga za generalizacijo, a vendar so drugačni pogled jezikovne skupnosti lahko uporaben in podatki glede relevantnosti uporabniško dodanega gradiva pomemben za razvoj virov. Predstavila sem tudi potek spodbudni. Posebno pozornost bo potrebno nameniti evalvacijske naloge, ki so jo kot testna množica opravili kategoriji NISEM PREPRIČAN/NE VEM, saj so vsi v to študenti. S tem sem želela preizkusiti sestavljena navodila kategorijo umeščeni potencialni sopomenski pari dobili in izbrana orodja ter določiti časovno-finančni obseg in opombo, da ocenjevalci ne poznajo besede ali pomena zahtevnost naloge, kar mi bo pomagalo pri načrtovanju dela besede. To ne pomeni, da so tovrstni predlogi nerelevantni, in rekrutaciji preostalih predvidenih skupin ocenjevalcev. bodo pa v procesu posodobitev Sopomenk zahtevali več Na podlagi odgovorov in povratnih informacij v pilotni pozornosti s strani urednikov, npr. natančnejše iskanje skupini se je ocenjevanje izkazalo kot izvedljivo. Izbrani korpusnih zgledov, uporabo dodatnih virov za preverjanje orodji, in sicer Googlova preglednica za evalvacijo dejanske rabe itn. sopomenskih parov ter spletno orodje za anketiranje 1ka, Na podlagi povratnih informacij iz vprašalnika in sta se izkazala kot ustrezna, enostavna za uporabo ter korespondence s študenti, ki so se name obračali med finančno in časovno vzdržna.18 Potrebno bo izboljšati ocenjevanjem, je razvidno, da je pred nadaljnjim navodila za ocenjevalce, hkrati se zdi smiselno ocenjevanjem treba dopolniti navodila ter podrobneje ocenjevalcem podrobneje pojasniti kontekst raziskave in pojasniti ozadje in cilje raziskave. Študenti so prejeli le zelo namen ocenjevanja (pridobitev njihovega subjektivnega kratko pojasnilo, da se ocenjevanje izvaja v okviru mnenja, ne "pravilnih" odgovorov). Naslovljena so bila doktorske naloge ter kakšen je glavni cilj, brez problematična mesta v navodilih, navodila za ocenjevalce podrobnejših opisov in pojasnitev. V navodilih so sicer pa ustrezno preoblikovana in dopolnjena za naslednje dobili tudi informacijo, naj pri ocenjevanju ne uporabljajo predvidene skupine. drugih jezikovnih virov, a brez obrazložitve, zakaj to ni Predvideno je, da isti seznam v ocenjevanje dobi še 5 zaželeno. Iz opomb, ki so jih dajali, je razvidno, da so skupin ocenjevalcev, in sicer slovaropisci, poklicni nekateri to navodilo kršili, saj so bile pogoste opombe tipa: prevajalci, lektorji, učitelji slovenščine in ljubitelji jezika v Gigafidi sem zasledil_a, Iz Gigafide je razvidno, Ne v brez jezikoslovne izobrazbe. V času pisanja prispevka Franu ne v Gigafidi nisem zasledil_ a, ali so v tabelo celo poteka rekrutacija sodelujočih, podatki pa naj bi bili pripenjali povezave na druge priročnike. To je zelo verjetno pridobljeni do poletja 2022. Sledijo analize rezultatov posledica pomanjkanja utemeljitve, zakaj to ni zaželeno. znotraj skupin, nato pa še primerjalno med skupinami. Veliko vprašanj, zastavljenih po e-pošti, je spremljala Čeprav prvi rezultati še niso primerni za generalizacijo, izjava, da želijo nalogo "pravilno rešiti". Možna razlaga za ponujajo dober uvid v dileme pri presojanju sopomenskosti to je, da so študenti vajeni na ocenjevanje odgovorov po uporabniško dodanega gradiva. Spodbudno je, da je bilo principu prav–narobe, v opisu naloge pa ni bilo izrecno (vsaj po prvem koraku raziskave) veliko uporabniško navedeno, da ni pravilnih ali napačnih odgovorov oz. da so dodanega gradiva ocenjenega kot relevantnega. V primeru, vsi odgovori pravilni, saj sprašujemo po njihovem mnenju. da ocene drugih predvidenih sodelujočih skupin prinesejo Ta navedba bo mogoče manj relevantna za ostale podobne rezultate, bo to ugotovitev mogoče upoštevati pri predvidene skupine ocenjevalcev, a vendarle se kaže kot nadaljnjem razvoju sopomenskih virov za slovenščino, v smiselna dopolnitev opisa naloge in njenega namena. smer širitve in bogatenja z novimi podatki. Hkrati se V navodilih so študenti dobili informacijo, da gre za potrjujejo predhodna spoznanja, da so uporabniški predlogi ocenjevanje uporabniško dodanih sopomenk, vendar brez v kar največji meri konstruktivni in dobronamerni, kar je razlage, kakšni vse predlogi se lahko na seznamu pojavijo. ključno za delovanje in nadaljnji razvoj odzivnih slovarjev. Pogosta so bila vprašanja, kaj narediti z nad- oz. podpomenkami, neuveljavljenimi tujkami ali popačenkami 8. Zahvala ter predlogi, ki so bolj razlagalne narave. Na podlagi povratnih informacij se kot smiselna dopolnitev kaže tudi Prispevek je nastal v okviru raziskovalnega programa Jezikovni viri in tehnologije za slovenski jezik (številka dodatni opis, katere vrste podatkov lahko ocenjevalci pričakujejo na seznamu. Na drugi strani so bili zatipki oz. programa P6-0411), ki ga sofinancira Javna agencija za očitne napake izpostavljeni kot primeri, ki se smatrajo za raziskovalno dejavnost Republike Slovenije. nerelevantne oz. nesopomenske, vendar jih niso vedno vsi ocenjevalci umestili v kategorijo NE (npr. par Zemlja – e, 9. Literatura kjer gre za očitno napako, je eden izmed ocenjevalcev Andrea Abel in Christian M. Meyer. 2013. The dynamics umestil v kategorijo NISEM PREPRIČAN/NE VEM). Kot outside the paper: user contributions to online so ocenjevalci izpostavili sami, so pogrešali natančnejšo dictionaries. V: Electronic lexicography in the 21st opredelitev kategorije POGOJNO DA. V navodilih za century: thinking outside the paper. Proceedings of the ostale skupine uporabnikov je zato treba natančneje eLex 2013 conference, 17-19 October 2013, Tallinn, opredeliti, da v ta sklop spadajo pari, kjer ocenjevalci sicer Estonia, str. 179–94. Trojina, Institute for Applied lahko povedo nekaj glede sopomenskosti, a ta ni nedvomna Slovene Studies in Eesti Keele Instituut. in bi želeli ob sopomenskem paru dodatne informacije. Jurij Apresjan. 2000. Systematic Lexicography (prev. Kevin Windle). Oxford University Press, Oxford. 18 Obe orodji sta namreč brezplačni za uporabo tako za dodatnega usposabljanja, posebne strojne opreme ali dodatne ocenjevalce kot za raziskovalce ter ne zahtevata predznanja oz. registracije. ŠTUDENTSKI PRISPEVKI 314 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Špela Arhar Holdt. 2015. Uporabniške raziskave za potrebe Hans Henrich Hock. 1991. Principles of Historical slovenskega slovaropisja: prvi korak. V: V. Gorjanc, P. Linguistics (druga, razširjena in dopolnjena izdaja izd.). Gantar, I. Kosem in S. Krek, ur., Slovar sodobne Mouton de Gruyter, Berlin, New York. slovenščine: problemi in rešitve, str. 136–49. Znanstvena Aleš Horák in Adam Rambousek. 2018. Wordnet založba Filozofske fakultete. Consistency Checking via Crowdsourcing. V: Špela Arhar Holdt. 2020. How Users Responded to a Proceedings of the XVIII EURALEX International Responsive Dictionary: the Case of the Thesaurus of Congress, Lexicography in Global Contexts, 17–21 July Modern Slovene. Rasprave Instituta za hrvatski jezik i 2018, Ljubljana, str. 1023–29). Znanstvena založba jezikoslovlje, 46(2): 465–82. doi:10.31724/rihjj.46.2.1 Filozofske fakultete. Špela Arhar Holdt in Jaka Čibej. 2020. Rezultati projekta Iztok Kosem in Eva Pori. 2021. Slovenske ontologije “Slovar sopomenk sodobne slovenščine: Od skupnosti za semantičnih tipov: samostalniki. V: I. Kosem, ur., skupnost“. V: Zbornik konference Jezikovne tehnologije Kolokacije v slovenščini, str. 159–202. Znanstvena in digitalna humanistika, 24. – 25. september 2020, založba Filozofske fakultete Univerze v Ljubljani, Ljubljana, Slovenija, str. 3–9. Inštitut za novejšo Ljubljana. doi:10.4312/9789610605379 zgodovino. Iztok Kosem, Polona Gantar in Simon Krek. 2013. Špela Arhar Holdt, Jaka Čibej, Kaja Dobrovoljc, Polona Automation of lexicographic work: an opportunity for Gantar, Vojko Gorjanc, Bojan Klemenc, Iztok Kosem, both lexicographers and crowd-sourcing for both Simon Krek, Cyprian Laskowski in Marko Robnik- lexicographers and crowd-sourcing. V: Electronic Šikonja. 2018. Thesaurus of modern Slovene: by the lexicography in the 21st century: thinking outside the community for the community. V: Proceedings of the paper. Proceedings of the eLex 2013 conference, 17-19 XVIII EURALEX International Congress, Lexicography October 2013, str. 32–48. Trojina, Institute for Applied in Global Contexts, 17-21 July 2018, Ljubljana, str. 401– Slovene Studies in Eesti Keele Instituut. 10. Znanstvena založba Filozofske fakultete. Simon Krek, Cyprian Laskowski, Marko Robnik-Šikonja, Pavel Braslavski, Dmitry Ustalov in Mikhail Mukhin. Iztok Kosem, Špela Arhar Holdt, Polona Gantar, Jaka 2014. A Spinning Wheel for YARN: User Interface for a Čibej, Vojko Gorjanc, Bojan Klemenc in Kaja Crowdsourced Thesaurus. V: Proceedings of the Dobrovoljc. 2018. Thesaurus of Modern Slovene 1.0. Demonstrations at the 14th Conference of the European Repozitorij raziskovalne strukture CLARIN.SI., Chapter of the Association for Computational http://hdl.handle.net/11356/1166 Linguistics, str. 101–104. Association for Computational Lionel Nicolas, Lavinia Aparaschivei, Verena Lyding, Linguistics. doi: 10.3115/v1/E14-2026 Christos Rodosthenous, Federico Sangati, Alexander Jaka Čibej in Špela Arhar Holdt. 2019. Repel the König in Corina Forascu. 2021. An Experiment on syntruders! A crowdsourcing cleanup of the thesaurus of Implicitly Crowdsourcing Expert Knowledge about modern Slovene. V: Electronic lexicography in the 21st Romanian Synonyms from Language Learners. V: century: Smart lexicography. Proceedings of the eLex Proceedings of the 10th Workshop on NLP for Computer 2019 conference, 1–3 October 2019, Sintra, Portugal, Assisted Language Learning, str. 1–14. LiU Electronic str. 338–56. Lexical Computing CZ s.r.o. Press. Jaka Čibej, Darja Fišer in Iztok Kosem. 2015. The role of Tobias Schnabel, Igor Labutov, David Mimno in Thorsten crowdsourcing in lexicography. V: Electronic Joachims. 2015. Evaluation methods for unsupervised lexicography in the 21st century: linking lexical data in word embeddings. V: Proceedings of the 2015 the digital age. Proceedings of eLex 2015 Conference, Conference on Empirical Methods in Natural Language 11-13 August 2015, Herstmonceux Castle, United Processing, str. 298–307. Association for Computational Kingdom, str. 70–83. Trojina, Institute for Applied Linguistics. doi: 10.18653/v1/D15-1 Slovene Studies in Lexical Computing Ltd. Jerica Snoj. 2019. Leksikalna sinonimija v Sinonimnem Darja Fišer, Aleš Tavčar in Tomaž Erjavec. 2014. slovarju slovenskega jezika. Založba ZRC, ZRC SAZU, sloWCrowd: A crowdsourcing tool for lexicographic Ljubljana. tasks. V: Proceedings of the Ninth International Jerica Snoj, Martin Ahlin, Branka Lazar in Zvonka Praznik. Conference on Language Resources and Evaluation. 2016. Sinonimni slovar slovenskega jezika. Založba LREC’14, str. 3471–75. European Language Resources ZRC, ZRC SAZU, Ljubljana. Association (ELRA). Rion Snow, Brendan O’Connor, Daniel Jurafsky in Darja Fišer. 2015. Semantic lexicon of Slovene sloWNet Andrew Y. Ng. 2008. Cheap and Fast—But is it Good? 3.1. Repozitorij raziskovalne strukture CLARIN.SI, Evaluating Non-Expert Annotations for Natural http://hdl.handle.net/11356/1026 Language Tasks. V: Proceedings of the 2008 Conference Polona Gantar, Simon Krek, Iztok Kosem, Mojca Šorli, on Empirical Methods in Natural Language Processing, Polonca Kocjančič, Katja Grabnar, Olga Yerošina, Petra 25-27 October 2008, Honolulu, Hawaii, USA, str. 254– Zaranšek in Nina Drstvenšek. 2013. Leksikalna baza za 63. Omnipress Inc. slovenščino 1.0. Repozitorij raziskovalne strukture Tavčar, Aleš, Darja Fišer in Tomaž Erjavec. 2012. CLARIN.SI. Pridobljeno iz sloWCrowd: orodje za popravljanje wordneta z http://hdl.handle.net/11356/1030 izkoriščanjem moči množic. V: Zbornik Osme Vojko Gorjanc, Polona Gantar, Iztok Kosem in Simon konference Jezikovne tehnologije, str. 197–202. Inštitut Krek, ur. 2017. Slovar sodobne slovenščine: problemi in Jožef Stefan. rešitve. Znanstvena založba Filozofske fakultete Jože Toporišič. 1992. Enciklopedija slovenskega jezika. Univerze v Ljubljani, Ljubljana. Cankarjeva založba, Ljubljana. doi:10.4312/9789612379759 Ada Vidovič-Muha. 2013. Slovensko leksikalno pomenoslovje. Znanstvena založba Filozofske fakultete, Ljubljana. ŠTUDENTSKI PRISPEVKI 315 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Ladislav Zgusta. 1971. Manual of Lexicography. Academia, Publishing House of the Czechoslovak Academy of Sciences, Praga. Marina Zorman. 2000. O sinonimiji. Znanstveni inštitut Filozofske fakultete, Ljubljana. ŠTUDENTSKI PRISPEVKI 316 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Angleško-slovenska šahovska terminološka baza Vili Grdič, Alja Križanec, Kaja Perme, Lea Turšič Oddelek za prevajalstvo, Filozofska fakulteta, Univerza v Ljubljani Aškerčeva 2, 1000 Ljubljana grdic.vili@gmail.com alja.manja@gmail.com kaja.perme@gmail.com lea.tursic@gmail.com Povzetek V okviru univerzitetnega predmeta Terminologija smo izdelali angleško-slovensko šahovsko terminološko bazo, saj smo želeli ustvariti zanesljiv dvojezični vir šahovske terminologije, ki vsebuje tudi slovenščino. Baza je nastala po korpusnem pristopu. Izdelali smo angleški in slovenski korpus ter iz njiju izluščili 82 angleških in 109 slovenskih terminov. Razdelili smo jih v pet podpodročij (taktika, strategija, otvoritev, končnica in ostalo) ter jih opremili z definicijami, kolokacijami, zgledi rabe, podatki o statusu in opombami. English-Slovenian Chess Terminology Database In our university Terminology course, we built an English-Slovenian chess terminology database because we wanted to create a reliable bilingual source of chess terminology that includes the Slovenian language. The database is based on the corpus approach. We built an English and a Slovenian corpus and extracted 82 English and 109 Slovenian terms. We divided them into five subfields (tactics, strategy, opening, endgame and other) and added definitions, collocations, usage examples, status information and notes. 1. Uvod 2. Namen članka Šah lahko razumemo le kot prostočasno dejavnost v Namen članka je opisati projekt gradnje dvojezične obliki namizne igre, v resnici pa gre za športno disciplino terminološke baze, ki smo jo ustvarili v okviru predmeta ter pestro interdisciplinarno in terminološko zelo Terminologija. Kot prevajalci se zavedamo, da so dandanes kompleksno področje. Je tudi predmet raziskav številnih jezikovni viri izjemno pomembni, zato smo tudi sami področij, tako naravoslovnih kot družboslovnih. skušali ustvariti koristen vir, osnovan po sodobnem Slovenska šahovska terminologija lahko deluje zelo jezikoslovnem pristopu. Projekt razumemo tudi kot zapleteno, včasih je tudi težko najti slovenske ustreznike nadaljevanje že opravljenega dela na področju slovenske tujejezičnih terminov, ki šahiste zaradi močnega vpliva šahovske terminologije, hkrati pa slovenskemu prostoru spleta kar naprej obkrožajo (sploh angleški). Je pa njeno približamo tudi angleške termine. Želimo spodbuditi k poznavanje ključno za pravilno izražanje in opisovanje nadaljnji gradnji slovenskih in večjezičnih šahovskih šahovskih partij. Da bi prevajalcem in jezikoslovcem pri priročnikov. iskanju slovenskih ustreznikov olajšali delo, smo se na podlagi korpusnega pristopa lotili izdelave dvojezične 3. Oris področja in sorodne raziskave šahovske terminološke baze. S šahom se eden od avtorjev Šah je okoli 1500 let stara strateška namizna igra, ki jo prispevka tudi sam dejavno ukvarja, kar je prispevalo k dandanes uvrščamo tudi med šport in je pestro motivaciji za raziskavo in vsebinskim izhodiščem. interdisciplinarno področje. Obravnavajo ga številna druga Terminologija je veda, ki preučuje specializirano področja in opravljajo aktualne raziskave, kot na primer izrazje določenega strokovnega področja, imenovano psihologija (povezava osebnosti, vsakdanjega življenja in termini. Poleg tega so njen predmet obravnave tudi pojmi šaha, gl. Krivec, 2021), matematika (šah in matematika, gl. in njihova razmerja ter poimenovanja v različnih jezikih, Grosar, 2017), nevrologija (šah in avtizem, gl. Gomes de eden od glavnih ciljev pa je izdelava terminoloških Sousa, 2021), robotika (šahovski robot, gl. Goldman et al., priročnikov. Tako jo lahko označimo tudi kot normativno 2021), računalništvo (šah in umetna inteligenca, gl. Guid, vedo, saj z izdajanjem priročnikov predpisuje rabo izrazja 2010), sociologija (ženske in spolna enakost v šahovskem in pripomore pri postopku terminološke standardizacije svetu, gl. Vishkin, 2022), pedagogika (vpliv šaha na učenje (povzeto po Vintar, 2017: 17–18). tujih jezikov, gl. Harazińska in Harazińska, 2017) in Pri zbiranju terminoloških podatkov smo izbrali podobno. korpusni pristop, ki v zadnjih letih prevladuje na področju Po mnenju šahovskega mojstra Matjaža Mikaca terminografije. Večina gradiv, iz katerih črpamo jezikovne (VIR 1) ima šah več dimenzij: je igra, spada pod znanost, podatke, je danes prostodostopnih. Najhitrejši in najlažji umetnost in predvsem šport. Sicer se nekateri z zadnjim ne način pri opisovanju jezika je zato s pomočjo programske strinjajo, a Mikac pravi, da pogosto kritizirana premajhna opreme, ki omogoča analizo besedila (npr. Sketch Engine), fizična aktivnost v šahu ni edini kriterij, po katerem se neka in izdelavo lastnega korpusa, iz katerega enostavno disciplina uvršča med šport. Tako kot druge športne pridobimo sezname besed, ki jih nato poljubno urejamo, discipline ima tudi šah bogato tekmovalno tradicijo samodejno luščimo termine in dalje analiziramo (povzeto (svetovna prvenstva, šahovska olimpijada, šolska po Vintar, 2017: 83). tekmovanja) ter moramo biti za partije dobro telesno in Pri projektu sta nam zelo pomagala šahovska mojstra umsko pripravljeni (npr. za več ur dolge partije). Iztok Jelen in Matjaž Mikac, svetovala nam je tudi Šah ima v Sloveniji bogato tradicijo, od konca 19. mojstrica Monika Rozman. stoletja do danes smo kot narod beležili izjemne dosežke, ŠTUDENTSKI PRISPEVKI 317 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 ki sodijo v svetovni vrh, npr. uspeh Josipa Plahute, Milana verjetno ostalo še iz časov Jugoslavije in Sovjetske zveze, Vidmarja, Luke Leniča in Laure Unuk (povzeto po Jelen, ko je imel šah velik (pogosto tudi politični) pomen in so o 2006: 10–12). Šahovski velemojster Marko Tratar njem večkrat poročali v medijih (VIR 1). (2003: 4) navaja, da je »[š]ah /…/ v slovenskem časopisju S šahovsko terminologijo so se ukvarjale že številne 20. stoletja vseskozi imel svoj prostor, tako po svoji raziskave, ki so pokazale kompleksnost področja in tekmovalni plati /…/ kot tudi zaradi svojih umetniških in njegovega izrazoslovja. Adylova (2017: 8) navaja, da je že znanstvenih ter pedagoških razsežnosti.« samo na šahovskem terminološkem podpodročju o V zadnjih letih (največ v času pandemije koronavirusa šahovskih otvoritvah svoja strukturna klasifikacija med letoma 2020 in 2021) pa so številni dejavniki terminov (dvo-, tri-, štiri- in večkomponentna pripomogli k večji priljubljenosti in razširjenosti šaha po poimenovanja otvoritev), podobno velja tudi za ostala vsem svetu. Velik globalni vpliv je imela serija Damin šahovska podpodročja (središčnica, končnica, taktike ipd.). gambit (Frank, 2020; gl. Jurc, 2020), ki na primeru fiktivne Karayev (2016: 103) opisuje, da so nekateri splošni izrazi zgodbe realno ponazarja zaničljiv odnos do žensk v prešli v šahovsko terminologijo (npr. to calculate šahovskem svetu 20. stoletja, tudi vse partije in dvoboji so 'računati'), nekateri pa z determinologizacijo tudi iz iz šahovskega vidika prikazani pravilno (Loeb McClain, področja šaha v splošni jezik, ki se običajno uporabljajo v 2020). Na najstnike in mlade odrasle, pa tudi ostale, je prenesenem pomenu (v slovenščini npr. imeti nekoga v močno vplivala platforma Twitch za oddajanje raznih šahu/matu/patu). V nadaljevanju (2016: 103) navaja, da vsebin v živo (Johannson, 2021). Na njej nekateri ljudje šah velikokrat asociirajo z vojno in politiko, zato se velemojstri in drugi šahisti v živo izobražujejo in razvedrijo šahovska terminologija v prenesenem pomenu pogosto tudi do več deset tisoč gledalcev. Med največjimi so Hikaru omenja tudi v nešahovskih kontekstih. Avtor se opre na Nakamura (profil GMHikaru), Alexandra in Andrea Botez novinarstvo in z njim povezan publicistični jezik: »Naša ( BotezLive) ter za slovenski prostor Laura Unuk, Teja Vidic vlada kakor kmet ne gre nazaj« ( Moskovskij Komsomolets, in Lara Janželj ( Checkitas). Tudi štiri spletna amaterska 21. 1. 2005). Dodaja, da fenomen prehajanja šahovske šahovska tekmovanja PogChamps so zelo pripomogla k terminologije v splošni jezik ni nič nenavadnega, saj je veliki gledanosti šaha, saj so na njih sodelovali popularni ravno to značilno že za športno terminologijo (v splošnem oddajalci vsebin (»streamerji«) s Twitcha in Youtuba jeziku uporabljamo npr. napad, podajanje žoge, zadetek v (Johannson, 2021; gl. VIR 2). Po mnenju Matjaža Mikaca črno ipd.). Tudi Zhuravleva in Vlavatskaya (2021: 534) (VIR 1) učinek omenjenih dejavnikov na šah v Sloveniji ni navajata, da šahovska terminologija ni omejena izključno bil tako močan in očiten, saj smo kot narod šahovsko že na področje šaha (za šah specifični izrazi so npr. šah, šah dobro razviti. mat, pat, fianketo), temveč se razteza na celotno športno sfero (npr. zmaga, poraz, napad, obramba, sodnik). 3.1. Šahovska terminologija Z etimologijo nekaterih slovenskih in tujejezičnih 3.2. Jezikovni viri šahovske terminologije šahovskih terminov se je ukvarjal pravnik Leonid Pitamic Skoraj vsaka (večja) angleška spletna stran za igranje (1950). Navaja, da večina terminov izvira iz latinščine, šaha in tudi druge spletne šahovske strani imajo vodnike, arabščine in perzijščine, ti pa so se pod vplivom kulturno- glosarje in ostale vire za učenje šaha, tam pa najdemo tudi političnega dogajanja v Evropi od 12. stoletja dalje v sezname terminologije z definicijami, slikami ipd. (npr. na evropskih jezikih razvijali različno. Nekateri termini iz več chess.com, lichess.org, chess24.com). Šahovski mojster jezikov imajo dokaj podoben izvor (npr. šah iz Iztok Jelen (VIR 3) se strinja, da je virov za angleščino srednjeveškega latinskega scacci), nekateri pa precej veliko, spletni so dobro dostopni, a se med seboj lahko zelo različnega in s tem tudi drugačen dobesedni pomen (npr. razlikujejo. Pravi, da je težko določiti njihovo termini za lovca: ang. bishop 'škof', nem. Läufer 'tekač', fr. verodostojnost, saj so definicije lahko različne, nekatere fou 'norec', rus. слон 'slon'). Pitamic navaja, da so besede zelo splošne, druge natančnejše; avtor ni znan, ne navajajo šah, šahovnica in ček izrazoslovno vplivale na nekatere virov informacij in ne opredelijo, kako je glosar nastal (npr. besede v več evropskih jezikih s področja prava, korpusni pristop). Ravno tako je vprašljiv nabor terminov, gospodarstva in finančništva (npr. današnja fr. beseda za saj nekateri glosarji opisujejo kolokacije in druge besedne šahovnico échiquier, ki je povezana z najvišjim sodiščem zveze kot termine ( control of the center 'nadzor/varovanje Echiquier v stari Normandiji; povzeto po Pitamic, središča'), drugi dodajajo tudi žargonizme ( cheapo 'lahka 1950: 173–204). past') in celo novotvorjenke ( Botez Gambit 'nenamerna Slovenska šahovska terminologija je nastajala pod žrtev dame', ki ga je izumila šahistka Alexandra Botez; vplivom srbščine, iz nje so prevzemali in prevajali stari VIR 14). Nekaterih terminov, ki jih zasledimo v spletnih slovenski šahovski mojstri, kot je Milan Vidmar (gl. glosarjih, pa v našem korpusu sploh ni in jih na spletu Vidmar, 1946; 1951). Njihova spoznanja (in del teorije najdemo zgolj v drugih glosarjih (torej obstajajo le v teoriji) hrvaškega šahista Vladimirja Vukovića, gl. 1978; 1990) pa in ne v dejanski rabi, npr. knight fork windmill (podvrsta je v več prispevkih za učni načrt šahovskega izbirnega taktike windmill). Za splošnega uporabnika so spletni viri predmeta za osnovne šole zbral šahovski mojster Iztok uporabni in dovolj natančni, za jezikoslovne namene pa so Jelen (VIR 3; VIR 10; gl. Jelen, 2004a; 2004b). zanesljivejši glosarji iz šahovskih knjig. Šahovska terminologija je v manjši meri večjezična, saj Sami smo se največ opirali na angleška glosarja iz knjig so se v večini jezikov uveljavili nekateri tujejezični termini Chess For Dummies (Eade, 2016) in Winning Chess iz francoščine ( en passant, j’adoube), nemščine Openings (Seirawan, 2016) priznanega ameriškega šahista ( Zwischenzug, Fingerfehler, Blitz) in italijanščine Yasserja Seirawana. Iztok Jelen priporoča The Oxford ( fianchetto, intermezzo). V žargonu slovenskih šahistov pa Companion to Chess (Hooper in Whyld, 1992). lahko zasledimo tudi hrvaške ( pješak/pijun), srbske Za slovenščino smo od spletnih virov zasledili glosarja ( dirigovanje) in ruske termine ( пешка 'peška'), kar je Šahovsko izrazoslovje na portalu ICP (VIR 4) in Šahovsko ŠTUDENTSKI PRISPEVKI 318 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 izrazoslovje na Wikipediji (VIR 5), ki imata velik nabor 4.2. Angleški in slovenski korpus terminologije in sta za splošnega uporabnika dovolj Za namene projekta smo ustvarili dva korpusa, natančna. Zanesljivejši so prispevki iz osnovnošolskega angleškega in slovenskega. Cilj pri zbiranju besedil je bil, učnega načrta za izbirni predmet šaha Iztoka Jelena da dosežemo čim boljšo zastopanost terminologije, zato (VIR 10; 2004a; 2004b), ki vsebujejo pravila igre, obširno smo izraze razdelili po terminoloških podpodročjih teorijo in slovenske termine. Avtor sam pa priporoča tudi (taktika, strategija, otvoritev, končnica in ostalo) in za Slovar slovenskega knjižnega jezika ter kot pomoč za vsako vključili približno enako število besed. Pri zbiranju nadaljnjo raziskovanje še rusko enciklopedijo Šahmaty, virov smo pazili, da smo zajeli tako splošne kot tudi Enciklopedičeski slovar (1990) in hrvaški prevod specializirane šahovske vire ter po vsebini pokrili vseh pet enciklopedije Golombek's Encyclopedia of Chess podpodročij. Z različnostjo in enakomerno zastopanostjo (Golombek, 1980). besedil smo iz korpusa želeli izluščiti relevantne termine, ki bi bolje odslikavali dejansko rabo, in dobiti natančnejše 4. Metoda podatke o pogostosti rabe. V korpus nismo zajeli takšnih Glavni cilj našega projekta je bila izdelava virov, ki vsebujejo veliko (ali izključno) definicij, kot so na dvojezičnega šahovskega glosarja oz. terminološke baze, ki primer glosarji, in takšnih, ki poleg šaha zajemajo še ostala, bi nastala na podlagi angleškega in slovenskega korpusa za nas irelevantna področja (in s tem tudi termine). Kljub besedil. Pri izdelavi smo se odločili za korpusni pristop. temu gre za omejen nabor virov, saj smo v slovenski Želeli smo raziskati dejansko rabo šahovskih terminov v korpusa vključili le prostodostopne spletne vire, v obeh jezikih, v bazo vključiti najpogostejše termine v rabi angleškega pa poleg takih tudi nekatere knjige v formatu in gesla opremiti z definicijami, kolokacijami, zgledi rabe, PDF. podatki o statusu in morebitnimi opombami. Na podlagi Slovenski korpus obsega 139.964 besed in je sestavljen angleškega korpusa smo zgradili angleško terminološko iz 55 besedil. Vsi viri so prostodostopni na spletu. Za bazo, nato pa smo z uporabo slovenskega korpusa dodali namene čim večjega nabora terminologije je največ slovenske terminološke ustreznike in jih opremili z spletnih prispevkov o šahovski teoriji, ostalo pa so splošni relevantnimi informacijami. šahovski članki. Slovenskih knjig o šahu nismo mogli vključiti, saj te na spletu niso prostodostopne, zato je 4.1. Korpusni pristop korpus v primerjavi z angleškim bistveno manjši. Korpus Terminološka baza je zasnovana po k obsega članke različnih tem, od tega jih je 28 o pravilih orpusnem igranja (npr. VIR 6), strategiji posameznih delov igre (npr. pristopu, kar pomeni, da smo jezikovne podatke zanjo VIR 7), figurah, zgodovini šaha in splošno (npr. VIR 8; pridobili iz korpusa, tega pa smo zgradili in analizirali v VIR 9). Vključili smo tudi 10 prispevkov s portala ICP orodju Sketch Engine. Za ta pristop smo se odločili, ker je lažje ustvariti korpus besedil in ga s pomočjo računalniških (npr. VIR 4) in 17 iz spletne učilnice za šah kot izbirni predmet v osnovnih šolah (VIR 10). konkordanc analizirati ter tako opisovati jezik določenega strokovnega področja Angleški korpus obsega 869.592 besed in je sestavljen , kot pa to početi na stari način z iz 21 besedil. Tako kot slovenski tudi ta pokriva ogromno listkovnim gradivom (Logar in Vintar, 2008: 5). Korpus, v katerega so zajeta različna besedila z nekega področja, teorije o posameznih delih igre (otvoritev, središčnica in lahko v dovolj velikem obsegu služi kot končnica, npr. VIR 11), strategiji in taktiki ter pravilnik reprezentativni svetovne šahovske zveze FIDE (VIR 12) – omenjena vzorec jezika ter daje vpogled v dejansko rabo jezika. Takšen pristop ni le lažji, temveč tudi sodobnejši in hitrejši vsebina pa je tako v spletnih virih kot tudi knjižnih. V ter tako uporabniku prijaznejši (Logar in Vintar korpusu je 7 daljših spletnih člankov (npr. VIR 13), dodali , 2008: 14). Že samo z osnovno analizo korpusa v orodju pa smo tudi 14 šahovskih knjig oziroma priročnikov v Sketch Engine formatu PDF (npr. Eade, 2016). Ker so knjige mnogo daljše dobimo seznam besed, ki ga nato lahko poljubno urejamo za nadaljnjo analizo (abecedno, po dolžini itd.), in podatek od člankov, ima angleški korpus v primerjavi s slovenskim manj besedilnih vnosov, a obsega veliko več besed. o pogostosti pojavitve besed, ki je pri prepoznavanju tipičnih terminoloških vzorcev posebej zaželen (Vintar, 2017: 84). Če pa je korpus lematiziran in oblikoskladenjsko 5. Terminološka baza označen, ga lahko analiziramo še podrobneje, npr. Pri luščenju, določevanju in razvrščanju terminov smo izberemo možnost, da se prikažejo vsi prislovi, pridevniki, naleteli na nekaj težav. Pri tem nam je pomagala šahovska predlogi itd., ki se pojavljajo ob nekem geslu, in tako mojstrica Monika Rozman. ugotavljamo, katere kolokacije so najpogostejše (Logar in Vintar, 2008: 5). Programi za analizo korpusov so 5.1. Težave z luščenjem terminov opremljeni s funkcijami, ki samodejno luščijo ključne Program je samodejno izluščil 1000 eno- in besede in termine, eno- in večbesedne. Tako dobimo nabor večbesednih terminov. Med njimi je bilo veliko izrazov, ki terminov, uporabnik jih nato le še ročno pregleda in niso bili termini, zato smo morali sezname prečistiti. neprimerne odstrani. S terminoloških seznamov smo odstranili naslednje Korpusni pristop danes ni nujno potreben le pri gradnji (primeri so iz angleških): terminoloških priročnikov, temveč tudi pri gradnji kakršnih koli jezikovnih priročnikov, ki želijo predstavitvi aktualno stanje jezika (Gantar, 2004: 170). Poleg avtomatizacije leksikografskih postopkov so njegove prednosti še informacija o sobesedilu in rabi ter možnost izločanja irelevantnih informacij (Gantar, 2004: 177). ŠTUDENTSKI PRISPEVKI 319 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 manjkajoči deli besed vsi uporabljajo zelo pogosto, tudi v različnih kontekstih (npr. defense je lahko del imena otvoritve; Sicilian Defense napačno zaznane agonal, advan, endg 'sicilijanska obramba'). Težko je določiti, ali gre pri besede napačno branje različnih besednih vrstah za samostojne termine ali pa so parry (namesto Garry (Kasparov)) variante enega termina. Nekateri termini opisujejo pojav, figuro, premik ipd., ki dummies (iz Chess for dummies), se lahko uvrsti v več podpodročij. Kmet je na primer glava in noga knjige Dvoretsky (avtor), pomemben v otvoritvi, središčnici in tudi končnici; z njim 2010 (letnica) lahko izvedemo nekaj “posebnih potez” ( pretvorba kmeta, vzetje na prehodu), torej sodi na več podpodročij. poteze kmetov (tudi oznake polj) Nekaterih terminov pa nismo mogli uvrstiti v nobeno od f4, e4, g5 predvidenih podpodročij ( črni, beli). Težav smo se delno poteze in koordinate poteze figur rešili tako, da smo uvedli podpodročje ostalo. V veliko Ke6, Ra4, o-o; exd5, cxd4 pomoč sta nam bila Monika Rozman in Iztok Jelen, slednji xf6 (le delni zapis poteze) je tudi pregledal ves glosar in nam priskrbel zanesljive terminološke vire. poimenovanje obeh polj sestavljene poteze in f4-f5, b7-b5 5.3. Izdelava terminološke baze koordinate Potem ko smo izbrali termine in jih razvrstili na diagonale ustrezna podpodročja, smo se lotili gradnje terminološke a2-g8, b1-h7 ( the b1-h7 diagonal) baze. Najprej smo v programu SDL MultiTerm ustvarili kombinacije črk in dvojezično bazo in v angleškem jeziku določili strukturo c-pawn, d6-pawn, f-file, e4-square drugih izrazov vnosa, ki je tudi v skladu s standardom TBX. Na raven vnosa smo dodali šahovsko podpodročje ( opening, Dvoretsky, Karpov, Rubinstein imena in priimki endgame, strategy, tactics, other), na raven jezika (po znanih šahisti se imenujejo šahistov definicijo in opombe, na raven samega termina pa rabo, nekatere otvoritve, variante ipd.) status ( obsolete, colloquial, preferred, standard, variant) in splošni izrazi opombe. Za opredelitev podpodročja smo se odločili zato, USSR, USCF da lahko natančneje ločimo, kateri termini spadajo v posamezno fazo šahovske igre in kaj termin sploh Tabela 1: Neterminološki izrazi z angleškega seznama predstavlja (ali gre za potezo, figuro, taktiko ipd.), za status izluščenih terminov. termina pa zato, da opozorimo na neustaljenost ali žargonsko rabo nekaterih terminov. Menimo, da je do težav z napačno zaznavo besed prišlo Najprej smo vnesli angleške termine in jim dodali zato, ker so bila nekatera besedila v težje berljivem definicije. Napisali smo jih po zgledu zanesljivih glosarjev formatu. Starejše knjige vsebujejo tudi stiliziran tisk, pri iz šahovskih knjig, največ iz Chess For Dummies (Eade, katerem so zaradi poudarka ali prostorske prerazporeditve 2016), in Winning Chess Openings (Seirawan, 2016), da so nekatere besede napisane narazen, npr. n o r p, te pa so bile za naše potrebe bolj razumljive, pomagala nam je tudi zaznane kot večbesedni termin, čeprav nimajo pomena. Monika Rozman. Na podlagi korpusa smo gesla opremili Orodje Sketch Engine je zaradi ponavljanja informacij v še z dodatnimi informacijami (podpodročje, kolokacije glavi in nogi knjig izluščilo tudi naslove, poglavja, strani, idr.). Nato smo s pomočjo slovenskega korpusa in po imena igralcev idr., ki so za nas irelevantne informacije. posvetu s šahovskimi mojstri vpisali še slovenske Najpogosteje izluščeni izrazi na seznamu so bile terminološke ustreznike, kolokacije, primere rabe ipd. šahovske poteze in koordinate. To pa zato, ker so večinoma Terminološka baza vsebuje 77 vnosov, od tega 82 iz teh informacij sestavljene šahovske knjige, te pa so angleških terminov s 77 definicijami in nekaterimi korpusu prispevale največ besed. Šahovske knjige poleg sopomenkami ter 109 slovenskih ustreznikov. Vsak termin uvoda nimajo “konkretnega” besedila, kakršnega smo vsebuje definicijo v angleščini in opredelitev podpodročja, vajeni v člankih. Polne so diagramov s pozicijami, na velika večina pa tudi kolokacije, status, rabo in po potrebi podlagi katerih avtor razlaga partije, s pomočjo njih pa se opombe. Pri pregledu in širitvi baze nam je pomagal Iztok učimo šahovskih otvoritev, strategije, taktik ipd. Jelen. Slovenskih definicij nismo dodali, saj nismo našli dovolj virov, ki bi vsebovali definicije za večino našega 5.2. Izbiranje terminov nabora slovenskih terminov, tako pa smo namesto svojega Ob razvrščanju terminov v pet podpodročij (taktika, pisanja definicij to raje izpustili. Prizadevamo si, da bi v strategija, otvoritev, končnica in ostalo) smo naleteli na bodoče s pomočjo mojstrov in npr. Šahovske zveze težave, ki so v terminologiji pogoste. Pri nekaterih Slovenije tudi to vrzel zapolnili. enobesednih terminih smo se težko odločili, ali gre za Za podpodročje otvoritve smo vnesli termine, ki so splošni izraz ali pa je beseda šahovski termin, npr. take značilni za ta del igre (sem spadajo tudi imena otvoritev), 'vzeti', diagonal 'poševnica', rank 'vrsta'. Največ težav smo npr. gambit, castling, Spanish Game, Sicilian Defense imeli z ugotavljanjem, ali gre pri večbesednih terminih za 'gambit, rokada/rošada, španska otvoritev, sicilijanska svoj termin ali le kolokacijo, npr. weak pawn 'šibek kmet', obramba'. Pri končnicah smo vnesli termine možnih izidov isolated pawn 'osamljeni kmet', center square/central igre, nekatere matne vzorce in poimenovanja določenih square 'sredično polje'. končnic, npr. checkmate, stalemate, back-rank mate, Pozorni smo bili tudi na termine, pri katerih je prišlo do Lucena position 'šah mat, pat, mat na osnovni vrsti, skladenjskih, pomenskih ali oblikoslovnih variant. Izrazi to Lucenova pozicija'. Podpodročje strategije obsega termine, defend, defense, defensive 'braniti, obramba, obrambni' se ki jih največkrat srečamo v središčnici ( middlegam e), ko se ŠTUDENTSKI PRISPEVKI 320 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 tvorijo pomembni strateški načrti, npr. position, square, 6. Pomanjkljivosti projekta diagonal, file, kingside, zugzwang, tempo 'pozicija, polje, Baza je nastala na podlagi omejenega nabora virov. poševnica, navpičnica, kraljevo krilo, nujnica, tempo'. Pri Slovenski korpus vsebuje le spletne vire, za dobro taktikah smo vključili osnovne taktične vzorce fork, reprezentativnost pa bi bilo treba vključiti še nekaj knjižnih skewer, discovered attack, sacrifice 'vilice, linijski udar, virov o različnih šahovskih podpodročjih. Na to smo sicer odkriti udar, žrtev' ipd. Dodali smo še podpodročje ostalo, pazili pri angleškemu korpusu, a tudi pri njem bi bilo za da bi se izognili nekaterim težavam pri razvrščanju boljšo reprezentativnost treba vključiti več virov. terminov na podpodročja. Tukaj zajamemo šahovske Da bi projekt obdržali v obvladljivih razsežnostih, smo figure, posebne poteze, nazive, akronime, npr. king, queen, se pri končnem izboru terminov opirali na korpusno promotion, grandmaster, arbiter, chessboard 'kralj, dama, pogostost ter v bazo vključili le osnovne termine in pretvorba kmeta, velemojster, sodnik/sodnica, šahovnica'. nekatere dodatne informacije. Z večjima korpusoma ter s Zapisali smo tudi pogoste kolokacije, npr. kingside pomočjo več strokovnjakov in terminologov bi lahko bazo attack, strong bishop, lead in development 'napad na dopolnili ne le v številu terminov, temveč tudi v naboru kraljevem krilu, močni lovec, razvojna prednost'. Če je bila kolokacij in primerov rabe. Naše definicije, način dodajanja sama raba termina dvoumna, smo omenili tudi, ali se termin in zapisa kolokacij ter ostale podatke bi moral pregledati še uporablja kot glagol, samostalnik ali pridevnik (npr. npr. terminolog in slovaropisec, da bi bila baza v skladu z checkmate 'šah mat' je v angleščini lahko glagol ali ustaljenimi načini gradnje terminološke baze ali samostalnik). Pri terminu fianchetto 'fianketo' smo dodali večjezičnega glosarja. tudi zgled pravilne in napačne izgovorjave v angleščini: /ˌfɪənˈkɛtəʊ/, */ˌfɪənˈtʃɛtəʊ/ in pravilne slovenske 7. Sklep /ˌfɪanˈketo/. Na podlagi korpusnega pristopa in izdelanih dveh korpusov smo ustvarili angleško-slovensko terminološko bazo, v katero smo vnesli 82 najpogosteje rabljenih angleških šahovskih terminov, jim pripisali 109 slovenskih ustreznikov ter jih opremili z definicijami, kolokacijami, primeri in informacijami o rabi. Pri gradnji korpusa smo uporabili tako poljudne članke kot tudi specializirano gradivo, pri čemer smo stremeli k večji reprezentativnosti posameznih podpodročij. V angleški korpus smo vključili tudi knjižne vire, slovenski pa je bil omejen le na spletne. Baza obsega nabor osnovnih terminov v obeh omenjenih jezikih. Prizadevamo si ustvariti obširnejša korpusa in izluščiti več terminov s kolokacijami in primeri rabe, dodati še slovenske definicije in sodelovati z več šahovskimi strokovnjaki, da bi bila baza v bodoče čim natančneje in pravilneje izdelana. 8. Literatura Z. T. Adylova. 2017. System Chess Nomina of Terminological Field “Debut”. Scientific Journal of National Pedagogical Dragomanov University. Series 9. Current Trends in Language Development, 16:5–11. Pedagoška univerza Dragomanov, Kijev. James Eade. 2016. Chess For Dummies. John Wiley & Sons, New York. Scott Frank, režiser. The Queen's gambit (Damin gambit). Netflix, 2020. https://www.netflix.com/si/title/80234304. Polona Gantar. 2004. Jezikovni viri in terminološki slovarji. V: Terminologija v času globalizacije: zbornik prispevkov s simpozija »Terminologija v času Slika 1: Primer terminološkega vnosa v MultiTermu. globalizacije, Ljubljana, 5.–6. junij 2003«, str. 169–178. ZRC SAZU, Ljubljana. Smo zagovorniki odprte znanosti, zato smo Samuel Goldman, Andrew Kwolek, Kenji Otani, Ian Ross terminološko bazo objavili na repozitoriju CLARIN.SI in Jack Zender. 2021. Chess Robot. Univerza v (Grdič et al., 2022), kjer je prostodostopna v formatu TBX. Michiganu, Oddelek za strojništvo. https://deepblue.lib. Čeprav zajema le najpogostejše šahovske termine, jo umich.edu/handle/2027.42/167650. vseeno upoštevamo kot doprinos k slovenski šahovski Harry Golombek. 1980. Šahovska enciklopedija. Prosvjeta, terminologiji. Zaradi terminoloških ustreznikov v dveh Zagreb. Prevod knjige: Golombek's Encyclopedia of jezikih lahko služi kot pomoč prevajalcem in drugim Chess. 1977. Crown publishers, New York. jezikoslovcem pri pisanju besedil in raziskovanju šahovske Luciano Gomes de Sousa. 2021. Chess and Autism terminologije. Prizadevamo si, da bi bazo v bodoče tudi Spectrum Disorder (ASD). Brilliant mind, 8(4). razširili in nadgradili. https://revistabrilliantmind.com.br/index.php/rcmbm/arti cle/view/52. ŠTUDENTSKI PRISPEVKI 321 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Vili Grdič, Alja Križanec, Kaja Perme, Lea Turšič. 2022. družbene vede. http://dk.fdv.uni-lj.si/dela/Tratar-Marko. English-Slovenian Chess Terminology Database 1.0, PDF. Slovenian language resource repository CLARIN.SI, Milan Vidmar. 1946. Razgovori o šahu z začetnikom. ISSN 2820-4042, http://hdl.handle.net/11356/1680. Državna založba Slovenije, Ljubljana. Gari Grosar. 2017. Šah in matematika. Diplomsko delo, Milan Vidmar. 1951. Pol stoletja ob šahovnici. Državna Univerza na Primorskem, Pedagoška fakulteta. https:// založba Slovenije, Ljubljana. repozitorij.upr.si/IzpisGradiva.php?id=9296&lag=eng. Špela Vintar. 2017. Terminologija: terminološka veda in Matej Guid. 2010. Znanje in preiskovanje pri človeškem in računalniško podprta terminografija. Znanstvena računalniškem reševanju problemov. Doktorsko delo, založba Filozofske fakultete, Ljubljana. Univerza v Ljubljani, Fakulteta za računalništvo in VIR 1 = Intervju z Matjažem Mikacem, intervjuval Vili informatiko. http://eprints.fri.uni-lj.si/1113/1/Matej__ Grdič. 4. avgust 2022, Ljubljana. Guid.disertacija.pdf. VIR 2 = Chess.com Launches PogChamps With Top Twitch Joanna Harazińska in Anna Harazińska. 2017. Chess-play Streamers. chess.com. https://www.chess.com/news/ as the effective technique In foreign language training. view/chess-com-pogchamps-twitch-rivals. Applied Researches in Technics, Technologies and VIR 3 = Osebna korespondenca z Iztokom Jelenom, Education, 5(3):238–242. https://www.readcube.com/ kontakt preko e-pošte. 5.–10. avgust 2022. articles/10.15547%2Fartte.2017.03.012. VIR 4 = Spletni šahovski portal ICP. Arhivirano 12. 4. 2021 David Hooper in Kenneth Whyld. 1992. The Oxford na archive.org. https://web.archive.org/web/2021041212 Companion to Chess. Second edition. Oxford University 5215/http://www.icp-si.eu/krozek/index.php?tip=glosar. Press, Oxford in New York. VIR 5 = Šahovsko izrazoslovje. Wikipedija. https://sl. Iztok Jelen. 2004a. Splošno-teoretska šahovska izhodišča wikipedia.org/wiki/%C5%A0ahovsko_izrazoslovje. izbirnega predmeta. Skupnosti SIO. Spletna učilnica Šah VIR 6 = Šahovska pravila. Wikipedija. https://sl. 7.–9. razred, poglavje 6 . https://skupnost.sio.si/course/ wikipedia.org/wiki/%C5%A0ahovska_pravila. view.php?id=2138. VIR 7 = Šahovska strategija in taktika. Wikipedija. Iztok Jelen. 2004b. Iz teorije kombinacij. Iz osebnega https://sl.wikipedia.org/wiki/%C5%A0ahovska_strategij arhiva Matjaža Mikaca. a_in_taktika. Iztok Jelen. 2006. Šah in primerjalna analiza stanja šaha v VIR 8 = Slovenske šahistke v Jugoslaviji. radiostudent.si. Sloveniji. Slovenska šahovska zveza. Iz osebnega arhiva https://radiostudent.si/kultura/repetitio/slovenske-%C5% Matjaža Mikaca. A1ahistke-v-jugoslaviji. Erik Johannson. 2021. Chess and Twitch: Cultural VIR 9 = Mitja Rizvič. 2016. Avtomatsko odkrivanje Convergence Through Digital Platforms. Magistrsko zanimivih šahovskih problemov. Diplomsko delo, delo, Univerza v Södertörnu, School of Culture and Univerza v Ljubljani, Fakulteta za računalništvo in Education, Media and Communication Studies. informatiko. https://core.ac.uk/download/pdf/151478793 https://www.diva-portal.org/smash/record.jsf?pid=diva2 .pdf. %3A1563119&dswid=6255. VIR 10 = Učni načrt za izbirni predmet šaha, spletna Ana Jurc. 2020. Damin gambit: kako posneti napeto učilnica Šah 7.–9. razred. Skupnosti SIO. https:// nadaljevanko o šahu? MMC RTV SLO. https://www. skupnost.sio.si/course/view.php?id=2138. rtvslo.si/kultura/gledamo/damin-gambit-kako-posneti- VIR 11 = Chess endgame. Wikipedia. https://en. napeto-nadaljevanko-o-sahu/543529. wikipedia.org/wiki/Chess_endgame. Assylkhan Agbayevich Karayev. 2016. Specifics of chess VIR 12 = FIDE laws of chess. International chess terminology. Science, technology and Education, federation. https://handbook.fide.com/chapter/E012018. 6(24):102–105. LCC Olympus, Moskva. VIR 13 = Chess opening. Wikipedia. https://en.wikipedia. Jana Krivec. 2021. Improve your life by playing a game : org/wiki/Chess_opening. learn how to turn your life activities into lifelong skills! VIR 14 = Terms. chess.com. https://www.chess.com/terms. Thinkers Publishing, Landegem. Allon Vishkin. 2022. Queen’s Gambit Declined: The Dylan Loeb McClain. 2020. I’m a Chess Expert. Here’s Gender-Equality Paradox in Chess Participation Across What ‘The Queen’s Gambit’ Gets Right. The New York 160 Countries. Psychological Science (2022), 33(2):276– Times. https://www.nytimes.com/2020/11/03/arts/televi 284. https://journals.sagepub.com/doi/10.1177/09567976 sion/chess-queens-gambit.html. 211034806. Nataša Logar in Špela Vintar. 2008. Korpusni pristop k Vladimir Vuković. 1978. Škola kombiniranja. Šahovska izdelavi terminoloških slovarjev: od besednih seznamov naklada, Zagreb. in konkordanc do samodejnega luščenja izrazja. Jezik in Vladimir Vuković. 1990. Uvod u šah na osnovi opće slovstvo, 53(5):3–17. šahovske teorije. Šahovska naklada, Zagreb 1990. Leonid Pitamic. 1950. Šah v pravnem izrazoslovju. Irina Nikolaevna Zhuravleva in Marina Vitalevna Razprave. [Razred 2], Razred za filološke in literarne Vlavatskaya. 2021. Structural model of chess terms in vede = Dissertationes. Classis 2, Philologia et litterae / English. Science, technology and Education, 2(87):534– Academia scientiarum et artium Slovenica, 1:173–204. 539. LCC Olympus, Moskva. Slovenska akademija znanosti in umetnosti, Ljubljana. Yasser Seirawan. 2016. Winning Chess Openings. Everyman Chess, London. Šahmaty. Enciklopedičeski slovar. 1990. Sovetskaja enciklopedija, Moskva. Marko Tratar. 2003. Šah v slovenskem časopisu. Diplomsko delo, Univerza v Ljubljani, Fakulteta za ŠTUDENTSKI PRISPEVKI 322 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Speech-level Sentiment Analysis of Parliamentary Debates using Lexicon-based Approaches Katja Meden†∗ †Department of Knowledge Technologies, Jožef Stefan Institute Jamova cesta 39, 1000 Ljubljana katja.meden@ijs.si ∗Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana Abstract Sentiment analysis or Opinion mining is a widely studied research area in the field of Natural Language Processing (NLP) that involves the identification of polarity (positive, negative or neutral sentiments) of the text, usually done on shorter and emotionally charged text, such as tweets and reviews. Parliamentary debates feature longer paragraphs and a very esoteric speaking style of Members of the Parliament (MPs), making them much more complex. The aim of the paper was to explore how and if lexicon-based approaches can handle the extraction of polarity from parliamentary debates, using the sentiment lexicon VADER (Valence Aware Dictionary and sEntiment Reasoner) and the Liu Hu sentiment lexicon. We performed sentiment analysis with both lexicons, together with topic modelling of positive and negative speeches to gain additional insight into the data. Lastly, we measured the performance of both lexicons, where both performed poorly. Results showed that while both VADER and Liu Hu were able to correctly identify the general sentiment of some topics (i.e., matching positive/negative keywords to positive/negative topics), most speeches themselves are very polarizing in nature, shifting perspectives multiple times. Sentiment lexicons failed to recognise the sentiment in parliamentary speeches that might not be extremely expressive or where a larger sum of intensity-boosting positive words are used to express negativity. We conclude that using lexicon-based approaches (such as VADER and Liu Hu) in their unaltered states alone do not suffice when dealing with data like parliamentary debates, at least not without any modification of lexicons. 1. Introduction tionary and sEntiment Reasoner) and Liu Hu sentiment lex- icon to see how (and even if) lexical-based methods are able Sentiment analysis or Opinion mining is a widely stud- to handle sentiment analysis of longer, more complex tex- ied research area in the field of Natural Language Pro- tual data such as parliamentary debates. To complement cessing (NLP) that encompasses extraction of thoughts, at- this research question, we performed sentiment analysis titudes and subjectivity of text to identify sentiment polarity with both lexicons, together with topic modelling of pos- (positive, negative or neutral sentiment). Sentiment ana- itive and negative sentiment clusters to gain additional in- lysis is mostly used on shorter and emotionally charged sight into the data. Lastly, we measured performance of text, such as tweets and reviews, though it can be used both lexicons and examined reasons for any possible mis- on other forms of textual data, such as parliamentary de- classifications. bates. Parliamentary debates are in essence transcriptions The paper is structured as follows: In Section 2 we of spoken language, produced in controlled and regulated present related work on sentiment analysis, VADER and circumstance, with rich (sociodemographic) metadata (Er- Liu Hu sentiment lexicons as well as studies done on re- javec et al., 2022). searching sentiment on parliamentary debates. In Section 3 Contrary to social media data that are usually used for we present the chosen methodology for our work, together sentiment analysis (tweets and other shorter social media- with presentation of the chosen dataset Hansard Debates based text), parliamentary debates and thus parliamentary with Sentiment Tags — HanDeSet. Section 4 includes the discourse vary from political environment and culture, text presentation of the results of the sentiment analysis with the (or rather, speeches) itself is longer and made by the parlia- chosen lexicons, topic modelling results, as well as their mentary representatives under strict(er) procedural-themed performance. Lastly, in the Section 5 we present our con- language. This alone makes parliamentary debates as an clusions and pointers for future work. object of sentiment analysis more complex in comparison to tweets or reviews, where opinions and sentiments are 2. Related work usually expressed much more clearly and in the shorter span of text. The sentiment analysis for this paper was 2.1. Sentiment analysis and lexicon-based approaches implemented on the HanDeSet parliamentary corpus that There are several methods of applying sentiment ana- includes 1251 motion-speech units from 129 debates with lysis, which are divided into three approaches: supervised, manually annotated sentiment labels. lexicon-based and hybrid approaches (Catelli et al., 2022), The aim of this paper is to explore lexicon-based ap-each with its own set of advantages and disadvantages. proaches on the basis of parliamentary debates using lexical The lexicon-based approaches utilize sentiment lex- (and rule-based) approach VADER (Valence Aware Dic- icons to describe the polarity (positive, negative and neut- ŠTUDENTSKI PRISPEVKI 323 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 ral) of the text. This approach involves manual construc- (2,006 with positive semantic orientation, and 4,783 neg- tion of lexicons with positive and negative words to be used ative).3. The opinion lexicon has evolved over the past in sentiment analysis and corpus of text to which the sen- decade, and is, similarly to VADER, more attuned to sen- timent analysis will be applied. The main advantages of timent expressions in social text and product reviews – this approach are the fact that they are easier to understand though it still does not capture sentiment from emoticons and have wider-term coverage, while the disadvantages lay or acronyms/initialisms (Hutto and Gilbert, 2014). The Liu in a finite number of words in the lexicons (i.e., we can- Hu sentiment lexicon has been implemented in the NLTK not cover all of the words, especially if the text is domain- library as a Liu Hu sentiment module (nltk.sentiment.util specific) and the assignation of a fixed sentiment orientation module),4 where function simply counts the number of pos- and score to words - every word in the lexicon is classified itive, negative and neutral words in the sentence and clas- as positive or negative with a numeric score, e.g., on the sifies it depending on which polarity is more represented. scale of -5 (very negative) to 5 (very positive), with 0 an- Words that do not appear in the lexicon are considered as notating neutrality of the text. For this paper, we will be fo- neutral5. cusing on two specific lexicon (and rule-based) approaches from the natural language toolkit (NLTK): VADER and the 2.4. Parliamentary debates Liu Hu sentiment module. Recently, parliamentary debates have raised an interest of researchers from various academic disciplines, espe- 2.2. VADER (Valence Aware Dictionary and cially as an object of linguistic research (Erjavec et al., sEntiment Reasoner) 2022). Transcriptions are done by professional stenograph- VADER is established as a gold-standard sentiment lex- ers, familiar with the procedures, as well as with the Mem- icon that is attuned to microblog-like contexts. It is primar- bers of Parliament (Truan and Romary, 2021). Parliament- ily designed for Twitter and other social media text (as well ary discourse is shaped by the specific rules and conven- as editorials, movie and product reviews). VADER senti- tions, which are in turn shaped by the socio-historical tra- ment module was implemented in NLTK.1 The aim of the ditions that influence the organisations and operations of authors was to provide computational sentiment analysis the Parliament. These conventions and traditions extend to engine that works well on social media style text, yet read- language use, e.g., turn-taking or forms of address (Fišer ily generalizes to multiple domains and requires no train- and de Maiti, 2020). Another characteristic of the tran- ing data, but is constructed from a generalizable, valence- scriptions is the fact that officially released records of par- based, human-curated sentiment lexicon (Hutto and Gil- liamentary debates are not verbatim and that minute-taking bert, 2014). The VADER sentiment lexicon is comprised varies across countries and history as well. The editing pro- of 7,500 lexical features with validated valence scores that cess can include elimination of obvious language or factual indicate both the sentiment polarity (positive/negative) and errors, dialectal or colloquial expressions and rude and ob- the sentiment intensity on a scale from –4 to +4. For ex- scene language. This, combined with the fact that editing ample, the word okay has a positive valence of 0.9, good is guidelines are mostly not publicly available, can hinder re- 1.9, and great is 3.1, whereas horrible is –2.5, the frowning search (Truan and Romary, 2021). emoticon :( is –2.2, and sucks and it’s slang derivative sux The main characteristics of parliamentary discourse in are both –1.5 (Hutto and Gilbert, 2014).2 the UK Parliament stem from previously mentioned com- In context of parliamentary debates, VADER has been position and operations of the Parliament - the UK Par- used in several different studies, such as in (Rohit and liament consists of two Houses: the House of Commons Singh, 2018), where VADER was used to extract sentiment and the House of Lords, where the decisions made in one polarity, as it uses a simple rule-based model for general House have to be approved by the other. (Parliament, sentiment analysis and generalizes more favorably across 2022). The House of Commons parliamentary debates con- contexts than any of many benchmarks such as LIWC and sist of three substantial elements (Abercrombie and Batista- SentiWordNet. Navarro, 2018b): Debates are initiated with a motion –— a proposal made 2.3. Liu Hu sentiment module by an MP. When invited by the Speaker (the presiding of- ficer of the chamber), other MPs may respond to the mo- Liu Hu sentiment lexicon is a product of the research tion, one or more times. Lastly, the Speaker may call a di- by Hu and Liu, where authors aimed to summarize all the vision, where MPs vote by physically moving to either the customer reviews of a product. Contrary to the traditional ‘Aye’ or ‘No’ lobby of the chamber. These divisions may summarization tasks they only mined reviews where cus- be called at any time, but typically occur at the end of the tomers have expressed their opinion on the product, trying to determine whether the opinions expressed were positive 3The entire Liu Hu lexicon was available on https://www. or negative (Hu and Liu, 2004). Liu Hu opinion lexicon cs.uic.edu/~liub/FBS/sentiment-analysis. is publicly available and consists of nearly 6,800 words html 4https://www.nltk.org/api/nltk.sentiment. 1https://www.nltk.org/api/nltk.sentiment. util.html vader.html 5List of positive and negative words in the lexicon 2The entire VADER lexicon is available at https: can be found at https://github.com/woodrad/ //github.com/cjhutto/vaderSentiment/blob/ Twitter-Sentiment-Mining/tree/master/Hu% master/vaderSentiment/vader_lexicon.txt 20and%20Liu%20Sentiment%20Lexicon ŠTUDENTSKI PRISPEVKI 324 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 debate. Example from the corpus shows the structure of the there are five generaliseable sentiment intensity character- units: istics: punctuation (specifically, the exclamation mark "!"), Motion: That there shall be an early parliamentary capitalization (e.g., using all caps in a text), amplifying the general election. intensity of the text with mood booster words (e.g., us- Speech: Does my right hon. Friend agree that the ing words like extremely or very) or using a combination Prime Minister, in calling this election, has essentially said of all of these characteristics (e.g., "The food here is EX- that she does not have confidence in her own Government TREMELY GOOD!!!"). In regard to this, we pre-processed to deliver a Brexit deal for Britain? One way in which she the text using only tokenization (and keeping the punctu- could secure my vote and the votes of my hon. Friends is to ation) and lemmatization (using UDPipe Lemmatizer). table a motion of no confidence in her Government, which I would happily vote for. 3.3. Experiment settings Vote: ‘Aye’ (positive). Most work was done in the Orange Data Mining Tool6. Both VADER and Liu Hu sentiment modules are both 3. Methodology already incorporated in the Sentiment analysis widget in 3.1. Dataset Orange. HanDeSeT: Hansard Debates with Sentiment Tags is a 3.3.1. Sentiment analysis and performance corpus that contains English parliamentary debates from comparison 1997 to 2017 with 1251 motion-speech units taken from Semantic analysis was performed on the speeches (with 129 separate debates and manually annotated with senti- both VADER and Liu Hu sentiment modules). VADER ment scores. The corpus itself was compiled from the UK outputs several scores for the semantic analysis: pos, neg, Hansard parliamentary corpora. Transcripts are largely- neu and compound. The compound feature is the combined verbatim records of the speeches made in both chambers score of all of the other features and our main indicator of of the UK Parliament in which repetitions and disfluen- sentiment in text. For Liu Hu, the score shows difference cies are omitted, while supplementary information such as between the sum of positive and sum of negative words, speaker names (speaker metadata) are added (Abercrombie normalized by the length of the document and multiplied and Batista-Navarro, 2018b). by a 100. The final score reflects the percentage of senti- The HanDeSet corpus features 1251 motion-speech ment difference in the document (Demšar et al., 2013). It units, where each unit comprises a parliamentary speech of is important to note that the lexicons were not modified in up to five utterances and an associated debate motion. As any way. detailed in (Abercrombie and Batista-Navarro, 2018b), par- Next we mapped the sentiment scores, output by both liamentary debates incorporate "much set, formulaic dis- sentiment modules to their respective labels: positive and course related to the operational procedures of the cham- negative. This was done to match the scores in the gold ber", i.e. speech segments used to thank the Speaker or standard, where each speech is labelled with either 0 for describing the activities in the chamber. negative or 1 for positive (and where neutral sentiment la- Each speech-motion unit has several sentiment polarity bels do not exist). Therefore, the main problem of mapping labels: these labels stemmed from speeches and motions, that had • manual speech : manually assigned sentiment label of a score of "0" (and are thus regarded as neutral) that needed the speech (0 = negative, 1 = positive) to be mapped either as positive or negative. After inspecting the dataset and the distributions of the • manual motion: manually assigned sentiment label of positive and negative class in the dataset (presented in the the motion (0 = negative, 1 = positive) Table 1), where it can be seen that the distributions for manually applied sentiment labels for speeches are slightly • gov/opp motion: label on the relationship of the MP skewed towards the positive class, with the positive class (who proposes the motion) to the Government (i.e. counting 705 speeches (56.4%) and the negative of 545 whether the MP is in Government or not: 0 = is not (43.6%) speeches. Therefore, we decided to map these in Government, 1 = is in Government) speeches as positive, in favor of the majority class. After • speech vote: a speaker-vote label extracted from the obtaining the labels (positive/negative), the last step was to division associated with the corresponding debate (i.e. compare the results of the sentiment analysis to the gold how the MP voted to proposed motion: 0 = negative, standard (and our test dataset) with classification accuracy 1 = positive) and F1 score evaluation metrics. To compare our results, a majority class baseline was added. Since our research scope covers only the parliamentary speech and the sentiment of it, we will be focusing on the 3.3.2. Descriptive analysis and topic modelling manual speech labels. As previously stated, our research aimed not only to evaluate the performance of both sentiment lexicons but 3.2. Data cleaning and pre-processing to research the sentiment in the UK parliamentary debates. As extraction of polarity (or sentiment) score can heav- In regard to this, we also applied topic modelling to ex- ily depend on certain text characteristics, pre-processing tract additional information on the topics of the analyzed text data can impact the performance of the lexicon-based modules severely. As detailed in (Hutto and Gilbert, 2014), 6https://orangedatamining.com/ ŠTUDENTSKI PRISPEVKI 325 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 parliamentary speeches. Descriptive analysis of the res- ized with MDS (Multidimensional scaling)), where the size ults provided by the VADER and Liu Hu sentiment mod- of the topic indicates Marginal Topic Probability (i.e. how ules on parliamentary debates enables insight into the pos- representative a topic is to a corpus or a cluster). To get the itive speeches, resemblances and reasons for possible dif- naming of the topics as accurate as we could, we used sev- ferences between the results of the lexicons. eral Orange widgets: t-SNE widget for the 2-D projection The results of the sentiment analysis are presented with of the speeches with similar topics, Extract keywords wid- histogram of sentiment scores of both sentiment lexicons get to extract 5 most common keywords in those speeches (compound score for VADER and sentiment score by Liu and Score documents widget to identify the names of the Hu) to visualize the distributions of positive and negat- documents the keywords occur in most often, inferring the ive scored speeches. Deriving from this we also per- topic name from the title and content of the documents. formed topic modelling on subsets of positive and negative speeches to identify topics and see if they correspond to the 4. Results general sentiment of the topic that the keywords belong to. 4.1. Sentiment analysis results To facilitate topic modelling, speeches first needed to be pre-processed: transformed to lowercase, tokenized, lem- In this section we present the results of the sentiment matized with UDPipe Lemmatizer. Lastly, stopwords were analysis, done with VADER and Liu Hu. Figure 1 com- filtered out list of stopwords, provided from NLTK and with pares the distributions of positive and negative speeches, a manually compiled additional list of stopwords7 for the identified by VADER (Figure 1a) and Liu Hu (Figure 1b) procedural words, that are very common in (procedural) sentiment lexicons. parliamentary speech. Even at first glance, we can see that VADER results For topic modelling we used Latent Dirichlet Allocation are leaned heavily towards the positive class. The com- method to extract keywords of speeches and its topics. As pound score ranges from 0.9987 (score of the most negative LDA does not give the optimal number of topics for the text speech) to 0.9992 (score of the most positive speech). Most itself, the exact number of topics needs to be determined by speeches in the dataset (617 speeches, 49.32%) were clas- the model user (Gan and Qi, 2021). We, therefore, experi- sified by VADER as extremely positive in the range from mented with different numbers of topics in the range from 0.8 to 1 of the compound score. On the other hand, only 5 to 11, with the Topic Coherence metric serving as our 124 speeches (9.91%) were deemed extremely negative in pointer. This specific range of topics was chosen to facil- range from -0.8 to -1. itate high enough granularity of the keywords in the topics Figure 1b represents results obtained by using Liu Hu (i.e., no less than 5 topics) but at the same time keep the sentiment lexicon. While VADER uses a scale from -1 to coherence of the keywords in the topics. Topic coherence 1, Liu Hu computes the sentiment score by preserving 0 as score represents the "degree of semantic similarity between the neutral value and deems everything below 0 as negative high-scoring words in the topic to help distinguish between and above as positive sentiment. As it can be seen from the topics that are semantically interpretable and topics that figure, the distribution of sentiment in the speeches differs are artifacts of statistical inference" (Stevens et al., 2012). greatly from the VADER results. The most negative speech Table 1 shows the Topic Coherence score fluctuation in dif- has a sentiment score of -6.976, the most positive a score ferent settings for all chosen subsets (positive and negative of 8.1967, with most speeches (353 speeches, 28.22%) po- clusters produced by VADER and Liu Hu), with numbers sitioned on a sentiment score spectrum from 0 to 1. Out of in bold representing the optimal number of topics for the those, 216 speeches were scored with 0 (neutral speeches). subset. In its entirety, more than 75% of the speeches were deemed positive by VADER (984 speeches, 75.78%). Sim- Number VADER VADER Liu Hu Liu Hu ilarly, Liu Hu deemed positive almost 70% of the speeches of Topics positive negative positive negative (867 speeches, 69.30%) For the topic modelling, each set 5 0.281 0.244 0.267 0.252 was split into a positive and negative subset: 6 0.272 0.256 0.275 0.244 • VADER subset of positive speeches: 948 speeches 7 0.263 0.282 0.264 0.250 (75.78%) 8 0.268 0.276 0.275 0.260 9 0.251 0.260 0.265 0.256 • VADER subset of negative speeches: 303 speeches 10 0.265 0.303 0.276 0.279 (24.22%) 11 0.284 0.270 0.265 0.259 • Liu Hu subset of positive speeches: 867 speeches Table 1: Topic Coherence scores of the positive and negat- (69.30%) ive subsets and their optimal number of topics. • Liu Hu subset of negative speeches: 384 speeches (30.70%) The topics, identified with the LDA method are visual- 4.2. Topic modelling results 7Additional list of stopwords is available at: The results are presented in two parts, using MDS to https://drive.google.com/file/d/16kH_ dV8HlUhctwmmsLn4F9zOkmJyqgg5/view?usp= aid in visualization of the topics and their labels. The first sharing part focuses on comparison of the topics in both positive ŠTUDENTSKI PRISPEVKI 326 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 (a) VADER (compound score) (b) Liu Hu (sentiment score) Figure 1: Results of the sentiment analysis and distribution of positive and negative speeches. (a) VADER (b) Liu Hu Figure 2: Comparison of topics, identified in the positive speeches between VADER and Liu Hu. clusters, while the second one presents identified topics and throughout the corpora, e.g., member, house, bill, parlia- trends in the negative clusters. ment, etc. In Liu Hu produced results, the largest topic is As it can be seen from Figure 2a and 2b, the largest relatively similar to the House procedures, that being Elect- clusters of keywords detected among the positive speeches, oral Commission, where most keywords, emphasised above produced by VADER, belong to the topic House proced- are still present, with two explicit keywords that define the ures8, where the topic consists of very common words nature of the topic - election and change. Both topics are also linked together (MDS enables linking of semantically 8 similar topics together), which makes the closeness of the Full name of the documents, that contain most of the keywords in the topic corresponds best to The Business of the keywords in both topics even more clear. Topic Electoral House, thought the name of the topic was shortened for easier Commission appears in both positive clusters. In addition to visualization. the aforementioned Electoral Commission, topics like EU ŠTUDENTSKI PRISPEVKI 327 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 membership, School funding and NHS funding also appear a 10-fold validation) were added to the Table 2 for compar- in both positive speeches. ison. The keywords and topics, identified in the negative speeches are shown in Figure 3a and 3b. Acc(%) F1 score With the Marginal Topic Probability score of 0.175, VADER 52.0 0.49 the most common keywords in the VADER negative sub- Liu Hu 50.0 0.47 set are found in topic State pension age, followed closely Baseline 56.5 0.56 by Armed forces (score of 0.172), Prisons and probation SVM (text only) 66.7 0.718 (0.150) and Police Officer Safety. MDS also showed that MLP (text only) 67.3 0.713 several topics are also very closely related to one another, e.g., Topic Armed forces is closely related to both House Table 2: Performance results with VADER and Liu Hu, ac- procedures and Terrorism bill topics. Similarly, although companied with the baseline and results for SVM and MLP not surprising, a strong connection is also found between from the related study. keywords in State pension (Women) and State pension age (Women). Lastly, strong similarity is shown between keywords in Police Officer Safety and Prisons and Proba- The performance of the VADER and Liu Hu sentiment tion. In the Liu Hu negative speeches, the most repres- lexicons is poor, not even surpassing the baseline score. ented topic is State pension (Women) with the Marginal However, if we want to put the results in a perspective, we Topic Score of 0.163, followed closely by EU Member- need to consider the nature of parliamentary debates and ship with the score of 0.159 and Homelessness with 0.114. parliamentary language. The language of parliamentary de- All three topics (or, rather, their keywords) are also con- bates is, as we stated previously, complex - the speeches nected amongst themselves. For both VADER and Liu Hu especially are longer and full of visible political procedure negatively scored speeches, the keywords most present in characteristics (such as courtesy naming, e.g., hon. Friend, them are found in topic on state pension and state pen- hon. Lady ...). sion age (very connected topics that share many common Very poor performance scores show that sentiment lex- keywords). In addition to that, several other topics can be icons (in their current, unmodified state) are not the best found in both subsets, e.g. Armed forces, Police Grant and methodology when it comes to extracting sentiment polar- House procedures. ity in parliamentary debates. In comparison, study, detailed In general, the keywords of the topics identified mostly in (Abercrombie and Batista-Navarro, 2018a) achieved corresponded to the general sentiment of the topics in their much greater results even by using just the text features (as respective subsets. Even though, in several cases, keywords shown in Table 2). (and topics) appeared both in the positive as well as in To research the reason for such poor performance, we the negative speeches. This is most likely due to the fact analysed several speeches in detail. Below is an example that parliamentary debates usually feature heavy position- and one of the possible explanations for misclassifications: taking in regard to a certain motion. "Our national health service is, and always has been, The topics in the negative speeches were harder to valued and cherished by my constituents who rightly ex- identify in comparison to the positive speeches - this is pect an excellent standard of care to be provided free at mostly due to the larger subset, as well as the fact that the the point of use when they need treatment. We are all deeply keywords were very fragmented. This can be seen in the committed to the future of the NHS, but to ensure that it can positive clusters, where the Marginal Topic Score of most continue to provide the quality of care that our constituents topics (aside from the two or three very well represented expect, it cannot stand still. [...] What is certain is that ones) are not high and are in lowest score range. While the current model through which health services in Calder- in general the topics were harder to identify, most topics dale and Huddersfield are delivered is not sustainable in that were strongly present in the speeches had very obvi- the long term, and that changes are needed to ensure that ous keywords. On the other hand, topics in the positive we have a local health service that continues to provide ex- speeches were easier to identify, although, there were some cellent care." exceptions, as some of the keywords (even though many The speech itself contains words that could influence stopwords were removed) were too general to pinpoint with the scoring in a positive way - VADER scored this speech human perception alone. with 0.9992 (making it one of the most positive speeches identified by VADER), while Liu Hu scored it with 1.578 . 4.3. VADER and Liu Hu performance evaluation Words in bold are all included in the VADER lexicon with To evaluate the performance of the sentiment modules high positive scores; e.g., committed has a score 1.1, valued we used the following evaluation metrics: classification ac- of 1.9, cherished of 2.3 and excellent of 2.7. Therefore, the curacy and F1 score. Similarly, a related research (Aber- speech could have been perceived as positive, even though crombie and Batista-Navarro, 2018b) used the dataset to the entire speech is in reality negative, as it emphasises that develop a 2-step model for sentiment analysis task - they the current model of health services is not long-term sus- trained SVM and MLP to produce a one-step Speech model tainable. Similarly, Liu Hu includes words cherished, qual- and a two-step Motion-Speech model, using different fea- ity, free and excellent in the list of positive words, but it tures (text only, text and metadata). The results for the one- does not include words like valued or committed (and thus step Speech model with text-only features (evaluated with making them neutral). The sentiment of this text is, accord- ŠTUDENTSKI PRISPEVKI 328 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 (b) Liu Hu (a) VADER Figure 3: Comparison of topics in negative speeches between VADER and Liu Hu. ing to Liu Hu, still positive - less than with VADER, but the it can be seen from the poor performance evaluation results, process and reason for misclassification is mostly the same. sentiment-based approaches like Liu Hu and VADER alone do not suffice when dealing with such a specific text data, 5. Conclusions at least not in their unmodified state. Better results could In this paper we used sentiment based approaches have possibly been acquired by modifying the lexicons to (VADER and Liu Hu) on the base of parliamentary data incorporate some of the characteristics of parliamentary de- with the aim to explore how these two modules handle bates (e.g., adding new words and changing the scoring of sentiment detection on longer, less expressive and more existing ones). formal language to that of the (usually) used social me- dia language (for which both sentiment modules are op- 6. Acknowledgments timized for). While the both VADER and Liu Hu were The paper was written in the framework of the research able to correctly identify the general sentiment of some top- programme P2-0103 (B): Tehnologije znanja (Knowledge ics, present in negative and positive clusters (e.g., matching Technologies), co-financed by the Slovenian Research keywords in the Euthenasia topic to the negative cluster), Agency (ARRS) from the state budget and the Slovenian re- the speeches themselves are very polarizing in nature. This search infrastructure CLARIN.SI (Common Language Re- can most clearly be seen in the fact, that some topics were sources and Technology Infrastructure, Slovenia). identified in both positive and negative clusters, e.g., top- ics like School funding and NHS funding were identified in 7. References both positive and negative speeches, as both can be viewed from different (positive or negative) standpoints. Gavin Abercrombie and Riza Batista-Navarro. 2018a. The most probable reason for misclassifications is the ‘Aye’ or ‘No’? Speech-level Sentiment Analysis of length of the speeches, as well as the matter of speeches Hansard UK Parliamentary Debate Transcripts. In: N. not being extremely expressive or having a bigger sum Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. of positive boosting words used to express negativity. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. The language of parliamentary discourse can be extremely Moreno, J. Odijk, S. Piperidis, and T. Tokunaga, eds., complex, mostly due to the esoteric speaking style and Proceedings of the Eleventh International Con- ference opaque procedural language of Parliament (Abercrombie on Language Resources and Evaluation (LREC 2018), and Batista-Navarro, 2018b). Distinguishing between a Miyazaki, Japan. European Language Resources positive and negative polarity of parliamentary debates can Association (ELRA). be a difficult task even for human annotators, which was Gavin Abercrombie and Riza Theresa Batista-Navarro. proven by the poor inter-annotator agreement score in the 2018b. A Sentiment-labelled corpus of Hansard Parlia- first round of annotation of the HanDeSet dataset, detailed mentary Debate Speeches. In: D. Fišer, M. Eskevich, and in (Abercrombie and Batista-Navarro, 2018a). Similar can F. de Jong, eds., Proceedings of the Eleventh be said for lexicon-based approaches to sentiment ana- International Conference on Language Resources and lysis, though despite the poor performance scores, the lex- Evaluation (LREC 2018 - ParlaMint II Workshop), icons still gave us some insight into the general sentiment Miyazaki, Japan. European Language Resources Asso- around topics and parliamentary speech characteristics. As ciation (ELRA). ŠTUDENTSKI PRISPEVKI 329 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Rosario Catelli, Serena Pelosi, and Massimo Esposito. 2022. Lexicon-based vs. BERT-based sentiment ana- lysis: A comparative study in Italian. Electronics, 11(3):374. Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, To- maž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, and Blaž Zupan. 2013. Orange: Data Mining Toolbox in Py- thon. Journal of Machine Learning Research, 14:2349– 2353. Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osen- ova, Nikola Ljubešić, Kiril Simov, Andrej Pančur, Michał Rudolf, Matyáš Kopp, Starkaður Barkarson, Steinþór Steingrímsson, Ça˘grı Çöltekin, Jesse de Does, Katrien Depuydt, Tommaso Agnoloni, Giulia Venturi, María Calzada Pérez, Luciana D. de Macedo, Cost- anza Navarretta, Giancarlo Luxardo, Matthew Coole, Paul Rayson, Vaidas Morkevičius, Tomas Krilavičius, Roberts Dar´gis, Orsolya Ring, Ruben van Heusden, Maarten Marx, and Darja Fišer. 2022. The Parlamint corpora of parliamentary proceedings. Language re- sources and evaluation, pages 1–34. Darja Fišer and Kristina Pahor de Maiti. 2020. Voices of the Parliament. Modern Languages Open. Jingxian Gan and Yong Qi. 2021. Selection of the Optimal Number of Topics for LDA Topic Model—Taking Patent Policy Analysis as an example. Entropy, 23(10):1301. Minqing Hu and Bing Liu. 2004. Mining and summariz- ing customer reviews. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, pages 168–177. Clayton Hutto and Eric Gilbert. 2014. VADER: A parsi- monious rule-based model for sentiment analysis of so- cial media text. In: Proceedings of the international AAAI conference on web and social media, pages 216–225. UK Parliament. 2022. The two-House system. Sakala Venkata Krishna Rohit and Navjyoti Singh. 2018. Analysis of speeches in Indian parliamentary debates. arXiv:1808.06834. Keith Stevens, Philip Kegelmeyer, David Andrzejewski, and David Buttler. 2012. Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pages 952–961. Naomi Truan and Laurent Romary. 2021. Building, En- coding, and Annotating a Corpus of Parliamentary De- bates in XML-TEI: A cross-linguistic account. Journal of the Text Encoding Initiative. ŠTUDENTSKI PRISPEVKI 330 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Evalvacijska kategorizacija strojno izluščenih protipomenskih parov Tina Mozetič,* Miha Sever,* Martin Justin,* Jasmina Pegan‡ * Filozofska fakulteta, Univerza v Ljubljani Aškerčeva 2, 1000 Ljubljana tina.mozetic11@gmail.com, mihasever98@gmail.com, martin1123581321@gmail.com ‡ Fakulteta za računalništvo in informatiko, Univerza v Ljubljani Večna pot 113, 1000 Ljubljana jp2634@student.uni-lj.si Povzetek Namen prispevka je oceniti relevantnost strojno pridobljenih protipomenskih parov za vključitev v razširjeni Slovar sopomenk sodobne slovenščine. Nekdanje strukturalistično pojmovanje protipomenskosti vedno bolj prehaja k sodobnejšemu, ki temelji na naprednih računalniških metodah, odprtosti, množičenju, relevantnosti in uporabnosti podatkov. V raziskavi smo pregledali 2852 strojno izluščenih parov protipomenk. Primeri, ki jih označevalci niso enoznačno uvrstili med protipomenske oziroma neprotipomenske, so razvrščeni v 21 kategorij. Za protipomenke vsake kategorije je opredeljeno, ali jih je smiselno vključiti v odzivni slovar. Strojni postopek se je izkazal za uspešnega, saj je v slovar mogoče vključiti 88 % izluščenih parov. Kategorije bodo v prihodnosti uporabne tudi za oblikovanje smernic ter razvoj nadaljnje metodologije strojnega luščenja protipomenk. Evaluative Categorisation of Automatically Extracted Pairs of Antonyms This paper aims to assess the relevance of extracted antonym pairs that are to be included in the expanded Thesaurus of Modern Slovene. The former structuralistic conception of antonymy is shifting to a more modern one that is based on advanced computational methods, openness, crowdsourcing, relevance, and data usability. In this study, we reviewed 2852 extracted pairs of antonyms. Examples that were not uniquely classified as antonyms or non-antonyms by the evaluators are grouped into 21 categories. For each category, it is determined whether they should be included in the responsive dictionary. The process proved to be successful, as 88% of the extracted pairs could be included in the dictionary. The categories will also be useful in the future for the creation of guidelines and the development of further methodologies for automatic extraction of antonyms. Problemske kategorije, ki jih bomo tako oblikovali, 1. Uvod bodo služile kot izhodišče za nadaljnje delo na projektu, ki Slovar sopomenk sodobne slovenščine je s 105.473 obsega nadgradnjo metodologije luščenja, pripravo iztočnicami in 368.117 sopomenkami »najobsežnejša smernic za uredniško obravnavo protipomenk in vključitev prosto dostopna avtomatsko generirana zbirka sopomenk protipomenk v Slovar sopomenk sodobne slovenščine. za slovenščino« (Sopomenke 1.0, 2022). Slovar deluje po Ročno pregledani protipomenski pari bodo uporabljeni kot principu odzivnega slovarja, ki je v prvem koraku učna množica za nadaljnje luščenje protipomenk iz korpusa pripravljen povsem strojno. Strojno pripravljeni podatki so Gigafida 2.0 (Krek et al., 2020). Tudi pri oblikovanju objavljeni takoj, ko jezikoslovna evalvacija potrdi njihovo smernic pa bo naša analiza prišla zelo prav, saj smo načelno ustreznost oz. relevantnost za skupnost, nato pa se identificirali probleme, za katere bo treba v nadaljevanju slovar razvija naprej po korakih in v sodelovanju podati tudi načelne uredniške rešitve. jezikoslovcev in širše zainteresirane javnosti (Arhar Holdt V drugem razdelku prispevka tako najprej predstavimo et al., 2018). Pri projektu Nadgradnja temeljnih slovarskih jezikoslovne raziskave protipomenskosti in koncept virov in podatkovnih baz CJVT UL bomo sopomenkam odzivnega slovarja. V tretjem na kratko opišemo metodi dodali protipomenke, za katere je treba opraviti tovrstno pridobivanja in označevanja podatkov. V četrtem razdelku jezikoslovno evalvacijo relevantnosti. pa predstavimo rezultate označevanja in jih analiziramo. Cilj našega prispevka je tako oceniti relevantnost Najprej predstavimo odločitve označevalcev glede strojno pridobljenih protipomenskih parov za vključitev v ustreznosti protipomenskih parov, nato pa natančneje razširjeni Slovar sopomenk sodobne slovenščine. Pri tem predstavimo vsako od problemskih kategorij, v katere so nas zanima predvsem, kateri del podatkov je (1) primeren bili v fazi označevanja uvrščeni »problematični« primeri. za neposredno vključitev v slovar, (2) kateri za vključitev Pri vsaki kategoriji predstavimo tudi njeno pogostost in ni primeren in (3) kateri zahteva dodaten premislek. V ocenimo, na kakšen način bi bilo identificirani problem prispevku se natančneje ukvarjamo s tretjo točko, pri čemer mogoče reševati. V zaključnem delu povzamemo bistvene dokazujemo, da je »problematične« primere mogoče ugotovitve prispevka. kategorizirati glede na vrsto problema in tako določiti, ali jih je (a) mogoče izboljšati strojno, ali (b) morda zahtevajo 2. Pregled področja uredniško odločitev, (c) jih je mogoče izboljšati s Jezikoslovje smatra protipomenskost – poleg pomenskim členjenjem gesla ali kvalifikatorji, (d) jih je sopomenskosti – za temeljno medleksemsko pomensko mogoče izboljšati s pomočjo skupnosti oziroma (e) kljub razmerje (Stramljič Breznik, 2010; Humar, 2016; Vidovič določenemu problemu pustili v naboru slovarskega gradiva Muha, 2005, 2021). V nasprotju s sopomenkami in računati na to, da bodo uporabniki sami presodili o protipomenke nujno nastopajo binarno, tj. v parih, in so njihovi uporabnosti. vedno del skupnega pojmovnega ali celo pomenskega polja (Vidovič Muha, 2021). V slovenskem izrazoslovju sta se ŠTUDENTSKI PRISPEVKI 331 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 enakovredno ustalila izraza protipomenka in antonim oz. priročniki. SSKJ je s kvalifikatorjem ant. (antonim) protipomenskost in antonimija, čeprav Slovenski pravopis opremil 87 leksemov, ki se uvrščajo med kakovostne 2001 prednost daje izrazu protipomenka (Humar, 2005). (polarne, skrajnostne) protipomenke, medtem ko Definiranje protipomenke je razmeroma enostavno. usmerjenih in dopolnjevalnih ne izkazuje (Humar, 2016). Protipomenka je po SSKJ (2014) »beseda z nasprotnim Toporišič (1976) v svoji slovnici antonimijo omenja bežno pomenom v odnosu do druge besede«, enako jo opredeljuje pri antonimnem pridevniku, protipomenskost krajše tudi Toporišič (2001). Marjeta Humar (2016) definicijo predstavi pozneje v četrti, prenovljeni izdaji leta 2001. razširi na »poimenovanja pojmov z eno- ali večpomensko Kljub temu da so protipomenke leksikografsko prepoznane besedo ali besedno zvezo, [pri katerem] sta v kot pomemben dejavnik pri določanju pravilnih pomenov protipomenskem razmerju pomenski sestavini pojmov besed (Toporišič, 2001), protipomenskega slovarja v (navadno po ena pri vsakem od dveh), izraženih z slovenskem prostoru še nimamo. Imamo pa dva slovarja enopomenskima besedama, z enopomenskima besednima sopomenk, in sicer Sinonimni slovar slovenskega jezika zvezama ali pa s posameznima pomenoma dveh (SSSJ), ki ga je izdal ZRC SAZU, in spletni Slovar večpomenskih besed ali zvez« (22). sopomenk sodobne slovenščine (SSSS), ki je nastal pod V nasprotju z definiranjem pomenska tipološka okriljem Centra za jezikovne vire in tehnologije. Pretekli razvrstitev protipomenk predstavlja veliko oviro; tovrstnih leksikografski opis slovenskega jezika se je naslanjal na razvrstitev je namreč toliko, kolikor je znanstvenikov, ki so strukturalistično tradicijo SSKJ-ja, ki so ji sledili tudi se z njimi ukvarjali. Problematike se zavedajo tudi dosedanji najvidnejši slovenski raziskovalci jezikoslovci sami (gl. Humar, 2016), njihova glavna naloga protipomenskosti (Jože Toporišič, Ada Vidovič Muha, pa bi bila določiti meje protipomenskosti (Gao in Zheng, Irena Stramljič Breznik, Marjeta Humar). 2014), ki se od enega do drugega znanstvenika močno Družbene spremembe kot posledica digitalizacije in razlikujejo. razvoja informacijsko-komunikacijske tehnologije so Marjeta Humar (2016) med pionirske in oblikovale potrebo po popolnoma drugačnem najpomembnejše jezikoslovne raziskovalce leksikografskem opisu slovenščine, na podlagi katerega bi protipomenskosti uvršča Lyonsa, Apresjana in Novikova. lahko gradili nove jezikovne vire in tehnologije. Lyons je določil tri vrste protipomenk, ki izhajajo iz ene od Leksikografija se namreč v sedanjem času zaradi vstopa naštetih značilnosti: komplementarnost, protipomenskost interneta spopada z vse hitrejšimi jezikovnimi in konverzija. Pri tem loči protipomenkost v ožjem in spremembami. Na eni strani je soočena z vprašanjem, kako širšem smislu; v ožjega vključuje le polarno v spremenjenih razmerah predstaviti slovarske vsebine protipomenskost, ki je zanj najčistejša oblika antonimije. jezikovnim uporabnikom, na drugi strani pa z novimi Apresjan je protipomenke razčlenil veliko temeljiteje, jezikovnimi praksami, ki jih vse težje sproti zajema in opozoril pa je tudi na kvaziprotipomenke, ki nimajo enako popisuje (Gantar et al., 2016). Sodobni jezikovni nasprotnih pomenov. Novikov je na drugi strani uporabniki vse bolj zahtevajo takojšnji dostop do protipomenskost razdelil na kontrarno nasprotnost – kot slovarskih vsebin sodobnega jezika, zato moramo najpogostejšo obliko, komplementarno in vektorsko leksikografske analize izvajati vse hitreje, a enako nasprotnost. Med kvaziprotipomenke je uvrstil pomensko kvalitetno (Gantar et al., 2016). Iz tradicionalnega neenake, nesorazmerne, nesimetrične, stilistično leksikografskega modela prehajamo v sodobnejši, pri raznorodne, časovno različne protipomenke, ki izražajo katerem slovarske vsebine temeljijo na naprednih druga nasprotja. računalniških metodah, odprtosti, množičenju, V slovenskem prostoru se je najbolj uveljavila členitev relevantnosti in uporabnosti podatkov. po A. Vidovič Muha (2005, 2021), ki protipomenskost Tako je na eni strani povsem ročni pristop luščenja opredeljuje kot pomensko nasprotnost ali dopolnjevalno podatkov zamenjal polavtomatični, ki ni le časovno in protislovnost; za izhodišče tipološke členitve jemlje vpliv finančno manj potraten, ampak hkrati zagotavlja dodatne protipomenk na aktantske vloge znotraj stavčne povedi. V potencialno koristne podatke za presojo o vključevanju okviru tega protipomenke deli na: leksemov v slovar. Pri tem se vloga leksikografa ne - zamenjavne oz. konverzivne, spreminja, saj še vedno ostaja odločevalec na vseh ravneh - dopolnjevalne oz. komplementarne, odločanja o slovarskem vključevanju leksemov, spreminja - skrajnostne oz. polarne, s podskupino pa se način pridobivanja in predstavitve leksemskega stopnjevalnih oz. gradualnih in podatka (Gantar et al., 2016). Podoben princip luščenja je - usmerjene oz. vektorske. bil uporabljen pri pripravi SSSS-ja. Leksemska razmerja V grobem kategorizacija temelji torej bodisi na navadno luščimo iz baze več virov, SSSS tako temelji na enakovrednih skupinah protipomenk bodisi na osi bolj luščenju podatkov iz korpusa Gigafida in Velikega protipomensko–manj protipomensko (ožji : širši smisel, angleško-slovenskega slovarja OXFORD - DZS (Arhar prave protipomenke : kvaziprotipomenke, popolne : Holdt et al., 2018). V tujini so pri pripravi korpusnih nepopolne, neostra : ostra nasprotnost, binarna : nebinarna protipomenskih slovarjev prešli že na avtomatično luščenje nasprotnost, izražanje nasprotja : stilistično sredstvo) (Wang et al., 2010; Lobanova et al., 2010; Aldhubayi in (Humar, 2016). Alyahya, 2014). Strukturna delitev protipomenk je jasnejša. V Na drugi strani pa SSSS deluje tudi po konceptu slovenskem prostoru se je z njo z besedotvornega vidika odzivnega slovarja; gre za odprto dostopno zbirko največ ukvarjala Irena Stramljič Breznik (2010), ki relevantnih, a še ne povsem neprečiščenih podatkov. Pri protipomenke deli na istokorenske (tudi gramatične ali izdelavi prečiščene baze sodeluje jezikovna skupnost, s tvorbene) in raznokorenske (tudi leksikalne). čimer izdelava slovarja ni nikdar zaključena, saj se Slovenski, pa tudi sicer nekdanji jugoslovanski prostor soustvarja skladno s spremenljivo jezikovno realnostjo. protipomenskosti dolgo ni posvečal večje pozornosti Poleg soustvarjanja jezikovni uporabniki potencialne (Humar, 2016), to izražajo tudi glavni slovenski jezikovni iztočnice tudi vrednotijo s svojimi odzivi (Arhar Holdt et ŠTUDENTSKI PRISPEVKI 332 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 al., 2018). Uporabniki prednost koncepta prepoznavajo v parov protipomenk. Dodatno smo upoštevali tudi pare preglednosti, dostopnosti, hitremu prilagajanju sodobni protipomenk, kjer eno izmed obeh besed zamenjamo z sliki jezika, soustvarjalnosti, preprosti uporabi in načinu njeno sopomenko, s čimer se je množica povečala na 4113 razvrščanja iztočnic (Kojc et al., 2018; Kamenšek Krajnc et parov protipomenk. Po brisanju ponavljajočih se zapisov, al., 2018). Temu bi moral slediti tudi sodobni slovar kjer sta besedi le zamenjani, smo pridobili množico 2852 protipomenk.. parov protipomenk. 3. Metodologija 3.2. Označevanje podatkov V raziskavo je bilo vključenih 2852 parov protipomenk. 3.1. Pridobivanje podatkov Vsak izmed šestih pregledovalcev je pregledal vse primere Podatkovno množico s protipomenkami smo sestavili iz v individualni Google Preglednici, pri čemer je vsakemu več virov. Postopek je podrobneje opisan v diplomskem paru pripisal eno izmed možnosti d, g in n. Oznaka d delu (Pegan, 2019), z izjemo zadnjega koraka z brisanjem označuje, da gre za protipomenki, oznaka n pove, da dani ponavljajočih zapisov, ki je bil dodan naknadno. Glavnino besedi nista protipomenki, oznaka g pa pomeni, da je par podatkov o protipomenkah smo pridobili iz baze sloWNet problematičen in ga je treba podrobneje proučiti. (Fišer, 2015), manjši delček (87) pa na osnovi klicev iz Označevalci pred začetkom nismo prejeli natančnejših slovarja SSKJ, dostopnega na slovarskem portalu Fran. navodil, kaj se smatra kot protipomensko in kaj ne. Namen Baza sloWNet ima obliko XML, poglejmo si en primer prvega koraka je bil namreč na osnovi gradiva ugotoviti zapisa množice sopomenk ( synset): problematična področja, ki bi jih lahko podrobneje analizirali v nadaljevanju. Med pregledovanjem smo beležili primere in sproti eng-30-00001740-a oblikovali 19 problemskih kategorij. V nadaljevanju smo ... vsakemu izmed problematičnih parov pripisali po en glavni in morebitni dodatni problem. Podatke smo si razdelili na able podatkov. Med pregledovanjem smo dodali še dve novi kategoriji, in sicer (Ne)dovršne glagolske tvorjenke in Dejanje in stanje, saj sta se kot problematični izkazali šele po natančnejši analizi vseh primerov. sposoben zmožen 4. Rezultati in analiza Po prvem krogu pregledovanj smo 1124 (39,4 %) parov ... enotno potrdili kot protipomenske in le 22 (0,8 %) primerov kot neprotipomenske. Pri preostalih (1706; 59,8 %) se je eng-30-00002098-a vsaj eden izmed pregledovalcev odločil drugače kot ostali, ... zato smo takšne primere označili za nadaljnjo analizo. V drugem krogu pregledovanja pa se je izkazalo, da so bili nekateri primeri problematični zgolj v zelo specifičnem Za vsak synset smo poiskali protipomenski synset prek pogledu oz. da je bil primer lažno označen kot elementa 'near_antonym'. Uporabili smo vse kombinacije, problematičen. Odločitev je bilo treba spremeniti tudi pri kjer je ena beseda v izvornem synsetu in druga v nekaterih že potrjenih parih, saj so se po podrobnejšem protipomenskem synsetu. Na tak način smo pridobili 4.514 pregledu izkazali kot problematični. Kategorija potrjenih parov protipomenk. protipomenk se je tako povečala na 1207 (42,3 %) Iz SSKJ smo poiskali vsa gesla, ki imajo navedene tudi primerov, medtem ko je bilo potrjenih neprotipomenskih protipomenke. Poenostavljen primer zapisa vidimo spodaj: parov 48 (1,7 %). V nadaljnjo analizo smo poslali 1597 (56 %) primerov, kot prikazuje Tabela 1.
abstrákten Oznaka Delež ... ant. Sta protipomenki 42,3 % Nista protipomenki 1,7 % konkreten: Nadaljnji pregled 56,0 % ...
Tabela 1: Rezultati po drugem krogu označevanja. Skupno smo iz SSKJ izluščili 87 parov protipomenk. Zaradi maloštevilčnosti smo podatke o protipomenkah Nadaljnja raziskava se bo osredotočila zgolj na primere razširili tako, da smo dodajali pare besed s pripono ne-, (1597; 56 %), ki so se po drugem krogu pregledovanj proti-, brez-. Primeri tako pridobljenih parov so dostopen – izkazali za problematične. Razdelili smo jih v 21 kategorij, nedostopen, ustaven – protiustaven ter alkoholen – prikazanih v Tabeli 2, kjer smo za lažjo predstavo vsako brezalkoholen. Tako pridobljene podatke smo deloma izmed kategorij ponazorili s primerom para besed, o katerih ročno prečistili nesmiselnih kombinacij, kot je no – brezno smo presojali. Vidimo lahko tudi, kolikokrat se je vsaka ter odstranili besede, za katere nismo imeli vektorskih izmed kategorij pojavila kot glavni in kot dodatni problem. vložitev v okviru diplomske naloge. Tako smo dobili 1340 Glavne probleme smo določili 1597 primerom, medtem ko ŠTUDENTSKI PRISPEVKI 333 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 smo dodatni problem identificirali pri 668 (41,83 %) Nedoslednost na ravni prevzeto – podomačeno (10,33 %) primerih, ki predstavljajo 23,46 % celotnega gradiva. in Zaznamovanost in/ali redkost besede (9,83 %). Iz Tabele 2 je razvidno, da se kot glavni problem Najredkeje so se kot problematične pojavljale kategorije najpogosteje pojavlja Redkost in kontekstualna vezanost Zatipki (0,31 %), Drugo (0,38 %) in Pomensko šibki glagoli pomenov (31,87 %). Pogosto se pojavljajo tudi kategorije (0,44 %). Zanikanost s predpono -ne in -brez (10,58 %), Št. pojavitev Št. pojavitev Kategorija Primer Odstotek Odstotek (glavni problem) (dodatni problem) Zatipki čistost – nečistot 5 0,31 / / alkoholne – Napačne leme 40 2,50 3 0,45 brezalkoholne Različna besedna dopoldne – popoldanski 16 1,00 / / vrsta (Ne)dovršnost narasti – zniževati 87 5,45 2 0,30 (Ne)določnost bližnji – daljen 11 0,69 / / Neobstoječe besedotvorne pritrjevanje – zanikanost 54 3,38 7 1,05 različice Zanikanost s občutljivost – 169 10,58 201 30,09 predpono ne, brez- nedražljivost Nedoslednost na ravni prevzeto - aktiv – trpnik 165 10,33 36 5,39 podomačeno (Ne)dovršne izglagolske zmanjšanje – povečanje 32 2,00 4 0,60 tvorjenke brezposelnost – Dejanje in stanje 18 1,13 2 0,30 zaposlitev Povratnost ubogati – upirati (se) 53 3,32 17 2,54 Pomensko šibki manjkati – biti (prisoten) 7 0,44 2 0,30 glagoli Pomensko polne pridobiti – odreči 15 0,94 2 0,30 besede (soglasje) Spol kot kralj – kraljica; dolžnica 60 3,76 3 0,45 “protipomenka” – upnik Zaznamovanost ata – mati; nenavadno – 157 9,83 79 11,83 in/ali redkost besede često Enakopisnice in bistrost – motnost 76 4,76 20 2,99 večpomenke Redkost in kontekstualna bogat – neploden 509 31,87 246 36,83 vezanost pomenov Lastnosti, ki si niso protipomenske, a se krivulja – premica 38 2,38 11 1,65 pogosto tako uporabljajo Posredne glasen – nem 40 2,50 5 0,75 sopomenke Stopenjski primeri prihodnji – sedanji 39 2,44 17 2,54 Drugo ofenziven – nespotakljiv 6 0,38 11 1,65 Tabela 2: 21 kategorij in njihove pojavitve kot glavni in dodatni problem. ŠTUDENTSKI PRISPEVKI 334 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Primeri: bližnji – daljen, mesten – podeželski, oddaljen 4.1. Zatipki – bližnji. V kategorijo Zatipki spadajo pari, pri katerih je vsaj ena izmed besed nedvoumno zatipkana, torej ne more biti isti 4.6. Neobstoječe besedotvorne različice ali drug leksem v katerikoli obliki. Iz Tabele 2 je razvidno, Gre za primere, ki so pomensko sicer ustrezni, a se da se je ta kategorija pojavila zgolj petkrat (0,31 %) kot težava pojavi, ker je ena (ali obe) beseda(i) neobstoječa. glavni problem in nikoli kot dodatni. Je ena izmed najbolj Kot primarni problem se je ta kategorija pojavila v 54 (3,28 problematičnih kategorij, saj besed, ki so narobe črkovane, %) primerih, kot sekundarni pa v sedmih (1,05 %). Ta ne moremo vključiti v slovar. kategorija po naši presoji ne sodi v slovar, saj gre za besede, Primeri: čistost – nečistot, izginti – pojaviti, izvažati – ki niso realno v rabi. Že pri luščenju protipomenskih uvžati. . kandidatov bi lahko dodali korak preverbe posamezne besede v referenčnem korpusu in dodali opozorilo pri tistih 4.2. Napačne leme primerih, ki se ne pojavljajo. Pod Napačne leme sodijo primeri, ki so sicer lahko Primeri: pritrjevanje – zanikanost, eleganca – oblikoslovno ujemajoči, vendar v neslovarski obliki. Iz neelegantnost, nelaskav – podrepniški. Tabele 2 je razvidno, da se je ta kategorija pojavila v 40 primerih (2,5 %) kot glavni in trikrat (0,45 %) kot dodatni 4.7. Zanikanost s predpono ne-, brez- problem. Takšne primere moramo odstraniti s seznama V kategoriji Zanikanost s predpono -ne, -brez govorimo parov za vključitev v slovar oz. jih spremeniti v pravo o primerih, pri katerih je vsaj ena izmed protipomenk slovarsko obliko. tvorjena kot negacija nekega izraza. Gre za pare, kjer sta Primeri: alkoholne – brezalkoholne, dolžna – nedolžna, obe besedi negaciji dveh protipomenk ali za primere, kjer finančne – nefinančne. se kot protipomenski par pojavita beseda in negacija njene sopomenke. Kot je razvidno iz Tabele 2, je bila ta 4.3. Različna besedna vrsta kategorija v 169 (10,58 %) primerih prepoznana kot glavni Pri kategoriji Različna besedna vrsta gre za besedne in v 201 (30,09 %) kot dodatni problem. Raba takšnih parov pare, kjer sestavini pripadata različnima besednima v besedilu bi bila morda slogovno problematična, zagotovo vrstama (npr. samostalnik in pridevnik, pridevnik in pa so protipomenski v določenih kontekstih. Pare bi zato prislov). Kot glavni problem se je ta kategorija pojavila pri vključili v slovar in odločitev prepustili uporabniku, ki 16 parih (1,00 %), kot sekundarni pa sploh ne. Pri večini najbolje pozna kontekst, v katerem se beseda nahaja. primerov besedi nista protipomenki, dilema se pojavi le pri Primeri: nespremenljiv – nestalen, neugoden – škodljiv, parih tipa samostalnik – pridevnik, saj gre tukaj največkrat koristen – neugoden. za posamostaljene pridevnike (tipa delavnik – fraj). V takšnih primerih sta besedi lahko rabljeni protipomensko, 4.8. Nedoslednost na ravni prevzeto – seveda v ustreznem kontekstu. Pare iz te kategorije se podomačeno odstrani s seznama za vključitev v slovar. Izjemo Tukaj obravnavamo primere, ki so sicer protipomenski, predstavljajo pari tipa samostalnik – pridevnik, ki se jih a je ena izmed besed prevzeta in s tem pogosto (drugače) ročno pregleda in vključi s potrebnimi oznakami. zaznamovana. Zanimivo je tudi iskati mejo med Primeri: dopoldne – popoldanski, znotraj – ven, »prevzetim« izrazom ( ujemanje – inkongruenca) in takim, delavnik – fraj. ki je v jeziku že uveljavljen ( inteligenten – neumen). Razlike se lahko pojavljajo tudi na ravni zapisa prevzete 4.4. (Ne)dovršnost besede in ne le v njenem pomenu (npr. software in softver). Pri (Ne)dovršnosti govorimo o glagolskih parih z Iz Tabele 2 je razvidno, da so označevalci vsaj eno besedo različnim glagolskih vidom. Tako je eden izmed glagolov prepoznali kot prevzeto v 165 (10,33 %) primerih, kjer je v nedovršni, drugi pa v dovršni obliki. Takšni pari so bili v bil to glavni problem in v 26 (5,39 %) primerih, kjer je bil 87 (5,45 %) primerih prepoznani kot primarni in dvakrat to dodatni problem. Ker gre tu le za prevzete besede, ki se (0,30 %) kot sekundarni problem. Jasno je, da je za v jeziku (še) niso uveljavile, bi jih bilo dobro vključiti v protipomenko nekemu glagolu najboljša izbira glagol, ki odzivni slovar, saj jih bo uporabnik lahko s pridom ima enak glagolski vid, a dilema ostaja pri glagolih, ki so uporabljal v primernih kontekstih. pomensko ustrezni in imajo drugačen glagolski vid. Takšne Primeri: aktiv – trpnik, politeizem – enoboštvo, skupen pare bi bilo (vsaj na prvi pogled) smiselno odstraniti. – individualen. Primeri: napasti – braniti, narasti – zniževati, natovoriti – iztovarjati. 4.9. (Ne)dovršne glagolske tvorjenke V kategorijo (Ne)dovršne glagolske tvorjenke sodijo 4.5. (Ne)določnost tvorjenke, pri katerih besedotvorna podstava izkazuje V to kategorijo sodijo pridevniški pari, pri katerih je razlike v dovršnosti. Ena beseda je torej tvorjena iz eden izmed pridevnikov v določni, drugi pa se pojavlja v dovršnega, druga pa iz nedovršnega glagola. Analiza je nedoločni obliki. Ta kategorija se je v 11 (0,69 %) primerih pokazala, da je primerov, kjer je bila ta kategorija pojavila kot glavni problem, medtem ko se kot dodatni prepoznana kot primarni problem, 32 (2 %), in da so takšni, problem ni pojavila. Ker je problem v veliki meri povezan kjer je bila prepoznana kot sekundarni, štirje (0,60 %). Ti z značilnostmi lematizacije za slovenščino, ki pridevnike primeri so podobni 4.4, zato bi jih bilo smiselno lematizira v nedoločno obliko, razen kadar to ni mogoče obravnavati na enak način, torej jih ne bi vključili v slovar. (pari so pomensko načeloma protipomenski), bi bilo Primeri: zmanjševanje – povečanje, izkrcanje – tovrstno gradivo smiselno ohraniti v slovarju. vkrcavanje, manjšanje – povečevanje. ŠTUDENTSKI PRISPEVKI 335 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4.10. Dejanje in stanje definirati kot »protipomenski« ter ga s tem obravnavati kot V to kategorijo sodijo samostalniški pari, ki so sicer nekaj nasprotnega in binarnega (npr. moški – ženska). protipomenski, a ena beseda predstavlja neko dejanje, Druga problematika je razvidna tudi iz primerov, pri katerih dogodek, drugi pa neko stanje, lastnost. Problematika je sta bila samostalnika (tipično) protipomenska, a v različnih podobna kot pri (Ne)dovršnih glagolskih tvorjenkah, le da slovničnih spolih ( dolžnica ¬ upnik). Kot glavni problem gre tu za samostalnike, ki niso glagolsko tvorjeni. Kot je se je spol pojavil pri 60 (3,76 %) parih in kot dodatni pri 3 razvidno iz Tabele 2, se je kategorija Dejanje in stanje kot (0,45 %) parih. Če bi kategorijo kljub problematičnosti glavni problem pojavila v 18 (1,13 %) primerih in kot uvrstili v slovar, bi bilo smiselno natančneje opazovati dodatni v 2 (0,30 %) primerih. Ker gre pri takšnih parih za odzive uporabnikov in ugotoviti, kako ocenjujejo manjšo nianso v pomenu, ki so v določenih kontekstih uporabnost in primernost tovrstnega gradiva. Pari tipa lahko protipomenski, jih je najbolje uvrstiti v slovar in dolžnica – upnik niso ustrezni za v slovar oz. bi treba uporabniku omogočiti, da sam presoja o njihovi gradivo umestiti pod ustrezno iztočnico ( dolžnica – upnica; uporabnosti. dolžnik – upnik). Primeri: zaposlitev – brezposelnost, degeneracija – Primeri: moški – ženska, kralj – kraljica, dolžnica – razvoj, nedolžnost – zagrešitev. upnik. 4.11. Povratnost 4.15. Zaznamovanost in/ali redkost besede V kategorijo Povratnost smo uvrstili glagolske pare, ki V kategoriji Zaznamovanost in/ali redkost besede so sicer protipomenski, a vsaj enemu izmed njiju (ali najdemo pare, kjer načeloma gre za protipomenska izraza, obema) manjka povratni zaimek. Brez povratnega zaimka a je en izmed njiju zaznamovan. V nekaterih primerih gre takšni glagoli nimajo smisla ali imajo drugačen pomen (ki za čustveno zaznamovanost ( fant – punči), v drugih za ni protipomenski s predlagano protipomenko). Iz Tabele 2 zastarelo rabo ( izjemoma – često), pogovorne izraze je razvidno, da je povratni zaimek kot glavni problem ( delavnik – fraj) ali zgolj za izraze, ki se v rabi le redko manjkal v 53 (3,32 %) parih, in pri 17 (2,54 %) kot dodatni pojavljajo ( debelost – mršavost). Kot je razvidno iz Tabele problem. Ker je pri takšnih glagolih povratni zaimek 2, se je ta kategorija kot glavni problem pojavljala precej ključen za smiselnost protipomenskega para, ga je nujno pogosto, in sicer pri 157 (9,83 %) parih, prav tako pa tudi dodati. Takšne primere bi zato odstranili s seznama za kot dodatni problem (pri 79 parih, tj. 11,83 %). Ker gre za vključitev v slovar. primere, ki semantično ustrezajo pojmu protipomenskosti, Primeri: strinjati (se) – prepirati (se), ubogati – upirati bi jih bilo najbolje vključiti v slovar, da uporabnik sam (se), udeležiti (se) – zamuditi. preceni, če oz. kdaj so v njegovem kontekstu uporabni. Zagotovo bi jim pa bilo dobro dodati slovarsko oznako, ki 4.12. Pomensko šibki glagoli bi označevala zaznamovanost, ki jo takšni izrazi imajo. Primeri: brat – sestrica, izredno – vobče, dolgovezen – Pri pomensko šibkih glagolih govorimo o glagolskih koncizen. parih, v katerih (vsaj) en člen ob sebi zahteva dopolnilo, če ga želimo smatrati kot protipomenko drugemu. Kategorija 4.16. Enakopisnice in večpomenke se je kot glavni problem pojavila sedemkrat (0,44 %) in dvakrat (0,30 %) kot dodatni. Če naj bodo tovrstni primeri V kategorijo Enakopisnice in večpomenke so vključeni uvrščeni v slovar, mora biti ob pomensko šibkem glagolu pari, kjer je eden izmed izrazov večpomenski. Pri teh parih dodana ustrezna beseda ali zveza. gre velikokrat tudi za prenesen pomen enega izmed členov Primer: manjkati – biti (prisoten), biti (statičen/pri ( hladen – navdušen). Problematične so tudi prave miru) – premikati (se), biti (statičen/pri miru) – gibati (se). enakopisnice, torej tiste, ki bi v slovarju imele ločene iztočnice in ne le več pomenov ( pust – masten). Takšni pari 4.13. Pomensko polne besede brez konteksta so se kot glavni problem pojavili 76-krat (4,76 %) in 20- krat (2,99 %) kot dodatni problem. Takšni primeri seveda Pod Pomensko polne besede spadajo pari, kjer je en člen sodijo v slovar, treba pa bi bilo opredeliti, s katerim lahko uporabljen kot protipomenka drugemu le takrat, ko je pomenom besede je določena beseda v protipomenskem uporabljen v določenem kontekstu skupaj z neko drugo razmerju. besedo. V ostalih kontekstih besedi nista v Primeri: bistrost – motnost, zajedalec – gostitelj, moder protipomenskem razmerju. Kot glavni problem se je – naiven. omenjena kategorija pojavila pri 15 (0,94 %) parih in kot dodatni pri 2 (0,30 %) parih. Zdi se, da bi tovrstne probleme 4.17. Redkost in kontekstualna vezanost primerov v slovar lahko vključili, manko konteksta pa rešili na ravni kolokacij, ki jih Slovar sopomenk sodobne slovenščine V to kategorijo sodijo primeri, ki so protipomenski le v trenutno vključuje za pomensko primerjavo dveh določenih kontekstih. Običajno je tu eden izmed izrazov sopomenk. bolj uveljavljen in uporabljen v več kontekstih, zato je Primeri: pridobiti – odreči (soglasje), odpovedati – protipomenka drugemu le v določenih primerih. Prav tako obdržati (naročnino), napolniti – sprožiti (pištolo). so se tukaj znašli primeri, pri katerih bi bili sestavini para v nad/podpomenskem razmerju, če bi eno od njiju negirali 4.14. Spol kot »protipomenka« (kot pri zdrav – umobolen, kjer bi bili pravi protipomenki zdrav – bolan, medtem ko je umobolen le ena oblika V kategoriji Spol kot »protipomenka« sta se pojavljali nezdravja). V kategorijo Redkost in kontekstualna vezanost dve problematiki. Najprej smo obravnavali pare, kjer sta primerov smo vključili tudi primere, kjer je bil eden izmed kot protipomenki navedena izraza, ki ju uporabljamo za izrazov zelo specifičen, običajno terminološki (primer: označevanje spolov. Vprašanje je, ali je ob upoštevanju izdelava – delaboracija). Odločili smo se, da terminoloških želene družbene občutljivosti slovarja spol sploh ustrezno izrazov ne bomo uvrščali v posebno kategorijo, saj je težko ŠTUDENTSKI PRISPEVKI 336 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 določiti mejo med strokovnimi, specifičnimi in »čistimi« smiselno vključiti v slovar in presojo uporabnosti prepustiti terminološkimi izrazi. Ker gre za najširšo kategorijo, se je uporabniški skupnosti. kot primarni problem pojavila v kar 509 (31,87 %) primerih Primeri: državljan – tujec, ofenziven – nespotakljiv, in kot sekundarni v 246 (36,83 %) primerih. Te primere se zamuditi – zadeti. vključi v odzivni slovar, saj lahko uporabnik v široki paleti možnosti izbere zase najustreznejšo. 5. Zaključek Primeri: bogat – neploden, cena – prednost, domač – Iz analize je razvidno, da imajo problemske kategorije nepoznan. različno težo, nekatere težave bi bilo treba nasloviti, preden se gradivo lahko vključi v slovar, medtem ko lahko pri 4.18. Lastnosti, ki niso protipomenske, a se drugih odločitev o relevantnosti prepustimo uporabniški pogosto tako uporabljajo skupnosti. V analizi smo ugotovili, da so kategorije Zatipki, V tej kategoriji so zbrani primeri, ki sicer opisujejo Napačne leme, Različna besedna vrsta, (Ne)dovršnost, izključujoče lastnosti, a v strogem pomenu ne gre za Neobstoječe besedotvorne različice, (Ne)dovršne glagolske protipomenki, čeprav se pogosto tako uporabljata. To so tvorjenke in Povratnost najbolj problematične, vendar jih je predvsem pari, ki jih v pogovornem kontekstu uporabljamo obenem predvidoma mogoče vsaj delno reševati tudi kot protipomenki, ali takšni, za katere zmotno mislimo, da avtomatsko, kar bomo upoštevali pri razvoju nadaljnje to sta. Kot je razvidno iz Tabele 2, je bila ta problematika metodologije strojnega pridobivanja protipomenk. Ostale prepoznana v 38 (2,38 %) primerih kot glavni in v 11 (1,65 kategorije pa so bolj vezane na kontekst, zato jih lahko %) primerih kot dodatni problem. Čeprav takšni pari niso vključimo v slovar in odločitev prepustimo skupnosti. strogo gledano protipomenski, bi jih bilo najverjetneje Čeprav je bilo nedvoumno potrjenih protipomenk na smiselno vključiti v slovar in izbiro prepustiti uporabniku. prvi pogled malo (manj kot polovica), pa nadaljnja analiza Primeri: anabolizem – katabolizem, krivulja – premica, kaže, da lahko v odzivni slovar vključimo veliko večino (88 nepomemben – znamenit. %) podatkov. Prav tu se kaže prednost odzivnega slovarja, ki uporabniku ponuja možnost, da izbira med širokim 4.19. Posredne sopomenke naborom potencialnih protipomenk in jih ocenjuje kot bolj Pod Posredne sopomenke sodijo pari tipa glasen – nem, ali manj ustrezne. V slovar je torej najbolje vključiti čim ki so na prvi pogled protipomenski le v redkih primerih, če več potencialnega gradiva in jezikovni skupnosti prepustiti pa bi eno sestavino zamenjali z njeno sopomenko, bi dobili odločitev, kaj je zanjo uporabno in kaj ne. precej bolj očiten protipomenski par (npr. glasen – tih). Z digitalizacijo družbe so se spremenile (in povečale) Takšni pari so se kot primarni problem pojavili 40-krat potrebe jezikovnih uporabnikov, ki želijo vedno večji nabor (2,50 %), kot sekundarni pa 5-krat (0,75 %). Čeprav niso podatkov, med katerimi lahko izbirajo. Odzivni slovar jim prototipsko protipomenski, bi bilo tudi takšne pare morda ne omogoči zgolj tega, ampak tudi dodajanje novega dobro vključiti v slovar, saj uporabniku lahko koristijo v gradiva in odzivanje na že obstoječe. Skupaj z družbo se določenih situacijah, obenem pa spremljati, ali bodo tako spreminjajo slovarji, z njimi pa tudi mi in naša vloga uporabniki v odzivnem slovarju tovrstne primere pri njihovem ustvarjanju. ocenjevali s pozitivnimi ali negativnimi glasovi. Primeri: profit – minus, glasen – nem, kvaren – koristen. 6. Zahvala Projekt Nadgradnja temeljnih slovarskih virov in 4.20. Stopenjski primeri podatkovnih baz CJVT UL v letih 2021–22 financira V to kategorijo smo zbrali pare, ki jih sicer lahko Ministrstvo za kulturo Republike Slovenije. razumemo kot protipomenske v določenem kontekstu, a se Avtorji in avtorice bi se radi zahvalili tudi Špeli Arhar pojavlja zelo očitna stopnjevanost. Besedi torej sta lahko Holdt za vključitev v projekt in pomoč pri načrtovanju protipomenki ( prihodnji – sedanji), a običajno obstaja še raziskave in prispevka. neko bolj izrazito nasprotje ( prihodnji – pretekli).Sem smo vključili tudi stopnjevane pridevnike, ki pa niso vedno nujno na popolnoma nasprotni stopnji. Tako imamo lahko 7. Literatura v paru npr. primernik in presežnik in ne le dva primernika Luluh Aldhubayi in Maha Alyahya. 2014. Automated (primer: manjši – največji in ne le manjši – večji). Arabic Antonym Extraction Using a Corpus Analysis Stopenjski primeri so se kot glavni problem pojavili v 39 Tool. Journal of Theoretical and Applied Information (2,44 %) primerih in v 17 (2,54 %) primerih kot dodatni Technology, 70(3):422–433. problem. Ker so kontekstualno pogojeni, jih je dobro Darja Fišer. 2015. Semantic lexicon of Slovene sloWNet vključiti v odzivni slovar in tako uporabniku omogočiti 3.1. Slovenian language resource repository CLARIN.SI. širšo izbiro potencialnih protipomenk. http://hdl.handle.net/11356/1026. Primeri: negativen – nevtralen, dvojen – enojen, Polona Gantar, Iztok Kosem in Simon Krek. 2016. maksimalen – majhen. Discovering automated lexicography: the case of the slovene lexical database. International Journal of 4.21. Drugo Lexicography, 29(2):200–225. Pod Drugo smo vključili primere, ki niso sodili v Špela Arhar Holdt, Jaka Čibelj, Kaja Dobrovoljc, Polona nobeno izmed ostalih kategorij. Kot je razvidno iz Tabele Gantar, Vojko Gorjanc, Bojan Klemenc, Iztok Kosem, 2, smo 6 (0,38 %) parov vključili pod Drugo kot glavni Simon Krek, Cyprian Laskowski in Marko Robnik problem in 11 (1,65 %) parov kot dodatni problem. Takšne Šikonja. 2018. Thesaurus of Modern Slovene: By the pare, ki so se pojavili zelo poredko (0,38 %), bi bilo Community for the Community. V: Thesaurus of Modern Slovene: By the Community for the Community. ŠTUDENTSKI PRISPEVKI 337 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Proceedings of the XVIII EURALEX International Wenbo Wang, Christopher Thomas in Amit Sheth. 2010. Congress, str. 401–410. Pattern-Based Synonym and Antonym Extraction. ACM Marjeta Humar. 2005. Protipomenskost v slovenski SE '10: Proceedings of the 48th Annual Southeast jezikoslovni literaturi. V: M. Jesenček, ur., Knjižno in Regional Conference: 1–4. narečno besedoslovje slovenskega jezika, str. 234–238, https://dl.acm.org/doi/abs/10.1145/1900008.1900094. Slavistično društvo Maribor, Maribor. Marjeta Humar. 2016. Protipomenskost v slovenskem knjižnem jeziku: na primeru terminoloških slovarjev. Inštitut za slovenski jezik Frana Ramovša ZRC SAZU, Ljubljana. Elin Kamenšek Kranjc, Špela Medved in Kaja Podgoršek. 2018. Primerjava spletnega slovarja Slovar sopomenk sodobne slovenščine in knjižnega Sinonimnega slovarja slovenskega jezika. Liter jezika, 9(12):66–70. Agnes Kojc, Tamara Rigler, Kaja Sluga, Anika Plešivčnik in Špela Kovačič. 2018. Slovar sopomenk sodobne slovenščine in Sinonimni slovar slovenskega jezika. Liter jezika, 9(12):62–65. Simon Krek, Cyprian Laskowski, Marko Robnik Šikonja, Iztok Kosem, Špela Arhar Holdt, Polona Gantar, Jaka Čibej, Vojko Gorjanc, Bojan Klemec in Kaja Dobrovoljc. 2018. Thesaurus of Modern Slovene 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1166. Simon Krek, Špela Arhar Holdt, Tomaž Erjavec, Jaka Čibej, Andraž Repar, Polona Gantar, Nikola Ljubešić, Iztok Kosem in Kaja Dobrovoljc. 2020. Gigafida 2.0: the reference corpus of written standard Slovene. V: N. Calzolari, ur., LREC 2020: Twelfth International Conference on Language Resources and Evaluation, str. 3340–3345. ELRA - European Language Resources Association, Paris. http://www.lrec- conf.org/proceedings/lrec2020/LREC-2020.pdf. Nikola Ljubešić in Tomaž Erjavec. 2018. Word embeddings CLARIN.SI-embed.sl 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1204. Anna Lobanova, Tom van der Kleij in Jennifer Spenader. 2010. Defining Antonymy: A Corpus-based Study of Opposites by Lexico-syntactic Patterns. International Journal of Lexicography, 23(1):19–53. Ada Vidovič Muha. 2005. Medleksemski pomenski razmerji – sopomenskost in protipomenskost. V: M. Jesenšek, ur., Knjižno in narečno besedoslovje slovenskega jezika, str. 206–221. Slavistično društvo Maribor, Maribor. Ada Vidovič Muha. 2021. Slovensko leksikalno pomenoslovje. Prva e-izdaja. Znanstvena založba FFUL, Ljubljana. Slovar slovenskega knjižnega jezika. Druga, dopolnjena in deloma prenovljena izdaja. 2014. Cankarjeva založba, Ljubljana. Sopomenke 1.0. O slovarju. Center za jezikovne vire in tehnologije. https://viri.cjvt.si/sopomenke/slv/about. Irena Breznik Stramljič. 2010. Tvorjenke slovenskega jezika med slovarjem in besedilom. Mednarodna založba Oddelka za slovanske jezike in književnosti FFUM, Maribor. Jasmina Pegan. 2019. Detekcija antonimov z vektorskimi vložitvami besed. Diplomsko delo. Fakulteta za računalništvo in informatiko Univerze v Ljubljani. Jože Toporišič. 1976. Slovenska slovnica. Založba »Obzorja«, Maribor. Jože Toporišič. 2000. Slovenska slovnica. Četrta, prenovljena izdaja. Založba »Obzorja«, Maribor. ŠTUDENTSKI PRISPEVKI 338 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Ilukana – aplikacija za učenje japonskih zlogovnih pisav hiragana in katakana s pomočjo asociacij Nina Sangawa Hmeljak,* Anna Sangawa Hmeljak,† Jan Hrastnik‡ * Fakulteta za računalništvo in informatiko, Univerza v Ljubljani Večna pot 113, 1000 Ljubljana nina.sangawa@gmail.com † Akademija za likovno umetnost in oblikovanje, Univerza v Ljubljani Dolenjska cesta 83, 1000 Ljubljana anna.sangawa@gmail.com ‡ Fakulteta za matematiko in fiziko, Univerza v Ljubljani Jamova cesta 21, 1000 Ljubljana Povzetek Prispevek predstavlja zasnovo in oblikovanje digitalne aplikacije za slovensko govoreče učence oz. študente japonščine kot pomoč pri pomnjenju japonskih zlogovnih pisav hiragana in katakana s pomočjo asociacij in interaktivnega učenja. Vsak znak dveh japonskih pisav je opremljen z ilustracijo, ki vsebuje obliko tega znaka in obenem ponazarja slovensko besedo, ki se začne s tem zlogom. Aplikacija nudi tako seznam ilustracij kot tudi interaktivne vaje, s katerimi uporabnik lahko preverja svoje znanje. Aplikacija je napisana s paketom za razvoj programske opreme Flutter, v jeziku Dart, tako da deluje v poljubnem operacijskem sistemu. Je še v fazi prototipa, v bodoče načrtujemo raziskavo o učinkovitosti pri pomnjenju, testiranje med uporabniki in dodelavo uporabniškega vmesnika. Ilukana – an app for learning the Japanese hiragana and katakana syllabaries using associations We present the concept and implementation of a digital application for Slovene-speaking learners of Japanese, as an aid to remembering the Japanese syllabaries hiragana and katakana using associations and interactive learning. Each letter of the Japanese syllabaries is matched with an illustration containing the letter itself and representing a Slovene word beginning with the syllable represented by the letter. The application includes a list of illustrations and interactive exercises. It is written using Flutter, in Dart, and can therefore be used in any operating system. The app is a prototype, research on its effectiveness, user testing and interface upgrades are planned. medtem ko je kitajskih pismenk na tisoče. Tako kot vsak 1. Uvod japonski otrok se tudi tuji učenci najprej naučijo teh dveh V prispevku predstavljamo izgradnjo in oblikovanje zlogovnic. Ker je učenje nove pisave, ki ima popolnoma digitalne aplikacije za učenje japonskih zlogovnih pisav drugačne oblike kot latinica, težavno, a tudi nujno potrebno, hiragana in katakana s pomočjo asociacij in interaktivnega da lahko učenci sploh začnejo brati v japonščini, smo se učenja. Aplikacija je namenjena slovensko govorečim odločili ustvariti aplikacijo, s katero je lahko učenje lažje in učencem ali študentom japonščine kot pomoč pri učenju bolj zabavno. osnovnih znakov japonskih zlogovnih pisav hiragana in katakana. Osnovana je na principu asociacije med znanimi 3. Učenje z asociacijami – mnemotehnika in novimi informacijami: za lažje pomnjenje oblike in Mnemotehnika oz. mnemonika je tehnika učenja oz. izgovora znakov japonskih zlogovnic ponuja za vsak znak pomnjenja, pri kateri skušamo vsebino, ki se jo želimo ilustracijo, ki nakazuje obliko tega znaka in obenem naučiti (tj. to, kar imamo samo v kratkoročnem spominu), ponazarja slovensko besedo, ki se začne s tem zlogom. V urediti in povezati z že znanim (tj. s tem, kar že imamo v aplikaciji so seznam ilustracij in interaktivne igrice za dolgoročnem spominu) na tak način, da si jo lažje preverjanje naučenih znakov in tudi mini-igrica za učenje zapomnimo. Pomembne so zlasti v začetni fazi učenja pravilnega vrstnega reda potez pri pisanju kane. Aplikacija jezika, ko si mora učenec zapomniti osnovno besedišče ali je še v fazi prototipa, v prispevku predstavljamo ozadje pisavo, medtem ko na višjem nivoju učenja jezika imajo projekta, namen aplikacije, teoretična izhodišča, podobne učenci običajno bolj razvito in povezano znanje in lahko ilustracije in aplikacije za govorce drugih jezikov, učinkovito uporabljajo druge metode (Oxford, 2016). oblikovalski koncept, tehnično implementacijo, Primeri mnemotehnike so razne rime in besedne zveze, s ugotovljene pomanjkljivosti in načrte za bodoče delo. katerimi si lažje zapomnimo določena pravila, kot npr. stavek “Suhi škafec hoče pasti”, s katerim si zapomnimo 2. Ozadje projekta soglasnike, pred katerimi uporabimo predlog s in ne z, ali Japonščina je priljubljen jezik med ljubitelji japonskih povezovanje oblike predmeta z obliko črke v besedi, ki si mang in animejev, v Sloveniji se poučuje na Filozofski jo hočemo zapomniti v povezavi s predmetom, npr. “ob fakulteti Univerze v Ljubljani, v več privatnih jezikovnih prvem krajcu se luna Debeli, ob zadnjem pa Crkuje”, kjer šolah, mnogi mlajši se japonščine učijo tudi sami s pomočjo oblika črk D in C spominja na obliko lune ob prvem in ob spleta. Pri učenju japonščine je posebej zahtevno učenje zadnjem krajcu. pisave, saj japonščina ne uporablja latinice ampak tri druge Med mnemotehnike spada tudi metoda ključne besede, pisave, hiragano in katakano, ki sta japonski zlogovni po kateri asociiramo novo besedo, ki se jo želimo naučiti, z pisavi, ter kanji oz. pismenke, ki izvirajo iz Kitajske besedo, ki podobno zveni, s pomočjo neke vsebinske (Hmeljak et al., 2020). Od treh pisav imata hiragana in povezave (Cohen, 1987; Manalo, 2002). Več raziskav kaže, katakana še najmanj znakov, vsaka ima 46 različnih znakov, da je mnemotehnika lahko uporabna za učenje širokega ŠTUDENTSKI PRISPEVKI 339 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 spektra snovi, kot je učenje tujih jezikov, znanstvenih Manalo et al. (2004) so ugotovili, da je tak način učenja zakonitosti, itd. Omenjene raziskave so pokazale, da so se hiragane učinkovit, udeleženci so bili na splošno zadovoljni, tisti, ki so se učili z uporabo mnemotehnike, veliko bolje bili so mnenja, da jim je pomagalo pri pomnjenju in odrezali kot tisti, ki se niso. Poleg tega je bila šolskem uspehu. Po drugi strani Matsunaga (2003) mnemotehnika učinkovita tudi pri učenju oseb s ugotavlja, da je učenje hiragane z mnemoničnimi slikami, specifičnimi učnimi težavami ali po možganskih povezanimi z angleškimi besedami, bilo učinkovito pri poškodbah. Nekatere raziskave so pokazale celo, da večina učencih japonščine, ki niso bili naravni govorci angleščine, ljudi spontano uporablja mnemotehniko pri učenju na le na kratek rok in to le za tiste, ki se nikoli prej niso učili pamet (Manalo et al., 2004). jezika, ki ne uporablja latinice. Mnemotehnika je torej uporabna tudi pri učenju novih O rabi mnemoničnih slik za učenje kane med govorci jezikov in pisav. Več raziskav je pokazalo tudi učinkovitost slovenščine še ni bilo raziskav, a lahko domnevamo, da tudi mnemotehnik, ki so angleškim govorcem pomagale pri zanje učenje preko ilustracij, ki se nanašajo na angleške učenju japonske pisave (Quackenbush et al., 1989; Manalo besede, ni posebej učinkovito, kot ugotavlja Matsunaga et al., 2004; Matsunaga, 2003) in korejske pisave (Brown, (2003) za govorce drugih jezikov. 2012). Obstaja tudi več učbenikov za učenje kitajskih pismenk, 4. Aplikacija Ilukana ki se poslužujejo mnemotehničnih metod za pomnjenje in Učbeniki in aplikacije za učenje hiragane z povezovanje oblike in pomena s pomočjo asociacij. Med mnemoničnimi slikami torej že obstajajo, vendar ne za prvimi je serija učbenikov Jamesa Heisiga (1977; 1987; slovensko govoreče oz. ni takih, ki bi povezale oblike Heisig in Sienko, 1994), ki pokriva vseh 2000 standardnih znakov kane s slovenskimi besedami. Glede na to, da pismenk in je bila prevedena v francoščino (1998), praktično vsi slovensko govoreči učenci japonščine že španščino (2001) in nemščino (2005), za angleško obvladajo tudi angleščino, bi lahko na prvi pogled govoreče pa obstaja še več podobnih učbenikov (Banno et uporabljali gradivo za angleško govoreče, kot ga ponuja npr. al. 2009; Bodnaryk, 2000; McNair, 2000; McNair, 2005 in Ogawa (1990) ali Koichi (2014) in je prikazano v sliki 1. McCabe, 2012). Za slovensko govoreče učence japonščine Toda zlasti pri angleščini, ki ima izrazito globok pravopis, pa še ni takega gradiva, zato smo se odločili, da ga lahko pri povezovanju oblike kane z izgovarjavo angleške ustvarimo. besede pride do zmede zaradi interference zapisa angleške besede in tudi zaradi variabilnosti izgovora same 3.1. Mnemonične slike angleščine (britanska ali ameriška angleščina ipd.). Za Za učenje hiragane z asociacijami obstaja že več slovenske govorce, ki se angleščine večinoma učijo primerov za učenje s pomočjo mnemoničnih slik, ki so istočasno v govorni in pisni obliki, bi lahko bilo težko povezane z angleščino. Oblika enega znaka zlogovnice odmisliti pisno obliko (npr. “nun” za znak な na) in hiragane se prekrije z ilustracijo angleške besede, ki se povezati samo izgovarjavo besede v angleščini (/nan/) z začne z enakim zlogom kot izbrani znak hiragane (Ogawa, zlogom /na/ v japonščini, saj se sorodna beseda v 1990; Rowley, 1995; Koichi, 2014). Obstajajo tudi že slovenščini izgovori /nuna/ (glej sliko 1). Podobno bi lahko aplikacije za ta namen, kot je npr. Hiragana Memory Hint tudi rekli za znak に ni, za katerega je izbrana beseda knee, in Katakana Memory Hint Japonske fundacije (Japan ki se izgovarja /nii/, vendar za tiste, ki si hkrati Foundation 2015), ki ponuja učenje v povezavi z predstavljajo tudi pisno obliko, je težko odmisliti »k«, ki je angleščino, indonezijščino in tajščino. na začetku besede. Tudi fonetično je marsikateri zlog v slovenščini bliže japonskemu kot angleški, tako je npr. zlog /sa/ praktično enak v slovenščini in japonščini, medtem ko je izgovorjava angleške besede “saw” drugačna, še dodatno pa lahko zmede razlika med ameriškim in britanskim izgovorom. Za mlajše učence, ki morda še ne obvladajo angleščine, pa je verjetno lažje pomniti asociacije z besedami iz lastnega maternega jezika kot iz angleščine ali drugega tujega jezika, ki ga še ne obvladajo dobro. Zato smo se odločili poiskati slovenske besede, ki lahko pomagajo pri pomnjenju znakov japonske kane, zanje ustvarili ilustracije in s temi zgradili aplikacijo Ilukana. Ilukana je aplikacija za učenje japonskih znakov hiragana in katakana. Ciljna publika so slovensko govoreči učenci ali študenti. Aplikacija ponuja ilustracije, s pomočjo katerih si uporabnik lažje zapomni povezavo med obliko znaka in njegovo izgovarjavo. Uporabnik dostopa do posameznih znakov preko seznama obeh pisav, ki se izmenjujeta z interakcijo uporabnika. Aplikacija vključuje tudi element igre v obliki kviza, prav tako za pomnjenje izgovorjave, kot tudi pravilnega vrstnega reda potez pri zapisovanju. Pri japonščini namreč pravopis določa vrstni red potez, ki vpliva na obliko znakov, zlasti bolj Slika1. Primeri idej mnemoničnih slik za znake に ni, さ kompleksnih. Pri znaku あ se na primer najprej zapiše vodoravna črta, nato navpična in na koncu še krivulja. sa in な na (zgoraj Ogawa 1990, spodaj Koichi 2014). ŠTUDENTSKI PRISPEVKI 340 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4.1. Ustvarjanje asociacijskih slik ekranu. Za vsak znak smo preizkusili več idej in se odločili V aplikaciji so znaki kombinirani z ilustracijami, ki za najbolj jasno. uporabniku pomagajo, da si zapomni obliko znaka z Primeri nekaj različnih idej za isti znak so prikazani v asociacijo na vsebino ilustracije. Tako na primer ilustracija sliki 3. za znak あ (glej sliko 2), ki se izgovori /a/, prikazuje adrenalinski park, kar uporabniku pomaga, da si preko besede “adrenalin”, ki se začne z zlogom /a/, zapomni povezavo med obliko znaka あ in njegovo izgovarjavo, tj. glasom /a/. Za ustvarjanje asociacij je bilo torej potrebno za vsak znak hiragane in katakane najti slovensko besedo, ki se začne z enakim zvokom kot izbrana hiragana in ki predstavlja nekaj, kar je podobne oblike. Pri tem je še nekaj omejitev: beseda mora pomeniti nekaj, kar je mogoče izrisati (abstraktne pojme bi težje spremenili v ilustracije), obenem pa mora biti ilustracija kolikor mogoče enoumno povezana z eno samo besedo, poleg tega ne sme imeti preveč detajlov, da se lahko jasno izriše tudi na manjšem Slika2. Primeri idej mnemoničnih slik: a kot adrenalin. Slika3. Primeri idej mnemoničnih slik: け/ke/ kot keramika, keglanje, kebab, kečap, Kekec. Na sliki 2 je prikazan primer za zlog /a/ (あ v hiragani in ア v katakani). Tu smo uspeli najti primer ilustracije (adrenalin) za isto besedo, ki se prekriva z znakoma za isti zlog v obeh zlogovnicah. Za znak hiragane あ smo kot najprimernejšo izbrali to besedo, ker komplicirani ovinki spominjajo na vlakec smrti. Pri katakani ア pa oblika lahko spominja na kajak na divjih vodah. Obe aktivnosti sta zelo dinamični in ju lahko povežemo z besedo adrenalin. Na sliki 3 je prikazanih več različnih idej, ki smo jih imeli za znak hiragane け, ki se izgovori /ke/. Po vrsti od leve zgoraj so keramika, kegljanje, kebab, kečap in Kekec. Slika4. Ni kot nilski konj. Da bi si lažje zapomnili, da je znak sestavljen iz dveh ločenih delov, sta bila kegelj ob kegljaču in Kekec s pohodno palico najboljša kandidata. Toda kegljanje bi lahko zamenjali z bowlingom, ki je bolj razširjen, in bi tako lahko zamešali z zlogom /bo/, ki se v hiragani zapiše ぼ in je oblikovno podoben znaku け oz. /ke/, zato smo izbrali ilustracijo Kekca, ki ni dvoumen. Sliki 4 in 5 prikazujeta še primer za /ni/ に in /sa/ さ. Za /ni/ smo izbrali izraz nilski konj, za /sa/ pa sardelo. 4.2. Oblikovanje aplikacije Pri oblikovanju aplikacije smo se odločili za Slika5. Sa kot sardele. minimalistični izgled. Za celotno podobo so za izhodišče ŠTUDENTSKI PRISPEVKI 341 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 uporabljene barve japonske zastave, tj. rdeča in bela, ter črna za besedilo. Zato da ekran ni presvetel in ne draži oči, je za ozadje uporabljena siva barva. Vse ilustracije, ki so funkcionalni del aplikacije (ikona za premikanje naprej in nazaj, vračanje na vstopno stran ipd.), so v slogu tradicionalnih japonskih grafik ukiyo-e. Ko prižgemo aplikacijo, se znajdemo pred vhodom japonske hiše. Ko se dotaknemo vrat, ki se nam odprejo, vstopimo na prvo stran, ki je glavni meni. Na glavnem meniju imamo štiri gumbe. V ozadju je ilustracija, ki je povzeta po znani grafiki japonskega slikarja Sharakuja, ki upodablja igralca gledališča kabuki. V navigacijski vrstici so trije gumbi. Na sredini je gumb za vračanje na vstopno stran ( home button) v obliki japonske hiše, na levi je gumb za premikanje na prejšnji ekran ( back button) v obliki roke v slogu japonskih grafik ukiyo-e, na desni pa gumb, ki nam omogoča preklapljanje med znaki hiragane in katakane. Prva dva gumba na vstopni strani sta namenjena učenju pismenk s pomočjo asociacije z ilustracijo. Če se dotaknemo gumba hiragana ali katakana, nas to pripelje na Slika7. Glavni meni. seznam vseh znakov (pismenk) te pisave. Ko se dotaknemo enega znaka hiragane ali katakane, nas to pripelje na stran s to pismenko čez celo širino ekrana. Pod pismenko je gumb, ki nas pripelje do gibljive slike, ki pokaže pravilni vrstni red, po katerem se zapiše. Ko se dotaknemo pismenke same, pa se nam prikaže pismenka v kombinaciji z ilustracijo. Pod pismenko je v latinici napisana izgovarjava (zlog, ki ga pismenka zapisuje) ter slovenska beseda za pojem, s katerim ga asociiramo. Tretji gumb na glavnem meniju z imenom “seznam” je le pregledni seznam, ki ga lahko z gumbom za preklapljanje uporabimo za pregledovanje in pomnjenje oblik pismenk. Četrti gumb z imenom “vaje” pripelje uporabnika do dveh vaj, kjer se lahko nauči vrstni red pisanja in izgovorjavo hiragane ali katakane. Slika8. Stran za črko あ. Slika6. Začetna stran aplikacije. Slika9. Mnemonična slika za あ. ŠTUDENTSKI PRISPEVKI 342 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4.3. Primerjava aplikacije Hiragana Memory 4.4. Tehnična implementacija aplikacije Hint z aplikacijo Ilukana Aplikacija je napisana s paketom za razvoj programske Aplikacija Hiragana Memory Hint je aplikacija, ki jo opreme Flutter, v jeziku Dart. Ta jezik smo izbrali zato, da Japonska fundacija ponuja na App Store in Google Play bo aplikacija uporabna v vsakem okolju, saj nam Flutter (Japan Foundation 2015). Namenjena je učenju zlogovne omogoča, da aplikacija deluje tako v operacijskih sistemih pisave hiragana oz. olajšanje tega za angleško govorečo iOS in Android kot v poljubnem spletnem brskalniku. populacijo. Obstajajo tudi različice za govorce drugih Pri pisanju programa je bil največji izziv oblikovati vse azijskih jezikov. Naša aplikacija Ilukana ji je podobna v objekte, da se oblika ohrani pri vseh možnih velikostih tem, da obe uporabljata mnemonične slike za povezovanje ekranov. To smo reševali s testiranjem na različnih besede v maternem jeziku govorca z obliko kane. Glavna telefonih in popravljanjem relativne razdalje med objekti. razlika pa je seveda to, da je naša namenjena govorcem Ker je bil dizajn originalen, smo morali posebej slovenščine. Obe aplikaciji imata dve glavni funkciji: implementirati vse objekte, kot je navigacija. pregledovanje in učenje japonskih znakov s pomočjo Spletna verzija aplikacije je dostopna na naslovu mnemonskih slik ter kviz, kjer lahko uporabnik vadi. https://sninah.github.io/ilukana/. Ilukana nudi uporabnbiku izbiro, ali si z mnemoničnimi slikami želi zapomniti hiragano ali katakano, medtem ko 5. Zaključek Hiragana Memory Hint ima na voljo le hiragano, saj za Aplikacija je še v fazi razvoja, testirana je bila na nekaj katakano obstaja ločena aplikacija. Naša aplikacija ima tudi različnih modelih telefonov, a za optimalno delovanje na daljši seznam vseh zlogov, saj ne vsebuje le osnovnih 46 manjših zaslonih potrebuje še nekaj dodelave. znakov hiragane kot aplikacija v angleščini, ampak vse V bodoče nameravamo testirati in optimizirati možne zloge, ki jih lahko napišemo z dodajanjem delovanje aplikacije, med drugim tudi z optimizacijo diakritičnih znakov: črtici ゛ za zvenečnost (npr. か /ka/ oz. tempiranja prikazovanja znakov, ki se zdaj pri kvizu が /ga/), krogec ゜ za glas /p/ (npr. は/ha/ oz. ぱ /pa/) in prikazujejo naključno, tako da se večkrat pojavljajo tisti, pri diakritični znaki za mehčanje soglasnikov (npr. さ /sa/ oz. katerih je uporabnik naredil več napak. し ゃ /ša/). Za te posebne zloge še nimamo ilustracij v Nameravamo tudi preveriti uporabnost in učinkovitost našem prototipu. ilustracij pri učenju hiragane in katakane med slovenskimi Aplikacija Hiragana Memory Hint ima več govorci. Temeljite uporabniške študije aplikacije še nismo interaktivnosti in elementov igre. Ilukana ima le dve igrici. izvedli, same ilustracije pa smo pokazali nekaj študentom Ena je za vajo izgovorjave, ki je v obliki kviza z več japonščine, ki so povedali, da so jim bile ilutracije zabavne možnimi odgovori, pri drugi pa uporabnik pritisne na črte in da so pomagale pri pomnjenju. Da bi preverili dejansko znaka po pravilnem vrstnem redu pisanja ( kakijun). učinkovitost pri pomnjenju, bi bilo potrebno izvesti Hiragana Memory Hint pa ima 4 različne tipe kvizov. Poleg eksperiment s kontrolno skupino in preverjanjem znanja branja hiragane ima še kviz z več odgovori, kjer uporabnik pred učenjem, takoj po učenju in čez daljši čas, kar izbere med več znaki hiragane za dano izgovorjavo, ki je načrtujemo izvesti v prihodnje. napisana v latinici, izbira znak hiragane glede na izgovorjavo posneto v zvočni obliki in kviz za izbiro 6. Literatura hiragane glede na napisano izgovorjavo, kjer so na izbiro znaki, ki so si podobni. Eri Banno, Yôko Ikeda, Chikako Shinagawa, Kaori Tajima Aplikacija Hiragana Memory Hint uporablja črno bele in Kyôko Tokashiki. 2009. Kanji look and learn: 512 linearne ilustracije z barvnim ozadjem, medtem ko Ilukana kanji with illustrations and mnemonic hints. Tokyo: uporablja barvne ilustracije s svetlo sivim ozadjem in črnim Japan Times. besedilom. Pri aplikaciji Hiragana Memory Hint so izbrali Robert P. Bodnaryk. 2000. Kanji Mnemonics: An minimalističen, jasen učbeniški design s sans serifno Instruction Manual for Learning Japanese Characters. latinico za angleščino in gothic črkovno vrsto v japonščini , Winnipeg, Manitoba: Kanji Mnemonics. medtem ko smo se pri Ilukana odločili za mincho črkovno Lucien Brown. 2012. The use of visual/verbal and physical vrsto pri japonskih pismenkah, zato da si lahko uporabnik mnemonics in the teaching of Korean Hangul in an natančno zapomni pisano obliko japonskih znakov, authentic L2 classroom context. Writing Systems vključno z zaključevanjem potez po principih tome, hane Research, 4(1):72–90. itd., črke v latinici pa so tudi v Ilukani sans serifne zaradi http://dx.doi.org/10.1080/17586801.2011.635949 boljše čitljivosti. Pri aplikaciji Hiragana Memory Hint so Andrew Cohen. 1987. The use of verbal and imagery za oblikovanje gumbov in znakov izbrali minimalistični mnemonics in second-language vocabulary learning. pristop, ki nas spominja na učbenik ali delovni zvezek s Studies in Second Language Acquisition, 9(1):43–61. poudarkom na okroglo obliko in igrive barve, pri Ilukana Kazumi Hatasa. 1991. Teaching Japanese syllabary with pa smo želeli uporabiti elemente japonske kulture v slogu visual and verbal mnemonics. CALICO Journal, 8(3): tradicionalnih japonskih grafik na gumbih in ozadjih. 69–80. http://www.jstor.org/stable/24156286. Razlika je tudi v uporabi barve: Hiragana Memory Hint James Heisig. 1986. Remembering the kanji: A complete uporablja več odtenkov barve v gumbih, kot so zelena, course on how not to forget the meaning and writing of modra, rdeča, oranžna itd., ki daje igriv občutek, medtem Japanese characters. Tokyo: Japan Publications Trading ko je pri Ilukana uporabljena rožnata rdeča kot glavni Co. odtenek s kombinacijo sivih odtenkov ter črne, ki daje bolj James Heisig. 1987. Remembering the kanji: A systematic izčiščen, eleganten občutek. guide to reading Japanese characters. Tokyo: Japan Publications Trading Co. James Heisig in Tanya Sienko. 1994. Writing and reading Japanese characters for upper-level proficiency. Tokyo: Japan Publications Trading Co. ŠTUDENTSKI PRISPEVKI 343 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 James Heisig, Marc Bernabé in Verònica Calafell. 2001. Kanji para recordar: curso mnemotécnico para el aprendizaje de la escritura y el significado de los caracteres japoneses. Barcelona: Herder Editorial. James Heisig in Yves Maniette. 1998. Les kanji dans la tête: apprendre à ne pas oublier le sens et l'écriture des caractères japonais. Yves Maniette. James Heisig in Robert Rauther. 2005. Bedeutung und Schreibweise der japanischen Schriftzeichen. Frankfurt am Main: V. Klostermann. Kenneth Higbee. 1977. Your Memory: How It Works and How to Improve It. Englewood Cliffs, NJ: Prentice-Hall. Kristina Hmeljak Sangawa, Hyeonsook Ryu in Mateja Petrovčič. 2020. Zakaj latinica ni dovolj: o izgubi informacij pri latinizaciji vzhodnoazijskih imen v knjižničnih katalogih. Knjižnica, 64(1–2):47–78. Japan Foundation. 2015. Hiragana Memory Hint. Katakana Memory Hint. English Version. https://minato- jf.jp/Home/JapaneseApplication. Koichi. 2014. Learn hiragana: The ultimate guide. https://www.tofugu.com/japanese/learn-hiragana/ Emmanuel Manalo. 2002. Uses of mnemonics in educational settings: A brief review of selected research. Psychologia, 45(2):69–79. https://doi.org/10.2117/psysoc.2002.69 Emmanuel Manalo, Satomi Mizutani in Julie Trafford. 2004. Using mnemonics to facilitate learning of Japanese script characters. Japan Association for Language Teaching Journal, 26(1):55–77. http://jalt- publications.org/recentpdf/jj/2004a_JJ.pdf#page=57. Sachiko Matsunaga 松 永 幸 子 . 2003. Effects of Mnemonics on Immediate and Delayed Recalls of Hiragana by Learners of Japanese as a Foreign Language. Japanese-Language Education around the Globe, 13: 19–40. https://doi.org/10.20649/00000331. Glen McCabe. 2012. Learning Japanese Hiragana & Katakana Flash Cards Kit. Tokyo: Charles E. Tuttle. Bruce McNair. 2005. Kanji Learned Through Phonic- Mnemonics: Learning to Read Japanese Kanji Using the McNair Phonic-Mnemonic System. Kanji Learning Institute. Bruce McNair. 2016. Read Kanji Read: Read the 2,136 Jooyoo Kanji in Two Months Using Phonic Mnemonics (English Edition). Kanji Learning Institute. Kunihiko Ogawa. 1990. Kana Can Be Easy. Tokyo: The Japan Times. Rebecca L. Oxford. 2016. Teaching and researching language learning strategies: Self-regulation in context. London: Routledge. Hiroko Quackenbush, Kiyomi Chujo, Kazuhiko Nagamoto in Shinichiro Tawata. 1989. 50 分ひらがな導入法:連 想法と色付きカード法の比較 Teaching how to read hiragana in 50 minutes: A comparison of mnemonics and the use of cards with associated colours. 日本語教育 Journal of Japanese Language Teaching, 69: 147–162. Michael Rowley. 1995. Kana Pict-O-Graphix: Mnemonics for Japanese Hiragana and Katakana. Albany, CA: Stone Bridge Press. ŠTUDENTSKI PRISPEVKI 344 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Filter nezaželene elektronske pošte za akademski svet Anja Vrečer Fakulteta za računalništvo in informatiko, Univerza v Ljubljani Večna pot 113, 1000 Ljubljana anja.vrecer@gmail.com Povzetek Nezaželena akademska elektronska sporočila so nezaželena sporočila, ki jih prejemajo predvsem profesorji, raziskovalci in drugi akademiki, in jih navadni filtri nezaželene elektronske pošte ne zaznavajo. V prispevku predstavimo izdelavo filtra nezaželene akademske elektronske pošte, pri čemer smo naredili primerjavo različnih metod filtriranja sporočil in različnih tehnik obdelave besedila. Za končni model smo uporabili nevronsko mrežo v kombinaciji z vektorskimi vložitvami besed ter ga povezali z izbranim odjemalcem elektronske pošte, in sicer z Gmailom. Filter smo testirali z 10-kratnim prečnim preverjanjem in dosegli tudi do 98% točnost. 1. Uvod 2. Namen članka Obstoječi filtri nezaželene akademske elektronske poš- Elektronska pošta je v zadnjem času postala ena naj- te so v veliki večini samo “ročno” napisana pravila, ki iz- bolj uporabljenih aplikacij za komunikacijo. Vsakodnevno ključujejo sporočila določenih prejemnikov ali z določeni- jo uporablja na milijone ljudi, tako v službi kot v prostem mi ključnimi besedami. Takšna pravila pa je za uspešno de- času (Whittaker et al., 2005). Slabost vsesplošne upo- lovanje potrebno stalno posodabljati, saj se pošiljatelji, pa rabnosti elektronske pošte pa je vse večja količina elek- tudi vsebina oziroma besede v teh sporočil, ves čas spremi- tronskih sporočil, ki jih prejemamo. Med njimi je tudi njajo. Zato smo v sklopu raziskave ustvarili filter nezažele- veliko nezaželenih elektronskih sporočil. Prebiranje vseh ne akademske elektronske pošte, ki temelji na modelu ne- sporočil nam zato včasih vzame ogromno časa in energije. vronske mreže v kombinaciji z vektorskimi vložitvami be- Ker želimo čim hitreje ločiti nezaželena sporočila od dru- sed. Začetni model je naučen na množici 660 nezaželenih gih, uporabnih sporočil, imajo mnogi poštni odjemalci že akademskih elektronskih sporočil, skupaj z 2.551 drugih vgrajene filtre nezaželene elektronske pošte. Vendar pa sporočil. Model se lahko tudi prilagodi uporabniku, tako takšni filtri ne zaznajo vseh vrst nezaželene elektronske da upošteva uporabnikova nezaželena akademska elektron- pošte. V prispevku se osredotočimo na eno takšnih skupin ska sporočila v njegovem elektronskem nabiralniku. nezaželene elektronske pošte, in sicer na nezaželeno aka- demsko elektronsko pošto. 3. Sorodna dela Profesorji in drugi akademiki v svoj elektronski nabiral- nik stalno dobivajo vabila k objavljanju člankov v različnih 3.1. Nezaželena akademska elektronska pošta revijah, k sodelovanju na konferencah ali ponudbe odpr- V tem razdelku opišemo ugotovitve o nezaželeni aka- tih delovnih mest. Takšne ponudbe se velikokrat ne na- demski elektronski pošti, povzete po različnih avtorjih. Pri vezujejo na prejemnikovo področje raziskovanja ali pa je pregledu značilnosti smo upoštevali tudi ugotovitve pri pre- takšnih ponudb preprosto preveč. Velik problem predstav- gledu nezaželene akademske elektronske pošte iz naše te- ljajo vabila k prispevanju člankov za manj znane ali pre- stne zbirke sporočil. datorske revije. Akademiki, ki se strinjajo z objavo svojega Nezaželena vabila. Izkoriščevalske ali predatorske re- članka v takšnih revijah, tvegajo, da je njihova kariera lahko vije so revije, katerih glavni cilj ni širjenje znanja ali oškodovana. Takšne revije namreč objavijo vsak članek, ki upoštevanje akademske kvalitete člankov, ampak nepošten ga prejmejo in s tem razveljavijo akademsko vrednost ob- zaslužek. Profesorje in druge akademike skušajo pretentati, javljenih člankov, akademika pa zaznamujejo kot soavtorja da bi z njimi sodelovali, s tem da bi jim plačali za objavo predatorske revije (da Silva et al., 2020). Hkrati se nepazlji- svojih člankov. Glavne lastnosti (Wahyudi, 2017) teh revij vemu prejemniku lahko zgodi, da preko sporočila posreduje so: svoje osebne informacije osebam, ki imajo od tega finančno korist (Lin, 2013). • za objavo članka je potrebno plačilo, Ker je vsebina akademskih nezaželenih elektronskih • revija se izdaja pogosto, sporočil pogosto precej drugačna od nezaželenih elek- tronskih sporočil, ki jih zazna večina navadnih filtrov • za objavo je sprejeto nadpovprečno veliko člankov, nezaželene pošte, mora prejemnik sam ločevati uporabna in • čas obdelave in pregleda člankov sta nerealno hitra in neuporabna sporočila. Raziskovalni prispevek našega dela je razvoj orodja za filtriranje nezaželenih akademskih elek- • kvaliteta objavljenih člankov je slaba ali zelo neena- tronskih sporočil, ki za klasifikacijski model uporablja nev- komerna. ronsko mrežo in dosega primerljive ali celo boljše rezultate od nekaterih raziskovalcev, ki so se ukvarjali s podobnim Leta 2014 je knjižničar iz Univerze v Koloradu, Jeffrey problemom. Beall, sestavil dva seznama, in sicer seznam vprašljivih ŠTUDENTSKI PRISPEVKI 345 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 založnikov in seznam vprašljivih revij. Zapisal je, da ob- času (Dadkhah et al., 2017). V nekaterih primerih se tudi stajajo samo zato, da črpajo denar od avtorjev, ki morajo zgodi, da če prejemnik ne odgovori na prvo sporočilo, sle- plačati za to, da so njihovi članki sprejeti v revijo (Wahyudi, dijo nova (Grey et al., 2016). 2017). Beallov seznam vprašljivih revij (angl. Beall’s list Tudi pri pošiljateljih nezaželene akademske elektronske of predatory journals) se najpogosteje uporablja pri identi- pošte so prisotne nekatere skupne značilnosti. Pošiljatelji fikaciji izkoriščevalskih revij. Obstajajo tudi druge zbirke se ponavljajo ali pošiljajo ponavljajoča sporočila več pre- sumljivih revij, kot na primer Alexa database in baza lažnih jemnikom naenkrat (da Silva et al., 2020). Včasih je elek- spletnih strani Phish Tank database (Dadkhah et al., 2017). tronski naslov zakrit, ponarejen ali pa se ne sklada s podpi- Tudi za pomoč pri identifikaciji pravih, strokovnih revij ob- som na koncu besedila (Soler in Cooper, 2019). Elektron- stajajo baze, kot je na primer Direktorij odprto-dostopnih ski naslovi, ki niso zakriti imajo večinoma uradno domeno revij (angl. Directory of Open Access Journals) (Kozak et institucije, ki ji ukradejo reference (Dadkhah et al., 2017). al., 2016). Poleg tega je v veliko primerih lažno predstavljena loka- Na podoben način so zasnovana tudi vabila na konfe- cija sedeža pošiljatelja (Kozak et al., 2016). To pomeni, da rence. V večini primerov se takšna vabila sploh ne na- pošiljatelj v sporočilo napiše drugo lokacijo, kot je dejan- vezujejo na prejemnikovo področje raziskovanja in ne ob- ska lokacija, iz katere je bilo sporočilo poslano. stajajo za širjenje znanja med podobno mislečimi akade- Opisane značilnosti so povzete iz ugotovitev različnih miki, ampak je njihov namen oglaševati svoje revije in študij. Tudi pri pregledovanju nezaželene akademske ele- služiti (D. Cobey et al., 2017). ktronske pošte, ki smo jo uporabili za učno množico, smo Zavajanje. Pri zavajanju oziroma ribarjenju (angl. phi- opazili podobne značilnosti. Nekateri filtri nezaželene aka- shing attacks) so spletne strani, na katere elektronsko demske elektronske pošte sicer upoštevajo najdene skupne sporočilo usmerja, ustvarjene z namenom, da prejemnik va- lastnosti teh sporočil, vendar pa so to v večini “na roko” nje vnese osebne podatke, kot so številka bančne kartice, napisana pravila, ki jih je za dobro delovanje potrebno gesla in podobno (da Silva et al., 2020). Te spletne strani so stalno spreminjati. Zato v nadaljevanju opišemo razvoj fil- narejene tako, da so podobne dejanskim stranem resničnih tra nezaželene elektronske pošte, ki deluje na podlagi kla- organizacij, zato prejemnik velikokrat sploh ne ve, da gre sifikatorja, ki avtomatsko klasificira elektronska sporočila. za ponaredek (Dadkhah et al., 2017). Zavajajoča elek- tronska sporočila so torej podvrsta nezaželene elektronske 4. Razvoj filtra nezaželene pošte pošte, v kateri se pošiljatelj pretvarja, da je predstavnik V tem poglavju predstavimo zasnovo filtra nezažele- neke druge legitimne organizacije z namenom pridobivanja ne akademske elektronske pošte. Najprej opišemo učno osebnih podatkov (Gupta et al., 2018). Sporočila te vrste množico sporočil in tehnike obdelave besedila, ki smo jih so večinoma namenjena določeni skupini ljudi ali določeni uporabili. Zatem predstavimo poenostavljen načrt filtra. organizaciji. Poglavje zaključimo z opisom povezave filtra z izbranim Še en način, kako delujejo zavajajoča elektronska spo- odjemalcem elektronske pošte. ročila, je s samo-izvršilno kodo. Ta način deluje tako, da se ob kliku na povezavo izvede skrit program in povzroči ško- 4.1. Učna množica elektronskih sporočil do na prejemnikovem računalniku z vgraditvijo virusa, ki Učno množico elektronskih sporočil smo pridobili iz uniči prejemnikove datoteke ali pa ukrade osebne informa- dveh različnih virov, saj ni bilo mogoče najti ustrezne cije, gesla in druge podatke iz njega (da Silva et al., 2020). zbirke akademskih sporočil, ki bi zajemala tako nezaželena kot tudi druga akademska sporočila. Uporabili smo ne- 3.2. Generična struktura nezaželenih akademskih zaželena akademska sporočila od profesorjev z Univer- sporočil ze v Ljubljani in druga sporočila s spleta. Skupno smo Wahyudi (2017) je v svojem članku natančno preučil zbrali 660 sporočil, označenih kot nezaželena akademska strukturo nezaželene akademske elektronske pošte, zato v elektronska sporočila. Drugo skupino sporočil, ki niso nadaljevanju opišemo glavne ugotovitve iz tega in drugih nezaželena, smo našli na spletu, in sicer na spletni strani člankov. kaggle (van Lit, 2019). Omenjena spletna zbirka vse- Generična struktura nezaželene akademske elektronske buje nezaželeno in drugo elektronsko pošto, vendar pa ta pošte je sestavljena iz pozdrava, napovedi, uvoda, osred- sporočila nimajo akademske vsebine. Za potrebe izde- njega dela in zaključka. Velikokrat so uporabljeni laska- lave našega sistema smo uporabili le sporočila, ki niso joči pozdravi in nazivi, kot sta “ugledni profesor” ali “ste nezaželena. Iz omenjene spletne baze sporočil smo do- strokovnjak na tem področju” (Grey et al., 2016). V poz- bili 2.551 sporočil, ki smo jih uporabili kot učne primere dravu sta lahko uporabljena tudi prejemnikova ime in prii- sporočil, ki niso nezaželena. Ker je bila množica elektron- mek. Sporočilo velikokrat izraža hvaljenje, lažne spodbude skih sporočil sestavljena iz sporočil iz različnih virov, je in obljublja nagrade ali karierne priložnosti (Dadkhah et al., bilo potrebno sporočila pretvoriti v enako obliko, primerno 2017; Soler in Cooper, 2019). Pošiljatelj velikokrat zatr- za nadaljnjo obdelavo. Poleg tega je bilo potrebno obdelati juje, da je prebral prejemnikov članek in da to sporočilo besedilo sporočila in ga ustrezno spremeniti. V nadaljeva- ni nezaželena pošta (da Silva et al., 2020). V veliki večini nju opišemo, kako smo se lotili tega problema. sporočilo govori o splošni temi, ki se ne navezuje na pre- Vsako sporočilo v učni množici smo spremenili v slo- jemnika (Grey et al., 2016; Moher in Srivastava, 2015). var s ključi Subject (zadeva), Sender (pošiljatelj), Še ena lastnost nezaželene akademske elektronske pošte je Receiver (prejemnik), Date (datum prejema) in Bo- ta, da od prejemnika zahteva odgovor v nerealno kratkem dy (telo sporočila). Sporočila v skupini sporočil, ki ŠTUDENTSKI PRISPEVKI 346 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 niso nezaželena, imajo isti vir in obliko. Zato smo vsa sporočila v tej skupini lahko pretvorili v slovar na isti način. Nezaželena sporočila pa smo dobili iz različnih virov in jih je bilo zato potrebno spremeniti v slovar na različne načine glede na končnico datoteke. Naslednji korak obdelave sporočil je pretvorba slovar- jev sporočil v obliko, primerno za model. Lai (2007) v svojem članku trdi, da je najbolj uporaben del za klasifika- cijo nezaželenih sporočil zadeva sporočila in da samo telo sporočila ne klasificira tako dobro, kot če je v kombinaciji z zadevo. To smo tudi preizkusili in se odločili, da tudi mi uporabimo kombinacijo zadeve in telesa sporočila. Poleg tega Méndez et al. (2006) pravijo, da priloga, ki je lahko priložena sporočilu in jo spremenimo v besedilo, doda ne- Slika 1: Primeri znakov iz nezaželenih akademskih sporočil potrebne informacije, ki niso dobre za klasifikacijo. Zato in ustrezne črke, ki so jih pošiljatelji nezaželenega akadem- priloge sporočila nismo uporabili. skega sporočila zamenjali. Sledi opis obdelave besedila sporočil. V zadevi so bile v nekaterih primerih v oglatih oklepajih zapisane oznake žnice, ki objekt serializira (angl. serialize) in s tem spre- sporočila (na primer INBOX). Zato smo iz besedila odstra- meni v binarni tok (angl. byte stream). Tako shranjenih nili del, ki je v oglatih oklepajih. Predvsem v množici sporočil ne moremo brati direktno iz datoteke, ampak jih sporočil, ki niso nezaželena, je veliko sporočil, ki vsebu- je za branje potrebno pretvoriti nazaj v besedilo. jejo druga sporočila (nizi izmenjujočih odgovorov). Zato smo morali najti takšne dele sporočil in jih odstraniti. To 4.2. Tehnike obdelave sporočil smo naredili tako, da smo odstranili vrstice, ki se začnejo Štetje ponovitev besed. Najbolj enostavna tehnika ob- z določenim znakom ali nizom znakov, kot so na primer: delave sporočil je štetje ponovitev besed v posameznem “To:”, “From:”, “Wrote:” itd. sporočilu. Zbirke besedil smo spremenili v matriko, v ka- Nato smo zadevo in telo sporočila obdelali na enak teri vsaka vrstica predstavlja sporočilo, stolpec pa besedo. način. Najprej smo odstranili velike začetnice in celotno Ker je vseh besed v vseh sporočilih lahko zelo veliko, smo sporočilo spremenili v male črke. Opazili smo, da se v ne- se omejili na 2000 besed. Poskusili smo tudi z odstranitvijo katerih nezaželenih sporočilih pojavljajo znaki, ki izgledajo besed, ki se pojavijo v manj kot treh sporočilih, tako kot so kot črke, vendar so v resnici drugi znaki in jih program za- to opisali Sakkis in sodelavci (Sakkis et al., 2003). zna kot ločila. Primeri znakov, ki smo jih našli v naši zbirki Frekvenca besed z inverzno frekvenco v dokumen- sporočil, so prikazani na sliki 1. Pošiljatelji nezaželenih tih (angl. term frequency-inverse document frequency - TF- akademskih sporočil so s tem očitno želeli preprečiti fil- IDF). Frekvenca besed (angl. term frequency) enostavno trom nezaželene elektronske pošte razpoznavo nekaterih pomeni število besed v posameznem sporočilu. Inver- besed, ki nakazujejo na nezaželeno akademsko elektron- zna frekvenca v dokumentih (angl. inverse document fre- sko pošto. Najdene znake smo zamenjali s pravo črko quency) pa predstavlja informativnost besede, torej ali se in na koncu besedila dodali značko "specialchars". beseda pogosto ali redko pojavlja v sporočilih (Hakim et V besedilu smo poiskali klicaje in jih zamenjali z značko al., 2014). TF-IDF besede izračunamo tako, da upora- "exclamationmark". Opazili smo namreč, da iz- bimo enačbo (1), pri čemer je t têrmin in d dokument ozi- razito velika raba klicajev lahko nakazuje na to, da je roma sporočilo. T F (t, d) predstavlja frekvenco besede t v sporočilo nezaželena akademska elektronska pošta. Poleg sporočilu d, IDF (t) pa je inverzna frekvenca besede t v tega smo poiskali elektronske naslove, povezave in imena dokumentih. Izračuna se jo z enačbo (2), pri čemer je n mesecev ter jih zamenjali z značkami "emailwashere", število vseh sporočil, DF (t) pa število sporočil v katerih "linkwashere" in "monthwashere", saj struktura se beseda t pojavi vsaj enkrat. V imenovalcu ulomka vre- elektronskega naslova in povezave ter ime meseca niso dnosti DF (t) dodamo še +1, da se izognemo deljenju z pomembni. Poleg omenjenega smo iz sporočil odstra- nič. nili ločila in nepotrebne besede, kot so vezniki, zaimki in vprašalnice. V angleščini se te pogoste in nepotrebne be- T F -IDF (t, d) = T F (t, d) ∗ IDF (t) (1) sede imenujejo stop words. Da bi identiteta profesorjev, ki so prispevali nezaželena n akademska sporočila za učno množico, ostala skrita, je bilo IDF (t) = log + 1 (2) potrebno iz sporočil odstraniti imena prejemnikov. Poleg DF (t) + 1 tega smo odstranili tudi imena pošiljatelja, saj je tudi ta po- Prednost uporabe tehnike TF-IDF je, da se normalizira datek nepotreben pri klasifikaciji. Ime ali priimek smo za- vpliv besed, ki se v dokumentih pojavljajo zelo pogosto in menjali z značko receivername za prejemnika oziroma so zato manj informativne kot besede, ki se pojavijo manj- sendername za pošiljatelja. krat. Na opisani način smo sporočila spremenili iz sezna- Medsebojna informacija. Za izbiro atributov smo pre- ma slovarjev v seznam besedil oziroma zbirko besedil izkusili tudi odstranitev atributov, ki imajo premajhno med- (angl. corpus). Sporočila smo nato shranili s pomočjo knji- sebojno informacijo (angl. mutual information). Medse- ŠTUDENTSKI PRISPEVKI 347 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 bojna informacija dveh naključnih spremenljivk je nenega- sporočila. Pri gradnji programske rešitve smo preizkusili tivna vrednost, ki pove odvisnost med tema spremenljiv- več klasifikatorjev, in sicer smo preizkusili naivni Bayes, kama (Kraskov et al., 2004). Z drugimi besedami, med- naključni gozd, metodo podpornih vektorjev, logistično re- sebojna informacija meri količino informacije, ki jo prido- gresijo in različne nevronske mreže. V končnem sistemu bimo o neki spremenljivki, če imamo podano neko drugo smo uporabili nevronsko mrežo, saj so bili rezultati testira- spremenljivko (Witten in Frank, 2000). Večja ko je med- nja pri tem klasifikacijskem modelu najboljši. Za odjema- sebojna informacija dveh spremenljivk, bolj sta spremen- lec elektronske pošte pa smo izbrali Gmail (Google, 2022). ljivki odvisni med sabo. Če pa je medsebojna informacija enaka nič, sta spremenljivki popolnoma neodvisni. Med- sebojno informacijo med dvema naključnima spremenljiv- kama lahko izračunamo z enačbo (3), kjer I(X; Y ) pred- stavlja medsebojno informacijo za spremenljivki X in Y , H(X) predstavlja entropijo spremenljivke X, H(X|Y ) pa je pogojna entropija za spremenljivko X, če imamo po- dano spremenljivko Y . Entropija je enaka povprečni la- stni informaciji in prestavlja stopnjo negotovosti oziroma informacije. Izračunamo jo s pomočjo formule (4), kjer so možni rezultati x1, ..., xn in P (xi) verjetnost rezultata xi. Pogojno entropijo pa izračunamo s formulo (5). Slika 2: Načrt delovanja sistema ob prvem učenju nevron- ske mreže in ob klasifikaciji neprebranih sporočil. I(X; Y ) = H(X) − H(X|Y ) (3) Slika 2 prikazuje delovanje sistema ob zagonu pro- n grama za klasifikacijo neprebranih sporočil. Najprej pro- X H(X) = − P (xi)logP (xi) (4) gram preveri, ali je nevronska mreža že shranjena na dis- i−1 ku. Če ni, se izvede začetno učenje nevronske mreže. Za učenje nevronske mreže so potrebni označeni učni podatki, X p(x, y) kar so v našem primeru nezaželena akademska elektronska H(X|Y ) = − p(x, y)log (5) p(x) sporočila in druga elektronska sporočila. Ker je vir sporočil x∈X,y∈Y lahko različen, se elektronska sporočila pretvori v enako Vektorska vložitev besed (angl. word to vector embed- obliko in obdela tako, da se odstrani nepotrebne atribute ding). Vektorska vložitev besed je tehnika predstavitve be- sporočila. Ta korak je na sliki označen s številko (1). Sledi sed z vektorji, ki ohranjajo pomenske značilnosti besed. To učenje nevronske mreže in shranjevanje na disk (2). Nev- pomeni, da so besede, ki so si pomensko bolj podobne, ronska mreža je tako ob naslednjem zagonu pripravljena bolj blizu v vektorskem prostoru (Ghannay et al., 2016). na klasifikacijo in ni potrebno pri vsakem zagonu čakati na Vektorje besed se sestavi glede na to, katere besede se v učenje nevronske mreže. Naslednji korak programa je bra- stavku nahajajo skupaj, saj se tako najlažje ugotovi pomen nje neprebranih sporočil iz elektronskega nabiralnika (3). besede. Zaradi tega je za učinkovito sestavljanje besednih Prebrana sporočila se obdelajo na enak način kot pri ko- vektorjev potrebna velika učna množica besedil. Ker je to raku (1). Shranjena nevronska mreža nato klasificira nepre- velikokrat težko pridobiti in ker je učenje vektorjev lahko brana sporočila. Če je katero izmed neprebranih sporočil precej zamudno, na spletu obstajajo baze besed in njiho- klasificirano kot nezaželena akademska elektronska pošta, vih vektorjev, ki so naučeni na velikih množicah besedil. program preveri, ali v uporabnikovem elektronskem nabi- Primeri zbirk naučenih vektorjev so na primer Googlova ralniku že obstajajo sporočila z oznako ACADEMIC SPAM. zbirka in zbirka GloVe (Global Vectors for Word Če ne, program za uporabnika ustvari novo oznako ACA- Representation) (Pennington, 2014). DEMIC SPAM in označi ustrezna sporočila (5). Če oznaka Ker je zbirka vektorjev iz baze GloVe naučena na že obstaja, program samo označi ustrezna sporočila s to ogromni množici besedil in je prosto dostopna, smo se oznako. Oznaka se nato prikaže na neprebranih sporočilih odločili, da jo bomo uporabili v našem sistemu. Na vo- v uporabnikovem elektronskem nabiralniku (6), hkrati pa ljo ima več zbirk vektorjev iz različnih virov in velikosti. nastane oziroma se dopolnjuje tudi mapa sporočil z oznako Zbirke smo preizkusili in ocenili njihovo uspešnost. Pre- ACADEMIC SPAM (7). izkusili smo tudi različno maksimalno število besed v po- Slika 3 prikazuje drugi program, ki je namenjen poso- sameznem sporočilu in maksimalno število unikatnih be- dobitvi nevronske mreže, tako da se čim bolj prilagodi upo- sed. V končnem sistemu smo uporabili 100-dimenzionalne rabniku. Posodobitev deluje samo, če ima uporabnik v svo- vektorje, omejitev 2.000 besed na sporočilo in omejitev jem elektronskem nabiralniku sporočila, označena z oznako 500.000 različnih besed. ACADEMIC SPAM. Program najprej prebere sporočila, ki so označena z 4.3. Zasnova filtra nezaželenih akademskih sporočil oznako ACADEMIC SPAM. Nato ta sporočila doda k shra- Zgradili smo programsko rešitev, sestavljeno iz dveh njenim uporabniškim sporočilom ali pa ustvari novo dato- programov. Prvi, glavni program, je namenjen klasifikaciji teko s shranjenimi uporabnikovimi nezaželenimi akadem- neprebranih sporočil. Drugi program pa je namenjen poso- skimi sporočili (1). Ta sporočila se potem uporabi kot del dobitvi nevronske mreže glede na uporabnikova označena učne množice pri učenju nevronske mreže (2). Če je shra- ŠTUDENTSKI PRISPEVKI 348 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 se izvede del programa za posodobitev oznak. Najprej pro- gram preko Gmail APIja prebere vse oznake, ki obsta- jajo v uporabnikovem elektronskem nabiralniku, in preveri, ali je katera med njimi ACADEMIC SPAM. Če oznaka že obstaja, se sporočilom, ki jih je klasifikator označil kot nezaželena, doda ta oznaka. Če oznaka še ne obstaja, pa se ustvari nova oznaka ACADEMIC SPAM. Slika 3: Načrt delovanja sistema ob posodobitvi nevronske mreže. njenih uporabnikovih sporočil več kot 1000, se sporočila razvrsti po datumu prejema in se jih izbere le zadnjih 1000. Če pa je shranjenih uporabnikovih sporočil manj kot 1000 (3), se množica nezaželenih akademskih sporočil dopolni s Slika 4: Izsek elektronskega nabiralnika v Gmailu, kjer sporočili iz baze nezaželenih akademskih sporočil (4). Po- sta bili dve neprebrani sporočili klasificirani kot nezažalena leg nezaželenih akademskih sporočil nevronska mreža za akademska pošta. učenje potrebuje tudi množico drugih sporočil. Te se prido- bijo iz baze drugih sporočil (5). Nevronska mreža se nato Rezultat zagona programa in klasifikacije neprebranih nauči na podanih učnih podatkih in posodobljena mreža se sporočil je oznaka ACADEMIC SPAM, ki se prikaže na shrani na disk (6), kjer je na voljo za naslednjo klasifikacijo ustreznih sporočilih. Na sliki 4 je prikazan primer takšne neprebranih sporočil. klasifikacije v Gmailu. Pred zadevo sporočil, klasificira- nih kot nezaželena akademska sporočila, se pojavi oznaka 4.4. Povezava z odjemalcem elektronske pošte ACADEMIC SPAM. Hkrati pa lahko na levi strani v se- Filter nezaželene akademske elektronske pošte smo po- znamu vseh oznak opazimo oznako ACADEMIC SPAM, vezali z brezplačno e-poštno storitvijo, ki jo ponuja Google, kjer lahko najdemo vsa sporočila, ki so bila v preteklosti in sicer Gmail. Za povezavo tega spletnega odjemalca elek- označena kot nezaželena akademska sporočila. tronske pošte s programom smo uporabili Gmail API. To je aplikacijski programski vmesnik, ki temelji na arhitekturi 5. Testiranje in rezultati REST (angl. RESTful API) (Developers, 2021). Arhitek- V zadnjem poglavju predstavljamo način testiranja pre- tura REST (angl. representational state transfer) je arhitek- izkušenih modelov klasifikacije in obdelave elektronskih tura za izmenjavo podatkov med spletnimi storitvami, kjer sporočil. Primerjamo rezultate in prikažemo rezultate al- je vsak vir dostopen z enoličnim identifikatorjem vira URL. goritma SHAP, ki poišče besede, ki so najbolj pripomogle Uporablja se za dostop do Gmail elektronskih nabiralnikov h klasifikaciji nezaželenih akademskih sporočil. in pošiljanje elektronskih sporočil preko programa. Program smo z Gmail API-jem povezali s pomočjo ra- 5.1. Način testiranja čunalniškega okolja Google Cloud. Tam smo ustva- Za testiranje uspešnosti smo uporabili 10-kratno prečno rili nov projekt in v njem omogočili Gmail API ter do- preverjanje (angl. 10-fold cross-validation). Pri tej me- dali avtorizacijo in avtentikacijo za program. Upora- todi učno množico razdelimo na 10 približno enako velikih bili smo API Keys in OAuth 2.0 Client IDs za množic in za vsak model naredimo 10 ponovitev testiranja. omogočanje Gmail APIja v programu. V vsaki iteraciji vzamemo za testno množico eno izmed Če je povezovanje z Gmail APIjem uspešno, je množic, ostale množice pa združimo v učno množico. program pripravljen na branje uporabnikovih neprebranih Na tak način bolj natančno preverimo uspešnost mode- sporočil. V primeru, da neprebrana sporočila ne obsta- lov, kot če bi iz množice naključno izbrali 10% primerov in jajo, se izpiše sporočilo: "No messages found." in le enkrat testirali model. Pri 10-kratnem prečnem prever- program se zaključi. V nasprotnem primeru pa se iz po- janju je namreč vsak primer v množici enkrat uporabljen datkov, pridobljenih z Gmail APIjem, generira slovar s kot testni. Tako lahko na koncu izračunamo povprečje in ključi Subject (zadeva), Sender (pošiljatelj), Recei- standardno deviacijo rezultatov iz vseh ponovitev testiranja ver (prejemnik), Date (datum prejema) in Body (telo ter dobimo bolj realne rezultate. Poleg tega smo lahko za- sporočila). Sporočila v obliki slovarja je nato potrebno pre- radi večkratne ponovitve testiranja za primerjavo modelov urediti v obliko, primerno za klasifikator, podobno kot smo uporabili tudi statistične teste. to naredili za sporočila v učni množici (glej razdelek 4.1.). Tako smo namesto seznamov slovarjev dobili seznam ob- 5.2. Rezultati delanih besedil. Ta seznam smo nato s pomočjo shranjenih Del rezultatov testiranja različnih modelov je prika- vektorjev spremenili v seznam vektorjev in ga pretvorili v zan v tabeli 1. Preizkusili smo različne tehnike obde- matriko. lave besedila, v tabeli pa so prikazani rezultati ob upo- Naslednji korak je nalaganje shranjenega klasifikatorja rabi tehnike TF-IDF z odstranitvijo besed, ki se pojavijo in klasifikacija neprebranih sporočil. Če klasifikator označi v manj kot treh sporočilih in besed, ki imajo medsebojno katerega izmed sporočil kot nezaželeno akademsko pošto, informacijo manjšo kot 0.01 pri prvih petih modelih ter ŠTUDENTSKI PRISPEVKI 349 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 vektorsko vložitvijo besed pri zadnjem modelu nevronske naše rezultate z rezultati navadnih filtrov nezaželenih ele- mreže. Uporabili smo celotno množico sporočil, in si- ktronskih sporočil. Koprinska et al. (2007) so na eni iz- cer 660 nezaželenih akademskih sporočil in 2.551 drugih med testnih množic z modelom naključnega gozda dosegli sporočil. Kot lahko vidimo, so rezultati že pri teh mo- točnost 96.03%, natančnost 95.62%, priklic 95.62% in F1 delih precej dobri, saj so pravilno klasificirana skoraj vsa mero 94.16%. Za obdelavo sporočil so uporabili posebno sporočila iz testne množice sporočil. metodo izbire atributov, in sicer varianco frekvence têrma (angl. term frequency variance). Ostali modeli v njihovem primeru niso bili tako uspešni. Lai (2007) je v članku opisal Tabela 1: Povprečne vrednosti in standardna deviacija te- preizkus modelov Naivnega Bayesa, k-najbližjih sosedov, stiranja z 10-kratnim prečnim preverjanjem. SVM in kombinacijo TF-IDF z metodo SVM. Za najbolj Model Točnost F1 AUC uspešen model se je izkazala kombinacija TF-IDF z me- Naivni todo SVM. S to metodo so v enem primeru dosegli točnost 88.49% ± 2.40% 77.29% ± 5.27% 0.91 ± 0.02 Bayes 93.43%. Naključni 98.32% ± 0.32% 95.88% ± 0.79% 0.96 ± 0.01 gozd SVM 98.65% ± 0.68% 96.62% ± 1.79% 0.97 ± 0.01 5.3. Razlaga klasifikacije z algoritmom SHAP Logistična 98.82% ± 0.58% 97.02% ± 1.55% 0.97 ± 0.01 regresija Razumljivost in enostavna razlaga modela sta izjemno Nevronska 98.98% ± 0.42% 97.49% ± 1.10% 0.98 ± 0.01 pomembni za interpretacijo rezultatov in možnost nadgra- mreža Nevronska dnje modela. To je velikokrat razlog, da se nekateri razisko- mreža 98.69% ± 0.62% 96.79% ± 1.47% 0.97 ± 0.01 valci odločijo za uporabo enostavnih (linearnih) modelov z GloVe namesto kompleksnejših, ki jih je težko razumeti. Vendar pa je zaradi naraščajoče količine podatkov, ki jih želimo obdelati, nujno, da uporabljamo tudi slednje. Za to obsta- jajo algoritmi, ki nam jih pomagajo razumeti in interpreti- Tabela 2: Povprečni rangi uspešnosti modelov glede na vre- rati rezultat njihove klasifikacije. Eden takšnih algoritmov dnost AUC. je algoritem SHAP (Lundberg in Lee, 2017). Naivni Naključni Logistična Nevronska SVM Bayes gozd regresija mreža 5 2.9 2.65 2.15 2.3 Za primerjavo modelov smo uporabili Friedmanov test (Friedman, 1937). Natančno razlago uporabe tega tes- ta opisuje Demšar (2006). Najprej smo primerjali sku- pino klasifikacijskih modelov, na katerih smo uporabili prej omenjene tehnike obdelave besedila in so v tabeli na pr- vih petih mestih. S Friedmanovim testom pri α = 0.05 na AUC smo preverili, ali lahko za kateri par modelov rečemo, da je eden izmed njiju izrazito boljši od drugega. Povpreč- ni rangi uspešnosti modelov glede na AUC so razvidni v tabeli 2. Izračunali smo kritično razdaljo CD = 1.93 in jo primerjali z razlikami povprečnih vrstnih redov uspeš- nosti modelov ter ugotovili, da so vsi modeli izrazito boljši za klasifikacijo nezaželenih akademskih sporočil kot naiv- ni Bayes. Za ostale pare modelov s Friedmanovim testom tega nismo mogli dokazati. Čeprav so bili rezultati že pri teh modelih precej dobri, smo se vseeno odločili implementirati še različne modele Slika 5: Slika prikazuje besede, ki najbolj vplivajo na rezul- nevronskih mrež v kombinaciji z vektorskimi vložitvami tat klasifikacije nevronske mreže. Besede, ki imajo več pik besed. Zgradili smo več različnih nevronskih mrež in jih na desni strani, so prispevale k temu, da je bilo sporočilo med seboj primerjali. V tabeli 2 je na zadnjem mestu pri- klasificirano kot nezaželeno. kazan rezultat ene izmed teh nevronskih mrež. Čeprav je rezultat nekoliko slabši od zgoraj opisanih modelov, smo Algoritem SHAP (SHapley Additive exPlanations) ozi- v končnem sistemu vseeno uporabili to nevronsko mrežo z roma Shapleyjeve aditivne razlage je algoritem, ki za po- vektorskimi vložitvami besed. Ta metoda namreč upošteva dane primere razloži, zakaj jih je model klasificiral tako, pomene besed in ne samo njihov zapis, tako kot ostale me- kot jih je. Z drugimi besedami, algoritem SHAP nam pove, tode obdelave besedil. kako posamezen atribut vpliva na napoved modela. V pri- Rezultati našega testiranja so nekoliko boljši od rezul- meru klasifikacije sporočil s tem algoritmom torej lahko tatov nekaterih raziskovalcev, ki so se ukvarjali s podobnim ugotovimo, katere besede najbolj vplivajo na rezultat kla- problemom. Sicer nismo našli primerov, v katerih bi raz- sifikacije. Na sliki 5 je prikazan rezultat algoritma SHAP iskovalci skušali klasificirati nezaželeno akademsko elek- na eni izmed ustvarjenih nevronskih mrež. Zaradi zahtev- tronsko pošto, vseeno pa lahko do neke mere primerjamo nosti algoritma smo uporabili manjšo podmnožico testnih ŠTUDENTSKI PRISPEVKI 350 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 sporočil. Graf na sliki od spodaj navzgor prikazuje katere 7. Zahvala besede naj bi najbolj vplivale na klasifikacijo nezaželenih Zahvaljujem se prof. dr. Zoranu Bosniću za vodenje, sporočil. Beseda linkwashere, s katero smo nadome- nasvete in mentorstvo med raziskavo ter profesorjem Fa- stili vse url povezave, očitno najbolje nakazuje na to, da kultete za računalništvo in informatiko, Univerze v Lju- sporočilo ni nezaželeno. Besede, ki močno nakazujejo bljani, ki so prispevali nezaželena akademska elektronska na to, da je sporočilo nezaželeno akademsko elektronsko sporočila za učno množico sporočil. sporočilo pa so university (univerza), dear (dragi oz. spoštovani), prof (kratica za profesor), research 8. Literatura (raziskava) in submissions (oddaje). Kelly D. Cobey, Miguel de Costa e Silva, Sasha Mazza- rello, Carol Stober, Brian Hutton, David Moher in Mark 6. Zaključek Clemons. 2017. Is this conference for real? navigating Za cilj smo si zadali izdelavo filtra nezaželene aka- presumed predatory conference invitations. Journal of demske elektronske pošte, ki bi med neprebranimi elek- oncology practice, 13(7):410–413. tronskimi sporočili v uporabnikovem elektronskem nabi- Jaime A Teixeira da Silva, Aceil Al-Khatib in Panagiotis ralniku, čim bolj učinkovito poiskal nezaželena akadem- Tsigaris. 2020. Spam emails in academia: issues and ska elektronska sporočila in jih označil. Za dosego tega costs. Scientometrics, 122(2):1171–1188. cilja smo morali preučiti strukturo in skupne značilnosti Mehdi Dadkhah, Glenn Borchardt in Tomasz Maliszewski. nezaželene akademske elektronske pošte ter preiskati ob- 2017. Fraud in academic publishing: researchers un- stoječe načine filtriranja nezaželene elektronske pošte. S der cyber-attacks. The American journal of medicine, testiranjem smo določili, da je model nevronske mreže naj- 130(1):27–30. bolj učinkovit pri filtriranju nezaželene akademske elek- Janez Demšar. 2006. Statistical comparisons of classifiers tronske pošte, zato smo ga tudi uporabili v končnem sis- over multiple data sets. The Journal of Machine Lear- temu. ning Research, 7:1–30. Ugotovili smo, da obstaja zelo malo rešitev za filtriranje Google Developers. 2021. Gmail api overview. nezaželene elektronske pošte, katerih osrednji cilj bi bil fil- https://developers.google.com/gmail/ triranje nezaželenih akademskih elektronskih sporočil. Ve- api/guides. lika večina teh rešitev uporablja le prepoznavanje znanih Milton Friedman. 1937. The use of ranks to avoid the pošiljateljev nezaželenih akademskih sporočil, vendar pa je assumption of normality implicit in the analysis of va- za učinkovitost tega načina filtriranja potrebno stalno po- riance. Journal of the american statistical association, sodabljanje seznama. Zato smo implementirali sistem, ki 32(200):675–701. neprebrana elektronska sporočila klasificira kot nezaželeno Sahar Ghannay Ghannay, Benoit Favre, Yannick Esteve in akademsko elektronsko pošto, glede na pomen besed v Nathalie Camelin. 2016. Word embedding evaluation sporočilih. To smo dosegli z vektorsko vložitvijo besed v and combination. V: Proceedings of the Tenth Internati- kombinaciji z modelom nevronske mreže. Poleg tega smo onal Conference on Language Resources and Evaluation izdelali program, ki lahko klasifikacijski model posodobi (LREC’16), str. 300–305. European Language Resources glede na uporabnikovo nezaželeno akademsko elektronsko Association (ELRA). pošto. Na tak način se model lahko prilagodi uporab- Google. 2022. Gmail: Brezplačna, zasebna in varna nikovemu elektronskemu nabiralniku in še bolj natančno e-pošta. https://www.google.com/intl/sl/ označuje nezaželena akademska elektronska sporočila. gmail/about/, pridobljeno: 2022-01-08. Ena izmed večjih pomanjkljivosti opisane rešitve je na- Andrew Grey, Mark J. Bolland, Nicola Dalbeth, Greg Gam- domestitev akademskih sporočil, ki niso nezaželena, z na- ble in Lynn Sadler. 2016. We read spam a lot: prospec- vadnimi nezaželenimi sporočili. Zaradi varovanja osebnih tive cohort study of unsolicited and unwanted academic podatkov namreč nismo mogli uporabiti sporočil profesor- invitations. BMJ, 355. jev, pa tudi na spletu ni bilo mogoče najti zbirk s takšnimi Brij B. Gupta, Nalin AG Arachchilage in Kostas E. Psannis. akademskimi sporočili. Tudi profesorji in drugi akademiki 2018. Defending against phishing attacks: taxonomy of sicer dobivajo takšna navadna sporočila in so zato tudi ta methods, current issues and future directions. Telecom- sporočila do neke mere ustrezna za učno množico. Vseeno munication Systems, 67(2):247–267. pa bi bilo potrebno preveriti, da klasifikator zaradi pomanj- Ari Aulia Hakim, Alva Erwin, Kho I Eng, Maulahikmah kanja akademskih sporočil, ki niso nezaželena, ne označi Galinium in Wahyu Muliady. 2014. Automated docu- kar vseh akademskih sporočil, kot nezaželena. ment classification for news article in Bahasa Indonesia Sistem bi lahko izboljšali še tako, da bi ob posodobitvi based on term frequency inverse document frequency (tf- modela upoštevali ne samo uporabnikova nezaželena aka- idf) approach. V: 2014 6th international conference on demska sporočila, ampak tudi druga sporočila. Poleg tega information technology and electrical engineering (ICI- sistem trenutno dobro deluje le za angleška sporočila, saj je TEE), str. 1–4. IEEE. naša množica učnih sporočil bila sestavljena le iz angleških Irena Koprinska, Josiah Poon, James Clark in Jason Chan. sporočil. Možna izboljšava bi torej lahko bila prepozna- 2007. Learning to classify e-mail. Information Sciences, vanje jezikov in prilagajanje filtra nanje. Dodali bi lahko 177(10):2167–2187. tudi uporabniški vmesnik, ki bi uporabniku olajšal uporabo Marcin Kozak, Olesia Iefremova in James Hartley. 2016. sistema. Spamming in scholarly publishing: A case study. Jour- ŠTUDENTSKI PRISPEVKI 351 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 nal of the Association for Information Science and Tech- nology, 67(8):2009–2015. Alexander Kraskov, Harald Stögbauer in Peter Grassberger. 2004. Estimating mutual information. Physical review E, 69(6):066138. Chih-Chin Lai. 2007. An empirical study of three machine learning methods for spam filtering. Knowledge-Based Systems, 20(3):249–254. Songqing Lin. 2013. Why serious academic fraud occurs in China. Learned Publishing, 26(1):24–27. Scott M Lundberg in Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural in- formation processing systems, 30. José Ramon Méndez, Florentino Fdez-Riverola, Fernando D´ıaz, Eva Lorenzo Iglesias in Juan Manuel Corchado. 2006. A comparative performance study of feature selec- tion methods for the anti-spam filtering domain. V: Indu- strial Conference on Data Mining, str. 106–120. Sprin- ger. David Moher in Anubhav Srivastava. 2015. You are invited to submit. . . . BMC medicine, 13(1):1–4. Jeffrey Pennington. 2014. Glove: Global vectors for word representation. https://nlp.stanford. edu/projects/glove/, pridobljeno: 2022-07-15. Georgios Sakkis, Ion Androutsopoulos, Georgios Paliou- ras, Vangelis Karkaletsis, Constantine D. Spyropoulos in Panagiotis Stamatopoulos. 2003. A memory-based approach to anti-spam filtering for mailing lists. Infor- mation retrieval, 6(1):49–73. Josep Soler in Andrew Cooper. 2019. Unexpected ema- ils to submit your work: Spam or legitimate offers? the implications for novice english l2 writers. Publications, 7(1):7. Wessel van Lit. 2019. Email spam Kaggle. https://www.kaggle.com/veleon/ham-and-spam-dataset. Ribut Wahyudi. 2017. The generic structure of the call for papers of predatory journals: A social semiotic perspec- tive. V: Text-based research and teaching, str. 117–136. Springer. Steve Whittaker, Victoria Bellotti in Paul Moody. 2005. In- troduction to this special issue on revisiting and reinven- ting e-mail. Human–Computer Interaction, 20(1-2):1–9. Ian H Witten in Eibe Frank. 2000. Data mining: practical machine learning tools and techniques with Java imple- mentations. Morgan Kaufmann. ŠTUDENTSKI PRISPEVKI 352 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Preparing a Corpus and a Question Answering System for Slovene Matjaž Zupanič∗, Maj Zirkelbach∗, Uroš Šmajdek∗, Meta Jazbinšek† ∗Faculty of Computer and Information Science, University of Ljubljana Večna pot 113, SI-1000 Ljubljana {mz4689, mz5153, us6796}@student.uni-lj.si †Department of Translation Studies, Faculty of Arts, University of Ljubljana Aškerčeva cesta 2, SI-1000 Ljubljana mj6953@student.uni-lj.si Abstract Lack of proper training data is one of the key issues when developing natural language processing models based on less-resourced languages, such as Slovene. In this paper we discuss machine translation as a solution to this issue, with the focus on question answering (QA). We use the SQuAD 2.0 dataset, which we have translated using eTranslation machine translator. To improve the reliability of translations, we translate the answers together with the context instead of separately, reducing the rate at which answers were not found in the context from 56% to 7%. For comparison, we also perform manual post-editing of the small subset of machine translations. We then compare these datasets utilizing various transformer-based QA models and observe the differences between the datasets and different model configurations. The results have shown little distinction between monolingual and larger multilingual models: monolingual SloBERTa scored 64.9% exact matches on the machine translated dataset and 72.6% exact matches on human translated one, whereas multilingual RemBERT scored 64.2% exact matches on the machine translated dataset and 71.9% exact matches on human translated one. Additionally, using machine translated dataset in the evaluation produces notably worse results then the human translated dataset. Qualitative analysis of the translations has shown that mistakes often occur when the sentences are longer and have more complicated syntax. 1. Introduction Slovene. In this work we present a method for a construct of a One of the goals in artificial intelligence is to build in- machine translated dataset from SQuAD 2.0 (Rajpurkar et telligent systems that would be able to interact with hu- al., 2018) and evaluate its quality using various modern QA mans and help them. One of such tasks is reading the models. Additionally, we benchmark its effectiveness by web and then answer complex questions about any topic performing manual post-editing on a subset of the trans- over given content. These question-answering (QA) sys- lated dataset and comparing the results. tems could have a big impact on the way that we access in- The main contributions of our work are: formation. Furthermore, open-domain question answering • a pipeline for translation of English question answer- is a benchmark task in the development of Artificial Intel- ing dataset; ligence, since understanding text and being able to answer • a Slovene monolingual model SloBERTa, fine-tuned questions about it is something that we generally associate on machine translated data and three different fine- with intelligence. tuned multilingual QA models, M-BERT, XLM-R and Recently, pre-trained Contextual Embeddings (PCE) CroSloEngual BERT, all on machine translated and models like Bidirectional Encoder Representations from both original and machine translated data; and Transformers (BERT) (Devlin et al., 2018) and A Lite • comparison of human and machine translated data in BERT (ALBERT) (Lan et al., 2020) have attracted lots of terms of question answering performance. attention due to their great performance in a wide range of In Section 2 we present the related work. In Section 3 NLP tasks. we present our dataset, the process of translation and post- Multilingual question answering tasks typically assume edition, and evaluate the quality of the translation. In Sec- that answers exist in the same language as the ques- tion 4 we give a brief overview of the models used in the tion. Yet in practice, many languages face both infor- evaluation. In Section 5 we present the evaluation and mation scarcity—where languages have few reference ar- discuss the results in Section 6. In Section 7 we present ticles—and information asymmetry—where questions ref- the conclusions and give possible extensions and enhance- erence concepts from other cultures. Due to the sizes of ments for future work. modern corpora, performing human translations is gener- ally infeasible, therefore we often employ machine transla- 2. Related work tions instead. Machine translation however is for the most Early question answering systems, such as LU- part incapable of interpreting nuances of specific languages NAR (Woods and WA, 1977), date back to the 60’s and such as culturally specific vocabulary or for example the the 70’s. They were characterised by a core database and a use of articles, indication of grammatical number or gen- set of rules, both handwritten by experts of the chosen do- der and conjugation endings when comparing English and main. Over time, with the development of large online text ŠTUDENTSKI PRISPEVKI 353 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 repositories and increasing computer performance, the fo- 100,000 question-answer pairs extracted from over 500 ar- cus shifted from such rule-based system to using machine ticles. learning and statistical approaches, like Bayesian classi- The reason to use Squad 2.0 over 1.0 is that it consists of fiers and Support Vector Machines. An example of this twice as much data and contains unanswerable questions. kind of system that was able to perform question answer- ing on Slovene language was presented by Čeh et al. ( Čeh 3.1. Machine Translation and Ojsteršek, 2009) in 2009. To translate the dataset into Slovenian we used the Another major revolution in the field of question an- eTranslation webservice (Commission, 2020). Due to the swering and natural language processing in general was the web service being primarily designed to translate webpages advent of deep learning approaches and self-attention. One and short documents in docx or pdf format, our translation of the most popular approaches of this kind is BERT (De- pipeline design was as follows: vlin et al., 2018), a transformer model introduced in 2019. Since then it has inspired many other transformed based 1. Convert the corpus in html format. models, for instance RoBERTa (Liu et al., 2019), AL- 2. Split html file into smaller chunks. We found that 4 BERT (Lan et al., 2020), and T5 (Raffel et al., 2020) , xlm MB chunks work best, as larger chunks were often un- and XLNet (Yang et al., 2019). able to be translated. Such models also have the advantage of being able 3. Send chunks to the translation service. to recognise multiple languages, giving rise to multilin- 4. Use the original corpus file to compose the translated gual models and model variants, such as M-BERT, XLM- document in the original format. R (Conneau et al., 2019), mT5 (Xue et al., 2021) and Rem- BERT (Chung et al., 2020). Nevertheless, the training Since the basic translation yielded quite underwhelm- requires large amounts of training data, which many lan- ing results, we employed two different methods to im- guages lack, leading to varying performance between dif- prove the results. The first was to correct the answers by ferent languages. They have also shown to perform worse breaking down both the answer and the context into lem- than monolingual models (Martin et al., 2020; Virtanen et mas and search for the answer sequence of lemmas in con- al., 2019). As such Ulčar et al. (Ulčar and Robnik- Šikonja, text sequence of lemmas. To accomplish this, CLASSLA 2020) made an effort to strike a middle ground between the (CLARIN Knowledge Centre for South Slavic languages) performance of monolingual and versatility of multilingual library (Ljubešić and Dobrovoljc, 2019) was used. If a models by reducing the number of languages in multilin- match was found, we replaced the bad answer with the orig- gual model to three; two similar less-resourced languages inal text, forming the lemma sequence in the context. The from the same language family and English. This resulted second method was to embed the answers in the context in two trilingual models FinEst BERT and CroSloEngual before translation. BERT al. (Ulčar and Robnik- Šikonja, 2020). To evaluate the quality of different translations, we mea- In 2020, a Slovene monolingual RoBERTa-based model sured how many answers can be directly found in their re- SloBERTa (Ulčar and Robnik- Šikonja, 2021) was intro- spective context, as they cannot be used in QA models oth- duced. It was trained on 5 different corpora, totaling 3.41 erwise. The results can be seen in Table 1. Resulting num- billion words. The latest version of the model is SloBERTa ber of valid questions, compared with the original, are pre- 2.0, augmenting the original model by more than doubling sented in Table 2. the number of training iterations. The authors evaluated its performance on named-entity recognition, part-of-speech Basic LC CE LC+CE tagging, dependency parsing, sentiment analysis and word 44% 66% 93% 94% analogy, but not on question answering. While the described advancements of natural language Table 1: Results for basic translation, lemma correction processing models already offer us a partial solution for the (LC), and context embedded (CE) translation of SQuAD lack of language-specific training corpora, namely the abil- 2.0 dataset. The percentages represent the number of an- ity to train the model on a language where large corpora are swers that can be directly found in the respective context. present (e.g. English), the models still require language- specific fine-tuning, for which a sizable corpora is needed. In our work we present a potential solution, by using the Dataset Subset AQ IQ Total machine-translation methods to translate smaller corpora to Slovene and use it to fine-tune and evaluate the results. Train 86,821 43,498 130,319 Original Test 5,928 5,945 11,873 3. Dataset description and methodology Train 81,884 43,498 125,382 Machine Trans. Stanford Question Answering Dataset (SQuAD Test 5,735 5,945 11,680 2.0) (Rajpurkar et al., 2018) is a reading comprehension dataset. It is based on a set of articles on Wikipedia which Table 2: Number of questions in original SQuAD 2.0 cover a variety of topics, from historical, pharmaceutical, dataset and our machine translated dataset. AQ denotes the and religious texts to texts about the European Union. Ev- number of answerable questions, IQ the number of impos- ery question in the dataset is a segment of text or span sible questions. from the corresponding reading passage. It consists of over ŠTUDENTSKI PRISPEVKI 354 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 3.2. Post-editing of Machine Translation is quite high because machine translation provided inco- Due to limited human resources, post-editing was done herent results. In this segments, the changes in post-editing on small number of automatically translated excerpts that are also more notable, because they affect the overall un- were chosen randomly. The provided excerpts included derstanding for potential readers. This can be seen in the original paragraphs or contexts, questions and answers, as following examples: well as their machine translations, which were to be cor- Original rected by a translation student. This was done in two steps: creating a project in the online translation tool Memsource 1. Who did Kublai make the ruler of Korea? with translation memory in tmx format, generated from ma- 2. Who was Al-Banna’s assassination a retaliation for the chine translations, and revision or post-editing of the seg- prior assassination of? ments. Editing was first done on the paragraphs and then 3. What plants create most electric power? on questions and answers, since the answers had to match Machine translation the text in the paragraph. The editing was minimal, which means that the focus was not on stylistic improvement, but 1. Kdo je Kublai postal vladar Koreje? mostly on correcting the grammatical errors, wrong mean- 2. Kdo je bil Al-Bannin umor maščevanja zaradi pred- ings and very unusual syntax, to make the translation com- hodnega umora? prehensible. As mentioned above, the topics of original 3. Katere rastline ustvarjajo največ električne energije? texts are diverse and very technical, covering different do- Post-edited machine translation mains such as religion, history, politics, mathematics and chemistry. 1. Koga je Kublajkan nastavil za vladarja Koreje? In total, there were 30 manually corrected contexts with 2. Al-Bannov umor je bil maščevanje za čigav predhodni accompanying 142 answerable and 143 unanswerable ques- umor? tions. The number of different segment types and of post- 3. Katere naprave ustvarjajo največ električne energije? editing changes can be seen in Table 3. The segments with answers have the largest number of non-corrected segments because they are shorter. Neverthe- Segment content S NS CS FS less, the percentage of corrected questions is still high if we Context 30 0 30 100% take into account that the answers represent 58% of all seg- Answerable question 142 38 104 73.2% ments. The mistakes in the answers were in the most part Answer 435 225 210 48.3% already corrected in the contexts. More severe mistakes in- Impossible question 143 43 100 69.9% clude semantic mistakes (e.g. plants translated as ’rastline’, Total number 750 306 444 59.2% not ’naprave’) and completely wrong answers (e.g. empty segment instead of ’Fermilab’ or ’in’ instead of ’1,388’). Table 3: Post-editing numerical data. S denotes the number Some frequent mistakes also occured in translations of the of segments, NS the number of non-corrected segments, CS names of movements, books, projects or other names (e.g. the number of corrected segments and FS the fraction of ’Bricks for Varšava’ was left untranslated and was changed corrected segments. to ’Zidaki za Varšavo’). There were some punctuation er- rors, but the most interesting are grammatical mistakes, es- pecially when the wrong grammatical case, gender or num- 3.3. Post-editing Analysis ber is used. Even if these mistakes were corrected in the The numbers seen in Table 3 are not fully representa- context, the answers had to be in the exact same form, so tive, since some corrections of the mistakes of machine many answers do not sound coherent, which is of course translation are more severe than others and in some seg- not the case for English, where the conjugation does not ments, there is a much greater number of corrections than change the words as much (e.g. ’Which part of China had in others. For instance, the corrections, including one of a people ranked higher in the class system?’ — ’Northern’ severe semantic mistake, can be seen in this example: — ’V katerem delu Kitajske so bili ljudje višje v razrednem 1. Original: The Northern Chinese were ranked higher sistemu?’ — ’Severni’ (from the example of a sentence in and Southern Chinese were ranked lower because the context mentioned above)). On the other part, some southern China withstood and fought to the last before corrected segments were identical even though the source caving in. was different due to the use of articles in English language 2. Machine translation: Severna Kitajci so bili uvrščeni (e.g. ’North Sea’ and ’the North Sea’ were both translated višje in južna Kitajci so bili uvrščeni nižje, ker je as ’Severno morje’). južna Kitajska zdržala in se borila do zadnjega pred It should also be noted that the database SQuAD 2.0 jamarstvom. is not entirely reliable. From the batch of randomly sam- 3. Post-edited machine translation: Severni Kitajci so pled 142 test question and answer groups, there were 14 bili uvrščeni višje in južni Kitajci so bili uvrščeni occurrences where at least one of the given answers was not nižje, ker se je južna Kitajska pred predajo upirala in correct (e.g. ’Advanced Steam movement’ instead of ’pol- se borila do zadnjega. lution’ as an answer to ’Along with fuel sources, what con- Answerable and impossible questions have a similar cern has contributed to the development of the Advanced percentage of segments with corrections. This percentage Steam movement?’). ŠTUDENTSKI PRISPEVKI 355 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 4. Models tests were performed on i5 10400f system with RTX 3070 In this section we present each of the five models that GPU 8 GB VRAM. For larger models we used RTX 3060 were used in the evaluation. 12 GB. To compare the performance between the English, ma- 4.1. XLM-R chine translated Slovene and human translated Slovene XLM-R (XLM-RoBERTa) (Conneau et al., 2019) is versions of the SQuAD 2.0 dataset, we used 5 different a pre-trained cross-lingual language model based on xlm question answering models: mBERT, XLM-R, RemBERT, (Lample and Conneau, 2019). The ’RoBERTa’ part of the SloBERTa 2.0, CroSloEngual BERT. The evaluation was name comes from its training routine that is the same as done in three steps: the monolingual RoBERTa model, specifically, that the sole 1. Performance evaluation of different models and fine- training objective is the MLM (masked language mode). tuning configuration on the English dataset, as a There is no next sentence prediction (as in BERT) or Sen- benchmark for the evaluation of the Slovene results. tence Order Prediction (as in ALBERT). XLM-R shows the 2. Performance evaluation of different models and fine- possibility of training one model for many languages while tuning configuration on the Slovene dataset, translated not sacrificing per-language performance. It is trained on using computer only, to evaluate the quality of ma- 2.5 TB of CommonCrawl data in 100 languages. chine translation. 3. Performance evaluation of different models and fine- 4.2. M-BERT tuning configuration on the Slovene subset which was M-BERT (Multilingual Bert) (Devlin et al., 2018) is a translated by a human, and same subset both in En- pre-trained cross-lingual language model as it’s name sug- glish and translated using computer, to evaluate the gest. It is based on BERT (Devlin et al., 2018). The benefits of human translation. pre-trained model is trained on 104 languages with large Before the evaluation, we removed all punctuation, amount of data from Wikipedia, using a masked language leading and trailing white spaces and articles from both modeling (MLM) objective. On Hugging Face, there is ground truth and prediction. Both of them were also set only a base model with 12 hidden transformer layers avail- in the lower case. Parameters used for fine-tuning are pre- able, large model with 24 hidden transformer layers was not sented in Table 4. uploaded and we were not able to test it. Metrics used for the evaluation match the official ones 4.3. RemBERT for SQuAD2.0 evaluation and were as follows: RemBERT (Chung et al., 2020) is a model, pre-trained • Exact - The fraction of predictions matched at least of on 110 languages, using a masked language modeling one the correct answers exactly. (MLM) objective. It’s difference with mBERT is that the • F1 - The average overlap between prediction and input and output embeddings are not tied. Instead, Rem- ground truth, defined as an average of F1 scores for BERT uses small input embeddings and larger output em- individual questions. F1 score of an individual ques- beddings. This makes the model more efficient since the tion is computed as a harmonic mean of the precision output embeddings are discarded during fine-tuning. and recall, where precision was defined as TM , and TP recall as TM , where T T M represents the matching to- GT 4.4. SloBERTa kens between prediction and ground truth, TP number SloBERTa (Ulčar and Robnik- Šikonja, 2021) is a of tokens in prediction and TGT number of tokens in Slovene monolingual large pre-trained masked language ground truth. A token is defined as a word, separated model. It is closely related to French Camembert model, by a white space. which is similar to base RoBERTa model, but uses a dif- The results of the non-translated SQuAD 2.0 and ma- ferent tokenization model. Since the model requires a large chine translated dataset can be seen in Table 5. The results dataset for training, it was trained on 5 combined datasets. of the human translated subset and its English and com- It outperformed existing Slovene models. puter translated counterparts can be seen in Table 6. Addi- tionally, we provide some examples of correct predictions 4.5. CroSloEngual BERT with wrong answers in Table 7 and some of correct answers It is a trilingual model based on BERT and trained for with wrong predictions in Table 8. Slovene, Croatian and English language. It was trained with 5.9 billion tokens from these languages. For those lan- Model Name B MS LR E guages it performs better than multilingual BERT, which is expected, since studies showed that monolingual models XLM-R-large 4 256 1e-5 3 perform better than large multilingual models (Virtanen et M-BERT-base 8 320 3e-5 3 al., 2019). CroSloEngual BERT 4 256 1e-5 3 RemBERT 4 256 1e-5 3 5. Results SloBERTa 2.0 16 320 3e-5 3 This section is divided into two parts. First we evaluate Table 4: Parameters used to fine-tune the evaluated models. automatic machine translations and then we evaluate per- B denotes the number of batches used during fine-tuning, formance of choosen QA models (XLM-R-large, M-Bert- MS the maximum sequence length, LR the learning rate and base, CroSloEngual BERT, RemBERT, SloBERTa 2.0). All E the number of epochs. ŠTUDENTSKI PRISPEVKI 356 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Fine-Tuning Original Machine Translation Model name Language Exact F1 Exact F1 xlmR-large Eng 81.8% 84.9% 64.3% 72.3% xlmR-large Slo 75.0% 79.2% 65.3% 72.4% xlmR-large Eng & Slo 74.4% 78.5% 65.9% 73.4% M-BERT-base Eng 75.6% 78.9% 55.4% 61.3% M-BERT-base Slo 62.4% 67.2% 60.4% 67.0% M-BERT-base Eng & Slo 70.7% 75.0% 60.5% 67.3% CroSloEngual BERT Eng 72.8% 76.3% 56.3% 63.6% CroSloEngual BERT Slo 63.6% 68.2% 58.4% 65.4% CroSloEngual BERT Eng & Slo 68.8% 73.0% 58.1% 65.7% RemBERT Eng 84.5% 87.5% 67.1% 73.8% SloBERTa 2.0 Slo 60.6% 64.7% 66.7% 73.9% Table 5: Comparison of the results of various models and their fine-tuning configurations on the English SQuAD 2.0 evaluation dataset and Slovene machine translated SQuAD 2.0 evaluation dataset. The English dataset only contains the questions preset in its Slovene counterpart. Specific parameters used in fine-tuning are presented in Table 4. Fine-Tuning Original Machine Translation Human Translation Model name Language Exact F1 Exact F1 Exact F1 xlmR-large Eng 80.0% 82.9% 61.1% 68.5% 71.6% 75.9% xlmR-large Slo 69.1% 72.9% 61.4% 69.1% 69.8% 74.8% xlmR-large Eng & Slo 68.8% 73.4% 64.6% 72.4% 70.5% 75.7% M-BERT-base Eng 71.9% 74.9% 52.6% 57.7% 57.5% 60.3% M-BERT-base Slo 56.1% 60.4% 58.6% 64.5% 60.4% 66.2% M-BERT-base Eng & Slo 64.9% 68.8% 55.8% 61.2% 63.5% 68.6% CroSloEngual BERT Eng 73.3% 75.5% 53.0% 60.8% 62.1% 65.7% CroSloEngual BERT Slo 59.6% 63.1% 51.6% 58.8% 60.7% 66.0% CroSloEngual BERT Eng & Slo 68.1% 70.6% 58.9% 66.3% 64.6% 71.0% RemBERT Eng 84.9% 87.2% 64.2% 71.4% 71.9% 76.9% SloBERTa 2.0 Slo 59.3% 65.0% 64.9% 72.2% 72.6% 78.0% Table 6: Comparison of the results of various models and their fine-tuning configurations on the Human Translated subset of SQuAD 2.0, and the subsets containing same question from original English dataset and the machine translated dataset. Specific parameters used in fine-tuning are presented in Table 4. # Dataset Question Answer Prediction ENG How many of Warsaw’s inhabitants spoke Polish in 1933? 833,500 833,500 1 MT Koliko prebivalcev Varšave je leta 1933 govorilo poljsko? prebivalcev 833.500 HT Koliko prebivalcev Varšave je leta 1933 govorilo poljski jezik? 833.500 833.500 ENG Who recorded ”Walking in Fresno?” Bob Gallion Bob Gallion 2 MT Kdo je posnel Walking in Fresno?“ je Bob Bob Gallion ” HT Kdo je posnel ≫Walking in Fresno≪? Bob Gallion Bob Gallion Table 7: Examples of correct predictions with wrong answers. ENG denotes the English dataset, MT one translated by a computer and HT one translated by a human. # Dataset Question Answer Prediction ENG Where did Korea border Kublai’s territory? northeast northeast 1 MT Kje je Koreja mejila na Kublajevo ozemlje? severovzhodno zahodno HT Kje je Koreja mejila na Kublajkanovo ozemlje? severovzhodno severovzhodno ENG How many miles, once completed, will the the Lewis S. Eaton trail cover? 22 22 2 MT Koliko kilometrov, ko bo končano, bo pokrivalo Lewis S. Eaton? 22 35 HT Koliko kilometrov bo dolga pot Lewisa S. Eatona, ko bo končana? 22 35 Table 8: Examples of correct answers with wrong predictions. ENG denotes the English dataset, MT one translated by a computer and HT one translated by a human. ŠTUDENTSKI PRISPEVKI 357 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 6. Discussion most cases reversed when given a task of answering hu- 6.1. Quantitative Analysis man translated questions. This leads us to conclude that machine translation, at least one available on via eTrans- From the results in Table 5, we can see that RemBERT lation (Commission, 2020) service, is not particularly suit- and SloBERTa 2.0 gave the best results on the dataset trans- able for training multilingual models. Of all the models, lated by a computer. While the result for SloBERTa was ex- SloBERTa 2.0 produced the best results on both machine pected, as monolingual models tend to perform better than and human translated data, while the RemBERT gave com- multilingual ones, RemBERT managed to outperform its parable results even when only fine-tuned on the English multilingual competitors while only being fine-tuned on the dataset. English dataset. We would attribute this simply to the bet- The testing procedure could be easily improved by em- ter design of the model. Although both models had a very ploying stronger hardware. RemBERT could for example similar performance, we would like to point out that Rem- be fine-tuned on the Slovene dataset, which would allow for BERT model is a much larger model and was pre-trained its better evaluation. Additionally, we were unable to ascer- on a significantly larger dataset. Similar results were also tain the optimal parameters for fine-tuning as performing observed when comparing the results on the smaller sub- multiple fine-tunings for each language would be unfeasi- set of questions that were translated by a human, as seen in ble. Some restrictions of the project are limited time for Table 6. post-editing and only one translator who is not an expert In Table 6 we can see models consistently perform- in the topics of various technical texts, and the method of ing better on the human translated data, suggesting that the minimal editing that can result in mediocre translation. The machine translation provided by eTranslation webservice experiment could be expanded by including a larger subset comes short of providing adequate set for proper evalua- of human translated or revised data, more datasets, such as tion in the Slovene language. We can also see that while the Natural Questions (Kwiatkowski et al., 2019), and different models fine-tuned using machine translated dataset do per- machine translation services, such as DeepL. form better when evaluated on the machine translated data, this does not hold true for evaluations on human translated data. 8. Acknowledgments We have also observed that fine-tuning the model on the We would like to to thank our mentors, Slavko Žitnik English dataset first, and then on the Slovene, yields better and Špela Vintar, for providing us with directions, feedback results on the smaller models, M-BERT-base and CroSlo- and advice. Engular BERT, as compared to fine-tuning on either lan- guage. 9. References 6.2. Qualitative Analysis Ines Čeh and Milan Ojsteršek. 2009. Developing a ques- While there are many correct predictions of the answers tion answering system for the Slovene language. WSEAS in the machine translated dataset, it is clear that a great Transaction on Information science and applications, number of predictions still does not answer the question (9). correctly. This is because the machine translation of the Hyung Won Chung, Thibault Févry, Henry Tsai, Melvin sentences in the context is not grammatically and stylisti- Johnson, and Sebastian Ruder. 2020. Rethinking cally correct, does not convey the right meaning and thus embedding coupling in pre-trained language models. the model has more problems finding the answer. The cor- CoRR, abs/2010.12821. rect predictions are mostly the ones where the answer to European Commission. 2020. CEF Digital eTranslation. the question is short and the words are not conjugated, i.e. https://ec.europa.eu/cefdigital/ numbers and names, even though there are some excep- wiki/display/CEFDIGITAL/eTranslation. tions. The same is true for human post-edited translation, Alexis Conneau, Kartikay Khandelwal, Naman Goyal, but improvement of some answers is already visible from Vishrav Chaudhary, Guillaume Wenzek, Francisco only a few representative examples in Table 7 and Table 8. Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual 7. Conclusion representation learning at scale. arXiv:1911.02116. In this work we present a machine translated SQuAD Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina 2.0 dataset and evaluate it on the following question an- Toutanova. 2018. BERT: Pre-training of deep bidirec- swering (QA) models: XLM-R-large, M-BERT-base, Rem- tional transformers for language understanding. BERT, CroSloEngual BERT and SloBERTa 2.0. Addition- arXiv:1810.04805. ally, we also perform human post-editing on a subset of Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, SQuAD 2.0 translations in order to better ascertain the qual- Michael Collins, Ankur Parikh, Chris Alberti, Danielle ity of machine translations. The results show that using ma- Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, chine translated data for evaluation led to notably worse re- Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming- sults as compared to the one translated by a human. More- Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and over, we noticed that while multilingual models fine-tuned Slav Petrov. 2019. Natural questions: a benchmark for using machine translated data performed better than ones question answering research. Transactions of the Asso- fine-tuned on English data when given a task of answer- ciation of Computational Linguistics. ing the machine translated question, the situation was in ŠTUDENTSKI PRISPEVKI 358 STUDENT PAPERS Konferenca Conference on Jezikovne tehnologije in digitalna humanistika Language Technologies & Digital Humanities Ljubljana, 2022 Ljubljana, 2022 Guillaume Lample and Alexis Conneau. 2019. Cross- Generalized autoregressive pretraining for language un- lingual language model pretraining. Advances in Neural derstanding. Advances in neural information processing Information Processing Systems (NeurIPS). systems, 32. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. AL- BERT: A lite BERT for self-supervised learning of lan- guage representations. In: International Conference on Learning Representations. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Nikola Ljubešić and Kaja Dobrovoljc. 2019. What does Neural Bring? Analysing Improvements in Mor- phosyntactic Annotation and Lemmatisation of Slove- nian, Croatian and Serbian. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Process- ing, pages 29–34, Florence, Italy. Association for Computational Linguistics. Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé Seddah, and Benoˆıt Sagot. 2020. CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, pages 7203–7219, Online. Association for Computational Linguistics. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of trans- fer learning with a unified text-to-text transformer. Jour- nal of Machine Learning Research, 21(140):1–67. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. Matej Ulčar and Marko Robnik- Šikonja. 2020. Finest BERT and CroSloEngual BERT. In: International Con- ference on Text, Speech, and Dialogue, pages 104–111. Springer. Matej Ulčar and Marko Robnik- Šikonja. 2021. SloBERTa: Slovene monolingual large pretrained masked language model. Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo Pyysalo. 2019. Multilingual is not enough: BERT for Fnnish. arXiv:1912.07076. William A Woods and WOODS WA. 1977. Lunar rocks in natural english: Explorations in natural language ques- tion answering. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre- trained text-to-text transformer. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, pages 483–498, Online. As- sociation for Computational Linguistics. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: ŠTUDENTSKI PRISPEVKI 359 STUDENT PAPERS Document Outline Home Invited talks Papers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Abstracts 1 2 3 4 5 6 7 8 9 Student Papers 1 2 3 4 5 6 7 8 9 10 11 12