EuropeanLanguageEquality A Strategic Agenda for Digital Language Equality Cognitive Technologies Editor-in-Chief Daniel Sonntag, German Research Center for AI, DFKI, Saarbrken, Saarland, Germany Titles in this series now included in the Thomson Reuters Book Citation Index and Scopus! The Cognitive Technologies (CT) series is committed to the timely publishing of high-quality manuscripts that promote the development of cognitive technologies and systems on the basis of artificial intelligence, image processing and understanding, natural language processing, machine learning and human-computer interaction. It brings together the latest developments in all areas of this multidisciplinary topic, ranging from theories and algorithms to various important applications. The intended readership includes research students and researchers in computer science, computer engineering, cognitive science, electrical engineering, data science and related fields seeking a convenient way to track the latest findings on the foundations, methodologies and key applications of cognitive technologies. The series provides a publishing and communication platform for all cognitive technologies topics, including but not limited to these most recent examples: . Interactive machine learning, interactive deep learning, machine teaching . Explainability (XAI), transparency, robustness of AI and trustworthy AI . Knowledge representation, automated reasoning, multiagent systems . Common sense modelling, context-based interpretation, hybrid cognitive technologies . Human-centered design, socio-technical systems, human-robot interaction, cognitive robotics . Learning with small datasets, never-ending learning, metacognition and introspection . Intelligent decision support systems, prediction systems and warning systems . Special transfer topics such as CT for computational sustainability, CT in business applications and CT in mobile robotic systems The series includes monographs, introductory and advanced textbooks, state-of-the-art collections, and handbooks. In addition, it supports publishing in Open Access mode. Georg Rehm • Andy Way Editors EuropeanLanguageEquality A Strategic Agenda for Digital Language Equality Editors Georg Rehm Andy Way German Research Centre for Artificial In ADAPT Centre Berlin, Germany Dublin City University Dublin, Ireland The European Language Equality project has received funding from the European Union under grant agreement no. LC-01641480 – 101018166. ISSN 1611-2482 ISSN 2197-6635 (electronic) Cognitive Technologies ISBN 978-3-031-28818-0 ISBN 978-3-031-28819-7 (eBook) https://doi.org/10.1007/978-3-031-28819-7 © The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication. Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Foreword Europe is a mosaic of languages, cultures and peoples. The promotion and encour­agementofthisdiversityisthereflectionofourwilltokeephavingdiversesocieties that live together in theuse of different languages. These languages are more than a communication tool. They are factors of iden­tity, vectors of culture, and ways of understanding and explaining the world. Each language,regardlessofitsstatusandnumberofspeakers,isatreasurethathasbeen created and polished over generations. And while the means and ambition must be put in place to promote and preserve all languages, those that are in a situation of greater weaknessmust be the object of special attention. The preservation of multilingualism as an expression of Europe’s intrinsic diver­sity is therefore a political commitment that today faces significant challenges. We have built a world with very powerful uniformizing tendencies, inertias that make itincreasingly difficulttoprotectthetreasure ofculturaland linguisticdiversity. So much so that one language disappears every two weeks, and up to 90% of existing languages couldbegone bythe turn of the century. In this sense, digital tools, although possessing many virtues, can also generate a clear concern when it comes to their impacts on linguistic diversity and equality. It has never been so fast and so easy to communicate and inform, and never have thetemptationand incentives to endup doing itin justa handful oflanguages – the most powerful and influential – been so great. If the audience is the world, and if what counts is getting more followers, then the temptation to stop using our own languagesisenormous.Inthissense,preservinglinguisticequalityinthedigitalage must be an objective assumed by all EU institutions. And part of the solution can come preciselyfrom the tools that thedigitalworld can offerto us. TheEuropeanParliamenthaslongexpresseditsconcernaboutthefutureofmul­tilingualism in the digital age. In a landmark document, our Parliament adopted a 2018 resolutiononachievinglanguageequalityin thedigital age,whoserapporteur wasmy Welshcolleague and formerMEP Jill Evans. Building up on that report, the Panel for the Future of Science and Technology (STOA), of which I am a proud member, held a seminar in late 2022 titled “To­wards full digital language equality in a multilingual European Union”. This event presentedtheconclusionsoftheEuropeanLanguageEquality(ELE)project,which analysedover80languagestodeveloparoadmaptowardsachievingfulldigitallan­guageequality in Europeby 2030. Itistemptingtothinkthatmultilingualismbeginsandendswiththelanguagesthat haveaguaranteedofficialstatus;inthecaseoftheEU,the24languagesthatappear inthetreatiesastheofficiallanguagesoftheUnion.ButintheEUalonethereareat least60otherlanguagesthatalsodeservetobepreservedandencouraged,despitethe fact that they do not have official status. That is why we must welcome initiatives like the ELE project, and work together towards a Union in which all languages, especially minority ones, enjoy the samerights. AsanativeCatalanspeaker,whoisverymuchawareofthepressurethattechno-logical and digital trends exert especially on lesser-used languages, and committed totheprotectionandpromotionoftheselanguages,Iamveryhonouredtointroduce this book, European Language Equality: A Strategic Agenda for Digital Language Equality,andIwouldliketothankandcongratulateallwhocontributedintheELE project and in the writing of these pages. Projects and publications like these draw the right path towardsa moreinclusive and diverse Union. Brussels,January 2023 Jordi Solé Preface Theoriginsofthisbookdatebackto2010.Backthen,undertheumbrellaoftheEU NetworkofExcellenceMETA-NET,westartedpreparingtheWhitePaperSeriesEu­rope’s Languages in the Digital Age (publishedin2012)1 andtheStrategic Research Agenda for Multilingual Europe 2020 (SRA,publishedin2013),thefirstdocument ofitskindfortheEuropeanLanguageTechnology(LT)fieldinacommunity-driven process.2 The META-NET White Paper Series revealed, among others, that, back then, 21 European languages were threatened by what we called digital language extinction. As a direct response to this danger, the META-NET SRA provided sug­gestionsastohowtobringaboutachangeandhowtoincreasethecollaborationwith theentireEuropeanLTcommunityonanumberofpriorityresearchthemestowards the common goal of what is now knownas digital language equality in Europe. Especially the notion of digital language extinction but also our strategic recom­mendations generated a certain amount of attention. Back in 2013 and 2014, col­leagues from META-NET were involved in dozens of television, radio and print interviews and therehavealsobeenseveral follow-uppublications and EU projects as well as official questions raised in the European Parliament (EP). These eventu­allyledtoanumberofworkshopsheldintheEPandtoastudycommissionedbythe EP’s Science and TechnologyOptions Assessment (STOA) unit. The STOA study3 (2018)eventuallypavedthewayforthereportLanguage Equality in the Digital Age4 jointlypreparedbytheEP’sCommitteesonCultureandEducation(CULT)andonIn­dustry,ResearchandEnergy(ITRE).Theserecommendations,informallyknownas the Jill Evans report,wereadoptedbytheEPinalandslidevoteinSeptember2018. Among other recommendations,this report suggested to the European Commission to “establish a large-scale, long-term coordinated funding programme for research, development and innovation in the field of language technologies, at European, na­tionalandregionallevels,tailoredspecificallytoEurope’sneedsanddemands”.The 1 http://www.meta-net.eu/whitepapers 2 http://www.meta-net.eu/sra 3 https://www.europarl.europa.eu/stoa/en/document/EPRS_STU(2017)598621 4 https://www.europarl.europa.eu/doceo/document/TA-8-2018-0332_EN.html EuropeanLanguageEquality(ELE)proposal5 andeventualproject,describedinthe present volume,represented ourdirect responseto thisrecommendation. It was a pleasure to lead the ELE project and to collaborate with such a strong and dedicated team consisting of 52 partner organisations covering all European countries,academiaandindustryaswellasallmajorpan-Europeaninitiatives.Like many other projects around the globe, ELE was also affected by the SARS-CoV-2 pandemic but, fortunately, our initial plans and project proposal had already been preparedduringthepandemic,whichmeantthatwewereabletotailortheprojectto thenewnormal.Nevertheless,everybodyinvolvedwashappytoeventuallybeable toattendourjointMETA-FORUMevent,whichtookplaceinJune2022withabout 100participantsintheconferencecentreinBrusselsandhundredsmoreparticipating remotely. Afterwhatfeltlikeanendlesssuccessionofvirtualmeetings,formanyof us, thiswas thefirst opportunity to meet face-to-face. Thisbookdescribestheresultsproducedduringtheproject’sruntime;additional details are available in more than 60 project reports.6 We would like to express our gratitude towards the consortium for its hard and dedicated work towards our goal of developing the Strategic Research, Innovation and Implementation Agenda and Roadmap for Achieving Full Digital Language Equality in Europe by 2030.7 We would also like to thank all ELE colleagues wholeheartedly for the chapters they contributed, without which this book would not have been possible. Additionally, we would like to thank all initiatives ELE collaborated with, especially the Euro­pean Language Grid8 project, the results of which have also been documented in a book in the same series. Finally, we would like to thank Jordi Solé for supporting and chairing the workshop Towards full digital language equality in a multilingual European Union, heldon 8Nov. 2022 in the EP, and for contributing the foreword. This volume covers the results achieved during the project’s first iteration (Jan­uary 2021 until June 2022). Immediately after the end of the first project, the initia­tive continued with the project ELE 2, which will end in June 2023. We sincerely hopethatthewholeELEinitiativewillserveitspurpose,whichistohelpbringabout digital language equality in Europe by 2030. This book provides an analysis of the current state of play (Part I) and our recommendations for the future situation in 2030 (Part II). Proper support for the implementation of these plans would mean a quantum leap for Europe’s multilingual landscape with concomitant benefits for all its citizens, regardless of the language theypreferto communicate in. Berlin andDublin,April2023 Georg Rehm Andy Way Acknowledgements The European Language Equality project has received funding from the Eu­ropean Union underthegrant agreementno.LC-01641480 – 101018166 (ELE). 5 https://www.european-language-equality.eu 6 https://www.european-language-equality.eu/deliverables/ 7 https://www.european-language-equality.eu/agenda/ 8 https://www.european-language-grid.eu Contents 1 EuropeanLanguageEquality:Introduction ..................... 1 Georg Rehm and Andy Way 1 OverviewandContext..................................... 1 2 TheEuropeanLanguageEqualityProject..................... 3 3 BeyondtheELEProject.................................... 5 4 SummaryofthisBook..................................... 7 4.1 Part I: European Language Equality – StatusQuoin2022................................ 7 4.2 Part II: European Language Equality – TheFutureSituationin2030andbeyond .............. 8 References..................................................... 9 Part I European Language Equality: Status Quo in 2022 2 State-of-the-ArtinLanguageTechnologyand Language-centric ArtificialIntelligence ......................... 13 Rodrigo Agerri, Eneko Agirre, Itziar Aldabe, Nora Aranberri, Jose Maria Arriola, Aitziber Atutxa, Gorka Azkune, Jon Ander Campos, Arantza Casillas, Ainara Estarrona, Aritz Farwell, Iakes Goenaga, Josu Goikoetxea, Koldo Gojenola, Inma Hernáez, Mikel Iruskieta, Gorka Labaka, Oier Lopez de Lacalle, Eva Navas, Maite Oronoz, Arantxa Otegi, Alicia Pérez, Olatz Perez de Vinaspre, German Rigau, Ander Salaberria,Jon Sanchez, Ibon Saratxaga, and Aitor Soroa 1 Introduction.............................................. 13 2 LanguageTechnology:HistoricalOverview. .................. 14 2.1 ABriefHistory................................... 14 2.2 TheDeepLearningEra. ............................ 15 3 NeuralLanguageModels................................... 16 4 ResearchAreas. .......................................... 18 4.1 LanguageResources ............................... 18 4.2 TextAnalysis ..................................... 19 4.3 SpeechProcessing................................. 20 4.4 MachineTranslation............................... 21 4.5 InformationExtractionandInformationRetrieval....... 22 4.6 NaturalLanguageGenerationandSummarisation ...... 23 4.7 Human-ComputerInteraction . ...................... 23 5 LanguageTechnologybeyondLanguage ...................... 24 6 Conclusions.............................................. 26 References..................................................... 27 3 DigitalLanguageEquality:Definition,Metric,Dashboard ......... 39 Federico Gaspari, Annika Grützner-Zahn, Georg Rehm, Owen Gallagher, Maria Giagkou,Stelios Piperidis, and Andy Way 1 IntroductionandBackground............................... 39 2 RelatedWork............................................. 40 3 DigitalLanguageEquality:KeyPrinciplesandDefinition....... 42 4 ImplementingtheDigitalLanguageEqualityMetric............ 43 5 TechnologicalFactors. ..................................... 44 5.1 WeightsandScores ................................ 45 5.2 ConfigurationoftheTechnologicalFactors ............ 46 5.3 ComputingtheTechnologicalScores................. 48 5.4 TechnologicalDLEScoresofEurope’sLanguages...... 49 5.5 OpenIssuesandChallenges......................... 49 6 ContextualFactors........................................ 51 6.1 ComputingtheContextualScores .................... 52 6.2 ExpertsConsultation............................... 55 6.3 ContextualDLEScoresofEurope’sLanguages ........ 57 6.4 OpenIssuesandChallenges......................... 59 7 DigitalLanguageEqualityDashboard. ....................... 60 8 ConclusionsandFutureWork ............................... 62 References..................................................... 62 Appendix...................................................... 66 4 EuropeanLanguageTechnologyin2022 ........................ 75 MariaGiagkou, Teresa Lynn,Jane Dunne,Stelios Piperidis, and Georg Rehm 1 Introduction.............................................. 75 2 HowDoEurope’sLanguagesCompare?...................... 76 2.1 SourceofEvidenceandMethodology ................ 77 2.2 ResultsandFindings............................... 79 3 TheVoiceoftheCommunity................................ 84 3.1 DevelopersofLanguageTechnologies ................ 84 3.2 UsersofLanguageTechnologies..................... 87 3.3 European Citizens as Consumers of LanguageTechnologies . ........................... 88 4 Conclusions.............................................. 91 References..................................................... 92 5 LanguageReportBasque ..................................... 95 Kepa Sarasola, Itziar Aldabe, Arantza Diaz de Ilarraza, Ainara Estarrona, AritzFarwell, Inma Hernáez, and Eva Navas 1 TheBasqueLanguage..................................... 95 2 TechnologiesandResourcesforBasque ...................... 96 3 RecommendationsandNextSteps........................... 97 References..................................................... 98 6 LanguageReportBosnian .................................... 99 TarikÆušiæ 1 TheBosnianLanguage..................................... 99 2 TechnologiesandResourcesforBosnian......................100 3 RecommendationsandNextSteps...........................102 References ..................................................... 102 7 LanguageReportBulgarian .................................. 103 Svetla Koeva 1 TheBulgarianLanguage...................................103 2 TechnologiesandResourcesforBulgarian....................104 3 RecommendationsandNextSteps...........................105 References ..................................................... 106 8 LanguageReportCatalan .................................... 107 Maite Melero, BlancaCalvo,MarRodríguez, andMartaVillegas 1 TheCatalanLanguage.....................................107 2 TechnologiesandResourcesforCatalan......................108 3 RecommendationsandNextSteps...........................109 References ..................................................... 110 9 LanguageReportCroatian ................................... 111 MarkoTadiæ 1 TheCroatianLanguage....................................111 2 TechnologiesandResourcesforCroatian .....................112 3 RecommendationsandNextSteps...........................114 References ..................................................... 114 10 LanguageReportCzech ...................................... 115 Jaroslava Hlaváèová 1 TheCzechLanguage. .....................................115 2 TechnologiesandResourcesforCzech . ......................116 3 RecommendationsandNextSteps...........................117 References ..................................................... 118 11 LanguageReportDanish ..................................... 119 Bolette Sandford Pedersen,SussiOlsen, andLina Henriksen 1 TheDanishLanguage. .....................................119 2 TechnologiesandResourcesforDanish. ......................120 3 RecommendationsandNextSteps...........................121 References ..................................................... 122 12 LanguageReportDutch ...................................... 123 FriedaSteurs,Vincent Vandeghinste, and Walter Daelemans 1 TheDutchLanguage . .....................................123 2 TechnologiesandResourcesforDutch . ......................124 3 RecommendationsandNextSteps...........................125 References ..................................................... 125 13 LanguageReportEnglish..................................... 127 DianaMaynard, Joanna Wright,Mark A.Greenwood, and KalinaBontcheva 1 TheEnglishLanguage.....................................127 2 TechnologiesandResourcesforEnglish......................128 3 RecommendationsandNextSteps...........................129 References ..................................................... 130 14 LanguageReportEstonian ................................... 131 Kadri Muischnek 1 TheEstonianLanguage....................................131 2 TechnologiesandResourcesforEstonian.....................132 3 RecommendationsandNextSteps...........................133 References ..................................................... 134 15 LanguageReportFinnish..................................... 135 Krister Lindén andWilhelmina Dyster 1 TheFinnishLanguage.....................................135 2 TechnologiesandResourcesforFinnish......................136 3 RecommendationsandNextSteps...........................137 References ..................................................... 138 16 LanguageReportFrench ..................................... 139 GillesAdda, Ioana Vasilescu, and FrançoisYvon 1 TheFrenchLanguage......................................139 2 TechnologiesandResourcesforFrench .......................140 3 RecommendationsandNextSteps...........................141 References ..................................................... 142 17 LanguageReportGalician .................................... 143 José ManuelRamírez Sánchez, Laura Docío Fernández, and Carmen García-Mateo 1 TheGalicianLanguage ....................................143 2 TechnologiesandResourcesforGalician .....................144 3 RecommendationsandNextSteps...........................145 References ..................................................... 146 18 LanguageReportGerman .................................... 147 Stefanie Hegele, Barbara Heinisch, Antonia Popp, Katrin Marheinecke, Annette Rios, Dagmar Gromann, Martin Volk, andGeorg Rehm 1 TheGermanLanguage.....................................147 2 TechnologiesandResourcesforGerman ......................148 3 RecommendationsandNextSteps...........................149 References ..................................................... 150 19 LanguageReportGreek ...................................... 151 MariaGavriilidou,MariaGiagkou,DoraLoizidou,andSteliosPiperidis 1 TheGreekLanguage . .....................................151 2 TechnologiesandResourcesforGreek . ......................152 3 RecommendationsandNextSteps...........................153 References ..................................................... 154 20 LanguageReportHungarian.................................. 155 Kinga Jelencsik-Mátyus, Enikõ Héja, Zsófia Varga, and Tamás Váradi 1 TheHungarianLanguage...................................155 2 TechnologiesandResourcesforHungarian....................156 3 RecommendationsandNextSteps...........................157 References ..................................................... 158 21 LanguageReportIcelandic ................................... 159 EiríkurRögnvaldsson 1 TheIcelandicLanguage ....................................159 2 TechnologiesandResourcesforIcelandic .....................160 3 RecommendationsandNextSteps...........................161 References ..................................................... 162 22 LanguageReportIrish ....................................... 163 Teresa Lynn 1 TheIrishLanguage........................................163 2 TechnologiesandResourcesforIrish.........................164 3 RecommendationsandNextSteps...........................165 References ..................................................... 166 23 LanguageReportItalian ..................................... 167 Bernardo Magnini, AlbertoLavelli, and Manuela Speranza 1 TheItalianLanguage. .....................................167 2 TechnologiesandResourcesforItalian. ......................168 3 RecommendationsandNextSteps...........................170 References ..................................................... 170 24 LanguageReportLatvian .................................... 171 Inguna Skadina, Ilze Auzina, Baiba Valkovska, and Normunds Gruzitis 1 TheLatvianLanguage.....................................171 2 TechnologiesandResourcesforLatvian......................172 3 RecommendationsandNextSteps...........................173 References ..................................................... 173 25 LanguageReportLithuanian ................................. 175 Anželika Gaidieneand Aurelija Tamulioniene 1 TheLithuanianLanguage . .................................175 2 TechnologiesandResourcesforLithuanian . ..................176 3 RecommendationsandNextSteps...........................177 References ..................................................... 178 26 LanguageReportLuxembourgish ............................. 179 DimitraAnastasiou 1 TheLuxembourgishLanguage..............................179 2 TechnologiesandResourcesforLuxembourgish. ..............180 3 RecommendationsandNextSteps...........................181 References ..................................................... 182 27 LanguageReportMaltese .................................... 183 MichaelRosner and Claudia Borg 1 TheMalteseLanguage .....................................183 2 TechnologiesandResourcesforMaltese ......................184 3 RecommendationsandNextSteps...........................185 References ..................................................... 186 28 LanguageReportNorwegian.................................. 187 Kristine Eide,Andre Kasen, and Ingerid Loyning Dale 1 TheNorwegianLanguage. .................................187 2 TechnologiesandResourcesforNorwegian. ..................188 3 RecommendationsandNextSteps...........................189 References ..................................................... 190 29 LanguageReportPolish ...................................... 191 Maciej Ogrodniczuk, Piotr Pêzik, Marek £aziñski, and Marcin Mi³kowski 1 ThePolishLanguage. .....................................191 2 TechnologiesandResourcesforPolish . ......................192 3 RecommendationsandNextSteps...........................193 References ..................................................... 194 30 LanguageReportPortuguese.................................. 195 AntónioBranco, Sara Grilo, andJoao Silva 1 ThePortugueseLanguage. .................................195 2 TechnologiesandResourcesforPortuguese. ..................196 3 RecommendationsandNextSteps...........................197 References ..................................................... 198 31 LanguageReportRomanian .................................. 199 Vasile Pãiºand Dan Tufiº 1 TheRomanianLanguage ...................................199 2 TechnologiesandResourcesforRomanian ....................200 3 RecommendationsandNextSteps...........................201 References ..................................................... 202 32 LanguageReportSerbian .................................... 203 CvetanaKrstevand Ranka Stankoviæ 1 TheSerbianLanguage.....................................203 2 TechnologiesandResourcesforSerbian......................204 3 RecommendationsandNextSteps...........................205 References ..................................................... 206 33 LanguageReportSlovak ..................................... 207 RadovanGarabík 1 TheSlovakLanguage......................................207 2 TechnologiesandResourcesforSlovak .......................208 3 RecommendationsandNextSteps...........................209 References ..................................................... 210 34 LanguageReportSlovenian................................... 211 Simon Krek 1 TheSlovenianLanguage...................................211 2 TechnologiesandResourcesforSlovenian....................212 3 RecommendationsandNextSteps...........................214 References ..................................................... 214 35 LanguageReportSpanish .................................... 215 Maite Melero, Pablo Penarrubia, David Cabestany, Blanca Calvo, MarRodríguez, andMartaVillegas 1 TheSpanishLanguage .....................................215 2 TechnologiesandResourcesforSpanish ......................216 3 RecommendationsandNextSteps...........................218 References ..................................................... 218 36 LanguageReportSwedish .................................... 219 Lars Borin,Rickard Domeij, Jens Edlund, and Markus Forsberg 1 TheSwedishLanguage ....................................219 2 TechnologiesandResourcesforSwedish .....................220 3 RecommendationsandNextSteps...........................221 References ..................................................... 222 37 LanguageReportWelsh ...................................... 223 Delyth Prysand Gareth Watkins 1 TheWelshLanguage. .....................................223 2 TechnologiesandResourcesforWelsh . ......................224 3 RecommendationsandNextSteps...........................225 References ..................................................... 226 PartII EuropeanLanguageEquality:TheFutureSituationin2030 and beyond 38 Consulting the Community: How to Reach Digital Language Equality in Europe by2030? .................................. 229 Jan Hajiè, Maria Giagkou, SteliosPiperidis, Georg Rehm, and Natalia Resende 1 Introduction..............................................229 2 Methodology.............................................230 3 ThePerspectiveofEuropeanLTDevelopers. ..................231 3.1 Stakeholders. .....................................233 3.2 Instruments. ......................................234 4 ThePerspectiveofEuropeanLTUsers . ......................235 4.1 Stakeholders. .....................................236 4.2 Instruments. ......................................236 5 ThePerspectiveofEurope’sCitizens .........................238 6 PredictingLanguageTechnologyin2030:DeepDives. .........239 7 CollectingAdditionalInputandFeedback.....................240 7.1 ConferencesandWorkshops. .......................240 7.2 ProjectWebsite...................................241 7.3 SocialMedia.....................................241 8 SummaryandConclusions . ................................241 References ..................................................... 242 39 ResultsoftheForward-lookingCommunity-wideConsultation ..... 245 Emma Daly, Jane Dunne, Federico Gaspari, Teresa Lynn, Natalia Resende, Andy Way, Maria Giagkou, Stelios Piperidis, Tereza Vojtìchová, Jan Hajiè, Annika Grützner-Zahn, Stefanie Hegele, Katrin Marheinecke,and Georg Rehm 1 Introduction..............................................245 2 ThePerspectiveofEuropeanLTDevelopers. ..................246 2.1 Respondents’Profiles ..............................246 2.2 LanguageCoverage................................247 2.3 PredictionsfortheFuture ...........................249 3 ThePerspectiveofEuropeanLTUsers . ......................251 3.1 Respondents’Profiles ..............................251 3.2 LanguageCoverage................................253 3.3 PredictionsfortheFuture ...........................254 4 ThePerspectiveofEurope’sCitizensasConsumersofLTs.......255 4.1 Respondents’Profiles ..............................255 4.2 LanguageCoverage................................255 4.3 PredictionsfortheFuture ...........................256 5 SummaryandConclusions . ................................260 References ..................................................... 262 40 Deep Dive Machine Translation ............................... 263 Inguna Skadina, Andrejs Vasi.jevs, Marcis Pinnis, Aivars Berzinš, Nora Aranberri, Joachim Van den Bogaert, Sally O’Connor, Mercedes García-Martínez, Iakes Goenaga, Jan Hajiè, Manuel Herranz, Christian Lieske, Martin Popel, Maja Popoviæ, Sheila Castilho, Federico Gaspari, Rudolf Rosa,Riccardo Superbo, and Andy Way 1 Introduction..............................................264 1.1 ScopeofthisDeepDive ............................264 1.2 MainComponents .................................265 2 State-of-the-ArtandMainGaps. ............................266 2.1 State-of-the-Art...................................266 2.2 MainGaps. ......................................270 3 TheFutureoftheArea .....................................274 3.1 ContributiontoDigitalLanguageEquality. ............274 3.2 BreakthroughsNeeded. ............................275 3.3 TechnologyVisionsandDevelopmentGoals...........278 3.4 TowardsDeepNaturalLanguageUnderstanding .......282 4 SummaryandConclusions . ................................282 References ..................................................... 283 41 Deep Dive Speech Technology ................................. 289 Marcin Skowron, Gerhard Backfried, Eva Navas, Aivars Berzinš, Joachim Van den Bogaert, Franciska de Jong, Andrea DeMarco, Inma Hernáez, Marek Kováè, Peter Polák, Johan Rohdin, Michael Rosner, Jon Sanchez, Ibon Saratxaga, andPetr Schwarz 1 Introduction..............................................290 1.1 ScopeofthisDeepDive ............................291 1.2 MainComponents .................................291 2 State-of-the-ArtandMainGaps. ............................291 2.1 State-of-the-Art...................................291 2.2 MainGaps. ......................................294 3 TheFutureoftheArea .....................................297 3.1 ContributiontoDigitalLanguageEquality. ............297 3.2 BreakthroughsNeeded. ............................300 3.3 TechnologyVisionsandDevelopmentGoals...........303 3.4 TowardsDeepNaturalLanguageUnderstanding .......307 4 SummaryandConclusions . ................................308 References ..................................................... 311 42 Deep Dive Text Analytics and Natural Language Understanding .... 313 Jose Manuel Gómez-Pérez, Andrés García-Silva, Cristian Berrio, German Rigau, Aitor Soroa, Christian Lieske, Johannes Hoffart, Felix Sasaki, Daniel Dahlmeier, Inguna Skadina, Aivars Berzinš, Andrejs Vasi.jevs, andTeresa Lynn 1 Introduction..............................................313 1.1 ScopeofthisDeepDive ............................315 1.2 MainComponents .................................316 2 State-of-the-ArtandMainGaps. ............................319 2.1 State-of-the-Art...................................319 2.2 MainGaps. ......................................320 3 TheFutureoftheArea .....................................322 3.1 ContributiontoDigitalLanguageEquality. ............322 3.2 BreakthroughsNeeded. ............................323 3.3 TechnologyVisionsandDevelopmentGoals...........325 3.4 TowardsDeepNaturalLanguageUnderstanding .......328 4 SummaryandConclusions . ................................329 References ..................................................... 332 43 DeepDiveDataandKnowledge ............................... 337 MartinKaltenböck,ArtemRevenko,KhalidChoukri,SvetlaBoytcheva, Christian Lieske, Teresa Lynn, German Rigau, Maria Heuschkel, Aritz Farwell, Gareth Jones, Itziar Aldabe, Ainara Estarrona, Katrin Marheinecke, Stelios Piperidis, Victoria Arranz, Vincent Vandeghinste, and Claudia Borg 1 Introduction..............................................338 1.1 ScopeofthisDeepDive ............................339 1.2 MainComponents .................................340 2 State-of-the-ArtandMainGaps. ............................342 2.1 State-of-the-Art...................................342 2.2 MainGaps. ......................................346 3 TheFutureoftheArea .....................................349 3.1 ContributiontoDigitalLanguageEquality. ............349 3.2 BreakthroughsNeeded. ............................350 3.3 TechnologyVisionsandDevelopmentGoals...........352 3.4 TowardsDeepNaturalLanguageUnderstanding .......355 4 SummaryandConclusions . ................................355 References ..................................................... 357 44 StrategicPlansandProjectsinLanguageTechnologyand Artificial Intelligence ........................................ 361 ItziarAldabe, Aritz Farwell, GermanRigau, Georg Rehm, and Andy Way 1 Introduction..............................................361 2 InternationalReportsonLanguageTechnology................364 2.1 ReportsfromInternationalOrganisations..............365 2.2 ReportsfromtheUnitedStates......................367 2.3 ReportsfromtheEuropeanUnion....................368 3 MajorLanguageTechnologyInitiativesinEurope ..............371 3.1 EuropeanInitiatives ...............................372 3.2 NationalandRegionalInitiatives ....................377 4 SWOTAnalysis..........................................380 4.1 Strengths ........................................381 4.2 Weaknesses . .....................................381 4.3 Opportunities.....................................382 4.4 Threats..........................................383 5 Conclusions..............................................384 References ..................................................... 384 45 Strategic Research, Innovation and Implementation Agenda for Digital LanguageEquality in Europe by2030 .................... 387 Georg Rehm and Andy Way 1 ExecutiveSummary. ......................................388 2 MultilingualEuropeandDigitalLanguageEquality ............389 3 WhatisLanguageTechnologyandHowCanitHelp? ...........391 4 A Shared European Programme for Language Technology and DigitalLanguage Equality in Europe by2030. ................. 391 4.1 PolicyRecommendations ...........................392 4.2 GovernanceModel................................393 4.3 TechnologyandDataRecommendations. .............394 4.4 InfrastructureRecommendations.....................395 4.5 ResearchRecommendations . .......................395 4.6 ImplementationRecommendations ...................397 5 RoadmaptowardsDigitalLanguageEqualityinEurope. ........397 5.1 MainComponents .................................397 5.2 Actions,Budget,Timeline,Collaborations. ............399 6 ConcludingRemarks. .....................................405 References ..................................................... 407 List of Contributors GillesAdda Université Paris-Saclay,CNRS, LISN, France, gilles.adda@limsi.fr RodrigoAgerri University of the Basque Country,Spain, rodrigo.agerri@ehu.eus EnekoAgirre University of the Basque Country,Spain, e.agirre@ehu.eus ItziarAldabe University of the Basque Country,Spain, itziar.aldabe@ehu.eus DimitraAnastasiou LuxembourgInstituteofScienceand Technology, Luxembourg, dimitra.anastasiou@list.lu Nora Aranberri University of the Basque Country,Spain, nora.aranberri@ehu.eus Victoria Arranz EvaluationsandLanguageResourcesDistributionAgency,France,arranz@elda.org Jose Maria Arriola University of the Basque Country,Spain, josemaria.arriola@ehu.eus Aitziber Atutxa University of the Basque Country,Spain, aitziber.atutxa@ehu.eus Ilze Auzina Institute of Mathematics and Computer Science, University of Latvia, Latvia, ilze.auzina@lumii.lv Gorka Azkune University of the Basque Country,Spain, gorka.azkune@ehu.eus Gerhard Backfried HENSOLDTAnalytics GmbH, Austria, gerhard.backfried@hensoldt.net Cristian Berrio Expert.AI, Spain, cberrio@expert.ai AivarsBerzinš Tilde, Latvia, aivars.berzins@tilde.com JoachimVan den Bogaert CrossLang, Belgium, joachim.van.den.bogaert@crosslang.com KalinaBontcheva University of Sheffield, United Kingdom, k.bontcheva@sheffield.ac.uk ClaudiaBorg University of Malta, Malta, claudia.borg@um.edu.mt Lars Borin University of Gothenburg, Sweden, lars.borin@svenska.gu.se Svetla Boytcheva Ontotext,Bulgaria, svetla.boytcheva@ontotext.com AntónioBranco University of Lisbon, Portugal, antonio.branco@di.fc.ul.pt David Cabestany Barcelona SupercomputingCenter, Spain, david.cabestany@bsc.es Blanca Calvo Barcelona SupercomputingCenter, Spain, blanca.calvo@bsc.es Jon Ander Campos University of the Basque Country,Spain, jonander.campos@ehu.eus Arantza Casillas University of the Basque Country,Spain, arantza.casillas@ehu.eus Sheila Castilho Dublin City University,ADAPT Centre, Ireland, sheila.castilho@adaptcentre.ie Khalid Choukri Evaluations and Language Resources Distribution Agency,France, choukri@elda.org Tarik Æušiæ University of Sarajevo, Bosnia and Herzegovina, tarik.cusic@izj.unsa.ba WalterDaelemans University of Antwerp, Belgium, walter.daelemans@uantwerpen.be DanielDahlmeier SAPSE, Germany, daniel.dahlmeier@sap.com Ingerid Loyning Dale The NationalLibrary of Norway, Norway, ingerid.dale@nb.no Emma Daly Dublin City University,ADAPT Centre, Ireland, emma.daly@adaptcentre.ie AndreaDeMarco University of Malta, Malta, andrea.demarco@um.edu.mt Arantza DíazdeIlarraza University of the Basque Country,Spain, a.diazdeilarraza@ehu.eus Laura DocíoFernández University of Vigo, Spain, ldocio@gts.uvigo.es Rickard Domeij Instituteof Languages and Folklore, Sweden, rickard.domeij@isof.se Jane Dunne Dublin City University,ADAPT Centre, Ireland, jane.dunne@adaptcentre.ie Wilhelmina Dyster University of Helsinki,Finland, wilhelmina.dyster@helsinki.fi Jens Edlund KTH Royal Institute of Technology, Sweden, edlund@speech.kth.se Kristine Eide The Language CouncilofNorway, Norway, kristine.eide@sprakradet.no Ainara Estarrona University of the Basque Country,Spain, ainara.estarrona@ehu.eus Aritz Farwell University of the Basque Country,Spain, aritz.farwell@ehu.eus Markus Forsberg University of Gothenburg, Sweden, markus.forsberg@gu.se Anželika Gaidiene Instituteof theLithuanian Language, Lithuania, anzelika.gaidiene@lki.lt OwenGallagher Dublin City University,ADAPT Centre, Ireland, owen.gallagher@adaptcentre.ie RadovanGarabík ¼. Štúr Institute of Linguistics, Slovak Academy of Sciences, Slovakia, radovan.garabik@kassiopeia.juls.savba.sk Mercedes García-Martínez Pangeanic, Spain, m.garcia@pangeanic.com Carmen García-Mateo University of Vigo, Spain, carmen.garcia@uvigo.es Andrés García-Silva Expert.AI, Spain, agarcia@expert.ai Federico Gaspari Dublin City University,ADAPT Centre, Ireland, federico.gaspari@adaptcentre.ie MariaGavriilidou Institute for Language and Speech Processing, R.C. “Athena”, Greece, maria@athenarc.gr MariaGiagkou Institute for Language and Speech Processing, R.C. “Athena”, Greece, mgiagkou@athenarc.gr IakesGoenaga University of the Basque Country,Spain, iakes.goenaga@ehu.eus Josu Goikoetxea University of the Basque Country,Spain, josu.goikoetxea@ehu.eus Koldo Gojenola University of the Basque Country,Spain, koldo.gojenola@ehu.eus Jose Manuel Gómez-Pérez Expert.AI, Spain, jmgomez@expert.ai Mark A.Greenwood University of Sheffield, United Kingdom, m.greenwood@sheffield.ac.uk Sara Grilo University of Lisbon, Portugal, sara.grilo@di.fc.ul.pt DagmarGromann University of Vienna, Austria, dagmar.gromann@univie.ac.at Annika Grützner-Zahn DeutschesForschungszentrum für Künstliche Intelligenz GmbH (DFKI),Germany, annika.gruetzner-zahn@dfki.de Normunds Gruzitis Institute of Mathematics and Computer Science, University of Latvia, Latvia, normunds.gruzitis@lumii.lv Jan Hajiè Charles University,Czech Republic, hajic@ufal.mff.cuni.cz Stefanie Hegele DeutschesForschungszentrum für Künstliche Intelligenz GmbH (DFKI),Germany, stefanie.hegele@dfki.de Barbara Heinisch University of Vienna, Austria, barbara.heinisch@univie.ac.at EnikõHéja ResearchCentrefor Linguistics, Hungary, heja.eniko@nytud.hu LinaHenriksen University of Copenhagen, Denmark, linah@hum.ku.dk Inma Hernáez University of the Basque Country,Spain, inma.hernaez@ehu.eus Manuel Herranz Pangeanic, Spain, m.herranz@pangeanic.com MariaHeuschkel Wikimedia Deutschland,Germany, maria.heuschkel@wikimedia.de Jaroslava Hlaváèová Charles University,Czech Republic, hlavacova@ufal.mff.cuni.cz Johannes Hoffart SAPSE, Germany, johannes.hoffart@sap.com MikelIruskieta University of the Basque Country,Spain, mikel.iruskieta@ehu.eus Kinga Jelencsik-Mátyus ResearchCentrefor Linguistics, Hungary, jelencsik-matyus.kinga@nytud.hu Gareth Jones BangorUniversity,United Kingdom, g.jones@bangor.ac.uk Franciska de Jong CLARIN ERIC, The Netherlands, franciska@clarin.eu Martin Kaltenböck Semantic Web Company, Austria, martin.kaltenboeck@semantic-web.com Andre Kasen The NationalLibrary of Norway, Norway, andre.kasen@nb.no Svetla Koeva Institutefor Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian AcademyofSciences,Bulgaria, svetla@dcl.bas.bg Marek Kováè Phonexia, Czech Republic, kovac@phonexia.com Simon Krek Jožef Stefan Institute,Slovenia, simon.krek@ijs.si Cvetana Krstev University of Belgrade, Serbia, cvetana@matf.bg.ac.rs Gorka Labaka University of the Basque Country,Spain, gorka.labaka@ehu.eus Alberto Lavelli FondazioneBruno Kessler, Italy, lavelli@fbk.eu Marek £aziñski University of Warsaw, Poland, m.lazinski@uw.edu.pl Christian Lieske SAPSE, Germany, christian.lieske@sap.com Krister Lindén University of Helsinki,Finland, krister.linden@helsinki.fi Dora Loizidou University of Cyprus, Cyprus, loizidou.dora@ucy.ac.cy Oier Lopez de Lacalle University of the Basque Country,Spain, oier.lopezdelacalle@ehu.eus Teresa Lynn Dublin City University,ADAPT Centre, Ireland, teresa.lynn@adaptcentre.ie Bernardo Magnini FondazioneBruno Kessler, Italy, magnini@fbk.eu KatrinMarheinecke DeutschesForschungszentrum für Künstliche Intelligenz GmbH (DFKI),Germany, katrin.marheinecke@dfki.de DianaMaynard University of Sheffield, United Kingdom, d.maynard@sheffield.ac.uk Maite Melero Barcelona SupercomputingCenter, Spain, maite.melero@bsc.es Marcin Mi³kowski Institute of Philosophy and Sociology, Polish Academy of Sciences, Poland, mmilkows@ifispan.edu.pl Kadri Muischnek University of Tartu, Estonia, kadri.muischnek@ut.ee EvaNavas University of the Basque Country,Spain, eva.navas@ehu.eus Sally O’Connor KantanMT, Ireland, sallyoc@kantanai.io Maciej Ogrodniczuk Instituteof Computer Science, Polish Academy of Sciences, Poland, maciej.ogrodniczuk@ipipan.waw.pl SussiOlsen University of Copenhagen, Denmark, saolsen@hum.ku.dk Maite Oronoz University of the Basque Country,Spain, maite.oronoz@ehu.eus ArantxaOtegi University of the Basque Country,Spain, arantza.otegi@ehu.eus Vasile Pãiº ResearchInstitutefor Artificial Intelligence “Mihai Drãgãnescu”, RomanianAcademy, Romania, vasile@racai.ro PabloPenarrubia Barcelona SupercomputingCenter, Spain, pablo.penarrubia@bsc.es Alicia Pérez University of the Basque Country,Spain, alicia.perez@ehu.eus Olatz Perez de Vinaspre University of the Basque Country,Spain, olatz.perezdevinaspre@ehu.eus Piotr Pêzik University of £ódŸ, Poland, piotr.pezik@uni.lodz.pl Marcis Pinnis Tilde, Latvia, marcis.pinnis@tilde.com Stelios Piperidis Institutefor LanguageandSpeechProcessing,R.C.“Athena”,Greece, spip@athenarc.gr Peter Polák Charles University,Czech Republic, polak@ufal.mff.cuni.cz Martin Popel Charles University,Czech Republic, popel@ufal.mff.cuni.cz Maja Popoviæ Dublin City University,ADAPT Centre, Ireland, maja.popovic@adaptcentre.ie Antonia Popp University of Zurich,Switzerland, popp@cl.uzh.ch Delyth Prys BangorUniversity,United Kingdom, d.prys@bangor.ac.uk José Manuel Ramírez Sánchez University of Vigo, Spain, jmramirez@gts.uvigo.es Georg Rehm DeutschesForschungszentrum für Künstliche Intelligenz GmbH (DFKI),Germany, georg.rehm@dfki.de Natalia Resende Dublin City University,ADAPT Centre, Ireland, natalia.resende@adaptcentre.ie ArtemRevenko Semantic Web Company, Austria, artem.revenko@semantic-web.com GermanRigau University of the Basque Country,Spain, german.rigau@ehu.eus Annette Rios University of Zurich,Switzerland, rios@cl.uzh.ch MarRodríguez Barcelona SupercomputingCenter, Spain, mar.rodriguez@bsc.es EiríkurRögnvaldssonÁrniMagnússon Institute forIcelandicStudies,Iceland, eirikur@hi.is JohanRohdin Phonexia, Czech Republic, rohdin@phonexia.com Rudolf Rosa Charles University,Czech Republic, rosa@ufal.mff.cuni.cz Michael Rosner University of Malta, Malta, mike.rosner@um.edu.mt Ander Salaberria University of the Basque Country,Spain, ander.salaberria@ehu.eus Jon Sanchez University of the Basque Country,Spain, jon.sanchez@ehu.eus Bolette Sandford Pedersen University of Copenhagen, Denmark, bspedersen@hum.ku.dk KepaSarasola University of the Basque Country,Spain, kepa.sarasola@ehu.eus IbonSaratxaga University of the Basque Country,Spain, ibon.saratxaga@ehu.eus Felix Sasaki SAPSE, Germany, felix.sasaki@sap.com Petr Schwarz Phonexia, Czech Republic, schwarz@phonexia.com Joao Silva University of Lisbon, Portugal, joao.silva@di.fc.ul.pt Inguna Skadina Institute of Mathematics and Computer Science, University of Latvia and Tilde, Latvia, inguna.skadina@tilde.com Marcin Skowron HENSOLDTAnalytics GmbH, Austria, marcin.skowron@hensoldt.net AitorSoroa University of the Basque Country,Spain, a.soroa@ehu.eus Manuela Speranza FondazioneBruno Kessler, Italy, manspera@fbk.eu RankaStankoviæ University of Belgrade, Serbia, ranka@rgf.bg.ac.rs FriedaSteurs Dutch Language Institute, The Netherlands, frieda.steurs@ivdnt.org Riccardo Superbo KantanMT, Ireland, riccardos@kantanai.io MarkoTadiæ Faculty of Humanities and Social Sciences, University of Zagreb, Croatia, marko.tadic@ffzg.hr Aurelija Tamulioniene Instituteof theLithuanian Language, Lithuania, aurelija.tamulioniene@lki.lt DanTufiº Research Institute for Artificial Intelligence “Mihai Drãgãnescu”, Romanian Academy, Romania, tufis@racai.ro Baiba Valkovska Institute of Mathematics and Computer Science, University of Latvia, Latvia, baiba.valkovska@lumii.lv Vincent Vandeghinste Dutch Language Institute, The Netherlands, vincent.vandeghinste@ivdnt.org Tamás Váradi ResearchCentrefor Linguistics, Hungary, varadi.tamas@nytud.hu ZsófiaVarga ResearchCentrefor Linguistics, Hungary, varga.zsofia@nytud.hu IoanaVasilescu Université Paris-Saclay,CNRS, LISN, France, ioana.vasilescu@limsi.fr Andrejs Vasiljevs Tilde, Latvia, andrejs.vasiljevs@tilde.com MartaVillegas Barcelona SupercomputingCenter, Spain, marta.villegas@bsc.es Tereza Vojtìchová Charles University,Czech Republic, vojtechova@ufal.mff.cuni.cz Martin Volk University of Zurich,Switzerland, volk@cl.uzh.ch Gareth Watkins BangorUniversity,United Kingdom, g.watkins@bangor.ac.uk Andy Way Dublin City University,ADAPT Centre, Ireland, andy.way@adaptcentre.ie Joanna Wright University of Sheffield, United Kingdom, j.wright@sheffield.ac.uk François Yvon Université Paris-Saclay,CNRS, LISN, France, francois.yvon@limsi.fr Acronyms ABSA Aspect-based Sentiment Analysis ACL AssociationforComputationalLinguistics ADRA AI,DataandRoboticsAsociation AI ArtificialIntelligence ALPAC AutomaticLanguageProcessingAdvisoryCommittee API ApplicationProgrammingInterface ASR AutomaticSpeechRecognition BART BidirectionalAuto-RegressiveTransformers BDVA BigDataValueAssociation BERT BidirectionalEncoderRepresentationsfromTransformers BLARK BasicLanguageResourceKit BLEU BilingualEvaluationUnderstudy CA Conversational Agent CEF Connecting Europe Facility CF ContextualFactor CL ComputationalLinguistics CLAIRE ConfederationofLaboratoriesforAIResearchinEurope CLARIN CommonLanguageResourcesandTechnologyInfrastructure CMS Content Management System CNN Convolutional Neural Network CPAI CoordinatedPlanonArtificialIntelligence CRACKER CrackingtheLanguageBarrier CSA CoordinationandSupportAction CULT Committee on Culture and Education DAIRO Data,AIandRobotics DARIAH Digital Research Infrastructure for the Arts and Humanities DCAT Data Catalogue Vocabulary DGCNECT Directorate-GeneralforCommunicationsNetworks,Content and Technology(European Commission) DGA DataGovernanceAct DH Digital Humanities DIN Deutsches Institut für Normung (German Institute for Standardisation) DL Deep Learning DLE Digital Language Equality DNN Deep Neural Networks DPP DataProtectionandPrivacy DSM DigitalSingleMarket E2E End-to-EndSystem EC European Commission ECPAI EU Coordinated Plan on Artificial Intelligence ECRML EuropeanCharterforRegionalorMinorityLanguages ECSPM European Civil Society Platform for Multilingualism EEA EuropeanEconomicArea EFNIL EuropeanFederationofNationalInstitutionsforLanguage EL Entity Linking ELE EuropeanLanguageEqualityEUProject ELEN EuropeanLanguageEqualityNetwork ELEXIS EuropeanLexicographicInfrastructure ELG European Language Grid ELIS/EUATC European Language Industry Survey ELISE European Learning and Intelligent Systems Excellence ELITR European Live Translator ELLIS EuropeanLaboratoryforLearningandIntelligentSystems ELM European Language Monitor ELRA European Language Resources Association ELRC European Language Resource Coordination ELT European Language Technology EMM Enterprise Metadata Management EOSC European Open Science Cloud EP European Parliament ESF EuropeanSocialFund ESFRI EuropeanStrategyForumonResearchInfrastructures EU European Union EUDATCDI EUDATCollaborativeDataInfrastructure EVALITA EvaluationofNLPandSpeechToolsforItalian FAIR Findable,Accessible,Interoperable,ReusablePrinciples GAN Generative Adversarial Networks GDP Gross Domestic Product GDPR General Data Protection Regulation GMM GaussianMixtureModel GPT Generative Pre-trained Transformer GPU Graphics Processing Unit H2H Human-to-Human Communication H2M Human-to-Machine Communication HCI Human-Computer Interaction HLT HumanLanguageTechnology HMM HiddenMarkovModels HPC High Performance Computing HTML Hypertext Markup Language IA Innovation Action ICT InformationandCommunicationTechnology IDSA International Data Spaces Association IE InformationExtraction IEEE InstituteofElectricalandElectronicsEngineers IPR IntellectualPropertyRights IR InformationRetrieval ISCA InternationalSpeechCommunicationAssociation ISO InternationalOrganizationforStandardization JSON JavaScript Object Notation KG Knowledge Graph LDC Linguistic Data Consortium LDS Language Data Space LEAM Large European AI Models LIBER Association of European Research Libraries LID Language Identification LLM Large Language Model LM Language Model LOD Linked Open Data LR Language Resource LRT Language Resources and Technologies LSC Catalan Sign Language LSE Spanish Sign Language LT Language Technology LTPI LanguageTechnologyProgrammeforIcelandic MAPA MultilingualAnonymisationforPublicAdministrations MARCELL MultilingualResourcesforCEF.ATintheLegalDomain META Multilingual Europe Technology Alliance META-NET A Network of Excellence forging META MFF Multiannual Financial Framework ML MachineLearning MLLM Multilingual Language Model MMT MultimodalMachineTranslation MNMT Multilingual Neural Machine Translation MT MachineTranslation MWE Multiword Expression NE Named Entity NED NamedEntityDisambiguation NEM NewEuropeanMediaInitiative NER NamedEntityRecognition NFDI Nationale Forschungsdateninfrastruktur (German National Research DataInfrastructure) NLG Natural Language Generation NLP Natural Language Processing NLTP NationalLanguageTechnologyPlatform NLU Natural Language Understanding NMT Neural Machine Translation NN Neural Network NTEU NeuralTranslationfortheEuropeanUnion OCR OpticalCharacterRecognition OECD OrganisationforEconomicCo-operationandDevelopment OIE Open Information Extraction PII PersonalIdentifiableInformation POS Part-of-Speech PRINCIPLE ProvidingResourcesinIrish,Norwegian,CroatianandIcelandic for Purposes of LanguageEngineering QA Question Answering RDA Research Data Alliance RE RelationExtraction RI Research Infrastructures RIA ResearchandInnovationAction RML RegionalandMinorityLanguages RNN Recurrent Neural Network ROI Return On Investment SER SpeechEmotionRecognition SID SpeakerIdentification SME SmallandMedium-sizeEnterprises SR SpeakerRecognition SRA Strategic Research Agenda SRIA StrategicResearch,InnovationandImplementationAgenda SRL SemanticRoleLabelling SSH SocialSciencesandHumanities SSHOC SocialSciencesandHumanitiesOpenCloud SSL SwedishSignLanguage ST Speech Technology STOA ScienceandTechnologyOptionsAssessment TA TextAnalysis TF Technological Factor TLD Top-Level Domain TM TextMining TTS Text-to-Speech Synthesis UD Universal Dependencies UN United Nations UNESCO UNEducational,ScientificandCulturalOrganization VQA Visual Question Answering WER WordErrorRate WP WorkPackage WSD WordSenseDisambiguation Chapter 1 European Language Equality: Introduction Georg Rehm and Andy Way Abstract Thischapterprovides an introductiontothe EU-funded project European Language Equality (ELE).Itmotivatestheprojectbytakingagenerallookatmulti­lingualism,especiallywithregardtothepoliticalequalityofalllanguagesinEurope. Since 2010, several projects and initiatives have developed the notion of utilising sophisticated language technologies to unlock and enable multilingualism techno­logically.However,despitealandmarkresolutionthatwasadoptedbytheEuropean Parliamentin2018,nosignificantprogresshasbeenmade.Togetherwiththewhole European LT community, and making use of a concerted community consultation process, the ELE project produced strategic recommendations that specify how to bring about full digital language equality in Europe and reach the scientific goal of DeepNaturalLanguageUnderstandingby2030,notonlyaddressingbuteventually solving the problemof digital inequality ofEurope’slanguages. 1 Overview and Context In Europe’s multilingual setup, all 24 official EU languages are granted equal sta­tus by the EU Charter and the Treaty on EU. Furthermore, the EU is home to over 60regionalandminoritylanguageswhichhavebeenprotectedandpromotedunder the European Charter for Regional or Minority Languages (ECRML) treaty since 1992,inadditiontovarioussignlanguagesandthelanguagesofimmigrantsaswell as trade partners. Additionally, the Charter of Fundamental Rights of the EU under Article 21 states that, “[a]ny discrimination based on any ground such as sex, race, colour,ethnicorsocialorigin,geneticfeatures,language,religionorbelief,political or any other opinion, membership of a national minority, property, birth, disability, ageor sexual orientation shall beprohibited.” Georg Rehm Deutsches ForschungszentrumfürKünstliche Intelligenz GmbH,Germany, georg.rehm@dfki.de AndyWay Dublin CityUniversity, ADAPT Centre,Ireland, andy.way@adaptcentre.ie © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_1 Unfortunately, language barriers still hamper cross-lingual communication and thefreeflowofknowledgeandthoughtacrosslanguagecommunitiesandcontinueto beunbreachableinmanysituations.Whilemultilingualismisoneofthekeycultural cornerstonesofEuropeandsignifiespartofwhatitmeanstobeandtofeelEuropean, no EU policy hasbeen proposed toaddress the problem of language barriers. Artificial Intelligence (AI), Natural Language Processing (NLP), Natural Lan­guage Understanding (NLU), Language Technologies (LTs), and Speech Technolo­gies (STs) have the potential to enable multilingualism technologically but, as the META-NETWhitePaperSeries Europe’s Languages in the Digital Age (Rehmand Uszkoreit 2012) found in 2012, our languages suffer from an extreme imbalance in termsoftechnologicalsupport.Englishisverywellsupportedthroughtechnologies, tools,datasetsandcorpora,forexample,butlanguagessuchasMaltese,Estonianor Icelandic have hardly any support at all. In fact, the 2012 study assessed at least 21 European languages to be in danger of digital extinction.If,asmentionedabove,all European languages are supposed to be on an equal footing in general, technologi­cally, they clearlyarenot(Kornai 2013). After the findingsof the META-NET studyand a set offollow-upprojects,stud­ies and recommendations (e.g., Rehm and Uszkoreit 2013; STOA 2018), the joint CULT/ITRE report Language Equality in the Digital Age (European Parliament 2018)waseventuallypassedwithanoverwhelmingmajoritybytheEuropeanParlia­menton11September2018.Itconcernstheimprovementoftheinstitutionalframe­workforLTpoliciesattheEUlevel,EUresearchandeducationpoliciestoimprove thefutureofLTsinEurope,andtheextensionofthebenefitsofLTsforbothprivate companies and public bodies. The resolution also recognises that there is an imbal­ance in terms of technology support of Europe’s languages, that there has been a substantial amount of progress in research and technology development and that a large-scale,long-termfundingprogrammeshouldbeestablishedtoensurefulltech­nology support for all of Europe’s languages. The goal is to enable multilingualism technologically since “the EU and its institutions have a duty to enhance, promote and upholdlinguistic diversity in Europe”(European Parliament 2018). WhiletheresolutionwasaimportantmilestonefortheideaofenablingEurope’s multilingualismtechnologicallyandbringingeverylanguageinEuropetothesame level of technology support, there has been no concrete follow-up action along the lines laid out in the resolution, i.e., to set up “a large-scale, long-term coordinated funding programme for research, development and innovation in the field of lan­guage technologies, at European, national and regional levels, tailored specifically toEurope’sneedsanddemands”.Inthemeantime,however,manyhighlyinfluential breakthroughsintheareaoflanguage-centricAIhavebeenachieved,mostlybylarge enterprises in the US and Asia, especially approaches and technologies concerning large language models (LLMs such asBERTorChatGPT).1 Due to a lack of action over the last five to seven years, Europe has mostly been playing “second fiddle” in the area of language-centric AI and Language Technolo­ 1 ChatGPTwasreleasedinNov.2022,https://chat-gpt.org.Mostchaptersofthisbookwerewritten by mid-2022, which is why they do not reflect the widespread impact and subsequent recognition of thisnovel application. gies. Driven by the “European Strategy for data”, the EU is currently concentrating on setting up a number of sectorial data spaces to enable and support the data econ-omyandtoboostitsdigitalsovereignty.2These,fortunately,alsoincludeadedicated languagedataspacewithafocusonstakeholdersfromindustry.But, simply put, lan­guage is much more than data. Inadditiontothecomplexandlong-termactivityof constructing the aforementioned data spaces, the EU also invests in AI-related ac-tionsthatincludelanguage,albeitwithlimitedbudgets.However,muchmoreneeds tobedonetoproperlyaddressthechallengeofEurope’smultilingualismwithmean­ingfuland long-lasting solutions. With a consortium of 52 partners, the EU project European Language Equality (ELE; Jan. 2021 – June 2022) and its follow-up project ELE 2 (July 2022 – June 2023) developed, through a large-scale, community-driven process, a Strategic Re­search, Innovation and Implementation Agenda for Digital Language Equality in Europe by 2030 toaddressthismajorissuebymeansofacoordinated,pan-European research,developmentandinnovationprogramme.3Thisbookisthedefinitivedocu­mentationoftheEUprojectELE.Itdescribesthecurrentsituationoftechnologysup­port for Europe’s languages and our overall recommendations of what more needs to be done toachieve Digital Language Equality(DLE) in Europe by2030. 2 TheEuropeanLanguageEqualityProject The original proposal for the EU project “European Language Equality” was pre­paredbyaconsortiumof52partners4 (seeFigure 1)andsubmittedon29July2020, responding to the European Commission call topic PPA-LANGEQ-2020 (“Devel­oping a strategic research, innovation and implementation agenda and a roadmap for achieving full digital language equality in Europe by 2030”).5 The ELE project started in January 2021 and finished in June 2022. Immediately after the end of the first ELE project, the one-year ELE 2 project began with a reduced consortium of seven partners, continuing some of the work strandsofthe first project. Developing a strategic agenda and roadmap for achieving full DLE in Europe by 2030involvesmanystakeholders, whichiswhy the processof preparingthedif­ferent parts of the strategic agenda and roadmap – the key objective and result of theproject –was carried outtogether with all52 partnersofthe consortium andthe widerEuropeanLTcommunity.Weconcentratedontwodistinctbutrelatedaspects: 1. describing the current state of play (as of 2021/2022) of LT support for the lan­guages underinvestigation; and 2. strategic andtechnologicalforecasting, i.e.,esti­matingandenvisioningthefuturesituationca.2030.Furthermore,wedistinguished betweentwomainstakeholdergroups:1. LT developers (industryandresearch)and 2 https://digital-strategy.ec.europa.eu/en/policies/strategy-data 3 https://european-language-equality.eu 4 https://european-language-equality.eu/consortium/ 5 https://ec.europa.eu/research/participants/data/ref/other_eu_prog/other/pppa/wp-call/call-fich e_pppa-langeq-2020_en.pdf Fig. 1 Membersof the ELE consortium atMETA-FORUM 2022 in Brussels (9June2022) 2. LT users and consumers. Both groups were represented in ELE with several net-works,initiativesandassociationswhoproducedonereporteach,highlightingtheir own individual needs, wishes and demands towards DLE. The project’s industry partners produced four in-depth reports compiling the needs, wishes and visions of theEuropeanLTindustry.Wealsoorganisedalargernumberofsurveys(inspiredby RehmandHegele2018)andconsultationswithstakeholdersnotdirectlyrepresented in the consortium. With the development of the strategic agenda, the project followed two comple­mentary goals.1. The socio-political goal wasthepreparation ofa strategicagenda explaininghowEurope canbring about fulldigital language equalityby2030.This objective and the need for a corresponding large-scale, long-term programme have been recognised already by the EU (European Parliament 2018). 2. Additionally, thestrategicagendaandtheeventuallarge-scale,long-termfundingprogrammeare also meant to pursue a scientific goal, i.e., reaching Deep Natural Language Un­derstanding by2030. Asbrieflymentioned,Europeiscurrentlylaggingbehindthe breakthroughsachievedonothercontinents,whichiswhythededicatedlarge-scale, long-term funding programme we envision can and must achieve both objectives: develop resources and technologies to fully unlock and benefit from multilingual­ism technologically and also put Europe back into the pole position in the area of LT,NLP and language-centric AI research. Operationally, the project was structured into five work packages (see Figure 2). In WP1, “European Language Equality: Status Quo in 2020/2021”, a definition of theconceptofDLEwaspreparedandthecurrentstate-of-the-artintheresearcharea of LT and language-centric AI was documented in a report. The heart of WP1 was the preparation of more than 30 language reports, each documenting one European languageandtheleveloftechnologysupportithadasof2022.WhileWP1examined the status quo, WP2, “European Language Equality: The Future Situation in 2030” looked into the future. Operationalised through a complex community consultation process, we collected and analysed the demands, needs, ideas and wishes of Euro­pean LT developers (industry and research), European LT users and consumers as wellasEuropeancitizens.Fourtechnicaldeepdivestookadetailedlookatthefour main areas of LT (Machine Translation, Speech, Text Analytics and Data). The re­sults of WP1 and WP2 were fed to WP3, “Development of the Strategic Agenda and Roadmap”, in which the overall strategic agenda was developed based on the collectedfindingsofWP1andWP2,includinganadditionalfeedbackloopwiththe widercommunity.WP4,“Communication–Dissemination–Exploitation–Sustain­ability”organisedanumberofevents,includingMETA-FORUM20226 inBrussels (see Figure 1) and a workshop in the European Parliament.7 WP4 also set up and managed our social media channels and a newsletter under the umbrella brand “Eu­ropean Language Technology”.8 WP5 took care of managing the large consortium of52 partners. Figure 3 shows theoverall timeline ofthe project. Ourmethodologywas,thus,basedonanumberofstakeholder-specificsurveysas well as collaborative document preparation that also involved technology forecast-ing.Bothapproacheswerecomplementedthroughthecollectionofadditionalinput and feedback through various online channels. The two main stakeholder groups (LT developers and LT users/consumers) differ in one substantial way: while the group of commercial or academic LT developers is, in a certain way, closed and wellrepresentedthroughrelevantorganisations,networksandinitiativesintheELE consortium,thegroupofLTusersisanopen setofstakeholdersthatisonlypartially represented in our consortium. Both stakeholder groups have been addressed with targeted and stakeholder-specificsurveys. The ELE project resulted in around 70 deliverables, of which the public ones are available online.9 In addition, a number of reports were prepared pro bono by collaborators who supported the goals of the project, including language reports on Bosnian, Serbian, West Frisian, the Nordic minority languages and Europe’s sign languages. All reports are available onthe ELE website. 3 Beyond the ELE Project While forecasting the future of the field of LT and language-centric AI is surely an enormous challenge, we can confidently predict that even greater advances will be achieved in all LT research areas and domains in the near future (Rehm et al. 2022). However, despite claims of human parity in many LT tasks, Deep Natural Language Understanding, the main scientific goal of the ELE Programme, is still an open research problem far from being solved since all current approaches have 6 https://www.european-language-grid.eu/events/meta-forum-2022 7 https://www.europarl.europa.eu/stoa/en/events/details/towards-full-digital-language-equality-i /20220711WKS04301 8 The social media channels and the newsletter were organised in close collaboration with ELE’s sister project European LanguageGrid(ELG,Rehm 2023). 9 https://www.european-language-equality.eu/deliverables Fig. 2 Work packages and tasksof the ELE project severe limitations (Bender et al. 2021). Interestingly, the application of zero-shot to few-shot transfer learning with multilingual pre-trained language models and self-supervised systems opens up the way to leverage LT for less-developed languages. For the first time, a single multilingual model recently outperformed the best spe­cially trained bilingual models on news translations, i.e., one multilingual model providedthebesttranslationsforbothlow-andhigh-resourcelanguages,indicating thatthemultilingualapproachappearstobethefutureofMT(Tranetal.2021).How­ever,thedevelopmentofthesenewsystemswouldnotbepossiblewithoutsufficient resources(experts,data,computefacilities,etc.),includingthecreation ofcarefully designed and constructed evaluation benchmarks and annotated datasets for every language anddomainof application. Unfortunately, as of now, there is no equality in terms of tool, resource and ap­plication availability across languages and domains. Although LT has the poten­tial to overcome the linguistic divide in the digital sphere, most languages are ne­glected for various reasons, including an absence of institutional engagement from decision-makers and policy stakeholders, limited commercial interest and insuffi­cient resources. For instance, Joshi et al. (2020) and Blasi et al. (2022) look at the relation between the types of languages, resources and their representation in NLP conferencesovertime.Asexpected,butalsodisappointingly,onlyaverysmallnum­beroftheover6,000languagesoftheworldarerepresented intherapidlyevolving fieldofLT. Agrowingconcernisthatduetounequalaccesstodigitalresourcesand financialsupport,onlyasmallgroupoflargeenterprisesandeliteuniversitiesarein a positionto lead furtherdevelopment inthis area (Ahmed and Wahed 2020). To unleash the full potential of LT in Europe and ensure that no users of these technologiesaredisadvantagedinthedigitalspheresimply due to the language they Fig. 3 Overalltimelineof the ELE project speak,wearguethatthereisapressingneedtofacilitatelong-termprogresstowards multilingual,efficient,accurate,explainable,ethical,fairandunbiasedlanguageun­derstandingandcommunication.Inshort,wemustensureDLEinallareasofsociety, fromgovernment to businessto citizens. 4 Summary of this Book Thisbookisstructuredintotwomainparts.PartIexaminesthe current state of play of technology support for Europe’s languages. Part II outlines the future situation in2030andbeyond,asspecifiedthroughthecommunityconsultingandforecasting process of the ELE project. Below weincludeshortsummaries of the two parts. 4.1 Part I: European Language Equality – Status Quo in 2022 Part I concentrates on the current situation as of 2022. First, Chapter 2 examines the state-of-the-art in LT, NLP and language-centric AI. It provides the technical foundationofallsubsequentchapters.Chapter3definestheDLEmetric,developed within the project, with its technological (Gaspari et al. 2022) and contextual fac­tors (Grützner-Zahn and Rehm 2022). This chapter also describes the interactive DLE dashboard, which was implemented as an additional component of the Euro­pean Language Grid cloud platform (ELG, Rehm 2023). Assuming that the ELG catalogueofresources,toolsandservicescontains,atanygivenpointintime,arep­resentative picture of the technology support of Europe’s languages, the dashboard can be used to visualise the overall situation in different ways, including compar­isons of multiple languages along various dimensions. Chapter 4 summarises the findings and provides an answer to the question of how Europe’s languages com-paretechnologicallyca.2022.Thechapterdescribesthemethodologyofbasingthe computation of the DLE scores on the contents of the ELG repository, which has beensubstantially expandedbytheELE project with morethan 6,000additionalre­sources, and highlights the current situation using a number of graphs. Chapters 5 to 37 contain extended high-level summaries of the 33 language reports produced bytheELEproject.Thesereportscanbeconceptualisedasupdates,tenyearson,of the META-NET White Papers (Rehm and Uszkoreit 2012), especially as many of them were written by the original authors. 4.2 Part II:EuropeanLanguageEquality –TheFutureSituationin 2030and beyond PartII outlinesthefuturesituationin2030andbeyond,makinguseofthecollected and synthesised results of the community consultation process. First, Chapter 38 describes the community consultation process on a general level, primarily with regard to the different surveys used in the project vis-a-vis European LT develop­ers, European LT users and consumers as well as European citizens. The chapter also summarises the approach regarding the four technology deep dives as well as the dissemination and feedback collection activities in the project. Chapter 39 sum-marises the resultsofthethree main surveys. The followingfourchapters highlight the main findings of the four technology deep dives on the four main areas of LT researchanddevelopment:MachineTranslation(Chapter40),SpeechTechnologies (Chapter 41), Text Analytics (Chapter 42) as well as Data and Knowledge (Chap­ter 43). The penultimate Chapter 44 presents the strategic plans and projects in LT and AI from an international, European and national perspective. It contextualises the strategic recommendations of the project. Finally, Chapter 45, provides an ex-tendedsummaryofthestand-alonedocumentofthe Strategic Research, Innovation and Implementation Agenda and Roadmap theELEprojecthasdeveloped.10 Onthe whole,thepresentbookcan beconceputalisedasthecollectivefindingsandrecom­mendations of the ELE project, and as such it reflects years of work based on the distilledinputandcollaborationofhundredsofexpertsandstakeholdersfromacross the EuropeanLT and language-centric AI community. 10 https://european-language-equality.eu/agenda/ References Ahmed,Nurand Muntasir Wahed(2020). “TheDe-democratization ofAI:Deep Learning andthe ComputeDivideinArtificialIntelligenceResearch”.In: CoRR abs/2010.15581.https://arxiv.o rg/abs/2010.15581. Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell (2021). “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Virtual Event Canada, pp. 610–623. Blasi, Damian, Antonios Anastasopoulos, and Graham Neubig (2022). “Systematic Inequalities in Language Technology Performance across the World’s Languages”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, pp. 5486–5505. DOI: 10.18653/v 1/2022.acl-long.376. https://aclanthology.org/2022.acl-long.376. European Parliament (2018). Language Equality in the Digital Age. European Parliament resolu­tion of 11 September 2018 on Language Equality in the Digital Age (2018/2028(INI). http://w ww.europarl.europa.eu/doceo/document/TA-8-2018-0332_EN.pdf. Gaspari, Federico, Owen Gallagher, Georg Rehm, Maria Giagkou, Stelios Piperidis, Jane Dunne, andAndy Way (2022). “Introducing the DigitalLanguage EqualityMetric: Technological Fac­tors”. In: Proceedings of the Workshop Towards Digital Language Equality (TDLE 2022; co-located with LREC 2022). Ed. by Itziar Aldabe, Begona Altuna, Aritz Farwell, and German Rigau. Marseille, France, pp. 1–12. http://www.lrec-conf.org/proceedings/lrec2022/workshop s/TDLE/pdf/2022.tdle-1.1.pdf. Grützner-Zahn,AnnikaandGeorgRehm(2022).“IntroducingtheDigitalLanguageEqualityMet­ric: Contextual Factors”. In: Proceedings of the Workshop Towards Digital Language Equality (TDLE 2022; co-located with LREC 2022).Ed.byItziarAldabe,BegonaAltuna,AritzFarwell, and German Rigau. Marseille, France, pp. 13–26. http://www.lrec-conf.org/proceedings/lrec2 022/workshops/TDLE/pdf/2022.tdle-1.2.pdf. Joshi, Pratik, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury (2020). “The State and Fate of Linguistic Diversity and Inclusion in the NLP World”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020).Online: Association for Computational Linguistics, pp. 6282–6293. DOI: 10.18653/v1/2020.acl-main .560. https://aclanthology.org/2020.acl-main.560. Kornai,Andras(2013).“DigitalLanguageDeath”.In: PLoS ONE 8.10.DOI: 10.1371/journal.pon e.0077056.https://doi.org/10.1371/journal.pone.0077056. Rehm, Georg, ed. (2023). European Language Grid: A Language Technology Platform for Multi­lingual Europe. Cognitive Technologies. Cham,Switzerland: Springer. Rehm,Georg,FedericoGaspari,GermanRigau,MariaGiagkou,SteliosPiperidis,AnnikaGrützner-Zahn, Natalia Resende, Jan Hajic, and Andy Way (2022). “The European Language Equality Project: Enabling digital language equality for all European languages by 2030”. In: The Role of National Language Institutions in the Digital Age – Contributions to the EFNIL Confer­ence 2021 in Cavtat. Ed.byŽeljkoJoziæandSabineKirchmeier. Budapest, Hungary:Nyelvtu­dományi Kutatóközpont,Hungarian Research Centre for Linguistics, pp. 17–47. Rehm, Georg and Stefanie Hegele (2018). “Language Technology for Multilingual Europe: An Analysis of a Large-Scale Survey regarding Challenges, Demands, Gaps and Needs”. In: Pro­ceedings of the 11th Language Resources and Evaluation Conference (LREC 2018). Ed. by Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélene Mazo, Asuncion Moreno, JanOdijk,SteliosPiperidis,andTakenobuTokunaga.Miyazaki,Japan:ELRA,pp.3282–3289. https://aclanthology.org/L18-1519.pdf. Rehm, Georg and Hans Uszkoreit, eds. (2012). META-NET White Paper Series: Europe’s Lan­guages in the Digital Age. 32 volumes on 31 European languages. Heidelbergetc.: Springer. Rehm, Georg and Hans Uszkoreit, eds. (2013). The META-NET Strategic Research Agenda for Multilingual Europe 2020.Heidelbergetc.:Springer. http://www.meta-net.eu/vision/reports/m eta-net-sra-version_1.0.pdf. STOA(2018). Language equality in the digital age – Towards a Human Language Project.STOA study (PE 598.621), IP/G/STOA/FWC/2013-001/Lot4/C2. https://data.europa.eu/doi/10.2861 /136527. Tran, Chau, Shruti Bhosale, James Cross, Philipp Koehn, Sergey Edunov, and Angela Fan (2021). “FacebookAI’sWMT21NewsTranslationTaskSubmission”.In:Proceedings of the Sixth Con­ference on Machine Translation (WMT 2021). Online: Association for Computational Linguis-tics,pp. 205–215. https://aclanthology.org/2021.wmt-1.19. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Part I European LanguageEquality: Status Quoin2022 Chapter 2 State-of-the-Art in Language Technology and Language-centric Artificial Intelligence RodrigoAgerri,EnekoAgirre, ItziarAldabe, NoraAranberri, JoseMariaArriola, Aitziber Atutxa, Gorka Azkune, Jon Ander Campos, Arantza Casillas, Ainara Estarrona, AritzFarwell, IakesGoenaga, JosuGoikoetxea, Koldo Gojenola, Inma Hernáez,MikelIruskieta, Gorka Labaka, Oier Lopez deLacalle, Eva Navas, Maite Oronoz, Arantxa Otegi, Alicia Pérez, Olatz PerezdeVinaspre, German Rigau, Ander Salaberria, Jon Sanchez, Ibon Saratxaga, andAitorSoroa Abstract This chapter landscapes the field of Language Technology (LT) and lan­guage-centric AI by assembling a comprehensive state-of-the-art of basic and ap­plied research in the area. It sketches all recent advances in AI, including the most recentdeeplearningneuraltechnologies.Thechapterbringstolightnotonlywhere language-centricAIasawholestands,butalsowheretherequiredresourcesshould be allocated to place European LT at the forefront of theAI revolution. We identify key research areas and gaps that need to be addressed to ensure LT can overcome the currentinequalities.1 1 Introduction Interestinthecomputationalprocessingofhumanlanguagesledtotheestablishment of specialised fields known as Computational Linguistics (CL), Natural Language Processing (NLP)andLanguageTechnology (LT). CLismoreinformedby linguis- Rodrigo Agerri · Eneko Agirre · Itziar Aldabe · Nora Aranberri · Jose Maria Arriola · Aitziber Atutxa · Gorka Azkune · Jon Ander Campos · Arantza Casillas · Ainara Estarrona · Aritz Far-well · IakesGoenaga · JosuGoikoetxea · KoldoGojenola · InmaHernaez · MikelIruskieta · Gorka Labaka · Oier Lopez de Lacalle · Eva Navas · Maite Oronoz · ArantxaOtegi · Alicia Pérez · Olatz PerezdeVinaspre · GermanRigau · AnderSalaberria · JonSanchez · IbonSaratxaga · AitorSoroa University oftheBasque Country, Spain, rodrigo.agerri@ehu.eus,e.agirre@ehu.eus, itziar.aldabe@ehu.eus,nora.aranberri@ehu.eus, josemaria.arriola@ehu.eus,aitziber.atutxa@ehu.eus, gorka.azkune@ehu.eus, jonander.campos@ehu.eus,arantza.casillas@ehu.eus,ainara.estarrona@ehu.eus, aritz.farwell@ehu.eus, iakes.goenaga@ehu.eus, josu.goikoetxea@ehu.eus, koldo.gojenola@ehu.eus, inma.hernaez@ehu.eus, mikel.iruskieta@ehu.eus, gorka.labaka@ehu.eus, oier.lopezdelacalle@ehu.eus,eva.navas@ehu.eus, maite.oronoz@ehu.eus, arantza.otegi@ehu.eus, alicia.perez@ehu.eus, olatz.perezdevinaspre@ehu.eus,german.rigau@ehu.eus, ander.salaberria@ehu.eus, jon.sanchez@ehu.eus,ibon.saratxaga@ehu.eus, a.soroa@ehu.eus 1 This chapter is an abridged version ofAgerriet al. (2021). © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_2 tics and NLP by computer science, LT is a more neutral term. In practice, these communitiesworkcloselytogether,sharingthesamepublishingvenuesandconfer­ences,combiningmethodsandapproachesinspiredbyboth,andtogethermakingup language-centricAI.Inthis chapter we treat them interchangeably. Overtheyears,LThasdevelopeddifferentmethodstomaketheinformationcon­tained in written and spoken language explicit or to generate or synthesise written orspokenlanguage.Despitetheinherentdifficultiesinmanyofthetasksperformed, currentLTsupportallowsmanyadvancedapplicationswhichwereunthinkableonly a few years ago. LT is present in our daily lives, for example, through search en-gines,recommendationsystems,virtualassistants,chatbots,texteditors,textpredic-tors, automatic translation systems, automatic subtitling, automatic summarisation and inclusive technology. Its recent accelerated development promises even more encouragingand excitingresults inthenear future. This state-of-the-art in LT and language-centric AI begins with a brief historical account in Section 2 on the development of the field from its inception through the current deep learning era. The following three sections are neural language models (Section 3), research areas (Section 4) and LT beyond language (Section 5). They offer a survey that maps today’s LT and language-centric AI landscape. Finally, a discussion and various conclusions are outlined in Section 6. 2 Language Technology: Historical Overview 2.1 ABriefHistory The1950smarkthebeginningofLanguageTechnologyasadiscipline.Inthemiddle ofthe20thcentury,AlanTuringproposedhisfamoustest,whichdefinesacriterion todetermine whether amachine can beconsidered intelligent (Turing 1950).A few yearslater,NoamChomskylaidthefoundationstoformalise,specifyandautomate linguistic rules with his generative grammar (Chomsky 1957). For a long period of time, the horizon defined by Turing and the instrument provided by Chomsky influenced the majority of NLPresearch. The early years of LT were closely linked to Machine Translation (MT), a well-defined task, and also relevant from a political and strategic point of view. In the 1950s it was believed that a high-quality automatic translator would be available soon. By the mid-1960s, however, the Automatic Language Processing Advisory Committee (ALPAC) report revealed the true difficulty of the task and NLP in gen-eral.The following two decadeswereheavily influenced byChomsky’s ideas, with increasingly complex systems of handwrittenrules.At the end of the 1980s, a revo­lution began which irreversibly changed the field of NLP. This change was driven mainlybyfourfactors:1.thecleardefinitionofindividualNLPtasksandcorrespond­ing rigorous evaluation methods; 2. the availability of relatively large amounts of data;3.machines that could process these large amounts of data; and4.thegradual introduction of more robust approaches based on statistical methods and machine learning (ML),that would pave the way forsubsequent major developments. Since the 1990s, NLP has moved forward with new resources, tools and appli­cations. An effort was made to create wide-coverage linguistic resources, such as annotated corpora, thesauri, etc., from which WordNet (Miller 1992) is one of the main results. Data-driven systems displaced rule-based systems, leading to the al­most ubiquitous presence of ML components in NLP systems. In the 2010s we ob-servedaradicaltechnologicalshiftinNLP.Collobertetal.(2011)presentedamulti­layerneuralnetwork(NN)adjustedbybackpropagationthatsolvedvarioussequen­tial labeling problems. Word embeddings gained particular relevance due to their roleintheincorporationofpre-trainedexternalknowledgeintoneuralarchitectures (Mikolovetal.2013).Largevolumesofunannotatedtexts,togetherwithprogressin self-supervisedMLandtheriseofhigh-performancehardware(GraphicsProcessing Units,GPU),enabledhighlyeffectivedeeplearningsystemstobedevelopedacross a range of application areas. These and other breakthroughs helped launch today’s DeepLearning Era. 2.2 The Deep Learning Era Today,LTismovingawayfromamethodologyinwhichapipelineofmultiplemod-ulesisutilisedtoimplementsolutionstoarchitecturesbasedoncomplexneuralnet­workstrainedonvastamountsofdata.Fourresearchtrendsareconverging:1.mature deep neural networktechnology,2. large amountsofmultilingual data,3. increased HighPerformanceComputing(HPC)power,and4.theapplicationofsimplebutef­fective self-learning approaches (Devlin et al. 2019; Yinhan Liu et al. 2020). These advancementshaveproducedanewstate-of-the-artthroughsystemsthatareclaimed to obtain human-level performance in laboratory benchmarks on difficult language understandingtasks.Asaresult,variouslargeITenterpriseshavestarteddeploying large language models (LLMs) in production. Despite their notable capabilities, however, LLMs have certain drawbacks that will require interdisciplinary collaboration and research to resolve. First, we have no clear understanding of how they work, when they fail, or what emergent prop­erties they present. Indeed, some authors call these models “foundation models” to underscoretheircriticallycentralyetincompletecharacter(Bommasanietal.2021). Second,thesystemsareverysensitivetophrasingandtypos,arenotrobustenough, and perform inconsistently (Ribeiro et al. 2019). Third, these models are expensive to train, which means that only a limited number of organisations can currently af­ford their development (Ahmed and Wahed 2020). Fourth, large NLP datasets used totrainthesemodelshavebeen‘filtered’toremovetargetedminorities(Dodgeetal. 2021). In addition, LLMs can sometimes produce unpredictable and factually inac­curatetextorevenrecreateprivateinformation.Finally,computinglargepre-trained models comes with a substantial carbonfootprint (Strubell et al. 2019). The implications of LLMs may extend to questions of language-centred AI sovereignty. Given the impact of LT in everyone’s daily lives, many LT practi­tioners are particularly concerned by the need for digital language equality (DLE) across all aspects of our societies. As expected, only a small number of the world’s more than 6,000 languages are represented in the rapidly evolving LT field. This disproportionate representation is further exacerbated by systematic inequalities in LT across the world’s languages (Joshi et al. 2020). Interestingly, the application of zero-shot to few-shot transfer learning with multilingual pre-trained language mod­els, prompt learning and self-supervised systems opens a path to leverage LT for less-developedlanguages.However,thedevelopment ofthesenewLTsystemswill require resources along with carefully designed evaluation benchmarks and anno­tateddatasets for every languageand domain of application. ForecastingthefutureofLTandlanguage-centricAIisachallenge.Itis,neverthe­less, safe to assume that many more advances will be achieved utilising pre-trained language models and that they will substantially impact society. Future users are likely to discover novel applications and wield them positively or negatively. In either case, as Bender et al. (2021) argue, it is important to understand the current limitationsofLLMs,whichtheyrefertoas“stochasticparrots”.Focusingonstate-of-the-artresultsexclusivelywiththehelpofleaderboards,withoutencouragingdeeper understandingofthemechanismsbywhichtheyareattained,cangiverisetomislead­ing conclusions. These, in turn, may direct resources away from efforts that would facilitate long-term progress towards multilingual, efficient, accurate, explainable, ethicaland unbiased languageunderstanding andcommunication. 3 NeuralLanguageModels LT is undergoing a paradigm shift with the rise of neural language models that are trainedonbroaddataatscaleandare adaptableto a widerange ofmonolingualand multilingual downstream tasks (Devlin et al. 2019; Yinhan Liu et al. 2020). These modelsarebasedonstandardself-superviseddeeplearningandtransferlearning,but their scale results in emergent and surprising capabilities. One of the advantages is their ability to alleviate the feature engineering problem by using low-dimensional and dense vectors (distributed representation) to implicitly represent the language examples (Collobert et al. 2011). In self-supervised learning, the language model is derived automatically from large volumes of unannotated language data (text or voice).Therehasbeenconsiderableprogressinself-supervisedlearningsinceword embeddingsassociatedword vectors with context-independentvectors. With transfer learning, the learning process starts from patterns that have been learnedwhensolvingadifferentproblem,i.e.,leveragingpreviouslearningtoavoid starting from scratch. Within deep learning, pre-training is the dominant approach to transfer learning: the objective is to pre-train a deep Transformer model on large amountsof dataand then reusethis pre-trainedlanguagemodelby fine-tuningit on smallamountsof(usuallyannotated)task-specificdata.Recentworkhasshownthat pre-trained language models can robustly perform tasks in a few-shot or even zero­shotfashionwhengivenanadequatetaskdescriptioninitsnaturallanguageprompt (Brown etal. 2020).Unlike traditional supervised learning, whichtrainsa model to takeinaninputandpredictanoutput,prompt-basedlearningorin-contextlearningis based on exploiting pre-trained language models to solve a task using text directly. This framework is very promising since some NLP tasks can be solved in a fully unsupervised fashion by providing a pre-trained language model with task descrip­tions in natural language (Raffel et al. 2020). Surprisingly, fine-tuning pre-trained languagemodelsonacollectionoftasksdescribedviainstructions(orprompts)sub­stantially boostszero-shot performanceon unseen tasks(Weiet al. 2021). Multilingual Large Language Models (MLLMs) such as mBERT (Devlin et al. 2019),XLM-R(Conneauetal.2020),mBART(YinhanLiuetal.2020),mT5(Xueet al.2021),etc.haveemergedasviableoptionsforbringingthepowerofpre-training to a large number of languages. For example, mBERT is pre-trained on Wikipedia corpora in 104 languages. mBERT can generalise cross-lingual knowledge in zero-shotscenarios.ThisindicatesthatevenwiththesamestructureofBERT,usingmul­tilingual data can enable the model to learn cross-lingual representations. The sur­prisinglygoodperformanceofMLLMsincross-lingualtransferaswellasbilingual taskssuggeststhattheselanguagemodelsarelearninguniversalpatterns(Doddapa­nenietal. 2021).Thus,oneofthemainmotivationsoftrainingMLLMsistoenable transfer from high-resource languages to low-resourcelanguages. New types of processing pipelines and toolkits have arisen in recent years due to the fast-growing collection of efficient tools. Libraries that are built with NN components are increasingly common, including pre-trained models that perform multilingual NLP tasks. Neural language models are adaptable to a wide spectrum ofmonolingualandmultilingualtasks.Thesemodelsarecurrentlyoftenconsidered black boxes, in that their inner mechanisms are not clearly understood. Nonethe­less, Transformer architectures may present an opportunity to offer advances to the broaderLTcommunityifcertainobstaclescanbesuccessfullyovercome.Oneisthe questionoftheresourcesneededtodesignthebest-performingneurallanguagemod­els,currentlydonealmostexclusivelyatlargeITcompanies.Anotheristheproblem ofstereotypes, prejudices and personal information within the corpora used to train the models. The predominance of English as the default language in NLP can be successfully addressed if there is sufficient will and coordination. The continued consolidation of large infrastructures will help determine how this is accomplished in the near future. Their successful implementation would mark a crucial first step towards the development, proliferation and management of language resources for all European languages. This capability would, in turn, enable Europe’s languages to enjoy full and equal access todigitallanguage technology. 4 Research Areas Section 4 introduces some of the more prominent research areas in the field: Lan-guageResources(Section4.1),TextAnalysis(Section4.2),SpeechProcessing(Sec­tion 4.3), Machine Translation (Speech 4.4), Information Extraction and Retrieval (Section 4.5), NLGand Summarisation (Section 4.6) as well as HCI(Section 4.7). 4.1 Language Resources The term Language Resource (LR) refers to a set of speech or written data and descriptions in machine readable form. These are utilised for building, improving or evaluating text-and speech-based algorithms or systems. They also serve as re-sourcesforthesoftwarelocalisationandlanguageservicesindustries,languagestud­ies, digital publishing, international transactions, subject-area specialists and end users. Although no widely standardised typology of LRs exists, they are usually classified as: 1. Data (i.e., corpora and lexical/conceptual resources); 2. Tools/Ser­vices(i.e.,linguisticannotations;toolsforcreatingannotations;searchandretrieval applications;applicationsfor automatic annotation) and3. Metadataand vocabular­ies (i.e., vocabularies or repositories of linguistic terminology; language metadata). Inthis section wewillfocus on the first two categories. AmainobjectiveoftheLRcommunityisthedevelopmentofinfrastructuresand platforms for presenting and disseminating LRs. There are numerous repositories in which resources for each language are documented. Among the major European cataloguesareEuropeanLanguageGrid(ELG,Rehm 2023),2 ELRC-SHARE,3 Eu­ropean Language Resources Association (ELRA), 4 Common Language Resources and Technology Infrastructure (CLARIN)5 and META-SHARE.6 The Linguistic Data Consortium,7 whichoperates outsideofEurope, should alsobehighlighted. In addition, there are several relevant multilingual public domain initiatives. Among these are the Common Voice Project,8 designed to encourage the develop-mentofASRsystems;theM-AILABSSpeechDataset,9fortext-to-speechsynthesis; the Ryerson Audio-Visual Database of Emotional Speech and Song,10 for research 2 https://www.european-language-grid.eu 3 http://www.elrc-share.eu 4 http://catalogue.elra.info 5 https://www.clarin.eu/content/language-resources 6 http://www.meta-share.org 7 https://catalog.ldc.upenn.edu 8 https://commonvoice.mozilla.org 9 https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/ 10 https://zenodo.org/record/1188976 onemotionalmultimediacontent;andLibriVox,11 anaudiobookrepositorythatcan beusedin different research fieldsand applications. A cursory glance at these repositories not only gives us an idea of the amount of resources available for Europe’s languages, but also reveals the clear inequality betweenofficialandminoritylanguages.Moreover,althoughthefourEuropeanlan­guages with the most resources are English, French, German and Spanish, English isfaraheadoftherest,withmorethantwiceasmanyresourcesasthenextlanguage (see Figure 1, p. 50). At the same time, the languages without official status trail significantly behind in terms of LR development, demonstrating the critical impact that official status has on the extent of available resources. 4.2 Text Analysis Text Analysis (TA) aims to extract relevant information from large amounts of un­structured text in order to enable data-driven approaches to manage textual content. In other words, its purpose is to create structured data out of unstructured text con-tentbyidentifyingentities,factsandrelationshipsthatareburiedinthetextualdata. TA employs a variety of methodologies toprocess text. It iscrucial for establishing “who did what, where and when,” a technology that has proven to be key for ap­plicationssuchasInformationExtraction,QuestionAnswering,Summarisationand nearly every linguistic processing task involving semantic interpretation, including OpinionMiningand Aspect-based Sentiment Analysis (ABSA). The best results for TA tasks are generally obtained by means of supervised, corpus-based approaches. In most cases, manually annotating text for every sin­gle specific need is extremely time-consuming and not affordable in terms of hu­man resources and economic costs. To make the problem more manageable, TA is addressed in several tasks that are typically performed in order to preprocess the text to extract relevant information. The most common tasks currently available in state-of-the-artNLPtoolsandpipelinesincludePart-of-Speech(POS)tagging,Lem­matisation,WordSenseDisambiguation(WSD),NamedEntityRecognition(NER), NamedEntityDisambiguation(NED)orEntityLinking(EL),Parsing,Coreference Resolution,SemanticRoleLabelling(SRL),TemporalProcessing,ABSAand,more recently, Open Information Extraction (OIE). Today,allthesetasksareaddressedinanend-to-endmanner,i.e.,evenforatradi­tionallycomplextasksuchasCoreferenceResolution(Pradhanetal. 2012),current state-of-the-art systemsare based on an approach in which noextra linguistic anno­tations are required. These systems typically employ LLMs. Similarly, most state-of-the-art TA toolkits, including AllenNLP and Trankit, among others (Gardner et al. 2018; M. V. Nguyen et al. 2021), use a highly multilingual end-to-end approach. Avoiding intermediate tasks has helped to mitigate the common cascading errors problemthatwaspervasiveinmoretraditionalTApipelines.Asaconsequence,the 11 https://librivox.org appearance of end-to-end systems has helped bring about a significant jump in per­formance across every TA task. 4.3 Speech Processing Speech processing aims at allowing humans to communicate with digital devices through voice. This entails developing machines that understand and generate not only oral messages, but also all the additional information that we can extract from thevoice,likewhoisspeaking,theirage,theirpersonality,theirmood,etc.Someof the main areas in speech technology are text-to-speech synthesis (TTS), automatic speech recognition (ASR)and speakerrecognition(SR). TTS attempts to produce the oral signal that corresponds to an input text with an intelligibility, naturalness and quality similar to a natural speech signal. Statisti­cal parametric speech synthesis techniques generate speech by means of statistical models trained to learn the relation between linguistic labels derived from text and acoustic parameters extracted from speech by means of a vocoder. HMM (Hidden MarkovModels)andmorerecentlyDNN(DeepNeuralNetworks)havebeenusedas statisticalframeworks.Variousarchitectureshavebeentested,suchasfeed-forward networks (Qian et al. 2014), recurrent networks (Y. Fan et al. 2014) and WaveNet (Oord et al. 2016). Among the criteria used for training, the most common is mini-mumgenerationerror(Z.WuandKing2016),althoughrecentlynewmethodsbased on Generative Adversarial Networks (GAN, Saito et al. 2017) have been proposed with excellent results in termsof naturalness ofthe producedvoice. ASR, producing a transcription from a speech signal, has been long sought after. Theintrinsicdifficultyofthetaskhasrequiredastep-by-stepeffort,withincreasingly ambitiousobjectives.Onlyinthelasttwodecadeshasthistechnologyjumpedfrom thelaboratorytoproduction.Thefirstcommercialsystemswerebasedonstatistical models, i.e., HMMs (Juang and Rabiner 2005; Gales and Young 2008). While this technology was the standard during the first decade of the century, in the 2010s, the increase in computing power and the ever-growing availability of training data allowed forthe introductionofDNN techniques forASR. Morerecently,end-to-endorfullydifferentiablearchitectureshaveappearedthat aim to simplify a training process that is capable of exploiting the available data. Inthesesystems,aDNNmapstheacousticsignalintheinputdirectlytothetextual output.Thus,theneuralnetworkmodelstheacousticinformation,thetimeevolution and some linguistic information, learning everything jointly. New architectures, in the form of Transformers (Gulati et al. 2020; Xie Chen et al. 2021) and teacher-student schemes (Z. Zhang et al. 2020; Jing Liu et al. 2021), have been applied to ASR with great success. Recently, Whisper, a Transformer sequence-to-sequence model trained on very large amounts of data that can perform several tasks such as multilingual ASR, translation and language identification, has been developed by OpenAI(Radford et al. 2022) showing the potentialof weakly supervised systems. AsimilarevolutionhastakenplaceintheareaofSR.Partofthewidespreademer­genceofbiometricidentificationtechniques,exemplifiedbythenowcommonplace ability to unlock a smartphone with a fingerprint or an iris, speaker recognition in­volves the automatic identification of people based on their voice. Nowadays, the classical systems have been outperformed by end-to-end neural network based sys­tems, which are being improved using widespread databases (Nagrani et al. 2017) and enforcing research (Nagrani et al. 2020), obtaining better recognition rates by means of new network architectures and techniques (Safari et al. 2020; H. Zhang et al. 2020;R. Wang etal. 2022). 4.4 Machine Translation Machine Translation (MT) is the automatic translation from one natural language into another. Since its first implementation (Weaver 1955) it has remained a key application in LT/NLP. While a number of approaches and architectures have been proposedandtestedovertheyears,NeuralMT(NMT)hasbecomethemostpopular paradigm for MT development both within the research community (Vaswani et al. 2018;YinhanLiuetal.2020;Zhuetal.2020;Sunetal.2022)andforlarge-scalepro­ductionsystems(Y.Wuetal.2016).ThisisduetothegoodresultsachievedbyNMT systems,whichattainstate-of-the-artresultsformanylanguagepairs(Akhbardehet al. 2021; Adelani et al. 2022; Min 2023). NMT systems use distributed representa­tionsofthelanguagesinvolved,whichenablesend-to-endtrainingofsystems.Ifwe compare them with classical statistical MT models (Koehn et al. 2003), we see that they do not require word aligners, translation rule extractors, and other feature ex-tractors;theembed – encode – attend –decode paradigmisthemostcommonNMT approach(Vaswani et al. 2017; Youet al. 2020;Dione et al. 2022). Thanks tocurrent advancesinNMTitiscommontofindsystemsthat caneasily incorporate multiple languages simultaneously. We refer to these types of systems asMultilingualNMT(MNMT)systems.TheprincipalgoalofanMNMTsystemis to translate between as many languages as possible by optimising the linguistic re-sourcesavailable.MNMTmodels(Aharonietal.2019;B.Zhangetal.2020;Emezue and Dossou 2022; Siddhant et al. 2022) are interesting for several reasons. On the one hand, they can address translations among all the languages involved within a single model, which significantly reduces training time and facilitates deployment of production systems. On the other hand, by reducing operational costs, multilin­gual models achieve betterresultsthanbilingualmodelsforlow-andzero-resource languagepairs:trainingisperformedjointlyandthisgeneratesapositivetransferof knowledge from high(er)-resource languages (Aharoni et al. 2019; Arivazhagan et al. 2019). This phenomenon is known as translation knowledge transfer or transfer learning (Zoph et al. 2016; T. Q.Nguyen and Chiang 2017; Hujon et al. 2023). Forinstance,A.Fanetal.(2021)havecreatedseveralMNMTmodelsbybuilding alarge-scalemany-to-manydatasetfor100languages.Theysignificantlyreducethe complexity of this task, employing automatic building of parallel corpora (Artetxe and Schwenk 2019; Schwenk et al. 2021)with a noveldataminingstrategy thatex­ploitslanguagesimilarityinordertoavoidminingalldirections.Themethodallows for directtranslation between 100 languageswithoutusingEnglishasapivotandit performs as well as bilingual models on many competitive benchmarks. Addition-ally,theytakeadvantageofbacktranslationtoimprovethequalityoftheirmodelon zero-shot and low-resource languagepairs. 4.5 InformationExtractionandInformationRetrieval Deep learning has had a tremendous impact on Information Retrieval (IR) and In­formation Extraction (IE). The goal of IR is to meet the information needs of users by providing them with documents or text snippets that contain answers to their queries. IR is a mature technology that enabled the development of search engines. Theareahas been dominated by classicmethodsbased on vector space modelsthat use manually created sparse representations such as TF-IDF or BM25 (Robertson and Zaragoza 2009), but recent approaches that depend on dense vectors and deep learning have shown promising results (Karpukhin et al. 2020; Izacard and Grave 2021).DenserepresentationsareoftencombinedwithQuestionAnswering(QA)to develop systems that are able to directly answer specific questions posed by users, either by pointing at text snippets that answer the questions (Karpukhin et al. 2020; Izacard and Grave 2021) or by generating the appropriate answers themselves (P. Lewis et al. 2021). IE aims to extract structured information from text. Typically, IE systems recog­nise the main events described in a text, as well as the entities that participate in thoseevents.Moderntechniquesmostlyfocusontwochallenges:learningtextualse­manticrepresentationsforeventsineventextraction(bothatsentenceanddocument level)andacquiringoraugmentinglabeledinstancesformodeltraining(K.Liuetal. 2020).Regardingtheformer,earlyapproachesreliedonmanuallycodedlexical,syn­tacticandkernel-basedfeatures(Ahn2006).Withthedevelopmentofdeeplearning, however, researchers have employed neural networks, including CNNs (Y. Chen et al. 2015), RNNs (T. H. Nguyen and Grishman 2016) and Transformers (Yang et al. 2019). Data augmentation has been typically performed by using methods such as distantsupervisionoremployingdatafromotherlanguagestoimproveIEonthetar­getlanguage,whichisespeciallyusefulwhenthetargetlanguageisunder-resourced. DeeplearningtechniquesutilisedinNMT(JianLiuetal.2018)andpre-trainedmul­tilingual LLMs (JianLiuet al. 2019) havealso helped inthis task. Another important task within IE is Relation Extraction (RE), whose goal is to predictthesemanticrelationshipbetweentwoentities,ifany.ThebestresultsonRE areobtainedbyfine-tuningLLMs,whicharesuppliedwithaclassificationhead.One ofthemostpressingproblemsinREisthescarcityofmanuallyannotatedexamples in real-world applications, particularly when there is a domain and language shift. In recent years, new methods have emerged that only require a few-shot or zero-shot examples. Prompt-based learning, e.g., uses task and label verbalisations that canbedesignedmanuallyorlearnedautomatically(SchickandSchütze2021)asan alternative to fine-tuning.In thesemethods,theinputs are augmented withprompts and the LM objective is used in learning and inference. This paradigm shift has allowedIEtaskstobeframedasaQAproblem(Sulemetal.2022)orasaconstrained text generation problem(S. Li et al. 2021) using prompts, questionsor templates. 4.6 Natural Language Generation and Summarisation NaturalLanguageGeneration(NLG)hasbecomeoneofthemostimportantandchal­lenging tasks in NLP (Gehrmann et al. 2021). NLG automatically generates under-standabletexts,typicallyusinganon-linguisticortextualrepresentationofinforma­tion as input (Reiter and Dale 1997; Gatt and Krahmer 2018; Junyi Li et al. 2021a). Applications that generate new texts from existing text include MT from one lan­guage to another (see Section 4.4), fusion and summarisation, simplification, text correction, paraphrase generation, question generation, etc. With the recent resur­gence of deep learning, new ways to solve text generation tasks based on different neuralarchitectureshavearisen(JunyiLietal. 2021b).Oneadvantageoftheseneu­ralmodels is that theyenableend-to-end learningof semantic mappingsfrom input to output in text generation. Existing datasets for most supervised text generation tasks are small (except MT). Therefore, researchers have proposed various meth­ods to solve text generation tasks based on LLMs. Transformer models such as T5 (Raffel et al. 2020) and BART (M. Lewis et al. 2020) or a single Transformer de­coderblocksuchasGPT(Brownetal.2020)arecurrentlystandardarchitecturesfor generating high qualitytext. Due to the rapid growth of information generated daily online (Gambhir and Gupta 2017), there is a growing need for automatic summarisation techniques that produce short texts from one or more sources efficiently and precisely. Several ex­tractive approaches have been developed for automatic summary generation that implement a number of machine learning and optimisation techniques (J. Xu and Durrett2019).AbstractivemethodsaremorecomplexastheyrequireNLUcapabil­ities. Abstractive summarisation produces an abstract with words and phrases that are based on concepts that occur in the source document (Du et al. 2021). Both ap­proaches can now be modeled usingTransformers(YangLiuand Lapata 2019). 4.7 Human-ComputerInteraction Thedemandfortechnologiesthatenableuserstointeractwithmachinesatanytime utilisingtextandspeechhasgrown,motivatingtheuseofdialoguesystems.Suchsys­tems allow the user to converse with computers using natural language and include Siri, Google Assistant, Amazon Alexa, and ChatGPT, among others. Dialogue sys­tems can be divided into threegroups: task-oriented systems, conversationalagents (also knownaschatbots) and interactiveQA systems. The distinguishing features of task-oriented dialogue systems are that they are designedtoperformaconcretetaskinaspecificdomainandthattheirdialogueflow isdefinedandstructuredbeforehand.Forexample,suchsystemsareusedtobooka table at a restaurant, call someone or check the weather forecast. The classical im­plementation of this type of system follows a pipeline architecture based on three modules:theNLUmodule,thedialoguemanagerandtheNLGmodule.Whileclas­sical dialogue systems trained and evaluated these modules separately, more recent systemsrelyonend-to-endtrainablearchitecturesbasedonneuralnetworks(Bordes et al. 2017;Hosseini-Asl etal. 2020). Conversationalagentsenableengagingopen-domainconversations,oftenbyem­ulating the personality of a human (S. Zhang et al. 2018). The Alexa prize,12 for instance, focused on building agents that could hold a human in conversation as longas possible.These kinds ofagents aretypicallytrainedinconversationsmined fromsocial media usingend-to-end neural architectures (Roller et al. 2021). Interactive QA systems try to respond to user questions by extracting answers fromeitherdocuments(Rajpurkaretal.2018)orknowledgebases(T.Yuetal.2018). In order to be able to have meaningful interactions, interactive QA systems have a simple dialogue management procedure taking previous questions and answers into account (Choi et al. 2018). The core technology is commonly based on LLMs (Qiu et al. 2020) where some mechanism is included to add context representation (Vakulenko et al. 2021). 5 Language Technology beyond Language Knowledge about our surrounding world is required to properly understand natural language utterances (Bender and Koller 2020). That knowledge is known as world knowledgeandmanyauthorsarguethatitisakeyingredienttoachievehuman-level NLU (Storks et al. 2019). One of the ways to acquire this knowledge is to explore the visual world together with the textual world (Elu et al. 2021). CNNs have been thestandardarchitectureforgeneratingrepresentationsforimages(LeCunandBen­gio1995)duringthelastdecade.Recently,self-attention-basedTransformermodels (Vaswani et al. 2017) have emerged as an alternative architecture, leading to excit­ing progress on a number of vision tasks (Khan et al. 2021). Compared to previous approaches, Transformers allow multiple modalities to be processed (e.g., images, videos, text and speech) using similar processing blocks and demonstrate excellent scalabilityproperties.Encoder-decodermodelsinparticularhavebeengainingtrac-tionrecentlyduetotheirversatilityonsolvingdifferentgenerativetasks(JunnanLi et al. 2022;Xi Chen et al. 2022). 12 https://developer.amazon.com/alexaprize Regardingdownstreamtasks,captiongenerationisatypicalvisio-linguistictask, where a textual description of an image must be generated. The first approaches to solve this problem combined CNNs with RNNs in an encoder-decoder architecture (Vinyals et al. 2015). Further improvements were achieved when attention was in­cluded(K.Xuetal.2015)andsomeresearchershaveproposedutilisingobject-based attention instead of spatial attention (Anderson et al. 2018). Although it is not cur-rentlyclearwhichattentionmechanismisbetter,thequalityofthetextgeneratedby these models is high as measured by metrics such as BLEU (Papineni et al. 2002) and METEOR (Banerjee and Lavie 2005) Visualgeneration,in contrastto caption generation, requires an image to be gen-eratedfromatextualdescription.Oneofthistask’smostsignificantchallengesisto develop automatic metrics to evaluate the quality of the generated images and their coherence with the input text. The first effective approaches were based on Gener­ative Adversarial Networks (Goodfellow et al. 2014) and Variational Autoencoders (Kingma and Welling 2013). Cho et al. (2020) demonstrate that multimodal Trans­formerscanalsogenerateimpressiveimagesfromtextualinput.Nevertheless,novel advancementsindiffusionmodels(Sohl-Dickstein et al. 2015;Hoet al. 2020)have definedthecurrentstate-of-the-artinimagegeneration(Rameshetal. 2022).These modelslearntoiterativelyreconstructnoisyimagesand,recently,theirsizeandcom­putationalcosthasbeenreducedasdiffusioncanbenowappliedinareducedlatent spaceinsteadofanimage’s pixelspace (Rombach et al. 2022). AnothertypicaltaskisVisualQuestionAnswering(VQA),wheregivenanimage and a question about the contents of that image, the right textual answer must be found.TherearemanyVQAdatasetsintheliterature(Antoletal. 2015;Johnsonet al.2017).Somedemandleveragingexternalknowledgetoinferananswerand,thus, theyareknownasknowledge-basedVQAtasks(P.Wangetal.2017a,b;Marinoetal. 2019). These VQA tasks demand skills to understand the content of an image and howitisreferredtointhetextualquestion,aswellasreasoningcapabilitiestoinfer thecorrectanswer.MultimodalTransformers,suchasOFA(P.Wangetal.2022)and PaLI(Xi Chen et al. 2022),define the state-of-the-art in several of thesetasks. Visual Referring Expressions are one of the multimodal tasks that may be con­sidered an extension of a text-only NLP task, i.e., referring expressions (Krahmer and Deemter 2012) in NLG systems. Its objective is to ground a natural language expressiontoobjectsinavisualinput.Thereareseveralapproachestosolvethistask (Golland et al. 2010; Kazemzadeh et al. 2014). The most recent ones use attention mechanismstomergebothmodalities(L.Yuetal.2018)orarebasedonmultimodal Transformers (Dinget al. 2022). A natural extension of textual entailment, Visual Entailment is an inference task forpredictingwhetheranimagesemanticallyentailsatext.Vuetal.(2018)initially proposed a visually-grounded version of the textual entailment task, where an im­age is augmented to include a textual premise and hypothesis. However, Xie et al. (2019)proposevisualentailment,wherethepremiseisanimageandthehypothesis is textual. As an alternative to entailment, there are other grounding tasks that clas­sify whether an image andits caption match (Suhr et al. 2018; F. Liu et al. 2022)or tasks that measure the similarity between sentences with visual cues, such as vSTS (Lopezde Lacalle etal. 2020). Multimodal MT (MMT) seeks to translate natural language sentences that de­scribevisualcontentinasourcelanguageintoatargetlanguagebytakingthevisual content as an additional input to the source language sentences (Elliott et al. 2017; Barrault et al. 2018). Different approaches have been proposed to handle MMT, al­though attention models that associate textual and visual elements with multimodal attentionmechanismsarethemostcommon(Huangetal. 2016;Calixtoetal.2017). 6 Conclusions Language tools and resources have increased and improved since the end of the last century, a process further catalysed by the advent of deep learning and LLMs over the past decade. Indeed, we find ourselves today in the midst of a significant paradigmshiftinLTandlanguage-centricAI.Thisrevolutionhasbroughtnotewor­thyadvances tothe fieldalong with the promiseofsubstantial breakthroughs in the coming years. However, this transformative technology poses problems, from a re-searchadvancement,environmental,andethicalperspective.Furthermore,ithasalso laid bare the acute digital inequality that exists between languages. In fact, as em-phasised in this chapter, many sophisticated NLP systems are unintentionally exac­erbatingthisimbalanceduetotheirrelianceonvastquantitiesofdataderivedmostly from English-language sources. Other languages lag far behind English in terms of digital presence and even the latter would benefit from greater support. Moreover, the striking asymmetry between official and non-official European languages with respecttoavailabledigitalresourcesisconcerning.TheunfortunatetruthisthatDLE inEuropeisfailingtokeeppacewiththenewfoundandrapidlyevolvingchangesin LT. Oneneedlooknofurtherthanwhatishappeningtodayacrossthediversetopog­raphy of state-of-the-art LTand language-centric AI forconfirmation ofthe current linguistic unevenness. The paradox at the heart of LT’s recent advances is evident in almost every LT discipline. Our ability to reproduce ever better synthetic voices has improved sharply for well-resourced languages, but dependence on large vol-umesof high-qualityrecordingseffectivelyunderminesattemptstodo the samefor low-resource languages. Multilingual NMT systems return demonstrably improved results for low-and zero-resource language pairs, but insufficient model capacity continuestohaunttransferlearningbecauselargemultilingualdatasetsarerequired, forcing researchers torely on Englishasthe best resourced language. Nonetheless, we believe this time of technological transition represents an op-portunitytoachievefullDLEinEurope.Thereareamplereasonsforoptimism.Re­centresearchinthefieldhasconsideredtheimplementationofcross-lingualtransfer learning andmultilingual language modelsforlow-resourcelanguages, anexample of how the state-of-the-art in LT could benefit from better digital support for low-resourcelanguages. Forecasting the future of LT and language-centric AI is a challenge. Just a few yearsago,nobodywouldhavepredictedtherecentbreakthroughsthathaveresulted insystemsabletodeal withunseentasksormaintainingnaturalconversations.Itis, however,safetopredictthatevenmoreadvanceswillbeachievedinallLTresearch areas and domains in the near future. Despite claims of human parity in many LT tasks, Natural Language Understanding is still an open research problem far from being solved since all current approaches have severe limitations. Interestingly, the application of zero-shot to few-shot transfer learning with multilingual LLMs and self-supervised systems opens up the way to leverage LT for less developed lan­guages. However, the development of these new LT systems would not be possible without sufficient resources (experts, data, HPC facilities, etc.) as well as the cre­ation of carefully designed and constructed evaluation benchmarks and annotated datasets for every language and domain of application. Focusing on state-of-the-art resultsexclusivelywiththehelpofleaderboardswithoutencouragingdeeperunder­standing of the mechanisms by which they are achieved can generate misleading conclusions, and direct resources away from efforts that would facilitate long-term progresstowardsmultilingual,efficient,accurate,explainable,ethicaland unbiased language understanding and communication, to create transparent digital language equality in Europein all aspects of society,from government tobusinessto citizen. References Adelani,David,MdMahfuzIbnAlam,AntoniosAnastasopoulos,AkshitaBhagia,MartaR.Costa­jussa, Jesse Dodge, Fahim Faisal, Christian Federmann, Natalia Fedorova, Francisco Guzmán, Sergey Koshelev, Jean Maillard, Vukosi Marivate, Jonathan Mbuya, Alexandre Mourachko, Safiyyah Saleem, HolgerSchwenk,and Guillaume Wenzek (2022).“Findingsof the WMT’22 Shared Task on Large-Scale Machine Translation Evaluation for African Languages”. In: Pro­ceedings of the Seventh Conference on Machine Translation (WMT). Abu Dhabi, United Arab Emirates(Hybrid):AssociationforComputational Linguistics,pp.773–800. https://aclantholo gy.org/2022.wmt-1.72. Agerri,Rodrigo,EnekoAgirre,ItziarAldabe,NoraAranberri,JoseMariaArriola,AitziberAtutxa, GorkaAzkune,ArantzaCasillas,AinaraEstarrona,AritzFarwell,IakesGoenaga,JosuGoikoet-xea,KoldoGojenola,InmaHernaez,MikelIruskieta,GorkaLabaka,OierLopezdeLacalle,Eva Navas,MaiteOronoz,ArantxaOtegi,AliciaPérez,OlatzPerezdeVinaspre,GermanRigau,Jon Sanchez, Ibon Saratxaga, and Aitor Soroa (2021). Deliverable D1.2 Report on the State of the Art in Language Technology and Language-centric AI.EuropeanLanguageEquality(ELE);EU project no. LC-01641480 –101018166. https://european-language-equality.eu/reports/LT-stat e-of-the-art.pdf. Aharoni,Roee,MelvinJohnson,andOrhanFirat(2019).“MassivelyMultilingualNeuralMachine Translation”.In: Proceedings of the 2019 Conference of the North American Chapter of the As­sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).Minneapolis,Minnesota:Associationfor ComputationalLinguistics,pp. 3874– 3884.DOI: 10.18653/v1/N19-1388. https://aclanthology.org/N19-1388. Ahmed,Nurand Muntasir Wahed(2020). “TheDe-democratization ofAI:Deep Learning andthe ComputeDivideinArtificialIntelligenceResearch”.In: CoRR abs/2010.15581.https://arxiv.o rg/abs/2010.15581. Ahn,David(2006).“Thestagesofeventextraction”.In: Proceedings of the Workshop on Annotat­ing and Reasoning about Time and Events. Sydney, Australia: Association for Computational Linguistics, pp. 1–8. https://aclanthology.org/W06-0901. Akhbardeh,Farhad,ArkadyArkhangorodsky,MagdalenaBiesialska,OndøejBojar,RajenChatter­jee, Vishrav Chaudhary, Marta R. Costa-jussa, Cristina Espana-Bonet, Angela Fan, Christian Federmann, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Leonie Harter,KennethHeafield,ChristopherHoman,MatthiasHuck,KwabenaAmponsah-Kaakyire, Jungo Kasai, Daniel Khashabi, Kevin Knight, Tom Kocmi, Philipp Koehn, Nicholas Lourie, Christof Monz, Makoto Morishita, Masaaki Nagata, Ajay Nagesh, Toshiaki Nakazawa, Mat­teo Negri, Santanu Pal, Allahsera Auguste Tapo, Marco Turchi, Valentin Vydrin, and Marcos Zampieri (2021). “Findings of the 2021 Conference on Machine Translation (WMT21)”. In: Proceedings of the Sixth Conference on Machine Translation. Online: Association for Compu­tationalLinguistics,pp. 1–88. https://aclanthology.org/2021.wmt-1.1. Anderson, Peter,Xiaodong He,Chris Buehler,DamienTeney,Mark Johnson, Stephen Gould,and Lei Zhang (2018). “Bottom-Up and Top-Down Attention for Image Captioning and Visual QuestionAnswering”.In:2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, pp. 6077– 6086. DOI: 10.1109/CVPR.2018.00636. http://openaccess.thecvf.com/content%5C_cvpr%5 C_2018/html/Anderson%5C_Bottom-Up%5C_and%5C_Top-Down%5C_CVPR%5C_2018 %5C_paper.html. Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh (2015). “VQA: Visual Question Answering”. In: 2015 IEEE Inter­national Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, pp. 2425–2433. DOI: 10.1109/ICCV.2015.279. https://doi.org/10.11 09/ICCV.2015.279. Arivazhagan, Naveen, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, and Wolfgang Macherey (2019). “The missing ingredient in zero-shot neural machine translation”. In: arXiv preprint arXiv:1903.07091. https://arxiv.org/abs/1903.07091. Artetxe, Mikel and Holger Schwenk (2019). “Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond”. In: Transactions of the Association for Com­putational Linguistics 7, pp. 597–610. DOI: 10.1162/tacl_a_00288. https://aclanthology.org /Q19-1038. Banerjee,SatanjeevandAlonLavie(2005).“METEOR:AnAutomaticMetricforMTEvaluation with Improved Correlation with HumanJudgments”. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan: Association for Computational Linguistics, pp. 65–72. https://aclanthol ogy.org/W05-0909. Barrault, Loi¨c,FethiBougares, Lucia Specia,Chiraag Lala,Desmond Elliott,andStella Frank (2018). “Findings oftheThird SharedTaskonMultimodalMachine Translation”.In: Proceed­ings of the Third Conference on Machine Translation: Shared Task Papers. Belgium,Brussels: Association for Computational Linguistics, pp. 304–323. DOI: 10.18653/v1/W18-6402. https: //aclanthology.org/W18-6402. Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell (2021). “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Virtual Event Canada, pp. 610–623. Bender, Emily M. and Alexander Koller (2020). “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Online:AssociationforComputationalLinguistics, pp. 5185–5198. https://aclanthology.org/2020.acl-main.463. Bommasani, Rishi et al. (2021). On the Opportunities and Risks of Foundation Models. arXiv: 2108.07258 [cs.LG].https://arxiv.org/abs/2108.07258. Bordes,Antoine,Y-LanBoureau,andJasonWeston(2017).“LearningEnd-to-EndGoal-Oriented Dialog”. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openrevie w.net/forum?id=S1Bb3D5gg. Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari­wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei (2020). “Language Models are Few-Shot Learners”. In: Ad­vances in neural information processing systems 33,pp. 1877–1901. Calixto, Iacer, Qun Liu, and Nick Campbell (2017). “Doubly-Attentive Decoder for Multi-modal Neural Machine Translation”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics, pp. 1913–1924. DOI: 10.18653/v1/P17-1175. https://aclanthology .org/P17-1175. Chen, Xi, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Se­bastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver,NanDing,KeranRong,HassanAkbari,GauravMishra,LintingXue,AshishThap­liyal, James Bradbury, Weicheng Kuo,MojtabaSeyedhosseini,Chao Jia,BurcuKaragolAyan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut(2022).“Pali: Ajointly-scaledmultilinguallanguage-imagemodel”. In: arXiv preprint arXiv:2209.06794. Chen,Xie,YuWu,ZhenghaoWang,ShujieLiu,andJinyuLi(2021).“Developingreal-timestream­ing transformer transducer for speech recognition on large-scale dataset”. In: ICASSP. IEEE, pp. 5904–5908. Chen, Yubo, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao (2015). “Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks”. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Con­ference on Natural Language Processing (Volume 1: Long Papers).Beijing,China:Association for Computational Linguistics, pp. 167–176. DOI: 10.3115/v1/P15-1017. https://aclanthology .org/P15-1017. Cho, Jaemin, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha Kembhavi (2020). “X-LXMERT:Paint,CaptionandAnswerQuestionswithMulti-ModalTransformers”.In: Pro­ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 8785–8805. DOI: 10.186 53/v1/2020.emnlp-main.707. https://aclanthology.org/2020.emnlp-main.707. Choi,Eunsol,HeHe,MohitIyyer,MarkYatskar,Wen-tauYih,YejinChoi,PercyLiang,andLuke Zettlemoyer (2018). “QuAC: Question Answering in Context”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Asso­ciation for Computational Linguistics, pp. 2174–2184. DOI: 10.18653/v1/D18-1241. https://a clanthology.org/D18-1241. Chomsky, Noam (1957). Syntactic structures. The Hague:Mouton. Collobert, Ronan, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa (2011). “Natural Language Processing (Almost) from Scratch”. In: Journal of Machine Learning Research 12, pp. 2493–2537. Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzmán,EdouardGrave,MyleOtt,LukeZettlemoyer,andVeselinStoyanov(2020). “UnsupervisedCross-lingualRepresentationLearningatScale”.In:Proceedings of the 58th An­nual Meeting of the Association for Computational Linguistics. Online: Association for Com-putationalLinguistics,pp.8440–8451.DOI: 10.18653/v1/2020.acl-main.747.https://aclanthol ogy.org/2020.acl-main.747. Devlin,Jacob,Ming-WeiChang,KentonLee,andKristinaToutanova(2019).“BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: NAACL Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–4186. DOI: 10.18653/v1/N1 9-1423.https://aclanthology.org/N19-1423. Ding, Henghui, Chang Liu, Suchen Wang, and Xudong Jiang (2022). “VLT: Vision-Language TransformerandQueryGenerationforReferringSegmentation”.In:IEEE Transactions on Pat­tern Analysis and Machine Intelligence. Dione,CheikhMBamba,AllaLo,ElhadjiMamadouNguer,andSileyeBa(2022).“Low-resource NeuralMachineTranslation:BenchmarkingState-of-the-artTransformerforWolof<->French”. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 6654– 6661. Doddapaneni, Sumanth, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh M Khapra (2021). “A primer on pretrained multilingual language models”. In: arXiv preprint arXiv:2107.00676. https://arxiv.org/abs/2107.00676. Dodge, Jesse, Maarten Sap, Ana Marasoviæ, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner (2021). “Documenting Large Webtext Corpora: A Case Studyon theColossal CleanCrawled Corpus”. In: arXiv preprint arXiv:2104.08758. Du,Zhengxiao,YujieQian,XiaoLiu,MingDing,JiezhongQiu,ZhilinYang,andJieTang(2021). “All NLP Tasks Are Generation Tasks: A General Pretraining Framework”. In: arXiv preprint arXiv:2103.10360. https://arxiv.org/abs/2103.10360. Elliott, Desmond, Stella Frank, Loi¨c Barrault, Fethi Bougares, and Lucia Specia (2017). “Find­ings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description”.In: Proceedings of the Second Conference on Machine Translation.Copenhagen, Denmark: Association for Computational Linguistics, pp. 215–233.DOI: 10.18653/v1/W17-4 718.https://aclanthology.org/W17-4718. Elu, Aitzol, Gorka Azkune, Oier Lopez de Lacalle, Ignacio Arganda-Carreras, Aitor Soroa, and Eneko Agirre (2021). “Inferring spatial relations from textual descriptions of images”. In: Pat­tern Recognition 113,p. 107847. Emezue,ChrisCandBonaventureFPDossou(2022).“MMTAfrica:Multilingualmachinetransla­tionforAfricanlanguages”.In: arXiv preprint arXiv:2204.04306. Fan,Angela,ShrutiBhosale,HolgerSchwenk,ZhiyiMa,AhmedEl-Kishky,SiddharthGoyal,Man-deep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, SergeyEdunov, EdouardGrave, MichaelAuli, and Armand Joulin (2021). “Beyond english-centric multilingual machine translation”. In: Journal of Machine Learning Research 22.107, pp. 1–48. Fan, Yuchen, Yao Qian, Feng-Long Xie, and Frank K Soong (2014). “TTS synthesis with bidirec­tional LSTM based recurrent neural networks”. In: Fifteenth annual conference of the interna­tional speech communication association. Gales, Mark and Steve Young (2008). The application of hidden Markov models in speech recog­nition. Now PublishersInc. Gambhir, Mahak and Vishal Gupta (2017). “Recent automatic text summarization techniques: a survey”. In: Artificial Intelligence Review 47.1, pp. 1–66. Gardner,Matt,JoelGrus,MarkNeumann,OyvindTafjord,PradeepDasigi,NelsonF.Liu,Matthew Peters,MichaelSchmitz,andLukeZettlemoyer(2018).“AllenNLP:ADeepSemanticNatural Language Processing Platform”. In: Proceedings of Workshop for NLP Open Source Software (NLP-OSS). Melbourne, Australia: Association for Computational Linguistics, pp. 1–6. DOI: 10.18653/v1/W18-2501. https://aclanthology.org/W18-2501. Gatt, Albert and Emiel Krahmer (2018). “Survey of the state of the art in natural language gener­ation: Core tasks, applications and evaluation”. In: Journal of Artificial Intelligence Research 61, pp. 65–170. Gehrmann, Sebastian, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Anuoluwapo Aremu, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna-Adriana Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondøej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan,MounicaMaddela,KhyatiMahajan,SaadMahamood,BodhisattwaPrasadMajumder, PedroHenriqueMartins,AngelinaMcMillan-Major,SimonMille,EmielvanMiltenburg,Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Andre Niyongabo Rubungo, Salomey Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan Diego Rodriguez, SashankSanthanam,JoaoSedoc,ThibaultSellam,SamiraShaikh,AnastasiaShimorina,Marco AntonioSobrevillaCabezudo,HendrikStrobelt,NishantSubramani,WeiXu,DiyiYang,Akhila Yerukola, and Jiawei Zhou (2021). “The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics”. In: Proceedings of the 1st Workshop on Natural Language Genera­tion, Evaluation, and Metrics (GEM 2021).Online:AssociationforComputationalLinguistics, pp. 96–120. DOI: 10.18653/v1/2021.gem-1.10. https://aclanthology.org/2021.gem-1.10. Golland, Dave, Percy Liang, and Dan Klein (2010). “A Game-Theoretic Approach to Generating SpatialDescriptions”.In:Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Cambridge, MA: Association for Computational Linguistics, pp. 410– 419.https://aclanthology.org/D10-1040. Goodfellow,IanJ.,JeanPouget-Abadie,MehdiMirza,BingXu,DavidWarde-Farley,SherjilOzair, Aaron C. Courville, and Yoshua Bengio (2014). “Generative Adversarial Nets”. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Pro­cessing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada.Ed.byZoubinGhahra­mani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, pp. 2672– 2680. https://proceedings.neurips.cc/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3­ Abstract.html. Gulati,Anmol,JamesQin,Chung-ChengChiu,NikiParmar,YuZhang,JiahuiYu,WeiHan,Shibo Wang, ZhengdongZhang, YonghuiWu,and Ruoming Pang(2020). “Conformer: Convolution-augmentedTransformerfor SpeechRecognition”. In: Interspeech, pp.5036–5040. Ho,Jonathan,AjayJain,andPieterAbbeel(2020).“Denoisingdiffusionprobabilisticmodels”.In: Advances in Neural Information Processing Systems 33,pp.6840–6851. Hosseini-Asl,Ehsan,BryanMcCann,Chien-ShengWu,SemihYavuz,andRichardSocher(2020). “Asimplelanguagemodelfortask-orienteddialogue”.In:Advances in Neural Information Pro­cessing Systems 33,pp. 20179–20191. Huang,Po-Yao,FrederickLiu,Sz-RungShiang,JeanOh,andChrisDyer(2016).“Attention-based MultimodalNeuralMachineTranslation”.In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. Berlin, Germany: Association for Computational Linguistics, pp. 639–645. DOI: 10.18653/v1/W16-2360.https://aclanthology.org/W16-2360. Hujon, Aiusha V, Thoudam Doren Singh, and Khwairakpam Amitab (2023). “Transfer Learning Based NeuralMachine TranslationofEnglish-Khasi on Low-Resource Settings”.In: Procedia Computer Science 218,pp. 1–8. Izacard, Gautier and Edouard Grave (2021). “Distilling Knowledge from Reader to Retriever for Question Answering”. In: International Conference on Learning Representations. https://open review.net/forum?id=NTEz-6wysdb. Johnson,Justin,BharathHariharan,LaurensvanderMaaten,LiFei-Fei,C.LawrenceZitnick,and RossB.Girshick(2017).“CLEVR:ADiagnosticDatasetforCompositionalLanguageandEle­mentaryVisualReasoning”.In:2017 IEEE Conference on Computer Vision and Pattern Recog­nition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, pp. 1988– 1997. DOI: 10.1109/CVPR.2017.215.https://doi.org/10.1109/CVPR.2017.215. Joshi, Pratik, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury (2020). “The Stateand Fate of LinguisticDiversityandInclusionin theNLP World”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Online:Associationfor ComputationalLinguistics, pp. 6282–6293. https://aclanthology.org/2020.acl-main.560. Juang, Biing-Hwang and Lawrence R Rabiner (2005). “Automatic speech recognition–a brief his-toryofthetechnologydevelopment”.In: Georgia Institute of Technology. Atlanta Rutgers Uni­versity and the University of California. Santa Barbara 1, p.67. Karpukhin,Vladimir,BarlasOguz,Sewon Min, Patrick Lewis, LedellWu, Sergey Edunov, Danqi Chen,andWen-tau Yih(2020).“DensePassage Retrieval forOpen-DomainQuestionAnswer-ing”.In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro­cessing (EMNLP). Online: Association for Computational Linguistics, pp. 6769–6781. DOI: 10.18653/v1/2020.emnlp-main.550.https://aclanthology.org/2020.emnlp-main.550. Kazemzadeh, Sahar, Vicente Ordonez, Mark Matten,and Tamara Berg(2014). “ReferItGame: Re­ferring to Objects in Photographs of Natural Scenes”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for ComputationalLinguistics, pp. 787–798. DOI: 10.3115/v1/D14-1086.https://aclanthology .org/D14-1086. Khan,Salman,MuzammalNaseer,MunawarHayat,SyedWaqasZamir,FahadShahbazKhan,and Mubarak Shah (2021). Transformers in Vision: A Survey. arXiv: 2101.01169 [cs.CV]. https: //arxiv.org/abs/2101.01169. Kingma,DiederikPandMaxWelling(2013).“Auto-encodingvariationalbayes”.In:arXiv preprint arXiv:1312.6114. Koehn, Philipp, Franz J. Och, and Daniel Marcu (2003). “Statistical Phrase-Based Translation”. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 127–133. https://aclanthology.o rg/N03-1017. Krahmer, Emiel and Kees van Deemter (2012). “Computational Generation of Referring Expres­sions: A Survey”. In: Computational Linguistics 38.1, pp. 173–218. DOI: 10.1162/COLI_a_0 0088.https://aclanthology.org/J12-1006. LeCun, Yann and Yoshua Bengio (1995). “Convolutional networks for images, speech, and time series”.In: The handbook of brain theory and neural networks 3361.10,p. 1995. Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer (2020). “BART: Denoising Sequence-to-Se­quence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.On­line: Association for Computational Linguistics, pp. 7871–7880. DOI: 10.18653/v1/2020.acl­ main.703. https://aclanthology.org/2020.acl-main.703. Lewis,Patrick,YuxiangWu,LinqingLiu,PasqualeMinervini,HeinrichKüttler,AleksandraPiktus, PontusStenetorp,andSebastianRiedel(2021).PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them. arXiv: 2102.07033 [cs.CL]. Li, Junnan, Dongxu Li, Caiming Xiong, and Steven Hoi (2022). “Blip: Bootstrapping language-imagepre-trainingforunifiedvision-languageunderstandingandgeneration”.In:International Conference on Machine Learning. PMLR,pp.12888–12900. Li, Junyi, Tianyi Tang, Gaole He, Jinhao Jiang, Xiaoxuan Hu, Puzhao Xie, Zhipeng Chen, Zhuo­hao Yu, Wayne Xin Zhao, and Ji-Rong Wen (2021a). “TextBox: A Unified, Modularized, and Extensible Framework for Text Generation”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations.Online:AssociationforComputational Linguistics, pp.30–39.DOI: 10.18653/v1/2021.acl-demo.4.https://aclanthology.org/2021.acl­ demo.4. Li, Junyi, Tianyi Tang,Wayne Xin Zhao, and Ji-Rong Wen (2021b). “Pretrained Language Model forTextGeneration:ASurvey”.In:Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. Ed. by Zhi-Hua Zhou. Survey Track. International Joint ConferencesonArtificialIntelligenceOrganization,pp.4492–4499.DOI: 10.24963/ijcai.2021 /612. https://doi.org/10.24963/ijcai.2021/612. Li,Sha,Heng Ji,and Jiawei Han (2021).“Document-LevelEvent ArgumentExtraction by Condi­tional Generation”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Asso­ciation for Computational Linguistics, pp. 894–908. https://aclanthology.org/2021.naacl-main .69. Liu,Fangyu,GuyEmerson,andNigelCollier(2022).“Visualspatialreasoning”.In:arXiv preprint arXiv:2205.00363. Liu, Jian, Yubo Chen, Kang Liu, and Jun Zhao (2018). “Event Detection via Gated Multilingual Attention Mechanism”. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. Ed. by Sheila A. McIlraith and Kilian Q. Weinberger. AAAI Press, pp. 4865–4872. https://www.aaai.org/ocs/index.php/AAAI/AAAI18 /paper/view/16371. Liu,Jian,YuboChen,KangLiu,andJunZhao(2019).“NeuralCross-LingualEventDetectionwith MinimalParallelResources”.In:Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).HongKong,China:AssociationforComputationalLinguistics, pp. 738–748. DOI: 10.18653/v1/D19-1068. https://aclanthology.org/D19-1068. Liu,Jing,RupakVigneshSwaminathan,SreeHariKrishnanParthasarathi,ChunchuanLyu,Athana­sios Mouchtaris, and Siegfried Kunzmann (2021). “Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models”. In: Proc. International Conference on Text, Speech and Dialogue (TSD). Liu, Kang, Yubo Chen, Jian Liu, Xinyu Zuo, and Jun Zhao (2020). “Extracting Events and Their Relations from Texts: A Survey on Recent Research Progress and Challenges”. In: AI Open 1, pp. 22–39. ISSN:2666-6510. DOI: https://doi.org/10.1016/j.aiopen.2021.02.004. https://www .sciencedirect.com/science/article/pii/S266665102100005X. Liu,YangandMirellaLapata(2019).“TextSummarizationwithPretrainedEncoders”.In:Proceed­ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).Hong Kong, China: Association for Computational Linguistics, pp. 3730–3740. DOI: 10.18653/v1 /D19-1387.https://aclanthology.org/D19-1387. Liu, Yinhan, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis,andLukeZettlemoyer(2020).“MultilingualDenoisingPre-trainingforNeuralMachine Translation”.In: Transactions of the Association for Computational Linguistics 8,pp.726–742. DOI: 10.1162/tacl_a_00343. https://aclanthology.org/2020.tacl-1.47. Lopez de Lacalle, Oier, Ander Salaberria, Aitor Soroa, Gorka Azkune, and Eneko Agirre (2020). “Evaluating Multimodal Representations on Visual Semantic Textual Similarity”. In: Proceed­ings of the Twenty-third European Conference on Artificial Intelligence, ECAI 2020, June 8-12, 2020, Santiago Compostela, Spain. Marino,Kenneth,MohammadRastegari,AliFarhadi,andRoozbehMottaghi(2019).“OK-VQA:A VisualQuestionAnsweringBenchmarkRequiringExternalKnowledge”.In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, pp. 3195–3204. DOI: 10.1109/CVPR.2019.00331. http://openaccess.thecvf.com/content%5C_CVPR%5C_2019/html/Marino%5C_OK-VQA%5 C_A%5C_Visual%5C_Question%5C_Answering%5C_Benchmark%5C_Requiring%5C_Ext ernal%5C_Knowledge%5C_CVPR%5C_2019%5C_paper.html. Mikolov, Tomás, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean (2013). “Dis­tributed Representations of Words and Phrases and their Compositionality”. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States.Ed.byChristopherJ.C.Burges,LéonBottou,ZoubinGhahramani,and Kilian Q. Weinberger, pp. 3111–3119. https://proceedings.neurips.cc/paper/2013/hash/9aa42b 31882ec039965f3c4923ce901b-Abstract.html. Miller, George A. (1992). “WordNet: A Lexical Database for English”. In: Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992. https://aclanthology.org/H92-1116. Min,Zeping(2023).“AttentionLink:AnEfficientAttention-BasedLowResourceMachineTrans­lationArchitecture”. In: arXiv preprint arXiv:2302.00340. Nagrani,Arsha,JoonSonChung,JaesungHuh,AndrewBrown,ErnestoCoto,WeidiXie,Mitchell McLaren,DouglasAReynolds,andAndrewZisserman(2020).“Voxsrc2020:Thesecondvox-celeb speaker recognition challenge”. In: arXiv preprint arXiv:2012.06867. https://arxiv.org/a bs/2012.06867. Nagrani, Arsha, Joon Son Chung, and Andrew Zisserman (2017). “VoxCeleb: A Large-Scale Speaker Identification Dataset”. In: Interspeech,pp.2616–2620. Nguyen, Minh Van, Viet Dac Lai, Amir Pouran Ben Veyseh, and Thien Huu Nguyen (2021). “Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Pro­cessing”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. ACL, pp. 80–90. DOI: 10.18653/v1/2 021.eacl-demos.10.https://aclanthology.org/2021.eacl-demos.10. Nguyen,ThienHuuandRalphGrishman(2016).“ModelingSkip-GramsforEventDetectionwith Convolutional Neural Networks”. In: Proceedings of the 2016 Conference on Empirical Meth­ods in Natural Language Processing.Austin,Texas:AssociationforComputationalLinguistics, pp. 886–891. DOI: 10.18653/v1/D16-1085. https://aclanthology.org/D16-1085. Nguyen, Toan Q. and David Chiang (2017). “Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation”. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Taipei,Taiwan: Asian Federationof Natural Language Processing, pp. 296–301. https://aclanthology.org/I17-2050. Oord,Aaronvanden,SanderDieleman,HeigaZen,KarenSimonyan,OriolVinyals,AlexGraves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu (2016). “Wavenet: A generative model for rawaudio”. In: arXiv preprint arXiv:1609.03499.https://arxiv.org/abs/1609.03499. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu (2002). “Bleu: a Method for Au-tomaticEvaluationofMachineTranslation”.In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, pp. 311–318. DOI: 10.3115/1073083.1073135. https://aclantholog y.org/P02-1040. Pradhan,Sameer,AlessandroMoschitti,NianwenXue,OlgaUryupina,andYuchenZhang(2012). “CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes”. In: Proceedings of CoNLL, pp.1–40. https://www.aclweb.org/anthology/W12-4501. Qian,Yao,YuchenFan,WenpingHu,andFrankKSoong(2014).“Onthetrainingaspectsofdeep neural network (DNN)forparametric TTS synthesis”. In: ICASSP. IEEE, pp. 3829–3833. Qiu, Xipeng, Tianxiang Sun, YigeXu, YunfanShao, Ning Dai, andXuanjing Huang (2020). “Pre-trained models for natural language processing: A survey”. In: Science China Technological Sciences 63.10, pp.1872–1897. Radford,Alec,JongWookKim,TaoXu,GregBrockman,ChristineMcLeavey,andIlyaSutskever (2022).“Robustspeechrecognitionvialarge-scaleweaksupervision”.In:arXiv preprint arXiv:­2212.04356. Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu (2020). “Exploring the limits of transfer learning with a unified text-to-text transformer”. In: Journal of Machine Learning Research 21.1, pp. 5485– 5551. Rajpurkar, Pranav, Robin Jia, and Percy Liang (2018). “Know What YouDon’t Know: Unanswer-ableQuestionsforSQuAD”.In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for ComputationalLinguistics, pp. 784–789. https://aclanthology.org/P18-2124. Ramesh,Aditya,PrafullaDhariwal,AlexNichol,CaseyChu,andMarkChen(2022).“Hierarchical text-conditionalimage generation withclip latents”. In: arXiv preprint arXiv:2204.06125. Rehm, Georg, ed. (2023). European Language Grid: A Language Technology Platform for Multi­lingual Europe. Cognitive Technologies. Cham,Switzerland: Springer. Reiter, Ehud and Robert Dale (1997). “Building applied natural language generation systems”. In: Natural Language Engineering 3.1, pp.57–87. Ribeiro,MarcoTulio,CarlosGuestrin,andSameerSingh(2019).“AreRedRosesRed?Evaluating Consistency of Question-Answering Models”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 6174–6184. DOI: 10.18653/v1/P19-1621. https://aclanthology.org/P19-1621. Robertson, Stephen and Hugo Zaragoza (2009). “The Probabilistic Relevance Framework: BM25 andBeyond”.In: Found. Trends Inf. Retr. 3.4,pp.333–389.ISSN:1554-0669.DOI: 10.1561/1 500000019. https://doi.org/10.1561/1500000019. Roller,Stephen,EmilyDinan,NamanGoyal,DaJu,MaryWilliamson,YinhanLiu,JingXu,Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston (2021). “Recipes for Building an Open-Domain Chatbot”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Compu­tational Linguistics, pp. 300–325. DOI: 10.18653/v1/2021.eacl-main.24. https://aclanthology .org/2021.eacl-main.24. Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer (2022). “High-resolutionimagesynthesiswithlatentdiffusionmodels”.In:Proc. of the IEEE/CVF Con­ference on Computer Vision and Pattern Recognition, pp. 10684–10695. Safari, Pooyan, Miquel India, and Javier Hernando (2020). “Self-attention encoding and pooling for speaker recognition”. In: Interspeech, pp.941–945. Saito, Yuki, Shinnosuke Takamichi, and HiroshiSaruwatari (2017).“Statistical parametricspeech synthesis incorporating generative adversarial networks”. In: IEEE/ACM Transactions on Au­dio, Speech, and Language Processing 26.1, pp. 84–96. Schick,TimoandHinrichSchütze(2021).“It’sNotJustSizeThatMatters:SmallLanguageModels Are Also Few-Shot Learners”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.On­line: Association for ComputationalLinguistics, pp. 2339–2352. DOI: 10.18653/v1/2021.naac l-main.185. https://aclanthology.org/2021.naacl-main.185. Schwenk, Holger, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, and An-gelaFan (2021).“CCMatrix:Mining Billions of High-Quality Parallel Sentences on the Web”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 6490–6500. DOI: 10.18653/v 1/2021.acl-long.507. https://aclanthology.org/2021.acl-long.507. Siddhant, Aditya, Ankur Bapna, Orhan Firat, Yuan Cao, Mia Xu Chen, Isaac Caswell, and Xavier Garcia (2022). “Towards the Next 1000 Languages in Multilingual Machine Translation: Ex-ploringtheSynergyBetweenSupervisedandSelf-SupervisedLearning”.In:arXiv: 2201.0311 0 [cs.CL]. Sohl-Dickstein, Jascha, EricWeiss, Niru Maheswaranathan, and Surya Ganguli (2015). “Deep un­supervised learning using nonequilibrium thermodynamics”. In: International Conference on Machine Learning. PMLR, pp. 2256–2265. Storks, Shane, Qiaozi Gao, and Joyce Y Chai (2019). “Commonsense reasoning for natural lan­guage understanding: A survey of benchmarks, resources, and approaches”. In: arXiv preprint arXiv:1904.01172, pp.1–60. Strubell, Emma, Ananya Ganesh, and Andrew McCallum (2019). “Energy and Policy Considera­tions for Deep Learning in NLP”. In: Proceedings of the 57th Annual Meeting of the Associa­tion for Computational Linguistics.Florence,Italy:AssociationforComputationalLinguistics, pp. 3645–3650. DOI: 10.18653/v1/P19-1355. https://aclanthology.org/P19-1355. Suhr, Alane, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi (2018). “A corpus for reasoning about natural language grounded in photographs”. In: arXiv preprint arXiv:1811.00491. Sulem,Elior,JamaalHay,andDanRoth(2022).“Yes,NoorIDK:TheChallengeofUnanswerable Yes/NoQuestions”.In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Seattle,United States: Association for Computational Linguistics, pp. 1075–1085. https://aclanthology.org/20 22.naacl-main.79. Sun, Zewei, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Shujian Huang, Jiajun Chen, and Lei Li (2022). “Rethinking document-level neural machine translation”. In: Findings of the Associa­tion for Computational Linguistics: ACL 2022, pp. 3537–3548. Turing,AlanM.(1950).“ComputingMachineryandIntelligence”.In:Mind LIX.236,pp.433–460. ISSN: 0026-4423. eprint: https://academic.oup.com/mind/article-pdf/LIX/236/433/30123314 /lix-236-433.pdf. https://doi.org/10.1093/mind/LIX.236.433. Vakulenko, Svitlana, Shayne Longpre, Zhucheng Tu, and Raviteja Anantha (2021). “Question rewriting for conversational question answering”. In: Proceedings of the 14th ACM interna­tional conference on web search and data mining,pp.355–363. Vaswani,Ashish,SamyBengio,EugeneBrevdo,FrancoisChollet,AidanGomez,StephanGouws, LlionJones,£ukaszKaiser,NalKalchbrenner,NikiParmar,RyanSepassi,NoamShazeer,and Jakob Uszkoreit (2018). “Tensor2Tensor for Neural Machine Translation”. In: Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track). Boston, MA: Association for Machine Translation in the Americas, pp. 193– 199.https://aclanthology.org/W18-1819. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, £ukasz Kaiser,and Illia Polosukhin (2017). “Attentionis all you need”. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010. Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan (2015). “Show and tell: A neuralimagecaptiongenerator”.In: IEEE Conference on Computer Vision and Pattern Recog­nition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, pp. 3156– 3164.DOI: 10.1109/CVPR.2015.7298935. https://doi.org/10.1109/CVPR.2015.7298935. Vu,HoaTrong,ClaudioGreco,AliiaErofeeva,SomayehJafaritazehjan,GuidoLinders,MarcTanti, AlbertoTestoni,RaffaellaBernardi,andAlbertGatt(2018).“GroundedTextualEntailment”.In: Proceedings of the 27th International Conference on Computational Linguistics.SantaFe,New Mexico, USA: Association for Computational Linguistics, pp. 2354–2368. https://aclantholog y.org/C18-1199. Wang,Peng,QiWu,ChunhuaShen,AnthonyR.Dick,andAntonvandenHengel(2017a).“Explicit Knowledge-based Reasoning for Visual Question Answering”. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Aus­tralia, August 19-25, 2017. Ed. by Carles Sierra. ijcai.org, pp. 1290–1296. DOI: 10.24963/ijca i.2017/179.https://doi.org/10.24963/ijcai.2017/179. Wang, Peng, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel (2017b). “Fvqa: Fact-basedvisualquestionanswering”.In: IEEE transactions on pattern analysis and machine intelligence 40.10, pp. 2413–2427. Wang, Peng, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang (2022). “Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework”. In: International Conference on Machine Learning. PMLR, pp. 23318–23340. Wang, Rui, Junyi Ao, Long Zhou, Shujie Liu, Zhihua Wei, Tom Ko, Qing Li, and Yu Zhang (2022). “Multi-view self-attention based transformer for speaker recognition”. In: ICASSP. IEEE,pp. 6732–6736. Weaver, Warren(1955). “Translation”.In: Machine translation of languages 14.15-23, p.10. Wei, Jason, Maarten Bosma, Vincent Y. Zhao,Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, AndrewM.Dai,andQuocV.Le(2021).“FinetunedLanguageModelsAreZero-ShotLearners”. In: arXiv preprint arXiv:2109.01652. arXiv: 2109.01652 [cs.CL]. https://arxiv.org/abs/2109 .01652. Wu,Yonghui,MikeSchuster,ZhifengChen,QuocV.Le,MohammadNorouzi,WolfgangMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, £ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean (2016). “Google’s neural machine translation system: Bridging the gap between human and machine translation”.In: arXiv preprint arXiv:1609.08144. https://arxiv.org/abs/1609.08144. Wu, Zhizheng and Simon King (2016). “Improving trajectory modelling for DNN-based speech synthesis by using stacked bottleneck features and minimum generation error training”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 24.7, pp. 1255–1265. Xie,Ning,FarleyLai,DerekDoran,andAsimKadav(2019).“Visualentailment:Anoveltaskfor fine-grained image understanding”. In: arXiv preprint arXiv:1901.06706. https://arxiv.org/abs /1901.06706. Xu, Jiacheng and Greg Durrett (2019). “Neural Extractive Text Summarization with Syntactic Compression”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan­guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 3292– 3303.DOI: 10.18653/v1/D19-1324. https://aclanthology.org/D19-1324. Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio (2015). “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015. Ed. by Francis R. Bach and David M. Blei. Vol. 37. JMLR Workshop and Conference Proceedings. JMLR.org, pp. 2048– 2057.http://proceedings.mlr.press/v37/xuc15.html. Xue,Linting,NoahConstant,AdamRoberts,MihirKale,RamiAl-Rfou,AdityaSiddhant,Aditya Barua, and Colin Raffel (2021). “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Associa­tion for Computational Linguistics, pp. 483–498. DOI: 10.18653/v1/2021.naacl-main.41. https://aclanthology.org/2021.naacl-main.41. Yang, Sen, Dawei Feng, Linbo Qiao, Zhigang Kan, and Dongsheng Li (2019). “Exploring Pre-trained Language Models for Event Extraction and Generation”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 5284–5294. DOI: 10.18653/v1/P19-1522. https://aclanthol ogy.org/P19-1522. You, Weiqiu, Simeng Sun, and Mohit Iyyer (2020). “Hard-Coded Gaussian Attention for Neural MachineTranslation”.In: Proceedings of the 58th Annual Meeting of the Association for Com­putational Linguistics.Online:AssociationforComputationalLinguistics,pp.7689–7700.DOI: 10.18653/v1/2020.acl-main.687. https://aclanthology.org/2020.acl-main.687. Yu,Licheng,ZheLin,XiaohuiShen,JimeiYang,XinLu,MohitBansal,andTamaraL.Berg(2018). “MAttNet: Modular Attention Network for Referring Expression Comprehension”. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018.IEEE ComputerSociety,pp. 1307–1315.DOI: 10.1109/CVPR.20 18.00142. http://openaccess.thecvf.com/content%5C_cvpr%5C_2018/html/Yu%5C_MAttNet %5C_Modular%5C_Attention%5C_CVPR%5C_2018%5C_paper.html. Yu,Tao, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang,Zifan Li, JamesMa, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev (2018). “Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Lan­guage Processing. Brussels, Belgium: Association for Computational Linguistics, pp. 3911– 3921.DOI: 10.18653/v1/D18-1425. https://aclanthology.org/D18-1425. Zhang, Biao, Philip Williams, Ivan Titov, and Rico Sennrich (2020). “Improving Massively Mul­tilingual Neural Machine Translation and Zero-Shot Translation”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020.Ed.byDanJurafsky,JoyceChai, Natalie Schluter,andJoelR.Tetreault.Associationfor ComputationalLinguistics, pp. 1628–1639. https://doi.org/10.18653/v1/2020.acl-main.148. Zhang,Hanyi,LongbiaoWang,YunchunZhang,MengLiu,KongAikLee,andJianguoWei(2020). “Adversarial SeparationNetwork for Speaker Recognition.”In: Interspeech, pp. 951–955. Zhang, Saizheng, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston (2018).“PersonalizingDialogueAgents:Ihaveadog,doyouhavepetstoo?”In:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa­pers). Melbourne, Australia: Association for Computational Linguistics, pp. 2204–2213. DOI: 10.18653/v1/P18-1205.https://aclanthology.org/P18-1205. Zhang, Ziqiang, Yan Song, Jian-shu Zhang, Ian McLoughlin, and Li-Rong Dai (2020). “Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distri­bution”. In: Interspeech,pp.3580–3584. Zhu, Jinhua, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tie-Yan Liu(2020).“IncorporatingBERTintoNeuralMachineTranslation”.In: 8th International Con­ference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=Hyl7ygStwB. Zoph, Barret, Deniz Yuret, Jonathan May, and Kevin Knight (2016). “Transfer Learning for Low-Resource Neural Machine Translation”. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Lin­guistics, pp. 1568–1575. DOI: 10.18653/v1/D16-1163.https://aclanthology.org/D16-1163. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 3 Digital Language Equality: Definition, Metric, Dashboard Federico Gaspari,AnnikaGrützner-Zahn, GeorgRehm,OwenGallagher, MariaGiagkou, Stelios Piperidis,and Andy Way AbstractThischapterpresentstheconceptofDigitalLanguageEquality(DLE)that was at the heart of the European Language Equality (ELE) initiative, and describes the DLE Metric, which includes technological factors (TFs) and contextual factors (CFs):theformerconcerntheavailabilityofLanguageResourcesandTechnologies (LRTs) for the languages of Europe, based on the data included in the European LanguageGrid(ELG)catalogue,whilethelatterreflectthebroadersocio-economic contexts and ecosystems of the languages, as these determine the potential for LRT development. The chapter discusses related work, presents the DLE definition and describeshowitwasimplementedthroughtheDLEMetric,explaininghowtheTFs and CFs were quantified. The resulting scores of the DLE Metric for Europe’s lan­guages can be visualised and compared through the interactive DLE dashboard, to monitor the progresstowards DLE inEurope.1 1 Introduction and Background The META-NET White Paper Series (Rehm and Uszkoreit 2012) showed the clear imbalance in terms of technology support for 31 European languages as of 2012 (see Chapter 1). Beyond the official European and national languages, more than 60 regional and minority languages (RMLs) are protected by the European Char­ter for Regional or Minority Languages and the Charter of Fundamental Rights of Federico Gaspari · OwenGallagher · Andy Way Dublin CityUniversity, ADAPT Centre,Ireland, federico.gaspari@adaptcentre.ie, owen.gallagher@adaptcentre.ie,andy.way@adaptcentre.ie AnnikaGrützner-Zahn · Georg Rehm Deutsches ForschungszentrumfürKünstliche Intelligenz GmbH,Germany, annika.gruetzner-zahn@dfki.de, georg.rehm@dfki.de MariaGiagkou · SteliosPiperidis R.C.“Athena”, Greece, mgiagkou@athenarc.gr, spip@athenarc.gr 1 ThischapterisbasedonGasparietal.(2021,2022a,b),Giagkouetal.(2022),andGrützner-Zahn and Rehm (2022). © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_3 theEU.Againstthisbackground,theEU-fundedprojectEuropeanLanguageEqual­ity (ELE) has addressed the issue of Digital Language Equality (DLE) in Europe, with the intention of tackling the imbalances across Europe’s languages, that have widened even further in the meantime, as explained in Chapter 4. ELE’s contribu­tiontoadvancingDLEinEuropehingesonasystematicallydevelopedandinclusive all-encompassingstrategicresearch,innovationandimplementationagenda(SRIA) and a related roadmap to drive forward much needed efforts in this direction (see Chapter 45). The present chapter describes the notion of DLE and the associated metric that are at the heart of these plans, and presents the DLE dashboard that vi-sualises the digital support of each European language, so as to monitor the overall progresstowards DLE inEurope, alsoin acomparativefashion across languages. Despitethepersistingimbalances,Europehascomealongwayinrecognisingand promotinglanguagesasfundamentalrightsofitspeopleandessentialcomponentsof itsuniquecombinedculturalheritage,andthisawarenessisreflectedinresearchand policyadvancementsofthelasttwodecades.Krauwer(2003)representedoneofthe earliest calls for action towards the development of Language Resources and Tech-nologies(LRTs),inparticularforunder-resourcedlanguages.Inthefollowingyears, several projects and initiatives contributed to the progress of Europe’s languages in terms of technological and digital support; some of the main efforts in this area that laid the foundation for subsequent substantial progress were, e.g., Euromatrix (Eisele et al. 2008),iTranslate4.eu (Yvonand Hansen 2010), FLaReNet (Soriaet al. 2012) and CLARIN (Hinrichs and Krauwer 2014). Additionally, META-NET, an EU Network of Excellence forging the Multilingual Europe Technology Alliance, was established and a group of projects (T4ME, CESAR, METANET4U, META­NORD) promoted and supported the development of Language Technologies (LTs) forallEuropeanlanguages(RehmandUszkoreit2012,2013;Rehmetal.2016).The EU project CRACKER (Cracking the Language Barrier, 2015-2017) continued the work of META-NET, concentrating on additional strategy development and com-munitybuilding(Rehmetal.2020).ThemostrecentEU-fundedprojectscontinuing effortsinthisareawereEuropeanLanguageGrid(ELG,Rehm2023b)andEuropean Language Equality (ELE, Rehm et al. 2022),which collaborated closely, leading to thedevelopmentoftheDLEMetricandtheDLEdashboardpresentedinthischapter. 2 Related Work While our work on DLE focused specifically on the languages of Europe, it is lo­cated in a broader context of related recent efforts with a wider remit, which are briefly reviewed here to pinpoint issues of interest for the subsequent presentation of the definition of DLE, its metric and the dashboard. Joshi et al. (2020) investi­gatetherelationbetweenthelanguages of the world andtheresourcesavailablefor them as well as their coverage in Natural Language Processing (NLP) conferences, providing evidence for the severe disparity that exists across languages in terms of technologicalsupportandattentionpaid byacademic,scientificandcorporateplay­ers. In a similar vein, Blasi et al. (2022, p. 5486) argue that the substantial progress brought about by the generally improved performance of NLP methods “has been restrictedtoaminusculesubsetoftheworld’sapprox.6,500languages”,andpresent a framework for gauging the global utility of LTs in relation to demand, based on theanalysisofasampleofover60,000paperspublishedatmajorNLPconferences. This study also shows convincing evidence for the striking inequality in the devel­opment of LTs across the world’s languages. While this severe disparity is partly in favour ofa few,mostly European, languages, on the whole, the vastmajorityofthe languages spoken inEurope are at adisadvantage. Simonsetal.(2022)developanautomatedmethodtoevaluatetheleveloftechno­logicalsupportforlanguagesacrosstheworld.Scrapingthenamesofthesupported languages from the websites of over 140 tools selected to represent a good level of technological support, they propose an explainable model for quantifying and mon­itoring digital language support on a global scale. Khanuja et al. (2022) propose an approach to evaluate NLP technologies across the three dimensions of inclusivity, equityandaccessibilityasawaytoquantifythediversityoftheuserstheycanserve, withaparticularfocusonequityasalargelyneglectedissue.Theirproposalconsists of addressing existing gaps in LRT provision in relation to societal wealth inequal­ity. Khanuja et al. (2022) lament in particular the very limited diversity of current NLP systems for Indian languages, and to remedy this unsatisfactory situation they demonstrate the value of region-specific choices when building models and creat­ing datasets, also proposing an innovative approach to optimise resource allocation for fine-tuning. They also discuss the steps that can be taken to reduce the biases in LRTsforIndianlanguagesandcalluponthecommunitytoconsidertheirevaluation paradigm in the interestofenriching the linguistic diversity of NLP applications. AcknowledgingthatLTsarebecomingincreasinglyubiquitous,Faisaletal.(2022) look into the efforts to expand the language coverage of NLP applications. Since a keyfactordeterminingthequalityofthelatestNLPsystemsisdataavailability,they study the geographical representativeness of language datasets to assess the extent to which they match the needs of the members of the respective language commu­nities, with a thorough analysis of the striking inequalities. Bromham et al. (2021) examine the effects of a range of demographic and socio-economic aspects on the use and status of the languages of the world, and conclude that language diversity is under threat across the globe, including in industrialised and economically ad-vancedregions.Thisstudyfindsthathalfofthelanguagesunderinvestigationfaced serious risks of extinction, potentially within a generation, if not imminently. This iscertainlyanextremelysombresituationtofaceupto,whichcallsforalarge-scale mobilisationofall possible efforts by all interestedparties to avoidsucha daunting prospect, particularly in Europe, where multilingualism is recognised as an impor­tant part of diversity. Establishing a working definition of DLE, devising a metric to measure the situation of each European language with respect to DLE and im­plementing an interactive dashboard to monitor progress in this direction are vital elementsof this large-scale endeavour. 3 Digital Language Equality: Key Principles and Definition TheDLEMetricandtheDLEdashboardcanbeusedtomeasure,visualiseandcom-pare the position of Europe’s languages with respect to DLE on the basis of up-to-date and carefully chosen quantitative indicators. In this context, language equality doesnotmean sameness on allcounts,regardless oftherespectiveenvironmentsof thelanguages;infact,thedifferenthistoricaldevelopmentsandcurrentsituationsof the very diverse languages under consideration are duly taken into account, along with their specific features, different needs and realities of their communities, e.g., intermsofnumberofspeakers,rangesofuse,etc.,whichvarysignificantly.Itwould be naive and unrealistic in practice to disregard these facts, and to set out to erase the differences that exist between languages, which are vital reflections of the rele­vant communities of speakers and key components of Europe’s shared cultural her­itage. This is also a core value of multilingualism in Europe, where all languages are regarded as inherent components of the cultural and social fabric that connects Europeancitizens in their diversity. In addition, the notion of DLE stays well clear of any judgement of the political, social and cultural status orvalue of the languages, insofar as they collectively con­tribute to a multilingual Europe that should be supported and promoted. Alongside thefundamentalconceptofequality,wealsorecognisetheimportanceofthenotion ofequity,meaningthatforsomeEuropeanlanguages,andforsomeoftheirneeds,a targeted effort is necessary toadvance the cause of equality. For example,theavail­abilityof,andaccessto,certainresourcesandservices(e.g.,torevitalisealanguage, or to promote education through that language) may be very important for some of Europe’s languages, but by and large these are not pressing issues, for instance, for most official national languages. With this in mind, the definition of DLE and the implementation of the DLE Metric discussed below are intended to accurately cap­ture the needs and expectations of the various European languages, and especially the shortfalls with respect to being adequately served in terms of resources, tools and technological services in the digital age, so as to support the large-scale efforts toachieveDLE,alsothroughdataanalyticsandvisualisationintheDLEdashboard. The definition of DLE drew inspiration, among others, from the META-NET WhitePaperSeries(RehmandUszkoreit2012)andfromtheBLARKconcept(Basic LanguageResourceKit,Krauwer 2003),whichhavebeeninstrumentalinassessing the level of technological support for specific languages, and in particular in identi­fying those that lag behind in the digital age and in encouraging the targeted inter­ventions required to fill the gaps in LT support. These starting points were further elaboratedbytheELEconsortiumincollaborationwithitsvastnetworksofcontacts and partnerships, also in light of the latest developments in LRTs and in language-centric AI techniques and of the evolution of the relevant institutional, academic, industrialandbusinesslandscapethathasgrownanddiversifiedconsiderablyinthe lasttwodecades,asdiscussedinotherchaptersofthisbook.Followingasystematic and inclusive consultation effort in the ELE consortium, the following consensus wasachieved (Gaspari etal. 2021,p.4). DigitalLanguageEquality(DLE)isthestateofaffairsinwhichalllanguages have the technological support and situational context necessary for them to continue to exist and to prosperas living languages in the digital age. This definition was applied to 89 European languages in the project: all 24 of­ficial EU languages, 11 additional official national languages and 54 RMLs. This definition, in turn, provided the conceptual basis to design and implement a metric to enable the quantification of the level of technological support of each European language with descriptive, diagnostic and predictive value to promote DLE in prac­tice.Thisapproachallowsforcomparisonsacrosslanguages,trackingtheirprogress towards the ultimate collective goal of DLE in Europe, as well as the prioritisation of interventions to meet any needs, especially to fill identified gaps, focusing on re­alistic and feasible targets, as part of the implementation of the all-encompassing SRIAandrelatedroadmapdevisedbyELEtodrivetheadvancementtowardsDLE, asdescribed indetail in Chapter 45. 4 Implementing the Digital Language Equality Metric Based on the definition of DLE, we describe the associated metric as follows (Gas-pari et al. 2021, p. 4): The DigitalLanguage Equality (DLE)Metric is a measurethatreflectsthe digital readiness of a language and its contribution to the state of technology- enabled multilingualism,tracking itsprogresstowards thegoal of DLE. TheDLEMetriciscomputedforeachEuropeanlanguageonthebasisofarange of quantifiers, grouped into technological factors (TFs, that correspond to the avail­able resources, tools and services, Gaspari et al. 2022a) and situational contextual factors (CFs, that reflect the broad socio-economic ecosystem of each language, which determines thepotential for technologyand resourcedevelopment,Grützner-Zahn andRehm 2022). The setup and formulation of the metric are modular and flexible, i.e., they con­sist of well-defined separate and independent, but tightly integrated quantifiers. In particular, the TFs were devised so as to be compatible with the metadata schema adopted by the European Language Grid cloud platform2 (Labropoulou et al. 2020; Piperidis et al. 2023). The ELG cloud platform bundles together datasets, corpora, functionalsoftware,repositoriesandapplicationstobenefitEuropeansociety,indus­try and academia and administration, and provides aconvenient single access point to LRTs forEurope’slanguages (Rehm 2023a). 2 https://www.european-language-grid.eu In addition, the definition of DLE and its associated metric have been designed tobetransparentandintuitiveforlinguists,LTexpertsanddevelopers,languageac­tivists,advocatesoflanguagerights,industrialplayers,policy-makersandEuropean citizens at large, to encourage thewidest possible uptake and buy-in to the causeof DLEacrossEurope.InestablishingtheDLEdefinitionanditsassociatedmetric,an effort was made for them to be founded on solid, widely agreed principles, but also striking a balance between a methodologically sound and theoretically convincing approach,andatransparentformulation.Therationalebehindthisapproachwasthat theDLEdefinitionanditsmetricshouldbeeasilyunderstoodandabletoinformfu­ture language and LT-related policies at the local, regional, national and European levelsinordertoguideandprioritisefutureeffortsinthecreation,developmentand improvement of LRTs according to the SRIA and roadmap (see Chapter 45), with the ultimategoal of achieving DLE in Europe by2030. ThroughdataanalyticsandvisualisationmethodsintheDLEdashboard(seeSec­tion 7), European languages facing similar challenges in terms of LT provision can begroupedtogether,andrequirementscanbeformulatedtosupporttheminremedy­ingtheexistinggapsandadvancingtowardsfullDLE.AcrucialfeatureoftheDLE Metric is its dynamic nature, i.e., the fact that its scores can be updated and moni­tored over time, at regular intervals or whenever one wishes to check the progress or the status of one or more European languages. This is why the DLE Metric is a valuable tool to achieve DLE for all European languages, and a key element of the sustainable evidence-based SRIA and of the roadmap guiding future interventions promoting LTsand language-centric AI acrossEurope. 5 Technological Factors In order to objectively quantify the level of technological support for each of Eu­rope’s languages, a number of TFs were considered. The following description presents their main categories, illustrating the breadth and diversity of the LRTs that they capture through the ELG catalogue (Rehm 2023a; Piperidis et al. 2023; Labropoulouetal. 2020).Inthatregard,weassumethattheELGcatalogue,withits more than 13,000 LRTs at the time of writing, provides a representative picture of the stateofplay of technology support of Europe’slanguages. ThefirstcategoryofTFsisbasedontheavailabilityofLRs,i.e.,corpora,datasets or collections of text documents, text segments, audio transcripts, audio and video recordings, etc., monolingual or bi-/multilingual, raw or annotated. This category also encompasses language models and computational grammars and resources or-ganised on the basis of lexical or conceptual entries (lexical items, terms, concepts, etc.) with their supplementary information (e.g., grammatical, semantic, statistical information, etc.), such aslexica, gazetteers, ontologies, term lists,thesauri, etc. TheresultingtechnologicalDLEscoreforeachEuropeanlanguageisareflection of the LRTs available in the ELG catalogue for that language. While the number of available LRs is an essential aspect of a language’s digital readiness, the specific types and features of these LRs are equally important, insofar as they indicate how well a language is supported in the different LT areas. To capture such aspects in the DLE Metric, in addition to raw counts of available LRs, the following LR fea­tureshavealsobeentakenintoaccountandattributedspecificweightsinthescoring mechanism(see Table 1, p. 66, inthe Appendix): • resourcetype • resourcesubclass • linguality type • mediatypecoveredorsupported • annotation type (where relevant) • domaincovered(whererelevant) • conditionsofuse The second category of TFs is based on the availability of tools and services of-feredvia theweborrunninginthecloud,butalsodownloadabletools, sourcecode, etc. This category encompasses, for example, NLP tools (morphological analysers, part-of-speech taggers, lemmatisers, parsers, etc.); authoring tools (e.g. spelling, grammarandstylecheckers);servicesforinformationretrieval,extraction,andmin­ing, text and speech analytics, machine translation, natural language understanding and generation, speech technologies, conversational systems, etc. The features of toolsandservicesthatareconsideredandassignedweightsinthescoringsystemof the DLE Metric (see Table 2, p. 67),areasfollows: • language (in)dependent • typeofinputprocessed • typeofoutputprovided • typeoffunction • domaincovered(whererelevant) • conditionsofuse 5.1 WeightsandScores TheweightsgiventothefeaturevaluesoftheLRTsquantifytheircontributiontothe DLEscore with regardtotherelevantTFs.The scoring system (see Tables 1and 2) is based on the assumption that for any language some features of LRTs contribute more effectively to achieving DLE than others. Higher weights are assigned to fea­turevaluesrelatedto1.morecomplexLRTs,e.g.,toolsthatprocessorsupportmore thanonemodality,2.moreexpensiveandlabour-intensivedatasetsortools,e.g.,in terms of the effort required to build them, 3. more open or freely available datasets and tools, and 4. additionalenvisaged applications that could be supported. One guiding consideration in developing the DLE Metric, and especially in as-signingtheweightsofthefeaturesandtheirvaluesfortheTFs,istomakethefewest possibleassumptionsaboutthe(preferredorsupposedlyideal)use-casesandactual application scenarios that may be most relevant to users. These can vary widely for all languages on the basis of a number of factors impossible to establish a priori. We therefore refrained from predetermining particular preferred end-uses when im­plementing the full specification of the DLE Metric, which otherwise would risk it beingunsuitableforsomeend-users andapplications. Here we briefly reviewsome ofthekey features of the TFs,focusing on those that can have several values. For instance, a feature of LRs that can receive several values is that of Annota­tion Type, where applicable. In the implementation of the DLE Metric, we assign a constant very small fixed weight, also based on the fact that some LRs can possess several annotation types in combination. A similar consideration applies to the Do­main feature (again, where relevant), which has many possible values both for LRs andfortoolsandservices:inthesecases,theweightsassignedtoDomain valuesare fixedandrelativelysmall,againconsideringthatmultipledomainscanbecombined in a single LR, tool or service. In addition to Domain, another feature that appears both in LRs and tools and services is Conditions of use: the weights proposed for this feature of the TFs are identical for the corresponding values of Conditions of use across datasets and tools and services. In the case of (much) more restrictive licensing terms, lower weights are assigned than to liberal use conditions, so they contribute (much) less to the partial technological DLE score for the LRT in ques­tion, and therefore tothe overall technological DLE score for the specific language. 5.2 ConfigurationoftheTechnologicalFactors Beforecomingupwiththefinalimplementationoftheweightingandscoringsystem fortheTFs(seeTables1and2),weexperimentedwitharangeofdifferentsetups.We usedthecontentsoftheELGcatalogueasofearly2022,whichatthattimecontained about11,500records,outofwhichabout75%weredatasetsandresources(corpora, lexical resources, models, grammars) and the rest were tools and services. These records contained multiple levels of metadata granularity. The ELG repository had been populated with LRTs following extensive efforts by a wide range of language experts and reflected the input of this community of experts, mobilised in ELE, to ensure comprehensive coverage, which is why we considered the ELG catalogue representativewithregardtotheexistenceofLRTsforEurope’slanguages,soitwas usedas theempirical basis for thecomputation of thetechnologicalDLE scores. The ELG catalogue includes metadata for LRs and LTs. In ELG, each resource and tool/service has several features and associated values, based on the schemes presentedinTables1and2.Eachfeaturewasinitiallyassignedatentativeweightto calculatepreliminarytechnologicalDLEscoresofeachlanguage,comparingthere­sultingscoresofanumberofalternativepreliminarysetups.Duringthisfine-tuning oftheweights,weconsideredespeciallywhereeachlanguagestoodinrelationtothe others and how their relative positioning changed as a result of assigning different weights to the various feature values. This was an efficient and effective method to graduallyrefinethesetupoftheTFsandproposetheimplementationoftheweights in the scoring mechanism thatwas eventually adopted(see Tables 1 and 2). The experiments showed that the global picture of thetechnologicalDLEscores for the languages of Europe tended not to change dramatically as the weights as-signedtothefeaturevaluesweremanipulated.Weexperimentedbothwithverymod­erateandnarrowrangesofweights,andwithmoreextremeanddifferentiatedweight­ing schemes. Since, ultimately, any changes were applied across the board to all LRTsincludedintheELG catalogueforalllanguages,anyresultingchangespropa­gatedproportionallytotheentiresetoflanguages,thusmakinganydramaticchanges rather unlikely, unless one deliberately rewarded (i.e., gamed) features known to disproportionatelyaffectoneormoreparticularlanguages.Itisclearthatthiswould have been a biased and unfair manipulation of the DLE Metric, and was therefore avoided, as we wanted the relevant scores to be a fair, and bias-free, representation ofthestatusof all European languages with respect toDLE. Thesepreliminaryexperimentscarriedoutinearly2022tofinalisethesetupofthe TFs for the DLE Metric demonstrated that the overall distribution of the languages tendedtoberelativelystable.Thiswasduepartlytothesheeramountoffeaturesand possible feature values that make up the TFs. As a result, even if one changed the weights, with the exception of minor and local fluctuations, three main phenomena weregenerally observed while testing the DLEMetricand itsTF scores. 1. The overall positioning of the languages remained largely stable, with a hand­ful of languages standing out with the highest technological DLE scores (En­glish leading by far, typically over German, Spanish and French, with the sec­ond language having roughly halfthe technological DLE scoreofEnglish), the manyminimallysupportedlanguagesstilldisplayingextremelylowtechnologi­calDLEscores,andalargegroupofsimilarlysupportedlanguagesinthemiddle. 2. ClustersoflanguageswithsimilarLTsupportaccordingtointuitionandexpert opinion remained ranked closely together, regardless of the adjustments made to specific weights forindividual features and their values. 3. Evenwhentwosimilarlysupportedlanguageschangedrelativepositions(i.e., language A overtook language B in terms of technological DLE score) as a re-sultofadjusting theweightsassignedtospecificfeaturesandtheir values,their absolute technological DLE scores still remained very close, and the changes in ranking tended not to affect other neighbouring languages on either side in a noticeable manner. During the preliminary testing thateventually led tothefinalsetupof theTFs in theDLEMetricpresented in Tables 1 and 2, we performedfocused checkson pairs or small sets of languages spoken by comparable communities and used in nearby areas or similar circumstances, and whose relative status in terms of LT support is wellknowntotheexperts.Thesefocusedchecksinvolved,e.g.,BasqueandGalician, IrishwithrespecttoWelsh,andthedozenlocallanguagesofItaly(alsowithrespect to Italian itself), etc. Overall, the general stability and consistency demonstrated by the technological DLE scores across different setups of weight assignments for the various features and their possible values for TFs provided evidence of its validity as an effective tool to guide developments and track progress towards full DLE for all ofEurope’slanguages.In essence,the setupeventuallyselected(Tables 1 and 2) ensures that the DLE Metric optimally captures the real situation of all of Europe’s languages inthe digital age,tracking theprogresstowards DLE. 5.3 ComputingtheTechnologicalScores Basedontheabove,thestepstocalculatethetechnologicalDLEscorewhichispart oftheDLE Metric are as follows: 1. EachLRTintheELGcatalogueobtainsascore(ScoreLRT ), which is equal to thesumoftheweightsofitsrelevantfeatures(seeTables1and2fortheweights and associated values). Specifically for features Annotation Type and Domain, instead of simply adding the respective weight, the weight is multiplied by the number of unique feature values the LR in questionhas (see Section 5.1). Example: Suppose an LRT in the ELG catalogue (LRT1) has the following features: corpus, annotated, monolingual, with three different annotation types (morphology,syntax,semantics),withtextasmediatype,coveringonedomain (e.g., finance), with condition of use research use allowed. Then, using the weights as specified inTable 1, LRT1isassigned thefollowing score: ScoreLRT 1 =5+1+2.5+ (3* 0.25)+1+ (1 * 0.3)+3.5 =14.05 2. To compute the technological DLE score for language X (TechDLELangX ) we sum up the ScoreLRT of all LRTs that support language X (LRT1, LRT2, …LRTN),i.e., N . TechDLELangX = ScoreLRTi i=1 Similarly, any tool or service included in the ELG catalogue receives a partial score with the same procedure, on the basis of the weights presented in Table 2. As the ELG catalogue organically grows over time, the resulting technological DLE scores are constantly updated for all European languages. These scores can be vi-sualised through the DLE dashboard (see Section 7), providing an up-to-date and consistent (i.e., comparable) measurement of the level of LT support and provision thateachlanguageofEuropehasavailable,alsoshowingwherethestatusisnotideal ornot at the level one mightexpect. 5.4 TechnologicalDLEScoresofEurope’sLanguages Figure1showsthetechnologicalDLEscoresforallofEurope’slanguagesasoflate February2023, obtained onthebasisof thefinalweightingand scoringmechanism describedin the previoussections. Not surprisingly, based on the TFs of the DLE Metric, at the time of writing in early 2023,Englishis still byfar themostwell-resourcedlanguageof Europe,lead­ing the way over German and Spanish, that follow with very similar technological DLE scores, which are roughly half that of English. French has a marginally lower score, which places it in fourth position. Italian, Finnish and Portuguese follow at somedistance,anditisinterestingtonotethatthenextclusteroflanguagesthatare spoken by sizeable communities in Europe (e.g., Polish, Dutch, Swedish), still in the top ten of the overall list of languages, have a technological DLE score that is roughly six times lower than that of English: a stark reminder based on evidence provided by the ELG catalogue and measured through the DLE Metric of the per-sistingimbalancesintheoveralldigitalsupportofEurope’slanguages,showingthat urgentdecisiveactionisneededtoachieveDLE(Chapter4providesamoredetailed cross-language comparison). 5.5 Open Issues and Challenges The technological DLE scores based on the TFs do not take into account the size of the LRs or the quality of the LRTs included in ELG. While these are important features, there exist a large variety of size units for LRs, and the way of measuring datasizeisnotstandardised,especiallyfornewtypesofLRssuchaslanguagemod-els.Regardingthequalityoftoolsandservicesinparticular,whilesomeinformation on the Technology Readiness Level3 scale is available in ELG, the large number of null values does not make it easy to take this aspect into account for consistency reasons. These are shortcomings that can be revisited in subsequent efforts, with a viewtoovercomingtheselimitationsandfurtherimprovingtheoverallaccuracyand granularityofthe technological DLE scores going forward. Asfarasdatasetsareconcerned,inparticular,therecouldbebenefitsinsettinga minimum size criterion to include LRs such as corpora or grammars in the compu­tationofthetechnologicalDLEscore,e.g.,toavoidusingverysmallresourcesthat cannot be realistically applied in actual technology development scenarios. How­ever, it is difficult to establish arbitrarily what this minimum size threshold should be, also in recognition of the specifics of the languages of Europe. As a result, the decision was made not to set any minimum size requirement for LRs. The thinking behind this choice was that relatively small datasets are common in less-resourced languages, for particular domains, etc., and there is the possibility to merge small datasets to create bigger ones that would, in fact, be useful, for instance in domain 3 https://en.wikipedia.org/wiki/Technology_readiness_level Fig. 1 Technological DigitalLanguageEquality scores asof lateFebruary 2023 adaptation for MT, to mention but one example. More broadly, by proposing the DLE Metric we intend to foster a culture of valuing all and any LRTs, especially forless-resourcedlanguages,judiciouslybalancingtheimportancegiventothesize, quantity, diversity and quality of the LRTs, being mindful that several of Europe’s languages are in direneed of support. 6 Contextual Factors WhilethetechnologicalscoresbasedontheTFsrepresentthetechnologicalsupport of a language, they do not reflect the overall socio-political environment of a lan­guage. There are other factors that influence how a language thrives in the digital age, such as political will, funding, being the object of research projects, economic interest,etc.Theimportanceofcreating apicturethatreflectsthisenvironmentofa languagecommunitywasrecentlyalsoconsideredbyotherresearchers.Severaldata-driven studies analyse the relationship between the technical support of a language and non-technological factors (seeSection 2). Relatedapproachesattempttomeasuretheinfluenceofnon-technologicalfactors on the development of LRTs considering often only individual factors in the realm of economy (usually the Gross Domestic Product, GDP), research (e.g., number of publicationsinspecificconferences)andthesizeofthelanguagecommunity. Inthe DLE Metric, the Contextual Factors (CFs) are defined as the “general conditions andsituationsofthebroadercontext”ofalanguagecommunity(Gasparietal.2021, p.7).Thisdefinitionincludesfactorsfromallareasoflifeassumingthatthosehave aninfluence on the development anduse of LRTs. Economy Factors in this area reflect the general and the LRT-specific part of the economy. The overall welfare of the language community and the size of the potentialmarketareimportantfactorsforcompaniestoinvestinthedevelopment of LRTs for a language. Education The language and digital literacy level of a language community in­fluences the use of a language online and on digital devices. Additionally, to be able to develop LRTs, researchers with technical but also linguistic skills of the respectivelanguages are needed. Funding InvestmentinresearchandinnovationintheareaofLTisnecessaryfor basicand applied researchon which technology developmentisbased. Industry Companies,bothwell-establishedandstartups,areimportantdriversof the development anddistribution of LT applications,tools andservices. Law The legal framework can hinder progress or steer developments in certain directions. Media Thecreationanddistributionofnews,newspapers,magazines,films,etc.in a language constitutes, on the one hand, a possible large dataset for the devel­opment of LRTs, and on the other hand, demonstrates the willingness to make content accessibleto the language community. Online The online representation of a language community indicates that active community members are willing and determined to use the language in the digi­tal world. Additionally, the availability of online data in the respective language gives researchers or developersthe opportunity to create LRs. Policy Strategic plans and agendas at local, regional and national levels indicate the political will to support a topic and the direction in which policy-makers in­tend to lead society inthe future. Public Administration Public authorities represent the state to its citizens. The inclusion and support of languages spoken in the country or region by public authoritiesenables participationand utilisation within the society. Research & Development&Innovation Innovations depend on basic and ap­plied research and on the development of products that are ready for the mar­ket. This requires a minimum of research positions in relevant institutions and supporting infrastructure. Society Thesocialattitudetowardsalanguagehasagreatinfluenceonhowmuch investment, effort and time are put into the preservation of a language by the languagecommunityand by the state. Technology Thetechnologicalinfrastructurereflectsthepossibilityforalanguage community toaccess and take apartin the digital world. 6.1 Computing the Contextual Scores 6.1.1 DataSourcesandCollection Initially, 72 potential contextual factors were identified through the collection of factors considered relevant in publications such as, among others, the STOA study (STOA 2018),theMETA-NETWhitePaperSeries(RehmandUszkoreit 2012)and EFNIL’s European Language Monitor (ELM);4 we also consulted with the 52 ELE project partners. The 72 tentative CFs were clustered into 12 areas (see above) rep-resentingdifferent aspects of a language’scontext (Gaspariet al. 2021). To be measurable, each factor had to be quantified with an indicator, which de­pendedontheexistenceandaccessibilityofcorrespondingdata.First,differentdata sources were collected including, among others, EUROSTAT,5 ELM, Ethnologue6 and various reports and articles. Second, possible indicators for each factor were considered and matched with the available data. GDP, for example, was considered to be asuitableindicatorfor the factor “economic size”. Eventually,27ofthe72initialfactorshadtobeexcludedduetomissingdata.This affected especially factors from the areas “research & development & innovation”, “society”and“policy”.Dataaboutpoliciesisessentiallytoobroadandreflectsrather 4 http://www.efnil.org/projects/elm 5 https://ec.europa.eu/eurostat 6 https://www.ethnologue.com coarselywhetherpoliciesexistornot.Forinstance,thefactor“presenceoflocal,re­gionalornationalstrategicplans,agendas,committeesworkingonthelanguage,LT, NLP, etc.” was quantified on data indicating whether a national agenda with regard to AI and LTs exists. Considering also local and regional plans and the existence and maybe also number and size of committees would require much more detailed data. The factors excluded from the class “research & development & innovation” covered mainly figures about the LT research environment, while broader numbers abouttheresearchsituationofthewholecountrywereindeedavailable.Tables4-15 intheAppendixshowallfactorsfromthepreliminarydefinition(Gasparietal.2021, 2022b), their class and the indicator they were quantified with. Overall, 46 factors werequantifiedwithatleastoneappropriateindicator,andsomewithtwoindicators representingdifferent perspectives like totalnumbers and numbersper capita. The data was collected inlate 2021.Many sourcesprovidedtheirdataas spread­sheets, while some data was published as HTML documents. The data for 15 indi­cators had to be collected manually from reports and articles. We attempt to update thecontextualfactorsonanannualbasis.Preliminarytestsindicatethatupdatingthe contextual DLE scores for all EU languages takes up to two weeks of work by one member of staff who is familiarwith the structure and nature of the CFs. 6.1.2 Data Processing The collected CF data was very heterogeneous: it had different formats, was based on country or language community level, included differing languages or countries and consisted of different datatypes. Data preparation took several steps, including dataformatstandardisation,harmonisinglanguagenamesbasedonGlottolog(Ham­marström et al. 2021) and data merging. Some sources provided plain text from which a score had to be manually determined. Features mentioned in the text, e.g., regarding the existence of a national LT policy, were quantified with a number and thisnumberwasassignedtocountriesorlanguagecommunities.Ifthetextincluded morethan onefeature, the numbers were addedup,e.g.,ifa country publishedsev­eral policies covering the topic AI and LTs. Table 3 (p. 68) shows a list of the indi­catorstransformedfromplain text. The DLE Metric processes dataon a per-language basis. Thus, data collected on thecountry levelhadtobeconvertedtothelanguage level.Intotal,thefactorswere quantifiedwiththreedifferenttypesofdata,namelyabsolutenumbers,proportional numbers, and scores. Total numbers were split proportionally, using the percentage of speakers of the language per country. The percentages were calculated through populationsize and number of speakers.Dueto some gapsand old records, experts from the ELE consortium were asked to provide missing or more up-to-date and reliable data. The figures for Alsatian, Faroese, Gallo, Icelandic, Macedonian and the Saami languages were correctedaccordingly. Languagesoftentaughtasasecondlanguage(English,German,French,Spanish) wereonlyincludedinthemappingifthelanguagehadanofficialstatusinthecountry. For example, the figures for English consist of the figures of the UK, Ireland and Malta (in other European countries, English does not have official status). If the language was an official national language in at least one country, only language communities with more than one percent were included to simplify the mapping. Totalnumberspercapitaofalanguagecommunity,proportionalnumbers,andscores wereapplied to the language communitieswithout adjustment. If a language was spoken in more than one country, total numbers were added up,whileproportionalnumbers,scoresandtotalnumberspercapitawerecalculated through the average; the different sizes of the language communities were partly taken into account, hence, the data values of bigger language communities were weighteddoubleforthecalculationoftheaverage.However,amorecomplexinclu­sionofthesizeofthelanguagecommunitywouldresultinmorefine-grainedfigures, which wouldprobably affectthe contextual DLEscores to some extent. 6.1.3 CalculationoftheContextualDigitalLanguageEqualityScore ThedatareferringtoeachlanguagecommunitywasconvertedintocontextualDLE scores,whichindicatetheextenttowhichalanguagehasacontextthatsupportsthe possibilityofevolvingdigitallyornot.Withoutthepoliticalwill,funding,innovation and economic interest in the respective region, the probability of achieving DLE is low. Giventheunderlyingcomplexity,inorderforthecontextualscorestobeeasily conceptualised and comparable across languages, a relative score between 0 and 1 was assigned to each language, with 0 representing a context with no potential for the development of LT, and 1 representing the best potential. To keep this part of the DLE Metric as transparent as possible, we decided to base the calculation on an average of the factors. Therefore, the intermediate goal was to calculate a score between 0 and 1 for each factor. The language with the lowest value for the respectivefactorwasattributed0,whilethelanguagewiththehighestvaluereceived 1.ThefollowingstepswereconductedtocalculatethecontextualDLEscoreforeach Europeanlanguage: 1. Calculationoftherange:highestvalue –lowestvalue; 2. (value-minimum)*100 =Percentage weighting of a language within the range; range 3. Theresultisarelativevalue:toobtainascorebetween0-1theresultisdivided by100; 4. Applysteps1-3foralllanguagesandfactors; 5. Calculatetheaverageofallfactorsperlanguage; 6. Weighting of the scores with the three chosen factors of a. number of speakers, b.scoresbasedonthelanguagestatus,andc.whetherthelanguageisanofficial EU languageornot. The three weighting factors were considered to be particularly relevant for the contexttodevelopLRTsduetotheinfluenceofthenumberofspeakersontheinvest­ment by large companies and its official status in the EU on the amount of funding. The weighting included two steps: 1. calculating the average of the overall scores, the scores for the number of speakers and the legal status and 2. adding 0.07 to the scoreforeachofficialEUlanguage.Thesecondstepwasseparatedfromtheaverage calculation, because the indicator consisted of two values, 1 if it is an official EU language and 0 if it is not. The average calculation would result in an excessively strong boost for all official EU languages. Hence, with the data for the contextual factors available at the end of 2021, English already had a score of around 0.7-0.8 without the boost. Smaller values for EU languages would have penalised English, which wouldnot have represented reality. WecreatedfivedifferentversionsofthepossibleconfigurationsoftheCFstocon­ductathoroughcomparativeevaluation.Thefactorswereclassifiedbasedonanum­ber of overall properties, i.e., if a data point can be updated automatically or if the data is considered high quality (see Tables 4-15). Data quality was chosen to avoid biasintheoverallresultcausedbyextrememaximumandminimumvalues.Forex-ample, for the quantification of the factor “number of podcasts”, several platforms were found which could have provided numbers of podcasts in different European languages,butbecauseofdifferenttargetaudiences,thevalueswerehighlyskewed to the languages spoken by those target audiences. Factors which were quantified withdatareflectingnobigdifferencesbetweenlanguageswerealsoexcludedbythe quality criterion, e.g., the literacy level of all countries varied between 98 and 99 percent,i.e.,hardlyatall.Tobeabletoupdatethemetriconaregularbasiswithout muchmanualeffortaftertheendoftheELEproject,thepossibilityofcollectingthe datafully automatically was picked as the other main criterion. Based onthese criteria,the followingCF configurations were examined: 1. Factors with available data: 46 factors 2. Factorsthatcanbeupdatedautomatically:34factors 3. Factorswithgoodorhighdataquality:26factors 4. Factors that can be updated automatically and that also have good or high data quality: 21 factors 5. A set of manually curated factors using four criteria: automatically updatable, good/high data quality, a maximum of two factors per class, balance between datatypes:12factors(Table16showsthefactorsincludedinthisconfiguration) Including fewer factors in the metric increased the risk of omitting an important factor. Ontheotherhand,includingfewerfactorsalsoreducedtheriskofdistorting the metricwith more data. 6.2 Experts Consultation Considering that appropriate baselines do not exist, we validated the five different resultsthroughtheconsultationofexperts.Individualcontextualscorescanbeinter­preted bycomparing them tothe scores of other languages. ThepanelconsistedofELEconsortiumpartners.Weselectedthemembersbased on their expertise and experience in the areas of LT, Computational Linguistics andLinguistics.Moreover,theexpertsrepresenteddifferentEuropeancountriesand wereveryfamiliarwiththebackgroundoftheircountriesandlanguagesspokenthere. Wereachedoutto37ofthe52ELEpartnerorganisations.Theyreceivedtheresults ofthefiveconfigurationsofthemetricandwereaskedtoprovideassessmentsregard­ing the languages they knew, to explain how they would have expected the results to be, and to indicate themost appropriateconfiguration. Intotal,18partnersprovidedassessments.Thefeedbackconsistedofoverallrat­ings of the five configurations as well as detailed comments regarding individual languages.Asaconsequence,mostanswersrelatedtoofficialEUlanguages.RMLs for which feedback was received are spoken in the UK, Spain, Italy and the Nordic countries.We received feedback on 56ofthe 89languages. In general, using all factors was evaluated as risky due to the possible distor­tion of results caused by data of bad quality. The results of configuration 1 were considered unexpected, with high scores for languages such as Emilian, Gallo and Franco-Provencial,probablycausedbydistorteddata.Thesecondconfigurationwas criticised, too, except for positive comments on the automatic nature of the metric. The results were less distorted but evaluated as worse compared to configurations 3-5. The results of configurations 4 and 5 were similar. Focusing on quality data improvedtheresultssignificantly.Withfewerfactors,configuration5providedsim­ilarresultsasconfiguration4.Configuration5wasassessedpositivelyregardingthe transparencyof fewer factors and thepossibility to balance the classes. Overall, the results of the fifth configuration were assessed to represent the con-textofthelanguagecommunitiesinthemostadequateway,whilethereisstillroom for improvementfor a few languages.Table 17 (p. 73) provides more details. Severalsuggestionsforimprovementsweremade.Sinceonlypan-Europeandata sources were taken into account for reasons of consistency and comparability, one recommendation concerned extending the data through relevant national and re­gional sources. One expert pointed out that the context of European languages spo­kenincountriesoutsideofEuropewasexcluded,andthesemissingstatisticsonthe development of LRTs would greatly impact the overall scores, e.g., Portuguese in Brazil. Another suggestion referred to missing factors, such as the inclusion of the vitalitystatusofalanguagebeingparticularlyimportantforRMLs,ortheintegration ofafactorrepresentingcompetitionofanationallanguagewithEnglishastheother officialnationallanguagewhichoftenstilldominatesdailylife,e.g.,inIreland,and preventsmorewidespreaduseoftheothernationallanguageintheseareas.Another idea was to replace the official EU status as a weighting factor with the country’s membershipintheEuropeanEconomicArea(EEA),sincethesecountriesalsohave access toEuropeanresearch funds. Suggestions were also made regarding the presentation of the results. Language communitieshavingparticularlycomplexpoliticalbackgroundsaremostlikelytobe misrepresentedbyasimplecalculationbasedoncountry-specificdata,andshouldbe highlighted and presented with the limits of solely data-driven work for such cases. Itwasalsosuggestedthatlanguageswithoutawritingsystemshouldbeemphasised asspecial casesfor thedevelopment of LRTs. Some feedback expressed reservations about the whole approach. A few review­ers pointed out that a single methodology should not be used to take into account thedifferentcomplex contexts andrealities of Europe’slanguagecommunities. For example,languageslikeMaltese,IrishandtheotherCelticlanguages,whichscored betterthanexpectedaccordingtoourexperts,areofnotehere.Therelativeprosper­ityoftheUnitedKingdom,eventhoughitisnolongeranEUMemberState,seems to boost the RMLs spoken in the UK, although in reality these RMLs are strongly dominated by English. The same applies to Ireland, which has a strong economy, a large ICT sector and significant investments in (English) AI and LT research and development, but a very low level of support forIrish LT. Another point of criticism was the inclusion of data not applied on a per capita basis.Asaresult,despitehavingrelativelygoodsupport,somesmalllanguagecom­munities were unable to achieve a high score. The size of the language community has an impact on the economic interest, investment, number of researchers, etc. for thelanguage,butforsmalllanguagecommunitiesthathavealreadyinvestedalotin their language and infrastructure, some of the scores obtained may appear too low comparedto the expectations of the experts. These criticisms can be debated at length, especially in the interest of finding effective solutions to the identified issues, but are very difficult to avoid altogether with such a quantitative approach as the one that is required to define and measure the CFs as part of the DLE Metric. These first stable results for the CF calculation were improved based on a more fine-grained data mapping from country to language community level and the feed­back of the experts. The aggregation of data points from different countries for lan­guages spoken in several countries, e.g., French, was based on the average with a boost for the data points collected from the countries in which the language has an official national status. This process was replaced by the calculation of a weighted averagebasedonthenumberofspeakersofthelanguagecommunitieswhichreflects the distribution of the language communities better and prevents distortion through too small or too big language communities. In addition, the boost for EU Member StateswaschangedtoaboostforcountriesintheEEA,thevitalitystatuswasadded asa penalty fordeclining languages,and those competing with Englishastheother dominant official national language were also penalised. The results of this adapta­tiondecreasedthenumberoflanguagesthateventuallyachievedanexcessivelyhigh contextual score. 6.3 ContextualDLEScoresofEurope’sLanguages In all examined configurations, the top third is dominated by the official EU lan­guages, while the RMLs are part of the long tail to the right. Official national lan­guages which are not officialEU languages are ranked between theofficial EU lan­guages andtheRMLs. Figure 2 showsthefinalresults after the adaptation. As expected, English has the best context for the development of LRTs by far. It is followed by German and French. Italian and Spanish are shown in positions 4 and 5. The position of Spanish after Italian is caused by the inclusion of data from Fig. 2 Contextual DigitalLanguage Equality scores asof lateFebruary 2023 Europeancountriesonly.IfdatahadbeenincludedfromcountriesoutsideofEurope, Spanish,Portuguese,FrenchandEnglishwouldhavehadmuchhigherscores.After the five leading languages, variations between the different configurations can be seen.Swedish,Dutch,Danish,Polish,Croatian,HungarianandGreekarerankedin theupperhalfoftheofficialEUlanguages.TheofficialEUlanguageswiththelowest scoresareLatvian,Lithuanian,Bulgarian,Romanian,MalteseandIrishwhichjoined this groupafterthe last adjustment. Among the group of official national languages which are not official EU lan­guages,Norwegian,IcelandicandSerbianarethetopperformers,achievingcontex­tual DLE scores in line with the middle-and lower-scoring official EU languages, while Manx7 is presented as a downward outlier. Languages such as Norwegian, Luxembourgish, Faroese and Icelandic achieve better scores than Albanian, Turk­ish, Macedonian and Bosnian. TheRMLsareledbylanguagesspokeninthemoreNortherncountrieslikesome Saami languages, Western Frisian and Welsh or languages spoken by quite big lan­guagecommunitieslikeCatalan.Atotalof23RMLsachievecontextualDLEscores equal to or lower than 0.05 in the final results, while 30 of the languages obtain scores between 0.06 and 0.1. Kildin Saami and Griko are the languages with the lowest scores. 6.4 Open Issues and Challenges ThecontextualDLEscorescalculatedhavesomelimitations(seeSection6.2).First, expanding the dataset to include regional or national sources would result in 1. a higher number of factors, 2. improved data quality, as the gaps in individual indica­torsmaybefilled,3.quantificationofmorefactorswithmorethanoneindicator,to reflect different perspectives, and 4. a more complex mapping to language commu­nities based on regionaldataresulting in a significant impact on RMLs. Second, the data cleaning procedure can be improved. One possibility would be toreplaceoutlierswithvaluesoutsidetwicethestandarddeviationbytherespective maximum or minimum values of the data series. Data gaps could be filled using data from previous years and skewed data could be corrected using a square root transformation. These processing stepscould decrease theimpact of distorteddata. Animprovementofthemappingfromcountryleveltolanguagelevelcouldrepre­sent regional or urban-rural divides more accurately, especially for larger countries. Inparticular,themissingmappingofproportionaldata,scoresandtotalnumbersper capitahasamajorimpactontheresultingcontextualDLEscores.Here,regionaldata couldhelpcalculatetheaveragedeviationofindividualregionsorlanguagecommu­ 7 ManxandJerriaishavebeenassignedtothegroupofnationallanguageswithoutbeinganofficial EU language,as bothlanguagesarerecognisedasofficiallanguages ofJersey and theIsle of Man. NeitherislandispartoftheUnitedKingdom,butcrowndependencies.Therefore,thetwolanguages can beconsidered both official national languages or RMLs. nitiesfromotherproportionaldataandtotransferthisdeviationtoproportionaldata only found on the nationallevel,and similarly for the total figuresper capita. Romaine(2017,p.49)stressestheimportanceofan“on-goingmonitoringofindi­vidualcommunities”forareliableevaluationofthesituationregardinglanguagedi­versity,whichwastakenintoaccountwiththeinclusionofthecriterionofautomatic updatabilityofthefactors.Oneproblemconcernstheeventualinterdependenciesof thevalues:thescoresof all languagesmaychangeifnewvaluesforsomelanguage communities are added, even if the situation of another language community itself has not changed. Atemporaldimension couldbeaddedto mitigate this. 7 Digital Language Equality Dashboard Inorder toprovide apreciseand easy-to-use tool for presenting andmonitoring the TFs and CFs that contribute to the DLE Metric, we designed and implemented a web-based dashboard as part of the European Language Grid.8 Itisavailable at: https://live.european-language-grid.eu/catalogue/dashboard Thedashboardshows thecontents ofthe ELGdatabase asinteractive visuals dy­namically created by user queries, thus providing constantly up-to-date and consis­tent(i.e.,comparable)measurementsofthelevelofLTsupportandprovisionacross all of Europe’s languages (Figure 3). The dashboard provides the figures, statistics and graphs, as appropriate,for: • theTFsandCFsoftheDLEMetric,calculatedaccordingtothedetailedtechni-cal description presented above; • LRTshostedintheELGcatalogue,whichconstitutethesource/basedataforthe TFs that areat the basis of the technologicalDLEscore. Architecturally, the DLE dashboard consists of two layers: the database of the ELGcatalogueandthefrontend.TheELGdatabasecontentsareindexedandsaved in JSON. Each user query retrieves the respective results from JSON and exposes them to the front end. While the TFs are calculated dynamically (see Section 5.3) and they reflect the status of the ELG catalogue’s database at the time of accessing the dashboard, in the current implementation the CFs are calculated offline, stored in aseparate fileand exposed totherespective tab of the dashboard’s frontend. 8 https://www.european-language-grid.eu Fig. 3 DLE dashboard showingthetechnological(top) and contextualDLE scores (bottom) 8 ConclusionsandFutureWork ThischapterhasintroducedthedefinitionofDLEadoptedinELEandhasdescribed theDLEMetric,explainingtherolesandsetupsofthecomplementaryTFsandCFs andhowthescoresarecomputed.Byprovidinganempirically-groundedandrealis­ticquantificationoftheleveloftechnologicalsupportofthelanguagesofEurope,the DLEMetricisintended tocontribute to future efforts to level up the digital support ofallofEurope’slanguages,mostnotablywiththeimplementationoftheevidence­based SRIA and roadmap that will drive future efforts in equipping all European languages with the LRTs needed to achieve full DLE (see Chapter 45). The DLE Metricprovidesatransparentmeanstotrackandmonitortheactualprogressinthis direction,asthetechnologicalandcontextualDLEscorescanbevisualisedthrough the DLE dashboard. The overview of the TFs and CFs is accompanied by discussions of the scoring andweightingmechanismsadoptedforthecomputationofthetechnologicalandcon­textualDLEscores,followingextensivetestingandexpertconsultationscomparing alternative setups. The chapter explains the overall design of the features and their valueswiththescoresandweightingmechanismsthatcontributetotheDLEMetric scores, based on data included in the ELG catalogue and the factors eventually se­lected to represent the specific ecosystems of the languages and their communities. Asaresultofthis,thenotionofDLEanditsassociatedmetricintroducedinthischap­ter represent valuable tools on which to base future efforts to measure and improve the readiness of Europe’s languages for the digital age, also taking into account the situational contexts in which the various languages are used via the CFs. Thankstothedescriptive,diagnosticandpredictivevalueoftheDLEMetric,the community now has a solid and verifiable means of pursuing and evaluating much-needed developments in the interest of all languages of Europe and their speakers. The DLE Metric is relevant to a wide range of stakeholders at local, regional, na­tionalandEuropeanlevelswho are committedto preventingtheextinction ofEuro­peanlanguagesunderthreatandwhoareinterestedinpromotingtheirprosperityfor thefuture.Suchstakeholdersincludedecision-andpolicy-makers,industryleaders, researchers, developers, and citizens across Europe who will drive forward future developmentsin the fields of LT andlanguage-centricAI in the interestofDLE. References Blasi, Damian, Antonios Anastasopoulos, and Graham Neubig (2022). “Systematic Inequalities in Language Technology Performance across the World’s Languages”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, pp. 5486–5505. DOI: 10.18653/v 1/2022.acl-long.376. https://aclanthology.org/2022.acl-long.376. Bromham, Lindell, Russell Dinnage, Hedvig Skirgard, Andrew Ritchie, Marcel Cardillo, Felicity Meakins,SimonGreenhill,andXiaHua(2021). “Globalpredictors of language endangerment andthefutureoflinguisticdiversity”.In: Nature Ecology & Evolution 6,pp.163–173. https://d oi.org/10.1038/s41559-021-01604-y. Eisele,Andreas,ChristianFedermann,HansUszkoreit,HerveSaint-Amand,MartinKay,Michael Jellinghaus,Sabine Hunsicker,Teresa Herrmann,andYuChen (2008).“Hybrid machinetrans­lationarchitectureswithinandbeyondtheEuroMatrixproject”.In: Proceedings of the 12th An­nual conference of the European Association for Machine Translation, EAMT 2008, Hamburg, Germany, September 22-23, 2008. Ed. by John Hutchins, Walther Hahn, and Bente Maegaard. European Association for Machine Translation, pp. 27–34. https://aclanthology.org/2008.eamt -1.6/. Faisal,Fahim,YinkaiWang,andAntoniosAnastasopoulos(2022).“DatasetGeography:Mapping Language DatatoLanguageUsers”.In: Proceedings of the 60th Annual Meeting of the Associ­ation for Computational Linguistics (Volume 1: Long Papers).Dublin,Ireland:Associationfor Computational Linguistics, pp. 3381–3411. DOI: 10.18653/v1/2022.acl-long.239. https://acla nthology.org/2022.acl-long.239. Gaspari, Federico, Owen Gallagher, Georg Rehm, Maria Giagkou, Stelios Piperidis, Jane Dunne, andAndyWay(2022a).“IntroducingtheDigitalLanguageEqualityMetric:TechnologicalFac-tors”. In: Proceedings of the Workshop Towards Digital Language Equality (TDLE 2022; co-located with LREC 2022). Ed. by Itziar Aldabe, Begona Altuna, Aritz Farwell, and German Rigau. Marseille, France, pp. 1–12. http://www.lrec-conf.org/proceedings/lrec2022/workshop s/TDLE/pdf/2022.tdle-1.1.pdf. Gaspari,Federico, AnnikaGrützner-Zahn,Georg Rehm,OwenGallagher,Maria Giagkou,Stelios Piperidis, and Andy Way (2022b). Deliverable D1.3 Digital Language Equality (full specifi­cation). European Language Equality (ELE); EU project no. LC-01641480 – 101018166 ELE. https://european-language-equality.eu/reports/DLE-definition.pdf. Gaspari, Federico, Andy Way, Jane Dunne, Georg Rehm, Stelios Piperidis, and Maria Giagkou (2021). Deliverable D1.1 Digital Language Equality (preliminary definition). European Lan­guageEquality(ELE); EU project no. LC-01641480 – 101018166. https://european-language­ equality.eu/reports/DLE-preliminary-definition.pdf. Giagkou, Maria, Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Athanasia Kolovou, andLeonVoukoutis(2022). Deliverable D1.37 Database and Dashboard.EuropeanLanguage Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/DLE-dashboard.pdf. Grützner-Zahn,AnnikaandGeorgRehm(2022).“IntroducingtheDigitalLanguageEqualityMet­ric: Contextual Factors”. In: Proceedings of the Workshop Towards Digital Language Equality (TDLE 2022; co-located with LREC 2022).Ed.byItziarAldabe,BegonaAltuna,AritzFarwell, and German Rigau. Marseille, France, pp. 13–26. http://www.lrec-conf.org/proceedings/lrec2 022/workshops/TDLE/pdf/2022.tdle-1.2.pdf. Hammarström, Harald, Robert Forkel, Martin Haspelmath, and Sebastian Bank (2021). Glottolog 4.5. Leipzig: MaxPlanckInstitute for Evolutionary Anthropology. https://doi.org/10.5281/zen odo.5772642. Hinrichs, Erhard and Steven Krauwer (2014). “The CLARIN Research Infrastructure: Resources and Tools for eHumanities Scholars”. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 1525–1531. http://www.lrec-conf.org/proceedings/lrec20 14/pdf/415_Paper.pdf. Joshi, Pratik, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury (2020). “The State and Fate of Linguistic Diversity and Inclusion in the NLP World”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020).Online: Association for Computational Linguistics, pp. 6282–6293. DOI: 10.18653/v1/2020.acl-main .560. https://aclanthology.org/2020.acl-main.560. Khanuja,Simran,SebastianRuder,andParthaTalukdar(2022).Evaluating Inclusivity, Equity, and Accessibility of NLP Technology: A Case Study for Indian Languages.DOI:10.48550/ARXIV.2 205.12676.https://arxiv.org/abs/2205.12676. Krauwer, Steven (2003). “The Basic Language Resource Kit (BLARK) as the First Milestone for theLanguageResourcesRoadmap”.In:Proceedings of the International Workshop Speech and Computer (SPECOM 2003).Moscow, Russia. Labropoulou, Penny, Katerina Gkirtzou, Maria Gavriilidou, Miltos Deligiannis, Dimitris Galanis, Stelios Piperidis, Georg Rehm, Maria Berger, Valérie Mapelli, Michael Rigault, Victoria Ar-ranz,KhalidChoukri,GerhardBackfried,JoséManuelGómezPérez,andAndresGarcia-Silva (2020).“MakingMetadataFitforNextGenerationLanguageTechnologyPlatforms:TheMeta­dataSchemaoftheEuropeanLanguageGrid”.In:Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020).Ed.byNicolettaCalzolari,FrédéricBéchet,Philippe Blache,ChristopherCieri,KhalidChoukri,ThierryDeclerck,HitoshiIsahara,BenteMaegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3421–3430. https://www.aclweb.org/anthology/2020.lrec-1.420/. Piperidis, Stelios, Penny Labropoulou, Dimitris Galanis, Miltos Deligiannis, and Georg Rehm (2023). “The European Language Grid Platform: Basic Concepts”. In: European Language Grid: A Language Technology Platform for Multilingual Europe. Ed. by Georg Rehm. Cham: Springer,pp.13–36. DOI: 10.1007/978-3-031-17258-8_2. https://doi.org/10.1007/978-3-031­ 17258-8_2. Rehm,Georg,ed.(2023a). European Language Grid: A Language Technology Platform for Multi­lingual Europe. Cognitive Technologies. Cham,Switzerland: Springer. Rehm, Georg (2023b). “European Language Grid: Introduction”. In: European Language Grid: A Language Technology Platform for Multilingual Europe. Ed. by Georg Rehm. CognitiveTech­nologies.Cham,Switzerland: Springer,pp. 1–10. Rehm,Georg,FedericoGaspari,GermanRigau,MariaGiagkou,SteliosPiperidis,AnnikaGrützner-Zahn, Natalia Resende, Jan Hajic, and Andy Way (2022). “The European Language Equality Project: Enabling digital language equality for all European languages by 2030”. In: The Role of National Language Institutions in the Digital Age – Contributions to the EFNIL Confer­ence 2021 in Cavtat. Ed.byŽeljkoJoziæandSabineKirchmeier. Budapest, Hungary:Nyelvtu­dományi Kutatóközpont,Hungarian Research Centre for Linguistics, pp. 17–47. Rehm,Georg,KatrinMarheinecke,StefanieHegele,SteliosPiperidis,KalinaBontcheva,JanHajic, Khalid Choukri, Andrejs Vasiljevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-binaAuksoriute,NúriaBel,AntónioBranco,GerhardBudin,WalterDaelemans,KoenraadDe Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur Rögnvaldsson,MikeRosner,BolettePedersen,IngunaSkadina,MarkoTadiæ,DanTufi.,Tamás Váradi,KadriVider,AndyWay,andFrançoisYvon(2020).“TheEuropeanLanguageTechnol­ogyLandscapein2020:Language-CentricandHuman-CentricAIforCross-CulturalCommuni­cationinMultilingualEurope”.In:Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020).Ed.byNicolettaCalzolari,FrédéricBéchet,PhilippeBlache,Christo­pherCieri,KhalidChoukri,ThierryDeclerck,HitoshiIsahara,BenteMaegaard,JosephMariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325. https://www.aclweb.org/anthology/2020.lrec-1.407/. Rehm, Georg and Hans Uszkoreit, eds. (2012). META-NET White Paper Series: Europe’s Lan­guages in the Digital Age. 32 volumes on 31 European languages. Heidelbergetc.: Springer. Rehm, Georg and Hans Uszkoreit, eds. (2013). The META-NET Strategic Research Agenda for Multilingual Europe 2020.Heidelbergetc.:Springer. http://www.meta-net.eu/vision/reports/m eta-net-sra-version_1.0.pdf. Rehm, Georg, Hans Uszkoreit, Sophia Ananiadou, Núria Bel, Audrone Bielevièiene, Lars Borin, António Branco, Gerhard Budin, Nicoletta Calzolari, Walter Daelemans, Radovan Garabík, Marko Grobelnik, Carmen García-Mateo, Josef van Genabith, Jan Hajiè, Inma Hernáez, John Judge, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Joseph Mariani, John McNaught, Maite Melero, Monica Monachini, Asunción Moreno, Jan Odjik, Maciej Ogrodniczuk, Piotr Pêzik, Stelios Piperidis, Adam Przepiórkowski, Eiríkur Rögnvalds­son, Mike Rosner, Bolette Sandford Pedersen, Inguna Skadina, Koenraad De Smedt, Marko Tadiæ, Paul Thompson, Dan Tufiº, Tamás Váradi, Andrejs Vasiljevs, Kadri Vider, and Jolanta Zabarskaite (2016). “The Strategic Impactof META-NETonthe Regional, Nationaland Inter­national Level”. In: Language Resources and Evaluation 50.2, pp. 351–374. DOI: 10.1007/s1 0579-015-9333-4. http://link.springer.com/article/10.1007/s10579-015-9333-4. Romaine, Suzanne (2017). “Language Endangerment and Language Death”. In: The Routledge Handbook of Ecolinguistics.Abingdon,Oxfordshire:Routledge,pp.40–55.DOI: 10.4324/978 1315687391.ch3. https://www.routledgehandbooks.com/doi/10.4324/9781315687391.ch3. Simons, Gary F., Abbey L. Thomas, and Chad K. White (2022). Assessing Digital Language Sup­port on a Global Scale.DOI: 10.48550/ARXIV.2209.13515. https://arxiv.org/abs/2209.13515. Soria, Claudia, Núria Bel, Khalid Choukri, Joseph Mariani, Monica Monachini, Jan Odijk, Ste-lios Piperidis, Valeria Quochi, and Nicoletta Calzolari (2012). “The FLaReNet Strategic Lan­guageResourceAgenda”.In:Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012. Ed. by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, JanOdijk,and Stelios Piperidis. European Language ResourcesAssociation(ELRA), pp. 1379–1386. http://www.lrec-conf.org/proceedings/lrec2012/summaries/777.html. STOA(2018). Language equality in the digital age – Towards a Human Language Project.STOA study (PE 598.621), IP/G/STOA/FWC/2013-001/Lot4/C2. https://data.europa.eu/doi/10.2861 /136527. Yvon, François and Viggo Hansen (2010). “iTranslate4.eu: Internet translators for all European languages”. In: Proceedings of the 14th Annual conference of the European Association for Machine Translation, EAMT 2010, Saint Raphaël, France, May 27-28, 2010. European Asso­ciation for MachineTranslation. https://aclanthology.org/2010.eamt-1.41/. Appendix Feature Value Weight Resource Type Corpus 5 Lexicalconceptualresource 1.5 Languagedescription 3.5 Subclass Raw corpus 0.1 Annotatedcorpus 2.5 Computational lexicon 2 Morphological lexicon 3 Terminologicalresource 3.5 Wordnet 4 Framenet 4 Model 5 Each of the others (there are 15 more) 0.5 Linguality Type Multilingual 5 Bilingual 2 Monolingual 1 MediaType Text 1 Image 3 Video 5 Audio 2.5 Numerical text 1.75 AnnotationType Each of these – can be combined in a single LR 0.25 Domain Each of these – can be combined in a single LR 0.3 Conditionsof Use Other specific restrictions 0.5 Commercialusesnotallowed 1 Noconditions 5 Derivativesnotallowed 1.5 Redistribution not allowed 2 Research use allowed 3.5 Table1 Weights assignedto the technological factorsof the DLE Metricforlanguage resources Feature Value Weight LanguageIndependent False 5 True 1 Input Type Inputtext 2 Inputaudio 5 Inputimage 7.5 Inputvideo 10 Inputnumericaltext 2.5 Output Type Outputtext 2 Outputaudio 5 Outputvideo 10 Outputimage 7.5 Outputnumericaltext 2.5 Function Type Text processing 3 Speech processing 10 Informationextractionandinformationretrieval 7.5 Translationtechnologies 12 Human-computer interaction 15 Naturallanguagegeneration 20 Support operation 1 Image/video processing 13 Other 1 Unspecified 1 Domain Each of these – can be combined in a single tool 0.5 Conditions of Use Unspecified 0 Other specific restrictions 0.5 Noconditions 5 Commercialusesnotallowed 1 Derivativesnotallowed 1.5 Redistribution not allowed 2 Research use allowed 3.5 Table2 Weights assignedto the technological factorsof the DLE Metricfortools and services Factor MergingoftheScores ConversionfromTexttoScores Publicfundingavail-Adding up scores for1forregionalfunding ableforLTs eachcountry 1fornationalfunding 1 for intranationalfunding 1 for ESIF 1 for EUREKA 1 for EUROSTAT Legalstatus and le-Addingupscoresperlan-10forstatutorynationallanguage galprotection guage 10fordefactonationalworkinglanguage 2 for statutoryprovinciallanguage 2 for statutoryprovincialworking language 1 for recognisedlanguage Publicly availableAdding up two scores: 2fordub mediaoutcomes one score for language 1.5forvoiceover transfer practices for cin-1.5 for sub anddub ema works screened and 1 for sub one for television works broadcast Adding up scores + divi­sionbythenumberofan­swers Broadcast in original language: 5 for mostly/al­ways, 2.5forsometimes Broadcast with dubbing: 4 for mostly/always, 2 for sometimes Broadcast in original language with voice-over: 3 for mostly/always, 1.5 for sometimes Dual-channel sound: 2 for mostly/always, 1 for sometimes Broadcastwithsubtitles:1formostly/always,0.5 for sometimes Presence of local,One of the scores per 1fornoplan/strategy regional or national country 2foraplanwithoutmentioningLT strategic plans 3 for a plan mentioning LT 4 for a plan mentioning LT and minority and re­gional languages Politicalactivity Adding up scores per 1scoreforeachdocument country 1scoreforeachdocumentmentioningLT 2 for each document exclusively about LT 1 for a document covering aspecific language 2 for each document published 2020/2021 1 for each document published 2019/2018 Table3 Contextual factors:Conversionfrom plain textinto scores ECONOMY Factor Indicator Size of the economy Annual GDP GDPpercapita* ** SizeoftheLT/NLPmarket LTmarketinmillionEuro Sizeofthelanguageservice,translatingorinter-Numberoforganisationsfromtheindustryinthe pretingmarket ELGcatalogue* ** SizeoftheIT/ICTsector Perc.oftheICTsectorintheGDP* ** ICT serviceexports inbalance of payment* ** InvestmentinstrumentsintoAI/LT GDEonR&Dinrelevantareas* Regional/nationalLTmarket Noindicatorfound Averagesocio-economicstatus Annualnetearnings,1.0FTEworker* ** Life expectancy at age 60** Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table4 Contextual factors:Proposed factorsfor class “Economy” EDUCATION Factor Indicator Higher Education Institutions operating in the No indicator found language Highereducationinthelanguage Noindicatorfound Academicpositionsinrelevantareas HeadcountofR&Dpersonnel Academic programmes in relevant areas No indicator found Literacylevel Literacyrate* Studentsinlanguage/LT/NLPcurricula Totalno.ofstudentsinrelevantareas* ** Equityineducation Proportionaltertiaryeduc.attainment* ** Inclusionineducation Percentageofforeignersattainingtertiaryeduca­ tion*** Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table5 Contextual factors:Proposed factorsfor class “Education” FUNDING Factor Indicator FundingavailableforLTresearchprojects No.ofprojectsfundedinrelevantareas* Score from the national funding programmes Venturecapitalavailable VenturecapitalamountsinEuro Publicfundingforinteroperableplatforms Numberofplatforms** Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table6 Contextual factors:Proposed factorsfor class “Funding” INDUSTRY Factor Indicator CompaniesdevelopingLTs No.ofenterprisesintheICTarea* ** Start-upsperyear Percentageof“Enterprisebirths”** Start-upsinLT/AI NumberofAIstartups* ** Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table7 Contextual factors:Proposed factorsfor class “Industry” LAW Factor Indicator Copyrightlegislationandregulations Noindicatorfound Legalstatusandlegalprotection Scoresoutofthelegalstatus* ** Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table8 Contextual factors:Proposed factorsfor class “Law” MEDIA Factor Indicator Subtitledordubbedvisualmedia Scoresoutoflanguagetransferpractices* Scoresout of answers about broadcastpractices Transcribedpodcasts NumberofentriesintheCBA* Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table9 Contextual factors:Proposed factorsfor class “Media” ONLINE Factor Indicator Digitallibraries PercentageofcontributiontoEuropeana Impactoflanguagebarriersone-commerce Percentageofpopulationbuyingcross-border** Digital literacy No indicator found Wikipediapages NumberofarticlesinWikipedia* ** Websites exclusively in the language No indicator found Websitesinthelanguage(notexclusively) Perc.ofwebsitesinthelanguages* ** Web pages No indicator Rankingofwebsitesdeliveringcontent 12selectedwebsitessupportingthelanguages Labelsandlemmasinknowledgebases NumberoflexemesinWikipedia* ** Languagesupportgaps Languagematrixofsupportedfeatures* ImpactonE-commercewebsites T-Index* Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table10 Contextualfactors: Proposed factors forclass“Online” POLICY Factor Indicator Presenceofstrategicplans,agendas,etc. ScoresoutofalistofthepublishednationalAI strategies Scoresfromquestionnaire about strategies PromotionoftheLRecosystem Noindicatorfound ConsiderationofbodiesfortheLRcitation Noindicatorfound Promotionofcooperation Noindicatorfound Publicandcommunitysupportfor resourcepro-No indicator found duction best practices Policies regarding BLARKs No indicator found Politicalactivity Scoresoutofthelistofdocuments Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table11 Contextualfactors: Proposed factors forclass“Policy” PUBLICADMINISTRATION Factor Indicator Languagesofpublicinstitutions No.ofconstitutionswritteninthelanguage Availablepublicservicesinthelanguage Percentageofamaximumscoreaboutdigital public services** Score for digital public services** Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table12 Contextualfactors: Proposed factors forclass“Public administration” RESEARCH& DEVELOPMENT& INNOVATION Factor Indicator Innovationcapacity InnovationIndex* ** ResearchgroupsinLT Numberofresearchorganisations Research groups/companies predominantly No indicator found working on the respectivelanguage ResearchstaffinvolvedinLT Noindicatorfound SuitablyqualifiedResearchstaffinLT Noindicatorfound CapacityfortalentretentioninLT Noindicatorfound StateofplayofNLP/AI Noindicatorfound ScientistsworkinginLT/onthelanguage Numberofresearchersinrelevantareas* ResearcherswhoseworkbenefitsfromLRsand No indicator found LTs Overallresearchsupportstaff Headcountofresearchsupportstaff* ** Scientific associations or general scientific and No indicator found technology ecosystem PapersaboutLTandorthelanguage NumberofpapersaboutLT** Number of papers aboutthe language* ** Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table13 Contextualfactors:Proposed factorsfor class“Research& Development& Innovation” SOCIETY Factor Indicator Importanceofthelanguage Noindicatorfound Fully proficient (literate) speakers Number of L1 speakers* Digitalskills Perc.ofindividualswithbasicdigitalskills* ** Sizeoflanguagecommunity Totalnumberofspeakers* ** Populationnotspeakingtheofficiallanguage(s) No indicatorfound Officialorrecognizedlanguages Totalno.oflanguageswithofficialstatus* Number of bordering languages Communitylanguages Numberofcommunitylanguages* Timeresourcesofthelanguagecommunity Noindicatorfound Society stakeholders for the language No indicator found Speakers’attitudestowardsthelanguage Totalnumberofparticipantswantingtoacquire the language Involvementofindigenouspeoples Noindicatorfound Sensitivity to barriers No indicator found Usageofsocialmediaornetworks Totalnumberofsocialmediausers* ** Percentage ofsocial media users* ** Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table14 Contextualfactors: Proposed factors forclass“Society” TECHNOLOGY Factor Indicator Open-sourcetechnologiesofLTs Noindicatorfound Accesstocomputer,smartphoneetc. Perc.ofhouseholdswithacomputer* ** Digitalconnectivityandinternetaccess Perc.ofhouseholdswithbroadband* ** Indicator marked * is automatically updateable – Indicator marked ** provides good quality data Table15 Contextualfactors: Proposed factors forclass“Technology” Class Factor Economy Size of economy Size of the ICT sector Education Students in LT/language Inclusion in education Industry Companies developing LTs Law Legal status and legal protection Online Wikipedia pages R & D & I Innovation capacity Number of papers Society Size of language community Usage of social media Technology Digital connectivity, internet access Table16 Contextualfactors included inthe final configuration (configuration5) 3 Digital Language Equality: Definition, Metric, Dashboard 73 Appropriate Ranked too high Ranked too low Contrary Opinion English Irish Norwegian French Dutch Italian Spanish German Danish Swedish Portuguese Saami, Northern Polish Hungarian Czech Latvian Greek Croatian Romanian Finnish Maltese Bulgarian Estonian Faroese Icelandic Slovene Scottish Gaelic Emilian Slovak Cornish Sicilian Lithuanian Manx Serbian Saami, Southern Basque Saami, Pite Catalan Saami, Lule Galician Saami, Skolt Asturian Saami, Inari Aragonese Sardinian Welsh Romagnol Griko Lombard Ligurian Venetian Southern Italian Friulian Piemontese Ladin 25 17 9 4 Table 17 Contextual factors: Assessment of the languages in the final configuration (configura­tion 5)bythepanel ofexperts Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,provide alinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 4 European Language Technology in 2022/2023 MariaGiagkou, Teresa Lynn,JaneDunne,Stelios Piperidis, and GeorgRehm Abstract Thischapterpresentstheresultsofanextensiveempiricalinvestigationof the digitalreadiness of European languages,and provides a snapshotofthesupport theyareofferedthroughtechnologyasof2022.Thedegreeofdigitalreadinesswas assessed on the basis of the availability of language resources and technologies for eachlanguageunderinvestigationandacross-languagecomparisonwasperformed. As a complementary approach, the perspectives and opinions of LT users, develop­ers and the regular citizen were acquired in order to fully understand the EU’s LT landscape. Both the objective empirical findings and the voice of the community clearly indicate that there is an extreme imbalance across languages when it comes to the individual levels of technological support. Although the LT field as a whole has demonstrated remarkable progress during the last decade, this progress is not equally evidenced across all languages, posing, more acutely than ever before, a threat of digital extinction formany of Europe’slessersupported languages.1 1 Introduction More than ten years ago, the study “Europe’s Languages in the Digital Age” con­cluded that most European languages are under threat in the digital age. The study, prepared by more than 200 experts and documented in 32 volumes of the META­NET White Paper Series (Rehm and Uszkoreit 2012), assessed Language Technol­ogy (LT) support for each language in four different areas: automatic translation, MariaGiagkou · SteliosPiperidis R.C.“Athena”, Greece, mgiagkou@athenarc.gr, spip@athenarc.gr Teresa Lynn · Jane Dunne Dublin CityUniversity, ADAPT Centre,Ireland, teresa.lynn@adaptcentre.ie, jane.dunne@adaptcentre.ie Georg Rehm Deutsches ForschungszentrumfürKünstliche Intelligenz GmbH,Germany, georg.rehm@dfki.de 1 This chapter includes findings from Way et al. (2022) and makes use of the general sections writtenby theELEconsortium for thelanguage reports(Giagkou etal. 2022). © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_4 speech interaction, text analysis and the availability of language resources (LRs). The results were alarming: most of the 32 European languages investigated were evaluatedas severely under-resourced and some almost completely neglected. During the last ten years since the publication of the META-NET White Papers, the LT field as a whole has seen remarkable progress. In particular, the advent of data-driven approaches such as deep learning and neural networks, together with the considerable increase in the number and quality of LRs for a number of lan­guages, have yielded previously unforeseeable results. However, is this remarkable progress equally evidenced across all languages, or is the gap between “big” and “small” languagesdocumented in2012 still present in2022/2023? Thequestionofwhetherlanguagescanbeconsidereddigitallyequalhasbecome increasingly relevant in recent years, with a growing number of studies attempting to quantify digital readiness and compare languages in this respect. Methods have varied, with some assessing the level of technology support based on mentions of a language at NLP publication venues or language resource catalogues (e.g., Blasi etal. 2022;Joshietal. 2020;RanathungaandSilva 2022)oronwebsitesdescribing LT tools and services (e.g., Simons et al. 2022). However, the overall conclusion is always the same; from a technological perspective, there is a striking imbalance across languages in terms of support, and it is clear that not all languages benefit equallyand fairly from the overallprogressin LTadvances. IntheELEproject,wetookanempiricalapproachtoquantifyingdigitalreadiness of a language and providing an evidence-based grounding on which languages can be compared. We started by applying the Digital Language Equality (DLE) Metric (see Chapter 3) to examine both the current state of technology support and the po­tentialforshort-andmid-termdevelopmentofLT(Section 2).Wecontinuedwitha quantitativeinvestigationofthevariousperspectivesanddimensionsofcurrenttech­nological support, as this is reflected in the Language Resources and Technologies (LRTs)collectionoftheEuropeanLanguageGrid(ELG,Rehm2023).Theresultsof thisempiricalassessmentwerethensupplementedbysurveysandconsultationswith abroadrepresentationofLTdevelopersandLTusersandconsumers,whoprovided feedback and insight as to their experiences with LTs for EU languages (Section 3). Furthermore and most importantly, we focused on a large number of European lan­guages and provided updates of the META-NET White Papers in the form of the ELELanguageReports(Giagkouetal. 2022),condensedversionsofwhicharepre­sented in Chapters 5–37. It is only through such a holistic examination that a clear picture of the current status and future prospects ofDLE can be gained. 2 HowDoEurope’sLanguagesCompare? In this section, we first describe our source of evidence and methodology (Sec­tion2.1), followed by a presentation of ourfindings(Section 2.2). 2.1 Source of Evidence and Methodology To compare the level of technology support across languages, we considered the languagetechnologytoolsandresourcesinthecatalogueoftheEuropeanLanguage Grid(Rehm 2023; Piperidis etal. 2023;Labropoulou et al. 2020). The comparative evaluation was performedon variousdimensions. • Thecurrentstateoftechnologysupport,asindicatedbytheavailabilityoftools and services2 broadly categorised into a number of core LT application areas: – Textprocessing(e.g.,part-of-speechtagging,syntacticparsing) – Informationextractionandretrieval(e.g.,searchandinformationmining) – Translationtechnologies(e.g.,machinetranslation,computer-aidedtrans­lation) – Natural language generation (NLG, e.g., text summarisation, simplifica­tion) – Speechprocessing(e.g.,speechsynthesis,speechrecognition) – Image/videoprocessing – Human-computerinteraction(HCI,e.g.,toolsforconversationalsystems) • The potential for short-and mid-term development of LTs, insofar as this po­tential can be approximated by the current availability of resources that can be usedastrainingorevaluationdata.Theavailabilityofdatawasinvestigatedwith regardto asmall numberofbasic types of resources: – Text corpora – Parallel corpora – Multimodalcorpora(incl.speech,image,video) – Language models – Lexicalresources(incl.dictionaries,wordnets,ontologies,etc.) We measured the LT support for 87 national, regional and minority European languages with regard to each of the dimensions mentioned above based on their respectivecoverageintheELGcatalogue.Forthetypesofresourcesandapplication areas,therespectivepercentageofresourcesthatsupportaspecificlanguageoverthe total number of resources of the same type was calculated, as well as their average. Subsequently, each language was assigned to one band per resource type and per application area and to an overall band, on a four-point scale, inspired by the scale usedin the META-NETWhite Paper Series, as follows: 1. Weak or no support: the language is present (as content, input or output lan­guage) in <3% of the ELG resourcesofthesame type 2 Toolstaggedas “languageindependent” without mentioning anyspecific languageare not taken into account. Such tools can certainly be applied to a number of languages, either as readily appli­cable or following fine-tuning, adaptation, training on language-specific data etc., yet their exact language coverage orreadiness isdifficultto ascertain. 2. Fragmentary support: the language is present in .3% and <10% of the ELG resources of the sametype 3. Moderate support: the language is present in .10% and <30% of the ELG resources of the sametype 4. Good support: the language is present in .30% of the ELG resources of the same type The thresholds for defining the four bands (i.e., 3%, 10% and 30%) were in­formed by an exploratory k-means 4-cluster analysis based on all data per applica­tion and resource type, in order to investigate the boundaries of naturally occurring clustersinthedata.Theboundariesoftheclusterswerethenusedtodefinethebands per application area and resource type. The overall level of support for a language wascalculated basedon theaverage coverage of alldimensionsinvestigated. TheELGplatformharvestsseveralmajorLR/LTrepositories3 and,ontopofthat, more than 6,000 additional LRTs were identified and documented by language in-formantsin the ELEconsortium. These records contain multiple levels of metadata granularity as part of their descriptions. At the time of investigation, the ELG cata­logue comprised more than 11,500 metadata records, encompassing both data and tools/services,coveringalmostallEuropeanlanguages,bothofficialandregionalas well as minority ones. It should be noted that due to the evolving nature of this extensive catalogue and differing approaches taken in documenting records, certain categories of meta-data captured are not yet at the level of consistency required to carry out a reliable cross-lingual comparison atagranularlevel.For example,information provided on corpus size, annotation type, licensing type, size unit type, and so on, still varies across records for many languages, while numerous gaps exist for others. As the ELGcatalogueiscontinuouslygrowing,thecomprehensiveness,accuracyandlevel ofdetailofthe records are expected to improveover time. Forthepurposesofahigh-levelcomparison,theresultspresentedherearebased on relative counts of entries in the ELG for the varying types of data resources and tools/services for each language. As such, the positioning of each language into a specific level of technology support is subject to change as it reflects a snapshot of the available resourcesat the time ofinvestigation. Thatsaid,weconsiderthecurrentstatusoftheELGcatalogueandthehigher-level findings below representative with regard to the current existence of LT resources for Europe’s languages. 3 Atthetime,ELGharvestedELRC-SHARE,LINDAT/CLARIAH-CZ,CLARIN.SI,CLARIN-PL and the datasets sectionof HuggingFace(Labropoulou et al. 2023). 2.2 Results and Findings Asdiscussedabove,ouranalysistakesintoaccountanumberofdimensionsfordata and tools/services. Table 1 reports the detailed results per language per dimension investigatedand the classification of each languageintoanoverall level of support. The best supported language is, as expected, English, the only language that is classified in the good support group. French, German and Spanish form a group of languages with moderate support. Although they are similar to English in some dimensions (e.g., German in terms of available speech technologies and Spanish in terms of available models), overall they have not yet reached the coverage that English has according to the ELG catalogue. All other official EU languages are clusteredinthe fragmentary support group,withtheexceptionofIrishandMaltese, which have only weak or no support. From the remaining languages, (co-)official at the national or regional level in at least one European country and other minor­ity and lesser spoken languages,4 Norwegian and Catalan belong to the group of languageswithfragmentary support.Basque,Galician,IcelandicandWelsharebor­derline cases; while they are grouped in the fragmentary support level, they barely pass thethresholdofthelowestlevel. Allotherlanguagesare supportedbytechnol­ogy eitherweakly or not atall. Figure 1 visualisesthese findings. Lookingintoparticulardimensionsofdataavailability,itisevidentthatanabun­danceoftrainingdatafordevelopingLTsisavailableonlyforafewlanguageswith high commercial interest. For many (the majority of) European languages, this is not the case and only corpora which are minuscule in comparison to English are available.Wheninvestigatingthecurrentavailabilityofsomeofthedatatypesmen­tioned in the previous paragraph, as represented in the resources hosted in ELG in January 2023,5 it is apparent that even the best-supported languages in this dimen­sion,SpanishandEnglish,arestillonlymoderatelycovered(Figure2).Withrespect tomultimodaldata,alllanguageswiththeexceptionofEnglishareweaklycovered, withsome,e.g.,MalteseandLuxembourgish,severelyunderrepresented(Figure 3). Although the data gaps per language are different, some data types are partic­ularly sparse across many languages. These include: large language models, both monolingualandmultilingual;multimodaldata,especiallyspeechinconversational settings(dialogues)fromspeakersofdifferentages,gendersandlinguistic/dialectal backgrounds, but also videocorporafor sign languages;domain-specific data (e.g., medical, legal or media among many others of interest); data for language use on 4 In addition to the languages listed in Table 1, ELE also investigated Alsatian, Aragonese, Ar-beresh, Aromanian, Asturian, Breton, Cimbrian, Continental Southern Italian (Neapolitan), Cor-nish,EasternFrisian,Emilian,FrancoProvencal(Arpitan),Friulian,Gallo,Griko,InariSami,Kare­lian,Kashubian,Ladin,Latgalian,Ligurian,Lombard,LowerSorbian,LuleSami,Mocheno,North­ernFrisian,NorthernSami,Picard,Piedmontese,PiteSami,Romagnol,Romany,Rusyn,Sardinian, Scottish Gaelic, Sicilian, Skolt Sami, Southern Sami, Tatar, Tornedalian Finnish, Venetian, Voro, Walser and Yiddish. The scores for all of these languages are very low, placing all of them in the weak or no support group. 5 The DLE dashboard enables more fine-grained comparisons. It dynamically visualises the con­tents of the ELG catalogue and offers an up-to-date snapshot of the current availability of LRTs (seeChapter 3):https://live.european-language-grid.eu/catalogue/dashboard. Tools and Services Language Resources Text ProcessingSpeech ProcessingImage/Video ProcessingInformation Extraction and IRHuman-Computer InteractionTranslation TechnologiesNatural Language GenerationText CorporaMultimodal CorporaParallel CorporaModelsLexical Resources Overall Bulgarian Croatian Czech Danish Dutch English Estonian Finnish French German Greek Hungarian Irish Italian Latvian Lithuanian Maltese Polish Portuguese Romanian Slovak Slovenian Spanish Swedish (Co-)official languagesRegional level National level EU official languages Albanian Bosnian Icelandic Luxembourgish Macedonian Norwegian Serbian Basque Catalan Faroese Frisian (Western) Galician Jerriais Low German Manx Mirandese Occitan Sorbian (Upper) All other languages Table1 Stateoftechnologysupport,in2022,forselectedEuropeanlanguageswithregardtocore LanguageTechnologyareasanddatatypesaswellasoveralllevelofsupport(lightyellow:weak/no support;yellow: fragmentarysupport; lightgreen:moderatesupport;green:good support) Fig. 1 Overallstateof technology supportforselectedEuropeanlanguages(2022) socialmedia;semanticresources(e.g.,semanticannotationsandknowledgebases); data for language pathologies; benchmarks, i.e., well-designed gold-standard cor­pora for evaluating LTsystems or fine-tuning language models. Fig. 2 Number of language models available in the catalogue of the European Language Grid for theEU officiallanguages andforsome indicative non-EU officialones(as of January 2023) Similarlytodata,theidentifiedgapsfortechnologiesareverydiverseacrosslan­guages.WhileoverallLTsforEnglisharenumerousandatthestate-of-the-artlevel, a number of very small minoritised languages lack even basic tools such as spell checkers.In the worst case, theyarenotevensupportedby operating systems.Nev-ertheless,thereseemstobeageneralisedconsensusthat,whenitcomestolanguages Fig. 3 Number of multimodal datasets (i.e., media type: audio, video or image) available in the catalogue of the European Language Grid for the EU official languages and for some indicative non-EUofficial ones (asof January 2023) for which at least a minimum level of technological support has been achieved, the technologiesmosturgentlyneededinclude:discourseprocessing,biasdetectionand anonymisation,conversationalsystemsandquestion-answeringinthewidercontext ofHCI,NLG(withsummarisationmentionedfrequently)andNaturalLanguageUn­derstanding (NLU), e.g., even English and German are currently supported by less than 100 HCI or NLG systems on ELG, while some languages like Bosnian and Norwegian Nynorsk are not supported at all(Figures 4 and 5). Fig.4 NumberofHuman-ComputerInteractionsystemsdescribedinthecatalogueoftheEuropean Language Grid for the EU official languages and for some indicative non-EU official ones (as of January2023) Theresultsofthisanalysisareonlyinformativeoftherelativepositioningoflan­guages, but not of the technological progress achieved by a specific language. The Fig.5 NumberofNaturalLanguageGenerationsystemsdescribedinthecatalogueoftheEuropean Language Grid for the EU official languages and for some indicative non-EU official ones (as of January2023) LTfieldasawholehassignificantlyprogressedinthelasttenyearsandremarkable progress has been achieved for specific languages in terms of quantity, quality and coverage of LRTs. It is at the same time undebatable that the technology require­mentsforalanguagetobeconsidereddigitallysupportedbytoday’sstandardshave changedsignificantlyinthelasttenyears(e.g.,theprevalentuseofvirtualassistants, chatbots, improved text analytics capabilities, etc.). Nevertheless, the imbalance in distribution across languages which was documented in the META-NET White Pa-persin2012stillexists,andthehugedistancebetweenthebestsupportedlanguages and the minimally supported ones was still evidenced in 2022. It is exactly this dis-tancethatneedstobeideallyeliminated,oratleastreduced,inordertomovetowards DLEand avert the risks of digital language extinction. It should be noted that this analysis does not include a fifth level, excellent sup­port, for the grouping of languages, in addition to the four levels described in Sec-tion2.1.Currently,noEuropeanlanguage,notevenEnglish,isoptimallysupported by technology, i.e., the goal of Deep Natural Language Understanding has not been reached yet for any language. Although recently there have been many break­throughs in AI, Computer Vision, Machine Learning and LT, we are still far from the grand challenge of highly accurate deep language understanding, which is able toseamlesslyintegratemodalities,situationalandlinguisticcontext,generalknowl-edge, meaning, reasoning, emotion, irony, sarcasm, humour, culture, explain itself on request, and be effected as required on the fly and at scale. A language can only be considered excellently supported by technology if and when the goal of Deep Natural Language Understanding has been reached. 3 The Voice of the Community The findings in Section 2 are extremely valuable in terms of highlighting the status quo across Europe with respect to LT support. However, facts and figures alone cannotpaintthe full picture. The perspectives andopinions of LT users, developers and the average citizen were also required in order to fully understand the EU’s LT landscape.Asaprojectfromthecommunityforthecommunity,theELEconsortium wanted to ensure that as many voices as possible were heard and taken as input for the ELEstrategic agendaand roadmap. Abroadspectrumofstakeholderswasconsultedtoachievethiswiderinsightinto thelevelsofLTsupportacrossEuropeanlanguages(alsoseeChapter 38,p. 229ff.). Wedistinguishbetweenthreemainstakeholdergroups:LT developers (industryand research), LT users (commercialand academicusers) and EU citizens, i.e., the gen­eralpublicwhouseandconsumeLTsineverydaypersonalandprofessionalsettings, often without even realising it. Each group is diverse, some including many sub­groups, representing a variety of sectors and domains. For the latter, we looked at theinterestingsubdivisionsofcommercialandacademicusersaswellasEUcitizens. The first two groups are represented in the ELE consortium with several networks, initiativesandassociations,representingtheviewsoftheirconstituencies,highlight­ingtheir wishes, demands and needs towards full DLE in Europe. Furtherinsightwasgainedfromanumberofonlinesurveysandexpertinterviews targeting LT developers, users and consumers. The surveys investigated language coverage, evaluated the current situation of LT in Europe and encouraged partici­pantstosharetheirpredictionsandvisionsforthefuture.Inthissection,welook,in particular, at the evaluation of the current situation to see how these opinions com­paretotheempiricalresultspresentedinSection2andalsoinChapter39(p.245ff.). 3.1 DevelopersofLanguageTechnologies European LT developers are a diverse group of stakeholders, comprising academic andindustrial entities inthefieldofLT.Beyondresearch,theydeveloppre-commer­cialprototypes,algorithms,applicationsandsystems.Aninitialgroupingis,thus,LT industry and LT research (alsoseeRehmetal. 2023,2020).Thissectionfocuseson theirviewaboutthesituationasof2022,whileSection3inChapter38presentstheir forward-lookingpredictions going towards2030. In addition to the horizontal grouping into research and industry, a vertical cate­gorisationcanbeperformedwithregardtothemulti-andinterdisciplinarynatureof LT.LTisintheintersectionofLinguisticsandComputationalLinguistics,Computer Science and AI, while at the same time encompassing methods and findings from Cognitive Science and Psychology, Mathematics, Statistics, Philosophy and other fields. As a result, the ELE stakeholder group of LT developers were identified not only within the strict limits of LT perse, but alsoin the neighbouringdisciplines of AI and Digital Humanities/Social Science and Humanities (DH/SSH). Europe has a long-standing research, development and innovation tradition in LT with over 800 centres performing excellent, highly visible and internationally recognised research on all European and many non-European languages. In terms of companies, the European LT industry was estimated to comprise 435 companies (LT-Innovate2016)or473LTvendorsintheEU26plusIcelandandNorwayin2017 (Vasiljevsetal.2019).InJanuary2023,theELGcataloguecomprisedmorethan800 commercial entities including integrators and a certain numberof usercompanies. In order to disseminate the survey widely, we mobilised existing European net­works, associations, initiatives and projects. Some of the well-established and long­standing pan-European LT networks were represented in the ELE consortium and they constituted the core ELE LT developers stakeholders groups (i.e., CLAIRE, CLARIN, LT-Innovate, META-NET and ELG). The ELE partners that represented these initiatives not only contributed their views to the project but also facilitated access to and elicitation of the views of their constituency and members. In partic­ular, they coordinated the distribution of the survey to their members, conducted interviews and focused consultation meetings, where needed and appropriate, and consolidated their feedback (Thönnissen 2022; Eskevich and Jong 2022; Rufener and Wacker 2022; Hajiè etal. 2022; Hegeleet al. 2022). Thesurveyencompassed45questionsintotal.A respondentwaspresentedwith 32(minimum)to45(maximum)questions,including“ifother”questions.Inall,35 questionswere mandatoryand27were closedquestions (singleormultiplechoice). Thesurveywasstructuredintofourmainparts:PartA.Respondents’profiling,Part B. Language coverage, Part C. Evaluation of current situation, and Part D. Predic­tions and visionsfor the future (see also Chapter 38, p. 229ff., and Chapter 39, p. 245ff.). For assessing the current situation from the perspective of LT develop­ers, we focus on the findings based onresponses to Parts B and C of the survey. The LT developers survey was filled in by 321 different respondents who repre­sent 223 different organisations (Way et al. 2022). 73% of the organisations are re­search or academicinstitutions and 22% are private companies. In 5% of responses the “Other” value was indicated as the type of organisation and this has been fur­ther specified as freelancer/private practitioner or currently unemployed, govern­ment agency, not-for-profit organisation, etc. Of note here is the response to the question “What languages does your organisation conduct research in and/or for what languages do you offer services, software, resources, models etc.?”. Figure 6 shows the languages supported by survey respondents’ organisations. All official EU languages are covered as well as other state official, regional and/or co-official European languages. The five most frequently mentioned languages are, yet again, English, German,Spanish, Frenchand Italian. Inordertoevaluatethecurrentsituationandtofurthergraspthemainchallenges andobstaclestheEuropeanLTcommunityfaces,thesurveyparticipantswereasked to indicate their level of agreement with a set of potential obstacles (Figure 7). As part of a free text question, respondents were also given the opportunity to elabo­rate on the obstacles and challenges indicated in the questions and/or add any other obstacle/challengenot previously listed. Fig. 6 LT developers survey – languages supported by the respondents’ organisations in their re-searchand development activities With respect to questions about the status quo of the languages, most of the par­ticipants agreed or strongly agreed that the importance of multilinguality in the Eu-ropeanlandscapedoesnotalwaysreceiveadequaterecognition,andthesmallerlan­guages appear not to be attractive enough for industry and investors (74% agreed or strongly agreed on this point). This was backed up by comments relating to how industrial players can finda commercialinterestin pre-competitive investmentsfor “larger” languages, while this will rarely be the case for “smaller” ones. It was sug­gested that in that situation, the role of additional investors for the development of LTs for “smaller” languages should be played by bodies either at national or EU level. Moreover, it was noted that it is very often the case that small languages can rely on public funding only, which however is considered insufficient. For this rea­son, it was argued that public investments for small languages are necessary on a larger scale to really make them available to the wider community. It was also ob­served that the cost of developing LTs foralanguage is usually constant, regardless of the number of speakers of that language. Furthermore, for languages with larger numbers of speakers, it can often be easier to collect LRs: for instance, the larger Fig. 7 LT developers survey – challenges the European LT community currently faces, according to LTdevelopers the number of speakers, the more online content is produced, which in turn can be collectedandprovidetherawlanguagedatanecessaryforthedevelopmentofLRTs. Itwasreportedthatthissituationwasevenworsefornon-standardlanguages:lo­caldialects,non-standardwrittenlanguageonsocialmediaplatforms,non-standard language for speech recognition,and non-standard languageas used bymigrants or citizenswithamigrationbackground.Thereishardlyeverfundingavailableforcre­atingLRsfornon-standardvarieties.Thereisequallylittleincentiveforresearchers to publish their work on small languages, resultingin the dominanceofthe English language in scientific literature. 3.2 Users of Language Technologies Commercial users were those respondents representing companies in the sector of Information and Communication Technologies (ICTs) and eCommerce (e.g., Megabyte Ltd, A Capela group, Telecats), energy (e.g., Shell, Menai Science Park Ltd) and businessservices(e.g.,SpencerStuart, Inuits, Projectusgrupa). They also includedrespondentsfromthefollowinggroups:self-employedlanguageprofession­als(e.g., translators); professionalsworking on different economicsectors (e.g., banking, health); independent professionals/consultants; professionals working in public administration; media andpublishing professionals. Academic users included researchers, data scientists, university professors, lan­guage teachers, lecturers, and Master’s and PhD students. Some non-governmental organisations (NGOs) were also represented in the survey, such as Federal Lezghin NationalandCulturalAutonomy,andrepresentativesofpublicadministration,such as National Youth Service (Ministry of Education, Children and Youth, Luxem­bourg), Hungarian National Research, Development and Innovation Office and the Government of the Balearic Islands. In addition, Wikipedia partners collected re­sponses from representatives of the various Wikipedia projects, such as Wikimedia Community User Group Malta, Wikimedia Hungary, Wikimedia UK, and Wikime­dia Community Ireland, to name a few. The full list of stakeholders of the LT users and consumers survey is presented inWayet al.(2022). Six well-known European initiatives disseminated the survey within their net-worksandproducedonereporteach,basedontheirrespectiveconstituencies.These include the European Federation of National Institutions for Language (EFNIL, Kirchmeier 2022), the European Language Equality Network (ELEN, Hicks 2022), theEuropeanCivilSocietyPlatformforMultilingualism(ECSPM,Gísladóttir2022), theNewEuropeanMediainitiative(NEM,Hrasnica2022),theAssociationofEuro­pean Research Libraries(LIBER,Blake 2022) and Wikipedia (Heuschkel 2022). Thesurveyobtainedatotalof246responses.Theresultsshowthatcontributions camefromadiverserangeofeconomicsectorsandprofessionalactivities,butmost of the respondents worked in the education and research sector with 130 responses (53%) out of 246, that is, most respondents were researchers, university professors, assistantprofessors,lecturersorheldotheracademicpositions.Thesurveywasalso filled out by representatives of NGOs, large enterprises, SMEs, government depart­mentsandindependentcontractorsandconsultantsindiverseeconomicsectors.The 15(6%)respondentswhoselectedtheoption“other”representednon-governmental bodies, non-profit organisations, public sector organisations, social organisations and independent government departments. Ofrelevance toassessing the current situation,we note here theresponses to the question “In general terms, how do you evaluate the performance of the tools you use for the official European language(s) you work with”.Responseswerecaptured througha4-pointLikertscale(where1indicatedvery poor support,2 poor support, 3good support and4excellent support).ThelistofLTsevaluatedcanbeseeninWay et al. (2022). Figure 8 shows the average score for each of the European languages evaluated. The results show striking differences in technological support between European languages. Unsurprisingly, English is very well supported with a mean scoreof3.4,whilethegroupformedbyGerman,FrenchandSpanishfollowswitha mean score between 2.4 and 2.5. All other European languages were considered to have either poor support (mean scores ranging from 1 to 1.3), very poor support or no support at all with scoresbelow 1. 3.3 EuropeanCitizensasConsumersofLanguageTechnologies Inadditiontotheconsultationwithstakeholdersthatrepresentcommunitiesofusers and consumers, a survey targeting European citizens was carried out to make sure that their voices also play a decisive role in the pursuit of full DLE in Europe. This Fig. 8 LT users survey – level of technological support: average scores for the European lan-guage(s) thatrespondentswork with consultation with a larger and more diverse cohort of consumers allowed us to ob­tain a more accurate picture of the current scenario in terms of LT support across European languages and have a more representative basis for a technological and scientificforecasting on how LTs can be deployed and applied in Europe by2030. Thecitizens’surveywaslaunchedinJanuary2022andclosedon01May2022.It wasmadeavailablein35languagesanddisseminatedacross28countries.6 Foreach country we created a standalone survey so that respondents only saw the version in thelanguageofthecountryinwhichtheywerebased.Forcountrieswithmorethan oneofficiallanguage,wecreatedastandaloneversionofthesurveyineachlanguage spoken in the country, e.g., four surveys were set up in Spain (in Spanish, Catalan, GalicianandBasque).Thisapproachallowedustospecificallytargetregionswhere we were more likely to find communities of respondents that were speakers of that language.Moredetailsonthissurveyandthecommunityconsultationmethodology arepresentedin Chapter 38 (p. 229ff.). In total, 21,108 complete responses were collected. However, as the collection of survey responses through commercial online services is known to present some knownissuesthatcanrenderresultsunreliable(Lawloretal.2021),closerinspection revealed a number of flags indicating unreliable responses. These responses were filtered from the dataset,and as such, a final 20,586 responses were analysed. 6 While ELE investigated about 90 European languages, we only produced translated versions for thoselanguagesforwhichnativespeakerpost-editingwasavailable.The35languagescoveredby thesurvey representthe supportoffered through the ELE consortium members. Respondents provided profiling questions and were asked to list all of the lan­guages they speak. Of particular interest in our examination of the current situation istheresponsetoquestion6“Please rate all the types of software applications, apps, tools or devices you use for your language(s)”. Fig. 9 EU citizens survey – responses to question 6: Please rate all the types of software applica­tions, apps, tools or devices you use for your language(s). Tools you do not use for your language(s) do not need to be rated. Note that purple indicates the medianand blue the mode. The list of eight tools presented was: Search apps (e.g., Google, Bing); per­sonal assistant apps (e.g., Siri, Alexa); proofreading apps (e.g., spelling and gram­mar checkers, autocorrect); translation apps (e.g., Google Translate, DeepL); auto-maticsubtitling(e.g.,newsreport,YouTube);languagelearningapps(e.g.,Babbel, RosettaStone);chatbots(e.g.,forcustomersupport)andscreenreaders.Theaimof thisquestionwastounderstandtheperceptionoftheaverageEUcitizenandLTuser ofthequality of the tools thatthey use for each languagethey speak. The ratings were based on a 5-point Likert scale, i.e., respondents had the op­tion of rating 1-star (poor) through to 5-stars (excellent) for each of the eight tools presented, and for each language they had selected in the previous question. In the interest of space, Figure 9 presents only the languages for which language reports were produced (see Chapters 5–37) and only shows responses from the perspective of each language, as opposed to each tool. Due to the large size of the dataset and thevaryingproportionofresponsesforeachlanguage,thefigurespresentedhereare basedonthecalculationofthemedianscore(purple)andthemode(blue).Toolsthat werenotavailableorusedbyarespondentdidnotreceiveascore.Intheseinstances, thetoolwasassignedaratingofzero,asapenaltyforlesser-usedtoolsacrossalllan­guages.ThisexplainsthelowscoresforlanguagessuchasSerbian,Luxembourgish and Icelandic, whicheither have very few available or low-rated existingLTs. Tosomedegree,theresultsreflectthetrendpresentedforthetechnologicalDLE scoresoftherelevantlanguages(seeChapter 3)intermsofthequantificationofthe technological factors of the DLE Metric. The difference between the median score for English and the next well-resourced languages is not as stark, however. This could be explained by the fact that the ratings of the tools are bound to an upper limit of five and as a result, the scores are “flatter” and closer to each other. On the other hand, we can see that the mode score reveals that tools for English, French, Spanish and Italian received more frequent higher ratings. Nevertheless, the results provideaclear insightintotheaverageEuropeanuser’sperceptionofthequalityof LT support fortheirlanguages. 4 Conclusions Weexaminedaround90Europeanlanguageswiththegoalofcreatingasnapshotof theirdigitalreadinessin2022.WemadeuseoftheinventoryofLRTsintheEuropean Language Grid and assessed the technological readiness of each language based on the availability of LRTs. From this, we carried out a cross-language comparison on thisempiricalbasis,aswellasananalysisoffeedbackfromdevelopersandusersof LTsacross Europe, includinginput from over20,000EU citizens. Thestatusasanalysedin2022isveryclear:thereisanextremeimbalanceacross languageswhenitcomestotheindividuallevelsoftechnologicalsupport.Whilethe META-NETWhitePaperSeriesreportedasimilarimbalancetenyearsago,whatis surprising is the little comparative change seen across the board since then. The same trend of acute digital inequality continues, and worse still, the gap between English and the rest of the EU languages is getting wider. Even though some of the widelyspokenlanguagesinEuropeandbeyond(Spanish,GermanandFrench)have demonstratedconsiderableprogressandareamongthetopperformers,theirdistance from English is intolerable. Moreover, a striking asymmetry is evidenced between officialand non-officialEU or EEA languages. Our results reiterate that digital language inequality poses a direct threat to Eu­rope’s linguistic and cultural diversity. Europe has become or is about to become a continent where digital diglossia is the de facto context for many EU citizens, with the exception of English native speakers. When going about their online lives, EU citizenstoooftenfinditmoreefficientorevenabsolutelynecessarytorelyonother, more widely supported languages (predominantly English) for certain services and information because this gives them greater access to high-quality and reliable con-tenttoabroaderaudience,andallowsthemtousemoreadvancedtechnologies.This istrueparticularlyfortheyoungergenerations,thusincreasingthegenerationallan­guagegap and bringing lesser-resourcedlanguages ever closer to digital extinction. References Blake,Oliver(2022). Deliverable D2.10 Report from LIBER.EuropeanLanguageEquality(ELE); EU projectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/con sultation-LIBER.pdf. Blasi, Damian, Antonios Anastasopoulos, and Graham Neubig (2022). “Systematic Inequalities in Language Technology Performance across the World’s Languages”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, pp. 5486–5505. DOI: 10.18653/v 1/2022.acl-long.376. https://aclanthology.org/2022.acl-long.376. Eskevich,Mariaand Franciska de Jong (2022). Deliverable D2.3 Report from CLARIN. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/consultation-CLARIN.pdf. Giagkou, Maria, Stelios Piperidis, Georg Rehm, and Jane Dunne, eds. (2022). ELE Language Report Series. Project deliverables; European Language Equality (ELE); EU project no. LC­01641480 – 101018166. https://european-language-equality.eu/deliverables/. Gísladóttir, Gu.rún (2022). Deliverable D2.7 Report from ECSPM. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-ECSPM.pdf. Hajiè,Jan,TeaVojtìchová,andMariaGiagkou(2022). Deliverable D2.5 Report from META-NET. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/consultation-META-NET.pdf. Hegele,Stefanie,KatrinMarheinecke,andGeorgRehm(2022).Deliverable D2.6 Report from ELG. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/consultation-ELG.pdf. Heuschkel,Maria(2022). Deliverable D2.12 Report from Wikipedia.EuropeanLanguageEquality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-Wikipedia.pdf. Hicks, Davyth (2022). Deliverable D2.9 Report from ELEN. European Language Equality (ELE); EU projectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/con sultation-ELEN.pdf. Hrasnica,Halid(2022).Deliverable D2.11 Report from NEM.EuropeanLanguageEquality(ELE); EU projectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/con sultation-NEM.pdf. LT-Innovate(2016). The LT-Innovate Innovation Agenda.http://www.lt-innovate.org/sites/default /files/2904-LTi_Innovation_Agenda.pdf. Joshi, Pratik, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury (2020). “The State and Fate of Linguistic Diversity and Inclusion in the NLP World”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020).Online: Association for Computational Linguistics, pp. 6282–6293. DOI: 10.18653/v1/2020.acl-main .560. https://aclanthology.org/2020.acl-main.560. Kirchmeier, Sabine (2022). Deliverable D2.8 Report from EFNIL. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-EFNIL.pdf. Labropoulou, Penny, Katerina Gkirtzou, Maria Gavriilidou, Miltos Deligiannis, Dimitris Galanis, Stelios Piperidis, Georg Rehm, Maria Berger, Valérie Mapelli, Michael Rigault, Victoria Ar-ranz,KhalidChoukri,GerhardBackfried,JoséManuelGómezPérez,andAndresGarcia-Silva (2020).“MakingMetadataFitforNextGenerationLanguageTechnologyPlatforms:TheMeta­dataSchemaoftheEuropeanLanguageGrid”.In:Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020).Ed.byNicolettaCalzolari,FrédéricBéchet,Philippe Blache,ChristopherCieri,KhalidChoukri,ThierryDeclerck,HitoshiIsahara,BenteMaegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3421–3430. https://www.aclweb.org/anthology/2020.lrec-1.420/. Labropoulou,Penny,SteliosPiperidis,MiltosDeligiannis,LeonVoukoutis,MariaGiagkou,Ondøej Košarko, Jan Hajiè, and Georg Rehm (2023). “Interoperable Metadata Bridges to the wider LanguageTechnologyEcosystem”.In:European Language Grid: A Language Technology Plat­form for Multilingual Europe.Ed.byGeorgRehm.CognitiveTechnologies.Cham,Switzerland: Springer,pp.107–127. Lawlor,Jennifer,CarlThomas,AndrewTGuhin,KendraKenyon,MatthewDLerner,UCASCon­sortium, and Amy Drahota (2021). “Suspicious and fraudulent online survey participation: In-troducingthe REAL framework”.In: Methodological Innovations 14.3. DOI: 10.1177/205979 91211050467.https://doi.org/10.1177/20597991211050467. Piperidis, Stelios, Penny Labropoulou, Dimitris Galanis, Miltos Deligiannis, and Georg Rehm (2023). “The European Language Grid Platform: Basic Concepts”. In: European Language Grid: A Language Technology Platform for Multilingual Europe. Ed. by Georg Rehm. Cham: Springer,pp.13–36. DOI: 10.1007/978-3-031-17258-8_2. https://doi.org/10.1007/978-3-031­ 17258-8_2. Ranathunga,SurangikaandNisansadeSilva(2022).“SomeLanguagesareMoreEqualthanOthers: Probing Deeper into the Linguistic Disparity in the NLP World”. In: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguisticsand the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Onlineonly:AssociationforComputationalLinguistics,pp.823–848. https://aclanthology.org /2022.aacl-main.62. Rehm, Georg, ed. (2023). European Language Grid: A Language Technology Platform for Multi­lingual Europe. Cognitive Technologies. Cham,Switzerland: Springer. Rehm, Georg, Katrin Marheinecke, Rémi Calizzano, and Penny Labropoulou (2023). “Language TechnologyCompanies,ResearchOrganisationsandProjects”.In:European Language Grid: A Language Technology Platform for Multilingual Europe. Ed. by Georg Rehm. CognitiveTech­nologies.Cham,Switzerland: Springer,pp. 171–185. Rehm,Georg,KatrinMarheinecke,StefanieHegele,SteliosPiperidis,KalinaBontcheva,JanHajic, Khalid Choukri, Andrejs Vasiljevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-binaAuksoriute,NúriaBel,AntónioBranco,GerhardBudin,WalterDaelemans,KoenraadDe Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur Rögnvaldsson,MikeRosner,BolettePedersen,IngunaSkadina,MarkoTadiæ,DanTufi.,Tamás Váradi,KadriVider,AndyWay,andFrançoisYvon(2020).“TheEuropeanLanguageTechnol­ogyLandscapein2020:Language-CentricandHuman-CentricAIforCross-CulturalCommuni­cationinMultilingualEurope”.In:Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020).Ed.byNicolettaCalzolari,FrédéricBéchet,PhilippeBlache,Christo­pherCieri,KhalidChoukri,ThierryDeclerck,HitoshiIsahara,BenteMaegaard,JosephMariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325. https://www.aclweb.org/anthology/2020.lrec-1.407/. Rehm, Georg and Hans Uszkoreit, eds. (2012). META-NET White Paper Series: Europe’s Lan­guages in the Digital Age. 32 volumes on 31 European languages. Heidelbergetc.: Springer. Rufener, Andrew and Philippe Wacker (2022). Deliverable D2.4 Report from LT-innovate. Euro­peanLanguageEquality(ELE);EUprojectno.LC-01641480 –101018166.https://european-l anguage-equality.eu/reports/consultation-LTInnovate.pdf. Simons,GaryF.,AbbeyL.L.Thomas,andChadK.K.White(2022).“AssessingDigitalLanguage SupportonaGlobalScale”.In: Proceedings of the 29th International Conference on Computa­tional Linguistics (COLING 2022). Gyeongju, Republic of Korea: International Committee on ComputationalLinguistics, pp. 4299–4305. https://aclanthology.org/2022.coling-1.379. Thönnissen,Marlies(2022). Deliverable D2.2 Report from CLAIRE.EuropeanLanguageEquality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-CLAIRE.pdf. Vasiljevs,Andrejs,KhalidChoukri,LucMeertens,andStefaniaAguzzi(2019). Final study report on CEF Automated Translation value proposition in the context of the European LT market/e­cosystem. DOI: 10.2759/142151. https://op.europa.eu/de/publication-detail/-/publication/8494 e56d-ef0b-11e9-a32c-01aa75ed71a1/language-en. Way, Andy, Georg Rehm, Jane Dunne, Jan Hajiè, Teresa Lynn, Maria Giagkou, Natalia Resende, Tereza Vojtìchová, Stelios Piperidis, Andrejs Vasiljevs, Aivars Berzins, Gerhard Backfried, Marcin Skowron, Jose Manuel Gomez-Perez, Andres Garcia-Silva, Martin Kaltenböck, and Artem Revenko (2022). Deliverable D2.17 Report on all external consultations and surveys. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/external-consultations.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 5 Language Report Basque KepaSarasola, Itziar Aldabe,Arantza DiazdeIlarraza,Ainara Estarrona, Aritz Farwell,Inma Hernáez, and EvaNavas Abstract Since 1968 Basque has been immersed in a process of revitalisation that has faced formidable obstacles. Nonetheless, significant progress has been made in numerousareas.TheLanguageTechnologycommunitywidelyacceptsthestandard­isedlanguageandconstructsefficaciousLTtools.Afterthirtyyearsofcollaborative work,researchhasresultedinstate-of-the-arttechnologyandrobust,broad-coverage NLPforBasque.However,adramaticdifferenceremainsbetweenBasqueandother European languages in terms of both the maturity of research and the state of readi­ness withrespectto language technology solutions. 1 The Basque Language Basque is spoken by 28.4% (751,500) of Basques in a territory that spans part of northernSpainandsouthernFrance.Ofthese,93.2%resideontheSpanishsideand the remaining 6.8% in the French region. The Basque Autonomous Community in Spain has established Basque as a co-official language. The Chartered Community ofNavarre grantsco-officialstatus to Basqueonly in northernNavarre. Basque has no official status in the French Basque Country. The same is true for the European Union,which limited thestatus of official European languages to state languages. As a non-Indo-European language isolate, Basque grammar differsconsiderably fromsurrounding languages, though ithasborrowed upto40%ofvocabularyfrom Romancelanguagesanduses theLatinscript.Thefivemainspokendialectsareno­ticeablydistinctfromoneanotheranditwasnotuntil1968thattheRoyalAcademy of the Basque Language unified Basque. Since then, it has been immersed in a pro­cess of revitalisation that has faced formidable obstacles. Nevertheless, significant progressinnumerousareashasfosteredthenecessarysociolinguisticconditionsfor thesuccessfuldevelopmentanddisseminationofLT. Thispositivecourseofevents, Kepa Sarasola · Itziar Aldabe · Arantza Díaz de Ilarraza · Ainara Estarrona · Aritz Farwell · Inma Hernáez · Eva Navas University oftheBasque Country, Spain, kepa.sarasola@ehu.eus, itziar.aldabe@ehu.eus, a.diazdeilarraza@ehu.eus, ainara.estarrona@ehu.eus,aritz.farwell@ehu.eus, inma.hernaez@ehu.eus,eva.navas@ehu.eus © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_5 bolstered by years of collaborative work, has resulted in state-of-the-art technology and robust, broad-coverage NLP for Basque (Hernáez et al. 2012). Still, a dramatic difference remains between Basque and other European languages in terms of re­search maturity and readiness with respect to solutions (Sarasola et al. 2022). Data collected by the Basque Institute of Statistics (EUSTAT), shows that 85% ofpeopleaged15+intheBasqueAutonomousCommunity(1,603,000individuals) used the internet between June and September 2021. According to the PuntuEUS Observatory, which measures the presence of Basque on the internet, there are cur-rently12,470websiteswiththeBasquelanguagecode(.eus)asthetop-leveldomain. In2020, the percentageofwebsites withcontent in Basque was 84.4%. 2 TechnologiesandResourcesforBasque The LT support of Basque is reflected in the European Language Grid (ELG). Half oftheresourcesarecorpora,whiletherestincludesresources,grammarsandmodels. BasquelanguagemodelsinELGmaybedividedintomonolingualandmultilingual. Among the former is BERTeus, a Basque language model pre-trained on crawled newspaperarticlesandtheBasqueWikipedia.ThelatterincludeIXAmBERT,amul­tilingual pre-trained languagemodel for English,Spanish and Basque. Most Basque monolingual corpora are annotated at some linguistic level. The largest, the ETC corpus and the Lexical Observatory Corpus, contain 48-355 mil-lionwords.TheEPECcorpuscontains300,000wordsofstandardwrittentext,man­ually tagged at different grammatical levels. Bi-or multilingual corpora, the major­ity of Basque corpora in ELG, are composed of comparable or parallel data. HAC, a cross-lingual corpus for Basque, Spanish, French and English and the Basque­SpanishEiTBcorpusofalignedcomparablesentencescontain629,916and564,625 translation units, respectively. In comparison to text corpora, resources that include other modalities are relatively few. However, several databases for ASR, TTS and speech-to-speechtranslation(S2ST)havebeenbuiltoverthelastdecade.Largepub­lic datasets for high quality speech synthesis for Basque are not available for com-mercialuse,butsmallerdatasetsdevelopedattheUPV/EHUareonhandforresearch. S2ST,anewresearchareathatrequiresbilingualdata,hasmadeinroadswithrespect to Basque: there is a bilingual Basque-Spanish dataset containing over eight years ofBasqueparliamentarysessions. Lexical resources outweigh conceptual ones, followed by dictionaries, thesauri, terminological resources, ontologies and wordnets. The Egungo Euskararen Hizte­gia (Contemporary Basque Dictionary) and the OrotarikoEuskal Hiztegia (General BasqueDictionary)countamongthemostimportantdictionaries.Additionally,there are euLex and the Euskararen Datubase Lexikala and three variants of WordNet (EusWordNet, Multilingual Central Repository 3.0, SLIGalnet). BasquetoolsandservicesinELGspanarangeofapplications,butnonearelisted for information extraction and retrieval, language generation and summarisation or human-computerinteraction.Instead,mostmaybeclassifiedasspellcheckersorfall undertextanalysis,speechprocessing,andtranslationtechnologies.Therearethree spellcheckers of note, while pipelines for sentence segmentation, tokenisation, PoS tagging, lemmatisation and dependency parsing may be constructed with UDPipe, ixaKatorIXA-pipes.Othertypesoflinguisticprocessingarealsoavailable,ranging fromword sensedisambiguation and lexical similarity toRSTparsers. TherearetwomajorTTSenginesthatreadtextswithhighqualitysyntheticvoices inBasqueorSpanish.JustoneBasquecompanyoffersaspeechrecognitionservice. Google’sCloudSpeech-to-TextisavailableforBasque,butonlyindefaultandcom­mand and search models. There are no additional enhanced models as there are for English, French or Spanish, no option for using Google’s Cloud TTS, and Amazon does not include Basque in their TTS or ASR services. Besides Google Translate, therearefourlocallydevelopedneuralsystemsthatprovidehighqualitytranslation between specific language pairs. Although most basic LT tools are available, a significant gap remains between Basque and other languages in terms of data. This difference is also observed in speechresourcesanddomain-specificdata.Ifwewishtofine-tunemodelsforbetter performance, domain-specific corpora are required. These examples underline the endemic digital inequality that exists in LT, although one bright spot for languages withfewresources,suchasBasque,isthatpre-trainedmono-andmultilingualmod-els have proven quite useful in NLP tasks, even when based on far smaller corpora. As a final note, it is worth mentioning most Basque resources have been produced byresearchgroupsattheUniversityoftheBasqueCountryandotherpublicentities. Regrettably,resourcesproducedbycompaniesinvolvedinpubliclyfundedprojects arenotalwaysopen-sourcedandgreaterpressuremustbeappliedtoensuretheyare. 3 RecommendationsandNextSteps While Basque’s digital condition may not be endangered, it does remain vulnera­ble. More work must be done to deepen its integration into social network applica­tions,expanditsuseinbusinessandemploymentservices,andextenditsreachinto entertainment products. Moreover, there are significant gaps in the availability of language data and tools that must be addressed so that research may be improved and bettercommercialapplications developed.Themoreobviouslacunaeinclude a lackofsufficientmultimodal corpora,publicdatasets, and advancedlanguagemod­els.Whileitistruethatpre-trainedmono-andmultilingualmodelsareemployedto greateffectinavarietyofNLPtasks,adearthofdomain-specificdatainBasquecon­tinuestohindertheabilitytofine-tunemodels.Thisisanareathatnotonlyrequires attention with respect to Basque, but also underscores the chasm in LT between the most utilised languages, such as English, and those with far fewer digital resources. ItisasunderstandableasitistroublesomethatahighpercentageofBasquespeakers meet with obstacles in their online lives, too often finding it easier or necessary to relyonother,morewidelyavailablelanguagesfordeterminedservicesandinforma­tion. This prima facie case of linguistic inequality, not limited to Basque, does not bodewell for the future of Europe’s cultural heritage. Fortunately, a remedy may yet be found if action is taken now. Basque’s digital healthwouldbenefitfrombolderandnimblerLTstrategiesattheEuropean,national and regional levels. In this context, the Spanish Governmenthas approved the New LanguageEconomyPERTEwiththepurposeofreinforcingthevalueofofficiallan­guages in the digital transformation process. Out of a €1.1 billion budget, at least €30millionwillbeearmarkedforsupportingprojectsinco-officiallanguages.Sim­ilarly, the Basque Government has launched GAITU, an action plan that aims to integrateBasqueintoLTbetween2021-2024.Finally,theopportunity totakea role in the CLARIN infrastructure would also result in the creation and maintenance of resources. These types of actions should guarantee that data and resources will be made publicly accessible whenever possible because the amount of available data will determine the quality of prospective applications. Licences that provide fewer restrictionsoncontentcreationshouldbemorewidespreadsothatgreateramountsof linguisticdatamaybecollected.Infrastructuresandtrainedpersonnelarerequiredto manage the influx of data and curate it for research and development. At one level, taking these steps will help ensure that LT continues to adapt to Basque’s digital needs and keep pace with advances at the global level. At another, such a strategy would impart greater visibility to LT and reinforce its vital role in enabling Basque to thrive intoday’s rapidly evolvingsocio-digitalspace. References Hernáez, Inmaculada, Eva Navas, Igor Odriozola, Kepa Sarasola, Arantza Diaz de Ilarraza, Igor Leturia, Araceli Diaz de Lezana, Benat Oihartzabal, and Jasone Salaberria (2012). Euskara Aro Digitalean – The Basque Language in the Digital Age. META-NET White Paper Series: Europe’s Languagesin theDigitalAge. Heidelberg etc.: Springer. http://www.meta-net.eu/wh itepapers/volumes/basque. Sarasola, Kepa, Itziar Aldabe, Arantza Diaz de Ilarraza, Ainara Estarrona, Aritz Farwell, Inma Hernaez, and Eva Navas (2022). Deliverable D1.4 Report on the Basque Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/language-report-basque.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 6 Language Report Bosnian Tarik Æušiæ Abstract It is objective to state that there are no language technologies for the BosnianlanguageorinitiativesforthedigitalisationoftheBosnianlanguage.There­fore,itisnecessarytotakeinitialstepstowardstechnologicalsupportfortheBosnian language, in order to prevent its digital extinction. In Bosnia and Herzegovina, no programmes aimed at the research and development of language technology prod-uctshavebeeninitiated.The Bosnianlanguageispresentinthedigitalspheremore or less as much as it is included in foreign, multilingual tools and resources, which aremostlyrelated toMachine Translation (GoogleTranslateand others). 1 The Bosnian Language The Bosnian language belongs to the West-South Slavic subgroup of the Slavic branch of the great Indo-European linguistic family. Bosnian has about 2.5 million native speakers in Europe. It is the official language in Bosnia and Herzegovina, alongwithCroatianandSerbian,whereit isspokenby1.87 millionpeople,or53% of the population. Bosnian is the native language of Bosniaks in Bosnia and Herze­govina, but also of members of other ethnic groups. Outside of Bosnia and Herze­govina, Bosnian is one of the official languages in Montenegro. Bosnian is also an officially recognised minority language in Croatia, Serbia, North Macedonia and Kosovo. In Western Europe and North America, Bosnian is used by about 150,000 people, and by 100,000 to 200,000 people in Turkey. There is no single language law in Bosnia and Herzegovina that regulates the issue ofofficiallanguageuse.However,Bosnian (alongwithCroatian and Serbian) islistedasoneoftheofficiallanguagesinlawsandregulationsonprimaryeducation, secondary education and higher education. Two writing systems are used in the Bosnian language: Latin and Cyrillic. Both LatinandCyrillichave30letterseach;Latinhas27monographsandthreedigraphs Tarik Æušiæ University ofSarajevo, Bosniaand Herzegovina, tarik.cusic@izj.unsa.ba © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_6 (dž, lj, nj), and Cyrillic has 30 monographs. In the past, the Bosnian language was also recorded withGlagolitic, Bosnian Cyrillic(Bosanèica) andArebica. According to the morphological classification, the Bosnian language belongs to the group of synthetic languages of the inflectional type: it has a larger number of inflections,i.e.,differentgrammaticalformsofwords;itischaracterisedbythefre­quentmergingofdifferentmorphemes,byamultitudeofchangeswithinindividual forms andat theboundariesofmorphemes, etc. The Bosnian language belongs to the group of languages marked by the syntac­tic structure of SVO: Subject–Verb–Object, e.g., Mahir sluša rok [Mahir listens to rock.]. There are three types of word order in the Bosnian language: basic word or­der (grammatical-semantic), actualised word order (contextually conditioned) and obligatoryword order (prosodically conditioned) (Jahiæ et al. 2000,p.465-473). In January 2021, 3.27 million people lived in Bosnia and Herzegovina (49.2% of them in urban areas): the total number of mobile connections was 3.73 million, whichis113.9%ofthetotalpopulation;therewere2.32millioninternetusers(71% ofthepopulation)and1.8millionactivesocialmediausers(55%ofthepopulation).1 There are more than 25,000 .ba domains registered.2 The languages of websites under the .ba domain are mostly Bosnian, Croatian and Serbian, while some web­sites, due to their character and purpose, are bilingual: Bosnian – English, Croatian –English, Serbian –English andthe like. 2 TechnologiesandResourcesforBosnian Very few resources (i.e., corpora, language models or lexica) are available for Bosniantodate.Infact,Bosnianlacksareferencemonolingualcorpusthatwouldbe a valuable asset for both linguistic research and LT development. With regard to bi-or multilingual corpora, although they are rare, Bosnian is included as part of some corpora. Examples are the SETimes corpus, a parallel corpus in ten languages with its Bosnian part consisting of 2.2 million words, and the Oslo Corpus of Bosnian Texts, a 1.5 million words corpus consisting of different genres of texts published in the 1990s. The Bosnian part of the CC-100 corpus comprises 14 million tokens (Conneau et al. 2020). InarelativelyrecentprojectaimingatcompilingWebcorporaofBosnian(bsWaC) (Ljubešiæ and Klubièka 2014), 8,388 seed URLs for Bosnian were obtained via the Google Search API queried with bigrams of mid-frequency terms obtained from corporabuiltwithfocusedcrawlsofnewspapersites.EachTLDwascrawledfor21 days with 16 cores used for document processing. The web corpus of the Bosnian language comprises 722 million tokens (Ljubešiæ and Klubièka 2016). 1 https://datareportal.com 2 https://www.domaintools.com Withrespecttoavailablelanguagetechnologies,Bosnianissupportedinanumber ofmachinetranslationsystems,mainlycommercialones,likeApptek,Tradukkaand iTranslate. Google Translate alsosupports Bosnian. CroNER is a tool for recognising and classifying named entities in natural lan­guagetextsinCroatian.CroNERrecognisesninedifferentclassesofnamedentities. Although developed for Croatian, CroNER can successfully be applied to texts in closelyrelated languages such as the Bosnian language. Arelativelyrecent(2017)mobileapplicationforThe orthography of the Bosnian language (Haliloviæ1996)canbeusedtolearnthespellingoftheBosnianlanguage and certain grammar rules. The mobile application allows you to search words or book chapters that contain this “orthography”. This medium also allows for more flexibilitythanabook:Youcanconsult“orthography”almostalways,onthetram,in acafe,duringawalk.Theaimwastobringthebookclosertotheyoungergeneration and topromote the useoftechnology ineducation. TheLanguageInstituteoftheUniversityofSarajevohasdevelopedadigitalplat­form for the Bosnian language, e-bosanski.3 Its goal is to offer language material about Bosnian in an online format. The material currently available is the Bosnian Dictionary of Accent Variations – Sound (Online) andConverter of Alphabets. The Dictionary of Accent Doublets is a dictionary entry in the Bosnian Accent Manual(withasoundaccentbook)byagroupofauthors:JasminHodžiæ,AidaKršo and Haris Æatoviæ.4 The corpus of audio recordings is designed to acquire compe­tencies in accentuation, especially for practising general mutual accent differences in individual accents, regardless of the realised examples in everyday speech or in the Bosnian accent norm. It contains over 1,000 accent doublets selected from over 7,000 examples that make up the already excerpted material for a future study on the sources of Bosnian accentuation. Practically, this means that sound recordings for different accent variations of the same words are hosted on this platform. The SoundedDictionaryofNamesisaseparatepartofthedictionaryappendixofthefu­ture study of the Prosodem variant of personal names by the author Jasmin Hodžiæ. 111 names with accent variations are currently provided, i.e., recordings of differ-entaccentvariationsofthesamenames.TheplatformalsoencompassestheAccent Reader5 and Accent Exercises.6 The Accent Reader provides material from a hun­dred accented and sounded literary texts. The texts are related to everyday Bosnian lifeandtradition.Videoswiththepronunciationofallvowelsunderdifferentaccents in the Bosnian language are available, including short-descending,short-ascending, and long-descending and long-ascending accents. The platform additionally provides a Converter of Alphabets, i.e., a converter fromthe Latin alphabet to Glagolitic, Bosnian Cyrillic (Bosanèica)and Arebica. The Language Institute of the University of Sarajevo plans to create a large his­toricalonlinedictionaryoftheBosnianlanguagethatwillincludelanguagematerial 3 https://www.e-bosanski.ba 4 https://www.e-bosanski.ba/rad/ 5 https://www.youtube.com/playlist?list=PL230XGW7TwJoq3ZNvg7IF7VpcsieCLW-n 6 https://www.youtube.com/playlist?list=PL230XGW7TwJo2MgihumhTIX52_QxFBQrT fromtheMiddleAges(inscriptionsandcharters),aljamiadotexts,textsfromorallit­eratureandso-calledKrajinaletters.Theonlinedictionarywillprovidewordsearch functionalities, retrieving the context of the word (sentence, verse, document) from the original work. 3 RecommendationsandNextSteps Asisevidentfromtheanalysisabove,therearenolargemonolingualcorporathatare representativeofthemodernuseoftheBosnianlanguage,orforthedevelopmentof largelanguage models (Æušiæ 2022).Therefore, itisnecessary to startfromscratch. Currentdataisnotsufficientineitherthegeneralorspecificdomains.Atthenational level,theCouncilofMinistersofBosniaandHerzegovinaisapublicbodythatcould pass the necessary acts to support the development of LT for the Bosnian language, butitisunlikelythatthiswillhappen,becauselanguageisasensitiveissueinBosnia. References Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzmán,EdouardGrave,MyleOtt,LukeZettlemoyer,andVeselinStoyanov(2020). “Unsupervised Cross-lingual Representation Learning at Scale”. In: Proc. of the 58th Annual Meeting of the Assoc. for Computational Linguistics.Ed.byDan Jurafsky,JoyceChai,Natalie Schluter, andJoelTetreault. ACL, pp. 8440–8451. DOI: 10.18653/v1/2020.acl-main.747. Æušiæ, Tarik (2022). Deliverable D1.36 Report on the Bosnian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-eq uality.eu/reports/language-report-bosnian.pdf. Haliloviæ,Senahid(1996). Pravopis bosanskoga jezika. Preporod. Jahiæ, Dževad, Senahid Haliloviæ, and Ismail Paliæ (2000). Gramatika bosanskoga jezika. Dom štampe. Ljubešiæ, Nikola and Filip Klubièka (2014). “{bs, hr, sr}wac-web corpora of Bosnian, Croatian andSerbian”. In: Proceedings of the 9th Web as Corpus Workshop (WaC-9),pp. 29–35. Ljubešiæ,NikolaandFilipKlubièka(2016). Bosnian web corpus bsWaC 1.1.JožefStefanInstitute, Slovenian languageresource repository CLARIN.SI. http://hdl.handle.net/11356/1062. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 7 Language Report Bulgarian Svetla Koeva AbstractThischapterreportsonthecurrentstatusoftechnologysupportforBulgar­ian and highlights certain gaps. The analysis is based on the services and resources available in the European Language Grid in early 2022. While the LT field as a whole has significantly progressed in the last ten years, we conclude that there is stillayawningtechnologicalgapbetweenEnglishandBulgarian,andevenbetween German,French,Italian,SpanishandBulgarian.Itisexactlythisdistancethatneeds to be ideally eliminated, if not at least reduced, in order to move towards Digital LanguageEquality for Bulgarian. 1 The Bulgarian Language Bulgarian is the official language of the Republic of Bulgaria. It is spoken by over eightmillionnativespeakers.AccordingtoanassessmentbytheNationalStatistical Instituteforthe2021census,thepopulationofBulgariaisabout6,500,000.Areport by theWorld Bank states that about1.7 million Bulgarians lived abroad in 2020. The official alphabet is Cyrillic. Bulgarian was the first Slavic language to have its own writing system, which dates from the 9th century. Bulgarian belongs to the familyofSouthSlaviclanguagesandformspartoftheBalkanlinguisticunion.Bul­garianexhibitsanumberofspecificcharacteristicsthatcontributetotherichnessof thelanguagebutcanalsobeachallengefornaturallanguageprocessing(NLP),e.g., aratherflexiblewordorder,combinedwiththelackofmorphologicaldistinctionfor nominalcases and regular subject omission. TheBulgarianconstitutionstatesthatBulgarianistheofficiallanguageintheRe­public of Bulgaria. All education and teaching provided as part of the current state curriculum, from preschool to university, is in Bulgarian. The Institute for Bulgar­ian Language of the Bulgarian Academy of Sciences is the official institution that monitors changes in the Bulgarian language, determines literary norms and reflects these changes in both orthography and grammar. SvetlaKoeva InstituteforBulgarianLanguage Prof. Lyubomir Andreychin,BAS, Bulgaria, svetla@dcl.bas.bg © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_7 AccordingtoW3Techs,Bulgarianaccountsforjust0.1%ofthelanguagecontent on the web (as of November 2021). Bulgarian internet users in 2020 increased by 31%incomparisonto2007andalready46%ofthetotalpopulationusestheinternet. Bulgarian Wikipedia, as an important source of data for NLP, has a considerably smallersizethan thebiggest Wikipedias. Bulgaria’s membership in the EU, together with the ideas of unity and diversity, and globalisation while preserving national identity, provides a real opportunity for the equal use ofBulgarian together with the othermajor European languages. 2 TechnologiesandResourcesforBulgarian LanguageTechnology(LT)providessolutionsforthefollowingmainapplicationar­eas:TextAnalysis;SpeechProcessing;MachineTranslation;InformationExtraction andInformationRetrieval;NaturalLanguageGeneration;andHuman-ComputerIn­teraction.ThisstudyonLTforBulgarianisbasedmainlyontheEuropeanLanguage Gridas of February 2022(Koevaand Stefanova 2022). Technologicaldevelopmentsinrecentyearshaveenabledtheprocessingofhuge amountsoflanguagedata,andallowedtheapplicationofcomplexmodelsandalgo­rithms, which will lead to significant progress (including for Bulgarian). Bulgarian is present in several monolingual and multilingual corpora. Some of the multilin­gual corpora are sentence-aligned, which allows for cross-lingual research. How­ever, large multilingual corpora are usually created automatically from the internet (often from Wikipedia). Annotated corpora with manually validated or manually assigned linguistic information are smaller in number and volume. There are very few examples of multimodal corpora. Among the multilingual annotated corpora where Bulgarian is present, there are two relatively large collections: Universal De­pendencies treebank v2.8.1, and the annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1), both freely available. There is an expanding collection of datasets and models for Bulgarian (atHugging Face). Bulgarianisrelativelywell-resourcedwhenitcomestodictionariesandthesauri. Most dictionaries have been developed at the Institute for Bulgarian Language, but duetocopyrightrestrictions,someofthemonlyoffersingleuserqueriesoraccessfor research purposesonly. Parts of the Bulgarian WordNet are availablefor download, extended with semantic classes, new semantic relations andsemanticframes. ThereareseveralNLPlibrariesprovidingsetsoflinguisticannotationsforBulgar­ian (tokenisation, sentence splitting, paragraph detection, lemmatisation, named en-tityrecognition,dependencyparsing,etc.).Anumberoflibrariesprovidedeeplearn­ingtechniquesandknowledgegraphs,andreportgoodlevelsofaccuracyandspeed (e.g., Spark NLP). Recently, two NLP pipelines (including a tokeniser, a sentence splitter,atagger,alemmatiserandadependencyparser)havebecomeavailable:UD­pipeandNLP-Cube,trainedforlanguageswithUDTreebanks,includingBulgarian. Generally,LTsforBulgarianstilldominatetextanalysiswhilemultimodalinputdata (such as simultaneous text, images, audio and video) is rarely processed. The quality of speech technology for Bulgarian is not yet satisfactory. There are stillnoaccessibleandreliablespeech-to-textsystemsforBulgarian,especiallywork­ing in real time. Excluding the automatic translation offered by multinational enter­prises, there are other available MT systems from and into Bulgarian with different typesofaccess.TheassessmentofthequalityofexistingMTservices,thenumberof languagepairs,andthecoverageofthematicdomainsstilldeterminesMTtechnolo­giesforBulgarianasunderdeveloped.Recently,therehavebeenseriousadvancesin researchon informationextraction forBulgarian:event extraction, sentimentanaly-sis, fake news detection,fact-checking. There is no dedicated funding or infrastructure for Bulgarian LTs. Many of the achievements and advancements in the development of language data and tools for Bulgarian have been the result of short-term funded projectsand PhD theses. AnumberofBulgarianLTcompaniesareverysuccessful,forexample,Ontotext, operating in the field of semantic technologieswith its productGraphDB. When we compare the two studies – Blagoeva et al. (2012) and Koeva and Ste-fanova(2022)–wecanseethatthereisadevelopmentinLRsandLTsforBulgarian, butthisisalsotrueforotherEuropeanlanguages.Furthermore,inacomparativeanal­ysis in 2012, Bulgarian was ranked 15th in terms of technological support, while it is ranked 21st in 2022. Nowadays, technological progress is rapid, and we should consider language models such as GPT-3 and its successors for Bulgarian and the other European languages, which will necessitate significantinvestments. 3 RecommendationsandNextSteps ManycommonlyusedAItechnologiesarestillnotavailableforBulgarian(Human-Computer Interaction, multimodal processing, etc.), while for others, if technology has made advances, there are no available applications (summarisation, question answering, etc.). Progress is typically made abroad and Bulgarian is part of some multilingual systems for MT and speech analysis. There is already a need for open realtimeMTservicesfromandtoBulgariancombiningtextandspeech,takinginto accountcontext,communicativepurposesanddifferentenvironments.Thus,speech andtexttechnologiesforBulgarianhavetobecombinedwithtechnologiesforother modalities: real time image and video processing working simultaneously in multi­lingual environments. Natural language understanding and generation of Bulgarian have to become part of multilingual andmultimodal processing. DigitalBulgarianneedslarge-scale,long-termsupport,harmonisedwiththesup­portforallEuropeanlanguages.Thesporadicfundingofvarioustasksandparticular languagesshouldbereplacedbycommongoalsandobjectivesforallEuropeanlan­guages,whichifprovidedwiththenecessaryfundingwillleadtovastimprovements. EffortscannotbefocusedonlyonBulgarianoronanysinglelanguage,becausemul­tilingual and multimodal resourcesand technologies are currently needed. A BLARK-like (Basic Language Resources Kit) minimum set of LRs and LTs forallEuropeanlanguagesshouldbedevelopedandmaintained,takingintoaccount thattheminimumrequirementschangerapidly.In2022,thissetshouldcontainlarge integratedmodelsforasmanyapplicationsaspossible:real-time,multimodal,cross-domain and multilingual LRs andLTs; and a variety of domain-specific datasets. Convenient and well-regulated access to data is essential for the developmentof new products, applications and services. To achieve a significant advance over the current situation, an increase of available (open and copyright-free) data for Bul­garian and other European languages is needed, as is an improvement in the legal conditions for (re)using data at the European level. ThereisaneedfordedicatededucationandtrainingprogrammesinthefieldofLT andAI,asithasprovendifficulttosourceresearchers,linguistsorengineerswiththe rightcombinationofskills(e.g.,Bulgarianlanguage,computerscience,linguistics). To avoid the reduplication of efforts and to promote data-sharing, it is needed to strengthenandreinforcetheEuropeanhubsandrepositories,suchasELG,intended for ready-to-use datasets, models, tools and services. This will increase the overall language support andensure the sustainability ofLT solutions. Toconclude,althoughanumberoftechnologiesandresourcesforBulgarianexist, therearefarfewertechnologiesandresourcesforBulgarianthanforEnglishaswell asforsomeotherEuropeanlanguages.Ourvisionishigh-qualityLTforallEuropean languages thatsupports politicaland economic unity through cultural diversity. References Blagoeva,Diana,SvetlaKoeva,andVladkoMurdarov(2012)............ .... . ........... ..... – The Bulgarian Language in the Digital Age.META-NETWhitePaperSeries:Europe’s Languages in the Digital Age. Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers /volumes/bulgarian. Koeva, Svetla and Valentina Stefanova (2022). Deliverable D1.5 Report on the Bulgarian Lan­ guage. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https: //european-language-equality.eu/reports/language-report-bulgarian.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 8 Language Report Catalan Maite Melero, BlancaCalvo, MarRodríguez, andMartaVillegas Abstract Despite its vulnerable position as a minoritised language, the presence of Catalan in the digital sphere is relatively strong, thanks to an active online commu­nity with a high technological profile. Technological support for Catalan is slowly growing, following the recent advances in AI and increased awareness of the value oflanguagedataandtechnologiesamongpublicandprivatebodies.However,more effort is needed to promote the creation of open-source solutions and resources so asto lowerthe investment barrier forcompanies to build technology forCatalan. 1 The Catalan Language Catalan is a Romance language, closely related to Occitan, spoken in four Euro­pean states – Andorra, Spain, France and Italy – where it shares space with three big languages (Spanish, French and Italian). Andorra is the only state where Cata­lan is the national language. In Spain, it is mainly spoken in Catalonia, Valencia, and the Balearic Islands, where it is official together with Spanish. In Valencia, the traditional denomination of the language is Valencian. Catalan is also spoken in Al-ghero(Sardinia)andinthesouthofFrance.Thetotalnumberofhabitualspeakersof Catalan is estimated to be about 4.5 million.Despite its vulnerable position as a mi-noritised language,thepresenceofCatalanin the digitalsphere is relatively strong. AgoodexampleistheCatalanWikipedia,whichranks20thgloballyintermsofnum­ber of articles. The use of Catalan in websites that offer their services in Catalonia (and the rest of the Catalan-speaking territories) has been steadily growing from an estimate of 38.75% in 2002 to the current estimate of 66.03%. The digital presence ofthelanguageisunevenacrosssectors.Only30.3%ofthe480mostpopularbrands inCatalonia havetheirwebsitetranslatedintoCatalan, butcloseto100%ofuniver­sities,NGOsandculture-relatedCatalanorganisationshavetheirwebsiteinCatalan. Incontrast, few publicorganisations atthe Spanishlevel,and noneat theEuropean Maite Melero · Blanca Calvo · MarRodríguez · MartaVillegas Barcelona SupercomputingCenter, Spain, maite.melero@bsc.es, blanca.calvo@bsc.es, mar.rodriguez@bsc.es, marta.villegas@bsc.es © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_8 level, offer a Catalan version of their website. As for social media and streaming platforms, popular sites such as Instagram, Netflix, Spotify, HBO, Linkedin or Tik-Tok do not offer localised Catalan versions. In spite of this, Catalan web users are considerablyactive online: Catalan is the 10th EU language (and19thinthe world) intermsofnumberoftweets,9thoftheEU(and17thintheworld)intermsofnum­berofuserswhotweetinthislanguageand5thoftheworldinnumberoftweetsper user. In the last ten years, grassroots social-media initiatives have emerged, such as Valençúbers,CanalMalaiaorCreators.tv.Theseeffortshavegivenvisibilitytomore than500Catalancontentcreatorsonvariouschannels,suchasYouTube,Instagram, Twitter, TikTok or Twitch andhave generatedmillionsofviews. ThepresenceofCatalanintechnologicalproductsisslowlygrowingbutveryun­evenly. Large technology corporations tend to consider Spain as a single language market,andconsequentlyrarelyincludeCatalanininnovativeandpopularAIappli­cations, such as voice assistants, although they do include it among the languages offered by some of their cloud services (e.g., Google Translate and Google Cloud STT andTTS,Amazon Lexand otherAWS services, etc.). 2 TechnologiesandResourcesforCatalan Fromthemid-nineties,machinetranslationbetweenCatalanandSpanishbegantobe usedintensivelybypresseditorsaimingatproducingbilingualpublications.Among theproductsdevelopedduringthoseyears,FreeLing,atextanalysistool,andtheAn-Cora corpus still stand out (Moreno et al. 2012). The Corpus textual informatitzat delallenguacatalana(CTILC),manuallyannotatedwithlemmaandmorphological information,wascollectedbytheInstituteforCatalanStudies(IEC)duringthesame period, while the Academy of the Valencian Language collected the Corpus Infor­matitzat del Valencia and the Corpus Toponímic Valencia, reflecting the Valencian subvariant. Another relevant institution created during the first years of digitisation isTERMCAT,apublicentityentrustedwiththecreationofterminologicalresources and the standardisation of neologisms. CurrentLTsrelyheavilyontheuseoflargelanguagemodelstrainedonlargecor­pora(Meleroetal.2022).TherecentCaTextisthelargestwebcorpusinCatalanwith acceptable quality, while AnCora remains the largest and more complete annotated corpus. There is a noticeable lack of specialised annotated corpora in Catalan for a variety of domains, genres and tasks, both for fine-tuning and evaluation purposes. Luckily this trend is starting to turn, and a series of datasets annotated for text clas­sification, question answering, summarisation, textual similarity, and named entity recognition, amongothers,arebeing createdinthe frameworkofthe AINA project. AINAhasalsoreleasedmonolinguallanguagemodelstrainedonCaText.Oneofthe mostpopularandwidelyusedLTsismachinetranslation(MT).TotrainMTmodels, bilingual parallel data is needed. Most of the largest bilingual corpora are between Catalan and Spanish, although many are not publicly available. Both the OPUS ini­tiative and the Paracrawl project offer multilingual models also containing Catalan texts. Several online platforms offer translation services for Catalan, like Google Translate and MS Bing, although one of the best rated ones, DeepL, does not yet include Catalan. In addition, some open-source initiatives have built downloadable translation models that include Catalan, such as rule-based Apertium (to and from most Romance languages) and neural-MT Softcatala (to and from some European languages).MoreworkisneededinMTinvolvingCatalantoimproveexistingmod­els and add more languages, like Chinese, Russian or Arabic. This would have ma­jor impacts, e.g., on e-commerce, the integration of migrants, and the international diffusion of Catalan audiovisual productions. Speech recognition and synthesis are trainedonaudiodatasets.TheMozillaCommonVoiceprojecthasbeenverysuccess­ful among the Catalan community, having produced over 1,300 hours of recorded speech. Another important resource is ParlamentParla, an open-source speech cor­pus consisting of around 611 hours of parliamentary speeches. Aside from that, we find smaller transcribed audio corpora for specific purposes (e.g., prosody, clinical, social and geographical variation). Local companies, like Verbio, offer customised solutions involving STT and TTS technologies in Catalan, such as automatic subti­tling for Catalan television. Catotron is a recent open-source TTS tool for Catalan developed by CollectivaT using deep learning models trained on an audio corpus providedbyCatalantelevision.CatalanSign Language(LSC)is usedby morethan 25,000 people in Catalonia. There is an ongoing project to collect an LSC corpus carried out by the IEC and the Pompeu Fabra University. The current amount of data is still insufficient to develop translators and other technology related to LSC, thusmore efforts should be devoted to this sensible area. In Catalonia, the AI strategy (Catalonia.ai) is led by the Department of Digital Policies,whichhasrecentlyapprovedtheAINAprojecttopromotethedevelopment of technological applications in Catalan, in collaboration with the Barcelona Super-computingCenter.AINAhasalreadystartedtoproduceconcreteresults(seeabove). There is a sizeable number of research groups focused on NLP or speech technolo­gies in universities and research centres across Catalonia and Valencia. There is also avibrantecosystem of smalland medium local enterprisesproviding language services and developing intelligent applications, some of them offering Catalan, al­though less often than desired due to the initial investment barrier. Among the rel­evant stakeholders it is worth mentioning the role played by Softcatala since the beginningofdigitisation.Softcatalaisanon-profitassociationwhosebasicaimisto promote the use of Catalan in computer science, the internet and new technologies. Since their origins, in 1998, they have contributed to open-source software local-isation and have developed free tools, such as spell-checkers, translation models, synonym dictionaries andmultilingual dictionaries. 3 RecommendationsandNextSteps The recent advances in AI-powered LTs have resulted in an increased awareness of the Catalan society and political bodies, of the importance of LTand language data. However, public administrations still host very large volumes of non-confidential data,thatissuitablefordevelopingcutting-edgetechnologybutremainsunexploited and inaccessible. We feel that due to this increased awareness, this is beginning to change. We expect that the European directives on the reuse of public information willsoonbefullyimplementedintheCatalanadministration,andopenaccesstolan­guagedata,whichisrecognisedasessentialforthedevelopmentofnewapplications and servicesin Catalan, will become standard. Giventhe particularities of the Cata­lan market, supporting open-source solutions would decrease dependence on large corporationsfordevelopingcutting-edgetechnologyforCatalan.Moreover,having access to open-source solutions and resources will allow small and medium-sized companies(andpotentiallyalsolargeones)todevelopapplicationsinCatalanwith-out having to face the initial investment barrier. A significant way of stimulating the market and driving the demand of technology in Catalan is to increase the inno­vation capacity of Catalan public services by incorporating technological solutions that include Catalan. This will eventually benefit the citizens down the line as well. Finally, the creation of an independent Centre of Excellence dedicated to Catalan LTs would be a way of 1. increasing visibility and sustainability of infrastructures and resources, both existing ones and those soon-to-be-created by current projects, 2. offering more educational and training LT programmes in Catalonia to increase the number of trained experts, 3. facilitating technology transfer between academia andindustry,4.boostingagrowingeconomicsector,whileguaranteeingtheposition ofCatalan in the digital challenge. References Melero, Maite, Blanca C. Figueras, Mar Rodríguez, and Marta Villegas (2022). Deliverable D1.6 Report on the Catalan Language. European Language Equality (ELE); EU project no. LC­ 01641480 – 101018166. https://european-language-equality.eu/reports/language-report-c atalan.pdf. Moreno,Asunción,NúriaBel,EvaRevilla,EmíliaGarcia,andSiscoVallverdú(2012). La llengua catalana a l’era digital – The Catalan Language in the Digital Age. META-NET White Paper Series: Europe’s Languages in the Digital Age. Heidelberg etc.: Springer. http://www.meta-ne t.eu/whitepapers/volumes/catalan. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 9 Language Report Croatian MarkoTadiæ Abstract This chapter presents a summary of the Language Report on Croatian (Tadiæ 2022) on general features of the language and the level of technological sup­port it receives since the previous report (Tadiæ et al. 2012). The chapter includes information about the typological and structural features of Croatian, its status and usage inthe digital sphere and its support through Language Technologies. 1 The Croatian Language The Croatian language belongs to the West-South Slavic subgroup of the Balto-Slavic branch of the Indo-European linguistic family. It is the native language of over 5 million speakers. Croatian consists of the dialects and standard national lan­guage of the Croats, and is the official language of just under 4 million people in Croatia.Along with BosnianandSerbian,it isoneofthe threeofficial languagesin Bosnia and Herzegovina, where it is spoken by about 400,000 people. Croatian is also spoken by national minorities in Croatia as well as by autochthonous Croatian minorities in Serbia, Montenegro, Slovenia, Hungary, Austria, Slovakia and Italy. Croatian is also used abroad. The largest Croatian diaspora is located in Germany, followed by the US, Canada and Australia. In 2013 Croatian became the 24th offi­cialEUlanguage.Accordingtothe2011census,90.42%ofthecountry’sinhabitants areethnicCroats,withCroatianthenativelanguagefor95.6%.Croatianisthemain language used and taught in schools. The literacy ratio in Croatia is 99.2%. Croat­ian was written with three scripts (Glagolitic, Cyrillic, Latin), and the Latin script becamedominantinthe16thcentury.Itwasstandardisedafter1835,whentheCroa­tian Latin alphabet settled on its modern-day form. The phoneme inventory of the Croatian standard language consists of 6 vowels and 25 consonants. Croatian differentiates ten parts of speech, five of which inflect (nouns, adjectives, numbers (partially), pronouns, verbs) and four do not (preposi­tions,conjunctions,particles,exclamations),whilesomeadverbsinflectonlyincom­ MarkoTadiæ University ofZagreb,Croatia, marko.tadic@ffzg.hr © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_9 parison.Thegrammaticalcategoriesthatinflectforthemajorityofdeclinablewords aregender(3values),number(2values),andcase(7cases).Definitenessismarked on adjectives and animacy in the accusative singular form of masculine nouns and adjectives. Verbs use categories of manner (5 values), person (3 values), number (2 values), voice (2 values) and tense (7 values). The verbs biti (‘to be’) and htjeti (‘will’) are auxiliary. Verbs also have an elaborate aspectual system (imperfective and perfective with additional subvalues such as inchoativity, iterativity, partitivity etc.)andtheycouldalsobereflexive.Adjectivesandadverbscaninflectforcompar­ison(3values).CroatianischaracterisedbyanSVOsyntacticpatternandrelatively freewordorder. Double-negationisrequired.Theagreementofcomponentsingen­der, number and case istypical. TheCroatianWebArchivecataloguesandstoreswebresources:portals,websites of institutions, associations, events, scientific projects, books, journals, etc. from 1998. The Croatian Wikipedia has 211,970 articles (31 May 2022) and is ranked 47th.Croatianisprevalentlyusedonmajorsocialmedia.CroatianappearsinGoogle Translate (since 2008) and Bing Translator as a source or target language. Most so­cial media offer translations of posts in/from Croatian, while popular open-source software as well as systems and interfaces by Apple, Google and Microsoft are lo-calised. 2 TechnologiesandResourcesforCroatian Inthelastdecade,thedevelopmentofCroatianLTadvancedprimarilybecauseCroa­tia joined the EU in 2013. The position of the 24th official EU language resulted in the inclusion of Croatian in large multilingual NLP campaigns, and it started to be researched by non-Croatian NLP experts, too. Although in some areas a number of fundamental resources are not yet available for Croatian, progress has been made inLRcollection,textanalytics,languagemodels,computerassistedlanguagelearn­ing and machine translation (MT), but speech processing is still seriously underde­veloped. A number of EU and nationally funded projects were running mostly in academic institutions. Fundamental tools for lemmatisation, PoS tagging, NER and syntacticanalysishavebeenprovided,buttherearenorobustandreliableindustrial systems.IntheareaofNLU,thereisanewerversionofCroatianWordnet(v2.1)and in 2016, a layer of semantic roles was added to the Croatian Dependency Treebank thusproviding basicLRs for semantic processing at lexical andclausal levels. After the release of the Croatian National Corpus v3 in 2013, there were sig­nificant advances in large corpus collection, e.g., hrWac v2.1, ParlaMint-HR 2.1, MARCELL Croatian Legislative Corpus, etc. A number of smaller specialised cor­poraappeared:forprocessingsocialmedia,forsentimentanalysis,forinvestigation ofspeech disorders, or language learning. Available bilingual data are either stand-alone parallel corpora, e.g., hrenWaC 2.0, bi-texts in the domain of tourism, or the MARCELL Croatian-English Paral­lel Corpus of Legislative Texts, or the results of data collection campaigns, e.g., ParaCrawl,Bibletranslations,andparallelcorporacollectedfrompublicinstitutions. Croatian became a language of interest in multilingual initiatives and shared tasks: Universal Dependencies (UD), C4Corpus, Deltacorpus, EU Patents, EU EAC TM, JRCDGTTMs,ParlaMintcorporaandComparableWikipediasofSouthSlavicLan­guages,OSCAR, SETimes, TED talks, OPUS, W2C, andWikiMatrix. The largest freely available lexical resources are inflectional lexicons: Croatian MorphologicalLexicon(HML)andhrLEXv1.3.Thereisonlyonegenerallanguage dictionaryfreelyavailableforonlinesearchtoitsfullestextent:Hrvatskijeziènipor­talaccessingthelexicographicalbaseofapublisher.Accesstootherlexicaislimited through a proprietary app by another publisher. Other larger lexica are specialised liketheCroatian Old Dictionaries Portal,Dictionaryof Neologisms,ordictionaries compiled by the Institute of Croatian Language and Linguistics covering spelling, phrasemes, valencies, collocations and Croatian Terminology Portal. Special types of lexica are Croatian Derivative Lexicon, CroDeriv and DerivBase.HR, that repre­sent the first steps of processing at the level of derivative morphology and both are connected with the Universal Derivations. The development of NooJ grammar models accelerated because it was present in teaching at undergraduate and postgraduate levels of linguistics and information sciences. Recently, after the introduction of language model approaches, a similar modelwas builtforCroatianbutusuallyincombinationwithotherlanguages,such asCroSloEngualBERT,BERTiæor ELMo embeddings models. The best pipeline for processing Croatian is developed within the UD initiative (UDPipe)anditfounditswayintotheGATE,WeblichtandELGplatforms.TheUD dataservedalsotoproducetheCroatiansegmentinUDify. ApartfromtheCroatian LanguageProcessingPipelinedevelopedin2013,thereistheCLASSLAforkforthe Stanford StanzapipelineforprocessingSouthSlaviclanguages.Also,atthelexical andeventsemanticslevel,twopopularonlineservicesfeatureCroatian,amongother languages: Wikifier and Event Registry. In Babelnet, Croatian is well represented and ranked 41st with almost3 millionsynsets. SupportforCroatianasasourceandtargetlanguageinMTsystemswasprovided as early as in MT@EC, followed by CEF AT and eTranslation. The introduction of theNMTparadigmincreasedthetranslationquality,asshownintheCEFprojectEU Council Presidency Translator, developed for the Croatian EU Presidency in 2020. The system outperformed Google Translate in hr›en›hr directions by several BLEU points and in 2020ittranslated more than 60 milliontokens. From2015to2016withintheESF-fundedprojectHR4EU,aPortalforLearning Croatian as a Foreign Language was produced. Despite some attempts at the Universities of Zagreb and Rijeka, speech technol-ogyisthemostunderdevelopedareaforCroatian;nofreeindustry-levelapplications exist. The commercial players have started to offer speech modules for Croatian. TheCollinsMultilingualdatabasesWordBankandPhraseBankhaveincludedCroa­tiansince2016,whiletheGlobalPhoneCroatianPronunciationDictionaryhasbeen availablesince2013.TalkBankisthefinalofferinginthislimitedsetofspeechdata for Croatian. Support for Android devices is provided at the level of the operating system,butit does not exist in the iOS environment. AnationallyfundedprogrammeforLTranfrom2007to2012.ItdisseminatedLT research from the Faculty of Humanities and Social Sciences, University ofZagreb toanumberofotherinstitutionsinCroatia.TheCroatianLTSocietyhasamissionto unofficiallycoordinateLTactivitiesinCroatia.Thedominantroleregardingfurther developmentofCroatianLTinthelastdecadewasplayedbytheEUthroughitsFP7, ICT-PSP,H2020andCEFprogrammes,fundingtheinvolvementofseveralCroatian researchteamswhereexpertisepersiststothisday,butR&Drarelyinvolvesindustry. Croatia joining CLARIN ERIC provided additional impetus. At the national level, severalprojects were funded throughtheCroatian Research Council. 3 RecommendationsandNextSteps There is alot ofcurrently inaccessible data that could makeanimpact onthefuture ofCroatianLTandarestillnotrecognisedaslanguagedata,e.g.,textsproducedby public administrations, aligned audio and subtitles archived by the national broad­caster, and the Croatian Scientific Journals Portal with open access. The long-term planistosecurethepresenceofCroatianNLPmodulesinthemajorNLPplatforms such as spaCy, FreeLing, NLP Cube, TextRazor, Cloud Natural Language, Apache OpenNLP,etc.,in ordertosecure thewider usageof LTforCroatianand itsdigital language equality withother languages. References Tadiæ, Marko (2022). Deliverable D1.7 Report on the Croatian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/language-report-croatian.pdf. Tadiæ, Marko, Dunja Brozoviæ-Ronèeviæ, and Amir Kapetanoviæ (2012). Hrvatski Jezik u Digi­ talnom Dobu – The Croatian Language in the Digital Age. META-NET White Paper Series: Europe’s Languagesin theDigitalAge. Heidelberg etc.: Springer. http://www.meta-net.eu/wh itepapers/volumes/croatian. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 10 Language Report Czech Jaroslava Hlaváèová AbstractThischapterprovidesbasicdataaboutLanguageTechnologyfortheCzech language. After a brief introduction with general facts about the language (history, linguistic features, writing system, dialects), we touch upon Czech in the digital sphere.ThemainachievementsinthefieldofNLParepresented:importantdatasets (corpora, treebanks, lexicons etc.) and tools (morphological analyzers, taggers, au­tomatic translators, voice recognisers and generators, keyword extracters etc). 1 The Czech Language Czech,oneoftheWestSlavoniclanguages,hasabout10millionspeakers,mostlive intheCzechRepublic(Czechia).Inotherpartsoftheworld,thereareabout200,000 speakers.Czech istheofficiallanguageinCzechia,andsinceMay2004ithasbeen one of the administrative languages of the EU. It is used in administrative, judicial and otherofficialproceedings(see Bojar et al. 2012,for moredetails). TheCzechlanguagehasseveralvarieties, especiallyinitsspokenform.Literary (standard) Czech is a prestige variety, which is taught in schools and strongly pre­ferredinofficialtextsandthemedia.Ineverydaycommunication,mostpeopleprefer other varieties of Czech. The most widespread one is common Czech, based on the CentralBohemiadialects.InMoraviaandSilesia,dialectssuchasHanak,Lach,and Czecho-Moravian are still used actively. Common Czech and these dialects differ fromtheliteraryvariety,especiallyinmorphology,andtoalesserextentintermsof the lexicon andpronunciation. Other differences are marginal. In writing, initially, the medieval Latin alphabet was used and for sounds not presentinLatin,digraphswereused.Intheearly15thcentury,thereligiousreformer Jan Hus replaced the digraphs with single letters with diacritics (“háèek” for the palatal/palatalisedconsonants –, ï,ò, ø,š, ,ž;“èárka”andforlongvowels –á,é, í, ó, ú, ý). The only digraph surviving in modern Czech is ch. Long u might have a ring ù, coming from the chain of changes ó›uo›ù. Jaroslava Hlaváèová Charles University,CzechRepublic, hlavacova@ufal.mff.cuni.cz © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_10 The Czech Republic has .cz as the top-level internet domain. It came into effect inJanuary 1993afterthesplitoftheformerCzechoslovakia,whichhadthedomain .cs. As of 21 October 2021, 1,412,102 websites with the top-level domain .cz were registered.Therewere9.66millioninternetusersinCzechiainJanuary2022.1 This numberincreasedby120,000(+1.3%)between2021and2022.Internetpenetration in Czechia stood at 90.0% in January 2022. There were 8.05 million social media users in Czechia in January 2022 (about 75% of the total population). The number increased by660,000 (+8.9%) between 2021 and 2022. 2 TechnologiesandResourcesforCzech There are several groups in Czech universities working on all areas of NLP (Hlava­cova 2022). They are especially Charles University in Prague, University of West Bohemia, Czech Technical University, Technical University of Liberec, Masaryk University in Brno, Brno University of Technology and Palacký University in Olo­mouc. Apart from academia, many companies develop LT, usually (but not always) withanarrowerfocus.TheLINDAT/CLARIAH-CZResearchInfrastructureforLan­guage Technologies brings together all the achievements in one place which makes them easily accessible tothewide public. Themainsources ofcontemporaryCzechdataarethe corporaof theseries SYN (Hnátková et al. 2014). SYN2000, SYN2005, SYN2010, SYN2015 and SYN2020 are balanced (representative) corpora of written Czech, morphologically annotated, around100milliontokenseach.SYN2006PUB,SYN2009PUBandSYN2013PUB are corpora of contemporary Czech newspapers and magazines sized 300 MW, 700 MW and 935 MW, respectively. All of the SYN corpora are joined together into a single corpus, the last version being SYN v10 (Køen et al. 2021), the corpus of contemporarywritten (printed)Czech.It contains5.9 GW. The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0) is a richly annotated and genre-diversified resource (Hajiè et al. 2020a). It consolidates the existing PDT-corporaof Czechdata, annotated usingthe standard PDTscheme. Bilingual data is represented mainly by Czech-English corpora. The 4th release of the Czech-English corpus CzEng 1.0 (Bojar et al. 2011) contains 15 million par­allel sentences from seven different types of sources automatically annotated at the surface anddeep layers of syntactic representation. UniversalDependenciesisaprojectthatseekstodevelopcross-linguisticallycon­sistent treebank annotation for many languages, with the goal of facilitating multi-lingualparser development, cross-linguallearning, andparsingresearchfromalan­guagetypologyperspective.Theannotationschemeisbasedon(universal)Stanford dependencies, Google universal part-of-speech tags, and the Interset interlingua for morphosyntactic tagsets. 1 https://datareportal.com/reports/digital-2022-czechia The main tools for NLP are UDPipe (Straka 2020), a trainable pipeline for seg­mentation, tokenisation, POS tagging, morphological analysis, lemmatisation and dependency parsing of raw texts, and MorphoDiTa: Morphological Dictionary and Tagger(StrakaandStraková2015).Itperformsmorphologicalanalysis,morpholog­icalgeneration,taggingandtokenisationandisdistributedasastandalonetooloras a library,along with trainedlinguisticmodels. The best-performing tool for Czech-English translation is the deep-learning sys­tem CUBBITT (Popel et al. 2021). In a context-aware blind evaluation by hu­man judges, CUBBITT significantly outperformed professional-agency English-to­Czechnewstranslationinpreservingtextmeaning(translationadequacy).Whilehu-man translation is still rated as more fluent, CUBBITT is shown to be substantially morefluentthanpreviousstate-of-the-artsystems.MostparticipantsofaTranslation Turing test struggle todistinguish CUBBITTtranslations from human translations. The work on speech recognition and indexing for digitised oral history archives MALACH(Holocaustsurvivors’testimony,archiveoftheInstitutefortheStudyof Totalitarian Regimes)2 continues and new tools are beingdeveloped. The Alquist Dialogue System3 is the social bot developed by a team of students from the Czech Technical University in Prague. Alquist is an advanced Conversa­tional AI bot carrying out entertaining and engaging conversations with humans on populartopicssuchasmovies,sports,news,etc.In2017and2018,itgainedsecond placein the Alexa Prize contestsin acompetition withover 100 academic teams. ThebasiclexiconisMorfFlex(Hajiè etal. 2020b), themorphologicaldictionary ofCzech, with full inflectionalinformationfor every wordform, encoded in aposi­tional tag. Wordforms are organised into paradigms according to their formal mor­phological behaviour. They are identified by aunique lemma. 3 RecommendationsandNextSteps TheNationalArtificialIntelligenceStrategy(2019-2030)oftheCzechRepublicwas releasedin2019bytheMinistryofIndustryandTrade.ItmentionsNLPamongthe disciplinesrelatedtohuman-machineinteraction,i.e.,asoneoftheprominentfields to be supported. At the same time, AICzechia,4 a national initiative for cooperation between Czech stakeholders in the field of AI, was established. In terms of NLP applications, it wants to target traditional areas such as defence/security, media and government, but also new domains such as social networks, smart homes and busi­ness support. It will maintain and expand activities in international organisations in the field (META-NET, CLARIN ERIC, LTInnovate, BDVA, ISCA,ACL, IEEE, ELRA and LDC). These documents indicate that AI, including NLP, will continue to be supported. 2 https://ufal.mff.cuni.cz/malach/en 3 http://alquistai.com 4 https://www.aiczechia.cz References Bojar,Ondøej,SilvieCinková,JanHajiè,BarboraHladká,VladislavKuboò,JiøíMírovský,JarmilaPanevová,NinoPeterek,JohankaSpoustová,andZdenìkŽabokrtský(2012). Èeština v digitál­ním vìku – The Czech Language in the Digital Age.META-NETWhitePaperSeries:Europe’s Languages in the Digital Age. Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers /volumes/czech. Bojar, Ondøej, Zdenìk Žabokrtský, Ondøej Dušek, Petra Galušèáková, Martin Majliš, David Ma-reèek, Jiøi´ ´ Maršik, Michal Novák, Martin Popel, and Aleš Tamchyna (2011). Czech-English Parallel Corpus 1.0 (CzEng 1.0). LINDAT/CLARIAH-CZ, Charles University. ÚFAL MFF UK.http://hdl.handle.net/11234/1-1458. Hajiè, Jan, Eduard Bejèek, Alevtina Bémová, Eva Buráòová, Eva Fuèi´´ ková, Eva Hajièová, JiøiHavelka,JaroslavaHlaváèová,PetrHomola,PavelIrcing,Jiøi´Kárnik,VáclavaKettnerová,Na-´talia Klyueva, Veronika Koláøová, Lucie Kuèová, Markéta Lopatková, David Mareèek, Marie Mikulová, Jiøi´ ´ Mirovský, Anna Nedoluzhko, Michal Novák, Petr Pajas, Jarmila Panevová, Nino Peterek, Lucie Poláková, Martin Popel, Jan Popelka, Jan Romportl, Magdaléna Rysová, Jiøi´ Semecký, Petr Sgall, Johanka Spoustová, Milan Straka, Pavel Straòák, Pavli´na Synková, Magda Ševèi´ ková, Jana Šindlerová, Jan Štìpánek, Barbora Štìpánková, Josef Toman, ZdeòkaUrešová, Barbora Vidová Hladká, Daniel Zeman, Šárka Zikánová, and Zdenìk Žabokrtský (2020a). Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0).LINDAT/CLARIAH-CZ,Charles University. CzechRepublic. http://hdl.handle.net/11234/1-3185. Hajiè,Jan,JaroslavaHlaváèová,MarieMikulová,MilanStraka,andBarboraŠtìpánková(2020b). MorfFlex CZ 2.0. LINDAT/CLARIAH-CZ, Charles University. http://hdl.handle.net/11234/1­ 3186. Hlavacova,Jaroslava(2022).Deliverable D1.8 Report on the Czech Language.EuropeanLanguage Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/language-report-czech.pdf. Hnátková,Milena,MichalKøen,PavelProcházka,andHanaSkoumalová(2014).“TheSYN-series corporaofwrittenCzech”.In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC). Reykjavík, Island, pp. 160–164. Køen, Michal, Václav Cvrèek, Jan Henyš, Milena Hnátková, Tomáš Jeli´ nek, Jan Kocek, Do-minika Kováøiková, Jan Køivan, Jiøi´´ Milièka, Vladimir Petkeviè, Pavel Procházka, Hana Sk-´oumalová,JanaŠindlerová,andMichalŠkrabal(2021). SYN v9: large corpus of written Czech. LINDAT/CLARIAH-CZ, CharlesUniversity. http://hdl.handle.net/11234/1-4635. Popel,Martin,MarkétaTomková,JakubTomek,£ukaszKaiser,JakobUszkoreit,OndøejBojar,andZdenìkŽabokrtský(2021). CUBBITT Translation Models (en-cs) (v1.0).LINDAT/CLARIAH-CZ,Charles University. http://hdl.handle.net/11234/1-3733. Straka, Milan (2020). UDPipe 2. Prague,CzechRepublic:ÚFAL MFF UK. Straka, Milan and Jana Straková(2015). MorphoDiTa. ÚFAL MFFUK. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 11 Language Report Danish Bolette Sandford Pedersen,SussiOlsen, and Lina Henriksen Abstract This chapter summarises the current level of language technologies (LT) and resources for Danish (Pedersen et al. 2022). Even if Danish LTs are now being usedinmanyareasofsociety,theirquality stillneedstobeimprovedinordertomake them more useful and inclusive for the majority of the population. To this end, the developmentoflarge,high-qualitylanguageresourcesanddatasetsstillprovestobe abottleneck.Wereport,however,onanincreasedawarenessofsharingandreusing language resources and data sets across public institutions, academia and industry. New,largegovernmentalinitiativeswithintheareaofAIandLThavebeeninitiated which support this development. 1 The Danish Language Danishis a MainlandScandinavianlanguageandtheofficiallanguage ofDenmark, which has about 5.831 million inhabitants. Danish phonology distinguishes itself from that of several of its neighbouring languages by exhibiting a very large num­ber of vowels and by having glottal stop as a meaning differentiating feature (e.g., ‘!hund’ (‘dog’) vs ‘hun’ (‘she’)). Furthermore, phonetical reductions are very com­mon, a fact which complicates Danish speech technology since word boundaries becomevery hard to identify, to give just one example. In the written language, the fact that compounds are spelled as one word (as in otherGermaniclanguages)complicatesthedevelopmentoflanguagetools,andfur­thermore,compoundsaregenerateddynamicallyandsoonlypartiallyaccountedfor in dictionaries. The very extensive use of particles with semi-lexicalised meanings poses a challenge to LT systems. The constructions often occur discontinuously in spokenandwritten Danish,afact whichtends to requirelarge amountsoflanguage datain order to be well representedin the corresponding language models. Bolette SandfordPedersen · SussiOlsen · Lina Henriksen University ofCopenhagen,Denmark, bspedersen@hum.ku.dk, saolsen@hum.ku.dk, linah@hum.ku.dk © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_11 The influence of the English language on Danish language users is increasing. Loanwordsandfixedphrasesdonotinfluencethelanguagesystemassuch,butthe syntax is influenced in some cases. For instance, some Danish verbs change their valencypatternbecauseoftheinfluencefromEnglish,asisthecasefor‘atgro’(‘to grow’) which is now beginning to occur as transitive, as in ‘kan man gro trofler i Danmark?’ (‘can you grow truffles in Denmark?’). In addition, the placement of adverbialstendsto be increasinglyinfluenced byEnglish. 2 TechnologiesandResourcesforDanish In recent years a number of repositories for Danish LT have been established. The followingoverviewisprimarilybasedonthese,includingtheDanishCLARINplat-form, the repository of The Danish Agency for Digitisation, sprogteknologi.dk, the Danlp listand the DaCy repository (for references, see Pedersen et al. 2022, 2012). Large Danish text corpora have typically been collected by institutions that de­velopdictionaries.Thesehostverylargebalancedcorporatoday,butduetointellec­tual property rights they are not entirely open source and ready to use for industry. For research and non-commercial purposes, the DK-CLARIN Reference Corpus of General Danish (45 million words) has been available for a decade at the CLARIN-DKrepository.Recently,theDanishGigaWordinitiativehasbeenlaunched,afreely available billion-word corpus of Danish textsassembled by agroup of researchers. In recent years several statistical and neural language models for Danish have beenprocessedandarebasedprimarilyontheabovementionedcorpora.Schneider­mann et al. (2020) report on six different neural models with different correlations with a hand-crafted similarity data set. Recently, also a number of contextualised, pre-trainedmodelshavebeendevelopedforDanish.TheScandivalbenchmarkeval­uatestheseandothermodels.Overall,recentmodelsenableimprovedlanguagepro­cessing for Danish with, e.g., a better grasp of the variation of word meaning in running text. Here, diversity in the training data is becoming more relevant since it canresult in biases with respect to gender,ethnicity, regional origin etc. Parallel text corpora are primarily used to build statistical models for machine translation. These models are highly dependent on reallylarge amounts of text data within all domains. The number of parallel corpora including Danish has increased somewhatoverthelastfewyears;especiallycorporawheretheotherlanguageisEn­glish. In recent years the EU initiative European Language Resource Coordination (ELRC) has helped increase awareness of the value of parallel corpora, in collabo­ration with three nationally located anchor points. Large public speech corpora are generally in short supply for Danish, a fact which complicates the development of speechtechnologies.However,afewsuchresourcesexistatamediumscale,i.e.,the DanishNST ASR Database at the Norwegian Sprakbanken, compiled originally by thecompanyNordiskSprogteknologi,DanPASS,andtheDanishParliamentSpeech Corpora. The production of a large, transcribed and time-encoded speech corpus is foreseenas part of the government’s new AIinitiative, launched in 2022. The Danish Universal Dependencies Treebank (UD-DDT), which has annota­tions for dependency structure, part-of-speech and named entities, constitutes a ba­sicresourceintermsofsyntax.TheSTOlexiconalsocontainssyntacticinformation such as valency information. Lexical semantic resources of various kinds are also availableforDanish.TheDanishwordnet,DanNet,isthelargestsemanticresource witharound70,000concepts.MorespecificresourcesareframenetsandDanishsen­timent lexicons, various lists of person names, addresses, place names, and some dialectdictionaries.Ajointcomputationallexicalproject,CentralWordRegisterfor Danish(COR), combines several of these resources inone joint resource. Danish preprocessing tools such as lemmatisers, part-of-speech taggers, named entity recognisers, and parsers have existed for Danish for several years and are continuouslyupgraded,partlybasedontheabove-mentionedresources.Evenifthere is still room for improvement, these tools generally achieve high accuracy and are integrated todayinto mostadvanced systems. IntegratedLTscancountonservicessuchasspeech,machinetranslation,andab­stracting systems. Dictus ApS and Omilon are examples of Danish companies that deliver dictation solutions to citizens and organisations such as the Danish Parlia­ment, the healthcare system, schools, Danish TV-stations and many more. Speech technologyisalsousedinsomechatbotsandvirtualassistants,andexamplesofser­vicesworkingforDanishareSiriandGoogleAssistant.Theirperformance,however, still leaves room for improvement. Open-source packages for developing speech recognition for Danish are generally scarce. An example is the open-source Python package DanSpeech (now Alvenir)fromthe Technical Universityof Denmark. Currently,mostpublicinstitutionsoutsourcetheirtranslationtaskstoprivatecom­panies, and this trend isrising. Machine translation is applied in almost all fields of translation, and the quality is improving. Recent benchmarking reports for Danish-English and English-Danish show acceptable BLEU scores over 0.70 (depending on the domain) for Google Translate, eTranslation, and DeepL. However, transla­tion quality decreases dramatically when Danish is used in combination with other languages. Other services include technologies such as anonymisation, sentiment analysis, automatic abstracting, summarisation, fake news detectors etc. of which only a few currently exist off-the-shelf for Danish. Areas such as opinion mining and sentiment analysis are growth areas since many companies and institutions in Denmark feel anincreasingneed to monitor opinions and sentiments on the web. 3 RecommendationsandNextSteps Several factorsplaya roleinhow fastandhow wellalanguagecommunitylikethe Danish one adapts to new technological advances. Even if Denmark is one of the most digitised countries in the world, its relatively small size – both as a language communityandcommercialmarket,togetherwithourhighproficiencyinEnglish – seemtohavedelayedtheinvestmentsanddevelopmentsinDanishlanguageprocess­ing and LT. The specific characteristics of Danish may also play a role. However, we see renewed interest in LT at all levels of Danish society. New stakeholders are emergingdaybydaytogetherwiththeincreasingtendencyofintroducinglanguage­centric AI in nearly allaspectsof society;recenttentative countsindicatethatmore than70DanishSMEshaveenteredtheLTscene.Withthisdevelopmentcomesmore focusandbetterunderstandingofthechallengesoflanguageprocessingandofwhy acontinuousupgradeofDanishlanguageresourcesisindispensable.Thisincreased acknowledgement and tendency towards sharing resources across fields is seen in academia, industry and public administration and will definitely boost LT for Dan-ishinthecomingyears.NewgovernmentalinvestmentsinAIandLTaresupporting industryandresearchinthisdevelopment.Allthisbeingacknowledged,theneedfor continuouscoordinatedeffortsbasedinpublicinstitutions,industrialsettingsaswell asresearchstillremains.Onepreconditionforsupportingthiseffortistoensurethat sufficient highly qualified staff are educated in NLP. It is recommended that NLP study programmes are sufficiently supported and prioritised at higher educational and ministry levels. Governmental focus on industry, research, and education is in­dispensable if we are to ensure that Danish stays on track to being a digitally fully functional language, alsoin future language-centric AI solutions. References Pedersen, Bolette Sandford, Sussi Olsen, and Lina Henriksen (2022). Deliverable D1.9 Report on the Danish Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-danish.pdf. Pedersen, Bolette Sandford, Jürgen Wedekind, Steen Bohm-Andersen, Peter Juel Henrichsen, Sanne Hoffensetz-Andresen, SabineKirchmeier-Andersen,JensOtto Kjarum,LouiseBieLar-sen, Bente Maegaard, Sanni Nimb, Jens-Erik Rasmussen, Peter Revsbech, and Hanne Erdman Thomsen (2012). Det danske sprog i den digitale tidsalder – The Danish Language in the Dig­ital Age. META-NET White Paper Series: Europe’s Languages in the Digital Age. Heidelberg etc.:Springer. http://www.meta-net.eu/whitepapers/volumes/danish. Schneidermann,Nina,RasmusStigHvingelby,andBoletteSandfordPedersen(2020).“Towardsa GoldStandardforEvaluatingDanishWordEmbeddings”.In:12th International Conference on Language Resources and Evaluation, Conference Proceedings (LREC 2020). Ed. by Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry De-clerck,SaraGoggi,HitoshiIsahara,BenteMaegaard,JosephMariani,HeleneMazo,Asuncion Moreno, JanOdijk,and Stelios Piperidis, pp. 4754–4763. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 12 Language Report Dutch FriedaSteurs,Vincent Vandeghinste, andWalterDaelemans AbstractThischapterprovidesanewstateofaffairs(Steursetal.2022)withregard to language technology (LT) for Dutch (after Odijk 2012). LT for Dutch is highly developed,andtheNetherlandsandFlandershaveastrongandcooperativeLTcom­munity. AlotofdigitaldataisfreelyavailablethroughCLARINandtheDutchLan­guageInstitute(INT).However,dataandsoftwarehavetobeupdatedcontinuously, andthereisaneedforanewoverarchingprogrammetosupportresearchinitiatives. 1 The Dutch Language Dutch is a West-Germanic language spoken by about 25 million people as a first language in the Netherlands and Belgium and by 5 million people as a second lan­guage (Steurs 2021). It is a close relative to both German and English and shares withGermanthesurvivaloftwotothreegrammaticalgenders,as well asthe useof modal particles, final-obstruent devoicing, and similar word order. The vocabulary ismostlyGermanicandincorporatesslightlymoreRomanceloansthanGermanbut far fewer than English. Some characteristics are challenging for computational pro-cessing,suchasarelativelyfreewordorderwithdifferencesbetweenmainandsub­ordinateclauses,andproductivecompounding.Separableverbprefixescanoccurfar fromtheverbandthemeaningofaseparableverbisoftennon-compositional.Writ­ten Dutch is a monocentric standardised language, with lexical and pronunciation variety between the Netherlands and Flanders. In contrast to its written uniformity, Dutch lacks a unique prestige dialect and has a large dialectal continuum consist­ing of 28 main dialects. Dutch is used by 1.3% of all websites and is the 12th most used language in terms of number of websites. The Dutch one is the sixth-largest Wikipediaedition.Dutchisusedofteninsocialmedia,whichleadstonewlinguistic trends and sublanguages,for which corpora are required to allow investigation. FriedaSteurs · VincentVandeghinste DutchLanguageInst.,TheNetherlands, frieda.steurs@ivdnt.org,vincent.vandeghinste@ivdnt.org Walter Daelemans University ofAntwerp, Belgium, walter.daelemans@uantwerpen.be © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_12 2 TechnologiesandResourcesforDutch The Dutch Language Institute (INT) keeps a detailed list of tools and resources for DutchatK-Dutch,1 manyofthesearedownloadable.2 TheLanguageMachinesweb­sitealso makes plenty of LT tools availableaswebservices.3 SoNaR(Oostdijketal.2013)isareferencecorpuscontainingdifferenttextgenres. Parallel data is available through the Dutch Parallel Corpus (Paulussen et al. 2013) and through OPUS. The Spoken Dutch Corpus (Oostdijk et al. 2002) contains 900 hours (9 million words) of speech and is manually transcribed and linguistically annotated. A new large up-to-date corpus for spoken Dutch containing more recent language and more variants is in high demand. Notwithstanding the popularity of social media, it is difficult to collect and share such data due to restrictions in the EU’s GDPR, and only a limited part of SoNaR contains thisregister. GiGaNT is a computational lexicon with a historical and a modern component. OpenDutchWordNet(Postmaetal.2016)isafreelyavailableDutchlexicalseman­ticdatabase.Amorecontemporaryversionwithbettercoveragewouldbedesirable. HuggingFace,ahubforpre-trainedlanguagemodels(LMs),lists112pre-trained LMs for Dutch, some of these can perform generation. Word2vec embeddings are available from Tulkens et al. (2016). Nevertheless, there is still demand for very large-scale LMs forDutch, andfor LMs oncertain domains and registers. In terms of text analysis, Frog (van den Bosch et al. 2007) provides lemmas, morphology, PoS tagging, named entities, chunking, and dependency information. Alpino (van Noord 2006) provides deep linguistic dependency parsing. Pattern (De Smedt and Daelemans 2012) and LeTs (Van de Kauter et al. 2013) are multilingual tools for text analysis, including Dutch. SpaCy, Stanza, Weblicht and UDPipe con­tain Dutchmodels. DutchNER isavailable in OpenNLP. Text-to-speech and speech recognition (ASR) are commercially available, often in two languagevariants, andalso for research purposes (both variants). Dutchispresentinmostcommercialonlinetranslationservices,whichprovidea limitedamountoftranslationfor free.eTranslationfromtheEuropeanCommission provides unlimited translation, including from andto Dutch. Thereis currently no joint Flanders-Netherlands overarching programme for the further development of tools and resources for Dutch. The LT community in the Netherlands and Flanders would be very much in favour of setting up a follow-up programmetotheSTEVINprogramme(SpynsandD’Halleweyn 2013),ajointpro-gramme toprovide the essentials for Dutchlanguage technology (2004-2011). TheNederlandseAICoalitie(DutchAICoalition)liststheuse-caseNederlandse AIvoorhetNederlands(DutchAIforDutch).TheNederlandstaligeSpraakCoalitie (Dutch Speech Coalition) stimulates development of speech technology for Dutch. NOTaS(DutchOrganisationforLanguageandSpeechTechnology)joinsthevarious playersinthefieldinensuringthattheDutchLTindustryleadsthewayintechnologi­ 1 https://kdutch.ivdnt.org 2 https://taalmaterialen.ivdnt.org 3 https://webservices.cls.ru.nl caldevelopments.TheFlandersAIprogrammesupportsresearchinNLP,especially on Conversational Agents for Dutch. Computational Linguistics in the Netherlands (CLIN), a yearly conference, is a meeting point for LT researchers in the Nether­lands and Flanders. The CLIN Journal provides an international forum for open ac­cess publication of high-quality scholarly articles in all areas of LT, with special attention on Dutch. Belgium NLP Meetup is a meeting group for anyone interested in Natural Language Processing. CLARIN is a European research infrastructure in which the Netherlands and Belgium participate. The Dutch portal pages CLAPOP list resources created in CLARIN NL and CLARIAH NL. The CLARIN portal at INT provides access to CLARIN tools from the Netherlands and Flanders. In both regions, CLARIN is part of CLARIAH, in which it joins forces with DARIAH, an infrastructurefor the arts andhumanities. 3 RecommendationsandNextSteps Dutch is not in a bad shape digitally. Plenty of data sets and tools are available, and the uptake of Dutch in major NLP applications seems assured. Many of the opentoolsrelyonopendatasets,oftencreatedintheSTEVINprogramme.Boththe Dutch language and NLPtechnologyhavechangedinthemeantime, thusmaking a neweffortatleastofthesizeoftheSTEVINprogrammenecessary.Itisimportantto allowtoolstolearnfromrecentlanguageuse.Itisparamountthatanewprogramme is set up in which researchers from the Netherlands and Flanders, and perhaps also beyond, cooperate in the design and construction of corpora that document recent language, be it in written, spoken,ormicroblogform. LT is already embedded in our everyday lives, and we may be using it without realising,whencheckingforspellingerrors,usingsearchenginesorcallingthebank toperformatransaction.Itisanimportantingredientofapplicationsthatcutacross various domains. In the health domain, LT contributes for instance to the automatic recognition and classification of medical terms or to the diagnosis of speech and cognitive disorders. It is more and more integrated in educational settings and ap­plications, for instance for educational content mining, for automaticassessment of free text answers, for feedback to learners and teachers, or for evaluation of pro­nunciation in a foreign language. In the legal domain, LT proves an indispensable component for the search, classification and codification of huge legal databases to legalquestionansweringandpredictionofcourtdecisions.IfDutchwantstoremain apartofthisstrongLT-drivensociety,weneednewinvestmentsinresearchprojects. References DeSmedt,TomandWalterDaelemans(2012).“PatternforPython”.In:Journal of Machine Learn­ing Research 13,pp. 2031–2035. Odijk, Jan (2012). Het Nederlands in het Digitale Tijdperk – The Dutch Language in the Digital Age.META-NETWhite Paper Series:Europe’sLanguagesin theDigital Age.Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/dutch. Oostdijk,Nelleke,WimGoedertier,FrankVanEynde,LouisBoves,Jean-PierreMartens,Michael Moortgat,andR.HaraldBaayen(2002).“ExperiencesfromtheSpokenDutchCorpusProject”. In: Proceedings of LREC 2002. European Language Resources Association(ELRA). Oostdijk,Nelleke,MartinReynaert,andInekeHosteVéroniqueandSchuurman(2013).“TheCon-structionofa500-Million-WordReferenceCorpusofContemporaryWrittenDutch”.In:Essen­tial Speech and Language Technology for Dutch: Results by the STEVIN programme. Ed. by Peter Spyns and Jan Odijk. Berlin, Heidelberg: Springer, pp. 219–247. DOI: 10.1007/978-3-6 42-30910-6_13. Paulussen,Hans,LieveMacken,WillyVandeweghe,andPietDesmet(2013).“DutchParallelCor-pus: A Balanced Parallel Corpus for Dutch-English and Dutch-French”. In: Essential Speech and Language Technology for Dutch: Results by the STEVIN programme. Ed. by Peter Spyns and Jan Odijk. Berlin, Heidelberg: Springer, pp. 185–199. DOI: 10.1007/978-3-642-30910-6 _11. Postma,Marten,EmielvanMiltenburg,RoxaneSegers,AnneleenSchoen,andPiekVossen(2016). “OpenDutchWordNet”. In: Proc. of the 8th Global Wordnet Conference.Bucharest, Romania. Spyns, Peter and Elisabeth D’Halleweyn (2013). “The STEVIN Programme: Result of 5 Years Cross-border HLT for Dutch Policy Preparation”. In: Essential Speech and Language Technol­ogy for Dutch. Ed. bySpynsP. and Odijk J. Berlin,Heidelberg: Springer,pp.21–39. Steurs, Frieda (2021). “Nederlands een grote taal? Een kewstie van meten”. In: Neerlandica Wratislaviensia,pp.17–29. Steurs, Frieda, Vincent Vandeghinste, and Walter Daelemans (2022). Deliverable D1.10 Report on the Dutch Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-dutch.pdf. Tulkens,Stéphan,ChrisEmmery,andWalterDaelemans(2016).“EvaluatingUnsupervisedDutch Word Embeddings as a Linguistic Resource”. In: Proceedings of LREC 2016. European Lan­guageResources Association (ELRA). Van de Kauter, Marjan, Geert Coorman, Els Lefever, Bart Desmet, Lieve Macken, and Véronique Hoste (2013). “LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit”. In: Computational Linguistics in the Netherlands Journal 3, pp. 103–120. https://clinjournal.or g/clinj/article/view/28. vandenBosch,Antal,GertjanBusser,SanderCanisius,andWalterDaelemans(2007).“Anefficient memory-based morphosyntactic tagger and parser for Dutch”. In: Computational linguistics in the Netherlands. Ed. by P. Dirix, I. Schuurman, V. Vandeghinste, and F. van Eynde. LOT, pp. 191–206. vanNoord,Gertjan(2006).“AtLastParsingIsNowOperational”.In:TALN Actes de la 13eme con-férence sur le Traitement Automatique des Langues Naturelles. Conférences invitées. Leuven, Belgique: ATALA, pp. 20–42. https://aclanthology.org/2006.jeptalnrecital-invite.2. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 13 Language Report English DianaMaynard, Joanna Wright,Mark A.Greenwood, and Kalina Bontcheva AbstractThischapterfocusesonthestatusoftheEnglishlanguage,primarilyacting asabenchmarkfortheleveloftechnologicalsupportthatotherEuropeanlanguages could receive (see Maynard et al. 2022; Ananiadou et al. 2012). While it is rather unlikelythat anyotherEuropean language willever reachthislevel, duetothecon­tinuing development of support for English, and thus serves as a moving goalpost, neverthelessitprovidesagoodcriterionforrelativeassessment.Whiletheinequali­tiesintheamountoftechnologicalsupportavailableforEnglishcomparedwithother Europeanlanguages may act as a deterrent for working on thelatter, neverthelessit servesasausefulmechanismforapplyingcross-lingualtransfermethodsinorderto build language models and generatelabelled data for lower resourcelanguages. 1 The English Language English is a truly international language, due in no small part to the worldwide in-fluenceoftheBritishEmpiresincethe17thcentury,andlatertotheinfluenceofthe United States. It has become the primary language of international discourse and is the lingua franca in many professional contexts, as well as in a number of regions with diverse native languages. English is the most spoken language in the world, with an estimated 1.36 billion total speakers.English is also the most widely taught foreign language in the world. There are almost three times as many people who speakEnglishasasecondlanguagecomparedtonativespeakers,withatotalof360 million first language speakers andaround one billion second language speakers. English is an Indo-European language and shares a number of features of other Germanic languages. It uses the Latin alphabet with a left-to-right writing system, andhastheISO-639-1code(en).Itisclassedasapluricentriclanguage,meaningthat it has no single standard codified form but rather several interacting ones, typically setby or corresponding to differentcountries(e.g.,US vs.BritishEnglish). DianaMaynard · Joanna Wright · MarkA.Greenwood · KalinaBontcheva University ofSheffield, United Kingdom, d.maynard@sheffield.ac.uk, j.wright@sheffield.ac.uk, m.greenwood@sheffield.ac.uk, k.bontcheva@sheffield.ac.uk © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_13 English is the most commonly used language online, representing about 60.4% of the top 10 million websites.1 As of 31 March 2020, the internet was estimated to have around 1.186 billion English speaking users (25.9% of all internet users aroundtheworld).2 Intermsofinternetpenetration,outofthe1.531billionEnglish speakers estimated for 2021 according to Internet World Stats, 77.5% of them are internetusers.ThenumberofEnglish-speakingusershasenjoyedarelativelymodest growthrateof742.9% inthe last 20 years, comparedwith Arabicat 9,348%. 2 TechnologiesandResourcesforEnglish While there has been an increasing interest in developing data and tools for multi­lingual language processing in the last 20 years, as witnessed by the topics of long­standing shared tasks such as CONLL, nevertheless English continues to be over­whelmingly dominant in every aspect of language processing. This is partially as a result of the dominance of the use and status of English in the digital sphere and as an international language, but also a circular problem related to the availability ofexisting low-level language processingtools and training data, which provide an easy startingpoint forfurther development. Thousands of corpora are freely available for English. The majority of these are covered by a Creative Commons licence, although theymay come with restrictions (e.g.,attributionornocommercial use).Some are covered bysharedtaskparticipa­tion agreements, implying that they are freely available at least to task participants. AnumberofcorporaarereleasedunderlicencescontrolledbyELRAandthusonly availabletoELRAmembers.TheLDCgrowsbyaround30to35newcorporaeach year, and while these do not all include English, it does mean that new resources with contemporary language useappear with reasonable regularity. Hundreds of monolingual lexical/conceptual resources are available, most of whicharedomain-specific,includingafewontologies.Itislikelythatahugenumber offreelyavailableadditionalresourcesareavailablebeyondthoselistedinthemain languageresourcecataloguessuchasELRAandLDC.Thesameistrueforbilingual resourcesthatincludeEnglish.Additionally,anumberofmultimodalresourcesexist (where text is one of the forms), mostly concerned with pronunciation. English is very well-served generally by spelling and grammar-checking tools. Most operating systems have built-in spell-checking tools, for example, aspell and hunspell on Linux. Most programming languages have at least one spell-checking library.Similarly,therearemanysummarizationsystemsavailableasopensourceor commercially,includingHuggingFaceTransformers.Text-to-speech(TTS)systems arealso well supported witha numberof opensource andcommercial models. There are several major infrastructures or toolkits for language processing avail-able,includingGATE,StanfordCoreNLP,StanfordStanza,NLTK,spaCy,Hugging­ 1 https://www.visualcapitalist.com/the-most-used-languages-on-the-internet/ 2 https://www.internetworldstats.com/stats7.htm Face Transformers, and OpenNLP, which all contain a variety of processing tools whichcanbeusedindividuallyorasacollection.Allofthesesupportatleasttokeni­sation, sentence splitting, PoS tagging, and named entity extraction. Some support many more tools such as sentiment analysis, or have specific support for domains such as medicine. Overall, there are thousands of models available, especially for text summarization, translation, TTS and various kindsofclassification. For low-level processing tasks, such as tokenisation, sentence splitting and PoS tagging, there are a few standalone tools and services contained in the ELG plat­form, but many more are provided as part of standard APIs. In general, tools for to­kenisationandsentencesplittingforEuropeanlanguagesaremoreorlesslanguage­independent. POS taggingis also a reasonably well-solvedproblem forEnglish. IntermsofInformationExtraction,therearedozensofNERsystemsforEnglish, ofwhichroughlyhalfaredomain-specific,withdomains/genresincludingbiomedi­cal,Twitter,dendrochronology,environment,chemistryandpolitics.Thisisalsoan area which has seen many ML models released. Tools which fall broadly into the InformationRetrieval(IR)categorycoverawiderangeoftasks,includingquestion answering. A number of these are cross-lingual. Many systems enable search in a specifiedlanguagebutcanreturnresultsinotherlanguages,includingEnglish.There are a number of commercial IR engines available, both for generic and specialised tasks.ConcerningMachineTranslation,therearehundredsoftools,ofwhichalarge numbercontainEnglishaseitherinputoroutput.Themostcommonpairing(regard­less of direction) is English/German. In termsofLTproviders,wehave identified53 majorindustrialorganisationsin the UK, including players such as BBC News Labs, the JISC, and Oxford Univer­sityPress,and246researchgroupsororganisationsbasedat94differentuniversities. Theseresearchgroupsaresplitbetweenvariousfacultiesanddepartments,compris­ing mostly Computer Science and Language departments, but also others such as Medicine,Architecture,LifeSciencesandEducation,CreativeIndustries,andMaths. InIrelandtherearealsoextensiveLTindustrybodiesandresearchcentres(e.g.,Ap­ple, Accenture, Google, SoapBox Labs, AYLIEN, and CeADAR), whose primary focus is on supporting the English-speaking rather thanIrish-speaking population. 3 RecommendationsandNextSteps English is extremely well supported by LT, which is unsurprising given its status in the digital world. Almost every tool and infrastructure or toolkit is first developed to handle English before being applied to other languages. Similarly, an enormous amountofdataisavailableforEnglish.Thesetwofactorshaveacirculareffect:due totheamountofdataavailable,trainingandtestingnewtoolsismucheasierforEn­glishthanotherlanguages,andthisleadsto newmodels,tools,andresourcesbeing developed.ThefrequencywithwhichEnglishisusedforonlinecommunicationalso provides a wealth of data from which to create new corpora, and the availability of a wide range of tools also makes it easier to annotate these with linguistic informa­tion. As tools improve, the accuracy and usefulness of pre-annotated corpora also improve, therebymaking further tool developmenteasier. On the onehand, thisis an excellent situationforthoseworkingon English data, and given the widespread use of English in the digital world, the usefulness of new tools is clear. On the other hand, this can be a double-edged sword for the devel­opment of LTs and LRs for other languages. The availability of data, tools and re­sourcesforEnglishhasfedtheenormoussuccessofneuralmodelsfordevelopingLT applications, but the lack of data for other languages means that such deep learning models trained on English are not directly applicable. Recently, however, advances have been made in the development of cross-lingual transfer learning in order to build NLP models for a low-resource target language by leveraging labelled data from languages such as English with a high level of resources, or via a staged pro­cess whereby training data from English feeds the development of languages with moderate resources, which may have greater similarity to low-resource languages and can feed a further transfer process. Additionally, multilingual transfer settings enable training data in multiple source languages to be leveraged to further boost performance of low-resource languages. On the negative side, almost all languages are inevitably playing “catch-up” compared with English, and as can be seen from our survey, the differences in LTs and LRs available for European languages are striking.Itishard even to grasp asenseofhow muchisavailableforEnglish,since resourcesaresodisparate,andthefiguresreportedinthecollectionsofELG,ELRA and otherrepositoriesare only the tip of the iceberg. References Ananiadou,Sophia,JohnMcNaught,andPaulThompson(2012).The English Language in the Dig­ ital Age. META-NET White Paper Series: Europe’s Languages in the Digital Age. Heidelberg etc.:Springer. http://www.meta-net.eu/whitepapers/volumes/english. Maynard,Diana, Joanna Wright, MarkA. Greenwood,and Kalina Bontcheva (2022). Deliverable D1.11 Report on the English Language.EuropeanLanguageEquality(ELE);EUprojectno.LC­ 01641480 – 101018166. https://european-language-equality.eu/reports/language-report-englis h.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 14 Language Report Estonian Kadri Muischnek Abstract This chapter gives a brief overview of Estonian LT tools and resources (Muischnek 2022; Liin et al. 2012). The Estonian language has only around one million speakers and so the market for Estonian LT products is also a small one. In general, the current situation of Estonian LT is acceptable for a small language, but farfromperfect.ThemainforcedrivingthedevelopmentofEstonianLThasbeenthe public sector and so the resources and tools developed by publicly funded projects are mainly open source. Nonetheless, during the last decade, the private sector has also engaged in creating tools and solutionsfor Estonian. 1 The Estonian Language DifferentlyfrommostlanguagesspokeninEurope,EstonianisnotanIndo-European language,butbelongstotheBalto-FinnicgroupoftheFinno-Ugriclanguages.Typo­logically,Estonianrepresentsatransitionalformfromanagglutinatingtoafusional language. The characteristic features of Estonian include the accent on the first syl­lable, a high frequency of vowels as opposed to consonants, three different lengths of vowels and consonants, the lack of grammatical gender and articles, and a basic vocabulary different from thatofthe Indo-Europeanlanguages. Estonianhas arichmorphological system: nominalsinflect for caseandnumber, andverbsforperson,number,tense,moodandvoice.Compoundingisrelativelyfree and productive in Estonian and derivation is another productive device for forming newlexicalitems.ThewordorderofEstonianisratherfreeandmostlygovernedby information structure. The most important rule is V2: the verb occupies the second position in the clause(Erelt 2003). Estonian is the official language of the Republic of Estonia and it is used in all spheres of life although there are some concerns regarding the use of Estonian in science and higher education. It is written using a supplemented Latin alphabet; inadditionto ASCII characters, it also includes the letters Ä,Ö, Ü,O, Š andŽ. Kadri Muischnek University ofTartu, Estonia, kadri.muischnek@ut.ee © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_14 TheEstonianpopulationhasgoodaccesstotheinternetanddigitalservices:92% ofhouseholds have an internet connection andmany services are availableonline.1 2 TechnologiesandResourcesforEstonian Large monolingual Estonian corpora are collected regularly; the most recent one, EstonianNationalCorpus2021,containsca.2.4billiontokens.2Estonianisincluded in the multilingual resources of the EU languages and we have at least a minimum necessary amount of audio resources. Annotateddataisexpensivetocreateand,thus,scarce.Notableexamplesarethe Estonian UDtreebanks3 and the Embeddia dataset for hate speech detection4. Lexical-conceptualresourcesaremostlylexicons,machine-readabledictionaries and terminological databases. An important resourceis the Estonian Wordnet.5 Thereisonlyonefull-coveragecomputationalgrammarforEstonian:Constraint Grammar.6SeverallargelanguagemodelstrainedexclusivelyonEstoniandata7 8 9 10andalsomultilingualones11 12 havebeencreated. Text analysis tools cover sentence segmentation, tokenisation, morphological analysis and dependency parsing for the standard written language. As soon as the textdeviatesfromthestandard,thequalityoftheanalysisdecreases.Themostbasic toolforanalyzingmorphologicallycomplexEstonianisamorphologicalanalyzer.13 For parsing Estonian one can use CG14 or several dependency parsing models15 16 17 trained on the Estonian UD treebanks. The EstNLTK Python library18 contains a pipeline starting from tokenisation and ending with syntactic analysis and informa­ 1 https://andmed.stat.ee/en/stat/majandus 2 https://doi.org/10.15155/3-00-0000-0000-0000-08D17L 3 https://universaldependencies.org 4 http://embeddia.eu/outputs/ 5 https://www.cl.ut.ee/ressursid/teksaurus/ 6 https://github.com/EstSyntax/EstCG 7 https://huggingface.co/tartuNLP/EstBERT 8 https://huggingface.co/EMBEDDIA/est-roberta 9 https://www.clarin.si/repository/xmlui/handle/11356/1277 10 https://huggingface.co/tartuNLP/gpt-4-est-large 11 https://huggingface.co/xlm-roberta-base 12 https://huggingface.co/EMBEDDIA/finest-bert 13 https://github.com/Filosoft/vabamorf 14 https://github.com/EstSyntax/EstCG 15 https://stanfordnlp.github.io/stanza 16 https://github.com/EstSyntax/EstSpaCy 17 https://lindat.mff.cuni.cz/services/udpipe/ 18 https://github.com/estnltk/estnltk tionextraction(NERetc).TheTEXTAToolkit19providesresourcesfortextanalytics and enables document classification, terminology extraction and topicdetection. TheTalTech’sspeechrecognitionsystem20providesspeechrecognitionandother services,e.g., automated subtitling.21 There arealsoseveral modelsforspeechsyn-thesis,22 including aneural network-based one.23 EstonianisfeaturedinGoogleTranslate,MicrosoftTranslatorandtheEU’strans­lation tool eTranslation. However, independent MT services are important for the government sector, so the central translation platform project was initiated. IntermsofInformationExtraction,thereareseveralNERmodels,aspartofEst­NLTK24 or on top of BERT25 andalso resources for time expression extraction.26 Existingvirtualassistantsolutions(Alexa,Siri,etc.)providelittlevalueforEsto­nian as they do not understand the language. On the other hand, simple “Estonian­speaking”chatbotsarewidelyusedonthewebsitesofcompaniesandinstitutionsto provide helpfor commonproblems. The need for LT support has been acknowledged by Estonian government agen­ciesandpolicy-makers.Since2006therehasbeenaseriesofNationalProgrammes for Language Technology, withthe currentone inforce until the year 2027.27 A new national AI strategy28 (2022–23) has been published recently. The Esto­nian Language Development Plan29 states the development of LT as a priority. The nationalresearchinfrastructuresrelatingtoLTinEstoniaaretheCenterofEstonian Language Resources30 and the Competence Center for Natural Language Process­ing.31 Estonia is a member of CLARIN, ELRC, andELG. 3 RecommendationsandNextSteps In terms of gaps, Estonian lacks both annotated data and tools for certain tasks and, asannotatingdataisatime-andworkforce-consumingprocess,itcanbeseenasan obstacle.Furthermore,EstonianlacksparallelEstonian–non-Englishdataasaresult ofdirecttranslationbetweentheselanguagepairs.Biggerand/orspecialmultimodal 19 https://github.com/texta-tk/texta 20 https://tekstiks.ee 21 https://github.com/alumae/kiirkirjutaja 22 http://www.eki.ee/heli/index.php 23 https://neurokone.ee 24 https://github.com/estnltk/estnltk 25 https://github.com/TartuNLP/bert-ner-service 26 https://github.com/soras/Ajavt 27 https://www.hm.ee/sites/default/files/documents/2022-10/estonian_language_technology_201 8-2027.pdf 28 https://e-estonia.com/wp-content/uploads/factsheet-ai-strategy-feb2023.pdf 29 https://www.hm.ee/en/ministry/ministry/strategic-planning-2021-2035 30 https://www.keeleressursid.ee/en/ 31 https://portaal.eki.ee corpora are needed, e.g., containing children’s or senior’s speech, accented speech etc. We also need more audio data for natural and noisy communication situations: spontaneous conversations, spontaneous meetings etc. Estonian lacks annotated re­sources containing non-normative language varieties, such as the written language variantsusedonsocialmediaorspecialisedlanguagesusedbyprofessionals(health­care, legal sphere etc.). Computational semantics for Estonian is under-resourced; we lack resources and tools for semantic role labeling, coreference resolution, re­lation extraction and event extraction, also for polarity detection. There is a need for text simplification, summarisation and paraphrasing tools and resources. In the fieldofdiscoursemodellingandpragmatics,goodandusefultheoreticalapplication­orientedresearchhasbeencarriedout,butthathasyettobeputintopractice.Thereis alsoagrowingpopularityofDigitalHumanitiesandaneedtoprocessolderwritten variants of Estonian. Despitevariousnationalandinternationalprogrammes,initiativesandstrategies, thereisstillalackofcontinuityinfundingasresearchfundinginEstoniaisentirely project-based,whichisnotsufficienttoaddressthegapsinresearchandtechnology support for the Estonianlanguage. References Erelt, Mati (2003). Estonian Language. Linguistica Uralica Supplementary Series. Tallinn: Esto­ nianAcademyPublishers. Liin,Kristra,KadriMuischnek,KailiMüürisep,andKadriVider(2012).Eesti keel digiajastul – The Estonian Language in the Digital Age. META-NET White Paper Series: Europe’s Languages in the Digital Age. Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/es tonian. Muischnek, Kadri (2022). Deliverable D1.12 Report on the Estonian Language. Reports on Eu­ ropean Language Equality (ELE) | Coordinator: Prof. Dr. Andy Way, Co-Coordinator: Prof. Dr. Georg Rehm, received funding from the European Union (EU project no. LC-01641480 – 101018166). https://european-language-equality.eu/reports/language-report-estonian.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 15 Language Report Finnish Krister Lindén and Wilhelmina Dyster Abstract During the last ten years, digitalisation has changed the way we interact withtheworldcreatinganincreasingdemandforlanguage-basedAIservices.Inthe fieldoflanguagetechnology,theFinnishlanguageisstillonlymoderatelyequipped with products, technologies and resources. The situation has improved in recent years, but still support for automated translation leaves room for ample improve­ment, as the general support for spoken language is modest inindustryapplications althoughsomerecentresearchresultsareencouraging.Wetakestockoftheexisting resources for Finnish and try to identify some remaining gaps. 1 The Finnish Language Finnish is the native language of about 4.9 million people in Finland and the sec­ondlanguageof0.5millionFinns (seeKoskenniemietal. 2012;LindénandDyster 2022). Finnish is spoken in several European countries as well as the United States and Australia. Finnish is an official language in the European Union. The Finnish constitutional law and languagelawdefineFinnishandSwedishasthenationallan­guages of Finland. Moreover,Finnishisan official minoritylanguage in Sweden. The Finnish literary language has a relatively short history. It has been used in religiousliteratureandthechurchsincethe16thcentury. Lawshavebeenwrittenin Finnishsincethe18thcentury. Untilthe19thcentury,Swedishwasusedinadminis­tration, education and literature, when the foundation of contemporary Finnish was laid and Finnish becamea sovereign language inallsocietal activity. Dialects are divided into two categories: the Western and the Eastern dialects. Thedifferenceismostlyinthepronunciationandwordforms(meijän, männä inthe East, meirän, mennä in the West) and partly in the vocabulary (vasta in the East, vihta in the West). The differences are clear, and speakers from different areas can beidentifiedbytheirintonation.However,thedifferencesareminorenoughtoallow speakers of different dialectsto understand each other. Krister Lindén · Wilhelmina Dyster University ofHelsinki, Finland, krister.linden@helsinki.fi,wilhelmina.dyster@helsinki.fi © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_15 Finnish is used widely and actively on the internet and social media. Almost all Finnish households (96%) have access to the internet. Traficom, the Finnish Trans­portandCommunicationsAgency,reportedinNovember2020thatthetotalnumber ofregisteredFI-domains had reached500,000. 2 TechnologiesandResourcesforFinnish ThedevelopmentofFinnishlanguagedataandtoolshasprogressedsteadilyoverthe past30years.Since1995,theLanguageBankofFinland1 andsince2015CLARIN and FIN-CLARIN have offered a wide varietyoftext andspeech corpora and tools. Today, a large number of fundamental tools and datasets are available for Finnish. Below wepresent some relevant resourcesin the different domains of LT.2 There are several large monolingual corpora with contemporary textual and spo­ken language as well as some multilingual corpora. Overall, general domain data seems to be prevalent, e.g., data collected from discussion forums or using web crawls. In addition, news texts, legislative texts and parliamentary speech are well-represented domains. The Language Bank of Finland has the expertise to handle sensitive data,butfor example, healthdomain corpora are still scarce. The InstitutefortheLanguages ofFinlandhascomprehensivecollections oflex­ical corpora. The Helsinki Term Bank for the Arts and Sciences (HTB)3 is a mul­tidisciplinary project aiming to gather a permanent terminological database for all fields of research in Finland. The Comprehensive Grammar of Finnish4 was pub­lishedin2004bytheFinnishLiteratureSociety. FinBERT5 isaversionofGoogle’s deep transfer learning model for Finnish, developed by the TurkuNLP Group. Fin-BERT is pre-trained with 1 million steps on over 3 billion tokens of Finnish text drawn from news, online discussion, and web crawls. Important software packages are:1.TheHelsinkiFinite-StateTransducer(HFST)6 canbeusedtoimplementmor­phologicalanalysers.2.TheTurkuNeuralParserPipelinedevelopedbyTurkuNLP7 is an open source dependency parsing pipeline. 3. The Aalto University Automatic Speech Recognition System (Aalto-ASR)8 provides functionalities for ASR from audio files and for automatic forced alignment of text and speech. 4. OPUS-MT9 focusesonthedevelopmentoffreeresourcesandtoolsformachinetranslation,with 1 https://kielipankki.fi 2 META-SHARE Finland contains additional resources,see https://metashare.csc.fi. 3 https://tieteentermipankki.fi 4 https://kaino.kotus.fi/visk/ 5 https://github.com/TurkuNLP/FinBERT 6 https://hfst.github.io 7 http://turkunlp.org/Turku-neural-parser-pipeline/ 8 http://urn.fi/urn:nbn:fi:lb-2021082323 9 https://github.com/Helsinki-NLP/Opus-MT currentlyover1,000pre-trainedneuralMTmodels.5.FintoAI10 isaserviceforau­tomatedsubjectindexing,whichcanbeusedtosuggestsubjectsfortextsinFinnish, Swedish and English. 6. Wavelet-based embedding models for speech synthesis for Finnishhave been developed at the UniversityofHelsinki. The Language Bank of Finland supports academic research and provides some supportfortheindustrialuseofacademicresourceswhicharealsoavailableforcom­mercial use. CSC (IT Center for Science) is tasked with providing one of the three EuroHPC supercomputers, LUMI. The whole system is designed with AI, machine learning and data analytics in mind. LUMI’s first pilot phase was concluded by the end of 2021,and LUMI will reachits full capacity in2022.11 Generally, the Finnish market is extremely active in the AI field. According to the State of AI in Finland report by FAIA (2020), “there are over 1,250 companies that use different AI applications, of which roughly 750 have developed their own technology.”A rapidly growingstartup ecosystem boosts AI/LT development. 3 RecommendationsandNextSteps In November 2019, VAKE (currently the Climate Fund) published a report (Jauhi­ainenetal.2019),specifyingthenextphaseofthelanguage-centricAIdevelopment programme and identifying topics in need of intervention. In November 2020, Fin­land launched an updated national AI strategy. The AI 4.0 Programme promotes the use of AI and other digital technologies in companies, with a special focus on SMEs. In the first interim report,12 published in April 2021, the programme pre­sented a vision for the future of the Finnish manufacturing industry, stating that by 2030 the Finnish manufacturing industry will be clean, efficient and digital. As statedinthereport,seamlesscollaborationbetweenhigh-speedtelecommunications networks,cloudcomputingand AIare central to the digital transformation. AccordingtotheVAKEreport,weneedtheavailabilityandaccessibilityofcom­ponents for processing speech with open licences to create prototypes or develop methods into full-scale production versions in the hands of companies. To this end, collaboration between stakeholders is needed: an ecosystem with a forum or a plat­form where different-level actors can come together to exchange experiences and seek new projectsand collaboration opportunities. Currently, 1. there are some multi-modal resources, but still no advanced dis­course processing tools for Finnish; 2. several research projects are working on ad­vanced information retrieval (IR) and data mining for Finnish; 3. the legal situation has become clearer with the General Data Protection Regulation (GDPR), but we are still waiting for Finland to fully implement the Digital Single Market Directive (DSM);4.wehavesomespecificcorporaofhighquality,butthecommercialsector 10 https://ai.finto.fi 11 https://www.lumi-supercomputer.eu 12 http://urn.fi/URN:ISBN:978-952-327-643-7 inFinlandstillneedslarge,up-to-dateresourcesforproductdevelopmenttargetedat everyday users and technologies to collect specialised data sets; 5. work on seman­tics has still not led to significant applications, but this is explored in the context of advancedresearchprojectsonIRandextraction;6.inspeechtechnology,therecent biggest leaps forward have been made using neural network technology. This has alsoledtosomeimprovementsforthecommercialsectorofferingspeech-basedser­vices,butspeechandvideocorporaarenolongerconsideredhardtocollectwiththe adventofmobile phones andteleconferencing. Speechcorporaandespeciallyresourcesforspontaneousspeechrecognitionand variousgenresofspeechsynthesisarecurrentlybeingdeveloped.Theneedforexten­sive and varied text materials can to some extent be rectified for research purposes through corpus collections of publicly produced language material when properly consideringGDPRandtheDSMdirective.Thiswillenablethecreationoflanguage models.However,westillneedavarietyofspecialiseddatasetsfordomain-specific purposestoadaptopen-sourceorproprietarysoftwarecomponents.Developingdedi­catedcomponentsfromscratchrequiresgiga-scaledatasetswhichmaybedifficultto compilefor smalllanguagecommunities andinspecialised domains. This pointsto aneedforageneral-purposelanguage-centricAIwhichcanleveragecross-language and cross-domain resources and benefit from adaptation to local language varieties and specialised domains with small or medium-sizeddatasets. References Jauhiainen,Tommi,MiettaLennes,andTerhiMarttila(2019).Suomenkielisen tekoälyn kehittämiso­ hjelma – esiselvitys. http://hdl.handle.net/10138/319478. Koskenniemi, Kimmo, Krister Lindén, Lauri Carlson, Martti Vainio, Antti Arppe, Mietta Lennes, Hanna Westerlund, Mirka Hyvärinen, Imre Bartis, Pirkko Nuolijärvi, and Aino Piehl (2012). Suomen kieli digitaalisella aikakaudella – The Finnish Language in the Digital Age. META­ NET White Paper Series: Europe’s Languages in the Digital Age. Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/finnish. Lindén,KristerandWilhelminaDyster(2022).Deliverable D1.13 Report on the Finnish Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/language-report-finnish.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 16 Language Report French GillesAdda, Ioana Vasilescu, and FrançoisYvon Abstract This chapter presents a survey of the current state of technologies for the automatic processing of the French language. It is based on a thorough analysis of existingtoolsandresourcesforFrench,andalsoprovidesanaccuratepresentationof thedomainanditsmainstakeholders(Addaetal.2022).Thechapterdocumentsthe presenceofFrenchontheinternetanddescribesinbroadtermstheexistingtechnolo­gies for the French language. It also spells out general conclusions and formulates recommendationsfor progress towards deep language understanding forFrench. 1 The French Language French is typologically a Romance language, closely related to other languages whose origin is Latin (e.g., Italian, Spanish, Portuguese, Romanian). French inher­itedGaulishfeaturesfromtheCelticdialectsspokenbyethnicgroupsthatpreviously populated the territory conquered by the Romans, and was later influenced by Ger­manic dialects as a consequence of the invasions that marked the fall of the Roman Empire. Modern French uses the Latin alphabet and has retained many Latin lin­guistic features. For instance, French is a nominative-accusative and article-based language (SVO) that greatly simplified the nominal and verbal declensions. French developeda large vocalic system including 12oral and 4 nasalvowels. With 128 million “native and real speakers” worldwide and an estimate of close to 300 million speakers overall (Collectif 2019), French appears only as the 16th most spoken native language, but as the 6th most spoken language in the world, after English, Chinese Mandarin, Spanish, Hindi and Russian. French is an official language in close to 30 countries, most notably in Europe (France: 65m speakers, Belgium:7mspeakers,Switzerland:3mspeakers,andLuxembourg),Africa,Canada andHaiti.InEurope, itis estimatedthat 129million peoplespeakFrenchmakingit the 3rd most spoken second language, after English and German. French-speaking Gilles Adda · Ioana Vasilescu · François Yvon Université Paris-Saclay,CNRS, LISN,France, gilles.adda@limsi.fr, ioana.vasilescu@limsi.fr, francois.yvon@limsi.fr © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_16 countriestogetherconstituteLa Francophonie,withtheOrganisation Internationale de la Francophonie coordinatingpoliciesbetween 88 associated states andentities. Collectif (2019) notes that in 2018 French occupies the fourth place on the in-ternet behind English, Chinese and Spanish, with a comfortable lead over the next set of languages. Pimienta (2022) observes that although French remains in fourth place on the internet in 2022, the gap to the following languages has considerably narrowed. The presence of French on the internet derives from its role as an inter­national language: French is an official language of the EU and one of the three workinglanguagesoftheEuropeanCommission.Frenchisalsoaworkinglanguage attheOrganisationforEconomicCo-operationandDevelopment,andattheUnited Nations. French is also one of the three official languages of the European Patent Office andone of the four working languagesofthe African Union. 2 TechnologiesandResourcesforFrench Looking at the technology landscape for French, most state-of-the-art tools and ap­plicationsrelyalmostexclusivelyongenericmachinelearningtechnologies,amajor changewithrespecttotheprevioussurvey(Marianietal. 2012):themostimportant ingredients for system building are data and, to a lesser extent, compute resources. We will, therefore, focus on the most critical language resources and give a general overview of the varioustechnologies derived from them. Large-scale,generalpurposelexicaforFrenchassociatinglemmasorwordforms tomorpho-syntacticinformationarewidelyavailable.ThereisnoofficialFrenchNa­tional Corpus that would contain a representative subset of the language, balanced acrossperiods,genresanddomains,asmayexistforotherlanguages.However,siz­ablecorpora(uptobillionsoftokens)ofmixedgenresareaccessibleandsearchable. The CommonCrawl project aggregates Web data that is orders of magnitude larger than these resources; and it is updated on a regular basis. Using French sub­sets of CommonCrawl, it has been possible to train large language models (LMs): FlauBERT uses a corpus of 12B running words, while CamemBERT uses the 22B words OSCAR. Otherlarge LMs for French are availablefor researchand commer­cial use;they help toboost the state-of-the-art for multipleNLP tasks. Large-scale annotated (segmented in sentences, speakers and turns, transcribed) speechdatabases,containingthousandsofhoursofrecordingsareavailableforsev­eral genres. Such resources have enabled advanced technologies for French (tran­scription, synthesis, NLU). However, the collection of large sets of recordings re-mainsapressingissuetowidentheapplicabilityofthesetechnologies,anobjective addressedby Mozilla’s Common Voice1 or the VoiceLab project.2 BasicNLPtoolswerealreadywellcoveredin2012andtheyhavebenefitedfrom improvementsinmachinelearning.Opensourceindustrialstrengthtokenizers,lem­ 1 https://commonvoice.mozilla.org/fr 2 http://www.levoicelab.org matizersandPOStaggersforFrenchareavailable.Wenote,however,thatnorecent systematic performance comparisons exist for these tasks; most of these tools pro-cess“generic” Frenchand too little exists forspecificsublanguages. Having moved to fully neural, the availability of Machine Translation systems for French mostly depends on the availability of parallel corpora. Good resources exist forFrench, especiallywhen matched withanEnglish translation. As for most social science and humanities domains, the digital revolution has created new avenues for language analysis. Such methodological changes are also happeningforFrenchandimpactalllinguisticdomains,withthecreationofcorpora, tools and methods. Regarding corpora, both written and spoken varieties of French arewell covered,although forhistorical reasons written sources are more common. Owing to its role as an international language and the comparatively large size and advanced development of French-speaking markets, French is relatively well covered by international LT services: French-English has been one of the earliest translationpairsontheWeb,andFrenchversionsofSiri,AmazonEchoandGoogle Homehavebeenavailableforyears.ThedevelopmentofLTsforFrenchfarexceeds the activityobserved in France or other French-speaking countries. Institutional support to LTs is mostly operated by the ANR (the French National ResearchAgency),albeitwithalackofcontinuousfunding;largevariabilityinfund­ing over the years isnotfavourable to planning. The French research community is nonetheless active, with a dozen significant academic clusters all over France, as well as Belgium, Canada and Switzerland, covering the full spectrum of NLP. This research has greatly benefited from the development of the Jean Zay platform, an openhigh-performance computing infrastructure tailoredto AI applications. 3 RecommendationsandNextSteps Many open-domain French corpora are the result of uncoordinated initiatives and consequently only partially cover the needs of domain-specific applications. This state of affairs results in 1. a lack of visibility of tools and data that are only known to restricted sub-communities, and 2. a waste of resources, as existing datasets are underused,orevenduplicated,whenotherpressingneedsremainunsatisfied.Afirst recommendation is thus to institutionalise clearer policies for the archiving of LRs for French, when they areproduced by public researchprojects. A second recommendation, aimed to increase the diversity and size of existing corpora, is to open the large datasets produced by public administration and institu­tions (e.g., in health, culture, media, justice or education) which are hard to access. Policies are needed to amplify the actions of the European CEF/ELRC programme to incentivize the development ofopen repositories with clear access rules. Applications that involve social network data (e.g., opinion mining, fake news andhatespeechdetection)requirespecificactions,astheyareoftenassociatedwith delicatelegalissues(relatedtoproprietaryrightsorpersonalinformation)thatlimit theirdisseminationandexploitation.Toreducethedependencyoncurrentdatapoli­cies of content holders, a third recommendation would be to secure access to sen­sitive data for research purposes and to facilitate the dissemination of publicly pro-duceddatabases and models (e.g.,using privacy-preserving techniques). Recommendation four is the definition of a strategic roadmap for identifying, building, curating, annotating and securing resources for language varieties or do-mainsthatarecriticalforresearch,industryorfortheadministrationineachFrench­speaking country, based on a precise analysis of the gaps in the existing datasets (some were alluded to above). This roadmap should also identify cases where re­sources can be transferred from English through MT. Recommendationfive aims to ensure, throughrecurrentfunding, that evaluation campaignsspecificallytargetingFrenchforalargenumberofapplicationsareorga­nized on a regular basis and widely advertised, so that systems are evaluated under real world conditions,so as todocument theirbiases, defects andharmful impacts. Thefinalrecommendationistoincreasethesupportforresearchonthemessuch asfairandexplainabledeeplearningforlargelanguagemodels,deeplanguageanal­ysisalgorithmsandtechnologies,multimodalresourcesforthestudyoflanguageac­quisitionthroughinteractionsandgrounding,andthestudyofpathologicallanguage processing.Thismultidisciplinaryresearchshouldinvolveallrelevantcommunities. References Adda, Gilles, Annelies Braffort, Ioana Vasilescu, and François Yvon (2022). Deliverable D1.14 Report on the French Language. European Language Equality (ELE); EU project no. LC­01641480 – 101018166. https://european-language-equality.eu/reports/language-report-fr ench.pdf. Collectif (2019). La langue française dans le monde.OIF/Gallimard. Mariani, Joseph, Patrick Paroubek, Gil Francopoulo, Aurélien Max, François Yvon, and Pierre Zweigenbaum (2012). La langue française a l’ Ere du numérique – The French Language in the Digital Age.META-NET White Paper Series: Europe’s Languagesin theDigital Age. Hei­ delberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/french. Pimienta,Daniel(2022).“LaplacedufrançaissurInternet”.In:La langue française dans le monde 2022. OIF/Gallimard, pp. 26–27. https://www.francophonie.org/sites/default/files/2022-03/Sy nthese_La_langue_francaise_dans_le_monde_2022.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 17 Language Report Galician JoséManuelRamírez Sánchez, Laura DocíoFernández,andCarmen García-Mateo Abstract This chapter reports on the current state of Language Technology (LT) for Galician. The main conclusion is that there are a limited number of resources, products, and technologies for the Galician language with text-based technologies andservicesbeingmorematurethanthosebasedonspeechprocessing.Westartwith general facts about Galician, followed by a high-level qualitative description of the LT situation forGalician, and concludewith recommendations forbridging the gap between Galician LTwith Spanishand theother co-official languages ofSpain. 1 The Galician Language Galician is part of the Romance family of languages, closely related to Portuguese, and it is one of the co-official languages of Spain. The linguistic rights of Galician speakers are guaranteed and regulated under the Linguistic Normalisation Act, es­pecially those related to administration, education, and media. Galician has about 1,926,000 speakers. There are still large Galician-speaking communities outside Spain(mainlyinEuropeandAmerica).Theirtotalsizeisunknownduetothevariety and complexityof these communities. TheonlinepresenceofGalicianislimited,withlessthan0.1%ofwebsitesusing it.1 Nevertheless, some initiatives try to increase the presence of Galician on the web (PuntoGal2 and Galipedia3 are good examples). The official survey Enquisa estrutural a fogares. Conecemento e uso do galego shows a generally low internet penetration and use by European standards, but between the ages of 15 and 44, the numbers are verysimilar to other Europeanregions.4 JoséManuel RamírezSánchez · Laura Docío Fernández · Carmen García-Mateo Univ. of Vigo, Spain, jmramirez@gts.uvigo.es,ldocio@gts.uvigo.es, carmen.garcia@uvigo.es 1 https://w3techs.com/technologies/details/cl-gl­ 2 https://dominio.gal 3 https://meta.wikimedia.org/wiki/List_of_Wikipedias 4 http://www.ige.gal/estatico/html/gl/OperacionsEstruturais/PDF/Resumo_resultados_EEF_Gal ego_2018.pdf © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_17 A substantial amount of digital content in the Galician language is generated by public institutions of the Autonomous Community of Galicia. In the last few years, thenumberofproductsandservicesdevelopedhasincreasedconsiderably,aimedat incorporatingGalicianintothedigitalsociety.ThewebportaloftheRealAcademia GalegaandtheXuntadeGaliciatranslatorarenoteworthyexamples.Althoughsome largeenterprises(Microsoft,Apple,Google,Meta)offerafewproductswithsupport forGalician,manyothersdonot(TikTok,Twitch,Adobe).However,thereisatotal lackofsupportforGalicianinthevirtualassistantsmarket,wherenoneofthepopular solutions allowsinteractionvia Galician. 2 TechnologiesandResourcesforGalician The2012META-NETWhitePaperonGalician(García-MateoandArza 2012)was moderatelyoptimisticaboutthestateofLTsupportforthelanguage.Tenyearslater, theLTstatusforGalicianhaschangedabit(SánchezandGarcía-Mateo2022).Inour analysis,wenoticedanincreaseintheresourcesandcorporacreatedbetween2018­2021 (67.7% of those indexed). However, tools and services developed in the same period have not increased to the same degree (37.3% of those indexed). There is a significant imbalance in the distribution of resources and corpora by technologies. Dataintextformatarethemostcommon(morethan90%),whereascorporaforother technologies are very few(5%are multi-modal, andalmost 2% are audioonly). Most of the resources come from three origins: non-Galician universities and research centres, Galician public institutions, and non-Galician private companies or public institutions. It is important to note that most of the resources, services, and tools created by non-Galician entities tend to belong to multilingual projects or products that include Galician as one of several languages. However, most of the resources, services, and tools created by Galician entities tend to focus on Galician, offering high-quality results. Regardingthe accessibilityanduseofresourcesforGalician,mostofthemhave been developed by open source projects, research centres, or universities under GNU/GPL licences. Around 20% of the indexed items are not available for com­mercial purposes, andmore than10%ofresources are under a proprietary licence. The situation of Galician in terms of data and resources is optimistic for most of the technologies that process and use text. However, regarding multimedia data, there is an enormous gap. In that sense, speech processing technologies seem less mature thantechnologies based ontext processing. For Galician, keyresults regarding technologiesand resourcesinclude: • There are large reference text databases in modern and historical Galician with a balanced mix of various domains (economics, technology, or the legal field) (Pineiro 2019; Garci´ a-Mateo etal. 2014). • There are some databases annotated with syntactic, semantic, or discursive in­formation. However, the number and size of these resources decrease as more complex linguistic and semantic informationis needed. • ParalleldatabaseswithmillionsoftokensexistbetweenGalicianandotherlan­guages such as Spanish, Portuguese, and English (OPUS5 is a good example). These databases have been used to develop machine translation systems in pro-ductionand education environmentsfor Portuguese or Spanish. • ArelevantmodeltohighlightisBertinho(VilaresCalvoetal.2021),amonolin­gualBERTmodelforGalician.Bertinhoimplementsstate-of-the-arttechnology, anditispossibletouseitinmanyNLPtasks.However,itsdevelopersstatethat Bertinho does not reach thesizeorperformance ofother monolingualversions, suchasBETOfor Spanish. • Availablemultimediaresourcesarerelativelylimited,withlittledomainvariabil­ity and usually recordingsofreadings. The acoustic quality isexcellent though. • Anothergapisrelatedtohuman-computerinteraction,wherethenecessarytools andresourcestoputtogetherchatbots,virtualassistants,andsimilarsystemsare poororoutdated. SpainhasnationalplansforbothArtificialIntelligence(AI,Gobierno_de_Espana 2020a) and LT (specifically for NLP, Gobierno_de_Espana 2020b). These plans fo­cus more on the potential, opportunities, and needs of Spanish LT, putting less em­phasis on co-official languages such as Galician. Two national associations bring together the community of researchers on issues related to LT: Sociedad Espanola de Procesamiento del Lenguaje Natural withafocusonNLP,andthe Red Temática en Tecnologías del Habla withits focus on speech processing. TheAutonomous Communityof Galicia has its own strategyforAI.6 Thisdocu­mentdescribesthecurrentenvironmentofAIinGaliciaandprovidesaroadmapfor public investments and developments until 2030. There is also an initiative called ProxectoNós,aregionalLTplanforGalicianfocusedondigitalchallengespromoted bytheGalicianregionalgovernment.Furthermore,therearemanymoreprojectsre­latedtoLTintheGalicianuniversityenvironment,bothfromalinguisticandtechno­logicalpointofview. Anotherinterestingfactisthatfromthenumberofcompanies in the Galician ICT industrial environment that use AI, only 21% are focused on cognitive assistants and just 12% on NLP. The Galician LT industry is very small, but a very active environment of spin-offs and public programmes exists dedicated to transferring knowledge from universities to the market. 3 RecommendationsandNextSteps The maingoal ofLT forGalicianistoreach the level of other co-official languages of Spain, such as Catalan or Basque. In this sense, increasing the use of LT in Gali­cian public services and institutions could be a necessary line of action to support andstimulateresearch and development ofnewresourcesandbettertools.Galician 5 https://opus.nlpl.eu 6 https://amtega.xunta.gal/sites/w_amtega/files/20210608_estrategia_ia_gl.pdf institutionsarethe producersofhigh-qualityresourcesandtoolsforGalician.How­ever,thereisalackofstandardisationanddisseminationoftheseproducts.Anoffice that centralises and standardises all the LT resources and tools created for Galician could be asignificant contributionto unifying all efforts. Supportforopensourcesolutions(dataandsoftware)wouldbeagoodlong-term strategyforsmall-marketlanguages.Thesesolutions allowthedevelopmentandre­searchofnewtechnologieswithouthavingtofaceaninitialinvestmentbarrier. Fur-thermore,anopen-sourcepolicyencouragesthecreationofstrongcommunitiesand guaranteessometechnologicalsovereigntyfromtheinterestsofglobalmarketsand multinational corporations. References García-Mateo, Carmen and Montserrat Arza (2012). O idioma galego na era dixital – The Gali­cian Language in the Digital Age.META-NETWhitePaperSeries:Europe’sLanguagesinthe Digital Age.Heidelbergetc.: Springer. http://www.meta-net.eu/whitepapers/volumes/galician. Garci´ a-Mateo,Carmen,AntonioCardenalLópez,XoséLuisRegueira,ElisaFernándezRei,Marta Martinez, Roberto Seara, Roci´o Varela, and Noemi´ Basanta (2014). “CORILGA: a Galician MultilevelAnnotatedSpeechCorpusforLinguisticAnalysis”.In: Proceedings of the 9th Inter­national Conference on Language Resources and Evaluation (LREC 2014), pp. 2653–2657. Gobierno_de_Espana (2020a). Estrategia Nacional de Inteligencia Artificial 2020. https://portal .mineco.gob.es/RecursosNoticia/mineco/prensa/noticias/2020/201202_np_ENIAv.pdf. Gobierno_de_Espana(2020b). Estrategia Procesamiento del Lenguaje Natural 2020.https://drive .google.com/file/d/1eXlFdRNTmOx4sm3FQ439Z8zaeNqEFGiK/view. Pineiro,CentroRamón(2019). Corpus de Referencia do Galego Actual (CORGA) [3.2].http://cor pus.cirp.gal/corga/. Sánchez, José Manuel Ramírez and Carmen García-Mateo (2022). Deliverable D1.15 Report on the Galician Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-galician.pdf. Vilares Calvo, David, Marcos Garcia González, and Carlos Gómez Rodri´guez (2021). “Bertinho: ´Galician BERTrepresentations”.In: Procesamiento del lenguaje natural,pp. 13–26. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 18 Language Report German Stefanie Hegele, BarbaraHeinisch, Antonia Popp,KatrinMarheinecke, Annette Rios, DagmarGromann,Martin Volk, and GeorgRehm Abstract German is the second most widely spoken language in the EU. The last decade has seen strongly perceptible language change, trending towards the sim­plification of the grammatical system, a rapidly growing number of anglicisms, a decreasing prevalenceof dialects, andan increase in socio-politicaldebates onmat­ters such as language policies and gender-neutral language. Many technologies and resourcesforGermanareavailable,whichisalsoduetonumerouswell-established research institutions and a thriving Language Technology (LT) and Artificial Intel­ligence (AI) industry. In order to withstand in the digital sphere, it is important that incentivesforresearch,digitaleducationandalsoconcreteopportunitiesformarket­ing and deploying LTapplicationsareputat theforefront of future AIstrategies. 1 The German Language With more than 150 million native and non-native speakers (Eberhard et al. 2021), German is the second most widely spoken language in the European Union. Ger­many, Austria and Switzerland form the DACH region, which is not only home to the three (codified) standard varieties of the German language, but also boasts a wealth of regiolects and dialects. Perceptible language change in German has been omnipresent for decades, leaving the language community to decide what becomes thenorm.AccordingtothreereportsonthestateoftheGermanlanguage,1published intheyears2013-2021bytheUnionoftheGermanAcademiesofSciencesandHu­manities,changesleanheavilytowardsthesimplificationofthegrammaticalsystem. StefanieHegele · Katrin Marheinecke · Georg Rehm Deutsches ForschungszentrumfürKünstliche Intelligenz GmbH,Germany, stefanie.hegele@dfki.de, katrin.marheinecke@dfki.de, georg.rehm@dfki.de BarbaraHeinisch · Dagmar Gromann University ofVienna,Austria, barbara.heinisch@univie.ac.at, dagmar.gromann@univie.ac.at Antonia Popp · Annette Rios · Martin Volk University ofZurich, Switzerland, popp@cl.uzh.ch,rios@cl.uzh.ch, volk@cl.uzh.ch 1 https://www.akademienunion.de/publikationen/sammelbaende © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_18 Therehasalsobeenahugeexpansioninvocabulary.Overthelastdecades,manyAn­glicismshavebeenintroducedintothelanguage,thateitherreplaceexistingGerman words or fill vocabulary gaps. Dialects have been more and more displaced. Germanusesgrammaticalgender. However,nounsthatrefertothesocialgender are often biased towards the male form. Proponents of a gender-inclusive language advocate that German needs a grammar that explicitly includes women and non-binarypeople, making all people feel equally addressed. Public debates about language policy positions are becoming more frequent and alsomoreheated.TheyattractagreatdealofmediaattentioninGermany. TheNew Right tries to use the topic of language in a targeted manner and to instrumentalise itin terms of national identity(Lobin 2021). Thereareanumberofnon-governmental,publiclyfundedorganisationsthatpro-mote the study of German and encourage international cultural exchange, such as the Goethe Institute, the Society for the German Language, or the Institute for the GermanLanguage.2 Regarding language education, the PISA study has continued to confirm the strong correlation between socio-economic background and educational success.3 Fears that the increased use of social media and emojis would worsen young peo­ple’swritingskillscannotbeconfirmed.Instead,theemergenceofnewwrittenforms shouldbe noted (Beißwengerand Pappert 2020;Storrer 2014). German is currently the second most studied foreign language in the EU, but is also gaining importance inAfrica and Asia. German has a widespread online presence and the fourth largest Wikipedia. In­ternet usecontinues torise.AccordingtotheEuropeanStatisticalOffice(Eurostat), inbothGermanyandAustria,therearemorethan85%ofregularinternetusersand close to 70% of people withbasic or above basic digital skills. 2 TechnologiesandResourcesforGerman Germanhasmanylinguisticcharacteristicsandparticularitiessuchasrelativelyfree wordorderandfairlylongnestedsentences(Eromsetal. 2003)thatposechallenges for Natural Language Processing tasks. Nevertheless, German is well supported by LanguageTechnology(LT)applicationsandresourcescomparedtomostotherEuro­peanlanguages.Anumberoflarge-scaleresourcesandstate-of-the-arttechnologies have been produced for Standard German. However, dialect-specific resources cur­rently account foronly asmall percentage. There exist a large number of German corpora of different sizes, ranging from a few hundred sentences up to millions. The sources are most often newspaper texts or texts collected from the web and social media. Various terminological resources, lexica,dictionariesorwordlistshavealsobeendevelopedforGerman.Annotations 2 https://www.goethe.de, https://gfds.de,https://www.ids-mannheim.de 3 https://www.bmbf.de/bmbf/shareddocs/pressemitteilungen/de/pisa-2018-deutschland-stabil-ueb er-oecd-durchschnitt.html cover a large spectrum of syntactic, semantic, and discourse structure markup. The most frequent corpus domains include health, news, politics and social media. Cur­rently,there are onlya fewlanguage modelspubliclyavailable forGerman. In addition, there are numerous free multilingual resources available online for German, e.g., the LEO dictionary. Other widely used MT systems are DeepL and GoogleTranslatewhichcoverthetranslationfromGermanintodozensoflanguages. EUROPEANA functions like a multimedia portal and digital library with content fromdifferentsources.4 Bytheendof2015,Germany,AustriaandSwitzerlandhad contributedaround 16% to the morethan 24 million objects. Hundredsoftools,bothopensourceandcommercial,thatworkeitherexclusively forGermanormultiplelanguagesincludingGermanhavebeendeveloped.Thevast majority process text input. Even though speech technology has already been suc­cessfullyintegratedintomanyeverydayapplications,fromspokendialoguesystems and voice-based interfaces to mobile phones and car navigation systems, audio is only supported by asmall fraction of tools, andimage and video by evenless. Research over the last decade and the deployment and integration of LT compo­nents to end-to-end processing pipelines has successfully led to the design of high-quality software with many tools supporting more than one function. The most fre­quenttaskssupportedbythecurrentcollectionofGermantoolsincludetextanddata analytics, information extraction, named entity recognition, information retrieval and speech recognition. Tools developed by universities and research centres are typically available for allusers free of charge. TheresearchcommunityinGermany,AustriaandSwitzerlandhasbeengrowing rapidlyoverthelastdecade.Numerousuniversitiesofferstudyprogrammesfocused on Language Technology, NLP, Computational Linguistics and closely related dis­ciplines. Recent breakthroughs in AI have not only led to cutting-edge technology developed by big companies, but have also inspired numerous startups and SMEs in the field. Current funding programmes, even though mostly targeted towards AI, havealsohelpedtoimproveresearchinthefieldingeneral,andalsohavesupported a number of research projects working on German in particular. While overall AI strategies vary in the German-speaking regions, the situation for LT/NLP research and development in Germany is, all aspects considered, rather good. The German government aims to invest about 3 billion Euros until 2025 to implement the strat­egy, including the creation of new AI centres, new funding programmes, new pro-fessorships,newinternationalcollaborations(e.g.,withFrance)andanewnational roadmap for AIstandardisation. 3 RecommendationsandNextSteps ThescopeofresourcesandrangeoftoolsarestilllimitedwhencomparedtoEnglish, and they are not yet good or ample enough to develop the kind of technologies re­ 4 https://www.europeana.eu/de quiredto supporta truly multilingualknowledge society. Highqualitydata sets and large language models represent a major step forward in AI. Our empirical results show that German is still partially lagging behind in this area (Hegele et al. 2022; Burchardt et al. 2012). There are also gaps in the areas of speech and text process­ing. In addition, existing technologies do not cover the many different varieties of regionallanguagesanddialectsthatexistinGermany,AustriaandSwitzerland.Fur­thermore,manyresourcesarenotavailableduetocopyrightreasons,confidentiality, (national) security reasonsetc. While German is among the three best supported European languages (next to Spanish and French), the gap towards English is indeed significant. Without a sub­stantial and timely intervention by the European Union, for many European lan­guages thisgap will continue to increase, endangering their digital existence. References Beißwenger, Michael and Steffen Pappert (2020). “Sprachverfall durch Emojis? Eine pragmalin­guistische Perspektive auf den Beitrag von Bildzeichen zur digitalen Kommunikationskultur”. In: Aptum. Zeitschrift für Sprachkritik und Sprachkultur 16,pp.32–50. Burchardt,Aljoscha,MarkusEgg,KathrinEichler,BrigitteKrenn,JörnKreutel,AnnetteLeßmöll­mann, Georg Rehm, Manfred Stede, Hans Uszkoreit, and Martin Volk (2012). Die Deutsche Sprache im digitalen Zeitalter – The German Language in the Digital Age.META-NETWhite Paper Series: Europe’sLanguages in the Digital Age. Heidelberg etc.: Springer. http://www.m eta-net.eu/whitepapers/volumes/german. Eberhard,DavidM.,GaryF.Simons,andCharlesD.Fennig(2021).Ethnologue: Languages of the World.Dallas, Texas. http://www.ethnologue.com. Eroms, Hans-Werner, Gerhard Stickel, and Gisela Zifonun (2003). Schriften des Instituts für Deutsche Sprache. Hegele, Stefanie, Barbara Heinisch, Antonia Popp, Katrin Marheinecke, Annette Rios, Dagmar Gromann, Martin Volk, and Georg Rehm (2022). Deliverable D1.16 Report on the German Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-german.pdf. Lobin, Henning (2021). Sprachkampf – Wie die Neue Rechte die deutsche Sprache instrumental-isiert.Berlin: Dudenverlag. Storrer, Angelika (2014). “Sprachverfall durch internetbasierte Kommunikation?” In: Sprachver-fall? Ed. by AlbrechtPlewnia and Andreas Witt. Berlin:DeGruyter,pp. 171–196. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 19 Language Report Greek MariaGavriilidou, Maria Giagkou, Dora Loizidou, and Stelios Piperidis AbstractTechnologicalsupportforGreek,oneofEurope’slesserspokenlanguages, has progressed in the past decade, while LRTs have both increased in volume and improvedinqualityandcoverage.Despitethisprogress,whencomparedtothe‘big languages’, Greek is obviously disadvantaged. Prominent among the challenges is thefactthatLTisnotincludedinthelanguagepoliciesorAIstrategiesofGreeceand Cyprus,i.e.,thesignificanceoflanguage-centricAIisstillnotofficiallyrecognised. Lackofcontinuityinresearchanddevelopmentfundingisanadditionalfactorham­pering progress. A Europe-wide coordinated initiative focused on overcoming the differences in language technology readiness for European languages coupled with national targetedactions isconsidered necessary. 1 TheGreekLanguage GreekistheofficiallanguageofGreece,oneofthetwoofficiallanguagesofCyprus and,since1981,oneoftheofficiallanguagesoftheEuropeanUnion.Itisspokenas a mothertongue by about 95%ofthe10.7 millioninhabitantsofGreece,by around 840,000GreekCypriots,andapproximately5millionpeopleofGreekoriginworld­wide. Greekisa heavilyinflectionallanguage,and has an extensive setofderivational affixes. As regards syntax, it presents a free word order, the neutral order being Verb-Subject-Object or Subject-Verb-Object. The Greek writing system has been the Greek alphabet for most of its history. The Modern Greek alphabet consists of 24 letters. The official orthography of Modern Greek is the simplified monotonic (singlestress) system,which utilises onlystress mark and diaeresis. MariaGavriilidou · Maria Giagkou · Stelios Piperidis R.C.“Athena”, Greece, maria@athenarc.gr,mgiagkou@athenarc.gr,spip@athenarc.gr Dora Loizidou University ofCyprus,Cyprus, loizidou.dora@ucy.ac.cy © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_19 2 TechnologiesandResourcesforGreek Inthe lastdecade,languageresourceshavebothincreased in volumeandimproved inqualityandvariety(Gavriilidouetal.2022,2012).ResourcesandbasicNLPtools areprovidedbyacademia,researchcentresandprivatecompaniesasoutputsofvar­ious endeavours (research projects conducted by academic institutions, funded by EUornationalfunds,commercialprojectsorself-funded)andmadeavailableunder variouslicensing conditions (freely distributed,only forresearch etc.). Contemporary written language is represented in three main general domain monolingualtextcorpora:theHellenicNationalCorpusdevelopedbyILSP,thecor­pora of the Centre for the Greek Language, and the Corpus of Greek texts of the University of Athens. Nonetheless, the size of available corpora does not suffice for valid synchronic linguistic research, and cannot guarantee the development of language models. Multiple bi-/multilingual text corpora which include Greek, de­veloped mostly automatically by leveraging web crawling techniques, have been extensivelyused for the development andtrainingof MT systems. Multimodalresourceshavebeendevelopedsporadically,withmostsystematicef­fortsconcentratedonsignlanguagecorporaandlexica.Recentattemptstoconstruct multimodal language resources for speech pathology applications are also notewor­thy.Withregardstolexicalresources,thepresenceofGreekinvariousinternational bi-/multilingual resources (e.g., IATE, WordNet, ConceptNet etc.) is encouraging. Finally, Greek features in some multilingual and/or monolingual language models; recently, three BERT models have been developed for Greek. ExistingbasicNLPtoolshavebeenimprovedbyadoptingdeep-learningmethod­ologiesandneuralnetworks.Theexistingpipelinesincludetoolsforvarioustypesof annotation,i.e.,sentencesplitting,tokenization,POStagging,lemmatisation,chunk­ing, and dependency parsing. All pipelines are available for use through the ELG and CLARIN:EL infrastructures. Tools for more advanced tasks such as monolin­gualinformationextraction,eventdetectionandnamedentityrecognitionhavealso improved over the last few years, by being trained on new datasets and applied to a variety of domains. Other applications, such as anonymisation, natural language generationandsentimentanalysiscanbefoundatdifferentlevelsofrobustnessand completeness.Concerningmultilingualtextprocessing,MTsystemssuchaseTrans­lation, Google Translate and DeepL, have significantly improved their coverage of Greek,whileanumberofMTsystemshavealsobeendevelopedbysmallercompa­nies in Greece and other EU Member States, and by academic and research organi­sations.Speechprocessinghasseenimportantprogress:dictationsystemsforGreek with domain-specific implementations and high-calibre speech synthesis technolo­gies have been made available by commercial providers. Several Greek-speaking digitalassistants are alsocurrently available. Most available LRTs described above are relevant only for Standard Modern Greek. Dialectal varieties of Greek, such as Cypriot Greek, used mainly in oral speech and in specific written speech types (e.g., in poetry and literature), are not equally supported by technology. As Cypriot Greek is distinguished from Standard Modern Greek on several linguistic levels of analysis, it is often the case that exist­ingLTstrainedonStandardModernGreekdatafailtoappropriatelyprocessCypriot Greek.Atthe sametime,LRTs developed specifically forCypriot Greek are sparse. These are mainly general-use lexical resources (dictionaries, glossaries, wordlists). InordertoprotectthisdialectalvarietyofModernGreek,aswellastheheritageand culture of its speakers, LT research should specifically treat CypriotGreek. Public research and academic organisations in Greece and Cyprus play a major roleindevelopingLT,mainlythroughtheirparticipationinnationalandEU-funded projects in the fields of LT and AI, despite the fact that in the last ten years, there hasbeennofundingprogrammespecificallysupportingLTinGreece.Participation in large-scale infrastructures, initiatives and projects, such as CLARIN:EL, ELRC and ELG, has boosted not onlyR&D in Greek LT, but ithas alsofacilitated sharing and reuse of LRTs. As far as the LTindustry is concerned, Greek is part of the port-foliosofseveralmultinationalcommercialproviders,whileitisalsosupportedbya small but active LT industry in Greece and Cyprus, consisting mainly of SMEs and providing various LT-related services, indicatively: AI, LT (event detection, basic NLP, lexical resources and terminologies), MT and Localisation, Speech Process­ing (mainly recognition), and Data Science/Big Data Analytics. 3 RecommendationsandNextSteps Despite the progress of Greek LT during the past decade, when comparing Greek to the ‘big languages’, the abysmal difference in terms of quantity, size and qual­ity of LRTs is evident. Efforts in the coming years should be concentrated on the further development of large-scale monolingual corpora that can be used for train­ing large language models. Semantically annotated datasets, semantic lexica and knowledge bases, and datasets that can be used for anonymisation, simplification, summarisation, text levelling and question answering systems should also be pri­oritised. Speech and multimodal data are scarcely available, limiting the potential for the development of conversational agents, among others. Greek is dramatically deprived particularly when it comes to conversational data or speech in informal settingsthatisgeneratedbyspeakersofdifferentages,gendersandlinguistic/dialec­talbackgrounds.Thetransitiontoubiquitoushuman-computerinteractioninGreek, supported by state-of-the-art research results in NLU and NLG is, unfortunately, still far away. Further challenges posing impediments to the development of LT for Greek include: 1. Scarcity of data: as Greece and Cyprus are small countries, the productionofdigitallanguagedataislimited;2.LackofexperienceintheuseofLT: the deployment of digital tools and methods in many disciplines, including life sci-encesandhumanities,hasonlyrecentlybeenintroduced.Researchers/professionals inthesedomainsneedstilltobeconvincedaboutitsbenefits;3.IssuesrelatedtoIPR orGDPRrenderresourceownershesitantaboutsharingtheirdatasets.Non-explicit, unclearterms ofuse anddistribution restrictsharing, useand repurposingof digital texts and language processing tools. The majority of resources pose restrictions on the types of uses they allow, thus discouraging prospective users, hampering new research anddevelopment and leading to repetition inresourcecreation. One of the main reasons for the disadvantaged position of Greek is that LT is not included in the language policy of Greece and Cyprus, i.e., the significance of language-centricAIhasnotbeenrecognisedyet.Whilesporadicefforts,self-funded orpartiallysupportedwithinITorAIprogrammes,haveyieldedresults,theyarenot adequate toboost GreekLT to a state-of-the-art level,nor to helpGreecekeep pace withdevelopmentsworldwide.LackofcontinuityinR&Dfundinghasbeenexperi­enced for many years, with short-term projects alternating with periods of drought. While it is important that infrastructural initiatives for LT have been thriving in Greece, their future funding is not secured and their sustainabilitymay beat stake. A strategy for keeping Greek up to pace with LT developments and ensuring Greek thrives in the digital sphere should foresee: 1. maintenance, extension and sustainabilityofLT-relatedinfrastructures;2.nationaland/orEuropeancoordinated actionsforensuringaccesstoopenhigh-performancecomputeinfrastructure;3.co­ordinated actions for the development of large-scale LRs ready to power large lan­guagemodels;4.targetedactionstofilltheobservedgapsinspeechandmultimodal data; 5. measures ensuring that the importance of LT and language-centric AI is recognisedandincludedinnationalpoliciesandstrategies;6.coordinatedactionsto further enhance digital literacy in the research communities and society as a whole; 7.coordinatedactionstopromotethecultureofdatasharing,includingopensource software, involving all stakeholders, thepublic sector, researchand industry. References Gavriilidou, Maria, Maria Giagkou, Dora Loizidou, and Stelios Piperidis (2022). Deliverable D1.17 Report on the Greek Language.EuropeanLanguageEquality(ELE);EUprojectno.LC­ 01641480 – 101018166. https://european-language-equality.eu/reports/language-report-greek .pdf. Gavriilidou, Maria, MariaKoutsombogera,Anastasios Patrikakos, andStelios Piperidis(2012). . ........ ...... .... ....... ..... – The Greek Language in the Digital Age.META-NET WhitePaper Series:Europe’sLanguagesintheDigitalAge. Heidelbergetc.:Springer. http://w ww.meta-net.eu/whitepapers/volumes/greek. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 20 Language Report Hungarian Kinga Jelencsik-Mátyus, Enikõ Héja, Zsófia Varga, and TamásVáradi Abstract The revolutionary expansion of language technologies (LT) in the last decade and the emergence of neural networks has heavily impacted LT. This is re­flected in the development of Hungarian NLP as well, as numerous high-quality LMs, tools and datasets have been created. However, new, huge datasets are still neededtotrainLMs.DuetobeingalesserresourcedUraliclanguagewithasmaller numberofspeakers,HungarianLThastofacechallengesoftendifferentfromthose of large Indo-European languages like English. Here we present a snapshot of this important period in the development of Hungarian LT, with special attention to lan­guageresources,and we outline some of the possible next steps. 1 The Hungarian Language Hungarian,spokenby13-14millionpeopleglobally,istheofficiallanguageofHun­gary and a few Hungarian-majority regions and municipalities in Serbia and Slove­nia. 9.8 million speakers live in Hungary and a further 2.5 million speakers use Hungarianasarecognisedminoritylanguageinneighbouringcountriesthatoncebe­longedtoHungary.Anadditional1millionHungarianspeakerslivescatteredaround the globe. There are slightdifferencesacross these languagevariants. Hungarian belongs to the Finno-Ugric branch of the Uralic language family (Si­mon et al. 2012). Its linguistic relatives include Finnish and Estonian, with a total numberofspeakersbelow7millioncombined.ThishasimplicationsforHungarian Language Technology (LT), which cannot draw much support from the technologi­cal development of its Uralic relatives. Developers of Hungarian LT face problems suchastheextensivecasesystemandagglutinationinthelanguage;asnominalsin­flect for number, case, and person, and verbs inflect for person, number, tense, and moodbothindefiniteandindefiniteconjugationparadigms.TheHungariancasesys­tem – with around 20 cases (Thomason 2005) – is particularly complex compared KingaJelencsik-Mátyus · EnikõHéja · ZsófiaVarga · TamásVáradi Research CentreforLinguistics,Hungary, jelencsik-matyus.kinga@nytud.hu, heja.eniko@nytud.hu, varga.zsofia@nytud.hu, varadi.tamas@nytud.hu © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_20 to Indo-European languages. The Hungarian language is written using an extended versionofthe Latin script, the 44-letter Hungarian alphabet. Mostof the Hungarian-specificLT resources are developedeither in Hungary or as part of large, multilingual Pan-European initiatives. The language variant these resources represent is almost exclusively standard Hungarian. Even in the case of corpora,mostofthematerialthatcreatorsincludecomesfromwithinHungary,with only some exceptions (e.g.,Hungarian NationalCorpus 2). 2 TechnologiesandResourcesforHungarian In recent years, the number of application areas of Hungarian LT has greatly in­creased, and several good quality Hungarian language models, tools, corpora and lexicalresourceshavebeencreated.HugedevelopmentscanbeseeninthefieldofAI aswell.BelowwegiveasnapshotofHungarianNLPinthisperiodofswiftchanges, with aspecial emphasison language resources (Jelencsik-Mátyuset al. 2022). MostmonolingualcorporaavailableforHungarianwerenotbuiltspecificallyfor LT, however, there is huge improvement in this area. Nowadays, monolingual cor­pora for Hungarian not only include collections of curated data (see the Hungarian NationalCorpus2.0),butalsodatasetscompiledbywebcrawling(e.g.,Webcorpus 2.0). New resources are now built with higher levels of annotation, and often with the purpose to serve as test and training data. For example, HuLu (Hungarian Lan­guageUnderstandingEvaluationBenchmarkKit)canbeusedprimarilyfortheeval­uation and analysis of natural language understanding (NLU) systems, and it aims tobetheHungarianversionoftheGLUEandSuperGLUEbenchmarks.Atthesame time, multilingual textual data containing Hungarian are abundant with almost 250 datasets,asHungarian is often includedinlargeEUand non-EUprojectsalongside dozensofotherlanguages.Whilemultilingualcorporavaryacrossbeingcomparable orparallel,generalordomain-specific,thereareveryfewdomain-specificmonolin­gual Hungarian corpora, especially from the legal domain. However, datasets an order of magnitude larger are needed to build effective language models. Several corpora to support buildingLMs are now under construction. The number of multimodal corpora for Hungarian is quite low, with the most common form being an audio dataset backed with transcripts. Importantly, there are no publicly available domain-specific multimodal datasets of considerable size in Hungarian, so R&D projects need to compile their own resources to train and evaluate speechprocessing systems. As BERT has become a standard in NLP, a number of BERT models have been trained for Hungarian (see HuBERT, HILBERT, emBERT). Besides BERT, models with other architectures are being adapted to Hungarian; a couple of experimental models were developed by the HILANCOconsortium. Solutions for the most common tasks in text analysis are available in state-of-the-art NLP tools and pipelines for Hungarian (see UDPipe, HuSpaCy, e-magyar andMagyarlánc).Tocoverhigherlevelsoftextanalysis,industrialstakeholdersde­veloped some cutting-edge text analysis toolkits, e.g., Neticle’s media monitoring system.However,rapidlyexpandingdemandsposeanever-growingnumberofchal­lengesfor Hungarian LT developers. TherearenumerousmultilingualspeechprocessingtoolscoveringHungarian,but onlyafewHungarian-specificapplicationsareavailable.AstheDNNapproachhas becomeprominentbothinTTSandASR researchanddevelopment,althoughthere aresomehigh-quality applications for Hungarian,new challenges have been identi­fied. There is a lack of computational and speech resources, i.e., competitive GPU-gridsandhigh-variabilitynaturalspeechrecordings,thathinderthedevelopmentof TTS andASRsolutions.Asfor commercial applications,see, forinstance,Clemen­tine’s Clemvoice that provides services including speech processing, or SpeechTex specialising inTTS forthe legal domain. Neural machine translation (NMT) has become the leading paradigm for MT at large,andforHungarianaswell.Astate-of-the-artNMTsystemisimplementedby theHungarianResearchCentreforLinguistics.Tocarryouthigh-performanceNMT, however, having high quality parallel language data both from general and specific domainsisessential.TheHungarianproviderGlobalesedoesthisbyenablinghuman translators totrain thecompany’s NMT engines basedon their own paralleldata. Althoughtherearesomecommercialsolutionscovering Hungarian(e.g.,Intelli­DockersenginesorSAS),wearenotawareofanysummarisationtooldevelopedfor Hungarian but, as a first step towards such a tool, initial extractive and abstractive summarisationtoolswerebuiltbasedonHungarian-specificTransformermodels.A GPT-2model (with news and poem generators) was also built for Hungarian. Chatbots and simple task-based systems are increasingly used, but systems that cancarry outmore open-ended conversations inHungarian are not yet available. In the last years, several solutions have been created for information retrieval. Recently, vector space models have been trained with a searchable online interface. Text classification, tag recommendation, topic modelling and sentiment analysis tools have been built to support Hungarian health services andthe press. FollowingthegrowthofAIinseveralfields,numerousnationalprogrammesand umbrella organisations were founded recently. The two most prominent organisa­tions in Hungary are the Artificial Intelligence National Laboratory and the Artifi­cial Intelligence Coalition. Their goals include facilitating cooperation and commu­nication between research centres, universities, and industrial AI developers; and, eventually,strengthening theposition of Hungarian AI internationally. 3 RecommendationsandNextSteps The emergence of neural technologies has massively reshaped how language data is used in a uniform way in most subfields of NLP. As we have seen in examples rangingfromspeechprocessingtosummarisationandmachinetranslation,although plenty of monolingual and multilingual corpora were compiled in the past years, there is an ever-growing need for novel datasets for fine-tuning, testing and bench­marking.Duetotheirimportance,theautomaticgenerationofsuchresourcesshould beconsidered as well. Thankstotheeffortsmadeoverthelastdecade,therearenowmultipletoolchains performing good-quality linguistic analysis. At the same time, more intricate tasks are still left to be covered, e.g., processing solutions for social media texts should alsobeexpanded.Human-computerinteractionisafieldthatappearstobeofutmost importance,but complex conversational agents are not yet availablefor Hungarian. ThereisstillampleroomforstrengtheningcooperationbetweenR&Dandindus-try, and their links with the public sector market (e.g., public administration). The future of R&D of Hungarian LT and AI is primarily dependent on various funding agents,astheLT-connectedmarketin itselfiscurrentlyunabletoprovide sufficient financial background. Finally, due to the complexity of LT-related knowledge, the need for good-quality and well-organised LT education should be addressed in the long run. References Jelencsik-Mátyus,Kinga,EnikõHéja,ZsófiaVarga,Tamás Váradi,LászlóJánosLaki,andGyõzõ YangZijian(2022).Deliverable D1.18 Report on the Hungarian Language.EuropeanLanguage Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/language-report-hungarian.pdf. Simon, Eszter, Piroska Lendvai, Géza Németh, Gábor Olaszy, and Klára Vicsi (2012). A magyar nyelv a digitális korban – The Hungarian Language in the Digital Age. META-NET White Paper Series: Europe’sLanguages in the Digital Age. Heidelberg etc.: Springer. http://www.m eta-net.eu/whitepapers/volumes/hungarian. Thomason,SarahG.(2005).“TypologicalandtheoreticalaspectsofHungarianincontactwithother languages”.In:Hungarian Language Contact outside Hungary.Amsterdam,Philadelphia:John Benjamins,pp. 11–28. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 21 Language Report Icelandic EiríkurRögnvaldsson Abstract In 2019,the Icelandic Governmentlaunched athree-year Language Tech­nology Programme for Icelandic (LTPI). Within this programme, a number of lan­guage resources and tools have been built from scratch and several pre-existing re­sources and tools have been enhanced and improved. This programme is now fin­ished and the situation for Icelandic with respect to language technology has im­proved considerably. In spite of this, Icelandic still remains a low-resourced lan­guagecomparedto most official European languages. 1 The Icelandic Language Icelandic is a North Germanic language with its roots in Old Norse. It is the only official language of Iceland apart from Icelandic Sign Language. Even though it is onlyspokenbyaround350,000peopleinIcelandandbyseveraltensofthousandsof Icelanders living abroad, it is not considered endangered according to UNESCO’s Language Vitality Scales1 or EGIDS.2 The language community is very homoge­neous, and dialectal variation isnegligible. Icelandic is a morphologically rich language; nouns, pronouns, adjectives and verbs are inflected for several grammatical features. The language is fusional, such that a single ending usually stands for more than one morphological category. Ty­pologically, Icelandic is an SVO (subject-verb-object) language with a strong V2 rule that requires the verb to appear in the second (or first) position of the sentence. However, becauseof therichinflectional system,word order is relatively free. TheIcelandicalphabetisbasedontheLatinalphabetwithanumberofadditions, especially vowel symbols with an acute accent, áéíóúýÁÉÍÓÚÝ, and the vowel symbols aA and öÖ which are also used in a number of other languages. Furthermore, Icelandic employs two more eccentric symbols: .. (eth, not to be Eiríkur RögnvaldssonÁrni Magnússon Institute for Icelandic Studies, Iceland, eirikur@hi.is 1 https://ich.unesco.org/doc/src/00120-EN.pdf 2 https://www.ethnosproject.org/expanded-graded-intergenerational-disruption-scale/ © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_21 confused with “d with a stroke”, ð) which is also used in Faroese, and .. (thorn) which is not used inany otherlanguage. Iceland has the highest percentage of internet users in Europe. In 2020, 98% of Icelandichouseholdshadinternetaccess.3 Inthesameyear,68,344websiteshad.is asthetopleveldomain.4 Icelandicissufficientlyrepresentedonthe internet,witha number of media websites and an Icelandic Wikipedia, for instance, but most peo­ple also frequently visit news sites in English, access various types of information in English, etc. Even though Icelandic is the main language used on social media, English is alsoprominent. 2 TechnologiesandResourcesforIcelandic The Icelandic Government launched the Language Technology Programme for Ice­landic (LTPI) in September 2019. The self-owned foundation Almannarómur5 was entrusted with the role of conducting the programme. Almannarómur, in turn, com­missioned the SÍM Consortium,6 comprising members from academia, NGOs and the private sector, to carry out the research and development work in this project. Researchers, developersand LT users are well representedin the Consortium. MostoftheexistingresourcesandtoolsforIcelandicaredirectorindirectoutputs ofthisprogramme.AlmostalloftheseresourcesandtoolsarestoredintheCLARIN­ISrepository.7 Theycanbedownloadedforfree,mostofthemunderstandardopen licences,and used in any kind of application. The Icelandic Gigaword Corpus (IGC) is a monolingual corpus comprising al­most2.7billiontokensofdifferentgenres.Mostofthetextsarefrom2001-2022.A few parsed corpora exist, most of them having been automatically parsed. Greynir-Corpus contains 10 million sentences from news sources which have been parsed into full constituency trees. The Icelandic Contemporary Corpus is a constituency parsed corpus built by using an Icelandic model of the Berkeley Neural Parser and containing30 million clausesfrom theIGC.A number ofsmallspecialisedcorpora have alsobeen developed. There exist a number of bilingual English-Icelandic corpora. Most of them are domain-specific corpora from ELRC and are not aligned. However, a few general purposealignedcorporaexist,themostimportantbeingParIcewith5.3milliontrans­lationunits.Muchlargerbilingualcorporaareneeded,especiallybetweenIcelandic and Englishbutalso between Icelandicand otherlanguages such as Polish. Afewaudiocorporaexist.ThemostimportantoneisTalrómurwhichconsistsof 122,417shortaudioclipsofeightdifferentspeakersreadingshortsentences,amount­ 3 https://www.statista.com/statistics/185663/internet-usage-at-home-european-countries/ 4 https://www.isnic.is/is/tolur 5 https://almannaromur.is/en 6 https://icelandic-lt.gitlab.io 7 https://repository.clarin.is ingto12,780minutesintotal.Alargecrowdsourcingproject,Samrómur,isnowon­going.InMay2022,atotalof2.85millionsentencesfrom28,000speakershadbeen recorded,247,800 minutes in all.No video corpora have been built for Icelandic. The Database of Modern Icelandic Inflection (DMII) is supposed to contain the inflectionalparadigmsofthewholevocabularyofIcelandic.Thecurrentversionhas avocabularyofabout305,000lemmas,and6.2millioninflectionalforms.TheDMII Core is an extract of DMII and contains the core vocabulary of Modern Icelandic, around 58,000 entries. The monolingual Dictionary of Contemporary Icelandic has 56,000entriesandisconstantlybeingupdated.Soundfileswithrecordingsofallthe headwords in thedictionary are alsoavailable. ThecompanyMi.eind,amemberofSÍM,hasbeendevelopingatranslationsys­tem between English and Icelandic using neural networks. Although still under de­velopment, it already gives very promising results. The pilot version is offered as a web-basedservice.8 Mi.eindisalsodevelopingAImodelsandsomeofthemareal­readyavailable,suchasGreynirTranslate(mBART25NMT),generaldomainIS-EN and EN-IS translation models based ona multilingual BART model. There exist a number of tools for analysing Icelandic text. Among them are two packages that each include various tools. IceNLP is a package which contains a tokeniser, part-of-speech tagger, lemmatiser, and shallow parser. Greynir is a more recentpackagethatcanparsetextintoconstituencytrees,findlemmas,inflectnoun phrases, assign part-of-speechtags and more. Anumberoftoolsforspeechprocessingarecurrentlybeingdevelopedwithinthe LTPI, among them a new speech recogniser and a speech synthesiser, but these are notyetpubliclyavailable although prototypeshave beenpubliclydemonstrated. Embla is the first voice assistant app for the Icelandic language, available both for iOS and Android. It combines a speech recogniser,a speech synthesiser and the Greynir tool which it uses to search for answers to questions that the user poses. Greynir extracts information from Icelandic text which allows natural language queryingof thatinformation and facilitatesnatural language understanding. In the national AI strategy from April 2021, the importance of developing LT resourcesandtoolsforIcelandicisexplicitlymentioned.9 Inthepolicystatementof thenewGovernmentthattookofficeinNovember2021,10 itisexplicitlystatedthat thestrategicR&DLTprogrammewillbeprolongedthroughoutthecurrentelection period, until 2025. 3 RecommendationsandNextSteps Tenyearsago,thestatusofIcelandicLTwasratherpoor(Rögnvaldssonetal.2012), but the LTPI has revolutionised the situation (Rögnvaldsson 2022). The forming of 8 https://velthyding.is 9 https://www.stjornarradid.is/gogn/rit-og-skyrslur/stakt-rit/2021/04/29/Stefna-Islands-um-gervi greind/ 10 https://www.stjornarradid.is/library/05-Rikisstjorn/Agreement2021.pdf the SÍM Consortium has led to a very fruitful cooperation among all stakeholders. Researchers who used to work individually on small projects now work together on implementing projects on a much bigger scale. The number of researchers and studentsinvolved in LT has multiplied and new startup companies have emerged. TheLTPIhasdeliveredhigh-qualityapplicationsthathopefullycontributetothe digital vitality of Icelandic. But even so, Icelandic still lacks a number of impor­tant resources now that the LTPI is finished. Among them are spoken language cor­pora; parallel corpora (Icelandic and other languages than English, such as Polish andtheScandinavianlanguages);corporafordifferentpurposes(sentimentanalysis, question answering, summarisation); annotated multimodal corpora; and term lists. Furthermore, Icelandic lacks tools for sentiment analysis, summarisation, question answering, natural language understanding and generation, dialogue management, disambiguation, text and speech translation, automatic subtitling, advanced speech synthesis (intonation,empathy) andspecialisedgrammar checking. In order for these resources and tools to be developed, the continuation of the LTPImustbesecured.ItisalsoofvitalimportancethatIcelandiciscompatiblewith products of the large international IT companies. A delegation of LT specialists led by the President of Iceland and the Minister of Culture recently visited Amazon, Apple, META and Microsoft in order to convincethemtoinclude Icelandic in their products,offeringthemaccesstoalldeliverablesoftheLTPI.Alarge-scaleEuropean cooperationwouldalsobeawelcomeassistanceinpreparingIcelandicforthefuture. References Rögnvaldsson, Eiríkur (2022). Deliverable D1.19 Report on the Icelandic Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/language-report-icelandic.pdf. Rögnvaldsson, Eiríkur, Kristín M. Jóhannsdóttir, Sigrún Helgadóttir, and Stein.ór Steingrímsson (2012).Íslensk tunga á stafranni öld – The Icelandic Language in the Digital Age.META-NET WhitePaper Series:Europe’sLanguagesintheDigitalAge. Heidelbergetc.:Springer. http://w ww.meta-net.eu/whitepapers/volumes/icelandic. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 22 Language Report Irish Teresa Lynn Abstract Language technology (LT) underpins many applications that enable our digitally enhanced lives (virtual assistants, search engines, translation tools, spell­checkers, language learning tools etc.). However, these advances do not benefit all Irish citizens equally. Due to a lack of sufficient LTs for Irish, Irish speakers reg­ularly need to revert to using English. Such a language shift plays a major role in therisk ofdigitalextinction,i.e.,aneventualdeclineinlanguageuseduetolack of technological support. This chapter highlights work carried out on Irish LT,and the gapsand challenges thatstill needto be addressed (Lynn 2022). 1 The Irish Language IrishisthefirstofficialandnationallanguageoftheRepublicofIreland,withEnglish as the second official language. Irish Sign Language has had official legal recogni­tionsince2017.Figuresfromthe2016censusreportthat39.8%(1.7million)ofthe populationcanspeakIrish,whileonly1.5%(roughly73,000)speakIrishona daily basis outside the education system. Irish is also recognised as a minority language in Northern Ireland and has been an official language of the European Union since 2007 (andfullworking languageofthe EU since2022). Irishhasthreemaindialects.However,thereisnospokenstandardvariety,which hasimplicationsforspeechtechnologydevelopment.Thewrittenformwasstandard­isedin1958withthepublicationofAnCaighdeánOifigiúil(TheOfficialStandard). Irish has rich morphology and a verb-subject-object (VSO) word order, which can pose challenges for applications such as alignment tools and machine translation (MT) when paired with English (SVO). Its inflectional nature (suffixation, initial mutation, etc.) leadsto sparsity in Irish datasets,which impacts data-driven LT. Therearedispersed‘Gaeltacht’regionsacrossIrelandwhereIrishisspokendaily asafirstlanguage.However,Englishisbecomingincreasinglyusedintheseregions, partially due to its monopolising digital presence. Outside Gaeltacht regions, Irish TeresaLynn Dublin CityUniversity, ADAPT Centre,Ireland, teresa.lynn@adaptcentre.ie © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_22 is also spoken in some urban areas. Irish is a compulsory core subject at primary and secondarylevel,andthenumber ofIrish-medium pre-schools,primary and sec­ondary schoolsis growing inboth the Republic ofIrelandand Northern Ireland. The Official Languages Act (2003) has the objective of ensuring the improved provisionofpublicservicesthroughtheIrishlanguage.Inaddition,the20YearStrat­egy for the Irish Language (2010-2030) and the accompanying Action Plan for the Irish Language (2018-2022) recognise the State’s commitment to the language’s re­vival. The National AI Strategy for Ireland focuses on English language-based AI. The recently publishedDigital Plan for Irish outlinesurgent needs in LT. Mainstream media produces much valuable audio and text-based Irish content. Irish language content is only found across roughly 1,500 (0.5%) of Ireland-based .ie domains, with low numbers of businesses localising their websites to Irish. The useofIrishinsocialmediaisprevalentamongusersacrossthemainplatforms.How­ever, there is still minimal support for Irish. Google Translate and Bing Microsoft Translator still prove unreliable within particular domain settings, and much con­troversyhasarisenaroundthefrequentmisuseofunverifiedautomatedtranslations. Facebook does not yet provide the option to translate Irish language posts. Google Search and Gmail interfaceswere localisedby volunteer translators. 2 TechnologiesandResourcesforIrish This summary is based on the European Language Grid (ELG). Some progress has been made in text analytics, MT, and speech technologies, mainly thanks to data collection and corpus creation from short term academic projects, funded by EU-projects and national funds, or self-funded. However, it should be noted that the ELG figures for Irish resources are inflated in some cases due to 1. the inclusion of version updates of some multilingual datasets like Universal Dependencies and ParaCrawl,2.largemultilingualdatasetsofwhichonlyasmallproportionrepresents Irish, and 3. Irish web-crawled data made availablethroughoverlapping projects. Irishisstillverymuchalow-resourcedlanguage,withfewchangesintermsofLT supportsinceJudgeetal.(2012).Thelackofdataresources,skill-setsanddedicated fundinghasleftagapformanyfundamentaltechnologies.Whilethereareextensive LT industry bodiesand researchcentresin Ireland,littleattention has been given to IrishLT. Irish-languagerelatedprojectsaremostlyfundedthroughTheDepartment oftheGaeltacht’s Irish Language SupportSchemesand ForasnaGaeilge. ThetwolargestmonolingualIrishcorpora(NewCorpusforIreland–Irish,NCII, and Gaois Corpus of Contemporary Irish) are both restricted in terms of access due to copyright. To address this, the development of the open-source National Corpus ofIrelandisunderway,whereresourcessuchasword-frequencyandn-gramlists,as well as languagemodels will be made available. Some NLP-task specific corpora have been produced as part of PhD research (e.g.,POS-taggedcorpora,treebanks,MWE-taggedcorpora,spokencorpora).There isaconsiderablelackofIrishmonolingualcorporaforspecificdomains(e.g.,legal, medical, education etc.). The Irish Wikipedia (An Vicipéid)dataset was used in the development of Multilingual BERT andthe Irish gaBERT languagemodel. TheavailabilityofbilingualtextsforthepurposesofEnglish-IrishMTincreased largely due toIreland’sinvolvementintheEuropeanLanguage ResourceCoordina­tion (ELRC) project, and other EU funded initiatives. The majority of this data is available to download from Ireland’s National Relay Station: eStór under the EU Open Data Directive. As such, both statistical and neural English-Irish MT engines have been built at DCU through PhD research. Irish is included amongst the lan­guages supported by the European Commission’s eTranslation platform. Google, Bing,and the IRIS MT system, areall free general-purposeIrishMT systems. An XFST Finite State suite of tools includes an Irishtokeniser, lemmatiser, mor­phologicalanalyser, POS-tagger,aconstraintgrammar and achunking tool.Depen­dency parsing models are available through UDPipe and Stanza. There is only one open-source spell-checker (GaelSpell) andgrammar checker (AnGramadóir). Steady progress has been made in speech synthesis for the three main dialects. Applicationshavebeendevelopedtomakethesevoicesavailabletothepublic(e.g., in accessibility aids and computer assisted language learning, CALL). Live record­ingsandcrowdsourcedrecordingsofpredominantlynativespeakersusingtheonline facilityMíleGlórarebeingcollectedandprocessedforthedevelopmentofthefirst ASR system. TheMozillaCommon Voiceproject has alsocollected asmall dataset ofIrishspeech(bothnativeandnon-nativespeakers)throughcrowdsourcingefforts. Irish is relatively well-resourced when it comes to electronic dictionaries, ter­minology databases, thesauri, gazetteers and glossaries. Most dictionary develop­ments (funded by Foras na Gaeilge) due to copyright restrictions, only offer single user queries or data access for research purposes only. The National Morphology Database and accompanying computational grammar library (Gramadán) are open-source. The National Terminology Database is used by the general public, students, freelance translators and translators at EU institutions. The Pota Focal site hosts a dictionary, glossary, verb valency dictionary and thesaurus, the latter of which is powered by Líonra Séimeantach na Gaeilge (LSG),an Irish Wordnet. In terms of Natural Language Processing, the GaelTech project (2017-2023) at DCU focuses on POS-tagging, syntactic parsing, language modelling and the pro­cessingofuser-generated content, code-switching and multiwordexpressions. 3 RecommendationsandNextSteps ManycommonlyusedandnecessarytechnologiesarestillnotavailableforIrish:rel­ativelylittleprogressinASR,andnoresearchorsystemdevelopmentforAutomatic Subtitling,Information Retrieval andExtraction, NaturalLanguageGeneration, Se­mantic Role Labelling, Named Entity Recognition, Sentiment Analysis, Question-Answering, Virtual Agents, Adaptive Learning or Anonymisation. The following highlightssomestrategiestoaddressthis.1. Change of focus Ashiftinfocus(away from the development of dictionaries and terminologies for language learning or translation) is required to recognise LT as an equally important axis for continued language use. 2. Untapped Potential Language data is broadly unknown and under­valued amongst Irish citizens and across the public sector. If collected and applied appropriately, this data could make a huge impact on the future of Irish LT. For ex-ample:developmentofASRandautomaticsubtitlingsystemsthroughdatafromthearchives of the national broadcasters (RTÉ, TG4); a named entity recogniser using the national placenames, biographies databases,and the Database of Irish-language Surnames; CALL systems using language learning corpora. 3. Need for Dedicated LT Programmes Duetothelackofdedicatededucationandtrainingprogrammesin this field, it has proven difficult to source researchers, linguists or engineers with the right combination of skills (e.g., Irish language, computer science, linguistics) in previous LT projects. 4. Long-term strategy There is a clear need for: a strategy for safeguardingIrish ina digital age;supportfor dedicatedLT educationand train-ing;investmentsindatacollectionandannotation;developmentofproduction-ready LTtools.5. Open-source culture ManyhighqualityresourcesavailableforIrishare under copyright protection, rendering them unusable for general purpose. Where possible all data and tools developed for Irish should be open-source, ensuring that access is widened to others that have the skills or resources to develop them fur­ther. 6. Corporate Social Responsibility While Ireland is a major European hub for technological innovation in AI and NLP industries, this investment only serves the English-speaking population of Ireland. As part of a corporate social responsibility policy, support forIrishlanguage requires muchmore serious consideration. References Judge,John,AilbheNíChasaide,RoseNíDhubhda,KevinP.Scannell,andElaineUíDhonnchadha (2012). An Ghaeilge sa Ré Dhigiteach – The Irish Language in the Digital Age. META-NET WhitePaper Series:Europe’sLanguagesintheDigitalAge. Heidelbergetc.:Springer. http://w ww.meta-net.eu/whitepapers/volumes/irish. Lynn,Teresa(2022). Deliverable D1.20 Report on the Irish Language.EuropeanLanguageEqual­ ity (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu /reports/language-report-irish.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 23 Language Report Italian Bernardo Magnini, AlbertoLavelli, and Manuela Speranza Abstract In the last few years, three important factors have influenced the Italian Language Technology (LT) community: 1. in 2015, the foundation of the Asso­ciazione Italiana di Linguistica Computazionale (Italian Association for Computa­tional Linguistics, AILC); 2. the organisation of CLiC-it, the annual Italian Confer-enceonComputationalLinguistics;3.theorganisationoftheEVALITA(Evaluation ofNLPandSpeechToolsforItalian)evaluationcampaigns.Thissituationisproduc­ing a widespread expansion of interest in LTfor Italian in academia and industry. 1 The Italian Language Italian is an official language in Italy (where other languages are co-official within certain regions), San Marino and the Vatican City State and it is one of the official languagesinSwitzerland.IthasofficialminoritystatusinSloveniaandCroatiaand formerly had official status in Albania, Malta, Monaco, Montenegro and Greece. It usedtobeanofficiallanguageintheformercolonialareasofItalianEastAfricaand ItalianNorthAfrica.ItisamongtheminoritylanguagesofBosniaandHerzegovina and Romania, although it is not protected in these countries. Italian is also spoken by very large immigrant and expatriate communities in the Americas and Australia. Italian is a major European language, being one of the official languages of the Eu-ropeanUnion,theOrganisationforSecurityandCo-operationinEuropeandoneof the working languages of the Council of Europe. Italianisthenativelanguageofaround15%oftheEUpopulation(EuropeanCom-mission2012),thusthesecondmostwidelyspokenlanguageafterGerman(Keating 2020),andhas61.8millionfirstlanguagespeakersaccordingtoWorldInfo.1Around 56millionnativespeakersofItalianresideinItaly;ithasbeenestimatedthatanother morethan200,000firstlanguagespeakersofItalianresideinSwitzerland,Belgium, France, Germany, and the United Kingdom, and smaller groups of speakers reside Bernardo Magnini · Alberto Lavelli · Manuela Speranza FondazioneBrunoKessler,Italy, magnini@fbk.eu,lavelli@fbk.eu,manspera@fbk.eu 1 https://www.worlddata.info/languages/italian.php © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_23 inCroatia,Luxembourg,Malta,RomaniaandSlovenia.Italianisinfourteenthplace in the ranking of the most used languages on the internet, as W3Techs estimates it to be used by 0.7% of thetop 10 million websites.2 ItalianbelongstotheIndo-EuropeanlanguagefamilyoftheRomancelanguages. ItswritingsystemisclosetobeingaphonemicorthographyandalmostallnativeItal­ian words end with vowels. Italian grammar is typical of the grammar of Romance languages in general. Cases exist for pronouns but not for nouns and there are two genders(masculineandfeminine).Nouns,adjectives,andarticlesinflectforgender andnumber.Subjectpronounsareusuallydropped,theirpresenceimpliedbyverbal inflections.Therearenumerouscontractionsofprepositionswithsubsequentarticles andnumerousproductivesuffixes(e.g.,fordiminutiveandaugmentative).Manyna­tive speakers of Italian residing in Italy are native bilingual speakers of Italian and one of the Italian dialects(which maydiffer significantly from Italian).3 The Digital Report, a survey conducted in 2020 by “We Are Social” in collabo­ration with Hootsuite, with the aim of collecting data on the use of the internet and social platforms both at the global and the local level (Starri 2021), reports that in Italyover1millionpeopleconnectedtotheinternetforthefirsttimein2020(a2.2% increase),for a total of over 50 million internet users. 2 TechnologiesandResourcesforItalian AconsiderablepartofthepubliclyavailablelanguageresourcesforItalianhavebeen produced in the EVALITA evaluation campaigns.4 In the context of EVALITA, 62 tasks (with the availability of corresponding annotated data) have been organised in total. These tasks range from lemmatisation to sentiment analysis, covering both writtentexts and speech tools. The last LT funding programme in Italy dates back to 1999–2001. Since then, therehasbeennospecificprogramme,norisoneforeseeninthenearfuture.Italian, as one of the bigger EU languages, is better equipped than other languages, but fur­ther research is needed before truly effective language technology solutions will be readyforeverydayuse,aswellastonotlagbehindthemuchbetterresourcedEnglish language. There is no national research infrastructurededicated toLTsin Italy. Cor­pora (both annotated and unannotated, benchmarks, tools for several tasks) for the Italian language are, however, available either through websites of single research institutions, or through shared infrastructures at the European level, including the CLARIN repository and theEuropeanLanguageGrid. Despitethelackofnationalfundingprogrammes,theItaliancommunityisrather active at the international level. Italy hosted EACL 2006 (11th Conference of the European Chapter of the Association for Computational Linguistics) in Trento and 2 https://w3techs.com/technologies/overview/content_language 3 https://www.istat.it/it/files/2017/12/Report_Uso-italiano_dialetti_altrelingue_2015.pdf 4 https://www.evalita.it ACL2019(57thAnnualMeetingoftheAssociationforComputationalLinguistics) inFlorence.ItalianresearchershavebeenchairingseveralLTconferences,including various editions of the Language Resources and Evaluation Conference (LREC), ACL 2021 (Programme Chair) andACL 2022 (General Chair). In the last few years, a series of initiatives have been taking place in the Italian NLP community. In 2007, the first edition of EVALITA (Evaluation of NLP and SpeechToolsforItalian)washeld.ThegeneralobjectiveofEVALITAistopromote the development of language and speech technologies for the Italian language, pro­vidingasharedframeworkwheredifferentsystemsandapproachescanbeevaluated in a consistent manner. As a side-effect of the evaluation campaign, both training and test data are available to the scientific community as benchmarks for future im­provements (Magnini et al. 2022). The first EVALITA edition was followed by six additional successful editions, the last in2020. Following the strong interest raised by EVALITA, the Associazione Italiana di Linguistica Computazionale5 (Italian Association for Computational Linguistics, AILC) was founded in 2015, with the goal of establishing common ground for the ItalianLT community. A second relevant initiative on LT in Italy is CLiC-it, the annual Italian Confer­ence on Computational Linguistics.6 The first edition of CLiC-it was held in Pisa in 2014. CLiC-it has become the most important forum for computational linguis­tics in Italy, and has obtained the important goal of stimulating the production of high-quality researchand resources for the Italian language. Another relevant initiative concerning Italian LTs is the work carried out by the EuropeanLanguageResourceCoordination(ELRC).7 OneoftheaimsoftheItalian ELRC is to mobilise public sector bodies to share their high-quality translated data. Additionally, many of the EVALITA resources and technologies (Patti et al. 2023) have beenmade availablethroughthe European Language Grid (Rehm 2023).8 Finally, it is worth mentioning the Lectures on Computational Linguistics, an AILCinitiativetargetingstudents(bothMaster’sandPhD)andaimingatproviding core competence in the LTfield.9 TheItalianacademicLTcommunityisrelativelywelldistributedoverthewhole Italian territory, both in university departments in human sciences (e.g., linguistics, digital humanities, cognitive sciences) and in departments in computer science. In addition, there are departments of the National Research Council (CNR) and local research institutions, which are very active in the field of computational linguistics and NLP. As for industrial providers, inItaly thereare more than one hundred com-paniesthatcanbe considered active developersin the LT field. MoredetailsabouttechnologiesandresourcesforItaliancanbefoundinMagnini et al.(2022) and theMETA-NET WhitePaper on Italian (Calzolari etal. 2012). 5 https://www.ai-lc.it 6 https://www.ai-lc.it/en/conferences/clic-it/ 7 https://lr-coordination.eu 8 https://www.european-language-grid.eu 9 https://www.ai-lc.it/lectures/ 3 RecommendationsandNextSteps Giventhisratherfavourablecontext(newneuralapproachesandstrongcommunity initiatives), we are seeing a widespread expansion of interest in Language Technol­ogyforItalian,bothinacademiaandinindustry.Atthesametime,theLTbariscon­tinuouslymoving upwards,whichrequiresadequate effortsandinvestments.These are particularly needed in areas such as for instance dialogue systems, where Ital-ianisstilllackingsufficientlanguageresources,andinapplicationdomainssuchas biomedicine,where progress is stilllimited. References Calzolari, Nicoletta, Bernardo Magnini, Claudia Soria, and Manuela Speranza (2012). La Lingua Italiana nell’Era Digitale – The Italian Language in the Digital Age.META-NETWhitePaper Series: Europe’s Languages in the Digital Age. Heidelberg etc.: Springer. http://www.meta-ne t.eu/whitepapers/volumes/italian. European Commission (2012). Europeans and their Languages. https://europa.eu/eurobarometer /surveys/detail/1049. Keating, Dave (2020). “Despite Brexit, English Remains The EU’s Most Spoken Language By Far”. In: Forbes. https://www.forbes.com/sites/davekeating/2020/02/06/despite-brexit-english -remains-the-eus-most-spoken-language-by-far/. Magnini, Bernardo, Alberto Lavelli, and Manuela Speranza (2022). Deliverable D1.21 Report on the Italian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-italian.pdf. Patti, Viviana, Valerio Basile, Andrea Bolioli, Alessio Bosca, Cristina Bosco, Michael Fell, and Rossella Varvara (2023). “Italian EVALITA Benchmark Linguistic Resources, NLP Services and Tools”. In: European Language Grid: A Language Technology Platform for Multilingual Europe. Ed. by Georg Rehm. Cognitive Technologies. Cham, Switzerland: Springer, pp. 295– 300. Rehm, Georg, ed. (2023). European Language Grid: A Language Technology Platform for Multi­lingual Europe. Cognitive Technologies. Cham,Switzerland: Springer. Starri, Matteo (2021). Digital 2021 – I dati italiani. https://wearesocial.com/it/blog/2021/02/digit al-2021-i-dati-italiani/. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 24 Language Report Latvian Inguna Skadina, Ilze Auzina, Baiba Valkovska, and Normunds Gruzitis Abstract Ten years ago, when META-NET conducted a study on Language Tech­nology support for Europe’s languages, Latvian was assessed as a language with little or no support (Skadina et al. 2012). During the last decade, progress has been made in the development of language resources and tools for Latvian, particularly with respect to advanced datasets and language models, machine translation solu­tions,speechtechnologies,andtechnologiesfornaturallanguageunderstandingand human-computerinteraction.Thischapterprovidesasummaryofthecurrentstateof theLatvianlanguage,theonlyofficiallanguageofLatvia,inthedigitalenvironment and highlights the mostimportant activities in the language technologyfield. 1 The Latvian Language LatvianistheofficiallanguageoftheRepublicofLatvia.Thereareabout1.5million nativespeakers,1.38millionofwhichliveinLatvia.Bytheendof2017,Latvianwas the mother tongue of 60.8% of the country’s resident population. Latvian is spoken as a second language by around 0.5 million people of other ethnicities. Latvian has three dialects: the Central, Livonic,and High Latviandialect. The Latvian language uses the phono-morphological basis of orthography. Lat­vianpunctuationisbasedonthegrammaticalpunctuationprinciple.Latvianorthog­raphyalmostfullycorrespondstothepronunciation.Thepresent-dayLatvianorthog­raphy basis is the Latin script. The Latvian standard alphabet consists of 33 letters, including letterswith diacritical marks. StandardLatvianhas26consonantphonemes,12vowels(sixshortandsixlong), and 10 diphthongs. Vowel length is phonemic and plays an important role in dis­tinguishing thelexicaland grammaticalmeaning ofwords.Most Latvian wordsare stressedonthefirstsyllable.Syllableswithlongvowels,diphthongs,anddiphthongi­calcombinationsofvowelandsonorantinthecentrearesubjecttocertainintonation Inguna Skadina · Ilze Auzina · Baiba Valkovska · Normunds Gruzitis University ofLatvia, Latvia, inguna.skadina@tilde.com, ilze.auzina@lumii.lv, baiba.valkovska@lumii.lv, normunds.gruzitis@lumii.lv © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_24 patterns. In a few areas, three patterns of tone or intonation are distinguished: level (also drawling,even) tone, falling tone, and broken tone. From a language typology perspective, Latvian has a classic Indo-European (Baltic) system. However, for regional and historical reasons, Latvian grammar also displays some features more similar to those found in Finno-Ugric languages (KalnacaandLokmane 2021).Latvianisafusional,mainlysuffixinglanguagewith arichsystemofformsandwordformation.Adistinctionismadebetweeninflected andnon-inflectedwordclasses.Nounsinflectfornumberandcase,adjectivesinflect for case, number, gender and definiteness, and verbs may inflect for tense, mood, voice and person (Nau 1998). Word order is relatively free, i.e., pragmatically gov-erned,but thebasic word order is subject verbobject (SVO). 2 TechnologiesandResourcesforLatvian ResearchanddevelopmentactivitiesinLatviaarebeingsupportedthroughdifferent EU and national finance instruments and are usually organised around short-term projects. The lack of a dedicated LT programme, however, leads to fragmentation of research and development activities and complicates the development of larger resourcesandlong-termcooperationbetweeninstitutions.Progressandkeyachieve­ments are regularly reported through the Baltic HLT conferences and other events (Skadina 2019;Skadina etal. 2022,provide recent overviews). Most open-access monolingual text corpora for Latvian are listed on Korpuss.lv (Sauliteetal. 2022).ModernLatvianisprimarilyrepresented throughthe Balanced Corpus of Modern Latvian (LVK2018, Dargis et al. 2020). A balanced subset of LVK2018includesseveralannotationlayers:namedentities,co-references,Univer­sal Dependencies (UD), FrameNet and PropBank annotations, as well as Abstract Meaning Representation (AMR) (Gruzitis et al. 2018). Many parallel corpora are openlyaccessiblefromOPUS,ELGandELRC-SHARE.Bilingualandmultilingual corporaarealso storedon Korpuss.lvand the Tilde Data Library.1 Domain-specific parallel corporafor the development of domain-specificMT engines arelacking. The first Latvian speech corpus was created in 2012/2013. It contains 100 hours oftranscribedspeech.However,accessislimited,andcurrentlytheonlyopen-access Latvianspeechcorporaareverysmall.Multimodalcorporaarestillnotavailablefor Latvian,althoughthedevelopmentofasignlanguagecorpusisongoingintheState ResearchProgramme “Letonika”. Tezaurs.lv is the largest open lexical dataset and online dictionary for Latvian (Spektorsetal.2016).Itisregularlyupdated,andcurrentlycontainsmorethan380k single-and multi-word entries, compiled from 300+ sources. A Latvian WordNet is being created as an extension to Tezaurs.lv. Different lexicons (mostly bilingual) areavailablefromtheLetonika.lvportal,includingdictionariesforwidelyusedlan­guagepairs, as well as dictionaries of the languages of the Baltic countries. 1 https://tilde.com/products-and-services/data-library Various text analysis tools such as tokenisers and sentence splitters, morpholog­ical analysers and taggers, spelling and grammar checkers, syntactic and semantic parsers,namedentityrecognisers,andtextclassifiersareavailableforLatvian.Open-source components are integrated into aLatvian NLP pipeline asa service.2 Regarding natural language understanding and generation, experiments with the interlingualUD,FrameNet,AMR,BERT,GPT,etc.modelsforLatviandemonstrate the potential of combining machine learning and knowledge-based approaches. With respect to machine translation (MT), the situation has changed a lot since 2012. Besides MT solutions provided by global companies, thecompany Tilde pro-videscustomisedMTsolutionsforcomplex,highlyinflectedlanguages,particularly smallerEuropeanlanguages.MTsystemsdevelopedbyTildehavebeenrecognised among the best systems for four consecutive years (2017-2020) at WMT interna­tionalnewstranslationsharedtasks(Pinnisetal. 2019).TheseresultsallowedTilde together with partners to develop the EU Council Presidency Translator which has been used already in eight countries (Pinnis et al. 2020). Several speech recognition and synthesis systems have been developed for Lat-vianbyTilde,thenationalnewsagencyLETA,andtheUniversityofLatvia.Several virtual assistants can communicate in Latvian, e.g., Hugo.lv (Skadins et al. 2020) lists morethan 10 virtualassistants fordifferentpublic services. LatviaisamemberofCLARIN(Skadinaetal.2020)andfocusesonLatvianand Latgalianresourcesandtools.CLARIN-LVparticipatesintheCLARINKnowledge Centerfor Systems and Frameworks for MorphologicallyRich Languages. 3 RecommendationsandNextSteps Today, Latvian has a rather stable position in the digital world. However, the situa­tioncouldchangedramatically,ifeffortsandinvestmentsinLTarenotincreasedin R&D and language policy. Strong national and European support is necessary for further Latvian research and development activities, including dedicated long-term LT programmes, that provide equal support for both research and industrial activi-ties.Tonarrowthedigitaldivide,thereispressingurgencyfornoveltechniquesthat wouldbringless-resourcedlanguagestoalevelcomparabletothestate-of-the-artre­sultsforresource-richlanguages.Moreover,closesynchronisationbetweennational and internationalactivities is necessary. References Dargis,Roberts,KristineLevane-Petrova,andIlmarsPoikans(2020).“LessonsLearnedfromCre­ ating a Balanced Corpus from Online Data”. In: Human Language Technologies – The Baltic Perspective.Vol.328. IOS Press,pp. 127–134.DOI: 10.3233/FAIA200614. 2 http://nlp.ailab.lv Gruzitis, Normunds, Lauma Pretkalnina, Baiba Saulite, Laura Rituma, Gunta Nespore-Berzkalne, ArtursZnotins,andPeterisPaikens(2018).“CreationofaBalancedState-of-the-ArtMultilayer CorpusforNLU”.In:Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), pp.4506–4513. Kalnaca, Andra andIlze Lokmane (2021). Latvian Grammar.University ofLatvia. Nau,Nicole (1998). Latvian.Vol.217. Lincom Europa. Pinnis, Marcis, Toms Bergmanis, Kristine Metuzale, Valters Šics, Arturs Vasilevskis, and Andrejs Vasiljevs(2020).“ATaleofEightCountriesortheEUCouncilPresidencyTranslatorinRetro­spect”. In: Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2). Association for MachineTranslation intheAmericas, pp. 525–537. Pinnis, Marcis, Rihards Krislauks, and Matiss Rikters (2019). “Tilde’s Machine Translation Sys­tems for WMT 2019”. In: Proceedings of the 4th Conference on Machine Translation (Volume 2). Florence,Italy: ACL,pp. 327–334.DOI: 10.18653/v1/W19-5335. Saulite, Baiba, Roberts Dargis, Normunds Gruzitis, Ilze Auzina, Kristine Levane-Petrova, Lauma Pretkalnina, Laura Rituma, Peteris Paikens, Arturs Znotinš, Laine Strankale, Kristine Pokrat-niece, Ilmars Poikans, Guntis Barzdinš, Inguna Skadina, Anda Baklane, and Valdis Saule­spurens (2022). “Latvian National Corpora Collection – Korpuss.lv”. In: Proceedings of the 13th LREC Conference. Skadina, Inguna (2019). “Some Highlights of Human Language Technology in Baltic Countries”. In: Databases and Information Systems. Vol. 315. IOS Press, pp. 18–30. DOI: 10.3233/978-1­ 61499-941-6-18. Skadina, Inguna, Ilze Auzina, Normunds Gruzitis, and Arturs Znotinš (2020). “Clarin in Latvia: Fromthepreparatoryphasetotheconstructionphaseandoperation”.In:Proceedings of the 5th Conference on Digital Humanities in the Nordic Countries,pp. 342–350. Skadina,Inguna,IlzeAuzina,BaibaValkovska,andNormundsGruzitis(2022).Deliverable D1.22 Report on the Latvian Language. European Language Equality (ELE); EU project no. LC­01641480 – 101018166. https://european-language-equality.eu/reports/language-report-l atvian.pdf. Skadina, Inguna, Andrejs Veisbergs, Andrejs Vasiljevs, Tatjana Gornostaja, Iveta Keiša, and Alda Rudzite (2012). Latviešu valoda digitalaja laikmeta – The Latvian Language in the Digital Age.META-NETWhite Paper Series:Europe’sLanguagesin theDigital Age.Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/latvian. Skadins, Raivis, Marcis Pinnis, Arturs Vasilevskis, Andrejs Vasiljevs, Valters Sics, Roberts Rozis, and Andis Lagzdins (2020). “Language Technology Platform for Public Administration”. In: Human Language Technologies – The Baltic Perspective.Ed.byUtkaAndrius,VaicenonieneJu­rgita,KovalevskaiteJolantai,andKalinauskaiteDanguole.Vol.328.FAIA.IOSPress,pp.182– 190. Spektors, Andrejs, Ilze Auzina, Roberts Dargis, Normunds Gruzitis, Peteris Paikens, Lauma Pret­kalnina,LauraRituma,andBaibaSaulite(2016).“Tezaurs.lv:thelargestopenlexicaldatabase for Latvian”. In: Proceedings of LREC 2016. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 25 Language Report Lithuanian Anželika Gaidieneand Aurelija Tamulioniene AbstractSignificantprogresshasbeenmadeinadaptingtheLithuanianlanguageto thedigitalenvironment.Anumberofdigitallanguageresourcesandbasiclanguage analysis tools, as well as complex online language services and the Lithuanian lan­guage ontology have been developed, while a number of computer programs and toolshavebeenlocalised.ComputerapplicationsrelevanttosocietyarebeingLithua­nianised,andthestandardisationofcomputertermsisbeingcarriedout.Lithuanian researchersactively participate in the cooperationand mobilityactivities ofinterna­tional associations, and a core of Lithuanian specialists working in the field of IT application, and developing innovative work in this field, has been formed. Lithua­nia also strives for all citizens to have full access to digital solutions, which adds importance to the policyofadapting them forthoseliving withdisabilities. 1 The Lithuanian Language LithuanianisaBalticlanguagefromtheIndo-Europeanfamily. LithuanianandLat­vianarethetwosurvivingBalticlanguages.Since2004,Lithuanianhasbeenoneof theofficiallanguagesoftheEuropeanUnion.Lithuanianisthestatelanguageofthe RepublicofLithuaniaandisenshrinedintheConstitutionassuch.TheuseofLithua­nianinpubliclifeisregulatedbytheStateLawontheLithuanianLanguage(1995). According to data from 2012, there were about 3.6 million Lithuanian speakers. In terms ofnumber of speakers, Lithuanian ranks 144th inthe world. Lithuanian is the most conservative of the Indo-European living languages, and it has best preserved many of its archaic features. From a typological point of view, Lithuanianhasmanyuniquefeatures,includingabundantformsofvariation,thesyn­thesisoftonalanddynamicstress,andthediverseorderofwordsreflectingthecom­plex syntactic level of discourse communication. Standard Lithuanian was formed at the beginning of the 20th century on the basis of one of the Aukštaitian dialects. Anželika Gaidiene · AurelijaTamulioniene Instituteof the Lithuanian Language, Lithuania, anzelika.gaidiene@lki.lt, aurelija.tamulioniene@lki.lt © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_25 It is characterised by a great variety of regional variants, the two main dialects are Aukštaitian and Samogitian (Vaišniene and Zabarskaite 2012). The Lithuanian al­phabet was formed in the 16th to 20th centuries on the basis of the Latin alphabet, to which nasalvowels(¹,ê,i, u) andletters with diacritics (è, š, ž, e,u) were added. The current Lithuanian language alphabet has 32 letters: 12 vowels, 20 consonants, and 3 letter combinations (ch, dz, dž). The grammatical structure is of a flexural type; the vocabulary is the most variable level of the language. Some words disap­pear and are replaced by new ones. In the current Lithuanian language, there is a pronouncedabundanceoftermsinvarious fields.ThevocabularyoftheLithuanian language consists of old words inherited from the Proto-Indo-European language, borrowings, and new wordsbased on inherited words and borrowings. Accordingto2021data,inthe16to74agegroup,almost87%oftheLithuanian populationusestheinternet;thisfigureisashighas100%inthe16to24agegroup, and 55.2% in the 65 to 74 age group. In 2021, about 225,000 .lt domains were reg­istered, of which more than 2,000 contain distinctive Lithuanian letters (e, ž, etc.). In addition, Lithuania remains among the leaders in fibre-optic internet service. In Lithuania, thecoverage of the fibre-optic network is 46.8%. 2 TechnologiesandResourcesforLithuanian ThelevelandadvancementoflanguagetechnologiesinLithuaniacanfirstandfore­most be appraised by the degree of achievement of the goals rooted in the 2014 – 2020guidelines(StateCommissionoftheLithuanianLanguage2014)fortheexpan­sion of the Lithuanian language in information technologies. Notably, those goals have been achieved with a great deal of success, yet some follow-up actions are needed, depending on the progress of the rapidly shifting language technologies on the global market and amidst society (Gaidiene and Tamulioniene 2022). Lithuania continues to create and develop general resources needed for the pur­posesofbuildinglanguagetechnologiesanddevisingtheirapplications.Therearea number of monolingual and bilingual corpora. The largest corpus of the Lithuanian language is the Corpus of the Contemporary Lithuanian Language. There are also several morphologically and syntactically annotated corpora (Morphologically An-notatedCorpus,MATAS;SyntacticallyAnnotatedTreebank,ALKSNIS).Thereare alsoanumberofparallelcorpora(e.g.,theLILAcorpus).Mostcorporaareopenac-cess.Nonetheless,consideringthedemandforlanguagedata,itneedstobesaidthat corpus data has to be augmented and new corpora (especially multilingual parallel data) should bedevelopedto reflect asmany areas of languageuse as possible. Lithuaniacontinuestodevelopdigitaldictionariesanddatabases.Usershavefree online access to the latest Dictionary of the Standard Lithuanian Language as well as other dictionaries, such as Dictionary of the Modern Lithuanian Language, Dic­tionary of the Lithuanian Language and the ongoing Database of Lithuanian Ne­ologisms. Other resources such as the Dictionary of Synonyms, the Dictionary of Antonyms, other various bilingual dictionaries etc. are alsofreely accessible online. Despitethisabundanceofdigitaldictionaries,consideringthedemandsoflanguage technologiesandofthepublic,thedictionariesofsynonyms,antonyms,andphrase­ology need to be updated, and the dictionaries of pronunciation and combinability (among others)digitalised. Semantic networks and ontologies in Lithuania are few in number. There is the General Ontology of the Lithuanian Language, the open-access ontology of Lithua­nian medical terms Snomed CT, and the service E-terms (Ontologies). There are several Lithuanian wordnets that can be developed further, e.g., LitWorNet. How-ever,theavailableontologiesandwordnetsareinadequateandneedtobeexpanded. The ALPMAVIS machine translation system is freely available. The company Tilde offers MT systems based on the latest neural networks for free. Continued development of MT systems would require more bilingual parallel corpora as well asspecialisedtext data to ensurebetterquality translation output. The available Lithuanian Speech-to-text Transcription Service covers different domains:administrative, legal, medical, and standard colloquial. Therearealso ser­viceswherespeechrecognitiontechnologyisusedtovoice-controlcomputers,such as Browser (browsing voice control), Controller (computer voice control), and so on.SomeoftheservicesavailableinLithuaniafeaturespeechsynthesistechnology, including Pronouncer, the Lithuanian Speech Synthesiser for the Blind, and so on. The Lithuanian language needs more annotated speech databases, which calls for concerted efforts to build speech databases for different fields, dialects, age groups, and sound environments (among others),and tomake them available to the public. ThevariousprojectsthathavebeenimplementedinLithuaniahaveproducedkey open-source tools for the basic analysis of digital texts in the Lithuanian language, such as a segmenter, a lemmatiser, a morphological analyser, a part of speech tag­ger, and so on (State Commission of the Lithuanian Language 2020). In terms of generating natural language, Lithuania is only makingits first steps inthis area. 3 RecommendationsandNextSteps Since 2012, significant progress has been made in developing various digital lan­guage resources and tools/services in Lithuania. Although Lithuanian is a language with a small number of speakers, it is progressing rapidly in the area of LT. As for digital resources and tools/services, there are still areas requiring further advances. Though a number of Lithuanian language resources are already available, consider­ing the demands of LTs and of the public, new monolingual dictionaries and bilin­gual dictionaries as well as various lexicons still need to be developed or updated. Ontologies, wordnets, corpora have to be enlarged and expanded; multilingual par­allel corpora required for MT need to be developed, etc. Concerning terminology, additional and updated compendia or terms are needed; the structure and techno­logical solutions of the databases of terms vary, making it more difficult to utilise data for other technological solutions; there is also a shortage of open terminologi­cal data. Lithuanian is in need of digital grammars, annotated speech databases and other resources that would accelerate theprogressoflanguage technologies. Lithuania requires an increase in digital language resources – corpora with texts andrecordings,thedevelopmentofLTsandthecreationofpublicservicesbasedon them – so that no group of society or region can feel the digital divide and foreign languagescanbeintegratedmoreeasilyintoLithuaniansociety. TheGuidelinesfor the Development of the Lithuanian Language in the Digital Environment and the Progress of Language Technologies for 2021–2027 map out the essential tasks or challenges of Lithuanian LTs, what should be done in Lithuania in the near future, and in which directions to work: 1. To increase the competence of specialists work­ing in the field of language technologies and to improve the ability of society as a wholetousetheopportunitiesprovidedbylanguagetechnologies.2.Toaccumulate andenrichopen,reliable,high-quality,reusabledigitallanguageresourcesandother digitallanguagedatasets.3.Todevelopthelanguagetechnologyinfrastructure,the applicationoflanguagetechnologiesinthepublicsectorandpublicservices,tocre-ate and improve publicly availableinformation technology solutions and tools. References Gaidiene,AnželikaandAurelijaTamulioniene(2022).Deliverable D1.23 Report on the Lithuanian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. h ttps://european-language-equality.eu/reports/language-report-lithuanian.pdf. State Commission of the Lithuanian Language (2014). The Guidelines for the Development of the Lithuanian Language Language Technologies for 2014–2020.http://www.vlkk.lt/kalbos-politi ka/lietuviu-kalbos-pletros-informacinese-technologijose-gaires/lietuviu-kalbos-pletros-infor macinese-technologijose-2014-2020-m-gaires. State Commission of the Lithuanian Language (2020). The Guidelines for the Development of the Lithuanian Language in the Digital Environment and the Progress of Language Technologies for 2021–2027. https://www.e-tar.lt/portal/lt/legalAct/71152ab00eee11ebb74de75171d26d52. Vaišniene, Daiva and Jolanta Zabarskaite (2012). Lietuviu kalba skaitmeniniame amžiuje – The Lithuanian Language in the Digital Age.META-NETWhitePaperSeries:Europe’sLanguages in the Digital Age. Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/lit huanian. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 26 Language Report Luxembourgish DimitraAnastasiou Abstract The Grand Duchy of Luxembourg is a small and multilingual country. The national language is Luxembourgish, and the legislative language is French. French, German and Luxembourgish are the three administrative and judicial lan­guages. There are about 650,000 inhabitants and the majority of Luxembourgers speakfourlanguages.AsofMarch2021,therewere59,000Wikipediaarticleswrit­ten in Luxembourgish. Luxembourgish is very under-resourced when it comes to dataresources and tools.Thischapter provides abrief overviewofthecurrentlevel ofsupport thatLuxembourgishreceives through technology (Anastasiou 2022). 1 The Luxembourgish Language Luxembourgisaverysmall,buthighlymultilingualcountry.Atvarioustimes,itwas partofdifferentEuropeanempires.Today,LuxembourgisthethirdEuropeancapital, along with Brussels and Strasbourg. It has the honour of hosting many of the EU’s important institutions, including the Publications Office of the EU, the Directorate-General forTranslation,and the Translation Centre for the BodiesoftheEU. AsforthepopulationofLuxembourg,theStatisticsportaloftheGrandDuchyof Luxembourg (STATEC) published a demographic atlas in 2019. According to this atlas, between 1981 and 2018, the Luxembourgish population increased by about 65%, from 364,597 to 602,005. There are 12 officially declared towns and 102 mu-nicipalities.LuxembourgCity has the highestpercentage of foreigners with70.8%. The languages spoken vary depending on the social situation or region. The re­gions with the highest density of Luxembourgish speakers include the north (85%) and the east (81%) of the country. According to STATEC, three out of four resi­dents work in a multilingual environment and 25% of the population has to speak fourormorelanguagesatwork.Frenchisthemostspokenlanguageatwork(78%), followedby English (51%) and Luxembourgish (48%). Luxembourgish is the most widely spoken language athome (53%), followedby French (32%) and Portuguese DimitraAnastasiou LuxembourgInstituteof Science andTechnology,Luxembourg, dimitra.anastasiou@list.lu © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_26 (19%). It should be notedthatLuxembourgishis notanofficiallanguage of the EU. The Luxembourgish language is a Moselle-Franconian dialect, which was histori­cally the mainly spoken language up to the 19th century in Luxembourg. On 24 February 1984, a law was enacted which made Luxembourgish an officially recog­nised language. In September 2018, the law was amended to add German sign lan­guage as an official language of Luxembourg. According to the provisions of the LanguagesLawof1984,French,GermanorLuxembourgishmaybeusedinadmin­istrativeandjudicial matters.Citizens caninteract with theadministrationinany of these three languages, and officials must attempt ‘as far as possible’ to respond in thelanguageusedbytheapplicant.LegislativedocumentsarewritteninFrenchand animportantconsequenceof thisonthejudicial level isthatonlythetextinFrench isdeemedauthentic for all levelsofpublic administration. In terms of vocabulary, Luxembourgish has a substantial number of loan words from French and German, but its morpho-syntax follows Germanic patterns (Gilles and Trouvain 2013). With the exception of the alveolo-palatal fricatives and the ap­proximant [w], the consonant inventory of Luxembourgish is quite similar to Stan­dard German. In addition, Luxembourgish has a set of eight diphthongs, which is considerably larger than for Standard German which has just one (Gilles and Trou­vain2013).Gilles(2019)examinesthecomplexlanguagesituationofLuxembourg. There is an officially recognised system with regards to the orthography of Luxem­bourgish, called “OLO” (ofizjel lezebuurjer ortografi); it can befound attheZenter fird’Lëtzebuerger Sprooch(ZLS)/Centrefor the LuxembourgishLanguage.1 2 TechnologiesandResourcesforLuxembourgish Luxembourgish-specific tools include a grapheme-to-phoneme conversion for Lux­embourgish based on 30,000 manually phonetically transcribed words, two spell­checkers, a PoS-tagger (including a tokenizer and lemmatizer), and sentence split­ter (Sirajzade and Schommer 2019), and a mobile application called Schnëssen.2 This crowdsourcing app collects data on the present-day language situation of Lux­embourgish; users can participate in a large set of audio recordings tasks and in sociolinguistic surveys. A recently published tool, LëtzRead,3 is a free browser ex­tension to integrate Luxembourgish-learning just by browsing the web (displaying certain words in Luxembourgish). Moreover, The library spaCy for advanced NLP has been trained for Luxembourgish, and the text-to-speech (TTS) tool MaryTTS has also been extended to support Luxembourgish. Luxembourgish data resources are mainly monolingual corpora, but there is also a Luxembourgish COVID glos­sary as well as an orthography trainer. The biggest text corpus in Luxembourgish contains170million words from a widerangeof genres (Parliamentarydebates, lit­ 1 https://portal.education.lu/zls/ORTHOGRAFIE 2 https://infolux.uni.lu/schnessen/ 3 https://www.letzread.com erature, transcripts of conversations, and media texts including articles from news outletslikeRTL.lu,radio100,7,eldoradio,andsocialmedia).Alltextsareannotated andorthographicallynormalised.ThiscorpusisownedbytheUniversityofLuxem­bourg and is for internal use only. Many lexical Luxembourgish-specific resources, including corpora, dictionaries, material for phonetics, applications, etc. are avail­able at Infolux,4 which is the research portal about Luxembourgish developed and maintained by the Institute of the Luxembourgish Language and Literature at the University of Luxembourg. Another important resource is the Luxembourgish On-lineDictionary(LOD),5 managedbytheZLS,amultilingualdictionarywith30,000 entries,inwhichLuxembourgishwordsaretranslatedintoGerman,French,English and Portuguese andillustrated by examples. Among the recent research projects related to language technology (LT) includ­ing Luxembourgish are ENRICH4ALL, STRIPS, and Lingscape. ENRICH4ALL (E-goverNment [RI] CHatbot for ALL)6 is a CEF-funded project (06/21-05/23) co­ordinated by the Luxembourg Institute of Science and Technology, and its objec­tive is to have a multilingual chatbot through integrating the CEF AT core service platform eTranslation to existing AI-based chatbot technology. The chatbot service will be deployed in public administration in Luxembourg, Denmark, and Romania. STRIPS7 was a three-year project (02/18-01/21), funded by the University of Lux-embourg,thataimedtodevelopasemanticsearchtoolboxfortheretrievalofsimilar patternsindocumentswritteninLuxembourgish.Lingscape8 isamobileapplication researchinglinguisticlandscapesallovertheworldbycollectingphotosofsignsand lettering on an interactive map. 3 RecommendationsandNextSteps Digitalisation plays a big role in the government of Luxembourg, the Ministry for Digitalisation was created on 11th December 2018. Luxembourg’s national AI Vi­sion initiative underlines the country’s unique ability to become a living lab of real-world AI applications. Mainlybecauseofthelackofunderlyingdataresources,therearegapsinmanyas­pectsofLuxembourgishLanguageTechnology. Whatiscurrentlymissingareavail-able bilingual corpora, e.g., Luxembourgish – English, German, French. The avail­ability of such data sets would facilitate the development of many LT applications, suchasnamedentityrecognition,machinetranslation,virtualagents,recommender systems, etc. Allof these applications are mainly statistically-based,sotypicallyre­quirealargeamountofmanuallyannotatedtrainingdata.Regardinglanguagemod­ 4 https://infolux.uni.lu 5 https://www.lod.lu 6 https://www.enrich4all.eu 7 https://acc.uni.lu/Research/strips/ 8 https://lingscape.uni.lu elswhichcan be usedfornaturallanguageunderstandingandgeneration, the multi-lingualBERTcoversmanylanguages,includingLuxembourgish;however,aBERT model trained specifically on large Luxembourgish data would yield better results. Another important aspect is that written Luxembourgish is not well standardised; while both German and French are intensively taught in schools, Luxembourgish, althoughthefirstlanguageofaround60%ofthepopulation,formspartoftheschool curriculumonlyrudimentarily.ThishasanimpactonthecorrectnessofLuxembour­gishinthedevelopmentofLTapplications.ItisnoteworthythatLuxembourgishhas becomemoreimportantinsecondaryschoolswithchangesincorporatedfortheaca­demic school years2021-2022 and 2022-2023. Luxembourg needs united forces for efficient collaboration. Since most people are multilingual, various stakeholders do not see the need to invest either time or budgetincreatingorsharingLuxembourgishresources.TheEU,thegovernment,re­searchinstitutions,andlanguageserviceprovidershavetoworktogethertoachieve the desired results. Important action points to improve the Luxembourgish LT land­scape are: 1. reaching a status that Luxembourgish can be used in many adminis­trative procedures; 2. raising awareness among various stakeholders in public and privatesectorsabouttheimpactofLuxembourgishdata;3.advancingthestandardi­sation,useandstudyofLuxembourgish,and4.dedicatedandcollaborativenational andEUfundingprogrammesforbothbasicandappliedresearchonLuxembourgish. References Anastasiou,Dimitra(2022).Deliverable D1.24 Report on the Luxembourgish Language.European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/language-report-luxembourgish.pdf. Gilles,Peter(2019).“39.KomplexeÜberdachungII:Luxemburg.DieGeneseEinerNeuenNation­ alsprache”. In: Deutsch. DeGruyter Mouton,pp.1039–1060. Gilles, Peter and Jürgen Trouvain (2013). “Luxembourgish”. In: Journal of the International Pho­ netic Association 43.1,pp.67–74. DOI: 10.1017/S0025100312000278. Sirajzade,JoshgunandChristophSchommer(2019).“TheLuNaOpenToolboxfortheLuxembour­ gish Language”. In: Advances in Data Mining, Applications and Theoretical Aspects, Poster Proceedings 2019.Ibaipublishing. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 27 Language Report Maltese Michael Rosner and Claudia Borg Abstract This chapter is a highly abbreviated version of an update (Rosner and C. Borg 2022) to the META-NET White Paper on Maltese (Rosner and Joachimsen 2012). Like its predecessor, the update forms part of a series for all European Lan­guages.Section 1providesabriefdescriptionofthelanguage,itsnationalstatus,its generaltypologyasalanguage,anditscurrentusageinthedigitalsphere.Section 2 givesanoverviewoftechnologiesandresourcesthatarecurrentlyavailable.Finally, Section3framesthemainshortcomingsofMalteselanguagetechnologyintermsof fragmentation, and offers some recommendations on how that might be reduced. 1 TheMalteseLanguage Maltese (il-Malti) is an official EU language and the national language of the Mal-tesearchipelago.97%oftheMaltesepopulation(ca.400,000people)considerittheir mothertongue.ItisalsospokenbycommunitiesinAustralia,Canada,theUSAand theUK.MalteseisderivedfromlatemedievalSicilianArabicwithRomancesuper­strata, and is often referred to as a mixed language due to the large number of loan words from Italian, English and French. It shares characteristics with other Semitic languages, making use of root-and-template morphologywhereby various forms of the same lexeme are formed by interdigitating vowels between a fixed sequence of root consonants. The main distinguishing characteristics of Maltese are free word order, mixed morphology, aspect-based temporalsystem,and lack of amorphologi-cal infinitive. Unlike other Semitic languages, the Maltese alphabet is based on the Latin one with the addition of some letters with diacritic marks and digraphs (c, gh, ¿,g,h).Itcontains24consonantsand6vowels.AccordingtoFabri(2011),thewrit­ing systems used for Maltese were somewhat ad hoc before 1920, but a degree of consistencyamong writers and in publications became a reality inthe 1950s. Withinthedigitalsphere,therehavealwaysbeenseveralMalteselanguagenews­papers.Thebroadcastmedia(radioandTV)arealmostexclusivelyinMaltese.Since MichaelRosner · ClaudiaBorg University ofMalta,Malta, mike.rosner@um.edu.mt,claudia.borg@um.edu.mt © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_27 the previous report, there has been a general decline in hard-copy newspaper read­ership, as all the media are now available online and the majority of readers prefer the online version. Various online-only news websites have appeared, one of which (Newsbook) operates bilingually. The full Maltese character set is now universally used.Socialmediaareextremelypopular(97%ofthepopulationaccordingtoa2021 survey).Facebookremainsthemostaccessed,butthereisatrendofincreasedusage ofInstagram andYouTube. Unlike other EUcountries, Twitterusage in Maltaisre­markably low. The Maltese Wikipedia currently ranks at 204/325 (for comparison, English, Portuguese, Irish, Icelandic, Romansch rank at 1, 18, 93, 95, and 213, re­spectively). It contains nearly 4 million words distributed over 4,400 content pages (cf. 6.5 million for English). This compares to about 3,000 pages in 2011; there are ca. 19,000 registered users with only about 40 active users (making changes every 30 daysorless.YouTubegivesriseto localisedcontent inmanyothercountriesbut thelocalwebsitestilloperatespredominantlyinEnglish.Ingeneral,theretendstobe agapbetweensocialmediacontentcreatorsandnon-creators.However,arenowned onlinepagewhichhassuccessfullybuckedthistrendisKelmaKelmawhichstarted in 2013 as a Facebook page and gathers many interesting original contributions by localsabouttheMalteselanguage.Thetop-levelcountrydomainforMalta,.mt,isad­ministeredbytheMaltaInternetFoundation,hascurrentlyca.17,000domainnames and subdomains, more thanthree times the figure in 2010. 2 TechnologiesandResourcesforMaltese RosnerandJoachimsen(2012)describethemainenablersandcontributionstoMal­tese Language Technology up to ca. 2011. 2012 marked the public release of the MSEspeech synthesiser (M.Borgetal. 2014),whilstGatt and colleagues beganre­vampingtheUniversity’sMLRSresourceserver(Rosner2008;GattandÈéplö2013) toincludesemi-automated data-collection,a tagger,Korpus Malti v3.0 (2016),con­taining ca. 250 million annotated tokens, pattern-based search facilities, CLEM, a 1 million token Corpus of Learner English in Malta, Gabra, an Open Lexicon for Maltese, anda Dictionary of MalteseSignLanguage. Most available corpora are monolingual written text. A few are spoken, and fewer still are multimodal such as MAMCO (Paggio et al. 2018). Many mono­lingual corpora form part of unannotated multilingual collections. Others are by-productsofprojectsandannotatedforMWEidentification(PARSEME)orPOSTag­ging (MLRS), anonymisation (MAPA), morphological analysis (UniMorph), NER (WikiAnn)etc.Bilingual/multilingualresourcesincludetheLawsofMalta,theGov­ernment Gazette, and the Acquis Communautaire. Regarding tools and services, besides low-level text preprocessing for tokenisa­tion, sentence and paragraph splitting and POS-tagging, the Gabra dictionary has evolved into the online Dizzjunarju tal-Malti app. Machine translation for Maltese has improved not only through the availability of free tools like Google Translate, butalso as a result of DGT’seTranslation platform whoseincreased takeup by pub­licadministrationofficialsfollowedaseriesofworkshopsorganisedthroughELRC. MuchrecentefforthasbeenfocusedondependencyparsingandASR.Thereisnowa 2000-sentenceUniversalDependencyTreebankforMaltesewhichhassupportedex­periments(Zammitetal.2019)aimedatdeliveringaprototypedependencyparserin 2022.Similarly,forspeechtechnology,thelocallyfundedMASRIprojecthasdeliv­eredafullyannotatedspeechcorpus(HernandezMena etal. 2020).Most resources mentioned above are freely available through MLRSand also EU platforms. Currently,themaindriversfortheevolutionoffutureMalteseLTaretargetedna­tionalinitiatives,againstamixedbackgroundofprojectsatEUlevel.Atthenational level,theNationalAIStrategy(2019)focusesonthecreationofanAIecosystemin-frastructureincludingtoolstoenableMalteseLanguageAIsolutions,withfundsear-marked for Maltese LT resources. The Malta Digital Innovation Authority (MDIA) iscommittedtosupportingMalteseLTtoolswhichwillfocusonmorphologicalanal­ysis, dependency parsing, named entity recognition and POS tagging. In 2019, the Government also committed funds to the development of a spell checker. However, thereisnoinformationwithrespecttotheprogressofthisimportantinitiative.Mean­while at the EU level Maltese participation in a wide range of projects, actions and initiatives including ELE, ELG, ELRC, DARIAH, LCT, LT-Bridge, MAPA, Nexus Linguarum, and NLTP, has ensured a level of Maltese presence on the European sceneand also producedsomespecialisedresources andtools. 3 RecommendationsandNextSteps Maltese LT is indeed alive, but manifests an important weakness: it is highly frag-mented,indifferentways:1.betweennationalefforts(small-scale,Maltese-focused) andinternationalones(large-scale,language-independent);2.acrossresources/tools which are not necessarily compatible with each other; and 3. between users and de­velopers of LTs (reduces the perceived relevance of the technologies developed). To address these requires further investigation of techniques like transfer learning, as seen, for example, in the MAPA project where general language models were successfully used for Maltese NER. Issue 2. can be reduced by insisting that such resources inhabit a framework which includes the necessary protocols to ensure in-teroperability,asseeninEuropeaninfrastructureslikeELGandNLTP,fundedunder CEF,aimingtobuildaNationalLanguagePlatformforMalteseintegratingeTransla­tionservicesdevelopedbytheEuropeanParliamentwithfine-tunedlocaltranslation memories,andprovidingacentralpointforcollectingdifferentLTservicestogether. 3. is in part the result of insufficient involvement of the IT industry in LT. Despite the latter being a major component of the local economy, the number of technical LTprovidersisvery low. LThas a crucialroletoplayasanaturalbridgelinkingIT, AI, communication and multilinguality. More needs to be done to support that role by encouraging participation in ELG by local IT players, among others. In 2016, the IT subcommittee of the Council for the Maltese Language had recognised the needforthelong-termcurationofresources,recommendingthecreationofacentral repository, and efforts to involve more stakeholders concerning the availability and importanceof resources.Some progresstowardstherealisationoftheserecommen­dations has been made but the effort needs a substantial and sustained coordinated investmentacross the different sectorsinvolved. References Borg,Mark,KeithBugeja,ColinVella,GordonMangion,andCarmelGafa(2014).“Preparationof aFree-RunningTextCorpusforMalteseConcatenativeSpeechSynthesis”.In: Perspectives on Maltese Linguistics, Studia Typologica 14.Ed.byAlbertBorg,SandroCaruana,andAlexandra Vella, pp. 297–318. Fabri,Ray(2011).“Maltese”.In:The Languages of the 25. Revue belge de Philologie et d’Histoire: RBPH. Ed. by Christian Delcourt and Piet van Sterkenburg. Amsterdam, Philadelphia: John Benjamins,pp. 17–28. Gatt, Albert and Slavomi´r Èéplö (2013). “Digital corpora and other electronic resources for Mal-tese”.In: Proceedings of Corpus Linguistics. Ed.byAndrew Hardieand Robbie Love. Univer­sity of Lancaster, UCREL. Hernandez Mena, Carlos Daniel, Albert Gatt, Andrea DeMarco, Claudia Borg, Lonneke van der Plas, Amanda Muscat, and Ian Padovani (2020). “MASRI-HEADSET: A Maltese Corpus for SpeechRecognition”.In:Proceedings of LREC 2020.Marseille,France:ELRA,pp.6381–6388. Paggio, Patrizia, Luke Galea, and Alexandra Vella (2018). Prosodic and gestural marking of com­plement fronting in Maltese. DOI:10.5281/zenodo.1181805. Rosner, Mike (2008). “Electronic Language Resources for Maltese”. In: Proceedings of Bremen Workshop on Maltese Linguistics. Springer. Rosner,MikeandClaudiaBorg(2022).Deliverable D1.25 Report on the Maltese Language.Euro­peanLanguageEquality(ELE);EUprojectno.LC-01641480 –101018166.https://european-l anguage-equality.eu/reports/language-report-maltese.pdf. Rosner, Mike and Jan Joachimsen (2012). Il-Lingwa Maltija Fl-Era Digitali – The Maltese Lan­guage in the Digital Age. META-NET White Paper Series: Europe’s Languages in the Digital Age.Heidelbergetc.:Springer. http://www.meta-net.eu/whitepapers/volumes/maltese. Zammit,Andrei, Slavomír Èéplö, Lonneke van der Plas, and Claudia Borg (2019). A Dependency Parser for Maltese: Comparing the impact of transfer learning from Romance and Semitic Lan­guages. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 28 Language Report Norwegian Kristine Eide,Andre Kasen, and Ingerid Loyning Dale Abstract The use of Language Technology (LT) has greatly increased in Norway in recent years, as have the linguistic resources needed to make them work. In the past 10 years, Norwegian has adopted new or improved versions of machine trans­lation, speech technology, chatbots and digital assistants, and machine learning has improved.Nevertheless,LTforbothwrittenstandardsoftheNorwegianlanguage – themajorityBokmalandminorityNynorsk –isnowherenearthesamelevelasthat ofmajor European languagessuchas English,German,French andSpanish. 1 The Norwegian Language Norwegian isaNorthGermanic,verb second,SVO language, spokenbyaboutfive millionpeopleinNorway,withsomeadditionalspeakersintheNorwegiandiaspora in the US andSouth America.Norwayis a highly digitalised society. Thereisgreat dialectal variationinNorway, anddialectshave much higher pres­tige than in the other Scandinavian countries. Unlike other official European lan­guages,thereisnoofficialstandardforspokenNorwegian.Peopletendtospeaktheir owndialect,andexpecttobeunderstood.Thisdialectalvariationaswellasthepitch accent found in most dialects present the biggest challenges for Norwegian speech technology.Whilethereisnoofficialstandardforthespokenlanguage,therearetwo officialwrittenNorwegianlanguages,BokmalandNynorsk.Theminoritylanguage, Nynorsk, has about 500,000 speakers. All public bodies at state level must be able tocorrespondwithcitizensinbothwrittenstandards,andeventhoughthelinguistic differences between Bokmal and Nynorsk are rather small, most types of language technology,suchasmachinetranslation,chatbots,spellcheckers,speech-to-textand text-to-speech,needseparatetoolsforeachlanguage.Bothstandardsreflectdialectal variationandallowforlargeformalmorphologicalaswellasorthographicvariation. KristineEide The LanguageCouncil of Norway, Norway, kristine.eide@sprakradet.no Andre Kasen · IngeridLoyningDale The NationalLibrary of Norway, Norway, andre.kasen@nb.no, ingerid.dale@nb.no © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_28 Withthisvariation,incombinationwithhighlyproductivecompounding,onesingle word can have a relatively high number of different spellings, which is a challenge for languagetechnology (Smedt etal. 2012a,b). 2 TechnologiesandResourcesforNorwegian The overall accessibility of Language Resources (LRs) for Bokmal is fairly good (Eide et al. 2022). Sizeand contemporaneityarein placefor unstructuredand semi-structureddata.Withgoodlinguisticinsight,onecanbuildseveralspecialisedappli­cationsandservicesfromopenlyavailableresources.Incontrast,mosttypesofLRs andLTsareeitherscarceorlackingforNynorsk,althoughbothspeechandtextdata have been added to the largest, open repository for language data (Sprakbanken) in recentyears.Domain-specificdataisseverelylimitedforbothBokmalandNynorsk. This is also true for thespokenlanguage with all its dialectalvariation. Awareness of the differences between Nynorsk and Bokmal is low outside Nor­way’s borders. Norwegian can often be found in large, multilingual LR collections, and is available asa language choice alsoon largeonline platforms. However, both nationally and internationally developed tools and services cater first and foremost to the Bokmal written standard,ortheEastern Norwegian spoken dialect. Speech technology development is challenged by thedialectal variation,in addi­tiontothetwoorthographicstandardsthatoftenallowforspellingvariations.There are pronunciation lexicons which cover Bokmal and Nynorsk orthographic forms, anddialectalvariationinpronunciationtranscriptionsisunderdevelopmentforboth. Some speech corpora with dialectal variation and a mix of read and spontaneous speechexist,somehavetranscriptionsinbothstandards.Thesecorporahaveproven useful in improving speech recognition scores, but they are either not large enough, or somewhat lacking in domain, style, societal or situational variation to train a ro­bust general purpose speech recognition system. Until recently, speech processing toolshavebeenalmostnon-existentforNynorsk.Thosethataredeemedusable,for either of the written standards,arein general proprietary and not freelyavailable. The largest text corpus is the Norwegian Colossal Corpus (NCC), which com­prises a majority of all Norwegian published works (digitised using OCR), in addi­tion to several other corpora, including Wikipedia, legislation, newspapers, books, web content, etc. The more recently published texts are still copyright-restricted, which limits the availability of the full corpus. The NCC has texts in both written languages, but the Nynorsk proportion is significantly smaller (5-10%). To remedy thescarcityofNynorsktextdata,theLanguageBankattheNationalLibraryharvests availablelegaldocumentsfrommunicipalitieswhereNynorskisthemainlanguage. There are three large language models (NorELMo, NorBERT, and Notram) for Norwegian, which have been trained on (parts of) the NCC. These models can be fine-tunedwithannotatedcorporatodeveloptask-specifictools.Thelanguagemod-els’ embeddings are significantly less robust for Nynorsk than for Bokmal, again due tothe disproportionate distribution of thelanguages inthe training data. Norway does not have access to the same amount of parallel data from the European institutions as the EU Member States. Even so, the ELRC initiative, in which Norway participates, has contributed to a growing awareness of the reusabil­ity of translations. Public administrations have contributed significant collections ofBokmal-Englishparallel data. Valuable translationmemoriesfor developing MT systemsfromEnglishtoBokmalhavealsocomeoutofEU-fundedresearchprojects, e.g., PRINCIPLE. There are very few translation memories between Nynorsk and English, but it is possible to use Bokmal as a pivot language when developing MT technology for English-Nynorsk. The most prominent Nynorsk-Bokmal corpus is themanuallycorrectedoutputoftheNynorskpressagencyNynorskPressekontor’s Apertium-basedpipeline.DuetothesimilaritiesbetweenNynorskandBokmal,MT between the two written standardsyields fairly good results. ThemostimportantlexicalresourceforNorwegianisNorskordbank(theNorwe­gianWordBank),alexicaldatabaseforNorwegianBokmalandNynorskreflecting the official standard orthography as defined in the Norwegian dictionaries Bokmal­sordbokaandNynorskordboka.BotharefreelyavailablefordownloadanduseinLT. While some domain-specific termbases exist for Bokmal, very few terms appear in theirNynorskparallel,forinstanceinthe nationalterminologyportalTermportalen. WhilethereisnoresearchprogrammeinNorwayaimedspecificallyatLT,several projectsareintheprocessoffillingsomeoftheidentifiedgapsinNorwegianLTand LRs. All major universities in Norway conduct research on LT and/or AI. Among the running projects, NorwAI aims at developing LTs for Scandinavian languages, including conversational search in natural language. SCRIBE seeks to develop an advancedspeech-to-texttranscriptionsystemforspontaneousspeech.SANT(Senti­ment AnalysisforNorwegian Text)isto createopenLRsforsentiment analysis for Norwegian.ThepublicbroadcastingcorporationNRKandtwoprivatemediagroups contributetotheproject.TheMalfridprojectcollectsallavailabledigitaltextsfrom the public sector in Norway. An effort like this will ensure the availability of un­structured text data of a more recent date. CLEANUP aims to develop tools and techniques to automatically anonymise unstructured text data from an array of do­mains. The project Universal NaturalLanguageUnderstanding builds upontheUD standard for syntactic treebanks. The goal of the project is to convert the syntactic representationto machine-readable semantic representation. 3 RecommendationsandNextSteps Eventhoughtheincreaseindataavailabilityfrom2018to2021hasbeensubstantial, awareness of what language data is, what it can be used for and how it should be shared,needstoberaisedinallsectors.DuetothelackofNynorskdataandmodern LTs’ preference for big data, it must be a priority for decision makers to strengthen LT for the lesser-used language to avoid weakening its equal status. Public sectors musttakeontheirnewresponsibilityasrequiredinthenewlanguageactandensure parallel versions ofBokmal and NynorskLT inpublic procurement. WhiletherearecertainsynergieswhendevelopingparallelLTforbothlanguages, thereisalsoaneedforparalleldevelopmentofbasicresources.Thecreationofmiss­ing tools and LRs must continue. There is a need for more text data for Nynorsk, more domain-specific data, and lexical/terminological resources, in particular for Nynorsk,aswellas speech datathatcoverdialectsandNynorsk inadditiontotools for semantic parsing. As for the quality of Norwegian LT, no overreaching assess­ment has been made of the improvement we assume has taken place. In particular, downstream (user-driven) quality assessment of Norwegian Nynorsk and Bokmal LT is needed, to compare their quality,aswell as dialect understanding. Politicalactionisnecessarytoopenupinternationalplatformstoincludethepos­sibility of introducing LTs for smaller languages such as Norwegian Nynorsk, even when the largeplatformsthemselves do not offerLT for these smaller languages. There must be sufficient funding for research and development for Bokmal and Nynorsk LT, and the extra cost of developing parallel versions of Bokmal and NynorskLTshouldbeconsideredwhenfundingfutureresearchprogrammes.Adedi­catedprogrammeforLTshouldbeconsidered.Participationininternationalresearch projects and programmesthat focuson LT,shouldbe encouraged. References Eide, Kristine, Andre Kasen, and Ingerid Loyning Dale (2022). Deliverable D1.26 Report on the Norwegian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-norwegian.pdf. Smedt, Koenraad De, Gunn Inger Lyse, Anje Müller Gjesdal, and Gyri S. Losnegaard (2012a). Norsk i den digitale tidsalderen (bokmalsversjon) – The Norwegian Language in the Digital Age (Bokmal Version).META-NETWhitePaperSeries:Europe’sLanguagesintheDigitalAge. Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/norwegian-bokmaal. Smedt, Koenraad De, Gunn Inger Lyse, Anje Müller Gjesdal, and Gyri S. Losnegaard (2012b). Norsk i den digitale tidsalderen (nynorskversjon) – The Norwegian Language in the Digital Age (Nynorsk Version). META-NET White Paper Series: Europe’s Languages in the Digital Age. Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/norwegian-nyn orsk. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 29 Language Report Polish Maciej Ogrodniczuk, Piotr Pêzik, Marek£aziñski, andMarcin Mi³kowski Abstract The quality of language technology (LT) for Polish has greatly improved recently, influenced by three independent trends. The first one is Poland-specific and concerns the increase in national funding of both scientific and R&D projects, resultingin theconstructionofThe NationalCorpus ofPolishand the development of the CLARIN-PL and DARIAH-PL infrastructures. Two other trends are global: the development of language resources (LRs) and tools by private companies and ofcourse,thedeeplearningrevolutionwhichhasledtoenormousimprovementsin the state-of-the-art in all fields of languageprocessing. 1 The Polish Language PolishisaSlaviclanguageoftheLechiticgroup,writteninLatinscript.Itisthemost spokenWestSlaviclanguageintheworld.ItistheofficiallanguageoftheRepublic ofPoland andsince2004,the sixthlargest officiallanguageof theEuropeanUnion. Itisspokenby10%ofEUcitizens:about40millionnativespeakersand10million secondlanguagespeakersworldwide.InPoland,itisthecommonspokenandwritten language andthe nativelanguage of the vastmajority ofthe population. Polish exhibits some specific characteristics (Pisarek 2007), which contribute to the richness of the language but present a challenge for computational processing. Wordorderisrelativelyfree, whichisusedmostly to stress the importanceof infor­mation rather than simply followinggrammatical rules. Maciej Ogrodniczuk Inst.ofComp.Science,PolishAcademyofSciences,Poland, maciej.ogrodniczuk@ipipan.waw.pl PiotrPêzik University of£ódŸ, Poland, piotr.pezik@uni.lodz.pl Marek£aziñski University ofWarsaw, Poland, m.lazinski@uw.edu.pl Marcin Mi³kowski Inst.ofPhilosophyandSociology,PolishAcademyofSciences,Poland,mmilkows@ifispan.edu.pl © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_29 Polish is relatively morphologically rich, which means that for roughly 180,000 base forms of words, almost 4 million inflected word forms exist. The inflection paradigmsarecomplex,andeventheirexactnumberisamatterofdispute,assingle exceptions might even be thought to create a new paradigm. Even native speakers have problems with properly inflecting many words, and most speakers of Polish as a second language never completely master the complexities of the inflectional system.PolishsyntaxissimilartoitsneighbouringSlaviclanguageswithatendency to analyse constructions seen in gender marking, forms of address and the use of infinitive and impersonal constructions. PolishiscurrentlyhighlyinfluencedbyEnglish,oneofthebiggestsourcesofne­ologismsandcalques,inparticularinscienceandtechnology. Thenumberofwords loaned from English into Polish is, however, much lower than in Dutch or German becauseoftheproblemwithinflectingsomewordsaswellasdifferencesinpronun­ciation systems. Other recent changes are the appearance of more direct forms of address and simplification of the traditional inflection patterns. 2 TechnologiesandResourcesforPolish TheleveloftechnologysupportforPolishissimilartothatofmanyotherofficialEU languages, with several available resources1 and basic text processing tools obtain­ingsatisfactoryaccuracyscores.2 ThecurrentlandscapeofPolishlanguageprocess­ing has been shaped by the following developments (see Ogrodniczuk et al. 2022; Mi³kowski 2012): 1. The construction of the National Corpus of Polish3 (NKJP; Przepiórkowski et al. 2012), a reference corpus containing over 1.5 billion words sampledfromdiversesourcessuchasclassicalliterature,dailynewspapers,special­istperiodicalsandjournals,transcriptsofconversations,andavarietyofshort-lived online texts, balanced with respect to gender, age and regional distribution of sam-ples.Theavailabilityofthecorpus,andparticularlyitsmanuallyannotated1-million wordsub-corpus,availableunderaCC-BY-licence,hasboostedbothresearchinthe humanities as well as the development of many NLP tools. Since the completion of the NKJP in 2011, other reference corpora have been used to represent recent developments in Polish. The most significant examples are the MoncoPL monitor­ing corpus (Pêzik 2020) and the Corpus of the 2010s.4 2. The development of the CLARIN-PL5 and DARIAH-PL6 infrastructures, led to the development of many resourcesandtoolssuchasS³owosieæ,thePolishWordNet7 (Dziobetal.2019),Ko­ 1 http://clip.ipipan.waw.pl/LRT 2 http://clip.ipipan.waw.pl/benchmarks 3 http://nkjp.pl 4 http://korpus-dekady.ipipan.waw.pl 5 https://clarin-pl.eu, http://clarin.biz 6 https://dariah.pl, https://lab.dariah.pl 7 http://plwordnet.pwr.wroc.pl/wordnet/ rpusomat, a corpus creation tool8 (Kieraœ and Kobyliñski 2021), COMBO, a neural tagger, lemmatiser and dependency parser9 (Klimaszewski and Wróblewska 2021), or SpokesPL, a search engine for Polish conversational data.10 3. External funding in the form of grants, both European (Horizon 2020, Connecting Europe Facility) or national, distributed by the National Science Centre and National Centre for Re-searchandDevelopment,haveallowedmanyresearchinstitutionsandcompaniesto increasethebudgetsofresearchprojectsbyanorderofmagnitude,andthusreactto commercial demands for speech recognition or dialogue systems. As a result, their NLPproductsarecharacterisedbystate-of-theartperformance.4.ThePolEvaleval­uation campaign for NLP tools for Polish11 started in 2017 as a practical exercise intended to advance the state-of-the-art with a series of tasks in which submitted tools compete against one another. This contest has brought the NLP community together and resulted in the development, enhancement and public release of refer­ence datasets for tasks such as sentiment analysis, speech recognition and machine translation. 5. The latest Transformer models (HerBERT12 and plT513) trained by researchersfromthecompanyAllegroand the InstituteofComputer Scienceofthe Polish Academy of Sciences, based on several large corpora of Polish, including NKJP. Makingthesemodelsfreelyavailableforthecommunityhasfacilitatedenor­mous progress.6.Increased accessibilityofmultimodal spokencorporaand speech databases such as a large annotated corpus of phone-based customer support dia­logues,14 which boosts the development of goal-oriented chatbots and helps Polish ASR engines to be on par with solutions by global service providers. Nonetheless, many complex and labour-intensive resources such as audio-video corpora and cor­pora with discoursestructure and semantic annotations are practically unavailable. 3 RecommendationsandNextSteps The national Polish AI strategy (Council of Ministers 2020) mentions the develop­ment of LTas ashort-term goal,supportedby national grants for projectsrelated to Polish language processing based on world-leading algorithms. Notably, the docu­mentmentionstheimportanceoflanguagedata:theneedfortheeliminationoflegal barriers to the exploration of language text corpora under copyright protection and awardingprojectsthatmakearchitecture,trainedmodelsandtrainingdatasetsavail­ablefor common use.This assumptionis in line withfindings fromthePolish NLP communityaswellasinternationaltrends.Whatneedstobeaddedtothisplanisac­ 8 https://korpusomat.pl 9 https://github.com/360er0/COMBO 10 http://spokes.clarin-pl.eu 11 http://poleval.pl 12 https://huggingface.co/allegro/herbert-large-cased 13 https://huggingface.co/allegro/plt5-large 14 http://pelcra.pl/new/diabiz cesstocommon (nationalor European) computingpowerto boostthe development and optimizationofstandardlanguage modelsand secure stable fundingfor crucial LRssuchasthe National Corpus of Polish or theGreatDictionaryof Polish. However, there is also a new dimension of this plan, created by the Russian in­vasion of Ukraine. With 3 million Ukrainian refugees in Poland in 2022, bilingual public administration has become an important new role for the Polish LT commu­nity, and is boosting the development of bilingual Polish-Ukrainian resources and tools.OntheEuropeanlevel,thisnewsituationcallsfortheembracingofUkrainian asone of the languages officiallysupported bythe EU. References Council of Ministers (2020). Polityka dla rozwoju sztucznej inteligencji w Polsce od roku 2020 – The Policy for the development of AI in Poland from 2020. https://www.gov.pl/web/ai/polityk a-dla-rozwoju-sztucznej-inteligencji-w-polsce-od-roku-2020. Dziob, Agnieszka, Maciej Piasecki, and Ewa Rudnicka (2019). “plWordNet 4.1 – a Linguistically Motivated,Corpus-basedBilingualResource”.In:Proceedings of the 10th Global Wordnet Con-ference.Global Wordnet Association,pp.353–362. Kieraœ, Witold and £ukasz Kobyliñski (2021). “Korpusomat – stan obecny i przysz³oœæ projektu”. In: Jêzyk Polski CI.2,pp. 49–58. DOI: 10.31286/JP.101.2.4. Klimaszewski, Mateusz and Alina Wróblewska (2021). “COMBO: State-of-the-Art Morphosyn-tactic Analysis”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. ACL,pp. 50–62. Mi³kowski, Marcin (2012). Jêzyk polski w erze cyfrowej – The Polish Language in the Digital Age.META-NETWhite Paper Series:Europe’sLanguagesin theDigital Age.Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/polish. Ogrodniczuk, Maciej, Piotr Pêzik, Marek £aziñski, and Marcin Mi³kowski (2022). Deliverable D1.27 Report on the Polish Language.EuropeanLanguageEquality(ELE);EUprojectno.LC­01641480 – 101018166. https://european-language-equality.eu/reports/language-report-polis h.pdf. Pêzik,Piotr(2020).“Budowaizastosowaniakorpusumonitoruj¹cegoMoncoPL”.In:Forum Ling- wistyczne 7,pp.133–150.DOI: 10.31261/FL.2020.07.11. Pisarek,Walery (2007). The Polish Language.Warsaw:The Council for the Polish Language. Przepiórkowski,Adam,Miros³awBañko,Rafa³L.Górski,andBarbaraLewandowska-Tomaszczyk, eds.(2012). Narodowy Korpus Jêzyka Polskiego.Warsaw:PWN. http://nkjp.pl/settings/papers /NKJP_ksiazka.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 30 Language Report Portuguese AntónioBranco, Sara Grilo, and Joao Silva Abstract Thischapterprovidesananalysisoftheleveloftechnologicalpreparation of the Portuguese language for the digital age, as well as the actions necessary for the consolidation of Portuguese as a language of international communication with global projection. 1 ThePortugueseLanguage Portuguese is the fifth most spoken language in the world, with around 280 million speakers (Instituto Camoes 2021),of which 250 million are nativespeakers, spread over four continents: Africa, America, Asia and Europe. It is the official language of Angola, Brazil, Cape Verde, East Timor, Guinea-Bissau, Macau, Mozambique, Portugal, S. Tome and Principe, and Equatorial Guinea. All variants of Portuguese across the different continents are mutually understandable. Portuguese is an offi­ciallanguageoftheEuropeanUnion,theMercosulandtheAfricanUnion.Withthe advancement of the alphabetisation in the African countries and in East Timor, Por­tuguese is confirming its growth potential in terms of the number of speakers. This chapteris partly based on Branco et al.(2022) and Brancoet al. (2012). Portuguesehasastrongpresenceinsocialnetworks.Forinstance,astudyof100 milliontweetsrevealsthatPortugueseisthesixthmostspokenlanguageonTwitter, after English,Japanese, Spanish,Korean and Arabic.1 Portuguese is a Romance language, with most of its lexicon being derived from Latin. To a speaker not knowing Portuguese, the European variant of this language may often sound like a sequence of consonants. This is due to the fact that, dif­ferently from the other Romance languages, the Portuguese unstressed vowels are often weakened or even not pronounced. This vowel weakening is a late change in AntónioBranco · Sara Grilo · Joao Silva University ofLisbon, Portugal, antonio.branco@di.fc.ul.pt, sara.grilo@di.fc.ul.pt, joao.silva@di.fc.ul.pt 1 https://www.vicinitas.io/blog/twitter-social-media-strategy-2018-research-100-million-tweets © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_30 EuropeanPortugueseanditdidnotaffectthevarietyspokeninBrazil,whichinthis respectiscloserto the Portuguese which was spoken some centuries ago. The basic word order in Portugueseissubject-verb-object (SVO) (ele leu o livro / he read the book). Portuguese is a null subject language, where the subject of the sentencemaynotberealisedbyaphoneticallyovertexpression(_li o livro /[I]read thebook).TheinflectionparadigminPortugueseisveryrich,especiallyinverbs.A verbwitharegularinflectionparadigmwillhavedifferentmarkersforaspect,tense, mood, person, number or polarity, culminating in more than 160 different inflected verb forms, encompassing both simpleand complex ones. TheadventofthedigitalageisamajorchallengeforthePortugueselanguageand its speakers. The scientific study and technological development of the Portuguese language,makingitfitforthedigitalage,isthusanendeavourofutmostimportance in order to ensurethat its speakerscanparticipate in the information society. 2 TechnologiesandResourcesforPortuguese TheactivityinLanguageTechnology(LT)forthePortugueselanguagecanbetraced back to projects, programmesand initiatives carriedoutin the last decades. One of the first important programs in this area was EUROTRA, an ambitious Machine Translation project established and funded by the European Commission from the late 1970s until 1994. The participation of Portugal in this project since 1986 was undertaken by ILTEC,specifically created for this purpose and involving mostly researchers fromthe Universities of Lisbon and Porto. Another key European project was LE-PAROLE, developed in the late 1990s, withtheparticipationofCLULandINESC-ID.Itsmainachievementwasthebuild­ingofcorporaandlexiconsaccordingtointegratedmodelsofcompositionandmate­rialsdescription.Partofthiscorpuswasenrichedandenlargedinthenationalproject TagShare, conducted at the University of Lisbon, in the Department of Informatics (NLX Group) and in the Center of Linguistics (CLUL), in 2005. This project en­abled the development of a set of language resources and software tools to support the computational processing of Portuguese. The result was a 1 million word cor­puslinguistically annotatedandfullyverifiedbyexperts, theCINTILcorpus, and a whole range of processing tools for tokenisation, morphosyntactic category (POS) tagging, inflection analysis, lemmatisation, multiword lexeme recognition, named entityrecognition, etc., intheLX-* collection. Onthebasisofthesetoolsandresources,top-quality,manuallyverifiedtreebanks, withsyntacticandsemanticgrammaticalanalysis,andthecompanioncomputational grammar andparsers, have been also developedfor the CINTIL-* and LX-*collec-tions,inthenationalprojectSemanticShareattheDepartmentofInformatics(NLX Group)oftheUniversityofLisbon.TheCorpusdeExtractosdeTextosElectrónicos MCT/Público (CETEMPúblico), released in 2000, in turn, is a corpus of about 180 million words from excerpts of a Portuguesedaily newspaper. In the field of speech processing, it is worth noting the TECNOVOZ project, which started in 2006. This project was directed by INESC-ID and one of its major goals was to foster technology transfer to the business sector, having as partners companies likethe public television RTP. Ontheindustryside,animportantcontributiontofosteringanLTindustryinPor­tugal was the establishment of the international Microsoft Language Development Center, near Lisbon, which lasted from 2005 to 2015. More recently, the two US-based startups DefinedCrowdand Unbabel have a significantpresence in Portugal. In Brazil, relevant efforts in LT for Portuguese have also been undertaken. To mention just a few illustrative examples, in the early 1990s, under the DIRECT project, the Bank of Portuguese was created at the Pontifical Catholic University of Sao Paulo. Since its inception, the Bank of Portuguese has been a source of data for corpus-based studies for several projects. Also worth mentioning is the Summ-it corpus, built to support the study of sum-marisation along with the phenomena of anaphoric and rhetorical relations in Por­tuguese. This resource was developed under the PLN-BR project, by the Núcleo Interinstitucional de Lingüística Computacional (NILC), driven by the University ofSao Pauloand gatheringresearchers from sevenother Brazilian institutions. On par with these programmes and projects both in Brazil and in Portugal, it is worth underlining PROPOR as the key focal initiative of the research community working on Portuguese. PROPOR is the major international scientific conference devotedtothecomputationalprocessingofPortuguese.Thelocationofthisbiennial conferencehas been alternating betweenthe two countries since1993. A landmark for the language technology for Portuguese landscape is the white paper The Portuguese Language in the Digital Age (Branco et al. 2012), produced in the scope of the European META-NET initiative. As an outcome of the European CEF project ELRI, the Repository for Transla­tion Resources (eTraduçao)2 is available which has been maintained since 2019 by AMA,thegovernmentagencyforthedigitaltransformationofthePortuguesepublic administration. Severalof itsdata sets are alsodistributedthroughELRC-SHARE. ThemajorAIinitiativespecificallyaddressingthefieldofLTistheimplementa­tion(2017-2021) and operation (from 2021onwards)ofthePORTULANCLARIN ResearchInfrastructure for the Science andTechnology of Language.3 3 RecommendationsandNextSteps ThedevelopmentoftechnologiesforPortuguesehasprogressedoverthepastdecade. However, given that progress in LT has accelerated, the level of competitive tech­nologicalpreparationofPortugueseforthedigitalagehasnotchangedsignificantly over thisperiod when taking thebest prepared language,English, asa reference. 2 https://etraducao.gov.pt/pt-pt/ 3 https://portulanclarin.net Someprogresshasbeenmadeintheareaoftextanalyticsandmachinetranslation, thankstofurtherdatacollectionandcorpuscreationthroughanumberofinitiatives funded by EU projects and national entities. Fundamental building blocks such as syntacticanalysistoolshaveprogressedsignificantly,buttheunderlyingdatasetsstill need to beenlargedto build more robust, reliableand application-ready systems. Therearestillalargenumberoffundamentaltoolsanddatasetsnotyetavailable for Portuguese. While steps have been made towards speech corpus development, there is still no state-of-the-art automatic speech recognition system available for Portuguese as open-sourcesoftware. From a natural language understanding perspective, there is a lack of semantic-based datasets and tools. Critically, there is a severe lack of freely available large languagemodels,alsoknownasfoundationmodels,basedondeeplanguagelearning with artificial neural network technology. Such language models to support deep neuralprocessing,includingthedevelopmentoflargemultimodallanguagemodels involvingPortuguese,arethusverymuchneeded,especiallythoseopenlyavailable to be used in researchand ininnovation The above considerations on the availability of data and tools for Portuguese clearlyindicatetheurgentneedtodirectsubstantiallymorefundingandeffortstothe preparation ofPortugueseforthedigital age.Thescientificstudyandtechnological developmentofthePortugueselanguageisacrucialendeavourforitspromotion,in orderto ensure thatits speakers can participatein the information society. References Branco, António, Sara Grilo, and Joao Silva (2022). Deliverable D1.28 Report on the Portuguese Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-portuguese.pdf. Branco, António, Amália Mendes, Sílvia Pereira, Paulo Henriques, Thomas Pellegrini, Hugo Meinedo, Isabel Trancoso, Paulo Quaresma, Vera Lúcia Strube de Lima, and Fernanda Bace­ lar (2012). A língua portuguesa na era digital – The Portuguese Language in the Digital Age.META-NETWhite Paper Series:Europe’sLanguagesin theDigital Age.Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/portuguese. Instituto Camoes (2021). Portugues no Mundo. https://pt.institutocamoes-praga.cz/centro-de-ling ua-portuguesa-instituto-camoes/portugues-no-mundo/. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 31 Language Report Romanian Vasile Pãiº and Dan Tufiº Abstract Since the previous META-NET report, there have been significant im­provements (e.g., creation of a large Romanian national corpus, steady progress in written language technologies, LT, construction of a national LT portal for the Ro­manian language etc.), but things are far from what they should be. Support for LT and AI through national programmes is still modest, although there are signs of a moreactiveinvolvementofpolicymakersinthestrategicplanningandfundingpro­grammes in this domain. Continued research is required to produce large language models,abletocapturethecharacteristicsoftheRomanianlanguage.Largelanguage resources need tobe created so that AI systems are able tolearn fromthem. 1 The Romanian Language The Romanian language which is an official language of the EU is also the offi­cial language of Romania. It is spoken by 19.4 million people in Romania and by about3.5millionpeopleinMoldova,whereitisunofficiallyknownasaMoldavian language. Speakers of Romanian in other European countries (Albania, Bulgaria, Croatia, Greece, Hungary, North Macedonia, Serbia, Ukraine and others) and com­munities of immigrants in Australia, Canada, Israel, Latin America, Turkey, USA and Asiancountries total around 4 million Romaniannative speakers.1 Romanian is an official language in the Autonomous Province of Vojvodina in Serbia.ItisoneofthelanguagesspokenintheautonomousMountAthosinGreece and a recognised minority language in Ukraine (Trandabã. et al. 2012). Romanian hasfourdialects:Daco–Romanian,Aromanian(about500,000speakersinAlbania, Bulgaria, Greece and North Macedonia), Istro–Romanian (15,000 speakers in two small areas in the Istrian Peninsula, Croatia) and Megleno–Romanian (about 5,000 speakers inGreece and North Macedonia). VasilePãiº · DanTufiº Romanian Academy,Romania, vasile@racai.ro, tufis@racai.ro 1 https://en.wikipedia.org/wiki/Romanian_diaspora © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_31 The Romanian alphabet is based on the Latin script with five additional lettersusing diacritics (Ã,Â, Î,., . andã,â,î, .,.).Manydigitaltextsarewritten without diacritics.Thequotationmarksusedoublelow(left)andrightmarks(„and”,respec­tively).However,especiallyindigitaltexts,theASCIIquotationmarkcharactermay be encountered. Dialogues are introduced using quotation dashes ( – ). The Oxford comma, used in certain English language documents, is considered incorrect in the Romanianlanguage.Intitles,onlythefirstletterofthefirstwordiscapitalised,with the rest of the title making use of regular sentence capitalisation. Names of months and days, as well as adjectives derived from proper names are not capitalised, e.g., februarie (February), vineri (Friday),italian (Italian). 2 TechnologiesandResourcesforRomanian The availability of language-specific data has a direct impact on the quality of language-specific orcross-language tools. The availability of large pre-trained mul­tilingualmodelsthatincluderepresentationsforRomanianlanguage,suchasXLM-RoBERTa or mBERT, somewhat alleviates the problem of constructing compute-intensive contextual word representations. Nevertheless, monolingual representa­tionssuchasRoBERT(Masalaetal.2020),DistilRoBERT(Avrametal.2022),and ALR-BERT,ledtoincreasedperformanceofmonolingualtools(Tufiº 2022).Static representations,suchasCoRoLa-basedwordembeddings(PãiºandTufiº2018),are still used dueto their lowercompute requirements(Pãi. and Tufi. 2022). Wordrepresentationsformonlythebasisofadvancedlanguagetools.Inaddition tolanguagemodels,task-specificcorporaarerequiredtotrainandevaluatethetools. The vast majority of Romanian resources are multilingual, with some being bilin-gual,andonlyafewmonolingualcorporaexist.ComparedtoEnglish,theavailable Romaniancorporarepresentaround10%.AvailablespeechcorporawithRomanian audio represent 5% of available English resources and about 50% when compared to neighbouring EU countries. In spite of the reduced number of available language resources, applications for different NLP tasks exist for Romanian. These include lemmatisation, part-of-speech tagging, dependency parsing, named entity recognition, syllabification, speechrecognition,text-to-speech,machinetranslation,punctuationrestoration,ter­minology annotation, and text classification. The number of identified tools repre­sents only15%ofthe tools availablefor English. Even if, in general, all LT fields are covered, certain fields are less developed or considered for the Romanian language by researchers and developers: language generation,dialoguemanagement,multimodalcorpusbuilding,andsocialmediaas­pects(includingmicro-blogging,socialnetworks,andmemeinterpretation).Speech processingismuchlessmaturethanLTforwrittentext,bothintermsofcorporaand instruments. Even though there has been much work on processing general Roma-nianlanguage,morefocusisneededforcreatingdomain-specificresourcesandtools (especially for the biomedical, legal, economyand socialmedia domains). The Representative Corpus of Contemporary Romanian Language (CoRoLa)2 (Tufi.etal.2019)wascreatedbytheRomanianAcademyasthelargestIPR-cleared reference corpus of written and spoken Romanian. Texts cover four domains (arts and culture, science, society, nature), reflecting six styles (imaginative, journalistic, scientific,legal, administrative, memoirs) anddifferentdocument types. One of the largest Romanian speech corpora is RSC (Georgescu et al. 2020), containing100hoursofaudiofiles.Themultilingualspeechcorpus VoxPopulicon­tains83hoursofRomanianlanguagespeech.ThespeechcomponentoftheCoRoLa corpus (comprised of multiple smaller corpora together with additional audio files specificallyobtainedforinclusioninCoRoLa)totals103hoursalignedwiththetext. A number of Romanian LTs, covering different fields of research, are available within the RELATE3 (Pãi. et al. 2020) portal. The platform covers results derived frommore than six nationaland international research projects. 3 RecommendationsandNextSteps Task-specific Romanian corpora (including multi-modal) are needed to enable new andcomplexlanguageprocessingoperations.Inturn,thesemustleadtothedevelop­ment of new tools, finally working towards digital language equality. This requires dedicated long-term support at the national, regional and European levels. Further­more, AI research should follow a human-centered approach. Biased or potentially harmful data in resources should be detected andaddressed. This, togetherwith fol­lowing lawful and ethical principles, as well as robust implementations, should en-ablebuilding TrustworthyAI (TAI)4 applicationsfor the Romanianlanguage. AI is an area of strategic importance and a key driver of economic development, providingsolutionstomanysocietalchallenges.Inthiscontext,manyEUcountries preparednationalplansforAI(e.g.,theSpanish National AI Strategy5 ortheFrench AI for Humanity6).InRomania,however,thereiscurrentlynosuchnationalplanfor AI.AstrategyforAI7 hasbeenproposed recentlywithintheRePatriot8 project,but itwasnotadoptedatnationallevel.Furthermore,thestrategyisnotveryconcrete,it centres mostly on which Romanian sectors would benefit most from AI, and which stepsareimportantfortheprocessofdevelopingRomanianAIinitiatives,butitdoes notincludeany plans abouthow toaccomplishthese actions. 2 https://corola.racai.ro 3 https://relate.racai.ro 4 https://op.europa.eu/en/publication-detail/-/publication/d3988569-0434-11ea-8c1f-01aa75ed7 1a1 5 https://www.lamoncloa.gob.es/presidente/actividades/Documents/2020/021220-ENIA.pdf 6 https://www.aiforhumanity.fr/en/ 7 https://www.slideshare.net/MonicaIon1/strategy-romania-in-the-era-of-artificial-intelligence-r blrepatriot 8 https://repatriot.ro References Avram, Andrei-Marius, Darius Catrina, Dumitru-Clementin Cercel, Mihai Dascalu, Traian Rebe-dea, Vasile Pais, and Dan Tufis (2022). “Distilling the Knowledge of Romanian BERTs Using MultipleTeachers”.In:Proceedings of the Thirteenth Language Resources and Evaluation Con­ference. Marseille: European Language Resources Association, pp. 374–384. https://aclanthol ogy.org/2022.lrec-1.39. Georgescu, Alexandru-Lucian, Horia Cucu, Andi Buzo, and Corneliu Burileanu (2020). “RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition”. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille: European Language Re-sourcesAssociation, pp. 6606–6612. https://aclanthology.org/2020.lrec-1.814. Masala,Mihai,StefanRuseti,andMihaiDascalu(2020).“RoBERT –ARomanianBERTModel”. In: Proceedings of the 28th International Conference on Computational Linguistics.Barcelona, Spain: InternationalCommitteeonComputationalLinguistics, pp.6626–6637.DOI: 10.18653 /v1/2020.coling-main.581. https://aclanthology.org/2020.coling-main.581. Pãi.,Vasile,Radu Ion, andDanTufi. (2020). “AProcessingPlatform RelatingDataandToolsfor Romanian Language”. In: Proceedings of the 1st International Workshop on Language Tech­nology Platforms.Marseille:EuropeanLanguageResources Association, pp. 81–88. https://ac lanthology.org/2020.iwltp-1.13. Pãiº, Vasile and Dan Tufiº (2018). “Computing distributed representations of words using the CoRoLa corpus”.In: Proceedings of the Romanian Academy Series A 19.2,pp.185–191. Pãi.,VasileandDanTufi.(2022).Deliverable D1.29 Report on the Romanian Language.European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/language-report-romanian.pdf. Trandabã., Diana, Elena Irimia, Verginica Barbu Mititelu, Dan Cristea, and Dan Tufi. (2012). Limba românã în era digitalã – The Romanian Language in the Digital Age.META-NETWhite Paper Series: Europe’sLanguages in the Digital Age. Heidelberg etc.: Springer. http://www.m eta-net.eu/whitepapers/volumes/romanian. Tufiº,Dan (2022).“Romanian Language Technology – a view from anacademic perspective”. In: International Journal of Computers Communications & Control 17.1. DOI: 10.15837/ijccc.20 22.1.4641. Tufi., Dan, Verginica Barbu Mititelu, Elena Irimia, Vasile Pãi., Radu Ion, Nils Diewald, Maria Mitrofan, and Mihaela Onofrei (2019). “Little Strokes Fell Great Oaks. Creating CoRoLa, The Reference Corpus of Contemporary Romanian”. In: Revue roumaine de linguistique 64.3, pp. 227–240. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 32 Language Report Serbian Cvetana Krstev and Ranka Stankoviæ AbstractStandardSerbianisthenationallanguageofSerbsandtheofficiallanguage in the Republic of Serbia. Although statistics show that the population of Serbia is well equipped touse IT,and althoughsomeimportant languageresources and tools have been developed for Serbian, the language still lags significantly behind most EuropeanlanguagesintermsofLanguageTechnology(LT).Thisshowsthatastable, dedicated and long-term investment in the development of LT for Serbian through national and international scientific anddevelopment projects is needed. 1 The Serbian Language StandardSerbianisthenationallanguageofSerbsandtheofficiallanguageintheRe­publicofSerbia.FormedonthebasisofEkavianandIjekavianNeo-ŠtokavianSouth Slavic Dialects, its form was determined by the reformer of the written languageof Serbs Vuk Karadžiæ, who also reformed both the Cyrillic alphabet and orthogra­phy. In the 20th century, in the federal state of Yugoslavia, this language was offi­cially encompassed by Serbo-Croatian, a name that implied a linguistic unity withCroats(andlaterwithothernationswhoselanguageswerebasedonNeo-Štokavian dialects).Inthe1990s,inSerbiathenameSerbo-Croatianwasreplacedbythename Serbian.The Constitution oftheRepublic ofSerbia from 2006stipulates:“The Ser­bian language and the Cyrillic alphabet shall be in official use in the Republic of Serbia”. However,the Latin alphabet is also in widespreaduse. According to the 2011 census data published by the Statistical Office of the Re­public of Serbia, the population of Serbia is 7,186,862, and Serbian is the mother tongue of 88.1% of the population. To this number, one should add the ethnic Serb populationinotherpartsofformerYugoslavia(anumbernoteasytodetermine).The SerbiandiasporalivesprimarilyinanumberofcountriesofCentralandWesternEu­rope, in the US, Canada and Australia, and their knowledge of Serbian is mainly determined bythe generation of immigrants theybelong to. CvetanaKrstev · RankaStankoviæ University ofBelgrade,Serbia, cvetana@matf.bg.ac.rs, ranka@rgf.bg.ac.rs © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_32 The Statistical Office also collects data about the use of ICT in Serbia each year (Kovaèeviæ and Rajèeviæ 2021). According to their data for 2021, published on 22 October 2021, the percentage of citizens between 16 and 74 years of age that used a computer regularly was 74.8%, while the internet was used regularly by 81.2% ofcitizens.Additionally,76.7%ofhouseholdspossessedacomputerin2021,while 81.5%ofallhouseholdshadaninternetconnection.Theinternetwasusedforprivate purposesmostlyforcommunicatingwithothers,readingonlinenewsandmagazines, and using social media. As for e-government, this study showed that 40% of inter-net users used online services instead of personally visiting public institutions and administrative bodies. 2 TechnologiesandResourcesforSerbian Thevariety ofcorporaaswell astheiravailabilityhas improvedsignificantlyinthe last 10 years (Vitas et al. 2012; Krstev and Stankoviæ 2022). Two corpora of con­temporary Serbian are available online. The first, published in 2013 (SrpKor2013), contains more than 120 million words, while the second, published in 2021 (Srp­Kor2021), contains more than 600 million words. Both are annotated with part-of-speech and lemmas and contain a variety of text types, with literary text being par-ticularlywellrepresented.AlongwiththegeneralcorpusSrpKor2021,severallarge collectionsofdomaintextswerepreparedthatcanbeusedwithinthesameplatform. Additionally, many text collections exist that contain data obtained from various news portals or by web crawling, some of which are represented as raw text, others are annotated with POS and lemmas, while a few are fully morphologically and/or NE-annotated. Some collections were prepared for a special purpose, such as senti­ment analysis, textsimilarity and textparaphrasing analysis. Thereareseveralbilingual,sentence-alignedcorporathatincludeSerbianasone of the languages, with the other being English, French or German; texts are from variousdomains,includingalargeportionofliterarytexts.ThedigitallibraryBibliša supports online search of these corpora. Besides, there are numerous multilingual collections that includeSerbian,with the majority of them being comparable. By far the most comprehensive of the many lexical resources for Serbian is Ser­bian Morphological Dictionaries (SrpMD, Krstev 2008), covering both simple and multi-wordunits,generallexica,propernames,anddomain-specificlexica.Itcovers morphologicaldescriptionsand,toacertainextent,semantics,usage,pronunciation, etymology,domains,derivationalrelations,etc.anditisbeingpermanentlyupdated. These dictionaries are open for search through the platform Leximirka at the site of JeRTeh,1 while its largest part with full morphological description and restricted additionalinformationismadepublic.Therearealsoseveralmonolingualandbilin­gual inflectional lexiconsbasedon MULTEXT-East. 1 The Association ofLanguage resourcesand tools, http://jerteh.rs Significant results have been achieved in the development of terminology re­sources for Serbian including simple-and multi-word terms from a wide range of domains.Partof these resources are bilingual(Sebian/English) ormultilingual, and some of them can be searched on the platform Termi at JeRTeh. Several special purpose mono-, bi-and multilingual lexical resources have been built that include Serbian,primarily for sentimentanalysis andhate-speech detection. TheSerbianWordNet,alignedtothePrincetonWordNet3.0,andSentiWordNet, isstill underdeveloped. Formal domain ontologies forSerbian are rare. There are a few language models and grammars for Serbian including Dict2Vec, an embedding model adapted for Serbian using the Serbo-Croatian Wikipedia and Wiktionary synonym pairs, and BERTiæ, a Transformer model pre-trained on eight billiontokensofcrawledtextfromtheCroatian,Bosnian,SerbianandMontenegrin domains. As for MT, we can only mention rudimentary attempts done in the scope ofscientificresearch andproducts createdby big technology enterprises. Several taggers and/or lemmatisers for Serbian have been developed based on TreeTagger, spaCy, NLTK and others. Many of them are part of NLP suites that covervarioustasks.Numerouslocalgrammars(e.g.,forcompoundverbforms,nom­inal phrases etc.) have been developed for Serbian texts using the Unitex/Gramlab corpusprocessingsuiteandSrpMD.ParsingofSerbianispossibleonlineusingUni­versalDependencyandCLASSLApiplines.ThefirstNERsystemwastherule-and lexicon-basedsystemSrpNERthattagsfine-grainedentities.Itwasusedtoproduce training data for NER systems developed using various ML methods and tools. A webservicewasdevelopedforthemorphologicalandsemanticqueryexpansionthat wasincorporated intoseveralonline applications,suchas theBiblišadigitallibrary. A substantial breakthrough in the area of speech processing was made by the AlphaNum company, a spin-off of the University of Novi Sad. They offer a large variety of commercial products and services: speech technologies, voice assistants, products for the disabled, etc. The document Strategy for the Development of Artificial Intelligence in the Re­public of Serbia for the period 2020-2025 wasadopted by theGovernmentin 2019. Asaresult,the InstituteforArtificialIntelligencewasfounded,withNLPasoneof its research areas. However,there is still noLT-related funding inSerbia. The strongest NLP/LT group consists of researchers from the University of Bel-gradeand JeRTeh. They started to work morethan 40years agounder the guidance of Prof. Duško Vitas, and they have produced by far the most resources and tools. ThestrongestgroupforspeechtechnologiescomesfromtheUniversityofNoviSad. In recent years, new NLP/LT research groups affiliated with different universities and research centres have emerged.Outside academia thereare few LTproviders. 3 RecommendationsandNextSteps Accordingtorecentstatisticaldataprovidedbyofficial authorities,Serbiancitizens are equipped to live in the digital world and are ready to use LT. However, this overview of LT for Serbian shows that some resources for Serbian are rich and di-verse,whilesometypesofresourcesarestillrare,andsomepracticallydonotexist. This analysis of the availability of resources, tools and services shows that Serbian is only weakly or fragmentarily supported. It also reveals that although languages close to Serbian (geographically, historically and by the number of speakers) such asBulgarian,SloveneandCroatianlagbehindEnglish,theyhavebetterLTsupport than Serbian. The policies taken in these countries to promote and support LT can serveas a guideline onhow toimprove LT for Serbian. Despite the valuable achievements documented here, Serbian is still a disadvan­taged language, with the risk that in a few years Serbian speakers will not benefit fromtheAI/LTrevolution.Topreventthisfromhappening,thereisaneedformore dedicatedLTfunding,onboththenationalandinternationallevel.Thisisespecially important bearing in mind that in the past, as well as today, researchers working on NLP/LT for Serbian are mostly affiliated with state universities, which require stable and adequate levels of funding. At the international level, Serbian and other weakly supported languages would benefit from more knowledge transfer projects that would not merely aim at mirroring existing solutions for English, but rather support theproduction of adequate resources and tools forendangered languages. References KovaèeviæMiladin,VladimirŠutiæandUrošRajèeviæ(2021)......... ............-......­ ........ ........... . ......... ......, 2021. (The use of ICT in the Republic of Serbia in 2021).https://publikacije.stat.gov.rs/G2021/Pdf/G202116016.pdf. Krstev,Cvetana(2008). Processing of Serbian – Automata, Texts and Electronic Dictionaries.Bel­ grade: University of Belgrade, Facultyof Philology. Krstev,CvetanaandRankaStankoviæ(2022). Deliverable D1.35 Report on the Serbian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/language-report-serbian.pdf. Vitas,Duško,LjubomirPopoviæ,CvetanaKrstev,IvanObradoviæ,GordanaPavloviæ-Lažetiæ,and Mladen Stanojeviæ (2012). ...... ..... . .......... .... – The Serbian Language in the Digital Age. META-NET White Paper Series: Europe’s Languages in the Digital Age. Heidel­ berg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/serbian. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 33 Language Report Slovak RadovanGarabík AbstractForSlovak,allthefundamentalNLPbuildingblocksforbasicapplications exist, but they are often of lesser quality and lower accuracy than those of other languages. The availability of free and open tools and data is rather low, with most of the resources proprietary. Compared to neighbouring languages of similar levels of NLP development (Czech, Polish, Hungarian), Slovak is positioned toward the lowerendofthisgroup.Slovaklanguagesupportby“bigplayers”intheLTindustry iscomparabletootherEuropeanlanguageswithsimilarsize;speechrecognitionand synthesisworkacceptablywhilemachinetranslationbetweenSlovakandEnglishis almost good enough to be used by professionals as a source for post-editing. Spell checkers, LT-assisted mobile phone input, OCR and lemmatised fulltext search are taken forgranted, although their quality is significantly lacking compared to bigger Europeanlanguages. 1 TheSlovakLanguage Slovak is the official language in the Slovak Republic. Since May 2004 it has also been one of the administrative languages of the European Union. According to the 2021censusdata,outof5.4millioninhabitantsofSlovakia,4.7millionpeoplehave Slovak as their mother tongue.1 Other estimates (perhaps overly optimistic) claim thatSlovakisspokenbymorethanonemillionemigrantsintheUnitedStates,about 300,000 people in the Czech Republic, and smaller groups in Hungary, Romania, Serbia,Croatia,Bulgaria,Polandandothercountries.Afactwhichisnotwellknown is that there exists another written variant of (Eastern) Slovak, using Cyrillic script. This variant is used around Ruski Krstur in Serbia by a few thousand speakers, but thankstohistoricalreligiouscircumstancesitisgenerallyconsideredadialectofthe Rusyn language, not Slovak. As such, it is almost completely ignored in all aspects concerning Slovaklinguistics. Radovan Garabík Slovak Academy ofSciences, Slovakia, radovan.garabik@kassiopeia.juls.savba.sk 1 Correctedfortheinhabitants with an unidentified mothertongue. © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_33 AsatypicalSlaviclanguageSlovakismoderatelyinflectedwithacomplexmor­phology and relatively flexible word order. It has three or four2 genders, two gram­matical numbers, three tenses and prominent aspectual pairs. It belongs (together with Polish, Czech, Lower and Upper Sorbian) to the West branch of Slavic lan­guages. In the 16th to 18th centuries, Czech was used as the cultural language in Slovakia,togetherwithseveraltypesofculturalSlovak,andthemodernstandardof the language datesto the secondhalf of the19th century. SlovakisgenerallyconsideredtobemutuallyintelligiblewithCzech,withsome caveats regarding different inflection of pronouns, some lexical and terminological differences and differences in verb conjugations. Czech enjoys a unique sociolin­guisticstatusinSlovakia;thepopulationiswidelyexposedtotheCzechlanguagein media(TV,movies,internet,andliterature).Asaresult,Czechiswidelyunderstood in Slovakia abovethelevelof natural mutualintelligibility. Note that the opposite – exposure of Czech Republic inhabitants to the Slovak language – is only marginal. Despitethis,thevisibleinfluenceofCzechonSlovakislimitedtosomelexicalitems and syntactical constructions, oftenregarded as “incorrect”. ThelanguageiswrittenusingtheLatinalphabetwithadditionaldiacriticalmarks, markingpalatalisationofconsonants,postalveolars,andphonemiclengthofvowels andconsonants.TheSlovakalphabethasthedistinctionofhavingthegreatestnum­ber of characters(43,or46 including digraphs) among European languages. On the web, Slovak is a sharply localised language, closely interwoven with the .sktop-leveldomain(TLD).Distribution(asof2021)ofthemostfrequenttop-level domainsofwebpagesintheSlovaklanguagefromtheAraneumSlovacumVIMax-imumBetawebcorpus(Benko 2014)showsthat76.6%ofdocumentsinSlovakare from the .sk TLD; 8.8% from the .com TLD, 3.8% from .cz, 2.9% from .eu, 2.0% from.netand therest from other, less frequent domains. 2 TechnologiesandResourcesforSlovak SlovaklanguageNLPandLT3 lagbehindthatofneighbouringlanguagesofsimilar status(i.e.,Czech,PolishandHungarian).Predominantlydevelopedinacademicen­vironments (Šimková et al. 2012), Slovak language technologies used to be mostly limited to lemmatisation and morphosyntactic analysis, with some limited industry interest in other tools (e.g., NER). The situation has somewhat changed in recent years,withindustrymoreinterestedindeeplearningmodels.Nevertheless,theavail-ability of huge language corpora and lexical resources available for Slovak is com-parableto similar languages (Aldabe et al. 2022). The main institution tasked with compiling and curating big, representative cor­pora is the Slovak National Corpus (SNK)4 department of the ¼. Štúr Institute of 2 Masculineissometimes analysedas twogenders; masculine animate andmasculine inanimate. 3 See, for example, https://github.com/essential-data/nlp-sk-interesting-links 4 https://korpus.sk Linguistics,SlovakAcademyofSciences.SNKwasalsoactiveindevelopingbasic digital language resources of the contemporary language, but also parallel corpora, spoken,dialectandhistoricalcorporaandlexicographicaldatabases(Garabík 2010) and indigitalisation of linguisticresearch in Slovakia. Corpora compiled at SNK have formed an indispensable part of linguistic re­search in Slovakia for a number of years, together with the ARANEA family of huge web corpora for more than 20 languages (Benko 2014).5 Currently, the main Slovaklanguagecorpus,prim-10.0,containsabout1.7billionwords.6 Thewebcor-pus Araneum Slovacum VI Beta contains about 4.4 billion words. In NLP and LT industry, companies usually use in-housecollected web corpora. OfficialSlovaktranslationsofvariousEUtexts(suchasAcquis communautaire, EU parliament proceedings, Official Journal of the EU etc.) make up the bulk of available, unrestrictedby copyright, parallel corpora suitable forMT-related tasks. All building blocks of basic NLP processing for Slovak are covered: lemmatisa­tion (since Slovak is a moderately inflected language, lemmatisation is often indis­pensable for any subsequent language processing), and morphological analysis, in-cludingPOStaggingandsyntacticparsing.Spellcheckers,LT-assistedmobilephone input, OCR, and lemmatised fulltext search are hidden parts of the technological backgroundthatisalreadytakenforgranted,althoughtheirqualityandaccuracyare lacking compared to bigger European languages. In recent years, deep learning lan­guage models appeared on the Slovak NLP scene, often adopted from comparable work forother languages(Pikuliak etal. 2021). Recently, chatbots have noticeably penetrated many areas of human-computer interaction, as the first line of contact in customer support, and although primarily used in English-speaking countries, they are now used in other countries as well, including Slovakia, where chatbots (in written communication mostly) are used by manycompanies.However,sincepooreraccuracyofSlovakanalysisleadstomixed resultsandthechatbotsaredeployedatleastpartlyforpublicrelationsreasons,quite oftenthesearejustmenu-drivenFAQs(oranexpertsystemindisguise)camouflaged by an animatedhead or similar graphical element, without deeper NLPprocessing. 3 RecommendationsandNextSteps InSlovakia,academicresearchandindustrydealingwithNLPandLTfunctionrather separately. The academic sphere often reacts rather slowly to real demands, and in-steadoftenexplores taskswithlittleimmediatebusinessapplication;theindustryis mostly interested in specific tools and generally does not do NLP-related research, although thereare a fewcompanies which are active inapplied NLP research. Sincemanyresourcesarenotreusable due tocopyright issues,clarification(i.e., opening)ofthelicensingofmanyexistingdatasetswouldbehelpfulforfurtherNLP 5 http://aranea.juls.savba.sk/aranea_about/ 6 https://korpus.sk/prim-10-0/ development. Many resources remain atthe“proof ofconcept” stageanddedicated effortisneededtobringthemuptoproperlevelsofusability. Thisisalsoconnected withtheissueofsustainabilityofexistingresources,manyofwhichweredeveloped asaresultofspecificresearchgrants,andoncethefinancingstopped,theresources werebasicallyabandoned andno new development is takingplace. The Action Plan for the digital transformation of Slovakia for 2019-2022 (AP 2019) describes a centralised coordinated approach and cooperation between aca­demic and commercial sectors in NLP. It is written only in general terms, without specific steps to be taken; the lack of computational linguists in Slovakia is not ad­dressed (e.g., by promoting university education). The change of government after parliamentary elections in February 2020 and the COVID-19 pandemic have led to the NLP section of the Action Plannot having beenactedupon at all. References Aldabe,Itziar,GeorgRehm,GermanRigau,andAndyWay(2022).Deliverable D3.1 Report on ex­isting strategic documents and projects in LT/AI (second revision).EuropeanLanguageEquality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/LT-strategic-documents-v3.pdf. AP(2019). Action plan for the digital transformation of Slovakia for 2019 – 2022. https://www.m irri.gov.sk/wp-content/uploads/2019/10/AP-DT-English-Version-FINAL.pdf. Benko, Vladimír (2014). “Aranea: Yet another family of (comparable) web corpora”. In: Interna­tional Conference on Text, Speech, and Dialogue. Springer,pp. 247–256. Garabík,Radovan(2010).“SlovakNationalCorpustoolsandresources”.In:Proceedings of the 5th Workshop on Intelligent and Knowledge oriented Technologies.InstituteofInformatics,Slovak Academyof Sciences, pp. 2–7. Pikuliak, Matúš, Marián Šimko, and Mária Bieliková (2021). “Cross-lingual learning for text pro-cessing:Asurvey”.In: Expert Systems with Applications 165,p.113765.DOI: 10.1016/j.eswa .2020.113765. Šimková, Mária, Radovan Garabík, Katarína Gajdošová, Michal Laclavík, Slavomír Ondrejoviè, Jozef Juhár, Ján Genèi, Karol Furdík, Helena Ivoríková, and Jozef Ivanecký (2012). Slovenský jazyk v digitálnom veku – The Slovak Language in the Digital Age. META-NET White Paper Series: Europe’s Languages in the Digital Age. Heidelberg etc.: Springer. http://www.meta-ne t.eu/whitepapers/volumes/slovak. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 34 Language Report Slovenian Simon Krek Abstract Around2.5millionpeoplearoundtheworldspeakorunderstandSlovene, withthevastmajorityofthemlivingintheRepublicofSloveniawhereitistheoffi­cial language. The constitution grants the right to use their mother tongue to Italian and Hungarian minorities in certain municipalities. In terms of Language Technol­ogy, the Slovene CLARIN.SI consortium plays the key role in the community; all major Slovene institutions involved in the development of LT resources, tools and services are members of the consortium. In contrast, the number of private com­panies in Slovenia specialising in LT for Slovene remains low, and most of the LT productscomeeitherfromthe(Slovene)academicspherevianationalorEUfunding, orfromthe biginternational IT companies that covera largenumber of languages. 1 The Slovenian Language Slovene is a member of the South Slavic language family and is spoken mainly in Slovenia and the neighbouring areas in Italy, Austria, Hungary and Croatia. In the nationalcensusof 2002, the lastone that recordedthenumberof nativespeakersof differentlanguages,87.8%ofthepopulation–ofatotalofjustunder2millionatthe time–declared Slovene to betheirmothertongue, with another3.3%claimingthat they use Slovene as the language of their everyday communication at home, which amountsto91.1%ofthepopulationusingSloveneastheirfirstlanguage.Thisnum­ber puts Slovenia in the group of EU states with the most homogeneous linguistic situation. Among other linguistic groups, native speakers of languages of the for­mer Yugoslavia were the largest in 2002, with 3.3% of them using a combination of Slovene and their mother tongue for everyday communication, and another 1% using only their mother tongue: Bosnian, Croatian, Serbian or Montenegrin. Other smallercommunities included speakers ofAlbanian,Macedonian and Romani. Slovene is the official language in the Republic of Slovenia. The constitution grants the right to use their mother tongue to the two minorities declaring that “in SimonKrek Jožef StefanInstitute, Slovenia, simon.krek@ijs.si © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_34 those municipalities where Italian or Hungarian national communities reside,” Ital-ianorHungarianarealsoofficiallanguages.In2002,itwasrecordedthatHungarian isthe mother tongueof0.4% of the population, andItalianof0.2%. According tolegislation inSlovenia,all educationand teaching provided as part of the current state curriculum, from preschool through to university level, must be in Slovene. In preschool, primary and secondary education, Italian is used in the schools of the Italian minority community, while Hungarian and Slovene are used in bilingual schools where the Hungarian minority is found. Special arrangements exist for children whose mother tongue is not Slovene, for the education of Roma children, children of foreign citizens and children of people without citizenship. 2 TechnologiesandResourcesforSlovenian A useful place to discover Slovene corpora are the CLARIN.SI NoSketch Engine1 and KonText2 concordancers.3 At the time of writing, there are 76 corpora of vary­ing sizes containing Slovene data in the repository, and 59 corpora in the concor-dancers. Most of them are available for download under open licences. The more important families of corpora cover general written standard language (Gigafida), Slovene Web and social media (slWaC, Janes), academic discourse (KAS), parlia­mentary transcriptions (siParl, ParlaMint), Slovene Wikipedia (CLASSLAWiki-sl), historical texts (IMP), literature (MAKS, ELTeC-slv), specialised domains (KoRP,DSI, Konji, etc.), and school essays (Šolar, SBSJ). There are also various manually annotated training and evaluation corpora available (ssj500k, etc.). The GOS (GOvorjena Slovenšèina, Spoken Slovene) family of corpora contains transcriptions of spoken Slovene. The original GOS includes about 120 hours of transcriptsfromvarioussituations:radioandTVshows,schoollessonsandlectures, private conversations between friends or within the family, work meetings, consul-tations,conversations in buyingand selling situations, etc. Intermsofparalleldata,Slovenehasbenefitedfromitsstatusasoneoftheofficial EU languages since 2004 and is included in the standard multilingual parallel data setsproducedeitherbyEUinstitutions(JRC-Acquis,DGT-Acquis,DCEP,DGT-TM, EAC-TM, ECDC-TM, JRC-Names) or by EU-funded or other projects (INTERA, WIT3, ParaCrawl, CommonCrawl, OpenSubtitles etc.), which are available either from OPUS or from repositories such as ELG. Two TM corpora produced by the Secretariat-General of the Slovene government were made available in the context oftheELRCproject andare uploaded in the ELRC-SHARE repository. There are 82 lexical/conceptual resources with Slovene data in the CLARIN.SI repository available under open access licences. Those that deserve special men­tion due to their size or importance are: Sloleks – morphological lexicon contain­ 1 https://clarin.si/noske/ 2 https://clarin.si/kontext/corpora/corplist 3 https://clarin.si/info/about/ ing around 100,000 most frequent Slovene lemmas, their inflected or derivative word forms (2.7M) and the corresponding grammatical description; sloWNet is the SloveneWordNetdevelopedintheexpandapproach:itcontainsthecompletePrince­tonWordNet3.0andover70,000Sloveneliterals;DictionaryoftheSlovenianNor­mative Guide is a normative orthographic dictionary of Slovene standard language. Itcontains140,266lemmasandsublemmasin92,617entries;ThesaurusofModern SloveneisanautomaticallycreatedthesaurusfromSlovenedataavailableinacom­prehensive English–Slovene dictionary, a monolingual dictionary, and a corpus. It contains 105,473entriesand 368,117 synonym pairs. Intermsoflanguagemodels,themostrecentoneistheSloveneRoBERTamodel. The corpora used for training the model contain 3.47 billion tokens in total. The subwordvocabularycontains32,000tokens.4Multilingualmodelsarealsoavailable, e.g., a trilingualBERTmodel,trained on Croatian, Slovene, and English data.5 ThestandardandmostaccuratetextprocessingtoolforSloveneistheCLASSLA forkoftheStanzapipeline.6Itsupportsprocessingofbothstandardandnon-standard Slovene at the level of tokenisation and sentence segmentation, part-of-speech tag-ging,lemmatisation,dependency parsing and named entity recognition. There are some Slovene LT companies that develop speech-to-text and text-to-speech tools.7 Slovene is also available in speech technology services offered by large enterprises such as Microsoft and Google, as well as by other companies spe­cialisinginspeechtechnology.8Thesesolutionshavealsofoundtheirwayintosome specialiseddevicescoveringmanylanguages.9 AttheUniversityofLjubljana,asys­temhasbeendevelopedforautomaticallytranslatinglecturesfromSlovenetoother languages inreal time, in thecontext ofthe OnlineNotes project.10 Machine translation services for Slovene are available through more or less the samestakeholders:some Slovene LTcompanies,11 thelargeenterprisessuch as Mi-crosoftandGoogle,andsomeotherinternationalcompaniesspecialisinginmachine translationtechnologyorgeneraltranslationservices.12 AsanofficialEUlanguage, SloveneisincludedintheeTranslationserviceofferedbytheEuropeanCommission. ThebiggestinvestmentinLTforSloveneistheDevelopmentofSloveneinDigi-talEnvironmentprojectfinancedbytheSloveneMinistryofCulturebetween2020­2023.13 The project will significantly upgrade existing LT resources, tools and ser­vices, or produce many of those that do not exist yet. The results of the project are 4 http://hdl.handle.net/11356/1397 5 http://hdl.handle.net/11356/1330 6 https://github.com/clarinsi/classla, https://pypi.org/project/classla/ 7 Amebis,Alpineon:eBralec, https://ebralec.si; Vitasis: Truebar, https://vitasis.si 8 NEWTON Technologies, https://www.newtontech.net;Sonix: https://sonix.ai 9 Pocketalk:https://europe.pocketalk.com/languages-countries/ 10 https://www.cjvt.si/en/infrastructure-support/tolmac/ 11 Vitasis: Truebar, https://vitasis.si;Aikwit, https://aikwit.com;Taia, https://taia.io 12 DeepL Translate, https://www.deepl.com; Pangeanic, https://pangeanic.com/languages/sloven ian-translation-services/, etc. 13 Razvoj slovenšèine v digitalnemokolju(RSDO): https://www.slovenscina.eu expected to be published on the CLARIN.SI and GitHub repositories in November 2022 and February 2023. 3 RecommendationsandNextSteps Ingeneral,onecanconcludethat1.thesupportforSloveneiscomparablewithother languages with a similar status (Krek 2022, 2012), 2. there is a general awareness ingovernmentalbodiesthatLTforSloveneshouldbesupportedinthefuture,3.the LT communityisgrowing,alsothrough new educational initiatives suchas the MA study of Digital Linguistics (Faculty of Arts, University of Ljubljana), and 4. there isinfrastructuralsupport,mainlythroughtheCLARIN.SIinfrastructureattheJožef Stefan Institute, which also covers all other stakeholders through the CLARIN.SI consortium. However, more efforts are needed in the future to bring the existing support closer to those available for other (official EU) languages. References Krek, Simon (2012). Slovenski jezik v digitalni dobi – The Slovene Language in the Digital Age. META-NET White Paper Series: Europe’s Languages in the Digital Age. Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/slovene. Krek, Simon (2022). Deliverable D1.31 Report on the Slovenian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/language-report-slovenian.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 35 Language Report Spanish Maite Melero, Pablo Penarrubia, David Cabestany, Blanca Calvo, Mar Rodríguez, and Marta Villegas Abstract Spanish,one ofthemostspokenlanguages in the world, is not threatened by globalisation in the way other languages are and is well-supported by big tech­nological companies, albeit still a long way from English. The number of available languageresources(text,andtoalesserextentspeech)inSpanishisquitelarge,but thereisstillalackofhigh-quality,well-curated,annotatedresources,availableunder open-accessconditions.Initiativesatthenationallevel,suchasthePlandeImpulso delasTecnologíasdel Lenguaje, havealready started to address this gap. 1 The Spanish Language The Spanish language, also known as Castilian, is the most spoken Romance lan­guageand thefourth mostspokenlanguage inthe world. Spanish is the officiallan­guageofSpain,whereitoriginatedasanevolutionofVulgarLatin,butmostSpanish speakersareintheAmericas.Itisspokennativelybyabout473millionpeopleacross 21 countries, where it shares territory with a multitude of languages. Spanish is the third most used language on the internet1 and this use is steadily growing due to the progressive incorporation of Latin American users. Its growth potential is still very high due to the limited access still seen in some Spanish-speaking countries (the average internet penetration in the Americas is only 67% vs. 92.6% in Spain). Currently, Spanish ranks second on the most popular social networks (Facebook, Instagram, Twitter) and streaming platforms (Netflix, Youtube). Youtube, in partic-ular,hasnowbecomeoneofthemaindisseminationchannelsforpopularculturein Spanish. It has made consumers of audiovisual products in Spanish much less con­fined to their geographical area of reference, favouring an unprecedented transfer of linguistic phenomena between the different varieties of Spanish. In contrast, the SpanishWikipediaranksonlyninthinthenumberofarticles,behindnotonlysome MaiteMelero· PabloPenarrubia· DavidCabestany· BlancaCalvo· MarRodríguez· MartaVillegas Barcelona SupercomputingCenter, Spain, maite.melero@bsc.es, pablo.penarrubia@bsc.es, david.cabestany@bsc.es, blanca.calvo@bsc.es,mar.rodriguez@bsc.es, marta.villegas@bsc.es 1 https://cvc.cervantes.es/lengua/espanol_lengua_viva/pdf/espanol_lengua_viva_2021.pdf © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_35 biglanguageslikeGermanandFrench,butalsomuchsmalleroneslikeSwedishand Dutch.WithregardtoAIapplicationsthatuseSpanish,mostofthesolutionsoffered by big companies (Google, Amazon, Facebook, Apple, Microsoft) have a Spanish version. Some of them even offer support to dialectal varieties, like Mexican Span­ish or peninsular Spanish. Most of these products offer less functionality than their Englishcounterparts,andthequalityislowerbutkeepsimprovingwitheachrelease. 2 TechnologiesandResourcesforSpanish TheSpanishlanguageextendsoveraverylargegeographicalareaand,consequently, many research centres across this area are devoting efforts to developing resources and tools for Spanish, although Spain still leads these efforts. As a global language with hundreds of millions of speakers, the number of unannotated resources (text, and to a lesser extent speech) in Spanish is quite large. However, although good progresshashappenedsincethelastsurvey(Melero etal. 2012),thereisstillalack ofhigh-quality,well-curated, annotatedand open-access resources. There are over 20 textual corpora exceeding 100 million words in Spanish, with halfofthemreachingabillionwords,suchastheNowCorpus,2 ortheBNECorpus (Meleroetal.2022).Mostoftheseareautomaticallycleanedandtaggedwebcorpora, but some come from well-edited sources such as newspapers, scientific journals, collectionsofpublishedbooks,orWikipedia.Insomecases,theycanbequeriedbut not downloaded, like Codicach3 or CORPES.4 Additionally, it should be noted that only half of the Spanish corpora contain linguistic annotations. The most common annotations are morpho-syntactic tags, like part-of-speech and lemma. The number ofcorporainSpanishforthedifferentdomainsvariesgreatly.Thus,whileasizeable amountofcorporaonlegalandadministrativelanguagecanbefound,otherdomains are under-represented. Spanish also appears in many multilingual corpora, together withEuropean languagesorwiththethreeothermajorlanguages in Spain(Catalan, Basque,Galician).Incontrast,thereisalackofparallelcorporawithotherminority languages of Spain, such as Asturian, Aragonese,Mirandese and Romani, and very fewwithindigenouslanguagesoftheAmericas,suchasNahuatl,Guarani,Quechua orAymara.Thereis alsoalackof bilingualcorporawithlanguages ofmigrants.As for Spanish Sign Language (LSE), it is estimated that there are more than 100,000 signersofLSE,20–30%ofwhomuseitastheirsecondlanguage.AtleastthreeLSE corpora as wellas lexicons andlearning resourceshave beendocumented. In the last couple of years, several large language models (LLMs) have been trained for Spanish. RoBERTa-bne and BETO are the most popular BERT-based ones; GPT2-2-bne is the only generative LLM to date.5 Even though applications 2 https://www.corpusdelespanol.org/now/ 3 http://sadowsky.cl/codicach.html 4 https://www.rae.es/banco-de-datos/corpes-xxi 5 https://github.com/PlanTL-GOB-ES/lm-spanish basedonLLMstendtobetrainedend-to-end,limitingtherelevanceoftypicalNLP low-level tasks, such as word tokenisation, segmentation, part-of-speech tagging, parsing,etc.,thosetasksremainimportantcomponentsofmanyapplications.There areanumberoftoolkitsandpackagesthatgatherandmaintainthesetools,likeFreel­ing,SpaCy,UDPipe,LIMAandConnexor,allincludingSpanish.Therearealsonu­meroustoolsforcommonend-usertasksinSpanish,suchasspellcheckers,grammar­checkers,style-checkers,etc.whichcanbeintegratedintomostcontentmanagement systems. Other tools deal withstylometry, plagiarism, information extraction, senti­ment analysis, automatic transcription, etc. Spanish is also well served by popular machinetranslationplatforms,suchasGoogleTranslate,DeepLorBing.Inaddition, Apertium6 hasbuiltdownloadabletranslationmodelstotranslatefromSpanishinto otherlanguagesofSpain(Catalan,Basque,Galician),andeTranslation,7 theECser-vice provided to public administrations and SMEs, offers neural-based translation betweenallofficialEuropeanlanguages,includingSpanish.Speechrecognitionand synthesis are behind some of the most iconic AI applications, such as virtual assis­tantsanddialogueagents.Thereareclosetoahundredspeechtechnologytoolsdoc­umentedforSpanish,includingtext-to-speech(TTS),automaticspeechrecognition (ASR), and speaker recognition (SR). Publicresearchcentresanduniversitiesplayanimportantroleindevelopinglan­guage technologies for Spanish. They are responsible for creating and distributing manyofthetoolsand resources mentioned above.In Spain,thePlan deImpulsode las Tecnologíasdel Lenguaje8 plays a central role in promoting the development of language resources for Spanish, but also for the other official languages of Spain. The Plan is supported by the Secretary of State for Digitalisation and Artificial In-telligence,andthroughitscollaborationwiththeTextMiningUnitintheBarcelona Supercomputing Center, it has produced several relevant assets in the biomedical textminingdomain,machinetranslation,andLLMs.9 Anotherproject,SpanishLan­guage and Artificial Intelligence (LEIA),10 is also currently underway between the RealAcademiaEspanoladelaLengua,theinstitutionentrustedwiththestabilityof theSpanishlanguage,andthebig enterprises (Microsoft,Amazon,Google, Twitter, Facebook) with the objective of ensuring high quality coverage of the Spanish lan­guagebytheirAIproducts.Asidefromthebigcompaniesinthetechnologyindustry, there are many SMEs developing solutions in Spanish. The top services offered in-cludecustomisedchatbots,machinetranslationsystems,speechtechnologies,spell­checkersandspecialisedtoolsforlinguisticinformationextractionandmanagement. Finally, mention should be made of the Spanish Society for Natural Language Pro­cessing (SEPLN),11 a non-profit organisation supported by research groups and the NLP industry,created back in 1983topromoteteaching, researchand development 6 https://www.apertium.org 7 https://ec.europa.eu/digital-building-blocks/wikis/display/CEFDIGITAL/eTranslation 8 https://plantl.mineco.gob.es 9 https://github.com/PlanTL-GOB-ES/lm-spanish 10 https://www.rae.es/noticia/que-es-leia 11 http://www.sepln.org/en/sepln ofSpanishNLP,andtoorganiseanannualconference,regularlyattendedbyanum­ber of researchgroups and companies working inthe field. 3 RecommendationsandNextSteps Despite its privileged position as a global language, more effort needs to be de­voted for Spanish to realise its full technological potential. Spanish is included in many multilingual projects and is well-supported by large industrial corporations andprojects,althoughthegapinthenumberandqualityofresourcesandtoolscom­paredtoEnglishisstillquitelarge.TherearemanyresourcesdocumentedforSpan­ish, but thereis still a lack of high-quality, well-curated,annotated and open-access resources. Moreover, much more should be done to identify untapped data silos in thepublicadministration,bothtextualandspeech,andfacilitateitsexploitation,fol­lowing the European directives on the reuse of public sector information. National initiativessuchasthePlandeImpulsodelasTecnologíasdelLenguajeneedamore sustained effort, capable of 1. filling the gaps in the available resources, 2. ensur­ing well-regulated access to language data, 3. increasing the innovation capacity of Spanish public services through Language Technologies, 4. promoting research in Spanish NLP and translation technologies and, finally 5. helping bring research so­lutions tothe market, andto the public. References Melero, Maite, Toni Badia, and Asunción Moreno (2012). La lengua espanola en la era digital – The Spanish Language in the Digital Age.META-NETWhitePaperSeries:Europe’sLanguages in the DigitalAge.Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/sp anish. Melero,Maite,PabloPenarrubia,DavidCabestany,BlancaC.Figueras,MarRodríguez,andMarta Villegas (2022). Deliverable D1.32 Report on the Spanish Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-eq uality.eu/reports/language-report-spanish.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 36 Language Report Swedish Lars Borin, Rickard Domeij,Jens Edlund, and Markus Forsberg AbstractSwedishspeechandlanguagetechnology(LT)researchgoesbackover70 years.Thishaspaidoff:thereisanationalresearchinfrastructure,aswellassignifi­cantresearchprojects,andSwedishiswell-endowedwithlanguageresources(LRs) andtools.However,therearegapsthatneedtobefilled,especiallyhigh-qualitygold-standard LRs required by the most recent deep-learning methods. In the future, we wouldliketoseeclosercollaborationsandcommunicationbetweenthe“traditional” LTresearchcommunityandtheburgeoningAIfield,theestablishmentofdedicated academic LTtrainingprogrammes, andnational funding forLT research. 1 The Swedish Language SwedishisthemainlanguageofSweden andalso aconstitutional official language ofFinland.Thereareabout10millionnativespeakersofSwedish,thevastmajority of which are Swedish citizens (Parkvall 2019), and an estimated additional 3 mil­lion second-language speakers. Swedish is spoken at all levels of government and educationinSwedenandtosomeextentinFinland.Itsvitalityisstrengthenedbyits closeness to the languages spoken in Norway and Denmark: speakers of Swedish, Norwegian and Danish are able to communicate with relative ease (Haugen and Borin2018).Theselanguages have around 20 millionnative speakers in total. Swedish is written using a modified Latin script with a 29-letter alphabet (the 26-letter Latin alphabet is extended with the vowel characters a, ä, ö). The writing system is in the mid-range of orthographic transparency. It is a relatively normal Germanic (and European) language. Its most “exotic” aspects are found in the do­main of phonology, such as: a phonemic pitch accent system; an unusually large LarsBorin · MarkusForsberg University ofGothenburg,Sweden, lars.borin@svenska.gu.se, markus.forsberg@svenska.gu.se RickardDomeij Instituteof Languages and Folklore, Sweden, rickard.domeij@isof.se JensEdlund KTH RoyalInstituteof Technology, Sweden, edlund@speech.kth.se © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_36 vowelsystem,including frontroundedvowels (wherethehighvowelsdisplayano­tabletwodegreesofrounding);andratherliberalphonotacticswithCCConsetsand CCCC codas. Structurally, Swedish generally follows the patterns typical of Ger-maniclanguages,includingV2wordorder,richnominalcompounding(orthograph­ically written without spaces), and a propensity for forming lexicalised particle (or phrasal) verbs, which appear in speech and text as discontinuous multiword expres-sions.Amongmoreunusualtraitswefindathird-personreflexivepossessive(i.e.,a specialpossessiveformusedonlyifthepossessorisco-referentialwiththesubject), and Swedish stands out in relation to its Germanic relatives through the recent in­troduction (and wide adoption) ofa consciously coinedgender-neutral third-person singular personal pronoun (hen, ‘he/she’). Approx. 95% of the Swedish population use the internet at least once a week. In 2020, 86% of Swedish households were connected to 100 Mb or faster fibre op­tic and 90% of the population used a smartphone. Over the last five years, the .se country top-level domain together with the popular .nu domain have had around 2 millionregistereddomainnames.Swedishwebpagesareoverwhelminglyproduced in Swedish, often with a parallel English version. The majority of mainstream soft-waresuchas operating systems, wordprocessors,etc.,are localised to Swedish. 2 TechnologiesandResourcesforSwedish There is awealthof monolingual textcorporawithautomaticlinguistic annotations available for Swedish, comprising billions of tokens in a variety of genres and text types (Borin et al. 2022, 2012). In contrast, there is a notable lack of gold-standard textcorpora,inparticularcorporathatreflectthepresent-daylanguageandtextgen-res. Notably, there is currently an ongoing national collaboration with the aim of creatingaSwedishnaturallanguageunderstandingbenchmarkliketheEnglish(Su-per)GLUE,1 called SuperLim.2 There are few publicly available collections of transcribed speech, and there is also a distinct lack of publicly available large multimodal corpora specifically de­signed orcuratedfor speech technology(ST)and/or LTpurposes. Hence, anumber of initiatives aim to record and make available speech corpora. Notably, an ASR corpus is being created with 100 speakers recorded in a studio setting, as well as recordings for a male and a female TTS voice, and the Finnish Language Bank is recording Finnish Swedish voices donated by the public. Furthermore, the lack of freelyavailablerecordings ofreal-worldspeechisaninhibitingfactorforSTdevel­opment beyond relatively simple and controlled applications and domains. While the availability of unannotated audio and video recordings on the internet is greater than everbefore,thelegality and circumstances underwhichtheuseof such datais permissible are unfortunately especially unclear when speechisinvolved. 1 https://gluebenchmark.com, https://super.gluebenchmark.com 2 https://spraakbanken.gu.se/en/resources/superlim The Sign Language Research Unit at Stockholm University provides access to a SwedishSignLanguage (SSL) corpus, with closeto 200kannotated tokens.3 LR compilation and LT for written Swedish started in the 1960s largely moti­vated by lexicographic considerations. For this reason, Swedish is well-equipped withhigh-qualitylexicalandconceptualresources.4 Anotablelacunainthiscontext isa Swedish wordnet, whichis still pending. For text processing, grammar-based LT has now largely yielded ground to deep neuralmachinelearningapproaches.Drawingonitsvasttextholdings,theNational Library of Sweden has taken a leading role in training large neural language mod­els (LLMs) for Swedish.5 For Swedish speech processing, several acoustic mod­els for Kaldi and wav2vec are available. Notable Swedish tools for speech include Wavesurfer6 and the SnackSoundToolkit.7 ThereisacademicresearchaswellascommercialinitiativesonseveralLTcompo­nenttechnologiesforSwedish,suchastoolsfortextandspeechprocessing,machine translation, computer-aided translation, spoken dialogue systems, language genera-tionandtextsummarisation,whileinformation retrieval andinformation extraction forSwedishareprimarilybeingdevelopedbycommercialcompanies,e.g.,asparts of proprietary business intelligence and intranet search applications. Notable is the work atStockholm Universityon developing LTtools for (transcribed) SSL. There is no dedicated national LT research funding programme, but several projects have recently been funded. The Wallenberg AI, Autonomous Systems and SoftwareProgramsupportsprojectsthatbenefitLT,suchasthebuildingofSwedish LLMs and improved ST algorithms. Outside academia there is great interest in LT and language-centric AI from commercial enterprises and public agencies; Sweden has a modest but thriving spectrum of companies offering various LT and AI so­lutions. Within academia, the research infrastructure Nationella sprakbanken8 ‘the SwedishLanguageBank’–fundedjointlybytheSwedishResearchCouncilandten universities and cultural heritage institutions – collects, develops, manages and dis-tributesLTsandLRsforresearch,notablyincludingresourcesandtoolsforhistorical stagesofSwedish,wherewedonotexpectcommercialinitiativestomaterialise.Na­tionella sprakbanken also coordinates the Swedish membership inCLARIN ERIC. 3 RecommendationsandNextSteps For most of its long history, Swedish academic LT has been pursued by a well-balanced and mutually complementary mix of researchers from computer science 3 https://www.ling.su.se/teckensprakskorpus, http://sts-korpus.su.se 4 https://spraakbanken.gu.se/en/research/themes/swedish-framenet-plus-plus 5 https://huggingface.co/KBLab 6 https://sourceforge.net/projects/wavesurfer/ 7 https://en.wikipedia.org/wiki/Snack_Sound_Toolkit 8 https://www.sprakbanken.se andlinguistics(engineeringandphoneticsinthecaseofST).However,recentyears haveseenaclearshifttowardsLTresearcherteamshavingastrongorpurecomputer sciencebackground,withanaccompanyinglackofawarenessofmanyimportantlin­guisticaspects of LT research problems. The Swedish academic LT expertise represents seventy years of accumulated knowledgeandexperience,whichshouldnotbeallowedtogotowaste.Intheshort term,thebestwayofensuringthisistofocusonfurtherLRdevelopmentforSwedish. Well-designed gold-standard corpora for fine-tuning LLMs and evaluating LT sys-temsrequireexactlythiskindofexpertisefortheirconstruction,notleastinorderto avoidpitfallssuchasmodelsmakingundesirablebiasedpredictionsthatriskperpet-uatinggenderrolesorleadingtounfairtreatmentofminoritygroups.Inthemedium term, we should aspire to understand current LLMs – which typically come across asblackboxes –inorder tobeable toexploitalreadyexistinglinguisticknowledge (e.g., information about words collected in a lexical or conceptual resource) when training LLMs, which potentially will reduce their training data requirements, thus puttingstate-of-the-artLT toolsin reach of lower-resourced languages. This calls for the establishment of closer collaborations and communication be­tween the “traditional” LT research community and the new AI field, e.g., through dedicatedLT training opportunities and earmarked funding forLT research. References Borin, Lars, Martha D. Brandt, Jens Edlund, Jonas Lindh, and Mikael Parkvall (2012). Svenska spraket i den digitala tidsaldern – The Swedish Language in the Digital Age.META-NETWhite Paper Series: Europe’sLanguages in the Digital Age. Heidelberg etc.: Springer. http://www.m eta-net.eu/whitepapers/volumes/swedish. Borin,Lars,RickardDomeij,JensEdlund,andMarkusForsberg(2022).Deliverable D1.33 Report on the Swedish Language.EuropeanLanguageEquality(ELE);EUprojectno.LC-01641480 – 101018166ELE. https://european-language-equality.eu/reports/language-report-swedish.pdf. Haugen, Einar and Lars Borin (2018). “Danish, Norwegian and Swedish”. In: The World’s Major Languages. Ed. by Bernard Comrie. 3rd ed.London:Routledge, pp.127–150. Parkvall, Mikael(2019). Den nya mangfalden. Stockholm:Makadam. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 37 Language Report Welsh Delyth Prysand Gareth Watkins Abstract In this chapter, based on Prys et al. (2022), an update to the META-NET White Paper (Evas 2014), we present LanguageTechnology (LT) for the Welsh lan­guage,providinganoverviewofthestatusofWelshinWalesandasummaryofthe Welshwritingsystemandtypology.Wedescribekeytoolsandourrecommendations for Welsh LT andassociatedresourcedevelopment. 1 The Welsh Language Welsh is mainly spoken in Wales, together with a small population in Argentina. A minoritised language (Prys 2006), Welsh is considered “vulnerable” (Moseley 2010). Welsh has official status in Wales (National Assembly for Wales 2011). The 2011 census reported that there were 562,000 Welsh speakers in Wales (19% of the population). The Welsh Government aim to almost double that figure by 2050 and recognisethat technology iskey tothis ambition (Welsh Gov. 2017). TheWelshalphabetcontains29letters,includingeightdigraphs(e.g.,ch)andthe letterjborrowedfromEnglishtorepresenttheborrowed/d./consonantphoneme.V, xandzarenotusedinWelsh,butareincludedwiththealphabetforcomputeruseas theyoftenappearinnamedentitiessuchasforeignplacenames.Welshbelongstothe insularCelticbranchofIndo-Europeanlanguages.Itisverbinitial,followingaVSO order.Ithasconsonantmutationsatthebeginningofwords.Accentedcharactersare common over vowels. Welsh has a continuum of other registers, with colloquial or informal registers differing markedly from the standard written form. It has many local dialects, with the main difference between those of north and south Wales. Welsh has two methods of verb formation, utilising concise forms or periphrastic forms, using auxiliary verbs. Guidelines to the latest version of the modern Welsh orthography,firststandardisedin1928,werepublishedin1987(Prys2006).In2021 a new Welsh Orthography Panel was established by the Welsh Government, which aims to resolve minor inconsistenciesin the orthography. Delyth Prys · Gareth Watkins Bangor University,United Kingdom, d.prys@bangor.ac.uk,g.watkins@bangor.ac.uk © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_37 2 TechnologiesandResourcesforWelsh According to Cunliffe et al. (2021), “on the Digital Language Vitality Scale […], Welsh is ‘Developing’, arguably tending towards ‘Vital’ in some aspects”. 90% of the2019/2020NationalSurveyforWales’respondentsusedtheinternet(WelshGov. 2021). However, English is the dominant online language among Welsh speakers (Welsh Gov. 2015). A lack of language tools for Welsh and inequality or lack of equivalence to Englishlanguage provisionexacerbatesthe problem. Themajorpaperdictionaries have been digitisedandmadeavailableonline,and ongoinglexicalworknowoccursnativelyinadigitalenvironment.Incontrasttotra­ditional descriptive dictionaries, terminology work in Welsh is concept based, held in databases, and published in many formats. These resources have been re-used in lexicons forvarious purposes, including spellingand grammarcheckers. Monolingual, bilingual and multilingual text corpora, as well as speech corpora, mainlyinthestandardorneutrallanguageregister,havebeencurated.TheLanguage Technologies Unit at Bangor University holds the largest collection of corpora, at over 700 million tokens, including the Cysill Ar-lein Monitor Corpus (Prys et al. 2016).TheCorCenCC(Knightetal.2020)corpusisthelargestannotated,balanced generalcorpustodate,with11milliontokens.Crowdsourcinghasbeensuccessfully used to gather large speech corpora of recorded prompts, currently using Mozilla CommonVoice.Recordingsofvoicetalents,collectedspecificallyforbuildingsyn­thetic voices, have been released under the CC0 licence. Intellectual Property and licensingissuesareofutmostconcernwhenassessingthesuitabilityofthesecorpora for useand reuse and can hampertheir opendistribution. In terms of speech technology, a Welsh personal assistant (Jones 2020) has been developed as has the first Welsh speech-to-text transciber. Synthetic voices have been created for Welsh using older diphone technology, with newer, more natural sounding unit selection voices becoming available under open licences. A voice banking initiative, Lleisiwr, a joint venture between Bangor University and NHS Wales, has been created for bilingual Welsh/English speakers about to lose their speech capabilities, andisone of the mostinnovative services established to date. AcousticandlanguagemodelsforWelsharebeingdeveloped.Someoftheseare partofmultilingualsets,whichareofvariablequalitycomparedtothosedeveloped specifically for Welsh. A Welsh part-of-speech tagging model has been developed forspaCy,unlockingthepotentialtoperformmanyotherNLPtasksonWelshtexts. Welsh has NLP tools fortext analysis, anonymisation, and information extraction. Intermsoftranslation,acommercialWelsh–Englishtranslationsystemexistsand MT for Welsh is offered by some major companies such as Google and Microsoft. MoseshasbeenusedtodevelopSMTforWelsh.Newerneuralnetenginesarebeing used, and the first domain-specific MT engine for health launched. Welsh/English translationmemoriescan besharedon the Open TranslationMemoriessite,emulat­ingtheELRIproject.AnoverviewoftheseLTtoolsandresourcesmaybefoundon the Welsh NationalLanguage Technologies Portal (Prys and Jones 2018). WhiletheUKLTindustryismostlyfocusedontheEnglishlanguage,Welshlan­guage LT provision is mainly driven forward by the higher education sector. Wales has vibrant creative technology, media and translation sectors which make use of the government-funded open source LT created by universities. The main hub for LTresearchinWalesisBangorUniversity,notablyitsLanguageTechnologiesUnit. Relevant research is also undertaken at the universities of Cardiff, Swansea and SouthWales.Effortshavealsobeenmadetoimproveteachingdigitaltechnologiesin schools and universities. ThecurrentWelshGovernment’sWelshlanguage strategy states that “We must ensure that high-quality Welsh language technology becomes available […] to support education, workplaces and social use of Welsh” (Welsh Gov.2017).ThiswasfurtherelaboratedintheGovernment’sWelshLanguageTech­nology Action Plan (Welsh Gov. 2018). After years of small-scale and fragmented initiatives,thepublicationofthisplanprovidesacoherent,plannedwayforwardfor the development of WelshLT resources, toolsand services. 3 RecommendationsandNextSteps There has been much progress in Welsh LT in recent years, but further work needs to be done if the Welsh language is to thrive in the digital world. While FAQ gen-erationisusedfortheWelshlanguage,thedevelopment ofmoresophisticated chat-bot systems would further benefit Welsh speakers. There is no published research on Welsh language knowledge graphs, nor what they have to offer to Welsh. Lim­itedresearchhasbeenconductedonWelshlanguagesentimentanalysis.Akeynew area for development is bilingual models to aid minoritised languages where users constantly have to switch between their own language and the majority language or code-switch within the minoritised language. Promising work has been done for Welsh in developing a bilingual model for text-to-speech. Similar work for speech recognition is underway, where pre-trained multilingual acoustic models can pro­vide useful crosslingual speech representations that can be fine-tuned for effective bilingualWelshand Englishspeech recognition. There are manyotherbilingualsit­uations where a similar approach could be explored. In order to fill these gaps Welsh needs to be able to join in large-scale multi­national and multilingual research and development programmes of the type previ­ously reserved for official EU languages. Also, in common with other minoritised languages,WelshneedsaspacewithintheEuropeancommunitywherespecialatten­tioncanbepaidtoup-resourcingtheselanguagesandup-skillingtheircommunities. MinoritisedEuropeanlanguagesoftenalsobelongtotheeconomicperipheryinEu­rope, and using LT for economic regeneration in those areas would have a positive effect ontheireconomic, social and linguistic well-being. Itisoftenmoreattractivetocourtnewandexcitingprojectideas.Fundingoppor­tunitiesareoftenprejudicedinfavourofsuchventures,butattentionalsoneedstobe paid tomaintaining,improving, consolidating andfurther developing existing tools andresources.Atthesametimeminoritisedlanguagesneedtotakefulladvantageof anyemerginginnovations,playingtheirfullpartintheLTdevelopmentsforEurope. References Cunliffe,Daniel,Andreas Vlachidis,DanielWilliams,andDouglas Tudhope(2021).“Naturallan­guageprocessingforunder-resourcedlanguages:DevelopingaWelshnaturallanguagetoolkit”. In: Computer Speech & Language 72. Evas, Jeremy (2014). Y Gymraeg yn yr Oes Ddigidol – The Welsh Language in the Digital Age. META-NET White Paper Series: Europe’s Languages in the Digital Age. Heidelberg etc.: Springer. http://www.meta-net.eu/whitepapers/volumes/welsh. Jones, Dewi Bryn (2020). “Macsen: A Voice Assistant for Speakers of a Lesser Resourced Lan­guage”. In: Proceedings of the 1st SLTU-CCURL workshop. Marseille, France: European Lan­guageResources Association (ELRA). Knight, Dawn, Steve Morris, Tess Fitzpatrick, Paul Rayson, Irena Spasiæ, and Enlli Môn Thomas (2020). The National Corpus of Contemporary Welsh: Project Report; Y Corpws Cenedlaethol Cymraeg Cyfoes: Adroddiad y Prosiect.https://corcencc.org/wp-content/uploads/2020/06/Cor CenCC-report_2020_en.pdf. Moseley,Christopher (2010). Atlas of the World’s Languages in Danger. Paris: UNESCO. National Assemblyfor Wales (2011). Welsh Language (Wales) Measure 2011.https://www.legisla tion.gov.uk/mwa/2011/1/contents/enacted. Prys,Delyth(2006).“SettingtheStandards:TenYearsofWelshTerminologyWork”.In:Terminol­ogy, Computing and Translation. Ed. by Pius ten Hacken. Tübingen:Narr. Prys, Delyth and Dewi Bryn Jones (2018). “National Language Technologies Portals for LRLs: A CaseStudy”.In: Human Language Technology. Challenges for Computer Science and Linguis­tics.Cham:Springer. Prys, Delyth, Gruffudd Prys, and Dewi Bryn Jones (2016). “Cysill Ar-lein: A Corpus of Written ContemporaryWelshCompiledfromanOn-lineSpellingandGrammarChecker”.In:Proceed­ings of LREC 2016.Portorož, Slovenia: European Language Resources Association(ELRA). Prys,Delyth,GarethWatkins,andStefanoGhazzali(2022).Deliverable D1.34 Report on the Welsh Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. h ttps://european-language-equality.eu/reports/language-report-welsh.pdf. Welsh Gov. (2015). Welsh language use in Wales, 2013–15. https://www.gov.wales/sites/default/f iles/statistics-and-research/2018-12/160301-welsh-language-use-in-wales-2013-15-en.pdf. Welsh Gov. (2017). Cymraeg 2050: A million Welsh speakers. https://www.gov.wales/sites/defaul t/files/publications/2018-12/cymraeg-2050-welsh-language-strategy.pdf. WelshGov. (2018). Welsh language technology action plan.https://www.gov.wales/sites/default/f iles/publications/2018-12/welsh-language-technology-and-digital-media-action-plan.pdf. Welsh Gov. (2021). Internet skills and online public sector services (National Survey for Wales): April 2019 to March 2020. https://www.gov.wales/internet-skills-and-online-public-sector-se rvices-national-survey-wales-april-2019-march-2020-html. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Part II European LanguageEquality: TheFutureSituationin2030 andbeyond Chapter 38 Consulting the Community: How to Reach Digital Language Equality in Europe by 2030? Jan Hajiè, Maria Giagkou, Stelios Piperidis, GeorgRehm, and NataliaResende Abstract This chapter describes the community consultation process carried out in the European Language Equality (ELE) project concerning the future situation in 2030. Due to its central status for the future-looking activities within the project, this chapter introduces the second part of the present book. We gathered, analysed and structured the views, visions, demands, needs and gaps of European Language Technology (LT) developers, both industry and academia, and European LT users and consumers. Additionally, based on these collected findings and other evidence, we attempted to derive a thorough description of the steps to take to reach Digital LanguageEquality(DLE)inEuropebytheyear2030and,moreover,whatthefield ofLT will look likein Europein about ten years fromnow.1 1 Introduction ThegoalofWP2,“EuropeanLanguageEquality –TheFutureSituationin2030”of the European Language Equality (ELE) project was the collection of a vast amount ofinputforthe StrategicResearch, Innovation andImplementationAgenda (SRIA) andRoadmapandtheproductionofseveralreportsbyabroadanddiversespectrum ofstakeholders–fromresearchthroughindustrytousers–abouttheirviews,visions, demands, needs and gaps related to LT, language-centric AI and DLE, while at the sametimeanticipatingtheexpecteddevelopmentsoverthenexttenyears.Theactivi­tiesintheprojectputaspecialfocusuponwaysandmeansofachievingDLEby2030 Jan Hajiè Charles University,CzechRepublic, hajic@ufal.mff.cuni.cz MariaGiagkou · SteliosPiperidis R.C.“Athena”, Greece, mgiagkou@athenarc.gr, spip@athenarc.gr Georg Rehm Deutsches ForschungszentrumfürKünstliche Intelligenz GmbH,Germany, georg.rehm@dfki.de Natalia Resende Dublin CityUniversity, ADAPT Centre,Ireland, natalia.resende@adaptcentre.ie 1 This chapter is an abridged version ofHajiè etal. (2021). © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_38 throughthedevelopment,implementationanduseofLT,inordertomakeEuropeans ofallregionsandoriginstrulyequalwhenaccessingandinteractingwitheducation, business,governmentsandpublicservicesintheirownlanguage.Alargepartofthe informationeventuallyintegratedintotheSRIAwascollectedthroughcarefullyde-signedsurveysdistributedtoresearchers,developers,innovatorsandusersandtheir communities as well as through reports produced by a number of ELE consortium partners.Thischapterdescribestheoverallmethodologyofthecommunityconsulta­tionapproachappliedintheprojectandthevariousreportsproduced.Thecollected findings, presented in the subsequent chapters, have been used as input for the de­velopment of the SRIA (see especially Chapter 45 and the other chapters of Part II ofthepresent book). Section 2 provides a description of the overall methodology. The following two sectionsspecifyhowtheconsortiumconducted consultationswith theEuropeanLT developers (Section 3), European LT users (Section 4) and European citizens (Sec-tion5).Section6describesthepreparationprocessofthefourtechnologydeepdives (includedinthisbookinChapters40to43).Section7explainstheinstrumentsused for the collectionof additional input and feedback.Section 8 concludes the chapter. 2 Methodology Our primary objective in the ELE project was the preparation of the Strategic Re­search, Innovation and Implementation Agenda and Roadmap for achieving full DLE in Europe by 2030 (see Chapter 45). Since the overarching goal of achieving DLE involved a large number of stakeholders, the process of preparing, discussing andfinalisingthedifferentpartsofthestrategicagendaandroadmapwascarriedout by all 52 partners of the consortium and the wider European LT community, which we involved viathe consortium partners’ networks and connections. TheprojectmadeuseofthesupportoftheconsolidatedEuropeanLTresearchand industry community – brought together through previous projects such as META-NETandCRACKER –andproducedaconvincing,sustainableandevidence-based agendaandroadmap.Onlywiththeinputandfeedbackfromexpertsworkingindif­ferent areas of our core field of Computational Linguistics and LT and also on the borderstootherfieldssuchas,amongothers,CognitiveScience,AI,MachineLearn­ing, Data Science and Knowledge Technologies, could the agenda and roadmap be preparedinawaythatwasgoal-oriented,all-encompassing,realistic,supportedand overall meaningful. Only with the inclusion of representatives from various differ­ent companies active in the field did the involvement of industry make sense in the grander scheme of things, especially regarding the inclusion of their needs and goals. The same holds for the non-industrial, but important stakeholders as users and consumers,in areas such as Digital Humanities/SocialScienceand Humanities (DH/SSH) research, policymaking, normative language policy (including minority ones), education and others. At the most abstract level, our main approach was twofold: we distinguished be-tweeninputfortheagendaandroadmapgeneratedwithintheconsortium,andinput generatedbyorganisationsnotparticipatingaspartnersintheELEproject(through surveys, interviews, external consultation meetings, etc.). When putting the consor­tiumtogether,weoptedforalargenumberofpartnersthatcovermanyrelevantareas thatneededtobetakenintoaccountforthedevelopmentofthestrategicagenda.The consortium-internal and consortium-external stakeholders’ input and feedback was systematically collected, structured and included in the agenda and roadmap devel­opmentprocess,resultinginanall-encompassing,coherentandconvincingstrategic roadmap with agreed-upon research questions and research goals, realistic timing, and a meaningful plan. Tocome up with suggestions and recommendations on howto achieve full DLE in Europe by 2030, we distinguished between two main stakeholder groups: 1. LT developers (industry and academia) and 2. LT users and consumers. Both groups wererepresentedinELEwithseveralnetworks,initiativesandassociationsthatpro­duced one report each, together with their respective constituencies, highlighting theirownindividualviews,needs,wishes,demandsandcontributionstowardsDLE. The industry partners of the ELE consortium generated, in various tandem groups, fourtechnologydeepdivestoprovide,similarly,theviews,needs,wishes,demands and contributions of the European LT industry, structured into 1. Machine Transla­tion(seeChapter40),2.Speech(seeChapter41),3.TextAnalytics(seeChapter42) and 4. Data and Knowledge (see Chapter 43). We also carried out additional sur­veys and consultation meetings as well as interviews with stakeholders who were notrepresented in the consortium. Themethodologyappliedwasbasedonanumberofstakeholder-specificsurveys (inspired by Rehm and Hegele 2018) as well as collaborative document prepara­tionthatalsoinvolvedtechnologyforecasting.Bothapproacheswerecomplemented withthecollectionofadditionalinputandfeedbackthroughvariousonlinechannels (see Figure 3 in Chapter 1 on page 7). As Table 1 illustrates, the two main targeted stakeholder groups differ in one substantial way: while the group of commercial or academicLT developers was, in acertain way, closed andwell represented through relevantorganisations,networksandinitiativesintheELEconsortium,thegroupof LT users is an open set of stakeholders that was only partially represented through relevantorganisations,networksandinitiativesintheconsortium.Bothstakeholder groupswereaddressedwithtargetedandstakeholder-specificsurveysthatweredis­tributed to the relevant stakeholders through the responsible ELE partners. In addi­tion, we communicated with additionalstakeholders, primarily through interviews. 3 ThePerspectiveofEuropeanLanguageTechnologyDevelopers One mission-critical aspect when it came to consulting the community was the col­lection of views, demands, needs, ideas and visions with regard to the wider topic ofDLEfromthecommunityofEuropeanLTdevelopersandalsothehighlydiverse Stakeholder Group Task 2.1 The perspective of European LT developers (industry and research) EuropeanLT developers(industry andacademia): Closed set that is well represented through rele­vant organisations, networksand initiatives in theELEconsortium . Instruments: Surveys, interviews . Approachfurtherdetailedin Section 3 ofthis chapter .Resultsreported inThönnissen (2022), Eskevich and Jong(2022),RufenerandWacker(2022), Hajiè etal. (2022), Hegele et al.(2022) Task 2.2 The perspective of European LT users and consumers AllpotentialEuropeanLTusers: Open set thatisonlypartiallyrepresentedthroughrelevantorgan­isations,networks and initiativesin the ELE consortium . Instruments: Surveys, interviews . Approachfurtherdetailedin Section 4 ofthis chapter .ResultsreportedinGísladóttir(2022),Kirchmeier(2022),Hicks(2022),Blake(2022),Hrasnica (2022),Heuschkel (2022) Task 2.3 Science – Technology – Society: Language Technology in 2030 Prominent companies of the European LTdeveloper landscape, all represented in the ELE consor­tium: Closed set . Instrument: Collaborativelycreated technology deep dives . Approachfurtherdetailedin Section 6 ofthis chapter .ResultsreportedindeliverablesBerzinšetal.(2022),Backfriedetal.(2022),Gomez-Perezetal. (2022),Kaltenböcket al.(2022) Table1 Stakeholdergroups and instruments relevantforthethree tasks inWP2 group of European LT users. This section describes the process for engaging with LTdevelopers(supplyside)whileSection4describeshowwecollaboratedwithLT users (demand side) with regard to their visions for 2030; as such, these sections are follow-ups that cover the forward-looking projections of the same stakeholder groups whose viewsasof 2022 are presented inChapter 4 (Section 3, p. 84ff.). We analysed the views of European LT developers and providers, i.e., represen­tativesbothfromindustryandacademiatoinvestigatetheirideas,demands,visions and predictions with regard to DLE going towards 2030. We explored the factors that drive their development plans and investments (e.g., market demand, number of speakers, available funds etc.) and the perceived obstacles that should be over­cometoachieveDLE.ThemaininstrumentforcollectingtheLTdevelopers’views wasasetofsurveys,whichweredistributedthroughtheestablishedresearchandin-dustrynetworksoftheELEconsortiumtotheirmembers.Inaddition,thesurveywas forwardedtootherpan-Europeaninitiatives,thuscoveringthewidestpossiblerange fromgeneric AI tomedia-and language-relatedinfrastructures.The datacollection activity was supplemented by focused meetings and interviews with targeted infor­mantswhichwereselectedbasedoneitherthequalityoftheirinputtothesurveyor their prominence in and impact on the European LT landscape. The collected feed­backoftheEuropeanLTdeveloperswasaugmentedwithadditionalinputproduced by thenetworks, analysedand consolidated in five reports (see Table 1). 3.1 Stakeholders The European LT developers are a diverse group of stakeholders, comprising aca­demic and industrial researchers in the field of LT/NLP. In addition to conducting research, the members of this group also develop pre-commercial prototypes, algo­rithms, applications and systems. They can also be innovators and entrepreneurs whoproductiseandcommercialiseLTstoaddress,amongothers,theneedsfordigi­talcontentanalysisandgenerationaswellasforpertinentcontenttransformationand dissemination.Aninitialgroupingis,thus, LT research (academia) and LT industry (also see Chapter 4). Europe has a long-standing tradition in LT with over 800 centres (Rehm et al. 2023a,2020)performingexcellent,highlyvisibleandinternationallyrecognisedre­search on almost all European and also many non-European languages. The Euro­pean LT industry has been estimated to comprise 473 LT vendors in the EU26 plus Iceland and Norway in 2017 (Vasiljevs et al. 2019). The ELG catalogue comprises more than 800 commercial entities, also including integrators and a certain num­ber of user companies (Rehm et al. 2021, 2023a). While LT is at the intersection of Linguistics and Computational Linguistics, Computer Science and Artificial Intel­ligence, we also take relevant neighbouring fields into account, especially Digital Humanities/Social Science and Humanities (DH/SSH). Withtheaimofinforming the ELESRIAwiththeopinions, views anddemands ofthe widestpossiblegroup of these stakeholders, we mobilisedexisting European networks, associations, initiatives and projects. Some of the well-established and long-standing pan-European LT networks were represented in the ELE consortium (Table2).TheELEpartnersthatrepresentedtheseinitiativescontributedtheirviews to the project and also facilitated access to and elicitation of the views of their con­stituency and members with regard to how DLE can be achieved by2030. Theyco­ordinatedthedistributionofaquestionnairetotheirmembers,conductedinterviews and focused consultation meetings, where needed and appropriate (see Section 3.2 and Table 1). While these stakeholders already represented a significant part of the European LT community, we engaged additional initiatives in the consultation process (see Hajiè et al. 2021, for further details). 234 Jan Hajiè, Maria Giagkou, Stelios Piperidis, Georg Rehm, and Natalia Resende Initiative Description Stakeholder Group META-NET TheMETA-NETNetworkofExcellenceconsistsof60researchcen­tres in 34 European countries. It develops the technical foundations of a multilingual, inclusive and innovative European society, sup-portingall European languages. ELG TheEuropeanLanguageGrid(ELG)projectdevelopedacloudplat-form and marketplace for the whole European LT community. The shared platform includes language resources, datasets and services to benefit European society and industry. It addresses the fragmen­tationof the European LT landscape. LT-Innovate LT-InnovateistheEuropeanLTindustryassociationwithmorethan 200members.It supports its membersby promotingtheindustry as a whole inthemostpromising targetmarkets. CLARIN The European Research Infrastructure for Language Resources and Technologyconsistsofmorethan20nationalconsortia,whichthem-selves consist of multiple partners. CLARIN makes language re­sourcesavailabletoresearchersandstudentsfromalldisciplines,es­peciallyinthehumanitiesandsocialsciences,throughsinglesign-on access. CLAIRE The Confederation of Laboratories for AI Research in Europe has 394membersin36countries.CLAIREseekstostrengthenEuropean excellence in AI research and innovation across all of AI, for all of Europe,withahuman-centredfocus.ItisnowsupportedbynineEU Member Stategovernments. European LT community (especially research) European LT community European LT industry European DH, NLP, SSH commu­nity European AI community Table 2 LT developer communities represented in the ELE consortium who shared their views in dedicatedreports 3.2 Instruments To collect and analyse the LT developers’ views, demands, visions and predictions, weadoptedaninclusiveandparticipatoryapproach,throughwhicheveryvoicewas enabledtofinditswayintotheSRIA.Wereachedouttoasmanyrepresentativesof the LT community as possible and elicited their educated views in a structured, yet flexible,way. TwomaininstrumentswereusedtocollecttheviewsoftheEuropean LTdevelopers:surveys(Section3.2.1)aswellasinterviewsandfocusedconsultation meetings (Section 3.2.2). 3.2.1 Survey TheLTdevelopersurveyattemptedtoelicitviewsinastructuredwaythatlentitself totheefficientanalysis, consolidation and integrationof the feedback in the respec­tive project reports, which, in turn, were fed into the SRIA (Chapter 45). Driven by theenvisaged topics thatthefinal SRIA intendedtocover, the survey encompassed closed and open-ended questions to inquire about the LT developers’ future predic­tionsandvisions.TheoverallstructureofthisonlinesurveyisdescribedinChapter4 (Section3,p.84ff.),andtheforward-lookingquestions,inparticular,weregathered in aspecific part, as follows: • Predictions and visions for the future: This part of the stakeholders survey was forward-looking and investigated ideas, predictions and wishes of the LT communityabouthowtheLTfieldasawholewillbeabletoequallysupportall Europeanlanguages by 2030, i.e., – policies or instruments that could contribute to speeding up the effective deploymentofLT inEurope equally for alllanguages; – prediction of future opportunities for LT in basic and applied research (sci­entificvision) and in innovation andindustry; – expectationswithregardtothechallengesalarge-scale,long-termELEpro­grammecanaddress by 2030. 3.2.2 Interviewsandfocusedconsultationmeetings To supplement the survey responses and to collect more detailed feedback, where appropriate,weconductedinterviewsandconsultationmeetingswithtargetedinfor­mants who were selected based on either the quality of their input to the survey or their prominence in and impact on the European LT landscape. Operationally, the selectionof stakeholders tobeinterviewed was based on the followingcriteria. 1. The respondent had partially filled in the survey and some essential input was missing in order tohave amore complete understanding of his/her views;or 2. Nomemberofanetworkorassociation(seeSection3.1)hadfilledinthesurvey. In the first case, we asked for a short and focused meeting with the respondent to elicitthe missing information. In thesecondcase, whena network or association that was considered a stakeholder for ELE was not represented, we identified key personsandconductedaninterview.Thekeydetailsoftherespondentsaredescribed in Chapter 4 (Section 3, p. 84ff.), while the results and findings of the survey and consultations with LT developers concerning the future situation in 2030 are dis-cussedinChapter 39(Section2,p. 246ff.);theirviewshavebeentakenonboardin the ELESRIA(Chapter 45). 4 ThePerspectiveofEuropeanLanguageTechnologyUsers This section describes our approach to gathering the voices of the highly heteroge­neous and diverse group of European LT users and consumers as the final “bene­ficiaries” of LT with regard to the necessary and desired developments supporting DLE for all European languages by 2030. This activity required engagement with individuals, representative public bodies and government units, organisations and businesses, including SMEs as well as larger companies, that use LT. We also ex­plored the factors that can promote language equality in the users’ and consumers’ view, especially with regard to encouraging the uptake of missing or poor LTs that can solve real communication problems for the members of all European language communities. Special attention was paid to the speakers of lesser-served languages, particularlythosethatfacedigitalextinctionorneglect,elicitingfromtheLTusersof suchlanguagecommunitiesindicationsofnecessaryordesirabledevelopmentsthat areexpectedtoputtheirownlanguagesonanequalfootingwiththedominantones by2030.Acomplementaryfocusconsideredtheperceivedobstaclesthathinderfull DLE,sothateffectiveremedialactioncanbepromptlytaken.Wefollowedthesame approach as for the supply side (Section 3), i.e., based on surveys and structured templatesseveralreportshavebeenproducedbytheELEconsortiummemberswho represented relevant stakeholdergroups. 4.1 Stakeholders LTusersandconsumerscompriseabroadgroupofstakeholdersfromawidevariety of domains and sectors. We reached out to representatives from public administra­tion (public bodies and government units), organisations and businesses, including SMEs as well as larger companies, that currently use and benefit from LT, as well as individuals. Six stakeholders are represented in the ELE consortium with a spe­cialfocusonspeakersoflesser-servedlanguages,particularlythosethatfacedigital extinction or neglect (see Table 3). In addition to the reports produced by these six core representative bodies and ELE partners, other relevant external stakeholders were consulted as well. The in­clusionofadditionalgroupsensuredthewidestpossiblecoverageandpromotedour inclusive approach to build a comprehensive, accurate and all-encompassing SRIA and roadmap towards achieving full DLE in Europe by 2030 (presented in Chap­ter 45). 4.2 Instruments Inasimilar wayasdescribed inSection 3.2forthestakeholderclassofLTdevelop­ers,surveysandfocusedconsultationmeetingswereused tocollectandanalysethe perspective of European LT users, i.e., their views, ideas, demands, future visions andpredictionswithregardtoDLE.Ourgoalwastoconsultwithasmanyrepresen­tatives of this stakeholder class as possible to collect their opinions in a structured, yet unconstrained, way. 38 Consulting the Community: How to Reach DLE in Europe by 2030? 237 Initiative Description Stakeholder Group ECSPM The European Civil Society Platform for Multilingualism is an al­lianceforthelanguagesspokeninEurope(national/official,minority, regional and autochthonous, as well as the languages of immigrant communities). It includes networks of more than 200 European as­sociations, societies and organisations that view multilingualism as an asset for European economic, social and cultural development, as well as a facilitator for intellectual and personal growth. It is a fer­vent voice of Europe’s civil society promoting languages, language policies and researchonmultilingualism. EFNIL The European Federation of National Institutions of Language is a pan-European organisation that was founded in 2003. EFNIL has 41 membersfrom27countriesandprovidesaforumfortheseinstitutions to exchange information about their work and to gather and publish information aboutlanguage useand languagepolicywithintheEU. ELEN The European Language Equality Network is an international NGO for the protection and promotion of European lesser-used languages gathering166memberorganisationsrepresenting46languagesin23 European states. Founded in 2012, it represents the voice of grass­roots European RMLcivilsociety. LIBER The Association of European Research Libraries is Europe’s princi­pleassociationofresearchlibraries,consistingofnearly450national, university and other libraries from more than 40 countries. LIBER helps European research libraries to ensure the preservation of Euro­pean cultural heritage, to improve access to collections, and to pro-videmoreefficientinformationservices.EnablingOpenScienceisa major priority, as is promoting innovative scholarly communication, fostering digital skills and services, and engaging with world-class e-infrastructures. NEM New European Media is the leading European Network for Media andCreative Industries withthemissionto foster theimpact ofinter-active technologies on the future of new media through interaction between media, content, creative industries, social media, broadcast­ing and telecom sectors as well as consumer electronics, represented bymorethan1,000members.Theapplicationofthenewesttechnolo­gies in respect to equal access to media for all is one of its higher priorities. Wikipedia Wikimedia Deutschland is an independent, charitable membership-based non-profit organisation that serves as the German chapter of the global Wikimedia movement. With more than 140 employees it is theoldest andlargest ofabout 40 independentchapters. European Plat-formforMulti­lingualism European National Languages European Regional, Minority and Endangered Languages European Research Libraries European New Media Community European Free Knowledge Community Table 3 LT users and consumers represented in the ELE consortium who shared their views in dedicatedreports 4.2.1 Survey SimilarlytothesurveyforLTdevelopers,feedbackfromtheLTusersandconsumers was collected in a structured way that lends itself to the efficient analysis, consoli­dation and integration of the feedback into the ELE SRIA (see Table 1). Driven by the envisaged topics that the final strategic agenda would cover, this survey encom­passedclosedandopen-endedquestionstounderstandtheLTusers’andconsumers’ future predictions and visions with regard to DLE in Europe. The survey had four parts and encompassed 63 questions in total. Some of the questions depended on previousanswers.Asaresult,arespondentwaspresentedwith30(minimum)to63 (maximum)questions,includingthe“ifother”questions.Ifpresentedthemaximum set of questions, 46 questions were mandatory, and 33 of them were closed (single or multiple choice). In particular, beyond the preliminary sections covering demo­graphic information and the language(s) for which the respondents used LRTs, the lastpartofthequestionnaireisofinteresthere,asitfocusedontheforward-looking opinions of the LT users going towards full DLE inEurope by 2030: • Predictions and visions for the future: This part of the online survey for LT usersinvestigatedideas,predictionsandwishesabouthowDLEcanbeachieved in Europeby 2030. – policies or instruments that could contribute to speeding up the effective deploymentofLTsin Europeequally for all languages; – expectationswithregardtothechallengesthatalarge-scalelong-termELE programme can address by 2030. The survey was circulated through the networks and associations described in Section 4.1 (also see Table 3) and through additional channels (see Section 7). It wasset up as an onlineformfor easydistribution as well as analysis of responses. 4.2.2 Interviewsandfocusedconsultationmeetings To complement the survey responses of the six LT user and consumer stakeholder groupsrepresentedintheELEconsortium,weconductedconsultationmeetingswith targetedinformants.TheapproachwassimilartotheonedescribedinSection 3.2.2. 5 ThePerspectiveofEurope’sCitizens Inadditiontotheconsultationwiththemorefocusedstakeholdergroups(Sections3 and4),alarge-scale,onlineandmultilingualsurveytargetingEurope’scitizenswas carried out with the aim of taking into account their opinions, individual needs, wishesandgeneraldemandsaswellastomakesurethattheirvoicesplayadecisive role in the pursuit of full DLE. This consultation with a larger and more diverse co­hortofLTconsumersallowedustoobtainanaccuratepictureofthecurrentscenario in terms of LT support across European languages and have a more representative basisforatechnologicalandscientificforecasting onhowLTscanbedeployedand appliedin Europeby 2030 tothe benefit of all Europeancitizens. Different survey platforms were tested to choose the most suitable one for our needs.Aftersettingupthesurveyintheplatformofourchoice,itwasdisseminatedin 28Europeancountriesandin38EuropeanlanguagesfromJanuary2022to01May 2022.Thesurveyincludedatotalof11questions,fourofwhichweresingle-choice questions, six were multiple-choice and one open-ended question which allowed respondentstoincludeanycommentsorfeedbacktheyhad.These11questionscould be answered in about five minutes via computers or mobile devices. More details concerningthetranslationoftheonlinesurveyintoseverallanguagesanditscareful and well-balanceddistribution are givenin Chapter 4(Section 3, p. 84ff.). After a few initial survey items that aimed at understanding the level of famil­iarity of respondents with terms from the field of LTs, the respondents’ profiles and language backgrounds were checked through a multiple-choice question that asked them to select the terms (e.g., “Information Retrieval”, “Natural Language Processing”, “Natural Language Understanding”) that they were familiar with or couldimmediatelyrecognise.Thequestionsofparticularinterestherewerethefinal twoaboutthefutureofLTsinEurope,which“requestedrespondentstoindicatethe toolstheywouldliketouseinthefutureifnotcurrentlyavailableintheirlanguages and alsoto rate the top three advantagesofimproving LTsfor all languages”. 6 Predicting Language Technology in 2030: Technology Deep Dives TheELEprojectalsoattemptedtoassessandpredict,inadedicatedforward-looking task,what thefieldofLTwilllooklike in2030. Tothisend,wecollected,analysed and consolidated the views of European LT industrial and academic stakeholders on anticipated future technological progress, innovations and impact on society in thecomingdecade,withaspecialemphasisontechnologies,resources,approaches, coverage andperformance needed to achieve DLEby 2030. The task was set up to seek agreement among these stakeholders in terms of pinpointing novel or significantly extended or adapted technologies that would ulti­mately enable or contribute to DLE, and consequently help bring about true digital equality in European society. To achieve these goals, such new technologies would have to take into account the state-of-the-art in various LT and AI areas, including the reasons why current technologies do not perform equally well for all languages (e.g., due to lack of data, poor-quality data, language properties, knowledge collec­tively and indirectly acquired for only some languages in the past, etc.) as well as thereasonsforbiasedresultsinsomeareas.Focusingonpossiblemethods,technolo­gies and processes for bringing all European languages on par both technologically and in consumer applications, there was a unifying theme, namely, to discover and explore ways to convert the unique challenges of a diverse European multilingual societyintoopportunitiesandtechnologies,processesandservicessuperiortothose developedin the context of largely homogeneous linguistic societies. Wealsotookafreshlookatdeployment,i.e.,howLTswouldbemadeavailableto the different stakeholders and end-users, from machines tohousehold appliances to mobile devices and perhaps even “invisible” devices. To achieve these goals, struc­tured document templates and also surveys oriented towards technological devel­opment and technology forecasting along the aforementioned lines (see Sections 3 and 4, respectively) were used by both industrial and academic stakeholders, and thenassembledintofourprojectreports,reflectingthemajortechnologyareas (Ma­chine Translation, Speech Technologies,TextAnalytics, Data andKnowledge). Four ELE partners were selected to lead the development of these technology deep dives, which are presented in abridged form in Chapter 40 (p. 263ff.) on Machine Translation, Chapter 41 (p. 289ff.) on Speech Technologies, Chapter 42 (p. 313ff.) on Text Analysis, and Chapter 43 (p. 337ff.) on Data, Knowledge and Language Resources. They collaborated closely with other ELE partners who also work in the respective fields. The four authoring teams made use of existing scien­tific publications, reports and foresight studies as well as science and technology predictions. In this way, the respective groups of experts developed a consolidated opinion with regard to the direction in which the relevant field is moving or should bemoving,whatthecurrentgapsandroadblocksaswellastheindustry’sneedsfrom research are, and whatthey can contributeto DLE.2 7 Collecting Additional Input and Feedback Complementing the instruments described above, we set up additional ways of col-lectinginputfortheemergingSRIA.Wewantedtoenableallstakeholderstocommu­nicate with ELE easily so that their opinions and ideas could be integrated into our recommendations.Overseveralmonthsthroughout2022,theemergingELEresults were disseminated through various channels (e.g., website, publications, presenta­tions, social media, etc.), and we solicited input by actively asking stakeholders for feedback, or by actively listening, especially on social media, to identify additional opinions regarding ourtopic (see Rehm etal. 2023b). 7.1 Conferences and Workshops ELE results were presented and discussed at many different conferences and work­shops. One example was the presentation of the pre-final ELE recommendations at META-FORUM 2022 in June 2022, which resulted in a valuable discussion with 2ThisapproachwasinspiredbythemethodologyfollowedinMETA-NET,inwhich“visiongroups” worked onsimilar documents(“visionpapers”,see, forexample, Rehm and Uszkoreit 2013). theaudienceintermsof,amongothers,additionalaspectstotakeintoaccount.3 The finalrecommendationswerepresentedattheSTOAworkshop“Towardsfulldigital language equality in a multilingual European Union” held at the European Parlia­ment inNovember 2022.4 7.2 Project Website An interactive contact form was implemented on the ELE website through which interestedstakeholderscould–andstillcan–communicatewiththeELEteam.5 We alsodistributedallreportsthroughthewebsitetoenableotherstoprovidefeedback.6 7.3 Social Media Social media activities in ELE concentrated on LinkedIn and Twitter. We used LinkedIn7 to address professional stakeholders including LT developers and users. Incontrast,whileTwitter8 wasprimarilyusedforreachingouttoEuropeancitizens, it was also used by many stakeholders for professional communication purposes. The social media activities of the ELE project were planned and executed in close collaboration with the ELG project. To be able to disseminate news about both ac­tivities throughthesejointchannels,we subsumedthe two initiativesunder thetitle “European Language Technology” (ELT, for more details see Rehm et al. 2023b), and a biweekly newsletter with updates and highlights from the ELE SRIA was cir-culatedtoalargeanddiverseaudienceofaround4000recipients,alsoinvitinginput and feedback.9 8 Summary and Conclusions This chapter describes the consultation process carried out under the umbrella of WP2, “European Language Equality – The Future Situation in 2030”, in the ELE project. It is meant to be a brief summary that illustrates the guidelines as well as instructions specified with regard to the implementation of our internal processes 3 https://www.european-language-grid.eu/events/meta-forum-2022 4 https://www.europarl.europa.eu/stoa/en/events/details/towards-full-digital-language-equality-i /20220711WKS04301 5 https://european-language-equality.eu/contact/ 6 https://european-language-equality.eu/deliverables/ 7 https://www.linkedin.com/company/european-language-technology/ 8 https://twitter.com/EuroLangTech 9 https://www.european-language-technology.eu and instruments applied by all actively involved partners, especially with regard to reachingouttoandgatheringfeedbackandinputfromEuropeanLTdevelopersand European LT users and consumers, but also with regard to technology forecasting through the four technological deep dives. These activities had an important role withintheELEproject:theydefinedallaspectsofthefuturesituationwithregardto DLEby2030.Duetothisimportant,mission-criticalroleintheproject,allinvolved stakeholders were made aware of the different aspects and dimensions the project needed to provide input for when it came to assembling the final recommendations for the SRIA. The main findings of the consultation process briefly summarised in the present chapter are presented in the subsequent chapters. Chapter 39 presents the results of the different surveys. Abridged versions of the four technology deep dives are presented in Chapters 40 (Machine Translation), 41 (Speech Translation), 42 (Text Analytics)and 43(DataandKnowledgeTechnologies).Finally,acompactbutcom­prehensive summary ofthe ELE SRIAand Roadmap is presented in Chapter 45. References Backfried,Gerhard,MarcinSkowron,EvaNavas,AivarsBerzinš,JoachimVandenBogaert,Fran­ciska de Jong, Andrea DeMarco, Inma Hernaez, Marek Kováè, Peter Polák, Johan Rohdin, Michael Rosner, Jon Sanchez, Ibon Saratxaga, and Petr Schwarz (2022). Deliverable D2.14 Technology Deep Dive – Speech Technologies.EuropeanLanguageEquality(ELE);EUproject no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/speech-deep-d ive.pdf. Berzinš, Aivars, Marcis Pinnis, Inguna Skadina, Andrejs Vasiljevs, Nora Aranberri, Joachim Van den Bogaert, Sally O’Connor, Mercedes García–Martínez, Iakes Goenaga, Jan Hajiè, Manuel Herranz, Christian Lieske, Martin Popel, Maja Popoviæ, Sheila Castilho, Federico Gaspari, Rudolf Rosa, Riccardo Superbo, and Andy Way (2022). Deliverable D2.13 Technology Deep Dive – Machine Translation.EuropeanLanguageEquality(ELE);EUprojectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/MT-deep-dive.pdf. Blake,Oliver(2022). Deliverable D2.10 Report from LIBER.EuropeanLanguageEquality(ELE); EU projectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/con sultation-LIBER.pdf. Eskevich,Mariaand Franciska de Jong (2022). Deliverable D2.3 Report from CLARIN. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/consultation-CLARIN.pdf. Gísladóttir, Gu.rún (2022). Deliverable D2.7 Report from ECSPM. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-ECSPM.pdf. Gomez-Perez, Jose Manuel, Andres Garcia-Silva, Cristian Berrio, German Rigau, Aitor Soroa, Christian Lieske, Johannes Hoffart, Felix Sasaki, Daniel Dahlmeier, Inguna Skadina, Aivars Berzinš,AndrejsVasiljevs,andTeresaLynn(2022). Deliverable D2.15 Technology Deep Dive – Text Analytics, Text and Data Mining, NLU. European LanguageEquality(ELE); EUproject no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/text-analytics­ deep-dive.pdf. Hajiè,Jan,MariaGiagkou,SteliosPiperidis,GeorgRehm,andNataliaResende(2021).Deliverable D2.1 Specification of the consultation process.EuropeanLanguageEquality(ELE);EUproject no.LC-01641480 –101018166.https://european-language-equality.eu/reports/consultation-pr ocess.pdf. Hajiè,Jan,TeaVojtìchová,andMariaGiagkou(2022). Deliverable D2.5 Report from META-NET. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/consultation-META-NET.pdf. Hegele,Stefanie,KatrinMarheinecke,andGeorgRehm(2022).Deliverable D2.6 Report from ELG. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/consultation-ELG.pdf. Heuschkel,Maria(2022). Deliverable D2.12 Report from Wikipedia.EuropeanLanguageEquality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-Wikipedia.pdf. Hicks, Davyth (2022). Deliverable D2.9 Report from ELEN. European Language Equality (ELE); EU projectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/con sultation-ELEN.pdf. Hrasnica,Halid(2022).Deliverable D2.11 Report from NEM.EuropeanLanguageEquality(ELE); EU projectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/con sultation-NEM.pdf. Kaltenböck, Martin, Artem Revenko, Khalid Choukri, Svetla Boytcheva, Christian Lieske, Teresa Lynn, German Rigau, Maria Heuschkel, Aritz Farwell, Gareth Jones, Itziar Aldabe, Ainara Estarrona, Katrin Marheinecke, Stelios Piperidis, Victoria Arranz, Vincent Vandeghinste, and Claudia Borg (2022). Deliverable D2.16 Technology Deep Dive – Data, Language Resources, Knowledge Graphs.EuropeanLanguageEquality(ELE);EUprojectno.LC-01641480 –1010­18166. https://european-language-equality.eu/reports/data-knowledge-deep-dive.pdf. Kirchmeier, Sabine (2022). Deliverable D2.8 Report from EFNIL. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-EFNIL.pdf. Rehm, Georg and Stefanie Hegele (2018). “Language Technology for Multilingual Europe: An Analysis of a Large-Scale Survey regarding Challenges, Demands, Gaps and Needs”. In: Pro­ceedings of the 11th Language Resources and Evaluation Conference (LREC 2018). Ed. by Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélene Mazo, Asuncion Moreno, JanOdijk,SteliosPiperidis,andTakenobuTokunaga.Miyazaki,Japan:ELRA,pp.3282–3289. https://aclanthology.org/L18-1519.pdf. Rehm,Georg, Katrin Marheinecke,Rémi Calizzano,and Penny Labropoulou (2023a). “Language TechnologyCompanies,ResearchOrganisationsandProjects”.In:European Language Grid: A Language Technology Platform for Multilingual Europe. Ed. by Georg Rehm. CognitiveTech­nologies.Cham,Switzerland: Springer,pp. 171–185. Rehm,Georg,KatrinMarheinecke,StefanieHegele,SteliosPiperidis,KalinaBontcheva,JanHajic, Khalid Choukri, Andrejs Vasiljevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-binaAuksoriute,NúriaBel,AntónioBranco,GerhardBudin,WalterDaelemans,KoenraadDe Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur Rögnvaldsson,MikeRosner,BolettePedersen,IngunaSkadina,MarkoTadiæ,DanTufi.,Tamás Váradi,KadriVider,AndyWay,andFrançoisYvon(2020).“TheEuropeanLanguageTechnol­ogyLandscapein2020:Language-CentricandHuman-CentricAIforCross-CulturalCommuni­cationinMultilingualEurope”.In:Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020).Ed.byNicolettaCalzolari,FrédéricBéchet,PhilippeBlache,Christo­pherCieri,KhalidChoukri,ThierryDeclerck,HitoshiIsahara,BenteMaegaard,JosephMariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325. https://www.aclweb.org/anthology/2020.lrec-1.407/. Rehm, Georg, Katrin Marheinecke, and Jens-Peter Kückens (2023b). “European Language Tech-nologyLandscape:CommunicationandCollaborations”.In: European Language Grid: A Lan­guage Technology Platform for Multilingual Europe.Ed.byGeorgRehm.CognitiveTechnolo­gies. Cham, Switzerland:Springer, pp.189–204. Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiljevs, GerhardBackfried,JoséManuelGómezPérez,UlrichGermann,RémiCalizzano,NilsFeldhus, StefanieHegele,FlorianKintzel,KatrinMarheinecke,JulianMoreno-Schneider,DimitrisGala­nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kaèena, Khalid Choukri, Valérie Mapelli, Mickaël Rigault, Julija Melnika, Miro Janosik, Katja Prinz, Andres Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguis­tics: System Demonstrations (EACL 2021).Kyiv,Ukraine: ACL, pp. 221–230. https://www.ac lweb.org/anthology/2021.eacl-demos.26.pdf. Rehm, Georg and Hans Uszkoreit, eds. (2013). The META-NET Strategic Research Agenda for Multilingual Europe 2020.Heidelbergetc.:Springer. http://www.meta-net.eu/vision/reports/m eta-net-sra-version_1.0.pdf. Rufener, Andrew and Philippe Wacker (2022). Deliverable D2.4 Report from LT-innovate. Euro­peanLanguageEquality(ELE);EUprojectno.LC-01641480 –101018166.https://european-l anguage-equality.eu/reports/consultation-LTInnovate.pdf. Thönnissen,Marlies(2022). Deliverable D2.2 Report from CLAIRE.EuropeanLanguageEquality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-CLAIRE.pdf. Vasiljevs,Andrejs,KhalidChoukri,LucMeertens,andStefaniaAguzzi(2019). Final study report on CEF Automated Translation value proposition in the context of the European LT market/e­cosystem.DOI10.2759/142151.AstudypreparedfortheEuropeanCommission,DGCommu­nicationsNetworks, Content& Technology by Crosslang,Tilde,ELDA,IDC. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 39 Results of the Forward-looking Community-wide Consultation Emma Daly,Jane Dunne, Federico Gaspari, Teresa Lynn, Natalia Resende,Andy Way, Maria Giagkou, Stelios Piperidis, Tereza Vojtìchová, Jan Hajiè, Annika Grützner-Zahn,StefanieHegele, Katrin Marheinecke, andGeorg Rehm AbstractWithintheELEprojectthreecomplementaryonlinesurveysweredesigned andimplementedtoconsulttheLanguageTechnology(LT)communitywithregard tothe currentstate of playand the future situationinabout2030 intermsof Digital Language Equality (DLE). While Chapters 4 and 38 provide a general overview of the community consultation methodology and the results with regard to the current situationasof2022,thischaptersummarisestheresultsconcerningthefuturesitua­tion in 2030. All of these results have been taken into account for the specification oftheproject’s StrategicResearch, Innovationand Implementation Agenda (SRIA) and Roadmap for Achieving Full DLE in Europe by 2030.1 1 Introduction WithinELEthreecomplementaryonlinesurveysweredesignedandimplementedin ordertoconsulttheLanguageTechnology(LT)communitywithregardtothecurrent state of play and the future situation in about 2030 in terms of Digital Language Equality (DLE). While Chapter 38 provides a general overview of the community consultation process and methodology and Chapter 4 in Part I gives a brief account EmmaDaly · JaneDunne · Federico Gaspari · Teresa Lynn · NataliaResende · Andy Way Dublin CityUniversity, ADAPT Centre,Ireland, emma.daly@adaptcentre.ie, jane.dunne@adaptcentre.ie, federico.gaspari@adaptcentre.ie, teresa.lynn@adaptcentre.ie, natalia.resende@adaptcentre.ie, andy.way@adaptcentre.ie MariaGiagkou · SteliosPiperidis R.C.“Athena”, Greece, mgiagkou@athenarc.gr, spip@athenarc.gr Tereza Vojtìchová · Jan Hajiè Charles University,CzechRepublic, vojtechova@ufal.mff.cuni.cz, hajic@ufal.mff.cuni.cz AnnikaGrützner-Zahn · Stefanie Hegele · KatrinMarheinecke · GeorgRehm Deutsches ForschungszentrumfürKünstliche Intelligenz GmbH,Germany, annika.gruetzner-zahn@dfki.de, stefanie.hegele@dfki.de, katrin.marheinecke@dfki.de,georg.rehm@dfki.de 1 This chapter summarises results reported in Wayet al. (2022a) and Way etal. (2022b). © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_39 of the results with regard to the current situation in 2022/2023, the present chapter summarisesourresultsconcerningthefuturesituation.Alloftheseresultshavebeen taken into account for the specification of the project’s strategic recommendations (see Chapter 45). Section 2 summarises the future-looking results with regard to the stakeholder group of European LT developers, introduced in Chapters 4 and 38, whereas Sec­tion3reportsthefindingswithregardtothestakeholdergroupofEuropeanLTusers and consumers. Section 4 describes the findings of the survey in which we reached out to Europe’s citizens to gauge their expectations and desires in terms of DLE by 2030 (see Chapter 4, Section 3, p. 84ff., and Chapter 38, Section 3, p. 231ff.). Section 5concludes thechapter. 2 ThePerspectiveofEuropeanLanguageTechnologyDevelopers The survey targeting LT developers and researchers generated a large number of responses between June and October 2021, representing more than 200 different organisations and more than 30 countries. The survey investigated topics like lan­guage coverage and evaluation of the current situation but also predictions and vi-sionsforthefuture.DetailedbreakdownsoftheresultscanbefoundinvariousELE project reports (Thönnissen 2022; Eskevich and Jong 2022; Rufener and Wacker 2022; Hajiè et al. 2022; Hegele et al. 2022). In addition to the survey, expert inter-viewswithselectedrepresentativesfrominitiativessuchas,amongothers,ELGand META-NET were conducted. The interviewees shared details on their work and re-latedchallenges,elaboratingonhowtodojusticetoallEuropeanlanguages,waysto positionEuropean LTon a global leveland the key challenges towardsestablishing a long-term European LT programme. 2.1 Respondents’ Profiles One major goal of this survey was to bring the European LT community together andtoreachawideanddemographicallydistributedaudience.Intotal,theLTdevel­opers survey was filled inby 321different respondents who represent 223 different organisations:73%oftheorganisationswereresearchoracademicinstitutions(63% universities,10%researchcentres)and22%werecompanies(17%SMEs,5%large enterprises). In 5% of responsesthe type“other” was indicated,i.e.,freelancer, pri­vatepractitioner,government agency, not-for-profit organisation, etc. Theheadquartersoftheseorganisationsarelocatedin32differentcountries,cov­ering all EU member states and other European countries, such as the UK, Switzer-land,Serbia,etc.,butalsootherglobalregions,e.g.,Brazil,theUSandIsrael.Most responseswerecontributedfromSpain,Germany,Greece,theCzechRepublic,and the Netherlands. The respondents cover a wide spectrum of the targeted groups of stakeholders, as apparent from the range of networks, associations and relevant projectsongoingatthetimethesurveywascirculated.Themostestablishedresearch networksinLT/AI,i.e.,META-NET,CLARINandCLAIREarewellrepresentedin thesurveyresponseswithabout40to90respondentseach.ELG,ELE’ssisterproject, isrepresentedwithmorethan50participants.Otherrelatedprojectsandnetworksfo­cusingonLToronneighbouringfields,suchasAI4EU,ELISE,ELEXIS,andNexus Linguarum are represented with around10 to 25surveyrespondentseach (Table 1). Additionalnetworks,associationsandprojectsindicatedbytherespondentsinclude ELRC, ELRA,ACL, EAMT,DARIAH and others. Initiative Responses Interviews CLAIRE 37 3 CLARIN 90 4 ELG 54 20 LT-Innovate 18 29 META-NET 61 5 AI4EU 16 – BDVA 12 – DIH4AI 1 – ELEXIS 19 – ELISE 4 – HumanE AI 11 – Nexus Linguarum 25 – TAILOR 9 – Other 31 – None of the above 115 – Table1 LTdeveloperssurvey–surveyresponsesandinterviewscollectedthroughtheparticipating initiatives The respondents were mainly active in the following areas: 1. Basic natural lan­guage processing services (POS tagging, parsing, named entity recognition etc.), 2. Text analytics and mining, information extraction, text classification, and 3. Lan­guageresources (LRs): data production,data aggregation (Figure 1). The technologies, products or servicesofferedby therespondents’organisations are used in various domains, a finding that demonstrates the applicability of LT in practicallyalleconomicsectors.Thetopthreedomainsindicatedbytherespondents were 1. Information and communication technologies (ICTs), 2. Digital humanities (DH),arts, culture and other services and 3. Education. 2.2 Language Coverage The respondents listed a wide range of languages they actively include in their re-searchanddevelopmentworkandforwhichtheyofferservices,software,resources, models etc. All official EU languages are covered as well as other state official, re­ Fig. 1 LT areasin which the respondentsconduct researchor develop tools and services gionalorco-officialEuropeanlanguages(seeFigure6inChapter 4,p. 86).Thefive mostfrequentlymentionedlanguagesareEnglish,German,Spanish,FrenchandItal-ian.Atotalof80respondentsindicated“other”languagestheysupportintheirprod-uctsorresearch,languagesspokenintheMiddleEastandAsiawithArabic,Chinese, Japanese,RussianandTurkishbeingthefivemostfrequentlymentionedones.Sign languages were also mentioned. To get an idea about the focus of future work, the respondents were asked about thelanguagestheirorganisationdoesnotyetsupport,butplanstosupportinthenext three years. Apart from some of the big languages, the respondents’ future plans additionallyincludesomeregionalandminoritylanguages(RMLs),suchasBasque, Catalan,Breton,Mirandese,RomaniorAromanian.Signlanguageswerementioned five times, and it is worth noting the presence of regional and dialectal varieties in the respondents’ future plans,e.g., Pontic Greek or Spanish varieties. Whenconsideringthetopthreedriversforthedecisiontosupportadditionallan­guages (Table 2), the most frequently selected factor is research interest (212 men­tions),followedbytheavailabilityofLRs(144)andmarketinterestordemand(138). As expected, the prioritisation of these factors is different when the type of organi­sationtherespondentrepresents is taken into account.Forindustry (including large enterprises andSMEs) market interest or demand byusers orconsumersplay a piv­otal role, while the availability of LRs follows at a distance. For research organisa­tionsandSMEs,morethanbigorganisations,fundingandinvestmentopportunities are also to be considered. In terms of “other” reasons, these were often specified with an appeal for equality and the need for preserving all languages in the digital age,asforinstanceinthefollowinganswers:“Needforequality”,“Ensurelanguage rightsinthedigitaleconomy,services,applications”,“Supportingunder-represented language communities to work towards the knowledge equitygoals”. 39 Results of the Forward-looking Community-wide Consultation 249 Drivers Research organisation Industry Other Total Research or scientific interest Availability of language resources Market interest or demand Available funding or investment Availability of human experts Availability of technologies or tools Other 196 108 65 107 60 44 69 12 29 66 18 12 18 14 4 7 7 3 3 5 4 212 144 138 128 75 67 87 Table2 LT developerssurvey – the top driversforthedecision to support additional languages 2.3 Predictions for the Future We were also interested in the respondents’ views on the measures and instruments thataredeemedeffectiveaswellasthekeychallengesthatafuturelarge-scaleELE programmeshouldaddress.Theparticipantshadtheoptiontorateanumberofpoli­ciesandinstrumentsaseitherveryeffective,effective,slightlyeffectiveornoteffec­tiveatall.Inaddition,respondentsweregiventhe opportunitytoelaborateonother policies or instruments, which they consider effective in speeding up the develop­mentanddeploymentofLTinEuropeequallyforalllanguages.Theresponseswere providedas free text. A critical aspect of the respondents’ visions for DLE, as brought up in multiple answers, is the availability of resources. By 2030 all European languages should have developed the critical mass of resources needed for developing LTs. These in­clude not only raw data, but also large multilingual language models. The issue of data availability is often mentioned in relation to the legal framework for sharing them. Large amounts of data for all languages are expected not only to be available by2030,butalsoavailableforfreeoratareasonablecostforresearchandcommer­cialpurposes.Standardisedtrainingandevaluationdataforalllanguagesaredeemed critical.Inparallel,accordingtothesurveyrespondents,LTdeveloperswillbework­ing towards automated procedures for the construction, annotation and curation of languagedata,as well astoaddress the issueof databias.Such achievements, com­bined with continuous work on improving transfer learning methods, are expected to contribute to a situation in which all languages, including small, minority and re­gional ones, enjoy technology support and a level of presence and use in the digital sphere that will ensuretheirpreservationand prosperity. AsharedscientificgoaloftheLTcommunityistheachievementofDeep Natural Language Understanding by 2030, brought up in numerous responses with various phrasings: “hybrid intelligence”, “cognitive AI”, “symbolic AI”, etc. Nonetheless, allthesementionsconvergeonthedescriptionofafuturestatusofLTswheretheleap fromsuperficiallanguage processing to language understanding has been achieved andseamlesshuman-likeinteraction,viablediscourse interpretationandubiquitous natural language interfaces are areality for allEuropeans in their own language. With respect to measures and instruments that can be employed to help achieve these goals and realise the visions, the respondents evaluated the effectiveness of a set of proposed measures. A long-term programme of ten or more years can po­tentiallyleadto groundbreaking researchand subsequently tothe desired leap from simplelanguage processingtodeeplanguageunderstandingaccordingtoalmostall respondents(averagescore4.2onafive-pointLikertscalewith5:veryeffectiveand 1: not effective at all). Continuous investment in existing research infrastructures (RIs) that support LT was considered equally effective (average score 4.2). Among others, access to data and tools viadistributedRIs is argued to allow for optimising boththestoragespaceandprocessingpower,aswellastocomparetheLTsinterms oftheir computational footprint. Atthetechnologicallevel,investinginthedevelopmentofnewscientificmethod­ologies for the transfer or adaptation of resources or technologies to other domains and languages is considered an effective measure to boost the digital readiness of lesssupportedlanguages(averagescore4.0).Giventheimportanceofastrongfoun­dation in basic research, it does not come as a surprise that a large majority of over 86% of respondents welcome an increase in the availability of qualified LT person­nel and incentives for talent retention. This also included reinforcing training and education initiatives, includingundergraduateand Master’s programmes. A number of elaborate answers focused on funding instruments as leverage to help Europe achieve global excellence and leadership in LT. Funding and invest-mentsshouldconcentratenotonlyon theapplied(computational)aspectsofLTbut also on basic research in linguistics and computational linguistics. Support of LR creationandsharingisanissueinmanyresponses.Withrespecttothebeneficiaries of funding, a number of respondents and interviewees expressed the opinion that incentives should be provided to language communities that strive to preserve their cultural and linguistic identities, especially with regard to enhancing a language’s presence on the internet. Businesses and industry-research collaborations are noted asanadditional targetgroup. In this context, some respondents perceive the role of national centres of excel­lence in LT as critically important. Such centers could collect and boost the voices of local players at a national level and increase industry visibility nationally and at the European level. Apart from designing the national research agendas in LT, they shouldberesponsibleforthecollection,curation,sharingandstandardisationoflan­guagedata, and forfollowing and implementing the European Data Strategy. Regulatory aspects pertinent to the LT field, in the form of regulations, recom­mendations or guidelines, have additionally been highlighted. These include, e.g., the adoption of the FAIR principles in Europe, a revised legislative framework for facilitating the use of language data and the application of data mining techniques forbothresearchandcommercialpurposes,guidelinesforprocurementbeneficiaries andforpublicbodiestoreleasetheirfundedorpublicdata,recommendationsforbig technologycompaniestoopenuptheirplatformsforthelesserspokenlanguagesand for the public and private sectors equally to provide multilingual websites. It could be also beneficial to impose content accessibility regulations, e.g., for multimedia subtitling, readability, dubbing,etc. The role of the research community is often criticised for its bias towards publi­cations on a small number of the world’s languages. Raising awareness of equality issues in international LT fora and incentivising Open Access journals and confer-encesdedicatedto less supported languages are among the suggestedmeasures. AwarenessraisingoftheimportanceofLTfordigitalinteractionsandtheroleof training young LT professionals is mentioned in numerous responses. Finally, the social dimensions of DLE have been emphasised by respondents who argued that linguistic and social diversity go hand in hand: the more diverse our society is, the morethereisanactualneedformultilingualresourcesandtechnologies.Thus,large­scalepoliciesagainstracismanddiscriminationareconsideredessential.Inparallel, engaging minoritised language communitiesand supporting community building is arguedto benefit theLT field, as it will increase demand forand theimpact of LT. European LT should foster and support multilingualism while strictly adhering toEuropeanvaluessuchasprivacybydesign,transferability,fairness,diversityand openness, transparency and accountability, public wealth, individual rights and col­lective purposes. Europe’s strengths lie in catering for multilingual solutions cover-ingalltheEuropeanlanguagesandserving allcitizens ofEurope.Bysupportingits linguisticdiversity, Europecanachieve digital self-determinationand sovereignty. 3 ThePerspectiveofEuropeanLanguageTechnologyUsers For LT users, a similar survey was set up (see Chapter 4, Section 3, p. 84ff., and Chapter 38, Section 4, p. 235ff.) and generated almost 250 responses. Similarly to theLTdeveloperssurvey,numerousadditionalinterviewswereconductedformore in-depth insights. Thesurveybroughttogetherdiversegroupsofstakeholdersincludingrepresenta­tivesofcommunitiesofLTusers,academicandcommercialstakeholders,language professionals (e.g., translators, lecturers and professors in the fields of linguistics and computational linguistics) and stakeholders from different economic sectors (e.g.,banking,health,publicadministration,languageservices).Thesurveywasdis­seminated mainly via email by the relevant ELE partners, namely, ELEN, LIBER, ECSPM, NEM, EFNIL and Wikipedia as well as through social networks. Table 3 shows the breakdownofresponses collectedthroughthe survey. 3.1 Respondents’ Profiles Responses came from a diverse range of sectors and professional activities; most of the respondents work in the education and research sector with 130 responses (53%) out of 246, that is, most respondents were researchers, university professors, assistantprofessors,lecturersorheldotheracademicpositions.Thesurveywasalso filled out by representatives of non-governmental organisations (NGOs), large en­ Initiative Responses Interviews ECSPM 10 2 EFNIL 28 6 ELEN 7 19 LIBER 29 3 NEM 296 Wikipedia 22 3 Other(e.g.,socialmedia) 121 – Total 246 39 Table 3 LT users survey – survey responses and interviews collected through the participating initiatives terprises, SMEs, government departments and independent contractors and consul­tants in diverse economic sectors. The 15 (6%) respondents who selected the op­tion “other” represented non-governmental bodies, non-profit organisations, public sectororganisations,socialorganisationsandindependentgovernmentdepartments (see Figure 2). Fig. 2 LT userssurvey – types of sectors andprofessional activities Contributions to the survey came from all over Europe and, due to social media sharing,someresponseswereprovidedbypeoplebasedoutsideEuropeancountries such as the US, the Democratic Republic of Congo and the Russian Federation. In Europe, the most represented countries were Croatia (33 responses), Spain (23 re­sponses), the UK (23 responses), Ireland (17 responses), Germany (16 responses) and France (14 responses). 3.2 Language Coverage A total of 74% of the respondents indicated that they work with English, which is the dominant language followed by a well-balanced group of languages composed of German (31%), French (31%) and Spanish (30%). At the other end of the spec­trum, many other European languages (e.g., Welsh, Catalan, Basque, Luxembour­gish, Galician) are under-represented as few respondents (between one and three) indicated they work with them. Respondents who selected “other”, mentioned that they work with Basque, Catalan, Macedonian, Luxembourgish, Moldovan, Welsh andGalician.Amongthenon-EuropeanlanguagesrespondentsmentionedJapanese, Chinese(orMandarin)andRussian.Figure3showsthebreakdownofEuropeanlan­guages the respondents work with inabsolute numbers. Fig. 3 LT users survey – European languages respondents work with (based on a set of 246 re­sponses) In relation to the languages respondents intend to include in their workflow, 50 respondents(20%)indicatedthattheyplantoincludeEnglish,German,Spanishand French. The survey shows, again, the English predominance over all languages fol-lowedbyGerman,SpanishandFrench.OtherofficialEUlanguageswerementioned byonlyafewrespondents(betweentwoandthreerespondentsonly)suchasItalian, PortugueseandGreekaswellassomeminority,regional,andlesser-usedlanguages such as Breton, Catalan, Faroese but only by one respondent each. These findings suggestaworryingscenario,where,inamultilingualandmulticulturalEurope,most minority, regional, lesser-used languages are disregarded either for not being com­mercially interesting or simply for lackof institutional investment. 3.3 Predictions for the Future Withregardtotheirpredictionsforthefuture,therangeofopinionswasverybroad. In general, most respondents (68%) are confident that in the next ten years, there willbehigher-qualitytoolsforallEuropeanlanguages including minority,regional, and lesser-used languages and that there will also be a wider range of tools for all Europeanlanguages(83%).However,fewerrespondents(46%)believethatLTswill helptopreventlinguisticloss,although65%thinkthatLTscanhelptopreventRMLs fromdisappearing.Mostrespondents(64%)alsoagreethatLTscanincreaseindivid­uals’exposuretotheselanguagesand60%believethatLTscanincreaseengagement withsocial,leisureandworkactivitiesintheirownlanguages.Amongotherbenefits mentionedintheopenquestions,respondentsthinkthatLTscanimprovemedicalin­teractionsbetweenpatientsandcliniciansandimprovemedicaldocumentation.One respondent highlighted that LTs can help with the preservation of cultural heritage and improve its visibility. Another respondent pointed out that LTs can improve on-lineandprintpublishinginminority,regional,andlesser-usedlanguages,including academic publications and works offiction. The survey also looked into the respondents’ ideas for the future of LT. They had the chance to indicate applications that could potentially use LT they want to see that are not currently available for the languages they work with. There were several interesting responses. In general, we can see respondents wish for higher-quality tools for certain languages such as “better parsing of Danish than currently available” or the availability of tools that do not yet exist for some languages but exist for others such as “speech recognition for Welsh”, “speech recognition for Catalan”, “free spell check for Irish”, “more reliable speech recognition, informa­tionextraction,summarisation,semanticparsingandsemanticsearchforGreek”,“a goodGeorgian-EnglishTranslator”and“betterMTforCroatian”.Otherrespondents indicatedthattheywouldliketoseesomeoftheexistingtoolsandtechnologiesavail­able in more languages, for instance, “Text-To-Speech for low resource languages” or“moreaccurate speech2text, decent text summarization, GPT2 for Finnish”. Someideasfornew(currentlynon-existent)LTswerealsoprovided.Forinstance, “case-sensitive tools or the creation of a tool that might provide more context, or warntheuserifthesamewordmeanssomethingcompletelydifferentdependingon the context. A tool that would be sensitive to connotative meanings” or “tools for collecting lexical data andspeed up the process of dictionary building”. Wecanconcludethatthemostimportantfindingofthissurveyistherespondents’ concern regarding the differences in technological support between European lan­guages, specifically the poor technological support of minority, regional and lesser-usedlanguages.Thedifferencesinsupportaremainlyreflectedindifferencesinthe qualityandperformanceoftoolsbetweenthelanguagesaswellasintheavailability of tools for a small group of low-resource languages, while these same tools do not exist for many other European languages. In order to achieve full DLE as a crucial steptomaintainlinguisticdiversity,thesurveyshowsthenecessityforactionandan implementationagendawiththeobjectiveoffosteringandsupportingamultilingual and linguisticallyinclusive Europethat brings solutions to all European citizens. 4 ThePerspectiveofEurope’sCitizensasConsumersofLTs TheELEprojecthasmadeanefforttoensurethatallvoiceswereheardandtakeninto accountinthepreparationoftheSRIA.Withthesupportofsocialmediacampaigns andanagencyspecialisinginsurveydissemination,wewereabletoreachthousands of EU citizens to hear their thoughts on how well they feel their languages are dig­itally supported. The European Citizen survey included a total of 11 questions, six multiple-choice questions, four single-choice questions and one open-ended ques­tionwhichallowedrespondentstoincludeanycommentsorfeedbacktheyhad.The surveywasdesignedtotakelessthanfiveminutestofillin(seeChapter4,Section3, p. 84ff., and Chapter 38, Section 4, p. 235ff.). It was translated into 35 languages. Toensurethereliabilityofthesurveydatacaptured,anumberofdatacleaningsteps were taken to remove responses that were deemed noisy or at risk of skewing the survey results. We analysed a total number of 20,586 valid responses, the largest public survey everconductedto date among Europeancitizens concerning LRTs. 4.1 Respondents’ Profiles We collected (anonymous) demographic information from respondents with the ob­jective to ensure our sample was representative enough of the population for gener­alisationpurposes.Weaskedrespondentstostatetheirlevelofeducation,agegroup and country of residence. We collected responses from 28 countries, and Figure 4 shows the breakdownofcontributionsper country. The demographic of the respondents is as follows: 27% of the respondents were between 25-34 years old. A total of 23% accounted for both the 18-24 and 35-44 agebrackets.Therestoftherespondentswere45+yearsold,1%oftherespondents preferrednottosay.Intermsofeducation,35%oftherespondentshadreachedhigh school level, 23% held a Bachelor’s Degree, 17% held a Master’s Degree, with the rest reporting vocational training (11%), only some high school completion (7%) and holdinga PhD(5%), 2%declinedto say. 4.2 Language Coverage We askedrespondents to select the languagesthey usebothsociallyand profession­ally. Overall, results show that many respondents use their native language in addi­tion to English even if they are not based in English-speaking countries. Therefore, we once again see a dominance of English over all other languages. Following En­glish, German and French also appear as languages frequently used in non-German or non-French speaking countries. Figure 5 illustrates the comparison of the most represented languages in the survey. Fig. 4 Europeancitizenssurvey –numberof responses collected 4.3 Predictions for the Future The following discussion concentrates on the forward-looking questions of the EU citizens survey and the responses concerning anticipated or hoped for future devel­opments with regard to the development and consolidation of LTs for Europe’s lan­guages. In one question we asked the respondents “What would be the top 3 advan­tages of improving apps and tools for all languages? Please select the three most important advantages in your opinion.” The purpose of this question was to assess respondents’viewsonthebenefitsofLTs.Notably,asseenfromFigure6,LTsarere­gardedaskeytoenhancingmultilingualsocietiesfromalinguisticdiversityperspec­tive.Ofseeminglylessimportancetotheaveragecitizenistheeconomicadvantage that arisesfrom LTsupport. Withregardtothequestion“Whatholdsyoubackfromusingsomeoftheseapps or tools in your languages?”, based on the answers received, it is reasonable to as­sume that if the reported barriers that are currently holding users back from using appsortoolsintheirlanguageswereremoved,andtoolsmoreadequatelysupported, then there would be more uptake in the number of people using language tools in theirownpreferredlanguage(seeFigure7).Itwassomewhatsurprisingthatthetop responsewas“Idon’tneedtouseany appsortoolsforthislanguage”,whichmight suggest that the poor support for some languages may condition users into believ­ing thattechnologiesdonotapplytosomechronicallyunderservedlanguages.This Fig. 5 Europeancitizenssurvey –most represented languages may apply in particular to users who also speak a dominant language that is well supported bytools and apps, in addition toone thatis scarcelysupported. In other words, these responses suggest that there is a real risk that some users have become so accustomed to using apps in or for better supported languages that theynolongerseetheneedforsimilarappstobedevelopedandmadeavailableinor fortheirownlanguage;atthesametime,thisdisappointingperceptionmaystabilise a situation where users default to using apps and tools in an additional language that is better supported, also due to their overall superior quality. Another popular responsewas“Issueswiththequalityoftheavailableappsortools”,indicatingthat people will not use an app or tool if they perceive its quality to be insufficient or inadequate.Thissuggeststhatoncethequalityofthetoolsisimprovedtoasufficient standard, more people would be inclined to use the app or tool in their language in the future. Concerning the query “Please select the tools that you currently do not use but would like to use in the future.”, one tool that people are calling for in particular among those to be made available for their languages is automatic subtitling (Fig­ure 8). Having this available for more languages would improve communication Fig.6 Responsestothequestion“Whatwouldbethetop3advantagesofimprovingappsandtools foralllanguages?” in theEU citizen survey Fig. 7 Responses tothequestion “What holds you back from usingsome ofthese apps ortoolsin yourlanguages?” intheEU citizen survey and accessibility of multimedia content for an ever-increasing range of European citizens (e.g., disabled people, elderly users, etc.). Relevant examples include au­tomatic subtitles being made available to those who are hearing-impaired, so they canwatchvideosandreadsubtitlesintheirownlanguage.Translationappsarealso in very high demand, which is not particularly surprising. However, even for those language-pairsthatareservicedbyMT,weneedtobevigilantasmanyofthefreely Fig. 8 Responses to the question “Please select the tools that you currently do not use but would like touse in thefuture.” inthe EU citizen survey availabletranslationtoolsarenotownedorresourcedbyEUcompanies.Screenread­ersareanothertoolthatisquitepopular,withobviousrelevancetovisuallyimpaired people. If screen readers were available in more languages, accessibility would be substantially increased for several language communities across Europe. Finally, in the analysis of the responses to the survey, a number of interesting commentsmadebyordinaryEUcitizenswerefoundinthesectionthatelicitedmore generalreactionsattheendofthequestionnaire.Inparticular,theverylastquestion ofthesurveyaskedtheparticipantstoenteranycommentstheyhadaboutthesurvey orLTsingeneral.Herefollows a selectionofthemostinsightfulcommentsthat we feel encapsulate some ofthe most relevant opinions on the matter. • “Nolanguageisinferiortoothers.Alllanguagesareworthyofsurvivalaslong asthere isat least one person whospeaksthat language.” • “Usually I google things in English because more information is available in English.” • “It is extremely important to have more language technology tools for the na­tional minority languages in Sweden. It is a rights issue to access everything from speech synthesis, machine translation, language apps, proofing programs, etc.Atthemoment,therearenoopportunitiesforthisforRoma,Meänkieliand to some extent for Sami and Yiddish.” • “Itwouldbegreattohavealittlemoreguidanceonwhatordinarypeople(with­outgreattechnologicalresourcessuchasuniversitiesandcompanies)candoto ‘feed’ or develop those technological resources forour minority languages.” These comments clearly indicate that some European citizens are eager to have more LT tools and apps made available to them in their language in the future, as this is related to the role that individual speakers and their communities can play goingforwardinthedigitalageintheinterestofequality. Atthemomentmanypeo­ple seem to be resorting to using search apps and personal assistants particularly in Englishorotherwell-resourcedlanguages,astheyarecurrentlyunavailableintheir own language or are not perceived to perform equally well. This suggests that if re­quired LTs were developed and made available as tools or apps, people would use them in their own language rather than English; at the very least they would have a choice, depending on the type of tasks that they need to perform in different cir-cumstances(e.g.,forprofessionalpurposesasopposedtopersonalorsocialreasons, withcolleagues,withinthefamilyorwithcirclesoffriendsandacquaintances,etc.). The survey also revealed that some European citizens want to see technology for their languages improved and maintained, and some are willing to get involved themselves,asshownbythecommentaskingwhattheordinarycitizencandotohelp thedevelopmentofthesemuch-neededtechnologies.Overall,citizensareconcerned about the technological status of their language, and are willing to help to ensure thattheirlanguageistechnologicallywellsupportedinthefutureforthedigitalage, especiallyifotherwisethereisathreatofextinction.Wewereparticularlypleasedat respondents’willingnesstotakeownershipoftheseissues,andactnotonlyasusers of tools but also as developers. We take this as a strong endorsement of the ELE project,andfurtherevidenceoftheneedfortheELEprogrammeto befullyfunded throughoutEurope toensure DLE forall Europeans, as reflectedin the ELE SRIA. 5 Summary and Conclusions ThesurveysandexpertinterviewsdiscussedheretargetedLTdevelopers,usersand the EU citizens. We investigated language coverage and encouraged participants to share their predictions and visions for the future of LTs in Europe with respect to achieving full DLE. The results show that there is still a huge gap between the LT support for English and all other European languages, with dramatic differences in several cases. Even though there is an increased interest in bridging this gap and in expanding technological support to more languages, limited funding, demand and obstacleswithregardtoavailableresourcesmakeitachallengingendeavour. While basic research is still urgently needed, the last decade has seen progress on a larger scalethancouldhavebeenimaginedtenyearsago.ManyexpertshighlightEuropean excellence,alsoonagloballevelandconsiderleadershipinLTandlanguage-centric AItobepossibleifthenecessaryconditionsarecreatedbypoliticaldecision-makers. The LT developers survey addressed the European LT community, reaching a wideanddemographicallydistributedaudience.Itwasansweredby321respondents whorepresent223organisationsin32countries.Therespondentswererecruitedby theresearchnetworks,i.e.,META-NET,CLARINandCLAIRE,projectslikeELG and other related initiatives focusing on LT or neighbouring fields, such as ELISE, ELEXIS,andNexusLinguarum.Additionalnetworks,associationsandprojectsrep­resentedbytherespondentsincludeELRC,ELRA,ACL,EAMT,DARIAHandoth­ ers. Theareas in which the respondentsareactivecovered the full rangeof LT. The languages theyfocusonhavea skeweddistributionthatreflects currentimbalances inthefieldinEuropeas wellaselsewhere,withEnglishfirst bya large margin,fol­lowed bythe bigofficial EU languages. Thetwomainconcerns expressedwerethe insufficient support for basic research in NLP and LT and the fierce competition of non-EUcompanieswiththemarketdisruptiontheycause.Thesurveyanswerstothe open-ended questions and views of the interviewed experts brought a host of opin­ions and suggestions in several important directions, in particular: the higher and evenelementaryeducationarea,researchfunding,legalandregulatoryobstacles,bi­asesandprivacyissuesofvarioustypes,commercialisationdifficultiesandwaysof supporting such efforts, the need to coordinate efforts between national centres of excellencevs. pan-European ones, etc. The LT users and consumers survey brought together academic and commercial stakeholders,languageprofessionalsandstakeholdersfromdifferentsectors.Itwas disseminated by the relevant ELE partners, i.e., ELEN, LIBER, ECSPM, NEM, EFNIL and Wikipedia who promoted the survey targeting representatives of organ-isations and communities of users and consumers. Based on the results, it can be concludedthatthemostimportantfindingistherespondents’concernregardingthe differences in technological support between Europe’s languages, specifically the poortechnologicalsupportofminority,regionalandlesser-usedlanguages.Thedif­ferencesinsupportaremainlyreflectedindifferencesinthequalityandperformance oftoolsbetweenthelanguagesaswellasintheavailabilityoftoolsforasmallgroup oflanguages,whilethesesametoolsdonotexistformanyotherEuropeanlanguages. ToachievefullDLEasasteptomaintainandpromotelinguisticdiversity,thesurvey shows the necessity for action and calls for an implementation agenda with the ob­jectiveoffosteringandsupportingamultilingualandlinguisticallyinclusiveEurope that brings solutions to all European citizens that arerelevant in the digital age. AnadditionalsurveywascarriedouttargetingEUcitizenswiththeaimoftaking into account their opinions, individual needs, wishes, general demands and, impor­tantly, to make sure that their voices play a decisive role in the pursuit of full DLE supportedbyLT.Thesurveywasdisseminatedin28countrieswiththehelpofaser­vice provider. Additional dissemination was carried out with the help of ELE part-nerswhopromotedthesurveyonsocialmedia,withintheirnetworksandthroughthe ELE project website. While structured very differently than the stakeholder group surveys, there are several similarities not only in terms of the scope of the analysis, butalsoofthekeyresultsthatwereobtained:languagesotherthanEnglisharepoorly supported (with only a fewexceptions) –something evident even from the distribu­tionof languagesthattherespondentsconsideredintheirresponses.These answers showthatraisingawarenessfortheLTpotentialinEuropeonapoliticalandinstitu­tional level is more important now than ever before. The European LT community isinapositionwherechangeisneededinordertocompetewithinnovativesystems andtoolsbuiltelsewhere.Onapoliticallevel,thisinvolvesmorecommitmentfrom the Europeaninstitutionsas well as those of the Member States. References Eskevich,Mariaand Franciska de Jong (2022). Deliverable D2.3 Report from CLARIN. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/consultation-CLARIN.pdf. Hajiè,Jan,TeaVojtìchová,andMariaGiagkou(2022). Deliverable D2.5 Report from META-NET. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/consultation-META-NET.pdf. Hegele,Stefanie,KatrinMarheinecke,andGeorgRehm(2022).Deliverable D2.6 Report from ELG. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/consultation-ELG.pdf. Rufener, Andrew and Philippe Wacker (2022). Deliverable D2.4 Report from LT-innovate. Euro­peanLanguageEquality(ELE);EUprojectno.LC-01641480 –101018166.https://european-l anguage-equality.eu/reports/consultation-LTInnovate.pdf. Thönnissen,Marlies(2022). Deliverable D2.2 Report from CLAIRE.EuropeanLanguageEquality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-CLAIRE.pdf. Way,Andy, GeorgRehm, Jane Dunne, Maria Giagkou, Jose Manuel Gomez-Perez,JanHajiè,Ste­fanie Hegele, Martin Kaltenböck, Teresa Lynn, Katrin Marheinecke, Natalia Resende, Inguna Skadina,MarcinSkowron,TerezaVojtìchová,andAnnikaGrützner-Zahn(2022a).Deliverable D2.18 Report on the state of Language Technology in 2030.EuropeanLanguageEquality(ELE); EU projectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/LT­ in-2030.pdf. Way, Andy, Georg Rehm, Jane Dunne, Jan Hajiè, Teresa Lynn, Maria Giagkou, Natalia Resende, Tereza Vojtìchová, Stelios Piperidis, Andrejs Vasiljevs, Aivars Berzins, Gerhard Backfried, Marcin Skowron, Jose Manuel Gomez-Perez, Andres Garcia-Silva, Martin Kaltenböck, and Artem Revenko (2022b). Deliverable D2.17 Report on all external consultations and surveys. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/external-consultations.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 40 Deep Dive Machine Translation Inguna Skadina, AndrejsVasi.jevs, Marcis Pinnis, Aivars Berzinš,Nora Aranberri, JoachimVan den Bogaert, Sally O’Connor,Mercedes García-Martínez, Iakes Goenaga, JanHajiè, Manuel Herranz, Christian Lieske, Martin Popel,Maja Popoviæ, SheilaCastilho, Federico Gaspari, Rudolf Rosa, Riccardo Superbo,and Andy Way Abstract MachineTranslation(MT)isoneoftheoldestlanguagetechnologieshav-ingbeenresearchedformorethan70years.However,itisonlyduringthelastdecade that it has been widely accepted by the general public, to the point where in many casesithasbecomeanindispensabletoolfortheglobalcommunity,supportingcom­munication between nations and lowering language barriers. Still, there remain ma­jorgapsinthetechnologythatneedaddressingbeforeitcanbesuccessfullyapplied in under-resourced settings, can understand context anduse worldknowledge. This chapterprovidesanoverviewofthecurrentstate-of-the-artinthefieldofMT,offers technicalandscientificforecastingfor2030,andprovidesrecommendationsforthe advancement of MT as a critical technology if the goal of digital language equality in Europe isto be achieved.1 Inguna Skadina · AndrejsVasi.jevs · Aivars Berzinš · Marcis Pinnis Tilde, Latvia, inguna.skadina@tilde.com, andrejs.vasiljevs@tilde.com, aivars.berzins@tilde.com, marcis.pinnis@tilde.com Nora Aranberri · Iakes Goenaga University oftheBasque Country, Spain, nora.aranberri@ehu.eus,iakes.goenaga@ehu.eus Joachim Vanden Bogaert CrossLang,Belgium, joachim.van.den.bogaert@crosslang.com Sally O’Connor · Riccardo Superbo KantanMT, Ireland, sallyoc@kantanai.io, riccardos@kantanai.io MercedesGarcía-Martínez · Manuel Herranz PANGEANIC, Spain, m.garcia@pangeanic.com,m.herranz@pangeanic.com Jan Hajiè · MartinPopel · RudolfRosa Charles University,CzechRepublic, hajic@ufal.mff.cuni.cz,popel@ufal.mff.cuni.cz, rosa@ufal.mff.cuni.cz Christian Lieske SAPSE, Germany, christian.lieske@sap.com Maja Popoviæ · SheilaCastilho · Federico Gaspari · Andy Way Dublin CityUniversity, ADAPT Centre,Ireland, maja.popovic@adaptcentre.ie, sheila.castilho@adaptcentre.ie, federico.gaspari@adaptcentre.ie, andy.way@adaptcentre.ie 1 This chapter is an abridged version ofBerzinš etal.(2022). © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_40 1 Introduction Machine translation (MT) was one of the first application areas ofnatural language processing (NLP). Starting from the first attempts to apply dictionary-based ap­proachesrightuptomodernneuralnetwork-basedsystems,MThasaimedtoprovide automatic translation from one natural languageintoanother. Today,MThasbecomeanimportantassetformultilingualEurope,allowingciti-zens, governments and businesses to communicate in their native languages, break­ingdownlanguagebarriersandsupportingtheimplementationoftheEuropeandig­ital single market. For example, the eTranslation automated translation tool,2 de­veloped by the European Commission, and its various adoptions (e.g., EU Council Presidency Translator, Pinnis et al. 2021)3 provide reasonably good MT service in 24 EU official languages for governments, the public sector and SMEs.4 However, MT support and the quality of its output still differ from language to language, and from domain to domain. In particular, MT quality drops significantly when trans-lationconcerns less-resourced languages, speech orterminology-rich domains with limitedavailable data. 1.1 Scope of this Deep Dive In2012,theMETA-NETWhitePaperseries(RehmandUszkoreit2012)presenteda thoroughanalysisofLanguageTechnology(LT)supportfor31Europeanlanguages. According to this study, for MT good support only applied to English and moder­ate support toonlytwowidelyspokenlanguages(FrenchandSpanish),leavingthe remaining 28 Europeanlanguages in clusters of fragmented or weak or no support. This chapter focuses on the MT landscape a decade after the publication of the META-NET White Papers. We analyse progress in MT, identify the main gaps and outline visions, the breakthroughs needed and development goals towards Digital Language Equality (DLE) and Deep Natural Language Understanding (NLU) by 2030. We look at the current services and technologies offered by MT providers in the European market. The dominance of global companies in the free online trans-lationmarket and therisks forEuropeans caused by this dependenceareamong the key topics discussed inthis chapter, especiallyto identify solutionsgoingforward. The main gaps are identified for four dimensions of MT: data, technology, ap­proaches and legislation. We focus not only on data availability and usability and theneedforless-resourcedtechnologies,butalsodiscusslimitationsrelatedtomulti­modalMT.WhileMTtechnologiestodayareavailableformostEuropeanlanguages, many of these languages are less attractive from a business point of view, and con­ 2 https://webgate.ec.europa.eu/etranslation/public/welcome.html 3 https://www.eu2020.de/eu2020-en/presidency/uebersetzungstool/2361002 4 As of February 2022, eTranslation was used by 108 projects – 87 projects reusing eTranslation and 21 projects committed toanalysingor reusing eTranslation. sequentlytheyarenotsowellequippedwithMTtools.Throughoutthechapter,lan­guagecoverageisaddressedasakeydimensionforDLE.Wealsodiscusslegaland ethical aspects related to the development, production and use of MT systems and services.WeanalyseIPRandGDPRrestrictionsandthe‘fairuse’principlefromthe developer’sperspective,andprivacyandsecurityissuesfromtheuser’sperspective. Finally, all these aspects are taken into consideration from the perspective of their impact on society, with a focus on Europe. The chapter provides a series of recom­mendations on how to address the current limitations of MT technologies and how to contribute toDLE as a crucial goal for Europe anditscitizens. 1.2 Main Components While different MTtypes(e.g., rule-based,example-based, statistical, hierarchical) have been investigated, in this subsection we will focus only on the recent develop­ment of Neural MT (NMT), based on an overview by Popel (2018). We present the main MT components of the general NMT architecture and the currently most pop-ularexample:Transformer(Vaswanietal.2017).Therearemanyothercomponents related to MT, which are not described here, e.g., automatic speech recognition5 andspeechsynthesis,whichareneededinthespeech-to-speechtranslationpipeline; cross-lingualinformationretrieval;multilingualsummarisation;integrationintopro­duction systems andmultilingual websites using suitable metadata formats.6 InNMT,eachinputsentenceisfirsttokenisedintoasequenceoftokens.Themost popular approach today is to split words into subword units (subwords, which need not be actual words of the language or even morphemes). For example, the Ger­man word Forschungsinstituten (‘research institutes’) may be encoded with three subwords: Forsch + ungsinstitu + ten_. There are several algorithms for training subword models (e.g., Sennrich et al. 2016b). NMT based on subwords shows bet­ter results than early approaches based on words and recent approaches based on characters (Libovický et al. 2022). Each token is represented as a real-value vector, called (subword/word) embedding. Most NMT systems initialise embeddings ran-domlyandtrainthemjointlywiththewholetranslation,butpre-trained(contextual) embeddingsmay be used as well, especiallyin low-resource settings. NMT systems are based on an encoder-decoder architecture. The encoder maps the input sequence to a vector of hidden states (sometimes called continuous rep­resentation or sentence embedding). The decoder maps the hidden states into the outputsequence (of target-languagetokens). Eachhidden state usuallycorresponds tooneposition(token)intheinputsequence,soingeneral,thevectorofhiddenstates has a variable length. EarlyNMTsystems (Sutskeveret al. 2014)used only the last hiddenvectorasaninputforthedecoder.Thus,thetrainingwasforcedtoencodeall the information about the input sentence into a fixed-length vector. Bahdanau et al. 5 See, for example,thereports ofthe ELITR projectat https://elitr.eu. 6 https://www.w3.org/TR/mlw-metadata-us-impl (2015) introduced an encoder-decoder attention mechanism, where the decoder has access to all of the encoder’s hidden states. This way, when generating each output token, the decoder can attend to different parts of the input sentence. The encoder-decoder attention mechanism circumvents the fixed-length sentence-representation restriction andimproves translationquality,especially on longer sentences. The process of translating sentences (at test time) with a trained NMT model is usually called inference. Most NMT systems use auto-regressive inference. This means that the output sentence is generated token by token and after each token is generated, its embedding is used as input for generating the next token. Decoding finishesonce the decoder generates aspecial end-of-sentence token. The advantage of NMT systems is that all their components can be trained in an end-to-end fashion unlike earlier data-driven approaches, where most components hadtobetrainedseparately. NMTisusuallytrainedusingbackpropagationoptimis­ingthecross-entropylossofthelastdecoder’ssoftmaxlayer,whichpredictsoutput token probabilities; there are also NMT systems optimising sentence-level metrics (e.g.,BLEU,Papinenietal.2002,orsimulatedhumanfeedback)withreinforcement learning techniques (e.g., Nguyen et al. 2017). NMT usually uses teacher-forcing: when generating the next word during training, it uses the previous word from the reference translation as theinput instead of usingthepreviously predictedword. The Transformer architecture follows the general encoder-decoder architecture, but unlike earlier recurrent-networks it uses self-attention and feed-forward layers inboththeencoderanddecoder.Thisallowstrainingandpartiallyalsothedecoding process tobe spedupthanks to better useofparallelisation. Self-attention is based on a compatibility function which assigns a weight to each pair of tokens, more precisely, to their vector representation on each layer. Transformer uses multi-head self-attention, so multiple versions (heads) of the self-attention function are trained for each layer. Figure 1 shows an example of visuali­sationfor different heads. 2 State-of-the-Art and Main Gaps 2.1 State-of-the-Art Deep learning techniques have given a major boost to the area. The application of neuralnetworkstoMThasopenedthepathtodevelopingauniversalenginewhose ultimategoalisasinglemodeltotranslatebetweenanyarbitrarylanguagepair. The effects of different advanced approaches for multilingual MT models have been in-vestigatedbyYangetal.(2021),forexample.Theyfirstexplorehowtoleveragethe large-scalelanguagemodelscreatedfromthepubliclyavailableDeltaLM-Largemul­tilingual pre-trainedencoder-decodermodel (Maet al. 2021) toinitialise the model. Forefficienttraining,theyapplyprogressivelearning(e.g.,Zhangetal.2020)tocre­ateadeepmodelfromashallowone.Additionally,theyimplementmultiplerounds ofback-translation(e.g.,Douetal.2020)fordataaugmentationpurposes.Whilethe Fig. 1 Visualisation of self-attention in a Transformer model trained on English › German trans­lation (adapted from Vaswani et al. 2017). Each head is visualised in a different colour and edge weightisindicatedbythickness.Eachofthefiguresshowsanotherattentionheadinencoderlayer 5(outof6).Thewordsintheleftcolumnineachofthethreevisualisationsrepresentvectorscorre­sponding to thesewordsonthe inputto thefifthlayer ofthe encoder. Theright-most figure shows two attentionheads,butfocusing only onthe word ‘its’ and illustrating coreference resolution. results are very promising, they reflect a worrying trend: when English is involved inthetranslationprocess eitherasasourceor targetlanguage,the BLEU scoresare ratherhigh.However,theresultsworsenconsiderably whentranslation in language pairswithout Englishis considered. If we turn to thegoal of achieving language equality,one of the most interesting approachesisunsupervised MT (e.g., Artetxeet al. 2018) whereno bilingual paral­lel data isneeded totraina fully working system. In recentyears, this approach has slowlybeencatchingupwiththetranslationqualityobtainedbysupervisedsystems. Forinstance,Hanetal.(2021)buildastate-of-the-artunsupervisedNMTsystemde­rivedfromagenerativepre-trainedlanguagemodel.Theirmethodisaconcatenation of three steps: few-shot amplification, distillation and back-translation (Sennrich et al. 2016a). They first use the zero-shot translation ability of a large pre-trained lan­guagemodel(GPT-3)togeneratetranslationsforasmallsetofunlabeledsentences. Inthe nextsteptheyamplifythesezero-shottranslationsby usingthem asfew-shot demonstrationsforsamplingalargersyntheticdataset,whichisthendistilledintoa smallermodelvia fine-tuningtoobtain anewstate-of-the-artinunsupervisedtrans­lation on the WMT14 English-French benchmark. While still restricted to a well-resourced languagepair, learningoutcomesare promising for lower-resource pairs. Withintheindustrialcontext,alookatproviders’solutionsgivesaclearoverview ofthestrengthsofeachcompany,aswellastheissuesthatremainrelevantregarding thesuccessfulimplementationofthetechnology. Akeyaspectthatmostcompanies emphasise is the capacity for domain adaptation. This allows for engines that learn from domain-specific texts, avoiding the noise that expressions from other fields might introduce in the learning process (e.g., Pangeanic, RWS, Tilde, Welocalize). Further customisation is also highly valued, most frequently by refining their own generic or domain-specific engine with a customer’s own data (e.g., Across, Lan­guageI/O,Tilde).Alternatively,do-it-yourselfMTopportunitiesareprovidedwhere customers build their own systemfromscratchusing just their owndata. Thetexttypeinvolvedisalsodistinctiveacrosscompanies,withsomepushingfor real-timeadaptiveMTforemailandchat(e.g.,LanguageI/O),whileothersempha­sise multimodality. When a level of accuracy and/or cultural adaptation is required, MTiscoupledwithpost-editing,whichisimplementedwithfunctionalitiesdirected at professional translators or crowd-sourcing platforms (e.g., Lengoo, Unbabel). Apart from the quality of the technology itself, seamless integration within ex­isting localisation workflows is paramount for its successful adoption, as well as scalability (e.g., KantanMT, Lilt, Tilde), open-source technology (e.g., Pangeanic, Apertium) and speech MT (e.g., Papercup, Tilde). Additionally, privacy and secu­rity are of huge interest as texts often include sensitive product or customer infor­mation. The lack of understanding of how MT works and the unclear legal rights, obligations and consequencesof misusecauseclientstoseek securesolutions(e.g., Across,Language Weaver,Pangeanic, Tilde). TherearenumerousEuropeancompaniesprovidingMTtoolsandservices,each with their own strengths and limitations. However, it is tech giants such as Ama­zon, Facebook, Google, Microsoft who set the standards and best practices for LT developmentandprovision.MostsuchcompaniesareheadquarteredoutsideEurope andsohavebusinessandsocietalobjectivesthatdonotalwaysalignwithEuropean needs and goals. The dominance of those global companies exposes Europe’s lack ofmarket power whichresults inincreasingmarket disparities. TheabsenceofaclearroadmapandsupportforLTattheEuropeanlevelresultsin a disjointed European market with disparate support for the language communities ofEurope.SucharoadmapiscruciallyimportantnowthatMTisplayingakeyrole in communication activities acrossthe globe. As a result, the demand for translated contenthasreachedanall-timehigh,but seemssettorisefortheforeseeablefuture. Nowadays, there are countless online MT sites for general use that offer access to MT either from companies that make the systems freely available with some us­age restrictions (Amazon, Google, Microsoft, DeepL, and Tilde among others) or frompublicbodies that facilitate theircustom-basedMTcapabilities (the European CommissionandtheBasqueandLatviangovernments,amongothers,Skadinsetal. 2020).Peopleusethesetoolstotranslateaverydiverserangeoftexts.Whileaccess is fast and straightforward, they do present privacy risks and cultural bias. To this day, the legal boundaries of text ownership and use are not fully regulated across Europe. Also, the array of languages available is increasing, but it is the major lan­guages that benefit from the advances first and foremost, with small and minority languages oftensufferingfrom uneven and generallylow quality. MT has been available to the video game localisation industry for years without much success given the need for highly creative and culturally adapted options, of-tenwithconstraintsdictated,forexample,byavailableon-screenspace.Forcurrent online collaborative games, in-game dialogue has become critical, as has the need for instant translation between multiple languages. This has motivated some game developers to explore the potentialofMT in their localisationprocesses. Medical translation is highly sensitive and requires the utmost precision. Given the serious consequences of mistranslations, MT has been largely absent from this area.However,itistimetopushforMTaccuracyandconsistency,andacceptnoth­ingshort of high-qualitytranslation (Haddowet al. 2021).MTcould proveofgreat assistancenotonlyforwrittentextbutalsoindoctor-patientcommunication.While medical interpreters remain the go-to specialists, often their services are not avail­able. To facilitate this type of communication, systems that can specifically tackle thelocallanguagesandthoseofthe immigrantsareessential.Therearenowanum­ber of success stories that demonstrate the utility of MT in this field. For example, in2020 SDLmadetheir MTsystemavailable to allengagedin COVID-19 medical research;7 NAVER LABS Europe released an MT model for COVID-19 research;8 and, to make emergency and crisis-related content available in as many languages aspossible,TranslatorswithoutBordersandseveralacademicandindustrypartners preparedCOVID-19 materials fortraining MT modelsfor nearly 90 languages.9 PublicAdministration –Makinglegalandadministrativedocumentsavailablein at least the official languages of Europe is an obligation of national governments. Given the intricacies of the texts, MT is not yet central in the translation process. However,severalinitiativessuchasELRC10,ELRI11 andELG12 (Rehmetal.2023) havecuratedandsharedLRsthatcanimproveMTservices.Alongthesamelines,the availabilityofhigh-qualityNMTatdifferentlevelsofpublicbodies,MemberStates and public administrations has been put forward as a key priority for the European Commission, particularly for under-resourced EU languages (see, e.g., the projects NTEUandiADAATPA,Biéetal. 2020;Castilhoetal.2019).Anexcellentexample oftheuseofMTbyEUCouncilPresidencystaffmembersandpublicadministration translators is demonstrated by the eight EU Council presidencies that used the EU CouncilPresidencyTranslator(Metuzaleetal.2020).Thechallengeistheprovision of this type of service not only for the 24 official languages, but for all languages in Europe, promoting citizen equality and European cohesion, which are key to a stable andunifiedview in the region. Toincrease customers’understandingofa product andtobuild trust, global con­tent on an eCommerce website should be translated into the target customer’s lan­guage. eCommerce companies require a mix of technical, highly accurate yet in­formal, creative, and culturally aware translations. While that can be challenging for MT, there are many companies (e.g., Lionbridge, Protranslating, Simultrans, Smartling) that can help online business owners to make their content multilin­ 7 https://www.biospace.com/article/releases/sdl-offers-machine-translation-free-of-charge-to-h ealth-science-professionals­ 8 https://europe.naverlabs.com/blog/a-machine-translation-model-for-covid-19-research 9 https://tico-19.github.io 10 https://www.lr-coordination.eu 11 http://www.elri-project.eu 12 https://www.european-language-grid.eu gual,withmultiplepluginscompatiblewithcommonContentManagementSystems (CMS) and eCommerce solutions in the market (WordPress, Drupal, Joomla, Ma-gento and WooCommerce). This short review shows that the current shortcomings of MT technology and areaswhereeffortshouldconcentraterevolvearoundaspectsthathelpincreasetrust throughincreasedaccuracy,aswellasthroughhighculturaladaptationandcreativity. It is high time MT quality and suitability are accounted for not only by means of usage-agnostic metrics, but also by customer experience measurements. It is clear thatascenariowhereallcitizensfeelequal,withthesamequalityoflanguageaccess to resources, services andcommerce, will considerably boostEuropean cohesion. 2.2 Main Gaps Data Availability and Data Quality –Asstated intheEUCharterandtheTreatyon the EU, all 24 official EU languages are granted equal status. However, the META­NET White Paper Series found that 21 of the 30 European languages investigated wereatriskofdigitalextinction.Inadditiontotheofficiallanguages,thereareover 60regionalandminoritylanguages,aswellasmigrantlanguagesandsignlanguages, spoken by 40 to 50 million people. The negative consequences of this lack of re­sources are twofold: 1. Europeans are not receiving the digital resources they are entitled to; and 2. there is a lack of language data to train MT engines to mitigate thisproblem.TheOpenDataDirective(2019/1024/EU)doesnotrecogniselanguage dataasahigh-valuedatacategory.Thismeansthatitmaynotbeclearwhatlanguage data exists for at-risk languages, or how data can be used for MT/LT development. Moreover, availability does not guarantee usability. To be considered usable, lan­guagedatamustmeetcertaincriteria.Forinstance,totrainhigh-performanceNMT systems, bilingual data needs to be clean and correctlyaligned. Domain-specific Data – NMT systems benefit from exposure to a wide variety of data, including style and content variety. Likewise, while domain specificity is important to tune an engine towards a particular field or subfield, expanding the domain coverage usually brings benefits to the training of an NMT system. This meansthat domain availability is almost as relevant as language availability. While categories such as legal, financial, and technical are usually well covered in terms of availability and suitability for a number of languages and language pairs, more specific or uncommon domains may not have comparable amounts of training data available. Moreover, there is generally a disparity between publicly available and proprietarybilingualcorpora.Asaresult,thereisagapintheavailabilityofdomain-specific language data both in official and minority languages, which could lead to the centralisation of some specialised fields over others, excluding speakers of less supported languagesin the long term. The Compute Divide –WiththeparadigmshifttoNMT,MThasbecomeincreas­ingly computationally intensive. Access to hardware, experts, and involvement in research has also shifted in such a way that elite universities and larger enterprises haveanadvantageduetotheirrelativeeaseofaccesstocomputepower. According totheELEanalysisonstrategicdocumentsandprojects(seeChapter 44,p. 361ff.), there isa lackof necessary resources (experts,HighPerformanceComputing, capa­bilities,etc.)inEuropecomparedtolargeUSandChineseITcorporationsthatlead thedevelopmentofnewLTsystems.Furthermore,thereisanunevendistributionof resources,including scientists, experts, computing facilities, and companies, across countries,regionsand languages in Europe (cf.Rehmet al. 2023). Multimodal MT –MTiscommonlythoughtofastranslatingtexttotext,butmul­timodal MT is also possible, although it is still in its early stages. Fields in which further technological innovation wouldincrease potentialuse-casesfor MT include image recognition, speech synthesis and automatic speech recognition. Image-to-text translation makes use of Optical Character Recognition (OCR) to isolate text in images. This technology is quite effective, and nowadays smartphone and tablet userscangenerallyavailofimagetranslationservicesfreeofcharge.However,OCR software is not as widespread as standard text-to-text translation. Multiple factors affect OCR accuracy, including coloured or decorative backgrounds, blurred texts, non-Latinalphabets,largerorsmallerletters,look-alikecharacters,andhandwritten text, all or any of which may result in nonsensical translations. Combining OCR with text prediction may improve the accuracy of this technology. Audiovisual me­dia is playing an increasingly central role in our lives thanks to AI-powered virtual assistants and online streaming services. For this reason, the ever-growing demand fortranslationofaudiovisualcontenthassparkedinterestinthedevelopmentofMT­centric text-to-speech and speech-to-text applications. Moreover, the need for ac­cessible content in the form of subtitles and audio descriptions for those who are visually impaired, deaf, or hard of hearing has the potential to drive innovation in MT.TheStrategicResearchAgendadevelopedbyNewEuropeanMedia13 provides a number of recommendations related to MT, including 1. streamlining the circula­tion of audiovisual (or video) programs through MT, while humans focus on the quality of work, for example; 2. encouraging synergies and convergence between subtitling and the development of multilingualism or the integration of foreign mi­grants, for example; 3. developing AI tools for automatic translation from speech to subtitles, and text to/from sign language; and 4. developing AI tools for robust automatic translation of subtitles. Training high-performance MT systems to trans­late subtitles is particularly challenging. Rigid copyright laws in Europe forbid the use of translations of copyrighted movies and audiovisual material, despite the fact thatthismayconstitutefairuse.Comparedtotechnicallanguage,subtitlesareoften morecreativeandidiomaticinnature,increasingthedifficultyoftranslationandthe need forhigh volumes of good-quality training data. Different Types of End Users –Thelanguageindustryisoftenfacedwithpressure toprovidediscountswhenusingMTunderthepremisethatMTboostsproductivity, allowing linguists to post-edit more words per hour than if they were to translate fromscratch.WhiletheadventofMThasallowedtranslatorsandlinguiststospend less time on repetitive content, productivity gains still depend on several other fac­ 13 https://nem-initiative.org tors, including the quality of the MT output and the complexity of the content or domain.Thepricingpressureoftenarisesfromalackofconsiderationoftheseextra factors which make post-editing a more complex task than it initially appears. Pro­viding industry with the resources to better communicate these factors could be a steptowardsrelievingpricingpressure.Furthermore,LThaschangedtheroleofthe translator.14 Theretendstobeagenerationaldivideinattitudestowardstheadoption of MT in translation workflows among linguists, with some older linguists fearing thatMTthreatenstheirjobsecurity.Youngerlinguiststendtohavemorepositivedis­positions due to proper training in such technologies being included in their higher education courses. However, linguists play an important role in the assessment and continuous improvement of MT engines, because there is no universal way to au­tomatically evaluate MT quality. Therefore, while the role of traditional translators mighthavechanged,demandforlinguistshasremainedhighalongsidethedevelop­mentsofMT. Attheotherendofthespectrum,thehypeabouttheadvancementsof AIandMTmightconvincepeoplewithlowlevelsofexpertiseintothinkingthatMT is infallible (for clear demonstrations that the ‘human parity’ claims were less than watertight,seeLäublietal. 2018;Toraletal. 2018).ThewideavailabilityofMTap­plications coupled with the sometimes deceptive fluency of NMT output may lead userstoavailofMT uncritically,withoutalwaysunderstandingitspitfalls. Another step in this direction includes educational publications, which address the technical foundations of machine learning as used in MT as well as the ethical, societal, and professional implications of itsuse (Kenny 2022). Automated Evaluation of MT –Automatedmetricsareacost-effectivewayofas­sessingthequalityofMToutput.Researchinthefieldfocusesheavilyondeveloping metrics that are able toshow higherand higher correlations with human judgement. As a result, different metrics are presented at conferences around the world every year. Despite (or as a result of) their abundance, there is still a lack of agreement among the MT community on a single metric which can be used universally to as­sess the quality of MT engines prior to deployment. Adopting a single metric as a standardwouldpossiblyallowforawidespreadbenchmarkingofMTacrossEurope. BilingualEvaluationUnderstudy(BLEU,Papinenietal. 2002),forexample,has enjoyedperhapsthebroadestuseintheMTindustry,despiteitsknownshortcomings with regards to neural MT. Many other metrics have been developed since BLEU, andwhiletheyallhavetheirprosandcons,thewidespreaduseofBLEUhasproven that metrics can serve apurpose withoutbeing scientificallyinfallible. Licensing – Translation memory and terminology data is often licensed for non-commercialuseonly. Whencommerciallicencesdoexist,theirpricesareoftenpro­hibitively high. This acts as a major barrier to SMEs developing MT applications, especially when thereis a limited amount of data available. Copyright – Copyright laws pose a further barrier in Europe. While copyright lawissubjecttofair-useexceptionsincountriessuchastheUS,Europeanlawisfar less flexible, and severely restricts the use of parts of copyright works for purposes such as data mining. If lawmakers could agree that using aligned translations of 14 Weusethewordlinguist torefertolanguageprofessionalswhotranslate,post-edit,andevaluate LT among othertasks copyrighted data constitutes fair use, as far as it in no way impairs the value of the materials and does not curtail the profits reasonably expected by the owner, LT stakeholderscouldavailofthishigh-qualitylanguagedatafortheimmediatebenefit ofEuropean language communities. Legislative and Adoption Gaps –Despite the widespreadcelebration of multilin­gualismintheEU,thereisnocommonpolicyaddressinglanguagebarriersasofyet. Wenowprovideafewexamplesofscenarioswheremultilingualismactsasabarrier to people in times of crisis. It is fair to say that current legislation does not account for these scenarios, resulting in critical gaps in services for communities in the EU. Adopting MT in these areas could mitigate the difficulty sometimes caused by lan­guagebarriers,strengtheningthepositionofmultilingualismasafacetofEuropean identity. 1. the COVID-19 pandemic has shown the need forrapid dissemination of information and guidelines in times of crisis. To give one example, in Ireland, the provision of multilingual information was seen to be slow, and reactive, with even theprovisionofinformationinIrishandIrishSignLanguagebeingslowintheearly stages.Thefirstrecommendationmade(O’Brienetal.2021)isforstatedepartments toimplementacoordinatedapproachtotheprovisionoftranslatedcontentincrises; 2. the requirement for all translations of personal documents to be stamped by a sworntranslatorcan increasethestressoncivilians, addingcostsandwaitingtimes. The repetitive nature of documents like theseas well as their standardisedterminol­ogy are particularly well-suited to MT; 3. just as the Audiovisual Media Services Directive boosted demand for text-to-speech and speech-to-text technologies, there couldbeanincreaseinthedemandforMTifpoliciesnecessitatingthetranslationof certainaudiovisualmaterialintoall24officiallanguageswereintroduced.WhileEU law requiresthattheproductdescriptions ofgoodssold withintheEUbe translated into the Member State’s official language, as of yet there are no such regulations regardingproductdescriptionsforcross-bordereCommerce;4.thereisagapinpub­liclyavailableMTserviceswhichcaterspecificallytotheneedsofpeopleinEurope. Users can globally avail of free-of-charge MT services but the multinationals who providetheservicescouldwithdraworstartchargingforthematanytime.Moreover, they do not cater specificallyto the needs of Europeancitizens. Training NMT engines is resource intensive and has a heavy carbon footprint. One area where the law is perhaps too relaxed is in relation to carbon emissions in thefieldofAIresearchanddevelopment. Researchershavewarnedofthemarginal performance gains associated with expensive compute time and non-trivial carbon emissions. Strubell et al. (2019) recommend that time spent retraining should be reported for NLP learning models and that researchers should prioritise developing efficientmodelsandhardware.TheEUhastheopportunitytobeapioneerintraining and developing green LTbyfollowing and enforcingthese recommendations. 3 The Future of the Area In this last section, we will examine the contribution of MT to DLE (Section 3.1), briefly sketch the main breakthroughs needed (Section 3.2), discuss our main tech­nology development goals and visions (Section 3.3) and describe the next steps to-wardsDeepNLU (Section 3.4). 3.1 ContributiontoDigitalLanguageEquality Nowadays,duetoglobalisation,MTisessentialforthedevelopmentofsociety.Peo­ple can access MT allowing for the democratisation of information in many lan­guages.MTdirectlyimpactstheeconomyandculturalexchangebetweencountries. Invariousscenarios,humantranslatorscannotmeetthehugedemandfortranslations in a short time and at low cost. In such cases, MT is much faster and may require less effortto post-edit than translating from scratch. MassiveamountsofparalleldataarerequiredtobuildsolidMTsystems.Parallel data creation is costly in terms of time and resources. We contend that work done for or by public administrations might offer a solution in this regard. The NEC TM project,15 for example, calculated in its market study that European public adminis­trations spend about 300 million Euros p.a. in translation contracts with language vendors. This parallel data is mostly not requested back by institutions, many of which operate in low-resource languages, but it should be made publicly available. Data availability directly affects the availability and quality of MT, as well as the contributionitcanmaketoDLEandthewidersociety. Thesedatapipelinescanim-prove local (national) technology, raise awareness of the fact that citizens are also data producers, and improve and increase the availability and quality of MT. For example, in the case of Catalan, having co-official status (in three Spanish regions) kickstarted a series of administrative decisions that facilitated the creation of more and more parallel data, which has been utilised by local MT companies. Societies that care about data sovereignty and establish language data policies can facilitate the growthofLT companies, whichin turn can positively impactthosesocieties. Uses of MT are very varied, from customer reviews on travel sites to legal doc­ument translation for public administrations. None of those uses and the business intelligence that can be derived from them can happen without translation. MT not only works for equality on dispute resolution or as a source of information for in­sights at scale irrespective of the source, but also enables businesses to build on thoseservices,impactingthesocietytheybelongto.Wecannotseparatetheuseand availability of the technology from its societalimpact. The ubiquity of MT services is an indisputable fact of current European digital societies.Itisnowembeddedinmanyservicesasareal-timehigh-qualitycommod-ity.TheELEconsortiumhasidentifiedseveralday-to-dayuseswhichillustratehow 15 https://www.nec-tm.eu MT is used in very different spheres, including: 1. civil servants verify the national legislation of other EU Member States by machine-translating it; 2. citizens com­municate via MT when visiting other countries; 3. the general public use MT to un­derstand social media conversations; 4. students machine-translate research papers; 5. eCommerce websites offer products online to consumers in multiple languages; and 6. publicadministrations translate documentationfor information exchange. Alltheseuse-casesgeneratemassiveamountsofonlinedata,thatisnotreusedby EUbusinessesandresearchgroups.Worsestill,itcanhappenthatitisgeneratedfor the benefit of the (non-European) free online tools providers to make their technol­ogy more accurate. Access to massive amounts of data that is freely available and provided by general users has scaled a lot of MT research, whilst it has provided littlein terms of open-source, generally availableresources. Whilst the majority of the talent in NLP and AI has been European, large-scale developments are foreign to the EU or the result of private sponsorship. Heavy in­vestment in MT research at universities over the years has created the know-how and technical knowledge which has only rarely been exploited commercially (e.g., KantanMT, Iconic). The question for Europeans remains on the privacy of the data used and how this data is transmitted. The MT landscape is dominated by large non-European players and technology companies. DeepL is the only significant EU-based provider, being sponsored by a German initiative born as a result of par­allel text data collection over many years (Linguee). Most European MT compa­nies remain fairly small and have much less impact (visibility) on society beyond professional-level usage. The EU’s own service (eTranslation) is available for free to public administrations anditalso openedits services to SMEs in 2021. A good example of increasing concerns comes from Switzerland, where DeepL and Google Translate were recently banned at Swiss Post as external tools amid concernsofprivacyanddataexploitation(accesswaslaterreopened,though).Swiss PostdeclaredthatitsstaffshouldonlyuseitsownMTtechnology,sonoprivatedata or data belonging to the organisation would be sent to third parties.16 GDPR has the potential to change things as privacy concerns become relevant to institutions andenterprises,with EUprojectssuch asMAPA17 providingaccurate, open-source anonymisation for public administrations. It remains to be seen how this potential isexploitedsothatMTandgeneralNLP solutions permeateandhelpcreateamore data-based Europe,based on intelligentsolutions withthe citizen at its core. 3.2 Breakthroughs Needed According to a competitiveness analysis ordered by the European Commission, the position of the European MT market, as compared to that of North America and Asia, is excellent for research and innovation, while it lags behind in terms of in­ 16 https://slator.com/swiss-post-bans-deepl-backs-down-after-staff-uproar/ 17 https://mapa-project.eu vestment, infrastructure and industry implementation (Vasiljevs et al. 2019). At the same time, the study highlights that the market is fragmented, which causes seri­ous issues for the level of intensity at which LT research can be conducted. While in North America and Asia resources can be allocated to only a limited number of languages, in Europe, resources must be distributed across a multitude of official and unofficial EU languages. As a result, the scale at which European research can be conducted is limited. Considering the massive infrastructure that is required to trainverylargestate-of-the-artMT/LTsystems,Europestartswithasystemichand­icap.Lookingforwardto2030,weexpectthemovementtowardsmoreefficientand real-timetranslationtocontinue.Europe’sstrongfoundationinresearchandinnova­tioncancompensateforthedisadvantageEuropeanorganisationshavewithrespect to infrastructure, provided that a concerted effort is undertaken in researching the development of new hardware platforms and AI training paradigms. ForEurope,abreakthroughinthesefieldsisneededtoremainonparwiththerest oftheworld.Breakthroughsinthedevelopmentofhardwareplatformsandtraining paradigms are also warranted by several EU policies. Through the European Green Deal18andtheHorizonEuropeWorkProgramme(EuropeanCommission2021),the European Commission has committed to making “Europe the world’s first climate-neutral continent by 2050”, i.e., the economy must be transformed with the aim of climateneutrality.MoreefficientAIinfrastructurecanhelpinreducingtheamounts ofenergythatarerequiredfordatastorageandalgorithmtraining.IfwewantMTto become ubiquitous, especially in embedded devices, the hardware on which it runs mustbescaleddownandthemodelsthatrunonitmustbeadaptedaccordingly.Such adaptation must occur with a minimal loss of quality, while increasing translation speedandreducingpowerconsumption.Toachievethis,abreakthroughinMThard­wareandsoftwarecodesignisrequired;bothneedtobedevelopedincooperationto ensurethatthecapabilitiesofthehardwarearealignedwiththeneedsofMTtraining and inference. Anequallyfundamentalbreakthroughisneededintheunderstandingofhowour current algorithms work. Many NLP systems today are based on large pre-trained language models which have demonstrated outstanding results on different tasks. However, a boost in performance comes with a cost in efficiency and interpretabil­ity, which “is a major concern in modern Artificial Intelligence and NLP research, as black-box models undermine users’ trust in new technologies” (Fomicheva et al. 2021).TheEUCoordinatedPlanonArtificialIntelligence(ECPAI,EuropeanCom-mission 2018) recognises this problem and advocates the need for trustworthy AI, mainlyfromtheperspectiveoftheend-user,butinterpretabilityandexplainabilityof AI models are also of great importance for the scientific community. If researchers wish to improve their algorithms, they must gain a deeper understanding of what causesmodels to behavetheway theydo,inorder to preventmodels fromperform­ingpoorly or from actingin agender-or culturally-biased manner. The ECPAI correctly states that “[f]urther developments in AI require a well-functioning data ecosystem built on trust, data availability and infrastructure”, but 18 https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52019DC0640 itunderestimatestheeffectthatoneofitscornerstoneshashadondatacollectionin thefield.According totheplan, “[GDPR]istheanchorof trustinthe singlemarket for data. It has established a new global standard with a strong focus on the rights of individuals, reflecting European values, and is an important element of ensuring trust in AI. […] The Commission would like to encourage the European Data Pro­tection Board to develop guidelines on the issue of the processing of personal data inthecontextofresearch.Thiswillfacilitatethedevelopmentoflargecross-country research datasets that can be used for AI.” (European Commission 2018). Unfortunately,GDPRhashadanadverseeffectonalargepartoftheEuropeanLT industry.Stakeholdersindatamanagement,publicationandcollectionhavecometo incorrectly assumethatalldataispersonalbydefault,asanoverlycautiousmeasure tocomplywithGDPR. Thisis especially truefor humanlanguagedata,sinceit has no fixedschemaindicating whenpersonaldetailsmayoccur. Asaresult, expensive legalcounselandtoolsforanonymisationareappliedinsituationswheretheycould beavoidedorarenotnecessaryatall.Inaddition,non-EuropeanAIcompanieshave beenabletooperatewithoutGDPRrestrictions,whichhasgiventhemaconsiderable competitive advantageover EU companies. AlthoughtheECPAIhas foreseenaframeworkforthefreeflow ofnon-personal dataintheEuropeanUnion(EuropeanUnion2018b),includingthecreationofcom­mon European data spaces in a number of areas, and a proposal for a directive on the reuse of public sector information (European Union 2018a), the process of ob­taininglinguisticdatathathasbeencreatedusingpublicfundingiscurrentlyfartoo cumbersome and pull-oriented. The data resulting from public procurement proce­dures has a tendency to remain locked up in privately-owned data silos, while the researchcommunityand LTindustry mustgo to greatlengths toidentify andrecon­structthepublicpartofthisdatausingNLPtools(see,forexample,Koehn2005).A crucialbreakthroughcouldbeachievedifexistingpolicyframeworkswereadapted tomakeitmandatoryforMemberStatestomakealldatainnaturallanguage-related workflows publicly available. It is the LT industry’s mission to reconstruct human thoughtprocessesinanautomatedway.Humanoperationsonlinguisticdatasuchas translation, revision and correction of translations, summarisation, etc. can provide the necessary data points to train AI algorithms to achieve this mission. A policy-inspired push model would be greatly beneficial for the development of all related researchdomains.Asafirststep,publicserviceadministratorsshouldbemadeaware ofthevalue of their humanworkflows. As a second step, theIP resulting from pub-licserviceworkflowsshouldbepubliclyreleasedbydefault.Finally,workflowdata shouldbemadediscoverableinapublication/subscriptionmanner,soitcanbeeasily pickedup by interested parties. Although MT has taken a big leap forward with the advent of neural systems, sometypesoftranslationremainverydifficult.IfwewantMTtobecomepervasive forproblematictexttypes(spreadsheetswithtabulardata,metadatafields,etc.),the problem of context modelling needs to be addressed. For textual translation, incor­poratingontologicalinformationmayhelp.Continueddevelopmentonmultilingual lexicalresourceswillberequiredforthis.Formultimodalsettings,extra-lingualcon­textmustbeincorporatedtoimproveresults.Contextmodellingisnotonlyrequired todealwithshortsentences orphrases, butalso to obtain more cohesivetranslation acrosslargervolumesoftext.NMTsystemshaveimprovedoverSMT,buthavenot yet succeeded in efficiently incorporating basic grammatical relations between sen-tencesandparagraphs.Sincethemajorityofhumanlanguageisproducedoutsideof written texts, extra-lingual cues are often required to decode a message adequately andtotranslateitcorrectly.Toenablebettermodellingofmultimodalenvironments, we not only need research into how modalities can enrich one another, but also in how training and test sets can be constructed to achievebetter modelling. In terms of the development of data, two important breakthroughs which must be achieved are 1. the creation of new data sets, and reiteration over existing data sets;and2.policysupportforpublicdatareuse.Ideally,newdataannotationefforts should build upon existing work. For example, for document-level NMT this can be done with limited effort, as demonstrated in the WMT19 campaign (Barrault et al. 2019). For video and audio content, it will most definitely require more work, but with existing NLP technology it is not unthinkable that EU Parliament sessions couldbesemi-automaticallylinkedwithrelatedvideoandaudiocontenttocreatean annotatedcorpusthatcanbeusedforbothbuildingnewNMTsystemsandanalysing the contribution of multimodal featurestowards translation quality. There are various other fields and areas in which further breakthroughs are needed, some of which are novel methods for document-level MT (with a focus on coherent translations of whole texts and documents), the integration of visual and audio features into MT approaches and engines as well as improved explainability (seeBerzinšetal. 2022).Anotherfieldisquantumcomputing,wheremoreresearch is needed on how MT, and NLP in general, can be reframed as a quantum comput­ing problem. Current work is still laying the foundation for future developments, becausethehardwareneededisnotavailableyet.Butitisimportanttonotethatthe first theoretical steps towards reformulating MT and NLP as quantum computing problems have already been made. 3.3 Technology Visions and Development Goals The strategy of building huge MT models by collecting all available data coming from many different domains (and also languages in current multilingual systems) should be complemented by developing smaller models, too. These small(er) mod-elsshould be trained usingthelargestpossible set ofavailableinformation, helping under-resourced languages and domains by appealing to knowledge from higher-resourced ones. One of the current problems is that if this results in a single huge model, most practitioners cannot run the model owing to hardware constraints, so smaller models adapted to particular language pairs and domains need to be made available.Thiswouldhaveseveralbenefits:suchmodelswouldbeeasytointegrate and use on any device, provide high-quality translations for all domains and lan­guages,and also be greenerby requiring fewer computational resources. ThefuturepubliclyavailableMTsystemsshouldbelessdependentonlargecom­panies,especiallythosewhicharenotEuropean.Theriskisthatwhatisfreelyavail-able now could (easily) be taken away if those companies – none of them MT com­panies per se, note – find a way to increase revenue in other directions, so that they deprecatetheirMTofferings,ashashappenedwithotherservicesprovidedbythese large corporations. Another challenge of the current systems is represented by various biases in the models,suchasgender,racialandethnicbias(Vanmassenhoveetal.2019).Suchbi­ases replicate regrettable patterns of socio-economic domination that are conveyed throughlanguage,sincethesebiasesarepresentinthetrainingdataandarethenam­plifiedbymodelswhichtendtochoosemorefrequentpatternsanddiscardrareones. Inthefuture,ethicalandfairMTshouldnotfurtherpropagate notionsofinequality, butrather fosteraninclusive society based on acceptance and respect. More and more NMT systems are being developed which go beyond the single sentence level(e.g., Lopesetal. 2020),usingavarietyof differentapproaches:tak­ing into account source-or target-language context, or both. Another interesting avenue being pursued is that different context spans have been investigated, rang­ing from a single preceding sentence to the entire ‘document’. While this might be straightforward for news articles and user reviews, the situation is different for lit­erary texts or movie subtitles, to name but two. Future systems should be able to identify which sentences benefit from the availability of context, and then find that context. This task is far from trivial because relevant information can be found in differentplaces,sometimesevenbeyondthegiventext,suchasthetopicofthetext, the gender of thewriter/speaker,oreven general worldknowledge. Suchexternalinformationcangobeyondtextdataandincludeimages,videos,ta­bles,etc.bydevelopingmultimodalMTsystems(YaoandWan2020).Suchsystems currentlyincludeimageinformationtohelpinthetranslationofimagecaptions.Fu­ture systems should combine sources of information which go beyond this, so that an image of a product can help disambiguate words in the description or review of the said product, for example. Multimodal models should also include sign lan­guage translation, which currently relies mainly on computer vision methods. Sign language MT shoulduse modelsbased on bothimages and natural language. Training data, crucial to building models, should receive more attention. Cur­rently, the majority of MT systems are trained on large amounts of data covering only a small amount of languages, language pairs and domains. While progress in MTismainlymeasuredunderhigh-resourceconditions,themajorityofdomainsand languages, including many of those spoken in Europe, are under-or low-resourced. Future systems should be able to cover all European languages as well as language pairs (not always including English or some other higher-resourced language), and be trained on many different domains and genres. For this to work for all – and not only for big companies and leading research teams – the availability and quality of trainingdatashouldbeincreased.Attentionshouldalsobegiventolanguageswhere thereisnowrittentradition,inwhichcasespoken-languagedataneedstobesourced. While techniques such as multilingual models, unsupervised MT, synthetic data, and transfer learning are all helping, if there is not enough good-quality data for a language (pair), then such methods will not reach the goal of high-quality MT, in whichcasenovelmethodsandresearchbreakthroughwillbeneededinthisdirection. The test sets used for assessing MT systems should receive more attention, too. Currently, a large number of research publications use news articles coming from shared tasks. Researchers test their systems on these texts and report improved au­tomatic scores. However, some of the human translations in these test sets used as references for automatic scores are of poor quality (Toral et al. 2018). The shared task organiserscannotbeblamedfor thissituation, as theydo the best that theycan withthelimitedbudgetsthattheyhave.Still,thesehumantranslationsshouldbethor­oughly examined in order to discard the inappropriate ones and keep only the good onesforlong-termtesting.NotethatinlightofthecomparisonbetweenMToutputs and human translations carried out in recent years where claims of “human parity” havebeeninvestigated,thequalityofhumantranslationsusedinMTevaluationhas to be high (Toral etal. 2018;Läubli et al. 2018). In addition, other test sets coming from different genres and domains need to be morewidelyused.Avastamountofsystemsarecurrentlytestedonlyonalimitedset ofdomains,newsbeingthepredominantone,whilemanygenresanddomainsareas yethardly covered bycurrentresearch,suchas user-generatedcontent (which itself isnotahomogeneousgenre),despitehavinggreatpotentialforfuturegrowth.Inthe longrun,westronglycontendthatMTsystemsshouldbetestedonalargenumberof differentdomainsandgenres,andforanever-increasingrangeoflanguagesinorder tohelpfacilitateDLE.Inthisregard,theriseofNMTanditsincreasingqualityhave led to more and more challenge test sets (or test suites). These specified test sets enable betterunderstanding of certain (linguistic) aspects whichcannotbeproperly assessed in standard ‘natural’ test sets. The development and creation of such test sets necessitate a large amount of human expertise, time and effort. In the future, they should beeasy andfast tocreate for anylanguage pair. As for the evaluation process itself, automatic metrics remain invaluable tools for the rapid development and comparison of MT systems. They have been devel­oped and improved constantly, with more and more metrics coming onstream each year. However, a number of challenges remain. Perhaps the most significant is that the community still relies to a large extent on BLEU, despite there being a large bodyofresearchpointingoutitsdrawbacks.Futuresystemsshouldbeevaluatedby new metrics which represent better approximations of human judgments and also ideallyabandonthedependenceonhumanreferencetranslations,whichisaserious limitation. Recently, more and more metrics based on neural networks and/or word representations have emerged which show better correlation with human judgment anddonotrequirereferencetranslations.However,thesemetricshaveanotherlimi­tation:theyrequirelabelledtrainingdatawhichaswehavepointedoutareavailable only for a limited number of language pairs and domains. Future automatic metrics should be equally valid without such constraints. In addition, all future automatic metrics should be able to evaluate MT output taking the context into account in or­der tobe more reliable (Läubli et al. 2018; Castilho 2021). Manual evaluation of translation quality, despite its disadvantages (time-and resource-intensive,aswell as being subjective),remains the goldstandard, both for evaluating MT systems and for developing suitable automatic metrics. That being said, the design of experiments and the standard method of reporting the results is farfromperfect.Differentpapersusethesamequalitycriterionnamewithdifferent definitions, or the same definition with different names. Furthermore, many papers do not use any particular criterion, asking the evaluators only to assess “how good” theoutputis.Weassertthatanyideaofasinglestandardgeneralunspecifiednotion of quality should be abandoned, and factors like the context in which MT is to be used together with appropriate quality aspects should be considered, as pointed out by Way (2013)and Mason (2019). These aspects might include adequacy/accuracy, readability/comprehension, appropriate register, correct terminology, or adequately fulfilling a particular task. Consequently, metrics should be created with such crite­ria designed in from the outset, and not only to provide a general unspecified score which is meaninglessto most people. Furthermore,recentresearchhasfoundthatreaderstendtofullytrustfluenttrans­lations as well as comprehensible translations even if they contain severe adequacy errorswhichchangetheactualcontentanddelivercompletelydifferentinformation (Popoviæ 2020; Martindale et al. 2021). Therefore, future automatic metrics should provideconfidenceindicatorsfortranslationsinordertoinformusersaboutthelevel oftrust theyshould have inthe MT output they are reading. Allowinguserstointeractnaturallywithmachinesviaspeechhasthepotentialto greatly transform, enhance and empower work, leisure and social experiences. The increasing quality of MT and the expanding preference (especially among younger users)forvoice-basedinteractionwithdevicespointstomoreandmoreapplications forspeech-to-textandspeech-to-speechtranslation.Thismeans,ofcourse,notonly thatspokenlanguageinputshouldbecomemoreandmoreatopicofcloseattention, but also that more data of exactly the right type needs to be available. By 2030, it is likely that the Automatic Speech Recognition-MT-Speech Synthesis pipeline will have been replaced by more direct approaches which model spoken language translationasanend-to-endprocess(Gangietal.2019),butclearlymoreworkneeds to be done inthis regard. Sign languagetranslationshould bewidely availablefor many domains tobreak down language barriers for deaf and hearing-impaired users so that they can access informationliketherestofsociety.Forthistobedoneproperly,signlanguagetrans­lation needs to include language features in addition to image features. In addition, itshouldnotonly be translatedfrom/into text but alsofrom/into speech. ItismoreandmorethecasethatMTisbeingusedforexpandingotherNLPtasks (e.g.,textclassification,topicmodelling,sentimentanalysis)tomultiplelanguages. Usually,fulltranslationiscarriedoutand thenthelabelsfor theoriginalsourcelan­guagetogetherwiththetranslationsareusedfortrainingclassifiersinthenewtarget language. However, for such tasks, where the translated text is not used directly, qualitycriteriamightberatherdifferent,andfulltranslationmightnotbenecessary. Extracting different representations from various layers could be even better suited forcertaintasks,sothisoptionshouldbemadeeasilyavailableinfutureMTsystems. 3.4 TowardsDeepNaturalLanguageUnderstanding Applying a purpose-andcommunication-orientedviewon MT allowsusto discuss the extent to which MT needs (deep) NLU, since it helps to put the prevailing MT-relatedmetrics–notrelatedtopurposeandcommunicationaspects –inperspective. Accordingly, claims related to MT reaching parity with human translations are mis­leadingsincethemetricstomeasurethisviareferencetranslationdataaretoolimited to address whether the intended communication has fulfilled its purpose when this isrelated toreader impression andstyle. Withaviewoncommunicationsuccess,itbecomesobviousthatMT –coretech­nology, evaluation methodologies, metrics and data for training and evaluation – needs NLPthat goesbeyondtraditional capabilitiessuch asdetectionofterms,key­words, labels, entities, relations, and sentiments. These capabilities – often referred to as ‘deep’ NLU – will be aware of context and able to consider annotations/meta-data. Context and annotation awareness will allow MT to generate texts that are faithful to the intended communication (input view), take translation purpose/spec­ifications/requirements into account (sender view), and show consideration of the reader/listener (output/consumerview). OnlyMTwithdeepNLUwill,forexample,beabletoefficientlysupportahuman-to-human or human-to-machine conversation that exhibits qualities like being con-textualised, adaptive, personalised, and knowledge-rich. The following ingredients currently seem to emerge as important elements for next-generation MT (based on DeepNLU):1.existingstandardsrelatedtoannotations;2.theFAIRdataprinciples as backbones of investment protection, and ‘responsible MT’; 3. experts like trans-lators,domainspecialists,modellers,datascientistsforcuration;4.moreopen,stan­dardised, flexible and robust technologies for all dimensions of data management; and 5. large, multilingual translation models that are safe to use and can easily be adaptedforresource-sparsecomputingenvironments,tospecifictasksanddomains, and for low-resource languages. 4 Summary and Conclusions Nowadays MT is widely used by the general public, public sector and government agencies,SMEs,LSPsandmanyotherindustries.Thiswillcontinuetogrow,cover­ing new application areas to support Europe’s digital single market as well as DLE. Lookingforwardto2030,weexpectthemovementtowardsdeepNLUtoenableef­ficient,real-timetranslationtosupporthuman-to-humanorhuman-to-machinecom­munication. DespitethewidespreadcelebrationofmultilingualismintheEU,thereisnocom­mon policy addressing language barriers. So far, the absence of a clear roadmap andsupportforLTatEuropeanlevelhasledtoanincohesive,fragmentedEuropean marketwithdisparatelanguagesupportforthelanguagecommunitiesofEurope.We hopethat the ELE SRIA (Chapter 45)willhave positiveeffects inthis regard. There is also a gap in publicly available MT services which cater specifically to the needs of people in Europe. Users around the world avail of free-of-charge MT servicesprovidedbyglobalcompanies.Theriskisthatwhatisfreelyavailablenow could (easily) be taken away if those companies find a way to increase revenue in other directions. The future publicly available MT systems should not depend on non-Europeanmultinationals. With the help of neural networks, MT has recently improved significantly in its quality,consistencyandproductivity.However,inmanycasesthefocusofnewtech­nologies is still on well-resourced languages, limiting diversity and reinforcing ex-istingdisparities.Furthermore,explainableandinterpretablemachinelearningisat­tracting more and more attention, and a fundamental breakthrough is needed in the understandingofhow current MTalgorithms work. TheincreasingqualityofMTandtheexpandingpreferenceforvoice-basedinter­actionpointstoapplicationsforspeech-to-speechtranslationandmultimodalMTin orderto break the languagebarrierfor human communication. Publiclyavailablemultilingualdatashouldincludeagreaterdiversityofdomains and languages, so that building high-quality MT systems becomes an option for all.Collectionofusablelanguagedataisparticularlyimportant.Iflawmakerscould agreethatusingalignedtranslationsofcopyrighteddataconstitutesfairuse,LTstake-holders could immediately avail of this high-quality language data. There is also a disparitybetweenpubliclyavailableandproprietarybilingualdata.Acrucialbreak-through could be achieved if policy frameworks make it mandatory for Member Statesto makeall data in natural language-related workflows publicly available. Increasedattentionshouldbepaid to thehuman judgmentsusedfortailoringthe automatic metrics, as well as to manual evaluation in general. There is also a lack of necessary resources (experts, HPC capabilities, etc.) compared to large US and Chinese IT corporations. There is also an uneven distribution of resources across countries,regionsand languages. Finally, the hardware on which MT runs must be scaled down. By ensuring that thecapabilitiesofthehardwarearealignedwiththeneedsofMTtrainingandinfer­ence models, smaller models would be easy to integrate and use on any device and also be greener by requiring fewer resources. The EU has the opportunity to be a pioneer ingreen LTby developing efficient models and hardware. At the level of policies/instruments, much more synchronisation of activities be­tween national and international bodies is necessary. A desirable approach for the efficientandhomogeneousimplementationofpoliciestowardsDLEwouldbemore equalsupportforallEUlanguages,includingequalinvolvementofnationalresearch communities. References Artetxe,Mikel,GorkaLabaka,andEnekoAgirre(2018).“UnsupervisedStatisticalMachineTrans­lation”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 -November 4, 2018. Ed. by Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii. Association for Computational Linguistics, pp. 3632–3642. https://doi.org/10.18653/v1/d18-1399. Bahdanau,Dzmitry,KyunghyunCho,andYoshuaBengio(2015).“NeuralMachineTranslationby Jointly Learning to Align and Translate”. In: 3rd International Conference on Learning Repre­sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and YannLeCun. http://arxiv.org/abs/1409.0473. Barrault,Loi¨c,OndrejBojar,MartaR.Costa-jussa,ChristianFedermann,MarkFishel,YvetteGra­ham,BarryHaddow,MatthiasHuck,PhilippKoehn,ShervinMalmasi,ChristofMonz,Mathias Müller,SantanuPal,MattPost,andMarcosZampieri(2019).“Findingsofthe2019Conference on Machine Translation (WMT19)”. In: Proceedings of the Fourth Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 -Volume 2: Shared Task Papers, Day 1. Ed. by Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Gra­ham, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana L. Neves, Matt Post, Marco Turchi, and Karin Verspoor. Association for Computational Linguistics, pp. 1–61. https://doi.org/10.1 8653/v1/w19-5301. Berzinš, Aivars, Marcis Pinnis, Inguna Skadina, Andrejs Vasiljevs, Nora Aranberri, Joachim Van den Bogaert, Sally O’Connor, Mercedes García–Martínez, Iakes Goenaga, Jan Hajiè, Manuel Herranz, Christian Lieske, Martin Popel, Maja Popoviæ, Sheila Castilho, Federico Gaspari, Rudolf Rosa, Riccardo Superbo, and Andy Way (2022). Deliverable D2.13 Technology Deep Dive – Machine Translation.EuropeanLanguageEquality(ELE);EUprojectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/MT-deep-dive.pdf. Bié, Laurent, Aleix Cerda-i-Cucó, Hans Degroote, Amando Estela, Mercedes Garci´´a-Martinez, Manuel Herranz, Alejandro Kohan, Maite Melero, Tony O’Dowd, Sinéad O’Gorman, Marcis Pinnis, Roberts Rozis, Riccardo Superbo, and Arturs Vasilevskis (2020). “Neural Translation for the European Union (NTEU) Project”. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation.Lisboa,Portugal:EuropeanAssociationfor Machine Translation, pp. 477–478. https://aclanthology.org/2020.eamt-1.60. Castilho,Sheila(2021).“TowardsDocument-LevelHumanMTEvaluation:OntheIssuesofAnno­tator Agreement, Effort and Misevaluation”. In: Proceedings of the Workshop on Human Eval­uation of NLP Systems (HumEval).Online:AssociationforComputationalLinguistics,pp.34– 45. https://www.aclweb.org/anthology/2021.humeval-1.4. Castilho, Sheila, Natália Resende, Federico Gaspari, Andy Way, Tony O’Dowd, Marek Mazur, Manuel Herranz, Alex Helle, Gema Rami´rez-Sánchez, Victor Sánchez-Cartagena, Marcis Pin-´nis, and Valters Šics (2019). “Large-scale Machine Translation Evaluation of the iADAATPA Project”. In: Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks. Dublin, Ireland: European Association for MachineTranslation, pp.179–185. Dou, Zi-Yi, Antonios Anastasopoulos, and Graham Neubig (2020). “Dynamic Data Selection and Weighting for Iterative Back-Translation”. In: Proceedings of the 2020 Conference on Empir­ical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020. Ed. by Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu. Association for Computational Linguistics, pp. 5894–5904. https://doi.org/10.18653/v1/2020.emnlp-main.475. European Commission (2018). Coordinated Plan on Artificial Intelligence. COM(2018) 795 final. https://digital-strategy.ec.europa.eu/en/policies/plan-ai. European Commission (2021). Horizon Europe Work Programme 2021-2022. European Commis­sion Decision C(2021)4200 of 15 June 2021. https://ec.europa.eu/info/funding-tenders/opport unities/portal/screen/how-to-participate/reference-documents. EuropeanUnion (2018a). Proposal for a Directive of the European Parliament and of the Council on the re-use of public sector information (recast), COM(2018) 234 final. EuropeanUnion(2018b).Regulation (EU) 2018/1807 of the European Parliament and of the Coun­cil of 14 November 2018 on a framework for the free flow of non-personal data in the European Union. Fomicheva, Marina, Piyawat Lertvittayakumjorn, Wei Zhao, Steffen Eger, and Yang Gao (2021). “The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results”. In: Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems.PuntaCana, DominicanRepublic:Association for ComputationalLinguistics,pp. 165–178. Gangi, Mattia Antonino Di, Matteo Negri, Roldano Cattoni, Roberto Dessi, and Marco Turchi `(2019). “Enhancing Transformer for End-to-end Speech-to-Text Translation”. In: Proceedings of Machine Translation Summit XVII Volume 1: Research Track, MTSummit 2019, Dublin, Ire­land, August 19-23, 2019. Ed. by Mikel L. Forcada, Andy Way, Barry Haddow, and Rico Sen-nrich. European Association for Machine Translation, pp. 21–31. https://aclanthology.org/W1 9-6603/. Haddow, Barry, Alexandra Birch, and Kenneth Heafield (2021). “Machine Translation in Health­care”. In: The Routledge Handbook of Translation and Health. Routledge, pp. 108–129. Han, Jesse Michael, Igor Babuschkin, Harrison Edwards, Arvind Neelakantan, Tao Xu, Stanislas Polu,AlexRay,PranavShyam,AdityaRamesh,AlecRadford,andIlyaSutskever(2021).“Un­supervised Neural Machine Translation with Generative Language Models Only”. In: CoRR abs/2110.05448. https://arxiv.org/abs/2110.05448. Kenny, Dorothy, ed. (2022). MultiTraiNMT: Machine Translation for Multilingual Citizens. In preparation. Berlin:LanguageSciencePress. Koehn, Philipp (2005). “Europarl: A Parallel Corpus for Statistical Machine Translation”. In: Proceedings of Machine Translation Summit X: Papers, MTSummit 2005, Phuket, Thailand, September 13-15, 2005, pp. 79–86. https://aclanthology.org/2005.mtsummit-papers.11. Läubli, Samuel, Rico Sennrich, and Martin Volk (2018). “Has Machine Translation Achieved Hu­man Parity? A Case for Document-level Evaluation”. In: Proceedings of EMNLP. Brussels, Belgium,pp.4791–4796. Libovický,Jindøich,HelmutSchmid,andAlexanderFraser(2022).“Whydon’tpeopleusecharac­ter-level machine translation?” In: Findings of the Association for Computational Linguistics: ACL 2022.Dublin, Ireland: Association for Computational Linguistics, pp.2470–2485. Lopes, António V., M. Amin Farajian, Rachel Bawden, Michael Zhang, and André F. T. Martins (2020). “Document-level Neural MT:A Systematic Comparison”.In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, EAMT 2020, Lisboa, Portugal, November 3-5, 2020.Ed.byMikelL.Forcada,AndréMartins,HelenaMoniz,Marco Turchi, Arianna Bisazza,Joss Moorkens,AnaGuerberof Arenas, Mary Nurminen, Lena Marg, SaraFumega,BrunoMartins,FernandoBatista,LuisaCoheur,CarlaParraEscarti´´n,andIsabel Trancoso. European Association for Machine Translation, pp. 225–234. https://aclanthology.o rg/2020.eamt-1.24/. Ma, Shuming, Li Dong, Shaohan Huang, Dongdong Zhang, Alexandre Muzio, Saksham Singhal, Hany Hassan Awadalla, Xia Song, and Furu Wei (2021). “DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual En­coders”. In: CoRR abs/2106.13736.https://arxiv.org/abs/2106.13736. Martindale, Marianna, Kevin Duh, and Marine Carpuat (2021). “Machine Translation Believabil­ity”. In: Proceedings of the First Workshop on Bridging Human – Computer Interaction and Natural Language Processing. Online: Association for Computational Linguistics, pp. 88–95. https://aclanthology.org/2021.hcinlp-1.14. Mason, Sarah Bawa (2019). “Joss Moorkens, Sheila Castilho, Federico Gaspari, Stephen Doherty (eds): Translation quality assessment: from principles topractice – MachineTranslation: Tech­nologies and Applications, Volume 1, Springer International Publishing, Heidelberg & Berlin, xii + 287 pp, ISBN 978-3-319-91240-0 (hardcover), 978-3-030-08206-2 (paperback), 978-3­319-91241-7(eBook)”. In: Machine Translation 33.3, pp. 269–277. https://doi.org/10.1007/s1 0590-019-09241-w. Metuzale,Kristine, Alexandra Soska, andMarcisPinnis(2020). “ATaleof Eight Countries orthe EU Council Presidency Translator in Retrospect”. In: Proceedings of the 14th Conference of the Association for Machine Translation in the Americas, AMTA 2020 -Volume 2: User Pa­pers, Virtual, October, 2020.Ed.byJaniceCampbell,DmitriyGenzel,BenHuyck,andPatricia O’Neill-Brown.Association forMachine TranslationintheAmericas,pp.525–546. https://acl anthology.org/2020.amta-user.25/. Nguyen, Khanh, Hal Daumé III, and Jordan Boyd-Graber (2017). “Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Den-mark:AssociationforComputational Linguistics, pp. 1464–1474. https://www.aclweb.org/ant hology/D17-1153. O’Brien, Sharon,Patrick Cadwell, andAlicia Zajdel(2021). Communicating COVID-19: Transla­tion and Trust in Ireland’s Response to the Pandemic. Tech. rep. School of Applied Language and Intercultural Studies, Dublin City University. https://www.dcu.ie/sites/default/files/inline­ files/covid_report_compressed.pdf. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu (2002). “Bleu: a Method for Au­tomatic Evaluation of Machine Translation”. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA. ACL, pp. 311–318. https://aclanthology.org/P02-1040/. Pinnis, Marcis, Stephan Busemann, Arturs Vasilevskis, and Josef van Genabith (2021). “The Ger-manEU Council PresidencyTranslator”.In: KI – Künstliche Intelligenz. Popel,Martin(2018).“MachineTranslationUsingSyntacticAnalysis”.PhDthesis.Praha,Czechia: MFF UK. Popoviæ,Maja(2020).“Relationsbetweencomprehensibilityandadequacyerrorsinmachinetrans­lation output”. In: Proceedings of the 24th Conference on Computational Natural Language Learning. Online:Associationfor ComputationalLinguistics,pp. 256–264. Rehm, Georg, Katrin Marheinecke, Rémi Calizzano, and Penny Labropoulou (2023). “Language TechnologyCompanies,ResearchOrganisationsandProjects”.In:European Language Grid: A Language Technology Platform for Multilingual Europe. Ed. by Georg Rehm. CognitiveTech­nologies.Cham,Switzerland: Springer,pp. 171–185. Rehm, Georg and Hans Uszkoreit, eds. (2012). META-NET White Paper Series: Europe’s Lan­guages in the Digital Age. 32 volumes on 31 European languages. Heidelbergetc.: Springer. Sennrich, Rico, Barry Haddow, and Alexandra Birch (2016a). “Improving Neural Machine Trans-lationModelswithMonolingualData”.In:Proceedings of the 54th Annual Meeting of the Asso­ciation for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics. https://doi.org/10.18653/v1/p16-1 009. Sennrich,Rico,BarryHaddow,andAlexandraBirch(2016b).“NeuralMachineTranslationofRare WordswithSubwordUnits”.In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.The AssociationforComputerLinguistics. https://doi.org/10.18653/v1/p16-1162. Skadins, Raivis, Marcis Pinnis, Arturs Vasilevskis, Andrejs Vasiljevs, Valters Sics, Roberts Rozis, and Andis Lagzdins (2020). “Language Technology Platform for Public Administration”. In: Human Language Technologies – The Baltic Perspective.Ed.byUtkaAndrius,VaicenonieneJu­rgita,KovalevskaiteJolantai,andKalinauskaiteDanguole.Vol.328.FAIA.IOSPress,pp.182– 190. Strubell, Emma, Ananya Ganesh, and Andrew McCallum (2019). “Energy and Policy Considera­tions for Deep Learning in NLP”. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28-August 2, 2019, Volume 1: Long Papers. Ed. by Anna Korhonen, David R. Traum, and Lluis Marquez. Association for ´ComputationalLinguistics, pp. 3645–3650. https://doi.org/10.18653/v1/p19-1355. Sutskever,Ilya,OriolVinyals,andQuocVLe(2014).“Sequencetosequencelearningwithneural networks”.In: Advances in neural information processing systems,pp.3104–3112. Toral, Antonio, Sheila Castilho, Ke Hu, and Andy Way (2018). “Attaining the Unattainable? Re­assessing Claims of Human Parity in Neural Machine Translation”. In: Proceedings of WMT. Brussels,Belgium, pp. 113–123. Vanmassenhove, Eva, Dimitar Sht. Shterionov, and Andy Way (2019). “Lost in Translation: Loss andDecayofLinguisticRichnessin MachineTranslation”. In: Proceedings of Machine Trans­lation Summit XVII Volume 1: Research Track, MTSummit 2019, Dublin, Ireland, August 19-23, 2019. Ed. by Mikel L. Forcada, Andy Way, Barry Haddow, and Rico Sennrich. European As-sociationforMachine Translation,pp. 222–232. https://aclanthology.org/W19-6622/. Vasiljevs,Andrejs,KhalidChoukri,LucMeertens,andStefaniaAguzzi(2019). Final study report on CEF Automated Translation value proposition in the context of the European LT market/e­cosystem. DOI: 10.2759/142151. https://op.europa.eu/de/publication-detail/-/publication/8494 e56d-ef0b-11e9-a32c-01aa75ed71a1/language-en. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, £ukasz Kaiser,and Illia Polosukhin (2017). “Attentionis all you need”. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010. Way, Andy (2013). “Traditional and Emerging Use-Cases for Machine Translation”. In: Proceed­ings of Translating and the Computer. Vol. 35. London. Yang,Jian,ShumingMa,HaoyangHuang,DongdongZhang,LiDong,ShaohanHuang,Alexandre Muzio,SakshamSinghal,HanyHassan,XiaSong,andFuruWei(2021).“MultilingualMachine TranslationSystemsfromMicrosoftforWMT21SharedTask”.In:Proceedings of the Sixth Con­ference on Machine Translation, WMT@EMNLP 2021, Online Event, November 10-11, 2021. Ed. by Loi¨c Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, AlexanderFraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, Tom Kocmi, André Martins, Makoto Morishita, and Christof Monz. Association for ComputationalLinguistics, pp. 446–455. https://aclanthology.org/2021.wmt-1.54. Yao,ShaoweiandXiaojunWan(2020).“MultimodalTransformerforMultimodalMachineTrans­lation”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Lin­guistics. Online: Association forComputationalLinguistics,pp. 4346–4350. https://aclantholo gy.org/2020.acl-main.400. Zhang, Biao, Philip Williams, Ivan Titov, and Rico Sennrich (2020). “Improving Massively Mul­tilingual Neural Machine Translation and Zero-Shot Translation”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020.Ed.byDanJurafsky,JoyceChai, Natalie Schluter,andJoelR.Tetreault.Associationfor ComputationalLinguistics, pp. 1628–1639. https://doi.org/10.18653/v1/2020.acl-main.148. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 41 Deep Dive Speech Technology Marcin Skowron, GerhardBackfried, EvaNavas,Aivars Berzinš, Joachim Van den Bogaert, Franciska deJong, Andrea DeMarco, Inma Hernáez,Marek Kováè, Peter Polák, JohanRohdin,MichaelRosner, Jon Sanchez, Ibon Saratxaga, andPetr Schwarz Abstract This chapter provides an in-depth account of current research activities and applications in the field of Speech Technology (ST). It discusses technical, sci-entific,commercialandsocietalaspectsinvariousSTsub-fieldsandrelatesSTtothe widerareasofNaturalLanguageProcessingandArtificialIntelligence.Furthermore, it outlines breakthroughs needed, main technology visions and provides an outlook towards 2030 as well as a broad view of how ST may fit into and contribute to a widervisionofDeepNaturalLanguageUnderstandingandDigitalLanguageEqual­ityinEurope.Thechapterintegratestheviewsofseveralcompaniesandinstitutions involved in researchand commercial applicationofST.1 Marcin Skowron · GerhardBackfried HENSOLDTAnalyticsGmbH, Austria, marcin.skowron@hensoldt.net, gerhard.backfried@hensoldt.net MarekKováè · Johan Rohdin · Petr Schwarz Phonexia,CzechRepublic, kovac@phonexia.com, rohdin@phonexia.com, schwarz@phonexia.com Eva Navas · Inma Hernáez · JonSanchez · IbonSaratxaga University oftheBasque Country, Spain, eva.navas@ehu.eus, inma.hernaez@ehu.eus, jon.sanchez@ehu.eus,ibon.saratxaga@ehu.eus Aivars Berzinš Tilde, Latvia, aivars.berzins@tilde.com Joachim Vanden Bogaert CROSSLANG,Belgium, joachim.van.den.bogaert@crosslang.com Franciskade Jong CLARIN ERIC, The Netherlands, franciska@clarin.eu AndreaDeMarco · MichaelRosner University ofMalta,Malta, andrea.demarco@um.edu.mt, mike.rosner@um.edu.mt Peter Polák Charles University,CzechRepublic, polak@ufal.mff.cuni.cz 1 This chapter is an abridged version ofBackfriedet al. (2022). © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_41 1 Introduction Speech – as the most natural manner for humans to interact with computers – has always attracted enormous interest. Speech Technology (ST) has been a focus of research and commercial activities over the past decades. From humble beginnings in the 1950s, theyhave come a long way to current state-of-the-art approaches. Stimulated by a shift towards statistical methods, the 1980s witnessed an era of Hidden-Markov-Models (HMM), Gaussian-Mixture-Models (GMM) and word-based n-gram models combined into speech recognition engines employing ever morerefineddatastructuresandsearchalgorithms(Jelinek1998).Theavailabilityof datatotrainthesesystemswaslimitedtoonlyafewlanguages,oftendrivenbysecu­rityandcommercialinterest.Eventhen,workonneuralnetworks(NN)wasalready beingcarriedoutandviewedbymanyasthemostpromisingapproach.However,it wasnotuntillater(2000s)thattheavailabilityoftrainingdatapairedwithadvances inalgorithmsandcomputingpowerfinallybegantounleashthefullpotentialofNN­based ST. Especially over the past couple of decades, ST has evolved dramatically and become omnipresent in many areas of human-machine interaction. Embedded intothewiderfieldsofArtificialIntelligence(AI)andNaturalLanguageProcessing (NLP), the expansion and scope of ST and its applications have accelerated further andgainedconsiderablemomentum.Recently,thesetrendswerecomplementedbya paradigmshiftrelatedtotheriseoflanguagemodels(Bommasanietal. 2021),such as BERT (Devlin et al. 2019) or GPT-3 (Brown et al. 2020): models trained on a broadscale,adaptableviafine-tuningandabletoperformverywellonawiderange of tasks. Substantial advances in algorithms and high-performance hardware have led to massively increased adoption and further technological improvements. With speechandnaturallanguageformingfundamentalpillarsofhumancommunication, ST may noweven be perceived as“speech-centric AI”. Withtheemergenceofintelligentassistants,SThasbecomeubiquitous,yetmany ST systems can only cope with restricted domains and can be used only with the most widely spoken languages. For languages with a low number of speakers, ST systemsarestillallbutabsentorseverelylimitedintheirscope.Recentadvancesin MachineLearning(ML)andSThavebeguntoenablethecreationofmodelsalsofor such less well-resourced languages. However, these approaches are generally more complex, expensive and less suitable for wide adoption. While recently presented results indicate that novel approaches could indeed be applied to address some of thechallengesrelatedtolow-resourcedlanguages,thescopeoftheirapplicationand inherent limitations arestill the subject of ongoing research (Lai et al. 2021). STshavebeeninvestigatedandresearchedintheirownright.However,theirfull potentialoftenonlybecomesevidentwhencombinedwithfurthertechnologiesform­ing intelligent systems capable of complex interaction, encompassing a diverse set ofcontexts andspanning multiple modalities. To the casual user,individual compo­nents then become blurred and almost invisible with one overall application acting as the partner within an activity which may otherwise be carried out together with a fellow human being. In this setting, the aggregation of technologies goes beyond narrowandhighlyspecialisedsystemstowardscombinedandcomplexsystems,pro­viding a notion of a more general and broader kind of intelligence. Speech and lan-guage,asthemostnaturalvehiclesforhumanstocommunicatewithmachines,thus becomethegatekeepers toand core of a broader kind of AI. 1.1 Scope of this Deep Dive The scope of this deep dive encompasses a wide range of STs including language identification, speaker recognition, automatic speech recognition, technologies ad­dressing paralinguistic phenomena as well as text-to-speech. It gathers and synthe­sisestheperspectivesofEuropeanresearchandindustrystakeholdersonthecurrent stateofaffairs,identifiesseveralmaingapsaffectingthefield,outlinesanumberof breakthroughsrequiredandpresentsthetechnologicalvisionanddevelopmentgoals forthenextyears.Inlinewiththeotherdeepdivesinthisbook,weadoptamultidi­mensionalapproachwherebothmarket/commercialaswellasresearchperspectives areconsideredandconcentrateonthefollowingaspects:technologies,models,data, applications and the impact of ST on society. The tendency for the combination of technologiesintomorepowerfulsystems,encompassingseveralindividualtechnolo­gies andmodels, has become apparent andis reflected throughout this chapter. 1.2 Main Components STsencompasstechnologiesontherecognitionaswellasproductionsideofspeech. They comprise a wide spectrum of sub-fields such as automatic speech recognition (ASR), the identification of language or dialects, speaker recognition/identification (SR/SID), the detection of age and gender, emotions, paralinguistic traits and the production of synthesisedspeech (often called text-to-speech). 2 State-of-the-Art and Main Gaps 2.1 State-of-the-Art TraditionalASRsystemsconsistof componentsforaudiopre-processing, an acous­tic model, a pronunciation model as well as a language model defined over units of a lexicon. Within a search algorithm, these elements are combined to produce the mostlikelytranscriptgiventheinputaudio.Inthisscheme,modelsgenerallyareof a generative nature and optimised individually. Since the early2000s,these compo-nentsarebeingreplacedwithdeepneuralnetworks(DNNs).Thischangewasmade possible by advances in algorithms and models as well as the massive increase in available training data and computing power (GPUs). As a result, word error rates (WERs) could be reduced considerably in many domains and languages. However, the performance of ASR systems still varies dramatically depending on the domain and language, with low-resourcelanguages still exhibiting WERs resembling those ofEnglish many years ago. For applications in practice (“ASR in the wild”), hybrid systems combining el­ements such as HMMs and DNNs still dominate the state of play. As such, they can still be regarded as state-of-the-art outside of research labs. Toolkits like Kaldi provide a sound basis for the development of systems for research as well as com-mercialenvironments.Novelapproachesintheareaofself-supervisedlearning,e.g., Wav2Vec 2.0 by Facebook (Baevski et al. 2020), focus on leveraging vast amounts ofunlabelleddata.Latentrepresentationsofaudioareproducedrepresentingspeech sounds similar to (sub-)phonemes which are then fed into a Transformer network. Thisapproachhasbeenshowntooutperformothertypicalpathsofsemi-supervised methods,whilealsobeingconceptuallysimplertoimplementandexecute.Thepos­sibility to employ smaller amounts of labelled data as well as being able to train multilingual models provide strongarguments for such approaches. Typically, ASR outputs unstructured and normalised text without punctuation marks. This is not problematic in use-cases where the user input is short and con­cise, e.g., when asking a question to a virtual assistant. However, when generating transcripts for longer speech, it is crucial to restore punctuation to improve read­ability and provide structure to the transcript. Moreover, punctuation is relevant for further downstream tasks such as named-entity recognition (NER), part-of-speech (POS)taggingandmachinetranslation(MT).RecognitionerrorsintroducedbyASR may lead tocascaded errors in these tasks, e.g., forMT (Ruiz et al. 2019). State-of-the-art SR systems use neural networks to extract a representation (em-bedding)forthespeakerinanutterance.Theinputtothenetworktypicallyconsists of features extracted from frames of 20-30ms, although there are also ongoing ef­fortstotakethe rawwaveformasaninput.Embeddingsarethencomparedinorder to decide whether they are from the same person or not. Typical NN architectures for embedding extraction are TDNN, ResNet, or LSTM. The standard choice of backendisagenerativemodel:ProbabilisticLinearDiscriminantAnalysis(PLDA). Recently, using cosine similarity plus an affine transform has proven to yield com­petitive performance. An advantage of generative backends is that scoring with dif­ferent numbers of enrolmentutterances becomes trivial.In addition to variationsof the embedding extractor architecture, many recent research efforts havefocused on the training objective. If the task at hand is verification, the most intuitive manner wouldbetotraintheextractorforthistask.However,inpractice,itoftenworksbet­tertotraintheextractorforclassification.Thatis,foratrainingutterancethenetwork shouldclassify who amongthe speakers in the training set speaks intheutterance. State-of-the-artlanguageidentification(LID)systemsarebasedonDNNsingest­ingsequencesofframe-levelfeaturesasinput,processingthemandapplyingapool­ingmechanismtoobtainanutterancelevelrepresentationwhichiseventuallyclassi­fied. Duringtraining, thiswholechain isperformedinanend-to-end(E2E) fashion. In testing, either the trained DNN is used directly for classification or the utterance levelrepresentationscanbeextractedandusedinasimplebackendforclassification, e.g., a Gaussianlinear classifier. InthefieldofSpeechEmotionRecognition(SER),awiderangeofmethodshave beenusedtoextractemotionsfromsignals.SimilartootherSTdomains,DeepLearn­ing is rapidly becoming the method of choice and several E2E models have been proposed (Tang et al. 2018). Unlike ASR, these have not yet become part of our everyday lives. To achieve this goal, SER systems require more accurately labelled data to improve training accuracy, more powerful hardware to speed up processing, and more powerful algorithms to improve recognition rates. In addition, further in­sights from fields such as psychology or neurology may be required. Detecting the cognitivestatesandreactionsofauserisasteptowardsdesigningproactivesystems capableofadaptingtotheuser’sneeds,preferencesandabilities.Asinotherrelated ST-fields, the detection of personality traits, mood disorders, signs of depression and other medical conditions has found its application in recent years. Techniques based on automatic processing of the voice signal have been used for language and cognitive assessments. These approaches provide the means for quantifying signal propertiesrelevantforthedetectionofspecificpathologies.Duetothedevelopment ofautomaticmethodsfacilitatingtheevolvingcontrolofawidepopulationsuffering from Alzheimer’s disease, a number of industry applications aimed at the detection ofneurodegenerative disordershave beenintroduced. Neural networks have greatly impacted the speech synthesis field by improving the quality and naturalness of synthetic voices compared to traditional systems and by enabling training in an E2E fashion. While traditional multi-stage pipelines are complex and require extensive domain expertise, E2E systems reduce the complex­ity by extracting the audio directly from the input text without requiring separate models. E2E text-to-speech (TTS) systems have shown excellent results in terms of audio quality and naturalness. However, they usually suffer from low training efficiency, requiring large sets for training. Full E2E architectures have been pro­posed, e.g., FastSpeech 2 (Ren et al. 2021). These systems produce spectrograms from text by applying an encoder-decoder architecture that produces a latent repre­sentation of the input text (or phonetic transcription) which is subsequently trans-formedintospectrograms.Thesesystemsprovideoutstandingresultsintermsofthe quality and naturalness of the generated voices but require large amounts of high-quality recordings to be trained properly. Efforts are being made to deploy these systems for low-resource languages by improving data efficiency, applying trans-ferlearningortrainingmultilingualmodels.Otherareasofintenseresearchactivity arestyletransfer,controllableandexpressivevoicegeneration,newefficientneural vocoders and speaker adaptation with a reduced amount of data. Regarding expres­sive speech synthesis, Global Style Tokens (Wang et al. 2018) represent one of the most common approaches. It consists of a reference encoder, encoding the speech Mel-spectrogram,andastyletokenlayer,learningdifferentprosodicaspectsinaset of trainable embeddings. The reference embedding is compared with each style to-kenwiththehelpofasequence-to-sequencemulti-headattentionmodule,forminga weightedsumofthestyletokenscalled“styleembedding”.Thisstyleembeddingis thenconcatenatedtothetextencoderoutput,thusconditioningtheMel-spectrogram synthesis on both text and encoded prosody of the speech. Other popular methods include Flowtron (Valle et al. 2021), Mellotron (Valle et al. 2020), and Ctrl-P. De­veloping high-quality synthetic voices with DNN-based techniques requires large amountsofhigh-qualityrecordingsfromasinglespeaker. Thisrequirementisoften difficult to fulfil, especially for minority languages and dialectal speech. The gen­eration of new synthetic voices is also hindered by this extensive data requirement. Effortsarebeingmadetosharedataamonglanguagesandspeakersinordertotrain the common aspects more robustly. Multi-speaker and multi-language modelling is acommonstrategyinDNN-basedTTSsynthesistoachieveimprovedvoicequality with a reduced amount of data from a single speaker. However, the quality of these voices isnot yetcomparableto those obtainedwith large databases. 2.2 Main Gaps While ST has found its way into a series of application fields, various important issues have not been addressed thoroughly and remain active areas of research. In the following, we review the main gaps and present them in the context of global and regional business activities, requirementsrelated to theavailability of qualified personnel,privacyandtrustconcerns,aswellastechnicalandend-userperspectives. Effects of scale – A trend towards increasingly complex E2E systems can be ob­served in all areas of ST. Due to the extreme demand on resources, e.g., data, com-pute,energy,orinfrastructure,theconstructionofsuchmodelsislimitedtoahandful of actors. The activities to make pre-trained language models available for transfer learningandfine-tuningandtoallowotherstoalsoparticipateinmajoradvancesare certainly beneficial. However, the extent of this transfer and level of control in the handsofafewinstitutionsposesarisktoother actors,tothemarketandpotentially even to innovation in the sector as a whole. Compared to the US and China, Euro-peanplayersareatastarkdisadvantageconcerningresources,i.e.,data,technology and funding. Academic institutions risk lagging behind industrial research due to a lackofresources and may haveto rely onnational initiatives tokeep up. Trained personnel and expertise – A further gap, concerning all areas of speech processing, can be identified in the scarcity of trained personnel and expertise as well as the risk of losing emerging talent to innovative power-players outside of Europe (with possibilities and employment conditions which generally cannot be matched by European players). Even in light of the democratisation of technology and auto-ML, allowing a much broader audience to create models anddeploy these for use, respective educational programmes in speech (and NLP/LT) technologies form the foundation for future European success in these areas and may hinder it if notappropriately establishedand strengthened. Privacy and trust –Dataleaksandscandalsinrecentyearshavespurredtheinter­estofindividualsaswellasofpolicy-makers.Concerns have arisenregardingtrust, privacy, intrusion, eavesdropping, or the hidden collection and use of data. These concernshavebeenrecognisedbymanyactorsbutareonlyaddressedtoaverylim­ited extent, as they often counteract commercialinterests. Technical perspectives – The focus of many ST fields on rather constrained con­ditions has left gaps in more diverse settings such as: processing of distant speech; noisy environments; accented speech, non-native speech, dialectal speech, code-switching,spontaneous,unplannedspeech,emotionalspeechandconnectedaspects concerning sentiments expressed; the integration of ST into collaborative environ-ments,multiple,simultaneousspeakers engaged invivid discussions; as wellas the integration of paralinguisticaspects and technologies. Group settings, multiple-user scenarios – While most research focuses on a sin-gleuser’sinteractions,STsembodiedinvirtualassistantsarebecomingincreasingly popular in social spaces. This highlights a gap in our understanding of the oppor­tunities and constraints unique to multiple user scenarios. These include detecting whether users are addressing the system or other participants, speaker diarisation, aspectsofsocialdynamics,andfindinginteractionbarriers.Duetothesefactors,the usefulness of voiceinterfaces ingroup settings isstill restricted. Interdisciplinary research work (Digital Humanities and Social Sciences and the Humanities, SSH) –Whiletheconnectiontothefieldofdigitalhumanitiesandcom-putationalsocialsciencesisnotfirmlyestablishedyet,itcouldbebeneficialtosetup collaborativelinkswitharangeofdisciplinesanddomainsworkingwithspokendata. In particular, the insights and requirements stemming from the needs for transcrip­tionworkflowsandaudiominingtoolsofcommunitiesproducingand(re)usingoral history data and interview recordings may help identify gaps in language resources formodeltraininganddomainadaptation(Draxleretal.2020).Itcouldbebeneficial to identify imbalances in language-specific support for the recognition, annotation and retrieval of the types of structured conversational speech that are used in inter-viewsettings in SSH and beyond (Pessanha and Salah 2022). Challenges related to an increased modelling power –Theincreaseinmodelling power and performance achieved over the last years also comes with some draw­backs and challenges. These include a need for even more data, respectively a lack of interest and work on the creation of new paradigms using less data. Current ap­proachesincludeshallowanddeepfusion,butthequestionofhowtooptimallycom­bine language models (LMs) and DNN structures has still not been addressed com­prehensively. Models requiring the complete input sequence for processing do not match well with requirements to perform causal processing. Several attempts to en-ablecausalprocessingarebeingexplored,amongthemtheuseofneuraltransducers running processing at regular intervals. The extent of context may also incur addi­tionalprocessing costs which need tobe balancedand mitigated. Models: interoperability and transparency –Modelsarenottransparentandthus hardtointerpret.Thisispartlyduetothefactthatpreviouslyindividualcomponents have been combined into single models. The complex process of hyper-parameter tuning is often too resource-intensive and thus has not been addressed in many in­stances. Elements of input/output like byte-pair-encodings (BPE) have been sug­gestedbutthesecontradicttheideaofgenuineE2Eprocessing.Integrationofseveral components into one model prompts the question of whether further downstream technologies will also become part of such integrated models. The combination in turn raises questions about theinterpretability andtransparency of such systems. Explainability and transparency for critical methods and technologies – While inthelastdecade,STresearchhasachievedimprovementsintermsofperformance, progress in terms of understanding of the architectures used and of the nature of the data and task has been limited. This is partly due to the fact that the NNs used in modern systems are harder to understand than the generative models of previous generationsystems.Itis also due to a lackof interestfromthe industryandfunding agencies to support this type of research. Students are also generally inclined to work on topics that mainly aim at improving performance since this increases their chances of obtaining a well-paid job inthe industryaftergraduation. End-users’ perspective – STs have made a leap in becoming adopted in many settingsforcommerciallyattractivelanguages.Especiallytheproliferationofintelli-gent Voice Assistants (VAs) has made speech a common mode of interaction. How­ever, several issues limiting the further adoption and widespread use of ST remain: these include problems in accurately recognising accented speech, a lack of trust in VAs to execute more complex or sensitive tasks, and concerns related to privacy and data collection. This issue is further exacerbated by the fact that systems often operate in the cloud rather than on-premise. Many VAs may already be utilised in languagesotherthanEnglish,butcoverageandsupportedfunctionalityvarygreatly. The gaps in support create barriers for users whose primary language is not fully catered for, or supported only to a limited extent, forcing them to communicate in a non-nativelanguageor risk beingexcludedfromusingtheever morepopular sys-temsandservices.Thisway,non-nativeusersarepushedtodevelopdifferentstrate­giesandmodesofinteraction,includingareducedleveloflanguageproductionand more frequent useofvisualfeedback. Data: availability, diversity – The main challenge related to data concerns its availability, i.e., adequate datasets for low-resource languages of an appropriate amount and quality. Various efforts aim to mitigate this fact by focusing on trans-ferlearningandfine-tuningofmodels.However,whereasthisapproachiscertainly beneficial,itgenerallydoesnotyieldmodelsofequalperformanceasforlanguages equippedwithlargeamountsoftrainingdata.Thelackofdataforlow-resourcelan­guages effectively excludes certain approaches from beingapplied. Data: diversity of voices – Some public databases available to train DNN-based TTS systems are only useful for building monolingual neutral voices for a number of major languages. The availability of open data free of restrictions such as copy­rightandlimitationsduetoGDPRregulationsintheremainingmajorlanguagesand all minority languages would allow the development of TTS systems for these lan­guagestoo.Databaseswithmoreexpressiveandspontaneousrecordingsareneeded tobuildTTSsystemssuitableformoreemotion-demandingapplicationslikeaudio-book reading, movie dubbing and HCI. The vast majority of datasets correspond to adult voices and there is a lack of data to generate child and elderly voices. As the voice is an important component of our identity, more diverse datasets are needed to generate personalisedvoicesthat can suitany user. Accuracy: reaching usable thresholds for applications – The single most fre­quently mentioned hindering factor for the broad adoption of ST is one that has been mentioned for the past 40 years, namely accuracy. The perceived accuracy anditsexactmeaninghavechangeddramatically:fromindividualwordsbeingmis­recognisedtointentionsthatarenotcorrectly interpretedincomplexsituations. For example, WER as an evaluation measure has had its merits in measuring progress in ASR (and still does). However, more comprehensive approaches to measuring the impact of ASR performance on downstream tasks and actual deployments may requirenovelmeasures.WERaloneclearlydoesnotprovidethefullpicturewhenit comes to the perceived performance and usability of complete systems comprising severalkindsofSTsandLTs.RegardingTTS,accuracytranslatestoalackofnatural­nessandrobustnessofthesynthesisedspeech.Differentapproacheshavebeentaken, some of them focused on designing robust attention mechanisms, others including alignmentinformationattheinput,orsubstitutingtheattentionmechanismwithnet-works that can predict the estimated duration of the input phonemes. However, the problemhasnotbeensolvedcompletelyyetandkeepshinderingthepracticalappli-cationofTTSsystemsinmanyinstances.ForSR,technologieshavealreadyreached acceptable performance for many applications. However, this does not mean that there is no need or opportunity for further research. All applications of SR would benefit from better core performance and increased robustness to different acoustic conditions and other variablesoccurringin real-worldspeech data. Dialectal speech and multilingual training – Most ST systems process speech only in the main variety of languages. To date, little attention has been devoted to dialectal speech. Certain STs can be used in languages different from the one(s) they were originally designed for. However, the performance of such systems typ­ically deteriorates. Some progress has been made to make systems more language-independent(e.g.,multilingualtraining,adversarialadaptation),butthereisstillam­pleroomforimprovement.Theeffectivenessofsuchapproachesforlanguagesthat differsubstantiallyfromthoseusedintraininghasnotbeeninvestigatedthoroughly and warrants further work. 3 The Future of the Area 3.1 ContributiontoDigitalLanguageEquality Purely technological systems alone do not exist – they are always embedded in a socialcontextandshouldthusalwaysbeviewedassocio-technicalsystems.Theap­plicationsofSThavediverseandmultifacetedimpactsonseveralkeyaspectsforso­cieties.Technologiesreachingperformancelevelsresemblingthoseofhumansmay inmanyaspectsleadtoahumanisationoftechnology,ascribinghumanattributesto system behaviour. Patterns of human-to-human (H2H) interaction may be applied tohuman-to-machine(H2M)interaction leadingtoheightenedexpectations andpo­tentiallyto subsequentdisillusion. Digital language inequality – The unbalanced availability and quality of ST re­sources strongly impact the performance of systems for different groups of lan­guages. For languages supported to a lesser extent, performance and accuracy are typicallysignificantlylowercomparedtoresource-richlanguages.Inextremecases, selectedfunctionalitiesor supportforsuchlanguagesmay notbeavailableatall.In addition, language varieties, dialects or accents may not be supported or only sup-portedonverylimitedlevels.STsarethusnotaccessiblenoravailabletoeveryoneon anequal level.The lack of commercial interest in thelongtail of “small languages” translates to a significantly slower pace of ST improvements and commercial adop­tion for the latter group. For native speakers of these languages, these imbalances lead to wider usage of the better-supported major languages, such as English. Mo­tivating speakers to use these major languages more frequently creates a new set of challenges related to handling accented and non-native speech. Compared to the level of service and the support provided for native speakers, this results in lower performance, weakened experience and reduced usability, rendering ST less useful oreven useless in the extremecase. Energy consumption and sustainability – The growing energy consumption re­quired for the ever-expanding amount of data being processed and the tendency to­wards continuously more complex ST models have become evident since the race for the largest models has been going on. Due to the extreme demand on resources, the generic construction of complex AI, NLP and ST systems is typically limited to a fewactors. Surging interest insustainability maycauseactorsto reconsider the massive increase in energy consumption that currently often accompanies progress in ST. An opportunity (and marketing advantage) may arise from directing efforts towards the creation of high-performance/low energy-consumption ST, exploring the capacities of E2E or novel direct speech-to-speech systems to lower the energy consumption by avoiding a separate,cascadingtrainingofsub-systems. Labour market – A further economic aspect concerns the impact of ST on au­tomation and as a consequence on the job market as a whole. As technologies such aschatbotsarebeingadoptedinpursuitofefficiency,theyalsoperformanincreasing number of tasks previously reserved for humans. ST and AI thus blur the boundary between humans and technology leading to shifts in jobs and even entire industries. Clearly,amessageofcooperationandsupportratherthanofrivalryandreplacement needs tobe communicated and acted upon. Politics and democracy –Ithasbeenpointedoutthatlanguagestronglyinfluences the manner in which we think and argue about political issues. Language causes mental frames to be activated and form our portfolio of ideas. Politicians and influ­encers have long discovered these mechanisms and are applying them actively to push their respective agendas. Having this central and immediate effect on cogni­tive mechanisms, linguistic plurality also forms the basis of cognitive plurality and as such plays a fundamental role in securing diverse and democratic values. Lim­itation to a few individual languages – such as may happen due to limited digital supportforcertainlanguages –impoverishesandreducesthisvariety,theflexibility and spectrum for expressionofthoughts and (political) ideas. Biases and ethical issues – SeveralSTsystems have beenshowntobe lessaccu-rate for female speakers than for males. This is not because women are underrepre­sented in the training data but more likely due to the properties of female and male voices.Variousethnicgroupsmaybeunderrepresentedindatasetsandconsequently, performancebecomeslessaccurate.Itshouldalsobenotedherethatbeinginagroup forwhichasystemperformsworsecanbeeitheranadvantageoradisadvantagede­pending on the application and the type of error the system tends to commit more often (false positives or false negatives). Another ethical concern pertaining to ST is due to possible privacy breaches through mass surveillance. TTS systems have reachedaquality levelanddegreeofsimilaritywiththevoice ofhumansthatcould be used to generate deep-fake voices or voices of deceased persons. Despite this scope for misuse, most of the possible applications of high-quality voices are posi­tive,andpeoplewithspeechdisorders,visualimpairmentandotherdisabilitiescould greatly benefit from them. However, deep-fakes could also be employed for illegal activities such as committing fraud or discrediting people. New regulations and the development of ad hoc legislation are critical to mitigating this pernicious effect. Tools able to detect speech deep-fakes need to be produced, and anti-spoofing tech­niquesthatdiscriminatesynthesisedfromnaturalspeechmustbedevelopedinclose collaborationwith teams working in ST. Users with special needs – While ASR systems achieve great accuracy on stan­dard speech, they perform poorly on disordered speech and other atypical speech patterns. Personalisation ofASR models, acommonlyappliedsolution to this prob­lem, is usually performed on servers posing problems related to data privacy and data transfer. While on-device personalisation of ASR has recently shown promis­ingresultsinahomeautomationdomainforuserswithdisorderedspeech(Tomanek et al. 2021), more research is required to increase performance for these groups of users and provide support for open conversations. TTS is considered an assistive technology and as such, it may contribute to the integration of individuals with vi­sual impairments or learning disabilities. By developing robust TTS systems, these people could enjoy the same advantages as any person without a disability. It also facilitatesequal accesstoeducationandsupports foreigners whomaystrugglewith the language. ST can contribute to the integration of immigrants by making it eas­ier to learn local languages and can help people with literacy issues and pre-literate children to access content presented in written form. ST may also prove helpful in timesofagingpopulationswithdegradingeyesight.Integratedintovirtualassistants, STs are able to provide support to elderly people, assisting them with reminders of appointmentsandmedicationneeds,providingaccesstoonlineinformationandim-proving both their ability to live by themselves and strengthen their autonomy. An­otherparticularbenefitofTTSrelatestoorallyimpairedpeople.Voiceisanessential component of our identity that we usually take for granted. However, losing it can affect how others perceive us and our own sense of who we are. TTS technology is abletoprovideavoiceforthosewhohavelosttheirownviapersonalisationsuiting the characteristics desired by eachuser. Privacy and trust –Astechnologiesareenteringthehomesandofficesofuserson a broad scale, an enhanced level of attention to privacy concerns, ethics and policy isessential.Policymakers,policywatchdogs,themediaandconsumersalikeneedto assumetherole ofgatekeepers.Trustisviewedasthe maincurrencyand keytothe adoptionandacceptanceoftechnologies.Scandalsandopaquebehaviouronthepart of ST providers may have detrimental effects. Whenever ST is linked to a person’s identityandusedforaccesscontrolorauthorisation,theissueoftrustbecomesespe­ciallyimportant.Forexample,STsareusedtoauthoriseaccessto resourcessuchas a bank account or building. In surveillance applications, it is used for detecting and identifyingcriminals.Inforensics,SRisusedforcomparingavoicerecordingfrom acrimescenewiththevoiceofasuspectoravictim.Forvoiceassistants,SRcanbe essential to make sure that certain requests are fulfilled only if made by the owner of the respective device or commodity. All of the above applications rely on high-performance and trusted ST, and can benefit tremendously in commercial terms if applied within these contexts. Many applications of ST store audio in the cloud. It isessentialtosecureguaranteesregardinghowdataisusedorwillbeusedinthefu-turebycloudserviceproviders(theriskofleakingalwaysremains).Inthelongrun, the question will be whether any possible breaches, leaks or scandals involving ST willerodetrusttoalevelthatuserswillnolongervolunteertoprovidetheirdata.Of course, the distrust will be weighed against the commodity of using certain devices and platforms whose terms of use may simply require the user to do so. Opting out may not always bea realistic option. Unlawful surveillance –Afurtherareaofconcernistheextentofunlawfulsurveil­lancebygovernments,stateagenciesorcorporations,infringingcitizens’rights,lib­erties, adversely affecting public discourse, democratic values and influencing the politicalpowers(Stahl2016).Theconcernscompriseprivacyinvasion,accountabil­ity of intelligence and security services, and the (non-)conformity of mass surveil­lance activities with fundamental rights (Garrido 2021). Their effects on the social fabricofnationscanonlybeconsideredandanalysedjointlywiththerapidlyextend-ingtechnologicalcapacitiesandthepervasivenessofdevicesabletocapture,process andtransmitrelevantdata.Regardlessoftheformofgovernment,thegrowingextent ofmasssurveillanceandespeciallyitsunlawfulapplicationmayleadtotheerosion ofpublic trust ingovernments and state agencies (Westerlund etal. 2021). 3.2 Breakthroughs Needed InthecontextofDigitalLanguageEquality(DLE),themainchallengesarelinkedto the inferior support and resources available for less common languages, and a need forimprovingtheperformanceandcapabilitiesofSTfortheselanguages.Theprolif­eration of ST, including areas with a high potential impact on individuals and large groups of users, also has to be considered in a wider context of policies governing ST and relevant fields and calls for major breakthroughs in terms of explainability for the critical methods and technologies. Policies and governance concerning the use of ST and data – in particular personal data – need to be kept up to date and on par with rapidly developing technologies and applications. In order to democratise STs and to strengthen their position within LT and AI, the base of users should be widened.Anincreaseineducationalprogrammes,includingingeneralAI,ML,NLP, andinter-disciplinaryprojects,isnecessaryforthecontinuoustrainingofexpertsin these fields able to draw upon expertise in voice technologies but at the same time also in domain-specific fields, thus formingthelinks between them. Training paradigms –Forapproaches requiringlargeamountsofannotateddata, strategies and frameworks for joint (potentially distributed) data collection, im­provedannotation,andjointprovisionareneeded.Thisnotonlyconcernsthecollec­tion but equally the storage and provision of such resources. A lack of commercial interest needs to be alleviated by public efforts to jump-start and boost efforts in low-resource languages to limit the threat of digital language extinction. From the perspective of data augmentation, the generation and use of syntheticdatamay pro­vide acomplementaryalley inthe creation or extensionof datasets. Efficientuse of transfer learning and fine-tuning, as well as work on algorithms and methodologies thatuselessdataorprovidemorerobustmodelswithloweramountsofdata,present promisingalternativestorelievethelack-of-datachallenge.ForspecificfieldsofST, improveduseofunlabelleddatainanunsupervisedorsemi-supervisedmanner(pre­training,self-supervisedtraining)providesfurtherpossibilities(Laietal.2021).For severaltechnologies,makingbetteruseofthehierarchicalstructureandrelatedness oflanguagesmaybebeneficial.Methodslikeone-shotlearningorfew-shotlearning likewise provide promising approaches. Access to and discoverability of training data – The need for large amounts of data severely limits the possibilities for small companies and niche players to com-peteand be ableto develop their ownsolutions.A plethoraoflicensingagreements posefurtherobstacles toaccess datasetsandresources. Simplificationandharmoni­sationofthesemechanismswouldbehighlybeneficial.Inthelargercontextofopen data sharing and bringing digital technology to businesses, citizens and public ad­ministrations these issues connect with the EU’s Digital EuropeProgramme. Support for low-resourced languages – Toprovide first-rate ST in any language, additional high-quality datasets are essential. Creating a wide set may not be feasi­ble in general, but could be achieved at least for several major European languages. New techniques for transfer learning and model adaptation from systems trained for resource-rich languages to systems able to function in languages with more re­duced quantities of available data should enable the development of cutting-edge STsystemsalsofortheselanguages.Newarchitecturesallowingthecombinationof resourcesfromseverallanguagesinsuchawaythattheircommonalitiesarelearned inamorerobustway(bycross-lingualknowledge-sharing)andmethodsforthecre­ationofmultilingualorlanguage-agnosticmodelswhichcanbeappliedtoanumber ofdifferentlanguages are of utmost importance. Confluence and context information integration –Atendencytowardsconfluence –thecombinationoftechnologiesandinclusionofalargercontext–canbeobserved andalsobeassumedtoplayamorepronouncedroleinthefuture.Theincreasedpres­ence of conversational interfaces, a proliferation of chatbots combining ASR, NLP andTTSwithanever-increasingpresenceofAIingeneral,hasmodifiednotonlythe technical and commercial landscape but also the expectations of users, which have beenacceleratedbyincreasedtimespentinhome-officesetupsandvirtualmeetings. Morepowerfultoolsandgreatercapabilitiesalsoprompttheintegrationofupstream technologies such as summarisation or sentiment analysis with voice technologies. Speechsynthesisisboundtobecomeasemotionalandpersuasiveasthehumanvoice itself. Automatic translation may be used to bridge language barriers. Technologies will need to be integrated in a manner allowing for feedback loops and adaptation seamlessly. Models need to be dynamic and methods allowing for dynamic adapta­tion–learningandunlearningcertainfeatures–willneedtobedevelopedtoaccount forflexibleandcontinuouslychangingconditions.Areasoflinguisticssuchasprag­maticsorparalinguisticswillneedtobeconsideredandintegratedtoamuchhigher extenttoallowformorenaturalandhuman-likeinteraction.Addingemotionsandaf­fectionsintotherecipesforHCI,recognisingintentandtakingintoaccountabroad varietyofcontextsholdsthepotentialtoturntheseinteractionsintotrulyhuman-like experiences. The components related to emotional understanding and empathy are especially relevant for systems functioning in social domains, such as healthcare, education, and customerservice. Explainability, transparency and privacy concerns –TrustinSTsandintheuseof dataobtainedbyinteractingwiththemmaybecomeadecisivefactorintheadoption oftechnologiesandsuccessofindividualmarketplayers.Anincreasedinterestinthe transparencyofdatauseandsystemfunctionalitycanbeobservedacrosstheboardin manyareasofMLandAI.Afundamentalquestiontobeansweredbyproviderswill bewhereprocessingisperformedandtowhatextentandpurposedataisusedtomod­ifymodels.Oneendofthespectrumofprocessingislarge,anonymousdata-centres spreadaroundtheglobe,theotherisformedbystrictlylocalprocessingonpersonal devices. On-premise solutions provided by companies or institutions form an inter­mediate setting. In all of these setups, the balance between capabilities and the re­quirements to achieve these capabilities will need to be determined and balanced againstethicalconcernsandpersonalandprivacy-preservingarguments.Theextent andamountofend-usercontrolwillbeacrucialfactor. Approacheslikeprivacy-by­designaccompaniedbyhighethicalandlegalstandardsmaybedeterminingfactors in enablingtrust,fosteringadoption andleading to economicsuccess. Performance, robustness and evaluation paradigms –Drivenbyvariousnational andinternationalevaluations,standardperformancemeasureshavebeendefinedon standardtestsets. CurrentmeasureslikethestandardWERonlytakecertainperfor­mance aspects into account and may need to be reconsidered, extended or comple­mented. Robustness and generalisability of ST components and models as well as standardevaluationsetsformultiplelanguagesandevaluationsetsallowingthepar­allel evaluation of several technologies (all on the same dataset) should be devised. The topics of ageing and recency of data for evaluation sets need to be taken into consideration.Ingeneral,evaluation(aswellastraining)datasetsshouldbeviewed more as work in progress than static artefacts. Extension to further languages and language varieties,dialects and speaking conditionslikewise should receive further attention to ensuring broad availability and adoption. Another needed innovation is amethodforobjectivelymeasuringTTSresults;suchsystemsarecurrentlyassessed by means of subjectiveevaluations which are time-consumingand laborious. Outreach – communities, non-experts –Recentyearshavewitnessedanincrease in interest in the democratisation of AI. The widespread application of ML and the well-knownfactthatexpertsinMLandAIhavebecomescarceresourceshasledto thedesiretoempowerawidersetofindividualstoparticipateinthecreationanduse ofthesetechnologies.Toolkitsanddo-it-yourself modelling formpartofthetrendto democratisevoicetechnologies.ApproacheslikeAuto-MLaimtoprovideaccessto ML also for non-experts and align with strategies to allow a wider audience to par-ticipateintheprocess.AsLTsareaggregatedandappliedtomorecomplexsettings, inter-disciplinary research and activities (for instance) from fields in the social sci­encesarebecomingmorerelevantandsynergiesbecomeapparent.Programmesand fundingschemestoactively engage these communitiesand foster inter-disciplinary research would further boostdevelopments. Alignments with EU policies and policy breakthroughs needed – Copyright leg­islation is more restrictive in Europe than in other economic regions and countries, e.g., utilising closed captions from TV broadcasts or subtitles from a copyrighted film to train and evaluate ST models could enable access to high-quality language dataiflawmakerscouldagreethattrainingofmodelsoncopyrighteddataconstitutes fair use, as long as it does not diminish the value of the assets or reduce the profits reasonably expected by the owner. The pace of ST development in Europe could be further increased by introducing changes that enable the re-use of existing data, whileatthesametimeensuringthatthevalueofthecopyrightownersisnotimpaired. GDPRintroducedanewglobalstandardthatplacesanemphasisonindividualrights andreflectsEuropeanvalues,andassuchcontributestobuildingtrustinAI.GDPR has had a negative impact on the majority of Europe’s LT business and research ac­tivities (Smal et al. 2020). Furthermore, non-European AI firms have been able to operate free of GDPR constraints since then, giving them an economic advantage. Oneoftherequiredbreakthroughsrelatesthustoensurethatwhileindividualrights areprotected,theextent ofthese –inparticular,inpractical settings andday-to-day operations – does not go beyond the intended scope. Automatic, efficient and free anonymisation tools arerequired forall European languages. 3.3 Technology Visions and Development Goals ST: the interface of the future – In many settings, voice provides the most natural way to interact with devices and appliances. The coming years will witness an in­creased advance in voice technologies to the point that interacting with automated systemswillbevirtuallyindistinguishablefromcommunicationwithhumanbeings inmanycases.Interfacespredominatelyrelyingontyping,clickingandswipingwill graduallytransformintomultimodal,orfullyvirtualinterfacesincludingvoice,shift­ing the task of adaptation from human users to computer systems. Compared to the other modalities currently dominating the HCI landscape, communication will en-compassricherkindsof(linguisticandparalinguistic)information,includinggender, age, emotional or cognitive state, health conditions or speaker-specific traits allow­ingformoresophisticatedandaccuratespeakeridentification,modelling,adaptation andpersonalisation.ThesefactorsandtheirintegrationintoHCI –asbeneficialand powerfulastheymaybe–alsogiverisetoprivacyandethicalconcerns.Theyprompt questionsofcontrol,userunderstandingandintentwhenitcomestosharinginforma­tionand theextent to which different kinds of information are transmitted andused inthefuture.Ensuingrisksandthepotentialimpactneedtobecarefullymetandbal­anced with measures to increase security and trust through technical means as well aspolicyandlegislativemeasures.Strikingthisbalancewillaffecttheadoptionofa widerangeofdevicesandservices:fromVAsinhomesandphones,navigation and control systems in cars to cooperative office and work environments and systems supportinga wide range of business and leisure activities. User and application contexts – A trend towards the integration of richer con-textistobeexpected,regardlessofthesub-fieldofvoiceprocessing.Thisconcerns individual technologies and their combination. For TTS, to have a truly interactive experiencewhendealingwithourdevices,theintegrationofcontextwillplayama­jorrole.Togivejustoneexample,thecorrectwaytopronounceamessageshouldbe inferredfromthecontextorthepreviousstepsofadialogue.Technologieswillneed to be sensitive to the user’s character, state, mood and needs and adapt themselves accordingly. Potentially, they will also need to take into account other participants’ states in case of group activities such as business meetings. Topics of pragmatics will be reflected by all technologies. Rather than individual communication turns, completeconversations with history and contextwillbethe norm. Addressing existing technological gaps – Continued efforts towards better un­derstanding and modelling human speech perception might result in sophisticated ASRaddressingseveralofthelimitationsandgapsidentifiedincurrentapproaches. Improved handling of audio conditions currently perceived as difficult (e.g., multi­plesimultaneousspeakersinnoisyenvironmentsspeakingspontaneouslyandhighly emotionallyinamixoflanguages)willbepossiblethankstosuchadvances.Awider deployment and further popularisation of ST will require solutions that offer high robustness, low latency, efficient customisation and the ability to provide possible equal support fora diversesetofspeakers. ST integration – An intimate relation of ASR, SID and TTS with downstream NaturalLanguageUnderstanding(NLU)technologiesisneededtoallowthecorrect interpretation of the input. A combination of technologies tointeract in multimodal ways (including visuals) and the efficient combination of inter-linked models will be able to guarantee the best experience possible. The successful combination will resultinanenhancedeasinessandnaturalnessofuse,hidingindividualcomponents and allowing systems to be perceived as assistants using natural language much in the way that human assistants would. Multimodal models –RecentlyintroducedNNarchitecturessupportencodingand decodingschemesof variousmodalities,e.g.,PerceiverIO(Jaegleetal. 2021). De-spitebeingtask-agnostic,themodelprovidescompetitiveresultsonmodalitiessuch as language, vision, multimodal data, and point clouds. In the near future, this type of architecture is expected to be used in a range of applications where multimodal contentneedstobejointlyanalysed.Furthermore,afuturelineofworkthatcaneas­ilybeenvisagedisthetrainingofasingle,sharedNNencoderonseveralmodalities at the same time, andonly usingmodality-specific pre-andpost-processors. Development pace – The pace of development in voice-based technologies is driven by general advances in ML and associated hardware as well as domain­specificadvancesinspeechperceptionandproduction.Theformercanbeexpected to accelerate even more due to general interest in ML and AI from a wide portfo­lio of domains. Advances in transfer learning, reinforcement learning, fine-tuning, the use of pre-trained models and components as well as the arrival of platforms such as Hugging Face have created additional momentum. The extension of GPU capabilitiescan likewise be expected to continue ata fast pace. Training and evaluation –Furtherimprovementsintroducedintheprocessofcre­ation and distributionofever-growing,evermorecoherentanddiversedatasetscan be expected. These will include large, multilingual, multi-domain and multimodal datasets, which will become de facto standard sets for training and evaluation. We will witness an increase in labelling efficiency, a wider adaptation of continuous learning, self-adaptation and self-modification paradigms. While datasets will con­tinue to grow, the quality and amount of data of high-versus low-resourced lan­guagesareunlikelytoconvergeintheshortterm.Thedevelopmentofmorecomplex andmultifaceteddatasetscallsformorecomprehensiveevaluationandqualitycrite­ria:ashiftthatwouldchangethefocusfromanindividualtechnologytoanend-user assessment of an experience while conducting a specific task in a non-laboratory environment andwithina specific operational and personalised contexts. Infrastructure, hardware – Extrapolating from the current trends a further rapid increase in the capacities of ST-related hardware and infrastructure can be foreseen (faster communication networks, higher bandwidths). Further popularisation of ST solutions in the context of the Internet of Things (IoT), and a new set of voice-enabled devices will be available to users at work, leisure and commercial settings. These developments create additional challenges related to load and scalability of theunderlyinginfrastructure,hardwareandnetworks.Moving computationtoedge devices will alsocontinue to be atrend in the near future. Privacy, accountability and regulations – The future development of ST and the wider LT field will be strongly influenced by the regulations governing the collec­tion, storage, transmission, and use of personal data. In the context of European AI companies and research institutes, the pace of development appears to be par­ticularly influenced by currentregulation schemes. Lawmakers’ decisions will thus have to consider the wide and profound impact of their regulations: on the protec­tionofcitizens’personaldataandprivacyontheonehand,andonthewiderfieldof AI technologies and the comparative advantages and disadvantages vis-a-vis other geopoliticalregionsontheother. Extrapolatingfromcurrentregulationsconcerning userprivacy,anddifferencesindatacollectionanduse,itseemsprobablethatthedi­videbetweentheEUandnon-EUcountrieswillcontinuetogrow. Itisunlikelythat a consensus or standardisation between competing regions will be found. With the growing presence of ST and AI in general, increased concerns about hidden flaws, shortcomingsandbaked-inbiasesofsuchsystemsaregainingmomentum.Whereas citizens and academia may work towards enhancing transparency and mechanisms that may be able to avoid certain phenomena, the industry may work towards ob­fuscation andhindrance of these mechanisms.A sequence of scandals and growing interestinissuesofethicsandprivacyhaveledtoanincreasedawarenessinsociety ofthisissue.Trustintechnologyisakeyingredientfortheadoptionoftechnologies by a largeportion of thepopulation. Transparency inhow privacy isintegrated into technologiesisacrucialingredienttoearningtrust.Privacy-by-designbeyondmere statementsmaybecomeadecisivefactorfortechnologyuptakeandmarketsuccess. Disclosure of the use of AI/ST – Due to the ever more human-like nature of ST, the use of AI technologies should be disclosed at the earliest stage possible for all transactions and applications. Making users aware of what they interact with can be regarded as a fundamental step in the creation of more transparency. This will not prevent humans from attributing personhood to machines or hinder human-like communication,but present an ethical and transparent frame around such settings. Audits of algorithms and models – Auditors will have to be independent for this to make sense and not open the door to even more secretive and evasive behaviour by companies. Federal agencies or boards may be required to preside over such ac­tivities. Standard test sets and tests may haveto be createdand applied. Impact assessments of the introduction of such technologies – The concept of measuringimpactandpotentialharmisfirmlyestablishedinfieldssuchastheenvi­ronment. Similarly, algorithmic impact assessments need to cover a broad range of factors, with ST andNLP focusing onlanguage-and language use-related aspects. Public repositories of incidents where AI/NLP caused harm –Publicrepositories and ways to report problematic uses of AI would allow the identification of repeat offenders and act in case of recurring problems. Furthermore, making such cases known publicly may serve as an incentive to corrector prevent them. Effects on society, workplace – The discussion about which jobs or areas within domains are likely candidates to be replaced by AI carries over to the domain of speechprocessing–aswellastoNLPingeneral–astheyformacoreelementofAI. Issues concerning automation and job replacement and the ensuing policy-making and socialramificationsthusalso directlyconcern STand their perception. Pervasiveness – A further spread and ubiquitous presence of voice-based tech­nologies, and wider deployment of ST across a multitude of services and devices due to a reduction in size and integration into wearable and virtual environments canbeexpected.Thismayalsoconcernfurtherpersonsbeinginthevicinityofsuch deploymentswho may be involvedindirectlyby someone else’s use of ST. Future applications – ST in combination with other NLP and AI technologies will pave the way for intelligent applications with human-like capabilities and the potentialfordisruptiveinnovationin varioussectors.Intelligentassistantsandchat­botscurrentlyprovidetheleadingpathstowardsgeneralandbroadadoption.Future applicationswillbeexpectedtounderstandauser’sintentsoversequencesofinterac­tions,completelyeliminatingperceivedboundariesbetweenindividualtechnologies. STsarealreadybeingusedbymultipleindustriestoincreaseself-servicefunctional-ities,reduceaveragehandlingtime,increaseavailabilityandreduceemployeecosts. Personalised Voices – Voices for TTS will be generated for any language and be fully customisable. In the same way as we can now personalise avatars in video games, we will be able to set every aspect of the synthetic voice to suit the char­acteristics we prefer foreach situation. Moreover, TTS technology will extend, and speechwillbegeneratednotonlyfromtextbutalsofromotherinputinformationthat couldbemoreconvenient forsomeuserswho donothaveeasyaccesstotextorfor somesituations(e.g.,requiringprivacy).Multi-modalsystemswillallowthegener­ation of speech from lip-reading, articulatory data acquired bydiverse technologies suchaselectromyography,permanentmagnetarticulographyandothersilentspeech interfaces, and even cerebral activity with brain-computer interfaces. Ambient intelligence – Viewing ST as a means for intelligent interaction, inte­gratingnuancedandfine-grainedcontextandinputfrommultiplemodalitiescanbe expectedtoleadtomorehuman-likesystemswheretheperceptionofindividualcom­ponents will blur into an overall experience for end-users. Such combinations may be a step towards a broader kind of AI as opposedto the narrow, highly-specialised versions in usetoday. 3.4 TowardsDeepNaturalLanguageUnderstanding Inmanyinstances,themostnaturalmannerforhumanstointeractwithmachinesis throughvoice,forissuingcommandsorqueriesaswellasgeneratingresponsesand statements. Certain types of scenarios (e.g., limiting the interaction to small, hand-helddevices)maycallforvoice-onlyinteraction,whereasothers(e.g.,allowingfor feedbackvialargescreens,augmented-orvirtual-realityenvironments)mayfavour multimedia settings, permitting the flow of information across different modalities in parallel.Otherscenarios may ask forcommunication completely withoutthe use ofaudio,inparticularwhenconsideringspecialneedsandinclusivecommunication. STs play a role in the ingestion of information, by acting as a kind of sensor conveying linguistic as well as paralinguistic inputsand converting themintostruc­tured information. Equally, their use concerns the output of information in auditive form(speech,butalsonon-speech,e.g.,confirmations)tocommunicatewithhuman users. Both directions of the flow of information apply to HCI as well as H2H in­teraction in the case of groups of human users interacting with each other or with computers, e.g., during meetings with intelligent assistants for transcription, trans­lation and summarisation. STs thus form an intermediate interface layer between humans and machines. Inbound (auditive) information is captured and enriched by ST before being passed on to downstream NLU processing. Outbound information isenriched,transformedandeventuallyrealisedasaudiobasedoncontent,structure andmetadataprovidedbysemanticcomponents.Thesemanticsandinterpretationof utterances as wellas thegenerationofappropriate responses based ona logical rep­resentationandstateofaconversationfullyresidewithinthescopeandcomponents of NLU and technologies such as dialogue managers (to carry out conversations) orknowledgegraphs(networksforsemanticrepresentations).Assuch,STsprovide essential contributions to the functioning of NLU in the input and output directions but they do not perform anysemantic processing (understanding) themselves. Visual cues such as gestures or manual articulation (sign language) may replace theaudio-elementofSTwhenoperatinginnoisyenvironmentsorinvolvinghearing­impaired or deaf people. Visual processing technologies assume the roles of ST in these cases. The combination of modalities is also possible and may be appropriate orimperativedependingontheactualcontext,suchasworkingenvironmentsrequir­ingahands-freeoperation.ThecontributionofSTtowardsachievingdeepNLUmay thuslieintheimprovementandextensionoftheindividualtechnologies(bothfrom accuracyaswellasalanguage-anddomain-coverageperspective),theirintegration into E2E systems allowing for joint operation and optimisation, including different kindsofknowledgesourcesandtheirflexibleanddynamicconfigurationdepending onthestateandcontextofanapplicationoruser.Approachesincludingthecombina­tion of several modalities for input and output may likewise prove beneficial in the contextofachievingdeepNLU.Inmanycases,therealpowerofNLUwillbecome clear when it is part of a complex system functioning as a human-like counterpart in communication: exhibiting context, history and elements of general intelligence. However,itmayalsocomeaboutthatNLUisovershadowedbythecognitivedown­stream processing and eventually perceived as a mere commodity. The element of admiration and awe on the part of the user will then concern the complete system performance,with NLU itself disappearing in importance as a small part of a much largerand more complex intelligentsystem. 4 Summary and Conclusions ThesubstantialadvancesmadeinthefieldofSTsoverthepastdecadesholdthepo­tential for disruptive innovation in many areas and application domains. Combined with the progress of related fields, they provide the basis for the broad adoption of speech and voice as the primary modality for interacting with computer systems as partoflargerandmorecomplexsystemsmodellinghuman-likecommunicationand interaction. This chapter outlined several research fields and business domains that provide promising areas for the use of ST and their inclusion into larger solutions yieldingmorenaturalmeansofcommunication.Severalissuesandchallengeshave been identified which need to be resolved to make this promise materialise. Below we summarise the key elements identified and provide recommendations for possi­blefutureactions.Allthesestrandsofprogresscanaidinsupportingtheoverarching goalofachievingDLEinEuropebyprovidingservicesmadepossiblebythesetech-nologiestolargermultilingualaudiencesatsimilarlevelsofscopeandperformance. Training data is still a key factor as long as supervised paradigms prevail. Ac­cessibility is often limited, or even locked, with individual actors amassing mas­sive amounts of data, effectively creating monopolies for certain markets. Licences and regulation as well as interoperability and compatibility of data resources and providers remain obstacles that need to be overcome. Methods not relying on vast amountsofdataare an active area of research. Even though the range of languages supported by ST has increased dramatically over the past decades, English still holds a privileged position. The creation of re­sources for further languages and dialects (some may only be spoken) is ongoing; the investigation of phenomena that are only present in other language families is also an active area of research. The creation of multilingual or language-agnostic models provides furtheravenuesfor improvement. A trend of integrated E2E models into one combined overall model can be ob­served. Training takes place in a single framework rather than individually, capital-isingonjointfactors.Considerableprogressinperformancehasbeenmadethrough this approach which can be expected to continue. The integration of semantic com­ponents such as NLU or knowledge graphs into these frameworks may provide ad­ditional elements required for intelligent interaction. In current applications, different components operate in an independent and iso­latedmanner. Thedynamic inclusionand integration ofcontextwouldallowSTsto operateonasignificantlyhigherlevelofaccuracy,eliminatingerrorsandnarrowing downalternatives.Variouswaysforthefusionofinformationhavebeeninvestigated but have not effectively come to fruition. Parallel systems for multiparty conversa­tions and multimodal approaches may provide waysforward. STsprimarilyaddressthevoicemodalityforinteractingwithcomputers.Combin­ingSTswithmultimodalinputsandoutputsmayprovideabasisfornext-generation HCI. The inclusion of gestures, facial expression, emotions or haptics, and the gen­eration of multimodal outputs reflecting these elements may result in a richer and more natural user experienceand lead towider adoption and acceptance of ST. Although established measures allow quantification of progress in ST, they may only tell part of the story when it comes to real-world applications and the combi­nationwithdownstreamprocessing.InmanyfieldsofST,performancehasreached (near-)human levels under controlled conditions with progress being significant in theory but often only marginal when translated into reality. A shift towards increas­ing robustnessand generality of resultsmay prove beneficial at thisstage. Recent progress and an abundance of ST in chatbots may evoke expectations of ST being a mere commodity and raise unrealistic expectations on the part of users. STs perform considerably worse when applied to conditions unlike those for which they were originally created. Accordingly, adaptation and customisation to special domains provide opportunities for specialists. Expectation management and open communication about the possibilities but also limitations from the ST community may help set expectations to realisticand practical levels. The interest and concern about fairness and biases of models and ethical issues relating to their use have been receiving increased attention. Methods for detect­ing biases and de-biasing need to be improved and are expected to become a more active area of development. Furthermore, access to ST for people with disabilities and impairments needs to be extended. Triggered by an increased interest in the fairness of AI systems (e.g., assessments of job applications, prison-parole, cred­its), applications continue to be subjected to scrutiny. Users demand explanations onthecapabilitiesandfunctioningofST. Resultsarequestionedwithsomeapplica­tion areas demanding audits of models and algorithms. Technical issues need to be addressed and accompanied on the policy-making and legislative levels. Standardi­sation of evaluations and publication of results may function as motivating factors for providers to addressthese issues more thoroughly. With the current and near-future state of ST, many businesses, political parties and ideological movements may develop conversational agents as a ubiquitous rep­resentatives to conveytheir agendaand sway public opinion to get supportfor their cause.Situationswheretheagents’identityisknownorhiddenshouldbeclearlydis­tinguished.Caseswhereacompanyorpartyisrepresentedbyasingleconversational agent, or by hundreds or even thousands to create a representation of mass support, shouldbemarked.Scandals,dataleaksandanincreaseincyber-crimehavebrought issues of security and privacy to the fore. Devices are ever more pervasive, taking ST into people’s offices and homes. IoT and wearables further accelerate this trend. Usersarebecomingincreasinglywaryoftherisksandundesiredeffectsrelatedtothe introduction of ST. Clandestine ways of data collection and eavesdropping infring­ing privacy are rightly exposed and castigated by the media. Actors risk suffering direconsequencesiftheydonotrespondandputcorrectivemeasuresintoplace.The balance between convenience and privacy will remain a fluid one to be negotiated repeatedly andonmultiple levels. The legislation governing the acquisition, storage, transmission, and use of per­sonaldatahasasignificantimpactonthefutureofSTandthewiderLTarea.Extrap­olatingfromcurrenttrends,thegapbetweentheregulationsusedindifferentregions will continue to widen. As AI technologies play a critical role in creating competi­tiveadvantagesacrossawiderangeofhumanactivities,itisunlikelythatcompeting countriesandregionswillbeabletoreachabroad,far-reachingagreement,resulting in one standardised set of regulations. Lawmakers’ decisions will thus have to con­sider a wide and profound impact of their regulations, on the protection of citizens’ personal data and privacy on the one hand, and on the pace of development in the broader field of AI technologies on the other: research, development and applica­tion and the comparative advantages and disadvantages vis-a-vis other regions and global centresofAI technologydevelopment. As technologies need to be accepted by society in order to be adopted, advance­ments as described in this chapter are not exclusively technical ones, but need to beaccompaniedbyprogressfromthehumanities.Multi-disciplinaryapproaches,as demonstrated by the rise of the digital humanities, may prove advantageous also in these scenarios. As systems become natural companions, the fields of psychol­ogy, neuroscience and philosophy bring new aspects and visions to the agenda and inspire novel approaches. Fear and anxieties generated by overly aggressive mar-keting,science-fictionanddisinformationneedtobemetwithprudenttransparency, adequatemanagementofexpectationsandaccompanyingpolicymeasures.Aninclu­siveapproachakintomakingST(andAI)visible,transparentandunderstandableto alargerpublic–akindofAI-literacyinthesenseofmedia-literacy–maybeastrong supportingtopicforalltheabove-mentioneddomains.Peoplehavealwaystendedto humanisemachines.Powerfulsystemsformedbythecombinationandintegrationof technologiesandcomponentsdescribedabovemayeffectivelybeattributedhuman­likequalitiesandpersonhoodbytheirusers.Ethicalaspectsofsuchinteractionmust be addressed in parallel with technological progress. Transparency (e.g., chatbots introducing themselves as machines) and openness are among the key factors to be consideredwhenleavingusersafreedomofchoiceratherthanimposingtechnology on them. This certainlyreaches far beyond ST but rather concerns AI ingeneral. References Backfried,Gerhard,MarcinSkowron,EvaNavas,AivarsBerzinš,JoachimVandenBogaert,Fran­ciska de Jong, Andrea DeMarco, Inma Hernaez, Marek Kováè, Peter Polák, Johan Rohdin, Michael Rosner, Jon Sanchez, Ibon Saratxaga, and Petr Schwarz (2022). Deliverable D2.14 Technology Deep Dive – Speech Technologies.EuropeanLanguageEquality(ELE);EUproject no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/speech-deep-d ive.pdf. Baevski, Alexei, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli (2020). “wav2vec 2.0: A FrameworkforSelf-SupervisedLearningofSpeechRepresentations”.In: NIPS’20: Proc. of the 34th Int. Conf. on Neural Information Processing Systems,pp. 12449–12460. Bommasani, Rishi et al. (2021). On the Opportunities and Risks of Foundation Models. arXiv: 2108.07258 [cs.LG].https://arxiv.org/abs/2108.07258. Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari­wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei (2020). “Language Models are Few-Shot Learners”. In: Ad­vances in neural information processing systems 33,pp. 1877–1901. Devlin,Jacob,Ming-WeiChang,KentonLee,andKristinaToutanova(2019).“BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: NAACL Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–4186. DOI: 10.18653/v1/N1 9-1423.https://aclanthology.org/N19-1423. Draxler, Christoph, Henk van den Heuvel, Arjan van Hessen, Silvia Calamai, and Louise Corti (2020).“ACLARINTranscriptionPortalforInterviewData”.In: Proceedings of the 12th Lan­guage Resources and Evaluation Conference.Marseille,France:EuropeanLanguageResources Association, pp. 3353–3359. %7Bhttps://aclanthology.org/2020.lrec-1.411%7D. Garrido,Miguelángel Verde (2021).“Why a Militantly Democratic Lack of Trustin State Surveil­lancecanEnableBetterandMoreDemocraticSecurity”.In: Trust and Transparency in an Age of Surveillance. Routledge, pp. 221–240. Jaegle,Andrew,SebastianBorgeaud,Jean-BaptisteAlayrac,CarlDoersch,CatalinIonescu,David Ding,SkandaKoppula,DanielZoran,AndrewBrock,EvanShelhamer,OlivierHénaff,Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and Joao Carreira (2021). “Perceiver io: A General ArchitectureforStructured Inputs &Outputs”. In: arXiv preprint arXiv:2107.14795. Jelinek, Frederick (1998). Statistical Methods for Speech Recognition.Cambridge:MITPress. Lai, Cheng-I Jeff, YangZhang, Alexander H Liu, ShiyuChang, Yi-Lun Liao, Yung-Sung Chuang, KaizhiQian,SameerKhurana,DavidCox,andJimGlass(2021).“PARP:Prune,AdjustandRe-PruneforSelf-SupervisedSpeechRecognition”.In:Advances in Neural Information Processing Systems 34, pp. 21256–21272. Pessanha, Francisca and Almila Akdag Salah (2022). “A Computational Look at Oral History Archives”. In: Journal on Computing and Cultural Heritage 15.1. Ren, Yi, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu (2021). “Fast-Speech2:FastandHigh-QualityEnd-to-EndTexttoSpeech”.In:9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Ruiz,Nicholas,MattiaAntoninoDiGangi,NicolaBertoldi,andMarcelloFederico(2019).“Assess­ingtheToleranceofNeuralMachineTranslationSystemsAgainstSpeechRecognitionErrors”. In: CoRR abs/1904.10997.arXiv: 1904.10997. http://arxiv.org/abs/1904.10997. Smal, Lilli, Andrea Lösch, Josef van Genabith, Maria Giagkou, Thierry Declerck, and Stephan Busemann(2020).“LanguageDataSharinginEuropean PublicServices –OvercomingObsta­cles and Creating Sustainable Data Sharing Infrastructures”. In: Proceedings of The 12th Lan­guage Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020. Ed.byNicolettaCalzolari,FrédéricBéchet,PhilippeBlache,KhalidChoukri,ChristopherCieri, ThierryDeclerck,SaraGoggi,HitoshiIsahara,BenteMaegaard,JosephMariani,HéleneMazo, AsunciónMoreno,JanOdijk,andSteliosPiperidis.EuropeanLanguageResourcesAssociation, pp. 3443–3448. https://aclanthology.org/2020.lrec-1.422/. Stahl, Titus (2016). “Indiscriminate Mass Surveillance and the Public Sphere”. In: Ethics and In­formation Technology 18.1,pp. 33–39. Tang, Dengke, Junlin Zeng, and Ming Li (2018). “An End-to-End Deep Learning Framework for SpeechEmotionRecognitionofAtypicalIndividuals”.In: Interspeech 2018, 19th Annual Con­ference of the International Speech Communication Association, Hyderabad, India, 2-6 Septem­ber 2018.ISCA, pp. 162–166. https://doi.org/10.21437/Interspeech.2018-2581. Tomanek,Katrin,FrançoiseBeaufays,JulieCattiau,AngadChandorkar,andKheChaiSim(2021). “On-DevicePersonalizationofAutomaticSpeechRecognitionModelsforDisorderedSpeech”. In: arXiv preprint arXiv:2106.10259. Valle, Rafael, Jason Li, Ryan Prenger, and Bryan Catanzaro (2020). “Mellotron: Multispeaker Ex-pressiveVoiceSynthesisbyConditioningonRhythm,PitchandGlobalStyleTokens”.In:2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020.IEEE, pp.6189–6193. Valle, Rafael, Kevin J. Shih, Ryan Prenger, and Bryan Catanzaro (2021). “Flowtron: an Autore­gressive Flow-based Generative Network for Text-to-Speech Synthesis”. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Wang,Yuxuan,DaisyStanton,YuZhang,R.J.Skerry-Ryan,EricBattenberg,JoelShor,YingXiao, YeJia,FeiRen,andRifA.Saurous(2018).“StyleTokens:UnsupervisedStyleModeling,Con­trol and Transfer in End-to-End Speech Synthesis”. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10­15, 2018.Ed.byJenniferG.DyandAndreasKrause.Vol.80.ProceedingsofMachineLearning Research, pp. 5167–5176. http://proceedings.mlr.press/v80/wang18h.html. Westerlund, Mika, Diane A Isabelle, and Seppo Leminen (2021). “The Acceptance of Digital Surveillancein anAgeof BigData”.In: Technology Innovation Management Review 11.3. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 42 Deep Dive Text Analytics and Natural Language Understanding Jose Manuel Gómez-Pérez,Andrés García-Silva, Cristian Berrio,GermanRigau, AitorSoroa, Christian Lieske, Johannes Hoffart, Felix Sasaki, Daniel Dahlmeier, Inguna Skadina, Aivars Berzinš, Andrejs Vasi.jevs, and Teresa Lynn AbstractInthischapter,wepresentacomprehensiveoverviewoftextanalyticsand Natural Language Understanding (NLU) from the perspective of digital language equality (DLE) in Europe. We focus on the research that is currently being under­taken in foundational methods and techniques related to these technologies as well asonthegapsthatneedtobeaddressedinordertoofferimprovedtextanalyticsand NLU support across languages. Our analysis includes eight recommendations that addresscentraltopicsfortextanalyticsandNLU,e.g.,theroleoflanguageequality for social good, the balance between commercial interests and equal opportunities for society, and incentives to language equality, as well as key technologies like languagemodelsandtheavailabilityofcross-lingual,cross-modal,andcross-sector datasets and benchmarks.1 1 Introduction Text analytics tools havebeen in the market for a long time and have proven useful forextractingmeaningfulinformationandinsightsfromdocuments,webpagesand social media feeds, among other text sources. Text analysis processes are designed togainknowledgeandsupportstrategicdecision-makingthatleveragestheinforma- JoseManuel Gómez-Pérez · Andrés García-Silva · Cristian Berrio Expert.AI, Spain, jmgomez@expert.ai, agarcia@expert.ai, cberrio@expert.ai GermanRigau · AitorSoroa University oftheBasque Country, Spain, german.rigau@ehu.eus, a.soroa@ehu.eus Christian Lieske · Johannes Hoffart · Felix Sasaki · Daniel Dahlmeier SAPSE, Germany, christian.lieske@sap.com, johannes.hoffart@sap.com, felix.sasaki@sap.com,daniel.dahlmeier@sap.com Inguna Skadina · Aivars Berzinš · Andrejs Vasi.jevs Tilde, Latvia, inguna.skadina@tilde.com, aivars.berzins@tilde.com,andrejs.vasiljevs@tilde.com Teresa Lynn Dublin CityUniversity, ADAPT Centre,Ireland, teresa.lynn@adaptcentre.ie 1 This chapter is an abridged version ofGomez-Perez et al. (2022). © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_42 tioncontainedinthetext.Typically,suchaprocessstartsbyextractingrelevantdata from text that is later used in analytics engines to derive additional insights. Nowa­daystextanalystshaveawiderangeofaccuratefeaturesavailabletohelprecognise and explore patternswhen interacting with largedocument collections. Text analysis is an interdisciplinary enterprise involving computer science tech­niques from machine learning, information retrieval, and particularly natural lan­guage processing (NLP). NLP is concerned with the interactions between comput­ersandhuman(natural)languages,and,inparticular,withprogrammingcomputers to fruitfully process large natural language corpora. Challenges in NLP frequently involve natural language understanding (NLU), natural language generation, con-nectinglanguage and machine perception, dialogue systems,and theircombination. Recent breakthroughs in deep learning have resulted in impressive progress in NLP. Neural language models like BERT and GPT-3 are able to infer linguistic knowledge from large collections of text that can then be transferred to deal effec­tivelywithNLPtaskswithoutrequiringtoomuchadditionaleffort.Neurallanguage models have had a positive impact on key tasks of text analytics and NLU, such as syntactic and semantic analysis, entity recognition and relation extraction, text classification, sentiment analysis, machine reading comprehension, text generation, conversationalAI,summarisation, andtranslation, amongothers. The success of machine and deep learning has caused a noticeable shift from knowledge-based and human-engineered methods to data-driven architectures in textprocessing.Thetextanalyticsindustryhasembracedthistechnologyandhybrid toolsareemergingnowadays,combiningorreplacingrobustrule-basedsystemsthat usedtobethenorminthemarketwithmachinelearningmethods.Nevertheless,de­spite all the hype about data-driven approaches to text processing and particularly Transformer-based language models like BERT (Devlin et al. 2019), which might leadnon-expertstothinkthateverythingisalreadysolvedintextanalysisandNLU, manygapsstillneedtobeaddressedtomakestate-of-the-artlanguagetechnologies (LTs)fullyoperationalandbenefitallEuropeanlanguages.Especiallyrelevantisthe fact that data-driven approaches require very largeamountsofdatafor training. Languagemodelshavelessenedtherequirementoflabelleddatatoaddressdown­streamtasks,buttheneedforsuchdatahasnotdisappeared.Beyondgeneralpurpose datasets,labelleddataisscarce,labour-intensiveandthusexpensivetoproduce.Ac­cesstolabelleddataisoneofthemajorhurdlesinleveragingdata-drivenapproaches inbusinessapplications,andisespeciallyproblematicforunder-resourcedlanguages for which such data does not exist in sufficient quantities, and there is little interest from technology providers to produce it. Moreover, neural language models work as black boxes that are hard to interpret. This lack of transparency makes it diffi­cult to build trust between human users and system decisions. Lack of explanatory capability is a major obstacle to bringing such technology in domains where regu­lation demands systems which can justify every decision they make. Furthermore, language modelsposeethicalchallenges including gender andracialbiases thatare learned from biases present in the data the models are trained on, thus perpetuating social stereotypes. Whiletheprogressmadeinthelastyearsisundeniablyimpressive,wearestillfar fromhavingperfecttextanalyticsandNLUtoolsthatprovideappropriatecoverage for all European languages, particularly for minority and regional languages. Thus, one of the main goals of this chapter is to outline how the European text analytics industry and research community can address the shortcomings by building on the strengths of current text analytics and NLU tools. We call for human-centric text analysis where people’s knowledge, emotions and needs are put at the centre of the design and learning process of the next generation of tools. Other topics in the re­searchagendaarehybridapproachescombiningexistingrule-basedanddata-driven systems, multilingualism in text analytics, multimodal analysis of information, and a new generation of benchmarks. 1.1 Scope of this Deep Dive To better understand how text analytics and NLU technologies are currently being made available to end users, stakeholdersandsociety,weadoptamultidimensional approach where both a market and research perspective are considered, as well as the key domains and applications related to text analytics and NLU. We look at the current service and tool offerings of the main text analytics and NLU providers in theEuropean market. This analysis alsoincludesrecent findingsin related research areas, such as NLP/NLU, machine learning, and information retrieval, where lan­guage understanding tasks that not long ago were the subject of study in research laboratories are now part of the text analytics market. This is as a result of recent breakthroughsindeeplearning,structuredknowledgegraphsandtheirapplications. Conventionaltextanalyticsservicesavailableinthemarketincludesyntacticanal­ysis, extractive summarisation, key phrase extraction, entity detection and linking, relation extraction, sentiment analysis, extraction of personal identifiable informa­tion, language detection, text classification, categorisation, and topic modelling, to name but a few. Also, conversational AI services and tools, including chatbots and virtual agents, are frequently offered under the umbrella of text analytics. More re­cent additions to the text analytics catalogue are machine reading comprehension services based on tasks such as extractive question answering, which are usually marketed as part of both virtual agents and intelligent search engines to provide ex­act answers touserquestions. In addition to general-purpose text analytics, we also consider specific domains where such technologies are particularly important. For example, there is a signifi­cantnumberofspecifictextanalyticstoolsfocusedonhealth,includingfunctionali­tiessuchasextractionofmedicalentities,clinicalattributes,andrelations,aswellas entity linking against medical vocabularies. Other use-cases for text analytics tools include customer and employee experience, brand management, recruiting, or con-tractanalysis.Anexhaustiveaccountofeachsectoranduse-case,andtheirrelevance for text analytics, is out of scope ofthis chapter. Text analytics tools and services are available for widely spoken languages or otherwisestrategiclanguageswherethemarketisbigenoughforcompaniestomake aprofit.Unfortunately,otherlanguagesmaybelessattractivefromabusinesspoint of view and consequently they are not equally covered by the current text analytics tools. This chapter addresses language coverage as another key dimension for the analysis oftext analytics andNLU tools whenconsidering DLE. We include recent research breakthroughs associated with the text analytics ser­vicesmentionedabove.Manyapplicationsoftextanalyticscanbeeffectivelysolved using classical machine learning algorithms, like support vector machines, logistic regression or conditional random fields, as well as rule-based systems, especially when there is little or no training data available. However, more sophisticated ap­proaches are needed as we transition towards scenarios involving a deeper under­standing of text in order to solve increasingly complex tasks like abstractive sum-marisation,reading comprehension, recognising textual entailment, orstance detec­tion. Therefore, this chapter puts a special emphasis on deep learning architectures, like Transformerlanguage models, and their extensions. Of particular interest for language equality are different means to deal with datascarcityforlow-resource languages.Self-supervised,weaklysupervised,semi-supervised, or distantly supervised algorithms reduce the overall dependence on la-beleddata,butevenwithsuchapproaches,thereisaneedforbothsufficientlabeled data to evaluate system performance and typically much larger collections of unla­beled data to support data-hungry machine learning techniques. Also in this direc­tion, we include a discussion on hybrid approaches where knowledge graphs and deep learning are used jointly in an effort to produce more robust, generalisable, and explainable tools. Another important area of research that we cover deals with leveraging other modalities ofinformationin addition to text. All such aspects are considered from the perspective of their combined impact on society. We provide recommendations to address the current limitations of text analyticsand NLU technologies in the interestofpromoting DLE inEurope. 1.2 Main Components The goal of text analytics is to discover novel and interesting information in docu­ments and text collections that can be, among others, useful for further analysis or strategic decision-making. Text analytics tools support a wide range of functionali­tiestoprocess,leverageandcuratetexts.Mostofthesefunctionalitiescanbebroadly categorised into syntactic analysis, information extraction (e.g., key phrases, enti­ties, relations, and personal identifiable information), text classification, sentiment and emotion analysis, and conversational AI functionalities. Recently, question an-swering,afunctionalitythatrequiresmachine-readingcomprehension,hasmadethe transition from research labs to production systems. The challenges involved in NLP and NLU have different levels of complexity, and as a result, the solution to each of the many challenges is at a different level ofprogress. Forexample, naturallanguagegenerationisone suchchallenge, where recentadvanceslikeGPT-3areheraldedasakeyenablerforanewgenerationoflan­guage applications.2 Therefore, in addition to functionalities that are already avail-ableinthemarket,thereareotherswhichtheresearchcommunityiscurrentlywork­ingon.Someadvancedfunctionalitiesinvolvereasoning,suchasmulti-hop question answering where systems need to gather information from various parts of the text toansweraquestion,andtextual entailment,wherethegoalistodeterminewhether a hypothesis is true, false, or undetermined given a premise. Moreover, with the ad-ventofgenerative models likeGPT-3,newopportunitieshavearisentoaddresshard problemsinvolvingtextgeneration,e.g., abstractive text summarisation,wherethe system generates a summary of a text rather than extracting relevant excerpts, or data to text generation, where the goal is to generate text descriptions from data contained in tables or JSON documents. Recently, commercial text analytics providers have started supporting the cus­tomisationoffunctionalities,e.g.,userscandefineclasses,entityandrelationtypes, orsentiment scores. This is possible thanks to supervised machinelearningmaking useofuser-generated examples.Theuser onlyprovidesexampleswhilethetextan­alytics tool handles all the complexity of the machine learning process. Thus, end users do not need a background in ML to customise their own services. However, some basic knowledge is required to understand how the trained models are evalu­ated and how to generate a balanced set of examples. The most common customis­abletextanalyticsservicesareclassificationandentityextraction,butproviderstypi­callyoffersupportforsentimentanalysisandrelationextraction,too.Tocustomisea textclassifierusersneedtoprovideexamplesof text labeled with classes,for entity extraction the text is labeled with entity types, for relation extraction relations be­tween entities are indicated, and for sentiment analysis documents are labeled with a sentiment score. To study the language support of existing text analytics technologies and NLU tools,welookintwomaindirections:1.thecatalogueofservicesofglobaltechnol­ogyproviders,whichprovidesuswithanotionofwhatisbeingcurrentlymadeavail­ableandmarketedtothepublic;and2.Europeaninitiativesthatofferrepositoriesof languageresourcesandtools(LRTs),liketheEuropeanLanguageGrid(ELG,Rehm 2023). At the time of writing, the ELG catalogue holds more than 11,500 metadata records (Labropoulou et al. 2020), including both data and tools/services, covering almost all European languages.3 The ELG platform was populated with more than 6,000 additional language resources identified by language informants in the ELE consortium and harvests major EU LRT repositories such as CLARIN4 and ELRC­SHARE.5 Theobservationsandfiguresincludedinthischapterhavebeenextracted from ELG, which aims at concentrating all available resources, tools and services and making them available in a single platform. Our goal with this chapter is not 2 https://openai.com/blog/gpt-3-apps/ 3 https://www.european-language-grid.eu 4 https://www.clarin.eu 5 https://elrc-share.eu to provide an exhaustive account, for which such figures could be complemented with additional information from other European infrastructures like the ones men­tioned above, but rather to provide an up-to-date indication of the support that each European(and non-European)language enjoys. For commercial text analytics services, we draw on reports from key players in market intelligence such as Gartner Magic Quadrant for Insight Engines6 and the Forrester Wave: AI-Based Text Analytics Platforms 2020.7 A mandatory require­ment for providers to be included in this study is for service documentation be pub-liclyavailable.WestudyservicesandlanguagessupportedbyAzureTextAnalytics, IBMWatson,Expert.aiandSASVisualTextAnalytics.Inaddition,weincludeother recognisedproviders,likeAmazonComprehendandGoogleNaturalLanguageAPI. To simplify the analysis of the language supportwe usethe followinggroups: • A – Official EU Languages (24): Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian,Lithuanian,Maltese,Polish,Portuguese,Romanian,Slovak,Slovenian, Spanish, and Swedish • B–OtherEuropeanlanguages;languagesfromEUcandidatecountriesandFree TradePartners(11):Albanian,Basque,Catalan,Galician,Icelandic,Norwegian, Scottish Gaelic, Welsh, Serbian, Turkish, Ukrainian • C – Languages spoken by immigrants in Europe; languages of important trade and political partners (18): Afrikaans, Arabic, Berber, Cebuano, Chinese, He-brew,Hindi/Urdu,Indonesian,Japanese,Korean,Kurdish,Latin,Malay,Pashto, Persian (Farsi), Russian, Tamil, Vietnamese Asummaryofourfindingsfollows.Asmallsetofservicesincludingentityextrac­tion, key phrase extraction, and syntactic analysis, offered by global text analytics providers, have a large coverage, above 80%, of EU official languages in category A. Nevertheless, the support of the languages in category A provided by the rest of the services is poorer, ranging from 20% to 45%. The situation of other European languagesincategoryBisactuallytheworst:thelanguagesupportofthefunctional services is scarce or non-existent. Languages in category C also have low cover­age across all functional services. In contrast, custom entity extraction has almost perfect support of the languages across all categories. However, custom classifica­tion,customsentimentanalysis,andcustomrelationextractionhavealanguagecov­erage similar to off-the-shelf text analytics services, covering less than half of the languages incategories Aand C, and barely any language at all in categoryB. According to the ELG catalogue, syntactic analysis services (language identifi­cation,tokenization,etc.)areavailablefornearlyalllanguagesincategoryA.How­ever, the language support of such services drops to 63% of languages in category B, and 72% in category C. Named entity recognition has moderate support across all language categories reaching 66% for category A, 54% for category B and 61% 6 https://www.gartner.com/en/documents/3999454 7 https://www.forrester.com/report/The-Forrester-Wave-AIBased-Text-Analytics-Platforms-Doc ument-Focused-Q2-2020/RES159887 forcategoryC.Fromthere,languagesupportfortextanalyticsservicessuchaskey­word extraction, sentiment analysis, summarisation, and entity linking is poor or non-existent in every languagecategory. Our analysis shows that official EU languages are covered by a subset of text analytics services including syntactic analysis, key phrase extraction, and entity ex­traction. However, only a small fraction of category A languages are supported by the remaining services. For other European languages in category B,global players offer scarce support or none at all, and for languages in category C support is also low.InELGthepicturechangesalittleforcategoryBlanguagessincethenumberof supportedlanguagesincreasesforsomeofthefunctionalservices.However,overall supportoflanguagesincategoriesBandCisstilllow,i.e.,globalplayersplantheir offeringsbased on the volume of the potentialmarket for each language. 2 State-of-the-Art and Main Gaps 2.1 State-of-the-Art LRTs have increased and improved since the end of the 1990s, a process further catalysed by the advent of deep learning and neural networks and lately with large pre-trainedlanguagemodels.Today,NLPpractitionersfindthemselvesinthemidst of a paradigm shift. This revolution has brought noteworthy advances to the field. However, this transformative technology poses problems from a research advance­ment, environmental, and ethical perspective. Furthermore, it has also laid bare the acutedigitalinequalitythatexistsbetweenlanguages.ManysophisticatedNLPsys­tems are unintentionally exacerbating this imbalance due to their reliance on vast quantities of data derived mostly from English-language sources. Other languages lag far behind in terms of digital presence. Moreover, the striking asymmetry be-tweenofficial and non-officialEuropeanlanguages with respect to availabledigital resources is worrisome. Unfortunately,EuropeanDLEisfailingtokeeppacewiththeserapidlyevolving changes. Neural language models and related techniques are key to NLP progress and so being able to build them for target languages with the same quality as En­glish is key if language equality is to be achieved. Now is the moment to seek bal­ance between European languages in the digital realm. There are ample reasons for optimism.Althoughthereismoreworkthatcanandmustbedone,Europe’sleading LRT repositories, platforms, libraries, models andbenchmarks have begunto make inroads.Interestingly,theapplicationofzero-shottofew-shottransferlearningwith multilingual pre-trained language models and self-supervised systems opens up the wayto leverage NLPfor less developed languages. Wearemovingfromamethodologyinwhichapipelineofmultiplemoduleswas thetypicalwaytoimplementNLPsolutions,toarchitecturesbasedoncomplexneu­ralnetworkstrainedwithvastamountsofdata.ThisrapidprogressinNLPhasbeen possible because of different factors: 1. mature deep learning technology; 2. large amountsofdata(includingmultilingualtextdata);3.increaseinHPC(GPUs);4.ap­plicationofsimplebuteffectiveself-learningandtransferlearningapproachesusing Transformers.TheNLPcommunityiscurrentlyengagedinaparadigmshiftwiththe production and exploitation of large, pre-trained Transformer-based language mod-els(Han et al. 2021; Min et al. 2021). 2.2 Main Gaps We focus oneightmain areas relatedtotextanalytics andNLUthat have an impact on digital language equality: data, legal aspects, limitations, benchmarking, confor­mance, and domain experts’ tooling. Data – The availability of suitable data for training and evaluating NLP tools is crucial. Unfortunately, current language data for text analytics suffers from several shortcomings.Labellingdatacanbealengthyoperationthatrequiresskilleddomain expertise,whichiscostlyandhardtofind.Dataandlanguagecoverageisaconcern­ing issue as the majority of datasets that are relevant to Europe are general-purpose datasets based on major languages such as English, German, Spanish and French. However, under the EU Digital Europe Programme, new common Data Spaces, in­cluding a Language Data Space, will be created. Quality is also important: reliable (misinformation-free),balanced(nobias)andcleancontent(non-toxic/hate-speech). Machinelearningmodelsarenotoriouslysensitivetobiasandnoisewithindatasets. Thus,there is a clear needfor reliable bias and toxicitydetectiontools. Legal aspects – Since text can often include personal data, data protection and privacy (DPP) policies can put limits on the type of data that can be made avail­able for text analytics. GDPR, the EU’s General Data Protection Regulation, while importantforEUcitizens’protection,significantlyhamperslanguagedatasourcing and reuse for machine learning-based tools in Europe. The principles of DPP and legal provisions such as GDPR stipulate that data should only be used for a priori defined narrow purposes and that these purposes must be made transparent to the datasubjectupfront.Thisprovesproblematicwhendealingwithinducedmodelsor datasets from web sources that have been reused without website owners’ or indi­viduals’ consent. European-based researchers and LT developers cannot, therefore, use,share,modifyorbuilduponmanyof thesedatasets,which setsDPP-compliant players in this field at a competitive disadvantage. NLU limitations – Most of today’s text analytics solutions are language-specific. Challenges arise in many contexts (business, personal, governmental), where the multilingual requirements of customers and users from across Europe and around the globe need to be served. As we have seen, data availability is already a gen­eral problem, but when it comes to lesser-spoken languages with lower amounts of digitalcontent, such scarcityiscompounded. Similarly, key piecesof contextualin-formation such as the author, intended audience, societal factors and the purpose of communication also need to be considered. As such, there is much scope for im­proving contextualised and personalised analytics. One growing area of research is multimodal NLP, which aims to capture these contextual features to make better judgementsorpredictions. Onepriorityformanybusinessesandorganisationsisto buildtrustandconfidenceinAImodels.Asaresult,therehasbeenanotableincrease in attention given to the area of explainable AI. In cases where decisions are made basedonAImodelprediction,itisimportantthatbusinessescanassessthesemodels’ level of accuracy, fairness and transparency. Finally, further exploration is required into extensibility methods to include domain-specific knowledge (e.g., when large corpora are not available), allowing LT providers to easily build custom extensions for machine learning-basedsystems. Benchmarking –Inlanguagetechnology(andNLUinparticular),awiderangeof benchmarkingframeworksexistsdependingonthetaskathand.Evaluationmetrics also vary depending on the task, ranging from reporting on precision, recall and F1 scores for classification tasks, to exact matching or, say, SacreBLEU8 scores for di­aloguesystems.CurrentNLU benchmarksinclude widelyadoptedoneslikeGLUE and and SuperGLUE.9 In terms of the nature of datasets used in benchmarking, re­alistic data is lacking. Therefore, the increasing trend for creating (often general purpose)syntheticdataprovestobeproblematic.Some evaluationdatasetsarealso often criticised in academic shared tasks, where they are sometimes referred to as ‘toy’ examples that are not applicable to real-world problems. There is a clear need for an increase in diversity, relevance andsuitabilityof annotated test data. Conformance –Adimensionrelatedtostandardsconcernsconformance,namely “thefulfillmentofspecifiedrequirementsbyaproduct,process,orservice.”10 While suchrequirementsarenotsocrucialforacademicresearch,theyarehighlyrelevant to enterprise language technology development as they assure quality standards for consumers. Accordingly, requirement statements are needed for any text analytics artefact.Forentitydetection,thisrequirementstatementcould,forexample,mention that a conformant application must be able to detect any of the entity types of the CommonLocaleDataRepository11 inSpanishandPortuguese.12Inparticular,inthe contextofregulated industries, certificationmay need to be considered. Domain experts tooling – Today, most work in LT based on ML requires expert level skills in tools related to data management, data science and NLP. This cre­ates bottlenecks since it does not allow domain experts (e.g., experts in finance) to become actively involved without extensive tool training or understanding of the underlying technology. This setup causes overhead and delays since work between tool experts and domain experts needs to be coordinated. What is lacking as a way to address this is the availability of consumer-grade, highly usable, low code or no code tools for domain experts. Ideally, such tools should be developed in collabora­tion with usability specialists, to allow domain experts to play a more active role in the development of solutions forapplication scenarios theyare familiar with. 8 https://huggingface.co/metrics/sacrebleu 9 https://gluebenchmark.com,https://super.gluebenchmark.com 10 https://www.w3.org/TR/qaframe-spec/#specifying-conformance 11 https://cldr.unicode.org/index/downloads 12 https://www.w3.org/TR/its20/#conformance and http://docs.oasis-open.org/xliff/xliff-core/v2. 1/os/xliff-core-v2.1-os.html#Conformance for sampleconformanceclauses. 3 The Future of the Area 3.1 ContributiontoDigitalLanguageEquality Today,textanalyticstoolscanhelpsocietiesandindividualsinvariouswaysbysup­portingtasksthatinvolvethediscoveryofinformation(facts,rules,relationships)in text. There are widely-used and indispensable applications available to businesses, consumers, citizens and governments that cover a wide range of usage scenarios, starting from recommendation and sentiment analysis tools to intelligent virtual as­sistants, business intelligence tools, predictive analytics, fraud management, risk management, and cybercrime prevention. Text analytics tools are also widely used in online and social media data analysis of useto both businessesand governments. Currently, however, all of these advances and digital innovations are really only supporting major well-resourced languages (i.e., English, French, German, Span­ish). Adapting these technologies to support other languages across Europe is not a trivial task of simply localising software or connecting existing technology to local databases or information sources. Languages differ significantly in many ways, not just in words but also inflectional nature (e.g., plural forms of nouns or tenses of verb), sentence structure (word order), idiomatic uses, semantic variability, and so on.Tothatend,applicationsneedtobebuiltuponsystemsthatunderstandtheunder­lyingpatternsineachlanguagethatrequiressupport.Astoday’sNLPtechniquesare increasinglydata-driven,thismeansthatsufficientamountsofdataneedtobemade available in order to adapt technologies to these languages. However, even here, it may not be as simple as plugging in new datasets to existing technologies; due to the fact that languages and domains can differ so significantly, various types of pa-rametertuning,systemadaptationorhybridimplementationmayalsoberequiredto achieve robust and reliabletechnologies in new languages and scenarios. Text analytics and NLU can play a major role in overcoming current language and technology barriers that prevent the flow and accessibility of information and knowledgeacrossEurope.Fromaneconomicperspective,thislanguagebarrierhas animpactontheDigitalSingleMarket(EuropeanParliament2018).Europe’sSingle Marketseekstoguaranteethefreemovementofgoods,capital,services,andpeople. Therole oftechnologyinthisiskeyas countriesseektoensurecontinued accessto this single market, including product information, national and local policies, edu­cation information, trade information, financial information, and so on. Such infor­mation needs to be accessible to all EU citizens. Text analytics tools (together with machine translation solutions and other cross-and multi-lingual solutions) are key for accessing this informationand knowledge acrossEurope. TheMETA-NETWhitePaper Series (RehmandUszkoreit 2012)reported onan analysis of LRTs available for EU languages. The results showed that with respect totextanalytics,good support onlyappliedtoEnglish,andmoderate support tofive widely spoken languages: Dutch, French, German, Italian and Spanish. This meant that the other 24 (out of 30) European languages in this study were clustered un­der fragmented as well as weak or no support. Today, all 24 official EU languages benefit from basic tools: tokenizers, lemmatizers, morphological analysers, part-of-speech tagging tools, and syntactic parsers. While the quality, reliability or robust­ness of these tools vary across languages, their existence represents a step in the right direction. In contrast, more sophisticated tools and services (e.g., summarisa­tiontools) areavailable only for asmall numberoflanguages. Some of the main reasons that prevent sophisticated text analytics techniques from being available for many EU languages (Rehm et al. 2020) are lack of data and data sparsity (especially for morphologically rich languages) for training and testing text analytics technologies, and the complexity of technology adaptation in low-resource settings. For instance, in the case of dialogue systems and chatbots, analysis of available datasets for dialoguemodelling clearly demonstrates a gap for less-resourced languages (Serban etal. 2018; Leonova 2020). Gartner (2021) forecasts the worldwide AI software revenue to $62.5 billion in 2022, an increase of 21.3% from 2021. Intelligent, AI-based, virtual assistants are already in demand in the digital market and their use in the workplace is growing. Gartner (2020) predicts that by 2025, 50% of knowledge workers will use a virtual assistantonadailybasis,upfrom2%in2019.Forthepublicsectorandbusinesses, this provides an opportunity to use intelligent virtual assistant technology to take careofmorerepetitiveandauxiliarybusinessprocesses.Gartner(2019)predictsthat decisionsupport/augmentationwillbethelargestareaofAIby2030,accountingfor 44%of business value, with agents representing 24%. For countries with lesser-spoken languages, these predictions only hold if tech-nologyexists to supportthem,of course.Ifnot,aneconomicdividewillemerge,as countrieswith sufficient language technologies will gain(further) advantage. 3.2 Breakthroughs Needed Various global enterprises from the US and Asia have started deploying large pre-trained neural language models in production. However, despite their impressive capabilities, large language models raise severe concerns. Currently, we have no clear understanding of how they work, when they fail, and which emergent proper-tiestheypresent.AsarguedbyBenderetal.(2021),itisimportanttounderstandthe limitations of language models, which they call “stochastic parrots”, and put their success in perspective. There are also worrying shortcomings in the text corpora used to train these Anglo-centric models, ranging from a lack of representation of low-resource languages, to harmful stereotypes, and to the inclusion of personal in­formation. Moreover, these models are costly to train and develop, both financially and environmentally. This also means that only a limited number of organisations with abundant resources in terms of funding, computing capabilities, NLP experts and corporacancurrently afford to develop them (Ahmedand Wahed 2020). Totackle these questions, much more critical interdisciplinary collaboration and research are needed. In Europe there is a lack of necessary resources (experts, data, computingfacilities,etc.)comparedtolargeUSandChineseITenterprisesthatlead thedevelopment of these new systems.In particular,the computing divide between large firms and non-elite universities increases concerns around bias and fairness within this technology breakthrough, and presents an obstacle towards democratis­ing NLP. In fact, in the EU there is an uneven distribution of resources (funding, open data, language resources, scientists, experts, computing facilities, IT compa­nies, etc.) by country, region and language. We note with concern a tendency to focus on state-of-the-art results exclusively with the help of leaderboards, without encouragingadeeperunderstandingofthemechanismsbywhichtheyareachieved. We believe that such short-term goals can generate misleading conclusions and di­rectresourcesawayfromimportanteffortsthatfacilitatelong-termprogresstowards efficient, accurate, explainable, ethical and unbiased multilingual language under­standing. Progress in these fields will help achieve DLE in Europe in all aspects of society, from government to businesses to the citizens themselves. Next, we focus on some of thesekey technical areas. Recent work has shown that pre-trained language models can robustly perform NLP tasks in a few-shot or even in zero-shot fashion when given an adequate task description in its natural language prompt (Brown et al. 2020; Ding et al. 2022). Prompting is a technique that involves adding a piece of text (prompt) to the input examplesto“encourage”alanguagemodeltobringtothesurfacetheimplicitknowl­edgetheuserisinterestedin,i.e.,guidingthelanguagemodeltoperformthetaskat hand.Surprisingly,fine-tuningpre-trainedlanguagemodelsonacollectionof tasks described via instructions (or prompts) substantially boosts zero-shot performance on unseen tasks (Wei et al. 2021; Sanh et al. 2022; Tafjord and Clark 2021). The application of zero-shot to few-shot transfer learning with multilingual pre-trained languagemodels,promptlearning,andself-supervisedsystemsopensupopportuni­tiesfor less developed languages inNLP. Integrating common sense knowledge and reasoning in NLP systems has tradi­tionallybeenseenasanearlyimpossiblegoal.Now,researchinteresthassharplyin-creasedwiththeemergenceofnewbenchmarksandlanguagemodels(Mostafazadeh et al. 2016; Talmor et al. 2019; Sakaguchi et al. 2021; Ma et al. 2021; Lourie et al. 2021).Thisrenewedinterestincommonsenseisencouragedbyboththegreatempir­icalstrengthsandlimitationsoflarge-scalepre-trainedneurallanguagemodels.This motivatesnew,relativelyunder-exploredresearchavenuesincommonsenseknowl-edge and reasoning. Combining large language models with symbolic approaches (knowledge bases, knowledge graphs), which are often used in large enterprises because they can be easily edited by human experts, is a non-trivial challenge. It is worth investigating ways to leverage structured and unstructured information sources and to enhance contextual representations with structured, human-curated knowledge (Peters et al. 2019; Colon-Hernandez et al. 2021; Lu et al. 2021). De­spite perhaps overly optimistic claims of human parity in many tasks, Natural Lan­guage Understanding is still an open research problem far from being solved since allcurrentapproacheshavesevere limitations.Languageisgroundedinourphysical world,aswellasinoursocietalandculturalcontext.Knowledgeaboutitisrequired to properly understand natural language (Bender andKoller 2020). While NLP systems based on deep learning obtain remarkable results on many tasks, the output provided by NLP models, particularly those models that generate text,isstillfarfromperfect.Forexample,thetextualsnippetsgeneratedbyadvanced language models such as GPT and successors are formed by syntactically correct sentences that seem to talk about a particular topic, however, there is often a lack of coherence among them and humans still need to monitor and adapt the output of such systems. There is a growing body of research of human-in-the-loop NLP frameworks, where model developers continuously integrate human feedback into the model deployment workflow. These feedback loops cultivate a human-AI part­nershipthatenhancesmodelaccuracyandrobustnessandbuildsusers’trustinNLP systems (Z. J. Wang et al. 2021). In the foreseeable future we expect more such interactions,as AIand NLP become embedded in everyday workprocesses. While the NLP community is fully committed to the open-source culture,the as­pect of reproducibility has been less of a concern, although the topic is becoming a centraloneinNLP.Nowadaysthemajorityofscientificarticlesareaccompaniedby thesource code anddatarequiredtoreproduce the experiments.Leaderboardssuch asNLP-progress,13 Allen Institute of AI leaderboard,14 Paperswith code,15 or Kag­gle16 encourage participation and facilitate evaluation across many different tasks and datasets. As a result, the NLP community has considerably increased access to publicly available and easily accessible models and datasets. This culture focused towards sharing fosters opportunities for the community to inspect the work of oth­ers, iterate, advance upon, and broaden access to the technology, which will in turn strengthen the collective skill sets and knowledge. Open-source libraries such as Transformers17 may open up these advances to a wider LT community. This library consists of carefully engineered state-of-the art Transformer architectures under a unifiedAPIandacuratedcollectionofmodels(Wolfetal.2020a).Followingupon thesuccessoftheHuggingFaceplatform(Wolfetal.2020b),theBigScienceproject took inspiration from scientific creation schemes such as CERN and the LHC, in which open scientific collaborations facilitate the creation of large-scale artefacts that are usefulfor the entire researchcommunity.18 3.3 Technology Visions and Development Goals In this section, we provide an overview of the main technological visions for NLP andNLU,whichwillcontributetoachievingDLEinEuropeby2030.Wehaveiden­tifieddevelopmentsforincreasingthelanguagesupportofsuchtechnologies,putting 13 http://nlpprogress.com 14 https://leaderboard.allenai.org 15 https://paperswithcode.com/area/natural-language-processing 16 https://www.kaggle.com/datasets?tags=13204-NLP 17 https://huggingface.co 18 https://bigscience.huggingface.co users’needsatthecentreofanybreakthroughsinvolvinglanguagetechnologies,the integrationwithothermodalitiesofinformationinadditiontotext,thehybridisation ofsymbolicAIandneuralsystems,andtheneedforanewbenchmarkingapproach. Language supportbeyond widely spoken languages,includingminorityandunder-resourced languages, is still a pending issue in text analytics and NLU. The invest­ment ofLTprovidersin suchlanguagesisinhibitedmostprobablydue toacompar­atively lower profitability in this space compared to mainstream languages, consid­eringthenumberofpotentialusers.Nevertheless,thecurrenttrendinLTrelyingon neurallanguage models andresearchon unsupervisedand zero-shotlearning opens up new possibilities to increase the coverage of minority and under-resourced lan­guagesinthetextanalyticsindustry.Languagemodelshaveshownpromisingresults inzero-shotsettingsinawiderangeoftasks(Radfordetal.2019;Brownetal.2020; Gao et al. 2021). This is primarily due to the fact that language models learn to per­form tasks from patterns occurring in text, eliminating or reducing to a great extent the need for additionallabeled data which is ascarce resource for manylanguages. DespitetheirdominanceincurrentNLPpipelines,languagemodelshavemainly beenaddressedas aone-size-fits-all approach,offeringalmostno customisationbe­yond the data used to fine-tune (Devlin et al. 2019) or prompt (Brown et al. 2020) models for downstream tasks. Current research focused on unsupervised and zero-shot learning (Gao et al. 2021) delves into this issue since users have little to say in the learning process. Moreover, the data-driven approach and race for accuracy haveyieldedopaquetoolsthatarehardtointerpret,andbiasedtoolsthatperpetuate socialstereotypesrelatedtogender,raceandethnicityintextcollections.Thelackof transparency makes it difficult to build trust between users and system predictions, having negative consequences for technology adoption. Biased tools have a direct impact on society,especially for marginalised populations (Sheng et al. 2021). We advocate for a next generation of language tools that care about end user needs and expectations,makingthempartofthedesignandlearningprocess.These tools will be human-aware, encompass human emotions, and be trustworthy, avoid bias,offerexplanations,andrespectuserprivacy.Moreover,humanintelligencewill be used together with machine learning techniques to produce better LRTs. Human feedback will be a guide in the learning process, informing the machine as to what userswantordonotwant.Reinforcementlearningfromhumanfeedbackisapromis­ingresearchavenue(Stiennonetal.2020;Lietal.2016)tousehumanintelligenceto improveNLPtools.Also,interactivitywithdomainexpertsandusers(e.g.,Shapira etal. 2021)isakeyareaforfurtheradvancesbeyondtheusualsupervisedparadigm. As practitioners come to realise the inevitable limitations of purely end-to-end deeplearningapproaches,whichincreaseinthecaseofunder-representedlanguages (bothintermsofavailablelanguagemodelsandsuitabletrainingcorpora),thetransi­tion to hybrid approaches involving different ways of combining neural and symbolic approaches becomesanalternativethatappearsmoreandmoretangible.Therefore, itis importantthatwe exhaustively discussthecomponents necessarytobuildsuch systems, how they need to interact, and how we should evaluate the resulting sys-temsusing appropriatebenchmarks.Thefieldofneurosymbolicapproaches willbe increasinglyimportantinordertoensuretheintegrationofexistingknowledgebases within our models, as already shown by approaches like KnowBert (Peters et al. 2019) and K-Adapter (R. Wang et al. 2021), not only to make NLU models aware oftheentitiescontainedinaknowledgebaseandtherelationsbetweenthemfroma generalpointofview,asprovidedbyresourceslikeWikipediaorWikidata,butalso whenitcomestoquicklyincorporatingexistingresourcesfromverticaldomainsand customorganisationsintoourmodelsinafast,scalableway. Some,e.g.,Shethetal. (2017) and Shoham (2015), argue that knowledge graphs can enhance both expres­sivityandreasoningpowerinmachinelearningarchitectures.Others(Gómez-Pérez et al. 2020) propose a working methodology19 for solving NLP problems that natu­rally integrate symbolic approaches based on structured knowledge with neural ap­proaches.Thesearethefirstpracticalstepsinthisdirection.Manymoreareneeded, particularly in a multilingual and language equality scenario. Different modalities canbecombinedtoprovidecomplementaryinformationthat mayberedundantbutcanhelptoconveyinformationmoreeffectively(Palanqueand Paterno2000).Forexample,multimodalanalysishasallowedmachinesforthefirst time ever to pass a test from middle school science curricula involving questions where it was necessary for the model to understand both language and diagrams in order to answer such questions (Gomez-Perez and Ortega 2020). This convergence acrossmodalitiesrequiressynergiesfromAIresearchfieldsthatuntilnowhavebeen conductedindividuallysuchasNLP,automaticspeechrecognitionandcomputervi­sion. Deep learning techniques will play an important role in multimodal analysis. Recently,Transformerarchitectures(Devlinetal. 2019),initiallyproposedforNLP, havebeenusedforimageprocessing(Dosovitskiyetal.2021)andcross-modalinfor­mationprocessingincludingimagesandtext(HuandSingh2021).Otherapproaches based on contrastive language-image pre-training, like CLIP (Radford et al. 2021), emphasise the relevance of zero and few-shot scenarios. CLIP shows that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets by leveraging information from text. Unfortunately, such text is in English only, showing how language inequality also impacts language-vision tasks. Benchmarking aligns research with development, engineering with marketing, and competitors across the industry in pursuit of a clear objective. However, for manyNLUtasksevaluationiscurrentlyunreliableandbiased,withplentyofsystems scoringsohighlyonstandardbenchmarksthatlittleroomisleftforresearcherswho developbettersystemstodemonstratetheirimprovements.Therecenttrendtoaban-don independent and identically distributed benchmarks in favour of adversarially constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only serves to obscure the abilities that we want our bench­marks to measure. Adversarial data collection, understood as the process in which a human workforce interacts with a model in real time, attempts to produce exam­ples that elicit incorrect predictions, but does not meaningfully address the causes of model failures, as shown, for instance, by Kaushik et al. (2021) for question an­swering. Restoring a healthy evaluation ecosystem will require significant progress 19 Methods, resources andtechnology on HybridNLP, https://github.com/expertailab/HybridNLP in the design of benchmark datasets, the reliability with which they are annotated, their size, and ways in which they handle social bias. This is even more important when we expand our view to the multilingual landscape, such as the European mul­tilingual reality. Furthermore, much more emphasis will need to be given to typical realistic settings (Church et al. 2021), in which large training data for the target task is not available, like few-shot and transfer learning. Moreover, while measur­ing performance on held-out data is a useful indicator, held-out datasets are often notcomprehensive,andcontainthesamebiasesasthetrainingdata,asillustratedby Rajpurkar et al. (2018) inter alia. Recht et al. (2019) also showed that this can lead tooverestimatingreal-worldperformance.ApproacheslikeRibeiroetal.(2020)ad­vocateforamethodologythatbreaksdownpotentialcapabilityfailuresintospecific behaviours,introducingdifferenttesttypes,suchaspredictioninvarianceinthepres­ence of certain perturbations and performance on a set of sanity checks inspired in software engineering. Two requirements must be compulsory for such benchmarks: On the one hand, they will need to cover a representative sample of the key sectors in the European economy, including among others finance, health, tourism, manu­facturing, and the corresponding added value chains. In contrast, such benchmarks need to be multilingual by design and cover each economic sector for each of the European languages, guaranteeing language equality regardless of the size of the market associated with each language. 3.4 TowardsDeepNaturalLanguageUnderstanding Much has been said about the impact of intelligent systems on our lives. Today’s largeamountsofavailabledata,producedatanincreasingpaceandinheterogeneous formats and modalities, have stimulated the development of means that extend hu­mancognitiveanddecision-makingcapabilities,alleviatingsuchburdensandassist­ingourdrivers,doctors,teachersandscientists.Inscientificdisciplineslikebiomed­ical sciences, some like Kitano (2016) even propose a new grand challenge for this kind of systems: to develop an AI that can make major scientific discoveries that are eventually worthy of a Nobel Prize. This suggests the time is ripe for a shared partnership with machines, where humans can benefit from augmented reasoning and information management capabilities. Through such a partnership, we foresee a virtuous circle of data collection, active learning, and interactive feedback, which willresult in adaptive, ever-learningsystems. Wehave already seen signs of such a partnership, e.g.,in the application of gen­erative models like GPT-3 to produce text given a prompt, with applications in dif­ferentbusinesssectors.Basedonthesedevelopments,somesuggest20 thatthefuture ofAIliesinthedevelopmentofsystemsthatallowmaintainingaconversationwith a computer. This scenario should go beyond current and past chatbots, able to copy formwithoutunderstandingmeaningbutneverthelesscapableofcreatingadialogue 20 https://www.theverge.com/22734662/ai-language-artificial-intelligence-future-models-gpt-3­ limitations-bias with the user. However, this often seems to be missing from AI systems like facial recognition algorithms, which are imposed upon us, or self-driving cars, where the public becomes the test subjects in a potentially dangerous experiment. Language will require advances in knowledge representation, true understanding of meaning and pragmatics, and the ability of models to explain and interpret their predictions in ways thathumanscanunderstandand relate to. The AI community and particularly the areas related to text understanding also need to address issues like fairness in ways that tangibly and directly benefit dis­advantaged and misrepresented populations. We have spent large amounts of effort discussingfairnessandtransparencyinouralgorithms.Atthealgorithmiclevel,fair­ness has to do with the absence of bias in the models that for example in NLU are used to address tasks that may range from the evaluation of mortgage applications or insurance policies to medical examinations and career recommendations. If al­gorithms are biased, so are their predictions, in which case inequalities would be perpetuatedasAI technologies are deployedmore andmore in society. This is essential work. The lack of resources in a specific language to train an NLUmodelinthatlanguagecanbeseenasanothersourceofdiscrimination.Avery visualexampleinarelateddomainhastodowiththeuseofasmartphonenavigation appinawheelchair,onlytoencounterastairwayalongtheroute.Eventhebestnav­igation apps pose major challenges and risks if users cannot customise suggested routes in order to avoid insurmountable obstacles. Similarly, the lack of availabil­ity of service functionalities in all languages will have an unwanted effect in the respective populations. Accessibility, education, homelessness, human trafficking, misinformation,andhealthamongothersareallareaswhereAIandtextunderstand-ingcanhaveareallypositiveimpactonpeople’squalityoflife.Sofar,wehaveonly startedto scratch the surface. 4 Summary and Conclusions We finish this chapter with a list of recommendations and guidelines that address central topics for text analytics and NLU. Among others, we emphasise the role of language equality for social good, the balance between commercial interests and equalopportunitiesforsociety,andincentivestohelpbringaboutlanguageequality. We also focus on key technologies like neural language models and the availability ofmultilingual, cross-sectorial datasets and benchmarks. 1. Language equality in text analytics is a transformative and integrative force for social good that can stimulate development in such important aspects for our societies asaccess to health, publicadministration servicesfor everyone,better educationandmorebusinessopportunities.Thesewillcontributetomoredevel-oped societies, which in turn will encourage progress and prosperity, creating new markets for text analytics and other areas related to AI and LT across Eu­rope. However, this is not yet a common scenario for all European languages. The question we should ask ourselves is: what is the alternative? What will the social cost be if the required policies do not effectively reach all European lan­guages until 2030? 2. The balance between legitimate commercial interests and equal access to op­portunities is fragile when it comes to DLE in text analytics. We have shown howglobalproviderstendtoconcentratetheirofferingsandinvestmentinmore widespread languages, neglecting a long tail of languages with smaller popu­lations. In contrast, European initiatives such as ELG (Rehm 2023) provide a moreequitablecoverage.Tworeflectionsemerge.First,itisaEuropeanpriority to ensure that all European languages are properly covered. Therefore, Euro­pean companies and also European research organisations in the text analytics spaceshouldbenefitfromincentivesthatallowthemtofocusonsuchlanguages. Suchincentivesshouldnaturallycomefromathrivingmarketdemandingthese services in Europe, but also in other forms, like – for companies – tax breaks associated to language services for less represented languages or – for research organisations – specific regional or national funding that can only be used for developing tools or resources for the national or regional language. Second, to createtractionthiseffortshouldinvolveEuropeantechnologyprovidersbutalso consumersofsuchservicesatthedifferentlevelsoftheEuropeanpublicadmin­istrationand largeEuropean companies. 3. Possible incentives to language equality in text analytics and NLU are not just financial. Acknowledgingthatweareworkingonaparticularlanguageconveys theopportunitytostressthatresearchislanguage-specific.Conversely,neglect­ingtostatethata particularpieceof research worked on, say, English language data gives a false veneer of language independence (Bender 2011). Incentives needtobeprovidedforTextAnalysisresearchtocoverall Europeanlanguages. 4. Neural language models are a cornerstone of most NLU and text analytics pipelines now, and this will continue in the next few years. However, current methods to create suchmodels arehardware-intensive,requirevast amounts of text data, and the training comes at the cost of high energy consumption and a large carbon footprint. Because of this, most of the language models avail­able nowadays (like BERT, RoBERTa, T5, GPT-3, etc.) have been trained on general-purpose documents collected from the internet and freely available re-sources,whichhinderstheirapplicationinverticaldomains,requiringadditional pre-training on relevant datathat is not easy to find. 5. Data is key. Without sufficient amounts of good-quality data, language models and text analytics solutions based on ML approaches cannot be trained. How­ever, suitable data and particularly multilingual text is hard to find and expen­sive to annotate in order to enable subsequent fine-tuning of pre-trained lan­guage models on tasks like classification, sentiment analysis, etc. While much progress has been made in creating large-scale labeled data sets for the major languages, it is not yet feasible, especially from a business-driven perspective, to do this for all European languages, let alone the literally thousands of lan­guagesspokenontheplanet.Assuggestedinthepreviousitem,thereislittleor no doubt that enough general-purpose data can be collected in the different Eu­ropean languages that will suffice to pre-train language models for each of our languagesfollowingself-supervisedapproaches.Theproblemcomesinsatisfy­ing the needs of domain-and task-specific data to adapt such models to solve real-life problems in each of the differentbusinesssectors andlanguages. 6. Data tends to be locked in regulatory and corporate silos. Research and solu­tionsforLTsthat addressproblemsof businessandsocialrelevance isunderde­veloped.Amajorreasonisthatenterprisedataisnotavailableto researchersin academia.Asenterprisedataisbynatureconfidentialandcompaniesneedtore­spectdataprotectionregulations,thebarriersformakingdataavailablearehigh. Theideatocreatedataspacesthroughwhichcompaniescanmakedataavailable under certain terms still needs to crystallise into a dynamic ecosystem that can becomparedtogenerallyavailabletextanalyticsandNLUdatasetsandmodels. To address this bottleneck, further collaboration is required between industry, academia and European institutions that facilitates the creation of multilingual textdataspacesacrossthedifferentstrategicbusinesssectors.Thiseffortwould benefitfromanimprovedbalancebetweenEuropeanregulationslikeGDPRand the use of data for research purposes. Currently, companies abiding by GDPR facerestrictionsanddemandsthatimposesomeburdens.Tobecompetitive,Eu­ropeancompaniesmayneedtouseneurallanguagemodelsbuiltbythirdparties in the US or China thatarenotsubject to such regulations. 7. Benchmarking is inadequate and needs to be fixed and updated. FormanyNLU tasks evaluation is currently unreliable and biased, with plenty of systems scor­ing so highly on standard benchmarks that little room is left for better systems to demonstrate their improvements. The recent trend to abandon traditional, in­dependent and identically distributed benchmarks in favour of adversarially­constructed,out-of-distributiontestsetsmeansthatcurrentmodelswillperform poorly,andultimatelyonlyobscurestheabilitiesthatwewantourbenchmarksto measure. Restoring a healthy evaluation ecosystem, particularly one involving a vision for DLE, will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias. However, if we want to make well-grounded progress it is crucial that improved benchmarking considers not only technical but also ethical and societal issues. Benchmark design needs to fit realistic data com­positions, rather than synthetic ones within our comfort zone. Addressing such shortage of real-life benchmarks will require significant collaboration between Europeanindustryand academia. 8. Text does not live in isolation. Information is cross-modal. Text is rarely found in isolation in real-life. Addressing many of the market and societal challenges towards DLE will benefit from taking into account cross-modal scenarios to leverageadditionalsourcesoffreesupervision.RecentadvanceslikeOpenAI’s CLIP and Meta’s Data2Vec21 seem promising. However, perhaps not surpris-ingly,all such models are currentlyavailable in English only. 21 https://ai.facebook.com/research/data2vec-a-general-framework-for-self-supervised-learning -in-speech-vision-and-language Finally, we would like to emphasise two points that are particularly critical to ensure DLE in Europe. First, neural language models and related techniques are at the core of sustaining progress in LT in modern NLP. Therefore, being able to build language models for target languages with the same quality as English is key for language equality. Second, multilingual data is the key element to train such modelsinthetargetlanguages.Weshouldnotassumethatlargeamountsofpublicly availablecorporaofgoodqualitycanbereadilyobtainedforallEuropeanlanguages, butratherthecontrary. Theefforttoensurethatalllanguageshavelargeamountsof publiclyavailablecorporaofgoodquality,takingintoaccountfairnessissues,should beat thecentre of any future effortsstriving for DLE. References Ahmed,Nurand Muntasir Wahed(2020). “TheDe-democratization ofAI:Deep Learning andthe ComputeDivideinArtificialIntelligenceResearch”.In: CoRR abs/2010.15581.https://arxiv.o rg/abs/2010.15581. Bender, Emily M. (2011). “On Achieving and Evaluating Language-Independence in NLP”. In: Linguistic Issues in Language Technology 6. Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell (2021). “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Virtual Event Canada, pp. 610–623. Bender, Emily M. and Alexander Koller (2020). “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Online:AssociationforComputationalLinguistics, pp. 5185–5198. https://aclanthology.org/2020.acl-main.463. Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari­wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei (2020). “Language Models are Few-Shot Learners”. In: Ad­vances in neural information processing systems 33,pp. 1877–1901. Church, Kenneth, Mark Liberman, and Valia Kordoni (2021). “Benchmarking: Past, Present and Future”. In: Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future.On­line: Association for Computational Linguistics, pp. 1–7. DOI: 10.18653/v1/2021.bppf-1.1. Colon-Hernandez,Pedro,CatherineHavasi,JasonAlonso,MatthewHuggins,andCynthiaBreazeal (2021). “Combining Pre-Trained Language Models and Structured Knowledge”. In: arXiv preprint arXiv:2101.12294. Devlin,Jacob,Ming-WeiChang,KentonLee,andKristinaToutanova(2019).“BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: NAACL Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–4186. DOI: 10.18653/v1/N1 9-1423.https://aclanthology.org/N19-1423. Ding, Ning, Shengding Hu, Weilin Zhao, Yulin Chen, Zhiyuan Liu, Haitao Zheng, and Maosong Sun(2022).“OpenPrompt:AnOpen-sourceFrameworkforPrompt-learning”.In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demon­strations. Dublin, Ireland: Association for Computational Linguistics, pp. 105–113. https://acl anthology.org/2022.acl-demo.10. Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Tho­mas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit,andNeilHoulsby(2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.arXiv:2010.11929 [cs.CV]. European Parliament (2018). Language Equality in the Digital Age. European Parliament resolu­tion of 11 September 2018 on Language Equality in the Digital Age (2018/2028(INI). http://w ww.europarl.europa.eu/doceo/document/TA-8-2018-0332_EN.pdf. Gao, Tianyu, Adam Fisch, and Danqi Chen (2021). “Making Pre-trained Language Models Better Few-shotLearners”.In: Proceedings of the 59th Annual Meeting of the Association for Compu­tational Linguistics and the 11th International Joint Conference on Natural Language Process­ing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 3816– 3830.https://aclanthology.org/2021.acl-long.295. Gómez-Pérez, José Manuél, Ronald Denaux, and Andrés Garcia-Silva (2020). ´A Practical Guide to Hybrid Natural Language Processing -Combining Neural Models and Knowledge Graphs for NLP. Springer. DOI: 10.1007/978-3-030-44830-1. Gomez-Perez, Jose Manuel, Andres Garcia-Silva, Cristian Berrio, German Rigau, Aitor Soroa, Christian Lieske, Johannes Hoffart, Felix Sasaki, Daniel Dahlmeier, Inguna Skadina, Aivars Berzinš,AndrejsVasiljevs,andTeresaLynn(2022). Deliverable D2.15 Technology Deep Dive – Text Analytics, Text and Data Mining, NLU. European LanguageEquality(ELE); EUproject no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/text-analytics­ deep-dive.pdf. Gomez-Perez, Jose Manuel and Raúl Ortega (2020). “ISAAQ – Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).Online: Associationfor Computational Linguistics, pp.5469–5479. https://aclanthology.org/2020.emn lp-main.441. Han, Xu, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, AoZhang,LiangZhang,WentaoHan,MinlieHuang,QinJin,YanyanLan,YangLiu,Zhiyuan Liu, Zhiwu Lu, Xipeng Qiu, Ruihua Song, Jie Tang, Ji-Rong Wen, Jinhui Yuan, Wayne Xin Zhao, and Jun Zhu (2021). “Pre-Trained Models: Past, Present and Future”. In: AI Open 2, pp. 225–250. https://www.sciencedirect.com/science/article/pii/S2666651021000231. Hu, Ronghang and Amanpreet Singh (2021). “Transformer is all you need: Multimodal multitask learning with a unifiedtransformer”.In: arXiv preprint arXiv:2102.10772 2. Kaushik,Divyansh,DouweKiela,ZacharyC.Lipton,andWen-tauYih(2021).“OntheEfficacyof AdversarialDataCollectionforQuestionAnswering:ResultsfromaLarge-ScaleRandomized Study”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Lin­guistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 6618–6633. https://a clanthology.org/2021.acl-long.517. Kitano, Hiroaki (2016). “Artificial Intelligence to Win the Nobel Prize and Beyond: Creating the EngineforScientificDiscovery”.In: AI Magazine 37,pp.39–49.DOI: 10.1609/aimag.v37i1.2 642. Labropoulou, Penny, Katerina Gkirtzou, Maria Gavriilidou, Miltos Deligiannis, Dimitris Galanis, Stelios Piperidis, Georg Rehm, Maria Berger, Valérie Mapelli, Michael Rigault, Victoria Ar-ranz,KhalidChoukri,GerhardBackfried,JoséManuelGómezPérez,andAndresGarcia-Silva (2020).“MakingMetadataFitforNextGenerationLanguageTechnologyPlatforms:TheMeta­dataSchemaoftheEuropeanLanguageGrid”.In:Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020).Ed.byNicolettaCalzolari,FrédéricBéchet,Philippe Blache,ChristopherCieri,KhalidChoukri,ThierryDeclerck,HitoshiIsahara,BenteMaegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3421–3430. https://www.aclweb.org/anthology/2020.lrec-1.420/. Leonova,Viktorija(2020).“ReviewofNon-EnglishCorporaAnnotatedforEmotionClassification inText”.In:Databases and Information Systems – 14th International Baltic Conference, DB&IS 2020, Tallinn, Estonia, June 16-19, 2020, Proceedings. Li,Jiwei,WillMonroe,AlanRitter,DanJurafsky,MichelGalley,andJianfengGao(2016).“Deep ReinforcementLearningforDialogueGeneration”.In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin,Texas: Association for Computa-tionalLinguistics,pp.1192–1202. https://aclanthology.org/D16-1127. Lourie, Nicholas, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi (2021). “UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark”. In: Proceedings of the AAAI Conference on Artificial Intelligence 35.15,pp.13480–13488. Lu, Yinquan, Haonan Lu, Guirong Fu, and Qun Liu (2021). “KELM: Knowledge Enhanced Pre-Trained Language Representations with Message Passing on Hierarchical Relational Graphs”. In: arXiv preprint arXiv:2109.04223. Ma, Kaixin, Filip Ilievski, Jonathan Francis, Yonatan Bisk, Eric Nyberg, and Alessandro Oltra­mari(2021).“Knowledge-DrivenDataConstructionforZero-shotEvaluationinCommonsense Question Answering”. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, pp. 13507–13515. https://ojs.aaai.org/index.php /AAAI/article/view/17593. Min, Bonan, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heinz, and Dan Roth (2021). “Recent Advances in Natural Lan­guage Processing via Large Pre-Trained Language Models: A Survey”. In: arXiv preprint arXiv:2111.01243. Mostafazadeh,Nasrin,NathanaelChambers,XiaodongHe,DeviParikh,DhruvBatra,LucyVander­wende, PushmeetKohli, andJames Allen (2016). “A Corpus and Cloze Evaluation for Deeper UnderstandingofCommonsenseStories”.In:Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech-nologies.San Diego, California:Association for ComputationalLinguistics,pp.839–849. http s://aclanthology.org/N16-1098. Palanque,PhilippeandFabioPaterno,eds.(2000). Interactive Systems: Design, Specification, and Verification, 7th International Workshop DSV-IS, Limerick, Ireland, June 5-6, 2000, Proceed­ings. DOI: 10.1109/ICSE.2000.870518. Peters,MatthewE.,MarkNeumann,RobertLogan,RoySchwartz,VidurJoshi,SameerSingh,and NoahA.Smith(2019).“KnowledgeEnhancedContextualWordRepresentations”.In:Proceed­ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).Hong Kong, China: Association for Computational Linguistics, pp. 43–54. https://aclanthology.org /D19-1005. Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar-wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever(2021).“LearningTransferableVisualModelsFromNaturalLanguageSupervision”. In: Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event.PMLR, pp. 8748–8763. Radford, Alec, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever (2019). Language Models are Unsupervised Multitask Learners.Tech. rep.OpenAI. Rajpurkar, Pranav, Robin Jia, and Percy Liang (2018). “Know What YouDon’t Know: Unanswer-ableQuestionsforSQuAD”.In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for ComputationalLinguistics, pp. 784–789. https://aclanthology.org/P18-2124. Recht,Benjamin,RebeccaRoelofs,LudwigSchmidt,andVaishaalShankar(2019).“DoImageNet Classifiers Generalize to ImageNet?” In: Proceedings of the 36th International Conference on Machine Learning. LongBeach. https://proceedings.mlr.press/v97/recht19a/recht19a.pdf. Rehm, Georg, ed. (2023). European Language Grid: A Language Technology Platform for Multi­lingual Europe. Cognitive Technologies. Cham,Switzerland: Springer. Rehm,Georg,KatrinMarheinecke,StefanieHegele,SteliosPiperidis,KalinaBontcheva,JanHajic, Khalid Choukri, Andrejs Vasiljevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-binaAuksoriute,NúriaBel,AntónioBranco,GerhardBudin,WalterDaelemans,KoenraadDe Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur Rögnvaldsson,MikeRosner,BolettePedersen,IngunaSkadina,MarkoTadiæ,DanTufi.,Tamás Váradi,KadriVider,AndyWay,andFrançoisYvon(2020).“TheEuropeanLanguageTechnol­ogyLandscapein2020:Language-CentricandHuman-CentricAIforCross-CulturalCommuni­cationinMultilingualEurope”.In:Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020).Ed.byNicolettaCalzolari,FrédéricBéchet,PhilippeBlache,Christo­pherCieri,KhalidChoukri,ThierryDeclerck,HitoshiIsahara,BenteMaegaard,JosephMariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325. https://www.aclweb.org/anthology/2020.lrec-1.407/. Rehm, Georg and Hans Uszkoreit, eds. (2012). META-NET White Paper Series: Europe’s Lan­guages in the Digital Age. 32 volumes on 31 European languages. Heidelbergetc.: Springer. Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh (2020). “Beyond Ac­curacy: Behavioral Testing of NLP Models with CheckList”. In: Stroudsburg, PA, USA: Asso­ciation for ComputationalLinguistics, pp. 4902–4912. https://www.aclweb.org/anthology/202 0.acl-main.442. Sakaguchi,Keisuke, RonanLeBras,Chandra Bhagavatula,and YejinChoi(2021). “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”. In: Communications of the ACM 64.9, pp. 99–106. https://doi.org/10.1145/3474381. Sanh, Victor, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, AntoineChaffin, ArnaudStiegler,TevenLeScao, ArunRaja, MananDey, M SaifulBari, Can-wenXu, Urmish Thakker,ShanyaSharma,ElizaSzczechla,TaewoonKim,GunjanChhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng-Xin Yong, Harshit Pandey, Michael Mckenna, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, JasonAlanFries,RyanTeehan,TaliBers,StellaBiderman,LeoGao,ThomasWolf,andAlexan­derM.Rush(2022).“MultitaskPromptedTrainingEnablesZero-ShotTaskGeneralization”.In: ICLR 2022 – Tenth International Conference on Learning Representations. Online. https://hal .inria.fr/hal-03540072. Serban,Iulian,RyanLowe,PeterHenderson,LaurentCharlin,andJoellePineau(2018).“ASurvey of Available Corpora for Building Data-Driven Dialogue Systems”. In: https://arxiv.org/abs/1 512.05742. Shapira, Ori, Ramakanth Pasunuru, Hadar Ronen, Mohit Bansal, Yael Amsterdamer, and Ido Da­gan(2021).“ExtendingMulti-DocumentSummarizationEvaluationtotheInteractiveSetting”. In: Proceedings of the 2021 North American Chapter of the Association for Computational Lin­guistics: Human Language Technologies.Online, pp. 657–677. DOI: 10.18653/v1/2021.naacl­ main.54. Sheng,Emily,Kai-WeiChang,PremNatarajan,andNanyunPeng(2021).“SocietalBiasesinLan-guageGeneration:ProgressandChallenges”.In:Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Nat­ural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 4275–4293. https://aclanthology.org/2021.acl-long.330. Sheth, Amit, Sujan Perera, Sanjaya Wijeratne, and Krishnaprasad Thirunarayan (2017). “Knowl­edge Will Propel Machine Understanding of Content: Extrapolating from Current Examples”. In: Proceedings of the International Conference on Web Intelligence.Leipzig,Germany:ACM, pp. 1–9. DOI: 10.1145/3106426.3109448. Shoham,Yoav(2015).“WhyKnowledgeRepresentationMatters”.In:Communications of the ACM 59.1, pp. 47–49.DOI: 10.1145/2803170. Stiennon, Nisan, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Rad-ford,DarioAmodei,andPaulF.Christiano(2020).“LearningtoSummarizewithHumanFeed-back”. In: Advances in Neural Information Processing Systems 33,pp.3008–3021. Tafjord,Oyvindand PeterClark(2021).“General-PurposeQuestion-Answering with Macaw”. In: ArXiv abs/2109.02593. Talmor,Alon,JonathanHerzig,NicholasLourie,andJonathanBerant(2019).“CommonsenseQA: AQuestionAnsweringChallengeTargetingCommonsenseKnowledge”.In:Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis­tics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Min­nesota: Association for Computational Linguistics, pp. 4149–4158. https://aclanthology.org /N19-1421. Wang, Ruize, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu Ji, Guihong Cao, DaxinJiang,andMingZhou(2021).“K-Adapter:InfusingKnowledgeintoPre-TrainedModels with Adapters”. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.Online:AssociationforComputationalLinguistics,pp.1405–1418.DOI: 10.18653/v1/2 021.findings-acl.121. https://aclanthology.org/2021.findings-acl.121. Wang, Zijie J., Dongjin Choi, Shenyu Xu, and Diyi Yang (2021). “Putting Humans in the Natural Language Processing Loop: A Survey”. In: Proceedings of the First Workshop on Bridging Human – Computer Interaction and Natural Language Processing. Online: Association for ComputationalLinguistics, pp. 47–52. https://aclanthology.org/2021.hcinlp-1.8. Wei, Jason, Maarten Bosma, Vincent Y. Zhao,Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, AndrewM.Dai,andQuocV.Le(2021).“FinetunedLanguageModelsAreZero-ShotLearners”. In: arXiv preprint arXiv:2109.01652. arXiv: 2109.01652 [cs.CL]. https://arxiv.org/abs/2109 .01652. Wolf,Thomas,LysandreDebut,VictorSanh,JulienChaumond,ClementDelangue,AnthonyMoi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush (2020a). “Transformers: State-of-the­ArtNaturalLanguageProcessing”.In: Proceedings of the 2020 Conference on Empirical Meth­ods in Natural Language Processing: System Demonstrations.Online:AssociationforCompu­tational Linguistics, pp. 38–45. DOI: 10.18653/v1/2020.emnlp-demos.6. https://aclanthology .org/2020.emnlp-demos.6. Wolf,Thomas,LysandreDebut,VictorSanh,JulienChaumond,ClementDelangue,AnthonyMoi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush (2020b). “Transformers: State-of-the-art Natural Language Processing”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. ACL, pp. 38–45. DOI: 10 .18653/v1/2020.emnlp-demos.6. https://aclanthology.org/2020.emnlp-demos.6. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 43 Deep Dive Data and Knowledge Martin Kaltenböck,Artem Revenko, Khalid Choukri,Svetla Boytcheva, Christian Lieske, TeresaLynn, German Rigau, Maria Heuschkel, Aritz Farwell, Gareth Jones, Itziar Aldabe, Ainara Estarrona,KatrinMarheinecke, Stelios Piperidis, Victoria Arranz, Vincent Vandeghinste, andClaudiaBorg Abstract This deep dive on data, knowledge graphs (KGs) and language resources (LRs)isthefinalofthefourtechnologydeepdives,asdataaswellasrelatedmodels arethebasisfortechnologiesandsolutionsintheareaofLanguageTechnology(LT) for European digital language equality(DLE). This chapter focuses onthe data and Martin Kaltenböck · ArtemRevenko Semantic Web Company, Austria, martin.kaltenboeck@semantic-web.com, artem.revenko@semantic-web.com Khalid Choukri · Victoria Arranz Evaluationsand Language ResourcesDistribution Agency,France, choukri@elda.org,arranz@elda.org SvetlaBoytcheva Ontotext, Bulgaria, svetla.boytcheva@ontotext.com Christian Lieske SAPSE, Germany, christian.lieske@sap.com Teresa Lynn Dublin CityUniversity, ADAPT Centre,Ireland, teresa.lynn@adaptcentre.ie GermanRigau · AritzFarwell · ItziarAldabe · Ainara Estarrona University oftheBasque Country, Spain, german.rigau@ehu.eus, aritz.farwell@ehu.eus, itziar.aldabe@ehu.eus, ainara.estarrona@ehu.eus MariaHeuschkel WikimediaDeutschland,Germany, maria.heuschkel@wikimedia.de Gareth Jones Bangor University,United Kingdom, g.jones@bangor.ac.uk KatrinMarheinecke Deutsches ForschungszentrumfürKünstliche Intelligenz GmbH,Germany, katrin.marheinecke@dfki.de SteliosPiperidis R.C.“Athena”, Greece, spip@athenarc.gr Vincent Vandeghinste DutchLanguageInstitute, TheNetherlands, vincent.vandeghinste@ivdnt.org ClaudiaBorg University ofMalta,Malta, claudia.borg@um.edu.mt © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_43 LRs required to achieve full DLE in Europe by 2030. The main components iden­tified – data, KGs, LRs – are explained, and used to analyse the state-of-the-art as wellasidentifygaps.Allofthesecomponentsneedtobetackledinthefuture,forthe widest range of languages possible, from official EU languages to dialects to non-EU languages used in Europe. For all these languages, efficient data collection and sustainable data provision to be facilitated with fair conditions and costs. Specific technologies,methodologiesandtoolshavebeenidentifiedtoenabletheimplemen­tation of the vision of DLE by 2030. In addition, data-related business models and data-governance models are discussed, as they are considered a prerequisite for a working data economy that stimulates a vibrant LT landscape that can bring about EuropeanDLE.1 1 Introduction Digital language equality (DLE) as well as the European data economy rely on the availability,theinteroperabilityandtheformof(unstructured,semi-structured,struc­tured)dataasabasisforfurtherinnovationandimprovedtechnologicaldevelopment, especially for trustworthy AI “made in Europe” and powerful language technology (LT) that respects and reflects European values. Data spaces,2 data sharing and ex­changeplatforms3 andmarketplacesareenablers,keytounleashingthepotentialof such data. However, data sharing and interoperability are still in their infancy. The diffusion of platforms for data sharing and availability of interoperable datasets is one of the key success factors which may help to drive the European data economy and industrialtransformation. The European Digital Single Market strategy that was adopted on 6 May 20154 has been built on three pillars: access, environment, and economy & society. The latter aims at maximising the growth potential of the digital economy, inspired by the2018CommissionCommunication“TowardsacommonEuropeandataspace”,5 whichprovidesguidanceonB2Bdatasharing,bringingtogetherdataasakeysource of innovation and growth from different sectors, countries and disciplines, into a common data space. Overall, the EU has specified its ambition6 to become the world’s mostsecure and trustable data hub. This chapter provides insights into: 1. the main components of this deep dive, 2. the current state-of-the-art, 3. the main gaps identified in the field, 4. its contri­ 1 This chapter is an abridged version ofKaltenböcket al. (2022). 2 Next-generation data acquisition and processing platforms as exemplified, among others, by the BDVA reference model: https://bdva.eu/sites/default/files/BDVA_SRIA_v4_Ed1.1.pdf. 3 Data sharing and exchange platforms, through which data is commercialised using open data, monetised dataand trusted data sharing mechanisms. 4 https://ec.europa.eu/commission/presscorner/detail/en/IP_15_4919 5 https://ec.europa.eu/transparency/regdoc/rep/1/2018/EN/COM-2018-232-F1-EN-MAIN-PAR T-1.PDF 6 https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A52020DC0066 bution to DLE and the impact on society, 5. an analysis of the main breakthroughs neededinthe area ofdata, language resources (LRs), andknowledge graphs (KGs), 6.themaintechnologyvisionsanddevelopmentgoalsidentifiedtohelpachievedeep naturallanguageunderstanding(NLU),allclosedby7.asummaryandconclusions section. 1.1 Scope of this Deep Dive This deep dive covers a relatively wide range of technologies in the area of LT, in­cluding machine translation (MT, Chapter 40), speech technologies (Chapter 41), textanalyticsandNLU(Chapter42)aswellascontentmanagementandknowledge management systems, text generation, and language learning systems, as data and LRs are the backbone for all these technologies as well as many more. In addition, the area of KGs plays an important role in this deep dive as KGs provide power-fulmechanisms andprinciplestointerlink andenrich data in a high-qualitymanner. KGscanbuildapowerfulandrelativelyeasytomaintainnetworkofinterlinkeddata – including and combining structured, semi-structured and unstructured data – that canbeseenasacrucialelementofthedatainfrastructurerequiredtodevelopfuture LT solutions, which require not only a single underlying dataset but in addition a wide range of meaningful andcontextualised data. Furthermore, the integrateddata models inside of KGs (taxonomies, vocabularies and ontologies) allow the training of algorithms for LT solutions with higher precision requiring smaller amounts of trainingdata. The topic of metadata and data in this chapter is always related to LT, language understanding and DLE in Europe. Accordingly, metadata and data in this respect concern (mostly, but not exclusively) LRs, (annotated) corpora, translation memo-ries,dictionariesandlexicographicresources,aswellasotherLRsandrelevantdata that is required for powerful multilingual LT. Such data and metadata constitute a strong enabler of AI and machine learning (ML), methodologies that have enabled innovative approaches andadvancesin the field of LT (Elliot et al. 2021). In addition to these principal components, a number of related methodologies and tools are currently on the rise, and these form part of the technological vision for 2030 in this deep dive. The subject of data-related business models is tackled throughoutthechapter,asfunctioning,sustainabledata-relatedbusinessmodelsare aprerequisiteforathrivingdataeconomyandecosystemthatinturnstimulatesand fostersthosedata-relatedcomponentslistedabove,toenableaworkingLTlandscape that can deliver European DLE. 1.2 Main Components Themaincomponentsofouranalysisrelatedtodata,LRs,andKGsinclude:1.avail­ability of data and metadata, 2. accessibility of data, 3. quality of data, 4. data inter­operability, 5.licensingand data-relatedregulations,6.dataandethics,and7.data literacy. At the same time, the following related concepts, methodologies and tools also need to be considered: 8. data infrastructures, data spaces and data markets; 9. data at scale; 10. KGs; 11. semantic AI (statistical and symbolic AI in combina­tion); and 12. innovativedataand metadata management tools. These main components always include structured data, semi-structured data and/orunstructured data,whichcanapplytodifferentmodalities,e.g.,written,spo­ken, signs, etc. In addition, as for other technology areas, the data for LT may be available as raw data and/or curated data, at varying levelsofquality. With the rise of AI, the importance of large language models (such as, e.g., BERT7 orGPT-38),andcomprehensiveandmultilingualKGs–allbasedonabroad rangeofdomainsand/orlanguages–iscontinuouslyincreasing.ForallLRsanddata types there is the requirement for domain-specificity, so that domain-and industry­specificapplicationscanbedevelopedwherespecialisedlanguageandterminology are realised, e.g., in industries such as health, pharmaceuticals or finance. Let us now examine each of these aspects in detail: Availability of data and metadata – As data and metadata form the backbone of any LT, the availability of data and metadata is the overall basis to enable such technologies and services. Availability therefore impinges on data collections, data typesavailable, and how tofind andexplore such data. Accessibility of data – The accessibility of data is crucial, it is also reflected in the FAIR principles (Wilkinson et al. 2016), initially advocated for research data management and stewardship in order to improve the Findability, Accessibility, In­teroperability, and Reuse of digital assets. Since 2007, accessibility has also been one of the initial eightkey principlesof open(government) data.9 Quality of data – When data is available and accessible, users often consider additionalattributesandcomponents,onebeingqualityofdata.Asthevalueofdata is based on its fit for certain use-cases and business cases, data quality is a crucial issuereflectingandimpactingtherespectivedatavalue.Dimensionstomeasuredata quality often include – but are not limited to – completeness, validity, timeliness, consistency,andintegrity(Sebastian-Coleman2012).Reliabilityisalsoanimportant factorofdataquality,althoughitishardtomeasure.Whenallthingsareconsidered, thequalityofanLTapplicationisoftenbasedlargelyonthequalityoftheunderlying datausedto train the system. Data interoperability – Data interoperability is defined as10 “addresses[ing] the abilityofsystemsandservicesthatcreate,exchangeandconsumedatatohaveclear, 7 https://en.wikipedia.org/wiki/BERT_(language_model) 8 https://en.wikipedia.org/wiki/GPT-3 9 https://opengovdata.org 10 https://datainteroperability.org shared expectations for the contents, context and meaning of that data.” Interoper­ability ensures the seamless interplay of different LT systems regarding both APIs and data exchange. Not unexpectedly, it is often connected with and facilitated by the specification and adoption ofrelated standardsin the field. Licensing and data-related regulations – Relevant data often comes from differ­ent owners and publishers, such as companies, public administrations or citizens, with different licences. Accordingly, proper licence clearing is a crucial task for all data-related activities in LT. The licences on data that are usually specified by data owners/publishersneedtobetakenintoaccountasanimportantcomponent,aswell astheapplicablelawsandregulationsarounddata,suchasthoseconcerningdatapri­vacy, security, processing and protection of personal identifiable information (PII), as laid out, for instance, in the General Data Protection Regulation (GDPR). Na­tional and regional as well as international regulations and policies around data use and re-use should also betaken into account. Data and ethics – The rise of AI and ML has led to an increase in both data collectionandprocessing,sotheissueofdataandethicshasbecomemoreandmore important. It is closely connected to data-related regulations. Language, by its very nature,canbeambiguousandtheassociatedinterpretationscaneasilyrepresentand exposebias.Accordingly,ethicsplaysacrucialroleregardingtheuseofdatainLTs and impacts equalityin general,including languageequality. Data literacy – Gartner Research11 defines data literacy as “the ability to read, write and communicate data in context, including an understanding of data sources andconstructs,analyticalmethodsandtechniquesapplied,andtheabilitytodescribe the use-case, application and resulting value.” Data infrastructures, data spaces, data markets – The ideas behind data spaces and data markets follow the intentions underpinning data catalogues established in the course of the open data movement since the early 2000s to allow the sharing, exchangeandtradingofdata.Dataspacesandmarketsenabletheavailabilityofand allowaccessibilitytohigh-qualitydata,whichfollowstandards(thusprovidingdata interoperability)accompaniedbyclearlicensingconditions.TheGaia-X12 initiative defines a “data space” as “refer[ring] to a type of data relationship between trusted partners, each of whom apply the same high standards and rules to the storage and sharingoftheirdata.However,ofkeyimportancetotheconceptofadataspaceisthat dataarenotstoredcentrallybutatsourceandarethereforeonlyshared(viasemantic interoperability) when necessary. A data space is the sum of all its participants – which may be data providers, users and intermediaries. Data spaces can be nested andoverlapping,sothatadataprovider,forexample,canparticipateinseveraldata spaces all at once. Data sovereignty and trust are essential for the working of data spaces andthe relationships between participants.” Data at scale –PracticalLTsolutionsrequirehigh-qualitydataatscaleandfora broad range of domains and available in various languages, with clear licences and fair conditions attached. Data infrastructures, data spaces and data markets provide 11 https://www.gartner.com/smarterwithgartner/a-data-and-analytics-leaders-guide-to-data-litera cy 12 https://gaia-x.eu/what-is-gaia-x/ powerfulmeanstodiscover,evaluateandaccessrelevantdataaswellasrelateddata­drivenservices, thatare required for LTsolutions. Knowledge Graphs –AKnowledgeGraphisaknowledgebasethatusesagraph­structureddatamodelortopologytointegratedata.KGsareusedtostoreinterlinked descriptionsofentities–objects,events,situationsorconcepts–whilealsoencoding the semantics underlying the terminology used.13 Since the development of the Se­manticWeb,KGshaveoftenbeenassociatedwithLinkedOpenData(LOD)projects, focusing on the connections between concepts and entities (Soylu et al. 2020; Auer et al. 2018). They are prominently associated with andusedby search enginessuch asGoogle or Bing;knowledge-engines and personal assistants such asWolfram Al­pha, Apple’s Siri, and Amazon Alexa; and social networks such as LinkedIn and Facebook.LTsolutionsrequirenotonlytargeteddatasetsbutalsohigh-quality,inter­linked,meaningfulandcontextualiseddatathatcaneasilybeused,quicklyexpanded andefficientlymaintainedwithreasonableeffort.KGsprovidethesecharacteristics and contributeto the data and knowledge backbone for LT. Semantic AI – Modern approachestend to combine statistical AI (ML)and sym­bolic AI (models like ontologies, knowledge bases for common sense knowledge, and cultural resources, among others). In October 2020, Agarwal defined semantic AI14 as “provid[ing] a framework to perform end to end complex tasks automati­cally. Itusesmanydifferentmachine learningandlogic-based approaches,andalso utilizes the background knowledge oftenstored inknowledgegraphs.” Innovative data and metadata management tools –Innovativedataandmetadata management tools enable the availability and accessibility of high-quality data and data interoperability (using relevant standards), that provide powerful data gover­nancemechanisms(followingrelevantregulations),thatenablemechanismsforthe assessment of ethics in data, and that allow improvements in data literacy. In ad­dition such tools should support (perhaps in combination with) secure data sharing mechanisms (data spaces), provide strong capability for interlinking data, support meaning and context (KGs) and provide semantic AIcapability. 2 State-of-the-Art and Main Gaps 2.1 State-of-the-Art From the start of the open data movement in 2007 with its eight principles of open government data, the requirements of industry data as well as organisation-based data-sharingandcollaborationhavefoundtheirfeetandculminatedinthenexteraof datasharing:datacataloguesanddataportals,aswellas,morerecently,dataspaces anddatamarkets.IntheareaofLT,dataavailability,accessibility,aggregation,shar­ 13 https://ontotext.com/knowledgehub/fundamentals/what-is-a-knowledge-graph 14 https://medium.com/@dr.puneet.a/what-is-semantic-ai-is-it-a-step-towards-strong-ai-5f0355 be3597 ing and reuse have received attention since the early 1990s, with associations and organisations providing LR catalogues, like the European Language Resources As­sociation15 or the Linguistic Data Consortium.16 Since the early 2010s, several re­search and innovation projects have contributed to the field including FLaReNet and META-NET with META-SHARE17 (Piperidis 2012). They provided recom­mendations, specifications and implementations of platforms promoting and facili­tatingdatadiscovery,sharingandreuse.Atthesametime,CLARIN18 (Hinrichsand Krauwer2014)hasbeenestablishedasaresearchinfrastructureprovidingaccessto digitallanguagedataforscholarsinthesocialsciencesandhumanities,andbeyond. CLARINisassociatedwiththeEUDATCollaborativeDataInfrastructure(EUDAT CDI),19 and contributes to the European Open Science Cloud (EOSC)20 with the EOSC-relatedprojectSocialSciencesandHumanitiesOpenCloud(SSHOC)21 and its data market forsocial sciencesand humanities.22 Anotherexampleofresearch,developmentandinfrastructureactivitiessupported by the implementation of the Public Sector Information Directive23 is the ELRC­SHARE repository24 (Piperidis et al. 2018) that is used for documenting, storing, browsing and accessing LRs that are collected through the European Language Re-sourceCoordination25initiative(Löschetal.2018)andconsideredusefulforfeeding the CEF AutomatedTranslation(CEF.AT)platform. In 2022, the European Language Grid (ELG)26 (Rehmet al. 2020a; Rehm 2023) released the ELG platform providing access to LT resources and services from all over Europe, enabling users to try out the services or use the ELG APIs. ELG built bridges to a wide range of language data platforms including the European AI on DemandPlatform (Labropoulou et al. 2023). Turning to the LT industry, there are products like the TAUS Marketplace,27 as wellasAPIsforlexicographicalinformationorNaturalLanguageProcessing(NLP) APIsgivingaccesstoservicesfrompart-of-speechtagginganddependencyparsing to MT, summarisation and question answering. Finally, there are active industry as­ 15 http://www.elra.info 16 https://www.ldc.upenn.edu 17 http://www.meta-share.org 18 https://www.clarin.eu 19 https://www.eudat.eu 20 https://eosc-portal.eu 21 https://sshopencloud.eu 22 https://marketplace.sshopencloud.eu 23 https://digital-strategy.ec.europa.eu/en/policies/public-sector-information-directive 24 https://elrc-share.eu 25 https://lr-coordination.eu 26 https://www.european-language-grid.eu 27 https://datamarketplace.taus.net sociationsandnetworkslikeLT-Innovate28orBDVA/DAIRO29 thatsupporttheidea ofdatacollection and provision and sharingto support betterLT inthe future. MostifnotalloftheaboveplatformsandinitiativeshavenowendorsedtheFAIR principles,adoptingthemasadefactostandard.Inthiscontext,datainteroperability hasbeenanimportantfactor,related(mostlybutnotexclusively)toefficientdatause andprocessing,aswellasdataexchangeandsharing.Therearedozensofstandards regardingdatainplaceworldwide,setupbyseveralstandardisationbodiesinarange ofindustrydomains.Thisdiversityofdata-relatedstandardsreinforcestheproblem as there is relatively little mapping between such standards and approaches. In the contextofELGandwithregardtothewiderareaofAI/LTplatforminteroperability, initialattemptshavebeenmadeatcross-platformsearchanddiscoveryofresources andservices,ontheonehand,andcompositionofcross-platformserviceworkflows, on theother (Rehmet al. 2020b). Since the open data and data sharing movement began, every digital asset has neededtobeaccompaniedbyaclearanddedicatedlicence.Whilethisissuehas be­come more and more important, there are quite a lot of possible licences to choose from, inevitably reinforcing legal interoperability problems. While there are multi­ple commercial licensing options not centrally registered, a good source for open licencesis the OpenDefinition ofthe Open Knowledge Foundation.30 Several data regulations and directives have been developed by the European Union over the last decade. They are an important foundation of the data economy, aswellastherealisationofaworking,sustainabledatainfrastructureacrossEurope. Some of the most important ones include, among others: GDPR,31 European Strat-egyforData,32EuropeanDataGovernance(DataGovernanceAct),33EUOpenData Strategy and PSI Directive,34 European Approach to Artificial Intelligence, includ­ing the EC AI Strategy,35 Digital Single Market Strategy for Europe,36 and Digital Action Education Plan.37 As far as LT for DLE in Europe is concerned, all of these regulationshaveaclearimpact.Intermsofthisdeepdive,theDataGovernanceAct has a strong implication for data, LRs and KGs, as it lays the groundwork for the development of common dataspaces in strategic sectors. Settingtechnicalissuestooneside,dataandethicsisatopicin which regulators andstandards(suchasthosementionedabove)playacrucialrole.Aftermanyyears’ discussion about data and ethics but also about AI and ethics, a standard has been published:IEEE P7000 Engineering Methodologies for Ethical Life-cycle Concerns 28 https://www.lt-innovate.org 29 https://www.bdva.eu 30 https://opendefinition.org/guide/data/ 31 https://eur-lex.europa.eu/eli/reg/2016/679/oj 32 https://ec.europa.eu/info/sites/default/files/communication-european-strategy-data-19feb2020 _en.pdf 33 https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52020PC0767 34 https://digital-strategy.ec.europa.eu/en/policies/open-data 35 https://digital-strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence 36 https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A52015DC0192 37 https://ec.europa.eu/education/education-in-the-eu/digital-education-action-plan_en Working Group.Itestablishesaprocessmodelbywhichengineersandtechnologists canaddressethicalconsiderationsthroughoutthevariousstagesofsysteminitiation, analysis and design. Expected process requirements include management and engi­neeringviewsofnewITproductdevelopment,computerethicsandITsystemdesign, value-sensitive design,and stakeholder involvementin ethical IT systemdesign.38 Dataliteracyisanunderlyingcomponentofdigitaldexterity:anemployee’sabil­ity and desire to use existing and emerging technology to drive better business out-comes.TheEuropeanUnionsupportsdataliteracyandbeyondintheDigitalAction EducationPlan,39 andgloballyprogrammesliketheWorldBank’sDataUseandLit­eracy Programme40 support the awareness, education and implementation of data literacy.Nevertheless,comparedtodataanddata-relatedtechnologiesavailable,the issue of data literacy lagsfar behind and needsmore actionand effort to be applied. The idea of a KG follows the basic principles of the semantic web and linked data. For LTs, the KG principles have great potential for modelling common-sense knowledgeanddomain-specificknowledge,aswellasprovisioningrichcontextand meaninginmonolingual,bilingual,multilingualandcross-lingualapplications.KGs are often assembled from numerous sources, and as a result, can be highly diverse in terms of structure and granularity. KGs aim to serve as an ever-evolving shared substrate of knowledge within an organisation or community (Noy and McGuinness 2001). We distinguish two types ofKGs:openKGsandenterpriseKGs.OpenKGsarepublishedonline,makingtheir contentaccessibleforthepublicgood.EnterpriseKGsareinternaltoacompanyand appliedtocommercialuse-cases.ApplicationsbasedonKGsincludesearch,recom­mender systems, personal agents, advertising, business analytics, risk assessment, and automation. Useful further reading includes Blumauer and Nagy (2020), Abu-Salih (2021), Colon-Hernandez etal. (2021),Ji etal. (2022), and Li etal. (2021). The technologicalleapsinLT andAIinthepastfewyears andthewidelyrecog­nised importance of data and knowledge resources for their accomplishment have called for new concepts and instruments in the area of data technologies and natu­rally so also in AI and LTs. In Europe, data spaces are a (relatively) new concept and solution to stimulate the data economy by providing secure and trustworthy mechanisms and platforms for data sharing and data trading. The European Com­missionlistsanumberofdataspacesinitsDataStrategyasofFebruary202041 that is strongly interconnected with the EU Data Governance Act.42 EU Member States havesupportedresearchondataspacesinrecentyears,asforexampleGaia-X43 and theInternationalDataSpacesinitiative(Germany)thatchanneledintotheestablish­ment of the International Data Spaces Association (IDSA) and the publication of several standards and recommendations in the field (IDS Information Model or the 38 https://sagroups.ieee.org/7000/ 39 https://ec.europa.eu/education/education-in-the-eu/digital-education-action-plan_en 40 https://www.worldbank.org/en/programs/data-use-and-literacy-program 41 https://ec.europa.eu/info/strategy/priorities-2019-2024/europe-fit-digital-age/european-data-s trategy 42 https://digital-strategy.ec.europa.eu/en/policies/data-governance-act 43 https://www.data-infrastructure.eu/GAIAX/Navigation/EN/Home/home.html Reference Architecture Model),44 or the Data Market Austria (DMA)45 prototype for a public marketplace for data trading. In January 2023, the European Commis­sionlaunchedtheCommonEuropeanLanguageDataSpacewhichaimstofocuson language data and models discoverability, sharing and trading covering all EU lan­guagesandaimingtosupportawiderangeofLTapplicationsindifferentmodalities, domainsand contexts. 2.2 Main Gaps The following observations have been formulated, collected and further analysed together with researchers and practitioners in the field and reflect our joint under-standingofthe currentgapsin the components of this deep dive. Thereisuntappedpotentialwhenitcomestodataavailableinarchivesaswellas olddatafiles.ThereisarealneedforopenAImodelsinLTthatareprovidedtointer-estedpartieswithopenlicences.Notonlyready-to-usemodelsarerequired,butalso the raw data needs to be made available in order for developers to create their own models. Annotatedcorpora are often availablemainly in English,and it is oftenthe casethattheyarenotavailableinotherlanguages,letaloneallthoserequiredfordif­ferenttechnologiesandapplications.TheELGdashboard46 offersavisualoverview ofthecurrentstandingofEurope’slanguages(andbeyond)withrespecttoavailable language data, tools and services. Through such availability counts the dashboard approximatesthetechnologicalreadinessofeachlanguage(seeChapter3).Thereis anurgentneedformonolingual,bilingualandmultilingualdomain-specificcorpora. Such data can only rarely be found via available resources, mostly because it sim­ply is not there, but also because of incorrect or missing documentation of data and metadata.Manuallyannotateddataislacking;althoughthequalityofautomatedand semi-automatedannotationsisincreasing,manualannotationbyhumanexpertsina certain field is stillthe best means of acquiringhigh-quality data. Overall, there are missing open LRs. Domain-specific LRs are required to be available for scientific purposes with open licences. If the FAIR principles were systematically applied, this would be a huge benefit where data and metadata is concerned, but they are not really being rolled out properly. Although Europe has benefited from a strong open data movement for about 15 years now, there is still a gapintheprovisionofclearlyspecifiedlicencesfordata.Atthemoment,benchmark approachesarenotharmonisedorstandardised,andbenchmarksondomain-specific vocabularies and annotated data and corpora are often missing. Metadata provides only very limited data provenance. Overall data quality is weak and so it often hap-pensthatuse-casescannotberealisedasspecifiedaslabeleddataisnotavailablefor the use-case at hand. Non-existing policies around data and metadata management 44 https://internationaldataspaces.org/use/reference-architecture/ 45 https://datamarket.at 46 https://live.european-language-grid.eu/catalogue/dashboard thatshouldbepartofadatagovernancemodeloftenresultinlowdataandmetadata quality. There are increasingly many data silos in place that are neither connected nor interoperable, and there are more and more data infrastructures available that aresimplynotinteroperableeither,astheharmonisationofrelevantstandardsinthe fieldismissing.Thisisaclearproblemandgapinthecombinationofresearchdata (e.g.,viaEOSC)47 andindustrydata(e.g.,viaindustrydatamarkets)aswellasdata frompublicadministration orgovernmentdatacataloguesandportals (e.g., the Eu-ropeanOpenDataPortal).48 Moreandclearerdirectivesandregulationsinthefield shouldbedeveloped to overcomethesegapsin relationto data,LRs, andKGs.The effectofregulationsondata-relatedtopicsshouldbeevaluatedcontinuouslyandreg­ulations and directives adapted for identified gaps and changing environments. For example, GDPRhas astrong effect on data collection. Guidelines and policies are not available for each language in order to achieve DLEinEurope.Datafornon-EUlanguagesandbeyondarenotsufficientlyinplace, and so services for such languages cannot be developed with sufficient quality for them to be useful. National crowd-sourcing platforms that facilitate data collection for low-resource languages are not availablehampering DLEin Europe. There is a strong need for education that can deliver improved understanding of better data management processes in science, academia, as well as in business and industry.Thisshouldleadtobetterunderstandingofthevalueofdata,andsoimprove data management principles and techniques. There is a need to inform educational bodiesoftheimportanceof sharingdata;forexample,ifmorelearner corporawere made asvailable, this would lead to improved computer-assisted language learning and adaptive educational technologies. More senior staff and experts in AI need to work ondata-related topics and deliver AI and deep learning mechanisms. As an overall gap, thereisa strong differencewithregard to the level ofdigitisa­tion in Europe. Data catalogues and portals often provide metadata only with links tothelisteddatathatisprovidedbythedatapublishersanddataownersthemselves, with only a small amount also providing the data itself. The resulting issues and gapsrelateto1.the availability of and access to the data itself,asinformationincat­alogues as to whether such data continues to be provided by publishers and owners is insufficient; 2. lack of interoperability in metadata but mainly in the data itself. Themetadataoftenprovidesdatainteroperability(e.g.,byusingthesamecatalogue softwareCKAN),49andatleastinEurope(butalsobeyond)wearemakinguseofthe defactometadatastandardforopendataanddataportalsDCAT-AP(DataCatalogue Vocabulary(DCAT)expandedforApplicationProfiles);50 and3.a fragmentation of data catalogues and data portals. 47 https://eosc-portal.eu 48 https://data.europa.eu 49 https://ckan.org 50 https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/solution/dc at-application-profile-data-portals-europe Regarding data spaces and data markets, the TRUSTS project (Trusted Secure DataSharingSpace)51 hascarriedoutastudy52 onthedefinitionandanalysisofthe EUandworldwidedatamarkettrendsandindustrialneedsforgrowth,thatincludes asectionondatamarketchallenges,whichincludesagoodsummaryofthegapsand challenges in this area (Figure 1). All these gaps and issues can only be addressed byworkingbusinessmodelsintheareaofdatasharingandtradinginaworkingand successfuldataeconomy. IDSApublished a relevant report in May 2021.53 For a KG to become useful for a downstream application there is a need for it to contain a certain amount of application-and domain-specific knowledge. Often, openlyavailableresourcesarenotsuitableforaparticulartask,sotoreducetheentry barriers thereis a need to be able togenerate asuitableontologyor schemafor said task andthen to populate theschema with instances. Currently, a KG is mainly developed based on textual and numerical data as an input format, with other formats like video and audio only very rarely taken into account. Working LT in the required languages using mechanisms like speech-to-text could support the creation of KGs. Finally, there is a gap in the availability of comprehensive KGs. While there are some common-knowledge KGs even freely available (DBpedia, or Yago being just twoexamples),thereisaclearlackofbiggerKGsinspecificdomainsandindustries, thatcanactasakindoffoundationmodel,butalsoastraininginputforAIalgorithms. Evenifsuchspecificgraphswereavailable,thereisacleargapintheavailabilityof multilingual domain-specific KGs that can be used for LTapplications. 51 https://www.trusts-data.eu 52 https://www.trusts-data.eu/wp-content/uploads/2021/07/D2.1-Definition-and-analysis-of-the­ EU-and-worldwide-data-market-trends-....pdf 53 https://internationaldataspaces.org/the-ecosystem-effect-of-business-models-driven-by-data-s overeignty/ Regarding the gaps in semantic AI, we see that the fields of statistical and sym­bolicAIarestillnotfullycombined;thetwofieldsoftenexistinisolationbesideone another and so cannot provide their full potential to the solution of a problem. This is largely the case overall in the machine learning and semantic web communities, but it is also the case in areas likeLT, or domains like health or energy. Finally,the following gaps regarding innovative dataand metadatamanagement tools have been identified: 1. The need for user-friendly, flexible, open-source cor­pus annotation tools that can easily be used by linguists as well as domain experts in-house and with fair costs and conditions. 2. The need for user-friendly visualisa­tion tools inordertobeabletounderstandthecontentofdatasetsathandquicklyand properly without the need for significant efforts in data integration and data wran­gling. 3. Better detection techniques for harmful content are required to avoid bias, andidentifyandfiltertoxiccontent,orfakenewsandfakedata,etc.Inatimewhere AI and ML are being used more and more, even small portions of toxic data and contentcaninfluenceanalgorithmduringtrainingandsoneedstobeidentifiedand filtered out. 4. Better techniques for corpus filtering are required regarding domain filtering,noisecleaning(seeabove,also)aswellasthefilteringandremovalofbias. 5. A clear lack of preservation technologies and tools have been identified that are required to ensure that lesser-spoken languages can be archived for the long term (e.g., that are available on tape only) and made available as data that is easy to use, including the provision of proper data documentation in the form of rich metadata. 6.Intelligent data analytics of small content nuggets isneeded,as,atthemoment,of­tenonlyhugecorporaarebeinganalysedbytheavailabletechnologiesandtools,but thereisanincreasingtrendtowardssmarterdataanalyticsthatcanbeappliedonever smaller datasets, including for instance to just one paragraph or section, rather than the whole text. 7. Add-on business models are needed as gaps have been identified in the area of business models around data creation and provision, and so the de­velopmentoftoolsandtechnologiesisoftenlimitedtosmallexperimentsinfunded research projects. Having clearly defined and successfully working business mod­elsinplacewouldimprovetheindustrialdevelopmentinthefieldandstimulatethe availability of the required innovativedata and metadata management tools. 3 The Future of the Area 3.1 ContributiontoDigitalLanguageEquality The major issue is the lack of available relevant and required data and LRs, as well as KGs in all European languages, official or not. At the moment sufficient data is available mainly in English, and to a lesser degree in German, French and Spanish. However, evenin these languages data gaps existthat hinderLT development. Looking further into this area it is easy to identify an even greater gap in the availability of data regarding dialects of European languages as well as regional languages. Dialects and regional languages exist, they are actively used and form partofacountry’soraregion’sidentityandculture.Languagediversityissostrong that sometimes in a small region several different languages or dialects areused. In addition, there is very little data available for sign languages which is a clear issue for the inclusion of those with disabilities, as well as there being little to no respective data available for non-EU languages that are widely spoken in the EU, like Turkish or Arabic, for example. DLEisafundamentalaspectofafunctioningEuropeansociety,inwhichdiversity and inclusion are valued in every single EU Member State and across Europe with its colourful regional cultures and identities. The lack of DLE in Europe carries the risk of dividing society as it fosters misunderstanding, and may even support the promotion of toxic content, fake news, or lead to wrong interpretations of regional policiesandregulationsorthemisinterpretationofresearchresultsintimesofcrisis. Wehaveidentifiedthefollowingthreeapproaches:1.Digital Language Equality Strategy: more funding and support by regional and national governments and the EuropeanUniontosupportthedevelopmentofDLEinEuropeforyearstocomefor EUlanguagesaswellasregionallanguagesanddialects(andfornon-EUlanguages, too),aidedbyadata and LRmatrixthat showswhichdataandLRsshouldbeavail­ablewhen and for which languages (see Chapter 45); 2. Crowdsourcing and citizen science: the creation of the required data needs the support of native speakers as well aslinguists withtherespectivelanguageexperienceand skills,andthe support by data experts providing guidance with regard to the creation of useful and high-quality data; and 3. Data-related business models: these are required in the field to fosterdatacreationandacquisitionforminorlanguagesanddialectsbyindustryand the private sector. Inaddition, andtoallow DLEfor certain domainslikehealth,for instance, there is a strong need for the continuous development and maintenance of monolingual, bilingualandmultilingualdomain-specificvocabulariesandKGs,toenablethemul­tilingual and cross-lingual development of innovative domain-specific applications that provide value to the economyand society as awhole. 3.2 Breakthroughs Needed Basedontheidentifiedcomponents,thestate-of-the-artanalysisandthegapanalysis, theareasofdatainfrastructures,dataspacesanddatamarketsaremajorissueswhere future technology visions and breakthroughs are needed in the field, as this area provides the overall umbrella for the availability and accessibility of the required datafor powerfulLTs that can help bring about DLE inEurope. The main breakthroughs needed in terms of data infrastructures, data spaces and data markets include:1.designingworkingarchitecturesandensuringeffective workflows for compliant data provision and consumption; 2. developing specifica­tionsandbuildingblockstoenabledataandmetadatainteroperability;3.developing anddeployingtechnologiesthatembeddatasovereigntyandbuildtrustamongdata providers and consumers; 4. developing specifications and building blocks that en­abledatavaluecreationincludingdatapublishinganddiscoverymechanismsaswell asaccountingandbilling;and5.specifyinganddevelopingdatagovernancemodels with clear roles, rules andpolicies for all stakeholders. A recent study by the European Commission (Cattaneo et al. 2020) examines trends in data markets. The study measures the value of a data market, i.e., “the marketplace where digital data is exchanged as products or services as a result of the elaboration of raw data”, and the value of the data economy, i.e., “[by] mea­sur[ing] the overall impacts of the Data Market on the economy as a whole”. The study compares the value of the data market and data economy from 2018 to 2019. It alsoprojects the facts andfigures for theyear 2025 based onthree scenarios. Growthindatamarketsandthedataeconomybringswithitseveralimplications. According to the European Commission,54 the total number of data professionals (i.e.,thosewhodealwithdataendeavoursastheirprimarytask)willcontinuetorise consistently.Manyopportunitieswillopenindata-relatedjobs,andmoreknowledge workers are needed. Despite these positive trends, there is still a potential lack of supply of data professionals in high-growth scenarios. Companies taking a role as dataproviders anddatabuyers will also growin number and market share. KGsandsemanticAIcombinedandprovidedaspartofadatainfrastructurecan bringclearvalue,andshouldbepartofanydatainfrastructureinthefuture.Gartner Research states that from 2021 onwards, graphs will form the foundation of mod­ern data analytics with the capabilities to enhance and improve user collaboration, ML models and explainable AI. Although graph technologies are not new to data analytics, there has been a shift in thinking about them as organisations identify an increasing number of use-cases where they could play an important role. In fact, as manyas50%ofGartnerclientinquiriesaroundthetopicofAIinvolveadiscussion aroundtheuseofgraph technology.55 In2020,itwas estimatedthatby2023,graph technologieswouldfacilitaterapidcontextualisationfordecisionmakingin30%of organisationsworldwide.56 ThemainbreakthroughsneededintheareaofKGs and semantic AI include:1.de­veloping KG principles and technology from the current status of a “rising star” to a natural part of any data infrastructure and any data-related organisational infras­tructure;2.fosteringthedevelopmentofmultilingualKGsunderfairconditionsand costsforuseandre-use;3.fosteringthedevelopmentofdomain-specificKGsunder fairconditionsandcostsforuseandre-use;4.KGsneedahigherlevelofautomation in their creation and maintenance, and more consideration needs to be given to the format of data beside textual data, such as audio and video; 5. a high level of deep and continuous learning will enable KGs to maintain themselves over time regard­ing new domain-specific and language-specific terminology. This means that new termswill be identified,analysedandinserted intothegraph in the correct position, as well as being applied to the applications used by the KG; 6. bringing together thetwomainAIcommunitiesofstatisticalAIandsymbolicAI toworktogether on 54 https://op.europa.eu/s/vbSA 55 https://www.gartner.com/smarterwithgartner/gartner-top-10-data-and-analytics-trends-for-20 56 https://info.tigergraph.com/gartner-graph-steps-onto-the-main-stage-of-data-and-analytics future semantic AI approaches; and 7. developing the areas of responsible AI and explainable AI by making use of semantic AI in multilingual environments to pro-videAI-basedapplicationsthatdeliverthecorrectresultswithbenefitsforresearch, industryand society. The global enterprise metadata management market is forecast to grow at a rate of20.3% from USD7.45Billionin 2019toUSD 27.24Billionby 2027.Enterprise metadata management (EMM) provides the control and clarity needed to manage the change that often accompanies acomplex enterprise data ecosystem.EMM and thevariouspiecesofmanagementsoftwarecreatedforitprovideadministrationfor dataintegration, andallow users toinspect themetadata’slinks and roles.57 Themainbreakthroughsneededintheareaofinnovative data and metadata man­agement tools include:1.thedevelopmentoftoolsthatcanbeeasilyintegratedwith data infrastructures, data spaces and data markets; 2. the development of technolo­giesandtoolsthatcanidentifyandremovebias,toxiccontentandfakedatafromdata andcontent;3.theprovisionoftoolsinthefieldofsemanticAI,thuscombiningsta­tisticalandsymbolicAI,thatprovideout-of-the-boxresponsibleandexplainableAI capability;4.thedevelopmentofalandscapewheremodelsandalgorithmsbasedon semanticAIcanbecreated,ultimatelywithsmalleramountsofdata;5.toolsfordata and metadata management that work not only in major languages like English but which can be easily adapted with low cost to smaller languages or dialects; 6. tools thatallowdeepermodellingof culturalaspects,genderaspects,etc. toavoidbiasin data; 7. tools thatare able to combine input from various typesofdata like text, im­ages, audio and video but also gestures; and 8. tools along the whole data life cycle foralllanguagesandallrelevantuse-casesarerequiredtoensurepowerfulLTwhich canhelp enableDLE. 3.3 Technology Visions and Development Goals Weidentifiedseveraltechnologyvisionsanddevelopmentgoalsfortheareaofdata, LRs and KGs regarding DLE as a result of a comprehensive list of use-cases in the field,highlightingtherelatedrequirements.Themajorityofuse-casesforLTinvolve human-to-machineandhuman-to-humancommunicationandinteractionviadigital tools. To a large extent, these can be categorised using the concepts of conversa­tional AI and platforms and insight engines that are covered by the other deep dive chapters in this book. In summary, the following excerpts represent identified data and technology developmentgoals: • LRs (speech, text) for official EU languages as well as for other European and non-European languages, for languagesofminorities and dialects; • pre-trainedandfine-tunedlanguagemodelsforgeneralandverticaldomainsfor at least all EU-24 languages; • speechmodelsaddressingatleasttheEU-24languages; • NLPpipelinesoftokenisers,taggers,parsersetc.,whichrequirelabelledlinguis­ticdatasets (e.g., treebanks)and evaluation sets; • interfacesandcontentshouldbeavailablein all languagesviatheweb,i.e.,the information available on a specific object, person or event provides the same amountofinformation inall languages; • knowledge and content available in the form of audio files should be available in all languagesso thatit can beeasily consumed; • appropriatedatarequiredtotrainanddevelopmonolingual,bilingualandmulti­lingualmodelsthatcoverthetypeofknowledge(domain-specific)andthetype of language required for MT, (multi)document summarisation and speech-to-text technologies; • efficientAPIsrequiredtointegrateorganisation-specificdataandsystemswith social media platforms; • pseudonymised or anonymised data for all EU languages, as well as domain-specificannotated corpora; • dataandmodelswhichaddressgenderbiasorminoritybiasetc.; • data and technologies for identifying and ideally also removing toxic content, hate speech, fake news; • comprehensivemultilingualontologiesinverticaldomains; • KGs for common concepts, event descriptions for daily activities, and patterns for frequentquestions; • text-to-speechresourcesforcommonvocabulariesandterminologies,aswellas computer vision technologies for sign languages; • data and technologies for modelling culture specific phenomena; • better designed crowdsourcing platforms to enable more citizen science efforts towards building speechand languagesystems. 57 https://www.reportsanddata.com/report-detail/enterprise-metadata-management-market Some of these points are already available and in use in different data infras­tructures. Beyond investing in the design and development of the missing parts, it is the integrated combination of all of them that could, from a technology perspec­tive, be the main breakthrough and technology vision for the future management of metadata and data, as well as of LRs, that can act as the backbone for power-fulLTs to realise DLE in Europe. Existing LT data infrastructure providers, such as ELG, ELRC-SHARE, CLARIN, META-SHARE, and ELRA as well as industrial and national initiatives can provide the seeds for a kind of federated data infras­tructure, i.e., a data space that enables seamless and trusted interactions between data providers and data consumers, and enables cross-fertilisation by means of in­teroperability, aided among others by semantic KG technologies. Interoperability challenges canbe broadly classifiedin four different layers: • technical interoperability, enabling technical components (i.e., data space con­nectors) to communicate witheachother; • semantic interoperability, ensuring that attributes and policies have the same meaning; • organisational interoperability,ensuringthatthedifferent(business)procedures and operations are compatible; • legal interoperability, ensuring that contractual statements are legally equiva­lent. Different federation architectures can be designed for building data spaces rang­ing from architectures with some central components (e.g., a data space catalogue) to fully decentralised ones. Whatever the architectural choice, data spaces will pro­mote data sovereignty, enhance data exchange and trading, and enable the creation ofvaluefromdata.TheLanguageDataSpace,coupledwiththedataspace-inherent data integration capabilities, and developments in machine learning, deep learning, transfer learning and federated learning is expected to help fill in the gaps. Of key importance in the development of language data spaces is the compliance of data and operations with the rules, regulations and values of the European Union. LTs themselves are expected to play a crucial role in ensuring such compliance. Pri­vacypreservationtechnologies,suchasdataanonymisationtechnologiesandethics compliance(throughbiasdetectiontechnologies,say),willbeimportanttoolsinthe handsofdataproviders,dataconsumersanddataspaceoperators.Byitsnature,the LanguageDataSpaceisconceivedofasoneofthehorizontaldataspacesinthedata spaceecosystemdesignedbytheEuropeanCommission.Inadditiontotheintra-data spaceinteroperability,theLanguageDataSpacewillhavetoensureinteroperability withverticaldataspaces(e.g.,health,manufacturing,skills,mobility,etc),enabling cross-fertilisation,datadiscovery,exchangeandtradingattheinter-dataspacelevel. Zooming out of the data spaces discourse and moving to technology visions re­gardingdataaccessandsharingingeneral,oneofthetop-10dataandanalyticstech­nology trends identified by Gartner Research is the notion of a Semantic Data Fab­ric.58 Although the notion was already identified in 2019, they predicted that the first real-world implementations would not be available before 2023. According to GartnerResearch,59 adatafabricenablesfrictionlessaccessandsharingofdataina distributeddataenvironment.Itenablesasingleconsistentdatamanagementframe­work,whichallowsseamlessdataaccessandprocessingbydesignacrossotherwise siloed storage. In the coming years, bespoke data fabric designs will be deployed primarilyasastaticinfrastructure,forcingorganisationsintoanewwaveofcoststo completely redesign their infrastructures for more dynamic data-mesh approaches. Adatafabricmusthavetheabilitytocollect andanalyseallformsofmetadata,and analyse and convert passive metadata toactive metadata. It must have the ability to create a KG that can operationalise the data fabric design, and enable users to en­rich data models with semantics. Extreme levels of distribution, scale and diversity of data assets add complexity to data integration rendering necessary a strong data integration backbone to enable versatile data sharing. 58 https://www.gartner.com/en/newsroom/press-releases/2019-02-18-gartner-identifies-top-10­ data-and-analytics-technolo 59 https://www.gartner.com/en/documents/3978267/data-fabrics-add-augmented-intelligence-to­ modernize-you 3.4 TowardsDeepNaturalLanguageUnderstanding Several areas of this deep dive on data, LRs, and KGs have already provided an overview of the state-of-the-art, a gap analysis and an outlook towards deep NLU. The way to help achieve deep NLU is once again by enabling the previously listed components for data, i.e., availability and accessibility of data and metadata; qual­ity of data; interoperability; licences and data-related regulations; data and ethics; and data literacy. Related to these components, where data and metadata are con­cerned, data infrastructures, data spaces and data markets, integrating KGs, seman­ticAIandinnovativedataandmetadatamanagementtoolsneedtobebuilt.Further-more,thefollowingareasareofgreatimportance:theabilitytomodelemotionsand culture-specific phenomena to facilitate cross-cultural understanding; the availabil­ityofworld-andsituation-specificknowledgeinasmanylanguagesaspossible;and of course tools that allow the modelling as well as the continuous learning of such attributes need to be built. Continuous adaptation of LRs in all languages via automated and handcrafted mechanismsiskeyfordeepNLU,toensurenewconceptsandterminologyareimme­diately taken into account and provided in monolingual, bilingual, and multilingual formats to ensure that new topics (like the COVID-19 pandemic) can be handled properly, but also so that the impact can be fully understood by a broad population toavoidbias,for example.Issues in digital language inequalitywillclearlysupport the division of societies, which needs to be avoided at all cost given the precarious times we live in, and the global natureof theproblems we all face. 4 Summary and Conclusions Data, LRs, andKGsform the basis and backbonefor LTs. Weidentified the follow­ing main components: availability and accessibility of data and metadata; quality ofdata;datainteroperability;licensinganddata-relatedregulations;dataandethics; and data literacy. All of these need to be tackled in the future to allow data collec­tionandprovisionwithfairconditionsandcostsforallrelevantstakeholderstohelp bringaboutDLEinEurope.Relatedtothese components,where dataand metadata areconcerned,weidentifiedthefollowingtechnologyconcepts,methodologiesand tools, that are currently on the rise and that are also part of our technology vision for2030:datainfrastructures,dataspacesanddatamarkets;KGs;semanticAI;and innovative data andmetadata management tools. As an add-on component, we tackled the topic of data-related business models, asweidentifiedtheimportanceofsustainabledata-relatedbusinessmodelsasapre-requisite for a working data economy and ecosystem that stimulate and foster the above-listed data-relatedcomponentsin awell-functioning LTlandscape. Besides technology, interoperability and data-related aspects, there must be a strong focus on applying all these mechanisms and methodologies to the widest range of languages possible, at least to the EU-24 languages but also regional and minority languages and also local dialects, as well as to non-European languages thatarewidespreadacrossEurope.WithoutsuchdataandLRsinplace,DLEsimply cannotbe achieved. Tofilltheidentifiedgapsindata,LRsandKGs,werecommendafuturepathfor Europe towards comprehensive and interlinked data infrastructures, which provide interoperability out-of-the-box by following harmonised and well-tested standards, regarding1.(semantic)datainteroperabilityaswellas2.servicesand3.innovative dataand metadata management tools available in allphasesofthe data lifecycle. Metadata, data, data-driven tools and services need to be easily integratable into these data infrastructures, without today’s huge efforts in data cleaning and inte­gration, or service and tool integration. This future technology vision of integrated and interoperable data spaces follows the approach of federated architectures inter­linking data providers and consumer spaces in a trusted framework. Existing data platforms and infrastructures as well as newly developed ones should be integrated where appropriate andpossible. Insuchafederatedecosystem,dataregardingadomainorlanguagecaneasilybe identified, used, re-used, and evaluated for specific use-cases. Data-driven services canbedeliveredtomeetanend-user’srequirements.Crowdsourcingandcitizensci­ence mechanisms will allow human-machine interaction to foster data acquisition, cleaning and enrichment (e.g., annotation, classification, quality validation and re­pair, domain-specific model creation, etc.). Raw data can be loaded into available tools to build models for specific use-cases, but also existing algorithms, models or vocabularies will be available for easy loading and re-use to avoid unnecessary energyconsumption/computingpowertodeliverenergy-efficientdatamanagement. A high level of importance needs to be placed on privacy protection (related to personal identifiable information, PII, and beyond) and the avoidance of bias (e.g., ongender),andtherespectiveprivacypreservationandethicscompliancetechnolo­gies should be available to all stakeholders. Data infrastructures require working and sustainable business models that pro­mote data sovereignty, enable data trading, sharing and collaboration. Policies and sustainable data governance models around data creation, data provision and data sharing will be needed. Targeted publicly funded programmes and activities in the areaofdataliteracyareneededfromearlyeducationonward,toensurethatsufficient human resourcesin the field are availablein the future. Inaddition,weneedtoinvestinthecollectionanddevelopmentofdataandLRs that are relevant for LT to ensure the availability of sufficient data in all EU lan­guages.Wemakerecommendationsinthreeareas:1.targetednationalandEuropean fundingalongamatrixofrelevantresourcesandlanguages,combinedwith2.more measuresinthefieldsofcrowdsourcingandcitizenscience,and3.thedevelopment of functioning data-related business models, all of which are of critical importance (see Chapter 45). Europe has a number of difficult problems to solve if DLE is to be achieved, in­cluding 1. thespecifics of the European language spacewith EU officiallanguages, a broad range of dialects and regional languages, as well as a high number of non-EU languages in use by a growing number of citizens across the continent, 2. the European societal characteristics with a rich variety and diversity in culture and so-ciety,and3.theoverallchallengingrequirements ofthecontinuousdigitisationina moreandmoreglobalisedworld,andtherelatedcriticalneedforanefficient,work­ing (language) data infrastructure, that provides a rich, easy-to-use and sustainable backbone for European LT. Despite these challenges, there is a huge potential to becomea worldleader inLT and a role modelfor DLE if theycanbeovercome. The availability of high-qualitydata,LRsand KGs in as many languages aspos­sible, that are easily accessible with fair conditions and costs in a clearly specified legalenvironmentprovidingtransparentrulesandregulations,hasclearbenefitsand bringswithitacompetitiveadvantageforallstakeholders.FortheEuropeanresearch community to foster innovations in the field, for the European industry to success-fullycompeteinaglobalmarket,andforthebenefitofEuropeancitizensandsociety, data, LRs, and KGs are crucialif European DLE is tobe achieved. References Abu-Salih,Bilal(2021).“Domain-SpecificKnowledgeGraphs:ASurvey”.In:Journal of Network and Computer Applications 185, p.103076. Auer, Sören, Viktor Kovtun, Manuel Prinz, Anna Kasprzik, Markus Stocker, and Maria Esther Vi-dal(2018).“TowardsaKnowledgeGraphforScience”.In:Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics. WIMS ’18. Novi Sad, Serbia: Associ­ation for Computing Machinery. https://doi.org/10.1145/3227609.3227689. Blumauer,AndreasandHelmutNagy(2020).Knowledge Graph Cookbook: Recipes for Knowledge Graphs that work.https://www.poolparty.biz/the-knowledge-graph-cookbook/. Cattaneo, Gabriella, Giorgio Micheletti, Mike Glennon, Carla La Croce, and Chrysoula Mitta (2020).The European Data Market Monitoring Tool: Key Facts & Figures, First Policy Conclu­sions, Data Landscape and Quantified Stories: D2.9 Final Study Report. Publications Office. DOI: 10.2759/72084. Colon-Hernandez,Pedro,CatherineHavasi,JasonAlonso,MatthewHuggins,andCynthiaBreazeal (2021). “Combining Pre-Trained Language Models and Structured Knowledge”. In: arXiv preprint arXiv:2101.12294. Elliot, Bern, Anthony Mullen, Adrian Lee, and Stephen Emmott (2021). Gartner Research: Hype Cycle for Natural Language Technologies. Hinrichs, Erhard and Steven Krauwer (2014). “The CLARIN Research Infrastructure: Resources andToolsforeHumanitiesScholars”.In:Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014).Reykjavik,Iceland:ELRA,pp.1525–1531. http://www.lrec-conf.org/proceedings/lrec2014/pdf/415_Paper.pdf. Ji, Shaoxiong, Shirui Pan, Erik Cambria, Pekka Marttinen, and Philip S. Yu (2022). “A Survey on KnowledgeGraphs:Representation,Acquisition,andApplications”.In: IEEE Transactions on Neural Networks and Learning Systems 33.2, pp. 494–514. DOI: 10.1109/TNNLS.2021.3070 843. Kaltenböck, Martin, Artem Revenko, Khalid Choukri, Svetla Boytcheva, Christian Lieske, Teresa Lynn, German Rigau, Maria Heuschkel, Aritz Farwell, Gareth Jones, Itziar Aldabe, Ainara Estarrona, Katrin Marheinecke, Stelios Piperidis, Victoria Arranz, Vincent Vandeghinste, and Claudia Borg (2022). Deliverable D2.16 Technology Deep Dive – Data, Language Resources, Knowledge Graphs.EuropeanLanguageEquality(ELE);EUprojectno.LC-01641480 –1010­18166. https://european-language-equality.eu/reports/data-knowledge-deep-dive.pdf. Labropoulou,Penny,SteliosPiperidis,MiltosDeligiannis,LeonVoukoutis,MariaGiagkou,Ondøej Košarko, Jan Hajiè, and Georg Rehm (2023). “Interoperable Metadata Bridges to the wider LanguageTechnologyEcosystem”.In:European Language Grid: A Language Technology Plat­form for Multilingual Europe.Ed.byGeorgRehm.CognitiveTechnologies.Cham,Switzerland: Springer,pp.107–127. Li, Xinyu, Mengtao Lyu, Zuoxu Wang, Chun-Hsien Chen, and Pai Zheng (2021). “Exploiting Knowledge Graphsin Industrial Products and Services:A Survey of Key Aspects, Challenges, andFuturePerspectives”. In: Computers in Industry 129, p.103449. Lösch,Andrea,Valerie Mapelli, Stelios Piperidis, AndrejsVasiljevs, Lilli Smal, Thierry Declerck, Eileen Schnur, Khalid Choukri, and Josef van Genabith (2018). “European Language Re­source Coordination: Collecting Language Resources for Public Sector Multilingual Informa-tionManagement”.In:Proceedings of the 10th Language Resources and Evaluation Conference (LREC 2018).Miyazaki,Japan:EuropeanLanguageResourcesAssociation(ELRA),pp.1339– 1343. Noy,NatalyaandDeborahL.McGuinness(2001).Ontology Development 101: A Guide to Creating Your First Ontology.TechnicalReportKSL-01-05andStanfordMedicalInformaticsTechnical Report SMI-2001-0880. Stanford Knowledge Systems Laboratory. http://www.ksl.stanford.ed u/people/dlm/papers/ontology-tutorial-noy-mcguinness-abstract.html. Piperidis,Stelios(2012).“TheMETA-SHARELanguageResourcesSharingInfrastructure:Princi­ples, Challenges, Solutions”. In: Proceedings of the Eight International Conference on Lan­guage Resources and Evaluation (LREC’12). Ed. by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, JanOdijk,and Stelios Piperidis. Istanbul, Turkey:ELRA. Piperidis, Stelios, Penny Labropoulou, Miltos Deligiannis, and Maria Giagkou (2018). “Manag­ing Public Sector Data for Multilingual Applications Development”. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Ed. byNicolettaCalzolari,KhalidChoukri,ChristopherCieri,ThierryDeclerck,SaraGoggi,Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélene Mazo, Asuncion Moreno, JanOdijk, Stelios Piperidis, and Takenobu Tokunaga. Miyazaki, Japan: ELRA. http://www.lre c-conf.org/proceedings/lrec2018/pdf/648.pdf. Rehm, Georg, ed. (2023). European Language Grid: A Language Technology Platform for Multi­lingual Europe. Cognitive Technologies. Cham,Switzerland: Springer. Rehm, Georg, Maria Berger, Ela Elsholz, Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, SteliosPiperidis,MiltosDeligiannis,DimitrisGalanis,KaterinaGkirtzou,PennyLabropoulou, KalinaBontcheva,DavidJones,IanRoberts,JanHajic,JanaHamrlová,LukášKaèena,Khalid Choukri, Victoria Arranz, Andrejs Vasiljevs, Orians Anvari, Andis Lagzdinš, Julija Melnika, GerhardBackfried,ErinçDikici,MiroslavJanosik,KatjaPrinz,ChristophPrinz,SeverinStam­pler, Dorothea Thomas-Aniola, José Manuel Gómez Pérez, Andres Garcia Silva, Christian Berrío, Ulrich Germann, Steve Renals, and Ondrej Klejch (2020a). “European Language Grid: An Overview”. In: Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Christopher Cieri, KhalidChoukri,ThierryDeclerck,HitoshiIsahara,BenteMaegaard,JosephMariani,Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3359–3373. https://w ww.aclweb.org/anthology/2020.lrec-1.413/. Rehm, Georg, Dimitrios Galanis, Penny Labropoulou, Stelios Piperidis, Martin Welß, Ricardo Usbeck, Joachim Köhler, Miltos Deligiannis, Katerina Gkirtzou, Johannes Fischer, Christian Chiarcos, Nils Feldhus, Julián Moreno-Schneider, Florian Kintzel, Elena Montiel, Víctor Ro­dríguezDoncel,John P. McCrae, David Laqua,Irina Patricia Theile,Christian Dittmar,Kalina Bontcheva, Ian Roberts, AndrejsVasiljevs, andAndis Lagzdinš(2020b). “Towardsan Interop­erableEcosystemofAIandLTPlatforms:ARoadmapfortheImplementationofDifferentLev-els of Interoperability”. In: Proc. of the 1st Int. Workshop on Language Technology Platforms (IWLTP 2020, co-located with LREC 2020). Ed. by Georg Rehm, Kalina Bontcheva, Khalid Choukri,JanHajic,SteliosPiperidis,andAndrejsVasiljevs.Marseille,France,pp.96–107. htt ps://www.aclweb.org/anthology/2020.iwltp-1.15.pdf. Sebastian-Coleman, Laura (2012). Measuring Data Quality for Ongoing Improvement: a Data Quality Assessment Framework. Newnes. Soylu,Ahmet,OscarCorcho,BrianElvesater,CarlosBadenes-Olmedo,FranciscoYedroMartínez, MatejKovacic,MatejPosinkovic,IanMakgill,ChrisTaggart,ElenaSimperl,TillC.Lech,and DumitruRoman(2020).“EnhancingPublicProcurementintheEuropeanUnionThroughCon­structingandExploitinganIntegratedKnowledgeGraph”.In: The Semantic Web – ISWC 2020 – 19th International Semantic Web Conference, 2020, Proceedings.LNCS.Germany:Springer, pp. 430–446. Wilkinson,MarkD.,MichelDumontier,IJsbrandJanAalbersberg,GabrielleAppleton,MylesAx-ton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne,JildauBouwman,AnthonyJ.Brookes,TimClark,MerceCrosas,IngridDillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alas­dair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C ’t Hoen, RobHooft,TobiasKuhn, RubenKok,Joost Kok,Scott J.Lusher,MaryannE.Martone,Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-AssuntaSansone,ErikSchultes,ThierrySengstag,TedSlater,GeorgeStrawn,Morris A. Swertz, MarkThompson,Johan vander Lei,Erikvan Mulligen, JanVelterop, Andra Waag­meester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons (2016). “The FAIRGuidingPrinciplesforScientificDataManagementandStewardship”.In:Scientific Data 3. DOI: 10.1038/sdata.2016.18.http://www.nature.com/articles/sdata201618. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 44 Strategic Plans and Projects in Language Technology and Artificial Intelligence ItziarAldabe, Aritz Farwell, German Rigau, Georg Rehm, andAndy Way Abstract This chapter on existing strategic plans and projects in Language Tech­nology and Artificial Intelligence is based on an analysis of around 200 documents and is divided into three sections. The first provides a synopsis of international and EuropeanreportsonLanguageTechnology.Thesecondconstitutesareviewofexist­ing European Strategic Research Agendas, initiatives, and national plans related to LanguageTechnology. ThethirdcontainsaSWOTanalysisdesignedtoidentifythe factorsthatwillneedtobeaddressedtohelpsolvethechallengeofdigitallanguage inequality in Europe. Among the principal conclusions presented is the contention thatourcontinentrequiressophisticatedmultilingual,cross-lingualandmonolingual LT for allEuropean languages: LT for Europe that is made in Europe.1 1 Introduction Invarietateconcordia(unitedindiversity)istheofficialLatinmottooftheEuropean Union (EU), adopted in 2000. According to the European Commission, “the motto meansthat,viatheEU,Europeansareunitedinworkingtogetherforpeaceandpros­perity, and that the many different cultures, traditions and languages in Europe are a positive asset for the continent” [emphasis added].2 All 24 official EU languages are granted equal status by the EU Charter and the Treaty on the EU. The EU is also home to over 60 regional and minority languages which are protected and pro­moted under the European Charter for Regional or Minority Languages (ECRML) Itziar Aldabe · AritzFarwell · German Rigau University oftheBasque Country, Spain, itziar.aldabe@ehu.eus,aritz.farwell@ehu.eus, german.rigau@ehu.eus Georg Rehm Deutsches ForschungszentrumfürKünstliche Intelligenz GmbH,Germany, georg.rehm@dfki.de AndyWay Dublin CityUniversity, ADAPT Centre,Ireland, andy.way@adaptcentre.ie 1 This chapter is an abridged version ofAldabe etal. (2022). 2 http://europa.eu/abc/symbols/motto/index_en.htm © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_44 since1992,3 inadditiontomigrantlanguagesandvarioussignlanguages,spokenby some 50 million people. The Charter of Fundamental Rights of the EU under Arti­cle214 statesthat“anydiscriminationbasedonanygroundsuchassex,race,colour, ethnicorsocialorigin,geneticfeatures, language,religionorbelief,politicalorany other opinion, membership of a national minority, property, birth, disability, age or sexualorientation shallbeprohibited”[emphasis added]. Multilingualism is a cultural cornerstone of Europe and signifies part of what it meanstobeandtofeelEuropean.However,notonlydolanguagebarriersstillham­percross-lingualcommunicationandthefreeflowofknowledgeandthoughtacross languages, a dilemma for which no common EU policy has been proposed, many languages themselves are also endangered or on the edge of extinction (even more sofromadigitalperspective).ThisisillustratedintheUNESCO Atlas of the World’s Languages in Danger (Moseley 2010),5 where a map of Europe shows threatened languages, includingblack flags thatcorrespond toalreadyextinct languages. Without a concerted effort to prevent the further deterioration of Europe’s lin­guistic ecosystem, this current snapshot is likely to worsen. And while it may well be that no silver bullet exists to remedy the situation, one approach offers a means to provide immediatesupport and addresstheissueof linguistic barriers:Language Technology(LT)and language-centric Artificial Intelligence (AI). Because natural language is at the heart of human intelligence, it is and must be at the heart of our efforts to develop AI technologies.6 By the same token, all sophisticatedandeffectiveAI-poweredtoolsareimpossiblewithoutmasteryoflan­guage.7ThisiswhylanguageandLTrepresentthenextgreatfrontierinAI.8Already arguably the hottest field in AI, LT also represents one of its fastest growing appli­cation areas.9 In fact, together with vision and robotics, several recent international reportsplaceLTasoneofthethreecoreapplicationareaswithinAI.Itsrisetopromi­nence is due to the various methods LT has developed over the years to make the informationcontainedinwrittenandspokenlanguageexplicitortogeneratewritten andspokenlanguage.Forthisreason,ithasbecomethenervecentreofthesoftware that processes unstructured information and exploits the vast amount of data con­tained in text,audioand video files, including those fromthewebandsocial media. Despite the inherent difficulties in many of the tasks performed, current LTsupport allows for many advanced applications which would have been unthinkable only a few years ago. Among these may be counted speech recognition, speech synthesis, text analytics and machine translation (MT), used by hundreds of millions of peo­ple on a daily basis.10 It is now common to utilise search engines, recommender 3 https://en.m.wikipedia.org/wiki/European_Charter_for_Regional_or_Minority_Languages 4 https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:12012P/TXT 5 http://www.unesco.org/languages-atlas/ 6 https://hbr.org/2022/04/the-power-of-natural-language-processing 7 https://www.nytimes.com/2022/04/15/magazine/ai-language.html 8 https://www.forbes.com/sites/robtoews/2022/02/13/language-is-the-next-great-frontier-in-ai 9 https://analyticsindiamag.com/is-nlp-innovating-faster-than-other-domains-of-ai/ 10 https://www.nimdzi.com/nimdzi-language-technology-atlas-2020/ Fig. 1 Language Technology asamultidisciplinaryfield systems, virtual assistants, chatbots, text editors, text predictors, MT systems, auto­matic subtitling, automatic summaries, and inclusive technology, all made possible thankstoLT.Thefield’srapiddevelopmentpromisesevenmoreencouragingresults inthenearfutureanditsincreasingsocialrelevancehasbeenhighlightedinnational and regional AI and LT strategies both inside and outside of Europe, as well as in prioritisedstrategicareas forresearch,development and innovation (R&D&I). With this in mind, it should not be forgotten that LT is also multidisciplinary in nature, combining knowledge in computer science (and specifically in AI), mathe­matics,linguisticsandpsychology,amongothers.Figure1depictssomeofthemost importantdisciplinesinvolvedinLT.Thisuniquenessmustbeweighedinanypublic orprivateAIinitiativethatincludesLT,especiallygiventhatfundingforLTstart-ups is booming and only the proper application of LT will allow the enormous volumes of multilingual written and spoken data in sectors as diverse as health, justice, edu­cation, or finance to be adequately processed andunderstood.11 Early-stage funding in 2021 amounts to just over USD 1 billion for companies offeringsolutionsthatmakesignificantuseofNLP,providingapictureofwhatfun­dersthinkisinnovative.12Thisbeliefisonlyreinforcedbytechnologyadvancessuch as ChatGPT, whose creator, OpenAI, projects USD 1 billion in revenue by 2024.13 Similarly, reports from analysts and consulting firms forecast enormous growth in theglobalLTmarketbasedontheexplosionofapplicationsobservedinrecentyears and the expected exponential growth in unstructured digital data. For instance, ac­ 11 https://www.forbes.com/sites/robtoews/2022/03/27/a-wave-of-billion-dollar-language-ai-star tups-is-coming 12 https://towardsdatascience.com/nlp-how-to-spend-a-billion-dollars-e0dcdf82ea9f 13 https://www.reuters.com/business/chatgpt-owner-openai-projects-1-billion-revenue-by-2024 -sources-2022-12-15/ cordingtoanindustryreportfrom2019,14 theglobalNLPmarketsizeisexpectedto growfromUSD10.2billionin2019toUSD26.4billionby2024,ataCAGRof21% is set during the forecast period 2019-2024.15 A recent report from 2021 estimates that the global NLP market is predicted to grow from USD 20.98 billion in 2021 to USD127.26billionin2028ataCAGRof29.4%intheforecastedperiod.16 NLPin Europe will witness market growth of 19.7% CAGR and is expected to reach USD 35.1billionby2026.17 Asafinalexample,accordingtoGlobalNewswiretheglobal NLP market is estimated to reach an expected value of USD 341.7 billion by 2030, growing at a CAGR of 27.6% during the forecast period.18 These numbers indicate that the return on investment (ROI) will be massive so it is imperative that Europe isat theheart of this growth in future. The attention paid to AI and LT in the social, political, and economic spheres reflect the significance of the technology for today’s world. This chapter on the ex­isting strategic plans and projects in LT and AI touches on all three of these areas. It is based on an analysis of close to 200 documents (Aldabe et al. 2022) and is divided into three sections. Section 2 provides a synopsis of international and Eu­ropean reports on LT. In addition to trends in innovation, many of these discuss the socioeconomicandpoliticalimpactofAIandLTfromapolicyperspective.Section3 constitutes a review of the existing European Strategic Research Agendas (SRAs), initiatives,andnationalplansrelatedtoLT. Amainfocusoftheseisthequestionof multilingualismandequaltechnologicalsupportforEurope’slanguagesthroughthe application of LT. Section 4 contains a SWOT analysis of the strategic documents andprojects,whichisdesignedtoidentifythefactorsthatwillneedtobeaddressed to helpsolvethe pressing issue of digital languageinequality inEurope. 2 International Reports on Language Technology AIcapabilitiesarerapidlyevolvingandithasbecomeoneofthe21stcentury’smost transformative technologies.19 The growing interest in AI at a global political, sci­entific and social level has led several international organisations to draft a number of reports and initiatives in recent years. These often focus on the socioeconomic impact of AItechnologies andapplicationswith respect to policy. 14 https://www.businesswire.com/news/home/20191230005197/en/Global-Natural-Language-Pro cessing-NLP-Market-Size 15 https://www.analyticsinsight.net/potentials-of-nlp-techniques-industry-implementation-and-gl obal-market-outline/ 16 https://www.analyticsinsight.net/the-global-nlp-market-is-predicted-to-reach-us127-26-billio n-by-2028/ 17 https://www.analyticsinsight.net/nlp-in-europe-is-expected-to-reach-us35-1-billion-by-2026/ 18 https://www.globenewswire.com/en/news-release/2022/09/29/2525379/0/en/Natural-Language -Processing-NLP-Market-Worth-USD-341-7-Billion-with-a-27-6-CAGR-by-2030-Report-by- Market-Research-Future-MRFR.html 19 https://www.holoniq.com/notes/50-national-ai-strategies-the-2020-ai-strategy-landscape/ 2.1 Reports from International Organisations The Organisation for Economic Co-operation and Development (OECD),20 a fre­quent contributor to this discourse, has helped coordinate dialogue on the subject at international fora (notably the G7, G20, EU and UN), offered practical advice to governments on how to actualise AI policy, and stressed the potential that digi­tal technologies demonstrate in responding to societal challenges.21 Its 2021 report, State of the implementation of the OECD AI Principles: Insights from national AI policies, identifies challenges and best practices for the implementation of the five policy recommendations to national governments contained in its OECD AI Princi­ples. These are: 1. invest in AI R&D; 2. foster a digital ecosystem for AI; 3. shape anenablingpolicyenvironmentforAI;4.buildhumancapacityandpreparationfor labourmarkettransformation;and5.fomentinternationalco-operationfortrustwor­thyAI. The report comeson theheelsoftheOECD’s The Digitalisation of Science, Technology and Innovation,whichemphasisesthatcutting-edgeNLPtechniquesare opening new analytical possibilities. Among those listed is the ability to recognise victimsofsexualexploitationontheinternetbasedonfacialdetectionandsocialnet-workanalysis(Chuietal. 2018).Advancessuchasthishavecaughttheattentionof researchers and policy makers in various countries, who have begun to experiment with NLP to track emerging research topics and technologies. As the report under­scores, policy makers use these results to formulate science and innovation policy initiatives,support investments in R&D&I, andevaluate public programmes.22 Similarpolicyguidanceandassessmentsappearelsewhere.23TheInter-American DevelopmentBank24 (IDB),forinstance,suggestsconstructingasharedunderstand­ing of AI in order to take better advantage of its opportunities and applications whilesimultaneouslycomingtogripswithitsrisks.25TheWorldEconomicForum,26 whichprovidesaframeworkforgovernmentsthatwishtodevelopnationalAIstrate­gies, assists those responsible for crafting policy in how to ask pertinent questions, followbestpractices,identifyandinvolvestakeholders,andcreateasetofoutcome 20 https://www.oecd.org 21 See, e.g., Artificial Intelligence in Society (https://doi.org/10.1787/eedfee77-en); State of the implementation of the OECD AI Principles: Insights from national AI policies (https://doi.org/10 .1787/1cd40c44-e); The Digitalisation of Science, Technology and Innovation (https://doi.org/10 .1787/b9e4a2c0-en). 22 To help policy makers, regulators, legislators and others characterise AI systems deployed in specific contexts, the OECD has developed a user-friendly tool to evaluate AI and LT systems fromapolicy perspective (https://www.oecd.org/publications/oecd-framework-for-the-classificat ion-of-ai-systems-cb6d9eca-en.htm). 23 See, e.g., the Helsinki Initiative on Multilingualism in Scholarly Communication (https://www. helsinki-initiative.org/en). 24 https://www.iadb.org 25 https://publications.iadb.org/en/artificial-intelligence-for-social-good-in-latin-america-and-th e-caribbean-the-regional-landscape-and-12-country-snapshots 26 https://www.weforum.org indicators.27 UNESCO28 extendstheseconsiderationstotheeducationalsphere,rec­ommendingthatgovernmentsandotherstakeholders,inaccordancewiththeirlegis­lationandpublicpolicies,respondtoeducation-relatedopportunitiesandchallenges presentedbyAI.TheBeijing Consensus on Artificial Intelligence and Education,an outcome document issued by UNESCO in 2019, stresses the multidisciplinary na­tureofAIandurgesreaderstoconsidertheroleofAItoolsinteachingandlearning, highlighting its effectiveness in aiding students with learning impairments or who study in a language other than their mother tongue.29 In the area of library science, Responsible Operations: Data Science, Machine Learning, and AI in Libraries,a position paper from OCLC,30 notes structural inequalities are perpetuated by data-driven policies (Padilla 2020) and sets an agenda for tackling positive and negative impactsofdatascience, machinelearning, andAI on libraries.31 Finally, in early 2022, based on the report Facilitating the implementation of the European Charter for Regional or Minority Languages through artificial intel­ligence, first published in 202032 and updated in 2022,33 the Committee of Experts of the European Charter for Regional or Minority Languages of the Council of Eu-rope(CoE)adoptedastatementonthepromotionofregionalorminoritylanguages through AI.34 The Committee of Experts encourages states to promote the inclu­sionofregional or minority languagesintoresearch andstudy on AI witha view to supportingthe developmentofrelevantapplications aswellastoestablish, incoop­erationwiththeusersofsuchlanguagesandtheprivatesector,astructuredapproach to the useof AI applications in the differentfields covered by the Charter. TheattentionpaidtoAIandLTinpolicyreportsreflectsthesocial,political,and economic importance that the technology has garnered in today’s world; and the same holds true for organisations that trace trends in innovation. In its report, Tech­nology Trends 2019 Artificial Intelligence,35 the World Intellectual Property Orga­nization36 foundthat50%ofallAIpatentshavebeenpublished injust thepastfive years,astrikingillustrationofhowrapidlyinnovationisadvancinginthefield.The report,whichclassifiesAItechnologytrendsintotechniques,functionalapplications, andapplicationfields,furthermorepointstoLTasoneofAI’smostsignificantfunc­tional applications, attributing over a quarter of all AI-related patents to NLP and speech processing. The number is unsurprising given the current levels of excite­ 27 https://www3.weforum.org/docs/WEF_National_AI_Strategy.pdf 28 https://en.unesco.org 29 https://unesdoc.unesco.org/ark:/48223/pf0000368303 See also, UNESCO’s Artificial Intel­ ligence in Education: Challenges and Opportunities for Sustainable Development, a 2019 report which,amongotherbreakthroughs,notedaChineseAIsystemthatisabletocorrectstudentessays as a milestone inLT for education (https://unesdoc.unesco.org/ark:/48223/pf0000366994). 30 https://www.oclc.org/en/about.html 31 https://doi.org/10.25333/xk7z-9g97 32 https://rm.coe.int/cahai-2020-23-final-eng-feasibility-study-/1680a0c6da 33 https://rm.coe.int/min-lang-2022-4-ai-and-ecrml-en/1680a657c5 34 https://rm.coe.int/declaration-ai-en/1680a657ff 35 https://www.wipo.int/publications/en/details.jsp?id=4386 36 https://www.wipo.int mentassociatedwithNLPwithinAI,wheretherisingstaristurningmanyheads.A caseinpointisthe State of AI Report for2021,37 issuedbyUKAIinvestorswithan eye toward stimulating informed conversation on AI and its implications going for­ward. The report, which considers research, talent, industry, and politics, discusses the emergence of large language models and notes that the latest generation are un­lockingnewNLPuse-cases.Indeed,thearrivalofTransformersasageneralpurpose architectureforMLhasbeenarevelation,beatingthestate-of-the-artindomainsas disparate as computer vision andproteinstructure prediction. 2.2 ReportsfromtheUnitedStates ReportsfromtheUStellananalogousstorytotheirinternationalcounterparts.Inits 2021 and 2022 AI Index Reports,38 for example, the Institute for Human-Centered AI (HAI) at Stanford University reviews the growth of research papers and confer­ences over time and by region, tracks AI accuracy on several benchmarks, focuses on trends in jobs and investment, and examines various national AI strategies. The reports also devote space to data and analysis concerning AI with respect to edu­cation, diversity, and ethics. Key takeaways include the observation that 65% of the new PhDs in the US chose jobs in industry over academia compared to 44% the previous year, that there is still little data available on the ethical challenges surrounding AI, and that the AI workforce remains predominantly male. The 2022 reportalsohighlightsthatwhilecurrentlanguagemodelsaresettingrecordsontech­nical benchmarks, they are also increasingly reflecting biases from their training data.ThesefindingsareaccompaniedbyHAI’sGlobalVibrancyTool,39 whichmea­sures performance on various economic, inclusion, and R&D factors across several countries. The tool can create an overall index for the full list of 26 countries and it is of note that none of the top ten is an EU member state. The worrisome nature of the latter data point is compounded in an examination of the global balance and flow of top AI scientists provided by the Paulson Institute’s Macro Polo think tank in its Global AI Talent Tracker report.40 According to this analysis, the US lead in AIisbuiltonattractinginternationaltalent,withmorethantwo-thirdsofthetop-tier AI researchers working in the US having received undergraduate degrees in other countries. Although 18% of the top-tier AI researchers are European, only 10% of them workin Europe. ThesefinaldetailsshouldsoundalarmbellsinEurope.AsdemonstratedinGath­ering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intel­ligence, released by the AI100 project in 2021, remarkable progress has been made in AI over the past five years and we may anticipate that its effects will ripple out 37 https://www.stateof.ai 38 https://aiindex.stanford.edu/report/ 39 https://aiindex.stanford.edu/vibrancy/ 40 https://macropolo.org/digital-projects/the-global-ai-talent-tracker/ for many years to come. Prepared by a panel of experts from around the globe, the report makes clear that the ability of computers to perform sophisticated language-and image-processing taskshas advancedsignificantlyandthatmore investmentof time and resources are required to meet the challenges posed by AI’s rapidly evolv­ingtechnologies.Ontheonehand,thisincludesgreatergovernmentinvolvementin theareasofregulationanddigitaleducation.InanAI-enabledworld,citizensyoung and old must be literate in these new digital technologies. On the other, this means addressing fears thatAI technologies will contributeto unemployment in some sec­tors. A Blumberg Capital survey of 1,000 American adults found that about half are concerned that AI threatens their livelihood. Indeed, despite the fact that 72% agreed that AI would help remove tedious tasks and free up time to concentrate on more creative ones, 81% were reluctant to surrender these tasks to an algorithm for fearofbeingsupplanted.41 Astheauthorsof Gathering Strength, Gathering Storms indicate, AI is leaving the laboratory and entering our lives, having a “real-world impact on people, institutions,and culture.”42 This perspective is shared by the National Security Commission. In addition to raisingconcernsthattheUnitedStatesrisksfallingbehindChinaandothercountries intheAI race,its recent 750-page reportencouragesthe federalgovernmenttostep up investment in the area.43 Specifically, the commission calls for a modest down payment of $40 billion, along with hundreds of billions more in the coming years to galvanise future breakthroughs and help democratise AI research. Moreover, the report provides policy makers with a guide to ensure the US is prepared to defend against AI threats, promote AI innovation, and make responsible use of AI for na­tional security. It is also worth mentioning that the report lists Natural Language Understanding as one of the six uses for deployed AI today. This view, which co­incides with the general consensus on LT expressed above, is further reinforced by the Future Today Institute44 in its 2021 Tech Trends Report on AI.45 The group not only identifies NLP as an area that is experiencing high interest, investment, and growth,butalsoforecaststhatNLPalgorithmswilldomoreinthefuture,including, for example, aid in interpretinggenetic changes inviruses. 2.3 Reports from the European Union ReportsfromtheEUpaintanequallyupbeatpictureaboutpresentandfutureexpecta­tionsregardingscienceandtechnology.ArecentEurobarometersurveyonEuropean citizens’knowledgeand attitudes towards these showsthat86%believe the overall 41 https://blumbergcapital.com/ai-in-2019/ 42 https://ai100.stanford.edu 43 https://www.nscai.gov/2021-final-report 44 https://futuretodayinstitute.com 45 https://2021techtrends.com/AI-Trends influenceofscienceandtechnologyispositive.46 EUcitizensexpectarangeoftech­nologiescurrentlyunderdevelopment,includingAI(61%),toimprovetheirwayof life over the next 20 years. The case for AI and LT is further laid out by various European Institutions in reports and policy initiatives that highlight their extensive impact on society and what must bedone to shepherdthis influence. These include, among others, European Artificial Intelligence (AI) leadership, the path for an inte­grated vision;47 Strategy on AI;48 Ethics Guidelines for Trustworthy AI;49 Liability for AI and other emerging technologies;50 On Artificial Intelligence: A European approach to excellence and trust;51 and Coordinated Plan on AI.52 All agree that AI is an area of strategic importance, a key driver of economic development, and a means to provide solutions to many societal challenges. As such, they concur that the socioeconomic, legal and ethical impact of AI must be carefully measured. For instance, the Joint Research Center (JRC) Science for Policy report, The Changing Nature of Work and Skills in the Digital Age,53 observes that employment opportu­nities related to the development and maintenance of AI technologies and Big Data infrastructures are expected to grow, whereas jobs that are most vulnerable to au­tomation appear to be those that require relatively low levels of formal education, donotinvolvecomplexsocialinteraction,ordemandroutinemanualtasks.Keeping this range in mind is a reminder that digital technologies may not only create or de­stroy some lines of work, but also fundamentally change what people do on the job and how theydo it. The European Commission’s new Coordinated Plan on AI, which affirms that NLP is one of the most rapidly advancing fields in AI, is designed to address such potentialturbulence.54 The2021plan,inconjunctionwiththefirst-everlegalframe-work for AI,55 will guarantee the safety and rights of people and businesses, while strengtheningAIuptake,investmentandinnovationacrosstheEU.Itisalsoseenas the EU’s next step in fostering global leadership in trustworthy AI, deemed neces­saryifEuropean AI istobeglobally competitivewhilerespecting Europeanvalues. This is of particular concern given that the EC’s 2021 Strategic Foresight Report, The EU’s capacity and freedom to act,56 stresses the EU’s capabilities in AI, Big 46 https://europa.eu/eurobarometer/surveys/detail/2237 47 https://www.europarl.europa.eu/thinktank/en/document/IPOL_STU(2018)626074 48 https://digital-strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence#Bui lding-Trust-in-Human-Centric-Artificial-Intelligence 49 https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai 50 https://ec.europa.eu/newsroom/dae/document.cfm?doc_id=63199 51 https://commission.europa.eu/publications/white-paper-artificial-intelligence-european-appro ach-excellence-and-trust_en 52 https://ec.europa.eu/digital-single-market/en/news/coordinated-plan-artificial-intelligence 53 https://publications.jrc.ec.europa.eu/repository/handle/JRC117505 54 https://digital-strategy.ec.europa.eu/en/library/new-coordinated-plan-artificial-intelligence 55 https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-european-approach-artificia l-intelligence 56 https://ec.europa.eu/info/strategy/strategic-planning/strategic-foresight/2021-strategic-foresig ht-report_en Data and Robotics lag behind the world’s leaders, the US and China. To strengthen digitalsovereigntyandEuropeanAI,thereportencouragesstakeholderstopromote values viathe finance,development and production of next generation tech. One important area of focus must be high-value data, a key factor in improving performance and building robust AI models. The EC wants to ensure legal clarity in AI-based applications, especially regarding data. Its proposed regulation on data governance will help by boosting data sharing across sectors and member states, whiletheGeneralDataProtectionRegulation(GDPR)isamajorsteptowardsbuild­ing trust.57 The member states also recently agreed to a negotiating mandate on a proposalforaDataGovernanceAct(DGA).58 TheDGAispartofawiderpolicyto give the EU a competitive edge in the increasingly data-driven economy. The aim is to promote the availability of data that can be utilised to power applications and advancedsolutionsinAI, personalisedmedicine,greenmobility,smartmanufactur­ingandnumerousotherareas.Whiletheseregulationssupporttheprivacyandrights of European citizens, it should be pointed out that significant barriers to the access andre-useoflanguageresourcesremain,especiallywithregardtocompetitionwith countriesthat have adopted the “fair use” doctrine, such as the US, Japanor Korea. Research infrastructures play a role in this regard, including the Common Lan­guage Resources and Technology Infrastructure (CLARIN), an ESFRI Landmark and ERICwhichoffers accesstoLRs andLTsfor researchersinthehumanities and social sciences.59 Not every EU Member State is officially affiliated with it, while others participate only as observers (Belgium joined CLARIN in 2021 and Spain will join in 2023). Additionally, because research funding agencies provide unbal­ancedresourcestothedifferentMemberStates,Europeanlanguagesarenotequally supportedby CLARIN(deJonget al. 2020). This problem has receivedmoreatten­tion in the EU project European Language Grid (ELG), which started in 2019 and concludedinJune2022.TheELGcloudplatformcontainsmorethan14,000running servicesandresourcesforallEuropeanlanguages(Rehmetal.2021;Rehm2023).60 Experience with infrastructures such asthese hasdemonstrated that theEU’sap­proach to data infrastructures must be crafted with Big Data technology and LT in mind. The ESFRI roadmap includes a “Landscape Analysis” that provides an ad­vanced analysis of the scientific needs and existing research infrastructure gaps as wellasdirectionsforstrategicinvestmentsinthefuturethatwouldhelpmaintainEu­rope’sleadershipintheglobalcontext.Accordingtoitsfindings,researchinfrastruc­tures in LT are indispensable in breaking new ground because they represent a core aspect of Big Data technology due to the volume and variety of data generated by the accumulation ofunstructuredtext. And as the main task inAI’s communication domain, NLP encompasses applications such as text generation, text mining, text classification, MT and speech recognition. Put differently, LT’s ability to analyse, understandandgenerateinformationexpressedinnaturallanguageiscrucialforim­ 57 https://eur-lex.europa.eu/eli/reg/2016/679/oj 58 https://www.consilium.europa.eu/en/press/press-releases/2021/10/01/eu-looks-to-make-data-s haring-easier-council-agrees-position-on-data-governance-act/ 59 https://www.clarin.eu 60 https://www.european-language-grid.eu provinghuman-computerinteraction.ThisviewisconfirmedbyAIWatch,theEC’s knowledge service responsible for monitoring the development, uptake and impact ofAI,in threerecent reports, Defining Artificial Intelligence, Artificial Intelligence in public services andAI Watch, road to the adoption of Artificial Intelligence by the public sector.61 Bywayofexample,thelatteridentifiedandemployed230casesof AI usage in public services in order to extract emerging trends in AI, revealing that well overhalf of thecasesarecloselyrelated toLT. Relatedly,theEC’sDirectorate-GeneralforCommunicationsNetworks,Content andTechnology(DGCNECT),incollaborationwiththeDirectorate-GeneralforIn­ternalMarket,Industry,EntrepreneurshipandSMEs(DGGROW),openedaconsul­tationin2021thatexamineduse-casesforwebsitetranslationatsmallandmedium-sizedenterprises(SMEs)andsurveyed multilingualwebsitesinaneffort to analyse languagebarriersacrossEUMemberStates.62Theinquiryidentifiedspecificmarket needs that could be addressed through public solutions, such as eTranslation,63 and by European language service providers. Of the over 1,000 SMEs that responded, 75% expressed interest in participating in the EC’s subsequent pilot programme to maketheirwebsiteautomaticallymultilingual.WhentheEuropean Language Indus­try Survey (ELIS)64 –then known as the EUATC survey – was run for the first time in2013,MTwasstillprimarilyseenasathreatandachallenge.Onlyafewlanguage companies saw it as an opportunity. Today, 65% of language company respondents seetheimprovedqualityofneuralMTasanopportunityratherthanathreat.Accord­ing to the 2022 survey, 58% of those companies have implemented the technology and an additional 20% are planning to do so. This potential willingness to incorpo­rate LT and AI corresponds with a separate study conducted by Eurostat65 in 2020. Itfoundthat7%ofEUenterpriseswithatleasttenemployeesusedAIapplications, 2% utilised ML to analyse big data internally, and 1% evaluated big data internally withthehelpofLT.Moreover,2%providedachatservice,whereachatbotorvirtual agent generatednatural language repliesto customers. 3 Major Language Technology Initiatives in Europe First, we take a closer look at European initiatives (Section 3.1) and then examine national and also regional initiatives (Section 3.2). 61 https://knowledge4policy.ec.europa.eu/ai-watch_en; https://publications.jrc.ec.europa.eu/re pository/handle/JRC118163; https://publications.jrc.ec.europa.eu/repository/handle/JRC120399; https://joinup.ec.europa.eu/collection/innovative-public-services/news/ai-watch-road-adoption­ artificial-intelligence 62 https://digital-strategy.ec.europa.eu/en/library/report-sme-survey-multilingual-websites 63 https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/eTranslation; https://ec.europa.eu/e ducation/knowledge-centre-interpretation/eu-initiatives-language-technologies_en 64 https://elis-survey.org 65 https://ec.europa.eu/eurostat/ 3.1 European Initiatives TheEuropeanParliamentrecentlyemphasisedthat“multilingualismpresentsoneof the greatest assets of cultural diversity in Europe and, at the same time, [is] one of themostsignificantchallengesforthecreationofatrulyintegratedEU.”(European Parliament 2018). The belief is reflected in the EU’s promotion of multilingualism, whichfallswithinthescopeofavarietyofEUpolicyareas.Whilemanyofthemulti­facetedeffortstosupportEurope’slanguagesarebearingfruit,stillgreaterattention must be paid to removing barriers to intercultural and inter-linguistic dialogue as a meansto stimulate mutualunderstanding.One meanstoachieve thisis throughlan­guage technology. Nonetheless, although official EU languages are granted equal status politically, they are far from equally supported from a technological perspec­tive(see,e.g.,RehmandUszkoreit2012;Rehmetal.2014;RehmandHegele2018; Rehmet al. 2020b, as wellas thechapters in Part I of this book). Several strategic documents have contributed to the European debate on this subject in the past decade, including The FLaReNet Strategic Language Resource Agenda (Soriaetal. 2014), META-NET Strategic Research Agenda for Multilingual Europe 2020 (Rehm and Uszkoreit 2013; Rehm et al. 2016), Language Technolo­gies for Multilingual Europe: Towards a Human Language Project (Rehm 2017), andtheSTOAreport, Language Equality in the digital age: Towards a Human Lan­guage Project (STOA 2018). The latter helped pave the way for the preparation of the European Parliament’s joint ITRE/CULT resolution, Language equality in the digital age (European Parliament 2018),66 adopted in a plenary meeting in Septem­ber 2018 with an overwhelming majority of 592 votes in favour, 45 against and 44 abstentions. Approval of the resolution by such a wide margin demonstrates the importance andrelevanceoftheissue.Itincludesmorethan40recommendations,structuredinto foursections:“Improvingtheinstitutionalframeworkforlanguagetechnologypoli­ciesatEUlevel”,“RecommendationsforEUresearchpolicies”,“Educationpolicies toimprovethefutureoflanguagetechnologiesinEurope”and“Languagetechnolo­gies:benefitsforbothprivatecompaniesandpublicbodies”.Amongthemostsalient items are thefollowing (emphases added;someitems abbreviated): • The report “recommends that in order to raise the profile of language technolo­gies in Europe, the Commission should allocate the area of ‘multilingualism and language technology’ to the portfolio of a Commissioner;considersthatthe Commissioner responsible should be tasked with promoting linguistic diversity and equality at EU level, given the importance of linguistic diversity for the future ofEurope;” (item 14) • “suggestsensuring comprehensive EU-level legal protection for the 60 regional and minority languages, recognition of the collective rights of national and lin­guisticminoritiesin thedigital world,andmother-tongueteachingforspeakers ofofficialand non-officiallanguages of the EU;” (item 15) • “calls on the Member States to develop comprehensive language-related poli­cies and to allocate resources and use appropriate tools in order to promote and facilitate linguistic diversity and multilingualism in the digital sphere; stresses the shared responsibility of the EU and the Member States and in developing databasesandtranslationtechnologiesforallEUlanguages,includinglanguages that are less widely spoken; calls for coordination between research and indus­try with a common objective of enhancing the digital possibilities for language translationandwithopenaccesstothedatarequiredfortechnologicaladvance­ment;”(item17) • “callsontheCommissionto establish a large-scale, long-term coordinated fund­ing programme for research, development and innovation in the field of lan­guage technologies, at European, national and regional levels, tailored specif­ically to Europe’s needs and demands; emphasises that the programme should seek to tackle deep natural language understanding and increase efficiency by sharingknowledge,infrastructuresandresources,witha viewtodevelopingin­novativetechnologiesandservices,inorder to achieve the next scientific break­through in this area and help to reduce the technology gap between European languages; stresses that this should be done with the participation of research centres, academic, enterprises […] andother relevant stakeholders;” (item 25) • “believesthat[…]European education policies should be aimed at retaining tal­ent in Europe,shouldanalyse the currenteducational needsrelated to language technology […] and, based on this, provide guidelines for the implementation of cohesive joint action at European level […] including the language-centric artificial intelligence industry;” (item 34) • “points to the need to promote the ever-greater participation of women in the field of European studies on language technologies, as a decisive factor in the development of researchand innovation” (item 36) 66 https://www.europarl.europa.eu/doceo/document/TA-8-2018-0332_EN.html TotheserecommendationsmaybeaddedtheremarksmadebyECCommissioner Corina Cre.uin herclosing statement at the hearing onthe resolution: EnsuringappropriatetechnologicalsupportforallEuropeanlanguageswill[…]createjobs, growth and opportunities in the DSM [(Digital Single Market)]. It will enhance the quality ofpublicservices,andreinforceastrongersenseofunityandbelongingthroughoutEurope. […] [U]nder the next Multiannual Financial Framework (MFF), we will need to reinforce funding,research andeducationactions.[…][O]vercominglanguagebarriersinthedigital environmentis essential for an inclusivesociety, a vibrant DSM and for unityin diversity. Cre.u’sstatementisinlinewithpreviouspublicappealsvoicedin2016byformer ECVicePresidentAndrusAnsipandin2017byDirectorGeneralRobertoViola(DG CNECT) forthe need to strengthen multilingualism through technologies.67 More recently, the EP’s CULT Committee adopted a resolution on AI in the cul­tural, creative and educational sector in which multilingual and linguistic diversity 67 See How multilingual is Europe’s Digital Single Market? (https://ec.europa.eu/commission/ commissioners/2014-2019/ansip/blog/how-multilingual-europes-digital-single-market_en); Multilingualism in the Digital Age: a barrier or an opportunity (https://ec.europa.eu/digital-singl e-market/en/blog/multilingualism-digital-age-barrier-or-opportunity). isalsotakenintoaccount.68 Regardingthelatter,theresolutioncallsfor:1.AItech­nologies to be regulated and trained in order to ensure non-discrimination, gender equality,pluralism,aswellasculturalandlinguisticdiversity;2. specificindicators to measure diversity in order to promote European ventures and prevent algorithm­basedrecommendationsthatnegativelyaffecttheEU’sculturalandlinguisticdiver­sity; and 3. an ethical framework for the use of AI technologies in EU media that guaranteesaccesstoculturallyandlinguisticallydiversecontent.Suchaframework wouldalsoaddressthemisuseofAItodisseminatefakenewsanddisinformation.69 The resolution goes hand in hand with a study commissioned by the EC that ex­plores the possibilities of applying AI technologies in ten domains that also belong to the cultural, creative and educational sector. The study aims to inspire creative entrepreneurs as well as policy-makers with concrete use cases and recommenda­tionsfortheapplicationofAI,70 focusingpartlyonlanguage-centricAI(NLP,NLU, speech technologies). The resolution also reflects the conclusions of the Education, Youth, Culture and Sport Council held on 4-5 April 2022, which called for the de­velopmentofanambitiousdigitalpolicyforlanguagetechnologies,translation,and lifelong language learning and teaching. This objective fits with the EU’s desire to take advantage of new technologies to foster multilingualism, which it hopes will facilitate accessto culture and nurture cultural exchange.71 AkeycommonalityinthesedocumentsandinitiativesistheideathatLTmustbe made in Europe for Europe. This approach will not only strengthen Europe’s place at the pole position of research excellence, but also contribute to future European cross-border and cross-language communication, economic growth and social sta­bility.ThepastfewyearshavewitnessedaflurryofwhitepapersandSRAsoffering roadmaps and recommendations for how best to attain the goal. In 2019, the Euro-peanLanguageResourceCoordination(ELRC)whitepaper, Sustainable Language Data Sharing to Support Language Equality in Multilingual Europe. Why Language Data Matters, underscored that the main challenge is a lack of appreciation for the valueoflanguagedata.72 Tohelpovercomethisperception,thegroupissuedseveral recommendationsaimed at the European and national policy level, including: • Updating the Open Data Directive (2019/1024/EU) so that it references lan­guage data as a high-value data category.73 • Conducting a study on language data to identify and quantify the value of lan­guage data for citizens,public administrationsand businesses. 68 https://www.europarl.europa.eu/news/en/press-room/20210311IPR99709/ai-technologies-mus t-prevent-discrimination-and-protect-diversity; https://oeil.secure.europarl.europa.eu/oeil/popups /summary.do?id=1663438&t=e&l=en 69 https://op.europa.eu/en/publication-detail/-/publication/b8722bec-81be-11e9-9f05-01aa75ed7 1a1 70 https://digital-strategy.ec.europa.eu/en/library/study-opportunities-and-challenges-artificial-int elligence-ai-technologies-cultural-and-creative 71 https://www.consilium.europa.eu/en/meetings/eycs/2022/04/04-05/ 72 https://lr-coordination.eu/sites/default/files/Documents/ELRCWhitePaper.pdf 73 https://digital-strategy.ec.europa.eu/en/policies/legislation-open-data • Updatingnationalpolicies(e.g.,OpenDatapolicies,digitalagendaorstrategies for AI) toexplicitly support the sharing of languagedata and LT. • Including obligatory (language) data management plans in all relevant national funding policies andcalls for proposals ifnot yetincluded. • Conductingnationalsurveystoassesstranslationpracticesinpublicadministra­tions atall levels. These steps will contribute to the development of an inclusive European digital society,ataskforwhichEuropeanLTisessential.However,stillothersarerequired. The Report on the Joint Stakeholder Consultation on Research and Innovation in Web Accessibility and Language Technologies, for instance, highlights that greater work must be done to develop systems capable of adapting and personalising digi­tal content according to individual needs, particularly in terms of accessibility and language.74 Research into sign languages represents one avenue that merits greater attention, given that sign languages are increasingly becoming recognised as offi­cial national languages. Another relevant is the accessibility of information in mul­timodal contextswith respect to formatting and the understanding of content. Fortunately,itisevidentthattheEUisnotblindtoLT’scrucialroleinbuildingEu­rope’sdigitalsocietyandhasalreadybeguntodedicatefundingandlaunchinitiatives toadvanceLTandAI.Research,industry,andthepublicsectorhavebenefittedfrom these actions. Two prominent examples include the Horizon 2020 Programme and the Connecting Europe Facility (CEF).75 LT was embedded in theformer within re-searchandinnovationinthefieldofinformationtechnologies,contenttechnologies, multilingual internet and AI. Through the latter, MT tools (eTranslation) and tools for the management of thesauri and glossaries have been developed (VocBench).76 There is,however,much left to bedone.The Final study report on CEF Automated Translation value proposition in the context of the European LT market/ecosystem provides an analysis of the EU’s LT market (including Norway and Iceland) and the adoption of LT by public administrations, both at the EU and national levels.77 The report underscores that EU industry is fragmented and that many small players struggletocompetewiththeglobalgiantsthatdominatethemarket.Itfurthernotes thatEuropeanbusinessesandthepublicsectorhavebecomedependentonthesenon­European global companies, which have massive amounts of data at their disposal due to both copyright disparities between the EU (explicit permission required by Europeanentities)andtheUS(fairusecopyrightexception),aswellasintensiveuse oftheir popular systems. 74 https://ec.europa.eu/digital-single-market/en/news/report-joint-stakeholder-consultation-res earch-and-innovation-web-accessibility-and-language-0. See also the New European Media’s SRIA: https://nem-initiative.org/wp-content/uploads/2020/06/nem-strategic-research-and-innov ation-agenda-2020.pdf?x98588 75 https://ec.europa.eu/programmes/horizon2020/en/h2020-section/information-and-communi cation-technologies; https://ec.europa.eu/digital-single-market/en/connecting-europe-facility; https://ec.europa.eu/digital-single-market/en/language-technologies 76 https://ec.europa.eu/isa2/solutions/vocbench3_en 77 https://op.europa.eu/en/publication-detail/-/publication/8494e56d-ef0b-11e9-a32c-01aa75ed7 1a1/language-en/format-PDF/source-106906783 Nonetheless,thedependencyonAmericanorChinesesystemsandthetorrentof dataflowingoutofEuropemaskareasinwhichEuropeaninitiativesmaymakereal the ideal of LT made in Europe for Europe. Several large international tech com­panies, by way of example, provide MT services free of charge. EU industry, by contrast, is experienced in navigating through Europe’s many languages and Eu­ropean MT developers have successfully deployed services for the public sector throughthesupportofEU-fundedprogrammes.LTmadeforEuropemeansharness­ingthisknow-howtosupportMTforallitslanguagesandcreatedomain-specificand application-specificMTwhilesimultaneouslybeingattentivetosecurityandprivacy issues. Moveover, as stated in My Europe. My language: With language technolo­gies made in the EU,78 LT offers opportunities to reduce language barriers across Europe and in the DSM at the intersection of Big Data, AI and HPC. Indeed, the EuropeanHighPerformanceComputingJointUndertaking79 (EuroHPCJU),ajoint initiative between the EU, European countries and private partners, is developing a world-class supercomputing ecosystem in Europe.80 The Language Data Space EU project, a platform and marketplace for the collection, creation, sharing and re-use ofmultilingual and multimodallanguage data, waslaunched inJanuary2023.81 TheEChasalsoestablishedpublic-privatepartnerships(PPPs)intheareaofAI.82 AsdetailedbyCurryetal.(2021),theBigDataValuePPP,createdbytheECandthe BDVAin2014,representedasubstantialcollectiveeffortonthepartoftheEuropean data community to formulate a set of technical research priorities for Big Data. Ac-cordingtothereport,Europe’smultilingualismpresentsaparticularchallengewhen itcomes todata: Large amounts of data are being made available in a variety of formats ranging from un­structured to semi-structured to structured formats […] A great deal of this data is created or converted and further processed as text. Algorithms or machines are not able to process the data sources due to the lack of explicit semantics. In Europe, text-based data resources occurinmanydifferentlanguages,sincecustomersandcitizenscreatecontentintheirlocal language. This multilingualism of data sources means that it is often impossible to align themusingexistingtoolsbecausetheyaregenerallyavailableonlyintheEnglishlanguage. Thus,theseamlessaligningofdatasourcesfordataanalysisorbusinessintelligenceapplica­tions ishindered bythe lackof language supportand gaps inthe availabilityofappropriate resources.83 78 https://digital-strategy.ec.europa.eu/en/library/my-europe-my-language-language-technologies -made-eu-brochure 79 https://eurohpc-ju.europa.eu 80 https://digital-strategy.ec.europa.eu/en/activities/work-programmes-digital 81 https://digital-strategy.ec.europa.eu/en/funding/language-data-space-call-tenders 82 https://adr-association.eu 83 https://elements-of-big-data-value.eu/research-priorities-for-big-data-value The Big Data Value PPP’s successor, the AI, Data and Robotics Partnership (formed in 2020 along with BDVA,84 euRobotics,85 ELLIS,86 CLAIRE,87 and Eu-rAI88)expandedonthisissueandzeroedinonNLP’simportanceinitsStrategicRe-search, Innovation and Deployment Agenda,89 “Natural Language Processing has particular resonance within Europe’s multi-lingual landscape and offers the poten­tial to harmonise human interaction.” Unfortunately, although the PPP includes LT experts, research groups and companies via some thegroups involved, currentlyno EuropeanLT association or networkisrepresented inthePPP. The initiative, however, complements the Coordinated Plan on Artificial Intelli­gence(CPAI)proposedbytheEuropeanCommissionfortheperiod2021-2027.The plan,whichconsidersAIan areaofstrategicimportance and aims topropelEurope to the forefront in terms of developing and exploiting AI technologies, calls for the EU to provide a minimum one billion euro annual investment in Horizon Europe and Digital Europe, although the objective is to reach twenty billion euros a year betweenpublicandprivateinvestments.90 Thefocusisonfourkeyareas:increasing investmentinAI;theavailabilityofdata;thepromotionoftalent;andensuringsecu­rity,ethicsandtrustinAI.Successinthesedomainsleansonthebeliefthatmember states must develop and coordinate their own national AI strategies, of which an analysis and comparison is provided in the report AI Watch: National strategies on 91 Artificial Intelligence: A European perspective in 2019. 3.2 National and Regional Initiatives The perspective that the EU Member States should be responsible for their individ­ualAIstrategiesstemspartlyfromtheobservationthateachcountryorregionisbest placedtoaddresstheirownparticularneeds.TheresponsebyEuropeancountriesto theCPAIhasbeenlargely positiveandthenumberofstates with an AI strategy (29 out of 30; only Croatia has no official strategy as of yet) demonstrates its success. Moreover, it is in the national plans that currently exist where many of the initia­tives concerning LT and language-centric AI reside, although this is not to say that dedicatedLTprogrammesarewidespreadinEurope.Andincomparisontonon-EU nationalAIinitiatives,Europe’smemberstateslagbehindwhenLTistakenintoac-count. Since Canada published the world’s first national AI strategy in 2017, more 84 https://www.bdva.eu 85 https://www.eu-robotics.net 86 https://ellis.eu 87 https://claire-ai.org 88 https://eurai.org 89 https://adr-association.eu/wp-content/uploads/2020/09/AI-Data-Robotics-Partnership-SRIDA -V3.0-1.pdf 90 https://knowledge4policy.ec.europa.eu/ai-watch/coordinated-action-plan-ai_en 91 https://ec.europa.eu/jrc/en/publication/ai-watch-national-strategies-artificial-intelligence-eur opean-perspective-2019 than30 othercountriesandregions havepublishedsimilardocumentsasof Decem­ber2020.92 Severalnon-EUnationsmeritbriefconsiderationhereduetotheexplicit inclusionofNLPintheirplans.China’sAIstrategy,oneofthemostcomprehensive in the world, singles out NLU technology as a decisive area to promote university AI curricula and in its pursuit of AI talent (Zhang et al. 2021). The UK, which em-phasisesastrongpartnershipbetweenbusiness,academia,andgovernment,created a pilot programme for under-18-year-olds to encourage careers in the AI sector, ex­plicitly mentioning NLP. India’s approach to AI considers the multilingual reality ofthecountryameanstoachievetechnologicalleadershipinAIandcitesthedevel­opment of an advanced NLP infrastructure for its languages as a stepping stone in thatdirection.93 Finally,theUSemphasisesthecrucialroleLTplaysinAIandNLU appears as one of the six “Uses for Deployed AI Today” in the National Security Commission on Artificial Intelligence’s Final Report, publishedin 2021.94 In Europe, only a handful of dedicated national programmes funded projects re­latedtoLTbefore2018.95 Instead,financialsupport for the development ofLTwas generallyprovided throughgenericR&D&Icallsin mostmemberstates.TheSpan­ish case is one of those notable exceptions. The Spanish government has recently announcedanewstrategicplanforeconomicrecoveryandtransformation(PERTE) called “The New Economics of Language.”96 The PERTE is presented as an oppor­tunity to take advantage of the potential of Spanish and co-official languages for economic growth and internationalcompetitiveness inareassuch as AI, translation, learning,culturaldissemination,audiovisualproduction,researchandscience.Ithas a budget of 1.1 billion euros in public funds and aims to mobilise another billion in privateinvestment.Additionally,followingthelinesoftheSpanishPlanfortheAd­vancement of LT,97 several regional governments have also launched LT initiatives, includingAINA(Catalonia),98Nós(Galicia)99andGAITU(theBasqueCountry).100 At the European level, LT received better support through calls in various pro-grammes: FP7, Horizon 2020, CEF Telecom, CIP ICT-PSP, EUREKA and EU­ 92 https://aiindex.stanford.edu/report/ 93 AI in India: A Policy Agenda. The report also highlights natural language voice recognition as a way to to account for the diversity in languages and digital skills in the Indian context and recommends the creation of annotated data sets for their languages to add incremental value to existingservicesranging frome-commerce toagriculture. 94 https://www.nscai.gov/2021-final-report. Seealso,the American AI Initiative. 95 Spanish Plan for the Advancement of Language Technology: https://plantl.mineco.gob.es/tecno logias-lenguaje/actividades/estudios/Paginas/tecnologias-del-lenguaje-en-Europa.aspx 96 https://planderecuperacion.gob.es/como-acceder-a-los-fondos/pertes/perte-nueva-economia­ de-la-lengua 97 https://plantl.mineco.gob.es/Paginas/index.aspx 98 https://politiquesdigitals.gencat.cat/ca/tic/aina-el-projecte-per-garantir-el-catala-en-lera-digit al/ 99 https://www.xunta.gal/hemeroteca/-/nova/134792/xunta-usc-ponen-marcha-lsquo-proxecto-n osrsquo-que-permitira-incorporar-galego 100 https://www.irekia.euskadi.eus/es/news/76846-gobierno-vasco-presentado-gaitu-plan-accion -las-tecnologias-lengua-2021-2024-cual-tiene-objetivo-integrar-euskera-las-tecnologias-linguis ticas 44 Strategic Plans and Projects in LT and AI 379 None at all LT-related funding Some Dedicated LT funding programme Artificial Intelligence AI LT funding strategy through AI Austria . . Belgium . D . Bulgaria . . Croatia . Cyprus . Czechia . . Denmark . . . Estonia . . . Finland . . France . .. Germany . . . Greece . D Hungary . . Iceland . . Ireland . . Italy . . Latvia . . Lithuania . . Luxembourg . . Malta . .. Netherlands . . Norway . . Poland . . Portugal . . Romania . D Serbia . . Slovakia . . Slovenia . . Spain . . Sweden . . Table1 TheLanguageTechnologyfundingsituationinEurope(2019/2021),extractedfromRehm et al.(2020b)and updated with the newestAIstrategies (D:Draft) ROSTARS, among others. However, in these most funding for LT projects gradu­ally reduced as well. If these findings are compared to those presented by Rehm et al.(2020b),weobserveaslightincreaseinthenumberoflanguage-centricAIinitia-tivesoverthenextcoupleofyears(seeTable1andFigure2).101 Itisnoteworthythat only12Europeancountriesoutofthe30studiedexplicitlyconsiderLTwithintheir national policy initiatives. This is significant because the successful development of the next generation of innovative AI technology relies on setting aside funding 101 According to Rehm et al. (2020b), only four of the 30 surveyed countries do not have some levelofLTfunding.FourcountrieshaveprogrammesdedicatedtoLT(Denmark,Estonia,Iceland, Spain), six provide funding for LT-related topics through AI (Belgium, Denmark, Estonia, France, Germany,Malta)andtwo(Ireland,Latvia)thatdonothaveLTprogrammes,but ratheralanguage strategydefined by their governments. Seealso Rehm etal. (2016,2020a,2021). Fig. 2 The LanguageTechnologyfundingsituationin Europe exclusively for LT. The same holds true for European countries that hope to incor­porate LT-based AI applications, such as interactive dialogue systems and personal virtual assistants, into public services.102 4 SWOT Analysis This section summarises, in the form of a SWOT analysis, the most relevant find­ings of the reports, documents and initiatives that were reviewed for this chapter. It attempts to identify the most significant favourable and unfavourable factors that must be addressed to make digital language equalitya reality inEurope by 2030. 102 https://digital-strategy.ec.europa.eu/en/news/new-report-looks-ai-national-strategies-progress -and-future-steps 4.1 Strengths • Emergenceofpowerfulnewdeeplearningtechniques,toolsthatarerevolution­ising LT. • ImportantbasicLThasbeendevelopedandapplicationsthatareusedonadaily basisby hundredsofmillions of users forspeech recognition, speechsynthesis, text analytics andMT are available. • ExistenceofmultiplenationalandEuropeanLTresearchnetworks,associations, communities and other relevant stakeholders whose objective is to promote all kinds of activities related to research, development, education and industry in the field of LT, both nationally and internationally. • Existence of unique, valuable and potentially extremely useful data resources that can be exploited by current LT. An enormous amount of data is expressed in human language. • Increasing number of companies in LT and good level of readiness for the im­plementation of LTin production environments. • LT contributes to the development of inclusive digital societies, and is critical for responding to social challenges (accessibility, transparency,equity). 4.2 Weaknesses • Deep learning LT and large pre-trained language models have shortcomings. Language models have limited real-world knowledge, can generate biased and factually incorrect text, may contain personal information, etc. They are also expensive to train and have a heavy carbon footprint. It is important to under-standthelimitationsof largepre-trained languagemodels and put their success in context. • The LT markets are currently dominated by large non-EU actors, which do not addressthe specific needs of amultilingual Europe; Europe remainsfarbehind due to market fragmentation, insufficient funding and legal barriers, thus hin­dering online commerce and communication. Europe does not fully exploit its enormous potential inLT. • LT currently only plays a rather subordinate role in the political agenda and public debateof theEU andmost of its Member States. • Thereisageneralmisconceptionandover-hypingofactualAIandLTcapabili-ties.AIisoftenperceivedinapolarisedfashionaseither“magical”technology that can solve any problem or as a threat to jobs and workers, who will be re­placed by machines. • NoEUpolicyhasbeenproposedtoaddresstheproblemoflanguagebarriers. • GDPR/Copyright is a major barrier to the access and re-use of language re­sources, incompetition with countriesthat adopt the “fair use” doctrine. • The Open Data Directive (2019/1024/EU) does not include language data as a “high-value data category”. Most of the data require extensive IPR clearing (to addressCopyright andGDPR). • ThereisalackofadequateLTpoliciesandsustainabilityplansattheEuropean and national levels to properly support European languages through LT. Only four of the 30 European countries studied have a dedicated LT national pro-gramme, onlysix have includedLT funding through national AI strategies. • ThereisscarceandlimitedLTsupportfornon-officialEUlanguages. • No European LT association is represented in the new Data, AI and Robotics public-private partnership. • Thereisalackofnecessaryresources(experts,HPCcapabilities,etc.)compared to large US and Chinese enterprises that lead the development of new LT sys­tems. In particular, the “computing divide” between large firms and non-elite universities increases concerns around bias and fairness within AI technology, and presents an obstacle towards democratising AI. • Compared to English, there are far fewer LT resources and tools including lan­guage resources, annotated corpora, pre-trained language models, benchmark datasets, softwarelibraries, etc. • There is an uneven distribution of resources (funding, open data, language re-sources,scientists,experts,computingfacilities,ITcompanies,etc.)bycountry, regionand language. • There is a weak open data sharing culture for many public stakeholders and SMEs. • TheinvestmentinAIdoesnotreflecttherealimportanceofLT. • ThereisafragmentedEuropeanmarketwithanextremelylargeandvariedbase ofabout1,000SMEcompaniesthatdevelopLT.Smalltomediumnationaltech­nologycompanieshavelittlecapitalandinvestmentinLTcapabilities.Themar­kets are small forlow-resource language speakers. • In many countries, there are weak links between academia and industry and insufficient effective mechanisms forknowledgetransfer. • There is weak internationalisation of R&D&I and innovation. 4.3 Opportunities • Many new powerful monolingual, multilingual and cross-lingual deep learning LT capabilities are available. • LTiskeyfortherealisationandsupportofEuropeanmultilingualism. • LT is used in practically all everyday digital products and services, since most use language to some extent, especially all internet-related products such as searchengines,social networks and e-commerce services. • LT can impact on sectors of fundamental importance to the well-being of all European citizens, such as health, administration, justice, education, culture, tourism, etc. • LTofferseffectivesolutionstofacilitatemonolingualandmultilingualcommu­nication, including for the deaf and hard of hearing, the blind and visually im­paired andthosewith language-related disabilitiesorimpairments. • LT is one of the most important AI application areas with a fast growing eco­nomic impact. Enormous growth is expected in the global LT market based on the explosion of applications observed in recent years and the expected expo-nentialgrowth in unstructureddigitaldata. • Europe can play an economic leading role with its neighbouring countries throughgoodpartnershipsbasedontheuseofLTcustomisedtootherlanguages. • GrowingtrendfortheLTmarketandindustryinEuroperegardingtheexploita­tionofdigitalresourcesanddataoflinguisticinterest.Digitisationisoneofthe key means togenerateneweconomic growth. • Consolidation of a competitive LT industry that harnesses the potential of re­search and academia both in educating well-trained LT professionals and in transferring research resultsto industry andpublic administrations. • Increasing awareness about the possibilities of AI and LT and the necessity to invest andcoordinateefforts. • Substantial breakthroughs and fast development of LT offer new opportunities fordigitalcommunication;currentmultilingualandcross-lingualdeeplearning LTallowsforthecreationofnewmultilingualpre-trainedlanguagemodelsand systems that canleverageand balanceLT across all European languages. • Ensureopennessofinfrastructuresfordataandtechnologies. 4.4 Threats • In comparison to 2012, the results of the European Language Equality project in 2022 show that the gap between English and all other languages appears to be getting bigger instead of smaller. • Development of non-explainable techniques and deep learning models without any commonsense or up-to-date knowledge, with social biases, containing per­sonal and privatedata,with aheavyimpact on carbonfootprint, etc. • AIisabroadarea,whichovershadowsanddwarfstheimportance,benefitsand contributionsofLT,especially in Europe. • LossofLTskillsandhumancapitaltrainedinEuropeduetothelackofsufficient research, transfer and funding opportunities. • Inabilitytoretainin,orattractto,theEUresearchersandworkersskilledinLT and AI. • Growing development of the sector in US and China that will eventually pen­etrate the European application market, limiting the Digital Language Equality opportunitiesasdescribedin thisreport. • Thecomplexityofcopyright,GDPR,OpenDatadirectivesetc.makesaccessto language resources too costly, unclear andrisky. • Fear of many jobs becoming redundant due to the deployment of AI-powered technologies. 5 Conclusions Europe’s multilingual nature is also one of the main obstacles to a truly connected, cross-lingual communication and information space. Moreover, while language di­versity is at the core of European identity, many of our languages are in danger of digital extinction because they are not sufficiently supported through Language Technologies (Moseley 2010; Rehm and Uszkoreit 2012; STOA 2018; European Parliament 2018).103 Sophisticated multilingual, cross-lingual and monolingual LT forallEuropeanlanguageswouldfuture-proofourlanguagesascornerstonesofour culturalheritageandrichness.Inrecentyears,EuropeanresearchinLThasfacedin­creasedcompetitionfromothercontinents,especiallywithrespecttobreakthroughs inAI.Thesescientificadvancementshaveledtoglobalcommercialsuccesses,from which the respective regions benefit especially. As a consequence, many European scientists, including young high-potential researchers, are leaving Europe to con­tinue their work abroad. Europe must invest in retaining and attracting these re-searchers.OurcontinentisinneedofpowerfulLTmade in Europe for allEuropean citizens,tailoredtoouruniquecultures,societiesandeconomicrequirementssothat a linguistically fragmented Europe may become a truly unified and inclusive one. This ambitious but worthy effort involves supporting its rich and diverse linguistic culturalheritage,frombroadlyspokenlanguagestominorityandregionallanguages, as well as the languages of immigrants and important trade partners, benefiting Eu­ropean citizens, European industry and European society. References Aldabe,Itziar,GeorgRehm,GermanRigau,andAndyWay(2022).Deliverable D3.1 Report on ex­isting strategic documents and projects in LT/AI (second revision).EuropeanLanguageEquality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/LT-strategic-documents-v3.pdf. Chui,Michael,MartinHarryson,JamesManyika,RogerRoberts,RitaChung,AshleyvanHeteren, andPieterNel(2018).“NotesfromtheAIfrontier:ApplyingAIforsocialgood”.In: McKinsey Global Institute. Curry,Edward,AndreasMetzger,SonjaZillner,Jean-ChristophePazzaglia,andAnaGarcíaRobles, eds. (2021). The Elements of Big Data Value: Foundations of the Research and Innovation Ecosystem.Cham: Springer. de Jong, Franciska, Bente Maegaard, Darja Fišer, Dieter van Uytvanck, and Andreas Witt (2020). “Interoperability in an Infrastructure Enabling Multidisciplinary Research: The case of CLA­RIN”. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. Mar­ 103 http://www.unesco.org/languages-atlas/index.php?hl=en&page=atlasmap seille, France: European Language Resources Association, pp. 3406–3413. https://aclantholog y.org/2020.lrec-1.417. European Parliament (2018). Language Equality in the Digital Age. European Parliament resolu­tion of 11 September 2018 on Language Equality in the Digital Age (2018/2028(INI). http://w ww.europarl.europa.eu/doceo/document/TA-8-2018-0332_EN.pdf. Moseley,Christopher (2010). Atlas of the World’s Languages in Danger.http://www.unesco.org/c ulture/en/endangeredlanguages/atlas. Padilla, Thomas (2020). “Responsible Operations: Data Science, Machine Learning, and AI in Libraries”.In: American Archivist 83,pp.483–487. Rehm,Georg,ed.(2017).Language Technologies for Multilingual Europe: Towards a Human Lan­guage Project. Strategic Research and Innovation Agenda. CRACKER and Cracking the Lan­guageBarrier federation. http://cracker-project.eu/sria/. Rehm, Georg, ed. (2023). European Language Grid: A Language Technology Platform for Multi­lingual Europe. Cognitive Technologies. Cham,Switzerland: Springer. Rehm, Georg, Maria Berger, Ela Elsholz, Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, SteliosPiperidis,MiltosDeligiannis,DimitrisGalanis,KaterinaGkirtzou,PennyLabropoulou, KalinaBontcheva,DavidJones,IanRoberts,JanHajic,JanaHamrlová,LukášKaèena,Khalid Choukri, Victoria Arranz, Andrejs Vasiljevs, Orians Anvari, Andis Lagzdinš, Julija Melnika, GerhardBackfried,ErinçDikici,MiroslavJanosik,KatjaPrinz,ChristophPrinz,SeverinStam­pler, Dorothea Thomas-Aniola, José Manuel Gómez Pérez, Andres Garcia Silva, Christian Berrío, Ulrich Germann, Steve Renals, and Ondrej Klejch (2020a). “European Language Grid: An Overview”. In: Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). Ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Christopher Cieri, KhalidChoukri,ThierryDeclerck,HitoshiIsahara,BenteMaegaard,JosephMariani,Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3359–3373. https://w ww.aclweb.org/anthology/2020.lrec-1.413/. Rehm, Georg and Stefanie Hegele (2018). “Language Technology for Multilingual Europe: An Analysis of a Large-Scale Survey regarding Challenges, Demands, Gaps and Needs”. In: Pro­ceedings of the 11th Language Resources and Evaluation Conference (LREC 2018). Ed. by Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélene Mazo, Asuncion Moreno, JanOdijk,SteliosPiperidis,andTakenobuTokunaga.Miyazaki,Japan:ELRA,pp.3282–3289. https://aclanthology.org/L18-1519.pdf. Rehm,Georg,KatrinMarheinecke,StefanieHegele,SteliosPiperidis,KalinaBontcheva,JanHajic, Khalid Choukri, Andrejs Vasiljevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Al-binaAuksoriute,NúriaBel,AntónioBranco,GerhardBudin,WalterDaelemans,KoenraadDe Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur Rögnvaldsson,MikeRosner,BolettePedersen,IngunaSkadina,MarkoTadiæ,DanTufi.,Tamás Váradi,KadriVider,AndyWay,andFrançoisYvon(2020b).“TheEuropeanLanguageTechnol­ogyLandscapein2020:Language-CentricandHuman-CentricAIforCross-CulturalCommuni­cationinMultilingualEurope”.In:Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020).Ed.byNicolettaCalzolari,FrédéricBéchet,PhilippeBlache,Christo­pherCieri,KhalidChoukri,ThierryDeclerck,HitoshiIsahara,BenteMaegaard,JosephMariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Marseille, France: ELRA, pp. 3315–3325. https://www.aclweb.org/anthology/2020.lrec-1.407/. Rehm, Georg, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Victoria Arranz, Andrejs Vasiljevs, GerhardBackfried,JoséManuelGómezPérez,UlrichGermann,RémiCalizzano,NilsFeldhus, StefanieHegele,FlorianKintzel,KatrinMarheinecke,JulianMoreno-Schneider,DimitrisGala­nis, Penny Labropoulou, Miltos Deligiannis, Katerina Gkirtzou, Athanasia Kolovou, Dimitris Gkoumas, Leon Voukoutis, Ian Roberts, Jana Hamrlová, Dusan Varis, Lukáš Kaèena, Khalid Choukri, Valérie Mapelli, Mickaël Rigault, Julija Melnika, Miro Janosik, Katja Prinz, Andres Garcia-Silva, Cristian Berrio, Ondrej Klejch, and Steve Renals (2021). “European Language Grid: A Joint Platform for the European Language Technology Community”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguis­tics: System Demonstrations (EACL 2021).Kyiv,Ukraine: ACL, pp. 221–230. https://www.ac lweb.org/anthology/2021.eacl-demos.26.pdf. Rehm, Georg and Hans Uszkoreit, eds. (2012). META-NET White Paper Series: Europe’s Lan­guages in the Digital Age. 32 volumes on 31 European languages. Heidelbergetc.: Springer. Rehm, Georg and Hans Uszkoreit, eds. (2013). The META-NET Strategic Research Agenda for Multilingual Europe 2020.Heidelbergetc.:Springer. http://www.meta-net.eu/vision/reports/m eta-net-sra-version_1.0.pdf. Rehm, Georg, Hans Uszkoreit, Sophia Ananiadou, Núria Bel, Audrone Bielevièiene, Lars Borin, António Branco, Gerhard Budin, Nicoletta Calzolari, Walter Daelemans, Radovan Garabík, Marko Grobelnik, Carmen García-Mateo, Josef van Genabith, Jan Hajiè, Inma Hernáez, John Judge, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Joseph Mariani, John McNaught, Maite Melero, Monica Monachini, Asunción Moreno, Jan Odjik, Maciej Ogrodniczuk, Piotr Pêzik, Stelios Piperidis, Adam Przepiórkowski, Eiríkur Rögnvalds-son, Mike Rosner, Bolette Sandford Pedersen, Inguna Skadina, Koenraad De Smedt, Marko Tadiæ, Paul Thompson, Dan Tufiº, Tamás Váradi, Andrejs Vasiljevs, Kadri Vider, and Jolanta Zabarskaite (2016). “The Strategic Impactof META-NETonthe Regional, Nationaland Inter­national Level”. In: Language Resources and Evaluation 50.2, pp. 351–374. DOI: 10.1007/s1 0579-015-9333-4. http://link.springer.com/article/10.1007/s10579-015-9333-4. Rehm,Georg,HansUszkoreit,IdoDagan,VartkesGoetcherian,MehmetUgurDogan,CoskunMer-mer,TamásVáradi,SabineKirchmeier-Andersen,GerhardStickel,MeirionPrysJones,Stefan Oeter, and Sigve Gramstad (2014). “An Update and Extension of the META-NET Study “Eu­rope’s Languages inthe DigitalAge””.In: Proceedings of the Workshop on Collaboration and Computing for Under-Resourced Languages in the Linked Open Data Era (CCURL 2014).Ed. byLaurette Pretorius,ClaudiaSoria, andPaola Baroni.Reykjavik, Iceland, pp. 30–37. http://g eorg-re.hm/pdf/CCURL-2014-META-NET.pdf. Soria,Claudia,NicolettaCalzolari,MonicaMonachini,ValeriaQuochi,NúriaBel,KhalidChoukri, Joseph Mariani, Jan Odijk, and Stelios Piperidis (2014). “The language resource Strategic Agenda: the FLaReNet synthesis of community recommendations”. In: Language Resources and Evaluation 48, pp.753–775. https://doi.org/10.1007/s10579-014-9279-y. STOA(2018). Language equality in the digital age – Towards a Human Language Project.STOA study (PE 598.621), IP/G/STOA/FWC/2013-001/Lot4/C2. https://data.europa.eu/doi/10.2861 /136527. Zhang, Daniel, Saurabh Mishra, Erik Brynjolfsson, John Etchemendy, Deep Ganguli, Barbara Grosz,TerahLyons,JamesManyika,JuanCarlosNiebles,MichaelSellitto,YoavShoham,Jack Clark, and Raymond Perrault (2021).“The AIIndex 2021 Annual Report”. In: DOI: 10.48550 /ARXIV.2103.06312.https://arxiv.org/abs/2103.06312. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder. Chapter 45 Strategic Research, Innovation and Implementation Agenda for Digital Language Equality in Europe by 2030 Georg Rehm and Andy Way AbstractThischapterpresentstheELEProgramme(ELEConsortium2022).React­ing to the landmark resolution (European Parliament 2018), its vision is to achieve digital language equality in Europe by 2030. The programme was prepared jointly with many stakeholders from the European Language Technology, Natural Lan­guageProcessing,ComputationalLinguisticsandlanguage-centricAIcommunities, aswellaswithrepresentativesofrelevantinitiativesandassociations,andlanguage communities. Europe still suffers from strong inequalities in terms of technology support of its languages. English is still by far the language with the best techno-logicalsupport,followedbyaclusterofthreelanguages(German,Spanish,French) that already have only half the technological support of English. More than half of the around 90 languages surveyed have either weak or no technological support at all. The ELE Programme is foreseen to be a shared, long-term funding programme tailored to Europe’s needs, demands and values. For the EU we foresee the role of providing resources for coordinating the programme, for providing shared infras­tructures, for maintaining the scientific goals and programme principles, etc. The participating countries have the role of providing resources for the development of technologiesanddatasetsfortheirownlanguages.Keygoalsare toreducethetech­nologygapbetweenEnglishandallotherEuropeanlanguagesandtoaddressthelack of available language data. The ELE Programme tackles the following overarching themes: Language Modelling, Data and Knowledge, Machine Translation, Text Un­derstanding andSpeech.Theseinterconnectedthemesfocusuponthesocio-political goal of establishing DLE in Europe and on the scientific goal of Deep Natural Lan­guageUnderstanding, both by 2030.1 Georg Rehm Deutsches ForschungszentrumfürKünstliche Intelligenz GmbH,Germany, georg.rehm@dfki.de AndyWay Dublin CityUniversity, ADAPT Centre,Ireland, andy.way@adaptcentre.ie on behalf of the whole European Language Equality consortium andall contributors. EuropeanLanguageEquality EU Project, coordinator@european-language-equality.eu 1 ThischapterisarevisedversionoftheELE Strategic, Research, Innovation and Implementation Agenda for Digital Language Equality (ELE Consortium 2022), which is also available online: https://european-language-equality.eu/agenda/. © The Author(s) 2023 G. Rehm,A. Way (eds.), European Language Equality, CognitiveTechnologies, https://doi.org/10.1007/978-3-031-28819-7_45 1 Executive Summary The overall vision of the ELE Programme is to achieve complete digital language equality(DLE)inEuropeby2030.Theprogrammewaspreparedjointlywithmany relevant stakeholders from the European Language Technology (LT), Natural Lan­guageProcessing(NLP),ComputationalLinguisticsandlanguage-centricArtificial Intelligence(AI)communities,aswellaswithrepresentativesofrelevantinitiatives and associations, and language communities. The ELE Programme responds to the call “to establish a large-scale, long-term coordinated funding programme for re­search, development and innovation in the field of language technologies, at Euro­pean, national and regional levels, tailored specifically to Europe’s needs and de­mands”, as specified by the European Parliament Resolution Language equality in the digital age (European Parliament 2018). The results of the ELE project show that English is still by far the language with the best and most thorough technolog­ical support, followed by a cluster of three languages (German, Spanish, French) that have only half the technological support of English. After yet another gap, the long tail of languages with fragmentary support starts with Finnish, Italian and Por­tuguese.Morethanhalfofthearound90languagessurveyedhaveeitherweakorno technologicalsupportatall.Incomparisontopreviousresultsfrom2012(Rehmand Uszkoreit2012),thegapbetweenEnglishandtheotherlanguagesappearstobeget­ting bigger insteadofsmaller. Withthe exceptions of English, German, French and Spanish,alllanguagesweinvestigatedexistinsocio-politicalandeconomicecosys­tems that do not incentivise, encourage or foster the development of technologies fortheselanguages.Whileall30Europeancountrieswesurveyedhaveputinplace national AI strategies, almost all of these national strategies seem to have either ig­nored or leftoutthe topicoflanguages and language-centric AI.2 The ELE Programme is foreseen to be a shared, long-term, coordinated and col­laborative LT funding programme tailored to Europe’s needs, demands and values, including multilingualism and language equality in general. For the EU we foresee theroleofprovidingresourcesforcoordinatingtheprogramme,forprovidingshared infrastructures, for maintaining the scientific goals and programme principles, etc. Theparticipatingcountrieshavetheroleofprovidingresourcesforthedevelopment of technologies and datasets for their own languages. Key goals are to reduce the technologygapbetweenEnglishandallotherEuropeanlanguagesandtoaddressthe lackofavailablelanguagedata:thisistrueforallEuropeanlanguagesexceptEnglish. The ELE Programme focuses upon openness: open source, open access and open standards as well as interoperability and standardisation. A key emphasis is on the creationoflargeopenaccesslanguagemodelsforallEuropeanlanguages,including 2 Despite our original findings, in the interim, Spain has funded the 1.1B€ PERTE New Economy of Language programme to “maximize the value of Spanish and co-official languages in the new digital economy and artificial intelligence”, see https://planderecuperacion.gob.es/como-acced er-a-los-fondos/pertes/perte-nueva-economia-de-la-lengua. Accordingly, rather than be seen as a laggard in this space, Spain now represents what could and should be done to support European languages and associated technology, and the PERTE programme stands as a template for other nations toadapt totheirparticularsituation. the creation of datasets and multilingual models, symbolic knowledge, models that include discourse capabilities as well as grounding and other sophisticated features currentlyoutofreachforexistingstate-of-the-arttechnologies.TheELEProgramme is expected to have a runtime of nine years. In addition to overall coordination, the ELE Programme tackles the following overarching themes: Language Modelling, Data and Knowledge, Machine Translation, Text Understanding and Speech.These interconnected themes focus upon the socio-political goal of achieving DLE in Eu­rope and on the scientific goal of Deep Natural Language Understanding, both by 2030. The ELE Programme strengthens and makes optimal use of infrastructures, dataspaces andservices provided byother European initiatives. The global NLP market is estimated to reach 341.7B$ by 2030. In contrast, the modest investment needed to implement the ELE Programme will not only bring about DLE in Europe but it will also move European research and industry in this fieldinto a dominant position for years tocome. 2 Multilingual Europe and Digital Language Equality Languages are the most common and versatile way for humans to convey and ac­cess information. We use language, our most natural means of communication, to encode, store, transmit, share and manipulate information. We use language in ev­erydaylifetointeractwithothersandourenvironmentandassocialglue,toexpress and to explain ourselves, to convince, agree with and rebut others. Our laws and constitutions are written in language. We use it in science, commerce, teaching and passing on knowledge to the next generations, for pleasure, creativity and aesthetic enjoymentinpuns,jokesandliterature.Historyandculturearerecorded,interpreted and enjoyed through language. Our languagesare a core part of our identities. Human languages are incredibly complex: a single word (phrase, sentence, text) can have many meanings, a single meaning can be expressed by many different words (but meaning depends on linguistic and situational context), we can use lan­guage literally and metaphorically, language and knowledge are highly intertwined, wedonotarticulateimportantpartsofamessageifthesepartsarepresumedshared knowledgebythecommunity(thisincludessituationalknowledge),importantparts ofmeaningresideinwhatcanbeinferredfromwhathasbeensaid.Atthesametime, language changes. New words are invented, some old ones are dropped, even the structure (syntax and morphology) of languages and the meaning of words change over time. These aspects make human languages fundamentally different from the formal languages of mathematics, logic and computer science. This is also what makes human languages so efficient, elegant, flexible and enjoyable. Finally, there aremanyhumanlanguages(6,000+),notevencountingregionalanddialecticalvari­ants.Alltheseaspectsareat the coreof humanlanguages andthey makeithardfor computers to “fully understand” human language and to “properly” process human language in thecontext of“fulland deepunderstanding”. Languages are at the heart of every aspect of life and their role is crucial to the future of European countries, citizens, businesses, and of the European Union as a whole. Full Digital Language Equality (see Chapter 3) in Europe can deliver an impact inthe followingfourhigh-priorityareas. Digital Language Equality will have a positive and unprecedented impact on all European languages. We must ensure that no European languages remain under-resourced(seeChapter4foranoverviewandChapters5to37forin-depthanaly­ses),butthattheyareequippedwiththesamehighleveloftechnologicalsupport already enjoyed by very few of them (Chapter 2). This, in itself, will deliver a major impact on all European citizens and businesses: supporting all languages in the interest of equality and fairness empowers and brings advantages to their speakers,while reflectingthe democraticand inclusive spirit of the EU. Digital Language Equality will make a contribution to establishing a fair, inclu­siveand sustainablemultilingual DigitalSingleMarket: thiswill be achieved by helpingtofuture-proofallEuropeanlanguagesthroughdigitaltechnologies,and especially preventing the threat of digital extinction for those that suffer from weak support. By fostering a more inclusive and cooperative business and so­cial environment, companies and citizens will benefit from sharing knowledge, digital services and products on an equal footing, overcoming the fragmentation that is caused by many European languages lagging behind, which severely pe­nalises their speakers as well as regional and local communities. Action in this vital area is particularly urgent due to the increasing range of economic, educa­tional and social opportunities that are afforded online and delivered remotely, from e-commerce to online shopping, to web-based recruitment services, online teaching programmes andprofessional training courses, amongothers. Digital Language Equality will help research in Europe, mobilising and leverag­ingtheirfullpotentialtostartreclaimingscientificandindustrialleadershipfrom US-based and Asian competitors, particularly large tech enterprises as well as academicinstitutionsandresearchcentres,thatposefiercecompetitioninseveral fields.TheELEProgrammewillinstigateregional,nationalandEU-widecollab-oration among scientists from academia and industry covering a broad range of disciplines, ensuring the mix of competencies that is required to deliver substan­tial andlasting impactat theforefront of scientific and technological progress. Digital Language Equality willactasamultiplierofopportunities.Itwillhelpto aggregatetheplayersthatarerequiredtounlockthefullpotentialofanEU-wide effort to exchange and share widely-agreed methodologies, resources and tech­nologies with a focus on promoting the digital equality of European languages: this will benefit the use and promotion of all European languages, encouraging inparticular those that have traditionally lagged behind. 3 WhatisLanguageTechnologyandHowCanitHelp? Language Technology (LT) is concerned with studying and developing systems ca­pable of processing human languages. Over the years, the field has developed dif­ferent methods to make explicit the information contained in written and spoken language – and increasingly for other modalities such as sign language, for exam­ple – or to generate or synthesise written or spoken language (see Chapter 2 for moredetails).Despitetheinherentdifficultyofmanyofthetasksperformed,current LTsupportallowsmanyadvancedapplicationswhichwouldhavebeenunthinkable only a few years ago. LT is present in our daily lives, for example, through search engines, recommendation systems, virtual assistants, chatbots, authoring assistants, text predictors, automatic translation systems, automatic subtitling, automatic sum-marisation tools, etc. Its rapid development in recent years predicts even more en­couraging and also exciting results in the near future. LT is providing solutions for thefollowingmainapplicationareas:MachineTranslation,SpeechProcessing,Text Analysis,InformationExtractionandInformationRetrieval,NaturalLanguageGen­eration, Human-Computer Interaction (see Chapter 2 as well as Chapters 40 to 43 for in-depthanalysesofthe state-of-the-art). 4 ASharedEuropeanProgrammeforLanguageTechnologyand DigitalLanguage Equality in Europe by 2030 Fully in line with the recommendations of the European Parliament resolution Lan­guage equality in the digital age (EuropeanParliament2018),ourrecommendations, asanalysedin the chaptersofthe present book can be summarised as follows. The vision described in this book is fully compatible with current EU policy, needs and demands; in fact, they are mission-critical. Missing investment inthe un­derdevelopedareasofLTandlanguage-centricAIwillresultinthedigitalextinction of languages, i.e., only global languages spoken by large numbers of speakers, in­cluding, crucially, outside the EU, will prevail and the global LT/NLP market will continue to be dominated by the US and China, while the European LT community willbepushedaside even further. ThemainconceptoftheELEProgrammeisacollaborationbetweentheEU,and in particular the European Commission, and all participating countries and regions since funding and further investment are needed on all levels. Funding on the level oftheEUshouldenableoverarchingcoordinationandEU-widetechnologicalinfras­tructure. It should cover the topics which require pan-European coordination such as shared tasks, protocols, multilingual dataset creation based on the same princi­ples in line with European values and priorities, etc. Coordination on the European level is needed because language communities are still too fragmented and mostly toosmall.Furthereffortshouldbeinvestedintoadequatepolicy-making,distributed researchinfrastructuresandtechnologicalplatformslikeELG(Rehm 2023)andthe Common European Language Data Space, with flexible access to sufficient High Performance Computing (HPC) facilities. Additionally, national and regional fund­ing should complement the European funding with regard to language-specific re­search and development. The main gaps to be filled in these respects and the most important anticipated developments are described, among others, in the language reports (see Chapters 5 to 37). This section summarises our main recommendations for this shared programme (moredetailedrecommendationsarecontainedinthepreviouschaptersofthisbook). First, weoutlinethepossible cornerstonesforsuitablepolicyand infrastructurerec­ommendations, as well as ideas for the realisation of a governance model. Second, we revise the technology and data recommendations suggested by the ELE consor­tium (derived from Chapters 39 to 44), which are closely related to those discussed in the Language equality in the digital age resolution(European Parliament 2018). Further,intermsofourresearchrecommendations,theELEconsortiumtogether withthewiderLTcommunityhasdevelopedaclearvisionforthedifferentareasof LT.WeseeanurgentneedtorefocusandmassivelystrengthenEuropeanLT/NLPre­searchthroughalarge-scaleinitiativeasashared,collaborativepan-Europeaneffort betweentheEUandthosecountriesandregionsthatparticipateintheinitiative,i.e., the ELE Programme. This endeavour should include the participation of research centres, academia, companies (particularly SMEs and startups), and other relevant stakeholders. As LT is aggregated and applied to more complex settings, interdisci­plinary research and activities are becoming more relevant in order to further boost developments and allow synergies to become apparent. To achieve Deep Natural Language Understanding, we need to finance and investigate fields such as cogni­tive,neural and symbolicAI further. The ELE Programme should boost pan-European long-term basic research as well as knowledgeand technology transferbetween research labs and industry. Fre­quentlymentionedareasandtasksforbasicandappliedresearchwherefurtherinves­tigation is needed include, among others, systematic language data collection (text, dialogue, vision, sign language and other forms of interactions), speech analysis, AI, human-computer interaction, machine learning, robotics, natural language un­derstanding and processing tasks such as machine reading, text analysis, machine translation,chatbots, virtual assistants and summarisation. 4.1 Policy Recommendations • Reinforce European leadership in LT by establishing the ELE Programme as a large-scale, long-term coordinated funding programme for research, devel­opment, innovation and education with the societal goal of achieving Digital LanguageEqualityinEuropeandthe scientific goal ofDeepNaturalLanguage Understanding, bothby 2030. • Ensure comprehensive EU-level legal protection for the more than 60 regional and minority languages spoken in Europe. • Empower recognition of the collective rights of national and linguistic minori­ties in the digital world (including sign languages). • Encouragemother-tongueteachingforspeakersofofficialandnon-officiallan­guages of the EU. • Safeguardsufficientfundingtosupportnewtechnologicalapproaches,basedon increased computational power and betteraccess tosizeable amounts of data. • Developspecificinitiativeswithincurrentfundingschemes,especiallyHorizon Europe and Digital Europe (including the Recovery Plan for Europe), to boost long-termbasicresearchaswellasknowledgeandtechnologytransferbetween countries andregions, and betweenacademia and industry. • Support the coordination between research and industry to enhance the digital possibilities for LTand OpenAccessto language data. • Defineanddevelopaminimumsetoflanguageresourcesandcapacitiesthatall Europeanlanguages shouldpossess (see Krauwer 2003) . • Developcommonpolicyactionsandprotocolsforlanguagedatasharingbypub­lic administrations at all levels. Language data should be included as a high-value data category in theOpen DataDirective (2019/1024/EU). • Enable and empower European SMEs and startups to easily access and use LT in order to grow their businesses independent of language barriers, also thanks to e-commerce and online marketplaces. • Create the necessary appealing conditions to attract and retain qualified and di­verseinternational LTpersonnel inEurope. • Encourage all EU-funded projects to have a language diversity plan and to in­clude direct or associated partners from a less-widely spoken language. • Empowerandencourageadministrationsatalllevelstoimproveaccesstoonline services andinformation in different languages. • Create a European network of centres of excellence in LT to increase industry visibility and to design nationalresearch agendas. • Implement and maintain long-term an overall EU-wide policy framework to achieve European LTsovereignty. • FacilitateEUMemberStates’acquisitionofLTfortheirlocalindustrieswithout depending on non-European technology providers. 4.2 Governance Model • Structure the ELE Programme as a shared, collaborative and coordinated pro-grammebetween the EUand all countries and regions that participate. • Allocate the area of multilingualism, linguistic diversity and language technol­ogy to theportfolio of aEU Commissioner. • SetupalargelobbyforEUregionalandminoritylanguages. • Createapan-Europeannetworkofresearchcentrestofacilitatethecoordination and also implementation of the ELE Programme at alllevels. • Promote a distributed centre for linguistic diversity that will strengthen aware­ness of theimportance of lesser-used,regional and minority languages. • Design and apply new forms of research funding and organisation to ease thetransitionfromapplication-orientedbasicresearchtocommercially-focused technology development. 4.3 Technology and Data Recommendations • Develop large open-source language models that work for all European lan­guages, optimised in termsofcompute timeand cost. • Addressthelackofavailabledataanddefinetheminimumamountoflanguage resources andcapabilitiesthat allEuropeanlanguages shouldpossess. • Addmorefocusonsystematicandcomprehensivelanguagedatacollection(text, dialogue, multimodal) and exploit automatic data generation (synthetic data), crowd-sourcingand translationof high-quality data. • Develop new methodologies for transfer and adaptation of resources and tech­nologies to otherdomains and languages. • Develop high-performance applications (in terms of speed and quality) for all languagesthat respectsafety, securityand privacy. • Ensureefficientadaptationstoapplications,bothintermsoflanguage,domain, efficiency, powerconsumption, easeofmaintenance, and quality assurance. • Developmethodstoovercometheunequaldataavailability,byfocusingon,e.g., annotation transfer, multilingual models preserving quality, few-shot or zero-shot learning. • Unleashthepowerofmonolingualandmultilingualpublicsectordata,datafrom broadcasters, socialmedia, publishers, etc. • Enforce open ecosystems, open standards and interoperability (including Open Source andOpen Access). • Focus on research on bias for strengthening inclusiveness and accessibility, to respectand promote Europeanvaluesand principles. • FocusuponGreenLTwithasmallcomputeandcarbonfootprint(e.g.,model compression). • Foster publicly available resources that facilitate innovation and research for both commercial andnon-commercial actors. • ConstructamultilingualLTbenchmark,aEuropean“SuperGLUE”-style(Wang et al. 2019) shared benchmark, that tracks progress. • Define the minimum language resources that all European languages should possessin order to prevent digital extinction. 4.4 Infrastructure Recommendations • Strengthen existing and create new research infrastructures and LT platforms thatsupportresearchanddevelopmentactivities,includingcollaboration,knowl­edge sharing, and Open Access todata, toolsand technologies. • Fill the identified gaps in data, language resources and knowledge graphs and create a future path for Europe towards comprehensive and interlinked data in­frastructures. • Develop clear and robust protocols to ensure flexible access to sufficient GPU-basedHPC infrastructure and robustprotocolsto processsensitive data. • Ensure sufficient operational capacity, especially for Large Language Models (LLMs)and flexibleaccess to GPU-based HPCfacilities. • Follow the idea of a Semantic Data Fabric including rich semantics for the de­velopment of an integrated and interoperable data infrastructure. 4.5 ResearchRecommendations 4.5.1 RecommendationsforallResearchAreas • Gather and make available the critical mass of resources in terms of data, HPC facilities, and expertise from pan-European LT research labs and centres, with supportfromthe EC aswell as nationaland regional administrations. • Createsufficientmultilingualandmultimodaldataofquality(responsible,legal, diverse, unbiased, ethical, representative, etc.), in all European languages and domains (media, health, legal, education,etc.). • Provide flexible access to HPC facilities for LT research and industry. HPC fa­cilities should provideclearand robustprotocols toprocess sensitivedata. • Developbetterbenchmarksanddatasets(ethical,responsible,legal,etc.)forall languages, domains, tasks andmodalities. • CombineinteractiveLT(conversationalAI)withtext,knowledge,andmultime­diatechnologiesforanewgenerationofapplicationsthatcanaddressthedeeper questions of communication, common sense andreasoning. • Encouragetrustworthy,unbiased,inclusive,non-discriminatoryLT/AI,making interpretability andexplainability of AI models a priority. • Develop further the areas of responsible AI by combining statistical and sym­bolicAIin multilingualenvironments to provideAI-basedapplications thatde­liveraccurate resultsand benefits for research, industry,and society. • Focus on methods and learning architectures to overcome the highly unequal dataavailability,suchasannotationtransfer,syntheticdataandtheirproperuse in machine learning, multilingual models preserving quality and coverage and few-shot or zero-shotlearning. • Focus on Green LT and investigate new efficient methods to extend, reuse and adapt existing pre-trained language models or develop new ones with much re­duced carbon footprint. • Develop language-and culture-specific technologies that cover more linguis­tic phenomena and text types, focusing on accessibility, through sign language, avatar technology, etc. • ProvidetransparencyofAImodelswithregardtoaccuracyandfairness. • ReframeLT/NLPasaquantumcomputingproblem. 4.5.2 Machine Translation • Develop near-real-time MT across all modalities (speech, text, signs, etc.) and adaptive MT,where the system learns from interaction with users. • Move towards context-aware methodologies that go beyond text data and in­clude images, videos, tables,etc. bydeveloping multimodal MT systems. • Develop low-resource MT by deepening research on projection and structural organisationofembeddingstocomprehendhowstructurallydifferentlanguages and their respectiveembedding spaces can be mappedto oneanother. 4.5.3 Speech Processing • EnhancespeechresourcesandcreateacousticmodelstocoverallEuropeanlan­guages, including non-standard varieties and dialects. • Improvethehandlingofaudioconditionscurrentlyperceivedasdifficult(e.g., multiple simultaneous speakers in noisy environments speaking spontaneously and highly emotionally in a mixoflanguages). • Develophigh-quality,naturalsyntheticvoices,allowinguserstoobtaincontent in the language oftheirchoice. • Improve context modelling to handle the translation of speech models across largervolumesoftext. • Supportresearchinthedirectionofcombiningspeech,NLUandNLPwithother modalities, such as imageand vision. • Addressprivacyandsecuritythreatsinareasofspeechsynthesis,voicecloning and speakerrecognition. 4.5.4 TextAnalyticsandNaturalLanguageUnderstanding • CreatelargeOpen-AccesslanguagemodelsforallEuropeanlanguages(forfine-tuning and downstream tasks), datasets (for training and testing), multilingual models, modelsthat include symbolic knowledge and discourse features. • Increase the adoption of approaches based on self-supervised, zero-shot, and few-shot learning. • Support research in NLU which integrates speech, NLP, and contextual infor­mation as well asadditionalmodes of perception. • StrengthenbasicresearchinneurosymbolicapproachestoNLP/NLU,including grounding and theuse of human-understandable databases andsources. • Strengthen progress in reinforcement-based learning, novel dialogue manage-mentstrategies,and situation-aware natural languagegeneration. • Strengtheninterdisciplinaryresearchandenablebettermodellingofmultimodal environments. 4.6 Implementation Recommendations • Structure the ELE Programme into three phases of similar duration. • Facilitate discussions between the EU, the European Commission in particular, and all participatingcountriesto define thegoalsand the financialsetup. • Encourage participating countries to invest into the development of large lan­guage models, data sets, technologies, and tools for their ownlanguages. • EncouragetheEUtoestablishlegislationtopromoteparticipation. • Encourage the EU to invest in the pan-European coordination of all language-specific projects and initiatives, support mechanisms, infrastructures, data pro­cedures, cross-cutting projects, etc. and provide flex funds for bootstrapping poorly supported languages. • Structure the ELE Programme into six themes covering: Language Modelling, Data and Knowledge, Machine Translation, Text Understanding, Speech, and Infrastructureandsupporteachthemebycoordinationactions(CSAs),research actions(RIAs) as wellasactionsfor innovation anddeployment (IAs). 5 RoadmaptowardsDigitalLanguageEqualityinEurope 5.1 Main Components Language Technologies have the potential to overcome the linguistic divide in the digital age. However, we need to define actions, tools, processes and actors that needtobeinvolved.TheELESRIA includes a roadmap with concretesteps forthe implementationthat carry tangible and measurable outputs. The main scientific goal of the ELE Programme is Deep Natural Language Un­derstanding in Europe by 2030. Efficiency will beincreased bysharing knowledge, infrastructuresandresources,withaviewtodevelopinginnovativetechnologiesand services, in order to achieve the next scientific breakthrough in this area and help reducethetechnologygapbetweenEurope’slanguageswiththecollaborationofre­searchcentres,academicexperts,industryandotherrelevantstakeholders.Crucially, Fig. 1 The six mainthemes oftheELEProgramme thelong-termELEProgrammewillinvolvesignificantlyintensifiedcoordinationbe­tweenthe participating countries and languages. The main societal and economic goal oftheELEProgrammeisDigital Language Equality in Europe in 2030. The focus is on language equality and the provision­ing of technologies,servicesand resourcesoutside the often-preferred languages to achieveandmaintainlong-termtechnologicalsovereigntyinthiscrucialapplication area. Forregional,minorityandlesser spokenlanguages,weneedtofind a(techno­logical) way to consider Deep Natural Language Understanding within a common approach, to create synergies and increase efficiency of the solutions and their de­sign and development. To narrow the digital divide, there is a pressing urgency for noveltechniquesthatwouldbringless-resourcedlanguagestoalevelcomparableto state-of-the-art results for resource-rich languages. This includes the leveraging of multimodal and multilingual resources to support the development of applications for languages andvarietieswith scarce resources. This roadmap towards Digital Language Equality in Europe by 2030 provides a pathandthemeanstoensurethatthetwogoalsoutlinedabovearemet.Totacklethis challenge, the ELE Programmecombinesthefollowing six themes (see Figure 1). Language Modelling This theme includes research, development and deploy-mentactivitiesregardingLLMs,especiallymultilingualandmultimodal,genera­tiveLLMsthatincludetext,speech,image,video,etc.Timeandresourcesneedto be invested forexperiments, developingnovelapproaches,sharedtasks, etc. For novel research approaches we need to combine national projects and data sets with international consortia. With regard to innovation and deployment, LLMs will be appliedin industrial sectorsand use-cases. Data andKnowledge The Data and Knowledge theme is focused on the collec­tion,production,annotation,curation,qualityassessment,standardisation,etc.of text data, spoken data, video data, and other multimodal data, primarily with re-gardto their application as data for pre-training different sorts of LLMs. Machine Translation The MT theme is focused on improving the automated translationfromonenaturallanguageintoanother(includingsignlanguagesand other modalities). While Europe has a strong foundation in this field, research needstocombinenovel,groundbreakingapproacheswithresultsoftheDataand KnowledgeaswellasLanguageModellingthemes(seeabove).Theresultsneed to be applied in different industrial sectors and use-cases. Deployment needs to be fast,agile anddriven by excellentteams. Text Understanding The Text Understanding theme aims to improve the iden­tification and labelling of information regarding all levels of linguistic analysis underlying any natural language text (or other modalities). This requires explor­ing new strands of research and building onsynergies with the other themes. An equally important aspect is applicability inindustry. Speech The Speech theme addresses one big challenge of the European LT com­munity, i.e., the shift from text-to-speech and multimodal processing (including research towards grounding). While progress in the area of speech applications has been made in the last decade, we also need novel research paradigms. This theme will benefit from the themes Data and Knowledge as well as Language Modelling.Thedevelopment of relevant industry applications is another goal. Infrastructure TheInfrastructurethemeinvolvestheextension,maintenanceand interoperability of platforms such as European Language Grid (ELG) and Lan­guage Data Space (LDS). ELG has the potential of functioning as one of the pri­maryplatformstosupport the activities oftheELE Programme.Moreover,ELG can be further developed into the focal point for best practices and the devel­opment of bridges to other relevant platforms. New features and functionalities needtobeimplementedforahigheradaptability. Otherimportantfactorsarethe provisioningofGPUsand standardisation. 5.2 Actions, Budget, Timeline, Collaborations The Language equality in the digital age resolution (European Parliament 2018) strongly encourages the “establish[ment of] a large-scale, long-term coordinated funding programme for research, development and innovation in the field of lan­ guagetechnologies, […] tailored specifically to Europe’s needs and demands”. Asadirectresponse,theELEproject(Rehmetal.2022a)hasdevelopedtheDLE Metric(Gasparietal.2022a;Grützner-ZahnandRehm2022)asameasuretoassess and track the advancement towards DLE in Europe empirically (Chapter 3) and, in parallel, an outline of necessary actions. These have been informed by 66 project reports3 that comprise more than 2400 pages with condensed findings, summarised in the form of the present book. A total of 92 languages have been taken into ac­count. We have included voices from research, industry and civil society. In terms of research on Europe’s languages, we prepared over 30 reports on the situation of individual languages (Chapters 5 to 37, Chapter 4 contains an overview analysis). In addition, we collected input through various surveys and more than 60 expert in-terviews(Chapters 4, 38and 39). To cover the industry angle, ourindustry partners produced four technical deep dives and collected feedback in a number of surveys for further information (Chapters 40 to 43). Civil society was represented by the Europeancitizensurveywith about 20,000 responses (Chapters 4, 38 and39). The ELE Programme has a foreseen runtime of nine years, divided into three phases of three years each. Implementing the ELE Programme will significantly improvethestate-of-the-artofLTandNLPandlanguage-centricAIresearch(Chap­ter 2), create DLE in Europe and put Europe back into the global pole position of research andindustrial applications of thistype of technology (Chapter 44). 5.2.1 Actions We foresee different types of projects, implemented using the different EC project types: coordination actions (CSAs), research actions (RIAs) as well as actions for innovationand deployment (IAs),see Table 1. Coordinationand Support Actions (CSAs) are needed to support research ac­tivities and policies (networking, exchange, access to research infrastructures, conferences,etc.).TheELEProgrammeenvisagesthreeCSAsfortheoverallpro­grammecoordination.Theseinclude,amongothers,themaintenanceoftheELE principles, quality assurance approaches, shared tasks, etc. Additional CSAs are needed for the themes Data and Knowledge as well as Language Modelling as these are fundamental for all other themes as well. Another CSA is needed for supporting and further developing sharedinfrastructures. 3 SeeGasparietal.(2021),Agerrietal.(2021),Gasparietal.(2022b),Sarasolaetal.(2022),Koeva andStefanova(2022),Meleroetal.(2022a),Tadiæ(2022),Hlavacova(2022),Pedersenetal.(2022), Steurs et al. (2022), Maynard et al. (2022), Muischnek (2022), Lindén and Dyster (2022), Adda et al. (2022), Sánchez and García-Mateo (2022), Hegele et al. (2022a), Gavriilidou et al. (2022), Jelencsik-Mátyusetal.(2022),Rögnvaldsson(2022),Lynn(2022),Magninietal.(2022),Skadina etal.(2022),GaidieneandTamulioniene(2022),Anastasiou(2022),RosnerandBorg(2022),Eide etal.(2022),Ogrodniczuketal.(2022),Brancoetal.(2022),Pãi.andTufi.(2022),Garabík(2022), Krek (2022), Melero et al. (2022b), Borin et al. (2022), Prys et al. (2022), Krstev and Stankoviæ (2022), Æušiæ (2022), Giagkou et al. (2022), Moshagen et al. (2022), Robinson-Jones and Scarse (2022), Hajiè et al. (2021), Thönnissen (2022), Eskevich and Jong (2022), Rufener and Wacker (2022), Hajiè et al. (2022), Hegele et al. (2022b), Gísladóttir (2022), Kirchmeier (2022), Hicks (2022), Blake (2022), Hrasnica (2022), Heuschkel (2022), Berzinš et al. (2022), Backfried et al. (2022),Gomez-Perezetal.(2022),Kaltenböcketal.(2022),Wayetal.(2022b),Wayetal.(2022a), Aldabeetal.(2022b),Aldabeetal.(2022a),ELEConsortium(2022),Hegeleetal.(2021a),Hegele et al.(2021b), Rehm et al. (2022b),Marheineckeet al.(2022) and Rehm etal. (2022c). Research and Innovation Actions (RIA) are collaborative projects funding re­search activities that allow the exploration of new technologies, new methods, new products, or improvements of existing ones. Research is the fundamental prerequisiteforDLE.Overthelastdecade,thecommunityhasdevelopedaclear visionoftheworkneededinthedifferentareasofLT. ToachieveDeepNLU,we need to invest in and further research the areas of language modelling, machine translation, textunderstandingand speech. Innovation Actions (IAs) consist of activities directly aiming at producing im­proved products, processes or services. They may include prototyping, testing, demonstrating, piloting,large-scaleproduct validation and marketreplication. Type Number ELE Programme – overall coordination CSA 3 Theme Data and Knowledge – coordination CSA 3 ThemeLanguageModelling –coordination CSA 3 ThemeLanguageModelling –research RIA 15 ThemeLanguageModelling –innovationanddeployment IA 15 ThemeMachineTranslation –research RIA 12 ThemeMachineTranslation –innovationanddeployment IA 12 ThemeTextUnderstanding –research RIA 12 ThemeTextUnderstanding –innovationanddeployment IA 12 ThemeSpeech –research RIA 12 ThemeSpeech –innovationanddeployment IA 12 ThemeInfrastructure –support CSA 3 Table1 Different typesand number ofprojectsforeseen in theELEProgramme 5.2.2 Budget As a shared programme between the EU and the participating countries, the final financial setup needs to be discussed between all involved parties. For the EU part of the budget, we suggest the breakdown shown in Table 2. In addition to these investments, which relate to the overarching coordination, research and innovation projects, the participating countries and regions are expected to invest in their lan­guagesthemselves,whilethelanguageswithfragmentary,weakornotechnicalsup-port can requestfundingfrom the European Union (flexible funds, see below). In addition to the sum of 690M€ for the actions implementing the theme-related projectsoftheELEProgramme,weenvisageinvestinganadditional150M€as flex­ible funds for languages with fragmentary, weak or no technical support since we anticipatethatanumberofparticipatingcountrieswillrequirecomplementaryfund­ ELEProgramme(overallcoordination) 60M€ ThemeDataandKnowledge 45M€ Theme Language Modelling 195M€ Theme Machine Translation 120M€ ThemeTextUnderstanding 120M€ Theme Speech 120M€ Theme Infrastructure 30M€ Sum 690M€ Flexible funds 150M€ Total 840M€ Table2 BudgetbreakdownoftheELEProgramme(EUcontributiononly;numbersareindicative) ing from the EU. A more detailed breakdown of the different themes with their as­sociated projecttypesand runtime isshown in Table 4. Thecomplementarynational/regionalinvestmentsrequiredontheindividuallan­guage level are difficult to predict. We group the languages into three clusters (see Table 3) and provide indicative investments, which relate to the whole duration of theELEProgramme.Otherfactors(e.g.,numberofspeakers,etc.)canbetakeninto accountto arrive at more precise numbers. Languageswith weak or no support 40-50M€each Languageswith fragmentary support 30-40M€each Languageswith moderate support 20-30M€each Table3 Indicativeinvestmentsrequired bylanguage,provided by the participatingcountries This language-specific funding is foreseen to be provided by the participating countries. However, the EU should help bootstrap the development of technologies for languages that are not doing well digitally, using thesuggested flexible funds. 5.2.3 Timeline The ELE Programme isforeseen to have a runtime of nine years, divided into three phasesofthreeyearseach(Table 4).TheCSAandRIAprojectsareexpectedtorun for three years each while the IA projects have a runtime of two years so that they canfocus on the innovation anddeploymentaspects. Phase 1: 2024-2026 Phase 1 lays a strong foundation for the overall ELE Pro-gramme. Allprojects start in Phase 1, except for the InnovationActions. Phase 2: 2027-2029 Phase2drives forwardall projects ofall types while contin­uingthe CoordinationActions. Phase 3: 2030-2032 Phase3 continuesthe CoordinationActions and finishesoff all projectsin 2032. Phase1 Phase2 Phase3 BudgetType Num. 2024 2025 2026 2027 2028 2029 2030 2031 2032 Each Sum ELEProgramme –overallcoordination CSA 3 ThemeDataandKnowledge –coordination CSA 3 15M€ 45M€ ThemeLanguageModelling –coordination CSA 3 15M€ 45M€ ThemeLanguageModelling –research RIA 15 5M€ 75M€ThemeLanguageModelling –innovationanddeployment IA 15 5M€ 75M€ ThemeMachineTranslation –research RIA 12 5M€ 60M€ ThemeMachineTranslation –innovationanddeployment IA 12 5M€ 60M€ ThemeTextUnderstanding –research RIA 12 5M€ 60M€ ThemeTextUnderstanding –innovationanddeployment IA 12 5M€ 60M€ ThemeSpeech –research RIA 12 5M€ 60M€ ThemeSpeech –innovationanddeployment IA 12 5M€ 60M€ ThemeInfrastructure –support CSA 3 10M€ 30M€ 690M€ Flexible funds for languages with fragmentary, weak or no technological support. 150M€ 840M€ Table4 Projecttypes,timelineand indicative budget breakdown oftheELEProgramme (EU) 5.2.4 Collaborations with Related Initiatives TheELEProgrammecomplementsrelatedinitiatives,projectsandorganisationsand it will make use of the services, resources and infrastructures provided by these ini­tiatives. We can groupthese different stakeholders into several broader categories: Data spacesand data infrastructures: Various EU/EC Data Spaces including the Common European Language Data Space(LDS),MediaDataSpaceandothers;BigDataValueAssociation(BDVA) andData,AIandRobotics(DAIRO);4 Gaia-X;5 InternationalDataSpacesAsso­ciation (IDSA);6 etc. Research and research data infrastructures: EuropeanOpenScienceCloud(EOSC);7 GermanNationalResearchDataInfras­tructure (NFDI);8 CLARIN ERIC;9 Research DataAlliance(RDA);10 etc. Various AI initiatives: ADRA;11 CLAIRE;12 LEAM;13 HumanE-AI;14 OpenGPT-X;15 etc. AIon DemandPlatform: AI-on-Demand Platform;16 EuropeanLanguage Grid (ELG);17 etc. Highperformance computing: EuroHPC Joint Undertaking;18 etc. Standardisation: WorldWide WebConsortium(W3C);19 DIN;20 etc. 4 https://www.bdva.eu, https://www.bdva.eu/DAIRO 5 https://gaia-x.eu 6 https://internationaldataspaces.org 7 https://eosc.eu 8 https://www.nfdi.de 9 https://www.clarin.eu 10 https://www.rd-alliance.org 11 https://adr-association.eu 12 https://claire-ai.org 13 https://leam.ai 14 https://www.humane-ai.eu 15 https://opengpt-x.de 16 https://www.ai4europe.eu 17 https://www.european-language-grid.eu 18 https://eurohpc-ju.europa.eu 19 https://www.w3.org 20 https://www.din.de 6 Concluding Remarks Large-scale studies such as the META-NET White Paper Series (Rehm and Uszko­reit 2012), the STOA study (STOA 2018) and the ELE language reports (see Chap­ter 4 for an overview and Chapters 5 to 37 for in-depth analyses) have shown that many languages are in danger of digital extinction because they are not sufficiently supportedthroughLanguageTechnologies.DigitalLanguageEqualityisthestateof affairsinwhichalllanguageshavethetechnologicalsupportandsituationalcontext necessaryforthemtocontinuetoexistandtoprosperaslivinglanguagesinthedig­ital age (Chapter 3). In alignment with what the Language Technology community haspromotedformorethanadecade,theEuropeanParliamentadoptedaresolution on Language equality in the digital age that suggested initiating a large-scale Euro-peanLTresearch,developmentandinnovationprogrammeandtointensifyresearch andfundingtoachieveDeepNaturalLanguageUnderstandingandalsoDigitalLan­guageEquality(European Parliament 2018). Languages are at the heart of every aspect of life. Understanding language is keyforbuildingintelligentsystems.Overthecomingyears,AIisexpectedtotrans-form every industry and society as a whole. There are trends and megatrends that bear closely on digital technologies. Among others, these include accelerating hy­perconnectivity, shifts in the nature of work, increasing digitalisation, new modes of learning, expanding consumerism, novel approaches to politics and governance, changes in healthcare, etc. LT and NLP are, by now, considered important driving forces. Language Technologywillplay a decidingrolein howthese unfold. Language tools and resources have increased and improved since the end of the last century, a process further catalysed by the advent of deep learning and neural networksoverthepastdecade.Wefindourselvestodayinthemidstofasignificant paradigmshiftinLTandlanguage-centricAI.Thisrevolutionhasbroughtnotewor­thyadvances tothe fieldalong with the promiseofsubstantial breakthroughs in the coming years. However, this transformative technology poses problems from a re-searchadvancement,environmental,andethicalperspective.Furthermore,ithasalso laidbaretheacutedigitalinequalitythatexistsbetweenlanguages.Infact,manyso­phisticatedNLPsystemsareunintentionallyexacerbatingthisimbalanceduetotheir reliance on vast quantities of data derived mostly from English-language sources. OtherlanguageslagfarbehindEnglishintermsofdigitalpresenceandeventhelatter wouldbenefitfromgreatersupport.Moreover,thestrikingasymmetrybetweenoffi­cial and non-official European languages with respect to available digital resources is very worrisome. The unfortunate truth is that European Language Technology is failing to keep pace with the newfound and rapidly evolving changes inthe field. One need look no further than what is happening today across the diverse topog­raphy of state-of-the-art LTand language-centric AI forconfirmation ofthe current linguistic unevenness. The paradox at the heart of recent LT advances is evident in almost every LT discipline. Our ability to reproduce ever better synthetic voices has improved sharply for well-resourced languages, but dependence on large vol-umesof high-qualityrecordingseffectivelyunderminesattemptstodo the samefor low-resource languages. Multilingual NMT systems return demonstrably improved results for low-and zero-resource language pairs, but insufficient model capacity continuestohaunttransferlearningbecauselargemultilingualdatasetsarerequired, forcing researchers to rely on English as the best-resourced language. A similar language discrepancy is also found in several of the domain sectors: medical cor­pora, models and knowledge bases suffer from this disparity, as do users of under-resourced languages in education, where access to language-related tools is limited for most smallerlanguage communities. However, this time of transition also represents an opportunity to right the ship. NowisthemomenttoseekbalancebetweenEuropeanlanguagesinthedigitalrealm. There are ample reasons for optimism. Although there is more work that can and must be done, Europe’s leading language resource repositories, platforms, libraries, models and benchmarkshave begun to make inroads in this regard. Over the last decade, the community has developed a clear vision of the work neededinthedifferentareasofLT. TheELEprojecthasdevisedanoutlineofneces­saryactionsin the form of concrete recommendations. TheELE Programme, speci­fiedintheformoftheSRIAandroadmappresentedinthischapter,willserveasthe blueprintforachievingDLEinEurope.Whilethepoliticalandsocietalgoalisreach­ingfullDigital Language Equality across all European languages (and,atthesame, preventingdigitalextinctionofmanyofourlanguagesinEurope),thescientificgoal envisioned to bereached by 2030 is Deep Natural Language Understanding. DeepNaturalLanguageUnderstandingisstillanopenresearchproblemfarfrom beingsolvedsinceallcurrentapproacheshaveseverelimitations.Thedevelopment ofnewLTsystemswouldnotbepossiblewithoutsufficientresources(data,experts, compute facilities, etc.). Creation of carefully designed evaluation benchmarks and annotateddatasetsforeverylanguageanddomainofapplicationisneededtofoster technologicalprogress,whileencouragingdeeperunderstandingofthemechanisms by which they are achieved. All these efforts will then lead to long-term progress towardsmultilingual,efficient,accurate,explainable,ethicalandunbiasedlanguage understandingandcommunication,tocreatetransparentdigitallanguageequalityin Europe in all aspectsofsociety, from governmentto businesses to thecitizens. WeforeseeanELEProgrammeofnineyears(2024-2032).Thisperiodwillbedi­videdintothreephasesofthreeyearseach,combiningcoordinationactions(CSAs), researchactions(RIAs)aswellasactionsforinnovationanddeployment(IAs).The whole community, meaning all relevant scientific and industrial stakeholders from all Member States and Associated Countries, need to be involved. The ELE Pro-gramme will tackle the following central themes: Language Modelling, Data and Knowledge, Machine Translation, Text Understanding, and Speech. As a shared programme between the EU and the participating countries, we sug­gest an EU budget of 690M€, plus 150M€ of flexible funds to help bootstrap the development of technologies for languages with fragmentary, weak or no technical support. Thiswill besupplementedby nationaland regional funding. The ELE Programme is meant to develop into the focal point in which all coor­dinated developments come together. In this regard, the European Institutions and national as well as regional governments and language institutes must be involved in creating resources, tools and technologies for their own languages. It is exactly the large scale of the effort that will accelerate the developments and advance the state-of-the-art that will make it possible to join forces that have so far never been joined. This will make it possible to address all European and other relevant lan­guages, all cultures with their particular background and framing of the world, all relevantdomains,andallstakeholdersbymeansofasubstantialnumberofusecases. Weareconvincedthatthisinitiative,builtaroundacoordinatedgiantpoolofshared datasets,openevaluations,opencompetitions,sharedtasks,standardisationefforts, etc. in the literal sense of Open Science, will have a much-needed, substantial and lasting impactin terms of interoperability, development costs, quality and, thus, up-takeofthetrulygame-changingtechnologiesdevelopedintheELEProgramme.Re­searchinEuropemustfocusoncreatingthenewparadigmofLanguageTechnology, fully harnessing the power of current and emerging AI methods that are based on vastdatasetsandknowledgebases.Withaconcertedeffortandsignificantfunding, digitallanguage equality will be achieved, for thebenefit of all Europeans. References Adda, Gilles, Annelies Braffort, Ioana Vasilescu, and François Yvon (2022). Deliverable D1.14 Report on the French Language. European Language Equality (ELE); EU project no. LC­01641480 – 101018166. https://european-language-equality.eu/reports/language-report-fr ench.pdf. Agerri,Rodrigo,EnekoAgirre,ItziarAldabe,NoraAranberri,JoseMariaArriola,AitziberAtutxa, GorkaAzkune,ArantzaCasillas,AinaraEstarrona,AritzFarwell,IakesGoenaga,JosuGoikoet-xea,KoldoGojenola,InmaHernaez,MikelIruskieta,GorkaLabaka,OierLopezdeLacalle,Eva Navas,MaiteOronoz,ArantxaOtegi,AliciaPérez,OlatzPerezdeVinaspre,GermanRigau,Jon Sanchez, Ibon Saratxaga, and Aitor Soroa (2021). Deliverable D1.2 Report on the State of the Art in Language Technology and Language-centric AI.EuropeanLanguageEquality(ELE);EU project no. LC-01641480 –101018166. https://european-language-equality.eu/reports/LT-stat e-of-the-art.pdf. Aldabe, Itziar, Aritz Farwell, and German Rigau (2022a). Deliverable D3.3 Report on the final round of feedback collection.EuropeanLanguageEquality(ELE);EUprojectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/feedback-collection.pdf. Aldabe, Itziar, Georg Rehm, German Rigau, and Andy Way (2022b). Deliverable D3.1 Report on existing strategic documents and projects in LT/AI (second revision). European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/LT-strategic-documents-v3.pdf. Anastasiou,Dimitra(2022).Deliverable D1.24 Report on the Luxembourgish Language.European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/language-report-luxembourgish.pdf. Backfried,Gerhard,MarcinSkowron,EvaNavas,AivarsBerzinš,JoachimVandenBogaert,Fran­ciska de Jong, Andrea DeMarco, Inma Hernaez, Marek Kováè, Peter Polák, Johan Rohdin, Michael Rosner, Jon Sanchez, Ibon Saratxaga, and Petr Schwarz (2022). Deliverable D2.14 Technology Deep Dive – Speech Technologies.EuropeanLanguageEquality(ELE);EUproject no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/speech-deep-d ive.pdf. Berzinš, Aivars, Marcis Pinnis, Inguna Skadina, Andrejs Vasiljevs, Nora Aranberri, Joachim Van den Bogaert, Sally O’Connor, Mercedes García–Martínez, Iakes Goenaga, Jan Hajiè, Manuel Herranz, Christian Lieske, Martin Popel, Maja Popoviæ, Sheila Castilho, Federico Gaspari, Rudolf Rosa, Riccardo Superbo, and Andy Way (2022). Deliverable D2.13 Technology Deep Dive – Machine Translation.EuropeanLanguageEquality(ELE);EUprojectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/MT-deep-dive.pdf. Blake,Oliver(2022). Deliverable D2.10 Report from LIBER.EuropeanLanguageEquality(ELE); EU projectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/con sultation-LIBER.pdf. Borin,Lars,RickardDomeij,JensEdlund,andMarkusForsberg(2022).Deliverable D1.33 Report on the Swedish Language.EuropeanLanguageEquality(ELE);EUprojectno.LC-01641480 – 101018166ELE. https://european-language-equality.eu/reports/language-report-swedish.pdf. Branco, António, Sara Grilo, and Joao Silva (2022). Deliverable D1.28 Report on the Portuguese Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-portuguese.pdf. Æušiæ, Tarik (2022). Deliverable D1.36 Report on the Bosnian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-eq uality.eu/reports/language-report-bosnian.pdf. Eide, Kristine, Andre Kasen, and Ingerid Loyning Dale (2022). Deliverable D1.26 Report on the Norwegian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-norwegian.pdf. ELEConsortium(2022).Deliverable D3.4 Digital Language Equality in Europe by 2030: Strategic Agenda and Roadmap. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/SRIA-and-roadmap.pdf. Eskevich,Mariaand Franciska de Jong (2022). Deliverable D2.3 Report from CLARIN. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/consultation-CLARIN.pdf. European Parliament (2018). Language Equality in the Digital Age. European Parliament resolu­tion of 11 September 2018 on Language Equality in the Digital Age (2018/2028(INI). http://w ww.europarl.europa.eu/doceo/document/TA-8-2018-0332_EN.pdf. Gaidiene,AnželikaandAurelijaTamulioniene(2022).Deliverable D1.23 Report on the Lithuanian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. h ttps://european-language-equality.eu/reports/language-report-lithuanian.pdf. Garabík,Radovan(2022).Deliverable D1.30 Report on the Slovak Language.EuropeanLanguage Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/language-report-slovak.pdf. Gaspari, Federico, Owen Gallagher, Georg Rehm, Maria Giagkou, Stelios Piperidis, Jane Dunne, andAndyWay(2022a).“IntroducingtheDigitalLanguageEqualityMetric:TechnologicalFac-tors”. In: Proceedings of the Workshop Towards Digital Language Equality (TDLE 2022; co-located with LREC 2022). Ed. by Itziar Aldabe, Begona Altuna, Aritz Farwell, and German Rigau. Marseille, France, pp. 1–12. http://www.lrec-conf.org/proceedings/lrec2022/workshop s/TDLE/pdf/2022.tdle-1.1.pdf. Gaspari,Federico, AnnikaGrützner-Zahn,Georg Rehm,OwenGallagher,Maria Giagkou,Stelios Piperidis, and Andy Way (2022b). Deliverable D1.3 Digital Language Equality (full specifi­cation). European Language Equality (ELE); EU project no. LC-01641480 – 101018166 ELE. https://european-language-equality.eu/reports/DLE-definition.pdf. Gaspari, Federico, Andy Way, Jane Dunne, Georg Rehm, Stelios Piperidis, and Maria Giagkou (2021). Deliverable D1.1 Digital Language Equality (preliminary definition). European Lan­guageEquality(ELE); EU project no. LC-01641480 – 101018166. https://european-language­ equality.eu/reports/DLE-preliminary-definition.pdf. Gavriilidou, Maria, Maria Giagkou, Dora Loizidou, and Stelios Piperidis (2022). Deliverable D1.17 Report on the Greek Language.EuropeanLanguageEquality(ELE);EUprojectno.LC­01641480 – 101018166. https://european-language-equality.eu/reports/language-report-greek .pdf. Giagkou, Maria, Penny Labropoulou, Stelios Piperidis, Miltos Deligiannis, Athanasia Kolovou, andLeonVoukoutis(2022). Deliverable D1.37 Database and Dashboard.EuropeanLanguage Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/DLE-dashboard.pdf. Gísladóttir, Gu.rún (2022). Deliverable D2.7 Report from ECSPM. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-ECSPM.pdf. Gomez-Perez, Jose Manuel, Andres Garcia-Silva, Cristian Berrio, German Rigau, Aitor Soroa, Christian Lieske, Johannes Hoffart, Felix Sasaki, Daniel Dahlmeier, Inguna Skadina, Aivars Berzinš,AndrejsVasiljevs,andTeresaLynn(2022). Deliverable D2.15 Technology Deep Dive – Text Analytics, Text and Data Mining, NLU. European LanguageEquality(ELE); EUproject no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/text-analytics­ deep-dive.pdf. Grützner-Zahn,AnnikaandGeorgRehm(2022).“IntroducingtheDigitalLanguageEqualityMet­ric: Contextual Factors”. In: Proceedings of the Workshop Towards Digital Language Equality (TDLE 2022; co-located with LREC 2022).Ed.byItziarAldabe,BegonaAltuna,AritzFarwell, and German Rigau. Marseille, France, pp. 13–26. http://www.lrec-conf.org/proceedings/lrec2 022/workshops/TDLE/pdf/2022.tdle-1.2.pdf. Hajiè,Jan,MariaGiagkou,SteliosPiperidis,GeorgRehm,andNataliaResende(2021).Deliverable D2.1 Specification of the consultation process.EuropeanLanguageEquality(ELE);EUproject no.LC-01641480 –101018166.https://european-language-equality.eu/reports/consultation-pr ocess.pdf. Hajiè,Jan,TeaVojtìchová,andMariaGiagkou(2022). Deliverable D2.5 Report from META-NET. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/consultation-META-NET.pdf. Hegele, Stefanie, Rémi Calizzano, Annika Grützner-Zahn, Katrin Marheinecke, and Georg Rehm (2021a). Deliverable D4.1 Promotional materials and PR package.EuropeanLanguageEqual­ity (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu /reports/promotional-materials.pdf. Hegele, Stefanie, Barbara Heinisch, Antonia Popp, Katrin Marheinecke, Annette Rios, Dagmar Gromann, Martin Volk, and Georg Rehm (2022a). Deliverable D1.16 Report on the German Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-german.pdf. Hegele,Stefanie,KatrinMarheinecke,Jens-PeterKückens,andGeorgRehm(2021b). Deliverable D4.2 Communication and dissemination plan.EuropeanLanguageEquality(ELE);EUproject no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/communicatio n-dissemination-plan.pdf. Hegele, Stefanie, Katrin Marheinecke, and Georg Rehm (2022b). Deliverable D2.6 Report from ELG. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https: //european-language-equality.eu/reports/consultation-ELG.pdf. Heuschkel,Maria(2022). Deliverable D2.12 Report from Wikipedia.EuropeanLanguageEquality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-Wikipedia.pdf. Hicks, Davyth (2022). Deliverable D2.9 Report from ELEN. European Language Equality (ELE); EU projectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/con sultation-ELEN.pdf. Hlavacova,Jaroslava(2022).Deliverable D1.8 Report on the Czech Language.EuropeanLanguage Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/language-report-czech.pdf. Hrasnica,Halid(2022).Deliverable D2.11 Report from NEM.EuropeanLanguageEquality(ELE); EU projectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/con sultation-NEM.pdf. Jelencsik-Mátyus,Kinga,EnikõHéja,ZsófiaVarga,Tamás Váradi,LászlóJánosLaki,andGyõzõ YangZijian(2022).Deliverable D1.18 Report on the Hungarian Language.EuropeanLanguage Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/language-report-hungarian.pdf. Kaltenböck, Martin, Artem Revenko, Khalid Choukri, Svetla Boytcheva, Christian Lieske, Teresa Lynn, German Rigau, Maria Heuschkel, Aritz Farwell, Gareth Jones, Itziar Aldabe, Ainara Estarrona, Katrin Marheinecke, Stelios Piperidis, Victoria Arranz, Vincent Vandeghinste, and Claudia Borg (2022). Deliverable D2.16 Technology Deep Dive – Data, Language Resources, Knowledge Graphs.EuropeanLanguageEquality(ELE);EUprojectno.LC-01641480 –1010­18166. https://european-language-equality.eu/reports/data-knowledge-deep-dive.pdf. Kirchmeier, Sabine (2022). Deliverable D2.8 Report from EFNIL. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-EFNIL.pdf. Koeva, Svetla and Valentina Stefanova (2022). Deliverable D1.5 Report on the Bulgarian Lan­guage. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https: //european-language-equality.eu/reports/language-report-bulgarian.pdf. Krauwer, Steven (2003). “The Basic Language Resource Kit (BLARK) as the First Milestone for theLanguageResourcesRoadmap”.In:Proceedings of the International Workshop Speech and Computer (SPECOM 2003).Moscow, Russia. Krek, Simon (2022). Deliverable D1.31 Report on the Slovenian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/language-report-slovenian.pdf. Krstev,CvetanaandRankaStankoviæ(2022). Deliverable D1.35 Report on the Serbian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/language-report-serbian.pdf. Lindén,KristerandWilhelminaDyster(2022).Deliverable D1.13 Report on the Finnish Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/language-report-finnish.pdf. Lynn,Teresa(2022). Deliverable D1.20 Report on the Irish Language.EuropeanLanguageEqual­ity (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu /reports/language-report-irish.pdf. Magnini, Bernardo, Alberto Lavelli, and Manuela Speranza (2022). Deliverable D1.21 Report on the Italian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-italian.pdf. Marheinecke, Katrin, Annika Grützner-Zahn, and Georg Rehm (2022). Deliverable D4.4 Report on ELE Conference. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/conference.pdf. Maynard,Diana, Joanna Wright, MarkA. Greenwood,and Kalina Bontcheva (2022). Deliverable D1.11 Report on the English Language.EuropeanLanguageEquality(ELE);EUprojectno.LC­01641480 – 101018166. https://european-language-equality.eu/reports/language-report-englis h.pdf. Melero,Maite,BlancaC.Figueras,MarRodríguez,andMartaVillegas(2022a). Deliverable D1.6 Report on the Catalan Language. European Language Equality (ELE); EU project no. LC­01641480 – 101018166. https://european-language-equality.eu/reports/language-report-c atalan.pdf. Melero,Maite,PabloPenarrubia,DavidCabestany,BlancaC.Figueras,MarRodríguez,andMarta Villegas (2022b). Deliverable D1.32 Report on the Spanish Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-eq uality.eu/reports/language-report-spanish.pdf. Moshagen, Sjur Norstebo, Rickard Domeij, Kristine Eide, Peter Juel Henrichsen, and Per Lang-gard(2022).Deliverable D1.38 Report on the Nordic Minority Languages.EuropeanLanguage Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/language-report-nordic-languages.pdf. Muischnek, Kadri (2022). Deliverable D1.12 Report on the Estonian Language. Reports on Eu­ropean Language Equality (ELE) | Coordinator: Prof. Dr. Andy Way, Co-Coordinator: Prof. Dr. Georg Rehm, received funding from the European Union (EU project no. LC-01641480 – 101018166). https://european-language-equality.eu/reports/language-report-estonian.pdf. Ogrodniczuk, Maciej, Piotr Pêzik, Marek £aziñski, and Marcin Mi³kowski (2022). Deliverable D1.27 Report on the Polish Language.EuropeanLanguageEquality(ELE);EUprojectno.LC­01641480 – 101018166. https://european-language-equality.eu/reports/language-report-polis h.pdf. Pãi.,VasileandDanTufi.(2022).Deliverable D1.29 Report on the Romanian Language.European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/language-report-romanian.pdf. Pedersen, Bolette Sandford, Sussi Olsen, and Lina Henriksen (2022). Deliverable D1.9 Report on the Danish Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-danish.pdf. Prys,Delyth,GarethWatkins,andStefanoGhazzali(2022).Deliverable D1.34 Report on the Welsh Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. h ttps://european-language-equality.eu/reports/language-report-welsh.pdf. Rehm, Georg, ed. (2023). European Language Grid: A Language Technology Platform for Multi­lingual Europe. Cognitive Technologies. Cham,Switzerland: Springer. Rehm,Georg,FedericoGaspari,GermanRigau,MariaGiagkou,SteliosPiperidis,AnnikaGrützner-Zahn, Natalia Resende, Jan Hajic, and Andy Way (2022a). “The European Language Equality Project: Enabling digital language equality for all European languages by 2030”. In: The Role of National Language Institutions in the Digital Age – Contributions to the EFNIL Confer­ence 2021 in Cavtat. Ed.byŽeljkoJoziæandSabineKirchmeier. Budapest, Hungary:Nyelvtu­dományi Kutatóközpont,Hungarian Research Centre for Linguistics, pp. 17–47. Rehm, Georg, Stefanie Hegele, and Katrin Marheinecke (2022b). Deliverable D4.3 Report on EP/EC Workshop.EuropeanLanguageEquality(ELE); EUprojectno. LC-01641480 –10101­8166.https://european-language-equality.eu/reports/EC-workshop.pdf. Rehm,Georg,StefanieHegele,andKatrinMarheinecke(2022c).Deliverable D4.6 ELE Book Pub­lication. European Language Equality(ELE); EUproject no. LC-01641480 – 101018166.http s://european-language-equality.eu/reports/book-publication.pdf. Rehm, Georg and Hans Uszkoreit, eds. (2012). META-NET White Paper Series: Europe’s Lan­guages in the Digital Age. 32 volumes on 31 European languages. Heidelbergetc.: Springer. Robinson-Jones, Charlie and Ydwine R. Scarse (2022). Deliverable D1.39 Report on the West Frisian Language.EuropeanLanguageEquality(ELE);EUprojectno.LC-01641480–101018­ 166.https://european-language-equality.eu/reports/language-report-frisian.pdf. Rögnvaldsson, Eiríkur (2022). Deliverable D1.19 Report on the Icelandic Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/language-report-icelandic.pdf. Rosner,MikeandClaudiaBorg(2022).Deliverable D1.25 Report on the Maltese Language.Euro­peanLanguageEquality(ELE);EUprojectno.LC-01641480 –101018166.https://european-l anguage-equality.eu/reports/language-report-maltese.pdf. Rufener, Andrew and Philippe Wacker (2022). Deliverable D2.4 Report from LT-innovate. Euro­peanLanguageEquality(ELE);EUprojectno.LC-01641480 –101018166.https://european-l anguage-equality.eu/reports/consultation-LTInnovate.pdf. Sánchez, José Manuel Ramírez and Carmen García-Mateo (2022). Deliverable D1.15 Report on the Galician Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-galician.pdf. Sarasola, Kepa, Itziar Aldabe, Arantza Diaz de Ilarraza, Ainara Estarrona, Aritz Farwell, Inma Hernaez, and Eva Navas (2022). Deliverable D1.4 Report on the Basque Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-langu age-equality.eu/reports/language-report-basque.pdf. Skadina,Inguna,IlzeAuzina,BaibaValkovska,andNormundsGruzitis(2022).Deliverable D1.22 Report on the Latvian Language. European Language Equality (ELE); EU project no. LC­01641480 – 101018166. https://european-language-equality.eu/reports/language-report-l atvian.pdf. Steurs, Frieda, Vincent Vandeghinste, and Walter Daelemans (2022). Deliverable D1.10 Report on the Dutch Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/reports/language-report-dutch.pdf. STOA(2018). Language equality in the digital age – Towards a Human Language Project.STOA study (PE 598.621), IP/G/STOA/FWC/2013-001/Lot4/C2. https://data.europa.eu/doi/10.2861 /136527. Tadiæ, Marko (2022). Deliverable D1.7 Report on the Croatian Language. European Language Equality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equali ty.eu/reports/language-report-croatian.pdf. Thönnissen,Marlies(2022). Deliverable D2.2 Report from CLAIRE.EuropeanLanguageEquality (ELE); EU project no. LC-01641480 – 101018166. https://european-language-equality.eu/rep orts/consultation-CLAIRE.pdf. Wang, Alex, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy,andSamuelR.Bowman(2019).“SuperGLUE:AStickierBenchmarkforGeneral-Purpose Language Understanding Systems”. In: Advances in Neural Information Processing Systems 32: Annual Conferenceon Neural Information ProcessingSystems 2019,NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. Ed. by Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B.Fox,and Roman Garnett, pp. 3261–3275. https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstrac t.html. Way,Andy, GeorgRehm, Jane Dunne, Maria Giagkou, Jose Manuel Gomez-Perez,JanHajiè,Ste­fanie Hegele, Martin Kaltenböck, Teresa Lynn, Katrin Marheinecke, Natalia Resende, Inguna Skadina,MarcinSkowron,TerezaVojtìchová,andAnnikaGrützner-Zahn(2022a).Deliverable D2.18 Report on the state of Language Technology in 2030.EuropeanLanguageEquality(ELE); EU projectno.LC-01641480 – 101018166. https://european-language-equality.eu/reports/LT­ in-2030.pdf. Way, Andy, Georg Rehm, Jane Dunne, Jan Hajiè, Teresa Lynn, Maria Giagkou, Natalia Resende, Tereza Vojtìchová, Stelios Piperidis, Andrejs Vasiljevs, Aivars Berzins, Gerhard Backfried, Marcin Skowron, Jose Manuel Gomez-Perez, Andres Garcia-Silva, Martin Kaltenböck, and Artem Revenko (2022b). Deliverable D2.17 Report on all external consultations and surveys. European Language Equality (ELE); EU project no. LC-01641480 – 101018166.https://europ ean-language-equality.eu/reports/external-consultations.pdf. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriate credittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommons licenseand indicateif changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfrom thecopyrightholder.