Rhymes and Syntax: A Morpho- Syntactic Analysis of Czech Poetry Silvie Cinková, Petr Plecháč, Martin Popel Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Malostranské nám. 25, 118 00 Praha 1, Czechia; Czech Academy of Sciences, Institute of Czech Literature of the CAS, Na Florenci 3/1420, 110 00 Praha 1, Czechia https://orcid.org/0000-0003-4526-3915 cinkova@ufal.mff.cuni.cz Czech Academy of Sciences, Institute of Czech Literature of the CAS, Na Florenci 3/1420, 110 00 Praha 1, Czechia https://orcid.org/0000-0002-1003-4541 plechac@ucl.cas.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Malostranské nám. 25, 118 00 Praha 1, Czechia https://orcid.org/0000-0002-3628-8419 popel@ufal.mff.cuni.cz A linguistically informed distant reading presupposes an adequate performance of Natural Language Processing tools. This article describes our evaluation of the UDPipe parser on a manually annotated sample of nineteenth-century Czech poetry in the following steps: (1) creation of a documented data set for this domain (poetry, nineteenth century, Czech); (2) domain-specific annotation decisions; (3) error analysis. The sample consisted of 29 randomly selected poems which were first automatically tagged and parsed with the UDPipe parser and then manually checked word by word. The following features were checked: word segmentation (chunking), lemmatization, part of speech assignment, assignment of more fine-grained morphological details, the position in the syntactic dependency tree (selection of the syntactic parent), as well as the label of the syntactic relation between the word and its parent. The findings were analyzed. The most typical parser errors are associated with complex noun phrases that contain other noun(s) as modifier(s), especially when these occur in a poetry-specific word order, that is, preposed to the governing noun. On the other hand, neither archaic orthography nor neologisms posed substantial issues. Keywords: Czech poetry / distant reading / text corpora / Universal Dependencies / natural language processing / treebanks 65 Primerjalna književnost (Ljubljana) 47.2 (2024) PKn, letnik 47, št. 2, Ljubljana, avgust 2024 66 Introduction Some text-mining use cases benefit from reaching beyond the bag-of- words approach to extraction of lexical or grammatical patterns.1 This is made possible by automatic morphological tagging and syntactic parsing wherever such a tool is available for the given language and achieves adequate performance within the given domain. Most parsers are run with language models that have been trained on contemporary non-fiction, and their performance is likely to decrease by the same measure that input texts deviate from those models’ domains. UDPipe, the largest Czech language model used by the best-per- forming Czech parser (Straka et al.), was trained on the 1990s daily Czech press (Hajič). At first glance, the main differences between this domain and that of nineteenth- and twentieth-century Czech poetry have to do with vocabulary, orthography, and word order. However, the effect of these differences on the parser performance is not predict- able. The parser performance can be measured and the most typical errors can only be detected by manual annotation of a random sam- ple and its comparison to the automatic output. While this work is time-consuming, the domain-specific annotated data could be added to the original model to increase performance on this new domain in the future—considering that this goal may turn out to require several iterations of additional annotation. In our experiments, we use the larg- est model, czech-pdt-ud-2.12-230717, and a smaller model based on fiction, czech-fictree-ud-2.12-230717. Data The data set is comprised of 29 random Czech poems from PoeTree (Plecháč et al.; Plecháč and Kolár), with a total of 6,643 tokens and 2,687 types (unique words). Most of the poems were written at the turn of the nineteenth century. About half of the represented poets belong to the Czech high-school literary canon. Most poems are rhymed. Figure 1 shows the publication dates of each poem along with its author’s lifespan. 1 The work on this article has been supported by the Czech Science Foundation grant European Poetry: Distant Reading 23-07727S. We have also been using data, tools, and services provided by the LINDAT/CLARIAH-CZ Research Infrastructure (https://lindat.cz), supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062). Silvie Cinková, Petr Plecháč, Martin Popel: A Morpho-Syntactic Analysis of Czech Poetry 67 Figure 1: The PoeTree Czech sample: authors’ life spans and poems’ publication years. Diachronic and stylistic language perspective The oldest poem (1805) was written during (and in the language of) the Czech National Revival, and is therefore quite different from the later poems. Most of the nineteenth-century poems are written in somewhat modern Czech, that is, in the Czech language as it was re-established after more than a century of Germanized education and at an advanced stage of efforts to integrate the norms of a written Czech no longer in use with the spoken vernacular of the time, which was naturally perceived as low standard. The twentieth-century poems can be considered repre- sentative of (a very marked stylistic register of) truly modern Czech. The entire nineteenth century saw competing progressive as well as regressive normative trends, with the variation in poetry furthermore augmented by a rapid increase in poetic experimentation and manneristic personal style distinctions (Šlosar). Habitual modes of linguistic periodization, as a consequence, are not very helpful in the case of this poetry sample. Despite all this variation, we can still track several recurring differences between contemporary Czech prose and the language observed in this sample. This section lists a few of the resulting annotation decisions. Spelling Spelling variation can be found in both word stems and morphemes. In order to enable searching across different diachronic layers with- out altering words, we preserved the token forms while normalizing PKn, letnik 47, št. 2, Ljubljana, avgust 2024 68 lemmas, wherever possible, to contemporary spelling variants. For instance, rather than transcribing nervosníma as nervózními, we lem- matized using the current term nervózní. Whenever the modern equivalent was not instantly apparent or the word had undergone more substantial morphological changes, such as s křeku (z keříku ‘from a bush’) or junoše (jinoch ‘lad’) in the 1805 poem, we left the lemmas intact. A prominent feature of Czech word formation-and one that pres- ents a particularly difficult and longstanding obstacle for language pro- cessing-is compound function words. The compounding of preposi- tions with other parts of speech, especially nouns and adverbs, produces adverbs, particles, conjunctions, and prepositions that are written at times as discrete words and at others as prefixes, according to numer- ous rules with numerous exceptions (Osolsobě), thus posing challenges and spelling issues for Natural Language Processing-indeed, even for educated native speakers (Žižková). Many of these words can be found in the basic vocabulary, such as na shledanou ‘good bye’ and zpočátku ‘initially.’ Throughout the nineteenth century, little attention was paid to graphical word boundaries in general, although partial and mutu- ally contradicting recommendations existed in grammar books. This had various consequences. For instance, the first generation of revival- ists treated compound function words with complex rules depending mostly on the word formation type of the noun or adjective that fol- lowed (Dobrovský), while a later generation of grammarians tended to decompose them into discrete words (Kampelík). The unmanageable spelling variations in the compound function words in our sample hampered lemma normalization. Whenever a compound function word consisted of two tokens, it was annotated as a syntactic relation between two words. Punctuation and sentence splitting The sample displays certain punctuation peculiarities: some poems combine the usual punctuation principle (syntactic segmentation) with the highlighting of rhetorical pauses (see Examples 1 and 2 below) in the manner of classic public speakers’ speech notes (Pavel Kosek and Jana Pleskalová). Silvie Cinková, Petr Plecháč, Martin Popel: A Morpho-Syntactic Analysis of Czech Poetry 69 (1) Ale ty oči! oči smilnící!  (But those eyes! those fornicating eyes!)2 (2) Deset let už o tom píše, pan Vejr v Švandě Dudákovi. (For a decade he has been writing about this, Mr. Vejr in Schwanda the Bag- piper.) Even if the punctuation determines clause boundaries fairly well, stick- ing to the syntactic (and not rhetorical) principle, that is, to separate clauses (as well as conjuncts and appositions), sentence boundaries re- main blurry. This applies both to poems without enjambement, where clause boundaries do not tend to cross verse boundaries (typically trochee verses, such as the extract from K. H. Mácha’s Prolog k pouti Krkonošské, transl. Prologue to the Riesengebirge Pilgrimage, in Example 3 below), and to poems with long-winded clauses (often verses in prose, such as J. Karásek’s ze Lvovic Nad obrazem Marie Magdaleny v hradčanské Loretě, transl. Over the Painting of Mary Magdalene in the Hradschin Loretta, in Example 4). The short paratactic clauses evoke a swift narration pace and sen- tence boundaries do not play a role, while the syntactically long-winded clauses evoke an agitated stream of consciousness, which nevertheless unfolds within a solid syntactic scaffolding. (3) Víc a více světnice se plní,  Hovor hlučí, kouř se z dýmek vlní;  Při stropu ho plamínku zář zlatí. (Gradually the room is getting crowded,  The talk is loud, smoke is curling from pipes;  The glow of small flames gilds the ceiling.) (4) V starobných ambitech, kde ztuhlá světic ctnost  V škrobených límcích španělských se vztyčuje,  Kde marně Šebestián sličný Kypící nahotu,  Drážděné šípem genitálie ukazuje,  Aby sváděl ctnostně odvrácené zraky,  Ty potměšilou vlnu ňader,  Tak měkce tajících,  Teď svůdně rozléváš,  Své tělo zase obnažuješ,  2 All translations of poem samples are by Peter Gaffney. PKn, letnik 47, št. 2, Ljubljana, avgust 2024 70 Tolika tknuté milenci,  A dráždíš pletí tolika ústy zlíbanou  A klínem, tolika vášněmi rozrývaným,  A očima samice, očima smilnýma. (In the age­old ambits, where the stiff virtue of the saint rises In starched Spanish collars,  Where in vain Sebestian, handsome,  Vibrant nudity,  His arrow­teased genitals puts on display,  To tempt the virtuously averted eyes,  You luscious wave of breasts,  So softly melting, now seductively spill,  Your body is exposed again, touched by so many lovers,  And you tease with your skin kissed by so many mouths  And a lap, torn by so many passions,  And with the eyes of a female,  With fornicating eyes.) The strategy for annotating sentence segmentation was set to make the sentences as coherent as possible, that is, with the fewest possible stand­ alone clauses with major ellipses. Lexical perspective Concerning vocabulary, some patterns of differences between PoeTree and the relevant Czech treebanks (UD_Czech­PDT and UD_Czech­ FicTree, henceforth PDT and FicTree) were predictable, namely ar­ chaic word forms (jest ‘is’; kdys ‘long ago’), archaic words (junoše ‘young lad’), Latin words (Ave; absolvo), and neologisms (čaroskvělý ‘miracu­ lously magnificent’). From the quantitative perspective, the overlap between case­insensi­ tive types (unique word forms) in the PoeTree sample and the training data sets of PDT and FicTree is approximately 59 and 47 respectively, excluding proper nouns, punctuation, and symbols. That means the UDPipe parser has never seen about one half of the words that occur in PoeTree, using either model. To allow for more qualified guesses about domain­adaptation requirements, we extracted a frequency list of all PoeTree tokens missing in PDT and a frequency list of all PoeTree tokens missing in FicTree. We compared the distributions of these types, as well as the parts of speech they belong to. In both groups, the top­ranking Silvie Cinková, Petr Plecháč, Martin Popel: A Morpho-Syntactic Analysis of Czech Poetry 71 PoeTree­specific tokens belong to the following parts of speech as defined by the Universal Dependencies (UD) tagging scheme: nouns, verbs, adjectives, adverbs, determiners, and pronouns. Even though the lower­ranking parts of speech ranked differently, there was no statisti­ cally significant difference between their distributions (Fisher’s exact test for count data, p­value = 1). In the next step, we extracted the symmetric difference of both lists (PoeTree­specific types that were missing either in PDT or FicTree but not in both), corresponding to 21% of their union (PoeTree­specific types missing either in PDT or FicTree or in both). From a total of 557 PoeTree­specific types, 136 were missing from PDT and 421 from FicTree. 370 of these types only occurred once in PoeTree (Figure 2). At this point we resorted to qualitative analysis. Figure 2: In which reference corpus are these PoeTree­specific words missing? The list of missing types contained 20 types that occurred at least four times in PoeTree. Only one of them was also missing in PDT: tobě (‘you’ in the dative singular). The FicTree corpus was missing the ar­ chaic form of the third person singular of jest ‘to be’ and the vocalized form of the preposition ku ‘towards.’ Other words with minimum fre­ quency 2 (down to Rank 87) were mostly missing in FicTree, probably because FicTree is smaller than PDT (166K tokens vs. 1M tokens in PDT). They did not seem to follow any interesting lexical or morpho­ logical pattern that would help distinguish FicTree from PoeTree. By contrast, a particularly striking pattern emerges in PoeTree compared to PDT. Here, PDT appears to have a bias against the second person singular: of 44 verb types in PoeTree that were specifically missing in PDT, six were in second­person singular form, as opposed to only two PKn, letnik 47, št. 2, Ljubljana, avgust 2024 72 from 143 verb types missing in FicTree. Even more strikingly, of the nine PoeTree­specific pronouns and determiners, five were second­per­ son singular words and all were absent in PDT. They even turned out to be among the most frequent PoeTree types, which is not typical for pronouns in a pro­drop language. Indeed, a search through the entire PDT suggested a noticeable dif­ ference in the frequency of the second person singular in PDT and in PoeTree: it detected only 45 occurrences of the singular ty ‘you’ (compared to 77 in PoeTree), nine occurrences of the singular tvůj ‘your,’ also nine in PoeTree, and 79 occurrences of the verb být ‘to be’ in the second person singular (compared to 30 in PoeTree). (It should be noted that the conjugated to be acts as auxiliary verb in the past and imperfective future tense.) It also detected 261 verbs in the present tense or imperative in the second person singular (101 in PoeTree), of which 95 were the fixed expression viz (‘see,’ as in ‘cross reference’ or ‘cf.’) in PDT. Finally, we listed types missing in both PDT and FicTree. Among the most frequent types (four to six occurrences) were the archaic forms kdys (kdysi ‘long ago’), přec (přece ‘yet’ or ’nevertheless’), by (aby, a polysemous subordinator), chcem (chceme ‘we want’), and jich (jejich ‘their’). The most frequent universal parts of speech (UPOS) among the hapaxes was noun, followed by verb and adjective (420, 297, and 261 occurrences respectively). Many of them were rare words or neolo­ gisms, and those belonging to common vocabulary were often either in archaic or otherwise marked forms (or second person), forming no other apparent pattern. Figure 3: Distribution of PoeTree­specific words missing in both PDT and FicTree. Silvie Cinková, Petr Plecháč, Martin Popel: A Morpho-Syntactic Analysis of Czech Poetry 73 Syntactic perspective While we did not make any a priori decisions concerning syntactic de­ pendencies, we did make assumptions about how word order was likely to diverge from what is usual in modern prose or non­fiction treebanks. The observed differences are described in detail in Section 7. The annotation process We pre­processed the sample with the most recent version at the time of the largest Czech language model, czech­pdt­ud­2.12­220711 (Straka), employed in the UDPipe parser (Straka et al.; Straka and Straková). One annotator edited the automatic annotation node by node to come as close as possible to a manual annotation made from scratch. It was published under the title UD_Czech-Poetry in the Release 2.13 of the UD corpora in the LINDAT­CLARIAH repository (http://hdl.handle. net/11234/1­5287). Evaluation results We evaluated UDPipe’s performance on the sample by comparing them to the UDPipe models based on PDT and FicTree, and then drilled into more detail using several analytical scripts in Udapi (Popel et al.). We also carried out manual error analysis. Figure 4 presents the performance of UDPipe­PDT and UDPipe­ FicTree operationalized by ten standard metrics (Kübler et al. 79–86; Zeman et al., “CoNLL”). Their values are measured as Precision (per­ centage of correct instances predicted by the parser), Recall (percent­ age of instances of gold annotation correctly predicted by the parser), and F1 (harmonic mean of Precision and Recall). They are plotted as the points of three different shapes. The first six metrics are self­ explanatory, with AllTags showing the performance on morphologi­ cal tagging (disregarding syntactic dependencies). The metrics UAS, LAS, MLAS, and BLEX consider each token in relation to its parent. UAS (Unlabeled Attachment Score) concentrates purely on the tree topology, which means that it only observes whether the given token is governed by the right parent. LAS (Labeled Attachment Score) consid­ ers the dependency label of the given token as well (that is, the rela­ tion to its parent). MLAS (Morphology­Aware Labeled Attachment PKn, letnik 47, št. 2, Ljubljana, avgust 2024 74 Score) adds UPOS and Features to the considerations. BLEX (Bilexical Dependency Score) combines content­word relations with lemmatiza­ tion (but not with tags or features). The plot also shows the perfor­ mance of the respective models (F1 Score) on their regular test data sets as colored bars. The performance values of both models on their regular test data sets are well above 95%. On PoeTree, the performance is generally worse, by the largest margin in Sentences and MLAS. UDPipe­PDT and UDPipe­FicTree perform very similarly on PoeTree, with UDPipe­PDT scoring slightly better than UDPipe­ FicTree in general and even substantially better in Sentences, UFeats, and MLAS. Therefore, UDPipe­PDT appears to be the parser of first choice for PoeTree and we limit manual error analysis in the next sec­ tions to the output of UDPipe­PDT. Error analysis Figure 4 reveals that the lowest scoring metric is Sentences, that is, the recognition of sentence boundaries. This is indeed neither surprising, given the a priori observations of punctuation and sentence splitting, nor extremely interesting, since sentence boundaries in poetry are often disputable even to a human annotator. For further error analysis, we have therefore aligned the manual and automatic word­to­word and re­segmented the automatic annotation to matching chunks of text. In this setup, we counted and classified the mismatches between manual and automatic annotation. The most frequent error is the choice of parent (546, that is, ca. 8% of tokens), of which 395 are not combined with any labeling error. This also corresponds to Figure 4, where the second lowest scoring metric is MLAS, the combination of tree topology (choice of parent) and the syntactic and morphological labels in the given token. It also confirms that topological errors are not to blame on sentence splitting alone. Of the 50 most frequent errors listed by the official UD evaluation script (Straka and Popel), 26 are dependency­labeling (deprel) errors, 13 are tokenization errors in thus far unseen contracted forms with unstable orthography, 11 are feature­labeling errors (Ufeats), four are part­of­speech errors, and four are lemma errors. The lemma errors revolve around the so­called canonic number for the base form in pronouns (e.g., náš ‘our’ as náš ‘our’ vs. můj ‘my’) and reveal the general need for permanent data harmonization against the ultimate morphological lexicon (Hajič et al., “MorfFlex”). Silvie Cinková, Petr Plecháč, Martin Popel: A Morpho-Syntactic Analysis of Czech Poetry 75 The most frequent UPOS error (24 occurrences, or 0.4%) con­ cerned the blurry distinction between adverbs and particles (also sug­ gesting inconsistencies in the manual annotation of different corpora), and the similarly blurry distinction between coordinating conjunctions used within a single sentence compared to those used across sentence boundaries (to be marked as sentential adverbs). We also found that the most frequent features errors were not really errors but innovations encouraged by the Czech UD coordina­ tors: to date, homonymous word forms have not been disambiguated in the PDT and FicTree data sets (e.g., Gender=Fem,Neut), unlike the PoeTree sample (Gender=Fem or Gender=Neut). These two approaches differ by the extent of contextualization. While the earlier approach deliberately relied on as little context as possible, the more recent developments in machine learning are likely to master context­ based morphological disambiguation even across sentence boundaries. A prominent example of this change might be the disambiguation of active verb participles (used to form past tense): the neutral plural is homonymous with feminine singular, and Czech is a pro­drop lan­ guage, which means that the coreferential antecedent of the dropped subject must often be tracked back across sentence boundaries. Figure 4: Model evaluation. PKn, letnik 47, št. 2, Ljubljana, avgust 2024 76 Since the aforementioned errors are not entirely errors, occur only rare­ ly, or can be automatically corrected in the model training data, after which they are likely to present themselves correctly, what remains is tree topology and dependency labeling (syntactic parsing). Focusing on dependencies also makes sense, given that syntactic dependencies represent one of the advantages of extracting information from tree­ banks with comparison to carrying out linear searches. In the context of information­extraction use case, we find it appropriate to emphasize errors in phenomena that are likely to hamper the extraction of relevant patterns (such as convoluted attributive structures) over errors that may be frequent but do not necessarily affect rule­based extraction of noun modifiers or predicates and their arguments and adjuncts. Such largely irrelevant errors may involve punctuation, coordination vs. parataxis mismatches, or inconsistent labeling of prepositional noun modifiers, such as we find with nmod (noun modifier) vs. obl (oblique case). Most prominent parsing errors Labeling confusion as weighted centrality degree in a network of labels The LAS results are best explained as a directed network graph (Figure 5) of dependency labels (deprels), with emphasis on their weighted degree centrality. Each edge connects a source node (human­assigned deprel) with a target node (deprel automatically assigned by UDPipe) on the same token, with the frequency of the given deprel combina­ tion in the same source­target direction increasing the edge weight. The number of outgoing edges along with their weights constitutes the weighted out­degree centrality of each deprel. In this scheme, deprels can have out­degree centrality only when used in the gold annotation, whereas they only have in­degree central­ ity when they are used in the automatic parsing. Hence, the top­rank­ ing deprels listed in Table 1 and highlighted in Figure 5 are gold­anno­ tation deprels that UDPipe labeled with the wrong deprel, in addi­ tion to attaching them to the wrong parent. Each node in this graph represents one syntactic label. The nodes are connected with directed edges (arrows). Each arrow starts in the gold annotation and points to its mismatched label in the automatic annotation, respectively. Dotted edges connect the top 20% of gold annotation with the most frequent mismatches (totaled across all mismatched labels). Silvie Cinková, Petr Plecháč, Martin Popel: A Morpho-Syntactic Analysis of Czech Poetry 77 The thick highlighted source­target edges in Figure 5 convey which deprels are frequently confused in both directions, such as, for instance, nmod and obl (ranking 1 and 2 in Table 1). Both denote a noun or noun group, possibly even introduced by a preposition. This modifier is called nmod (noun modifier) when modifying a noun, such as John in letter to John, but obl (oblique case) when modifying a verb, such as in write to John. In a vague context such as write a letter to John, the modifier John can be attached to either, while in give the letter from/by Mary to John, we would rather attach Mary to the noun letter as nmod than to the verb give as obl. It does not come as a surprise that conj (conjunction) and root are strongly interconnected in both directions: in complex sentences with several clauses, the parser easily fails to identify the main predicate. Quite symptomatically, root is also connected with advcl (adverbial clause, subordinate clause), parataxis (coordination of two main clauses without a conjunction), and orphan (clause with an elided predicate). The strong associations of conj with obl, nsubj (nominal subject), and, to a lesser extent, obj (direct object), indicate misrecognized coordina­ tions of nouns. The second strongest association with root is nsubj, which can be easily accounted for by the fact that the UD scheme prefers content words as parents of function words (e.g., nouns govern prepositions), while at the same time regarding copula verbs as auxiliary words. In copula predicates, therefore, the root is the predicate noun (Figure 6), which may be confused in turn with the subject (nsubj). Ultimately, the strongest confusion emerges between the aforemen­ tioned nmod and obl. Since our statistic considers only labeling mis­ matches on incorrectly attached nodes, we can generally assume that nmod cases in the automatic sample are governed by nouns (since the parser has learned that nmod only modifies nouns) and obl cases are governed by verbs. This implies that a fraction of nouns in attributive positions or positions of verb arguments or adjuncts will be systemati­ cally lost when searching the poetry data with syntactic corpus queries. It is worthwhile to investigate whether this confusion occurs randomly or according to a pattern. PKn, letnik 47, št. 2, Ljubljana, avgust 2024 78 deprel weighted out- degree centrality parent UPOS NOUN parent UPOS VERB nmod 68 + ­ obl 47 ­ + root 43 ­ ­ nsubj 32 ­ + conj 31 + + advcl 29 ­ + amod 28 + ­ Table 1: Highest weighted out­degree centrality. The last two columns illustrate the possible distribution of the deprels among nouns and verbs as parent tokens. Figure 5: The most prominent labeling errors in a network graph of tokens with the wrong parent as well as the wrong dependency label. Silvie Cinková, Petr Plecháč, Martin Popel: A Morpho-Syntactic Analysis of Czech Poetry 79 Figure 6: Copula predicate. The predicate noun and hence the sentence root is sick, while she is the subject (Marneffe et al.). Labeling errors on nouns in attributive position The high weighted out­degree centrality of nmod means that nmod UDPipe kept assigning other labels to nodes that should have been nmod. It hence makes sense to examine errors from the perspective of gold data, that is, to concentrate on nouns and their noun attributes. When concentrating on nouns and their attributes, we get the fol­ lowing picture: the PoeTree sample contains 501 cases of attributive nouns (i.e., nmod). Of those, 169 (34%) were attached to an incorrect parent. Of 117 attributive nouns in a prepositional case, 49 (41%) were misrecognized. Of 384 attributive nouns in a direct case, 120 (31%) were misrecognized. However, when the case was genitive and the attribute noun was preposed, as many as 51 of 52 cases (98%) were misrecognized. Comparing that with adjectival attributes, we observed only 117 of 877 incorrect parent attachments (13%). When the adjective pre­ ceded the noun, 63 of 575 attributes (11%) were incorrectly attached; when it followed the noun, the number was 54 of 302 (17%). Parser performance decreased in proportion to the distance between tokens. However, the data was too sparse to be statistically significant (for a token distance of 3 or more, the number was 19 of 25 errors if post­ posed, and 6 of 36 if preposed). Preposed genitive attributes The analysis has shown that the parser failed most dramatically with preposed genitive attributes, apparently because it had never spotted them in the training data. PKn, letnik 47, št. 2, Ljubljana, avgust 2024 80 In current Czech prose, it is not uncommon for noun attributes to take the form of another noun in the genitive case. The noun in geni­ tive often denotes the agent or patient of an event (hledání odpovědi ‘the search for an answer’), the owner or bearer (planeta opic ‘planet of the apes’), or a quantified mass or set (pytel brambor ‘sack of potatoes’). Nevertheless, in all these cases, the genitive follows the head noun. In the entire PDT, there are only two cases of a preposed genitive attri­ bute. One is the lexicalized expression svého druhu ‘of sorts’; the other concerns attributive nouns modified by a cardinal numeral, which in Czech requires the genitive of the governing noun (Figure 7). In this last case, one could argue that word order is slightly marked, empha­ sizing the amount, whereas in the unmarked order the genitive noun follows the head noun (see Section 8.1). In poetry, on the other hand, the preposed genitive attribute is a legitimate structure, given its 10% proportion of all attributive nouns in our sample. This alone–its 98% error rate–calls for a domain adapta­ tion of the language model to poetry. Figure 7: Preposed noun genitive in current Czech. The numbers 4 and 2 mean accusative and genitive. Even with current Czech, the parser gets confused (Figure 7), regarding both poplatek ‘fee’ and korun ‘crowns’ as verb arguments, rather than two direct objects (obj), which the annotation scheme does not allow (a verb clause can only have one instance of a subject and object). Errors of this kind appear systematically in the poetry data (Figure 8). Silvie Cinková, Petr Plecháč, Martin Popel: A Morpho-Syntactic Analysis of Czech Poetry 81 Figure 8: Preposed genitive attribute in poetry. As Figures 9, 10, and 11 show, virtually any clause chunk can land be­ tween the preposed genitive attribute and head noun, resulting in addi­ tional parsing errors in the vicinity. This is why UDPipe again mistakes the genitive attribute, in Figure 10, for a second subject apart from its head noun (the true subject). In Figure 9, UDPipe (bottom line) has not recognized any syntactic dependency relation between the geni­ tive (Madonna’s) and head noun (face). The same applies to Figure 11, where it has missed the relation between hair (in the genitive) and flood. Figure 9: Discontinuous attributive sequence. Figure 10: A whole clause between attributive genitive and its head noun. PKn, letnik 47, št. 2, Ljubljana, avgust 2024 82 Figure 11: Very disrupted parsing. Comparison of parsing errors in PoeTree and PDT Qualitative findings from the previous section suggest that the specific constraints on prosody and meter may require poetic texts to allow lon­ ger distances between tokens and their modifiers (edge lengths, mea­ sured in tokens), as well as specific word order patterns. This section investigates the distance between several frequent syntactic dependen­ cies, the order of their members, and the performance of the UDPipe­ PDT model on two data sets: the PoeTree sample and PDT­test set (on which the performance of UDPipe­PDT was measured). Performance on preposed genitive attributes As Figure 12 shows, most preposed genitive attributes occur imme­ diately before the head noun, within the maximum distance (edge length) ­6 in PDT­test and ­5 in PoeTree. The red bars in the blue­red pairs are lower than the blue ones in both PDT­test and PoeTree, but the difference is smaller in PDT than in PoeTree, which implies higher recall in PDT than in PoeTree. At the same time, orphaned red bars occur to the right of zero in both data sets. These are precision errors, and they are markedly fewer in the PDT­test sample. By and large, the distributions of edge lengths are almost identi­ cal. At this point we should note that in PDT, unlike PoeTree, the preposed genitives are the product of grammatical congruence with a genitive­requiring cardinal numeral (denoting containers, substances, currencies, or metric units, see Section 7.3). Therefore it comes as no surprise that UDPipe processes them much better in PDT than in PoeTree. Also disregarding the lexical patterns and looking at raw Silvie Cinková, Petr Plecháč, Martin Popel: A Morpho-Syntactic Analysis of Czech Poetry 83 frequencies, the preposed genitives are clearly overrepresented in poetic texts compared to PDT, given that the PDT­test set is almost 27 times larger than the PoeTree sample and its occurrence of preposed genitives is only approximately double. Figure 12: UDPipe­PDT’s performance on the preposed genitive attribute in PoeTree and PDT­test. Performance on any noun attribute Generally speaking, noun attributes preceding nouns are overrepre­ sented in poetic texts, and UDPipe has a precision issue with postposed noun attributes in PoeTree and a recall issue with preposed noun at­ tributes (slightly above half of the approximately 50 items immediately before head nouns would be genitives). Figure 13: UDPipe­PDT’s performance on noun attributes in PoeTree and PDT­test. PKn, letnik 47, št. 2, Ljubljana, avgust 2024 84 Performance on adjective attributes The adjective attributes apparently are relatively more frequent in PoeTree than in PDT, but UDPipe­PDT processes them well, al­ though the overall performance of UDPipe­PDT on PoeTree is slightly lower than on PDT­test, with errors both in precision and in recall, and both left and right from the governing noun. Figure 14: UDPipe­PDT’s performance on adjective attributes in PoeTree and PDT­test. Performance on clause subjects The distribution of edge lengths for the clause subject is apparently iden­ tical in both data sets. On PDT­test, UDPipe­PDT tends to produce precision errors, while on PoeTree both error types occur. Interestingly, the overall performance appears slightly higher on PoeTree. Figure 15: UDPipe­PDT‘s performance on subjects in PoeTree and PDT­test. Silvie Cinková, Petr Plecháč, Martin Popel: A Morpho-Syntactic Analysis of Czech Poetry 85 Performance on direct objects Direct objects occur apparently more often immediately before their governing verb in PoeTree than they do in PDT­test. UDPipe­PDT performs slightly worse on PoeTree than on PDT­test, in all positions, but the difference is not dramatic. Figure 16: UDPipe­PDT’s performance on direct objects. Performance on prepositional objects and adverbials Distributions are similar for prepositional objects and adverbials, with one interesting observation: objects immediately following the verb are rather rare, especially in PDT­test, and UDPipe­PDT has a severe preci­ sion problem (too many false positives) on both datasets, across positions. Figure 17: UDPipe­PDT’s performance on nouns with prepositions modifying verbs. PKn, letnik 47, št. 2, Ljubljana, avgust 2024 86 Discussion and conclusion We have evaluated the performance of the UDPipe parser with the largest Czech model based on the Prague Dependency Treebank (Hajič et al., “Prague”) converted to Universal Dependencies (Zeman et al., “Universal”), and performed a semi­manual error analysis focused on parts of speech and dependency relations that are most likely to occur in corpus queries to extract information from texts in text­mining or distant reading research tasks. Czech poetry makes ample use of the free word order that is a fea­ ture of the Czech language. Hence, PoeTree contains structures that do not normally occur, and UDPipe­PDT fails to parse them correctly because it has never spotted them in the training data. These structures are not random but recurrent, and therefore it is important to, first, identify and tackle them as parsing issues, and second, provide manu­ ally annotated data to the UDPipe model training pipeline to improve UDPipe’s performance on poetry. WORKS CITED Dobrovský, Josef. Ausführliches Lehrgebäude der Böhmischen Sprache, zur gründli- chen Erlernung derselben für Deutsche, zur vollkommenern Kenntniß für Böhmen. Prague, Johann Herrl, 1809. Hajič, Jan. “Complex Corpus Annotation: The Prague Dependency Treebank.” Jazykovedný ústav L. Štúra, SAV, 2004, https://ufal.mff.cuni.cz/pdt2.0/publica­ tions/Hajic2004.pdf. Accessed 24 Jan. 2024. Hajič, Jan, et al. “MorfFlex CZ 2.0.” LINDAT/CLARIAH-CZ, 2020, http://hdl.han­ dle.net/11234/1­3186. Accessed 24 Jan. 2024. Hajič, Jan, et al. “Prague Dependency Treebank 2.0.” Linguistic Data Consortium, 2006, https://ufal.mff.cuni.cz/pdt2.0/. Accessed 24 Jan. 2024. Kampelík, František Cyril. Čechoslovan, čili národní jazyk v Čechách, Na Moravě, ve Slezku a Slovensku. Prague, Jan Hostivít Pospíšil, 1842. Kübler, Sandra, et al. Dependency Parsing. Springer, 2009. Marneffe, Marie­Catherine de, et al. “Syntax: General Principles–The Status of Function Words.” Universal Dependencies Guidelines, 2017, https://universalde­ pendencies.org/u/overview/syntax.html#the­status­of­function­words. Accessed 24 Jan. 2024. Osolsobě, Klára. Česká morfologie a korpusy. Prague, Karolinum, 2014. Kosek, Pavel, and Jana Pleskalová. “Spřežkový Pravopis.” CzechEncy–Nový encyklo- pedický slovník češtiny, edited by Petr Karlík et al., Brno, Masarykova univerzita, 2017, https://www.czechency.org/slovnik/SPŘEŽKOVÝ PRAVOPIS. Accessed 24 Jan. 2024. Plecháč, Petr, and Robert Kolár. “The Corpus of Czech Verse.” Studia Metrica et Poetica, vol. 2, no. 1, 2015, pp. 107–118, https://doi.org/10.12697/smp.2015.2.1.05. Accessed 24 Jan. 2024. Silvie Cinková, Petr Plecháč, Martin Popel: A Morpho-Syntactic Analysis of Czech Poetry 87 Plecháč, Petr, et al. PoeTree. Poetry Treebanks in Czech, English, French, German, Hungarian, Italian, Portuguese, Russian and Spanish. 0.0.1. Zenodo, 2023., https:// zenodo.org/records/10008459. Accessed 24 Jan. 2024. Popel, Martin, et al. “Udapi: Universal API for Universal Dependencies.” Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies, edited by Marie­ Catherine de Marneffe et al., Northern European Association for Language Technology, 2017, pp. 96–101. Straka, Milan. “Universal Dependencies 2.12 Models for UDPipe 2.” LINDAT/ CLARIAH-CZ, 2023, http://hdl.handle.net/11234/1­5200. Accessed 24 Jan. 2024. Straka, Milan, and Martin Popel. “Eval.Py. 1.2.” GitHub, 2023, https://github.com/ UniversalDependencies/tools/blob/master/eval.py. Accessed 24 Jan. 2024. Straka, Milan, and Jana Straková. “UDPipe 2.” LINDAT/CLARIAH-CZ, 2022, http://hdl.handle.net/11234/1­4816. Accessed 24 Jan. 2024. Straka, Milan, et al. “UDPipe: Trainable Pipeline for Processing CoNLL­U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing.” Proceedings of the Tenth International Conference on Language Resources and Evaluation, edited by Nicoletta Calzolari et al., European Language Resources Association, Paris, 2016, pp. 4290–4297, https://aclanthology.org/L16­1680. Accessed 24 Jan. 2024. Zeman, Daniel, et al. “CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.” Proceedings of the CoNLL 2018 Shared Task, edited by Daniel Zeman and Jan Hajič, Kerrville (TX), The Association for Computational Linguistics, 2018, pp. 1–21, http://www.aclweb.org/anthology/ K18­2001. Accessed 24 Jan. 2024. Zeman, Daniel, et al. “Universal Dependencies 2.12.” LINDAT/CLARIAH-CZ, 2023, http://hdl.handle.net/11234/1­5150. Accessed 24 Jan. 2024. Žižková, Hana. “Compound Adverbs as an Issue in Machine Analysis of Czech Language.” Journal of Linguistics / Jazykoedný Časopis, vol. 68, no. 2, 2017, pp. 396–403, https://doi.org/10.1515/jazcas­2017­0049. Accessed 24 Jan. 2024. Rime in skladnja: oblikoskladenjska analiza češke poezije Ključne besede: češka poezija / oddaljeno branje / besedilni korpusi / Universal Dependencies / obdelava naravnega jezika / odvisnostne drevesnice Oddaljeno branje, ki upošteva jezikoslovna spoznanja, predpostavlja ustrezno delovanje orodij za obdelavo naravnega jezika. Članek prikaže evalvacijo raz­ členjevalnika UDPipe na primeru ročno označenega vzorca češke poezije 19. stoletja v naslednjih korakih: (1) ustvarjanje dokumentiranega nabora podat­ kov za to področje (poezija, 19. stoletje, češčina); (2) odločitve o označevanju, PKn, letnik 47, št. 2, Ljubljana, avgust 2024 88 specifične za področje; (3) analiza napak. Vzorec je obsegal 29 naključno izbranih pesmi, ki so bile najprej samodejno označene in razčlenjene z raz­ členjevalnikom UDPipe, nato pa so bile oznake ročno preverjene za vsako posamično besedo. Preverjene so bile naslednje značilnosti: segmentacija besed (razdelitev), lematizacija, dodelitev oblikoskladenjskih oznak, dodelitev natančnejših morfoloških oznak, dodelitev položaja v skladenjskem drevesu (izbor nadrejenega elementa) in oznaka skladenjskega razmerja med besedo in njenim nadrejenim elementom. Ugotovitve smo analizirali; najpogostejše napake razčlenjevalnika so povezane s kompleksnimi samostalniškimi bese­ dnimi zvezami, ki vsebujejo druge samostalnike kot modifikatorje, še posebej, če se ti pojavijo v besednem redu, specifičnem za poezijo, npr. kot določilo samostalniškega jedra. Po drugi strani niti arhaični pravopis niti neologizmi niso predstavljali bistvenih težav. 1.01 Izvirni znanstveni članek / Original scientific article UDK 821.162.3.09-1"18":004 DOI: https://doi.org/10.3986/pkn.v47.i2.04