THE NEW CHINESE CORPUS OF LITERARY TEXTS LITCHI Mateja PETROVCIC University of Ljubljana, Slovenia mateja.petrovcic@ff.uni-lj.si Radovan GARABÍK L. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia garabik@kassiopeia.juls.savba.sk Luboš GAJDOŠ Faculty of Arts of the Comenius University in Bratislava, Slovakia lubos.gajdos@uniba.sk Abstract The aim of the article is to introduce the corpus of Chinese literary texts and to describe the process and design principles behind the corpus construction. The authors provide information regarding the reasoning behind the chosen structure and annotation of the corpus, and further discuss possibilities the corpus opens for linguistic research and language learning. The article provides several examples of how the corpus can be used at various levels of language research. Keywords: Chinese; corpus linguistics; building and using corpora; literary texts; Litchi Povzetek Namen clanka je predstaviti korpus kitajskih literarnih besedil ter opisati procese in principe njegove izdelave. Sledi utemeljitev za izbrano strukturo korpusa in obrazložitev uporabljenega sistema oznacevanja. V nadaljevanju prispevek predstavi možnosti uporabe korpusa za jezikoslovne raziskave in ucenje oziroma poucevanje jezika. Razlicne zahtevnostne stopnje smo avtorji tudi ponazorili s številnimi primeri. Kljucne besede: kitajšcina; korpusno jezikoslovje; izdelava in uporaba korpusov; literarna besedila; Litchi 1 Introduction Corpus linguistics is a well-established indispensable part of linguistic research in general. We can find the most prominent use of monolingual huge corpora both in scientific research or practical uses, notably in lexicography, language education, natural language processing, and as a valuable data source in machine learning and data mining / data science. There is a lack of corpora of different text types and genres in general, although it is indisputable that texts have many different functions in social life and result in corresponding differences in form and substance (Council of Europe, 2011, p. 93). Awareness of these features are mentioned several times in the Common European framework of reference for languages, advising the users to consider with which text types the learner will need/be equipped/be required to deal receptively, productively, interactively, and in mediation (Council of Europe, 2011, p. 96). Therefore, specialized corpora are needed in L2 acquisition as complementary learning resources. Among freely available Chinese corpora, the BCC corpus (http://bcc.blcu.edu.cn) is an exception in this respect, since it distinguishes the following subcategories of texts: literature wénxué .., press bàokan .., multi-domain1 duolingyù ..., Weibó .., science and technology kejì .., and ancient Chinese gu Hànyu .... However, even though the BCC corpus enables users to go beyond simple search, its functions are very limited compared to CQL supported corpora.2 Therefore, the authors realized a creation of an extra corpus of Chinese literary texts (named Litchi) with advanced functionality would be a useful addition to freely available corpora of Chinese language. 1 In the present version, this part is called duolingyù ... (multi-domain), and in the previous versions it was called zonghé .. (comprehensive). As stated on the website, this section includes texts from the newspapers, literature, Weibo, science and technology. These contents are independent and do not intersect with other sections of the BCC corpus. The goal of this part was to build a "balanced" corpus. 2 See retrievable examples jiansuoshì shìlì ..... (http://bcc.blcu.edu.cn/help) for details. 3 See more in Gajdoš, Garabi´k and Benicka´ (2016). 4 See more at http://158.195.113.63/run.cgi/corp_info?corpname=zh-law. Perhaps surprisingly, there are only a few academic institutions in Europe that build and use corpora of Chinese language, even though Chinese as a second language and Chinese exolinguistic research is steadily gaining in popularity. Comenius University in Bratislava (Slovakia) is one of the institutes working on Chinese language corpus linguistics and corpus creation. The first corpus (the web-corpus Hanku)3 was created in 2016, followed by the corpus of legal Chinese in 2018.4 Although the usefulness of language corpora is indisputable, we nevertheless sometimes encounter questions of practical considerations. Why is there a strong need for such corpora and how can they be effectively exploited? We firmly believe that it is indispensable to make further research on registers of modern Chinese to answer such questions. Sadly, the current utilization of available corpus linguistic resources in scientific research in this area still does not reach satisfactory levels in many respects. 2 Availability The Litchi is available via the website of Comenius University,5 using the NoSketch Engine web-based corpus manager (Rychlý, 2007; Kilgarriff et al., 2014). The process of the building has begun in autumn 2019, and the corpus structure is modelled after our previous implementation to keep user access compatible across different corpora. Main parameters of the corpus are summarized in the following table. 5 Available at https://fphil.uniba.sk/katedry-a-odborne-pracoviska/katedra-vychodoazijskych-studii/cinsky-jazykovy-korpus/litchi/. Table 1: Parameters of the corpus Litchi Parameters Status Notes Type synchronous literary texts from the Internet Language of interface Slovak, English, Chinese, others explanatory notes in English Size (May 2020) 92 613 119 in tokens Tokenization into words (cí .) automatic statistical word segmentation POS annotation yes Penn Chinese Treebank tagset Bibliographic annotation yes title, author’s name, alternative author's name(s), authors’ geographical origin, gender, date of birth indication Style and genre annotation no Phonetic annotation yes Hànyu pinyin: tones marked by diacritics; tones marked by numerals Syntactic annotation yes Penn Chinese Treebank compatible dependency annotation Statistic tools yes absolute frequency, relative frequency average reduced frequency Save results directly from the interface yes in text or XML format Parameters Status Notes KWIC yes KWIC or sentence view Collocations search yes many collocation measures Advanced search options yes Boolean operators—conjunction, disjunction, negation; possibility to use regular expressions at the character, word, pinyin, and metadata level; full CQL etc. Sorting by yes Multi level sorting hierarchy; left, right, node, references etc. Availability registration required free to use for registered users, registration not restricted 3 Corpus compilation The Litchi corpus is compiled of freely available literary works in Chinese published on the Internet. The source texts are stored in the GB18030 character encoding (as a national standard of the People's Republic of China) (Lunde, 2009, p. 105). However, all the subsequent processing and annotation are performed in the UTF-8 encoding,6 to ensure maximum compatibility of processing tools, corpus manager and user access. 6 In practice, we can treat the GB18030 as an alternative ASCII-extending encoding of the Unicode character repertoire (on par with UTF-8). 7 See more at https://lxneng.com/posts/70. 3.1 Cleanup Text cleanup consists of removing unwanted characters, collapsing whitespace to a single ordinary space, replacing control characters with a space, and unifying line endings. Tokenization is performed by ZPar (Zhang & Clark, 2011), with tokens equal to Chinese words (cí .), but tokens include also numbers, punctuation and other symbols (with the exception of white space and control characters). Tones are written either using standard diacritics or, for users lacking the means to enter Hànyu pinyin diacritics, there is a possibility to use digits 1 to 5 (with the neutral tone having the number 5). The transcription into Hànyu pinyin was performed by the xpinyin package.7 3.2 Corpus structure Since the Litchi corpus was compiled with Chinese as foreign language instruction in mind, the annotation has been designed to facilitate queries by inexperienced Chinese speakers (e.g. students of the language).8 8 For details see Chapter 4. 3.3 Positional attributes Positional attributes describe token-level annotation – the basic unit of the corpus is a token, it usually corresponds to a word, but also punctuation characters and numerals are separate tokens. Given the specifics of written Chinese language, tokens in the Litchi corpus are equal to Chinese words (cí .); tokenization (word segmentation) in Chinese is a nontrivial task and a certain amount of errors is to be expected. Each token can be assigned several attributes, further describing or specifying the token, its grammatical or lexical features. Following positional attributes are used in the Litchi corpus: word, lemma, tag, pinyin, npinyin, head, deprel. The fundamental attribute word is the basic unit of the corpus (token). It is in the original form of the word (cí .) in the text as written in Hànzì. Example: ..... (Slovenia). We repurposed the default attribute lemma to be an all-encompassing default query type. It is a combination of a word written in Hànzì, each individual character (zì .) of a word in Hànzì, Hànyu pinyin transcription of a word, using both diacritics and numerals to mark tones, a transcription with the tones omitted, as well as a union of transcriptions of individual characters (.) of the word. We aim for inclusiveness – if a user enters a single syllable (in either Hànzì or one of the two Hànyu pinyin transcriptions, or even in Hànyu pinyin without tones), the corpus manager will search for all the words containing the syllable. For example, the word Siluòwénníya ..... will be assigned the “lemma” luo|luo4|luò|ni|ni2|ni´|si|si1|si1luo4wen2ni2ya4| siluowenniya|si|siluòwénni´yà|wen|wen2|wén|ya|ya4|yà|.|.|.|.|.....|.. The attribute tag is the part of speech tag, a two or three uppercase ASCII character denoting the part of speech of the word. For example, the word ..... will be likely part-of-speech tagged as NR (i.e. Proper Noun). The pinyin attribute is the transcription of word using the Hànyu pinyin method, tones are indicated by diacritics. The transcription is in lowercase. For example, ..... will be transcribed as siluòwénníyà. Characters with multiple readings will be assigned only the first (in some collation) reading. The npinyin is again the Hànyu pinyin transcription, but this time the tones are indicated by numerals 1 to 5 (5 stands for the neutral tone). For example, ..... will be transcribed as si1luo4wen2ni2ya4. Tokens in the current sentence are numbered (counted from zero) and the attribute head is the token number the current word is in relation to. The attribute deprel marks the syntactical relation of the word (node) to the governing word (node). Figure 1: Example of the use of the attribute deprel (NMOD – functionally corresponds to an attributive); searching for nominal modifiers of the word háizi .. 3.4 Structures The corpus possesses a hierarchical structure – the so-called structures describe information about grouping of tokens, or intra-token information. The corpus can be thus seen as a stream of tokens, interrupted by special marks denoting a start of a structure, end of a structure, or a structure between two tokens. The Litchi corpus uses following structures (compatible with de facto standards in written language corpora): stands for one document, which is a logically and conceptually separate standalone unit, typically a book, a short story etc. The structure contains several attributes, providing annotation of the document (metadata).

marks paragraphs, units conveying a sort of coarse-grained segmentation of text; paragraphs are inferred from the structure of the text itself, without resorting to linguistic information marks sentences, segmented according to heuristic-statistical model of the ZPar segmentation. The structure , often used in other corpora to mark that there was no whitespace between tokens is not used since spaces in written Chinese are mostly irrelevant (and not used). 3.5 Document annotation Each document has a certain set of metadata (document annotation) that are kept in the compiled corpus and can be queried or the results can be filtered by the metadata. doc.title is the name of the document (e.g. book title), written in Hànzì. The Litchi corpus includes 1312 different literary works. doc.author is the name of the author (pen name, if different from the real name), written in Hànzì. doc.alter_name comprises alternative author names (either the real name or other pen names). If the author is of non-Chinese origin, this string includes the name (or multiple name variants) either in the original language or in a well-known transcription. The motivation for this labeling was to maintain the original pair title-author. For example, according to the WorldCat, the work ..... (Wàng ta´n yu fenghuà) is written by author .. (Liu Liu) (see Figure 2). Figure 2: The results of the query “.....” in WorldCat However, Liu Liu is a pen name of the author with the real name Zhang Xin .., as provided elsewhere on the Internet, for example the Xiabook.com (https://www.shutxt.com/writer/61/). Similarly, the author of the work entitled ........ (Níersi qí é luxíngjì) is ...·.... (Xi’erma Lagéluòfu) in the doc.author field, and Selma Lagerlöf in the doc.alter_name field. All the bibliographic records in Litchi (including origin, gender, and age range) have been checked manually to verify and complement the meta data. As a result, 539 different authors are included in the corpus.9 Some tasks were quite intriguing, for example the author “B·N·...” (N B Cuikefu), which stands for “Vasily Ivanovich Chuikov”. In the original Chinese text, “B” is a letter of a Cyrillic script and stands for the letter “V” in Latin script. However, instead of “.” (e.g. letter “I” in Latin script), the mirror form “N” has been used. The Chinese version is therefore a mixture between “Vasily Ivanovich Chuikov” and “....... ........ ......”. 9 The BCC corpus includes works from 469 different authors. 10 See http://wap.yuexinet.net/view.php?aid=38 doc.authors_origin provides information on authors’ origin in the geographical or linguistic sense (e.g. China, Korea, Japan, etc.) to enable the user to distinguish originally Chinese texts from the translated works into Chinese. This is a two-letter abbreviation of the region. doc.gender is the gender of the author, we use the value self-described by the author or the gender the author is commonly considered to be of. This is not strictly a binary valued item – currently, there are three values present in the corpus annotation: M for male, F for female, N/A for unknown gender. doc.born_in provides information related to the authors’ age, in 15-year intervals. Authors born in the previous centuires have only the century of their birth recorded here (written in English, with the numeric part at the beginning). The value “N/A” has been assigned to all the bibliographic data where no clear and straight expressions were available in the authors' online profiles. For example, if a person's brief presentation avoided the use of 3rd person personal pronouns and used neutral expressions, such as bizhe .. (writer), zuòzhe .. (author), qí . (his/her), benrén .. (I/me), it was not possible to assign a clear gender value. Similarly, if the author states to be born “in a small village in Yue nan” (.......), this does not necessarily mean “in the South Guangdong”. Unlike the expressions Eastern/Northern/Western Guangdong (../../..), the notion Yuè nán .. (lit. Southern Guangdong) doesn't seem to refer to any real geographical places.10 If the description was further masked with blurred expressions such as “graduated from the BA studies at a certain university” (.......), this further justified the use of the “N/A” value. The final proportion of known and unknown information for fields doc.gender, doc.authors_origin and doc.born_in is presented in Figue 3: Authors’origin known unknown Authors’gender known unknown Authors’age known unknown Figure 3: Proportion of known/unknown information in the authors’ data 4 Usage in linguistics research The corpus manager is a very powerful tool when in the hands of an experienced user. originally aimed at scientific research in linguistics and related fields, in the last decades of ever-increasing importance of corpus linguistics the usage of corpora converged to a subfield of descriptive linguistics with its own terminology, approaches, good practices and established rules. Nevertheless, the learning curve is not prohibitively steep, the corpus manager can even be used by completely casual users, if we prepare the corpus adequately and provide sane defaults. For pedagogical reasons, we arbitrarily divide the corpus usage into these levels: • basic • advanced • expert Needless to say, this division is based on our experience and the dividing lines between the levels are not strictly delineated. 4.1 Basic use At the basic level a user may search for a word as KWIC (Key Word In Context). This is a very basic option when searching for concordance (context) of KWIC and it is very useful for students of foreign languages or translators. This usage usually does not require any additional instructions – users just type the word and get a readable list of occurrences. For Chinese language corpora, the situation is a bit complicated by the need to enter Hànzì characters. Although the plethora of input methods is a thing of the past and (in a non-professional setting) the prevalent input method is based on toneless Hànyu pinyin transcription, language model selecting (and ordering) the most probable Hànzì characters and the user picking up the appropriate character. While easy for native or fluent speakers, it can be challenging for students or less literate, less proficient non-native speakers. Also the specific tokenization matters – users have to be familiar with our chosen segmentation into words (.). This is the basic motivation behind our lemma attribute – by default, the users can query the corpus by a single character (.) or a word (.); both of them can be written in Hànzì, in Hànyu pinyin with standard diacritics, in Hànyu pinyin with tones indicated by numbers, or in toneless Hànyu pinyin. Thus users with either technical obstacles preventing them typing Hànzì or diacritics, or users less proficient in written Chinese can still benefit from the corpus, by entering the search term in an intuitive way and still getting (a superset of) relevant results. This is obviously very important in teaching Chinese as foreign language. In addition to searching for given words or characters, one of the nontrivial results we can obtain from huge corpora is the collocation analysis by various collocation measures. By default, the logDice measure is selected, empirically found to provide the best results for lexicographic purposes (and by extension, for almost any other purpose as well) (see Figure 4). The NoSketch Engine UI makes it very easy to search for collocation candidates in the corpus. Figure 4: The Collocations candidates parameters selection UI The collocation candidates for e.g. the token gongzuò .. (work) are presented in Figure 5. Figure 5: Top 10 collocation candidates of a word gongzuò .., word range -5 to 5 (i.e. up to five tokens in both directions) Results of this simple query show that this word is frequently found in the phrases such as gongzuò rényuán .... (staff member); it takes the verb zuòhao .. (to do/to finish), as in zuòhao gongzuò .... (to do a job well), zuòhao zìji de gongzuò ....... (to do one's own work), zuòhao yuanjiao gongzuò ...... (to do a distance teaching work); as a noun, it takes the classifier xiàng .; it is expected to be used together with the conjunction hé . (with), e.g. to work with somebody, etc. 4.2 Advanced use At the advanced level, it is possible to search for combinations of a few words conforming to a specified condition (e.g. usage of negation words (Gajdoš, 2019), concrete word order, part-of-speech tags, syntactic role, Boolean operators etc.) by using CQL expressions. In this example, we search for the most frequent attributives to the noun gongzuò .. (work). The CQL query for this task would be (meet [tag="VA|NN|JJ|M"] 1:[word=".." & tag="NN"]1 2). See Figure 6. Figure 6: The results of the query (meet [tag="VA|NN|JJ|M"] 1:[word=".." & tag="NN"]1 2) In the next step, results of the previous query may be ordered by Node forms in the Frequency menu, to get a list of the most frequent attributives (Figure 7). Figure 7: Top 10 most frequent attributives of the noun gongzuò .. (work) Results reveal that the most frequent noun phrases with the head noun gongzuò .. (work) include guanli gongzuò .... (management work), jiàoyù gongzuò .... (educational work), xuanchuán gongzuò .... (promotional work), jiànshè gongzuò .... (construction work), and others. Its most frequent measure words are xiàng ., gè . or fèn ., etc. In our opinion, this level of usage is suitable for most cases – language pedagogy as well as linguistics research. 4.3 Expert use The expert level is an extension of the previous one and it is often used in linguistics research. The corpus manager offers an arbitrary combination of POS tags, word order, context filters (e.g. MEET, WITHIN), conditions for bibliographic annotation etc. For example, with bibliographic annotation in the Litchi corpus, it is possible to search for a concrete grammatical phenomenon in the works of one author (doc.author) or in the works of all female authors (doc.gender). The following figure demonstrates the possibility of conditions combination (search only in texts by authors not from Mainland China; find all “regular” verbs (VV) with an “aspect” marker (AS) followed again by the same verb). CQL query: (1:[tag="VV"] 2:[tag="AS"] 3:[tag="VV"] within ) & 1.word=3.word Figure 8: The combination of conditions in CQL Or, to continue with an example using gongzuò .. (work), it can be observed, that male authors tend to write more about work than female authors. Moreover, their focus seems to be on different aspects of work, as roughly indicated in the data. The most frequent collocation candidates in men works are zhuchí gongzuò .... (take charge of the work), zhèngzhì gongzuò .... (political work) or canjia gongzuò .... (participate in work); whereas the most frequent collocation candidates in works of female authors include shoutóu gongzuò .... (work at hand), zhao gongzuò ... (to look for a job), zhaodào gongzuò .... (to find a job).11 11 For more relevant results, a thorough research should be conducted. Figure 9: Comparison of frequency and collocation candidates for the word gongzuò .. (work) in relation to authors’ gender Data also show that there are more works written by men, but this does not influence the relative frequency of the selected word. 52,12% 33,53% 14,34% Number of words per gender M F N/A Figure 10: Distribution of the gender annotation value, as a percentage of the number of words (.) in the corpus Following two tables demonstrate the use of the corpora to identify keywords that are more relevant in one corpus, as compared to the second (reference) corpus, using the Simple maths method (Kilgarriff 2009) – the words with their relative frequency much higher in one corpus. We focus on rare words. Table 2: Comparison of most relevant keywords in the zh-law corpus, as compared against zh-lit as the reference corpus zh-lit word Freq Freq/mill Freq Freq/mill Score .. 33949 4712.4 197 2.4 1378.1 ... 11302 1568.8 27 0.3 1178.8 .. 9974 1384.5 15 0.2 1169.9 .. 57301 7953.8 530 6.5 1059.1 .. 6761 938.5 7 0.1 865.1 .. 50465 7004.9 583 7.2 858.3 .. 16099 2234.6 161 2.0 750.7 .. 5486 761.5 3 0.0 735.4 ... 5970 828.7 12 0.1 723.1 ... 5643 783.3 11 0.1 690.9 .. 8449 1172.8 59 0.7 680.5 ... 8772 1217.6 71 0.9 650.9 .. 8703 1208.0 83 1.0 598.6 ... 6212 862.3 43 0.5 564.9 .. 4655 646.1 14 0.2 552.2 Table 3: Comparison of most relevant keywords in the zh-lit corpus, as compared against zh-law as the reference corpus zh-lit zh-law word Freq Freq/mill Freq Freq/mill Score . 200,276 2460.4 4 0.6 1582.7 . 266,923 3279.2 16 2.2 1018.4 .. 219,236 2693.4 12 1.7 1010.8 . 720,026 8845.7 65 9.0 882.7 . 80,800 992.6 1 0.1 872.5 . 149,508 1836.7 16 2.2 570.6 .. 48,539 596.3 2 0.3 467.5 . 124,025 1523.7 18 2.5 435.8 . 208,813 2565.3 39 5.4 400.1 . 41,029 504.0 2 0.3 395.3 . 48,497 595.8 4 0.6 383.7 ... 39,129 480.7 2 0.3 377.0 . 32,804 403.0 1 0.1 354.8 .. 29,531 362.8 1 0.1 319.5 .. 69,307 851.5 13 1.8 304.0 Last but not least, the Litchi mainly reflects the language use of speakers born in recent decades, as shown in Figure 11. Therefore, this corpus is also appropriate for studies focusing on some specific features of the most recent language use. Figure 11: Distribution of author birth dates, by the number of words (.) in the corpus 5 Conclusion The Litchi corpus is the third corpus of a family of Chinese language corpora used at the Comenius University. It adds a corpus of a different language variety and register to the existing corpora of Chinese (texts of laws, web corpus), while keeping compatible structure and annotations. The corpus manager offers the possibility of quantitative/ qualitative analysis of various Chinese language registers - comparison of the three corpora, but it can also be used for comparison between. Chinese language usage in different situations or contexts (e.g. between translation and original texts; analysis of different expressions used by authors based on their gender, historical period etc.). The corpus is accessible through a web interface upon registration and aims to be a valuable resource for both teachers and students of Chinese as a foreign language, but also for linguistic research. To conclude, the Litchi is a unique corpus in many respects. It provides a rich bibliographic, phonological, morphological and syntactic annotation and thus offers wide range of possibilities for linguistics research, e.g. lexicography/lexicology, morphology, syntax and to some extend also sociolinguistics. Acknowledgments This work was partially supported by the Slovenian Research Agency under research program Asian languages and cultures (P6-0243). References Council of Europe. (2011). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge, U.K: Cambridge University Press. https://rm.coe.int/1680459f97 Gajdoš, L. (2019). Retrieving Linguistic Information from a Corpus on the Example of Negation in Chinese. Acta Linguistica Asiatica, 9(2), 103-115. https://doi.org/10.4312/ala.9.2.103-115 Gajdoš, L., Garabi´k, R., & Benicka´, J. (2016). The New Chinese Webcorpus Hanku – Origin, Parameters, Usage. Studia Orientalia Slovaca, 15(1), 53–65. Kilgarriff, A. et al. (2014). The Sketch Engine: Ten Years on. Lexicography, 1.1, 7-36. Kilgarriff, A. (2009). Simple maths for keywords. In M. Mahlberg, V. Gonza´lez-Di´az & C. Smith (Eds.), Proceedings of Corpus Linguistics Conference CL2009. University of Liverpool, UK. Lunde, K. (2009). CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing, 2nd edition. O'Reilly Media. Rychlý, P. (2007). Manatee/Bonito-A Modular Corpus Manager. In P. Sojka & A. Hora´k (Eds.), RASLAN 2007 (pp. 65-70). Brno: Masaryk University. Zhang, Y., & Clark, S. (2011). Syntactic Processing Using the Generalized Perceptron and Beam Search. Computational Linguistics, 37(1), 105-151.