Retrieving Linguistic Information from a Corpus on the Example OF NEGATION IN CHINESE Luboš GAJDOŠ Comenius University in Bratislava, Slovakia lubos.gajdos@uniba.sk Abstract The paper deals with corpus analysis of negation in Chinese, namely the negatives bù ^ and méi/méiyou S/SW. The adverbs BU and MEI are two of the most frequent negatives in Chinese. The aim of this study is to present statistical data together with linguistic analysis. The results provide empirical evidence of discrepancy between "authentic" language data versus linguistic prescription with practical implications for second-language acquisition. The findings inter alia suggest a new approach to verb categorisation. Keywords: Chinese language; corpus linguistics; quantitative description; negation; potential complements Povzetek Članek obravnava korpusno analizo negacije v kitajščini, pri čemer se avtor osredotoča na prislova bu ^ in mei/meiyou ki sta najpogostejši nikalnici v sodobnem kitajskem jeziku. Namen prispevka je predstaviti statistične podatke v povezavi z jezikoslovno analizo. Rezultati študije prinašajo empirične dokaze o neskladju med jezikovno rabo in jezikovnimi normami, ta spoznanja pa je moč uporabiti tudi pri poučevanju kitajščine kot tujega jezika in za premislek o drugačnem pristopu na kategorizacijo glagolov. Ključne besede: kitajščina; korpusno jezikoslovje; kvantitativni opis; negacija; zmožnostna dopolnila Acta Linguistica Asiatica, 9(2), 2019. ISSN: 2232-3317, http://revije.ff.uni-lj.si/ala/ DOI: 10.4312/ala.9.2.103-115 @0g) 104 Lubos GAJDOS 1 Introduction Generally speaking, there are a number of negatives in modern Chinese.1 In this article only two negative adverbs, namely bu ^ and mei/meiyou ^/^W2 (hereafter referred to as BU and MEI), are discussed. The Hanku corpus is used3 as the primary source of language material and statistical data. As the intention is to mainly use the corpus-driven4 approach to studying of negation, thus the previous linguistics research on this topic is left aside. Let us start with some basic queries:5 [tag="AD" & word="^" ] [tag="W|AD" & word="i£|i£W"]6 The results are 7371142 (9897.85 per million),7 686352 (921.62 per million)8 respectively. These numbers only tell that the token BU is approximately 10 times more frequent than MEI. The difference is even more pronounced when searching in a certain variety of Chinese, e.g. in the corpus of legal Chinese, the occurrence of BU is 45254 (6 281.56 per million), the occurrence of MEI 720 (99.94 per million). Let us take a closer look at the tokens that collocate with these negatives. The following queries should return collocates at the position 1 on the right side:9 [tag="AD" & word="^"][] [tag="AD"& word="&|$;W"][] 1 For details see e.g. Liu (2004, pp. 253-258). 2 For the sake of simplicity, both negatives mei and meiyou are treated as two forms of one negative, namely MEI. On the other hand, their collocative partners may differ because of e.g. prosodic factors. 3 See more in Gajdos, Garabik and Benicka (2016, pp. 53-65). 4 See more in Baker, Hardie and McEnery (2006, p. 49). 5 In this article, the Corpus Query Language (hereafter CQL) is used to search for collocations. With CQL, complex criteria can be set to find one or many tokens. Criteria for each token must be between a pair of square brackets [ ], e.g. [attribute="value"]. See more at https://www.sketchengine.eu/documentation/cql-basics/ 6 As there are more tags (e.g. VV = verbs, VE = YOU as the main verb, AD = adverbs) dedicated to tokens mei and meiyou it is rather difficult to accurately determine the value of the negative MEI. Thus, I only use tags VV, AD in this article. For more details on the tagset see Fei (2000, pp. 4-35). 7 Unless stated otherwise, frequencies are given in absolute occurrence in the Hanku corpus. 8 The occurrence of mei ® is 342190 (459.49 per million) and 344162 (462.14 per million) for meiyou m 9 The regular expressions may match the following patterns: adverb bu + any token or adverb mei/meyou + any token. As it is rather difficult to identify the collocates at the position further to the right using only POS tags, this topic will be left for future research. See e.g. Gajdos (2018). Retrieving Linguistic Information from a Corpus on the Example 105 The results are summarised in the tables below.10 Table 1: The most frequent POS at the position 1 (Corpus: web-zh) # Query: word, [tag="AD" & word="W"][] # Query: word, [tag="AD|VV" & word="S | SW"][] No. of results: 7371142 No. of results: 686352 tag frequency tag frequency AD VV 5092363 VV VV 326984 AD VA 827140 AD VV 193781 AD VC 695511 VV AD 35762 AD AD 444055 VV P 30138 AD P 155025 AD AD 16246 AD PU 25581 VV AS 15136 AD AS 19730 VV PU 9902 AD JJ 19525 VV CD 5771 AD PN 13121 VV PN 5350 AD NN 10044 VV VA 4975 AD BA 9276 VV BA 4870 AD VE 8798 VV NN 4449 AD LB 8553 VV DT 4223 AD CD 7535 AD VA 3511 AD SB 6815 AD P 3488 AD NR 5930 VV SB 2795 AD DT 4226 VV DEC 2660 AD DEC 3170 VV LB 1914 AD DEV 2380 AD SB 1541 AD SP 1987 VV DER 1460 AD LC 1744 AD BA 1334 AD M 1709 VV JJ 1209 AD MSP 1545 VV NR 1106 AD NT 1097 VV NT 959 AD OD 1032 AD CD 861 AD CC 1023 AD PN 691 AD CS 1021 AD VC 663 AD DEG 941 VV VC 526 AD DER 118 VV OD 499 AD ETC 88 VV LC 380 AD FW 45 AD VE 354 AD IJ 14 AD LB 338 10 The results are calculated using the NoSketch Engine UI - Node tags. 106 Lubos GAJDOS The table indicates different collocability for the negative BU and MEI, e.g. the negative BU exhibits a strong preference for copulas (here VC).11 For practical reasons, only the POS tags, which are more frequent than 1% of each group (here in bold), are included in the analysis. The PU tag is also to be excluded from further analysis as it stands for punctuation. Table 2 shows 10 of the most frequent collocates for each negative. The results are calculated using the NoSketch Engine UI - Node forms.12 Table 2: The most frequent tokens at the position 1 (# Corpus: web-zh) # Query: word, (meet [tag="VV|VA| VC|AD|P"]2:[tag="AD" & word="*"]-1 -1) # Query: word, (meet [tag="VV| AD|P|AS"]2:[tag="AD | VV" & word="S | >St"]-1 -1) * mm No. of results: 7214094 No. of results: 621555 word frequency word frequency £ 693223 M 56260 m 377529 * 14539 311133 m 13999 M 288756 7 12223 227692 M 9843 125328 £ 9249 125212 U 9215 124252 rn 9141 rn 91523 *3\ 7189 m 85127 6744 At first sight, it is surprising that the collocation MEI+ neng m is the third most frequent, despite the fact that most grammars and textbooks deny this possibility.13 Similar findings may provide the impetus for further research which would take greater 11 The co-occurrence of MEI+VC is caused by the misspelling of the character shi in most cases, e.g. mei shi instead of mei shi 12 To find collocative partners of both negatives, the operator meet is used. That means that the corpus is search for the following patterns: adverb BU + verbs (VV) or adverb BU + adjectives (VA) or adverb BU + copulas (VC) or adverb BU + adverbs (AD) or adverb BU + prepositions (P). See more at https://www.sketchengine.eu/documentation/cql-meet-union/ 13 There are some exceptions, e.g. Svarny and Uher (2014, p. 48) describe this phenomenon and Liu (2004, p. 257) also suggests this possibility, however, they do not further elaborate this point. Retrieving Linguistic Information from a Corpus on the Example 107 account of actual language use. The "new" grammars or textbooks should be then based on such research. After searching the first hundred examples manually, it turns out that the cooccurrence of some tokens with BU is higher than one would expect based on the frequency of affirmation, e.g. liâo 7 (70691), zhù ft (65400), qï M (40086) etc., furthermore, these verbs typically serve as so-called complements.14 That means that only tentative conclusion may be drawn from this evidence, nevertheless, it should play a role when comparing the overall frequency of both negatives. I discuss this topic further in the chapter Potential complements. 2 Potential complements Let us return to the examples that have been mentioned in Chapter 1 and analyse them. (1) m Tä He 'He cannot do this.' & ^/AD bàn bù/AD manage BU-neg, 7/vv W liäo/VV cï shi compl. this matter (2) U ^/AD ft/VV UT jieke zhongyu ren bu/AD zhu/VV shuo le Jack finally endure BU-neg. compl. speak LE 'Jack finally couldn't help saying it.' (3) m ^/AD Ê/VV 7 zhè bèizi fangzi mai bù/AD qï/VV le this life house buy BU-neg. compl. LE '(One) cannot afford to buy a house for the entire life.' (4) ^/AD U/VV duiyu shuangfang bu neng bu/AD shuo/ VV for to both sides not able to BU-neg. speak 'regarding (things that) both sides cannot but speak' Randomly selected samples suggest that many examples may be considered as so-called potential complements with the "morphological" structure VV + BU + VV while 14 See e.g. Yip (2009, pp. 234-241). 108 Lubos GAJDOS the first morpheme (verb) is not equal to the third. The following query meets this condition:15 (meet (meet 1:[tag="W"][tag="AD" & word="*"]-1 -1) 2:[tag="W|VA"]-2 -2) & 1.word!=2.word The examples below show that the regular expression does not always match the desired pattern and therefore must be modified. (5) I^AW^-, & * M/VV? Nage ren ne, jiao Li YT, zh! bu zhTdao That person is Li Yi know BU-neg. know 'Do you know that that person is called Li Yi?' (6) m m * mww Nin neng bu nenggou zai juti de gen women jiang yTxia? You able to BU-neg. able to tell us more specifically 'Can you tell us this again more specifically?' (7) m * n/vv YTding chengdu shang, bu neng bu shuo/VV to a certain extent, not able BU-neg. speak 'To a certain extent, one cannot but speak.' (8) n * ^/VV YTge shuo bu yao one say BU-neg. want 'One says no.' (9) n * Mrw xfc Zhexie ren zhao bu dao/VV gongzuo these people find BU-neg. compl. work 'These people cannot find work.' There is a point worth noting here as well - auxiliary verbs (e.g. modal verbs) must be removed from the search pattern. As there is no dedicated tag for modal or auxiliary 15 This regular expression matches the following pattern: verb2/adjective2 + BU + verbl and verb1 * verb2, the verbl is KWIC (Key Word in Context). Retrieving Linguistic Information from a Corpus on the Example 109 verbs (except VE, VC), each of the verbs must be enumerated in the query with the attribute "word".16 A double negative must be excluded too. The refined query is:17 (meet (meet (meet 1:[tag="VV"& word!= "g | f£" "& word="(?i).{1,2}"] [tag="AD" & word="^"]-1 -1) 2:[tag="VV|VA"& word!= "g| M"]-2 -2)[word!="^"]-3 -3) & 1.word!=2.word The following table shows the result. The overall frequency is 828224 (1112.13 per million). Table 3: The most frequent potential complements - the negative form (# Corpus: web-zh) # Query: word,(meet (meet (meet 1:[tag="VV"& word!= | | M& | ^ | MM | ^^ | ^ | MS | ^mmmmmnmmmti^mmmMinmmim" & word="(?i).{1,2}"][tag="AD" & word="^"]-1 -1) 2:[tag="VV | VA"& word!= | | | ^ | MM | 2)[word!="^"]-3 -3) & 1.word!=2.word word Frequency 111457 T liäo 65697 59793 & 36774 # dé 33857 & 31377 ± 29568 22326 a 19412 m 12632 ^ zhao 10366 10114 f 9877 16 The query above contains only two of these verbs, the others are present here, e.g. ^ | etc. The limit for the length of the tokens is set to 1 or 2 by the expression: "word="(?i).{1,2}". 17 The regular expression means that the corpus is searched for the following pattern: token (not BU) + verb2 (not g nor + adverb BU + mono- or disyllabic verbl (which is not g nor and verbl * verb2. Only the verbl is KWIC in the concordance and other tokens are used as contextual filters. See more at https://www.sketchengine.eu/documentation/cql-meet-union/ 110 Lubos GAJDOS The result of the affirmative form might be achieved by the same query with only minor modification:18 (meet (meet (meet 1:[tag="W"& word!= "g | M" "& word="(?i).{1,2}"] [tag="DER"]-1 -1) 2:[tag="VV|VA"& word!= "g| M"]-2 -2)[word!="^"]-3 -3) & 1.word!=2.word The total frequency of 167822 (225.35 per million) clearly shows that the occurrence of the affirmative form is far less frequent. This fact only validates the previous assumption mentioned in the literature.19 The following list contains a sample of the most frequent verbs: ± 11598, M 10084, glj 9769, ft 7 614, ft 7607, ft^, 3736, % 2977 etc. If we move back to the calculation of the overall frequency of BU, the value of the negative form of potential complements (1112.13 per million) should be subtracted from the total frequency, i.e. 8785.72 per million. Needless to say, these are only approximate numbers and further research is required. 3 Verb collocates The first chapter discusses the collocability of the negative BU and MEI. In this chapter, I further explore this topic. When comparing the total frequency of BU vs. MEI, some considerations should be taken into account, i.e. some verbs/adjectives collocate with BU only, some registers use only a limited number of MEI etc. After saving the results as a text file (from the NoSketchEngine UI), I proceed to test the 2 lists20 for the duplication21 and calculate the average value of co-occurrence. When comparing two lists for duplication in the spreadsheet program, there are many tokens in the MEI list which are marked as they have no counterpart in the BU list. This might cause surprise at first since one would expect only tokens from the BU list not having a counterpart. The explanation is rather simple: (1) most of these tokens have a disyllabic morphological structure (V+X), e.g. and cannot be paired with their monosyllabic counterpart in the BU list by the spreadsheet program (e.g. or (2) the frequency of the BU counterpart is below the lowest frequency of samples (see footnote 13). 18 There is a dedicated tag for the de-marker, i.e. DER. 19 See e.g. Liu (2004, p. 583). 20 Each list contains the 1000 most frequent verbs that collocate with BU and MEI. 21 This might be done in MS Excel, LibreOffice Calc or any spreadsheet program. Retrieving Linguistic Information from a Corpus on the Example 111 Table 4: The 10 most frequent verbs collocating with BU and MEI (# Corpus: web-zh) # Query: word,(meet [tag="VV"]2:[tag="AD" & word="^"]-1 -1) # Query: word,(meet 1:[tag="VV" & word="(?i).{1,2}"][tag="AD | VV" & word="S | SW"]-1 -1) * mm word Frequency word Frequency m 377442 M 56260 A 311129 * 14539 M 288756 m 13999 227691 m 9843 125328 u 9215 ^ 124875 m 9141 m 122930 *m 7189 rn 91458 m 6086 m 85108 £ 5602 78854 * 5371 The results indicate that: • From the list of the 1000 most frequent tokens (verbs) with the negative BU, 619 tokens collocate with MEI too, yet from the 100 most frequent tokens, there are 69 of them; the rest are e.g. the following tokens: £n, ^T, M, IS, M, etc. that co-occur with BU only; • From the list of the 100 most frequent tokens (verbs) with the negative MEI, a few preferably collocate with MEI, e.g. Mi'J, MS, etc.;22 • The lower the frequency of a token in the BU list, the less frequent it collocates with both negatives; • Generally speaking, the co-occurrence of the negative MEI with the same verb is about 2.5-time less frequent as with the BU negative, however, statistical data reveals great disparities between tokens (see table 5). That is to say that verbs on the left side of the table collocate almost always with the negative BU, on the other hand, verbs on the right side almost exclusively collocate with the negative MEI. 22 This may be seen from the following comparison: the query [word="^|^^" & tag="W|AD"][word=";^il"] with the frequency of 4542 (6.10 per million) and the query [word=" & tag="AD"][word=";il"] 62 (0.08 per million). 112 Lubos GAJDOS Table 5: Collocability of verbs (# Corpus: web-zh) Preference for BU Preference for MEI word ratio word ratio M A ft « n Mi 7 1511,8 858,4 781,7 408,7 325.2 307,7 301.4 278.5 236.3 235.4 Sa mv MA mn sa mm 0,005 0,182 0,201 0,277 0,290 0,314 0,323 0,340 0,342 0,385 4 Adjective and adverbs collocates This chapter focuses on the collocability of adjectives and adverbs and the same searching methods are used. As for the adjectives, a brief look at the given statistical data (827140 or 1110.67 per million vs. 8486 or 11.39 per million; see table 6) demonstrates that adjectives (almost) always collocate with the negative BU. The exceptions here may be considered as phrases. Table 6: Collocability of adjectives (# Corpus: web-zh) # Query: word,(meet [tag="VA"]2:[tag="AD" & word="*"]-1 -1) # Query: word,(meet [tag="VA"]2:[tag="AD|W" & word="j%| -1) * No. of results: 827140 No. of results: 8486 word Frequency word Frequency » 71999 1978 58946 884 £ 46448 812 %% 34917 £ 374 28451 329 x 23685 257 M 18118 » 251 l 14844 mm 205 14831 195 * 13191 »» 180 Retrieving Linguistic Information from a Corpus on the Example 113 The situation with regard to adverbs is a little different. While the results indicate a strong tendency to the negative BU, yet both negatives may be used. Table 7: Collocability of adverbs (# Corpus: web-zh) # Query: word,(meet [tag="AD"]2:[tag="AD" & word="*"]-1 -1) # Query: word,(meet [tag="AD"]2:[tag="AD | VV" & word="% | SW']-1 -1) * SIM No. of results: 444055 No. of results: 52008 word frequency word frequency S 76724 6480 X 39336 S 4637 22521 4062 M 19298 2989 H 17114 £ 2855 Èâ H 13088 2316 12286 1889 £ 10410 * 1541 £ 8623 X 988 8621 920 5 Conclusion To begin with, statistical data given in this study should only be taken as exhibiting a general tendency and not as a fully accurate description of "real" language. It should also be pointed out that this paper only examines the occurrence of negatives at the first position to the left of collocates. In this respect, new methods should be devised for solving issues addressed here, e.g. the problem with the POS annotation and its error rate which may significantly affect statistical data or the problem with identifying the difference between the negative MEI and the verb you ^ (with the tag VE) etc. This leads us to the questions how to interpret the results in light of these points and what valuable results this study brings. Firstly, when comparing results of both negatives, it seems that some verbs described as "auxiliary" or "modal" tend to collocate with the negative MEI more often than stated by language prescription. On the other hand, empirical data support the claim that adjectives only collocate with the negative BU. As for adverbs, there is still a strong preference for BU, but because I do not consider adverbs as a "true" collocate to negatives (rather as part of a bigger structure), this question should be explored in future research. Let us now move on to the negative MEI. There are many verbs that preferably collocate with MEI rather than with BU. A closer look at the results reveals that their 114 Lubos GAJDOS morphological structure is disyllabic and the left morpheme is often a so-called "resultative complement" (jieguo buyu ^n^^Ki). This finding may imply that the category of verbal aspect and tense23 deserves closer attention. That means if MEI is regarded as past time marker, these verbs are commonly used in past tense and the present tense (with BU) may describe the situation as a condition or future tense. A similar phenomenon is also observed in some Slavic languages, where the present and preterite of perfective verbs fulfil these functions too (e.g. compare the present perfective form "urobim" vs. the past perfective form "urobil" in Slovak). This suggests that these verbs in Chinese might be treated as perfective. In order to fully explore this topic, the marker le T, as a counterpart to the negative MEI, should be included in an comparative analysis. There is a very detailed, corpus-based study conducted on this subject by Petrovčič (2009), Operator Le in Chinese worth noting here. To conclude, the article shows how to use a corpus when searching for evidence of some language phenomena. As for negation in Chinese, the paper only suggests a different approach to this subject and additional research is needed. References Baker P., Hardie A., & McEnery, T. (2006). A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press. Fei, X. (2000). The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0). Available at https://www.cs.brandeis.edu/~clp/ctb/posguide.3rd.ch.pdf Gajdoš, L., Garabik, R., & Benicka, J. (2016). The New Chinese Webcorpus Hanku - Origin, Parameters, Usage. Studia Orientalia Slovaca, 15(1), 53-65. Gajdoš, L. (2018). Korpusova analyza adverbiale v pravnej činštine [Corpus Analysis of adverbial complements in legal Chinese]. In T. Guldanova (Ed.), Kontexty sudneho prekladu VII (pp. 27-39). Bratislava: Univerzita Komenskeho. Liu, Y. et al. (2004). Practical Chinese Grammar [^^RinÜ]. Beijing: Shangwu yishuguan. Petrovčič, M. (2009). Operator Le in Chinese. Saarbrücken: VDM Verlag. Petrovčič, M. (2017). Traditional and Contemporary Approaches to Chinese Particles. In S. Bračič & M. Petrovčič (Eds.), Partikeln überall: Deutsch - Slowenisch - Chinesisch, (pp. 103122). Ljubljana: Znanstvena založba Filozofske fakultete. Sketch Engine. Available at https://www.sketchengine.eu Švarny, O., & Uher, D. (2014). Prozodicka gramatika činštiny [Prosodic Grammar of Chinese Language]. Olomouc: Univerzita Palackeho. Yip, P., & Rimmington, D. (2009). Basic Chinese. Abingdon: Routledge. 23 See also Petrovčič (2017, pp. 108-109). Retrieving Linguistic Information from a Corpus on the Example 115 Appendix: The Hanku tagset24 Tag English Example AD adverb fa AS aspect particle m BA preposition BA in ba-construction ^ CC coordinating conjunction M CD cardinal number CS subordinating conjunction M DEC markers - nominalizer DEG genitive marker m DER resultative DE DEV manner DE ^ DT determiner a ETC et cetera FW foreign word ISBN IJ interjection JJ other noun-modifier £ LB preposition BEI in long bei-construction $ LC localizer ± M measure word MSP other particle m NN noun NR proper noun ^in NT temporal noun OD ordinal number IH ON onomatopoeia P preposition PN pronoun m PU punctuation o SB preposition BEI in short bei-construction $ SP sentence-final particle 7 VA predicative adjective * VC copula VE verb as the main verb VV verb 24 For details see Fei (2000, pp. 4-35).