Of University of Ljubljana FACULTYofARTS Acta Linguistica Asiatica Volume 2, Number 2, October 2012 Lexicography of Japanese as a Second/Foreign Language Acta Linguistica Asiatica Volume 2, Number 2, October 2012 Editors: Andrej Bekeš, Mateja Petrovčič Issue Editor: Kristina Hmeljak Sangawa Editorial Board: Bi Yanli (China), Cao Hongquan (China), Luka Culiberg (Slovenia), Tamara Ditrich (Slovenia), Kristina Hmeljak Sangawa (Slovenia), Ichimiya Yufuko (Japan), Terry Andrew Joyce (Japan), Jens Karlsson (Sweden), Lee Yong (Korea), Arun Prakash Mishra (India), Nagisa Moritoki Škof (Slovenia), Nishina Kikuko (Japan), Sawada Hiroko (Japan), Chikako Shigemori Bučar (Slovenia), Irena Srdanovic (Japan). © University of Ljubljana, Faculty of Arts, 2012 All rights reserved. Published by: Znanstvena založba Filozofske fakultete Univerze v Ljubljani (Ljubljana University Press, Faculty of Arts) Issued by: Department of Asian and African Studies For the publisher: Andrej Černe, the Dean of the Faculty of Arts Journal is licensed under a Creative Commons Attribution 3.0 Unported (CC BY 3.0). Journal's web page: http://revije.ff.uni-lj.si/ala/ Journal is published in the scope of Open Journal Systems ISSN: 2232-3317 Abstracting and Indexing Services: COBISS, Directory of Open Access Journals, Open J-Gate and Google Scholar. Publication is free of charge. Address: University of Ljubljana, Faculty of Arts Department of Asian and African Studies Aškerčeva 2, SI-1000 Ljubljana, Slovenia E-mail: matej a.petrovcic@ff.uni-lj. si Table of Contents Foreword.....................................................................................................................5-6 RESEARCH ARTICLES Kokugo Dictionaries as Tools for Learners: Problems and Potential Tom GALLY............................................................................................................9-20 Towards the Lexicographic Description of the Grammatical Behaviour of Japanese Loanwords: A Case Study Toshinobu MOGI...................................................................................................21-34 RESEARCH ARTICLES (PROJECT REPORTS) Compilation of Japanese Basic Verb Usage Handbook for JFL Learners: A Project Report Prashant PARDESHI, Shingo IMAI, Kazuyuki KIRYU, Sangmok LEE, Shiro AKASEGAWA and Yasunari IMAMURA..................................................37-64 ITADICT Project and Japanese Language Learning Marcella MARIOTTI, Alessandro MANTELLI....................................................65-82 Automatic Addition of Stylistic Information in a Japanese Dictionary Raoul BLIN............................................................................................................83-96 The Construction of a Database to Support the Compilation of Japanese Learners' Dictionaries Yuriko SUNAKAWA, Jaeho LEE, Mari TAKAHARA......................................97-115 Foreword It is my pleasure to introduce this thematic issue dedicated to the lexicography of Japanese as a second or foreign language, the first thematic issue in Acta Linguistica Asiatica since its inception. Japanese has an outstandingly long and rich lexicographical tradition, but there have been relatively few dictionaries of Japanese targeted at learners of Japanese as a foreign or second language until the end of the twentieth century. With the growth of Japanese language teaching and learning around the world, the rapid development of very large scale linguistic resources and language processing technologies for Japanese, a new generation of aggregated, collectively developed or crowd-sourced resources evolving in the context of the social web, a shift from static paper to constantly developing electronic resources, the spread of internet access on hand-held devices, and new approaches to the use of language reference resources stemming from these developments, dictionaries and other reference resources for learners, teachers and users of Japanese as a foreign/second language are being developed and used in new ways in different user communities. However, information about such developments often does not reach researchers, lexicographers, dictionary users and language teachers in other user communities or research spheres. This special issues wishes to contribute to the spread of such information by presenting some recent developments in this growing field. Having received a very lively response to our call for papers, not all papers selected for publishing could fit into this issue, and part of them will be included in the December issue of ALA, which is also going to be dedicated to Japanese lexicography. The first round of papers included in this issue presents a varied cross-section of current JFL lexicographical work and research. All papers in this issue point out the relative scarcity of appropriate reference works for learners of Japanese as a foreign language, especially when compared to lexicographical resources for Japanese native speakers, and each of the endeavours presented here confronts this lack with its own original approach. Reflecting the paradigm shift in Japanese language research, where corpus research is again playing a central role, most papers presented here take advantage of the bounty of newly available corpora and web data, most prominent among which is the Balanced Corpus of Contemporary Written Japanese developed by the National Institute for Japanese Language and Linguistics in Tokyo, and which is used by Mogi, Pardeshi et al. and Sunakawa et al. in their lexicographical research and projects, while Blin taps data for his research from the web, another increasingly important linguistic resource. The first two papers offer two perspectives on existing Japanese dictionaries. Tom Gally in his paper Kokugo Dictionaries as Tools for Learners: Problems and Potential points out the drawbacks of currently available Japanese dictionaries from the perspective of learners of Japanese as a foreign language, but at the same time offers a very detailed and convincing explanation of the merits of monolingual Japanese dictionaries for native speakers (kokugo dictionaries), such as their comprehensiveness, detailedness and quantity of contextual information, when compared to bilingual dictionaries, which make them a potentially useful resource even for an audience they are not targeting - foreign language learners. His detailed explanation of possible uses and potential hurdles and pitfalls learners may encounter in using them, is not only accurate and informative, but also of immediate practical value for language teachers and lexicographers. Toshinobu Mogi, in his paper Towards the Lexicographic Description of the Grammatical Behaviour of Japanese Loanwords: A Case Study, investigates the lexicographic description of loanwords in Japanese reference works and notes how information offered by currently available dictionaries, especially regarding the grammatical aspects of loanword use, is not sufficient for learners of Japanese as a foreign language. After pointing our the deficiencies of current dictionary descriptions and noting how dictionaries sense divisions do not reflect the frequency of different senses in actual use, as reflected in a large-scale representative general corpus of Japanese, he uses a fascinatingly detailed analysis of the behaviour of a Japanese loanword verb to describe a corpus-based method of lexical description, based on the correspondence between usage forms and senses, which could be used for the compilation of Japanese learners' dictionaries meant for the reception and production of Japanese. The second part of this special issue is composed of four reports on particular aspects of ongoing lexicographical work targeted at learners of Japanese as a foreign language. Prashant Pardeshi, Shingo Imai, Kazuyuki Kiryu, Sangmok Lee, Shiro Akasegawa and Yasunari Imamura in their paper Compilation of Japanese Basic Verb Usage Handbook for JFL Learners: A Project Report, after pointing out - as other authors in this issue - the lack of a detailed and pedagogically sound lexicographical description of Japanese basic vocabulary for foreign learners, propose a corpus-based on-line system which incorporates insights from cognitive grammar, contrastive studies and second language acquisition research to solve this problem. They present their current implementation of such a system, which includes audiovisual material and translations into Chinese, Korean and Marathi. The system also uses natural language processing techniques to support lexicographers who need to process daunting amounts of corpus data in order to produce detailed lexical descriptions based on actual use. The next article by Marcella Maria Mariotti and Alessandro Mantelli, ITADICT Project and Japanese Language Learning, focus on the learner's perspective. They present a collaborative project in which Italian learners of Japanese compiled an on-line Japanese-Italian dictionary using a purposely developed on-line dictionary editing system, under the supervision of a small group of teachers. One practical and obvious outcome of the project is a Japanese-Italian freely accessible lexical database, but the authors also highlight the pedagogical value of such an approach, which stimulates students' motivation for learning, hones their ICT skills, makes them more aware of the structure and usability of existing lexicographic and language learning resources, and helps them learn to cooperate on a shared task and exchange peer support. The third project report by Raoul Blin, Automatic Addition of Genre Information in a Japanese Dictionary, focuses on the labelling of lexical genre, an aspect of word usage which is not satisfactorily presented in current Japanese dictionaries, despite its importance for foreign language learners when using dictionaries for production tasks. The article describes a procedure for automatic labelling of genre by means of a statistical analysis of internet-derived genre-specific corpora. The automatisation of the process simplifies its later reiteration, thus making it possible to observe lexical genre development over time. The final paper in this issue is a report on The Construction of a Database to Support the Compilation of Japanese Learners' Dictionaries, by Yuriko Sunakawa, Jae-ho Lee and Mari Takahara. Motivated by the lack of Japanese bilingual learners' dictionaries for speakers of most languages in the world, the authors engaged in the development of a database of detailed corpus-based descriptions of the vocabulary needed by learners of Japanese from intermediate to advanced level. By freely offering online the basic data needed for bilingual dictionary compilation, they are building the basis from which editors in under-resourced language areas will be able to compile richer and more up-to-date contents even with limited human and financial resources. This project is certainly going to greatly contribute to the solution of existing problems in Japanese learners' lexicography. Kristina Hmeljak Sangawa RESEARCH ARTICLES Kokugo Dictionaries as Tools for Learners: Problems and Potential Tom GALLY The University of Tokyo cwpgally@mail. ecc .u-tokyo.ac .j p Abstract For second-language learners, monolingual dictionaries can be useful tools because they often provide more detailed explanations of meanings and more extensive vocabulary coverage than bilingual dictionaries do. While learners of English have access to many monolingual dictionaries designed specifically to meet their needs, learners of Japanese must make do with Kokugo dictionaries, that is, monolingual dictionaries intended for native Japanese speakers. This paper, after briefly describing Kokugo dictionaries in general, analyzes a typical entry from such a dictionary to illustrate the advantages and challenges of the use of Kokugo dictionaries by learners of Japanese. Keywords monolingual; Japanese; kokugo; dictionary; learners Izvleček Enojezični slovarji so lahko koristno orodje pri učenju tujega jezika, saj pogosto ponujajo bolj podrobne razlage pomenov in pokrivajo bolj obsežno besedišče kot pa dvojezični slovarji. Medtem ko imajo učenci angleščine kot tujega jezika na voljo veliko enojezičnih slovarjev, ki so bili izdelani prav za njihove potrebe, pa morajo učenci japonščine uporabljati enojezične slovarje japonščine, imenovane Kokugo, t.j. slovarje, ki so namenjeni govorcem japonščine kot maternega jezika. Pričujoči članek - po kratki splošni predstavitvi slovarjev Kokugo - skozi analizo slovarskega članka iz takega slovarja oriše prednosti in izzive rabe slovarjev Kokugo za učenje japonščine kot tujega jezika. Ključne besede enojezični; japonščina; kokugo; slovar; učenci Acta Linguistica Asiatica, Vol. 2, No. 2, 2012. ISSN: 2232-3317 http://revije.ff.uni-lj.si/ala/ 1. Introduction In the past few decades, learners of English as a second language have benefited from the publication and rapid development of many monolingual dictionaries designed specifically to meet their needs. These dictionaries, which include Oxford Advanced Learner's Dictionary, Collins COBUILD Advanced Learner's English Dictionary, and similar volumes from Cambridge, Longman, Macmillan, and Merriam-Webster, have incorporated many learner-friendly features, including a controlled defining vocabulary, greater attention to collocations and idioms both as headwords and in definitions and examples, extensive use of corpora for meaning explication and example selection, and new macro- and microstructure designs. (For more information on these dictionaries, see Cowie, 1999, and Bejoint, 2010, pp. 163-200.) The rapid innovations in these dictionaries have been driven not only by advances in lexicography and corpus linguistics but also by the huge global market for English-learning materials, making learner's dictionaries, despite the large investment necessary for their creation, a potentially lucrative source of income for publishers. Learners of Japanese, however, have not been nearly as fortunate, as there are no monolingual dictionaries of Japanese currently available that meet the needs of intermediate and advanced learners.1 Learners fluent in English, Chinese, or Korean, which have reasonably good bilingual dictionaries with Japanese, might not suffer significantly from this lack, but speakers of most other languages are at a severe disadvantage when trying to learn Japanese. Furthermore, at least in the case of English, most of the bilingual dictionaries used by learners of Japanese were in fact written for native Japanese speakers and thus lack many features needed by second-language learners, including explanatory definitions for difficult-to-translate headwords, verb-conjugation categories and other grammatical information, and usage notes. Perhaps the greatest drawback of bilingual dictionaries published for fluent Japanese speakers, when considered from the learner's perspective, is the omission of headwords that Japanese users are not likely to seek when using a bilingual dictionary into another language, including slang, dialect, archaisms, variants, and proper names. Some of these drawbacks of bilingual dictionaries of Japanese can be overcome through the use of a type of dictionary often overlooked in Japanese-language education: monolingual dictionaries of Japanese aimed at native speakers of the 1 The monolingual Dictionary of Basic Japanese Usage for Foreigners was published by Japan's Agency for Cultural Affairs in 1971. This dictionary incorporated features useful for learners, including explanatory definitions written in relatively simple language, many example sentences, and full conjugation information for verbs and adjectives. Although a second edition appeared in 1975 and a third with about 4,500 headwords in 1990, the dictionary is no longer in print, let alone available in digital form. Two companion volumes, listed in library catalogs but not consulted for this study, were Dictionary of Chinese Characters for Foreigners [©fcfcfflSiSft Gaikokujin no Tame no Kanji Jiten] (1966) and A Specialized Scientific Dictionary for the Foreigners: Physical Science PP.fflfpS?:ft (§ Gaikokujin no Tame no Senmon Yogo Jiten: Shizen Kagaku Kei] (1966), are also out of print. language. Usually called kokugo jisho [HfnSi®] or kokugo jiten [HfnSift], these dictionaries are readily available in both paper and digital versions from commercial Japanese publishers, and they have many advantages over bilingual dictionaries: their definitions are often explanations of the headword's meaning, rather than mere synonyms; they indicate conjugation categories of verbs; and, while their inclusion of slang and other nonstandard language is sometimes limited, they do contain a wider range of vocabulary than most bilingual dictionaries. Because these dictionaries were written for native speakers of Japanese, however, they present significant hurdles to learners, particularly in the comprehensibility of their definitions and examples. This paper therefore examines the typical features of these monolingual Japanese dictionaries—called Kokugo dictionaries here—and discusses the advantages and disadvantages of those features for people learning Japanese as a second language. 2. Contemporary Kokugo Dictionaries A wide range of Kokugo dictionaries are currently available for native speakers, from the 14-volume Nihon Kokugo Daijiten, a comprehensive historical dictionary of the language from the earliest recorded times to the present, to small, inexpensive dictionaries sold in 100-yen stores that are intended mainly to provide the meanings and orthography of "hard" or often misunderstood words.2 Although dictionaries all along this spectrum can be used profitably by learners, this paper will concentrate on two categories of Kokugo dictionaries that are likely to be most useful: midsized dictionaries that focus on contemporary general vocabulary, and comprehensive dictionaries that also include historical vocabulary and encyclopedic entries. Among the many midsized dictionaries aimed at general users are Shinmeikai Kokugo Jiten, Iwanami Kokugo Jiten, Sanseido Kokugo Jiten, and Meikyo Kokugo Jiten. These dictionaries typically claim to have about 70,000 entries, and their paper editions have between about 1400 and 1900 pages, including front and back matter. Their headwords, senses, and examples primarily reflect the modern Japanese language, and they contain few encyclopedic entries. (The second edition of Meikyo Kokugo Jiten, for example, contains brief entries for Nihon "Japan" and Chugoku "Chugoku region; China" but none for Tokyo or Amerika.) One-volume comprehensive dictionaries are printed in a larger format and contain more pages, usually around 3000, and claim to have around 230,000 entries. Three comprehensive dictionaries that, as of 2012, have been updated recently are Kojien, Daijirin, and Daijisen. (Many similar dictionaries have been published in the past century, but most are no longer being updated.) In addition to the contemporary 2 Okimori, Kurashima, Kato, & Makino (1996) contains a full list and descriptions of Kokugo and other dictionaries published in Japan up through the mid-1990s. Some of the many books in Japanese about the history and characteristics of Kokugo dictionaries are Kurashima (1997), Ishiyama (2007), and Kurashima (2010). vocabulary covered by the midsized dictionaries, the comprehensive dictionaries also contain archaic headwords and senses, and citations are often taken from classical or canonical literary works. They also contain many proper names and technical words that are missing from the midsized dictionaries. Perhaps the most important difference among these dictionaries for learners is the sense order: while Kojien orders the multiple senses of a headword with the earliest or most basic meanings first, Daijirin and Daijisen give the most common contemporary meanings first. As of 2012, all of the dictionaries named above are available in paper form. Most are also available in digital formats, which might include cd- and/or dvd-roms, portable electronic dictionaries, free and/or subscription-based Web sites, and smartphone, tablet, and personal computer applications. Data on dictionary sales in Japan are held closely by publishers, but anecdotal evidence, including observations of the dictionaries used by university students and the space allocated to paper dictionaries in bookstores, suggests that the era of paper dictionaries is coming to an end. While digital versions do offer some distinct advantages to students, including faster lookup times, intra- and inter-dictionary links, and, on some devices, handwritten input, the actual content of digital Kokugo dictionaries is so far largely identical to that of their paper versions. For this reason, and because the rapid progress of digital and network technology makes it difficult to predict how Kokugo dictionaries might be delivered to users in coming years, this paper will focus only on the content of dictionary entries, not their medium of presentation. 3. A Typical Entry in a Midsized Dictionary To see the advantages and challenges of the use of Kokugo dictionaries by learners of Japanese, let us examine in detail an entry for a word that an intermediate or advanced learner might want to look up in a dictionary: the verb satoru. This word was chosen because one of its two main senses is used in general contexts in the contemporary language while the other is limited to a particular cultural domain. The entry for satoru from the second print edition of the midsized Meikyo Kokugo Jiten (2010) appears below. This is followed by a detailed explanation of the entry's components and the implications of each component for a learner accessing the entry. In the explanations, the romanization of each component is given in italics for the convenience of readers. ["5 CE5) ] HiM (M^i) ^^ioTiScia^^itSo j [^fg] [£] 3.1 Headword The headword is listed in kana order based on the pronunciation of its unmarked imperfective form, ¿¿5 satoru, not by its usual orthographic representations (| 5 or, less commonly, ^5). Thus the preceding word in the paper dictionary is ¿¿V satori and the following word is ^ KV^ sadoru. For learners using paper Kokugo dictionaries, this pronunciation-based listing can be frustrating, as often a word one wishes to look up appears in a text at least partly in kanji, rather than entirely in kana; if one does not know the reading of the kanji, one cannot find it easily in a Kokugo dictionary.3 This problem is usually alleviated with electronic dictionaries, which, depending on the hardware and software, allow kanji-containing words to be looked up using cut-and-paste, stylus or finger input, optical character recognition, or selection of kanji components (multiradical lookup). In Meikyo, the boundary between the verb stem $ t sato- and suffix 5 -ru is indicated by a nakaguro, or black dot ( • ); the same symbol is used in this dictionary to separate the stems and suffixes of adjectives. Meikyo also uses a hyphen (-) to separate the parts of compound words; the word ^ K^X^ sadondesu "sudden death", for example, appears as a headword as ^ K ^-X ^. Neither the black dot nor the hyphen would appear in those words in a regular text. These markers, which are normally omitted from bilingual dictionaries, can provide useful clues to learners about the morphemic structure and etymology of headwords. One of the challenges for learners using most dictionaries of Japanese, including all currently available Kokugo dictionaries, is that words can be looked up only by their canonical, unmarked form. If a reader encounters a conjugated form of the verb satoru, such as the potential satoreru or the negative passive participle satorerarenakute, and wants to find the meaning of the word in a dictionary, he or she must be able to deduce that the plain imperfective form is satoru. A fairly high level of grammatical knowledge is therefore necessary before a learner can use such dictionaries effectively.4 3.2 Orthography Because this verb can be written not only in kana but also with kanji, the two usual kanji representations follow the headword in brackets: |5 5) . The lack of any marking or further bracketing of the first version, |5, indicates that this is a standard 3 Some printed Kokugo dictionaries, including Iwanami Kokugo Jiten and Daijirin, have indexes of kanji and "hard-to-read" kanji combinations (jukugo), but those indexes exclude many word forms that learners would need to look up. A compete kanji and kanji-compound index to the second edition of the comprehensive dictionary Daijirin was published in 1997 as a separate volume (Kanjibiki Gyakuhiki Daijirin), but its bulk makes it unwieldy for casual use. 4 An exception is Jim Breen's WWWJDIC, a free online Japanese-English dictionary. Searches for most conjugated or declined forms of words lead to the standard headword forms. written form of the verb. The second version, , is both enclosed in curved parentheses, indicating that it is a nonstandard form, and marked with the symbol ", indicating that, while the kanji ^ appears on the Joyo Kanji (^^^^) list of characters designated by the government for everyday use, sato- is not an officially designated reading for that character. Other symbols are used in this dictionary to indicate when kanji do not appear on the Joyo list at all, when a reading is in an annex to the Joyo Kanji list, and when a combination of characters has a special reading. This detailed information about the status of different written forms of words can be useful to learners for at least two reasons. When a reader learns from a dictionary that the written form of a word he or she has encountered in a text is nonstandard, the reader can often infer something about the text's provenance: it might predate the government's postwar orthographic standards, it might not have been subjected to the rigorous editing applied to newspapers and some other publications, or it might reflect the author's individual preferences or literary sensibility. The orthographic labeling also helps the learner decide what form to use when writing in Japanese; a person composing a university report or a job application letter, for example, might decide to use the standard forms even if he or she prefers the nonstandard forms. 3.3 Part-of-Speech Information The next item in the entry, ffi ^, consists of two abbreviations of verbal categories. The character ffi indicates that the headword is a transitive verb (ffiftf tadoshi), while the character ^ shows that it follows the godan conjugation pattern. For other headwords, this information might be for (meishi, "noun"); M, for MWf (keiyoshi, "adjective"); MM, for M^ftf (keiyo doshi, "adjectival verb"); ff, for ff^f (daimeishi , "pronoun"); etc. This grammatical information, especially about verb categories, is usually omitted from bilingual dictionaries aimed at native speakers of Japanese. Learners opening a Kokugo dictionary for the first time, however, are likely to be confused by them, as the abbreviations might refer to grammatical categories that the learners know by very different names. Godan conjugation verbs, for example, are often called "consonant-stem" verbs in textbooks of Japanese written in English, and understanding the term godan and similar expressions requires familiarity with Japanese grammar as it has been taught in Japanese schools. Kokugo dictionaries also often indicate the categories of verbs for the literary language (^M bungo), which many students do not need to learn. In order to get the most out of this section of Kokugo dictionary entries, therefore, students would have to make a conscious effort to learn the abbreviations and their meanings in the context of traditional Japanese school grammar. 3.4 Definitions This entry for satoru has two senses, marked with the numbers O and ©. Within each sense is a definition followed by an example or two. The definition of the first sense is ^©©^S^E1^ fo gytrnffi-tZo o Mono no honshitsu ya imi nado o (chokkanteki ni) hakkiri to rikai suru. Mata, kakusarete ita koto nado o hakkiri to ninshiki suru. This might be translated as "To understand clearly (intuitively) the essence, meaning, etc. of something. Or, to recognize clearly something that is hidden." The definition of the second sense is ^©l^^iot^S®^!^ ^^^ 5o mVZM [to unlock, to deactivate] Chuukosha de, rimittaa ga katto sareteiru kuruma wa urareteiru mono na no deshouka? "In the case of used cars, are cars with deactivated speed limiters being sold?" (0C06_00923) From a semantic point of view, sense [2] "to cut hair" could be merged into sense [1], since it coincides with the sense "to shorten long thin things". However, since examples categorised as sense [2] contain mediative expressions (expressions of semantic indirectness), as will be explained later, these examples were put in a separate sense. Sense [3] is generally presented in dictionaries as "to delete/abridge a part of a text or sum of money", but in fact it may express the deletion of something in its completeness, as it can collocate with expressions such as zenbu (^^ "all") or zengaku (^fK "the whole sum") as in example (7b). However, these cases may also be interpreted as "considering a larger unit, making the larger unit smaller". It should be noted that most dictionaries present the use of katto-suru as a specialised term in sports, such as senses 3. and 4. in the dictionary description quoted at (4), but that examples of this use such as (8b) and (9a) were actually very rare in BCCWJ. 4.2 Syntactical characteristics This section explores the syntactical characteristics of the verb katto-suru, beginning with co-occurrence patterns, for each of the senses presented in 4.1. 4.2.1 Co-occurrence patterns Let us first consider the case particles and adverbs occurring in sentences where the predicate is katto-suru. Table 3 presents data for all patterns occurring in at least 5 examples. Table 3: Co-occurrence patterns of katto-suru Sense No. of examples Case particles Adverbial expressions o (object) de (instr.) de (location) kara (source) Adverbs of result Adverbs of quantity [1] cut 98 66 9 1 32 8 [2] cut hair 33 11 3 6 11 [3] reduce 103 52 5 4 16 [4] block 14 12 2 [5] other 4 2 Total 252 143 14 6 6 47 24 The direct object of the transitive verb katto-suru marked by particle o, including cases where the object is topicalised and marked by particle wa, appears in the same sentence in 143 of 252 cases (56.7%). There are 54 further cases when the patient, which is usually marked by particle o, appears as the subject of a passive sentence marked by particle ga, and 10 cases where the patient is the head of a noun-modifying clause and therefore not accompanied by any particle. An accurate count of the examples where the patient (usually accompanied by particle o) is really absent therefore yields 45 examples altogether (17.9%). Since 20 of these are examples of sense [2], we can conclude that one of the characteristics of sense [2] is that it tends not to co-occur with the verb's direct object, although it is sometimes difficult to decide whether an example such as (10) should be considered to be a case of transitive sentence where the object kami o "hair o") is omitted, or a case of intransitive sentence. In this analysis, such sentences were considered to be cases of transitive sentences. (10) ftft^oti^Kti^ Watashi wa itsumo biyouin de katto-shite iru. "I always have [my hair] cut at the hairdressers." / "I also have a cut at the hairdressers.") Other particles that characteristically co-occur are instrumental particle de with sense [1] (e.g. hasami de katto-suru "cut with scissors"), locative particle de with sense [2] (e.g. biyouin de katto-suru "to cut / have a haircut at the hairdressers"), source particle kara with sense [3] (e.g. chingin kara katto-suru "to cut from wages"). If we now consider adverbial expressions, we find that sense [1] and [2] are often accompanied by adverbs which express the result of cutting, such as "...o mijikaku ("short") / hitokuchidai ni ("to a mouthful") / suki na katachi ni ("to one's preferred shape") katto-suru", while sense [3] is often accompanied by adverbs of quantity or degree, such as "... o sukoshi ("a little") / ichibu ("in part") / zengaku ("completely, for the whole sum") katto-surrf'. On the basis of this analysis of co-occurrences, table 4 summarises typical patterns for each sense, taking into account patterns which occurred in approximately 10% or more examples. Table 4: Senses and patterns of katto-suru Sense Arguments Patterns [1] cut to process / peel / separate [food] o ([tool] de) o case (+ instrumental de case) (+ expression of result) to shorten / make smaller [thin and long / thin object] o ([tool] de) [2] cut hair to cut and arrange hair ([hair] o) ([place] de) (o case) (+ locative de case) (+ expression of result) [3] reduce to delete / abridge [images / text / items] o o case (+ expression of quantity / degree) to reduce the quantity [money / quantity] o [4] block block, obstruct [ultraviolet rays / light] o ([tool] de) o case (+ instrumental de case) take, intercept [a ball / a pass] o ([bodypart] de) As could be seen above, by analysing corpus data it is possible to present detailed information regarding cases, arguments and adverbs which tend to co-occur with a particular verb, and the semantic category of nouns which tend to appear in these arguments. In the case of katto-suru, it was shown that all senses appear in transitive uses of the verb, but that each sense appears in its own particular pattern. 4.2.2 Sentence-final forms Let us now see the characteristic sentence-final forms such as voice markers and auxiliary verbs which appear after the verb katto-suru. Table 5 presents forms which have appeared in at least 5 examples. Table 5: Sentence-final forms of katto-suru Sense No. of examples -rare-(passive) -te iru (progressive/ resultative/ havitual) -te shimau (perfect) -te morau (benefactive) -te iku (future continuation) -you (volitive) [1] cut 98 9 4 1 2 [2] cut hair 33 2 8 6 1 [3] reduce 103 39 16 6 1 4 5 [4] block 14 2 1 [5] other 4 2 2 Total 252 54 30 8 7 7 5 A very conspicuous point is that sense [3] tends to occur in the passive form much more than other senses (72.2% of all passive sentences pertain to this sense). This tendency reflects the fact that the situation of "reducing / cutting down surplus parts of a text, image or sum of money" is depicted from the point of view of the (possibly unwitting) receiver more often than the other senses. This is also corroborated by the fact that as much as 6 examples of -te shimau, an auxiliary verb expressing regret, are used in this sense in the passive form. Conversely, the volitive form -you, which is used when viewing the act from the opposite point of view, such as the administration or management, is used very rarely and almost exclusively in parliament proceedings and economic texts. Sense [2], on the other hand, often occurs with benefactive verbs (-te morau "receive" or -te kureru "give (to me)"), expressing gratitude for the received action. Katto-suru used in sense [2] exhibits the characteristic of semantic indirectness (mediativeness) (kaizaisei ^^ftt, Sato, 2005), in the sense that it can be used in examples such as (10) even when it was a hairdresser or someone else (and not the subject of the sentence) who actually cut the hair, at the subject's request. The syntactic construction with an auxiliary benefactive verb could therefore be considered as a syntactic reference to the agent who acted upon request. If we now consider the form -te iru, we find as many as 22 examples where the form expresses the resulting state of an action (16 of these are in the passive form), while most remaining examples refer to sense [2]: 6 examples of repetitive action / habit, 1 example of progressing action. The tendency of sense [2] to appear in forms expressing repetition can be seen as stemming from the fact that the action of "cutting one's hair" is something done habitually. Finally, all examples of the form -te iku, regardless of the verb sense, express gradual development or progression of the action described and do not exhibit any particularity regarding sense. 4.3 Analysis As could be seen in the above analysis, usage examples of katto-suru taken from the corpus BCCWJ can show not only the characteristics of nouns which can typically co-occur as objects, but also the characteristics of co-occurring expressions and predicate forms which are typical for each sense. For example, with regard to sense [2] "cut one's hair", the following syntactical characteristics have been observed. (11) a. There are cases in which the object with particle o is not expressed, since the object "hair" is taken for granted. b. There is sometimes an argument with locative case particle de, such as "biyouin de" ("at the hairdresser's"). c. The verb is often accompanied by expressions of result, such as "mijikaku" ("short"). d. The verb is often in the form - te iru, expressing repetitive action or habit. e. It can appear in sentences expressing semantic indirectness (mediativeness), in which reference to the agent is made by means of auxiliary benefactive verbs. The fact that we can observe such an equivalence between the senses of a verb and its syntactical characteristics means that we can predict - to a certain extent - the meaning and subsense of a word from the form of its sentence, and vice versa the form of a sentence from the meaning (or subsense) of a word. In the context of teaching Japanese as a foreign language, providing learners with a description of a verb which includes not only its lexical meaning, but also its patterns of usage as presented in table 4, would help them in receptive and productive tasks, since it would provide them with the information they need to tell, for example, which sense of katto-suru is meant in the sentence they read, judging from the co-occurring words and patterns, or to predict, for example, which words and patterns can be used with katto-suru when they want to use this verb in a particular sense. 5. Conclusion The results of the case study presented in this paper indicate the importance of analysing loanwords from a grammatical perspective, investigating their behaviour within sentences and not only their meaning. Judging from the many differences in the description of loanwords in monolingual Japanese dictionaries, as presented in Table 1, it is clear that much is still unknown regarding loanwords in contemporary Japanese. The first task that awaits us is to build up a comprehensive description of basic loanwords, on the basis of corpus data. As has been shown in the present paper, the results of such a description would be very useful to the teaching of Japanese as a foreign language. At the same time, while preparing detailed analyses of individual words, a descriptive lexicographical framework aimed at foreign learners needs to be developed, indicating what information needs to be included in a description of grammatical patterns. Moreover, even a corpus of the size of BCCWJ may sometimes not offer enough examples for fine-grained distinctions of meaning. The solution of this problem is another task that awaits us. References Himeno, M. (2004). Kenkyusha Nihongo Hyögen Katsuyö Jiten. Kenkyüsha. Ishino, H. (1996). Jiten ni okeru Gairaigo no Gogi Kijutsu. In Gengogakurin 1995-1996. Tokyo: Sanseidö. 273-286. Kim, E. (2011). 20-seiki Köhan no Shinbun Goi ni okeru Gairaigo no Kihongoka. (Shift of the Loanwords to Basic Words in the Japanese Newspaper Vocabulary in the Second Half of the 20th Century.) Handai Nihongo Kenkyu, Monograph 3. Toyonaka: Osaka University. Kitahara, Y. (2002). Meikyö Kokugo Jiten. Tokyo: Taisyükan. Koizumi, T., Funaki, M., Honda, K. Nitta,Y. & Tsukamoto, H. (1989). Nihongo Kihon Döshi Yöhö Jiten. Tokyo: Taisyükan. Nakayama, E., Jinnouchi, M., Kiryu, R., & Miyake, N. (2008). Nihongo kyöiku ni okeru "Katakanago Kyöiku" no Atsukawarekata. (Teaching Japanese as a Foreign Language: Katakana and its Implementation in the Syllabus.) Nihongo Kyöiku, 138, 83-91. Nishio, M., Iwabuchi, E., & Mizutani, S. (2009). Iwanami Kokugo Jiten. 7th ed. Tokyo: Iwanami Shoten. Sanseidö Henshüjo. (2010) Concise Katakanago Jiten. 4th ed. Tokyo: Sanseidö. Sasaki, M. (2001). Yoku Tsukau Katakanago. Tokyo: Alc. Sato, T. (2005). Jidöshi-bun to Tadöshi-bun no Imiron. Tokyo: Kasamashoin. Sawada, T. (1993). Nihongo Kyöiku no tameno Kihon Gairaigo ni tsuite. (Loanwords Usage in Japanese: The Fundamental Points for the Japanese Language Teaching.) Bulletin of Nara University of Education. Cultural and Social Science, 42(1), 225-239. RESEARCH ARTICLES (PROJECT REPORTS) Compilation of Japanese Basic Verb Usage Handbook for JFL Learners: A Project Report Prashant PARDESHI National Institute for Japanese Language and Linguistics (NINJAL) prashantpardeshi@gmail.com Shingo IMAI Kazuyuki KIRYU Sangmok LEE Shiro AKASEGAWA Yasunari IMAMURA Tsukuba University Mimasaka University Kyushu University Lago Institute of Language National Institute for Japanese Language and Linguistics (NINJAL) Abstract In this article we introduce a collaborative research project entitled " Nihongogakushuushayou kihondoushi youhouhandbook no sakusei (Compilation of Japanese Basic Verb Usage Handbook for Japanese as Foreign Language (JFL) Learners)" carried out at the National Institute for Japanese Language and Linguistics (NINJAL) and report on the progress of its research product, namely, a prototype of a basic verb usage handbook (referred to as "handbook" below). The handbook differs in many ways from the conventional printed dictionaries or electronic dictionaries available at present. First, the handbook is compiled online and will be made available on internet for free access. Secondly, the handbook is corpus-based: the contents of the entry are written taking into consideration the actual use of the headword using the BCCWJ corpus. Also, it contains illustrative examples of particular meanings culled from the BCCWJ corpus as well as those coined by the entry-writers. Third, the framework used in the description of semantic issues (polysemy network, cognitive mechanism underlying semantic extensions and semantic relationships among various meanings, etc.) is cognitive grammar, which adopts a prototype approach. Fourth, it includes audio-visual contents (such as audio files and animations/video clips etc.) for effective understanding, acquisition and retention of various meanings of a polysemous verb. Fifth, the handbook is bilingual (Japanese-Chinese, Japanese-Korean and Japanese-Marathi) and incorporates insights of contrastive studies and second language acquisition. The handbook is an attempt to share cutting edge research insights of various branches of linguistics with Japanese language pedagogy. It is hoped that the handbook will prove to be useful for JFL learners as well as Japanese language teachers across the globe. Keywords basic verbs; corpus-based; cognitive grammar; audio-visual contents; bilingual dictionary; multilingual dictionary Acta Linguistica Asiatica, Vol. 2, No. 2, 2012. ISSN: 2232-3317 http://revije.ff.uni-lj.si/ala/ Izvleček Članek predstavlja skupinski raziskovalni projekt z naslovom " Nihongogakushuushayou kihondoushi youhouhandbook no sakusei (Izdelava priročnika o rabi japonskih osnovnih glagolov za učence japonščine kot tujega jezika)", ki poteka na Državnem inštitutu za japonski jezik in jezikoslovje (National Institute for Japanese Language and Linguistics - NINJAL), ter poroča o trenutnem stanju raziskovalnega izida, t.j. prototipa priročnika o rabi osnovnih glagolov (v nadaljevanju "priročnik"). Priročnik se v marsičem razlikuje od običajnih tiskanih in elektronskih slovarjev, ki so trenutno dosegljivi. Prva značilnost je ta, da se priročnik ureja preko spleta in bo prosto dostopno objavljen na spletu. Druga je ta, da je priročnik osnovan na korpusih: pri redakciji gesel se upošteva dejanska raba iztočnic v korpusu BCCWJ, priročnik pa vsebuje tako primere rabe posameznih podpomenov, ki se črpajo iz korpusa BCCWJ, kot tudi primere, ki jih sestavijo redaktorji. Tretja značilnost je ta, da se semantični vidiki (pomenske mreže, kognitivni mehanizmi, ki botrujejo pomenskim širitvam, ter pomenske povezave med posameznimi podpomeni, ipd.) opisujejo v okviru kognitivne slovnice s prototipnim pristopom. Četrta značilnost je ta, da vključuje zvočne in slikovne vsebine (zvočne posnetke, animacije, videoposnetke ipd.) kot pomoč pri učinkovitem razumevanju, učenju in pomnjenju različnih pomenov večpomenskih glagolov. Peta značilnost je ta, da je priročnik dvojezičen (japonsko-kitajski, japonsko-korejski in japonsko-maratski) in vključuje spoznanja protistavnega jezikoslovja in vede o učenju tujih jezikov. Priročnik je poskus zlitja najnovejših raziskovalnih spoznanj različnih vej jezikoslovja z didaktiko japonskega jezika. Upamo, da bo priročnik koristil tako učencem kot učiteljem japonščine po celem svetu. Ključne beside osnovni glagoli; korpusno osnovan; kognitivna slovnica; zvočno-slikovne vsebine; dvojezični slovar; večjezični slovar 1. Introduction Verbs as predicators are one of the crucial components determining the skeleton of a sentence, which serves as a basic unit of communication. For improving communication skills in Japanese it is imperative for JFL (Japanese as foreign language) learners to master various usages of basic verbs used frequently in day-today communication in a systematic way. At the National Institute for Japanese Language and Linguistics (NINJAL), a collaborative research project entitled "Nihongogakushuushayou kihondoushi youhouhandbook no sakusei (Compilation of Japanese Basic Verb Usage Handbook for Japanese as Foreign Language (JFL) Learners)" is being carried out (project leader: Prashant Pardeshi, timeline: October 2009-September 2012). The aim of the project is to develop a prototype for the compilation of a handbook of usage of basic verbs in Japanese frequently used in day-to-day conversation by integrating state-of-the-art insights from various related fields such as Cognitive Linguistics, Corpus Linguistics, Japanese Linguistics, Japanese Language Pedagogy, Contrastive Linguistics, and Linguistic Typology. The envisaged end product is a set of small-scale bi-lingual handbooks such as Japanese-Chinese, Japanese-Korean and Japanese-Marathi, compiled adopting the prototype developed in the project. We believe that such a bilingual handbook of usage of Japanese basic verbs would be of great help for JFL learners in their effort to acquire the Japanese language systematically and efficiently. The handbooks under compilation differ from existing dictionaries in various respects such as compilation policy, scope and contents of description and the writing and editing process. In this article we report on the progress of the project and salient features of its envisaged research output, namely, a prototype of a bilingual Japanese basic verb usage handbook (referred to as handbook below). The structure of this article is as follows. In section 2 we provide the outline of the handbook project and a overview of the salient features of the handbook under preparation. Against this backdrop, in section 3 we exemplify the organization of each entry with the help of a concrete example - the verb hashiru "to run" - and describe the (tentative) methodology of description. One of the salient features of the handbook is that it is corpus-based. In section 4, we describe the tools/interfaces developed for retrieving information necessary for writing an entry from the corpora of correct use of Japanese and of the errors of JFL learners. Further, the compilation and editing work of the handbook is carried out online using a web-based editing tool. In section 5, we describe the multilingual editing tool developed in this project. This tool allows us to transcend the barriers of space and time. Furthermore, we are developing audio-visual contents in order to foster understanding of various meanings of polysemous verbs. In section 6 we introduce those contents. Finally, in section 7 we discuss future prospects. 2. Overview of the handbook project and salient features of the handbook 2.1 Overview of the handbook project We believe that systematic learning of polysemous basic verbs including features such as the semantic behaviour (semantic extensions of a verb and interrelationship among its various meanings, related words such as synonyms, antonyms etc., proverbs/idioms involving the verb in question etc.), grammatical/syntactic behaviour (voice and polarity bias, aspectual and modal characteristics, co-occurrence restrictions, modifiers/adverbial elements, ungrammatical/unnatural usages, etc.), argument structure (case frame), genre/register bias, etc. is necessary in order to master communication skills in Japanese. Further, it is also necessary to know where and how the Japanese language (target language: L2) is similar to or different from the user's mother tongue (source language: L1). In view of this, the aim of the project is to develop a prototype for the compilation of a handbook of usage of Japanese basic verbs by integrating state-of-the-art insights from various related fields of theoretical and applied linguistics for the JFL learner. At present, 58 scholars from various parts of the globe are participating in this project. Out of these 58 scholars, 42 are native speakers of Japanese while 16 are non-native researchers working on Japanese language for a long period of time1. Since the primary goal of the project is qualitative, viz. developing a prototype of a bilingual basic verb usage handbook, we decided to restrict the quantity (number) of verbs and focus on highly polysemous basic verbs which pose a great challenge for JFL learners. Concretely speaking, we focus on the following 11 verbs: verbs of spatial motion (vertical motion: agaru "go/move up", ageru "cause to go/move up", sagaru "go/move down", sageru "cause to go/move down", and horizontal motion: hashiru "run"), and verbs of temporary or permanent transfer of possession (ageru "to give something to someone as a present/gift", morau "to receive something from someone as a present/gift", uru "to sell", kau "to buy", kasu "to lend" and kariru "to borrow"). All of these verbs are highly polysemous: for example, in our handbook there are 19 meanings/senses for agaru "go/move up", 22 for ageru "cause to go/move up", and 11 for hashiru "run". In section 3, we describe the policy and method of description of an entry through the example of the entry for hashiru. "run" in our handbook. 2.2 Salient features of the handbook The handbook under preparation is in electronic online form and the target users of the handbook are envisioned to be advanced JFL learners and native as well as non-native teachers of Japanese. In addition to the dictionary-like usage for looking up the meaning and examples illustrating various meanings of a verb, the handbook serves as a reference grammar also containing many salient features such as explanation of cognitive mechanisms underlying semantics extensions, notes on grammatical and non-grammatical usages, pragmatics or context-related explanations, tips from the contrastive perspective (comparison with the L1 of the JFL learner), "real" examples from the corpus, visual contents such as image-schema (static, abstract line drawings as well as concrete animations and video-clips), and audio-contents such as accent pattern and sound-files for all illustrative examples. Further, the descriptions and "coined" examples are all based on the actual use of the verb as "objectively" gleaned through the corpus data. Out of all these salient features, two features can be considered as "discriminatory" ones that set apart the present handbook from the bi-lingual dictionaries available at present: (i) corpus-based approach: drawing on a corpus of "correct use" of Japanese native speakers and one of "erroneous use" of JFL learners in addition to the intuitions of scholars for the composition of entries and (ii) incorporating the insights of cognitive linguistics and contrastive linguistics. For the corpus of "correct use" of Japanese native speakers we used the BCCWJ corpus (Maekawa, 2012) developed by the National Institute for Japanese Language 1 For further details visit the project HP: http://www.ninjal.ac.jp/research/project/b/youhoujiten/. and Linguistics (NINJAL) and developed an interface called NINJAL-LWP for the BCCWJ corpus (NLB) to cull the information necessary for writing a entry. For "erroneous uses" of JFL learners we used the data from Teramura (1990) and developed a interface to retrieve relevant information from it (see section 4 for details). The prototype of the handbook under preparation incorporates examples from BCCWJ corpus culled with the help of NLB and thus offers both coined as well as real examples side-by-side (see the tentative design in Figure 1). For incorporating the insights of cognitive linguistics we have incorporated visual contents such as image-schema (static, abstract line drawings as well as concrete animations and video-clips), and audio-contents such as accent pattern and sound-files for all illustrative examples taking full advantage of the web-based nature of the handbook. As for incorporating insights of contrastive linguistics, in addition to grammatical similarities and differences between Japanese and JFL's native language we have provided extra-grammatical information such as notes on pragmatics and cultural factors. The handbook is compiled/edited using a web-based editing tool connecting scholars in Japan, China and India. Such a handbook differs in many respects from contemporary bilingual dictionaries and therefore we purposely call it a bilingual handbook. In the following sections prominent salient features of the handbook are discussed. 3. The organization of an entry and the (tentative) methodology of description 3.1 Organization of an entry The organization of an entry/headword is explained below with the help of the concrete example of the verb hashiru (to run). Following this, the methodology of description is mentioned. However it should be borne in mind that the statement pertaining to the methodology of description is tentative and subject to change. [T^^b : Accent] LHL C^ffl : Conjugation] hasir- Group I CMS^K : List of senses/meanings] 1. A, (a person or an animal moves quickly ahead (by quickly moving its legs alternatively)) 2. vehicle moves fast) 3. ^^^^ St^ 5 (transportation operates) 4. g 5 (to move to the destination hurriedly) 5. g^Wfefci^®^ 05 (to run around for some purpose) 6. (to run away, to flee from one's own side and join another side) 7. (incline towards an undesired trend) 8. (take a quick look) 9. If, — (I^TiiS) (sudden appearance [and disappearance] of a feeling or phenomenon) 10. JH, tSft^^fcS^^^I^TVS, ICTV5 (extension or continuation of a road or a river or a crack etc. in a particular direction) 11. ffiltSo (to work, to achieve results) The details of the sense 1 are described below. Owing to space restrictions, other senses are not discussed here. CMS : Sense/meaning] a person or an animal moves quickly ahead (by quickly moving its legs alternatively). [^12 : Orthographical form] £ (ttb) 5 [gffi: Transitivity] S (Intransitive) Image] —^ : Construction frame] • Ix — ^ : NOM runs) • 3 (Optional elements/adjuncts) ^fe (source) kara, (goal) made 1 /(¿fi) £ (location 1/position) wo, 1) £ (distance 1) wo (®j2) ^ (location 2) de, (M^) ^ (instrument) de, ^ (speed) de, (£Bff2) (distance 2), (glW) ^ (purpose) de, (fKl) ^ (time 1) de, (manner), (^^2) (time 2) : Collocations] ga ® A (person) : % (I), — Mr./Mrs./Ms. X, ® (he) , itt (child), S^ (player) © ffi^ (animal) : S (horse), M (cat), ^X ^ (mouse) (source/starting point) kara ® (building) : ^ (station), % (house) © (place) : (Tokyo), ^fe (Hakone), (A/1 © ) ©¿^6 (from the location of a person or an object) (#jl/&g) £ (place 1/position) wo ® (place) : (park), M^ (in-house), (school ground), (beach), (along something), ^M (walkway), |IlM (mountain trail/pass), (course), M T (corridor), 7K©i (on or above the water surface), ffl©1^ (in the dark), i^VPiL©^ (in the warm sun) © fig (position) : g ©M (in front of one's eyes), (ahead), h y (top), (way ahead in the forward direction) (Kffil) £ (distance 1) wo (Marathon), 42.195km, (half marathon), (long distance), fe^Bf (short distance) (#jj2) ^ (location 2) de (park), M^ (indoor), feg (school ground), (beach) (M^) ^ (instrument) de ^(jogging shoes), (bare foot) (speed) de ^^(with full speed), f® 50km (50 km/hour) «2) (distance 2) 100 ^ —h ^ (100 meters), 50 ^—h^ (50 meters) (gift) ^ (purpose) de Hfr (national tournament), t^V^y^ (Olympic), l^— y. (race) (fKl) ^ (time 1) de 1 fK (one hour), 100 11 # (100 meters in 11 sec) «) (manner) (slowly), ®< (fast), ^g^i (as fast as one's legs can/could carry one), (fiercely), Ittfot (breathlessly), h^ho (feebly), IfayiaV (zippingly) (fK2) (time 2) l fK (one hour), 10 ^ (10 minutes) C^^®^ : Wrong collocations] (manner) (b^) (inappropriate/incorrect) (slowly) mx • ^^ : examples/coined examples] • 10km((I ) slowly ran 10 km at the university wearing new shoes. ) • (The dog runs across the park from the other side to here.) • ffEt^M^fe^tfiS5o (To run from Tokyo to Hakone in the ekiden race.) • f îf'^S ^ (To run to the station along the boulevard street.) • ^©S^ 20 ^fï^ofco ((I) ran slowly around my house for 20 minutes.) C^X • —^^ : examples/from corpus: not translated into the target language] , 2004) • (Yahoo!^®@, 2005, ^ (IWAiS B^MÎJ , 2000. 9 : Information on errors pertaining to specific use] (i) mm i -mm© rij ^¿s^itttf^^o mm© rij ^ mm 4mm 1 -mm© ^¿^¿f» r Éot^^5^5 ^5®t ►^ftits ►^ft^tf >^^5 >£5^ COT^ • Idioms/Proverbs] üÄ^ y — ^ : Semantic network] 8 ^ 5 ^ 6 ^ 7 Î 4 Î 10 ^ 9 ^ ® ^ 2 ^ 3 4 11 [MSI (y—K77^y—) : Related words (word family)] • Synonyms :>lM®)it5 HPttÄ ►f^5 • HI Near-synonyms >S5 ►fêfr ^T< 3.2 The methodology of description: the content and the intent (Accent) In the case of accent, H stands for high and L stands for low pitch accent. However, for conveying accent information, the audio medium is more effective than the visual and we provide audio files to convey accent in addition to the visual representation. (Conjugation) The stem of the verb and its conjugation pattern is provided. As for the conjugation pattern, the classification widely used in Japanese language education (Group I, II and III) is adopted. (List of meanings/senses) The basic meaning is presented first and derived meanings follow as distinct senses. The basic meaning is also known as the central meaning and in a polysemous word it is considered as the most basic sense/meaning. This meaning is more concrete, more frequent and corresponds to what is known as the prototypical sense. The order of senses/meanings in the list of senses/meanings is decided taking into consideration the semantic closeness or remoteness of the sense in question to the central meaning. However basically this relationship is not linear there is some inevitable arbitrariness in determination of the order of meaning/senses. A semantic network diagram (described below) is also presented in order to show relationships among meanings graphically. (Meaning/Sense) The meaning/sense is explained in simple, easily understood terms. Some key words are intentionally used in order to make clear the relationships among the explanations. Such a strategy will also help to foster the understanding of connections in the semantic network. Also, the explanation is devised in such a way that the semantic congruence between the constructional meaning suggested by the construction frame discussed later and the core arguments and adjuncts would be easier to comprehend. (Orthographic representation) The orthographic representation in Kanji (Chinese) characters is provided with kana reading. ( Transitivity ) The transitivity of the verb in question is given. Depending on the meaning/sense, the transitivity may vary. However, the transitivity given here is that of the basic/central meaning. (Image) Providing a pictorial image of the meaning/sense helps in facilitating understanding of the meaning/sense in question. Image plays an important role especially in the derived/extended meanings/senses. Images are modeled on image schema proposed in the theory of cognitive linguistics. However we adopted more concrete images as compared to theoretical image schema. Further, in the case of image, unlike image schema, emphasis is given to ease of understanding rather than theoretical precision. For the image, still pictures, animation as well as video clips are used (see section 6 for details). (Construction frame) The construction frame is shown in the form of a two tier structure: obligatory core arguments and optional adjuncts. However, as shown below, in some cases judgment between the two is difficult. For example, the verb kaku "to write" is a two-place predicate taking two core arguments, however in a construction like write to it behaves like a 3-place predicate. In such cases, in the construction grammar approach (cf. Goldberg (1995), the construction containing 3 arguments is assumed. One falls in a dilemma on the issue of whether the 3-place construction should be incorporated in the description of a dictionary entry for the verb kaku "write". This is because, if one proceeds with adopting the construction-centered explanation, one needs to include extremely eccentric constructions as well, resulting in dramatically swelling the length of the description. Even if one adopts such a description policy, the issue of deciding whether the phrase ni should be treated as an argument or as an adjunct remains unsolved. Viewed from the meaning/sense of the verb it is an adjunct while viewed from the point of a construction it is an argument. At present this issue is left to the decision of the entry writer and editor, however, by referring to the frequency count, this issue can be resolved to a certain extent. (Collocations) Collocations are shown for both arguments and adjuncts. This is because collocations differ from one sense to another as well as from one case particle to another. Collocations are ordered in the sequence of collocation frequency deduced using the BCCWJ corpus browsing tool called NINJAL-LWP for BCCWJ (NLB for short). As a statistical index expressing the strength of a collocation, a score called "Mutual Information (MI)" score is available, however the MI score tends to cull expressions involving high degree of idiomaticity, so we decided to use raw frequency as a criterion for the purpose of listing collocations. Arranging collocation based on the raw frequency deduced from NLB ensures objectivity and authenticity. However, on the other hand, owing to the limitation on the size of the corpus (65 million words in NLB, 100 million words in BCCWJ) there is no guarantee that all the collocations needed to be listed in the dictionary are culled without any leakage. Therefore, some collocations which do not appear in the NLB, but which are thought to be necessary for learners are added. This measure, to a large extent, depends on the experience of the editor. In future, if the size of the corpus is increased, it is expected that the selection of collocations on the basis of the frequency criterion would become easier. For this purpose, the Tsukuba WEB Corpus (TWC) with a projected ten times the entries of BCCWJ is under preparation. (Wrong collocations) Here collocations which are prone to lead to wrong usage are described. (Examples: coined examples) For each meaning/sense we provided more than 3 coined examples. In order to avoid examples ending only with dictionary form (plain style, non-past), we have made a deliberate attempt to coin examples involving variation of tense, aspect, voice, modality etc. Such a move also helps to enhance naturalness of examples. Quite often we have even used complex sentences as well. (Examples: from corpus) We have provided examples culled from the BCCWJ corpus as well. The purpose of providing examples from corpus is to provide examples that are natural in the context of situation in question. However, on the other hand there is the criticism that such examples are difficult for non-natives to comprehend. The same observation has been made during the process of compilation of this handbook as well. It has been pointed out that real examples from a corpus are hard to comprehend unless one has sufficient knowledge of socio-cultural background. It became clear in our handbook that translation of such examples into another language is a big obstacle. Especially, considering the typically High Context Communication (Hall, 1976) nature of Japanese, it is easy to imagine that the problems of real examples would be much graver than in English. Whether to stick to real-examples only or to allow coined examples for the point of view of second language education is a complex issue with no satisfactory solution. At present, taking merits of both, we have decided to include natural examples as well as tailored examples. However, since the translation of natural examples is an extremely difficult task, we have decided not to translate the corpus examples. (Information on wrong usage: in the case of specific meanings) Mistakes that learners tend to make often are described under this heading. For information on wrong usage by JFL learners, various databases including Teramura database (http://teramuradb.ninjal.ac.jp/) are used. However, since these corpora are developed individually, the size of each of them is rather small and it is difficult to deduce general patterns of mistakes from them. Under such circumstances we have to heavily rely on the teaching experience of the editor. The following are examples from learners' corpora: Spoken language corpus: tffi^lf^^^^^ (taiwa taishou detaabeesu, seikatsu taishou deetabeesu) 0 $ * (nihongo gakushuusha kaiwa deetabeesu) B íHf f sl^ Y y f y — T — $ (nihongo gakushuusha kaiwa sutoratejiideeta) KY n—/* (KY koopasu) $ KY ^ t ff^^—zHtagutsuki KY koopasu to kensaku tsuuru) BTS i 5#f illfi^—(BTS ni yoru tagengo hanashikotoba koopasu) ^ y$ t'a—ÜtiS — * (intabyuu keishiki ni yoru nihongo kaiwa deetabeesu (Uemura koopasu)) Written language corpus: — (Teramura goyou reishuu deetabeesu) 0 (nihongo gakushuusha gengo koopasu) ii^^X DB (sakubun taiyaku DB) (shizengengoshori no gijutsu wo riyou shita tagutsuki gakushuusha sakubun koopasu) 0 ^•mm•Biitiif-?^-^ (nihon/kannkoku.taiwan no daigakusei ni yoru nihongoikenbun deetabeesu) JLPTUFS ^ * (JLPTUFS sakubun koopasu) In addition to the above list, there are many corpora which are either not made public or are accessible to only few individuals. For the effective use of intellectual resources, it is desired that an organization like NINJAL take the lead in the development of a platform like CHILDES (Child Language Data Exchange System) which allows accumulation of data in a common platform. (Grammar) Here we have shown the behavior of the verb with respect to grammatical categories like aspect, voice, tense etc. A conclusion is still not reached on whether to include categories like direct passive, indirect passive, imperative form, other sentence-final expressions. Further, whether to make judgments on grammaticality of such categories based on intuitions of individuals or on the basis of corpus frequency is also not yet decided. For making judgments on grammaticality (especially the subtle ones, shown by triangle sign) on the basis of corpus frequency, the size of the BCCWJ corpus seems not to be sufficient. (Compounds) Compound words are too large in number and hence it is impractical to include all of them. If so, again one has to decide on the basis either of intuition or of corpus frequency in order to decide potential candidates that should be listed. We would like to make use of the corpus for this and at present are using frequency as a criterion for listing compound words. (Idioms and proverbs) Idioms and proverbs consist of elements which are tightly bound together and the meaning of the whole cannot be guessed from the combination of the meanings of the parts. In other words, it can be said that semantic transparency is low in the case of idioms and proverbs. However, the transparency is a gradient concept and the decision of collocation or proverb is bound to be arbitrary. One yardstick for this decision can be MI (Mutual Information) score. The higher the degree of idiomaticity the greater the MI score (see section 4.1.2). (Semantic network) The relationships among meanings/senses are visually shown with the help of a radial category network diagram. The basic or central meaning is the one that is known in cognitive linguistics as the prototypical meaning. The relationships among meanings/senses are visually shown with the help of a radial category network diagram. The basic or central meaning is the one that is known in cognitive linguistics as the prototypical meaning. Derivations from it are arranged in a way to be understood intuitively. These semantic derivations themselves are products of linguistic research. Many cognitive linguists are also involved in this project. However, there is no guarantee that the semantic derivations are determined on the basis of a single meaning. Also the sequence of diachronic change and synchronic relationship often do not match. In view of these considerstions, while insights from cognitive linguistics form the basis of description, often changes have been made in favour of intuitive understanding. There are places where accuracy of description from the point of cognitive linguistics conflicts with intuitive understanding. In such cases we have preferred educational considerations such as ease of understanding for teachers and learners. As for the network, showing just the connection is not enough. The strength of the connection should also be shown. We are thinking of showing the strength or weakness of the connections visually in terms of the thickness of the line or the distance between the senses so as to foster understanding in a visual and intuitive way. (Related words (word family)) At present, we have listed words with almost the same meaning and synonyms as related words. Listing of antonyms is also under consideration. We are thinking of presenting the word family in the form of a radial category network, if possible. 4. Developing tools for corpora of correct usage and wrong usage One of the important policies we adopted to create this handbook is to make good use of available corpora. To compile a corpus-based handbook or dictionary, the existence of tools which enable dictionary writers to use corpora adequately and efficiently in the process of dictionary making is indispensable. In this project we chose the Balanced Corpus of Contemporary Written Japanese (BCCWJ) as a corpus of correct use by Japanese natives and the Gaikokujin gakushuusha no nihongo goyoureishuu (Collection of errors of JFL learners, 1990), compiled by Hideo Teramura and his colleageus, as a corpus of wrong usages of JFL learners. We developed search tools for each of these corpora. In the following two subsections, we will describe the features and functions of both tools. 4.1 NINJAL-LWP for BCCWJ (NLB) NINJAL-LWP for BCCWJ (NLB, http://nlb.ninjal.ac.jp) is an online search tool for the BCCWJ, jointly developed by the National Institute of Japanese Language and Linguistics (NINJAL) and Lago Institute of Language (LIL). The basic unit of this system is LagoWordProfiler (LWP), which LIL has developed for dictionary writing and editing. LWP has been successfully utilized in several projects of English-Japanese, Japanese-English dictionary making. tnu-ym n-f-ymsm /«¡-> »I» 1» » sjfssfss-i-iiin 17 o -Stotlc-fiS 2 mm M 1,762 m o b.-L 1 — .I.-Íí F 546 - o <■ » ••• ; : : T Çi^t+SP 138# ® m - Ml N-S ^¿iit'é-D 434 8.32 -40.58 A ÔiotîÔ 374 12.20 •7.07 362 3.05 -5.35 362 7.10 -10.62 298 2.81 •10.00 214 9.58 -5.14 174 4.37 -2.22 139 4.26 •2.90 ''S.kotMô T28 2.22 -2.76 ■123 3.40 -5.32 118 7.34 •0.65 116 6.56 -0.54 115 4.60 -2.67 112 4.45 •5.11 Biotins 111 5.17 -2.69 105 8.99 -3.31 à fi (f). . .. page [7 1 /14 ». I 100 |Vi;| F«®- ÍB ■ ftB3t:l3Jt BS: 16.72 ■»Bg: 12.5t 70!>: Í.75 IHM Îiotâîï. ¡¡Hism+B rragäusttt» im i B¥] 'n tí. ~ ¿IÜÉ y.'j^o munxm uth,ejí?-7=-f:>ji>t"t®í) ïi^ifii1S5. 0XIL0JÍÍSI . 2003, 2 .ft'T .........X,,........................— □ I- Page 11 |j4 " " ^ : NINÍAL-LWP for [ll.T.WI ('.ir^T ij'i & ¿OÍ2 N il lndityte't^jr Jspepese Lafigyaqe and ' JJ ' iiq. j Ir^itufce of InHficjuage. All njihls Figure 1: The headword Window of NLB BCCWJ is the first balanced corpus of the Japanese language, developed by NINJAL, and its final version was made public at the end of 2011. It is a large corpus of more than 100 million words, the size of which is comparable to the British National Corpus. The main component of the corpus consists of random samples from books, newspapers, magazines using rigid statistical methods to establish representativeness. Nine additional sub-corpora are provided for special purposes, including web text, which shows different usage patterns from those of text of the print media (Maekawa, 2012). 4.1.1 Lexical profiling The most important feature of NLB is its introduction of the lexical profiling methodology. Lexical profiling is now a standard method for making corpus-based dictionaries because it satisfies the requirements for using corpora in dictionary making. A concordancer used to be a standard tool in the earliest corpus lexicography. On the COBUILD Project, which made extensive use of corpora for the first time, the writing staff wrote headword entries by analyzing concordance lines from a concordancer (Sinclair, 1987). Concordance lines enable the dictionary writer to analyze individual words in real context. However, the larger the number of lines, the more difficult it is to grasp the whole variety of linguistic phenomena. To solve this difficulty, lexicographers realized the importance of summarizing linguistic phenomena comprehensively by use of abstraction (lemmatization, POS tagging, and chunking) and statistical measures (the MI score, the T score, etc.). In this process, lexical profiling as a new approach gradually developed (Church et al., 1991). At the end of the 1990s, a practical lexical profiling tool called Word Sketch appeared (Kilgarriff & Rundell, 2002). This software was first used for compiling Macmillan English Dictionary for Advanced Learners, and then it developed into the integrated system Sketch Engine, which is now used in many dictionary projects. Lexical profiling has two important requirements. The first is comprehensiveness. Linguistic research, in general, focuses on a particular linguistic behavior and adopts an approach that examines individual instances carefully and thoroughly. On the other hand, what dictionary making requires is to examine each headword's overall behavior. A dictionary writer needs to grasp a headword's behavior as comprehensively as possible. When implementing a search tool, which patterns to extract and how to classify those extracted patterns are vital keys to ensure comprehensiveness. The other key is time efficiency. This is essential in dictionary making. The number of headwords in a dictionary range from several thousand to one hundred thousand. To make best use of a corpus when writing a large number of headwords, an environment that enables dictionary writers to use a corpus efficiently is indispensable. Key factors to realize this environment include search speed and a user interface. 4.1.2 Lexical profiling in NLB So how does NLB satisfy the requirements of lexical profiling? As to comprehensiveness, NLB deals with the orthographical variety of the Japanese language. Japanese is usually written in three types of characters: hiragana, katakana and kanji. This means a word could be written in at least three ways. The noun hito, which means a person, can be written as in hiragana, or t b in katakana, or A in kanji, with different connotations. In the case of compound verbs, things are complicated by the fact that some verbs have two or more kanji candidates with slightly different meanings. The compound verb (toriageru), which means pick up or adopt, can also be written as 5. Including a variation of kana suffixes, more than ten orthographical forms for b U T^v^ are possible. From the point of view of comprehensiveness, it is, in many cases, more appropriate to group two or more orthographical variants into the most typical orthographical form than to give each form a headword status. NLB deals with this issue by incorporating the idea of representative orthographical form. In the previous example of ft ^ i ^ 5 , more than ten orthographical forms are all grouped into the Figure 2: Orthographical forms for toriageru representative form which consists of a headword entry. Figure 2 shows the frequency distribution of orthographical forms for ® D in BCCWJ. In order to maximize time efficiency, NLB has a user interface that allows the user to examine grammatical patterns, collocations, and examples from the corpus in the same window (See Figure 1). On Sketch Engine, which we mentioned earlier, a screen transition occurs every time the user looks for examples for each collocation. A user interface with frequent screen transitions is problematic from the point of view of time efficiency. With the recent spread of large screen displays, it is not so difficult as before to introduce a user interface with a minimum of screen transitions. Although user interfaces for corpus search tools have not been given much attention until recently, its importance is expected to increase as the size of corpora increases and more sophisticated search functions are implemented. Search speed is another important factor closely related to time efficiency. NLB shows the results of collocations and examples almost instantly by optimizing the structure of the database. Another important feature of NLB is its function to sort collocations by raw frequency and other statistic measures such as the Mi-score and the logDice score. Figure 3 shows collocations of N &M5 (N wo kau, to buy N). In the upper part of the figure, collocations are ordered by raw frequency, and in the lower part, by MI score. The MI score has a tendency to be unreliably high among low-frequency collocations. To avoid this reliability issue, NLB provides a filter function to remove low-frequency collocations. In the lower part of Figure 3, low-frequency collocations of less than five instances are excluded from the list. You can see idiomatic expressions like 5 (upset someone), ft'L^M 5 (seek someone's favor), ^^^M 5 (make someone laugh at you) are top of the list. Sorting collocations by multiple statistic measures is an extremely useful function. 1264ft ® MS. Ml N-S Ml 3.92 -2.29 119 5.72 -0.30 [-ffi] ®M3 118 2.12 -0.26 — t®M3 99 6.39 0.16 96 6.33 0.57 97 0.42 -1.11 79 10.77 -0.49 74 4.43 -0.23 ffl® H3 71 1.03 -1.99 66 6.B4 -0.46 59 2.64 -0.91 JÏEÎM3 56 7.95 -0.63 nc£M3 50 6.70 0.12 SSan^KÎ 49 6.27 0.32 46 9.73 0.50 46 13.01 -0.49 - fi Page 1 /13 ►> t- IOC 180ft ® =in>r-^3> MS. Ml „ N-S 46 13.01 -0.49 1B 13.01 -0.40 25 12.69 -0.04 9 11.69 -0.22 10 11.69 -0.02 39 10.97 0.14 79 10.77 -0.49 m3 6 10.21 0.02 mm&no 21 9.79 0.14 45 9.70 0.21 s< dsm3 12 9.69 -0.40 9 9.52 -0.04 -ssb3. 5 9.46 0.09 îsif ZM? 34 9.42 -1.27 5 9.21 0.11 f 'iy 23 9.03 0.16 • fi o Page 1 /2 ►> ICC - Figure 3: Collocations of N wo kau NLB also facilitates creating examples with dictionary-making-oriented functionality. On the example panel (the right-most panel of Figure 1), examples for a collocation are shown in ascending order of their character counts. This helps the dictionary writer to use corpus examples for reference easily and effectively. Each corpus example is color-coded according to the sub-corpus it belongs to, which enables the writer to know where each example comes from quickly. In addition, the writer can examine the context of a corpus example just by clicking its source information label. As we have seen, NLB provides an ideal environment for Japanese dictionary making, by dealing with the wide variety of orthographical forms in Japanese, and offering a user-friendly interface. 4.2 The Teramura Wrong Usage Database Gaikokujin gakushuusha no nihongo goyoureishuu (Collection of errors of JFL learners) is a report compiled by Teramura Hideo and his team in the late 1990s, after they collected and classified misuse samples from compositions written by overseas students from 24 countries. The total of the misuse samples amounts to 6,300, with misuse labels attached to misuse positions. Other information includes learner's nationality and composition type. The online version of this report, Teramura Wrong Usage Database provides a search function. The user can search misuse examples by combining conditions (a type of misuse, a learner's nationality, a composition type, etc.) Figure 4 shows the "search from misuse type" function. Misuse types are shown in a tree structure, effectively informing the user of how many misuse instances there are for each type on any combination of nationalities and composition types. Figure 4: Teramura Wrong Usage Database Most conventional Japanese dictionaries for native speakers and foreign learners, including ones with a learning or teaching purpose, only show correct usages; very few show wrong usages. This tool enables us to include useful wrong usage information for learners such as wrong collocations in a definition entry. 5. Crossing the barriers of space and time: An online multi-lingual editing tool Compiling a dictionary requires a lot of time and human resources. It is usually the case that there is an editor-in-chief who directs lexicographers in charge of writing up entries. The editor-in-chief proofreads the entries that the lexicographers have written, and corresponds with them as often as necessary. Proofreading may be done by different proofreaders and the editor-in-chief manages the editorial activity. This process usually takes a long time, and is not ideal if time for the compilation is limited. Another drawback of this traditional system is that lexicographers will usually have no chance to examine entries that the other lexicographers write. To overcome these problems, we have developed a web-based editing system so that the editors, lexicographers and proofreaders can have access to the entry data for editing, reviewing and proofreading processes. To develop the current online editor system, our experience in compiling A Dictionary of Basic Verbs in Japanese for Marathi, the outcome of Prashant et al. (2007)'s project is fully exploited. Under a limited budget, we made use of free applications to achieve our goal: a wiki system to store the entry data in XML format. Wiki is a system for collaborative editing online and has a repository system, under which all older versions of wiki pages are stored. By comparing the current version with one of the older versions, editors can tell what have been changed, deleted and/or added in the latest version. In this new system, we take advantage of the repository feature of wiki. In the current system, the lexicographers write entries in Japanese first. Then the Japanese entries are translated into four foreign languages (Marathi, Korean, Chinese and English) by translators. At this stage some additional information will be added that is related to cultural and linguistic differences between Japanese and the target language. The following sub-sections give a brief outline of the online editorial system. 5.1 An outline of the online editorial system 5.1.1 Some features of the online editor The online editor developed for this project has the following features: • Data are input in a textbox area on the editor and stored in an XML structure. • The data input in the editor are reflected in a preview function to check how they look in the HTML format instantaneously. • Employing Yahoo API, it is possible to assign furigana, the phonetic transcription of kanji, in a format that may be convertible into other formats like HTML. • The lexicographers can read the entries that are written by the others online and post a comment, which will be shared by all editors. 5.1.2 Online editor as a plug-in of Dokuwiki The editor is not a standalone application but is developed as a plug-in for Dokuwiki. Dokuwiki is a Unicode-based wiki application and does not require a binary database system like SQL because data pages are saved in text files. Each entry is organized in an XML format and stored as a Dokuwiki page. Since the file is a text file, it can be directly used as an XML file for data-processing. The lexicographers first login to the Dokuwiki homepage as in Figure 5. Figure 5: The homepage of the editorial system on Dokuwiki 5.1.3 Starting the online editor After logging in, lexicographers choose the language, and then select one of the entries in the list to edit it. The Wiki page shows the XML data of the entry, but it is not directly edited. They start the plug-in online editor. On starting up, the editor retrieves the XML data from the Wiki page. The view of the entry data is formatted in an Explorer view, with a tree structure displayed on the left pane and each sub-data displayed on the right pane, as in Figure 6. Figure 6: A full view of the online editor Figure 7 shows the view when one of the items is selected and its editing area is displayed on the right pane. Figure 7: The editing pane for Collocation 01 (^^^ 01) is open on the right page 5.1.4 The preview function The editor has a preview function. There are two types of preview: the entire view of the entry and the partial view of an item of the entry. The preview is generated via XSLT as an HTML page. An image of the full-scaled preview in Marathi is shown in Figure 8, and an image of the partial view is shown in Figure 9. Since it is a bilingual version, both the Japanese data and the respective Marathi data are shown. In the bilingual version, as shown in Figure 8, an additional piece of information from a contrastive point of view ( M M tt ® ) is also provided when necessary. This information will not be included in the Japanese version. Figure 8: The full-scaled preview of the Marathi translation of ageru The layout design of the preview in Figures 8 and 9 is not intended to be final, but to be temporary just for convenience. The final layout design will be developed differently and be applied to generate the final product from the same XML data. « & o □ |gn Ol 9 IS |g* 02 BlgS o C ftä^l] O Ci fWJ if« o Cj «fc-q-jiT, o C iiSllroffiiSM o C (iSi|ffiiifflfl¥fä o C |g)! 03 o C Ig* 04 o □ Igu 05 » c fgn o& O C Igu 07 L/1": zbOh ■ msustî B^li+iHSlll - tïij S ISO 1. y -^-fj ^îâ^wjûté-j t. #T ij|plullil čfiR" dWül ÏÏTH1T l/(adv,n.iH3beginning of month/(P)/ 12009. M a [jfe0s]/(n) (the) date/(P)/ 12010. /(n) time/years/day s/(P)/ 12011. Rm [tf-3i-5]/(n-adv,n-t) Monday/(P)/ 12012. RmB [Wo j: 5 ¡Tj/(n-adv,n-t) Monday/(P)/ 1 201 3. fijteUf?^?l/fn-adv,n-t) end of the month/(P)/ 12014. nS /(n-adv.n-tt end of the month/(P)/ 1201 5. fl^t every monthrtrite/common/(P)/ 1 201 6. [if 3 itl]/(n) monthly saiary/(P)/ 12017. [if b i <)/(n) lunar eclipse/(P)/ 1201 8. p/(n) viewing the moon/(P)/ 12019. R mitf-i b V>]/(n) monthly tuition fee/(P)/ 1 2020. ^BE [ff/5]/(n) monthly installment (instalmentymonthly payment/(P)/ 12021. tiff? b i, <2> < 3> (<4>); <5> <6> <7> <8> <9> <10> <11> <12> Each entry includes <1> the entry number, the usual <2> reading and <3> writing of the word, <4> the list of its homographs (see section 4.2), <5> the French translation, <6-11> its frequencies in six sub-corpora/genres (see section 3.2) and <12> the total number of occurrences. All the data used to calculate the frequencies are provided in a database distributed freely (JaLexBD1). Furthermore, the dictionary provides a summary description of the sub-corpora/genres: size, list of the most frequent words, comparison of the frequencies of some morphological structures, etc. 2.2 The distributions For each word, the dictionary provides the frequencies of the word alone and with affixes, in each of the subcorpora. Due to lack of space, the frequencies for each distribution of each word are not detailed in the paper version, but all the numbers of occurrences are provided in the JaLexBD database. 1 http : //rkappa. fr/lexic/ JaLexBD/index. JaLexBD.php We chose constructions with affixes as the most linguistically interesting ones, especially those which are less morphologically and syntactically ambiguous (see section 4.2). For example, we avoided counting strings of concatenated nouns (without particles), since this can be extremely difficult for the automatic analysis of a text. The constructions were counted as follows: 2.2.1 Word alone The word appears without any affix. 2.2.2 Words with the adjectivising suffix -teki Any noun with the -teki suffix is transformed into a -na adjective. N+teki means roughly "which has the property of N". For example, when -teki is attached to the noun gainen "concept"), it forms the word gainenteki which means "conceptual". 2.2.3 Words with the nominalising suffix -sei (tt) The suffix -sei (tt) can be placed after a noun or an adjective. N/adj+sei is a (grammatical) noun and means roughly "the property of being N/adj". For example, ensyoo (^ffi) means "inflammation", while ensyoo + sei (^^tt) means "of an inflammatory nature, type or origin". Some of the constructions suffixed by sei may be lexicalized, such as tt (keizai + sei, "economic efficiency, economic performance, economy"). 2.2.4 Words with the plural suffixes -ra b) or -tati (S/ The suffixes -ra b) and -tati (S/7c%) express the plural. For example, the common noun gakusei (^4 "student") is indefinite. Depending on the context, it can be interpreted as being either singular or plural. The construction gakusei+ra (^ 4^/^4 b), however, can only be interpreted as being plural. In contemporary Japanese, the two suffixes are both used for human nouns, like "gakusef (^4). They can also be used with a humanised entity, such as neko+tati ( "the cats"). However, this construction is generally limited to children's language. The difference between the two suffixes may lie in the register of the language. For example, -ra is known to be more formal than -tati. Thus, comparing the frequency of plural suffixes may provide an interesting indication of text genres. The process of counting plural suffixes is based on the hiragana transcription, i.e. b (ra) and 7c % (tati). This restriction is justified by the fact that the Chinese character (^), which can be used to write the suffix -ra, is ambiguous, since ^ can also be used to write nado ("etc."). In order to prevent possible errors, we did not count the occurrences of but only the transcriptions in hiragana. In order to unify the counting procedure, we also restricted the counting of -tati to the hiragana transcription. This restriction certainly comes at the expense of -tati, since -tati is written more frequently in Chinese characters. Thus, the results in the DFJC certainly minimise the number of occurrences of -tati, although we cannot say to what extent it is minimised. 3. Corpora and genres The corpus is divided into seven sub-corpora. Each sub-corpus has specific characteristic(s) that we will refer to as "genre". These characteristics mainly stem from their source. For example, "journalistic genre" (i.e. "journalistic corpus") will refer to the collection of texts retrieved from newspaper websites. The details of the other sub-corpora are outlined below. We consider that such a "genre" definition is explicit enough and does not require the laborious evaluation process described in the introduction. The description may be supplemented with other characteristics, but they will not have been used to build the corpora and define the genres. We can distinguish between two types of text: monologues and dialogues. A "dialogue" refers to a text which provides an answer to a question, or which is constructed to be answered by someone other than the author. A "monologue" refers to a text which is not constructed to be followed by an answer, and which does not provide an answer itself. We also distinguish between reviewed and non-reviewed texts. The assumption is that the variety of morpho-syntactical structures and vocabulary is wider in non-reviewed texts. When reviewing is part of the production process, it is expected that the author and reviewer will agree to some (at least implicit) conventions about acceptable (or authorized) language. The author must produce a text corresponding to this agreement. If not, the reviewer will correct the text according to these constraints. In principle, there are no such limitations in non-reviewed texts. If the set of authors is limited, the variety of structures and vocabulary is limited to the skills of those authors. Indeed, the variety is expected to be poorer in comparison with corpora produced by an unlimited set of authors. This is why we distinguished between textual corpora produced by a limited and an unlimited set of authors. We should also take the writing time into account. We assume that the variety of vocabulary and morpho-syntactic structure is greater when writers have unlimited time to write. Table 2 below presents a summary of those characteristics for all the corpora. In addition, it shows the size of the sub-corpora and their frequencies. Table 2: Characteristics of the corpora used (Conventions: "+" yes for all the texts of the corpus; "-" no for all the texts of the corpus; "+/-" depending on the text of the corpus) No. of sentences frequency of updates reviewing only one theme for the corpus limited set of authors limited set of readers monologue White papers 105 520 one year + - + + Daijirin Dictionary 176 809 partial, long-term + - + - Newspapers 167 819 1 day + - + - Legal texts 34 688 1 partial year + + + + dialogue Q&A govn. 54 901 1 full year + - + + Q&A misc. 136 946 1 partial day +/- - - - Chats 23 547 1 full day +/- - - - 3.1 Selection criteria for the corpora We applied the criteria below to select the sub-corpora/genres. 3.1.1 Representativeness of a subcorpus with respect to its genre In order to obtain a better representativeness, we built the sub-corpora as follows. Our strategy differs from the well-known corpus BCCWJ (Maruyama, 2009) in many respects. Whenever possible, we used the complete collection of texts from a source rather than using a sample. For example, we collected all the White Papers of 2009, 2010 and 2011, whereas the BCCWJ contains only samples of collections. For the same reason and also unlike the BCCWJ, all the texts of the sub-corpora are complete. We did not use any samples. The corpus is strictly limited to written language. Transcriptions of spoken language, such as the Minutes of the Diet (included in the BCCWJ) are excluded. 3.1.2 Development of the corpus The DFJC, published in 2012, is the first step of the project to observe the development of genres over time. As such, it was necessary to choose genres/corpora that would change over time. To date, statistical studies in Japan about the Japanese language have been designed for static corpora. Those corpora were created once and for all and no updates were planned. Ever since the first statistical study on a large corpus of written Japanese was conducted by the National Institute for Japanese Language and Literature in the 50's (Yamazaki, 2006), no corpus has been updated or compared to previous versions to observe its development over time. To break away from this method, all the corpora used for the DFJC are textual collections that will be updated within a few years (or within a few months in some cases). The corpora of newspapers, chats, questions to the government, miscellaneous Q&A and White Papers will be entirely renewed in one year. Some of the sub-corpora, such as commercial dictionaries and fundamental legal texts (Constitution, etc.) should not change for a long time, but as long as they are distributed without changing (and excepting cases where they are distributed explicitly as historical texts), they should be understandable. Thus, even if such texts have not been renewed for a while, the language used represents the current language during the period of their distribution. The sub-corpora for the DFJC consist of texts produced mostly between 2008 and 2011, which represents a span of 4 years. We had first planned to re-compile the texts annually, but the size of the corpora was not sufficient to obtain a good representativeness. Consequently, we estimate that a span of 3 to 5 years would be a good compromise. 3.1.3 Accessibility In order to build the corpora, we also had to make do with limited financial and human resources. Scanning or manually retyping texts, as has been done for the BCCWJ, was out of the question. The solution was therefore to use the Internet. To this end, the corpora/genres selected were taken from collections of texts accessible on the Internet. However, we did not have enough technical (and financial) resources to build a corpus as large as the one described by Kawahara and Kurohashi (2006) which contains 470 million sentences. Even if the texts are easy to access on the Web, we had to limit the selection to a relatively small set of genres/corpora (710,000 sentences) 3.2 Detailed presentation of the sub-corpora 3.2.1 White papers This is a collection of White Papers published in 2009, 2010 and 2011. Due to their thematic variety, this corpus cannot be considered as representative of one discipline. However, we assume that the production conditions are homogenous: the texts are written by a restricted set of authors (specialists). We also assume that these texts are reviewed. 3.2.2 Daijirin dictionary To our knowledge, the language genre of dictionary texts has not yet been studied, despite the fact that it is certainly original and subject to many editorial conventions. This corpus has not yet been completed. In the online Daijirin dictionary (Matsumura, 2006; http://dic.yahoo.co.jp), we only used the pages corresponding to the lemmas of the DFJC. As such, this corpus only contains 16,000 entries even though Daijirin contains 230,000 entries in total. Furthermore, Daijirin cannot be considered as being representative of all Japanese dictionaries. Therefore, this corpus is a poor representation and has been used as a test corpus only. In the future, the totality of the Daijirin and other online dictionaries will be used. 3.2.3 Newspapers This corpus contains the online versions of three newspapers, covering the editions from April through December 2011. The newspapers are Asahi (www.asahi.com), Nippon Keizai (www.nikkei.com) and Nikkan Kougyou (www.nikkan.co.jp). The first two newspapers are widely distributed and have a significant place among newspapers. Furthermore, newspapers are very important in the everyday lives of Japanese people (almost all Japanese households are subscribed to a newspaper). We plan to incorporate more newspapers in the future. 3.2.4 Legal texts The legal corpus is divided into two parts. The first part is the compilation of all official legal texts produced in 2008, 2009 and 2010 (law fe#, hooritsu; Cabinet Office Ordinance, rtl^W^, naikakuhurei; decree meirei). The second part is the compilation of six legal codes (Constitution, ^fe, kenpoo; civil law, Kfe, minpoo; commercial law, jjfe, shoohoo; criminal law ffljfe, keihoo; civil procedure, K ^ # ^ fe , minzi soshoohoo and criminal procedure fflj ^ # ^ fe , keizi soshoohoo) .The second part will not be updated, whereas the first one is renewed every year. 3.2.5 Written questions submitted to the government A compilation of all the written questions (from the Diet) submitted to the government between 2008 and 2010. 3.2.6 Miscellaneous questions and answers A compilation of websites taken from oshiete.goo.ne.jp. This site is equivalent to the website Chiebukuro used for the BCCWJ. Each page contains an open question and possibly one or more answers. In some cases, there is no answer. Questions can address any subject matter. 3.2.7 Chats This corpus is made up of pages from different chat websites. Due to financial concerns, this corpus is small. The procedure for collecting the pages included in this corpus will be changed in order to obtain more information. It must be pointed out that the fundamental difference between this corpus and the corpus of miscellaneous questions and answers is the temporal constraint in terms of production. In miscellaneous question-answer dialogues, there is no (implicit) constraint on the time interval between the question and the answer. On the other hand, the response is usually immediate where chats are concerned. 4. Tools and analytical method This section presents the software used, the problems encountered, their solutions and their impact on results. 4.1 Software The number of occurrences of the words in the corpora was counted using the free software SAGACE version 4.2.0 2 (Blin, 2012b). This tool is designed for searching patterns, but it does not analyse entire sentences. Patterns are defined as strings of words (or characters). A word can be a single word or any word of a pre-defined category. In the latter case, the category must be listed in the lexicon which is associated with SAGACE. SAGACE is only executed from the command line. To launch the query, the required pattern and the search parameters are described in a request form (a simple text file) interpreted by SAGACE. We have chosen this software mainly on account of its ease of use. Unlike symbolical parsers, it is not necessary to develop a grammar , which requires time to carry out maintenance operations. It is sufficient to create a lexicon listing the words of the various categories. This is important since the maintenance and modification of a rule-based grammar can be a complex operation. Furthermore, there is currently no free and open grammar for Japanese. SAGACE also differs from statistical parsers (such as Mecab 3) on account of the fact that it does not require training and manual evaluation. In order to obtain good results with such tools, it is necessary to perform training and manual evaluation for each genre of text. Such a procedure is very costly. http://crlao.ehess.fr/japonais-coreen/corpus/sagace/sagace.html Furthermore, SAGACE is autonomous and does not require any other software. It performs all the necessary functions for the analysis: a request interface, search engine and results interface. In addition, untagged plain text is sufficient. Thus, no preanalysis is required. As mentioned above, the advantage of SAGACE is that it is easy to use. The drawback, however, is that errors of analysis are (perhaps) more frequent than with other parsers. In order to limit the risk of errors, we selected less ambiguous patterns to be searched. As a result, not all the occurrences were counted. The frequencies indicated in the dictionary are thus slightly lower than the real frequencies. We are unable to assess the difference. 4.2 Difficulties of analysis and solutions The automatic analysis of the Japanese language is subject to a few difficulties, which are well known in the field of Natural Language Processing. In this section, we will discuss how they have been solved (or not) using SAGACE, and what impact this solution had on the results. 4.2.1 Lemmatisation Words are not separated graphically in written Japanese. Even if the parser analyses the entire sentence, morpho-syntactic errors may occur. Only a semantic and pragmatic parser can prevent errors, but no such tools currently exist. To limit the risk of errors with SAGACE, we first applied the traditional "longest match method". Secondly, we restricted the number of searched patterns to the ones with a low risk of ambiguity (even if it is not zero). Overall, the searched pattern includes the contiguous words before and after the target structure. For example, when searching occurrences of a noun suffixed by tt (sei), we used a pattern including a particle or punctuation mark on the left, and a particle, punctuation mark or copula on the right: . , particle { particle } noun tt { punctuation } t punctuation / t copula ' As an example, this is the description of the pattern in the request form: >0 cat:particle | punctuation | XX // 1 =0 cat:LEXEME /-affich:trait:lemme /-count // 2 =0 tt // 3 =0 cat:particle | punctuation | copula // 4 These lines are interpreted as follows: (1) the first element of the pattern is anywhere (">0") in the sentence. It is either a particle, a punctuation mark, or a mark to indicate the beginning of sentence. Formally, the description of the elements of the patterns are formulas written in a language close to propositional language. The interpretation is very similar to the interpretation of propositional logic. For example, the description of the first element can be formally interpreted as follows: the element is any word belonging to the category ("cat:") defined as the union ("|"; disjunction) of three basic categories (propositional constant): category of particles ("particle"), category of punctuation ("punctuation") and category "XX" (which is a singleton containing the mark indicating the beginning of a sentence). All the basic categories are listed in the lexicon associated with SAGACE. Using this description language, it is possible to "create" new categories by merely combining basic categories listed in the lexicon associated with SAGACE and without modifying this lexicon. (2) the second element is contiguous ("=0") with the precedent one. It is the word to be counted ("/-count"); it is any word of the category named LEXEME in the lexicon used by SAGACE. (3) the third element is contiguous ("=0") with the previous one. It is " tt ". (4) the third element is contiguous ("=0") with the previous one. It is either a particle, a punctuation mark or the copula. A more detailed description of the pattern syntax is available in the manual and online tutorials of SAGACE. All requests are provided in the DFJC. 4.2.2 Homography Some words have the same graphic form but a different reading, and perhaps a slightly different meaning. For example, two homographic words transcribed as ^ are almost synonymous, but have a different reading, uo and sakana. In the corpus, in order to know which reading is being referred to, a semantic (including pragmatics) analysis must be performed, but such an analysis tool is not currently available. A statistical analyser may solve the problem, but the results are not absolutely certain and the tool requires training. For a great number of regular common nouns, there is a homographic proper noun. For example, mori "forest") and hayasi (#, "wood") are also used as a last name. These last names are very common. Fortunately, in the corpora that have been used, they frequently appear with specific affixes, such as honorific suffixes (san, "miss, mister") for human proper nouns. As such constructions don't agree with the pattern we use, most of them have been excluded from the counting operation. Despite these precautions, it is possible that some occurrences may have been counted as proper nouns. Thus, the frequencies of the entries which can also be used as common nouns may be slightly higher than the real frequencies. We assume that this is a minor problem with no significant impact on the frequencies. In the DFJC, we tagged all the entries which can also occur as a proper noun. To this end, we used a list of 320,000 proper nouns, extracted from the mecab-naist-dic and some other resources. The list contains most Japanese personal names, place names and company names. It does not include Chinese proper nouns. This list is not very long, but it should suffice to fulfil the purpose. 4.2.3 Multiple transcriptions All Japanese words can be transcribed in various ways, by combining three character sets: hiragana, katakana and Chinese characters. Standard dictionaries provide the standard transcription. In fact, there is no "official" or "academic" standard. The so-called standard transcription could be defined as "the one which most closely resembles government prescriptions, among the most used transcriptions". For example, the word mósikomi ("application") is lexicalized as Í L^^ or Í ( L)^^, depending on the dictionary. The parentheses indicate that the characters can be omitted (but are still pronounced). Some dictionaries used for Natural Language Processing provide all the most common transcriptions and consider them as entries. For example, mecab-naist-jdic 4 provides five transcriptions/entries for the word mósikomi ("application"): L"^, ÍL^^, and Í L^ Such lexicalisation has many flaws: it is very redundant and not exhaustive. The multiplicity of transcriptions is not a problem per se, since the author's choice of one transcription among many can help to characterise a written style. It rather represents an editorial problem when publishing a paper dictionary: listing all the transcriptions takes up a lot of space, even though many of them have such a low frequency that they are insignificant. The DFJC provides only the most common transcriptions. For some words, the frequency is the sum of the frequencies of two transcriptions. For example, when a noun contains the so-called honorific prefix o, we do not separate the transcription in kana from the transcription in Chinese characters. For example, the number of occurrences of the entry otearai "toilet") is the addition of the number of occurrences of the two transcriptions and Variations of transcriptions are not only obtained by combining different systems. Some words have two or more transcriptions in Chinese characters. In most of these cases, two transcriptions exist: an "academic" transcription and a "popular" transcription. For example, the "academic" transcription of tamago ("egg") is The popular transcription is EE^. In the DFJC, the two (or more) transcriptions are clearly separated and constitute independent entries. 4.2.4 Homography and homophony For some words, there are other words that are both homophonic and homographic. This is more common with monosyllabic (one kana) words. It can also occur with the kana transcription of a word. For example, "tooth" and "blade" are homophonic: Aa. They are usually written in Chinese characters (resp. # and 50). However, when they are written in kana in a corpus, it is necessary to perform a semantic analysis to determine what word is being referred to. For the DJFC, we chose to count only dictionary lemmas, which are mainly in Chinese characters. We assume that existing entries in hiragana do not have such homophonic and homographic equivalents. 5. Conclusion and outlook In this paper, we explained the process we have implemented to automatically characterise the genre(s) of 16,000 words in a Japanese-French dictionary. We plan to repeat this work regularly, about every three or four years, using the same process. There is room for improvement, and we wish to improve at least two points. Firstly, the number of entries will be increased. In particular, we will add verbal nouns. Secondly, as explained above, some corpora must be changed: the dictionary will be supplemented and we need a more reliable source for chats. We also plan to conduct the same study on inflected words, such as verbs and adjectives. Despite the fact that SAGACE is not well designed for manipulating inflected words, a large-scale test (Blin, 2012c) showed, however, that the same method can be applied with the same tool. A more detailed (and manual) assessment of the results is required. If the results are good enough, we will apply the process to locutions, including locutions with inflected words. References Blin, R. (2012a). Dictionnaire de fréquence du japonais contemporain - 16 000 noms (Youfeng.). Paris. Blin, R. (2012b). SAGACE v4.2.0. CNRS. Retrieved from http://crlao.ehess.fr/japonais-coreen/corpus/sagace/manuel/Manuel.pdf Blin, R. (2012c). Fréquences des verbes japonais dans un corpus de grande taille. Blin. Retrieved from http://rkappa.fr/sagace/tutoriel/sagace4-2/data/ListeDesFrequencesDesVerbesJaponais.pdf Kawahara, D., & Kurohashi, S. (2006). Case frame compilation from the web using highperformance computing. Presented at the 5th International Conference on Language Resources and Evaluation. Maruyama T. (2009). "Gendai nihongo kakikotoba keikin koopasu" monitaa kaihatu deeta (2009nendoban) sanpuringu houhou ni tuite [ About the method of sampling in the "Balanced Corpus of Contemporary Written Japanese" (v.2009)]]. National Institute for Japanese Language and Linguistics Matsumura, A. (2006). Daijirin Second Edition. Tokyo: Sanseido. Yamazaki, M. (2006). Kokuritu kokugo kenkyuuzyo no goi tyousa no rekisi to kadai [Thematics and history of the lexical surveys of the National Institute of Japanese Language]. 12th Workshop "Thematics and history of the lexical surveys of the National Institute of Japanese Language" (pp. 168-186). Tokyo University. The Construction of a Database to Support the Compilation of Japanese Learners' Dictionaries Yuriko SUNAKAWA Jae-ho LEE University of Tsukuba jhlee .n@gmail. com University of Tsukuba sunakawa@sakura. cc .tsukuba. ac .j p Mari TAKAHARA University of Tsukuba takahara.mari.ge@u.tsukuba.ac.jp Abstract The number of Japanese language learners outside Japan, especially of advanced level learners, is increasing yearly. From the intermediate level onwards, they could profit from bilingual Japanese learners' dictionaries in their native language, but in most linguistic areas of the world only very simple dictionaries for beginners and for tourists are available. Our project therefore aims at supporting the compilation of Japanese language learners' dictionaries for intermediate and advanced learners by building a database of contents needed when editing a Japanese language learners' dictionary, and offering it online. This 4 year project is going to be running from 2011 to 2014. Two surveys were conducted: a survey of the vocabulary used in textbooks of Japanese as a foreign language and a quantitative survey on the targeted area of the Japanese language in a large-scale corpus, in order to select the list of words to be included in the database, and a general list of basic vocabulary for Japanese language instruction was created. At present, usage examples are being compiled on the basis of this vocabulary list, and a database system is being developed. A prototype of a database search interface and download system has been completed. The database is going to include various types of information which are considered to be useful for learners, such as grammar, phonetics, synonyms, collocations, stylistics, learners' errors etc. These are presently being studied in detail to be made public in 2014. Keywords Japanese language learners' dictionary, lexicography, dictionary editing support, bilingual dictionary, database, basic vocabulary for Japanese language instruction Število učencev in študentov japonskega jezika zunaj Japonske, posebej na višjih nivojih, narašča iz leta v leto. Od srednjega nivoja dalje so za učenje koristni dvojezični učni slovarji, ki vključujejo uporabnikov materni jezik, a za večino jezikov na svetu obstajajo le zelo preprosti slovarji za začetnike ali za turiste. Zato je cilj tega projekta sestaviti bazo podatkov, ki so Acta Linguistica Asiatica, Vol. 2, No. 2, 2012. ISSN: 2232-3317 http://revije.ff.uni-lj.si/ala/ Izvleček potrebni v učnem slovarju japonščine, in jo ponuditi na spletu, zato da bi s tem podprli urejanje japonskih učnih slovarjev za srednjo in nadaljevalno stopnjo. Projekt bo trajal 4 leta, od leta 2011 do leta 2014. Doslej sta bili izvedeni dve raziskavi, ki sta služili kot osnova za izbor besedišča v bazi podatkov: analiza besedišča učbenikov japonščine kot tujega jezika ter kvantitativna raziskava ciljnega jezikovnega področja v obsežnem korpusu. Na osnovi tega je bil izoblikovan seznam osnovnega besedišča japonščine za splošno rabo. Trenutno sta v teku urejanje primerov rabe teh besed ter razvoj sistema za urejanje in objavljanje podatkov, izdelan pa je prototip spletnega vmesnika za iskanje po bazi podatkov in prenašanje podatkov iz baze. Načrtuje se vključitev informacij, za katere se predvideva, da bodo koristne učencem, kot so informacije o slovnici, glasoslovju, sinonimih, kolokacijah, slogu in kulturi. Delo poteka s ciljem, da se baza javno objavi leta 2014. Ključne besede učni slovar japonskega jezika, slovaropisje, slovaropisna podpora, dvojezični slovar, baza podatkov, osnovno besedišče za učenje japonščine 1. Introduction In 2009 there were more than 3,650,000 Japanese language learners outside Japan: a 28,7-fold increase in 30 years.1 The number of learners taking the Japanese-Language Proficiency Test is also increasing, especially at the advanced levels. In 2009, the number of test takers at the advanced levels (levels 1 and 2) had increased by 6.4 times since 1999, and its ratio to the total number of test takers increased from 55 % to 76 % in 10 years.2 A useful tool for Japanese language learning is a language learners' bilingual dictionary including the learners' mother tongue and developed on the basis of the characteristics of their mother tongue. Particularly from the intermediate level onwards, students have more opportunities to read and write on their own, and therefore need a learners' dictionary which satisfies both the needs of receptive and productive tasks. However, the majority of learners around the world are provided only with simple dictionaries for beginners or for tourists, except for countries like China and Korea, where there are many learners of Japanese. The development of dictionaries requires enormous financial and human resources. For the production of a Japanese language dictionary for native speakers in which one of the present authors was involved, for example, a strong team of experienced dictionary writers and editors together with the editorial board of a 1 http://www.ipf.go.ip/i/japanese/survev/result/index.html (July 21st, 2012) Kaigai no nihongo kyoiku no genjo: nihongo kyoiku kikan chosa 2009-nen gaiyo ("The present situation of Japanese language education abroad: Research on institutions with Japanese language education; 2009 summary") 2 Numbers are calculated by authors based on the statistical data obtained on the site 'Changes in the number of candidates for the Japanese Language Proficiency Test' http://www.ilpt.jp/statistics/index.html (July 21st, 2012) publishing company spent nearly 10 years of trial and error before completion. In the field of Japanese language learning around the world, which is a very poor market compared to that of the Japanese language dictionary market for native speakers, the financing and manpower needed for compiling a dictionary from scratch are simply not available. However, the appearance of a strong medium, the Internet, has greatly changed the scene. Publishing a dictionary in paper form through a publisher involves a considerable financial and temporal investment, and its distribution in different countries may face problems due to differing publishing and marketing conditions. If it is published on the web, on the other hand, almost no extra cost is needed and only two problems need to be solved: the creation of the contents needed for the dictionary, and the development of a system that can be used by learners. The problem of distributing learners' dictionaries has largely been solved by internet use, and conditions are becoming ripe to offer a dictionary free of charge anywhere in the world. The problems that remain to be solved are the creation of dictionary contents and the development of a system for making the contents available to users. The present project aims at building an electronic database with the contents necessary for a Japanese learners' dictionary, and offering this database to all areas of the world over the internet. Dictionary editors of individual areas may make use of any information in this database for further processing, or add new information particular to their area and eventually make their own web dictionary to be published free of charge or at a low price. One existing web dictionary should be mentioned here: the multilingual Reading Tutor Web Dictionary (http://chuta.jp/ Kawamura et al., 2012). This dictionary was developed as a dictionary tool for Reading Tutor, a reading support system for Japanese language learners. Presently it includes 20 languages and this number is expected to increase. The Reading Tutor Web Dictionary has been an ambitious try to broaden the possibility of a bilingual dictionary in many different languages. However, since it is based on a preset monolingual Japanese dictionary, it is difficult for editors in different linguistic areas to freely reshape it and edit their own bilingual dictionary. In order to develop a dictionary which is useful for intermediate and advanced learners, the editors should be able to work on a unique dictionary for learners of their own linguistic area, taking into consideration contrastive research on Japanese and the learners' native language. The main novelty of our approach lies in the fact that the "database for Japanese learners' dictionary editing support" is not aimed at producing a dictionary, but rather at offering the general information on word usage, with appropriate usage examples, which is considered to be necessary to foreign learners of any language background. In this sense, this project is a wholly new attempt at creating the necessary environment for bilingual dictionary compilation for learners of any mother tongue. Our project team, based on the conditions described above, is set to build a database with all necessary information for editing Japanese learners' dictionaries, and support editors of bilingual Japanese learners' dictionaries around the world. The project is supported by a Japanese government grant-in-aid for scientific research ("Basic research A") and is running from April 2011 for 4 years up to 2014, under the name "Research for the formulation of basic grounds for the construction of a general database for the development of Japanese language learners' dictionaries". The following sections present a general description of the project. 2. Organisation of the project team The present project has two teams, a construction team which builds the database to support the editing of Japanese learners' dictionaries, and a research cooperation team which supports the activities of the first team. The database construction team has 30 members. Besides the leader, Yuriko Sunakawa, there are 11 research members, 18 affiliated researchers, and one part-time researcher. Members are divided into groups, including a Japanese language research group, a corpora research group etc., and investigate methods for including word-usage information into the data base, or for the use of corpus studies in dictionary description, while also being involved in the construction of the database itself. Within the Japanese language research group, there are sections for research on (1) collocation, (2) synonyms, (3) grammar information, (4) cultural information and (5) phonetic information. The corpus research group includes sections for (1) corpus information, (2) basic vocabulary, (3) learner corpora and (4) language processing. Each section is engaged in research in its own area. The team of collaborating researchers counts 47 members, including many who reside outside of Japan. Collaborating researchers in Japan are involved in English lexicography, corpus linguistics, Japanese language research, research on foreign languages such as French, English or German, Japanese language teaching research etc. All of them are engaged in research which can contribute to Japanese learners' lexicography from different points of view, and share their research findings with all other members through oral and written presentations. Collaborating researchers outside Japan are involved in research on Japanese lexicography, corpus compilation, Japanese language and language education research, and while sharing the results of their research with other members of the project like domestic cooperating researchers, they also conduct surveys and investigations needed for the construction of the database, such as surveys on the needs of Japanese language learners outside Japan, contribute to the compilation of learners' corpora, investigate learners' errors, etc. 3. Data base to support editorial work of Japanese language dictionary The development of dictionaries requires a detailed description of Japanese language use based on actual research results of contrastive studies and linguistic research. Since the present project aims at supporting lexicographic work aimed at intermediate and advanced learners of Japanese, we are building a database containing the following information: a) headword usage information (information on meaning, grammar, phonetics, synonymy, collocation, style, culture, corpus-based frequency etc.); b) example sentences based on typical usage examples for each subsense, edited at an appropriate level for intermediate and advanced learners; c) information on frequent errors by Japanese language learners. This information is going to be published with a Creative Commons license, thus enabling dictionary editors anywhere in the world to freely access our database, be it for a profit or nonprofit undertaking, to process the information according to their own area's needs and eventually develop bilingual learners' dictionaries for speakers of their own native language. In order to build the above-mentioned database, our work plan within the research period is the following: a) selection of basic vocabulary needed by Japanese language learners; b) research aimed at including word usage information on basic Japanese vocabulary into the database, making use of existing Japanese language corpora; c) research in error analysis in order to include error information into the database, making use of existing learners' corpora; d) editing of usage examples which are appropriate for intermediate and advanced learners, on the basis of typical usage examples extracted from existing Japanese corpora, for each subsense of each headword; e) development of a system for organising word usage information, and of a corpus search tool aimed at editing word usage information; f) development of a system to make the database public, and suitable tools for users. 4. Making the vocabulary list As a first step for creating the database, we constructed a list of lexemes to be included and described in the database. In the field of teaching Japanese as a foreign language, the vocabulary list of the old version of the Japanese Language Proficiency Test: Test Content Specifications (hereinafter "old JLPT list") is well known and is still being widely used as a basic source of data for educational yardsticks, teaching material development, vocabulary research etc. However, in the present project we developed our own basic Japanese instructional vocabulary list instead of using the above mentioned JLPT list, due to the following reasons. 1. The "old JLPT list" was created more than 30 years ago and does not reflect recent vocabulary changes. 2. Out of concern for learners abroad, it does not include culturally-bound terms. 3. Its scale of difficulty was set up for test compilation and not for language education. First of all, concerning point 1. above, the "old JLPT list" was compiled manually in 1980s and, although twice revised, it has not changed much from the 80s and does not correspond to the new changes in Japanese language vocabulary (cf. Oshio et al., 2007). Specifically, loanwords are poorly represented and vocabulary which can enliven expression, such as onomatopoeia, is largely missing. Figure 1: Vocabulary distribution in the "former test vocabulary list", □ originally Japanese words, E words of mixed origins, O words of Chinese origin, ■ other borrowings Figure 1 shows the distribution of words in the "old JLPT list". Vocabulary for level 4 includes as much as 50% native Japanese words, but as the level gets increases, so does the ratio of words of Chinese origin. The most problematic is the ratio of loanwords. As can be seen in Figure 1, loanwords make up less than 10 % of each level. Such a small number of loanwords does not correspond to actual language use in contemporary Japanese society and needs to be revised. Onomatopoetic words are also poorly represented in the "old JLPT list", which includes only a few, such as nikoniko, pikapika, furafura and wakuwaku. Concerning point 2., the intentional exclusion of culturally-bound terms, including names of food, animals and plants, out of concern for test-takers abroad is problematic. This choice is based on the understanding that the Japanese Language Proficiency Test is meant to test language ability and not cultural knowledge. Considering the list was compiled for the purposes of this kind of test, the policy of the "old JLPT list" is in itself very reasonable, but decidedly removed from the reality of Japanese language education in which Japanese society and cultural matters are part of the curriculum. Lastly, with regard to point 3., the "old JLPT list" is aimed at the evaluation of Japanese language ability, nothing more and nothing less. The "old JLPT list" is not intended for the development of teaching materials and dictionaries, and problems will inevitably occur if it is used for these purposes. The test-making perspective diverges from the perspective of language education in many respects, particularly in with regard to the setting of a difficulty scale (levels of vocabulary items). The difficulty scale in the test is set from the perspective of "levels of Japanese which may be assumed to be known to students" and not the perspective of educational goals, as "levels of Japanese one would like students of a certain level to know". Taking into account the three problems described above, we conducted a survey of the vocabulary of Japanese language textbooks, and quantitative research on the target language area in a large-scale corpus. On the basis of this research, we compiled a general-purpose list of basic vocabulary for Japanese language education (hereinafter "instructional vocabulary list"). The main aims of the "instructional vocabulary list" are: (1) to make a vocabulary list for Japanese language education including authentic vocabulary items; (2) to label vocabulary items according to their various characteristics so that the vocabulary list will be useful for dictionary development as well as various needs in classroom situations; (3) to create a vocabulary list which various users in and outside Japan may share through the web. To accomplish these aims, we have conducted the following: In order to realise (1), we conducted a vocabulary survey making use of corpus data and natural language processing technology. In order to realise (2), we added information about the degree of difficulty of each vocabulary item, based on the subjective judgement of Japanese language teachers, and decided to add semantic information according to the "categorised vocabulary list" (Bunrui goi hyou). In order to realise (3), we decided to format electronic data in CSV format, which can be used with proprietary spreadsheet software (such as Microsoft's Excel®), as well as with plain text editors. 4.1 Compilation procedure The "Instructional vocabulary list" was compiled in the following 4 steps. 1. Vocabulary extraction: vocabulary was extracted from morphologically analysed corpus data. 2. Manual editing: noise and boiler-plate was manually removed. 3. Subjective assessment: the difficulty level of the extracted vocabulary was subjectively assessed by five teachers of Japanese. 4. Index construction: each vocabulary item was tagged with semantic information and frequency data obtained from the corpus. The following section presents each step in detail. 4.1.1 Vocabulary extraction As a first step towards the compilation of the "Instructional vocabulary list", we extracted content words (excluding particles and auxiliary verbs) from the "Japanese textbook corpus" and from the "Yahoo!Chiebukuro" and "Books" part of the 2009 edition public data of the "Balanced Corpus of Contemporary Written Japanese" (http://www.tokuteicorpus.jp/), after having morphologically analysed all texts. We then calculated the frequency of all content words and compiled a list of all words appearing more than 5 times. The "Japanese textbook corpus" mentioned above is a corpus of texts extracted from 100 Japanese language textbooks. It was compiled for research purposes by the present authors and is not publicly available. It includes major Japanese language textbooks used in Japan and abroad, in a balanced proportion of textbooks from beginning to advanced level. The "Balanced Corpus of Contemporary Written Japanese" is a balanced corpus of the Japanese written language developed by the National Institute for Japanese Language and Linguistics, but since at the time our project started the complete corpus was not yet publicly available, we used the monitor data version published in 2009. The "Books" section of this data amounted to 40,000,000 words, and was considered sufficient for the compilation of our vocabulary list. Morphological analysis was conducted using MeCab (Kudo, 2011) and UniDic (Den et al., 2007). When extracting vocabulary, we used not only the short morphological unit tan-tan'i (fe^fi), but also morpheme N-grams3, combining multiple morphemes into longer units, as exemplified below. 1. Examples of 2-grams: aien-ka (S®^ "habitual smoker"), aisu-koohii (T 4 — "iced coffee"), ai-tsugu D "come in succession"), aite-kata "other party"), ao-shingou (WfH^ "green traffic light") 2. Examples of 3-grams: ami-no-me (^Ag "net mesh"), iku-tsu-ka (S^^ "a few"), i-kko-date (^Flt "detached house"), ichi-do-ni (^S^ "all at once"), ichi-nin-mae (^A^ "a portion for one person; a grownup"), itsu-de-mo (fM^TD b "anytime"), itsu-made-mo (Ao^TD b "forever"), ima-ni-mo b "at any moment"), ima-hito-tsu (^^o "not quite"), unten-menkyo-shou "driver's license"), o-kyaku-san £ A "guest"), o-jii-san (^LD A£A "grandfather"), o-jii-chan (^LD A^^A "grandpa") Examples in (1) are 2-grams, i.e. sequences of two morphemes. For example, aien-ka (S®^ "habitual smoker") is a word composed of aien (S® "love of smoking") and -ka (-^ "person"), aite-kata "other party") is a word composed of aite ( "partner") and kata (^ "person"), etc. Examples in (2) are 3-grams, are sequences of three morphemes, such as ami-no-me (^ A g "net mesh"), which is composed of ami (^ "net"), no (A "of') and me (g "mesh, grain"). The extracted N-grams were manually checked and cleaned of noise, resulting in a list of 18,010 lexical units. 4.1.2 Subjective assessment If the list is to be used in the context of Japanese language teaching, a difficulty scale needs to be designed, and lexical units must be labelled according to this scale as words to be learned at a certain level. However, vocabulary cannot be categorised only mechanically; subjective labels by teachers of Japanese, based on their experience and intuition, must also be included. However, subjective judgement is not necessarily based on scientific evidence and it is therefore difficult to handle such an index when building a database which must be consistent and systematic. In our project we therefore asked five teachers of Japanese with ten or more years of teaching experience to judge - each one by his or herself - the difficulty of the words, collected all responses, processed them statistically and labelled all lexical elements by degree of difficulty. Raters were asked to classify the list of 18,010 words which was obtained as described in 4.1.1., dividing it into six categories: beginning - 1st part, beginning - 2nd 3 N-grams are a model of language proposed in the field of natural language processing, consisting of strings of N elements, which can be characters or morphemes: a morpheme 3-gram is composed of three consecutive morphemes, a 4-gram of four, etc. part, intermediate - 1st part, intermediate - 2nd part, advanced - 1st part and advanced -2nd part. The raters were instructed to judge the level of word difficulty from the perspective of classroom instruction, as the level at which words should be introduced during classroom learning. The average rating for each word was computed, and the word list divided into six levels. The final decision of word level was taken in two rounds. During the first round, we first computed the average level score of all five raters, and then also the k-value agreement of each rater's score with the average score. When the agreement between the rater and the average score was less than 0.5, we excluded that rater's score and computed again the average of the remaining raters' scores, taking that as the final score. We were thus able to exclude those scores which were markedly different from the rest. The final results of this procedure are presented in Table 1. Table 1: Results of subjective assessment Vocabulary level Number of vocabulary items Examples 1. Beginning - 1 st part 426 oyasumi fcW^ "good night", tonari IP "neighbour", petto F "pet", onegaishimasu fcSfV LSt "please", ohayougozaimasufcfii 5 ^ i V^St "good morning", watashi M "I, me", warui IV "bad", otearai fc^^V "toilet", otousan fc^ ^ / "father" 2. Beginning - 2nd part 800 ryouri "food", ryokou ^^T "travel", reizouko 77 ^^ "refrigerator", resutoran l/^FyV "restaurant", remon l^y "lemon", wakai "young", wasureru ^^ 55 "forget", gokurousama M ^^^ "thank you for your work", irasshaimase o L^VSi "welcome", annai^^ "introduction, guidance" 3. Intermediate - 1 st part 2,323 ikebana "ikebana", iken ^^ "opinion", ikou "from ... onwards", ikooru 4 ^ ^^ "equal to", iremono A^^ "container", ironna fe/^ "various", iwa S "rock", iwau ^ 5 "celebrate, congratulate", ugokasu"move", usotsuki 5 ^^^ "liar", uchuujin "creature from outer space" 4. Intermediate - 2nd part 6,482 iryou "health care", iryou ^^ "clothing", irui ^ S "clothing", irogami feffi "colored paper", iwaigoto ^V^ "celebration", iwakan ain^ "sense of incongruity", insutorakutaa "instructor", ushinau ^ 5 "loose", ushirosugata ^ ^ ^ "view from behind", uttae ff^ "lawsuit", kakudo ^ S "angle" Vocabulary level Number of vocabulary items Examples 5. Advanced - 1 st part 6,401 kakudan fe© "remarkable", kakuchou te^ "extension", kakutei ^^ "decision", kakutou fell "fight", gattai "union", gatchiri ^o^^ "solidly", kabuseru "cover", kafusoku ^^ M "too much or too little", kabunushi ft^ "shareholder, stockholder", kabegami Sffi "wallpaper", kahogo "overprotective", kankakuki ^^^ "sensory organ" 6. Advanced - 2nd part 1,578 kanten ^A "agar-agar", kannushi "Shinto priest", kampa % "fund-rising campaign, contribution", kampan E® "deck", gyouten WA "astonishment", kyokushou "infinitesimal", kirifuki ^^^ "sprayer", guzuru "grumble", kusemono < "cunning person; fishy thing", kuchidutae "oral tradition", kuppuku "surrender", kumikyoku Ififl "suite" Total 18,010 The results of a comparison between the vocabulary included in our "instructional vocabulary list" and the vocabulary list of the "old JLPT list" are presented in Table 2. Table 2: Old JLPT and "Instructional vocabulary list" comparison Levels of the old JLPT vocabulary list Level 1 Level 2 Level 3 Level 4 Not included Total 1. Beginning - 1 st part 0 4 7 375 40 426 2. Beginning - 2nd part 6 79 208 341 166 800 Levels of the Instructional Vocabulary List 3. Intermediate - 1 st part 94 921 410 105 793 2,323 4. Intermediate - 2nd part 884 1,944 93 37 3,524 6,482 5. Advanced - 1 st part 1,290 449 13 0 4,649 6,401 6. Advanced - 2nd part 118 32 0 0 1,428 1,578 Total 2,392 3,429 731 858 10,600 18,010 The lexical units marked as "Not included" in Table 2 are words which are part of the "instructional vocabulary list", but not included in the "old JLPT list", and amount to 10,600 lexical units. When comparing the "instructional vocabulary list" with the "old JLPT list", loanwords appear to be a particularly problematic area. For example, words such as jazu (V^X "jazz"), kameraman "cameraman"), tisshu (^ 4 y "tissue") are categorised as Level 1 (the most difficult) in the "old JLPT list", while in our "instructional vocabulary list" they are set in level Beginning - 2. On the other hand, words such as rekoodo (K" (audio) record"), firumu "film"), haadodisuku Kx "hard disc"), which are categorised as Level 4 (the easiest) in the "old JLPT test", are set in level Intermediate - 2 in our "instructional vocabulary list". These differences are likely to reflect the changes in word usage which have occurred since the 1980s, when the "old JLPT list" was compiled. 4.1.3 Index construction The "instructional vocabulary list" is now being turned into a database by adding the following indexes to each lexical item. 1. Vocabulary ID 2. Standard written form 3. Readings 4. Vocabulary difficulty level 5. Part of speech 6. Type of word by origin 7. Old Japanese Language Proficiency Test Level 8. Meaning classification 9. Accent information 1 is a unique number for the lexical item. 2 was prepared in accordance with the dictionary Gendai kokugo hyouki jiten ("Dictionary of modern language written forms"). 3 is the reading of the standard written form, 4 is one of the six difficulty levels determined by subjective measurement as described above. 5 complies with the part-of-speech divisions of UniDic. 6 is also based on UniDic's labels, and indicates whether the word is a native word, loan from Chinese, loan from other languages, a word of mixed origin, or a fixed expression. 7 is the level in the "old JLPT list", 8 is a semantic label which complies with the categorisation of NINJAL's Bunrui goihyou ("Table of vocabulary by semantic categories"), while 9 indicates the accent pattern of the word. Table 3 shows a concrete example of a few indexed lexical items. Table 3: Sample of the Instructional Vocabulary List ROÄ ID TOW mm POÄ MM m 10^ mm T^ 10 T—h T—h ^p-^s ^ÄM 1 40 — t — — t — » ^p-^s £p—fx and ^p-^s ^ÄM 6 109 t^ y ^p-^s inM 2 m fr-4^-« «-tTA | fr-g g 0 222 y ±m ^p-^s inM fr-M^-^ra- 3 294 f^îL w ^ ff^p—f inM 2 m it-^9 - it- 5 262 MÎ5 T^v / »p—ff inM 2 m 4 The next step, based on these indexes, is going to be the compilation of definitions aimed at dictionary compiling, and the writing of usage examples. 5. Current progress Currently, we are creating a database and developing a system aimed at dictionary compilation, on the basis of the "instructional vocabulary list" in Table 2. In particular, we are now in the process of compiling and editing usage examples on the basis of sense definitions. Definitions are compiled with reference to the database of basic words with familiarity indexes by word sense (Amano & Kobayashi, 2008) and the data being developed by Kawamura Yoshiko et al. within the system Reading Tutor. In particular, we are using the data in The Reading Tutor Web Dictionary (Kawamura et al. 2012), including 8000 lemmas with word ID, example ID, headword, reading, note, part of speech, sense, and 27,000 examples for particular subsenses. Conversely, the word usage data and usage examples being developed within our project are going to be included in The Reading Tutor Web Dictionary, and work in both projects is being carried out in close cooperation. Usage examples are being edited by external collaborators, who were asked to write three original examples and select three corpus examples for each word sense. Original examples are to be written using only vocabulary not beyond the difficulty level of the headword, and we are developing special software to support example compilation. The system development group has developed a prototype system to search the database online and download data, as shown in Figure 2. Figure 2: Prototype of a dictionary search system When a user inputs a headword in the search box and launches the search, items which completely or partially match the search string are shown on the interface. Some lexical items are linked to pictures. By clicking on the button marked Gogi o hyouji (f "Show meaning"), the user can see definitions and examples for the headword hana, as shown in Figure 3. ver 0.01 Figure 3: Display of word sense and examples The list of partial match results includes words such as hanabi "fireworks"), kaki "flower vase"), hanazakari V "full bloom"), hanataba "flower bouquet"), kadan (^M. "flower bed"), hanabatake "flower field"), kabin "flower vase"), which begin with the character ^ (hana or ka, "flower"), and words such as kaika "blossoming"), nanohana "rape blossoms"), ikebana (4. ^ "ikebana"), senkouhanabi "toy fireworks, sparkler"), kusabana "flowering plant"), which include this character. As can be seen from these examples, partial match results are headwords which contain the characters of the search string, not the word hana. Scrolling down the page, one can see lexical items which are semantically related to the headword, in the section labelled kanrengo ( i^bh "related words"). For example, a search for the word ringo (V^^ "apple") produces the results shown in Figure 3. D a*il¥S»»verD.01 x C D "öl msi 12 7-=Eyf tign-easffir-Bi 45 ft i B1#i® ***** 307 pfci.ll,) isüys l asaääi B1 ¡Hi ***** 384 SSIfS? (7Tt» IflHSSil Ci(±fSiii ***** 91i ¡MßM. rgpi-ffliism-—ss LllUS® cK±M ***** 1446 ft (OjO t-igB-easWr-Bi I Bgfeaa ***** 2304 ^'.J-7 (^)--j') Tfc»M»itSH-«B I fflHÄl Bzmm ***** 2333 tWJ'j (TfL-y-yi IflHSSil A2ittJ»® ***** 2729 lg§JJ rgpi-ffliism-—ss B2mm ***** l issm^ 2765 ib^ i® [«»**BH-«H I B2 m%:i$i ***** 2786 frf (ii+j I fflHÄl Bi Mft ***** Figure 4: Words related to ringo ("apple") Words listed as "related words" under the headword ringo ("apple") in this figure include other words for fruits and plants, such as aamondo (T " ^ ^ K "almond"), appuru (T y^V^ "apple"), abokado K "avocado"), ichou (V^ ± 5 "ginkgo"), ume (^ "Japanese apricot"), etc. The system is based on the Bunrui goihyou ("Table of vocabulary by semantic categories") mentioned above. The word ringo, for example, is categorised in Bunrui goihyou as "noun > nature > flora > trees", and all words pertaining to the same category are extracted by the system and displayed as related words. In recently developed search systems, the user can perform complex searches and choose between complete match (searching for words which match the search string in its entirety) or partial match (searching also words which only partially match the search string), and between initial partial match (for words beginning with the search string) or final partial match (for words ending with the search string), or search by pronunciation, or by written form, etc. Non-expert users, however, may be confused by too detailed search possibilities. We therefore decided to offer a simple system where the user only inserts a search keyword and clicks once, and the system then displays both complete and partial matches. As for the written form of the search string, in order to search for words of Chinese origin, the search string must be input in Chinese characters, while loanwords from other languages are displayed only if searched for in their standard written form, in katakana, but words of native origin are displayed both when the search string is input in Chinese characters and when it is in hiragana. Native Japanese homophones or words which are written with different Chinese characters depending on the sense in which they are used, can thus be obtained by inserting one single search string in hiragana. For example, if the search string kiru (#5) is searched for, the system will display information for both kiru (^ 5 "wear") and kiru (^5 "cut"). Partial match searches are useful for examining compound nouns and verbs, since by inserting a verb or part of it, one can search for all compound verbs containing it. For example, a search for the hiragana string kakeru (^^5) produces a complete display of all compound verbs containing it, such as headwords oikakeru 5 "chase, pursue"), oshikakeru (^ "throng to, crash in, barge in"), koshikakeru "to sit"), shikakeru "start; prepare; challenge") etc. The user can check the meaning of unknown words by clicking on the button Gogi o hyouji (bhS ^ "Show meaning"), as explained above, obtaining sense definitions and examples as shown in Figure 3 and 5. Figure 5: Display of compound verb senses and examples The search function "Related words", on the other hand, is useful for investigating synonyms and antonyms. A search for the word sawayaka ( ê ^^^ "fresh, pleasant"), for example, yields a result list including synonyms such as kokoroyoi (f£ A "pleasant"), sugasugashii (^^^^ LA "fresh"), soukai (^ "fresh, refreshing"), kokochiyoi iA "kokochiyoi"), etc., and antonyms such as uttoushii ( 5 LA "gloomy, disagreeable"), fukai"unpleasant, disagreeable") etc. 6. Further stages and development plan As mentioned above, the database is going to include not only semantic information and usage examples, but also other pieces of information that are useful for users, i.e. information on phonetics, synonymy, collocation, stylistics, culture and errors. At present, each team is working on how to describe these items and in what form to upload them on the database. The progress and results of these teams will be shared by all members of the project by holding research meetings. The time plan for the coming 3 years is the following: Year 2012 • work on the basic design of the database • start with description of basic word usage • set up the environment for data processing • public release of a part of the data (vocabulary list for Japanese language learning) • start publishing information on the project's homepage • compilation of usage examples Year 2013 • construction of a corpus retrieval system • partial release of the data (the system and the corpus tools) Year 2014 • completion of the corpus • release of the final set of data with usage examples • workshops to popularise the database and its use By the end of 2012, we will start advertising on our homepage and make the prototype of our database public. These will be improved in 2013 by adopting users' feedback. By the end of 2014, the last year of the project, the database will be completed. After completion, results of our project will be made public through workshops, targeting particularly users outside Japan in order to encourage practical use of the database as a resource for developing dictionaries for learners of Japanese. We plan to continue with our project according to the time line as described above. (This study is subsidised by the Japan Society for the Promotion of Science, Grants-in-aid No. 23242026.) References Amano, S. [^SJ^Bg], Kobayashi, T. (ed.) (2008). Kihongo deetabeesu - gogibetsu tango shinmitsudo [Ä^M'rD — ^^D — ^ -H^^MS^S] ("Database of basic words - Word familiarity index for single subsenses'). ToDkyoD: Gakken Den, Y. [fö^Bf], Ogiso, T. [//^f^if ], Ogura, H. [JS^f], Yamada, A. [^ HM], Minematsu, N. [^if^], Uchimoto, K. [ftÄü*] & Koiso, H. [/J^fö^] (2007). Koopasu nihongogaku no tame no gengo shigen: keitaisokaisekiyou denshika jisho no kaihatsu to sono ouyou 0 : ^I^ÄWffl®^ t ^("The development of an electronic dictionary for morphological analysis and its application to Japanese corpus linguistics"), Nihongokagaku [ 0 ("Japanese linguistics") 22: 101-122. Kawamura, Y. et al. (2012). The Reading Tutor Web Dictionary - Chuta no web jisho [^a ^ web S¥#]. Retrieved from: http://chuta.jp/ Kudo, T. (2011). MeCab: yet another part-of-speech and morphological analyzer. Retrieved from http://mecab.sourceforge.net/ Lee, (2011). Nihongo nöryoku shiken no chösen: Atarashii nihongo nöryoku shiken wo rei ni (Kokusai köryü kikin jigyö repöto 14) [ 0 0 ^ M-fMi^W^SII^Ä^*^ h 14)] ("The challenge of the Japanese Language Proficiency Test: The case of the new Japanese Language proficiency test" [Japan Foundation Project Report 14]), Nihongogaku [ 0 ^M^] ("Japanese Language ") 30(1), 95-107. National Institute for Japanese Language and Linguistics [SÄSMW^^] (ed.) (2004). Bunrui goihyö zouhokaiteiban [^MMs^il'M^pT,®] ("Table of vocabulary by semantic categories"). Tokyo: Dainihontosho Oshio, K. Akimoto, M.[^Ä^Bf], Takeda, A.[tHii], Abe, Takanashi, M.[^^fl], Yanagisawa, Y.[fPif£?Bg], Iwamoto, R.[S^|-], & Ishige, ] (2008). Atarashii nihongo nöryoku shiken no tame no goi-hyö sakusei ni mukete ("Towards a new vocabulary list for the new Japanese Language Proficiency Test"), Kokusai köryü kikin nihongokyöiku kiyö [SI^MÄ^ 0 ^M^WIH^] ("Japan Foundation Japanese Language Journal"), 4,71-86.