Proceedingss of the 4th Conference on CMC and Social Media Corpora for the Humanities 27–28 September 2016 Faculty of Arts, University of Ljubljana Ljublj ljana, Slovenia Editors: Darja Fišer Michael Beißwenger PROCEEDINGS OF THE 4TH CONFERENCE ON CMC AND SOCIAL MEDIA CORPORA FOR THE HUMANITIES Editors: Darja Fišer, Michael Beißwenger Technical editors: Jaka Čibej, Katja Zupan Published by: nanst ena alo ba ilo ofs e fa ultete ni er e ubl ani ubl ana ni ersit Press acult of rts Issued by: Department of Translation tu ies For the publisher: Branka Kalenić Ramšak, the dean of the Faculty of Arts Ljubljana, 2016 First edition Conference web site: http://nl.ijs.si/janes/cmc-corpora2016/ Publication is available at: http://nl.ijs.si/janes/wp-content/uploads/2016/09/CMC-conference-proceedings-2016.pdf The publication was supported by the Slovenian Research Agency within the national basic research project “Resources, Tools and Methods for the Research of Nonstandard Internet Slovene” (J6-6842, 2014-2017). This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license. Publication is free of charge. Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID=286436096 ISBN 978-961-237-859-2 (pdf) Preface This volume presents the proceedings of the 4th edition of the Conference on CMC and Social Media Corpora for the Humanities ( cmc-corpora2016) which was held on September 27–28 at the University of Ljubljana, Slovenia. The conference series (http://cmc-corpora.org/) is dedicated to the collection, organization, annotation, processing, analysis and sharing of data and corpora from computer-mediated communication (CMC) and social media genres for research purposes. The genres of interest to the cmc-corpora conference community include e- mail, chats, forums, newsgroups, blogs, news comments, wiki discussions, SMS and mobile messaging applications (WhatsApp, etc.), interactions on social network sites (Facebook, Twitter etc.), on YouTube and in multimodal online environments. The conference brings together research questions from linguistics, philology, communication sciences, media and social sciences with methods, tools and infrastructures from the fields of corpus and computational linguistics, natural language processing, text technology and digital humanities. The focus of the conferences is on  language-centered research using computational methods and tools for the empirical analysis of CMC and social media phenomena,  approaches towards automatic processing and annotation of CMC and social media data,  corpus-linguistic research on collecting, processing, representing and providing CMC and social media corpora on the basis of standards in the field of digital humanities. Previous conferences have been held in Dortmund/Germany (2013 and 2014) and in Rennes/France (2015). Besides keynote talks by two invited speakers, Dawn Knight from Cardiff University (UK) and Petra Kralj Novak from the Jožef Stefan Institute (Slovenia), the 4th cmc-corpora conference featured 17 papers, 4 posters and 1 student paper written by 40 authors and co- authors from 24 research institutions in 11 countries, addressing key issues and current trends in the research field on data from 8 different languages. We thank all colleagues who have contributed to the conference and to this volume with their papers, talks and posters, and as members of the scientific committee. We hope that the results of the conference will mark another step towards a lively exchange of approaches, expertise, resources, tools and best practices between researchers and existing networks in the field and pave the ground for future standards in building and using CMC and social media corpora for research in the humanities. Darja Fišer, University of Ljubljana, Slovenia Michael Beißwenger, University of Duisburg-Essen, Germany Co-chairs of the Scientific Committee Ljubljana and Essen, September 2016 i Coordinating Committee Michael Beißwenger (UDE, Germany) Ciara R. Wigham (ICAR, France) Thierry Chanier (LRL, France) Scientific Committee Co-chairs Darja Fišer (UL, Slovenia) Michael Beißwenger (UDE, Germany) Members Thierry Chanier (LRL, France) Isabela Chiari (SAPIENZA, Italy) Tomaž Erjavec (JSI, Slovenia) Axel Herold (BBAW, Germany) Gudrun Ledegen (UR2, France Nikola Ljubešić (UZ, Croatia) Julien Longhi (UCP, France) Harald Lüngen (IDS, Germany) Maja Miličević (UB, Serbia) Céline Poudat (UN, France) Egon W. Stemle (EURAC, Italy) Ciara R. Wigham (ICAR, France) Organizing Committee Chair Darja Fišer (UL) Members Simon Krek (JSI) Jaka Čibej (UL) Katja Zupan (JSI) ii Organizers iii Table of Contents Preface …………………………………………………………………………………..…… i Committees ……………………………………………………………………………..…… ii Organizers ……………………………………………………………………………..……. iii Table of Contents ……………………………………………………………………..……. iv INVITED TALKS ....……………………………………………………………………….. 1 Constructing E-Language Corpora: a focus on CorCenCC (The National Corpus of Contemporary Welsh) Dawn Knight ………………………………………………………………………… 1 Sentiment of Emojis Petra Kralj Novak ………………………………..………………..…………...…….. 2 REGULAR TALKS …………...……………………………………………………………. 3 Syntactic Annotation of Slovene CMC: First Steps Špela Arhar Holdt, Darja Fišer, Tomaž Erjavec, Simon Krek …………………..………. 3 (Best) Practices for Annotating and Representing CMC and Social Media Corpora in CLARIN-D Michael Beißwenger, Eric Ehrhard, Axel Herold, Harald Lüngen, Angelika Storrer ........ 7 Grammatical Frequencies and Gender in Nordic Twitter Englishes Steven Coats .. ………………………….………….………………………………... 12 Framework for an Analysis of Slovene Regional Language Variants on Twitter Jaka Čibej ……………………………………………………………….………….. 17 Analysis of Sentiment Labeling of Slovene User-Generated Content Darja Fišer, Tomaž Erjavec …………………………………………………………. 22 Compilation and Annotation of the Discourse-structured Blog Corpus for German Holger Grumt Suárez, Natali Karlova-Bourbonus, Henning Lobin .…………………..… 26 Expressiveness in Flemish Online Teenage Talk: A Corpus-Based Analysis of Social and Medium-Related Linguistic Variation Lisa Hilte, Reinhild Vandekerckhove, Walter Daelemans ………….……………..…… 30 French Wikipedia Talk Pages: Profiling and Conflict Detection Lydia-Mai Ho-Dac, Véronika Laippala, Céline Poudat, Ludovic Tanguy …….…………. 34 Slovene Twitter Analytics Nikola Ljubešić, Darja Fišer …………………..……………………………..………. 39 Textometrical Analysis of French Arts Workers“fr.Intermittents” on Twitter Julien Longhi, Dalia Saigh …..………………..……………………………..……….. 44 The Use of Alphanumeric Symbols in Slovene Tweets Dafne Marko …..………………..……………………………..……………………. 48 iv Table of Contents A Multilingual Social Media Linguistic Corpus Luis Rei, Dunja Mladenić, Simon Krek …..………………..…………………………. 54 Political Discourse in Polish Internet – Corpus of Highly Emotive Internet Discussions Antoni Sobkowicz …..………………..…………………………………………....... 58 Topic Ontologies of the Slovene Blogosphere: A Gender Perspective Iza Škrjanec, Senja Pollak …..………………..……………………………………… 62 Linguistic Characteristics of Dutch Computer-Mediated Communication: CMC and School Writing Compared Lieke Verheijen …..………………..………………………………………………… 66 A Multimodal Analysis of Task Instructions for Webconferencing-supported L2 Interactions: A Pilot Study of the ISMAEL Corpus Ciara R. Wigham, H. Müge Satar …..………………..………………………………. 70 Linguistic Analysis of Emotions in Online News Comments - an Example of the Eurovision Song Contest Ana Zwitter Vitez, Darja Fišer …..………………..…………………………………... 74 STUDENT PAPER ………………………………………………………………………... 77 Alternative Endings of Slovene Verbs in Third Person Plural: A Corpus Approach Gašper Pesek, Iza Škrjanec, Dafne Marko …..………………..……………………….. 77 POSTERS …………………………………………………………………………………. 82 Geolocating German on Twitter: Hitches and Glitches of Building and Exploring a Twitter Corpus Bettina Larl, Eva Zangerle …..………………..……………………………………… 82 The #Intermittent Corpus: Corpus Features, Ethics and Workflow for a CMC Corpus of Tweets in TEI Julien Longhi …..………………..……………………………………………........... 83 The Construction of a Teletandem Multimodal Data Bank Queila Barbosa Lopes …..………………..………………………………………...… 84 Graphic Euphemisms in Slovenian CMC Mija Michelizza, Urška Vranjek Ošlak …..………………….………………………… 85 Author Index ……………………………………………………………………………… 86 v Constructing E-Language Corpora: a focus on CorCenCC (The National Corpus of Contemporary Welsh) Dawn Knight Centre for Language and Communication Research, Cardiff University, 2 Column Drive, CF10 Cardiff, UK E-mail: knightd5@cardiff.ac.uk Abstract Digital communication in the age of ‘web 2.0’ (that is the second generation of in the internet: an internet focused driven by user-generated content and the growth of social media) is becoming ever-increasingly embedded into our daily lives. It is impacting on the ways in which we work, socialise, communicate and live. Defining, characterising and understanding the ways in which discourse is used to scaffold our existence in this digital world is, therefore, emerged as an area of research that is a priority for applied linguists (amongst others). Corpus linguists are ideally situated to contribute to this work as they have the appropriate expertise to construct, analyse and characterise patterns of language use in large-scale bodies of such digital discourse (labelled ‘e-language’ here). Indeed, an increasing amount of e-language corpora are being developed to allow us to investigate e-language use. This presentation discusses some of the methodological, technical, practical and ethical considerations and challenges faced in the construction of e-language corpora. It will outline, for example, some of the approaches used when planning the construction of e-language corpora including: obtaining consent; approaches to sampling, collecting and anonymising data; sourcing and attributing metadata, as well as some reflections on constructing a corpus infrastructure. Discussions will be contextualised with reference to the Economic and Social Research Council (ESRC) and the Arts and Humanities Research Council (AHRC)-funded CorCenCC corpus (Corpws Cenedlaethol Cymraeg Cyfoes - The National Corpus of Contemporary Welsh) project. CorCenCC will be the first large-scale corpus of Welsh representative of language use across communication types, including 2 million words of e-language and 4 million words each of spoken and written language. CorCenCC will be open-source and freely available for use by professional communities and anyone with an interest in language. Bespoke applications and instructions will be provided for different user groups. The corpus will enable, for example, community users to investigate dialect variation or idiosyncrasies of their own language use; professional users to profile texts for readability or develop digital language tools; to learn from real life models of Welsh; and researchers to investigate patterns of language use and change. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 1 Ljubljana, Slovenia, 27–28 September 2016 Sentiment of Emojis Petra Kralj Novak Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia E-mail: petra.kralj.novak@ijs.si Abstract Emojis are one of the phenomenons of the technological age. What started out as the odd smiley face at the end of a text message :) has evolved into being an indispensable part of informal computer mediated communication. For example, Instagram reports that in March 2015 nearly half of the texts on their platform contained emojis. But what is the emotional context of emojis? We engaged 83 human annotators to label over 1.6 million tweets in 13 European languages by the sentiment polarity (negative, neutral, or positive). About 4% of the annotated tweets contain emojis. By computing the sentiment of emojis from the sentiment of the tweets in which they occur, we constructed the first emoji sentiment lexicon, called the Emoji Sentiment Ranking, and draw a sentiment map of the 751 most frequently used emojis. The sentiment analysis of the emojis allows us to draw several interesting conclusions. It turns out that most of the emojis are positive, especially the most popular ones. The sentiment distribution of the tweets with and without emojis is significantly different. The inter-annotator agreement on the tweets with emojis is higher. Emojis tend to occur at the end of the tweets, and their sentiment polarity increases with the distance. We observe no significant differences in the emoji rankings between the 13 languages and the Emoji Sentiment Ranking. Consequently, we propose our Emoji Sentiment Ranking as a European language-independent resource for automated sentiment analysis. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 2 Ljubljana, Slovenia, 27–28 September 2016 Syntactic Annotation of Slovene CMC: First Steps Špela Arhar Holdt*♦, Darja Fišer*‡, Tomaž Erjavec‡, Simon Krek‡ * Faculty of Arts, University of Ljubljana Aškerčeva 2, 1000 Ljubljana ♦Institute for Applied Slovene Studies Trojina Trg republike 3, 1000 Ljubljana ‡ Jožef Stefan Institute Jamova cesta 39, 1000 Ljubljana E-mail: spela.arharholdt@ff.uni-lj.si, darja.fiser@ff.uni-lj.si, tomaz.erjavec@ijs.si, simon.krek@ijs.si Abstract This paper presents the first steps towards the syntactic annotation of Slovene CMC, namely the annotation of 200 Slovene tweets with the JOS dependency model. After a presentation of the dataset we present the selected annotation model, the annotation procedure, and results. The focus of the paper is on the decisions regarding the annotation of CMC-specific elements that required special treatment: Twitter-specific features, foreign language elements, ellipsis and fragments, non-standard use of punctuation, and other non-standard language features. The dataset, together with the CMC-adapted annotation guidelines, can be used for further annotation of language data (from Twitter or other CMC genres), and in the second step to train a parser for the selected CMC domain(s). The large-scale corpus-based research of non-standard Slovene syntax, which will be facilitated by the described activities, will help disprove the myths surrounding CMC that are still present in the field of Slovene studies. Keywords: computer mediated communication, syntactic annotation, JOS dependency model, Slovene language, tweets step, the sentence segmentation and tokenization was 1. Introduction manually corrected, the tweets were normalised on the With the advent of digital media and the Internet, lexical and morphological level (Čibej et al., 2016a), and communication practices began to change significantly, finally, the attributed lemmas and POS-tags were manually challenging the traditionally established dichotomies of corrected (Čibej et al., 2016b). public vs. private, formal vs. informal, written vs. spoken, and standard vs. non-standard language use. Initially, the 3. The JOS Dependency Model linguistic research community observed the new situation For the annotation, the JOS dependency model was used. with a somewhat reserved attitude, whereas in the last years, The system, which was designed in the project “Linguistic more and more studies aim to disprove the myths Annotation of Slovene” (Erjavec et al., 2010), is based on surrounding computer mediated communication and its syntactic dependencies. The categories of the system are possible negative impact on the evolution of language presented in Table 1. (Crystal, 2011). Since computer-mediated communication is a global phenomenon, work on languages other than Groups of labels Labels Description English soon followed (Myslin and Gries, 2010; Storrer, dol Links heads and 2013; Chanier, 2015). modifiers in phrases. While studies have been performed on Slovene as First level labels link del Links parts of verbal phrases. well, they mostly focused on orthographic (Jakop, 2008; elements in different Links heads in Arhar Holdt and Dobrovoljc, 2015), lexical (Michelizza, types of phrases prir coordinate structures 2015; Zwitter Vitez and Fišer, 2015) and processing issues (green and yellow within clauses. (Ljubešić et al., 2016a; Ljubešić et al., 2016b) whereas no colour in the larger-scale corpus-based work exists on the syntax of visualisation). vez Links words or commas in conjuctive function. Slovene CMC. The goal of this paper is to present the first Links (function) words steps in bridging this gap, the annotation of 200 Slovene skup in frozen multi-word tweets with the JOS dependency model (Erjavec et al., structures. 2010), which will serve as the groundwork for syntactic Second level labels ena Clause subject. annotation and analysis of Slovene CMC. link sentence dve Clause object. elements (red colour tri Adverbial of manner. 2. Dataset in the visualisation). štiri Other adverbials. A dataset of 200 tweets (475 sentences) was extracted from Third level label the Janes corpus of Slovene CMC (Fišer et al., 2015), links all other Links to the root, structures (blue modra punctuation, fragments, sampled to include an equal amount of linguistically and colour in the etc. technically standard and non-standard tweets (Ljubešić et visualisation). al., 2015). The dataset only includes tweets longer than 120 characters published by private individuals. This material Table 1: The labels in the JOS dependency model. was lemmatized and POS-tagged with the tools described (http://eng.slovenscina.eu/tehnologije/razclenjevalnik) in (Erjavec et al., 2005; Ljubešić et al., 2014). In the next Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 3 Ljubljana, Slovenia, 27–28 September 2016 As the JOS dependency model is based on the principle that 4.1 Twitter Elements the relations inducible from the tags on lower levels We considered hashtags, usernames, URLs, and emoticons (lemmas and POS) are not annotated again on the syntactic of two kinds. The elements that were syntactically part of a level, it is significantly simpler and more robust than sentence were annotated in accordance with their function, similar models, e.g. the Prague Dependency Treebank while function-free elements (typically appearing at the (Böhmová et al., 2003). The main features of the model are beginning or the end of the tweet) were connected to the described in Erjavec et al. (2010), and in more detail in the node. This decision is in accordance with similar projects annotation guidelines (Holozan et al., 2008). The model (Kong et al., 2012), and the annotation of the dataset was applied in the “Communication in Slovene (SSJ)” indicates that the separation of the two groups is project to annotate the ssj500k training corpus (Krek et al., sufficiently straightforward. Figure 1 presents an example, 2015), on the basis of which a parser for Slovene was where the first hashtag ( #zooljubljana) connects to the node, trained (Dobrovoljc et al., 2012). Additionally, a while the second one is annotated as a part of a noun phrase specialised program was developed for the visualisation, ( a plane to #sochi). manual annotation and search of the data (the screenshots on Figures 1 to 4 are from this program, the author of the 4.2 Foreign Language program is Janez Brank). Foreign language elements (primarily from English and 4. Annotation and Results related South Slavic languages) appear in Slovene tweets as single words, word phrases, or longer segments/clauses. The dataset, described in Section 2, was automatically Different levels of adaptation to Slovene can be observed parsed and imported into the SSJ annotation program. Syntactic annotations were then manually corrected, regarding the spelling and morphology of these elements. following the guidelines for the annotation of the jos500k The questions about how to lemmatise and POS-tag them corpus. During annotation, the majority of the problems (including the question how to separate the ones to be could be adequately addressed by the existing guidelines, tagged as foreign from the ones to be treated as Slovene) while for some specific questions, the guidelines had to be were addressed at the earlier stages of the project (Čibej et complemented by additional rules. In the remainder of this al., 2016b). On the syntactic level, we followed a principle paper, we present the decisions regarding the annotation of: that single words and two-part phrases with a clear Twitter elements; foreign language; syntactical fragments dependency relation are attached into the syntactic tree, and ellipsis; non-standard use of punctuation; and other whereas in longer phrases and segments, all of the foreign non-standard language features. The implement solutions elements get attached to the node instead. Figure 2 presents are exemplified in Figures 1 to 4. The examples are in an example of the first type, where the English phrase Slovene, with English translation provided in the corresponding figure title.1 personal message is connected to the tree. Figure 1: #zooljubljana more than obviously the lynx wanted to catch a plane to #sochi. Figure 2: Jeez, somebody sent me a virus and now it’s sending random stuff to everyone as a {personal message}. 1 The translation is somewhat word-by-word to facilitate comprehension of annotated syntactic relations to non-Slovene speakers, however certain adaptations were obviously required due to language differences. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 4 Ljubljana, Slovenia, 27–28 September 2016 4.3 Ellipses and Fragments parsing mistakes, and with the comma being notoriously Tweets are especially challenging to annotate syntactically difficult to master for Slovene speakers, such instances are due to their fragmented nature and a large number of frequent. For the annotation of the dataset, the parsing ellipses. One possible solution to this problem is to use a errors were manually corrected, however the findings system that allows for the orphan node to be promoted to indicate that it might be beneficial to include a step of the place of the missing parent (Dobrovoljc and Nivre, punctuation normalisation before the attempts on the 2016). Another alternative is to use a system that attaches syntactical level (some work for Slovene has been such elements directly to the root (Kong et al., 2012). The presented by Kranjc and Robnik Šikonja, 2015). JOS dependency model was designed as the latter: while in regular clauses only the head of the predicate is attached to 4.5 Other Non-standard Language Features the root (or the relevant ordinate clause), in fragments or Last but not least, the annotated dataset exhibits a number clauses without the predicate, each separate phrase head of other syntactic features that have been previously attaches to the node as well (Figure 3). attributed to non-standard written Slovene (Michelizza, 2015), e.g. atypical word order, non-standard use of 4.4 Non-standard Use of Punctuation conjunctions, cases, grammatical number, high number of The annotation of the dataset revealed that lexical and demonstrative pronouns and certain particles. A morphological normalisation of tweets and subsequent preliminary analysis of the annotated data reveals that 49 % manual correction of lemmas and POS tags successfully of the (linguistically and technically non-standard) tweets eliminated many of the potential problems for syntactic exhibit at least one of the listed features. While it is clear annotation. However, the non-standard use of punctuation that these phenomena need to be linguistically addressed in in tweets remains an important factor of negative influence. the future, they did not pose a problem for the annotation. The existing parser is trained on standard Slovene language, Figure 4 presents an example, where the predicate to be where punctuation – especially the use of the comma – similar is accompanied by two objects in dative ( he is plays an important role in determining the borders between similar to the members of the parliament and similar to me clauses and other types of sentence segments. Omitted, = to me he seems similar). While valence in this example redundant, and misplaced commas thus as a rule lead to is atypical, the annotations are relatively straightforward. Figure 3: During the night from Rogoznica to Veli Rat, if god allows it, and then slowly back. Figure 4: […] @PrinasalkaZlata, somehow to me he is similar to the members of the parliament #spialprede. 5. Conclusion and Further Work The JOS dependency model in combination with the SSJ The paper presented the first steps towards a syntactic annotation program proved to be adequate for the described annotation of Slovene CMC. In this first stage, 200 tweets task, the main advantages of the system being its robustness were annotated with the JOS dependency model and and the ability to allow multiple attachments to the root annotator guidelines were supplemented with examples for element. A major drawback is that the system is language- the annotation of Twitter-specific and non-standard specific and as such offers little possibility for cross-lingual language features. The dataset, together with the guidelines, comparison. Recent attempts to translate the annotations of can be used for further annotation of language data (from the ssj500k corpus to the Universal Dependencies system Twitter or other CMC genres), and in the second step to (Dobrovoljc et al., 2016) suggest a possible solution to this train a parser for the selected CMC domain(s). problem in the future. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 5 Ljubljana, Slovenia, 27–28 September 2016 6. Acknowledgements Holozan, P., Krek, S., Pivec, M., Rigač, S., Rozman, S. and The work described in this paper was funded by the Velušček, A. (2008). Specifikacije za učni korpus. Slovenian Research Agency within the national basic Kamnik: Projekt »Sporazumevanje v slovenskem research project "Resources, Tools and Methods for the jeziku« ESS in MŠŠ. Research of Non-Standard Internet Slovene" (J6-6842, Jakop, N. (2008). Pravopis in spletni forumi – kva dogaja? 2014–2017). In Slovenščina med kulturami, Zbornik Slavističnega društva Slovenije 19, pp. 315–327. 7. References Kranjc, A. and Robnik Šikonja, M. (2015). Postavljanje vejic v slovenščini s pomočjo strojnega učenja in Arhar Holdt, Š. and Dobrovoljc, K. (2015). Zveze izboljšanega korpusa Šolar. In D. Fišer (Ed.), Zbornik samostalnika z nesklonljivim levim prilastkom v konference Slovenščina na spletu in v novih medijih, korpusih Janes in Kres. In D. Fišer (Ed.), Zbornik Ljubljana: Znanstvena založba Filozofske fakultete, pp. konference Slovenščina na spletu in v novih medijih. 38–43. Ljubljana: Znanstvena založba Filozofske fakultete, pp. Kong, L., Schneider, N., Swayamdipta, S., Bhatia, A., Dyer, 4–9. C. and Smith, N. A. (2014). A dependency parser for Böhmová, A., Hajič, J., Hajičová, E. and Hladká, B. (2003). tweets. In Proc. of EMNLP. Doha, Qatar, pp. 1001–1012. The Prague dependency treebank. In Treebank: Building Krek, S., Dobrovoljc, K., Erjavec, T., Može, S., Ledinek, N. and Using Parsed Corpora. Netherlands: Springer, pp. and Holz, N. (2015). Training corpus ssj500k 1.4, 103–127. Slovenian language resource repository CLARIN.SI, Chanier, T., Poudat, C., Sagot, B., Antoniadis, G., Wigham, http://hdl.handle.net/11356/1052. C. R., Hriba, L., Longhi, J. and Seddah, D. (2014). The Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., CoMeRe corpus for French: structuring and annotating Pollak, S. and Škrjanec, I. (2015). Predicting the level of heterogeneous CMC genres. Journal for Language text standardness in user-generated content. In 10th Technology and Computational Linguistics, 29(2), pp.1- International Conference on Recent Advances in Natural 30. Language Processing: Proceedings of RANLP 2015 Čibej, J., Fišer, D. and Erjavec, T. (2016a). Normalisation, Conference. Hissar, pp. 371–378. Tokenisation and Sentence Segmentation of Slovene Ljubešić, N., Erjavec, T. and Fišer, D. (2014). Tweets. In Proceedings of the Workshop on Standardizing tweets with character-level machine Normalisation and Analysis of Social Media Texts translation. In Computational linguistics and intelligent (NormSoMe). Portorož: ELRA, pp. 5–10. text processing: 15th International Conference, CICLing Čibej, J., Arhar Holdt, Š., Erjavec, T. and Fišer, D. (2016b). 2014, Kathmandu, Nepal: Proceedings: part II. Springer, Razvoj učne množice za izboljšano označevanje spletnih Heidelberg, pp. 164–175. besedil. In Proceedings of the Conference on Language Michelizza, M. (2015). Spletna besedila in jezik na spletu. Technologies and Digital Humanities. Ljubljana (in Založba ZRC, ZRC SAZU, Ljubljana. print). Myslin, M. and Gries, S. T. (2010). k dixez? A corpus study Crystal, D. (2011). Internet Linguistics: A Student Guide. of Spanish Internet orthography. Literacy and Linguistic London, New York: Routledge. Computing, 25(1), pp. 85–104. Dobrovoljc, K. and Nivre, J. (2016). The Universal Storrer, A. (2013). Sprachverfall durch internetbasierte Dependencies Treebank of Spoken Slovenian. In Kommunikation? Linguistische Erklärungsansätze – Proceedings of the Ninth International Conference on empirische Befunde. In Sprachverfall? Dynamik – Language Resources and Evaluation (LREC ’16). Wandel – Variation. Jahrbuch des Instituts für Deutsche Portorož, pp. 1566–73. Sprache 2013. De Gruyter Mouton, pp. 171–196. Dobrovoljc, K., Erjavec, T. and Krek, S. (2016). Pretvorba Zwitter Vitez, A. and Fišer, D. (2015). From mouth to korpusa ssj500k v Univerzalno odvisnostno drevesnico keyboard: the place of non-canonical written and spoken za slovenščino. In Proceedings of the Conference on structures in lexicography. Electronic lexicography in Language Technologies and Digital Humanities. the 21st century: linking lexical data in the digital age: Ljubljana (in print). proceedings of eLex 2015 Conference, Herstmonceux Dobrovoljc, K., Krek, S. and Rupnik, J. (2012). Skladenjski Castle, UK. Ljubljana: Trojina, Institute for Applied razčlenjevalnik za slovenščino. In T. Erjavec, J. Žganec Slovene Studies; Birmingham: Lexical Computing, pp. Gros (Eds.), Zbornik Osme konference Jezikovne 250–267. tehnologije. Ljubljana: Institut Jožef Stefan, pp. 42–47. Erjavec, T., Fišer, D., Krek, S. and Ledinek, N. (2010). The JOS linguistically tagged corpus of Slovene. In: LREC 2010, 7th International Conference on Language Resources and Evaluations. Valletta, pp. 1806–1809. Erjavec, T., Ignat, C., Pouliquen, B. and Steinberger, R. (2005). Massive multi-lingual corpus compilation: Acquis Communautaire and totale. In Proceedings of the 2nd Language & Technology Conference. Poznan, pp. 32–36. Fišer, D., Ljubešić, N. and Erjavec, T. (2015). The JANES corpus of Slovene user generated content: construction and annotation. In International Research Days: Social Media and CMC Corpora for the eHumanities: Book of Abstracts. Rennes,.p. 11 . Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 6 Ljubljana, Slovenia, 27–28 September 2016 (Best) Practices for Annotating and Representing CMC and Social Media Corpora in CLARIN-D Michael Beißwenger*, Eric Ehrhard†, Axel Heroldˠ, Harald Lüngen‡, Angelika Storrer† * Department of German Studies, University of Duisburg-Essen, Berliner Platz 6–8, D-45127 Essen † Department of German Linguistics, University of Mannheim, Schloss, Ehrenhof West, D-68131 Mannheim ˠ Berlin-Brandenburg Academy of Sciences and Humanities, Jägerstraße 22/23, D-10117 Berlin ‡ Institute for the German Language, R5, 6–13, D-68161 Mannheim E-mail: michael.beisswenger@uni-due.de, eric.ehrhardt@gmx.de, herold@bbaw.de, luengen@ids-mannheim.de, astorrer@mail.uni-mannheim.de Abstract The paper reports the results of the curation project ChatCorpus2CLARIN. The goal of the project was to develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). The paper presents an overview of the resources and practices developed in the project, describes the added value of the resource after its integration and discusses, as an outlook, to what extent these practices can be considered best practices which may be useful for the annotation and representation of other CMC and social media corpora. Keywords: CMC corpora, TEI encoding, tagging, corpus infrastructures, legal issues, CLARIN 1. Introduction (CMC) and social media according to existing stan- This paper reports the results of the curation project dards in the Digital Humanities / CLARIN context. The ChatCorpus2CLARIN. The goal of the project was to main result of goal (1) is, thus, the integrated chat develop a workflow and resources for the integration of corpus whereas the results of goal (2) are documented an existing chat corpus (the Dortmund Chat Corpus, resources and practices that may be reused by other Beißwenger 2013) into the CLARIN-D research infra- projects which aim at integrating CMC and social structure for language resources and tools in the Hu- media resources into CLARIN. manities and the Social Sciences1 as part of the Euro- pean Common Language Resources and Technology 3. The Corpus Infrastructure2. The paper presents an overview of the The Dortmund Chat Corpus (Beißwenger, 2013) has resources and practices developed in the project, de- been collected at TU Dortmund University as a re- scribes the added value of the resource after its inte- source for researching the peculiarities and linguistic gration and discusses, as an outlook, to what extent variation in written CMC. The corpus comprises 478 these practices can already be considered as best prac- chat documents ( logfiles) containing 140240 user tices which may be useful for the annotation and rep- postings or 1M words of German chat discourse from resentation of other CMC and social media corpora. heterogeneous sources representing the use of chats in a wide range of application contexts (social chats, advi- 2. Goals of the Project sory chats, chats in the context of learning and teaching, The goal of the project was twofold: On the one hand, moderated chats in the media context). The corpus has (1) the project aimed to integrate an existing chat been annotated using a homegrown XML format corpus into the CLARIN-D corpus infrastructures at (‘ChatXML’) that describes (1) the basic structure and the Berlin-Brandenburg Academy of Sciences and properties of chat logfiles and postings, (2) selected Humanities (BBAW) and at the Institute for the Ger- “netspeak” phenomena such as emoticons, interaction man Language (IDS), Mannheim. This included, as words, addressing terms, nicknames and acronyms, (3) subtasks, (1a) the development of a schema and con- selected metadata about the chat platforms and chat version routine for the transformation of the XML users. Since 2005, a large subset of the corpus has been markup and metadata in the original resource into a TEI available as a ChatXML resource for download and format, (1b) the addition of a new annotation layer with offline querying, and as an HTML version for online part-of-speech and lemma information, (1c) a browsing.3 re-anonymization of the corpus data according to the recommendations given in a legal opinion. On the other 4. Overview of Workflow and Resources hand, (2) the solutions developed to achieve goal (1) The Dortmund Chat Corpus served as a use case to should be designed as general (and not idiosyncratic) demonstrate how an integration of CMC and social approaches to the challenge of annotating and repre- media resources could be accomplished in a way that senting corpora of computer-mediated communication the target resource (1) conforms to established stan- 1 http://clarin-d.de 2 https://www.clarin.eu 3 http://www.chatkorpus.tu-dortmund.de Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 7 Ljubljana, Slovenia, 27–28 September 2016 dards for the representation and linguistic annotation of was developed in the BMBF project Analyse und corpora in the Digital Humanities context and (2) can Instrumentarien zur Beobachtung des Schreibge- be used for comparative analyses with other types of brauchs im Deutschen 5 (Horbach et al., 2014). The corpus resources in CLARIN-D (text and speech cor- toolchain had originally been trained on annotating pora). A visualization of the workflow and practices chat and forum data with a tag set derived from Bartz et developed in the project is given in Fig. 1; the steps and al. (2014). resources of the pipeline are described in the following subsections. 4.3 The ‘STTS 2.0’ Part-of-Speech Tagset and Guidelines from the EmpiriST2015 Shared 4.1 Experimental CMC Corpus with Data Task Project Samples from Heterogeneous Sources As target standard for the PoS layer, we used the For developing and testing the solutions for goals (1a), STTS-IBK tag set (‘STTS 2.0’) developed in the GSCL (1b) and (1c) (cf. Sect. 2) not only with chat data, we shared task on automatic linguistic annotation of CMC compiled a small experimental corpus of 38382 tokens and social media (EmpiriST2015)6. ‘STTS 2.0’ is an TEI-SIG “CMC” TEI schema drafts “CLARIN TEI and best practices schema” for CMC for CMC testing & further development OTHER CMC RESOURCES / Experimental CORPORA CMC CORPUS Specification and • chat XSLT for conversion • Usenet news to TEI • whatsapp CMDI Metadata • tweets • Wikipedia talk DEREKO DWDS Integration into CHAT CHAT CHAT CHAT CHAT CH AT CLARIN-D CORPUS Automatic CORPUS Manual CORPUS CORPUS CORPUS CORPUS infra- 1.0 linguistic 1.1 post- 1.2 2.0- beta Anonym i- 2.0 2.0 annotation editing zation structure ChatXML- + token + post-edited Automatic TEI- + anonymized in CLARIN conversion : structured + pos pos layer structured ChatXML to + lemma TEI-CMC IDS- BBAW- Repository Repository OrthoNormal STTS 2.0 gold NLP TOOLCHAIN standard Human Human XSLT for U Saarbrücken (4339 token ANNOTATORS ANNOTATORS anonymization www.sc hreibgebrauch.de partial corpus) Tagset Tagset Specification for STTS 2.0- beta anonymization STTS 2.0 Annotation Guidelines Legal Opinion ( EmpiriST 2015) Figure 1: Workflow and resources. with data also from other CMC and social media genres. advanced version of the tag set suggested in Bartz et al. The corpus included (1) two logfiles from different (2014) and builds on the categories of the “Stutt- subcorpora of the chat corpus (12526 tokens), (2) 94 gart-Tübingen Tagset” ( STTS, Schiller et al., 1999) news messages from the Usenet corpus in DEREKO which is a well-acknowledged defacto standard for PoS (Schröck & Lüngen, 2015) (9108 tokens), (3) excerpts tagging of German written corpora. In its canonical from two Wikipedia talk pages (907 tokens), (4) do- version, STTS does not include any tags for CMC and nated tweets from two different twitter accounts (1412 social media genres. ‘STTS 2.0’ therefore introduces tokens) and (5) 1907 posts from two different whatsapp two types of new tags: (1) tags for phenomena which conversations collected in the project “What's up, are specific for CMC and social media discourse, (2) Deutschland?”4 (14429 tokens). tags for phenomena which are typical of spontaneous spoken language in colloquial registers and which can 4.2 The NLP Toolchain Developed in the also be found in corpora of transcribed speech (e.g., in BMBF Project www.schreibgebrauch.de the FOLK corpus of spoken language at the IDS which Part-of-speech (PoS) tagging was done in two stages: uses an STTS extension which is compatible with (1) an automatic tagging process and (2) a manual ‘STTS 2.0’, Westpfahl, 2014). The resulting tag set is post-editing phase. Automatic tagging (including to- still downwardly compatible with STTS (1999) and kenization, PoS tagging and lemmatization) was done therefore allows for interoperability with other corpora at Saarland University applying an NLP toolchain that 5 http://www.schreibgebrauch.de 4 http://www.whatsup-deutschland.de 6 http://sites.google.com/site/empirist2015/ Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 8 Ljubljana, Slovenia, 27–28 September 2016 that have been tagged with STTS. In the EmpiriST2015, textClass/catRef mechanism. This model can be easily several existing NLP systems have been trained on extended to a broader range of text and/or discourse assigning the ‘STTS 2.0’ extensions to tokens of CMC properties to account for more detailed classifications, and social media discourse (Beißwenger et al., 2016). such as the one proposed by Herring (2007) – work that The tag set is described in an annotation guideline hasn’t been done within the project but which is a goal (Beißwenger et al. 2015) and had previously been for a future extension of the schema. tested with data from several CMC genres. In the curation project, these guidelines have been used for 4.6 Legal Opinion on Republishing the Re- manual post-editing the results of the automatic tagging source in CLARIN-D – and Consequences process described in Sect. 4.2. In the post-editing (Anonymization) process which was done using an adapted version of the Prior to the integration of the curated resource in tool OrthoNormal from the FOLKER tool suite CLARIN infrastructures, we sought a legal opinion to (Schmidt, 2012), the whole corpus has been made get a better picture of the legal conditions for repub- compatible with the ‘STTS 2.0’ tag set. In addition, for lishing the material as a whole or in parts. The legal a partial corpus of 4339 tokens all tags assigned in the opinion which was provided by iRights.Law/John H. automatic process have been post-edited independently Weitzmann (iRights.Law, 2016) carefully checked by two human annotators who had been trained with possible restrictions arising from individual property the guidelines (agreement according to Cohens Kappa: rights, copyrights and other legal statutes. One result κ = 0.92). Differing cases were decided by the project was that the possibility to identify individuals from heads. The 4339 partial corpus with manually checked their utterances (with the exception of public figures) PoS annotation can be considered as an additional needed to be circumvented by means of an anonymi- resource from the project which can be used for further zation of names, nicknames, host names and IP ad- retraining of tagging systems with ‘STTS 2.0’. dresses, geographical names (e. g. address data) etc. In addition, it turned out that some (minor) parts of the 4.4 The ‘CLARIN TEI Schema for CMC’ and resource must not be made available to the public at all, the XSLT for Conversion notably those parts where personality rights of par- The resource was converted into a TEI representation ticipants are strongly affected. This applies to a sub- format which builds on (1) the official TEI-P5 frame- corpus obtained from chat-based psycho-social coun- work for electronic text encoding and interchange and seling (a subcorpus which hadn't been made available (2) two versions of a customization of TEI-P5 for CMC to the public even in the original version of the corpus). genres created in the context of the TEI special interest For this subcorpus, due to the personal context repre- group “computer-mediated communication” (CMC- sented in the discourse, anonymization alone is SIG) and described in Beißwenger et al. (2012) and unlikely to prevent the identification of individuals. Chanier et al. (2014). Starting from a close evaluation Consequently, these resources (8 logfiles containing of the most recent version of the customization Chanier 88227 tokens) were removed from the final corpus. et al. (2014), we developed the models and best prac- The legal opinion saw no indication of concerns re- tices from the TEI CMC-SIG further taking into con- garding copyright (German “Urheberrecht”, specifi- sideration the genres available in our experimental cally) as it acknowledges that the collected logfiles as corpus. The resulting new TEI schema draft – the well as the individual user posts in the overwhelming ‘CLARIN TEI schema for CMC’ – has been made majority of cases do not represent works of art. Pro- available for further use and comments in the TEI wiki7. tectable under EU (and German) law however, is the The conversion of the ChatXML format into the target work committed in the course of collection, curation TEI format was done using an XSLT stylesheet. and transformation of the data into the format of the intended linguistic database. Therefore and in accor- 4.5 Representation of Metadata in TEI dance with our goal to provide the resource as openly as In contrast to the customizations needed for the markup possible, we followed the lawyers’ suggestion to pro- of the primary discourse data, we did not modify the vide the resource with a CreativeCommons licence (CC existing TEI metadata model. All metadata provided in BY 4.0) which allows for the protection of database the original version of the corpus (which was partially creator rights. given as part of the ChatXML structure, partially as The task of anonymization could not be done com- textual descriptions provided in the corpus-external pletely automatically: In a first step, names that had documentation of the corpus data) could be already been annotated in the original resource could be re-modelled using their TEI equivalents within the replaced by categorized placeholders automatically. teiHeader. Special attention was paid to the modeling Likewise, the metadata section and the filenames were of a text classification scheme which is associated with anonymized, including names and properties of par- the corpus documents by means of the TEI's generic ticipants, and the names of chat platforms. What had to be done manually was to replace all those occurrences of names that had not been annotated in the source, or 7 http://wiki.tei-c.org/index.php?title=SIG:CMC/ that could not be matched to entries in the participant clarindschema Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 9 Ljubljana, Slovenia, 27–28 September 2016 list automatically (e.g., because chatters were ad- adopt and test these resources in other CMC corpus dressing each other using nicknames of nicknames or projects. In our own work, we tested them not only with referring to people who were not participating in the chat data but also with a selection of data from other chat themselves). This was a very time-consuming genres (experimental corpus, cf. Sect. 4.1). We’re process which at the current state could only be done optimistic that the availability of the resources will for the 4339 token gold standard subset of the corpus. facilitate corpus annotation for colleagues who are The anonymization of the rest of the corpus is part of a building similar corpora and who are aiming to repre- follow-up work package to be finished at the end of sent them on the basis of existing standards same as we 2016. did when adopting the encoding framework of the TEI and the STTS tagset for German for our purpose. We’re 5. Availability aware of the fact that no existing schema – not even an All work packages described in Sect. 4.1–4.5 have been established standard as TEI-P5 – can usually be finished. Until October 2016, a first release of the adopted for a new project to 100%; instead, each pro- resource will present a preview in form of the 4339 ject typically needs their own customizations and token gold standard. It is planned to make the full extensions when adopting an existing solution. Nev- resource available in a 2nd release in early 2017. ertheless, customizing and extending a given solution is The corpus will be ingested into the CLARIN reposi- usually much easier than having to start to design a tories at the IDS8and the BBAW9. At IDS, the resource solution from scratch. Especially the TEI schema for will become part of the German Reference Corpus CMC is open for further changes according to experi- archive DEREKO and as such will be integrated in the ences and results from other projects. It will be the corpus query platform COSMAS II10. At BBAW, the basis for further discussions in the TEI-SIG “com- corpus will be integrated in the corpus query platform puter-mediated communication” which is open for the DWDS11. In addition, the corpus will be made acces- participation to everybody who is interested to bring in sible through CLARIN’s federated content search, e.g. their own experiences and suggestions. for NLP toolchains such as WebLicht12. In our own work, we are planning to adopt the re- sources and practices from the project for the integra- 6. Features of the Integrated Resource tion of further CMC and social media resources into the CLARIN-D corpus infrastructures at the IDS and the Compared with the original version of the resource, the BBAW (starting as of autumn 2016). The TEI schema, CLARIN-integrated version (‘Chat Corpus 2.0’, cf. Fig. in addition, is currently being used and tested also in 1) will allow for advanced queries using the additional projects in which none of the authors of this paper is linguistic annotations (sentences, tokens, PoS, lemmas). involved – e.g., in a weblog corpus project at the Uni- Due to the remodeling of the resource in TEI and the versity of Gießen, Germany (‘Discourse-structured compatibilty of the PoS annotations with STTS the Blog Corpus for German’, Karlova-Bourbonus et al., corpus will be interoperable with other TEI-/STTS- 2016) and for the annotation of an English Q&A corpus annotated language resources. The integration into the at the University of California, Davis, USA (Rachael CLARIN-D corpus infrastructures at BBAW and IDS Duke, Raul Aranovich). will facilitate the comparative analysis of the chat The ‘STTS 2.0’ tagset for PoS tagging CMC and social corpus with the BBAW and IDS text and speech cor- media data has been used for the EmpiriST2015 shared pora. These features will not only increase the value of task in which several NLP systems have been adapted the resource for language-centered CMC research and for the automatic annotation of German CMC. These variational linguistics but also the possibilities to use it systems will allow corpus projects to achieve better in language teaching and higher education. results in tagging their data than with standard NLP 7. Outlook tools which have typically been trained only on ‘stan- dard’ genres (newspaper corpora etc.). According to goal (2) (cf. Sect. 2), the resources and The results and recommendations of the legal opinion practices developed in the project were meant to func- will be a useful point of reference for further inquiries tion as general approaches to open issues in repre- into the (still difficult) legal conditions of collecting senting and annotating CMC and social media data and republishing discourse from CMC and social which should have the potential to be useful also for media sources as parts of linguistic research infra- other projects in the field. To assess empirically structures. whether the current versions of the resources (the TEI schema, the ‘STTS 2.0’ tagset and annotation guide- 8. References lines) already have this potential, it is necessary to Bartz, T., Beißwenger, M., Storrer, A. (2014). Optim- ierung des Stuttgart-Tübingen-Tagset für die lin- 8 https://repos.ids-mannheim.de/ guistische Annotation von Korpora zur internet- 9 http://clarin.bbaw.de/en/repo/ basierten Kommunikation: Phänomene, Herausfor- 10 http://cosmas2.ids-mannheim.de/ derungen, Erweiterungsvorschläge. Journal for 11 http://www.dwds.de/ Language Technology and Computational Linguis- 12 https://weblicht.sfs.uni-tuebingen.de/weblicht/ Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 10 Ljubljana, Slovenia, 27–28 September 2016 tics 28 (1), pp. 157–198. http://www.jlcl.org/ – two toolsets for transcribing and annotating spoken 2013_Heft1/7Bartz.pdf language. In Proceedings of the Eighth conference Beißwenger, M. (2013). Das Dortmunder Chat-Korpus. on International Language Resources and Evalua- Zeitschrift für germanistische Linguistik 41 (1), pp. tion (LREC’12). http://www.lrec-conf.org/ 161–164. http://www.linse.uni-due.de/tl_files/ proceedings/lrec2012/pdf/529_Paper.pdf PDFs/Publikationen-Rezensionen/Chatkorpus_ Schröck, J., Lüngen, H. (2015). Building and Anno- Beisswenger_2013.pdf tating a Corpus of German-Language Newsgroups. Beißwenger, M., Bartsch, S., Evert, S., Würzner, K.-M. In Proceedings of the 2nd Workshop on Natural (2016). EmpiriST 2015: A Shared Task on Auto- Language Processing for Computer-Mediated matic Linguistic Annotation of Computer-Mediated Communication / Social Media (NLP4CMC2015), Communication, Social Media and Web Corpora. In pages 17–22. https://sites.google.com/site/ Proceedings of the 10th Web as Corpus Workshop nlp4cmc2015/proceedings. (WAC-X) and the EmpiriST Shared Task. Strouds- [TEI P5] TEI Consortium (eds) (2007). TEI P5: burg: Association for Computational Linguistics Guidelines for Electronic Text Encoding and Inter- (ACL Anthology W16-26), 44–56. change. http://www.tei-c.org/Guidelines/P5/ http://aclweb.org/anthology/W16-26 Westpfahl, S. (2014). STTS 2.0? Improving the Tagset Beißwenger, M., Bartz, T., Storrer, A., Westpfahl, S. for the Part-of-Speech-Tagging of German Spoken (2015). Tagset und Richtlinie für das Data. In Proceedings of LAW VIII – The 8th Lingu- Part-of-Speech-Tagging von Sprachdaten aus Gen- istic Annotation Workshop. Association for Com- res internetbasierter Kommunikation. Guideline putational Linguistics (ACL Anthology W14-49), document from the EmpiriST2015 shared task. 1–10. http://www.aclweb.org/anthology/W14-4901 http://sites.google.com/site/empirist2015/home/ann otation-guidelines Beißwenger, M., Ermakova, M., Geyken, A., Lem- nitzer, L., Storrer, A. (2012). A TEI Schema for the Representation of Computer-mediated Communica- tion. Journal of the Text Encoding Initiative (jTEI) 3. http://jtei.revues.org/476 (DOI: 10.4000/jtei.476). Chanier, T., Poudat, C., Sagot, B., Antoniadis, G., Wigham, C., Hriba, L., Longhi, J., Seddah, D. (2014). The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres. In: Journal of language Technology and Computational Lin- guistics (JLCL) 29 (2), pp. 1–30. http://www. jlcl.org/2014_Heft2/1Chanier-et-al.pdf Herring, S.C. (2007). A Faceted Classification Scheme for Computer-Mediated Discourse. Lan- guage@Internet 4 (1). http://www.languageat internet.org/articles/2007/761 Horbach, A., Steffen, D., Thater, S., Pinkal, M. (2014). Improving the Performance of Standard Part-of-Speech Taggers for Computer-Mediated Communication. In Proceedings of KONVENS 2014, pp. 171–177. iRights.Law Rechtsanwälte (2016). Rechtsgutachten zur Integration mehrerer Text-Korpora in die CLARIN-D-Infrastrukturen. (Legal opinion for the ChatCorpus2CLARIN project, 46 pages) Karlova-Bourbonus, N., Grumt Suárez, H., Lobin, H. (2016). Compilation and Annotation of the Dis- course-structured Blog Corpus for German. In Pro- ceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, University of Ljubljana [this volume]. Schiller, A., Teufel, S., Stöckert, C. (1999). Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). University of Stuttgart: Institut für maschinelle Sprachverarbeitung. Schmidt, T. (2012). EXMARaLDA and the FOLK tools Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 11 Ljubljana, Slovenia, 27–28 September 2016 Grammatical Frequencies and Gender in Nordic Twitter Englishes Steven Coats University of Oulu, Finland English Philology, Faculty of Humanities, 90014 University of Oulu, Finland Email: steven.coats@oulu.fi Abstract English is increasingly used for online communication in many contexts in which it is not the primary local language, particularly on social media platforms with global extent such as Twitter. The grammatical properties of online and Twitter Englishes, however, have mainly been considered in L1 contexts, as have correlations between gender and some grammatical features. In this study, the correlation of grammatical types (parts of speech) and gender is undertaken for English-language Twitter messages originating from the Nordic countries. A corpus of geo-located English-language Twitter messages was created by accessing the Twitter Streaming API. After disambiguating author gender and applying part-of-speech tags, the relative frequencies of grammatical items were determined and those with significant gender divergence identified. Principal components analysis shows some gender-based separation of discourse in the Nordic countries in terms of grammatical features. The analysis supports previous findings pertaining to gendered differences in English and sheds light on how English continues to evolve in online environments. Keywords: corpus linguistics, Twitter, world Englishes, language and gender 1. Introduction and Background of educational attainment, to such an extent that it has been suggested that their national languages are becoming lin- Technological developments can affect the way we inter- guistic systems with “restricted functional range” (Görlach act with one another, and the recent shift towards mediated, 2002: 16). Although research has addressed language use text-based communication in online environments provides on Twitter by country (e.g. Mocanu et al. 2013), and work opportunities for the study of English varieties in global exists on grammatical feature frequencies in Nordic non- contexts. Although the status of English as the world’s CMC genres (e.g. for Swedish in Allwood 1998), studies of principal lingua franca continues to consolidate, its use in feature frequencies in English from non-L1 environments global computer-mediated communication (CMC), espe- have been few, and the relationship between author gender cially in non-L1 environments, exhibits a diversity of or- and feature frequency in CMC language has not yet been thography, lexis, and grammar that has been characterized investigated in detail in Nordic contexts, whether in local by Blommaert (2012) as a “supervernacular”. languages or English.1 CMC and social media such as Twitter have become im- In this study an approach based in part on multidimensional portant sites of interaction for many, and in recent years a analysis (Biber 1988; 1995) is taken. After establishing the number of studies have sought to characterize the commu- extent to which English is used on Twitter in the Nordic na- nicative and discourse functions of Twitter language (Page tional contexts, relative grammatical feature frequencies are 2012; Zappavigna 2011; Squires 2015 for an overview). calculated and the features most strongly associated with The extensiveness of Twitter data, its public availability, gender identified. With a principal components analysis, and the richness of the associated metadata have allowed the underlying association between feature frequencies and for geographical analyses (Leetaru et al. 2013; Mocanu et gender is established. al. 2014) and dialectological and sociolinguistic analyses of English (Eisenstein et al. 2014; Bamann, Eisenstein and 2. Data Collection and Processing Schnoebelen 2014). Some previous studies of English-language CMC and Twit- Data was collected in .json format from Twitter’s Stream- ter have found different rates of use of particular word ing API during May 2016 by utilizing a scripting li- classes by males and females. For example, it has been brary in Python.2 The raw .json data was filtered for found that females use more personal pronouns, more the tweet text (the “status update”) and the metadata modal verbs, and more emoticons, while males use more fields author_name, screen_name, time, id, lang determiners such as articles or demonstrative pronouns and (language), country, and the latitude and longitude more numbers or numerals (Baron 2004; Herring and Pao- coordinates. lillo 2006; Argamon et al. 2007; Bamann, Eisenstein and Schnoebelen 2014). For the most part, however, analysis of 1For an analysis of feature frequencies in English as it is used Twitter English has been conducted on data without consid- in various Asian contexts see Xiao (2009). Baron (2004) analyses eration of its geographical provenance, or on data gathered a small corpus of Instant Messenger data in English from Ameri- from Anglophone national contexts, mostly in the United can and Swedish university students. 2 States. The Tweepy library (Roesslein 2015) was used Knowledge of English is extensive in the Nordic coun- (https://github.com/tweepy/tweepy). tries of Iceland, Norway, Denmark, Sweden, and Finland, countries with well-developed economies and high levels Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 12 Ljubljana, Slovenia, 27–28 September 2016 2.1. Geolocation name associations (Rao et al. 2010; Mislove et al. 2011).5 Lists of the most frequent given names in the Nordic coun- The collection script selected only tweets with a populated tries were obtained from the corresponding national statis- place object that originated within a bounding box cir- tical offices. The author_name field for each user was cumscribing the territorial boundaries of the Nordic coun- then filtered for strings that either begin with or include tries (longitude -26 to 32, latitude 53 to 72; see Figure 1). as a discrete element the most common male and female given names in the corresponding Nordic country. Users matching both male and female names were discarded. The method assigned gender to 39% of Iceland, 50% of Nor- way, 61% of Denmark, 47% of Sweden, and 62% of Fin- land tweets.6 2.3. Tokenization and Part-of-Speech Tagging The Carnegie-Mellon University Twitter Tagger (Gimpel et al. 2011; Owoputi et al. 2013) was used to tokenize the subcorpora and apply part-of-speech tags using a sub- set of the Penn Treebank tagset (Marcus, Santorini and Marcinkiewicz 1993) and additional tags for the Twitter- specific features username, hashtag, and retweet. The tool is somewhat tolerant of the non-standard orthography typi- cal of Twitter messages. Figure 1: Area from within tweets with geographical coor- 3. Analysis and Discussion dinates were collected from the API. The linguistic profiles of the subcorpora were determined Each tweet was assigned exact latitude/longitude coordi- and the relationship between gender and individual gram- nates.3 From the 2.155 million tweets collected by the matical features assessed using t-tests. Principal compo- script, 302,737 were retained to create subcorpora from nents analysis was used to gauge the extent to which males the Nordic countries of Iceland, Norway, Denmark, Swe- and females utilize different communicative styles in En- den and Finland, based on the country values within the glish on Twitter. place field. For further analysis, two subcorpora were 3.1. Language Profile prepared for each country by filtering the data according to the lang field in the tweet object: one consisting of tweets English is extensively used in Twitter user messages origi- in the principal national language, and one of tweets in En- nating from the Nordic countries (Table 1).7 glish.4 Tweets originating from outside the Nordic coun- In Iceland, Norway and Denmark, males use the national tries and in other languages were not further considered. language on Twitter more than do females; Females use The English data comprised 101,956 tweets and 1,475,553 English more. This difference is most pronounced for Den- tokens. mark. In Sweden and Finland the rates of language use by gender are similar, with males using slightly more English 2.2. Gender Disambiguation and females the national languages. Unlike some social media platforms, Twitter does not pro- 3.2. Correlation of Grammatical Features, vide a profile entry where gender is to be identified nor re- Country and Gender quire users to otherwise supply gender information. There- 34 of the PoS tags were applied at least once in all of the fore, gender was disambiguated for tweets based on gender- ten gendered subcorpora. For each subcorpus, the rela- 3Most Twitter users select a place when registering with the 5Latent attribute inference using Twitter data manually tagged service; the coordinates of the place are then automatically as- for gender is a popular topic in machine learning (Pennacchiotti signed by Twitter as a lat-long bounding box in tweet metadata. and Popescu 2011; Ciot, Sonderegger and Ruths 2013) – the ap- Some users additionally opt to broadcast precise GPS coordinates proach used here relies on the association between given name and with each status update. For tweets without precise geographi- author gender rather than using machine learning to infer gender cal coordinates, location was induced by calculating the center of based on the content of messages whose authors’ gender has been the bounding box circumscribing the place field. Correlation of manually tagged, but both approaches can be used to investigate the precise GPS coordinates and the induced coordinates based on links between language use and gender. centering the place entity was 0.993, as the place entity is al- 6The differences are due in part to the somewhat different most always populated by a bounding box circumscribing a small name frequency information obtained from the national statistical area such as a city. See also Leetaru et al. (2013). offices. For example, only 395 unique given names were obtained 4For Finland, corpora were also created for the country’s sec- from Iceland, but 1190 from Norway, 5382 from Denmark, 1704 ond official language, Swedish. from Sweden, and 7899 from Finland. 7The Twitter automatic language detection algorithm classi- fies both Riksm˚al and Nynorsk with the language code no, “Nor- wegian”. For Finland, the percentage shown includes messages messages in the national languages of Finnish and Swedish. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 13 Ljubljana, Slovenia, 27–28 September 2016 Nat. lang. English Other in English on Twitter are personal pronouns, compared to Iceland males 80.8 9.8 9.4 4.28% by Swedish males. females 71.5 17.6 10.9 Norway males 46.6 28.9 24.5 females 37.3 40.0 22.7 0.15 females 0.15 females 0.15 females males males males Denmark males 45.4 40.0 14.6 female male female male female male Mean = 5.4 6.95 Mean = 4.4 4.94 Mean = 6.61 7.17 0.10 0.10 0.10 Median = 3.07 5.71 Median = 3.45 4.35 Median = 6.25 6.93 females 25.7 52.5 21.8 Std. dev = 7.35 7.71 Std. dev = 5.18 5.32 Std. dev = 6.17 6.35 t−test p−value = 0 t−test p−value = 0.002 t−test p−value = 0.007 Cohen's d = 0.2 Cohen's d = 0.1 Cohen's d = 0.09 fffffffffffffffffffffffff ffffffffffffffffffffffffffffffff ortion of all users f f f f f f ortion of all users m m m m m m m m m m m m m m m m m m m m m m m m m m m m m f m f m f m m m f m f m f m f m f m m m m m f m f m m m ortion of all users Sweden males 61.9 24.5 13.6 fff f m f f f f f f m f f f m m m ff mm 0.05 f m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m fff m f f m f 0.05 f f m f m f f f m m f m 0.05 Prop ff m f f m f m f m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m f m f m m f m m ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff f m f m Prop f m f Prop f f f m f m m f m m f f m f m m f m f f f f m f m f m m f m f m f m f f m f m m f m f m f f f f m f m f m f m f m f m f f m m m f m f m f f f m f m f m f m f m f m f m f f m f m f m f m f m f m f m f m f f m f m f m f f m f m f m m f f m f m f m f m f m f m f m m m f m fffmmmm fffffmmmmm females 63.8 23.8 12.4 ffff f m f f m f m m f m f m f m f m f m f f m f m fm f m f m f m f m f m f f m f m fm f m f m f m f m f m f f m f m fm f m m f m f m f m f f m f m fm f m f m f m f m f m f f m f m fm f m f m f m f m f m f f m f m fm f m m f m f m f m f f m f m m f m f m f f m f m m f m m f m f m f m f f m f m m f m f m f f m m m f m f m f m f m f f f m m m f m f m f m f m f m f f f m m m f f f f m m f m f m f m f f f m f m m f m m f f f f m m m f m f f f f m m f m fm f f f m fm f m fm m f f fm f f m f m f m f f f f m m m ffffffffff m f m f m f m f m f m f f m f m f f m f m f f m f m f f m f m f m f m f m m m 0.00 0.00 0.00 Finland males 57.2 28.8 14.0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Percent period Percent determiner Percent preposition females 58.5 25.0 16.5 0.15 females 0.15 females 0.15 m m mmmm females males males m m m m males Table 1: Percent tweets by country, gender and language. m m m m m female male female male m fffffff m female male f f Mean = 12.32 13.78 Mean = 5.83 4.28 f m fm Mean = 3.81 3.25 0.10 0.10 f 0.10 m Median = 6.67 8 Median = 3.85 2.05 f m f f f m Median = 0 0 m Std. dev = 16.51 16.61 Std. dev = 7.52 5.77 f Std. dev = 6.02 5.09 m m m m m m m m m m m m m m m m m m m m m m m m m m f f m t−test p−value = 0.009 m m m m f f m t−test p−value = 0 f m m t−test p−value = 0.004 Cohen's d = 0.09 m m m f m m m Cohen's d = 0.24 ff m m Cohen's d = 0.1 ortion of all users ortion of all users m m f mm fffffffffffffffffffm ortion of all users f m f m m f m f mm f m fm f m m f m f m f m f fm m m m m m m ffffmm 0.05 f tive frequency of each tag was calculated. To determine 0.05 0.05 fff mmm ff m Prop fff Prop ffm f m f mf f m f m f m m Prop f m f m f m f m f m f m f m mf m f mf m f m fffffffffffffffffffffffffffffffffffffffffff m f f m f m f m f m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m mf mf mf m m f f m m f m m f f f m m f f m m m f f m m f f m m f m f f m m f f m m m f f m m f f m m f m m f f m m f f m m m f f m m f f m f m m m f f m m f f m m f f m m f f m m f m m f f m m f f m m m f f m f m m f f m m f f m m m m m m m m f m m f m m f m f m f m f m f m f m f m f whether features were preferred by males or females, a t- ffffffff m f m f m f m f m f m f m m m m f m f m m f m f m f m f m m f m f m m f m f m m f m f m m f m f m m f m f m m f m f m m m f f m f m m m f m f m m m f m f m f m m m f m f m f m m m f m m f m m m f m m m f m m m f mf m f mf m f mf m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f mf m f mf m f mf mmm 0.00 0.00 0.00 test of population means was conducted on the basis of the 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Percent proper noun Percent personal pronoun Percent adverb mean standardized value for males and for females in all f m ff subcorpora. Of the 34 features, ten exhibited significant m 0.15 females 0.15 females f 0.15 females m m m m m f f m m m males males f m males m m f f m m fm f (p < 0.05) differences in use between males and females: m m f m m m female male female male fm f m female male m Mean = 7.16 4.84 Mean = 6.97 8.24 f f Mean = 2.6 1.92 0.10 m m m 0.10 0.10 f Median = 1.65 0 Median = 1.13 3.85 fm f m Median = 0 0 Sentence-ending punctuation, numbers or numerals, proper m m Std. dev = 12.72 9.7 Std. dev = 11.15 12.31 f Std. dev = 4.68 3.69 m fm mf m f m t−test p−value = 0 t−test p−value = 0.001 m f m t−test p−value = 0 m m f m m f fffffffffffffffffffff m f f Cohen's d = 0.21 Cohen's d = 0.11 m f Cohen's d = 0.17 f m mf m f m ortion of all users f m ffffffffffffffffffff f m m ortion of all users ff m m ortion of all users m f m f m f nouns, and gerund or present participle forms were more ff m m m m m m m m m m m m m m m m m m m m m mf m f f m f m m m f f ffmm m m f 0.05 mf m f m f 0.05 fm m m 0.05 f m m f Prop m f m f m fm m m f m f Prop m m f Prop m f m f m m f f m f m m m f m m f m m f f f m m m f f m m m f f f m m m f f m mf f m m m f mf f m m m f f m f m f f m m m f f m m f m f f m m m m f ffff mmmmmm f m f m f m f frequently utilized by males, while personal pronouns, pos- m f m f m f m mf mf f m m mf f m f mf f m mf f m f mf f m m m f f m f m f f m m m f f m f m f f m m m f f m f m f f m m m ff f f m f f m m f f m m m ff f m m f m f f m m m ff f m m f m f f m m ff f m m m f f m m f m ff f m m f f m m m ff f m m f m f f m m f f m m m ff f m m f f f m f f m m f f m m m f f m m f f m m f m f f m m f f m m m f f m m f f m f m m f m f f m m f m f m m m m m m m m m m m mf mf m f m m f m m f m fff mf m f m mf m m m m m f m m f m mf m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m mf m f m f m f m f m f m f m f m f m f m f m f m f m f m f m f m m m m 0.00 0.00 0.00 sessive pronouns, adverbs, interjections, usernames, and 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Percent interjection Percent username Percent verb, non−3rd person singular present past particples were more likely to be used by females (Ta- ble 2). Figure 2: Percent of all tokens by feature for features that differ significantly by gender from Sweden. Feature Gender p-value Signif. 1 Quotation marks (”) m 0.320 2 Left bracket (() m 0.080 3.3. Principal Components Analysis 3 Right bracket ()) m 0.089 4 Comma m 0.098 In order to explore underlying patterning of the variance in 5 Period (. ? !) m 0.010 * 6 Other punctuation (: ; ... + - = <> [ ]) m 0.245 the data, a principal components analysis was conducted on 7 Coordinating conjunction f 0.269 a covariance matrix of the normalized frequencies of the 34 8 Number m 0.040 * 9 Determiner m 0.416 variables for the ten English subcorpora (a male and a fe- 10 Hashtag f 0.758 male subcorpus for each of the five Nordic countries). The 11 Preposition or subordinating conjunction m 0.502 12 Adjective m 0.405 first two components capture 70.8% of the variance in the 13 Comparative adjective f 0.848 data. The strongest loadings (≥ |0.2|) on the first two com- 14 Superlative adjective f 0.213 15 Modal verb f 0.695 ponents are shown in Table 3. 16 Noun, singular or mass m 0.275 17 Proper noun m 0.014 * Feature PC1 PC2 18 Plural noun m 0.596 19 Personal pronoun f 0.005 * Personal pronoun 0.60 -0.21 20 Possessive pronoun f 0.005 * Interjection 0.31 0.34 21 Adverb f 0.036 * 22 Phrasal particle m 0.449 Verb, non-3rd person singular present 0.21 23 to f 0.596 Period (. ? !) -0.28 0.28 24 Interjection f 0.018 * Noun, singular or mass -0.25 -0.51 25 Username (preceded by ) m 0.168 26 Verb, base form f 0.007 * Proper noun -0.45 27 Verb, past tense f 0.441 Comma 0.38 28 Verb, gerund or present particle f 0.866 Number 0.34 29 Verb, past participle m 0.022 * 30 Verb, non-3rd person singular present f 0.001 * Username 0.23 31 Verb, 3rd person singular present f 0.292 32 Wh-determiner m 0.094 33 Wh-pronoun f 0.934 Table 3: Loadings ≥ |0.2| on first two principal compo- 34 Wh-adverb f 0.106 nents Table 2: Grammatical features by gender For the features with the strongest loadings on the first principal component, grammatical types with interpersonal Gendered differences were also considered by country and interaction and stance orientation functions (personal pro- feature. For Sweden, for example, the distribution of those nouns, 1st- and 2nd-person singular present verb forms, and features for which a significant difference by gender was interjections8) have the strongest positive loadings, while detected is depicted in Figure 3. The differences between males and females are not large (Cohen’s d ≤ 0.24) , but 8The Carnegie-Mellon Twitter tagger also assigns the interjec- statistically significant according to a t-test of population tion tag to emoticons, word types that are often associated with means: E.g. 5.83% of all words used by Swedish females the expression of emotional affect (Vandergriff 2013). Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 14 Ljubljana, Slovenia, 27–28 September 2016 features with informational and text-organizational func- personal pronouns, possessive pronouns or affect markers tions (nouns, proper nouns, and sentence-ending punctu- more than males, whereas males use features such as de- ation) have the strongest negative loadings. terminers, numbers/numerals, and nouns more than do fe- males (Bamann, Eisenstein and Schnoebelen 2014). This patterning holds true for English used on Twitter in the PCA of Gendered Subcorpora, Components 1 and 2 Nordic countries by persons with common Nordic names, 2.0 many of whom are likely non-L1 English users. no.m Multidimensional approaches based on factor analysis or 1.5 principal components analysis have shown that differences in aggregate grammatical feature frequencies for national 1.0 varieties of English can be interpreted in terms of commu- nicative or discourse-functional dimensions (Biber 1988; no.f 0.5 1995; Xiao 2009). In this study, Nordic Twitter data that ariance = 11.91 % sv.m da.m sv.f have been induced to reflect author gender exhibit differen- 0.0 tiation by gender along a first principal component, explain- rtion of Vo fi.m fi.f ing the majority of variance in the data (58.9%). The load- ings on this component correspond to grammatical features −0.5 da.f whose discourse or communicative functions may contrast PC2, Prop interactive stance orientation and affective content with in- −1.0 formational and discourse organization functions – a find- is.m is.f ing comparable to the proposed “involved versus informa- −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 tional production” dimension found by Biber (1988: 107). PC1, Proportion of Variance = 58.92 % Most work on differences in feature frequencies by gen- der has been conducted on L1 English data, but there is Figure 3: Loadings on components 1 and 2 of PCA for En- some evidence for differential use of word classes by gen- glish subcorpora. der in other languages.9 This study shows that similar dif- ferences exist for (presumable) non-L1 English users on Twitter. It has been suggested that the small differences in The positions of the gendered subcorpora along the first two aggregate grammatical feature frequencies between males principal components are shown in Figure 4. The analysis and females may reflect different orientations towards the suggests some functional separation between males and fe- use of communicative or discourse functions for the nego- males in Nordic Twitter Englishes as they are manifest in tiation of affect maintenance or solidarity (Holmes 1998). terms of grammatical feature frequencies: The male cor- Exploratory data analysis suggests that for Nordic Twit- pora all have negative values in the first principal compo- ter corpora with induced author gender, functional sepa- nent, while the female corpora have positive values. Gen- ration of English-language feature frequencies by gender der separation along the second principal component is also can be observed. A tentative confirmation of some of the manifest, although not as pronounced. In terms of the in- trends observed in CMC and Twitter data from L1 Anglo- dividual Nordic countries, the distance between males and phone contexts raises interesting questions as to the possi- females is larger for Iceland and Norway, while it is some- ble causes. Future work could further investigate this topic what smaller for Denmark, Sweden and Finland. by exploring the extent to which gender differentiation is present in Twitter material in the Nordic languages, and 4. Conclusion and Summary whether language transfer phenomena may influence the large-scale patterning of linguistic elements in non-L1 on- Geographically specified and gender-induced corpora of line Englishes. online Englishes complied from social media sites such as Twitter shed light on the ways in which English contin- 5. References ues to develop and diversify globally, especially in contexts where it has not traditionally been a language of daily com- Allwood, J. (1998). Some frequency based differences munication. The results of this study bear upon research between spoken and written Swedish. In Proceedings into online English varieties and the relationship between from the XVI:th Scandinavian Conference of Linguistics, language and gender. Turku, Finland. Department of Linguistics, University of While it is not surprising that English is extensively used Turku. on a global internet platform such as Twitter, the present Argamon, S., Koppel, M., Pennebaker, J., and Schler, J. research confirms high rates of use of English on Twitter in (2007). Mining the blogosphere: Age, gender, and the the Nordic countries (cf. Mocanu et al. 2013). Overall, per- varieties of self-expression. First Monday, 12(9). sons in Denmark and Norway send more tweets in English, Bamann, D., Eisenstein, J., and Schnoebelen, T. (2014). and females more than males. In the present work, gender analysis reinforces findings 9For French, see Schenk-van Witsen (1981). For French, Turk- from previous corpus studies and research into L1 Twit- ish, Indonesian and Japanese, see Ciot, Sonderegger and Ruths ter or CMC English: Females tend to use features such as (2013). Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 15 Ljubljana, Slovenia, 27–28 September 2016 Gender Identity and Lexical Variation in Social Media. clusters. In Proceedings of NAACL-HLT, pages 380– Journal of Sociolinguistics, 18(2):135–160. 390. NAACL-HLT. Baron, N. S. (2004). See you online: Gender issues in col- Page, R. (2012). The linguistics of self-branding and lege student use of instant messaging. Journal of Lan- micro-celebrity in Twitter: The role of hashtags. Dis- guage and Social Psychology, 23(4):397–423. course & Communication, 6(2):181–201. Biber, D. (1988). Variation Across Speech and Writing. Pennacchiotti, M. and Popescu, A.-M. (2011). A machine Cambridge University Press, Cambridge, UK. learning approach to Twitter user classification. In Pro- Biber, D. (1995). Dimensions of register variation: ceedings of the Fifth International AAAI Conference on A cross-linguistic comparison. Cambridge University Weblogs and Social Media, pages 281–288, Menlo Park, Press, Cambridge, UK. CA. Association for the Advancement of Artificial Intel- Blommaert, J. (2012). Supervernaculars and their dialects. ligence. Dutch Journal of Applied Linguistics, 1(1):1–14. Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M. Ciot, M., Sonderegger, M., and Ruths, D. (2013). Gen- (2010). Classifying Latent User Attributes in Twitter. der inference of Twitter users in non-English contexts. In Proceedings of the 2nd International Workshop on In Proceedings of the 2013 Conference on Empirical Search and Mining User-Generated Contents, pages 37 Methods in Natural Language Processing, pages 1136– – 44. ACM. 1145, Stroudsburg, PA. Association for Computational Roesslein, J. (2015). Tweepy. Python programming lan- Linguistics. guage module. Eisenstein, J., O’Connor, B., Smith, N. A., and Xing, E. P. Schenk-van Witsen, R. (1981). Les différences sexuelles (2014). Diffusion of Lexical Change in Social Media. dans le franc¸ais parlé: Une étude-pilote des différences PLoS ONE, 9(1). lexicales entre hommes et femmes. Langage et Societé, Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, 17(1):59–78. D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, Squires, L. (2015). Twitter: Design, discourse, and im- J., and Smith, N. A. (2011). Part-of-speech tagging for plications of public text. In Alexandra Georgakopoulou Twitter: Annotation, features, and experiments. In Pro- et al., editors, The Routledge Handbook of Language ceedings of the 49th Annual Meeting of the Association and Digital Communication, pages 239–256. Routledge, for Computational Linguistics: Human Language Tech- London and New York. nologies, pages 42–47, Stroudsburg, PA. Association for Vandergriff, I. (2013). Emotive communication online: A Computational Linguistics. contextual analysis of computer-mediated communica- Görlach, M. (1995). Still More Englishes. John Ben- tion (CMC) cues. Journal of Pragmatics, 51:1–12. jamins, Amsterdam. Xiao, R. (2009). Multidimensional analysis and the study Gustafson-Capková, S. and Hartmann, B. (2008). Manual of world Englishes. World Englishes, 28(4):421–450. of the Stockholm Ume˚a Corpus version 2.0. Stockholm Zappavigna, M. (2011). Ambient affiliation: A linguis- University. tic perspective on Twitter. New Media and Society, Herring, S. and Paolillo, J. (2006). Gender and genre varia- 13(5):788–806. tion in weblogs. Journal of Sociolinguistics, 10(4):439– 459. Holmes, J. (1998). Women’s talk: The question of soci- olinguistic universals. Australian Journal of Communi- cations, 20:125–149. Leetaru, K. H., Wang, S., Cao, G., Padmanabhan, A., and Shook, E. (2013). Mapping the global Twitter heartbeat: The geography of Twitter. First Monday, 18(5/6). Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: the Penn treebank. Computational Linguistics, 19(2):313– 330. Mislove, A., Ahn, Y.-Y., Onnela, J.-P., and Rosenquist, J. N. (2011). Understanding the demographics of Twit- ter users. In Proceedings of ICWSM, pages 554–557, Menlo Park, CA. Association for the Advancement of Artificial Intelligence. Mocanu, D., Baronchelli, A., Perra, N., Gonc¸alves, B., Zhang, Q., and Vespignani, A. (2013). The Twitter of babel: Mapping world languages through microblogging platforms. PLoS ONE, 8(4). Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schnei- der, N., and Smith, N. A. (2013). Improved part-of- speech tagging for online conversational text with word Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 16 Ljubljana, Slovenia, 27–28 September 2016 Framework for an Analysis of Slovene Regional Language Variants on Twitter Jaka Čibej Department of Translation, Faculty of Arts, University of Ljubljana Aškerčeva 2, 1000 Ljubljana E-mail: jaka.cibej@ff.uni-lj.si Abstract The rapid rise of computer-mediated communication has allowed regional language variation to flourish in written form, opening new doors both for dialectological studies as well as natural language processing. In this paper, we present the methodology and framework for a linguistic analysis of Slovene regional language variants on Twitter. We describe the creation and sampling of a dataset stratified by region, present a preliminary typology of non-standard Slovene language elements on Twitter, and propose an approach to measure regional specificity and dispersion of non-standard language elements in computer-mediated communication. Keywords: regional language variants, non-standard Slovene, Twitter, computer-mediated communication 1. Introduction 3. Dataset Preparation In the last two decades, the rapid rise of computer- The dataset presented in this paper consists of tweets mediated communication and social media has allowed extracted from the JANES corpus of Slovene user- language to spread into digital communication platforms, generated content (Fišer et al., 2015a). The tweets were lending a voice to a plethora of different languages sampled by taking into account a number of criteria. traditionally present only in spoken varieties: from First, only tweets sent from private accounts were sociolects to dialects and everything in between. In included, while tweets from corporate accounts (e.g. those addition, due to their ever increasing quantities, internet managed by press agencies and companies) were texts have become an important source of information, eliminated.1 This was done for two reasons: corporate and there is an increasing demand for tools and resources accounts contain many automatically generated tweets, to help process them, as shown by the proliferation of while the overwhelming majority of their original tweets different areas within internet linguistics and natural are written in standard Slovene, which makes them language processing. One of the problematic aspects to be irrelevant for our study. tackled in this regard is regional language variation. Second, the dataset only includes L3 tweets, i.e. those The main goal of this paper is to present a methodology with a high level of linguistic non-standardness (Ljubešić and framework for a linguistic analysis of regional et al., 2015). L3 tweets contain a high degree of non- language variants on Twitter. The paper is structured as standard spelling and vocabulary and as such provide the follows: first, we present a brief overview of related work, most material for the study of regional language variants. which is followed by the description of our dataset and the Third, the tweets were sampled by taking into account the sampling methods used. We then provide an overview of metadata on the users' regional origin (Čibej & Ljubešić, the preliminary typology of non-standard Slovene 2015). This metadata was determined by collecting language elements on Twitter and the measures of Slovene geotagged tweets over a period of eighth months regional specificity and dispersion to be used in further (from January 2015 to September 2015), then assigning analyses, and conclude with the preliminary results of the each user with geotagged tweets to one of 9 regions analysis of three regional samples. corresponding to the 7 main dialectal groups of Slovene as well as Ljubljana and Maribor, the two largest cities, 2. Related Work which we decided to treat separately as melting pot areas. Studies on regional variation of various languages on In order to exclude users with ambiguous origin, only Twitter have been conducted with different purposes, users that sent more than 90% of their tweets from a single mostly as part of development of NLP tools, e.g. diacritic region were taken into account. A certain amount of noise restoration (Harrat et al., 2013) and POS-tagging is to be expected in the dataset despite this criterion, but (Bernhard & Ligozat, 2013), but also within sociological should not prove too prominent and will be further studies of language variation (Jørgensen et al., 2015; penalised during the analysis (see Section 6). Eisenstein, 2015). Slovene regional language variation on Twitter (and in social media in general), however, is currently still an under-researched area that cannot be neglected, especially considering the rich dialectal variation of Slovene (Ramovš, 1931), the numerous dialectological studies conducted on spoken Slovene (Kenda Jež, 2002), as well as the fact that regional variation has already been documented in Slovene tweets (Fišer et al., 2015b). 1 Twitter users included in the JANES corpus were manually annotated as corporate or private. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 17 Ljubljana, Slovenia, 27–28 September 2016 Regional Number of Number of Number of subcorpus tokens tweets users 4.1. Non-Standard Vocabulary Gorenjska 37,683 22,070 48 Non-standard vocabulary includes all lexical elements that Dolenjska 17,364 6,922 22 are considered non-standard, i.e. those that would not be Štajerska 41,712 9,284 42 expected in standard Slovene texts and/or are not included Panonska 5,020 2,512 14 in existing standard language resources such as Koroška 6,207 4,203 5 dictionaries or lexicons. Examples include regionally Primorska 13,917 5,748 31 specific words (e.g. particles ejga for Gorenjska, čuj for Rovtarska 4,823 2,348 7 Štajerska, nanka for Primorska), standard words with new meanings ( hudo meaning 'awesome' instead of 'bad'), and Ljubljana 92,104 43,018 116 non-standard words/phrases of foreign language origin, Maribor 4,789 4,340 14 either in their original spelling (e.g. web app) or fully/partially adapted to Slovene spelling and Table 1: Size of regional subcorpora in the JANES corpus morphology (e.g. ekskjuz, from English 'excuse'; učelini, of Internet Slovene (v0.3). from Italian 'uccellini'). A subcategory of non-standard vocabulary also included As shown in Table 1, some of the regional subcorpora are certain CMC-specific abbreviations, either English ( wtf, very small both in terms of the number of tokens as well lol, omg) or Slovene ( jbg „fuck that‟, bmk „I don‟t give a as the number of users included. However, geotagged fuck‟), and alphanumerical spellings ( ju3 for jutri, tweets are still being collected, and more users and tweets 'tomorrow'). will be added when the corpus is updated. For the purposes of this paper, we focus on three of the best 4.2. Reductions and Ellipses represented regions: Primorska, Gorenjska, and Štajerska. With 69 different tags, reductions and ellipses are by far the most prolific category. Most often, they involve vowel 3.1. Samples of Regional Subcorpora drops in different positions in a word. A common example For each region, a sample containing 500 L3 tweets was is the ellipsis of the final -i in the infinitive ( delati → created. First, all tweets were extracted from the relevant delat, 'to work') or the final -o in adverbs ( čudno → čudn, regional subcorpus. The tweets were then shuffled and 'weirdly, oddly'). As for consonants, a common example is sampled by user in order to avoid overrepresentation of the ellipsis of - j in the - lj- or - nj- consonant clusters very prolific Twitter users. In some cases, the most active ( peljem → pelem, 'I drive'; zadnji → zadni, 'the last'). users provided more than 2,000 tweets to a regional subcorpus, while the least active provided less than 10. 4.3. Alternative Graphemes The samples included all users from the relevant regional This category encompasses alternative, non-standard subcorpus, while the number of tweets each user spellings of graphemes, most often in cases when it is contributed was limited to a maximum of 40–50 tweets pronounced differently in spoken language. Examples (depending on the total number of users). include the spelling of g as h ( bog → boh, 'god') or v as w ( 4. Typology of Non-Standard Slovene ne vem → ne wem, 'I don‟t know'). Language Elements on Twitter 4.4. Non-Standard Morphology Small subsets of 100–150 tweets were manually analysed This category included words that exhibited non-standard in each sample in order to design a typology of non- morphological characteristics such as alternative case standard Slovene language elements on Twitter. The endings (e.g. the non-standard locative ending - i of typology was created with a bottom-up approach and so singular masculine nouns, na šihtu → na šihti, 'at work') far includes 7 main categories: non-standard vocabulary, or other regionally specific suffixes (e.g. the non-standard reductions and ellipses, non-standard morphology, second-person plural verb suffix -ste instead of -te, imate spelling variants of frequent standard words, alternative → imaste „you have‟). graphemes, frequent transformations, and miscellaneous.2 Currently, the typology consists of 105 different tags, but 4.5. is flexible and allows for the addition of new elements as Spelling Variants of Frequent Standard Words certain rare or regionally specific elements (especially those concerning morphology and syntax) may yet arise The category of spelling variants includes common during annotation. In the following subsections, we standard (mostly function) words with numerous spelling present the main categories in further detail. variants that are unequally distributed between different regions. A good example is the word toliko ('this much, so'), which can also be spelt as tok, tolk, tolko, telko, tuk, tulk, etc. Similarly, the word jaz (personal pronoun, first person singular, 'I') can also be encountered as jz, js, jst, jest, jes, etc. Although these spelling variants often also include other non-standard elements (e.g. vowel ellipses), 2 they are also annotated as a separate category in order to Initially, a syntactic category was included, but was later produce an exhaustive list so that their regional omitted as syntactic elements were much too scarce in the samples. However, potential regionally specific syntactic distribution can be tested on the entire geolocated JANES features encountered during the analysis will be researched on subcorpus. larger amounts of data in the JANES corpus of Internet Slovene. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 18 Ljubljana, Slovenia, 27–28 September 2016 4.6. Frequent Transformations This is an additional category that included spelling 6. Measures of Regional Specificity and transformations in non-standard word spellings that were Dispersion perceived during annotation as frequently occurring. In addition to statistical tests, we also propose a method to Similar to frequent spellings of non-standard words, these determine the level of regional specificity of a certain transformations were annotated separately in order to language element based on a number of criteria described allow for a comparison of their distributions in different in the following subsections. In addition, these measures regions. A prevalent example is the transformation of -aj- should help to reduce the effect of potential noise in the to -ej- ( nekaj → nekej, 'something'; včeraj → včerej, dataset (e.g. users that are originally from a different 'yesterday') or -aj- to -j- ( zdajle → zdjle, 'now'; kaj → kj, region and have permanently moved to a different one, but 'what'). continue to use language elements typical of their region of origin). 4.7. Miscel aneous The final category included miscellaneous non-standard 6.1. Relative Frequency language elements that could not be categorised in any of Relative frequency ( fR) is the ratio of the frequency of a the previous categories. These mainly consisted of joint language element and the total number of occurrences in spellings, i.e. instances where two words should be spelt its category. The greater the relative frequency, the more separately in standard Slovene, but are written together in frequent the language element within the region. their non-standard form (e.g. ne vem → nevem; 'I don't know, I dunno'), or amalgams of two adjacent words, most 6.2. User Ratio often function words ( to je → toj, 'this is', če je → čej, 'if it is'). The user ratio ( u) is the ratio of the number of users using a language element and the number of all users from the 5. Dataset Annotation region in question. The greater the number of users that use a language element, the greater the user ratio. This The samples were manually annotated in .txt format to value thus measures how widespread the element is enable flexible post-processing and analysis with Python among the users of the region. It penalises idiosyncratic regular expressions. Relevant tokens or phrases were elements (especially with prolific users) or elements used annotated as shown in Figure 1. by users that have been misclassified as pertaining to a specific region. [token/phrase]{tag 1}{tag 2}{...} 6.3. Type/Token Ratio [sej]{V.saj}{Taj.ej} The type/token ratio ( t) is the ratio of the number of types and the number of tokens used with a language element. The greater the t-ratio, the greater the number of words it occurs with, and the greater the likelihood that the element Figure 1: Annotations. will arise in text. This value penalises frequent language elements that only occur in a limited number of words. The upper line shows the general pattern of annotation. A single token or phrase may be annotated with multiple 6.4. Annotation Ratio tags. The bottom line shows an example of the annotated word Similar to the type/token ratio, the annotation ratio ( sej ('because'), annotated both as a spelling variant a) is of saj ( V.saj) and as a frequent spelling transformation of - the ratio of the number of different tags the element aj- to -ej-. ( Taj.ej). occurs with and the number of all tags in its category. The Several language elements were excluded from greater the annotation ratio, the greater the number of tags annotation. These included a number of CMC-specific it occurs with and the greater the likelihood it occurs. elements (emoticons and emojis, hashtags or URLs), spelling mistakes that were perceived as obviously 6.5. Coefficient of Regional Dispersion accidental, as well as the non-use of diacritics, which is The coefficient of regional dispersion ( δR) is meant as a often a consequence of technical limitations and rarely simple summarisation of all other measures of regional voluntary. Code-switching, although relatively common, specificity and dispersion. It is calculated as follows: was also omitted. If foreign language words or phrases were used as part of a Slovene sentence, they were δR = fR × u × t × a × 100 annotated as non-standard vocabulary. Entire sentences or independent units in foreign language, however, were The greater the coefficient of regional dispersion, the disregarded. The same was true of non-standard variants more widespread and frequent the element in question. of proper nouns (e.g. phoneticised versions of Twitter and Facebook – Tviter, Fejsbuk). 7. Annotation Results In this section, we provide some of the preliminary results of the annotated dataset and demonstrate the use of the abovementioned fR, u, t, a and δR values to measure regional specificity and dispersion for a particular language element. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 19 Ljubljana, Slovenia, 27–28 September 2016 The annotation results for the Gorenjska, Štajerska and Primorska regions are shown in Table 2. As all samples are of comparable size (they consist of 500 tweets each), Gorenjska Štajerska Primorska the absolute frequencies are given. fR 0.52 0.60 0.47 Category Primorska Gorenjska Štajerska u 0.75 0.42 0.54 Non-standard t 0.61 0.53 0.67 vocabulary 394 347 371 Spelling variants a 0.43 0.41 0.24 of frequent 233 322 183 δR 10.25 5.41 4.08 standard words Alternative Table 4: Measures of regional specificity and dispersion graphemes 40 54 34 for the final -i ellipsis. Reductions and ellipses 588 1122 648 As can be deduced from Table 4, the final -i ellipsis has a Non-standard significantly greater user ratio in Gorenjska, as well as a morphology 90 99 67 significantly higher coefficient of regional dispersion, Frequent which would indicate that the language element is much transformations 120 181 68 more widespread in this region compared to Štajerska and Miscellaneous 39 59 24 Primorska. Total 1504 2184 1395 8. Conclusion In the paper, we described the creation of a dataset for the analysis of Slovene regional language variants on Twitter Table 2: Quantitative Analysis of Annotated Samples. and presented a method for the analysis of regional language variants on Twitter. The regions do not differ to a great extent in terms of the In our future work, we will perfect the typology of non- frequency of non-standard vocabulary, although we expect standard language elements in Slovene CMC and make a that a detailed qualitative analysis will show differences in comparison with phenomena presented in existing the type of non-standard words used (e.g. we expect to Slovene dialectological studies. We will also extend the find more words originating from Italian in the Primorska annotated dataset to other Slovene regions and analyse all region, which lies next to the border with Italy). encountered language elements in terms of their regional As far as the frequencies of other categories are specificity and dispersion, then compare the results with concerned, the differences between the three regions are the results obtained through other statistical methods. In more pronounced. What is particularly interesting to note addition, elements that rarely occur in the samples (e.g. is that while reductions and ellipses are the most prolific non-standard syntactic constructions) will be tested on category in all three regions, they are especially frequent larger text samples in the JANES corpus of Internet in the Gorenjska region. The most frequent type of ellipsis Slovene. The results of the analysis will be used to design in all three regions was the -i ellipsis. The frequencies of features to be used in the development of a model for the final and non-final -i ellipses are shown in Table 3, along automatic recognition of Slovene regional language with χ2 p-values and Cramer's V effect sizes. variants on Twitter. Gorenjska vs. Gorenjska vs. Primorska vs. 9. Acknowledgments Primorska Štajerska Štajerska The work described in this paper was funded by the Final -i Slovenian Research Agency within the national basic ellipsis 254 140 254 174 140 174 research project “Resources, Tools and Methods for the Non-final -i Research of Non-standard Internet Slovene” (J6-6842, ellipsis 231 156 231 118 156 118 2014–2017). χ2 p-value >0.05 >0.05 0.037 10. References Cramer's V 0.05 0.07 0.12 Fišer, D. Ljubešić, N. & Erjavec, T. (2015a). The JANES corpus of Slovene user generated content: construction Table 3: χ2 p-values and Cramer's V effect sizes for and annotation. International Research Days: Social distributions of final vs. non-final -i ellipses. Media and CMC Corpora for the eHumanities: Book of Abstracts, 23–24 October 2015. Rennes, France, p. 11. The only statistically significant difference in the Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., distribution of final vs. non-final - i ellipses is the one Pollak, S. & Škrjanec, I. (2015). Predicting the level of between Primorska and Štajerska, with a small, but not text standardness in user-generated content. 10th entirely negligible effect size. It would appear Štajerska International Conference on Recent Advances in slightly prefers final -i ellipsis to non-final -i ellipsis. Natural Language Processing: Proceedings of RANLP Table 4 shows the measures of regional specificity and 2015 Conference, 7–9 September 2015. Hissar, dispersion for the final -i ellipsis for all three regions. Bulgaria, pp. 371–378. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 20 Ljubljana, Slovenia, 27–28 September 2016 Jørgensen, A. K., Hovy, D. & Søgaard, A. (2015). Challenges of studying and processing dialects in social media. Proceedings of the ACL 2015 Workshop on Noisy User-generated Text. Beijing, China, July 31, 2015, pp. 9–18. Harrat, S., Abbas, M., Meftouh, K. & Smaili, K. (2013). Diacritics restoration for Arabic dialect texts. Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013). France. Bernhard, D. & Ligozat, A.-L. (2013). Hassle-free POS- Tagging for the Alsatian Dialects. Zampieri, M. & Diwersy, S. (eds.), Non-standard Data Sources in Corpus-based Research. Aachen: Shaker Verlag, pp. 85–92. Čibej, J. & Ljubešić, N. (2015). "S kje pa si?" – Metapodatki o regionalni pripadnosti uporabnikov družbenega omrežja Twitter. Fišer, D. (ed.), Proceedings of Konferenca Slovenščina na spletu in v novih medijih. Ljubljana, ZIFF, pp. 10–14. Kenda Jež, K. (2002). Cerkljansko narečje: teroetični model dialektološkega raziskovanja na zgledu besedišča in glasoslovja. PhD dissertation. Ljubljana: Faculty of Arts. Eisenstein, J. (2015). Written dialect variation in online social media. Boberg, C., Nerbonne, J. & Watt, D. (eds.): Handbook of Dialectology. Wiley. Ramovš, F. (1931). Dialektološka karta slovenskega jezika. Ljubljana: Rektorat univerze kralja Aleksandra I. in J. Blaznika nasl. – Univerzitetna tiskarna. Fišer, D., Erjavec, T., Čibej, J. & Ljubešić, N. (2015). Gradnja in analiza korpusa spletne slovenščine JANES. Smolej, M. (ed.): OBDOBJA 34: Slovnica in slovar – aktualni jezikovni opis. Ljubljana: ZIFF, pp. 217–223. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 21 Ljubljana, Slovenia, 27–28 September 2016 Analysis of Sentiment Labeling of Slovene User-Generated Content Darja Fišer,*† Tomaž Erjavec† * Department of Translation, University of Ljubljana, Aškerčeva 2, 1000 Ljubljana † Department of Knowledge Technologies, Jožef Stefan Institutute, Jamova cesta 39, 1000 Ljubljana E-mail: darja.fiser@ff.uni-lj.si, tomaz.erjavec@ijs.si Abstract The paper takes a close look at the results of sentiment annotation of the Janes corpus of Slovene user-generated content on 557 texts sampled from 5 text genres. A comparison of disagreements among three human annotators is examined at the genre as well as text level. Next, we compare the automatically and manually assigned labels according to the text genre. The effect of text genre on correct sentiment assignment is further investigated by investigating the texts with no inter-annotator agreement. We then look into the disagreements for the texts with full human inter-annotator agreement but different automatic classification. Finally, we examine the texts that humans and the automatic model struggled with the most. Keywords: sentiment analysis, quantitative and qualitative evaluation, user-generated content, non-standard Slovene 1. Introduction We also produced a manually annotated dataset. This evaluation dataset comprised 600 texts, which were Sentiment analysis or opinion mining detects opinions, sampled in equal proportions from each subcorpus (apart sentiments and emotions about different entities expressed from blog comments as they have been found to behave in texts (Liu, 2015). It is currently a very popular text- very similar to news comments) in order to represent all mining task, especially for social networking services, the text genres included in the corpus in a balanced where people regularly express their emotions about manner. various topics (Dodds et al., 2015). A sentiment analysis The sample was then manually annotated for the three system for Slovene user generated content (UGC) was sentiment labels by three human annotators. The developed by Mozetič et al. (2016) and has been, inter annotators marked some texts as out of scope (written a alia, used to annotate the Janes corpus of Slovene UGC foreign language, automatically generated etc.), so the (Erjavec et al., 2015). The first results are encouraging but final evaluation sample consists of 557 texts. the results vary both in inter-annotator agreement and In the following sections the labels assigned by the accuracy of the system across genres (Fišer et al., 2016), annotators were compared to each other while the suggesting further improvements of the system are automatically assigned scores were compared to the needed. One of the steps towards this goal is a qualitative annotators’ majority class, i.e. the sentiment label analysis of (dis)agreement among the annotators and an assigned to each text by the most annotators. In cases of error analysis of the incorrectly classified texts, which is complete disagreement the neutral sentiment is assigned the goal of this paper. as the majority class. The paper is organized as follows. In Section 2 we give a brief presentation of the corpus and its sentiment 3. Quantitative Analysis of Sentiment annotation. In Section 3 we present the results of a Annotation quantitative analysis of manual and automatic sentiment annotation on a sample collection of texts. In Section 4 we In our quantitative analysis we first analyze the difficulty follow with a qualitative analysis of the texts and their of the task for humans and the algorithm on the evaluation features that make the task difficult for humans as well as sample. We also compare annotation results with respect those that the algorithm struggles with. The paper ends to text genres. Finally, we measure the degree of with concluding remarks and ideas for future work. disagreements of the assigned labels in order to measure the severity of the annotation incongruences. 2. Sentiment Annotation of Janes 3.1. Comparison Between Manual and The Janes corpus (Erjavec et al., 2015) is the first large Automatic Annotations (215 million tokens) corpus of Slovene UGC that comprises blog posts and comments, forum posts, news First, a comparison of disagreements among the human comments, tweets and Wikipedia talk and user pages. annotators was computed as well as that of the automatic Apart from the standard corpus processing steps, such as system with the majority class. Since we are investigating tokenization, sentence segmentation, tagging and sentiment annotation accuracy from the perspective of the lemmatization (Ljubešić and Erjavec, 2016) as well as difficulty of the task, measured with the dispersion of some UGC-specific processing steps, such as annotations by human annotators, we are operating with rediacritization (Ljubešić et al., 2016), normalization percentage agreement in this paper. While we have (Ljubešić et al., 2014) and text standardness labeling measured inter-annotator agreement with Krippendorff’s (Ljubešić et al., 2015), all the texts in the corpus were also alpha, which is 0.563 for human annotations and 0.432 for annotated for sentiment (negative, positive, or neutral) automatic annotations with respect to the human majority with a SVM-based algorithm that was trained on a large vote (cf. Fišer et al., 2016), this measure reports inter- collection of manually annotated Slovene tweets (Mozetič annotator agreement for the entire annotation task and is et al., 2016). as such not informative enough for the task at hand in this paper. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 22 Ljubljana, Slovenia, 27–28 September 2016 The results in Table 1 shows that the task was easier for explicitly expressed in news comments than in tweets, some texts in the sample both for humans and for the whereas blogs might be easier because they are longer system as annotators’ labels range from perfect agreement which again makes them easier for sentiment to an empty intersection. While at least two human identification. Forum posts, on the other hand, seem to be annotators provided the same answer on nearly 97% of the the hardest overall, which is addressed in more detail in sample, all three annotators agreed on less than half of the Section 4. texts, which is a clear indication that the task is not straightforward and intuitive for humans, suggesting that Text type Disagreement better guidelines and/or training are needed to obtain blog 4 21% consistent and reliable results in the annotation campaign. forum 6 32% As could be expected, texts that were difficult to annotate news 2 11% for humans also proved hard for the system. Namely, the tweet 6 32% system chose the same label as the annotators in the wiki 1 5% majority of the cases (65 %) only for those that humans total 19 100% were in complete agreement. Where the annotators Table 3: Disagreement among the annotators per text disagreed partially or completely, there is substantially genre. less overlap with them and the system (46% - 33%). 3.3. Comparison of the Degree of Disagreements Manual Since not all incongruences between the system and the All annotators 2/3 annotators All annotators true answer are equally bad from the application point of Automatic agree agree disagree identical 160 65% 133 46% 6 33% view, we looked into the degrees of disagreements for the different 87 35% 159 54% 12 67% texts receiving the same label by all three annotators and a total 247 44% 292 52% 18 3% different one by the system. As can be seen from Table 4, Table 1: Comparison between automatic (to majority the automatic system has a clear bias towards neutral class) and manual annotations. labels, i.e. more than half of the mislabeled opinionated texts were marked as neutral by the algorithm. 3.2. Mislabeling neutral texts as opinionated is seen in about a Comparison Between Text Genres third of the cases. The worst-case scenario, in which In order to better understand which text types are easy and negative texts are labeled as positive or vice versa and which difficult for sentiment annotation, we compared the therefore hurts the usability of the application the most, is labels assigned by the annotators and the system quite rare (12%). The behavior of the system on texts with according to the genre of the texts in the sample. In texts partial human agreement is consistent with the findings for which annotators are in perfect agreement, the biggest above in assigning sentiment of opposite polarities which overlap between the system and the majority vote of the again represents the smallest part of the sample (8%). annotators is achieved on news comments. These are Neutralizing negative and positive texts occurs on 40% of followed by blog posts which, together with the news the sample, which is slightly lower than for the texts on comments, represent over half of all the texts receiving the which all the annotators agree. The most prevalent same sentiment label by both humans and the model. category are neutral texts mislabeled as negative which is The effect of text genre on the difficulty of correct seen in 34% of the cases, substantially more than above. sentiment assignment was further investigated by looking at the genre of those texts for which there was no Annotators agree, 2/3 annotators agree, agreement among the human annotators, i.e. texts which Differences system disagrees system disagrees were annotated as negative by one annotator, positive by neg → neut 29 33% 36 23% another and neutral by the third. The results of this neg → pos 7 8% 7 4% analysis are presented in Table 3 and are consistent with neut → neg 14 16% 54 34% the previous findings in that sentiment in forum posts and neut → pos 13 15% 29 18% pos → neut 20 23% 6 17% tweets is the most elusive while being the least pos → neg 4 5% 27 4% problematic on Wikipedia talk pages and in news Total 87 100% 159 100% comments. Table 4: Discrepancies between automatic and majority human vote. All annotators agree 2/3 annotators agree All annotators disagree Different Identical Different Identical Different Identical 4. Qualitative Analysis of Sentiment Type No. % No. % No. % No. % No. % No. % Annotation blog 14 16 34 21 38 24 27 20 3 25 1 17 forum 23 26 29 18 42 26 20 15 4 33 1 17 In this section we present the results of qualitative analysis news 12 14 48 30 21 13 32 24 2 17 0 0 of the biggest problems in sentiment annotation observed tweet 23 26 24 15 25 16 21 16 2 17 4 67 in the evaluation sample. We first examine all the texts for wiki 15 17 25 15 33 21 33 24 1 8 0 0 which there was no agreement among the human total 87 160 159 133 12 6 18 Table 2: Comparison between automatic and majority vote annotators and then focus on the texts that humans found easy to annotate consistently but the system failed to per text genre. annotate correctly. Since the system was trained on tweets, one would expect them to receive the highest agreement, which is not the case. A possible reason for this is that sentiment is more Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 23 Ljubljana, Slovenia, 27–28 September 2016 4.1. Toughest Sentiment Annotation Problems All 4 blog posts were relatively long and contained mixed for Humans sentiment. For example, a post that contains a description By examining the texts which received a different label by of a blogger’s entire life starts off with very positive each annotator we wished to investigate the difficulty of sentiment that then turns into a distinctly negative one the task itself, regardless of the implementation of an after some difficult life situations. While some annotators automatic approach. In the evaluation sample of 557 texts treated this text as neutral as it contained all types of there were 18 such cases: 6 tweets, 5 forum posts, 4 blog sentiment, others treated it as negative since negative posts, 2 news comments, and 1 Wikipedia talk page. sentiment is the dominant one in terms of amount of text it As can be seen from Table 5, there are significant appears in with respect to other parts, in terms of strength discrepancies in annotator behavior. While Annotators 1 with which it is expressed, and/or in terms of the final and 2 chose positive and negative labels equally position in the text, suggesting it to be the prevailing frequently (A1: 9 negative, 8 positive, 1 neutral; A2: 8 sentiment the author wished to express. negative, 8 positive, 2 neutral). Annotator 3 was heavily 1 news comment was lacking context and 1 contained biased towards the neutral class (A3: 1 negative, 2 mixed sentiment, which is also true with the Wikipedia positive, 15 neutral). The automatic system lies in talk page that is complaining about a plagiarized article between these two behaviors (S: 5 negative, 7 positive, 6 but in a clearly constructive, instructive tone that is trying neutral), sharing the most equal votes on individual texts not to complain about the bad practice but teach a new with Annotator 1 (44%) and the fewest with Annotator 2 user about the standards and good practices respected by (22%). This suggests that annotators did not pick different the community. labels for individual texts due to random/particular mistakes but probably adopted different strategies in 4.2. Toughest Sentiment Annotation Problems selecting the labels systematically throughout the for Computers assignment. While Annotators 1 and 2 favored the In the second part of the qualitative analysis we focus on expressive labels even for the less straightforward the 87 texts from the sample which were labeled the same examples, Annotator 3 opted for a neutral one in case of by all three annotators but differently by the system. With doubt. These discrepancies could be overcome by more this we hope to see the limitations of the system when precise annotation guidelines for such cases. trying to deal with the cases most straightforward for humans. The sample consisted of 23 forum posts and 23 Source Ann1 Ann2 Ann3 System Note tweets, 15 Wikipedia talk pages, 14 blog posts and 12 blog - + 0 - mixed news comments. As said in Section 3, almost all of the blog - + 0 - mixed discrepancies (87%) were neutral texts that were blog - + 0 - mixed mislabeled as opinionated by the system or vice versa. blog + - 0 0 mixed Serious errors, i.e. cross-spectrum discrepancies were rare forum - + 0 + mixed (4.6% true negatives mislabeled as positive and 8% true forum + - 0 - context positives mislabeled as negative). forum + - 0 0 context forum + - 0 + context Problematic feature No. % forum + 0 - + context no feature identified 22 25.29 news - + 0 + context neg. vocabulary 18 20.69 news + - 0 - mixed + vocaulary 10 11.49 tweet - + 0 0 sarcasm cynical 10 11.49 tweet - + 0 0 mixed emoticons 7 8.05 tweet - + 0 0 mixed too short 5 5.75 tweet 0 - + 0 short quote 5 5.75 tweet + - 0 + mixed foreign/specialized vocabulary 5 5.75 tweet + - 0 + sarcasm non-standard text 2 2.3 wikip. - 0 + + mixed names 2 2.3 Table 5: Analysis of the difficult cases for the human mixed sentiment 1 1.15 annotators. Total 87 100.00 Table 6: Analysis of the problematic text features for the A detailed investigation of the 18 problematic texts sentiment annotation algorithm. showed that 3 out of 6 tweets contain mixed sentiment in the form of message and vocabulary distinctive for one We performed a manual inspection of the erroneously sentiment, which is then followed by an emoticon of a annotated texts and classified them into one of 10 the distinctively opposite sentiment. 2 tweets were sarcastic categories representing possible causes for the error. As and 1 simply too short and informal to understand what Table 6 shows, in over a quarter of the analyzed texts, no the obviously opinionated message was about (“prrrr za special feature was identified and it really is not clear why bič :P / prrrr for the whip :P”). the system made an error there as the sentiment in them is 4 out of 5 forum posts are lacking a wider context (the obvious. The most common characteristics of the entire conversation thread) which is needed in order to mislabeled texts, which occurred in 43% of the analyzed find out whether the post was meant as a joke or was sample, were lexical features, i.e. the vocabulary typical sarcastic. Some annotators annotated it as is, others of negative/positive messages, foreign and specialized assumed sarcasm or opted for a neutral label. 1 forum post vocabulary, proper names and non-standard words that are contained mixed sentiment. most likely out-of-vocabulary for the model and therefore Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 24 Ljubljana, Slovenia, 27–28 September 2016 cannot contribute to successful sentiment assignment. E.g. 6. Acknowledgments a perfectly neutral discussion on Wikipedia was labeled as The authors would like to thank the anonymous reviewers negative due to the topic of the conversation ( quisling, for their helpful suggestions and comments. The work invader, traitor). Similarly, many posts with objective described in this paper was funded by the Slovenian advice to patients on the medical forum which contain a Research Agency within the national basic research lot of medical jargon were mislabeled as negative. project “Resources, Tools and Methods for the Research The second common source of errors were the inter- and of Nonstandard Internet Slovene” (J6-6842. 2014-2017). hyper-textual features that are typical of user-generated content, such as quotes from other sources, parts of 7. Bibliography discussion threads, fragmentary, truncated messages, URL links and emoticon and emoji symbols. The remaining Sheridan Dodds, P., Clark, E. C., Desu, S., Frank, M. R., issues include cynical texts and texts with mixed Reagan, A. J., Ryland Williams, J., Mitchell, L., Decker sentiment that have already been discussed in Section 4.1. Harris, K., Kloumann, I. M., Bagrow, J. P., Megerdoomian, K., McMahon, M. T., Tivnan, B. F. and 5. Conclusions Danforth, C. M. (2015). Human language reveals a universal positivity bias. Proc. of the National Academy In this paper we presented the results of a quantitative and of Sciences. 112(8): 2389–2394. qualitative analysis of sentiment annotation of the Janes Erjavec, T., Fišer, D., and Ljubešić, Nikola (2015). Razvoj corpus. These insights should enable better understanding korpusa slovenskih spletnih uporabniških vsebin Janes. of the task of sentiment annotation in general as well as Zbornik konference Slovenščina na spletu in v novih facilitate improvements of the system in the future. The medijih. 20–26. Ljubljana, Znanstvena založba results of the first analysis show that overall, blogs have Filozofske fakultete. proven to be the easiest to assign a sentiment to as both Fišer, D., Smailović, J., Erjavec, T., Mozetič, I., and Grčar, humans and the automatic assignment achieve the highest M. (2016). Sentiment Annotation of Slovene User- score here. The sentiment of the blog posts we examined Generated Content. Proc. of the Conference Language was straightforward to pin down by the annotators due to Technologies and Digital Humanities. Ljubljana, text length and informativeness, through which it becomes Faculty of Arts. clear which sentiment is expressed by the author. Kilgarriff, A. (2012). Getting to Know Your Corpus. Proc. For humans, the second easiest are tweets, whereas the of 15th International Conference on Text. Speech and automatic system preforms worse on them than on news Dialogue (TSD’12). Brno, Czech Republic. September comments and Wikipedia talk pages. This is especially 3-7 2012, 3–15, Springer Berlin Heidelberg. interesting as the automatic system was trained on tweets Liu, B. (2015). Sentiment Analysis: Mining Opinion,. and would therefore be expected to perform best on the Sentiments, and Emotions. Cambridge University Press. same type of texts. A detailed examination of the Ljubešić, N., Erjavec, T., and Fišer. D. (2014). problematic tweets shows they are extremely short, Standardizing tweets with character-level machine written in highly telegraphic style or even truncated and translation. Computational Linguistics and Intelligent therefore do not provide enough context to reliably Text Processing. LNCS 8404, 164–175, Springer. determine the sentiment. Furthermore, messages on Ljubešić, N., Erjavec, T., (2016). Corpus vs. Lexicon Twitter are notoriously covertly opinionated, often Supervision in Morphosyntactic Tagging: The Case of sarcastic, ironic or cynical, making it difficult to pin down Slovene. Proc. of 10th International Conference on the intended sentiment. Language Resources and Evaluation (LREC’16). The results of the second analysis are consistent with the European Language Resources Association (ELRA). first in that texts which contain vocabulary that is typically Ljubešić, N., Erjavec, T., and Fišer, D. (2016). Corpus- associated with a particular sentiment but used in a Based Diacritic Restoration for South Slavic different context or communicative purpose makes the Languages. Proc. of 10th International Conference on sentiment difficult to determine. As for the forum posts Language Resources and Evaluation (LREC’16). which are much harder for the system to deal with than for European Language Resources Association (ELRA). humans, highly specialized vocabulary on the medical, Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., science and automotive forums (which in addition to Pollak, S., and Škrjanec, I. (2015). Predicting the level terminology is full of very non-standard orthography and of text standardness in user-generated content. Proc. of vocabulary) would most likely be beneficial in the training 10th International Conference on Recent Advances in data for the model to learn on. Based on the analysis Natural Language Processing Conference (RANLP'15). reported on in this paper, we plan to improve inter- 7–9 September 2015, 371–378. Hissar. Bulgaria. annotator agreement by providing the annotators with Martineau, J., and Finin, T., (2009). Delta TFIDF: An more comprehensive guidelines that will inform the improved feature space for sentiment analysis. Proc. of annotators about how to treat the typical problematic 3rd AAAI Intl. Conf. on Weblogs and Social Media cases. We will try to improve the automatic system by (ICWSM), 258–261. providing it with training material from the worst Mozetič, I., Grčar, M., and Smailović, J., (2016). performing text types. It is less clear how to improve the Multilingual Twitter sentiment classification: The role quality of the automatic labeling of sarcastic, ironic and of human annotators. PLoS ONE. 11(5):e0155036. cynical tweets that are a very common phenomenon. Vapnik, V. N., (1995). The Nature of Statistical Learning Theory. Springer. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 25 Ljubljana, Slovenia, 27–28 September 2016 Compilation and Annotation of the Discourse-structured Blog Corpus for German Holger Grumt Suárez, Natali Karlova-Bourbonus, Henning Lobin Department for German Linguistics and Literature Applied and Computational Linguistics Justus-Liebig-University Giessen, Germany Holger.H.Grumt-Suarez@germanistik.uni-giessen.de, natali.karlova-bourbonus@zmi.uni-giessen.de, henning.lobin@germanistik.uni-giessen.de Abstract The present paper reports the first results of the compilation and annotation of a blog corpus for German. The main aim of the project is the representation of the blog discourse structure and relations between its elements (blog posts, comments) and participants (bloggers, commentators). The data included in the corpus were manually collected from the scientific blog portal SciLogs. The feature catalogue for the corpus annotation includes three types of information which is directly or indirectly provided in the blog or can be construed by means of statistical analysis or computational tools. At this point, only directly available information (e.g., title of the blog post, name of the blogger etc.) has been annotated. We believe, our blog corpus can be of interest for the general study of blog structure or related research questions as well as for the development of NLP methods and techniques (e.g. for authorship detection). Keywords: CMC, blog corpus, corpus compilation, corpus annotation, TEI the author of the blog (henceforth blogger) can edit his 1. Introduction post any time and add new information on request. The In our opinion, two views on computer-mediated interrelatedness and interaction between elements (blog communication (CMC) – linguistic and structural – have post, comments) and agents (blogger, commentators) of so far been established. According to the linguistic view, the blog contribute to the dynamics of the blog as well. the language of CMC represents a distinct type of To demonstrate this idea, we compiled the first version of language form besides written and spoken language. an annotated blog corpus in German using the scientific Moreover, it combines characteristics of these two blog portal SciLogs (SciLogs, 2016) as a data source. The traditional language forms thus constituting a bridge corpus includes both blog posts and related comments. between them. The structural view in its turn concentrates The catalogue of features for the annotation of the corpus on building up of CMC. Two different kinds of CMC is based on three types of information directly or structure can be distinguished – external and internal. indirectly available from the data source. The typology of External structure relates to the representation, or layout, information is proposed in Section 3.2.1. of CMC by means of HTML mark-up language which The structure of the paper is as follows. Section 2 may be an individual decision of a developer. External provides an overview of the studies related to the topic of structure most of the blogs includes for example a header the present project. Section 3 describes the main steps (title), content, a footer (contact information) and a conducted for the purpose of the blog corpus compilation sidebar (site navigation). Internal structure in its turn and annotation. Some observed challenges for the relates to the generic structure of the CMC content. It automation of the task and possible solutions are also describes a set of structural elements (e.g., post, comment, included in this section. Finally, Section 4 reports the thread, word cloud etc.), properties and principles a CMC results of the project and outlines the next steps. is constructed of and built on to function as a holistic construct and to match its purpose. 2. Related Work The identification of the full spectrum of CMC Currently, there is a limited number of publicly-available, characteristics – linguistic or structural – still faces some large-scale blog corpora. This is surprising given the great major challenges primarily as a result of lacking valid influence of blogs on the web in general. annotated data. Storrer (2014: 189) claims that for this An example for one of the few large-scale blog corpora is purpose a special – third - kind of corpus besides the the Birmingham Blog Corpus compiled at Birmingham written and spoken corpora is needed. She also adds that City University. The corpus consists of more than 630 appropriate standards, methods and quality criteria for the million words, including a 180 million words sub-section study of CMC are crucially important as well. separated into posts and comments (Kehoe, 2012). One In the present study, the structural nature of the weblog objective of this corpus was to analyze if “comments (henceforth blog) as a representative genre of CMC is of could be used to improve document indexing on the web” interest. We describe the genre blog as a dynamic, (ibid.). The online tool (WebCorp, 2016) of the “living” construct of interrelated and interacting elements. Birmingham blog corpus allows the querying of words The dynamics of a blog arise from its constant expansion and phrases, but there is no possibility to either search the as a result of ever more comments and blog posts as well comment structure, a specific time period, keywords or a as on the account of new blog participants. Additionally, specific blogger respectively commentator. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 26 Ljubljana, Slovenia, 27–28 September 2016 Another example of a blog corpus is the bilingual discovered some complications that will have to be dealt (German, French) corpus d’apprentissage INFRAL with later during the automation phase. We will discuss (Interculturel Franco-Allemand en Ligne), which is part some of these complications in Section 3.2.3. of the LETEC (Learning and Teaching Corpus) Our next step will be to complete our corpus with the data (Abendroth-Timmer, 2014). This corpus is included in the appeared in 2015 considering all SciLogs sections. CoMeRe (Communication médiée par les réseaux) According to our current knowledge, the SciLogs data of project, which “aims to build a Kernel corpus assembling 2015 includes about 1.200 blog posts and 12.000 existing corpora of different CMC […] genres and new comments. Retrieval of the blog data from the web will be corpora build on data extracted from the Internet” conducted semi-automatically. For this purpose, an open (Abendroth-Timmer, 2014). The INFRAL blog corpus source program HTTrack Website Copier (Roche, 2016) consists of posts from two groups: a group of ten will be used. HTTrack enables the download of all kinds francophone learners of German as a foreign language of the website data stored on the server including HTML from l’Université de Franche-Comté and a group of nine pages, images and other files to a local directory on a German-speaking learners of French as a foreign computer. After the retrieval step, the data will be cleaned language from the University of Bremen who e.g. had to from the noise in the data and represented in form of discuss various intercultural topics. One task of this HTMl pages (external structure). Finally, the relevant corpus was the modeling of the structure of interactions. content will be extracted from the HTML pages and Therefore, every comment has been given a reference to annotated with TEI annotation standard (internal the ID of the post, but the links between the comments structure). The programming language Python and its themselves are not included. The TEI schema developed packages for XML parsing will be used for this purpose. for the CoMeRe project – this project is also part of the TEI special interest group (SIG) "computer-mediated 3.2 Data Annotation communication" (CMC) – will be an important basis for our own schema. 3.2.1 Types of Blog Information Finally, the German language wordpress blog corpus by We distinguish between three types of information Barbaresi and Würzner (Barbaresi & Würzner, 2014) is provided in the blog based on how the former is made another example of a blog corpus worth mentioning. The available. The first type (type A) incorporates corpus consists of a total of 158,719 German wordpress information which is directly available in the blog or from blogs. The collected data is released under the Creative the source code of the blog site. In the blog post structure, Commons license. The corpus can be used for example in it includes the blog post itself along with the meta the lexicography for the purpose of dictionary building. information such as the title of the blog post, date of Moreover, it can be a good source “to test linguistic creation, the name of the blogger, the categories the entry annotation chains for robustness” (Barbaresi & Würzner, belongs to and main keywords. In the structure of the 2014). comments, type A information is represented by the total number of comments as well as the name of the 3. Methodology commentator, date and comment ID. The second type (type B) includes information which is not directly 3.1 Data Collection available but can be inferred from type A information, e.g. To date, we have compiled a test corpus in the German usual activity time of a commentator (at what time a language, which contains 21 blog posts and 195 particular commentator usually writes his comments). comments related to those blog posts. For this test corpus, Finally, the third type (type C) is an interpretative we wanted to cover a whole week and we therefore information type. This kind of information is neither randomly choose week 49 in 2015 (from November 30, directly nor indirectly provided in the blog but is rather 2015 to December 6, 2015). The source of the data is the the result of statistical (basic statistics), linguistic (e.g., scientific blog portal SciLogs (SciLogs, 2016). SciLogs is part-of-speeches) and discourse (e.g., topic identification subdivided into different sections (BrainLogs, with topic modeling) interpretation and analysis of the ChonoLogs, KosmoLogs, WissenLogs) where scientists – blog entries. The interpretative information type can and those interested in science – can interact in either be collected manually or by use of computational interdisciplinary discussions about science. For the test tools. corpus, we did not focus on a particular section; we extracted the blogs from different sections. 3.2.2 Annotation Standard The result of analyzing the SciLogs source code is that the To date, no standard exists for representing CMC data. different sections store the data using the same template. One option could be to design an XML schema for CMC In the source code, we can find the information for title, from scratch, which would perfectly fit the needs of our category, keywords, date, name of the blogger / project. The main reason as to why we are not going along commentators, the comments and their different levels of with XML is that the schema would be idiosyncratic and indentation, the permalinks of the comments and the the corpus would not interoperate without causing blogpost itself. The data collection for the test corpus was difficulty with other resources. When searching for a done manually and in the process of the work we standard for the representation of texts in digital form, one Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 27 Ljubljana, Slovenia, 27–28 September 2016 will take a look at the Text Encoding Initiative (TEI). and their annotation with TEI was processed However, none of the modules in the current version of automatically and then manually checked for mistakes in the TEI Guidelines (P5) can be adopted for our project. order to achieve accurate and reliable results. We believe Fortunately, the SIG CMC group under the direction of that it is be less time- and cost-consuming than fully Beißwenger (TU Dortmund) has been working on the manual processing of the data. The automatic part is adaption of TEI guidelines to the presentation of genres of conducted based on explicit marks of multiple reference CMC since 2012 (Beißwenger, 2015). Given that no such as [@name]*. In the TEI blog annotation the module for CMC is so far ready to use, we have started to multiple references are specified by enumeration of the look for schema drafts by the SIG CMC group and up to ids of their comments (). now, a couple of corpora have been released by the SIG Finally, the third challenge is the task of the correct CMC group. Among them are CMC genres like tweets, assignment of the comments to the level in the email, text chat, wiki discussions and weblogs (Chanier, hierarchical structure of the comments. At present, the 2014; Beißwenger, 2013; Storrer, 2015). The schema that number of possible level assignments is limited to five. fits our needs best, is the one released in 2014 by the All comments appearing after the first comment on the French network CoMeRe (Communication médiée par les fifth level are (wrongly) assigned to the fifth level. In réseaux) (Hriba, 2013). The CoMeRe schema is based on order to solve this problem, we developed a simple the previous schema draft by DeRiK (Beißwenger, 2013) algorithm to compute the correct level of the comments. and includes e.g. the metadata schema for CMC. But still, The algorithm first takes the person reference (“@name”, there is no possibility for representing the full structure of “[name] schrieb (engl.: wrote)” etc.) included in the text a blog and especially the related comments. Our goal is to of the analyzed comment as the input. In the case of take the latest schema draft provided by the SIG CMC multiple references, only the first reference is taken into (Beißwenger, 2016) and not to try to change the main consideration. The algorithm then searches backwards for characteristics of the schema. The status of that schema is the matches between the person reference and the name of that of a “core model for the representation of CMC” the commentator in the previous comments. Through (Beißwenger et al. 2012: 6). And so we will need to matches, level of the analyzed comment is computed as redefine some elements while also introducing some new the sum of the level assignment of the comment which the ones. person reference belongs to and 1. By absence of the references, the level of the comment is counted 3.2.3 Challenges and Possible Solutions subsequently. A number of aspects are challenging since the task of blog corpus annotation is in some cases the result of the 4. Results particularities of the content management system (CMS) The main steps conducted for the purposes of a scientific functionality used by our blog data source. Most of the blog corpus compilation as well as challenges faced challenges deal with the structure of the comments. As we during this process were described in the present study. are at an early stage of our project, only a limited number The current version of the corpus contains 21 blog posts of challenges and solutions will be described here. and 195 related comments written in the period of one The first challenge is due to the absence of an editing week. We are convinced that comments are an essential function for the comments. The commentator who edits part of a blog corpus. On their own or in connection with the text of the comment creates a new entry which appears the correspondent blog post, they provide valuable in the timeline as an autonomous comment. Thus, the information for processing diverse research questions on comments structure of our blog corpus includes both the language of the blog and its structure. For example, original comments and their edited versions appeared to based on the name of the commentator and the time of his the time of the data collection. Though, this aspect does comments, we can compute at what time a particular not have an impact on the difficulty of the automation of commentator is active in the blog. the annotation task. However, it first impacts the accuracy The data for our blog corpus was manually collected and of the total number of distinct comments (type A annotated according to the TEI schema drafts developed information). Second, it creates confusing linkages in the by the TEI special interest group. For the annotation, three comments structure. types of information (direct, indirect and interpretative) The latter problem also arises as the result of the second based on the availability of the latter have been identified. challenge – the possibility that one comment refers to The present version of the corpus includes annotation of more than one previous comment. Unfortunately, the the first type - directly retrieved information (e.g., the CMS of our blog source does not offer any special options name of the blogger, title of the blog entry, the name of the to mark or highlight multiple comment references. In commentator etc.). The next objective of the project is an some cases, the commentators use constructions such as expansion and full annotation of the corpus as well as the [@name]* to overcome this problem. In other cases, an automation of the data collection and annotation task. At additional analysis of the comment content is required. the final stage of the present project, our annotated corpus For the purpose of the study, only explicit references are will be made available to the interested community to taken into consideration. No deeper content analysis has perform diverse kinds of research and experiments. Our been conducted. The identification of multiple references aim is to enable the access to the corpus through a Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 28 Ljubljana, Slovenia, 27–28 September 2016 searchable online database. Additionally, we plan to make Roche, X. (2016). HTTrack. Website Copier. a part of the corpus to be available upon request. For the http://www.httrack.com/ (last retrieved 20 April 2016). legal aspects of the SciLogs data usage and publication an SciLogs (2016). SciLogs. Tagebücher der Wissenschaft. external competent institution will be consulted. Spektrum der Wissenschaft Verlagsgesellschaft mbH. http://www.scilogs.de/impressum/ (last retrieved 20 April 2016). 5. Acknowledgements Storrer, A. (2014). Sprachverfall durch internetbasierte We would like to thank our anonymous reviewers for their Kommunikation? Linguistische Erklärungsansätze – insightful comments and suggestions. Following the empirische Befunde. In: A. Plewina & W. Andreas feedback, we included several improvements in our paper. (Eds.), Sprachverfall? Berlin, De Gruyter, pp. 171-196. Storrer, A. (2015). ChatCorpus2CLARIN: Integration of 6. References the Dortmund Chat Corpus into CLARIN-D. In Abendroth-Timmer, D. et al. (2014). Corpus CLARIN-D Website. d'apprentissage INFRAL (Interculturel http://de.clarin.eu/en/curation-project-1-3-german-phil Franco-Allemand en Ligne). Banque de corpus ology (last retrieved 20 April 2016). CoMeRe. Ortolang.fr: Nancy. WebCorp (2013). Birmingham Blog Corpus. WebCorp: https://hdl.handle.net/11403/comere/cmr-infral (last Linguist’s Search Engine. Birmingham City University. retrieved 23 August 2016). http://wse1.webcorp.org.uk/cgi-bin/BLOG/index.cgi Barbaresi, A., Würzner, K.-M. (2014): For a fistful of (last retrieved 20 April 2016). blogs: Discovery and comparative benchmarking of republishable German content. In KONVENS 2014, NLP4CMC workshop proceedings, p. 2–10. Beißwenger, M. et al. (2012). A TEI Schema for the Representation of Computer-mediated Communication. In: Journal of the Text Encoding Initiative (jTEI), Issue 3. Beißwenger, M. et al. (2013). DeRiK: A German reference corpus of computer-mediated communication. pp. 531-537. In: M. A. Finlayson (Eds.), LLC. The Journal of Digital Scholarship in the Humanities, Volume 28, Number 4. Oxford, OUP, pp. 531-537. Beißwenger, M. (2015). Computer-Mediated Communication SIG. In TEI Website. http://www.tei-c.org/Activities/SIG/CMC/ (last retrieved 20 April 2016). Beißwenger, M. (2016). SIG:Computer-Mediated Communication. In TEI Website. http://wiki.tei-c.org/index.php/SIG:Computer-Mediate d_Communication (last retrieved 20 April 2016). Chanier,T. et al. (2014). The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres. In: Special issue on Building And Annotating Corpora Of Computer-Mediated Discourse: Issues and Challenges at the Interface of Corpus and Computational Linguistics. Journal of Language Technology and Computational Linguistics. Berlin, GSCL, pp1-31. Hriba, L., Chanier, T. (2013). Projet européen TEI-CMC. Comere: Corpuscomere. Communication médiée par les réseaux. In Comere Website. https://corpuscomere.wordpress.com/tei/ (last retrieved 20 April 2016). Kehoe, A., Gee, M. (2012). Reader comments as an aboutness indicator in online texts: introducing the Birmingham Blog Corpus. In Studies in Variation, Contacts and Change in English 12: Aspects of corpus linguistics: compilation, annotation, analysis. http://www.helsinki.fi/varieng/series/volumes/12/keho e_gee/ (last retrieved 20 April 2016). Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 29 Ljubljana, Slovenia, 27–28 September 2016 Expressiveness in Flemish Online Teenage Talk: A Corpus-Based Analysis of Social and Medium-Related Linguistic Variation Lisa Hilte, Reinhild Vandekerckhove, Walter Daelemans CLiPS, University of Antwerp E-mail: lisa.hilte@uantwerpen.be, reinhild.vandekerckhove@uantwerpen.be, walter.daelemans@uantwerpen.be Abstract We analyze linguistic expressiveness in an extensive corpus (2 million tokens) of Flemish online teenage talk, focusing on the use of typographic chatspeak features, an onomatopoeic and a lexical variable and its correlation with the chatters’ profile and the online medium. General quantitative findings are that girls outperform boys in the expression of emotional involvement, and younger adolescents outperform the older group. However, medium has the largest impact: much more expressive markers are used in asynchronous social media posts than in synchronous instant messaging. On a qualitative level, utterances written by girls, by younger teenagers and on the asynchronous platform contain more expressive markers related to love or friendship. Apart from the medium’s (a)synchronicity and its public or private character, the nature of the interaction appears to be a determining factor too. The asynchronous social media posts involve a lot of flirting or pleasing, which drastically increases linguistic expressiveness. Keywords: computer-mediated communication, adolescents, computational sociolinguistics expressive markers: 1. Introduction - flooding (i.e. deliberate, expressive repetition) of letters Since the rise of informal computer-mediated e.g. suuuper communication (CMC), both laymen and linguists have - flooding of punctuation marks been fascinated by the prototypical features that they e.g. nice!!! identified in several forms of digital writing (see Crystal, - combinations of exclamation and question marks 2001). Androutsopoulos relates these features to three e.g. wtf?!? dimensions or themes: “orality, compensation, and - capitalization of words or entire utterances economy” (2011: 149). While orality refers to the use of e.g. FAIL spoken language features in written discourse and economy - emoticons covers all strategies to shorten messages, the “semiotics of e.g. dude :P compensation” “includes any attempt to compensate for the - typographic rendering of kisses or hugs and kisses absence of facial expressions or intonation patterns” (Baron, e.g. Xxxx 1984: 125; Androutsopoulos, 2011: 149). The latter The onomatopoeic marker studied in this research is the dimension is at issue in the present paper, which examines rendering of laughter in CMC, which includes all variants the use of expressive markers in Flemish online teenage of haha and hihi. talk. e.g. hahahaha Finally, we added a lexical variable, i.e. the use of 2. Goal of the Paper intensifiers: “items that amplify and emphasize the We examine social and medium-related linguistic variation meaning of an adjective or adverb” (Stenström, Andersen concerning expressiveness in a corpus of Flemish online & Hasund, 2002: 139). In Dutch, these items can either be teenage talk. The linguistic variables include several adverbs or intensifying prefixes. typographic features that are generally associated with chat e.g. Supermooie t-shirt ‘super nice T-shirt’ discourse (e.g. emoticons), an onomatopoeic variable 4. Corpus and Methodology (rendition of laughter) and a lexical variable (intensifiers1). All features will be discussed more elaborately in section 3. We investigate the potential (quantitative and qualitative) 4.1. Corpus correlations between the use of the selected expressive Our corpus consists of 400 808 online messages or markers and the profile of the chatters (in terms of age and 2 066 521 tokens2. The messages were produced between gender) as well as the impact of the synchronicity and 2007 and 2013 by adolescents from Dutch-speaking (largely) public versus private character of the medium on northern Belgium (Flanders), all aged between 13 and 20 which the utterances were written. years old. The utterances were written on both a synchronous electronic medium (private instant messaging) 3. Expressive Markers and an asynchronous electronic medium (private and public First of all, the present study includes six typographic messages on a social media site). Table 1 shows the distribution of the tokens over the age and gender groups 1 We sincerely thank Jens Vercammen for the data processing for token can be a word, but also an emoticon or isolated punctuation this variable. marks. 2 These tokens are the result of splitting the text on whitespace. A Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 30 Ljubljana, Slovenia, 27–28 September 2016 and the two media. We note that, although there is an Female Male imbalance for all three social variables (e.g. more male than 21.77% 9.30% female material), the smaller subcorpora are always Younger (13-16) Older (17-20) sufficiently large and thus do not exclude valid testing for the three variables. 25.23% 7.74% Asynchronous posts Synchronous posts 28.35% 5.94% GIRLS BOYS YOUNGER OLDER YOUNGER OLDER total Table 2: Overview of expressiveness ratios per subcorpus. SYNC. 118 694 176 233 29 146 973 061 1 297 134 ASYNC. 463 277 67 257 162 077 76 776 769 387 General tendencies for the social variables are that the girls total 581 971 243 490 191 223 1 049 837 2 066 521 use significantly more expressive markers than the boys (p < .001), that younger teenagers use significantly more Table 1: Distribution of variables in the corpus. expressive features than older ones (p < .001) and that significantly more expressive writing is used on 4.2. Methodology asynchronous media (p < .001). These general tendencies The typographic and onomatopoeic expressive markers also hold for each of the analyzed expressive markers: the were automatically detected and counted using Python female (resp. younger, resp. async.) texts contain each scripts. The coverage of the software was evaluated and expressive marker significantly more often than the male judged accurate on a test set of 1000 randomly chosen posts (resp. older, resp. sync.) texts. from the corpus by comparing a human annotator’s feature As for the strength of the correlation between the linguistic extraction to the software’s output. The intensifiers were and independent variables, the strongest correlation can be automatically extracted using a predefined list 3 (which found for medium (Cramer’s V = 0.31), followed by age covered most of the intensifiers used in the corpus) and a (Cramer’s V = 0.24) and gender (Cramer’s V = 0.17). The frequency cutoff to not take into account very infrequent same order can also be found for effect size: medium has variants. The software’s output was manually screened and the largest effect size (odds ratio = 6.27), followed by age filtered. To evaluate the human judgment, finally, a test set (odds ratio = 4.02) and gender (odds ratio = 2.71). These of 700 utterances was screened by two annotators, who scores should be interpreted as follows: the odds that a obtained a low error rate (1.57%). token contains an expressive marker are 6.27 times higher if the token is produced on the asynchronous platform than 5. Results and Discussion when produced on the synchronous platform5 . Medium To verify the statistical significance of our quantitative seems to be the most interesting independent variable when findings, we combined chi square tests with a bootstrapping it comes to expressiveness, as the correlation with the approach (with Monte Carlo resampling), to obtain more linguistic variables is very high and the actual effect size is solid results than when performing one single chi square large as well. test on the entire data set4. The statistical values we report Some expressive features both heavily correlate with the in the next paragraphs (p-values, Cramer’s V scores and social variables and are used very differently odds ratios) are the mean of the values for all samples. (quantitatively) by the subgroups of the same social variable. This is the case for letter flooding (i.e. deliberate, 5.1. Quantitative Findings expressive letter repetition) and the rendition of kisses (e.g. ‘xxx’), especially with regards to medium. The odds ratios We quantified the degree of expressiveness by counting all are respectively 51.85 (kisses – medium) and 16.33 (letter markers in the subcorpora and dividing these counts by the flooding – medium): for each occurrence of kisses number of tokens in the subcorpora. This approach led to (flooding letters, resp.) in the synchronous chat messages, relative expressiveness scores or ratios. The entire data set 51.85 occurrences (16.33, resp.) can be expected in the contained 295 127 expressive markers, which is a ratio of asynchronous posts. 14.28% (in terms of tokens – in terms of types: 21 427 markers, or a ratio of 11.88%). An overview of the ratios 5.2. Qualitative Findings per independent variable is shown in Table 2. The asynchronous posts contain the highest relative number of On a qualitative level, some constants could be found expressive markers (28.35%), followed by the younger among all different subgroups. The most popular participants’ texts (25.23%) and the girls’ texts (21.77%). expressive markers in all groups are emoticons and punctuation flooding (deliberate repetition of question and exclamation marks). These features’ popularity could be 3 In alphabetical order: (1) bere, (2) echt, (3) echt wel, (4) erg, (5) and advice in the statistical aspect of the research. fucking, (6) gans, (7) heel, (8) kei, (9) kweetniehoe, (10) loei, (11) 5 Note that these numbers differ from the ratios reported in mass(as), (12) massiv, (13) mega, (14) muug, (15) over, (16) Table 2. Although both numbers express a similar concept, the overdreven, (17) so, (18) super, (19) vies, (20) vree, (21) zeer, (22) calculation behind them is different, as sample sizes of both zo, (23) zot. subcorpora are taken into account to calculate odds ratio and not 4 We thank Giovanni Cassani and Dominiek Sandra for their help to calculate the straightforward percentages. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 31 Ljubljana, Slovenia, 27–28 September 2016 due to their ‘explicit’ expressive nature: many emoticons These findings correspond to the younger and female represent facial expressions and question and exclamation teenagers’ preference for expressive markers related to love marks are the most expressive punctuation marks. and friendship. As for medium, however, no correlations Apparently, because of the explicit nature of these features, have been reported between the way people write on certain they are very obvious and favored markers. platforms and their gender or age. This could thus be an As for letter flooding, we note that in all subgroups, mainly artefact of the imbalance in our dataset. Another possible vowels are repeated, and hardly ever plosives. This explanation lies in the nature of our asynchronous texts. supports the hypothesis that flooding is the orthographic Although many posts on the asynchronous medium are representation of an oral phenomenon (Darics, 2013: 144), public, the interaction often has a largely personal character. i.e. the lengthening of sounds, which is easiest for vowels Many comments on this social medium involve flirting and impossible for plosives. and/or pleasing (e.g. in positive reactions to other users’ A third general tendency is the top position of the Dutch pictures). In this respect, our asynchronous medium differs first person singular pronoun ‘ik’ (I) among the lexemes from other social media, like Twitter, where the writing is written in capital letters. As pronouns are function words, less personal and more targeted at informing a wider they are automatically used more frequently (Newman et audience, rather than at bonding or pleasing6 . The latter al., 2008: 216; Pennebaker, 2011: 27). However, the top focus prevails in our asynchronous data, which could position of ‘ik’ could also be symptomatic of the fact that explain the higher rate of love-related expressive markers when the teenagers write in a very expressive way, they in this subcorpus. often talk about something personal. This finding also suggests that quite often entire utterances are written in 6. Conclusion capitals, as merely capitalizing function words would make This paper discussed linguistic expressiveness in (Belgian) less sense (although the chatters could, of course, only Dutch informal computer-mediated messages. We included emphasize the word ‘I’ in their utterance to stress its typographic CMC features (e.g. emoticons), an importance). onomatopoeic variable (the rendition of laughter) and a Finally, the qualitative in-depth analyses for each of the lexical feature (the use of intensifiers) and looked for expressive markers also lay bare correlations between the possible correlations between these linguistic variables and independent variables. Strikingly, similar tendencies could the authors’ profile (gender, age) versus the CMC medium. be noted for texts written by female participants, by Girls appeared to outperform boys in the use of expressive younger teenagers, and on the asynchronous medium. markers, and so did the younger adolescents compared to These texts contain a lot more expressive markers related the older ones. The results were extremely consistent in this to love and friendship. The most popular emoticons were respect: the same tendencies could be observed for each of related to love (e.g. heart-emoticons: <3) and many of the the expressive markers. Quite strikingly however, medium top lexemes that were written in allcaps concerned love or appeared to have the largest impact (more expressive friendship (e.g. ‘LOVEYOU’, ‘BFF’: best friend forever). writing in asynchronous and largely public than in These results are incongruent with male texts, the texts synchronous and mainly private posts). The qualitative written by older adolescents or the synchronous posts. E.g.: analyses show that girls and younger teenagers produce While heart-emoticons were much favored by girls, they more love-related expressive markers than boys and older were at the bottom of the list of the emoticons produced by adolescents. And again, remarkably, these types of boys. correlations were found for medium too (with more love- However, some caution might be needed when interpreting related markers used in the asynchronous than in the these correlations, as there is an imbalance in our dataset synchronous posts). which could (partially) influence our results: many of the The present research differs from previous research into female participants are also younger adolescents, often expressive markers in CMC in that it includes a wider range writing on the asynchronous medium, whereas many of the of expressive markers (both lexical and typographic) and male participants are also older teenagers, often writing on combines three independent variables (age, gender and the synchronous chat platform. Still, linguistic correlations medium). While gender and to a minor extent age have between gender and age have been reported on before received ample attention in related research, the present (Argamon et al., 2007; Pennebaker, 2011; Schwartz et al., findings highlight the importance of the variable medium. 2013). Stylistic correlations concern the use of function They call for refinement of this variable, since apart from words: men and older people use more articles and (a)synchronicity and the public versus private character of prepositions, whereas younger people and women use more the medium, the character and goal of the interaction seem pronouns, conjunctions and auxiliary verbs (Pennebaker, to be determinant factors too and consequently need to be 2011: 66; Argamon et al., 2007: n.pag.; Schwartz et al., operationalized in future research. 2013: 8-9). On a content-related note, Argamon et al. report that men and older people prefer topics like politics, 7. References religion and business, whereas women and younger people Androutsopoulos, J. (2011). Language Change and Digital prefer discussing home, romance and fun (2007: n.pag.). Media: A Review of Conceptions and Evidence. In: T. 6 We thank Lieke Verheijen for pointing out this difference. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 32 Ljubljana, Slovenia, 27–28 September 2016 Kristiansen & N. Coupland (Eds.), Standard Languages and Language Standards in a Changing Europe. Oslo: Novus , pp. 145--161. Argamon, S., Koppel, M., Pennebaker, J.W., & Schler, J. (2007). Mining the Blogosphere: Age, Gender and the Varieties of Self-Expression. First Monday, 12(9), n.pag. Baron, N.S. (1984). Computer Mediated Communication as a Force in Language Change. Visible Language, 18(2), pp. 118--141. Crystal, D. (2001). Language and the Internet. Cambridge: Cambridge University Press. Darics, E. (2013). Non-verbal Signalling in Digital Discourse: The Case of Letter Repetition. Discourse, Context and Media, 2, pp. 141--148. Newman, M.L., Groom, C.J., Handelman, L.D. & Pennebaker, J.W. (2008). Gender Differences in Language Use: An Analysis of 14,000 Text Samples. Discourse Processes, 45(3), pp. 211--236. Pennebaker, J.W. (2011). The Secret Life of Pronouns. What Our Words Say About Us. New York: Bloomsbury Press. Schwartz, A.H., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M. et al. (2013). Personality, Gender, and Age in the Language of Social Media: The Open- Vocabulary Approach. PLoS ONE, 8(9), e73791. Stenström, A.B., Andersen, G., & Hasund, I.K. (2002). Non-Standard Grammar and the Trendy Use of Intensifiers. In: A.B. Stenström, G. Andersen & I.K. Hasund, Trends in Teenage Talk. Corpus Compilation, Analysis and Findings. Amsterdam: John Benjamins, pp. 131--163. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 33 Ljubljana, Slovenia, 27–28 September 2016 French Wikipedia Talk Pages: Profiling and Conflict Detection Ho-Dac L.-M.(*), Laippala V.(**), Poudat C.(***) and Tanguy L.(*) (*) CLLE, University of Toulouse, CNRS, UT2J, 5 allées A. Machado, 31058 Toulouse CEDEX 9, France (**) TIAS, University of Turku, 0014 Turun yliopisto, Finland (***) BCL, University of Nice Sophia Antipolis, 24, avenue des Diables bleus, 06357 Nice CEDEX 4, France E-mail: hodac@univ-tlse2.fr, mavela@utu.fi, celine.poudat@unice.fr, tanguy@univ-tlse2.fr Abstract Wikipedia is a popular and extremely useful resource for studies in both linguistics and natural language processing (Yano and Kang, 2008; Ferschke et al., 2013). This paper introduces a new language resource based on the French Wikipedia online discussion pages, the WikiTalk corpus. The publicly available corpus includes 160M words and 3M posts structured into 1M thematic sections and has been syntactically parsed with the Talismane toolkit (Urieli, 2013). In this paper, we present the first results of experiments aiming at classifying and profiling the talk pages and threads in order to determine criteria for selecting discussions with conflicts. Keywords: French Wikipedia talk pages, conflict detection, data-driven approaches 1. Introduction theless, such features remain indirect markers of conflicts, With the exponential development of the Internet, new as they may be interpreted differently, allowing no clear communicative situations and new genres have come about. distinction between editorial conflicts and vandalism, for The new web genres, which are not yet fully characterized, instance (Potthast et al., 2008; Yasseri et al., 2012; Adler are complex objects challenging the existing methodolo- et al., 2011). Other commonly used criteria include article gies and analysis tools: the Wikipedia encyclopedic project and talk page length, number of revisions in article and talk is one of these new textual objects that can be studied un- pages, number of anonymous edits/users, character or word der the umbrella term Computer-Mediated Communication insertion or deletion between users, article labels, etc. (CMC, (Herring et al., 2013)). Wikipedia, which celebrates Such criteria serve as the basis for the automatic detection its 15th birthday this year, is an open and collaborative of quality articles (Wilkinson and Huberman, 2007), con- project, available in numerous languages. The success of flictual pages (Kittur et al., 2007; Vuong et al., 2008; Sumi the web encyclopedia is indisputable, as evidenced by its et al., 2011) or topic categories which are more likely to huge size (5M articles in the English Wikipedia / 1.7M arti- generate conflicts, such as religion and philosophy accord- cles in the French Wikipedia as of June 2016). In addition, ing to (Kittur et al., 2009). Wikipedia is one of the 10 most consulted websites in the Although these studies have provided interesting insights world (Alexa, June 2016). on the evolution of Wikipedia’s organization and collab- Over the last decade, Wikipedia has become a wealth of in- orative edition, the linguistic characteristics of Wikipedia formation which is more and more used by natural language pages remain little explored. In particular, talk pages are processing (NLP) and text mining applications (Ferschke specifically interesting to observe as they are at the heart of & al. (2013) propose an overview of the use of Wikipedia the Wikipedia device. Each article is associated with a talk in NLP). It has also been the subject of many studies in page, where most of the coordination work is done, and social sciences. After the quality of the encyclopedia has where the potential conflicts are discussed and ultimately been established by (Giles, 2005), a large number of stud- resolved in the best-case scenario (Viegas et al., 2007). Talk ies use Wikipedia for describing human coordination and pages are the places where editors discuss the modifications collaboration processes (Viegas et al., 2007; Brandes and to be made on the article, including sections to be rewritten Lerner, 2007; Kittur and Kraut, 2008; Stvilia et al., 2008) or suppressed (Ferschke et al., 2012). via the analysis of revisions and talk pages which provide Wikipedia talk pages may be considered as a new discus- evidence of collaborative edition, maintenance work, coop- sion sub-genre. Wikipedia editorial talk pages are indeed eration and conflict resolution (Kittur et al., 2007; Viégas quite specific: (i) they are directly related to the article they et al., 2004). are associated with, and they share a common focus, i.e. ar- Most of these studies do not focus on the linguistic and dis- ticle editing and improvement; (ii) they contain open asyn- cursive aspects of Wikipedia pages, certainly because of the chronous discussions that anyone may edit. In that respect, sprawling structure of Wikipedia (multiplicity of pages and they might be compared to forum discussions except that versions), which makes corpus building quite difficult. As a they rely on a specific Wiki device which has direct conse- consequence, these works mostly rely on network analysis quences on the macrostructure: in spite of clear recommen- or on statistical features extracted from article revision his- dations concerning the form of the postings (level of the tories. For instance, an interesting result for our project is answer, mandatory signature and date, etc.), talk pages are that article reverts (when users restore a previous version) often hybrids, combining dialogues whose structure may are proven significant features to detect conflicts (Viégas et not be obvious (as Wikipedians may for instance edit previ- al., 2004; Brandes and Lerner, 2007; Kittur et al., 2007; Suh ous postings), and checklist elements; (iii) they share com- et al., 2007; Kittur and Kraut, 2010; Miller, 2012). Never- mon features referring notably to editing actions, conflict Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 34 Ljubljana, Slovenia, 27–28 September 2016 management and Wikipedia procedures (e.g. NPOV, i.e. Neutrality of Point of View, relevance, source, quality etc.). Politique Conflicts are particularly interesting to observe in France Wikipedia, since they can be considered as frontiers be- tween collaboration and discussion. Antagonistic edits of Featured the article structure and content may indeed lead to dis- agreements and this is quite usual when co-editing, before participants agree on a more stable version of the article. {{calm}} Disagreements may turn to conflicts when the editing pro- cess and/or the discussion process are deadlocked, which leads to an automated report. In such cases, pages are Three kinds of metadata were automatically extracted to tagged with specific labels signaling that a conflict is on- categorize and describe the discussions: going on the article or talk pages (e.g. NPOV or relevance disputes, ”Keep calm” banner). Examples of pages with 1. ”discipline” indicates associated thematic portals, such labels are quite numerous: Abortion in Iran, Bengali cuisine, List of Volvo trucks to cite just a few. 2. ”avancement” corresponds to article’s quality scale The aim of the present study is twofold: at a descriptive based on Wikipedian assessments2, level, we would like to contribute to the linguistic descrip- tion of Wikipedia talk pages, which have been little ex- 3. ”conflictness” gives information about possible con- plored using linguistic criteria. In particular, few linguistic flicts in the discussion. Such information may be man- studies have been conducted on French Wikipedia (see (De- ually inserted by Wikipedians via the template {{ keep nis et al., 2012) on the detection of conflicting threads or calm }} which adds the following banner at the top of (Poudat and Loiseau, 2007) on the exploration of Wikipedia the talk page3). categories). We will first perform an automatic classifi- cation on the entire set of French Wikipedia talk pages, which were gathered within the WikiTalk Corpus, making the most of the French ”Appel au calme” (keep calm) la- bel, signaling ongoing conflict(s) on the talk page. In or- Discussion structure is encoded according to the following der to have a broader view of the linguistic characteristics TEI elements: of the French Wikipedia talk pages, We will then propose a profiling of the genre, using a mutidimensional analysis •
for threads enabling us to highlight key features and oppositions at a global level. Conflicting threads and pages will be charac- • for topic titles and terized within this global generic profile. • for posts. The WikiTalk corpus is composed of talk pages extracted from the French Wikipedia dump dated May 12th 2015 Table 1 gives a quantitative overview of the WikiTalk cor- which contains 3.5M talk pages. Only 365,612 pages were pus4. kept in the released WikiTalk Corpus. Indeed, 57% of the talk pages were user pages and we chose to remove them, discussions sections posts words even if these talk pages are basically online discussions. 365,612 1,023,841 2,406,514 161,833,298 Only 24% of the remaining talk pages contained more than two words1. Table 1: Quantitative overview of the WikiTalk corpus. The 365,612 remaining talk pages were segmented into threads and posts based on the wikicode. Threads corre- Eight of the extracted talk pages, amounting to 413 posts spond to divisions delimited by (sub)headings signaled by and 47,284 tokens, were manually inspected to evaluate the the wiki markup: /==.*?==/. Posts are delimited accord- extraction process. Results show that 23 posts were not ex- ing to tracted at all and 33 posts were wrongly delimited, among 1. timestamp and an optional user signature, such as: which 25 merged several posts in one. As a result, the Viking59 10 mai 2009 à 17:16 (CEST); or extraction process has an estimated precision of 0.92 and a recall of 0.95. Post attribute values (@who, @when and 2. a change in the interactional level indicated by the @interactionalLevel) were only checked for one talk number of semi-colons (:) in the beginning. page but indicated 100% accuracy. Once threads and posts were delimited, all discussions were formatted according to the TEI-P5 guidelines. Metadata are encoded in the teiHeader as illustrated below with the 2https://en.wikipedia.org/wiki/Wikipedia: element. Version_1.0_Editorial_Team/Assessment 3https://en.wikipedia.org/wiki/Template: 11,013,791 (68%) talk pages were blank and 116 432 (8%) Calm consisted in redirections to another talk page. 4Soon available at http://redac.univ-tlse2.fr/ Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 35 Ljubljana, Slovenia, 27–28 September 2016 3. Classification of Conflicting vs. Peaceful The classification is performed using the stochastic gradi- Talk Pages ent method with two-thirds of the corpus used for training and the remaining for testing. As lexical features we use The first tested method consisted in a data-driven compar- lemmas; as syntactic features we use unlexicalized bi-arcs ison of the global linguistic characteristics of two classes composed of two syntax dependencies between tokens with of talk pages, distinguished according to an experimental the actual lexical information deleted but with all other classification of ”conflicting” vs. ”peaceful” talks. The se- information on the syntactic dependency, Part-of-Speech lection criteria used for distinguishing between these two and other morphological features, as illustrated in Fig. 1. classes are based on the Wikipedians’ assessment of the article’s quality and the Wikipedians’ alert regarding con- flict or impoliteness in a talk page. Moreover, only talk pages containing more than 100 words were taken into account. Among those, 2,028 a priori ”conflicting” talks (11M words) were selected according to the following cri- Figure 1: A delexicalized syntactic bi-arc describing a teria: clitic+verb+conjunction as in the clause ‘I find that’. • in teiHeader indicates that the ”keep calm” template was inserted; Syntactic analysis and lemmatisation were provided by the Talismane toolkit (Urieli, 2013). Two levels of text seg- • a parallel talk page was created for discussing the arti- ments were considered: threads and posts. Entire pages cle’s neutrality5; were not taken into account because a conflict usually hap- pens inside a thread. In addition, our previous experiments on the page-level have already shown higher scores for the bag of words method (Ho-Dac and Laippala, 2015). In • the page itself is a parallel talk page created for dis- the analysis, we consider, however, that all the posts and cussing the article’s neutrality. threads in a page labeled as conflicting / peaceful are in the same category. Table 2 gives the precision (P) and recall Criterion for selecting 4,569 a priori ”peaceful” talks (8.8M (R) for detecting the ”conflict” category by using the two words) are the following: feature sets on threads and posts. • in teiHeader in- threads posts dicates that the associated article was assessed to be features P R P R ”Featured” or ”A-class”; lemmas 0.84 0.60 0.79 0.69 • a parallel talk page was created for deciding if the ar- bi-arcs 0.55 0.48 0.63 0.59 ticle deserves the ”featured” or ”A-class” status. units 46,690 194,289 Table 2: Comparison of lexical vs. syntactic approaches for the automatic classification of conflicting threads and posts. For the purpose of evaluating our distinction between these two classes while also determining features that may be Results show that the best method for detecting conflict used for selecting talk pages where conflicts may occur, seems to be a classification of threads by using a lexical we trained a text classification model using the Vowpal approach. A closer look on the threads classified with high Wabbit linear classifier (Agarwal et al., 2011). In addition probability and on typical bi-arcs used by the classifier is to being fast and easily adjustable to large corpora, it has necessary for better understanding. the advantage of generating a list of the most significant Even if the precision of more than 80% seems encouraging, features and their relative weights. we must admit that these results lead us to question both the Two feature sets were tested for the classification task: features used for classification and our a priori definition lexical features and syntactic features. Classification based of a conflicting talk. Next sections begin to address these on lexical features which considers texts as bags-of-words questions by proposing a range of new features for profiling or bags-of-lemmas is the traditional approach, as for Talk pages in a bottom-up approach and presenting a cur- example (Scott et al., 2006) which propose a keyword rent project of conflict manual annotation in the WikiTalk analysis for reflecting thematic and stylistic features. corpus. Classification based on syntactic features which considers texts as bags-of-syntactic N-grams more or less lexicalized 4. A Bottom-Up Approach to Talk Page is less common (Kanerva et al., 2014; Goldberg et al., 2013). This method enables a more robust analysis on Profiling text characteristics that does not depend on the text topic The automatic classification was supplemented by a sec- but attempts to generalize the level of description beyond ond approach which uses statistical techniques based on lin- individual lexical topics to typical structures (Laippala et guistic features and portals information for discovering talk al., 2015). pages and thread profiles in a bottom-up approach, without a focus on conflict. This method considered all the 366,612 5This possibility seems specific to the French Wikipedia talk pages and used the R package FactoMineR dedicated Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 36 Ljubljana, Slovenia, 27–28 September 2016 to multivariate exploratory data analysis6. Each talk page 5. Perspective: Exploring Conflicts at the and thread was automatically described with four types of Thread Level features: In this paper, we have proposed different ways to explore • THEMA: portal sections of the associated article page Wikipedia talk pages; CMC genres are indeed complex ob- knowing that an article may be categorized as belong- jects that challenge our traditional methods and we assume ing such as Art, History, Sport7 up to 7 of the 11 pos- that such objects require different levels of investigation. sible Wikipedia sections (these 11 variables were bi- The profiling step still needs further analysis but is already narised); quite promising. The results of the automatic classification show that the fea- • GLOBAL: general quantitative characteristics (num- tures taken into account and the parameters used for detect- ber of words and posts) and, for entire talk pages, ing conflicting talk pages are still fairly inaccurate. In addi- amount of threads and different contributors, propor- tion our definition of a conflict discussion must be revised. tion of anonymous posts; Several paths are currently being followed, including (i) us- • INTERACT: the frequency of a wide range of inter- ing other criteria, starting with the dimensions with identi- action and politeness cues per talk pages and threads fied in the profiling step; (ii) using more detailed categories, (social deixis, marks of agreement and disagreement); combining the article labels signaling conflicts, and the talk page labels; and (iii) using a dataset of manually annotated • DISCREL: the frequency of connectives for each dis- talk pages. We are currently annotating the threads of 30 course relations as defined in the LEXCONN, ”a French talk pages extracted from the WikiTalk corpus in terms of lexicon of 328 discourse connectives, collected with conflicts (degree, intensity, type) thanks to a CORLI grant8. their syntactic categories and the discourse relations We just led a first annotation experience, following the ex- they convey” (Roze et al., 2012). ample of (Denis et al., 2012), which enabled us to bring interesting contrasts to light (Poudat et al., 2016). A Principal Components Analysis on talk pages and threads For the moment, two talk pages have been annotated, to- extracted 5 dimensions that explain around 30% of the total talling 255 threads for which coders have just to indicate variance (29.2% for entire talk pages, 32.4% for threads). if the thread is conflict or not with a very basic definition. The first dimension is simply related to the size of the text As Table 3 shows, around one thread on 2 was annotated as units. The second dimension is more interesting and the conflicting. correlated features differ between talk pages and threads. As for talk pages, it opposes Talk page’s topic # threads # conflicts % • talk pages with politeness cues (thanks, hello, cheers, Bogdanoff brothers 75 37 49.3 please, etc.), formal you (vous) and we (nous) and dis- Psychoanalysis 140 74 52.9 course relations expressing concession, condition and Total 215 111 51.6 temporal relations; to Table 3: Conflicting annotated threads in two talk pages. • talk pages with more discourse relations expressing contrast, background/narration and causality. As for threads, dimension 2 opposes 6. References • threads with agreement cues (ok, agree, of course, yes, Adler, B. T., De Alfaro, L., Mola-Velasco, S. M., Rosso, no, etc.), formal you and discourse relations express- P., and West, A. G. (2011). Wikipedia vandalism detec- ing alternation, consequence, goal and temporal rela- tion: Combining natural language, metadata, and repu- tions; to tation features. In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent • threads with more I, informal we (on) and discourse Text Processing, volume Part II of CICLing’11, pages relations expressing contrast. 277–288, Berlin, Heidelberg. Springer-Verlag. A third dimension that may be relevant gathers together talk Agarwal, A., Chappelle, O., Dudik, M., and Langford, J. pages (as threads) in which more connectives expressing (2011). A reliable effective terascale linear learning sys- narrative relations (then, later, once, before, etc.) and con- tem. JMLR, 15:1111–1133. sequence relations (in this case, in this respect, etc.) occur. Brandes, U. and Lerner, J. (2007). Revision and co- We may also notice that no THEMA features are significant revision in wikipedia: Detecting clusters of interest. for any dimensions. In Proceedings of International Workshop Bridging the More precise details defining these profiles will be pre- Gap Between Semantic Web and Web 2.0, 4th European sented during the presentation, with a focus on extreme talk Semantic Web Conference (ESWC ´07), Innsbruck, Aus- pages and threads on each dimension. Our next goal is to tria. locate conflicting threads in this 5 dimensional space. Denis, A., Quignard, M., Fréard, D., Détienne, F., Baker, M., and Barcellini, F. (2012). Détection de conflits 6http://factominer.free.fr/index.html 7https://fr.wikipedia.org/wiki/Portail: 8TGIR Huma-Num CORLI (Corpus, Languages and Interac- Accueil tions, French National Consortium) Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 37 Ljubljana, Slovenia, 27–28 September 2016 dans les communautés épistémiques en ligne. In TALN- vandalism detection in wikipedia. In Advances in Infor- Actes de la Conférence sur le Traitement Automatique mation Retrieval, pages 663–668. Springer. des Langues Naturelles-2012. Poudat, C. and Loiseau, S. (2007). Représentation et car- Ferschke, O., Gurevych, I., and Chebotar, Y. (2012). Be- actérisation lexicale des sciences dans wikipédia. Revue hind the article: Recognizing dialog acts in wikipedia franc¸aise de linguistique appliquée, 12(2):29–44. talk pages. In Proceedings of the 13th Conference of the Poudat, C., Vanni, L., and Grabar, N. (2016). How to ex- European Chapter of the Association for Computational plore conflicts in french wikipedia talk pages? In JADT, Linguistics, pages 777–786. Association for Computa- pages 645–656. tional Linguistics. Roze, C., Danlos, L., and Muller, P. (2012). Lexconn: A Ferschke, O., Daxenberger, J., and Gurevych, I. (2013). A french lexicon of discourse connectives. Discours, 10. survey of nlp methods and resources for analyzing the Scott, M., , and Tribble, C. (2006). Textual Patterns: collaborative writing process in Wikipedia. In The Peo- Key Words and Corpus Analysis in Language Educa- ple’s Web Meets NLP: Collaboratively Constructed Lan- tion. Philadelphia, PA, USA: John Benjamins Publishing guage Resources. Springer. Company. Giles, J. (2005). Internet encyclopaedias go head to head. Stvilia, B., Twidale, M. B., Smith, L. C., and Gasser, Nature, 438(7070):900–901. L. (2008). Information quality work organization in Goldberg, Y., , and Orwant, J. (2013). A dataset of wikipedia. Journal of the American Society for Informa- syntactic-n grams over time from a very large corpus tion Science and Technology, 59(6):983–1001, April. of english books. In Second Joint Conference on Lex- Suh, B., Chi, E. H., Pendleton, B. A., and Kittur, A. ical and Computational Semantics (*SEM), 1. Associa- (2007). Us vs. them: Understanding social dynamics in tion for Computational Linguistics. wikipedia with revert graph visualizations. In Visual An- Herring, S., Stein, D., and Virtanen, T. (2013). Pragmatics alytics Science and Technology, 2007. VAST 2007. IEEE of computer-mediated communication, volume 9. Walter Symposium on, pages 163–170. IEEE. de Gruyter. Sumi, R., Yasseri, T., Rung, A., Kornai, A., and Kertész, Ho-Dac, L.-M. and Laippala, V. (2015). Les discussions J. (2011). Characterization and prediction of wikipedia wikipedia : un corpus pour caractériser le genre ”discus- edit wars. In Proceedings of the ACM WebSci’11, pages sion”. In International Research Days Social Media and 1–3, Koblenz, Germany, June 14-17 2011. CMC Corpora for the eHumanities, Rennes, France, oc- Urieli, A. (2013). Analyse syntaxique robuste du franc¸ais : tober. concilier methods syntaxiques et connaissances linguis- Kanerva, J., Luotolahti, J., Laippala, V., , and Ginter, F. tiques dans l’outil Talismane. Ph.D. thesis, Université de (2014). Syntactic n-gram collection from a large-scale Toulouse - Jean Jaurès. corpus of internet finnish. In Proceedings of the Sixth Viégas, F. B., Wattenberg, M., and Dave, K. (2004). Study- International Conference Baltic HLT. ing cooperation and conflict between authors with his- tory flow visualizations. In Proceedings of the SIGCHI Kittur, A. and Kraut, R. E. (2008). Harnessing the wisdom conference on Human factors in computing systems, of crowds in wikipedia: quality through coordination. In pages 575–582. ACM. Proceedings of the 2008 ACM conference on Computer Viegas, F., Wattenberg, M., Kriss, J., and van Ham, supported cooperative work, pages 37–46. ACM. F. (2007). Talk Before You Type: Coordination in Kittur, A. and Kraut, R. E. (2010). Beyond wikipedia: co- Wikipedia. In 40th Annual Hawaii International Con- ordination and conflict in online production groups. In ference on System Sciences, 2007. HICSS 2007, pages Proceedings of the 2010 ACM conference on Computer 78–78, January. supported cooperative work, pages 215–224. ACM. Vuong, B.-Q., Lim, E.-P., Sun, A., Le, M.-T., Lauw, H. W., Kittur, A., Suh, B., Pendleton, B. A., and Chi, E. H. (2007). and Chang, K. (2008). On ranking controversies in He says, she says: conflict and coordination in wikipedia. wikipedia: Models and evaluation. In Proceedings of In Proceedings of the SIGCHI conference on Human fac- the 2008 International Conference on Web Search and tors in computing systems, pages 453–462. ACM. Data Mining, WSDM ’08, pages 171–182, New York, Kittur, A., Chi, E. H., and Suh, B. (2009). What’s NY, USA. ACM. in wikipedia?: Mapping topics and conflict using so- Wilkinson, D. M. and Huberman, B. A. (2007). Cooper- cially annotated category structure. In Proceedings of ation and Quality in Wikipedia. In Proceedings of the the SIGCHI Conference on Human Factors in Comput- 2007 International Symposium on Wikis, WikiSym ’07, ing Systems, CHI ’09, pages 1509–1512, New York, NY, pages 157–164, New York, NY, USA. ACM. USA. ACM. Yano, T. and Kang, M. (2008). Taking advantage of Laippala, V., Kanerva, J., and Ginter, F. (2015). Syntactic wikipedia in natural language processing term project re- ngrams as keystructures reflecting typical syntactic pat- port. Language and Statistics, II:11–762. terns of corpora in finnish. Procedia - Social and Behav- Yasseri, T., Sumi, R., Rung, A., Kornai, A., and Kertész, J. ioral Sciences, 198:233 – 241. (2012). Dynamics of conflicts in wikipedia. PloS one, Miller, N. (2012). Characterizing conflict in wikipedia. 7(6):e38869. Mathematics, Statistics, and Computer Science Honors Projects. Potthast, M., Stein, B., and Gerling, R. (2008). Automatic Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 38 Ljubljana, Slovenia, 27–28 September 2016 Slovene Twitter Analytics Nikola Ljubešić,∗‡ Darja Fišer†∗ ∗ Department of Knowledge Technologies, Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana, Slovenia ‡ Dept. of Information and Communication Sciences, University of Zagreb Ivana Lučića 3, HR-10000 Zagreb, Croatia † Faculty of Arts, University of Ljubljana Aškerčeva cesta 2, SI-1000 Ljubljana, Slovenia E-mail: nikola.ljubesic@ijs.si, darja.fiser@ff.uni-lj.si Abstract The paper presents the results of metadata analysis in a corpus of 7.5 million Slovene tweets. In our analyses we primarily focus on the weekly and daily posting dynamics, their dependence on the account type (corporate vs. private) and user gender, as well as the dependence of the mentioned variables on retweeting, favoriting, text standardness and text sentiment. Through these analyses we gain insight into both user behaviour on social networks and the available linguistic material. Keywords: Twitter corpus, meta-data analysis, Slovene language 1. Introduction found it to be bound to the social norms of working life. The large volumes of content generated by Twitter users While gender studies on Twitter predominantly focus on as well as Twitter’s proactive policy have sparked a new gender classification, (Bamman et al., 2012) give a de- venue of research that is attractive for a wide range of disci- tailed overview of the commonly attributed characteristics plines, including information and computer science, media of male and female language and behaviour relevant for our and communication studies, and linguistics. Twitter ana- study: language standardness (women more standard than lytics has been successfully employed to discriminate be- men), communication style (men more informative, women tween different types of users (Mislove et al., 2011) and more involved), and characteristic vocabulary (with women behaviour (Pennacchiotti and Popescu, 2011; Rao et al., exhibiting more distinct features than men, such as fre- 2010). With state-of-the art techniques, a number of la- quent use of emoticons, expressive lengthening of words, tent user attributes can be identified, such as their loca- repeated exclamation marks, etc.). tion (Hecht et al., 2011), gender (Burger et al., 2011), age The typology and granularity of user types varies greatly in (Nguyen et al., 2013), occupation (Hu et al., 2016), social the literature. While they typically exceed the two classes class (Borges et al., 2014) and personality type (Quercia et used in our corpus, most researchers distinguish organiza- al., 2011). tions, such as news media outlets and public institutions This paper is our first attempt at twitter analytics of the from other users. Arakawa et al. (2014) have the clos- Slovene JANES Tweet v0.4 corpus (Fišer et al., 2016a) est reading to our corporate users in their organizations which contains 7.5 million tweets or 107 million tokens category, which they were able to classify with the high- that were posted by nearly 9,000 different users between est accuracy. They report that tweets from organizations June 2013 and January 2016. Our goal is to gain insight posted the highest number of tweets the objective of which into user behaviour on social networks and their language is to transmit information, which is characterized by a dis- characteristics. In addition to the automatically harvested tinctly high use of nouns, polite language, hashtags, URLs metada during tweet collection, such as posting time, no. of and retweets. favourites and retweets, the corpus was enhanced with a set The relationship between gender and subjective language of manually and automatically assigned metadata at both in tweets has been explored for English, Spanish and Rus- user and tweet level. At user level, account type (private / sian by Volkova et al. (2013) who have shown that there corporate) and user gender (male / female) were manually are substantial differences in the use of subjective words assigned, while at tweet level text standardness (completely (e.g. weakness, which is used to express positive sentiment standard / slightly non-standard / very non-standard) and by women and negative by men), hashtags (e.g. baseball, sentiment scores (positive / negative / neutral) were auto- which expresses positive sentiment by men and negative by matically computed. women) and emoticons (with women using more emoticons overall than men in English and Spanish but, interestingly, 2. Related Work not in Russian) and that these differences can improve sen- timent classification. Rios and Lin (2013) have used tweet timestamps to visual- ize annual tweeting dynamics in different cities all over the 3. Posting Dynamics world, discovering some interesting cultural differences. Scheffler and Kyba (2016), on the other hand, have exam- The first part of our statistical analyses focuses on the vol- ined the morning routine of German Twitter users and have ume of posts, retweets and favourites. We inspect the Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 39 Ljubljana, Slovenia, 27–28 September 2016 private 0.15 male 0.15 corporate female 0.10 0.10 0.05 0.05 0.00 0.00 Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun Figure 1: Probability distribution of tweets by day of week, separate by source (left) and gender (right). private 0.08 0.08 male corporate female 0.06 0.06 0.04 0.04 0.02 0.02 0.00 0.00 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 Figure 2: Probability distribution by hour in day, separate by source (left) and gender (right). weekly and daily posting cycles and their dependence on retweeted favorited the account type (private vs. corporate) and user gender private 8.5% 30.2% (male vs. female). Finally, we inspect the dependence of corporate 16.3% 18.0% the two last variables and post retweeting and favouriting. male 9.4% 29.2% female 6.8% 32.9% 3.1. Weekly Posting Cycle Table 1: Probabilities of tweets to be retweeted, i.e. favor- The weekly posting cycle is presented in Figure 1 where the ited, given account type and user gender variables. graph on the left shows distributions for private and corpo- rate accounts while distributions for male and female users of private accounts are displayed on the right. We can see which female users take over and are more active through- that while the overall volume of tweets posted is higher on out the afternoon and evening, suggesting that male users weekdays, corporate users are dominant during the week display behaviour a bit similar to corporate accounts while and private ones on weekends which is not surprising but females display a distinct private-use behaviour tweeting in does have important implications on the topics and the lan- their spare time after work. guage of the tweets published during the week vs. on week- ends. Genderwise the distributions are very similar to the 3.3. Retweets and Favorites type of user, with male users prevailing mid-week and fe- Next we make comparisons between the retweeted and fa- males on weekends. vorited variables on one side and the source and gender variables on the other. We operationalise the retweet and 3.2. Daily Posting Cycle favorite variables as binary variables that are true if a tweet Figure 2 shows the daily posting cycle with user behaviour was retweeted or favorited, respectively. We present the per account type displayed on the left and behaviour per percentages of the retweeted or favourited tweets given the user gender limited to private accounts on the right. As account type or user gender in Table 1. expected, tweeting volume of corporate users peaks during We begin by inspecting the dependence of the source vari- morning hours (11 a.m.) while private users are most active able and the retweet variable. The probability of a corpo- in the evening (9 p.m.). Interestingly, both types of users rate tweet to be retweeted is twice as high as for private have a secondary peak that coincides with the period of the tweets, which was to be expected as the primary function of major peak of the other group. In terms of user gender, most corporate tweets is information dissemination. Run- male users dominate slightly from 1 a.m. to 3 p.m. after ning the chi-square test of independence proves for the vari- Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 40 Ljubljana, Slovenia, 27–28 September 2016 1.5 private male corporate 1.5 female 1.0 1.0 Density 0.5 0.5 0.0 0.0 1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 Figure 3: Distribution of the three standardness levels by account type (left) and user gender (right). Lilac represents distribution overlap. 1.0 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Figure 4: Standardness by hour of day, standard represented with blue, slightly non-standard with lilac, very non-standard with red. ables of source and retweets not to be independent with 4. Language Standardness X 2(1 , N = 7503200) = 74308 , p < . 001. The second part of statistical analyses inspects the linguistic characteristics of tweets posted by the different groups of Similarly, analysing the dependence of the source variable users. Due to space constraints, we only present the results and the favorite variable, we measure that private tweets for language standardness scores assigned to each tweet in tend to be almost twice as frequently favorited as corpo- the corpus via a regression model (Ljubešić et al., 2015) rate tweets which is again consistent with the communica- while the behaviour of tweets according to the percentage tive role of private posts that have a strong community- and of normalised tokens via CSMT (Ljubešić et al., 2016) that relationship-building role. The chi-square test of indepen- was also computed is consistent with the text standardness dence shows a relationship between the source and favorite results. variable with X 2(1 , N = 7503200) = 80215 , p < . 001. We inspect the relationship of the account type and the user Moving to the comparison with the gender variable, we first gender variable on one hand and the standardness contin- inspect the dependence of the gender and the retweets vari- uous variable (ranging from 1 to 3) on the other. The re- able. Male tweets are 38% more probable to be retweeted sulting plot is presented in Figure 3. We can see that tweets than female tweets. Calculating the chi-square test of inde- posted by private and corporate users differ significantly re- pendence shows a relationship between these two variables garding linguistic standardness, corporate users showing a with X 2(1 , N = 7503200) = 11714 , p < . 001. much stronger tendency towards standard language, which is not surprising given their communicative goal. Male and By comparing the gender and favorited variables, we cal- female users are much more similar in this respect, but male culate that it is 13% more likely for a female tweet to be users tend to produce more standard tweets, while female favorited than a male tweet. The chi-square test of indepen- ones produce more semi- and non-standard ones, which is dence shows a relationship between the gender and favorite an interesting finding that deserves a closer examination in variable with X 2(1 , N = 7503200) = 8913 . 4 , p < . 001. future work. Given that the difference in text standardness by user gen- The presented results again suggest that male Twitter users der presented in the right plot of Figure 3 is minor, we behave more like corporate users and females are more perform the chi-square test of independence showing a aligned with the private Twitter accounts. relationship of user gender and tweet standardness with Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 41 Ljubljana, Slovenia, 27–28 September 2016 0.5 negative neutral positive 0.4 0.3 0.2 0.1 0.0 Mon Tue Wed Thu Fri Sat Sun Figure 5: Distribution of the sentiment by day of week among private users. 0.6 0.6 0.6 private male negative corporate 0.5 female 0.5 neutral 0.5 positive 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0 neg neut pos neg neut pos L1 L2 L3 Figure 6: Distribution of the sentiment by source (left) and Figure 7: Distribution of the sentiment by language stan- gender (right). dardness (L1 - completely standard, L2 - slightly non- standard, L3 - very non-standard). X 2(1 , N = 7503200) = 9740 . 9 , p < . 001. For this test we operationalise the standardness variable as a binary vari- able and the binary positive / negative text sentiment vari- able, discarding tweets that are by the discrete standardness able, discarding thereby neutral tweets. The test shows a variable (three levels) slightly non-standard. While 24% of relationship between the gender and the sentiment variables the remaining female tweets are estimated as being non- with X 2(1 , N = 7503200) = 6179 . 8 , p < . 001. standard, for males the percentage is 19.6%. In order to gain more insight into how the sentiment of Next, we plot the distribution of the discrete standardness Slovene tweet users varies throughout the week, we plotted variable (three levels) in the daily posting cycle in Fig- the relationship of posting day and tweet sentiment in Fig- ure 4. As expected, tweets are the most standard in the ure 5. Disregarding the neutral tweets which prevail every early morning hours (7 a.m.), which is probably an effect day of the week, we can see that users start the week with of corporate accounts of newspapers and other media post- a distinctly negative attitude which peaks on Thursday and ing links to new content for the day. As the day progresses, then starts decreasing on Friday so that positive sentiment the proportion of slightly non-standard tweets rises steadily prevails during the weekend, peaking on Saturday. as does the proportion of very non-standard ones but they The relationship between sentiment and standardness go up only slightly until late evening hours (after 11 p.m.) among private users is examined in Figure 7. Disregarding when they pick up and peak at around 3 a.m. the neutral tweets that are prevalent across the board, posi- tive sentiment prevails in very non-standard posts while the 5. Sentiment Analysis opposite is true at the other end of the spectrum. Our plan is Finally, we look into the relationship of the account type to investigate this dependence in more detail in future work. and user gender variables with the sentiment score auto- matically assigned to each text in the Janes corpus using 6. Conclusions SVM (Fišer et al., 2016b). The three variables are com- In this paper we carried out an analysis of a series of ex- pared in Figure 6. While corporate users post predomi- tralinguistic and linguistic variables in a large corpus of nantly positive tweets and private users more neutral and Slovene tweets. Among many of our findings, the most negative ones, male users post slightly more negative posts interesting ones are that there are big differences between and female users take the lead in the positive ones. tweeting behaviour, content and treatment of corporate and Again, given the close results on user gender, we perform private tweets that are aligned with the primary commu- the chi-square independence test of the user gender vari- nicative functions of the two types of Twitter users. Pri- Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 42 Ljubljana, Slovenia, 27–28 September 2016 vate male users tweet more than female users during week- the 10th conference on language technologies and digital days while female users dominate on weekends. Male users humanities. tweet more in the morning hours while female users take Hecht, B., Hong, L., Suh, B., and Chi, E. H. (2011). the lead in the afternoon and evening. Male users use more Tweets from Justin Bieber’s Heart: The Dynamics of standard language than female users, which is most fre- the Location Field in User Profiles. In Proceedings of quently used in the early morning hours overall. Female the SIGCHI Conference on Human Factors in Comput- users express more positive sentiment in their posts than ing Systems, CHI ’11, pages 237–246, New York, NY, their male counterparts, which is the prevalent sentiment USA. ACM. overall while both tend to be more positive on weekends Hu, T., Xiao, H., Luo, J., and vy Thi Nguyen, T. (2016). than during the week. What the Language You Tweet Says About Your Occu- While the results are difficult to compare directly with the pation. related work, the results obtained for the communication Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., behaviour and styles of private and corporate users closely Pollak, S., and Škrjanec, I. (2015). Predicting the Level resemble the ones reported by Arakawa et al. (2014), of Text Standardness in User-generated Content. In Pro- Scheffler and Kyba (2016) and Volkova et al. (2013). The ceedings of Recent Advances in Natural Language Pro- most striking difference between our results and related cessing, pages 371–378. work is the language standardness level, which is higher Ljubešić, N., Zupan, K., Fišer, D., and Erjavec, T. (2016). in male users, contrary to what Bamman et al. (2012) have Normalising Slovene data: historical texts vs. user- observed. generated content. In Proceedings of KONVENS 2016. In the future we plan to extend our work with comprehen- Mislove, A., Jørgensen, S., Ahn, Y.-Y., Onnela, J.-P., and sive statistical content and linguistic analyses. We also wish Rosenquist, J., (2011). Understanding the Demograph- to compare the results with other text genres in the JANES ics of Twitter Users, pages 554–557. AAAI Press. corpus, such as blog posts, forum messages, news com- Nguyen, D.-P., Gravel, R., Trieschnigg, R., and Meder, T. ments and Wikipedia talk pages. Finally, we envisage to (2013). ” how old do you think i am?” a study of lan- compare the results with similar languages, such as Croat- guage and age in twitter. ian and Serbian. Pennacchiotti, M. and Popescu, A.-M. (2011). A machine learning approach to twitter user classification. 7. Acknowledgements Quercia, D., Kosinski, M., Stillwell, D., and Crowcroft, J. The work described in this paper was funded by the Slove- (2011). Our twitter profiles, our selves: Predicting per- nian Research Agency within the national basic research sonality with twitter. In Privacy, Security, Risk and Trust project ”Resources, Tools and Methods for the Research of (PASSAT) and 2011 IEEE Third Inernational Conference Nonstandard Internet Slovene” (J6-6842, 2014-2017). on Social Computing (SocialCom), 2011 IEEE Third In- ternational Conference on, pages 180–185. IEEE. 8. References Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M. (2010). Classifying latent user attributes in twitter. Arakawa, Y., Kameda, A., Aizawa, A., and Suzuki, T. In Proceedings of the 2Nd International Workshop on (2014). Adding twitter-specific features to stylistic fea- Search and Mining User-generated Contents, SMUC tures for classifying tweets by user type and number of ’10, pages 37–44, New York, NY, USA. ACM. retweets. Journal of the Association for Information Sci- Rios, M. and Lin, J. (2013). Visualizing the ”pulse” of ence and Technology, 65(7):1416–1423. world cities on twitter. Bamman, D., Eisenstein, J., and Schnoebelen, T. (2012). Scheffler, T. and Kyba, C. C. (2016). Measuring social jet- Gender in Twitter: Styles, stances, and social networks. lag in twitter data. In Tenth International AAAI Confer- CoRR, abs/1210.4567. ence on Web and Social Media. Borges, G. R., Almeida, J. M., Pappa, G. L., et al. (2014). Volkova, S., Wilson, T., and Yarowsky, D. (2013). Explor- Inferring user social class in online social networks. In ing demographic language variations to improve multi- Proceedings of the 8th Workshop on Social Network Min- lingual sentiment analysis in social media. In Proceed- ing and Analysis, page 10. ACM. ings of the 2013 Conference on Empirical Methods in Burger, J. D., Henderson, J., Kim, G., and Zarrella, G. Natural Language Processing, EMNLP 2013, 18-21 Oc- (2011). Discriminating Gender on Twitter. In Proceed- tober 2013, Grand Hyatt Seattle, Seattle, Washington, ings of the Conference on Empirical Methods in Natural USA, A meeting of SIGDAT, a Special Interest Group of Language Processing, EMNLP ’11, pages 1301–1309, the ACL, pages 1815–1827. Stroudsburg, PA, USA. Association for Computational Linguistics. Fišer, D., Erjavec, T., and Ljubešić, N. (2016a). The compilation, processing and analysis of the JANES cor- pus of Slovene user-generated content. Slovenščina 2.0, 4(2):67–100. Fišer, D., Smailović, J., Erjavec, T., Mozetič, I., and Grčar, M. (2016b). Sentiment Annotation of the Janes Corpus of Slovene User-Generated Content. In Proceedings of Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 43 Ljubljana, Slovenia, 27–28 September 2016 A Textometrical Analysis of French Arts Workers “fr. Intermittents” on Twitter Julien Longhi, Dalia Saigh Cergy-Pontoise University, AGORA E-mail: julien.longhi@u-cergy.fr, dalia-saigh@hotmail.com Abstract The term "social media" is increasingly used and tends to replace the term Web 2.0. Through social networks, people create various relationships. The aim of this paper is to describe how communities of users interact with each other on a specific subject, especially on Twitter. The theme that we will study is about the controversy concerning French arts workers ( fr.intermittents. We will conduct a textometrical analysis using the software Iramuteq and then explain the statistical results. Keywords: social media, Twitter, intermittents, textometrical analysis, Iramuteq word hashtag (#) followed by the word “arts workers” 1. Introduction then listed in a database of 13 074 tweets with # intermit- The term "social media" is increasingly used and tends to tent(s) and distributed in 4 617 twittos (Twitter users) over replace the term Web 2.0. Through social networks, peo- the period of June to September 2014, when tensions ple interact and create various relationships. In their ex- stepped up a notch and movements intensified. changes, they establish content, organize, modify, and Through the constitution of the corpus # intermittent, we combine it with personal creations. Despite authors’ free- hope to obtain a corpus which enables us to work on this dom of expression and drafting, the content structure must kind of discourse (tweets related to a controversial topic), obey rules of writing that are specific to each medium. to characterize it and understand it under different forms The aim of this paper is to analyze and describe how in order to extend previous research (Longhi 2006, 2008) communities of users interact with each other on a specif- that focused on French arts workers in 2003/2004. ic subject. In our study, the theme is the controversy con- cerning French arts workers on Twitter: a microblogging 2.2 Data Building: the Choice of Data service that is a hugely successful in spite of its particular After having contacted Twitter and having obtained con- working principle: blogging through ultra-short messages firmation that we had the right to collect and use infor- containing 140 characters. This feature allows the infor- mation available on the site2, we started tweet collection. mation flow faster but requires authors to be very concise This step was guided by the following process: when writing the tweet. In 2014: retrieval of 13 074 tweets with # intermittent We will first describe the context and methodology for posted by 4 617 people. building our corpus. Then, we will introduce the method In 2015: we established a threshold of at least 10 tweets that we adopted for the textual analysis of this corpus enti- with # intermittent: we obtained 215 accounts that had tled #intermittent (arts workers). We will also present produced at least 10 tweets explicitly referenced as be- Iramuteq, an analytical software tool that we have selected longing to this theme (in order to have representative ac- for this purpose and explain certain statistical results counts). By collecting all the tweets from these 215 peo- achieved. ple, we gathered 586 239 tweets that included 10 876 tweets with # intermittent. The corpus # intermittent corre- 2. Corpus Building: Background and sponds to these 10 876 tweets. Methodology For the proper conduct of this process, we made, in col- laboration with project participants from the field of In March 2014, social partners signed a new agreement Computing (Boris Borzic and AbdulhafizAlkhouli) a se- concerning the unemployment benefits for French arts lection of data and metadata. For this, our colleagues de- workers. This text that became the convention of 14 May veloped a customized application. The application 2014 on unemployment benefits aroused concerns and 1) uses the Twitter API: using ten functions of the API opposition among the arts workers. A protest movement according to our needs, and recovering all the in- and mass demonstrations took place in Paris and in other formation in JSON format that we then convert; French cities and lasted for several days. 2) allows the database to be enriched with a clean basic These reactions rapidly invaded social networks especial- design (ten tables, fifty fields). Then we have programs ly Twitter. Millions of tweets were written as soon as the that calculate indices for enriching additional fields; first information about this controversy emerged. 3) allows customized export, with the information stored in a range of data formats. The challenge for a lin- 2.1 The Project Goal guistic approach is to use this material to develop the #in- The finalization of this corpus was made possible thanks termittent corpus. to financial support from Ortolang1. The funding request These tweets were then formatted in TEI (with CMC for- centred around the finalization of the corpus-building pro- mats extension tracks offered by a European group) to cess. The corpus is composed of tweets formed from the 2 http://scinfolex.com/2009/06/14/twitter-et-le-droit-dauteur-vers-un- 1 https://repository.ortolang.fr copyright-2-0/ Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 44 Ljubljana, Slovenia, 27–28 September 2016 become a corpus in order to meet the institutional re- who posted the tweet. The third variable allows "the num- quirements of the CoMeRe3 project, and allow us to carry ber" of tweets sent by a twittos to be counted as well as out a discourse analysis with word-processing tools on the the re-tweets. corpus # intermittent or future corpora. 3. Textometrical Analysis of the Corpus The figure below shows the formatting of the corpus # in- termittent: # Intermittent Textometry offers an instrumented approach to corpus analysis, articulating quantitative syntheses and analyzes including text (Lebart & Salem, 1994). Functionally, tex- tometry implements differential principles. The approach highlights similarities and differences observed in the cor- pus according to the representation dimensions considered (lexical, grammatical, phonetic, or prosodic ones, etc). In addition to provide sorting procedures and statistical cal- Figure 1: The format of the corpus # intermittent. culations for the study of digital corpora of texts, textome- try establishes contextual and contrastive modeling. Thus, 4. Methods and Results of the Analysis the text is characterized by its words in relation to their use in the corpus, the word is characterized by its co- 4.1 The Word-Cloud occurrences, etc. (Pincemin, 2011). Textometry is particularly relevant to corpus exploitation Iramuteq contains an option that makes a kind of a lexical in human and social sciences. It simultaneously enables a compendium of a document in which the discussed key detailed and global observation of different texts while concepts are represented by a size unit (in the sense of the remaining close to them, and highlights the fact that lan- used typography weight). This allows their importance guage is an important observation field for human and within the corpus to be highlighted. Specifically, the more social sciences. a keyword is quoted in an article, the bigger it will appear in the cloud of words. This technique will allow us to put 3.1 Iramuteq: the Text Analysis Tool forward the keywords used by twittos. The Iramuteq software offers a set of analysis procedures for the description of a textual corpus. One of its principal methods is Alceste. This allows a user to segment a cor- pus into “context units”, to make comparisons and group- ings of the segmented corpus according to the lexemes contained within it, and then to seek “stable distributions” (Reinert, 1998). In addition to the Alceste method, Ira- muteq provides other analysis tools including prototypical analysis, similarities analysis, and word clouds analysis. All of these methods allow the users of this tool to map out the dynamics of the discourses of the different sub- jects engaged in interaction (Reinert, 1999). 3.2 The Corpus Structure Input files for Iramuteq must be in text format (.txt) and observe the following formatting rules: The basic unit is called "text". A text can represent an Figure 2: The word-cloud of the corpus # intermittent. interview, an article, a book or any other type of docu- ments. A corpus may contain one or more texts (but at This word-cloud highlights the most common occurrences least one). The texts are introduced by four stars (****) in tweets. These lexical items are positioned centrally in followed by a series of starred variables separated by a the cloud. The occurrence " intermittent" is the largest in space. It is possible to put the starred variables within the size because it constitutes the key word of our corpus; this text by introducing the beginning of the line by a hyphen is why its frequency is higher. That word is followed by followed by a star (- *). This is known as "themes". The specific markers such as "co" and "http" that refer to links line should contain only this variable. shared on Twitter. Indeed, these links are automatically For our corpus format, we have chosen a format with abbreviated http: // co to allow long URLs to be shared three representative variables: we called the first " inter- without exceeding the maximum number of characters mittent", because it constitutes the key word of this cor- allowed when writing a tweet. There is also the sign "rt" pus. The second is about the "usernames", it’s why this which means "retweet". This has the function of reposting variable will change from a tweet to another depending on the tweet of another person enabling users to quickly share it with all subscribers. 3http://corpuscomere.wordpress.com Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 45 Ljubljana, Slovenia, 27–28 September 2016 Around those keywords are others which have more or gist who wrote a book with the title “Les intermittents du less the same frequency and thus appear the same size. spectacle. Enjeux d’un siècle de luttes” . So, in this group, Among them, those that refer to the semantic field of the we understand that the majority of links mentioned in republic and the French government such as Manuel tweets refer users to web pages where the name of the Valls, Republique (republic), député (deputy), français sociologist is mentioned. (French), F.Hollande, Fillipetit, minister (minister). Other There is also the “rt” group which includes the following lexical items evoke either movements or activities such as terms: chronculture, pullmarin, dinamopress, angelin... accord (agreement), grève (strike), mobilisation (mobili- which refer to the names of people who have retweeted zation), manifestation (protest), convention (convention), the most. The "co" group is, as explained above, the ab- combat (fight). There are also names or adjectives refer- breviated form of links on Twitter. ring to French arts workers, and describing their situation We can already understand from this figure that the # in- as chômeurs (unemployed workers), précaires (precari- termittent corpus contains a lot of links, retweets related ous), interluttents, comédiens (actors). to French arts workers, and it describes their various ac- Despite the interest of this method, the resulting descrip- tions and their status (highlighted by the cluster précaires tion remains very general. For a more detailed analysis, (precarious). Iramuteq offers another graphical representation of a cor- That being said, as the lexicon related to the keyword " in- pus’ words, a significant method called “similarities anal- termittent" is very dense, the function similarities analysis ysis”, which retains the idea of size proportional to the has simply helped us to describe the nature and the main frequency, but introduces the relations of co-occurrences topic of tweets (tweets with links, retweets, arts workers between words. status . .). To further clarify the corpus structure, we will use the HDC “Hierarchical Descending Classification” 4.2 Similarities Analysis function (a method established by Max Reinert). Similarities analysis is a technique based on graph theory (Flament, 1962). It presents in a graphical format the 4.3 The Hierarchical Descending Classification structure of a corpus, distinguishing between the shared One method used by Alceste is the hierarchical descend- parts and the specificities of coded variables. This allows ing classification. This method offers a global approach to the link between the different forms in the text segments a corpus. The HDC after partitioning the corpus, identifies to emerge (Marchand & Ratinaud, 2012). statistically independent word classes (forms). These clas- ses are interpreted through their profiles, which are char- acterized by specific correlated forms. The HDC shows that using a dendrogram. Figure 4: The result of the Hierarchical Descending Clas- Figure 3: The similarities analysis of the corpus sification. # intermittent. Two groups are distinguished in this figure, the first with The first observation that we can make is that this corpus two related classes (class 1 and class 2), and the second is very homogeneous with one central idea around which where there is only one class (class 3). revolves the greatest part of the lexicon of our corpus. The class 1 includes forms associated with the different This figure shows a single main cluster, with some others protest movements of French arts workers such as the which are very small and not relevant. This cluster con- occupation of streets, theaters and other places, the sists of a word cloud which contains the key word " inter- demonstrations in Paris and elsewhere. mittent" at its center and around it, are grouped a very dense and related lexicon. Here is an extract of characteristic segments (with a high We notice the presence of some small groups, which are score), which contain the most common words associated in the main cluster, directly related (with edges) to " inter- with class 1 like manif (event), cipdfjournée (cip-idf day), mittent", the most important one. Among these groups, action (action), the common words are highlighted in red: there is: “http” in which we find the term intermit- tentdespectacle (arts workers) and a little further, a small cluster containing the name Gregory Mathieu, a sociolo- Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 46 Ljubljana, Slovenia, 27–28 September 2016 **** *intermittent *CQFjournal *tweet10 5. Conclusion score :1458.88 A Textometrical analysis of this corpus has allowed us to rt cipidf journée d action paris 10h république 14h manif ministère- du travail 127 rue de grenelle intermittents précaires htt t see how twittos have reacted to the announcement of the new unemployment insurance system related to French **** *intermittent *CIP_IDF *t weet782 arts workers. Through the analysis of similarities, we have score : 1431.31 found that there were a lot of links pointing to this topic Rvd paris journée d actions coordonées 11h devant bourse du travail with references to the sociologist Mathieu Grégoire and 3 rue du château d eau intermittents précaires httt co oz3kijtjuc his various texts, and also newspaper names and publica- Figure 5: The characteristics segments of class 1. tions including Le Monde. There were also various re- tweets and thanks to this, the issue has become in a short Class 2 refers to strikes held by the French arts workers time a "trending topic" on Twitter. This is due to the vari- and their different concerts and show cancellations. This ous markers such as #, URLs, the @ sign . . The Reneirt class contains words such as: grève (strike), festival (festi- method (HDC) taught us that discourse around this sub- val), annulé (cancelled). The following figure shows the ject is divided into two different sets. On the one hand, characteristics segments of this class: tweets that describe the precariousness of French arts workers and their various protest movements against the **** *intermittent *CIP_LR* tweet155 new regime. On the other hand, tweets denouncing the score :2058.76 impartiality of the agreement, with links providing infor- intermittents rencontres photos arles la grève a été votée pour lundi 7 juillet jour de l ouverture du festival le vernissage annulé mation about that act and citing various political personal- ities who were involved in the controversy. **** *intermittent *cie813* tweet48 score :1877.02 6. References second soir de grève et d annulations au printemps des comédiens à montpellier opéra occupé représentation traviata annulée intermit- Flament, C. (1962). L’analyse de similitude. Cahiers du tents centre de recherche opérationnelle, 4, pp. 63--97 Figure 6: The characteristics segments of class 2. Lebart, L., Salem, A. (1994). Statistique textuelle. Paris: Dunod. Class 3 concerns the tweets that talk about the unemploy- Longhi, J. (2006). De intermittent du spectacle à intermit- ment insurance system related to the French arts workers tent: de la représentation à la nomination d’un objet du and political entities involved in this affair. Here is a char- discours. Corela, 4 (2). URL : acteristics segment summarizing the words associated http://corela.revues.org/457. with this class, including medef, valls, samuelchurin, au- Longhi, J. (2008). Sens communs et dynamiques séman- relifil: tiques : l’objet discursif intermittent. Langages, 170, pp. 109--124. **** *intermittent *cie813* tweet48 Marchand, P., Ratinaud, P. (2012). L’analyse de similitude score :1877.02 appliquée aux corpus textuels: les primaires socialistes second soir de grève et d annulations au printemps des comédiens à pour l’élection présidentielle française (septembre- montpellier opéra occupé représentation traviata annulée intermit- tents octobre 2011). Actes des 11èmes Journées internatio- nales d’Analyse statistique des Données Textuelles. **** *intermittent *AFARfiction *tweet42 JADT, 2012, pp. 687--699. score :403.83 rt jp_gille intermittents je viens de remettre mon rapport à manuel- Pincemin, B. (2011). Sémantique interprétative et texto- valls premier ministre avec aurelifil et frebsamen httt cocb métrie . Corpus, 10. URL: Figure 7: The characteristics segments of class 3. http://corpus.revues.org/2121. Reinert, M. (1998). Quel objet pour une analyse statis- These results demonstrate that unlike the written press tique du discours? Quelques réflexions à propos de la which showed a plurality of views concerning the seman- réponse Alceste. Actes des 4èmes journées Internatio- tic representation of the word “intermittent” (see Longhi, nales d’Analyse Statistiques des Données textuelles. 2006) which was seen whether as a status ( statut), a pro- URL : http://lexicometrica.univ- fession ( métier) or in the dynamics of these two semantic paris3.fr/jadt/jadt1998/reinert.htm. components. Here, the word “intermittent” is presented Reinert, M. (1999). Quelques interrogations à propos de using three different senses "system" ( régime), "status" l’objet d’une analyse de discours de type statistique et ( statut) and "fight” ( lutte). This indicates that Twitter fo- de la réponse « Alceste ». Langage et société, 90 (1), cuses on the status side and declines it by introducing the pp. 57--70. French arts workers insurance system (one way of looking Corpus CoMeRe: https://corpuscomere.wordpress.com at the status) or the consequence of this status (fight). Iramuteq: www.iramuteq.org Ortolang: https://www.ortolang.fr/market/home Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 47 Ljubljana, Slovenia, 27–28 September 2016 The Use of Alphanumeric Symbols in Slovene Tweets Dafne Marko Faculty of Arts, University of Ljubljana E-mail: dafne.marko@gmail.com Abstract This paper deals with the use of alphanumeric symbols in Slovene tweets. We use the JANES corpus, a large corpus of internet Slovene containing tweets, forum and blog posts, and comments on news articles and on Wikipedia discussion and user pages. We analyze the use of words consisting of alphabetic and numeric symbols as a means of both creative writing and as a word-shortening strategy. We investigate which alphanumeric features are most frequently used in Slovene tweets and identify the numerals substituting the letters. The results are compared with other subcorpora in the JANES corpus as well as with the Kres corpus, a collection of standard Slovene texts. Furthermore, we compare the distribution of alphanumeric features according to user type and text standardness. Keywords: alphanumeric symbols, letter/number homophones, Slovene tweets, computer-mediated communication the broad banner of technology mediated 1. Goal of the Paper communication” (Denby, 2010). Another shared The main goal of the paper is to research the occurrence characteristic is character limitation – Twitter imposes an of a CMC-specific linguistic feature – words consisting explicit message length limit of 140 characters, and in of alphabetic and numeric characters. We focus on the text messaging, the limitation is 160 characters. subcorpus of Slovene tweets, but also compare the Although there are several differences between the distributions with other subcorpora in the JANES corpus aforementioned writing media (public vs. private, cost of (forum posts, blog entries, comments on news articles, specific service, device used for text messaging and Wiki talk) as well as with the Kres corpus, a corpus of Twitter posts, etc.), research on texting could help us get standard Slovene with a balanced genre structure. We a deeper insight into the characteristics of another CMC predict to find no or very few occurrences in the Kres phenomenon – specific linguistic features in Twitter corpus, proving that it is, indeed, a CMC-specific posts. linguistic feature. We also predict the Twitter subcorpus Most of the researchers claim that “shortenings are to be most abundant with this sort of writing, whereas no presented to be the one major characteristic of text significant difference between other subcorpora is messaging that is assumed to be technologically expected. We research whether gender (male vs. female) determined by the limited number of permitted or user type (corporate vs. private) influences the use of characters and the cumbersome input via the small alphanumeric characters in words. Furthermore, our goal cellular phone keypad” (Bieswanger, 2006). Language is to carry out a detailed analysis of the most frequently used in texting, or what Crystal (2001) refers to as used words with alphanumeric symbols in Slovene Netspeak, is assumed to be “heavily abbreviated” tweets. We try to investigate which numeric symbols (Thurlow, 2003), although Thurlow reports “relatively (numerals) are used to substitute the letters and whether few (n = 73) examples of language play using they are used phonetically (e.g., ju3, translated as letter-number homophones (e.g. Gr8 'great', RU 'are 2morrow, with a number 3 pronounced as /tri/, the same you'), which, in popular representations at least, have as in the word jutri) or graphically (e.g., g33k, where the become the most definitive feature of text-messaging”. number 3 represents the letter “e”). With our analysis, we Some authors claim that Twitter posts, which fall into the present a linguistic phenomenon which could be category of microblogs (Moseley, 2013) or microtexts described as a type of creativity in writing, and – (Gouws et al., 2011), are rich with abbreviations “solely according to the fact that the length of a single tweet is to conserve space within a text” (Alkawas, 2011), when limited to 140 characters – also as a word-shortening others observe that “SMS language seems to have strategy. evolved into a fashionable and stylish way of writing where the way of writing is as important as the content” 2. Related Work (Kirsten Torrado, 2014). So far, little research has been done on the use of The linguistic phenomenon discussed in this paper is alphabetic and numeric characters in tweets. Most often referred to as letter/number homophones (comp. researchers deal with text messaging or texting as a Bieswanger, 2006; Kirsten Torrado, 2014; Frehner, 2008; relatively new writing medium. It has been pointed out Kadir et al., 2012; Elizondo, 2011; Farina and Lyddy, that “/t/here are a great deal of apparent similarities 2011; Thurlow, 2003; Kul, 2007; Alkawas, 2011) or between Twitter and text messaging” since “they are ( alphanumeric) rebus writing (Halmetoja, 2013; Danet both a medium via which friends and acquaintances can and Herring, 2007), but we can also find wider, more communicate with one another, and they both fall under generic expressions, e.g., complex abbreviations Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 48 Ljubljana, Slovenia, 27–28 September 2016 (Filipan-Žignić et al., 2012) or textism (Grace et al., symbols, where the numerals appear at the beginning, in 2012; Bushnell et al., 2011). Crystal (2001) refers to this the middle, or at the end of the word. To achieve that, phenomenon as a “rebus-like potential” of letters and appropriate regular expressions were used for querying numbers “whose pronunciation is identical with words or the corpus: [word="[0-9]+[a-zA-ZščžŠČŽ]+"], parts of words” and “are used to replace words or letter [word="[a-zA-ZščžŠČŽ]+[0-9]+[a-zA-ZščžŠČŽ]+"], sequences”. Frehner (2008) differentiates between letter and [word="[a-zA-ZščžŠČŽ]+[0-9]+"]. Prior to further homophones – the use of a single letter whose analysis, irrelevant results had to be manually selected phonological content is equated with a word, e.g., “u” for and excluded from the list. This was done because there “you”, number homophones – the use of a numeral are numerous examples of words which consist of whose phonological content is equated with a word, e.g., alphanumeric symbols but represent a proper name/part “4” for “for”, and a combination of letters and numerals, of a proper name, a chemical symbol, a unit of forming letter/number homophones, e.g., “b4” for measurement, or some other abbreviation (e.g., A4, CO2, “before”. C4, TEŠ6, m2, etc.). With these examples, no Such shortenings can also be observed in Slovene. transformation from numerals to letters can be made in Michelizza (2008) talks about a special group of the written form (e.g. A4 ! *A-štiri, CO2 ! *CO-dva, abbreviations, “typical for the language of SMS, where etc.). After a quick overview of the concordances, we parts of a word are substituted by a mathematical symbol created a frequency list of all the words with alphabetic or numeral, pronounced the same or at least similar to the and numeric symbols which appear in Slovene tweets. part of the word it substitutes (e.g., ju3)”1. Dobrovoljc To investigate which users (corporate or private; male or (2008) lists the most frequently used letter/number female) incorporate such shortenings into their tweets, homophones in Slovene (ju3 = “jutri”, pr8 = “prosim”, 5er we used the appropriate filters and compared the results. = “Peter”, 1x =“enkrat”) and English (4yeo = “for your We also compared the frequency of letter/number eyes only”, j4f = “just for fun”, 2 much = “too much”), but homophones in tweets according to their technical and concludes that there is a low percentage of such writings linguistic standardness4. in Slovene texting language. Logar (2006) names this kind Furthermore, the distributions of letter/number of linguistic feature a “combination of various writing homophones in all JANES subcorpora were compared to symbols”, which are commonly observed in Slovene that of the Kres corpus5, a collection of standard written SMS. In her research, she showed that “/a/mong the more Slovene with a balanced genre structure (Logar Berginc than 450 examples of SMS abbreviations that had been et al., 2012). submitted to the site by 11 January 2002, more than 60% In the second part of our study, the letter/number were some type of abbreviation, while the rest of the homophones found in Slovene tweets were analyzed in material (160 examples) was made of, for example, the more detail. Among the most frequently used numerals following: :-) ʻzadovoljen’, :) ʻveselje’, :(. . ʻjočem’, :x in the shortenings discussed, we investigated: ʻpoljubček’, :D ʻširok nasmešek’, mi2 ʻmidva’, ju3 ʻjutri’, • where they appear (at the beginning, in the 2mač ʻpreveč’, sk8ar ʻskejtar’, 8-) ʻNosim očala’, <>< middle, or at the end of a word); ʻribica’, {*} ʻobjemček, poljubček’, *+* ʻvidim te’, @x@ • what they substitute (a string of letters, a single ʻmaš mačka?’, @->-- ʻvrtnica’, \_/0 ʻA greš na kavo?’, =:x letter); ʻzajček’.” Since most of the researches mentioned above • whether they are used phonetically or focused only on the use of alphanumeric symbols in graphically ( 2morrow vs. g33k). texting, we will investigate how often they occur in Slovene tweets. 4. Results 3. Dataset and Methodology Surprisingly, the first query returned no results, which means that there are no words with numerals at the For our research, we used the JANES v0.4 corpus2, a large beginning of a word appearing in Slovene tweets corpus of Slovene tweets, forum posts, blog texts, represented in the JANES corpus. Thus, we used only comments on news articles and on Wikipedia pages and users, which contains over 175 million words or 9 million 4 documents, published between 2002 and 2016 (Fišer et All texts in the JANES corpus are annotated according to their level of standardness. “The score for technical text standardness al., 2016). We focused on the biggest subcorpus, the focuses on word capitalisation, the use of punctuation, and the Slovene tweets, which consists of 90.180.337 words from presence of typos or repeated characters in the words. The score 7.503.199 different Twitter posts. for linguistic standardness, on the other hand, takes into account Using the concordancer SketchEngine3, we searched for the knowledge of the language by the authors and their more or all occurrences of words consisting of alphanumeric less conscious decisions to use non-standard language, involving spelling, lexis, morphology, and word order” (Ljubešič et al., 2015. Tweets are annotated “using a score between 1 (standard 1 The paragraph was translated into English by the author. and 3 (very non-standard, with 2 marking slightly non-standard 2 Description available at http://nl.ijs.si/janes/ (access: June texts” (Ljubešič et al., 2015). 12, 2016). 5 Description available at 3 Available at https://www.sketchengine.co.uk/ (access: June http://www.slovenscina.eu/korpusi/kres (access: August 20, 10, 2016. 2016). Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 49 Ljubljana, Slovenia, 27–28 September 2016 the remaining two regular expressions for our further frequency can be explained by the fact that the numeral research. in the personal pronoun does not only have the same phonetic content as the letters (e.g., s5 ! /spet/, where 5 4.1 Numeral at the End of the Word is pronounced as /pet/), but emphasizes the number of The total number of concordances for all tokens ending people a specific personal pronoun is denoting ( mi2 ! in a numeral is 58.794. However, as mentioned before, two people; mi3 ! 3 people, etc.). words which represent a part of a proper name, a It is also interesting that 11 tokens are actually English chemical symbol, a unit of measurement, etc., had to be homophones, frequently used in Slovene tweets as well. manually selected and excluded from the list to get the actual result. The remaining tokens and their absolute 4.2 Numeral in the Middle of the Word frequencies are represented in Table 1. Interestingly, the list of different letter/number homophones with numerals appearing in the middle of Token Frequency the word is significantly longer, whereas the relative ju3 1173 frequency in much lower (9.97 per million tokens). This Mi2 kind of shortening technique proves to be very 593 productive, but still appears less frequently than the one mi2 371 with numerals at the end of the word. After excluding all Ju3 337 the proper names and other irrelevant words, 117 tokens s56 292 with approximately 50 different lemmas were found in MI2 119 Slovene tweets. vi2 110 In some cases, it was difficult to identify whether we were dealing with a typographical error or whether the hi5 97 word was intentionally written in that form. We used the tr00 77 proximity of the letters and numbers on the keyboard to zju3 50 identify and exclude possible typographical errors (e.g., Hi5 47 v0lilcev ! number 0 and the letter “o” are very close on na1 36 the keyboard, so we identified it as a typographical error vs. v8dja ! probably intentionally written as such). As gr8 36 expected, the majority of homophones represent English Tr00 31 words, where the preposition “to” is typically substituted Mi3 31 by number 2 (e.g., B2B, p2p, coffee2go, up2date, etc.). me2 27 There are, however, also numerous Slovenian words str8 26 written with both alphabetic and numeric symbols. It is Vi2 20 important to emphasize that we did not exclude the homophones which represent a phrase consisting of two Gr8 17 or more words, e.g., mi3je = mi trije; še1x = še enkrat, Me2 11 etc. We decided not to consider them as typographical h8 11 errors, but as a decision of Twitter users to write them as u3 10 one word, similar to the English multi-word phrases sk8 7 listed above ( coffee2go, up2date, etc.). Zju3 5 Token Frequency mi3 5 B2B/b2b 205/41 H8 3 w00t/W00t 66/39 TR00 1 d00h/d0h/D0h/d000h 51/48/26/4 Table 1: Words with numerals at the end of the word. pr0n/Pr0n 49/6 g33k/g33ki/g33kov/ 35/9/6/5/4 As seen in the table above, 27 different tokens with 15 g33ka/G33k different lemmas7 were found in the corpus of Slovene na1x 30 tweets, altogether representing a relative frequency of n00b/n00be 24/4 33.1 per million tokens. Nine out of 27 tokens, which are B2C 21 written in bold, represent alternative written forms of s3ksi/S3ksi 19/4 different personal pronouns. The relatively high p2p/P2P 19/18 6 B4B 19 Token S5 was excluded from the list because it was almost exclusively used in the proper name Galaxy S5. p0rn8 18 7 Since the texts in the JANES corpus are normalized, the tokens Ju3 and ju3 would both have the same lemma – jutri. 8 The motivation for writing specific words with numerals Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 50 Ljubljana, Slovenia, 27–28 September 2016 Za1x 13 letter/number homophones are far most frequent in mi3je 12 Twitter posts (43.07 per million), followed by the forum posts (18.19 per million). Blog entries, comments on še1x 11 news, and wiki talk show a fairly similar distribution ju3šnji/ 11/4/4/3/3 (2.61, 3.37, and 2.94 per million, respectively). ju3snji/ Surprisingly, letter/number homophones can also be Ju3šnji/ found in the Kres corpus (1.36 per million). A total of 12 ju3šnjega/ different examples were found, 10 of them with numerals ju3snjem in the middle of the word (e.g., l33t, cig4ni, za1x, pr0n), one with numerals ending a word ( ju3), and also one Table 2: Most frequently used words with numerals in with numerals starting a word ( 4ever). However, all of the middle of the word. these examples were found in the texts obtained from the web pages and from the Slovenian magazine Joker, 4.3 The Use of Alphanumeric Symbols which is primarily a computer gaming magazine with a According to User Gender distinctive writing style. The comparison of frequency of letter/number homophones according to user gender shows that the use 50 of words with alphanumeric symbols is far more frequent 45 among male users (roughly 80% of all the occurrences 40 were written by male users). However, we must take into 35 account that the comparison was made using all the 30 occurrences returned by the regular expressions, since 25 we could not filter out all irrelevant examples. 20 15 10 4.4 The Use of Alphanumeric Symbols 5 According to User Type 0 The comparison according to user type (corporate vs. Kres private) shows a strong tendency of private users to Tweets Blogs Forums iki talk incorporate such writing into their tweets. From 65.042 Comments W occurrences, 45.682 (≈ 70%) were written by private users. Figure 1: Relative frequencies of letter/number 4.5 The Use of Alphanumeric Symbols homophones in the JANES subcorpora and the Kres According to the Level of Text Standardness corpus. Regarding the level of text standardness, an assumption 5. Qualitative Analysis of Extracted can be made that less standard tweets (from both Letter/Number Homophones linguistic and technical perspective) contain more letter/number homophones than those written according In this section, a qualitative analysis of the most to grammatical and orthographic rules. We compared all frequently used letter/number homophones is presented, 9 possibilities of text standardness available in the along with interpretations of the numeric features and JANES corpus (from L1T1 = linguistic 1, technical 1 to their functions. L3T3 = linguistic 3, technical 3 with all median possibilities). Words with alphabetic and numeric 5.1 Most Frequent Numerals Used in symbols are most frequently used in tweets annotated as Letter/Number Homophones very non-standard (L1T1) or linguistically very If we analyze the extracted words with alphanumeric non-standard and technically slightly non-standard symbols in more detail, it is evident that numerals are (L1T2). used only in the middle of the word (117 tokens) or at the end of a word (27 tokens). As mentioned before, no 4.6 Comparison with the Kres Corpus tokens with numerals starting a word were found in the In the figure bellow, relative frequencies of all corpus. Numerals used in letter/number homophones are letter/number homophones (regardless of the position of definitely not randomly picked by the users. Each the numeral) in the JANES subcorpora and the Kres numeral has a specific meaning or interpretation and corpus are presented. As evident from the chart, substitutes either a single letter or a string of letters in a word. In our corpus, 9 numerals were identified in instead of letters could be to avoid censorship carried out by letter/number homophones, i.e. 0, 1, 2, 3, 4, 5, 7, and 8. the moderators of forums, blogs, or comment sections. The Among them, numerals 2, 3, 8, and 0 are the most same pattern appears in the word cig4n (= cigan) found in the frequent ones. Since the same numeral can appear in Kres corpus. different words and even have different functions, it is interesting to identify which letter/letters are substituted Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 51 Ljubljana, Slovenia, 27–28 September 2016 by them. In Table 3, all numerals and their According to that, we can undoubtedly claim that interpretations are presented, together with the examples letter/number homophones are not only used as a from the corpus. word-shortening strategy, but as a form of creative writing and a specific stylistic feature as well. Numeral Interpretation Example Furthermore, graphically used numerals also prove that 1 “ena” na1 = “na ena” CMC is “essentially a mixed modality” which “i” BRA71L = “Brazil” “resembles speech” but “looks like writing” (Baron, 2008), since the words, such as g33k, tr00, or n00b, have 2 ”dva” mi2 = “midva” little significance if not visually represented. “dve” me2 = “medve” “to” up2date = “up to date” 6. Conclusion 3 “tri” ju3 = “jutri” s3njam = “strinjam” This paper presents the use of so-called letter/number “e” g33k = “geek” homophones in Slovene tweets as presented in the JANES corpus. The results show that a considerable 4 “for” t4t = “training for amount of alphabetic and numeric symbols are used both “a” trainers” in Slovene and in English words. This phenomenon, also G4ME = “game” described as a type of “neography” (Danet and Herring, 5 “pet” s5 = “spet” 2007), proved to be characteristic for CMC, especially “five” hi5 = “high five” microtexts, such as Twitter and forum posts, since no 7 “z” BRA71L = “Brazil” letter/number homophones were found in the Kres 8 “eat” gr8 = “great” corpus apart from the texts obtained from web pages. As “aight” str8 = “straight” expected, Twitter proved to be the richest subcorpus “ate” h8 = “hate” regarding this phenomenon, followed by the forum l8r = “later” subcorpus with a relatively high frequency of the 0 “o” n00b = “noob” shortenings discussed. Numerals used in the middle or at p0rn = “porn” the end of specific words substitute letters or strings of w00p = “woop” letters and make the texts either shorter or more interesting to read. Since a certain numeral can be used both graphically ( g33k) and phonetically ( u3nek), a Table 3: Numerals used in letter/number homophones creative writing style has emerged among new with their interpretations and examples. generations, which definitely deserves linguistic 5.2 Phonetic vs. Graphic Function of Numerals attention. For a more precise description of the phenomenon, a more detailed comparison with other As evident from the table above, there is a striking CMC media (SMS, blogs, forums, etc.) would be difference between numerals that are used phonetically necessary. Apart from that, an analysis of usernames (e.g. s5 = “spet”, where their pronunciation is identical would be very useful for investigating language with a part of the word, enabling them to replace the creativity as observed in computer-mediated letter sequence) and graphically (e.g. G4ME = “game”, communication. where the numeral 4 has a similar visual appearance as the letter “A”). All numerals used at the end of the words 7. Acknowledgements (see Table 1) are used phonetically; the only exception is The work described in this paper was funded by the the word tr00 (= “true”). This example is especially Slovenian Research Agency within the national basic interesting because numerals do not only substitute a research project “Resources, Tools and Methods for the string of letters ( 00 = “oo”), but the pronunciation of the Research of Nonstandard Internet Slovene” (J6-6842, substituted letters is similar or the same as the original 2014-2017). pronunciation of the word string (“ue” in true). In other words, 3 transformations are needed to identify the 8. References “original” word, namely tr00 ! troo ! /tru:/ ! “true”. Numerals which appear in the middle of the word can be Alkawas, S. (2011). Textisms: The Pragmatic Evolution used either graphically or phonetically. Numerals which among Students in Lebanon and its Effect on English substitute letters based on their appearance rather than Essay Writing. Master Thesis, Lebanese American their pronunciation tend to be duplicated (e.g. g33k, University. w00p, n00b, etc.)9. Baron, S. (2008). Always On: Language in an Online and Mobile World. Oxford University Press, Oxford. Bieswanger, M. (2006). 2 abbrevi8 or not 2 abbrevi8: A 9 This type of writing is also refered to as “l33t speak” or contrastive analysis of different space-and “l33t”, used mostly by players of video games “where numbers time-saving strategies in English and German text and symbol combinations are used to represent letters” messages. In Hallett, T., Floyd, S., Oshima, S. and (Sherblom-Woodward, 2002). Shield, A. (Eds.), Texas Linguistics Forum Vol. 50, Austin. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 52 Ljubljana, Slovenia, 27–28 September 2016 http://studentorgs.utexas.edu/salsa/proceedings/2006/ Kul, M. (2007). Phonology in text messages. In Poznań Bieswanger.pdf Studies in Contemporary Linguistics 43(2), pp. 43–57. Bushnell, C., Kemp, N. and Heritage Martin, F. (2011. Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., Text-messaging practices and links to general Pollak, S. and Škrjanec, I. (2015). Predicting the level spelling skills: A study of Australian children. In of text standardness in user-generated content. In Australian Journal of Educational & Developmental Proceedings, pp. 371–378, Hissar: [s.n.]. Psychology. Vol 11, pp. 27–38. http://lml.bas.bg/ranlp2015/docs/RANLP_main.pdf. Crystal, D. (2001). Language and the Internet. Logar Berginc, N., Grčar, M., Brakus, M., Erjavec, T., Cambridge, Cambridge University Press. Arhar Holdt, Š. and Krek, S. (2012). Korpusi Danet, B. and Herring, S. (Eds.). (2007). The slovenskega jezika Gigafida, KRES, ccGigafida in Multilingual Internet. Language, Culture, and ccKRES: gradnja, vsebina, uporaba. Ljubljana: Communication Online. Oxford University Press, Trojina, zavod za uporabno slovenistiko; Fakulteta za Oxford. družbene vede. Denby, L. (2010). The Language of Twitter: Linguistic Logar, N. and Smith, J. (trans.). (2006). Stilno Innovation and Character Limitation in Short zaznamovane nove tvorjenke: tipologija = Messaging. Undergraduate dissertation, University of Stylistically marked new derivates: a typology. In Leeds. Vidovič-Muha, A. (Ed.), Slovensko jezikoslovje Dobrovoljc, H. (2008). Jezik v e-poštnih sporočilih in danes, Slavistično društvo Slovenije, Ljubljana, pp. vprašanja sodobne normativistike. In Košuta, M. (Ed.), 87–101. Slovenščina med kulturami, Slavistično društvo Michelizza, M. (2008). Jezik SMS-jev in Slovenije, Celovec, pp. 295–314. SMS-komunikacija. In Jezikoslovni zapiski: zbornik Elizondo, J. (2011). Not 2 Cryptic 2 DCode: Inštituta za slovenski jezik Frana Ramovša, Inštitut za Paralinguistic Restitution, Deletion, and Non-standard slovenski jezik Frana Ramovša ZRC SAZU, Orthography in Text Messages. Ph.D. thesis, Ljubljana, pp. 151–166. Swarthmore College. Moseley, N. (2013). Using word and phrase abbreviation Farina, F. and Lyddy, F. (2011). The Language of Text patterns to extract age from Twitter microtexts. Thesis, Messaging: “Linguistic Ruin” or Recource? In The Rochester Institute of Technology. Irish Psychologist, Vol. 37, Issue 6, pp. 145–149. Sherblom-Woodward, B. (2002). Hackers, Gamers and Filipan-Žignić, B., Velički, D. and Sobo, K. (2012). SMS Lamers: The Use of l33t in the Computer Sub-Culture. communication – Croatian SMS language features as http://www.swarthmore.edu/SocSci/Linguistics/ compared with those in German and English Speaking papers /2003/sherblom - woodward.pdf. Countries. In Revija za elementarno izobraževanje, št. Thurlow, C. (2003). Generation Txt? The 1. Pedagoška fakulteta, Maribor. sociolinguistics of young people's text-messaging. In Fišer, D., Erjavec, T. and Ljubešić, N. (2016). Janes v0.4: Discourse Analysis Online, Sheffield. korpus slovenskih spletnih uporabniških vsebin. Slovenščina 2.0 (to appear). Frehner, C. (2008). Email, SMS, MMS: The Linguistic Creativity of Asynchronous Discourse in the New Media Age. Peter Lang. Gouws, S., Metzler, D., Cai, C. and Hovy, C. (2011). Contextual bearing on linguistic variation in social media. In Proceedings of the workshop on language in social media ( LSM 2011), pp. 20–29. http://aclweb.org/anthology/W/W11/W11-0704.pdf. Grace, A., Kemp, N., Martin, F. H. and Parrila, R. (2012). Undergraduates’ use of text messaging language: Effects of country and collection method. In Writing Systems Research. Taylor & Francis Online. Halmetoja, T. (2013). Gender-Reated Variation in CMC Language: A Study of Three Linguistic Features on Twitter. BA thesis, Göteborgs Universitet. Kadir, Z. A., Maros, M. and Hamid, B. A. (2012). Linguistic Features in the Online Discussion Forums. In International Journal of Social Science and Humanity, Vol. 2, No. 3, May 2012, pp. 276–281. Kirsten Torrado, U. (2014). Development of SMS language from 2000 to 2010. In Cougnon, L. and Fairon, C. (Eds.), SMS communication: A linguistic approach. Benjamins Current Topics. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 53 Ljubljana, Slovenia, 27–28 September 2016 A Multilingual Social Media Linguistic Corpus Luis Rei*,†, Dunja Mladenić*,†, Simon Krek* *Jožef Stefan Institute, †Jožef Stefan International Postgraduate School Jamova cesta 39, 1000 Ljubljana, Slovenia E-mail: luis.rei@ijs.si, dunja.mladenic@ijs.si, simon.krek@ijs.si Abstract This paper focuses on multilingual social media and introduces the xLiMe Twitter Corpus that contains messages in German, Italian and Spanish manually annotated with Part-of-Speech, Named Entities, and Message-level sentiment polarity. In total, the corpus contains almost 20K annotated messages and 350K tokens. The corpus is distributed in language specific files in the tab-separated values format. It also includes scripts that enable to convert sequence tagging tasks to a format similar to the CONLL format. Tokenization and pre-tagging scripts are distributed together with the data. Keywords: social media, Twitter, part-of-speech, named entities, named entity recognition, sentiment Analysis 1. Overview The (Gimpel et al., 2011) corpus contains almost 2K twit- High-quality newswire manually annotated linguistic cor- ter messages with POS tags while (Owoputi et al., 2013) pora, with different types of annotations, are now available annotated 547 twitter messages. Tweebank drawn from the for different languages. Over the past few years, new so- latter boasts a total of 929 tweets (12,318 tokens) as well as cial media based linguistic corpora have begun appearing providing clear guidelines which the previously mentioned but few are focused on classical problems such as Part-of- twitter annotation efforts had not. Speech tagging and Named Entity Recognition. Of these While there are many English social media sentiment cor- few, most are English corpora. pora, the most well known is probably the Semeval cor- It has been documented that social media text poses addi- pus (Rosenthal et al., 2014) which contains over 21K Twit- tional challenges to automatic annotation methods with er- ter messages, SMS, and LiveJournal sentences. All mes- ror rates up to ten times higher than on newswire for some sages are annotated with one of three possible labels: Pos- state-of-the-art PoS taggers (Derczynski et al., 2013a). It itive, Negative, or Objective/Neutral. For Spanish senti- has been shown that adapting methods specifically to social ment classification, the TASS corpus (Villena Román et media text, with the aid of even a small manually annotated al., 2013) contains 68K Twitter messages labeled semi- corpus, can help improve results significantly (Ritter et al., automatically with one of five labels: the three Semeval 2011; Derczynski et al., 2013a; Derczynski et al., 2013b). labels plus Strong Positive and Strong Negative. Smaller While there exist social media sentiment corpora for twit- corpora with at least three labels exist for many other lan- ter messages in the languages we annotated, the corpus we guages including German and Italian. We decided to add are presenting also includes message level sentiment labels. sentiment polarity to our multilingual corpus because it is One motivation for this is the potential contribution of an- a popular task, challenging for automated methods, and the notations, such as PoS tags, to sentiment classification tasks cost (annotator time) of adding this additional annotation (Zhu et al., 2014). is mostly marginal when compared to the cost of PoS and The xLiMe Twitter Corpus provides linguistically anno- NER annotations. tated Twitter1 social media messages, known as ”tweets”, in German, Italian, and Spanish. The corpus contains ap- 3. Description proximately 350K tokens with POS tags and Named Entity The developed multilingual social media corpus includes annotations. All messages, approximately 20K, are labeled document level and token-level annotations. There is one with message level sentiment polarity. We further explain document level annotation, Sentiment polarity and two (2) the composition of the corpus in § 3. token-level annotations, PoS and NER. The corpus details 2. Related Work are shown in table 1, namely the distribution of annotated tweets and tokens per language. The Italian part of the An early effort in linguistically annotating noisy online corpus is the largest with 8601 annotated tweets, followed text was the NPS Chat Corpus (Forsyth and Martell, 2007) by Spanish with 7668 tweets, and German containing 3400 which contains more than 10K online chat messages, writ- tweets. ten in English, manually annotated with POS tags. The Ritter twitter corpus (Ritter et al., 2011) was the first 3.1 Data Collection to introduce a manually annotated Named Entity recogni- tion corpus for twitter. It contains 800 English messages The tweets were randomly sampled from the twitter pub- (16K tokens) which also contain Part-of-Speech and chunk- lic stream from late 2013 to early 2015. Tweets were se- ing tags. lected based on their reported language. Some rules were automatically applied to discard spam and low information 1Twitter: http://twitter.com tweets (”garbage”) tweets: Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 54 Ljubljana, Slovenia, 27–28 September 2016 1. Tweets with less than 5 tokens were discarded; 2. Tweets with more than 3 mentions were discarded; 3. Tweets with more than 2 URLs were discarded; 4. Automatic language identification with langid.py (Lui and Baldwin, 2011) was used on the tweet text with- out twitter entities and if didn’t match the reported lan- guage, the tweet was discarded. Language Tweets Tokens Annotators German 3400 60873 2 Italian 8601 162269 3 Spanish 7668 140852 2 Table 1: Number of annotated tweets and tokens per lan- guage. 3.2 Preprocessing URLs and Mentions were replaced with pre-specified to- kens. Tokenization was performed using a variant of twok- Figure 1: Screenshot of the annotation tool interface. The enize (O’Connor et al., 2010) that was additionally adapted text of a tweet is at the top followed by the sentiment la- to break apart apostrophes in Italian as in ”l’amica” which bel dropdown menu. Below there is a column with the becomes ”l’”, ”amica”. tokens and rows for each annotation (PoS and NER). An- notators manually fix the errors inherent in the automatic 3.3 Annotation Process pre-tagging step previously described. Finally, a dropdown There were two annotators for Spanish, two for German, menu allows marking the annotation of the document as and three for Italian. A small number of tweets for each ”To Do”, ”Finished”, ”Invalid”, or ”Skip”. Note that in this language were annotated by all the annotators working on example, the labels have not yet been manually corrected. the language in order to allow estimation of agreement mea- sures as described in § 4.. POS tags were pre-tagged using Tag German Italian Spanish Pattern (De Smedt and Daelemans, 2012) and some basic Adjective 2514 7684 5741 rules for twitter entities such as URLs and mentions. Adposition 4333 14960 13467 We built an annotation tool optimized for document and to- Adverb 4173 8476 6116 ken level annotation of very short documents, i.e. tweets. Conjunction 1576 6737 6684 The annotation tool included the option to mark tweets as Determiner 2990 9811 10037 ”invalid” since despite the automatic filtering performed Interjection 225 1427 1109 in § 3.1 it was still possible that tweets with incorrectly Noun 11057 30759 23230 identified language, spam, or incomprehensible text might Number 1176 2550 1568 be presented to the annotators. This feature can be seen in Other 1936 1503 3033 fig. 1. Particle 638 352 18 Pronoun 4530 7737 10333 3.4 Part-of-Speech Punctuation 8650 20529 14102 The part of speech tagset consists of the Universal Depen- Verb 6506 21793 19460 dencies tagset (Petrov et al., 2012) plus twitter specific tags Continuation 918 4227 3422 based on Tweebank (Owoputi et al., 2013). We present the Emoticon 449 1076 951 full tagset and the number of occurrences, per language, of Hashtag 1895 3035 1805 each tag in table 2. Mention 1984 6519 9070 URL 1923 4494 3019 3.41. Twitter Specific Tags While most tags will be easily recognizable to most readers, Table 2: Tagset with occurrence counts in the corpus per we believe it is useful to provide here a description of the language. tags which are specific to social media and twitter. Further details about these tags can be found in our guidelines. Emoticon this tag applies to unicode emoticons and tradi- Continuation indicates retweet indicators such as ”rt” and tional smileys, e.g. ”:)”; ”:” in ”rt @jack: twitter is cool” and ellipsis that mark a truncated tweet rather than purposeful ellipsis; Hashtag this tag applies to the ”#” symbol of twitter hash- Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 55 Ljubljana, Slovenia, 27–28 September 2016 tags, and to the following token if and only if it is not annotators occurred when labeling sentiment. Even for hu- a proper part-of-speech; mans, it can be challenging to assign sentiment, without context, to a small message. Mention this indicates a twitter ”@-mention” such as ”@jack” in the example above; Language Tweets Tokens Annotators URL indicates URLs e.g. ”http://example.com” or ”exam- German 47 791 2 ple.com”; Italian 45 758 3 Spanish 45 721 2 A noteworthy guideline is the case of the Hashtag. Twit- ter hashtags are often just topic words outside of the sen- Table 5: Number of tweets and tokens annotated by all an- tence structure and not really part-of-speech. In this case, notators for a given language. the Hashtag PoS tag applies to the word following the ”#” symbol. Otherwise, if it is part of the sentence structure, Task German Italian Spanish the guideline specifies that it should be labeled as if the ”#” symbol was not present. PoS 0.88 (AP) 0.87 (AP) 0.85 (AP) NER 0.67 (SUB) 0.42 (MOD) 0.51 (MOD) 3.5 Named Entities Sentiment -0.07 (Poor) 0.02 (Slight) 0.37 (Fair) Named entities are phrases that contain the names of per- Table 6: Inter Annotator Agreement (Cohen/Fleiss kappa) sons, organizations, and locations. Identifying these in per task per language. In parenthesis, the human readable newswire text was the purpose of the CoNLL-2003 Shared interpretation where: AP - Almost Perfect, MOD - Moder- Task (Tjong Kim Sang and De Meulder, 2003). We have ate, SUB - Substantial. adopted the definitions for each named entity class: Per- son, Location, Organization, and Miscellaneous. In table 3 we show each type of entity in our corpus and the number 5. Format and Availability of tokens annotated with each per language. The corpus is primarily distributed online2 as a set of three Entity Type German Italian Spanish tab-separated values (TSV) files - one per language. We also distribute the data in language and task specific for- Location 742 2087 1441 mats such as a text file containing the German tweets with Miscellaneous 995 5802 775 one word per line followed by a whitespace character and a Organization 350 1150 836 NER label. These were automatically created using a script Person 757 3701 2321 described in § 5.2. Table 3: Token counts per named entity type per language 5.1 Headers in the corpus. Each of the TSV files has the same set of headers: 3.6 Sentiment token the token, e.g. ”levantan”; Each tweet is labeled with its sentiment polarity: positive, tok id a unique identifier for the token in the current neutral/objective, or negative. The choice of this three la- message, composed of the tweet id, followed by bels mirrors that of the Semeval Shared Task (Rosenthal et the dash character, followed by a token id, e.g. al., 2014). The vast majority of tweets in our corpus was ”417649074901250048-47407”; annotated with the Neutral/Objective label as we show in table 4. doc id a unique identifier for the message (tweet id), e.g.: ”417649074901250048”; Language Positive Neutral Negative Total doc task sentiment the sentiment label assigned by the German 334 2924 142 3400 annotator; Italian 554 7524 523 8601 Spanish 388 7083 197 7668 tok task pos the Part-of-Speech tag assigned by the anno- tator; Table 4: Message level sentiment polarity annotation counts. tok task ner the entity class label assigned by the annota- tor; 4. Agreement annotator the unique identifier for the annotator. In order to estimate inter-annotator agreement, for each lan- Note that the combination of the token identifier and the guage, the annotators were given tweets that they annotated annotator identifier is unique i.e. the combination is present in common. We show the number of tweets and tokens in only once in the corpus. table 5. These were then used to calculate Cohen’s Kappa (technically, Fleiss’ Kappa for Italian) and we show the re- 2xLiMe Twitter Corpus: https://github.com/lrei/ sults in table 6. The worst agreement between the human xlime_twitter_corpus Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 56 Ljubljana, Slovenia, 27–28 September 2016 5.2 Scripts twitter: Annotation, features, and experiments. In Pro- In order to facilitate experiments using this corpus as well ceedings of the 49th Annual Meeting of the Association as to replicate its construction, several python scripts are for Computational Linguistics: Human Language Tech- distributed with the corpus data. We detail the most im- nologies: short papers-Volume 2, pages 42–47. Associa- portant scripts here. Namely the tokenizer, the pre-tagger, tion for Computational Linguistics. and the script that converts the sequence tagging tasks (PoS Lui, M. and Baldwin, T. (2011). Cross-domain feature se- and NER) into a format similar to the CoNLL 2002/2003 lection for language identification. In In Proceedings of format. In this format, there are empty lines which mark 5th International Joint Conference on Natural Language the end of a tweet and ”word” lines start with the token Processing. followed by a space, followed by a tag. O’Connor, B., Krieger, M., and Ahn, D. (2010). Tweet- motif: Exploratory search and topic summarization for xlime2conll.py the script used to convert the data into the twitter. In Proceedings of the 4th International Con- column format similar to the CoNLL 2003 shared ference on Weblogs and Social Media (ICWSM 2010), task; pages 384–385. Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schnei- extract sentiment.py the script used to convert the data der, N., and Smith, N. A. (2013). Improved part-of- into a format that is easy to handle by text classifica- speech tagging for online conversational text with word tion tools, specifically, a TSV file with the headers: id, clusters. Association for Computational Linguistics. text, sentiment; Petrov, S., Das, D., and McDonald, R. (2012). A univer- sal part-of-speech tagset. In Nicoletta Calzolari (Confer- twokenize.py the tokenizer used to split the tokens in the ence Chair), et al., editors, Proceedings of the 8th Inter- corpus; national Conference on Language Resources and Evalu- pretag.py the script used to pre-tag the data; ation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA). agreement.py the script used to calculate the agreement Ritter, A., Clark, S., Mausam, and Etzioni, O. (2011). measures. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empir- 6. Acknowledgments ical Methods in Natural Language Processing (EMNLP 2010), pages 1524–1534. This work was supported by the Slovenian Research Rosenthal, S., Ritter, A., Nakov, P., and Stoyanov, V. Agency and the ICT Programme of the EC under xLiMe (2014). Semeval-2014 task 9: Sentiment analysis in twit- (FP7-ICT-611346) and Symphony (FP7-ICT- 611875). We ter. In Proceedings of the 8th International Workshop would like to thank the annotators that were involved in on Semantic Evaluation (SemEval 2014), pages 73–80, producing the xLiMe Twitter Corpus. The annotators for Dublin, Ireland, August. Association for Computational German were M, Helbl and I. Škrjanec; for Italian, E. Linguistics and Dublin City University. Derviševič, J. Jesenovec, and V. Zelj; and for Spanish, M. Tjong Kim Sang, E. F. and De Meulder, F. (2003). In- Kmet and E. Podobnik. troduction to the conll-2003 shared task: Language- independent named entity recognition. In Proceedings of 7. References the 7th conference on Natural language learning at HLT- De Smedt, T. and Daelemans, W. (2012). Pattern for NAACL 2003-Volume 4, pages 142–147. Association for python. The Journal of Machine Learning Research, Computational Linguistics. 13(1):2063–2067. Villena Román, J., Lana Serrano, S., Mart´ınez Cámara, E., Derczynski, L., Maynard, D., Aswani, N., and Bontcheva, and González Cristóbal, J. C. (2013). Tass-workshop on K. (2013a). Microblog-genre noise and impact on se- sentiment analysis at sepln. mantic annotation accuracy. In Proceedings of the 24th Zhu, X., Kiritchenko, S., and Mohammad, S. M. (2014). ACM Conference on Hypertext and Social Media, pages Nrc-canada-2014: Recent improvements in the senti- 21–30. ACM. ment analysis of tweets. In Proceedings of the 8th In- Derczynski, L., Ritter, A., Clark, S., and Bontcheva, K. ternational Workshop on Semantic Evaluation (SemEval (2013b). Twitter part-of-speech tagging for all: Over- 2014), pages 443–447. coming sparse and noisy data. In Proceedings of Recent Advances in Natural Language Processing (RANLP)., pages 198–206. Association for Computational Linguis- tics. Forsyth, E. N. and Martell, C. H. (2007). Lexical and dis- course analysis of online chat dialog. In Semantic Com- puting, 2007. ICSC 2007. International Conference on, pages 19–26. IEEE. Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., and Smith, N. A. (2011). Part-of-speech tagging for Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 57 Ljubljana, Slovenia, 27–28 September 2016 Political Discourse in Polish Internet – Corpus of Highly Emotive Internet Discussions Antoni Sobkowicz National Information Processing Institute E-mail: antoni.sobkowicz@opi.org.pl Abstract In this work, we present description and initial statistical analysis on a corpus of comments from most popular polish news-related website, Onet.pl. Presented corpus contains highly informal texts, politically polarized texts, with highly emotive content. We gathered corpus containing 4,829,076 texts and 1,826,906 unique tokens total during 9 month time period which held several important political events in Poland. Presented corpus is freely available, and we intend to update it regularly, with additional texts being currently retrieved. Keywords: politically related texts, polish language corpus, social media, computer-mediated communications 1. Introduction 3. Corpus Source Description Discussion about politics on the internet, especially in Corpus was scraped from Onet.pl website, one of the such politically polarized country as Poland – with largest and most commented news related websites in supporters of two dominating parties being very vocal and Polish internet. Onet.pl is a news website, covering topics active - are often very emotive. People tend to not only from politics to sport and entertainment, with complex express their feelings about events but resort to personal comment section under each news piece, news tagging. insults or insults directed at politicians. This makes these discussions very interesting for analysis. We have gathered over 4.8 million comments from largest polish news-oriented website, Onet.pl, choosing only comments under political related news, over the 9 months that were very intensive in terms of political events in the country (presidential and parliamentary elections where party ruling for last 8 years lost, Constitutional Tribunal crisis, changes in public media). We have analyzed basic properties of this set and we encourage researchers in text analysis related fields to use it. Collected dataset is freely available, and we intend to update it every three months with new content. Dataset described in this paper was previously used in several works, although it was not publicly available. 2. Related Work Corpora regarding political text are widely available, with examples being a multilingual corpus of annotated political programs (Merz et al., 2016), the corpus of political speeches with annotated audience reactions Figure 1: A chunk of typical discussion in the comment (Guerini et al., 2013) or political speech corpus of section on Onet.pl – source for corpus described in this Bulgarian (Osenova & Simov, 2012). This corpus paper. Elements are as follow: 1 - Comment poster name; however only touches text with a higher degree of 2 - Comment text; 3 - Comment score; 4 - Replies to formalization. comment, nested, with information who replied to which Corpora build on less formal text sources are also post. Texts were blurred out because they may be available – based on Tweets (Longhi & Wigham, 2015), offensive. blogs (Eisenstein & Xing, 2010) and other sources. These are more similar to corpus described in this paper because The comment section is tree based, meaning comments of the informality of those sources. that reply to other comments are displayed below with Work on similar kind of dataset – politically related indentation. The user can rate comments, and the average comments in the Polish language, also done on Onet.pl rating is displayed near each comment, along date and data - was done by Sobkowicz and Sobkowicz (2012), time of posting and name of the original poster. An although dataset was highly limited. example of such tree and data are is shown in figure 1. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 58 Ljubljana, Slovenia, 27–28 September 2016 3.1 Discussions on Onet.pl 5. Toolset Description As Onet.pl does not enforce registrations, the user can Data was gathered using specialized tools written in post under any nickname. This seems to encourage more Python using scraps library. Scrappers parsed all heated discussions, with lots of insults – token based comment pages, going from first to last, and saving all analysis of the dataset using only known, heavily emotive data to JSON files. These files were then parsed and saved negative tokens shown that around 5% of all messages can be considered as directly insulting (insulting other into an SQLite database for easier use. users or other parties connected to the topic of the discussion). Manual sentiment analysis on small 6. Availability randomly selected subset of data shows that purely neutral Corpus is available for free, but taking into consideration texts are only 15% of data (146 texts out of 950 assessed). the fact that the collected corpus is relatively large in size In data analyzed by Sobkowicz and Sobkowicz (2012), – around 6GB, we currently do not provide direct neutral texts were 56% of all texts, however, authors used download link – instead, we encourage to contact us to different neutrality and emotiveness measure. prepare data for transfer via selected service. Each posted news piece tends to have several hundred texts, the more dividing in opinion the topic is, the more 7. Conclusions and Future Work posts are written by users. We have built new corpus containing politically related 4. Corpus Description comments from under news pieces on largest polish news related site. We gathered over 4.8 million texts spanning 9 We have gathered comments under articles (along with months period and calculated basic corpus statistics. In article text). Data was gathered in three periods, from near future, we plan to finish downloading fourth part of May – August 2015, September – December 2015 and data (spanning the time from April to June 2016) and keep January – March 2016, with fourth part being currently corpus up-to-date for foreseeable future. downloaded. We encourage researchers to use this corpus and analyze it in greater detail – in the context of linguistics, sentiment 4.1 Comment Data Description analysis, and analysis of human interactions in CMC. We Comments are scraped from the website while preserving believe that corpus this large, coming from very bi-polar their tree structure, along time of posting and user handle. community can be very interesting for researchers. This information can be used to retrieve back user network if needed. Comments themselves are stored in their raw form, without any alteration to their text. We 8. References decided against storing only extracted and processed tokens, as we believe that preserving additional data (such as discussion tree) is very important, and Eisenstein, J., & Xing, E. (2010). The CMU 2008 political extracting/lemmatizing/stemming can be done when blog corpus. Carnegie Mellon University, School of needed. Computer Science, Machine Learning Department. Guerini, M., Giampiccolo, D., Moretti, G., Sprugnoli, R., 4.2 Basic Corpus Properties & Strapparava, C. (2013). The new release of corps: A Corpus contains 4,829,076 texts, with average length of corpus of political speeches annotated with audience 179 characters and length distribution shown in figure 2. reactions. In Multimodal Communication in Political Average length in tokens is 33, with distribution shown in Speech. Shaping Minds and Social Action (pp. 86-98). figure 2. Both of distributions seem to follow lognormal Springer Berlin Heidelberg. distribution as expected from human produced texts Longhi, J., Wigham, C. R. (2015) Structuring a CMC (Sobkowicz et al. 2013). corpus of political tweets in TEI: corpus features, ethics, Distribution of unique tokens to a number of texts in the and workflow. Corpus Linguistics 2015 corpus is shown in figure 3 and 4 – non-unique tokens and unique tokens only respectively. Corpus itself contains Merz N., Regel, S., Lewandowski, J. (2016). The over 160 million tokens, with 1,826,906 unique tokens (as Manifesto Corpus: A new resource for research on we do no extract lemmas from the words, this number in political parties and quantitative text analysis. Research inflated by different conjugations). & Politics We do not provide sentiment annotation for the corpus, Osenova, P., & Simov, K. (2012). The Political Speech because given it’s size and lack of good sentiment Corpus of Bulgarian. In LREC (pp. 1744-1747). analysis tools for the Polish language, we believe we Sobkowicz, P., Thelwall, M., Buckley, K., Paltoglou, G., cannot give accurate or semi-accurate sentiment & Sobkowicz, A. (2013). Lognormal distributions of information for the corpus. This is the case also for user post lengths in Internet discussions-a consequence lemmatization and POS tagging. of the Weber-Fechner law?. EPJ Data Science, 2(1), 1-20. 4.3 Anonymization Sobkowicz, P., & Sobkowicz, A. (2012). Two-year study We believe that given the fact that source website does not of emotion and communication patterns in a highly require users to register and does not provide any other polarized political discussion forum. Social Science information about the user beyond their username Computer Review, 0894439312436512. anonymization is not required for this dataset. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 59 Ljubljana, Slovenia, 27–28 September 2016 Figure 3: Distribution of number of non-unique tokens to number of texts in corpus. Figure 4: Distribution of number of unique tokens to number of texts in corpus. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 60 Ljubljana, Slovenia, 27–28 September 2016 Figure 2: Distributions of text length in the corpus, both in raw character length and in token length. Right figures show the distribution in log-log scale. Both distributions seem to follow log-normal distribution, which seems to be the case for most of human-created according to other research. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 61 Ljubljana, Slovenia, 27–28 September 2016 Topic Ontologies of the Slovene Blogosphere: A Gender Perspective Iza Škrjanec1, Senja Pollak2 1Jožef Stefan International Postgraduate School, Ljubljana, Slovenia 2Jožef Stefan Institute, Ljubljana, Slovenia skrjanec.iza@gmail.com, senja.pollak@ijs.si Abstract In the past years, blogs have become an increasingly popular genre for publishing content on the Web. Blogs are also one of the five genres of the Janes corpus of Slovene user-generated content. The aim of this paper is to explore the topics of the blog subcorpus of Janes using OntoGen, a semi-automatic and data-driven ontology editor. In addition to the construction of the topic ontology of the blogs from two Slovene blog portals, special focus is placed on the topical variation in entries by male and female bloggers. First, the keywords of selected topics differentiating male and female blog entries are analysed. Next, we present two topic ontologies, one based on blog entries by private female and the other by private male users, and contrast them against each other. The analysis has shown that both groups write about politics, family, romance and sexuality, environment, and nutrition. Men seem to blog more about spectator sports, music and literature, the Roman-Catholic Church, the refugee crisis, and biology; in contrast, female authors discuss religion, emotions and social politics. Keywords: blogs, topic ontologies, gender, keyword analysis swearing more than men. Schmid (2003) carried out a 1. Introduction comparable study on the spoken part of the BNC corpus. In corpus linguistics, corpora serve as the main resource for He conducted a list of words typical for 14 different topics. either testing various hypotheses or developing linguistic Using relative frequencies, he observed which topics are theories based on the corpus data. This is why it is more dominant in the corpus of female and male speakers. important to learn about the properties of the corpus we are An overrepresentation of female speakers was detected in working with, e.g. recognizing frequent topics by topics dealing with clothing, basic colors, home, food and observing keywords (Kilgarriff, 2012). drink, body and health, and people. In contrast, the domains For Slovene, the topics of blogs in particular have not yet of work, computing, sports, and public affairs were been studied; however, Logar Berginc and Ljubešić (2013) considered more typical of the male subcorpus. The contrasted two Slovene corpora of various genres against domains on swearing and car and traffic occurred equally each other: the crawled slWaC1 corpus and the reference in the speech of both groups. Gigafida2 corpus. Using the LDA topic modelling method, In comparison to the topic keyword analysis as for example a number of n topics for each of the corpora was in Logar Berginc and Ljubešić (2013), the approach constructed. When comparing the topics, Logar Berginc selected for this study results in hierarchical ontologies and Ljubešić found that some topics appeared in both which allow the identification of subtopics for each topic, corpora (domestic policy, team sports, finance, war, enables the user to be involved in the process of ontology terrorism, publications and culture, local politics, health construction, and provides the visualization of the and law). The slWaC corpus contains more documents on constructed ontologies. In addition to the understanding of film and music, travelling and tourism, foreign affairs and the topics of the Slovene blogosphere, the main classified ads. In contrast, the following topics are more contribution of our paper is the research of gender and the prominent in Gigafida: cities, street traffic, public events, Slovene language in social media. We thus wish to television and radio programs, individual sports, and work. contribute to existing studies, e.g. on the use of emoticons Some differences between the reference corpus and the and expressive punctuation in tweets (Osrajnik et al., 2015) Janes corpus including blog entries (but also tweets, news and the discourses about women and men (Škrjanec et al., comments, forum posts) have been identified through 2016). collocation analysis in Pollak (2015). The rest of the paper is structured as follows. In Section 2, In this paper, we focus on topical variation between male the blog subcorpus and the text preparation are presented. and female bloggers. For English there have been some The OntoGen tool and the ontology construction process studies on how the content in social networks posts or are described in Section 3, and discussed in Section 4. In spoken language correlates with the demographic factors of Section 5, we conclude the paper and suggest further work. users, such as gender and age. Using data mining techniques, Argamon et al. (2007) found that male bloggers 2. Corpus Description and Data tend to write about religion, politics, business and the Preparation Internet more frequently, while female bloggers blog about The corpus of Slovene blogs used in this paper is one of the conversation, domestic environment, fun, romance, and subcorpora in version 04 of the Janes corpus of user- 1 http://nlp.ffzg.hr/resources/corpora/slwac/ 2 http://www.gigafida.net/ Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 62 Ljubljana, Slovenia, 27–28 September 2016 generated Slovene. The corpus was compiled within the blog entry ID, category (female or male), and the lemma Janes3 research project and contains various genres of user- form of each token. All preprocessing steps were carried generated content: tweets, news comments, forum posts, out with a simple Python programme. Stop words were user and talk pages from Wikipedia, and blog entries and removed. OntoGen cannot process diacritics and other blog comments. In this paper, the focus is placed on the special characters, so these were replaced with character compilation and properties of the blog subcorpus, for which sequences that enable the reconstruction of the original two Slovene blog portals were crawled: publishwall.si and form. rtvslo.si (Fišer et al., to appear). The blog entries of the Janes corpus were contributed by 3. Topic Ontology Construction over 800 users, which we annotated for their account type In this section, the OntoGen tool and the construction of (private or corporate) and gender 4 (female, male and three topic ontologies are presented. undefined). Corporate accounts belong to different companies or journalists, the rest are private. The gender 3.1 The OntoGen Tool was manually assigned based on the use of grammatical gender when referring to self; the profile picture and For the construction of Slovene blog topic ontologies, we username. If we were not able to identify the user as male used the OntoGen tool5, which is a semi-automatic data- or female driven ontology editor that combines text mining , the tag “neutral” (meaning “undefined”) was used. techniques with a fairly simple user interface (Fortuna et al., For our study, we selected the blog posts of male and 2007). OntoGen is based on Bag-of-Words (BoW) vector female private users. Private users wrote over 29,000 representations of documents, weighted by the Term entries altogether (female: 9,056; male: 20,105). For the Frequency-Inverse Document Frequency weights. The tool ontology construction, blog entries in Slovene were taken provides subtopic suggestions based on the k-means into consideration. clustering algorithm, with the parameter k being set by the Disregarding the gender and account type, the average user. The user then decides whether to add the clusters to length of blog entries and comments in the entire blog the ontology. The user can also manually move the subcorpus is about 70.16 words (85.42 tokens) per entry. documents and provide labels for the clusters (topics). Since clustering algorithms perform better on longer texts Additionally, if the input documents are pre-categorized, a than on shorter ones, blog entries with minimum of 100 full method for grouping the instances according to the labels words (no stop words) were used for ontology construction is also supported. (9,039 entries by male and 3,771 by female users). The user can influence the division into subtopics by employing the Active learning functionality that is based on 2.1.1 Text Preparation the SVM (Support Vector Machines) active learning The original vertical file of the Janes blog subcorpus was method. The user provides a term or a set of keywords that parsed into a format supported by OntoGen, in which each represent a new subtopic to be added to the ontology. This blog entry is represented with a single line containing the action is followed by iterative model refinement through user interaction by answering to the question whether a Figure 1: Topic ontology of entries by female bloggers. 3 The Janes project webpage: http://nl.ijs.si/janes/ solely on the use of grammatical gender. 4 The bloggers themselves were not contacted and asked about 5 The OntoGen tool: http://ontogen.ijs.si/ their self-identification. Thus, the claims on their gender are based Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 63 Ljubljana, Slovenia, 27–28 September 2016 particular document belongs to the topic or not. The user Female Male can decide when to stop the active learning process and the Keywords SVM k. Keywords SVM k. new (sub)topic is added to the ontology. moški, moški, ženska, For each topic, OntoGen provides a list of keywords, which ženska, želeti, moški, sex, are the words that are the most descriptive for the content film, želeti, partner, film, sex, ženska, of cluster, i.e. words with the highest weights in the partner, čutiti, prijatelj, žena, film, document centroid vectors (ibid.). Another view is gained Romance življenje, strah, ženski, mati, by inspecting SVM keywords, which are the words most and ljubezen, razmišljat žena, rak, bivši_žena, distinctive for the selected concept with regard to its sibling sexuality prijatelj, i, dekle, obraz, concepts in the hierarchy (e.g. words contrasting male and ženski, potrebova želeti zgodbica, female entries categorized in a selected topic). odnos ti, fb, brada, 3.2 spolnost, punca The Construction of Three Topic Ontologies #431 telo #329 The dataset with entries by private male and female družba, želeti, družba, družba, bloggers was imported into OntoGen. The ontology was sistem, obstajati, sistem, sistem, built using k-means for topic suggestions, and the active obstajati, narod, kapitalizem bitje, learning functionality, as well as by manually arranging the lasten, ego, telo, , družben, sodoben, ontology. Because the entries were pre-categorized Political narod, izkušnja, problem, demokracij according to the user gender, we could examine the system zavest, lasten, človeški, a, ideja, keywords and SVM keywords of topics, whereby the topics vrednota, ego, življenje, planet, with a more or less comparable number of entries by female človeški, sposoben, planet, materialen, and male bloggers were selected for analysis. Two topics življenje različen, znanost, svoboda, ( Romance and sexuality; Political system) and their zavest vrednota stoletje keywords are presented in Table 1. In addition, we #129 #503 constructed a topic ontology for entries by female (Figure Table 1: Keywords for the topics Romance and sexuality, 1) and male (Figure 2) users6. and Political system. 4. Discussion differences, it is evident that female bloggers use more A keyword list can tell us something more about the main verbs (“feel”, “think”, “need”), while male bloggers focus ideas and concepts users blog about concerning a particular more on the participants (“ex-wife”, “girlfriend”, “mother”) topic. After constructing a common topic ontology of and appearance (“face”, “beard”). The keyword entries by both groups of users, we observed the entries on “crab”/”cancer” suggests that the topic is still somewhat romance and sexuality to compare the keywords and SVM noisy. The keywords for the topic Political system also keywords of both groups. From keywords in Table 1, we reveal similarity between entries by men and women can learn that male and female bloggers use similar (“society”, “system”, “life”, “human”), whereas in entries keywords (“woman”, “man”, “want”) with some variation. by female bloggers the topic of nation comes forward. In Observing the SVM keywords, which point out the the entries by male bloggers, terms like “capitalism”, Figure 2: Topic ontology of entries by male bloggers. 6 For the common ontology, lemmas of uni- and bigrams with the the minimum frequency was set to 10. minimum frequency of 20, and for the gender specific ontology, Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 64 Ljubljana, Slovenia, 27–28 September 2016 “democracy” and “freedom” indicate a specific political technical standardness and sentiment of the text. In the issue. future, the automatization of the topic labelling could be A topical comparison of blog entries by female and male performed by combining clustering and terminology bloggers (Figures 1 and 2) shows some interesting extraction as shown in Fortuna et al. (2008). Adding the similarities and differences. Both groups seem to write topic to the metadata enables a more fine-grained analysis about the environment, nutrition, family and parenthood, of discursive strategies for the same topic with regard to the sexuality, and politics, in particular the subtopic on gender of the user, which is something we plan to carry out Slovenian politics and the (post)independence era (the in the future. Independence War, post-war killings and the role of former communists today). Another common topic is economy 6. Acknowledgements (mostly Slovene and EU). One of the more prominent The work described in this paper was funded by the topics of male bloggers is that on the Slovene politician Slovenian Research Agency within the national basic Janez Janša, mostly concerning his 2013–2015 trial for research project “Resources, Tools and Methods for the corruption. An evident topic on current affairs is also that Research of Nonstandard Internet Slovene” (J6-6842, of the refugee crisis in the male ontology. In contrast to 2014-2017). female bloggers, male authors contributed a significant number of entries on biology, spectator sports, music and 7. References literature. They also discuss the role of the Roman Catholic Argamon, Shlomo, Moshe Koppel, James W. Pennebaker Church. In turn, female bloggers write more about and Johnatan Schler (2007). Mining the Blogosphere: spirituality in connection to various religious beliefs, and Age, gender and the varieties of self-expression. First nature. Emotions are also a prevalent blog topic of female Monday, 12(9). users; additionally, they pay special attention to social Baker, P. (2014). Using Corpora to Analyze Gender. politics and issues, such as handicapped people and their London: Bloomsbury. social rights. Fišer, D., Erjavec, T., Ljubešić, N. (to appear): Janes v0.4: korpus slovenskih spletnih uporabniških vsebin. 5. Conclusion Slovenščina 2.0 – Special Issue. In the paper, we described the process of topic ontology Fortuna, B., Grobelnik, M., Mladenić, D. (2007). OntoGen: construction and keyword analysis of blog entries from two Semi-automatic Ontology Editor. HCI International Slovene blog portals. The goal was to contrast the topics 2007, July 2007, Beijing. 309–318. covered by female and male bloggers. Fortuna, B., Lavrač, N., Velardi, P. (2008). Advancing To avoid over-generalization on gendered topics, it is Topic Ontology Learning through Term Extraction. In important to take into account the distribution of blog Ho, T., Zhou, Z. (eds), Proceedings of PRICAI 2008: entries among bloggers. Some topics are heavily dominated Trends in Artificial Intelligence. Hanoi, Vietnam, by a very small number of bloggers ( Biology, Social issues December 15-19, 2008. 626–635. and politics), but this is not visible in the ontology. When Kilgarriff, A. (2012). Getting to know your corpus. In Sojka, using quantitative methods to explore gender and language P., Horak, A., Kopecek, I. (eds), Proceedings of the 15th use, it seems the tendency is to favour differences, while International Conference on Text, Speech and Dialogue backgrounding similarities, what Baker (2014) calls the (TSD2012), pages 3-15. Brno, Czech Republic: Springer. “difference mindset”. The findings of studies such as this Logar Berginc, N., Ljubešić, N. (2013). Gigafida in slWaC: one may suggest and show mostly the differences. However, tematska primerjava. Slovenščina 2.0, 1 (1): 78–110. the language and topics of a single gendered group is not Osrajnik, E., Fišer, D., Popič, D. (2015). Primerjava rabe homogenous, which is what Baker (ibid.) discovered when ekspresivnih ločil v tvitih slovenskih uporabnikov in he contrasted “same-sex” parts of the BNC spoken among uporabnic. Fišer, D. (ed), Zbornik konference each other using Manhattan Distance for a list of keywords. Slovenščina na spletu in v novih medijih. Ljubljana: He found that some pairs of “same-sex” parts vary more Znanstvena založba Filozofske fakultete, 50–74. than pairs of “mixed-sex” combinations. Pollak, S. (2015). Identifikacija spletno specifičnih In spite of considering this issue, the analysis has shown kolokacij pogostega besedišča. Fišer, D. (ed), Zbornik that some topics ( Refugee crisis, Janez Janša, Biology, konference Slovenščina na spletu in v novih medijih. Spectator sports, Music and literature) seem more Ljubljana: Znanstvena založba FF UL, 57–62. prominent in entries by male bloggers, while female Schmid, H. J. (2003). Do men and women really live in bloggers typically contribute to topics like Religion, Nature, different cultures? Evidence from the BNC. In: Wilson, Emotions, Social politics. When writing about mutual A., Rayson, R. and McEnery, T. (eds), Corpus topics ( Romance and sexuality, Political system), female Linguistics by the Lune. Lódź Studies in Language 8. and male bloggers discuss them from different perspectives. Frankfurt: Peter Lang. 185-221. Our methodology can be applied to explore the topics of Škrjanec, I., Sobočan, A. M., Pollak, S. (2016). The lexical the entire blog subcorpus, including corporate users and environments of woman and man in the corpus of those undefined in terms of gender. The information on the Internet Slovene. In: Granić, J., Kecskes, I. (eds), predominant topic of the entry could enrich the existing Proceeding of the 7th INPRA Conference. 10-12 June blog metadata: user gender, account type, the linguistic and 2016, Split, Croatia. 161. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 65 Ljubljana, Slovenia, 27–28 September 2016 Linguistic Characteristics of Dutch Computer-Mediated Communication: CMC and School Writing Compared Lieke Verheijen Radboud University (Nijmegen, the Netherlands) E-mail: lieke.verheijen@let.ru.nl Abstract Computer-mediated communication has become essential in many youths’ lives. Because language in CMC frequently deviates from standard language norms, it is feared to harm youngsters’ traditional literacy skills. To determine if and, if so, how social media affect their writing skills, we first need to establish how CMC actually differs from the standard language. This paper presents findings of a study comparing CMC texts and school essays by youths from the Netherlands. Linguistic analyses were done with T-Scan, software specifically designed for Dutch texts. A range of lexical measures (lexical diversity, ‘special’ words, lexical density, ellipses) and syntactic measures (dependency lengths, subordinate clauses, sentence length, D-level) were studied. Results reveal that in comparison to their school writings, Dutch youths’ computer-mediated communication is syntactically less complex, contains more omissions, and is lexically more diverse, different, and dense. These youths thus employ different registers in the writing contexts of CMC and school. Keywords: computer-mediated communication, social media, writing, register, literacy MSN chats, SMS, tweets, and WhatsApp chats. These 1. Introduction social media represent four CMC genres: instant Most youths’ daily lives are nowadays filled with messaging with an internet application, text messaging, computer-mediated communication. Instant messaging, microblogging, and instant messaging with a mobile texting, and other social media are essential for them to phone app. The first three genres were selected from keep in touch with friends and family. In computer- SoNaR (‘STEVIN Nederlandstalig Referentiecorpus’), a mediated messages, it is key to communicate effectively, reference corpus of written Dutch (Treurniet & Sanders, expressively, and informally. As a result, CMC writings 2012; Oostdijk et al., 2013). WhatsApp chats were frequently differ from standard language conventions (e.g. gathered especially for the purposes of my project, via a Thurlow & Brown, 2003; Crystal, 2008; Frehner, 2008; website where youths could voluntarily donate their Cougnon & Fairon, 2014). Notable differences are messages, http://cls.ru.nl/whatsapptaal/. Table 1 shows nonstandard orthography and syntax, as in ‘ fyi i’ll B specifics of the CMC corpus. For comparison, I also @home l8er 2night, u OK with that? car broke down  ’. collected school writings. These were written by youths of This sentence contains abbreviations, omissions, an similar ages as the CMC texts, of different educational emoticon, and lacks capitalisation and punctuation at the levels. Table 2 shows more details on the school essays. appropriate places. Such deviations in CMC from the ‘official’ language norms are a source of worry for many Genre Years of Age # words # chats or parents and language teachers: they fear it damages collection group contributors youths’ traditional literacy skills. MSN 2009-2010 12-17 45,051 106 18-23 4,056 21 SMS 2011 12-17 1,009 7 2. Research Goals 18-23 23,790 42 This paper presents a study that is part of my PhD project Twitter 2011 12-17 22,968 25 18-23 99,296 83 into the impact of CMC on literacy. In order to determine WhatsApp 2015 12-17 55,865 11 / 84 whether and, if so, how youths’ social media use affects 18-23 140,134 23 / 132 their writings at school, it is imperative to first investigate total 2009-2015 12-23 392,169 what youths’ CMC actually looks like and how it differs # chats: MSN, WhatsApp; # contributors: SMS, Twitter, WhatsApp from the standard language. The main goal of this study is Table 1: CMC texts. to explore in what ways the informal language used by Dutch youths in CMC differs from their more formal Educational level Years of Age # words # texts school writings. These questions were analysed by means production group of a manual analysis, as well as an automatic analysis; the lower secondary 2013-2014 ± 14-15, 50,143 128 present paper focuses on the latter. ( vmbo) 3rd grade higher secondary 2013-2014 ± 14-15, 50,070 153 3. Methodology ( vwo) 3rd grade lower tertiary 2012-2014 ± 17-18, 39,793 137 ( mbo) 2nd grade 3.1 Materials higher tertiary 2012-2014 ± 18-19, 50,175 169 For my study into Dutch written CMC, I used a corpus of ( uni) 1st grade CMC texts by youths between 12 and 23 years old, with total 2012-2014 ± 14-19 190,181 587 Table 2: School essays. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 66 Ljubljana, Slovenia, 27–28 September 2016 3.2 Method T-Scan computes the density of ‘special words’, measured A quantitative corpus study was conducted. For the first per one thousand words. This includes names, loanwords, part of the analysis, frequencies of several linguistic numbers, Roman numerals, and times. On average, the features were counted manually in the CMC texts. Yet CMC writings had a higher density of ‘special words’ ( M this paper focuses on the second/automatic part of the = 140.77, SE = 33.20) than the school writings ( M = analysis, comparing the CMC texts to school writings 28.58, SE = 4.02), t(10) = -3.35, p < .01. Figure 2 with T-Scan – software specifically designed for Dutch illustrates this and shows that there is much variation texts (Pander Maat et al., 2014). On the basis of between CMC genres. The greater frequency of ‘special theoretical considerations, a range of relevant lexical and words’ is because of textisms, misspellings, typos, and syntactic measures were selected. It was hypothesized that URLs in CMC – character strings that T-Scan cannot CMC texts, compared to school essays, are lexically more recognize as words, since they deviate orthographically diverse, different, and dense; contain more omissions; and from Standard Dutch and are not listed in any standard are syntactically less complex. Independent t-tests were dictionaries. Tweets in particular include many URLs and conducted to compute whether differences were ‘words’ of the format @username, within messages in significant; one-tailed probability values are reported here. response to another user’s tweet (replies) or messages directed at another user (mentions). This higher density 4. Results and Discussion endorses the hypothesis that CMC is lexically more different from the standard language. 4.1 Lexical Analysis 350 The measure of textual lexical diversity (MTLD) is the 300 average length of sequential word strings in a text that 250 maintain a type-token ratio (TTR) above a specified 200 threshold (McCarthy & Jarvis, 2010). The MTLD depends 150 on the TTR, which is calculated by dividing the number of 100 types (different words) by the number of tokens (total 50 number of words). Although the TTR is a classic measure, 0 the MTLD is more reliable, because it is insensitive to text length. A higher MTLD value indicates more lexical diversity: more different words or differently spelled words. On average, the CMC writings had a higher lexical diversity ( M = 119.62, SE = 14.39) than the school Figure 2: Density of ‘special words’. writings ( M = 76.10, SE = 2.23), t(10) = -2.08, p < 0.05. Figure 1 shows that the MTLD was higher in the CMC The third lexical measure that was selected is lexical texts, with the exception of WhatsApp chats by 12-17- density. This is the number of content words (nouns, year-olds. 1 The higher lexical diversity depends on the verbs, adjectives, and adverbs) per one thousand words orthographic variation in written CMC, due to textisms (e.g. Johansson, 2008). When a text has a high lexical (unconventional spellings, deviating from the standard density, it contains many content words and few function language norms), misspellings (‘errors’, as judged by words. On average, the CMC writings had a higher lexical linguistic prescriptivists), and typos (incorrect key presses density ( M = 531.70, SE = 9.28) than the school writings or false predictions by predictive software). This confirms ( M = 481.31, SE = 2.68), t(10) = -3.71, p < .01, as shown the hypothesis that CMC is lexically more diverse. in Figure 3. This is due to the frequent omission of function words in CMC, which is known for its concise 200 writing style, somewhat similar to that of telegrams or newspaper headlines. The findings from T-Scan thus 150 support the hypothesis that CMC is lexically denser. 100 600 50 550 0 500 450 400 Figure 1: Measure of textual lexical diversity (MTLD). 1 This apparent exception can be attributed to the frequent repetition of chain messages and certain words in a spam-like manner by one contributor; excluding this outlier, the MTLD Figure 3: Lexical density. would be 92.70 – higher than the school essays, as hypothesized. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 67 Ljubljana, Slovenia, 27–28 September 2016 Another interesting measure is the density of elliptical T-Scan also measures the average number of subordinate constructions, quantified as the number of finite verbs clauses per sentence. It includes both finite (relative, without a subject per one thousand words. On average, the adverbial, and complement clauses) and infinitival CMC writings had a higher density of ellipses ( M = 25.86, subclauses. A higher density of subclauses is indicative of SE = 3.17) than the school writings ( M = 8.60, SE = 1.18), greater syntactic complexity. On average, the CMC t(10) = -5.10, p < .001. Figure 4 shows that the CMC writings had a lower average no. of subordinate clauses writings of all genres contained more elided subjects per sentence ( M = 0.14, SE = 0.02) than the school (though just barely for MSN chats by 18-23 year olds). writings ( M = 0.80, SE = 0.06), t(10) = 10.21, p < .001. This backs up the abovementioned results on lexical Figure 6 clearly shows that the CMC texts overall density: informal written CMC contains fewer function contained fewer subordinate clauses. Again, the lower words than formal school essays, at least partly due to the syntactic complexity of CMC is confirmed by T-Scan. frequent omission of grammatical subjects. 1.0 40 0.8 35 30 0.6 25 20 0.4 15 0.2 10 5 0.0 0 Figure 6: Average no. of subordinate clauses per sentence. Figure 4: Density of ellipses. Another complexity measure provided by T-Scan is the 4.2 Syntactic Analysis average sentence length, which is measured in number of One measure of syntactic complexity is the average of all words. A higher average sentence length indicates more dependency lengths per sentence. The dependency length syntactic complexity. On average, the CMC writings had is the distance between a head (of a sentence or phrase) a lower average sentence length ( M = 6.55, SE = 0.28) and its dependent, such as a finite verb and the subject or than the school writings ( M = 16.33, SE = 0.79), t(10) = an article and the corresponding noun. T-Scan expresses 14.76, p < .001. Figure 7 shows that the texts of all four the distance in number of words that need to be skipped CMC genres contained much shorter sentences than the from head to dependent. Texts with a higher average school essays, irrespective of the writer’s educational dependency length contain more discontinuous structures, level or age. Once more, the hypothesis is confirmed. making them syntactically more complex and more difficult to process for readers (Gibson, 2000). On 20 average, the CMC writings had a lower average of all 15 dependency lengths per sentence ( M = 0.63, SE = 0.06) 10 than the school writings ( M = 1.59, SE = 0.10), t(10) = 5 9.04, p < .001. It is clear from Figure 5 that the CMC texts 0 of all genres had lower average dependency lengths, no matter what the writer’s age or educational level. This supports the idea that CMC is syntactically less complex. 2.0 Figure 7: Average sentence length. 1.5 1.0 A final relevant syntactic measure is the so-called D-level. The D-level of a text is determined on the basis of a 0.5 classification and rank order of sentence types in eight increasingly complex developmental levels, in the order in 0.0 which children learn these constructions (Rosenberg & Abbeduto, 1987; Covington, 2006). The assumption is that a higher D-level value suggests more syntactic complexity. On average, the CMC writings had a lower D-level ( M = 0.88, SE = 0.08) than the school writings ( M Figure 5: Average of all dependency lengths per sentence. = 2.87, SE = 0.10), t(10) = 15.51, p < .001. The CMC texts of all four genres had lower D-levels, as can be seen Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 68 Ljubljana, Slovenia, 27–28 September 2016 in Figure 8. This result is in line with the proposed 8. References hypothesis on syntactic complexity. Cougnon, L.-A., & Fairon, C., Eds. (2014). SMS Communication: A Linguistic Approach. Amsterdam: 3.5 John Benjamins. 3.0 Covington, M.A., He, C., Brown, C., Naçi, L., & Brown, 2.5 J. (2006). How Complex is That Sentence? A Proposed 2.0 1.5 Revision of the Rosenberg and Abbeduto D-Level Scale. 1.0 CASPR Research Report 2006-01. University of 0.5 Georgia: Artificial Intelligence Center. 0.0 Crystal, D. (2008). Txtng: The Gr8 Db8. Oxford: Oxford University Press. Frehner, C. (2008). Email - SMS - MMS: The Linguistic Creativity of Asynchronous Discourse in the New Media Age. Bern: Peter Lang. Figure 8: D-level. Gibson, E. (2000). The dependency locality theory: a distance-based theory of linguistic complexity. In Y. 5. Conclusion Miyashita, A.P. Marantz & W. O’Neil (Eds.), Image, Language, Brain. Cambridge: MIT Press, pp. 95--126. To conclude, the lexical and syntactic analysis of CMC Johansson, V. (2008). Lexical diversity and lexical texts of four social media support my hypothesis: in density in speech and writing: A developmental comparison to school writing, CMC is lexically more perspective. Working Papers in Linguistics, 53, 61--79. diverse, different, and dense, while syntactically it McCarthy, P., & Jarvis, S. (2010). MTLD, vocd-D, and contains more omissions and is less complex. This proves HD-D: A validation study of sophisticated approaches that Dutch youths in secondary and tertiary education to lexical diversity assessment. Behavior Research employ a different register in informal computer-mediated Methods, 42(2), 381--392. communication than in texts written in more formal Oostdijk, N., Reynaert, M., Hoste, V., & Schuurman, I. settings. These results are hopeful: perhaps deviations (2013). The construction of a 500-million-word from the standard language in youngsters’ CMC do not reference corpus of contemporary written Dutch. In P. cause great interference with their traditional writing skills Spyns & J. Odijk (Eds.), Essential Speech and after all – they might be quite capable of keeping the Language Technology for Dutch: Results by the registers separate, as societal norms expect them to do. STEVIN Programme. Heidelberg: Springer, pp. 219-- 6. Future Work 247. Pander Maat, H., Kraf, R., van den Bosch, A., Dekker, N., A limitation of the present study is that the materials van Gompel, M., Kleijn, S., Sanders, T., & van der compared here, i.e. CMC discourse and texts written at Sloot, K. (2014). T-Scan: A new tool for analyzing school, were not produced by the same writers. In Dutch text. Computational Linguistics in the addition, they have been collected over a relatively long Netherlands Journal, 4, 53--74. time span, of six years. For a more precise answer to the Rosenberg, S., & Abbeduto, L. (1987). Indicators of question if and, if so, how CMC use affects school linguistic competence in the peer group conversational writing, I plan to conduct research in which (a) social behavior of mildly retarded adults. Applied media data and school texts of the same students are Psycholinguistics, 8(1), 19--32. collected and analysed and (b) additional information Thurlow. C., & Brown, A. (2003). Generation txt? The about writers’ use of CMC and social media (in terms of sociolinguistics of young people’s text-messaging. frequency/intensity) are gathered through surveys. Future Discourse Analysis Online, 1. work will include one more genre, namely posts from the Treurniet, M., & Sanders, E. (2012). Chats, tweets and social networking site Facebook. Furthermore, it SMS in the SoNaR corpus: Social media collection. In unfortunately exceeded the scope of this paper to closely D. Newman (Ed.), Proceedings of the First Annual examine variation between texts of different genres, International Conference on Language, Literature & educational levels, ages; this may also be explored further. Linguistics. Singapore: Global Science and Technology Still, this study can serve as a fruitful basis for analyses on Forum, pp. 268--271. the impact of written computer-mediated communication on young people’s literacy skills. 7. Acknowledgements This study is part of a research project funded by the Dutch Organisation for Scientific Research (NWO), project number 322-70-006. I would like to thank Wilbert Spooren and the anonymous reviewers for their useful comments on previous versions of this paper. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 69 Ljubljana, Slovenia, 27–28 September 2016 A Multimodal Analysis of Task Instructions for Webconferencing-supported L2 Interactions: A Pilot Study of the ISMAEL Corpus Ciara R. Wigham, † H. Müge Satar* †Clermont Université, Laboratoire de Recherche sur le Langage (LRL), MSH, 4 rue Ledru, 63000 Clermont-Ferrand. * Boğaziçi Üniversitesi School of Foreign Languages, Bebek, 34342, Istanbul Email: ciara.wigham@univ-bpclermont.fr, muge.satar@boun.edu.tr Abstract This pilot study examines how trainee language teachers use the different semiotic resources available to them during webconferencing-supported interactions to give task instructions. The sub-corpus examined is taken from the ISMAEL corpus (Guichon et al., 2014) that structured interaction data from a six-week telecollaborative exchange between trainee teachers of French and learners of French, who majored in Business. The study explores, firstly, how the corpus of synchronous CMC interactions was structured in order to be used by researchers who were not involved in the pedagogical project. Secondly, we will describe how the interactions were transcribed with reference to a multimodal interactional analysis approach. Thirdly, a sequential analysis of two trainee teachers’ instruction-giving practices for a role-play task will be presented. The aim of the pilot study is to determine whether research and pedagogical leads emerge that warrant a larger investigation of the corpus with relation to multimodal instruction-giving practices. Keywords: instruction-giving, LEarning and TEaching Corpora (LETEC), multimodality, teacher-training, webconferencing 1. Introduction and Research Aims This pilot study attempts to bridge the research gaps Tasks in the second language classroom allow for mentioned above by focusing on how trainee teachers of authentic communication with a focus on meaning (Ellis, French as a foreign language give task instructions 2003; Nunan, 2004). Alongside recent pedagogical during webconferencing-supported interactions and, moves towards task-based language teaching (TBLT) more specifically, how they use the multimodal semiotic approaches, telecollaboration is also gaining increasing resources available to them during these practices. The interest and research has started to explore how, by data examined in this qualitative study is taken from the bringing together different student populations from ISMAEL corpus (Guichon et al., 2014) that structured different cultures and languages, telecollaboration can the interaction data from a six-week telecollaborative support language learning and help prepare students for exchange between undergraduate Business students physical mobility programmes, or, if involving learning French at an Irish higher education institution teacher-trainee populations, prepare trainees for online and trainee teachers on a Master’s programme in mediated teaching contexts (Guth & Helm, 2010). Many Teaching French as a Foreign language at a French telecollaboration programmes based on TBLT use University. In our paper presentation, we will, firstly, synchronous means of communication to bring together examine how the corpus was structured. Then, drawing the student populations that are in geographically distant on multimodal interactional analysis and conversation locations. However, as Guichon & Cohen underline analysis approaches, we will examine a sub-corpus of whilst “synchronicity is generally seen as bringing real two trainee teachers’ instruction giving practices for a value to online pedagogical interactions […], research role-play rehearsal task (Nunan, 2004). In particular, we investigating the potential of a broad array of channels examine how the trainee teachers contextualise has been much less frequent” (2014:332). instruction-giving sequences. The aim of the pilot study In any foreign-language classroom, instruction-giving is is to discern whether a larger investigation of the corpus a significant part of teacher-talk time. Indeed, in TBLT, would be pertinent and more specific research questions specific teacher roles include guiding and facilitating such a study could address. learning during task completion and explaining the purpose, expected results and task completion steps in 2. Instruction-giving understandable ways for learners (Raith & Hegelheimer, Instructions are defined as directives, explanations or 2010). Although a limited number of studies have questions, etc. used by the teacher in order “to get the explored teachers’ instruction-giving practices (see students to do something” (Watson Todd, 1997:32). Section 2), research on instruction-giving practices in Instructions could constitute such a crucial aspect of the synchronous online contexts is currently non-existent. classroom activities that successful task outcomes may Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 70 Ljubljana, Slovenia, 27–28 September 2016 depend on effective instructions (Watson Todd et al. , the trainees planned each session (except the 2008). Seedhouse (2008) investigated instruction-giving introductory session) around a theme of Business French practices from a conversational analysis approach, according to the needs of DCU students as they prepare focusing on how teachers create, manage and maintain a for an internship in France. Therefore, the topics for the shift in focus through the use of discourse markers, sessions were preparing for an internship, project changes in the spatial configuration of participants and management, pitching a project, interviews, and labour metadiscoursal comments. He describes how semiotic law. The online webconferencing sessions took place on means, through the proxemics distance placed between Visu (Guichon, Bétrancourt, & Prié, 2012) as part of a the teacher and resources allowed a shift in focus. larger circular learning design (detailed in Guichon & Markee (2015a) examined instructions from an Wigham, 2016). In this presentation, we will only draw ethnomethodological perspective, concluding that on the data from the synchronous sessions. “non-verbal aspects of communication are a vital part of Twelve of the 18 students (eight females, four males) instructions” (p. 126). These non-verbal aspects included and all of the trainees (ten females, two males) gave gaze, cultural artifacts, gestures and embodied actions. permission for their data to be included in the ISMAEL His observation of overlaps between teacher instructions corpus. Thus, the corpus includes data of 7 groups. and learner responses indicated that instructions are not Because of differently sized groups, five groups monologues, but they have an interactional nature. comprised a trainee working with two learners whilst the According to Markee (2015a), teachers’ instructions in other two groups were learner-trainee pairs. Currently, the classroom comprise six fragments: “(1) how 24 of the 35 synchronous interactions included have been [students] will be working (in dyads or small groups); (2) transcribed, totalling 13h04m30s of data. Pseudonyms what resources they will need; (3) what tasks they have are used for all personal information. to accomplish; (4) how they will accomplish the task; (5) During the structuration phase of the ISMAEL corpus, how much time they have to accomplish these tasks; (6) the different participants’ webcam videos had been and why they should do something” (pp. 120-121). extracted from the Visu software and imported into the Markee (2015b) concluded that further research is transcription software ELAN (Sloetjes & Wittenburg needed on teachers’ instruction-giving practices 2008). The spoken interaction of all the online sessions particularly in second language teaching. had been transcribed and, using the timestamps created Whilst Markee appears to be referring to face-to-face in Visu, the parallel text chat logs had been synchronized teaching contexts, his statement appears all the more true with these transcriptions. With regards to LEarning and for computer-assisted language learning contexts as we TEaching Corpora (LETEC, Reffray et al., 2012), the failed to identify any studies specifically that detailed learning design for the telecollaboration project, as well instruction-giving sequences in synchronous online as documents related to the research protocol, was also pedagogical interactions. This observation was the available within the corpus. starting point for the analysis presented in this paper. 3.2 Sub-corpus Examined 3. Methodology This preliminary study examines data from the fourth This section presents our research methodology. The session of the telecollaboration project. During this corpus design will be the focus of the first part of our session, participants engaged in a role-playing task that paper presentation. concerned project management. This task was planned in three stages. First, the trainees would introduce the roles 3.1 ISMAEL Corpus and the Pedagogical for the learners (co-workers at McDonalds) and for Context themselves (manager). At this stage, learners needed to This study draws on the ISMAEL corpus (Guichon et al., collaboratively find a new formula for children’s 2014) that structured data from a telecollaboration birthday parties organized at the fast-food restaurant. project between Business undergraduates at Dublin City During the second stage, the learners were asked to list University (DCU) and trainee teachers (henceforth, the actions required to execute their new idea in text trainees) at Université Lyon 2 (Lyon2) on a French as a chat. In the final stage, the trainee (in the role of the foreign language Master’s programme. For the Lyon2 manager) would guide a reflection session on the ideas students, the exchange formed part of an optional of the learners (i.e. the employees) using questions such module in online teaching that aims to help the trainees as: What action would you need to put into place first: develop professional skills to teach French online and to which is the most important for you? Why? analyse their online teaching practice and develop A sub-corpus of the instruction-giving interaction data reflective analysis around this. For the undergraduate from two of the seven teacher trainees (Samia, Etienne) DCU students, the exchange composed part of a 12-week was chosen for analysis. Samia is a 23 year old female blended French for Business module that had CEFR who has completed several teaching observation level B1.2 as its minimum exit level (Council of Europe, placements and who has experience of one-to-one tuition 2001). and some French language teaching at first school in Participants completed six 40-minute weekly online Germany. One of her learners spoke English as his first sessions via webconferencing in autumn 2013. Two of Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 71 Ljubljana, Slovenia, 27–28 September 2016 language (Sean) whilst the other’s (Angela) mother head movements, gestures and distance between the tongue was German. webcam and the participant. Etienne is a 24 year old male who has no formal teaching experience. He had been involved in running It is worth noting that the approach to the analysis of the conversation workshops in French as a foreign language sub-corpus involved a researcher who was closely in the at an American University over a five-month period. data collection, data transcription and the structuration of Etienne’s leaners were Conor, who was of Irish origin the corpus and an ‘outsider’ who did not know the and Sophie who was a Spanish speaker (L1). Neither of participants and the context (cf. Guichon, in print). Both the trainees was involved in preparing the lesson plan for researchers worked on the sub-corpus together, this session which had been prepared by their classmates. constantly comparing their interpretations of the data and Samia’s session lasted 35m21s whilst Etienne’s session how the instruction-giving sequences were organised. lasted 20m46s. Figures 1 and 2 give an overview of the We will briefly touch on the advantages and verbal interaction data for these sessions. disadvantages of data analysis that involves ‘insider’ and ‘outsider’ researchers. 4. Preliminary Findings In the second part of our paper presentation, we will look closely at the interaction data and will present a sequential analysis of each of the two instruction-giving sequences. Due to space constraints, it is not possible here to go into depth concerning the micro-analysis conducted. Rather, we summarise the analysis of each case. The analysis of the instruction-giving sequence in the session conducted by Samia shows a clear step-by-step approach to instruction giving. Gaze plays an important Figure 1: Overview of verbal interaction data. role in punctuating these steps. Samia combines the audio and text chat modalities to elicit key vocabulary for the task and concept check these items. Gaze shifts, accompanied by vocatives play an important part in assigning learner roles. Samia then makes use of the visual mode to communicate, through a change in proximity, that she is giving greater control of the floor to learners as they begin the task and, thus, that she wishes to step out of her interaction management role. A shift in pronoun use to the inclusive ‘we’ also allows her to show verbally that she has moved into the fictitious role of manager rather than the managerial role of task instruction-giver. In contrast, in the analysis of Etienne’s instruction-giving Figure 2: Total length audio turns. sequence, the trainee first of all sets the context for the task by checking the concept of children’s birthday 3.3 Analysis approach and procedures parties and then proceeds by indicating his role and Data for this presentation was analysed using multimodal providing examples of possible themes. This helped interactional analysis (Norris, 2004) which aims to learners identify what constitutes the trainee’s explore people’s meaning-making practices in the expectations concerning successful task completion. moment-by-moment construction of interaction with an However, as they had not yet been given their roles some emphasis on “how people employ gesture, gaze, posture, confusion ensues. Learner role allocation was achieved movement, space and objects to mediate interaction in a through a side-sequence during a long task-preparation given context” (Jewitt, 2011: 34). For the verbal data, we phase rather, as was the case with Samia, as a main step also make use of conversation analysis techniques. The in the instruction-giving process. The trainee’s initial step in the analysis was to identify multimodal interaction during this phase is of particular instruction-giving sequences for the role-playing task by interest. In the visual mode he attempts to remove his isolating trainees’ transition into the task and the several presence from the interactional order through a change in fragments that were introduced to cover all aspects of the posture and proximity, underlining that this is an instructions. The second analysis step was the annotation individual-work phase. Gaze change during this of the co-verbal acts that accompanied task instructions. preparation phase allows the trainee to monitor whether The co-verbal actions included gaze, facial expressions, he has covered all of the information points that are Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 72 Ljubljana, Slovenia, 27–28 September 2016 necessary for the task and prompt Etienne to introduce a Dublin, Ireland: Centre for Translation and Textual side sequence in which he allocates learner roles. The Studies & Lyon, France: Laboratoire ICAR. laughter and posture change that follow help to signal the Guichon, N. & Cohen, C. (2014). The Impact Of The learners’ better understanding of the task instructions. Webcam On An Online L2 Interaction. Canadian Modern Language Review. 70(3), 331–354. 5. Discussion Guichon, N. & Wigham, C. R. (2016). A semiotic Our initial analysis suggests that in order to draw perspective on webconferencing-supported language pedagogical conclusions, it would be of particular teaching, ReCALL, 28(1), 62-82. interest to further examine instruction-giving sequences Guth, S. & Helm, F. (2010). Telecollaboration 2.0. New with reference to how the beginning and different stages York:Peter Lang. of the task instructions are marked; how the trainees Jewitt, C. (2011). Different approaches to multimodality. allocate roles required by the task during these sequences In C. Jewitt (Ed.), The Routledge Handbook of and trainees deal with key lexical items. Multimodal Analysis, (pp. 28-39). London: Routledge With reference to these points, the data examined in this Markee, N. (2015a). Giving and following pedagogical pilot investigation suggests, firstly, that changes in instructions in task-based instruction: An proximity to the webcam may be a successful technique ethnomethodological perspective. In P. Seedhouse to highlight changes in role and show learners that the and C. Jenks (Eds.) International Perspectives on the trainee is moving into his fictional role required by the ELT Classroom, (pp.110-128). Basingstoke: Palgrave task. Secondly, the multimodal analysis sheds light on MacMillan. different strategies employed by the trainees to introduce Markee, N. (2015b). Teachers’ instructions: Toward a vocabulary for the task. Whilst Samia used elicitation to collections-based, comparative research agenda in concept-check key vocabulary that she often then put classroom conversation analysis. Paper presented at into the text chat modality, Etienne preferred to use HUMAN Social Interaction and Applied Linguistics pre-emptive vocabulary explanation to establish the Postgraduate Conference, 08 September 2015, context for the task and used reduced proximity to signal Hacettepe University, Ankara. when he was willing to leave the floor/interactional [https://sial2015hu.files.wordpress.com/2015/09/1-ank order. ara-paper-final.pdf] Thirdly, combining vocatives in the audio modality and Norris, S. (2004). Analyzing multimodal interaction: a gaze in the visual mode appeared effective in role methodological framework. London: Routledge. allocation whilst the other session demonstrates what Nunan, D. (2004). Task-Based Language Teaching. happens when task instructions, especially role allocation Cambridge: Cambridge University Press. are not complete and how the resulting confusion and Raith, T. & Hegelheimer, V. (2010). Teacher uneasiness can be resolved. Development, TBLT and Technology. In M. Thomas The presentation will conclude with pedagogical & H. Reinders (Eds.), Task-Based Language Learning recommendations highlighting the need to raise teacher and Teaching with Technology, (pp.154-175). London: trainees’ awareness of the multimodal features of Continuum. webconferencing that can be employed to facilitate Reffay, C., Betbeder, M-L. & Chanier, T. (2012). instruction-giving. Multimodal learning and teaching corpora exchange: lessons learned in five years by the Mulce project’, 6. Acknowledgments Int. J. Technology Enhanced Learning, 4(1/2), 11–30. The authors are grateful to the LABEX ASLAN Seedhouse, P. (2008) Learning to Talk the Talk: (ANR-10-LABX-0081) of Université de Lyon for its Conversation Analysis as a Tool for Induction of support within the program ‘Investissements d’Avenir’ Trainee Teachers. In Garton, S. & Richards, K. (eds). (ANR-11-IDEX-0007) of the French government Professional encounters in TESOL: discourses of operated by the National Research Agency (ANR). teachers in training (pp.42-57). Basingstoke: Palgrave Macmillan. 7. References Sloetjes, H. & Wittenburg, P. (2008). Annotation by category – ELAN and ISO DCR. In Proceedings of the Ellis, R. (2003). Task-based language learning and 6th International Conference on Language Resources teaching. New York: Oxford University Press. and Evaluation (LREC 2008). Guichon, N. (in print). Sharing a multimodal corpus to Watson Todd, R. (1997). Classroom Teaching Strategies. study webcam-mediated language teaching. Language London: Prentice Hall. Learning & Technology. Watson-Todd R, Chaiyasuk I, and Tantisawatrat N Guichon, N., Bétrancourt, M., Prié, Y. (2012). Managing (2008) A functional analysis of teachers’ instructions. written and oral negative feedback in a synchronous RELC Journal, 39, 25-50. online teaching situation. Computer assisted language learning, 25(2), 181–197. Guichon, N., Blin., F., Wigham, C.R., & Thouësny, S. (2014) ISMAEL LEarning and TEaching Corpus. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 73 Ljubljana, Slovenia, 27–28 September 2016 Linguistic Analysis of Emotions in Online News Comments - an Example of the Eurovision Song Contest Ana Zwitter Vitez,+* Darja Fišer*† + Faculty of Humanities, University of Primorska, Titov trg 5, 6000 Koper, Slovenia * Department of Translation, University of Ljubljana, Aškerčeva 2, 1000 Ljubljana, Slovenia † Department of Knowledge Technologies, Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia E-mail: ana.zwitter@guest.arnes.si, darja.fiser@ff.uni-lj.si Abstract The aim of the study is to identify linguistic differences of positive and negative comments on the example of online comments on the news about the Eurovision song contest. The results show that positive comments have a typical exclamation form and simple sentence structure, and include more informal vocabulary and orthographic variation. Negative comments, on the other hand, are more likely to be formulated as statements with a complex syntax structure and with a neutral vocabulary and standard orthography. The detected differences can be explained by the communicative function of the negative comments that act as reviews and therefore call for thorough argumentation and build an individual’s reflective identity. Keywords: online news comments, linguistic analysis, sentiment analysis, Eurovision song contest 1. Introduction trends but are not always reliable for linguistic analyses of individual texts. Identifying emotions in language is a relevant field of In discourse analysis, much more fine-grained sentiments research because of the strong connection between the are typically examined: happiness (Stefanowitch 2004), physiological arousal of an emotion and its social display shame (Retzinger, 1991), and even irony (Haverkate, (Mygovych, 2013). If we understand how people feel, we 1990). These approaches are very interesting for can analyse or even predict how they will react in certain qualitative analyses but cannot be scaled for emotion situations. This is why sentiment analysis can be used for identification on bigger datasets. predicting societal changes, election results, and customer In this paper we focus on a qualitative analysis of a small satisfaction (Liu, 2015). dataset of news comments on the Eurovision song contest In online comments, analysis of emotions is particularly in Slovenian in order to examine linguistic characteristics interesting because comments enable users to formulate of opinionated texts. Once comments were manually their own opinion and to find their own identity attributed a sentiment category, they were analysed on the independently of the official media content. Wright syntactic, lexical and orthographic level. (2009)1 even claims that “for many businesses, online opinion has turned into a kind of virtual currency that can make or break a product in the marketplace”. 3. Methodology Online news comments of the Eurovision song contest 3.1 Sample Creation represent a specific dataset because they usually evoke polarised emotions (either users strongly support or hate The analysis was performed on a sample extracted from Eurovision song contestants and/or their songs). The the Janes corpus v0.4 (Fišer et al., 2016). The sample comments often even exceed the scope of the song contest contains 70 comments referring to an article announcing itself and refer to wider political and societal issues (e.g. that the Slovenian representative was selected to compete Azerbejdžan podeli Rusiji 12 točk in si s tem zagotovi in the finals of the Eurovision song contest2 published on dostavo plina za še eno leto #Eurovision. / Azerbaijan the national television and radio online news portal RTV gives Russia 12 points, thus guaranteeing their gas supply Slovenija. Only opinionated comments were taken into for another year #Eurovision). account for the study. Neutral, factual and objective The aim of this paper is to provide a linguistic analysis of comments (e.g. Bjørn Einar Romøren) were discarded as online comments in order to detect syntactic, lexical and were off-topic comments or direct replies to a previous orthographic differences between positive and negative comment that were part of an internal debate that had comments. nothing to do with the article they appeared under (e.g. Kje je kolega XX? Upam, da ni zaspal! / Where is our 2. Emotion Analysis in CMC camerad XX? I hope he hasn’t fallen asleep! ). Different approaches have been developed to analyse emotions in computer-mediated communication. In data 3.2 Sentiment Annotation mining three basic categories are most often used: First, comments were manually attributed a sentiment positive, negative, neutral (Smailović, 2013). These category (positive, negative) by two annotators. models are very useful on big datasets to study overall Disagreements were detected in 6 cases (8%), which were discussed in order to reach a systematic final decision. 4 1URL: 2 URL: http://www.rtvslo.si/zabava/glasba/evrovizija- http://www.nytimes.com/2009/08/24/technology/internet/24emot 2014/slovenija-s-tinkaro-v-evrovizijski-finale-vidimo-se- ion.html v-soboto/336356 Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 74 Ljubljana, Slovenia, 27–28 September 2016 of the cases involved comments which consisted of two parts, expressing a different sentiment each (e.g. Tinkari 40 pa zaželim srečo,četudi je ta Evrovizija zadnjih 10 let z 30 uvedbo polfinalov čisti cirkus in šov,ki ga prav tako dolgo 20 ne jemljem več tako resno. / I wish Tinkara all the best, 10 negative even though for the past 10 years the Eurovision and its 0 semi-finals have been nothing but a circus and a show positive that I am not taking seriously anymore.). In such cases, the annotators agreed to determine the prevalent sentiment in the comment. In 2 of the cases, it was not clear out of context whether the comments were meant literally, as a joke or cynical (e.g. Pričakujem 12 točk iz Makedonije. / I Figure 1: The distribution of different types of sentences am expecting 12 points from Macedonia. ). In such cases, in positive and negative comments. the entire discussion thread was examined for a wider context and annotated accordingly. The majority of the sentences in the negative comments are complex (62%) while nearly half of the sentences in 3.3 Linguistic Analysis the positive comments (49%) are simple. For illustration, Figure 2 contains examples of a complex negative Each sentence in the sample was analysed for sentence sentence and a simple positive one. type (statement, exclamation question, order), sentence structure (simple, complex), vocabulary characteristics Negative, complex: and orthography (formal, informal). Examples of the SLO: Če zaupaš našim medijem, so še skoraj vsako leto analysis are presented in Tables 1 and 2. bile kritike glede naših pesmi pozitivne, ampak rezultata pa nobenega in isto bo letos. SLO: Držim pesti! ENG: According to our media, despite positive reviews ENG: Fingers crossed! our songs were unsuccessful almost every year and this Sentiment positive year won’t be any different. Sentence form exclamation Positive, simple: Sentence structure simple SLO: Imam dober občutek. Vocabulary / ENG: I have a good feeling about this. Orthography informal Figure 2: Examples of simple and complex sentences in Table 1: Linguistic analysis of a positive comment. positive and negative comments. 4.2 Vocabulary SLO: Kolikor slišim, se je včeraj slabo odrezala. The vocabulary level was manually annotated following ENG: As far as I heard, she did not fare well last night. the criterion of whether a comment is characterised by a Sentiment negative specific lexical unit carrying an opinion or not. Sentence form statement Sentence structure complex Vocabulary slabo 30 Orthography standard 25 Table 2: Linguistic analysis of a negative comment. 20 15 negative 4. Results and discussion 10 Comments with the same sentiment label were compared positive 5 in order to detect the shared linguistic properties on the 0 syntactic, lexical and orthographic level. The sample contains slightly more negative (53%) than positive (47%) neutral opinionated comments. Figure 3: The distribution of neutral and opinionated 4.1 vocabulary in positive and negative comments. Syntax As can be seen from Figure 1, a large majority (86%) of the negative comments are statements (e.g. Ne, ne bo. / As Figure 3 shows, vocabulary in the negative comments No, it won’t. ) with only a few examples of questions (8%) is heavily opinionated (70%), e.g. kuhna (inside deal), and exclamations (5%). Among the positive comments, on davkoplačevalci (taxpayers), lajna (broken record). About the other hand, there is a similar share of statements (48%) half of the positive comments are characterised by and exclamations (45%) (e.g. Srečno! / Good luck! ). opinionated vocabulary (51%), but this vocabulary is not While positive comments contain no questions, there are a topic-specific and usually expresses general support, e. g. couple of commands (6%) (e.g. Uživajmo in ne nergajmo Srečno! (Good luck!), upam (I hope), podpiramo (we kot stare babe. / Let’s enjoy the show and not whine like support). old ladies. ). Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 75 Ljubljana, Slovenia, 27–28 September 2016 4.3 Orthography usually shows through the use of typical vocabulary and At the orthographic level the following phenomena that orthography. are typical of CMC language were observed: informal Our future work plan is to extend the analysis on a wider orthography (e.g. dejmo instead of dajmo), use of all-caps range of highly opinionated topics (sports, politics, (e.g. BO), non-standard use of punctuation (e.g. Dajmo religion, product and service reviews) and text types klobasica!!!:) / Do it), and emoticons (;)). As can be seen (blogs and blog comments, tweets, forum posts, in Figure 4, while distinctly standard orthography (78%) is Wikipedia talk pages). In addition, the set of sentiments used in the negative comments, nearly half of the positive will be more fine-grained in order to distinguish between comments (42%) contain non-standard orthographic different types of negative or positive sentiment such as features. support and cynicism that deserve special treatment. 6. Acknowledgements 40 The work described in this paper was funded by the 30 Slovenian Research Agency within the national basic research project “Resources, Tools and Methods for the 20 negative Research of Non-standard Internet Slovene” (J6-6842, 10 positive 2014–2017). 0 7. References informal standard Fišer, D., Erjavec, T., Ljubešić, N. (2016): JANES v0.4: Korpus slovenskih spletnih uporabniških vsebin. Figure 4: The distribution of non-standard and standard Slovenščina 2.0, 4(2), 67–100. orthography in positive and negative comments. Haverkate, H. (1990). A speech act analysis of irony. Journal of Pragmatics, 14(1), 77–109. For illustration, Figure 5 contains an example of a positive Liu, B. (2015). Sentiment Analysis: Mining Opinions, comment with standard orthography and an example of a Sentiments, and Emotions. Cambridge University Press. positive comment with non-standard spelling. Mygovitch, I. (2013). Secondary nomination in the modern English language: affective lexical units. Vіsnik Negative, non-standard: LNU іmenі Tarasa Ševčenka, 1(1), 206–214. SLO: dejmo naši!!! Ritchie, G. (2004). The Linguistic Analysis of Jokes. ENG: c’mon, team!!! Journal of Literary Semantics, 33(2), 196–197. Positive, standard: Smailović, J., Grčar, M. Lavrač, N., Žnidaršič, M. (2014). SLO: Danes imajo naši turisti še zadnji dan turistovanja Stream-based active learning for sentiment analysis in na davkoplačevalske stroške. the financial domain. Information Sciences 285: 181– ENG: Today is the last vacation day attaxpayers’ expense 203. for our tourists. Stefanowitsch, A. (2004). Happiness in English and Figure 4. German: A metaphorical-pattern analysis. In M. Achard and S. Kemmer (eds.) Language, Culture, and Mind, Figure 5: Examples of standard and non-standard 137–149. CSLI Publications. orthography in positive and negative comments. Retzinger, M. S. (1998). Violent Emotions: Shame and Rage in Marital Quarrels. Newbury Park, CA: Sage 5. Conclusions Publications. The aim of the study was to identify linguistic characteristics of positive and negative comments on the example of comments on articles about the Eurovision song contest. The results show that positive comments are fewer, typically have an exclamation form and simple sentence structure, and contain more informal vocabulary and orthography. Negative comments are more numerous, are typically represented as statements with a more complex syntax structure and with distinctly general vocabulary as well as standard orthography. While it is true that the analysed sample is small and limited to a single topic, the results are very homogenous and consistent throughout the analysis. A plausible explanation for such a discrepancy is the critical function of the negative comments which calls for thorough argumentation, not affect, and from the position of a reflective individual who acts in his own capacity, not as a member of regional or social groups adherence to which Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 76 Ljubljana, Slovenia, 27–28 September 2016 Alternative Endings of Slovene Verbs in Third Person Plural: A Corpus Approach Gašper Pesek, Iza Škrjanec, Dafne Marko Ljubljana E-mail: gasper.pesek@gmail.com, skrjanec.iza@gmail.com, dafne.marko@gmail.com Abstract This paper is concerned with the alternative endings of some Slovene verbs in third person plural. Certain Slovene verbs in this form can take the endings - jo or - do ( jejo or jedo), whereby the two possible choices are normatively evaluated in various ways in Slovene language manuals. The paper introduces a corpus-driven approach to the problem by first extracting potential verbs with this phenomenon, after which the use of both endings is compared in the subcorpora of Janes (a corpus of Slovene user-generated content) and in the Kres corpus. Keywords: alternative endings, Slovene verbs, user-generated content, standard language povedo, where the derivative with the prefix pre- is written 1. Introduction with the ending - jo, whereas the original verb is written Slovene is a morphologically rich language with some with the ending - do. The choice between two alternative alternative forms in the inflection paradigm. Related forms seems to be puzzling for the users of Slovene, as can studies using a corpus approach (Može, 2013; Arhar Holdt, be seen from several questions posted in specialized1 as 2013) show that Slovene language manuals either well as general2 online forums, such as Med.Over.Net. In insufficiently describe the phenomena in question, or the replies, two main factors for the use of appropriate prescriptively evaluate one variant as more prestigious than endings are emphasized: the medium (spoken or written) the other. This paper is concerned with Slovene verbs that and register (literary or conversational). The ending - do is can take the endings - do or - jo in third person plural. We considered more common in written language and formal observe their behavior in the Janes corpus of user-generated communication. Slovene (Fišer et al., 2016) and in the Kres corpus (Logar In this regard, an analysis of the phenomenon in computer- Berginc et al., 2012), which mostly contains standard mediated communication or user-generated content would written Slovene. be interesting, as the language on the Internet is heavily The rest of the paper is structured as follows: in Section 2, influenced by spoken language (Crystal, 2006); however, a the alternative verb endings - jo and - do are presented spectrum of genres are considered as CMC, differing in together with their evaluation in different Slovene language synchronicity, message size, privacy settings, manuals. In Section 3, we briefly present the two corpora communication norms, etc. (for a list of factors, see Herring, and the data extraction process. The verb forms and their 2007). frequencies in the Janes subcorpora and in the Kres corpus The aim of this paper is to use a corpus-based approach to are analyzed in Section 4, while Section 5 provides a genre determine the general tendencies of using the endings - do comparison. Section 6 concludes the paper and suggests and - jo. We expect to find a preference for the ending - jo in further work. the Janes corpus and a preference for the ending - do in the Kres corpus. We intend to observe the usage patterns with 2. Problem Description and the Aim of the regard to different genres in Janes, and to contrast the Paper tendencies in the Janes corpus with those in the Kres corpus, Certain Slovene verbs can take two different endings in which is a corpus of standard written language. third person plural: - jo or - do. In Slovene grammar (Toporišič, 2004) 3. Methodology , this characteristic is observed in five athematic verbs ( jejo – jedo; grejo – gredo; bojo – bodo; vejo 3.1 Corpus Description – vedo; dajo – dado; ‘they eat/go/will be/know/give’, respectively). According to paragraph 891 in the Slovene For the analysis, two corpora were queried. The Janes v0.4 normative language manual Slovenski pravopis 2001, the corpus is a corpus of user-generated Slovene. It contains ending - jo is frequently used instead of - do, which is over 175 million words or 9 million documents, published especially true for “literary conversational language” between 2002 and 2016 in five different genres: tweets, ( knjižni pogovorni jezik) – considered less appropriate for forum posts, blog entries and their comments, online news written texts – and derivatives (by means of prefixation) of and their comments, and user and page talk from the the previously mentioned athematic verbs, e.g. prepovejo – Slovene Wikipedia. 1 E.g.: http://www2.arnes.si/~lmarus/suss/arhiv/suss-arhiv- http://med.over.net/forum5/viewtopic.php?t=10424263 000103.html (accessed: 10 August, 2016) (accessed: 10 August, 2016) 2 E.g.: http://med.over.net/forum5/viewtopic.php?t=2547265, Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 77 Ljubljana, Slovenia, 27–28 September 2016 The Kres corpus, which contains nearly 100 million words, The forum subcorpus (Figure 2) reveals a considerable is a collection of standard written Slovene with a balanced preference (roughly 70%) for the - do ending with the verb genre structure: it consists of periodicals (newspapers and zvedeti. A relatively equal distribution of both endings is magazines), fiction and non-fiction, documents from the observed with izvedeti, zavedeti, pojesti, and povedati. All Web, and other genres, published between 1990 and 2011. other verbs show a preference (60% or more) for - jo, especially in the case of izpovedati and dopovedati, which 3.2 Data Extraction have only produced concordances with - jo. The Sketch Engine concordance (see Kilgarriff, 2014) was In the blog subcorpus (Figure 3), there is an overall used for corpus scanning. Employing a CQL expression, we preference for - do. A relatively even distribution of both used the Kres corpus to first extract verbs that end in - do, endings can be seen for poizvedeti, zavedeti, najesti, after which we noted the frequency of the forms with - do izpovedati, odpovedati, and prepovedati. Napovedati shows and - jo 3 in the five Janes subcorpora4, as well as the entire a notable preference (over 60%) for - jo. Janes and Kres corpora. Thus, we collected 17 verbs5: biti The news comment subcorpus (Figure 4) reveals a notable (‘be’), iti (‘go’), dati (‘give’), vedeti (‘know’), izvedeti preference (60% or more) for - do in the cases of izvedeti (‘find out’), zvedeti (‘find out’), poizvedeti (‘inquire’), and povedati, while zvedeti and poizvedeti have only zavedeti (‘realize’), jesti (‘eat’), pojesti 6 (‘eat up’), najesti provided concordances with - do. Zavedeti, pojesti, and (‘sate’), povedati (‘tell’), izpovedati (‘confess’), dopovedati izpovedati have shown a relatively equal distribution of the (‘get across’), napovedati (‘predict’), odpovedati (‘cancel’), two endings, whereas the others ( najesti, napovedati, and prepovedati (‘forbid’). odpovedati, and prepovedati) have shown a notable preference for - jo. Dopovedati did not produce any 4. Analysis concordances. In Wiki talk (Figure 5), the verbs that have produced 4.1 Distributions of Variants in the Janes concordances show a strong preference for the - do ending, Subcorpora with the exception of povedati and izvedeti, which have Because Slovene’s 5 athematic verbs are a linguistically displayed a relatively even distribution of the two ending unique group, they call for a separate initial analysis. With variants. regard to the future tense form of the verb biti, all subcorpora indicate a virtually exclusive preference (nearly 4.2 The Janes Subcorpora in Relation to Kres 100% of concordances) for the ending - do. The opposite is This subsection examines each of the Janes subcorpora in true of dati, where - jo has almost completely replaced its relation to Kres. The 5 athematic verbs will once again be older alternative7. In the case of vedeti, - do is preferred in described separately. all subcorpora – least prominently in the forum subcorpus, In all of the Janes subcorpora, the ratios of - jo to - do for the and most prominently in the blog (nearly 90% of verb biti are virtually the same as in Kres. For iti, on the concordances) and Wiki talk subcorpora (over 90%)8 . Iti other hand, all Janes subcorpora display a much stronger displays a relatively even distribution between the two preference for - jo, except for the blog subcorpus, as its ratio alternatives, with the exception of the blog subcorpus, of - jo to - do is practically the same as in Kres. Dati almost where a preference for - do is observed (roughly 80%). Jesti exclusively employs the - jo ending in both Kres as well as has revealed a general preference for the - do ending, with a all of the Janes subcorpora. For vedeti, the tweet, forum, and relatively equal distribution in the tweet subcorpus, a slight news comment subcorpora prefer - jo compared to Kres, preference for - do in the forum subcorpus, and a strong whereas the blog and Wiki talk subcorpora show ratios preference for - do in the comment (roughly 70%), blog similar to the ones in Kres (which shows a slightly stronger (roughly 80%), and Wiki talk (roughly 85%) subcorpora. preference for - do). Finally, jesti displays a notable The following paragraphs summarize the specifics of each preference for - jo in the tweet, forum, and news comments subcorpus. subcorpora. There is an overlap with the Kres ratio in the In the tweet subcorpus (Figure 1 9 ), the - do ending is Wiki talk subcorpus, and a slight preference for - jo with the preferred (60% or more) for najesti and povedati; - jo is ratio in the blog subcorpus. preferred for izpovedati, napovedati, odpovedati, and The distributions of alternate endings for the verbs prepovedati. All other verbs display a relatively equal poizvedeti, zavedeti, and dopovedati in the tweet subcorpus distribution of both endings. are very similar to those in Kres. With najesti, however, 3 CQL expression for verb extraction: 7 Since the word form dado is used mostly as a proper noun (but [word=".+do" & tag="G.*"] has been erroneously annotated as a verb form), all CQL expression for form frequency, e.g., bojo: concordances were analyzed manually. [word="(b|B)ojo" & tag="G.*"] 8 There is an overlap between the word forms vedo (third person 4 In the news subcorpus, only the comment section was taken into consideration, excluding the news because they are not user- plural) and vedo (a participle in masculine singular form). The generated content. latter is a regional, non-standard variation of the participle vedel, 5 The verb zajesti was also extracted from the Kres corpus, but used mainly in northeastern Slovenia. Since all derivatives of the was not included in the analysis due to low or zero frequencies verbs vedeti, jesti and povedati follow the same pattern, there in the Janes subcorpora. might be a slight deviation from the actual frequency of verbs 6 Because the third person plural forms of the verbs pojesti (‘eat ending in - do for third person plural. up’) in peti (‘sing’) overlap ( pojejo – ‘they eat/sing’), all 9 For charts, see Appendix 1. concordances were analyzed manually. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 78 Ljubljana, Slovenia, 27–28 September 2016 there is a higher preference for - do in the tweets. All other be the only ones with the same pattern – biti is almost verbs show a notable to strong preference for - jo in exclusively realized as bodo, and dati as dajo. Regarding comparison with Kres. the genres in the Janes corpus (tweets, forum posts, blog In the forum subcorpus, all remaining verbs show a higher entries preference for , news comments and on Wikipedia discussions and - jo, albeit in varying degrees. The verbs in the blog subcorpus display a slightly higher user pages), a conclusion can be made that there are evident preference for similarities between the blog subcorpus and the Kres - jo with the verb zavedeti, and a stronger preference for - jo with zvedeti, pojesti, najesti, and corpus, while the tweet, forum, and news comment napovedati. The ratios for poizvedeti, povedati, and subcorpora display fairly comparable tendencies as well. odpovedati are similar to the ones in Kres, while the Thus, it is important to emphasize that different genres in remaining verbs ( izvedeti, izpovedati, dopovedati, and CMC may not always have the same linguistic prepovedati), seem to prefer -do more than Kres. characteristics and should therefore not be understood as a In the news comment subcorpus, pojesti, najesti, povedati, homogenous language variety. For a more precise napovedati, and odpovedati show a higher preference for - description of the phenomenon, other derivatives (e.g., jo. The ending ratios for izvedeti, zavedeti, izpovedati, and spovedati, zapovedati), which were not extracted in the prepovedati are almost the same as the ones in Kres. automatic process Interestingly, zvedeti and poizvedeti show a higher , should also be included into discussion. preference for The use of the endings -jo and -do should also be analyzed - do. In the Wiki talk subcorpus from a normative perspective , there is a slightly higher , taking into account the preference for - jo with izvedeti and a notably higher language manuals and dictionaries with their descriptions preference for - jo with povedati. Pojesti, however, shows a of the analyzed verbs. slightly higher preference for - do, while prepovedati shows a notably higher preference for - do. 7. Acknowledgements The work described in this paper was funded by the 5. Genre Comparison Slovenian Research Agency within the national basic To be able to draw more general conclusions, this section research project “Resources, Tools and Methods for the considers only the verbs with a minimum frequency of 10 Research of Nonstandard Internet Slovene” (J6-6842, in all corpora, excluding the Wiki talk subcorpus due to its 2014–2017). We would also like to thank the reviewers for small size10. The following verbs meet this criterion: biti, iti, their useful comments and suggestions. dati, vedeti, izvedeti, zvedeti, jesti, pojesti, povedati, odpovedati, and prepovedati. 8. References Having compared all of the genres with one another, we Arhar Holdt, Š. (2013). Študentje, škratje in nadškofje: have identified three recurring patterns. In all (sub)corpora, končnica - je v imenovalniku množine pri samostalnikih the verb biti is predominantly used with the - do ending, prve moške sklanjatve. Slovenščina 2.0, 1(1), pp. 134– while the verb dati is practically never used with the - do 154. ending anymore. Taking into account all the remaining Crystal, D. (2006). Language and the Internet. Cambridge: verbs, the blog subcorpus and the Kres corpus generally Cambridge University Press. display comparable tendencies with eight of them ( iti, Fišer, D., Erjavec, T., Ljubešić, N. (2016). Janes v0.4: vedeti, izvedeti, jesti, pojesti, povedati, odpovedati, and korpus slovenskih spletnih uporabniških vsebin. prepovedati), whereby the verbs in question show a similar Slovenščina 2.0 (to appear). preference for - do in some cases, and equal shares of both Herring, S.C. (2007). A faceted classification scheme for endings in others. computer-mediated discourse, Language@Internet: The second evident pattern is the similarity of the forum, http://www.languageatinternet.org/articles/2007/761 tweet, and news comment subcorpora, as they contain a (accessed: 13 August 2016). similar distribution of the endings in the verbs iti, jesti, Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., pojesti, povedati, odpovedati, and prepovedati. In the case Michelfeit, J., Rychlý, P., Suchomel, V. (2014). The of two verbs ( vedeti and izvedeti), the tweet and forum Sketch Engine: ten years on. Lexicography, 1(1), pp. 7– subcorpora show a similar tendency of an equal distribution 36. for both endings. Logar Berginc, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt Š. and Krek, S. (2012). Korpusi slovenskega jezika 6. Conclusion Gigafida, KRES, ccGigafida in ccKRES: gradnja, The paper describes the use of endings -jo and -do in certain vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno Slovene verbs in third person plural. The corpus analysis slovenistiko; FDV. shows a strong preference for -do both in the Kres corpus Može, S. (2013). Raba kratkega nedoločnika: korpusni (12 out of 17 verbs) and in the Janes corpus (10 out of 17 pristop. Slovenščina 2.0, 1 (1), pp. 155–175. verbs). However, different verbs display completely SP 2001 – Slovenski pravopis. Ljubljana: SAZU – ZRC different tendencies in the analyzed subcorpora, meaning SAZU – Založba ZRC. that a general conclusion concerning the use of endings -jo Toporišič, J. (2004). Slovenska slovnica. Maribor: Obzorja. and -do cannot be made. The verbs biti and dati proved to 10 For absolute and relative frequencies of verb forms, see https://www.dropbox.com/s/ducs6y7ei75vn9p/Abs_rel.xlsx?dl=0. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 79 Ljubljana, Slovenia, 27–28 September 2016 Appendix 111 100% 100 80% 80 60% 60 40% 40 20% 20 0% 0 -jo -do Kres Figure 1: Percentage of - jo/- do in the tweet subcorpus (green/blue) and the Kres corpus (yellow). 100% 100 80% 80 60% 60 40% 40 20% 20 0% 0 -jo -do Kres Figure 2: Percentage of - jo/- do in the forum subcorpus (green/blue) and the Kres corpus (yellow). 100% 100 80% 80 60% 60 40% 40 20% 20 0% 0 -jo -do Kres Figure 3: Percentage of - jo/- do in the blog subcorpus (green/blue) and the Kres corpus (yellow). 11 The asterisk (*) indicates that one or both verb forms appeared verb were not found in the Janes corpus. The yellow dot defines only once. Empty columns indicate that the forms for a particular the limit between percentage for - jo (below it) and - do (above it). Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 80 Ljubljana, Slovenia, 27–28 September 2016 100% 100 80% 80 60% 60 40% 40 20% 20 0% 0 -jo -do Kres Figure 4: Percentage of - jo/- do in the news comment subcorpus (green/blue) and the Kres corpus (yellow). 100% 100 80% 80 60% 60 40% 40 20% 20 0% 0 -jo -do Kres Figure 5: Percentage of - jo/- do in the Wiki talk subcorpus (green/blue) and the Kres corpus (yellow). Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 81 Ljubljana, Slovenia, 27–28 September 2016 Geolocating German on Twitter Hitches and Glitches of Building and Exploring a Twitter Corpus Bettina Larl, Eva Zangerle University of Innsbruck E-mail: bettina.larl@uibk.ac.at, eva.zangerle@uibk.ac.at Abstract Languages, and thus Linguistics, have always been influenced by technological developments and new media forms and every development brought new methods and approaches of how language can or should be studied and explored. About 16% of the EU residents speak German as a native language and this makes it the widest spread language within the European Union. German is a pluricentric language with three standard varieties: German Standard German, Swiss Standard German and Austrian Standard German. The official borders between Germany, Austria and Switzerland also form the boundary between the three standards. Because of easy access and informal communication methods, more and more oral markers find their way into written language. This is often showcased on social media platforms such as Twitter. Every tweet includes language output in the form of short messages that can contain different regional characteristics. Tweets can be geolocated, which means these language outputs can be assigned to the geographic location they were tweeted from. To explore research questions like “Is there a connection between the language output and the geographic location tweets were sent from?” and “Could, for example, lexical varieties be allocated to a specific region by geolocation information provided in tweets?” We are building a Twitter Corpus. The Corpus contains tweets collected via the Twitter streaming API, using a binding box around the rough approximation of the Deutscher Sprachraum and re-filtering the results for Tweets sent within Germany, Austria, Switzerland and South Tyrol/Italy. This paper shows preliminary findings of hand sampling a random sample of 1,000,000 Tweets. Keywords: Twitter, geolocation, German Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 82 Ljubljana, Slovenia, 27–28 September 2016 The #Intermittent Corpus: Corpus Features, Ethics and Workflow for a CMC Corpus of Tweets in TEI Julien Longhi Cergy-Pontoise University, AGORA E-mail: julien.longhi@u-cergy.fr Abstract This poster aims to describe issues encountered whilst structuring a corpus of tweets compiled from the key word intermittent (arts worker) in order to analyse a discursive topic related to the controversy surrounding the status of French arts workers. This corpus is part of the CoMeRe project (CoMeRe, 2014): it aims to build a kernel corpus of computer-mediated communication (CMC) genres with interactions in the French language. Three key words characterize the project: variety, standards and openness. A variety of interactions was sought: public or private interactions as well as interactions from informal, learning and professional situations. The CoMeRe project structured the corpora in a uniform way using the Text Encoding Initiative format (TEI, Burnard & Bauman, 2013) and described each corpus using Dublin Core and OLAC standards for metadata (DCMI, 2014; OLAC, 2008). The TEI model was extended in order to encompass the Interaction Space (IS) of CMC multimodal discourse (Chanier et al., 2014). The term ‘openness’ also characterizes the project: The corpora have been released as open data on the French national platform of linguistic resources (ORTOLANG, 2013) in order to pave the way for scientific examination by partners not involved in the project as well as replicative and cumulative research. This poster presentation aims to give an overview of the corpus building process using, as a case study, a corpus of tweets cmr-intermittent (Longhi et al., 2016). The following steps led to the choice of tweets: 1) In 2015, with the creation of a threshold of at least 10 tweets with the #intermittent (s), we identified 215 accounts, each of which had produced at least 10 tweets explicitly referenced as contributing to this theme (in order to have representative accounts). 2) By gathering all of the tweets sent by those 215 people, we collected 586, 239 tweets. 3) 10,876 of the 586, 239 tweets contained the #: #intermittent(s): the #intermittent corpus corresponds to these 10, 876 tweets. The poster will focus, firstly, on how features that are specific to Twitter were included and structured in the interaction space TEI model. We will exemplify how certain features are accounted for in TEI. These include hashtags that label tweets in order that other users can see tweets on the same topic and at signs that allow users to mention or reply to other users. Secondly, the poster will evoke some of the ethical and rights issues that had to be considered before publishing this corpus of tweets. Finally, the workflow and multi-stage quality control procedure adopted during the corpus building process will be illustrated. Keywords: tweets, corpus, TEI, CMC corpora Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 83 Ljubljana, Slovenia, 27–28 September 2016 The Construction of a Teletandem Multimodal Data Bank Queila Barbosa Lopes São Paulo State University “Júlio de Mesquita Filho” - UNESP E-mail: queilalopes@gmail.com Abstract The discussion presented here represents the initial reflection of my doctoral thesis. The main purpose of my research is to propose an organization of a multimodal databank in semi-integrated and integrated Teletandem (Aranha & Cavalari, 2014) modalities. “Teletandem is a virtual, autonomous, and collaborative context that uses online teleconferencing tools (text, voice, and webcam images of VoIP technology, such as Skype) to promote intercontinental and intercultural interactions between students who are learning a foreign language” (Telles, 2015: 2). During these interactions, the interactants produce some genres to communicate. All production is saved on the computers and then saved on a HD. As result, we have a considerable amount of research data. The databank organization will be based on the bazermanian conception of genres system according to which genres occur in an activity system (Bazerman, 1994; 2005). According to this conception, every social activity is done through genre sets, which are interrelated within a genre system, occurring in an activity system. The argument is grounded on the socio-rhetorical genre approach, which comprehend genres as a typified and socially situated action. Based on this assumption, I believe it will be possible to propose an organization of the data bank which will optimize researcher’s time. It will also help to understand how Teletandem learning of a foreign language works. The question that guides my research is “Considering the genres characteristics of teletandem practice, how is it possible to organize a multimodal data bank in integrated and semi-integrated Teletandem? I will try to use the methodology proposed by Chanier and Wigham (2016) “to transform […] data from online learning situations”. It will be also relevant to consider the concept of learning scenario (Foucher, 2010) as the space where there is the occurrence of one genre instead of another. Data were collected from 2012 to 2015, when around 655 hours of video interaction were recorded, 477 chats, 849 reflexives diaries, 180 questionnaires (initial and final) and 1444 texts were produced. Texts were also revised and rewritten by the participants. The objective of this presentation is to share a) the status of this work to the international community of researchers on Computer Mediated Communication, so that the work can be improved by the comments of its members, and b) especially the questions I have faced during this stage of the research. Keywords: data bank, Teletandem, genres system, activity system Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 84 Ljubljana, Slovenia, 27–28 September 2016 Graphic Euphemisms in Slovenian CMC Mija Michelizza, Urška Vranjek Ošlak Fran Ramovš Institute of the Slovenian Language ZRC SAZU Novi trg 4, SI-1000 Ljubljana E-mail: mmija@zrc-sazu.si, uvranjek@zrc-sazu.si Abstract Taboo words have been a part of human communication for as long as language has existed. Scientists are not in agreement on why taboo words emerged but it seems as if certain words have always been out-of-bounds for language users or have always caused certain negative feelings or reactions. Everyday communication (written and spoken) is filled with taboo words either expressed openly or disguised and concealed as more or less harmless. The evolution of the Internet and its many communication possibilities have led to a new (and somewhat less hidden) growth of taboo words, especially swear words and words designed to insult the recipient of the communication. The poster presents the analysis of graphic euphemisation of chosen swear words in Slovenian CMC (on Twitter and in online news comments) and identifies different ways in which CMC users disguise taboo words, mostly in order to avoid automatic detection and deletion of their tweets or comments. The search was performed with search queries kur* and piz*. Most common graphic euphemism types are the substitution of a letter with a non-letter symbol and the insertion of a non-letter symbol (eg. kur**, piz.ijo). Repeated letters are also common (eg. pizzzzzda). Substitutions of letters with visually similar symbols (eg. kur@. .), other letters or letter combinations with similar pronunciation (eg. kurz, kurchiti, pyzda, pisda) are less frequent. The analysis also shows that CMC users are very innovative; juxtaposition, puns and various word formation procedures (eg. pizdapaponedeljek, pizdarna (< pisarna), kurbenizon (< kombinezon)) are very common even though their primary role is language play rather than taboo word encryption. Keywords: graphic euphemisms, CMC, taboo words, Slovenian language Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 85 Ljubljana, Slovenia, 27–28 September 2016 Author Index Špela Arhar Holdt, Institute for Applied Slovene Studies Trojina and Faculty of Arts, University of Ljubljana, Ljubljana, Slovenia …………………………………… 3 Queila Barbosa Lopes , São Paulo State University “Júlio de Mesquita Filho” - UNESP, São Paulo, Brazil ………………………………………………….……….. 84 Michael Beißwenger, Department of German Studies, University of Duisburg- Essen, Essen, Germany ……………………………………………………….………. 7 Steven Coats, English Philology, Faculty of Humanities, University of Oulu, Finland …………………………………………………………………….……….. 12 Jaka Čibej, Department of Translation, Faculty of Arts, University of Ljubljana, Ljubljana, Slovenia ………………………………………………………...……….. 17 Walter Daelemans, CLiPS, University of Antwerp, Belgium …………………………….. 30 Eric Ehrhard, Department of German Linguistics, University of Mannheim, Schloss, Ehrenhof West, Mannheim, Germany …………….……………….………… 7 Tomaž Erjavec, Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia ……..…………………………………………………..……. 3, 22 Darja Fišer, Department of Translation, Faculty of Arts, University of Ljubljana, Ljubljana, Slovenia and Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia ………………… 3, 22, 39, 74 Holger Grumt Suárez, Department for German Linguistics and Literature, Applied and Computational LinguisticsJustus-Liebig-University Giessen, Germany …………………..………………………………………………..……….. 26 Axel Herold, Berlin-Brandenburg Academy of Sciences and Humanities, Berlin, Germany …………….……………………………………………..………… 7 Lisa Hilte, CLiPS, University of Antwerp, Belgium ……………………………….……….. 30 Lydia-Mai Ho-Dac, CLLE, University of Toulouse, Toulouse, France …..............…..…… 34 Natali Karlova-Bourbonus, Department for German Linguistics and Literature, Applied and Computational LinguisticsJustus-Liebig-University Giessen, Germany …..………………………………………………………………..……….. 26 Dawn Knight, Centre for Language and Communication Research, Cardiff University, Cardiff, United Kingdom ……………………………………….………… 1 Petra Kralj Novak, Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia ……………………………………………….………… 2 Simon Krek, Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia ……………................................................................…….. 3, 54 Véronika Laippala, TIAS, University of Turku, Turun yliopisto, Finland …............……….. 34 Bettina Larl , University of Innsbruck, Innsbruck, Germany ……………………...……….. 82 Nikola Ljubešić, Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia and Department of Information and Communication Sciences, University of Zagreb, Zagreb, Croatia .………………………….……….. 39 Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 86 Ljubljana, Slovenia, 27–28 September 2016 Author Index Henning Lobin, Department for German Linguistics and Literature, Applied and Computational LinguisticsJustus-Liebig-University Giessen, Germany ………… 7 Julien Longhi, Cergy-Pontoise University, AGORA, Cergy-Pontoise, France …..…… 44, 83 Harald Lüngen, Institute for the German Language, Mannheim, Germany ….…..………… 7 Dafne Marko, Faculty of Arts, University of Ljubljana, Ljubljana, Slovenia …….…… 48, 77 Mija Michelizza , Fran Ramovš Institute of the Slovenian Language ZRC SAZU, Ljubljana, Slovenia ………………………………………………………...……….. 85 Dunja Mladenić , Jožef Stefan Institute, Jožef Stefan International Postgraduate School, Ljubljana, Slovenia ………………………………………………. .……….. 54 Hatice Müge Satar , Boğaziçi Üniversitesi, School of Foreign Languages, Bebek, Istanbul, Turkey ……………………………………………………. ……….. 70 Gašper Pesek , Faculty of Arts, University of Ljubljana, Ljubljana, Slovenia .…………….. 77 Senja Pollak , Jožef Stefan Institute, Ljubljana, Slovenia …………………….…….……….. 62 Céline Poudat, BCL, University of Nice Sophia Antipolis, Nice, France …............……….. 34 Luis Rei , Jožef Stefan Institute, Jožef Stefan International Postgraduate School, Ljubljana, Slovenia ………………………………………………….……..……….. 54 Dalia Saigh, Cergy-Pontoise University, AGORA, Cergy-Pontoise, France ..…….……….. 44 Antoni Sobkowicz , National Information Processing Institute, Warsaw, Poland ……………………………………………………………………………….. 58 Angelika Storrer, Department of German Linguistics, University of Mannheim, Schloss, Ehrenhof West, Mannheim, Germany …………….………………..………… 7 Iza Škrjanec , Jožef Stefan International Postgraduate School, Ljubljana, Slovenia ……………………………………………………………………..…… 62, 77 Ludovic Tanguy, CLLE, University of Toulouse, Toulouse, France …..................……….. 34 Reinhild Vandekerckhove, CLiPS, University of Antwerp, Belgium …………….……….. 30 Lieke Verheijen , Radboud University, Nijmegen, Netherlands ……………….…………... 66 Urška Vranjek Ošlak , Fran Ramovš Institute of the Slovenian Language ZRC SAZU, Ljubljana, Slovenia …………………………………………………….. 85 Ciara R. Wigham , Clermont Université, LRL, Clermont-Ferrand, France ……...……….. 70 Eva Zangerle , University of Innsbruck, Innsbruck, Germany …..…………………...…….. 82 Ana Zwitter Vitez, Faculty of Humanities, University of Primorska, Koper, Slovenia and Department of Translation, University of Ljubljana, Ljubljana, Slovenia ………………………………………………………...……….. 74 Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 87 Ljubljana, Slovenia, 27–28 September 2016 Document Outline zbornik-CMC-delovno1 PAPER_arhar_holdt_et_al_syntactic_annotation PAPER_coats_grammatical_frequencies PAPER_cmc2016-cibej_final2 PAPER_fiser_et_al_cmc4corpora2016-sentiment-final-camera-ready PAPER_grumt_Corrected_Grumt_Karlova_Lobin_05.09. PAPER_hilte_et_al PAPER_Hodac_et_al Introduction Perspective: Exploring Conflicts at the Thread Level References PAPER_ljubesic_CMC16_twitter Introduction Related Work Posting Dynamics Acknowledgements References PAPER_longhi_saigh_Iramuteq PAPER_Marko_Aplhanumeric_cmc-corpora2016_final PAPER_rei_et_al_multilingual_social_media PAPER_skrjanec_pollak_ontologies PAPER_verheijen_cmc-corpora2016-paper_FINAL FORMAT Linguistic Characteristics of Dutch Computer-Mediated Communication: CMC and School Writing Compared Radboud University (Nijmegen, the Netherlands) PAPER_wigham_satar_Instruction_giving_CMC PAPER_zwitter_fiser_emotions STUDENT_pesek_skrjanec_marko_v2.2