29I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of… 1.01 UDC: 004.934:821.111SHAK(083.41) Isolde van Dorst* You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of Pronominal Address Terms IZVLEČEK YOU, THOU IN THEE: STATISTIČNA ANALIZA UPORABE IZRAZOV ZAIMKOVNEGA NASLAVLJANJA PRI SHAKESPEARU Študija se ukvarja z oblikovanjem napovednega modela, namenjenega ugotavljanju, katere jezikovne in nejezikovne značilnosti vplivajo na izbiro zaimkov v Shakespearovih igrah. V angleščini, ki se je uporabljala v Shakespearovem obdobju, je razlikovanje med YOU in THOU, ki je danes arhaično, še obstajalo. Običajno se navaja, da sta ga določala relativni družbeni status ter osebna bližina govorca in naslovljenca. Vendar pa je treba še ugotoviti, ali bo statistično strojno učenje potrdilo to tradicionalno razlago. Proučuje se 23 značilnosti, izbranih z različnih jezikoslovnih področij, kot so pragmatika, sociolingvistika in analiza pogovora. Trije uporabljeni algoritmi – naivni Bayesov klasifikator, odločitveno drevo in metoda podpornih vektorjev – so izbrani kot ilustrativni nabor možnih modelov zaradi njihovih kontrastnih predpostavk in učne pristranskosti. Opravita se dve napovedi, prva o binarnem (you/thou) razlikovanju in druga o trinarnem (you/thou/thee) razli- kovanju. Od vseh treh algoritmov daje najboljše rezultate metoda podpornih vektorjev. Po ugotovitvah so značilnosti, ki najbolje napovejo izbiro zaimka, besede iz neposrednega jezi- kovnega konteksta. Izkazalo se je, da na napoved zaimka vpliva tudi več drugih značilnosti, vključno z imenom govorca in naslovljenca, razliko v statusu ter pozitivnim ali negativnim mnenjem. Ključne besede: izrazi zaimkovnega naslavljanja, Shakespeare, korpusno jezikoslovje, digitalna humanistika, statistično modeliranje * Lancaster University, i.vandorst@lancaster.ac.uk 30 Prispevki za novejšo zgodovino LIX - 1/2019 ABSTRACT This study creates a prediction model to identify which linguistic and extra-linguistic fea- tures influence pronoun choices in the plays of Shakespeare. In the English of Shakespeare’s time, the now-archaic distinction between you and thou persisted, and is usually reported as being determined by relative social status and personal closeness of speaker and addressee. However, it remains to be determined whether statistical machine learning will support this traditional explanation. 23 features are investigated, having been selected from multiple linguistic areas, such as pragmatics, sociolinguistics and conversation analysis. The three algorithms used, Naive Bayes, decision tree and support vector machine, are selected as illustrative of a range of possible models in light of their contrasting assumptions and lear- ning biases. Two predictions are performed, firstly on a binary (you/thou) distinction and then on a trinary (you/thou/thee) distinction. Of the three algorithms, the support vector machine models score best. The features identified as the best predictors of pronoun choice are the words in the direct linguistic context. Several other features are also shown to influ- ence the pronoun prediction, including the names of the speaker and addressee, the status differential, and positive and negative sentiment. Keywords: pronominal address terms, Shakespeare, corpus linguistics, digital humani- ties, statistical modelling Introduction For several decades much research has been undertaken on the use of you, thou and thee in Shakespeare’s works. However, the results so far have yet to arrive at an exact and conclusive answer regarding how these pronouns were used. This study combines the strengths of multiple research fields in an effort to deter- mine via hitherto unused methods which linguistic and extra-linguistic features influ- ence the choice of second person singular pronoun (you versus thou or thee) in the plays of William Shakespeare. Prior findings in literary and linguistic studies are uti- lised to find which features could be relevant in this choice, and tools and applications created for corpus linguistics and computer science are exploited to analyse the data in a more exact way than has so far been accomplished. Through these techniques, I hope to identify which features can contribute to a more accurate prediction of pronoun choice, in a model to mimic the pronoun use of Shakespeare. It is worth observing at this point that it has not yet been determined whether it is even possible to predict the pronoun based on linguistic features. Part of the aim of this paper is to make a determination on this point. In other words, is it possible to create a computational model that can predict which pronoun will be used based on a set of linguistic and extra-linguistic features taken from the text itself and selected on the basis of knowledge that we have of English in the late 1500s and early 1600s? To 31I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of… accomplish this, all occurrences of you, thou and thee are extracted from Shakespeare’s plays, and every instance is manually coded for 23 linguistic and extra-linguistic fea- tures, creating data which will serve to ascertain the answer to this primary question. A second question to be addressed is whether some features perform better as predictors of the pronoun choice than others. Thirdly, the issue of whether the use of different algorithms affects the prediction outcomes will be considered. Throughout this paper, italicised you, thou and thee refer to specific pronoun forms. However, whereas you – in Early Modern English as in contemporary English – does not exhibit any formal variation for pronoun case, thou is strictly a nominative form with thee as its accusative/dative form. Thou and thee are therefore related inflectional forms of a single pronoun lemma; you exists in variation with both. Small capitals are used to indicate the pronoun lemmas, thus: you and thou, where thou includes both thou and thee. Whenever discussing pronouns in this paper, I am strictly referring to the singular second-person pronouns you, thou and thee that are examined in this study. Background Digital Humanities Over the past few years, computational research has branched out into other research fields that are not necessarily closely connected to computer science. Digital Humanities (DH) is an umbrella term for all research that is computational but approaches the datasets investigated within, and/or addresses questions or problems that are of importance to, the disciplines of the humanities. The popularity of Digital Humanities, a cross-domain field of study, is attributable to the fact that it does not diminish the differences between fields but rather opera- tionalises this difference to solve difficulties that could not be dealt with within a single discipline. The role of computational methods in the humanities can be considered as that of a supporting character; in any DH computer modelling research, it should be kept in mind that the interpretation is as important at the suitability of a computa- tional model and its outcomes. Early Modern English and you/thou In Early Modern English (EModE), two different second person singular pro- nouns were used, namely the formally singular thou and the formally plural (but pragmatically also respectful-singular) you, with only the latter surviving the EModE period (Taavitsainen and Jucker 2003). The difference between the uses of these two pronouns is evident from multiple literary studies that have addressed Shakespeare’s 32 Prispevki za novejšo zgodovino LIX - 1/2019 work, work of his contemporaries, and other documents from this era, such as Walker (2003) and Busse (2002). These studies suggest that unwritten social rules governed the use of these pronouns, abiding by which rules was necessary in order to speak according to society’s standards. The use of the two different pronouns acted as a sign of relative status: you would be used to superiors and thou towards inferiors. The choice of pronoun can thus also operate as a subtle means of showing respect or dis- respect; using the pronouns in this way would have been natural and easy to English native speakers of the period. Shakespeare lived during the Early Modern English period, and thus used both you and thou in his writing. His work was written less than 100 years before thou and thee disappeared from the standard language (surviving in dialects and archaicised registers, such as pious addresses to the divinity). Thus we may straightforwardly posit that the disappearance of thou was likely already in progress around his time. Though obviously heightened in its use of emotional and dramatic language and style to accommodate to the genre of the play script, the language of Shakespeare – includ- ing the usage of the two second-person pronouns – can be assumed to be a reasonably good representation of the language used generally in social interaction and conversa- tion at that time (Calvo 1992). Prior Studies on you/thou Most studies of Shakespeare’s use of you and thou so far have been literary and nonnumeric studies (Brown and Gilman 1960; Quirk 1974; Calvo 1992); the relative few to have used data-based or quantitative techniques did not implement any method beyond directly comparing raw frequency counts (Busse 2003; Mazzon 2003; Stein 2003). Moreover, these studies did not look at all the extant Shakespeare plays, but instead chose a few plays to focus on. Nonetheless, these studies have demonstrated some patterns in the use of you and thou and thus provide a workable foundation for a more in-depth study of the usage of those two pronouns. These prior studies support in the overall conclusion that the pronouns you and thou appear to be used to support the explicit expression of respect, social status, and familiarity. Quirk (1974) and Mazzon (2003) characterise the role of the pronoun as a linguistic marker, whose usage can be seen as either marked or unmarked. In other words, the use of a particular pronoun can be seen as marked when it is used unex- pectedly, for example when you is expected based on social status, but thou is used instead. Thus, in contrast to earlier studies (Brown and Gilman 1960), they do not per- ceive you and thou to be in direct contrast, and to have a more variable interpretation than was assumed until then, based on the context it occurs in. Calvo (1992) and Stein (2003) expand on this by concluding that markedness of the pronoun is dependent on the context and the situation, in addition to the pronoun choice depending on stable factors such as the social statuses of, and the level of familiarity between, the characters 33I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of… in Shakespeare’s plays; the speakers and addressees in this study – rather than just the latter factors (Brown and Gilman 1960). The emotive effect of the utterances within which the you/thou distinction is utilised is of importance as well; feelings such as anger and love for another character may find expression through pronoun choice. This is connected to the notion of respect, as, in an angry remark, marked pronouns can be used to disrespect the addressee based on their social status (Stein 2003). As Stein (2003) and Busse (2006) already stressed in their studies, a study of you and thou in Shakespeare cannot and should not be limited to a single research discipline. Rather, what is needed is a combination of literature, sociolinguistics, pragmatics and conversation analysis, which are all useful in capturing the complex- ity of pronominal address and the social constrictions that may have underpinned the choice of one honorific pronoun-form over the other. Methodology As has already been mentioned, this is a strictly empirical study which attempts to verify the findings of earlier research through a computational approach. The use of a computational, statistical method is motivated by the goal of creating a more objective representation of Shakespeare’s use of you and thou in his plays than has been accomplished so far, since it does not require analysis of meaning-in-context by a human being, but rather proceeds directly from quantitative measurements. Hypotheses Three hypotheses were formulated on the basis of the literature: 1. No single model will be able to predict the pronominal address term solely based on linguistic and extra-linguistic features. This, being a null-hypothesis, is exactly what this study aims to falsify by develop- ing such a model. It is not likely that a single model will be able to predict Shakespeare’s original choice of you or thou based on linguistic and extra-linguistic features, because this choice is dependent on so many factors. However, the application of literature, sociolinguistics, pragmatics and conversation analysis all combined into a computational model will be able to successfully predict the pronoun choice as it includes all the factors that might influence the choice for either you or thou. 2. The features of social status, age and sentiment will be better predictors of the pronoun choice than other features. 34 Prispevki za novejšo zgodovino LIX - 1/2019 A hierarchy will be established according to which the linguistic and extra-linguis- tic features are predicting the pronoun choice in the best performing model. It may be inferred from the literature that social status, age and sentiment are highly likely to be at the top of this hierarchy, among the most influential features; these three features have shown up most reliably in prior research. 3. The best performing algorithm will combine features both dependently and independently. The different learning biases and assumptions of the three algorithms applied in this study will reveal how the features interact with one another. The first algorithm, Naive Bayes, assumes all features are independent of one another, while the deci- sion tree algorithm assumes that the features are all dependent on each other. Lastly, the support vector machine works with both dependent and independent features. I expect the set of features that will be included in the final model to be a combination of both dependent and independent features, and therefore the support vector machine algorithm to perform best. The three algorithms will be discussed in more detail later in the chapter Classification based on three algorithms. Data The data for this study comes from the Encyclopaedia of Shakespeare’s Language project1, which is a research project at Lancaster University (UK). The project corpus consists of 38 of Shakespeare’s plays, which includes all 36 plays from the First Folio with the addition of The Two Noble Kinsmen and Pericles: Prince of Tyre. A broadly annotated version of the full Shakespeare corpus can be found online2. Some of the annotation and all of the abbreviations used for the titles of the plays follow The Arden Shakespeare. Linguistic and Extra-linguistic Features The Encyclopaedia of Shakespeare’s Language corpus is richly annotated. However, some additional annotation was necessary to perform a full analysis of what extra-linguistic features could be predictors of the pronominal address term. The full set of features used in this study can be found in Table 1. The added features are briefly described here. As a referent (such as a second person singular pronoun) is dependent on context, the adjacent part of the utterance is used as a feature to test the effect of co-text. Six 1 More information on this project, which is funded by the Arts and Humanities Research Council (AH/ N002415/1), can be found on http://wp.lancs.ac.uk/shakespearelang/. 2 CQPweb Main Page, http://cqpweb.lancs.ac.uk. 35I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of… co-textual words are included, i.e. a 7-gram altogether. “LW” labels the words occur- ring on the left of the pronoun, and “RW” the words on the right of the pronoun. Each of these words are numbered based on their distance from the pronoun, e.g. LW3 is the third word on the left of the pronoun. In corpus linguistics, collocations are often examined within a three-word-window, meaning there are three words on either side of the word of interest. While I am not necessarily looking at specific collocations of you and thou, the LW/RW features will look at similarities and differences in co- textual words to see if they can predict the pronoun choice. Another feature noted as critical in prior studies is sentiment, that is the use of the pronoun to convey positivity or negativity. Sentiment was annotated with the use of the 7-gram described above. SentiStrength is a lexicon-based sentiment analysis program that scores phrases with a score for positivity and negativity (Thelwall et al. 2010). Since SentiStrength was developed to work with online comments rather than complete sentences as in formal written English, it works well with n-grams too. The scores for positivity and negativity are kept as separate variables. The corpus already included metadata on the speakers; however, I wanted to include age as well. The age of a character is often not given except for when it is an important attribute of that character, making this difficult to annotate. Therefore, Quennell and Johnson’s (2002) character descriptions were used. The characters were Table 1: List of all features used in this study Feature Acronym Annotation Genre Genre Pre-annotated Play name Play Pre-annotated Play, act, scene Scene Pre-annotated Speaker ID S_ID Pre-annotated Speaker gender S_Gender Pre-annotated Speaker status S_Status Pre-annotated Production date Prod_Date Pre-annotated N-gram LW1-3, RW1-3 Automatic Positive sentiment Pos_Sent Automatic Negative sentiment Neg_Sent Automatic Speaker age S_Age Manual Location Location Manual Addressee ID A_ID Automatic Addressee gender A_Gender Pre-annotated Addressee status A_Status Pre-annotated Addressee age A_Age Manual Status differential Stat_Diff Automatic No. of people addressed A_Number Pre-annotated 36 Prispevki za novejšo zgodovino LIX - 1/2019 sorted into a trinary classification, with ‘adult’ as the default category. Any deviations towards ‘younger’ or ‘older’ were based on textual references or the character’s name, such as for ‘Old Man’ in King Lear. Older characters were occasionally classified as such based on the fact they had adult children with prominent roles in the plays. A more global feature is the location where the scene is set. This was difficult to annotate, due to the often unreliable stage directions. Instead of a nominal description for each scene location, I used a binary annotation of ‘public’ and ‘private’. The text itself was examined to determine the location based on what characters said about their location, but in addition Bate and Rasmussen’s (2007) annotation and Greenblatt, Cohen, Howard and Maus’ (1997) annotations were consulted. The use of these three resources enabled the binary manual annotation of location for every scene. Besides the information about the speaker and the scene, information regarding the addressee is essential when analysing character interaction from a conversation analysis perspective. As a manual annotation for addressee would be incredibly time consuming, I instead used an automatic method which identifies the previous speaker as the addressee of any given utterance. This is in line with the last-as-next bias used in conversation analysis (Mazeland 2003). This means that, even in larger group con- versations, it is often expected that the last speaker before the current speaker will also be the next speaker, thus making it likely that the current speaker is addressing the last speaker. If the utterances were interrupted by the start of a new scene or other stage directions (e.g. someone walking into the scene), the annotated addressee would be the next speaker rather than the previous speaker for the first utterance after the interruption. Using the data for the social status of the speaker and the addressee, I also cre- ated a status differential. As the status category labels are numeric and ordered, this can be done by taking the difference between the two. For example, a king (status = 0) and a servant (status = 6) are distant in status, and thus will have a high status differential (here: 6). Between a king and a prince (status = 1), the difference is a lot smaller (here: 1). This absolute feature was automatically generated from the already annotated features. A feature that had to be excluded is familiarity between characters (social dis- tance). This data was not already available, and it was beyond the scope of this study to annotate this for all relevant character pairs. The literature has shown this to be a relevant feature. However, through the use of sentiment analysis, I have attempted to cover the complimentary and insulting aspects that could arise from high familiarity, and any lack thereof arising from low familiarity. Obviously, this does not cover all aspects of familiarity, but it means that this feature is not totally neglected. 37I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of… Classification Based on Three Algorithms Three different algorithms are used for the classification task, namely Naive Bayes, decision trees and support vector machines. Whereas it would be ideal to achieve a high precision and recall score, the main goal of this research is to see whether it is even possible to predict the second person singular pronoun choice through a com- putational application at all. If this is indeed the case, what features contribute to this prediction? It is thus more important to verify which features influence the choice and to what extent they do so. The reason for using three algorithms, and in particular these three, is their dif- ferences in learning biases and assumptions. Naive Bayes assumes all features are independent of one another, whereas decision tree attempts to create a dependent, hierarchical structure in the features. Support vector machine (SVM) is more complex and is able to combine both dependent and independent features. The addition of the latter algorithm will be particularly useful if the difference between the two simpler algorithm’s models is small. As well as applying three algorithms, I will also look at the difference between keeping thou and thee separate and combining them into the one category thou. For this, I will run both a binary (you and thou) and a trinary (you, thou and thee) clas- sification, to see whether this affects the scores or changes which features are included in the best models. Overview of Implementation I ran the three algorithms using the Waikato Environment for Knowledge Analysis (Weka3) software4 with the default settings. The algorithms were run using a 10-fold cross-validation to ensure the best model based on training and testing of all folds combined. The number of relevant instances of you/thou/thee extracted from the dataset is 22,932, which makes up 99.5% of the total number of such pronouns in the dataset. The pronouns were extracted using a Python script with simple heuristics. About 0.5% was missed due to noise in the dataset. The number of instances of you/thou/thee that were extracted from each play range from 363 (in Macbeth) to 811 (in Coriolanus). I attempted to improve or maintain the scores while making the model simpler by excluding features, that is, through feature ablation. When there were conflicting changes in the scores, the scores of precision and F-measure were prioritised. I hoped to identify which features truly help predict the pronoun by building the simplest but best performing model. The baseline that the models were compared to is derived 3 Weka 3 - Data Mining with Open Source Machine Learning Software in Java, http://www.cs.waikato.ac.nz/ml/ weka/. 4 In Weka, Naive Bayes is identified as NaiveBayesMultinominal, decision tree as J48, and support vector machine as SMO. 38 Prispevki za novejšo zgodovino LIX - 1/2019 from the distribution of the pronouns in the dataset, thus 62.6% of you and 37.4% thou. I first took out groups of features that are related, rather than one feature at a time. Among the 23 features, I created six different groups. The first group related to the wider linguistic and social context (play, production date, genre, scene, location), while the second group was the closer linguistic co-text (n-gram). Information on the speaker (name, status, gender, age) and the addressee (name, status, gender, age, number of people) were groups 3 and 4. I kept status differential on its own, because it relates to multiple groups. Finally, the last group was sentiment (positive and nega- tive). After the group ablation, I went back over the features to see if individual feature exclusions would improve the model further. This ensured the simplest and best model for each algorithm. The scores and the features included in each model are given in Tables 2, 3 and 4. Results Trinary Classification Scores Table 2 shows the results of the trinary classification. As can be seen, each model performed significantly better than the baseline model, on all scores. The F-measure of the best model, the support vector machine model, is highlighted in bold. Table 2: Scores for precision, recall, F-measure and accuracy for trinary pronoun prediction Algorithm Precision Recall F-measure Accuracy Baseline Weighted Avg. 0.392 0.626 0.483 62.6417% you 0.626 1.000 0.770 thou 0.000 0.000 0.000 thee 0.000 0.000 0.000 Naive Bayes Weighted Avg. 0.826 0.826 0.826 82.64% you 0.880 0.885 0.882 thou 0.865 0.850 0.857 thee 0.509 0.510 0.510 Decision Tree Weighted Avg. 0.732 0.752 0.712 75.2093% you 0.738 0.960 0.835 thou 0.896 0.574 0.700 thee 0.408 0.097 0.157 Support Vector Machine Weighted Avg. 0.854 0.857 0.854 85.675% you 0.871 0.927 0.898 thou 0.919 0.836 0.876 thee 0.659 0.566 0.609 39I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of… Binary Classification Scores Table 3 shows the results of the best models for the binary classification. The F-measure of the best model, again the support vector machine model, is highlighted in bold. This is also the best scoring model out of all models presented in this paper. Table 3: Scores for precision, recall, F-measure and accuracy for binary pronoun prediction Algorithm Precision Recall F-measure Accuracy Baseline Weighted Avg. 0.392 0.626 0.483 62.6417% you 0.626 1.000 0.770 thou 0.000 0.000 0.000 Naive Bayes Weighted Avg. 0.868 0.868 0.867 86.8306% you 0.876 0.920 0.897 thou 0.853 0.782 0.816 Decision Tree Weighted Avg. 0.818 0.818 0.818 81.8376% you 0.849 0.863 0.856 thou 0.764 0.744 0.754 Support Vector Machine Weighted Avg. 0.872 0.873 0.872 87.2798% you 0.886 0.914 0.900 thou 0.848 0.803 0.825 Feature Comparison of the Models Overall, the final models contain similar sets of features. The exact compositions are given in Table 4. What is surprising is that the binary classification model for the decision tree is very different from the other models: it does not contain any of the words from the n-gram as a predictor, whereas the others did. Table 4: Features included in the best model of each algorithm Algorithm Type Features included Naive Bayes Trinary LW1, LW2, RW1, RW2, S_ID Binary LW1, LW2, LW3, RW1, RW2, RW3, A_ID Decision Tree Trinary LW1, LW2, RW1, RW2, S_ID, Stat_Diff, Neg_Sent Binary Scene, S_ID, S_Gender, A_ID, A_Status, A_Age, Stat_Diff, Pos_Sent Support Vector Machine Trinary LW1, RW1, S_ID, S_Age, A_ID, A_Age, A_ Number, Stat_Diff, Pos_Sent, Neg_Sent Binary LW1, RW1, S_ID, S_Age, A_ID, A_Age, A_ Number, Stat_Diff, Pos_Sent, Neg_Sent 40 Prispevki za novejšo zgodovino LIX - 1/2019 Discussion This study has given some new insights into the analysis of pronominal address terms. Looking at the second person singular pronoun choice as a binary and a trinary classification problem resulted in slightly different outcomes. Even though the highest scores were achieved in the binary classification, one might still wonder whether this is the best method for addressing the second person singular pronoun choice. Looking back at prior studies on pronoun interpretation and comparing them to the features used in this study, we can conclude that thee and thou are equal in their opposition to you, with the main difference being their grammatical role. From the model compari- son, we have seen that the co-text is most important when predicting the pronoun. This is evidence of the purely grammatical difference between thou and thee and their overall similarity in other aspects. Therefore, both linguistically and computationally, it makes more sense to perform a binary classification. Differences between the algorithms were observed, but all three algorithms easily outperformed the baseline. The support vector machine models performed best, but the scores for the Naive Bayes models were quite similar to those for the SVM models. A choice between these approaches could be based solely on the scores for accuracy, precision, recall and F-measure, or also by taking into account the complexity, which is significantly higher for the support vector machine models. The more nuanced models that the support vector machine creates, which include more features than the mod- els of the other algorithms, may suggest that the extra complexity of SVM models is indeed beneficial. The best predicting features were the LW and RW features, which supports the importance of the direct linguistic co-text. In particular RW1 appeared as the most important feature in predicting the second person singular pronominal address term. Other important features were the speaker’s name, addressee’s name, status differ- ential, positive sentiment and negative sentiment, with additional support from the speaker’s gender, addressee’s status, addressee’s age, speaker’s age, and number of peo- ple addressed. Only six features were not included in any of the models: genre, play, production date, location, speaker’s status and addressee’s gender. I am, therefore, now able to falsify the null-hypothesis that it is not possible to build a reliable prediction model based on linguistic and extra-linguistic features. All six models demonstrate that linguistic and extra-linguistic features substantially improve the prediction of the pronominal address term, as all six outperform the baseline. The second hypothesis, about which features would be good predictors, was par- tially correct in predicting that social status, age and sentiment would be included in the best models. However, none of these features were the main predictor of pronoun choice; that was the immediate co-text. With regard to the final hypothesis, it has been revealed that the features are indeed both dependent on and independent of each other. However, since the Naive Bayes 41I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of… models perform almost identically to the support vector machine models, we can say that the features are, for the most part, independent of one another. Conclusions The primary finding of this study is that it is indeed possible to build a predic- tion model for the use of you versus thou with a singular referent in the plays of Shakespeare that is based on linguistic and extra-linguistic features. Moreover, in par- ticular, the direct linguistic co-text of the second person singular pronoun is impor- tant. Other important features include the speaker’s and addressee’s names, status differential and both positive and negative sentiment. All in all this suggests that the pronoun choice is influenced by several linguistic and extra-linguistic features. The best scoring algorithm and model was the support vector machine with 87.3% accuracy through its binary classification model. For future research, I would recommend an exploration of other algorithms and features that were left out of this study, such as morphology, word embeddings and POS-tags. This will help us gain more information about the linguistic co-text directly surrounding the second person singular pronoun, which will likely give more insight into why this direct co-text is so important in deciding the choice of you or thou. Moreover, including familiarity between characters (social distance) as a feature would be beneficial, as this has been noted multiple times in prior research as an influential factor, but was beyond the scope of this study. Although this study has not yet provided a comprehensive set of all the linguistic and extra-linguistic features that influence the second person singular pronoun choice in Shakespeare’s plays, it has definitely provided a more objective and extensive analy- sis of the matter that furthers the research into you and thou. Acknowledgements The research presented in this article was conducted in collaboration with the Encyclopaedia of Shakespeare’s Language project at Lancaster University. This pro- ject is funded by the UK’s Arts and Humanities Research Council (AHRC), grant reference AH/N002415/1. The Shakespeare corpus will be made publicly available in Summer 2019, first via the CQPweb interface and then through download at a later stage. Many thanks to Jonathan Culpeper and the rest of the team for their advice and support throughout the study. 42 Prispevki za novejšo zgodovino LIX - 1/2019 References Literature: • Bate, Jonathan, and Eric Rasmussen, eds. 2007. William Shakespeare: Complete Works. London: The Royal Shakespeare Company. • Brown, Roger W., and Albert Gilman. 1960. “The Pronouns of Power and Solidarity.” In Style in Language, edited by Thomas A. Sebeok, 253–76. Cambridge: MIT Press. • Busse, Beatrix. 2006. Vocative Constructions in the Language of Shakespeare. Amsterdam: John Benjamins. • Busse, Ulrich. 2003. “The Co-occurrence of Nominal and Pronominal Address forms in the Shakespeare Corpus: Who Says Thou or You to Whom?”, in Diachronic perspectives on Address Term Systems, edited by Irma Taavitsainen and Andreas H. Jucker, 193–221. Amsterdam: John Benjamins. • Busse, Ulrich. 2002. The Function of Linguistic Variation in the Shakespeare Corpus: A Corpus-based Study of the Morpho-syntactic Variability of the Address Pronouns and Their Socio-historical and Pragmatic Implications. Amsterdam: John Benjamins. • Calvo, Clara. 1992. “Pronouns of Address and Social Negotiation in As You Like It.” In Language and Literature, Vol. 1(1), 5–27. London: Longman Group UK Ltd. • Greenblatt, Stephen, Walter Cohen, Jean E. Howard, and Katherine E. Maus. 1997. The Norton Shakespeare: Based on the Oxford Edition. New York: W.W. Norton & Company, Inc. • Mazeland, Harrie. 2003. Inleiding in de conversatieanalyse. Bussum: Coutinho bv. • Mazzon, Gabriella. 2003. “Pronouns and Nominal Address in Shakespearean English: A Socio- affective Marking System in Transition.” In Diachronic Perspectives on Address Term Systems, edited by Irma Taavitsainen and Andreas H. Jucker, 223–49. Amsterdam: John Benjamins. • Quennell, Peter, and Hamish Johnson. 2002. Who’s Who in Shakespeare. London: Routledge. • Quirk, Randolph. 1974. “Shakespeare and the English language.” In The linguist and the English Language, edited by R. Quirk, 46–64. London: Edward Arnold. • Stein, Dieter. 2003. “Pronomial Usage in Shakespeare: Between Sociolinguistics and Conversation Analysis.” In Diachronic Perspectives on Address Term Systems, edited by Irma Taavitsainen and Andreas H. Jucker, 251–307. Amsterdam: John Benjamins. • Taavitsainen, Irma, and Andreas H. Jucker. 2003. “Introduction.” In Diachronic Perspectives on Address Term Systems, edited by Irma Taavitsainen and Andreas H. Jucker, 1–25. Amsterdam: John Benjamins. • Thelwall, Mike, Kevan Buckley, Georgious Paltoglou, Di Cai, and Arvid Kappas. 2010. “Sentiment Strength Detection in Short Informal Text.” Journal of the American Society for Information Science and Technology, 61(12): 2544–58. https://doi.org/10.1002/asi.21416. • Walker, Terry. 2003. “You and Thou in Early Modern English Dialogues: Patterns of usage.” In Diachronic Perspectives on Address Term Systems, edited by Irma Taavitsainen and Andreas H. Jucker, 309–42. Amsterdam: John Benjamins. 43I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of… Isolde van Dorst YOU, THOU AND THEE: A STATISTICAL ANALYSIS OF SHAKESPEARE’S USE OF PRONOMINAL ADDRESS TERMS SUMMARY Much research has been undertaken on the use of you, thou and thee in Shakespeare’s works. However, the results so far have yet to arrive at an exact and conclusive answer regarding how these pronouns were used. This study combines the strengths of multi- ple research fields in an effort to determine via hitherto unused computational meth- ods which linguistic and extra-linguistic features influence the second person singular pronoun choices in the plays of Shakespeare. In the English of Shakespeare’s time, the now-archaic distinction between you and thou persisted, and is usually reported as being determined by relative social status and personal closeness of speaker and addressee. However, even between studies with similar outcomes, the results vary mas- sively on the degree of influence and by the inclusion or exclusion of a wide range of other potential influencing factors. Therefore, it remains to be determined whether statistical machine learning will support this traditional explanation. In this study, 23 linguistic and extra-linguistic features are investigated, having been selected from multiple linguistic areas, such as pragmatics, sociolinguistics and conversation analysis. The three algorithms used, Naive Bayes, decision tree and sup- port vector machine, are selected as illustrative of a range of possible models in light of their contrasting assumptions and learning biases. Two predictions are performed, firstly on a binary (you/thou) distinction and then on a trinary (you/thou/thee) distinction, giving six final models to compare. This is a strictly empirical study, which attempts to verify the findings of earlier research through a computational approach. Its aim and main focus is to try and find a pattern or model that best explains the use of second person singular pronominal address terms in Shakespeare, rather than simply achieve the best performing model. The primary finding of this study is that it is indeed possible to build a prediction model for the use of singular second person pronouns in the plays of Shakespeare based on linguistic and extra-linguistic features. Moreover, in particular, the direct lin- guistic context of the pronoun is the most important feature in all of the models except one. Several other features are also influencing the pronoun prediction, including the names of the speaker and addressee, the status differential, and positive and negative sentiment. Additionally, all three algorithms easily outperformed the baseline. Out of the three algorithms, the support vector machine models score best. However, the Naive Bayes models perform almost equally well. This reveals that the features are, for the most part, independent of one another. When comparing the binary and tri- nary classification outcomes, the binary models scored better than the trinary ones. 44 Prispevki za novejšo zgodovino LIX - 1/2019 Looking back at prior studies on pronoun interpretation and comparing them to the features used in this study, we can conclude that thee and thou are equal in their oppo- sition to you, with the main difference being their grammatical role. Therefore, both linguistically and computationally, it makes most sense to use the binary classification. Isolde van Dorst YOU, THOU IN THEE: STATISTIČNA ANALIZA UPORABE IZRAZOV ZAIMKOVNEGA NASLAVLJANJA PRI SHAKESPEARU POVZETEK O uporabi zaimkov you, thou in thee v Shakespearovih delih je bilo opravljenih veliko raziskav. Vendar rezultati doslej še niso dali natančnega in dokončnega odgo- vora o tem, kako so se ti zaimki uporabljali. Študija združuje prednosti z različnih razi- skovalnih področij, da bi z računalniškimi metodami, ki doslej še niso bile uporabljene, ugotovili, katere jezikovne in nejezikovne značilnosti vplivajo na izbiro osebnega zaimka druge osebe ednine v Shakespearovih igrah. V angleščini, ki se je uporabljala v Shakespearovem obdobju, je razlikovanje med YOU in THOU, ki je danes arhaično, še obstajalo. Običajno se navaja, da sta ga določala relativni družbeni status ter osebna bližina govorca in naslovljenca. Vendar pa se tudi med študijami s podobnimi rezultati ti zelo razlikujejo glede stopnje vplivanja ter upoštevanja ali neupoštevanja številnih drugih mogočih dejavnikov vpliva. Zato je treba še ugotoviti, ali bo statistično strojno učenje potrdilo to tradicionalno razlago. V tej študiji se proučuje 23 jezikovnih in nejezikovnih značilnosti, izbranih z raz- ličnih jezikoslovnih področij, kot so pragmatika, sociolingvistika in analiza pogovora. Trije uporabljeni algoritmi – naivni Bayesov klasifikator, odločitveno drevo in metoda podpornih vektorjev – so izbrani kot ilustrativni nabor možnih modelov zaradi njiho- vih kontrastnih predpostavk in učne pristranskosti. Opravita se dve napovedi, prva o binarnem (you/thou) razlikovanju in druga o trinarnem (you/thou/thee) razlikova- nju, s čimer dobimo šest končnih modelov, ki jih lahko primerjamo. Študija je strogo empirična, njen cilj pa je z računalniškim pristopom preveriti ugotovitve predhodnih raziskav. Osredotoča se predvsem na iskanje vzorca ali modela, ki bi najbolje pojasnil uporabo izrazov zaimkovnega naslavljanja za drugo osebo ednine pri Shakespearu, in ne le na oblikovanje modela, ki deluje najbolje. Temeljna ugotovitev te študije je, da je resnično mogoče oblikovati napovedni model za uporabo zaimkov za drugo osebo ednine v Shakespearovih igrah na podlagi jezikovnih in nejezikovnih značilnosti. Poleg tega je neposredni jezikovni kontekst zaimka najpomembnejša značilnost v vseh modelih razen v enem. Na napoved zaimka 45I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of… vpliva tudi več drugih značilnosti, vključno z imenom govorca in naslovljenca, razliko v statusu ter pozitivnim ali negativnim mnenjem. Vsi trije algoritmi so tudi z lahkoto dosegli boljše rezultate od izhodišča. Od vseh treh algoritmov daje najboljše rezultate metoda podpornih vektorjev. Vendar tudi modeli naivnega Bayesovega klasifikatorja dosegajo skoraj enako dobre rezultate. Iz tega izhaja, da so značilnosti večinoma neod- visne druga od druge. Primerjava binarne in trinarne klasifikacije je pokazala, da so rezultati binarnih modelov boljši od rezultatov trinarnih. Če primerjamo predhodne študije o interpretaciji zaimkov z značilnostmi, uporabljenimi v tej študiji, lahko ugo- tovimo, da sta zaimka thee in thou v opoziciji z zaimkom you enakovredna, pri čemer je najpomembnejša razlika njihova slovnična vloga. Zato je z jezikoslovnega in raču- nalniškega stališča najbolj smiselna uporaba binarne klasifikacije.