Metodološki zvezki, Vol. 5, No. 1, 2008, 81-93 Usage of Multivariate Analysis in Authorship Attribution: Did Janez Mencinger Write the Story "Poštena Bohinčeka"? Marko Limbek1 Abstract This paper uses different techniques of multivariate analysis in authorship attribution and shows that statistical methods can be successful in the field of stylometry and that useful results can be obtained. 1 Introduction Unique solutions in the analysis of texts cannot be achieved with the use of subjective methods that depend on personal evaluation, therefore certain objective methods which would assure a level of certainty »beyond reasonable doubt« are called for. The aim is to obtain data for statistical analysis by the quantification of the characteristics of the texts. In this case of authorship attribution the intention is to determine what distinguishes one author from the other authors in order to describe the author's personal style. Realistic results cannot always be guaranteed, but at least there is a wider choice of different techniques available provided. The usage of statistical methods in literature is very interesting and can be of great use in solving real questions. Some basic facts about the development of stylometry can be found in Holmes (1997), while in the last hundred years several authors have been exploring this field. Different techniques have been developed, ranging from less to more sophisticated. The simplest technique is measuring the word length and the sentence length, which was done some time ago by Mendenhall (1887) and Yule (1938). This technique is simple and easily determinable, but not very reliable. The second 1 Student of Statistics, University of Ljubljana, Kongresni trg 12, 1000 Ljubljana; marko.limbek@gmail.com group of techniques uses the vocabulary distribution, which was extensively developed by Holmes (1991). It focuses on the distribution of the vocabulary frequency, especially on hapax legomena and hapax dislegomena, words that appear once or twice in the text and provide a good insight in the richness of the vocabulary. Holmes also deals with Sichel's distribution and Yule's characteristic K. The last group of methods includes multivariate methods, used efficiently by Binongo (2003). In the style analysis it is important to obtain the fixed mark of the author's style to find the permanent characteristics, and those characteristics must be independent from the content which changes from one text to another. The main point is in using function words, such as pronouns, auxiliary verbs, prepositions, conjunctions, determiners and other closed-class words that form the skeleton of the text and do not have content. In this manner the method is somehow the opposite of exploring vocabulary distribution. An important fact is that the author cannot avoid using function words and moreover uses them unconsciously; they can be found even in the simplest texts, they do not change with the development of the language and they do not possess referential meaning, which is why they represent a truly objective source for determining the author's specifics. In the case in question, these function words are used as variables. The background of linguistic phenomena has not been explored since the focus of the paper is in using methods of multivariate analysis. The relevance of using statistical methods should however be warranted too. Slovenian authors have so far also been exploring the field. Dović (2002) has been using both method of measuring sentence and word length as well as cumulative method while performing an interesting analysis on possible plagiarism, whereas Primoz Jakopin has done some really extensive research especially in the field of enthropy and has also established a new corpus called New word. In this sense the paper somehow represents extension of their work by including methods of multivariate analysis into the national arsenal. 2 The problem Multivariate analysis is often used for different kinds of authorship attribution. If there is a text for which it is unknown whether it belongs to one author or the other, usually the process is that the text and the text samples of both authors are analysed and compared and hopefully it is possible to determine to whom it is more likely to belong. In this particular case there is a story on the table for which the authorship is unknown and the examination is suitable to see whether it belongs to a certain author. The story in question is »Poštena Bohinceka« from 1860 and the possible author is Slovenian writer Janez Mencinger. There are four other texts available that were undoubtedly written by him, and which originate from the same period around 1860, thus the time factor cannot serve as an explanation for the difference in texts. All five texts will be analysed using the multivariate analysis methods, the differences and similarities between them will thus be determined and in the end the conclusion will be reached whether »Poštena Bohinceka« was written by him or not. The statistical units used consist of blocks of precisely 1000 words into which the available text has been cut. A computer programme written in Perl is then used to count the occurrences of each function word in each block, thus obtaining the distribution of all the function words. With such data it is now possible to start the multivariate analysis. 3 Data There are five short stories at the disposal, the first four being Mencinger's: »Jerica« with more than 8.000 words, »Vetrogoncic« with more than 11.000 words, »Človek toliko velja, kot plača« with more than 12.000 words and »Bore mladost« with more than 13.000 words. The main one, »Poštena Bohinceka«, contains more than 19.000 words. In the manner described, »Poštena Bohinceka« is divided into 19 blocks of thousand words, starting from the first word, while the words above 19.000 are to be neglected. The statistical result will not be harmed! In the same manner, 8 blocks are obtained from »Jerica«, 11 blocks from »Vetrogoncic«, 12 blocks from »Človek toliko velja, kot plača« and 13 blocks from »Bore mladost«, which makes 44 blocks altogether from Mencinger and 19 from »Poštena Bohinceka«. The total number of blocks is 63. The same number of words in each block also eliminates the basic need for the normalisation of variables. There was a small additional project of how to compile a set of 50 most frequent function words in Slovenian language which would be chosen for variables. An existing list of some 200 function words had to be checked for their frequency through some bigger corpus and the cut off point was set after 50 words. The choice of words should be further discussed, since determining a basic set of function words is an important step to be made for any language. It resulted in the following stop words: "ne" , "ki" , "le", "tako", "da", "je", "naj" , "ali" , "kar" , "k "in" , "po" , "pri", "proti", "si", "bo", "v", "iz" , "s", "med "cez", "ko" , "kakor", "kako", "ker", "z", " pred", "jaz" , "nic "do" , "pa" , "ti", "to", "ga", "brez", "mu", "bi" , "ni" , "kaj "kadar ", "za " , "nihce" , "vse" , "preden", "se" , "tudi", "od "ravno ", "na" , "o" . At this point another Perl written programme is used to obtain the frequency of each function word in each block and to fill the matrix. Thus the preparation of data is complete and the analysis in SPSS can now continue. Arranging units in order: Bore mladost 1-13 Človek 14-25 Jerica 26-33 Vetrogončič 34-44 Poštena Bohinčeka 45-63 Figure 1: Dendrogram using Ward's Method. 4 Results 4.1 Cluster analysis Clustering is classification of units into different groups, based on similarity of units, so that similar data is collected in the same group. The process is done in steps and each step can be observed in the belonging dendrogram. The method used is Ward's linkage with the least square distance. Arranging the 63 units in orderly blocks amounts to "Bore mladost" 1-13, "Človek" 14-25, "Jerica" 26-33, "Vetrogoncic" 34-44 and "Poštena Bohinceka" 45-63. As can be seen, dendrogram is in two parts, lower one, amost completely composed of blocks of »Poštena Bohinceka«, and upper one, almost completely composed of blocks of other four stories. It is true, that blocks 48 and 56 of »Poštena Bohinceka« are found in upper part, but all of other blocks are correctly put together. That means that common characteristics of the same text have been found and it also shows that other four stories are more similar, since they are mixed together in upper part. 4.2 Principal component analysis Principal component analysis is used to obtain useful information out of multidimensional data. These multidimensional date are in a special way contracted in order to retain as much information as possible and to be somehow visible in less dimensions, preferably just two or three. Data structure can be seen. Results of principal component analysis are considered as the principal element in showing that Mencinger, in fact, was not the author of »Poštena Bohinceka«. Performing principal component analysis according to 50 variables (frequency of selected words) shows that scree plot (in Appendix 2) breaks between the third and the fourth point and the first three components contain 27,584% of variance explained. The eigenvalues and variance explained of first ten components are listed in Table 1. Table 1: Total variance explained. Component Initial Eigenvalu es Total % ofVariance Cumulative % 1 4,929 1 0,953 1 0,953 2 4,21 5 9,367 20,31 9 3 3,269 7,264 27,584 4 2,343 5,207 32,791 5 2,202 4,893 37,684 6 2,000 4,445 42,1 29 7 1 ,91 3 4,250 46,380 S 1 ,825 4,056 50,436 9 1 ,725 3,833 54,269 1 0 1 ,583 3,51 8 57,787 Table 2: Component matrix. words C1 C2 C3 English tranl. ne ,354 ,682 ,129 no ki -,512 ,218 ,277 which le ,539 -,149 ,153 only tako ,401 ,147 ,034 so da ,297 -,118 ,624 that je -,357 -,763 ,138 is naj ,050 -,029 ,209 should ali ,212 -,162 -,194 or kar ,498 ,019 ,315 just k -,053 ,000 ,260 to in -,321 ,438 ,056 and po ,250 -,079 -,358 after pri ,115 -,009 ,525 at, by proti -,275 ,227 -,311 against si ,293 -,054 -,563 you (are) bo ,271 ,555 ,124 will be v -,516 ,246 ,183 in iz -,176 ,258 ,348 from s -,168 ,003 -,033 with med -,248 ,266 ,540 between cez -,347 ,131 -,353 over ko -,091 -,386 ,021 when kakor ,265 -,360 -,181 like kako ,489 ,136 -,160 how ker ,002 -,076 -,092 because z -,296 ,066 -,084 with pred -,360 ,441 ,136 before jaz ,421 ,300 -,061 me do -,006 ,490 -,185 until pa ,427 ,553 ,237 yet ti ,307 ,357 -,351 you to ,394 -,245 ,030 this ga ,007 -,325 ,353 him mu ,175 -,484 -,072 him bi ,629 ,258 -,139 would ni ,081 -,582 ,260 is not kaj ,610 -,070 ,034 what za ,272 -,190 -,032 for vse ,093 -,051 ,242 all se -,154 ,150 -,327 is tudi ,327 ,170 ,359 too od -,393 ,264 -,009 from ravno ,480 ,064 -,141 exactly na -,314 ,130 -,237 on o ,097 ,068 ,430 about 2,00000- 1,00000- £ g 0,00000E o o —*> Jpr^f I / X1 a \v 7* Both Groups O Known texts Unknown texts -3,00000 -2,00000 -1,00000 0,00000 1,00000 2,00000 3,00000 Component 1 Figure 2: Scatter plot of the first and the second component. Figure 3: Scatter plot of the first and the third component. Figure 4: Scatter plot of the second and the third component. Figure 5: Similarity among Mencinger's stories and difference between both groups. The component loading matrix shows which words correlate with the components the most. Those with the absolute value greater than 0,4 are in bold type. By looking at scattergrams of the first and the second loading component, the first and the third loading component and the second and the third loading component where each unit (block of a text) is labeled by »known« and »unknown« author, it is evident that the second component clearly divides the units into two groups, where the larger upper group represents four texts by Mencinger and the smaller lower group represents »Poštena Bohinčeka«. It can be concluded that each group was written by a different author, also taking into consideration very distinctive centroids. When drawing each story separately, it can be seen that Mencinger's four stories are quite interlaced, whereas »Poštena Bohinceka« differs from them. The same interlacement can be observed in 3D perspective. 4.3 t-test The t-test is used to test the hypothesis that the means of two groups are equal or in other words that both groups are similar to each other. However performing t-test on two groups that are not similar not only confirms the existence of significant differences between both groups but also points out the single variables, that distinguish groups the most. The first group is represented by known words and the second by unknown words. When performing the t-test, as shown in Appendix 2, 16 out of 50 variables made the distinction between groups. These variables are: ne, ki, je, in, bo, v, iz, med, kakor, pred, nič, pa, brez, ni, kadar, nihče. This is a relatively sufficient proof that there is a statistical difference between means of both groups of texts, which confirms the hypothesis. Now the last step is performing another test with a discriminant analysis by using these t-test-identified variables, as well as variables suggested by the principal components analysis. 4.4 Discriminant analysis Discriminant analysis is usually used to find linear combinations of variables, that would distinguish predefined classes. Here it is used mainly to confirm that two sets of words, known words and unknown words, are different. Coefficients of linear combinations will of course also be set. As indicated, two discriminant analyses are performed. The results of the principal component analysis (on second component) suggest the distinguishable variables ne, je, in, bo, pred, do, pa, mu, ni, which distinguish the "known" and "unknown" texts the most. Therefore, these words are used for the first discriminant analysis, Di. The results of the t-test suggest the variables ne, ki, je, in, bo, v, iz, med, kakor, pred, nič, pa, brez, ni, kadar, nihče, which are used for the second discriminant analysis, D2. The result is as follows: there is a difference between groups according to both analysis, it also shows that the group of variables, obtained with the t-test, is a better distinguishes These variables classify original cases 100%, whereas the PCA classify group classifies the cases only 96,8%. Wilks' Lambda is also much higher with PCA group. Table 3: Discriminant analysis: standardised loadings of the discriminant variables, % of correctly classified, Wilks' Lambda, X2, significant level. D1 D2 ne ,146 ,367 no ki / ,660 which je ,181 -,419 is in ,405 ,107 and bo ,248 ,034 will be v / ,317 in iz / ,169 from med / ,153 between kakor / -,307 like pred ,131 ,108 before nic / -,328 no do -,050 / until pa ,871 ,855 yet brez / ,182 without mu -,450 / him ni -,373 ,013 not kadar / -,574 when nihce / ,125 noone % correctly 96,8% 100% classified Wilks' Lambda 0,324 0,134 X2 63,696 (9) 106,659 (16) Sig ,000 ,000 5 Conclusions Are the authors of the analysed stories really different? The statistical results obtained by four different approaches confirm the hypothesis that the unknown author is not Mencinger. Furthermore, the variables that distinguish the stories the most have been identified. It must be emphasised that other criteria such as the historical time of writing, the theme of the stories and the literary style do not differ and cannot influence the results obtained. We have thus managed to show, using four statistical approaches, that Janez Mencinger is not the author of "Poštena Bohinčeka", and it would be interesting to see who is! Acknowledgements The author would like to thank PhD Prof. Miran Hladnik, UL FF, for his contribution of texts and advice in the literary field and to PhD Prof. Anuška Ferligoj, UL FDV, for her extensive help with the statistical methods and other advices. There are also some other experts in the field who have shown interest in the project. References [1] Binongo, J.N.G. (2003): Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution. Chance, 16, 9-17. [2] Dabagh, R.M. (2007): Authorship attribution and statistical text analysis. Metodološki zvezki, 4, 149-163. [3] Dović, M. (2002): Podbevšek in Cvelbar: Poskus empirične preverbe namigov o plagiatorstvu. Slavistična revija, 50, 233-249. [4] Holmes, D.I. (1991): A stylometric analysis of mormon scripture and related texts. J.R. Statist., 155, 91-120. [5] Jakopin, P. (2003): Nizkoentropijski jezikovni model na besedilih Cirila Kosmača in Ivana Cankarja. Slovenski roman, 21, 421-428. Appendix 1 Independent Samples Test Table 4: Means, standard deviation, t, sig. of known and unknown variables. MEANS STANDARD DEV. WORDS known unknown known unknown t Sie. ne no 12,39 7,53 3,558 3,935 4,819 ,000 ki which 6,55 4,37 3,209 2,773 2,569 ,013 le only 2,52 2,32 1,861 1,057 ,453 ,652 tako such 4,8 3,74 2,426 1,821 1,703 ,094 da that 16 13,47 5,532 3,323 1,847 ,070 ie is 44,39 63 13,464 16,111 -4,743 ,000 na[ should 1,77 1,79 1,508 1,512 -,040 ,968 ali or 2,89 3,95 2,137 1,985 -1,846 ,070 kar just 3 2,53 1,88 1,264 1,002 ,320 k to 2,11 2,11 1,498 1,729 ,019 ,985 in and 39,05 31,74 8,488 5,425 3,452 ,001 po after 5,16 6 2,623 2,809 -1,143 ,257 pri at/by 2,82 2,42 1,756 2,063 ,781 ,438 proti aqainst 1,27 0,89 1,468 1,049 1,014 ,315 si (you) are 4,09 5,47 3,588 2,458 -1,529 ,132 bo will be 5,2 2,47 3,593 1,712 3,151 ,003 v in 17,8 14,79 4,892 4,131 2,340 ,023 iz from 4 2,53 2,035 1,124 2,959 ,004 s with 2,93 3,11 1,576 2,424 -,338 ,736 med between 2,02 0,89 1,406 0,875 3,229 ,002 cez over 1,48 1,32 1,592 1,157 ,398 ,692 ko when 2,95 3,68 2,09 2,11 -1,268 ,210 kakor like 6,09 8,74 2,311 2,903 -3,855 ,000 kako how 3 2,26 2,323 1,695 1,245 ,218 ker because 3,86 3,84 2,174 2,911 ,032 ,974 z with 4,77 4,89 2,666 2,961 -,161 ,872 pred before 2,64 1,16 1,63 1,5 3,382 ,001 iaz me 1,68 1,37 1,653 1,383 ,723 ,472 nie nothing 0,93 2,26 1,404 1,695 -3,242 ,002 do until 1,98 1,32 1,517 1,108 1,710 ,092 pa yet 17,18 5,95 5,978 3,749 7,556 ,000 ti you 2,64 1,95 2,373 1,715 1,141 ,258 to this 3,36 3,89 2,114 2,331 -,887 ,378 ga him 6,09 6,63 3,388 2,91 -,605 ,547 brez without 1,23 0,47 1,309 0,697 2,361 ,021 mu him 5,89 8,26 3,604 3,364 -2,450 ,017 bi would 9,45 7,68 4,752 2,75 1,514 ,135 ni is not 7,16 10,53 3,154 4,948 -3,251 ,002 kaj what 3,77 4 3,256 2,494 -,271 ,787 kadar when 0,07 0,68 0,255 0,885 -4,263 ,000 za for 4,09 4,74 2,089 2,446 -1,069 ,289 nihče noone 0,57 0 0,974 0 2,531 ,014 vse all 3,41 4,05 2,213 1,957 -1,095 ,278 preden before 0,3 0,26 0,632 0,452 ,201 ,841 se is 24,27 23,42 5,302 5,326 ,584 ,561 tudi too 6 4,84 2,988 2,588 1,467 ,148 od from 3,98 3,26 2,758 1,91 1,025 ,309 ravno exactly 1,64 1,11 1,844 1,049 1,173 ,245 na on 10,41 9,89 3,694 4,067 ,492 ,624 0 about 1,37 0,74 1,662 0,933 1,553 ,126 2 Scree plot i i 11 i i 11 i i i 11 i i 11 i i 111 i 111 i 111 i 111 i i 11 i i 11 i i i Component Number Figure 6: Scree plot. 3 Frequency table and histogram of variable "ne" To illustrate the normal distribution of variables a histogram of variable "ne" has been added. The curve on the histogram represents continuous normal distribution and the columns represent discrete distribution of variable "ne". It can be seen that the heights of the columns try to follow the normal curve and so the conclusion is as expected that the distribution of one variable is more or less normal. ■i« Figure 7: Histogram showing normal distribution of variable "ne".