Informatica 36 (2012) 409-424 409 Evaluating the Performance of LSA for Source-code Plagiarism Detection Georgina Cosma Department of Business Computing, PA College, Larnaca, CY-7560 Cyprus E-mail: g.cosma@faculty.pacollege.ac.cy Mike Joy Department of Computer Science, University of Warwick, Coventry, CV4 7AL, UK E-mail: M.S.Joy@warwick.ac.uk Keywords: LSA, source-code similarity detection, parameter tuning Received: October 25, 2012 Latent Semantic Analysis (LSA) is an intelligent information retrieval technique that uses mathematical algori thms for analyzing large corpora of text and revealing the underlying semantic informati on of documents. LSA is a highlyparameterized statistical method, and its effectiveness is driven by the setting of its parameters which are adjusted based on the task to which it is applied. This paper discusses and evaluates the importance of parameterization for LSA based similarity detection of source-code documents, and the applicability of LSA as a technique for source-code plagiarism detection when its parameters are appropriately tuned. The parameters involve preprocessing techniques, weighting approaches; and parameter tweaking inherent to LSA processing - in particular, the choice of dimensions for the step of reducing the original post-SVD matrix. The experiments revealed that the best retrieval performance is obtained after removal of in-code comments (Java comment blocks) and applying a combined weighting scheme based on term frequencies, normalized term frequencies, and a cosine-based document normalization. Furthermore, the use of similarity thresholds (instead of mere rankings) requires the use of a higher number of dimensions. Povzetek: Prispevek analizira metodo LSA posebej glede plagiarizma izvirne kode. 1 Introduction Latent Semantic Analysis (LSA) is an intelligent information retrieval technique that uses mathematical algorithms for analyzing large corpora of text and revealing the underlying semantic information of documents [10, 11]. Previous researchers have reported that LSA is suitable for textual information retrieval and is typically used for indexing large text collections and retrieving documents based on user queries. In the context of text retrieval, LSA has been applied to a variety of tasks including indexing and information filtering [12], essay grading [23, 38, 13,43, 19, 18] cross-language information retrieval [44], detecting plagiarism in natural language texts [7], and source-code clustering and categorization [20, 25, 22, 26, 27, 28]. LSA has been applied to source-code with the aim of categorizing software repositories in order to promote software reuse [27, 28, 24] and much work has been done in the area of applying LSA to software components. Some of the LSA based tools developed include MUDABlue [20] for software categorization, Softwarenaut [25] for exploring parts of a software system using hierarchical clustering, and Hapax [22] which clusters software components based on the semantic similarity between their software entities (entire systems, classes and methods). Although LSA has been applied to source-code related tasks such as reuse and categorization of source-code artifacts, there appears to be a lack of literature investigating the behavior of parameters driving the effectiveness of LSA for tasks involving source-code corpora. The current literature also lacks an evaluation of LSA and its applicability to detecting source-code plagiarism [31, 32]. 2 A Review of Latent Semantic Analysis Latent Semantic Analysis uses statistics and linear algebra to reveal the underlying "latent" semantic meaning of documents [5]. Latent Semantic Indexing (LSI) is a special case of LSA, and the term LSI is used for tasks concerning the indexing or retrieval of information, whereas the term LSA is used for tasks concerned with everything else, such as automatic essay grading and text summarization. The first step prior to applying LSA involves preprocessing the documents in the corpus in order to efficiently represent the corpus as a term-by-document matrix. Document pre-processing operations include the following [2]. - Tokenization. This involves identifying the spaces in 410 Informatica 36 (2012) 409-424 G. Cosma et al. the text as word separators, and considering digits, hyphens, punctuation marks, and the case of letters. - Stopword elimination. This is the elimination of words with a high frequency in the document corpus, and involves removing prepositions, conjunctions and common words that could be considered as useless for purposes of retrieval, e.g. words such as the, and, and but, found in the English language. In source-code this involves removing programming language reserved words (i.e. keywords). - Stemming of words. This involves transforming variants of words with the same root into a common concept. A stem is the portion of the remaining word after removing its affixes (suffixes and prefixes). An example of a stem is the word eliminat which is the prefix of the variants eliminated, eliminating, elimination, and eliminations. After pre-processing is performed, the corpus of documents is transformed into a m x n matrix A = [a ij ], in which each row m represents a term vector, each column n represents a document vector, and each cell aij of the matrix A contains the frequency at which a term i appears in document j. Thus, the rows of matrix A represent the term vectors, and the columns of matrix A represent the document vectors. Term weighting is then applied to matrix A. The purpose of term-weighting is to adjust the frequency values of terms using local and global weights in order to improve retrieval performance. Local weights determine the value of a term in a particular document, and global weights determine the value of a term in the entire document collection. Various local and global weighting schemes exist [4] and these are applied to the term-by-document matrix to give high weights to important terms, i.e. those that occur distinctively across documents, and low weights to terms that appear frequently in the document collection. Document length normalization [41] adjusts the term values depending on the length of each document in the collection. The value of a term in a document is li,j x gi x nj, where li,j is the local weighting for term i in document j, gi is the global weighting for term i, and nj is the document-length normalization factor [4]. Long documents have a larger number of terms and term frequencies than short documents and this increases the number of term matches between a query and a long document, thus increasing the retrieval chances of long documents over small ones. Literature claims that the cosine document length normalization can improve retrieval performance [41, 40]. Tables 1, 2, and 3 contain some of the most commonly used term-weighting formulas [4]. Symbol fij defines the number of times (term-frequency) term i appears in document j; let b(f.. ) = { 1 if fij > ° b(fij j=\0, if fij =0, _ fij Pij = v^ f Z^ j fij Once term-weighting is applied, the matrix is then submitted for Singular Value Decomposition (SVD) to derive the latent semantic structure model. Singular Value Decomposition decomposes matrix A into the product of three other matrices: an m x r term-by-dimension matrix, U, an r x r singular values matrix, £, and an n x r document by dimension matrix, V. The rank r of matrix A is the number of nonzero diagonal elements of matrix £. SVD can provide a rank-k approximation to matrix A, where k represents the number of dimensions (or factors) chosen, and k < r. This process is known as dimensionality reduction, which involves truncating all three matrices to k dimensions. The reduced matrices are denoted by Uk, £k, and Vk where U is a m x k matrix, £ is a k x k matrix and V is a n x k matrix. The rank-k approximation to matrix A, can be constructed through Ak = Uk £k V^. It is important when computing the SVD that k is smaller than the rank r, because it is this feature that reduces noise in data and reveals the important relations between terms and documents [6, 3]. One common task in information retrieval systems involves a user placing a query in order to retrieve documents of interest. Given a query vector q, whose nonzero elements contain the weighted term frequency values of the terms, the query vector can be projected to the k-dimensional space using Function 1 [6]. Q = qT Uk £-1, (1) On the left hand side of the equation, Q is a mapping of q into latent semantic space, and on the right hand side of the equation q is the vector of terms in the user's weighted query; qT is the transpose of q; and qT Uk is the sum of the k-dimensional term vectors specified in the query, multiplied by the inverse of the singular values £-1. The singular values are used to separately weight each dimension of the term-document space [6]. Once the query vector is projected into the term-document space it can be compared to all other existing document vectors using a similarity measure. One very popular measure of similarity computes the cosine between the query vector and the document vector. Typically, using the cosine measure, the cosines of the angles between the query vector and each of the other document vectors are computed and the documents are ranked according to their similarity to the query, i.e. how close they are to the query in the term-document space. All documents or those documents with a similarity value exceeding a threshold, are returned to the user in a ranked list sorted in descending order of similarity values, i.e. the documents most similar to the query are displayed in the top of the ranked list. The quality of the results can be measured using evaluation measures, such as those discussed in Section 6. In the term-by-document matrix A that has columns aj (i < j < n where EVALUATING THE PERFORMANCE OF LSA FOR... Informatica 36 (2012) 409-424 411 Symbol Name Formula b Binary b(fij ) l Logarithmic log2 (1 + fij ) n Augmented normalised term frequency (b(fij ) + (fij/maxkfkj ))/2 t Term frequency fij a Alternate log b(fij )(1 + log2fij ) Table 1: Formulas for local term-weights (lij ) Symbol Name Formula x None 1 e Entropy 1 + (Ej (Pij log2(Pij ))/iog2n) f Inverse document frequency (IDF) log2(n/Y 4 and y < 4), and by removing those symbols during preprocessing, important meaning from documents may also be removed. This raises the question of, how to treat symbols in programming languages prior to applying LSA. Possible ways of answering this question would be to add the symbols to the term dictionary used to create the term-by-document matrix. Another way of treating symbols would be to replace them with words (e.g. replace symbol - with the word minus), or even to categorize symbols and to replace each one with their category name (e.g. replace occurrences of the mathematical symbols with the word arithmetic). Experiments with how to treat symbols, would be of greater importance when applying LSA to languages such as Perl, which are heavily based on symbols. References [1] A. Aiken. Moss: A system for detecting software plagiarism. Software: www.cs.berkeley.edu/ aiken/moss.html, accessed: July 2008. [2] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. [3] M. Berry. Large-scale sparse singular value computations. The International Journal of Supercomputer Applications, 6(1):13-49, Spring 1992. [4] M. Berry and M. Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval (Software, Environments, Tools), Second Edition. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2005. [5] M. Berry, Z. Drmac, and E. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41(2):335-362, 1999. [6] M. Berry, S. Dumais, and G. O'Brien. Using linear algebra for intelligent information retrieval. Technical Report UT-CS-94-270, University of Tennessee Knoxville, TN, USA, 1994. [7] A. Britt, P. Wiemer-Hastings, A. Larson, and C. Per-fetti. Using intelligent feedback to improve sourc-ing and integration in students' essays. International Journal of Artificial Intelligence in Education, 14:359-374, 2004. [8] C. Chen, N. Stoffel, M. Post, C. Basu, D. Bassu, and C. Behrens. Telcordia LSI engine: Implementation 420 Informatica 36 (2012) 409-424 G. Cosma et al. and scalability issues. In RIDE '01: Proceedings of the 11th International Workshop on research Issues in Data Engineering, pages 51-58, Washington, DC, USA, 2001. IEEE Computer Society. [9] G. Cosma and M. Joy. An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Transactions On Computing, 2009. Accepted for publication November 2009. [10] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990. [11] S. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers, 23(2):229-236, 1991. [12] P. Foltz. Using latent semantic indexing for information filtering. SIGOIS Bulletin, 11(2-3):40-47, 1990. [13] R. Gravina, M. Yanagisawa, and K. Akahori. Development and evaluation of a visual assesment asistant using latent semantic analysis and cluster analysis. In Proceedings of International Conference on Computers in Education, pages 963-968, 2004. [14] T. Hoad and J. Zobel. Methods for identifying ver-sioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 54(3):203-215, 2003. [15] E. Jessup and J. Martin. Taking a new look at the latent semantic analysis approach to information retrieval. In In: Proceedings of the SIAM Workshop on Computational Information Retrieval, pages 121144. Raleigh, NC, 2001. [16] K. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal ofDoc-umentation, 28:11-21, 1972. [17] M. Joy and M. Luck. Plagiarism in programming assignments. IEEE Transactions on Education, 42(1):129-133, 1999. [18] T. Kakkonen, N. Myller, E. Sutinen, and J. Timonen. Automatic essay grading with probabilistic latent semantic analysis. In Proceedings of the 2nd Workshop on Building Educational Applications Using Natural Language Processing at the 43rd Annual Meeting of the Association for Computational Linguistics, pages 29-36, Ann Arbor, Michigan, USA, 2005. [19] T. Kakkonen and E. Sutinen. Automatic assessment of the content of essays based on course materials. In Proceedings of the International Conference on Information Technology: Research and Education 2004 (ITRE2004), pages 126-130, London, UK, 2004. [20] S. Kawaguchi, P. Garg, M. Matsushita, and K. Inoue. Mudablue: An automatic categorization system for open source repositories. In APSEC '04: Proceedings of the 11th Asia-Pacific Software Engineering Conference, pages 184-193, Washington, DC, USA, 2004. IEEE Computer Society. [21] A. Kontostathis. Essential dimensions of latent semantic indexing (lsi). In HICSS '07: Proceedings of the 40th Annual Hawaii International Conference on System Sciences, page 73, Washington, DC, USA, 2007. IEEE Computer Society. [22] A. Kuhn, S. Ducasse, and T. Girba. Enriching reverse engineering with semantic clustering. In WCRE '05: Proceedings of the 12th Working Conference on Reverse Engineering, pages 133-142, Washington, DC, USA, 2005. IEEE Computer Society. [23] T. Landauer, D. Laham, B. Rehder, and M. Schreiner. How well can passage meaning be derived without using word order: A comparison of latent semantic analysis and humans. In COGSCI-97, pages 412-417, Stanford, CA, 1997. Lawrence Erlbaum. [24] T. E. Lin M., Amor R. A Java reuse repository for eclipse using LSI. In Proceedings of the 2006 Australian Software Engineering Conference (ASWEC'06). IEEE, 2006. [25] M. Lungu, A. Kuhn, T. Girba, and M. Lanza. Interactive exploration of semantic clusters. In 3rd International Workshop on Visualizing Software for Understanding and Analysis (VISSOFT 2005), pages 95100, 2005. [26] J. Maletic and A. Marcus. Supporting program comprehension using semantic and structural information. In International Conference on Software Engineering, pages 103-112, 2001. [27] J. Maletic and N. Valluri. Automatic software clustering via latent semantic analysis. In ASE '99: Proceedings of the 14th IEEE International Conference on Automated Software Engineering, page 251, Washington, DC, USA, 1999. IEEE Computer Society. [28] A. Marcus, A. Sergeyev, V. Rajlich, and J. Maletic. An information retrieval approach to concept location in source code. In Proceedings of the 11th IEEE Working Conference on Reverse Engineering (WCRE2004), Delft, The Netherlands, pages 214223, November 9-12 2001. [29] C. McMillan, M. Grechanik, and D. Poshyvanyk. Detecting similar software applications. In Proceedings ofthe 2012 International Conference on Software Engineering, ICSE 2012, pages 364-374, Piscataway, NJ, USA, 2012. IEEE Press. [30] L. Moussiades and A. Vakali. PDetect: A clustering approach for detecting plagiarism in source code datasets. The Computer Journal, 48(6):651-661, 2005. [31] M. Mozgovoy. Desktop tools for offline plagiarism detection in computer programs. Informatics in Education, 5(1):97-112, 2006. [32] M. Mozgovoy. Enhancing Computer-Aided Plagiarism Detection. Dissertation, Department of Computer Science, University of Joensuu, Department of Computer Science, University of Joensuu, P.O.Box 111, FIN-80101 Joensuu, Finland, November 2007. EVALUATING THE PERFORMANCE OF LSA FOR... Informatica 36 (2012) 409-424 421 [33] P. Nakov. Latent semantic analysis of textual data. In CompSysTech '00: Proceedings of the Conference on Computer systems and Technologies, pages 50315035, New York, NY, USA, 2000. ACM. [34] P. Nakov, A. Popova, and P. Mateev. Weight functions impact on LSA performance. In Proceedings of the EuroConference Recent Advances in Natural Language Processing (RANLP'01), pages 187-193. John Benjamins, Amsterdam/Philadelphia, 2001. [35] C. Perfetti. The limits of co-occurrence: tools and theories in language research. Discourse Processes, 25:363-377, 1998. [36] B. Pincombe. Comparison of human and LSA judgements of pairwise document similarities for a news corpus. Research Report No. AR-013-177, Defence Science and Technology Organisation - Australia, 2004. [37] L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science, 8(11):1016-1038, 2002. [38] B. Rehder, M. Schreiner, M. Wolfe, D. Lahaml, W. Kintsch, and T. Landauer. Using latent semantic analysis to assess knowledge: Some technical considerations. Discourse Processes, 25:337-354, 1998. [39] S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 76-85, New York, NY, USA, 2003. ACM. [40] A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21-29. ACM Press, 1996. [41] A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalization. Technical report, Cornell University, Ithaca, NY, USA, 1995. [42] P. Wiemer-Hastings. How latent is latent semantic analysis? In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJ-CAI99, pages 932-941. Morgan Kaufmann, July 31-August 6 1999. [43] F. Wild, C. Stahl, G. Stermsek, and G. Neumann. Parameters driving effectiveness of automated essay scoring with LSA. In M. Danson, editor, Proceedings of the 9th International Computer Assisted Assessment Conference (CAA), pages 485-494, Lough-borough, UK, July 2005. Professional Development. [44] L. Yi, L. Haiming, L. Zengxiang, and W. Pu. A simplified latent semantic indexing approach for multi-linguistic information retrieval. In In Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation (PACLIC17), pages 69-79, Sentosa, Singapore, 2003. COLIPS Publications. [45] D. Zeimpekis and E. Gallopoulos. Design of a MATLAB toolbox for term-document matrix generation. Technical Report HPCLAB-SCG, Computer Engineering and Informatics Department, University of Patras, Greece, February 2005. 422 Informatica 36 (2012) 409-424 G. Cosma et al. LSA Latent Semantic Analysis LSI Latent Semantic Indexing IDF Inverse Document Frequency SVD Singular Value Decomposition VSM Vector Space Model Local weighting schemes b Binary l Logarithmic n Augmented normalized term frequency t Term frequency a Alternate log Global weighting schemes x None e Entropy f Inverse document frequency (IDF) g Gfldf n Normal P Probabilistic inverse Document Length Normalization schemes x None c Cosine Preprocessing schemes KC Keep Comments, Keep Keywords and Keep Skeleton code KCRK Keep Comments, Remove Keywords KCRKRS Keep Comments, Remove Keywords and Remove Skeleton code RC Remove Comments, Keep Keywords and Keep Skeleton code RCRK Remove Comments and Remove Keywords RCRKRS Remove Comments, Remove Keywords and Remove Skele- ton code Evaluation measures AP Average Precision MAP Mean Average Precision LPM Lowest Positive Match HFM Highest False Match Sep. Separation MMAP Maximum Mean Average Precision Weighting schemes (local weight, global weight, document length normalization) txx Term frequency, none, none txc Term frequency, none, cosine tfx Term frequency, Idf, none tfc Term frequency, Idf, cosine tgx Term frequency, GfIdf, none tgc Term frequency, GfIdf, cosine tnx Term frequency,normal, none tnc Term frequency, normal, cosine tex Term frequency, entropy, none tec Term frequency, entropy, cosine lec log, entropy, cosine lex log, entropy, none Table 5: List of Acronyms EVALUATING THE PERFORMANCE OF LSA FOR... Informatica 36 (2012) 409-424 423 KC KCRK KCRKRS RC RCRK RCRKRS Average txx 0.86 0.86 0.86 0.78 0.75 0.54 0.77 k 20 60 60 15 40 106 50.17 txc 0.86 0.86 0.85 0.79 0.80 0.55 0.79 k 20 45 45 40 60 2 35.33 tfx 0.94 0.92 0.92 0.91 0.87 0.61 0.86 k 40 40 40 35 45 70 45.00 tfc 0.93 0.94 0.93 0.88 0.88 0.60 0.86 k 70 80 80 60 60 60 68.33 tgx 0.73 0.70 0.69 0.74 0.69 0.54 0.68 k 25 20 15 20 15 2 16.17 tgc 0.82 0.74 0.64 0.75 0.69 0.57 0.70 k 30 50 10 20 40 10 26.67 tnx 0.92 0.92 0.92 1.00 1.00 0.63 0.90 k 40 40 40 35 25 70 41.67 tnc 0.95 0.96 0.95 1.00 1.00 0.61 0.91 k 25 25 25 15 15 80 30.83 tex 0.87 0.87 0.88 0.85 0.82 0.60 0.82 k 30 30 30 30 35 60 35.83 tec 0.94 0.94 0.94 0.87 0.87 0.61 0.86 k 80 80 70 70 60 80 73.33 lex 0.94 0.93 0.93 0.97 0.97 0.62 0.90 k 20 30 30 20 25 70 32.50 lec 0.96 0.94 0.95 0.97 1.00 0.61 0.91 k 40 20 20 10 90 45 37.50 Table 6: MMAP values for dataset A KC KCRK KCRKRS RC RCRK RCRKRS Average txx 0.94 0.91 0.86 0.90 0.88 0.85 0.89 k 60 70 80 10 45 40 50.83 txc 0.95 0.88 0.86 0.90 0.87 0.60 0.84 k 15 20 15 10 5 25 15.00 tfx 0.78 0.78 0.78 0.74 0.74 0.73 0.76 k 45 70 70 40 40 40 50.83 tfc 0.84 0.83 0.83 0.79 0.78 0.77 0.81 k 15 15 15 15 15 35 18.33 tgx 0.92 0.82 0.77 0.91 0.88 0.81 0.85 k 35 60 70 25 15 40 40.83 tgc 0.92 0.78 0.74 0.95 0.89 0.80 0.85 k 15 20 10 15 20 20 16.67 tnx 0.84 0.84 0.83 0.90 0.90 0.90 0.87 k 70 70 60 60 60 60 63.33 tnc 0.85 0.85 0.85 0.91 0.91 0.91 0.88 k 10 10 10 15 15 15 12.50 tex 0.80 0.80 0.80 0.74 0.74 0.74 0.77 k 45 45 45 90 90 90 67.50 tec 0.83 0.81 0.80 0.79 0.79 0.77 0.80 k 15 15 15 15 15 15 15.00 lex 0.86 0.85 0.85 0.86 0.86 0.86 0.86 k 60 60 60 40 40 40 50.00 lec 0.88 0.88 0.87 0.90 0.89 0.87 0.88 k 15 15 15 10 10 10 12.50 Table 7: MMAP values for dataset B 424 Informatica 36 (2012) 409-424 G. Cosma et al. KC KCRK KCRKRS RC RCRK RCRKRS Average txx G.78 G.74 G.98 G.81 G.77 G.77 G.81 k 15 15 35 9G 8G 9G 54.17 txc G.81 G.76 G.96 G.82 G.78 G.78 G.82 k 4G 5G 45 8G 9G 8G 64.17 tfx G.65 G.65 G.91 G.71 G.71 G.7G G.72 k 8G 7G 7G 7G 7G 7G 71.67 tfc G.73 G.71 G.94 G.75 G.7G G.69 G.75 k 8G 9G 25 6G 5G 5G 59.17 tgx G.72 G.71 G.93 G.73 G.69 G.64 G.74 k 9G 8G 6G 5G 7G 7G 7G.GG tgc G.75 G.74 G.92 G.74 G.69 G.67 G.75 k 8G 7G 6G 8G 8G 1GG 78.33 tnx G.83 G.79 G.95 G.82 G.8G G.79 G.83 k 25 25 25 2G 35 35 27.5G tnc G.84 G.82 G.97 G.88 G.85 G.85 G.87 k 2G 15 15 2G 15 25 18.33 tex G.7G G.7G G.9G G.75 G.73 G.71 G.75 k 6G 9G 5G 7G 8G 8G 71.67 tec G.73 G.72 G.96 G.71 G.7G G.69 G.75 k 8G 8G 1G 6G 5G 8G 6G.GG lex G.74 G.74 G.96 G.74 G.74 G.73 G.78 k 2G 2G 25 35 6G 6G 36.67 lec G.78 G.77 G.93 G.78 G.78 G.75 G.8G k 35 4G 25 2G 25 25 28.33 Table 8: MMAP values for dataset C KC KCRK KCRKRS RC RCRK RCRKRS Average txx G.8G G.77 G.75 G.83 G.8G G.79 G.79 k 25 6G 45 3G 6G 5G 45.GG txc G.82 G.77 G.76 G.84 G.8G G.79 G.8G k 2G 2G 2G 3G 1G 1G 18.33 tfx G.7G G.69 G.69 G.73 G.73 G.73 G.71 k 45 4G 4G 25 45 45 4G.GG tfc G.74 G.74 G.74 G.78 G.77 G.77 G.76 k 15 15 15 25 25 25 2G.GG tgx G.79 G.73 G.73 G.81 G.74 G.73 G.76 k 3G 25 25 35 7G 25 35.GG tgc G.73 G.7G G.7G G.79 G.74 G.73 G.73 k 3G 3G 3G 1G 15 15 21.67 tnx G.71 G.71 G.7G G.81 G.83 G.82 G.76 k 15 2G 15 1G 1G 1G 13.33 tnc G.82 G.79 G.79 G.92 G.86 G.86 G.84 k 1G 15 15 5 15 15 12.5G tex G.7G G.7G G.7G G.74 G.73 G.73 G.72 k 45 45 5G 5G 4G 4G 45.GG tec G.67 G.67 G.67 G.72 G.72 G.72 G.7G k 1G 5 15 25 25 25 17.5G lex G.64 G.65 G.65 G.7G G.72 G.72 G.68 k 15 15 15 25 9G 9G 41.67 lec G.76 G.76 G.76 G.78 G.78 G.78 G.77 k 15 15 15 2G 2G 2G 17.5G Table 9: MMAP values for dataset D