Informatica 40 (2016) 377–385 377 Mathematical Equation Structural Syntactical Similarity Patterns: A Tree Overlapping Algorithm and Its Evaluation Evgeny Pyshkin University of Aizu, Japan Tsuruga, Ikki-machi, Aizu-Wakamatsu Fukushima, 965-8580 Japan E-mail: pyshe@u-aizu.ac.jp Mikhail Ponomarev Peter the Great St. Petersburg Polytechnic University 29 Polytechnicheskaya st., 195251 St. Petersburg, Russia E-mail: ponmike92@gmail.com Keywords: syntactical similarity, mathematical equations, MathML, tree overlapping Received: November 14, 2016 In this paper we examine mathematical equations structural syntactical similarity patterns. The major focus of this contribution is an NLP tree overlapping algorithm modification adopted to the case of syntactical similarity of mathematical equations presented in MathML. We describe the software implementation and the tests arranged for the cases of both structural and subexpression based similarity. The paper also contains a discussion of algorithm evaluation problems conditioned by the lack of relevant syntactical similarity centered equation corpora. Povzetek: Prispevek se ukvarja s strukturnimi sinkaktičnimi vzorci podobnosti matematičnih izrazov. 1 Introduction Nowadays there are only few models of adopting natu- ral language processing (NLP) algorithms related to syn- tax similarity to mathematical notations. Unique structural syntax of mathematical equations with a big variety of se- mantically equivalent constructions provide a non-trivial case for information retrieval [11]. Many reported imple- mentations are focused on finding exact matching of math- ematical constructions rather than on recognizing their sim- ilarity [9, 8]. Indeed, for a case of mathematical equations, syntactical similarity is defined rather fuzzy by using sev- eral structural syntactical similarity patterns. However, a model that would deal with syntactical similarity seems to be very useful while developing searching and classifica- tion tools used in education, so as to allow math learners and tutors selecting suitable tasks nailing down a topic pre- sented during a classroom session. An obvious possible use case is accessing a set of relevant mathematical equa- tions to be used for training while a learner is doing the preparation exercises for an examination. Another interest- ing possibility is searching an equation by its syntactical structure, the latter being often easier to recall compare to exact mathematical formulas. This paper is based on Mikhail Ponomarev and Evgeny Pyshkin, Adopting Tree Overlapping Algorithm for MathML Equation Structural Similarity, published in the Proceedings of the 2nd International Confer- ence on Applications in Information Technology (ICAIT-2016) [7] For the reason that most structural notations used for rep- resenting mathematical expressions are in fact based on di- rected graphs, the syntactical similarity can be defined by using tree structural similarity. Specifically, this work ad- dresses the case when expressions are uniformly presented in MathML, the latter being one of widely used structural XML based notations used in mathematics. In turn, if bet- ter structural math equation forms are used, one can ex- pect more efficient and accurate retrieval, in contrast to a frequent use of image based equation representation used on many web sites. At the same time we accept a possi- ble criticism pointing an issue that not an every mathemati- cal expression retrieval difficulty could be addressed under MathML representability constraints. Within a context of mathematical equations similarity, it is important to mention both meaning related similarity and presentation related similarity (in MathML there are two markup schemes corresponding to these two perspectives: content markup and presentation markup). Semantic (e.g. equation meaning) similarity is out of scope of this study; this work is focused only on presentation similarity which is related to the equation syntax (the form) rather than to its meaning (the contents). MathML – Mathematical Markup Language 378 Informatica 40 (2016) 377–385 E. Pyshkin et al. 2 Structural similarity of mathematical equations Since the structure of mathematical equations can be rep- resented in the form of a tree, mathematical equation simi- larity may be defined using tree similarity. In [1] similarity of two trees is defined on the base of recursive examina- tion of their subtrees. In [5] the following mathematical expressions similarity patterns are defined: Mathematical equivalence: Equations E1 and E2 are mathematically equivalent if they are semantically (but not obligatorily syntactically) the same, for example d(sin(x))dx and (sin(x))′, sin2(x) + cos2(x) and 1 are correspond- ingly equivalent. Identity: E1 and E2 are identical if they are exactly the same. Syntactical identity: E1 and E2 are syntactically iden- tical if they are identical after normalization (dealing with variable names and numeric values). For example sin(a) and sin(b), 1sin(x) and 5 sin(x) are correspondingly syntacti- cally identical. N–similarity: Normalized equations E1 and E2 are n- similar if there is a similarity (in a certain sense) which is sim(E1, E2) > n, n being a parametric value determining a threshold. There are two specific N–similarity cases: 1. Subexpression n–similarity: There is a subexpres- sion n–similarity for E1 and E2, if E1 and E2 are n– similar and the corresponding trees both contain the common subtree which in turn contains all the termi- nal nodes of both trees. Figure 1 shows an example for the case of expressions sin(x)2 and sin(x)2 . 2. Structural n–similarity: E1 and E2 are structurally n–similar if E1 and E2 are n–similar (in common sense) and there is a common part in both trees rooted at root nodes of compared trees with the produc- tion rules being the same for all the nodes in this part. Figure 2 illustrates this case for the equations x+ √ sin(a) and x+ √ 2b. T (sin(x) 2 ) mrow msup mn 2 mrow mo ) mi x mo ( mo sin T ( sin(x)2 ) mrow mfrac mn 2 mrow mo ) mi x mo ( mo sin Figure 1: Equations sin(x)2 and sin(x)2 are structurally n– similar for any value of n > 1826 Note that in order to represent n-values, in Figures 1 and 2 we use the nodes ratio which is a ratio of the number of T (x+ √ sin(a)) mrow msqrt mo ) mi a mo ( mo sin mo + mi x T (x+ √ 2b) mrow msqrt mi b mn 2 mo + mi x Figure 2: x + √ sin(a) and x + √ 2b are structurally n– similar for any value of n > 1224 common nodes in both trees to the number of all nodes in both trees. 3 Equation similarity evaluation using tree similarity In order to introduce different approaches used for evaluat- ing mathematical equation similarity based on tree similar- ity, in this section some sample trees are used: T0, T1 and T2 (shown in Figure 3). In the following text we demon- strate how the similarity between T0 and T1 as well as be- tween T0 and T2 respectively can be calculated. T0 a01 c01b 0 1 e01d 0 1 T1 a11 c11b 1 1 e11 g11 i11 d11 T2 a21 b21 e21 g22 j21 d21 g21 i21 Figure 3: Sample trees T0, T1 and T2 For each node in Figure 3 its upper index corresponds to the tree which this node belongs to. Thus, all the nodes of T0 have the upper indexes which are equal to 0, the nodes of T1 have the upper indexes which are equal to 1, and so on. The lower indexes are used so as to count the equal nodes within a tree. For instance, T2 contains two equal nodes g, so the first appearance of g is marked by the lower index 1, while the second appearance of g is marked by the lower index 2. These indexes are suppressed in cases, when they are not necessary for algorithm description. 3.1 Tree edit distance Tree edit distance method uses the definition of similarity (distance) between two trees as a weighted number of edit operations (insert, delete, and modify) required to trans- form one tree to another (as described, for example, in [10]). Mathematical Equation Structural Syntactical Similarity Patterns. . . Informatica 40 (2016) 377–385 379 Assume S is a sequence of edit operations {s1, s2, . . . , sk} for transforming one tree to another. Assume γ is a non-negative distance measure describing node transformation from a to b (defined here as a → b) such as γ(a→ b) 1 0 and γ(a→ b) = γ(b→ a). For an operation sequence S we get the following sum: γ(S) = ∑|S| i=1 γ(si). Then the distance between two equations is defined as follows: δ(T1, T2) = min{γ(S)} (1) Figures 4 and 5 illustrate how the transformation dis- tances from T0 to T1 and from T0 to T2 correspondingly are calculated: a sequence of two insertions is required in order to transform T0 to T1; four operations (one deletion and three insertions) are required in order to transform T0 to T2. T0 a cb ed ⇒ T ′ 0 a cb e g d ⇒ T1 a cb e g i d Figure 4: Tree edit transformation: T0 to T1 T0 a cb ed ⇒ T ′ 0 a cb ed ⇒ T ′′ 0 a b ed g ⇒ T ′′′ 0 a b e g d g ⇒ T2 a b e g j d g Figure 5: Tree edit transformation: T0 to T2 The node weights (and the operation costs as well) might not be equal, hence there might be different similarity mea- sures based on general edit distance schema. Also, in dif- ferent algorithms a sequence Si is searched differently: in addition to the earlier mentioned work [10] there are other implementations described in [2] and [6]. Algorithmic complexity of the above mentioned approaches is summa- rized in Table 1. 3.2 Subpath set Subpath set similarity between two trees is defined as the number of subpaths shared by the trees. Given a tree, its subpaths are defined as a set of all the paths from the root node to the leaves including the partial paths. Subpath Table 1: Tree edit distance based algorithms Algorithm Time Memory Particularities TED [10] O(n4) O(n2) Good for bal- anced trees ODTED [2] O(n3) O(n2) Better in un- balanced trees RTED [6] O(n3) O(n2) Tree balance insensitive based similarity definition for a case of natural language processing can be found in [4]. A possible application of this concept to a case of MathML equations can be illus- trated by the algorithm described in [9]. Figure 6 shows a set of common subpaths in the sam- ple trees T0 and T1. Hence, in this example subpath based tree similarity Ss(T0, T1) = 11. Figure 3 illustrates the same issue for a case of the sample trees T0 and T2. Sim- ilarly, subpath based tree similarity can be calculated as Ss(T0, T2) = 9. A more practical case is a set TS of trees to be processed in order to find those which are similar to some given tree T0. Concerning efficiency issues, a significant improve- ment may be achieved if, instead of computing the similar- ity for many pairs, an indexing table I[p] for the whole TS corpus is used [4]. Table 2: Indexing table for T1 and T2 p I[p] p I[p] a {1, 2} e→ g {1, 2} b {1, 2} g → i {1, 2} c {1} a→ g {2} d {1, 2} g → j {2} e {1, 2} a→ b→ d {1, 2} g {1, 2} a→ b→ e {1, 2} i {1, 2} b→ e→ g {1, 2} j {2} e→ g → i {1} a→ b {1, 2} a→ g → i {2} a→ c {1} e→ g → j {2} b→ d {1, 2} a→ b→ e→ g {1, 2} b→ e {1, 2} b→ e→ g → i {1} b→ e→ g → j {2} 380 Informatica 40 (2016) 377–385 E. Pyshkin et al. T0 a01 c01b 0 1 e01d 0 1 T1 a11 c11b 1 1 e11 g11 i11 d11 (a), (b), (c), (d), (e) (a, b), (a, c), (b, d), (b, e) (a, b, d), (a, b, e) (g), (i) (e, g), (g, i) (b, e, g), (e, g, i) (a, b, e, g), (b, e, g, i) (a, b, e, g, i) 0T 1T Figure 6: Subpaths in T0 and T1 T0 a01 c01b 0 1 e01d 0 1 T2 a21 b21 e21 g22 j21 d21 g21 i21 (a), (b), (d), (e) (a, b), (b, d), (b, e) (a, b, d), (a, b, e) (c) (a, c) (g), (i), (j) (a, g), (g, i), (e, g), (g, j) (a, g, i), (b, e, g), (e, g, j) (a, b, e, g), (b, e, g, j) (a, b, e, g, j) 0T 2T Figure 7: Subpaths in T0 and T2 Table 2 provides an example of creating the indexing ta- ble for a case of the set of trees consisting of only two trees T1 and T2 (shown in Figure 3). In Table 2 p-columns list all the subpaths from both T1 and T2 (i.e. from all the corpus trees); for every subpath the corresponding I[p]-cell shows a list of trees where such a subpath exists. Algorithmic complexity of an indexing table based algorithm is O(L ∗ D2), where L – maximal number of tree leaves, D – maximal tree depth among all the corpus trees [4]. 3.3 Tree overlapping algorithm and its modification for math equation structural similarity A basic tree overlapping algorithm is described in [4] for a case of sentence similarity which is defined as follows. When putting an arbitrary node n1 of a tree T1 on a node n2 of a tree T2, there might be the same production rule overlapping in T1 and T2. Similarity is defined as a number of such overlapping production rules. In contrast to the base algorithm from [4] where tree ter- minals are naturally excluded, for a case of mathematical equations we also include terminal nodes as if they had the same production rules (Relaxation 1). Also we relax the strictness of the base algorithm and include the pairs of cor- responding nodes which are in the same order among their siblings but do not obligatorily have the same production rules for their child nodes (Relaxation 2). Below there is a formal definition of our modification. Assume L(n1, n2) represents a set of overlapping node pairs when putting n1 on n2. Assume ch(n, i) is i-th child of node n. The set L(n1, n2) is being generated by apply- ing the following rules: 1. (n1, n2) ∈ L(n1, n2) 2. If (m1,m2) ∈ L(n1, n2), then (ch(m1, i), ch(m2, i)) ∈ L(n1, n2) 3. L(n1, n2) includes all the pairs generated recursively by the rule No. 2. A number NTO(n1, n2) of production rules (according to the Relaxation 1) is defined as follows: NTO(n1, n2) =    (m1,m2) m1 ∈ nodes(T1) ∧m2 ∈ nodes(T2) ∧ (m1,m2) ∈ L(n1, n2) ∧ PR(m1) = PR(m2)    (2) Mathematical Equation Structural Syntactical Similarity Patterns. . . Informatica 40 (2016) 377–385 381 In equation 2 nodes(T ) is a set of nodes (including terminals) in a tree T , while PR(n) is a production rule rooted at the node n. Figure 8 shows an example of overlapping tree modification algorithm for NTO(d1, d2) = {(d1, d2), (f1, f2), (g1, g2)}. Assume PWPR(n1, n2) is a set of nodes which is repre- sented as a path from (n1, n2) to the top last pair of nodes being in the same order among their siblings. Assume ni and mi are nodes of a tree Ti, ch(n, i) is i-th child of node n. According to the Relaxation 2, PWPR is defined as fol- lows: 1. (n1, n2) /∈ PWPR 2. If PR(parent(n1)) 6= PR(parent(n2)) ∧ ch(parent(n1), i) = ch(parent(n2), i) ∧ ch(parent(n1), i) = n1 ∧ h(parent(n2), i) = n2, (parent(n1), parent(n2)) ∈ PWPR 3. PWPR(n1, n2) includes only pairs generated by ap- plying rule No. 2. Then the second component for an integral similarity measure can be defined by using the above introduced PWPR as follows: PTO(n1, n2) =   (m1,m2) (p1, p2) ∈ NTO(n1, n2) (m1,m2) ∈ PWPR(p1, p2), if top(m1,m2) = (n1, n2)    (3) In equation 3 top(n1, n2) is the last pair in set PWPR(n1, n2): top(n1, n2) = plast(n1, n2), plast ∈ PWPR. Thus, for two nodes, the resulting combined similarity measure is defined as follows: CTO(n1, n2) = |NTO(n1, n2)|+ |PTO(n1, n2)| For the whole trees, we get: STO(T1, T2) = max n1∈nodes(T1),n2∈nodes(T2) CTO(n1, n2) (4) 3.4 Software implementation We developed a software prototype in order to arrange a series of experiments for the above described modification of the tree overlapping algorithm for a case of mathematical equations. Figure 9 gives a hint of how the application user interface is organized. For displaying mathematical equations defined in MathML the library net.sourceforge.jeuclid is used. 4 Experiments One of the problems we faced while attempting to evaluate the algorithm is that, unlike to the NLP domain, there is no substantial corpus of mathematical equation syntactical similarity classes. 4.1 Test Corpora For our rather preliminary analysis several experts experi- enced in teaching mathematics in high schools and lyceums were involved. With their help we selected a number of typical trigonometry problems from the set of tasks used in Russian Unified State Examination [3] The selected equa- tions are listed in Table 3. With the help of our experts, the expressions were clas- sified according their structural similarity. As a result, two types of equation classification were created: a classifica- tion based on equation structural similarity (see Table 4) and a classification based on subexpression similarity (see Table 5). 4.2 Tests Though corpora presented in Tables 4 and 5 aren’t repre- sentative enough, they make possible to proceed with some preliminary similarity precision estimation. Let us remind that precision is defined as follows: precision = TP TP + FP (5) In our tests we assume that in equation 5 TP is the num- ber of k first true positive equations belonging to the same class as the query expression, while FP is the number of t first false positive equations: k + t = n − 1, where n is the number of equations belonging to the respective class. The preliminary experiments described in this work may be considered a prove-of-concept example for investigat- ing further necessary improvements of the developed algo- rithm. In the future tests a standard cross-fold validation procedure will be required in order to get trustworthy pre- cision evaluation results. Figure 11 illustrates the process of structural similarity computation for two expressions from the tiny corpus de- scribed earlier. The first expression consists of 34 nodes while the second one has 33 nodes. 20 nodes are equal in both trees. So, STO = 20+2034+33 = 40 67 = 0.597. 4.3 Analysis Table 6 lists 5 expressions from the base defined in Ta- ble 3 which achieve the best scores for the query expres- sion √ 2 sin( 3π2 − x) sinx = cosx (belonging to the class 1 according to Table 4). Two best scores are for the equations which also belong to the same class 1, unlike to the equation− √ 2 sin(− 5π2 + 382 Informatica 40 (2016) 377–385 E. Pyshkin et al. |NTO(d1, d2)| = 3 T1 a1 c1 e11d 1 g1f1 b1 + T2 a2 c2 j2d2 g2f2 h2 k2 |PWPR(d1, d2)| = |PWPR(f1, f2)| = = |PWPR(g1, g2)| = 2 T1 a1 c1 e11d 1 g1f1 b1 = T2 a2 c2 j2d2 g2f2 h2 k2 CTO(d1, d2) = 5 T1 a1 c1 e11d 1 g1f1 b1 T2 a2 c2 j2d2 g2f2 h2 k2 Figure 8: Modified tree-overlapping algorithm: example Figure 9: Structural similarity component: GUI x) sinx = cosx (No. 4 in Table 4) which wasn’t recog- nized as a similar expression despite of its obvious sim- ilarity. To explain this phenomenon we have to go back to MathML equation structure. As you can see from Fig- ure 10 (left side), two compared equations (both belonging to the class 1 of the corpus) have rather similar structure (at least, from human point of view). However, their tree roots have different number of child nodes, hence their produc- tion rules are (formally) different too. It means that we have to enhance equation normalization factor (currently limited by only variable names and numerical values): in the above mentioned case the issue can be resolved by restructuring a tree based equation representation as Figure 10 (right side) shows: both trees in the right side are semantically equiv- alent to those which are in the left side. After such a nor- malization, syntactical similarity score increases from 0 (in the “left” case) to 0.44 (in the “right” case). Similar tests were arranged for expressions from other classes as well as for the case of subexpression similarity. Specifically, for a case of subexperssion similarity, Table 7 lists 5 best results for the query 2 sin4 x+3 cos 2x+1 = 0 (which belongs to the class 5 according the the test corpus from Table 5) against the equations from the base defined in Table 3. In Table 7, nodes ratio means a ratio of common nodes to all nodes in compared trees. Among 5 best results listed in Table 7, only the second equation 4 sin4 2x + 3 cos 4x − 1 = 0 doesn’t belong to the class 5 (according to the experts’ classification). Let us analyze a possible reason. The experts didn’t include this equation to the class 5 due to the difference between sin42x and sin4 x subexpressions. They considered this part of equation as more representative from the viewpoint of structural syntactical similarity. However, the subex- pressions cos 2x and cos 4x were recognized by the algo- rithm as subexpression based similar equations to the query since both contains the explicit multiplier before x. Similar to the case of structural similarity this issue could be ad- dresses by the equation representation normalization (i.e. introducing an explicit multiplier equal to 1). In sum, based on the results presented in Table 5, for the subexpression similarity sample test corpus the average precision P = 1 1+ 1 1+···+ 3 4+ 4 4+ 4 4 13 = 12.25 13 = 0.94. However, such an accuracy achieved for a small test cor- pus defined in Table 5 may be considered as rather promis- ing but very preliminary evaluation results. Further inves- tigations with using more representative equation corpora are necessary. 5 Conclusion In this study we adopted a tree overlapping algorithm (used originally in NLP) for mathematical equation syntactical similarity. We implemented the algorithm as a software prototype and arranged a set of experiments with sample test corpora. We discovered that the proposed modification fits well a selection of equations from college-level teach- ing practice both for the cases of structural and subexpres- sion based syntactical similarity patterns. For the reason that the current implementation has some drawbacks which became evident after the arranged experiments, the further steps towards equation normalization are required in order to achieve better equation classification accuracy. Mathematical Equation Structural Syntactical Similarity Patterns. . . Informatica 40 (2016) 377–385 383 Table 3: Base of test equations No. Equation No. Equation 1 √ 2 sin( 3π2 − x) sinx = cosx 16 2 cos(π2 + x) = √ 3 tanx 2 cos(π2 + 2x) = √ 2 sinx 17 sin 2x+ 2 sin2 x = 0 3 2 cos(x− 11π2 ) cosx = sinx 18 2 sin(7π2 − x) sinx = cosx 4 2 sin4 x+ 3 cos 2x+ 1 = 0 19 2 sin2 x− √ 3 sin 2x = 0 5 (2 cosx+ 1)( √ − sinx− 1) = 0 20 cos 2x− 3 cosx+ 2 = 0 6 (2 sinx− 1)(√− cosx+ 1) = 0 21 2 cos3 x− cos2 x+ 2 cosx− 1 = 0 7 4 sin4 2x+ 3 cos 4x− 1 = 0 22 cos 2x+ 3 sinx− 2 = 0 8 cos 2x = sin(x+ π2 ) 23 sin 2x+ √ 2 sinx = 2 cosx+ √ 2 9 2 √ 3 cos2( 3π2 + x)− sin 2x = 0 24 3 cos 2x− 5 sinx+ 1 = 0 10 cos2 x− 12 sin 2x+ cosx = sinx 25 cos 2x− 5 √ 2 cosx− 5 = 0 11 cos 2x = 1− cos(π2 − x) 26 − √ 2 sin(− 5π2 + x) sinx = cosx 12 √ cos2 x− sin2 x(tan 2x− 1) = 0 27 2 sin2 x−sin x 2 cos x− √ 3 = 0 13 tanx+ cos( 3π2 − 2x) = 0 28 2 sin 2 x−sin x 2 cos x+ √ 3 = 0 14 cosx+ cos(π2 + 2x) = 0 29 4 cos 4 x− 4 cos2 x+ 1 = 0 15 12 sin 2x+ sin 2x− sinx = cosx 30 4 sin2 x+ 8 sin( 3π2 + x) + 1 = 0 ⇒ T1( √ 2 sin( 3π2 − x) sinx = cosx) math mi x mi cos mo = mi x mi sin mrow . . . mi sin mrow msqrt mn 2 T2(− √ 2 sin(− 5π2 + x) sinx = cosx) math mi x mi cos mo = mi x mi sin mrow . . . mi sin mrow msqrt mn 2 mo − STO(T1, T2) = 0+0 34+38 = 0 72 = 0 T ′ 1( √ 2 sin(3π2 − x) sinx = cosx) math mrow mi x mi cos mo = mrow mi x mi sin mrow mrow . . . mi sin mrow msqrt mn 2 T ′ 2(− √ 2 sin(− 5π2 + x) sinx = cosx) math mrow mi x mi cos mo = mrow mi x mi sin mrow mrow . . . mi sin mrow msqrt mn 2 mo − STO(T ′ 1, T ′ 2) = 17+17 37+41 = 34 78 = 0.44 Figure 10: Tree structure normalization to avoid a false negative case References [1] R. Bod. Beyond grammar. An Experienced-Based Theory of Language. CSLI Lecture Notes, 88, 1998. [2] E. D. Demaine, S. Mozes, B. Rossman, and O. Weimann. An optimal decomposition algorithm for tree edit distance. ACM Transactions on Algo- rithms (TALG), 6(1):2, 2009. [3] E. Denisova-Schmidt and E. Leontyeva. The unified state exam in russia: problems and perspectives. In- ternational Higher Education, (76):22–23, 2014. [4] I. Hiroshi, H. Keita, H. Taiichi, and T. Takenobu. Ef- ficient sentence retrieval based on syntactic structure. In Proceedings of the COLING/ACL on Main confer- ence poster sessions, pages 399–406. Association for Computational Linguistics, 2006. [5] S. Kamali and F. W. Tompa. Improving mathemat- 384 Informatica 40 (2016) 377–385 E. Pyshkin et al. T ( √ 2 sin(3π2 − x) sinx = cosx) math mi x mi cos mo = mi x mi sin mi sin mrow msqrt mn 2 mrow mo ) mi x mo − mfrac mn 2 mrow mi π mn 3 mo ( T (2 cos(x− 11π2 ) cosx = sinx) math mi x mi sin mo = mi x mi cos mi cos mrow mn 2 mrow mo ) mfrac mn 2 mrow mi π mn 3 mo − mi x mo ( Figure 11: Structural similarity computation: example Table 4: Structural Similarity Classification No. Expression Class 1 √ 2 sin(3π2 − x) sinx = cosx 1 2 2 cos(x− 11π2 ) cosx = sinx 3 2 sin(7π2 − x) sinx = cosx 4 − √ 2 sin(− 5π2 + x) sinx = cosx 5 cos 2x− 3 cosx+ 2 = 0 2 6 cos 2x+ 3 sinx− 2 = 0 7 3 cos 2x− 5 sinx+ 1 = 0 8 cos 2x− 5 √ 2 cosx− 5 = 0 9 cos(π2 + 2x) = √ 2 sinx 310 cos 2x = sin(x+ π2 ) 11 2 cos(π2 + x) = √ 3 tanx 12 2 sin4 x+ 3 cos 2x+ 1 = 0 413 4 sin4 2x+ 3 cos 4x− 1 = 0 14 4 cos4 x− 4 cos2 x+ 1 = 0 15 (2 cosx+ 1)( √ − sinx− 1) = 0 516 (2 sinx− 1)( √− cosx+ 1) = 0 17 √ cos2 x− sin2 x(tan 2x− 1) = 0 18 cos2 x− 12 sin 2x+ cosx = sinx 619 12 sin 2x+ sin2x− sinx = cosx 20 tanx+ cos( 3π2 − 2x) = 0 721 cosx+ cos(π2 + 2x) = 0 22 2 sin 2 x−sin x 2 cos x− √ 3 = 0 8 23 2 sin 2 x−sin x 2 cos x+ √ 3 = 0 ics retrieval. Towards a Digital Mathematics Library. Grand Bend, Ontario, Canada, July 8-9th, 2009, pages 37–48, 2009. [6] M. Pawlik and N. Augsten. Rted: a robust algorithm for the tree edit distance. Proceedings of the VLDB Endowment, 5(4):334–345, 2011. Table 5: Subexpression Based Similarity No. Expression Class 1 cos(π2 + 2x) = √ 2 sinx 1 2 cosx+ cos(π2 + 2x) = 0 3 (2 cosx+ 1)( √ − sinx − 1) = 0 2 4 (2 sinx− 1)( √− cosx + 1) = 0 5 √ 2 sin( 3π2 − x) sinx = cosx 3 6 2 sin( 7π2 − x) sinx = cosx 7 2 sin 2 x−sin x 2 cos x− √ 3 = 0 4 8 2 sin 2 x−sin x 2 cos x+ √ 3 = 0 9 2 sin4 x + 3 cos 2x+ 1 = 0 5 10 4 cos4 x − 4 cos2 x+ 1 = 0 11 cos2 x − 12 sin 2x+ cosx = sinx 12 2 sin2 x −sin x 2 cos x− √ 3 = 0 13 2 sin2 x −sin x 2 cos x+ √ 3 = 0 [7] M. Ponomarev and E. Pyshkin. Adopting tree over- lapping algorithm for mathml equation structural sim- ilarity evaluation. In Proceedings of the 2nd Inter- national Conference on Applications in Information Technology (ICAIT-2016), pages 17–20. The Univer- sity of Aizu, The University of Aizu Press, Oct 2016. [8] K. Sain, A. Dasgupta, and U. Garain. Emers: a tree matching–based performance evaluation of math- ematical expression recognition systems. Interna- tional Journal on Document Analysis and Recogni- tion (IJDAR), 14(1):75–85, 2011. [9] K. Yokoi and A. Aizawa. An approach to similarity search for mathematical expressions using mathml. Mathematical Equation Structural Syntactical Similarity Patterns. . . Informatica 40 (2016) 377–385 385 Table 6: Query: √ 2 sin(3π2 − x) sinx = cosx Compared expression Nodes ratio Similarity 2 sin(7π2 − x) sinx = cosx 60/67 0.896 2 cos(x− 11π2 ) cosx = sinx 40/67 0.597 2 cos(π2 + x) = √ 3 tanx 24/63 0.381 3 cos 2x− 5 sinx+ 1 = 0 12/60 0.200 tanx+ cos( 3π2 − 2x) = 0 10/67 0.149 Table 7: Query: 2 sin4 x+ 3 cos 2x+ 1 = 0 Compared expression Nodes ratio Similarity 4 cos4 x− 4 cos2 x+ 1 = 0 10/59 0.169 4 sin4 2x+ 3 cos 4x− 1 = 0 10/60 0.167 cos2 x − 12 sin 2x + cosx = sinx 10/63 0.159 2 sin2 x−sin x 2 cos x− √ 3 = 0 10/64 0.156 2 sin2 x−sin x 2 cos x+ √ 3 = 0 10/64 0.156 Table 8: Evaluating classification precision for a case of subexpression similarity No. Expression TPTP+FP 1 cos(π2 + 2x) = √ 2 sinx 1/1 2 cosx+ cos(π2 + 2x) = 0 1/1 3 (2 cosx+ 1)( √ − sinx− 1) = 0 1/1 4 (2 sinx− 1)(√− cosx+ 1) = 0 1/1 5 √ 2 sin( 3π2 − x) sinx = cosx 1/1 6 sin( 7π2 − x) sinx = cosx 1/1 7 2 sin 2 x−sin x 2 cos x− √ 3 = 0 1/1 8 2 sin 2 x−sin x 2 cos x+ √ 3 = 0 1/1 9 2 sin4 x+ 3 cos 2x+ 1 = 0 3/4 10 4 cos4 x− 4 cos2 x+ 1 = 0 3/4 11 cos2 x− 12 sin 2x+ cosx = sinx 3/4 12 2 sin 2 x−sin x 2 cos x− √ 3 = 0 4/4 13 2 sin 2 x−sin x 2 cos x+ √ 3 = 0 4/4 Towards a Digital Mathematics Library. Grand Bend, Ontario, Canada, July 8-9th, 2009, pages 27–35, 2009. [10] K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related prob- lems. SIAM journal on computing, 18(6):1245–1262, 1989. [11] Q. Zhang and A. Youssef. An Approach to Math- Similarity Search, pages 404–418. Springer Interna- tional Publishing, Cham, 2014.