Informatica 40 (2016) 377–385 377
Mathematical Equation Structural Syntactical Similarity Patterns: A Tree
Overlapping Algorithm and Its Evaluation
Evgeny Pyshkin
University of Aizu, Japan
Tsuruga, Ikki-machi, Aizu-Wakamatsu
Fukushima, 965-8580 Japan
E-mail: pyshe@u-aizu.ac.jp
Mikhail Ponomarev
Peter the Great St. Petersburg Polytechnic University
29 Polytechnicheskaya st., 195251 St. Petersburg, Russia
E-mail: ponmike92@gmail.com
Keywords: syntactical similarity, mathematical equations, MathML, tree overlapping
Received: November 14, 2016
In this paper we examine mathematical equations structural syntactical similarity patterns. The major focus
of this contribution is an NLP tree overlapping algorithm modification adopted to the case of syntactical
similarity of mathematical equations presented in MathML. We describe the software implementation and
the tests arranged for the cases of both structural and subexpression based similarity. The paper also
contains a discussion of algorithm evaluation problems conditioned by the lack of relevant syntactical
similarity centered equation corpora.
Povzetek: Prispevek se ukvarja s strukturnimi sinkaktičnimi vzorci podobnosti matematičnih izrazov.
1 Introduction
Nowadays there are only few models of adopting natu-
ral language processing (NLP) algorithms related to syn-
tax similarity to mathematical notations. Unique structural
syntax of mathematical equations with a big variety of se-
mantically equivalent constructions provide a non-trivial
case for information retrieval [11]. Many reported imple-
mentations are focused on finding exact matching of math-
ematical constructions rather than on recognizing their sim-
ilarity [9, 8]. Indeed, for a case of mathematical equations,
syntactical similarity is defined rather fuzzy by using sev-
eral structural syntactical similarity patterns. However, a
model that would deal with syntactical similarity seems to
be very useful while developing searching and classifica-
tion tools used in education, so as to allow math learners
and tutors selecting suitable tasks nailing down a topic pre-
sented during a classroom session. An obvious possible
use case is accessing a set of relevant mathematical equa-
tions to be used for training while a learner is doing the
preparation exercises for an examination. Another interest-
ing possibility is searching an equation by its syntactical
structure, the latter being often easier to recall compare to
exact mathematical formulas.
This paper is based on Mikhail Ponomarev and Evgeny Pyshkin,
Adopting Tree Overlapping Algorithm for MathML Equation Structural
Similarity, published in the Proceedings of the 2nd International Confer-
ence on Applications in Information Technology (ICAIT-2016) [7]
For the reason that most structural notations used for rep-
resenting mathematical expressions are in fact based on di-
rected graphs, the syntactical similarity can be defined by
using tree structural similarity. Specifically, this work ad-
dresses the case when expressions are uniformly presented
in MathML, the latter being one of widely used structural
XML based notations used in mathematics. In turn, if bet-
ter structural math equation forms are used, one can ex-
pect more efficient and accurate retrieval, in contrast to a
frequent use of image based equation representation used
on many web sites. At the same time we accept a possi-
ble criticism pointing an issue that not an every mathemati-
cal expression retrieval difficulty could be addressed under
MathML representability constraints.
Within a context of mathematical equations similarity, it
is important to mention both meaning related similarity and
presentation related similarity (in MathML there are two
markup schemes corresponding to these two perspectives:
content markup and presentation markup). Semantic (e.g.
equation meaning) similarity is out of scope of this study;
this work is focused only on presentation similarity which
is related to the equation syntax (the form) rather than to its
meaning (the contents).
MathML – Mathematical Markup Language
378 Informatica 40 (2016) 377–385 E. Pyshkin et al.
2 Structural similarity of
mathematical equations
Since the structure of mathematical equations can be rep-
resented in the form of a tree, mathematical equation simi-
larity may be defined using tree similarity. In [1] similarity
of two trees is defined on the base of recursive examina-
tion of their subtrees. In [5] the following mathematical
expressions similarity patterns are defined:
Mathematical equivalence: Equations E1 and E2 are
mathematically equivalent if they are semantically (but not
obligatorily syntactically) the same, for example d(sin(x))dx
and (sin(x))′, sin2(x) + cos2(x) and 1 are correspond-
ingly equivalent.
Identity: E1 and E2 are identical if they are exactly the
same.
Syntactical identity: E1 and E2 are syntactically iden-
tical if they are identical after normalization (dealing with
variable names and numeric values). For example sin(a)
and sin(b), 1sin(x) and
5
sin(x) are correspondingly syntacti-
cally identical.
N–similarity: Normalized equations E1 and E2 are n-
similar if there is a similarity (in a certain sense) which is
sim(E1, E2) > n, n being a parametric value determining
a threshold. There are two specific N–similarity cases:
1. Subexpression n–similarity: There is a subexpres-
sion n–similarity for E1 and E2, if E1 and E2 are n–
similar and the corresponding trees both contain the
common subtree which in turn contains all the termi-
nal nodes of both trees. Figure 1 shows an example
for the case of expressions sin(x)2 and sin(x)2 .
2. Structural n–similarity: E1 and E2 are structurally
n–similar if E1 and E2 are n–similar (in common
sense) and there is a common part in both trees rooted
at root nodes of compared trees with the produc-
tion rules being the same for all the nodes in this
part. Figure 2 illustrates this case for the equations
x+
√
sin(a) and x+
√
2b.
T (sin(x)
2
)
mrow
msup
mn
2
mrow
mo
)
mi
x
mo
(
mo
sin
T ( sin(x)2 )
mrow
mfrac
mn
2
mrow
mo
)
mi
x
mo
(
mo
sin
Figure 1: Equations sin(x)2 and sin(x)2 are structurally n–
similar for any value of n > 1826
Note that in order to represent n-values, in Figures 1 and
2 we use the nodes ratio which is a ratio of the number of
T (x+
√
sin(a))
mrow
msqrt
mo
)
mi
a
mo
(
mo
sin
mo
+
mi
x
T (x+
√
2b)
mrow
msqrt
mi
b
mn
2
mo
+
mi
x
Figure 2: x +
√
sin(a) and x +
√
2b are structurally n–
similar for any value of n > 1224
common nodes in both trees to the number of all nodes in
both trees.
3 Equation similarity evaluation
using tree similarity
In order to introduce different approaches used for evaluat-
ing mathematical equation similarity based on tree similar-
ity, in this section some sample trees are used: T0, T1 and
T2 (shown in Figure 3). In the following text we demon-
strate how the similarity between T0 and T1 as well as be-
tween T0 and T2 respectively can be calculated.
T0
a01
c01b
0
1
e01d
0
1
T1
a11
c11b
1
1
e11
g11
i11
d11
T2
a21
b21
e21
g22
j21
d21
g21
i21
Figure 3: Sample trees T0, T1 and T2
For each node in Figure 3 its upper index corresponds to
the tree which this node belongs to. Thus, all the nodes of
T0 have the upper indexes which are equal to 0, the nodes of
T1 have the upper indexes which are equal to 1, and so on.
The lower indexes are used so as to count the equal nodes
within a tree. For instance, T2 contains two equal nodes g,
so the first appearance of g is marked by the lower index 1,
while the second appearance of g is marked by the lower
index 2. These indexes are suppressed in cases, when they
are not necessary for algorithm description.
3.1 Tree edit distance
Tree edit distance method uses the definition of similarity
(distance) between two trees as a weighted number of edit
operations (insert, delete, and modify) required to trans-
form one tree to another (as described, for example, in
[10]).
Mathematical Equation Structural Syntactical Similarity Patterns. . . Informatica 40 (2016) 377–385 379
Assume S is a sequence of edit operations
{s1, s2, . . . , sk} for transforming one tree to another.
Assume γ is a non-negative distance measure describing
node transformation from a to b (defined here as a → b)
such as γ(a→ b) 1 0 and γ(a→ b) = γ(b→ a).
For an operation sequence S we get the following sum:
γ(S) =
∑|S|
i=1 γ(si).
Then the distance between two equations is defined as
follows:
δ(T1, T2) = min{γ(S)} (1)
Figures 4 and 5 illustrate how the transformation dis-
tances from T0 to T1 and from T0 to T2 correspondingly
are calculated: a sequence of two insertions is required in
order to transform T0 to T1; four operations (one deletion
and three insertions) are required in order to transform T0
to T2.
T0
a
cb
ed
⇒
T
′
0
a
cb
e
g
d
⇒
T1
a
cb
e
g
i
d
Figure 4: Tree edit transformation: T0 to T1
T0
a
cb
ed
⇒
T
′
0
a
cb
ed
⇒
T
′′
0
a
b
ed
g ⇒
T
′′′
0
a
b
e
g
d
g ⇒
T2
a
b
e
g
j
d
g
Figure 5: Tree edit transformation: T0 to T2
The node weights (and the operation costs as well) might
not be equal, hence there might be different similarity mea-
sures based on general edit distance schema. Also, in dif-
ferent algorithms a sequence Si is searched differently: in
addition to the earlier mentioned work [10] there are other
implementations described in [2] and [6]. Algorithmic
complexity of the above mentioned approaches is summa-
rized in Table 1.
3.2 Subpath set
Subpath set similarity between two trees is defined as the
number of subpaths shared by the trees. Given a tree, its
subpaths are defined as a set of all the paths from the root
node to the leaves including the partial paths. Subpath
Table 1: Tree edit distance based algorithms
Algorithm Time Memory Particularities
TED [10] O(n4) O(n2) Good for bal-
anced trees
ODTED [2] O(n3) O(n2) Better in un-
balanced trees
RTED [6] O(n3) O(n2) Tree balance
insensitive
based similarity definition for a case of natural language
processing can be found in [4]. A possible application of
this concept to a case of MathML equations can be illus-
trated by the algorithm described in [9].
Figure 6 shows a set of common subpaths in the sam-
ple trees T0 and T1. Hence, in this example subpath based
tree similarity Ss(T0, T1) = 11. Figure 3 illustrates the
same issue for a case of the sample trees T0 and T2. Sim-
ilarly, subpath based tree similarity can be calculated as
Ss(T0, T2) = 9.
A more practical case is a set TS of trees to be processed
in order to find those which are similar to some given tree
T0. Concerning efficiency issues, a significant improve-
ment may be achieved if, instead of computing the similar-
ity for many pairs, an indexing table I[p] for the whole TS
corpus is used [4].
Table 2: Indexing table for T1 and T2
p I[p] p I[p]
a {1, 2} e→ g {1, 2}
b {1, 2} g → i {1, 2}
c {1} a→ g {2}
d {1, 2} g → j {2}
e {1, 2} a→ b→ d {1, 2}
g {1, 2} a→ b→ e {1, 2}
i {1, 2} b→ e→ g {1, 2}
j {2} e→ g → i {1}
a→ b {1, 2} a→ g → i {2}
a→ c {1} e→ g → j {2}
b→ d {1, 2} a→ b→ e→ g {1, 2}
b→ e {1, 2} b→ e→ g → i {1}
b→ e→ g → j {2}
380 Informatica 40 (2016) 377–385 E. Pyshkin et al.
T0
a01
c01b
0
1
e01d
0
1
T1
a11
c11b
1
1
e11
g11
i11
d11
(a), (b), (c), (d), (e)
(a, b), (a, c), (b, d), (b, e)
(a, b, d), (a, b, e)
(g), (i)
(e, g), (g, i)
(b, e, g), (e, g, i)
(a, b, e, g), (b, e, g, i)
(a, b, e, g, i)
0T
1T
Figure 6: Subpaths in T0 and T1
T0
a01
c01b
0
1
e01d
0
1
T2
a21
b21
e21
g22
j21
d21
g21
i21
(a), (b), (d), (e)
(a, b), (b, d), (b, e)
(a, b, d), (a, b, e)
(c)
(a, c)
(g), (i), (j)
(a, g), (g, i), (e, g), (g, j)
(a, g, i), (b, e, g), (e, g, j)
(a, b, e, g), (b, e, g, j)
(a, b, e, g, j)
0T
2T
Figure 7: Subpaths in T0 and T2
Table 2 provides an example of creating the indexing ta-
ble for a case of the set of trees consisting of only two trees
T1 and T2 (shown in Figure 3).
In Table 2 p-columns list all the subpaths from both T1
and T2 (i.e. from all the corpus trees); for every subpath
the corresponding I[p]-cell shows a list of trees where such
a subpath exists. Algorithmic complexity of an indexing
table based algorithm is O(L ∗ D2), where L – maximal
number of tree leaves, D – maximal tree depth among all
the corpus trees [4].
3.3 Tree overlapping algorithm and its
modification for math equation
structural similarity
A basic tree overlapping algorithm is described in [4] for
a case of sentence similarity which is defined as follows.
When putting an arbitrary node n1 of a tree T1 on a node
n2 of a tree T2, there might be the same production rule
overlapping in T1 and T2. Similarity is defined as a number
of such overlapping production rules.
In contrast to the base algorithm from [4] where tree ter-
minals are naturally excluded, for a case of mathematical
equations we also include terminal nodes as if they had the
same production rules (Relaxation 1). Also we relax the
strictness of the base algorithm and include the pairs of cor-
responding nodes which are in the same order among their
siblings but do not obligatorily have the same production
rules for their child nodes (Relaxation 2). Below there is a
formal definition of our modification.
Assume L(n1, n2) represents a set of overlapping node
pairs when putting n1 on n2. Assume ch(n, i) is i-th child
of node n. The set L(n1, n2) is being generated by apply-
ing the following rules:
1. (n1, n2) ∈ L(n1, n2)
2. If (m1,m2) ∈ L(n1, n2), then
(ch(m1, i), ch(m2, i)) ∈ L(n1, n2)
3. L(n1, n2) includes all the pairs generated recursively
by the rule No. 2.
A number NTO(n1, n2) of production rules (according
to the Relaxation 1) is defined as follows:
NTO(n1, n2) =



(m1,m2)
m1 ∈ nodes(T1)
∧m2 ∈ nodes(T2)
∧ (m1,m2) ∈ L(n1, n2)
∧ PR(m1) = PR(m2)



(2)
Mathematical Equation Structural Syntactical Similarity Patterns. . . Informatica 40 (2016) 377–385 381
In equation 2 nodes(T ) is a set of nodes (including
terminals) in a tree T , while PR(n) is a production rule
rooted at the node n.
Figure 8 shows an example of overlapping
tree modification algorithm for NTO(d1, d2) =
{(d1, d2), (f1, f2), (g1, g2)}.
Assume PWPR(n1, n2) is a set of nodes which is repre-
sented as a path from (n1, n2) to the top last pair of nodes
being in the same order among their siblings. Assume ni
and mi are nodes of a tree Ti, ch(n, i) is i-th child of node
n. According to the Relaxation 2, PWPR is defined as fol-
lows:
1. (n1, n2) /∈ PWPR
2. If PR(parent(n1)) 6= PR(parent(n2))
∧ ch(parent(n1), i) = ch(parent(n2), i)
∧ ch(parent(n1), i) = n1
∧ h(parent(n2), i) = n2,
(parent(n1), parent(n2)) ∈ PWPR
3. PWPR(n1, n2) includes only pairs generated by ap-
plying rule No. 2.
Then the second component for an integral similarity
measure can be defined by using the above introduced
PWPR as follows:
PTO(n1, n2) =


(m1,m2)
(p1, p2) ∈ NTO(n1, n2)
(m1,m2) ∈ PWPR(p1, p2),
if top(m1,m2) = (n1, n2)



(3)
In equation 3 top(n1, n2) is the last pair in set
PWPR(n1, n2): top(n1, n2) = plast(n1, n2), plast ∈
PWPR.
Thus, for two nodes, the resulting combined similarity
measure is defined as follows:
CTO(n1, n2) = |NTO(n1, n2)|+ |PTO(n1, n2)|
For the whole trees, we get:
STO(T1, T2) = max
n1∈nodes(T1),n2∈nodes(T2)
CTO(n1, n2)
(4)
3.4 Software implementation
We developed a software prototype in order to arrange a
series of experiments for the above described modification
of the tree overlapping algorithm for a case of mathematical
equations. Figure 9 gives a hint of how the application user
interface is organized.
For displaying mathematical equations defined in
MathML the library net.sourceforge.jeuclid is used.
4 Experiments
One of the problems we faced while attempting to evaluate
the algorithm is that, unlike to the NLP domain, there is
no substantial corpus of mathematical equation syntactical
similarity classes.
4.1 Test Corpora
For our rather preliminary analysis several experts experi-
enced in teaching mathematics in high schools and lyceums
were involved. With their help we selected a number of
typical trigonometry problems from the set of tasks used in
Russian Unified State Examination [3] The selected equa-
tions are listed in Table 3.
With the help of our experts, the expressions were clas-
sified according their structural similarity. As a result, two
types of equation classification were created: a classifica-
tion based on equation structural similarity (see Table 4)
and a classification based on subexpression similarity (see
Table 5).
4.2 Tests
Though corpora presented in Tables 4 and 5 aren’t repre-
sentative enough, they make possible to proceed with some
preliminary similarity precision estimation. Let us remind
that precision is defined as follows:
precision =
TP
TP + FP
(5)
In our tests we assume that in equation 5 TP is the num-
ber of k first true positive equations belonging to the same
class as the query expression, while FP is the number of
t first false positive equations: k + t = n − 1, where n is
the number of equations belonging to the respective class.
The preliminary experiments described in this work may
be considered a prove-of-concept example for investigat-
ing further necessary improvements of the developed algo-
rithm. In the future tests a standard cross-fold validation
procedure will be required in order to get trustworthy pre-
cision evaluation results.
Figure 11 illustrates the process of structural similarity
computation for two expressions from the tiny corpus de-
scribed earlier. The first expression consists of 34 nodes
while the second one has 33 nodes. 20 nodes are equal in
both trees. So, STO = 20+2034+33 =
40
67 = 0.597.
4.3 Analysis
Table 6 lists 5 expressions from the base defined in Ta-
ble 3 which achieve the best scores for the query expres-
sion
√
2 sin( 3π2 − x) sinx = cosx (belonging to the class
1 according to Table 4).
Two best scores are for the equations which also belong
to the same class 1, unlike to the equation−
√
2 sin(− 5π2 +
382 Informatica 40 (2016) 377–385 E. Pyshkin et al.
|NTO(d1, d2)| = 3
T1
a1
c1
e11d
1
g1f1
b1
+
T2
a2
c2
j2d2
g2f2
h2
k2
|PWPR(d1, d2)| = |PWPR(f1, f2)| =
= |PWPR(g1, g2)| = 2
T1
a1
c1
e11d
1
g1f1
b1
=
T2
a2
c2
j2d2
g2f2
h2
k2
CTO(d1, d2) = 5
T1
a1
c1
e11d
1
g1f1
b1
T2
a2
c2
j2d2
g2f2
h2
k2
Figure 8: Modified tree-overlapping algorithm: example
Figure 9: Structural similarity component: GUI
x) sinx = cosx (No. 4 in Table 4) which wasn’t recog-
nized as a similar expression despite of its obvious sim-
ilarity. To explain this phenomenon we have to go back
to MathML equation structure. As you can see from Fig-
ure 10 (left side), two compared equations (both belonging
to the class 1 of the corpus) have rather similar structure (at
least, from human point of view). However, their tree roots
have different number of child nodes, hence their produc-
tion rules are (formally) different too. It means that we have
to enhance equation normalization factor (currently limited
by only variable names and numerical values): in the above
mentioned case the issue can be resolved by restructuring a
tree based equation representation as Figure 10 (right side)
shows: both trees in the right side are semantically equiv-
alent to those which are in the left side. After such a nor-
malization, syntactical similarity score increases from 0 (in
the “left” case) to 0.44 (in the “right” case).
Similar tests were arranged for expressions from other
classes as well as for the case of subexpression similarity.
Specifically, for a case of subexperssion similarity, Table 7
lists 5 best results for the query 2 sin4 x+3 cos 2x+1 = 0
(which belongs to the class 5 according the the test corpus
from Table 5) against the equations from the base defined
in Table 3. In Table 7, nodes ratio means a ratio of common
nodes to all nodes in compared trees.
Among 5 best results listed in Table 7, only the second
equation 4 sin4 2x + 3 cos 4x − 1 = 0 doesn’t belong to
the class 5 (according to the experts’ classification). Let
us analyze a possible reason. The experts didn’t include
this equation to the class 5 due to the difference between
sin42x and sin4 x subexpressions. They considered this
part of equation as more representative from the viewpoint
of structural syntactical similarity. However, the subex-
pressions cos 2x and cos 4x were recognized by the algo-
rithm as subexpression based similar equations to the query
since both contains the explicit multiplier before x. Similar
to the case of structural similarity this issue could be ad-
dresses by the equation representation normalization (i.e.
introducing an explicit multiplier equal to 1).
In sum, based on the results presented in Table 5, for
the subexpression similarity sample test corpus the average
precision P =
1
1+
1
1+···+
3
4+
4
4+
4
4
13 =
12.25
13 = 0.94.
However, such an accuracy achieved for a small test cor-
pus defined in Table 5 may be considered as rather promis-
ing but very preliminary evaluation results. Further inves-
tigations with using more representative equation corpora
are necessary.
5 Conclusion
In this study we adopted a tree overlapping algorithm (used
originally in NLP) for mathematical equation syntactical
similarity. We implemented the algorithm as a software
prototype and arranged a set of experiments with sample
test corpora. We discovered that the proposed modification
fits well a selection of equations from college-level teach-
ing practice both for the cases of structural and subexpres-
sion based syntactical similarity patterns. For the reason
that the current implementation has some drawbacks which
became evident after the arranged experiments, the further
steps towards equation normalization are required in order
to achieve better equation classification accuracy.
Mathematical Equation Structural Syntactical Similarity Patterns. . . Informatica 40 (2016) 377–385 383
Table 3: Base of test equations
No. Equation No. Equation
1
√
2 sin( 3π2 − x) sinx = cosx 16 2 cos(π2 + x) =
√
3 tanx
2 cos(π2 + 2x) =
√
2 sinx 17 sin 2x+ 2 sin2 x = 0
3 2 cos(x− 11π2 ) cosx = sinx 18 2 sin(7π2 − x) sinx = cosx
4 2 sin4 x+ 3 cos 2x+ 1 = 0 19 2 sin2 x−
√
3 sin 2x = 0
5 (2 cosx+ 1)(
√
− sinx− 1) = 0 20 cos 2x− 3 cosx+ 2 = 0
6 (2 sinx− 1)(√− cosx+ 1) = 0 21 2 cos3 x− cos2 x+ 2 cosx− 1 = 0
7 4 sin4 2x+ 3 cos 4x− 1 = 0 22 cos 2x+ 3 sinx− 2 = 0
8 cos 2x = sin(x+ π2 ) 23 sin 2x+
√
2 sinx = 2 cosx+
√
2
9 2
√
3 cos2( 3π2 + x)− sin 2x = 0 24 3 cos 2x− 5 sinx+ 1 = 0
10 cos2 x− 12 sin 2x+ cosx = sinx 25 cos 2x− 5
√
2 cosx− 5 = 0
11 cos 2x = 1− cos(π2 − x) 26 −
√
2 sin(− 5π2 + x) sinx = cosx
12
√
cos2 x− sin2 x(tan 2x− 1) = 0 27 2 sin2 x−sin x
2 cos x−
√
3
= 0
13 tanx+ cos( 3π2 − 2x) = 0 28 2 sin
2 x−sin x
2 cos x+
√
3
= 0
14 cosx+ cos(π2 + 2x) = 0 29 4 cos
4 x− 4 cos2 x+ 1 = 0
15 12 sin 2x+ sin
2x− sinx = cosx 30 4 sin2 x+ 8 sin( 3π2 + x) + 1 = 0
⇒
T1(
√
2 sin( 3π2 − x) sinx = cosx)
math
mi
x
mi
cos
mo
=
mi
x
mi
sin
mrow
. . .
mi
sin
mrow
msqrt
mn
2
T2(−
√
2 sin(− 5π2 + x) sinx = cosx)
math
mi
x
mi
cos
mo
=
mi
x
mi
sin
mrow
. . .
mi
sin
mrow
msqrt
mn
2
mo
−
STO(T1, T2) =
0+0
34+38
= 0
72
= 0
T
′
1(
√
2 sin(3π2 − x) sinx = cosx)
math
mrow
mi
x
mi
cos
mo
=
mrow
mi
x
mi
sin
mrow
mrow
. . .
mi
sin
mrow
msqrt
mn
2
T
′
2(−
√
2 sin(− 5π2 + x) sinx = cosx)
math
mrow
mi
x
mi
cos
mo
=
mrow
mi
x
mi
sin
mrow
mrow
. . .
mi
sin
mrow
msqrt
mn
2
mo
−
STO(T
′
1, T
′
2) =
17+17
37+41
= 34
78
= 0.44
Figure 10: Tree structure normalization to avoid a false negative case
References
[1] R. Bod. Beyond grammar. An Experienced-Based
Theory of Language. CSLI Lecture Notes, 88, 1998.
[2] E. D. Demaine, S. Mozes, B. Rossman, and
O. Weimann. An optimal decomposition algorithm
for tree edit distance. ACM Transactions on Algo-
rithms (TALG), 6(1):2, 2009.
[3] E. Denisova-Schmidt and E. Leontyeva. The unified
state exam in russia: problems and perspectives. In-
ternational Higher Education, (76):22–23, 2014.
[4] I. Hiroshi, H. Keita, H. Taiichi, and T. Takenobu. Ef-
ficient sentence retrieval based on syntactic structure.
In Proceedings of the COLING/ACL on Main confer-
ence poster sessions, pages 399–406. Association for
Computational Linguistics, 2006.
[5] S. Kamali and F. W. Tompa. Improving mathemat-
384 Informatica 40 (2016) 377–385 E. Pyshkin et al.
T (
√
2 sin(3π2 − x) sinx = cosx)
math
mi
x
mi
cos
mo
=
mi
x
mi
sin
mi
sin
mrow
msqrt
mn
2 mrow
mo
)
mi
x
mo
−
mfrac
mn
2
mrow
mi
π
mn
3
mo
(
T (2 cos(x− 11π2 ) cosx = sinx)
math
mi
x
mi
sin
mo
=
mi
x
mi
cos
mi
cos
mrow
mn
2
mrow
mo
)
mfrac
mn
2
mrow
mi
π
mn
3
mo
−
mi
x
mo
(
Figure 11: Structural similarity computation: example
Table 4: Structural Similarity Classification
No. Expression Class
1
√
2 sin(3π2 − x) sinx = cosx
1
2 2 cos(x− 11π2 ) cosx = sinx
3 2 sin(7π2 − x) sinx = cosx
4 −
√
2 sin(− 5π2 + x) sinx = cosx
5 cos 2x− 3 cosx+ 2 = 0
2
6 cos 2x+ 3 sinx− 2 = 0
7 3 cos 2x− 5 sinx+ 1 = 0
8 cos 2x− 5
√
2 cosx− 5 = 0
9 cos(π2 + 2x) =
√
2 sinx
310 cos 2x = sin(x+ π2 )
11 2 cos(π2 + x) =
√
3 tanx
12 2 sin4 x+ 3 cos 2x+ 1 = 0
413 4 sin4 2x+ 3 cos 4x− 1 = 0
14 4 cos4 x− 4 cos2 x+ 1 = 0
15 (2 cosx+ 1)(
√
− sinx− 1) = 0
516 (2 sinx− 1)(
√− cosx+ 1) = 0
17
√
cos2 x− sin2 x(tan 2x− 1) = 0
18 cos2 x− 12 sin 2x+ cosx = sinx 619 12 sin 2x+ sin2x− sinx = cosx
20 tanx+ cos( 3π2 − 2x) = 0 721 cosx+ cos(π2 + 2x) = 0
22 2 sin
2 x−sin x
2 cos x−
√
3
= 0
8
23 2 sin
2 x−sin x
2 cos x+
√
3
= 0
ics retrieval. Towards a Digital Mathematics Library.
Grand Bend, Ontario, Canada, July 8-9th, 2009,
pages 37–48, 2009.
[6] M. Pawlik and N. Augsten. Rted: a robust algorithm
for the tree edit distance. Proceedings of the VLDB
Endowment, 5(4):334–345, 2011.
Table 5: Subexpression Based Similarity
No. Expression Class
1 cos(π2 + 2x) =
√
2 sinx
1
2 cosx+ cos(π2 + 2x) = 0
3 (2 cosx+ 1)(
√
− sinx − 1) = 0
2
4 (2 sinx− 1)( √− cosx + 1) = 0
5
√
2 sin( 3π2 − x) sinx = cosx 3
6 2 sin( 7π2 − x) sinx = cosx
7 2 sin
2 x−sin x
2 cos x−
√
3
= 0
4
8 2 sin
2 x−sin x
2 cos x+
√
3
= 0
9 2 sin4 x + 3 cos 2x+ 1 = 0
5
10 4 cos4 x − 4 cos2 x+ 1 = 0
11 cos2 x − 12 sin 2x+ cosx = sinx
12
2 sin2 x −sin x
2 cos x−
√
3
= 0
13
2 sin2 x −sin x
2 cos x+
√
3
= 0
[7] M. Ponomarev and E. Pyshkin. Adopting tree over-
lapping algorithm for mathml equation structural sim-
ilarity evaluation. In Proceedings of the 2nd Inter-
national Conference on Applications in Information
Technology (ICAIT-2016), pages 17–20. The Univer-
sity of Aizu, The University of Aizu Press, Oct 2016.
[8] K. Sain, A. Dasgupta, and U. Garain. Emers: a
tree matching–based performance evaluation of math-
ematical expression recognition systems. Interna-
tional Journal on Document Analysis and Recogni-
tion (IJDAR), 14(1):75–85, 2011.
[9] K. Yokoi and A. Aizawa. An approach to similarity
search for mathematical expressions using mathml.
Mathematical Equation Structural Syntactical Similarity Patterns. . . Informatica 40 (2016) 377–385 385
Table 6: Query:
√
2 sin(3π2 − x) sinx = cosx
Compared expression Nodes
ratio
Similarity
2 sin(7π2 − x) sinx = cosx 60/67 0.896
2 cos(x− 11π2 ) cosx = sinx 40/67 0.597
2 cos(π2 + x) =
√
3 tanx 24/63 0.381
3 cos 2x− 5 sinx+ 1 = 0 12/60 0.200
tanx+ cos( 3π2 − 2x) = 0 10/67 0.149
Table 7: Query: 2 sin4 x+ 3 cos 2x+ 1 = 0
Compared expression Nodes
ratio
Similarity
4 cos4 x− 4 cos2 x+ 1 = 0 10/59 0.169
4 sin4 2x+ 3 cos 4x− 1 = 0 10/60 0.167
cos2 x − 12 sin 2x + cosx =
sinx
10/63 0.159
2 sin2 x−sin x
2 cos x−
√
3
= 0 10/64 0.156
2 sin2 x−sin x
2 cos x+
√
3
= 0 10/64 0.156
Table 8: Evaluating classification precision for a case of
subexpression similarity
No. Expression TPTP+FP
1 cos(π2 + 2x) =
√
2 sinx 1/1
2 cosx+ cos(π2 + 2x) = 0 1/1
3 (2 cosx+ 1)(
√
− sinx− 1) = 0 1/1
4 (2 sinx− 1)(√− cosx+ 1) = 0 1/1
5
√
2 sin( 3π2 − x) sinx = cosx 1/1
6 sin( 7π2 − x) sinx = cosx 1/1
7 2 sin
2 x−sin x
2 cos x−
√
3
= 0 1/1
8 2 sin
2 x−sin x
2 cos x+
√
3
= 0 1/1
9 2 sin4 x+ 3 cos 2x+ 1 = 0 3/4
10 4 cos4 x− 4 cos2 x+ 1 = 0 3/4
11 cos2 x− 12 sin 2x+ cosx = sinx 3/4
12 2 sin
2 x−sin x
2 cos x−
√
3
= 0 4/4
13 2 sin
2 x−sin x
2 cos x+
√
3
= 0 4/4
Towards a Digital Mathematics Library. Grand Bend,
Ontario, Canada, July 8-9th, 2009, pages 27–35,
2009.
[10] K. Zhang and D. Shasha. Simple fast algorithms for
the editing distance between trees and related prob-
lems. SIAM journal on computing, 18(6):1245–1262,
1989.
[11] Q. Zhang and A. Youssef. An Approach to Math-
Similarity Search, pages 404–418. Springer Interna-
tional Publishing, Cham, 2014.