https://doi.org/10.31449/inf.v46i6.3787 Informatica 46 (2022) 135–141 135
Learning the Pattern-based CRF for Prediction of a Protein Local Structure
Zhalgas Mukanov
1
and Rustem Takhanov
2
E-mail: mukanovj@mail.ru, rustem.takhanov@nu.edu.kz
1
Fundamental Mathematics Department, Eurasian National University, 2 Satpayev Str., Nur-Sultan, Kazakhstan
2
Mathematics Department, Nazarbayev University, 53 Kabanbay Batyr Ave, Nur-Sultan, Kazakhstan
Keywords: pattern-based CRFs, sequence labeling, protein conformation prediction, energy landscape, structural SVM
Received: October 13, 2021
Prediction of protein conformation from its amino acid sequence is widely acknowledged as one of the most
important computational biology problems and is considered a source of interesting problem formulations
for machine learning. Here methods of supervised learning stay side by side with statistical physics and
information theory. According to classical results of Anfinsen, protein conformational structure is fully
determined by its primary structure, i.e., amino acid sequence, and energy landscape theory says that the
native state of a protein corresponds to the minimum of its free energy [2].
There are two dominating approaches to protein structure prediction, the first is based on minimizing
physics-based free energies with some unknown parameters, and the second is a knowledge-based approach
that does not necessarily use the notion of free energy and aims only to yield high prediction accuracy [14].
In comparison to these two approaches, there is a deficit in intermediate approaches where the goal is
to find such knowledge-based parameterizations of free energy that would approximate real free energy
for certain protein families and have a high accuracy of prediction comparable with pure knowledge-
based approaches. According to M. Gromov, if energy landscape theory is true, then “probably, free
energy can be encoded with a reasonable accuracy by something like 10
4
− 10
6
bits of information”, and
the main mathematical problem here is the lack of “general mathematical “parameter fitting” method(s),
which, when applied to proteins, could provide (an effective version of) the total inter-residue interaction
energies” [10]. In this paper, we introduce a probabilistic model based on a certain parametrization of free
energy that we expect could be fruitful both for predicting protein dihedral angles and investigating the
structure of the energy landscape. This model is based on the idea that free energy is largely determined
by pairwise interactions of amino acids that are located near each other on a protein sequence. Though this
approach is far from reality for general proteins, we expect it to approximate an all-alpha protein’s energy
landscape.
Povzetek: Za doloˇ canje strukture beljakovin je bila razvita nova metoda, ki se uˇ ci pogostih vzorcev.
1 Introduction
Prediction of protein conformation from its amino acid se-
quence is widely acknowledged as one of the most im-
portant computational biology problems and is considered
a source of interesting problem formulations for machine
learning. Here methods of supervised learning stay side by
side with statistical physics and information theory. Ac-
cording to classical results of Anfinsen, protein conforma-
tional structure is fully determined by its primary structure,
i.e., amino acid sequence, and energy landscape theory says
that the native state of a protein corresponds to the mini-
mum of its free energy [2].
There are two dominating approaches to protein struc-
ture prediction, the first is based on minimizing physics-
based free energies with some unknown parameters, and
the second is a knowledge-based approach that does not
necessarily use the notion of free energy and aims only
to yield high prediction accuracy [14]. In comparison to
these two approaches, there is a deficit in intermediate ap-
proaches where the goal is to find such knowledge-based
parameterizations of free energy that would approximate
real free energy for certain protein families and have a high
accuracy of prediction comparable with pure knowledge-
based approaches. According to M. Gromov, if energy
landscape theory is true, then “probably, free energy can
be encoded with a reasonable accuracy by something like
10
4
− 10
6
bits of information”, and the main mathemati-
cal problem here is the lack of “general mathematical “pa-
rameter fitting” method(s), which, when applied to pro-
teins, could provide (an effective version of) the total inter-
residue interaction energies” [10]. In this paper, we intro-
duce a probabilistic model based on a certain parametriza-
tion of free energy that we expect could be fruitful both
for predicting protein dihedral angles and investigating the
structure of the energy landscape. This model is based on
the idea that free energy is largely determined by pairwise
interactions of amino acids that are located near each other
on a protein sequence. Though this approach is far from
reality for general proteins, we expect it to approximate an
136 Informatica 46 (2022) 135–141 Z. Mukanov et al.
all-alpha protein’s energy landscape.
1.1 Related work
There are plenty of publications in literature dedicated to
the problem of protein backbone dihedral angles predic-
tion. This problem is interesting in two contexts. First,
since secondary structure and backbone dihedral angles
correlate (especially in the alpha-helix region), these prob-
lems are often tackled together and considered adjacent re-
search themes [13]. Second, it has been shown that high
accuracy prediction of secondary structure/dihedral ϕ,ψ angles improves the recognition of the so-called fold of
a protein [13]. Pioneering works on secondary structure
prediction were published in 70-s [5, 8]. The highest pre-
diction accuracies were achieved in the middle of the 90s
by machine learning techniques that use multiple sequence
alignment with proteins from the PDB database. PSIPred
is a popular and high-scoring example of an algorithm of
this kind [11]. Later, using a similar representation of a
protein as in PSIPred, the idea of simultaneous prediction
of secondary structure and backbone dihedral angles was
implemented in DISSPred, which has one of the highest
accuracies for both these problems [13]. A survey of the
most recent advances in the problem can be found in [25].
Such approaches improve their accuracy as the number of
resolved proteins grows, and they weaken if the template
protein(the one for which a prediction is made) does not
have close homologs among resolved proteins.
In the absence of close homologs, prediction methods
based on sequence-structure analysis are considered as one
of the promising [6, 4]. The success of such techniques is
based on the fundamental fact that the complexity and di-
versity of local conformational structures observed in pro-
teins are much less than sequence complexity. The first
and critical step in such techniques is a choice of structural
motives alphabet; then, correlations between sequence and
corresponding structure are retrieved from data, and a pre-
diction of secondary/local structure is made. One of the
ways to formalize such a scheme is the Hidden Markov
Models and their modifications. In the majority of such
algorithms, a state of hidden Markov process formalizes
structural information associated with a certain amino acid
of a protein and transition probabilities between states of
adjacent amino acids computed by standard formulas from
data. HMMSTR is one of the most popular methods based
on HMMs [4]. The idea of applying structural SVM ma-
chinery for computation of HMM parameters for secondary
structure recognition was implemented in [9, 1].
2 Pattern-based energy
A pattern-based energy potential over words x∈ D
n
in
some alphabetD is defined as
E(x) =
X
α ∈ Γ X
[i,j]⊆ [1,n]
j− i+1=|α | f
α ij
· [x
i:j
=α ] (1)
where Γ ⊆ D
∗ is a fixed set of non-empty words,|α | is the
length of wordα and [· ] is the Iverson bracket. Therex
i:j
denotes a subword x
i
··· x
j
of x. A pattern-based condi-
tional random field is defined as the probability distribution
over wordsp(x)∝ e
− E(x)
.
Intuitively, pattern-based CRFs allow modeling long-
range interactions for selected subsequences of labels. In-
ference algorithms for pattern-based CRFs were developed
in [26, 21] and they include (i) computing the partition
functionZ =
P
x
exp{− E(x)} ; (ii) computing marginal
probabilities p(x
i:j
= α ) for all triplets (i,j,α ) present
in (1); (iii) computing MAP, i.e. minimizing energy (1).
Note that MAP problem is a special case of hybrid val-
ued constraint satisfaction problems [12, 18, 19]. Hybrid
VCSPs in general can be NP-hard, but minimizing the en-
ergy (1) is not only tractable but even solvable in time
O(n).
Pattern-based CRFs were already applied in such con-
texts as handwritten character recognition, identification of
named entities from text [26], optical character recogni-
tion [16] and the language modeling [3, 20, 22].
3 Pattern-based CRFs for the
sequence labeling problem
One of the important classes of pattern recognition prob-
lems with structural outputs is sequence labeling problems.
In such problems, we are given two finite alphabets,A (in-
put alphabet) andL (output labels). A training set consists
of pairs of the form (x,y) wherex (y) is a word overA (L)
and both words have the same length, i.e. letters of the first
word are tagged by labels. The goal is to construct map-
pingsm :A
n
→ L
n
,n = 1, 2,... , that both consistent with
a training set and some supplementary model requirements.
Examples of such formulations can be found in protein
folding (secondary structure prediction), natural language
processing (part-of-speech tagging), and speech recogni-
tion. Popular methods used for sequence labeling learning
include hidden and maximum entropy Markov models. In
this paper, we will describe a generalized version of our
model and show how two key problems of inference and
learning can be efficiently solved in this framework.
Definition. A pair (α,β ), whereα is a word overA and
β is a word overL, is called a pattern pair.
Suppose we are given a pattern pair (α,β ) and tuples
x = (x
1
,...,x
n
) ∈ A
n
and y = (y
1
,...,y
n
) ∈ L
n
.
Then by (α,β )⊢ i
(x,y) we say thatx
i:i+|α |− 1
= α and
y
i:i+|α |− 1
=β .
Suppose we are given a finite set of pattern pairs A.
Then, for any tuples x = (x
1
,...,x
n
) ∈ A
n
and y =
(y
1
,...,y
n
)∈ L
n
, Ψ A
(x,y) denotes a vector with compo-
nents indexed byA, and for (α,β )∈ A, (α,β )-component
is equal to the number ofi∈{ 1,...,n } for which (α,β )⊢ i
(x,y).
If any pattern pair fromA is assigned a real value, then
parametersw ={ w
i
} i∈ A
define a conditional probability
Learning the Pattern-based CRF for Prediction of. . . Informatica 46 (2022) 135–141 137
distributionp (y|x,w) wherex∈ A
n
,y∈ L
n
:
p (y|x,w) =
e
−⟨ w,Ψ A
(x,y)⟩ Z (x,w)
(2)
whereZ (x,w) =
P
y
e
−⟨ w,Ψ A
(x,y)⟩ . It is easy to see that
this family of conditional probability distributions is a spe-
cial case pattern-based CRF.
Two computational problems are of first interest in the
framework of conditional random fields: inference and
learning. Inference for pattern-based CRF is equivalent to
minimization of energy:
max
y
p (y|x,w) = min
y
⟨ w, Ψ A
(x,y)⟩ . (3)
If we rewrite our energy term as
⟨ w, Ψ A
(x,y)⟩ =
X
p∈ A
X
i:p⊢ i
(x,y)
w
p
=
n
X
i=1
X
p∈ A:p⊢ i
(x,y)
w
p
(4)
we will see that the inference is equivalent to mini-
mization of pattern-based functional over y with f
α ij
=
P
(α,ϕ )∈ A:(α,ϕ )⊢ i
(x,y)
w
(α,ϕ )
.
Given a training sample
  (x
i
,y
i
)
  i=1,n
maximum like-
lihood parameter learning is the following task:
n
Y
i=1
p
  y
i
|x
i
,w
  → max
w
(5)
which after standard operations is equivalent to
L (w) =
n
X
i=1
⟨ w, Ψ A
  x
i
,y
i
  ⟩ + logZ
  x
i
,w
  → min
w
(6)
Let us consider generalized version of pattern-based
CRF by adding equivalence relation on patterns setA,∼ .
Then we have additional constraints on learned weights: if
p∼ p
′ thenw
p
=w
p
′ . Such models appear to address the
problem of overfitting when|A| becomes high. From now,
we will assume that weights are indexed by equivalence
classes of∼ .
Since our energy is parameterized linearly,L (w) is con-
vex. Therefore we can optimize it using gradient-based
optimization(using first order or second order gradient de-
scent) [15]. Elements of Jacobian and Hessian of L (w)
are equal to the following sums:
[∇ w
L (w)]
c
=
n
X
i=1
[
X
p∈ c
(
  Ψ A
  x
i
,y
i
   p
− E
y∼ p(y|x
i
,w)
  Ψ A
  x
i
,y
   p
) ]
(7)
[H
w
L (w)]
cc
′ =
n
X
i=1
X
p∈ c,p
′∈ c
′ E
y∼ p(y|x
i
,w)
  Ψ A
  x
i
,y
   p
× E
y
′∼ p(y|x
i
,w)
  Ψ A
  x
i
,y
′    p
′ (8)
where
  Ψ A
  x
i
,y
i
   p
=
      k :p⊢ k
  x
i
,y
i
       (9)
It is easy to see that computing Jacobian and Hessian
can be reduced to computation of marginal probabilities
p
  y
k:l
=α |x
i
,w
  in the pattern-based CRF p
  y|x
i
,w
  .
Thus, we can apply BFGS algorithm [7] to solve (6) and
to learn the weights of patterns in pattern-based CRFs (3).
4 Pattern-based conditional random
field for dihedral angles prediction
Suppose we have a two ordered sets of variables X =
{ X
1
,...,X
n
} and Y ={ Y
1
,...,Y
2n
} . Variables X
i
take
their values in a set of amino acids
{ Ala, Arg,..., Val, Sec, Pyl} (amino acid symbols) and we interpret any initialization of
X as an amino acid sequence. Any amino acid in a protein
corresponds to two dihedral angles ϕ and ψ in a protein
(figure 1), and we interpretY
2i− 1
andY
2i
asϕ andψ for
amino acidX
i
. Before any experiment, we will discretize
an interval [− 180, 180] and, therefore, will always imply
thatY
i
takes its values from some fixed finite setD.
Figure 1: Anglesϕ ,ψ andω .
A triple (A
1
,A
2
,α ), whereA
1
,A
2
are amino acid sym-
bols and α is a word of even length over alphabet D, is
called a local contact triple. The name contact triple is re-
lated to the following interpretation: suppose that values of
ϕ,ψ angles within certain accuracy can be predicted based
on the local interaction of amino acids (by local interaction,
we understand the interaction between amino acids that are
closely located on a sequence). Then the mutual location of
two amino acids can be described by a list of dihedral an-
gles between them. Here we neglectω angles and torsion
angles of these amino acids.
Suppose we are given a finite set of contact triples
A. Then, for any pair X = { X
1
,...,X
n
} and
Y = { Y
1
,...,Y
2n
} , Ψ A
(X,Y ) denotes a vector with
components indexed by A, and for (A
1
,A
2
,α ) ∈ A, (A
1
,A
2
,α )-component is equal to the number of
i ∈ { 1,...,n } for which X
i
= A
1
,X
i+|α |/2− 1
=
A
2
,Y
2i− 1
Y
2i
...Y
2i+|α |− 3
Y
2i+|α |− 2
=α .
Suppose pairs X,Y are obtained from some probabil-
ity distributionp (Y|X,w) with unknown parametersw =
138 Informatica 46 (2022) 135–141 Z. Mukanov et al.
{ w
i
} i∈ A
. The family of probability distributions
− logp (Y|X,w) =C (X) +⟨ w, Ψ A
(X,Y )⟩ (10)
where C (X) = log
P
Y
e
−⟨ w| Ψ A
(X,Y )⟩ , is our pattern-
based conditional random field.
After denotingE
w
(X,Y ) =⟨ w, Ψ A
(X,Y )⟩ , it is easy
to see that
max
Y
p (Y|X,w) = minE
w
(X,Y ). (11)
Therefore, if parameters are known (w = w
∗ ), for in-
put string X inference is equivalent to minimization of
⟨ w, Ψ A
(X,Y )⟩ .
In the following section, we will describe the structural
SVM procedure that we used to obtain weights.
4.1 Learning parameters by structural SVM
Given a training set{ (x
i
,y
i
)} l
i=1
, the structural risk min-
imization approach reduces the problem of CRFs learning
to minimization of the following functional:
C
l
l
X
i=1
∆ ( y
i
,f
w
(x
i
))→ min (12)
wheref
w
(x
i
) = arg min
y
⟨ w, Ψ ( x
i
,y)⟩ and ∆ : Y× Y → N is a loss function on a setY . The latter problem is very
hard in general, and a standard way to tackle it is to re-
place a function with its convex upper bound [15]. Fol-
lowing this scheme,we used structural SVM technique to
learning pattern-based CRFs. After changing Ψ ( x,y)→ − Ψ ( x,y), structural SVM functional is the following one
∥ w∥ 2
+
C
l
l
X
i=1
max
y∈ Y
( ∆ ( y
i
,y) +
⟨ w, Ψ ( x
i
,y)− Ψ ( x
i
,y
i
)⟩ )
(13)
We used software by T. Joachims [24], which is based
on the following quadratic programming reformulation:
∥ w∥ 2
+C
P
ℓ i=1
ξ n
→ min
w,ξ
⟨ w, Ψ ( x
i
,y
i
)− Ψ ( x
i
,y)⟩≥ ∆( y
i
,y)− ξ i
,
n = 1,...,ℓ, ∀ y∈Y ,
ξ i
≥ 0, i = 1,...,ℓ.
Very roughly, this functional can be interpreted as
follows. For optimal parameter w we want y
i
≈ arg max
y
⟨ w, Ψ ( x
i
,y)⟩ , and therefore
⟨ w, Ψ ( x
i
,y
i
)− Ψ ( x
i
,y)⟩≥ 0
if ∆ ( y
i
,y) is large(greater than resulting slackξ i
), and in
the neighborhood ofy
i
(i.e. when ∆ ( y
i
,y)≤ ξ i
— we can
call such neighborhoods as "uncertainty neighborhoods")
the difference⟨ w, Ψ ( x
i
,y
i
)− Ψ ( x
i
,y)⟩ can be negative.
The goal is to lower the sum of radiuses of "uncertainty
neighborhoods" plus regularization on∥ w∥ 2
.
5 Experiments and discussion
Experimental data were taken from the PDB database.
From 81756 available protein structures, we extracted 7631
all-alpha proteins. This list was filtered by PISCES (A
Protein Sequence Culling Server) with requirements of se-
quence identity to be less than 25%, of resolution to be less
than 3 , and ofR-factor to be less than 0.3. The resulting
sample contained 908 proteins.
Each protein in dataset was represented as a pair of
words, the first word is its amino acid sequence, and the
second word is a sequence if ϕ,ψ angles of amino acids
that were discretized with step of 20 degrees, i.e. ϕ discr
=
[ϕ/ 20],ψ discr
= [ψ/ 20]. Therefore, the second word was
in an alphabet of 18 symbols. A set of contact triplesA that
is the basis of our model was chosen by a simple procedure:
for each triple (A
1
,A
2
,α ), whereA
1
,A
2
are amino acids,
we counted the number of times it occurs in our database
(i.e. segment A
1
...A
2
in the amino acid sequence corre-
sponds to segmentα in the second word) under the condi-
tion thatC
α atoms of amino acidsA
1
,A
2
are located less
than 10 from each other according to the corresponding
PDB file; then we took all triples that were counted more
than ten times.
We have three variants of loss functions. Since ϕ,ψ plane has torical structure, we define H
ϕ (ϕ 1
,ϕ 2
) =
min{| ϕ 1
− ϕ 2
| , 360−| ϕ 1
− ϕ 2
|} . H
ψ is defined analo-
gously. For discretizedϕ,ψ we defineH
discr
ϕ (ϕ 1
,ϕ 2
) =
min{| ϕ 1
− ϕ 2
| , 18−| ϕ 1
− ϕ 2
|} . Then,
H
  { y
i
} i=1,2n
,{ y
′ i
} i=1,2n
  ≜ n
X
i=1
H
ϕ   y
2i− 1
,y
′ 2i− 1
  +H
ψ (y
2i
,y
′ 2i
)
(14)
H
discr
is defined analogously. During learning we can use
only the discrete version, but when accessing accuracy on
the test set we will use the continuous version of the loss
functionH.
The second criterion isMDA, and it is commonly used
for accessing the accuracy of dihedral angles prediction. In
a continuous case, it is defined as a percentage of amino
acids in a sequence that belongs to any continuous segment
of length not less than 8 for which predicted dihedral angles
do not differ from real by more than 120 degrees. I.e.:
MDA
  { y
i
} i=1,2n
,{ y
′ i
} i=1,2n
  ≜ | Ω | n
(15)
where Ω = { i|∃ k,l : k ≤ i ≤ l,l− k ≥ 7,∀ s =
k,l|y
2s− 1
− y
′ 2s− 1
|≤ 120,|y
2s
− y
′ 2s
|≤ 120} . The dis-
crete version ofMDA can be defined by changing 120 to
6; we denote itMDA
discr
.
Recall that structural SVM calls a procedure to maxi-
mize ∆ ( y,y
′ ) +⟨ w, Ψ ( x
i
,y)⟩ where ∆ is a loss func-
tion, and this exactly the place where inference algorithm
is needed for learning. SinceH (y,y
′ ) can be understood
as a unary part of our optimized functional, then our in-
ference algorithm can be applied straightforwardly. On the
Learning the Pattern-based CRF for Prediction of. . . Informatica 46 (2022) 135–141 139
Table 1: Experimental results
Loss function H
ϕ ◦ C H
ψ ◦ C MDA, % C
H
discr
22.7 47.9 55.0 64.0
MDA
discr
22.8 48.3 56.5 4.0
HMMSTR[4] - - 57.1 -
PHD[17] - - 48.0 -
contrary,MDA has a very "global" structure. Therefore,
instead of maximizing the previous sum, we used a heuris-
tic that reduces to excluding the loss part.
The sample set was divided into three subsets: a training
sample, a holdout sample (for fitting some parameters) and
a test sample. The training sample was used to train param-
eters with fixedC, the holdout sample was used to choose
C and the resulting prediction accuracy was accessed on
the test set. The results of experiments are given in Table
1. Rows show results for fixed loss functions and columns
show exact attained values for measures of accuracy on a
test set.
In the work of Bystroff & al. [4], whose method has a
lot of common with ours, MDA value attained 57% on
the training set consisting of 618 randomly selected pro-
teins. Such an accuracy became possible to achieve by
using a special library of structural motives, called I-sites,
that was generated by clustering of all motives. Also, in-
stead of using amino acid symbols, each position in a pro-
tein sequence was associated with the amino acids profile,
which significantly improved overall accuracy. The high-
est MDA achieved by the first generation protein structure
annotations predictors is approximately 60%. In the sec-
ond generation of predictors 64% accuracy was reported.
The third generation of predictors, based on Deep Learning
algorithms, predict secondary structure at over 70% accu-
racy [23]. Thus, so far our method reproduces the accuracy
of the first-generation software.
Our algorithm can be improved by adding structural in-
formation for amino acids, like solvent accessibility or sec-
ondary structure. We generated a set of patterns by a very
simple procedure, and two factors were crucial in defining
a set of patterns: we wanted to choose a set that would map
all local conformational structures with good scale, at the
same time the cardinality of this set could not be too high
to overfit or to become computationally intractable. Both
these problems could be addressed by introducing long "su-
perpatterns", by which we mean clusters of patterns that
have the same weights in the energy model. In such an ap-
proach, the problem of clusterization of pattern set appears
that should be addressed before learning of weights starts.
It is also important to note that our inference algorithm can
easily adapt to formulation with long superpatterns by just
considering a superpattern as a set of patterns for which one
weight is attributed. If elements of superpatterns will map
all patterns with fixed length from a discretized database,
then the number of such patterns will grow proportion-
ally to NLP , where N is the number of proteins in the
database, L is the average length of a protein, andP is a
pattern length.
6 Conclusions and future work
The goal of our paper was to show perspectives of the
pattern-based conditional random field in bioinformatics
applications. Hidden Markov Models, currently very popu-
lar in protein folding prediction, gene prediction, and other
problems with sequence (tagging) labeling flavor, can be
considered as a precursor of pattern-based CRFs. By defi-
nition, the current state of HMM is probabilistically deter-
mined by the previous state. When there are high distance
interactions between labels on the sequence, a notion of
HMM state should include a lot of context structural in-
formation. Instead of complicating the notion of a state,
we propose to consider long label sequences as analogs
of states. But for our formulation, there is a lack of re-
search dedicated to problems of learning and inferencing in
pattern-based models. We need more efficient algorithms
for the inference part of the problem since in some appli-
cations, the length of patterns could be high (for example,
the typical helix region of a protein can be 20 amino acids
long, then onlyω,ϕ,ψ angles sequence of such regions can
be of length 60 or longer, not including other structural in-
formation) and the number of patterns could be very large
(in superpatterns approach it could be comparable with the
size of the learning database). We developed an efficient al-
gorithm for this problem based on dynamic programming
with a preprocessing step to tune the parameters of an al-
gorithm for concrete patterns set. On the learning part, we
used the structural SVM technique.
An important issue to be addressed in future work is how
our approach could be developed in the context of bioinfor-
matics applications, such as protein folding and gene pre-
diction. Learning weights is the hardest part of such prob-
lems, which requires a thorough analysis of maximum like-
lihood and structured risk minimization parameter learn-
ing. Besides, our exact definition of pattern-based CRF can
be easily generalized both in the direction of including su-
perpatterns and in the direction of defining various areas of
dependence for weights to be learned.
Acknowledgement
This work was supported from the Nazarbayev University
collaborative research grant project OPCRP2022002.
References
[1] Yasemin Altun, Ioannis Tsochantaridis, and Thomas
Hofmann. Hidden markov support vector machines.
In Proceedings of the Twentieth International Confer-
ence on International Conference on Machine Learn-
ing, page 3–10, 2003.
140 Informatica 46 (2022) 135–141 Z. Mukanov et al.
[2] C B Anfinsen. The formation and stabilization of
protein structure. Biochemical Journal, 128(4):737–
749, 07 1972. https://doi.org/10.1042/
bj1280737.
[3] Zhenisbek Assylbekov and Rustem Takhanov.
Reusing weights in subword-aware neural language
models. In Proceedings of the 2018 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), pages 1413–
1423, New Orleans, Louisiana, June 2018. https:
//doi.org/10.18653/v1/N18-1128.
[4] Christopher Bystroff, Vesteinn Thorsson, and David
Baker. Hmmstr: a hidden markov model for local
sequence-structure correlations in proteins11edited
by j. thornton. Journal of Molecular Biology,
301(1):173 – 190, 2000.https://doi.org/10.
1006/jmbi.2000.3837.
[5] Peter Y . Chou and Gerald D. Fasman. Prediction of
protein conformation. Biochemistry, 13(2):222–245,
1974. PMID: 4358940. https://doi.org/10.
1021/bi00699a002.
[6] A.G. de Brevern, C. Etchebest, and S. Hazout.
Bayesian probabilistic approach for predicting
backbone structures in terms of protein blocks.
Proteins: Structure, Function, and Bioinformatics,
41(3):271–287, 2000. https://doi.org/10.
1002/1097-0134(20001115)41:3<271::
AID-PROT10>3.0.CO;2-Z.
[7] R. Fletcher. Newton-Like Methods, chapter 3, pages
44–79. John Wiley and Sons, Ltd, 2000. https:
//doi.org/10.1002/9781118723203.ch3.
[8] J. Garnier, D.J. Osguthorpe, and B. Robson. Analy-
sis of the accuracy and implications of simple meth-
ods for predicting the secondary structure of globular
proteins. Journal of Molecular Biology, 120(1):97
– 120, 1978. https://doi.org/10.1016/
0022-2836(78)90297-8.
[9] Blaise Gassend, Charles O’Donnell, William Thies,
Andrew Lee, Marten van Dijk, and Srinivas Devadas.
Learning biophysically-motivated parameters for al-
pha helix prediction. BMC bioinformatics, 8 Suppl
5:S3, 02 2007. https://doi.org/10.1186/
1471-2105-8-S5-S3.
[10] Misha Gromov. Crystals, proteins, stabil-
ity and isoperimetry. Bulletin of the Amer-
ican Mathematical Society, 48(2):229–257,
2011. https://doi.org/10.1090/
S0273-0979-2010-01319-7.
[11] DT Jones. Protein secondary structure prediction
based on position-specific scoring matrices. Journal
of molecular biology, 292(2):195—202, September
1999. https://doi.org/10.1006/jmbi.
1999.3091.
[12] Vladimir Kolmogorov, Michal Rolínek, and Rustem
Takhanov. Effectiveness of structural restrictions
for hybrid csps. In Algorithms and Computation
- 26th International Symposium, ISAAC 2015, Pro-
ceedings, LNCS, pages 566–577, Germany, January
2015. Springer Verlag. https://doi.org/10.
1007/978-3-662-48971-0_48.
[13] Petros Kountouris, Petros Kountouris, and
Jonathan D. Hirst. Prediction of backbone
dihedral angles and protein secondary struc-
ture using support vector machines. BMC
Bioinformatics, 10(2):437, 2009. https:
//doi.org/10.1186/1471-2105-10-437.
[14] Jooyoung Lee, Sitao Wu, and Yang Zhang. Ab Initio
Protein Structure Prediction, pages 3–25. Springer
Netherlands, Dordrecht, 2009. https://doi.
org/10.1007/978-1-4020-9058-5_1.
[15] Sebastian Nowozin and Christoph H. Lampert. Struc-
tured learning and prediction in computer vision.
Foundations and Trends® in Computer Graphics and
Vision, 6(3–4):185–365, 2011. https://doi.
org/10.1561/0600000033.
[16] Xian Qian, Xiaoqian Jiang, Qi Zhang, Xuanjing
Huang, and Lide Wu. Sparse higher order condi-
tional random fields for improved sequence label-
ing. In Proceedings of the 26th Annual International
Conference on Machine Learning, ICML ’09, page
849–856, New York, NY , USA, 2009. Association for
Computing Machinery. https://doi.org/10.
1145/1553374.1553483.
[17] Burkhard Rost. Better 1d predictions by ex-
perts with machines. Proteins: Structure,
Function, and Bioinformatics, 29(S1):192–
197, 1997. https://doi.org/10.1002/
(SICI)1097-0134(1997)1+<192::
AID-PROT25>3.0.CO;2-I.
[18] Rustem Takhanov. Hybrid vcsps with crisp and
valued conservative templates. In 28th Interna-
tional Symposium on Algorithms and Computa-
tion, ISAAC 2017, volume 92, Germany, December
2017. https://doi.org/10.4230/LIPIcs.
ISAAC.2017.65.
[19] Rustem Takhanov. Searching for an algebra on csp
solutions, 2017. arXiv:1708.08292.
[20] Rustem Takhanov and Zhenisbek Assylbekov. Pat-
terns versus characters in subword-aware neural lan-
guage modeling. In Neural Information Process-
ing, pages 157–166, Cham, 2017. https://doi.
org/10.1007/978-3-319-70096-0_17.
Learning the Pattern-based CRF for Prediction of. . . Informatica 46 (2022) 135–141 141
[21] Rustem Takhanov and Vladimir Kolmogorov. In-
ference algorithms for pattern-based crfs on se-
quence data. In Proceedings of the 30th Inter-
national Conference on Machine Learning, vol-
ume 28 of Proceedings of Machine Learning Re-
search, pages 145–153, Atlanta, Georgia, USA, 17–
19 Jun 2013. URL: https://proceedings.
mlr.press/v28/takhanov13.html.
[22] Rustem Takhanov and Vladimir Kolmogorov. Com-
bining pattern-based crfs and weighted context-free
grammars. Intell. Data Anal., 26(1):257–272, 2022.
https://doi.org/10.3233/IDA-205623.
[23] Mirko Torrisi, Gianluca Pollastri, and Quan Le.
Deep learning methods in protein structure predic-
tion. Computational and Structural Biotechnology
Journal, 18:1301–1310, 2020. https://doi.
org/10.1016/j.csbj.2019.12.011.
[24] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten
Joachims, and Yasemin Altun. Support vector ma-
chine learning for interdependent and structured out-
put spaces. In Proceedings of the Twenty-First Inter-
national Conference on Machine Learning, page 104,
New York, NY , USA, 2004. https://doi.org/
10.1145/1015330.1015341.
[25] Yuedong Yang, Jianzhao Gao, Jihua Wang, Rhys
Heffernan, Jack Hanson, Kuldip Paliwal, and Yaoqi
Zhou. Sixty-five years of the long march in pro-
tein secondary structure prediction: the final stretch?
Briefings in Bioinformatics, 19(3):482–494, 12 2016.
https://doi.org/10.1093/bib/bbw129.
[26] Nan Ye, Wee Lee, Hai Chieu, and Dan Wu.
Conditional random fields with high-order fea-
tures for sequence labeling. In Advances in
Neural Information Processing Systems, vol-
ume 22, 2009. URL: https://proceedings.
neurips.cc/paper/2009/file/
94f6d7e04a4d452035300f18b984988c-Paper.
pdf.
142 Informatica 46 (2022) 135–141 Z. Mukanov et al.