https://doi.org/10.31449/inf.v46i6.3787 Informatica 46 (2022) 135–141 135 Learning the Pattern-based CRF for Prediction of a Protein Local Structure Zhalgas Mukanov 1 and Rustem Takhanov 2 E-mail: mukanovj@mail.ru, rustem.takhanov@nu.edu.kz 1 Fundamental Mathematics Department, Eurasian National University, 2 Satpayev Str., Nur-Sultan, Kazakhstan 2 Mathematics Department, Nazarbayev University, 53 Kabanbay Batyr Ave, Nur-Sultan, Kazakhstan Keywords: pattern-based CRFs, sequence labeling, protein conformation prediction, energy landscape, structural SVM Received: October 13, 2021 Prediction of protein conformation from its amino acid sequence is widely acknowledged as one of the most important computational biology problems and is considered a source of interesting problem formulations for machine learning. Here methods of supervised learning stay side by side with statistical physics and information theory. According to classical results of Anfinsen, protein conformational structure is fully determined by its primary structure, i.e., amino acid sequence, and energy landscape theory says that the native state of a protein corresponds to the minimum of its free energy [2]. There are two dominating approaches to protein structure prediction, the first is based on minimizing physics-based free energies with some unknown parameters, and the second is a knowledge-based approach that does not necessarily use the notion of free energy and aims only to yield high prediction accuracy [14]. In comparison to these two approaches, there is a deficit in intermediate approaches where the goal is to find such knowledge-based parameterizations of free energy that would approximate real free energy for certain protein families and have a high accuracy of prediction comparable with pure knowledge- based approaches. According to M. Gromov, if energy landscape theory is true, then “probably, free energy can be encoded with a reasonable accuracy by something like 10 4 − 10 6 bits of information”, and the main mathematical problem here is the lack of “general mathematical “parameter fitting” method(s), which, when applied to proteins, could provide (an effective version of) the total inter-residue interaction energies” [10]. In this paper, we introduce a probabilistic model based on a certain parametrization of free energy that we expect could be fruitful both for predicting protein dihedral angles and investigating the structure of the energy landscape. This model is based on the idea that free energy is largely determined by pairwise interactions of amino acids that are located near each other on a protein sequence. Though this approach is far from reality for general proteins, we expect it to approximate an all-alpha protein’s energy landscape. Povzetek: Za doloˇ canje strukture beljakovin je bila razvita nova metoda, ki se uˇ ci pogostih vzorcev. 1 Introduction Prediction of protein conformation from its amino acid se- quence is widely acknowledged as one of the most im- portant computational biology problems and is considered a source of interesting problem formulations for machine learning. Here methods of supervised learning stay side by side with statistical physics and information theory. Ac- cording to classical results of Anfinsen, protein conforma- tional structure is fully determined by its primary structure, i.e., amino acid sequence, and energy landscape theory says that the native state of a protein corresponds to the mini- mum of its free energy [2]. There are two dominating approaches to protein struc- ture prediction, the first is based on minimizing physics- based free energies with some unknown parameters, and the second is a knowledge-based approach that does not necessarily use the notion of free energy and aims only to yield high prediction accuracy [14]. In comparison to these two approaches, there is a deficit in intermediate ap- proaches where the goal is to find such knowledge-based parameterizations of free energy that would approximate real free energy for certain protein families and have a high accuracy of prediction comparable with pure knowledge- based approaches. According to M. Gromov, if energy landscape theory is true, then “probably, free energy can be encoded with a reasonable accuracy by something like 10 4 − 10 6 bits of information”, and the main mathemati- cal problem here is the lack of “general mathematical “pa- rameter fitting” method(s), which, when applied to pro- teins, could provide (an effective version of) the total inter- residue interaction energies” [10]. In this paper, we intro- duce a probabilistic model based on a certain parametriza- tion of free energy that we expect could be fruitful both for predicting protein dihedral angles and investigating the structure of the energy landscape. This model is based on the idea that free energy is largely determined by pairwise interactions of amino acids that are located near each other on a protein sequence. Though this approach is far from reality for general proteins, we expect it to approximate an 136 Informatica 46 (2022) 135–141 Z. Mukanov et al. all-alpha protein’s energy landscape. 1.1 Related work There are plenty of publications in literature dedicated to the problem of protein backbone dihedral angles predic- tion. This problem is interesting in two contexts. First, since secondary structure and backbone dihedral angles correlate (especially in the alpha-helix region), these prob- lems are often tackled together and considered adjacent re- search themes [13]. Second, it has been shown that high accuracy prediction of secondary structure/dihedral ϕ,ψ angles improves the recognition of the so-called fold of a protein [13]. Pioneering works on secondary structure prediction were published in 70-s [5, 8]. The highest pre- diction accuracies were achieved in the middle of the 90s by machine learning techniques that use multiple sequence alignment with proteins from the PDB database. PSIPred is a popular and high-scoring example of an algorithm of this kind [11]. Later, using a similar representation of a protein as in PSIPred, the idea of simultaneous prediction of secondary structure and backbone dihedral angles was implemented in DISSPred, which has one of the highest accuracies for both these problems [13]. A survey of the most recent advances in the problem can be found in [25]. Such approaches improve their accuracy as the number of resolved proteins grows, and they weaken if the template protein(the one for which a prediction is made) does not have close homologs among resolved proteins. In the absence of close homologs, prediction methods based on sequence-structure analysis are considered as one of the promising [6, 4]. The success of such techniques is based on the fundamental fact that the complexity and di- versity of local conformational structures observed in pro- teins are much less than sequence complexity. The first and critical step in such techniques is a choice of structural motives alphabet; then, correlations between sequence and corresponding structure are retrieved from data, and a pre- diction of secondary/local structure is made. One of the ways to formalize such a scheme is the Hidden Markov Models and their modifications. In the majority of such algorithms, a state of hidden Markov process formalizes structural information associated with a certain amino acid of a protein and transition probabilities between states of adjacent amino acids computed by standard formulas from data. HMMSTR is one of the most popular methods based on HMMs [4]. The idea of applying structural SVM ma- chinery for computation of HMM parameters for secondary structure recognition was implemented in [9, 1]. 2 Pattern-based energy A pattern-based energy potential over words x∈ D n in some alphabetD is defined as E(x) = X α ∈ Γ X [i,j]⊆ [1,n] j− i+1=|α | f α ij · [x i:j =α ] (1) where Γ ⊆ D ∗ is a fixed set of non-empty words,|α | is the length of wordα and [· ] is the Iverson bracket. Therex i:j denotes a subword x i ··· x j of x. A pattern-based condi- tional random field is defined as the probability distribution over wordsp(x)∝ e − E(x) . Intuitively, pattern-based CRFs allow modeling long- range interactions for selected subsequences of labels. In- ference algorithms for pattern-based CRFs were developed in [26, 21] and they include (i) computing the partition functionZ = P x exp{− E(x)} ; (ii) computing marginal probabilities p(x i:j = α ) for all triplets (i,j,α ) present in (1); (iii) computing MAP, i.e. minimizing energy (1). Note that MAP problem is a special case of hybrid val- ued constraint satisfaction problems [12, 18, 19]. Hybrid VCSPs in general can be NP-hard, but minimizing the en- ergy (1) is not only tractable but even solvable in time O(n). Pattern-based CRFs were already applied in such con- texts as handwritten character recognition, identification of named entities from text [26], optical character recogni- tion [16] and the language modeling [3, 20, 22]. 3 Pattern-based CRFs for the sequence labeling problem One of the important classes of pattern recognition prob- lems with structural outputs is sequence labeling problems. In such problems, we are given two finite alphabets,A (in- put alphabet) andL (output labels). A training set consists of pairs of the form (x,y) wherex (y) is a word overA (L) and both words have the same length, i.e. letters of the first word are tagged by labels. The goal is to construct map- pingsm :A n → L n ,n = 1, 2,... , that both consistent with a training set and some supplementary model requirements. Examples of such formulations can be found in protein folding (secondary structure prediction), natural language processing (part-of-speech tagging), and speech recogni- tion. Popular methods used for sequence labeling learning include hidden and maximum entropy Markov models. In this paper, we will describe a generalized version of our model and show how two key problems of inference and learning can be efficiently solved in this framework. Definition. A pair (α,β ), whereα is a word overA and β is a word overL, is called a pattern pair. Suppose we are given a pattern pair (α,β ) and tuples x = (x 1 ,...,x n ) ∈ A n and y = (y 1 ,...,y n ) ∈ L n . Then by (α,β )⊢ i (x,y) we say thatx i:i+|α |− 1 = α and y i:i+|α |− 1 =β . Suppose we are given a finite set of pattern pairs A. Then, for any tuples x = (x 1 ,...,x n ) ∈ A n and y = (y 1 ,...,y n )∈ L n , Ψ A (x,y) denotes a vector with compo- nents indexed byA, and for (α,β )∈ A, (α,β )-component is equal to the number ofi∈{ 1,...,n } for which (α,β )⊢ i (x,y). If any pattern pair fromA is assigned a real value, then parametersw ={ w i } i∈ A define a conditional probability Learning the Pattern-based CRF for Prediction of. . . Informatica 46 (2022) 135–141 137 distributionp (y|x,w) wherex∈ A n ,y∈ L n : p (y|x,w) = e −⟨ w,Ψ A (x,y)⟩ Z (x,w) (2) whereZ (x,w) = P y e −⟨ w,Ψ A (x,y)⟩ . It is easy to see that this family of conditional probability distributions is a spe- cial case pattern-based CRF. Two computational problems are of first interest in the framework of conditional random fields: inference and learning. Inference for pattern-based CRF is equivalent to minimization of energy: max y p (y|x,w) = min y ⟨ w, Ψ A (x,y)⟩ . (3) If we rewrite our energy term as ⟨ w, Ψ A (x,y)⟩ = X p∈ A X i:p⊢ i (x,y) w p = n X i=1 X p∈ A:p⊢ i (x,y) w p (4) we will see that the inference is equivalent to mini- mization of pattern-based functional over y with f α ij = P (α,ϕ )∈ A:(α,ϕ )⊢ i (x,y) w (α,ϕ ) . Given a training sample (x i ,y i ) i=1,n maximum like- lihood parameter learning is the following task: n Y i=1 p y i |x i ,w → max w (5) which after standard operations is equivalent to L (w) = n X i=1 ⟨ w, Ψ A x i ,y i ⟩ + logZ x i ,w → min w (6) Let us consider generalized version of pattern-based CRF by adding equivalence relation on patterns setA,∼ . Then we have additional constraints on learned weights: if p∼ p ′ thenw p =w p ′ . Such models appear to address the problem of overfitting when|A| becomes high. From now, we will assume that weights are indexed by equivalence classes of∼ . Since our energy is parameterized linearly,L (w) is con- vex. Therefore we can optimize it using gradient-based optimization(using first order or second order gradient de- scent) [15]. Elements of Jacobian and Hessian of L (w) are equal to the following sums: [∇ w L (w)] c = n X i=1 [ X p∈ c ( Ψ A x i ,y i p − E y∼ p(y|x i ,w) Ψ A x i ,y p ) ] (7) [H w L (w)] cc ′ = n X i=1 X p∈ c,p ′∈ c ′ E y∼ p(y|x i ,w) Ψ A x i ,y p × E y ′∼ p(y|x i ,w) Ψ A x i ,y ′ p ′ (8) where Ψ A x i ,y i p = k :p⊢ k x i ,y i (9) It is easy to see that computing Jacobian and Hessian can be reduced to computation of marginal probabilities p y k:l =α |x i ,w in the pattern-based CRF p y|x i ,w . Thus, we can apply BFGS algorithm [7] to solve (6) and to learn the weights of patterns in pattern-based CRFs (3). 4 Pattern-based conditional random field for dihedral angles prediction Suppose we have a two ordered sets of variables X = { X 1 ,...,X n } and Y ={ Y 1 ,...,Y 2n } . Variables X i take their values in a set of amino acids { Ala, Arg,..., Val, Sec, Pyl} (amino acid symbols) and we interpret any initialization of X as an amino acid sequence. Any amino acid in a protein corresponds to two dihedral angles ϕ and ψ in a protein (figure 1), and we interpretY 2i− 1 andY 2i asϕ andψ for amino acidX i . Before any experiment, we will discretize an interval [− 180, 180] and, therefore, will always imply thatY i takes its values from some fixed finite setD. Figure 1: Anglesϕ ,ψ andω . A triple (A 1 ,A 2 ,α ), whereA 1 ,A 2 are amino acid sym- bols and α is a word of even length over alphabet D, is called a local contact triple. The name contact triple is re- lated to the following interpretation: suppose that values of ϕ,ψ angles within certain accuracy can be predicted based on the local interaction of amino acids (by local interaction, we understand the interaction between amino acids that are closely located on a sequence). Then the mutual location of two amino acids can be described by a list of dihedral an- gles between them. Here we neglectω angles and torsion angles of these amino acids. Suppose we are given a finite set of contact triples A. Then, for any pair X = { X 1 ,...,X n } and Y = { Y 1 ,...,Y 2n } , Ψ A (X,Y ) denotes a vector with components indexed by A, and for (A 1 ,A 2 ,α ) ∈ A, (A 1 ,A 2 ,α )-component is equal to the number of i ∈ { 1,...,n } for which X i = A 1 ,X i+|α |/2− 1 = A 2 ,Y 2i− 1 Y 2i ...Y 2i+|α |− 3 Y 2i+|α |− 2 =α . Suppose pairs X,Y are obtained from some probabil- ity distributionp (Y|X,w) with unknown parametersw = 138 Informatica 46 (2022) 135–141 Z. Mukanov et al. { w i } i∈ A . The family of probability distributions − logp (Y|X,w) =C (X) +⟨ w, Ψ A (X,Y )⟩ (10) where C (X) = log P Y e −⟨ w| Ψ A (X,Y )⟩ , is our pattern- based conditional random field. After denotingE w (X,Y ) =⟨ w, Ψ A (X,Y )⟩ , it is easy to see that max Y p (Y|X,w) = minE w (X,Y ). (11) Therefore, if parameters are known (w = w ∗ ), for in- put string X inference is equivalent to minimization of ⟨ w, Ψ A (X,Y )⟩ . In the following section, we will describe the structural SVM procedure that we used to obtain weights. 4.1 Learning parameters by structural SVM Given a training set{ (x i ,y i )} l i=1 , the structural risk min- imization approach reduces the problem of CRFs learning to minimization of the following functional: C l l X i=1 ∆ ( y i ,f w (x i ))→ min (12) wheref w (x i ) = arg min y ⟨ w, Ψ ( x i ,y)⟩ and ∆ : Y× Y → N is a loss function on a setY . The latter problem is very hard in general, and a standard way to tackle it is to re- place a function with its convex upper bound [15]. Fol- lowing this scheme,we used structural SVM technique to learning pattern-based CRFs. After changing Ψ ( x,y)→ − Ψ ( x,y), structural SVM functional is the following one ∥ w∥ 2 + C l l X i=1 max y∈ Y ( ∆ ( y i ,y) + ⟨ w, Ψ ( x i ,y)− Ψ ( x i ,y i )⟩ ) (13) We used software by T. Joachims [24], which is based on the following quadratic programming reformulation: ∥ w∥ 2 +C P ℓ i=1 ξ n → min w,ξ ⟨ w, Ψ ( x i ,y i )− Ψ ( x i ,y)⟩≥ ∆( y i ,y)− ξ i , n = 1,...,ℓ, ∀ y∈Y , ξ i ≥ 0, i = 1,...,ℓ. Very roughly, this functional can be interpreted as follows. For optimal parameter w we want y i ≈ arg max y ⟨ w, Ψ ( x i ,y)⟩ , and therefore ⟨ w, Ψ ( x i ,y i )− Ψ ( x i ,y)⟩≥ 0 if ∆ ( y i ,y) is large(greater than resulting slackξ i ), and in the neighborhood ofy i (i.e. when ∆ ( y i ,y)≤ ξ i — we can call such neighborhoods as "uncertainty neighborhoods") the difference⟨ w, Ψ ( x i ,y i )− Ψ ( x i ,y)⟩ can be negative. The goal is to lower the sum of radiuses of "uncertainty neighborhoods" plus regularization on∥ w∥ 2 . 5 Experiments and discussion Experimental data were taken from the PDB database. From 81756 available protein structures, we extracted 7631 all-alpha proteins. This list was filtered by PISCES (A Protein Sequence Culling Server) with requirements of se- quence identity to be less than 25%, of resolution to be less than 3 , and ofR-factor to be less than 0.3. The resulting sample contained 908 proteins. Each protein in dataset was represented as a pair of words, the first word is its amino acid sequence, and the second word is a sequence if ϕ,ψ angles of amino acids that were discretized with step of 20 degrees, i.e. ϕ discr = [ϕ/ 20],ψ discr = [ψ/ 20]. Therefore, the second word was in an alphabet of 18 symbols. A set of contact triplesA that is the basis of our model was chosen by a simple procedure: for each triple (A 1 ,A 2 ,α ), whereA 1 ,A 2 are amino acids, we counted the number of times it occurs in our database (i.e. segment A 1 ...A 2 in the amino acid sequence corre- sponds to segmentα in the second word) under the condi- tion thatC α atoms of amino acidsA 1 ,A 2 are located less than 10 from each other according to the corresponding PDB file; then we took all triples that were counted more than ten times. We have three variants of loss functions. Since ϕ,ψ plane has torical structure, we define H ϕ (ϕ 1 ,ϕ 2 ) = min{| ϕ 1 − ϕ 2 | , 360−| ϕ 1 − ϕ 2 |} . H ψ is defined analo- gously. For discretizedϕ,ψ we defineH discr ϕ (ϕ 1 ,ϕ 2 ) = min{| ϕ 1 − ϕ 2 | , 18−| ϕ 1 − ϕ 2 |} . Then, H { y i } i=1,2n ,{ y ′ i } i=1,2n ≜ n X i=1 H ϕ y 2i− 1 ,y ′ 2i− 1 +H ψ (y 2i ,y ′ 2i ) (14) H discr is defined analogously. During learning we can use only the discrete version, but when accessing accuracy on the test set we will use the continuous version of the loss functionH. The second criterion isMDA, and it is commonly used for accessing the accuracy of dihedral angles prediction. In a continuous case, it is defined as a percentage of amino acids in a sequence that belongs to any continuous segment of length not less than 8 for which predicted dihedral angles do not differ from real by more than 120 degrees. I.e.: MDA { y i } i=1,2n ,{ y ′ i } i=1,2n ≜ | Ω | n (15) where Ω = { i|∃ k,l : k ≤ i ≤ l,l− k ≥ 7,∀ s = k,l|y 2s− 1 − y ′ 2s− 1 |≤ 120,|y 2s − y ′ 2s |≤ 120} . The dis- crete version ofMDA can be defined by changing 120 to 6; we denote itMDA discr . Recall that structural SVM calls a procedure to maxi- mize ∆ ( y,y ′ ) +⟨ w, Ψ ( x i ,y)⟩ where ∆ is a loss func- tion, and this exactly the place where inference algorithm is needed for learning. SinceH (y,y ′ ) can be understood as a unary part of our optimized functional, then our in- ference algorithm can be applied straightforwardly. On the Learning the Pattern-based CRF for Prediction of. . . Informatica 46 (2022) 135–141 139 Table 1: Experimental results Loss function H ϕ ◦ C H ψ ◦ C MDA, % C H discr 22.7 47.9 55.0 64.0 MDA discr 22.8 48.3 56.5 4.0 HMMSTR[4] - - 57.1 - PHD[17] - - 48.0 - contrary,MDA has a very "global" structure. Therefore, instead of maximizing the previous sum, we used a heuris- tic that reduces to excluding the loss part. The sample set was divided into three subsets: a training sample, a holdout sample (for fitting some parameters) and a test sample. The training sample was used to train param- eters with fixedC, the holdout sample was used to choose C and the resulting prediction accuracy was accessed on the test set. The results of experiments are given in Table 1. Rows show results for fixed loss functions and columns show exact attained values for measures of accuracy on a test set. In the work of Bystroff & al. [4], whose method has a lot of common with ours, MDA value attained 57% on the training set consisting of 618 randomly selected pro- teins. Such an accuracy became possible to achieve by using a special library of structural motives, called I-sites, that was generated by clustering of all motives. Also, in- stead of using amino acid symbols, each position in a pro- tein sequence was associated with the amino acids profile, which significantly improved overall accuracy. The high- est MDA achieved by the first generation protein structure annotations predictors is approximately 60%. In the sec- ond generation of predictors 64% accuracy was reported. The third generation of predictors, based on Deep Learning algorithms, predict secondary structure at over 70% accu- racy [23]. Thus, so far our method reproduces the accuracy of the first-generation software. Our algorithm can be improved by adding structural in- formation for amino acids, like solvent accessibility or sec- ondary structure. We generated a set of patterns by a very simple procedure, and two factors were crucial in defining a set of patterns: we wanted to choose a set that would map all local conformational structures with good scale, at the same time the cardinality of this set could not be too high to overfit or to become computationally intractable. Both these problems could be addressed by introducing long "su- perpatterns", by which we mean clusters of patterns that have the same weights in the energy model. In such an ap- proach, the problem of clusterization of pattern set appears that should be addressed before learning of weights starts. It is also important to note that our inference algorithm can easily adapt to formulation with long superpatterns by just considering a superpattern as a set of patterns for which one weight is attributed. If elements of superpatterns will map all patterns with fixed length from a discretized database, then the number of such patterns will grow proportion- ally to NLP , where N is the number of proteins in the database, L is the average length of a protein, andP is a pattern length. 6 Conclusions and future work The goal of our paper was to show perspectives of the pattern-based conditional random field in bioinformatics applications. Hidden Markov Models, currently very popu- lar in protein folding prediction, gene prediction, and other problems with sequence (tagging) labeling flavor, can be considered as a precursor of pattern-based CRFs. By defi- nition, the current state of HMM is probabilistically deter- mined by the previous state. When there are high distance interactions between labels on the sequence, a notion of HMM state should include a lot of context structural in- formation. Instead of complicating the notion of a state, we propose to consider long label sequences as analogs of states. But for our formulation, there is a lack of re- search dedicated to problems of learning and inferencing in pattern-based models. We need more efficient algorithms for the inference part of the problem since in some appli- cations, the length of patterns could be high (for example, the typical helix region of a protein can be 20 amino acids long, then onlyω,ϕ,ψ angles sequence of such regions can be of length 60 or longer, not including other structural in- formation) and the number of patterns could be very large (in superpatterns approach it could be comparable with the size of the learning database). We developed an efficient al- gorithm for this problem based on dynamic programming with a preprocessing step to tune the parameters of an al- gorithm for concrete patterns set. On the learning part, we used the structural SVM technique. An important issue to be addressed in future work is how our approach could be developed in the context of bioinfor- matics applications, such as protein folding and gene pre- diction. Learning weights is the hardest part of such prob- lems, which requires a thorough analysis of maximum like- lihood and structured risk minimization parameter learn- ing. Besides, our exact definition of pattern-based CRF can be easily generalized both in the direction of including su- perpatterns and in the direction of defining various areas of dependence for weights to be learned. Acknowledgement This work was supported from the Nazarbayev University collaborative research grant project OPCRP2022002. References [1] Yasemin Altun, Ioannis Tsochantaridis, and Thomas Hofmann. Hidden markov support vector machines. In Proceedings of the Twentieth International Confer- ence on International Conference on Machine Learn- ing, page 3–10, 2003. 140 Informatica 46 (2022) 135–141 Z. Mukanov et al. [2] C B Anfinsen. The formation and stabilization of protein structure. Biochemical Journal, 128(4):737– 749, 07 1972. https://doi.org/10.1042/ bj1280737. [3] Zhenisbek Assylbekov and Rustem Takhanov. Reusing weights in subword-aware neural language models. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1413– 1423, New Orleans, Louisiana, June 2018. https: //doi.org/10.18653/v1/N18-1128. [4] Christopher Bystroff, Vesteinn Thorsson, and David Baker. Hmmstr: a hidden markov model for local sequence-structure correlations in proteins11edited by j. thornton. Journal of Molecular Biology, 301(1):173 – 190, 2000.https://doi.org/10. 1006/jmbi.2000.3837. [5] Peter Y . Chou and Gerald D. Fasman. Prediction of protein conformation. Biochemistry, 13(2):222–245, 1974. PMID: 4358940. https://doi.org/10. 1021/bi00699a002. [6] A.G. de Brevern, C. Etchebest, and S. Hazout. Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins: Structure, Function, and Bioinformatics, 41(3):271–287, 2000. https://doi.org/10. 1002/1097-0134(20001115)41:3<271:: AID-PROT10>3.0.CO;2-Z. [7] R. Fletcher. Newton-Like Methods, chapter 3, pages 44–79. John Wiley and Sons, Ltd, 2000. https: //doi.org/10.1002/9781118723203.ch3. [8] J. Garnier, D.J. Osguthorpe, and B. Robson. Analy- sis of the accuracy and implications of simple meth- ods for predicting the secondary structure of globular proteins. Journal of Molecular Biology, 120(1):97 – 120, 1978. https://doi.org/10.1016/ 0022-2836(78)90297-8. [9] Blaise Gassend, Charles O’Donnell, William Thies, Andrew Lee, Marten van Dijk, and Srinivas Devadas. Learning biophysically-motivated parameters for al- pha helix prediction. BMC bioinformatics, 8 Suppl 5:S3, 02 2007. https://doi.org/10.1186/ 1471-2105-8-S5-S3. [10] Misha Gromov. Crystals, proteins, stabil- ity and isoperimetry. Bulletin of the Amer- ican Mathematical Society, 48(2):229–257, 2011. https://doi.org/10.1090/ S0273-0979-2010-01319-7. [11] DT Jones. Protein secondary structure prediction based on position-specific scoring matrices. Journal of molecular biology, 292(2):195—202, September 1999. https://doi.org/10.1006/jmbi. 1999.3091. [12] Vladimir Kolmogorov, Michal Rolínek, and Rustem Takhanov. Effectiveness of structural restrictions for hybrid csps. In Algorithms and Computation - 26th International Symposium, ISAAC 2015, Pro- ceedings, LNCS, pages 566–577, Germany, January 2015. Springer Verlag. https://doi.org/10. 1007/978-3-662-48971-0_48. [13] Petros Kountouris, Petros Kountouris, and Jonathan D. Hirst. Prediction of backbone dihedral angles and protein secondary struc- ture using support vector machines. BMC Bioinformatics, 10(2):437, 2009. https: //doi.org/10.1186/1471-2105-10-437. [14] Jooyoung Lee, Sitao Wu, and Yang Zhang. Ab Initio Protein Structure Prediction, pages 3–25. Springer Netherlands, Dordrecht, 2009. https://doi. org/10.1007/978-1-4020-9058-5_1. [15] Sebastian Nowozin and Christoph H. Lampert. Struc- tured learning and prediction in computer vision. Foundations and Trends® in Computer Graphics and Vision, 6(3–4):185–365, 2011. https://doi. org/10.1561/0600000033. [16] Xian Qian, Xiaoqian Jiang, Qi Zhang, Xuanjing Huang, and Lide Wu. Sparse higher order condi- tional random fields for improved sequence label- ing. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 849–856, New York, NY , USA, 2009. Association for Computing Machinery. https://doi.org/10. 1145/1553374.1553483. [17] Burkhard Rost. Better 1d predictions by ex- perts with machines. Proteins: Structure, Function, and Bioinformatics, 29(S1):192– 197, 1997. https://doi.org/10.1002/ (SICI)1097-0134(1997)1+<192:: AID-PROT25>3.0.CO;2-I. [18] Rustem Takhanov. Hybrid vcsps with crisp and valued conservative templates. In 28th Interna- tional Symposium on Algorithms and Computa- tion, ISAAC 2017, volume 92, Germany, December 2017. https://doi.org/10.4230/LIPIcs. ISAAC.2017.65. [19] Rustem Takhanov. Searching for an algebra on csp solutions, 2017. arXiv:1708.08292. [20] Rustem Takhanov and Zhenisbek Assylbekov. Pat- terns versus characters in subword-aware neural lan- guage modeling. In Neural Information Process- ing, pages 157–166, Cham, 2017. https://doi. org/10.1007/978-3-319-70096-0_17. Learning the Pattern-based CRF for Prediction of. . . Informatica 46 (2022) 135–141 141 [21] Rustem Takhanov and Vladimir Kolmogorov. In- ference algorithms for pattern-based crfs on se- quence data. In Proceedings of the 30th Inter- national Conference on Machine Learning, vol- ume 28 of Proceedings of Machine Learning Re- search, pages 145–153, Atlanta, Georgia, USA, 17– 19 Jun 2013. URL: https://proceedings. mlr.press/v28/takhanov13.html. [22] Rustem Takhanov and Vladimir Kolmogorov. Com- bining pattern-based crfs and weighted context-free grammars. Intell. Data Anal., 26(1):257–272, 2022. https://doi.org/10.3233/IDA-205623. [23] Mirko Torrisi, Gianluca Pollastri, and Quan Le. Deep learning methods in protein structure predic- tion. Computational and Structural Biotechnology Journal, 18:1301–1310, 2020. https://doi. org/10.1016/j.csbj.2019.12.011. [24] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector ma- chine learning for interdependent and structured out- put spaces. In Proceedings of the Twenty-First Inter- national Conference on Machine Learning, page 104, New York, NY , USA, 2004. https://doi.org/ 10.1145/1015330.1015341. [25] Yuedong Yang, Jianzhao Gao, Jihua Wang, Rhys Heffernan, Jack Hanson, Kuldip Paliwal, and Yaoqi Zhou. Sixty-five years of the long march in pro- tein secondary structure prediction: the final stretch? Briefings in Bioinformatics, 19(3):482–494, 12 2016. https://doi.org/10.1093/bib/bbw129. [26] Nan Ye, Wee Lee, Hai Chieu, and Dan Wu. Conditional random fields with high-order fea- tures for sequence labeling. In Advances in Neural Information Processing Systems, vol- ume 22, 2009. URL: https://proceedings. neurips.cc/paper/2009/file/ 94f6d7e04a4d452035300f18b984988c-Paper. pdf. 142 Informatica 46 (2022) 135–141 Z. Mukanov et al.