https://doi.org/10.31449/inf.v43i4.2621 Informatica 43 (2019) 527–534 527 Refin-Align: New Refinement Algorithm for Multiple Sequence Alignment Ahmed Mokaddem, Amine Bel Hadj and Mourad Elloumi Laboratory of Technologies of Information and Communication, and Electrical Engineering, University of Tunis, Tunisia E-mail:moka.ahmed@yahoo.fr, amine_bel_hadj_90@yahoo.com, Mourad.Elloumi@gmail.com Keywords: multiple sequence alignment, algorithms, refinement, block Received: December 17, 2018 In this paper, we present Refin-Align a new refinement algorithm for a multiple sequence alignment. Re- fining alignment consists on constructing a new more accurate multiple sequence alignment from an initial one by applying some modifications. Our refinement algorithm Refin-Align uses a new definition of block and also our multiple sequence alignment algorithm Pro-malign. We assess our algorithm Refin-Align on multiple sequence alignment constructed by different algorithms using different benchmarks of protein se- quences. In the most cases treated, our algorithm improves the scores of the multiple sequence alignment. Povzetek: Razliˇ cni znani algoritmi napovedujejo zaporedje beljakovin, novo razviti algoritem pa doloˇ ca najboljšo skupno vrednost na osnovi napovedi posameznih algoritmov. 1 Introduction Multiple sequence alignment is an important task in bioin- formatics. Aligning a set of sequences consists in optimiz- ing the number of matches between the characters occur- ring in the same order in each sequence (figure1). Figure 1: Multiple sequence alignment. Multiple sequence alignment can help biologist to pre- dict structure and function information for a set of se- quences. Indeed, we can reveal information about biolog- ical functions common to biological macromolecules from several different organisms by identifying similar regions, these regions are often an important structural or functional roles. Multiple sequence alignment can also help in the classification of macromolecules into different families ac- cording to similar sub-strings detected. In addition, mul- tiple sequence alignment can help to construct a phyloge- netic tree and analyse relationships between species in or- der to establish a common biological ancestor. Although pairwise sequence alignment for two se- quences can be constructed with optimal solution using the dynamic programming algorithm [1], multiple sequence alignment for more than two sequences is a NP-complete problem [2]. There are two main approaches to resolve this problem: 1. Progressive approach: it consist to align sequences gradually. Indeed, we start by aligning the most simi- lar two sequences. Then, we align the sequences to other sequences aligned, according to a defined or- der. Finally, we obtain the multiple sequence align- ment. All progressive multiple sequence alignment al- gorithms adopt the same process. The most used pro- gressive multiple sequence algorithms are ClustalW [3], T-COFFEE [4], MUSCLE [5], MAFFT [6], GL- Probs [7] and Clustal Omega [8]. Progressive approach operates in three steps: (a) In the first step, we compute distances between all pairs of sequences of the set and we store these distances in a matrix called distance ma- trix. This step aims to estimate the similarity be- tween pairs of sequences in order to distinguish the two sequences that are the first to be aligned. Many distances are used [9]. Among these dis- tances we mention: – k-mer distances used by the algorithm MUSCLE and MAFFT, – Percent of similarity used by the algorithm ClustalW, – Kimura distance [10] used by the algorithm Clustal Omega, – Distance defined by the GLProbs algorithm. (b) In the second step, we construct a guide tree us- ing the distance Matrix. This step aims to define the order of aligning sequences. Two main algo- rithms are used to construct a guide tree: – UPGMA [11] used by MUSCLE, MAFFT and GLProbs 528 Informatica 43 (2019) 527–534 A. Mokaddem et al. – Neighbor-joining [12] used by ClustalW and T-COFFEE (c) In the last step, we follow the branching or- der of the guide tree, constructed in the pre- vious step, to construct the multiple sequence alignment by aligning pair of sequences using the dynamic programming algorithm[1] or by a profile-profile[3] alignment. A profile is constructed by selecting for each col- umn of the sequence alignment the character that have the maximum occurrences in that column (Figure 2). Figure 2: Profile construction. 2. Iterative approach: it consists to construct an initial multiple sequence alignment. Then, we apply a num- ber of iterations, during each iteration we perform a set of modifications to the current alignment in order to ameliorate his score. Among this modifications, we can insert or delete of one or more gaps ’-’ in one or more position in the multiple of sequence align- ment. The main multiple sequence alignment algo- rithms adopting iterative approach are genetic algo- rithm such as GAPAM [13] and PASA [14]. Each algorithm adopting progressive approach or iterative approach produces mistakes in multiple sequence align- ment, thus, we used refinement algorithms in order to cor- rect bad aligned residues, that can ameliorate the quality of the multiple alignment by ameliorate his scores. The pro- cess of all refinement algorithms consists to apply a set of modifications to an initial multiple sequence alignment in order to construct a new one having better scores than the previous alignment. These modifications are repeated un- til convergence (i.e. no improvement can be made on the current alignment). There are different algorithms for re- finement of multiple sequence alignments: 1. RASCAL [15]: Rascal operates as follows: First, we analyse the initial multiple sequence alignment and detect the well-aligned regions by applying the Mean Distance (MD). Then, we detect the badly aligned re- gions. Finally, we realign the badly aligned regions. 2. REFINER [16]: when applying REFINER algorithm on a multiple sequence alignment, we realign each se- quence with the profile of the multiple sequence align- ment of the remaining sequences. Convergence is ob- tained when all the iterations is realised and each se- quence is realigned. 3. RF [17]: is similar to the REFINER algorithm but the convergence is obtained when the number of iterations is equal to 2N2 where N is the number of sequences. 4. REFORMALIGN [18]: Using REFORMALIGN, we construct the final alignment indirectly. First, we start by constructing a profile to the initial multiple se- quence alignment. Then, we align each sequence to the profile constructed in the first step. Finally, we merge all the sequences alignment in order to obtain the final alignment. Thus, Refinement algorithms are used in order to enhance a multiple sequence alignment (MSA). Indeed, we start by an initial multiple sequence alignment by using one mul- tiple sequence alignment algorithm. Then, we apply the refinement algorithm to the initial multiple sequence align- ment in order to construct a new more accurate multiple sequence alignment having higher score. 2 Block definition We propose a new algorithm called Refin-Align for refin- ing multiple sequence alignment. Refin-Align uses a new definition of block. Indeed, a block is defined as a multi- ple alignment of substrings extracted from a multiple se- quence alignment. A block is formed of at least two adja- cent columns separated from the initial alignment on both sides by a column formed only of identical characters. Our new definition of blocks is different from the standard defi- nition of blocks, which presents the blocks as substrings de- limited by columns containing at least one gap. The blocks are extracted from the initial multiple alignment and then they will be realigned to improve their scores. A block is defined as follow: – A set of aligned substrings – having the same size in each sequence – A block must contain at least two columns – No substrings formed the block must be formed en- tirely of gap – A block must not contains a column having exactly the same character. 3 Refin-Align: New refinement algorithm The principle of our algorithm is to extract a misaligned blocks from the sequences that distort the multiple align- ment and realign them. The Refin-Align algorithm allows improving the quality of an initial multiple alignment by it- eratively realigning the blocks of the initial multiple align- ment. The advantage of our new block definition is to allow Refin-Align:New Refinement Algorithm for. . . Informatica 43 (2019) 527–534 529 Figure 3: Refinement process. Figure 4: Block extraction. more possibility for the characters of the initial alignment to be realigned. Our Refin-Align algorithm operates as fol- lows: 1. First, we extract the blocks from the initial multiple sequence alignment. 2. Then, we compute the scores of each block. We use the sum of pairs score SP[19]. The SP score corre- spond to the sum of the scores for all pairs of aligned characters. SP score is computed using this formula: SP (A) = L X i=1 X 1