Informatica 33 (2009) 511-519 511 Improvements to a Roll-Back Mechanism for Asynchronous Checkpointing and Recovery Monika Kapus-Kolar Department of Communication Systems, Jožef Stefan Institute Jamova cesta 39, 1000 Ljubljana, Slovenia E-mail: monika.kapus-kolar@ijs.si Keywords: asynchronous checkpointing, recovery, maximum consistent state Received: September 19, 2007 We indicate the existence of a logical flaw in a recently published recovery algorithm for distributed systems and suggest a correction. We also improve the associated communication protocol, the prevention of multiple concurrent instantiations of the algorithm, the handling of obsolete messages and the organization of the stable storing of the relevant data. Povzetek: Opozarjamo na logično napako v pred kratkim objavljenem algoritmu za okrevanje v porazdeljenih sistemih in predlagamo popravek. 1 Introduction Gupta, Rahimi and Yang recently proposed a novel recovery algorithm for distributed systems in which checkpoints are taken asynchronously [1]. A checkpoint taken by a process is a snapshot of its local state, stored in a stable storage, so that the process can roll back to it, if this becomes necessary. The start of a process is also one of its checkpoints. Asynchronous checkpointing means that processes take their checkpoints independently. A failure in a distributed system in principle requires that all its constituent processes roll back, to a global state from which the system can resume its operation as if it had started from it, i.e., to a globally consistent set of local checkpoints (GCSLC), ideally to the so called maximum GCSLC, in which every local checkpoint is as recent as possible. In the case of asynchronous checkpointing, when a failure occurs, the processes have yet to find the maximum GCSLC. They do that by running a checkpoint coordination algorithm (CCA). Alternatively, the processes might agree to restart from the GCSLC which they currently treat as the starting state of the system, i.e., from the current recovery line. This is the most recently computed maximum GCSLC, initially the actual starting state of the system. [1] suggests that the processes occasionally initiate a CCA just for advancing the recovery line. In this paper, we demonstrate that, because of a subtle logical flaw, the CCA of [1] sometimes returns a checkpoint set which is not globally consistent. We correct the flaw and also suggest some other improvements. The rest of the paper is organized as follows. In the next section, we describe the system and the testing of checkpoint consistency and give a brief outline of the CCA of [1]. In Section 3, we explain and correct the flaw. In Section 4, we suggest several other improvements of the CCA. A detailed specification of the improved CCA is given in Section 5. Section 6 suggests that checkpointing and recovery-line advancing in the absence of failures should be more flexible. As the proposed CCA correction increases the usage of the local stable storages, we in Section 7 suggest how to organize them. Section 8 comprises a discussion and conclusions. 2 Preliminaries 2.1 System properties Of the assumptions explicitly or implicitly stated in [1] for the distributed system considered, the following seem important: - The set of the constituent processes is a fixed {Pj \j G N} with N a {1,... ,n}. Every process is virtually always aware of the global time. - From every process Pj to every other process Pj>, there is virtually a reliable first-in-first-out channel with the worst-case transit delay not exceeding a predefined Tjj. The channels are the only means of inter-process communication. Processes exchange application messages (AMs) and CCA messages (CMs). - When a process fails, no other process fails simultaneously and the system does not experience another failure until every process restarts. 2.2 Testing checkpoint consistency A GCSLC is characterized by all its members being mutually consistent. A checkpoint Cj,r of a process Pj is consistent with a checkpoint Cj>,r> of a process Pj> if none of 512 Informatica 33 (2009) 511-519 M. Kapus-Kolar the processes has recorded a reception of an AM from the partner for which the partner has not recorded a transmission [1]. As channels are reliable queues, it is acceptable that processes record their AM transmissions and receptions simply by counting them, relative to the recovery line [1], for this is where the system virtually started. For a checkpoint Cj indicate how many AMs Pj has sent to or, respectively, received from Pj', where Sjrj and Rj r j are by definition zero. Suppose that the currently considered checkpoint set is a {Cjr(j)\j e N}. To check that it is a GCSLC, one in principle has to check Rjr(j) j < Sj'rj')j for every two processes Pj and Pj'. This is what the CCA of [2] does, by letting every process Pj check f\j'e(N\{j})(Rj,r(j) ,j' < Sj', r(j') ,j). The CCA of [1] works under the assumption that this can be done simply by letting every process Pj check J2j'e(N\{j}) Rj,r(j),j' < J2j'e(N\{j}) Sj',r(j'),j. In both CcAs, if the adopted test fails for a Pj, the process starts considering an earlier checkpoCint, namely the most recent checkpoint with Aj'e(N\{j})(Rj,r'(j) ,j' < Sj' ,r(j') ,j ) or Cj, r'(j) Sj'e(N\{j}) tively. Rj '(j),j' ■j'e(N\{j}) Sj',r Sj',r(j'),j, respec- p, [0,0,0] =[0,0,0] .0,0] [SWM3 ,0,0] [r2i2/ - C9 1 [0,0,0] =[0,0,0] [0,0,0] =[0,4,5] [7,0,0] =[0,0,0] Figure 1: The situation considered in Examples 1-5. The black checkpoints represent the maximum GCSLC. The dashed message is the one making C1 <2 inconsistent with C2, 2. 2.3 Outline of the algorithm In the CCA of [1], all communication goes over the initiator of the particular algorithm instance (see Example 1 in the next section). The initiator process repeatedly polls every process, virtually also itself, for the value of the transmission counters of the local candidate for a recovery-line checkpoint. Every such request carries all the information on the transmission counters of the current candidate checkpoints, if any, which the recipient needs for deciding which checkpoint to consider in the next iteration. The first candidate checkpoint of a process is always its most recent checkpoint. After every process replies, the initiator might detect that the candidate checkpoint set has changed. In that case, it starts another iteration. Otherwise, it broadcasts an indication that a GCSLC has been found. Finally, every process promotes its currently considered checkpoint into its recovery-line checkpoint, from which it, if so requested by the initiator, subsequently restarts. The algorithm covers also the possibility that the initiator requests an immediate restart, from the current recovery line. 3 A flaw and a correction 3.1 The problem of inaccurate information When a {Cj , rj) \j G N} is checked for being a GCSLC, each Sj, rj), j' may be any natural up to and including the number of the AMs sent from Pj to Pj', and each Rj , rj j' may be any natural up to and including the number of the AMs received at Pj from Pj', for remember that the checkpointing is entirely asynchronous. It might, hence, happen that for a process Pj, the test J2j'e(N\{j}) Rjr(j),j' < Y^j'e(N\{j}) Sj', r(j') j adopted by [1] succeeds in spite of ^j'e(N\{j}) (Rj, r(j), j' < Sj', r(j'), j ) fe^ sothat Pj fails to detect that {Cj , rj) \j g N} is not a GCSLC. If the other processes also fail to detect that the checkpoint set is not consistent, the CCA terminates prematurely and the set un-acceptably becomes the new recovery line. Example 1. A possible problematic scenario of a system consisting of processes Pi to P3 and using the CCA of[1] (see also Fig. 1): (1) Upon the start of the system, every process Pj takes a checkpoint Cj,lt with every Sj, 1 ,j' and Rj, 1 ,j' zero. (2) P2 sends to P1 three promptly received AMs and then takes a checkpoint C2,2, with S2,2j1 = 3 and S2,2,3 = R2,2,1 = R2,2,3 = 0. (3) P2 sends to P1 another promptly received AM and P3 sends to P1 five promptly received AMs. P1 then takes a checkpoint C1}2, with R1i2,2 = 4, R1,2,3 = 5 and S1t2,2 = S1,2,3 = 0. (4) P3 sends to P1 two more promptly received AMs and then takes a checkpoint C3,2, with S3,2,1 = 7 and S3,2,2 = R3,2,1 = RS,2,2 = 0. (5) P1 undergoes a failure. (6) After P1 recovers, it starts considering its most recent checkpoint C1j2 and invites P2 and P3 to a new instance of the CCA. (7) In response, P2 starts considering its most recent checkpoint C2,2 and sends to P1 a CM carrying S2,2j1 and S2,2,3, while P3 starts considering its most recent checkpoint C3,2 and sends to P1 a CM carrying S3,2j1 and S3,2,2. (8) When P1 receives the responses, it sends (S1,2,2 + S3,2,2) to P2 and (S1,2,3 + S2,2,3) to P3. It also observes (R\,2,2 + R123) = 9 < (S2,2,1 + S3,2,1) = 10 and consequently decides to continue considering C1}2, recording the decision by setting the local flag to 0. IMPROVEMENTS TO A ROLL-BACK MECHANISM FOR. Informatica 33 (2009) 511-519 513 (9) Upon receiving (S1,2,2 + S13,2,2), P2 observes (R-2,2,1 + R2,2,s) = 0 < (Slt2,2 +S322) = 0 and therefore continues considering C2,2, indicating that to P1 by a CM carrying flag 0 and S2,2,1 and S2,2,3. Likewise, P3 upon receiving (Si,2,3 + S2,2,3) observes (R3,2,1 + R3,2,2) = 0 < (Sh2,3 + S2,2,3) = 0 and therefore continues considering C3,2, indicating that to P1 by a CM carrying flag 0 and S3,2i and S3,2,2. (10) After receiving the two indications, P1, observing that the flag of every process is 0, concludes that a GCSLC has been found, tells P2 and P3 to restart from C2,2 and C32 respectively, and finally restarts from C1,2. (11) Because of S2,2j1 = 3 before the restart, P2 after restarting retransmits the fourth AM to P1. Because of R1,2,2 = 4 before the restart, P1 erroneously interprets the AM as the fifth from P2 and consequently takes an inappropriate action. The simplification from [1] described in Section 2.2 is, hence, unacceptable. 3.2 The Necessary Changes of the Algorithm The indispensable changes to the CCA are the following: - Where a process Pj originally stores in its stable storage a £j>e(N\{j})Rj,r(j),j>, it must actually store Rj,r(j),j' of every other process Pj>. - Where the initiator process originally sends to a pro carrying S1i1i2 and S3,2,2 to P2 and a CM carrying S11,1,3 and S2 2 3 to P3. It also observes Ri,i,2 = 0 < S2,2<1 = 3 and Ri,i,3 = 0 < S3,2j1 = 7 and consequently decides to continue considering C1,1, recording the decision by setting the local flag to 0. (11) Upon receiving and S3,2,2, P2 observes R2,2,i =0 < SiX2 = 0 and R2fi,3 = 0 < S3,2,2 = 0 and therefore continues considering C2,2, indicating that to P1 by a CM carrying flag 0 and S2,2,1 and S2,2,3. Likewise, P3 upon receiving S1j1j3 and S2,2,3 observes R3,2,1 =0 < S1X3 = 0 and R^ =0 < ^2,2,3 = 0 and therefore continues considering C3y2, indicating that to P1 by a CM carrying flag 0 and S3,2j1 and S3,2,2. (12) After receiving the two indications, P1, observing that the flag of every process is 0, correctly concludes that a GCSLC has been found, tells P2 and P3 to restart from C2,2 and C3y2, respectively, and finally restarts from C1t1. 4 Optimization of process communication If processes exchange transmission counters instead of just their sums, minimization of the number and the size of the exchanged CMs becomes even more important. In the following, we suggest some additional optimizations. cess P. a S \{j})Sj {j')ij, it must actually send 4.1 Early reporting of test results a CM containing Sj',rj'),j of every process Pj' other than Pj. Pj Where a process ^j'e(N\{j}) Rj,r(j),j' < Y,j'e(N\{j}) Sj',r(j'),j it must actually test /\j'E(N\{j})(Rj,r{j),j originally tests Sj < Sj r(j'),j ). Example 2. If the indispensable CCA corrections are made in Example 1, the scenario after its seventh step takes the following direction (see also Fig. 1): (8) After P1 receives the responses, it sends a CM carrying S1,2,2 and S3,2,2 to P2 and a CM carrying S1,2,3 and S2,2,3 to P3. It also observes R1,2,2 = 4 > S2,2i1 = 3 and consequently, because R1,1,2 =0 < S2,2,1 = 3 and R-i-13 = 0 < S3,2,1 = 7, starts considering C1t1, recording the change of focus by setting the local flag to 1. (9) Upon receiving S1,2,2 and S3,2,2, P2 observes R2,2,1 =0 < Sh2,2 = 0 and R223 = 0 < S3,2,2 = 0 and therefore continues considering C2,2, indicating that to P1 by a CM carrying flag 0 and S2,2t1 and S2,2,3. Likewise, P3 upon receiving S1,2,3 and S2,2,3 observes R3,2,1 =0 < Slt2,3 = 0 and R3,2,2 =0 < 'S2,2,3 = 0 and therefore continues considering C3,2, indicating that to P1 by a CM carrying flag 0 and S3j2j1 and S3,2,2. (10) After receiving the two indications, Pi, observing that there is a process with a non-zero flag, sends a CM In a CCA iteration testing a checkpoint set [Cjr(j)\j £ N}, the initiator process, a Pi, originally (l) transmits every Sjr(j)j' with j' £ {i,j}, (2) tests Aje(N\{i})(Ri,r(i),j < Sj,r(j),i) and (3) receives the expected responses. After the test, the candidate checkpoint of Pi is a Cir'(i). If it is different from Ci S221 = 3 and consequently starts considering C1t1. It then sends a CM carrying S1,1<2 and S3,2,2 to P2 and a CM carrying S1j1j3 and S2,2,3 to P3. (9) The same as (11) in Example 2. (10) After receiving the two indications, P1, observing that every flag received was 0, correctly concludes that a GCSLC has been found, tells P2 and P3 to restart from C2,2 and C32 respectively, and finally restarts from C1t1. 514 Informatica 33 (2009) 511-519 M. Kapus-Kolar 4.2 Immediate counter reporting Originally, the CMs inviting processes to another instance of the CCA do not comprise the relevant transmission counters of the initiator. We think that they should, because such immediate counter reporting might spare an iteration. Example 4. Remember that in Example 3, the first seven steps are as in Example 1. If the process which fails in the fifth step is changed to P2, the subsequent scenario fragment changes to the following (see also Fig. 1): (6a) After P2 recovers, it starts considering its most recent checkpoint C2,2 and invites Pi and P3 to a new instance of the CCA. (7a) In response, Pi starts considering its most recent checkpoint Ci,2 and sends to P2 a CM carrying Si,2,2 and S123, while P3 starts considering its most recent checkpoint C3,2 and sends to P2 a CM carrying S3,2,i and S3,2,2. (8a) After P2 receives the responses, it observes R2,2,i =0 < Sia,2 = 0 and R.2,2,3 = 0 < S3,2,2 = 0 and therefore continues considering C2,2. It then sends a CM carrying S2,2,i and S3,2,i to Pi and a CM carrying Si,2,3 and S223 to P3. (9a) Upon receiving S2,2,i and S3,2,i, Pi observes R1,2,2 =4 > S2,2,i = 3 and consequently starts considering Ci,i, indicating that to P2 by a CM carrying flag 1 and Si,\,2 and Si,i,3. On the other hand, P3 upon receiving Si,2,3 and S2,2,3 observes R3,2,i =0 < Sij2j3 = 0 and R3,2,2 =0 < S2,2,3 = 0 and therefore continues considering C3,2, indicating that to P2 by a CM carrying flag 0 and S3,2,i and S3,2,2. (10a) After receiving the two indications, P2, observing that a non-zero flag has been received, observes R2,2,i = 0 < Si 1,2 = 0 and R2,2,3 = 0 < S3,2,2 = 0, consequently deciding to continue considering C2,2, and then sends a CM carrying S2,2,i and S3,2,i to Pi and a CM carrying Si,i,3 and S2,2,3 to P3. (11a) Upon receiving S2,2,i and S3,2,i, Pi observes Ri,i,2 =0 < S2,2,i = 3 and Rhh3 =0 < Sw = 7 and therefore continues considering Ci,i, indicating that to P2 by a CM carrying flag 0 and Si,i,2 and Si,i,3. Likewise, P3 upon receiving Slti,3 and S2,2,3 observes R3,2,i =0 < S1X3 = 0 and R3,2,2 =0 < ' S2a,3 = 0 and therefore continues considering C3,2, indicating that to P2 by a CM carrying flag 0 and S3,2,i and S3,2,2. (12a) After receiving the two indications, P2, observing that every flag received was 0, correctly concludes that a GCSLC has been found, tells Pi and P3 to restart from Ci,i and C3,2, respectively, and finally restarts from C2,2. If immediate counter reporting is introduced, the scenario fragment simplifies to the following: (6b) After P2 recovers, it starts considering C2,2, sending to Pi an invitation carrying S2,2,i and to P3 an invitation carrying S2,2,3. (7b) In response, Pi, observing that Ri,2,2 =4 > S2,2,i = 3, starts considering Ci,i and sends to P2 a CM carrying ,i,2 and Si,i,3, while P3, observing that R3,2,2 =0 < S2,2,3 = 0, starts considering its most recent checkpoint C3,2 and sends to P2 a CM carrying S3,2,\ and S3,2,2- (8b) After P2 receives the responses, it observes R2,2,i =0 < S1X2 = 0 and R.2,2,3 = 0 < 83,2,2 = 0 and therefore continues considering C2,2. It then sends a CM carrying S2,2,1 and S3,2,\ to Pi and a CM carrying S1,1,3 and 82,2,3 to P3. (9b) The same as (11a). (10b) The same as (12a). 4.3 Update reporting Like [1], we assume that the initiator process maintains an (n x n)-array variable V in which every component Vj,j> with j = j' is during GCSLC construction regularly updated to the value of the Sj,rj),j> belonging to the checkpoint Cj,r(j) currently considered by Pj. Every process Pj repeatedly contributes values for the jt]h row and checks the values in the jth column of V. The search for a GCSLC terminates when V stabilizes. Suppose that a Pj and a Pj' are currently considering checkpoints Cj,rj and Cj,rj'), respectively, and that Rj,r(j),j' < Sj',r(j'),j, as required. Then suppose that Pj and Pj' at some later point move their attention towards some older checkpoints, a Cjyj and a Cj'yj'), respectively. Rj,r'(j),j' < Rj,r(j),j' implies that the problematic Rj r'(j) j' > Sj',r'(j'),j is possible only if Sj'r'(j'),j < Sj',r(j'),j, i.e., if Vj'j has changed, for Sj',r'j'),j > Sj',r(j'),j is impossible. It, hence, suffices that updates to Vj' j are reported by Pj' to the initiator process, a Pi, and received by Pj from Pi. A CM sent to Pi is then a zero-flag CM exactly if it carries no transmission counters. In [1], a CM carrying information on V unnecessarily always carries an entire row or column, respectively, and flags are sent explicitly. Example 5. If only updates to V are reported, the scenario fragment (8b)-(10b) in Example 4 simplifies to the following (see also Fig. 1): (8b) After P2 receives the responses, it observes R2,2,i = 0 < Si,i,2 = 0 and R2,2,3 = 0 < S3,2,2 = 0 and therefore continues considering C2,2. It then sends a CM carrying S3,2,i to Pi and a CM carrying Si,i,3 to P3. (9b) Upon receiving S3,2,i, Pi observes Ri,i,3 =0 < S3,2,1 = 7 and therefore continues considering Ci,i, indicating that to P2 by a CM carrying no transmission counters. Likewise, P3 upon receiving Si,i,3 observes R3,2,i = 0 < S1A,3 = 0 and therefore continues considering C3,2, indicating that to P2 by a CM carrying no transmission counters. (10b) After receiving the two indications, P2, observing that no transmission counters have been received, correctly concludes that a GCSLC has been found, tells Pi and P3 to restart from Ci,i and C3,2, respectively, and finally restarts from C22 IMPROVEMENTS TO A ROLL-BACK MECHANISM FOR. Informatica 33 (2009) 511-519 515 [R.,u c,, =[0,0,0] [S12 =[0,0,0] [R,'2 cu =[0,0,1] [Su, =[0,0,0] [R13 c,; =[0,1,1] [S1A =[0,1,0] [R,,4 c,. =[0,2,1] =[0,2,0] [s2,u: [R2,u C2J I [Sy.,: [Rs.,., c,, [0,0,0] =[0,0,0] / ^ u / ^ u [S2,2, MO ,0,0] [S2,3, Ml ,0,0] [R2,2 J=[i ,0,0] [R2,3 J=[: -,0,0] C2,2 C2,3 v II \ 'II V [s2, [R2, =[2,0,0] =[3,0,0] p, [s2,,,?: [R2,u C2J I {0,0,0] [S32 =[0,0,0] [rJ. r 3,2 I [S.. [R,. cu 4 =[0,0,0] =[1,0,0] (a) =[0,0,0] [S13 =[0,0,0] [Ru' =[0,1,0] [S1A =[0,1,0] [R,,4 =[0,2,0] =[0,2,0] [0,0,0] =[0,0,0] c, n 3 c, n' > ^ u / ^ u [S2,2, MO ,0 ,0] [S2,3, Ml ,0,0] [S,4 M2,0,0] [R„ J=[i ,0,0] [R2,3 J=[: -,0,0] [R,4 ?]=[3,0,0] C'2,2 C2,3 C-2,4 1 \ \ -> [R3,: 3,2 + =[0,0,0] =[0,0,0] (b) Figure 2: The situation considered in Example 6 (a) before and (b) after the recovery line is advanced to the black checkpoints, which represent the maximum GCSLC. 4.4 Selective polling If a CM which the initiator process, a Pi, is originally supposed to send to a process Pj is the jth column of V and Pi discovers that Pj already has a faithful copy of the column, the CM may be omitted, for even if received, Pj would continue considering the same checkpoint and produce no update for the jt]h row of V. Moreover, if Pj does not receive the CM, it will not attempt to send a response, meaning that it will skip an entire iteration (see Example 6, step 13). Besides, if Pi discovers that all the other processes may skip the current iteration, it may immediately indicate CCA termination (see Example 6, step 15). Example 6. With the above optimizations, the following scenario is possible in a system consisting of processes Pi to P3 (see also Fig. 2): (1) Upon the start of the system, every process Pj takes a checkpoint Cj,i, with every Sj,i,j' and Rj,i,j/ zero. (2) P1 sends an AM to P3 and takes a checkpoint Ci,2, with Si,2,3 = 1 and Si,2,2 = Ri,2,2 = Ri,2,3 = 0. (3) P3 receives the AM and takes a checkpoint C3,2, with R-3,2,i = 1 and S3 2A = S3,2,2 = R3,2,2 = 0. (4) Pi sends an AM to P2. P2 receives the AM and takes a checkpoint C2,2, with R2,2,i = 1 and S2,2,i = S2,2,3 = R2,2,3 = 0. (5) P2 sends an AM to Pi. Pi receives the AM and takes a checkpoint Ci,3, with Si,3,2 = Si,3,3 = Ri,3,2 = 1 and Ri,3,3 = 0. (6) Pi sends an AM to P2. P2 receives the AM and takes a checkpoint C2,3, with S2,3,i = 1, R2,3,i = 2 and S2,3,3 = R2,3,3 = (7) P2 sends an AM to Pi. Pi receives the AM and takes a checkpoint Ci,4, with Sij4,2 = Ri,4,2 = 2, Si,4,3 = 1 and Ri,4,3 = 0. (8) Pi sends an AM to P2. P2 receives the AM and takes a checkpoint C2,4, with S2,4,,i = 2, R2,4,i = 3 and S*2,4,3 = R2,4,3 = 0. (9) P2 decides to initiate the CCA just for advancing the recovery line. So it starts considering C2,4, sending to Pi an invitation carrying S2,4,,i and to P3 an invitation carrying S2,4,3. (10) In response, Pi, observing that Ri,4,2 = 2 < S2,A,i = 2, starts considering Ci,4 and sends to P2 a CM carrying Si,4,2 and Si,4,,3, while P3, observing that R3,2,2 =0 < S2,4,3 = 0, starts considering C3,2 and sends to P2 a CM carrying S3,2,i and S3,2,2. (11) After P2 receives the responses, it observes R2,4,i =3 > Si,4,,2 = 2 and therefore starts considering C2,3. It then sends a CM carrying S2,3,i and S3,2,i to Pi and a CM carrying Si,4,3 to P3. (12) Upon receiving S2,3,i and S3,2,i, Pi, observing that Ri,4,2 =2 > S2,3,i = 1, starts considering Ci,3, indicating that to P2 by a CM carrying Si,3,2. On the other hand, P3 upon receiving Si,4,3 continues considering C3,2, indicating that to P2 by a CM carrying no transmission counters. (13) After receiving the two indications, P2, observing that R2,3,i =2 > Si,3,2 = 1, starts considering C2,2, sending to Pi a CM carrying S2,2,i. P3 is not involved in the iteration, because Si,3,3 = Sij4,3 and S2,2,3 = S2,3,3. (14) Upon receiving S2,2,i, Pi, observing that Ri,3,2 = 1 > S2,2,i = 0, starts considering Ci,2, indicating that to P2 by a CM carrying Si,2,2. (15) After receiving the indication, P2, observing that R2,2,i = 1 > Si,2,2 = 0, starts considering C2,i. It then, because Si,2,3 = Si,3,3 and S2,i,i = S2,2,i and S2,i,3 = S2,2,3, immediately concludes that a GCSLC has been found, indicates that to Pi and P3 and finally makes C2,i its recovery-line checkpoint. (16) Upon receiving the indication, Pi makes Ci,2 its recovery-line checkpoint, meaning that it deletes Ci,i and subtracts Si,2,3 from Si,2,3, Si,3,3 and Si,4:,3. Likewise, P2 makes C3,2 its recovery-line checkpoint, meaning that it deletes C3,i and subtracts R3,2,i from R3,2,i. The resulting situation is given in Fig. 2(b). 516 Informatica 33 (2009) 511-519 M. Kapus-Kolar 5 Details of the algorithm (9) It iterates to the step (4). 5.1 Introduction For a CCA instance I, let Pin(i) denote its initiator process, Advi the predicate telling whether it starts by an attempt to advance the recovery line, and Rsti the predicate telling whether it terminates by a system restart, i.e., whether it is an instance initiated upon a failure. We assume Advi V Rsti, so that I has a purpose. Every CM of a CCA instance I is either an invitation CM (ICM) or an ordinary CM (OCM). If it is sent by Pin(i) and carries no transmission counters, it is also a termination CM (TCM). In Sections 5.2-5.4, we specify a CCA instance, an I, as if it runs in isolation. The given CCA is essentially the CCA of [1] corrected and slightly modified as proposed in Sections 3 and 4. In Section 5.5, we, unlike [1], specify also how processes are assumed to react if they during the execution of a specific CCA instance receive an AM or a CM belonging to another CCA instance. 5.2 Behaviour of the initiator In a CCA instance I, Pin(i), having decided on Advi and Rsti, executes the following sequence of steps, starting at a time tj in(i) and broadcasting a TCM at a time t j: (1) If —Advi, it broadcasts an ICM carrying just Advi (and thereby implicitly also Rsti) and the current time, i.e. a TCM carrying tj in(i) and tj, and then goes directly to the procedure specified in Section 5.4. Otherwise, it proceeds as follows. (2) It sets CC, the variable denoting its currently considered checkpoint, to its most recent checkpoint, a Cin(i)r, and initializes in its (n x n)-array variable V every Vin(i)j to Sin(i),rj, every Vjj to zero and every other Vjj, to a value so large that no transmission or reception counter can ever acquire it. (3) To every other process Pj, it sends, and records that by setting a Boolean variable Qj to true, an ICM carrying Advi, Rsti, Vin(i)j and in the case of Rsti also tj in(i). (4) It makes a copy V' of V. (5) From every other process Pj with Qj true, it receives, and records that by setting Qj to false, an OCM carrying at least Rsti and then changes every Vj j, for which the CM carries a value to the value received. (6) It sets CC to its most recent checkpoint Cin(i), r with Rin(i),r,j < Vj, in(i) for every process Pj and sets every Vin(i),j to Sin(i),r,j. (7) If for every other process Pj, every Vj, j is the same as V' j, it broadcasts an OCM carrying just Rsti and the current time, i.e. a TCM carrying tj, and then goes directly to the procedure specified in Section 5.4. Otherwise, it proceeds as follows. (8) To every other process Pj with a Vj, j different from Vj, j, it sends, and records that by setting Qj to true, an OCM carrying Rsti and every such Vj, j. 5.3 Behaviour of the other processes In a CCA instance I, a process Pj other than Pin(i) for every variable Vjj, and Vj,j of Pin(i) maintains a copy with the same name, executing the following sequence of steps: (1) It receives, at a time tj j, from Pin(i) an ICM and sets Advi, Rsti, tj in(i), tj and Vin(i) j to the value received, if any. We assume that (tj,j — tj in(i)) does not exceed a predefined Tjjn(i) j. An appropriate value for a Tj, j with j' = j would typically be slightly over Tj,j, while every Tj j is by definition zero. (2) If —Advi, it goes directly to the procedure specified in Section 5.4. Otherwise, it proceeds as follows. (3) If possible and desired, it takes a fresh checkpoint. (4) It sets Vjj to zero and every Vj,j with j' g {In(I),j} and Vjj with j = j' to a value so large that no transmission or reception counter can ever acquire it. (5) It sets its variable CC to its most recent checkpoint Cjr with Rjrj' < Vj, j for every process Pj,, sends to Pin(i) an OCM carrying Rsti and, as anew value for Vjj,, every Sjrj, different from Vjj,, and then sets every Vjj, to Sj,rj,. (6) It receives from Pin(i) an OCM carrying at least Rsti and changes every Vj,j for which the CM carries a value to the value received. If the CM was a TM, it sets tj to the value received and goes directly to the procedure specified in Section 5.4. Otherwise, it iterates to the step (5). The time when a process Pj gets involved into a CCA instance I, hence, does not exceed a t^ j defined as min{(tj ini) + Tjn(ii) j),tj} in the case of Advi and as (t j, in(l) + TL(i),j) otherwise. 5.4 Moving the recovery line and/or restarting Here are the steps of the procedure which in a CCA instance I, every process Pj, with its CC in the case of Advi a Cj r , executes at the end of the CCA, terminating at a time 4 : i, j: (1) If Advi, it deletes from its storage every checkpoint t older than Cj r. (2) If Rsti, it deletes from its storage every checkpoint except the oldest one. (3) For every preserved checkpoint Cj,r,, it decreases every Sj,r,,j, by Sj,r, j, and every Rjrr,, j, by Rjrr, j,. (4) If Rsti, it restarts from its oldest preserved checkpoint. We assume that (t4 j — tj) does not exceed a predefined T'inil) j. An appropriate value for a Tj, j would typically be slightly over Tj,j. IMPROVEMENTS TO A ROLL-BACK MECHANISM FOR. Informatica 33 (2009) 511-519 517 5.5 Handling of unexpected messages If a process waiting for an OCM of a CCA instance I instead receives an AM, it freely decides whether to process it concurrently to or after I. If a process Pj waiting for an OCM of a CCA instance I instead receives a CM of another CCA instance, an I', it reacts as follows: An ICM with RstI* false or an OCM with RstI* = Rsti is just discarded. Upon an ICM with RstI* true, Pj abandons I and starts executing I' instead. An OCM with RstI* = RstI is erroneously recognized as an OCM of I, so such situation must be prevented (see Section 6.3). If a process Pj currently involved in no CCA instance receives an OCM or an ICM for which it suspects that it belongs to an obsolete CCA instance, it just discards it. An ICM of a CCA instance I', received at a time t from a process Pj', is recognized as obsolete if —RstI* and Pj has participated in a CCA instance I with RstI true and (t3i,j* + Tj'j) > t. This is because the estimated worst-case scenario for I overriding I' is that Pj* sends the ICM just before it gets involved into I at the time tl j* and then the ICM reaches Pj with the delay Tj*, j. 6 Optimizing algorithm activation 6.1 Introduction [1] suggests that in the absence of failures, the CCA is initiated periodically, with the period very long, so that one can assume that when a new CCA instance starts, the previous one, if any, has long been completed. More precisely, [1] specifies that the period is much longer than the checkpointing period of any individual process. [1] also suggests that processes initiate the CCA in turns, so that they never attempt to initiate it concurrently. To implement the policy, [1] has all system processes organized in a virtual ring along which the right to initiate the CCA is passed as a token, merely by the passing of time. Every reception of the token is interpreted as an obligation to initiate the CCA immediately. In [1], the frequency of token passing is exactly the desired frequency of running the CCA in the absence of failures. We find this scheme too rigid. On the one hand, forcing processes to take a checkpoint or initiate a recovery-line advancement periodically is seldom optimal. Processes should rather wait for a substantial reason. On the other hand, the initiator token should circulate as fast as possible, so that any process wanting it receives it quickly. Therefore, we drop the periodicity assumption for checkpointing and recovery-line advancing, and keep the virtual ring just as a means of concurrency control, giving it a more appropriate timing. 6.2 Reasons for checkpointing and recovery-line advancing For taking a checkpoint, a typical reason would be that the process has since its last checkpoint accomplished a lot of work or exchanged a lot of AMs. For advancing the recovery line, we see three reasons: - A process is running out of stable storage and therefore wants that some of the stored checkpoints become obsolete (remember Section 5.4, step 1). - A reception or a transmission counter of a process is approaching overflow and the process therefore wants more events to become obsolete for counting (remember Section 5.4, step 3). - A process wants to reduce the discrepancy between its current state and its recovery-line checkpoint, so that, if a failure occurs, its roll-back in the worst case would not be so drastic. How often such a reason occurs, depends not only on static parameters, such as the size of the local stable storages and the speed of the processes and the channels, but also on the demands of the executed application, which tend to vary with time. 6.3 Prevention of concurrent instantiations At any moment, the initiator token is virtually either in transit or resides with one of the processes. If a process Pj possesses the token at a time t, the next time when another process Pj* possesses it should ideally be soon, but not until Pj* has had a chance to detect whether Pj has initiated the CCA at time t, i.e., not until after (t + jj*). The delay is necessary because when initiating a CCA instance I with —RstI, the process PIn(I) must be aware of the most recently initiated previous CCA instance, an I', if any, so that it can, by suitably delaying I, secure that the following is satisfied at the time tj In(I): - t\,In(I) > 1V,In(I) and tlIn(I) > + TIn{I*),j fOT every other process Pj, so that every process has already terminated execution of I'. - If Rsti*, t\, In(I) > tf*, j + Tj,j* for every two processes Pj and Pj*, so that (1) the CMs which I' made obsolete, if any, have already been received and (2) the ICM of I will not be discarded upon reception (remember Section 5.5). It might happen that while a process is waiting for an opportunity to initiate a CCA instance I with —RstI, another process initiates a CCA instance I' with AdvI*. Upon detecting that, PIn(I) should drop its pursuit, for the task of advancing the recovery line is already being taken care of. 518 Informatica 33 (2009) 511-519 M. Kapus-Kolar 7 Organization of the storage When a process gets involved into a CCA instance, it has to access information on its previous checkpoints. [1] suggests that a process maintains this information primarily in its working memory, so that any access to it can be quick. However, a process having just recovered from a failure can safely rely only on the copy of the information which it has made in its stable storage. How efficiently individual parts of the copy can be accessed, strongly depends on the organization of the storage, which, hence, deserves some discussion. Obviously, each checkpoint Cjr requires that process Pj stores every Rjrj> with j = j' individually, and not just J2j'e(N\{j}) Rj,r,j', as in [1]. For each Cjr, Pj would, hence, store vectors Sj, r and Rj,r of size (n — 1). If we imitated the policy of [1], every process Pj would for every checkpoint Cj,r store in the stable storage the list of all Rj,r' (originally of all Xj'e(N\{j}) Rj,r',j') belonging to a Cjr' not later than Cjr, so that, when deciding how far back to move from Cjr, Pj could fetch all the necessary information in a single access to the storage. However, this would mean that each element of such a list is stored several times, first in the list of the checkpoint to which it belongs and then in the list of every subsequent checkpoint. With list elements vectors of size (n — 1), i.e. not of size 1 as expected in [1], this would be unacceptable for large n. We think that Pj should maintain a single list, for every Cjr comprising a triplet consisting of Sjr, Sjr and a pointer to all the other details which Pj might need if it actually restarts from Cjr. As n is fixed, so can be the size of the triplets, meaning that such a list can be easily maintained and accessed as virtually a single block of consecutive storage locations. Alternatively, Pj could maintain three lists, one for each kind of the items in the triplets. With this approach, Pj could fetch the stable data more selectively. With all the lists consisting of fixed-sized elements, finding their corresponding elements would still be trivial. Let us also note that fetching the entire R list in a single step might not always be a good idea, as this might be really a lot of data. In other words, when optimizing communication between processes and their stable storage, one is usually forced to make compromises between the number and the size of the messages, where the optimal strategy depends on the specifics of the system and the current circumstances. It might happen that a CCA instance I does not advance the recovery line in spite of Advj; true. In such a case, a process wanting to take fresh checkpoints in spite of running out of stable storage may start deleting its earlier checkpoints, with the only constraint that it must never delete its recovery-line checkpoint or its current candidate for a new recovery-line checkpoint. As the starting checkpoint of any process is stored implicitly, processes can survive even without a stable storage for checkpoints [2], be- cause further checkpoints only increase the probability that in the case of a restart, the required roll-back will not be too drastic. 8 Discussion and conclusions 8.1 Contributions of the paper We have proposed the following improvements to the CCA of [1], its activation and its storage management: - A logical flaw has been identified and corrected. - Instead of exchanging entire rows or columns of the array V, processes now exchange only their updates. - Whenever possible, processes now skip individual iterations of the algorithm. - It can no longer happen that the algorithm executes a superfluous iteration. - Obsolete CMs are now properly recognized and ignored. - By recognizing the policy of recovery-line advancement and the prevention of concurrent instantiations as two separate issues, we have been able to make the former more flexible and the latter, through faster token circulation, less restrictive. The assumption that processes take their checkpoints periodically has also become unnecessary. - The organization of the stable storage no longer includes multiple copies of reception counters. - Process are allowed to replace their old checkpoints with fresh ones. 8.2 Correctness and performance of the algorithm The proposed CCA is essentially that of [1], except that we properly increased the rigour of checkpoint comparison, in every CM added the missing and removed the redundant information, and removed the CMs which consequently lost every purpose. Hence, if the original CCA is correct up to the rigour of checkpoint comparison, our CCA is also correct. In [1], it is demonstrated that, for the adopted system topology, the there proposed CCA is an improvement over the CCA of [3], in that the number of checkpoint comparisons grows, linearly, much slower with the average number of checkpoints per process, and linearly instead of quadratically with the number of processes. It is also demonstrated that the proposed CCA is an improvement over the CCA of [2], in that the number of the exchanged CMs grows linearly instead of quadratically with the number of processes. IMPROVEMENTS TO A ROLL-BACK MECHANISM FOR. Informatica 33 (2009) 511-519 519 If the checkpoint comparisons in the CCA of [1] are made sufficiently rigorous, the algorithm preserves all the above advantages. With the proposed additional optimizations of process communication, the number of checkpoint comparisons and the number of the exchanged CMs are sometimes even lower, because the redundant iterations are skipped and because the redundant requests for information on transmission counters are removed. Besides, the remaining CMs are sometimes shorter, because processes exchange just updates to V. 8.3 Concluding remarks For a distributed algorithm, it is equally important to decide on the task which the process should perform in cooperation, on the protocol securing the necessary coordination and on the management of concurrent instances of the algorithm. For the CCA of [1], we proposed a correction of the task and several improvements to the protocol and the concurrency management. For further work, we propose quantitative assessment of the expected benefits of the additional optimizations. References [1] Gupta B., Rahimi S., Yang Y. (2007) A novel roll-back mechanism for performance enhancement of asynchronous checkpointing and recovery, Informatica, 31(1), pp. 1-13. [2] Juang T., Venkatesan, S. (1991) Crash recovery with little overhead, Proceedings of the 11th International Conference on Distributed Computing Systems, IEEE, Arlington, Texas, pp. 454-461. [3] Ohara M., Arai M., Fukumoto S., Iwasaki K. (2004) Finding a recovery line in uncoordinated checkpointing, Proceedings of the 24th International Conference on Distributed Computing Systems Workshops, IEEE, Hachioji, Tokyo, Japan, pp. 628-633.