https://doi.org/10.31449/inf.v42i3.2244 Informatica 42 (2018) 301–311 301 Time-stamp Incremental Checkpointing and its Application for an Optimization of Execution Model to Improve Performance of CAPE Van Long Tran Samovar, Télécom SudParis, CNRS, Université Paris-Saclay - 9 rue Charles Fourier, Évry, France E-mail: van_long.tran@telecom-sudparis.eu and www.telecom-sudparis.eu Hue Industrial College - 70 Nguyen Hue street, Hue city, Vietnam E-mail: tvlong@hueic.edu.vn and www.hueic.edu.vn Éric Renault Samovar, Télécom SudParis, CNRS, Université Paris-Saclay - 9 rue Charles Fourier, Évry, France E-mail: eric.renault@telecom-sudparis.eu and www.telecom-sudparis.eu Viet Hai Ha College of Education, Hue University - 34 Le Loi street, Hue city, Vietnam E-mail: haviethai@gmail.com and www.dhsphue.edu.vn Xuan Huyen Do College of Sciences, Hue University - 77 Nguyen Hue street, Hue city, Vietnam E-mail: doxuanhuyen@gmail.com and www.husc.edu.vn Keywords: OpenMP, OpenMP on cluster, CAPE, Checkpointing-Aided Parallel Execution, Checkpointing, Incremental checkpointing, DICKPT, TICKPT Received: March 29, 2018 CAPE, which stands for Checkpointing-Aided Parallel Execution, is a checkpoint-based approach to au- tomatically translate and execute OpenMP programs on distributed-memory architectures. This approach demonstrates high-performance and complete compatibility with OpenMP on distributed-memory systems. In CAPE, checkpointing is one of the main factors acted on the performance of the system. This is shown over two versions of CAPE. The first version based on complete checkpoints is too slow as compared to the second version based on Discontinuous Incremental Checkpointing. This paper presents an improvement of Discontinuous Incremental Checkpointing, and a new execution model for CAPE using new techniques of checkpointing. It contributes to improve the performance and make CAPE even more flexible. Povzetek: Predstavljena je izboljšava CAPE - paralelno izvajanje, usmerjeno s podporo redundance. 1 Introduction In order to minimize programmers’ difficulties when de- veloping parallel applications, a parallel programming tool at a higher level should be as easy-to-use as possible. MPI [1], which stands for Message Passing Interface, and OpenMP [2] are two widely-used tools that meet this re- quirement. MPI is a tool for high-performance computing on distributed-memory environments, while OpenMP has been developed for shared-memory architectures. If MPI is quite difficult to use for non programmers, OpenMP is very easy to use, requesting the programmer to tag the pieces of code to be executed in parallel. Some efforts have been made to port OpenMP on distributed-memory architectures. However, apart from our solution, no solution successfully met the two following requirements: 1) to be fully compliant with the OpenMP standard and 2) high performance. Most prominent ap- proaches include the use of an SSI [3], SCASH [4], the use of the RC model [5], performing a source-to-source translation to a tool like MPI [6, 7] or Global Array [8], or Cluster OpenMP [9]. Among all these solutions, the use of a Single Sys- tem Image (SSI) is the most straightforward approach. An SSI includes a Distributed Shared Memory (DSM) to provide an abstracted shared-memory view over a physi- cal distributed-memory architecture. The main advantage of this approach is its ability to easily provide a fully- compliant version of OpenMP. Thanks to their shared- memory nature, OpenMP programs can easily be com- piled and run as processes on different computers in an SSI. However, as the shared memory is accessed through the network, the synchronization between the memories in- volves an important overhead which makes this approach hardly scalable. Some experiments [3] have shown that the larger the number of threads, the lower the performance. As a result, in order to reduce the execution time over- head involved by the use of an SSI, other approaches have been proposed. For example, SCASH only maps the shared variables of the processes onto a shared-memory area at- 302 Informatica 42 (2018) 301–311 V .L. Tran et al. tached to each process, the other variables being stored in a private memory, and the RC model uses the relaxed consistency memory model. However, these approaches have difficulties to identify the shared variables automat- ically. As a result, no fully-compliant implementation of OpenMP based on these approaches has been released so far. Some other approaches aim at performing a source-to- source translation of the OpenMP code into a MPI code. This approach allows the generation of high-performance codes on distributed-memory architectures. However, not all OpenMP directives and constructs can be implemented. As yet another alternative, Cluster OpenMP, proposed by Intel, also requires the use of additional directives of its own (ie. not included in the OpenMP standard). Thus, this one cannot be considered as a fully-compliant implemen- tation of the OpenMP standard either. Concerning to bypass these limitations, we developed CAPE [10, 15] which stands for Checkpointing-Aided Par- allel Execution. CAPE is a solution that provides a set of prototypes and frameworks to automatically translate OpenMP programs for distributed memory architectures and make them ready for execution. The main idea of this solution is using incremental checkpoint techniques (ICKPT) [11, 12] to distribute the parallel jobs and their data to other processes (the fork phase of OpenMP), and collect the results after the execution of the jobs from all processors (the join phase of OpenMP). ICKPT is also used to deal with the exchange of shared data automatically. Although CAPE is still under development, it has shown its ability to provide a very efficient solution. For example, a comparison with MPI showed that CAPE is able to reach up to 90% of the MPI performance [13, 14]. This has to be balanced with the fact that CAPE for OpenMP requires the introduction of fewpragma directives only in the se- quential code, i.e. no complex code from the user point of view, while writing a MPI code might require the user to completely refactorise the code. Moreover, as compared to other OpenMP for distributed-memory solutions, CAPE is fully compatible with OpenMP [13, 15]. This paper presents an improvement of DICKPT – a checkpoint technique for CAPE, and a new execution model applied these new checkpoints, that improves the performance and the flexibility of CAPE. A part of these results were also presented and published at the SoICT’s 2017 conference [16]. The paper is organized as follows: the next section describes CAPE mechanism, capabilities and restrictions in details. Section 3 presents a develop- ment of checkpointing that are used in CAPE. Then, Sec- tion 4 presents the design and the implementation of the new execution model based on the new checkpointing tech- niques. The analysis and evaluation of both new check- pointing and execution model are presented in Section 5. Section 4 shows the result of the experimental analysis. Fi- nally, Section 5 concludes the paper and presents our future works. 2 CAPE principles In order to execute an OpenMP program on distributed- memory systems, CAPE uses a set of templates to translate an OpenMP source code into a CAPE source code. Then, the generated CAPE source code is compiled using a tra- ditional C/C++ compiler. At last, the binary code can be executed independently on any distributed-memory system supporting the CAPE framework. The different steps of the CAPE compilation process for C/C++ OpenMP programs is shown in the Figure 1. Figure 1: Translation of OpenMP programs with CAPE. 2.1 Execution model The CAPE execution model is based on checkpoints that implement the OpenMP fork-join model. This mecha- nism is shown in Figure 2. To execute a CAPE code on a distributed-memory architecture, the program first runs on a set of nodes, each node being run as a process. Whenever the program meets a parallel section, the master node dis- tributes the jobs among the slave processes using the Dis- continuous Incremental Checkpoints (DICKPT) [12, 13] mechanism. Through this approach, the master node gen- erates DICKPTs and sends them to the slave nodes, each slave node receives a single checkpoint. After sending checkpoints, the master node waits for the results to be re- turned from the slaves. The next step is different depending upon the nature of the node: the slave nodes receive their checkpoint, inject it into their memory, execute their part of the job, and sent back the result to the master node by using DICKPT; the master node waits for the results and af- ter receiving them all, merges them before injection into its memory. At the end of the parallel region, the master sends the resulting checkpoint to every slaves to synchronize the memory space of the whole program. 2.2 Translation from OpenMP to CAPE In the CAPE framework, a set of functions has been defined and implemented to perform the tasks devoted to DICKPT, typically, distributing checkpoints, send- ing/receiving checkpoints, extracting/injecting a check- point from/to the program’s memory, etc. Besides, a set of templates has been defined in the CAPE compiler to perform the translation of the OpenMP program into the CAPE program automatically and make it executable in the CAPE framework. So far, nested loops and shared-data variable constructs are not supported yet. However, this is not regarded as an issue as this can be solved at the level Time-stamp Incremental Checkpointing and its Application. . . Informatica 42 (2018) 301–311 303 Figure 2: CAPE execution model. of the source-to-source translation and does not require any modifications in the CAPE philosophy. In this end, CAPE can only be applied to OpenMP programs matching the Bernstein’s conditions [17]. After the translations operated by the CAPE compiler, the OpenMP source code is free of any OpenMP directives and structures. Figure 3 presents an example of code sub- stitution for the specific case of theparallel for con- struct. This example is typical of those we implemented for the other constructs [7]. The automatically generated code is based on the following functions that are part of the CAPE framework: – start( ) sets up the environment for the genera- tion of DICKPTs. – stop( ) restores the environment used for the gen- eration of DICKPT. – create(file) generates a checkpoint with name file. – inject(file) injects a checkpoint into the mem- ory of the current process. – send(file, node) sends a checkpoint from the current process to another. – wait_for(file) waits for checkpoints and merges them to create another one. – merge(file 1 ,file 2 ) merges two checkpoints together. Figure 3: Template for the parallel for with incre- mental checkpoints. – broadcast(file) sends a checkpoint to all slave nodes. – receive(file) waits for and receives a check- point. 2.3 Discontinuous incremental checkpointing on CAPE Checkpointing is the technique that saves the images of a process at a point during its lifetime, and allows it to be resumed at the saving’s time if necessary [11, 18]. Using checkpointing, processes can resume their execution from a checkpoint state when a failure occurs. So, no need to take time to initialize and execute it from the begin. These tech- niques are introduced since two decades ago. Nowadays, they are researched and used widely on fault-tolerance, ap- plications trace/debugging, roll-back/animated playback, and process migration. Basically, checkpointing techniques can be categorized 304 Informatica 42 (2018) 301–311 V .L. Tran et al. into two groups: completed checkpoints and incremental checkpoints. Completed checkpointing [18, 19, 20] saves all information regarding the process at the points that it generate checkpoints. The advantages of this technique is reducing the time of generation and restoration. However, the checkpoint’s size is too large. Incremental checkpoint- ing [11, 21, 22, 23, 12, 24] only saves the modified infor- mation as compared to the previous checkpoint. This tech- nique reaches advantages of reducing checkpoint’s over- head and checkpoint’s size, so it is in widely used in dis- tributed computing. Besides, using data compression to re- duce checkpoint’s size [11, 21, 24], it is also focus on the techniques that detect modified data but reach the minimum of size. Some typical techniques are using page-based pro- tection to identify the pages in memory that have been mod- ified [11, 22, 23], using word-level granularity [21, 12], us- ing block encoding [22], using user-directed and memory exclusion [11], using live variable analysis [24]. Figure 4: Principle of DICKPT in cases of checkpointing. In CAPE, Discontinuous Incremental Checkpointing (DICKPT) is a development based on incremental check- pointing, that contains two kinds of data, register infor- mation and modified data of the process. In which, the first one is copied from all register data of the process, and the second one is identified based on write-protection tech- niques. Figure 4 shows the steps to monitor and generate a checkpoint of a process on CAPE. It is done by an other process making use of the ptrace Unix system call. The idea of these steps is that, at the beginning of the paral- lel region, the monitor sets all page of monitor process at write-protected. Whenever the monitored process wants to write into any write-protected page, aSIGSEGV signal is generated. Then, the monitor saves the data of this page, re- moves the write-protection and lets the monitored process write into the page. At the end of the region, the monitor compares the saved data with the current data of monitored process page. The difference are extracted and saved into checkpoint file. 2.4 Remarks The good performance of CAPE as compared to those of MPI and the full compliance to the OpenMP specifica- tions [13, 15, 14] have made CAPE a good alternative to port OpenMP on distributed-memory architectures. So far, the implementation of CAPE is not complete, some disad- vantages can be listed: 1. DICKPT saves all modified data of process, including temporary and private variables. This is an unneces- sary synchronization inside an OpenMP program. 2. As shown in Figure 2, the master node might act as a bottleneck while waiting for checkpoints from the salves, merging checkpoints and/or sending back data to slaves for memory synchronization. 3. To distribute jobs to slaves, the master node gener- ates a number of checkpoints that depends upon the number of slave nodes and so that each slave node re- ceives a checkpoint (see Figure 7). This method can reach a high-level of optimization. However, it might not be enough flexible for some cases like 1) the num- ber of slaves may not be identified at compile time, 2) the OpenMP source code should be modified to de- tect when the master generates the checkpoint and 3) the dynamic scheduling of OpenMP cannot be imple- mented using this method. 4. After distributing the jobs, the slave nodes execute the divided jobs while the master does nothing until the reception of the resulting checkpoints from the slaves, which clearly wastes resources. 5. For synchronization, the checkpoints should be sent by order in order to resume exactly the last state of process. 3 Time-stamp incremental checkpointing (TICKPT) Time-stamp Incremental Checkpointing (TICKPT) is an improvement of DICKPT by adding new factor – time- stamp – into incremental checkpoints and by removing un- Time-stamp Incremental Checkpointing and its Application. . . Informatica 42 (2018) 301–311 305 necessary data based on data-sharing variable attributes of OpenMP program. Basically, TICKPT contains three mandatory elements including register’s information, modified region in mem- ory of the process, and their time-stamp. As well as DICKPT, in TICKPT, the register’s information are ex- tracted from all registers of the process in the system. How- ever, the time-stamp is added to identify the order of the checkpoints in the program. This contributes to reduce the time for merging checkpoints and selecting the right ele- ment if located at the same place in memory. In addition, only the modified data of shared variables are detected and saved into checkpoints. It makes checkpoint’s size signif- icantly reduced depending on the size of private variables of the OpenMP program. To present the order of checkpoints in a program, time- stamps have to represent the order of the instructions when it is executed. For the general case, an activation tree [25] can be used to identify the sequence of function call in a program. For CAPE, checkpoints are always generated in same level of functions, so that the program counter can be used to ensure simplicity. However, if the instruction is aloop, the program counter is combined with the loop iteration to represent the order of the loop exactly. To detect modified data, the write-protection mechanism is used. However, only the shared variables are written down in the checkpoint file. The matter in here is how to detect private and shared variables. Figure 5: Allocation of OpenMP program’s variables in virtual process memory. In an OpenMP program, data-sharing variable attributes can be set up either, implicitly or explicitly [2]. All vari- ables declared outside an#pragma omp parallel di- rective are implicitly shared. This includes all global and static local variables allocated inheap anddata seg- ments of the process’s memory, and local variables allo- cated on thestack (see Figure 5). The variables inheap and data segments can easily be identified by their ad- dress. For the variables on the stack, we save the stack pointer before entering the #pragma omp parallel region. Variables declared before the stack pointer are shared. The others, are private. To explicitly, change the status of a variable, the pro- grammer can use data-sharing attributes like OpenMP di- rective #pragma omp threadprivate (list of variables) and relative clauses. The OpenMP data- sharing clauses are shown in Table 1. Clauses Description default(none|shared) Specifying the default behavior of variables shared(list) Specifying the list of shared variables private(list) Specifying the list of private variables firstprivate(list) Allowing to access value of the list of private variables in the first time lastprivate(list) Allowing to share value of the list of private variables at the end of parallel region copyin(list) Allowing to access value of threadprivate variables copyprivate(list) Specifying the list of private variables that should be shared among all threads. reduction(list, ops) Specifying the list of variables that are subject to a reduction operation at the end of the par- allel region. Table 1: OpenMP data-sharing clauses. 4 A new execution model for CAPE In order to improve the performance of CAPE and its flexi- bility, we designed a new execution model that extends the one presented in Section 2.1. In this new execution model, DICKPT is replaced by TICKPT. Figure 6 illustrates the model which can be described as follows: 1. At the beginning of the program, all nodes in the sys- tem execute the same sequential code. 2. When a parallel region is reached, the master process creates a set of incremental checkpoints. The number of incremental checkpoints depends upon the num- ber of tasks in the parallel region. Each incremen- tal checkpoint contains the state of the program to be 306 Informatica 42 (2018) 301–311 V .L. Tran et al. Figure 6: The new execution model for CAPE. used to resume its execution in another process at the saved time. 3. The master process scatters the set of incremental checkpoints. Each node receives some of the check- points generated by the master process. This step is illustrated in the Figure 8. 4. The received incremental checkpoints are injected into the slave processes’ memories. 5. The slave processes resume their execution. 6. Results on slave processes are extracted by identify- ing the modified regions and saved as an incremental checkpoint. 7. Incremental checkpoints of each process is sent back to the master node. Incremental checkpoints are com- bined altogether to generate a single checkpoint. This step can be distributed among the processes if need be. 8. The final combined incremental checkpoint is injected in the master process’ memory and the master process can resume its execution. Changing the execution model implies changing the translation templates. Figure 9 presents the template for the #pragma omp parallel for directive that adapts to the new execution model. The other OpenMP directives can be designed in a similar way. For this template, CAPE operates as follows: Figure 7: Scheduling method in CAPE-2. Figure 8: Scheduling method with the new execution model. – generate_dickpt(before i ) (line 3): at each loop it- eration, the master process generates an incremental checkpoint. – scatter(before; &recv n ; master) (line 4): the mas- ter process scatters the checkpoints to the available processes, including itself. Each process receives some of the checkpoints (recv n ). – inject(recv n ) (line 5): each checkpoint is injected into the target process’ memory. – the execution is resumed on instructionD (line 6). – generate_dickpt(after n ) (line 7): each process gen- erates an incremental checkpoint that saves the result of its execution. – allreduce(after n ; &after; []) (line 8): the after n checkpoint of process n is sent to the other processes. Checkpoints are combined, calculated and saved in a newafter checkpoint. With TICKPT, the order of checkpoints is presented in each of them, so this is performed using the Recursive Doubling algo- rithm [26] as illustrated in Figure 10. – inject(after) (line 9): incremental checkpointafter Time-stamp Incremental Checkpointing and its Application. . . Informatica 42 (2018) 301–311 307 Figure 9: Prototype for theparallel for with the new execution model. Figure 10: Recursive doubling for allreduce. is injected into the application’s memory to synchro- nize the state of the program on all nodes. 5 Analysis and evaluation 5.1 From DICKPT to TICKPT As presented in Section 3, TICKPT is an evolution of DICKPT. It creates and adds time-stamps into checkpoints to make them more flexible and to reduce synchronization time when applying on CAPE. In addition, it removes un- necessary data to reduce checkpoint’s size. For the syn- chronization time, we will analyse and evaluate the whole performance of CAPE. For checkpoint’s size, we consider the amount of the modified data generated by TICKPT and DICKPT after having executed the piece of code in Fig- ure 11 in each node, with various values forN. Data contained in A, B and C variables are changed. The DICKPT counts them all, while TICKPT only counts data in variableC. Therefore, the amount of modified data significantly reduced with TICKPT as shown in Figure 12. 5.2 Analysis of the new execution model Moving from a scheduling of CAPE processes based on the number of nodes (Figure 7) to a scheduling based on #define N 1000 ... int A[N], B[N], C[N], i; ... #pragma omp parallel for private(A,B) shared(C) for(i = 0; i