CHALLENGES AND SOLUTIONS FOR THERMAL-AWARE SOG TESTING Zebo Peng, Zhiyuan He and Petru Eies Embedded Systems Laboratory, Linköping University, Sweden INVITED PAPER MIDEM 2007 CONFERENCE - WORKSHOP ON ELECTRONIC TESTING 12.09. 2007 - 14.09. 2007, Bled, Slovenia Key words: electronic testing, SoC devices, thermal-aware SoC testing techniques, test efficiency Abstract: High temperature has negative impact on the performance, reliability and lifespan of a system on chip. During testing, the chip can be overheated due to a substantial increase of switching activities and concurrent tests in order to reduce test application time. This paper discusses several issues related to the thermal problem during SoC testing. It will then present a thermal-aware SoC test scheduling technique to generate the shortest test schedule such that the temperature constraints of individual cores and the constraint on the test-bus bandwidth are satisfied. In order to avoid overheating during the test, we partition test sets into shorter test sub-sequences and add cooling periods in between. Further more, we interleave the test subsequences from different test sets in such a manner that the test-bus bandwidth reserved for one core is utilized during its cooling period for the test transportation and application of the other cores. We have developed a heuristic to minimize the test application time by exploring alternative test partitioning and interieaving schemes with variable length of test sub-sequences and cooling periods. Experimental results have shown the efficiency of the proposed heuristic. Izzivi in rešitve pri testiranju sistemov na čipu Kjučne besede: testiranje elektronike, SoC sistemi na čipu, tehnike testiranje SoC z obvladovanjem pregrevanja, učinkovitost testiranja Izvleček: Visoke temperature negativno vplivajo na lastnosti, zanesljivost in življensko dobo sistemov na čipu. Med testiranjem lahko pride do pregrevanja čipa zaradi povečanega števila vklopov z namenom skrajšati čas testiranja. V prispevku opišemo tovrstne probleme in predstavimo tehnike, s katerimi načrtujemo take testne procedure, ki vodijo računa o tem, da ne prihaja do pregrevanja posameznih delov čipa. Z namenom preprečiti pregrevanje smo teste razdelili na krajše testne periode z vmesnimi pavzami za hlajenje elektronike. Dodatno smo programirali testna zaporedja tako, da so se signali širili preko testnih poti selektivno k posameznim delom elektronike med tem, ko so se drugi deli hladili. Izdelali smo ustrezno metodologijo, ki nam je omogočila skrajšati testne čase. Eksperimentalni rezultati so pokazali uspešnost predlagane metode. 1. Introduction The rapid development of System-on-Chip (SoC) techniques has led to many challenges to the design and test community. The challenges to the designers have been addressed by the development of the core-based design method, where pre-designed and pre-verified building blocks, called embedded cores, are integrated together to form a SoC. While the core-based design method has led to the reduction of design time, it entails several test-related problems. How to address these test problems in order to provide an optimal test solution is a great challenge to the SoC test community /1 /. A key issue for SoC testing is the selection of an appropriate test strategy and the design of a test infrastructure on chip to implement the selected test strategy. For a core embedded in a SoC, the direct test access to its peripheries is impossible. Therefore, special test access mechanism must be included in a SoC to connect the core peripheries to the test sources and sinks. The design of the test infrastructure, including the test access mechanism, must be considered together with the test scheduling prob- lem, in order to reduce the silicon area used for test access and to minimize the total test application time /2/, /3/, /4/, /5/, /6/, /7/, /8/, The rapid increasing test data volume needed for SoC testing is another issue to be addressed, since it contributes significantly to long test application times and huge ATE memory requirements /7/. This issue can be addressed by sharing the same test set among several cores as well as test data compression. Both test set sharing and test compression can exploit the large percentage of don't care bits in typical test patterns generated for complex SoC designs in order to reduce the amount of test data needed /9/. The issue of power dissipation should also be considered in order to prevent a SoC chip from being damaged by overheating during test /2/, /9/, /10/, /11/. High temperature has become a technological barrier to the testing of high performance SoC, especially when deep submi-cron technologies are employed. In order to reduce test time while keeping the temperature of the cores under test within a safe range, thermal-aware test scheduling tech- niques are required, and this paper discussed several issues related to thermal-aware SoC testing. Thermal-aware testing has recently attracted many research interests. Liu et al. proposed a technique to evenly distribute the generated heat across the chip during tests, and therefore avoid high temperature/12/. Rosingeretal. proposed an approach to generate thermal-safe test schedules with minimized test time by utilizing the core adjacency information to drive the test scheduling and reduce the temperature stress between cores /13/. In our previous work /14/, we proposed a test set partitioning and interleaving technique, and employed constraint logic programming (CLP) to generate thermal-aware test schedules with the minimum test application time (TAT). In our work, we assume that a continuous test will increase the temperature of a core to pass a limit beyond which the core may be damaged. In order to avoid overheating during tests, we partition the entire test set into a number of test sub-sequences and introduce a cooling period between two consecutive test sub-sequences. As the test application time substantially increases when long cooling periods are introduced, we interleaved different partitioned test sets in order to generate a shorter test schedule. In /14/, we restricted the length of test sub-sequences that belong to the same test set to be identical. Moreover, we also restricted the cooling periods between test sub-sequences from the same test set to have equal length. The main purpose of these restrictions was to keep the size of the design space small and, by this, to reduce the optimization time, so that the CLP-based algorithm will be able to generate the optimal solutions in a reasonable time. However, these restrictions have resulted in less efficient test schedules, and longer test application times. In our recent work, we have eliminated these restrictions so that both test sub-sequences and cooling periods can have arbitrary lengths. Since breaking the regularity of test subsequences and cooling periods dramatically increases the size of exploration space, the CLP-based test scheduling approach proposed in /14/ is not feasible any more, especially for practical industrial designs. Therefore, new, low-complexity heuristics are needed which are able to produce efficient test schedules under the less restricted and more realistic assumptions. The rest of this paper is organized as follows. The next section discusses the thermal issue related to SoC testing, and some solutions. It will also motivate the importance of test partitioning and interleaving with arbitrary partition/cooling lengths. Section 3 defines formally the thermal-award test scheduling problem we are addressing in this paper. Section 4 presents the overall strategy of our thermal-aware scheduling approach, and Section 5 the proposed test scheduling heuristic. The experimental results are described in Section 6, and conclusions in Section 7. 2. The thermal issue High temperature can be observed in most high-perform-ance SoCs due to high power consumption. High power consumption results in excessive heat dissipation, and elevates the junction temperature which has large impacts on the operation of integrated circuits /15/, /16/, /17/, /18/. The performance of the integrated circuits is proportional to the driving current of CMOS transistors, which is a function of the carrier mobility. Increasing junction temperature decreases the carrier mobility and the driving current of the CMOS transistors, which consequently degrades the performance of circuits /19/. At higher junction temperature, the leakage power increases. The increased leakage power in turn contributes to an increase of junction temperature. This positive feedback between leakage power and junction temperature may result in thermal runaway and destroy the chip /19/. The long term reliability and lifespan of integrated circuits also strongly depends on junction temperature. Failure mechanisms in CMOS integrated circuits, such as gate oxide breakdown and electro-migration, are accelerated in high junction temperature. This may results in a drop of the long term reliability and lifespan of circuits /18/ . Advanced cooling system can be one solution to the high temperature problems. However, the cost of the entire system will substantially increase, and the size of the system is inevitably large. The thermal issue becomes even more severe in the case Oftesting than in normal functional mode, since testing dissipates more power and heat due to a substantial increase of switching activities /18/. In order to prevent excessive power during test, several techniques have been developed. Low power DFT and test synthesis techniques can be utilized, including low-power scan chain design /20/, /21 /, as well as scan cell and test pattern reordering /21 /, /23/, /24/. Although low power DFT can reduce the power consumption, such techniques usually add extra hardware into the design and therefore can increase the delay and the cost of the produced chips. For many modern SoC designs, when a long sequence of test patterns is continuously applied to a core, the temperature of this core may increase and pass a certain limit beyond which the core will be damaged. In such scenarios, the test has to be stopped when the core temperature reaches the limit, and can be restarted later when the core has been cooled down. Thus, by partitioning a test set into shorter test sub-sequences and introducing cooling periods between them, we can avoid the overheating during test. Figure 1 illustrates the temperature profile of a core under test when the entire test set for the core is partitioned into four test sub-sequences, TSi, TS2, TS3, and TSa, and cooling periods are introduced between the sub- sequences. In this way, the temperature of the core under test remains within the imposed temperature limit. Temperature Completion Time Fig. 1. Illustration of test set partitioning It is obvious that introducing long cooling periods between test sub-sequences will substantially increase the test application time (TAT). To address this problem, we can reduce the TAT by interleaving the partitioned test sets such that the test-bus bandwidth reserved for a core C/, during its cooling periods, are utilized to transport test data for another core C/ (j ^ /), and thereafter to test the core C/. By interleaving the partitioned test sets belonging to different cores, the test-bus bandwidth is more efficiently utilized. Figure 2 gives an example where two partitioned test sets are interleaved so that the test time is reduced with no need for extra bus bandwidth. Temperature Testing (Core 2 Completion Time i -^Temp. Upper Limit — Core 1 Core 2 -Cooling (Core Time Fig. 2. Illustration of test set interleaving There are many design alternatives which can be used to implement the above basic ideas of test partitioning and interleaving for thermal-aware test scheduling. In general, the objective is to minimize the test application time by generating an efficient test partitioning/interleaving scheme and to schedule the individual test sub-sequences which avoids violating the temperature limits of individual cores, and, at the same time, satisfies the test-bus bandwidth constraint. This is a complex optimization problem, and we have developed, as mentioned in the previous section, a solution for the problem with the restriction that the length of the test sub-sequences from the same test set should be identical and the cooling periods between test sub-se- quences from the same test set should have equal length /14/. However, this restriction has resulted in less efficient test schedules, and thus longer test application times. To illustrate the usefulness of eliminating this restriction so that both test sub-sequences and cooling periods can have arbitrary lengths, each test sub-sequence can be considered as a rectangle, with its height representing the required test-bus bandwidth and its width representing the test time. Figure 3 gives an example where three test sets, TS^, TSa, and TSs, are partitioned into 5, 3, and 2 test sub-sequenc-es, respectively. Note that the partitioning scheme which determines the length of test sub-sequences and cooling periods has ensured that the temperature of each core will not violate the temperature limit, by using a temperature simulation /19/. Figure 3(a) shows a feasible test schedule under the regularity assumption (identical test sub-se-quence length and identical cooling periods for each core). In Figure 3(b), an alternative test schedule is depicted, where the test sub-sequence and the cooling periods can have arbitrary lengths. This example shows the possibility to find a shorter test schedule by exploring alternative solutions, where the number and length of testsub-sequenc-es, the length of cooling periods, and the way that the test sub-sequences are interleaved are different from those in Figure 3(a). Bandwidth Limit Test Completion- TSa, TS31 ts,: tsij ts22 TS32 ■TSw ts23 (a) A test schedule with regular partitioning scheme Bandwidth Limit Test Completion- TS3, ts21 ■-TSti^ ts22 rSa: ts23 TS« ts24 O (b) An alternative test schedule with irregular partitioning scheme Fig. 3. Comparison of two different test schedules 3. Problem formulation We have assumed a test architecture using a single test bus to transport test data between the tester and the cores under test. A tester can be either an external automated test equipment (ATE) or an embedded tester integrated on the chip. Each core under test is connected to the test bus with a number of dedicated TAM wires. The test patterns, together with a generated test schedule, are stored in the tester memory. A test controller controls the entire test process according the test schedule, sending test patterns to and receiving test responses from the corresponding cores through the test bus and the TAM wires. Suppose that a system S, consisting of n cores Ci, C2, ... , Cn, employs the test architecture defined above. In order to test core Ci, a test set TS, consisting of Ii generated test patterns is transported through the test bus and the dedicated TAM wires to/from core C/, utilizing a bus bandwidth l/V/. The test bus is designed to allow transporting several test sets in parallel but has a bandwidth limit BL {BL > Wi, i = 1,2, ... , n). We assume that continuously applying test patterns belonging to IS, may cause the temperature of core C/ to go beyond a certain limit TLi so that the core can be damaged. In order to prevent overheating during tests, as discussed before, we partition a test set into a number of test sub-sequences and introducing a cooling period between two partitioned test sub-se-quences, such that no test sub-sequence drives the core temperature higher than the limit and the core temperature is kept within a safe rage. The problem that we address in this paper is to generate a partitioning scheme and a test schedule for system S such that the test application time is minimized while the bus bandwidth constraint is satisfied and the temperatures of all cores during tests remains below the corresponding temperature limits. 4. Overall strategy We have proposed an approach to solve the formulated problem in two major steps. First, we generate an initial partitioning scheme for every test set by using temperature simulation and the given temperature limits. Second, the test scheduling algorithm explores different test schedules by selecting alternative partitioning schemes, interleaving test sub-sequences, and squeezing them into a two-dimensional space constrained by the test-bus bandwidth. In order to generate thermal-safe partitioning schemes, we have used a temperature simulator, HotSpot /17/, /25/, /26/, /27/, to simulate instantaneous temperatures of individual cores during tests. HotSpot assumes a circuit packaging configuration widely used in modern IC designs, and it computes a compact thermal model /27/ based on the analysis of three major heat flow paths existing in the assumed packaging configuration /26/, /27/. Given the floorplan of the chip and the power consumption profiles of the cores, HotSpot calculates the instantaneous temperatures and estimates the steady-state temperatures for each unit. In this paper, we assume that the temperature influences between cores are negligible since the heat transfer in the vertical direction dominates the transferring of dissipated heat, which has been validated by the simulation results with HotSpot /14/, /19/. When generating the initial thermal-safe partitioning scheme, we have assumed that a test set TSi is started when the core is at the ambient temperature TMamb- Then we start the temperature simulation, and record the time moment th-i when the temperature of core Ci reaches the given temperature limit TL,. Knowing the latest test pattern that has been applied by the time moment , we can easily obtain the length of the first thermal-safe test sub-sequence rS;i that should be partitioned and separated from the test set TSi. Then the temperature simulation continues while the test process on core C/ has to be stopped until the temperature goes down to a certain degree. Usually a relatively long time is needed in order to cool down a core to the ambient temperature, as the temperature decreases slowly at a lower temperature level (see the dashed curve in Figure 4). Thus, we let the temperature of core Ci go down only until the slope of the temperature curve reaches a given value k\ at time moment tc^. At this moment, we have obtained the duration of the first cooling period d/i = tc^ - f/ii. Restarting the test process from time moment fci, we repeat this heating-and-cooling procedure throughout the temperature simulation until all test patterns belonging to TSi are applied. Thus we have generated the initial thermal-safe partitioning scheme, where test set TSi is partitioned into m test sub-sequences [TSij I / = 1, 2, ... , m} and between every two consecutive test sub-sequences, the duration of the cooling peri- od \s [dij \ j = 1, 2.....m-1}, respectively. Figure 4 depicts an example of partitioning a test set into four thermal-safe test sub-sequences with three cooling periods added in between. Temperature Test Completion Cooling-A-Cooling—Jo. TSi3-^ Cooling-