Analyzing Educational Process Through a Chain of Data Marts Viljan Mahnic University of Ljubljana Faculty of Computer and Information Science Trzaska 25, SI-1000 Ljubljana, Slovenia viljan.mahnic@fri.uni-lj.si Keywords: data warehouse, data mart, star schema data model. Received: June 6, 2003 We describe the development strategy, architecture, and logical design of a data warehouse that can be built gradually, exploiting the benefits of the bottom-up, data mart approach. Connections between individual data marts are planned in advance with the aim of building a sequence of data marts that makes it possible to analyze the educational process as a value chain. Queries can be made across different subject areas (viz. enrolment applications, enrolment, examination, and degree records) in order to obtain a snapshot or a slice of the entire value chain that shows how far a subset of students has moved from the enrolment application to their final degree. 1 Introduction In the early nineties, Bill Inmon introduced the concept of a data warehouse as a subject oriented, integrated, nonvolatile, time variant collection of data in support of management's decisions [5]. To create a data warehouse, data are extracted from different source systems, and then transformed, integrated, and loaded on an appropriate data store. Since then, data warehousing has grown to become one of the most important areas in the information systems field [3]. The benefits of data warehousing are numerous and some organizations are receiving significant returns [12]. The concepts of data warehousing have attracted substantial attention within the EUNIS community. At the EUNIS 1997 Conference, D. Stevenson [9] presented a data warehouse development project from users' and management's perspective, while at EUNIS 1999 M. Bajec et al. [1] proposed to build a data warehouse in order to analyze enrolment applications. At EUNIS 2001, two French initiatives were presented: J-F. Desnos [2] described a comprehensive data warehouse project for French universities, while Flory et al. [4] presented the design and implementation of a data warehouse for research administration. Additionally, the importance of data quality was described in order to ensure successful data warehouse implementation [8]. The aim of our paper is to describe the development strategy, architecture, and logical design of a data warehouse that should provide a unified and integrated source of data for various analyses of educational process at the University of Ljubljana. The main feature of this data warehouse is that it can be built stepwise as a chain of data marts that use common dimension tables, thus providing a suitable architecture for drill-across applications. 2 Development strategy and data warehouse architecture Even though data warehouses are widespread, there is no common agreement about the best development methodology to use. While Bill Inmon (who is recognized as “the father of data warehousing”) recommends a top-down, enterprise data warehouse approach, Ralph Kimball [6, 7] recommends the bottom-up, data mart approach. Using a top-down approach, a global enterprise data warehouse is built first that serves as a basis for implementation of individual data marts. On the other hand, in a bottom-up approach, individual data marts are developed first and later interconnected through common dimensions into a comprehensive data warehouse. Considering our specific situation (viz. limited budget and the need for tangible results as soon as possible) as well as positive experience reported in the literature [10] we decided to adopt the bottom-up, data mart approach. This approach provides usable data faster, at a lower cost, and with less financial risk. However, in the long term this approach is successful only if all connections between individual data marts are well planned in advance. Therefore, special attention was devoted to logical data warehouse design in order to develop consistent data definitions and define an appropriate structure of the dimension tables that interconnect different data marts. Additionally, since data warehouses (in general) play an important role in understanding value chains (e.g., by connecting trading partners along the demand or supply chain), we designed our data warehouse with the aim of representing the educational process as a value chain consisting of the following steps: Figure 1: The sequence of the data marts used to analyze the educational process as a value chain • enrolment application • first enrolment • examination • next enrolment • examination • ... (several enrolment and examination steps repeat here) • degree In our design, the aforementioned chain is modeled as a sequence of four data marts shown in Figure 1. The first data mart corresponds to enrolment applications, the second one contains enrolment data, the third one corresponds to examination records, and the last one deals with alumni data. 3 Logical design of data marts Data marts are designed using dimensional modeling introduced by Kimball [6]. Each data mart is represented by a star schema data model (also called star join schema or dimensional model) that is made up of a fact table in the center of the star and several dimension tables as the points of the star. The fact table contains measurable facts that are recorded for each transaction (viz. enrolment application, enrolment, examination, and degree taken, respectively), while dimension tables describe entities (viz. students, study programs, teachers, courses, etc.) that are involved into these transactions. Each star schema data model can be implemented individually and later integrated with other star schemas through common dimensions as described in Section 4. Figures 2 and 3 represent two sample star schema data models of our data warehouse. Figure 2 describes the logical design of the enrolment data mart that will be implemented first, while Figure 3 represents data in the examination records data mart. 3.1 Enrolment data mart The fact table We modelled each enrolment as an event at the intersection of seven dimensions: student, time (viz. academic year), department, study program, year of study, study mode, and type of enrolment (see Figure 2). Such a fact table represents a robust set of many-to-many relationships among these seven dimensions; however, it has only one measurable fact: the fee paid for studies.1 This means that applications will perform mostly counts. 1 In Slovenia only part-time students pay fees for their studies. Nevertheless, this table can be queried to answer any number of interesting questions, such as: • How many students enrolled at each department or study program? • What is the structure of enrolled students (considering secondary school, profession, secondary school grade, study mode, and/or type of enrolment)? • Is the number of students (at a particular department or study program) increasing or decreasing? • What is the progress rate of a particular generation of students? The student dimension The student dimension table contains data on students that are enrolled at the university. Each row corresponds to one student and contains his or her personal data. Secondary school, secondary school grade, and profession attributes enable the construction of correlations between students' progress and secondary school, secondary school grades, and secondary school profile of students. On the other hand, zip, county, and region attributes represent a geographic hierarchy that is useful in performing analyses of novice students in connection with the enrolment applications data mart. The time dimension Considering the enrolment data mart alone, an explicit time dimension table is not necessary because it is enough to keep the academic year of each enrolment as a degenerate dimension within the fact table. However, since the time dimension is common to all data marts a more elaborate version of time dimension table is necessary that is represented in Figure 3. The study program dimension At the University of Ljubljana study programs usually consist of several elective modules which can be further divided into submodules (e.g., in the final year of studies). Therefore, the study program dimension in fact describes individual submodules and defines a useful study program-module-submodule hierarchy that enables the generation of reports with different levels of detail (viz. drill-up and drill-down queries). The year of study dimension This is a degenerate dimension since the year of study attribute in the fact table is the only attribute of this dimension. It can be used as a grouping key for pulling together all the students enrolled in the same year of studies. Figure 2: The star schema data model representing the logical design of the enrolment data mart. The department dimension The department dimension table describes each member institution. At present, the University of Ljubljana consists of 26 member institutions (22 faculties, 3 academies, and 1 high school). The study mode dimension The study mode dimension describes every possible manner of studies (e.g. full-time, part-time, etc.). The type of enrolment dimension The type of enrolment dimension describes all possible enrolment types (e.g. first enrolment, repeated enrolment, etc.). 3.2 Examination records data mart The fact table Each record of the fact table in Figure 3 corresponds to one examination (viz. the grain of the fact table) and is uniquely defined by a compound key consisting of keys of all dimension tables. There are two measurable facts that can be taken at the intersection of all the dimensions: grade and sequential number of examination attempt. Dimension tables Logical design of the examination records data mart comprises six dimension tables through which the data in the fact table can be analyzed: the student dimension, the course dimension, the teacher dimension, the department dimension, the time dimension, and the study program dimension. The student dimension is the same as in the enrolment data mart. Secondary school, secondary school grade, and profession attributes can again be used in the construction of correlations between examination results and secondary school, secondary school grades, and secondary school profile of students. Given the fact that the key of the time dimension is simply a date, the time dimension table could be omitted, but we found it useful because additional attributes (such as semester, academic year, and day number overall) enable to slice data by semesters and academic years, as well as to perform simple arithmetic between days across year and month boundaries (e.g., to compute time elapsed from first enrolment till graduation). Figure 3: The star schema data model representing the logical design of the examination records data mart. The course dimension describes every course, and the teacher dimension describes every teacher. The department dimension and the study program dimension are the same as in the enrolment data mart. Using the examination records data mart various analyses of examination results are possible, e.g.: • What is the average grade of a specified subset of students?2 • How many students passed an exam in a given time period at each department or study program? • What is the average grade and number of examination attempts at given course and/or teacher? • How do average grades compare among different study programs and/or faculties? etc. 4 Connecting data marts through common dimensions and drill- across applications Some dimension tables (e.g., the student dimension, the study program dimension, the time dimension, etc.) are common to all data marts in our data warehouse. These tables can act as “glue” that connects the data marts together and allows meaningful queries to be made across different subject areas (viz. enrolment applications, enrolment records, examination records, and degree records). Using data warehousing terminology, these queries are often called drill-across applications. In order to support drill-across applications, all constraints on dimension attributes must evaluate to exactly the same set of dimensional entities from one data mart in the value chain to the next data mart in the value chain. For example, a constraint on student dimension at any point in the value chain must mean exactly the same subset of students at all points in the chain. The easiest way to achieve this requirement is to 2 The subset of students can be specified using attributes in the student dimension table as a source of constraints. Type of Course enrolment Figure 4: The chain of data marts connected through common dimension tables that are physically implemented only once. Enrolment facts Examination facts Studymode physically implement all common tables only once as shown in Figure 4.3 Therefore, it is extremely important to plan the structure of common dimension tables in advance not only to satisfy the needs of individual data marts, but also to provide the necessary connections for drill-across applications. Given the fact that the University of Ljubljana is extremely decentralized and each member institution maintains its own data about students, teachers, and courses, substantial effort was necessary to integrate and cleanse these data in order to build common dimension tables. Beside dimension tables that have already been described in previous section, there are two additional dimensions shown in Figure 4: • The candidate dimension table corresponds to all candidates for enrolment. The accepted candidates become part of the student dimension after first enrolment. • The secondary school dimension describes every secondary school in Slovenia. There are some special situations when the aforementioned requirement can be achieved without implementing the common dimension only once, e.g. in the case of dimensions with reduced detail and derived dimensions that support aggregates. The interested reader can find more information in [6, pp. 84-85]. 5 A sample drill-across report Using our data warehouse we can imagine that students "move" sequentially through the value chain, and a drill-across report can show a snapshot or a slice of the entire value chain that shows how far a subset of students has moved from the enrolment application to their final degree. For example, using the enrolment applications data mart a subset of candidates that applied for enrolment in a given academic year can be defined. In combination with the enrolment data mart, only those candidates that actually enrolled (viz. became students) can be isolated. For this subset of students, their examination records can be analyzed using the examination records data mart and the number of students who finished their studies can be determined from degree records data mart. Table 1 represents a sample report obtained by drilling across. Suppose we have several generations of candidates that applied for enrolment at a given department in five consecutive years and we want to track their progress towards the graduation. Given the fact that the department and time dimensions are exactly the same for all data marts, only those data that belong to the department and academic years in question are processed in each data mart. Similarly, the common student dimension assures that the same subset of students is taken into account in the whole value chain. Using the enrolment applications data mart the exact number of applicants and the number of approved Department XXX Academic Average Exams Length of Year Applicants Approved Enrolled Grade Passed Graduated Studies 1992/93 237 162 150 8.16 29.3 77 6.15 1993/94 199 165 143 8.02 27.1 71 6.32 1994/95 182 158 144 7.57 24.3 64 6.87 1995/96 176 151 145 7.03 19.5 56 7.02 1996/97 __________ 183____________ 154___________ 150___________ 7.43 __________ 22.4 __________ 57 _________________ 6.21 ______________ Table 1: A sample drill-across report _________________________________Department XXX ___________ __________ __________ _______________________________ Academic Study Average Exams Length of Year Program Applicants Approved Enrolled Grade Passed Graduated Studies 1992/93 AAA 125 101 95 8.21 30.1 52 6.08 BBB 112 61 55 8.07 27.9 25 6.32 Total 1992/93 237 162 150 8.16 29.3 77 6.15 1993/94 AAA 121 110 96 8.12 28.1 52 6.12 BBB 78 61 47 7.81 25.0 19 6.89 Total 1993/94 199 165 143 8.02 27.1 71 6.32 1994/95 AAA 119 108 94 7.78 27.1 47 6.65 BBB 63 50 50 7.18 19.0 17 7.47 Total 1994/95 182 158 144 7.57 24.3 64 6.87 etc. Table 2: Report refinement by drilling down applications can be determined. The enrolment data mart enables the computation of the actual number of enrolled students, while the examination records data mart provides the average grade4 and average number of exams passed for these students. Finally, the degree records data mart is used to compute the number of graduates in each generation as well as the average length of their studies (in years). By drilling down we can still refine our query in order to obtain more detailed information about each generation of students. By simply adding the study program name as a new row header the same analysis can be obtained for each study program separately (see Table 2). We can further subdivide the subset of students in each generation by using other attributes from the study program hierarchy (viz. elective module name and submodule name) as well as by choosing row headers 4 In Slovenia the following grades are used: 1 to 5 – insufficient; 6 – sufficient; 7 – good; 8, 9 – very good; 10 – excellent. from dimensional attributes of other dimensions (e.g., secondary school, secondary school grade, secondary school profession). 6 Conclusions We described the design of a data warehouse that can be implemented gradually as a chain of data marts connected through common dimension tables. Each data mart was represented using the star schema data model and a special attention was devoted to the definition of common dimensions in order to enable drill-across applications. The proposed design describes the educational process as a value chain and makes it possible to analyze how far a subset of students has moved from the enrolment application to their final degree. References [1] Bajec, M., Rupnik, R., Krisper, M. Using Data Warehouses in University Information Systems, in K. Sarlin (ed.) EUNIS 99 – Information Technology Shaping European Universities, pp. 115-121. [2] Desnos, J-F. A Data Warehouse for French Universities, Informatica, Vol. 25, No. 2, July 2001, pp. 177-181. [3] Eckerson, W.W. Evolution of Data Warehousing: The Trend toward Analytical Applications, Boston, MA: The Patricia Seybold Group (April 28, 1999), pp. 1-8. [4] Flory, A. et al. Design and Implementation of a Data Warehouse for Research Administration Universities, in J. Knop and P. Schirmbacher (eds.) EUNIS 2001 – The Changing Universities, The Role of Technology, Berlin, March 2001, pp. 164­ 167. [5] Inmon, B. Building the Data Warehouse, QED Publishing Group, 1992. [6] Kimball, R. The Data Warehouse Toolkit, John Wiley & Sons, 1996. [7] Kimball, R. et al. The Data Warehouse Lifecycle Toolkit, John Wiley & Sons, 1998. [8] Mahnic, V. Rozanc, I. Data Quality: A Prerequisite for Successful Data Warehouse Implementation, Informatica, Vol. 25, No. 2, July 2001, pp. 183­ 188. [9] Stevenson, D. Data Warehouse and Executive Information Systems – Ignoring the Hype, in J-F. Desnos and Y. Epelboin (eds.) European Cooperation in Higher Education Information Systems, Grenoble, France, September 1997, pp. 202-207. [10] Watson, J.H. et al. Sherwin-Williams' Data Mart Strategy: Creating Intelligence across the Supply Chain, Communications of the Association for Information Systems, Vol. 5, Article 9, May 2001. [11] Watson, J.H. Recent Developments in Data Warehousing, Communications of the Association for Information Systems, Vol. 8, 2001, pp. 1-25. [12] Watson, J.H. et al. The Benefits of Data Warehousing: Why Some Organizations Realize Exceptional Payoffs, Information & Management, Vol. 39, 2002, pp. 491-502.