Data quality: A prerequisite for successful data warehouse implementation Viljan Mahnič and Igor Rožanc University of Ljubljana Faculty of Computer and Information Science Tržaška 25, SI-1000 Ljubljana, Slovenia E-mail: Viljan.Mabnic(a),fri.uni-li.si. igor.Rozanc@fri.uni-li.SL Keywords: data vvarehouse, information quality, total quality management, data quality assessment Received: January 14, 2001 Building a data warehouse for a large decentralized university such as the University of Ljubljana is an attractive challenge, but also a risky and demanding task. Experience has shown that projects attempting to integrate data are especially vulnerable to data quality issues. Therefore, before embarking on a data warehouse initiative a thorough quality assessment ofthe source data is necessaij. We describe how the assessment criteria based on the Total Qiiality data Management Methodology were adapted to our specific needs and used to determine the quality of študent records data at two member institutions, viz. the Faculty of Computer and Information Science, and the Faculty of Electrical Engineering. The most important results of the assessment are described and proposals are given for further activities. The assessment has shown that the študent records data at the Faculty of Computer and Information Science and Faculty of Electrical Engineering are good enough to be used as source for the global warehouse at the imiversity level after some data cleansing takes plače. Additionally, special attention must be devoted to the integration ofsuch data that are replicated at many individual departments (viz. employees, subjects taught, and students). Therefore, we propose that a unique coding schemefor ali employees, students, and subjects taught be defined in thefirst step ofthe data warehouse design, and an ongoing data quality management process is established clearly defining the roles and responsibilities ofall personnel involved. Introduction The University of Ljubljana is the largest university inSlovenia. It consists of 26 member institutions (20faculties, 3 academies, and 3 colleges) and has more than40,000 students. In the past, the member institutions hadsubstantial autonomy regarding the usage of informationtechnologies, which led to uncoordinated development of their information systems. Different applications weredeveloped for the same purpose and due to thisheterogeneity it is very difficult or even impossible tocreate reports that require cross-referencing data fromdifferent institutions orapplication areas [1]. In such a situation, the building of a data warehouse atthe university level seems to be an appropriate solution.Therefore, a pilot project started with the aim of defininga data model for a global data base which will be fed bydata from member institutions on a regular basis and which will serve as a basis for analytical processing atthe university level. One of the main tasks within thisproject is to define the granularity of data in the datawarehouse and different levels of detail that will supportbest the decision making processes.However, experience from other organizations has shownthat projects attempting to integrate data are especiallyvulnerable to data quality issues [2]. A recent study by the Standish Group states that 83 percent of data migration projects overrun their budget (or fail) primarily as a result of misunderstandings about the source data and data definitions. Similar surveys conducted by the Gartner Group point to data quality as a leading reason for overruns and failed projects [3]. In order to avoid such a pitfall, a thorough assessment of the quality of data that are used as input to the global data warehouse is necessary. Since we decided that our data warehouse will be populated gradually, starting with študent records data, the quality of these data was analyzed first. The aim of this paper is to describe in detail the assessment methodology and results obtained at two typical member institutions, viz. the Faculty of Computer and Information Science (FCIS), and the Faculty of Electrical Engineering (FEE). The assessment criteria were defined according to English's Total Quality data Management methodology [4] and Forino's recommendations [3]. A description of these criteria is given in Section 2, vvhile the results of the assessment are presented in section 3. Section 4 describes the proposals for further activities, and section 5 summarizes the most important conclusions. 2 Data Quality Assessment Metho ­dology Data quality assurance is a complex problem that requires a systematic approach. English [4] proposes a comprehensive Total Quality data iVlanagement Methodology (TQdM), which consists of 5 processes of measuring and improving information quaHty, and an umbrella process for bringing about cultural and environmental changes to sustain information quality improvement as a management tool and a habit: Process 1: Assess Data Definition & Information Architecture Quality Process 2: Assess Information Quality Process 3: Measure NonquaIity Information Costs Process 4: Reengineer and Cleanse Data Process 5: Improve Information Process Quality Process 6: Establish the Information Quality Environment Each process is further divided into steps that must be foIlowed in order to achieve the desired data quality. Organizations embarlcing on data warehouse initiatives and that do not yet have an information qua!ity function must conduct many of these steps, but may do so in a different sequence, based on their specific needs. Considering our specific needs, we concentrated on Process 2 which defines two aspects of information quality: the inherent information quality and the pragmatic information qua]ity. Inherent information quality is the correctness or accuracy of data, while pragmatic information quality is the value that accurate data has in supporting the work of the enterprise. In this paper the results of inherent information qua]ity assessment are described. In order to determine the quality of data a field-by-field assessment is required. However, simply having data is not enough, but the context for which the data is to exist must also be known. To put in other terms, a clear data definition or the so called meta data must be provided [3]. Generally speaking, one can find meta data in data models, data dictionaries, repositories, specifications, etc. If current meta data does not exist, then a subject matter expert is needed, and meta data is a much-desired by-product of a data quality assessment. The extent of assessment in great deal depends on the availability of meta data. According to [3], assessments usua!ly focus on one or more of the folIowing types of quality criteria: 1. Data type integrity 2. Business rule integrity V. Mahnič et al. 3. Name and address integrity If the assessment team knows nothing more than field names, types and sizes, then the focus is on testing the field's integrity based on its type (numeric, alphanumeric, date, etc). If additional characteristics of the field are provided (domain, relationship with other fields, etc) , then business rule integrity is also performed. Finally, if name and address data is critical (particularly if it will be Consolidated with other data), then name and testing shouid be performed. On the other hand, English [4] defines the follovving inherent information quality characteristics and measures: 1. Definition Conformance 2. Completeness (of vaiues) 3. Validity, or business rule conformance 4. Accuracy to surrogate source 5. Accuracy (to reality) 6. Precision 7. Nonduplication (of occurrences) 8. Equivalence of redundant or distributed data 9. Concurrency of redundant or distributed data 10. Accessibility Considering the aforementioned quality characteristics and our specific needs we decided to measure the quality of our data using the following criteria: 1. Completeness of vaiues: Users were encouraged to prioritize a subset of source fields that must have a nonnull value. For each field the percentage of records with missing vaiues was computed. 2. Validity, or business rule conformance: For a chosen subset of most important fields we measured the degree of conformance of data vaiues to their domains and business rules. a) Ali 1:N relationships were examined for existence of foreign keys in master tables. b) Since time represents an important dimension in a dimensional data warehouse a range check was performed on ali date fields in order to find possible out-of-range vaiues. c) A range check was performed on ali other fields from the highest priority subset (e.g. academic year, year of study, grades etc). d) Special cross-checks were defined for fields having relationships with other fields that further define the allowable domain set, e.g. • when a študent applies for an exam for the first time (i.e. N0_0F_ATTEMPTS=1) the date of previous examination must be blank and vice versa; • the date of previous examination must be lesser than the date of next examination; • in each degree record the difference between the degree date and thesis issue date must be positive and less than six months. Nonduplication of occurrences and equivalence of redundant and distributed data: In our čase two different tests were performed; a) AH files in the študent records database were checked for eventual duplicafions of primary Informatica 25 (2001) 183-188 185 b) The študent and employee files at both facuJties were checked for the existence of the records that are duplicate representations of the same študent or employee respectively. 3 Assessment Results 3.1 Completeness of values The most important fields in 33 files were checked for nonnull values, and it was found, that only few values were missing. Results are summarized in Table 1, while Tables 2 and 3 show the fields and files with most missing values for each faculty, respectively. The greatest problem represent the missing student_id values in alumni records. This is a consequence of the fact that 15 years ago (when alumni records started) some students were not assigned a študent Identification number. keys. Faculty No.of files FCIS 33 FEE 33 TOTAL 66 No. of checked No. of checked No. of records vvith attributes records missing values 69 201617 43 69 405436 996 138 607053 1039 Table 1: Completeness of vaules (Summary Data) Field name VPIS ST IME Field description študent id eniployee first name File name Total no. No. of records with of records missing values DIPLOMA 961 18 DELAVEC 391 4 Table 2: Fields with most missing values (Faculty of Computer and Information Science) Field name VPIS ST IME D DELAVEC Field description študent id employee first name employee id File name Total no. of records DIPLOMA 2926 DELAVEC 399 VAJE 152071 No. of records with missing values 185 4 737 % of erroneous records 0.021 % 0.246 % 0.171 % % oferroneous records 1.873% 1.023% % of erroneous records 6.323 % 1.003% 0.485 % Table 3: Fields with most missing values (Faculty of Electrical Engineering) 3.2 Validity, or business rule conformance Existence of foreign keys in master tables: The examination of ali 1:N relationships revealed that in about a half percent of records foreign keys do not have their counterpart values in the corresponding master tables (see Table 4). Hovvever, this average is misleadaing, since the majority of errors appear within a small number of relationships. The follovving relationships appeared to be the most problematic at both faculties (see Tables 5 and 6): 1. the relationship betvveen entities DIPLOMA (studenfs degree thesis) and NAČIN (a code table of possible types of study, e.g. full-time or part-time); 2. multiple relationships between enfities STUD_F (studenfs personal data) and OBČINA (a code table of territorial units in Slovenia) representing the records. The second problem is much harder and was territorial unit of studenfs residence, plače of birth, caused by significant changes in territorial organization secondary school finished etc. at the local level in Slovenia after independence, which required several consequent modifications of the The first of the aforementioned problems is simply a corresponding code table. Although each year's consequence of the fact that (within the alumni records enrolment data corresponded to the currently valid database) the faculties started coUecting the type of study version of the code table, there are a lot of inconstancies data in 1994, vvhile this datum is missing in older when we look at the data in tirne perspective. Faculty No. of No. of No. of No. of No. of %o f %o f files checked relationships checked records relationships with errors records with errors relationships vvith errors records with errors FCIS 49 207 1361758 38 6360 18.357% 0.467 % FE 49 207 2702005 51 16605 24.638 % 0.615% TOTAL 98 414 4063763 89 22965 21.498% 0.565 % Table 4: Existence of foreign keys in master tabies (Summary Data) Entity No. of Foreign key field No. of non % (File) records Name Description existent vaiues oferrors DIPLOMA 961 NAČIN type of study code 546 56.816% STUD_F 4062 OBCINA_SS territorial unit code of 1340 32.989 % studenfs secondary school STUD_F 4062 OBCINA_R territorial unit code of 1331 32.767 % studenfs plače of birth STUD_F 4062 0BC1NA_S territorial unit code of 1314 32.349 % studenfs permanent residence STUD_F 4062 OBCINA Z territorial unit code of 603 14.845 % studenfs temporary residence Table 5: Problematic relationships (Faculty of Computer and Information Science) Entity No. of Foreign key field No. of non % (File) records Name Description existent vaiues oferrors DIPLOMA 2926 NAČIN type of study code 1461 49.932 % STUD_F 8164 OBCINA_SS territorial unit code of 3565 42.908 % studenfs secondary school STUD_F 8164 OBCINA_R territorial unit code of 3503 42.896 % studenfs plače of birth STUD_F 8164 OBCINA S territorial unit code of 3502 42.896 % studenfs permanent residence STUD_F 8164 OBCINA_Z territorial unit code of 1422 17.418% studenfs temporary residence Table 6: Problematic relationships (Faculty of Electrical Engineering) Range checking of date fields did not reveal any serious Special cross-checks pointed out that the business rule problems. In the worst čase the rate of out-of-range field requiring that each študent completes his^er degree vaiues reached 0.149 % at FCIS, and 0.493% at FEE. project in six months is sometimes violated. Namely, the difference betvveen the degree date and thesis issue date Range checking of other fields from the highest was more than six months in 0.976 % of cases at FCIS, priority subset also yielded quite good results. The filed and in 1.805 %> of cases at FEE. Error rates reported by IME_D (viz. employee's first name) was ranked the other cross-checking tests were less than 0.5 %. worst at both faculties containing 1.023 % erroneous vaiues at FCIS, and 1.003 % erroneous vaiues at FEE. 3.3 Nonduplication of occurrences and equivalence of redundant and distributed data Nonduplication of primary keys: The študent records Information system at FCIS and FEE [5] (as well as at other member institutions of the University of Ljubljana) is implemented using Clipper which does not automatically force the uniqueness of primary keys. In spite of the fact that ali programs have been carefully written in order to avoid duplicates, some can be introduced through the manual maintenance of data, especially code tables. Therefore, ali files in the študent records database were checked for eventual duplications of primary keys. An excerpt of assessment results shovving only files with most duplicates is presented in tables 7 through 9. Table 7 shows the number of duplicated primary keys in code tables that are maintained centrally by the University computing center for ali member institutions, while tables 8 and 9 refer to duplicates at FCIS and FEE respectively. A relatively high percentage of errors in some code tables indicates that the maintenance of code tables at the University computing center should be improved. File No. of records No. of duplicated primary keys % of errors VSI.DBF 1314 68 5.175% SS POKLI.DBF 1566 42 2.682 % CENTRI.DBF 51 1 1.960% ZAVOD.DBF 100 1 1.000% Table 7: Nonduplication of primary keys (Code tables maintained by the University computing center) File No. of records No. of duplicated priniary keys % of errors IZJEME.DBF 56 1 1.786% TEMA.DBF 978 3 0.307 % DELAVEC.DBF 391 1 0.226 % DVIG.DBF 922 1 0.108% Table 8: Nonduplication ofprimary keys (FCIS) File No. of records No. of duplicated priniary keys % of errors SPP.DBF 270 11 4.074 % PRED PR.DBF 746 8 1.072% PP.DBF 2557 12 0.469 % DVIG.DBF 2825 9 0.319% Table 9: Nonduplication of primary keys (FEE) Lquivalence of redundant and distributed data: Given such a replication two measures were introduced: the the fact that the some teachers teach at both faculties as number of replicated entity occurrences (viz. the same well as that some students study at both faculties (e.g. a teacher or študent in both databases) with the same študent can obtain his/her B. Se. degree at FEE and enrol primary key, and the number of replicated entity with for graduate study at FCIS or vice versa) some data are different primary keys. replicated across faculties. In order to state the extent of Entity(File) Description No. of replications Same primary key Different priniary key DELAVEC ŠTUDENT Employees Students 388 4 0 121 Table 10: Equivalence of redundant occurrences (FCIS and FEE) Assessment revealed that employee files at both faculties Identification number regardless the fact that the študent are consistent: ali replicated employees have the same has already been enrolled at another faculty. primary key. This is not the čase with študent files. Due Unfortunately, this kind of inconsistencies may be a to the decentralized organization of the University of source of major problems when integrating data into a Ljubljana each faculty assigns its students a different global warehouse. 4 Proposed Furthe r Actions On the basis of assessment results we propose two kinds of further actions: cleansing of source data and an appropriate design of the global data warehouse at the university level. Some erroneous source data can be cleansed automatically (e.g. missing type of study code in file DIPLOMA), while other data require manual or combined manual and automatic approach (e.g. removal of duplicated primary keys, re-establishment of relationships using territorial unit codes). Some errors (e.g. out-of-range values) can be prevented by the incorporation of appropriate controls in the program code. Special attention must be devoted to the maintenance of code tables that are common for the whole university. A relatively high percentage of duplicate codes in code tables, maintained by the university computing center up to now indicates that the maintenance of these code tables must improve. Additional problems could arise during the integration of those data and code tables that are at present maintained by individual departments (viz. employees, subjects taught, and students). Although our assessment did not reveal serious problems within each department, many duplications and code conflicts may occur during the integration. Therefore, we propose that a unique coding scheme for ali employees, students, and subjects taught is defined in the first step of the data warehouse design. The data warehouse design must be based on principles of TQdM methodology. Data standards must be defined and data definition and Information architecture quality assessment must take plače before programming and population of the data warehouse begins. An ongoing data quality management process must be established and the roles and responsibilities of a data quality administrator, subject area champions, data oversight committee, and data owners clearly defined [6]. 5 Conclusions Our paper was intended to increase the avvareness of the importance of data quality not only when building a data warehouse but also in operational environments that support transactional processing. An assessment methodology to empirically determine the data quality was described and the results of the assessment were presented. Considering the assessment results we estimate that source data at the Faculty of Computer and Information Science and Faculty of Electrical Engineering are good enough to be used as source for the global warehouse at V. Mahnič etal. the university level after some data cleansing takes plače. In first plače, missing študent Identification numbers must be provided in alumni records and the broken relationships using territorial units codes must be re­established in students' personal data. During the design of the global data warehouse a special attention must be devoted to the integration of those data that may be replicated at many individual departments (viz. employees, subjects taught, and students). Since each department has its ovvn policy of coding, many duplications and code conflicts may occur during the integration. Therefore, we propose that a unique coding scheme for ali employees, students, and subjects taught is defined in the first step of the data warehouse design. Additionally, an ongoing data quality management process must be established and the roles and responsibilities of aH personnel involved should be clearly defined. 6 References [1] Mahnič, V. Tovvards the re-integration of the University of Ljubljana Information system, in J-F. Desnos & Y. Epelboin (eds.), European Cooperation in Higher Education Information Systems EUNIS 97, Grenoble, September 1997, pp. 250-258. [2] Celko, J., McDonald, J., Don't Warehouse Dirty Data, Datamation, October 15, 1995, pp. 42-53. [3] Forino, R., The Data Quality Assessment, Part 1, DM Review Online, August 2000, http://www.dmreview.com [4] English, L.P., Improving Data Warehouse and Business Information Quality, John Wiley & Sons, Inc., 1999, ISBN 0-471-25383-9. [5] Mahnič, V., Vilfan, B., Design of the Študent Records Information System at the University of Ljubljana, in J. Knop (ed.), Trends in Academic Information Systems in Europe, Proceedings of the EUNIS'95 Congress, Dusseldorf, Germany, November 1995, pp. 207- 220. [6] Kachur, R., Data Quality Assessment for Data Warehouse Design, DM Review Online, April 2000, http://www.dmreview.com