Data quality: A prerequisite for successful data warehouse implementation 
Viljan Mahnič and Igor Rožanc University of Ljubljana Faculty of Computer and Information Science Tržaška 25, SI-1000 Ljubljana, Slovenia E-mail: Viljan.Mabnic(a),fri.uni-li.si. igor.Rozanc@fri.uni-li.SL 
Keywords: data vvarehouse, information quality, total quality management, data quality assessment 
Received: January 14, 2001 

Building a data warehouse for a large decentralized university such as the University of Ljubljana is an attractive challenge, but also a risky and demanding task. Experience has shown that projects attempting to integrate data are especially vulnerable to data quality issues. Therefore, before embarking on a data warehouse initiative a thorough quality assessment ofthe source data is necessaij. We describe how the assessment criteria based on the Total Qiiality data Management Methodology were adapted to our specific needs and used to determine the quality of študent records data at two member institutions, viz. the Faculty of Computer and Information Science, and the Faculty of Electrical Engineering. The most important results of the assessment are described and proposals are given for further activities. The assessment has shown that the študent records data at the Faculty of Computer and Information Science and Faculty of Electrical Engineering are good enough to be used as source for the global warehouse at the imiversity level after some data cleansing takes plače. Additionally, special attention must be devoted to the integration ofsuch data that are replicated at many individual departments (viz. employees, subjects taught, and students). Therefore, we propose that a unique coding schemefor ali employees, students, and subjects taught be defined in thefirst step ofthe data warehouse design, and an ongoing data quality management process is established clearly defining the roles and 
responsibilities ofall personnel involved. 

 Introduction 
The University of Ljubljana is the largest university inSlovenia. It consists of 26 member institutions (20faculties, 3 academies, and 3 colleges) and has more than40,000 students. In the past, the member institutions hadsubstantial autonomy regarding the usage of informationtechnologies, which led to uncoordinated development of their information systems. Different applications weredeveloped for the same purpose and due to thisheterogeneity it is very difficult or even impossible tocreate reports that require cross-referencing data fromdifferent institutions orapplication areas [1].
In such a situation, the building of a data warehouse atthe university level seems to be an appropriate solution.Therefore, a pilot project started with the aim of defininga data model for a global data base which will be fed bydata from member institutions on a regular basis and which will serve as a basis for analytical processing atthe university level. One of the main tasks within thisproject is to define the granularity of data in the datawarehouse and different levels of detail that will supportbest the decision making processes.However, experience from other organizations has shownthat projects attempting to integrate data are especiallyvulnerable to data quality issues [2]. A recent study by the Standish Group states that 83 percent of data  migration projects overrun their budget (or fail) primarily  as a result of misunderstandings about the source data  and data definitions. Similar surveys conducted by the  Gartner Group point to data quality as a leading reason  for overruns and failed projects [3]. 
 In order to avoid such a pitfall, a thorough assessment of  the quality of data that are used as input to the global  data warehouse is necessary. Since we decided that our  data warehouse will be populated gradually, starting with  študent records data, the quality of these data was analyzed first. The aim of this paper is to describe in  detail the assessment methodology and results obtained  at two typical member institutions, viz. the Faculty of  Computer and Information Science (FCIS), and the  Faculty of Electrical Engineering (FEE). 
 The assessment criteria were defined according to  English's Total Quality data Management methodology 
 [4] and Forino's recommendations [3]. A description of  these criteria is given in Section 2, vvhile the results of  the assessment are presented in section 3. Section 4  describes the proposals for further activities, and section  5 summarizes the most important conclusions. 

2 Data Quality Assessment Metho ­dology 
Data quality assurance is a complex problem that requires a systematic approach. English [4] proposes a comprehensive Total Quality data iVlanagement Methodology (TQdM), which consists of 5 processes of measuring and improving information quaHty, and an umbrella process for bringing about cultural and environmental changes to sustain information quality improvement as a management tool and a habit: 
Process 1: Assess Data Definition & Information Architecture Quality 
Process 2: Assess Information Quality 
Process 3: Measure NonquaIity Information Costs 
Process 4: Reengineer and Cleanse Data 
Process 5: Improve Information Process Quality 
Process 6: Establish the Information Quality Environment 

Each process is further divided into steps that must be foIlowed in order to achieve the desired data quality. Organizations embarlcing on data warehouse initiatives and that do not yet have an information qua!ity function must conduct many of these steps, but may do so in a different sequence, based on their specific needs. 
Considering our specific needs, we concentrated on Process 2 which defines two aspects of information quality: the inherent information quality and the pragmatic information qua]ity. Inherent information quality is the correctness or accuracy of data, while pragmatic information quality is the value that accurate data has in supporting the work of the enterprise. In this paper the results of inherent information qua]ity assessment are described. 
In order to determine the quality of data a field-by-field assessment is required. However, simply having data is not enough, but the context for which the data is to exist must also be known. To put in other terms, a clear data definition or the so called meta data must be provided [3]. Generally speaking, one can find meta data in data models, data dictionaries, repositories, specifications, etc. If current meta data does not exist, then a subject matter expert is needed, and meta data is a much-desired by-product of a data quality assessment. 
The extent of assessment in great deal depends on the availability of meta data. According to [3], assessments usua!ly focus on one or more of the folIowing types of quality criteria: 
1.	
 Data type integrity 

2.	
 Business rule integrity 


V. Mahnič et al. 
3.	 Name and address integrity 

If the assessment team knows nothing more than field names, types and sizes, then the focus is on testing the field's integrity based on its type (numeric, alphanumeric, date, etc). If additional characteristics of the field are provided (domain, relationship with other fields, etc) , then business rule integrity is also performed. Finally, if name and address data is critical (particularly if it will be Consolidated with other data), then name and testing shouid be performed. 
On the other hand, English [4] defines the follovving inherent information quality characteristics and measures: 
1.	
 Definition Conformance 

2.	
 Completeness (of vaiues) 

3.	
 Validity, or business rule conformance 

4.	
 Accuracy to surrogate source 

5.	
 Accuracy (to reality) 

6.	
 Precision 

7.	
 Nonduplication (of occurrences) 

8.	
 Equivalence of redundant or distributed data 

9.	
 Concurrency of redundant or distributed data 

10.	
 Accessibility 



Considering the aforementioned quality characteristics and our specific needs we decided to measure the quality of our data using the following criteria: 
1.	
 Completeness of vaiues: Users were encouraged to prioritize a subset of source fields that must have a nonnull value. For each field the percentage of records with missing vaiues was computed. 

2.	
 Validity, or business rule conformance: For a chosen subset of most important fields we measured the degree of conformance of data vaiues to their domains and business rules. 


a) Ali 1:N relationships were examined for existence of foreign keys in master tables. 
b) Since time represents an important dimension in a dimensional data warehouse a range check was performed on ali date fields in order to find possible out-of-range vaiues. 
c) A range check was performed on ali other fields from the highest priority subset (e.g. academic year, year of study, grades etc). 
d) Special cross-checks were defined for fields having relationships with other fields that further define the allowable domain set, e.g. 
•	
 when a študent applies for an exam for the first time (i.e. N0_0F_ATTEMPTS=1) the date of previous examination must be blank and vice versa; 

•	
 the date of previous examination must be lesser than the date of next examination; 

•	
 in each degree record the difference between the degree date and thesis issue date must be positive and less than six months. 


Nonduplication of occurrences and equivalence of redundant and distributed data: In our čase two different tests were performed; 
a) AH files in the študent records database were checked for eventual duplicafions of primary Informatica 25 (2001) 183-188 185 
b) The študent and employee files at both facuJties were checked for the existence of the records that are duplicate representations of the same študent or employee respectively. 


3 Assessment Results 

3.1 Completeness of values 
The most important fields in 33 files were checked for nonnull values, and it was found, that only few values were missing. Results are summarized in Table 1, while Tables 2 and 3 show the fields and files with most missing values for each faculty, respectively. The greatest problem represent the missing student_id values in alumni records. This is a consequence of the fact that 15 years ago (when alumni records started) some students were not assigned a študent Identification 
number. 
keys.  
Faculty  No.of  
files  
FCIS  33  
FEE  33  
TOTAL  66  


No. of checked  No. of checked  No. of records vvith  
attributes  records  missing values  
69  201617  43  
69  405436  996  
138  607053  1039  

Table 1: Completeness of vaules (Summary Data) 
Field name 
VPIS ST 

IME 


Field description 
študent id eniployee first name 

File name  Total no.  No. of records with  
of records  missing values  
DIPLOMA  961  18  
DELAVEC  391  4  

Table 2: Fields with most missing values (Faculty of Computer and Information Science) 

Field name 
VPIS ST IME D DELAVEC 


Field description 
študent id employee first name employee id 

File name Total no. of records 
DIPLOMA 2926 DELAVEC 399 VAJE 152071 

No. of records with missing values 
185 
4 
737 

% of erroneous records 
0.021 % 
0.246 % 
0.171 % 

% oferroneous records 
1.873% 
1.023% 


% of erroneous records 
6.323 % 1.003% 0.485 % 

Table 3: Fields with most missing values (Faculty of Electrical Engineering) 


3.2 Validity, or business rule conformance 
Existence of foreign keys in master tables: The examination of ali 1:N relationships revealed that in about a half percent of records foreign keys do not have their counterpart values in the corresponding master tables (see Table 4). Hovvever, this average is misleadaing, since the majority of errors appear within a small number of relationships. The follovving relationships appeared to be the most problematic at both faculties (see Tables 5 and 6): 
1.	
 the relationship betvveen entities DIPLOMA (studenfs degree thesis) and NAČIN (a code table of possible types of study, e.g. full-time or part-time); 

2.	
 multiple relationships between enfities STUD_F (studenfs personal data) and OBČINA (a code table 



of territorial units in Slovenia) representing the records. The second problem is much harder and was territorial unit of studenfs residence, plače of birth, caused by significant changes in territorial organization secondary school finished etc. at the local level in Slovenia after independence, which 
required several consequent modifications of the The first of the aforementioned problems is simply a corresponding code table. Although each year's consequence of the fact that (within the alumni records enrolment data corresponded to the currently valid database) the faculties started coUecting the type of study version of the code table, there are a lot of inconstancies data in 1994, vvhile this datum is missing in older when we look at the data in tirne perspective. 
Faculty  No. of  No. of  No. of  No. of  No. of  %o f  %o f  
files  checked relationships  checked records  relationships with errors  records with errors  relationships vvith errors  records with errors  
FCIS  49  207  1361758  38  6360  18.357%  0.467 %  
FE  49  207  2702005  51  16605  24.638 %  0.615%  
TOTAL  98  414  4063763  89  22965  21.498%  0.565 %  

Table 4: Existence of foreign keys in master tabies (Summary Data) 
Entity  No. of  Foreign key field  No. of non  %  
(File)  records  Name  Description  existent vaiues  oferrors  
DIPLOMA  961  NAČIN  type of study code  546  56.816%  
STUD_F  4062  OBCINA_SS  territorial unit code of  1340  32.989 %  
studenfs secondary school  
STUD_F  4062  OBCINA_R  territorial unit code of  1331  32.767 %  
studenfs plače of birth  
STUD_F  4062  0BC1NA_S  territorial unit code of  1314  32.349 %  
studenfs permanent residence  
STUD_F  4062  OBCINA Z  territorial unit code of  603  14.845 %  
studenfs temporary residence  
Table 5: Problematic relationships (Faculty of Computer and Information Science)  

Entity  No. of  Foreign key field  No. of non  %  
(File)  records  Name  Description  existent vaiues  oferrors  
DIPLOMA  2926  NAČIN  type of study code  1461  49.932 %  
STUD_F  8164  OBCINA_SS  territorial unit code of  3565  42.908 %  
studenfs secondary school  
STUD_F  8164  OBCINA_R  territorial unit code of  3503  42.896 %  
studenfs plače of birth  
STUD_F  8164  OBCINA S  territorial unit code of  3502  42.896 %  
studenfs permanent residence  
STUD_F  8164  OBCINA_Z  territorial unit code of  1422  17.418%  
studenfs temporary residence  
Table 6: Problematic relationships (Faculty of Electrical Engineering)  

Range checking of date fields did not reveal any serious Special cross-checks pointed out that the business rule problems. In the worst čase the rate of out-of-range field requiring that each študent completes his^er degree vaiues reached 0.149 % at FCIS, and 0.493% at FEE. project in six months is sometimes violated. Namely, the 
difference betvveen the degree date and thesis issue date Range checking of other fields from the highest was more than six months in 0.976 % of cases at FCIS, priority subset also yielded quite good results. The filed and in 1.805 %> of cases at FEE. Error rates reported by IME_D (viz. employee's first name) was ranked the other cross-checking tests were less than 0.5 %. worst at both faculties containing 1.023 % erroneous vaiues at FCIS, and 1.003 % erroneous vaiues at FEE. 
3.3 Nonduplication  of occurrences  and  
equivalence  of  redundant  and  
distributed data  

Nonduplication of primary keys: The študent records Information system at FCIS and FEE [5] (as well as at other member institutions of the University of Ljubljana) is implemented using Clipper which does not automatically force the uniqueness of primary keys. In spite of the fact that ali programs have been carefully written in order to avoid duplicates, some can be introduced through the manual maintenance of data, especially code tables. Therefore, ali files in the študent records database were checked for eventual duplications of primary keys. An excerpt of assessment results shovving only files with most duplicates is presented in tables 7 through 9. Table 7 shows the number of duplicated primary keys in code tables that are maintained centrally by the University computing center for ali member institutions, while tables 8 and 9 refer to duplicates at FCIS and FEE respectively. A relatively high percentage of errors in some code tables indicates that the maintenance of code tables at the University computing center should be improved. 
File  No. of records  No. of duplicated primary keys  % of errors  
VSI.DBF  1314  68  5.175%  
SS POKLI.DBF  1566  42  2.682 %  
CENTRI.DBF  51  1  1.960%  
ZAVOD.DBF  100  1  1.000%  

Table 7: Nonduplication of primary keys (Code tables maintained by the University computing center) 
File  No. of records  No. of duplicated priniary keys  % of errors  
IZJEME.DBF  56  1  1.786%  
TEMA.DBF  978  3  0.307 %  
DELAVEC.DBF  391  1  0.226 %  
DVIG.DBF  922  1  0.108%  

Table 8: Nonduplication ofprimary keys (FCIS) 

File  No. of records  No. of duplicated priniary keys  % of errors  
SPP.DBF  270  11  4.074 %  
PRED PR.DBF  746  8  1.072%  
PP.DBF  2557  12  0.469 %  
DVIG.DBF  2825  9  0.319%  

Table 9: Nonduplication of primary keys (FEE) 

Lquivalence of redundant and distributed data: Given such a replication two measures were introduced: the the fact that the some teachers teach at both faculties as number of replicated entity occurrences (viz. the same well as that some students study at both faculties (e.g. a teacher or študent in both databases) with the same študent can obtain his/her B. Se. degree at FEE and enrol primary key, and the number of replicated entity with for graduate study at FCIS or vice versa) some data are different primary keys. replicated across faculties. In order to state the extent of 
Entity(File)  Description  No. of replications  
Same primary key  Different priniary key  
DELAVEC ŠTUDENT  Employees Students  388 4  0 121  

Table 10: Equivalence of redundant occurrences (FCIS and FEE) 
Assessment revealed that employee files at both faculties Identification number regardless the fact that the študent are consistent: ali replicated employees have the same has already been enrolled at another faculty. primary key. This is not the čase with študent files. Due Unfortunately, this kind of inconsistencies may be a to the decentralized organization of the University of source of major problems when integrating data into a Ljubljana each faculty assigns its students a different global warehouse. 


4 Proposed Furthe r Actions 
On the basis of assessment results we propose two kinds of further actions: cleansing of source data and an appropriate design of the global data warehouse at the university level. 
Some erroneous source data can be cleansed automatically (e.g. missing type of study code in file DIPLOMA), while other data require manual or combined manual and automatic approach (e.g. removal of duplicated primary keys, re-establishment of relationships using territorial unit codes). Some errors 
(e.g. out-of-range values) can be prevented by the incorporation of appropriate controls in the program code. 
Special attention must be devoted to the maintenance of code tables that are common for the whole university. A relatively high percentage of duplicate codes in code tables, maintained by the university computing center up to now indicates that the maintenance of these code tables must improve. 
Additional problems could arise during the integration of those data and code tables that are at present maintained by individual departments (viz. employees, subjects taught, and students). Although our assessment did not reveal serious problems within each department, many duplications and code conflicts may occur during the integration. Therefore, we propose that a unique coding scheme for ali employees, students, and subjects taught is defined in the first step of the data warehouse design. 
The data warehouse design must be based on principles of TQdM methodology. Data standards must be defined and data definition and Information architecture quality assessment must take plače before programming and population of the data warehouse begins. An ongoing data quality management process must be established and the roles and responsibilities of a data quality administrator, subject area champions, data oversight committee, and data owners clearly defined [6]. 

5 Conclusions 
Our paper was intended to increase the avvareness of the importance of data quality not only when building a data warehouse but also in operational environments that support transactional processing. An assessment methodology to empirically determine the data quality was described and the results of the assessment were presented. 
Considering the assessment results we estimate that source data at the Faculty of Computer and Information Science and Faculty of Electrical Engineering are good enough to be used as source for the global warehouse at V. Mahnič etal. 
the university level after some data cleansing takes plače. In first plače, missing študent Identification numbers must be provided in alumni records and the broken relationships using territorial units codes must be re­established in students' personal data. 
During the design of the global data warehouse a special attention must be devoted to the integration of those data that may be replicated at many individual departments (viz. employees, subjects taught, and students). Since each department has its ovvn policy of coding, many duplications and code conflicts may occur during the integration. Therefore, we propose that a unique coding scheme for ali employees, students, and subjects taught is defined in the first step of the data warehouse design. Additionally, an ongoing data quality management process must be established and the roles and responsibilities of aH personnel involved should be clearly defined. 


6 References 
[1] Mahnič,	 V. Tovvards the re-integration of the University of Ljubljana Information system, in J-F. Desnos & Y. Epelboin (eds.), European Cooperation in Higher Education Information Systems EUNIS 97, 
Grenoble, September 1997, pp. 250-258. 
[2] Celko,	 J., McDonald, J., Don't Warehouse Dirty Data, Datamation, October 15, 1995, pp. 42-53. 
[3] Forino,	 R., The Data Quality Assessment, Part 1, DM Review Online, August 2000, http://www.dmreview.com 
[4] English,	 L.P., Improving Data Warehouse and Business Information Quality, John Wiley & Sons, Inc., 1999, ISBN 0-471-25383-9. 
[5] Mahnič,	 V., Vilfan, B., Design of the Študent Records Information System at the University of Ljubljana, in J. Knop (ed.), Trends in Academic Information Systems in Europe, Proceedings of the EUNIS'95 Congress, Dusseldorf, Germany, November 1995, pp. 207- 220. 
[6] Kachur,	 R., Data Quality Assessment for Data Warehouse Design, DM Review Online, April 2000, http://www.dmreview.com