Data protection for outsourced data mining Boštjan Brumen, Izidor Golob, Tatjana Welzer and Ivan Rozman University of Maribor, Faculty of Electrical Engineering and Computer Science Smetanova 17, Si-2000 Maribor, Slovenia {bostjan.brumen | izidor.golob | welzer | i.rozman}@uni-mb.si AND Marjan Družovec University of Maribor, Faculty of Mechanical Engineering Smetanova 17, Si-2000 Maribor, Slovenia marjan.druzovec@uni-mb.si AND Hannu Jaakkola Tampere University of Technology, Pori School of Technology and Economics PO BOX 300, Fi-28101 Pori, Finland hj@pori.tut.fi Keywords: data protection, data mining, outsourcing Received: January 18, 2002 In the paper, we present data mining from the data protection point of view. In many cases, the companies kave a lack ofexpertise in data mining and are reguired to get helpfrom outside. In this čase the data leave the organization and need to beprotected against misuse, both legally and technically. In the paper a formal framework for protecting the data that leave the organization 's boundary is presented. The data and the data structure are modified so that data modeling process can stili take plače and the results can be obtained, but the data content itselfis hard to reveal. Once the data mining results are returned, the inverse process discloses the meaning of the model to the data owners. The approach is especially usefulfor model-based data mining. 1 Introduction The traditional means for coUecting the data wererestricted to the use of our natural senses - and so werethe means for storing the data. Then, people started touse other, more persistent means for data storage, such asskins, stones and much later, papyrus and paper. But onlysince the advent of electricity (or more specifically,electronics) the collection of data is no more restricted tohuman natural senses only. Electronic equipment today, operative in such diversefields as geology, business, astronomy, or medicine iscapable of gathering vast amounts of data. It is to noticethat the storage nowadays is affordable: in 1980, an IBMPC had a 5 MB hard drive and was priced at S3000, andin 1995, forthesameprice , itwasequipped with a 1GBhard drive [Bigus, 1999]. Today (in 2002), an IBM PC atthe same priče (not adjusted for inflation!) is shippedwith an 80 GB hard drive. In the first 15 years, theamount of disk storage available in a PC compatiblecomputerincreasedover200times. From 1995 to today,the increase is almost 80-fold and cumulative factor ofincrease since 1980 is more than 16000. The situation isequally scaled with larger computers. In other words, the amount of data storage (and consequently the amount of data actually stored) doubles roughly every 15 months, beating the famous Moore's law' by 3 months. The last boost was clearly powered by the wide use of the Internet. In the early 1990s, the computers have definitely moved data in paper form to on-line databases. For the last 30 years, the data were mainly a by-product of daily operations of a company. In general, not much was used for analytical purposes. But, the data are every company's vital assets [Reinhardt, 1995]. Assets are useless, unless they are used in (strategic) business processes. Data are facts and as such do not contribute to a company's competitive advantage. Thus, data are useless if they only reside in a company - even worse, they are causing costs: maintenance, storage, and security, to mention only a few. For this reason, there is a need that data be processed, analyzed, and converted into information. Upon Information, company's decision­ makers can act and sustain competitive advantage. Thus, onee data are intelligently analyzed and presented, they Dr. Gordon E. Moore stated in 1965 that the number of transistors per square inch on integrated circuits (and thus the processing povver) doubles roughly every 18 months [Moore, 1965]. become a valuable resource to be used for a competitive advantage [Hedberg, 1995]. Data indeed represent a great potential. Many companies have recognized the value in data and started to systematically collect and store the data. When it came to using the data for other purposes than daily operations, the traditional transactional systems became to fail answering the questions in reasonable time, simply because they were not designed for such queries. In early 1990s a new paradigm was being introduced: the data vvarehouses. Today, a modem, efficient and effective information system consists of "two hearts" - a database and a data warehouse (Figure 1). The transactional databases take čare of daily transactions, such as "receive payment" or "ship goods". The data vvarehouse part is responsible for answering ad-hoc queries, such as "show received payments for shipped goods by month by stores". In betvveen runs the ETL (extract - transform - load) process, which updates the data vvarehouse content. transactional data warehou8e databases mi OLfi.PI other analytical tools Figure 1: A Modem Information System Data in the data warehouses, and especially in databases are considered as secure. However, there are many real-life cases where the data are not protected. Furthermore, even if they are protected in the "safe-house" environment of the databases and/or data warehouses, they sometimes leave the environment. In the next section we describe the data mining process and explain why data mining poses a possible threat to data security. Data Mining The concept of data mining (DM) has been used since 1991 [Shapiro, 1991], and defined in 1996 [Fayyad, 1996] as a part of process known as knowledge discovery in databases (KDD) which itself is a "nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data". Data mining constitutes only a subtask in the overall KDD process and refers to the information extraction itself Data mining is an area that connects many different fields, such as machine learning, artificial intelligence, statistics, databases / data warehouses, computer B. Brumenetai. graphics, inductive logic programming, and many others. The results come in two basic forms - in descriptions and models. In the former, we try to describe, "what is going on in data" [Berry, 2000]. In the latter, we use currently available data and build models, which capture the rules or characteristics in data. The data mining results come in various forms, such as neural nets, decision trees and mles, association mles, clusters, connections (graphs, such as trees and networks) and various scores in form of numbers. There are countiess techniques available, such as various methods of decision trees/mles induction, neural networks, genetic algorithms, clustering, statistical models, visualization, and many others. The list of possible results and of available techniques is far from being complete; the purpose is to inform the reader that data mining is not a single tool with a single result; it is rather a process. As discussed in [Bmmen, 2001], data mining is not reserved for large companies (with large databases). In the last few years it has become clear that smaller data sets can be mined too. The problem with smaller companies is that they do not posses in-house expertise for doing data mining, but they do have domain knowledge and understand their data structures much better. They have two choices: not doing data mining at aH or doing it with help from outside. The former is sometimes not an option due to competitive reasons; the latter poses a potential security threat to data. The larger companies may require some help from outside as vvell. Sometimes the company's resources (hardware and software) may not be adequate for mining purposes, thus the data need to leave the organization. Once outside the safe-house environment of organization's databases and data warehouses, they may be used for purposes other than specified. Due to requirements of most of today's mining tools, data need to be prepared in a flat file. The reason is that many mining tools are developed to avoid DBMS overhead, and to assure compatibility across DB and OS platforms. As such, data are much more easily copied and distributed. Even if we somehovv enable access to our DB/DW from outside, we have again no control over what happens once the data are out. 3 Protecting data for outsourced data mining One approach to protect our data is to use techniques developed for statistical databases. Two major systems, namely |x-Argus [Hundepool, 1996] and Datafly [Sweeney, 1997] have been developed. In these cases, the sensitive data values were generalized [Samarati, 2001] or not disclosed, respectively. In the data mining world, we can not have data that have been distorted or changed. For example, a decision rale on generalized and not disclosed data would read I F maritai_statu s = 'no t released ' A sex='no t released ' A AGE='young' A zip_code='9xxxx' THEN clas s = 7. Obviously, managers would find such form of discovered knovvledge useless. We propose an approach where no data semantics is Jost, the statistics inside the data remains intact, but the data are stili, protected. In ouriframework,,we transform the. (relational) database that is to be exported to the outside world. The transformations are to be performed on both data structure and data values. In the next paragraph we briefly present the notation and basic definitions for relational data model. We adopt the terminology for the relational data model (lUDM) from [Elmasri, 1999] and for the abstract data type (ADT) from [Helman, 1994]. The ADT search table operators are not described in this paper; for formal definitions, which are beyond the scope of this paper, refer to [Helman, 1994]. Suppose we have a relational database DB [Elmasri, 1999], which is a finite set of relational schemas DB = {R^,...,R|,...,R,^, vvhere Rj is a relational schema, denotedby ^^(^,,...,4,,...,^^) , where A^,...,A„,...,Af, is a list of attributes. Each attribute 4 , is the name of a role played by some domain D„ in the relation schema /?/. D„ is called the domain of y4„ and is denoted by D„ = dom(A^,). A relation r, of the relation schema Ri (also referred to as a search table), denoted by r,(/?/), is a set of N-tuples, r;={?,,...,/„,...,?^}. Each N-tuple t„ is an ordered list of 7V values, ?m=(v„|,...,v„„,...,v„^), where each value v„ is an element of dom( A^,), or is a special null value. The n* value of tuple /„, which corresponds to the attribute A„, is referred to as v„„=t„[A,J. A relation r/i?jj is a mathematical relation of degree n on the domains dom( ^„) , vvhich is a subset of the Cartesian product of the domains that define i?,: r^Rj) C:{dom(A^)X;..xdom{Aff)x...xdom{A^y) In the foll6wing four definitions we set up a formal framevvork for reversible database transformation. In lemma 2 we claim that the process is indeed reversible and prove it in proof 2. Definition 1: Let D_ and D^ be sets. Afunction from D | into D , is a subset F of D_ x D^^ such that for each element a G D , there is exactly one element bs D , such that (a, b)s F . Definition 2: Let D* be a set of domains, such that L)*=|L),'||L)*| = |D„|j . Let D^^-^D] be a fiinction, and F = \f„Os be a set of transformations/i./, is said to transform D onto D , if \.yb3a(f„(a) = b) 2./„(«l)=/,(«2)* in the relation schema R^. The relation r' of the relation schema R^ is a set of N-tuples, r' =u'^,...,tl,,...,tl,\. Each N-tuple t'^ is an ordered list ofA^ values, il, =(V1|,...,V^„,...,V2A, ) , vvhere each value v* is an element ofJo/w(y4*). The relation schemas of database DB* are essentially the same as of those in the DB. That is, the number of the relation schemas is the same and each schema has the same number of attributes as the corresponding table in schema DB. Such a database is needed so that the instances are easily transformed from database DB into database DB*. The transformation fiinction is defined in the next definition. Definition 4: (table transformation). Let us define a fiinction Trans that transforms a table instance (a relation), r., into a transformed table instance r', such that: Trans{r. ) = r^ Lemma 1: Function Trans is bijection. Proof 1: Function Trans is bijective if it is injective and surjective. A function Trans is said to be injective, if and only if whenever Trans(p)=Trans(q)^^p=q. Suppose we have two tuples p,qer,, p=^{p^,...,p^), q = {q^,...,qj^). Since Trans({p})=Trans({q}) => [(^p\,...,p',„...,p'^)\p„=f„{p\_A„')] = {(g|,...,^;;,..., g;) I g-;,=/, (g[^„}} => p*,, -4„for \i) when a.=b, for V; ^ P„ =9»=>(;'i.-.Pw ) = (^i'-'?jv ) =^P=^­ A function Trans is said to be surjective, if and only if for any q'' e r', there exists an element per^ such that Trans({p})={q}. Suppose we have q's r', q -Ui^,...,q^,...,q\. Each q^ is calculated as q„= f„{p\_K^ • From definition 2 we have that for function/,, for each b exists an a such that f„{a) = b. By declaring s\-=b it is evident that there must exist a p, so that Trans({p}) = {q} ^ ^q : 3p{Tram{{p}) = {q }) .• Lemma 2: Function Trans has an inverse, Tram'', such that Trans'(r') = r^, where Trans (rj)= r'. Proof 2: Let Tram be bijection. Then exists one and only one bijection Tram'^ that implies 1. Tram'^ {Trans{{p})) = {p} for Vp 6 r, 2. Trans{Tram'^ ({ {Pl}={P2}' Since Tram'^ is a function: Tram{[Pi}) = Tram{{p2}) => Tram"^ (Tram{{p^})) = Tram'^ {Tram{{p.^})) i.e. (p,}=(p2}. Next we prove that 7>am(7ra«5''({g'})) = {qr}/or V^e/^*. Since Tram is surjection: for any q 6 r' there exists p e /^. such that Tram({p}) -{g} and since Trans'^ is a function: {/?} = Trans'^ ({i?}) is defined for every q^r'. Thus, 7>a«i({p}) = 7>ara(J'ra«i"'({gf})) = {^}. • Instead of giving out the original database DB, we give out the transformed database DB*. The set of transformations on domain D, F^, the old names of relations, R, and the old names of attributes for each relation are kept secret. This way the attack on database DB* is much more difficult since the attacker has no semantic Information on the content of the database. If F^ is carefully chosen the attack becomes infeasible. The functions that can be chosen are strong encryption algorithms or other functions that preserve statistical properties of data [Adam, 1989], [Willenborg, 1996]. The advantage is that B. Brumen et al. the F ° can easily be chosen so that it corresponds to the value of data to be protected. For clarity let us take a closer look at an example vvhere we have three tables. We transform them using the above definitions. Example 1: Suppose we have a set of domains D: D={Integer, String, Char}, a set of relational schemas DB=(Ri}: DB={STUDENT, COURSE, GRADE), where each relation schema is denoted as STUDENT(S_ID, Fnaitie, Lname , Zip , City , Age) ; COURSE (C_ID, Name, Credits) ; GRADE (S'LU_ID, COU_ID, Grade_Value ) ; and each attribute has the foUovving domain: ŠTUDENT: Dom(S_ID)={l, 2, 3), Dom(Fname)=String={John, Martha, Alice}, Dom(Lname)=(Smith, Jones}, Dom(Zip)={12345, 12346, 12347), Dom(City)={London, Helsinki, Berlin), Dom(Age)={18, 19, 20); COURSE: Dom(C_ID)={l, 2, 3), Dom{Name)={Math, Physics, Biologv), Dom(Credits)={5, 10); GRADE: Doin(STU_ID) = {l, 2, 3), Dom(COU_ID) = {1, 2, 3), Dom{Grade_value)={A , B, C) . We have a set of relations (instances) of relation schemas, presented in a tabular form: študent(ŠTUDENT) : S ID Fname Lname Zip 1 John Smith 12345 London 18 2 Martha Smith 12346 Helsinki 19 3 Alice Jones 12347 Berlin 20 course{COaRSE): C ID Name Credits 1 Math 5 2 Physics 5 3 Biology 10 grade(GRADE) : STU ID COU ID Grade Value 1 1 A 1 2 B 2 1 A 2 3 B 3 1 C We define D * as D*={Numberl, String, Number2). Further, we define DB*={TABLEl, TABLE2, TABLE3), where each relation schema is denoted as TABLEHCOLl, C0L2 , C0L3 , C0L4, C0L5 , C0L6) ; TABLE2{C0L1, C0L2 , C0L3) ; TABLE3 (.COhl, C0L2 , C0L3) ; and each attribute has the foilowing domain: TABLE1: Dom(COLl)={10, 13, 16}, Dom(C0L2)={Stri, Str2, Str3}, Dom(COL3)=(Str4, StrS}, Dom(COL4)={37042, 37045, 37048}, Dom(COL5)={Str6, Str7, Str8}, Doin(COL6) = {61, 64, 67}; TABLE2: Doni(COLl) = {10 , 13 , 16} , Dom(C0L2) ={Str9 , StrlO , Strll) , Dom(COL3)={22, 37} ; TABLE3: Doiii(COLl) = {10 , 13 , 16} , Doin(COL2) = {10 , 13 , 16) , Dom(COL3)={8, 9 , 10) . By applying the function Trans on each of the relation instances, študent (ŠTUDENT), course(coaRSE) and grade (GRADE), we get the folloNving transformed instances: 7>fl«5('studen t (STUDENT)^=table l (TABLEl) Transfcourse (C0URSE)^=table 2 (TABLE2) Transfgtade (GRADE)j=table3 (TABLE3) The instances are a set of tuples. We present each of the instances in a tabular form: tablel(TABLEl) : ColI Col2 Col3 Col4 Col5 Col6 10 Stri Str4 37042 Str6 61 13 Str2 Str4 37045 Str7 64 16 Str3 Str5 37048 Str8 67 table2(TABLE2) CoU Co/2 10 Str9 22 13 StrlO 22 16 Strll 37 table3(TABLE3 ) Coll Col2 Col3 10 10 10 10 13 9 13 10 10 13 16 9 16 10 8 Suppose an outside entity does models-based data mining on the above tables. In models-based data mining, the goal is to make a model based on the underlying data. The models can be neural networks, decision trees, decision lists, association rules, and a set of clusters, to name only a few. Basically, the models can be built using supervised or unsupervised leaming. In the former, the algorithm takes as input a vector of values and returns the class as the output. The calculated class is in leaming phase compared to the actual value and the correction is made if needed, thus supervised learning. In the latter, the algorithm tries to group vectors together based on some similarity function. Since there is no correct ansvver, there is nothing to supervise. Infomiatica 26 (2002) 205-210 209 For example, when building a decision tree, the algorithm makes a branch based on some statistical properties of data, not on actual values or attribute names. For these reasons the actual data and their structure can be hidden, and the results will stili be the same, as long as statistics within data is maintained. Any mining result that is retumed is again in form of a model that includes the transformed data elements. Thus, the result that is retumed can be taken as a set of relation schemas DB*=|7?,*| with a corresponding set of relation instances \r'\, and connections among them, depending on the stmcture of the model. Note that even a single data celi can be vievved as an instance of a table with one attribute (column) and one tuple (row). With inverse transformation, the data owner decodes the model (relation instances) into a readable form. The example 2 depicts the process. Example 2: Suppose the data mining mle says that I F tabiei.coi e < 67 THEN table3.Col3 > 8. We have two table instances, g' (i.e. tabiei ) and s' (i.e. tabie3). Each table instance has only one tuple, i.e. q' =|?|*|and s' ={w*} tablel : g table3: 5 First, by using the inverse, Trans'' on g', we get: Trans'(g')=g={ti}= ^{tl\ti [Age] = /;" ' it' [0016 ] >} = = {ti\t! [Age] = (^* [Col6]-7)/3 } = = {tj\t, [Age] = (67-7)/3 } = = {tj\ti [Age]=20 } Next, by applying Trans'' on s', we get: Trans''{s' )=s={ui} = ={u,\ M;[Grade_Value ] = /3" ' ( s\ [Col3 ] ) } = = {M/| M/[Grade_Value ] = /3" ' (8) } = = {M;| M/[Grade_Value)= c } Since instances g' (tablei ) and s* (tabieS) correspond to instances študent and grade, respectively, we get the follovving two relation instances as a result: študent : g grade : S Grade Value C Finally, the rule decodes to i F student.Age < 20 THEN Grade.Grade_Value > C. • 4 Conclusion The information age has caused a shift from a product-centric society to a data-centric one. While the ftindamentals for data storage (and protection) have been around for almost 40 years, and have become very solid since the introduction of relational database systems, the nevv technologies, such as data warehousing and data mining, require special attention and čare. In the paper we presented a new knowledge discovery technology - data mining - from the data security perspective. In data mining process, sometimes the data need to be modeled outside of an organization. In this čase they need to be protected in such a way that the modehng process is stili possible and the data security is preserved at the same time. We presented a formal framework for transformation of a schema and content of a relational database. We prove that, the transformation...is. reversible. With careful selection of transformation fiinctions, the attack on data becomes infeasible. The functions can be selected so that the effort to break them exceeds the value of protected data. By using the framework the outsourced data miner is stili able to do the data mining, and the data remain secure. The reversibility of the process enables the data ovvners to decode the model into a readable form. The presented framevvork is especially useful for models-based data mining, where the data values and data structure play no role when building a model. References [Adam, 1989] Adam, Nabil R.; Wortman, John C : Security-Control Methods for Statistical Databases: A Comparatice Study, ACM Computing Surveys, Vol. 21, No. 4, pp. 515-556, 1989 [Berry, 2000] Berry, Michael A. J.; Linoff, Gordon: Mastering Data Mining: The Art and Science of Customer Relationship Management, John Wiley & Sons, Inc., New York, USA, 2000 [Bigus, 1999] Bigus, Joseph P: Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support, McGraw-Hill, Nevv York, USA, 1999 [Brumen, 2001] Brumen, Boštjan; Welzer, Tatjana; Golob, Izidor; Jaakkola, Hannu: Convergence Detection in Classification Task of Knowledge Discovery Process, In Proceedings of Portland intemational conference on management of engineering and technology, Portland, Oregon, USA, 2001 [Elmasri, 1999] Elmasri, Ramez A.; Navathe, Shamkant B.: Fundamentals of Database Systems, 3'^'' B. Brumenetal. Edition, Addison-Wesley Publishing, New York, 1999 [Fayyad, 1996] Fayyad, Usama; Shapiro­, Piatetsky, Gregory; Smyth, Padhraic; Uthurusamy, Ramasamy; (Eds.): Advances in Knowledge Discovery and Data Mining, AAAI Press, USA, 1996 [Hedberg, 1995] Hedberg, Sara R.: The Data Gold Rush, Byte Magazine, 10-1995 [Helman, 1994] Helman, Paul: The Science of Database Management. Irwin Inc, 1994 [Hundepool, 1996] Hundepool, Anco J.; Willenborg, Leon C.R.J.: \i- and T-Argus: SoftAvare for Statistical Disclosure Control, Proceedings of 3"^ Intemational Seminar on Statistical Confidentiality, Bled 1996 [Moore, 1965] Moore, Gordon E.: Cramming More Components Onto Integrated Circuits, Electronics, Vol. 38, No. 8, April 19, 1965 [Reinhardt, 1995] Reinhardt, Andy: Vour Next Mainframe, Byte Magazine, 5-1995 [Samarati, 2001] Samarati, Pierangela: Protecting Respondents' Intentities in Microdata Release, IEEE Trans. On Knowledge and Data Engineering, Vol. 13, No. 6,2001 [Shapiro, 1991] Piatetsky-Shapiro, Gregory; Frawley, William J. (Eds.): Knowledge Discovery in Databases, AAAI/MIT Press, USA, 1991 [Sweeney, 1997] Sweeney, Latanya: Guaranteeing Anonymity when Sharing Medical Data, the Datafly System, Proceedings, Journal of American Medical Informatics Association, Washington, DC, Hanley & Belflislnc, 1997 [Willenborg, 1996] Willenborg, Leon C.R.J.; De Waal, Ton: Statistical Disclosure Control in Practice, LNS Vol. 111, Springer-Verlag, 1996