Informatica 19 (1995) 257-264 257 Comparing Inference Control Mechanisms for Statistical Databases with Emphasis on Randomizing Ernst L. Leiss Uniyersity of Houston, Department of Computer Science, Houston, Texas, 77004, U.S.A. coscelOcs.uh.edu AND Jurij Jaklič University of Ljubljana, Faculty of Economics, Kardeljeva pl. 17, 61000 Ljubljana, Slovenia jurij.jaklic@uni-lj.si Keywords: statistical databases, security, comparison, randomizing Edited by: Jerzy R. Nawrocki Received: July 27, 1994 Revised: February 24, 1995 Statistical databases are primarily collected for statisticalally contain information about persons and organizationsand only aggregate statistics on this confidential attribute Accepted: April 12, 1995 analysis purposes. They usu­ which is considered confidential are permitted. However, deduction of confidential data (inference) is frequently possible. In order to preverit this possibility several security-control mechanisms have been developed, among them the randomizing method. We compare randomizing with other methods using several evaluation criteria. Evalua­tion shows that randomizing has several advantages in comparison with other methods, such as high leve! of security, robustness, low cost. On the other hand, the problem of bias for small query sets can be considerable Introduction Databases often contain confidential information. Various security-control mechanisms deal with di­fferent kinds of security problems, such as encryp­tion, identiflcation of users, and access authoriza­tion. Here we will study inference, i.e., the de­duction of confidential information from legal (i.e. permitted) statistical summary queries. A statistical database (SDB) is a database where certain users are authorized to issue only aggregate (statistical summary) queries, such as sum, maximum, minimum, count, average, median, variance, standard deviation, and k-moment. These users (researchers) cannot retri­eve information about an individual entry. Typical examples are Census Bureau databases, salaries of individual persons in a company, and diagnoses of patients in a hospital. We must enable these users to retrieve aggre­gate statistics, but prevent them from retrieving for some applications. values of confidential attributes of individual re­cords. In this way we protect the individuaPs right to privacy and on the other hand we can process information needed by the society [22]. Problems arise when certain users (snoopers) try to derive or infer confidential information from legal aggregate queries. If they are successful, we say that the SDB is compromised. There is more than one way how to compromise a SDB, for example the linear system attack [10] or using a tracker [8]. Several methods have been'developed in order to protect SDB's against compromises. Most of them are described in [2], where they are also classified and evaluated. Randomizing security-control mechanisms such as the one described in [17] are just mentioned in [2]. Our goal is to eva­luate this method and compare it with others. 2 Overview of th e Method s Two models offer frarneworks for dealing with the problem of the SDB security [2]: one is called the conceptual model [5] and the other the lattice model [9]. They support the research in this neld but do not offer methods for security control. There are two approaches to the problem of in­ference for statistical databases: — query restriction approach and - perturbation approach. Methods belonging to the same group have si­milar characteristics and are therefore easier to compare. In this section a brief overview of the two approaches is given. 2.1 Query Restriction Approach The idea is that if we do not permit every aggre­gate query to be executed, we can achieve better security of the database. The proposed methods differ in the way in which it is decided whether the query is permitted. The first method (query set size control) is to limit the query set size [13]. It has been noticed [10], that if the issued queries have many com­mon entities (records) in their query sets, there is a higher possibility of compromising the data­base. So, the method query set overlap control proposes to restrict the number of records that any pair of issued queries can have in common. One of the methods which can provide a very high level of security is auditing [27]. Auditing keeps track of issued queries and checks for each new query whether the database can be compromised. The partitioning method ([4],[25]) and the celi su­ppression method [6] are other techniques. 2.2 Perturbation Approach The perturbation approach includes two subgro­ups of methods. One subgroup replaces original data in the SDB with other data and ušes these perturbed data to compute statistics (data pertur­bation) while the other subgroup computes stati­stics from the original data, but noise is added in one or another way during the computation of the statistics (output perturbation). The probability distribution methods [19] re­place the original SDB by another sample with the same (assumed) probability distribution. A variant of this method (the analytical method) approximates the data distribution by orthogo­nal polynomials (see [15]). The other proposed approach [23] is to replace the true values of a gi­ven attribute with the perturbed values once and forever (fixed data perturbation). There are two different methods, one for numerical and one for categorical attributes. An example of the output perturbation me­thods is the random sample queries method [7]; a statistic of a randomly selected subset of the original query set is given as a response to the issued query. In the randomizing method [17] a response to the given query is computed from a superset of the query set. Records are randomly selected from the database and added to the query set. Let us assume that a user is interested in a cer­tain statistic of a given query set of size k. Let i\,..., ik be the indices of the records in the query set, and let DK be the name of the confidential attribute in whose statistic the user is interested. Instead of computing the true statistic, another v > 0 entities of the database are selected rando­mly and added to the query set. Then the stati­stic of this superset of the original query set with k + v records is computed and returned as the re­sult. Thus for a query of type average we have the perturbed response A _ E?=i DKj, + ZU P*«,-_ k + v _k-a + Ej = i DK5j k + v where s j is the index of the j-th. randomly selected record and a is the true response C/= i DKi-)lk. v should be (much) smaller than k, otherwise the precision of the result can be very bad, altho­ugh the security of this method increases with hi­gher values for v. That yields a security-accuracy trade-off. Here we select v to be equal to 1, justi­fied by the observation (see [17]) that even with small introduced noise, the relative error of the in­ferred values for DK will be considerable for large values of k. COMPARING INFERENCE CONTROL... Comparison We use the evaluation criteria proposed in [2] to compare the different methods; they cover the im­portant objectives of a good security control me­chanism. Some of the criteria exclude each other, and thus an effort must be directed towards ba­lancing them. 3.1 Security We consider different kinds of disclosures: - Exact disclosure is possible when a user can obtain the exact value of the confidential at­tribute. - Partial disclosure is possible when a user can obtain an estimate DK[ of the value of the confidential attribute D K for the i-th record, such that Var(DKl) < c\ ( i.e. variance of the estimate DK[ is less than cf), where the parameter C\ is set by the DBA (Data Base Administrator). For the čase of categori­cal attributes a partial disclosure is possible when a user can infere that the confidential attribute has not a certain value. - Statistical disclosure occurs when the same query is issued several times in order to ob­tain a small variance of the estimate of the true response -filtering. Let us look first at the exact compromisability of an SDB under different protection methods. There is the possibility of exact disclosure for some methods belonging to the query restriction approach group, namely the query set size control and the query set overlap control. Exact compromise cannot be done for a SDB protected by the auditing because of the nature of the auditing. This is also true for celi suppression and partitioning (see [6] and [26]). For methods belonging to the perturbation group, including randomizing, there is no possibi­lity for the exact disclosure, except for some rare cases for the analytičal method (see [2]) and for rounding [1]. Conditions under which exact dis­closure is possible for these two methods are very severe. There are more possibilities for partial disclo­sure. Regardless of which method we use, it is Informatica 19 (1995) 257-264 259 possible to achieve partial disclosure. As for the most of the perturbation approach methods, it is possible to balance the security against the preci­sion also for randomizing. The parameters which influence the security (and precision) and can be set by the DBA are: v, the number of records to be added to the query set and j , the parameter which is used in the method to solve the problem of accuracy (see Section 3.2). The problem of statistical disclosure has to be considered only in the cases where answers to the same query issued several times differ. For these (perturbation) methods one can achieve a better estimate of the real answer to the query using the method of filtering. The idea is that the user repe­ats the same query several times. Let An denote the perturbed response to the ra-th repetition of the query with the true response a. In general A{ ^ Aj for i ^ j . Then the user can repeat the query m times and compute the average , Aj+A2 + ... + Am a = . m The result of this expression will converge to a certain value a* with increasing m. If the pertur­bed response (after one repetition) is A then we have Var(a ) = —-, m if the answers to repeated queries are indepen­dent. Thus, we see that the more times one re­peats the query, the better an estimate a' of the true response a can be achieved. Let us see how many queries (mr) we have to issue if we want to get statistical disclosure of the SDB protected by the basic randomizing method, if the criterion for the statistical disclosure is Var{a!) < c\, where c\ is the parameter set by the DBA. Let us consider a fixed query of type average with the true value a. The perturbed value of that query is a • k + DK., A~ k+1 ' where k is the size of the original query set and DKS is the value of the confidential attribute D K of the randomly selected record. If the random number generator we use is uniform, then we can expect that on the average the value for DKS is equal to the average of the values over the data­base (DK*). Thus the expected value of A is a-k + BK* E(A) k + 1 Now we can compute the variance of A: a-k + DKS a-k + DK* Var(A) = E k+1 k+1 E(T>KS -DK*)2 Far(DK ) (k + 1)2 (k + iy Because the responses to the queries are indepen­dent of each other, we have Var(A) Var(DK)Var(a') = m-(k + l)2 m As we expected, the variance of the estimate de­pends on the variance of the confidential attribute and it is smaller for larger query set size. Thus we have: > Var(BK)mrc\-(k + iy Even if the mimber of queries needed to com­promise a SDB can be quite large, it is possible to do it. The reason is that with the increasing m the average introduced noise of the query always converges to the same value DK*. In order to avoid this one can use the following method. For each issued query two calls to the random number generator are made, and therefore two in­dices for the additional record are proposed. The selection of the single record to be added to the query set depends on the values of the confidential attribute of the records which are already in the query set. Let us say that the two proposed indi­ces are x\ and X2- Then we choose max{a;i,a;2} if the Value of the boolean expression E = [(DKtl < DKi2) © (DKi2 < DKh) © • • • ... ® (L>#,-,_, In order to solve this problem one can use the re­stricted randomizing method. However, we know that the value of the perturbed response to a query of type max is greater or equal to the true response. Thus, the problem of bias is here more severe. The situation is the same for queries of type min, but not for queries of type median: here the answer will change often, but the error is usually small and depends on the variance of the database and the selection of the query set. Since there are few perturbation methods which can be used for ali types of queries, vve can con­sider randomizing as a general method, even tho­ugh it is not equally good for ali types of queries. 3.7 Suitability It is desirable that a method be applicable for both numeric and categorical attributes. AH query restriction approaches are able to deal with both. Two fixed data perturbation methods have been developed, one for numeric attributes [28] and the other one for categorical attributes [2]. The probability distribution and the random sam­ple queries can be used for both. The randomizing method is not suitable for categorical, but only for numeric attributes. In addition, some methods are suitable for more than one attribute and others for only one at­tribute. Again ali the query restriction methods are basically suitable for more than one attribute. COMPARING INFERENCE CONTROL... The only problem is for auditing, where the pro­cessing overhead can become very high and this method may have no practical value. From the perturbation methods the following are suitable for more than one attribute: data swapping, ana­lytical method, random sample queries, if the at­tributes are independent, also the varying output perturbation, as well as randomizing. The last criteria is suitability to dynamic SDB. For some purposes a SDB can be static, for exam­ple a census database. For other purposes it is very important that the database be dynamic and on-line. For such a database the method used must provide security also in cases of changes in the database. Moreover, it must not require too much processing overhead per change in the da­tabase. Randomizing is completely suitable for on-line dynamic databases and does not require any additional effort for implementation nor any processing overhead per change in the database. There are some methods which are not suitable to on-line dynamic SDB at ali: celi suppression, data swapping, and probability distribution. Note that ali the output perturbation methods are suitable for on-line dynamic SDB. 3.8 Information Lost Information lost is defined as "amount of non-confidential information that is unnecessarily eli­minated, as well as, in the čase of perturbation methods, the statistical quality of the informa­tion provided to the users" [2]. This means that in the čase of perturbation methods the term in­formation lost corresponds to precision. In other words: the higher the precision is the lower the information lost is. This applies also to the ran­domizing method. The situation is different for the query restric­tion methods. Information lost can be very high for the query set size restriction and for the query overlap restriction approaches. The auditing method restricts by its nature only queries that can lead to compromise. Fol­lowing the defmition of the information lost (see above) we can say that this method causes no in­formation loss. However, the priče that must be paid for this is very high (see Section 3.5). It is more difficult to give an estimation of infor­mation lost for the partitioning and celi suppres­sion methods because it depends on data in the Informatica 19 (1995) 257-264 263 database. Some empirical results (see [26]) show that information loss can be very severe for par­titioning. In order to reduce it, dummy records have been proposed. 3.9 Conclusion s In summary we can say that the randomizing me­thod has more than one advantage in comparison with other methods. It assures high securitv, ro­bustness, and precision if the improvements de­scribed in this section are used. Another very im­portant advantage is that the method is among the lowest in cost. The disadvantages are: its bias (however it tends to be small for large sizes of query sets); it is not suitable for categorical data; it is not suitable for some types of queries. References [1] Achugbue J.O. and F.Y. Chin. "The effec­tiveness of output modification by rounding for protection of statistical databases." IN­FOR Journal, Vol. 17, No. 3, August 1979, pp. 209-218. [2] Adam N.R. and J.C. Wortmann. "Securitv-Control Methods for Statistical Databases: A Comparative Study." ACM Computing Surveys, Vol. 21, No. 4, December 1989, pp. 515-556. [3] Beck L.L. "A security mechanism for sta­tistical databases." ACM Trans. Database Syst., Vol. 5, No. 3, September 1980, pp. 316-338. [4] Chin F.Y. and G. Ozsoyoglu. "Security in partitioned dynamic statistical databases." In Proceedings of the IEEE COMPSAC, 1979, pp. 594-601. [5] Chin F.Y. and G. Ozsoyoglu. "Statistical database design." ACM, Trans. Database Syst., Vol. 6, No. 1, March 1981, pp. 113­ 139. [6] Cox L.H. "Suppression methodology and statistical disclosure control." Journal of American Statistical Association, Vol. 75, No. 370, June 1980, pp. 377-385. [7] Denning D.E. "Secure statistical databases with random sample queries." ACM Trans. Database Syst., Vol. 5, No. 3, September 1980, pp. 291-315. [8] Denning D.E., P.J. Denning and M.D. Sch­wartz. "The tracker: A threat to statistical database security." ACM Trans. Database Syst., Vol. 4, No. 1, March 1979, pp. 76-96. [9] Denning D.E. and J. Schlorer. "Inference control for statistical databases." Compu­ter, Vol. 7, No. 16, July 1983, pp. 69-82. [10] Dobkin D., A.K. Jones and R.J. Lipton. "Secure databases: Protection against user influence." ACM Trans. Database Syst., Vol. 4, No. 1, March 1979, pp. 97-106. [11] Fellegi I.P. "On the question of statistical confidentiality." Journal of American Stati­stical Association, Vol. 67, No. 337, March 1972, pp. 7-18. [12] Friedman A.D. and L. J. HofFman, "Towards a fail-safe approach to secure databases." In Proceedings of the IEEE Symposium on Security and Privacy, 1980. [13] HofFman L.J. and W.F. Miller. "Getting a personal dossier from a statistical data bank." Datamation, Vol. 16, No. 5, May 1970, pp. 74-75. [14] Jaklič J. "Protecting statistical databa­ses by randomizing and other methods: Comparison and simulation." Master the­sis, Dept. of Computer Science, University of Houston, December 1992. [15] Lefons D., A. Silvestri and F. Tangorra. "An analytic approach to statistical databases." In Proceedings of 8th International Confe­rence on Very Large Databases, 1983, pp. 260-273. [16] Leiss E.L. "Protecting statistical databa­ses through randomizing." Technical Report #UH-CS-81-07, University of Houston, De­cember 1981. [17] Leiss E.L. "Randomizing, a practical me­thod for protecting statistical databases against compromise." In Proceedings of 8th E.L. Leiss et al. International Conference on Very Large Databases, 1982, pp. 189-196. [18] Leiss E.L. "Principles of data securitv." Ple­num Press, 1982. [19] Liew C.K., W.J. Choi, and C.J. Liew. "A data distortion by probability distribution." ACM Trans. Database Syst., Vol. 10, No. 3, pp. 395-411. [20] Matloff N.E. "Another look at the use of no­ise addition for database security." In Pro­ceedings of the IEEE Symposium on Secu­rity and Privacy, 1986. [21] Morris J.L. "Computational methods in ele­mentary numerical analysis." John Wiley & Sons, 1983. [22] Palley M.A. and J.S. Simonoff. "The use of regression methodology for compromise of confidential infbrmation in statistical da­tabases." ACM Trans. Database Syst., Vol. 12, No. 4, December 1987, pp. 593-608. [23] Reiss S.P. "Practical data-swapping: The first steps." ACM Trans. Database Syst., Vol. 9, No. 1, March 1984, pp. 20-37. [24] Reiss S.P. "Practical data-swapping: The first steps." In Proceedings of the IEEE Symposium on Security and Privacy, 1980. [25] Schlorer J. "Information loss in partitio­ned statistical databases." Comput. Jour­nal, Vol. 26, No. 3, 1983, pp. 218-223. [26] Schlorer J. "Disclosure from statistical da­tabases: Quantitative aspects of trackers." ACM Trans. Database Syst. Vol. 5, No. 4, December 1980, pp. 467-492. [27] Schlorer J. "Confidentality of statistical re­cords: A treat monitoring scheme of on-line dialogue." Methods Inform. Med., Vol. 1, No. 15, 1976, pp. 36-42. [28] Traub J.F., Y. Yemini and H. Wozniako­wski. "The statistical security of a statisti­cal database." ACM Trans. Database Syst., Vol. 9, No. 4, December 1984, pp. 672-679.