Some Notes on the Probability Space of Statistical Surveys George Petrakos1 Abstract This paper introduces a formal presentation of sampling process using principles and concepts from Probability Algebra and Information Theory. Under this model, any sampling scheme defines uniquely a probability measure, illustrated in various examples along with some applications in survey design and management. 1 Introduction - basic definitions Let P be the target population of a statistical survey and R = (rv, v =1,2, ... N} a relevant register in hand, consists of N individual statistical units. Regardless of the parameters of the selection process, all the possible outcomes concerning the elements of R comprise the set QR = {r1+, r1-, r2+, r2-, ... rN+, rN-}, where rv+ denotes the presence of the vth unit while rv- denotes its absence (Kullback, 1997). By M = {E e Qr} we determine the set of all subsets E of QR, called samples. Any selection process in QR defines uniquely a probability measure p and furthermore any sample E can either have chances to appear ( p(E) > 0) or not ( p(E) = 0). Let us now consider a mapping on M ( f: M ^ S ) such that, = jE, p(E) > 0 10, p(E) = 0 (1.1) 1 Dept. of Public Administration, Panteion University of Social and Political Sciences. Athens, GR and Agilis SA, Statistics and Informatics, GR Acadimias 96-100, 10677 Athens, Greece; george.petrakos@agilis-sa.gr Thus we construct a non-empty set ê ç M which, with the basic Boolean operations and a probability measure p which is strictly positive, normed and additive, form a probability algebra (ê, p) (Kappos, 1969). Therefore for any elements E in ê (i) p(E) > 0 and p(E) = 0 iff E= 0 (ii) p(e) = 1, where e is the unit in ê (iii) p(E1 u E2) = p(E1) + p(E2) if E1 n E2 = 0 Any element in ê different than 0 and e is called possible sample. We also consider N+1 classes S(n)e M, n =0,1,2,...N such that N S = { (r1k, r2k, . rNk), £ I(nk) = n }, where k = {+, -} and n=1 0, k = - I ( rk ) = , [1, k = + which contains all subsets of M, where n appearances of statistical units occur. n, 1 < n < N, the class S (") contains \n J subsets S^) , ieIn ={1, 2, ... ( N1} By applying f on S (n) ç M, we construct a non empty set S n f: S(n)-Sn :f ( Si(n)) = j ^ p(S;n)) > 0 , ieIn (1.2) I 0, p(S(n)) = 0 Under the probability algebra (ê, p) defined by a chosen sampling process, the class Sn has the following properties (inherited by ê) (i) p(Si" ) > 0, i e In(s) (ii) p (u S in) = 1, i e In(s) (iii) p( S i" U Sj" ) = p(Sin ) + p( S j"), " (i, j) e In(s) x In(s) with i*j where, In(s) ç In the subset of indices for which p(S^)) > 0 and e = uS^ ie In(s), the unit, with p(e) = 1. n This basic set of notions and definitions introduces a more algebraic approach to measurable sample designs than the analytical ones (Särndal et all, 2003) which are focusing on the estimation of various parameters. This algebraic approach seems to handle multiple sampling procedures, like multiple recapture designs, more efficiently. 2 Application to various sampling schemes In a single sample process it can be shown that S^ nSjn = 0, " (i, j) e In(s) x In(s) with i^j. The probability that two different samples will be drawn in a single sampling process is zero, therefore p(S^) ÇSjn)) = 0 and the only event in p) with probability 0 is the empty set, 0. There are sampling schemes where In(s) c In (strictly), i.e. in stratified random sampling where only the S^s that satisfy the proportional to strata restriction meet with property (i), while for the rest it holds that p(S^)) = 0, ieIn - In(s). On the other hand, in a simple random sampling In ° In(s), since all S^"), ie In satisfy property (i). The above concepts can also be applied to multiple sampling procedures. In this type of sampling, both rv+ and rv" are present in the sample, in different stages of course. We will examine the form of the event space QR and the class Sn, for sampling with replacement and multiple recapture sampling. Sampling with replacement. Sampling from N statistical units by choosing one unit each of the n(sample size) times and put it back in the population before the next trial is a process that corresponds to an event space QR such that: N Qr = (rvk(n)} with v = 1,2,...,N k={+,-} and n = 1,2,... where £I[rvk(n)] = 1, " n V=1 and a probability algebra p) is defined based on Sn = S1 x S1x.x S1= XS1, n where S1 is the basis for an SRS of size 1. Multiple recapture. In a multiple recapture experiment run in a population of size N (usually unknown), the sample space is expanded over the discrete time of trials (t=1,2,...T). If the population is closed for this time period, the sample space is: Qr(T) = {rvk(t)} with v = 1,2,...,N, k={+,-} and t = 1,2,...,T. When the population size changes in the different points of time (open population), the sample space is: Qr(T) = {rvk(t)} with v = 1,2,...,N(t), k={+,-} and t = 1,2,...,T. The basic class is ST = SX xSX2 x...xSXt =X SXt, where Xt e {0,1,...,N(t)} a discrete random t variable with elements ST = S,XxS,Xlx...xS,Xtwith It =1, 2, ... |N| , t = 1 11 i2 iT UJ 1,2,...,T. 3 The probability space Under a pr. algebra (S, p) a class Sn is uniquely defined and contains all the possible samples and only them. This class forms a basis for the construction of all events in S. Any event Eî S can be constructed by using one or more basic samples S^ and expressed as a union of these S^, based on the fact that any possible event related to the sampling process can be realized by unions of samples S In(s). It can be easily shown that S is closed under the basic set operation. For that, let us consider E1, E2 î S as unions of some S ^, such that: E1 î S ^ E1 = U S(n), E2 î S ^ E2 = U S(n), for some i, j î In(s),. Then J' J J E1 u E2 = U S(n) u U S(n) = U S(n) î S where d is such that Sd(n) belongs i i J J d d either to U S(n) or U S(n) and E1 n E2 = U S(n) î S where g is such that Sg(n) J' J j g g belongs both to U S(n) and U S(n). If there is no g such that Sg(n) belongs to both i i J J of the unions above, then E1 n E2 = 0 and the two events are mutually exclusive. These properties can be easily extended for any finite set of events Ei. Moreover, the above defined possible event E2 contains another possible event, noted as E1 c E2 when U S(n) £ U S n) , iîI, dîD, or equivalently IeD. i i d d Let us now illustrate the above with a couple of examples: Example 1 Let N = 4 and n = 3. Then S is a class of four basic sets, namely , S13 = {rf, r2+, r3+, r4+}, S23 = { n+, r2-, r3+, r4+}, S33 = { n+, r2+, r3-, r4+}, S43 = { r1+, r2+, r3+, r4}. The event of the presence of the first two individuals which can be noted by E12 = {n , r2 } can be expressed as a union of basic sets, E12 = S3(3) U S4(3) . In other words, the event E12 occurs when at least one of the basic events in which the first two individuals are present occurs. Remark Someone can argue that in the example above that B12 = S3(3) n S4(3), which in terms of point set theory seems correct, since { r1+, r2+, r3-, r4+} n { r1+, r2+, r3+, r4-} = {r1+, r2+}. However, in our treatment under the given sampling scheme, {r1+, r2+} ° {r1+, r2+, r3k, r4k}, k=+,- which explains why B12 = S3(3) u S4(3) Example 2 Let N = 3 so R = {r1, r2, r3}. If they are placed in an orthogonal space in 3-D taking values of 0 and 1 for non-appearance and appearance respectively, we have the following transformation: [S(0)] : (0,0,0) ® (rf, r2-, r3-) [S(1)] : (1,0,0) ® (r1+, r2-, r3-), (0,1,0) ® (rf, r2+, r3-), (0,0,1) ® (rf, r2-, r3+) [S(2)] : (1,1,0) ® (r1+, r2+, r3-), (1,0,1) ® (n+, r2-, r3+), (0,1,1) ® (rf, r2+, r3+) [S(3)] : (1,1,1) ® (r1+, r2+, r3+) which produce all possible samples. For n=2, we have 3 orthogonal vectors S1(2), S2(2), S 3( ) which are a basis for some sampling schemes where 2 out of 3 are selected. Figure 1: 3-D orthogonal space. Any measure fi applied to ri can associate a measurable function 9(n) =( fi) i=1,2,...n to a class Sn and therefore a value 9(n) (S^) to each basic sample. Considering the probability measure p in S mathematical expectation E(9(n)) = S P(S>n) ) (31) i where 9(n) = { fi(ri), f2(r2),... fn(rn)} and p(S;n), S P(S^) =1, is located at the i barycenter of the polytope formed by S (Petrakos, 2000) and expresses a mean value of 9(n) before the sample is drawn. An interesting application of this approach is the determination and application of a cost function C(n) = {ci(ri), c2(r2),... cn(rn)}, where ci's variation is due to corresponding ri's costly characteristics (access, distant location, etc). Then the cost of a sample S^ is Ci = C(n) I(Sin)', where I(Sin) is a n-dim vector with ones for the corresponding ri 's and zeros for the ri- 's. Finally the expected cost of the sampling process estimated in the design phase will be E(C(n)) = S p(Sin) C(n) I(Sin)' (3.2) i 4 Conclusions A probability algebra model has been introduced in order to describe the data collection process in a statistical survey. Its sufficiency, efficiency and simplicity was tested and proved over different sampling schemes. Future research can adapt this model to more complicated and realistic sampling schemes, incorporating cost and non-response to the design of a statistical survey. From a theoretical point of view, this model can be viewed and further studied as an application of group theory. In both cases, this paper aspires to provide some basic ideas for substantial research. Acknowledgements The author is grateful to the associate editor of the journal and to the referees for their constructive comments. The author would also like to thank Mr. George Maniatis for reviewing the final version of this paper. References [1] Cochran, W. (1977): Sampling Techniques. New York: J. Wiley & Sons. [2] Kappos, D. (1969): Probability Algebras and Stochastic Spaces. Monograph in Probability and Mathematical Statistics. London: Academic Press. [3] Kullback, S. (1997): Information Theory and Statistics. New York: Dover Publ. Inc. [4] Petrakos, G. (2000): The topological foundation and some properties of the mixed estimator. Computational Statistics, 15, 109-114 [5] Särndal, C., Swensson, B., and Wretman, J. (2003): Model Assisted Survey Sampling. New York: Springer.