Informática 35 (2011)407-417 407 Distributed Representations Based on Geometric Algebra: The Continuous Model Agnieszka Patyk-Lonska, Marek Czachor Gdansk University of Technology ul. Narutowicza 11/12, Gdansk 80-233, Poland E-mail: {patyk, mczachor}@pg.gda.pl, http://www.pg.gda.pl/ patyk http://www.mif.pg.gda.pl/kft/czachor.html Diederik Aerts Centrum Leo Apostel (CLEA), Vrije Universiteit Brüssel Krijgskundestraat 33, 1160 Brussels, Belgium E-mail: diraerts@vub.ac.be http://www.vub.ac.be/CLEA/aerts/ Keywords: distributed representation of data, geometric algebra, HRR, BSC, scaling Received: October 23, 2011 Authors revise the concept of a distributed representation of data as well as two previously developed models: Holographic Reduced Representation (HRR) and Binary Spatter Codes (BSC). A Geometric Analogue (GAc — "c" stands for continuous as opposed to its discrete version) of HRR is introduced - it employs role-filler binding based on geometric products. Atomic objects are real-valued vectors in n-dimensional Euclidean space while complex data structures belong to a hierarchy of multivectors. The paperreports on a test aimed at comparison of GAc with HRR and BSC. The test is analogous to the one proposed by Tony Plate in the mid 90s. We repeat Plate's test on GAc and compare the results with the original HRR and BSC — we concentrate on comparison of recognition percentage for the three models for comparable data size, rather than on the time taken to achieve high percentage. Results show that the best models for storing and recognizing multiple similar structures are GAc and BSC with recognition percentage highly above 90. The paper ends with remarks on perspective applications of geometric algebra to quantum algorithms. Povzetek: Članek se ukvarja s porazdeljeno predstavitvijo podatkov, ki uporablja geometrijsko algrebro. 1 Introduction Distributed representations of data are very different from traditional structures (e.g. trees, lists) and complex structures bare little resemblance to their components, therefore great care must be taken when composing or decomposing a complex structure. The most widely used definition of a distributed representation is due to Hinton et al. [13]. In a distributed representation of data each concept is represented over a number of units and each unit participates in the representation of some number of concepts. The size of a distributed representation is usually fixed and the units have either binary or continuous-space values. In most distributed representations only the overall pattern of activated units has a meaning. Let us consider an example of storing the following in- This paper is based on A. Patyk-Lonska, M. Czachor and D. Aerts Some tests on geometric analogues of Holographic Reduced Representations and Binary Spatter Codes published in the proceedings of the 1st International Workshop on Advances in Semantic Information Retrieval (part of the FedCSIS'2011 conference). formation: "Fido bit Pat". The action in this statement is bite and the features (i.e. roles) of this action are an agent and an object, denoted biteagt and biteobj, while their fillers are Fido and Pat respectively. If we consider storing the way that the action is performed, we can add a third feature (role), e.g. biteway. If we store Fido, Pat, biteagt and biteobj as vectors, we are able to encode "Fido bit Pat" as biteagt * Fido + biteobj * Pat. The operation of binding, denoted by "*", takes two vectors and produces another vector, often called a chunk of a sentence. It would be ideal for the resulting vector not to be similar to the original vectors but to have the same dimensions as the original vectors. Superposition, denoted by "+", is an operation that takes any number of vectors and creates another one that is similar to the original vectors. Usually, the superimposed vectors are already the result of the binding operation. Irrespectively of the mathematical model, the above operations are defined in a way that allows to build complex 408 Informática 35 (2011)407-417 A. Patyk-Lonska et al. statements, such as "John saw Fido bit Pat": John * seeagt + (biteagt * Fido + biteobj * Pat) * seeobj. In order to decode information, we have to use the operation of unbinding — it is the inverse (an exact inverse or a pseudo-inverse) of binding enabling us to extract an information from a complex statement, provided that we have one of the bound vectors or a very similar vector as a cue. Marking the unbinding operation by "ft" we obtain the following answer to "Who bit Pat?": (biteagt * Fido + biteobj * Pat) ft biteagt = Fido'. We cannot definitely say that the resulting vector Fido' will be an exact copy of Fido, as even an optimal scheme will generate a considerable amount of noise. Since we cannot expect that a noisy decoded information will be identical to what was encoded, we have to rely heavily on various similarity measures — they vary mostly by time taken by computation and the accuracy. Clean-up memory is an auto-associative collection of all atomic objects and complex statements produced by the system. Given a noisy extracted vector such structure must be able to recall the most similar item stored or indicate, that no matching object had been found. Independently of the scheme considered, any representation should possess the following qualities - composition and decomposition — rules of composition and decomposition must be applicable to all elements of the domain, irrespectively of the degree of complication of a given element. Further, decomposition should support structure-sensitive processing. - fixed size — structures of different degree of complication should take up the same amount of space in order to facilitate generalization. In the GAC model this feature has been given up. Still, structures of different complexity will be of the same type. - similarity — the representation scheme should provide a quick way to compute similarity between analogous structures (e.g. Fido bit Pat Smith and Fido bit John). - noise reduction — decomposed statements should resemble their original counterpart. - productivity — the model should be able to construct complex nested structures using a set of only few rules. As far as previously developed models are concerned, Holographic Reduced Representations (HRR), Binary Spatter Codes (BSC), and Associative-Projective Neural Networks (APNN) are distributed representations of cognitive structures where binding of role-filler codevectors maintains predetermined data size. In HRR [23] binding is performed by means of circular convolution n— 1 (x © y)j = xkVj — k mod n . k=0 of real n-tuples or, in 'frequency domain', by componentwise multiplication of (complex) n-tuples, (xi ,...,Xn) © (V1, . . . ,Vn) = (X1V1, XnVn). Bound n-tuples are superposed by addition, and unbinding is performed by an approximate inverse. A dual formalism, where real data are bound by componentwise multiplication, was discussed by Gayler [9]. In BSC [14, 15] one works with binary n-tuples, bound by componentwise addition mod 2, (X1, ...,Xn) © (V1, ...,Vn) = = (X1 © V1,...,Xn © Vn), Xj © Vj = Xj + yj mod 2, (1) and superposed by pointwise majority-rule addition; unbinding is performed by the same operation as binding. APNN, introduced and further developed by Kussul [16] and his collaborators [17], employ binding and superposition realized by a context-dependent thinning and bitwise disjunction, respectively. As opposed to HRR and BSC, APNN do not require an unbinding procedure to retrieve component codevectors from their bindings. A detailed comparison of HRR, BSC and APNN can be found in [24]. 2 Geometric Algebra One often reads that the above models represent data by vectors, which is not exactly true. Given two vectors one does not know how to perform, say, their convolution or componentwise multiplication since the result depends on basis that defines the components. Basis must be fixed in advance since otherwise all the above operations become ambiguous. It follows that neither of the above reduced representations can be given a true and meaningful geometric interpretation. Geometric analogues of HRR [5] can be constructed if one defines binding by the geometric product, a notion introduced in 19th century works of Grassmann [11] and Clifford [8]. The fact that a geometric analogue of HRR is intrinsically geometric may be important for various conceptual reasons — for example, the rules of geometric algebra may be regarded as a mathematical formalization of the process of understanding geometry. The use of geometric algebra distributed representations has been inspired by a well-known fact, that most people think in pictures, i.e. two-and three-dimensional shapes, not by using sequences of ones and zeroes. Mere strings of bits are not meaningful to (most) humans, no matter how technically advanced they are. DISTRIBUTED REPRESENTATIONS ... Informatica 35 (2011)407-417 409 In order to grasp the main ideas behind a geometric analogue of HRR let us consider an orthonormal basis b1,...,bn in some n-dimensional Euclidean space. Now consider two vectors x = E fc=i xk bk and y = ELi yk bk. The scalar x ■ y = y ■ x is known as the inner product. The bivector x A y = —y A x is the outer product and may be regarded as an oriented plane segment (alternative interpretations are also possible, cf. [7]). 1 is the identity of the algebra. The geometric product of x and y then reads the most important difference between geometric and convolution algebras. Geometric products of different basis vectors bki...kj = bki ... bkj, k1 < ■ ■ ■ < kj, are called basis blades (or just blades). In n-dimensional Euclidean space there are 2n different blades. This can be seen as follows. Let {x1,..., xn} be a sequence of bits. Blades in an n-dimensional space can be written as c — bxi bxn cxi...xn = b1 .. .bn where bk = 1, which shows that blades are in a one-to-one relation with n-bit numbers. A general multivector is a linear combination of blades, xy = ^ xk yk 1 + y"!(xfc yi _ ykxi)bkbi. k=i k3 a multivector o) rank 4 would have ^^ scalars, ^^ bivectors and ^^ 4-blades. The number of fc-blades in a multivector of rank r is described by Table 1. It becomes clear that a multivector of rank r over R" is ac- L 2 J 2i + r mod 2 -dimensional tually a vector over a space. As an example let us consider the following roles and fillers being normalized vectors drawn randomly from Rn with Gaussian distribution N(0,1) Pat male 66 {«1, ■ ■ ■ , an}, {bl, .. ., bn}, {ci , . . . , cn} , { x 1 , ■ ■ ■ , xn}, {y1, ■ ■ ■ , yn}, {z1 , ■ ■ ■ , zn}. PSmith, who is a 66 year old male named Pat, is created by first multiplying roles and fillers with the help of the geometric product PSmith = = name * Pat + sex * male + age * 66 = name ■ Pat + name A Pat + sex ■ male + sex A male + age ■ 66 + age A 66 ^n=1(aixi + biyi + cizi) a1x2 — a2x1 + b1y2 — b2y1 + c1z2 — c2z1 a1x3 - a3x1 + b1y3 - b3y1 + c1z3 - c3z1 an—1xn — anxn— 1 + — bnVn — 1 + cn—1zn — cnzn — 1 = [do, di2, ¿13, .. . ,d( n — 1)"J = do + di2ei2 + di3ei3 + • • • + d( where ei,... ,en are orthonormal basis blades. In order to be decoded as much correctly as possible, PSmith n 1 DISTRIBUTED REPRESENTATIONS ... Informatica 35 (2011)407-417 411 Table 1: Numbers of /. -blades in multivectors of various ranks in Rn rank scalars vectors bivectors trivectors 4-blades.. a size 1 0 1 0 0 0.. . dat( J) 2 (0) 0 ( 2) 0 0.. 2))) 3 0 ( n \ ( 1) 0 ( n \ (3) 0.. ( n ) +( + ^ 4 w 0 2 0 fi) . <0) + ,2j n + 4 2r ( 0.) 0 ( 2) 0 ( 4.).. 0 ( 2i)> 2r +1 0 ( 1) 0 (3) 0.. ■ ^1=0 u +1) should have the same magnitude as vectors representing atomic objects, therefore it needs to be normalized. Finally, PSmith takes the form of PSmith = [do,di2,di3, .. . ,d(n-i)n]T, where di = , ( di . V ¿^ = 0,12 dj PSmith is now a multivector of rank 2. The decoding operation name+ PSmith = name+(name ■ Pat + name A Pat +sex ■ male + sex A male + age ■ 66 +age A 66) will produce a multivector of rank 3 consisting of vectors and trivectors. However, the original Pat did not contain any trivector components — they all belong to the noise part and the only interesting blades in name+PSmith are vectors. The expected answer is a vector, therefore there is no point in calculating the whole multi-vector name+ PSmith and only then comparing it with items stored in the clean-up memory. To be efficient, one should generate only the vector-part while computing name+ PSmith and skip the noisy trivectors. Let {)k denote the projection of a multivector on /-blades. To decode PSmith's name we need to compute {name+ PSmith) i = name+namePat + { name+ (name A Pat +sex ■ male + sex A male + age ■ 66 +age A 66) )i = Pat + noise = Pat'. The resulting Pat' will still be noisy, but to a lesser degree than it would have been if the trivectors were present. Formally, we are using a map *1,2 that transforms a mul-tivector of rank 1 (i.e. an n-tuple) and a multivector of rank 2 (i.e. a (1 + ) -tuple) into a multivector of rank 1 without computing the unnecessary blades. Let X be a multivector of rank 2 X = {X)o + {X)2 = xo + ximeiem, l