Informatica 35 (2011)419-427 419 Experiments on Preserving Pieces of Information in a Given Order in Holographic Reduced Representations and the Continuous Geometric Algebra Model Agnieszka Patyk-Lonska Gdansk University of Technology ul. Narutowicza 11/12, Gdansk 80-233, Poland patyk@pg.gda.pl, http://www.pg.gda.pl/ patyk Keywords: distributed representations of data, geometric algebra, HRR, BSC, word order, trajectory associations, bag of words Received: October 23, 2011 Geometric Analogues of Holographic Reduced Representations (GAc, which is the continuous version of the previously developed discrete GA model) employ role-filler binding based on geometric products. Atomic objects are real-valued vectors in n-dimensional Euclidean space and complex statements belong to a hierarchy of multivectors. The property of GAc and HRR studied here is the ability to store pieces of information in a given order by means of trajectory association. We describe results of three experiments: fnding correct item or correct place of an item in a sequence and fnding the alignment of items in a sequence without the precise knowledge of trajectory vectors. Povzetek: Clanek preucuje ohranitev informacij pri obliki holografske hrambe podatkov. 1 Introduction The work presented here is a result of experimenting with trajectory association technique using the newly-developed distributed representation model GAc [20]. An ideal distributed representation system needs to meet several criteria in order to successfully perform cognitive tasks. These include computational efficiency, noise tolerance, scaling and ability to represent complex structures. The most widely used def nition of a distributed representation of data is due to Hinton et al. [13]: in a distributed representation of data each concept is represented over a number of units and each unit participates in the representation of some number of concepts. The size of a distributed representation is usually fxed and the units have either binary or continuous-space values. In most distributed representations only the overall pattern of activated units has a meaning. Such patterns of activity are hard to understand and interpret, therefore they are often compared to greyscale images. Distributed representations usually take the form of one-dimensional vectors, while greyscale images are two-dimensional matrices, but the way the pixels are aligned (one-dimensional string or two-dimensional array) is of no relevance. Since the information is distributed over the elements of a vector, a great percentage of units ("pixels") can be changed without making the vector (overall "picture") This paper is based on A. Patyk-Lonska Preserivng pieces of information in a given order in HRR and GAc published in the proceedings of the 1st International Workshop on Advances in Semantic Information Retrieval (part of the FedCSIS'2011 conference). unrecognizable. Let us consider an example of storing the following information: "Fido bit Pat". The action in this statement is bite and the features (i.e. roles) of this action are an agent and an object, denoted biteagt and biteobj, while their fillers are Fido and Pat respectively. If we consider storing the way that the action is performed, we can add a third feature (role), e.g. biteway. If we store Fido, Pat, biteagt and biteobj as vectors, we are able to encode "Fido bit Pat" as biteagt * Fido + biteoj * Pat. The operation of binding, denoted by "*", takes two vectors and produces another vector, often called a chunk of a sentence. It would be ideal for the resulting vector not to be similar to the original vectors but to have the same dimensions as the original vectors. Superposition, denoted by "+", is an operation that takes any number of vectors and creates another one that is similar to the original vectors. Usually, the superimposed vectors are already the result of the binding operation. For more details and examples on distributed representations of data the reader should refer to [20]. 2 Preserving Pieces of Information in a Given Order While some solutions to the problem of preserving pieces of information in a given order have proved ingenious, others are obviously flawed. Let us consider the representation of the word eye — it has three letters, one of which occurs 420 Informatica 35 (2011) 419-427 A. Patyk-Lonska twice. The worst possible choice of binding and superposition would be to store quantities of letters, e.g. eye = twice * e + once * y, since we would not be able to distinguish eye from eey or yee. Another ambiguous representation would be to remember the neighborhood of each letter eye = beforey * e + betweene * y + after y * e. Unfortunately, such a method of encoding causes words eye and eyeye to have the same representation eyeye = beforey * e + 2 • betweene * y + (beforey + after y ) * e + after y * e = 2(beforey * e + betweene * y +aftery * e) = 2 eye. Real-valued vectors are normalized in most distributed representation models, therefore the factor of 2 would be most likely lost in translation. Such contextual roles (Smolensky [25]) cause problems when dealing with certain types of palindromes. Remembering positions of letters is also not a good solution eye = letter first * e + letter second * y + letter third * e as we need to redundantly repeat the first letter as the third letter, otherwise we could not distinguish eye from ey or ye. Secondly, this method of encoding will not detect similarity between eye and yeye. Pike argues in [21] that matrix-based memory is multidirectional, i.e. it allows both forward and backward association — having two vectors a and b and their binding M = ab we can extract both a and b by performing a reverse operation on the appropriate side of the matrix. Convolution-correlation systems, on the other hand, regard bindings a © b and b © a as identical. We will use a similar technique, asking right-hand-side and left-hand-side questions during experiments described in the following sections. A quantum-like attempt to tackle the problem of information ordering was made in [1] — a version of semantic analysis, reformulated in terms of a Hilbert-space problem, is compared with structures known from quantum mechanics. In particular, an LSA matrix representation ([1, 10]) is rewritten by the means of quantum notation. Geometric algebra has also been used extensively in quantum mechanics ([2, 4, 3]) and so there seems to be a natural connection between LSA and GAc, which is the ground for fututre work on the problem of preserving pieces of information in a given order. As far as convolutions are concerned, the most interesting approach to remembering information in a given order has been described in [12]. Authors present a model that builds a holographic lexicon representing both word meaning and word order from unsupervised experience with natural language texts comprising altogether 90000 words. This model uses simple convolution and superposition to construct n-grams recording the frequency of occurrence of every possible word sequence that is encountered, a window of about seven words around the target word is usually taken into consideration. To predict a word in a completely new sentence, the model looks up the frequency with which the potential target is surrounded by words present in the new sentence. To be useful, n-gram models need to be trained on massive amounts of text and therefore require extensive storage space. We will use a completely different approach to remembering information order — trajectory association described by Plate in [23]. Originally, this technique also used convolution and correlation, but this time items stored in a sequence are actually superimposed, rather than being bound together. 3 Trajectory Associacion In the HRR model vectors are normalized and therefore can be regarded as radii of a sphere of radius 1. If we attach a sequence of items, say A, B, C, D, E to arrowheads of five of those vectors, we obtain a certain trajectory associated with sequence ABCDE. This is a geometric analogue to the 'method of loci which instructs to remember a list of items by associating each term with a distinctive location along a familiar path. Let k be a randomly chosen HRR vector and let ki = k © ki-1 = ki-1 © k, i> 1 be its ithpower, with k1 = k. The sequence Sabcde is then stored as Sabcde = A © k + B © k2 + C © k3 + D © k4 + E © k5. Of course, each power of k needs to be normalized before being bound with a sequence item. Otherwise, every subsequent power of k would be larger or smaller than its predecessor. As a result, every subsequent item stored in a sequence would have a bigger or a smaller share in vector SABCDE. Obviously, this method cannot be applied to the discrete GA model or to BSC, since it is impossible to obtain more than two distinct powers of a vector with the use of XOR as a means of binding. This technique has a few obvious advantages present in HRR but not in GAc had we wished to use ordinary vectors as first powers — different powers of a vector k would then be multivectors of different ranks. While ki and ki±1 are very similar in HRR, in GAc they would not even share the same blades. Further, the similarity of ki and ki+m in HRR is the same as the similarity of kj and kj+m, whereas in GAc that similarity would depend on the parity of i and j. In the light of these shortcomings, we need to use another structure acting as a first power in order to make trajectories work in GAc. Let t be a random normalized full EXPERIMENTS ON PRESERVING. Informatica 35 (2011)419-427 421 multivector over Rn following way and let us define powers of t in the t1 = t, ti = (ti-1)t for i> 1. We will store vectors a1... a; in a sequence Sai...ai using powers of the multivector t S„ ait + a2t + • • • + a¡t To answer a question " What is the second item in a sequence?" in GAc we need to use the projected product {Sai...ai (t2) + )l « a2, and to find out the place of item ai we need to compute (ai) + Sa i. :t*. S«A = S © A* ; A+S: tx inHRR, tx in GAc. This amounted to 120000 questions altogether. For the purpose of the experiment described in Section 6 we used the clean-up memory consisting of powers of t only. Ideally, every position of the letter A should come up as the correct answer the same number of times. However, since high powers of t acquire noise, lower powers should be recognized more often as the correct answer. Indeed, Figure 1 shows that in HRR the frequencies of the powers of t align with t being recognized most often and t5 being recognized least often. In GAc the percentage diagrams for various powers of t lay close to each other and often intertwine, still the relationship between the powers of t is similar to the one observed in HRR. Since lower powers of t are recognized correctly more often, higher powers of t come up more often as the incorrect answer to S ft A. Vector t3 is the correct answer to S„A„ ft A. However, if t3 is not recognized, the next most similar answer will be t5 because it contains three "copies" of t3, indicated here by brackets t * t * [tj * t) * t]. Some may argue that such encoding puts a demand on items in the clean-up memory to hold information if they are roles or fillers, which is dangerously close to employing fixed data slots present in localist architectures. Actually, elements of a sequence can be recognized by their size, relatively shorter than the size of multivector t and its powers. We present three experiments using trajectory association and we comment on test results for HRR and GAc models. Firstly, we studied if an item can be retrieved given a sequence and an appropriate power of t, and vice versa — if a sequence and an item can lead to the power of t associated with that item. Finally, we tested whether both HRR and GAc models can find the alignment of items in a sequence without the precise knowledge of vector t or its powers. Since the normalization using the square root of the number of chunks proved very noisy in statements containing powers of the trajectory vector, we decided to improve the HRR model. The HRR vectors in our tests were normalized by dividing them with their magnitude. 4 Correct Place Detection In this experiment we investigated if powers of a (multi)vector t carry enough information about the original t. During 1000 tests for (multi)vector sizes ranging from 24 to 28 we asked the following question for every sequence S (a permutation of letters {A, B, C, D, E}): " Where is A ?" The second most similar item will be t4 because it contains two "copies" of t, and so on. The item least similar to t3 will be t. This relationship should be best observable in HRR, since the powers can be multiplied from either side. In GAc the powers of t can be increased from one side only and the relation between them should be less visible. Figure 2 shows that high powers of t are recognized more often in cases when the proper answer is not recognized — we will use this relationship in an experiment described in Section 6. 5 Correct Item Detection Here we tested if trajectory association allows us to ask " What is the xth item in a sequence?" Lx Sfttx S © (tx)* inHRR , S (tx)+ in GAc, where Lx G {A, B, C, D, E} denotes the xth letter in a sequence. During 1000 tests for (multi)vector sizes ranging from 25 to 29 we asked that question for every permutation sequence of the set {A, B, C, D, E}, there were 120000 questions altogether for every (multi)vector tx. Again, we tested both HRR and GAc models using a clean-up memory consisting only of expected answers, i.e. letters {A, B, C, D, E}. The results for both models (Figure 3) were similar with GAc performing slightly better than HRR. In both models the first few letters of a sequence were more often recognized correctly than the last letters. Among the erroneously recognized letters, the last few letters of a sequence were most often offered as the most probable answer, which will come in handy in the next Section. The diagrams for GAc lie closer together, once again indicating that trajectory association spreads information more evenly in GAc than in HRR. 6 Item Alignment The three previous tests were not very demanding for trajectory associations. Finally, we tested whether the HRR 422 Informatica 35 (2011) 419-427 A. Patyk-Lonska [%] of tx recognized correctly — HRR -|-1-1-1- 24 25 26 27 28 [%] of t recognized correctly — GAc —I-1-1-1-H! 24 25 26 27 28 Figure 1: Correct recognition of S J A « tx in HRR and GAc using clean-up memory of {t, t2, t3, t4, t5}, 1000 trials [%] of tx recognized incorrectly as other powers of t — HRR -I-1-1-1-h 24 25 26 27 28 [%] of tx recognized incorrectly as other powers of t — GAc Figure 2: Incorrect recognition of S J A ^ tx in HRR and GAC using clean-up memory of {t, t2, t3, t4, t5}, 1000 trials 2 3 4 5 = 120000 = 5 4 5 6 _n7 8 N=2 N=2 N=2 N=2 N=2 2 3 4 5 = 120000 = 5 4 5 6 7 8 R R R R R 2 3 4 5 N 2 3 4 5 = 120000 = 5 4 5 6 7 8 N=2 N=2 N=2 N=2 N=2 9.35% 3.65% 2 9.23% 7 5.29% 3.11% 1.37% 3 9.28% 8 6.35% 4.63% 2.42% 4 9.85% 8.72% 7.40% 5.53% 3.47% 5 11.63% 8.56% 6.43% 4.59% N = 120000 = 5 4 5 6 7 8 R R R R R 8.70% 4.30% 2.44% 1.21% 2 6.88% 5.13% 1.33% 3 9.16% 7.01% 5.11% 3.10% 1.42% 4 9.37% 7.23% 5.49% 1.42% 5 9.20% 6.99% 5.34% 3.51% N 4 5 6 7 8 2 2 2 2 2 and GAC models were capable of performing the following task: Given only a set of letters A, .£>, C, _D, E and an encoded sequence S.....comprised of those five letters find out the position of each letter in that sequence. We assumed that no direct access to t or its powers is given — they do belong to the clean-up memory, but cannot be retrieved "by name". One may think of this problem as a "black box" that inputs randomly chosen letter vectors and in return outputs a (multi)vector representing always the same sequence, irrespectively of the dimension of data. Inside, the black box generates (multi)vectors t, t2, t3, t4, t5. Their values are known to the observer but their names are not. Since we can distinguish letters from non-letters, the naive approach would be to try out all 120 alignments of letters A, B, C, D and E using all possible combinations of non-letters as the powers of t. Unfortunately, powers of EXPERIMENTS ON PRESERVING... Informatica 35 (2011)419-427 423 [%] of letters recognized incorrectly as other letters — HRR 5 6 7 8 9 L L L2 L L3 L4 L L5 [%] of letters recognized incorrectly as other letters — GAC —|-1-1-1 i > N 25 26 27 28 29 10% 6 7 9 R R R L L 2 31.57% L 3 17.11% L 4 L 5 Figure 3: Recognition of S L1L2L3L4L5 trials. H tx « Lx in HRR and GAc using clean-up memory containing letters only, 1000 t are different each time the black box produces a sequence. We will use an algorithm based on the assumption that tx, if not recognized correctly, is more similar to highest powers of t as shown in Section 4. The second assumption is that letters lying closer to the end of the sequence are often offered as the incorrect answer to questions concerning letters (Section 5). The clean-up memory C for this experiment consists of all five letters and the five powers of t. We will also use an auxiliary clean-up memory L containing letters only. The algorithm for finding out the position of each letter begins with asking a question described by equation (1) — "Where in the sequence S.....is the letter Lx?": S^ H Lx £»•••• © (Lx)* (Lx)+S ••••• in HRR in GAc = (tx)' (1) x t 424 Informatica 35 (2011) 419-427 A. Patyk-Lonska [%]of { B, C, D, E} 50% 40% 30% 20% 10% 50% 40% 30% 20% 10% 50% 40% 30% 20% 10% 50% 40% 30% 20% 10% 50% 40% 30% 20% 10% Asking with A oE oE N=100 N=200 N=300 N=400 N=500 °D °C oD oC oD oD non A 1643 960 616 506 393 = 100% oC B 19.90% 16.04% 12.66% 8.30% 6.62% ° B °B oC C 25.56% 22.92% 22.73% 21.94% 18.58% oB oB D 26.78% 27.71% 28.73% 32.02% 30.03% °B E 27.75% 33.33% 35.88% 37.75% 44.78% 100 200 300 400 500 N ] of {A, C, D, E} Asking with B JE D oE oE N=100* N=200* N=300 N=400 N=500 ' §D C 8D oD oD non B 2474 1588 1133 837 628 = 100% V oC oC oC A 18.03% 18.14% 12.71% 9.44% 8.28% ° A °A c 25.14% 24.87% 22.51% 23.30% 21.97% oA oA oA D 29.59% 28.90% 30.98% 30.59% 30.89% E 27.24% 28.09% 33.80% 36.68% 38.85% 100 200 300 400 500 N ] of { A, B, D, E} Asking with C oE oE N=100 N=200 N=300 N=400 N=500 ^D °D °E oE oD oB ° A oD oB oA oD nonC 3003 2214 1694 1346 1121 = 100% D A 22.51% 18.16% 18.00% 15.30% 13.83% ° A oB B 24.18% 22.81% 21.55% 21.47% 20.07% oA D 25.51% 26.06% 27.86% 27.64% 28.55% E 27.81% 32.97% 32.59% 35.59% 37.56% 100 200 300 400 500 N ] of { A, B, C, E} Asking with D N=100* N=200 N=300 N=400 N=500 ECB oooo oE oE oE non D 3432 2686 2264 1808 1564 = 100% 8B °C UB oC °A A 21.71% 22.23% 20.19% 17.64% 18.67% oA oA B 26.84% 24.72% 24.03% 25.77% 21.16% c 25.44% 25.58% 25.66% 26.05% 27.37% E 26.02% 27.48% 30.12% 30.53% 32.80% 100 200 300 400 500 N ] of { A, B, C, D} Asking with E N=100* N=200* N=300 N=400* N=500 o 8D 8D non E 3754 3116 2617 2353 1921 = 100% l°D °D C °B °B oB A 24.77% 20.31% 19.79% 19.46% 18.84% oA oA A oA B 25.12% 25.48% 23.42% 22.90% 23.27% c 26.08% 28.34% 27.59% 29.20% 28.37% D 24.03% 25.87% 29.19% 28.35% 29.52% E N Figure4: Finding letter alignment in a sequence Sabcde inHRR, 10000 trials. for each letter Lx £ L. Next, we need to find the item in the clean-up memory C \ L that is most similar to (tx)'. Let us denote this item by z. With high probability, z is the power of t associated with the position of the letter Lx in the sequence S...... although, if recognized incorrectly, z will most likely point to some other ty>x. Now let us ask a second question (eq. (2)) — "Which letter is situated at the z'th position in the sequence q u _ f S.....® z* S.....flz = \ (S.....z+h = L ~ Lx. in HRR inGAc (2) We use the projected product in GAc because we are looking for a letter vector placed on the position indicated by EXPERIMENTS ON PRESERVING. Informatica 35 (2011)419-427 425 z. In HRR the resulting L' should be compared with letters only. In most cases L' will point to the correct letter. However, in a small fraction of test results, L' will point to letters surrounding Lx, because z has been mistakenly decoded as ty for some y = x. Also, letters preceding Lx should come up less often than letters proceeding Lx. Figure 4 presents test results for HRR. The data in Figure 4 should be interpreted as follows: the first row of each table next to a graph contains the vector lengths of the data used in 5 consecutive experiments (10000 trials each). The second row contains the number of faulty answers within those 10000 trials. The next 4 rows present the percentage of occurence of a "faulty" letter within all faulty answers presented in the second row. Faulty alignments (i.e. those, for which the percentages corressponding to letters do no align increasingly within a single column) have been marked with a "*" in the table headings. We used Sabcde as the mysterious encoded sequence S...... In each case we crossed out the most frequently occurring letter and we concentrated on the frequency of the remaining letters. In HRR, for sufficiently large vector sizes, the frequencies fL of all letters L e L aligned correctly fB