https://doi.or g/10.31449/inf.v47i3.4785 Informatica 47 (2023) 361–372 361 A Hybrid Deep Learning Appr oach to Keyword Spotting in V ietnamese Stele Images Anna Scius-Bertrand 1 , Marc Bui 2 and Andreas Fischer 1 1 University of Fribour g and HES-SO, Fribour g, Switzerland 2 Ecole Pratique des Hautes Etudes, Paris, France E-mail: anna.scius-bertrand@unifr .ch, marc.bui@ephe.psl.eu, andreas.fischer@unifr .ch Keywords: hybrid deep learning, document image analysis, keyword spotting, annotation-free, Hausdorf f edit distance, V ietnamese steles, Chu Nom Received: April 5, 2023 In or der to access the rich cultural heritage conveyed in V ietnamese steles, automatic r eading of stone engravings would be a gr eat support for historians, who ar e analyzing tens of thousands of stele images. Appr oaching the challenging pr oblem with deep learning alone is difficult because the data-driven models r equir e lar ge r epr esentative datasets with expert human annotations, which ar e not available for the steles and costly t o obtain. In this article, we pr esent a hybrid appr oach to spot keywor ds in stele images that combines data-driven deep learning with knowledge-based structural modeling and matching of Chu Nom characters. The main advantage of the pr oposed method is that it is annotation-fr ee, i.e. no human data annotation is r equir ed. In an experimental evaluation, we demonstrate that keywor ds can be successfully spotted with a mean average pr ecision of mor e than70 % when a single engraving style is consider ed. Povzetek: Pr edstavljen je hibridni pristop za iskanje ključnih besed v slikah z nagr obnikov , ki združuje globoko učenje in strukturno modeliranje Chu Nom znakov . Ključne besede so uspešno pr epoznane s povpr ečno natančnostjo več kot 70%. 1 Intr oduction V ietnamese steles are of great value for historians, as the stone engravings are a unique source of information to un- derstand the social, economic, and belief structures in the villages. The V ietnamica 1 project, in particular , aims to in- vestigate pious donations from ordinary people of fered to local shrines. For this purpose, about 40,000 digital stele images are studied, which may contain hundreds of Chu Nom characters on a single image. T o cope with this vast amount of characters, automatic image analysis methods are needed that are able to transcribe the contents of the ste- les into machine-readable form for searching and browsing. Some examples of stele images are shown in Figure 1, high- lighting significant dif ferences in column layout and image quality across dif ferent steles. Although the state of the art for handwriting recognition for historical documents has made great progress in the past decades, it remains a dif ficult problem and an active field of research [1]. Keyword spotting [2] has been proposed early on as an alternative to automatic transcription for dif- ficult cases, where a full transcription is not feasible with high accuracy . The goal is to identify specific search terms of interest, either by providing a template image of the key- word ( query-by-example ) or by providing a textual repre- sentation of the search term ( query-by-string ). Similar to developments in other related fields, such as 1 https://vietnamica.hypotheses.or g computer vision and natural language processing, the dif- ferent methods for keyword spotting can be divided into three groups, namely heuristic methods, machine learning based methods, and deep learning based methods. They are roughly ordered by time, heuristic methods being the oldest, but they still coexist and new approaches are being developed for all three groups. Heuristic methods incorporate domain knowledge about the handwriting and are able to match images directly , i.e. keyword images and images from the manuscript, in order to retrieve similar instances in a query-by-example scenario. Early examples include dynamic time warping methods based on contour features [3] and gradient fea- tures [4], as well as segmentation-free methods based on scale-invariant f eature transform (SIFT) [5]. More recently , a graph-based approach has been proposed in [6], which relies on a structural representation of the handwriting and uses an approximate graph edit distance to compare hand- writing graphs. Machine learning methods pursue the paradigm of learn- ing by example. They train keyword models with the help of learning samples, i.e. manually annotated handwriting images. In a first step, characteristic features are man- ually defined based on domain knowledge and in a sec- ond step, dif ferent machine learning methods are used to learn keyword models based on the features. Examples of this group of keyword spotting methods include methods based on hidden Markov models (HMM) with geometric 362 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al. Figure 1: Example stele images. features [7] and bag of local features [8], as well as bidi- rectional long short-term memory networks (BLSTM) with geometric features [9]. After training keyword models, the user can perform a query-by-string search, without the need of providing example images of the keyword. Finally , deep learning methods are also based on the learning-by-example paradigm but they do not require man- ually defined features. Instead, they aim at learning charac- teristic representations, so-called embedding spaces, auto- matically from the data. Images as well as textual represen- tations can be embedded in the same space, such that both query-by-example and query-by-string can be realized. A prominent example is the PHOCNet [10], which learns an embedding space based on pyramidal histogram of charac- ters (PHOC) representations. T oday , the best keyword spotting performance is achieved by means of deep learning methods. However , they require a considerable amount of manually annotated training samples. In the case of historical V ietnamese ste- les, such learning samples can only be provided by experts, who are able to read the ancient Chu Nom script. It is thus time-consuming and costly to build a comprehensive train- ing dataset, which is representative for the heterogeneous collection of stele images (see Figure 1). At the time of the writing, such a training set is not available for the 40,000 stele images. In this article, we present a hybrid deep learning method for keyword spotting in historical V ietnamese stele images. It aims to combine deep learning with heuristic methods, such that the domain knowledge of the heuristic methods can compensate the lack of annotated learning samples. In- deed, it is an annotation-fr ee method that does not require any human annotations at all. The proposed method can be applied directly to the origi- nal stele images and consists of two processing steps. First, characters are detected using deep neural networks that are trained on synthetic images with printed Chu Nom char - acters and then auto-calibrated to real stele images. Sec- ondly , the structure of the Chu Nom characters is modeled with a graph-based representation and matched with search terms using an approximate graph edit distance, in order to ef ficiently perform query-by-example keyword spotting. A comprehensive experimental evaluation is performed to measure the spotting performance. The remainder of this article is structured as follows. Section 2 discusses related work on stele images, Section 3 provides more details about the content of the steles and the image acquisition, Section 4 presents the proposed keyword spotting method, and Section 5 details the experiments. Fi- nally , conclusions are drawn in Section 6. 2 Related work Initial work on the stele images has focused on the task of layout analysis with the aim to segment stele images into columns and characters. Such an initial segmentation is an important preprocessing step for character recogni- tion. Notable work in this domain includes [1 1], where a heuristic method based on V oronoi diagrams is proposed to segment characters, and [12], where a deep learning ap- proach based on semantic segmentation is pursued to detect columns with only a small number of human annotations. Recently , in [13] a deep learning method based on object detection networks has been introduced for character seg- mentation, which does not require human annotations and generalizes well to dif ferent layouts and engraving styles. In [14], a generative deep learning model has been sug- gested to create synthetic Chu Nom characters in dif ferent engraving styles. Recent work on Chu Nom also includes the U-Net based approach reported in [15], which was stud- ied in the context of manuscripts. Our method builds upon the character segmentation method of [13] and goes a step further to perform key- word spotting. For graph-based character modeling we rely on keypoint graphs, which have been studied for Latin scripts [6] and Chu Nom characters in manuscripts [16] before, but not for stone engravings. Important adapta- tions to the logographic writing system include the use of super -resolution to better model small strokes that distin- guish similar Chu Nom characters. The graphs are ef fi- ciently matched using the Hausdorf f edit distance [17], an approximation of graph edit distance that can be calculated in quadratic time with respect to the number of graph nodes. A Hybrid Deep Learning Approach to Keyword Spotting … Informatica 47 (2023) 361–372 363 Ef ficient graph matching is especially important in the con- text of super -resolution when Chu Nom graphs may contain over 100 nodes. Preliminary results have already been published in a con- ference paper [18]. In this article, we provide a more de- tailed description of the hybrid deep learning method and significantly extend the experimental evaluation. Instead of considering only 8 steles, we conduct more comprehen- sive experiments on 20 stele images with manual ground truth. Furthermore, we study the important case of spotting keywords within the same style of engravings and compare it to a scenario with mixed styles. This study serves the pur - pose to better understand the possibilities and limitations of the proposed method. 3 Dataset The 40,000 stele images represent about 25,000 steles, i.e. man-sized stones with engravings, which were erected in V ietnamese villages between the 16th and the 20th cen- tury . The majority of the steles represent donations made by villagers to the local shrines and are engraved in the ancient V ietnamese Chu Nom writing system [19]. How- ever , they can also contain information about finances, con- structions, and demarcations, thus informing about the so- cial, economic, and religious life of the villages. The ste- les were erected for all to see and were able to withstand adverse weather conditions and armed conflicts. Neverthe- less, they may contain degradations, fissures, and impacts, which may render parts of the steles illegible (see Figure 1). The images of the steles were obtained by the French School of the Far East (EFEO) and the Institute of Han Nom Studies by means of stampings [19]. A sheet of paper is pressed on the stone and fixed with a binder , e.g. banana juice. Then, ink is applied with a roller on the paper over the entire surface of the stele, such that engravings appear in white and the stone background, as well as characters writ- ten in relief, appear dark in the color of the ink. Finally , the paper is photographed to obtain digital stele images. Such pictures of the stampings contain more character details and are easier to read when compared with pictures of the orig- inal steles. In this article, we consider a research dataset of 20 stele images 2 . It encompasses all steles, for which we have ob- tained ground truth information so far at the level of indi- vidual characters, i.e. bounding boxes around the charac- ters as well as their machine-readable Chu Nom transcrip- tion in unicode. Characters that are not readable are marked with a special symbol. In total, the dataset contains 5,138 characters, which corresponds to an average of about 257 characters per stele. 2 The dataset is available at https://github.com/asciusb/ steles_kws_database . 4 Hybrid deep learning Figure 2 provides an overview of the proposed hybrid deep learning method for keyword spotting. At the core of the method is a deep learning model that is responsible to de- tect the location of main text characters on the stele images. It is trained on synthetic data and auto-calibrated to real data. Afterwards, a structural representation and compar - ison with keyword templates is performed for spotting. The training data used for supervised learning of the deep learning model does not originate from human annotations. Instead, human knowledge is used to design the synthetic training data and to perform the auto-calibration. Also, human knowledge is used to model the characters with a graph-based representation and to perform the structural matching with the keyword templates based on heuristic methods. In the following, the individual components are de- scribed in more detail. 4.1 Character detection The deep learning model is an object detection network with a you look only once (YOLO) [20] architecture, which has originally been introduced for detecting objects in nat- ural scenes, e.g. pedestrians in the context of autonomous driving. When applied to the problem of character detection on stele images, one of the main dif ferences is that a lar ge number of small objects need to be detected, rather than a small number of lar ge objects. Therefore, it is important that the visual analysis is performed with a suf ficiently high resolution, such that even small strokes of the logographic characters can be taken into account. T wo initial preprocessing steps are applied to the original images: – Rescaling the stele images to a uniform height of 1024 pixels, while keeping the same aspect ratio. – Inverting the colors, such that the engraved characters appear in dark color rather than white. The height has been chosen to ensure that the stele images fit into the GPU memory . The resulting images are typ- ically smaller than the originals. Inverting the colors has been chosen with respect to improved readability and more convenient generation of synthetic data. The specific network architecture used for character de- tection is YOLOv5 [21], which analyzes the image at mul- tiple suf ficiently lar ge scales to detect small characters on lar ge steles. The backbone of YOLOv5 is a cross stage partial network (CSPNet) [22], which extracts convolu- tional feature maps. The neck is a path aggregation net- work (P ANet) [23], which performs a combination of fea- ture maps at dif ferent scales. Finally , a dual head is used to perform both classification and bounding box regression on the combined feature maps. Theoretically , the dual head would allow us not only to localize the characters but also 364 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al. 1) Initial training with synthetic data 2) Auto-calibration with real data 3) Character detection using calibrated model 4) Structural representation with keypoint graphs 5) Structural comparison using Hausdorff edit distance 6) Keyword spotting based on distance score Figure 2: Overview of the hybrid deep learning method for keyword spotting. Green boxes represent detected characters and yellow boxes are detected columns. to classify them. However , practical attempts to classifi- cation have failed when considering thousands of dif ferent Chu Nom character classes. Instead, the classification head only performs a binary classification, whether or not a char - acter is present, and the regression head predicts the extent of the character bounding box. 4.2 Synthetic training The initial training of the character detection network is based on thousands of fully synthetic training steles (Fig- ure 2, step 1) , for which the ground truth annotations, i.e. bounding boxes around the characters, is generated along- side with the synthetic images. The generation is guided by the following heuristic considerations: – Color -inverted stele images contain dark character en- gravings on a gray stone background, surrounded by a black border . – Characters are arranged in a column layout. The data generation therefore proceeds as follows. First, a gray rectangle is drawn on a black background. Then, a Chu Nom font 3 is used to write random text on the gray rectangle in a random number of columns. Finally , random 3 The NomNaT ongLight font available at http://www. nomfoundation.org . noise is added to the synthetic images by means of transla- tion, blur , changes in brightness, as well as salt and pepper noise, in order to avoid overfitting of the character detection network. 4.3 Auto-calibration After an initial training on synthetic data, the network is ap- plied and adapted to real data by means of auto-calibration (Figure 2, step 2) , following the procedure introduced in [13]. The aim of the auto-calibration is to replace the generic gray rectangle and black border of the synthetic training data with real stele backgrounds, such that the net- work can improve the s eparation between stele background and character foreground. The following heuristic considerations are taken into ac- count for detecting the main text area: – Main text characters have approximately the same size and are or ganized in columns. – The main text area is rectangular . The auto-calibration is illustrated in more detail in Fig- ure 3. After printing random Chu Nom text on simple back- grounds to create fully synthetic stele images, an initial training of the character detection network is performed. Afterwards, the network is applied to real stele images and layout analysis is used to recognize the main text area. Lay- out analysis consists of the following steps. The median A Hybrid Deep Learning Approach to Keyword Spotting … Informatica 47 (2023) 361–372 365 a) Synthetic background b) Synthetic stele c) Character detection d) Layout analysis e) Real background f) Semi-synthetic stele Figure 3: Auto-calibration of the deep learning model for character detection. Green boxes represent detected characters, red and blue boxes are characters discarded during layout analysis, yellow boxes are columns and the main text area, and the cyan box is a homogeneous background region. box is calculated to estimate the size of the main text char - acters. Characters that are either too small (e.g. parts of or - naments or parts of t he background), or too lar ge (e.g. char - acters of the title above the main text) are discarded. After - wards, unsupervised clustering using the DBSCAN [24] al- gorithm is performed to find the main text columns and thus the main text area around the columns (yellow rectangles in Figure 3) . A homogeneous non-text region with low vari- ance is determined as a pattern to fill the entire main text area (cyan rectangle in Figure 3) . Finally , the Chu Nom font is used to write synthetic printed text on the main text area, similar to the generation of the initial training data, with the dif ference that a real stele background is present around the printed Chu Nom text. The auto-calibration leads to new semi-synthetic train- ing data, on which the initial network is further fine-tuned, thus adapting to real stele backgrounds and improving the detection accuracy . For further details, we refer to [13]. 4.4 Structural r epr esentation Once the characters have been detected by the calibrated network (Figure 2, step 3) , the character images are mod- eled with a graph-based representation that captures their structure, i.e. the arrangement of individual strokes that constitute the character (Figure 2, step 4) . W e employ key- point graphs [25], which have been used successfully for keyword spotting in the past for Latin manuscripts [6] as well as Chu Nom manuscripts [16] written with ink on parchment or paper . The graph extraction is illustrated in Figure 4. F irst, a lo- cal text enhancement is applied by means of a dif ference of Gaussians (DoG) filter . Afterwards, the image is binarized with a global threshold and thinned to obtain strokes that have a width of one pixel. Endpoints and intersection points are added as nodes to the keypoint graph, labeled with their (x,y ) coordinates. For circular structures, a random point is 366 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al. Figure 4: Keypoint graph extraction. added as a node as well. T o complete the initial set of nodes, additional points are added as nodes at regular intervals of D pixels on the skeleton image. Once all nodes have been added to the graph, their coordinate labels are normalized to zero mean and unit variance (z-score). Finally , neighboring nodes on the skeleton image are connected with unlabeled and undirected edges. For more details on the graph extrac- tion, we refer to [26], whose implementation of keypoint graphs is used in the present article. An important modification of the keypoint graph extrac- tion, which led to successful spotting results on the stele images was to model the characters in super -resolution, in order to capture suf ficient details about small strokes that may mark the dif ference between two similar Chu Nom engravings. T o that end, the bounding box of the detec- tion network is first translated back to the original image (inverting the downscaling to 1024 pixel height), then the character is cut out from the original image and upscaled to the same width S for all characters, while keeping the as- pect ratio. When extracting graphs from a character image in super -resolution, i.e. when using values ofS lar ger than the original width, it is possible that strokes in the keypoint graph contain more nodes than pixels in the original image. Hence, even very small strokes become more relevant in the graph-based representation. 4.5 Structural comparison T o compare the graphs of the character images with key- word graphs (Figure 2, step 5) , we consider the graph edit distance [27, 28]. It is a general graph dissimilarity mea- sure that is applicable to any kind of labeled graphs. W ith respect to a set of basic edit operations, typically insertion, deletion, and label substitution for nodes and edges, it cal- culates the minimum edit cost for transforming one graph into another . However , the exact graph edit distance is more of theoretical value than of practical relevance be- cause it is an NP-complete problem, which can only be solved for small graphs with few nodes in reasonable time. In order to cope with lar ge character graphs, which may have over 100 nodes in super -resolution, we use the Haus- dorf f edit distance [17], an approximation of graph edit dis- tance that calculates a lower bound in quadratic time. De- rived from the Hausdorf f distance between sets, it compares each nodeu ∈ g 1 , plus its adjacent edges, of one graphg 1 with every nodev ∈ g 2 , plus its adjacent edges, of another graph g 2 and sums up the minimum edit cost f(u,v ) for matching the substructures. A specialϵ node is considered for insertions(ϵ,v ) and deletions(u,ϵ ) . Formally , HED(g 1 ,g 2 ) = ∑ u∈ g1 min v∈ g2∪{ ϵ } f(u,v ) + ∑ v∈ g2 min u∈ g1∪{ ϵ } f(u,v ) W e use the Euclidean cost model for the structural com- parison of keypoint graphs. It considers the Euclidean dis- tance of the(x,y ) labels for node label substitution, a con- stant costc n for node insertion and deletion, and a constant costc e for edge insertion and deletion. 4.6 Keyword spotting T o spot a Chu Nom character (Figure 2, step 6) ,n template images of the keyword are collected from real steles and keypoint graphs are extracted. Afterwards, the minimum HED score score(g) = min t∈ T HED(g,t ) is calculated for each character graphg of the stele images with respect to the template graphsT = { t 1 ,...,t n } . Fi- nally , the character graphs are sorted according to the spot- ting score, such that the most similar character graphs ap- pear in the top ranks. For evaluating the spotting performance, precision (P) and recall (R) are calculated as P = TP TP +FP , R = TP TP +FN , with respect to the number of true positives (TP), false posi- tives (FP), and false negatives (FN) for each possible score A Hybrid Deep Learning Approach to Keyword Spotting … Informatica 47 (2023) 361–372 367 threshold. Then, the average precision (AP) is calculated for each keyword and the mean average precision (mAP) mAP = 1 K K ∑ i=1 AP i over allK keywords is used as the final performance mea- sure for keyword spotting. 5 Experiments 5.1 Spotting scenarios T o e valuate the proposed hybrid deep learning method, the 5,138 ground truth characters of the 20 stele images (see Section 3) are randomly separated into three distinct sets for template selection (50%), validation (25%), and testing (25%), respectively . The template selection set is used to select n = 5 tem- plates per keyword, the validation set is used for optimizing hyper -parameters, and the test set is used for evaluating the final spotting performance. A total ofK = 128 keywords are spotted, which appear at least 5 times in the template selection set, at least once in the validation set, and at least once in the test set. W e compare three spotting scenarios with respect to the use of human annotations, as listed in T able 1: – The fully annotated scenario uses ground truth la- bels for parameter optimization as well as performance evaluation. – The font-validated scenario does not require human annotations for parameter optimization. Instead, a synthetic font-based validation set is used (see below). – The annotation-fr ee scenario, which is the tar get sce- nario for the proposed method, does not require any human annotations. It evaluates the keyword spotting system with respect to automatically detected charac- ters instead of ideal ground truth locations. The synthetic font-based validation set is created as fol- lows. 20 keywords are selected randomly and printed in 5 dif ferent Chu Nom fonts. 900 other characters are printed and added to the validation set, which is composed of 1,000 characters in total. Each of the keywords is then spotted on the validation set using a single template and the mAP re- sults are used to compare and optimize dif ferent parameter settings. Furthermore, we compare two spotting scenarios with re- spect to the engraving styles: – The same style scenario spots keywords on each stele image separately , such that the engraving style of the keyword templates is the same as the style of the stele characters. – The mixed styles scenario spots keywords across all 20 stele images, taking into account dif ferent engrav- ing styles. 5.2 Parameter optimization For the character detection network (see Sections 4.2 and 4.3), we consider only one set of hyper -parameters, i.e. the default parameters of the medium-sized YOLOv5m model 4 . The weights of the model are pretrained on the COCO [29] object detection dataset. The pretrained net- work is fine-tuned with 30,000 synthetic steles over 15 epochs until conver gence. Afterwards, an additional fine- tuning epoch is used for auto-calibration with real stele backgrounds. The parameters of the structural representation and the structural comparison (see Sections 4.4 and 4.5) are op- timized in two steps. First, a default character width of S = 150 pixels and node distance of D = 5 pixels is fixed to evaluate a range of node and edge costs c n ,c e ∈ { 0. 3, 0. 6, 0. 9, 1. 2, 1. 5, 1. 8, 2. 1} on the validation set. Af- terwards, the optimal node and edge costs are fixed and dif- ferent character widths S ∈ { 90, 120, 150, 180, 210} and node distancesD ∈{ 3, 4, 5, 6, 7} are evaluated on the val- idation set. Figure 5: Optimization of structural representation for the fully annotated scenario. The optimal parameters are listed in T able 2 for both the fully annotated and the font-validated scenario. When con- sidering synthetic printed characters (font-validated), we observe that a lar ger node distance and a lar ger edge cost is preferred when compared with real stele characters (fully annotated). This may be due to the increased stability , i.e. less variability , of the character shapes and character back- ground in the case of printed fonts. A more detailed view on the optimization of S and D is provided in the three-dimensional visualizations in Fig- ures 5 and 6. They show that changing the parameters lead to more significant dif ferences in mAP for the real charac- ters when compared with synthetic ones, indicating that the synthetic validation set may need to be improved to better represent the challenges encountered for real characters. 4 github.com/ultralytics/yolov5, commit cc03c1d5727e178438e9f0ce0450fa6bdbbe1ea7 368 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al. T able 1: Dif ferent keyword spotting scenarios with respect to human annotations. Parameter Optimization Performance Evaluation Fully annotated Ground truth validation set Ground truth test set Font-validated Synthetic font-based validation set Ground truth test set Annotation-fr ee Synthetic font-based validation set Automatic detection test set T able 2: Optimal parameters for structural representation and comparison. Parameters Fully annotated Font-validated Character widthS 120 150 Node distanceD 3 7 Node costc n 0.9 0.3 Edge costc e 0.9 2.1 Figure 6: Optimization of structural representation for the font-validated scenario. 5.3 Runtime performance For training the YOLO-based character detection model, we used 2 Nvidia T itan R TX GPUs. One training epoch on 30,000 stele images took 6.3 minutes on average and a total of 16 training epochs was suf ficient for conver gence. Detecting characters with the trained YOLO network took only a few milliseconds per stele. For graph matching with the Hausdorf f edit distance, we used computational nodes with 64 CPU cores (AMD EPYC, 2.25GHz). One graph comparison took 4.4 millisec- onds on average, which allowed us to spot a single keyword template in about 1 second per stele. Note that the graph comparisons need to be performed only once. Afterwards, the positions of all keywords in the collection of stele images can be indexed, such that histori- ans can search and retrieve keywords quasi instantly based on the index. 5.4 Spotting performance T able 3 shows the spotting performance on the test set when using optimized parameters. The results obtained for the Templates. Top 10 results (correct results in green). Keywords existing in the test set, but not found in the top 10 results. Figure 7: Qualitative spotting results for the same style spotting scenario. Templates. Top 10 results (correct results in green). Keywords existing in the test set, but not found in the top 10 results. Figure 8: Qualitative spotting results for the mixed styles spotting scenario. steles are put into context with results obtained for Chu Nom manuscripts reported in [16]. These manuscripts are written with ink on parchment with a regular writing style, leading to better character detection quality and less noise in the character images when compared with the steles. The results are not directly comparable, because a dif ferent set of keywords was used, but they provide a point of reference for a less challenging keyword spotting task. For the same style spotting scenario, the performance level is excellent with a mAP of 72% for the annotation-free scenario. The performance is similar to that of manuscripts, which typically are better readable and have less varia- A Hybrid Deep Learning Approach to Keyword Spotting … Informatica 47 (2023) 361–372 369 T able 3: Spotting performance in terms of mean average precision (mAP) on the test set. Fully annotated Font-validated Annotation-Fr ee Manuscripts (Kieu) [16] 0.76 0.78 0.77 Steles: Same style 0.85 0.81 0.72 Steles: Mixed styles 0.56 0.50 0.40 tions of the writing style when compared with stele images. However , the impact of optimizing the parameters on syn- thetic characters, rather than real ones, is stronger (mAP re- duced from 85% to 81%) and the loss in mAP for automatic character detection is stronger as well (mAP reduced from 81% to 72%). These results show the increased dif ficulty of spotting Chu Nom characters in stele images and leave room for improvements regarding parameter optimization and character detection. For the mixed styles spotting scenario, the performance drops significantly to a mAP of 40% for the annotation- free scenario. It indicates a limitation of the proposed hy- brid deep learning method, which did not generalize well to engraving styles that are dif ferent from the keyword tem- plates. An application to stele collections with similar en- graving styles seems therefore more promising. Note, how- ever , that we have only used 5,138 characters in our exper - iments. It is possible that the method will better generalize with a lar ger dataset. Figures 7 and 8 provide qualitative spotting results for both scenarios. For same style spotting, 3 out of 4 charac- ters are correctly spotted in the first 3 ranks. However , the fourth character is not part of the top 10 results, because of an error in automatic character detection, which has in- cluded some noise in the bottom right corner . For mixed style spotting, 5 out of 10 characters appear in the top 10 ranks, but not in the first 5 ranks. The remaining 5 charac- ters are not part of the top 10 results, due to noise but also due to dif ferent engraving styles, which are not represented in the keyword template images. 6 Conclusions The proposed hybrid deep learning approach to keyword spotting aims to combine the strengths of data-driven meth- ods with knowledge-based modeling. In a first step, a deep convolutional neural network is trained on a lar ge synthetic dataset to detect printed Chu Nom characters. By means of self-calibration, the network is then automatically adapted to the stele images. In a second step, the detected characters are modeled by means of keypoint graphs and the Hausdorf f edit distance is used to ef ficiently perform a structural com- parison for retrieving keywords. Especially when the engraving style of the keyword is the same as the style of the stele characters, an excellent mean average performance of over 70 % is achieved. In the case of mixed engraving styles, however , the spotting results drop to about 40 % mean average precision. Al- though this performance level is still helpful for historians to browse lar ge image collections of heterogeneous steles, there is clearly room for improvement. There are several interesting lines of future research to further improve the results. Staying in the same style sce- nario, future work includes the investigation of style clus- tering, such that similar engraving styles can be identified across a lar ge number of stele images. Noise removal meth- ods may also be interesting to avoid spotting mistakes due to non-character artifacts. W ith respect to the mixed style scenario, it may be nec- essary to perform some sort of data-driven learning to im- prove the spotting results, for example by means of geomet- ric deep learning with graph neural networks [30]. In order to avoid the requirement of human annotations, it would be interesting to pursue a self-calibration strategy similar to the self-calibration of the character detection network. Finally , a promising line of future research would be to generalize the proposed spotting method to other historical scripts and languages. Refer ences [1] Andreas Fischer , Marcus Liwicki, and Rolf Ingold, editors. Handwritten Historical Document Analysis, Recognition, and Retrieval — State of the Art and Fu- tur e T r ends . W orld Scientific, 2020. [2] Raghavan Manmatha, C. Han, and E.M. Riseman. W ord spotting: A new approach to indexing handwrit- ing. In Pr oc. Int. Conf. on Computer V ision and Pat- tern Recognition , pages 631—637, 1996. [3] T . M. Rath and R. Manmatha. W ord spotting for his- torical documents. Int. Journal on Document Analysis and Recognition , 9:139–152, 2007. [4] K. T erasawa and Y . T anaka. Slit style HOG features for document image word spotting. In Pr oc. 10th Int. Conf. on Document Analysis and Recognition , pages 1 16–120, 2009. [5] M. Rusiñol, D. Aldavert, R. T oledo, and J. Lladós. Browsing heterogeneous document collections by a segmentation-free word spotting method. In Pr oc. 1 1th Int. Conf. on Document Analysis and Recogni- tion , pages 63–67, 201 1. [6] Michael Stauf fer , Andreas Fischer , and Kaspar Riesen. Graph-based Keywor d Spotting . W orld Sci- entific, 2019. 370 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al. [7] A. Fischer , A. Keller , V . Frinken, and H. Bunke. Lexicon-free handwritten word spotting using charac- ter HMMs. Pattern Recognition Letters , 33(7):934– 942, 2012. [8] L. Rothacker , M. Rusiñol, and G. A. Fink. Bag-of- features hmms for segmentation-free word spotting in handwritten documents. In Pr oc. 12th Int. Conf. on Document Analysis and Recognition , pages 1305– 1309, 2013. [9] V . Frinken, A. Fischer , R. Manmatha, and H. Bunke. A n ovel word spotting method based on recurrent neu- ral networks. IEEE T rans. on Pattern Analysis and Machine Intelligence , 34(2):21 1–224, 2012. [10] Sebastian Sudholt and Gernot A Fink. Phocnet: A deep convolutional neural network for word spotting in handwritten documents. In 2016 15th International Confer ence on Fr ontiers in Handwriting Recognition (ICFHR) , pages 277–282. IEEE, 2016. [1 1] Thai V . Hoang, Salvatore T abbone, and Ngoc-Y en Pham. Extraction of nom text regions from stele im- ages using area voronoi diagram. In 10th Interna- tional Confer ence on Document Analysis and Recog- nition , pages 921–925, 2009. [12] Anna Scius-Bertrand, Lars V oegtlin, Michele Alberti, Andreas Fischer , and Marc Bui. Layout analysis and text column segmentation for historical vietnamese steles. In Pr oceedings of the 5th International W ork- shop on Historical Document Imaging and Pr ocess- ing , pages 84–89, 2019. [13] Anna Scius-Bertrand, Michael Jungo, Beat W olf, An- dreas Fischer , and Marc Bui. Annotation-free char - acter detection in historical V ietnamese stele images. In Pr oc. 16th Int. Conf. on Document Analysis and Recognition (ICDAR) , pages 432–447, 2021. [14] Jonas Diesbach, Andreas Fischer , Marc Bui, and Anna Scius-Bertrand. Generating synthetic styled chu nom characters. In Pr oc. 18th Int. Conf on Fr ontiers in Handwriting Recognition (ICFHR) , 2022. [15] Kha Cong Nguyen, Cuong T uan Nguyen, and Masaki Nakagawa. Nom document digitalization by deep convolution neural networks. Pattern Recognition Letters , 133:8–16, 2020. [16] Anna Scius-Bertrand, Linda Studer , Andreas Fis- cher , and Marc Bui. Annotation-free keyword spot- ting in historical vietnamese manuscripts using graph matching. In IAPR Joint International W orkshops on Statistical T echniques in Pattern Recognition (SPR 2022) and Structural and Syntactic Pattern Recogni- tion (SSPR 2022) : S+SSPR , 2022. [17] A. Fischer , C. Y . Suen, V . Frinken, K. Riesen, and H. Bunke. Approximation of graph edit distance based on Hausdorf f matching. Pat. Rec. , 48(2):331– 343, 2015. [18] A. Scius-Bertrand, A. Fischer , and M. Bui. Retriev- ing keywords in historical vietnamese stele images without human annotations. In Pr oc. 1 1th Int. Sympo- sium on Information and Communication T echnology (SoICT) , 2022. [19] Philippe Papin. Aperçu sur le programme “publi- cation de l’inventaire et du corpus complet des in- scriptions sur stèles du viêt-nam”. Bulletin de l’École française d’Extrême-Orient , 90(1):465–472, 2003. [20] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Y ou only look once: Unified, real-time object detection. In Pr oceedings of the IEEE con- fer ence on computer vision and pattern r ecognition , pages 779–788, 2016. [21] Glenn et al. Jocher . ultralytics/yolov5: v4.0 - nn.silu() activations, weights & biases logging, pytorch hub in- tegration. DOI: 10.5281/zenodo.4418161, 2021. [22] Chien-Y ao W ang, Hong-Y uan Mark Liao, Y ueh-Hua W u, Ping-Y ang Chen, Jun-W ei Hsieh, and I-Hau Y eh. Cspnet: A new backbone that can enhance learning capability of cnn. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern r ecogni- tion workshops , pages 390–391, 2020. [23] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Ji- aya Jia. Path aggregation network for instance seg- mentation. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 8759– 8768, 2018. [24] Martin Ester , Hans-Peter Kriegel, Jör g Sander , Xi- aowei Xu, et al. A density-based algorithm for dis- covering clusters in lar ge spatial databases with noise. In Pr oc. 2nd Int. Conf. on Knowledge Discovery and Data Mining , pages 226–231, 1996. [25] A. Fischer , K. Riesen, and H. Bunke. Graph similarity features for HMM-based handwriting recognition in historical documents. In Pr oc. Int. Conf. on Fr ontiers in Handwriting Recognition , pages 253–258, 2010. [26] Paul Maer gner , V inaychandran Pondenkandath, Michele Alberti, Marcus Liwicki, Kaspar Riesen, Rolf Ingold, and Andreas Fischer . Combining graph edit distance and triplet networks for of fline signature verification. Pattern Recognition Letters , 125:527–533, 2019. [27] H. Bunke and G. Allermann. Inexact graph matching for structural pattern recognition. Pattern Recognition Letters , 1(4):245–253, 1983. A Hybrid Deep Learning Approach to Keyword Spotting … Informatica 47 (2023) 361–372 371 [28] A. Sanfeliu and K. S. Fu. A distance measure be- tween attributed relational graphs for pattern recogni- tion. IEEE T rans. on Systems, Man, and Cybernetics , 13(3):353–363, 1983. [29] T sung-Y i Lin, Michael Maire, Ser ge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár , and C Lawrence Zitnick. Microsoft coco: Common ob- jects in context. In Eur opean confer ence on computer vision , pages 740–755. Springer , 2014. [30] Pau Riba, Andreas Fischer , Josep Lladós, and Alicia Fornés. Learning graph edit distance by graph neural networks. Pattern Recognition , 120:108132, 2021. 372 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al.