https://doi.or g/10.31449/inf.v47i3.4785 Informatica 47 (2023) 361–372 361
A Hybrid Deep Learning Appr oach to Keyword Spotting in V ietnamese Stele
Images
Anna Scius-Bertrand
1
, Marc Bui
2
and Andreas Fischer
1
1
University of Fribour g and HES-SO, Fribour g, Switzerland
2
Ecole Pratique des Hautes Etudes, Paris, France
E-mail: anna.scius-bertrand@unifr .ch, marc.bui@ephe.psl.eu, andreas.fischer@unifr .ch
Keywords: hybrid deep learning, document image analysis, keyword spotting, annotation-free, Hausdorf f edit distance,
V ietnamese steles, Chu Nom
Received: April 5, 2023
In or der to access the rich cultural heritage conveyed in V ietnamese steles, automatic r eading of stone
engravings would be a gr eat support for historians, who ar e analyzing tens of thousands of stele images.
Appr oaching the challenging pr oblem with deep learning alone is difficult because the data-driven models
r equir e lar ge r epr esentative datasets with expert human annotations, which ar e not available for the steles
and costly t o obtain. In this article, we pr esent a hybrid appr oach to spot keywor ds in stele images that
combines data-driven deep learning with knowledge-based structural modeling and matching of Chu Nom
characters. The main advantage of the pr oposed method is that it is annotation-fr ee, i.e. no human data
annotation is r equir ed. In an experimental evaluation, we demonstrate that keywor ds can be successfully
spotted with a mean average pr ecision of mor e than70 % when a single engraving style is consider ed.
Povzetek: Pr edstavljen je hibridni pristop za iskanje ključnih besed v slikah z nagr obnikov , ki združuje
globoko učenje in strukturno modeliranje Chu Nom znakov . Ključne besede so uspešno pr epoznane s
povpr ečno natančnostjo več kot 70%.
1 Intr oduction
V ietnamese steles are of great value for historians, as the
stone engravings are a unique source of information to un-
derstand the social, economic, and belief structures in the
villages. The V ietnamica
1
project, in particular , aims to in-
vestigate pious donations from ordinary people of fered to
local shrines. For this purpose, about 40,000 digital stele
images are studied, which may contain hundreds of Chu
Nom characters on a single image. T o cope with this vast
amount of characters, automatic image analysis methods
are needed that are able to transcribe the contents of the ste-
les into machine-readable form for searching and browsing.
Some examples of stele images are shown in Figure 1, high-
lighting significant dif ferences in column layout and image
quality across dif ferent steles.
Although the state of the art for handwriting recognition
for historical documents has made great progress in the past
decades, it remains a dif ficult problem and an active field
of research [1]. Keyword spotting [2] has been proposed
early on as an alternative to automatic transcription for dif-
ficult cases, where a full transcription is not feasible with
high accuracy . The goal is to identify specific search terms
of interest, either by providing a template image of the key-
word ( query-by-example ) or by providing a textual repre-
sentation of the search term ( query-by-string ).
Similar to developments in other related fields, such as
1
https://vietnamica.hypotheses.or g
computer vision and natural language processing, the dif-
ferent methods for keyword spotting can be divided into
three groups, namely heuristic methods, machine learning
based methods, and deep learning based methods. They
are roughly ordered by time, heuristic methods being the
oldest, but they still coexist and new approaches are being
developed for all three groups.
Heuristic methods incorporate domain knowledge about
the handwriting and are able to match images directly ,
i.e. keyword images and images from the manuscript, in
order to retrieve similar instances in a query-by-example
scenario. Early examples include dynamic time warping
methods based on contour features [3] and gradient fea-
tures [4], as well as segmentation-free methods based on
scale-invariant f eature transform (SIFT) [5]. More recently ,
a graph-based approach has been proposed in [6], which
relies on a structural representation of the handwriting and
uses an approximate graph edit distance to compare hand-
writing graphs.
Machine learning methods pursue the paradigm of learn-
ing by example. They train keyword models with the help
of learning samples, i.e. manually annotated handwriting
images. In a first step, characteristic features are man-
ually defined based on domain knowledge and in a sec-
ond step, dif ferent machine learning methods are used to
learn keyword models based on the features. Examples of
this group of keyword spotting methods include methods
based on hidden Markov models (HMM) with geometric
362 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al.
Figure 1: Example stele images.
features [7] and bag of local features [8], as well as bidi-
rectional long short-term memory networks (BLSTM) with
geometric features [9]. After training keyword models, the
user can perform a query-by-string search, without the need
of providing example images of the keyword.
Finally , deep learning methods are also based on the
learning-by-example paradigm but they do not require man-
ually defined features. Instead, they aim at learning charac-
teristic representations, so-called embedding spaces, auto-
matically from the data. Images as well as textual represen-
tations can be embedded in the same space, such that both
query-by-example and query-by-string can be realized. A
prominent example is the PHOCNet [10], which learns an
embedding space based on pyramidal histogram of charac-
ters (PHOC) representations.
T oday , the best keyword spotting performance is
achieved by means of deep learning methods. However ,
they require a considerable amount of manually annotated
training samples. In the case of historical V ietnamese ste-
les, such learning samples can only be provided by experts,
who are able to read the ancient Chu Nom script. It is thus
time-consuming and costly to build a comprehensive train-
ing dataset, which is representative for the heterogeneous
collection of stele images (see Figure 1). At the time of the
writing, such a training set is not available for the 40,000
stele images.
In this article, we present a hybrid deep learning method
for keyword spotting in historical V ietnamese stele images.
It aims to combine deep learning with heuristic methods,
such that the domain knowledge of the heuristic methods
can compensate the lack of annotated learning samples. In-
deed, it is an annotation-fr ee method that does not require
any human annotations at all.
The proposed method can be applied directly to the origi-
nal stele images and consists of two processing steps. First,
characters are detected using deep neural networks that are
trained on synthetic images with printed Chu Nom char -
acters and then auto-calibrated to real stele images. Sec-
ondly , the structure of the Chu Nom characters is modeled
with a graph-based representation and matched with search
terms using an approximate graph edit distance, in order
to ef ficiently perform query-by-example keyword spotting.
A comprehensive experimental evaluation is performed to
measure the spotting performance.
The remainder of this article is structured as follows.
Section 2 discusses related work on stele images, Section 3
provides more details about the content of the steles and the
image acquisition, Section 4 presents the proposed keyword
spotting method, and Section 5 details the experiments. Fi-
nally , conclusions are drawn in Section 6.
2 Related work
Initial work on the stele images has focused on the task
of layout analysis with the aim to segment stele images
into columns and characters. Such an initial segmentation
is an important preprocessing step for character recogni-
tion. Notable work in this domain includes [1 1], where a
heuristic method based on V oronoi diagrams is proposed
to segment characters, and [12], where a deep learning ap-
proach based on semantic segmentation is pursued to detect
columns with only a small number of human annotations.
Recently , in [13] a deep learning method based on object
detection networks has been introduced for character seg-
mentation, which does not require human annotations and
generalizes well to dif ferent layouts and engraving styles.
In [14], a generative deep learning model has been sug-
gested to create synthetic Chu Nom characters in dif ferent
engraving styles. Recent work on Chu Nom also includes
the U-Net based approach reported in [15], which was stud-
ied in the context of manuscripts.
Our method builds upon the character segmentation
method of [13] and goes a step further to perform key-
word spotting. For graph-based character modeling we
rely on keypoint graphs, which have been studied for Latin
scripts [6] and Chu Nom characters in manuscripts [16]
before, but not for stone engravings. Important adapta-
tions to the logographic writing system include the use of
super -resolution to better model small strokes that distin-
guish similar Chu Nom characters. The graphs are ef fi-
ciently matched using the Hausdorf f edit distance [17], an
approximation of graph edit distance that can be calculated
in quadratic time with respect to the number of graph nodes.
A Hybrid Deep Learning Approach to Keyword Spotting … Informatica 47 (2023) 361–372 363
Ef ficient graph matching is especially important in the con-
text of super -resolution when Chu Nom graphs may contain
over 100 nodes.
Preliminary results have already been published in a con-
ference paper [18]. In this article, we provide a more de-
tailed description of the hybrid deep learning method and
significantly extend the experimental evaluation. Instead
of considering only 8 steles, we conduct more comprehen-
sive experiments on 20 stele images with manual ground
truth. Furthermore, we study the important case of spotting
keywords within the same style of engravings and compare
it to a scenario with mixed styles. This study serves the pur -
pose to better understand the possibilities and limitations of
the proposed method.
3 Dataset
The 40,000 stele images represent about 25,000 steles, i.e.
man-sized stones with engravings, which were erected in
V ietnamese villages between the 16th and the 20th cen-
tury . The majority of the steles represent donations made
by villagers to the local shrines and are engraved in the
ancient V ietnamese Chu Nom writing system [19]. How-
ever , they can also contain information about finances, con-
structions, and demarcations, thus informing about the so-
cial, economic, and religious life of the villages. The ste-
les were erected for all to see and were able to withstand
adverse weather conditions and armed conflicts. Neverthe-
less, they may contain degradations, fissures, and impacts,
which may render parts of the steles illegible (see Figure 1).
The images of the steles were obtained by the French
School of the Far East (EFEO) and the Institute of Han Nom
Studies by means of stampings [19]. A sheet of paper is
pressed on the stone and fixed with a binder , e.g. banana
juice. Then, ink is applied with a roller on the paper over
the entire surface of the stele, such that engravings appear in
white and the stone background, as well as characters writ-
ten in relief, appear dark in the color of the ink. Finally , the
paper is photographed to obtain digital stele images. Such
pictures of the stampings contain more character details and
are easier to read when compared with pictures of the orig-
inal steles.
In this article, we consider a research dataset of 20 stele
images
2
. It encompasses all steles, for which we have ob-
tained ground truth information so far at the level of indi-
vidual characters, i.e. bounding boxes around the charac-
ters as well as their machine-readable Chu Nom transcrip-
tion in unicode. Characters that are not readable are marked
with a special symbol. In total, the dataset contains 5,138
characters, which corresponds to an average of about 257
characters per stele.
2
The dataset is available at https://github.com/asciusb/
steles_kws_database .
4 Hybrid deep learning
Figure 2 provides an overview of the proposed hybrid deep
learning method for keyword spotting. At the core of the
method is a deep learning model that is responsible to de-
tect the location of main text characters on the stele images.
It is trained on synthetic data and auto-calibrated to real
data. Afterwards, a structural representation and compar -
ison with keyword templates is performed for spotting.
The training data used for supervised learning of the deep
learning model does not originate from human annotations.
Instead, human knowledge is used to design the synthetic
training data and to perform the auto-calibration. Also,
human knowledge is used to model the characters with a
graph-based representation and to perform the structural
matching with the keyword templates based on heuristic
methods.
In the following, the individual components are de-
scribed in more detail.
4.1 Character detection
The deep learning model is an object detection network
with a you look only once (YOLO) [20] architecture, which
has originally been introduced for detecting objects in nat-
ural scenes, e.g. pedestrians in the context of autonomous
driving. When applied to the problem of character detection
on stele images, one of the main dif ferences is that a lar ge
number of small objects need to be detected, rather than
a small number of lar ge objects. Therefore, it is important
that the visual analysis is performed with a suf ficiently high
resolution, such that even small strokes of the logographic
characters can be taken into account.
T wo initial preprocessing steps are applied to the original
images:
– Rescaling the stele images to a uniform height of 1024
pixels, while keeping the same aspect ratio.
– Inverting the colors, such that the engraved characters
appear in dark color rather than white.
The height has been chosen to ensure that the stele images
fit into the GPU memory . The resulting images are typ-
ically smaller than the originals. Inverting the colors has
been chosen with respect to improved readability and more
convenient generation of synthetic data.
The specific network architecture used for character de-
tection is YOLOv5 [21], which analyzes the image at mul-
tiple suf ficiently lar ge scales to detect small characters on
lar ge steles. The backbone of YOLOv5 is a cross stage
partial network (CSPNet) [22], which extracts convolu-
tional feature maps. The neck is a path aggregation net-
work (P ANet) [23], which performs a combination of fea-
ture maps at dif ferent scales. Finally , a dual head is used
to perform both classification and bounding box regression
on the combined feature maps. Theoretically , the dual head
would allow us not only to localize the characters but also
364 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al.
1) Initial training with synthetic data 2) Auto-calibration with real data 3) Character detection using calibrated model 4) Structural representation with keypoint graphs 5) Structural comparison using Hausdorff edit distance 
6) Keyword spotting based on distance score Figure 2: Overview of the hybrid deep learning method for keyword spotting. Green boxes represent detected characters
and yellow boxes are detected columns.
to classify them. However , practical attempts to classifi-
cation have failed when considering thousands of dif ferent
Chu Nom character classes. Instead, the classification head
only performs a binary classification, whether or not a char -
acter is present, and the regression head predicts the extent
of the character bounding box.
4.2 Synthetic training
The initial training of the character detection network is
based on thousands of fully synthetic training steles (Fig-
ure 2, step 1) , for which the ground truth annotations, i.e.
bounding boxes around the characters, is generated along-
side with the synthetic images. The generation is guided by
the following heuristic considerations:
– Color -inverted stele images contain dark character en-
gravings on a gray stone background, surrounded by a
black border .
– Characters are arranged in a column layout.
The data generation therefore proceeds as follows. First,
a gray rectangle is drawn on a black background. Then, a
Chu Nom font
3
is used to write random text on the gray
rectangle in a random number of columns. Finally , random
3
The NomNaT ongLight font available at http://www.
nomfoundation.org .
noise is added to the synthetic images by means of transla-
tion, blur , changes in brightness, as well as salt and pepper
noise, in order to avoid overfitting of the character detection
network.
4.3 Auto-calibration
After an initial training on synthetic data, the network is ap-
plied and adapted to real data by means of auto-calibration
(Figure 2, step 2) , following the procedure introduced
in [13]. The aim of the auto-calibration is to replace the
generic gray rectangle and black border of the synthetic
training data with real stele backgrounds, such that the net-
work can improve the s eparation between stele background
and character foreground.
The following heuristic considerations are taken into ac-
count for detecting the main text area:
– Main text characters have approximately the same size
and are or ganized in columns.
– The main text area is rectangular .
The auto-calibration is illustrated in more detail in Fig-
ure 3. After printing random Chu Nom text on simple back-
grounds to create fully synthetic stele images, an initial
training of the character detection network is performed.
Afterwards, the network is applied to real stele images and
layout analysis is used to recognize the main text area. Lay-
out analysis consists of the following steps. The median
A Hybrid Deep Learning Approach to Keyword Spotting … Informatica 47 (2023) 361–372 365
a) Synthetic background b) Synthetic stele c) Character detection d) Layout analysis e) Real background f) Semi-synthetic stele Figure 3: Auto-calibration of the deep learning model for character detection. Green boxes represent detected characters,
red and blue boxes are characters discarded during layout analysis, yellow boxes are columns and the main text area, and
the cyan box is a homogeneous background region.
box is calculated to estimate the size of the main text char -
acters. Characters that are either too small (e.g. parts of or -
naments or parts of t he background), or too lar ge (e.g. char -
acters of the title above the main text) are discarded. After -
wards, unsupervised clustering using the DBSCAN [24] al-
gorithm is performed to find the main text columns and thus
the main text area around the columns (yellow rectangles in
Figure 3) . A homogeneous non-text region with low vari-
ance is determined as a pattern to fill the entire main text
area (cyan rectangle in Figure 3) . Finally , the Chu Nom
font is used to write synthetic printed text on the main text
area, similar to the generation of the initial training data,
with the dif ference that a real stele background is present
around the printed Chu Nom text.
The auto-calibration leads to new semi-synthetic train-
ing data, on which the initial network is further fine-tuned,
thus adapting to real stele backgrounds and improving the
detection accuracy . For further details, we refer to [13].
4.4 Structural r epr esentation
Once the characters have been detected by the calibrated
network (Figure 2, step 3) , the character images are mod-
eled with a graph-based representation that captures their
structure, i.e. the arrangement of individual strokes that
constitute the character (Figure 2, step 4) . W e employ key-
point graphs [25], which have been used successfully for
keyword spotting in the past for Latin manuscripts [6] as
well as Chu Nom manuscripts [16] written with ink on
parchment or paper .
The graph extraction is illustrated in Figure 4. F irst, a lo-
cal text enhancement is applied by means of a dif ference of
Gaussians (DoG) filter . Afterwards, the image is binarized
with a global threshold and thinned to obtain strokes that
have a width of one pixel. Endpoints and intersection points
are added as nodes to the keypoint graph, labeled with their
(x,y ) coordinates. For circular structures, a random point is
366 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al.
Figure 4: Keypoint graph extraction.
added as a node as well. T o complete the initial set of nodes,
additional points are added as nodes at regular intervals of
D pixels on the skeleton image. Once all nodes have been
added to the graph, their coordinate labels are normalized to
zero mean and unit variance (z-score). Finally , neighboring
nodes on the skeleton image are connected with unlabeled
and undirected edges. For more details on the graph extrac-
tion, we refer to [26], whose implementation of keypoint
graphs is used in the present article.
An important modification of the keypoint graph extrac-
tion, which led to successful spotting results on the stele
images was to model the characters in super -resolution, in
order to capture suf ficient details about small strokes that
may mark the dif ference between two similar Chu Nom
engravings. T o that end, the bounding box of the detec-
tion network is first translated back to the original image
(inverting the downscaling to 1024 pixel height), then the
character is cut out from the original image and upscaled to
the same width S for all characters, while keeping the as-
pect ratio. When extracting graphs from a character image
in super -resolution, i.e. when using values ofS lar ger than
the original width, it is possible that strokes in the keypoint
graph contain more nodes than pixels in the original image.
Hence, even very small strokes become more relevant in
the graph-based representation.
4.5 Structural comparison
T o compare the graphs of the character images with key-
word graphs (Figure 2, step 5) , we consider the graph edit
distance [27, 28]. It is a general graph dissimilarity mea-
sure that is applicable to any kind of labeled graphs. W ith
respect to a set of basic edit operations, typically insertion,
deletion, and label substitution for nodes and edges, it cal-
culates the minimum edit cost for transforming one graph
into another . However , the exact graph edit distance is
more of theoretical value than of practical relevance be-
cause it is an NP-complete problem, which can only be
solved for small graphs with few nodes in reasonable time.
In order to cope with lar ge character graphs, which may
have over 100 nodes in super -resolution, we use the Haus-
dorf f edit distance [17], an approximation of graph edit dis-
tance that calculates a lower bound in quadratic time. De-
rived from the Hausdorf f distance between sets, it compares
each nodeu ∈ g
1
, plus its adjacent edges, of one graphg
1
with every nodev ∈ g
2
, plus its adjacent edges, of another
graph g
2
and sums up the minimum edit cost f(u,v ) for
matching the substructures. A specialϵ node is considered
for insertions(ϵ,v ) and deletions(u,ϵ ) . Formally ,
HED(g
1
,g
2
) =
∑ u∈ g1
min
v∈ g2∪{ ϵ } f(u,v )
+
∑ v∈ g2
min
u∈ g1∪{ ϵ } f(u,v )
W e use the Euclidean cost model for the structural com-
parison of keypoint graphs. It considers the Euclidean dis-
tance of the(x,y ) labels for node label substitution, a con-
stant costc
n
for node insertion and deletion, and a constant
costc
e
for edge insertion and deletion.
4.6 Keyword spotting
T o spot a Chu Nom character (Figure 2, step 6) ,n template
images of the keyword are collected from real steles and
keypoint graphs are extracted. Afterwards, the minimum
HED score
score(g) = min
t∈ T
HED(g,t )
is calculated for each character graphg of the stele images
with respect to the template graphsT = { t
1
,...,t
n
} . Fi-
nally , the character graphs are sorted according to the spot-
ting score, such that the most similar character graphs ap-
pear in the top ranks.
For evaluating the spotting performance, precision (P)
and recall (R) are calculated as
P =
TP
TP +FP
, R =
TP
TP +FN
, with respect to the number of true positives (TP), false posi-
tives (FP), and false negatives (FN) for each possible score
A Hybrid Deep Learning Approach to Keyword Spotting … Informatica 47 (2023) 361–372 367
threshold. Then, the average precision (AP) is calculated
for each keyword and the mean average precision (mAP)
mAP =
1
K
K
∑ i=1
AP
i
over allK keywords is used as the final performance mea-
sure for keyword spotting.
5 Experiments
5.1 Spotting scenarios
T o e valuate the proposed hybrid deep learning method, the
5,138 ground truth characters of the 20 stele images (see
Section 3) are randomly separated into three distinct sets
for template selection (50%), validation (25%), and testing
(25%), respectively .
The template selection set is used to select n = 5 tem-
plates per keyword, the validation set is used for optimizing
hyper -parameters, and the test set is used for evaluating the
final spotting performance. A total ofK = 128 keywords
are spotted, which appear at least 5 times in the template
selection set, at least once in the validation set, and at least
once in the test set.
W e compare three spotting scenarios with respect to the
use of human annotations, as listed in T able 1:
– The fully annotated scenario uses ground truth la-
bels for parameter optimization as well as performance
evaluation.
– The font-validated scenario does not require human
annotations for parameter optimization. Instead, a
synthetic font-based validation set is used (see below).
– The annotation-fr ee scenario, which is the tar get sce-
nario for the proposed method, does not require any
human annotations. It evaluates the keyword spotting
system with respect to automatically detected charac-
ters instead of ideal ground truth locations.
The synthetic font-based validation set is created as fol-
lows. 20 keywords are selected randomly and printed in 5
dif ferent Chu Nom fonts. 900 other characters are printed
and added to the validation set, which is composed of 1,000
characters in total. Each of the keywords is then spotted on
the validation set using a single template and the mAP re-
sults are used to compare and optimize dif ferent parameter
settings.
Furthermore, we compare two spotting scenarios with re-
spect to the engraving styles:
– The same style scenario spots keywords on each stele
image separately , such that the engraving style of the
keyword templates is the same as the style of the stele
characters.
– The mixed styles scenario spots keywords across all
20 stele images, taking into account dif ferent engrav-
ing styles.
5.2 Parameter optimization
For the character detection network (see Sections 4.2
and 4.3), we consider only one set of hyper -parameters,
i.e. the default parameters of the medium-sized YOLOv5m
model
4
. The weights of the model are pretrained on the
COCO [29] object detection dataset. The pretrained net-
work is fine-tuned with 30,000 synthetic steles over 15
epochs until conver gence. Afterwards, an additional fine-
tuning epoch is used for auto-calibration with real stele
backgrounds.
The parameters of the structural representation and the
structural comparison (see Sections 4.4 and 4.5) are op-
timized in two steps. First, a default character width of
S = 150 pixels and node distance of D = 5 pixels is
fixed to evaluate a range of node and edge costs c
n
,c
e
∈ { 0. 3, 0. 6, 0. 9, 1. 2, 1. 5, 1. 8, 2. 1} on the validation set. Af-
terwards, the optimal node and edge costs are fixed and dif-
ferent character widths S ∈ { 90, 120, 150, 180, 210} and
node distancesD ∈{ 3, 4, 5, 6, 7} are evaluated on the val-
idation set.
 Figure 5: Optimization of structural representation for the
fully annotated scenario.
The optimal parameters are listed in T able 2 for both the
fully annotated and the font-validated scenario. When con-
sidering synthetic printed characters (font-validated), we
observe that a lar ger node distance and a lar ger edge cost
is preferred when compared with real stele characters (fully
annotated). This may be due to the increased stability , i.e.
less variability , of the character shapes and character back-
ground in the case of printed fonts.
A more detailed view on the optimization of S and D
is provided in the three-dimensional visualizations in Fig-
ures 5 and 6. They show that changing the parameters lead
to more significant dif ferences in mAP for the real charac-
ters when compared with synthetic ones, indicating that the
synthetic validation set may need to be improved to better
represent the challenges encountered for real characters.
4
github.com/ultralytics/yolov5, commit
cc03c1d5727e178438e9f0ce0450fa6bdbbe1ea7
368 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al.
T able 1: Dif ferent keyword spotting scenarios with respect to human annotations.
Parameter Optimization Performance Evaluation
Fully annotated Ground truth validation set Ground truth test set
Font-validated Synthetic font-based validation set Ground truth test set
Annotation-fr ee Synthetic font-based validation set Automatic detection test set
T able 2: Optimal parameters for structural representation and comparison.
Parameters Fully annotated Font-validated
Character widthS 120 150
Node distanceD 3 7
Node costc
n
0.9 0.3
Edge costc
e
0.9 2.1
 Figure 6: Optimization of structural representation for the
font-validated scenario.
5.3 Runtime performance
For training the YOLO-based character detection model,
we used 2 Nvidia T itan R TX GPUs. One training epoch
on 30,000 stele images took 6.3 minutes on average and a
total of 16 training epochs was suf ficient for conver gence.
Detecting characters with the trained YOLO network took
only a few milliseconds per stele.
For graph matching with the Hausdorf f edit distance,
we used computational nodes with 64 CPU cores (AMD
EPYC, 2.25GHz). One graph comparison took 4.4 millisec-
onds on average, which allowed us to spot a single keyword
template in about 1 second per stele.
Note that the graph comparisons need to be performed
only once. Afterwards, the positions of all keywords in the
collection of stele images can be indexed, such that histori-
ans can search and retrieve keywords quasi instantly based
on the index.
5.4 Spotting performance
T able 3 shows the spotting performance on the test set when
using optimized parameters. The results obtained for the
Templates.
Top 10 results (correct results in green).
Keywords existing in the test set, but not found in the top 10 results. 
Figure 7: Qualitative spotting results for the same style
spotting scenario.
Templates.
Top 10 results (correct results in green).
Keywords existing in the test set, but not found in the top 10 results. 
Figure 8: Qualitative spotting results for the mixed styles
spotting scenario.
steles are put into context with results obtained for Chu
Nom manuscripts reported in [16]. These manuscripts are
written with ink on parchment with a regular writing style,
leading to better character detection quality and less noise
in the character images when compared with the steles. The
results are not directly comparable, because a dif ferent set
of keywords was used, but they provide a point of reference
for a less challenging keyword spotting task.
For the same style spotting scenario, the performance
level is excellent with a mAP of 72% for the annotation-free
scenario. The performance is similar to that of manuscripts,
which typically are better readable and have less varia-
A Hybrid Deep Learning Approach to Keyword Spotting … Informatica 47 (2023) 361–372 369
T able 3: Spotting performance in terms of mean average precision (mAP) on the test set.
Fully annotated Font-validated Annotation-Fr ee
Manuscripts (Kieu) [16] 0.76 0.78 0.77
Steles: Same style 0.85 0.81 0.72
Steles: Mixed styles 0.56 0.50 0.40
tions of the writing style when compared with stele images.
However , the impact of optimizing the parameters on syn-
thetic characters, rather than real ones, is stronger (mAP re-
duced from 85% to 81%) and the loss in mAP for automatic
character detection is stronger as well (mAP reduced from
81% to 72%). These results show the increased dif ficulty
of spotting Chu Nom characters in stele images and leave
room for improvements regarding parameter optimization
and character detection.
For the mixed styles spotting scenario, the performance
drops significantly to a mAP of 40% for the annotation-
free scenario. It indicates a limitation of the proposed hy-
brid deep learning method, which did not generalize well to
engraving styles that are dif ferent from the keyword tem-
plates. An application to stele collections with similar en-
graving styles seems therefore more promising. Note, how-
ever , that we have only used 5,138 characters in our exper -
iments. It is possible that the method will better generalize
with a lar ger dataset.
Figures 7 and 8 provide qualitative spotting results for
both scenarios. For same style spotting, 3 out of 4 charac-
ters are correctly spotted in the first 3 ranks. However , the
fourth character is not part of the top 10 results, because
of an error in automatic character detection, which has in-
cluded some noise in the bottom right corner . For mixed
style spotting, 5 out of 10 characters appear in the top 10
ranks, but not in the first 5 ranks. The remaining 5 charac-
ters are not part of the top 10 results, due to noise but also
due to dif ferent engraving styles, which are not represented
in the keyword template images.
6 Conclusions
The proposed hybrid deep learning approach to keyword
spotting aims to combine the strengths of data-driven meth-
ods with knowledge-based modeling. In a first step, a deep
convolutional neural network is trained on a lar ge synthetic
dataset to detect printed Chu Nom characters. By means of
self-calibration, the network is then automatically adapted
to the stele images. In a second step, the detected characters
are modeled by means of keypoint graphs and the Hausdorf f
edit distance is used to ef ficiently perform a structural com-
parison for retrieving keywords.
Especially when the engraving style of the keyword is
the same as the style of the stele characters, an excellent
mean average performance of over 70 % is achieved. In
the case of mixed engraving styles, however , the spotting
results drop to about 40 % mean average precision. Al-
though this performance level is still helpful for historians
to browse lar ge image collections of heterogeneous steles,
there is clearly room for improvement.
There are several interesting lines of future research to
further improve the results. Staying in the same style sce-
nario, future work includes the investigation of style clus-
tering, such that similar engraving styles can be identified
across a lar ge number of stele images. Noise removal meth-
ods may also be interesting to avoid spotting mistakes due
to non-character artifacts.
W ith respect to the mixed style scenario, it may be nec-
essary to perform some sort of data-driven learning to im-
prove the spotting results, for example by means of geomet-
ric deep learning with graph neural networks [30]. In order
to avoid the requirement of human annotations, it would be
interesting to pursue a self-calibration strategy similar to the
self-calibration of the character detection network.
Finally , a promising line of future research would be to
generalize the proposed spotting method to other historical
scripts and languages.
Refer ences
[1] Andreas Fischer , Marcus Liwicki, and Rolf Ingold,
editors. Handwritten Historical Document Analysis,
Recognition, and Retrieval — State of the Art and Fu-
tur e T r ends . W orld Scientific, 2020.
[2] Raghavan Manmatha, C. Han, and E.M. Riseman.
W ord spotting: A new approach to indexing handwrit-
ing. In Pr oc. Int. Conf. on Computer V ision and Pat-
tern Recognition , pages 631—637, 1996.
[3] T . M. Rath and R. Manmatha. W ord spotting for his-
torical documents. Int. Journal on Document Analysis
and Recognition , 9:139–152, 2007.
[4] K. T erasawa and Y . T anaka. Slit style HOG features
for document image word spotting. In Pr oc. 10th Int.
Conf. on Document Analysis and Recognition , pages
1 16–120, 2009.
[5] M. Rusiñol, D. Aldavert, R. T oledo, and J. Lladós.
Browsing heterogeneous document collections by a
segmentation-free word spotting method. In Pr oc.
1 1th Int. Conf. on Document Analysis and Recogni-
tion , pages 63–67, 201 1.
[6] Michael Stauf fer , Andreas Fischer , and Kaspar
Riesen. Graph-based Keywor d Spotting . W orld Sci-
entific, 2019.
370 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al.
[7] A. Fischer , A. Keller , V . Frinken, and H. Bunke.
Lexicon-free handwritten word spotting using charac-
ter HMMs. Pattern Recognition Letters , 33(7):934–
942, 2012.
[8] L. Rothacker , M. Rusiñol, and G. A. Fink. Bag-of-
features hmms for segmentation-free word spotting
in handwritten documents. In Pr oc. 12th Int. Conf.
on Document Analysis and Recognition , pages 1305–
1309, 2013.
[9] V . Frinken, A. Fischer , R. Manmatha, and H. Bunke.
A n ovel word spotting method based on recurrent neu-
ral networks. IEEE T rans. on Pattern Analysis and
Machine Intelligence , 34(2):21 1–224, 2012.
[10] Sebastian Sudholt and Gernot A Fink. Phocnet: A
deep convolutional neural network for word spotting
in handwritten documents. In 2016 15th International
Confer ence on Fr ontiers in Handwriting Recognition
(ICFHR) , pages 277–282. IEEE, 2016.
[1 1] Thai V . Hoang, Salvatore T abbone, and Ngoc-Y en
Pham. Extraction of nom text regions from stele im-
ages using area voronoi diagram. In 10th Interna-
tional Confer ence on Document Analysis and Recog-
nition , pages 921–925, 2009.
[12] Anna Scius-Bertrand, Lars V oegtlin, Michele Alberti,
Andreas Fischer , and Marc Bui. Layout analysis and
text column segmentation for historical vietnamese
steles. In Pr oceedings of the 5th International W ork-
shop on Historical Document Imaging and Pr ocess-
ing , pages 84–89, 2019.
[13] Anna Scius-Bertrand, Michael Jungo, Beat W olf, An-
dreas Fischer , and Marc Bui. Annotation-free char -
acter detection in historical V ietnamese stele images.
In Pr oc. 16th Int. Conf. on Document Analysis and
Recognition (ICDAR) , pages 432–447, 2021.
[14] Jonas Diesbach, Andreas Fischer , Marc Bui, and
Anna Scius-Bertrand. Generating synthetic styled chu
nom characters. In Pr oc. 18th Int. Conf on Fr ontiers
in Handwriting Recognition (ICFHR) , 2022.
[15] Kha Cong Nguyen, Cuong T uan Nguyen, and Masaki
Nakagawa. Nom document digitalization by deep
convolution neural networks. Pattern Recognition
Letters , 133:8–16, 2020.
[16] Anna Scius-Bertrand, Linda Studer , Andreas Fis-
cher , and Marc Bui. Annotation-free keyword spot-
ting in historical vietnamese manuscripts using graph
matching. In IAPR Joint International W orkshops
on Statistical T echniques in Pattern Recognition (SPR
2022) and Structural and Syntactic Pattern Recogni-
tion (SSPR 2022) : S+SSPR , 2022.
[17] A. Fischer , C. Y . Suen, V . Frinken, K. Riesen, and
H. Bunke. Approximation of graph edit distance
based on Hausdorf f matching. Pat. Rec. , 48(2):331–
343, 2015.
[18] A. Scius-Bertrand, A. Fischer , and M. Bui. Retriev-
ing keywords in historical vietnamese stele images
without human annotations. In Pr oc. 1 1th Int. Sympo-
sium on Information and Communication T echnology
(SoICT) , 2022.
[19] Philippe Papin. Aperçu sur le programme “publi-
cation de l’inventaire et du corpus complet des in-
scriptions sur stèles du viêt-nam”. Bulletin de l’École
française d’Extrême-Orient , 90(1):465–472, 2003.
[20] Joseph Redmon, Santosh Divvala, Ross Girshick, and
Ali Farhadi. Y ou only look once: Unified, real-time
object detection. In Pr oceedings of the IEEE con-
fer ence on computer vision and pattern r ecognition ,
pages 779–788, 2016.
[21] Glenn et al. Jocher . ultralytics/yolov5: v4.0 - nn.silu()
activations, weights & biases logging, pytorch hub in-
tegration. DOI: 10.5281/zenodo.4418161, 2021.
[22] Chien-Y ao W ang, Hong-Y uan Mark Liao, Y ueh-Hua
W u, Ping-Y ang Chen, Jun-W ei Hsieh, and I-Hau Y eh.
Cspnet: A new backbone that can enhance learning
capability of cnn. In Pr oceedings of the IEEE/CVF
confer ence on computer vision and pattern r ecogni-
tion workshops , pages 390–391, 2020.
[23] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Ji-
aya Jia. Path aggregation network for instance seg-
mentation. In Pr oceedings of the IEEE confer ence on
computer vision and pattern r ecognition , pages 8759–
8768, 2018.
[24] Martin Ester , Hans-Peter Kriegel, Jör g Sander , Xi-
aowei Xu, et al. A density-based algorithm for dis-
covering clusters in lar ge spatial databases with noise.
In Pr oc. 2nd Int. Conf. on Knowledge Discovery and
Data Mining , pages 226–231, 1996.
[25] A. Fischer , K. Riesen, and H. Bunke. Graph similarity
features for HMM-based handwriting recognition in
historical documents. In Pr oc. Int. Conf. on Fr ontiers
in Handwriting Recognition , pages 253–258, 2010.
[26] Paul Maer gner , V inaychandran Pondenkandath,
Michele Alberti, Marcus Liwicki, Kaspar Riesen,
Rolf Ingold, and Andreas Fischer . Combining
graph edit distance and triplet networks for of fline
signature verification. Pattern Recognition Letters ,
125:527–533, 2019.
[27] H. Bunke and G. Allermann. Inexact graph matching
for structural pattern recognition. Pattern Recognition
Letters , 1(4):245–253, 1983.
A Hybrid Deep Learning Approach to Keyword Spotting … Informatica 47 (2023) 361–372 371
[28] A. Sanfeliu and K. S. Fu. A distance measure be-
tween attributed relational graphs for pattern recogni-
tion. IEEE T rans. on Systems, Man, and Cybernetics ,
13(3):353–363, 1983.
[29] T sung-Y i Lin, Michael Maire, Ser ge Belongie, James
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár , and
C Lawrence Zitnick. Microsoft coco: Common ob-
jects in context. In Eur opean confer ence on computer
vision , pages 740–755. Springer , 2014.
[30] Pau Riba, Andreas Fischer , Josep Lladós, and Alicia
Fornés. Learning graph edit distance by graph neural
networks. Pattern Recognition , 120:108132, 2021.
372 Informatica 47 (2023) 361–372 A. Scius-Bertrand et al.