Image Anal Stereol 2018;36:159-171 doi: 10.5566/ias.1859 Original Research Paper LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES FATEMEH VAKHSHITEH1, FARSHAD ALMASGANJ,1, AHMAD NICKABADI2 1Department of Biomedical Engineering, Amirkabir University of Technology; 2Department of Computer Engineering and IT, Amirkabir University of Technology, Iran e-mail: f.vakhshiteh@aut.ac.ir, almas@aut.ac.ir, nickabadi@aut.ac.ir (Received November 18, 2017; revised April 29, 2018; accepted May 8, 2018) ABSTRACT Lip-reading is typically known as visually interpreting the speaker’s lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units’ type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lip- reading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works. Keywords: Deep belief Networks, Hidden Markov Model, lip-reading, Restricted Boltzmann Machine INTRODUCTION Lip-reading is typically known as visually interpre- ting the movements of speaker’s lip during speaking. From mathematical point of view, it requires conve- rting the mouth movements to a truthful represen- tation for possible visual recognition (Potamianos et al., 2003). Experiments over many years have revea- led that speech intelligibility increases if both the audio and visual information are available (McClain et al., 2004, Sumby and Pollack, 1954). This effect becomes more apparent in noisy conditions like hos- pital emergency rooms. In some works, the visual information extracted from lip movements are inte- grated into Automatic Speech Recognition (ASR) systems to contribute to its robustness and accuracy; this is especially evaluated when the audio stream is corrupted by different levels of noises (Potamianos et al., 2003). Taking a step toward automating the lip-reading process, some challenges will appear compared to the conventional audio speech recognition tasks. Firstly, the coarticulation phenomenon can cause the visible speech articulators to be in different positions for the same underlying sound, and thus, the visual features become more dependent to the context (Lan et al., 2012a). In some studies, viseme (facial and lip position corresponding to a particular phoneme) classes are used to resolve this issue (Bowden et al., 2012, Savchenko and Khokhlova, 2014). However, other issues such as the homophone effect, insufficient training data per class, and substantial lack of dis- tinction between classes, when there are too many visemes within a set, are faced in real conditions. Secondly, the choice of features for visual speech recognition purposes is rather wide. In the literature, several types of visual features are evaluated, which have been commonly categorized as those depending on pixels and those based on the models (Matthews et al., 2002). Although some efforts have been made in this regard, extraction of optimized visual features is still under the investigation (Lan et al., 2009, Lan et al., 2010). Thirdly, selected visual features are data- driven and so are greatly speaker-dependent. Hence, many of the presented automated lip-reading systems are either using speaker-dependent or multi-speaker configuration. 159 VAKHSHITEH F ET AL: Lip-reading via deep neural networks Finding a method to face all these challenges in unison is still under the investigations. In the current study, it is aimed to introduce an approach which performs the major parts of a lip-reading system, but the main emphasis is over the feature extraction part. This is done by developing an efficient incorporation and arrangement of certain function blocks, by which highly informative yet discriminative visual features are extracted from lip images. Proposed feature extraction pipeline is arranged in a way that pass the speakers’ frame sequences through the function blocks of face detection, lip localization, scaling and reshaping, and meaningful feature extraction by a combination of Principal Com- ponent Analysis (PCA), Active Shape Model (ASM), and geometric feature calculations. The feature extrac- tion pipeline is further improved by normalization, addition of Linear Discriminant Analysis (LDA) technique, Maximum Likelihood Linear Transform (MLLT), and Speaker Adaptive Training (SAT), and augmentation of dynamic information. The novelty of this research is the arrangement and newly inclusion of some function blocks in way that resulted in appropriate “hybrid” visual features. The advantage of the proposed feature extraction over some state-of- the art methods is that it is not facing the challenges of robustly tracking the lip-contour and hand- labelling the shape parameters in a corresponding set of images. Attachment of mentioned standard feature improvement techniques at the end of the pipeline, increases the efficiency and enhances the accuracy of the process. The approach is succeeded with the inte- gration of a properly-designed DBN-HMM-based final recognizer system, which is explored by working on different DBN topologies and parameter settings. To the best of our knowledge, this feature extraction pipeline has not been devised in this way previously. Experiments were conducted on the CUAVE cor- pus, in two separate tasks of the Multi-Speaker (MS) and Speaker Independent (SI). The proposed method was evaluated over both the phoneme- and word- level accuracies; feature extraction mechanism was formerly assessed with a baseline HMM recognizer. The rest of this paper is organized as follows: section 2 reviews some recent seminal studies of lipreading. Section 3 describes the visual feature ex- traction process and Deep Neural Networks (DNNs), exploited for the classification. Section 4 reports the experimental results: first, the HMM baseline model with conventional visual features is explained. Next, various deep architectures of a certain neural network type, called DBN, are explored, and finally, phoneme- and word-level recognition rates are presented. Discus- sions are provided in Section 5, and conclusions are given in Section 6. RELATED WORKS Typically, there are some challenges with automating lip-reading process, including the coarticulation effect, visual features diversity and speaker-dependency of features. The history of these issues along with some proposed solutions for them could be found in the literature. Regarding to the coarticulation effect and visual feature diversity, the following studies have reached to noticeable solutions. In (Lan et al., 2010), it is demonstrated that the effects of inter-speaker variability of features could be reduced by applying the feature improvement of per speaker z-score normalization and the hierarchical LDA (HiLDA) (Potamianos et al., 2004) techniques. Using HMM recognizer, the best viseme accuracies of 45% and 50% are achieved in the MS and SI tasks, respectively. In (Lan et al., 2012b) the best viewing angle for an automated lip-reading system is studied, using a purpose built audio-visual speech database called LiLiR. The database contains multi-camera, multi- angle recordings of a speaker reciting 200 sentences from the Resource Management Corpus, with the voca- bulary size of 1000 words. Visual features extracted in this work are based on the Active Appearance Models (AAMs), including the shape, appearance, “CAT” (simple concatenation of the shape and appea- rance parameters), “CSAM” (application of PCA over mentioned concatenation), and HiLDA variations. These features are extracted from the respective views and are appended with their second derivatives (ΔΔ). For all features, z-score normalization is applied which has been shown that improves the separability of the features among involved classes. Regarding to the speaker-dependency issue, seve- ral studies have been done for the single or MS con- ditions. Some of these studies propose deep archi- tectures for their lip-reading systems. As it has been proven, the DNNs are effective tools for the feature extraction and classification tasks (Hinton et al., 2012). A lot of researches are recently published in which the ASR systems are implemented by emplo- ying various deep learning techniques. Some of these works demonstrate the use of deep learning in the Audio-Visual Speech Recognition (AVSR) systems (Huang and Kingsbury, 2013, Mroueh et al., 2015, Ngiam et al., 2011, Noda et al., 2014, Srivastava and 160 Image Anal Stereol 2018;36:159-171 Salakhutdinov, 2012). However, the application of such techniques to the automatic lip-reading could be investigated more. In the following studies, different deep architecture lip-reading systems are reviewed and the rate of their speaker-dependency is highlighted. In (Ngiam et al., 2011), the PCA is applied to the mouth region-of-interest (ROI), and a deep autoen- coder (AE) is trained to extract the bottleneck (BN) features. The features from the entire utterance were fed to a support vector machine ignoring the temporal dynamics of the speech. Digit classification accuracies of 64.4% and 68.7% are achieved in this study, using AVLetters and CUAVE databases, respectively. In (Srivastava and Salakhutdinov, 2012), similar feature extraction approach proposed by (Ngiam et al., 2011) is followed; this is done, while a deep Boltzmann Machine (DBM) is trained to extract multimodal representations. The employed features are extracted from the entire utterance and fed to a support vector machine. Working on CUAVE data- base, digit classification accuracy of 69% is achieved. In (Noda et al., 2014), a convolutional neural network (CNN) is proposed to act as the feature extraction block for a lip-reading system; the speaker’s mouth area images along with the phoneme labels are used during the training phase. The evaluation is done on an audio-visual speech database comprising 300 Japanese words with six different speakers, each of which is modeled with an independent CNN. The average phone recognition rate (PRR) (for 40 phone- mes, normalized with the number of samples for each phoneme over six speakers) of about 58% is attained in this paper, using 6464 pixels of mouth area images as input. In (Mroueh et al., 2015), some deep multimodal learning methods are proposed to fuse speech and visual modalities in an AVSR system. Two uni- modal deep networks are first trained separately over the audio and video data, and their final hidden layers are fused to form a joint feature which is further used in another deep network. The experiments are conducted on the IBM large vocabulary audio-visual studio database, leading to phone error rate (PER) of 69.36% and 41% in video-only and audio-only DNN, respectively. In (Almajai et al., 2016), a SI lip-reading system is designed, using Resource Management (RM) data- base. A combination of MLLT followed by SAT is applied over the CSAM features introduced in (Lan et al., 2012b). HMM models are trained over phoneme and viseme units, but the phoneme implementation represents the superiority. The best word accuracies of 45% and 54% are achieved with HMM and DNN application, respectively. In (Wand et al., 2016), the lip-reading is suc- ceeded with a processing pipeline, based purely on neural networks. In this study, Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) are stacked, so that a single structure is formed. Using the raw mouth images as the neural network inputs, the performance of such stacked network is experi- mentally evaluated and compared to the standard Support Vector Machine (SVM) classifier. Evaluations are performed in some speaker-dependent tasks on data from 19 speakers of GRID corpus (Cooke et al., 2006). Averaging over the speakers, the best word recognition rate of 79.6% is reported, using end-to- end neural network-based architecture. An end-to-end visual speech recognition system based on Long-Short Memory (LSTM) networks is proposed in (Petridis et al., 2017). The model consists of two streams to extract features directly from the mouth and difference images, respectively. In this study, the temporal dynamics in each stream are mo- delled by an LSTM and the fusion of the two streams takes place via a Bidirectional LSTM (BLSTM). The classification accuracy of 84.5% over utterances is reported on the OuluVS2 database. This accuracy is reached to 78.6% on the CUAVE database with similar visual front-end. In (Stafylakis and Tzimiropoulos, 2017), an end- to-end deep learning architecture for word level visual speech recognition is proposed. The proposed system is made of a spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. All evaluations are done on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video ex- cerpts from BBC TV broadcasts. In this study, the word accuracy of 83.0% is attained by the proposed network. In the current research, from the modeling per- spective, the approach could be categorized as the neural network-based works reported in the literature. In particular, the final recognizer is designed based on the DBN-HMMs, implemented in the MS and SI tasks. The performance of the developed system is evaluated by the phone and word recognition accuracy measure, inside a digit recognition task. Working on a database of limit digits (0-9), makes it more rational to select phonemes as the recognized units. Thus, the PRRs are measured and reported primarily. Once a proper deep architecture DBN re- cognizer is explored, the WRRs are reported as well. 161 VAKHSHITEH F ET AL: Lip-reading via deep neural networks The effects of feeding the visual features to the in- volved DBN is evaluated by devoting a considerable part of this research to some proposed visual feature extraction schema. MATERIAL AND METHODS In this section, the proposed lip-reading method and its two major parts of feature extraction and clas- sification are explained. First, video files are conver- ted to the frame sequences, and the corresponding features are extracted. Next, some nearly optimized DBN architectures with proper parameter settings are sought and examined. HYBRID VISUAL FEATURE EXTRACTION Visual features have been broadly categorized as those depending on video pixels and those based on models; in the literature, these two kinds of features are referred to appearance- and shape-based features, respectively. Previous works have shown that the raw features, such as dimensionally reduced image inten- sities, can be used as the inputs to neural networks while resulting in good classifications (Ngiam et al., 2011). In (Lan et al., 2009) it is mentioned that if two types of appearance- and shape-based features, such as pixel intensities and contour shapes of the ROI, are combined, more informative feature vectors can be created. Accordingly, we devise our feature extraction process to have at least two function blocks of appea- rance- and shape-based feature extractors. So, the lip-reading system is made up of the following function blocks: Face detection, lip loca- lization followed by the ROI bounding box extrac- tion, scaling and reshaping, meaningful feature extrac- tions, techniques of inter-class discrimination and speaker adaptation inclusion, and dynamic information consideration. Many of these blocks are the same as used in conventional visual speech recognition systems; however, certain arrangement of those and, introduction of some different blocks make the whole process influence differently. The general scheme of the proposed feature ex- traction process is represented in Fig. 1. As mentio- ned earlier, each video file is converted to a sequence of frames, in which the speaker’s body is covered from the shoulders to the head. Each frame is pre- processed, so as to extract only the ROI encompassing the mouth. For this purpose, Active Shape Modeling algorithm (Cootes et al., 1995) is used, via which face landmarks are obtained. Face detection is made by tracking the landmarks placed on the border of face. Eighteen landmarks are positioned on the inner and outer lip contours. Similarly, by tracking these points, an exact bounding box (B.B) of the lips is obtained. Each determined mouth B.B is rescaled to 2540 pixel size, and finally is converted to a 1000-dimen- sional feature vector. This vector is further transfor- med to a 32-dimensional vector, using the PCA al- gorithm. In this way, the so-called appearance-based features are created. From the (x,y) lip coordinates, the width and height of the mouth are calculated and used as the geometric features. Concatenating the x- and y-co- ordinates with the geometric features, a 38-dimen- sional vector is created, which we call it the shape- based feature set. These features are combined with the appearance-based features, making a 70-dimen- sional feature vector. In CUAVE database, the video files are recorded with the NTST standard of 29.9 frames per seconds (fps) (Patterson et al., 2002). These files are accom- panied with audio files that are segmented into frames with the rate of 100 fps. The transcription is made according to the audio files, demonstrating 100 labels in one second. The sequence of the 70-dimen- sional vectors is then up-sampled to 100 fps, so that the direct mapping to the corresponding phoneme labels becomes possible. To complete the feature vector created up to this stage, the z-score normalization is applied, and aug- mentation is made by adding the first and second derivatives (Δ/ΔΔ) of the evaluated features. Extension of the process with the next three blocks of LDA, MLLT, and SAT is inspired from (Almajai et al., 2016). In (Almajai et al., 2016), it is shown that the MLLT and SAT techniques improve the phone and word recognition accuracies, signifi- cantly, especially when a DNN recognizer is applied thereafter. LDA is implemented to learn the optimal projection that best describes the dynamics of speech. From the so far acquired feature vector, it learns a set of orthogonal projections that maximize the between class distance and at the same time, minimize the within class distance. With LDA application, 40- dimensional feature vectors are resulted, which are further decorrelated using the standard technique of MLLT. With MLLT technique, a linear transform of the input features is found in which the assumption of a diagonal covariance matrix is the most valid. It can be shown that inter-class discrimination is improved in this condition. The SAT technique is then applied using feature-space Maximum Likelihood Linear Regression (fMLLR) of 4041. With this technique, 162 Image Anal Stereol 2018;36:159-171 163 the effect of variation in the features of different speakers is normalized to model the intra-speaker variability and to avoid modelling of the inter-speaker variability. The 40-dimensional speaker adapted fea- tures are then spliced across a window of n frames, to make a hyper-vector, which we call it hybrid features in this paper. Contiguous frames are chosen to be (n–1)/2 frames before and after the central vector. How to choose the correct value for n will be dis- cussed in “Multi-speaker task” section. The PCA dimensionality reduction block was for- merly used in conventional visual speech recognition systems; however, by deliberately implementing it after the blocks of detection, localization, scaling, normalization, and reshaping (represented in Fig. 1), it is aimed to achieve more beneficial appearance- based features to this point, while not facing the challenges of robustly tracking the lip-contour and hand-labelling the shape parameters (the x and y- coordinates of a set of s vertices that delineate the lip- contour) in a corresponding set of images. The greatest influence of the proposed feature extraction process comes from the augmentation of features with the higher level geometric features. Augmen- tation of the process with confirmed techniques of MLLT and SAT adds another weighty value to the whole process, which had not been considered in this manner before. A dashed line in Fig. 1 is drawn to show where the HMM and DBN recognizers are differentiated in taking their input feature vectors (discussed in section “Baseline Model”). Fig. 1. The schematic of the hybrid feature extraction process. VAKHSHITEH F ET AL: Lip-reading via deep neural networks DEEP BELIEF NETWORK CLASSIFIER Restricted Boltzmann Machine (RBM) is a stochastic neural network that is represented by a bipartite graph, where visible units are connected to hidden units using undirected weighted connections. The visible and hidden units can have any distribution in the exponential family (Welling et al., 2005); how- ever, the Bernoulli and Gaussian distributions are the mostly used distributions. The RBMs in a stack scheme can produce a single multi-layer generative model known as Deep Belief Network (Hinton et al., 2006). The weight initializing is one of the fundamental problems in the corresponding deep training process. Hinton et al. (Hinton et al., 2006) introduced a layer- wise pre-training algorithm that led to a good initia- lization of the weights and biases. After pre-training, the DBNs are normally fine-tuned with respect to a typical supervised criterion such as the cross-entropy, to improve the network performance. APPLICATION OF THE DBN IN THIS WORK In order to apply DBN-HMM to the current phone recognition task, the proposed visual features are used to set the states of the visible units of the lower layer of the involved DBN; this produces a probabi- lity distribution over the possible labels of the central frame of the corresponding consecutive frames. In the current work, the visual inputs are repre- sented via real-valued feature vectors, and the hidden units are assumed to be binary. Therefore, the first building block of the used RBM is considered to have a Gaussian–Bernoulli (G–B) distribution; but, the other stacking RBMs have Bernoulli-Bernoulli (B–B) distributions. The DBN-HMM system is trained using the align- ment of states and the decision tree derived from the GMM stage. The mini-batch stochastic gradient descent technique is employed to initialize the weights of the DBN. For the involved G–B RBMs, 50 epochs are run with the learning rate of 0.05; while for the B–B RBMs, 25 epochs with a fixed learning rate of 0.6 are run. These learning rates are found to be optimal after testing different values. Next, fine-tuning stage is followed to complete the DNN-HMM hybrid model. To develop this hybrid model, DBNs with different architectures are trained to estimate the posterior pro- babilities of the corresponding HMM states (Veselý et al., 2013). Hybrid models are then used as the implemented Visual Model (VM) to generate the VM scores for the lattice generating decoder. Fig. 2 shows the overall architecture of the des- cribed lip-reading system. This architecture is used to decode the test video streams; this is done after the training phase in which the visual and language models (LMs) are trained over the selected training set. The Bigram LM is prepared from the constructed lexicon that affects weights of the lattice trees in the decoding process (Mohri et al., 2008). All the imple- mentations needed for realization of the mentioned classifier and conducting the experiments are accom- plished by employing the Kaldi speech recognition toolkit (Povey et al., 2011). Feature extraction and deep learning based clas- sification experiments are done on a consumer-class personal computer with an AMD FX(tm)-8350 Eight- Core processor (4 GHz, 4 cores), 32 GB RAM, and a single NVIDIA GeForce GTX Titan X graphic pro- cessing unit with 12 GB on-board graphics memory. Fig. 2. Overall architecture of the implemented lip- reading system. To test the robustness of the proposed features across different speakers, two separate lip-reading tasks of the MS and SI are arranged. In the MS task, different DBN architectures are examined and the best model is selected. In the SI task, the investi- gation space is reduced to the best DBN architectures obtained in the previous task. RESULTS In this section, the employed database and a baseline model along with the conventional visual features for the automatic lip-reading purpose are presented. To have a more comprehensive comparison, the proposed hybrid features are also utilized as the input of the baseline model. Next, the experiments are conducted for the MS and SI tasks, and the corresponding re- sults are reported alongside. 164 Image Anal Stereol 2018;36:159-171 DATABASE The experiments are conducted over the Clemson University Audio Visual Experiments (CUAVE) data- base (Patterson et al., 2002). The main purpose of uti- lizing this corpus is to allow testing of more realistic speaker data that includes a variety of visual features and speaker movements. The corpus contains 36 indi- viduals of different sex (17 female and 19 male), race and ethnicity. Speakers utter digits 0-9 in ascending and descending order in various styles of: standing still, moving side-to-side/back-and-forth/or tilting the head, turning to profile view, and then returning to the frontal view. Since the “frontal” view of lips seems to have enough information for lip-reading task, all “profile” views are neglected. In this way, two sepa- rate video parts are created out of each video file. Although, the moving sections of the videos are inclu- ded in the experiments, the utterances with incomplete facial crops in the corresponding frames are also ne- glected due to their hardness of face recognition. BASELINE MODEL The Hidden Markov Model (HMM) is used as the baseline recognition models; this is the method of choice for speech recognition and have been shown to be successful for the automatic lip-reading purpose (Lan et al., 2012a, Matthews et al., 2002). In the cur- rent study, the HMMs are constructed so that a com- parison could be made between the results obtained through the baseline system against the DBN-HMM deep architecture-based lip-reading system. Kaldi toolkit is applied here for building and mani- pulating the employed HMMs. Since the performance of the developed system is evaluated for digit recognition task and the basic models are developed over phoneme units, a total of 19 separate phonemes and 1 extra code for the silence are considered and 20 HMMs are finally created. The built HMM system is a triphonic-based system considered as the baseline system. For each HMM, three states is considered, and the number of Gaussians per HMM state is dedicated automatically based on the data count (Povey et al., 2011). To compose the Kaldi decoding graph, a phone-pair bigram language model is built and the lexicon is derived from the CUAVE database. The proposed hybrid features along with the con- ventional Discrete Cosine Transform (DCT), Geome- tric, and ASM visual features are employed as the input of the baseline system. A feature selection pro- cess is made in a way to cover examples from dif- ferent categories of the appearance-based, shape-based, and hybrid visual features. To extract the ASM visual features, the Active Shape Model algorithm is used similar to what ex- plained in “Materials and Methods” section. As men- tioned earlier, frames of video files are up-sampled to 100 fps. Then, 77 landmarks are extracted via ASM algorithm, which represent (x,y) coordinates of face points. Out of all, 18 coordinates correspond to the points on the lips contour. These coordinates are gathered to make a 36-dimensional feature vector. Delta (Δ) and delta-delta (ΔΔ) coefficients are added to this vector, and z-score normalization is implemen- ted on it. The resulted feature vector is called ASM- feature in this work. To extract the Geometric visual features, a number of high level features that are meaningful to humans, including the contour height, width, perimeter, and the area within the contour, are extracted. In this regard, a box and a diamond encompassing ROIs are considered. Similar to the ASM-feature extraction procedure, the up-sampling, adding Δ/ΔΔ coefficients, and z-score normalizing are done, subsequently. To observe the strength of transform domain fea- tures, DCT features are examined additionally. It is a useful transform for the dimensionality reduction pur- pose (Potamianos et al., 2004). The area of lip is transformed into DCT coefficients after the applica- tion of the DCT. The ROI is transformed into 88 grayscale intensity image blocks; they are then converted to a single vector using zigzag strategy. In the literature, dimension of the DCT features has been chosen to be a value in the range of 20 to 35 (Morade and Patnaik, 2015, Petridis and Pantic, 2016, Potamianos and Neti, 2001). Accordingly, random dimensions of 24, 27, 30, and 33 are considered in this study. Experiments show that the best dimension among these values is 27. A DC (zero frequency) component is included in the features so that to provide information related to percentage visibility of teeth and tongue (Morade and Patnaik, 2015). Next, Δ/ΔΔ coefficients are added, and z-score normaliza- tion is performed, lastly. The resulted feature vector is called DCT-features in this work. The proposed hybrid visual features are firstly fed to the baseline recognition system to have an initial evaluation of their strength. In such a condition, these hybrid features are employed without concatenating the context frames. This is done to make consistency with the size limitation of the conventional HMM input. The corresponding experimental results are given in Table 1 and Fig. 5. Fig. 5 represents the whole comparison of these features after applying the proposed features to the DBN-HMM system. 165 VAKHSHITEH F ET AL: Lip-reading via deep neural networks 166 According to Table 1, using the hybrid features resulted in the PRR of 61.5%. This high accuracy demonstrates the strength of the proposed features, even when a conventional HMM based classifier is employed. Table 1. Comparison between the proposed hybrid features and traditional visual features, using base- line HMM classifier. Category Visual Feature Type PRR (%) Appearance-based DCT 38.9 Shape-based Geometric ASM 40.9 38.65 Hybrid Proposed Hybrid Features 61.5 MULTI-SPEAKER TASK To handle the previously defined tasks, various DBN topologies are examined in this section. The MS experiments are designed in a way that all the spea- kers attend in the training phase. For this purpose, a special version of the 6-fold cross-validation technique is used. As mentioned earlier, the employed database is conditioned in a way to include two distinct video parts for each speaker. Our plan is to divide the individual sections of the database into three sub- divisions of 66.66% (24 speakers), 16.66% (6 spea- kers) and 16.66% (6 speakers) to devote to training, testing and development sets, respectively. However, to satisfy the condition of all speakers attendance during the training, both video parts of the first 12 speakers, along with the first video parts of the remaining 24 speakers (12 of the train, 6 of the development, and 6 of the test subdivisions) are used to train the DBN; while, second video parts of 24 remained speakers are put for the test and develop- ment sets. This technique is then repeated for selec- ting all the 6 possible held up group. Using hybrid visual features Here, the visual features that are proposed in “Materials and Methods section” are used as the DBN inputs, while different DBN topologies are examined. The initial DBN architecture is inspired from (Mohamed et al., 2012) and defined as 440 – 1024 – 1024 – 1024 – 1024 – 60; where 440 is the network input di- mension spliced across a window of 11 frames, 1024 is the number of intermediate hidden units, and 60 is the network output dimension, corresponded to the HMM states. As mentioned before, each phoneme class is modeled with a single HMM of 3 states. Thus, in overall 60 states are resulted for all phonemes. This architecture is modified parametrically in four separate experiments: first, the effect of varying the number of hidden units (network’s width) is studied. In this regard, 256, 512, 1024 and 2048 units are distinctively examined for the intermediate RBMs. Secondly, the effect of the number of hidden layers (network’s depth) is examined. Accordingly, different layer numbers of 2, 3, 4, 5, 6, and 7 are considered. Thirdly, the effect of context frame numbers is inves- tigated. So, a wide range of window frames, from 3 to 27, are considered. Finally, the effect of learning rate parameter has been investigated. Different pair num- bers, corresponding to the G–B and B–B RBMs, is considered, respectively. Resulted accuracies are re- presented in Fig. 3. According to the results represented in Fig. 3, a DBN network with a width of 1024, depth of 4 hidden layers (6 layers in overall), and 11 context frames in the input could result in a PRR of 71.65%, when the learning rate is set to 0.4 and 0.01, during the train of the G–B and B–B RBMs, respectively. However, by increasing the input window frames to 21, the PRR is improved from 71.65% to 75.1%. By changing the learning rates to 0.6 and 0.05, for the G– B and B–B RBMs, respectively, this recognition accuracy is further improved to 77.65%, which is sig- nificant. As a result, the best examined DBN archi- tecture is found to be 840 – 1024 – 1024 – 1024 – 1024 – 60, which leads to the PRR of 77.65%. It is inferred that (a) the same size for every hidden layer in the network could be considered while still achieve appropriate accuracies. This size is found to be 1024 in the current study, (b) 1024 neurons are able to code the higher layer visual representations, when the task is phone recognition. This means that 1024 neurons are able to cover visual information of the lips when the speaker utters phonemes, (c) More hidden layers improves performance when the number of hidden layers is less than 4, but with more hidden layers, the accuracy begins to descend. The complexity of the visual speech recognition could be handled with a network of 4 hidden layers, (d) a window of 21 frames (equivalent to 525ms) in this task, covers the average size of phones and its required dynamic information. Smaller input windows miss important discriminative information in the context, while networks with larger windows are probably getting distracted by the almost irrelevant information far from the center of the window, and (f) learning rates play significant roles that should not be neglected. For the resulted proper DBN topology, the WRR is found to be 80.25%, which is noteworthy. Image Anal Stereol 2018;36:159-171 Fig. 3. From left to right, along rows: the effect of DBN’s width, depth, splice numbers, and learning rate on PRR. SPEAKER-INDEPENDENT TASK In the SI task, only the best architecture found in the previous experiments (840 – 1024 – 1024 – 1024 – 1024 – 60) is considered for further investigations. As mentioned in the MS task, the individual sections of the database are divided into three parts of 66.66%, 16.66% and 16.66% to devote to training, testing and development sets, respectively. A 6-fold cross-valida- tion experiment is considered, where for each fold, six speakers are present that were absent during the training and evaluation phases. In each fold, the whole utterances of those 6 speakers are used for testing, while the utterances of the rest of 30 speakers are used for training and development processes. This technique is then repeated for all the 6 possible held up groups. Using the 840 – 1024 – 1024 – 1024 – 1024 – 60 architecture, the SI PRR of 73.40% is achieved. The corresponding WRR is measured as well and found to be 76.91%, which is remarkable. The resulted accuracies reveal that the involved deep architecture is designed properly that is able to act appropriately even in the challenging SI task, in which large variation in lip-shapes of speakers is clearly seen. As expected, here, the confusions are higher, and the rate of misclassifying is raised. However, the confusion matrix (Fig. 4) shows that the overall classification is acceptable for this challenging task. Reading the confusion matrix from left to right, each entry is the number of times a phoneme class is visually recognized as another one. The entries are normalized and expressed as a percentage to the whole samples of every targeted phoneme. Row and column of the “silence” are removed for clarity. Fig. 4. Confusion matrix for the test data, produced by applying the best DBN architecture to the SI task. As demonstrated, almost all misclassifying phone- mes are less than correctly classified ones. Vowels (e.g., /ah/, /ow/, /ao/, /ay/, /uw/) are relatively well discriminated and recognized. It is attributed to the fact that the generation of vowels is strongly correlated with the visual cues represented by the movements of lips or jaw (Barker and Berthommier, 1999, Yehia et al., 1998). However, some consonants have been less effi- ciently recognized (e.g. /k/, /s/, /z/, /th/). The articu- lation of these consonants is mostly attributed to the dynamic interaction of the interior oral structures, such as tongue, teeth, oral cavity, which are invisible 167 VAKHSHITEH F ET AL: Lip-reading via deep neural networks from the frontal facial images (Noda et al., 2014). This is not the case in recognition of ‘v’ and ‘w’; these are consonants that are mostly attributed to the lips and are visually tractable. The ‘r’ consonant is not labial, but is obviously palatal, hence, could be mostly distinguished visually. As represented in Fig. 4, the correct rate of classification of these 3 conso- nants are interestingly high. DISCUSSION In this paper, some issues of the automatic lip-reading systems are discussed and accordingly in two aspects the following cases are proposed and developed: (a) a modified method of extracting appropriate visual fea- tures (b) a properly designed Deep Belief Network structure accommodated inside the hybrid structure implemented by the Kaldi toolbox. Applying only the feature extraction modifications, the proposed hybrid visual features are utilized as the input of the baseline HMM and higher PRR (61.5% vs. the topmost acc. of 40.9%) are resulted. This accuracy is significantly increased (to 77.65%), when the pro- perly structured DBN-HMM recognizer is replaced. Details of these comparisons are represen-ted in Fig. 5. Fig. 5. The PRRs obtained in MS task, using propo- sed and traditional visual features in HMM and DNN- HMM recognizers. Some recognition accuracies reported in the lite- rature in the visual speech recognition/lip-reading field are scheduled in Table 2. Task comprehensiveness is highlighted and shown in yellow color: some studies applied their developed systems distinctively over both the MS and SI tasks. The superiority of model strength and its robustness could be concluded inside the table. In the last column, the accuracies are reported based on the determined classification level. Table 2. Review of: seminal studies about visual speech recognition/ lip-reading, their highlights, and accu- racies. ++ these accuracies are achieved by incorporation of sophisticated deep recognizers. Task Class Labels Study Corpus Feature Method MS SI Phoneme Viseme Word Best Accuracy (%) (Lan et al., 2010) LiLiR HiLDA projected AAM + z-score normalization HMM + + + 45 (the MS task) 50 (the SI task) (Ngiam et al., 2011) CUAVE ROI + PCA + Δ/ΔΔ + Context frames Deep AE + SVM + + 68.7 (Lan et al., 2012a) LiLiR Cat + ΔΔ HMM + + 52.5 (Srivastava and Salakhutdinov, 2012) CUAVE ROI + PCA + Δ & ΔΔ + Context frames Multi DBM + SVM + + 69 (Noda et al., 2014) Japanese database Raw ROIs CNN + + + 48 (PRR) 22.5 (WRR) (Mroueh et al., 2015) IBM AV- ASR ROI + LDA + Context frames DNN + + 30.64 HMM 45 (Almajai et al., 2016) RM scam + LDA + MLLT + SAT DNN + + 54 (Wand et al., 2016) GRID Raw ROIs LSTM + + 79.6++ CUAVE 78.6++ (Petridis et al. 2017) OuluVS2 Raw ROIs LSTM + + 84.5++ (Stafylakis and Tzimiropoulos, 2017) In-The-Wild Face ROI + resizing + normalizing CNN + ResNet + BLSTM + + 83.0++ HMM 61.5 (PRR) This work CUAVE Hybrid features DBN- HMM + + + + 77.65 (the MS PRR) 80.25 (the MS WRR) 73.40 (the SI PRR) 76.91 (the SI WRR) 168 Image Anal Stereol 2018;36:159-171 By considering the highlights and challenges, the discussion is followed in two branches: comparing the current study results with those: (a) worked on the CUAVE database, and (b) worked on different data- base. Studies that worked on the CUAVE database are shown in blue color in Table 2. Since there is not a standard partitioning for this database, it is difficult to exactly compare between reported results of different works. In (Ngiam et al., 2011), a simple visual feature extraction front-end is introduced along with new con- cepts regarding to multimodal audio-visual feature ex- traction. A word recognition rate of 68.7% was reported using video deep autoencoder (AE), followed by the SVM classifier. In (Ngiam et al., 2011), the idea of feature learning when two sources of audio and video information are available is proposed. Bottleneck (BN) features were extracted that code information of audio, video, and their correlations, simultaneously. In other words, to create BN features, it could be interpreted that the audio information was used primarily and the visual features played the auxiliary role. However, in the current study, the hybrid features are extracted from the video only channel. Although, it was expec- ted that the BN representations employed in (Ngiam et al., 2011) were more beneficial for the ultimate recognition tasks, the WRR (76.91%) reported in the SI task of the current study reveals the strength of the proposed hybrid visual features. In (Srivastava and Salakhutdinov, 2012), the si- milar approach proposed in (Ngiam et al., 2011) was followed, except that a new architecture of multimo- dal DBM was introduced; as the result, the word re- cognition rate of 69% was reported. Again, although the extracted multimodal features in (Srivastava and Salakhutdinov, 2012) encoded both the audio and video information, the solely video encoding perfor- mance of the proposed hybrid visual features in the current task seems to be superior. In (Petridis et al., 2017), an end-to-end recognizer is introduced based on LSTMs, which are proven to have great performance in Classification tasks. The similarity of (Petridis et al., 2017) and the current study is in the utilized CUAVE database. Over this database, the best reported SI WRRs are 78.6% and 76.91%, respectively. The 2% increase in the word recognition accuracy is definitely rational due to more complicated recognizer implementation. In the following, comparisons with the rest of studies will be made. As stated earlier, these studies employed different databases. The best SI PRR (73.40%) obtained in the current study could be compared with the viseme accuracies of the SI lip-reading systems reported in (Lan et al., 2012a, Lan et al., 2010). These systems utilized the AAM-based features in their front-end recognizers and resulted in viseme accuracies of 50% and 52.5%, respectively. Although LiLiR vocabulary is larger than CUAVE and hence the phoneme size is greater (44 vs 19), the far larger speakers population (about threefold) in the current study, reveals the strength of the proposed hybrid visual features. This superiority could be also seen against the video-only DNN model proposed in (Mroueh et al., 2015). A PER of 69.36% (accuracy of 30.64%) was achieved in this study, using IBM large vocabulary audio-visual studio database. Of course, part of this 42.76% difference between the obtained accuracies (30.64% vs 73.40%) is definitely due to the number of lexicons and phonemes in IBM vs CUAVE data- bases. In (Almajai et al., 2016, Lan et al., 2010, Mroueh et al., 2015), and this study, essentially viseme or phoneme class labels are recognized. So, the correspon- ding reported accuracies are rather more comparative. The comparisons are represented in Fig. 6. Fig. 6. Viseme/Phone recognition rates, reported by some reviewed studies. Blue charts represent viseme recognition rates, while red charts depict phone re- cognition rate. The best MS PRR obtained in the current study (77.65%) is higher than the corresponding PRR of 48%, reported in (Noda et al., 2014). Noda et al. considered speaker-dependent settings and utilized speaker-wise CNN models. Although Japanese database vocabulary and phoneme sizes are larger than those of CUAVE (300 vs 10 and 40 vs 19, respectively), achieving 77.65% accuracy is noteworthy, since higher recognition rate is resulted and more speakers (36) are incorporated. 169 VAKHSHITEH F ET AL: Lip-reading via deep neural networks Feature extraction and classification parts of the current study and (Almajai et al., 2016) are compa- rable since in both engineered features are extracted, standard techniques such as MLLT and SAT are used, and the DNN-based classifiers are utilized. The superiority of the current study over (Almajai et al., 2016) could be accounted for more speakers (about twofold) involvement. However, larger lexicon size (1000 vs. 10 words) and greater phoneme numbers (44 vs 19) of the RM over the CUAVE database compensates this superiority. The WRR reported in the current study is higher (76.91% vs. 54%) than the what reported in (Almajai et al., 2016). It reveals that the beginner blocks (up to LDA) of the proposed feature extraction pipeline play effective roles that can lead to higher WRR. Thus, the arrangement of blocks meets the suitability. The current study and (Wand et al., 2016) have few in common: the databases, tasks, and classification levels are different. Besides, the MS and speaker- wise conditions are considered in this study. The fact is that in (Wand et al., 2016) much more context frames are considered by the LSTMs, and hence the effect of dynamic information is highlighted. In the current study, however, the emphasis was on the feature extraction part, and the scope of view was reduced to this end. Similarly, the current study and (Stafylakis and Tzimiropoulos, 2017) are hardly comparable. It is due to their completely different deep recognizers. In the current study, much simpler deep recognizer is utilized, while in (Stafylakis and Tzimiropoulos, 2017), a sophisticated network made up of a spatiotemporal convolutional, residual and bidirectional Long Short- Term Memory is incorporated. The results of the (Stafylakis and Tzimiropoulos, 2017) is therefore, expected to be higher since the network is devised more complicated. Besides, the lexicon size and type of databases are different too. As mentioned earlier, the focus of this study was mainly on the proper feature extraction, which are claimed to be strength enough if utilized in other recognizers as well. CONSLUSION This study has discussed some issues of the automatic lip-reading systems and explored an overcoming approach with two major parts of visual feature extraction and classification. In the feature extraction part, hybrid features were extracted in a proposed well-designed arrangement of certain function blocks. As the final classifier, a single DBN with a well-designed topology, inside a DBN- HMM structure, was established and employed for all speakers. Experiments were applied on the CUAVE database, in two separate tasks of the MS and SI con- ditions. The proposed method was basically evaluated on the phoneme-level and considerably outperformed the HMM baseline recognizer. The proposed visual feature set, so-called hybrid features, was used in two different lip-reading sys- tems. Using the conventional HMM based recognizer as the baseline system, the PRR of about 61.5% was obtained. Introducing the features to the input of a DBN-HMM recognizer, with a well-designed DBN topology, led to the PRR of 77.65% for the MS task, which showed a considerable jump against the baseline system. In the SI task, the best PRR of 73.40% was achieved, which was highly appreciable, since the SI task has been proven to be challenging (Petridis et al., 2017). The PRR and WRR achieved in this study, were comparable with the corresponding accuracies reported in seminal recent literatures that were reviewed in this paper. It is verified that the proposed hybrid features are discriminative enough to be efficiently used in this task. As a future work, proposed hybrid feature set could be examined inside an Audio-Visual Automatic Speech Recognition (AV-ASR) system to investigate its benefits in combination with the audio based features. REFERENCES Almajai I, Cox S, Harvey R, Lan Y (2016). Improved speaker independent lip reading using speaker adaptive training and deep neural networks. 2016 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 272–6. Barker JP, Berthommier F (1999). Evidence of correlation between acoustic and visual features of speech. Ohala et al:199–202. Bowden R, Cox S, Harvey RW, Lan Y, Ong E-J, Owen G, Theobald B-J (2012). Is automated conversion of video to text a reality? Optics and Photonics for Counter- terrorism, Crime Fighting, and Defence VIII, volume SPIE 8546:85460U. Cooke M, Barker J, Cunningham S, Shao X (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America 120:2421–4. Cootes TF, Taylor CJ, Cooper DH, Graham J (1995). Active shape models-their training and application. Computer Vision and Image Understanding 61:38–59. Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-r, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN 170 Image Anal Stereol 2018;36:159-171 171 (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29:82–97. Hinton GE, Osindero S, Teh Y-W (2006). A fast learning algorithm for deep belief nets. Neural Computation 18:1527–54. Hochreiter S, Schmidhuber J (1997). Long short-term me- mory. Neural Computation 9:1735–80. Huang J, Kingsbury B (2013). Audio-visual deep learning for noise robust speech recognition. 2013 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 7596–9. Lan Y, Harvey R, Theobald BJ, Ong EJ, Bowden R (2009). Comparing visual features for lipreading. 2009 International Conference on Auditory-Visual Speech Processing, 102–6. Lan Y, Theobald BJ, Harvey R, Ong EJ, Bowden R (2010). Improving visual features for lip-reading. Pro- dedings of the 2010 Conference on Auditory-Visual Speech Processing. Lan Y, Harvey R, Theobald BJ (2012). Insights into mac- hine lip reading. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4825–8. Lan Y, Theobald BJ, Harvey R (2012). View independent computer lip-reading. 2012 IEEE International Confe- rence on Multimedia and Expo (ICME), 432–7. Matthews I, Cootes TF, Bangham JA, Cox S, Harvey R (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:198–213. McClain M, Brady K, Brandstein M, Quatieri T (2004). Automated lip-reading for improved speech intelli- gibility. Proceedings of the 2004 IEEE International Conference on Acoustics, Speech and Signal Proces- sing (ICASSP), 1:I–701. Mohamed A-r, Dahl GE, Hinton G (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing 20:14–22. Mohri M, Pereira F, Riley M (2008). Speech recognition with weighted finite-state transducers. Springer hand- book of speech processing 559–84. Mroueh Y, Marcheret E, Goel V (2015). Deep multimodal learning for audio-visual speech recognition. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2130–4. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011). Multimodal deep learning. Proceedings of the 28th International Conference on Machine Learning (ICML- 11), 689–96. Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2014). Lipreading using convolutional neural network. 15th Annual Conference of the International Speech Communication Association. Patterson EK, Gurbuz S, Tufekci Z, Gowdy JN (2002). Cuave: A new audio-visual database for multimodal human-computer interface research. 2002 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2:II-2017. Petridis S, Pantic M (2016). Deep complementary bottle- neck features for visual speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2304–8. Petridis S, Li Z, Pantic M (2017). End-to-end visual speech recognition with lstms. 2017 IEEE International Confe- rence on Acoustics, Speech and Signal Processing (ICASSP), 2592-6. Potamianos G, Neti C (2001). Improved roi and within frame discriminant features for lipreading. Proceedings of the 2001 International Conference on Image Proces- sing, 3:250–3. Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003). Recent advances in the automatic recognition of audio- visual speech. Proceedings of the IEEE 91:1306–26. Potamianos G, Neti C, Luettin J, Matthews I (2004). Audio- visual automatic speech recognition: An overview. Issues in Visual and Audio-Visual Speech Processing 22:23. Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J (2011). The kaldi speech recognition toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, EPFL-CONF-192584. Savchenko A, Khokhlova YI (2014). About neural-network algorithms application in viseme classification problem with face video in audiovisual speech recognition sys- tems. Optical Memory and Neural Networks 23:34–42. Srivastava N, Salakhutdinov RR, (2014). Multimodal learning with deep boltzmann machines. Advances in Neural Information Processing Systems, 2222–30. Stafylakis T, Tzimiropoulos G (2017). Combining residual networks with lstms for lipreading. arXiv preprint arXiv:1703.04105. Sumby WH, Pollack I (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America 26:212–5. Veselý K, Ghoshal A, Burget L, Povey D (2013). Sequence- discriminative training of deep neural networks. Inter- speech 2345–9. Wand M, Koutník J, Schmidhuber J, (2016). Lipreading with long short-term memory. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6115–9. Welling M, Rosen-Zvi M, Hinton GE, (2005). Exponential family harmoniums with an application to information retrieval. Advances in Neural Information Processing Systems, 17:1481–8. Yehia H, Rubin P, Vatikiotis-Bateson E (1998). Quanti- tative association of vocal-tract and facial behavior. Speech Communication 26:23–43.