https://doi.or g/10.31449/inf.v47i3.4755 Informatica 47 (2023) 327–334 327 Motion Embedded Images: An Appr oach to Captur e Spatial and T emporal Featur es for Action Recognition T ri Le 1, 3 , Nham Huynh-Duc 1, 3 , Chung Thai Nguyen 1, 3 and Minh-T riet T ran 1, 2, 3 1 Faculty of Information T echnology , University of Science, VNU-HCM 2 Software Engineering Lab, University of Science, VNU-HCM 3 V ietnam National University , Ho Chi Minh City Keywords: action recognition, motion embedded image, sports dataset, two-stream network Received: March 22, 2023 The demand for human activity r ecognition (HAR) fr om videos has witnessed a significant sur ge in various r eal-life applications, including video surveillance, healthcar e, elderly car e, among others. The explotion of short-form videos on social media platforms has further intensified the inter est in this domain. This r esear ch endeavors to focus on the pr oblem of HAR in general short videos. In contrast to still images, video clips offer both spatial and temporal information, r endering it challenging to extract complementary information on appearance fr om still frames and motion between frames. This r esear ch makes a two-fold contribution. Firstly , we investigate the use of motion-embedded images in a variant of two-str eam Con- volutional Neural Network ar chitectur e, in which one str eam captur es motion using combined batches of frames, while another str eam employs a normal image classification ConvNet to classify static appearance. Secondly , we cr eate a novel dataset of Southeast Asian Sports short videos that encompasses both videos with and without effects, which is a modern factor that is lacking in all curr ently available datasets used for benchmarking models. The pr oposed model is trained and evaluated on two benchmarks: UCF-101 and SEAGS-V1. The r esults r eveal that the pr oposed model yields competitive performance compar ed to prior attempts to addr ess the same pr oblem. Povzetek: Raziskava pr edstavi model za pr epoznavanje človeških aktivnosti iz videov in testira model na novi bazi video posnetkov jugovzhodne Azije. 1 Intr oduction The task of human activity recognition (HAR) pertains to the labeling of actions or activities observed within video clips. In recent years, the proliferation of online social plat- forms has led to an exponential increase in the volume of media data being uploaded, with short-form videos dom- inating the internet landscape, beginning with T iktok and now extending to Facebook, Instagram, and Y outube. Con- sequently , the need for HAR has become increasingly cru- cial across a range of domains, including content monitor - ing, classification, and recommendation systems, video re- trieval, human-computer interaction, and robotics. In contrast to a still image, a video clip af fords not only static spatial information confined within a single frame but also temporal information that results from integrating spa- tial information across frames to capture dynamic motions. There exists a plethora of research investigating the chal- lenging task of video classification. Currently , the major - ity of high-accuracy results have been obtained using 3D convolutional kernels to capture the temporal information within videos [ 1 ][ 7 ][ 3 ]. Nonetheless, this architecture may be cost-prohibitive to employ in practical scenarios due to its high computational requirements. Consequently , certain approaches prioritize computational ef ficiency to handle lar ger datasets, yet may not be suitable for real-world appli- cations [ 26 ][ 15 ][ 2 ]. These methods often necessitate pow- erful processors to train successfully . Conversely , train- ing Convolutional Neural Networks (ConvNets) to acquire temporal information in videos of fers a straightforward, al- beit ef fective alternative. Researchers following this ap- proach vary in their methods for processing original frames, such as fusing temporal information early or late in the network [ 1 1 ], or combining multiple sequential frames to generate optical flow information [ 18 ]. Motivated by the positive outcomes of these studies and the ef fectiveness of ConvNet models in image recognition, we seek to explore the performance of ConvNet models for video classifica- tion. Notably , the extraction of temporal information in short videos remains a less explored domain, likely owing to its inherent dif ficulty . This paper introduces a novel ap- proach for embedding both temporal and spatial features of consecutive video frames into images, thereby enabling ef- fective recognition of the static features of a scene, such as objects, context, and entities, as well as the motion informa- tion. Specifically , we incorporate this method into a variant of the two-stream ConvNet model. The first stream lever - ages the images generated by our approach to detect motion in videos, while the second stream employs a conventional image classification network to recognize spatial informa- tion, utilizing single still video frames as inputs. This latter 328 Informatica 47 (2023) 327–334 T . Le et al. stream aims to identify and preserve any spatial information that might be lacking in the former . T o evaluate the performance of action recognition mod- els, various publicly available datasets such as UCF-101 [ 19 ] and UCF Sport [ 17 ] have been introduced, containing 101 action and 10 sport classes, respectively . Some datasets attempt to cover a broader range of activities by includ- ing more classes[ 1 1 ][ 12 ], while others incorporate user - uploaded data from multiple media sources such as Y outube and V imeo to simulate daily human activities [ 8 ][ 5 ]. De- spite these ef forts, most video datasets lack the complex- ity of videos edited using text, filters, and ef fects that are prevalent in short-form videos on social networks like T ik- tok, Facebook, and Y outube. These limitations can lead to inaccurate benchmarking of models when applied to this new form of video content. In this research, we also aim to collect a novel dataset that includes both non-ef fected and ef fected clips. Inspired by previous datasets [ 17 ][ 1 1 ], we gathered data within the same Sport category and fo- cused on South-East Asian Game sports. Our dataset, SEAGS_V1, consists of 8 sports classes and 1,168 videos sourced from Y outube and T iktok. The availability 1 of this dataset will enable researchers to evaluate the performance of their models on a more diverse range of video content. In this study , we evaluate the performance of our pro- posed MEI T wo-stream network on two widely-used action recognition datasets, UCF-101 and SEAGS_V1. T o inves- tigate the potential of our approach further , we also exper - iment with dif ferent backbone architectures and integrate them into an EnsembleNet. Our empirical results demon- strate that our proposed method holds considerable promise in enhancing the accuracy of Activity Recognition on short- form videos. The content of this paper is or ganized as follows. In Sec- tion 2 , we briefly review existing work related to action recognition. Then we present our proposed method in Sec- tion 3 . W e discuss our experiments in Section 4 . Finally , the conclusion and future work are discussed in Section 5 . 2 Related W ork The early-stage methodologies employed for video classifi- cation tasks typically involve a three-stage process. Firstly , visual features of a video segment are extracted densely [ 20 ] or at a sparse set of interest points[ 14 ]. Secondly , these extracted features are combined into a fixed-sized video- level description. Lastly , a classifier , such as a SVM, is trained on the resulting ”bag of words” representation to discriminate between the pertinent visual classes. Subse- quently , ConvNets have replaced all three stages with a sin- gle neural network that is end-to-end trainable. However , there are several approaches to augment the connectivity of a ConvNet in the time domain, exploiting local spatio- temporal information[ 9 ] [ 1 1 ]. However , these approaches are challenged by the limitations of ConvNets in capturing motion information among frames, leading to the loss of temporal features. 1 SEAGS_V1 is currently available online here . 2.1 T wo-str eam ar chitectur e T o mitigate the aforementioned challenge, researchers in- vestigated a novel two-stream ConvNet architecture [ 18 ] [ 21 ] [ 25 ]. This architecture involves feeding the input videos into two distinct streams: the spatial and tempo- ral streams. Each stream employs a deep ConvNet, with softmax scores combined by late fusion. Notably , the in- puts for each stream dif fer slightly . The spatial stream pro- cesses individual video frames to recognize actions from still images. In contrast, the temporal stream works on pre- computed optical flow features using optical flow estima- tion techniques, such as [ 23 ]. 2.2 Spatial-temporal featur e fusion method The two-stream architecture has inspired numerous stud- ies, with many seeking to improve its performance by fo- cusing on two key areas: the fusion stage and the temporal stream. In an ef fort to optimize the fusion stage, Feichten- hofer et. al. conducted a comprehensive investigation of various approaches to fusing the two networks over space and time [ 4 ]. They ultimately discover that fusing a spatial and temporal network at the convolution layer instead of the softmax layer results in comparable performance, while also significantly reducing a substantial number of param- eters. Another approach involves using a separate architec- ture to combine image information. Y ue et. al. explored two video-classification methods [ 22 ] which are both capa- ble of aggregating frame-level ConvNet outputs into video- level predictions: Feature Pooling methods max-pool lo- cal information through time, while LSTM’ s hidden state evolves with each subsequent frame. 2.3 V ariations of temporal str eam V arious approaches have been explored in an ef fort to im- prove the performance of the temporal stream in the two- stream architecture. Zhang et. al. investigates the replace- ment of optical flow with motion vector , which can be ob- tained directly from compressed videos without additional calculation [ 24 ], resulting in a more than 20x speedup com- pared to traditional two-stream approaches. However , mo- tion vectors tend to lack fine structures and contain noisy and inaccurate motion patterns, leading to a decline in recognition performance. An alternative approach involves learning to predict optical flow using a supervised ConvNet. Ng. et. al. proposes a multitask learning model, Action- FlowNet, that trains a single stream network directly from raw pixels to jointly estimate optical flow while recogniz- ing actions with ConvNet, capturing both appearance and motion in a single model [ 16 ]. In this study , we build upon the ideas of the two-stream architecture [ 18 ] and modify the temporal stream. Rather than relying on optical flow , we introduce a novel approach that embeds motion into the original frames, generating motion-embedded images that retain spatial features in the temporal stream. This is based on the belief that motion and appearance should not be separated. However , the spatial stream is still considered, as our current method for gener - Motion Embedded Images: An Approach… Informatica 47 (2023) 327–334 329 ating motion-embedded images may contain noisy and in- accurate motion patterns caused by background movement. 3 Pr oposed Method In this section, we introduce our novel approach called mo- tion embedded image (MEI) and two-stream network. The input video is fed into two distinct streams, the normal and motion streams, as illustrated in Figure 1 . The processes in these streams are implemented separately . Prior to be- ing input into the streams, the input can be pre-processed. These inputs are then fed into a ConvNet to perform image classification, and the prediction scores of both streams are fused to produce the final prediction. In the following sub- sections, we provide comprehensive details of the motion embedding technique, motion stream, normal stream, and fusion stage. Figure 1: Illustration of our proposed two-stream architec- ture. Normal stream (top) takes individual frames as in- puts, while Motion stream (bottom) requires motion em- bedded images which are a combination of consecutive video frames. Then, the convolutional neural networks in both streams learn to classify them. Finally , a fusion algo- rithm is performed to combine normal-motion information. Both streams are end-to-end trainable. 3.1 Motion Embedding As per the requirements of the Motion stream, the input video frames must under go a motion embedding stage. Our proposed motion embedding techniques are illustrated in Figure 2 , which depict the workflow involved in this stage. The resulting output of this stage is motion-embedded im- ages that convey the direction and order of motion of a single image. Furthermore, we believe that the spatial and temporal information stored simultaneously gives more features for Convolutional Neural Network to learn, which is described in detail in a later sub-section. All frames extracted from the input video are orderly numbered asT and segmented into batches consisting ofN consecutive frames. Each batch is fed into the motion em- bedding stage, which comprises two components: image processing and combinator . The image processing compo- nent is responsible for generating new images from origins, while the combinator aggregates the processed images to create motion-embedded images. It is noteworthy that the aggregation of consecutive frames in a video emphasizes the parts containing static objects and contexts, highlight- ing the contours of the dif ferent stages in the motion that can be easily distinguished from the static parts. The com- binator is often dependent on the method used in the image processing component. In the following sub-section, we present our studies about two methods for processing im- ages and their corresponding combinators. Figure 2: W orkflow of our motion embedding technique. The figure illustrates a batch of N=5 consecutive frames from an input video before and after processing which uses the Equal Division method. A combinator , then, mer ges all processed frames to generate relevant motion embedded images. 3.1.1 Equal division T o ensure that all frames contribute equally to MEI, we di- vide the values of all pixels in each frame byN . This tech- nique also enables the combinator to keep the pixel values between 0 and 255. The formula for this technique is pre- sented below: processed _img = original _img ∗ 1 N In the formula,processed _img andoriginal _img are 2D arrays representing the pixel values of the processed and original frames, respectively . The operation is performed element-wise. The combinator we suggest for this method is simply a summation of all processed images. Therefore, the final MEI for a batch concluding at frameT is formulated by the following equation: MEI T = T ∑ i=T− N+1 processed _img i (1) In Figure 2 , a batch of five consecutive frames from an input video is depicted, which is processed through the mo- tion embedding stage using the Equal Division method. As evident from the figure and equations, it is obvious that the final MEI likely presents a stack of images. Due to the iden- tical contributions of all frames to the final image, the mo- tion transitions are presented in a uniform manner through- out the sequence. 330 Informatica 47 (2023) 327–334 T . Le et al. 3.1.2 Gradient division The Equal Division method is limited in that it fails to cap- ture the directionality of the motion, as it presents all action steps in an identical manner . T o overcome this limitation, we propose the Gradient Division method. This method pri- oritizes the most recent frame in a batch to serve as the base frame for activity recognition and appropriately weights the contribution of each frame in the batch, with later frames carrying higher weights than earlier ones. The following describes our proposed formulas for im- age processing component: sum _N = N ∑ i=1 i, contrib = T mod N +1 sum _N processed _img = original _img ∗ contrib In the above formula, processed _img , original _img are 2D arrays of the processed and original frames’ pixel val- ues, respectively . The equation is performed element- wisely . The two scalarssum _N ,contrib are aimed to cal- culate the contribution of frameT in a batch ofN frames. The combinator we suggest for this method is similar to the formula 1 for the Equal Division combinator . Figure 3: W orkflow of our motion embedding technique. The figure illustrates a batch of N=5 consecutive frames from an input video before and after processing which uses the Gradient Division method. A combinator , then, mer ges all processed frames to generate relevant motion embedded images. Figure 3 shows a batch of 5 consecutive frames from an input video. It is fed into the motion embedding stage us- ing the Gradient Division method. As shown in the figures and formulas above, the later frames in the batch contribute more to the final output image. This leads to a much better presentation of the direction of action in final motion em- bedded images. W e believe that based on this motion trail, Convolutional Neural Network can learn temporal and spa- tial information simultaneously . 3.2 Motion str eam The motion stream proceeds in a sequential manner , where batches of N consecutive frames are sequentially fed into the s tream. The motion stream operation involves two pri- mary stages. Firstly , the input batch is transformed into an MEI through the motion embedding stage. Subsequently , the generated images are processed by a ConvNet to pre- dict the spatial-temporal features from MEI. 3.3 Normal str eam Initially , we endeavored to investigate the feasibility of em- ploying MEI exclusively for action recognition. However , our experiments revealed that contemporary motion em- bedding techniques tend to retain motion trails from ex- traneous objects and backgrounds, resulting in suboptimal outcomes. Consequently , we discerned that static appear - ance remains a valuable source of information, given its ca- pacity to capture immobile objects without motion trails. Accordingly , we resolved to supplement our approach by adding a normal stream to perform classifications grounded in still images. This stream comprises an image classifica- tion ConvNet architecture and can be enhanced by leverag- ing recent breakthroughs in lar ge-scale image recognition methods [ 13 ]. By pre-training this network on a compre- hensive image classification dataset, such as the ImageNet challenge dataset, we can further enhance its predictive ca- pabilities. The normal stream is designed to process individual video frames. In each batch, the most recent frame, referred to as the base frame when using Gradient Division for the motion stream, is extracted and fed into the Convolutional Neural Network (CNN) of this stream. 3.4 Fusion stage The predictions generated by the two streams of image clas- sification are integrated through a fusion process to produce the ultimate prediction output. At present, our approach to this fusion stage is to compute the arithmetic mean of the predictions, as explicated by formula 2 . pred(x) = normal _pred(x)+motion _pred(x) 2 (2) where x indicates the input image and normal _pred , motion _pred and pred present the prediction of normal, motion stream, and the final prediction result, respectively . 4 Experiments and Results 4.1 Dataset 4.1.1 UCF-101 The UCF-101 dataset [ 19 ] is a prominent benchmark for evaluating the performance of human action recognition models. The dataset comprises a diverse collection of 101 action classes, spanning over 13,000 clips and 27 hours of video data. Notably , the dataset features realistic user - uploaded videos that capture camera motion and cluttered backgrounds. T o evaluate the performance of our approach, we adopt the split-test 01 provided by the authors of this dataset. 4.1.2 SEAGS_V1 W e present a novel dataset, SEAGS_V1, that features a di- verse mix of ef fect and non-ef fect videos. Motion Embedded Images: An Approach… Informatica 47 (2023) 327–334 331 Our dataset is obtained from a variety of video platforms, including Y outube, T ikT ok, and Facebook reels. W e lever - age normal videos as the base data for actions, while short videos with added image ef fects, text, and stickers serve to enrich the dataset for improved recognition of short ef- fect videos. Figure 6 showcases some examples from our dataset that include text and stickers. Short videos of less than 20 seconds are included in their entirety , except for the intro and outro, while longer videos are manually split into 2-4 segments that are 5-20 seconds in duration. T o facilitate our experiments, SEAGS_V1 is structured in the same manner as UCF-101, with videos or ganized into folders corresponding to their respective class labels.The name of the video is formatted as v__.mp4 W e also provide the following files: classInd.txt file contains index of each class label. testlist.txt file contains the path to testing videos accounting for 30% of dataset. trainlist.txt file contains the path to training videos accounting for 70% of dataset. After data collection, SEAGS_V1 is completed with 8 classes. Each class consists of 100 - 160 videos, each video is between 1 and 20 seconds long. Figures 4 , 5 and T able 1 show the statistics of SEAGS_V1 dataset. Figure 4: Statistical chart of the clip amount of classes Figure 5: Statistical chart of the total time and average video duration of classes 4.2 Data Augmentation Upon close examination of our dataset, SEAGS_V1, we figure out that many behaviors are labeled with the same T able 1: An overview of the SEAGS_V1 dataset Actions 8 Clips 1 169 T otal Duration 188 m Mean Clip Length 9.64 s Min Clip Length 1.0 s Max Clip Length 20.0 s Audio No action class, yet dif fered only in their direction. T o fur - ther augment the dataset and facilitate learning in these cases, we implemented a data augmentation technique that involves flipping the original images. Figure 6 shows some examples of flipped and original video frames from our dataset. Figure 6: Some flipped and original video frames from dataset SEAGS_V1 4.3 Image classification backbones For UCF-101, we consider to use Ef ficientNetB0 as the backbone. For SEAGS_V1, we conduct experi- ments using a range of backbones, including Ef ficient- NetB0, DenseNet201, InceptionNetV3, ResNet50, and MobileNetV2. Moreover , we explore the potential bene- fits of ensembling multiple base ConvNet models into a stronger classifier , which we refer to as EnsembleNet, by summing the probability prediction of each model. ensemble _net(x) = 1 K K ∑ k base _net k (x) where x indicates the input image and K represents the number of base models. 4.4 Motion embedding implementation W e use some specific parameters to create embedded mo- tion images, namely N = 10 and interval_frames = 5 for SEAGS_V1 andN = 10 and interval_frames = 10 for UCF- 101. Here, interval_frames refers to the distance, in terms of frame count, between two consecutive batches or the dis- tance from the first frame of batch k to the first frame of batch k + 1 . Each embedded motion image is generated from a batch ofN frames. As depicted in Figure 7 , a com- parison of three types of images - normal image, MEI with 332 Informatica 47 (2023) 327–334 T . Le et al. Gradient Division, and Equal Division - highlights the ef- fectiveness of Gradient Division in preserving the direction of motion in activities, whereas Equal Division does not. Accordingly , we employ Gradient Division as the method for the motion embedding process in our experiments. Fig- ure 8 shows some motion-embedded images from both the SEAGS_V1 and UCF-101 datasets. Figure 7: Some examples of normal image (left), MEI with Gradient Division (middle) and with Equal Division (right) from two datasets Figure 8: Some motion embedded (left) and its original im- ages (right) from SEAGS_V1 (A, B, C, D) and UCF-101 (E, F , G, H) datasets 4.5 T raining W e partition the dataset into training and validation sets at a ratio of 7:3. W e conclude the training process once the val- idation accuracy exceeded 0.9. Notably , training with nor - mal images requires only 10 epochs to achieve the desired validation accuracy , whereas training with MEI takes 50 epochs. Each stream is trained independently , and the prob- abilities are subsequently fused for prediction purposes. 4.6 T wo-str eam implementation W e train both the spatial and temporal streams using the same model architecture, albeit independently . The motion stream is fed with the MEIs generated using the parameters specified in the previous section. During testing, the normal stream processes all the last frames of the batches to make predictions. 4.7 Results Our experimental results on the UCF-101 dataset demon- strate that our proposed method achieved significantly higher accuracy than the initial models developed by Soomro et. al. [ 19 ], Karpathy et. al. [ 1 1 ], and a two-stream model [ 6 ]. However , when compared to the original two- stream model [ 18 ] and the state-of-the-art approach devel- oped by W ang et. al. [ 10 ], our method exhibits a noticeable performance gap, as shown in T able 2 . T able 2: Experiment result on UCF-101 dataset (split test 01) (ours with backbone Ef ficientNetB0) Model Accuracy (%) Soomro et al [ 19 ] 43.9 Karpathy et al [ 1 1 ] 65.4 Han et al [ 6 ] 68.0 Simonyan et al [ 18 ] 88.0 Kalfaoglu et al [ 10 ] 98.69 Ours (with normal image) 68.54 Ours (with MEI) 67.04 Ours (T wo-stream) 70.08 T able 3: Experiment result on SEAGS_V1 dataset with nor - mal image Backbone Accuracy (%) Ef ficientNetB0 84.9 DenseNet201 89.2 MobileNetV2 87.2 ResNet50 64.1 InceptionV3 86.9 Ensemble (5 base models) 92.9 (Done on 1/10 of the total frames of each video) Overall, the experimental results presented in T ables 2 , 3 , and 4 suggest that the accuracy of models trained with MEIs is mar ginally lower than that of models trained with normal images. In particular , the incorrect predictions of MEI-based models are primarily observed in videos with Motion Embedded Images: An Approach… Informatica 47 (2023) 327–334 333 T able 4: Experiment result on SEAGS_V1 dataset with mo- tion embedded image Backbone Accuracy (%) Ef ficientNetB0 88.3 DenseNet201 87.5 MobileNetV2 81.5 ResNet50 52.7 InceptionV3 85.8 Ensemble (5 base models) 92.9 T able 5: Experiment result on SEAGS_V1 dataset with pro- posed two-stream model Backbone Accuracy (%) EfficientNetB0 90.02 DenseNet201 89.46 MobileNetV2 88.89 ResNet50 60.1 1 InceptionV3 88.32 moving contexts, where the MEIs generated from these videos make it dif ficult for the models to distinguish be- tween actions and context, resulting in suboptimal perfor - mance. Figure 8 (B, F) provides examples of poorly gener - ated MEIs from such videos. In contrast, normal images are found to preserve clear visual information among objects, even in the presence of moving contexts. Conversely , MEIs exhibit a distinct advantage in videos with static or minimally moving contexts, where they can ef fectively highlight the motion of activities that may not be apparent in normal images. Figure 8 provides exam- ples of such scenarios (A, C, D, H). Hence, the fusion of these two types of images in a two-stream architecture sig- nificantly improves the accuracy of the final result on both datasets, as evidenced by the results presented in T ables 2 and 5 . Notably , in cases where the motion of activities is relatively consistent, MEIs and normal images exhibit simi- lar characteristics, and the models can ef fectively learn spa- tial information. Figure 8 (E, G) provides examples of such cases. 5 Conclusion In this paper , we propose an approach of applying mo- tion embedded Image (MEI) in a human activity recogni- tion two-stream ConvNet model for short-form videos. W e also propose an unprecedented dataset called SEAGS_V1, which consists of both non-ef fected and ef fected short videos of 8 local Southeast Asian Sports. Currently , our experiments on UCF-101 and SEAGS_V1 datasets show that combining the motion stream with the normal spatial stream gives significantly better results than using each stream as an independent model. Moreover , ConvNet models using the ensembled backbone have no- tably higher accuracy than those using only one back- bone. The derived results s how a promising potential of the model to advance prediction ef ficiency in the human activity recognition problem. Extra training data is beneficial for our model to learn spatial and temporal information, so we are planning to train it on lar ge video datasets such as Sports-1M. Our next direction is to modify the architecture so it can focus more on the activity instead of the whole image and the extracted information will not be diluted. The most important im- provement plan is to make the motion stream retain more spatial information so the model only consists of one mo- tion stream and becomes more lightweight. Acknowledgement This research is supported by research funding from Honors Program, University of Science, V ietnam National Univer - sity - Ho Chi Minh City . Refer ences [1] J. Carreira and A. Zisserman. Quo vadis, action recog- nition? a new model and the kinetics dataset. In pr o- ceedings of the IEEE Confer ence on Computer V ision and Pattern Recognition , pages 6299–6308, 2017. https://doi.or g/10.48550/arXiv .1705.07750 . [2] C. Feichtenhofer . X3d: Expanding architec- tures for ef ficient video recognition. In Pr oceed- ings of the IEEE/CVF confer ence on computer vi- sion and pattern r ecognition , pages 203–213, 2020. https://doi.or g/10.1 109/cvpr42600.2020.00028 . [3] C. Feichtenhofer , H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. In Pr o- ceedings of the IEEE/CVF international confer ence on computer vision , pages 6202–621 1, 2019. https://doi.or g/10.48550/arXiv .1812.03982 . [4] C. Feichtenhofer , A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 1933–1941, 2016. https://doi.or g/10.1 109/cvpr .2016.213 . [5] R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzynska, S. W estphal, H. Kim, V . Haenel, I. Fruend, P . Y ianilos, M. Mueller -Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Pr oceedings of the IEEE international confer - ence on computer vision , pages 5842–5850, 2017. https://doi.or g/10.1 109/iccv .2017.622 . [6] C. Han, C. W ang, E. Mei, J. Redmon, S. K. Divvala, Z. W u, X. W ang, Y .-G. Jiang, H. Y e, and X. Xue. Y olo-based adaptive window two-stream convolu- tional neural network for video classification. 2017. [7] K. Hara, H. Kataoka, and Y . Satoh. Can spa- tiotemporal 3d cnns retrace the history of 2d 334 Informatica 47 (2023) 327–334 T . Le et al. cnns and imagenet? In Pr oceedings of the IEEE confer ence on Computer V ision and Pattern Recognition , pages 6546–6555, 2018. https://doi.or g/10.1 109/cvpr .2018.00685 . [8] F . C. Heilbron, V . Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A lar ge-scale video bench- mark for human activity understanding. In 2015 IEEE Confer ence on Computer V ision and Pat- tern Recognition (CVPR) , pages 961–970, 2015. https://doi.or g/10.1 109/cvpr .2015.7298698 . [9] S. Ji, W . Xu, M. Y ang, and K. Y u. 3d convolutional neural networks for human action recognition. IEEE T ransactions on Pattern Analysis and Machine Intel- ligence , 35(1):221–231, 2013. [10] M. E. Kalfaoglu, S. Kalkan, and A. A. Alatan. Late temporal modeling in 3d cnn architectures with bert for action recognition. In Computer V ision–ECCV 2020 W orkshops: Glasgow , UK, August 23–28, 2020, Pr oceedings, Part V 16 , pages 731–747. Springer , 2020. https://doi.or g/10.1007/978-3-030-68238-5 4 8 . [1 1] A. Karpathy , G. T oderici, S. Shetty , T . Leung, R. Sukthankar , and L. Fei-Fei. Lar ge-scale video classification with convolutional neural networks. In 2014 IEEE Confer ence on Computer V ision and Pattern Recognition , pages 1725–1732, 2014. https://doi.or g/10.1 109/cvpr .2014.223 . [12] W . Kay , J. Carreira, K. Simonyan, B. Zhang, C. Hillier , S. V ijayanarasimhan, F . V iola, T . Green, T . Back, P . Natsev , M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR , abs/1705.06950, 2017. https://doi.or g/10.48550/arXiv .1705.06950 . [13] A. Krizhevsky , I. Sutskever , and G. E. Hinton. Im- agenet classification with deep convolutional neu- ral networks. In F . Pereira, C. J. C. Bur ges, L. Bottou, and K. Q. W einber ger , editors, Ad- vances in Neural Information Pr ocessing Systems 25 , pages 1097–1 105. Curran Associates, Inc., 2012. https://doi.or g/10.1 145/3065386 . [14] Laptev and Lindeber g. Space-time interest points. In Pr oceedings Ninth IEEE International Confer ence on Computer V ision , pages 432–439 vol.1, 2003. https://doi.or g/10.1 109/iccv .2003.1238378 . [15] J. Lin, C. Gan, and S. Han. T empo- ral shift module for ef ficient video under - standing. CoRR , abs/181 1.08383, 2018. https://doi.or g/10.48550/arXiv .181 1.08383 . [16] J. Y .-H. Ng, J. Choi, J. Neumann, and L. S. Davis. Actionflownet: Learning motion repre- sentation for action recognition. In 2018 IEEE W inter Confer ence on Applications of Computer V ision (W ACV) , pages 1616–1624. IEEE, 2018. https://doi.or g/10.1 109/wacv .2018.00179 . [17] M. D. Rodriguez, J. Ahmed, and M. Shah. Ac- tion mach a spatio-temporal maximum average correlation height filter for action recognition. In 2008 IEEE Confer ence on Computer V i- sion and Pattern Recognition , pages 1–8, 2008. https://doi.or g/10.1 109/cvpr .2008.4587727 . [18] K. Simonyan and A. Zisserman. T wo-stream convo- lutional networks for action recognition in videos. Ad- vances in neural information pr ocessing systems , 27, 2014. https://doi.or g/10.48550/arXiv .1406.2199 . [19] K. Soomro, A. R. Zamir , and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR , abs/1212.0402, 2012. https://doi.or g/10.48550/arXiv .1212.0402 . [20] H. W ang, A. Kläser , C. Schmid, and C.-L. Liu. Action recognition by dense trajecto- ries. In CVPR 201 1 , pages 3169–3176, 201 1. https://ieeexplore.ieee.or g/document/5995407 . [21] L. W ang, Y . Xiong, Z. W ang, and Y . Qiao. T o- wards good practices for very deep two-stream convnets. CoRR , abs/1507.02159, 2015. https://doi.or g/10.48550/arXiv .1507.02159 . [22] J. Y ue-Hei Ng, M. Hausknecht, S. V ijayanarasimhan, O. V inyals, R. Monga, and G. T oderici. Beyond short snippets: Deep networks for video classification. In Pr oceedings of the IEEE confer ence on computer vi- sion and pattern r ecognition , pages 4694–4702, 2015. https://doi.or g/10.1 109/cvpr .2015.7299101 . [23] C. Zach, T . Pock, and H. Bischof. A duality based approach for realtime tv-l1 optical flow . volume 4713, pages 214–223, 09 2007. https://doi.or g/10.1007/978- 3-540-74936-3 2 2 . [24] B. Zhang, L. W ang, Z. W ang, Y . Qiao, and H. W ang. Real-time action recognition with enhanced motion vector cnns. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 2718– 2726, 2016. https://doi.or g/10.1 109/cvpr .2016.297 . [25] Y . Zhao, K. Man, J. Smith, K. Siddique, and S.-U. Guan. Improved two-stream model for human action recognition. EURASIP Journal on Image and V ideo Pr ocessing , 2020, 06 2020. [26] Y . Zhu, Z. Lan, S. Newsam, and A. Hauptmann. Hidden two-stream convolutional networks for action recognition. In Computer V ision–ACCV 2018: 14th Asian Confer ence on Computer V ision, Perth, Aus- tralia, December 2–6, 2018, Revised Selected Pa- pers, Part III 14 , pages 363–378. Springer , 2019. https://doi.or g/10.48550/arXiv .1704.00389 .