https://doi.or g/10.31449/inf.v47i3.4755 Informatica 47 (2023) 327–334 327
Motion Embedded Images: An Appr oach to Captur e Spatial and T emporal
Featur es for Action Recognition
T ri Le
1, 3
, Nham Huynh-Duc
1, 3
, Chung Thai Nguyen
1, 3
and Minh-T riet T ran
1, 2, 3
1
Faculty of Information T echnology , University of Science, VNU-HCM
2
Software Engineering Lab, University of Science, VNU-HCM
3
V ietnam National University , Ho Chi Minh City
Keywords: action recognition, motion embedded image, sports dataset, two-stream network
Received: March 22, 2023
The demand for human activity r ecognition (HAR) fr om videos has witnessed a significant sur ge in various
r eal-life applications, including video surveillance, healthcar e, elderly car e, among others. The explotion
of short-form videos on social media platforms has further intensified the inter est in this domain. This
r esear ch endeavors to focus on the pr oblem of HAR in general short videos. In contrast to still images,
video clips offer both spatial and temporal information, r endering it challenging to extract complementary
information on appearance fr om still frames and motion between frames. This r esear ch makes a two-fold
contribution. Firstly , we investigate the use of motion-embedded images in a variant of two-str eam Con-
volutional Neural Network ar chitectur e, in which one str eam captur es motion using combined batches of
frames, while another str eam employs a normal image classification ConvNet to classify static appearance.
Secondly , we cr eate a novel dataset of Southeast Asian Sports short videos that encompasses both videos
with and without effects, which is a modern factor that is lacking in all curr ently available datasets used
for benchmarking models. The pr oposed model is trained and evaluated on two benchmarks: UCF-101
and SEAGS-V1. The r esults r eveal that the pr oposed model yields competitive performance compar ed to
prior attempts to addr ess the same pr oblem.
Povzetek: Raziskava pr edstavi model za pr epoznavanje človeških aktivnosti iz videov in testira model na
novi bazi video posnetkov jugovzhodne Azije.
1 Intr oduction
The task of human activity recognition (HAR) pertains to
the labeling of actions or activities observed within video
clips. In recent years, the proliferation of online social plat-
forms has led to an exponential increase in the volume of
media data being uploaded, with short-form videos dom-
inating the internet landscape, beginning with T iktok and
now extending to Facebook, Instagram, and Y outube. Con-
sequently , the need for HAR has become increasingly cru-
cial across a range of domains, including content monitor -
ing, classification, and recommendation systems, video re-
trieval, human-computer interaction, and robotics.
In contrast to a still image, a video clip af fords not only
static spatial information confined within a single frame but
also temporal information that results from integrating spa-
tial information across frames to capture dynamic motions.
There exists a plethora of research investigating the chal-
lenging task of video classification. Currently , the major -
ity of high-accuracy results have been obtained using 3D
convolutional kernels to capture the temporal information
within videos [ 1 ][ 7 ][ 3 ]. Nonetheless, this architecture may
be cost-prohibitive to employ in practical scenarios due to
its high computational requirements. Consequently , certain
approaches prioritize computational ef ficiency to handle
lar ger datasets, yet may not be suitable for real-world appli-
cations [ 26 ][ 15 ][ 2 ]. These methods often necessitate pow-
erful processors to train successfully . Conversely , train-
ing Convolutional Neural Networks (ConvNets) to acquire
temporal information in videos of fers a straightforward, al-
beit ef fective alternative. Researchers following this ap-
proach vary in their methods for processing original frames,
such as fusing temporal information early or late in the
network [ 1 1 ], or combining multiple sequential frames to
generate optical flow information [ 18 ]. Motivated by the
positive outcomes of these studies and the ef fectiveness of
ConvNet models in image recognition, we seek to explore
the performance of ConvNet models for video classifica-
tion. Notably , the extraction of temporal information in
short videos remains a less explored domain, likely owing
to its inherent dif ficulty . This paper introduces a novel ap-
proach for embedding both temporal and spatial features of
consecutive video frames into images, thereby enabling ef-
fective recognition of the static features of a scene, such as
objects, context, and entities, as well as the motion informa-
tion. Specifically , we incorporate this method into a variant
of the two-stream ConvNet model. The first stream lever -
ages the images generated by our approach to detect motion
in videos, while the second stream employs a conventional
image classification network to recognize spatial informa-
tion, utilizing single still video frames as inputs. This latter
328 Informatica 47 (2023) 327–334 T . Le et al.
stream aims to identify and preserve any spatial information
that might be lacking in the former .
T o evaluate the performance of action recognition mod-
els, various publicly available datasets such as UCF-101
[ 19 ] and UCF Sport [ 17 ] have been introduced, containing
101 action and 10 sport classes, respectively . Some datasets
attempt to cover a broader range of activities by includ-
ing more classes[ 1 1 ][ 12 ], while others incorporate user -
uploaded data from multiple media sources such as Y outube
and V imeo to simulate daily human activities [ 8 ][ 5 ]. De-
spite these ef forts, most video datasets lack the complex-
ity of videos edited using text, filters, and ef fects that are
prevalent in short-form videos on social networks like T ik-
tok, Facebook, and Y outube. These limitations can lead to
inaccurate benchmarking of models when applied to this
new form of video content. In this research, we also aim
to collect a novel dataset that includes both non-ef fected
and ef fected clips. Inspired by previous datasets [ 17 ][ 1 1 ],
we gathered data within the same Sport category and fo-
cused on South-East Asian Game sports. Our dataset,
SEAGS_V1, consists of 8 sports classes and 1,168 videos
sourced from Y outube and T iktok. The availability
1
of this
dataset will enable researchers to evaluate the performance
of their models on a more diverse range of video content.
In this study , we evaluate the performance of our pro-
posed MEI T wo-stream network on two widely-used action
recognition datasets, UCF-101 and SEAGS_V1. T o inves-
tigate the potential of our approach further , we also exper -
iment with dif ferent backbone architectures and integrate
them into an EnsembleNet. Our empirical results demon-
strate that our proposed method holds considerable promise
in enhancing the accuracy of Activity Recognition on short-
form videos.
The content of this paper is or ganized as follows. In Sec-
tion 2 , we briefly review existing work related to action
recognition. Then we present our proposed method in Sec-
tion 3 . W e discuss our experiments in Section 4 . Finally ,
the conclusion and future work are discussed in Section 5 .
2 Related W ork
The early-stage methodologies employed for video classifi-
cation tasks typically involve a three-stage process. Firstly ,
visual features of a video segment are extracted densely
[ 20 ] or at a sparse set of interest points[ 14 ]. Secondly , these
extracted features are combined into a fixed-sized video-
level description. Lastly , a classifier , such as a SVM, is
trained on the resulting ”bag of words” representation to
discriminate between the pertinent visual classes. Subse-
quently , ConvNets have replaced all three stages with a sin-
gle neural network that is end-to-end trainable. However ,
there are several approaches to augment the connectivity
of a ConvNet in the time domain, exploiting local spatio-
temporal information[ 9 ] [ 1 1 ]. However , these approaches
are challenged by the limitations of ConvNets in capturing
motion information among frames, leading to the loss of
temporal features.
1
SEAGS_V1 is currently available online here .
2.1 T wo-str eam ar chitectur e
T o mitigate the aforementioned challenge, researchers in-
vestigated a novel two-stream ConvNet architecture [ 18 ]
[ 21 ] [ 25 ]. This architecture involves feeding the input
videos into two distinct streams: the spatial and tempo-
ral streams. Each stream employs a deep ConvNet, with
softmax scores combined by late fusion. Notably , the in-
puts for each stream dif fer slightly . The spatial stream pro-
cesses individual video frames to recognize actions from
still images. In contrast, the temporal stream works on pre-
computed optical flow features using optical flow estima-
tion techniques, such as [ 23 ].
2.2 Spatial-temporal featur e fusion method
The two-stream architecture has inspired numerous stud-
ies, with many seeking to improve its performance by fo-
cusing on two key areas: the fusion stage and the temporal
stream. In an ef fort to optimize the fusion stage, Feichten-
hofer et. al. conducted a comprehensive investigation of
various approaches to fusing the two networks over space
and time [ 4 ]. They ultimately discover that fusing a spatial
and temporal network at the convolution layer instead of
the softmax layer results in comparable performance, while
also significantly reducing a substantial number of param-
eters. Another approach involves using a separate architec-
ture to combine image information. Y ue et. al. explored
two video-classification methods [ 22 ] which are both capa-
ble of aggregating frame-level ConvNet outputs into video-
level predictions: Feature Pooling methods max-pool lo-
cal information through time, while LSTM’ s hidden state
evolves with each subsequent frame.
2.3 V ariations of temporal str eam
V arious approaches have been explored in an ef fort to im-
prove the performance of the temporal stream in the two-
stream architecture. Zhang et. al. investigates the replace-
ment of optical flow with motion vector , which can be ob-
tained directly from compressed videos without additional
calculation [ 24 ], resulting in a more than 20x speedup com-
pared to traditional two-stream approaches. However , mo-
tion vectors tend to lack fine structures and contain noisy
and inaccurate motion patterns, leading to a decline in
recognition performance. An alternative approach involves
learning to predict optical flow using a supervised ConvNet.
Ng. et. al. proposes a multitask learning model, Action-
FlowNet, that trains a single stream network directly from
raw pixels to jointly estimate optical flow while recogniz-
ing actions with ConvNet, capturing both appearance and
motion in a single model [ 16 ].
In this study , we build upon the ideas of the two-stream
architecture [ 18 ] and modify the temporal stream. Rather
than relying on optical flow , we introduce a novel approach
that embeds motion into the original frames, generating
motion-embedded images that retain spatial features in the
temporal stream. This is based on the belief that motion and
appearance should not be separated. However , the spatial
stream is still considered, as our current method for gener -
Motion Embedded Images: An Approach… Informatica 47 (2023) 327–334 329
ating motion-embedded images may contain noisy and in-
accurate motion patterns caused by background movement.
3 Pr oposed Method
In this section, we introduce our novel approach called mo-
tion embedded image (MEI) and two-stream network. The
input video is fed into two distinct streams, the normal and
motion streams, as illustrated in Figure 1 . The processes
in these streams are implemented separately . Prior to be-
ing input into the streams, the input can be pre-processed.
These inputs are then fed into a ConvNet to perform image
classification, and the prediction scores of both streams are
fused to produce the final prediction. In the following sub-
sections, we provide comprehensive details of the motion
embedding technique, motion stream, normal stream, and
fusion stage.
Figure 1: Illustration of our proposed two-stream architec-
ture. Normal stream (top) takes individual frames as in-
puts, while Motion stream (bottom) requires motion em-
bedded images which are a combination of consecutive
video frames. Then, the convolutional neural networks in
both streams learn to classify them. Finally , a fusion algo-
rithm is performed to combine normal-motion information.
Both streams are end-to-end trainable.
3.1 Motion Embedding
As per the requirements of the Motion stream, the input
video frames must under go a motion embedding stage. Our
proposed motion embedding techniques are illustrated in
Figure 2 , which depict the workflow involved in this stage.
The resulting output of this stage is motion-embedded im-
ages that convey the direction and order of motion of a
single image. Furthermore, we believe that the spatial
and temporal information stored simultaneously gives more
features for Convolutional Neural Network to learn, which
is described in detail in a later sub-section.
All frames extracted from the input video are orderly
numbered asT and segmented into batches consisting ofN
consecutive frames. Each batch is fed into the motion em-
bedding stage, which comprises two components: image
processing and combinator . The image processing compo-
nent is responsible for generating new images from origins,
while the combinator aggregates the processed images to
create motion-embedded images. It is noteworthy that the
aggregation of consecutive frames in a video emphasizes
the parts containing static objects and contexts, highlight-
ing the contours of the dif ferent stages in the motion that
can be easily distinguished from the static parts. The com-
binator is often dependent on the method used in the image
processing component. In the following sub-section, we
present our studies about two methods for processing im-
ages and their corresponding combinators.
Figure 2: W orkflow of our motion embedding technique.
The figure illustrates a batch of N=5 consecutive frames
from an input video before and after processing which uses
the Equal Division method. A combinator , then, mer ges
all processed frames to generate relevant motion embedded
images.
3.1.1 Equal division
T o ensure that all frames contribute equally to MEI, we di-
vide the values of all pixels in each frame byN . This tech-
nique also enables the combinator to keep the pixel values
between 0 and 255. The formula for this technique is pre-
sented below:
processed _img = original _img ∗ 1
N
In the formula,processed _img andoriginal _img are 2D
arrays representing the pixel values of the processed and
original frames, respectively . The operation is performed
element-wise.
The combinator we suggest for this method is simply a
summation of all processed images. Therefore, the final
MEI for a batch concluding at frameT is formulated by the
following equation:
MEI
T
=
T
∑ i=T− N+1
processed _img i (1)
In Figure 2 , a batch of five consecutive frames from an
input video is depicted, which is processed through the mo-
tion embedding stage using the Equal Division method. As
evident from the figure and equations, it is obvious that the
final MEI likely presents a stack of images. Due to the iden-
tical contributions of all frames to the final image, the mo-
tion transitions are presented in a uniform manner through-
out the sequence.
330 Informatica 47 (2023) 327–334 T . Le et al.
3.1.2 Gradient division
The Equal Division method is limited in that it fails to cap-
ture the directionality of the motion, as it presents all action
steps in an identical manner . T o overcome this limitation,
we propose the Gradient Division method. This method pri-
oritizes the most recent frame in a batch to serve as the base
frame for activity recognition and appropriately weights the
contribution of each frame in the batch, with later frames
carrying higher weights than earlier ones.
The following describes our proposed formulas for im-
age processing component:
sum _N =
N
∑ i=1
i, contrib =
T mod N +1
sum _N
processed _img = original _img ∗ contrib
In the above formula, processed _img , original _img are
2D arrays of the processed and original frames’ pixel val-
ues, respectively . The equation is performed element-
wisely . The two scalarssum _N ,contrib are aimed to cal-
culate the contribution of frameT in a batch ofN frames.
The combinator we suggest for this method is similar to
the formula 1 for the Equal Division combinator .
Figure 3: W orkflow of our motion embedding technique.
The figure illustrates a batch of N=5 consecutive frames
from an input video before and after processing which uses
the Gradient Division method. A combinator , then, mer ges
all processed frames to generate relevant motion embedded
images.
Figure 3 shows a batch of 5 consecutive frames from an
input video. It is fed into the motion embedding stage us-
ing the Gradient Division method. As shown in the figures
and formulas above, the later frames in the batch contribute
more to the final output image. This leads to a much better
presentation of the direction of action in final motion em-
bedded images. W e believe that based on this motion trail,
Convolutional Neural Network can learn temporal and spa-
tial information simultaneously .
3.2 Motion str eam
The motion stream proceeds in a sequential manner , where
batches of N consecutive frames are sequentially fed into
the s tream. The motion stream operation involves two pri-
mary stages. Firstly , the input batch is transformed into an
MEI through the motion embedding stage. Subsequently ,
the generated images are processed by a ConvNet to pre-
dict the spatial-temporal features from MEI.
3.3 Normal str eam
Initially , we endeavored to investigate the feasibility of em-
ploying MEI exclusively for action recognition. However ,
our experiments revealed that contemporary motion em-
bedding techniques tend to retain motion trails from ex-
traneous objects and backgrounds, resulting in suboptimal
outcomes. Consequently , we discerned that static appear -
ance remains a valuable source of information, given its ca-
pacity to capture immobile objects without motion trails.
Accordingly , we resolved to supplement our approach by
adding a normal stream to perform classifications grounded
in still images. This stream comprises an image classifica-
tion ConvNet architecture and can be enhanced by leverag-
ing recent breakthroughs in lar ge-scale image recognition
methods [ 13 ]. By pre-training this network on a compre-
hensive image classification dataset, such as the ImageNet
challenge dataset, we can further enhance its predictive ca-
pabilities.
The normal stream is designed to process individual
video frames. In each batch, the most recent frame, referred
to as the base frame when using Gradient Division for the
motion stream, is extracted and fed into the Convolutional
Neural Network (CNN) of this stream.
3.4 Fusion stage
The predictions generated by the two streams of image clas-
sification are integrated through a fusion process to produce
the ultimate prediction output. At present, our approach to
this fusion stage is to compute the arithmetic mean of the
predictions, as explicated by formula 2 .
pred(x) =
normal _pred(x)+motion _pred(x)
2
(2)
where x indicates the input image and normal _pred ,
motion _pred and pred present the prediction of normal,
motion stream, and the final prediction result, respectively .
4 Experiments and Results
4.1 Dataset
4.1.1 UCF-101
The UCF-101 dataset [ 19 ] is a prominent benchmark for
evaluating the performance of human action recognition
models. The dataset comprises a diverse collection of 101
action classes, spanning over 13,000 clips and 27 hours
of video data. Notably , the dataset features realistic user -
uploaded videos that capture camera motion and cluttered
backgrounds. T o evaluate the performance of our approach,
we adopt the split-test 01 provided by the authors of this
dataset.
4.1.2 SEAGS_V1
W e present a novel dataset, SEAGS_V1, that features a di-
verse mix of ef fect and non-ef fect videos.
Motion Embedded Images: An Approach… Informatica 47 (2023) 327–334 331
Our dataset is obtained from a variety of video platforms,
including Y outube, T ikT ok, and Facebook reels. W e lever -
age normal videos as the base data for actions, while short
videos with added image ef fects, text, and stickers serve
to enrich the dataset for improved recognition of short ef-
fect videos. Figure 6 showcases some examples from our
dataset that include text and stickers. Short videos of less
than 20 seconds are included in their entirety , except for the
intro and outro, while longer videos are manually split into
2-4 segments that are 5-20 seconds in duration.
T o facilitate our experiments, SEAGS_V1 is structured
in the same manner as UCF-101, with videos or ganized into
folders corresponding to their respective class labels.The
name of the video is formatted as
v_<class label>_<index>.mp4
W e also provide the following files:
classInd.txt file contains index of each class label.
testlist.txt file contains the path to testing videos
accounting for 30% of dataset.
trainlist.txt file contains the path to training videos
accounting for 70% of dataset.
After data collection, SEAGS_V1 is completed with 8
classes. Each class consists of 100 - 160 videos, each video
is between 1 and 20 seconds long. Figures 4 , 5 and T able 1
show the statistics of SEAGS_V1 dataset.
Figure 4: Statistical chart of the clip amount of classes
Figure 5: Statistical chart of the total time and average
video duration of classes
4.2 Data Augmentation
Upon close examination of our dataset, SEAGS_V1, we
figure out that many behaviors are labeled with the same
T able 1: An overview of the SEAGS_V1 dataset
Actions 8
Clips 1 169
T otal Duration 188 m
Mean Clip Length 9.64 s
Min Clip Length 1.0 s
Max Clip Length 20.0 s
Audio No
action class, yet dif fered only in their direction. T o fur -
ther augment the dataset and facilitate learning in these
cases, we implemented a data augmentation technique that
involves flipping the original images. Figure 6 shows some
examples of flipped and original video frames from our
dataset.
Figure 6: Some flipped and original video frames from
dataset SEAGS_V1
4.3 Image classification backbones
For UCF-101, we consider to use Ef ficientNetB0 as
the backbone. For SEAGS_V1, we conduct experi-
ments using a range of backbones, including Ef ficient-
NetB0, DenseNet201, InceptionNetV3, ResNet50, and
MobileNetV2. Moreover , we explore the potential bene-
fits of ensembling multiple base ConvNet models into a
stronger classifier , which we refer to as EnsembleNet, by
summing the probability prediction of each model.
ensemble _net(x) =
1
K
K
∑ k
base _net
k
(x)
where x indicates the input image and K represents the
number of base models.
4.4 Motion embedding implementation
W e use some specific parameters to create embedded mo-
tion images, namely N = 10 and interval_frames = 5 for
SEAGS_V1 andN = 10 and interval_frames = 10 for UCF-
101.
Here, interval_frames refers to the distance, in terms of
frame count, between two consecutive batches or the dis-
tance from the first frame of batch k to the first frame of
batch k + 1 . Each embedded motion image is generated
from a batch ofN frames. As depicted in Figure 7 , a com-
parison of three types of images - normal image, MEI with
332 Informatica 47 (2023) 327–334 T . Le et al.
Gradient Division, and Equal Division - highlights the ef-
fectiveness of Gradient Division in preserving the direction
of motion in activities, whereas Equal Division does not.
Accordingly , we employ Gradient Division as the method
for the motion embedding process in our experiments. Fig-
ure 8 shows some motion-embedded images from both the
SEAGS_V1 and UCF-101 datasets.
Figure 7: Some examples of normal image (left), MEI with
Gradient Division (middle) and with Equal Division (right)
from two datasets
Figure 8: Some motion embedded (left) and its original im-
ages (right) from SEAGS_V1 (A, B, C, D) and UCF-101 (E,
F , G, H) datasets
4.5 T raining
W e partition the dataset into training and validation sets at a
ratio of 7:3. W e conclude the training process once the val-
idation accuracy exceeded 0.9. Notably , training with nor -
mal images requires only 10 epochs to achieve the desired
validation accuracy , whereas training with MEI takes 50
epochs. Each stream is trained independently , and the prob-
abilities are subsequently fused for prediction purposes.
4.6 T wo-str eam implementation
W e train both the spatial and temporal streams using the
same model architecture, albeit independently . The motion
stream is fed with the MEIs generated using the parameters
specified in the previous section. During testing, the normal
stream processes all the last frames of the batches to make
predictions.
4.7 Results
Our experimental results on the UCF-101 dataset demon-
strate that our proposed method achieved significantly
higher accuracy than the initial models developed by
Soomro et. al. [ 19 ], Karpathy et. al. [ 1 1 ], and a two-stream
model [ 6 ]. However , when compared to the original two-
stream model [ 18 ] and the state-of-the-art approach devel-
oped by W ang et. al. [ 10 ], our method exhibits a noticeable
performance gap, as shown in T able 2 .
T able 2: Experiment result on UCF-101 dataset (split test
01) (ours with backbone Ef ficientNetB0)
Model Accuracy (%)
Soomro et al [ 19 ] 43.9
Karpathy et al [ 1 1 ] 65.4
Han et al [ 6 ] 68.0
Simonyan et al [ 18 ] 88.0
Kalfaoglu et al [ 10 ] 98.69
Ours (with normal image) 68.54
Ours (with MEI) 67.04
Ours (T wo-stream) 70.08
T able 3: Experiment result on SEAGS_V1 dataset with nor -
mal image
Backbone Accuracy (%)
Ef ficientNetB0 84.9
DenseNet201 89.2
MobileNetV2 87.2
ResNet50 64.1
InceptionV3 86.9
Ensemble (5 base models) 92.9
(Done on 1/10 of the total frames of each video)
Overall, the experimental results presented in T ables 2 ,
3 , and 4 suggest that the accuracy of models trained with
MEIs is mar ginally lower than that of models trained with
normal images. In particular , the incorrect predictions of
MEI-based models are primarily observed in videos with
Motion Embedded Images: An Approach… Informatica 47 (2023) 327–334 333
T able 4: Experiment result on SEAGS_V1 dataset with mo-
tion embedded image
Backbone Accuracy (%)
Ef ficientNetB0 88.3
DenseNet201 87.5
MobileNetV2 81.5
ResNet50 52.7
InceptionV3 85.8
Ensemble (5 base models) 92.9
T able 5: Experiment result on SEAGS_V1 dataset with pro-
posed two-stream model
Backbone Accuracy (%)
EfficientNetB0 90.02
DenseNet201 89.46
MobileNetV2 88.89
ResNet50 60.1 1
InceptionV3 88.32
moving contexts, where the MEIs generated from these
videos make it dif ficult for the models to distinguish be-
tween actions and context, resulting in suboptimal perfor -
mance. Figure 8 (B, F) provides examples of poorly gener -
ated MEIs from such videos. In contrast, normal images are
found to preserve clear visual information among objects,
even in the presence of moving contexts.
Conversely , MEIs exhibit a distinct advantage in videos
with static or minimally moving contexts, where they can
ef fectively highlight the motion of activities that may not
be apparent in normal images. Figure 8 provides exam-
ples of such scenarios (A, C, D, H). Hence, the fusion of
these two types of images in a two-stream architecture sig-
nificantly improves the accuracy of the final result on both
datasets, as evidenced by the results presented in T ables 2
and 5 . Notably , in cases where the motion of activities is
relatively consistent, MEIs and normal images exhibit simi-
lar characteristics, and the models can ef fectively learn spa-
tial information. Figure 8 (E, G) provides examples of such
cases.
5 Conclusion
In this paper , we propose an approach of applying mo-
tion embedded Image (MEI) in a human activity recogni-
tion two-stream ConvNet model for short-form videos. W e
also propose an unprecedented dataset called SEAGS_V1,
which consists of both non-ef fected and ef fected short
videos of 8 local Southeast Asian Sports.
Currently , our experiments on UCF-101 and SEAGS_V1
datasets show that combining the motion stream with the
normal spatial stream gives significantly better results than
using each stream as an independent model. Moreover ,
ConvNet models using the ensembled backbone have no-
tably higher accuracy than those using only one back-
bone. The derived results s how a promising potential of
the model to advance prediction ef ficiency in the human
activity recognition problem.
Extra training data is beneficial for our model to learn
spatial and temporal information, so we are planning to
train it on lar ge video datasets such as Sports-1M. Our next
direction is to modify the architecture so it can focus more
on the activity instead of the whole image and the extracted
information will not be diluted. The most important im-
provement plan is to make the motion stream retain more
spatial information so the model only consists of one mo-
tion stream and becomes more lightweight.
Acknowledgement
This research is supported by research funding from Honors
Program, University of Science, V ietnam National Univer -
sity - Ho Chi Minh City .
Refer ences
[1] J. Carreira and A. Zisserman. Quo vadis, action recog-
nition? a new model and the kinetics dataset. In pr o-
ceedings of the IEEE Confer ence on Computer V ision
and Pattern Recognition , pages 6299–6308, 2017.
https://doi.or g/10.48550/arXiv .1705.07750 .
[2] C. Feichtenhofer . X3d: Expanding architec-
tures for ef ficient video recognition. In Pr oceed-
ings of the IEEE/CVF confer ence on computer vi-
sion and pattern r ecognition , pages 203–213, 2020.
https://doi.or g/10.1 109/cvpr42600.2020.00028 .
[3] C. Feichtenhofer , H. Fan, J. Malik, and K. He.
Slowfast networks for video recognition. In Pr o-
ceedings of the IEEE/CVF international confer ence
on computer vision , pages 6202–621 1, 2019.
https://doi.or g/10.48550/arXiv .1812.03982 .
[4] C. Feichtenhofer , A. Pinz, and A. Zisserman.
Convolutional two-stream network fusion for
video action recognition. In Pr oceedings of
the IEEE confer ence on computer vision and
pattern r ecognition , pages 1933–1941, 2016.
https://doi.or g/10.1 109/cvpr .2016.213 .
[5] R. Goyal, S. Ebrahimi Kahou, V . Michalski,
J. Materzynska, S. W estphal, H. Kim, V . Haenel,
I. Fruend, P . Y ianilos, M. Mueller -Freitag, et al.
The” something something” video database for
learning and evaluating visual common sense. In
Pr oceedings of the IEEE international confer -
ence on computer vision , pages 5842–5850, 2017.
https://doi.or g/10.1 109/iccv .2017.622 .
[6] C. Han, C. W ang, E. Mei, J. Redmon, S. K. Divvala,
Z. W u, X. W ang, Y .-G. Jiang, H. Y e, and X. Xue.
Y olo-based adaptive window two-stream convolu-
tional neural network for video classification. 2017.
[7] K. Hara, H. Kataoka, and Y . Satoh. Can spa-
tiotemporal 3d cnns retrace the history of 2d
334 Informatica 47 (2023) 327–334 T . Le et al.
cnns and imagenet? In Pr oceedings of the
IEEE confer ence on Computer V ision and
Pattern Recognition , pages 6546–6555, 2018.
https://doi.or g/10.1 109/cvpr .2018.00685 .
[8] F . C. Heilbron, V . Escorcia, B. Ghanem, and J. C.
Niebles. Activitynet: A lar ge-scale video bench-
mark for human activity understanding. In 2015
IEEE Confer ence on Computer V ision and Pat-
tern Recognition (CVPR) , pages 961–970, 2015.
https://doi.or g/10.1 109/cvpr .2015.7298698 .
[9] S. Ji, W . Xu, M. Y ang, and K. Y u. 3d convolutional
neural networks for human action recognition. IEEE
T ransactions on Pattern Analysis and Machine Intel-
ligence , 35(1):221–231, 2013.
[10] M. E. Kalfaoglu, S. Kalkan, and A. A. Alatan. Late
temporal modeling in 3d cnn architectures with bert
for action recognition. In Computer V ision–ECCV
2020 W orkshops: Glasgow , UK, August 23–28, 2020,
Pr oceedings, Part V 16 , pages 731–747. Springer ,
2020. https://doi.or g/10.1007/978-3-030-68238-5
4
8 .
[1 1] A. Karpathy , G. T oderici, S. Shetty , T . Leung,
R. Sukthankar , and L. Fei-Fei. Lar ge-scale video
classification with convolutional neural networks.
In 2014 IEEE Confer ence on Computer V ision
and Pattern Recognition , pages 1725–1732, 2014.
https://doi.or g/10.1 109/cvpr .2014.223 .
[12] W . Kay , J. Carreira, K. Simonyan, B. Zhang,
C. Hillier , S. V ijayanarasimhan, F . V iola,
T . Green, T . Back, P . Natsev , M. Suleyman,
and A. Zisserman. The kinetics human action
video dataset. CoRR , abs/1705.06950, 2017.
https://doi.or g/10.48550/arXiv .1705.06950 .
[13] A. Krizhevsky , I. Sutskever , and G. E. Hinton. Im-
agenet classification with deep convolutional neu-
ral networks. In F . Pereira, C. J. C. Bur ges,
L. Bottou, and K. Q. W einber ger , editors, Ad-
vances in Neural Information Pr ocessing Systems
25 , pages 1097–1 105. Curran Associates, Inc., 2012.
https://doi.or g/10.1 145/3065386 .
[14] Laptev and Lindeber g. Space-time interest points.
In Pr oceedings Ninth IEEE International Confer ence
on Computer V ision , pages 432–439 vol.1, 2003.
https://doi.or g/10.1 109/iccv .2003.1238378 .
[15] J. Lin, C. Gan, and S. Han. T empo-
ral shift module for ef ficient video under -
standing. CoRR , abs/181 1.08383, 2018.
https://doi.or g/10.48550/arXiv .181 1.08383 .
[16] J. Y .-H. Ng, J. Choi, J. Neumann, and L. S.
Davis. Actionflownet: Learning motion repre-
sentation for action recognition. In 2018 IEEE
W inter Confer ence on Applications of Computer
V ision (W ACV) , pages 1616–1624. IEEE, 2018.
https://doi.or g/10.1 109/wacv .2018.00179 .
[17] M. D. Rodriguez, J. Ahmed, and M. Shah. Ac-
tion mach a spatio-temporal maximum average
correlation height filter for action recognition.
In 2008 IEEE Confer ence on Computer V i-
sion and Pattern Recognition , pages 1–8, 2008.
https://doi.or g/10.1 109/cvpr .2008.4587727 .
[18] K. Simonyan and A. Zisserman. T wo-stream convo-
lutional networks for action recognition in videos. Ad-
vances in neural information pr ocessing systems , 27,
2014. https://doi.or g/10.48550/arXiv .1406.2199 .
[19] K. Soomro, A. R. Zamir , and M. Shah. UCF101:
A dataset of 101 human actions classes from videos
in the wild. CoRR , abs/1212.0402, 2012.
https://doi.or g/10.48550/arXiv .1212.0402 .
[20] H. W ang, A. Kläser , C. Schmid, and C.-L.
Liu. Action recognition by dense trajecto-
ries. In CVPR 201 1 , pages 3169–3176, 201 1.
https://ieeexplore.ieee.or g/document/5995407 .
[21] L. W ang, Y . Xiong, Z. W ang, and Y . Qiao. T o-
wards good practices for very deep two-stream
convnets. CoRR , abs/1507.02159, 2015.
https://doi.or g/10.48550/arXiv .1507.02159 .
[22] J. Y ue-Hei Ng, M. Hausknecht, S. V ijayanarasimhan,
O. V inyals, R. Monga, and G. T oderici. Beyond short
snippets: Deep networks for video classification. In
Pr oceedings of the IEEE confer ence on computer vi-
sion and pattern r ecognition , pages 4694–4702, 2015.
https://doi.or g/10.1 109/cvpr .2015.7299101 .
[23] C. Zach, T . Pock, and H. Bischof. A duality based
approach for realtime tv-l1 optical flow . volume 4713,
pages 214–223, 09 2007. https://doi.or g/10.1007/978-
3-540-74936-3
2
2 .
[24] B. Zhang, L. W ang, Z. W ang, Y . Qiao, and H. W ang.
Real-time action recognition with enhanced motion
vector cnns. In Pr oceedings of the IEEE confer ence on
computer vision and pattern r ecognition , pages 2718–
2726, 2016. https://doi.or g/10.1 109/cvpr .2016.297 .
[25] Y . Zhao, K. Man, J. Smith, K. Siddique, and S.-U.
Guan. Improved two-stream model for human action
recognition. EURASIP Journal on Image and V ideo
Pr ocessing , 2020, 06 2020.
[26] Y . Zhu, Z. Lan, S. Newsam, and A. Hauptmann.
Hidden two-stream convolutional networks for action
recognition. In Computer V ision–ACCV 2018: 14th
Asian Confer ence on Computer V ision, Perth, Aus-
tralia, December 2–6, 2018, Revised Selected Pa-
pers, Part III 14 , pages 363–378. Springer , 2019.
https://doi.or g/10.48550/arXiv .1704.00389 .